Doing meaningful work with Machine Learning — Classify Disaster Messages

The reason is that when disaster happens, there are millions of messages sent and tweeted to inform about it.

However, disasters are taken care of different organizations.

Food provision might be offered by a certain organization, while putting down fires would be taken care of a different organization.

Hence,the utility of this application would be to categorize these messages into various types so that it can be understood what type of aid is necessary for a specific disaster.

Project StructureThere are three parts of the project:ETL PipelineExtract, transform and load the data.

This is concerned with processing the data.

Namely I loaded, merged and cleaned the messages and categories dataset.

I stored into an SQLite database so that the model can use it in the next step to train.

ML Pipeline The machine learning pipeline is concerned with training the model and testing it.

The pipeline includes a text processing part because it deals with text sources as mentioned in the beginning.

I also used GridSearchCV to tune the model further and save it as a pickle file.

Flask Web AppThe run.

py process_data and train_classifier are basically the ETL pipeline and ML pipeline included in the terminal work space to make the app work.

ETL PipelineIn the first part of the project, my purpose is to extract the data that I need, make the necessary transformations so that I can use it later in my algorithm building.

Once I had a look at the two data sets that I need: the categories and messages , I merged the two data sets using the common id.

# merge data sets df = messages.

merge(categories, on = [‘id’]) df.

head()I then split the categories into separate category columns, and gave each column its separate category name.

# create a dataframe of the 36 individual category columnscategories = df['categories'].

str.

split(';', expand = True)row = categories.

head(1)category_colnames = row.

applymap(lambda x: x[:-2]).

iloc[0, :].

tolist()# rename the columns of `categories`categories.

columns = category_colnamescategories.

head()Because models use numbers as inputs, I converted the category values to just numbers 0 or 1.

for column in categories: # set each value to be the last character of the string categories[column] = categories[column].

astype(str).

str[-1] # convert column from string to numeric categories[column] = categories[column].

astype(int)categories.

head()After transforming the category columns, I made changes to the data frame.

I replaced the original category column with the new category columns.

# drop the original categories column from `df`df.

drop('categories', axis = 1, inplace = True)# concatenate the original dataframe with the new `categories` dataframe df = pd.

concat([df, categories], axis = 1) df.

head()After checking for duplicates in my data, I got rid of them.

# check number of duplicatesdf[df.

duplicated].

shape(170, 40)# drop duplicates df.

drop_duplicates(inplace = True)# check number of duplicates df[df.

duplicated].

count()I finally saved the clean data set into a SQLite database.

# Save the clean dataset into a sqlite database.

engine = create_engine('sqlite:///disaster.

db')df.

to_sql('messages_disaster', engine, index=False)ML PipelineIn the second part of the project I created the machine learning pipeline that was used to classify the disaster messages into different categories.

The reason for being called a ‘pipeline’ is because this modeling tool is composed of several steps that processes inputs to generate the outputs.

In this case namely, I used tokenization to process text data.

# load data from databaseengine = create_engine('sqlite:///disaster.

db')df = pd.

read_sql_table('messages_disaster', con = engine)X = df['message'] Y = df.

drop(['message', 'genre', 'id', 'original'], axis = 1)# Tokenization function to process text data.

def tokenize(text): tokens = word_tokenize(text) lemmatizer = WordNetLemmatizer() clean_tokens = [] for tok in clean_tokens: clean_tok = lemmatizer.

lemmatize(tok).

lower().

strip() clean_tokens.

append(clean_tok) return clean_tokensThe machine learning pipeline will take the message column as input and output classification on the 36 categories in the data set.

This is a problem of Natural Language Processing; i.

e.

processing text to extract meaning from the message.

Isn’t that amazing?pipeline = Pipeline([ ('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', MultiOutputClassifier(RandomForestClassifier()))])Just like with all other ML models, we must have training and testing sets.

The reason is that we do not want our model to do super well on the training set, while not being able to classify our categories properly when it sees new data.

Hence we must use only a subset of our data to train it and see how it performs on the testing set.

# Split data into train and test tests.

X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size = 0.

2, random_state = 45)# Train the model.

pipeline.

fit(X_train, y_train)When it comes to testing my model, I want to have some objectives measures of performance.

Namely I will look at the f1 score, precision and recall.

# Test the model and print the classification report for each of the 36 categories.

def performance(model, X_test, y_test): y_pred = model.

predict(X_test) for i, col in enumerate(y_test): print(col) print(classification_report(y_test[col], y_pred[:, i]))performance(pipeline, X_test, y_test)just a snapshotWhen building models, it’s always a good idea to seek improvement.

Try to adjust the parameters of the model to get better results.

This is what I am attempting here.

It’s the same process, but with a different pipeline.

# Improve the pipeline.

pipeline2 = Pipeline([ ('vect', CountVectorizer()), ('best', TruncatedSVD()), ('tfidf', TfidfTransformer()), ('clf', MultiOutputClassifier(AdaBoostClassifier()))])# Train the adjusted pipeline.

pipeline2.

fit(X_train, y_train)# Check the performance of the adjusted model.

performance(pipeline2, X_test, y_test)I went one step further and used a different set of parameters with a certain range of values.

With the help of GridSearchCV, the model chooses the best parameters.

parameters2 = { 'tfidf__use_idf': (True, False), 'clf__estimator__n_estimators': [50, 100], 'clf__estimator__learning_rate': [1,2] }cv2 = GridSearchCV(pipeline2, param_grid=parameters2)cv2.

fit(X_train, y_train)performance(cv2, X_test, y_test)Build the APIFinally I concluded this project with making an API that takes in a disaster message and classifies it into the most likely disaster categories.

This way, we can help disaster organizations to understand better what type of catastrophe happened and which sort of aid is necessary.

Closing WordsIf you made it this far, thank you very much for reading! Hope this gives you an idea of how useful machine learning can be.

And how wide the range of its applications is.

We can literally save lives of people by knowing how to process text data and implement models.

Wish you all the very best and stay blessed :).

. More details

Leave a Reply