NLP Tutorial: MultiLabel Classification Problem using Linear Models

NLP Tutorial: MultiLabel Classification Problem using Linear ModelsGeorgios DrakosBlockedUnblockFollowFollowingJun 16This article presents in details how to predict tags for posts from StackOverflow using Linear Model after carefully preprocessing our text features.

Table of ContentsIntroductionDatasetImport Libraries and Load the dataText PreprocessingEDATransforming text to a vectorMultiLabel ClassifierEvaluationHyperParameter TuningFeature ImportanceConclusionReferencesIntroductionOne of the most common tasks of NLP is to automatically predict the topic of a question.

In this article, we’ll start from preprocessing Questions and tags of Stack Overflow and then we will build a simple model to predict the tag of a Stack Overflow question.

Let’s get started.

DatasetFor this project, we’ll use the Stack Overflow Tag Prediction dataset which can be found on Kaggle.

Import Libraries and Load the dataIn this task you will need the following libraries:Numpy — a package for scientific computing.

Pandas — a library providing high-performance, easy-to-use data structures and data analysis tools for the Pythonscikit-learn — a tool for data mining and data analysis.

NLTK — a platform to work with natural language.

import pandas as pdimport numpy as npimport nltk, renltk.

download('stopwords') # load english stopwordsfrom nltk.

corpus import stopwordsfrom sklearn.

model_selection import train_test_splitimport warningswarnings.

simplefilter("ignore")warnings.

warn("deprecated", DeprecationWarning)warnings.

simplefilter("ignore")The list of stop words is downloaded from nltk.

Secondly, we will load the data and split it to train and test dataset.

dataset = pd.

read_csv('/Users/Georgios.

Drakos/Downloads/train.

csv') print(dataset.

shape) # 70-30% random split of dataset X_train, X_test, y_train, y_test = train_test_split(dataset['title'].

values, dataset['tags'].

values, test_size=0.

3, random_state=42) dataset.

head()As you can see, “title” column contains titles of the posts and “tags” column contains the tags.

It could be noticed that a number of tags for a post are not fixed and could be as many as necessary.

Text PreprocessingOne of the most known difficulties when working with natural data is that it’s unstructured.

For example, if you use it “as is” and extract tokens just by splitting the titles by whitespaces, you will see that there are many “weird” tokens.

To prevent these problems, it’s usually useful to prepare the data somehow.

REPLACE_BY_SPACE_RE = re.

compile('[/(){}[]|@,;]')BAD_SYMBOLS_RE = re.

compile('[^0-9a-z #+_]')STOPWORDS = list((stopwords.

words('english')))def text_prepare(text,join_sumbol): """ text: a string return: modified initial string """ # lowercase text text = text.

lower()# replace REPLACE_BY_SPACE_RE symbols by space in text text = re.

sub(REPLACE_BY_SPACE_RE," ",text,)# delete symbols which are in BAD_SYMBOLS_RE from text text = re.

sub(BAD_SYMBOLS_RE,"",text) text = re.

sub(r's+'," ",text)# delete stopwords from text text = f'{join_sumbol}'.

join([i for i in text.

split() if i not in STOPWORDS]) return texttests = ["SQL Server – any equivalent of Excel's CHOOSE function?", "How to free c++ memory vector<int> * arr?"]for test in tests: print(text_prepare(test,' '))Now we can preprocess the titles using function text_prepare and making sure that both the titles and tags don’t have bad symbols:X_train = [text_prepare(x,' ') for x in X_train]X_test = [text_prepare(x,' ') for x in X_test]y_train = [text_prepare(x,',') for x in y_train]y_test = [text_prepare(x,',') for x in y_test]EDAFind 3 most popular tags and 3 most popular words in the train dataset.

from collections import Counterfrom itertools import chain# Dictionary of all tags from train corpus with their counts.

tags_counts = Counter(chain.

from_iterable([i.

split(",") for i in y_train]))# Dictionary of all words from train corpus with their counts.

words_counts = Counter(chain.

from_iterable([i.

split(" ") for i in X_train]))top_3_most_common_tags = sorted(tags_counts.

items(), key=lambda x: x[1], reverse=True)[:3]top_3_most_common_words = sorted(words_counts.

items(), key=lambda x: x[1], reverse=True)[:3]print(f"Top three most popular tags are: {','.

join(tag for tag, _ in top_3_most_common_tags)}")print(f"Top three most popular words are: {','.

join(tag for tag, _ in top_3_most_common_words)}")Transforming text to a vectorMachine Learning algorithms work with numeric data and we cannot use the provided text data “as is”.

There are many ways to transform text data into numeric vectors.

In this article, we will try to use two of them.

Bag of wordsOne of the well-known approaches is a bag-of-words representation.

To create this transformation, follow the below steps:Find N most popular words in train corpus and enumerate them.

Now we have a dictionary of the most popular words.

For each title in the corpora create a zero vector with the dimension equals to N.

For each text in the corpora iterate over words which are in the dictionary and increase by 1 the corresponding coordinate.

Let’s try to do it for a toy example.

Imagine that we have N= 4 and the list of the most popular words is['hi', 'you', 'me', 'are']Then we need to enumerate them, for example, like this:{'hi': 0, 'you': 1, 'me': 2, 'are': 3}And we have the text, which we want to transform to the vector:'hi how are you'For this text, we create a corresponding zero vector.

[0, 0, 0, 0]And iterate over all words, and if the word is in the dictionary, we increase the value of the corresponding position in the vector:'hi': [1, 0, 0, 0] 'how': [1, 0, 0, 0] # word 'how' is not in our dictionary 'are': [1, 0, 0, 1] 'you': [1, 1, 0, 1]The resulting vector will be:[1, 1, 0, 1]We will implement now the described encoding in the function my_bag_of_words with the size of the dictionary equals to 5000.

To find the most common words we use train data.

# We considered only the top 5,000 words, this parameter can be fine-tunedDICT_SIZE = 5000WORDS_TO_INDEX = {j[0]:i for i,j in enumerate(sorted(words_counts.

items(), key=lambda x: x[1], reverse=True)[:DICT_SIZE])}INDEX_TO_WORDS = {i:j[0] for i,j in enumerate(sorted(words_counts.

items(), key=lambda x: x[1], reverse=True)[:DICT_SIZE])}ALL_WORDS = WORDS_TO_INDEX.

keys()def my_bag_of_words(text, words_to_index, dict_size): """ text: a string dict_size: size of the dictionary return a vector which is a bag-of-words representation of 'text' """ result_vector = np.

zeros(dict_size) keys= [words_to_index[i] for i in text.

split(" ") if i in words_to_index.

keys()] result_vector[keys]=1 return result_vectorNow apply the implemented function to all samples.

We transform the data to sparse representation, to store the useful information efficiently.

There are many types of such representations, however, sklearn algorithms can work only with csr matrix, so we will use this one.

X_train_mybag = sp_sparse.

vstack([sp_sparse.

csr_matrix(my_bag_of_words(text, WORDS_TO_INDEX, DICT_SIZE)) for text in X_train])X_test_mybag = sp_sparse.

vstack([sp_sparse.

csr_matrix(my_bag_of_words(text, WORDS_TO_INDEX, DICT_SIZE)) for text in X_test])print('X_train shape ', X_train_mybag.

shape)print('X_test shape ', X_test_mybag.

shape)TF-IDFThe second approach extends the bag-of-words framework by taking into account the total frequencies of words in the corpora.

It helps to penalize too frequent words and provide better features space.

We use TfidfVectorizer from scikit-learn and our train corpus to train a vectorizer.

Don’t forget to take a look into the arguments that you can pass to it.

I filter out too rare words (occur less than 5) and too frequent words (occur more than in 90% of the titles).

Also, use bigrams along with unigrams in our vocabulary.

Details about TF-IDF technique can be found in my article here.

from sklearn.

feature_extraction.

text import TfidfVectorizerdef tfidf_features(X_train, X_test): """ X_train, X_val, X_test — samples return bag-of-words representation of each sample and vocabulary """ # Create TF-IDF vectorizer with a proper parameters choice # Fit the vectorizer on the train set # Transform the train, test, and val sets and return the result tfidf_vectorizer = TfidfVectorizer(X_train,ngram_range=(1,2),max_df=0.

9,min_df=5,token_pattern=r'(S+)' ) tfidf_vectorizer.

fit(X_train) X_train = tfidf_vectorizer.

transform(X_train) X_test = tfidf_vectorizer.

transform(X_test) return X_train, X_test, tfidf_vectorizer.

vocabulary_X_train_tfidf, X_test_tfidf, tfidf_vocab = tfidf_features(X_train, X_test)tfidf_reversed_vocab = {i:word for word,i in tfidf_vocab.

items()}Once you have done text preprocessing, always have a look at the results.

Be very careful at this step, because the performance of future models will drastically depend on it.

In this case, check whether you have c++ or c# in your vocabulary, as they are obviously important tokens in our tags prediction task:print("c#" in set(tfidf_reversed_vocab.

values()))print("c++" in set(tfidf_reversed_vocab.

values()))MultiLabel ClassifierAs we have noticed before, in this task each example can have multiple tags.

To deal with such kind of prediction, we need to transform labels in a binary form and the prediction will be a mask of 0s and 1s.

For this purpose, it is convenient to use MultiLabelBinarizer from sklearn.

Let’s have a look at the target variable:First, we will need to transform each element of labels to a dictionary before passing it to the MultiLabelBinarizer.

# transform to dictionaryy_train = [set(i.

split(',')) for i in y_train]y_test = [set(i.

split(',')) for i in y_test]Let’s fit and transform the target variable:mlb = MultiLabelBinarizer()y_train = mlb.

fit_transform(y_train)y_test = mlb.

fit_transform(y_test)In this task, we suggest using One-vs-Rest approach, which is implemented in OneVsRestClassifier class.

In this approach k classifiers (= number of tags) are trained.

As a basic classifier, use LogisticRegression.

It is one of the simplest methods, but often it performs good enough in text classification tasks.

It might take some time because the number of classifiers to train is large.

# For multiclass classificationfrom sklearn.

multiclass import OneVsRestClassifier# Modelsfrom sklearn.

linear_model import LogisticRegressionfrom sklearn.

discriminant_analysis import LinearDiscriminantAnalysisfrom sklearn.

svm import LinearSVCfrom sklearn.

naive_bayes import MultinomialNBfrom lightgbm import LGBMClassifierdef train_classifier(X_train, y_train, X_valid=None, y_valid=None, C=1.

0, model='lr'): """ X_train, y_train — training data return: trained classifier """ if model=='lr': model = LogisticRegression(C=C, penalty='l1', dual=False, solver='liblinear') model = OneVsRestClassifier(model) model.

fit(X_train, y_train) elif model=='svm': model = LinearSVC(C=C, penalty='l1', dual=False, loss='squared_hinge') model = OneVsRestClassifier(model) model.

fit(X_train, y_train) elif model=='nbayes': model = MultinomialNB(alpha=1.

0) model = OneVsRestClassifier(model) model.

fit(X_train, y_train) elif model=='lda': model = LinearDiscriminantAnalysis(solver='svd') model = OneVsRestClassifier(model) model.

fit(X_train, y_train)return model# Train the classifiers for different data transformations: bag-of-words and tf-idf.

# Linear NLP model using bag of words approach%time classifier_mybag = train_classifier(X_train_mybag, y_train, C=1.

0, model='lr')# Linear NLP model using TF-IDF approach%time classifier_tfidf = train_classifier(X_train_tfidf, y_train, C=1.

0, model='lr')Create predictions for the data.

y_test_predicted_labels_mybag = classifier_mybag.

predict(X_test_mybag)y_test_predicted_labels_tfidf = classifier_tfidf.

predict(X_test_tfidf)Now take a look at how classifier, which uses TF-IDF, works for a few examples:y_test_pred_inversed = mlb.

inverse_transform(y_test_predicted_labels_tfidf)y_test_inversed = mlb.

inverse_transform(y_test)for i in range(3): print('Title: {}.True labels: {}.Predicted labels: {}.'.

format( X_test[i], ','.

join(y_test_inversed[i]), ','.

join(y_test_pred_inversed[i]) ))Now, we would need to compare the results of different predictions, e.

g.

to see whether TF-IDF transformation helps or to try different regularization techniques in logistic regression.

For all these experiments, we need to set up evaluation procedure.

EvaluationTo evaluate the results we will use several classification metrics:AccuracyF1-scoreArea under ROC-curveArea under precision-recall curveWe will create a function which calculates and prints out:accuracyF1-score macro/micro/weightedPrecision macro/micro/weightedfrom sklearn.

metrics import accuracy_scorefrom sklearn.

metrics import f1_scorefrom sklearn.

metrics import roc_auc_score from sklearn.

metrics import average_precision_scorefrom sklearn.

metrics import recall_scorefrom functools import partialdef print_evaluation_scores(y_val, predicted): f1_score_macro = partial(f1_score,average="macro") f1_score_micro = partial(f1_score,average="micro") f1_score_weighted = partial(f1_score,average="weighted") average_precision_score_macro = partial(average_precision_score,average="macro") average_precision_score_micro = partial(average_precision_score,average="micro") average_precision_score_weighted = partial(average_precision_score,average="weighted") scores = [accuracy_score,f1_score_macro,f1_score_micro,f1_score_weighted,average_precision_score_macro, average_precision_score_micro,average_precision_score_weighted] for score in scores: print(score,score(y_val,predicted))print('Bag-of-words')print_evaluation_scores(y_test, y_test_predicted_labels_mybag)print('Tfidf')print_evaluation_scores(y_test, y_test_predicted_labels_tfidf)HyperParameter TuningNow, we will experiment a bit with training our classifiers by using weighted F1-score as an evaluation metric.

Moreover, we select to use the TF-IDF approach and try and -regularization techniques in Logistic Regression with different coefficients (e.

g.

C equal to 0.

1, 1, 10, 100).

import matplotlib.

pyplot as plthypers = np.

arange(0.

1, 1.

1, 0.

1)res = []for h in hypers: temp_model = train_classifier(X_train_tfidf, y_train, C=h, model='lr') temp_pred = f1_score(y_test, temp_model.

predict(X_test_tfidf), average='weighted') res.

append(temp_pred)plt.

figure(figsize=(7,5))plt.

plot(hypers, res, color='blue', marker='o')plt.

grid(True)plt.

xlabel('Parameter $C$')plt.

ylabel('Weighted F1 score')plt.

show()We fit the “best” model and create predictions for test set when we are happy with the quality:# Final modelC = 1.

0classifier = train_classifier(X_train_tfidf, y_train, C=C, model='lr')# Resultstest_predictions = classifier.

predict(X_test_tfidf)test_pred_inversed = mlb.

inverse_transform(test_predictions)test_pred_inversedFeature ImportanceFinally, it is usually a good idea to look at the features (words or n-grams) that are used with the largest weights in your logistic regression model in order to get an intuition about our model:def print_words_for_tag(classifier, tag, tags_classes, index_to_words, all_words): """ classifier: trained classifier tag: particular tag tags_classes: a list of classes names from MultiLabelBinarizer index_to_words: index_to_words transformation all_words: all words in the dictionary return nothing, just print top 5 positive and top 5 negative words for current tag """ print('Tag: {}'.

format(tag)) tag_n = np.

where(tags_classes==tag)[0][0] model = classifier.

estimators_[tag_n] top_positive_words = [index_to_words[x] for x in model.

coef_.

argsort().

tolist()[0][-8:]] top_negative_words = [index_to_words[x] for x in model.

coef_.

argsort().

tolist()[0][:8]] print('Top positive words: {}'.

format(', '.

join(top_positive_words))) print('Top negative words: {}.'.

format(', '.

join(top_negative_words)))print_words_for_tag(classifier, 'c', mlb.

classes_, tfidf_reversed_vocab, ALL_WORDS)print_words_for_tag(classifier, 'c++', mlb.

classes_, tfidf_reversed_vocab, ALL_WORDS)print_words_for_tag(classifier, 'linux', mlb.

classes_, tfidf_reversed_vocab, ALL_WORDS)print_words_for_tag(classifier, 'python', mlb.

classes_, tfidf_reversed_vocab, ALL_WORDS)print_words_for_tag(classifier, 'r', mlb.

classes_, tfidf_reversed_vocab, ALL_WORDS)print_words_for_tag(classifier, 'java', mlb.

classes_, tfidf_reversed_vocab, ALL_WORDS)ConclusionThis brings us to the end of this article.

Hope you got a basic understanding of how to solve a MultiLabel Classification Problem using Linear Models by following this post.

Feel free to use the Python code snippets of this article.

The full code can be found on my Github page:https://github.

com/geodra/Articles/blob/master/NLP%20Tutorial%20MultiLabel%20Classification%20Problem%20using%20Linear%20Models.

ipynbThanks for reading and I am looking forward to hearing your questions :)Stay tuned and Happy Machine Learning.

Referenceshttps://github.

com/https://www.

coursera.

org/http://textvis.

lnu.

se/Originally published at https://gdcoder.

com on June 16, 2019.

.. More details

Leave a Reply