Solving NLP task using Sequence2Sequence model: from Zero to Hero

In short, NER is a task of extracting Name Entities from a sequence of words (a sentence)..Here I’m going to do the following:Build a very simple model that treats this task as a classification of each word in every sentence and use it as a benchmark.Build a Sequence to Sequence model using Keras.Talk about what is the right way to measure and compare our results.Use pre-trained Glove embeddings in the Seq2Seq model.Feel free to jump to any section.Bag of Words and Multi-class ClassificationAs I mentioned before, our output should be a sequence of classes, but first, I want to explore somewhat a naive approach — a simple multi-class classification model..Sometimes we try to model these problems as simple classification tasks while in reality, a sequence model could be much better.As I said, I treat this approach as a benchmark and keep things as simple as possible, so for each word (instance), my features will be simply the word vector (Bag of words) and all other words in the same sentence..My target variable will be one of 17 classes.def sentence_to_instances(words, tags, bow, count_vectorizer): X = [] y = [] for w, t in zip(words, tags): v = count_vectorizer.transform([w])[0] v = scipy.sparse.hstack([v, bow]) X.append(v) y.append(t) return scipy.sparse.vstack(X), ySo given a sentence like:“The World Health Organization says 227 people have died from bird flu”We’ll get 12 instances for each word.the Oworld B-orghealth I-orgorganization I-orgsays O227 Opeople Ohave Odied Ofrom Obird Oflu ONow our task is, given a single word in a sentence, predict its class.We have 47958 sentences in our dataset, we break them into “train” and “test” sets:train_size = int(len(sentences_words) * 0.8)train_sentences_words = sentences_words[:train_size]train_sentences_tags = sentences_tags[:train_size]test_sentences_words = sentences_words[train_size:]test_sentences_tags = sentences_tags[train_size:]# ============== Output ==============================Train: 38366 Test: 9592 We’ll use the method above to transform all the sentences into many instances of words..In the train dataset, we have 839,214 word instances.train_X, train_y = sentences_to_instances(train_sentences_words, train_sentences_tags, count_vectorizer)print 'Train X shape:', train_X.shapeprint 'Train Y shape:', train_y.shape# ============== Output ==============================Train X shape: (839214, 50892)Train Y shape: (839214,)In our X we have 50892 dimensions which are: a one hot vector for the current word, and, a Bag of Words vector for all other words in the same sentence.We’ll use Gradient Boosting Classifier as our predictor:clf = GradientBoostingClassifier().fit(train_X, train_y)predicted = clf.predict(test_X)print classification_report(test_y, predicted)We get: precision recall f1-score support B-art 0.57 0.05 0.09 82 B-eve 0.68 0.28 0.40 46 B-geo 0.91 0.40 0.56 7553 B-gpe 0.96 0.84 0.90 3242 B-nat 0.52 0.27 0.36 48 B-org 0.93 0.31 0.46 4082 B-per 0.80 0.52 0.63 3321 B-tim 0.91 0.66 0.76 4107 I-art 0.09 0.02 0.04 43 I-eve 0.33 0.02 0.04 44 I-geo 0.82 0.55 0.66 1408 I-gpe 0.86 0.62 0.72 40 I-nat 0.20 0.08 0.12 12 I-org 0.88 0.24 0.38 3470 I-per 0.93 0.25 0.40 3332 I-tim 0.67 0.15 0.25 1308 O 0.91 1.00 0.95 177215avg / total 0.91 0.91 0.89 209353Is it good?.Each of the 75 tokens now has a vector of size 300.Once we have this, we can use the Bidirectional LSTM layer that for each token will look both ways in the sentence and return a state that will help us classify the word later on..Later in this post, we’ll use pre-trained embeddings to improve our model.Let’s train our model:model.fit(train_sequences_padded, train_tags_padded, batch_size=32, epochs=10, validation_data=(test_sequences_padded, test_tags_padded))# ============== Output ==============================Train on 38366 samples, validate on 9592 samplesEpoch 1/1038366/38366 [==============================] – 274s 7ms/step – loss: 0.1307 – sparse_categorical_accuracy: 0.9701 – val_loss: 0.0465 – val_sparse_categorical_accuracy: 0.9869Epoch 2/1038366/38366 [==============================] – 276s 7ms/step – loss: 0.0365 – sparse_categorical_accuracy: 0.9892 – val_loss: 0.0438 – val_sparse_categorical_accuracy: 0.9879Epoch 3/1038366/38366 [==============================] – 264s 7ms/step – loss: 0.0280 – sparse_categorical_accuracy: 0.9914 – val_loss: 0.0470 – val_sparse_categorical_accuracy: 0.9880Epoch 4/1038366/38366 [==============================] – 261s 7ms/step – loss: 0.0229 – sparse_categorical_accuracy: 0.9928 – val_loss: 0.0480 – val_sparse_categorical_accuracy: 0.9878Epoch 5/1038366/38366 [==============================] – 263s 7ms/step – loss: 0.0189 – sparse_categorical_accuracy: 0.9939 – val_loss: 0.0531 – val_sparse_categorical_accuracy: 0.9878Epoch 6/1038366/38366 [==============================] – 294s 8ms/step – loss: 0.0156 – sparse_categorical_accuracy: 0.9949 – val_loss: 0.0625 – val_sparse_categorical_accuracy: 0.9874Epoch 7/1038366/38366 [==============================] – 318s 8ms/step – loss: 0.0129 – sparse_categorical_accuracy: 0.9958 – val_loss: 0.0668 – val_sparse_categorical_accuracy: 0.9872Epoch 8/1038366/38366 [==============================] – 275s 7ms/step – loss: 0.0107 – sparse_categorical_accuracy: 0.9965 – val_loss: 0.0685 – val_sparse_categorical_accuracy: 0.9869Epoch 9/1038366/38366 [==============================] – 270s 7ms/step – loss: 0.0089 – sparse_categorical_accuracy: 0.9971 – val_loss: 0.0757 – val_sparse_categorical_accuracy: 0.9870Epoch 10/1038366/38366 [==============================] – 266s 7ms/step – loss: 0.0076 – sparse_categorical_accuracy: 0.9975 – val_loss: 0.0801 – val_sparse_categorical_accuracy: 0.9867We get 98.6% accuracy on our test set..In this case, there’s no “B-” or “I-” tags, we compare the actual type of entity and not word classes.Using our predicted values, which is a matrix of probabilities, we want to construct a sequence of tags for each sentence with the original length (and not 75 as we did) so we can compare them to the true values..We will do this both for our LSTM model and our Bag of Words model:lstm_predicted = model.predict(test_sequences_padded)lstm_predicted_tags = []bow_predicted_tags = []for s, s_pred in zip(test_sentences_words, lstm_predicted): tags = np.argmax(s_pred, axis=1) tags = map(index_tag_wo_padding.get,tags)[-len(s):] lstm_predicted_tags.append(tags) bow_vector, _ = sentences_to_instances([s], [['x']*len(s)], count_vectorizer) bow_predicted = clf.predict(bow_vector)[0] bow_predicted_tags.append(bow_predicted)Now we are ready to evaluate both our models using the seqeval library:from seqeval.metrics import classification_report, f1_scoreprint 'LSTM'print '='*15print classification_report(test_sentences_tags, lstm_predicted_tags)print print 'BOW'print '='*15print classification_report(test_sentences_tags, bow_predicted_tags)We get:LSTM=============== precision recall f1-score support art 0.11 0.10 0.10 82 gpe 0.94 0.96 0.95 3242 eve 0.21 0.33 0.26 46 per 0.66 0.58 0.62 3321 tim 0.84 0.83 0.84 4107 nat 0.00 0.00 0.00 48 org 0.58 0.55 0.57 4082 geo 0.83 0.83 0.83 7553avg / total 0.77 0.75 0.76 22481BOW=============== precision recall f1-score support art 0.00 0.00 0.00 82 gpe 0.01 0.00 0.00 3242 eve 0.00 0.00 0.00 46 per 0.00 0.00 0.00 3321 tim 0.00 0.00 0.00 4107 nat 0.00 0.00 0.00 48 org 0.01 0.00 0.00 4082 geo 0.03 0.00 0.00 7553avg / total 0.01 0.00 0.00 22481There’s a big difference..You can see that the BOW model wasn’t able to predict almost anything right, while the LSTM model did a much better job.Of course, we could work more on the BOW model and achieve much better results, but the big picture is clear, the Sequence to Sequence model is much more appropriate in this case.Pre-trained Word EmbeddingsAs we saw before, most of our model parameters were for the Embedding layer..While it’s not directly related to our task, using these embeddings may help our model to represent words better for its goal.There are other ways to build word embeddings, from simple cooccurrences matrix to much more complex language models..Let’s evaluate it the right way and compare to our previous models:lstm_predicted_tags = []for s, s_pred in zip(test_sentences_words, lstm_predicted): tags = np.argmax(s_pred, axis=1) tags = map(index_tag_wo_padding.get,tags)[-len(s):] lstm_predicted_tags.append(tags)print 'LSTM + Pretrained Embbeddings'print '='*15print classification_report(test_sentences_tags, lstm_predicted_tags)# ============== Output ==============================LSTM + Pretrained Embbeddings=============== precision recall f1-score support art 0.45 0.06 0.11 82 gpe 0.97 0.95 0.96 3242 eve 0.56 0.33 0.41 46 per 0.72 0.71 0.72 3321 tim 0.87 0.84 0.85 4107 nat 0.00 0.00 0.00 48 org 0.62 0.56 0.59 4082 geo 0.83 0.88 0.86 7553avg / total 0.80 0.80 0.80 22481Much better, our F1 score increased from 76 to 80!ConclusionThe sequence to Sequence models are very powerful models for many tasks like Named Entity Recognition (NER), Part of Speech (POS) tagging, parsing and more.. More details

Leave a Reply