Named Entity Recognition (NER) with keras and tensorflow

we are going to use a dataset from kaggle.import pandas as pdimport numpy as npimport matplotlib.pyplot as pltplt.style.use("ggplot")data = pd.read_csv("ner_dataset.csv", encoding="latin1")data = data.drop(['POS'], axis =1)data = data.fillna(method="ffill")data.tail(12)words = set(list(data['Word'].values))words.add('PADword')n_words = len(words)n_words35179tags = list(set(data["Tag"].values))n_tags = len(tags)n_tags17we have 47958 sentences in our dataset, 35179 different words and 17 different named entities (Tags).Let’s have a look at the distribution of the sentence lengths in the dataset:class SentenceGetter(object): def __init__(self, data): self.n_sent = 1 self.data = data self.empty = False agg_func = lambda s: [(w, t) for w, t in zip(s["Word"].values.tolist(),s["Tag"].values.tolist())] self.grouped = self.data.groupby("Sentence #").apply(agg_func) self.sentences = [s for s in self.grouped] def get_next(self): try: s = self.grouped["Sentence: {}".format(self.n_sent)] self.n_sent += 1 return s except: return Nonethis Class is in charge of converting every sentence with its named entities (tags) into a list of tuples [(word, named entity), …]getter = SentenceGetter(data)sent = getter.get_next()print(sent)[('Thousands', 'O'), ('of', 'O'), ('demonstrators', 'O'), ('have', 'O'), ('marched', 'O'), ('through', 'O'), ('London', 'B-geo'), ('to', 'O'), ('protest', 'O'), ('the', 'O'), ('war', 'O'), ('in', 'O'), ('Iraq', 'B-geo'), ('and', 'O'), ('demand', 'O'), ('the', 'O'), ('withdrawal', 'O'), ('of', 'O'), ('British', 'B-gpe'), ('troops', 'O'), ('from', 'O'), ('that', 'O'), ('country', 'O'), ('.', 'O')]sentences = getter.sentencesprint(len(sentences))47959largest_sen = max(len(sen) for sen in sentences)print('biggest sentence has {} words'.format(largest_sen))biggest sentence has 104 wordsso the longest sentence has 140 words in it and we can see that almost all of the sentences have less than 60 words in them.One of the biggest benefits of this approach is that we dont need any feature engineering; all we need is the sentences and its labeled words, the rest of the work is carried on by ELMo embeddings..In order to feed our sentences into a LSTM network, they all need to be the same size..looking at the distribution graph, we can set the length of all sentences to 50 and add a generic word for the empty spaces; this process is called padding.(another reason that 50 is a good number is that my laptop cannot handle longer sentences).max_len = 50X = [[w[0]for w in s] for s in sentences]new_X = []for seq in X: new_seq = [] for i in range(max_len): try: new_seq.append(seq[i]) except: new_seq.append("PADword") new_X.append(new_seq)new_X[15]['Israeli','officials','say','Prime','Minister','Ariel', 'Sharon', 'will','undergo','a', 'medical','procedure','Thursday', 'to','close','a','tiny','hole','in','his','heart','discovered', 'during','treatment', 'for','a', 'minor', 'stroke', 'suffered', 'last', 'month', '.', 'PADword', 'PADword', 'PADword', 'PADword', 'PADword', 'PADword', 'PADword', 'PADword', 'PADword', 'PADword', 'PADword', 'PADword', 'PADword', 'PADword', 'PADword', 'PADword', 'PADword', 'PADword']and the same applies for the named entities but we need to map our labels to numbers this time:from keras.preprocessing.sequence import pad_sequencestags2index = {t:i for i,t in enumerate(tags)}y = [[tags2index[w[1]] for w in s] for s in sentences]y = pad_sequences(maxlen=max_len, sequences=y, padding="post", value=tags2index["O"])y[15]array([4, 7, 7, 0, 1, 1, 1, 7, 7, 7, 7, 7, 9, 7, 7, 7, 7, 7, 7, 7, 7, 7,7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,7, 7, 7, 7, 7, 7])next we split our data into training and testing set and then we import tensorflow Hub ( a library for the publication, discovery, and consumption of reusable parts of machine learning models) to load the ELMo embedding feature and keras to start building our network.from sklearn.model_selection import train_test_splitimport tensorflow as tfimport tensorflow_hub as hubfrom keras import backend as KX_tr, X_te, y_tr, y_te = train_test_split(new_X, y, test_size=0.1, random_state=2018)sess = tf.Session()K.set_session(sess)elmo_model = hub.Module("https://tfhub.dev/google/elmo/2", trainable=True)sess.run(tf.global_variables_initializer())sess.run(tf.tables_initializer())Running above block of code for the first time will take some time because ELMo is almost 400 MB..next we use a function to convert our sentences to ELMo embeddings:batch_size = 32def ElmoEmbedding(x): return elmo_model(inputs={"tokens": tf.squeeze(tf.cast(x, tf.string)),"sequence_len": tf.constant(batch_size*[max_len]) }, signature="tokens", as_dict=True)["elmo"]now let’s build our neural network:from keras.models import Model, Inputfrom keras.layers.merge import addfrom keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Bidirectional, Lambdainput_text = Input(shape=(max_len,), dtype=tf.string)embedding = Lambda(ElmoEmbedding, output_shape=(max_len, 1024))(input_text)x = Bidirectional(LSTM(units=512, return_sequences=True, recurrent_dropout=0.2, dropout=0.2))(embedding)x_rnn = Bidirectional(LSTM(units=512, return_sequences=True, recurrent_dropout=0.2, dropout=0.2))(x)x = add([x, x_rnn]) # residual connection to the first biLSTMout = TimeDistributed(Dense(n_tags, activation="softmax"))(x)model = Model(input_text, out)model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])since we have 32 as the batch size, feeding the network must be in chunks that are all multiples of 32:X_tr, X_val = X_tr[:1213*batch_size], X_tr[-135*batch_size:]y_tr, y_val = y_tr[:1213*batch_size], y_tr[-135*batch_size:]y_tr = y_tr.reshape(y_tr.shape[0], y_tr.shape[1], 1)y_val = y_val.reshape(y_val.shape[0], y_val.shape[1], 1)history = model.fit(np.array(X_tr), y_tr, validation_data=(np.array(X_val), y_val),batch_size=batch_size, epochs=3, verbose=1)Train on 38816 samples, validate on 4320 samplesEpoch 1/338816/38816 [==============================] – 834s 21ms/step – loss: 0.0625 – acc: 0.9818 – val_loss: 0.0449 – val_acc: 0.9861Epoch 2/338816/38816 [==============================] – 833s 21ms/step – loss: 0.0405 – acc: 0.9869 – val_loss: 0.0417 – val_acc: 0.9868Epoch 3/338816/38816 [==============================] – 831s 21ms/step – loss: 0.0336 – acc: 0.9886 – val_loss: 0.0406 – val_acc: 0.9873The initial goal was to play around with parameter tuning to achieve higher accuracy but my laptop was not able to handle more than 3 epochs and batch sizes bigger than 32 or increasing the test size..I am running keras on a Geforce GTX 1060 and it took almost 45 minutes to train those 3 epochs, if you have a better GPU, give it shot by changing some of those parameters.0.9873 validation accuracy is a great score, however we are not interested to evaluate our model with Accuracy metric..Let’s see how we can get Precision, Recall, and F1 scores:from seqeval.metrics import precision_score, recall_score, f1_score, classification_reportX_te = X_te[:149*batch_size]test_pred = model.predict(np.array(X_te), verbose=1)4768/4768 [==============================] – 64s 13ms/stepidx2tag = {i: w for w, i in tags2index.items()}def pred2label(pred): out = [] for pred_i in pred: out_i = [] for p in pred_i: p_i = np.argmax(p) out_i.append(idx2tag[p_i].replace("PADword", "O")) out.append(out_i) return outdef test2label(pred): out = [] for pred_i in pred: out_i = [] for p in pred_i: out_i.append(idx2tag[p].replace("PADword", "O")) out.append(out_i) return out pred_labels = pred2label(test_pred)test_labels = test2label(y_te[:149*32])print(classification_report(test_labels, pred_labels)) precision recall f1-score support org 0.69 0.66 0.68 2061 tim 0.88 0.84 0.86 2148 gpe 0.95 0.93 0.94 1591 per 0.75 0.80 0.77 1677 geo 0.85 0.89 0.87 3720 art 0.23 0.14 0.18 49 eve 0.33 0.33 0.33 33 nat 0.47 0.36 0.41 22avg / total 0.82 0.82 0.82 113010.82 F1 score is an outstanding achievement..it beats all the other three deep learning methods mentioned at the beginning of this section and it can be easily adapted by the industry.finally, let’s see how our predictions look like:i = 390p = model.predict(np.array(X_te[i:i+batch_size]))[0]p = np.argmax(p, axis=-1)print("{:15} {:5}: ({})".format("Word", "Pred", "True"))print("="*30)for w, true, pred in zip(X_te[i], y_te[i], p): if w != "__PAD__": print("{:15}:{:5} ({})".format(w, tags[pred], tags[true]))Word Pred : (True)==============================Citing :O (O)a :O (O)draft :O (O)report :O (O)from :O (O)the :O (O)U.S..:B-org (B-org)Government :I-org (I-org)Accountability :I-org (O)office :O (O), :O (O)The :B-org (B-org)New :I-org (I-org)York :I-org (I-org)Times :I-org (I-org)said :O (O)Saturday :B-tim (B-tim)the :O (O)losses :O (O)amount :O (O)to :O (O)between :O (O)1,00,000 :O (O)and :O (O)3,00,000 :O (O)barrels :O (O)a :O (O)day :O (O)of :O (O)Iraq :B-geo (B-geo)'s :O (O)declared :O (O)oil :O (O)production :O (O)over :O (O)the :O (O)past :B-tim (B-tim)four :I-tim (I-tim)years :O (O)..:O (O)PADword :O (O)PADword :O (O)PADword :O (O)PADword :O (O)PADword :O (O)PADword :O (O)PADword :O (O)PADword :O (O)PADword :O (O)PADword :O (O)like always, the code and jupyter notebook is available on my Github.Questions and Comments are highly appreciated.References:https://www.depends-on-the-definition.com/named-entity-recognition-with-residual-lstm-and-elmo/http://www.wildml.com/2016/08/rnns-in-tensorflow-a-practical-guide-and-undocumented-features/https://allennlp.org/elmohttps://jalammar.github.io/illustrated-bert/. More details

Leave a Reply