Text Classification with State of the Art NLP Library — Flair

If not, run pip install pandas first.import pandas as pddata = pd.read_csv("./spam.csv", encoding='latin-1').sample(frac=1).drop_duplicates()data = data[['v1', 'v2']].rename(columns={"v1":"label", "v2":"text"}) data['label'] = '__label__' + data['label'].astype(str)data.iloc[0:int(len(data)*0.8)].to_csv('train.csv', sep=' ', index = False, header = False)data.iloc[int(len(data)*0.8):int(len(data)*0.9)].to_csv('test.csv', sep=' ', index = False, header = False)data.iloc[int(len(data)*0.9):].to_csv('dev.csv', sep=' ', index = False, header = False);This will remove some duplicates from our dataset, shuffle it (randomise rows) and split the data into train, dev and test sets using the 80/10/10 split.If this runs successfully you will end up with train.csv,test.csv and dev.csv formatted in the FastText format ready to be used with Flair.3.2 Training a Custom Text Classification ModelTo train the model run this snippet in the same directory as the generated dataset.from flair.data_fetcher import NLPTaskDataFetcherfrom flair.embeddings import WordEmbeddings, FlairEmbeddings, DocumentLSTMEmbeddingsfrom flair.models import TextClassifierfrom flair.trainers import ModelTrainerfrom pathlib import Pathcorpus = NLPTaskDataFetcher.load_classification_corpus(Path('./'), test_file='train.csv', dev_file='dev.csv', train_file='test.csv')word_embeddings = [WordEmbeddings('glove'), FlairEmbeddings('news-forward-fast'), FlairEmbeddings('news-backward-fast')]document_embeddings = DocumentLSTMEmbeddings(word_embeddings, hidden_size=512, reproject_words=True, reproject_words_dimension=256)classifier = TextClassifier(document_embeddings, label_dictionary=corpus.make_label_dictionary(), multi_label=False)trainer = ModelTrainer(classifier, corpus)trainer.train('./', max_epochs=20)When running this code for the first time, Flair will download all required embedding models which can take up to a few minutes..The whole training process will then take another 5 minutes.This snippet first loads the required libraries and datasets into a corpus object.Next, we create a list of the embeddings (two Flair contextual sting embeddings and a GloVe word embedding)..This list is then used as an input for our document embedding object..Stacked and document embedding are one of the most interesting concepts of Flair..They provide means to combine different embeddings together..You can use both traditional word embeddings (like GloVe, word2vec, ELMo) together with Flair contextual sting embeddings..In the example above we use an LSTM based method of combining word and contextual sting embeddings for generating document embeddings..You can read more about it here.Finally, the snippet trains the model which produces final-model.pt and best-model.pt files which represent our stored trained model.3.3 Using the Trained Model for PredictionsWe can now use the exported model to generate predictions by running the following snippet from the same directory:from flair.models import TextClassifierfrom flair.data import Sentenceclassifier = TextClassifier.load_from_file('./best-model.pt')sentence = Sentence('Hi. Yes mum, I will…')classifier.predict(sentence)print(sentence.labels)The snippet prints out ‘[ham (1.0)]’ meaning that the model is 100% sure our example message is not spam.How does it Perform Compared to Other Frameworks?Unlike Facebook’s FastText or even Google’s AutoML Natural Language platform, doing text classification with Flair is still a relatively low-level task..We have full control of how text embedding and training is done by having an option to set parameters such as learning rate, batch size, anneal factor, loss function, optimiser selection… In order to achieve optimal performance these hyper parameters need to be tuned..Flair provides us with a wrapper of a well known hyper parameter tuning library Hyperopt (described here) which we can use to tune our hyper parameters for optimal performance.In this article, we used the default hyper parameters for the sake of simplicity..With mostly default parameters our Flair model achieved an f1-score of 0.973 after 20 epochs.For comparison, we trained a text classification model with FastText and on AutoML Natural Language platform..We first ran FastText with the default parameters and achieved an f1-score of 0.883, meaning that our model outperformed the FastText by a large margin.. More details

Leave a Reply