Supercharging word vectors

Using the fastText method for creating word vectors, we will also be able to create a model which can handle out-of-vocabulary words as well as being robust to spelling mistakes and typos.fastText word vectorsThis article assumes prior knowledge of word vectors but it is worth touching on fastText and how this differs from the more widely known word2vec approach to creating vector representations of words..FastText was developed by Facebook with a stable release being open-sourced in 2017.The most noticeable difference between fastText and word2vec is that fastText splits out words using n-gram characters..For example, ‘Lincolnshire’, (a county in the UK) would be split into:Lin, inc, nco, col, oln, lns, nsh, shi, hir, ireWhere n=3..This approach is a significant improvement over word2vec for two reasons:The ability to infer out-of-vocabulary words..For example, the above model would understand that ‘Lancashire’ (also a county in the UK) is related to Lincolnshire due to the overlap of ‘shire’ (or ‘shi’, ‘hir’, and ‘ire’) between the two words.The robustness to spelling mistakes and typos..It is easy to see that the same character level modelling also means that fastText is robust enough to handle spelling variations..This is particularly useful in the analysis of Social Media content.A detailed review of how fastText works can be viewed here.Show me the code!The rest of this article walks through a simple example which will train a fastText model on a series of documents, apply TF-IDF to these vectors and use this to perform further analysis.The documents in question are Modern Slavery statements submitted by companies to explain the steps they are taking to eradicate Modern Slavery both internally and within their supply chains..The below article shows how this data has been cleaned prior to analysis:Clean your data with unsupervised machine learningCleaning data does not have to be painful!.This post is a quick example of how to use unsupervised machine learning to…towardsdatascience.comIf you would like to follow along, a colab notebook can be found here containing all of the code.Step 1..tokenize text and create phrasesWe will use spaCy to split each of the documents into a list of words (tokenization)..We will also clean the data by removing stop words, punctuation and converting to lowercase using the Gensim library:#The dataframe is called 'combined' it has a column called 'text' containing the text data for each company#creates a list of documents with a list of words inside:text = []for i in combined.text.values: doc = nlp(remove_stopwords(strip_punctuation(strip_non_alphanum(str(i).lower())))) tokens = [token.text for token in doc] text.append(tokens)We will then stick together common terms..For example, as each of the documents writes about modern slavery, it is useful to combine these two words into a single phrase ‘modern_slavery’.. More details

Leave a Reply