ELMo: Contextual language embedding

', ' ').

replace(' ', ' ').

replace('xa0',' ') #get rid of problem charstext = ' '.

join(text.

split()) #a quick way of removing excess whitespacedoc = nlp(text)sentences = []for i in doc.

sents: if len(i)>1: sentences.

append(i.

string.

strip()) #tokenize into sentences2.

Get the ELMo model using TensorFlow Hub:If you have not yet come across TensorFlow Hub, it is a massive time saver in serving-up a large number of pre-trained models for use in TensorFlow.

Luckily for us, one of these models is ELMo.

We can load in a fully trained model in just two few lines of code.

How satisfying…url = "https://tfhub.

dev/google/elmo/2"embed = hub.

Module(url)To then use this model in anger we just need a few more lines of code to point it in the direction of our text document and create sentence vectors:# This tells the model to run through the 'sentences' list and return the default output (1024 dimension sentence vectors).

embeddings = embed( sentences, signature="default", as_dict=True)["default"]#Start a session and run ELMo to return the embeddings in variable xwith tf.

Session() as sess: sess.

run(tf.

global_variables_initializer()) sess.

run(tf.

tables_initializer()) x = sess.

run(embeddings)3.

Use visualisation to sense-check outputsIt is amazing how often visualisation is overlooked as a way of gaining greater understanding of data.

Pictures speak a thousand words and we are going to create a chart of a thousand words to prove this point (actually it is 8,511 words).

Here we will use PCA and t-SNE to reduce the 1,024 dimensions which are output from ELMo down to 2 so that we can review the outputs from the model.

I have included further reading on how this is achieved at the end of the article if you want to find out more.

from sklearn.

decomposition import PCApca = PCA(n_components=50) #reduce down to 50 dimy = pca.

fit_transform(x)from sklearn.

manifold import TSNEy = TSNE(n_components=2).

fit_transform(y) # further reduce to 2 dim using t-SNEUsing the amazing Plotly library, we can create a beautiful, interactive plot in no time at all.

The below code shows how to render the results of our dimensionality reduction and join this back up to the sentence text.

Colour has also been added based on the sentence length.

As we are using Colab, the last line of code downloads the HTML file.

This can be found below:Sentence encodeInteractive sentence embeddingdrive.

google.

comThe code to create this is below:import plotly.

plotly as pyimport plotly.

graph_objs as gofrom plotly.

offline import download_plotlyjs, init_notebook_mode, plot, iplotinit_notebook_mode(connected=True)data = [ go.

Scatter( x=[i[0] for i in y], y=[i[1] for i in y], mode='markers', text=[i for i in sentences], marker=dict( size=16, color = [len(i) for i in sentences], #set color equal to a variable opacity= 0.

8, colorscale='Viridis', showscale=False ) )]layout = go.

Layout()layout = dict( yaxis = dict(zeroline = False), xaxis = dict(zeroline = False) )fig = go.

Figure(data=data, layout=layout)file = plot(fig, filename='Sentence encode.

html')from google.

colab import filesfiles.

download('Sentence encode.

html')Exploring this visualisation, we can see ELMo has done sterling work in grouping sentences by their semantic similarity.

In fact it is quite incredible how effective the model is:Download the HTML for yourself (link above) to see ELMo in action4.

Create a semantic search engine:Now that we are confident that our language model is working well, lets put it to work in a semantic search engine.

The idea is that this will allow us to search through the text not by keywords but by semantic closeness to our search query.

This is actually really simple to implement:First we take a search query and run ELMo over it;We then use cosine similarity to compare this against the vectors in our text document;We can then return the ’n’ closest matches to the search query from the document.

Google Colab has some great features to create form inputs which are perfect for this use case.

For example, creating an input is as simple as adding #@param after a variable.

The below shows this for a string input:search_string = "example text" #@param {type:"string"}In addition to using Colab form inputs, I have used ‘IPython.

display.

HTML’ to beautify the output text and some basic string matching to highlight common words between the search query and the results.

Lets put it to the test.

Let us see what ASOS are doing with regards to a code of ethics in their Modern Slavery return:A fully interactive semantic search engine in just a few minutes!This is magical!.The matches go beyond keywords, the search engine clearly knows that ‘ethics’ and ethical are closely related.

We find hits for both a code of integrity and also ethical standards and policies.

Both relevant to our search query but not directly linked based on key words.

I hope you enjoyed the post.

Please do leave comments if you have any questions or suggestions.

Further reading:Below are my other posts in what is now becoming a mini series on NLP and exploration of companies Modern Slavery returns:Clean your data with unsupervised machine learningCleaning data does not have to be painful.This post is a quick example of how to use unsupervised machine learning to…towardsdatascience.

comSupercharging word vectorsA simple technique to boost fastText and other word vectors in your NLP projectstowardsdatascience.

comTo find out more on the dimensionality reduction process used, I recommend the below post:Visualising high-dimensional datasets using PCA and t-SNE in PythonThe first step around any data related challenge is to start by exploring the data itself.

This could be by looking at…medium.

comFinally, for more information on state of the art language models, the below is a good read:http://jalammar.

github.

io/illustrated-bert/.

. More details

Leave a Reply