My First Usage of Natural Language Processing (NLP) in Industry

The answer was using Natural Language Processing (NLP).

Photo by Markus Spiske on UnsplashHow NLP increased the data availableThe business problem has now been defined as having too many service engineers reports (of varying content length and quality) to read and classify.

This is a problem that NLP is actually perfectly designed for.

By collecting a large amount of expertly labelled reports that the quality engineers have processed before we now have failure labels assigned to each of those reports.

This data can be used to train a machine learning model that can be fed the text from a report and give a prediction of the failure that is being described by the service engineer.

This opens up huge new capability as it can mean that more products can be intensely monitored because pre-labelled reports can be given to the quality engineers to spot check and verify the machine learning algorithm is still doing its job well.

It can also mean that the huge backlog of reports that there is simply not enough resources to process can now be automatically processed and used and done without having to greatly expand the quality and reliability departments to carry out.

This is exciting stuff!The Basic Process of the ModelIt is not usually a great idea to just pour raw text into a machine learning model and expect great results.

Usually an amount of cleaning and processing is required.

These are generally summarised as:Spell Checking(Optional) Industry Abbreviation ConversionRemoving Stop WordsRemoving PunctuationStemming or LemmatizationTokenisation and Conversion to NumbersTo help I will provide a piece of example text:“WI on transformer resulting in fure on the line.

Obstrction removed, line repaired and fuse reset.

”Note: These steps shown are not always appropriate for all text based problems and in fact can make it harder (e.

g.

you are trying to predict a missing word), but for our problem we are trying to take several sentences of text and get the algorithm to sum it up into a failure label.

For this kind of problem these steps are acceptable and improve the accuracy of the resulting model and minimise overfitting to outliers or specific engineers (e.

g.

one that always has the same typos).

Spell CheckingIf you’ve ever read your doctors prescription notes they are rushed and scrawled and often riddled with spelling errors.

The same is also true of service engineers whose end of year performance is judged on repairing their daily docket of failed devices and not on writing great literature.

This means that the reports are done as the last thing before they leave the customer site and are quickly done so they can get back on the road.

This leaves a good probability that spelling errors will occur.

Because our machine learning algorithm does not know the english language, if it sees the words “fire” and “fure” it will treat them as separate words (but a human being would know is a typo).

Get too many of these spellings and you are losing signal in your text that the algorithm can use and it will also start to overfit to those edge cases of unique spelling mistakes that rarely occur (which we don’t want).

So running the text through a spell checker is a must to help to correct these issues as we want the text that a human would read the same to be ingested by our algorithm the same.

It also may be useful at this point to correct any abbreviations the industry may use (e.

g.

WI means Wildlife Impact).

You will find that people will sometimes use the abbreviations and other times the full word, so consolidating that text into one form (all abbreviations or all full wording) is useful to do here to maximise the signal for the algorithm to train on.

The report now reads:“Wildlife Impact on transformer resulting in fire on the line.

Obstruction removed, line repaired and fuse reset.

”Remove Stop WordsThe english language is full of words that, while helpful for making readable sentences, provide little information.

In this application these are words like “a”, “the” etc.

which will come out of our bagging (see later) as highly used words but contain little meaning when trying to get a failure classification.

There do exist lists of stop words, but sometimes these need adjusting as not all stop words are stop words in different applications.

The end result is the sentences now reads:“Wildlife Impact transformer resulting fire line.

Obstruction removed, line repaired fuse reset.

”Note: While it is getting harder to read for us as a human we are trying to distill the report down to the key words that convey information for the algorithm to learn on.

Remove PunctuationPunctuation is good for us as human beings to help us read but for the algorithm it will only be confused by it as it will treat words with different capitalisation (“Worlds” and “worlds” are different to it) and punctuation as key parts.

The words are all rendered in lower case and the punctuation removed.

“wildlife impact transformer resulting fire line obstruction removed line repaired fuse reset”Note: Both removing punctuation and stop words will reduce size of the inputs the machine learning algorithm must accept, reducing the amount of data it needs to process to train, predict etc.

on which are all very good benefits when you start off with a huge amount of text data to build the model on.

Stemming or LemmatisationIn the english language words can be written in different forms, but still possess the same meaning.

For example, “result”, “results” and “resulting”.

All of these we can interpret as meaning the same but for our algorithm it will treat them as different words and will have to learn that those three things mean the same thing.

We can make life easier by choosing to lemmatize or stem words.

Lemmatisation attempts to return the word back to dictionary term or “lemma” (the exact form dictated by if its a very, noun etc.

), but one of the more common and simpler methods used is that of stemming.

What stemming aims to do is chop off the end of words in a rule based and systematic way so that different forms of the same word are all returned to a common base form (not necessarily in a form you’d recognise as you’ll see) with the most common algorithms being “Snowball” and “Porter”.

We apply a snowball stemmer to our sentence and get the resulting sentence:“wildlif impact transformer result fire line obstruct remov line repair fuse reset”You will note that we preserved the word “transformer” as it is a word that holds specific meaning within utilities (it’s a piece of equipment) and we want the algorithm to recognise that as a word different from say “transform” or “transforming”.

This is where having access to domain specific experience can be very useful so such decisions can be determined to help preserve precious information in the text that you might otherwise accidentally destroy.

Tokenisation and Conversion to Numbers (Bag of Words)Now that we have a cleaned and standardised format for the text in the report we now need to convert it into a form that the machine learning algorithm can interpret.

We do this by breaking the sentence up into “tokens” (or word segmenting) so that these can be converted into a format the algorithm can ingest.

To do this we break the words up by spaces and then count up the number of times a particular word appears in the report.

This gives us a bag-of -words model, which ignores grammar and word order but retains word counts.

This results in a table where each row is the result for the text in each report and each column represents one of the unique words seen in the report.

This is done for each of the reports in the training data set, with new columns added as new unique words are found.

It is this resulting table that can be fed into the model to train it and for a prediction to be made the data just needs to be formatted similarly and the row for that report fed into the model.

For our report example the bag of words table would look like this:Note: The bag-of-words model is one of the more simple ways of representing text in a form that an algorithm can ingest, but other types exist such as “term frequency–inverse document frequency” (TF_IDF) which weights the number counts by how important the words are to a particular document type compared to all the documents.

This would mean that if the word “line” is used heavily it will be down weighted greatly, whereas “wildlif” will not be down weighted because it only occurs in the wildlife induced fault reports and not in the remaining reports.

Training the ModelWe are now at the point where we could train our model using the labelled service engineer reports that we’ve processed into a bag-of-words model.

The classical algorithms often used are Support Vector Machines (SVM) of Naive Bayes.

The trained model can then be used on the remaining labelled reports to verify it is working correctly (a hold back test set) and was deemed very good so was used against the large repository of unlabelled service engineer reportsThe Results?Photo by bruce mars on UnsplashSuccess!.It was found that even this very simple handling of text produced very marked results with the resulting model being very good against reports that contained enough key words.

It was even possible to go back and break apart one category into three more by re-labelling some of the training set and then re-training the model.

This saved a huge amount of time within the business and also allowed a large amount of data to be funnelled into making more accurate business decisions.

The quality engineers could then be re-focused on other more important tasks and were just performing spot checks on the labelling to verify the model was still performing its task adequately.

It was suffice to say that the business case was met and people were impressed with what was possible.

The next step was to adapt the model so that it pre-labelled all reports to help the quality engineers that were still processing reports by hand (this was carried out in more business and safety critical areas).

SummaryOnly using the most classical of approaches to NLP the business was able to unlock a huge repository of data and lessen the burden on existing resources enabling other tasks (that NLP can’t solve) to be done.

.. More details

Leave a Reply