When Data is Scarce… Ways to Extract Valuable Insights

Photo credit: PixabayWhen Data is Scarce… Ways to Extract Valuable InsightsDescriptive statistics, Exploratory Data Analysis, and Natural Language Processing (NLP) techniques to understand your data.

Bety Rodriguez-MillaBlockedUnblockFollowFollowingApr 18Recently I came across the Region of Waterloo’s Open Data project and its Freedom of Information Requests data set.

My colleague Scott Jones has already analyzed it using Machine Learning (ML) techniques in a series of posts.

ML did poorly because data is scant.

While Scott did what one should do in this type of situation, which is, find more data, I was curious as to what else can this data, albeit scarce, tell me.

After all, data always has value.

Before I take you into this 8 minute read journey, I should let you know that the jupyter notebook on github has all the code and many more insights into this data, all of which can’t be covered in here.

If you don’t feel like reading the notebook, the full set of graphic results can be found in this file.

Here I present you few highlights of the analysis.

Getting to know the dataWe use the pandas library for this, and here is what one of files found in Open Data looks like:Sample of the 1999 Freedom of Information Request FileWe have 18 files, one for each year, 1999 to 2016, 576 requests in total, and amazingly all with the same six columns.

We will work only with the three main columns, Source, Summary_of_Request, and Decision.

Source.

This is the entity who is making the request, a.

k.

a.

, the requester.

By looking at the information over the years, I was able to merge the classes into ‘Business’, ‘Individual’, ‘Individual by Agent’ , ‘Media’, ‘Business by Agent’, and ‘Individual for dependant’.

Summary_of_Request.

Contains the request, which is already edited by a clerk.

Decision.

Merged classes are: ‘All information disclosed’, ‘Information disclosed in part’, ‘No records exist’ , ‘Request withdrawn’ , ‘Partly non-existent’, ‘No information disclosed’, ‘Transferred’, ‘Abandoned’, ‘Correction refused’, ‘Correction granted’, and ‘No additional records exist’.

How are these columns related?Descriptive Statistics and Exploratory Data AnalysisIn this section, we will focus on the columns Source and Decision.

We will analyze the requests with some NLP tools later on.

Here is how the data is distributed:About 60% of the requests are either ‘All information disclosed’ or ‘Information disclosed in part’.

There are at least seven types of decisions with less than 25 instances, including one of the most important ones, ‘No information disclosed’.

Therefore, not only do we have a limited amount of data, we also have an unbalanced case.

This does not look great for ML.

With another view of the data, Source vs.

Decision, we see that most of the requests are made by ‘Business’, ‘Individual’, and ‘Individual by Agent’.

Normalizing those numbers for each source, i.

e.

, each row adds to 1, we see that the main three sources fare well as ‘All information disclosed’ is above 30% for each, and ‘Information disclosed in part’ adds 18% to 34% more.

Putting them above 50%.

Also, ‘Individual by Agent’ has a higher success rate than ‘Individual’.

‘Media’, having few requests, does not do well, only 10% of the requests got ‘All information disclosed’.

Natural Language Processing (NLP)Now we proceed to analyze the actual ‘Summary_of_Requests’.

For this, we turn to Natural Language Processing libraries, such as NLTK and spaCy, and the help of scikit-learn.

Broadly generalizing, there are few steps one needs to do before analyzing any text (see Susan Li’s post):Tokenize the text: Break the text in single special entities/words, i.

e.

, tokens.

Remove any unwanted characters, such as returns ‘?.’ and punctuation, such as ‘-’, ‘…’, ‘ ” ’.

Remove URLs or replace them by a word, e.

g.

, “URL”.

Remove screen names or replace the ‘@’ by a word, e.

g.

, “screen_name”.

Remove capitalization of words.

Remove words with n or less characters.

In this case, n = 3.

Remove stop words, i.

e.

, words with little meaning in a language.

These words probably won’t help classifying our text.

Examples are words such as ‘a’, ‘the’, ‘and’.

There is no single universal list of stop words.

Lemmatization, which is the process of grouping together the inflected forms of a word so they can be analyzed as a single item, identified by the word’s lemma, or dictionary form.

So, after writing single-purpose functions, we can transform the text withdef prepare_text_tlc(the_text): text = clean_text(the_text) text = parse_text(text) tokens = tokenize(text) tokens = replace_urls(tokens) tokens = replace_screen_names(tokens) tokens = lemmatize_tokens(tokens) tokens = remove_short_strings(tokens, 3) tokens = remove_stop_words(tokens) tokens = remove_symbols(tokens) return tokensAnd since we will be working with this text constantly, we just add this pre-processed text to the dataframe as a new column, ‘Edited_Summary’.

N-grams and WordCloudHow else can we analyze and visualize our text?.As a first step we can find which words and phrases are the most used, i.

e.

, we can get unigrams (single tokens) and, in general, n-grams (groups of n-tokens) and their frequencies in text.

def display_top_grams(gram, gram_length, num_grams): gram_counter = Counter(gram) if gram_length is 1: name = 'unigrams' elif gram_length is 2: name = 'bigrams' elif gram_length is 3: name = 'trigrams' else: name = str(gram_length) + '-grams' print("No.

of unique {0}: {1}".

format(name, len(gram_counter)))for grams in gram_counter.

most_common(num_grams): print(grams) return NoneSo for our unigrams,And using WordCloud:So why is the word ‘remove’ so prominent?.As it turns out, for privacy reasons, all names, dates, and locations written on the original request have been removed and replaced in the Open Data files with phrases such as ‘{location removed.

}’ or ‘{date removed}’.

There are more than 30 variations of this.

Using regular expressions (regEx) to clean the text, we arrive to a better word cloud.

This time, we will allow bigrams, too.

Looking at the word cloud above and the trigrams,we see that there are common phrases, such as ‘ontario works’, ‘environmental site’, ‘grand river transit’, ‘rabies control’, ‘public health inspection’, and ‘food bear illness’ (as in ‘food borne illness’ — remember we lemmatized our tokens).

So, how common are these phrases in our text?.And would requesting information with such phrases dictate the chance of having the request approved?.As it turns out, 46% of our data are those types of requests, none of these phrases got a single ‘No information disclosed’ decision, and there are clear trends:For example, ‘rabies control’ got about 95% with all or partially information disclosed, while 5% of the cases were transferred.

Summary_of_Request and Edited_Summary StatisticsWe already know we have a limited amount of data, but how much is limited.Well, there are only 7 requests with more than 100 words in the full text, and only 1 in the tokenized text.

Full text averages 21 words per request, although the median is 15, while the tokenized text averages 9 words with a median of 7.

Part-of-Speech (POS) TaggingHere we use spaCy to identify how our text is composed of nouns, verbs, adjectives, and so on.

We also use the function spacy.

explain( ) to find out what those tags mean.

full_text_nlp = nlp(full_text) # spaCy nlp()tags = [] for token in full_text_nlp: tags.

append(token.

tag_)tags_df = pd.

DataFrame(data=tags, columns=['Tags'])print("Number of unique tag values: {0}".

format(tags_df['Tags'].

nunique()))print("Total number of words: {0}".

format(len(tags_df['Tags'])))# Make a dataframe out of unique valuestags_value_counts = tags_df['Tags'].

value_counts(dropna=True, sort=True)tags_value_counts_df = tags_value_counts.

rename_axis( 'Unique_Values').

reset_index(name='Counts')# And normalizing the count valuestags_value_counts_df['Normalized_Count'] = tags_value_counts_df['Counts'] / len(tags_df['Tags'])uv_decoded = []for val in tags_value_counts_df['Unique_Values']: uv_decoded.

append(spacy.

explain(val))tags_value_counts_df['Decoded'] = uv_decodedtags_value_counts_df.

head(10)And merging categories, such as ‘noun, singular or mass’ and ‘noun, plural’, to make a generalized version, here is how our requests are composed:Topic Modeling using scikit-learn, Bokeh, and t-SNEIn the notebook, we use different techniques for topic modeling, including scikit-learn’s functions for Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), comparing both CountVectorizer( ) and TfidfVectorizer( ), gensim with LDA, using t-Distributed Stochastic Neighbor Embedding (t-SNE) for dimensionality reduction, and Bokeh and pyLDAvis for visualization.

We won’t add the full code used here, and we do encourage you to take a look at the notebook.

All the tools did a decent job given the limitations of our data.

Here is a highlight,Pretty much all the most-frequent phrases are represented in the topics.

And as expected, some topics are clear, such as ‘ontario works’ or ‘environmental site’, while other clusters are not so defined.

Machine LearningWe already know ML won’t work well, but, given that this is a learning exercise, we go ahead anyways.

In the notebook, we compare eight different ML models for three different cases.

We can’t compare the full data as is, since there are cases with very few instances.

For example, only one request got ‘Correction granted’, so when we are training our model, that case would exclusively be either in the training set or in the test set.

And having only one case won’t exactly provide a good foundation.

We have few options,We can drop requests with less than, say, 15 instances, call it ‘Over-15’.

We bin our full set of decisions into three basic categories:* All information disclosed (plus ‘Correction granted’.

)* Information disclosed in part (plus ‘Partly non-existent’.

)* No information disclosed (plus ‘Transferred’, ‘No records exist’, ‘Correction refused’, ‘No additional records exist’, ‘Withdrawn’, and ‘Abandoned’.

) Which in turn will make our set balanced.

We can drop requests with less than 15 instances, and we drop decisions where no actual decision was made, i.

e.

, cases that were withdrawn or abandoned, call it ‘Independent’.

And here are the results,Overall, both Logistic Regression and the Multinomial Naive Bayes classifier , combined with tf-idf gave better results.

While binning our classes seems to be the most logical approach.

The code and full set of results can be found here.

.

. More details

Leave a Reply