Text Classification of Freedom of Information Requests: Part I

The python package fuzzywuzzy, developed by SeatGeek, is perfect for this type of fuzzy string matching.

The package offers four ways to probe the likeness of two strings.

Essentially, ‘fuzz_ratio’ compares entire strings, ‘fuzz_partial_ratio’ searches the strings for partial matches, ‘fuzz_token_sort_ratio’ compares out-of-order strings, and ‘fuzz_token_set_ratio’ extends token sort for longer strings.

Each returns a similarity ratio (the higher the more alike), whereafter a threshold can be set at an acceptable cutoff.

Let’s test the performance of each.

from fuzzywuzzy import fuzzall_df_clean = all_df.

copy()for row in all_df['Decision'].

unique(): for row2 in all_df['Decision'].

unique(): matching_result = fuzz.

ratio(row, row2) # could be ratio, partial_ratio, token_set_ratio, token_sort_ratio if matching_result > 80: #print(row, row2) #print(matching_results) #Combine if match found all_df_clean['Decision'][all_df_clean['Decision'] == row2] = row print(all_df_clean['Decision'].

value_counts())print(len(all_df_clean['Decision'].

unique()))All disclosed 240Partly exempted 197Withdrawn 116No records exist 75Information disclosed in part 50Partly non-existent 32No information disclosed 30Nothing disclosed 28Forwarded out 15Abandoned 13No responsive records exist 11Non-existent 3Correction refused 3Correction made 2Transferred to Region of Waterloo Public Health 2Statement of disagreement filed 1No additional records exist 1Transferred 1Correction granted 1Request withdrawn 1Name: Decision, dtype: int6420Not bad.

We have merged four classes (i.

e.

‘No records exist’ and ‘No record exists’) but things like ‘No information disclosed’ and ‘Nothing disclosed’ clearly mean the same.

The 80% cutoff has been decided by trail-and-error.

Let’s try a partial ratio:All disclosed 240Partly exempted 197Request withdrawn 117No responsive records exist 87No information disclosed 80Non-existent 35Nothing disclosed 28Forwarded out 15Abandoned 13Correction refused 3Transferred 3Correction made 2Statement of disagreement filed 1Correction granted 1Name: Decision, dtype: int6414Another six classes merged, but the algorithm doesn’t really know what to do with the substring ‘disclosed’.

It has combined ‘Information disclosed in part’ with ‘No information disclosed’ which do not mean the same thing.

All disclosed 240Partly exempted 197Withdrawn 116No information disclosed 80No records exist 75Partly non-existent 32Nothing disclosed 28Forwarded out 15Abandoned 13No responsive records exist 11Correction refused 3Non-existent 3Correction made 2Transferred to Region of Waterloo Public Health 2Transferred 1No additional records exist 1Statement of disagreement filed 1Correction granted 1Request withdrawn 1Name: Decision, dtype: int6419Similar to the basic ratio case.

No information disclosed 348Partly exempted 197Request withdrawn 117No responsive records exist 87Non-existent 35Forwarded out 15Abandoned 13Correction refused 3Transferred 3Correction made 2Statement of disagreement filed 1Correction granted 1Name: Decision, dtype: int6412The most aggressive matching case yet, but again we are tripped up on ‘disclosed’ as now we no longer have the case ‘All disclosed’, which is unacceptable.

The partial ratio case seems to be a compromise — perhaps we can further merge a couple of categories manually.

We will see what happens when the Toronto data is added in Part II.

all_df_pr['Decision'].

loc[all_df_pr['Decision'] == "Abandoned"] = "Request withdrawn"all_df_pr['Decision'].

loc[all_df_pr['Decision'] == "Non-existent"] = "No responsive records exist"all_df_pr['Decision'].

loc[all_df_pr['Decision'] == "No information disclosed "] = "Nothing disclosed"all_df_pr['Decision'].

loc[all_df_pr['Decision'] == "Transferred"] = "Forwarded out"all_df_pr['Decision'].

loc[all_df_pr['Decision'] == "Correction granted"] = "Correction made"print(all_df_pr['Decision'].

value_counts())All disclosed 240Partly exempted 197Request withdrawn 130No responsive records exist 122Nothing disclosed 108Forwarded out 18Correction made 3Correction refused 3Statement of disagreement filed 1Name: Decision, dtype: int64Everything looks distinct now.

Let’s cut off the classes where there is a big drop-off in numbers:all_df_over20 = all_df_pr.

groupby('Decision').

filter(lambda x: len(x) > 20)The remainder of this article will borrow heavily from Susan Li’s excellent article on multi-class text classification.

The expectation is that based on the smaller size of our dataset and the fact that the text itself is fairly opaque, our results will not be very good.

A quick overview of our data:The classes are not as unbalanced as we might have thought.

The ‘category_id’ column is an integer representation of the possible decision cases.

Now, calculate a tf-idf vector for each request, removing all stop words.

from sklearn.

feature_extraction.

text import TfidfVectorizertfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', ngram_range=(1,2), stop_words='english')features = tfidf.

fit_transform(all_df_over20.

Request_summary).

toarray()labels = all_df_over20.

category_idfeatures.

shapePrint the common bigrams and unigrams for each class.

from sklearn.

feature_selection import chi2import numpy as npN=2for Decision, category_id in sorted(category_to_id.

items()): features_chi2 = chi2(features, labels == category_id) indices = np.

argsort(features_chi2[0]) feature_names = np.

array(tfidf.

get_feature_names())[indices] unigrams = [v for v in feature_names if len(v.

split(' ')) ==1] bigrams = [v for v in feature_names if len(v.

split(' ')) ==2] print("# '{}':".

format(Decision)) print(" .

Most correlated unigrams:..

{}".

format('..

'.

join(unigrams[-N:]))) print(" .

Most correlated bigrams:..

{}".

format('..

'.

join(bigrams[-N:])))# ‘All disclosed ‘: .

Most correlated unigrams:.

assistance.

personal .

Most correlated bigrams:.

information removed.

removed competition# ‘No responsive records exist’: .

Most correlated unigrams:.

site.

phase .

Most correlated bigrams:.

assessment address.

site assessment# ‘Nothing disclosed’: .

Most correlated unigrams:.

bus.

2014 .

Most correlated bigrams:.

complete ontario.

works file# ‘Partly exempted’: .

Most correlated unigrams:.

competition.

94 .

Most correlated bigrams:.

information competition.

rabies control# ‘Request withdrawn’: .

Most correlated unigrams:.

provincial.

evaluation .

Most correlated bigrams:.

treatment plant.

provincial offencesThis is what was meant when I said the requests themselves were fairly opaque in terms of being able to classify by eye.

At least there appears to be good separation between the classes.

Try four different models:import warningswarnings.

filterwarnings(action='once')from sklearn.

linear_model import LogisticRegressionfrom sklearn.

ensemble import RandomForestClassifierfrom sklearn.

svm import LinearSVCfrom sklearn.

model_selection import cross_val_scoremodels = [ RandomForestClassifier(n_estimators=200, max_depth=3, random_state=0), #, class_weight='balanced'), LinearSVC(), #class_weight='balanced'), MultinomialNB(), LogisticRegression(random_state=0)#, class_weight='balanced'),]CV=5cv_df = pd.

DataFrame(index=range(CV * len(models)))entries=[]for model in models: model_name = model.

__class__.

__name__ accuracies = cross_val_score(model, features, labels, scoring='accuracy', cv=CV) for fold_idx, accuracy in enumerate(accuracies): entries.

append((model_name, fold_idx, accuracy))cv_df = pd.

DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy'])import seaborn as snssns.

boxplot(x='model_name', y='accuracy', data=cv_df)sns.

stripplot(x='model_name', y='accuracy', data=cv_df, size=8, jitter=True, edgecolor="gray", linewidth=2)plt.

show()cv_df.

groupby('model_name').

accuracy.

mean()model_nameLinearSVC 0.

362882LogisticRegression 0.

398066MultinomialNB 0.

398152RandomForestClassifier 0.

349038Name: accuracy, dtype: float64As expected the performance is exceptionally poor.

Let’s look at the confusion matrix.

precision recall f1-score support All disclosed 0.

36 0.

69 0.

47 67No responsive records exist 0.

50 0.

49 0.

49 41 Partly exempted 0.

58 0.

48 0.

53 73 Request withdrawn 0.

38 0.

11 0.

17 47 Nothing disclosed 0.

58 0.

39 0.

47 36 avg / total 0.

48 0.

45 0.

44 264Well, the model is at least trying.

In the follow-up article we’ll attempt to improve this both by significantly adding to the available data and by trying a series of binary classifiers on each class.

Code available on GitHub: https://github.

com/scjones5/foi-kw.

. More details

Leave a Reply