Creating a Custom Classifier for Text Cleaning

Creating a Custom Classifier for Text CleaningMachine Learning for Sentence ClassificationRodrigo NaderBlockedUnblockFollowFollowingFeb 4Recently I've been studying NLP more than other data science fields, and one challenge that I face more often than not is the cleaning part of the process.

Building NLP models require many pre-processing steps, and if the data is not properly treated, it could result in poor models, which is necessarily what we want to avoid.

In this article, we're going to focus on PDF documents.

The goal here is to open a PDF file, convert it to plain text, understand the need for data cleaning and build a machine learning model for that purpose.

In this post we will:Open a PDF file and convert it into a text stringSplit that text into sentences and build a data setManually label that data with user interactionMake a classifier to remove unwanted sentencesSome libraries we’re going to use:pdfminer → read PDF filestextblob → text processingpandas → data analysisPDF ReaderAs always, I’ll try to explain the code used along the text, so feel free to skip the snippets if you'd like.

Let's start by importing some modules:from collections import Counterfrom IPython.

display import clear_outputfrom pdfminer.

converter import TextConverterfrom pdfminer.

layout import LAParamsfrom pdfminer.

pdfinterp import PDFResourceManagerfrom pdfminer.

pdfinterp import PDFPageInterpreterfrom pdfminer.

pdfpage import PDFPagefrom textblob import TextBlobimport ioimport mathimport numpy as npimport pandas as pdimport stringWe are going to use pdfminer to build our PDF reader:def read_pdf(path): rsrcmgr = PDFResourceManager() retstr = io.

StringIO() codec = 'utf-8' laparams = LAParams() device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams) fp = open(path, 'rb') interpreter = PDFPageInterpreter(rsrcmgr, device) password = "" maxpages = 0 caching = True pagenos=set() for page in PDFPage.

get_pages(fp, pagenos, maxpages=maxpages, password=password, caching=caching, check_extractable=True): interpreter.

process_page(page) text = retstr.

getvalue() text = " ".

join(text.

replace(u"xa0", " ").

strip().

split()) fp.

close() device.

close() retstr.

close() return textAlthough this function seems long, it's just reading a PDF file and returning its text as a string.

We'll apply it to a paper called "A Hands-on Guide to Google Data":By just looking at the first page, we quickly see that an article contains much more than simple sentences, including elements like dates, line counts, page numbers, titles and subtitles, section separators, equations, and so on.

Let's check how those properties will come out when the paper is converted to plain text (primer.

pdf is the name of the file, stored locally in my computer):read_pdf('primer.

pdf')It's clear here that we lost the all the text structure.

Line counts and page numbers are spread as they were part of sentences, while titles and references can't be clearly distinguished from the text bodies.

There are probably many ways out there for you to conserve the text structure while reading a PDF, but let's keep this messed up for the sake of explanation (as this is very often how raw text data looks like).

Text CleaningA full cleaning pipeline has many steps, and to become familiar with them I suggest following some tutorials (this and this are great starting points).

In general lines, the cleaning process chain would include:TokenizationNormalizationEntity extractionSpelling and grammar correctionRemoving punctuationRemoving special charactersWord StemmingOur goal here isn't to replace any of those stages, but instead, build a more general tool to remove what's unwanted for us.

Take it as a complementary step to help in the middle.

Let's suppose we want to get rid of any sentence that does not look like human written.

The idea is to classify those sentences as "unwanted" or “weird” and consider the remaining sentences "normal".

For example:32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 related.

Or51 52 53 54 55 # read data from correlate and make it a zoo time series dat <- read.

csv("Data/econ-HSN1FNSA.

csv") y <- zoo(dat[,2],as.

Those sentences are clearly messed up because of the text transformation and in case we're making, let's say, a PDF summarizer, they shouldn't be included.

To remove them, we could manually analyze the text, figure out some patterns and apply regular expressions.

But, in some cases, it might be better to build a model that find those patterns for us.

This is what we're doing here.

We'll create a classifier to recognize weird sentences so that we can easily remove them from the text body.

Building The Data SetLet's build a function to open the PDF file, split the text into sentences and save them into a data frame with columns label and sentence:def pdf_to_df(path): content = read_pdf(path) blob = TextBlob(content) sentences = blob.

sentences df = pd.

DataFrame({'sentence': sentences, 'label': np.

nan}) df['sentence'] = df.

sentence.

apply(''.

join) return dfdf = pdf_to_df('primer.

pdf')df.

head()Since we don't have the data labeled (in "weird" or "normal" ), we're going to do it manually to fill our label column.

This data set will be updatable so that we can attach new documents to it and label their sentences.

Let's first save the unlabelled dataset into a .

pickle file:df.

to_pickle('weird_sentences.

pickle')Now we'll create a user interaction function to manually classify the data points.

For each sentence in the dataset, we'll display a text box for the user to type '1' or nothing.

If the user types '1', the sentence is to be classified as "weird".

I'm using a Jupyter Notebook so I've called the clear_output() function from IPython.

display to improve the interaction.

def manually_label(pickle_file): print('Is this sentence weird? Type 1 if yes.

!.') df = pd.

read_pickle(pickle_file) for index, row in df.

iterrows(): if pd.

isnull(row.

label): print(row.

sentence) label = input() if label == '1': df.

loc[index, 'label'] = 1 if label == '': df.

loc[index, 'label'] = 0 clear_output() df.

to_pickle('weird_sentences.

pickle') print('No more labels to classify!')manually_label('weird_sentences.

pickle')This is how the output looks like for each sentence:Since this sentence looks pretty normal, I won't type '1', but simply press enter and move on to the next one.

This process will repeat until the dataset is fully labeled or when you interrupt.

Every user input is being saved to the pickle file, so the dataset is being updated at each sentence.

This easy interaction made it relatively fast to label the data.

It took me 20 minutes to have about 500 data points labeled.

Two other functions were written to keep things simple.

One to attach another PDF file to our dataset and another one to reset all the labels (sets the label column values to np.

nan).

def append_pdf(pdf_path, df_pickle): new_data = pdf_to_df(pdf_path) df = pd.

read_pickle(df_pickle) df = df.

append(new_data) df = df.

reset_index(drop=True) df.

to_pickle(df_pickle)def reset_labels(df_pickle): df = pd.

read_pickle(df_pickle) df['label'] = np.

nan df.

to_pickle(df_pickle)As we ended up with more "normal" than "weird" sentences, I built a function to undersample the dataset, otherwise, some machine learning algorithms wouldn't perform well:def undersample(df, target_col, r=1): falses = df[target_col].

value_counts()[0] trues = df[target_col].

value_counts()[1] relation = float(trues)/float(falses) if trues >= r*falses: df_drop = df[df[target_col] == True] drop_size = int(math.

fabs(int((relation – r) * (falses)))) else: df_drop = df[df[target_col] == False] drop_size = int(math.

fabs(int((r-relation) * (falses)))) df_drop = df_drop.

sample(drop_size) df = df.

drop(labels=df_drop.

index, axis=0) return dfdf = pd.

read_pickle('weird_sentences.

pickle').

dropna()df = undersample(df, 'label')df.

label.

value_counts()645 labeled data points.

Not enough to make a decent model, but we'll use it as a playground example.

Text TransformationNow we need to transform the sentences in a way the algorithm can understand.

One form of doing that is counting the occurrence of each character inside the sentence.

That would be something like a bag-of-words technique, but at the character level.

def bag_of_chars(df, text_col): chars = [] df['char_list'] = df[text_col].

apply(list) df['char_counts'] = df.

char_list.

apply(Counter) for index, row in df.

iterrows(): for c in row.

char_counts: df.

loc[index, c] = row.

char_counts[c] chars = list(set(chars)) df = df.

fillna(0).

drop(['sentence', 'char_list', 'char_counts'], 1) return dfdata = bag_of_chars(df, 'sentence')data.

head()Machine Learning ModelPerfect!.Now we're just left with a usual machine learning challenge.

Many features and one target in a classification problem.

Let's split the data into train and test sets:data = data.

sample(len(data)).

reset_index(drop=True)train_data = data.

iloc[:400]test_data = data.

iloc[400:]x_train = train_data.

drop('label', 1)y_train = train_data['label']x_test = test_data.

drop('label', 1)y_test = test_data['label']We're ready to choose an algorithm and check its performance.

Here I'm using a Logistic Regression just to see what we can achieve:from sklearn.

linear_model import LogisticRegressionfrom sklearn.

metrics import accuracy_scorelr = LogisticRegression()lr.

fit(x_train, y_train)accuracy_score(y_test, lr.

predict(x_test))86 % accuracy.

That's pretty good for a tiny dataset, a shallow model and a bag-of-chars approach.

The only problem is that although we split into training and testing, we are evaluating the model with the same document that we trained on.

A more appropriate approach would be using a new document as the test set.

Let's make a function that enables us to predict any custom sentence:def predict_sentence(sentence): sample_test = pd.

DataFrame({'label': np.

nan, 'sentence': sentence}, [0]) for col in x_train.

columns: sample_test[str(col)] = 0 sample_test = bag_of_chars(sample_test, 'sentence') sample_test = sample_test.

drop('label', 1) pred = lr.

predict(sample_test)[0] if pred == 1: return 'WEIRD' else: return 'NORMAL'weird_sentence = 'jdaij oadao //// fiajoaa32 32 5555'Normal Sentence:We just built a cool Machine Learning modelnormal_sentence = 'We just built a cool machine learning model'predict_sentence(normal_sentence)Weird Sentence:jdaij oadao //// fiajoaa32 32 5555weird_sentence = 'jdaij oadao //// fiajoaa32 32 5555'predict_sentence(weird_sentence)And our model scores.Unfortunately, when I tried more sentences it showed bad performance classifying some of them.

The bag-of-words (in this case chars) method isn't probably the best option, the algorithm itself could be highly improved and we should label many more data points for the model to become reliable.

The point here is that you could use this same approach to perform a lot of different tasks, e.

g.

recognizing specific elements (e.

g.

links, dates, names, topics, titles, equations, references, and more).

Used the right way, text classification can be a powerful tool to help in the cleaning process, and should not be taken for granted.

Good cleaning!Thank you if you kept reading until the end.

This was an article focused on text classification to handle cleaning problems.

Please follow my profile for more on data science and feel free to let me any comments or concerns.

See you next post!.. More details

Leave a Reply