Text Classification Every Engineer should be able to Build from Scratch — Naive Bayes

Or what is the relationship between each word and the review sentimental?NB is simply a systematic way of making predictions according to those two.

Naive BayesThe NB classifier is based on Bayes’ Theorem in probably and statistic.

The theorem describes the probability of the event ‘A when B’ from the probability of event ‘B when A’ and the probability of each A and B.

P(A|B) = P(B|A) * P(A) / P(B)Instead of A and B, in our classification, we are interested in the probability the review is positive when the review contains words or P(POS|W1,W2,.

,Wn).

P(POS|W1,W2,.

,Wn) = P(W1,W2,.

,Wn|POS) * P(POS) / P(W1,W2,.

,Wn)To breakdown the problem further, we need to make a ‘naive’ assumption about the input.

In ‘Naive’ Bayes, we assume each word (W1, W2, …, Wn) occur independently of each other.

With that assumption, we can simply break the probability of the words into individual components, e.

g.

P(W1, W2, .

) = P(W1) * P(W2) * .

Thus, the probability that the input is positive can be calculated asP(POS|W1,W2,.

,Wn) = P(W1|POS) * P(W2|POS) * .

* P(Wn|POS) * P(POS) / P(W1,W2,.

,Wn)At the same time, the probability for the review to be negative will beP(NEG|W1,W2,.

,Wn) = P(W1|NEG) * P(W2|NEG) * .

* P(Wn|NEG) * P(NEG) / P(W1,W2,.

,Wn)In binary classification, to predict if the input is a positive or negative review, we simply have to compare the two probabilities.

Y = P(POS) * P(W1|POS) * P(W2|POS) * .

* P(Wn|POS) / P(NEG) * P(W1|NEG) * P(W2|NEG) * .

* P(Wn|NEG)if Y > 1: the review is more likely to be Positiveif Y < 1: the review is more likely to be NegativeFinally, we need to assign the value to each probability componentThe probability that word Wi occurred in positive or negative review —  P(Wi|POS) or P(Wi|NEG)The probability that a review is positive or negative (base probability) — P(POS)or P(NEG)There are actually multiple ways of calculating those probabilities (see.

different variances of NB), but for this example, we will use a straightforward counting.

P(POS) = Number of positive samples / Number of all samplesP(NEG) = Number of negative samples / Number of all samplesP(Wi|POS) = Number of word Wi in positive samples / Number of Word Wi in all samplesP(Wi|NEG) = Number of word Wi in negative samples / Number of Word Wi in all samplesImplementationHere is a straightforward implementation in Java:In the constructor, we count word occurrences in positive/negative samples.

We also count the number of positive/negative samples itself.

In thepredict() function, we simply use those stored counts to calculate the probability according to NB formula previously mentioned.

However, there are several problems if we were to use the code above in practice:First, we are adding every word into two different dictionaries without any filtering.

From the way we use them, having only one dictionary is actually enough.

We should also do more pre-processing in the constructor to make prediction faster (see.

the optimized version below).

Also, to get better prediction accuracy, we should also be more considerate on the dictionary’s size and what words should be added into the dictionary.

In fact, feature selection is a very important topic not only for NB, but also for other machine learning models in general.

In lines 35–36, we assign 0’s to the out-of-vocabulary words.

Multiplying a 0’s to the positive/negative probability will make the whole result become zero.

When the word is not in neither dictionary, we will also have divide-by-zero or NaN problems.

A better way to handle this is to have a small ‘alpha’ value as rare words’ count.

So far, we also haven’t considered other NLP techniques to clean or normalize the input e.

g.

Stopwords, TF-iDF, or Stemming.

More Optimized VersionInstead of storing the count for each Wi into two separated dictionaries, we can precompute the probability ratio P(Wi|POS)/P(Wi|NEG) and store the value in a single dictionary.

In the previous code, there are also a lot of Floating Point (FP) multiplications in predict() function.

Too many multiplications of value between 0.

00 and 1.

00 could lead to FP underflow.

FP’s multiplication is also often slower than addition.

Thus, instead of storing the probability P(x) directly, we can instead store log-probability log(P(x)) instead.

Z = log(Y) = log { P(POS) * P(W1|POS) * P(W2|POS) * .

* P(Wn|POS) / P(NEG) * P(W1|NEG) * P(W2|NEG) * .

* P(Wn|NEG) } = log(P(POS)/P(NEG)) + log(P(W1|POS)/P(W1|NEG)) + log(P(W2|POS)/P(W2|NEG)) + .

+ log(P(Wn|POS)/P(Wn|NEG))if Z > 0: the review is more likely to be Positiveif Z < 0: the review is more likely to be NegativeWith that, we can optimize the previous classifier as followed.

More about Naive Bayes (When and Why)Naive Bayes (NB) classifiers are in a family of Linear Models or Linear Classifiers.

Among those linear classifiers, Logistic Regression (LR) is probably a more well-known model.

NB and LR are very similar both in theory and practice.

For example, in our optimized version, the predict() function resembles LR’s W * x + b linear prediction (In LR, we combine each feature multiplied by its ‘weight’, then add some ‘bias’).

In fact, there have been formal studies to compare the two.

One of them is a paper by Andrew Ng and Michael I Jordan.

In general, the paper concluded that: NB requires fewer data to train, while LG could achieve higher accuracy if we have enough data to train.

I’ve found this rule of thumb is often true in practice.

For me, a more important reason to choose NB is: it’s just easier and flexible to build the NB models.

As we have seen in the code, NB classifiers are trained by counting the words.

This can be done by iterate through the data only once.

While LG classifiers (and other models) are usually trained by Gradient Descent-based techniques that require computing gradient against a loss function.

Because Gradient Descent requires expensive floating point calculations, it often needs to be optimized by hardware and framework (e.

g.

Numpy or Tensorflow).

It also often iterates through the data several times or epochs.

Gradient Descent is also often random or stochastic in nature, while NB’s counting result is deterministic.

This means NB’s training will get you exactly the same model as long as you train on the same data.

There is no random seed or state to control.

This property makes NB fits well with some Reactive programming framework.

ConclusionAs the title suggests, I believe that Naive Bayes is a classification model any engineer should be able to build from scratch on any platform with any programming language.

It’s very simple and easy to implement.

It should be the first thing to try when you need to build a text classifier.

I hope you would agree with me after reading this article.

.

. More details

Leave a Reply