Python Data Science Getting Started Tutorial: NLTK

The combined classifier algorithm is a commonly used technique, which is implemented by creating a voting system.

Each algorithm has one vote and the most votes are selected.

To do this, we want our new classifier to work like a typical NLTK classifier and have all the methods.

Quite simply, with object-oriented programming, we can ensure that we inherit from the NLTK classifier class.

To do this we will import it:From nltk.

classify import ClassifierIfrom statistics import modeWe also import mode(the majority) because this will be the way we choose the maximum count.

Now let’s build our classifier class:Class VoteClassifier (ClassifierI) : def __init__ (self, *classifiers) : self.

_classifiers = classifiersWe call our class VoteClassifier, we inherit NLTK ClassifierI.

Next, we assign the list of classifiers passed to our class self.

_classifiers.

Next, we will continue to create our own classification method.

We intend to call it .

classifyso that we can call it later .

classify, just like the traditional NLTK classifier.

Def classify (self, features) : Votes = [] For c in self.

_classifiers: v = c.

classify(features) Votes.

append(v) Return mode(votes)Quite simply, what we are doing here is to iterate through our list of classifier objects.

Then, for each one, we ask it to be based on feature classification.

Classification is considered a vote.

After the traversal is complete, we return mode(votes), this is just the mode to return the vote.

This is what we really need, but I think another parameter, confidence is useful.

Since we have a voting algorithm, we can also count the number of support and negative votes, and call it “confidence.

” For example, the confidence of a 3/5 vote is weaker than a 5/5 vote.

Therefore, we can literally return the voting ratio as a measure of confidence.

This is our confidence method:Def confidence (self, features) : Votes = [] For c in self.

_classifiers: v = c.

classify(features) Votes.

append(v) Choice_votes = votes.

count(mode(votes)) Conf = choice_votes / len(votes) Return confNow let’s put things together:Import nltkImport randomFrom nltk.

corpus import movie_reviewsFrom nltk.

classify.

scikitlearn import SklearnClassifierImport pickleFrom sklearn.

naive_bayes import MultinomialNB, BernoulliNBFrom sklearn.

linear_model import LogisticRegression, SGDClassifierFrom sklearn.

svm import SVC, LinearSVC, NuSVCFrom nltk.

classify import ClassifierIFrom statistics import modeClass VoteClassifier(ClassifierI):Def __init__(self, *classifiers): self.

_classifiers = classifiersDef classify(self, features): votes = [] for c in self.

_classifiers: v = c.

classify(features) votes.

append(v) return mode(votes)Def confidence(self, features): votes = [] for c in self.

_classifiers: v = c.

classify(features) votes.

append(v)Choice_votes = votes.

count(mode(votes)) conf = choice_votes / len(votes) return confDocuments = [(list(movie_reviews.

words(fileid)), category)For category in movie_reviews.

categories() for fileid in movie_reviews.

fileids(category)]Random.

shuffle(documents)All_words = []For w in movie_reviews.

words(): All_words.

append(w.

lower())All _words = nltk.

FreqDist(all_ words)Word _features = list(all_ words.

keys())[:3000]Def find_features(document):Words = set(document) features = {} for w in word_features: features[w] = (w in words) Return features#print((find_features(movie_reviews.

words('neg/cv000_29416.

txt')))))Featuresets = [(find_features(rev), category) for (rev, category) in documents]Training_set = featuresets[:1900]Testing_set = featuresets[1900:]#classifier = nltk.

NaiveBayesClassifier.

train(training_set)Classifier_f = open("naivebayes.

pickle","rb")Classifier = pickle.

load(classifier_f)Classifier_f.

close()Print("Original Naive Bayes Algo accuracy percent:", (nltk.

classify.

accuracy(classifier, testing_set))*100)classifier.

show _most_ informative_features (15)MNB_classifier = SklearnClassifier(MultinomialNB())MNB _classifier.

train(training_ set)Print("MNB _classifier accuracy percent:", (nltk.

classify.

accuracy(MNB_ classifier, testing_set))*100)BernoulliNB_classifier = SklearnClassifier(BernoulliNB())BernoulliNB _classifier.

train(training_ set)Print("BernoulliNB _classifier accuracy percent:", (nltk.

classify.

accuracy(BernoulliNB_ classifier, testing_set))*100)LogisticRegression_classifier = SklearnClassifier(LogisticRegression())LogisticRegression _classifier.

train(training_ set)Print("LogisticRegression _classifier accuracy percent:", (nltk.

classify.

accuracy(LogisticRegression_ classifier, testing_set))*100)SGDClassifier_classifier = SklearnClassifier(SGDClassifier())SGDClassifier _classifier.

train(training_ set)Print("SGDClassifier _classifier accuracy percent:", (nltk.

classify.

accuracy(SGDClassifier_ classifier, testing_set))*100)##SVC_classifier = SklearnClassifier(SVC()) ##SVC_classifier.

train(training_set) ##print("SVC_classifier accuracy percent:", (nltk.

classify.

accuracy(SVC_classifier, testing_set))*100)LinearSVC_classifier = SklearnClassifier(LinearSVC())LinearSVC _classifier.

train(training_ set)Print("LinearSVC _classifier accuracy percent:", (nltk.

classify.

accuracy(LinearSVC_ classifier, testing_set))*100)NuSVC_classifier = SklearnClassifier(NuSVC())NuSVC _classifier.

train(training_ set)Print("NuSVC _classifier accuracy percent:", (nltk.

classify.

accuracy(NuSVC_ classifier, testing_set))*100)Voted_classifier = VoteClassifier(classifier,NuSVC_classifier, LinearSVC_classifier, SGDClassifier_classifier, MNB_classifier, BernoulliNB_classifier, LogisticRegression_classifier)Print("voted _classifier accuracy percent:", (nltk.

classify.

accuracy(voted_ classifier, testing_set))*100)Print("Classification:", voted _classifier.

classify(testing_ set[ 0 ][ 0 ]), "Confidence %:",voted _classifier.

confidence(testing_ set[ 0 ][ 0 ])*100)Print("Classification:", voted _classifier.

classify(testing_ set[ 1 ][ 0 ]), "Confidence %:",voted _classifier.

confidence(testing_ set[ 1 ][ 0 ])*100)Print("Classification:", voted _classifier.

classify(testing_ set[ 2 ][ 0 ]), "Confidence %:",voted _classifier.

confidence(testing_ set[ 2 ][ 0 ])*100)Print("Classification:", voted _classifier.

classify(testing_ set[ 3 ][ 0 ]), "Confidence %:",voted _classifier.

confidence(testing_ set[ 3 ][ 0 ])*100)Print("Classification:", voted _classifier.

classify(testing_ set[ 4 ][ 0 ]), "Confidence %:",voted _classifier.

confidence(testing_ set[ 4 ][ 0 ])*100)Print("Classification:", voted _classifier.

classify(testing_ set[ 5 ][ 0 ]), "Confidence %:",voted _classifier.

confidence(testing_ set[ 5 ][ 0 ])*100)So at the end, we run some sorter examples for the text.

All our output:Original Naive Bayes Algo accuracy percent: 66.

0Most Informative Features Thematic = True pos : neg = 9.

1 : 1.

0 secondly = True pos : neg = 8.

5 : 1.

0 narrates = True pos : neg = 7.

8 : 1.

0 layered = True pos : neg = 7.

1 : 1.

0 rounded = True pos : neg = 7.

1 : 1.

0 Supreme = True pos : neg = 7.

1 : 1.

0 crappy = True neg: pos = 6.

9 : 1.

0 uplifting = True pos : neg = 6.

2 : 1.

0 ugh = True neg : pos = 5.

3 : 1.

0 gaining = True pos : neg = 5.

1 : 1.

0 mamet = True pos : neg = 5.

1 : 1.

0 wanda = True neg : pos = 4.

9 : 1.

0 onset = True neg : pos = 4.

9 :1.

0 fantastic = True pos : neg = 4.

5 : 1.

0 milos = True pos : neg = 4.

4 : 1.

0 MNB_classifier accuracy percent: 67.

0 BernoulliNB_classifier accuracy percent: 67.

0 LogisticRegression_classifier accuracy percent: 68.

0 SGDClassifier_classifier accuracy percent: 57.

99999999999999 LinearSVC_classifier accuracy percent: 67.

0 NuSVC_classifier accuracy percent: 65.

0 voted_classifier accuracy percent: 65.

0 Classification: neg Confidence %:100.

0 Classification: pos Confidence %: 57.

14285714285714 Classification: neg Confidence %: 57.

14285714285714 Classification: neg Confidence %: 57.

14285714285714 Classification: pos Confidence %: 57.

14285714285714 Classification: pos Confidence %: 85.

71428571428571XVII.

Investigate bias using NLTKIn this tutorial we will discuss some issues.

The main problem is that we have a fairly biased algorithm.

You can test it by commenting out the scramble of the document, then training with the first 1900 and leaving the last 100 (all positive) comments.

Test it and you will find that your accuracy is very poor.

Instead, you can test with the top 100 data, all of which are negative and use 1900 training.

Here you will find that the accuracy is very high.

This is a bad sign.

This can mean a lot of things, and we have a lot of options to solve it.

In other words, the project we are considering suggests that we continue and use different data sets, so we will do so.

. More details

Leave a Reply