I Built a Fake News Detector Using Natural Language Processing and Classification Models

In an attempt to answer these questions, I built my own fake news detector using open source data from Reddit.

Here’s how I did it and what I learned along the way.

With the help of Pushshift.

io API Wrapper, I scraped approximately 30,000 posts from the Subreddits r/TheOnion and r/nottheonion.

I chose these Subreddits to see how well I could distinguish between fake news and absurd news.

Posts on the r/TheOnion feature satirical news from www.


com or other similar parody sites.

Posts on r/nottheonion feature absurd current events reported on by credible news outlets.

To keep my data clean and concise, I chose to make my predictor variable (X) the title of a post and my target variables (y) be 1 to represent r/TheOnion and 0 to represent r/nottheonion.

To clean my data, I created a data cleaning function that dropped duplicate rows in a DataFrame, removed punctuation and numbers from all text, removed excessive spacing, and converted all text to lowercase.

# Data cleaning functiondef clean_data(dataframe):# Drop duplicate rows dataframe.

drop_duplicates(subset='title', inplace=True) # Remove punctation dataframe['title'] = dataframe['title'].


replace('[^ws]',' ')# Remove numbers dataframe['title'] = dataframe['title'].


replace('[^A-Za-z]',' ')# Make sure any double-spaces are single dataframe['title'] = dataframe['title'].


replace(' ',' ') dataframe['title'] = dataframe['title'].


replace(' ',' ')# Transform all text to lowercase dataframe['title'] = dataframe['title'].


lower() print("New shape:", dataframe.

shape) return dataframe.

head()Now that my Subreddit datasets were nice and clean, I was ready to conduct an exploratory data analysis (EDA).

Even though I decided to assign post titles to my predictor variable (X), during my data scrape I also acquired other features of a post to uncover any hidden stories in the data.

In total, I scraped the following features from each post:title: title of subreddit postsubreddit: which subreddit the post belongs tonum_comments: the number of comments for a postauthor: author's username of a postsubreddit_subcribers: number of subscribers for that subredditscore: the score received on Redditdomain: the domain referenced in the postcreated_utc: date and time the post was createdSomething peculiar stood out to me when I observed the number of posts shared by an author.

In r/notheonion with 15 million subscribers, only three authors shared over 100 posts, while r/TheOnion with 95k subscribers had 14 authors who shared over 100 posts, the most being 4,113 posts.

After making this observation, I confirmed that I made a good decision to use r/TheOnion as a case study to understand WhatsApp’s fake news problem.

In WhatsApp’s “Tips to help prevent the spread of rumors and fake news,” three of the seven tips focus on preventing the spread of fake news.

One of the biggest problems with fake news is not necessarily that it gets written, but rather that it gets spread.

The activity of r/TheOnion’s authors mimics the core qualities of the fake news phenomenon.

The first biggest mistake in data science is looking every business challenge as predictive approach.

Remember 70% of low hanging problems can be solved by just doing an EDA.

— Sundar RamamurthyAnother interesting thing I discovered during my EDA of the data was the most referenced domains in each Subreddit.

Of course, the majority of the domains referenced in r/TheOnion were from theonion.

com and other parody news sites.

However, the most referenced domains in r/nottheonion gave me a kick.

The top five most referenced domains on r/nottheonion were foxnews.

com, theguardian.

com, google.

com, bbc.

com, and newsweek.


Top 5 Most Referenced Domains in r/TheOnion & r/nottheonion.

I performed more EDA on my datasets and analyzed the most frequently used words by applying CountVectorizer(n_gram(1,1)) on the data.

I also analyzed the most frequently used bigrams by applying CountVectorizer(n_gram(2,2)) on the data.

Between the two Subreddits I made note of common frequent words and added them to a custom stop_words list which I would later use in my modeling of the data.

I could have begun the modeling process early on, but I decided to conduct EDA first in order to get to know my data well.

After the data shared their stories with me, I began to create and refine my predictive models.

I set my predictor (titles) and target (subreddit) variables, conducted a train/test split, and found the best parameters for my models through Pipeline and GridSearchCV.

I used a combination of vectorizers and classification models to find the best parameters that would give me the highest accuracy score.

I optimized for accuracy to ensure that all fake news gets classified as fake news, and that all authentic news does not get classified as fake news.

I implemented four models with combinations of using CountVectorizer and TfidVectorizer paired with LogisticRegression and MultinomialNB.

My best model for achieving the highest test accuracy score implemented CountVectorizer and MultinomialNB.

Here’s my code for how I found the best parameters for this model.

# Assign vectorizer and model to pipelinepipe = Pipeline([('cvec', CountVectorizer()), ('nb', MultinomialNB())])# Tune GridSearchCVpipe_params = {'cvec__ngram_range': [(1,1),(1,3)], 'nb__alpha': [.

36, .

6]}gs = GridSearchCV(pipe, param_grid=pipe_params, cv=3)gs.

fit(X_train, y_train);print("Best score:", gs.

best_score_)print("Train score", gs.

score(X_train, y_train))print("Test score", gs.

score(X_test, y_test))gs.

best_params_My best model for interpreting coefficients implemented CountVectorizer and LogisticRegression.

Here’s my code for how I found the best parameters for this model.

pipe = Pipeline([('cvec', CountVectorizer()), ('lr', LogisticRegression(solver='liblinear'))])# Tune GridSearchCVpipe_params = {'cvec__stop_words': [None, 'english', custom], 'cvec__ngram_range': [(1,1), (2,2), (1,3)], 'lr__C': [0.

01, 1]}gs = GridSearchCV(pipe, param_grid=pipe_params, cv=3)gs.

fit(X_train, y_train);print("Best score:", gs.

best_score_)print("Train score", gs.

score(X_train, y_train))print("Test score", gs.

score(X_test, y_test))gs.

best_params_To evaluate my CountVectorizer and MultinomialNB model, I implemented a confusion matrix.

While a 90% accuracy test score is high, that still signifies that 10% of posts are being misclassified as either fake news or real news.

If this were WhatsApp’s scores for their fake news detector, 10% of all fake news accounts would be misclassified on a monthly basis.

Good thing I created a fake news detector on a smaller dataset first.

Finally, even though my CountVectorizer and LogisticRegression model didn’t perform as well as the model above, I still decided to interpret its coefficients to get a better picture of how each word affects the predictions being made.

In the graph below of my logistic regression coefficients, the word that contributes the most positively to being from r/TheOnion is ‘kavanaugh’, followed by the ‘incredible’ and ‘ftw’.

The word that contributes the most positively to being from r/nottheonion is ‘Florida’, followed by ‘cops’ and ‘arrested’.

After exponentiating my coefficients, I discovered that as occurences of “kavanaugh” increases by 1 in a title, that title is 7.

7 times as likely to be classified as r/TheOnion.

And as occurrences of “florida” increases by 1 in a title, that title is 14.

9 times as likely to be classified as r/nottheonion.

Looking back at my process, I would have tested more NLP vectorizers and classification models on my data.

Moving forward, I’m curious to learn more about how to parse through images, videos, and other forms of media through machine learning, since news articles aren’t always written in text format.

I also have a better understanding of how WhatsApp might have created a model to detect fake news accounts.

As far as WhatsApp’s accuracy score for deleting accounts, that’s a question that’s still on my mind.

To view all of my code for this process, check out my GitHub repo.

Jasmine Vasandani is a data scientist, strategist, and researcher.

She is passionate about building inclusive communities in data.

Learn more about her: www.


co/.. More details

Leave a Reply