Predicting Reddit Comment Upvotes with Machine Learning

Predicting Reddit Comment Upvotes with Machine LearningAdam ReevesmanBlockedUnblockFollowFollowingDec 31, 2018In this article, we will use Python and the scikit-learn package to predict the number of upvotes of a comment on Reddit.

We fit a variety of regression models and compare their performance using the following metrics:R² to measure the goodness of fitmean absolute error (MAE) and root mean squared error (RMSE) on a test set to measure accuracy.

This article is based on the work from this Github repository.

The code can be found in this notebook.

BackgroundReddit is a popular social media site.

On this site, users post threads in various subreddits like to one below.

A thread in the “AskReddit” subredditUsers can comment on threads or other comments.

They can also give upvotes or downvotes to other threads and comments.

Our goal is to predict the number of upvotes that comments will receive.

DataThe data, a pickle file containing 1,205,039 rows (comments) that occurred in May of 2015, is hosted on google drive and can be downloaded using this link.

The target variable and relevant features that will be used for modeling are listed below.

They can be divided into several categories.

Target variablescore: number of upvotes on the commentComment level featuresgilded: the number of gilded tags (premium likes) on the commentdistinguished: the type of user on the page.

Either ‘moderator’, ‘admin’, or ‘user’controversiality: a Boolean indicating whether (1) or not (0) comment is controversial (popular comments that are getting close to the same amount of upvotes as downvotes)over_18: Whether or not the thread has been marked as NSFWtime_lapse: the time in seconds between comment and the first comment on the threadhour_of_comment: the hour of day comment was postedweekday: the day of week comment was postedis_flair: whether or not there is flair text for the comment (https://www.

reddit.

com/r/help/comments/3tbuml/whats_a_flair/)is_flair_css: whether or not there is a CSS class for the comment flairdepth: depth of comment in thread (number of parent comments that comment has)no_of_linked_sr: number of subreddits mentioned in the commentno_of_linked_urls: number of urls linked in the commentsubjectivity: number of instances of “I”is_edited: whether or not the comment has been editedis_quoted: whether or not comment quotes anotherno_quoted: number of quotes in the commentsenti_neg: negative sentiment scoresenti_neu: neutral sentiment scoresenti_pos: positive sentiment scoresenti_comp: compound sentiment scoreword_count: number of words in the commentParent level featurestime_since_parent: the time in seconds between comment and the parent commentparent_score: score of parent comment (NaN if the comment doesn’t have a parent)parent_cos_angle: cosine similarity between comment and its parent comment’s embeddings (https://nlp.

stanford.

edu/projects/glove/)Comment tree root featuresis_root: whether or the comment is a roottime_since_comment_tree_root: the time in seconds between comment and the comment tree rootcomment_tree_root_score: score of comment tree rootThread level featureslink_score: upvotes of on thread comment is onupvote_ratio: the percentage of upvotes from all votes on thread comment is onlink_ups: number of upvotes on threadtime_since_link: time in seconds since the thread was createdno_past_comments: number of comments on thread before comment was postedscore_till_now: score of thread at the time this comment was postedtitle_cos_angle: cosine similarity between comment and its thread’s title’s embeddingsis_selftext: whether or not thread had selftextSetupLet’s load all of the libraries we’ll need.

import pandas as pdimport numpy as npimport matplotlib.

pyplot as pltimport seaborn as snsfrom sklearn.

metrics import mean_squared_error, r2_score, mean_absolute_errorfrom sklearn.

model_selection import train_test_splitfrom sklearn.

preprocessing import LabelBinarizerfrom sklearn.

dummy import DummyRegressorfrom sklearn.

linear_model import LinearRegressionfrom sklearn.

linear_model import LassoCVfrom sklearn.

linear_model import RidgeCVfrom sklearn.

linear_model import ElasticNetCVfrom sklearn.

neighbors import KNeighborsRegressorfrom sklearn.

tree import DecisionTreeRegressorfrom sklearn.

ensemble import RandomForestRegressorfrom sklearn.

ensemble import GradientBoostingRegressorimport warningswarnings.

filterwarnings('ignore')We also define some functions for interacting with the models.

def model_diagnostics(model, pr=True): """ Returns and prints the R-squared, RMSE and the MAE for a trained model """ y_predicted = model.

predict(X_test) r2 = r2_score(y_test, y_predicted) mse = mean_squared_error(y_test, y_predicted) mae = mean_absolute_error(y_test, y_predicted) if pr: print(f"R-Sq: {r2:.

4}") print(f"RMSE: {np.

sqrt(mse)}") print(f"MAE: {mae}") return [r2,np.

sqrt(mse),mae]def plot_residuals(y_test, y_predicted): """" Plots the distribution for actual and predicted values of the target variable.

Also plots the distribution for the residuals """ fig, (ax0, ax1) = plt.

subplots(nrows=1, ncols=2, sharey=True) sns.

distplot(y_test, ax=ax0, kde = False) ax0.

set(xlabel='Test scores') sns.

distplot(y_predicted, ax=ax1, kde = False) ax1.

set(xlabel="Predicted scores") plt.

show() fig, ax2 = plt.

subplots() sns.

distplot((y_test-y_predicted), ax = ax2,kde = False) ax2.

set(xlabel="Residuals") plt.

show()def y_test_vs_y_predicted(y_test,y_predicted): """ Produces a scatter plot for the actual and predicted values of the target variable """ fig, ax = plt.

subplots() ax.

scatter(y_test, y_predicted) ax.

set_xlabel("Test Scores") ax.

set_ylim([-75, 1400]) ax.

set_ylabel("Predicted Scores") plt.

show()def get_feature_importance(model): """ For fitted tree based models, get_feature_importance can be used to get the feature importance as a tidy output """ X_non_text = pd.

get_dummies(df[cat_cols]) features = numeric_cols + bool_cols + list(X_non_text.

columns) feature_importance = dict(zip(features, model.

feature_importances_)) for name, importance in sorted(feature_importance.

items(), key=lambda x: x[1], reverse=True): print(f"{name:<30}: {importance:>6.

2%}") print(f".Total importance: {sum(feature_importance.

values()):.

2%}") return feature_importanceRead in datadf = pd.

read_pickle('reddit_comments.

pkl')Handle missing valuesThe data has some missing values, which are handled either by imputation or by dropping observations.

Missing values occurred in the following columns for the following reasons:parent_score: some comments did not have a parent (imputed)comment_tree_root_score and time_since_comment_tree_root: some comments were the root of a comment tree (imputed)parent_cosine, parent_euc, title_cosine, title_euc: some comments lacked words that had glove word embeddings (dropped).

In addition, some comments did not have a parent (parent_cosine, parent_title imputed)df = df[~df.

title_cosine.

isna()] # drop where parent/title_cosine is NaNparent_scrore_impute = df.

parent_score.

mode()[0] # impute with mode of parent_score columncomment_tree_root_score_impute = df.

comment_tree_root_score.

mode()[0] # impute with mode of comment_tree_root_score columntime_since_comment_tree_root_impute = df.

time_since_comment_tree_root.

mode()[0] # impute with mode of time_since_comment_tree_root columnparent_cosine_impute = 0parent_euc_impute = 0df.

loc[df.

parent_score.

isna(), 'parent_score'] = parent_scrore_imputedf.

loc[df.

comment_tree_root_score.

isna(), 'comment_tree_root_score'] = comment_tree_root_score_imputedf.

loc[df.

time_since_comment_tree_root.

isna(), 'time_since_comment_tree_root'] = time_since_comment_tree_root_imputedf.

loc[df.

parent_cosine.

isna(), 'parent_cosine'] = parent_cosine_imputedf.

loc[df.

parent_euc.

isna(), 'parent_euc'] = parent_euc_imputeSelect variablesIn the next step, we define which variables to use when training the model.

We make a list for boolean variables, for variables with multiple categories and for numeric variables.

bool_cols = ['over_18', 'is_edited', 'is_quoted', 'is_selftext']cat_cols = ['subreddit', 'distinguished', 'is_flair', 'is_flair_css','hour_of_comment', 'weekday']numeric_cols = ['gilded', 'controversiality', 'upvote_ratio','time_since_link', 'depth', 'no_of_linked_sr', 'no_of_linked_urls', 'parent_score', 'comment_tree_root_score', 'time_since_comment_tree_root', 'subjectivity', 'senti_neg', 'senti_pos', 'senti_neu', 'senti_comp', 'no_quoted', 'time_since_parent', 'word_counts', 'no_of_past_comments', 'parent_cosine','parent_euc', 'title_cosine', 'title_euc', 'no_quoted','link_score']Using our list of variables, we can prepare the data for modeling.

The step below uses scikit-learn’s LabelBinarizer to make dummy variables out of the categorical columns then combines all variables.

lb = LabelBinarizer()cat = [lb.

fit_transform(df[col]) for col in cat_cols]bol = [df[col].

astype('int') for col in bool_cols]t = df.

loc[:, numeric_cols].

valuesfinal = [t] + bol + caty = df.

score.

valuesx = np.

column_stack(tuple(final))We split the data into a training and test set using an 80–20 split.

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.

2, random_state=10)ModelingIn this section, we use scikit-learn to fit models on the Reddit data.

We start with a baseline model, then try to improve results with Lasso, Ridge, and Elastic Net Regression.

In addition, we try K-Nearest Neighbors, Decision Tree, Random Forest and Gradient Boosted Regression.

First, let’s define a dictionary that will store the results of the model diagnostics.

model_performance_dict = dict()Linear Regression ModelsBaseline ModelWe fit a simple model to establish a baseline.

This model always predicts the mean number of upvotes.

baseline = DummyRegressor(strategy='mean')baseline.

fit(X_train,y_train)model_performance_dict["Baseline"] = model_diagnostics(baseline)Linear Regressionlinear = LinearRegression()linear.

fit(X_train,y_train)model_performance_dict["Linear Regression"] = model_diagnostics(linear)Lasso Regressionlasso = LassoCV(cv=30).

fit(X_train, y_train)model_performance_dict["Lasso Regression"] = model_diagnostics(lasso)Ridge Regressionridge = RidgeCV(cv=10).

fit(X_train, y_train)model_performance_dict["Ridge Regression"] = model_diagnostics(ridge)Elastic Net Regressionelastic_net = ElasticNetCV(cv = 30).

fit(X_train, y_train)model_performance_dict["Elastic Net Regression"] = model_diagnostics(elastic_net)Nonlinear Regression ModelsK-Nearest Neighbor Regressionknr = KNeighborsRegressor()knr.

fit(X_train, y_train)model_performance_dict["KNN Regression"] = model_diagnostics(knr)Decision Tree Regressiondt = DecisionTreeRegressor(min_samples_split=45, min_samples_leaf=45, random_state = 10)dt.

fit(X_train, y_train)model_performance_dict["Decision Tree"] = model_diagnostics(dt)Random Forest Regressionrf = RandomForestRegressor(n_jobs=-1, n_estimators=70, min_samples_leaf=10, random_state = 10)rf.

fit(X_train, y_train)model_performance_dict["Random Forest"] = model_diagnostics(rf)Gradient Boosting Regressiongbr = GradientBoostingRegressor(n_estimators=70, max_depth=5)gbr.

fit(X_train, y_train)model_performance_dict["Gradient Boosting Regression"] = model_diagnostics(gbr)Model comparisonWe compare the models based on three metrics: R², MAE, and RMSE.

To do so, we define the function below.

def model_comparison(model_performance_dict, sort_by = 'RMSE', metric = 'RMSE'): Rsq_list = [] RMSE_list = [] MAE_list = [] for key in model_performance_dict.

keys(): Rsq_list.

append(model_performance_dict[key][0]) RMSE_list.

append(model_performance_dict[key][1]) MAE_list.

append(model_performance_dict[key][2]) props = pd.

DataFrame([]) props["R-squared"] = Rsq_list props["RMSE"] = RMSE_list props["MAE"] = MAE_list props.

index = model_performance_dict.

keys() props = props.

sort_values(by = sort_by) fig, ax = plt.

subplots(figsize = (12,6)) ax.

bar(props.

index, props[metric], color="blue") plt.

title(metric) plt.

xlabel('Model') plt.

xticks(rotation = 45) plt.

ylabel(metric)Let’s use this function to compare the models based on each metric.

model_comparison(model_performance_dict, sort_by = 'R-squared', metric = 'R-squared')model_comparison(model_performance_dict, sort_by = 'R-squared', metric = 'MAE')model_comparison(model_performance_dict, sort_by = 'R-squared', metric = 'RMSE')Interpreting resultsThe random forest model is a reasonable choice when taking performance and training time into account.

The mean absolute error is approximately 9.

7 which means that on average, the model estimate is off by about 9.

7 upvotes.

Let’s look at some plots for more information about model performance.

y_predicted = rf.

predict(X_test)plot_residuals(y_test,y_predicted)Comparing the histograms of test scores and predicted scores, we notice that the model tends to overestimate the target variable when it is small.

In addition, the model never predicts that the target variable will be much larger than 2,000.

It appears that results are skewed by the few cases where the target variable is large.

The majority of comments have only a small number of upvotes but model expects these to receive more than they do.

However, when the comment has an extreme number of upvotes, the model will underestimate it.

This distribution of residuals suggests that a logical next step would be to explore the results of a stacked model.

Stacking is an ensembling technique (like random forests, gradient boosting, etc.

) that can often improve performance.

We would first fit a classifier to predict the number of upvotes (with classes like few, some, many) and the result would be used as an additional predictor in the regression model.

This method has the potential to reduce errors and improve the goodness of fit because, in addition to our original information, the regression model would also have a hint about the number of comments to help it make a prediction.

Tree-based models also allow us to quantify the importance of the features they used.

rf_importances = get_feature_importance(rf)The least important features are the indicator variables for different subreddits.

Since this data only includes comments from five of the most popular, and rather generic, subreddits (food, world news, movies, science, and gaming), we would not expect much these features to be very important.

Additionally, there are many comments with little or no importance.

These features could be removed.

This could help avoid overfitting and decrease the time it takes to train models.

The five most important features are ones that describe the thread that the comment is on or the comment’s parent.

We might expect this due to the fact that popular and trending content gets shown to more users, so comments that are close to content that has a lot of upvotes are more likely to get a lot of upvotes as well.

It is also important to note that many of the features that had high importance were ones that had missing values.

For this reason, a deeper analysis of the way in which missing values were handled could lead to improved model performance (for example, when we dropped comment tree roots, parent score was by far the most important feature, at ~25%).

Interpolation using the mean, median or prediction using a simple linear regression would be worth testing as well.

ConclusionIn this article, we have outlined a machine learning workflow that uses the scikit-learn python library to predict Reddit comment upvotes.

We compared the performance of linear and nonlinear regression models and found that a random forest regressor was the optimal choice.

After a quick examination of this model’s residuals, we saw lot’s of room for improvement.

Possible next steps for this project include:Fitting models using fewer features and comparing their performance to the originalsAnalyzing missing values and their effect on model performanceStacking models for improved performanceThis article is based on a project that was originally completed by Adam Reevesman, Gokul Krishna Guruswamy, Hai Le, Maximillian Alfaro, and Prakhar Agrawal during the Introduction to Machine Learning course at the University of San Francisco’s Master of Science in Data Science.

Relevant work can be found in this Github repository and the code from this article can be found in this notebook.

I would be pleased to receive feedback on any of the above.

Feel free to reach out to me with any comments or questions.

.. More details

Leave a Reply