Machine Learning with Reddit, and the Impact of Sorting Algorithms on Data Collection and Models

This variance didn’t occur when randomizing and splitting the same data into a train/test split- it occurred when we scraped the same reddits, with different sorting algorithms.

The same Multinomial NB Model, performed better with data scraped through the hot posts algorithm.

One inference that can be made with this discrepancy, is these subreddits most recently popular posts are more distinct than their historically popular posts.

Let’s explore some of the implications of this!Side by side comparison of Hot and Top posts most common words circa April 18th 2019 aggregated between both /r/Futurology and /r/Worldnews — as you can see, the difference is night and day.

There’s currently six sorting algorithms one can organize a subreddit by — Best, Hot, New, Top, Controversial, and Rising.

I found another Medium article written in 2015, laying out the hot algorithm when Reddit’s code was open source — as of 2016 this is no longer the case, these once publicly transparent sorting algorithms are proprietary components of Reddits business model (relatedly, these codes are obscured to prevent bad faith posters from artificially pushing bad or irrelevant content to the top of the subreddits, or to farm karma, for whatever pathological reasons someone wants fake internet points).

The author rewrote the code from Pyrex — used to write Python to C extensions — into Python, for readability, which can be viewed here :# Rewritten code from /r2/r2/lib/db/_sorts.

pyxfrom datetime import datetime, timedeltafrom math import logepoch = datetime(1970, 1, 1)def epoch_seconds(date): td = date – epoch return td.

days * 86400 + td.

seconds + (float(td.

microseconds) / 1000000)def score(ups, downs): return ups – downsdef hot(ups, downs, date): s = score(ups, downs) order = log(max(abs(s), 1), 10) sign = 1 if s > 0 else -1 if s < 0 else 0 seconds = epoch_seconds(date) – 1134028003 return round(sign * order + seconds / 45000, 7What one can interpret from this is the hot algorithm was defined as a function that returns posts with more upvotes than downvotes, prioritizing posts with high upvote counts that were posted relatively recent compared to other posts within the subreddit.

What we do generally know these days otherwise, is Controversial posts have a higher proportion of downvotes, top posts are historically popular, best posts have higher ratios of upvotes after a certain period of time, new posts are simply organized by time posted, and rising are new posts getting lots of attention and votes.

In this case, the difference in returned posts from my scrapes using the hot and top algorithms at the same time demonstrates the nature of posts within a subreddit will change over time, meaning our machine learning models are not useful if they are not receiving new training data.

Of course, this applies universally to machine learning applications with online data, and even more broadly, human learning: the internet is an ever evolving digital organism that a static digital archive will never capture the full essence of.

As such, we must always be prepared to adapt and learn and be receptive to new information to be able to understand what we are witnessing.

Side by side comparison of Hot and Top posts most common words on /r/Futurology circa April 18th 2019Side by side comparison of Hot and Top posts most common words on /r/Worldnews circa April 18th 2019.

. More details

Leave a Reply