Using NLP to Identify Redditors Who Control Multiple Accounts

Using NLP to Identify Redditors Who Control Multiple AccountsPhoto by Daniel Monteiro on UnsplashIntroductionI built a model that can determine if two Reddit accounts are being controlled by the same user, based solely on their writing styles..How few words do we need before we can start to distinguish an author’s writing style from another’s, and how many authors can we compare a body of anonymously written text to before we start to see two users with indistinguishably similar writing styles?ImplementationFor this project, I wanted to see if I could answer these questions by analyzing users on Reddit..Perhaps, if I’m lucky, I can identify two or more accounts that belong to the same user as a way to help subreddit moderators maintain healthier discussion in their communities.Because this is an unsupervised learning problem, I needed some way to validate the accuracy of my model..MCCC was first tested on a small group of 7 randomly chosen Redditors after each of the 7 users’ comment histories were split into a pseudo-user in Subset 1 and the original user in Subset 2.By iterating through each pseudo-user in Subset 1 and comparing it to each of the users in Subset 2, MCCC was in fact able to correctly identify 5 of the 7 users..However, function words show up in every text, and their frequency of use tends to stay fairly consistent across different documents for a given author.For my analysis, 150 of some of the most commonly used function words were used to identify user writing styles by the Delta method..The 7 users previously analyzed were now matched back to their correct user with 100% accuracy..Identifying users out of a random group of 40 (filtering out those who have less than 200 comments, too small of a history to identify writing tendencies) returned 95% accuracy.Cosines of pseudo-users to their original matching accounts vs non-matchesIn addition to lexical analysis, we can also distinguish unique writing styles with syntax..change their style of writing in different contexts, they may be accidentally identified as another user that has very similar writing tendencies.comparing the feature vectors of two usersTo further improve the model, I incorporated into the feature vector the use of punctuation and certain markdown formatting methods that are commonly used on Reddit (such as “[ ]( )” used to display hyperlinks)..This boosted my model’s performance to 93.8% accuracy in correctly matching users out of a group of 3,000 users.Probability distribution of the final modelAlpha for non-matching accountsFrom the distribution, I was able to determine critical values at which I can reject the null hypothesis that a user is a non-match.. More details

Leave a Reply