Road to Revolution: Socialism vs Communism

Road to Revolution: Socialism vs CommunismSubreddit Classification via PushShift API and Natural Language ProcessingAshley WhiteBlockedUnblockFollowFollowingDec 29I have always found the ideology behind Socialism and Communism to be very compelling during an area where socio-economic inequity continues to plague society and inhibits true progression as human kind..And while no one political / economic system isn’t perfect, one must acknowledge the shortcomings of capitalism and its role in fueling this inequities.In Marxist theory, socialism is the stage following capitalism in the transition of a society to communism..They are both founded on the idea of collective cooperation, but differs in that communists believe that cooperation should be run by a government made up of one political party..Whether you agree with one, the other, both, or neither, there is a lot that can be learned from these theories on creating a more just and equal society.Because of the subtle differences between the two, Socialism and Communism would serve as a great example to explore the power of natural language processing..To do this, I conducted an experiment leveraging text from Subreddits r/Communism and r/Socialism..Using the following steps, I could turn these posts into a corpus that can be used to train a random forest or other type classification algorithm:Query PushShift API to retrieve submissionsClean & pre-process textAnalyze vectorized / tokenized textGridsearch to optimize hyperparameters across two classification algorithmsStep 1: Query PushShift APIInstead of pulling submissions directly from Reddit (which limits up to 1000 queries), I leveraged the PushShift API, which has created a historical archive of most subreddits..Through this API, I was able to pull submission title, text, author and date.Step 2: Clean & Pre-Process TextAfter reading in the combined files from the two subreddits, I did some cleaning and pre-processing: namely to remove special characters, covert to lower case, and lemmatize resulting words.Additionally, I removed all tags for removed / deleted posts or moderator auto-posts (which are specific to the subreddit and can bias the the training set):Step 3: Analyze Vectorized / Tokenized TextIn order to further refine my text analysis, I needed to tokenize the lemmatized text fields and identify which words were most rare and frequent in each set..I first wanted to remove (via stop words) the most frequent words within the combined data frame that would be low predictive features in the classification algorithm.Step 4: Gridsearch to Optimize HyperparametersFinally, I wanted to compare the accuracy of two different classification algorithms: Logistic Regression and Random Forest..Both of these are powerful tools, but I found the Logistic Regression to be more a more reliable predictor, especially given the 75K+ tokenized features.Although the Random Forest generated a comparable accuracy score for the training data, I wanted to compare models to see which would be more reliable across an average of thresholds, by comparing ROC / AUC scores:. More details

Leave a Reply