Scraping Reddit data

Scraping Reddit dataHow to scrape data from Reddit using the Python Reddit API Wrapper(PRAW)Gilbert TannerBlockedUnblockFollowFollowingJan 5Photo by Fabian Grohs on UnsplashAs its name suggests PRAW is a Python wrapper for the Reddit API, which enables you to scrape data from subreddits, create a bot and much more.

In this article, we will learn how to use PRAW to scrape posts from different subreddits as well as how to get comments from a specific post.

Getting StartedPRAW can be installed using pip or conda:Now PRAW can be imported by writting:import prawBefore it can be used to scrape data we need to authenticate ourselves.

For this we need to create a Reddit instance and provide it with a client_id , client_secret and a user_agent .

To get the authentication information we need to create a reddit app by navigating to this page and clicking create app or create another app.

Figure 1: Reddit ApplicationThis will open a form where you need to fill in a name, description and redirect uri.

For the redirect uri you should choose http://localhost:8080 as described in the excellent PRAW documentation.

Figure 2: Create new Reddit ApplicationAfter pressing create app a new application will appear.

Here you can find the authentication information needed to create the praw.

Reddit instance.

Figure 3: Authentication informationGet subreddit dataNow that we have a praw.

Reddit instance we can access all available functions and use it, to for example get the 10 “hottest” posts from the Machine Learning subreddit.

Output:[D] What is the best ML paper you read in 2018 and why?[D] Machine Learning – WAYR (What Are You Reading) – Week 53[R] A Geometric Theory of Higher-Order Automatic DifferentiationUC Berkeley and Berkeley AI Research published all materials of CS 188: Introduction to Artificial Intelligence, Fall 2018[Research] Accurate, Data-Efficient, Unconstrained Text Recognition with Convolutional Neural Networks.

We can also get the 10 “hottest” posts of all subreddits combined by specifying “all” as the subreddit name.

Output:I've been lying to my wife about film plots for years.

I don’t care if this gets downvoted into oblivion!.I DID IT REDDIT!!I’ve had enough of your shit, KarenStranger Things 3: Coming July 4th, 2019.

This variable can be iterated over and features including the post title, id and url can be extracted and saved into an .

csv file.

Figure 4: Hottest ML postsGeneral information about the subreddit can be obtained using the .

description function on the subreddit object.

Output:**[Rules For Posts](https://www.

reddit.

com/r/MachineLearning/about/rules/)**——–+[Research](https://www.

reddit.

com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3AResearch)——–+[Discussion](https://www.

reddit.

com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3ADiscussion)——–+[Project](https://www.

reddit.

com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3AProject)——–+[News](https://www.

reddit.

com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3ANews)——–.

Get comments from a specific postYou can get the comments for a post/submission by creating/obtaining a Submission object and looping through the comments attribute.

To get a post/submission we can either iterate through the submissions of a subreddit or specify a specific submission using reddit.

submission and passing it the submission url or id.

To get the top-level comments we only need to iterate over submission.

comments .

This will work for some submission, but for others that have more comments this code will throw an AttributeError saying:AttributeError: 'MoreComments' object has no attribute 'body'These MoreComments object represent the “load more comments” and “continue this thread” links encountered on the websites, as described in more detail in the comment documentation.

There get rid of the MoreComments objects, we can check the datatype of each comment before printing the body.

But Praw already provides a method called replace_more , which replaces or removes the MoreComments .

The method takes an argument called limit, which when set to 0 will remove all MoreComments.

Both of the above code blocks successfully iterate over all the top-level comments and print their body.

The output can be seen below.

Source: [https://www.

facebook.

com/VoyageursWolfProject/](https://www.

facebook.

com/VoyageursWolfProject/)I thought this was a shit post made in paint before I read the titleWow, that’s very cool.

To think how keen their senses must be to recognize and avoid each other and their territories.

Plus, I like to think that there’s one from the white colored clan who just goes way into the other territories because, well, he’s a badass.

That’s really cool.

The edges are surprisingly defined.

However, the comment section can be arbitrarily deep and most of the time we surely also want to get the comments of the comments.

CommentForest provides the .

list method, which can be used for getting all comments inside the comment section.

The above code will first of output all the top-level comments, followed by the second-level comments and so on until there are no comments left.

Recommended ReadingWeb Scraping using Selenium and BeautifulSoupHow to use Selenium to navigate between pages and use it to scrap HTML loaded with JavaScript.

towardsdatascience.

comConclusionPraw is a Python wrapper for the Reddit API, which enables us to use the Reddit API with a clean Python interface.

The API can be used for webscraping, creating a bot as well as many others.

This article covered authentication, getting posts from a subreddit and getting comments.

To learn more about the API I suggest to take a look at their excellent documentation.

If you liked this article consider subscribing on my Youtube Channel and following me on social media.

The code covered in this article is available as a Github Repository.

If you have any questions, recommendations or critiques, I can be reached via Twitter or the comment section.

.

. More details

Leave a Reply