Scrape Reddit data using Python and Google BigQuery

',''))df.

to_csv('cordcutter_comments.

csv',index = False)You can find the final version of the code in my github repository.

akhilesh-reddy/Cable-cord-cutter-lift-and-sentiment-analysis-using-Reddit-dataScraped data from Reddit and performed Named entity recognition,topic modelling on the comments to understand public…github.

comOur final output looks like this:We have our data, but there is one challenge here.

Generally, it takes more time to get months of historical data using Reddit API.

Thanks to Jason Michael Baumgartner of Pushshift.

io(a.

k.

a /u/Stuck_In_The_Matrix on Reddit), we have years of historical Reddit data cleaned and stored in Bigquery which is the second part of this post.

Reddit data in Bigquery:For those who do not know what Bigquery is,Google BigQuery is an enterprise data warehouse that solves this problem by enabling super-fast SQL queries using the processing power of Google’s infrastructure.

Best part is querying this data would be free.

Google provides first 10GB of storage and first 1 TB of querying memory free as part of free tier and we require less than 1 TB for our task.

Lets look at how to query this information.

First click on this Google BigQuery link to get started.

Google will automatically log you in with your Google credentials that are stored in your browser.

If this is your first time on BigQuery, there will be a dialog box asking you to create a project.

Hit on create a project button.

Give a name to the organization and click on create project at the top.

Give the name of the project and you can leave the location box as it is for now.

Then click on create.

Now you will have your project created and a dashboard appears on the screen.

Now after this, click on the link .

This will open the reddit datasets under the project that you have created.

On the left hand side, you will see datasets updated for each month under the schema name fh-bigquery.

Let’s run the query to get the data for one month from the table.

select subreddit,body,created_utcfrom `fh-bigquery.

reddit_comments.

2018_08` where subreddit = 'cordcutters'This will get you all the comments for ‘cordcutter’ subreddit.

But make sure that you leave the “Use Legacy SQL” check box in the options unchecked as the above snippet of code is in standard sql.

However, you can choose your choice of sql and make changes to the code accordingly.

This is how the result looks like and you can download the result as a csv by clicking “Download as CSV” button.

Here, I have just focused on getting the data as we requried it.

If you want to play more with reddit data on bigquery, you can refer this article by Max Woolf which goes into more detail about Reddit data in Bigquery.

Summary:In this post, we have seen how to create OAuth2 credentials for connecting to Reddit, making data requests to Reddit API to get most recent data and query historical data in a very fast way through Google Bigquery.

In addition to getting data through an API and Bigquery, you might find it interesting to look at web scraping using Selenium and python.

Following is an article about that by a fellow classmate(Atindra Bandi) at UT Austin.

Web Scraping Using Selenium — PythonIn this article, you’ll learn how to navigate through multiple pages of a website and scrape large amounts of data…towardsdatascience.

comThat’s all folks.Stay tuned to get an update on a series of articles on recommendation systems, statistics for data science and data visualizations from me in the coming weeks.

.. More details

Leave a Reply