Scraping Twitter User Data Using Google and TweepyAnthony BaumBlockedUnblockFollowFollowingMar 4With ever-increasing value being placed on the effectiveness of social media in marketing, mining data from social platforms is a critical piece of the ad-tech puzzle.
Free developer API access to social data is becoming more and more restrictive, and so easily accessing the right data can be a challenge.
Twitter is an exception to the rule and the Tweepy module for Python is an easy to use and well-documented package for externally accessing the platform.
For application development, the API is great but presents challenges for us data folk.
We’re typically interested in mass data collection rather than building something that uses the API one “object” at a time.
There’s no easy way to grab bulk user data that match some specific criteria, but we can make use of what other users have created to get around the issue.
Twitter users are able to create lists containing accounts that relate to some given topic they want easy access to.
They also probably named that list with a topic-relevant keyword.
We can search for URLs matching “lists + keyword” within the Twitter domain, using the googlesearch module (link to GitHub at the end).
If the URL contains “/lists/” we know it’s a user-created list.
For this walkthrough, I’ve used the “influencers” keyword.
So the function isn’t infinitely searching, we add a couple of checks to break the search, in this case, if the search reaches 5x more non-list urls than list urls, or if we’ve reached the number of results we want (passed as results_to_obtain).
For each list we obtained, Tweepy’s list_members function will get all the users contained in that list.
For each user, a host of attributes are contained in a json accessible user object.
By creating a list for each attribute, appending chosen user data to those lists, and zipping the lists, a pandas dataframe containing user data can be created.
Just like that, we have a potentially sizeable dataset of users.
An important consideration at this point is that while the Tweepy code can work whether an account it’s looking at is protected or not, trying to get any further data about that account will not work.
We need to remove any protected accounts from out dataframe.
Luckily, the Tweepy user object contains a boolean attribute for whether an account is protected.
Now we know we can access all data about accounts in our dataset.
As we pulled users from lists of influencers, it makes sense to add some metrics on how engaged a given user’s audience is with their content.
Using Tweepy’s user_timeline(count=x) function returns the x most recent tweets for that user.
Each tweet is returned in a json format including both retweet count and like count as attributes.
By pulling in the most recent 100 tweets, we can create dataframe columns with median retweets and likes for each user across their most recent activity.
As we can’t directly access engagement for those tweets, we need to create some kind of proxy metric.
A function of how engaged a user’s audience is with their activity makes sense, so we’ll define this metric as:((favs / follower_count) + (rts / follower_count)) / 2While that’s the basic idea, as more people typically favourite tweets than retweet them, the weight of favourites would bias the metric.
To treat them equally, they need to be normalized.
Scikit learn makes this easy, and we’ll use MinMaxScaler.
That’s that.
Now we have an approximated idea of how much response a user’s typical activity will gain, and we can conduct some pretty robust analyses on top of this small dataset.
Here’s what the top 5 rows of the data we’ve put together looks like when sorted by engagement.
Access the GitHub repo for the code yourself here: https://github.
com/wxbaum/twitter_user_scraperYou can find documention on the googlesearch module I’ve used here: https://github.
com/MarioVilas/googlesearch.. More details