Analyzing Tweets with NLP in minutes with Spark, Optimus and Twint

Analyzing Tweets with NLP in minutes with Spark, Optimus and TwintSocial media has been gold for studying the way people communicate and behave, in this article I’ll show you the easiest way of analyzing tweets without the Twitter API and scalable for Big Data.

Favio VázquezBlockedUnblockFollowFollowingMay 5IntroductionIf you are here it’s likely that you are interested in analyzing tweets (or something similar) and you have a lot of them, or can get them.

One of the most annoying things for that is getting a Twitter application, get the authentication and all of that.

And then if you are using Pandas, there’s no way to scale that.

So what about a system that doesn’t have to authenticate with the Twitter API, that can get an unlimited (well almost) amount of tweets and the power to analyze them, with NLP and more.

Well you’re in for a treat because that’s exactly what I’m going to show you right now.

Getting the project and repohttps://matrixds.

com/You can follow everything I’m going to show you very easily.

Just forklift this MatrixDS project:MatrixDS | The Data Project WorkbenchMatrixDS is a place to build, share and manage data projects at any scale.




comAlso there’s a GitHub repo with everything:FavioVazquez/twitter_optimus_twintAnalyzing tweets with Twint, Optimus and Apache Spark.

– FavioVazquez/twitter_optimus_twintgithub.

comWith MatrixDS you can actually run the notebooks, get the data and run the analysis for free, so if you want to learn more please do it.

Getting Twint and OptimusTwint utilizes Twitter’s search operators to let you scrape Tweets from specific users, scrape Tweets relating to certain topics, hashtags & trends, or sort out sensitive information from Tweets like e-mail and phone numbers.

With Optimus, a library I co-created, you can clean your data, prepare it, analyze it, create profilers and plots, and perform machine learning and deep learning, all in a distributed fashion, because on the back-end we have Spark, TensorFlow, Sparkling Water and Keras.

So let’s first install everything you need, for that when you are in the Matrix project, go to the Analyze Tweets notebook and run (you can also do this from the JupyterLab terminal):!pip install –user -r requirements.

txtAfter that, we need to install Twint, for that run:!pip install –upgrade –user -e git+https://github.


git@origin/master#egg=twintThis will download a scr/ folder so we need to do some config:!mv src/twint .

!rm -r srcThen to import Twint that we need to run:%load_ext autoreload%autoreload 2import syssys.


append("twint/")and finally:import twintOptimus was installed in the first step, so let’s just start it (this will start a Spark cluster for you):from optimus import Optimusop = Optimus()Setup Twint for scrapping tweetshttps://www.


com/2015/11/3/9661180/twitter-vine-favorite-fav-likes-hearts# Set up TWINT configc = twint.

Config()If you are running this on notebooks you’ll need also to run:# Solve compatibility issues with notebooks and RunTime errors.

import nest_asyncionest_asyncio.

apply()Search for data science tweetsI’ll start our analysis scrapping tweets about data science, you can change this to whatever you want.

For doing that we just need to run this:c.

Search = "data science"# Custom output formatc.

Format = "Username: {username} | Tweet: {tweet}"c.

Limit = 1c.

Pandas = Truetwint.


Search(c)Let me explain this code to you.

In the last section when we ran the code:c = twint.

Config()we started a new Twint configuration.

After that we need to pass different options we want to scrape tweets.

Here’s the full list of configuring options:Variable Type Description——————————————–Username (string) – Twitter user's usernameUser_id (string) – Twitter user's user_idSearch (string) – Search termsGeo (string) – Geo coordinates (lat,lon,km/mi.

)Location (bool) – Set to True to attempt to grab a Twitter user's location (slow).

Near (string) – Near a certain City (Example: london)Lang (string) – Compatible language codes: https://github.

com/twintproject/twint/wiki/Langauge-codesOutput (string) – Name of the output file.

Elasticsearch (string) – Elasticsearch instanceTimedelta (int) – Time interval for every request (days)Year (string) – Filter Tweets before the specified year.

Since (string) – Filter Tweets sent since date (Example: 2017-12-27).

Until (string) – Filter Tweets sent until date (Example: 2017-12-27).

Email (bool) – Set to True to show Tweets that _might_ contain emails.

Phone (bool) – Set to True to show Tweets that _might_ contain phone numbers.

Verified (bool) – Set to True to only show Tweets by _verified_ usersStore_csv (bool) – Set to True to write as a csv file.

Store_json (bool) – Set to True to write as a json file.

Custom (dict) – Custom csv/json formatting (see below).

Show_hashtags (bool) – Set to True to show hashtags in the terminal output.

Limit (int) – Number of Tweets to pull (Increments of 20).

Count (bool) – Count the total number of Tweets fetched.

Stats (bool) – Set to True to show Tweet stats in the terminal output.

Database (string) – Store Tweets in a sqlite3 database.

Set this to the DB.

(Example: twitter.

db)To (string) – Display Tweets tweeted _to_ the specified user.

All (string) – Display all Tweets associated with the mentioned user.

Debug (bool) – Store information in debug logs.

Format (string) – Custom terminal output formatting.

Essid (string) – Elasticsearch session ID.

User_full (bool) – Set to True to display full user information.

By default, only usernames are shown.

Profile_full (bool) – Set to True to use a slow, but effective method to enumerate a user's Timeline.

Store_object (bool) – Store tweets/user infos/usernames in JSON objects.

Store_pandas (bool) – Save Tweets in a DataFrame (Pandas) file.

Pandas_type (string) – Specify HDF5 or Pickle (HDF5 as default).

Pandas (bool) – Enable Pandas integration.

Index_tweets (string) – Custom Elasticsearch Index name for Tweets (default: twinttweets).

Index_follow (string) – Custom Elasticsearch Index name for Follows (default: twintgraph).

Index_users (string) – Custom Elasticsearch Index name for Users (default: twintuser).

Index_type (string) – Custom Elasticsearch Document type (default: items).

Retries_count (int) – Number of retries of requests (default: 10).

Resume (int) – Resume from a specific tweet id (**currently broken, January 11, 2019**).

Images (bool) – Display only Tweets with images.

Videos (bool) – Display only Tweets with videos.

Media (bool) – Display Tweets with only images or videos.

Replies (bool) – Display replies to a subject.

Pandas_clean (bool) – Automatically clean Pandas dataframe at every scrape.

Lowercase (bool) – Automatically convert uppercases in lowercases.

Pandas_au (bool) – Automatically update the Pandas dataframe at every scrape.

Proxy_host (string) – Proxy hostname or IP.

Proxy_port (int) – Proxy port.

Proxy_type (string) – Proxy type.

Tor_control_port (int) – Tor control port.

Tor_control_password (string) – Tor control password (not hashed).

Retweets (bool) – Display replies to a subject.

Hide_output (bool) – Hide output.

Get_replies (bool) – All replies to the tweet.

So in this code:c.

Search = "data science"# Custom output formatc.

Format = "Username: {username} | Tweet: {tweet}"c.

Limit = 1c.

Pandas = TrueWe are setting the search term, them formatting the response (just to check), getting only 20 tweets with the Limit =1 (they are in increments of 20) and finally making the result compatible with Pandas.

Then when we run:twint.


Search(c)We are launching the search.

The result is:Username: tmj_phl_pharm | Tweet: If you're looking for work in Spring House, PA, check out this Biotech/Clinical/R&D/Science job via the link in our bio: KellyOCG Exclusive: Data Access Analyst in Spring House, PA- Direct Hire at Kelly Services #KellyJobs #KellyServicesUsername: DataSci_Plow | Tweet: Bring your Jupyter Notebook to life with interactive widgets https://www.


io/post/bring-your-jupyter-notebook-to-life-with-interactive-widgets?utm_source=Twitter&utm_campaign=Data_science … +1 Hal2000Bot #data #scienceUsername: ottofwagner | Tweet: Top 7 R Packages for Data Science and AI https://noeliagorod.

com/2019/03/07/top-7-r-packages-for-data-science-and-ai/ … #DataScience #rstats #MachineLearningUsername: semigoose1 | Tweet: ëäSujy #crypto #bitcoin #java #competition #influencer #datascience #fintech #science #EU https://vk.

com/id15800296 https://semigreeth.


com/2019/05/03/easujy-crypto-bitcoin-java-competition-influencer-datascience-fintech-science-eu- https-vk-com-id15800296/ …Username: Datascience__ | Tweet: Introduction to Data Analytics for Business http://zpy.

io/c736cf9f #datascience #adUsername: Datascience__ | Tweet: How Entrepreneurs in Emerging Markets can master the Blockchain Technology http://zpy.

io/f5fad501 #datascience #adUsername: viktor_spas | Tweet: [Перевод] Почему Data Science командам нужны универсалы, а не специалисты https://habr.


it&utm_medium=twitter&utm_campaign=450420 … pic.


com/i98frTwPSEUsername: gp_pulipaka | Tweet: Orchestra is a #RPA for Orchestrating Project Teams.

#BigData #Analytics #DataScience #AI #MachineLearning #Robotics #IoT #IIoT #PyTorch #Python #RStats #TensorFlow #JavaScript #ReactJS #GoLang #CloudComputing #Serverless #DataScientist #Linux @lruettimann http://bit.

ly/2Hn6qYd pic.


com/kXizChP59UUsername: amruthasuri | Tweet: "Here's a typical example of a day in the life of a RagingFX trader.

Yesterday I received these two signals at 10am EST.

Here's what I did.

My other activities have kept me so busy that .


ly/2Jm9WT1 #Learning #DataScience #bigdata #Fintech pic.


com/Jbes6ro1lYUsername: PapersTrending | Tweet: [1/10] Real numbers, data science and chaos: How to fit any dataset with a single parameter – 192 stars – pdf: https://arxiv.



pdf … – github: https://github.

com/Ranlot/single-parameter-fit …Username: webAnalyste | Tweet: Building Data Science Capabilities Means Playing the Long Game http://dlvr.

it/R41k3t pic.


com/Et5CskR2h4Username: DataSci_Plow | Tweet: Building Data Science Capabilities Means Playing the Long Game https://www.


io/post/building-data-science-capabilities-means-playing-the-long-game?utm_source=Twitter&utm_campaign=Data_science … +1 Hal2000Bot #data #scienceUsername: webAnalyste | Tweet: Towards Well Being, with Data Science (part 2) http://dlvr.

it/R41k1K pic.


com/4VbljUcsLhUsername: DataSci_Plow | Tweet: Understanding when Simple and Multiple Linear Regression give Different Results https://www.


io/post/understanding-when-simple-and-multiple-linear-regression-give-different-results?utm_source=Twitter&utm_campaign=Data_science … +1 Hal2000Bot #data #scienceUsername: DataSci_Plow | Tweet: Artificial Curiosity https://www.


io/post/artificial-curiosity?utm_source=Twitter&utm_campaign=Data_science … +1 Hal2000Bot #data #scienceUsername: gp_pulipaka | Tweet: Synchronizing the Digital #SCM using AI for Supply Chain Planning.

#BigData #Analytics #DataScience #AI #RPA #MachineLearning #IoT #IIoT #Python #RStats #TensorFlow #JavaScript #ReactJS #GoLang #CloudComputing #Serverless #DataScientist #Linux @lruettimann http://bit.

ly/2KX8vrt pic.


com/tftxwilkQfUsername: DataSci_Plow | Tweet: Extreme Rare Event Classification using Autoencoders in Keras https://www.


io/post/extreme-rare-event-classification-using-autoencoders-in-keras?utm_source=Twitter&utm_campaign=Data_science … +1 Hal2000Bot #data #scienceUsername: DataSci_Plow | Tweet: Five Methods to Debug your Neural Network https://www.


io/post/five-methods-to-debug-your-neural-network?utm_source=Twitter&utm_campaign=Data_science … +1 Hal2000Bot #data #scienceUsername: iamjony94 | Tweet: 26 Mobile and Desktop Tools for Marketers http://bit.

ly/2LkL3cN #socialmedia #digitalmarketing #contentmarketing #growthhacking #startup #SEO #ecommerce #marketing #influencermarketing #blogging #infographic #deeplearning #ai #machinelearning #bigdata #datascience #fintech pic.


com/mxHiY4eNXRUsername: TDWI | Tweet: #ATL #DataPros: Our #analyst, @prussom is headed your way to speak @ the #FDSRoadTour on Wed, 5/8!.Register to attend for free, learn about Modern #DataManagement in the Age of #Cloud & #DataScience: Trends, Challenges & Opportunities.


ly/2WlYOJb #Atlanta #freeeventDoesn’t look that good but we got what we wanted.

TWEETS!Saving results into PandasSadly there’s no direct connection between Twint and Spark, but we can do it with Pandas and then pass the result to Optimus.

I created to simple functions that you can see in the actual project that helps you with Pandas and the weird Twint API for this part.

So when we run this:available_columns()You’ll see:Index(['conversation_id', 'created_at', 'date', 'day', 'hashtags', 'hour','id', 'link', 'location', 'name', 'near', 'nlikes', 'nreplies','nretweets', 'place', 'profile_image_url', 'quote_url', 'retweet','search', 'timezone', 'tweet', 'user_id', 'user_id_str', 'username'],dtype='object')These are the columns we have from the query we just did.

There’s a lot of different things to do with this data, but for this article I’ll only use some of them.

So to transform the result from Twint to Pandas we run:df_pd = twint_to_pandas(["date", "username", "tweet", "hashtags", "nlikes"])and you’ll see this Pandas DF:Much better isn’t it?Sentiment Analysis (the simple way)We will run a sentiment analysis on some tweets, using Optimus and TextBlob a library for NLP.

The first thing we need to do is clean this tweets, for that Optimus is the best choice.

For saving the data as an Optimus (Spark) DF we need to run:df = op.


data_frame(pdf= df_pd)We’ll just remove accents and special characters with Optimus (for a real work scenario you need to do much more than this like removing links, images, and stopwords), for that:clean_tweets = df.


remove_accents("tweet") .


remove_special_chars("tweet")Then we need to collect this tweets from Spark to get them in a Python list, for that:tweets = clean_tweets.



flatMap(lambda x: x).

collect()Then to analyze the sentiment of these tweets we will use TextBlob sentiment function:from textblob import TextBlobfrom IPython.

display import Markdown, display# Pretty printing the resultdef printmd(string, color=None): colorstr = "<span style='color:{}'>{}</span>".

format(color, string) display(Markdown(colorstr))for tweet in tweets: print(tweet) analysis = TextBlob(tweet) print(analysis.

sentiment) if analysis.

sentiment[0]>0: printmd('Positive', color="green") elif analysis.

sentiment[0]<0: printmd('Negative', color="red") else: printmd("No result", color="grey") print("")That will give us:IAM Platform Curated Retweet Via httpstwittercomarmaninspace ArtificialIntelligence AI What About The User Experience httpswwwforbescomsitestomtaulli20190427artificialintelligenceaiwhatabouttheuserexperience AI DataScience MachineLearning BigData DeepLearning Robots IoT ML DL IAMPlatform TopInfluence ArtificialIntelligenceSentiment(polarity=0.

0, subjectivity=0.

0)NeutralSeattle Data Science Career Advice Landing a Job in The Emerald City Tips from Metis Seattle Career Advisor Marybeth Redmond – httpsbitly2IYjzaj pictwittercom98hMYZVxsuSentiment(polarity=0.

0, subjectivity=0.

0)NeutralThis webinarworkshop is designed for business leaders data science managers and decision makers who want to build effective AI and data science capabilities for their organization Register here httpsbitly2GDQeQT pictwittercomxENQ0Dtv1XSentiment(polarity=0.

6, subjectivity=0.

8)PositiveContoh yang menarik dari sport science kali ini dari sisi statistik dan pemetaan lapangan Dengan makin gencarnya scientific method masuk di sport maka pengolahan data seperti ini akan semakin menjadi hal biasa httpslnkdinfQHqgjh Sentiment(polarity=0.

0, subjectivity=0.

0)NeutralComplete handson machine learning tutorial with data science Tensorflow artificial intelligence and neural networks Machine Learning Data Science and Deep Learning with Python httpsmedia4yousocialcareerdevelopmenthtmlmachinelearning python machine learning online data science udemy elearning pictwittercomqgGVzRUFAMSentiment(polarity=-0.

16666666666666666, subjectivity=0.

6)NegativeWe share criminal data bases have science and medical collaoarations Freedom of movement means we can live and work in EU countries with no hassle at all much easier if youre from a poorer background We have Erasmus loads more good thingsSentiment(polarity=0.

18939393939393936, subjectivity=0.

39166666666666666)PositiveValue of Manufacturers Shipments for Durable Goods BigData DataScience housing rstats ggplot pictwittercomXy0UIQtNHySentiment(polarity=0.

0, subjectivity=0.

0)NeutralTop DataScience and MachineLearning Methods Used in 2018 2019 AI MoRebaie TMounaged AINow6 JulezNorton httpswwwkdnuggetscom201904topdatasciencemachinelearningmethods20182019html Sentiment(polarity=0.

5, subjectivity=0.

5)PositiveCome check out the Santa Monica Data Science Artificial Intelligence meetup to learn about In PersonComplete Handson Machine Learning Tutorial with Data Science httpbitly2IRh0GU Sentiment(polarity=-0.

6, subjectivity=1.

0)NegativeGreat talks about the future of multimodality clinical translation and data science Very inspiring 1stPETMRIsymposium unitue PETMRI molecularimaging AI pictwittercomO542P9PKXFSentiment(polarity=0.

4833333333333334, subjectivity=0.

625)PositiveDid engineering now into data science last 5 years and doing MSC in data science this yearSentiment(polarity=0.

0, subjectivity=0.

06666666666666667)NeutralProgram Officer – Data Science httpbitly2PV3ROF Sentiment(polarity=0.

0, subjectivity=0.


And so on.

Well that was extremely easy, but it won’t scale, because in the end we are collecting the data from Spark so the driver’s RAM is the limit.

Let’s do it a little better.

Add sentiment directly to a Spark DataFrameTransforming this code to Spark code it’s simple.

This piece of code can help you transform other codes as well.

So lets start importing the User Defined Function module from Spark:from pyspark.


functions import udfThen we will transform the code from above to a function:def apply_blob(sentence): temp = TextBlob(sentence).

sentiment[0] if temp == 0.

0: return 0.

0 # Neutral elif temp >= 0.

0: return 1.

0 # Positive else: return 2.

0 # NegativeAfter that we will register the function as a Spark UDF:sentiment = udf(apply_blob)Then to apply the function to the whole dataframe we need to write:clean_tweets.

withColumn("sentiment", sentiment(clean_tweets['tweet'])).

show()And we will see:Sentiment analysis, the good programmer way (Making the code modular)This is not actually quality code.

Let’s transform this into functions to use it over and over.

The first part is setting up everything:%load_ext autoreload%autoreload 2# Import twintimport syssys.


append("twint/")# Set up TWINT configimport twintc = twint.

Config()# Other importsimport seaborn as snsimport osfrom optimus import Optimusop = Optimus()# Solve compatibility issues with notebooks and RunTime errors.

import nest_asyncionest_asyncio.

apply()# Disable annoying printingclass HiddenPrints: def __enter__(self): self.

_original_stdout = sys.

stdout sys.

stdout = open(os.

devnull, 'w') def __exit__(self, exc_type, exc_val, exc_tb): sys.


close() sys.

stdout = self.

_original_stdoutThe last part it’s a class that will remove the automatic printing of Twint so we just see the dataframe.

All of the things from above can be summarize in these functions:from textblob import TextBlobfrom pyspark.


functions import udffrom pyspark.


types import DoubleType# Function to get sentiment def apply_blob(sentence): temp = TextBlob(sentence).

sentiment[0] if temp == 0.

0: return 0.

0 # Neutral elif temp >= 0.

0: return 1.

0 # Positive else: return 2.

0 # Negative# UDF to write sentiment on DFsentiment = udf(apply_blob, DoubleType())# Transform result to pandasdef twint_to_pandas(columns): return twint.



Tweets_df[columns]def tweets_sentiment(search, limit=1): c.

Search = search # Custom output format c.

Format = "Username: {username} | Tweet: {tweet}" c.

Limit = limit c.

Pandas = True with HiddenPrints(): print(twint.


Search(c)) # Transform tweets to pandas DF df_pd = twint_to_pandas(["date", "username", "tweet", "hashtags", "nlikes"]) # Transform Pandas DF to Optimus/Spark DF df = op.


data_frame(pdf= df_pd) # Clean tweets clean_tweets = df.


remove_accents("tweet") .


remove_special_chars("tweet") # Add sentiment to final DF return clean_tweets.

withColumn("sentiment", sentiment(clean_tweets['tweet']))So to get the tweets and add sentiment we use:df_result = tweets_sentiment("data science", limit=1)df_result.

show()And that’s it :)Lets see the distribution of the sentiments:df_res_pandas = df_result.






27)})Doing more with TwintTo see how to do this check: https://amueller.



htmlWe can do more stuff, here I’ll show you how to create a simple function to get tweets, and also how to build a word cloud from them.

So to get the tweets from a simple search:def get_tweets(search, limit=100): c = twint.

Config() c.

Search = search c.

Limit = limit c.

Pandas = True c.

Pandas_clean = Truewith HiddenPrints(): print(twint.


Search(c)) return twint.



Tweets_df[["username","tweet"]]With this we can get thousands of tweets very easy:tweets = get_tweets("data science", limit=10000)tweets.

count() # 10003To generate a word cloud this is all we need to do:from wordcloud import WordCloud, STOPWORDS, ImageColorGeneratorimport matplotlib.

pyplot as plt%matplotlib inlinetext = tweets.


values# adding movie script specific stopwordsstopwords = set(STOPWORDS)stopwords.








add("pic")wordcloud = WordCloud( background_color = 'black', width = 1000, height = 500, stopwords = stopwords).

generate(str(text))I added some stopwords that are common in tweets that don’t matter to the analysis.

To show it we use:plt.

imshow(wordcloud, interpolation=’bilinear’)plt.



figsize’] = [10, 10]And you’ll get:Pretty but not that much.

If we want good code we need modules so, let’s transform that into a function:def generate_word_cloud(tweets): # Getting the text out of the tweets text = tweets.


values # adding movie script specific stopwords stopwords = set(STOPWORDS) stopwords.

add("https") stopwords.

add("xa0") stopwords.

add("xa0'") stopwords.

add("bitly") stopwords.

add("bit") stopwords.

add("ly") stopwords.

add("twitter") stopwords.

add("pic")wordcloud = WordCloud( background_color = 'black', width = 1000, height = 500, stopwords = stopwords).

generate(str(text)) plt.

imshow(wordcloud, interpolation='bilinear') plt.

axis("off") plt.


figsize'] = [10, 10]And then we just run:tweets = get_tweets("artificial intelligence", limit=1000)generate_word_cloud(tweets)Try it yourselfThere’s much more things that you can do with the library.

Some other functions:twint.


Search() – Fetch Tweets using the search filters (Normal);twint.


Followers() – Fetch a Twitter user's followers;twint.


Following() – Fetch who follows a Twitter user;twint.


Favorites() – Fetch Tweets a Twitter user has liked;twint.


Profile() – Fetch Tweets from a user's profile (Includes retweets);twint.


Lookup() – Fetch informations from a user's profile (bio, location, etc.


Actually you can use it from the terminal.

For that just run:pip3 install –upgrade -e git+https://github.


git@origin/master#egg=twintThen just run go to the twint folder:cd src/twintAnd finally you can run for example:twint -u TDataScience –since 2019-01-01 –o TDS.

csv –csvHere I’m getting all the tweets (845 so far) from the TDS Team of the year.

Here is the CSV file if you want it:FavioVazquez/twitter_optimus_twintAnalyzing tweets with Twint, Optimus and Apache Spark.

– FavioVazquez/twitter_optimus_twintgithub.

comBonus (scaling the results)Let’s get 10k tweets and get their sentiment, because why not.

For that:df_result = tweets_sentiment("data science", limit=100000)df_result.

show()This actually took almost 10 minutes so take your precautions.

It may be faster to get the tweets from the CLI and then just applying the function.

Let’s see how many tweets we have:df_results.

count()And we have 10.

031 tweets with sentiments!.You can use them for training other models too.

Thanks for reading this, hopefully it can help you with your current job and understanding of data science.

If you want to know more about me follow me on twitter:Favio Vázquez (@FavioVaz) | TwitterThe latest Tweets from Favio Vázquez (@FavioVaz).

Data Scientist.

Physicist and computational engineer.

I have a…twitter.


. More details

Leave a Reply