Build your own Whatsapp Chat Analyzer

Build your own Whatsapp Chat AnalyzerSamir SheriffBlockedUnblockFollowFollowingApr 29Recently, after working on a number of projects from Udacity’s courses, I was on the look out for new, familiar data to analyze and what better place to start than one’s own phone.

Whatsapp claims that nearly 55 billion messages are sent each day.

The average user spends 195 minutes per week on Whatsapp, and is a member of plenty of groups.

With this treasure house of data right under our very noses, it is but imperative that we embark on a mission to gain insights on the messages our phones are forced to bear witness to.

This article aims to serve as a step-by-step guide to build your own whatsapp conversation analyzer, and is divided into the following 3 main topics:Data CollectionData PreparationData ExplorationPrerequisitesBefore you can get started, ensure that the following packages are installed in your Python environment (I recommend using Jupyter since you can see intermediate outputs easily while following the steps in this tutorial):PandasSeabornMatplotlibJupyter (Optional but useful since you can see intermediate outputs easily while following the steps in this tutorial) OR if you are lazy like me and don’t feel like installing any of these packages, just head over to Google Colaboratory (https://colab.

research.

google.

com) which is a free Jupyter notebook environment that comes with everything pre-installed, and get started in a jiffy!Data CollectionFirst off, we require a whatsapp conversation to analyze.

Open a whatsapp conversation you wish to analyze (preferably a group chat since they tend to be larger) and use the “Export Chat” functionality to send the entire conversation in text format to your email ID.

Important Note: When prompted by whatsapp, ensure that you do not export any media otherwise it might take ages to export.

Download the exported chat from your email inbox.

It should resemble the following:18/06/17, 22:45 – Messages to this group are now secured with end-to-end encryption.

Tap for more info.

25/09/16, 21:50 – Nick Fury created group "Avengers"18/06/17, 22:45 – Nick Fury added you18/06/17, 22:45 – Nick Fury added Hulk18/06/17, 22:45 – Nick Fury added Thor18/06/17, 22:45 – Nick Fury added Tony Stark18/06/17, 22:29 – Tony Stark: Here are the details for tomorrow's picnic:The park is located at 123 Main Street.

Bring your own snacks, we will also be grilling.

It is going to be very warm so dress appropriately.

We should be getting there at noon.

See you then and don't forget the sunscreen.

18/06/17, 22:46 – Hulk: HULK NO CARE18/06/17, 22:46 – Hulk: HULK NO FRIEND HERE18/06/17, 22:46 – Hulk: HULK HATE LOKI18/06/17, 22:46 – Hulk: GFCHGK18/06/17, 22:47 – Thor: Stop pressing every button in there18/06/17, 22:47 – Loki: Why do you have 2 numbers, Banner?18/06/17, 22:48 – Hulk: HULK FIRST SMASH YOU THEN TELL YOUData PreparationBackground photo created by freepik — www.

freepik.

comJust like raw vegetables have to be cooked and garnished with a variety of spices to make them palatable to humans, so also this plain text file will have to be parsed and tokenized in a meaningful manner in order to be served (stored) in a Pandas dataframe:Let us consider just a single line from the text (which we will call “raw text”) and see how we can extract relevant columns from it:18/06/17, 22:47 – Loki: Why do you have 2 numbers, Banner?Hereafter, whenever I wish to draw your attention to different tokens in a string s, I will present to you 2 lines.

The first line will indicate the token names enclosed within {curly braces} and their relative positions within s.

The second line will be the string s modified to indicate the actual values corresponding to each token, enclosed within curly braces.

For instance , if the string s is “Word 1, random word Word 2 :” , then I will provide the following definition:{Token 1}, random word {Token 2} {Token 3}{Word 1}, random word {Word 2} {:)}From this, you will be able to infer that the value of Token 1 is “Word 1”, that of Token 2 is “Word 2” and Token 3 is “:)”.

In our sample line of text, our main objective is to automatically break down the raw message into 4 tokens, and we will see how to go about this task in the next section:{Date}, {Time} – {Author}: {Message}{18/06/17}, {22:47} – {Loki}: {Why do you have 2 numbers, Banner?}Step 1: Detecting {Date} and {Time} tokensFirst, in order to detect if a line of text is a new message or belongs to a multi-line message, we will have to check if that line begins with a Date and Time, for which we will need a little bit of regular expression (regex) matching (Don’t worry — it isn’t that mind boggling once you break it down, especially with some nifty tools that I’ll show you in a minute).

Let us define a new method called startsWithDateTime :The following diagram shows how regex matching detects the date and time in our message:The following diagram gives a brief overview of all the messages detected in the sample text file:I won’t dive into the details of how regular expressions actually work, but if you are interested, you can find more explanations about how this matching is done by visiting https://regex101.

com/ and https://medium.

com/tech-tajawal/regular-expressions-the-last-guide-6800283ac034 .

Coming back to our sample line, Before we ran the startsWithDateTime method, no tokens were detected in our raw sample message:{Raw Message}{18/06/17, 22:47 – Loki: Why do you have 2 numbers, Banner?}After we run the startsWithDateTime method, 2 tokens were detected in our processed sample message:{Date}, {Time} -Message{18/06/17}, {22:47} – Loki: Why do you have 2 numbers, Banner?Step 2: Detecting the {Author} tokenNow that we have identified lines that contain new messages with Date and Time components, let us move to the next part of the message (everything after the hypen):Loki: Why do you have 2 numbers, Banner?Once again, we will require some more regular expression matching.

Our objective is to detect the author of this message.

While there could be a variety of patterns depending on how you have saved your friends’ names in your phone contacts app, the most commonly used patterns I have identified are as follows (Feel free to add or remove any rules as you deem fit):Keeping these rules in mind, let us now define a method called startsWithAuthor which finds strings that match at least one of the aforementioned rules:The following diagram shows how regex matching detects the author in our message:The following diagram gives a brief overview of all the authors detected in the sample text file:You can find more explanations about how this matching is done by visiting https://regex101.

com/Before we ran the startsWithAuthor method, 2 tokens had been detected in our processed sample message:{Date}, {Time} -Message{18/06/17}, {22:47} – {Loki: Why do you have 2 numbers, Banner?}After we run the startsWithAuthor method, 4 tokens are detected in our processed sample message.

{Date}, {Time} – {Author}: {Message}{18/06/17}, {22:47} – {Loki}: {Why do you have 2 numbers, Banner?}Note: You might be wondering how the Message token appeared out of thin air.

Well, once we have detected the Date, Time and Author tokens, what we are left with is the remaining portion of the string which is the de facto Message tokenStep 3: Extracting and Combining tokensNow that we have been able to identify the Date, Time, Author and Message tokens in a single message, it is time to split each line based on the separator tokens like commas (,), hyphens(-), colons(:) and spaces( ), so that the required tokens can be extracted and saved in a dataframe.

This time, let me invert things by highlighting the separator tokens instead of the Date, Time, Author and Message tokens:Date{Comma }Time{ Hyphen }Author{Colon }Message18/06/17{, }22:47{ – }Loki{: }Why do you have 2 numbers, Banner?Let us define a new method called getDataPoint for the task of splitting string based on the separator tokens to extract the tokens of interest:Sample output values are shown in comments (#) beside each line.

Note: Figuring out when the value of Author token can be None is left as an exercise to the reader.

Step 4: Parsing the entire file and handling Multi-Line MessagesWe have come to the last stage of data parsing, for which we will have to read the entire whatsapp text file, identify and extract tokens from each line and capture all data in tabular format within a list:Initialize a pandas dataframe using the following code:You will find all your data tabulated as shown below (Looks neat, doesn’t it?):Data ExplorationImage obtained from https://www.

flocabulary.

com/lesson/age-of-exploration/Finally, we have reached one of the most exciting parts of our journey — Data Exploration.

It is time for us to unearth the interesting stories that all this data is trying to tell us.

Describe the Data FrameFirstly, let us take a look at what pandas has to say about our data frame (df):This command shows the number of entries (count), unique entries, most frequently occurring entries (top) and frequency of the most frequently occurring entries (freq) for each column in the data frame.

The output might look like this:Highly TalkativeWho are the most garrulous members?.Let’s take a look at the number of messages sent by the top 10 Authors in the group.

Mysterious Messages with No Authors!Remember how, a few sections earlier, I had given you an exercise to figure out that the author of certain messages can be None?.Well, if you haven’t figured it out yet, you needn’t worry because the results of this section might give you a clue.

Let us find all those messages which have no authors, using the following code:Do you see any pattern in the messages here?Media OverloadWhile glancing through the original text file or the entire data frame, you might have noticed messages containing the string: “<Media omitted>”.

These messages represent any pictures, videos or audio messages.

Let us find all media messages and analyze the number of media messages sent by the top 10 authors who send media messages in the group, using the following code:Do you spot any differences between the authors who send the most messages overall, and the authors who send the most media messages?We don’t need no media nor ghostsSince we are just restricting ourselves just to the analysis of text-only messages sent by our friends in the group, let us create a new data frame (messages_df) by dropping all those messages that are either media messages or do not have an author, using the data frames obtained in the previous 2 sections:This step could be categorized as data cleaning.

Feel free to skip this step if you want to gain insights on the entire non-text data as well.

If you don’t collect any metrics, you’re flying blind.

It might be interesting to count the number of letters and words used by each author in each message.

So, let us add 2 new columns to the data frame called “Letter_Count” and “Word_Count”, using the following code:This step could be categorized as data augmentation.

Now, let us describe the cleaned and augmented data frame.

One important point to note here is the distinction made between columns containing continuous values vs.

those containing discrete values:Try running the describe command on the entire data frame without specifying any columns.

What do you observe?“The temple of art is built in words” and ‘l’, ‘e’, ‘t’, ‘t’, ‘e’, ‘r’, ‘s’Let us take a step back and look at the overall picture.

How many words and letters have been sent since the beginning of time (which in this case, happens to be since the moment the group was conceived)?Running this code revealed that a whopping 1,029,606 letters and 183,485 words were used in the the Avengers’ group.

“Words are, of course, the most powerful drug used by mankind.

”How many words have been sent in total by each author, since the beginning of time?What is the most common number of words in a message?Looks like most messages contain only 1 word.

I wonder what that word is!How wonderful it is to be able to write someone a letter!Does it make sense to count the total number of letters sent by each author since the beginning of time as well?.Well, since we already have a “Letter Count” column, I don’t see any harm in doing so.

So, here goes:What is the most common number of letters in a message?Looks like most messages contain only 1 or 2 letters.

Hmmm!.Very interesting!.Are these letters from the English language or some other symbols?Remember, remember, the 5th of November The Gunpowder Treason and plotDo you know the date on which the most number of messages were sent in the history of your group?.Well, fear no more, for you will find out in just a second:22/09/17 was the most active date.

Was this the date when Thanos struck, rendering everyone panic-stricken and bursting with questions?.What date did you get?.Do you remember anything significant happening on this date?Time is an illusionDo you lie awake at night wondering at what time of the day your group is most active?.The truth will be revealed:Looks like the group is mostly active at around 8:15 PM at night.

Be sure to message at this time to get a quicker response.

The need of the HourWhat is the most suitable hour of the day at which to message to increase your chances of getting a response from someone?In order to answer this question, we will have to augment the data frame to include a new column for the hour (extracted from the “Time” column), as follows:Now, you just have to run code similar to the ones for obtaining the top dates and times:Looks like messaging between 6 PM and 7 PM has the highest chances of eliciting responses from group members.

ConclusionCongratulations!.You are more knowledgeable about your whatsapp conversations now!.Those were quite a lot of insights, weren’t they?. More details

Leave a Reply