The romantic side of data science: Analyzing a relationship through a year worth of text messages

Let’s look below (excuse me for the lack of subplots here, but plotly sure makes it impossible to generate when using heatmaps!):More details below!The Y-axis of these heatmaps indicates the time delay between getting a message and replying.

The X-axis shows the time of day in which the message was received.

The different colors in the heatmap indicate how many messages were sent throughout the year, in the specific time of day (x-axis)and within a specific delay (y-axis).

For example, we can see that for messages OJ received at 10am, 156 were answered to within 0–2 minutes.

Kapish?First off, like we saw earlier, most of the communication happens in daytime, around mid-afternoon.

As expected, we both are not the fastest to respond to late night messages, and seemingly not too different when it comes to our delay answering any daytime messages: we’re pretty fast in most cases! This will be interesting to look at in a few months, hopefully I will not be as connected as before — and this could be a useful measure.

What am I even talking about?Content is just as interesting as the statistics we’ve explored so far.

BUT it’s also a more personal matter, so I will keep this section minimal :)Negative Nancy or Positive Patricia?If you have been reading carefully, you might have noticed that traditionally in this relationship, I am considered the more negative among us.

But does this actually reflect in what I say versus what OJ says?Sentiment analysis of our messages was created using nltk’s built-in VADER, which enables sentiment analysis on natural sentences (mostly social media content).

I have locally edited the existing VADER lexicon to match our vocabulary more accurately and include terms in French we use very often to reflect sentiment better.

More details at the end.

Seems like it’s a definite no!.In fact, when comparing percentage wise, it actually seems I am a bit less negative than OJ!.This is a shocker but the differences are minor.

An alternative viewpoint that might reflect the situation better is not my negativity in messaging, but rather — lack of positivity.

OJ is clearly more positive, with a whole 28.

9% of messages said in positive tone, while I stand a whole 5.

7% lower.

But this was expected.

Looking at what happens throughout the day (excuse me for inserting this as a photo, it was impossible to insert as a decent subplot!), it seems we don’t demonstrate anything significantly odd, perhaps except for a slight higher presence of negative content from my end in the early morning hours (sigh).

Comparing negative and positive messages in different times of dayNicknames anyone?Finally, by looking at a word cloud visualizing all of the content (rather than just first messages content), the first conclusion is: we can’t really tell much.

Both word clouds, for all of the conversations!.Most of the stop words have been removed using nltk’s built in stop words, however due to different encoding, some snuck back in (e.

g.

‘that’,’it’) and I have not yet removed these.

In a future version and when I will have some extra time, I might get around to that :)There aren’t any words that stand out as unusual; we can pick on a few habits in there, like the Brit’s (OJ) frequent use of the words ‘mister’ or ‘sir’, or both us overusing ‘lol’,’haha’ and ‘hehe’.

Millennials after all.

We can’t even detect any nickname in here, on any side, a strong indicator of our phasing in and out of nicknames.

No, a more useful practice to understand content will be analyzing within specific contexts: for example, how does our conversation change during trips? What are we saying in a negative context? What are we complaining about? And so on.

Bottom line?So what have we learned from this practice?OJ is a chatty human.

GT needs to initiate more conversations.

GT is not as negative as you might perceive! But maybe not as positive as OJ.

We are respectful when it comes to waiting time, and not drying the other side for too long waiting for a response.

Shabbat is a great time to put the phone down.

Data can be an interesting and original anniversary tradition!Happy first anniversary to us, OJ :)Photo Credit: Matthew Henry [Burst]Some technical info for the data enthusiastsI wrote the code for this project in Python, as I mentioned earlier.

It’s pandas-heavy, and the plotting was done almost entirely with plotly.

As always, some of my best mates in the process were stackoverflow and documentation pages of the different libraries I used.

VADER is a great nltk tool for analyzing text data sentiment in Python, if you haven’t got a training set with labels available.

It uses term scoring for a wide lexicon of words (that you can easily add to — see below) and after reviewing a sample data from my export — it actually is pretty damn accurate!Word clouds were generated with Python’s WordCloud library, and were… decent.

I haven’t worked with these very often, but am on the look out for a better resolution and customization tool, so feel free to leave a message if you have suggestions for that!Here are some links that I found useful and you might too!Plotly — an interactive plotting library for Python (and other languages) with detailed documentation pages.

VADER — this page gives a great walk-through of what’s behind the scenes and how to use the tool very practically.

If you want to edit the lexicon and add your own terms with the update method:from vaderSentiment.

vaderSentiment import SentimentIntensityAnalyzeranalyser = SentimentIntensityAnalyzer()new_words = { 'greaaaat': 2.

0, 'shiiit': -2.

0}analyser.

lexicon.

update(new_words)WordCloud can be used easily after importing from wordcloud import WordCloud .

Once that’s done, you can build the WC object and customize the colors, remove stop words and more.

I learned about this with a DataCamp tutorial you can find here.

.. More details

Leave a Reply