Adventures in WhatsApp DB — extracting messages from backups (with code examples)

Adjust your clock and try again.

I have no idea what do date issues have to do with not starting WhatsApp but fixing the date and time didn’t help and reading around about this most solutions seem to suggest uninstall-reinstall which is ok in general but not something I wanted to risk when I needed these messages for evidence in court.

Step 1 — Backup and extract files from your iPhoneThis step is more Mac OS specific although it should work similarly in other OSs.

The idea is simple:Create a virtual machine and install iTunes on it.

Make sure the VM is not connected to the Internet.

(If you have no idea how to create a VM you can try this article).

Connect your iPhone to the VM.

Follow common steps to backup iPhone locally on the VM.

To those interested on Mac you can find the backups at:~/Library/Application Support/MobileSync/BackupThe backup format is described hereInstall an iPhone backup extraction tool (such as iExplorer by macroplant) in the VM.

It doesn’t matter which tool as it won’t have internet access and therefore your data is safe.

Finally use the backup extraction tool to extract ChatStorage.

sqlite that is the main WhatsApp conversation database.

To get to it you need to browse to “App Group” → “group.

net.

whatsapp.

WhatsApp.

shared”As a bonus, you can also extract the media by extracting the folder: “App” → “net.

whatsapp.

WhatsApp” → “Library” → “Media”Note “ChatStorage.

sqlite” at the endStep 2 — Connecting to the DB and reading the messages tableWhatsApp seems to store all conversations in a Sqlite DB which makes our life fairly easy.

As a first step let’s try to use the sqlite3 built-in python library + Pandas to see if we can connect to the DB and get the list of tables:As you can see there are 18 tables.

Digging through the content I’ve found that the interesting one (i.

e.

the one with your messages) seems to be ZWAMESSAGE.

Let’s open it:Success!There is a total of 34 columns that we can see:ZWMESSAGE table columnsLet’s see what we’ve got.

The below is my analysis based on the data I’ve found in the tables in my own DB and my inferences about it:1.

Z_PK — seems like a serial number2.

Z_ENT to ZFILTEREDRECIPIENTCOUNT — seem less important3.

ZFLAGS — seems to indicate message state4.

ZGROUPEVENTTYPE — seems to be related to group chats5.

ZISFROMME — message is from me… it is 1 for messages sent by this user and 0 for messages received6.

ZMESSAGEERRORSTATUS to ZSPOTLIGHTSTATUS — seems like general statuses7.

ZSTARRED — did we star the message8.

ZCHATSESSION — unique identifier denoting a chat session9.

ZGROUPMEMBER — haven’t gotten to look at this one yet10.

ZLASTSESSION — last chat session?.didn’t dig into it11.

ZMEDIAITEM — seems related to media item indexing, might be an identifier to one of the other tables12.

ZMESSAGEINFO and ZPARENTMESSAGE — seem simple enough to figure out from the names13.

ZMESSAGEDATE — message creation date probably (see date format discussion below)14.

ZSENTDATE — message sent date probably (see date format discussion below)15.

ZFROMJID — from who did we get it (if it is an incoming message)16.

ZMEDIASECTIONID — seems related to media storage for media messages, doesn’t show in messages without media17.

ZPHASH – hmmm.

not sure18.

ZPUSHNAME — seems like the contact name on your phone19.

ZSTANZAID — some conversation / media id indicator.

Format seems different in media messages and text messages20.

ZTEXT — message text21.

ZTOJID — to whom did we send it (if it is an outgoing message)Step 3 — Extracting specific conversationsOur base goal is to extract specific conversations or communications.

So we can search for a contact by name or phone number or search for a specific conversation by messages if we remember part of the text and eventually extract the complete conversation based on the conversation ID.

But, before we do any of that we should figure out how to get the dates and times to a human readable format.

So intuitively I assumed that ZMESSAGEDATE is a unix-time timestamp.

So I converted it to a date and this is what I got:Huh?!.1985 ?Reading around it seems that Apple, in their infinite unique wisdom have decided to use dates starting from 1.

1.

2001 on iPhone so let’s see what happens if we add an offset from time 0 of unix-time to 1.

1.

2001:That seems to have solved the issue.

Along the way I also added a date index to the data-frame to make it more comfortable to use.

Note that if you are importing from Android or another OS you probably don’t need to use this shift in dates.

Now that we have dates lets extract conversations.

First let’s find a conversation we care about, such as by looking for a specific phone number:a fragment of the phone number is enoughyou should notice that the lambda I created is used to search both the from and the to columns.

An interesting side note is that there is a disappointing Python “Wat” moment here because False | None != None | False which just does not make sense.

Anyways, moving on if we look at the ZCHATSESSION column we’ll notice that in my case the conversation id will be “104.

0” for this number (if there are also group chats you may see more than one chat session id).

So out next step will be to extract all the messages the chat session.

To do this let’s add proper accessors for that and for the other search methods I mentioned above:useful accessorsNow, to sum everything up, let’s get all messages from a chat session and display them (enabling multi-line format to see all text):Now we see a chat session between my wife and a Kindergarten teacher :PPart 4 — Closing words and a note on extracting mediaMedia files are stored in the folder mentioned in step 1 — as best as I can tell the subfolders are based on phone number of the contact that shared a file (and a suffix denoting group chats? — this is a guess).

The subfolders under each phone number seem to follow a similar logic to the iTunes backup format and I suspect the messages table has some reference to links in the media table but I didn’t spend time on actually figuring the details out.

One last note some folders will also contain thumb files which are actually just jpg files (rename them and open normally).

The jupyter notebook code can be found on github hereI have removed personal information from it but the code and samples are exactly the same.

The notebook might also contain some extra details that I didn’t bother including here.

Finally, if you have the patience and do figure out more details or have any improvements to suggest to the code send a pull request on github and I’ll be happy to include it with relevant credits.

Please leave a clap if you enjoyed this.

You can find me @gal_diskin on twitter or @Disk1n on github.

.

. More details

Leave a Reply