3 Quick Ways To Compare Data in Python

Hands down, Beyond Compare!In this article, we will explore something in between the two solutions we just looked at.

Maybe because you want to do a very quick diff, or maybe because the datasets are simply too large for Excel to handle or for you to setup a diff.

Let us explore 3 quick alternatives.

Photo by Taylor Vick on Unsplash1.

Check the integrity of the dataIn this instance, by checking the integrity of the data I mean that we will be checking the data set in its entirety and we will either pass the test or fail it.

We will not be able to determine why the check has failed but this is a super quick way of doing it.

Let us explore 2 alternative algorithms.

Check with an MD5 ChecksumAn MD5 Checksum is essentially an algorithm that will return a hexadecimal number for the contents of a file.

We can use the following code to leverage it:import hashlib, sysfiles = [sys.

argv[1], sys.

argv[2]] #these are the arguments we takedef md5(fname): md5hash = hashlib.

md5() with open(fname) as handle: #opening the file one line at a time for memory considerations for line in handle: md5hash.

update(line.

encode('utf-8')) return(md5hash.

hexdigest())print('Comparing Files:',files[0],'and',files[1])if md5(files[0]) == md5(files[1]): print('Matched')else: print('Not Matched')Returning:Check with SHA1 algorithmSHA1 algorithm is yet another hexadecimal algorithm that will convert our file contents into a string.

It is the same algorithm that bitcoin uses for its blockchain mining process.

Using the below:import hashlib, sysfiles = [sys.

argv[1], sys.

argv[2]] #these are the arguments we takedef sha1(fname): sha1hash = hashlib.

sha1() with open(fname) as handle: #opening the file one line at a time for memory considerations for line in handle: sha1hash.

update(line.

encode('utf-8')) return(sha1hash.

hexdigest())print('Comparing Files:',files[0],'and',files[1])if sha1(files[0]) == sha1(files[1]): print('Matched')else: print('Not Matched')Giving:2.

Check the contents of the data with SQLPhoto by Joshua Sortino on UnsplashLuckily for us, using a couple Python libraries, we can import our files into an SQL database and use the Except Operator to highlight any differences.

The only thing to note, is that Except expects the data to be ordered; otherwise it will highlight everything as a difference.

We can quickly use this method this way:import sys, sqlite3, pandas as pdfiles = [sys.

argv[1], sys.

argv[2]] #these are the arguments we takeconn = sqlite3.

connect(':memory:') #we are spinning an SQL db in memorycur = conn.

cursor()chunksize = 10000i=0for file in files: i = i+1 for chunk in pd.

read_csv(file, chunksize=chunksize): #load the file in chunks in case its too big chunk.

columns = chunk.

columns.

str.

replace(' ', '_') #replacing spaces with underscores for column names chunk.

to_sql(name='file' + str(i), con=conn, if_exists='append')print('Comparing', files[0], 'to', files[1]) #Compare if all data from File[0] are present in File[1]cur.

execute( '''SELECT * FROM File1 EXCEPT SELECT * FROM File2''')i=0for row in cur: print(row) i=i+1if i==0: print('No Differences')print('Comparing', files[1], 'to', files[0]) #Compare if all data from File[1] are present in File[0]cur.

execute( '''SELECT * FROM File2 EXCEPT SELECT * FROM File1''')i=0for row in cur: print(row) i=i+1if i==0: print('No Differences')cur.

close()Returning:3.

Check the data using PandasPhoto by Franki Chamaki on UnsplashUsing the famous Pandas package, we can quickly check our files once we have them in a dataframe.

Let us examine a few different properties of the pandas library that will allow us to achieve this.

Using the .

equals() parameterIn the same set up as our previous examples; code:import sys, sqlite3, pandas as pdfiles = [sys.

argv[1], sys.

argv[2]] #these are the arguments we takedf1 = pd.

read_csv(files[0])df2 = pd.

read_csv(files[1])df3 = df1.

equals(df2)print('Matches:', df3)Outcome:Using the .

any() parameterWe can quickly do that in the following way:import sys, sqlite3, pandas as pdfiles = [sys.

argv[1], sys.

argv[2]] #these are the arguments we takedf1 = pd.

read_csv(files[0])df2 = pd.

read_csv(files[1])df3 = (df1 != df2).

any(axis=None)print('Differences in file:', df3)df3 = (df1 != df2).

any(1)ne_stacked = (df1 != df2).

stack()changed = ne_stacked[ne_stacked]changed.

index.

names = ['id', 'col']print('Differences In:')print(changed)which returns:Using the .

Eq() parameterExecute the following script:import sys, sqlite3, pandas as pd, numpy as npfiles = [sys.

argv[1], sys.

argv[2]] #these are the arguments we takedf1 = pd.

read_csv(files[0])df2 = pd.

read_csv(files[1])df3 = df1.

eq(df2)print(df3.

all())#print(df3.

all(axis=1))df4 = df3.

all(axis=1)df4 = pd.

DataFrame(df4, columns=['Columns'])print(df4[df4['Columns']==False])Resulting in:Do you use any other ways to compare your data?.Let me know!Before you go, I often blog about finance and technology.

If you want to stay up to date with my articles, don’t forget to follow me!.. More details

Leave a Reply