Fuzzy matching at scale

Fuzzy matching at scaleFrom 3.

7 hours to 0.

2 seconds.

How to perform intelligent string matching in a way that can scale to even the biggest data sets.

Josh TaylorBlockedUnblockFollowFollowingJul 1Same but different.

Fuzzy matching of data is an essential first-step for a huge range of data science workflows.

Data in the real world is messy.

Dealing with messy data sets is painful and burns through time which could be spent analysing the data itself.

This article focuses in on ‘fuzzy’ matching and how this can help to automate significant challenges in a large number of data science workflows through:Deduplication.

Aligning similar categories or entities in a data set (for example, we may need to combine ‘D J Trump’, ‘D.

Trump’ and ‘Donald Trump’ into the same entity).

Record Linkage.

Joining data sets on a particular entity (for example, joining records of ‘D J Trump’ to a URL of his Wikipedia page).

By using a novel approach borrowed from the field of Natural Language Processing we can perform these two tasks on large data sets.

The problem with Fuzzy Matching on large dataThere are many algorithms which can provide fuzzy matching (see here how to implement in Python) but they quickly fall down when used on even modest data sets of greater than a few thousand records.

The reason for this is that they compare each record to all the other records in the data set.

In computer science, this is known as quadratic time and can quickly form a barrier when dealing with larger data sets:Number of records on the left against the number of operations required for an algorithm that works in quadratic time.

A relativity small data set of 10k records would require 100m operations.

What makes this worse is that most string matching functions are also dependant on the length of the two strings being compared and can therefore slow down even further when comparing long text.

How a well known NLP algorithm can help solve the issueThe solution to this problem comes from a well known NLP algorithm.

Term Frequency, Inverse Document Frequency (or tf-idf) has been used in language problems since 1972.

It is a simple algorithm which splits text into ‘chunks’ (or ngrams), counts the occurrence of each chunk for a given sample and then applies a weighting to this based on how rare the chunk is across all the samples of a data set.

This means that useful words are filtered from the ‘noise’ of more common words which occur within text.

Whilst these chunks are normally applied to whole words, there is no reason why the same technique cannot be applied to sets of characters within words.

For example, we could split each word into 3 character ngrams, for the word ‘Department’, this would output:' De', 'Dep', 'epa', 'par', 'art', 'rtm', 'tme', 'men', 'ent', 'nt 'We can then compare these chunks across a matrix of items which represents our data set to look for close matches.

This method of finding close matches should be both very efficient and also produce good quality matches through its ability to place greater importance on groups of characters which are less common across the data.

Lets put it to the test!Real world exampleThe example we will use to test this algorithm is a set of UK public organisations who have published contracts on Contracts Finder.

The names of these organisations are very messy and it appears as if they have been typed-into the system via a free-text field.

A sample of the data set — there are 3,651 different buying organisations in totalSmart DedupingWe will first explore how to dedupe close matches.

The process is made painless using Python’s Scikit-Learn library:Create a function to split our stings into character ngrams.

Create a tf-idf matrix from these characters using Scikit-Learn.

Use cosine similarity to show close matches across the population.

The ngram functionThe below function is used as both a cleaning function of the text data as well as a way of splitting text into ngrams.

Comments have been added in the code to show the purpose of each line:The ngram function both cleans the raw string and also splits it into character ngramsApplying the function and creating a tf-idf matrixThe great thing about the tf-idf implementation in Scikit is that it allows for a custom function to be added to it.

We can therefore add-in the function we have created above and build the matrix in just a few lines of code:from sklearn.

feature_extraction.

text import TfidfVectorizerorg_names = names['buyer'].

unique()vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams)tf_idf_matrix = vectorizer.

fit_transform(org_names)Finding close matches through cosine similarityYou could use the cosine similarity function from Scikit here however it is not the most efficient way of finding close matches as it returns a closeness score for every item in the dataset for each sample.

Instead, we are going to use a faster implementation of this which can be found here:import numpy as npfrom scipy.

sparse import csr_matrix!pip install sparse_dot_topn import sparse_dot_topn.

sparse_dot_topn as ctdef awesome_cossim_top(A, B, ntop, lower_bound=0): # force A and B as a CSR matrix.

# If they have already been CSR, there is no overhead A = A.

tocsr() B = B.

tocsr() M, _ = A.

shape _, N = B.

shape idx_dtype = np.

int32 nnz_max = M*ntop indptr = np.

zeros(M+1, dtype=idx_dtype) indices = np.

zeros(nnz_max, dtype=idx_dtype) data = np.

zeros(nnz_max, dtype=A.

dtype)ct.

sparse_dot_topn( M, N, np.

asarray(A.

indptr, dtype=idx_dtype), np.

asarray(A.

indices, dtype=idx_dtype), A.

data, np.

asarray(B.

indptr, dtype=idx_dtype), np.

asarray(B.

indices, dtype=idx_dtype), B.

data, ntop, lower_bound, indptr, indices, data)return csr_matrix((data,indices,indptr),shape=(M,N))Putting all of this together we get the following result:Witchcraft!.The algorithm is eerily good at identifying duplicate records.

Very impressive, but how fast is it?.Let us compare against a more traditional method of computing close matches using the ‘fuzzywuzzy’ library:!pip install fuzzywuzzyfrom fuzzywuzzy import fuzzfrom fuzzywuzzy import processt1 = time.

time()print(process.

extractOne('Ministry of Justice', org_names)) #org names is our list of organisation namest = time.

time()-t1print("SELFTIMED:", t)print("Estimated hours to complete for full dataset:", (t*len(org_names))/60/60)Outputs:SELFTIMED: 3.

7s Estimated hrs to complete for full dataset: 3.

78hrsYikes the baseline time for this using traditional methods is around 3.

7 hrs.

How long did it take our algorithm to work its magic?import timet1 = time.

time()matches = awesome_cossim_top(tf_idf_matrix, tf_idf_matrix.

transpose(), 10, 0.

85)t = time.

time()-t1print("SELFTIMED:", t)Outputs: SELFTIMED: 0.

19sWow — we have reduced the estimated time from 3.

7hrs to around a fifth of a second (an approximate 66,000X speed-up!)Record linkage and a different approachIf we want to use this technique to match against another data source then we can recycle the majority of our code.

In the below section we will see how this is achieved and also use the K Nearest Neighbour algorithm as an alternative closeness measure.

The dataset we would like to join on is a set of ‘clean’ organisation names created by the Office for National Statistics (ONS):The clean data set we would like to join against.

As can be shown in the code below, the only difference in this approach is to transform the messy data set using the tdif matrix which has been learned on the clean data set.

The ‘getNearestN’ then uses Scikit’s implementation of K Nearest Neighbours to find the closest matches in the dataset:from sklearn.

metrics.

pairwise import cosine_similarityfrom sklearn.

feature_extraction.

text import TfidfVectorizerimport reclean_org_names = pd.

read_excel('Gov Orgs ONS.

xlsx')clean_org_names = clean_org_names.

iloc[:, 0:6]org_name_clean = clean_org_names['Institutions'].

unique()print('Vecorizing the data – this could take a few minutes for large datasets.

')vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams, lowercase=False)tfidf = vectorizer.

fit_transform(org_name_clean)print('Vecorizing completed.

')from sklearn.

neighbors import NearestNeighborsnbrs = NearestNeighbors(n_neighbors=1, n_jobs=-1).

fit(tfidf)org_column = 'buyer' #column to match against in the messy dataunique_org = set(names[org_column].

values) # set used for increased performance###matching query:def getNearestN(query): queryTFIDF_ = vectorizer.

transform(query) distances, indices = nbrs.

kneighbors(queryTFIDF_) return distances, indicesimport timet1 = time.

time()print('getting nearest n.

')distances, indices = getNearestN(unique_org)t = time.

time()-t1print("COMPLETED IN:", t)unique_org = list(unique_org) #need to convert back to a listprint('finding matches.

')matches = []for i,j in enumerate(indices): temp = [round(distances[i][0],2), clean_org_names.

values[j][0][0],unique_org[i]] matches.

append(temp)print('Building data frame.

') matches = pd.

DataFrame(matches, columns=['Match confidence (lower is better)','Matched name','Origional name'])print('Done')This produces the following results:Not all items exist across the data sets.

Thankfully the closeness score can be used to apply a threshold to what counts as a positive match.

As can be seen from this sample, not all items appear across both data sets, however the K Nearest Neighbour will still have a go at finding the closest match.

As such, we would need to apply a threshold to the closeness score to determine what counts as a match.

Matching the 3,651 entities to our clean data set (containing c3,000 entities) took less than a second using this method.

In summary, tf-idf can be a highly effective and highly performant way of cleaning, deduping and matching data when dealing with larger record counts.

Code, References and further reading:Please see the colab document below to view the full code for this post:FastmatchFast Fuzzy Matchingcolab.

research.

google.

comThis article is inspired by the following post on Van Den Blog:Super Fast String Matching in PythonTraditional approaches to string matching such as the Jaro-Winkler or Levenshtein distance measure are too slow for…bergvca.

github.

ioFurther information on tf-idf can be found in the below articleHow to process textual data using TF-IDF in PythonComputers are good with numbers, but not that much with textual data.

One of the most widely used techniques to process…medium.

com.. More details

Leave a Reply