Build a Scalable Search Engine for Fuzzy String Matching

Actually we match 9 n-grams exactly.

Lets just build a lookup dictionary then.

First we map all n-grams to all the indexes of the words:{“Ana”:[0, 1], …, “sst”: [1], “pap”:[1, 2], …}2.

Then build another lookup dictionary, now lets just match all indices to all other indices that appear together.

This is also possible by looping once through the first dictionary.

{0 : [0, 1], 1 : [0, 1, 2], 2 : [1, 2]}Now we have to compare 0 only with 0 and 1, but not 2.

And also we have to compare 2 only with 1 and 2.

Of course this computational saving is only minor.

But if the average size of the matches now go down to lets say 100 we save actually 99.

99% of computing time for calculating our distances (Levenshtein, Jaccard, Cosine).

There are some different properties you can use for matching:N-grams of different size lengths => n = {2,3,4,5,6,…}Words from string => “Christoph” from “Christoph Ostertag”Other properties from Name => Is student, has same customer ID, social security number, …Industry/Topic Bucket => Has some words in common to a certain bucket: {basketball, tennis} both in sportsPractical considerationsSearch Depth:We can only compare our string to other strings that our string is in direct relation with or we can increase our search depth to search for strings that have a relation with strings that our string has a relation to.

(String A -> Strings (Depth 1)-> Strings (Depth 2) )Overmatching — Reducing the Connectivity of the Graph:If we match our strings with too many other strings, removing the most common n-grams or other properties we match on can be useful.

Matching all AGs togehter might not be very useful.

Therefore we just define how often a match is to be allowed or which percentage of most occurring matches we want to drop.

ConclusionMaking fuzzy matching useful is possible, we just have to think about how to make it scalable and what we actually want to accomplish with it.

Do we just want to find typos or do we want to match companies or people that could be the same.

There is much we can do, and the discussion should be less on what is the best thing to do in general, but what business case we are trying to solve.

.. More details

Leave a Reply