Query Segmentation and Spelling Correction

Query Segmentation and Spelling CorrectionSonu SharmaBlockedUnblockFollowFollowingApr 22In English Language, people generally type the queries which are separated by space, but sometimes and somehow this space is found to be omitted by unintentional mistake.

The method of finding the word boundaries is known as Query Segmentation.

For example, we can easily deciphernutfreechocolatesasnut free chocolatesBut, the machine can’t unless we teach them.

Here is a piece of evidence:Showing results of an unsegmented search querySince we have not lost our capabilities, our brain does this somehow intuitively and unconsciously.

But, in the search domain, it affects the precision and recall of the results for a particular query which is not segmented properly.

Let’s see how:Search Engines in E-Commerce particularly based on inverted-indexes should have that query indexed or analyzer [like substrings etc.

] must have been used while indexing the product attributes which links to a particular list of products.

But, the problem still persists which is related to the mainly precision and recall as well.

If we segment the above query in three parts, then we can do a boolean search [using AND] for three query words i.

e.

“nut” AND “free” AND “chocolates” which gives the results with the list of products which are the intersection of three results i.

e.

only the products contained in three lists are returned.

Hence, improving precision.

The query might have also been misspellede.

g.

“nutfreechacolatas” or “skommedmilk”Here, we can see that the spelling of chocolates is wrong.

But, in order to correct this, the precondition is to identify the segments i.

e.

the word boundaries in order to have an input for the Spelling Correction algorithm in the first place.

No Results for an unsegmented query having misspelled words3.

In Machine Learning NLP based Query Understanding , we need to feed the correctly segmented and correctly spelled queries to the model in order to get better accuracy in finding the intent of the search query.

I implemented these algorithms mainly for this purpose only.

Here, we see that spelling correction algorithm is dependent on the word segmentation.

So, let’s see how we can do segmentation first:Query Segmentation can be achieved by dividing the string in several ways.

But, wait, How many possible different compositions can exist by using this naive approach.

Actually, it’s exponential and for a word of length n, it is {2^(n−1)} possible compositions, because there will be (n-1) boundaries for a given string.

The final list of the valid and genuine composition of queries will be decided on the basis whether the composition consists of valid words and ordered or sorted on the basis of the most likely combination of words in the query i.

e.

the composition having the highest probability.

Will explain later how this probability will be calculated for each composition.

Now, with this naive approach, the query “nutfreechocolates” (length, n = 17) has 2¹⁶ = 65,536 compositions.

As the length of the query increases, it will not scale and takes lots of computing power.

In order to be more efficient, I analyzed two algorithms :Dynamic Programming Approach:In Naive Approach, we were doing some similar calculation of finding the most likely segment of a prefix of the query again and again which can be memoized in order to improve the complexity from exponential (O(2^n)) to a polynomial function (O(n²)).

Let’s understand through an example:For query: “isit” [Total Segment = 2^3 = 8]It can be either segmented in 2 parts or 3 parts or 4 partsStart with 4: [Possible composition = 3C3 = 1]"i" "s" "i" "t = P(i|'')*P(s|i)*P(i|t)*P(t|i)Start with 3: [Possible composition = 3C2 = 3]"i" "s" "it" = P(i|'')*P(s|i)*P(it|s)"is" "i" "t" = P(is|'')*P(i|is)*P(t|i)"i" "si" "t" = P(i|'')*P(si|i)(P(t|si)"Start with 2: [Possible composition = 3C1 = 3] "i" "sit" = P(i|'')*P(sit|i)"is" "it" = P(is|'')*P(it|is)"isi" "t" = P(isi|'')*P(t|isi)where P(A|B) is conditional probability of happening of A given that B has already occured.

The bigram probability is introduced for more context.

Unigram probability can also be used to find the probability of composition.

Here, 8 compositions are computed by using brute-force method.

In order to avoid this using DP, every time before our program makes a recursive call to segment the remaining string we check whether this exact substring has been segmented before.

If so we just dismiss the recursive call and retrieve the optimum segmentation result for that exact substring from a cache (where it has been stored the first time that specific substring has been segmented).

While calculating the probability score of each word, let’s assume that the word is valid and is present in the dictionary (if the word is not valid, then we need to apply Spelling Correction here).

The score of a word will be :P(A) = freq(A) / sum of freq(all words)and P(composition) = log(P(word1)) + log(P(word2)) + ….

where compostion consists of word1, word2 and so on.

because, Naive Bayes probability of a composition will be product of independent word probabilities i.

e.

P(AB) = P(A) * P(B)If the word is not present in the dictionary, then as per http://norvig.

com/ngrams/ch14.

pdfP(A) = 10 / ( sum of freq(all words) * 10^len(A) )The code can be found here: https://github.

com/grantjenks/python-wordsegmentTriangular Matrix Approach:Instead, of memoizing all the compositions of the prefix of the query, here only the prefix composition with the highest probability will be overwritten over the lower probability.

This approach has been explained in a much better way here: https://towardsdatascience.

com/fast-word-segmentation-for-noisy-text-2c2c41f9e8daSpelling Correction also can be done using many approaches.

But, they might be very costly.

Here, I will discuss the two optimized approaches other than the Naive Approach:Peter Norvig Approach: It consists of generating the candidates of n edit distance away by using insert, deletes, substitute and transpose operation over the query provided in each iteration.

Then, A Language Model, where the score/probability of the possible candidates generated in the first step are calculated if it is found in the dictionary and An Error Model, where the words not found in the dictionary will go through one more iteration of insertion, deletion, substitution, and transposition operation and they again Language Model will be applied.

But, this is very costly as the list of candidates generated in each edit will increase by (54n+25) times for each word.

For query “nutfreechacolatas” in our example has the length of 17.

So, in the first iteration of calculating the edits, there will be 54*17+25 = 943 candidates.

Again in the second iteration of each candidate, the possible candidates will be 54*943+25 = 50,947.

So, it is not going to scale for more no of edits distance and a long query.

Spelling Correction using Symmetric Deletes [SymSpell]: Here, the complexity has been optimized by using pre-processing before processing the query online.

In a production scenario, the pre-processing task takes time and dependent on the size of the dictionary, so we can create a singleton class of the spelling correction class which loads the dictionary in the first attempt of creating the object of it.

The main idea is to generate the candidates by only deleting the characters in the query from the dictionary as well as the given query to be corrected and meet in the middle.

The four operations here will be handled in the following ways:a) Deletion and Insertion: Deleting a character from the dictionary is equivalent to adding a character in the given query and vice-versa.

delete(dictionary entry,edit_distance) = input entrydictionary entry = delete(input entry,edit_distance)b) Substitution and Transposition: Deleting a character from both the dictionary as well the given query is equivalent to substitution and deleting two consecutive characters from both the dictionary and given query will be equivalent to the transposition operation.

delete(dictionary entry,edit_distance) = delete(input entry,edit_distance)In this approach, we need to calculate the edit distance between candidates generated by deletion operation of the given query and the dictionary words also formed by deletion operation, as the edit distance may change during simultaneous deletes on both the sides.

For example, for the word: bank, after pre-processing, in the dictionary, bank= ban [Edit distance = 1] and if the given query=”xban”.

On the deletion of the given query, xban=ban [edit distance = 1], but bank!=xban as the edit distance between the two is 2.

There are several ways of calculating edit distances:Hamming Distance: it allows only the substitution, hence, it only applies to strings of the same length.

Longest Common Subsequence (LCS): it allows only insertion and deletion.

Levenshtein Distance: It allows insertion, deletion, substitution but not the transposition.

i.

e transition AC->CA will be counted as 2 edit distance by substitution.

Damerau Levenshtein Distance: It allows all the four operations we are doing in order to generate the candidates.

Actually, this algorithm has two variants: a)Restricted Damerau-Levenshtein distance (Optimal string alignment algorithm): adjacent transposition counted as 1 edit, but substrings can’t be edited more than once: ed(“CA” , “ABC”) =3 with 2 substitutions (CA->AC) and one insertion (AC->ABC).

b) True Damerau-Levenshtein distance: adjacent transposition counted as 1 edit, substrings can be edited more than once: ed(“CA” , “ABC”) =2 with 1 transposition (CA->AC) and 1 insertion AC->ABC.

The later one is being implemented.

Code for the calculation of edit distance using True Damerau-Levenshtein distance: can be found here: https://github.

com/mammothb/symspellpy/blob/master/symspellpy/editdistance.

pyHere, instead of using 2-d DP for calculation of edit distance, two 1-d arrays of size equals the length of the query is used.

One more level of optimization is done here in terms of space complexity.

The detailed overview can be found here: https://www.

codeproject.

com/Articles/13525/Fast-memory-efficient-Levenshtein-algorithm-2This algorithm is optimized and has a constant time of complexity.

All the possible deletes of the dictionary words are hashed and indexed with the list of their original words to make the search faster.

The dictionary has been created by using three sources: 1.

Brand Names of the Products, 2.

Display Names of the Products, 3.

Top n thousand keywords searched previously.

For our use-case, consists of around 350k words in the dictionary, the prefix length of 7 is being used for candidates generation and is taking an average of 5 ms for a query of the average length of 8 characters.

Thanks for reading.

Please do read the previous stories to know more and keep watching the space for more updates.

References:https://en.

wikipedia.

org/wiki/Damerau%E2%80%93Levenshtein_distancehttps://en.

wikipedia.

org/wiki/Levenshtein_distancehttps://github.

com/mammothb/symspellpyhttps://towardsdatascience.

com/symspellcompound-10ec8f467c9bhttps://towardsdatascience.

com/fast-word-segmentation-for-noisy-text-2c2c41f9e8dahttps://towardsdatascience.

com/symspell-vs-bk-tree-100x-faster-fuzzy-string-search-spell-checking-c4f10d80a078https://medium.

com/@wolfgarbe/1000x-faster-spelling-correction-algorithm-2012-8701fcd87a5fhttp://norvig.

com/ngrams/ch14.

pdf.. More details

Leave a Reply