Improving Data Quality with Product Similarity Search

In addition, it can assist enhancing data information: If some product data lack certain attribute values, it is possible to suggest potential attributes that are found in its most similar products.ChallengesMulti-project product similarity search, i.e..searching for product matching in two different inventories, is challenging because the product information most likely presents very different data structures and values..That requires a preprocessing step that needs to be adjusted to common standards, ignoring for example an individual business’s specificities in data.ImplementationThe Similar Products API provides the opportunity for a store to scan their entire catalog for products that are similar to each other in terms of their data information..It leverages diverse product information: product name, description, attribute values, price and variant count..Therefore the system consists basically of smaller independent components, one for each information source, and each component calculates the similarity for the respective information source.Users can choose which of these data sources should be considered and even specify which features should have a higher influence on the similarity score, since the final output is a weighted average similarity score for each pair of products in examination..Furthermore, it provides the option to limit the search to specific products or product types.Data flow diagram for the comparison of two productsThe code is written in Python including methods from the most popular data science libraries: NumPy, scikit-learn, pandas, SciPy.Text similarity for names and descriptions:Product names and descriptions undoubtedly carry important information, but as with any NLP case, any text instance must be converted to a vector..Since both price and variant count data are numerical, the method is straightforward: we scale the values to a fixed standard range, typically [0,1], and use a numerical distance metric to calculate how dissimilar two values are.It should be noted that products typically contain more than one price values (e.g. a normal price, a seasonal price, some discount price, prices for different countries etc.) and these are set to a specific currency as well..Thus for the price comparison, we convert all values to the same currency and use the median of all the prices as a representative for the product variant.Mixed data similarity for attributes:Product attributes are special cases of data because they have mixed data types..But if the data is represented by a set of features that are numerical, nominal, boolean and even multi-valued (i.e. every instance described as a collection of values), then there is no common similarity metric to compare all those types in a meaningful way..For a better understanding consider the following table representing a sample of attributes of wine products.Small set of attribute values of wine productsWe distinguish the following attributes :numerical: contentsboolean: is_availablenominal: color, country, aciditymulti-valued: country_availability, foodsOur implementation of a distance metric which is capable to handle different types of variables is based on the Gower’s distance metric¹: we calculate the distances between two instances differently for each variable type and combine those in a final weighted distance score..The distance metrics for each data type are described in the following.For numerical attributes it is straightforward and similar to the price similarity case..Boolean attributes can be converted to numerical values (0 and 1) and be treated in the same way with a numerical distance metric, which will eventually give maximum similarity for identical values and zero otherwise.Categorical (also called nominal) data, like country and color, must be compared in a similar way: identical values should have the maximum similarity and zero if they are different..But this only after the attributes with entropy equal to 1 were removed, because that means that all items have unique values and thus the attribute is not adding any value in identifying similar items.Last, when two different inventories are compared and the overlap of matching attributes is small, as this accounts only partly for the similarity score, we make sure that this is reflected to the final score.Example response from the Similar Products APIThe feature is currently provided in beta testing phase.. More details

Leave a Reply