Photo by Mike Kononov on UnsplashMachine learning for data cleaning and unificationAbizer JafferjeeBlockedUnblockFollowFollowingApr 1The biggest problem data scientist face today is dirty data.
When it comes to real world data, inaccurate and incomplete data are the norm rather than the exception.
The root of the problem is at the source where data is not being recorded does not follow standard schemas or breaks integrity constraints.
The result is that dirty data gets delivered downstream to systems like data marts where it is very difficult to clean and unify, thus making it unreliable to utilize for analytics.
Today data scientists often end up spending 60% of their time cleaning and unifying dirty data before they can apply any analytics or machine learning.
Data cleaning is essentially the task of removing errors and anomalies or replacing observed values with true values from data to get more value in analytics.
There are the traditional types of data cleaning like imputing missing data and data transformations and there also more complex data unification problems like deduplication and repairing integrity constraint violations.
All of these are inter-related, and it is important to understand what they are.
Data cleaning and unification problemsSchema mapping looks at multiple structured data and figures out whether they are talking about the same thing in the same way.
In the example below, does “building #” and “building code” both represent building number?Schema MappingRecord linkage is where multiple mentions of the same real-world entity appear across the data.
The different formatting styles for each source leads to records that look different but in fact all refer to the same entity.
In the example below all four table records are referring to the same medical lab.
Record Linkage & DeduplicationMissing data refers to values that are missing from a dataset.
Missing value imputation is the process of replacing missing data with substituted values.
In practice, the problem is more complicated because missing data is not represented by Nulls but instead by garbage, like in the example below.
Missing DataIntegrity constraints ensure that data follow functional dependencies and business rules that dictate what values can legally exist together.
Deducing constraints from data can be very difficult, especially since most data relations are non-obvious.
In the example below, Jane Smith is a building manager for both the Medical Lab and Management building which breaks the business rule.
For the rule to hold, either Jane Smith is not the management for one of the two buildings or the medical lab is actually a management building, or the management building is a medical lab.
Integrity ConstraintAbove we’ve seen a few of several data quality challenges.
The problem is that most data scientists are employing rule-based tools and ETL scripts that handle each of the data quality issues in isolation.
Whereas the fact is that most data, like the figure below, usually have all if not most data quality problems and they interact in complex ways.
The problem is not just of tools being unable to handle interaction between data quality issues.
The solutions don’t even scale well on large data sets due to high levels of computation and require multiple passes before enough corrections have been made.
Most datasets contain several data quality issues (source)Machine learning for data cleaning and unificationConsidering the issues with current solutions, the scientific community is advocating for machine learning solutions for data cleaning which consider all types of data quality issues in a holistic way and scale to large datasets.
Entity resolution is a good example of data unification task where machine learning is useful.
The three primary tasks involved in entity resolution are deduplication, record linkage, and canonicalization.
At the core of deduplication, we want to eliminate duplicate copies of repeated data.
With record linkage we aim to identify records that reference the same entity across different sources.
And canonicalization is where we convert data with more than one representation into a standard form.
These tasks can be described as a “classification” and “clustering” exercise where with enough training we can develop models to classify pairs of records as matches or non-matches and cluster pairs into groups to choose a golden record.
The figure below gives a visual representation of this process.
The deduplication process (source)The python dedupe library is an example of a scalable ML solution for performing deduplication and record linkage across disparate structured datasets.
To work effectively, dedupe relies on domain experts to label records as matches or non-matches.
Domain experts are important because they are good at recognizing which fields are most likely to uniquely identify a record and they can judge what a canonical version of a record should look like.
Record repair is another use of ML in data cleaning, and an important component of unification projects.
Repairing records is mainly about predicting the correct values of erroneous or missing attributes in a source data record.
The key is to build a model that combines signals such as integrity constraints, external knowledge and quantitative statistics for probabilistic inference of missing or erroneous values.
The HoloClean system is one such example of an ML solution where the user provides a dataset to be cleaned and some high-level domain-specific information such as reference datasets, available rules and constraints and examples of similar entries within the database.
The system then fixes errors ranging from conflicting and misspelled values to outliers and null entries.
In the example below, the user can specify denial constraints (first-order logic rules that capture domain-expertise) such as City, State, Address -> Zip to identify that row 1 conflicts with information in row 2 and row 3.
Repairing Integrity Constrain Violations (source)Providing scaleable ML solutionsOrganizations using traditional approaches for data cleaning and unification have built many legacy operations including ETL scripts with business rules and domain-knowledge documentation.
Ignoring all this work is the main reasons many new solutions do not work in improving legacy operations.
A big ML challenge is how to ingest information from legacy operations to guide the learning process for ML solutions.
One option is to use the legacy operations to build training data but there is a possibility that the data may be biased and noisy.
Since data cleaning and unification are usually performed at the source, the ML solutions need to be able to do inference and predictions on large-scale datasets.
Taking the deduplication solution referred to earlier; the classification essentially happens on finding similarities between pairs of records and this is an n-squared problem which is computationally heavy and slow.
Therefore, it becomes important to engineer solutions which use techniques like sampling and hashing to reduce complexity.
It is important for people to be part of the process for any ML solution to work.
Roles like domain experts and IT experts are essential in transferring legacy operations into useful features and labels to generate training data for ML solutions, to verify results of ML predictions on data quality and to assess the impact of ML cleaning and unification at the source on analytics downstream.
In conclusion, data cleaning and unification at the source are essential to create trustworthy analytics for organizations downstream.
It is important to recognize that data quality problems cannot be solved properly in isolation and machine learning solutions that offer holistic approaches to cleaning and unifying data may be the best solution.
At the same time, we must understand that in order to develop scaleable ML pipelines that work at the organizational level we must ensure that these solutions build upon legacy operations and bring humans into the loop.
If you have more questions, please reach out to me on LinkedIn!.