A Deep Dive Into Data QualityGary CBlockedUnblockFollowFollowingJan 3Introduction:Data quality is often seen as the unglamorous component to working with data.
Ironically, it’s usually the component that takes up the majority of our time.
Data quality might very well be the single most important component of a data pipeline, since without a level of confidence and reliability in your data, the dashboard and analysis generated from the data is useless.
The challenge with data quality is that there are no clear and simple formulas for determining if data is correct.
Also, Data that is correct today, may not be incorrect a month late.
Fortunately, there are some fundamental techniques and approaches that can be broadly applied when validating data quality.
Overall, the goal in measuring data quality is to find outliers that do not meet specific requirements and record sets that violate business assumptions.
Strategies:When working with moving data, data can be thought about in three separate layers: the ETL layer, the business layer, and the reporting layer.
The ETL layer contains the code for data ingestion and data movement between a source system and a target system (for example from the application database to the data warehouse).
The business layer sits between your raw ingested data and your final data models.
Finally, the reporting layer contains the dashboards from which business users can view and interact with.
In the following sections, we will cover the techniques that are applicable for each data layer.
The following examples are done in Postgres SQL.
I will be using two tables, a Clicks table and a Views table.
The clicks table represents data about users that click a link on a page.
The Views table represents data about users that visited a page.
Data Quality in the ETL layer:The first layer of a data pipeline is the ETL layer.
The data quality checks in this layer are usually similar regardless of the business needs and industry.
The goal here is to check to ensure that data is not lost or degraded while moving from the source to target system.
We check for things such as differences in row counts (showing data has been added or lost incorrectly), partially loaded datasets (usually with high null count), and duplicated records.
Example using SQL:Row CountsWith source_count as (Select count(*) as count from source) ,Target_count as (Select count(*) as count from target)Select CASE WHEN (select count from source_count) = (select count from target_count) THEN TRUE ELSE FALSE END as valid_joinIn the above example, we are looking for a scenario where the counts from the source, and target do not match.
In most instances, the number of rows should stay consistent between both table.
This is to ensure that transformations that occur between tables do not unexpectedly change row counts.
For instance, the number of users on your application should match between your application database and your data warehouse.
There are also instances where the number of rows are expected to change.
In this case, the goal should be to validate that counts within a reasonable expected range.
Join ValidationWITH join_count AS (SELECTcount(*) AS view_countFROM viewsLEFT JOIN clicks ON clicks.
view_id = views.
id)SELECTCASE WHEN (SELECT view_count FROM join_count) = count(1) THEN TRUEELSE FALSE END AS valid_joinFROMviews;In the example above, we are checking to ensure that the row count of the table after the left join is the same as the row count of the original table.
This is to ensure that there are no duplicate keys on either tables (this can usually be enforced by the database, but not when working with derived tables or views).
Again, this constraint is not always true as some joins are expected to cause the rows in the joined record to increase or decrease, in which case it is necessary to understand the expected range of values.
Image from TrifactaData Quality at the Business layer:The next layer of a data pipelines is the business logic layer.
Typically, data quality checks in this layer are applied after the raw/partially transformed data has been loaded to a staging area in your Datawarehouse.
You do not want to load the data into the final destination until these checks are passed and invalid rows are inspected.
The goal here is to ensure that fundamental business understandings are not violated and that the data makes business sense.
These business quality checks also act as a secondary checks or safeguard to ensure that the data has been brought in correctly during the ETL layer (as it can be extensive and impractical to validate every data quality measure in the ETL layer).
Business layer checks typically involve validating that numerical measures fall within a valid range as defined by business requirements.
SELECT count(*), yearFROMviews aJOIN date b on a.
date _id = b.
date_idGROUP BY 1,2,3SELECT count(*), year, monthFROMViews aJOIN date b on a.
date _id = b.
date_idGROUP BY 1 2,3One technique is to observe change in a given metric or measure over different time periods.
In this example, we are looking at views over different years as well as over different months.
We are trying to identify a sudden unexpected spike or drop in user views.
It is also necessary to distinguish between expected and unexpected changes.
For instance, this could be views for a e-commerce site, thus views increase during the holiday seasons.
— Validating the number of views should always be greater than the number of clicksSelectcase whencount(a.
num_view) >= count(b.
num_clicks)thenTrueElseFalseend as click_validationFROMviews aleft join clicks b on b.
click_id = a.
click_idAnother commonly seen validation is that change in values follow a certain trend or certain business rule.
In our case, the number of users who view a page should always be more or equal to then number of users who click on a link (since you can’t click a link before viewing the page first).
Since these checks are heavily reliant on business constraints, it is important to also ensure that the business rules and assumptions are properly documented either in the code and/or in a metadata repository.
Data Quality at the Reporting Layer:The reporting layer is the final layer of a data pipeline.
This is the layer that end users interact with your data.
However, just because this layer is usually the end of the pipeline, does not mean the reporting layer should not be utilized by engineers and analyst to ensure data quality.
One simple technique is to plot your data points on a graph and visually inspect for outliers.
It is pretty common to catch data quality mistakes visually that are not captured in your validation checks.
Sample DashboardAnother excellent use case for the reporting layer is to create data quality dashboards.
You can combine multiple data quality checks together and have these metrics/graphs sent out to you on a daily basis before send the data out to your executive/business users.
Conclusion:Fundamentally, data quality validations should be automated as much as possible.
Validations should be embedded into the data pipeline code, but in a way flexible enough to be effortlessly changed.
Depending on the criticality of the data and validation, you may want your pipeline to either fail completely or to flag the issue, move records into a separate reject area, and continue processing.
There are tools, such as Trifacta, that will help to simplify and automate many of these data quality checks.
Also, existing ETL tools, like Informatica, also have data quality checks built in.
However, it’s still important to understand how to implement data quality from scratch to ensure that the data quality checks you implement make sense.
When designing a data pipelines, data quality should be a driving factor that heavily influences the development effort.
Data engineers should have a level of familiarity of their data to be able to enhance and debug data quality issues.
It’s important to note that data quality should not the responsibility of a single team or individual.
In my opinion, it’s critical for engineers to be as familiar with the code as they are with the data they are working on.
This article gave a broad overview of data quality, techniques for how to monitor it, and strategies for actively working with it.
As we stated earlier, data quality is a critical component of a data system.
Investing the time and resources to work with data quality is important!.Data quality should be placed at the same level of importance as unit testing is to software development.
.