Why Bad Data Gets Into Your Perfectly Good Pipelines

6 causes of common data quality issues and how to fix them

Predicting how a data pipeline will break is like trying to predict the future; the near infinite number of outcomes makes it an exercise in futility.

Nonetheless, data engineers bravely peer into this vast vortex of possibility and set tests to monitor their pipelines’ reliability.

They might create data quality checks based on common failure points, past experiences, or how the tarot cards fell, but they cannot create a test for every outcome.

For example, one organization I worked with had key pipelines that went directly to a few strategic customers. They monitored data quality meticulously, instrumenting more than 90 rules on each pipeline. Something broke and suddenly 500,000 rows were missing–all without triggering one of their alerts.

So, How Does Bad Data Get Into Good Pipelines?

Because “a near infinite amount of ways”  is an unsatisfying answer for how bad data gets into good pipelines, I’ll outline a few real examples I’ve seen just in the last year before diving into solutions.

Authentication issues: One online networking company experienced data downtime when a Salesforce password expired, causing their salesforce_accounts_created table to stop updating. Another company found out they were missing a bunch of data in key tables as a result of an authorization issue with Google Adwords.

Upstream dependencies: An online fitness company had a number of important reports tied to its Salesforce subscriber data that would pull into a variety of tables under certain conditions. These reports went down after changes were made in how the Salesforce subscriber data was being ingested that stopped subsequent jobs from being triggered.

Source data delivery timing: One gaming company noticed drift in their new user acquisition data. The social media platform they were advertising on changed their data schedule so they were delivering data every 12 hours instead of 24. The company’s ETLs were set to pick up data only once per day, so this meant that suddenly half of the campaign data that was being sent to them wasn’t getting processed or passed downstream, skewing their new user metrics away from “paid” and towards “organic.”

Processing changes: A media company saw several customer subscription related tables double in size as a result of an error in the system of record process framework that treated “” as null.

Data asset ownership drift: An ecommerce company transferred the internal ownership of a machine learning module which labeled if an item was out of stock and suggested alternatives. As the new owners were getting up to speed, the model was running on old data and incorrectly determined things as in and out of stock.

Human error: An online language learning system noticed one of their tables was not growing at the right rate because they added an incorrect filter to their ETL pipeline.

Had these issues not been caught immediately, consequences would have varied from uncomfortable meetings to customer attrition and direct revenue impact.

How to prevent bad data from entering good pipelines

Just because bad data and broken pipelines are inevitable doesn’t mean it can’t be mitigated proactively.

Move from data testing to data observability: Just as software engineering underwent a process of adding observability tools like DataDog or New Relic to their testing processes, so too is data engineering.

Testing accounts for “known unknowns,” but end-to-end data observability covers all bases. These tools monitor data health in real time across five key pillars such as freshness, volume, distribution, schema, and lineage.

All of the scenarios presented above were detected and resolved before any damage was done as a result of those monitors. For example, the freshness monitor identified the authentication issues, volume monitor identified the processing issue, and a schema monitor identified the human error issue.

Have an incident triage process in place and track it: When incidents occur, data teams need to understand who is responsible for fixing which data assets in what period of time as well as how to communicate to stakeholders.

Tracking is vital–you can improve what you don’t measure. Key metrics include number of data incidents, time to resolution, and time to identification.

Prioritize lineage and metadata: Data lineage empowers data teams to understand every dependency, including which reports and dashboards rely on which data sources, and what specific transformations and modeling take place at every stage.

When data lineage is incorporated into your data platform, potential impacts of changes can be forecasted and communicated to users at every stage of the data lifecycle to offset unexpected impacts. It also radically accelerates root cause analysis when issues occur.

Plan for the inevitable

You cannot devise a test for every type of way data can break, but you can catch and reduce the number of issues that occur in your data pipelines with a focus on end-to-end visibility.

About the Author

Will Robins is the head of customer success at Monte Carlo. Monte Carlo, the data reliability company, is creator of the industry’s first end-to-end data observability platform. Monte Carlo works with such data-driven companies as Fox, Affirm, Vimeo, ThredUp, PagerDuty, and other leading enterprises to help them achieve trust in data.

Featured image: ©maximusdn