Data Quality Checks: How to Bake Them Into Your Pipeline

Bad data is one of the most insidious problems in data engineering. Unlike a pipeline failure — which is loud, immediate, and forces a response — bad data is often silent. The pipeline runs successfully. The dashboard loads. The numbers look plausible. And somewhere downstream, a business decision gets made on the basis of data that was wrong in a way nobody detected. By the time the error surfaces, it has already done its damage, and the forensic work of tracing it back to its origin is painful and time-consuming.

Data quality is not a reporting problem or an analytics problem. It is a pipeline problem, and the engineers building pipelines are the ones best positioned to address it. This post is about how to think about data quality systematically and how to build checks into your pipeline so that bad data is caught at the point of entry rather than at the point of consequence.

What Data Quality Actually Means

Data quality is an overloaded term that means different things depending on who is using it. For the purposes of pipeline engineering, it is useful to break it into five concrete dimensions.

Completeness asks whether the data you received contains everything it should. Are there nulls in columns that should never be null? Are there fewer rows than expected? Did the extraction capture the full dataset or did it silently truncate?

Validity asks whether the data conforms to expected formats and constraints. Are dates actually dates? Are numeric fields within plausible ranges? Are categorical fields constrained to the set of values the business recognizes?

Uniqueness asks whether records that should be distinct actually are. Duplicate rows in a fact table will inflate metrics. Duplicate keys in a dimension table will cause fan-out in joins and produce results that are wrong in ways that are surprisingly difficult to diagnose.

Consistency asks whether the same fact is represented the same way across sources. A customer ID that is numeric in one system and a UUID in another, a product name that is spelled differently across two data sources, a revenue figure that is calculated differently by the finance system and the CRM — these are consistency problems, and they undermine trust in data faster than almost anything else.

Timeliness asks whether the data arrived when it was supposed to. A pipeline that runs but loads data that is twelve hours stale may be technically successful while being practically useless. Freshness checks are the mechanism for catching this.

Most data quality failures are failures of one or more of these dimensions. Building quality checks into your pipeline means encoding expectations about each dimension and asserting them at the right points in the data flow.

Where Quality Checks Belong

The most important architectural decision in data quality is where checks run. There are three natural positions in the pipeline: at ingestion, after raw loading, and after transformation.

Checks at ingestion — before data enters your warehouse — are the most defensive. If incoming data fails a critical check, you can reject the load entirely rather than letting bad data contaminate your warehouse. The downside is that this is the point where you know the least about the data. Checks at ingestion tend to be structural: is the schema what we expect, is the row count within a reasonable range, are required fields present.

Checks after raw loading operate on data that is already in your warehouse but has not yet been transformed. This is the right place for source-level quality assertions: uniqueness of primary keys in the raw table, absence of nulls in columns that the source system guarantees will be populated, validity of key field formats. These checks catch problems in the source data before they propagate into transformed models.

Checks after transformation are where business logic validations live. Once data has been cleaned, joined, and modeled, you can assert things like referential integrity between dimension and fact tables, consistency of calculated metrics, and range constraints that only make sense in the context of the business. A revenue figure that is negative, an order that predates the customer’s account creation, a session duration of forty-eight hours — these are the kinds of anomalies that transformation-level checks are designed to catch.

The right answer is checks at all three positions, with the depth and specificity increasing as data moves through the pipeline. Early checks are coarse and fast. Later checks are fine-grained and encode business knowledge.

dbt Tests: The Foundation

For teams using dbt, the testing framework is the most accessible starting point for systematic data quality. dbt ships with four built-in generic tests that cover a large portion of common quality requirements: not_null, unique, accepted_values, and relationships.

Not_null asserts that a column contains no null values. Unique asserts that every value in a column is distinct. Accepted_values asserts that every value in a column belongs to a defined set. Relationships asserts referential integrity between two tables — that every value in a foreign key column exists in the referenced primary key column.

These four tests, applied consistently across your staging and mart models, will catch a substantial proportion of common data quality issues. The discipline of writing them is as important as the tests themselves. When you sit down to write a not_null test on a column, you are forced to articulate an expectation about the data. That act of articulation surfaces assumptions that were previously implicit and unverified.

Beyond the built-in tests, dbt’s generic test framework allows you to define custom tests that encode business rules. A custom test might assert that the sum of a revenue column in a fact table is within ten percent of the same figure reported by the finance system. Another might assert that the count of active users today is not more than fifty percent lower than yesterday — a threshold that would indicate a data pipeline problem rather than genuine user churn. These business-rule tests are the most valuable and the most underinvested part of most data quality suites.

Great Expectations and Soda: Going Further

For teams that need more than dbt tests can provide — particularly around statistical quality checks, cross-source validation, and quality monitoring over time — tools like Great Expectations and Soda extend the quality framework significantly.

Great Expectations allows you to define rich expectation suites against datasets and run them as part of your pipeline. An expectation suite might assert not just that a column is not null, but that the proportion of nulls is below five percent — a more realistic constraint for messy source data. It might assert that the distribution of values in a column has not shifted significantly from the previous run, which catches the kind of gradual data drift that point-in-time checks miss.

Soda takes a similar approach with a focus on SQL-based checks defined in a lightweight YAML syntax, making it accessible to teams comfortable with SQL who do not want to write Python-based expectation suites. Soda Cloud, its managed offering, adds quality monitoring dashboards and alerting on top of the check framework.

The value of these tools lies not just in their check capabilities but in their ability to produce a historical record of data quality over time. A check that fails today is an incident. A check that has been failing intermittently for three months is a pattern, and patterns reveal systemic problems in source systems, pipeline logic, or data contracts that point-in-time checks alone will not surface.

Quarantine Patterns

Not every quality failure should halt the pipeline. For pipelines that process high volumes of data continuously, blocking on every quality issue is often impractical and may cause more harm than the bad data itself. The quarantine pattern is a middle ground: rows that fail quality checks are routed to a separate quarantine table rather than rejected or passed through silently.

The quarantine table captures the failing rows alongside metadata about which check failed and why. The pipeline continues processing the rows that passed. The quarantine table is then reviewed — either manually or by automated remediation logic — and failing rows are either fixed and reprocessed or discarded with a documented reason.

This pattern is particularly useful for pipelines ingesting data from sources with known quality issues that cannot be immediately fixed at the source. Rather than accepting that bad data will flow through or stopping the pipeline entirely, quarantine contains the problem while keeping the pipeline operational. Over time, the quarantine table becomes a valuable diagnostic tool: the distribution of failure reasons tells you exactly where your upstream quality problems are concentrated and helps you prioritize remediation efforts.

Alerting and Ownership

Data quality checks are only as valuable as the response they trigger. A check that fails and produces a log entry that nobody reads is not a data quality check — it is a false sense of security. Building quality into your pipeline requires building an alert and response culture alongside the technical implementation.

Every quality check should have a defined owner — a person or team responsible for investigating failures and deciding on the appropriate response. Alerts should be routed to that owner through a channel they will actually see, with enough context to diagnose the problem without having to dig through logs: which check failed, on which table, with what error message, and how many rows were affected.

The response protocol matters as much as the alert itself. When a quality check fails, the team needs clear answers to three questions: should the pipeline be paused pending investigation, should downstream consumers be notified that data may be unreliable, and what is the remediation path? Having these answers documented in advance — rather than improvised during an incident — is the difference between a managed quality failure and a chaotic one.

Quality as Culture

The technical implementation of data quality checks is the easier part of the problem. The harder part is building the organizational culture in which data quality is treated as a shared responsibility rather than something the data team worries about in isolation.

Source system owners need to understand that their changes have downstream quality implications. Business stakeholders need to be partners in defining what correct data looks like, because many quality rules encode business knowledge that engineers do not have. Leadership needs to invest in the time required to build and maintain a quality suite, even when it does not ship a visible feature.

Data quality checks baked into the pipeline are the technical mechanism. But the culture that makes those checks effective — the shared sense that bad data is everyone’s problem and nobody’s acceptable outcome — is what determines whether they actually protect the business.

Scroll to Top