Building Reliable ELT Pipelines from Scratch

There is a meaningful difference between a pipeline that works and a pipeline that is reliable. A pipeline that works moves data from point A to point B under normal conditions. A reliable pipeline does that — and also handles the abnormal conditions gracefully: the source API that goes down at 2am, the schema that changes without warning, the transformation that produces subtly wrong results because an upstream team changed how they define a key business metric. The gap between working and reliable is where most of the real engineering lives, and it is where most pipelines eventually fail their teams.

This post is about building ELT pipelines with reliability as a first-class concern from the beginning — not as an afterthought once things start breaking in production.

Understanding ELT Before Building It

The shift from ETL to ELT is not just a reordering of letters. It represents a different philosophy about where transformation should happen and who should own it.

In the old ETL model, data was transformed before it landed in the destination. This meant your pipeline code had to understand both the source system and the target data model simultaneously. Changes to either required changes to the pipeline. The transformation logic was often buried in proprietary tools that were hard to version, test, or collaborate on.

ELT flips this. You extract data from the source, load it raw into your data warehouse or lakehouse, and then transform it inside the warehouse using SQL. The pipeline’s job is simple: get the data there intact. The transformation layer — typically dbt or a similar tool — handles everything else. This separation of concerns is the foundation of the modern data stack, and it is what makes ELT pipelines easier to build reliably than their ETL predecessors.

Extraction: Getting Data Out Without Breaking Things

Extraction is where most pipelines first encounter the real world, and the real world is messy. Source systems — databases, APIs, SaaS platforms, event streams — were not designed with your pipeline in mind. They change, they rate-limit, they go down, and they sometimes return data that is subtly different from what their documentation promises.

The first decision in extraction is whether to use a managed connector or build your own. Tools like Fivetran, Airbyte, and Stitch handle a large catalogue of sources with pre-built connectors that manage authentication, pagination, rate limiting, and incremental syncing. For common sources — Salesforce, Stripe, Postgres, Google Analytics — a managed connector is almost always the right call. The engineering time you save is significant, and the connectors are maintained by teams whose entire job is keeping them working as source APIs evolve.

For custom or internal sources — internal microservices, proprietary databases, bespoke APIs — you will need to build your own extraction logic. When you do, a few principles matter enormously.

Always prefer incremental extraction over full loads where possible. A full load pulls every row from the source on every run. An incremental load pulls only what has changed since the last successful run, typically by filtering on a timestamp or a monotonically increasing ID. At small data volumes the difference is negligible. At large volumes, full loads become slow, expensive, and potentially disruptive to source systems that were not designed to handle bulk reads at arbitrary times.

Build your extraction layer to be idempotent. An idempotent operation produces the same result whether it runs once or ten times. If your pipeline fails halfway through and needs to retry, idempotency ensures you do not end up with duplicate data or partial loads in the destination. This usually means using upsert patterns — insert if the record does not exist, update if it does — rather than blind appends.

Handle pagination and rate limits explicitly. APIs that return large datasets in paginated responses require your extractor to walk through every page before the extraction is complete. Rate limits require backoff logic — if the API tells you to slow down, your code needs to wait and retry rather than hammering the endpoint until it gets blocked. These are not edge cases; they are normal operating conditions for any API-based pipeline.

Loading: Raw and Untouched

The loading layer in ELT should do as little as possible. Its job is to move data from extraction into the raw layer of your warehouse faithfully, without modification. Resist the temptation to clean or filter data at this stage. Load everything. Load it raw.

This might feel wasteful, but it is enormously valuable in practice. When a downstream transformation produces unexpected results, you want to be able to go back to the raw data and understand what arrived from the source. If you filtered or transformed at load time, that debugging path is gone. Raw data is your audit trail, and storage is cheap enough that retaining it is almost always worth the cost.

Structure your raw layer with clear naming conventions. A common pattern is to namespace raw tables by source — raw.salesforce_accounts, raw.stripe_payments, raw.postgres_users — and to add metadata columns to every raw table: a loaded_at timestamp recording when the row was ingested, and a source_file or batch_id column that ties each row back to a specific extraction run. These columns cost almost nothing to add and save enormous amounts of debugging time later.

Transformation: Where dbt Earns Its Place

Once raw data is in the warehouse, the transformation layer takes over. This is where your raw, messy, source-shaped data becomes the clean, business-shaped models that analysts and downstream systems rely on.

A layered transformation architecture is the most maintainable approach. The first layer — typically called staging — applies minimal cleaning to raw data: renaming columns to consistent conventions, casting data types, deduplicating records, and filtering out rows that are structurally invalid. Staging models are one-to-one with source tables. They do not join across sources or apply business logic. Their job is simply to make raw data workable.

The second layer — intermediate or marts, depending on your conventions — is where business logic lives. This is where you join across sources, apply business rules, calculate derived metrics, and build the models that answer actual business questions. Because this logic is written in SQL and version-controlled in dbt, it is auditable, testable, and collaborative in a way that transformation logic buried in a pipeline tool never is.

Testing is where many teams underinvest and then regret it. dbt’s built-in tests — not null, unique, accepted values, referential integrity — are the minimum bar. They should run on every model in CI before changes are merged to production. Beyond the built-in tests, custom data tests that encode business rules — a revenue figure that should never be negative, a user count that should never decrease month over month — catch the subtle correctness issues that structural tests miss entirely.

Error Handling and Observability

A reliable pipeline is one you can understand when it breaks. This requires intentional investment in observability from the start.

Every pipeline run should produce structured logs that record what was extracted, how many rows were loaded, how long each step took, and whether any errors were encountered. These logs should be queryable — storing them in your warehouse alongside your data means you can build dashboards on pipeline health the same way you build dashboards on business metrics.

Alerting should be specific enough to be actionable. An alert that says “pipeline failed” is less useful than one that says “the Stripe extraction failed with a 429 rate limit error after processing 40,000 of 120,000 records.” The more context your alerts carry, the faster the on-call engineer can diagnose and resolve the issue.

Build retry logic into every extraction step, with exponential backoff and a maximum retry limit. Most transient failures — network timeouts, temporary API unavailability, brief database overload — resolve themselves if you wait and try again. Pipelines that fail immediately on the first error require constant manual intervention. Pipelines that retry intelligently resolve themselves the majority of the time.

The Discipline of Reliability

Building a reliable ELT pipeline is less about any single technical choice and more about a consistent set of disciplines applied at every layer: idempotent extraction, faithful raw loading, tested transformations, and observable operations. None of these are complicated in isolation. The challenge is applying all of them consistently, especially under the pressure to ship quickly.

The teams that build the most reliable pipelines are the ones who treat reliability as a design requirement — something that gets considered before the first line of code is written — rather than as a quality to be retrofitted after the first production incident.

Scroll to Top