Incremental Loads vs. Full Loads: When to Use Which

If you have spent any time building data pipelines, you have encountered this decision more times than you can count. You are wiring up a new data source, and you need to decide how data moves from that source into your warehouse. Do you pull everything every time — a full load — or do you pull only what has changed since the last run — an incremental load? It seems like it should have an obvious answer, and sometimes it does. But the cases where it is genuinely not obvious are common enough, and the consequences of choosing wrong are significant enough, that the decision deserves careful thinking rather than a default assumption.

This post lays out the mechanics of both approaches, the conditions under which each one is appropriate, and the failure modes that catch teams off guard when they pick the wrong one.

Full Loads: Simple, Honest, Expensive

A full load does exactly what the name suggests. Every time the pipeline runs, it pulls the complete dataset from the source and replaces whatever was previously in the destination. Every row, every time, regardless of what changed.

The appeal of full loads is their simplicity. There is no state to manage, no cursor to track, no logic to handle edge cases around deletions or updates. The source and destination are always in sync after a successful run, by definition. If something goes wrong and data gets corrupted in the destination, the fix is straightforward: truncate the table and run again. Debugging is easy because there are no hidden assumptions about what was loaded previously.

Full loads are the right default when the dataset is small enough that pulling everything is fast and cheap, when the source system does not expose reliable change tracking, or when correctness is more important than efficiency and the two are in tension. Reference tables — lists of countries, product categories, currency codes — are natural candidates. They change infrequently, they are small, and the cost of always pulling everything is negligible. Configuration tables, dimension tables with modest row counts, and lookup datasets all belong in this category.

The problems with full loads emerge at scale. When a table has hundreds of millions of rows, pulling everything on every pipeline run becomes slow, expensive, and potentially disruptive. You are putting load on the source system for every run. You are transferring large volumes of data, most of which has not changed. You are consuming warehouse compute to replace data that was already correct. And you are extending the time window during which the destination table is being replaced, which can create consistency issues for downstream queries running during that window.

Full loads also destroy history. If a record in the source changes — a customer updates their email address, a product price is revised — a full load overwrites the previous state. If your analysis ever needs to know what the value was before the change, that information is gone. For many use cases this does not matter. For others it matters enormously.

Incremental Loads: Efficient, Complex, Fragile When Done Wrong

An incremental load pulls only the data that has changed since the last successful run. Instead of replacing the entire destination table, it appends new records and updates existing ones. The pipeline maintains a cursor — typically a timestamp or a maximum ID value from the last run — and uses it to filter the source query on the next execution.

The efficiency gains of incremental loading are substantial. A table with a billion rows might add a million new records per day. An incremental pipeline processes one million rows. A full load processes one billion. At that ratio, the difference in cost, speed, and source system impact is not marginal — it is the difference between a pipeline that is operationally viable and one that is not.

Incremental loads are the right approach when datasets are large and growing, when the source exposes a reliable mechanism for identifying changes, and when pipeline latency matters — because incremental pipelines can run frequently, even continuously, without the overhead of processing the full dataset each time.

The complexity of incremental loading lives in three places: tracking state correctly, handling updates and deletions, and recovering from failures without producing incorrect data.

Tracking state means maintaining a reliable cursor that tells the pipeline where it left off. The most common cursor is a timestamp column — updated_at, modified_at, or similar — that records when each row was last changed. On each run, the pipeline queries for rows where the timestamp is greater than the last recorded cursor value, processes those rows, and then updates the cursor to the maximum timestamp seen in this run.

This works well when the timestamp column is reliable. It fails silently when it is not. Common failure modes include timestamps that are set by the application rather than the database — and therefore subject to clock skew across servers — and source systems where updates do not always advance the timestamp correctly. If your cursor is based on an unreliable timestamp, your incremental logic will quietly miss records, and you may not notice until a downstream analysis produces results that do not match expectations.

Handling updates is straightforward conceptually but requires the right destination pattern. If a source record is updated, the incremental load needs to update the corresponding record in the destination rather than creating a duplicate. This is typically handled with an upsert — merge the incoming record with the destination table on a primary key, updating if the key exists and inserting if it does not. All major cloud warehouses support merge operations, and dbt’s incremental models handle this pattern cleanly.

Deletions are where incremental loads most commonly fail teams. If a record is deleted from the source, an incremental pipeline based on a timestamp cursor will never see that deletion. The record simply stops appearing in the change feed. The destination table retains the deleted record indefinitely, silently diverging from the source. For some use cases — append-only event tables, for instance — this is fine. For others — customer records, product catalogs, any table where deletions are meaningful — it is a serious correctness issue.

The solutions to the deletion problem are imperfect. Some source systems expose a soft delete pattern, marking records as deleted with a flag rather than removing them, which makes deletions visible to incremental logic. Others expose a change data capture stream that explicitly records delete events. Where neither is available, a periodic full load reconciliation — running a full load once a week or once a month alongside the daily incremental loads — is a pragmatic fallback that catches accumulated drift.

Change Data Capture: The Best of Both

Change data capture, or CDC, deserves mention as a more sophisticated approach to the incremental loading problem. Rather than querying the source database directly, CDC reads the database’s transaction log — the internal record of every insert, update, and delete that the database engine maintains for its own recovery purposes. Tools like Debezium, Airbyte’s CDC connectors, and Fivetran’s log-based replication use this mechanism to capture every change, including deletions, at the row level.

CDC is the most accurate and least intrusive form of incremental extraction. It does not put query load on the source database, it captures deletions natively, and it can achieve very low latency because it reads changes as they are committed rather than waiting for a scheduled query. The trade-off is operational complexity — log-based replication requires database-level configuration, appropriate permissions, and careful handling of schema changes — and not every source system supports it.

Making the Decision

The decision framework is simpler than it might appear. Start with full loads by default for any table where the complete dataset is small enough to move comfortably within your pipeline’s scheduling window and budget. Migrate to incremental loading when table size or change volume makes full loads impractical, when pipeline frequency requirements exceed what full loads can support, or when source system load is a concern.

When you implement incremental loading, invest time in validating your cursor logic against the actual behavior of the source system. Do not assume that an updated_at column is maintained reliably — check it. Build periodic reconciliation runs into your pipeline design from the beginning, so that incremental drift can be detected and corrected before it compounds into a serious data quality problem.

And document the approach clearly for every table in your pipeline. The engineer who inherits your work six months from now — or the version of you who has forgotten the details — needs to know not just how the pipeline works but why it was designed the way it was. The decision between full and incremental is one of those choices that is obvious in context and deeply puzzling without it.

Scroll to Top