Why Data Engineers Need to Think Like Software Engineers

There is a version of data engineering that looks like this: a skilled analyst who learned Python, discovered that they could write scripts to move data around, and gradually took on more responsibility for the pipelines that fed their team’s dashboards. They are good at SQL. They understand the business data deeply. They can debug a broken pipeline and usually figure out what went wrong. But their code has no tests. Their pipelines have no documentation. Their deployment process is manually running a script on a laptop and hoping nothing breaks. Their version control strategy is a folder called “final_v3_actually_final.”

This is not a criticism of the individual. It is a description of how data engineering as a discipline evolved — out of analytics rather than out of software engineering — and it explains why so many data teams are now sitting on fragile, undocumented, untestable infrastructure that only one or two people fully understand. The discipline grew faster than its engineering foundations, and the gap between how data pipelines are built and how software systems are built remains wider than it should be.

Closing that gap is one of the most important things a data engineering team can do. Not because software engineering practices are inherently superior, but because data pipelines are software, and the practices that make software reliable, maintainable, and collaborative apply to pipelines with equal force.

The False Distinction

Part of the resistance to software engineering practices in data work comes from a perception that the two disciplines are fundamentally different. Software engineers build products. Data engineers build infrastructure. Software engineers work with well-defined requirements. Data engineers work with messy, evolving business questions. Software engineering practices — unit tests, code review, CI/CD, modular design — were developed for one context and may not translate cleanly to another.

This perception contains a grain of truth and a large amount of motivated reasoning. It is true that data engineering has unique characteristics. Data pipelines operate on external systems that cannot be easily mocked. The correctness of a transformation is often only verifiable against real data. Business logic changes frequently and sometimes without warning. These are genuine differences that affect how software engineering practices should be applied.

But they are not differences that make software engineering practices inapplicable. They are differences that require thoughtful adaptation. The answer to “unit testing is hard because we depend on external databases” is not “we will not test” — it is “we will find appropriate testing strategies for our context.” The answer to “our requirements change frequently” is not “we will not document” — it is “we will document in a way that is easy to update.”

The false distinction between data engineering and software engineering is comfortable because it excuses teams from the discipline that software engineering demands. Abandoning that comfort is the first step toward building infrastructure that holds up over time.

Version Control Is Not Optional

The most basic software engineering practice — version controlling your code — is still not universal in data engineering, and this is remarkable given how long version control has been available and how well-understood its benefits are.

A pipeline that exists only as a script on someone’s laptop, or in a shared folder on a network drive, or as a query saved in a BI tool’s interface, is a pipeline that cannot be audited, cannot be reviewed, cannot be rolled back when something goes wrong, and cannot be collaborated on without the risk of overwriting each other’s work. These are not theoretical risks. They are things that happen regularly in data teams that have not adopted version control as a non-negotiable baseline.

The answer is straightforward: everything goes in git. Pipeline code, dbt models, SQL transformations, configuration files, infrastructure definitions — all of it. Not because git is magic, but because version control is the foundation on which every other software engineering practice is built. Code review requires a shared repository. CI/CD requires a repository to trigger from. Rollback requires a history of changes. Documentation is more trustworthy when it lives alongside the code it describes and changes are tracked together.

The cultural shift required is minimal. The tools are free and well-documented. The resistance to it, where it exists, is almost always inertia rather than a principled objection.

Testing Data Pipelines

Testing is the software engineering practice that data teams most frequently skip and most frequently regret skipping. The reasons are understandable — writing tests takes time, testing pipelines against real data is awkward, and the feedback loop between writing a pipeline and seeing whether it works is shorter in data work than in application development. None of these reasons justify the absence of testing. They just explain it.

Testing in data engineering operates at multiple levels. Unit tests validate individual transformation logic in isolation — a function that parses a date string, a SQL expression that calculates a margin percentage, a deduplication key definition. These tests are fast, cheap to write, and catch the kind of logic errors that are easy to introduce and hard to spot in production data.

Integration tests validate that components work together correctly — that a staging model correctly surfaces the data that a downstream mart model expects, that a pipeline that extracts from a source and loads to a destination preserves the row count and key fields accurately. These tests are slower and more expensive but catch a different class of errors that unit tests cannot.

Data quality tests — the dbt tests and Great Expectations checks discussed in an earlier post — are a form of acceptance testing: assertions about the properties of the output data that the pipeline should always satisfy. Together, these three layers of testing create a safety net that allows engineers to make changes with confidence rather than anxiety.

The discipline of writing tests also changes how pipelines are designed. Code that is hard to test is almost always code with poor separation of concerns — transformation logic mixed with I/O, business rules buried inside orchestration code, functions that do too many things. Writing tests forces the modularity that makes code maintainable over time. The test suite is not just a quality mechanism. It is a design tool.

Code Review and Collaborative Ownership

In most software engineering teams, no code ships to production without being reviewed by at least one other engineer. The review serves multiple purposes: it catches bugs that the author missed, it shares knowledge across the team, it enforces consistency of style and approach, and it creates a culture of shared ownership where the codebase belongs to the team rather than to individuals.

Data engineering teams frequently skip this practice, and the consequences are predictable. Knowledge becomes siloed in the engineers who built specific pipelines. Inconsistent patterns accumulate across the codebase — one pipeline handles errors this way, another handles them a different way, and new engineers have no clear standard to follow. Changes that break things get merged because there was no second pair of eyes to ask the obvious questions.

Code review in data engineering looks the same as it does in software engineering: pull requests, inline comments, explicit approval before merging. The content of a data engineering review includes the same things a software review covers — logic correctness, edge case handling, test coverage — plus data-specific concerns: is the transformation semantically correct given the business definition, are the dbt model dependencies appropriate, does this pipeline handle the failure modes we know exist in this source.

The time investment in code review is real. The time saved by catching problems before they reach production, and by distributing knowledge across the team rather than concentrating it in individuals, is larger. Every team that has genuinely adopted code review for data pipelines reports that the initial friction was worth it within a few months.

Documentation as Engineering Practice

Documentation in data engineering is chronically undervalued, partly because it is not glamorous and partly because it decays — documentation written at the time a pipeline is built becomes stale as the pipeline evolves, and stale documentation is arguably worse than no documentation because it actively misleads.

The software engineering answer to documentation decay is to make documentation part of the development workflow rather than a separate activity. dbt’s model documentation — descriptions at the model level and column level, written in YAML alongside the model definition — is updated when the model is updated because they live in the same file. A README that lives in the same repository as the code it describes is more likely to be maintained than one that lives in a separate wiki that nobody remembers to update.

The standard to aim for is this: a new engineer joining the team should be able to understand what any pipeline does, why it exists, what data it produces, and how to run it without asking another team member. If that bar cannot be met, the documentation is insufficient. The test is not whether documentation exists — it is whether it is complete enough to be genuinely useful to someone encountering the system for the first time.

The Professionalization of the Discipline

Data engineering is a young discipline, and young disciplines go through a period of professionalization where informal practices are replaced by rigorous ones. Software engineering went through this in the 1990s and 2000s — the adoption of agile methods, the spread of version control, the rise of test-driven development were all part of a maturation process that made software teams meaningfully more productive and their outputs meaningfully more reliable.

Data engineering is in the middle of that process now. The engineers who will define what professional data engineering looks like are the ones building teams today, making decisions about how work gets done, and choosing whether to apply the lessons that software engineering learned the hard way or to repeat them.

The choice to think like a software engineer is not about abandoning what makes data engineering distinctive. It is about bringing the same rigor to pipelines and infrastructure that software engineers bring to applications — because the business depends on them equally, and they deserve to be built with equal care.