Why Most Data Teams Are Over-Engineering Their Stacks

There is a pattern that repeats itself across data teams with enough regularity that it deserves to be named. A company hires a few ambitious data engineers, gives them a greenfield infrastructure project, and tells them to build something modern. They read the right blogs, attend the right conferences, and follow the right people on LinkedIn. Six months later, they have assembled a genuinely impressive stack: a streaming ingestion layer, a lakehouse on object storage, an orchestration platform with a sophisticated DAG structure, a feature store, a reverse ETL tool, a semantic layer, and a data catalog. The infrastructure is elegant. The architecture diagram looks like something you would see in a conference keynote.

And the business has five analysts who mostly need weekly sales reports.

This is over-engineering, and it is one of the most common and least discussed problems in data engineering today.

The Seduction of Sophistication

Before diagnosing the problem, it is worth being honest about why it happens. Over-engineering data stacks is not a sign of incompetence. In many cases it is a sign of competence misdirected. The engineers who build these stacks are often genuinely talented, genuinely well-intentioned, and genuinely excited about their craft. The problem is not ability — it is incentive alignment.

Data engineers are evaluated, hired, and promoted on the basis of technical sophistication. Job descriptions ask for Kafka, Spark, Airflow, dbt, and Kubernetes. Conference talks celebrate architectural complexity. Blog posts are written about building things, not about thoughtfully choosing not to build things. The professional incentives in data engineering point overwhelmingly toward more tooling, more complexity, and more architectural ambition, regardless of whether the business actually needs any of it.

At the same time, the modern data stack ecosystem has made it genuinely easy to add new tools. A new connector takes an afternoon. A new orchestration layer takes a sprint. Each individual addition feels justified in isolation. The cumulative weight of all those additions is what eventually becomes a liability.

The Real Cost of Complexity

When data teams talk about the cost of their stack, they usually mean the vendor bills. Snowflake compute credits, Fivetran monthly active row fees, dbt Cloud seats — these are real costs and they are worth managing. But they are not the most significant cost of an over-engineered stack. The most significant cost is the one that does not show up on any invoice.

Every tool in your stack is a surface area that needs to be understood, maintained, monitored, and upgraded. Every tool has failure modes that need to be learned, usually in production. Every tool has a learning curve that must be climbed by every new engineer who joins the team. Every tool has a vendor relationship, a contract, a renewal conversation, and a risk of deprecation or pricing changes.

The cognitive overhead of a complex stack is enormous and largely invisible. Engineers spend time managing infrastructure instead of solving business problems. Onboarding new team members takes longer because the surface area they need to understand is larger. Debugging becomes harder because failures can originate anywhere in a long chain of interconnected systems. Making changes becomes slower because the blast radius of any modification is harder to reason about.

This overhead compounds. A team of four engineers managing a ten-tool stack has very little capacity left for the analytical work that actually serves the business. Adding a fifth tool does not add twenty percent more capability — it adds twenty percent more maintenance burden while consuming capacity that could have gone toward solving real problems.

The Specific Mistakes

Over-engineering manifests in a few recurring patterns that are worth naming specifically.

Streaming when batch is fine. Real-time data infrastructure is significantly more complex and expensive to operate than batch infrastructure. It requires different tooling, different expertise, and different operational practices. It is absolutely the right choice when business decisions genuinely need to react to data within seconds or minutes. It is absolutely the wrong choice when the stakeholders who will consume the output check their dashboards once a day. A remarkable number of streaming pipelines have been built to feed dashboards that nobody looks at in real time. The engineers who built them were solving an interesting technical problem. They were not solving a business problem.

Orchestration complexity that exceeds pipeline complexity. Airflow is a powerful tool and, for large organizations with hundreds of interdependent pipelines, an appropriate one. It is also frequently adopted by teams with twelve pipelines who spend more time managing Airflow than they would have spent managing the pipelines directly. The orchestration layer should be proportionate to what it is orchestrating. For small to medium pipeline portfolios, simpler tools — dbt Cloud’s built-in scheduler, Prefect, or even cron — are often more than adequate.

Data catalogs nobody uses. The data catalog is the perpetually aspirational tool of the data stack. Every team knows they should have one. Most teams that implement one find, six months later, that engineers update it reluctantly and stakeholders consult it rarely. A catalog is only valuable if it is maintained, and maintenance requires cultural buy-in that most organizations have not established before they install the tool. A well-maintained README in a dbt project and clear naming conventions in the warehouse provide eighty percent of the value of a catalog at ten percent of the operational cost.

The feature store for a team with one model. Machine learning infrastructure is perhaps the area most prone to premature sophistication. Feature stores, model registries, experiment tracking platforms, and model serving layers are genuinely important tools for mature ML organizations with dozens of models in production. They are significant overhead for a team that has one model in production and is not sure whether it is actually being used.

What Proportionate Engineering Looks Like

The antidote to over-engineering is not under-engineering — it is proportionate engineering. The question is not “what is the best possible tool for this job” but “what is the simplest tool that reliably solves this problem at our current scale, and what would need to be true for us to need something more sophisticated.”

A startup with a small data team and a handful of data sources probably needs a managed connector, a cloud data warehouse, and dbt. That is it. Those three tools, used well, can support sophisticated analytics for a business with millions of dollars in revenue and dozens of stakeholders. The temptation to add an orchestration layer, a data catalog, a reverse ETL tool, and a semantic layer before the fundamentals are solid is a trap.

Proportionate engineering also means making deliberate decisions about what not to build. Every new tool added to a stack should have to clear a bar: what specific problem does this solve, what is the cost of solving it this way versus the alternatives, and what is the operational burden we are accepting by adding it. Most tools that get added to over-engineered stacks were never asked these questions. They were added because they seemed useful, because a conference talk made them sound exciting, or because a competing company was using them.

The Maturity Trap

There is a particular version of over-engineering that is worth calling out specifically: building for a scale you have not reached and may never reach. Data engineers read about how Netflix or Airbnb or Uber built their infrastructure and absorb the architectural patterns those companies developed to solve problems at enormous scale. Then they apply those patterns to organizations that are orders of magnitude smaller.

Netflix’s data infrastructure exists to serve billions of events per day across hundreds of millions of users. The patterns they developed — the microservice architecture, the sophisticated stream processing, the custom tooling — were responses to genuine problems at genuine scale. Applying those patterns to an organization processing ten million events per month is not aspirational engineering. It is solving problems you do not have with solutions that create problems you did not need.

The best data infrastructure is not the most sophisticated. It is the most appropriate — the one that solves today’s real problems simply enough that the team has capacity left over to anticipate and address tomorrow’s real problems before they become crises.

A Different Kind of Ambition

What data engineering needs more of is the ambition to say no. The ambition to look at a shiny new tool and ask whether it actually serves the business before adopting it. The ambition to simplify a stack that has grown beyond its team’s capacity to maintain it. The ambition to define success not by the impressiveness of the architecture but by the reliability of the outputs and the satisfaction of the people who depend on them.

The most impressive data teams are not the ones with the most tools. They are the ones who know exactly why every tool in their stack is there and could defend that choice against a skeptical audience. That discipline — that insistence on earning every layer of complexity — is rarer and more valuable than any individual technical skill.