The Hype Around Real-Time Data — Is It Worth It?

Real-time data is the most consistently overhyped concept in the data engineering industry. It has been overhyped for the better part of a decade, and it shows no signs of becoming less overhyped anytime soon. Vendor marketing, conference talks, and job descriptions have collectively created an impression that real-time data processing is the natural destination of every mature data platform — that batch is a temporary inconvenience on the way to streaming, and that any organization not processing data in real time is leaving competitive advantage on the table.

This impression is wrong. Not because real-time data is without value — it is enormously valuable in the right contexts — but because the right contexts are far narrower than the industry narrative suggests, and the costs of getting there are far higher than the marketing materials acknowledge.

What Real-Time Actually Means

The first problem with the real-time conversation is that nobody agrees on what real-time means. The term gets applied to everything from sub-millisecond latency in high-frequency trading systems to dashboards that refresh every fifteen minutes. These are not the same thing. They have different technical requirements, different cost profiles, and different use cases.

It is more useful to think in terms of a latency spectrum. At one end is true real-time — millisecond to second latency — which requires purpose-built streaming infrastructure, significant engineering investment, and is appropriate for a small set of genuinely latency-sensitive use cases. In the middle is near real-time — latency measured in minutes — which is achievable with micro-batch processing and is appropriate for a broader set of use cases where freshness matters but milliseconds do not. At the other end is batch — latency measured in hours or days — which is the right choice for the majority of analytical workloads where the freshness requirement is modest and the priority is correctness and cost efficiency.

When a business leader says they want real-time data, they usually mean they want fresher data than they currently have. That is a very different requirement from true real-time, and conflating the two leads to expensive infrastructure decisions made in pursuit of a latency requirement that was never actually specified.

The Cases Where Real-Time Is Genuinely Worth It

To be fair to the technology, there are use cases where real-time data processing is not just nice to have but essential to the value proposition. These cases exist, they are important, and they are worth understanding clearly — precisely so you can recognize when your situation is and is not one of them.

Fraud detection is the canonical example. A payment fraud model that evaluates transactions in real time — within the window of the payment authorization — can prevent fraudulent transactions rather than merely detecting them after the fact. The business value of prevention versus detection is enormous, and no amount of batch processing sophistication can replicate it. The latency requirement is real, it is measurable, and it is directly tied to business outcomes.

Personalization at scale is another legitimate use case. Recommendation systems that update user models in real time based on in-session behavior — what a user is looking at right now, not what they looked at yesterday — can meaningfully improve engagement and conversion rates for platforms with sufficient scale. Netflix updating its recommendations based on what you watched in the last ten minutes is a real-time data problem with real business impact.

Operational monitoring and alerting sits firmly in the real-time category. Infrastructure monitoring, application performance monitoring, and anomaly detection systems all require low-latency data to be useful. An alert about a service degradation that fires three hours after the degradation began is not an alert — it is a post-mortem data point.

These use cases share a common characteristic: the value of the data is explicitly time-dependent in a way that is directly tied to business outcomes. The question is not “would fresher data be better” — fresher data is almost always better — but “is the business value of fresher data sufficient to justify the cost of producing it.”

The Cases Where Real-Time Is Not Worth It

The more common situation — and the one the industry narrative systematically underrepresents — is the one where real-time data processing is adopted for use cases that do not require it, at costs that are not justified by the business value delivered.

Executive dashboards are perhaps the most frequent culprit. A significant amount of streaming infrastructure has been built to power dashboards that senior leaders look at once a day, in morning meetings, after their first cup of coffee. The data on those dashboards could be twelve hours old and it would make no practical difference to any decision made on the basis of it. The streaming pipeline feeding those dashboards is impressive engineering in service of a non-requirement.

Marketing analytics is another area where the real-time impulse frequently outruns the genuine need. Campaign performance metrics updated in real time sound powerful. In practice, the decisions those metrics inform — budget allocation, creative optimization, audience targeting — are made on daily or weekly cycles by people who could not act on minute-by-minute data even if they had it. The latency of the decision-making process is the actual constraint, not the latency of the data.

Financial reporting is a case where real-time data can actively create problems. Month-end revenue figures, for instance, are subject to adjustments, reconciliations, and accounting treatments that happen after the transactions occur. Real-time revenue dashboards frequently show numbers that do not match what finance will ultimately report, creating confusion and undermining trust in data rather than building it. Batch processing with appropriate cutoff logic often produces more trustworthy financial data than real-time pipelines that surface unreconciled figures.

The True Cost of Real-Time

The cost of real-time data infrastructure is systematically underestimated, and this underestimation is one of the main drivers of the hype cycle. Vendors selling streaming platforms have an obvious interest in making real-time sound accessible. The engineer proposing the streaming architecture is excited about the technical challenge. Neither party has a strong incentive to give the business a complete picture of what it is committing to.

The infrastructure cost is the most visible component. Streaming platforms — Kafka clusters, Flink jobs, managed Kinesis or Pub/Sub deployments — are more expensive to run than batch equivalents because they are always on. A batch pipeline consumes compute for the duration of its run and then goes idle. A streaming pipeline consumes compute continuously, twenty-four hours a day, whether data is flowing or not. At small data volumes this is manageable. At scale it becomes a dominant line item.

The engineering cost is less visible but often larger. Streaming systems are genuinely more difficult to build, test, and operate than batch systems. Exactly-once processing semantics — ensuring that every event is processed exactly once, with no duplicates and no gaps — are notoriously hard to achieve and reason about. Stateful stream processing introduces failure modes that do not exist in stateless batch processing. Reprocessing historical data in a streaming system requires replaying event logs, which is more operationally complex than rerunning a SQL query. The engineers who can do this well are less common and more expensive than batch engineers.

The operational cost compounds over time. Streaming pipelines require continuous monitoring. A batch job that fails fails loudly and can be rerun. A streaming pipeline that develops a subtle bug may process data incorrectly for hours before anyone notices. Consumer lag — the gap between where events are being produced and where they are being consumed — requires constant monitoring and can indicate problems that are not obvious from the output alone. The operational surface area of a streaming system is simply larger than that of a batch equivalent.

The Near-Real-Time Middle Ground

For teams that genuinely need fresher data than traditional batch provides but do not have use cases that justify true streaming infrastructure, the near-real-time middle ground deserves serious consideration.

Micro-batch processing — running batch pipelines on a fifteen-minute or hourly cadence rather than daily — provides dramatically fresher data at a fraction of the complexity of streaming. Tools like dbt, BigQuery scheduled queries, and Snowflake tasks support high-frequency batch scheduling without requiring streaming infrastructure. For dashboards that need to be fresh throughout the business day, hourly batch is often entirely sufficient.

Change data capture tools like Fivetran and Airbyte also offer high-frequency sync options — some approaching five-minute latency — that provide near-real-time freshness for operational data without requiring custom streaming pipelines. The data is not technically streaming, but for most business use cases the distinction is irrelevant.

The honest question for any team considering real-time infrastructure is whether the specific use cases they are trying to serve actually require latency below what high-frequency batch can provide. In most cases the answer is no. In the cases where the answer is yes, the streaming investment is justified. In the cases where the answer is no, micro-batch is almost always the right choice.

Asking the Right Question

The real-time conversation in data engineering is usually framed around capability: can we do real-time? The more useful question is different: does our business make decisions that require real-time data, and are those decisions important enough to justify the cost of producing it?

Answering that question honestly requires talking to the people who will consume the data. It requires understanding not just what data they want but how they actually use it — how often they look at dashboards, how quickly they can act on new information, and what decisions they make on what time horizons. Those conversations frequently reveal that the appetite for real-time data is more modest than the executive mandate suggested, and that the genuine need is for data that is fresher and more reliable than what currently exists — not necessarily data that arrives in milliseconds.

Real-time data is a powerful capability. It is also one of the most expensive and complex capabilities in the modern data stack. The teams that deploy it wisely — in genuine service of business outcomes that require it — get enormous value from it. The teams that deploy it because it seems like the right direction are building impressive infrastructure on top of a requirement that was never clearly established.

The hype is real. The use cases are narrower than the hype suggests. Know which one you are actually in.

Scroll to Top