No new features were shipping. No customers were onboarding. EventBridge fired maybe once a week—sometimes less—triggering a Lambda that published messages into three SQS queues. Two services would wake up, do their work, and go quiet. A few minutes later, a third service would pick up its delayed message and finish the sequence.

CloudWatch graphs were flat in the reassuring way that usually means nothing is wrong.

Operationally and from a business point of view, nothing meaningful was happening.

And yet, the AWS bill quietly crossed $850.

The number itself wasn’t alarming. What mattered was what it represented. This wasn’t a spike, or a one-off incident, or the cost of a mistake. This was the steady state.

Left alone, the system would drift into a five-figure annual cost while processing almost no data—simply by staying alive.

The architecture of waiting

When we finally traced it, the architecture was unremarkable.

Three ECS services, each inside a VPC, each polling its own SQS queue. EventBridge kicked off the sequence. Lambda published the messages. Two queues received work immediately. The third relied on a visibility delay so earlier stages could finish—coordination through time rather than state.

In effect, the system coordinated by waiting instead of knowing.

We had chosen this design deliberately.

It was simple. Familiar. The volume was low enough that delay-based sequencing felt reasonable—no orchestrator, no dependency graph, no new operational surface area. The services stayed up because starting them on demand would have taken longer than the work itself. Polling kept latency predictable. The queues stayed thin.

Every individual decision made sense.

What we missed was the cost of staying alert.

Three services, each checking an empty queue every few seconds. Every minute. Every hour. For weeks at a time. All of it flowing through a NAT Gateway.

The messages we cared about crossed the network maybe once a week. The polling requests crossed it millions of times.

We weren’t paying for work. We were paying to ask whether there was any work.

The cost of responsiveness

The cost wasn’t in compute, or queue operations, or even the data transfer of the messages themselves. It accumulated in the act of waiting—of remaining responsive in a system where responsiveness was rarely needed.

This kind of failure doesn’t come from misconfiguration in the usual sense. Nothing was “wrong” enough to break. The services behaved exactly as designed. Messages were picked up within seconds when they arrived. The system was responsive, resilient, and operationally clean.

What it wasn’t was economically aligned with how rarely the system was actually used.

Low-frequency systems expose this gap especially well. When something happens once a week, steady-state cost dominates event cost. The work becomes incidental compared to the infrastructure required to remain ready for it.

The system optimizes for availability. The business optimizes for absence.

This shape is especially common in production geospatial data engineering. We design for rare but heavy events—seasonal satellite imagery drops, quarterly land cover updates, one-off boundary backfills. Everything else is waiting. Waiting feels cheap conceptually. In production, it often isn’t.

Why quiet systems hide expensive problems

What makes this failure mode hard to detect is that it hides behind correctness. Queue depths stay near zero. CPU stays low. There’s no error rate to chase. Even cost dashboards smooth it out, turning a constant leak into background noise.

Nothing crosses a threshold sharply enough to demand intervention.

It also resists ownership. No single team “caused” the bill. Keeping services running was reasonable at the expected scale. The polling interval matched the latency requirement. VPC isolation fit the security posture. Each choice was defensible in isolation.

The cost only appeared when those reasonable choices interacted over time—inside a system nobody revisited because it never broke.

This is the same shape I’ve seen in silent geometry errors and observability drift. The system passes validation. The numbers look plausible. The failure isn’t a crash, but a misalignment—between intent and effect, between what the system is doing and what anyone believes it is doing.

The danger of calmness

The danger isn’t complexity. It’s calmness.

Noisy systems attract scrutiny. Quiet systems earn trust. Cost, like geometry, can decay while everything still “works.”

We built a system optimized for responsiveness. It stayed ready, stayed healthy, stayed correct.

It simply had no way to notice that being idle was expensive.

And nothing about idleness shows up in service metrics.


If you’ve designed event-driven systems, you know this tension. You optimize for the work that matters, but you pay for the time between. The question isn’t whether to stay ready, it’s whether the way you’ve chosen to stay ready matches how your system is actually used.