Building Resilient Systems With Circuit Breakers and Retry Patterns

The first time I watched a cascading failure take down a production system, it was almost beautiful in its destructiveness. One service slowed down. Its callers started timing out. Their callers started queuing. Within minutes, every service in the ecosystem was unhealthy — all because one downstream dependency was having a bad day.

That experience taught me the most important lesson in distributed systems: design for failure, not for success.

The Cascade Problem

In a monolith, a slow database query slows down one request. In a distributed system, a slow dependency can consume all the threads, connections, and memory in every upstream service. The failure amplifies as it propagates.

The root cause is usually the same: services that wait indefinitely for responses from dependencies that aren't going to respond. Without protection, well-intentioned retries make things worse — turning a partial failure into a total failure.

Circuit Breakers

The circuit breaker pattern, borrowed from electrical engineering, is the single most impactful resilience pattern I've implemented.

The concept is simple: wrap calls to external dependencies in a circuit breaker that monitors failure rates. When failures exceed a threshold, the circuit "opens" — subsequent calls fail immediately without attempting the call. After a cool-down period, the circuit enters a "half-open" state, allowing a limited number of test calls. If they succeed, the circuit closes. If they fail, it opens again.

In practice, this means:

Fast failure over slow failure. When a dependency is down, callers get an immediate error instead of waiting for a timeout. This preserves their resources for requests they can actually handle.

Automatic recovery. When the dependency recovers, the circuit breaker detects it and resumes normal traffic — no manual intervention needed.

Visibility. Circuit breaker state changes are excellent alerting signals. A circuit opening tells you exactly which dependency is struggling.

Retry Strategies

Retries are essential — network blips, temporary overloads, and transient errors are facts of life. But naive retries can amplify problems rather than solve them.

Exponential backoff. Don't retry immediately. Wait 100ms, then 200ms, then 400ms. Give the failing service time to recover instead of hammering it.

Jitter. Add randomness to retry intervals. Without jitter, all clients retry at the same moment, creating a "thundering herd" that overwhelms the recovering service.

Retry budgets. Limit the total number of retries across all clients. If 30% of requests are being retried, something is seriously wrong and more retries will only make it worse.

Idempotency. Only retry operations that are safe to retry. If the operation isn't idempotent, retrying might create duplicate transactions, double charges, or corrupted data.

Timeouts

Every external call should have a timeout. This is non-negotiable. Without timeouts, a single hung connection can consume resources indefinitely.

My approach to timeouts:

Connection timeout — how long to wait for a TCP connection. Keep this short (1-3 seconds). If the server isn't accepting connections, waiting longer won't help.

Read timeout — how long to wait for a response. Set based on the expected response time of the operation, with reasonable headroom.

Overall timeout — the total time budget for the operation, including retries. This prevents retry loops from running indefinitely.

Bulkheads

Named after ship compartments that prevent a hull breach from sinking the entire vessel, bulkheads isolate failures within a system.

In practice, this means separate connection pools, thread pools, and resource limits for each external dependency. If your payment processor is slow, it shouldn't consume the connections you need for your eligibility checks.

I typically implement this with separate HTTP clients per dependency, each with its own connection pool, timeout configuration, and circuit breaker.

Putting It Together

The resilience stack I deploy for critical services:

Timeouts — set per-dependency, non-negotiable
Retries — exponential backoff with jitter, capped at 3 attempts
Circuit breakers — per-dependency, with alerting on state changes
Bulkheads — isolated connection pools per dependency
Fallbacks — graceful degradation when a dependency is unavailable

This isn't over-engineering. In the systems I build — healthcare eligibility checks, financial transaction processing — a cascading failure doesn't just cause a bad user experience. It causes real harm. These patterns are the difference between "service degraded" and "service down."

Build for failure. Your 3 AM self will thank you.