Skip to content
Learn/Circuit Breaker

Circuit Breaker

Protecting downstream calls with a state machine

Key Takeaways

  • A circuit breaker has three states: closed (passing traffic), open (rejecting), and half-open (probing)
  • After N consecutive failures, open the circuit to fail fast instead of waiting on a dead backend
  • The half-open state probes with a single request — close on success, re-open on failure
  • In half-open state, only one probe request should reach the backend; reject all others immediately

What is the Circuit Breaker Pattern?

A circuit breaker wraps calls to a downstream service and monitors for failures. When failures reach a threshold, the circuit "opens" and subsequent calls fail immediately without contacting the downstream — giving it time to recover. After a cooldown period, the circuit allows a single probe request through to test if the service has recovered.

The name comes from electrical circuit breakers, which trip to prevent damage when current exceeds safe levels. The software version prevents cascading failures when a dependency is unhealthy.

Why It Matters

Without a circuit breaker, when a downstream service goes down, every request to your service:

  • Waits for a timeout — tying up connections and threads
  • Accumulates latency — your P99 spikes as requests queue up behind dead calls
  • Cascades upstream — your callers start timing out too, spreading the failure

At scale, a single unhealthy dependency can take down an entire service mesh. Netflix famously built Hystrix after experiencing exactly this: one slow backend caused cascading failures across dozens of microservices.

A circuit breaker fails fast — returning an error in milliseconds instead of waiting seconds for a timeout. This preserves your service's capacity for requests that can actually succeed.

How It Works

The Three States

Closed (normal operation):

  • All requests are forwarded to the downstream
  • Failures are counted (consecutive failures, not total)
  • When consecutive failures reach the threshold (e.g., 3), transition to Open
  • A successful response resets the failure counter

Open (failing fast):

  • All requests are rejected immediately with 503
  • No calls are made to the downstream
  • A cooldown timer starts (e.g., 5 seconds)
  • After the cooldown expires, transition to Half-Open

Half-Open (probing):

  • Allow one probe request through to the downstream
  • If it succeeds: transition to Closed (service recovered)
  • If it fails: transition back to Open (restart the cooldown)
  • While the probe is in-flight, reject all other requests (same as Open)

The Concurrency Problem in Half-Open

When the circuit transitions to half-open, multiple requests may arrive simultaneously. Only one should be the probe — the rest must be rejected. Since Node.js is single-threaded, a synchronous flag check before the first await is sufficient:

code
if (halfOpenProbeInFlight) return reject();
halfOpenProbeInFlight = true;
class=class="text-pass">"text-fg-subtle">// now await the backend call

Common Mistakes

  • Not resetting failures on success: The failure count should track *consecutive* failures. A single success in the closed state resets the counter to zero.
  • Not restarting the cooldown on re-open: When a half-open probe fails and the circuit re-opens, the cooldown timer should restart from the current time, not continue from the original open time.
  • Allowing multiple probes in half-open: Without a guard flag, concurrent requests in half-open state all become probes, defeating the purpose of the single-request test.
  • Using total failures instead of consecutive: A service that fails 3 out of 100 requests is probably fine. The circuit should only open when failures are *consecutive*, indicating a systemic issue.

Practice This Concept

Apply what you've learned by solving this challenge.

SystemTrials — Backend Engineering Practice That Tests Like Production