Skip to content
Learn/Health Checks

Health Checks

Letting infrastructure know when your service is ready

Key Takeaways

  • Liveness probes (/healthz) check if the process is alive — readiness probes (/readyz) check if it can serve traffic
  • Readiness checks must verify actual dependency connectivity, not just return 200
  • Run dependency checks in parallel with timeouts so one slow dependency does not block the entire check
  • Return structured JSON with per-dependency status so operators can quickly identify which dependency failed

What are Health Checks?

Health checks are HTTP endpoints that report whether your service is functioning correctly. Infrastructure tools (load balancers, orchestrators, monitoring systems) poll these endpoints to make automated decisions about routing traffic, restarting containers, and triggering alerts.

There are two distinct types:

  • Liveness (/healthz): "Is the process alive and not deadlocked?" — if this fails, the process should be restarted
  • Readiness (/readyz): "Can this instance handle requests right now?" — if this fails, stop sending traffic but don't restart

Why It Matters

A health check that always returns 200 is worse than no health check at all — it tells the infrastructure everything is fine when it's not. Your database could be down, your Redis connection could be broken, and the health check keeps saying "ok." Traffic keeps flowing to a broken instance, and users see errors.

Good health checks are the foundation of self-healing infrastructure. They let Kubernetes automatically restart stuck pods, load balancers route around failed instances, and monitoring systems alert before users notice.

How It Works

Liveness Probe

Simple — just confirm the HTTP server is responding:

code
GET /healthz → 200 { class="text-pass">"status": class="text-pass">"ok" }

If the event loop is blocked or the process is deadlocked, this request will time out, and the orchestrator will restart the container.

Readiness Probe

More sophisticated — verify each dependency is actually reachable:

  • Check PostgreSQL: run SELECT 1
  • Check Redis: send PING, expect PONG
  • Check any other dependencies

Return 200 only if ALL dependencies are healthy. Return 503 with details if any fail.

Key Design Principles

  • Run checks in parallel: Use Promise.allSettled() so a slow database check doesn't block the Redis check
  • Use timeouts: A health check that takes 30 seconds to fail is useless. Set 2-second connection timeouts.
  • Return structured data: Include a checks object showing the status of each dependency, so operators can immediately see what's broken
  • Use short-lived connections: Don't reuse the main connection pool for health checks — you want to verify you *can* connect, not just that an existing connection works

Common Mistakes

  • Always returning 200: The health check should actually test dependencies, not just respond
  • Sequential checks: Checking Postgres, then Redis, then another service sequentially means the total check time is the sum of all check times
  • No timeouts: A health check blocked on a hanging database connection can cause cascading failures

Practice This Concept

Apply what you've learned by solving this challenge.

SystemTrials — Backend Engineering Practice That Tests Like Production