Skip to content
Learn/Retry-Safe Workflows

Retry-Safe Workflows

Building multi-step operations that survive failures

Key Takeaways

  • Track each step's completion state in the database — check before executing so retries skip completed steps
  • Use database-level locking (advisory locks or SELECT FOR UPDATE) to prevent concurrent retries from running the same step simultaneously
  • Store workflow state in the database, not in memory — this ensures retry safety across server restarts
  • Make each step independently idempotent — if a step is re-executed, it should produce the same result

What are Retry-Safe Workflows?

A retry-safe workflow is a multi-step operation that can be safely re-executed after a failure without causing duplicate side effects. If your server crashes halfway through a three-step workflow (create order → charge payment → send confirmation), retrying the workflow should pick up where it left off — not re-charge the customer.

This is fundamentally different from single-request idempotency. Here, you're coordinating multiple operations that each have side effects, and you need the entire sequence to be resumable.

Why It Matters

Multi-step operations fail in production for many reasons:

  • Server crashes between steps
  • Database connections drop mid-transaction
  • External API calls time out
  • Deployments restart the process during execution

Without retry safety, operators are left manually inspecting database state to figure out what completed and what didn't. With retry safety, the system simply re-runs the workflow and it self-heals.

How It Works

Step Tracking

Store the state of each step in a database table:

  • Create a workflow record with a unique ID
  • Before executing each step, check if it's already marked as completed
  • If completed: skip it and move to the next step
  • If not: execute it, mark it as completed, then continue

Concurrency Protection

If two retry attempts run simultaneously, they might both try to execute the same step. Prevent this with database-level locking:

  • Advisory locks: Lock on the workflow ID before checking/executing steps. Only one process can hold the lock at a time.
  • SELECT FOR UPDATE: Lock the workflow row within a transaction, preventing concurrent reads.

The Pattern

code
1. Acquire lock on workflow ID
2. Load workflow state from database
3. For each step:
   a. If step already completed → skip
   b. Execute step
   c. Mark step as completed in database
4. Release lock

Common Mistakes

  • In-memory state tracking: Storing workflow progress in a variable instead of the database means all state is lost on crash — the entire workflow re-runs.
  • No locking: Two concurrent retries can execute the same step twice if you don't lock the workflow.
  • Locking too broadly: Locking at the table level instead of the workflow level blocks unrelated workflows from progressing.

Practice This Concept

Apply what you've learned by solving this challenge.

SystemTrials — Backend Engineering Practice That Tests Like Production