Retry-Safe Workflows
Building multi-step operations that survive failures
Key Takeaways
- ✓Track each step's completion state in the database — check before executing so retries skip completed steps
- ✓Use database-level locking (advisory locks or SELECT FOR UPDATE) to prevent concurrent retries from running the same step simultaneously
- ✓Store workflow state in the database, not in memory — this ensures retry safety across server restarts
- ✓Make each step independently idempotent — if a step is re-executed, it should produce the same result
What are Retry-Safe Workflows?
A retry-safe workflow is a multi-step operation that can be safely re-executed after a failure without causing duplicate side effects. If your server crashes halfway through a three-step workflow (create order → charge payment → send confirmation), retrying the workflow should pick up where it left off — not re-charge the customer.
This is fundamentally different from single-request idempotency. Here, you're coordinating multiple operations that each have side effects, and you need the entire sequence to be resumable.
Why It Matters
Multi-step operations fail in production for many reasons:
- Server crashes between steps
- Database connections drop mid-transaction
- External API calls time out
- Deployments restart the process during execution
Without retry safety, operators are left manually inspecting database state to figure out what completed and what didn't. With retry safety, the system simply re-runs the workflow and it self-heals.
How It Works
Step Tracking
Store the state of each step in a database table:
- Create a workflow record with a unique ID
- Before executing each step, check if it's already marked as completed
- If completed: skip it and move to the next step
- If not: execute it, mark it as completed, then continue
Concurrency Protection
If two retry attempts run simultaneously, they might both try to execute the same step. Prevent this with database-level locking:
- Advisory locks: Lock on the workflow ID before checking/executing steps. Only one process can hold the lock at a time.
- SELECT FOR UPDATE: Lock the workflow row within a transaction, preventing concurrent reads.
The Pattern
1. Acquire lock on workflow ID
2. Load workflow state from database
3. For each step:
a. If step already completed → skip
b. Execute step
c. Mark step as completed in database
4. Release lockCommon Mistakes
- In-memory state tracking: Storing workflow progress in a variable instead of the database means all state is lost on crash — the entire workflow re-runs.
- No locking: Two concurrent retries can execute the same step twice if you don't lock the workflow.
- Locking too broadly: Locking at the table level instead of the workflow level blocks unrelated workflows from progressing.
Further Reading
Temporal's explanation of durable workflows — the industrial-strength version of retry-safe execution.
How AWS Step Functions handles retries, timeouts, and partial failures in multi-step workflows.
The definitive resource on consistency and fault tolerance in distributed systems, including workflow coordination patterns.