Back to blog Workflow Reliability

Anatomy of a Reliable Workflow: Retry Logic, Dead-Letter Queues, and Alerting

Liam Donovan

January 3, 2025

Workflow retry logic and fault tolerance

Every workflow that calls an external API will fail eventually. The API will be under maintenance, a rate limit will be exceeded, a network packet will get dropped, a dependent service will restart. These are not edge cases — they're the baseline operating conditions of any system that runs continuously and calls external services. The question is not whether your workflow will encounter these conditions, but what happens when it does.

A fault-tolerant workflow is not one that never fails. It's one where failures are recoverable, observable, and don't produce incorrect state. Getting there requires three specific mechanisms working together: retry logic that handles transient failures, dead-letter queues for permanently failed steps, and alerting that surfaces actionable signals without generating noise. This post walks through each one in depth.

Retry Logic: The Details That Matter

Retrying a failed step seems simple. Attempt the operation. If it fails, wait and try again. But the implementation details — retry count, wait interval, error classification, idempotency — determine whether your retry logic actually makes your workflow more reliable or introduces new failure modes.

Retry count and backoff interval

A fixed-interval retry policy (retry every 30 seconds, up to 3 times) is a starting point but not a complete solution. When an API is under sustained load, fixed-interval retries from multiple concurrent workflow runs add to the load at a predictable cadence, potentially making the API's recovery harder. Exponential backoff spreads retry pressure over time: retry after 15 seconds, then 60 seconds, then 5 minutes, then 30 minutes. Each doubling gives the downstream system progressively more time to recover.

A practical configuration for most ops workflow steps: 4 retry attempts with exponential backoff starting at 15 seconds, capped at a maximum interval of 30 minutes. After the 4th failed attempt, the step is marked permanently failed and routed to the dead-letter queue. Adjustments: lower the cap for time-sensitive workflows (a lead notification that retries for 2 hours has missed the response SLA), raise it for non-urgent batch steps where eventual completion is acceptable.

Error classification before retry

Not all errors are retriable. Before queuing a retry, the workflow runner should classify the error:

Retriable: HTTP 429 (rate limited), 503 (service unavailable), 504 (gateway timeout), network-level timeout errors. These indicate a transient condition that will likely resolve.
Non-retriable: HTTP 400 (bad request), 401 (unauthorized), 403 (forbidden), 404 (resource not found), 422 (unprocessable entity). These indicate a structural problem — malformed payload, expired authentication token, missing resource — that retrying will not fix. Retrying a 401 wastes retry attempts and delays the alert that would prompt someone to refresh the token.

Some error classification is ambiguous. A 500 (internal server error) could indicate a transient bug in the target API that will resolve on retry, or it could indicate a systematic failure that will persist. The pragmatic approach is to classify 500s as retriable for the first 2–3 attempts, then non-retriable after that — on the assumption that if the first two retries didn't succeed, the API is probably in a sustained failure state rather than a momentary one.

Idempotency on write operations

Retry logic on write operations introduces a risk: if the first attempt partially succeeded before the error was returned (the write completed but the 200 response was lost in transit), a naive retry will create a duplicate record. This is the classic distributed systems problem of at-least-once delivery versus exactly-once delivery.

The solution is idempotency keys: a unique identifier (typically a hash of the workflow run ID and the step ID) included in the API request header. A well-implemented API uses the idempotency key to recognize duplicate requests and return the result of the original request without re-executing the write. When the receiving API supports idempotency keys, always use them on retry-able write steps. When it doesn't, use a read-before-write check: look up whether the record already exists with the expected values before attempting to create it.

Dead-Letter Queues: What Happens After Retries Exhaust

A dead-letter queue (DLQ) is the destination for workflow steps that have exhausted all retries and cannot complete. The name comes from messaging queue systems, where messages that cannot be delivered are moved to a "dead letter" holding area rather than being discarded.

In an ops workflow context, the DLQ serves two purposes: it preserves the failed payload (so the step can be replayed manually once the root cause is fixed), and it provides an audit trail of all permanently failed steps across all workflow runs.

What a DLQ entry should contain

A useful DLQ entry includes:

The workflow run ID and step name
The timestamp of the first failure and the timestamp of the final retry attempt
The input payload that was being processed when the step failed
The error response from the final retry (status code, error message, stack trace if available)
The number of retry attempts made

Without the input payload, a DLQ is largely useless for recovery — you know a step failed, but you can't replay it without reconstructing what data was being processed. Storing the payload in the DLQ entry allows an ops practitioner to inspect the failed record, fix the root cause (expired API key, malformed field value, missing record in the target system), and manually replay the step from the DLQ without needing to re-trigger the original workflow.

Manual replay vs. automatic retry from DLQ

Some teams configure automatic periodic retries from the DLQ — every 4 hours, attempt to reprocess all DLQ entries. This works for cases where the root cause was truly transient and has since resolved (a CRM maintenance window that ended six hours ago). It's less suitable for cases where the root cause is a data problem that will fail again on replay — it just re-populates the DLQ with the same entries and generates repeated alert noise.

The safer default is manual replay with optional bulk resubmission. An ops practitioner reviews the DLQ entries, confirms the root cause is resolved, and initiates a replay. For volume recovery after an extended outage, bulk replay — resubmit all DLQ entries created in a specified time window — is more practical than one-by-one processing.

Alerting: Signal vs. Noise

A well-configured retry and DLQ system means most transient failures resolve automatically. The alerting layer's job is to surface the failures that don't — the ones that land in the DLQ or the ones where the failure rate indicates a systemic problem rather than a transient one.

Alert on failure rate, not individual failures

Alerting on every single step failure is operationally unsustainable for any workflow that runs at meaningful volume. A workflow processing 500 records per hour with a 1% transient failure rate on the CRM write step will generate 5 alerts per hour, each of which resolves on first retry. That's noise that trains the team to ignore alerts — which means the real incidents get ignored too.

The better approach is failure rate thresholds over a rolling window: alert if the step failure rate exceeds 5% over the last 60 minutes. This surfaces the difference between normal transient failures (which stay below 5% and resolve via retry) and actual incidents (where the rate spikes because the downstream API is down or authentication has expired).

Alert on DLQ entries, not on retries

Individually, a DLQ entry represents a confirmed, persistent failure that requires human attention. Alerting on DLQ additions is appropriate — but aggregate the alert rather than sending one notification per entry. "5 new entries in the CRM_Write dead-letter queue in the last 15 minutes — all HTTP 401, token may be expired" is actionable. Five separate Slack messages each saying "step failed, moved to DLQ" is noise that looks like one event but generates five interruptions.

Distinguish alert recipients by impact

Not every workflow failure should alert the same person. A failure in the lead notification workflow during business hours warrants a real-time Slack ping to the RevOps lead. A failure in an overnight data reconciliation job warrants a morning Slack message, not a 3am page. Configuring alert routing by workflow criticality and time window is worth the setup investment — it determines whether your on-call rotation is sustainable or exhausting.

The Simplest Reliable Workflow

Reliability doesn't require building all of this at once. The minimum viable configuration for a workflow that runs business-critical steps: enable retry with backoff on any step that calls an external API, configure a dead-letter queue destination for permanently failed steps, and set up a single alert that fires when new DLQ entries accumulate above a threshold. That's three settings. It takes minutes to configure. It changes the operational posture from "we find out about failures when someone notices the data is wrong" to "we find out about failures within 30 minutes of the first unrecoverable error."

That gap — between discovering a failure in the data at week-end versus discovering it within the hour — is where retry logic, DLQs, and alerting pay for themselves. The mechanisms are not complicated. The value is in having them on by default, for every workflow that touches operational data.

← Back to all articles