Durable Outbound Webhooks with HMAC and Exponential Backoff

In a Reddit scraping side project I'm building, a user kicks off a long-running scrape and doesn't want to babysit it. They've registered a webhook URL — n8n, Zapier, webhook.site, their own backend — and they expect a POST the moment the job finishes. The orchestrator code that completes the job doesn't get to know who's listening. It just calls fire_event("scraping.completed", payload) and trusts that the notification will be cryptographically signed, delivered with retries on transient failure, audited row-by-row in the database, and dead-lettered honestly if the receiver is permanently broken.

This post is the layer that makes that trust justified. The webhooks Django app: how it's structured, how retries work, why HMAC matters and why constant-time comparison matters more, and the one architectural decision the article actually exists to defend — retry state lives in the database, not in Celery.

The press-release courier

The cleanest framing: webhooks is a press-release distribution service with a registered-mail courier.

The user walks into the press office (the API) and says: "When my scraping jobs finish, send the press release to these URLs." The press office records the subscription, assigns a wax seal (the HMAC secret), and hands the user a copy. Later, when a job completes, the dispatcher (fire_event) takes that announcement, finds every subscription that's interested in this event type, and dispatches a courier (one Celery task) per subscription. The courier walks to the door, knocks (POST), waits for a signed receipt (200-299 response). If nobody answers, the courier comes back in a minute. Then five. Then fifteen. Then an hour. Then six hours. After that, the courier writes "undeliverable" in the logbook and moves on.

The wax seals are HMAC-SHA256 signatures, so the receiver can verify the announcement was sealed by this press office and not forged en route. The logbook is the WebhookDelivery table — every attempt, every response code, every retry, queryable from the web UI. The cleanest one-line framing: durable, observable, idempotent outbound notification delivery, with audit trail and resilience baked into the data layer.

The system, in one picture

Three producer tasks — scraper.tasks.run_scraping_job_task, content_analysis.tasks.run_analysis_job_task, exports.tasks.run_export_job_task — all funnel through a single fire_event(event_type, payload) function. Inside fire_event, the dispatcher queries for every active Webhook row whose subscribed events include the firing event, writes one WebhookDelivery row per match (status pending, attempt count 0), and enqueues deliver_webhook_task.delay(delivery_id) for each. From here, every delivery is independent — one Celery task, one HTTP POST, one retry chain.

If you've read the data_provider post, the scraper.tasks.run_scraping_job_task that fires scraping.completed is the same one that actually runs the scrape through the pluggable backend factory. Webhooks is the next layer — what happens after the scrape finishes.

End to End

One event → 2 deliveries → 2 receivers (one OK, one fails, one retry scheduled)

Producer

scraper.tasks.run_scraping_job_task

fire_event("scraping.completed", payload)

Dispatcher

fire_event() — finds matching subscriptions, writes WebhookDelivery rows

deliver_webhook_task.delay(delivery_id) × 2

Celery broker

Redis / RabbitMQ — just a queue, no retry state

workers pick up

Courier (deliver_webhook_task)

sign payload (HMAC-SHA256), POST, capture response

POST + signature header

Receivers

external endpoints — n8n, Zapier, custom backends

Fan-out is per subscription. One receiver’s failure never blocks the other.

A few things to call out on the diagram. Fan-out is per-subscription, not per-event. If two users have both subscribed to scraping.completed, one event creates two WebhookDelivery rows and two independent courier tasks. A failure for one user's webhook never affects the other user's delivery. The producer is shielded from receiver failure by a try/except wrapper at the dispatcher callsite — the scrape job has already finished by the time fire_event runs, and we never want webhook trouble to cascade back into job state.

The broker is just a queue. Celery owns the in-flight tasks, but it owns nothing about retry scheduling — the next attempt is always scheduled via apply_async(countdown=…) from inside the courier task, after the failure is recorded in the database. More on this in §7.

The retry policy

Five attempts. Five gaps: 1 minute, 5 minutes, 15 minutes, 1 hour, 6 hours. If all five fail, the delivery is marked dead. Total wall-clock time from first attempt to DEAD is roughly 7 hours 21 minutes.

The gap sizes aren't arbitrary. They're picked to absorb three different failure modes:

  • Transient blips (network hiccup, momentary receiver overload) — most of these recover within a minute or two. Attempts 1 and 2 catch them.
  • Short outages (receiver restart, deploy in progress) — usually resolve in 5–20 minutes. Attempts 3 and 4 catch them.
  • Longer outages (receiver service down for the morning) — attempt 5 at the 6-hour mark gives most operational incidents a chance to resolve before we give up.

The geometric growth is deliberate: linear retries (try every 5 minutes for an hour) would either hammer a transient outage or give up too fast on a longer one. Exponential gaps spread the retries across time scales — letting the same policy serve all three failure modes without ops intervention.

Retry Schedule

Five attempts, exponential gaps, ~7h21m to DEAD — calendar time, not wall-clock seconds

elapsed
attempt 1first try
T+0
attempt 2+1m
T+1m
attempt 3+5m
T+6m
attempt 4+15m
T+21m
attempt 5+1h
T+1h 21m
+6h
T+7h 21m
attempt fired·attempt pending·DEADterminal — manual retry only

The long, empty gap between attempt 5 and DEAD is six hours. That’s the window where a receiver outage can still self-recover.

After attempt 5 fails, the delivery row's status is set to dead and the courier walks away. The row stays in the database forever (or until manually deleted) — dead is a terminal state, not a cleanup signal. A user clicking "Retry now" in the UI can reopen any failed or dead delivery (we flip status back to pending and enqueue a fresh task), which is the recovery path for genuinely-permanent receiver problems that get fixed later.

One detail worth noting: retries are scheduled with deliver_webhook_task.apply_async(args=[str(delivery.id)], countdown=seconds). The countdown argument tells Celery to defer execution by that many seconds. The retry isn't a Celery-internal feature — it's a fresh task enqueue, scheduled from the courier task's own failure path. We'll spend §7 defending that choice.

The delivery state machine

Every WebhookDelivery row carries a status field that's the single source of truth for "what happened to this notification." Five states: pending, in_flight, success, failed, dead. The transitions:

  • pendingin_flight: a courier task picks the row up.
  • in_flightsuccess: the receiver returned a 2xx response.
  • in_flightfailed: the receiver returned a non-2xx (or no response within timeout), and attempt_count < MAX_ATTEMPTS. The courier schedules a retry; the row's next_retry_at timestamp is set so the UI can show "Next retry at 14:03".
  • in_flightdead: same as failed, but attempt_count >= MAX_ATTEMPTS. No retry is scheduled; the row is terminal.
  • failedin_flight (deferred): the scheduled retry fires.
  • failedpending: the user clicks "Retry now" in the UI. We flip the status, clear next_retry_at, and delay() a fresh task.
  • deadpending: same manual-retry edge, just from a different terminal state.

Delivery States

One delivery’s lifecycle — pending → in_flight → failed → retry → success

enqueue2xx responsenon-2xxattempt < MAXnon-2xxattempt = MAXretry countdownelapseduser clicks “Retry now”is_terminal() · no-opis_terminal() · no-opis_terminal() · no-oppendingawaiting first attemptin_flightcourier sending nowsuccess2xx received — terminalfailedretrying — non-terminaldeadexhausted MAX — terminal
success / failed / dead are terminal — the courier’s first action is is_terminal(), so Celery double-deliveries no-op.

The state machine carries one critical guarantee: success, failed, and dead are idempotent. The courier task's first action, before doing anything else, is:

if delivery.is_terminal():
    return {"task": "deliver", "delivery_id": str(delivery.id),
            "status": delivery.status, "skipped": "already_terminal"}

If Celery somehow delivers the same task twice (rare in steady state, but it happens — broker hiccups, worker restarts, manual replays from tooling), the second invocation no-ops on the row's terminal state. The database is the source of truth for "is this attempt still owed?" — the second courier checks, sees the answer is no, and walks away clean.

HMAC signing and verification

Every outbound POST carries three custom headers on top of the standard Content-Type:

  • X-OrientedOS-Signature: the HMAC-SHA256 hex digest of the raw request body, prefixed with sha256=. The prefix is important — it versions the algorithm so we can ship sha256_v2= or blake3= later without breaking existing receivers.
  • X-OrientedOS-Delivery-Id: the WebhookDelivery.id. Receivers can use this for their own dedupe — if they see a delivery ID they've processed before (because the courier retried after the receiver responded slowly), they can return 200 without re-processing.
  • X-OrientedOS-Event: the event type string (e.g., scraping.completed), so receivers can route on it without parsing the body.

Sender side, the signing is six lines of Python:

def sign_payload(secret: str, body: bytes) -> str:
    digest = hmac.new(
        secret.encode("utf-8"),
        body,
        hashlib.sha256,
    ).hexdigest()
    return f"sha256={digest}"

HMAC Wire Format

How a signed webhook crosses the wire and gets verified on the other side

Senderthe courier task

1 · the payload

{
  "event": "scraping.completed",
  "job_id": "8e2a...",
  "post_count": 42
}

2 · hmac

digest = hmac.new(
  webhook.secret.encode(),
  body,
  hashlib.sha256,
).hexdigest()

3 · header

sha256=<64-char hex>
Wirethe POST request
POST /your/webhook HTTP/1.1
Host: your-receiver.example.com
Content-Type: application/json
User-Agent: OrientedOS-Webhooks/1.0
X-OrientedOS-Event: scraping.completed
X-OrientedOS-Delivery-Id: 8e2a-...-9f01
X-OrientedOS-Signature: sha256=4f3d...

{ "event": "...", "job_id": "...", ... }
  • X-OrientedOS-Event · route on it without parsing the body
  • X-OrientedOS-Delivery-Id · for receiver-side dedupe on retries
  • X-OrientedOS-Signature · the HMAC, algorithm-prefixed
Receiveryour service

1 · extract

raw_body = request.body         # bytes, untouched
header = request.headers.get(
  "X-OrientedOS-Signature", ""
)

2 · re-compute

expected = "sha256=" + hmac.new(
  secret.encode(), raw_body,
  hashlib.sha256,
).hexdigest()

3 · compare (constant-time)

if hmac.compare_digest(
   expected, header
):
   accept()
else:
   reject(401)

Don’t use ==.

Python’s string equality short-circuits on the first differing byte. An attacker measuring response time can guess the signature one byte at a time. compare_digestreads every byte regardless — that’s the whole point.

Receiver side, the verification is the part that catches people:

def verify_signature(secret: str, body: bytes, header: str) -> bool:
    if not header.startswith("sha256="):
        return False
    expected = "sha256=" + hmac.new(
        secret.encode("utf-8"), body, hashlib.sha256,
    ).hexdigest()
    return hmac.compare_digest(expected, header)

The hmac.compare_digest(...) call is non-negotiable. The naive version (expected == header) is a timing oracle: Python's == on strings short-circuits on the first differing byte, so an attacker who can measure response time can guess the signature one byte at a time. compare_digest does constant-time comparison — it always reads every byte regardless of where the mismatch is — and it's the difference between "signature verification" and "signature theatre." If you're writing the receiver side of anyone's webhook system and the reference implementation says ==, that's a bug.

One more receiver-side detail: hash the raw request body bytes, not a parsed-and-re-serialized version. Json round-tripping can subtly reorder keys or change whitespace, which breaks the digest. Grab the raw body from your framework (request.body in Django, await request.body() in FastAPI, req.rawBody after body-parser in Express) and pass it untouched into the HMAC.

Why the retry state lives in the database, not in Celery

This is the article's argument. Celery has a built-in retry mechanism: self.retry(countdown=seconds, max_retries=N). It's the natural first move for anyone wiring up Celery tasks. We deliberately don't use it. Here's why.

The courier task is configured max_retries=0. We never call self.retry. When a delivery fails, we record the failure in the database (incrementing attempt_count, setting next_retry_at, updating status to failed), and then we schedule the next attempt via:

deliver_webhook_task.apply_async(
    args=[str(delivery.id)],
    countdown=BACKOFFS[delivery.attempt_count - 1],
)

A fresh enqueue. From outside the task. Backed by the row's own state. Three things become possible that aren't possible with self.retry:

  1. The retry queue is queryable. "Which webhooks are currently pending retry?" is WebhookDelivery.objects.filter(status='failed', next_retry_at__isnull=False).order_by('next_retry_at'). One ORM line. With self.retry, the next attempt lives in the Celery broker (Redis or RabbitMQ), and answering that question means broker introspection from the ops layer — not something the web UI can do.

  2. Idempotency is enforced at task entry. The is_terminal() guard at the top of the courier function uses the database as the source of truth. If Celery delivers the same task twice — broker hiccup, worker restart, manual replay — the second invocation sees status='success' and returns immediately. With self.retry, the task is re-queued without any database state to consult, so double-delivery means two concurrent attempts.

  3. Manual re-trigger uses the same code path as automatic retry. The UI's "Retry now" button does exactly what the automatic retry does: flip status back to pending, clear next_retry_at, and enqueue deliver_webhook_task.delay(delivery_id). No separate retry endpoint, no parallel scheduling logic. One task handles every path into delivery.

Retry State Ownership

self.retry (Celery owns it) vs apply_async + DB state (you own it)

Concernself.retryapply_async + DB state

Where retry state lives

Source of truth for "is this delivery still owed?"

Celery broker (Redis / RabbitMQ)WebhookDelivery row in PostgreSQLwin

Queue observability

Answering "what's pending retry right now?"

Broker introspection — ops-onlyWebhookDelivery.objects.filter(status="failed") — one ORM linewin

Surfacing retry queue in the UI

Hard (web tier would need broker access)Trivial (same ORM, same queries)win

Idempotency on double-delivery

What happens if Celery delivers the same task twice

Implicit — no dedupe; two concurrent attemptsExplicit — is_terminal() guard on the rowwin

Manual re-trigger code path

"Retry now" button in the UI

Separate logic (likely a second task)Same path — flip status, enqueue same taskwin

Retry-arithmetic code in the task

Cost of the design

None (Celery handles it)winA few lines (pick BACKOFFS[attempt - 1])
Five wins vs one — the cost is a few lines of explicit arithmetic in the failure path. The payback is retry state you own.

The cost of this design is a small amount of explicit arithmetic in the failure path of the courier task — a few lines to pick the right backoff from the BACKOFFS array, set next_retry_at, and call apply_async(countdown=…). The benefit is that retry state is yours, not the broker's. For an outbound-integration system where users want to see what's queued and where ops needs to be able to diagnose "is this delivery actually going to retry, or is it stuck?" — that ownership pays back every day.

Two more gotchas

The webhook.test event-type bypass. The UI has a "Send a test delivery" button on every webhook subscription. Clicking it fires a webhook.test payload through the same courier path — bypassing the normal event-type subscription gate. A webhook subscribed to scraping.completed doesn't have webhook.test in its events list, but the test path delivers anyway. This is a deliberate exception to "only deliver events the user subscribed to," justified by user intent: when someone clicks "test this webhook right now," they're not asking the system to check subscription rules. Worth flagging because it surfaces in the audit trail and looks like a subscription leak until you remember the bypass exists.

MAX_ATTEMPTS lives on the model, not the task module. It's a class constant on WebhookDelivery, not a top-of-file constant in tasks.py. The reason is migration safety: when a future schema migration needs to know "what's the max retry count?" — for backfilling old rows, for adding a check constraint, whatever — the model is the authoritative answer. tasks.py is implementation that can be rewritten without touching the schema. Putting cross-layer constants on the model keeps them where the migration framework can see them. Tiny detail; saves an annoying ordering bug down the road.

Closing

This is the fourth in a series about the same Reddit-scraping side project. The SSE post was the real-time push layer. The data_provider post was the pluggable sourcing layer. The ai_provider post was the pluggable inference layer. Webhooks is the durable outbound notification layer.

The four share a root principle: when the what of an operation needs to swap out — different scraper backends, different LLMs, different downstream receivers — wrap the swappability behind a small, typed contract and let the callers pretend the underlying choice doesn't exist. The article-level corollary, the one this post exists to make: when state needs to be visible, queryable, and owned, store it in the database — not in the broker, not in the framework, not in whatever piece of infrastructure happened to come with a retry feature. The framework's retry mechanism is convenient. Visibility is more convenient.