Durable Outbound Webhooks with HMAC and Exponential Backoff
In a Reddit scraping side project I'm building, a user kicks off a long-running scrape and doesn't want to babysit it. They've registered a webhook URL — n8n, Zapier, webhook.site, their own backend — and they expect a POST the moment the job finishes. The orchestrator code that completes the job doesn't get to know who's listening. It just calls fire_event("scraping.completed", payload) and trusts that the notification will be cryptographically signed, delivered with retries on transient failure, audited row-by-row in the database, and dead-lettered honestly if the receiver is permanently broken.
This post is the layer that makes that trust justified. The webhooks Django app: how it's structured, how retries work, why HMAC matters and why constant-time comparison matters more, and the one architectural decision the article actually exists to defend — retry state lives in the database, not in Celery.
The press-release courier
The cleanest framing: webhooks is a press-release distribution service with a registered-mail courier.
The user walks into the press office (the API) and says: "When my scraping jobs finish, send the press release to these URLs." The press office records the subscription, assigns a wax seal (the HMAC secret), and hands the user a copy. Later, when a job completes, the dispatcher (fire_event) takes that announcement, finds every subscription that's interested in this event type, and dispatches a courier (one Celery task) per subscription. The courier walks to the door, knocks (POST), waits for a signed receipt (200-299 response). If nobody answers, the courier comes back in a minute. Then five. Then fifteen. Then an hour. Then six hours. After that, the courier writes "undeliverable" in the logbook and moves on.
The wax seals are HMAC-SHA256 signatures, so the receiver can verify the announcement was sealed by this press office and not forged en route. The logbook is the WebhookDelivery table — every attempt, every response code, every retry, queryable from the web UI. The cleanest one-line framing: durable, observable, idempotent outbound notification delivery, with audit trail and resilience baked into the data layer.
The system, in one picture
Three producer tasks — scraper.tasks.run_scraping_job_task, content_analysis.tasks.run_analysis_job_task, exports.tasks.run_export_job_task — all funnel through a single fire_event(event_type, payload) function. Inside fire_event, the dispatcher queries for every active Webhook row whose subscribed events include the firing event, writes one WebhookDelivery row per match (status pending, attempt count 0), and enqueues deliver_webhook_task.delay(delivery_id) for each. From here, every delivery is independent — one Celery task, one HTTP POST, one retry chain.
If you've read the data_provider post, the scraper.tasks.run_scraping_job_task that fires scraping.completed is the same one that actually runs the scrape through the pluggable backend factory. Webhooks is the next layer — what happens after the scrape finishes.
End to End
One event → 2 deliveries → 2 receivers (one OK, one fails, one retry scheduled)
Producer
scraper.tasks.run_scraping_job_task
Dispatcher
fire_event() — finds matching subscriptions, writes WebhookDelivery rows
Celery broker
Redis / RabbitMQ — just a queue, no retry state
Courier (deliver_webhook_task)
sign payload (HMAC-SHA256), POST, capture response
Receivers
external endpoints — n8n, Zapier, custom backends
A few things to call out on the diagram. Fan-out is per-subscription, not per-event. If two users have both subscribed to scraping.completed, one event creates two WebhookDelivery rows and two independent courier tasks. A failure for one user's webhook never affects the other user's delivery. The producer is shielded from receiver failure by a try/except wrapper at the dispatcher callsite — the scrape job has already finished by the time fire_event runs, and we never want webhook trouble to cascade back into job state.
The broker is just a queue. Celery owns the in-flight tasks, but it owns nothing about retry scheduling — the next attempt is always scheduled via apply_async(countdown=…) from inside the courier task, after the failure is recorded in the database. More on this in §7.
The retry policy
Five attempts. Five gaps: 1 minute, 5 minutes, 15 minutes, 1 hour, 6 hours. If all five fail, the delivery is marked dead. Total wall-clock time from first attempt to DEAD is roughly 7 hours 21 minutes.
The gap sizes aren't arbitrary. They're picked to absorb three different failure modes:
- Transient blips (network hiccup, momentary receiver overload) — most of these recover within a minute or two. Attempts 1 and 2 catch them.
- Short outages (receiver restart, deploy in progress) — usually resolve in 5–20 minutes. Attempts 3 and 4 catch them.
- Longer outages (receiver service down for the morning) — attempt 5 at the 6-hour mark gives most operational incidents a chance to resolve before we give up.
The geometric growth is deliberate: linear retries (try every 5 minutes for an hour) would either hammer a transient outage or give up too fast on a longer one. Exponential gaps spread the retries across time scales — letting the same policy serve all three failure modes without ops intervention.
Retry Schedule
Five attempts, exponential gaps, ~7h21m to DEAD — calendar time, not wall-clock seconds
The long, empty gap between attempt 5 and DEAD is six hours. That’s the window where a receiver outage can still self-recover.
After attempt 5 fails, the delivery row's status is set to dead and the courier walks away. The row stays in the database forever (or until manually deleted) — dead is a terminal state, not a cleanup signal. A user clicking "Retry now" in the UI can reopen any failed or dead delivery (we flip status back to pending and enqueue a fresh task), which is the recovery path for genuinely-permanent receiver problems that get fixed later.
One detail worth noting: retries are scheduled with deliver_webhook_task.apply_async(args=[str(delivery.id)], countdown=seconds). The countdown argument tells Celery to defer execution by that many seconds. The retry isn't a Celery-internal feature — it's a fresh task enqueue, scheduled from the courier task's own failure path. We'll spend §7 defending that choice.
The delivery state machine
Every WebhookDelivery row carries a status field that's the single source of truth for "what happened to this notification." Five states: pending, in_flight, success, failed, dead. The transitions:
pending→in_flight: a courier task picks the row up.in_flight→success: the receiver returned a 2xx response.in_flight→failed: the receiver returned a non-2xx (or no response within timeout), andattempt_count < MAX_ATTEMPTS. The courier schedules a retry; the row'snext_retry_attimestamp is set so the UI can show "Next retry at 14:03".in_flight→dead: same asfailed, butattempt_count >= MAX_ATTEMPTS. No retry is scheduled; the row is terminal.failed→in_flight(deferred): the scheduled retry fires.failed→pending: the user clicks "Retry now" in the UI. We flip the status, clearnext_retry_at, anddelay()a fresh task.dead→pending: same manual-retry edge, just from a different terminal state.
Delivery States
One delivery’s lifecycle — pending → in_flight → failed → retry → success
The state machine carries one critical guarantee: success, failed, and dead are idempotent. The courier task's first action, before doing anything else, is:
if delivery.is_terminal():
return {"task": "deliver", "delivery_id": str(delivery.id),
"status": delivery.status, "skipped": "already_terminal"}
If Celery somehow delivers the same task twice (rare in steady state, but it happens — broker hiccups, worker restarts, manual replays from tooling), the second invocation no-ops on the row's terminal state. The database is the source of truth for "is this attempt still owed?" — the second courier checks, sees the answer is no, and walks away clean.
HMAC signing and verification
Every outbound POST carries three custom headers on top of the standard Content-Type:
X-OrientedOS-Signature: the HMAC-SHA256 hex digest of the raw request body, prefixed withsha256=. The prefix is important — it versions the algorithm so we can shipsha256_v2=orblake3=later without breaking existing receivers.X-OrientedOS-Delivery-Id: theWebhookDelivery.id. Receivers can use this for their own dedupe — if they see a delivery ID they've processed before (because the courier retried after the receiver responded slowly), they can return 200 without re-processing.X-OrientedOS-Event: the event type string (e.g.,scraping.completed), so receivers can route on it without parsing the body.
Sender side, the signing is six lines of Python:
def sign_payload(secret: str, body: bytes) -> str:
digest = hmac.new(
secret.encode("utf-8"),
body,
hashlib.sha256,
).hexdigest()
return f"sha256={digest}"
HMAC Wire Format
How a signed webhook crosses the wire and gets verified on the other side
1 · the payload
{
"event": "scraping.completed",
"job_id": "8e2a...",
"post_count": 42
}2 · hmac
digest = hmac.new( webhook.secret.encode(), body, hashlib.sha256, ).hexdigest()
3 · header
POST /your/webhook HTTP/1.1
Host: your-receiver.example.com
Content-Type: application/json
User-Agent: OrientedOS-Webhooks/1.0
X-OrientedOS-Event: scraping.completed
X-OrientedOS-Delivery-Id: 8e2a-...-9f01
X-OrientedOS-Signature: sha256=4f3d...
{ "event": "...", "job_id": "...", ... }- X-OrientedOS-Event · route on it without parsing the body
- X-OrientedOS-Delivery-Id · for receiver-side dedupe on retries
- X-OrientedOS-Signature · the HMAC, algorithm-prefixed
1 · extract
raw_body = request.body # bytes, untouched header = request.headers.get( "X-OrientedOS-Signature", "" )
2 · re-compute
expected = "sha256=" + hmac.new( secret.encode(), raw_body, hashlib.sha256, ).hexdigest()
3 · compare (constant-time)
if hmac.compare_digest( expected, header ): accept() else: reject(401)
Don’t use ==.
Python’s string equality short-circuits on the first differing byte. An attacker measuring response time can guess the signature one byte at a time. compare_digestreads every byte regardless — that’s the whole point.
Receiver side, the verification is the part that catches people:
def verify_signature(secret: str, body: bytes, header: str) -> bool:
if not header.startswith("sha256="):
return False
expected = "sha256=" + hmac.new(
secret.encode("utf-8"), body, hashlib.sha256,
).hexdigest()
return hmac.compare_digest(expected, header)
The hmac.compare_digest(...) call is non-negotiable. The naive version (expected == header) is a timing oracle: Python's == on strings short-circuits on the first differing byte, so an attacker who can measure response time can guess the signature one byte at a time. compare_digest does constant-time comparison — it always reads every byte regardless of where the mismatch is — and it's the difference between "signature verification" and "signature theatre." If you're writing the receiver side of anyone's webhook system and the reference implementation says ==, that's a bug.
One more receiver-side detail: hash the raw request body bytes, not a parsed-and-re-serialized version. Json round-tripping can subtly reorder keys or change whitespace, which breaks the digest. Grab the raw body from your framework (request.body in Django, await request.body() in FastAPI, req.rawBody after body-parser in Express) and pass it untouched into the HMAC.
Why the retry state lives in the database, not in Celery
This is the article's argument. Celery has a built-in retry mechanism: self.retry(countdown=seconds, max_retries=N). It's the natural first move for anyone wiring up Celery tasks. We deliberately don't use it. Here's why.
The courier task is configured max_retries=0. We never call self.retry. When a delivery fails, we record the failure in the database (incrementing attempt_count, setting next_retry_at, updating status to failed), and then we schedule the next attempt via:
deliver_webhook_task.apply_async(
args=[str(delivery.id)],
countdown=BACKOFFS[delivery.attempt_count - 1],
)
A fresh enqueue. From outside the task. Backed by the row's own state. Three things become possible that aren't possible with self.retry:
-
The retry queue is queryable. "Which webhooks are currently pending retry?" is
WebhookDelivery.objects.filter(status='failed', next_retry_at__isnull=False).order_by('next_retry_at'). One ORM line. Withself.retry, the next attempt lives in the Celery broker (Redis or RabbitMQ), and answering that question means broker introspection from the ops layer — not something the web UI can do. -
Idempotency is enforced at task entry. The
is_terminal()guard at the top of the courier function uses the database as the source of truth. If Celery delivers the same task twice — broker hiccup, worker restart, manual replay — the second invocation seesstatus='success'and returns immediately. Withself.retry, the task is re-queued without any database state to consult, so double-delivery means two concurrent attempts. -
Manual re-trigger uses the same code path as automatic retry. The UI's "Retry now" button does exactly what the automatic retry does: flip
statusback topending, clearnext_retry_at, and enqueuedeliver_webhook_task.delay(delivery_id). No separate retry endpoint, no parallel scheduling logic. One task handles every path into delivery.
Retry State Ownership
self.retry (Celery owns it) vs apply_async + DB state (you own it)
| Concern | self.retry | apply_async + DB state |
|---|---|---|
Where retry state lives Source of truth for "is this delivery still owed?" | Celery broker (Redis / RabbitMQ) | WebhookDelivery row in PostgreSQLwin |
Queue observability Answering "what's pending retry right now?" | Broker introspection — ops-only | WebhookDelivery.objects.filter(status="failed") — one ORM linewin |
Surfacing retry queue in the UI | Hard (web tier would need broker access) | Trivial (same ORM, same queries)win |
Idempotency on double-delivery What happens if Celery delivers the same task twice | Implicit — no dedupe; two concurrent attempts | Explicit — is_terminal() guard on the rowwin |
Manual re-trigger code path "Retry now" button in the UI | Separate logic (likely a second task) | Same path — flip status, enqueue same taskwin |
Retry-arithmetic code in the task Cost of the design | None (Celery handles it)win | A few lines (pick BACKOFFS[attempt - 1]) |
The cost of this design is a small amount of explicit arithmetic in the failure path of the courier task — a few lines to pick the right backoff from the BACKOFFS array, set next_retry_at, and call apply_async(countdown=…). The benefit is that retry state is yours, not the broker's. For an outbound-integration system where users want to see what's queued and where ops needs to be able to diagnose "is this delivery actually going to retry, or is it stuck?" — that ownership pays back every day.
Two more gotchas
The webhook.test event-type bypass. The UI has a "Send a test delivery" button on every webhook subscription. Clicking it fires a webhook.test payload through the same courier path — bypassing the normal event-type subscription gate. A webhook subscribed to scraping.completed doesn't have webhook.test in its events list, but the test path delivers anyway. This is a deliberate exception to "only deliver events the user subscribed to," justified by user intent: when someone clicks "test this webhook right now," they're not asking the system to check subscription rules. Worth flagging because it surfaces in the audit trail and looks like a subscription leak until you remember the bypass exists.
MAX_ATTEMPTS lives on the model, not the task module. It's a class constant on WebhookDelivery, not a top-of-file constant in tasks.py. The reason is migration safety: when a future schema migration needs to know "what's the max retry count?" — for backfilling old rows, for adding a check constraint, whatever — the model is the authoritative answer. tasks.py is implementation that can be rewritten without touching the schema. Putting cross-layer constants on the model keeps them where the migration framework can see them. Tiny detail; saves an annoying ordering bug down the road.
Closing
This is the fourth in a series about the same Reddit-scraping side project. The SSE post was the real-time push layer. The data_provider post was the pluggable sourcing layer. The ai_provider post was the pluggable inference layer. Webhooks is the durable outbound notification layer.
The four share a root principle: when the what of an operation needs to swap out — different scraper backends, different LLMs, different downstream receivers — wrap the swappability behind a small, typed contract and let the callers pretend the underlying choice doesn't exist. The article-level corollary, the one this post exists to make: when state needs to be visible, queryable, and owned, store it in the database — not in the broker, not in the framework, not in whatever piece of infrastructure happened to come with a retry feature. The framework's retry mechanism is convenient. Visibility is more convenient.