A Guide to Async Patterns: Django, Celery, asyncio, React
The codebase for a Reddit scraping side project I'm building has a dozen places where someone might say "this thing runs in the background." Or "this thing streams live." Or "the frontend keeps refreshing." Read three of them in a row and they all look like the same kind of thing. They're not. They're solving different problems with different machinery, and the machinery has different rules.
I've spent a non-trivial fraction of debugging time on async problems where the bug wasn't in the code itself — it was in which kind of async I thought was in play. Fix that confusion at the top of the call stack and the actual bug is usually obvious. This post is the map I wish I had when I started: seven concrete async patterns I use across this codebase, plus four bridges that connect the backend async work to the frontend UI. Most "async" in this codebase is one of those seven; mixing them up is where the bugs live.
The kitchen analogy
The codebase has four time-domains, and they map cleanly onto the areas of a restaurant kitchen.
- The hot line — regular HTTP requests. A customer orders, a plate arrives in seconds. Synchronous, blocking, fast. Most of the API is this.
- The prep kitchen — Celery tasks. Work done out-of-band so the dining room stays responsive. The customer doesn't wait at the table for the salad dressing to be whisked.
- The teppanyaki bar — Server-Sent Events. The customer sits at the bar and watches their dish being made in real time, narrated as it goes. The kitchen and the customer share a synchronised view of progress.
- The phone tree — webhooks. The kitchen calls customers who set up a number to call, with retries if no one picks up.
Different time-domains, different rules. A waiter (synchronous handler) who tries to do a 20-minute braise on the hot line blocks every other customer in the room. A prep cook (Celery task) who tries to narrate live to the teppanyaki bar (SSE) has no way to send chatter back over the broker. The kitchen analogy is silly, but it sticks — once a pattern is anchored to its "kitchen area" you mostly stop using the wrong machinery for the wrong job.
The seven patterns
I count seven concrete async patterns in this codebase. Some of them stack on top of each other (SSE is built from async views + async generators, sitting on top of Django's ASGI runtime). One of them cross-cuts almost everything (cancellation). The rest are mostly independent — different problems, different solutions.
The Seven Patterns
Two entry points, two composition layers, two specialised patterns — and one that cross-cuts everything
A few things to call out on the map. Patterns 1 and 2 are entry points — Celery tasks and async/await views are the two ways "this thing might not return right away" enters the system. Patterns 3 and 4 compose — async generators are the engine that SSE drives. Patterns 6 and 7 are specialised — webhook retry chains and adaptive frontend polling are each their own thing, built on top of the basics. And Pattern 5 cross-cuts — cooperative cancellation appears wherever long-running work happens, which means it appears almost everywhere.
The rest of this post walks each of the seven in turn. None of them is doing anything exotic on its own. The interesting part is how they interact — which is what the gotchas section near the end is about.
Pattern 1 — Celery tasks
The oldest async pattern in any Django codebase. A task is a Python function decorated with @shared_task; a worker process (separate from the web tier) picks it up off a Redis broker and runs it; the web request that enqueued it returned in milliseconds. The codebase has seven task files across six apps (scraper, discovery, projects, content_analysis, webhooks, exports).
The decorator carries the usual knobs: bind=True (the task gets self), max_retries (Celery-managed retry, usually 0 in this codebase because we own retry state), soft_time_limit vs time_limit (graceful warning vs hard kill). What surprises people new to Celery is that @shared_task is a capability flag, not a requirement to use .delay(). You can call a decorated task as a plain Python function inline, and it just runs on the calling process. The bulk orchestrator does exactly this:
@shared_task(bind=True, max_retries=0, soft_time_limit=1800, time_limit=2100)
def run_scraping_job_task(self, scraping_job_id: str) -> dict[str, Any]:
job = ScrapingJob.objects.select_related("project").get(id=scraping_job_id)
job.mark_running()
for subreddit in job.config.subreddits:
try:
# Inline call — NOT scrape_subreddit_posts_task.delay(...)
scrape_subreddit_posts_task(job.id, subreddit)
except Exception as exc:
job.errors.append({"subreddit": subreddit, "error": str(exc)})
if job.errors:
job.mark_partial()
else:
job.mark_completed()
return {"job_id": str(job.id), "status": job.status}
If scrape_subreddit_posts_task.delay(...) were used instead, every subreddit would fan out to a separate worker — different workers, different processes, no shared transaction. By calling inline, the orchestrator keeps everything on one worker, in one transaction-able context, with deterministic ordering. The orchestrator decides when to fan out and when not to; the @shared_task decorator just makes the option available. For a richer Celery example — a task that re-schedules itself via apply_async(countdown=…) to implement a retry chain — see Durable Outbound Webhooks with HMAC and Exponential Backoff.
Pattern 2 — Async/await + Django ORM
Python's asyncio event loop, Django 4+'s native async ORM (aget, aupdate_or_create, aget_or_create, etc.), and the sync_to_async bridges for everything in the Python ecosystem that doesn't yet have an async equivalent. Two locations in the codebase use this: the SSE streaming views in scraper/streaming_views.py, and the async scraper adapters inside the data_provider package.
The plain shape of an async view:
async def stream_job(request, job_id):
await _authenticate_or_401(request)
job = await ScrapingJob.objects.aget(id=job_id)
if job.project.user_id != request.user.id:
return JsonResponse({"detail": "Not found"}, status=404)
return StreamingHttpResponse(
_event_stream(job),
content_type="text/event-stream",
)
Two non-obvious things going on. First, this is a plain Django async view, not a DRF @api_view-decorated view. DRF 3.16 still doesn't support async dispatch — the request would be handled synchronously even if the function is async def. So in code that touches the streaming layer, I authenticate manually:
async def _authenticate_or_401(request):
auth_result = await sync_to_async(JWTAuthentication().authenticate)(request)
if auth_result is None:
raise SuspiciousOperation("Unauthenticated")
request.user, _ = auth_result
JWTAuthentication().authenticate does a database lookup; calling it directly inside an async view would block the event loop. The sync_to_async wrapper shunts it to a thread pool, so the loop keeps spinning. Mildly annoying boilerplate that goes away the day DRF lands async support.
Second, the ORM calls are aget, not get. The a* prefix is Django's native async query method. Mixing get (sync) and aget (async) in the same view is fine but worth being deliberate about — accidentally calling a sync ORM method inside an async view will block the loop just as badly as the auth call would, with none of the warnings.
Pattern 3 — Async generators
async def functions that contain yield. They look like ordinary generators except the consumer drives them with async for. The codebase has three implementations of async def scrape_subreddit_posts_stream(...) (native JSON, Playwright, nodriver CDP), and one consumer that drives them:
async def _event_stream(job):
async for dto in scraper.scrape_subreddit_posts_stream(subreddit, limit):
# Cancel checkpoint
await sync_to_async(job.refresh_from_db)(fields=["cancel_requested"])
if job.cancel_requested:
yield b"event: cancelled\ndata: {}\n\n"
return
# Persist and emit
await ScrapedPost.objects.aupdate_or_create(...)
yield _fmt_event("post", dto.model_dump())
The bit that's worth dwelling on: backpressure is free. The consumer's loop body decides the pace. If it's doing a DB write that takes 200ms per post, the producer's next yield waits 200ms before producing. There is no queue between them, no manual signalling, no asyncio.Queue with a maxsize. The generator protocol itself is the signal — next() returns when the producer is ready, the body runs, the producer waits.
The second free thing is cleanup. When the consumer's async for exits early — because the client disconnected, because cancellation fired, because an exception bubbled up — Python's generator machinery throws GeneratorExit into the producer at the suspended yield. The producer's try/finally block runs. In the Playwright scraper, that means await browser.close(). In the native JSON scraper, it's nothing. Either way, no leaked resources.
These generators are the engine behind the data_provider package — for the factory/registry layer that picks which scraper backend produces the events at runtime, see One Interface, Five Scraping Backends.
Pattern 4 — Server-Sent Events (SSE)
A transport layer, not an execution model. SSE rides on top of a plain HTTP response that the server keeps open and writes to incrementally. Django's StreamingHttpResponse wraps an iterable (in our case, an async generator) and flushes each chunk as it's yielded. The wire format is a tiny, strict text grammar:
event: post
data: {"id": "t3_abc", "title": "...", "author": "..."}
:keepalive
event: complete
data: {"posts_scraped": 30}
Three rules: event: names the message type, data: carries one line of UTF-8 (usually JSON), and a blank line ends the message. A line starting with : is a comment — browsers ignore it, but proxies see traffic, which is what the :keepalive lines are for.
This is the layer the SSE article lives in. For the full breakdown of the wire format, the proxy_buffering off trap that makes SSE look broken in production until you find it, and why I use fetch + ReadableStream instead of the browser's built-in EventSource, see Server-Sent Events, in Production. For this article, the takeaway is that SSE is just Pattern 2 (async views) producing Pattern 3 (async generators) with a wire format that the browser knows how to chop into frames. The "real-time" feeling is the natural consequence of incremental flushes, not a separate piece of machinery.
Pattern 5 — Cooperative cancellation
The cross-cutting one. Cancellation in this codebase is a boolean column on the job row — cancel_requested = BooleanField(default=False) — and three observation points that read it. The bulk orchestrator (Celery, Pattern 1) checks the flag between subreddits. The SSE streaming view (Pattern 4 over Pattern 3) checks at the top of every generator iteration. The exports task checks between sheets.
Two-Sided Cancel
One click fires two halves in parallel — server stops working, client stops reading
user
Forget either half: the server keeps working invisibly, or the client renders ghost events from a stream the user already “stopped”.
Cancellation is safe but delayed. If the producer is mid-network-call to Reddit when the cancel arrives, the cancel doesn't fire until the network call returns and control flow reaches the next checkpoint. You can't preempt arbitrary code — you can only check the flag at known safe points. Pick those points carefully (between subreddits, not in the middle of one) and the worst-case latency is bounded by the longest operation between checkpoints. In practice that's a few seconds, which is fine for a user pressing Cancel.
The other half of cancellation lives on the client. The frontend hook owns an AbortController; when the user clicks Cancel, it fires two requests: a POST /cancel to flip the server flag, and controller.abort() to kill the local fetch. Both halves are necessary. The POST /cancel ensures the server-side work stops (and that any other consumer of the same job sees the cancelled state). The controller.abort() ensures the SSE consumer ignores any events that were already in flight when the cancel decision was made. Forget either half and you end up with weird state — either the server keeps working invisibly, or the client renders ghost events from a stream the user already "stopped."
Pattern 6 — Webhook retry chain
A Celery task that re-schedules itself. When a webhook delivery fails, the courier task computes the next backoff (1m, 5m, 15m, 1h, 6h — the exponential schedule) and calls deliver_webhook_task.apply_async(args=[delivery_id], countdown=seconds). After five failed attempts the row is marked dead and the chain stops. State — attempt_count, next_retry_at, status — lives on the WebhookDelivery row in the database.
The reason I don't use Celery's built-in self.retry(...) is the central argument of the webhooks article — owning retry state in the database (rather than the broker) buys you queue observability (the next attempt is visible to the web UI as WebhookDelivery.objects.filter(status='failed', next_retry_at__isnull=False)) and idempotency (the is_terminal() guard at task entry no-ops on accidental double-delivery). For the full architectural argument, the visualization of the 7-hour retry timeline, and the HMAC signing details, see Durable Outbound Webhooks with HMAC and Exponential Backoff.
For this article, the takeaway is that this is also Pattern 1 (a Celery task), but the task's failure path schedules itself rather than calling self.retry. Same broker, same workers, same decorator — different control flow for "what happens when this fails."
Pattern 7 — Adaptive frontend polling
The seventh pattern lives in the React frontend, not the Python backend. A useEffect hook schedules a setInterval whose cadence depends on the most recently fetched state. While pending, poll every 3 seconds. While running, poll every 5 seconds. When the state hits terminal (one of complete, failed, cancelled), the polling stops.
useEffect(() => {
if (intervalRef.current) clearInterval(intervalRef.current)
if (!jobs) return
const hasRunning = jobs.some((j) => j.status === 'running')
const hasPending = jobs.some((j) => j.status === 'pending')
if (!hasRunning && !hasPending) return
const cadence = hasRunning ? 5000 : 3000
intervalRef.current = setInterval(() => { refetch() }, cadence)
return () => { if (intervalRef.current) clearInterval(intervalRef.current) }
}, [jobs, refetch])
Polling Cadence
State drives cadence — terminal means no polling at all
The insight: polling is state-driven, not time-driven. A polling interval picked once and held forever wastes traffic when nothing is happening and feels slow exactly when the user wants the most responsiveness — at state transitions. By rescheduling on every fetch, the cadence is always relative to the latest known state. The pending→running transition is the moment a user cares most; the 3-second cadence catches it fast. Once running, the work is going to take a while regardless, so the 5-second cadence is fine. When terminal, polling stops completely — no traffic, no battery, no proxy load.
This is intentionally not a global subscription manager. Each component owns its own interval. The trade is that two components polling the same endpoint will each fire their own request — which is fine at our scale (a handful of open tabs, maybe 4 endpoints between them) and would only become a problem at much higher concurrency. The simpler architecture means there's no global state to debug when something goes weird.
Execution models compared
Step back from the patterns and notice that they sit on top of three different execution models:
- Celery worker: a separate process, coordinated via a broker (Redis here). One worker can run many tasks, but each task gets its own execution context. Web tier hands off and forgets; the worker may live for hours.
- Async event loop: a single thread inside a single web-worker process, multitasking cooperatively. Many in-flight requests can share one process if they spend most of their time
await-ing I/O. There's no parallelism here — only concurrency — but the cost-per-in-flight-request is tiny. - Frontend polling: the browser sending small, sequential requests on a schedule. Each request hits a normal sync endpoint that returns quickly with a snapshot of state. The "async-ness" is in the cadence, not the protocol.
Execution Models
What each model actually does at runtime — same wall-clock axis (30 s), animated in sync
Celery worker
· web tier hands off · worker runs longweb tier
celery worker
- Latency to done:
- ~25 s
- Connections held:
- 0 on web tier · 1 on worker
- Infra:
- web + worker + broker
Async event loop
· one process · five overlapping in-flight requestsweb worker (event loop)
- Latency to done:
- ~6 s each
- Concurrent in flight:
- 5 sharing 1 worker
- Infra:
- web tier only
Frontend polling
· repeated small requests every 3 s · last one returns terminalbrowser
- Latency to done:
- bounded by cadence (≤ 3 s)
- Requests/min:
- ~20
- Infra:
- none beyond HTTP
Different latency, throughput, isolation, failure profiles. Pick by what you need. Celery for "long, can-afford-to-be-slow, must-survive-restart" (typical Django use case: 30-second scrape, 5-minute export). Async event loop for "many in-flight things in the web tier" (typical use case: a streaming endpoint that holds the connection open for minutes while yielding events). Polling for "show progress on long-running work without standing up new infrastructure" (typical use case: status row that updates from pending → running → complete).
The mistake is using the wrong one. A polling loop that fetches every 100ms because someone wants "real-time" is doing the work of an SSE connection at ten times the cost. An async view that does ten sequential synchronous DB queries blocks the loop for the entire round-trip, defeating the point. A Celery task that finishes in 50ms is just overhead — the dispatch alone takes longer than the work.
The four bridges
Each backend async pattern needs a way to surface to the user-facing code. The codebase has four distinct bridges, each with different latency / cancellation / infrastructure trade-offs.
- Bridge A — SSE →
useScrapeStreamhook: backend emits SSE frames; frontend consumes viafetch + ReadableStreamand a hand-written wire-format parser. Near-zero latency, two-sided cancel (server flag + clientAbortController), no extra infrastructure beyond what HTTP already provides. - Bridge B — Celery → adaptive polling: backend returns
202 Acceptedwith ajob_id; frontend polls/api/.../jobs/{id}/every 3 or 5 seconds until terminal. Latency bounded by the cadence; cancellation is just a flag flip (next poll sees the new state); no new infra (no Channels, no WebSockets). - Bridge C — Sync HTTP → async work:
/fast/endpoints block until they have results (5–15 seconds),/deep/endpoints dispatch a Celery task and return202. Different user intents — quick keyword extraction vs full LLM-driven discovery — drive which endpoint shape gets used. - Bridge D — Backend → external (webhooks): backend is the client, external receiver is the server. The frontend sees this indirectly — by polling the
WebhookDeliveryhistory, which becomes a user-visible audit trail of attempts.
The Four Bridges
Backend async → frontend UI — same job, four different transports
| Bridge | Latency to UI | Cancellation | No new infra | Reach for it when… |
|---|---|---|---|---|
A SSE backend pushes · frontend reads stream | near-zero | two-sided · server flag + AbortController | plain HTTP | live feed of many events while the user watches |
B Polling frontend fetches every N seconds | 3–5 s | one-sided · server flag, next poll sees it | plain HTTP | one terminal transition, no infra appetite |
C Fast / Deep sync vs async routes by user intent | sync: 5–15 s · async: 202 + poll | sync: no cancel · async: same as B | plain HTTP | one feature with two latency expectations |
D Webhooks backend is the client · external receives | seconds–hours (retries) | no cancel · delivery is the user intent, not a process | plain HTTP + retry chain | notify external integrations of completions |
Three things to notice. The bridges aren't interchangeable. SSE gives near-zero latency at the cost of a complex parser and a two-sided cancel; polling gives mediocre latency at the cost of nothing. Cancellation gets harder as the bridge gets more interactive. Bridge B (polling) cancels in one round-trip; Bridge A (SSE) cancels in two parallel paths that have to meet. Infrastructure adds up. None of these bridges need new infrastructure — that's deliberate. The day I need true many-to-many push (one server event to N clients at once), Channels and a real WebSocket layer become unavoidable. Today, none of the bridges need that, and the simpler stack pays back on every deployment.
Gotchas — when patterns interact
Each pattern in isolation is straightforward. The bugs happen at the seams.
Two writers on the same row. The SSE streaming path has three patterns running on it at once: an async view (Pattern 2), an async generator (Pattern 3), and a cancellation checkpoint (Pattern 5). All three are reading and writing to the same ScrapingJob row. If a Celery bulk task (Pattern 1) targets the same job at the same time, both writers race over mark_completed(). The fix is convention, not code: the streaming path targets one subreddit at a time; the bulk path targets many. They never overlap on the same job ID. Enforcing this in code (with a "streaming-mode" flag on the job) is on the list, but a clear convention plus a code-review rule has been enough so far.
.delay() vs inline calls. Calling task.delay() inside another task fans out to N workers. Sometimes that's what you want (parallelism!), sometimes it's catastrophic (every worker now holds a DB connection, the orchestrator can't reason about completion order). Inline calls keep the work on the calling worker — same process, same transaction-able context. Pick deliberately; the default of .delay() is the wrong default for orchestrator-style tasks.
Async ORM inside sync helpers. The a* ORM methods are great. Calling one of them inside a sync helper function (via async_to_sync(...)) once per row in a loop over 1000 rows is terrible — you pay the cost of an async↔sync bridge per row, the connection pool churns, and a Postgres lock that would be held briefly in batch becomes a contention hotspot. If you need async ORM inside sync code, batch it: await asyncio.gather(*[Model.objects.aupdate(...) for ...]) and bridge once at the top.
Polling stops when fetched state is terminal, not when desired state is terminal. If a user clicks "Retry" on a dead delivery, the row flips to pending and you enqueue a fresh task — but the polling loop won't restart unless the component refetches. The fix is to always refetch on user actions (cancel, retry, trigger), not just on schedule. The polling cadence handles the "what's happening now" question; explicit refetches handle the "I just did something" question.
Closing: the debugging mental model
Seven async patterns, each with a specific problem-shape it solves. Four bridges between backend async and frontend code, each with a different latency / visibility / cancel profile. When debugging anything async, the first question is "which of the seven is in play?" — and the second is "what's the bridge to the user-facing surface?" Most async bugs in this codebase have turned out to be misidentified patterns: a developer reaching for .delay() when they meant inline, an async view that accidentally does sync DB work, a polling loop that doesn't restart after a user action.
This article is the architectural overview that ties the prior four pieces together: Server-Sent Events, in Production is the deep dive on Pattern 4. One Interface, Five Scraping Backends is the deep dive on the factory layer that produces the async generators of Pattern 3. An LLM Abstraction Layer with Pre-Built Agents is Pattern 2 applied to LLM provider calls (with a ReAct loop on top). Durable Outbound Webhooks with HMAC and Exponential Backoff is the deep dive on Pattern 6. The series intentionally goes from features (the prior four) to patterns (this one) — features are easier to motivate, patterns are what you reach for next time.