One Interface, Five Scraping Backends

In a Reddit scraping side project I'm building, the code that actually fetches a subreddit's posts isn't a single thing — it's five. A paid proxy API (BrightData) for when the target is in active anti-bot mode. A real browser driver (Selenium) for pages that only render after JavaScript. A direct JSON request for the cheap, simple case. An async-native browser (Playwright) for the live-feed path. And a low-overhead CDP driver (nodriver) for when Playwright's full surface area is overkill.

The orchestrator code that calls all of this shouldn't have to care which one runs. It should ask for posts and get posts, regardless of which mechanism actually went and got them. That requires an abstraction layer — a small Python package called data_provider that lives one floor below the rest of the application and handles "go get me Reddit data, I don't care how." This post is that layer, opened up.

The cleanest way into the design is by analogy.

Imagine you're traveling internationally with one laptop and a universal travel adapter kit. Your laptop is the calling code. Each plug shape in the kit — UK, EU, US, JP — is a different backend. The adapter kit's case is a Python package called data_provider. The selector dial on the case is an enum called ScraperBackend. And there's a clever feature: a smart mode that tries the UK plug first, and if it falls out of the socket, automatically tries the EU plug, then the US one. That's ScraperBackend.AUTO.

The laptop never has to know what country it's in. It just says "give me power" — the adapter handles the rest.

In code terms: my orchestrator says scraper = get_scraper(backend); scraper.scrape_subreddit_posts(...) and has no idea whether BrightData or Selenium fulfilled the call. The caller is decoupled from the how. This post is the story of that abstraction layer — what's inside data_provider, why it's shaped the way it is, and the two gotchas that make the design less obvious than it looks.

There's a second analogy worth keeping in your back pocket: the adapter kit also has a voltage normalizer. The UK plug delivers 240V; the US plug delivers 120V. Your laptop only accepts one voltage. The Pydantic DTOs (ScrapedPostDTO, ScrapedCommentDTO, etc.) are that normalizer — every adapter must return data in the same shape, so the caller can trust it regardless of which backend produced it. We'll come back to this when we hit the trust boundary section.

The architecture, in one picture

The package has five moving parts and they fit together cleanly: a factory function, two registries (one sync, one async), the concrete adapters, the abstract base classes those adapters implement, and the DTOs that adapters all promise to return. The factory is the entry point — callers never instantiate adapters directly; they call get_scraper(backend) and get back something that satisfies the right ABC. The registries are tiny dicts: enum value → factory lambda. The adapters do the actual work of talking to BrightData or driving Selenium. The ABCs (BaseRedditScraper, BaseRedditScraperAsync, BaseRedditScraperAsyncStreaming, BaseSerpProvider) define the contracts. The DTOs define the shape of the answer.

If you read the earlier post on streaming this data to the browser via SSE, data_provider is the layer below it — the engine that fetches what those streams emit. The streaming view yields events; this package decides who actually went and got them.

Architecture

One call, top to bottom — caller down to the database

Two things to call out on the diagram. First, there are two registries, not one — _SCRAPER_REGISTRY for sync adapters, _ASYNC_SCRAPER_REGISTRY for async. Each enum value lives in exactly one. The factory functions cross-check: if you ask get_scraper("playwright_async"), the sync factory doesn't try to be helpful and bridge — it raises ValueError with a redirect message saying "use get_async_scraper(...) instead." Same the other way. The two execution models stay strictly separated; no asyncio.run() shims silently bridging at the boundary. That's a deliberate design choice — implicit bridges are how event loops deadlock at 3 a.m.

Second, the dotted horizontal line marked "trust boundary." That's where data from outside the system crosses into our domain. Below that line, everything is a typed DTO instance. Above it, anything goes — Reddit returns whatever it feels like, BrightData wraps payloads in its own envelope shape, an HTML scraper can hand you malformed JSON if a page changed. The DTOs are how we refuse to let that mess propagate.

The five backends

Why five? Each one exists because one of the others isn't enough.

BrightData is the paid proxy + unblock API. It's the backend you reach for when you need high-volume scraping and Reddit is in active anti-bot mode. It costs money per request; in exchange, you stop worrying about IP rotation, captchas, and rate limits. Sync only in this codebase — BrightData's response model is request/response, and there's no async win to chase.

Selenium is a real Chromium driver. It's the backend that handles pages where Reddit is serving JS-heavy content and the JSON endpoint isn't enough. It's slow, it's resource-hungry, and on a small server it's the most expensive thing in the stack. But when a page only renders after useEffect runs in the browser, Selenium is the lever that pulls. Sync only — it predates the async refactor.

Native JSON is the cheap path: a direct requests call against Reddit's /r/{sub}/{sort}.json endpoint. No browser, no proxy, no overhead. It's the first thing AUTO tries, when it doesn't have a reason to think it'll fail. The native backend has both sync and async variants (and a streaming async variant that I use under the SSE layer).

Playwright is the modern, async-native real browser. Where Selenium is the dependable workhorse, Playwright is the version you write when your stack is already async. It supports the streaming async pattern too — important for the live-feed path.

nodriver is a CDP-level driver — the successor to Undetected-Chromedriver — that talks directly to Chrome's DevTools protocol rather than going through a WebDriver intermediary. The selling point is lower overhead than Playwright for use cases where the full automation surface is overkill. Async + streaming, like Playwright.

A separate ABC, BaseSerpProvider, lives in the same package for a different reason: subreddit discovery. Given a query like "machine learning", the SerpProvider asks a search engine which subreddits exist and which ones rank. That's a different operation from scraping posts, so it gets its own contract. Today there's one SERP adapter; the registry pattern means adding a second is the same shape of change as adding a sixth scraper.

Backends

Capabilities side by side — five scrapers plus the separate SERP provider

Backend	sync	async	streaming	SERP	Best for
BrightData paid proxy / unblock API		—	—	—	high-volume scraping, anti-bot heavy targets
Selenium real Chromium driver		—	—	—	JS-heavy pages on a sync stack
Native JSON direct HTTP to /r/{sub}/{sort}.json				—	the cheap, simple path — what AUTO tries first
Playwright async-native real browser	—			—	JS-heavy pages on an async stack
nodriver CDP-level driver · successor to Undetected-Chromedriver	—			—	lower overhead than Playwright when full automation is overkill
SerpProvider separate ABC (BaseSerpProvider)	—	—	—		subreddit discovery from a search query

The matrix isn't symmetric — and that's the whole reason for the split between get_scraper and get_async_scraper. Each backend exists in exactly one of the registries; the factory enforces it. If you want to migrate a backend from sync to async (say, you reimplement the BrightData adapter with httpx instead of requests), it moves from one registry to the other and the migration is visible in the diff. No silent fallthrough, no quietly-degraded behavior.

The trust boundary: Pydantic DTOs

The single most load-bearing decision in this package is what happens at the moment a response arrives from outside. Every adapter, when it gets bytes back from Reddit or BrightData or wherever, parses them and constructs DTO instances — ScrapedPostDTO, ScrapedCommentDTO, ScrapedAuthorDTO, DiscoveredSubredditDTO. These are Pydantic models with two strictness knobs cranked all the way up:

StrictInt (and StrictStr, etc.) instead of plain int. The plain int field is liberal — give it "42" (a string) and Pydantic will helpfully coerce. StrictInt refuses; it raises a validation error if the type isn't an actual int. We want refusal.
extra="forbid" on the model config. Default behavior is to silently drop unknown fields. forbid raises if anyone hands you a payload with fields your schema doesn't know about. If Reddit adds a new field to a post, we want to find out at the validation layer, not let it leak through and quietly corrupt downstream code.

The combination is deliberate. Strict typing without extra="forbid" lets new fields slip through. extra="forbid" without strict typing lets bad data in disguise pass. Together they form a gate that's hard to fool.

Trust Boundary

Pydantic DTO validation — what passes the gate, what bounces off

Raw response

Pydantic gate

Outcome

{
  "reddit_id": "t3_abc",
  "title": "Hello",
  "score": 42,
  "author": "alice"
}

shape matches schema

ScrapedPostDTO(
  reddit_id='t3_abc',
  title='Hello',
  score=42,
  author='alice',
)

{
  "reddit_id": "t3_xyz",
  "title": "Other",
  "score": "42",
  "author": "bob"
}

score: StrictInt expected int, got 'str'

DTOValidationError
  field: score
  type:  int_type
  input: "42"  ← string, not int

{
  "reddit_id": "t3_jkl",
  "title": "Surprise",
  "score": 7,
  "author": "carol",
  "internal_id": "x_99"
}

extra='forbid' rejects unknown field

DTOValidationError
  field: internal_id
  type:  extra_forbidden
  hint:  add to schema or drop field

This is the bit worth printing on a sticker: loud failure beats silent corruption. A DTOValidationError at the boundary is annoying — it bubbles up, you go look at what changed in the response, you fix it. The alternative is corrupt rows in the database that you don't notice for weeks, until you build a report and the numbers don't add up. Pydantic's strictness is the difference between the noisy version of that bug and the silent one.

The corollary, which is the section's main gotcha: DTOs are not Django models. ScrapedPostDTO looks like one. It has the same field names. It's defined right next to a models.py. You'd assume ScrapedPostDTO(...) saves to the database. It does not. Pydantic DTOs are pure-Python value objects — no manager, no .save(), no foreign keys. Their job is to validate the shape of data that just crossed the trust boundary. If the shape is right, you get a DTO. If not, you get an exception. Persistence is a separate step. The orchestrator does:

ScrapedPost.objects.update_or_create(
    project=job.project,
    reddit_id=dto.reddit_id,
    defaults=_post_dto_to_defaults(dto, job=job, subreddit=subreddit),
)

The DTO → Django model translation lives in helpers like _post_dto_to_defaults — explicit, auditable, decoupled from the scraper. This is the bit that makes the abstraction earn its keep: you can swap a backend (BrightData → Selenium) without touching the persistence layer, and you can swap the persistence layer (PostgreSQL → something else) without touching the scrapers. Each side only knows the DTO shape.

The happy path: a single call, end to end

A concrete trip through the code when scrape_subreddit_posts_task runs with scraper_backend="brightdata":

Step 1 — Resolution. The task has a ScrapingJob row with scraper_backend="brightdata". It calls get_scraper(ScraperBackend.BRIGHTDATA).

Step 2 — Registry lookup. Inside get_scraper: first check, is this AUTO? No. Next check, is it accidentally registered in the async registry instead? No. Look it up in _SCRAPER_REGISTRY. Find a lambda that constructs BrightDataRedditScraper(zone_env_var=..., default_zone="web_unlocker1"). Invoke the lambda. Return a fresh adapter instance.

Step 3 — Method call. Caller does scraper.scrape_subreddit_posts("python", "hot", 25). Inside the adapter: validate inputs (via data_provider.validators — sanity checks like "subreddit name isn't empty"), make the HTTP call to BrightData, parse the response, build list[ScrapedPostDTO]. Pydantic enforces shape — if BrightData returns a post with score="42" (a string), construction fails, and the adapter raises DTOValidationError. Loud, immediate, with a stack trace pointing at the exact field that failed.

Step 4 — Return. The adapter returns list[ScrapedPostDTO]. The task flattens those into the update_or_create calls we just saw. From this point on, the system has normalized data; no scraper-specific shape leaks past the adapter boundary.

That's the happy path. Notice what doesn't happen anywhere in this trace: there is no if backend == "brightdata" branch in the caller, no instance check on the returned DTOs, no per-backend post-processing logic. The orchestrator wrote one code path; the factory and the DTO contracts do the rest of the work.

AUTO mode: the cascade

The factory has one wrinkle that makes the abstraction interesting. If the caller said get_scraper(ScraperBackend.AUTO) instead of a concrete backend, the factory doesn't return a real adapter — it returns _AutoRedditScraper, a thin wrapper that doesn't talk to Reddit itself. Each method on that wrapper (scrape_subreddit_posts, scrape_post_comments, scrape_author_profile) calls into a private _dispatch() helper that walks the chain.

The dispatch logic, distilled:

Read settings.DATA_PROVIDER_AUTO_CHAIN — a list of backend names in priority order, e.g. ["native", "brightdata", "selenium"].
For each backend in that order: ask the circuit breaker is_open(backend). If yes, skip — this backend has failed recently, back off. If no, instantiate the adapter and try the call.
On DataProviderError or NotImplementedError: record a failure with the circuit breaker, fall through to the next backend.
On any other exception (KeyError, TypeError, anything unexpected): re-raise it immediately. Programmer bugs should surface, not be swallowed by a fallback chain. AUTO is for graceful degradation under external failure, not a mask for our own bugs.
If every backend in the chain bails: raise AllProvidersFailedError with the chain of underlying exceptions attached, so the caller can see exactly what went wrong at each step.

AUTO Mode

ScraperBackend.AUTO walks the chain — three scenarios cycling

Scenario A · happy pathScenario B · fallback to next backendScenario C · all backends fail

call →

brightdata

chain[0]

selenium

chain[1]

native

chain[2]

DTOs returned via brightdataDTOs returned via selenium(1 fallback)raise AllProvidersFailedError

Circuit breaker decides which backends are eligible to try at each step — failures over the last 60 s short-circuit the chain.

This is the part that turns a nice abstraction into one that's actually useful in production. Reddit goes flaky for ten minutes and a backend starts returning 503s; AUTO doesn't fail the user's request — it tries the next backend. BrightData has a billing problem and starts returning 402s; AUTO doesn't keep hammering it — the circuit breaker opens after a few failures and the chain skips that backend until it cools off. Selenium runs out of memory on the staging box; AUTO falls through to native JSON and the user gets degraded-but-functional results.

The key constraint, and the reason this design works: AUTO is only ever as good as the chain order. If you put a backend that's slow and expensive first, AUTO will hit it before falling back to the cheap fast one. The chain isn't smart; it's a list. You pick the order based on what you know about reliability and cost, and you update DATA_PROVIDER_AUTO_CHAIN when those facts change.

The circuit breaker

What stops AUTO from hammering a failing backend forever is a small in-memory state machine called _CircuitState. It's the load-bearing safety net under the cascade, and it's worth understanding because it's also the part that's easiest to misconfigure.

Each backend gets a list of recent failure timestamps. Every time _dispatch() records a failure, it appends the current time to that list. Before any dispatch attempt, the circuit breaker checks: how many of those timestamps are within the last 60 seconds? If three or more, the circuit is open — calls to that backend are short-circuited (fast-failed) and AUTO moves on to the next one immediately. If fewer than three, the circuit is closed — calls go through normally.

The cooldown is implicit: the failure list ages itself out. As real time passes, old timestamps drift outside the 60-second window. Once the count drops below the threshold, the circuit closes again automatically. There's no separate "cooldown timer" to manage; the sliding window IS the cooldown.

Circuit Breaker

Three failures in 60 s open the circuit; cooldown closes it again

CLOSED

calls pass through

OPEN

fast-fail, skip

HALF_OPEN

one probe

60-second sliding windowfailures/ 3

0 s15 s30 s45 s60 s

Process-local state. Each Celery prefork worker tracks its own window — no shared cache, no Redis dependency.

The state machine is technically two-state in this implementation (CLOSED and OPEN), but the conceptual third state — HALF_OPEN — is what happens at the boundary moment when the count drops below the threshold. The next dispatch attempt is effectively a probe; if it succeeds, the circuit stays closed and the chain has recovered. If it fails, that failure goes onto the list, the count creeps back up, and the breaker re-opens on the next request that pushes it over the threshold.

Two implementation details matter and would bite you if you copied this design naively:

Thread safety. The failure-list dict is guarded by threading.Lock. Multiple worker threads can call is_open() and record_failure() concurrently; without the lock, you'd get races where two threads each see the count as 2 and both decide it's fine to try — except the failure they each then record pushes the real count to 4. The lock isn't optional.

Process-local state, on purpose. Each Celery prefork worker maintains its own _CircuitState dict. There's no Redis-backed shared state, no cross-worker coordination. Is that a feature or a bug? In this codebase it's a feature: a circuit "open" in one worker doesn't unfairly punish requests routed to other workers, and the system has zero new infrastructure to maintain. The trade is that an outage takes slightly longer to detect across the whole fleet (each worker has to discover it independently). For my scale, that's fine. At higher scale or with explicit SLO requirements, a Redis-backed shared circuit would earn its keep.

Two more gotchas

The DTOs-are-not-Django-models trap I already covered in the trust boundary section. Two more worth knowing:

ScraperBackend.AUTO is not in either adapter registry. It's a real enum value — you can pass it to get_scraper() — but it's intercepted at the factory level before the registry lookup happens. If you ever write code that iterates the registries to "list all real backends," remember AUTO is a meta-backend, not a backend, and it won't show up in the iteration. The fix is to filter it out of the enum explicitly, or iterate the registry keys directly instead of the enum.

The async streaming variant has its own ABC. BaseRedditScraperAsyncStreaming extends BaseRedditScraperAsync and adds one extra method: scrape_subreddit_posts_stream(). Not every async backend implements it — only the ones that can yield posts incrementally as they arrive. If you write code that assumes any async scraper can stream, you'll hit NotImplementedError at runtime. AUTO handles this gracefully (NotImplementedError is treated like DataProviderError and falls through), but direct callers shouldn't assume the streaming method exists on an arbitrary async adapter.

Closing

The cleanest mental model: the data_provider app is a typed, factoried, fault-tolerant skin around "go get me Reddit data," and every line of code outside this app gets to pretend Reddit is one stable thing. The factory hides which backend ran. The DTOs hide which response shape arrived. The AUTO cascade hides which backend was healthy at the moment. The orchestrator gets to write for post in scraper.scrape_subreddit_posts(...) and nothing else.

That said: the abstraction earns its keep only when you actually have multiple backends. If you only ever call BrightData, this whole structure is dead weight — a factory function calling a single lambda, a registry with one entry, a circuit breaker with nothing to fall back to. The pattern is worth reaching for the moment a second backend shows up; before that, it's premature. The travel-adapter analogy works because the kit is useful when you actually travel. Sitting at home, it's just a box of plugs.