Deep Fix & Prevention Analysis

For each issue in the Jobs Issues Extraction: the structural fix, and the design/process/testing failures that allowed it to exist.

Every bug has a surface fix (the patch that shipped) and a deep fix (the structural change that makes the class of bug impossible). Every bug also has preventions — things that, had they been in place, would have caught it before it mattered. Preventions are tagged: Type API/type design, Test automated testing, Observe monitoring/alerting, Design architectural review, Process team convention or checklist.

Recurring Deep Principles (by frequency across all 65 issues)

Type Make illegal states unrepresentable (19 issues)
Observe Distinguish silence from health (14 issues)
Design Classify errors into retry/fail/degrade (12 issues)
Test Contract tests at stage boundaries (11 issues)
Design Cheap validation before expensive work (10 issues)
Process Require end-to-end smoke test for new stages (9 issues)
Type Return Result[T, Error], never None-as-success (8 issues)
Observe Alert on output shape, not just error absence (7 issues)
Design Stages declare preconditions explicitly (6 issues)
Process Adversarial "what if external returns garbage" review (6 issues)
💀

Silent Failures & Data Integrity

18 issues
Bug Empty IB response marked as done
IB disconnected → empty data → item marked done → permanently lost from queue.
Raise RetryLaterError when no data returned.
Deep Fix
Stage handlers should return a typed result object (StageResult) that cannot be constructed without data. The runner should reject None/{} at the framework level, not rely on each handler to remember to check. A StageResult.success(data) factory that validates non-emptiness makes "empty success" impossible to express.
Preventions
Type Return type that encodes non-emptiness. If process_stage() returns StageResult instead of dict | None, the runner can enforce the contract. The bug wasn't in the handler — it was in the API that let handlers return nothing and call it success.
Test Fault injection test: "what does the stage do when the dependency returns empty?" Mock IB returning {}. If the test passes (item stays pending), you've tested the contract. This test would have caught every "empty = success" bug in this category.
Observe Track items-completed-with-no-data as a metric. A dashboard counter for "done items with empty results" would have surfaced this before 100s of items were lost. Absence of output is a signal, not a non-event.
Design "What happens when the external system is down?" should be a mandatory design question for every stage that depends on an external service. The IB handler was written assuming IB is always up — it should have been written assuming IB is usually down.
Bug IB fallback silently used ephemeral connections
Stockloader not running → fell back to ephemeral IB connections → silently degraded, missed pacing.
Remove fallback. Fail fast with RetryLaterError.
Deep Fix
Fallbacks are lies unless they're logged and metered. The real fix is a policy: every fallback path must (a) emit a warning-level log, (b) increment a "degraded" counter, and (c) have its own acceptance criteria (does the fallback actually produce correct results?). An untested, unmonitored fallback is worse than a crash — it gives you wrong answers with high confidence.
Preventions
Design The "fallback review" principle: for every fallback path, ask "if this ran for a week without anyone noticing, would the data be correct?" If no, it's not a fallback — it's a silent corruption vector. Remove it.
Observe Alert when fallback path activates. A fallback that fires silently is indistinguishable from normal operation. The whole point of a fallback is that something went wrong — that should be visible.
Process Convention: "no silent degradation." If a stage can't do its job at full quality, it must either retry or fail — never silently produce lower-quality output. This prevents an entire class of "it worked but the data was wrong" bugs.
Bug Circuit breaker: infinite retry, dashboard showed "queued"
Import errors caught, logged, retried forever. Dashboard showed "queued" — nothing processed.
Track consecutive failures. After 3, auto-pause with error visible in dashboard.
Deep Fix
The runner's error handling conflated transient errors (network timeout — retry makes sense) with deterministic errors (import error — retry is insane). A proper error taxonomy with at least three classes — RetryLater (transient), PermanentFailure (this item is broken), SystemFailure (the stage itself is broken, stop everything) — would have made the circuit breaker automatic. An import error is always a SystemFailure: retrying won't fix missing code.
Preventions
Type Three-class error taxonomy in the type system. If the runner only accepts RetryLater | PermanentFail | SystemHalt, it can dispatch correctly by construction. A bare except Exception that retries everything is the anti-pattern — it means you haven't thought about which errors are which.
Observe Alert on "N items attempted, 0 succeeded" within a window. Any system that processes items should have a throughput metric. Zero throughput for >5 minutes should page someone — it means the system is alive but brain-dead.
Test Chaos test: inject an ImportError into a handler and verify the runner halts the job. This directly tests the "what happens when the code is broken" path, which is different from "what happens when the data is bad."
Design Bounded retry with escalation. Unbounded retry is never correct. After N failures of the same kind, escalate: pause the job, alert, change strategy. The maximum retry count should be a design parameter, not an accident of implementation.
Bug Stuck in_progress items from crashed runner
Runner crash → items stuck in in_progress → never picked up again.
Reset in_progress to pending on startup.
Deep Fix
in_progress is a lease, not a state. It should have an expiry. Instead of a bare status flag, use claimed_at timestamp + claimed_by worker ID. Any item claimed >N minutes ago by a dead worker is automatically released. This makes crash recovery continuous rather than requiring a restart — and it works in multi-worker scenarios where one worker dies but others are still alive.
Preventions
Design Lease-based claiming instead of flag-based. The moment you write status='in_progress', ask: "what happens if the writer dies before writing status='done'?" If the answer is "it's stuck forever," you've designed a leak. Leases with TTL are the standard solution to this problem (see: SQS visibility timeout, Kubernetes pod eviction, distributed lock TTLs).
Observe Monitor age of in_progress items. An item that's been in_progress for >10x the p99 stage duration is almost certainly stuck. Alert on it.
Test Kill-the-runner test. Start runner, let it claim items, kill -9 it, start a new runner, verify items get processed. This is a basic resilience test that every queue system needs.
Bug 637 items stuck at stage-level in_progress
Startup reset fixed item-level status but not stage-level status inside JSON column.
Reset both item-level and stage-level in_progress on startup.
Deep Fix
Don't store the same concept in two places. Having status as a column AND stages.fetch: "in_progress" inside a JSON column means two sources of truth for "is this item being worked on?" They can (and did) disagree. The stage-level status should be the only source; derive the item-level status from it. Single source of truth, computed views.
Preventions
Design Single source of truth for state. Every piece of state should live in exactly one place. If you find yourself writing "reset X" and then later discovering you also need to "reset Y which is X stored differently," your schema has redundancy. Normalize it.
Test Invariant check: item status is consistent with stage statuses. A periodic assertion that status='pending' ⟹ no stage has 'in_progress' would have caught the 637 stuck items immediately.
Process When fixing a "stuck state" bug, grep for every place that state is stored. The first fix (item-level reset) was incomplete because it only addressed one representation. A checklist item: "Is this state stored anywhere else?"
Bug VTT extract returned silently empty
Missing VTT → empty result → downstream garbage → marked done.
Raise exception when VTT missing.
Deep Fix
Stages should declare their preconditions as assertions, not assumptions. If extract requires a VTT file, it should check for its existence before doing any work and raise a clear PreconditionFailed("VTT file not found at {path}"). The runner can then distinguish "precondition not met" (wait for upstream) from "stage failed" (something broke). This is the guard clause pattern applied to pipeline stages.
Preventions
Type Precondition declarations in stage metadata. Each stage config should list requires: ["vtt_path"]. The runner checks preconditions before invoking the handler. The handler never runs in an invalid state.
Test "Missing input" test for every stage. What does extract do when there's no VTT? What does enrich do when there's no raw_html? These are the most basic failure modes and the most commonly untested.
Design The "absent input" question belongs in the new-stage checklist. For every input a stage reads, document what happens when it's missing. If the answer is "I don't know," the stage isn't ready to ship.
Bug Earnings transcript marked done with no transcript
Handler returned None → runner converted to {} → "success" with no data.
Raise RetryLaterError; fix handler wrapper.
Deep Fix
The runner's None → {} conversion is the root bug. The framework should never coerce a handler's return value into something the handler didn't intend. None means "I have nothing to say" — the framework should treat that as an error, not as "success with empty data." The fix is: if result is None: raise StageError("Handler returned None") in the runner itself.
Preventions
Type Ban None as a valid handler return. Type annotation: async def process_stage(...) -> StageResult (not -> dict | None). A linter or runtime check in the runner catches violations. This single change prevents every "None-as-success" bug.
Test Runner contract test: verify that a handler returning None is treated as an error. This tests the framework's behavior, not the handler's — it's a one-time test that protects all handlers.
Design Principle: frameworks should be strict, handlers should be expressive. The runner (framework) should reject ambiguous returns. Handlers should use explicit success/failure types. Don't be "helpful" by converting garbage to something passable.
Bug yt-dlp metadata silently returned {} on auth errors
No error propagation from yt-dlp subprocess on bot/auth failures.
Raise on bot/auth errors.
Deep Fix
Subprocess wrappers must parse exit codes AND output to determine success. A generic run_subprocess() wrapper that returns stdout or {} on any error is dangerous. The wrapper should return a SubprocessResult(stdout, stderr, exit_code) and let the caller decide what constitutes success. Never swallow subprocess failures at the wrapper level.
Preventions
Type Subprocess wrappers return structured results, not parsed output. Let the caller inspect exit code + stderr + stdout and decide. The wrapper's job is to run the command, not to interpret success.
Process For every external tool integration, document: "what does failure look like?" yt-dlp has at least 4 failure modes (bot detection, rate limit, missing content, network error). Each should be handled differently. If you haven't enumerated them, your error handling is incomplete.
Bug INSERT OR REPLACE destroys columns
INSERT OR REPLACE does DELETE + INSERT. Columns not in INSERT list get destroyed.
Use ON CONFLICT DO UPDATE SET col=excluded.col.
Deep Fix
This is a language-level footgun in SQLite. The deep fix is a convention: never use INSERT OR REPLACE — always use INSERT ... ON CONFLICT DO UPDATE (UPSERT). This should be enforced by a lint rule or code review checklist. Better yet, wrap all DB writes through a db.upsert(table, data, conflict_keys) helper that always uses the safe pattern.
Preventions
Process Ban INSERT OR REPLACE project-wide. Add to CLAUDE.md conventions. Grep for it in CI. It's never what you want when the table has columns you're not setting.
Type DB write helper that encapsulates the safe pattern. db.upsert() that always generates ON CONFLICT DO UPDATE. Developers never write raw INSERT for upserts.
Test Round-trip test: write row A, upsert partial row B on same key, verify A's untouched columns survive. This directly tests the data preservation property.
Bug Moltbook extract: pre-migration data invisible
Items fetched before migration had raw.json on disk but no entry in results table.
Check results table first, fall back to disk.
Deep Fix
Migrations must migrate data, not just schema. The schema migration (add results table) was incomplete because it didn't backfill existing items. A migration that changes where data lives without moving existing data creates a split-brain: some items in the old location, some in the new. The migration script should have included a backfill_from_disk_to_db() step.
Preventions
Process Migration checklist: "Does this migration change where data is read from? If yes, have you migrated all existing data to the new location?"
Test Post-migration verification: count items accessible via old path vs new path. They should be equal.
Design Avoid dual-read paths. The fallback (check DB, then disk) is technical debt. It's a workaround for an incomplete migration. Complete the migration, remove the fallback. Dual-read paths accumulate forever if you don't clean them up.
Bug Duplicate HTML served for different VIC ideas
Expired cookies → same HTML for different URLs → discovered after 100+ identical extractions.
Detect duplicate HTML by comparing lengths.
Deep Fix
Content validation should be semantic, not structural. Length comparison catches exact duplicates but misses near-duplicates. The real fix is a content fingerprint: hash the text content (stripped of boilerplate) and flag when multiple items share the same hash. Better yet, the fetch stage should verify that the fetched content actually relates to the requested entity — a cheap LLM check ("Is this about {company_name}?") catches both duplicates and wrong-content errors.
Preventions
Observe Content dedup metric: track unique content hashes vs total fetches. If 100 fetches produce 3 unique hashes, something is wrong. This is a distribution anomaly detector, not a per-item check.
Design Treat authentication state as a stage precondition. If the stage requires valid cookies, verify cookie freshness before making requests. Don't discover stale cookies after processing 100 items.
Test Fetch two different URLs and assert the responses differ. Trivial test, catches the entire class of "auth is broken and everything returns the same page."
Bug SQLite "closed database" in dashboard
DB functions called after conn.close(). Gradio hot-reload worsened it.
Use with closing(get_db()) as conn: everywhere.
Deep Fix
Connection lifecycle should be managed by one owner, not scattered across callers. Use a context-manager-based connection pool or a "unit of work" pattern where every DB operation happens within a with db_session() as s: block. The connection is opened, used, and closed in one scope — it can't leak or be used after close because the variable is scoped to the with block.
Preventions
Type Make bare connection objects impossible to obtain. If the only way to get a DB connection is with db_session() as conn:, you can't forget to close it. The API prevents the bug.
Design Gradio hot-reload invalidates module-level state. Any module-level connection, cache, or singleton will break on reload. Design for it: use function-scoped resources or a lazy-init pattern with staleness detection.
Bug 11 of 12 handlers broken by function rename
get_handler_logget_handler_logger. asyncio.gather(return_exceptions=True) silently ate all ImportErrors.
Fix imports. Log exceptions from gather results.
Deep Fix
Two deep failures here. (1) asyncio.gather(return_exceptions=True) is a silent-failure factory — it converts crashes into list elements that nobody checks. Use a wrapper that logs/raises if any result is an exception. (2) Renaming without grep is a time bomb. Every rename should be accompanied by a project-wide search for the old name. IDE refactoring tools do this; manual renames don't.
Preventions
Type Wrap asyncio.gather in a helper that raises on any exception result. async def gather_or_raise(*coros) that iterates results and re-raises the first exception. Use this instead of bare gather(return_exceptions=True) everywhere.
Test Import smoke test: import every handler module at test time. A single test file that does import jobs.handlers.vic_ideas etc. for every handler catches 100% of import-time errors. Takes <1s to run.
Process Rename checklist: grep for old name project-wide before committing. rg 'get_handler_log[^g]' would have found all 11 stale call sites.
Observe Startup health check: runner tries to import + instantiate every registered handler before entering the main loop. If any handler fails to import, the runner refuses to start. Fail loudly at startup, not silently at runtime.
Bug Stage errors lost during item transitions
Later stage errors overwrote earlier stage errors.
Store per-stage errors in stages JSON.
Deep Fix
Error storage should be append-only, not overwrite. Use an error log (array of {stage, error, timestamp}) instead of a single error field. This is the same principle as audit logs vs current state: you want the full history of what went wrong, not just the last thing.
Preventions
Design Multi-stage pipelines need per-stage error storage by construction. This should be a framework feature, not a per-handler responsibility. The runner should store errors keyed by stage automatically.
Test Two-stage failure test: fail stage A, then fail stage B, verify both errors are retrievable.
Gotcha Handler not async when dependency became async
Dependency made async, caller still sync → "cannot unpack non-iterable coroutine object."
Made entire chain async.
Deep Fix
Async is viral — changing one function to async requires changing all callers. The deep prevention is: when converting a function to async, immediately search for all call sites and convert them too. Better yet, design the system as async-from-the-start if any component might need it. Retrofitting async is always painful.
Preventions
Type Type checker (mypy/pyright) catches calling an async function without await. If the project had type checking enabled, this would be a static error, not a runtime surprise.
Process When making a function async, grep for all callers and update them in the same commit. Async conversions must be atomic — half-converted call chains always break.
Gotcha yt-dlp non-zero exit with valid output
Non-zero exit code for warnings, but stdout has valid content. Treating non-zero as failure threw away good data.
Check stdout content. Only error on non-zero + empty output.
Deep Fix
External tools have their own definition of "success" that doesn't match yours. The subprocess wrapper should define success criteria per-tool, not assume exit code 0 = success. For yt-dlp: success = "stdout contains valid JSON." For curl: success = "HTTP 200." The tool's exit code is a hint, not a contract.
Preventions
Design Per-tool success criteria. When integrating an external tool, document: "What constitutes success? What constitutes retriable failure? What constitutes permanent failure?" These three questions should be answered before writing the wrapper.
Process Read the tool's documentation on exit codes before assuming. yt-dlp's docs explicitly state that non-zero can mean warnings. This is a "RTFM prevents bugs" case.
Convention No asyncio.gather exception logging
return_exceptions=True silently eats exceptions.
Iterate results and log exceptions.
Deep Fix
Ban bare asyncio.gather(return_exceptions=True). Replace with a project utility: gather_logged(*coros, logger=log) that automatically inspects results and logs/counts any exceptions. Make the safe version the easy version.
Preventions
Type Utility function that makes the pit of success the default. If gather_logged is easier to use than raw asyncio.gather, developers will naturally use the safe version.
Process Lint rule: flag any use of return_exceptions=True without a corresponding exception check in the same scope.
Gotcha Metadata stage vacuously succeeded without transcript
Metadata ran on discovery data only — no transcript input — succeeding vacuously.
Merged into single stage.
Deep Fix
A stage that succeeds without its primary input is lying. Every stage should have a "minimum viable input" assertion. If metadata enrichment needs a transcript and there isn't one, it should skip (not succeed) or fail (not silently produce empty enrichment). The broader principle: distinguish "nothing to do" from "did something successfully."
Preventions
Design Stage output validation: "did this stage produce meaningful output?" A stage that returns {} or produces identical output regardless of input is vacuous. The runner should flag it.
Test Feed a stage empty/missing input and verify it doesn't claim success. The vacuous success pattern is always testable: give it nothing, check that it doesn't say "done."
🛡️

Rate Limiting, Auth & Bot Detection

12 issues
Bug VIC rate-limit page passed structural checks
Rate-limit page had enough HTML structure to pass basic checks.
Three-layer defense: text detection, HTML validation, LLM sanity check.
Deep Fix
Structural checks are necessary but never sufficient for content from adversarial or unreliable sources. The deep fix is a layered validation architecture: (1) structural check (is it HTML?), (2) content fingerprint (does it look like previous valid responses?), (3) semantic check (does it contain expected entities/concepts?). Each layer catches what the previous misses. The key insight: the more sophisticated the source's failure modes, the more semantic your validation must be.
Preventions
Design Assume every external response might be garbage. Design validation as if the external source is actively trying to fool you. Not because it is, but because rate-limit pages, error pages, CAPTCHAs, and CDN failovers all produce structurally-valid-but-semantically-wrong responses. This mindset catches bugs before they happen.
Test Golden file tests: save actual rate-limit/error pages and assert the validator rejects them. Build a corpus of "things that look valid but aren't." Every new false positive gets added to the corpus.
Observe Track content-length distribution. Rate-limit pages are usually shorter than real content. A sudden shift in response size distribution signals something is wrong.
Bug VIC "Please wait 24 hours" — second rate-limit variant
Different text format for rate limit, not caught by first detection.
Added explicit string detection.
Deep Fix
String matching for error detection is a game of whack-a-mole. Each new rate-limit message requires a new string. The semantic approach (LLM: "Is this a rate-limit/error page?") catches all variants with one check. For lower cost, use a classifier trained on the corpus of known error pages. The principle: match on semantics, not syntax, when the source can vary its syntax.
Preventions
Design Semantic validation over pattern matching when the input space is open-ended. String matching works for known patterns; LLM/classifier works for unknown patterns. Use both: string matching as a fast first pass, semantic check as a safety net.
Process When adding a string pattern for error detection, ask: "Are there other messages I haven't seen yet?" If yes, you need a more general approach. One variant is an accident; two variants is a pattern; three means you need a classifier.
Bug YouTube bot detection blocked downloads
Datacenter proxy alone wasn't enough.
Proxy + cookies + BotDetectedError.
Deep Fix
Bot detection evasion requires multiple signals, not just IP rotation. Modern anti-bot systems check IP reputation + browser fingerprint + cookie state + behavioral patterns. The deep fix is an escalation ladder: try cheapest approach first (direct), escalate to proxy, escalate to proxy + cookies, escalate to full browser. Each level has a cost — track which level each target requires and auto-select.
Preventions
Design Fetch strategy as an escalation ladder, not a fixed approach. [direct → proxy → proxy+cookies → browser]. Start cheap, escalate on failure, remember what works per-domain.
Observe Track bot-detection rate per domain over time. A sudden spike means the target changed their detection. Alerts let you respond before the entire pipeline stalls.
Bug YouTube JS challenge solver broken
Missing --remote-components ejs:github flag across 7 call sites.
Add flag to all call sites.
Deep Fix
7 call sites with the same flag means the flag should be in one place. Create a ytdlp_command(url, **kwargs) -> list[str] factory that always includes the base flags. Individual callers add only their specific flags. DRY for command construction, not just for code.
Preventions
Type Command builder pattern. A single function that constructs yt-dlp commands with all required base flags. Callers can't forget flags they don't know about.
Process When adding a "required for all calls" flag, grep for all call sites immediately. If there are >2, refactor to a shared builder first, then add the flag once.
Bug PO token requirement for YouTube
New YouTube requirement, downloads failing with unclear errors.
Install plugin, player_client fallback chain.
Deep Fix
External API changes are inevitable — design for them. The system should have a "YouTube health" check that periodically tests a known-good download and alerts when it fails. This detects platform changes before they affect the pipeline, not after 100 items fail. The principle: canary requests detect environmental changes before they cause damage.
Preventions
Observe Canary downloads: periodically test a known-good URL and alert on failure. A 5-minute cron job that downloads a short public video catches YouTube changes within minutes, not days.
Design Subscribe to yt-dlp release notes / changelog. PO token support was documented in yt-dlp releases before it became required. Tracking upstream tool changes is part of maintaining the integration.
Bug SSL/timeout errors classified as permanent failure
Network errors → permanent failure → never retried.
Classify as retry_later.
Deep Fix
Error classification should be a lookup table, not scattered conditionals. Maintain a mapping: {SSLError: RETRY, TimeoutError: RETRY, HTTPError(404): PERMANENT, HTTPError(429): RETRY, HTTPError(500): RETRY, ValueError: PERMANENT}. New error types get an explicit classification. Unknown errors default to RETRY with a logged warning — you don't know if it's transient, so assume it might be.
Preventions
Type Centralized error classifier. One function: classify_error(exc) -> RetryLater | PermanentFail | SystemHalt. Every handler uses it. New error types get added to one place.
Design Default to retry for unknown errors. The cost of retrying a permanent error is wasted compute. The cost of permanently failing a transient error is lost data. The asymmetry favors retry as the default.
Bug Rate limit errors (429) tripping circuit breaker
Quota errors treated as permanent, triggering job pause.
Treat 429 as retry_later.
Deep Fix
The circuit breaker should distinguish "something is broken" from "we're going too fast." Rate limits aren't failures — they're flow control. The circuit breaker should only trigger on non-rate-limit consecutive failures. Rate limits should trigger backoff (slow down), not shutdown (stop). Two different mechanisms for two different problems.
Preventions
Design Separate rate-limit handling from error handling. Rate limits → adaptive backoff (slow down, increase delay). Errors → circuit breaker (stop after N consecutive). Mixing them means either: rate limits shut you down, or real errors don't shut you down. Both are bad.
Observe Metric: rate-limit-vs-error ratio. If 90% of "failures" are rate limits, the circuit breaker is wrong. This metric makes the misclassification visible.
Bug VIC cookie exchange was sync in async handler
Sync httpx.get() blocking the event loop.
Made async.
Deep Fix
Lint for sync I/O inside async functions. Any httpx.get(), requests.get(), open(), or time.sleep() inside an async def is a bug. A ruff/flake8 rule or a runtime warning when the event loop is blocked >100ms catches these mechanically.
Preventions
Type Convention: all handlers are async, all I/O uses async clients. Import httpx.AsyncClient as the default; don't even import the sync client in handler files.
Observe Event loop block detection. Python's asyncio debug mode warns when the event loop is blocked >100ms. Enable it in development.
Convention Direct "naked" fetches got IP blocked
No proxy → real IP exposed → blocked.
Always proxy by default.
Deep Fix
Proxy should be the default, not an opt-in. Create a project-wide HTTP client factory: get_http_client(proxy=True) where proxy defaults to on. A whitelist of domains that don't need proxy (localhost, LLM APIs). Any new domain automatically gets proxied. The footgun of forgetting to proxy disappears.
Preventions
Type HTTP client factory with proxy-by-default. Make the safe path the easy path. Opting out of proxy requires explicit proxy=False with a domain in the whitelist.
Process Grep for bare httpx.get/requests.get and flag them. All HTTP calls should go through the project client.
Gotcha VIC cookie has two layers (remember + session)
Persistent cookie in Chrome's DB, session cookie needs HTTP exchange.
Fallback chain: file → browser → HTTP exchange → cache.
Deep Fix
Document the auth flow before implementing. VIC's two-layer cookie system is discoverable by reading network traffic, but it was discovered by trial and error. For any authenticated target, spend 30 minutes with browser DevTools documenting the full auth flow before writing code. The cost of understanding upfront is far less than the cost of debugging auth failures in production.
Preventions
Process Auth flow documentation as a prerequisite for new handler development. Before writing the handler: document cookies, tokens, session lifecycle, expiry, and refresh mechanism. This becomes the specification for the auth code.
Test Auth health check: verify cookies are valid before starting a batch. A pre-flight check that makes one authenticated request and verifies the response is valid catches stale cookies before they ruin a batch.
Gotcha yt-dlp auth errors vs warnings indistinguishable by exit code
Non-zero for both fatal and non-fatal. Must parse stderr.
Parse stderr for specific error strings.
Deep Fix
Same as "yt-dlp non-zero with valid output" — per-tool success criteria. The yt-dlp wrapper should have a classify_result(stdout, stderr, exit_code) -> Success | Warning | AuthError | ContentError function that encodes all the tool's quirks in one place. Callers get clean typed results.
Preventions
Type Tool-specific result classifier. Encode every known failure mode of the external tool in a classifier function. Unknown failure modes get logged as warnings with full stderr.
Gotcha Retry + reprocess didn't unpause paused jobs
Items reset to pending but job stayed paused.
Retry/reprocess buttons auto-unpause.
Deep Fix
State transitions should be atomic and complete. "Retry" is not just "reset item status" — it's a compound operation: reset items + unpause job + clear circuit breaker counter. If any step is missing, the system is in an inconsistent state. Model operations as state machines where every transition specifies the full target state, not just one field.
Preventions
Design Define operations as state transitions, not field updates. retry() should be a single function that atomically moves from paused+failed_items to running+pending_items. Not three separate SQL updates that can partially complete.
Test End-to-end retry test: pause job → retry → verify items process. Tests the full operation, not just the SQL.
🔧

Stage Design & Pipeline

10 issues
Bug Extract stage ran before fetch
Stage dependencies were not inferred from YAML ordering — extract ran with no data because fetch hadn't completed yet.
Added explicit stage_deps declarations in the YAML pipeline config.
Deep Fix
Stages should declare their inputs as typed preconditions, not rely on implicit ordering. The runner should resolve a DAG from input/output declarations and refuse to schedule a stage whose preconditions aren't satisfied. Ordering in a YAML file is a serialization artifact, not a contract.
Preventions
Design Precondition-based scheduling over positional ordering. Any system where execution order is derived from list position will eventually have an insertion that violates implicit dependencies. Stages should declare what they need, and the scheduler should enforce it.
Test Dry-run mode that validates the DAG before executing. A pipeline should have a zero-cost validation pass that checks all preconditions are satisfiable and all dependencies form a valid DAG.
Type Make illegal orderings unrepresentable in the pipeline schema. Input/output type annotations on stages let a validator reject impossible orderings statically.
Design Monolithic transcript_and_meta stage
A single stage handled metadata + captions + whisper. When Whisper failed, all work retried.
Split into 4 independent stages: metadata, captions, whisper, merge.
Deep Fix
Stage granularity should match failure domains. If two operations can fail independently, they belong in separate stages. The litmus test: "can I retry just the part that failed?" If no, the stage is too coarse.
Preventions
Design One failure domain per stage. A stage that calls two external services has two independent failure modes. Bundling them forces redundant re-execution on partial failure.
Process The retry-scope test during stage design. "If this fails halfway through, what work gets thrown away?" If the answer includes completed, successful work, the stage needs splitting.
Design Stage order: transcript before scoring
Downloaded and transcribed every video before scoring relevance. Hundreds of irrelevant videos fully processed.
Reordered: score metadata first, transcribe only passing videos.
Deep Fix
Cheap filters before expensive transforms is a universal pipeline design principle. Every pipeline should be ordered by ascending cost, with each stage acting as a progressively more expensive quality gate. Same principle as database query optimizers pushing selective predicates down.
Preventions
Design Cost-aware stage ordering. Each stage should have an estimated cost. The framework should warn when a high-cost stage precedes a low-cost filter that could eliminate inputs.
Process Funnel metrics in pipeline dashboards. Track input count at each stage boundary. If stage N passes 95% to stage N+1, which then drops 80%, the filter is in the wrong position.
Design Transcript before metadata validation
Videos transcribed before basic metadata checks. Invalid videos consumed Whisper credits before rejection.
Moved metadata validation before transcription.
Deep Fix
Validate before investing resources — a specialization of cheap-before-expensive. Validation is nearly free and should always precede transformation. This is the pipeline analog of failing fast.
Preventions
Design Validation is always stage zero. Any pipeline accepting external input should have an explicit validation gate before the first resource-consuming stage.
Observe Track waste ratio per stage. "Items processed that were later rejected by a downstream filter." A non-zero waste ratio means a cheap check could have been hoisted earlier.
Design Groq primary vs YouTube captions — three changes in one day
Transcript source flipped 3× in one day: YouTube → Groq → YouTube+Groq fallback.
YouTube primary (free), Groq fallback (paid, higher quality).
Deep Fix
Strategy decisions with cost/quality implications need a decision matrix evaluated once, not iterative trial-and-error in production. Benchmark cost, quality, latency, and failure rate on a representative sample before committing.
Preventions
Process Decision records for strategy choices. Write a one-page decision record: options, criteria, measurements, choice, and reversal conditions. Forces evaluation before commitment.
Test A/B comparison on a sample before rollout. Run both strategies on 20 items. Compare quality, cost, latency. Decide from data.
Design Audio files accumulated on disk (10GB+)
Every video left MP3/M4A. 10GB+ accumulated with no cleanup.
keep_audio=False default; delete after transcription.
Deep Fix
Stages should own the lifecycle of their intermediate artifacts. Every file a stage creates should have an explicit disposition: keep (final), pass (input to next stage, then delete), or temp (delete on stage completion). Default = deletion, with explicit opt-in for retention.
Preventions
Design Artifact lifecycle as a framework primitive. Stages declare outputs with retention classes (ephemeral, intermediate, final). The framework enforces cleanup.
Observe Disk usage monitoring with per-pipeline attribution. Alert when growth exceeds expected bounds.
Process Stage review checklist: "What files does this stage create, and who deletes them?"
Design Enrichment coupled to fetch stage
LLM enrichment inside fetch. Couldn't rerun without re-downloading.
Separate enrich stage reading from cached data.
Deep Fix
Stages should be independently re-runnable. The test: "can I rerun just this stage with cached inputs?" If not, the stage has a hidden dependency on the execution of a prior stage, not just its output.
Preventions
Design Stage inputs must be fully materialized. A stage reads from persistent storage, never from in-memory state of a prior stage.
Test Isolated stage re-run test. Run each stage in isolation with pre-populated inputs. If it requires running prior stages, the stage has an undeclared dependency.
Design Two jobs with 93% discovery overlap
Two pipelines discovered nearly identical video sets. Both ran full processing.
Folded into single pipeline.
Deep Fix
Deduplication belongs at the discovery layer. When multiple pipelines can discover the same items, there should be a shared discovery registry with canonical ownership.
Preventions
Design Canonical item registry with dedup at ingestion. Content-addressable IDs in a shared registry. Overlap impossible by construction.
Observe Pipeline overlap detection. Periodically compute Jaccard similarity of item sets across pipelines. Alert when overlap exceeds 20%.
Design IR page crawl: garbage URLs from parked domains
Crawl hit parked domains → hundreds of junk URLs → 15s timeouts each.
Removed crawl strategy. Targeted API-based discovery.
Deep Fix
Discovery output should pass through a quality gate before entering the pipeline. Never trust discovery output — validate it as untrusted input.
Preventions
Design Discovery output is untrusted input. Apply validation (domain allowlists, pattern matching, content-type checks) before committing resources.
Observe Discovery quality metrics: yield rate by source. A source with 2% yield rate is a signal to investigate or remove.
Design Manual subs not preferred over auto-generated
Auto captions used by default; manual subs (higher quality) ignored.
Prefer manual subs; rename with .manual.vtt.
Deep Fix
When multiple sources provide equivalent data at different quality, declare an explicit quality hierarchy with preference ordering in configuration, not buried in code.
Preventions
Design Explicit preference ordering for equivalent inputs. Ranked list in config. Stage iterates and takes the first available.
Observe Track source provenance through pipeline. Know which variant was used — makes quality decisions auditable.
📦

Import & Startup Issues

5 issues
Bug litellm concurrent import race
Multiple async workers imported litellm concurrently on first use. litellm's __init__ isn't thread-safe — AttributeError and corrupted state.
Pre-import litellm before asyncio.run().
Deep Fix
Module initialization is a shared mutable resource. Heavy modules with side effects must be eagerly imported in the main thread before any concurrent execution begins. Python's import lock is necessary but not sufficient for modules with complex __init__ logic.
Preventions
Design Eager import of stateful modules before concurrency begins. Treat these imports as application bootstrap, not lazy loading.
Process Startup manifest: explicit list of pre-imported modules. New heavy-module dependencies added to this list during code review.
Test Concurrent cold-start integration test. Spin up N workers simultaneously with empty module caches and verify no import errors.
Bug lib.llm concurrent import partially initialized
"cannot import call_llm from partially initialized module" — multiple workers hit cold start simultaneously.
Eagerly import in runner.py before async work.
Deep Fix
Same root cause — the runner's bootstrap phase must import all modules workers will need before spawning any workers. Individual fixes per module are whack-a-mole; the fix is a centralized pre-import phase.
Preventions
Design Centralized bootstrap_imports() in the runner. Single auditable location for all worker dependencies.
Test Static analysis for imports used in worker code paths. Every module imported inside a worker function must also appear in the pre-import manifest.
Bug Lazy import ↔ circular import ping-pong
Lazy import to fix circular → caused concurrent race. Revert → brought back circular. Two fixes fighting.
Top-level import + pre-import, then refactored the circular dependency out.
Deep Fix
Circular imports are a dependency graph problem, not an import-ordering problem. The real fix is restructuring the module graph to eliminate cycles — extract shared interfaces into a leaf module.
Preventions
Design Layered dependency architecture with enforced direction. Define module layers (core → lib → handlers → runners) where dependencies only flow downward.
Test CI check for import cycles. Circular imports should be a blocking CI failure.
Process Ban lazy imports as a circular-import fix. "Circular import" is not an accepted reason for moving an import into a function body.
Gotcha Pointless ImportError catch on required dependency
try: import litellm / except ImportError: pass on a required module. Silent failure.
Removed try/except. Top-level import, fail immediately.
Deep Fix
try/except ImportError on a required dependency is always wrong. It converts a clear, immediate failure into a delayed, mysterious one. Defensive programming turned pathological.
Preventions
Process Convention: never try/except on required imports. A bare pass in the except clause is a code smell that the import is actually required.
Design Fail fast at the boundary, not deep in the call stack. Error detection as close to the cause as possible.
Convention Cross-handler imports create hidden dependencies
Handler A imported from Handler B → B refactored → A broke silently.
Shared utilities to jobs/lib/. Handlers never import from each other.
Deep Fix
Handlers should be leaf nodes in the dependency graph. When handlers form a graph among themselves, you lose the ability to modify, test, or deploy them independently.
Preventions
Design Handlers as leaf nodes: architectural constraint. May import from jobs/lib/, lib/, stdlib — never from jobs/handlers/.
Test Import graph validation in CI. Parse import statements in handlers; fail if any handler imports from another handler.
📊

Dashboard & UI Bugs

7 issues
Bug Double-refresh: high CPU + broken clicks
Callable passed to component + timer.tick = two competing refresh cycles. CPU doubled, clicks broken.
Pass static value; only timer triggers refresh.
Deep Fix
Understand the reactive ownership model before wiring data sources. Any framework with reactive bindings (Gradio, React, Svelte) has this trap: implicit reactivity + explicit triggers = duplication. Establish a single canonical refresh path. Audit every component for "who owns the refresh" and ensure exactly one source of truth.
Preventions
Design Single-writer principle for UI state. Every mutable UI element should have exactly one code path that updates it.
Test Refresh-count assertion. Instrument the data-fetch function with a counter. Assert exact N calls over a window matching the timer interval, not 2N.
Bug Heartbeat not cleared on runner exit
Stale heartbeat → dashboard showed "running" after runner stopped.
Clear heartbeat in finally/atexit handler.
Deep Fix
Heartbeats are leases, not flags. A lease says "I am alive if this was written within the last T seconds." With a TTL-based model, no cleanup code needed. Any liveness system that requires active cleanup on death is architecturally broken — death is precisely when cleanup code doesn't run.
Preventions
Design TTL-based liveness over flag-based. Encode timestamp + staleness threshold. Reader interprets freshness; writer doesn't need cleanup.
Test Simulate SIGKILL, wait TTL+1, assert dashboard shows "stopped." If your liveness model survives SIGKILL, it survives everything.
Bug Moltbook handler signature mismatch
Positional args vs keyword-only. TypeError at runtime.
All handlers: keyword-only args matching dispatch convention.
Deep Fix
Handler interfaces must be machine-enforced contracts. Define a Protocol or ABC for the handler signature. Any function called via indirection (dispatch table, registry, plugin) needs a typed contract because call site and definition are decoupled.
Preventions
Type Protocol class for handler signatures. Static analysis catches mismatches at edit time.
Test Parametric handler-invocation test. Iterate all registered handlers, invoke each with mock args using the dispatcher's calling convention.
Gotcha Job kind misclassification
Backfill jobs labeled as monitor. Wrong dashboard tab + wrong metrics.
Changed to kind: backfill.
Deep Fix
Classification should be derived from structural properties, not manually assigned labels. Finite catalog = backfill. Polls endpoint = monitor. Infer kind from observable properties; the label becomes a computed property.
Preventions
Design Derive kind from job properties. Finite item list + sequential = backfill; periodic poll = monitor. Compute at registration time.
Process Enum, not string, for job kind. Typos caught at import time, valid set discoverable.
Bug p50 showing "—" for fast stages
0.0 elapsed is falsy in Python. Fast stages treated as missing data.
Check is not None explicitly.
Deep Fix
Python truthiness conflates "missing" with "zero" for every numeric type. The rule is absolute: if the variable can legitimately be zero, use is not None. Distinguish "absent" from "present but trivial" at the type level.
Preventions
Process Lint rule: ban bare truthiness checks on numeric variables. Require explicit is not None or > 0.
Test Test with zero values for every numeric display. Boundary value that must render, not disappear.
Bug Status column expanding despite max-width
auto table layout ignores width constraints.
table-layout: fixed.
Deep Fix
For any data table where you need predictable column widths, table-layout: fixed is required, not optional. With auto, the browser's content-driven algorithm overrides your declarations.
Preventions
Design Default to table-layout: fixed for all data tables. auto only for prose tables where content should dictate layout.
Test Visual regression with max-length content. Seed test data with maximum-length strings for every column.
Bug Startup message printed 3x in debug mode
Multiple init paths printing banner.
Guard with _initialized flag.
Deep Fix
Initialization must be idempotent. Separate "configure" from "start." Configuration is idempotent (set values, register). Starting is one-shot (launch threads, print banner). If tangled, multiple call sites trigger duplicates.
Preventions
Design Separate configure() from start(). Multiple configure() safe; multiple start() raises error.
Test Test that double-init is harmless. Call init twice, assert no duplicate side effects.
🔍

Discovery & Content Quality

9 issues
Bug Commentary videos matched as earnings calls
YouTube search returned commentary/reaction videos alongside actual calls.
Strict title matching + skip-word filters.
Deep Fix
Search results are candidates, not answers. YouTube's algorithm promotes commentary over primary sources. Fix: two-stage pipeline: broad recall → precision filtering. Channel verification, title patterns, duration heuristics, earnings calendar correlation.
Preventions
Design Two-stage discovery: recall then precision. Separate "find candidates" from "validate candidates." Measure both rates independently.
Test Golden set of known-good and known-bad results. 20 real calls + 20 commentary URLs. Assert 100% classification accuracy.
Design Channel allowlist for authoritative sources. Ticker → official YouTube channel ID mapping.
Bug Short tickers matched as substrings
LOW, CAT, GE matched as common English words in titles.
Short tickers (≤3 chars) require $SYM prefix or company name.
Deep Fix
The shorter an identifier, the more context you need to confirm a match. Length-adaptive matching rules: 1-2 char need $prefix or exact company name within 50 chars. 3 char need word boundary + financial context. 4+ char: word boundary alone.
Preventions
Design Length-adaptive matching tiers. Codified in a shared matching utility.
Test Adversarial test corpus for collision-prone tickers. 50 most collision-prone tickers with true/false positive sentences.
Bug $AA matched $AAL, T matched TXN
Substring containment. Wrong transcripts for wrong companies.
Regex word-boundary matching.
Deep Fix
Identifiers must always be matched with boundary awareness. Build a shared match_ticker() utility so this decision is made once and correctly.
Preventions
Type Shared match_identifier() utility. Enforces word boundaries by default. Callers never write their own regex.
Process Ban bare in for identifier matching. The in operator is for collection membership, not pattern matching in strings.
Bug January earnings = Q4 prior year
Jan 2026 earnings tagged as Q1 2026 instead of Q4 2025.
Fixed quarter derivation: Jan earnings = Q4 prev year.
Deep Fix
Event dates and reporting periods are fundamentally different temporal concepts. An earnings call has when-it-happens and what-it-reports-on. These can differ by 2-8 weeks and across year boundaries. Model them as separate fields.
Preventions
Design Separate event timestamp from reporting period in the data model. event_date and fiscal_period as independent fields.
Test Boundary-month test cases. Jan 1, Jan 31, Feb 28, Dec 31 — assert correct fiscal quarter for each.
Bug Sub-entity names polluted discovery
"Samsung Electronics (Foundry Division)" stored as separate entity.
NOT LIKE '% (%' filter.
Deep Fix
Entity normalization must happen at write time, not read time. The NOT LIKE filter is a read-time bandage applied everywhere. The structural fix: resolve-on-write — normalize entity names before insertion.
Preventions
Design Resolve-on-write for all entity names. Strip parenthetical qualifiers, apply alias mappings, resolve to canonical form before storing.
Test Uniqueness invariant: no entity contains parenthetical qualifiers after insertion.
Bug Entity dedup: 1639→987 companies
No normalization on insert. 28 Samsung entries. 9663→3727 relationships after dedup.
Entity resolution on insert (resolve_company + add_company).
Deep Fix
Deduplication is a data quality problem that compounds silently. Each duplicate looks harmless; collectively they fragment analytics and under-count aggregations. Fix: resolve-on-write pipeline (normalize, fuzzy-match existing, merge or create). Any append-only store without a dedup strategy accumulates garbage proportional to input diversity.
Preventions
Design Entity resolution pipeline on write path. Normalize → fuzzy-match → merge or create. Standard record linkage pattern.
Test Insert-twice-get-one test. Insert same entity with two surface forms, assert one row.
Observe Duplicate ratio metric. Periodic fuzzy clustering; alert if estimated duplicate rate exceeds 5%.
Bug Moltbook used nonexistent API endpoint
/posts/trending didn't exist. Zero items. Job appeared healthy.
Fixed endpoint. Added status code checking.
Deep Fix
Zero results from discovery should be a failure signal, not success. The "silent zero" antipattern: the most dangerous failures look like success. Distinguish "found nothing" from "couldn't look."
Preventions
Observe Alert on zero-yield discovery runs. Historical average > 0 but current run = 0 → fire alert.
Design HTTP client that raises on non-2xx by default. 404s become exceptions, not silent empty responses.
Design No LLM sanity check on scraped content
Structural checks passed for wrong content — login walls, error pages, unrelated articles.
Cheap LLM sanity check (~$0.0001/call).
Deep Fix
Structural validation and semantic validation answer different questions; you need both. A login wall has perfect HTML structure. Graduated validation pyramid: structural (free) → statistical (microseconds) → semantic/LLM (pennies). Only content passing all three layers is ingested.
Preventions
Design Graduated validation: structural → statistical → semantic. Fail fast at cheap layers; expensive layers only for content that passes cheap ones.
Test Adversarial content test suite. Login walls, 404 pages, unrelated articles. Assert all rejected.
Gotcha Base64 images inflate HTML for LLM calls
100KB+ base64 images consumed thousands of tokens. Zero analysis value.
Strip data: URIs. Cap at 50K chars.
Deep Fix
Every byte sent to an LLM has a cost. Content must pass through a token-budget-aware preprocessing pipeline: parse HTML → extract text → remove binary/encoded data → estimate tokens → truncate if over budget.
Preventions
Design Token budget estimation before every LLM call. If over budget, trigger preprocessing. Never send raw scraped content without budget checking.
Design HTML-to-LLM preprocessing function. Standardized: strip scripts, styles, base64, SVGs, comments, whitespace. Sits between every scraper and every LLM call.
🏗️

Architecture Evolution

8 issues
Design File-per-item storage doesn't scale
One directory per item with metadata.json. Thousands of small files at 1000+ items.
Results table in SQLite.
Deep Fix
Storage strategy must be chosen based on expected item count, not developer convenience. File-per-item is fine for <50 items with human browsing. For machine-processed collections, a database is the only sane default.
Preventions
Design "Will this exceed 100 items?" → database. If uncertain, database. Files for artifacts humans open directly.
Process Design review: data volume. Every new data store should document expected cardinality.
Design Consolidate 1145 HTML files into SQLite
VIC ideas as 1,145 individual HTML files.
Single SQLite DB with full-text search.
Deep Fix
Same principle, rediscovered independently. Convention enforcement prevents rediscovery cost. Without the rule, every project re-discovers the file-per-item antipattern.
Preventions
Process Pattern library for solved problems. Record the decision rule, not just the fix. "Collections >100 → SQLite" prevents the entire class.
Design No cost tracking for LLM stages
Prompt bug could burn API budget silently.
Handlers return _cost. Runner logs. Guard auto-pauses.
Deep Fix
Every resource-consuming operation needs metering as a first-class concern. Cost isn't a reporting feature — it's a safety mechanism. A stage that consumes resources without reporting cost can bankrupt you silently.
Preventions
Type Cost as a required StageResult field. If a handler doesn't report cost, the runner flags it.
Observe Budget circuit breaker. Per-job and per-run cost ceilings. Auto-pause + alert when exceeded.
Design Sequential stage iteration → async workers
One item through all stages before next. Stage B idle while A processes.
Independent async worker per stage with semaphore concurrency.
Deep Fix
Pipeline parallelism is the natural model for independent stages. When stages don't share mutable state, they can and should run concurrently. Design for the execution model that matches the data flow.
Preventions
Design Identify parallelism during pipeline design. Draw the data flow graph. If stages connect by queues (not shared state), they can run in parallel.
Design Single-stage pipeline → multi-stage
All processing in one process() call. Failure at minute 8 = redo all 10 minutes.
Multi-stage with stages JSON. Each stage commits independently.
Deep Fix
Stage decomposition is about failure isolation, not code organization. The question: "What's the most expensive work I'd have to redo on failure?" Draw stage boundaries at points where re-execution cost is highest.
Preventions
Design "What's the most expensive redo on failure?" identifies boundaries better than any architecture diagram.
Type Stage result persistence is non-negotiable. Framework must enforce: every stage persists output before next stage begins.
Design No output versioning
Code changes didn't trigger reprocessing. Stale results persisted.
Hash handler source. Detect stale items. "Reprocess Stale" button.
Deep Fix
Data lineage: every output must know what code and inputs produced it. Versioned computation = storing (input_hash, code_hash, output) so staleness is detectable, not guessable.
Preventions
Design Versioned computation. Every derived artifact stores the hash of its code + inputs. Staleness detection becomes a comparison.
Observe Staleness dashboard. % of items matching current handler hash. Sudden drop after code change → shows reprocessing scope.
Perf IRS ZIP: downloaded 300MB–3.5GB for few KB of XML
Full ZIP download to extract individual small files.
HTTP range requests — kilobytes instead of gigabytes.
Deep Fix
Understand the data format's access patterns before choosing a download strategy. ZIP has a central directory readable with a small range request. 1000× savings was available from day one; the cost was reading the format spec.
Preventions
Design Check if random access is possible before bulk download. "Do I need all of it?" → check if format and server support partial access.
Process Cost model for data acquisition. Estimate download size × frequency. If absurd, there's a better approach.
Design Round-based discover → process → sleep loop
Sequential rounds with full-round sleeps. New items waited for round completion.
Persistent async workers, continuous flow. 1–5s backoff.
Deep Fix
Items should flow through the pipeline as a stream, not accumulate into batches. Polling for checking external state; processing triggered by item availability, not timer.
Preventions
Design Items should flow, not wait for round boundaries. Default to item-level flow (async queue → worker pool). Batch only when there's a genuine reason.
Design Separate discovery cadence from processing cadence. Decouple with a queue: discovery writes, workers read.
🤖

LLM Integration & Conventions

6 issues
Bug max_tokens truncated JSON responses
max_tokens too low → truncated mid-JSON → broke parsing.
Remove max_tokens entirely for structured output.
Deep Fix
Never cap output length when the output format requires completeness. JSON, XML, YAML are all-or-nothing: truncated = garbage. If cost is a concern, reduce input size or use a cheaper model — don't truncate structured output.
Preventions
Type Structured output mode auto-omits max_tokens. LLM call helper enforces this. Impossible to accidentally truncate structured responses.
Test Validate structure immediately. json.loads() on raw response catches truncation. Pydantic/jsonschema catches structural problems.
Convention Native web search inconsistent across providers
native_web_search=True gave wildly different behavior per provider.
Standardized on tools=["web_search"] (Serper API).
Deep Fix
Abstract over provider differences at the tool layer. It's better to own the implementation of a capability than to depend on N providers implementing it identically. Provider-specific flags that "should work the same" are a reliability trap.
Preventions
Design Provider-agnostic tool interface. Important capabilities implemented once in your tool layer, not delegated to provider-specific features.
Convention temperature=0 for structured extraction
Default temperature added noise to factual extraction.
temperature=0 for all extraction.
Deep Fix
LLM parameters should be determined by task type, not left to defaults. Extraction wants determinism. Generation wants randomness. A task-type → LLM-config mapping makes correct parameters automatic.
Preventions
Type Task-type → LLM config presets. extraction(temperature=0), generation(temperature=0.7). Callers specify intent, not parameters.
Convention Structured logging via get_handler_logger
f-string logs not machine-parseable.
Structured kwargs: log.info("scored", score=42, source="youtube").
Deep Fix
Logs are data, not messages. Structured logs are queryable from day one. The cost of structured logging is near zero at write time; the cost of unstructured logging compounds at read time.
Preventions
Process Structured logging from day one. Retrofitting is painful. Starting structured is free — just a different calling convention.
Observe Log-based metrics. When logs are structured, metrics come for free: count by type, compute percentiles, track rates.
Convention Raw data first: cache artifacts for re-extraction
No raw caching → schema changes required re-fetching (rate-limited, deleted, paywalled).
Fetch stores raw. Extract parses from cache.
Deep Fix
Separate data acquisition from data interpretation. Acquisition is expensive, rate-limited, non-deterministic, sometimes irreversible. Interpretation is cheap, repeatable, improvable. By caching raw, better extraction logic is a free upgrade across your entire historical dataset.
Preventions
Design Immutable raw layer. Cheap (storage is cheap) and invaluable (re-fetching is expensive, sometimes impossible).
Design Fetch and extract are always separate stages. Fundamentally different failure modes, costs, and improvement cadences.
Convention Documents in system prompt with XML tags for caching
Documents in user prompt re-sent every call. No prompt caching benefit.
Static documents in system prompt → 80–90% cost reduction via caching.
Deep Fix
Prompt architecture is cost architecture. Most providers cache the system prompt prefix. Static context there, dynamic content in user prompt. A 10× cost reduction from prompt restructuring dwarfs model-switching savings.
Preventions
Design Prompt architecture for cost. Treat prompt construction as a caching problem. Static context in cacheable prefix; dynamic in user prompt.
Process Periodic cost review of top-5 LLM call patterns. Check if static content is in cacheable prefix. A few hours of restructuring can halve monthly costs.

Cross-Cutting Principles

These principles each prevent 5+ issues from the list above. They're the highest-leverage investments.

1. Make illegal states unrepresentable

Return Result[T, Error], not None. Design handler return types that encode non-emptiness. Use StageResult.success(data) that validates at construction time. When the type system makes it impossible to represent an invalid state, you don't need runtime checks — the code won't pass the type checker. Eliminates the entire class of "None/empty treated as success" bugs, the hardest to detect because they produce silent corruption.

Prevents: empty result as success, None propagation, silent data loss, unvalidated LLM responses, stages on incomplete input

2. Distinguish silence from health

Zero throughput should alarm, not reassure. A dashboard showing "0 errors" when nothing is processing isn't healthy — it's brain-dead. Track items-in vs items-out at every stage boundary. Alert when throughput falls to zero. The most dangerous state isn't "failing loudly" — it's "doing nothing quietly," because humans interpret silence as health.

Prevents: dashboard queued but nothing moves, silent stalls, worker pools draining to zero, discovery stopping, stages silently skipping all items

3. Classify errors by kind, not by source

Every error is: transient (retry with backoff), permanent (fail this item), or systemic (halt the job). One classifier function makes this decision. When a new error appears, add one line to the classifier — not a new try/except in every stage.

Prevents: rate limit trips circuit breaker, SSL as permanent, transient fails item permanently, bot detection retried infinitely, cookie expiry as item failure

4. Cheap before expensive

Validate before downloading. Score before transcribing. Free API before paid. Build a cost model and order stages by ascending cost with early termination at each gate. The cheapest operation that can reject an item should run first. A pipeline downloading 3.5GB to check if a file exists reveals cost wasn't part of the design thinking.

Prevents: downloading GBs for KBs, transcribing filtered-out audio, LLM on invalid content, re-fetching unchanged data, expensive calls on duplicates

5. Stages declare preconditions, framework enforces them

Each stage declares required inputs. The runner checks before invoking. Handlers never run in an invalid state, so handler code can assume valid inputs — eliminating defensive checks and the bugs from getting those checks wrong.

Prevents: extract before fetch, analysis on empty transcript, metadata without required fields, out-of-order after partial failure, crashes on missing input

6. Every fallback is a lie until tested

Fallbacks handle rare conditions, which means they rarely run, which means they rarely get tested, which means they rarely work when needed. Worse: a fallback producing wrong results silently is more dangerous than no fallback. If it runs for a week unnoticed, is the data correct? If you can't answer confidently, remove it — fail fast is safer.

Prevents: silent degradation, fallback masking real error, backup source returning stale data, retry succeeding with corrupt data, plausible-but-wrong defaults

7. Canary the environment

External dependencies change without notice. Don't wait for the pipeline to fail on real work — periodically test with known-good inputs and expected outputs. A canary that runs hourly and checks "can I still fetch this URL and get this field?" catches platform changes before they corrupt a full run. Cost is trivial; early warning is invaluable.

Prevents: PO token expiry after 1000 failures, bot detection breaking pipeline for days, cookie expiry causing silent auth failures, API schema changes, rate limit policy changes

8. Separate data acquisition from data interpretation

Fetching is expensive, rate-limited, non-deterministic, sometimes irreversible. Interpreting is cheap, repeatable, improvable. By storing the raw artifact before any transformation, better extraction logic is a free upgrade across your entire history. Schema changes don't require re-crawling. Raw is immutable. Interpretation is versioned. The single most valuable pattern for data pipelines.

Prevents: schema change requires re-fetch, extraction bug requires re-crawl, prompt improvement can't reach history, debugging requires reproducing fetch, data loss from source changes