Jobs System — Deep Fix & Prevention Analysis

For each issue in the Jobs Issues Extraction: the structural fix, and the design/process/testing failures that allowed it to exist.

💀

Silent Failures & Data Integrity

18 issues

Bug Empty IB response marked as done

IB disconnected → empty data → item marked done → permanently lost from queue.

Raise RetryLaterError when no data returned.

Deep Fix

Stage handlers should return a typed result object (StageResult) that cannot be constructed without data. The runner should reject None/{} at the framework level, not rely on each handler to remember to check. A StageResult.success(data) factory that validates non-emptiness makes "empty success" impossible to express.

Preventions

Type Return type that encodes non-emptiness. If process_stage() returns StageResult instead of dict | None, the runner can enforce the contract. The bug wasn't in the handler — it was in the API that let handlers return nothing and call it success.

Test Fault injection test: "what does the stage do when the dependency returns empty?" Mock IB returning {}. If the test passes (item stays pending), you've tested the contract. This test would have caught every "empty = success" bug in this category.

Observe Track items-completed-with-no-data as a metric. A dashboard counter for "done items with empty results" would have surfaced this before 100s of items were lost. Absence of output is a signal, not a non-event.

Design "What happens when the external system is down?" should be a mandatory design question for every stage that depends on an external service. The IB handler was written assuming IB is always up — it should have been written assuming IB is usually down.

Bug IB fallback silently used ephemeral connections

Stockloader not running → fell back to ephemeral IB connections → silently degraded, missed pacing.

Remove fallback. Fail fast with RetryLaterError.

Deep Fix

Fallbacks are lies unless they're logged and metered. The real fix is a policy: every fallback path must (a) emit a warning-level log, (b) increment a "degraded" counter, and (c) have its own acceptance criteria (does the fallback actually produce correct results?). An untested, unmonitored fallback is worse than a crash — it gives you wrong answers with high confidence.

Preventions

Design The "fallback review" principle: for every fallback path, ask "if this ran for a week without anyone noticing, would the data be correct?" If no, it's not a fallback — it's a silent corruption vector. Remove it.

Observe Alert when fallback path activates. A fallback that fires silently is indistinguishable from normal operation. The whole point of a fallback is that something went wrong — that should be visible.

Process Convention: "no silent degradation." If a stage can't do its job at full quality, it must either retry or fail — never silently produce lower-quality output. This prevents an entire class of "it worked but the data was wrong" bugs.

Bug Circuit breaker: infinite retry, dashboard showed "queued"

Import errors caught, logged, retried forever. Dashboard showed "queued" — nothing processed.

Track consecutive failures. After 3, auto-pause with error visible in dashboard.

Deep Fix

The runner's error handling conflated transient errors (network timeout — retry makes sense) with deterministic errors (import error — retry is insane). A proper error taxonomy with at least three classes — RetryLater (transient), PermanentFailure (this item is broken), SystemFailure (the stage itself is broken, stop everything) — would have made the circuit breaker automatic. An import error is always a SystemFailure: retrying won't fix missing code.

Preventions

Type Three-class error taxonomy in the type system. If the runner only accepts RetryLater | PermanentFail | SystemHalt, it can dispatch correctly by construction. A bare except Exception that retries everything is the anti-pattern — it means you haven't thought about which errors are which.

Observe Alert on "N items attempted, 0 succeeded" within a window. Any system that processes items should have a throughput metric. Zero throughput for >5 minutes should page someone — it means the system is alive but brain-dead.

Test Chaos test: inject an ImportError into a handler and verify the runner halts the job. This directly tests the "what happens when the code is broken" path, which is different from "what happens when the data is bad."

Design Bounded retry with escalation. Unbounded retry is never correct. After N failures of the same kind, escalate: pause the job, alert, change strategy. The maximum retry count should be a design parameter, not an accident of implementation.

Bug Stuck in_progress items from crashed runner

Runner crash → items stuck in in_progress → never picked up again.

Reset in_progress to pending on startup.

Deep Fix

in_progress is a lease, not a state. It should have an expiry. Instead of a bare status flag, use claimed_at timestamp + claimed_by worker ID. Any item claimed >N minutes ago by a dead worker is automatically released. This makes crash recovery continuous rather than requiring a restart — and it works in multi-worker scenarios where one worker dies but others are still alive.

Preventions

Design Lease-based claiming instead of flag-based. The moment you write status='in_progress', ask: "what happens if the writer dies before writing status='done'?" If the answer is "it's stuck forever," you've designed a leak. Leases with TTL are the standard solution to this problem (see: SQS visibility timeout, Kubernetes pod eviction, distributed lock TTLs).

Observe Monitor age of in_progress items. An item that's been in_progress for >10x the p99 stage duration is almost certainly stuck. Alert on it.

Test Kill-the-runner test. Start runner, let it claim items, kill -9 it, start a new runner, verify items get processed. This is a basic resilience test that every queue system needs.

Bug 637 items stuck at stage-level in_progress

Startup reset fixed item-level status but not stage-level status inside JSON column.

Reset both item-level and stage-level in_progress on startup.

Deep Fix

Don't store the same concept in two places. Having status as a column AND stages.fetch: "in_progress" inside a JSON column means two sources of truth for "is this item being worked on?" They can (and did) disagree. The stage-level status should be the only source; derive the item-level status from it. Single source of truth, computed views.

Preventions

Design Single source of truth for state. Every piece of state should live in exactly one place. If you find yourself writing "reset X" and then later discovering you also need to "reset Y which is X stored differently," your schema has redundancy. Normalize it.

Test Invariant check: item status is consistent with stage statuses. A periodic assertion that status='pending' ⟹ no stage has 'in_progress' would have caught the 637 stuck items immediately.

Process When fixing a "stuck state" bug, grep for every place that state is stored. The first fix (item-level reset) was incomplete because it only addressed one representation. A checklist item: "Is this state stored anywhere else?"

Bug VTT extract returned silently empty

Missing VTT → empty result → downstream garbage → marked done.

Raise exception when VTT missing.

Deep Fix

Stages should declare their preconditions as assertions, not assumptions. If extract requires a VTT file, it should check for its existence before doing any work and raise a clear PreconditionFailed("VTT file not found at {path}"). The runner can then distinguish "precondition not met" (wait for upstream) from "stage failed" (something broke). This is the guard clause pattern applied to pipeline stages.

Preventions

Type Precondition declarations in stage metadata. Each stage config should list requires: ["vtt_path"]. The runner checks preconditions before invoking the handler. The handler never runs in an invalid state.

Test "Missing input" test for every stage. What does extract do when there's no VTT? What does enrich do when there's no raw_html? These are the most basic failure modes and the most commonly untested.

Design The "absent input" question belongs in the new-stage checklist. For every input a stage reads, document what happens when it's missing. If the answer is "I don't know," the stage isn't ready to ship.

Bug Earnings transcript marked done with no transcript

Handler returned None → runner converted to {} → "success" with no data.

Raise RetryLaterError; fix handler wrapper.

Deep Fix

The runner's None → {} conversion is the root bug. The framework should never coerce a handler's return value into something the handler didn't intend. None means "I have nothing to say" — the framework should treat that as an error, not as "success with empty data." The fix is: if result is None: raise StageError("Handler returned None") in the runner itself.

Preventions

Type Ban None as a valid handler return. Type annotation: async def process_stage(...) -> StageResult (not -> dict | None). A linter or runtime check in the runner catches violations. This single change prevents every "None-as-success" bug.

Test Runner contract test: verify that a handler returning None is treated as an error. This tests the framework's behavior, not the handler's — it's a one-time test that protects all handlers.

Design Principle: frameworks should be strict, handlers should be expressive. The runner (framework) should reject ambiguous returns. Handlers should use explicit success/failure types. Don't be "helpful" by converting garbage to something passable.

Bug yt-dlp metadata silently returned {} on auth errors

No error propagation from yt-dlp subprocess on bot/auth failures.

Raise on bot/auth errors.

Deep Fix

Subprocess wrappers must parse exit codes AND output to determine success. A generic run_subprocess() wrapper that returns stdout or {} on any error is dangerous. The wrapper should return a SubprocessResult(stdout, stderr, exit_code) and let the caller decide what constitutes success. Never swallow subprocess failures at the wrapper level.

Preventions

Type Subprocess wrappers return structured results, not parsed output. Let the caller inspect exit code + stderr + stdout and decide. The wrapper's job is to run the command, not to interpret success.

Process For every external tool integration, document: "what does failure look like?" yt-dlp has at least 4 failure modes (bot detection, rate limit, missing content, network error). Each should be handled differently. If you haven't enumerated them, your error handling is incomplete.

Bug INSERT OR REPLACE destroys columns

INSERT OR REPLACE does DELETE + INSERT. Columns not in INSERT list get destroyed.

Use ON CONFLICT DO UPDATE SET col=excluded.col.

Deep Fix

This is a language-level footgun in SQLite. The deep fix is a convention: never use INSERT OR REPLACE — always use INSERT ... ON CONFLICT DO UPDATE (UPSERT). This should be enforced by a lint rule or code review checklist. Better yet, wrap all DB writes through a db.upsert(table, data, conflict_keys) helper that always uses the safe pattern.

Preventions

Process Ban INSERT OR REPLACE project-wide. Add to CLAUDE.md conventions. Grep for it in CI. It's never what you want when the table has columns you're not setting.

Type DB write helper that encapsulates the safe pattern. db.upsert() that always generates ON CONFLICT DO UPDATE. Developers never write raw INSERT for upserts.

Test Round-trip test: write row A, upsert partial row B on same key, verify A's untouched columns survive. This directly tests the data preservation property.

Bug Moltbook extract: pre-migration data invisible

Items fetched before migration had raw.json on disk but no entry in results table.

Check results table first, fall back to disk.

Deep Fix

Migrations must migrate data, not just schema. The schema migration (add results table) was incomplete because it didn't backfill existing items. A migration that changes where data lives without moving existing data creates a split-brain: some items in the old location, some in the new. The migration script should have included a backfill_from_disk_to_db() step.

Preventions

Process Migration checklist: "Does this migration change where data is read from? If yes, have you migrated all existing data to the new location?"

Test Post-migration verification: count items accessible via old path vs new path. They should be equal.

Design Avoid dual-read paths. The fallback (check DB, then disk) is technical debt. It's a workaround for an incomplete migration. Complete the migration, remove the fallback. Dual-read paths accumulate forever if you don't clean them up.

Bug Duplicate HTML served for different VIC ideas

Expired cookies → same HTML for different URLs → discovered after 100+ identical extractions.

Detect duplicate HTML by comparing lengths.

Deep Fix

Content validation should be semantic, not structural. Length comparison catches exact duplicates but misses near-duplicates. The real fix is a content fingerprint: hash the text content (stripped of boilerplate) and flag when multiple items share the same hash. Better yet, the fetch stage should verify that the fetched content actually relates to the requested entity — a cheap LLM check ("Is this about {company_name}?") catches both duplicates and wrong-content errors.

Preventions

Observe Content dedup metric: track unique content hashes vs total fetches. If 100 fetches produce 3 unique hashes, something is wrong. This is a distribution anomaly detector, not a per-item check.

Design Treat authentication state as a stage precondition. If the stage requires valid cookies, verify cookie freshness before making requests. Don't discover stale cookies after processing 100 items.

Test Fetch two different URLs and assert the responses differ. Trivial test, catches the entire class of "auth is broken and everything returns the same page."

Bug SQLite "closed database" in dashboard

DB functions called after conn.close(). Gradio hot-reload worsened it.

Use with closing(get_db()) as conn: everywhere.

Deep Fix

Connection lifecycle should be managed by one owner, not scattered across callers. Use a context-manager-based connection pool or a "unit of work" pattern where every DB operation happens within a with db_session() as s: block. The connection is opened, used, and closed in one scope — it can't leak or be used after close because the variable is scoped to the with block.

Preventions

Type Make bare connection objects impossible to obtain. If the only way to get a DB connection is with db_session() as conn:, you can't forget to close it. The API prevents the bug.

Design Gradio hot-reload invalidates module-level state. Any module-level connection, cache, or singleton will break on reload. Design for it: use function-scoped resources or a lazy-init pattern with staleness detection.

Bug 11 of 12 handlers broken by function rename

get_handler_log → get_handler_logger. asyncio.gather(return_exceptions=True) silently ate all ImportErrors.

Fix imports. Log exceptions from gather results.

Deep Fix

Two deep failures here. (1) asyncio.gather(return_exceptions=True) is a silent-failure factory — it converts crashes into list elements that nobody checks. Use a wrapper that logs/raises if any result is an exception. (2) Renaming without grep is a time bomb. Every rename should be accompanied by a project-wide search for the old name. IDE refactoring tools do this; manual renames don't.

Preventions

Type Wrap asyncio.gather in a helper that raises on any exception result. async def gather_or_raise(*coros) that iterates results and re-raises the first exception. Use this instead of bare gather(return_exceptions=True) everywhere.

Test Import smoke test: import every handler module at test time. A single test file that does import jobs.handlers.vic_ideas etc. for every handler catches 100% of import-time errors. Takes <1s to run.

Process Rename checklist: grep for old name project-wide before committing. rg 'get_handler_log[^g]' would have found all 11 stale call sites.

Observe Startup health check: runner tries to import + instantiate every registered handler before entering the main loop. If any handler fails to import, the runner refuses to start. Fail loudly at startup, not silently at runtime.

Bug Stage errors lost during item transitions

Later stage errors overwrote earlier stage errors.

Store per-stage errors in stages JSON.

Deep Fix

Error storage should be append-only, not overwrite. Use an error log (array of {stage, error, timestamp}) instead of a single error field. This is the same principle as audit logs vs current state: you want the full history of what went wrong, not just the last thing.

Preventions

Design Multi-stage pipelines need per-stage error storage by construction. This should be a framework feature, not a per-handler responsibility. The runner should store errors keyed by stage automatically.

Test Two-stage failure test: fail stage A, then fail stage B, verify both errors are retrievable.

Gotcha Handler not async when dependency became async

Dependency made async, caller still sync → "cannot unpack non-iterable coroutine object."

Made entire chain async.

Deep Fix

Async is viral — changing one function to async requires changing all callers. The deep prevention is: when converting a function to async, immediately search for all call sites and convert them too. Better yet, design the system as async-from-the-start if any component might need it. Retrofitting async is always painful.

Preventions

Type Type checker (mypy/pyright) catches calling an async function without await. If the project had type checking enabled, this would be a static error, not a runtime surprise.

Process When making a function async, grep for all callers and update them in the same commit. Async conversions must be atomic — half-converted call chains always break.

Gotcha yt-dlp non-zero exit with valid output

Non-zero exit code for warnings, but stdout has valid content. Treating non-zero as failure threw away good data.

Check stdout content. Only error on non-zero + empty output.

Deep Fix

External tools have their own definition of "success" that doesn't match yours. The subprocess wrapper should define success criteria per-tool, not assume exit code 0 = success. For yt-dlp: success = "stdout contains valid JSON." For curl: success = "HTTP 200." The tool's exit code is a hint, not a contract.

Preventions

Design Per-tool success criteria. When integrating an external tool, document: "What constitutes success? What constitutes retriable failure? What constitutes permanent failure?" These three questions should be answered before writing the wrapper.

Process Read the tool's documentation on exit codes before assuming. yt-dlp's docs explicitly state that non-zero can mean warnings. This is a "RTFM prevents bugs" case.

Convention No asyncio.gather exception logging

return_exceptions=True silently eats exceptions.

Iterate results and log exceptions.

Deep Fix

Ban bare asyncio.gather(return_exceptions=True). Replace with a project utility: gather_logged(*coros, logger=log) that automatically inspects results and logs/counts any exceptions. Make the safe version the easy version.

Preventions

Type Utility function that makes the pit of success the default. If gather_logged is easier to use than raw asyncio.gather, developers will naturally use the safe version.

Process Lint rule: flag any use of return_exceptions=True without a corresponding exception check in the same scope.

Gotcha Metadata stage vacuously succeeded without transcript

Metadata ran on discovery data only — no transcript input — succeeding vacuously.

Merged into single stage.

Deep Fix

A stage that succeeds without its primary input is lying. Every stage should have a "minimum viable input" assertion. If metadata enrichment needs a transcript and there isn't one, it should skip (not succeed) or fail (not silently produce empty enrichment). The broader principle: distinguish "nothing to do" from "did something successfully."

Preventions

Design Stage output validation: "did this stage produce meaningful output?" A stage that returns {} or produces identical output regardless of input is vacuous. The runner should flag it.

Test Feed a stage empty/missing input and verify it doesn't claim success. The vacuous success pattern is always testable: give it nothing, check that it doesn't say "done."

🛡️

Rate Limiting, Auth & Bot Detection

12 issues

Bug VIC rate-limit page passed structural checks

Rate-limit page had enough HTML structure to pass basic checks.

Three-layer defense: text detection, HTML validation, LLM sanity check.

Deep Fix

Structural checks are necessary but never sufficient for content from adversarial or unreliable sources. The deep fix is a layered validation architecture: (1) structural check (is it HTML?), (2) content fingerprint (does it look like previous valid responses?), (3) semantic check (does it contain expected entities/concepts?). Each layer catches what the previous misses. The key insight: the more sophisticated the source's failure modes, the more semantic your validation must be.

Preventions

Design Assume every external response might be garbage. Design validation as if the external source is actively trying to fool you. Not because it is, but because rate-limit pages, error pages, CAPTCHAs, and CDN failovers all produce structurally-valid-but-semantically-wrong responses. This mindset catches bugs before they happen.

Test Golden file tests: save actual rate-limit/error pages and assert the validator rejects them. Build a corpus of "things that look valid but aren't." Every new false positive gets added to the corpus.

Observe Track content-length distribution. Rate-limit pages are usually shorter than real content. A sudden shift in response size distribution signals something is wrong.

Bug VIC "Please wait 24 hours" — second rate-limit variant

Different text format for rate limit, not caught by first detection.

Added explicit string detection.

Deep Fix

String matching for error detection is a game of whack-a-mole. Each new rate-limit message requires a new string. The semantic approach (LLM: "Is this a rate-limit/error page?") catches all variants with one check. For lower cost, use a classifier trained on the corpus of known error pages. The principle: match on semantics, not syntax, when the source can vary its syntax.

Preventions

Design Semantic validation over pattern matching when the input space is open-ended. String matching works for known patterns; LLM/classifier works for unknown patterns. Use both: string matching as a fast first pass, semantic check as a safety net.

Process When adding a string pattern for error detection, ask: "Are there other messages I haven't seen yet?" If yes, you need a more general approach. One variant is an accident; two variants is a pattern; three means you need a classifier.

Bug YouTube bot detection blocked downloads

Datacenter proxy alone wasn't enough.

Proxy + cookies + BotDetectedError.

Deep Fix

Bot detection evasion requires multiple signals, not just IP rotation. Modern anti-bot systems check IP reputation + browser fingerprint + cookie state + behavioral patterns. The deep fix is an escalation ladder: try cheapest approach first (direct), escalate to proxy, escalate to proxy + cookies, escalate to full browser. Each level has a cost — track which level each target requires and auto-select.

Preventions

Design Fetch strategy as an escalation ladder, not a fixed approach. [direct → proxy → proxy+cookies → browser]. Start cheap, escalate on failure, remember what works per-domain.

Observe Track bot-detection rate per domain over time. A sudden spike means the target changed their detection. Alerts let you respond before the entire pipeline stalls.

Bug YouTube JS challenge solver broken

Missing --remote-components ejs:github flag across 7 call sites.

Add flag to all call sites.

Deep Fix

7 call sites with the same flag means the flag should be in one place. Create a ytdlp_command(url, **kwargs) -> list[str] factory that always includes the base flags. Individual callers add only their specific flags. DRY for command construction, not just for code.

Preventions

Type Command builder pattern. A single function that constructs yt-dlp commands with all required base flags. Callers can't forget flags they don't know about.

Process When adding a "required for all calls" flag, grep for all call sites immediately. If there are >2, refactor to a shared builder first, then add the flag once.

Bug PO token requirement for YouTube

New YouTube requirement, downloads failing with unclear errors.

Install plugin, player_client fallback chain.

Deep Fix

External API changes are inevitable — design for them. The system should have a "YouTube health" check that periodically tests a known-good download and alerts when it fails. This detects platform changes before they affect the pipeline, not after 100 items fail. The principle: canary requests detect environmental changes before they cause damage.

Preventions

Observe Canary downloads: periodically test a known-good URL and alert on failure. A 5-minute cron job that downloads a short public video catches YouTube changes within minutes, not days.

Design Subscribe to yt-dlp release notes / changelog. PO token support was documented in yt-dlp releases before it became required. Tracking upstream tool changes is part of maintaining the integration.

Bug SSL/timeout errors classified as permanent failure

Network errors → permanent failure → never retried.

Classify as retry_later.

Deep Fix

Error classification should be a lookup table, not scattered conditionals. Maintain a mapping:

{SSLError: RETRY, TimeoutError: RETRY, HTTPError(404): PERMANENT, HTTPError(429): RETRY, HTTPError(500): RETRY, ValueError: PERMANENT}

. New error types get an explicit classification. Unknown errors default to RETRY with a logged warning — you don't know if it's transient, so assume it might be.

Preventions

Type Centralized error classifier. One function: classify_error(exc) -> RetryLater | PermanentFail | SystemHalt. Every handler uses it. New error types get added to one place.

Design Default to retry for unknown errors. The cost of retrying a permanent error is wasted compute. The cost of permanently failing a transient error is lost data. The asymmetry favors retry as the default.

Bug Rate limit errors (429) tripping circuit breaker

Quota errors treated as permanent, triggering job pause.

Treat 429 as retry_later.

Deep Fix

The circuit breaker should distinguish "something is broken" from "we're going too fast." Rate limits aren't failures — they're flow control. The circuit breaker should only trigger on non-rate-limit consecutive failures. Rate limits should trigger backoff (slow down), not shutdown (stop). Two different mechanisms for two different problems.

Preventions

Design Separate rate-limit handling from error handling. Rate limits → adaptive backoff (slow down, increase delay). Errors → circuit breaker (stop after N consecutive). Mixing them means either: rate limits shut you down, or real errors don't shut you down. Both are bad.

Observe Metric: rate-limit-vs-error ratio. If 90% of "failures" are rate limits, the circuit breaker is wrong. This metric makes the misclassification visible.

Bug VIC cookie exchange was sync in async handler

Sync httpx.get() blocking the event loop.

Made async.

Deep Fix

Lint for sync I/O inside async functions. Any httpx.get(), requests.get(), open(), or time.sleep() inside an async def is a bug. A ruff/flake8 rule or a runtime warning when the event loop is blocked >100ms catches these mechanically.

Preventions

Type Convention: all handlers are async, all I/O uses async clients. Import httpx.AsyncClient as the default; don't even import the sync client in handler files.

Observe Event loop block detection. Python's asyncio debug mode warns when the event loop is blocked >100ms. Enable it in development.

Convention Direct "naked" fetches got IP blocked

No proxy → real IP exposed → blocked.

Always proxy by default.

Deep Fix

Proxy should be the default, not an opt-in. Create a project-wide HTTP client factory: get_http_client(proxy=True) where proxy defaults to on. A whitelist of domains that don't need proxy (localhost, LLM APIs). Any new domain automatically gets proxied. The footgun of forgetting to proxy disappears.

Preventions

Type HTTP client factory with proxy-by-default. Make the safe path the easy path. Opting out of proxy requires explicit proxy=False with a domain in the whitelist.

Process Grep for bare httpx.get/requests.get and flag them. All HTTP calls should go through the project client.

Gotcha VIC cookie has two layers (remember + session)

Persistent cookie in Chrome's DB, session cookie needs HTTP exchange.

Fallback chain: file → browser → HTTP exchange → cache.

Deep Fix

Document the auth flow before implementing. VIC's two-layer cookie system is discoverable by reading network traffic, but it was discovered by trial and error. For any authenticated target, spend 30 minutes with browser DevTools documenting the full auth flow before writing code. The cost of understanding upfront is far less than the cost of debugging auth failures in production.

Preventions

Process Auth flow documentation as a prerequisite for new handler development. Before writing the handler: document cookies, tokens, session lifecycle, expiry, and refresh mechanism. This becomes the specification for the auth code.

Test Auth health check: verify cookies are valid before starting a batch. A pre-flight check that makes one authenticated request and verifies the response is valid catches stale cookies before they ruin a batch.

Gotcha yt-dlp auth errors vs warnings indistinguishable by exit code

Non-zero for both fatal and non-fatal. Must parse stderr.

Parse stderr for specific error strings.

Deep Fix

Same as "yt-dlp non-zero with valid output" — per-tool success criteria. The yt-dlp wrapper should have a classify_result(stdout, stderr, exit_code) -> Success | Warning | AuthError | ContentError function that encodes all the tool's quirks in one place. Callers get clean typed results.

Preventions

Type Tool-specific result classifier. Encode every known failure mode of the external tool in a classifier function. Unknown failure modes get logged as warnings with full stderr.

Gotcha Retry + reprocess didn't unpause paused jobs

Items reset to pending but job stayed paused.

Retry/reprocess buttons auto-unpause.

Deep Fix

State transitions should be atomic and complete. "Retry" is not just "reset item status" — it's a compound operation: reset items + unpause job + clear circuit breaker counter. If any step is missing, the system is in an inconsistent state. Model operations as state machines where every transition specifies the full target state, not just one field.

Preventions

Design Define operations as state transitions, not field updates. retry() should be a single function that atomically moves from paused+failed_items to running+pending_items. Not three separate SQL updates that can partially complete.

Test End-to-end retry test: pause job → retry → verify items process. Tests the full operation, not just the SQL.

🔧

Stage Design & Pipeline

10 issues

Bug Extract stage ran before fetch

Stage dependencies were not inferred from YAML ordering — extract ran with no data because fetch hadn't completed yet.

Added explicit stage_deps declarations in the YAML pipeline config.

Deep Fix

Stages should declare their inputs as typed preconditions, not rely on implicit ordering. The runner should resolve a DAG from input/output declarations and refuse to schedule a stage whose preconditions aren't satisfied. Ordering in a YAML file is a serialization artifact, not a contract.

Preventions

Design Precondition-based scheduling over positional ordering. Any system where execution order is derived from list position will eventually have an insertion that violates implicit dependencies. Stages should declare what they need, and the scheduler should enforce it.

Test Dry-run mode that validates the DAG before executing. A pipeline should have a zero-cost validation pass that checks all preconditions are satisfiable and all dependencies form a valid DAG.

Type Make illegal orderings unrepresentable in the pipeline schema. Input/output type annotations on stages let a validator reject impossible orderings statically.

Design Monolithic transcript_and_meta stage

A single stage handled metadata + captions + whisper. When Whisper failed, all work retried.

Split into 4 independent stages: metadata, captions, whisper, merge.

Deep Fix

Stage granularity should match failure domains. If two operations can fail independently, they belong in separate stages. The litmus test: "can I retry just the part that failed?" If no, the stage is too coarse.

Preventions

Design One failure domain per stage. A stage that calls two external services has two independent failure modes. Bundling them forces redundant re-execution on partial failure.

Process The retry-scope test during stage design. "If this fails halfway through, what work gets thrown away?" If the answer includes completed, successful work, the stage needs splitting.

Design Stage order: transcript before scoring

Downloaded and transcribed every video before scoring relevance. Hundreds of irrelevant videos fully processed.

Reordered: score metadata first, transcribe only passing videos.

Deep Fix

Cheap filters before expensive transforms is a universal pipeline design principle. Every pipeline should be ordered by ascending cost, with each stage acting as a progressively more expensive quality gate. Same principle as database query optimizers pushing selective predicates down.

Preventions

Design Cost-aware stage ordering. Each stage should have an estimated cost. The framework should warn when a high-cost stage precedes a low-cost filter that could eliminate inputs.

Process Funnel metrics in pipeline dashboards. Track input count at each stage boundary. If stage N passes 95% to stage N+1, which then drops 80%, the filter is in the wrong position.

Design Transcript before metadata validation

Videos transcribed before basic metadata checks. Invalid videos consumed Whisper credits before rejection.

Moved metadata validation before transcription.

Deep Fix

Validate before investing resources — a specialization of cheap-before-expensive. Validation is nearly free and should always precede transformation. This is the pipeline analog of failing fast.

Preventions

Design Validation is always stage zero. Any pipeline accepting external input should have an explicit validation gate before the first resource-consuming stage.

Observe Track waste ratio per stage. "Items processed that were later rejected by a downstream filter." A non-zero waste ratio means a cheap check could have been hoisted earlier.

Design Groq primary vs YouTube captions — three changes in one day

Transcript source flipped 3× in one day: YouTube → Groq → YouTube+Groq fallback.

YouTube primary (free), Groq fallback (paid, higher quality).

Deep Fix

Strategy decisions with cost/quality implications need a decision matrix evaluated once, not iterative trial-and-error in production. Benchmark cost, quality, latency, and failure rate on a representative sample before committing.

Preventions

Process Decision records for strategy choices. Write a one-page decision record: options, criteria, measurements, choice, and reversal conditions. Forces evaluation before commitment.

Test A/B comparison on a sample before rollout. Run both strategies on 20 items. Compare quality, cost, latency. Decide from data.

Design Audio files accumulated on disk (10GB+)

Every video left MP3/M4A. 10GB+ accumulated with no cleanup.

keep_audio=False default; delete after transcription.

Deep Fix

Stages should own the lifecycle of their intermediate artifacts. Every file a stage creates should have an explicit disposition: keep (final), pass (input to next stage, then delete), or temp (delete on stage completion). Default = deletion, with explicit opt-in for retention.

Preventions

Design Artifact lifecycle as a framework primitive. Stages declare outputs with retention classes (ephemeral, intermediate, final). The framework enforces cleanup.

Observe Disk usage monitoring with per-pipeline attribution. Alert when growth exceeds expected bounds.

Process Stage review checklist: "What files does this stage create, and who deletes them?"

Design Enrichment coupled to fetch stage

LLM enrichment inside fetch. Couldn't rerun without re-downloading.

Separate enrich stage reading from cached data.

Deep Fix

Stages should be independently re-runnable. The test: "can I rerun just this stage with cached inputs?" If not, the stage has a hidden dependency on the execution of a prior stage, not just its output.

Preventions

Design Stage inputs must be fully materialized. A stage reads from persistent storage, never from in-memory state of a prior stage.

Test Isolated stage re-run test. Run each stage in isolation with pre-populated inputs. If it requires running prior stages, the stage has an undeclared dependency.

Design Two jobs with 93% discovery overlap

Two pipelines discovered nearly identical video sets. Both ran full processing.

Folded into single pipeline.

Deep Fix

Deduplication belongs at the discovery layer. When multiple pipelines can discover the same items, there should be a shared discovery registry with canonical ownership.

Preventions

Design Canonical item registry with dedup at ingestion. Content-addressable IDs in a shared registry. Overlap impossible by construction.

Observe Pipeline overlap detection. Periodically compute Jaccard similarity of item sets across pipelines. Alert when overlap exceeds 20%.

Design IR page crawl: garbage URLs from parked domains

Crawl hit parked domains → hundreds of junk URLs → 15s timeouts each.

Removed crawl strategy. Targeted API-based discovery.

Deep Fix

Discovery output should pass through a quality gate before entering the pipeline. Never trust discovery output — validate it as untrusted input.

Preventions

Design Discovery output is untrusted input. Apply validation (domain allowlists, pattern matching, content-type checks) before committing resources.

Observe Discovery quality metrics: yield rate by source. A source with 2% yield rate is a signal to investigate or remove.

Design Manual subs not preferred over auto-generated

Auto captions used by default; manual subs (higher quality) ignored.

Prefer manual subs; rename with .manual.vtt.

Deep Fix

When multiple sources provide equivalent data at different quality, declare an explicit quality hierarchy with preference ordering in configuration, not buried in code.

Preventions

Design Explicit preference ordering for equivalent inputs. Ranked list in config. Stage iterates and takes the first available.

Observe Track source provenance through pipeline. Know which variant was used — makes quality decisions auditable.

📦

Import & Startup Issues

5 issues

Bug litellm concurrent import race

Multiple async workers imported litellm concurrently on first use. litellm's __init__ isn't thread-safe — AttributeError and corrupted state.

Pre-import litellm before asyncio.run().

Deep Fix

Module initialization is a shared mutable resource. Heavy modules with side effects must be eagerly imported in the main thread before any concurrent execution begins. Python's import lock is necessary but not sufficient for modules with complex __init__ logic.

Preventions

Design Eager import of stateful modules before concurrency begins. Treat these imports as application bootstrap, not lazy loading.

Process Startup manifest: explicit list of pre-imported modules. New heavy-module dependencies added to this list during code review.

Test Concurrent cold-start integration test. Spin up N workers simultaneously with empty module caches and verify no import errors.

Bug lib.llm concurrent import partially initialized

"cannot import call_llm from partially initialized module" — multiple workers hit cold start simultaneously.

Eagerly import in runner.py before async work.

Deep Fix

Same root cause — the runner's bootstrap phase must import all modules workers will need before spawning any workers. Individual fixes per module are whack-a-mole; the fix is a centralized pre-import phase.

Preventions

Design Centralized bootstrap_imports() in the runner. Single auditable location for all worker dependencies.

Test Static analysis for imports used in worker code paths. Every module imported inside a worker function must also appear in the pre-import manifest.

Bug Lazy import ↔ circular import ping-pong

Lazy import to fix circular → caused concurrent race. Revert → brought back circular. Two fixes fighting.

Top-level import + pre-import, then refactored the circular dependency out.

Deep Fix

Circular imports are a dependency graph problem, not an import-ordering problem. The real fix is restructuring the module graph to eliminate cycles — extract shared interfaces into a leaf module.

Preventions

Design Layered dependency architecture with enforced direction. Define module layers (core → lib → handlers → runners) where dependencies only flow downward.

Test CI check for import cycles. Circular imports should be a blocking CI failure.

Process Ban lazy imports as a circular-import fix. "Circular import" is not an accepted reason for moving an import into a function body.

Gotcha Pointless ImportError catch on required dependency

try: import litellm / except ImportError: pass on a required module. Silent failure.

Removed try/except. Top-level import, fail immediately.

Deep Fix

try/except ImportError on a required dependency is always wrong. It converts a clear, immediate failure into a delayed, mysterious one. Defensive programming turned pathological.

Preventions

Process Convention: never try/except on required imports. A bare pass in the except clause is a code smell that the import is actually required.

Design Fail fast at the boundary, not deep in the call stack. Error detection as close to the cause as possible.

Convention Cross-handler imports create hidden dependencies

Handler A imported from Handler B → B refactored → A broke silently.

Shared utilities to jobs/lib/. Handlers never import from each other.

Deep Fix

Handlers should be leaf nodes in the dependency graph. When handlers form a graph among themselves, you lose the ability to modify, test, or deploy them independently.

Preventions

Design Handlers as leaf nodes: architectural constraint. May import from jobs/lib/, lib/, stdlib — never from jobs/handlers/.

Test Import graph validation in CI. Parse import statements in handlers; fail if any handler imports from another handler.

📊

Dashboard & UI Bugs

7 issues

Bug Double-refresh: high CPU + broken clicks

Callable passed to component + timer.tick = two competing refresh cycles. CPU doubled, clicks broken.

Pass static value; only timer triggers refresh.

Deep Fix

Understand the reactive ownership model before wiring data sources. Any framework with reactive bindings (Gradio, React, Svelte) has this trap: implicit reactivity + explicit triggers = duplication. Establish a single canonical refresh path. Audit every component for "who owns the refresh" and ensure exactly one source of truth.

Preventions

Design Single-writer principle for UI state. Every mutable UI element should have exactly one code path that updates it.

Test Refresh-count assertion. Instrument the data-fetch function with a counter. Assert exact N calls over a window matching the timer interval, not 2N.

Bug Heartbeat not cleared on runner exit

Stale heartbeat → dashboard showed "running" after runner stopped.

Clear heartbeat in finally/atexit handler.

Deep Fix

Heartbeats are leases, not flags. A lease says "I am alive if this was written within the last T seconds." With a TTL-based model, no cleanup code needed. Any liveness system that requires active cleanup on death is architecturally broken — death is precisely when cleanup code doesn't run.

Preventions

Design TTL-based liveness over flag-based. Encode timestamp + staleness threshold. Reader interprets freshness; writer doesn't need cleanup.

Test Simulate SIGKILL, wait TTL+1, assert dashboard shows "stopped." If your liveness model survives SIGKILL, it survives everything.

Bug Moltbook handler signature mismatch

Positional args vs keyword-only. TypeError at runtime.

All handlers: keyword-only args matching dispatch convention.

Deep Fix

Handler interfaces must be machine-enforced contracts. Define a Protocol or ABC for the handler signature. Any function called via indirection (dispatch table, registry, plugin) needs a typed contract because call site and definition are decoupled.

Preventions

Type Protocol class for handler signatures. Static analysis catches mismatches at edit time.

Test Parametric handler-invocation test. Iterate all registered handlers, invoke each with mock args using the dispatcher's calling convention.

Gotcha Job kind misclassification

Backfill jobs labeled as monitor. Wrong dashboard tab + wrong metrics.

Changed to kind: backfill.

Deep Fix

Classification should be derived from structural properties, not manually assigned labels. Finite catalog = backfill. Polls endpoint = monitor. Infer kind from observable properties; the label becomes a computed property.

Preventions

Design Derive kind from job properties. Finite item list + sequential = backfill; periodic poll = monitor. Compute at registration time.

Process Enum, not string, for job kind. Typos caught at import time, valid set discoverable.

Bug p50 showing "—" for fast stages

0.0 elapsed is falsy in Python. Fast stages treated as missing data.

Check is not None explicitly.

Deep Fix

Python truthiness conflates "missing" with "zero" for every numeric type. The rule is absolute: if the variable can legitimately be zero, use is not None. Distinguish "absent" from "present but trivial" at the type level.

Preventions

Process Lint rule: ban bare truthiness checks on numeric variables. Require explicit is not None or > 0.

Test Test with zero values for every numeric display. Boundary value that must render, not disappear.

Bug Status column expanding despite max-width

auto table layout ignores width constraints.

table-layout: fixed.

Deep Fix

For any data table where you need predictable column widths, table-layout: fixed is required, not optional. With auto, the browser's content-driven algorithm overrides your declarations.

Preventions

Design Default to table-layout: fixed for all data tables. auto only for prose tables where content should dictate layout.

Test Visual regression with max-length content. Seed test data with maximum-length strings for every column.

Bug Startup message printed 3x in debug mode

Multiple init paths printing banner.

Guard with _initialized flag.

Deep Fix

Initialization must be idempotent. Separate "configure" from "start." Configuration is idempotent (set values, register). Starting is one-shot (launch threads, print banner). If tangled, multiple call sites trigger duplicates.

Preventions

Design Separate configure() from start(). Multiple configure() safe; multiple start() raises error.

Test Test that double-init is harmless. Call init twice, assert no duplicate side effects.

🔍

Discovery & Content Quality

9 issues

Bug Commentary videos matched as earnings calls

YouTube search returned commentary/reaction videos alongside actual calls.

Strict title matching + skip-word filters.

Deep Fix

Search results are candidates, not answers. YouTube's algorithm promotes commentary over primary sources. Fix: two-stage pipeline: broad recall → precision filtering. Channel verification, title patterns, duration heuristics, earnings calendar correlation.

Preventions

Design Two-stage discovery: recall then precision. Separate "find candidates" from "validate candidates." Measure both rates independently.

Test Golden set of known-good and known-bad results. 20 real calls + 20 commentary URLs. Assert 100% classification accuracy.

Design Channel allowlist for authoritative sources. Ticker → official YouTube channel ID mapping.

Bug Short tickers matched as substrings

LOW, CAT, GE matched as common English words in titles.

Short tickers (≤3 chars) require $SYM prefix or company name.

Deep Fix

The shorter an identifier, the more context you need to confirm a match. Length-adaptive matching rules: 1-2 char need $prefix or exact company name within 50 chars. 3 char need word boundary + financial context. 4+ char: word boundary alone.

Preventions

Design Length-adaptive matching tiers. Codified in a shared matching utility.

Test Adversarial test corpus for collision-prone tickers. 50 most collision-prone tickers with true/false positive sentences.

Bug $AA matched $AAL, T matched TXN

Substring containment. Wrong transcripts for wrong companies.

Regex word-boundary matching.

Deep Fix

Identifiers must always be matched with boundary awareness. Build a shared match_ticker() utility so this decision is made once and correctly.

Preventions

Type Shared match_identifier() utility. Enforces word boundaries by default. Callers never write their own regex.

Process Ban bare in for identifier matching. The in operator is for collection membership, not pattern matching in strings.

Bug January earnings = Q4 prior year

Jan 2026 earnings tagged as Q1 2026 instead of Q4 2025.

Fixed quarter derivation: Jan earnings = Q4 prev year.

Deep Fix

Event dates and reporting periods are fundamentally different temporal concepts. An earnings call has when-it-happens and what-it-reports-on. These can differ by 2-8 weeks and across year boundaries. Model them as separate fields.

Preventions

Design Separate event timestamp from reporting period in the data model. event_date and fiscal_period as independent fields.

Test Boundary-month test cases. Jan 1, Jan 31, Feb 28, Dec 31 — assert correct fiscal quarter for each.

Bug Sub-entity names polluted discovery

"Samsung Electronics (Foundry Division)" stored as separate entity.

NOT LIKE '% (%' filter.

Deep Fix

Entity normalization must happen at write time, not read time. The NOT LIKE filter is a read-time bandage applied everywhere. The structural fix: resolve-on-write — normalize entity names before insertion.

Preventions

Design Resolve-on-write for all entity names. Strip parenthetical qualifiers, apply alias mappings, resolve to canonical form before storing.

Test Uniqueness invariant: no entity contains parenthetical qualifiers after insertion.

Bug Entity dedup: 1639→987 companies

No normalization on insert. 28 Samsung entries. 9663→3727 relationships after dedup.

Entity resolution on insert (resolve_company + add_company).

Deep Fix

Deduplication is a data quality problem that compounds silently. Each duplicate looks harmless; collectively they fragment analytics and under-count aggregations. Fix: resolve-on-write pipeline (normalize, fuzzy-match existing, merge or create). Any append-only store without a dedup strategy accumulates garbage proportional to input diversity.

Preventions

Design Entity resolution pipeline on write path. Normalize → fuzzy-match → merge or create. Standard record linkage pattern.

Test Insert-twice-get-one test. Insert same entity with two surface forms, assert one row.

Observe Duplicate ratio metric. Periodic fuzzy clustering; alert if estimated duplicate rate exceeds 5%.

Bug Moltbook used nonexistent API endpoint

/posts/trending didn't exist. Zero items. Job appeared healthy.

Fixed endpoint. Added status code checking.

Deep Fix

Zero results from discovery should be a failure signal, not success. The "silent zero" antipattern: the most dangerous failures look like success. Distinguish "found nothing" from "couldn't look."

Preventions

Observe Alert on zero-yield discovery runs. Historical average > 0 but current run = 0 → fire alert.

Design HTTP client that raises on non-2xx by default. 404s become exceptions, not silent empty responses.

Design No LLM sanity check on scraped content

Structural checks passed for wrong content — login walls, error pages, unrelated articles.

Cheap LLM sanity check (~$0.0001/call).

Deep Fix

Structural validation and semantic validation answer different questions; you need both. A login wall has perfect HTML structure. Graduated validation pyramid: structural (free) → statistical (microseconds) → semantic/LLM (pennies). Only content passing all three layers is ingested.

Preventions

Design Graduated validation: structural → statistical → semantic. Fail fast at cheap layers; expensive layers only for content that passes cheap ones.

Test Adversarial content test suite. Login walls, 404 pages, unrelated articles. Assert all rejected.

Gotcha Base64 images inflate HTML for LLM calls

100KB+ base64 images consumed thousands of tokens. Zero analysis value.

Strip data: URIs. Cap at 50K chars.

Deep Fix

Every byte sent to an LLM has a cost. Content must pass through a token-budget-aware preprocessing pipeline: parse HTML → extract text → remove binary/encoded data → estimate tokens → truncate if over budget.

Preventions

Design Token budget estimation before every LLM call. If over budget, trigger preprocessing. Never send raw scraped content without budget checking.

Design HTML-to-LLM preprocessing function. Standardized: strip scripts, styles, base64, SVGs, comments, whitespace. Sits between every scraper and every LLM call.

Deep Fix & Prevention Analysis

Recurring Deep Principles (by frequency across all 65 issues)

Silent Failures & Data Integrity

Rate Limiting, Auth & Bot Detection

Stage Design & Pipeline

Import & Startup Issues

Dashboard & UI Bugs

Discovery & Content Quality

Architecture Evolution

LLM Integration & Conventions

Cross-Cutting Principles

1. Make illegal states unrepresentable

2. Distinguish silence from health

3. Classify errors by kind, not by source

4. Cheap before expensive

5. Stages declare preconditions, framework enforces them

6. Every fallback is a lie until tested

7. Canary the environment

8. Separate data acquisition from data interpretation