For each issue in the Jobs Issues Extraction: the structural fix, and the design/process/testing failures that allowed it to exist.
RetryLaterError when no data returned.StageResult) that cannot be constructed without data. The runner should reject None/{} at the framework level, not rely on each handler to remember to check. A StageResult.success(data) factory that validates non-emptiness makes "empty success" impossible to express.
process_stage() returns StageResult instead of dict | None, the runner can enforce the contract. The bug wasn't in the handler — it was in the API that let handlers return nothing and call it success.{}. If the test passes (item stays pending), you've tested the contract. This test would have caught every "empty = success" bug in this category.RetryLaterError.RetryLater (transient), PermanentFailure (this item is broken), SystemFailure (the stage itself is broken, stop everything) — would have made the circuit breaker automatic. An import error is always a SystemFailure: retrying won't fix missing code.
RetryLater | PermanentFail | SystemHalt, it can dispatch correctly by construction. A bare except Exception that retries everything is the anti-pattern — it means you haven't thought about which errors are which.in_progress → never picked up again.in_progress to pending on startup.in_progress is a lease, not a state. It should have an expiry. Instead of a bare status flag, use claimed_at timestamp + claimed_by worker ID. Any item claimed >N minutes ago by a dead worker is automatically released. This makes crash recovery continuous rather than requiring a restart — and it works in multi-worker scenarios where one worker dies but others are still alive.
status='in_progress', ask: "what happens if the writer dies before writing status='done'?" If the answer is "it's stuck forever," you've designed a leak. Leases with TTL are the standard solution to this problem (see: SQS visibility timeout, Kubernetes pod eviction, distributed lock TTLs).status as a column AND stages.fetch: "in_progress" inside a JSON column means two sources of truth for "is this item being worked on?" They can (and did) disagree. The stage-level status should be the only source; derive the item-level status from it. Single source of truth, computed views.
status='pending' ⟹ no stage has 'in_progress' would have caught the 637 stuck items immediately.PreconditionFailed("VTT file not found at {path}"). The runner can then distinguish "precondition not met" (wait for upstream) from "stage failed" (something broke). This is the guard clause pattern applied to pipeline stages.
requires: ["vtt_path"]. The runner checks preconditions before invoking the handler. The handler never runs in an invalid state.None → runner converted to {} → "success" with no data.RetryLaterError; fix handler wrapper.None → {} conversion is the root bug. The framework should never coerce a handler's return value into something the handler didn't intend. None means "I have nothing to say" — the framework should treat that as an error, not as "success with empty data." The fix is: if result is None: raise StageError("Handler returned None") in the runner itself.
None as a valid handler return. Type annotation: async def process_stage(...) -> StageResult (not -> dict | None). A linter or runtime check in the runner catches violations. This single change prevents every "None-as-success" bug.run_subprocess() wrapper that returns stdout or {} on any error is dangerous. The wrapper should return a SubprocessResult(stdout, stderr, exit_code) and let the caller decide what constitutes success. Never swallow subprocess failures at the wrapper level.
INSERT OR REPLACE does DELETE + INSERT. Columns not in INSERT list get destroyed.ON CONFLICT DO UPDATE SET col=excluded.col.INSERT ... ON CONFLICT DO UPDATE (UPSERT). This should be enforced by a lint rule or code review checklist. Better yet, wrap all DB writes through a db.upsert(table, data, conflict_keys) helper that always uses the safe pattern.
INSERT OR REPLACE project-wide. Add to CLAUDE.md conventions. Grep for it in CI. It's never what you want when the table has columns you're not setting.db.upsert() that always generates ON CONFLICT DO UPDATE. Developers never write raw INSERT for upserts.raw.json on disk but no entry in results table.backfill_from_disk_to_db() step.
conn.close(). Gradio hot-reload worsened it.with closing(get_db()) as conn: everywhere.with db_session() as s: block. The connection is opened, used, and closed in one scope — it can't leak or be used after close because the variable is scoped to the with block.
with db_session() as conn:, you can't forget to close it. The API prevents the bug.get_handler_log → get_handler_logger. asyncio.gather(return_exceptions=True) silently ate all ImportErrors.asyncio.gather(return_exceptions=True) is a silent-failure factory — it converts crashes into list elements that nobody checks. Use a wrapper that logs/raises if any result is an exception. (2) Renaming without grep is a time bomb. Every rename should be accompanied by a project-wide search for the old name. IDE refactoring tools do this; manual renames don't.
asyncio.gather in a helper that raises on any exception result. async def gather_or_raise(*coros) that iterates results and re-raises the first exception. Use this instead of bare gather(return_exceptions=True) everywhere.import jobs.handlers.vic_ideas etc. for every handler catches 100% of import-time errors. Takes <1s to run.rg 'get_handler_log[^g]' would have found all 11 stale call sites.{stage, error, timestamp}) instead of a single error field. This is the same principle as audit logs vs current state: you want the full history of what went wrong, not just the last thing.
return_exceptions=True silently eats exceptions.asyncio.gather(return_exceptions=True). Replace with a project utility: gather_logged(*coros, logger=log) that automatically inspects results and logs/counts any exceptions. Make the safe version the easy version.
gather_logged is easier to use than raw asyncio.gather, developers will naturally use the safe version.return_exceptions=True without a corresponding exception check in the same scope.{} or produces identical output regardless of input is vacuous. The runner should flag it.[direct → proxy → proxy+cookies → browser]. Start cheap, escalate on failure, remember what works per-domain.--remote-components ejs:github flag across 7 call sites.ytdlp_command(url, **kwargs) -> list[str] factory that always includes the base flags. Individual callers add only their specific flags. DRY for command construction, not just for code.
{SSLError: RETRY, TimeoutError: RETRY, HTTPError(404): PERMANENT, HTTPError(429): RETRY, HTTPError(500): RETRY, ValueError: PERMANENT}. New error types get an explicit classification. Unknown errors default to RETRY with a logged warning — you don't know if it's transient, so assume it might be.
classify_error(exc) -> RetryLater | PermanentFail | SystemHalt. Every handler uses it. New error types get added to one place.httpx.get() blocking the event loop.httpx.get(), requests.get(), open(), or time.sleep() inside an async def is a bug. A ruff/flake8 rule or a runtime warning when the event loop is blocked >100ms catches these mechanically.
httpx.AsyncClient as the default; don't even import the sync client in handler files.asyncio debug mode warns when the event loop is blocked >100ms. Enable it in development.get_http_client(proxy=True) where proxy defaults to on. A whitelist of domains that don't need proxy (localhost, LLM APIs). Any new domain automatically gets proxied. The footgun of forgetting to proxy disappears.
proxy=False with a domain in the whitelist.httpx.get/requests.get and flag them. All HTTP calls should go through the project client.classify_result(stdout, stderr, exit_code) -> Success | Warning | AuthError | ContentError function that encodes all the tool's quirks in one place. Callers get clean typed results.
retry() should be a single function that atomically moves from paused+failed_items to running+pending_items. Not three separate SQL updates that can partially complete.stage_deps declarations in the YAML pipeline config.keep_audio=False default; delete after transcription..manual.vtt.__init__ isn't thread-safe — AttributeError and corrupted state.asyncio.run().__init__ logic.
runner.py before async work.bootstrap_imports() in the runner. Single auditable location for all worker dependencies.try: import litellm / except ImportError: pass on a required module. Silent failure.try/except ImportError on a required dependency is always wrong. It converts a clear, immediate failure into a delayed, mysterious one. Defensive programming turned pathological.
pass in the except clause is a code smell that the import is actually required.jobs/lib/. Handlers never import from each other.jobs/lib/, lib/, stdlib — never from jobs/handlers/.Protocol or ABC for the handler signature. Any function called via indirection (dispatch table, registry, plugin) needs a typed contract because call site and definition are decoupled.
kind: backfill.0.0 elapsed is falsy in Python. Fast stages treated as missing data.is not None explicitly.is not None. Distinguish "absent" from "present but trivial" at the type level.
is not None or > 0.auto table layout ignores width constraints.table-layout: fixed.table-layout: fixed is required, not optional. With auto, the browser's content-driven algorithm overrides your declarations.
table-layout: fixed for all data tables. auto only for prose tables where content should dictate layout._initialized flag.$SYM prefix or company name.match_ticker() utility so this decision is made once and correctly.
match_identifier() utility. Enforces word boundaries by default. Callers never write their own regex.in for identifier matching. The in operator is for collection membership, not pattern matching in strings.event_date and fiscal_period as independent fields.NOT LIKE '% (%' filter.NOT LIKE filter is a read-time bandage applied everywhere. The structural fix: resolve-on-write — normalize entity names before insertion.
resolve_company + add_company)./posts/trending didn't exist. Zero items. Job appeared healthy.data: URIs. Cap at 50K chars._cost. Runner logs. Guard auto-pauses.process() call. Failure at minute 8 = redo all 10 minutes.max_tokens too low → truncated mid-JSON → broke parsing.max_tokens entirely for structured output.json.loads() on raw response catches truncation. Pydantic/jsonschema catches structural problems.native_web_search=True gave wildly different behavior per provider.tools=["web_search"] (Serper API).temperature=0 for all extraction.extraction(temperature=0), generation(temperature=0.7). Callers specify intent, not parameters.log.info("scored", score=42, source="youtube").These principles each prevent 5+ issues from the list above. They're the highest-leverage investments.