Jobs System — Comprehensive Issues & Lessons

Extracted from 152 git commits, 141 Claude Code sessions, and handler code analysis. Jan 22 – Feb 8, 2026.

65
Issues
18
Silent Failures
12
Rate Limit / Auth
10
Stage Design
25
Design / Other
Badges: Bug Broke something Gotcha Subtle trap Design Architecture issue Convention Rule established Perf Performance fix Refs: abc1234 git commit sess:abcd1234 Claude Code session

Categories

1. Silent Failures & Data Integrity (18 issues) 2. Rate Limiting, Auth & Bot Detection (12 issues) 3. Stage Design & Pipeline (10 issues) 4. Import & Startup (5 issues) 5. Dashboard & UI (7 issues) 6. Discovery & Content Quality (9 issues) 7. Architecture Evolution (8 issues) 8. LLM Integration & Conventions (6 issues)
💀

Silent Failures & Data Integrity

18 issues
Bug Empty IB response marked as done
When Interactive Brokers was disconnected, stage_ib returned empty data and marked the item as done — permanently losing it from the queue.
Raise RetryLaterError when no data returned. Item stays in queue for retry.
a3bcb49 sess:5a284f31 sess:dccd4a54
Bug IB fallback silently used ephemeral connections
stage_ib fell back to creating ephemeral IB connections when stockloader wasn't running. Silently degraded performance and missed pacing constraints.
Removed fallback. Fail fast with RetryLaterError("Stockloader service not running").
c9ae402 sess:1ce49014 sess:dfafcc94
Bug Circuit breaker: infinite retry, dashboard showed "queued"
Job-loop exceptions (e.g. import errors) were caught, logged, and retried forever. Dashboard showed "queued" — nothing processed.
Track consecutive failures per stage. After 3, auto-pause job with error message visible in dashboard.
6e907fc sess:13089c42 sess:286c20f2
Bug Stuck in_progress items from crashed runner
If the runner process crashed, items left in in_progress state never got picked up again.
Reset all in_progress items to pending on runner startup.
0d4844c sess:290637e3 sess:fce7a6a1
Bug 637 items stuck at stage-level in_progress
The startup reset fixed item-level status='in_progress' but not stage-level in_progress inside the stages JSON column. Items had stages={"fetch":"in_progress"} but status='pending'invisible to stage workers.
Reset both item-level and stage-level in_progress on startup. Fixed 637 stuck items across all jobs.
7a16e66 sess:eb873e16
Bug VTT extract returned silently empty
When VTT file was missing, extract stage returned empty result instead of failing. Downstream got empty transcripts and produced garbage — but item marked done.
Raise exception when VTT is missing.
c7bd0a7 sess:01abe80c sess:44028ed3
Bug Earnings transcript marked done with no transcript
Handler returned None which runner converted to {} — "success" with no data. Item moved to done with no replay URL, no content.
Raise RetryLaterError when no replay URL. Fix handler wrapper propagating None.
71eb7b3 sess:26196664 sess:d8a7f79f
Bug yt-dlp metadata silently returned {} on auth errors
No error propagation from yt-dlp subprocess on bot/auth failures. Returned empty dict, treated as success.
Raise on bot/auth errors; include last error in circuit breaker pause_reason.
7c94adb sess:44028ed3
Bug INSERT OR REPLACE destroys columns
INSERT OR REPLACE in SQLite does DELETE + INSERT. Columns not in the list (like raw_html) get destroyed. Fetch stored HTML, then extract's INSERT OR REPLACE wiped it.
Use ON CONFLICT DO UPDATE SET col=excluded.col or plain UPDATE.
sess:1d6bcec3
Bug Moltbook extract: pre-migration data invisible
Items fetched before results-table migration had raw.json on disk but no entry in the results table. Extract stage only checked results table.
Check results table first, fall back to disk for transitional items.
a41caf4
Bug Duplicate HTML served for different VIC ideas
VIC served same HTML for different idea URLs when cookies expired. Basic checks passed. Discovered only after extracting 100+ ideas with identical text.
Detect duplicate HTML by comparing lengths. Track quality_flags: ["possible_dupe_html"].
sess:1d6bcec3
Bug SQLite "closed database" in dashboard
get_queue() called DB functions after conn.close(). Gradio hot-reload made this worse.
Use with closing(get_db()) as conn: context managers everywhere.
a701696 sess:13089c42 sess:fce7a6a1
Bug 11 of 12 handlers broken by function rename
get_handler_log renamed to get_handler_logger but call sites not updated. asyncio.gather(return_exceptions=True) silently ate all the ImportErrors — nothing worked, no errors shown.
Fix all imports. Log crashed tasks from asyncio.gather results.
86719bd sess:72a2aa12 sess:cd65b4a5
Bug Stage errors lost during item transitions
Errors from earlier stages were overwritten when later stages also failed. Only the last stage's error survived.
Store stage errors in stages JSON ({stage}_error) so they survive transitions.
2b47ab4 sess:d8a7f79f sess:f2637de6
Gotcha Handler not async when dependency became async
fetch_ib_historical_ticks was made async but caller was still sync. Error: "cannot unpack non-iterable coroutine object".
Made entire handler chain async: stage_ib → process_stage → process().
19bc1e9
Gotcha yt-dlp non-zero exit with valid output
yt-dlp returns non-zero exit code for warnings (n-challenge, POT server) but stdout has valid content. Treating non-zero as failure threw away good data.
Check if stdout has content. Only error on non-zero + empty output.
1c19fe6 sess:f5474c12
Convention No asyncio.gather exception logging
asyncio.gather(return_exceptions=True) silently eats exceptions. Exceptions stored in results list but never logged.
Always iterate results and log isinstance(result, BaseException) entries.
sess:2d6b34ac sess:0d1f1138
Gotcha Metadata stage vacuously succeeded without transcript
Metadata stage ran on discovery data only, succeeding vacuously without any transcript input. Looked green on dashboard, produced zero useful enrichment.
Merged transcript + metadata into single stage. Remove separate metadata stage.
eed4732
🛡️

Rate Limiting, Auth & Bot Detection

12 issues
Bug VIC rate-limit page passed structural checks
VIC's "Please wait, your view count will reset" page had enough HTML structure to pass basic checks. Stored as valid thesis HTML.
Three-layer defense: (1) text detection for "access limits", "view count will reset"; (2) HTML structure validation; (3) LLM sanity check via flash-lite (~$0.0001/call).
8f5ef43
Bug VIC "Please wait 24 hours" — second rate-limit variant
A different rate-limit text format from VIC that wasn't caught by the first detection pass.
Added explicit string detection for "Please wait 24 hours".
ab07e4d
Bug YouTube bot detection blocked downloads
yt-dlp downloads failing with "Sign in to confirm you're not a bot". Datacenter proxy alone wasn't enough.
Combine BRIGHTDATA_PRIMARY_PROXY + YTDLP_COOKIES_BROWSER=chrome. Raise BotDetectedError.
3754542 edd62ca sess:01abe80c
Bug YouTube JS challenge solver broken
Missing --remote-components ejs:github flag caused JS challenge failures across all 7 yt-dlp call sites.
Add flag to all yt-dlp call sites.
63b3cca sess:26196664 sess:708b6c4c
Bug PO token requirement for YouTube
YouTube added PO token requirement for certain content. Downloads failing with no clear error.
Install bgutil-ytdlp-pot-provider plugin; player_client fallback chain: android → tv_downgraded → web.
8611053 sess:f2637de6 sess:d8a7f79f
Bug SSL/timeout errors classified as permanent failure
Network timeouts and SSL errors classified as failed (permanent). Items never retried.
Classify SSL and timeout errors as retry_later.
5be54c3 sess:44028ed3
Bug Rate limit errors (429/RESOURCE_EXHAUSTED) tripping circuit breaker
Gemini quota errors treated as permanent failures, triggering circuit breaker and pausing jobs.
Treat 429/RESOURCE_EXHAUSTED as retry_later, not permanent failure.
2b47ab4 sess:d8a7f79f sess:f2637de6
Bug VIC cookie exchange was sync in async handler
_load_cookies() used sync httpx.get() inside async handler, blocking the event loop.
Made _load_cookies() async with httpx.AsyncClient.
dfdfff9
Convention Direct "naked" fetches got IP blocked
HTTP requests without proxy exposed real IP. Got blocked on VIC, YouTube, and scraping targets.
Always proxy by default. BRIGHTDATA_PRIMARY_PROXY. Whitelist only localhost, Finnhub, LLM providers.
e49ae5d
Gotcha VIC cookie has two layers (remember + session)
VIC uses a persistent remember_web_* cookie (in Chrome's DB) and an in-memory vic_session cookie (NOT in Chrome's DB). Must exchange via HTTP request.
Cookie fallback chain: file override → browser remember_web_* → HTTP exchange → cache to file.
4f168c6 865f6a3
Gotcha yt-dlp auth errors vs warnings are indistinguishable by exit code
yt-dlp exits non-zero for both fatal auth errors (bot detection) and non-fatal warnings (n-challenge). Must parse stderr text.
"Sign in to confirm"/"bot" → BotDetectedError. Non-zero + stdout has content → debug only.
52f8b25 sess:f5474c12
Gotcha Retry + reprocess didn't unpause paused jobs
Circuit breaker paused job. "Retry Failed" reset items to pending — but job stayed paused. Items queued, nothing processed.
Retry and reprocess buttons auto-unpause the job.
ab16faa 281187a sess:44028ed3
🔧

Stage Design & Pipeline

10 issues
Bug Extract stage ran before fetch
Stage dependencies weren't inferred from YAML ordering. Extract ran before fetch completed.
Explicit stage_deps in jobs.yaml. Also added inference from list order as fallback.
05e3cfd sess:cd65b4a5
Design Monolithic transcript_and_meta stage
Single stage did metadata + captions + whisper — expensive, hard to retry, no partial progress.
Split into 4 discrete stages: meta (fast, free) → score (LLM) → captions (free) → whisper (conditional).
39df61b sess:f5474c12
Design Stage order: transcript before scoring
Downloaded every video before knowing if interesting. Wasted bandwidth and yt-dlp rate limit budget.
Reordered: meta → score → captions → whisper. Score first (cheap), download only interesting content.
04fbcc7 sess:cd65b4a5 sess:3cb39b8d
Design Transcript before metadata validation
Time spent on transcription before knowing if video was valid or relevant.
Metadata first (free, fast), transcript second (expensive).
3b93799
Design Groq primary vs YouTube captions — three changes in one day
YouTube captions only → Groq primary → YouTube primary + Groq fallback. Lacked upfront cost/quality analysis.
YouTube captions are free. Groq ($0.04/hr) only when captions unavailable. Track transcript_source.
ae6e6eb 5c51d13
Design Audio files accumulated on disk (10GB+)
Every transcribed video left its MP3/M4A. Hundreds of videos = 10+ GB of orphaned audio.
keep_audio=False by default. Auto-detect: _job_has_stage(job, "diarize") keeps audio.
b6d96c2
Design Enrichment coupled to fetch stage
VIC LLM sanity check was inside fetch stage. Couldn't rerun enrichment without re-downloading all 28K+ pages.
Separate enrich stage. Reads cached raw_html from DB. Independently rerunnable.
Design Two jobs with 93% discovery overlap
pltr_interviews_2025 duplicated pltr_content_processing discovery. Almost entirely redundant work.
Fold into single pipeline. Add audio + diarize stages to the unified job.
376cf21
Design IR page crawl: garbage URLs from parked domains
IR page crawl strategy hit 15s+ timeouts on parked domains. Produced garbage URLs that wasted downstream stages.
Remove IR page crawl. Strategy chain reduced to: ir_knowledge cache → YouTube → LLM search → unknown.
73f97dc
Design Manual subs not preferred over auto-generated
Auto-generated YouTube captions used by default. Manual subs (professional quality) available but not preferred.
Prefer manual subs; rename to .manual.vtt for distinction.
3fcc29f
📦

Import & Startup Issues

5 issues
Bug litellm concurrent import race
Multiple async stage workers imported litellm concurrently on first use. litellm's __init__ isn't thread-safe — AttributeError and corrupted module state.
Pre-import litellm in cli() before asyncio.run().
eb004d1 sess:dccd4a54
Bug lib.llm concurrent import partially initialized
cannot import call_llm from partially initialized module lib.llm — multiple workers hit cold start simultaneously.
Eagerly import lib.llm in runner.py before spawning workers.
cc69dce sess:44028ed3
Bug Lazy import ↔ circular import ping-pong
Lazy-import litellm to avoid circular → caused concurrent import race. Revert to top-level → brought back circular. Two fixes fighting each other.
Top-level import + pre-import before async loop.
fc24bec eb004d1 sess:dccd4a54
Gotcha Pointless ImportError catch on required dependency
try: import litellm except ImportError: pass — litellm is required. Handler silently does nothing if missing.
Drop try/except. Required deps fail immediately with clear ImportError.
ef3432b sess:dccd4a54 sess:dfafcc94
Convention Cross-handler imports create hidden dependencies
Handler A importing from handler B: if B breaks, A silently breaks too.
Convention: handlers never import from each other. Shared logic → jobs/lib/.
📊

Dashboard & UI Bugs

7 issues
Bug Double-refresh: high CPU + broken clicks
Passing callable to overview_table — Gradio auto-polled independently of timer.tick. Duplicate refreshes reset selection state.
Pass static value, not callable. Let only timer.tick trigger refreshes.
8784413 sess:5a284f31 sess:dccd4a54
Bug Heartbeat not cleared on runner exit
Dashboard showed "▶ running" when runner had already stopped.
Clear heartbeat file in runner's finally/exit handler.
6aa8678 sess:09b494d0
Bug Moltbook handler signature mismatch
Handler used positional args. Runner called with keyword-only args + job param. TypeError at runtime.
All handlers: async def process_stage(*, item_key, data, stage, job).
e9c038b
Gotcha Job kind misclassification
DML and Healthy Gamer marked as monitor but they're historical backfills. Wrong dashboard tab + wrong metrics.
Changed to kind: backfill. Rule: finite catalog = backfill.
42ec148
Bug p50 showing "—" for fast stages
0.0 elapsed seconds is falsy in Python. Stages completing in <1ms treated as missing data.
Check explicitly for None instead of truthiness.
c22cd9a
Bug Status column expanding despite max-width
auto table layout ignores width constraints on columns.
Add table-layout: fixed.
e5b6536
Bug Startup message printed 3x in debug mode
Multiple initialization paths each printing the startup banner.
Guard with flag.
e6523d4
🔍

Discovery & Content Quality

9 issues
Bug Commentary videos matched as earnings calls
YouTube search returned commentary, reaction, analysis videos — not actual earnings calls. Multi-ticker roundup videos also matched.
Strict title matching: require $SYM AND "earnings". Skip words: shock, breakdown, reaction, crash, moon, squeeze.
ac41e6d sess:5f2c6648
Bug Short tickers matched as substrings
LOW, CAT, GE matched as common English words. "How LOW can it go?" matched for Lowe's.
Short tickers (≤3 chars) require $SYM prefix or company name.
ac41e6d
Bug $AA matched $AAL, T matched TXN
Ticker substring matching contaminated 6 of 320 URLs. $AA matched $AAL, INTE matched INTC.
Regex word-boundary patterns for ticker matching.
2117342 sess:44028ed3 sess:cd65b4a5
Bug January earnings = Q4 prior year
January 2026 earnings tagged as Q1 2026 instead of Q4 2025.
Fixed quarter derivation: Jan earnings = Q4 prev year.
3cc6d25
Bug Sub-entity names polluted discovery
"Samsung Electronics (Foundry Division)" appeared as separate entity. Duplicate research.
Filter with NOT LIKE '% (%'.
b3f5e8e sess:26196664 sess:9765dc16
Bug Entity dedup: 1639 → 987 companies
Exact-name-only upsert with no normalization. 28 Samsung entries, duplicates everywhere in supplychain DB. 9663 → 3727 relationships after dedup.
Entity resolution on insert (resolve_company + add_company).
403500b sess:26196664 sess:b0019aeb
Bug Moltbook used nonexistent API endpoint
/posts/trending didn't exist. Zero items discovered. Job appeared healthy (no errors).
Use correct endpoint: /posts?sort=trending|new. Verify API endpoints before wiring.
e9c038b sess:5a284f31 sess:b0019aeb
Design No LLM sanity check on scraped content
HTML passing structural checks could still be wrong content — rate-limit pages, error pages, different company's page.
Cheap LLM sanity check (flash-lite, ~$0.0001/call). "Is this a real investment thesis for {company}?"
Gotcha Base64 images inflate HTML for LLM calls
Base64-encoded images (100KB+ each) in HTML wasted tokens on image data with zero text analysis value.
Strip data:image/[^;]+;base64,[A-Za-z0-9+/=]+ before LLM. Cap at 50K chars.
🏗️

Architecture Evolution

8 issues
Design File-per-item storage doesn't scale
One directory per item with metadata.json. At 1000+ items: thousands of small files, git tracking painful, querying requires globbing.
Results table in SQLite. Large raw content in per-job DB. Files only for large binaries (VTT, parquet, audio).
a44dfd0 sess:5a284f31 sess:dccd4a54
Design Consolidate 1145 HTML files into SQLite
VIC ideas stored as individual HTML files. 1145 files → 1 LFS-tracked DB.
Handler reads/writes raw_html column in single SQLite DB.
5b7130d sess:72a2aa12 sess:01abe80c
Design No cost tracking for LLM stages
A bug in the prompt could silently burn through API budget before anyone noticed.
Handlers return _cost. Runner logs to job_cost_log. Guard checks daily cost, auto-pauses.
6dc0a82 sess:13089c42 sess:1ce49014
Design Sequential stage iteration → async workers
Each stage ran one item at a time, completing one item through all stages before starting next.
Independent async worker per stage with asyncio.Semaphore for concurrency control.
08b76a2
Design Single-stage pipeline → multi-stage
All processing in one process() call. No partial progress, no retry per stage.
Multi-stage pipeline with stages JSON column. Each stage progresses independently.
d10124d
Design No output versioning
Code changes didn't trigger reprocessing of already-done items. Stale results silently persisted.
Hash handler source per stage. Detect stale items. "Reprocess Stale" button.
cd6f06b sess:dccd4a54 sess:dfafcc94
Perf IRS ZIP: downloaded 300MB–3.5GB for few KB of XML
Full ZIP downloads to extract individual XML files. Massive bandwidth waste.
HTTP range requests: three small requests to surgically extract XML. Zero disk usage.
bde166a
Design Round-based discover → process → sleep loop
Sequential rounds with full-round sleeps. Items waited for round completion before advancing.
Persistent async workers per stage. Items flow through pipeline continuously with 1–5s backoff.
6dc0a82 sess:13089c42
🤖

LLM Integration & Conventions

6 issues
Bug max_tokens truncated JSON responses
max_tokens of 4000/8000 too low for JSON. Truncated mid-response, broke parsing.
Remove max_tokens entirely. Use cheaper model if cost concern, not token ceiling.
c22cd9a
Convention Native web search doesn't scale across providers
native_web_search=True gives inconsistent behavior across providers. No control over search queries.
Use tools=["web_search"] (Serper-based) for consistent cross-provider web search.
c22cd9a
Convention temperature=0 for structured/JSON extraction
Default temperature introduced non-determinism in factual extraction tasks.
Always use temperature=0 for extraction. Default None for generation.
c22cd9a
Convention Structured logging via get_handler_logger
f-string log messages not machine-parseable. No structured querying of log data.
Handlers use log = get_handler_logger("name"). Pass domain data as kwargs.
0e4ceca sess:01abe80c sess:72a2aa12
Convention Raw data first: cache artifacts for re-extraction
Without raw caching, schema changes or bug fixes required re-fetching all content (rate-limited, deleted, paywalled).
Fetch stores raw artifacts (HTML, JSON, audio) → Extract parses from cache. Enables schema evolution.
0a24f15
Convention Documents in system prompt with XML tags for caching
Documents in user prompt get re-sent every call, no prompt caching benefit.
Place document in system prompt as <document>{text}</document>. User prompt contains only the extraction instruction (varies). Document gets prompt-cached.