# New Job Checklist

> **Audience**: LLMs and developers building new jobs in the rivus/jobs system.
> **Source**: Distilled from 32 real bugs and design issues across 80+ commits (Jan–Feb 2026).
> See `jobs/docs/issues-extraction.html` for the full annotated issue catalog.

---

## 1. Silent Failures — The #1 Killer

These bugs are the hardest to find because everything *looks* fine.

### MUST: Never mark empty results as done

If a stage gets empty/null data from an external service, **raise `RetryLaterError`**, don't return success. Empty results marked as `done` are permanently lost from the queue.

```python
# BAD — silently loses the item
if not data:
    return {}

# GOOD — item stays in queue for retry
if not data:
    raise RetryLaterError("No data returned from service")
```

### MUST: Never use INSERT OR REPLACE in SQLite

`INSERT OR REPLACE` does DELETE + INSERT. Columns not in the INSERT column list are **destroyed**. This silently wipes data from earlier stages (e.g., fetch stored `raw_html`, then extract's INSERT OR REPLACE deleted it).

```sql
-- BAD — destroys columns not in the list
INSERT OR REPLACE INTO items (id, score) VALUES (?, ?)

-- GOOD — preserves existing columns
INSERT INTO items (id, score) VALUES (?, ?)
ON CONFLICT(id) DO UPDATE SET score=excluded.score

-- ALSO GOOD — explicit update
UPDATE items SET score=? WHERE id=?
```

### MUST: Validate content, not just HTTP status

A 200 OK with an error page, rate-limit page, or wrong content is **worse** than a clear failure. Always validate what you got:

- **Text detection**: Search for error strings ("access limits", "Please wait", "Sign in")
- **Structure validation**: Required DOM elements present (`#description`, `.idea_name`, etc.)
- **LLM sanity check**: For semi-structured content, a flash-lite call (~$0.0001) catches semantic failures that structural checks miss

```python
# BAD — trusts that 200 means valid content
if response.status_code == 200:
    return {"raw_html": response.text}

# GOOD — validates content quality
html = response.text
if any(s in html for s in RATE_LIMIT_STRINGS):
    raise RetryLaterError("Rate-limit page detected")
if not has_required_structure(html):
    raise RetryLaterError("Missing required page elements")
```

### MUST: Check for duplicate content

External services may serve the same page for different URLs (expired cookies, rate limiting). Detect duplicates by comparing content length or hash across items.

```python
# Track content hashes — flag when >5 items have identical content
quality_flags = []
if html_length in known_lengths and known_lengths[html_length] > 5:
    quality_flags.append("possible_dupe_html")
```

### MUST: Log exceptions from asyncio.gather

`asyncio.gather(return_exceptions=True)` silently eats exceptions. Always check results:

```python
results = await asyncio.gather(*tasks, return_exceptions=True)
for i, result in enumerate(results):
    if isinstance(result, BaseException):
        log.error("task failed", task_index=i, error=str(result))
```

### MUST: Handle async/sync mismatch

If a dependency becomes async, every caller in the chain must be async too. A sync call to an async function returns a coroutine object — it won't error, it just gives you garbage.

```python
# BAD — returns coroutine object, not data
data = fetch_data(url)  # fetch_data is now async

# GOOD
data = await fetch_data(url)
```

### SHOULD: Handle tools that return non-zero with valid output

Some CLI tools (yt-dlp, ffmpeg) return non-zero exit codes for warnings while stdout has perfectly valid content. Check stdout before assuming failure.

```python
# BAD — throws away valid data
if proc.returncode != 0:
    raise RuntimeError("command failed")

# GOOD — check if output is usable
if proc.returncode != 0:
    if proc.stdout.strip():
        log.debug("non-zero exit but output present", rc=proc.returncode)
    else:
        raise RuntimeError(f"command failed: {proc.stderr}")
```

---

## 2. Rate Limiting, Auth & Bot Detection

Every external fetch will eventually get rate-limited or blocked.

### MUST: Proxy all external HTTP requests by default

Direct "naked" fetches expose your real IP. Getting blocked requires manual debugging and wastes time. Use `BRIGHTDATA_PRIMARY_PROXY` by default. Only whitelist localhost, Finnhub, and LLM providers.

```python
# BAD — direct fetch, will get blocked eventually
response = await client.get(url)

# GOOD — always proxy external requests
proxy = os.environ.get("BRIGHTDATA_PRIMARY_PROXY")
if proxy:
    response = await client.get(url, proxy=proxy)
else:
    log.warning("no proxy configured — direct fetch", url=url)
```

### MUST: Classify transient errors as retry, not failure

SSL errors, timeouts, and connection resets are **transient**. Marking them as permanent `failed` loses items forever.

```python
# Error classification
TRANSIENT_ERRORS = (
    asyncio.TimeoutError,
    ssl.SSLError,
    httpx.ConnectError,
    httpx.ReadTimeout,
    ConnectionResetError,
)

try:
    result = await fetch(url)
except TRANSIENT_ERRORS:
    raise RetryLaterError("transient network error")
except Exception as e:
    raise  # permanent failure — item marked failed
```

### MUST: Use async HTTP in async handlers

Sync HTTP calls (`httpx.get()`, `requests.get()`) inside async handlers **block the event loop**. With concurrent stage workers, this freezes all other stages.

```python
# BAD — blocks event loop
response = httpx.get(url)

# GOOD — non-blocking
async with httpx.AsyncClient() as client:
    response = await client.get(url)
```

### SHOULD: Understand multi-layer auth (cookies, sessions, tokens)

Some services have layered auth. Example: VIC uses a persistent browser cookie (`remember_web_*`) that must be exchanged via HTTP for a session cookie. Using only the browser cookie doesn't work.

**Document the auth flow** for each external service in the handler's docstring.

### SHOULD: Detect bot detection vs auth errors

Different failure modes require different responses:
- **Bot detection** ("Sign in to confirm") → trigger circuit breaker, switch proxy
- **Auth expired** → refresh tokens/cookies
- **Rate limit** → backoff and retry

Parse error messages/pages to classify correctly.

---

## 3. Stage Design

### MUST: Declare stage dependencies explicitly

Don't rely on YAML ordering alone. Use `stage_deps` in jobs.yaml:

```yaml
stage_deps:
  extract: [fetch]
  check_enrich: [extract]
  score: [check_enrich]
```

### MUST: Order stages by cost — cheap checks first

Run cheap checks (scoring, filtering) before expensive operations (downloading, transcribing). Don't download every video before knowing if it's interesting.

```
GOOD: meta → score → captions → whisper (score first, download only if interesting)
BAD:  download → transcribe → score (downloads everything, wastes bandwidth)
```

### MUST: Keep stages independently rerunnable

Every stage reads from cached artifacts, not live sources. This enables prompt iteration, bug fixes, and schema evolution without re-fetching.

```
fetch         → stores raw HTML/JSON/audio (cache layer)
extract       → reads from cache, writes structured data
check_enrich  → validates content + adds LLM-derived insights
```

**Rerun checklist per stage**:
- [ ] Reads from DB/disk, never from external source
- [ ] Idempotent writes (UPDATE, not INSERT)
- [ ] No side effects on earlier stages' data
- [ ] `--reprocess-stage {name}` works for both new and existing items

### MUST: Separate check_enrich from fetch

LLM analysis (content validation + enrichment) must be a separate stage from data fetching. Coupling them means you can't iterate on prompts without re-downloading everything.

### SHOULD: Clean up temporary files

Audio files, temp downloads, intermediate artifacts — delete them after processing unless a downstream stage needs them. Use `keep_audio=False` pattern with auto-detection:

```python
keep = _job_has_stage(job, "diarize")  # only keep if needed downstream
if not keep:
    os.unlink(audio_path)
```

### SHOULD: Use check_enrich as a diagnostic stage (self-healing pattern)

When a pipeline has both code parsing (extract) and LLM analysis (check_enrich), the LLM stage can **diagnose parser failures** — not just validate data quality. This creates a self-healing feedback loop:

1. **Extract** (code parser) handles the 95% case — fast, free, deterministic
2. **check_enrich** (LLM) validates extract output against raw content, reports discrepancies
3. **Discrepancies** reveal whether the issue is bad source data or a parser bug
4. **Fix parser** → rerun extract → rerun check_enrich → discrepancies drop

```python
# In check_enrich: compare LLM findings with extract output
discrepancies = []
if llm_symbol and not extract_symbol:
    discrepancies.append({"field": "symbol", "extract": "", "llm": llm_symbol, "severity": "high"})
return {
    "content_ok": True,
    "thesis_type": "value",
    "_discrepancies": discrepancies,  # structured parser health data
}
```

The LLM is the **observer, not the bandaid** — it surfaces where code fails without hiding bugs. See `jobs/docs/architecture.md` "Self-Healing Pipeline Pattern" for the full pattern.

### SHOULD: Analyze cost vs quality tradeoffs upfront

Before building: which data sources are free (YouTube captions) vs paid (Groq transcription)? Document the fallback chain and track `source` in results for quality comparison.

---

## 4. Import & Startup

### MUST: Don't try/except ImportError on required dependencies

If a module is required, let it fail immediately with a clear error. Silent `except ImportError: pass` means the handler does nothing and nobody knows why.

```python
# BAD — required dep silently missing
try:
    import litellm
except ImportError:
    pass

# GOOD — fail fast
import litellm  # required — fails at import time if missing
```

### MUST: Pre-import heavy libraries before the async loop

Libraries like litellm aren't thread-safe for concurrent first-import. Import them in `cli()` before `asyncio.run()`:

```python
def cli():
    import litellm  # pre-import: single-threaded, no race
    asyncio.run(main())
```

### MUST: No cross-handler imports

Handler A importing from handler B creates hidden dependency chains. If B's imports break, A breaks silently too. Shared logic goes in `jobs/lib/`.

```python
# BAD — handler importing from another handler
from jobs.handlers.vic_ideas import _strip_base64

# GOOD — shared utility in lib
from jobs.lib.html_utils import strip_base64
```

---

## 5. Data Integrity

### MUST: Use context managers for SQLite connections

Never manually open/close connections. Gradio hot-reload and concurrent handlers cause "closed database" errors.

```python
# BAD
conn = open_raw_db()
result = conn.execute(query)
conn.close()

# GOOD
with closing(open_raw_db()) as conn:
    result = conn.execute(query)
```

### MUST: Reset stuck in_progress items on startup

If the runner crashes, items left `in_progress` are permanently stuck — not pending, not failed, not done. Reset them on startup:

```python
# In runner startup
conn.execute(
    "UPDATE work_items SET status='pending' WHERE status='in_progress' AND job_id=?",
    (job_id,)
)
```

### MUST: Store raw data as compressed files for large jobs (1000+ items)

Compressed files on local disk, gitignored, backed up externally — not thousands of loose files or DB blobs:

```
jobs/data/{job_name}/html/{id}.html.xz   # compressed, gitignored
jobs/data/{job_name}/{job_name}.db       # structured data only
```

### SHOULD: Strip base64 images before LLM calls

Embedded base64 images (100KB+) waste tokens. Strip them and cap input:

```python
html = re.sub(r'data:image/[^;]+;base64,[A-Za-z0-9+/=]+', '[img-removed]', html)
html = html[:50_000]  # cap for LLM context
```

---

## 6. Discovery & Content Quality

### MUST: Verify API endpoints exist before wiring up discovery

Don't assume an endpoint exists — test it. A nonexistent endpoint produces zero items with zero errors, so the job looks healthy but discovers nothing.

### MUST: Handle substring ticker matches

Short tickers (LOW, CAT, GE) match as common English words. Require `$SYM` prefix or company name for tickers ≤3 chars.

### MUST: Get date/quarter derivation right

January earnings are Q4 of the **previous** year, not Q1 of the current year. Watch for off-by-one quarter errors.

### SHOULD: Filter sub-entity names from discovery

"Samsung Electronics (Foundry Division)" is not a separate entity from "Samsung Electronics". Filter names with `NOT LIKE '% (%'`.

### SHOULD: Validate search result relevance

YouTube searches for "AAPL earnings Q4 2025" return commentary videos, not actual earnings calls. Use strict title matching + skip lists for noise words (shock, breakdown, reaction, crash, etc.).

---

## 7. Dashboard & Runner

### MUST: Match handler signature to runner expectations

All handlers must use keyword-only args:

```python
async def process_stage(*, item_key: str, data: dict, stage: str, job) -> dict | None:
```

### MUST: Clear heartbeat files on runner exit

Stale heartbeat files make the dashboard show "running" when the runner has stopped. Clear in the finally/exit handler.

### MUST: Use correct `kind` classification

If a job has a finite catalog, it's `backfill`. If it monitors for new items, it's `monitor`. Wrong classification = wrong dashboard tab + wrong metrics.

### MUST: Don't pass callables to Gradio components for data

Pass static values and let timers trigger refreshes. Passing callables causes double-refresh, breaks click handlers, and spikes CPU.

### SHOULD: Retry/reprocess should auto-unpause

When a circuit breaker pauses a job, retry and reprocess buttons should unpause automatically. Otherwise items get queued but nothing processes.

---

## 8. LLM Integration

### MUST: No `max_tokens` on structured output calls

Truncated responses break JSON parsing. Let the model finish naturally. Use a cheaper model if cost is a concern, not a token ceiling.

### MUST: Use `temperature=0` for extraction

Structured/JSON extraction needs determinism. No creativity needed for factual extraction.

### MUST: Put documents in system prompt for caching

Place the document being analyzed in the system prompt with XML tags. The user prompt contains only the extraction instruction. This way the document gets prompt-cached across repeated calls.

```python
system = f"You analyze documents.\n<document>\n{text}\n</document>"
user = "Extract the thesis summary, sector, and quality score as JSON."
```

### SHOULD: Return `_cost` from LLM stages

Track costs so the guard checker can auto-pause when daily limits are exceeded:

```python
return {
    "score": 42,
    "_cost": response.usage.total_cost,
}
```

---

## 9. Error Handling & Observability

### MUST: Use structured logging, not f-strings

```python
# BAD — not machine-parseable
log.info(f"[score] {item_key}: {score} | {source} | {title}")

# GOOD — structured kwargs (queryable in JSONL)
log.info("scored", score=42, source="youtube", title=title)
```

### MUST: Circuit breaker for repeated failures

Track consecutive failures per stage. After 3 (configurable), auto-pause the job with an error message visible in the dashboard.

### SHOULD: Declare VERSION_DEPS for stages with imported dependencies

If a stage calls imported functions whose output matters (parsers, prompts), declare them in `VERSION_DEPS` so `stage_version_hash` includes them. Without this, changing an imported parser won't mark items as stale.

```python
from projects.vic.parse import parse_vic_page

VERSION_DEPS = {
    "extract": [parse_vic_page],                                    # function → hashed via inspect.getsource
    "check_enrich": [_CHECK_ENRICH_SYSTEM, _CHECK_ENRICH_PROMPT],   # strings → hashed directly
}
```

When in doubt, declare the dep — a false stale is cheap (just skip reprocessing), a missed stale is a silent bug.

### SHOULD: Log at the user's mental model

- **info**: Domain-relevant attributes (title, date, duration, score)
- **debug**: Implementation details (byte ranges, VTT timestamps, cue counts)
- **warning**: Quality flags, fallbacks used
- **error**: Failures that need attention

---

## Quick Reference Card

```
Building a new job? Run through this:

[ ] Handler signature: async def process_stage(*, item_key, data, stage, job)
[ ] Empty results → RetryLaterError (never mark empty as done)
[ ] SSL/timeout → RetryLaterError (never mark transient as failed)
[ ] No INSERT OR REPLACE (use ON CONFLICT DO UPDATE)
[ ] Context managers for all SQLite connections
[ ] Proxy all external HTTP (BRIGHTDATA_PRIMARY_PROXY)
[ ] Async HTTP only in async handlers (no httpx.get/requests.get)
[ ] No cross-handler imports (shared code → jobs/lib/)
[ ] Required deps: top-level import, no try/except
[ ] Heavy libs: pre-import before asyncio.run()
[ ] Stage deps explicit in jobs.yaml
[ ] Cheap stages before expensive stages
[ ] Each stage reads cached data, not live sources
[ ] check_enrich separate from fetch
[ ] VERSION_DEPS declared for stages with imported deps (parser, prompts)
[ ] Content validation (not just HTTP status)
[ ] Duplicate content detection
[ ] LLM calls: temp=0, no max_tokens, document in system prompt
[ ] Structured logging (kwargs, not f-strings)
[ ] Return _cost from LLM stages
[ ] kind: backfill vs monitor correctly set
[ ] API endpoints verified before wiring discovery
[ ] Base64 images stripped before LLM calls
[ ] Heartbeat cleared on runner exit
[ ] Stuck in_progress reset on startup
```
