# Jobs System — Overview & Feature Comparison

> Consolidated reference for the rivus jobs framework: architecture, capabilities, and how it compares to mainstream alternatives.

---

## What It Is

A **content monitoring and research pipeline** built on SQLite + asyncio. It discovers items from external sources (YouTube channels, financial APIs, web pages, social platforms), processes them through configurable multi-stage pipelines, and stores structured results locally. Think of it as a personal research automation system — part RSS reader, part ETL pipeline, part monitoring platform.

**Scale**: ~15 active jobs, 17+ discovery strategies, processing thousands of items across earnings calls, YouTube channels, company research, supply chain graphs, and LLM benchmark evaluations.

---

## Architecture at a Glance

```
╔════════════════════════════════════════════════════╗
║                     jobs.yaml                      ║
║  definitions: discovery, stages, handlers, pacing  ║
╚═════════════════════════╤══════════════════════════╝
                          │ hot-reload (mtime + SIGHUP)
                          ▼
╔════════════════════════════════════════════════════╗
║                     runner.py                      ║
║                                                    ║
║  per job:                                          ║
║    ├─ discovery_task   (timer, polls sources)      ║
║    ├─ stage_worker ×N  (persistent, 1-5s backoff)  ║
║    └─ guard_checker    (60s: cost, validate)       ║
║                                                    ║
║  shared:                                           ║
║    ├─ ResourceRegistry (cross-job semaphores)      ║
║    ├─ config watcher   (hot-reload jobs.yaml)      ║
║    └─ singleton lock   (heartbeat file)            ║
╚═══════════╤═══════════════╤═══════════╤════════════╝
            │               │           │
            ▼               ▼           ▼
    ╭────────────╮  ╭────────────╮  ╭────────────╮
    │ tracker    │  │ doctor     │  │ pacer      │
    │ (SQLite)   │  │ (LLM err   │  │ (token     │
    │            │  │  classify) │  │  bucket)   │
    ╰────────────╯  ╰────────────╯  ╰────────────╯
```

**Data flow**: Discovery Strategy → work_items (SQLite) → stage workers → results table + disk artifacts

---

## Core Features

### 1. YAML-Defined Multi-Stage Pipelines

Every job declares its processing pipeline as ordered stages with per-stage concurrency:

```yaml
stages:
  - name: fetch        # concurrency: 1 (default)
  - name: extract
    concurrency: 5     # 5 items in parallel
  - name: score
  - name: qa
```

Stages run as independent async workers. Items completing stage N are visible to stage N+1 within ≤5 seconds. No round-based polling — continuous flow.

**Stage dependency graphs** override linear ordering when needed:
```yaml
stage_deps:
  transcript: [ir]
  diarize: [transcript]
  chart: [transcript]      # chart + diarize run in parallel after transcript
```

### 2. Pluggable Discovery Strategies

17+ strategies, registered via decorator, covering diverse source types:

| Strategy          | Source                | Use case                                     |
|-------------------|-----------------------|----------------------------------------------|
| `finnhub_calendar`| Finnhub API           | Earnings calendar with market-cap priority    |
| `youtube_channel` | yt-dlp channel scrape | All videos from a channel                    |
| `serper_youtube`  | Serper API            | YouTube search, dedup by video_id            |
| `serper_search`   | Serper API            | Web/news search with backfill                |
| `tracker_query`   | Another job's SQLite  | Chain jobs — done items from A feed B        |
| `multi_source`    | Orchestrates others   | Combines strategies with dedup               |
| `manual`          | CLI / API             | One-off URL additions                        |

Discovery runs on a configurable timer (default 60s) and can be woken instantly via wake files.

### 3. LLM-Powered Error Intelligence ("Doctor")

Every stage exception is classified by an LLM (grok-fast, ~1-2s) into actionable categories:

| Error class   | Action         | Example                              |
|---------------|----------------|--------------------------------------|
| `transient`   | retry_later    | Timeout, 429, connection reset       |
| `item_specific`| fail_item     | Bad data for this item only          |
| `temporal`    | pause_job      | Market closed, outside hours         |
| `systemic`    | pause_job      | Auth expired, service down           |
| `code_bug`    | pause_job      | KeyError, AttributeError             |

**Circuit breaker** escalates based on consecutive error counts (e.g., 3 systemic in a row → pause). Counts reset on success. All actions logged to `doctor_actions` table with Pushover notifications for medium/high risk.

### 4. Output Version Tracking & Staleness Detection

Handlers declare `VERSION_DEPS` mapping stages to their dependencies (functions, prompts, configs). When any dependency changes, items processed by the old version show as "stale" in the dashboard with a one-click reprocess button.

```python
VERSION_DEPS = {
    "extract": [parse_vic_page],                    # imported parser
    "check_enrich": [_SYSTEM_PROMPT, str(LLM_CFG)], # prompt + model config
}
```

This enables **prompt iteration without re-fetching**: change a prompt → stale items appear → reprocess only the affected stage.

### 5. Job Chaining via Pipelines

Jobs connect into multi-step pipelines using the `tracker_query` discovery strategy:

```
pltr_discovery (score content) ──tracker_query──▶ pltr_content_processing (fetch+extract)
     pipeline: pltr_deep_dive/1                        pipeline: pltr_deep_dive/2
                                                        min_score: 10
```

Pipeline context shows in the dashboard. The downstream job only processes items meeting a quality threshold.

### 6. Safety Guards

Per-job configurable limits that auto-pause without data loss:

| Guard              | Behavior                                                |
|--------------------|---------------------------------------------------------|
| `max_pending: 200` | Skip discovery when queue is full (processing continues)|
| `daily_cost_limit` | Auto-pause when LLM spend exceeds threshold (resets midnight PT)|
| `validate: true`   | Call handler's `validate()` after batches — custom health checks|

### 7. Shared Resource Management

Cross-job semaphores prevent overloading shared external services:

```yaml
resources:
  youtube:           { concurrency: 4  }   # 4 yt-dlp calls total, across ALL jobs
  groq_transcribe:   { concurrency: 3  }
  deepgram_transcribe: { concurrency: 20 }
  local_diarize:     { concurrency: 2  }
```

### 8. Priority System

Lower number = processed first (Unix nice semantics). Priority flows from discovery through stages:

- Discovery sets initial priority (e.g., `-market_cap` so largest companies process first)
- Stage handlers can return `_priority` to update mid-pipeline (e.g., volatility score replaces cap-based priority)
- Retried items get `-1` (highest priority) to clear failures quickly

### 9. Self-Healing Pipeline Pattern

The `check_enrich` stage (LLM) acts as a diagnostic layer over `extract` (code parser):

```
extract (code, fast, free) → check_enrich (LLM, validates + finds discrepancies)
    ↑                              │
    └──── fix parser ◄─────────────┘  _discrepancies reveal parser bugs
```

The LLM is the **observer, not the bandaid** — it surfaces where code fails without hiding bugs.

### 10. Audit/QA System

Configurable quality checks run as pipeline stages or batch audits:

- `min_chars_per_minute` — transcript density
- `no_repeated_blocks` — VTT cue repetition detection
- `transcript_coverage` — timestamp vs duration check
- `required_fields` — metadata completeness
- `duration_range` — plausibility validation

### 11. Hot Reload & Wake Mechanism

- **Config hot-reload**: `jobs.yaml` mtime watched + SIGHUP handler — no restart needed for config changes
- **Wake files**: Touch `jobs/data/.wake/JOB_ID` to interrupt sleep and trigger immediate processing (checked every 2s)
- **CLI integration**: `jobctl wake a16z` / `jobctl wake --all`

### 12. Comprehensive CLI (`jobctl`)

```bash
jobctl status                    # Overview of all jobs
jobctl status a16z               # Detail for one job
jobctl wake a16z                 # Force immediate pickup
jobctl pause a16z "market closed" # Pause with reason
jobctl pause a16z --retry-in 3h  # Auto-unpause after duration
jobctl unpause a16z              # Resume
jobctl clear-errors a16z         # Reset circuit breaker
jobctl reset a16z fetch          # Reset stuck items in a stage
jobctl reprocess a16z extract    # Reprocess items (prompt changes)
jobctl add-url "https://..."     # Add URL to newsflow (auto-routes)
```

Plus diagnostic tools: `diagnose.py` (health checks), `failures.py` (error classification), `stats.py` (daily/lifetime analytics), `job_inspect.py` (queue inspection).

### 13. Gradio Dashboard

Live monitoring UI at `jobs.localhost`:
- Job overview with status, stage progress, p50 timing
- Item-level detail with per-stage status and timing
- Pause/resume, retry failed, reprocess stale — one-click operations
- Pipeline visualization (multi-job workflows)
- Runner heartbeat indicator

---

## Design Philosophies Compared

The feature differences between rivus/jobs and mainstream frameworks stem from fundamentally different assumptions about what a job system is for. Understanding these philosophical splits explains why the feature matrices look the way they do.

### 1. "Find work" vs. "Execute dispatched tasks"

**Mainstream assumption**: Work originates in application code. A user clicks a button, an API receives a request, a cron fires — and the application *dispatches* a task to the queue. The framework's job is to execute it reliably.

**rivus/jobs assumption**: Work *exists in the world* and must be *discovered*. YouTube uploads videos, companies file earnings, news breaks — the system's job is to notice, then process. Discovery is the first stage of the pipeline, not an external trigger.

This is why rivus/jobs has 17+ discovery strategies as first-class primitives while Celery/Temporal/BullMQ have zero. They don't need them — their callers already know what work exists. rivus/jobs operates in environments where nobody tells you there's work to do; you have to go looking.

**Dagster's sensors** and **Prefect's event triggers** are the closest mainstream equivalents, but they're add-ons to an execution engine. In rivus/jobs, discovery *is* the engine's starting point.

### 2. "Understand errors" vs. "Count retries"

**Mainstream assumption**: Errors are opaque. Retry N times with exponential backoff, then give up. Maybe dead-letter it. The developer writes custom error handling per task type.

**rivus/jobs assumption**: Errors have *meaning* that determines the correct response. A timeout (transient) needs retry. An expired auth token (systemic) needs the whole job paused. A market-closed error (temporal) needs a scheduled resume. A KeyError (code bug) needs a human. Counting retries collapses all these into one dimension.

The Doctor module uses an LLM to classify each error semantically, then applies per-class circuit breakers. This costs ~$0.001 per error and 1-2 seconds, but it means a single auth failure pauses the job instead of burning through 50 retries before someone notices.

No mainstream framework does this because they were built before cheap, fast LLM inference existed. Error classification was too expensive to automate, so everyone hardcoded retry counts.

### 3. "Iterate on outputs" vs. "Run once correctly"

**Mainstream assumption**: A pipeline produces output once. If the output is wrong, you fix the code and rerun the whole pipeline. The framework tracks whether a task *ran*, not whether its output is *current*.

**rivus/jobs assumption**: Outputs are living artifacts. Prompts change, parsers improve, models get upgraded. The question isn't "did this run?" but "is this result still valid given the current code?" This is especially true for LLM-heavy pipelines where prompt iteration is the primary development loop.

`VERSION_DEPS` tracks the exact functions, prompts, and configs that produced each result. Change a prompt string → items automatically show as stale → one-click reprocess only the affected stage. Dagster's asset freshness policies are the closest mainstream concept, but they track *time* staleness, not *code* staleness.

This philosophy drives the raw-data-first stage design: fetch caches artifacts to disk, extract reads from cache. You can iterate on extraction 20 times without re-fetching — because re-fetching months later may be impossible (paywalls, deleted content, rate limits).

### 4. "Budget-aware" vs. "Resource-unlimited"

**Mainstream assumption**: Compute is the bottleneck. You scale workers, add machines, optimize throughput. Cost is an ops concern tracked in cloud billing dashboards, not a runtime constraint.

**rivus/jobs assumption**: API calls cost real money per item. An LLM extraction stage at $0.01/item across 10,000 items is $100. A discovery strategy that accidentally fetches 50,000 items can run up hundreds in API costs overnight. Cost is a *runtime safety constraint*, not a billing afterthought.

This drives cost-based auto-pause guards (`daily_cost_limit`), `_cost` return values from handlers, and the cheap-stages-first ordering principle (score before download, filter before transcribe). No mainstream framework has native cost tracking because they were built for CPU/memory-bound workloads, not pay-per-call API workloads.

### 5. "Single operator, zero ops" vs. "Team-scale, managed infra"

**Mainstream assumption**: Multiple developers, CI/CD pipelines, staging environments, role-based access. The framework needs deployment primitives (Docker, Kubernetes, cloud runners), collaboration features (RBAC, audit logs, parameterized runs), and operational tooling (distributed tracing, multi-tenant isolation).

**rivus/jobs assumption**: One person on a laptop. The entire system state is one SQLite file. Start in <1 second, no services to manage, no broker to crash, no cluster to monitor. `git clone` + `python -m jobs.runner` and you're running.

This is a deliberate tradeoff: rivus/jobs will never distribute across machines, but it also never requires you to debug a RabbitMQ cluster at 2 AM. For a solo research operation processing thousands of items, the single-process asyncio model handles the I/O concurrency without the operational complexity.

### 6. "Domain-aware pipelines" vs. "Generic task execution"

**Mainstream assumption**: The framework is domain-agnostic. It moves data between stages. What the data *means* is the handler's problem.

**rivus/jobs assumption**: The framework knows things about its domain. It knows that content quality matters (audit/QA checks as a primitive). It knows that LLM stages can diagnose parser stages (self-healing pattern). It knows that discovery-time priority should reflect business value (`-market_cap` so the most important companies process first). It knows that external services have shared rate limits that span jobs (ResourceRegistry).

This bleeds domain knowledge into the framework, which makes it less generic but more capable for its specific use case. A generic framework can't auto-pause when LLM costs exceed $5/day because it doesn't know what costs are.

### Summary: Where the Philosophies Diverge

| Dimension              | Mainstream frameworks              | rivus/jobs                                |
|------------------------|------------------------------------|-------------------------------------------|
| Work origin            | Dispatched by application code     | Discovered from external sources          |
| Error handling         | Count-based retry + dead-letter    | Semantically classified by LLM            |
| Output model           | Run once, rerun if wrong           | Tracked for staleness, iteratively refined|
| Cost model             | Ops concern (billing dashboards)   | Runtime safety constraint (auto-pause)    |
| Operational model      | Team + managed infra               | Solo operator + zero infra (SQLite)       |
| Domain awareness       | Generic (handlers know the domain) | Framework encodes domain patterns         |

These aren't value judgments — they're fitness-for-purpose tradeoffs. Celery is better for executing millions of web requests across a cluster. Temporal is better for business workflows that must never lose state. rivus/jobs is better for a solo researcher who needs to discover, process, iterate, and budget-cap LLM-heavy content pipelines on a single machine.

---

## Feature Comparison with Other Frameworks

### Quick Feature Matrix

| Feature                        | rivus/jobs | Celery | Temporal | Prefect | Dagster | Airflow | BullMQ | Dramatiq |
|--------------------------------|:----------:|:------:|:--------:|:-------:|:-------:|:-------:|:------:|:--------:|
| **Multi-stage pipelines**      | ✓          | ~      | ✓        | ✓       | ✓       | ✓       | ~      | ~        |
| **YAML-defined jobs**          | ✓          | ✗      | ✗        | ~       | ✗       | ✗       | ✗      | ✗        |
| **Auto-discovery of work**     | ✓          | ✗      | ✗        | ~       | ✓       | ~       | ✗      | ✗        |
| **LLM error classification**   | ✓          | ✗      | ✗        | ✗       | ✗       | ✗       | ✗      | ✗        |
| **Output version tracking**    | ✓          | ✗      | ✗        | ✗       | ✓       | ✗       | ✗      | ✗        |
| **Cost-based auto-pause**      | ✓          | ✗      | ✗        | ✗       | ✗       | ✗       | ✗      | ✗        |
| **Shared resource semaphores** | ✓          | ~      | ✗        | ✗       | ✗       | ~       | ~      | ✗        |
| **Priority queue**             | ✓          | ✓      | ✓        | ✗       | ✗       | ✓       | ✓      | ✓        |
| **Hot config reload**          | ✓          | ~      | ✗        | ✗       | ✗       | ✗       | ✗      | ✗        |
| **Pipeline chaining**          | ✓          | ~      | ✓        | ✓       | ✓       | ✓       | ~      | ~        |
| **Zero-infra (SQLite)**        | ✓          | ✗      | ✗       | ✗       | ✗       | ✗       | ✗      | ✗        |
| **Distributed execution**      | ✗          | ✓      | ✓        | ✓       | ✓       | ✓       | ✓      | ✓        |
| **Web UI**                     | ✓          | ✓      | ✓        | ✓       | ✓       | ✓       | ~      | ✗        |
| **QA/audit checks**            | ✓          | ✗      | ✗        | ✗       | ✓       | ✗       | ✗      | ✗        |
| **Circuit breaker**            | ✓          | ✗      | ✓        | ✗       | ✗       | ✗       | ✗      | ✗        |

`✓` = native/first-class, `~` = possible with plugins/workarounds, `✗` = not supported

### Detailed Comparison

#### vs. Celery / Dramatiq (Task Queues)

Celery and Dramatiq are **task queues** — they execute individual tasks dispatched by application code. Celery uses Canvas primitives (chain, group, chord) for composition; Dramatiq offers simpler linear pipelines. Both require a message broker (Redis/RabbitMQ).

**rivus/jobs advantage**: Discovery-driven (work found, not dispatched), YAML-defined multi-stage pipelines with per-stage concurrency, LLM error intelligence, output version tracking, cost guards, zero-infra (no broker). Better for research pipelines where items emerge from external sources.

**Celery advantage**: Battle-tested distributed execution, horizontal scaling, largest Python ecosystem (Django, FastAPI, Beat scheduler), Flower monitoring UI, every result backend imaginable (Redis, SQL, S3, GCS, MongoDB).

**Dramatiq advantage**: Simplest reliable Python queue — always acks after success (no silent loss), clean middleware API, lightweight. But no arbitrary DAGs (linear pipelines only).

#### vs. Temporal / Hatchet (Workflow Orchestration)

Temporal is the gold standard for durable workflow execution — workflows survive crashes and resume exactly from the last checkpoint. Hatchet is a newer Postgres-native alternative with simpler ops.

**rivus/jobs advantage**: YAML-defined jobs (vs code-defined workflows), auto-discovery of work, LLM error classification, output version tracking, cost-based guards, zero operational overhead (vs running Temporal server + workers + Cassandra/Postgres).

**Temporal advantage**: True durable execution (infinite loops that survive crashes), exactly-once guarantees, polyglot SDKs (Go, Java, Python, TS, .NET), signals/queries for runtime interaction, workflow versioning at the platform level.

**Hatchet advantage**: Postgres-native (no Redis/Cassandra), DAG steps with parent dependencies, dynamic per-key rate limits (per-user/per-tenant), fair concurrency scheduling (group-round-robin prevents noisy neighbors), sub-25ms dispatch latency.

#### vs. Prefect / Dagster (Data Pipeline Orchestration)

Prefect and Dagster are closest in spirit — they manage data pipelines with stages, monitoring, and quality checks.

**rivus/jobs advantage**:
- **Discovery as a first-class concept** — Dagster has sensors and Prefect has event triggers, but rivus has 17+ pluggable strategies (API polling, channel scrapes, cross-job chaining via `tracker_query`, multi-source composition with dedup)
- **LLM error intelligence** — no equivalent; both use simple retry counts
- **Output version tracking with dependency graphs** — Dagster has asset freshness policies, but rivus tracks function-level deps via `VERSION_DEPS` (specific prompts, parsers, model configs that affect output)
- **Cost-based guards** — auto-pause when LLM spend exceeds threshold; unique to research/LLM workloads
- **Zero-infra** — SQLite vs PostgreSQL + managed cloud
- **Self-healing pipeline pattern** — LLM-as-observer for parser diagnostics

**Dagster advantage**: Asset-centric model (freshness policies, auto-materialization), deep dbt/Spark/Snowflake integration via IO managers, asset catalog with lineage graph, asset checks for data quality, partitioned assets for time-series.

**Prefect advantage**: Minimal boilerplate (one decorator), dynamic task graphs inferred from data dependencies, global concurrency limits with decay rate, event-driven automations, Prefect Cloud.

#### vs. Airflow (DAG Scheduling)

Airflow is a DAG-based scheduler for orchestrating batch ETL. 2000+ operators in provider packages. Airflow 3.0 added asset-aware scheduling and DAG versioning.

**rivus/jobs advantage**: Event-driven discovery (not just cron), continuous stage processing (not scheduled batches), LLM error intelligence, version tracking, lightweight single-process design.

**Airflow advantage**: Largest operator ecosystem (AWS, GCP, Azure, Spark, dbt, hundreds more), distributed executors (Celery, Kubernetes), asset-aware scheduling (3.0), calendar scheduling with catchup, managed offerings (Astronomer, MWAA, Cloud Composer), mature RBAC.

#### vs. BullMQ (Node.js Queue)

Redis Streams-backed queue with exactly-once semantics and excellent rate limiting.

**rivus/jobs advantage**: Multi-stage pipelines with per-stage concurrency, discovery automation, LLM error classification, output versioning, Python ecosystem, zero-infra.

**BullMQ advantage**: Redis-native performance (sub-ms enqueue), per-key rate limiting (uniquely granular — per-user, per-tenant), parent-child job flows, repeatable jobs (cron or interval), sandboxed processors.

### Storage Backend Comparison

| Framework  | Queue/Broker                   | State Store                          |
|------------|--------------------------------|--------------------------------------|
| rivus/jobs | SQLite (WAL mode)              | SQLite (same file)                   |
| Celery     | Redis, RabbitMQ, SQS           | Redis, SQL, S3, GCS, MongoDB         |
| Temporal   | Internal (Cassandra/Postgres)  | Same (event history)                 |
| Prefect    | Prefect server (SQLite/PG)     | Block storage (S3/GCS/Azure)         |
| Dagster    | Postgres event log             | IO managers (S3, Snowflake, etc.)    |
| Airflow    | Postgres/MySQL metadata DB     | XCom (DB) + external                 |
| BullMQ     | Redis Streams                  | Redis (only option)                  |
| Dramatiq   | Redis, RabbitMQ                | Redis, Memcached                     |
| Hatchet    | PGMQ / RabbitMQ               | Postgres                             |

### When to Use What

| Scenario                                              | Best choice        | Runner-up  |
|-------------------------------------------------------|--------------------|------------|
| Research/content pipeline, LLM-heavy, single machine  | **rivus/jobs**     | Prefect    |
| Simple Python fire-and-forget tasks                   | Celery             | Dramatiq   |
| Complex business workflows, must survive crashes      | Temporal           | Hatchet    |
| Data pipeline + asset lineage + dbt                   | Dagster            | Airflow    |
| Scheduled batch ETL, massive operator ecosystem       | Airflow            | Prefect    |
| Python workflow, minimal boilerplate                   | Prefect            | Hatchet    |
| Node.js, Redis shop, multi-tenant rate limiting       | BullMQ             | —          |
| Python queue, correctness above all                   | Dramatiq           | —          |
| Modern Temporal alternative, Postgres-only infra      | Hatchet            | Temporal   |

---

## What's Unique to rivus/jobs

These features are **not found** (or not combined) in any mainstream framework:

### 1. LLM Error Intelligence
Every exception classified by an LLM into actionable categories (transient/item_specific/temporal/systemic/code_bug) with per-class circuit breakers and automated escalation. No other framework uses semantic understanding of errors to decide retry vs pause vs fail.

### 2. Discovery as a First-Class Primitive
Work items aren't dispatched by application code — they're automatically discovered from external sources via 17+ pluggable strategies. The framework finds its own work. Strategies compose (`multi_source`) and chain (`tracker_query` feeds one job's output to another).

### 3. Output Version Tracking with Dependency Graphs
`VERSION_DEPS` tracks the exact functions, prompts, and configs that produce each stage's output. When any dependency changes, affected items are marked stale with one-click reprocessing. This is essential for LLM-heavy pipelines where prompt iteration is the primary development loop.

### 4. Cost-Based Auto-Pause
Daily LLM spend limits per job. Handlers return `_cost`, the guard checker sums it, and auto-pauses when exceeded. Designed for research workloads where API costs can spiral unexpectedly.

### 5. Self-Healing Pipeline Pattern
LLM stages diagnose code parser failures by comparing their output against the parser's output, generating structured `_discrepancies` that reveal parser bugs. The LLM observes; the developer fixes.

### 6. YAML-Defined Everything + Hot Reload
Jobs, stages, discovery, pacing, guards, resources — all in one `jobs.yaml` with live reload. Change a config, runner picks it up. No code changes, no restarts.

### 7. Zero-Infrastructure Operation
SQLite (WAL mode) + single asyncio process. No Redis, no PostgreSQL, no Kubernetes, no message broker. Starts in <1 second, runs on a laptop. The entire state is in one file (`jobs.db`).

---

## Architecture Decisions & Tradeoffs

| Decision                    | Rationale                                    | Tradeoff                                |
|-----------------------------|----------------------------------------------|-----------------------------------------|
| SQLite, not PostgreSQL      | Zero ops, single file, instant startup       | No distributed execution                |
| Single process, not workers | Simplicity, no IPC, shared state             | Limited to one machine's resources      |
| asyncio, not threads        | Efficient I/O concurrency for HTTP/LLM calls | CPU-bound work blocks the loop          |
| YAML config, not code       | Non-developers can add jobs, hot reload      | Less flexible than code-defined DAGs    |
| LLM error classification    | Semantic understanding beats regex matching  | ~$0.001 per error, 1-2s latency         |
| Version hashing via source  | Zero config, auto-detects code changes       | Fragile (whitespace, lambdas) — moving to explicit versioning |
| Wake files, not pub/sub     | Zero infra, filesystem is reliable           | 2s check interval (not instant)         |
| Priority as float           | Flexible (negative for urgent, large for low)| Less intuitive than P0/P1/P2 buckets   |

---

## Active Jobs

| Job                          | Emoji | Pipeline         | Stages                                                    | Purpose                          |
|------------------------------|-------|------------------|-----------------------------------------------------------|----------------------------------|
| `earnings_backfill_largecap` | 📞    | —                | price → ir → ib → transcript → diarize → chart           | Large-cap earnings call analysis |
| `dumb_money_live`            | 📺    | —                | meta → score → captions → audio → whisper → qa           | YouTube finance channel          |
| `a16z`                       | 🅰️    | —                | meta → score → captions → audio → whisper → qa           | a16z podcast channel             |
| `healthy_gamer_all`          | 🧠    | —                | meta → score → captions → audio → whisper → qa           | Health/psychology content        |
| `dwarkesh_podcast`           | 🎙️    | —                | meta → score → captions → audio → whisper → qa           | Tech interview podcast           |
| `lex_fridman`                | 🤖    | —                | meta → score → captions → audio → whisper → qa           | Long-form tech interviews        |
| `pltr_discovery`             | 🔎    | pltr_deep_dive/1 | score                                                     | Palantir content discovery       |
| `pltr_content_processing`    | 📝    | pltr_deep_dive/2 | fetch → extract → audio → diarize → score                | Palantir content processing      |
| `newsflow_monitor`           | 📰    | newsflow/1       | fetch → extract → score                                  | Company news monitoring          |
| `supplychain_anchors`        | ⚡    | supplychain/1    | profile → suppliers → customers → competitors → expand    | Semiconductor supply chain graph |
| `supplychain_expand`         | 🔗    | supplychain/2    | profile → suppliers → customers → competitors → expand    | Expand supply chain graph        |
| `person_intel`               | 👤    | —                | discover → enrich → research → score → assess            | VC/investor dossiers             |
| `benchmark_eval`             | 📊    | —                | evaluate                                                  | LLM benchmark evaluations        |

---

## Key Files

| What               | Path                        |
|--------------------|-----------------------------|
| Job definitions    | `jobs/jobs.yaml`            |
| Runner             | `jobs/runner.py`            |
| SQLite tracker     | `jobs/lib/tracker.py`       |
| Discovery strategies| `jobs/lib/discovery.py`    |
| Error intelligence | `jobs/lib/doctor.py`        |
| Rate limiter       | `jobs/lib/pacer.py`         |
| Handler resolver   | `jobs/lib/executor.py`      |
| QA checks          | `jobs/lib/audit.py`         |
| CLI                | `jobs/ctl.py`               |
| Dashboard          | `jobs/app_ng.py`            |
| Health diagnostics | `jobs/diagnose.py`          |
| Failure analysis   | `jobs/failures.py`          |
| Pipeline stats     | `jobs/stats.py`             |
| New job checklist  | `jobs/docs/new-job-checklist.md` |
| Ops reference      | `jobs/CLAUDE.md` (concise) + `jobs/docs/architecture.md` (detailed) |

---

## Related Documents

| Document                 | Path                                                   | Content                                |
|--------------------------|--------------------------------------------------------|----------------------------------------|
| V3 cleanup plan          | `docs/plans/2026-02-26-unified-work-system-v3.md`     | Schema cleanup + autodo module design  |
| Action plan (5 tiers)    | `docs/plans/2026-02-26-jobs-action-plan.md`            | Prioritized improvement roadmap        |
| Review findings          | `docs/plans/2026-02-26-jobs-review-findings.md`        | Deep module-by-module analysis         |
| 4-model critique         | `docs/plans/2026-02-26-jobs-vario-critique.md`         | Multi-model consensus on improvements  |
| Newsflow buildout plan   | `docs/plans/2026-02-21-newsflow-buildout.md`           | Topic-driven news intelligence design  |
| Batch job principles     | `~/.claude/principles/batch-jobs.md`                   | Core design principles                 |
