# Unified Task System: Internal Tasks as Jobs

**Date**: 2026-02-26
**Status**: Design (V2 — post multi-model critique)
**Replaces**: `supervisor/todos/queue.yaml` flat file, `supervisor/autonomous/planner.py` keyword-based classifier

## Review History

V1 reviewed by GPT 5.2, Grok 4.1, Gemini 3.1 Pro. Full critiques preserved in `/tmp/critique_{gpt,grok,gemini}.txt`. Key consensus:
- Schema over-normalized (3 tables for one unit of work)
- Fingerprint-as-identity is wrong (text rewrites lose history)
- Cache invalidation circular (depends on LLM output)
- JSON blobs unqueryable (need junction tables)
- Priority float 0.0-1.0 is false precision
- lib/runner/ plugin platform is premature yak-shaving
- ClickUp sync is premature (system doesn't exist yet)
- LLM scoping treated as oracle instead of unreliable hint

All addressed below.

## Problem

The current autonomous work system has three design flaws:

1. **Flat queue conflates checks and findings**. `sup-001` "Fix stale port refs" is a *finding* that should be an output of a recurring doc-health check, not a peer of that check in the queue.

2. **Priority without scoping is guesswork**. The planner reads TODO text and assigns `priority=3, effort=medium` from keywords. Meaningful prioritization requires understanding what implementation entails.

3. **No decomposition**. Complex TODOs need sub-tasks. The queue treats multi-step items as atomic.

## Design

### Core Model

Two domains, deliberately separate:

```
┌──────────────────────────────────┐  ┌──────────────────────────────────┐
│        Project Management        │  │          Monitoring              │
│                                  │  │                                  │
│  work_items (todos + assessments │  │  checks (recurring sweep defs)   │
│             merged, versioned)   │  │  check_runs (execution log)      │
│                                  │  │                                  │
│  tasks (executable sub-items)    │  │  Findings emit work_items with   │
│  item_files (junction)           │  │  origin_type = 'check'           │
│  item_deps (junction)            │  │                                  │
└──────────────────────────────────┘  └──────────────────────────────────┘
```

**Key V2 changes**: todos and assessments merged (1:1 relationship was unnecessary JOIN). Findings eliminated as separate table — checks emit `work_items` with `origin_type='check'`. JSON blobs normalized to junction tables. Priority is integer bucket.

### Schema

```sql
-- Enable WAL for concurrent readers + single writer
PRAGMA journal_mode = WAL;
PRAGMA foreign_keys = ON;

-- ─── Work Items (unified: raw todos + scoped assessments) ─────────────

CREATE TABLE work_items (
    id INTEGER PRIMARY KEY,
    -- Identity: location-based, not text-based
    source_file TEXT NOT NULL,         -- e.g. "learning/TODO.md"
    source_line INTEGER,
    heading TEXT,                       -- parent heading context
    text_hash TEXT NOT NULL,           -- SHA256 of normalized text (dedupe hint, NOT identity)

    -- Raw scan data
    text TEXT NOT NULL,
    first_seen_at INTEGER NOT NULL,    -- unix epoch UTC
    last_seen_at INTEGER NOT NULL,

    -- Origin tracking
    origin_type TEXT NOT NULL DEFAULT 'scan'
        CHECK (origin_type IN ('scan', 'check', 'manual')),
    origin_check_run_id INTEGER REFERENCES check_runs(id),

    -- Assessment (nullable — populated by scoping step)
    title TEXT,                        -- short actionable title (LLM-generated)
    summary TEXT,                      -- what implementation entails
    effort TEXT CHECK (effort IN ('unknown', 'small', 'medium', 'large'))
        DEFAULT 'unknown',
    risk TEXT CHECK (risk IN ('unknown', 'low', 'medium', 'high'))
        DEFAULT 'unknown',
    priority INTEGER  -- NULL until assessed. P0=urgent, P1=high, P2=normal, P3=low
        CHECK (priority IS NULL OR priority BETWEEN 0 AND 3),
    tier TEXT DEFAULT 'unknown'
        CHECK (tier IN ('unknown', 'safe_always', 'code_change')),
    type TEXT,  -- NULL until assessed (feature, fix, refactor, docs, etc.)
    project TEXT DEFAULT 'rivus',

    -- Assessment metadata
    assessed_at INTEGER,               -- unix epoch UTC
    assessed_by TEXT,                  -- model that did the assessment
    assessment_version INTEGER DEFAULT 0,  -- bumped on re-assessment
    -- Cache: hash of (text + source_file + heading + repo HEAD at assess time)
    -- Does NOT include files_involved (chicken-egg problem)
    input_hash TEXT,

    -- Lifecycle
    status TEXT DEFAULT 'new'
        CHECK (status IN ('new', 'triaged', 'assessed', 'in_progress', 'done', 'wontfix', 'stale')),
    accepted INTEGER DEFAULT 0,        -- human confirmed this is worth doing
    parent_id INTEGER REFERENCES work_items(id),  -- for sub-items within TODO.md

    -- External sync (deferred to later phase)
    clickup_task_id TEXT,
    clickup_synced_at INTEGER
);

CREATE INDEX idx_work_items_status ON work_items(status, priority);
CREATE INDEX idx_work_items_source ON work_items(source_file, source_line);
CREATE INDEX idx_work_items_text_hash ON work_items(text_hash);
CREATE INDEX idx_work_items_origin ON work_items(origin_type, origin_check_run_id);

-- Dedupe: same file + similar position + same text = same item
-- But text_hash is a hint, not PK. Location drift (line numbers change)
-- is handled by fuzzy matching in the scanner.

-- ─── Junction: files involved in a work item ─────────────────────────

CREATE TABLE item_files (
    work_item_id INTEGER NOT NULL REFERENCES work_items(id) ON DELETE CASCADE,
    file_path TEXT NOT NULL,
    PRIMARY KEY (work_item_id, file_path)
);

-- ─── Junction: dependencies between work items ───────────────────────

CREATE TABLE item_deps (
    blocked_id INTEGER NOT NULL REFERENCES work_items(id) ON DELETE CASCADE,
    blocking_id INTEGER NOT NULL REFERENCES work_items(id) ON DELETE CASCADE,
    PRIMARY KEY (blocked_id, blocking_id)
);

-- ─── Tasks: executable sub-items from decomposition ──────────────────

CREATE TABLE tasks (
    id INTEGER PRIMARY KEY,
    work_item_id INTEGER NOT NULL REFERENCES work_items(id),
    title TEXT NOT NULL,
    description TEXT,
    status TEXT DEFAULT 'pending'
        CHECK (status IN ('pending', 'in_progress', 'done', 'failed', 'skipped')),
    priority INTEGER  -- inherited from work_item, NULL until parent is prioritized
        CHECK (priority IS NULL OR priority BETWEEN 0 AND 3),
    effort TEXT CHECK (effort IN ('small', 'medium', 'large')),
    sort_order INTEGER DEFAULT 0,      -- ordering within a work item

    -- Execution tracking
    started_at INTEGER,
    completed_at INTEGER,
    attempts INTEGER DEFAULT 0,
    last_error TEXT,
    leased_by TEXT,                     -- runner ID holding this task
    leased_until INTEGER,              -- lease expiry (unix epoch)
    result TEXT                         -- JSON outcome
);

CREATE INDEX idx_tasks_status ON tasks(status, priority);
CREATE INDEX idx_tasks_work_item ON tasks(work_item_id);

-- ─── Checks: recurring sweep definitions ─────────────────────────────

CREATE TABLE checks (
    id TEXT PRIMARY KEY,               -- e.g. "doc-health", "convention-scan"
    title TEXT NOT NULL,
    description TEXT,
    handler TEXT NOT NULL,             -- Python callable path
    schedule_type TEXT DEFAULT 'daily'
        CHECK (schedule_type IN ('daily', 'weekly', 'on_commit', 'on_idle')),
    next_run_at INTEGER,               -- unix epoch, computed from schedule
    enabled INTEGER DEFAULT 1
);

-- ─── Check Runs: execution log ───────────────────────────────────────

CREATE TABLE check_runs (
    id INTEGER PRIMARY KEY,
    check_id TEXT NOT NULL REFERENCES checks(id),
    started_at INTEGER NOT NULL,
    completed_at INTEGER,
    status TEXT DEFAULT 'running'
        CHECK (status IN ('running', 'completed', 'failed')),
    handler_version TEXT,              -- for stale detection
    findings_count INTEGER DEFAULT 0,
    new_findings_count INTEGER DEFAULT 0,  -- not previously seen
    error TEXT,
    summary TEXT
);

CREATE INDEX idx_check_runs_check ON check_runs(check_id, started_at);
```

### Key Design Decisions (V2)

**1. Location-based identity, not text-based.**
The V1 used `SHA256(text)` as primary key. All three reviewers flagged this: text rewrites lose history, same text in different files is different work. V2 uses `INTEGER PRIMARY KEY` with `(source_file, source_line, heading)` for fuzzy matching and `text_hash` as a dedupe hint only.

**2. Assessments are columns, not a separate table.**
V1 had `todos` → `assessments` as a 1:1 relationship requiring JOINs. V2 inlines assessment fields into `work_items`. Assessment history tracked via `assessment_version` bump (previous versions can be logged to a JSONL file if needed, not a table).

**3. Findings are work items, not a separate entity.**
V1 had a `findings` table. All three reviewers said: findings that need action are just tasks/work items. V2: checks emit `work_items` with `origin_type='check'` and `origin_check_run_id`. Finding dedup happens via `text_hash` + `source_file` matching — if the same issue is found again, existing item's `last_seen_at` is bumped instead of creating a duplicate.

**4. Priority is P0-P3 integer, not 0.0-1.0 float.**
All three reviewers called float priority "false precision" — LLMs can't produce stable absolute scores. V2 uses 4 buckets (P0=urgent, P1=high, P2=normal, P3=low) with `sort_order` for fine-grained ordering within a bucket. Manual override is trivial.

**5. Cache invalidation uses broad context, not self-referential files.**
V1 hashed `files_involved` — but that's LLM output (chicken-egg). V2 uses `input_hash = hash(text + source_file + heading + repo_head_sha)`. Re-assessment triggered when the TODO text itself changes or on-demand. Not when arbitrary code changes.

**6. Junction tables for files and dependencies.**
V1 stored `files_involved` and `dependencies` as JSON blobs. V2 normalizes to `item_files` and `item_deps` for queryability ("show all work items touching file X").

**7. Acceptance gate before execution.**
V2 adds `accepted` boolean. Not all scanned TODOs should be auto-scoped and auto-executed. Triage step (human or rules-based) marks items worth investing assessment cost on.

**8. Task leasing for crash recovery.**
V1 had no concept of stuck tasks. V2 adds `leased_by`, `leased_until`, `attempts`, `last_error`. If a runner crashes, expired leases are reclaimed.

### Pipeline

```
1. SCAN       TODO.md files → work_items (origin_type='scan')
              Dedupe by text_hash + source_file. Bump last_seen_at for known items.
              Mark items not seen in N scans as 'stale'.

2. TRIAGE     New items default to status='new'.
              Rules or human review → status='triaged', accepted=1.
              Skip scoping for items not accepted (saves LLM cost).

3. SCOPE      For accepted items → LLM reads nearby code → fills assessment fields
              (effort, risk, files_involved, sub_tasks, dependencies).
              Output treated as SUGGESTIONS — files_involved validated by
              ripgrep (do the files exist? are they referenced?).
              assessment_version bumped. input_hash recorded.
              Priority is NOT assigned yet — we need the scope to set it.

4. PRIORITIZE Scoped items get priority (P0-P3) based on assessed effort,
              risk, dependencies, and project goals. Can be LLM-assigned
              or human-assigned. Unscoped items have NULL priority and
              are never auto-executed.

5. DECOMPOSE  Two levels:
              a) Per-item: complex items → child tasks in tasks table.
                 Simple items → single task auto-created.
                 Tasks inherit priority from parent work_item.
              b) Cross-item (architectural): after scoping, look across
                 multiple items for shared foundations. If N items all need
                 the same capability, factor it out as its own work_item
                 with the others depending on it. This prevents one-off
                 towers and builds reusable architecture. The scoper can
                 flag "this overlaps with items X, Y" during scoping;
                 a separate consolidation pass groups them.

6. EXECUTE    Runner picks tasks by priority + sort_order.
              Checks run on schedule → check_runs log → findings become work_items.
              Leasing prevents double-execution.

6. SYNC       (Deferred) Tasks materialize to ClickUp via clickup-python-sdk.
              Build only after the DB-based system is stable.
```

### Scoping: The Key Difference

The current planner does steps 1 and 5 — scan and (attempt to) execute. The missing piece is **scoping** (step 3), which makes prioritization meaningful.

**Critical caveat** (consensus from all three reviewers): LLM scoping output is **unreliable**. TODOs are written precisely because the developer didn't think through details. The LLM will hallucinate `files_involved`, invent dependencies, and produce non-repeatable priorities.

Mitigations:
- Treat assessment output as **suggestions with provenance** (store `assessed_by` model)
- **Validate** `files_involved` via ripgrep/glob (do these files exist? contain relevant code?)
- **Don't scope everything** — only triage-accepted items. 80% of 528 items are probably junk/stale
- **Scope on demand** when an item is picked for execution, not eagerly for all items
- Store `input_hash` for cache, but don't chase file-content hashing (too volatile)

Scoping prompt (per work item):

```
Given this TODO item and the relevant source code, analyze what
implementation would entail:

TODO: {item.text}
Source: {item.source_file}:{item.source_line}
Heading context: {item.heading}

Relevant code:
{code_snippets from nearby files and grep results}

Output JSON:
- title: short actionable title
- summary: 2-3 sentences on what the implementation involves
- files_involved: list of files that would need changes
- effort: small (<30min) | medium (30min-2hr) | large (2hr+)
- risk: low | medium | high
- sub_tasks: list of concrete steps (if decomposable)
- dependencies: which other items (by title) must be done first
- confidence: low | medium | high (self-assessed reliability)

NOTE: Priority is NOT part of scoping output. Priority is assigned in a
separate step after scoping, because it depends on project goals and
cross-item comparison — not just the scope of one item in isolation.
```

**Cost**: ~$0.01-0.05 per item (haiku). Scope only accepted items — likely 50-100 of 528, not all.

### Checks Replace One-Shot Queue Items

Current queue → proposed checks:

| Current Queue Item               | Proposed Check        | Finding type          |
| -------------------------------- | --------------------- | --------------------- |
| sup-001 "Fix stale port refs"    | `doc-health`          | port_mismatch         |
| sup-002 "Fix broken links"       | `doc-health`          | broken_link           |
| sup-009 "Convention violations"  | `convention-scan`     | (various)             |
| sup-003 "Audit session.py"       | `code-scrutiny`       | (targeted)            |
| sup-008 "Jobs system health"     | `jobs-health`         | (various)             |
| sup-010 "Principle health"       | `learning-health`     | (various)             |
| sup-011 "CLAUDE.md accuracy"     | `doc-health`          | claude_md_drift       |
| plan-rescan                      | `todo-scan`           | (this scanner)        |

**~8-12 recurring checks** replace the ever-growing flat queue. Findings become `work_items` with `origin_type='check'`, deduplicated across runs.

### Internal Tasks as Jobs

Internal checks use the same **patterns** as the jobs system, but not the same runner:

| Jobs Concept       | Internal Tasks Equivalent                      |
| ------------------ | ---------------------------------------------- |
| `jobs.yaml`        | `checks` table                                 |
| Discovery strategy | TODO scanner, check scheduler                  |
| Stage pipeline     | triage → scope → decompose → execute           |
| SQLite tracker     | `tasks.db`                                     |
| Dashboard          | CLI view (`sup task list`)                      |
| Circuit breaker    | Safety rules (read-only, file limits)          |

Runner unification (shared `lib/runner/`) is a **future consideration**, not a prerequisite. Build the task system first with a simple asyncio loop. Extract shared patterns only if duplication becomes painful.

### ClickUp Materialization (Deferred)

**SDK**: [`clickup-python-sdk`](https://pypi.org/project/clickup-python-sdk/) — Python, API key auth, covers task CRUD.

**Why deferred**: All three reviewers flagged this as premature. Building external sync before the local system is stable creates dual source-of-truth problems. One-way push guarantees drift (humans edit ClickUp). Build and stabilize the DB first.

**When to add**: After the task system has been in use for 2+ weeks and the schema has stabilized. Start with create-only (no updates), markdown/CSV export as interim.

**Open questions for later**:
- One ClickUp list vs per-project lists
- If ClickUp is added, should it become source of truth for status? (Gemini's recommendation)
- Bidirectional sync complexity vs value

## Implementation Strategy: Prove Value First

**V2 reversal**: V1 proposed building a generic runner platform first, then porting. All three reviewers identified this as backwards — spending the most time on the least validated component. V2: build the simple task system, prove it works, generalize only if needed.

### Phases

**Phase 1: Task DB + Scanner + CLI** (the minimum useful system)

```
supervisor/tasks/
├── db.py           # SQLite schema (above), CRUD operations
├── scanner.py      # TODO.md scanning (from planner.py, improved dedupe)
├── scoper.py       # LLM-based scoping (on-demand, not batch)
├── cli.py          # sup task list/scope/accept/run
└── tasks.db        # SQLite database
```

Deliverables:
- `sup task scan` — scan TODO.md files, populate work_items
- `sup task list` — show items by status/priority
- `sup task accept ID` — mark item for scoping
- `sup task scope ID` — run LLM scoping on one item
- `sup task scope --batch` — scope all accepted items
- Migrate 14 existing queue items

**Phase 2: Checks + Scheduling**

```
supervisor/tasks/checks/
├── doc_health.py
├── convention_scan.py
├── todo_scan.py
└── ...
```

Deliverables:
- `sup check run doc-health` — run a check, emit findings as work_items
- `sup check list` — show check status
- Simple asyncio scheduler (not a generic runner)
- Finding dedup via text_hash + source_file matching

**Phase 3: Task Execution**

- `sup task exec ID` — execute a task (with leasing)
- Simple priority-based task picker
- Crash recovery via lease expiry

**Phase 4: External Sync + Runner Unification** (only if proven valuable)

- ClickUp sync
- Consider `lib/runner/` extraction if jobs and tasks show clear overlap
- Dashboard

### Migration Path

1. **Create `tasks.db`** with schema above, alongside existing `queue.yaml`
2. **Migrate 14 manual queue items** into work_items
3. **Run scanner** on all TODO.md files (populate work_items, no LLM cost)
4. **Triage**: Human review of ~528 items → accept the valuable ones
5. **Scope accepted items** incrementally (~50-100, ~$1-5)
6. **Implement 2-3 checks** (doc-health, todo-scan) to prove the pattern
7. **Retire `queue.yaml`** once DB is source of truth

## File Layout

```
supervisor/
├── tasks/
│   ├── db.py           # SQLite schema, CRUD operations
│   ├── scanner.py      # TODO.md scanning (from planner.py)
│   ├── scoper.py       # LLM-based implementation scoping
│   ├── checks/         # Check implementations
│   │   ├── doc_health.py
│   │   ├── convention_scan.py
│   │   ├── jobs_health.py
│   │   ├── learning_health.py
│   │   └── todo_scan.py
│   ├── sync/
│   │   └── clickup.py  # ClickUp materialization (Phase 4)
│   └── cli.py          # sup task list, sup task scope, sup check run
├── autonomous/         # (existing — deprecates after Phase 2)
│   ├── planner.py      # → migrates to tasks/scanner.py + tasks/scoper.py
│   └── todo.py         # → migrates to tasks/db.py
└── todos/
    └── queue.yaml      # → retired after Phase 1 migration
```

## Runner Unification (Future, Not Prerequisite)

The jobs runner (`jobs/runner.py`) solves hard problems: resumability, pacing, circuit breakers, error classification. These patterns should eventually be shared. But building a generic plugin platform before the task system exists is premature.

**When to revisit**: After Phase 3 (task execution works end-to-end). If the task scheduler and the jobs runner share >50% of their code, extract shared patterns to `lib/runner/`. Not before.

**What would be shared**: SQLite tracking, leasing, error classification, stage-level status. What stays domain-specific: discovery strategies, handlers, scheduling logic, dashboard views.

## Open Questions

- **Triage automation**: Can rules (e.g., "items in actively-developed directories", "items touching files changed in last 30 days") auto-accept items? Or is human triage required?
- **Assessment drift**: When should old assessments be re-run? Time-based (90 days)? Or only when text/location changes?
- **Check scheduling**: Simple interval (daily/weekly) or smarter triggers (on-commit for doc-health, on-idle for code-scrutiny)?
- **Scope vs execute ordering**: Should scoping happen lazily (when an item is about to be executed) instead of as a separate batch step?