Design Critique: Unified Task System

Document: docs/plans/2026-02-26-unified-task-system-design.md (V1)
Reviewed by: GPT 5.2 Grok 4.1 Gemini 3.1 Pro
Synthesized by: Sonnet 4.6
Generated: 2026-02-26 09:18 PT

Schema

[3/3] ★ JSON stored in TEXT columns — Normalize files_involved, dependencies, results into junction tables
[3/3] Fingerprint as primary key — Use UUID/integer PK; keep fingerprint as dedupe-hint index only
[3/3] No indexes declared — Add indexes on status, priority, check_id, created_at at minimum
[2/3] Polymorphic composite task IDs — Replace plan-{hex8}/check-{name}-{date} with UUID + typed columns
[2/3] Assessments are 1:1 with todos, no history — Add id PK, assessed_at, model, prompt_version, is_current columns
[2/3] Stringly-typed enums without CHECK constraints — Add CHECK constraints or integer codes for all status/effort/tier fields
[2/3] TEXT timestamps throughout — Store all timestamps as INTEGER epoch or enforced ISO8601 UTC
[1/3] todos.status missing "ignored/wontfix" — Add ignored and wontfix as valid status values immediately
[1/3] Foreign keys half-baked across origin types — Add origin_type + origin_id columns to tasks for full traceability

Architecture

[3/3] ★ Generic runner platform is a massive distraction — Delete lib/runner/ plugin architecture; use simple asyncio loop or cron
[3/3] Three overlapping work-item concepts — Collapse checks/findings/tasks or use single work_items table with kind column
[2/3] Checks forced into same runner abstraction as tasks — Separate monitoring lifecycle (checks/findings) from task execution lifecycle
[2/3] todos → assessments → tasks hierarchy is over-abstracted — Flatten: inline assessment fields onto todos, spawn tasks only for decomposition
[1/3] "Publishable runner others would want" — Remove external-adoption scope entirely until internal system is stable
[1/3] ClickUp as premature integration — Export markdown/CSV first; defer API integration until stable IDs and lifecycle exist

Failure Modes

[3/3] ★ Cache invalidation is broken by design — Hash repo HEAD commit SHA + prompt version, not self-reported files_involved contents
[3/3] LLM outputs treated as stable oracles — Store model, prompt_version, confidence; require human confirmation before execution
[2/3] SQLite concurrency/locking ignored — Enable WAL mode; enforce single writer queue across all concurrent processes
[2/3] ClickUp sync has no idempotency or failure handling — Add idempotency_key, last_sync_error, clickup_sync_state, and retry backoff
[2/3] Split-brain state with ClickUp — Designate one authoritative source of truth; never silently overwrite human edits
[2/3] Findings have no dedupe or lifecycle — Add finding_fingerprint unique key and open/closed/ignored status to prevent infinite duplicates
[2/3] No "stuck/timed-out" task handling — Add leased_by, leased_until, attempts, last_error fields to tasks
[1/3] check_runs not a real entity — Create check_runs(id, check_id, started_at, completed_at, status, handler_version, error) table
[1/3] Schedule semantics are hand-wavy — Add next_run_at, enforce single active run per check, define on_commit/on_idle triggers concretely
[1/3] Finding flood from checks has no rate limit — Cap findings spawned per run; add bulk-insert guard before writing to DB

Sequencing

[3/3] ★ Phase plan is backwards — runner built before task system validated — Build tasks.db + scanner + minimal CLI first; unify runner only after autonomy proves value
[2/3] Scoping all 528 items up front is wasteful — Scope only recently-touched or high-signal items first; rescan deltas incrementally
[1/3] Migration step 3 (scope 528 items) depends on unbuilt scoper — Defer bulk scoping until scoper, DB CRUD, and schema are all stable
[1/3] "Schema compatible with jobs.db" claim is false — Acknowledge incompatibility now; plan explicit migration rather than deferring it

Process

[3/3] ★ Floating-point composite priority is fake precision — Replace REAL 0.0–1.0 with P0/P1/P2/P3 buckets; support manual override as hard requirement
[2/3] LLM cannot reliably infer files/dependencies from TODO text alone — Validate files_involved with static analysis (ripgrep/imports) before trusting or caching
[1/3] "Lower score = higher priority" is an easy foot-gun — Invert convention or rename field; document sort direction explicitly in schema comments
[1/3] Bidirectional ClickUp sync deferred but schema doesn't support it — Either commit to ClickUp as source of truth now or explicitly drop bidirectional sync from design

=== GPT ===

1) Overengineering (you built a mini Jira + a mini orchestrator + a mini ETL platform)

A. “Unified runner platform” is a massive detour

Sections: Implementation Strategy: Clean Build, Then Port, Runner Architecture: Core + Plugins, What Makes It Desirable, Exploration Needed
You’re not building a task system. You’re rebuilding your jobs runner into a generalized plugin platform with pacing, “doctor” error classification, version hashing, Gradio dashboard, packaging decisions, etc.
That is months of surface area that has zero to do with the stated problems (checks vs findings, scoping, decomposition).

Concrete alternative:
Keep autonomy separate. Implement tasks.db + scanner + scoper + simple scheduler loop. If later you want convergence with jobs, do it when autonomy is stable and proves value. Right now “lib/runner/” is pure yak shaving.

B. Checks + findings + tasks is 3 overlapping concepts

Schema: checks, findings, tasks
You created three entities for “work-like things”:
checks (definitions)
findings (issues detected)
tasks (things to execute)
Then you add “findings can spawn tasks”. So now every actionable finding becomes… a task. So why are findings separate records instead of just tasks with origin=check_run and kind=finding?

Concrete alternative:
Use one “work_items” table with a kind column (todo, check_run, finding, task) and a parent_id. Or simpler: keep findings but don’t invent tasks for check-driven work—treat “fix” as a field on finding (resolution_status, resolution_task_id).

C. Numerical priority 0.0–1.0 is fake precision

Section: Priority as Numerical Score
You’re pretending a model can output stable absolute priorities that won’t need reranking. That’s fantasy. The score will drift with prompt tweaks, model version changes, and “cached assessments” lasting forever.
Also your semantics are inverted: lower = more important is an easy foot-gun (people will sort wrong constantly).

Concrete alternative:
Use an integer priority_rank (or p0/p1/p2) plus a separate sort_key computed at view time. Or store urgency/impact/effort and compute priority dynamically so you can change weighting without invalidating all assessments.

D. ClickUp sync is premature integration pain

Section: ClickUp Materialization
You’re adding external API coupling, auth, rate limiting, mapping semantics, and eventual bidirectional sync—before the local system even works.
And “one-way to start” guarantees drift because humans will edit ClickUp anyway.

Concrete alternative:
Export a markdown/CSV view first. If you must integrate, start with “create only” and never update; treat ClickUp as a read-only mirror. Or don’t do it until you have stable IDs and lifecycle.

2) Wrong abstractions (boundaries are in the wrong place)

A. `assessments` keyed by `fingerprint` assumes “one assessment per todo forever”

Schema: assessments(fingerprint PRIMARY KEY REFERENCES todos)
That’s not an “assessment”, it’s a cached snapshot of code state + model behavior. You will need multiple revisions:
prompt changes
model changes
code changes not captured by files_involved
“assessment was wrong; redo”
Your schema makes reassessment a destructive overwrite or forces you to invent side tables later.

Concrete alternative:
assessments(id PK, fingerprint FK, assessed_at, model, prompt_version, input_hash, output_json, is_current) and keep history. Set todos.status based on existence of a current assessment.

B. Fingerprint = SHA256(normalized text) is a dumb identity

Schema: todos(fingerprint TEXT PRIMARY KEY)
You’re treating text as identity. If someone rewrites the TODO text (common), it becomes a brand new item and loses history. Also duplicates across files are “deduped by text fingerprint” which is wrong: the same text in different contexts can be different work.

Concrete alternative:
Identity should include location (source_file, line, heading path) or a stable anchor (like a UUID comment TODO[rivus:abc123]). Keep fingerprint as a similarity/dedupe hint, not the primary key.

C. `tasks.id` mixes two unrelated ID schemes

Schema: tasks(id TEXT PRIMARY KEY, -- plan-{hex8} or check-{name}-{date})
This is amateur hour. You’re encoding semantics into IDs and guaranteeing collisions/format churn.

Concrete alternative:
Use INTEGER PRIMARY KEY (SQLite rowid) or UUID. Add explicit columns: origin_type, origin_id, check_run_id, etc.

D. “Internal tasks as jobs” is a forced metaphor

Sections: Internal Tasks as Jobs, Runner Architecture
Checks are periodic analyses; TODO execution is ad-hoc work. Forcing them into the same runner abstraction will contaminate both:
checks want idempotent detection + diffing + suppression
tasks want ownership, batching, decomposition, completion criteria
You’re building a “unified platform” instead of acknowledging they are different lifecycle objects.

3) Missing failure modes (this will break immediately)

A. SQLite concurrency/locking is ignored

You have async runners, check schedules, ClickUp sync, decomposition, execution—all writing tasks.db.
SQLite + multiple writers = database is locked errors, especially with long transactions or Gradio dashboard reading concurrently.

Fix: WAL mode, short transactions, a single writer queue, or separate DBs per subsystem.

B. No concept of “stuck” / “timed out” / “abandoned”

Schema: tasks.status pending|in_progress|done|failed
What happens when the runner crashes mid-task? in_progress stays forever.
No heartbeat, no leased_until, no attempt_count, no last_error.

Fix: add leasing: leased_by, leased_until, attempts, last_error, last_update.

C. Check runs aren’t a real entity

You have checks.last_run and findings.run_id (string), plus a “runs table” mentioned in the diagram but not defined.
So you can’t answer basic questions:
which run produced which findings?
did the check fail halfway?
duration, error output, handler version?

Fix: create check_runs(id PK, check_id, started_at, completed_at, status, handler_version, error) and reference it from findings.

D. Your cache invalidation is broken by design

Section: Cache invalidation
cached_hash = hash(todo text + contents of files_involved)
Problem: files_involved is produced by the assessment itself. Catch-22:
if the original assessment missed a relevant file, your cache hash never includes it
code can change elsewhere and invalidate the assessment logically but not hash-wise
Also: file contents hashing at assessment time requires you to store which revision you hashed. No git SHA, no mtime snapshot.

Fix: hash broader inputs: - repo HEAD commit SHA (or diff range) - plus a bounded “context set” (e.g., directory-level) not just self-reported files_involved - store prompt_version and model in the hash too

E. Findings explode forever (no dedupe/suppression)

Schema: findings has no fingerprint/dedupe key.
Daily doc-health will create duplicate findings endlessly.

Fix: add finding_fingerprint (e.g., type+file+line+message normalized) and an “open/closed” lifecycle. Or store only deltas per run.

F. Schedule semantics are hand-wavy

Schema: checks.schedule daily|weekly|on_commit|on_idle
No timezone, no cron expression, no missed-run catch-up, no jitter, no “don’t run twice concurrently”.
last_run as TEXT is not enough for correctness.

Fix: next_run_at, schedule_type, schedule_value, and enforce single active run per check.

G. ClickUp sync failure modes ignored

Duplicate creation if sync retries without idempotency key.
Rate limits, partial failures, network timeouts.
Task renamed locally after sync: do you update? (you said yes) but you have no etag/version handling.

Fix: store clickup_sync_state, last_sync_error, idempotency_key, backoff.

4) Premature generalization (designed for imaginary users/scale)

A. “Publishable runner others would want”

Sections: The Question, Packaging, Competitive landscape
You’re designing for external adoption while you haven’t stabilized internal needs. This is classic self-distraction.
Plugin discovery mechanisms, entry points, packaging—none of this solves your TODO/finding conflation.

B. LLM scoping for 528 items up front is waste

Migration step: “Run scoping on all 528 scanned items”
Most TODOs are junk, stale, or never to be done. You’re paying to “understand” garbage.
Also you’ll quickly learn your prompts and schema are wrong, so you’ll rerun anyway.

Alternative:
Scope only: - items touched recently (git blame / last modified) - items in high-signal directories - top N by simple heuristic (tags, prefixes, owner)

5) Schema problems (painful querying, migration, correctness)

A. TEXT timestamps everywhere = garbage data

Fields: first_seen, last_seen, assessed_at, started_at, completed_at, created_at, last_run
If you don’t enforce ISO8601 and timezone, your ordering and filtering will be inconsistent.

Fix: store as INTEGER unix epoch or ISO8601 UTC with validation. Add indices.

B. JSON stuffed into TEXT columns guarantees query pain

assessments.files_involved TEXT, dependencies TEXT, tasks.result TEXT
You will absolutely want to query:
“show all tasks touching file X”
“tasks blocked by dependency Y”
With JSON-in-TEXT, you’ll either do slow full scans with JSON1 functions or write fragile parsing code.

Fix: normalize: - assessment_files(assessment_id, path) - assessment_deps(assessment_id, depends_on_fingerprint) - task_results if you need structure, or keep result JSON but index key fields separately.

C. No indexes declared where you obviously need them

queries you will do constantly:
tasks by status, priority
todos by status, last_seen
findings by check_id, created_at, severity
Without indexes, SQLite will crawl.

D. `todos.status new|assessed|stale` is underspecified

“stale” based on what? last_seen? Then you need transitions defined and automation.
Missing “ignored/wontfix” which you will need immediately.

E. Foreign keys are half-baked

tasks.parent_fingerprint references todos, but tasks can also come from findings/checks. You allow check_id but no link to finding.
So you can’t trace “this task exists because of finding X”.

Fix: add origin_finding_id or a generic origin_type/origin_id.

6) Sequencing errors (you’re building the hard stuff first)

A. Phase plan is backwards

Phases: build generic runner core first, then autonomy, then migrate jobs
That guarantees you spend the most time on the least validated component.
Autonomy doesn’t need pacers, doctor, version hashing, plugin marketplace. It needs:
correct data model
dedupe
lifecycle
minimal UI/CLI

Better sequence: 1) Implement tasks.db + scanner + simple “scope on demand” command
2) Add checks with proper check_runs + finding dedupe
3) Add task execution
4) Only then consider runner unification if duplication is painful

B. “Schema compatible enough with jobs.db” is a lie

You explicitly have different tables: todos/assessments/tasks/checks/findings
Jobs tracker is “generic stage-level status” (implied).
You will end up writing a migration anyway. Pretending compatibility delays admitting it.

7) The author’s biggest blind spot (baked-in assumption that will ruin this)

You assume LLM outputs are stable, correct, and cacheable “indefinitely”

Sections: Scoping, Cache invalidation, Priority as Numerical Score
You’re building the entire system around LLM “assessments” being:
accurate about files involved
accurate about dependencies
accurate about effort
stable enough to cache forever
In reality:
it will miss files
it will hallucinate dependencies
it will produce non-repeatable priorities
it will drift with model updates and temperature
Your schema and pipeline treat the LLM as an oracle instead of an unreliable hint generator.

Concrete fix: - Treat assessment output as suggestions with confidence + provenance. - Store: model, prompt_version, temperature, confidence, evidence (file snippets used). - Require human confirmation for dependencies and decomposition before execution. - Add “verification” stage: run static checks (ripgrep references, imports) to validate files_involved and dependency claims before trusting them.

High-impact fixes you should make immediately (before writing any runner platform)

1) Add check_runs table and make findings reference it.
2) Add finding dedupe + lifecycle (open/closed/ignored) so checks don’t spam forever.
3) Stop using todo-text fingerprint as primary identity; introduce stable IDs.
4) Make assessments versioned (history), not 1:1 with todo fingerprint.
5) Normalize files_involved and dependencies into join tables.
6) Add leasing/attempt/error fields to tasks so crashes don’t poison the queue.
7) Delete the “publishable plugin runner” ambition from this design until the task system works end-to-end.

=== GROK ===

1. Overengineering — what is more complex than it needs to be?

The entire Core Model with 7 interconnected tables (todos, assessments, tasks, checks, findings, runs implied but missing, plus assessments referencing todos) for what is essentially a TODO list enhancer. A single tasks table with type enum ('todo', 'check', 'finding'), parent_id, status, priority, and metadata JSONB handles 90% of this without scoping/assessments as separate tables — just inline LLM-generated fields on demand.
Scoping step (Pipeline step 2, scoper.py) runs LLM on every new/changed todo to generate title/summary/files_involved/etc. Overkill for trivial todos like "Fix typo" — just use the raw text as title, skip LLM entirely unless marked complex. Cost estimate (~$0.01-0.05/item) ignores that 80% of 528 items are probably fire-and-forget.
Plugin architecture (Runner Architecture section) with BaseDiscovery/BaseHandler/BaseSync classes, directories for discovery/handlers/sync/dashboard — this is Airflow-lite for a solo dev's internal chores. Hardcode 8-12 checks (from the table) as functions in checks.py, run via cron/loop, no plugins needed.
Phases 1-4 plan: Building a "domain-free runner" in lib/runner/ before anything runs, then porting jobs handler-by-handler. Just fork jobs/runner.py, strip rivus specifics, and use it directly for tasks — no clean build, no validation loops.
ClickUp sync as a full SDK integration with task creation/updates/status mapping — use a simple webhook or manual export CSV from SQLite query, not bidirectional future-proofing.

2. Wrong abstractions — where are the abstraction boundaries in the wrong place?

Assessments table as a "deciphering" layer between raw todos and tasks — wrong boundary; assessments are the prioritized todos. Merge into todos table: add columns title, summary, effort, priority, etc., directly. One table for "scannable work items," spawn tasks only for decomposed ones.
Tasks table conflates todo-derived subtasks (one-shot, manual) with check-derived items (recurring, auto). check_id REFERENCES checks(id) on tasks mixes lifecycles — findings should spawn new todos/findings, not tasks. Boundary: todos/assessments/checks produce events that trigger tasks, not direct refs.
Fingerprint as PK (todos/assessments) abstracts dedup to text hash, but TODO.md items evolve (text tweaks invalidate unrelated assessments). Wrong: use source_file + line as composite PK, hash only for quick dup-check index.
Checks as "internal jobs" mirroring jobs.yaml (Internal Tasks as Jobs table) — abstraction leak; checks are just cron defs (handler + schedule), not full jobs with stages/pacing. No need to shoehorn into runner patterns — use APScheduler or simple asyncio loop.
Priority as "numerical score" (assessments.priority REAL, composite formula) abstracts too far from reality; LLMs can't compute 0.4 * urgency + 0.3 * impact reliably without ground truth data.

3. Missing failure modes — what will break that is not addressed?

LLM scoping failures: Prompt assumes "relevant code: {code_snippets from files mentioned or inferred}" — how are files inferred pre-scoping? Hallucinated files_involved JSON leads to wrong cached_hash, endless re-scoping loops. No fallback: if LLM fails (rate limit, OOM), todo stays 'new' forever.
Cache invalidation (assessments.cached_hash): Computed from "todo text + contents of files_involved" — but files_involved is output of scoping. Chicken-egg: first run has no files, hash mismatches on re-run. No handling for git renames/branch switches invalidating everything.
DB concurrency: Single SQLite for runner + CLI + scanner + scoper — no WAL mode mentioned, will deadlock under concurrent writes (e.g., check run + manual sup task list). ClickUp sync updates clickup_task_id during execution.
Finding-to-task spawn: "Findings can spawn tasks (auto-fix or manual review)" — no mechanism defined. What if finding floods tasks (doc-health finds 1000 stale refs)? No rate limit, DB bloat.
ClickUp sync: No retry/backoff on API failures (rate limits, key expiry). clickup_task_id updated async, but what if partial failure leaves orphan tasks? No cleanup for deleted ClickUp tasks.
Schedule overlaps: checks.schedule 'daily' with last_run — no locking, multiple runs if drift >24h. on_commit/on_idle undefined triggers.

4. Premature generalization — what is designed for hypothetical users/scale that does not exist?

Runner as publishable library (Runner Architecture: "published as a library or standalone tool," "niche: personal/small-team content pipelines," competitive analysis vs Airflow/Temporal) — zero evidence of external users; this is for rivus internal tasks. Designing for "youtube_channel.py, finnhub_calendar.py" when problem is TODO.md + 12 checks.
Plugin directories/entry points for arbitrary discoveries/handlers/syncs (e.g., linear.py, github_issues.py) — overkill; hardcode rivus needs (todo_scanner, doc_health, clickup).
Composite priority scoring for "fine-grained ordering without ties" assumes 1000s of items competing; 528 todos + 12 checks don't need floats or formulas — simple ORDER BY effort ASC, status='new'.
Unified dashboard Phase 4 for jobs + internal tasks — premature; internal tasks aren't production like jobs, no shared ops burden.
Bidirectional ClickUp sync "future" via webhooks — assumes PM team editing ClickUp drives DB, but doc says "infrastructure, not user-facing."

5. Schema problems — what will be painful to query, migrate, or maintain?

JSON fields everywhere (assessments.files_involved, dependencies, result; tasks.result) — impossible to query "all todos depending on fingerprint X" without JSON parsing in app code. Use relational: task_dependencies(parent_fingerprint TEXT, child_fingerprint TEXT).
todos.fingerprint TEXT PRIMARY KEY — text changes (normal for TODO.md) orphan assessments/tasks. Add current_fingerprint TEXT col, but migration nightmare: scan all history.
tasks.id TEXT PRIMARY KEY "plan-{hex8} or check-{name}-{date}" — collision risk (hex8=16M space, but manual overrides), no auto-inc for bulk inserts. Use AUTOINCREMENT INTEGER.
No indexes: Query SELECT * FROM tasks WHERE status='pending' ORDER BY priority on 1000s rows — scans whole table. Missing: INDEX on tasks(status, priority), todos(status), findings(check_id, run_id).
Missing runs table: findings.run_id TEXT unreferenced — dangling refs. Add runs(id TEXT PK, check_id, started_at, completed_at).
Stringly-typed enums (status 'new|assessed|stale', effort 'small|medium|large', tier 'safe_always|code_change') — no CHECK constraints, typos break queries. Use ENUM or INTEGER codes.
Migration: Replacing queue.yaml with DB — no ALTER path if schema changes (e.g., add col); dump/reload 528 items loses history.

6. Sequencing errors — what is in the wrong phase or depends on something not yet built?

Scoping (Pipeline 2) depends on "files mentioned or inferred" code snippets — but inference requires prior knowledge not built (no static analysis tool mentioned).
Phase 1: Core lib/runner/ before Phase 2 autonomy — circular; tests need "trivial plugin" but real plugins (todo_scanner) reference tasks.db schema not yet CRUD'd in db.py.
Migration step 3: Run scoping on all 528 items — but assessments table needs scoper.py, which needs runner? Chicken-egg before Phase 2.
Checks → findings → tasks (Pipeline 5) — handler in checks.handler not executed until runner built (Phase 1), but migration (step 4) starts with doc-health impl.
Cache invalidation before first assessment: No cached_hash on initial run, so every code tweak re-scopes everything.
File Layout: autonomous/planner.py "→ migrates to tasks/scanner.py + tasks/scoper.py" — but scanner needs DB from db.py, scoper needs LLM prompts not specified.

7. The thing the author is most blind to — what assumption is so baked in they cannot see it?

LLMs magically produce accurate scoping/estimates/dependencies from TODO text + "relevant code snippets." Reality: TODOs like "wire up principle application tracking" are vague handwaves; no code snippets exist pre-assessment (nothing "mentions" files), LLM hallucinates files_involved=["guesses.py"], wrong effort ('medium'), phantom deps. This "key difference" (Scoping section) crumbles — prioritization stays "guesswork" (Problem #2), but now with $0.02 hallucinations per item and invalid cache churn. Alternative: Human skims 528 items once (1-2 hours), sets effort/priority in YAML; rescan deltas only. LLMs excel at generation, not estimation without training data.

=== GEMINI ===

Here is a brutal, direct teardown of the system design document.

1. Overengineering

The generic runner platform (lib/runner) is a massive distraction. You are setting out to build an internal task list and somehow talked yourself into building a competitor to Prefect/Dagster. Section "Runner Architecture: Core + Plugins" discusses publishing this as an open-source library, defining plugin contracts, and solving "hard problems that every pipeline builder faces." * The reality: You need to run 8-12 internal Python scripts (doc-health, convention-scan). You do not need a generic, plugin-based, multi-stage DAG orchestrator for this. * Alternative: Use standard asyncio tasks or a basic cron-like loop for your 12 checks. Do not build a generic pipeline framework.

2. Wrong Abstractions

Conflating recurring checks with one-off code tasks. You are trying to force two fundamentally different concepts into the same execution model. A recurring doc-health check that yields 50 "broken link" findings is a continuous monitoring process. A TODO like "Implement reduce image iteration" is a discrete, stateful project management entity. Forcing both through the same tasks.db pipeline just so you can use the same runner will result in a bloated schema that serves neither well. * Alternative: Split the system. Monitoring/Linting (Checks/Findings) belongs in a CI/CD or background worker model. Project Management (TODOs/Tasks) belongs in an issue tracker.

The todos → assessments → tasks hierarchy is bureaucratic. You have abstracted a single unit of work across three tables. If a developer manually deletes a TODO from the code, what happens to the execution state of the 5 sub-tasks in the tasks table? The state machine spanning these three tables is going to be a nightmare to keep synchronized.

3. Missing Failure Modes

The assessment cache invalidation will bankrupt you or DDOS your LLM. You state: cached_hash is a hash of (todo text + contents of files_involved). If a TODO item touches a core file (e.g., main.py or utils.py), any subsequent commit to that file will change the file hash. This will instantly invalidate the assessment for that TODO, triggering a re-read and re-assessment by the LLM on the next scan. Your assumption of "$5-25 one-time, then incremental" is dead wrong. You will be re-assessing dozens of unchanged TODOs on almost every git push. * Alternative: Do not hash file contents. Assess TODOs on demand when picked up for execution, or only invalidate if the line number/text of the TODO itself changes.

Split-brain state with ClickUp. You are doing a "one-way push to start" to ClickUp. The moment a human looks at ClickUp and clicks "Done" or changes a priority, your DB and ClickUp are out of sync. Because the DB is the source of truth, the next sync will likely overwrite the human's changes or ignore them, making the ClickUp UI untrustworthy. * Alternative: If you use ClickUp, ClickUp must be the source of truth for task state. Sync TODOs into ClickUp, and let the autonomous worker poll ClickUp for its queue.

4. Premature Generalization

Floating-point, composite-scored priorities. You define priority as a REAL (0.0-1.0) calculated via 0.4 * urgency + 0.3 * impact + 0.3 * effort_inverse. This is an academic fantasy. When a production bug hits, or you just really want to do a specific task, how do you force a float-based composite scoring system to put it at the top? You will end up writing hacky overrides. * Alternative: Stick to integers or discrete buckets (P0, P1, P2, P3). Manual override is a hard requirement for any task system.

5. Schema Problems

Storing relational graph data in JSON strings. In assessments, you have dependencies TEXT, -- JSON list of fingerprints. You cannot efficiently query a SQLite JSON list to find unblocked tasks, perform topological sorts, or cascade deletes. Every time the planner needs to find the next executable task, it will have to load the entire table into memory and parse JSON to build the dependency graph. * Alternative: Create a standard junction table: task_dependencies (blocked_fingerprint, blocking_fingerprint).

Polymorphic composite primary keys. In tasks, id TEXT PRIMARY KEY, -- plan-{hex8} or check-{name}-{date}. Do not embed business logic (whether it's a plan or a check, and its date) into a primary key string. It makes joins slower, foreign keys brittle, and migrations painful. * Alternative: Use standard UUIDs or auto-incrementing integers for PKs. Put type, name, and date in their own typed columns.

6. Sequencing Errors

Blocking the task system on a massive legacy rewrite. Look at your Implementation Strategy. Phase 1 is rewriting the production jobs runner into a domain-free generic library. Phase 2 is building the task system on top of it. You are blocking the exact thing you want to build (the unified task system) behind a massive, risky refactor of your existing, working production infrastructure. If Phase 1 hits a snag, your task system is delayed indefinitely. * Alternative: Build the task system natively first using a simple loop. Prove the LLM scoping and task decomposition actually works. If it proves valuable, then refactor the old jobs system to use the new pattern.

7. The Ultimate Blind Spot

Assuming code-comment TODOs contain enough context for project planning. Your entire "Scoping" step relies on an LLM reading a TODO comment like // TODO: refactor session handling and generating a concrete, 3-day multi-file implementation plan with accurate effort estimates and sub-tasks. You are entirely blind to the fact that developers write TODOs precisely because they don't want to think through the implementation details at that moment. The code snippets surrounding a TODO rarely contain the systemic context required to estimate "real effort" or "risk". The LLM will confidently hallucinate files involved, invent dependencies, and generate garbage sub-tasks. You are building an elaborate, automated pipeline to feed garbage into ClickUp.