Design Critique: Unified Task System

Document: docs/plans/2026-02-26-unified-task-system-design.md (V1)
Reviewed by: GPT 5.2 Grok 4.1 Gemini 3.1 Pro
Synthesized by: Sonnet 4.6
Generated: 2026-02-26 09:18 PT

Schema


Architecture


Failure Modes


Sequencing


Process

=== GPT ===

1) Overengineering (you built a mini Jira + a mini orchestrator + a mini ETL platform)

A. “Unified runner platform” is a massive detour

Concrete alternative:
Keep autonomy separate. Implement tasks.db + scanner + scoper + simple scheduler loop. If later you want convergence with jobs, do it when autonomy is stable and proves value. Right now “lib/runner/” is pure yak shaving.

B. Checks + findings + tasks is 3 overlapping concepts

Concrete alternative:
Use one “work_items” table with a kind column (todo, check_run, finding, task) and a parent_id. Or simpler: keep findings but don’t invent tasks for check-driven work—treat “fix” as a field on finding (resolution_status, resolution_task_id).

C. Numerical priority 0.0–1.0 is fake precision

Concrete alternative:
Use an integer priority_rank (or p0/p1/p2) plus a separate sort_key computed at view time. Or store urgency/impact/effort and compute priority dynamically so you can change weighting without invalidating all assessments.

D. ClickUp sync is premature integration pain

Concrete alternative:
Export a markdown/CSV view first. If you must integrate, start with “create only” and never update; treat ClickUp as a read-only mirror. Or don’t do it until you have stable IDs and lifecycle.


2) Wrong abstractions (boundaries are in the wrong place)

A. assessments keyed by fingerprint assumes “one assessment per todo forever”

Concrete alternative:
assessments(id PK, fingerprint FK, assessed_at, model, prompt_version, input_hash, output_json, is_current) and keep history. Set todos.status based on existence of a current assessment.

B. Fingerprint = SHA256(normalized text) is a dumb identity

Concrete alternative:
Identity should include location (source_file, line, heading path) or a stable anchor (like a UUID comment TODO[rivus:abc123]). Keep fingerprint as a similarity/dedupe hint, not the primary key.

C. tasks.id mixes two unrelated ID schemes

Concrete alternative:
Use INTEGER PRIMARY KEY (SQLite rowid) or UUID. Add explicit columns: origin_type, origin_id, check_run_id, etc.

D. “Internal tasks as jobs” is a forced metaphor


3) Missing failure modes (this will break immediately)

A. SQLite concurrency/locking is ignored

Fix: WAL mode, short transactions, a single writer queue, or separate DBs per subsystem.

B. No concept of “stuck” / “timed out” / “abandoned”

Fix: add leasing: leased_by, leased_until, attempts, last_error, last_update.

C. Check runs aren’t a real entity

Fix: create check_runs(id PK, check_id, started_at, completed_at, status, handler_version, error) and reference it from findings.

D. Your cache invalidation is broken by design

Fix: hash broader inputs: - repo HEAD commit SHA (or diff range) - plus a bounded “context set” (e.g., directory-level) not just self-reported files_involved - store prompt_version and model in the hash too

E. Findings explode forever (no dedupe/suppression)

Fix: add finding_fingerprint (e.g., type+file+line+message normalized) and an “open/closed” lifecycle. Or store only deltas per run.

F. Schedule semantics are hand-wavy

Fix: next_run_at, schedule_type, schedule_value, and enforce single active run per check.

G. ClickUp sync failure modes ignored

Fix: store clickup_sync_state, last_sync_error, idempotency_key, backoff.


4) Premature generalization (designed for imaginary users/scale)

A. “Publishable runner others would want”

B. LLM scoping for 528 items up front is waste

Alternative:
Scope only: - items touched recently (git blame / last modified) - items in high-signal directories - top N by simple heuristic (tags, prefixes, owner)


5) Schema problems (painful querying, migration, correctness)

A. TEXT timestamps everywhere = garbage data

Fix: store as INTEGER unix epoch or ISO8601 UTC with validation. Add indices.

B. JSON stuffed into TEXT columns guarantees query pain

Fix: normalize: - assessment_files(assessment_id, path) - assessment_deps(assessment_id, depends_on_fingerprint) - task_results if you need structure, or keep result JSON but index key fields separately.

C. No indexes declared where you obviously need them

D. todos.status new|assessed|stale is underspecified

E. Foreign keys are half-baked

Fix: add origin_finding_id or a generic origin_type/origin_id.


6) Sequencing errors (you’re building the hard stuff first)

A. Phase plan is backwards

Better sequence: 1) Implement tasks.db + scanner + simple “scope on demand” command
2) Add checks with proper check_runs + finding dedupe
3) Add task execution
4) Only then consider runner unification if duplication is painful

B. “Schema compatible enough with jobs.db” is a lie


7) The author’s biggest blind spot (baked-in assumption that will ruin this)

You assume LLM outputs are stable, correct, and cacheable “indefinitely”

Concrete fix: - Treat assessment output as suggestions with confidence + provenance. - Store: model, prompt_version, temperature, confidence, evidence (file snippets used). - Require human confirmation for dependencies and decomposition before execution. - Add “verification” stage: run static checks (ripgrep references, imports) to validate files_involved and dependency claims before trusting them.


High-impact fixes you should make immediately (before writing any runner platform)

1) Add check_runs table and make findings reference it.
2) Add finding dedupe + lifecycle (open/closed/ignored) so checks don’t spam forever.
3) Stop using todo-text fingerprint as primary identity; introduce stable IDs.
4) Make assessments versioned (history), not 1:1 with todo fingerprint.
5) Normalize files_involved and dependencies into join tables.
6) Add leasing/attempt/error fields to tasks so crashes don’t poison the queue.
7) Delete the “publishable plugin runner” ambition from this design until the task system works end-to-end.

=== GROK ===

1. Overengineering — what is more complex than it needs to be?

2. Wrong abstractions — where are the abstraction boundaries in the wrong place?

3. Missing failure modes — what will break that is not addressed?

4. Premature generalization — what is designed for hypothetical users/scale that does not exist?

5. Schema problems — what will be painful to query, migrate, or maintain?

6. Sequencing errors — what is in the wrong phase or depends on something not yet built?

7. The thing the author is most blind to — what assumption is so baked in they cannot see it?

LLMs magically produce accurate scoping/estimates/dependencies from TODO text + "relevant code snippets." Reality: TODOs like "wire up principle application tracking" are vague handwaves; no code snippets exist pre-assessment (nothing "mentions" files), LLM hallucinates files_involved=["guesses.py"], wrong effort ('medium'), phantom deps. This "key difference" (Scoping section) crumbles — prioritization stays "guesswork" (Problem #2), but now with $0.02 hallucinations per item and invalid cache churn. Alternative: Human skims 528 items once (1-2 hours), sets effort/priority in YAML; rescan deltas only. LLMs excel at generation, not estimation without training data.

=== GEMINI ===

Here is a brutal, direct teardown of the system design document.

1. Overengineering

The generic runner platform (lib/runner) is a massive distraction. You are setting out to build an internal task list and somehow talked yourself into building a competitor to Prefect/Dagster. Section "Runner Architecture: Core + Plugins" discusses publishing this as an open-source library, defining plugin contracts, and solving "hard problems that every pipeline builder faces." * The reality: You need to run 8-12 internal Python scripts (doc-health, convention-scan). You do not need a generic, plugin-based, multi-stage DAG orchestrator for this. * Alternative: Use standard asyncio tasks or a basic cron-like loop for your 12 checks. Do not build a generic pipeline framework.

2. Wrong Abstractions

Conflating recurring checks with one-off code tasks. You are trying to force two fundamentally different concepts into the same execution model. A recurring doc-health check that yields 50 "broken link" findings is a continuous monitoring process. A TODO like "Implement reduce image iteration" is a discrete, stateful project management entity. Forcing both through the same tasks.db pipeline just so you can use the same runner will result in a bloated schema that serves neither well. * Alternative: Split the system. Monitoring/Linting (Checks/Findings) belongs in a CI/CD or background worker model. Project Management (TODOs/Tasks) belongs in an issue tracker.

The todosassessmentstasks hierarchy is bureaucratic. You have abstracted a single unit of work across three tables. If a developer manually deletes a TODO from the code, what happens to the execution state of the 5 sub-tasks in the tasks table? The state machine spanning these three tables is going to be a nightmare to keep synchronized.

3. Missing Failure Modes

The assessment cache invalidation will bankrupt you or DDOS your LLM. You state: cached_hash is a hash of (todo text + contents of files_involved). If a TODO item touches a core file (e.g., main.py or utils.py), any subsequent commit to that file will change the file hash. This will instantly invalidate the assessment for that TODO, triggering a re-read and re-assessment by the LLM on the next scan. Your assumption of "$5-25 one-time, then incremental" is dead wrong. You will be re-assessing dozens of unchanged TODOs on almost every git push. * Alternative: Do not hash file contents. Assess TODOs on demand when picked up for execution, or only invalidate if the line number/text of the TODO itself changes.

Split-brain state with ClickUp. You are doing a "one-way push to start" to ClickUp. The moment a human looks at ClickUp and clicks "Done" or changes a priority, your DB and ClickUp are out of sync. Because the DB is the source of truth, the next sync will likely overwrite the human's changes or ignore them, making the ClickUp UI untrustworthy. * Alternative: If you use ClickUp, ClickUp must be the source of truth for task state. Sync TODOs into ClickUp, and let the autonomous worker poll ClickUp for its queue.

4. Premature Generalization

Floating-point, composite-scored priorities. You define priority as a REAL (0.0-1.0) calculated via 0.4 * urgency + 0.3 * impact + 0.3 * effort_inverse. This is an academic fantasy. When a production bug hits, or you just really want to do a specific task, how do you force a float-based composite scoring system to put it at the top? You will end up writing hacky overrides. * Alternative: Stick to integers or discrete buckets (P0, P1, P2, P3). Manual override is a hard requirement for any task system.

5. Schema Problems

Storing relational graph data in JSON strings. In assessments, you have dependencies TEXT, -- JSON list of fingerprints. You cannot efficiently query a SQLite JSON list to find unblocked tasks, perform topological sorts, or cascade deletes. Every time the planner needs to find the next executable task, it will have to load the entire table into memory and parse JSON to build the dependency graph. * Alternative: Create a standard junction table: task_dependencies (blocked_fingerprint, blocking_fingerprint).

Polymorphic composite primary keys. In tasks, id TEXT PRIMARY KEY, -- plan-{hex8} or check-{name}-{date}. Do not embed business logic (whether it's a plan or a check, and its date) into a primary key string. It makes joins slower, foreign keys brittle, and migrations painful. * Alternative: Use standard UUIDs or auto-incrementing integers for PKs. Put type, name, and date in their own typed columns.

6. Sequencing Errors

Blocking the task system on a massive legacy rewrite. Look at your Implementation Strategy. Phase 1 is rewriting the production jobs runner into a domain-free generic library. Phase 2 is building the task system on top of it. You are blocking the exact thing you want to build (the unified task system) behind a massive, risky refactor of your existing, working production infrastructure. If Phase 1 hits a snag, your task system is delayed indefinitely. * Alternative: Build the task system natively first using a simple loop. Prove the LLM scoping and task decomposition actually works. If it proves valuable, then refactor the old jobs system to use the new pattern.

7. The Ultimate Blind Spot

Assuming code-comment TODOs contain enough context for project planning. Your entire "Scoping" step relies on an LLM reading a TODO comment like // TODO: refactor session handling and generating a concrete, 3-day multi-file implementation plan with accurate effort estimates and sub-tasks. You are entirely blind to the fact that developers write TODOs precisely because they don't want to think through the implementation details at that moment. The code snippets surrounding a TODO rarely contain the systemic context required to estimate "real effort" or "risk". The LLM will confidently hallucinate files involved, invent dependencies, and generate garbage sub-tasks. You are building an elaborate, automated pipeline to feed garbage into ClickUp.