Document: docs/plans/2026-02-26-unified-task-system-design.md (V1)
Reviewed by: GPT 5.2
Grok 4.1
Gemini 3.1 Pro
Synthesized by: Sonnet 4.6
Generated: 2026-02-26 09:18 PT
files_involved, dependencies, results into junction tablesstatus, priority, check_id, created_at at minimumplan-{hex8}/check-{name}-{date} with UUID + typed columnsid PK, assessed_at, model, prompt_version, is_current columnsCHECK constraints or integer codes for all status/effort/tier fieldstodos.status missing "ignored/wontfix" — Add ignored and wontfix as valid status values immediatelyorigin_type + origin_id columns to tasks for full traceabilitylib/runner/ plugin architecture; use simple asyncio loop or cronchecks/findings/tasks or use single work_items table with kind columntodos → assessments → tasks hierarchy is over-abstracted — Flatten: inline assessment fields onto todos, spawn tasks only for decompositionfiles_involved contentsmodel, prompt_version, confidence; require human confirmation before executionidempotency_key, last_sync_error, clickup_sync_state, and retry backofffinding_fingerprint unique key and open/closed/ignored status to prevent infinite duplicatesleased_by, leased_until, attempts, last_error fields to taskscheck_runs not a real entity — Create check_runs(id, check_id, started_at, completed_at, status, handler_version, error) tablenext_run_at, enforce single active run per check, define on_commit/on_idle triggers concretelytasks.db + scanner + minimal CLI first; unify runner only after autonomy proves valueREAL 0.0–1.0 with P0/P1/P2/P3 buckets; support manual override as hard requirementfiles_involved with static analysis (ripgrep/imports) before trusting or caching=== GPT ===
Concrete alternative:
Keep autonomy separate. Implement tasks.db + scanner + scoper + simple scheduler loop. If later you want convergence with jobs, do it when autonomy is stable and proves value. Right now “lib/runner/” is pure yak shaving.
checks, findings, taskschecks (definitions)findings (issues detected)tasks (things to execute)origin=check_run and kind=finding?Concrete alternative:
Use one “work_items” table with a kind column (todo, check_run, finding, task) and a parent_id. Or simpler: keep findings but don’t invent tasks for check-driven work—treat “fix” as a field on finding (resolution_status, resolution_task_id).
Concrete alternative:
Use an integer priority_rank (or p0/p1/p2) plus a separate sort_key computed at view time. Or store urgency/impact/effort and compute priority dynamically so you can change weighting without invalidating all assessments.
Concrete alternative:
Export a markdown/CSV view first. If you must integrate, start with “create only” and never update; treat ClickUp as a read-only mirror. Or don’t do it until you have stable IDs and lifecycle.
assessments keyed by fingerprint assumes “one assessment per todo forever”assessments(fingerprint PRIMARY KEY REFERENCES todos)files_involvedConcrete alternative:
assessments(id PK, fingerprint FK, assessed_at, model, prompt_version, input_hash, output_json, is_current) and keep history. Set todos.status based on existence of a current assessment.
todos(fingerprint TEXT PRIMARY KEY)Concrete alternative:
Identity should include location (source_file, line, heading path) or a stable anchor (like a UUID comment TODO[rivus:abc123]). Keep fingerprint as a similarity/dedupe hint, not the primary key.
tasks.id mixes two unrelated ID schemestasks(id TEXT PRIMARY KEY, -- plan-{hex8} or check-{name}-{date})Concrete alternative:
Use INTEGER PRIMARY KEY (SQLite rowid) or UUID. Add explicit columns: origin_type, origin_id, check_run_id, etc.
tasks.db.database is locked errors, especially with long transactions or Gradio dashboard reading concurrently.Fix: WAL mode, short transactions, a single writer queue, or separate DBs per subsystem.
tasks.status pending|in_progress|done|failedin_progress stays forever.heartbeat, no leased_until, no attempt_count, no last_error.Fix: add leasing: leased_by, leased_until, attempts, last_error, last_update.
checks.last_run and findings.run_id (string), plus a “runs table” mentioned in the diagram but not defined.Fix: create check_runs(id PK, check_id, started_at, completed_at, status, handler_version, error) and reference it from findings.
cached_hash = hash(todo text + contents of files_involved)files_involved is produced by the assessment itself. Catch-22:Fix: hash broader inputs:
- repo HEAD commit SHA (or diff range)
- plus a bounded “context set” (e.g., directory-level) not just self-reported files_involved
- store prompt_version and model in the hash too
findings has no fingerprint/dedupe key.doc-health will create duplicate findings endlessly.Fix: add finding_fingerprint (e.g., type+file+line+message normalized) and an “open/closed” lifecycle. Or store only deltas per run.
checks.schedule daily|weekly|on_commit|on_idlelast_run as TEXT is not enough for correctness.Fix: next_run_at, schedule_type, schedule_value, and enforce single active run per check.
etag/version handling.Fix: store clickup_sync_state, last_sync_error, idempotency_key, backoff.
Alternative:
Scope only:
- items touched recently (git blame / last modified)
- items in high-signal directories
- top N by simple heuristic (tags, prefixes, owner)
first_seen, last_seen, assessed_at, started_at, completed_at, created_at, last_runFix: store as INTEGER unix epoch or ISO8601 UTC with validation. Add indices.
assessments.files_involved TEXT, dependencies TEXT, tasks.result TEXTFix: normalize:
- assessment_files(assessment_id, path)
- assessment_deps(assessment_id, depends_on_fingerprint)
- task_results if you need structure, or keep result JSON but index key fields separately.
status, prioritystatus, last_seencheck_id, created_at, severitytodos.status new|assessed|stale is underspecifiedlast_seen? Then you need transitions defined and automation.tasks.parent_fingerprint references todos, but tasks can also come from findings/checks. You allow check_id but no link to finding.Fix: add origin_finding_id or a generic origin_type/origin_id.
Better sequence:
1) Implement tasks.db + scanner + simple “scope on demand” command
2) Add checks with proper check_runs + finding dedupe
3) Add task execution
4) Only then consider runner unification if duplication is painful
todos/assessments/tasks/checks/findingsConcrete fix:
- Treat assessment output as suggestions with confidence + provenance.
- Store: model, prompt_version, temperature, confidence, evidence (file snippets used).
- Require human confirmation for dependencies and decomposition before execution.
- Add “verification” stage: run static checks (ripgrep references, imports) to validate files_involved and dependency claims before trusting them.
1) Add check_runs table and make findings reference it.
2) Add finding dedupe + lifecycle (open/closed/ignored) so checks don’t spam forever.
3) Stop using todo-text fingerprint as primary identity; introduce stable IDs.
4) Make assessments versioned (history), not 1:1 with todo fingerprint.
5) Normalize files_involved and dependencies into join tables.
6) Add leasing/attempt/error fields to tasks so crashes don’t poison the queue.
7) Delete the “publishable plugin runner” ambition from this design until the task system works end-to-end.
=== GROK ===
tasks table with type enum ('todo', 'check', 'finding'), parent_id, status, priority, and metadata JSONB handles 90% of this without scoping/assessments as separate tables — just inline LLM-generated fields on demand.checks.py, run via cron/loop, no plugins needed.todos table: add columns title, summary, effort, priority, etc., directly. One table for "scannable work items," spawn tasks only for decomposed ones.check_id REFERENCES checks(id) on tasks mixes lifecycles — findings should spawn new todos/findings, not tasks. Boundary: todos/assessments/checks produce events that trigger tasks, not direct refs.source_file + line as composite PK, hash only for quick dup-check index.0.4 * urgency + 0.3 * impact reliably without ground truth data.files_involved JSON leads to wrong cached_hash, endless re-scoping loops. No fallback: if LLM fails (rate limit, OOM), todo stays 'new' forever.sup task list). ClickUp sync updates clickup_task_id during execution.clickup_task_id updated async, but what if partial failure leaves orphan tasks? No cleanup for deleted ClickUp tasks.task_dependencies(parent_fingerprint TEXT, child_fingerprint TEXT).current_fingerprint TEXT col, but migration nightmare: scan all history.SELECT * FROM tasks WHERE status='pending' ORDER BY priority on 1000s rows — scans whole table. Missing: INDEX on tasks(status, priority), todos(status), findings(check_id, run_id).runs(id TEXT PK, check_id, started_at, completed_at).LLMs magically produce accurate scoping/estimates/dependencies from TODO text + "relevant code snippets." Reality: TODOs like "wire up principle application tracking" are vague handwaves; no code snippets exist pre-assessment (nothing "mentions" files), LLM hallucinates files_involved=["guesses.py"], wrong effort ('medium'), phantom deps. This "key difference" (Scoping section) crumbles — prioritization stays "guesswork" (Problem #2), but now with $0.02 hallucinations per item and invalid cache churn. Alternative: Human skims 528 items once (1-2 hours), sets effort/priority in YAML; rescan deltas only. LLMs excel at generation, not estimation without training data.
=== GEMINI ===
Here is a brutal, direct teardown of the system design document.
The generic runner platform (lib/runner) is a massive distraction.
You are setting out to build an internal task list and somehow talked yourself into building a competitor to Prefect/Dagster. Section "Runner Architecture: Core + Plugins" discusses publishing this as an open-source library, defining plugin contracts, and solving "hard problems that every pipeline builder faces."
* The reality: You need to run 8-12 internal Python scripts (doc-health, convention-scan). You do not need a generic, plugin-based, multi-stage DAG orchestrator for this.
* Alternative: Use standard asyncio tasks or a basic cron-like loop for your 12 checks. Do not build a generic pipeline framework.
Conflating recurring checks with one-off code tasks.
You are trying to force two fundamentally different concepts into the same execution model. A recurring doc-health check that yields 50 "broken link" findings is a continuous monitoring process. A TODO like "Implement reduce image iteration" is a discrete, stateful project management entity. Forcing both through the same tasks.db pipeline just so you can use the same runner will result in a bloated schema that serves neither well.
* Alternative: Split the system. Monitoring/Linting (Checks/Findings) belongs in a CI/CD or background worker model. Project Management (TODOs/Tasks) belongs in an issue tracker.
The todos → assessments → tasks hierarchy is bureaucratic.
You have abstracted a single unit of work across three tables. If a developer manually deletes a TODO from the code, what happens to the execution state of the 5 sub-tasks in the tasks table? The state machine spanning these three tables is going to be a nightmare to keep synchronized.
The assessment cache invalidation will bankrupt you or DDOS your LLM.
You state: cached_hash is a hash of (todo text + contents of files_involved).
If a TODO item touches a core file (e.g., main.py or utils.py), any subsequent commit to that file will change the file hash. This will instantly invalidate the assessment for that TODO, triggering a re-read and re-assessment by the LLM on the next scan. Your assumption of "$5-25 one-time, then incremental" is dead wrong. You will be re-assessing dozens of unchanged TODOs on almost every git push.
* Alternative: Do not hash file contents. Assess TODOs on demand when picked up for execution, or only invalidate if the line number/text of the TODO itself changes.
Split-brain state with ClickUp. You are doing a "one-way push to start" to ClickUp. The moment a human looks at ClickUp and clicks "Done" or changes a priority, your DB and ClickUp are out of sync. Because the DB is the source of truth, the next sync will likely overwrite the human's changes or ignore them, making the ClickUp UI untrustworthy. * Alternative: If you use ClickUp, ClickUp must be the source of truth for task state. Sync TODOs into ClickUp, and let the autonomous worker poll ClickUp for its queue.
Floating-point, composite-scored priorities.
You define priority as a REAL (0.0-1.0) calculated via 0.4 * urgency + 0.3 * impact + 0.3 * effort_inverse. This is an academic fantasy. When a production bug hits, or you just really want to do a specific task, how do you force a float-based composite scoring system to put it at the top? You will end up writing hacky overrides.
* Alternative: Stick to integers or discrete buckets (P0, P1, P2, P3). Manual override is a hard requirement for any task system.
Storing relational graph data in JSON strings.
In assessments, you have dependencies TEXT, -- JSON list of fingerprints.
You cannot efficiently query a SQLite JSON list to find unblocked tasks, perform topological sorts, or cascade deletes. Every time the planner needs to find the next executable task, it will have to load the entire table into memory and parse JSON to build the dependency graph.
* Alternative: Create a standard junction table: task_dependencies (blocked_fingerprint, blocking_fingerprint).
Polymorphic composite primary keys.
In tasks, id TEXT PRIMARY KEY, -- plan-{hex8} or check-{name}-{date}.
Do not embed business logic (whether it's a plan or a check, and its date) into a primary key string. It makes joins slower, foreign keys brittle, and migrations painful.
* Alternative: Use standard UUIDs or auto-incrementing integers for PKs. Put type, name, and date in their own typed columns.
Blocking the task system on a massive legacy rewrite. Look at your Implementation Strategy. Phase 1 is rewriting the production jobs runner into a domain-free generic library. Phase 2 is building the task system on top of it. You are blocking the exact thing you want to build (the unified task system) behind a massive, risky refactor of your existing, working production infrastructure. If Phase 1 hits a snag, your task system is delayed indefinitely. * Alternative: Build the task system natively first using a simple loop. Prove the LLM scoping and task decomposition actually works. If it proves valuable, then refactor the old jobs system to use the new pattern.
Assuming code-comment TODOs contain enough context for project planning.
Your entire "Scoping" step relies on an LLM reading a TODO comment like // TODO: refactor session handling and generating a concrete, 3-day multi-file implementation plan with accurate effort estimates and sub-tasks.
You are entirely blind to the fact that developers write TODOs precisely because they don't want to think through the implementation details at that moment. The code snippets surrounding a TODO rarely contain the systemic context required to estimate "real effort" or "risk". The LLM will confidently hallucinate files involved, invent dependencies, and generate garbage sub-tasks. You are building an elaborate, automated pipeline to feed garbage into ClickUp.