# Doctor Upgrade Design

Date: 2026-03-07, revised 2026-03-09
Status: approved (v3 — thin architecture, signal library hardened)
Reviewed by: Vario maxthink x3 — fat→thin pivot, thin design critique, signal library + full design review

## Vision

Doctor is a portable project health monitor you can drop into any codebase. It watches logs, triages errors, and — uniquely — can spawn an AI coding agent to fix bugs autonomously.

**Zero-config baseline**: Point doctor at a project with a `logs/` directory and it starts finding and fixing problems. No instrumentation required.

**Upgraded path**: Instrumented services push structured signals (crashes, test failures, classified errors) for faster, higher-quality triage. The baseline log watcher keeps running as a catch-all.

**Doctor's unique value**: It's the only system that can go from "error detected" to "code fix committed" without human intervention. Everything else (process supervision, health checks, retries) already exists in other tools. Doctor fills the gap between "something is broken" and "someone needs to fix the code."

## Architecture Decision: Thin Doctor

**Question**: Should doctor be a fat monitor (reimplements monitoring) or a thin fix dispatcher (receives signals from existing systems)?

**Verdict** (all 4 models agreed): Thin doctor. Doctor only owns what nothing else can do — LLM triage and autonomous code fixes via Claude Code. Detection stays in systems that already do it.

**Rationale**:
- launchd already supervises processes, ops CLI already manages services, jobs framework already classifies errors — don't reimplement
- ~50% less code to write and maintain
- Push-based signals are instant (no 30s polling lag) and more reliable
- Doctor becomes testable in isolation (mock the signals directory)

### Two Signal Tiers

Doctor accepts signals from two tiers — both feed into the same triage pipeline:

| Tier | Source | Setup | Latency | Context | Use case |
|------|--------|-------|---------|---------|----------|
| **Baseline** | Log watching | None — just needs `logs/` dir | 30s scan | Low (regex/heuristic) | Drop-in for any project, zero instrumentation |
| **Upgraded** | Signal files | Add loguru sink or system hook | Instant | High (structured, classified) | Instrumented services in rivus |

**Log watching is not legacy to remove** — it's the zero-config baseline that makes doctor useful out of the box on any project. Signal files are the fast path for services that opt in.

The daemon runs both: watchdog on `signals/incoming/` for instant response, plus periodic log scan for uninstrumented projects. Both produce the same internal event format downstream.

### Boundary

| Layer            | Owner              | Notes                                         |
|------------------|--------------------|-----------------------------------------------|
| Detection        | Existing systems   | launchd (crashes), jobs (errors), ops (health) |
| Classification   | Jobs + Doctor LLM  | Jobs handles 80% (transient vs code bug). Doctor LLM triages ambiguous cases |
| Signaling        | Existing → Doctor  | Push via signal files (see §1)                |
| Triage + Fix     | **Doctor**         | LLM triage → Claude Code in worktrees        |
| Notifications    | Doctor             | Push/dashboard per policy (see §3)            |

### What was dropped from v1

- ~~Git commit detection~~ → CI/test runner signals test failures to doctor
- ~~Service crash snapshots~~ → launchd/ops signals crashes to doctor
- ~~Startup error parsing~~ → service itself signals startup failures

## 1. Signal Protocol

Doctor watches a signals directory. Other systems drop atomic JSON files when they detect problems doctor might fix.

### Directory

`doctor/signals/` (in monorepo). Subdirectories:
- `incoming/` — new signals (writers create here)
- `processed/` — doctor moves files here after handling (audit trail, pruned after 7 days)

**Safety limits**:
- `incoming/` capped at 1000 files. If exceeded, new signals are dropped rather than filling the filesystem (doctor is down or overwhelmed — dropping signals is safer than disk exhaustion)
- `processed/` auto-pruned: delete files older than 7 days

### File Format

Filename: `{service}_{timestamp}_{short-uuid}.json`

```json
{
  "event_id": "abc123",
  "timestamp": "2026-03-08T12:00:00Z",
  "service": "auth-svc",
  "type": "crash|test_fail|health_fail|startup_error|code_bug",
  "severity": "low|med|high",
  "fingerprint": "content-based hash (see §1.5)",
  "details": {
    "error_msg": "KeyError: 'user_id'",
    "log_tail": ["last", "10", "lines"],
    "exit_code": 1,
    "git_hash": "abc1234",
    "traceback": "full traceback text if available"
  },
  "source": "launchd|jobs|ops-cli|loguru"
}
```

Writers use atomic write-then-rename (tempfile → rename). **Both paths must be on the same filesystem** — `os.rename()` is only atomic within a single filesystem. Symlinks across mounts will silently lose atomicity.

### Signal Library: `lib/doctor_signal.py`

**Critical constraint**: ZERO dependencies on doctor or any third-party package. Stdlib only (`json`, `tempfile`, `os`, `datetime`, `uuid`, `hashlib`, `traceback`, `subprocess`). Any project can use it.

**API**:
```python
def signal_error(
    service: str,
    error_type: str,
    message: str,
    *,
    traceback: str | None = None,
    exit_code: int | None = None,
    log_tail: list[str] | None = None,
    git_hash: str | None = None,
    extra: dict | None = None,
) -> Path | None:
    """Write a signal file. Returns path or None if rate-limited/capped."""

def doctor_loguru_sink(message) -> None:
    """Plain function (not a class) that duck-types loguru's message.record.
    Does NOT import loguru — works via duck typing on the message object.
    Classifies error, writes signal if it looks like a code bug."""

def signal_on_exit(service: str) -> None:
    """Register atexit + SIGTERM handler for crash-on-exit signaling.
    Best-effort only — see crash coverage table for limitations."""

def get_git_hash() -> str | None:
    """Best-effort git rev-parse HEAD via subprocess. Returns None on failure."""

def make_fingerprint(
    service: str,
    error_type: str,
    message: str,
    traceback: str | None = None,
) -> str:
    """Content-based fingerprint. See §1.5 for algorithm."""
```

Common fields are explicit params (self-documenting for 80% case). `extra: dict` is the escape hatch for everything else.

**`doctor_loguru_sink` is a plain function, not a class.** It does NOT import loguru at module level. It duck-types the `message.record` dict that loguru passes to sink functions. This means `lib/doctor_signal.py` can be imported in any stdlib-only context without error — the loguru sink just won't be called unless loguru is the one calling it.

### Crash Coverage

| Crash type | Who signals | How | Reliability |
|------------|-------------|-----|-------------|
| Unhandled Python exception | `doctor_loguru_sink` | Loguru catches it, sink writes signal | High (if loguru installed) |
| Clean exit with error (sys.exit(1)) | `signal_on_exit()` via atexit | Checks exit code on shutdown | **Best-effort** (see caveats) |
| SIGTERM (graceful shutdown) | `signal_on_exit()` via signal handler | Writes signal before chaining to previous handler | Medium |
| Hard crash (segfault, OOM, kill -9) | **Process manager** (launchd) | External: post-crash wrapper script | High |
| Hang/timeout | **Process manager** (launchd) | External: watchdog timeout kills process | High |
| Silent failure (exit 0 but stopped working) | **Log staleness detection** (§6) | Detects stale logs (no writes in 5x expected interval) | Medium |
| Asyncio background task exception | Not caught by default | **Must** set `loop.set_exception_handler()` | Requires manual setup |
| Daemon thread crash | Not caught by atexit | Thread-local finally or `threading.excepthook` | Requires manual setup |
| Multiprocessing child death | Not caught by parent atexit | Pool error callbacks or child-level setup | Requires manual setup |

**`atexit` caveats** (documented explicitly):
- Does NOT run on: `os._exit()`, `SIGKILL`, unhandled signals
- SIGTERM only works if `signal_on_exit()` installs a handler (it does, but service may override)
- Daemon threads killed before atexit runs
- Multiprocessing children have separate atexit registries

**Design decision**: launchd wrapper is the **authoritative** crash reporter for hard/unexpected exits. `atexit`/`signal_on_exit()` is an optimization that adds richer context *when it fires*. Services that use asyncio/threads/multiprocessing need additional integration (documented in signal library README).

### Signal Writers

| System         | When                              | How                                    |
|----------------|-----------------------------------|----------------------------------------|
| **loguru sink** | ERROR+ with traceback            | `doctor_loguru_sink` classifies + writes signal |
| **atexit/SIGTERM** | Process exiting with error    | `signal_on_exit()` — best-effort      |
| **launchd**    | Service crashes/exits unexpectedly | Post-crash wrapper script writes signal (authoritative) |
| **Jobs framework** | Error classified as `code_bug` | Calls `signal_error()` directly       |
| **ops CLI**    | Health check fails                | `ops health-fail <svc>` writes signal |
| **Test runner**| Tests fail after commit           | Writes signal with failing test names |

### Loguru Sink Details

`doctor_loguru_sink` (plain function in `lib/doctor_signal.py`):
1. Fires on ERROR+ level with exception info
2. Classifies locally: transient (timeout, connection refused) vs potential code bug (KeyError, AttributeError, TypeError, assertion)
3. If code bug → writes signal file with traceback, service name, git hash
4. Rate-limited: max 1 signal per fingerprint per 10 minutes (prevent storms)

Better than external log scanning because:
- Service has context (startup vs request vs background task)
- No parsing heuristics — loguru provides structured exception data
- Instant, not delayed by scan interval
- False positive rate drops (service knows what's transient for it)

**Coexistence**: Log polling stays as the universal baseline for uninstrumented projects. Services that add the sink get faster, higher-quality triage. Both feed the same pipeline.

### 1.5. Fingerprint Algorithm

Everything depends on fingerprinting: dedup across tiers, cooldowns, attempt limits. The algorithm must produce **identical fingerprints** for the same bug whether it comes from a signal file or the log watcher.

```python
def make_fingerprint(service, error_type, message, traceback=None):
    """Content-based fingerprint for cross-tier dedup."""
    parts = [service, error_type, _normalize_message(message)]
    if traceback:
        frames = _extract_top_frames(traceback, n=3)
        parts.extend(frames)
    return hashlib.sha256("|".join(parts).encode()).hexdigest()[:16]
```

**Normalization**:
- `_normalize_message`: strip line numbers, variable values, timestamps, UUIDs, hex addresses
- `_extract_top_frames`: extract top 3 stack frames, strip line numbers and local variable names, keep only `filename:function_name`

**Cross-tier dedup**: Both the signal writer and the log watcher must produce the same fingerprint for the same error. The log watcher maps its regex-parsed output through the same `make_fingerprint()` function. Dedup checks fingerprint (not event_id) — so duplicate signals from both tiers are caught.

**Granularity tradeoffs**:
- Too coarse (service + error_type only): different bugs grouped, second suppressed
- Too fine (full traceback with line numbers): same bug with different call depths treated as separate
- Sweet spot: service + error_type + normalized message + top 3 frame signatures

### Doctor Reader

Event-driven via `watchdog` on `doctor/signals/incoming/`:
- On notification → scan directory for **all** new files (FSEvents may coalesce multiple events into one notification)
- Process through a `queue.Queue()` — single consumer thread writes to SQLite (eliminates contention from burst signals)
- Dedup by **content fingerprint** (not event_id) in SQLite
- After processing → move to `processed/`

**Startup sequence** (handles race window):
1. Register watchdog on `incoming/`
2. **Then** scan `incoming/` for all unprocessed files (catches anything that landed during registration)
3. Dedup by fingerprint handles any files processed twice

## 2. Fix Lifecycle

### 2a. Branch Naming

`dr/{project}-{slug}-{HHMMSS}-{fp8}` — provenance + collision-proof.

- Slug sanitized: lowercase, `[^a-z0-9-]` stripped, truncated to 30 chars
- `{fp8}`: first 8 chars of fingerprint hash (more meaningful than random suffix — identifies the bug)
- Commit messages prefixed with `[doctor]` for filtering

### 2b. Auto-Merge (restricted)

**Phase 1 (dry-run)**: Doctor does everything except `git merge` + `git push`. Logs what it *would* have done and notifies. Run for 1-2 weeks to build confidence.

**Phase 2 (live)**: Auto-merge to main only when ALL conditions met:
- Fix is in **allowlisted categories**: import fixes, type errors, config typos, test-only fixes
- Category detected via **static checks** (path patterns, `git diff --stat`), not LLM classification
- ≤3 files changed AND ≤50 lines changed
- **No forbidden paths touched**: `migrations/`, `auth/`, `security/`, deployment configs, `__init__.py`
- Full test suite passes (not just previously-failing tests)
- Fresh rebase on latest main (fast-forward only)
- **Circuit breaker**: max 3 auto-merges per hour **per project**. If exceeded, stop and notify.
- **Post-merge verification**: run tests again after merge. If fail, retry test once (flake protection). If still fail, `git revert HEAD` automatically (also tagged `[doctor]`) and notify. Write "poison pill" to issues table — that fingerprint is permanently blocked from auto-merge.
- **No new fix starts while post-merge verification is running.** Verification is part of the fix lifecycle.

**Doctor-caused regressions**: If a signal's `git_hash` points to a `[doctor]` commit, **always route to human review** — never auto-fix. Don't let the robot fix its own bugs in a loop.

### 2c. Pending Review with TTL

Larger fixes stay at `pending_review`. After 48h (weekend-safe):
- Delete worktree (save disk)
- **Keep branch** on remote (preserve work)
- Archive diff + test output to `doctor/fixes/archive/`
- Error stays tracked in issues table
- **Send notification** that fix expired and needs review

### 2d. Worktree Cleanup on Startup

On daemon start:
- List all `dr/` worktrees
- Check each for: lease in DB with `fix_started_at` timestamp, active PID, recent file modifications (< 1h)
- Only clean worktrees that are: lease expired (fix_started_at + timeout + 5min buffer), no active PID, no recent activity
- Log every cleanup action to events table

### 2e. Fix Attempt Limits

- **Timeout**: 10 minutes per fix attempt. Kill via `os.killpg()` (process group kill) to ensure Claude Code and all subprocesses are terminated. Clean up worktree after kill.
- **Max concurrent**: 1 fix at a time (solo dev machine)
- **Max attempts per issue**: 2 (after 2 failures, escalate to human via push notification)
- **Cooldown**: no re-attempt of same fingerprint within 30 minutes

## 3. Notifications

Notifications via Pushover (`lib/notify.push()`).

### 3a. Policy

| Event                            | Channel    | Condition                                |
|----------------------------------|------------|------------------------------------------|
| Doctor starts fixing             | Push       | First attempt only, not retries          |
| Fix merged (auto or reviewed)    | Push       | Always                                   |
| Fix failed after max attempts    | Push       | Always                                   |
| Fix expired (48h TTL)            | Push       | Always                                   |
| Service crash detected (signal)  | Push       | Debounce: 1 per service per 5 min       |
| `main` tests failing (signal)    | Push       | First occurrence only, then dashboard    |
| Persistent test failure          | Push       | Escalate if dashboard-only persists >30 min |
| Doctor is down (stale heartbeat) | Push (cron)| Heartbeat file checked by cron job       |

### 3b. Deduplication

- Per-fingerprint cooldown: same error → max 1 push per hour
- Per-service crash cooldown: max 1 push per 5 minutes
- Escalation: dashboard-only events that persist >30 min get promoted to push

### 3c. Attention Budget

Target: ≤5 pushes per day in steady state. If doctor is sending more, something systemic is wrong — auto-mute non-critical pushes and send a single "Doctor is noisy, check dashboard" meta-notification. Manual override: `ops doctor unmute`.

## 4. Visibility

### 4a. doctor_events Table

```sql
CREATE TABLE doctor_events (
    id          INTEGER PRIMARY KEY,
    ts          TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%f', 'now')),
    event_type  TEXT NOT NULL,  -- signal_received, fix_start, fix_complete, fix_failed, merge, revert, ...
    project     TEXT NOT NULL,
    summary     TEXT NOT NULL,
    detail_json TEXT,           -- structured payload
    status      TEXT DEFAULT 'info',  -- info, warning, error, resolved
    fingerprint TEXT,           -- link to issues table
    branch      TEXT,
    commit_hash TEXT,
    duration_ms INTEGER,
    signal_type TEXT,           -- crash, test_fail, health_fail, code_bug
    correlation_id TEXT,        -- group related events (signal → fix_start → fix_complete)
    actor       TEXT DEFAULT 'doctor'  -- doctor, human, session_id
);

CREATE INDEX idx_events_project_ts ON doctor_events(project, ts DESC);
CREATE INDEX idx_events_type_ts ON doctor_events(event_type, ts DESC);
CREATE INDEX idx_events_correlation ON doctor_events(correlation_id);
CREATE INDEX idx_events_fingerprint ON doctor_events(fingerprint);
```

WAL mode + busy_timeout=5000ms (already set in doctor_db).

**Retention**: Delete events older than 30 days. Periodic `VACUUM` (weekly) to prevent fragmentation from INSERT+DELETE patterns.

### 4b. Integration Points

- **Watch dashboard**: "Doctor Activity" tab reading from doctor_events via SSE
- **CLI**: `ops doctor events [--project X] [--tail] [--json]`
- **CLI**: `ops doctor status` — current issues, active fixes, last signal time
- **CLI**: `ops doctor review <branch>` — show diff + test results for pending fix
- **CLI**: `ops doctor unmute` — override attention budget auto-mute

### 4c. Cleanup

- Remove unused `doctor/notifications/` markdown directory

## 5. Coordination

YAML coordination in `~/.coord/` stays for session registration. Doctor-specific state (fix attempts, worktree leases, issue dedup) lives entirely in SQLite.

Doctor acquires a per-project advisory lock (via issues table `status='fixing'` + `fix_started_at` timestamp) before spawning a fix. If another session or doctor instance is already fixing that project, skip. **Stale lock recovery**: on startup and periodically, scan for `fixing` rows where `fix_started_at` is older than timeout (10 min) + buffer (5 min). Reset stale locks to `open`.

Before pushing to main, write a coordination event so other sessions can detect the push.

## 6. Log Staleness Detection

The log watcher baseline gets one enhancement: **staleness detection**. If a service normally logs every N seconds and hasn't logged in 5x that window, signal it as a potential silent failure.

- Track `last_log_ts` per service (per log file) during normal scanning
- Build a baseline of expected log frequency per service (rolling average over 1 hour)
- If no new log lines for 5x the expected interval → write signal with `type: "stale_logs"`, `severity: "med"`
- Combined with ops health checks (service hung) and launchd (service exited), this covers most silent failure modes without service instrumentation

This stays in the log watcher — no signal library changes needed.

## 7. Known Gaps (accepted)
- **Multiprocessing child crashes**: not caught by parent's signal library setup. Services using multiprocessing need child-level integration. Documented in signal library README.
- **Asyncio/thread exceptions**: require manual `loop.set_exception_handler()` / `threading.excepthook`. Documented.
- **LLM triage quality**: the triage prompt and how doctor decides a signal is actionable vs noise is underspecified. Will be iterated on with real signals.

## Implementation Priority

1. **Signal library** (`lib/doctor_signal.py`) — zero-dep, with fingerprint algorithm
2. **Signals directory + reader** — watchdog-based, queue-backed, with dedup
3. **Events table** — record everything doctor does
4. **Branch naming + commit tagging** — low risk, high value
5. **Fix attempt limits + timeout** (with `os.killpg`) — safety
6. **Stale lock recovery** — prevent permanently stuck projects
7. **Signal writers**: jobs framework `code_bug` → doctor signal (highest-value source)
8. **Signal writers**: launchd crash wrapper, ops health-fail
9. **Auto-merge Phase 1: dry-run** — log what would merge, don't actually merge
10. **Auto-merge Phase 2: live** — after 1-2 weeks of dry-run confidence
11. **Notification policy + deduplication** — polish
12. **Dashboard/CLI integration** — visibility
