```yaml
# Defaults: status=open, needs=autonomous, effort=M
items:
  - id: no-silent-cost-drops
    title: "★ Never silently drop cost data — warn/raise when stream completes without cost metadata"
    meta: {tags: [cost, reliability, priority], effort: S, scope: [lib/llm/stream.py]}
    notes: >
      LLMStream._intercept_meta silently swallows missing cost metadata — if the provider
      doesn't return usage/cost in the stream, the call is invisible to cost_log. This caused
      $254 of xAI grok-reasoning calls to go completely untracked (2026-03-30). Fix: after
      stream completes, if cost is None, log a warning with model + caller. For non-streaming
      call_llm, same check. No silent drops — every LLM call must be visible in cost_log.
      Related: stream_subscription (direct-sub-metadata below) also lacks cost tracking.

  - id: per-caller-call-count-limit
    title: "Add per-caller call-count limit (e.g. 1000/10min) — cheap-model runaway protection"
    meta: {tags: [cost, reliability], effort: S, scope: [lib/cost_log.py]}
    notes: >
      The $15/10min kill limit is calibrated for expensive models. At gemini-flash prices,
      88K calls fit under $15. Add a call-count limit alongside the dollar limit.

  - id: scanner-file-count-cap
    title: "Add MAX_FILES per scanner to prevent cold-cache cost explosion"
    meta: {tags: [cost, autodo], effort: S, scope: [helm/autodo/scanner/_core.py]}
    notes: >
      LLM scanners do 1 call per file with no upper bound. Cold cache across 8 scanners
      × 500+ files = 4000+ calls. Add a cap per scanner and per aggregate scan run.

  - id: direct-sub-metadata
    title: "Make stream_subscription return LLMResponse with full metadata (usage, cost, finish_reason)"
    notes: >
      stream_subscription (claude_oauth.py) yields plain text — no tokens, cost, or finish_reason.
      Fix: parse message_start event for input_tokens, message_delta for output_tokens + stop_reason.
      Then wrap in LLMResponse and log cost via _log_cost(). Would make the fast direct-httpx path
      a drop-in replacement for the litellm subscription path.
    meta: {tags: [infra, data], effort: S, scope: [lib/llm/claude_oauth.py]}
  - id: anthropic-usage-tracking
    title: "Extract token usage/cost from Anthropic OAuth subscription streaming responses"
    meta: {tags: [infra, data], scope: [lib/llm/stream.py]}
  # gemini-usage-tracking: DONE — usageMetadata extracted in both subscription and API billing paths
  - id: codex-usage-tracking
    title: "Extract usage from x-codex-* headers in Codex streaming responses"
    meta: {tags: [infra, data], effort: S, scope: [lib/llm/]}
  # litellm-monkeypatch-fix: DONE — monkey-patch removed; OAuth tokens work standalone via extra_headers (no api_key needed). Verified 2026-03-31.
  - id: gemini-31-pro-rollout
    title: "Complete Gemini 3.1 Pro rollout — run /model-update, test custom-tools variant"
    meta: {tags: [infra], effort: S, scope: [lib/llm/models.py]}
  - id: gpt53-codex-standard-api
    title: "Update codex alias when GPT-5.3-Codex standard API goes live"
    meta: {tags: [infra], effort: S, needs: blocked, scope: [lib/llm/models.py]}
  - id: codex-spark-access
    title: "Evaluate ChatGPT Pro subscription for gpt-5.3-codex-spark access"
    meta: {tags: [research], effort: S, needs: human, scope: [lib/llm/models.py]}
  - id: gpt54-subscription-test
    title: "Test GPT-5.4 via Codex OAuth subscription route"
    meta: {tags: [infra], effort: S, scope: [lib/llm/]}
  - id: gpt54-benchmarks
    title: "Run benchmarks for GPT-5.4 — compare latency/quality vs 5.3-chat-latest"
    meta: {tags: [research], scope: [lib/llm/benchmarks/]}
  - id: grok-420-ga
    title: "Wait for Grok 4.20 GA and evaluate as grok alias upgrade"
    meta: {tags: [infra], effort: S, needs: blocked, scope: [lib/llm/models.py]}
  - id: grok-imagine-video
    title: "Evaluate Grok Imagine Video capability vs Sora/Runway"
    meta: {tags: [feature, research], needs: research, scope: [lib/llm/]}
  - id: grok-code-v2-watch
    title: "Watch for Grok Code v2 (multimodal + parallel tools) release"
    meta: {tags: [infra], effort: S, needs: blocked, scope: [lib/llm/models.py]}
  - id: cli-pool-cleanup
    title: "Remove unused CLI pool management logic if OAuth APIs prove fully reliable"
    meta: {tags: [cleanup], effort: S, scope: [lib/llm/cli.py]}
  - id: multi-turn-session-persistence
    title: "Support stateful conversations across process restarts via persisted thread/conversation IDs"
    meta: {tags: [feature], scope: [lib/llm/]}
  - id: advanced-web-search-controls
    title: "Support domain allow/exclude filters for Grok/OpenAI native web search"
    meta: {tags: [feature], effort: S, scope: [lib/llm/web_tools.py]}
  - id: per-call-ttft
    title: "Measure TTFT (time to first token) in stream.py and attach to LLMResult"
    meta: {tags: [infra, data], effort: S, scope: [lib/llm/stream.py]}
  - id: per-call-token-rate
    title: "Calculate and record tokens/sec on every LLM call"
    meta: {tags: [infra, data], effort: S, scope: [lib/llm/stream.py], depends: [per-call-ttft]}
  - id: per-call-byte-counts
    title: "Track raw input/output byte counts for cost sanity checks"
    meta: {tags: [infra, data], effort: S, scope: [lib/llm/]}
  - id: per-call-stats-storage
    title: "Store per-call LLM stats (model, ttft, tps, tokens, cost, caller) to JSONL/SQLite"
    meta: {tags: [infra, data], scope: [lib/llm/], depends: [per-call-ttft, per-call-token-rate, per-call-byte-counts]}
  - id: per-call-stats-dashboard
    title: "Surface per-call LLM stats in /health endpoint and hub.localhost"
    meta: {tags: [feature, polish], scope: [lib/llm/, watch/], depends: [per-call-stats-storage]}
  - id: monitoring-health-stats
    title: "Add request counts, success rates, latency, quota utilization to /health endpoint"
    meta: {tags: [infra], scope: [lib/llm/]}
  - id: grok-fast-replace-haiku
    title: "Evaluate replacing haiku with grok-fast for badge/route/principles/session-tree tasks"
    meta: {tags: [research], scope: [lib/llm/, helm/]}
  - id: shadow-model-testing
    title: "Implement shadow model testing — run candidate models in parallel on live traffic for async eval"
    meta: {tags: [architecture, research], effort: L, scope: [lib/llm/]}



```

# lib/llm TODO

## High Priority: OAuth API Enhancements

### Usage Tracking for Subscription Calls
Extract token usage and cost from OAuth API responses (currently missing for subscription calls).
- [ ] **Anthropic:** Parse `message_start` and `message_delta` events for usage.
- [x] **Gemini:** Extract `usageMetadata` from response.
- [ ] **Codex:** Extract usage from `x-codex-*` headers in streaming responses.

### ~~Robust Monkey-patching~~ (DONE)
Monkey-patch removed — OAuth tokens work standalone via `extra_headers={'authorization': 'Bearer TOKEN', 'anthropic-beta': 'oauth-2025-04-20'}`. No API key needed. Verified 2026-03-31.

## Model Rollout Tracking

Track new models that are announced but not yet accessible via API.

### Gemini 3.1 Pro (`gemini-3.1-pro-preview`) — 2026-02-19
- [ ] **Last checked**: 2026-02-26T17:03-0800
- [ ] **API billing (litellm)**: ⸢WORKS⸣ — tested, returns responses. Heavy on reasoning tokens (190 reasoning / 5 text for simple prompt)
- [ ] **Subscription (Google One OAuth)**: ⸢WORKS⸣ — tested 2026-02-26, returns responses ("Hello there, friend!"). Was 404 on Feb 19, flag flipped since.
- [ ] **Try manually**: [Google AI Studio](https://aistudio.google.com/), Gemini app, [Vertex AI](https://console.cloud.google.com/)
- [ ] **Action**: Ready to run `/model-update` skill — API route works, update `gemini` alias
- [ ] **Pricing**: Same as 3 Pro ($2.00/$12.00 input/output per 1M)
- [ ] **Key improvement**: ARC-AGI-2 77.1% (vs 31.1% for 3 Pro) — 2.5x reasoning jump
- [ ] **Also check**: `gemini-3.1-pro-preview-customtools` variant (tool prioritization)
- [ ] **Details**: `docs/model-updates/2026-02-19-gemini-3.1-pro.md`
- [x] Run `/model-update` to swap `gemini` alias to 3.1 Pro (done 2026-02-26)
- [x] Re-test subscription route periodically — WORKS as of 2026-02-26
- [x] Multi-account failover — primary (G1 AI Pro) + timc (standard tier via quantjoy GCP project)
- [ ] **Subscription model availability by tier** (tested 2026-03-04):
  - G1 AI Pro (primary): gemini-2.5-pro, 2.5-flash, 2.5-flash-lite, 3.1-pro-preview, 3-pro-preview, 3-flash-preview
  - Standard tier (timc): gemini-2.5-pro, 2.5-flash, 2.5-flash-lite only (3.x models = 404)
- [x] Check if Gemini 3.1 Flash is announced → Nano Banana 2 (`gemini-3.1-flash-image-preview`) launched 2026-02-26, added to image_gen

### GPT-5.3-Codex (`gpt-5.3-codex`) — subscription works, standard API pending
- [ ] **Last checked**: 2026-02-19T10:08-0800
- [ ] **Subscription (Codex OAuth)**: ⸢WORKS⸣ — tested, "Hello to you." via `codex_oauth.py`
- [ ] **Standard API (`api.openai.com`)**: Not yet available — OpenAI says "coming soon", delayed for cybersecurity gating
- [ ] **Try manually**: [Codex CLI](https://github.com/openai/codex) (`codex exec`), [Codex app](https://codex.openai.com), ChatGPT, IDE extensions
- [ ] **Current alias**: `codex` → `openai/gpt-5.2-codex` — alias uses standard API route
- [ ] **Note**: `codex_oauth.py` already uses `gpt-5.3-codex` as default for subscription calls
- [ ] **Action when standard API live**: Update `codex` alias in models.py to `openai/gpt-5.3-codex`, add `gpt-5.2-codex` to LEGACY, update TEMPERATURE_1_REQUIRED
- [ ] **Also announced**: `gpt-5.3-codex-spark` — NOT available via our ChatGPT subscription ("Pro only"), no standard API either
- [ ] Check OpenAI API changelog for `gpt-5.3-codex` standard API availability
- [ ] Consider adding `codex-spark` alias when Spark gets API access
- [ ] Evaluate upgrading to ChatGPT Pro subscription — needed for `gpt-5.3-codex-spark` (real-time coding model, currently Pro-only)

### GPT-5.4 / GPT-5.4-Pro — 2026-03-05
- [x] **API billing**: WORKS — both `gpt-5.4` and `gpt-5.4-pro` tested
- [ ] **Subscription (Codex OAuth)**: Not tested yet
- [x] **Registered**: aliases `gpt` → `gpt-5.4`, `gpt-pro` → `gpt-5.4-pro`
- [x] **Pricing**: $2.50/$15.00 (5.4), $15.00/$120.00 (5.4-pro)
- [x] **Context**: 1M tokens (2x pricing >272K input)
- [ ] **Benchmarks**: Not yet run — compare latency/quality vs 5.3-chat-latest

### Grok 4.20 Beta (`grok-4.20-experimental-beta-0304`) — API early access only
- [ ] **Last checked**: 2026-03-05T12:30-0800
- [ ] **API billing**: In API model list but early access only — not generally available
- [ ] **Variants**: reasoning, non-reasoning, multi-agent
- [ ] **Try manually**: [X / Grok](https://x.com/i/grok) — select "Grok 4.20 Beta"
- [ ] **Action**: Wait for GA, then evaluate as `grok` alias upgrade
- [ ] Watch [xAI release notes](https://docs.x.ai/developers/release-notes)

### Grok Imagine Video (`grok-imagine-video`) — new
- [ ] **Last checked**: 2026-03-05T12:30-0800
- [ ] **API billing**: Listed in API, pricing TBD (credit-based: 2-10 credits per video)
- [ ] **Action**: New capability — needs video gen module (not image_gen)
- [ ] Evaluate if worth adding vs. existing Sora/Runway options

### Grok Code v2 (multimodal + parallel tools) — in training
- [ ] **Last checked**: 2026-02-19T09:53-0800 — not announced yet
- [ ] **Try manually**: N/A (still in training, no preview available)
- [ ] **Current**: `xai/grok-code-fast-1` is stable ($0.20/$1.50 per 1M)
- [ ] **Upcoming**: New variant with multimodal inputs, parallel tool calling, extended context
- [ ] Watch [xAI release notes](https://docs.x.ai/developers/release-notes) for grok-code-fast-2 or similar

## CLI Pool (Low Priority / Deprecated)

The CLI subprocess pool (`cli.py`) is now secondary to direct OAuth APIs.

### Cleanup
- Remove unused CLI pool management logic if OAuth APIs prove fully reliable.
- Fix the Opus hang in `cli.py` if it's still needed as a fallback.

## Future Work

### Multi-turn Session Persistence
Support stateful conversations in `Conversation` class across process restarts by persisting `thread_id` (Codex) or `conversation_id`.

### Advanced Web Search Controls
Support filters (allowed/excluded domains) for Grok/OpenAI native web search.

### Per-Call Stats Tracking

Track TTFT, token rate, and I/O size on every production call (not just benchmarks). Currently `LLMResult` carries `.usage` (prompt_tokens, completion_tokens) and `.cost`, but no timing. Add:

- [ ] **TTFT** (time to first token) — measured in stream.py, attached to LLMResult
- [ ] **Token rate** (tokens/sec) — output_tokens / (total_time - ttft)
- [ ] **Input/output bytes** — raw byte counts (before/after tokenization) for cost sanity checks
- [ ] **Storage**: Append to `~/.coord/llm_stats.jsonl` (or SQLite) — per-call rows with model, timestamp, ttft_ms, tps, input_bytes, output_bytes, input_tokens, output_tokens, cost, caller (badge_worker, brain, etc.)
- [ ] **Dashboard**: Surface in `/health` endpoint and hub.localhost

This enables: model comparison on real workloads, regression detection, cost attribution by caller, data-driven model selection.

### Monitoring & Stats
Add detailed stats to `/health`:
- Request counts per provider/billing route.
- Success rates and average latency.
- Subscription quota utilization (from headers/usage endpoints).

## Grok 4.1 Fast (non-reasoning) — Use More Aggressively

`grok-fast` alias → `xai/grok-4-1-fast-non-reasoning`. Benchmarks show ~120 tps, ~400-700ms TTFT, $0.20/M input. This is significantly cheaper than haiku ($1.00/M input, $5.00/M output) with competitive quality for structured tasks.

**Evaluate replacing haiku with grok-fast for fast tasks:**
- [ ] Badge worker (currently haiku) — JSON extraction, topic classification
- [ ] Route worker (currently haiku) — project routing
- [ ] Principles worker (currently haiku) — principle matching
- [ ] helm.api session tree updates (new, see session intelligence design)
- [ ] Any other classification/extraction tasks using haiku

**Caveats to test:**
- xAI auto-cache (75% off, 99% hit rate) may make it even cheaper for repeated system prompts
- Quality on structured JSON output — does it follow schemas as reliably as haiku?
- Subscription routing: no xAI subscription route exists (API billing only) — cost comparison should account for haiku being free via Claude subscription

**Quick test**: Run the badge gym with `grok-fast` as a variant alongside `haiku` — the gym already supports multi-model comparison.

## Shadow Model Testing & Async Eval 🎤 (present this!)

Run shadow calls to candidate models in production to find improvement opportunities without disrupting live behavior.

**Pattern:**
1. On each fast-task LLM call (badge, route, etc.), fire the primary model (haiku) AND a shadow model (grok-fast, gemini-flash, etc.) in parallel
2. Primary result is used immediately — shadow result is logged but not acted on
3. Async eval compares shadow vs primary: accuracy, latency, cost
4. Accumulate stats → periodic report: "grok-fast matched haiku 94% of the time at 1/5 the cost"

**Implementation sketch:**
- [ ] Add `shadow_models: list[str]` option to `fast_llm_call()` / hot server
- [ ] Hot server fires shadow calls as background tasks (asyncio.create_task)
- [ ] Log results to `~/.coord/shadow_eval.jsonl` — primary_response, shadow_response, model, latency, match_score
- [ ] LLM judge (cheap model) scores agreement between primary and shadow periodically (batch, not per-call)
- [ ] Report: `inv llm.shadow-report` → HTML showing model comparison on real production data
- [ ] When a shadow model consistently matches/beats primary → promote it (update alias or config)

**Key principle**: The gym tests prompt variants on replayed sessions. Shadow testing evaluates model variants on live traffic. Together they cover both axes of optimization (prompt × model).