# LLM Task Gym — Design

**Date**: 2026-02-23
**Goal**: Optimize prompt + model quality for learning_worker pipeline stages, starting with condensation, then extraction. Reusable pattern for any LLM call.

## Problem

The learning_worker pipeline has 4 stages, each an LLM call with a system prompt and model choice. Currently:
- Stage 1 (condensation): haiku, hardcoded prompt
- Stage 2 (learning extraction): opus, hardcoded prompt
- Stage 3 (follow-up extraction): opus, hardcoded prompt
- Stage 4 (dropped commitments): opus, hardcoded prompt

No systematic way to test whether different prompts or lighter/faster models produce comparable quality. Grok-fast and Gemini Flash are likely viable for Stage 1 but untested.

## Core Abstraction

A **task** is any LLM call: (input, system_prompt, model) → output. The gym tests **prompt variants × model variants** on a **corpus of real inputs**, judges quality, and produces a comparison report.

```
Corpus (real inputs)
    ↓
Vario fan-out: prompt × model matrix
    ↓
Judge: score each output (reference comparison + rubric)
    ↓
Report: model × prompt score matrix
```

## Architecture

### Directory

```
learning/gyms/llm_task/
├── gym.py               # GymBase subclass — prepare, run, judge, report
├── report.py            # HTML report generation
├── corpus/              # test data (auto-generated)
│   └── corpus.jsonl     # {id, input, reference_output, metadata}
├── results/             # timestamped run outputs
└── tasks/               # per-task YAML configs
    ├── condensation.yaml
    └── extraction.yaml  # next after condensation
```

### Task Config Format

```yaml
task: condensation
description: Compress session transcript preserving reasoning signal

corpus:
  source: session_transcripts
  count: 15
  min_user_messages: 5

reference:
  model: opus
  prompt: default   # from learning_worker.CONDENSE_SYSTEM

models: [haiku, grok-fast, flash, gemini-lite]

prompts:
  default: null     # uses the production prompt
  tighter: |
    Condense to ~1500 words. Focus on decisions, errors, pivots.
  structured: |
    Condense into: ## Decisions ## Errors ## Outcomes ## Open Items

judge:
  model: opus
  criteria:
    - name: signal_preservation
      weight: 40
      description: Keeps reasoning, decisions, errors, user corrections
    - name: compression
      weight: 25
      description: Meaningfully shorter while retaining signal
    - name: downstream_utility
      weight: 20
      description: A downstream extractor could find learnings and dropped commitments
    - name: coherence
      weight: 15
      description: Reads as narrative, not choppy fragments
```

### Workflow

1. **`gym.py prepare <task>`** — Load real session transcripts, generate reference outputs (opus), save corpus
2. **`gym.py run <task>`** — Fan out: each (model, prompt) processes each corpus item via Vario. Store all outputs.
3. **`gym.py judge <task>`** — Opus judges each candidate vs reference on criteria. Store scores.
4. **`gym.py report <task>`** — HTML report: model × prompt matrix, per-criterion breakdown, best/worst examples

### Judging Strategy

Pair-judge: for each corpus item, the judge sees (reference_output, candidate_output, original_input) and scores the candidate on each criterion 0-100. This is the same pattern as `learning/session_review/pair_judge.py`.

### Integration with Vario

- Use `lib.llm.call_llm` for all model calls (same as Vario internals)
- Fan-out uses `asyncio.gather` across model×prompt combinations
- No dependency on Vario UI — just the parallel execution pattern

## Next Task: Learning Extraction

Same harness, different config:

```yaml
task: extraction
corpus:
  source: condensed_transcripts  # output from condensation
  count: 15
reference:
  model: opus
  prompt: default   # EXTRACT_SYSTEM
models: [haiku, sonnet, grok-fast, flash]
judge:
  criteria:
    - name: relevance (weight: 35)
    - name: completeness (weight: 30)
    - name: specificity (weight: 20)
    - name: false_positives (weight: 15)
```

## What's Reusable

Adding a new task = new YAML file. The gym code, judge pattern, corpus management, and reporting are all generic. Future targets: badge generation, dropped commitment detection, follow-up extraction, or any LLM call anywhere in the system.

## Non-Goals

- Not optimizing for cost — optimizing for quality, then checking which lighter models are "good enough"
- Not A/B testing in production — this is offline evaluation
- Not replacing Vario — complementing it with corpus management and task-specific judging
