# LLM Task Gym — Implementation Plan

> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

**Goal:** Build a reusable gym for optimizing prompt + model quality on any LLM task, starting with session condensation.

**Architecture:** Task configs (YAML) define corpus source, model/prompt variants, and judge criteria. A single `gym.py` loads corpus, fans out across model×prompt combinations via `asyncio.gather` + `lib.llm.call_llm`, judges each output against an opus reference, and generates an HTML comparison report.

**Tech Stack:** `lib/gym/base.py` (GymBase, Candidate, CorpusStore), `lib.llm.call_llm` (async LLM), click CLI, loguru, YAML config, HTML report.

---

### Task 1: Create gym directory and condensation task config

**Files:**
- Create: `learning/gyms/llm_task/__init__.py`
- Create: `learning/gyms/llm_task/tasks/condensation.yaml`

**Step 1: Create directory structure**

```bash
mkdir -p learning/gyms/llm_task/tasks learning/gyms/llm_task/corpus learning/gyms/llm_task/results
```

**Step 2: Write condensation task config**

Create `learning/gyms/llm_task/tasks/condensation.yaml`:

```yaml
task: condensation
description: Compress session transcript preserving reasoning signal

corpus:
  source: session_transcripts
  project_dir: ~/.claude/projects/-Users-tchklovski-all-code-rivus
  count: 15
  min_user_messages: 5
  min_duration_seconds: 300

reference:
  model: opus
  max_tokens: 3000
  timeout: 120.0
  # prompt: null = uses default from learning_worker.CONDENSE_SYSTEM

models:
  - haiku
  - grok-fast
  - flash
  - gemini-lite

prompts:
  default: null  # uses CONDENSE_SYSTEM from learning_worker.py
  tighter: |
    Condense this Claude Code session transcript to ~1500 words.
    Focus ONLY on: decisions made, errors hit, approaches abandoned, user corrections.
    Drop all routine operations. Use plain text, not markdown.
  structured: |
    Condense this Claude Code session transcript into these sections:
    ## Decisions — Key choices about approach, architecture, tools
    ## Errors — What went wrong and how it was resolved
    ## Pivots — Approaches abandoned and why
    ## Outcomes — What was accomplished vs what was intended
    ## Open Items — Things discussed but not completed
    Keep total under 2000 words. Plain text within sections.

judge:
  model: opus
  max_tokens: 500
  timeout: 120.0
  criteria:
    - name: signal_preservation
      weight: 40
      description: Preserves reasoning, decisions, errors, and user corrections
    - name: compression
      weight: 25
      description: Meaningfully shorter while retaining important signal
    - name: downstream_utility
      weight: 20
      description: A downstream extractor could find learnings and dropped commitments from this
    - name: coherence
      weight: 15
      description: Reads as a coherent narrative, not choppy fragments
```

**Step 3: Write `__init__.py`**

Create `learning/gyms/llm_task/__init__.py` — empty file.

**Step 4: Commit**

```bash
git add learning/gyms/llm_task/
git commit -m "feat(gym): add llm_task gym directory and condensation config"
```

---

### Task 2: Corpus preparation — load and parse session transcripts

**Files:**
- Create: `learning/gyms/llm_task/corpus_prep.py`

**Step 1: Write corpus_prep.py**

This module loads real session transcripts, parses them (reusing `learning_worker.parse_transcript`), filters by quality, and saves as corpus JSONL.

```python
#!/usr/bin/env python
"""Corpus preparation — load session transcripts for gym evaluation."""

import json
import time
from pathlib import Path

import click
from loguru import logger

from supervisor.sidekick.hooks.learning_worker import (
    CONDENSE_SYSTEM,
    call_llm,
    parse_transcript,
)


PROJECTS_DIR = Path("~/.claude/projects/-Users-tchklovski-all-code-rivus").expanduser()


def load_candidate_sessions(
    project_dir: Path = PROJECTS_DIR,
    min_user_messages: int = 5,
    min_duration_seconds: int = 300,
    count: int = 15,
) -> list[dict]:
    """Load session transcripts suitable for gym evaluation.

    Returns list of {session_id, raw_text, user_message_count, duration_seconds}.
    """
    jsonl_files = sorted(
        project_dir.glob("*.jsonl"),
        key=lambda p: p.stat().st_mtime,
        reverse=True,
    )

    sessions = []
    for jf in jsonl_files:
        if len(sessions) >= count:
            break

        # Skip tiny files (< 50KB unlikely to be interesting)
        if jf.stat().st_size < 50_000:
            continue

        try:
            parsed = parse_transcript(jf)
        except Exception as e:
            logger.debug("Skip {}: {}", jf.name[:12], e)
            continue

        if parsed["user_message_count"] < min_user_messages:
            continue
        if parsed["duration_seconds"] < min_duration_seconds:
            continue

        sessions.append({
            "session_id": jf.stem,
            "raw_text": parsed["raw_text"],
            "user_message_count": parsed["user_message_count"],
            "duration_seconds": parsed["duration_seconds"],
            "has_errors": parsed["has_errors"],
            "has_edits": parsed["has_edits"],
            "file_size_mb": round(jf.stat().st_size / 1e6, 1),
        })
        logger.info(
            "Loaded {}: {} msgs, {:.0f}s, {:.1f}MB",
            jf.stem[:8],
            parsed["user_message_count"],
            parsed["duration_seconds"],
            jf.stat().st_size / 1e6,
        )

    return sessions


def generate_reference(raw_text: str, model: str = "opus", timeout: float = 120.0) -> str:
    """Generate reference condensation using the production prompt and a strong model."""
    return call_llm(
        prompt=raw_text,
        system=CONDENSE_SYSTEM,
        model=model,
        max_tokens=3000,
        timeout=timeout,
    )


def save_corpus(sessions: list[dict], corpus_dir: Path):
    """Save prepared corpus to JSONL."""
    corpus_dir.mkdir(parents=True, exist_ok=True)
    corpus_file = corpus_dir / "corpus.jsonl"

    with open(corpus_file, "w") as f:
        for s in sessions:
            f.write(json.dumps(s) + "\n")

    logger.info("Saved {} sessions to {}", len(sessions), corpus_file)


@click.command()
@click.option("--count", default=15, help="Number of sessions to load")
@click.option("--with-reference/--no-reference", default=False, help="Also generate opus reference outputs")
def prepare(count: int, with_reference: bool):
    """Prepare corpus from real session transcripts."""
    sessions = load_candidate_sessions(count=count)
    logger.info("Loaded {} candidate sessions", len(sessions))

    if with_reference:
        for s in sessions:
            logger.info("Generating reference for {}...", s["session_id"][:8])
            start = time.time()
            s["reference_output"] = generate_reference(s["raw_text"])
            logger.info("Reference done ({:.1f}s, {} chars)", time.time() - start, len(s["reference_output"]))

    corpus_dir = Path(__file__).parent / "corpus"
    save_corpus(sessions, corpus_dir)


if __name__ == "__main__":
    prepare()
```

**Step 2: Test corpus loading**

```bash
python -m learning.gyms.llm_task.corpus_prep --count 3 --no-reference
```

Expected: loads 3 sessions, saves to `learning/gyms/llm_task/corpus/corpus.jsonl`.

**Step 3: Commit**

```bash
git add learning/gyms/llm_task/corpus_prep.py
git commit -m "feat(gym): corpus preparation for llm_task gym"
```

---

### Task 3: Core gym — fan-out generation across model × prompt variants

**Files:**
- Create: `learning/gyms/llm_task/gym.py`

**Step 1: Write gym.py**

```python
#!/usr/bin/env python
"""LLM Task Gym — test prompt × model quality on real session data.

Usage:
    python -m learning.gyms.llm_task.gym prepare condensation --count 10
    python -m learning.gyms.llm_task.gym run condensation
    python -m learning.gyms.llm_task.gym judge condensation
    python -m learning.gyms.llm_task.gym report condensation
"""

import asyncio
import json
import time
from datetime import datetime
from pathlib import Path

import click
import yaml
from loguru import logger

from lib.gym.base import Candidate, CorpusStore, GymBase
from lib.llm import call_llm
from supervisor.sidekick.hooks.learning_worker import CONDENSE_SYSTEM


# Map task names to their default system prompts
DEFAULT_PROMPTS = {
    "condensation": CONDENSE_SYSTEM,
}

JUDGE_SYSTEM = """\
You are evaluating the quality of a condensed session transcript.

You will see:
1. The ORIGINAL raw transcript (or a snippet)
2. A REFERENCE condensation (produced by a strong model)
3. A CANDIDATE condensation (to evaluate)

Score the candidate on the given criteria, each 0-100.
Also provide an overall weighted score and brief reasoning.

Return JSON only:
{
  "overall": <0-100>,
  "reason": "brief explanation of strengths and weaknesses",
  "subscores": {<criterion_name>: <0-100>, ...}
}
"""


class LLMTaskGym(GymBase):
    """Gym for testing prompt × model quality on any LLM task."""

    def __init__(self, task_name: str):
        self.task_name = task_name
        self.task_config = self._load_task_config(task_name)
        super().__init__(gym_dir=Path(__file__).parent)
        self.results_dir = Path(__file__).parent / "results"
        self.results_dir.mkdir(exist_ok=True)

    def _load_task_config(self, task_name: str) -> dict:
        config_path = Path(__file__).parent / "tasks" / f"{task_name}.yaml"
        if not config_path.exists():
            raise FileNotFoundError(f"No task config: {config_path}")
        return yaml.safe_load(config_path.read_text())

    def load_corpus(self) -> list[dict]:
        """Load corpus from JSONL."""
        corpus_file = Path(__file__).parent / "corpus" / "corpus.jsonl"
        if not corpus_file.exists():
            return []
        entries = []
        for line in corpus_file.read_text().splitlines():
            if line.strip():
                entries.append(json.loads(line))
        return entries

    async def _run_single(self, raw_text: str, model: str, system_prompt: str, max_tokens: int = 3000) -> str:
        """Run a single model call."""
        return await call_llm(
            model, raw_text,
            system=system_prompt,
            max_tokens=max_tokens,
            stream=False,
            timeout=self.task_config.get("reference", {}).get("timeout", 120.0),
        )

    async def generate(self, input_data=None, feedback: str | None = None) -> list[Candidate]:
        """Fan out: each corpus item × each model × each prompt variant."""
        corpus = self.load_corpus()
        if not corpus:
            logger.error("No corpus loaded. Run 'prepare' first.")
            return []

        models = self.task_config.get("models", ["haiku"])
        prompts_config = self.task_config.get("prompts", {"default": None})
        default_prompt = DEFAULT_PROMPTS.get(self.task_name, "")

        tasks = []
        task_meta = []

        for entry in corpus:
            raw_text = entry["raw_text"]
            session_id = entry["session_id"]

            for model in models:
                for prompt_name, prompt_text in prompts_config.items():
                    system_prompt = prompt_text or default_prompt
                    tasks.append(self._run_single(raw_text, model, system_prompt))
                    task_meta.append({
                        "session_id": session_id,
                        "model": model,
                        "prompt": prompt_name,
                    })

        logger.info("Running {} tasks ({} corpus × {} models × {} prompts)",
                     len(tasks), len(corpus), len(models), len(prompts_config))

        # Fan out with concurrency limit
        semaphore = asyncio.Semaphore(6)

        async def limited(coro, idx):
            async with semaphore:
                start = time.time()
                try:
                    result = await coro
                    elapsed = time.time() - start
                    meta = task_meta[idx]
                    logger.info("Done {}/{}:{} ({:.1f}s, {} chars)",
                                meta["session_id"][:8], meta["model"], meta["prompt"],
                                elapsed, len(result))
                    return result
                except Exception as e:
                    logger.error("Failed {}: {}", task_meta[idx], e)
                    return f"ERROR: {e}"

        results = await asyncio.gather(*[limited(t, i) for i, t in enumerate(tasks)])

        candidates = []
        for i, result in enumerate(results):
            meta = task_meta[i]
            candidates.append(Candidate(
                id=f"{meta['model']}:{meta['prompt']}:{meta['session_id'][:8]}",
                content=result,
                variant=f"{meta['model']}:{meta['prompt']}",
                metadata=meta,
            ))

        # Save run
        run_ts = datetime.now().strftime("%Y%m%d_%H%M%S")
        run_dir = self.results_dir / f"{self.task_name}_{run_ts}"
        run_dir.mkdir(parents=True, exist_ok=True)
        with open(run_dir / "candidates.jsonl", "w") as f:
            for c in candidates:
                f.write(json.dumps(c.to_dict()) + "\n")
        logger.info("Saved {} candidates to {}", len(candidates), run_dir)

        return candidates

    async def evaluate(self, candidates: list[Candidate]) -> list[Candidate]:
        """Judge each candidate against reference using opus."""
        corpus = {e["session_id"]: e for e in self.load_corpus()}
        judge_config = self.task_config.get("judge", {})
        judge_model = judge_config.get("model", "opus")
        criteria = judge_config.get("criteria", [])

        criteria_text = "\n".join(
            f"- {c['name']} ({c['weight']}%): {c['description']}"
            for c in criteria
        )

        semaphore = asyncio.Semaphore(4)

        async def judge_one(candidate: Candidate) -> Candidate:
            async with semaphore:
                session_id = candidate.metadata["session_id"]
                entry = corpus.get(session_id, {})
                reference = entry.get("reference_output", "(no reference available)")
                # Use first 2000 chars of raw as context for judge
                raw_snippet = entry.get("raw_text", "")[:2000]

                judge_prompt = f"""## Original transcript (snippet)
{raw_snippet}

## Reference condensation
{reference}

## Candidate condensation ({candidate.metadata['model']}, prompt: {candidate.metadata['prompt']})
{candidate.content}

## Criteria
{criteria_text}

Score the candidate 0-100 on each criterion. Return JSON only."""

                try:
                    response = await call_llm(
                        judge_model, judge_prompt,
                        system=JUDGE_SYSTEM,
                        max_tokens=judge_config.get("max_tokens", 500),
                        temperature=0.1,
                        stream=False,
                        timeout=judge_config.get("timeout", 120.0),
                    )

                    import re
                    m = re.search(r'\{[\s\S]*\}', response)
                    if m:
                        data = json.loads(m.group())
                        candidate.score = data.get("overall", 0)
                        candidate.reason = data.get("reason", "")
                        candidate.subscores = data.get("subscores", {})
                    else:
                        candidate.score = 0
                        candidate.reason = "No JSON in judge response"
                except Exception as e:
                    logger.error("Judge error for {}: {}", candidate.id, e)
                    candidate.score = 0
                    candidate.reason = f"Error: {e}"

                return candidate

        judged = await asyncio.gather(*[judge_one(c) for c in candidates])
        return list(judged)

    def print_summary(self, candidates: list[Candidate]):
        """Print model × prompt score matrix to console."""
        # Group by variant (model:prompt)
        by_variant: dict[str, list[float]] = {}
        for c in candidates:
            by_variant.setdefault(c.variant, []).append(c.score)

        print(f"\n{'='*60}")
        print(f"  LLM Task Gym: {self.task_name}")
        print(f"  {len(candidates)} candidates evaluated")
        print(f"{'='*60}\n")

        # Model × Prompt matrix
        models = sorted(set(c.metadata["model"] for c in candidates))
        prompts = sorted(set(c.metadata["prompt"] for c in candidates))

        # Header
        header = f"{'model':<15}"
        for p in prompts:
            header += f" {p:<12}"
        print(header)
        print("-" * len(header))

        for model in models:
            row = f"{model:<15}"
            for prompt in prompts:
                key = f"{model}:{prompt}"
                scores = by_variant.get(key, [])
                avg = sum(scores) / len(scores) if scores else 0
                row += f" {avg:>5.1f}       "
            print(row)

        # Best overall
        candidates.sort(key=lambda c: c.score, reverse=True)
        best = candidates[0]
        print(f"\nBest: {best.variant} (avg {best.score:.1f})")
        print(f"  {best.reason}")
        if best.subscores:
            for k, v in best.subscores.items():
                print(f"  {k}: {v}")


@click.group()
def cli():
    """LLM Task Gym — optimize prompt × model quality."""
    pass


@cli.command()
@click.argument("task")
@click.option("--count", default=15, help="Number of sessions")
@click.option("--with-reference/--no-reference", default=True)
def prepare(task: str, count: int, with_reference: bool):
    """Prepare corpus for a task."""
    from learning.gyms.llm_task.corpus_prep import (
        generate_reference,
        load_candidate_sessions,
        save_corpus,
    )

    sessions = load_candidate_sessions(count=count)
    logger.info("Loaded {} sessions", len(sessions))

    if with_reference:
        for s in sessions:
            logger.info("Generating reference for {}...", s["session_id"][:8])
            start = time.time()
            s["reference_output"] = generate_reference(s["raw_text"])
            elapsed = time.time() - start
            logger.info("Reference: {:.1f}s, {} chars", elapsed, len(s["reference_output"]))

    corpus_dir = Path(__file__).parent / "corpus"
    save_corpus(sessions, corpus_dir)
    logger.info("Corpus ready: {} sessions", len(sessions))


@cli.command()
@click.argument("task")
def run(task: str):
    """Run generation for all model × prompt combinations."""
    gym = LLMTaskGym(task)
    candidates = asyncio.run(gym.generate())
    logger.info("Generated {} candidates", len(candidates))


@cli.command()
@click.argument("task")
@click.option("--run-dir", default=None, help="Specific run dir (default: latest)")
def judge(task: str, run_dir: str | None):
    """Judge candidates against reference."""
    gym = LLMTaskGym(task)

    # Load latest run
    results_dir = Path(__file__).parent / "results"
    if run_dir:
        rd = results_dir / run_dir
    else:
        runs = sorted(results_dir.glob(f"{task}_*"), reverse=True)
        if not runs:
            logger.error("No runs found for task {}. Run 'run' first.", task)
            return
        rd = runs[0]

    candidates_file = rd / "candidates.jsonl"
    candidates = []
    for line in candidates_file.read_text().splitlines():
        if line.strip():
            d = json.loads(line)
            candidates.append(Candidate(
                id=d["id"], content=d["content"], variant=d["variant"],
                metadata=d.get("metadata", {}),
            ))

    logger.info("Judging {} candidates from {}", len(candidates), rd.name)
    judged = asyncio.run(gym.evaluate(candidates))

    # Save judged results
    with open(rd / "judged.jsonl", "w") as f:
        for c in judged:
            f.write(json.dumps(c.to_dict()) + "\n")

    gym.print_summary(judged)

    # Save to corpus store
    gym.corpus.append_many(judged)


@cli.command()
@click.argument("task")
@click.option("--count", default=5, help="Number of corpus items")
@click.option("--report/--no-report", default=True, help="Generate HTML report")
def full(task: str, count: int, report: bool):
    """Full pipeline: prepare → run → judge → report."""

    async def _full():
        from learning.gyms.llm_task.corpus_prep import (
            generate_reference,
            load_candidate_sessions,
            save_corpus,
        )

        # Prepare
        sessions = load_candidate_sessions(count=count)
        logger.info("Loaded {} sessions, generating references...", len(sessions))
        for s in sessions:
            s["reference_output"] = generate_reference(s["raw_text"])
        save_corpus(sessions, Path(__file__).parent / "corpus")

        # Run + Judge
        gym = LLMTaskGym(task)
        candidates = await gym.generate()
        if not candidates:
            return
        judged = await gym.evaluate(candidates)
        gym.print_summary(judged)
        gym.corpus.append_many(judged)

        if report:
            from learning.gyms.llm_task.report import generate_html_report
            report_path = generate_html_report(judged, gym.task_config)
            logger.info("Report: {}", report_path)

    asyncio.run(_full())


if __name__ == "__main__":
    cli()
```

**Step 2: Verify import chain works**

```bash
python -c "from learning.gyms.llm_task.gym import LLMTaskGym; print('OK')"
```

**Step 3: Commit**

```bash
git add learning/gyms/llm_task/gym.py
git commit -m "feat(gym): core LLM task gym with model×prompt fan-out and judge"
```

---

### Task 4: HTML report generator

**Files:**
- Create: `learning/gyms/llm_task/report.py`

**Step 1: Write report.py**

Follow the badge gym report pattern — model × prompt score matrix as an HTML table, per-session detail cards showing reference vs candidate side-by-side, criteria breakdown.

```python
"""LLM Task Gym HTML report generator."""

import json
from datetime import datetime
from html import escape
from pathlib import Path

from lib.gym.base import Candidate


def generate_html_report(candidates: list[Candidate], task_config: dict) -> Path:
    """Generate HTML comparison report from judged candidates."""
    report_dir = Path(__file__).parent / "reports"
    report_dir.mkdir(exist_ok=True)

    task_name = task_config.get("task", "unknown")
    ts = datetime.now().strftime("%Y%m%d_%H%M%S")
    report_path = report_dir / f"{task_name}_{ts}.html"

    # Build model × prompt matrix
    models = sorted(set(c.metadata["model"] for c in candidates))
    prompts = sorted(set(c.metadata["prompt"] for c in candidates))

    # Aggregate scores
    scores: dict[str, list[float]] = {}
    for c in candidates:
        key = f"{c.metadata['model']}:{c.metadata['prompt']}"
        scores.setdefault(key, []).append(c.score)

    # Matrix rows
    matrix_rows = ""
    for model in models:
        cells = f"<td><strong>{escape(model)}</strong></td>"
        for prompt in prompts:
            key = f"{model}:{prompt}"
            vals = scores.get(key, [])
            avg = sum(vals) / len(vals) if vals else 0
            color = _score_color(avg)
            cells += f'<td style="background:{color};text-align:center">{avg:.1f}</td>'
        matrix_rows += f"<tr>{cells}</tr>\n"

    # Header row
    header_cells = "<th>Model</th>" + "".join(f"<th>{escape(p)}</th>" for p in prompts)

    # Detail cards (top 10 + bottom 5)
    candidates_sorted = sorted(candidates, key=lambda c: c.score, reverse=True)
    detail_cards = ""
    show = candidates_sorted[:10] + candidates_sorted[-5:]
    for c in show:
        subscores_html = ""
        if c.subscores:
            for k, v in c.subscores.items():
                bar_width = v if isinstance(v, (int, float)) else 0
                subscores_html += f"""
                <div class="criterion">
                    <span class="crit-name">{escape(str(k))}</span>
                    <div class="bar" style="width:{bar_width}%;background:{_score_color(bar_width)}"></div>
                    <span class="crit-score">{v}</span>
                </div>"""

        detail_cards += f"""
        <div class="card">
            <div class="card-header">
                <span class="model-tag">{escape(c.metadata.get('model', '?'))}</span>
                <span class="prompt-tag">{escape(c.metadata.get('prompt', '?'))}</span>
                <span class="session-id">{escape(c.metadata.get('session_id', '?')[:8])}</span>
                <span class="score" style="background:{_score_color(c.score)}">{c.score:.0f}</span>
            </div>
            <div class="reason">{escape(c.reason)}</div>
            {subscores_html}
            <details><summary>Output ({len(c.content)} chars)</summary>
                <pre>{escape(c.content[:2000])}</pre>
            </details>
        </div>"""

    html = f"""<!DOCTYPE html>
<html><head><meta charset="utf-8">
<title>LLM Task Gym: {escape(task_name)}</title>
<style>
body {{ font-family: -apple-system, sans-serif; max-width: 1200px; margin: 0 auto; padding: 20px; background: #f5f5f5; }}
h1 {{ color: #333; }}
table {{ border-collapse: collapse; width: 100%; margin: 20px 0; }}
th, td {{ border: 1px solid #ddd; padding: 10px; }}
th {{ background: #333; color: white; }}
.card {{ background: white; border-radius: 8px; padding: 16px; margin: 12px 0; box-shadow: 0 1px 3px rgba(0,0,0,0.1); }}
.card-header {{ display: flex; gap: 8px; align-items: center; margin-bottom: 8px; }}
.model-tag {{ background: #e3f2fd; padding: 2px 8px; border-radius: 4px; font-weight: bold; }}
.prompt-tag {{ background: #f3e5f5; padding: 2px 8px; border-radius: 4px; }}
.session-id {{ color: #666; font-family: monospace; }}
.score {{ padding: 2px 10px; border-radius: 4px; color: white; font-weight: bold; margin-left: auto; }}
.reason {{ color: #555; margin: 8px 0; }}
.criterion {{ display: flex; align-items: center; gap: 8px; margin: 4px 0; }}
.crit-name {{ width: 180px; font-size: 0.9em; }}
.bar {{ height: 16px; border-radius: 3px; min-width: 2px; }}
.crit-score {{ font-weight: bold; font-size: 0.9em; }}
details {{ margin-top: 8px; }}
pre {{ background: #f9f9f9; padding: 12px; border-radius: 4px; white-space: pre-wrap; font-size: 0.85em; max-height: 300px; overflow-y: auto; }}
</style></head><body>
<h1>LLM Task Gym: {escape(task_name)}</h1>
<p>Generated {ts} — {len(candidates)} candidates across {len(models)} models × {len(prompts)} prompts</p>

<h2>Score Matrix (avg per model × prompt)</h2>
<table>
<tr>{header_cells}</tr>
{matrix_rows}
</table>

<h2>Detail (top 10 + bottom 5)</h2>
{detail_cards}
</body></html>"""

    report_path.write_text(html)
    return report_path


def _score_color(score: float) -> str:
    """Green for high, red for low."""
    if score >= 80:
        return "#4caf50"
    if score >= 60:
        return "#ff9800"
    if score >= 40:
        return "#f44336"
    return "#b71c1c"
```

**Step 2: Commit**

```bash
git add learning/gyms/llm_task/report.py
git commit -m "feat(gym): HTML report generator for llm_task gym"
```

---

### Task 5: Smoke test — run condensation gym on 3 sessions

**Step 1: Prepare corpus (3 sessions, no reference yet to test loading)**

```bash
python -m learning.gyms.llm_task.gym prepare condensation --count 3 --no-reference
```

Expected: loads 3 sessions, saves corpus.

**Step 2: Prepare with reference (3 sessions)**

```bash
python -m learning.gyms.llm_task.gym prepare condensation --count 3 --with-reference
```

Expected: 3 sessions with opus reference outputs.

**Step 3: Run generation (2 models only for speed)**

Edit `condensation.yaml` temporarily to `models: [haiku, grok-fast]` and 1 prompt variant, then:

```bash
python -m learning.gyms.llm_task.gym run condensation
```

Expected: 6 candidates (3 sessions × 2 models × 1 prompt).

**Step 4: Judge**

```bash
python -m learning.gyms.llm_task.gym judge condensation
```

Expected: score matrix printed to console, judged.jsonl saved.

**Step 5: Full pipeline (if individual steps worked)**

Restore `condensation.yaml` to full config and:

```bash
python -m learning.gyms.llm_task.gym full condensation --count 3 --report
```

Expected: HTML report at `learning/gyms/llm_task/reports/condensation_*.html`.

**Step 6: Commit**

```bash
git add learning/gyms/llm_task/
git commit -m "feat(gym): verified llm_task gym on condensation task"
```

---

### Task 6: Add extraction task config (prep for next gym)

**Files:**
- Create: `learning/gyms/llm_task/tasks/extraction.yaml`

**Step 1: Write extraction config**

```yaml
task: extraction
description: Extract learnings from condensed session transcripts

corpus:
  source: condensed_transcripts
  count: 15

reference:
  model: opus
  max_tokens: 1000
  timeout: 120.0
  # prompt: null = uses EXTRACT_SYSTEM from learning_worker.py

models:
  - haiku
  - sonnet
  - grok-fast
  - flash

prompts:
  default: null  # uses EXTRACT_SYSTEM
  stricter: |
    Extract 0-3 learnings from this session. Be VERY selective.
    Only extract genuinely novel, non-obvious insights that would
    change how someone approaches similar work in the future.

    A learning must pass this test: "Would I tell a colleague about
    this unprompted?" If not, skip it.

    Return JSON: {"learnings": [...]} or {"learnings": []} if nothing qualifies.

judge:
  model: opus
  max_tokens: 500
  timeout: 120.0
  criteria:
    - name: relevance
      weight: 35
      description: Are the extracted learnings genuinely useful and non-obvious?
    - name: completeness
      weight: 30
      description: Did it catch the important learnings present in the session?
    - name: specificity
      weight: 20
      description: Concrete and actionable vs vague platitudes?
    - name: false_positives
      weight: 15
      description: Did it avoid extracting routine or obvious observations?
```

**Step 2: Register extraction prompt in gym.py**

Add to the `DEFAULT_PROMPTS` dict in `gym.py`:

```python
from supervisor.sidekick.hooks.learning_worker import CONDENSE_SYSTEM, EXTRACT_SYSTEM

DEFAULT_PROMPTS = {
    "condensation": CONDENSE_SYSTEM,
    "extraction": EXTRACT_SYSTEM,
}
```

**Step 3: Commit**

```bash
git add learning/gyms/llm_task/tasks/extraction.yaml learning/gyms/llm_task/gym.py
git commit -m "feat(gym): add extraction task config for next gym iteration"
```
