# Session Extraction — Design

**Date**: 2026-02-23
**Goal**: Extract learnings directly from raw session transcripts, replacing the condense-then-extract pipeline with a chunk-and-extract approach that preserves full signal fidelity.

## Problem

The current learning_worker pipeline condenses entire session transcripts before extracting learnings. This is an anti-pattern: condensation removes signal before we know which signal matters. Raw transcript chunks contain the actual errors, user corrections, and tool output that make learnings specific and actionable.

**Principle**: Defer Compression Until Post-Extraction (data-quality/defer-compression-until-post-extraction).

## Architecture

```
Session JSONL files (raw)
    ↓
SessionAdapter.load_document() — parse JSONL → structured turns
    ↓
SessionAdapter.preprocess() — strip noise (low-info prompts, routine tool output)
    ↓
Chunk by episode (topic shifts between conversation turns)
    ↓
Triage: classify episodes (routine / error_recovery / pivot / user_correction / design_decision)
    ↓
Extract per non-routine chunk: episodes → learning instances (parallel, any model)
    ↓
Merge + dedup across chunks
    ↓
Store to learning.db via `learn add` (instances, auto-linked to principles)
```

### Location

`learning/session_extract/` — separate from gyms (this is a pipeline, not a gym) and session_review (which analyzes efficiency patterns, not learnings).

### Reuse from SemanticNet

- `DomainAdapter` base class → `SessionAdapter` implementation
- `extract_batch()` for concurrent extraction across sessions
- Topic-change chunking pattern (adapted from time-based to turn-based)

No new data models — extraction outputs go directly into `learning.db` as instances via the existing `learn add` path.

## Chunking Strategy

Sessions are turn-based, not time-based. An episode is a stretch of conversation about one topic/task.

**Two modes:**
1. **Fixed window** — every N turns (fallback, fast, no LLM)
2. **LLM topic detection** — feed turn summaries to flash, get episode boundaries

### Turn parsing

Each turn includes:
- User messages (full text)
- Assistant decisions and reasoning (summarized)
- Errors encountered (full error text)
- Tool calls (name + target, not full output)

Dropped:
- Routine tool output (file contents, grep results)
- Low-info user messages ("yes", "ok", "cnp")
- System reminders, hook output

### Episode triage

Before extraction, a cheap classifier (haiku/flash) tags each episode:

| Tag | Meaning | Action |
|-----|---------|--------|
| `routine` | Standard read/edit/commit | Skip extraction |
| `error_recovery` | Something broke, got fixed | Extract |
| `pivot` | Approach abandoned, new direction | Extract |
| `user_correction` | AI was wrong, user redirected | Extract |
| `design_decision` | Architectural choice made | Extract |

For a 20-episode session, ~5-8 episodes get extracted from.

## Output

No new data model. Extraction outputs the same JSON shape that `learn add` already accepts:

```json
{
  "observation": "what specifically happened",
  "generalization": "broader pattern",
  "project": "rivus",
  "type": "pattern",
  "tags": ["episode:auth-bug-fix", "session:abc123"]
}
```

Episode topic and session ID go in tags. Evidence turn indices can go in tags too (`turn:5`, `turn:6`). This feeds directly into `learning.db` instances and gets auto-linked to principles — no translation layer.

## Extraction

Per-episode LLM call with the raw episode text (full fidelity). Extracts 0-2 insights per episode.

Prompt receives:
- Full episode text (actual errors, actual user corrections, actual tool calls)
- Taxonomy guidance: error patterns, efficiency losses, conventions discovered, approaches that worked
- Context: what project, session duration, adjacent episode topics

## Eval (not gym)

Model × prompt sweep on extraction quality. Test whether flash, grok-fast, or haiku produces comparable insights to opus. Reuse the eval infrastructure from `learning/gyms/llm_task/` (judge scoring, HTML reporting).

A real gym (active learning, curriculum) can come later — for now, eval sweeps answer the practical question of model selection.

## Deployment

1. Build as offline batch tool (`inv learning.extract` or `python -m learning.session_extract`)
2. Validate quality against current learning_worker output
3. Migrate into SessionEnd hook later if quality is better

## Watch for Reuse

SemanticNet does similar chunking + extraction for other domains (VIC writeups, YouTube transcripts). As the session extraction matures, factor out shared patterns:
- Turn-based chunking (vs time-based) could generalize to any conversational transcript
- Episode triage pattern could apply to any long document with variable signal density
- The "extract from chunks, not from summary" pattern is domain-agnostic
