# SemanticNet — Logbook

Generic YouTube channel processing: pool → pipeline → topic labeling → frame grabs → portal.

## 2026-02-24: Initial Build + First Iteration on Topic Granularity

### What Was Built
- **Pool resolver** (`pool/resolve.py`) — ranked video list from jobs DB, sorted by recency, age cutoffs
- **Generic pipeline** (`pipeline.py`) — chunk → embed → store, any channel via `pool/config.yaml`
- **Search API** (`api.py`) — FastAPI: text search, semantic search, autocomplete, videos, concepts, themes endpoints
- **Topic labeling** (`concepts.py`) — chronological topic walkthrough per video via LLM
- **Frame extraction** (`frames.py`) — yt-dlp URL resolve + ffmpeg frame grab at topic midpoints (~0.5s/frame)
- **Portal generator** (`portal/generate.py`) — static HTML with thumbnails, external links, topic badges, topic transcript view with frame grabs
- **Theme clustering** (`concepts.py cluster_themes`) — HDBSCAN on concept embeddings, LLM-named themes

### First Run: a16z Channel
- 5 videos processed (of 959 available), 127 chunks, 115 topic labels, 115 frames extracted
- DB: `lib/semnet/data/content.db` (unified, all channels share one DB filtered by `channel` column)
- Portal at `projects/a16z/portal/` — index, 5 video pages, 5 topic pages, 115 frame JPEGs

### Problem: Topic Granularity Too Fine
- **115 concepts from 5 videos** = ~23 per video. Way too many.
- LLM creates a new topic label for every minor conversational shift
- Examples of over-splitting from one 15-min video:
  - "cosmic microwave background discovery" vs "cosmic microwave background research" vs "significance of cmb map" — really one topic
  - "john mather's career path" vs "john mather's early life" — same topic
- Result: transcript view is a stack of small folded cards, each with 1-2 paragraphs. Fragmented, not browsable.

### Problem: Topics Not Navigable
- Topic labels are just text annotations — not clickable, no per-topic pages
- No way to see "everything about X across all videos"
- No related topics or cross-video connections visible

### Design Direction (Agreed)

**Core model:** Topics are the primary browsable unit, not chunks or videos.

**Hierarchy:**
```
Index → Concept page → Video at timestamp
```

**Per-concept pages** (`concepts/{slug}.html`):
- All occurrences across videos — frame grab + excerpt + "watch at 5:15" link
- Stats: "discussed in 3 videos, 7 minutes total"
- Related topics: co-occurring or adjacent topics
- Source links back to video at exact timestamp

**Topics everywhere are clickable** — transcript view, index badges, search results all link to concept page.

**Fix granularity at the source:**
- Retune labeling prompt: 5-8 broad topics per video, not 20+
- Chapter-level, not paragraph-level: "John Mather's background", "cosmic microwave background", "James Webb telescope" — done for a 15-min video
- Theme clustering exists as backup for cross-channel grouping

**Transcript view: continuous flow**
- Not separate cards per topic
- Continuous text with topic labels as colored sidebar markers or inline headers
- Small topics don't get same visual weight as 3-minute topics

### Iteration 2: Coarser Topic Labels
Retuned prompt: "4-8 big subjects, chapter-level not paragraph-level, merging is better than splitting."

**Before:** 115 concepts from 5 videos (~23/video)
**After:** 29 concepts from 5 videos (~6/video)

Example — "Can You Prove The Big Bang Theory?" (15 min):
- Before: 9 topics (cosmic microwave background discovery, john mather's career path, john mather's early life, cosmic microwave background research, origins of cosmic structure, quantum mechanics and atomic structure, gravity's role in cosmic evolution, significance of cmb map, james webb space telescope mission)
- After: 5 topics (innovations through space exploration, John Mather's early life, proving the big bang theory, cosmic microwave background discovery, James Webb Space Telescope)

The 5-topic version reads like a natural chapter outline. The main content "proving the big bang theory" correctly spans 7 chunks (3:45-10:15), while before it was split into 4 separate micro-topics.

### Iteration 3: Per-Concept Pages + Clickable Topics

Built `concepts/{id}-{slug}.html` pages with:
- Stats bar: video count, segment count, total duration
- Occurrence cards: video title (→ video page), timestamp range (→ YouTube), frame thumbnail, text excerpt
- Related Topics section: other concepts that co-occur in the same videos, all clickable

Made all topic names clickable throughout the portal:
- **Index**: 29 topic badges link to concept pages
- **Topic transcript view**: topic headers in colored sections link to concept pages
- **Video chunk view**: topic labels on each chunk link to concept pages
- **Concept pages**: related topics link to sibling concept pages

Navigation hierarchy now complete:
```
Index (29 clickable topics) → Concept page (occurrences + related) → Video at timestamp (YouTube)
                            ↘ Video page (chunk view)
                            ↘ Topic transcript view (continuous colored sections)
```

Cleared stale theme data (themes were built from old 115-concept run; index falls back to flat topic list until themes re-clustered).

### Next Steps
1. Redesign transcript view as continuous flow (sidebar markers instead of cards)
2. Re-run theme clustering with new 29 concepts
3. Process more a16z videos (only 5 of 959 done)
4. Add concept nav (prev/next) to concept pages