# PLTR Content Discovery — Expansion Design Task

## Context

Current discovery is narrow: YouTube search for `{leader} × {month}` with 2 leaders.
This misses most valuable content: employee interviews, bootcamp footage, customer
case studies, podcast appearances, conference panels, written interviews.

## Goal

Build a **broad discovery funnel** that finds all substantive Palantir content,
scores it for insightfulness, and prioritizes the most valuable pieces for
transcription and analysis.

## Discovery Sources (beyond YouTube search)

| Source                | Content Type                          | Method                    |
|-----------------------|---------------------------------------|---------------------------|
| YouTube channels      | Official Palantir channel, AIPCon     | Channel scan (yt-dlp)     |
| YouTube search        | Interviews, panels, keynotes          | Query expansion           |
| Podcasts              | Long-form interviews (Lex, All-In)    | RSS feeds, podcast APIs   |
| Conference sites      | Davos, Web Summit, defense conferences| Web scraping (brain/)     |
| News sites            | Bloomberg, CNBC, Reuters interviews   | Web search + extraction   |
| Substack/blogs        | Written long-form interviews          | RSS, web search           |
| X (Twitter)           | Interview links, customer mentions    | Search API, Bright Data   |
| Text publications     | In-depth culture/strategy interviews  | Web search + brain/       |
| Reddit/HN             | AMA threads, employee posts           | API search                |
| LinkedIn              | Employee thought pieces               | Hard to scrape            |

## Query Expansion

Beyond named leaders, discover content about:
- **Employees by role**: "Palantir engineer interview", "Palantir deployment strategist"
- **Products**: "Palantir AIP demo", "Palantir Foundry walkthrough"
- **Customers**: "Palantir customer case study", "[company] using Palantir"
- **Events**: "AIPCon", "Palantir bootcamp", "Palantir FedStart"
- **Culture**: "working at Palantir", "Palantir culture", "Palantir hiring"
- **Founders & origins**: "Peter Thiel Palantir", "Joe Lonsdale Palantir", "Palantir founding story", "Palantir PayPal mafia"
- **Thiel on Palantir**: "Thiel Palantir vision", "Thiel intelligence software", "Thiel Palantir interview" (spans 2004–present)
- **Customer perspectives**: "Palantir customer interview", "[customer] Palantir experience", "Palantir deployment results"
- **In-depth text interviews**: "Palantir interview site:wired.com OR site:forbes.com OR site:ft.com", "Karp interview long-form"
- **X/Twitter discovery**: "Palantir interview link", "Palantir customer from:@handle", "[leader] interview thread"

## Scoring & Prioritization

Two-phase scoring:

### Phase 1: Cheap metadata scoring (before download)
- Duration (longer = likely more substantive)
- Title/description keyword analysis
- Source quality (known good channels/podcasts)
- Recency
- Dedup against already-processed content

### Phase 2: LLM scoring (after transcript exists)
- Candor vs PR talking points
- Novel information density
- Strategic insight level
- Specificity (concrete examples vs generalities)

Scoring function is **subject to learning** — human ratings refine weights over time.

## Architecture Decision

Two separate jobs (see Two-Job Architecture below). Discovery is its own job with
multi-source strategies; processing is a second job that reads discovery's completed items.
The jobs framework handles both — no new infrastructure needed.

## Implementation Phases

1. **YouTube expansion** — more queries, official channel scan, broader terms
2. **Podcast integration** — RSS feed parsing for known podcasts
3. **Web search** — brain/ integration for news sites and conferences
4. **Scoring pipeline** — metadata scoring → priority → LLM re-scoring post-transcript
5. **Learning loop** — feedback UI, weight adjustment

## Dependencies

- `brain/` for web content extraction (paywalls, JS rendering)
- LLM API for scoring (haiku/flash-lite, cheap)
- Podcast API or RSS parsing
- Possibly Bright Data for geo-restricted content

## Content Types Beyond Video

### Text Interviews & Articles
- Long-form written interviews (Wired, Forbes, FT, Bloomberg Businessweek, NYT)
- Culture deep-dives, employee profiles, "what it's like to work at Palantir"
- Customer testimonials and case studies (often on vendor/consulting sites)
- Stages: `discover → fetch → extract → score` (brain/ handles paywall/JS)

### Customer Perspectives
- Add to both YT and X discovery: customer names, deployment stories, before/after
- Known customer verticals: defense, healthcare (NHS, HHS), energy, finance, manufacturing
- Search for "[customer org] Palantir" across all sources
- Customer conference talks (e.g., NHS staff presenting Palantir Foundry results)

### Founding Story & Architectural Vision
- **Thiel as architect**: His original thesis (PayPal fraud detection → intelligence community), how his thinking evolved over 20 years. Interviews, Stanford lectures, book excerpts (Zero to One chapter on secrets/Palantir).
- **Lonsdale**: Co-founder perspective, early product decisions, why he left, how he talks about it now (8VC context). Often more candid than Karp on early days.
- **Origin arc**: PayPal mafia → In-Q-Tel funding → CIA/intelligence roots → commercial pivot → AIP/LLM era. Track how the narrative shifts across eras.
- **Key search terms**: "Palantir founding", "Palantir origin story", "Thiel Palantir Stanford", "Lonsdale Palantir early days", "In-Q-Tel Palantir"
- **Historical interviews are high value** — a 2010 Thiel interview about Palantir's purpose is more revealing than a 2025 earnings call

### X (Twitter) as Link Discovery
- X posts often link to in-depth interviews elsewhere — treat X as a **link source**, not just content
- Search patterns: `palantir interview url:`, `karp long-form`, `palantir customer thread`
- Extract linked URLs → feed into brain/ for full extraction
- Also capture substantive X threads themselves (multi-tweet analysis threads)

## Two-Job Architecture

Split into two jobs rather than one job with a viewer bolted on:

| Job                        | Responsibility                        | Output                              |
|----------------------------|---------------------------------------|-------------------------------------|
| `pltr_discovery`           | Find content, metadata-score, dedup   | Items with URL, title, score, source|
| `pltr_content_processing`  | Fetch, transcribe/extract, LLM-score  | Transcripts, analysis, knowledge    |

Discovery job's "done" items become processing job's discovery source (query tracker for items above score threshold).

### Handoff: Discovery → Processing
Processing job's discovery strategy: query the tracker for `pltr_discovery` items with status `done` and metadata score above threshold. Each discovered item's metadata (URL, source type, title) becomes the processing item's input. This is just a new discovery strategy (`tracker_query`) — no new plumbing.

### Dashboard: Custom Item Renderers

The existing jobs dashboard already shows items per job — but each job needs a **custom grid renderer** for its items rather than a generic row:

**Discovery job grid columns**:
- Title (clickable link to source)
- Source type (YT / X / Web / Podcast)
- Query that found it
- Metadata score
- Duration / word count
- Date discovered

**Processing job grid columns**:
- Title (linked)
- Stage progress (fetch ✓ → extract ✓ → score ⏳)
- Content preview (first 100 chars of transcript)
- LLM score (post-processing)

**Implementation**: Each handler defines a `dashboard_columns()` → list of column defs (name, accessor, format) and the dashboard renders items as table rows with those custom columns. Handlers that don't define it get the generic key/status columns. Just table rows with custom cols and formatting — no special grid or card layouts.

## Open Questions

- How aggressively to expand? 100s of items? 1000s? (Score threshold controls this)
- What's the end use — knowledge base? Investment research? Both?
- Should processing job auto-ingest above threshold, or require human approval in dashboard?