Computationally assess the originality, quality, and fertility of ideas in writing — with a live demo on a16z's "Big Ideas 2026"
You read a piece of writing—an essay, a strategy doc, a research paper—and you want to answer: Are these ideas any good? Not "is the prose nice" (that's a style question), but: Are the claims specific enough to test? Are they actually novel, or just well-packaged conventional wisdom? Do they open productive lines of thinking, or are they dead ends?
This system decomposes that vague judgment into 9 measurable dimensions, scores each one using calibrated LLM evaluation, and reports a profile that separates substance from style from originality.
Extract
Pull claims & theses from your text
Landscape
Search web, curate reference corpus
Corpus Extract
Extract ideas from confirmed sources
Score
Evaluate against 9 dimensions
Stage 0 uses an LLM to extract two types of ideas: claims (atomic, falsifiable assertions like "X causes Y because Z") and theses (higher-level framings, mental models, surprising connections). Each gets a confidence score.
Stage 1 is interactive: the system auto-generates search terms from your ideas, searches the web via Serper APIA Google Search API that returns structured results (organic, news, scholar) for programmatic querying., and presents candidate sources. You accept or reject them before any analysis happens. This gives you control over what "the existing landscape" means.
Stage 2 fetches your approved sources and extracts ideas from them too, building the reference corpus that novelty is measured against.
Stage 3 scores each of your ideas on 9 dimensions using batched LLM calls (2-3 dimensions per call, research-backed sweet spot from CheckEvalResearch showing that evaluating 1-3 criteria per LLM call produces more accurate scores than dumping all criteria at once.). Hard dimensions (novelty, fertility) get multi-model consensus (3 models, take median).
9 dimensions in 4 groups. Each scored 1–10. Per-group aggregation uses floor scores (minimum, not average)—because a highly novel idea with zero coherence is worthless, and averaging hides that.
| Group | Dimension | What It Measures |
|---|---|---|
| Substance Does it say something real? |
Claim Precision | How specific and falsifiable is the core claim? |
| Internal Coherence | Does the argument's structure hold together? | |
| Evidential Grounding | Is it anchored in evidence, not just assertion? | |
| Novelty Does it add something new? |
Semantic Novelty | How distant from existing landscape corpus? |
| Framing Novelty | Does it reframe the problem to open new avenues? | |
| Expression How well communicated? |
Clarity | Can a reader extract the core claim unambiguously? |
| Rhetorical Force | Does the expression serve the idea effectively? | |
| Fertility Does it open further thinking? |
Generativity | Does it suggest new questions or research directions? |
| Composability | Can it be productively combined with other ideas? |
We ran the system on a16z's Big Ideas 2026: Part 1—a dense piece where a16z's investment teams predict what's coming in AI, enterprise, and consumer tech.
Extraction took ~50 seconds (Sonnet, single pass over 21K chars of article text). The 5 highest-confidence theses were then scored on all 9 dimensions. Here are the results, ranked by overall quality:
ideas/ +-- models.py # Idea, IdeaScore, EvaluationProfile, 9 DIMENSIONS, 4 GROUPS +-- extract.py # Stage 0 + 2: LLM-based claim/thesis extraction +-- landscape.py # Stage 1: search term gen + Serper search + source fetching +-- eval.py # Stage 3: batched 9-dim scoring + multi-model consensus +-- store.py # SQLite (structured) + sqlite-vec (embeddings) dual store +-- ui/ | +-- app.py # UI at kb.localhost/ideas (consolidated into kb hub) +-- demo.py # CLI demo script +-- data/ # SQLite DB + vector store + demo results +-- tests/ # 27 tests, all passing
| Decision | Why |
|---|---|
| Batched eval (2–3 dims per LLM call) | CheckEval research: small batches outperform monolithic "score everything at once" prompts |
| Multi-model consensus for novelty + fertility | These are the hardest dimensions to score—3 models (Opus, Sonnet, Gemini), take median |
| Floor scores per group, not averages | Averages hide fatal flaws. A novel but incoherent idea should score low, not medium |
| Interactive landscape (Stage 1) | Novelty is relative to a corpus. You should validate what "existing work" means before scoring against it |
| Claims vs theses distinction | Different idea types need different evaluation emphasis—claims need precision, theses need framing novelty |
| Component | Source | What We Reuse |
|---|---|---|
| LLM calls | lib/llm | call_llm with model aliases, prompt caching, cost tracking |
| Web search | lib/discovery_ops | serper_search for landscape building |
| Content fetching | lib/ingest | fetch with caching, PDF/HTML/YouTube support |
| Vector search | lib/vectors | sqlite-vec for embedding similarity (novelty computation) |
| Batched eval pattern | draft/style/evaluate.py | 2–3 criteria per call with prompt cache optimization |
Generated 2026-03-03 · Source: a16z Big Ideas 2026: Part 1 · Evaluation model: Claude Sonnet 4.6 · 38 ideas extracted, 5 evaluated (20 LLM calls) · kb.localhost/ideas