Idea Evaluation System

Computationally assess the originality, quality, and fertility of ideas in writing — with a live demo on a16z's "Big Ideas 2026"

The Problem

You read a piece of writing—an essay, a strategy doc, a research paper—and you want to answer: Are these ideas any good? Not "is the prose nice" (that's a style question), but: Are the claims specific enough to test? Are they actually novel, or just well-packaged conventional wisdom? Do they open productive lines of thinking, or are they dead ends?

This system decomposes that vague judgment into 9 measurable dimensions, scores each one using calibrated LLM evaluation, and reports a profile that separates substance from style from originality.

How It Works

Stage 0

Extract

Pull claims & theses from your text

Stage 1

Landscape

Search web, curate reference corpus

Stage 2

Corpus Extract

Extract ideas from confirmed sources

Stage 3

Score

Evaluate against 9 dimensions

Stage 0 uses an LLM to extract two types of ideas: claims (atomic, falsifiable assertions like "X causes Y because Z") and theses (higher-level framings, mental models, surprising connections). Each gets a confidence score.

Stage 1 is interactive: the system auto-generates search terms from your ideas, searches the web via Serper APIA Google Search API that returns structured results (organic, news, scholar) for programmatic querying., and presents candidate sources. You accept or reject them before any analysis happens. This gives you control over what "the existing landscape" means.

Stage 2 fetches your approved sources and extracts ideas from them too, building the reference corpus that novelty is measured against.

Stage 3 scores each of your ideas on 9 dimensions using batched LLM calls (2-3 dimensions per call, research-backed sweet spot from CheckEvalResearch showing that evaluating 1-3 criteria per LLM call produces more accurate scores than dumping all criteria at once.). Hard dimensions (novelty, fertility) get multi-model consensus (3 models, take median).

The Evaluation Taxonomy

9 dimensions in 4 groups. Each scored 1–10. Per-group aggregation uses floor scores (minimum, not average)—because a highly novel idea with zero coherence is worthless, and averaging hides that.

GroupDimensionWhat It Measures
Substance
Does it say something real?
Claim PrecisionHow specific and falsifiable is the core claim?
Internal CoherenceDoes the argument's structure hold together?
Evidential GroundingIs it anchored in evidence, not just assertion?
Novelty
Does it add something new?
Semantic NoveltyHow distant from existing landscape corpus?
Framing NoveltyDoes it reframe the problem to open new avenues?
Expression
How well communicated?
ClarityCan a reader extract the core claim unambiguously?
Rhetorical ForceDoes the expression serve the idea effectively?
Fertility
Does it open further thinking?
GenerativityDoes it suggest new questions or research directions?
ComposabilityCan it be productively combined with other ideas?
Why floor scores instead of averages? A brilliant thesis (novelty=9) that makes no falsifiable claim (substance=2) is intellectually exciting but practically useless. The floor score exposes the weakest link. Averages would give it a comfortable 5.5 and hide the fatal flaw.

Live Demo: a16z "Big Ideas 2026"

We ran the system on a16z's Big Ideas 2026: Part 1—a dense piece where a16z's investment teams predict what's coming in AI, enterprise, and consumer tech.

38
ideas extracted
23
claims
15
theses
5
evaluated in depth

Extraction took ~50 seconds (Sonnet, single pass over 21K chars of article text). The 5 highest-confidence theses were then scored on all 9 dimensions. Here are the results, ranked by overall quality:

THESIS · confidence 0.80 · #1 overall
"As AI applications deliver value with minimal screen time, screen time will cease to be a meaningful KPI, requiring more complex methods of measuring ROI."
Substance
4
precision 5 · coherence 7 · evidence 4
Novelty
5
semantic 5 · framing 6
Expression
5
clarity 8 · rhetoric 5
Fertility
7
generativity 7 · composability 8
claim precision
5
internal coherence
7
evidential grounding
4
semantic novelty
5
framing novelty
6
clarity
8
rhetorical force
5
generativity
7
composability
8
THESIS · confidence 0.90 · #2 overall
"Cybersecurity teams create labor scarcity by buying products that detect everything, forcing teams to review everything, creating a vicious cycle of false scarcity."
Substance
4
precision 4 · coherence 6 · evidence 4
Novelty
4
semantic 4 · framing 6
Expression
6
clarity 7 · rhetoric 6
Fertility
7
generativity 7 · composability 8
THESIS · confidence 0.80 · #3 overall
"Vertical AI is evolving from information retrieval (2024) to reasoning (2025) to multiplayer multi-party collaboration (2026)."
Substance
3
precision 4 · coherence 6 · evidence 3
Novelty
4
semantic 4 · framing 5
Expression
6
clarity 7 · rhetoric 6
Fertility
6
generativity 6 · composability 7
THESIS · confidence 0.85 · #4 overall
"Startups that build platforms to extract structure from documents, images, and videos will hold the key to enterprise knowledge and process."
Substance
3
precision 3 · coherence 6 · evidence 4
Novelty
3
semantic 3 · framing 4
Expression
6
clarity 7 · rhetoric 6
Fertility
6
generativity 6 · composability 7
THESIS · confidence 0.80 · #5 overall
"Data and AI infrastructure are becoming inextricably linked as we move toward a truly AI-native data architecture."
Substance
2
precision 3 · coherence 6 · evidence 2
Novelty
2
semantic 2 · framing 3
Expression
4
clarity 7 · rhetoric 4
Fertility
5
generativity 5 · composability 6

What the Demo Reveals

Finding 1: Expression consistently strong, substance consistently weak. Every idea scored 6–8 on clarity but 2–5 on evidential grounding. This is exactly what you'd expect from a VC thought-leadership piece: polished prose wrapping under-evidenced claims. The system correctly separates "sounds good" from "is good."
Finding 2: Fertility outperforms novelty across the board. All 5 ideas scored higher on generativity/composability (5–8) than semantic novelty (2–5). These aren't new ideas—they're repackaged trends. But they're productive repackagings that open follow-up questions. The system distinguishes "new" from "useful."
Finding 3: High author confidence ≠ high quality. The cybersecurity thesis had the highest extraction confidence (0.90—asserted boldly in the article) but only middling substance. The "screen time" thesis had lower confidence (0.80) but the best overall profile. The system correctly separates assertion strength from argument quality.
Finding 4: The platitude detector works. "Data and AI infrastructure are becoming inextricably linked" scored substance=2, novelty=2. This is a platitude dressed as insight, and the system caught it. Compare with the cybersecurity "false scarcity" thesis (substance=4, novelty=4)—which at least proposes a mechanism.

Architecture

ideas/
+-- models.py          # Idea, IdeaScore, EvaluationProfile, 9 DIMENSIONS, 4 GROUPS
+-- extract.py         # Stage 0 + 2: LLM-based claim/thesis extraction
+-- landscape.py       # Stage 1: search term gen + Serper search + source fetching
+-- eval.py            # Stage 3: batched 9-dim scoring + multi-model consensus
+-- store.py           # SQLite (structured) + sqlite-vec (embeddings) dual store
+-- ui/
|   +-- app.py         # UI at kb.localhost/ideas (consolidated into kb hub)
+-- demo.py            # CLI demo script
+-- data/              # SQLite DB + vector store + demo results
+-- tests/             # 27 tests, all passing

Key Design Decisions

DecisionWhy
Batched eval (2–3 dims per LLM call)CheckEval research: small batches outperform monolithic "score everything at once" prompts
Multi-model consensus for novelty + fertilityThese are the hardest dimensions to score—3 models (Opus, Sonnet, Gemini), take median
Floor scores per group, not averagesAverages hide fatal flaws. A novel but incoherent idea should score low, not medium
Interactive landscape (Stage 1)Novelty is relative to a corpus. You should validate what "existing work" means before scoring against it
Claims vs theses distinctionDifferent idea types need different evaluation emphasis—claims need precision, theses need framing novelty

Reused Infrastructure

ComponentSourceWhat We Reuse
LLM callslib/llmcall_llm with model aliases, prompt caching, cost tracking
Web searchlib/discovery_opsserper_search for landscape building
Content fetchinglib/ingestfetch with caching, PDF/HTML/YouTube support
Vector searchlib/vectorssqlite-vec for embedding similarity (novelty computation)
Batched eval patterndraft/style/evaluate.py2–3 criteria per call with prompt cache optimization

What's Next


Generated 2026-03-03 · Source: a16z Big Ideas 2026: Part 1 · Evaluation model: Claude Sonnet 4.6 · 38 ideas extracted, 5 evaluated (20 LLM calls) · kb.localhost/ideas