Idea Evaluation System

Computationally assess the originality, quality, and fertility of ideas in writing — with a live demo on a16z's "Big Ideas 2026"

The Problem

You read a piece of writing—an essay, a strategy doc, a research paper—and you want to answer: Are these ideas any good? Not "is the prose nice" (that's a style question), but: Are the claims specific enough to test? Are they actually novel, or just well-packaged conventional wisdom? Do they open productive lines of thinking, or are they dead ends?

This system decomposes that vague judgment into 9 measurable dimensions, scores each one using calibrated LLM evaluation, and reports a profile that separates substance from style from originality.

How It Works

Stage 0

Extract

Pull claims & theses from your text

Stage 1

Landscape

Search web, curate reference corpus

Stage 2

Corpus Extract

Extract ideas from confirmed sources

Stage 3

Score

Evaluate against 9 dimensions

Stage 0 uses an LLM to extract two types of ideas: claims (atomic, falsifiable assertions like "X causes Y because Z") and theses (higher-level framings, mental models, surprising connections). Each gets a confidence score.

Stage 1 is interactive: the system auto-generates search terms from your ideas, searches the web via Serper APIA Google Search API that returns structured results (organic, news, scholar) for programmatic querying., and presents candidate sources. You accept or reject them before any analysis happens. This gives you control over what "the existing landscape" means.

Stage 2 fetches your approved sources and extracts ideas from them too, building the reference corpus that novelty is measured against.

Stage 3 scores each of your ideas on 9 dimensions using batched LLM calls (2-3 dimensions per call, research-backed sweet spot from CheckEvalResearch showing that evaluating 1-3 criteria per LLM call produces more accurate scores than dumping all criteria at once.). Hard dimensions (novelty, fertility) get multi-model consensus (3 models, take median).

The Evaluation Taxonomy

9 dimensions in 4 groups. Each scored 1–10. Per-group aggregation uses floor scores (minimum, not average)—because a highly novel idea with zero coherence is worthless, and averaging hides that.

Group	Dimension	What It Measures
Substance Does it say something real?	Claim Precision	How specific and falsifiable is the core claim?
	Internal Coherence	Does the argument's structure hold together?
	Evidential Grounding	Is it anchored in evidence, not just assertion?
Novelty Does it add something new?	Semantic Novelty	How distant from existing landscape corpus?
Novelty Does it add something new?	Framing Novelty	Does it reframe the problem to open new avenues?
Expression How well communicated?	Clarity	Can a reader extract the core claim unambiguously?
Expression How well communicated?	Rhetorical Force	Does the expression serve the idea effectively?
Fertility Does it open further thinking?	Generativity	Does it suggest new questions or research directions?
Fertility Does it open further thinking?	Composability	Can it be productively combined with other ideas?

Why floor scores instead of averages? A brilliant thesis (novelty=9) that makes no falsifiable claim (substance=2) is intellectually exciting but practically useless. The floor score exposes the weakest link. Averages would give it a comfortable 5.5 and hide the fatal flaw.

Live Demo: a16z "Big Ideas 2026"

We ran the system on a16z's Big Ideas 2026: Part 1—a dense piece where a16z's investment teams predict what's coming in AI, enterprise, and consumer tech.

ideas extracted

claims

theses

evaluated in depth

Extraction took ~50 seconds (Sonnet, single pass over 21K chars of article text). The 5 highest-confidence theses were then scored on all 9 dimensions. Here are the results, ranked by overall quality:

THESIS · confidence 0.80 · #1 overall

"As AI applications deliver value with minimal screen time, screen time will cease to be a meaningful KPI, requiring more complex methods of measuring ROI."

Substance

precision 5 · coherence 7 · evidence 4

Novelty

semantic 5 · framing 6

Expression

clarity 8 · rhetoric 5

Fertility

generativity 7 · composability 8

claim precision

internal coherence

evidential grounding

semantic novelty

framing novelty

clarity

rhetorical force

generativity

composability

THESIS · confidence 0.90 · #2 overall

"Cybersecurity teams create labor scarcity by buying products that detect everything, forcing teams to review everything, creating a vicious cycle of false scarcity."

Substance

precision 4 · coherence 6 · evidence 4

Novelty

semantic 4 · framing 6

Expression

clarity 7 · rhetoric 6

Fertility

generativity 7 · composability 8

THESIS · confidence 0.80 · #3 overall

"Vertical AI is evolving from information retrieval (2024) to reasoning (2025) to multiplayer multi-party collaboration (2026)."

Substance

precision 4 · coherence 6 · evidence 3

Novelty

semantic 4 · framing 5

Expression

clarity 7 · rhetoric 6

Fertility

generativity 6 · composability 7

THESIS · confidence 0.85 · #4 overall

"Startups that build platforms to extract structure from documents, images, and videos will hold the key to enterprise knowledge and process."

Substance

precision 3 · coherence 6 · evidence 4

Novelty

semantic 3 · framing 4

Expression

clarity 7 · rhetoric 6

Fertility

generativity 6 · composability 7

THESIS · confidence 0.80 · #5 overall

"Data and AI infrastructure are becoming inextricably linked as we move toward a truly AI-native data architecture."

Substance

precision 3 · coherence 6 · evidence 2

Novelty

semantic 2 · framing 3

Expression

clarity 7 · rhetoric 4

Fertility

generativity 5 · composability 6

What the Demo Reveals

Finding 1: Expression consistently strong, substance consistently weak. Every idea scored 6–8 on clarity but 2–5 on evidential grounding. This is exactly what you'd expect from a VC thought-leadership piece: polished prose wrapping under-evidenced claims. The system correctly separates "sounds good" from "is good."

Finding 2: Fertility outperforms novelty across the board. All 5 ideas scored higher on generativity/composability (5–8) than semantic novelty (2–5). These aren't new ideas—they're repackaged trends. But they're productive repackagings that open follow-up questions. The system distinguishes "new" from "useful."

Finding 3: High author confidence ≠ high quality. The cybersecurity thesis had the highest extraction confidence (0.90—asserted boldly in the article) but only middling substance. The "screen time" thesis had lower confidence (0.80) but the best overall profile. The system correctly separates assertion strength from argument quality.

Finding 4: The platitude detector works. "Data and AI infrastructure are becoming inextricably linked" scored substance=2, novelty=2. This is a platitude dressed as insight, and the system caught it. Compare with the cybersecurity "false scarcity" thesis (substance=4, novelty=4)—which at least proposes a mechanism.

Architecture

ideas/
+-- models.py          # Idea, IdeaScore, EvaluationProfile, 9 DIMENSIONS, 4 GROUPS
+-- extract.py         # Stage 0 + 2: LLM-based claim/thesis extraction
+-- landscape.py       # Stage 1: search term gen + Serper search + source fetching
+-- eval.py            # Stage 3: batched 9-dim scoring + multi-model consensus
+-- store.py           # SQLite (structured) + sqlite-vec (embeddings) dual store
+-- ui/
|   +-- app.py         # UI at kb.localhost/ideas (consolidated into kb hub)
+-- demo.py            # CLI demo script
+-- data/              # SQLite DB + vector store + demo results
+-- tests/             # 27 tests, all passing

Key Design Decisions

Decision	Why
Batched eval (2–3 dims per LLM call)	CheckEval research: small batches outperform monolithic "score everything at once" prompts
Multi-model consensus for novelty + fertility	These are the hardest dimensions to score—3 models (Opus, Sonnet, Gemini), take median
Floor scores per group, not averages	Averages hide fatal flaws. A novel but incoherent idea should score low, not medium
Interactive landscape (Stage 1)	Novelty is relative to a corpus. You should validate what "existing work" means before scoring against it
Claims vs theses distinction	Different idea types need different evaluation emphasis—claims need precision, theses need framing novelty

Reused Infrastructure

Component	Source	What We Reuse
LLM calls	`lib/llm`	call_llm with model aliases, prompt caching, cost tracking
Web search	`lib/discovery_ops`	serper_search for landscape building
Content fetching	`lib/ingest`	fetch with caching, PDF/HTML/YouTube support
Vector search	`lib/vectors`	sqlite-vec for embedding similarity (novelty computation)
Batched eval pattern	`draft/style/evaluate.py`	2–3 criteria per call with prompt cache optimization

What's Next

Landscape-relative novelty: Currently scoring without a reference corpus (no Stage 1/2 in the demo). With a curated corpus, novelty scores become much more meaningful—measuring distance from what's already been said.
Move core to lib/ideas/: Extraction, evaluation, and models should be a shared library so vario-ng and other tools can use them without importing the UI.
Embedding-based novelty: Compute semantic novelty via actual embedding distance (topic-space vs claim-space decomposition) instead of LLM judgment alone.
Calibration: Run on a diverse set of texts (academic papers, blog posts, VC memos, student essays) to calibrate score distributions and identify dimension-specific biases.

Generated 2026-03-03 · Source: a16z Big Ideas 2026: Part 1 · Evaluation model: Claude Sonnet 4.6 · 38 ideas extracted, 5 evaluated (20 LLM calls) · kb.localhost/ideas