Surface-Specific Eval Metrics Design
Surface-Specific Eval Metrics Design
Context
We're comparing content pillar strategies (merged_chunks vs docling_chunks) across three surfaces: flashcards, quiz, summary. The oracle judge does holistic A/B comparison. Auto-metrics provide specific, quantitative signals explaining why one variant is better.
Flashcards already have 8 metrics. Quiz and summary have zero. This doc designs pragmatic metrics for both.
Quiz Metrics
1. quiz_answer_correctness
What it tests: Is the marked correct answer actually correct? Are marked wrong answers actually wrong?
Why it matters: A quiz with wrong answers is worse than no quiz. This is the #1 quality signal.
How it works:
- For each question, send Q + all choices + source chunk text to LLM
- LLM verifies: (a) correct answer is factually right, (b) each distractor is actually wrong
- Classification: CORRECT, INCORRECT_ANSWER, INCORRECT_DISTRACTOR, ERROR
- Returns
good_rate= % of questions where answer + all distractors are valid
Inputs: Quiz questions with choices, content pillars (for source chunk text)
Surface-specific nuances:
true_false: Verify the statement's truth value matches the marked answerfill_in_the_blank: Verify the answer fits the blank correctlysingle_choice/multiple_choice: Verify correct answer(s) AND each distractor
2. quiz_distractor_quality
What it tests: Are distractors plausible enough to challenge a student who doesn't know the material?
Why it matters: Trivially wrong distractors make quizzes useless ("What's 2+2? A:4, B:Banana, C:Purple"). Good distractors are the difference between useful and useless MCQs.
How it works:
- For each MCQ question, send Q + correct answer + distractors to LLM
- LLM rates each distractor: PLAUSIBLE (would fool someone who didn't study), WEAK (obviously wrong to most), ABSURD (nonsensical)
- Score = % of distractors rated PLAUSIBLE
- Extra data: which distractors are weak, suggestions for improvement
Inputs: Quiz questions (MCQ types only, skip true_false and fill_in_the_blank)
3. quiz_coverage
What it tests: Do quiz questions cover the important concepts from the source?
Why it matters: A quiz that only tests one chapter is incomplete.
How it works: Reuse the coverage_v2 pattern:
- Extract key concepts from content pillars
- For each concept, LLM judges depth of coverage by quiz questions (0-3 scale)
- Score = weighted mean of depths
Inputs: Quiz questions + content pillars
4. quiz_clarity
What it tests: Are questions clear, unambiguous, and self-contained?
Why it matters: An ambiguous question frustrates students regardless of answer quality.
How it works:
- Batch quiz questions (10 per batch)
- LLM classifies each: CLEAR, AMBIGUOUS, CONTEXT_DEPENDENT, ERROR
- Score = % CLEAR
Inputs: Quiz questions (no source material needed)
Summary Metrics
1. summary_faithfulness
What it tests: Does the summary contain only factually accurate claims derivable from the source?
Why it matters: Hallucinated content in a summary is actively harmful for studying.
How it works (inspired by RAGAS faithfulness):
- Step 1: LLM extracts atomic claims from summary (each claim = one fact)
- Step 2: For each claim, LLM verifies against source markdown: SUPPORTED, UNSUPPORTED, CONTRADICTED
- Score = % SUPPORTED out of total non-trivial claims
- Extra data: list of unsupported/contradicted claims with reasoning
Inputs: Summary text + source markdown (from parsing_result.md_url)
2. summary_coverage
What it tests: Does the summary cover all major topics from the source?
Why it matters: A summary that misses half the content is incomplete.
How it works:
- Extract main topics from content pillars
- For each topic, LLM judges: COVERED (mentioned and explained), MENTIONED (briefly touched), MISSING (not addressed)
- Score = weighted (COVERED=1, MENTIONED=0.5, MISSING=0) / total topics
Inputs: Summary text + content pillars
3. summary_coherence
What it tests: Is the summary well-organized, logically structured, and readable?
Why it matters: A coherent summary is easier to study from.
How it works:
- LLM rates the summary on a 1-5 scale across 3 dimensions:
- Organization: Clear structure, logical section ordering
- Flow: Smooth transitions, no abrupt jumps
- Completeness: Each section is self-contained, no dangling references
- Score = mean of 3 ratings, normalized to 0-100
Inputs: Summary text only (no source needed)
4. summary_information_density
What it tests: Is the summary appropriately concise? Does it avoid filler and repetition?
Why it matters: Verbose summaries waste study time. Dense summaries are more useful.
How it works:
- LLM identifies: filler phrases, repeated information, unnecessary padding
- Score = (total sentences - filler sentences) / total sentences * 100
- Extra data: highlighted filler/repetition passages
Inputs: Summary text only
Implementation Priority
Phase 1 (essential for meaningful experiment results):
quiz_answer_correctness- Without this, quiz comparison is guessworksummary_faithfulness- Without this, summary comparison is guessworksummary_coverage- Direct quality signal using existing infrastructure
Phase 2 (improves experiment depth):
4. quiz_distractor_quality - Unique to MCQ, differentiating signal
5. quiz_coverage - Reuses coverage_v2 pattern
6. summary_coherence - Quick win, simple implementation
Phase 3 (nice to have):
7. quiz_clarity - Simple but less discriminating
8. summary_information_density - Subtle signal
Integration Plan
- New files:
metrics/quiz_correctness.py,metrics/quiz_distractor_quality.py,metrics/quiz_coverage.py,metrics/summary_faithfulness.py,metrics/summary_coverage.py,metrics/summary_coherence.py - New functions in
sdk.py:evaluate_quiz(),evaluate_summary() - Hook into
experiment_runner.pyafter generation (parallel with flashcard eval) - New MetricName literals for each surface
- Frontend: surface-specific metric display in ExperimentOverview