Surface-Specific Eval Metrics Design

Context

We're comparing content pillar strategies (merged_chunks vs docling_chunks) across three surfaces: flashcards, quiz, summary. The oracle judge does holistic A/B comparison. Auto-metrics provide specific, quantitative signals explaining why one variant is better.

Flashcards already have 8 metrics. Quiz and summary have zero. This doc designs pragmatic metrics for both.

Quiz Metrics

1. `quiz_answer_correctness`

What it tests: Is the marked correct answer actually correct? Are marked wrong answers actually wrong?

Why it matters: A quiz with wrong answers is worse than no quiz. This is the #1 quality signal.

How it works:

For each question, send Q + all choices + source chunk text to LLM
LLM verifies: (a) correct answer is factually right, (b) each distractor is actually wrong
Classification: CORRECT, INCORRECT_ANSWER, INCORRECT_DISTRACTOR, ERROR
Returns good_rate = % of questions where answer + all distractors are valid

Inputs: Quiz questions with choices, content pillars (for source chunk text)

Surface-specific nuances:

true_false: Verify the statement's truth value matches the marked answer
fill_in_the_blank: Verify the answer fits the blank correctly
single_choice / multiple_choice: Verify correct answer(s) AND each distractor

2. `quiz_distractor_quality`

What it tests: Are distractors plausible enough to challenge a student who doesn't know the material?

Why it matters: Trivially wrong distractors make quizzes useless ("What's 2+2? A:4, B:Banana, C:Purple"). Good distractors are the difference between useful and useless MCQs.

How it works:

For each MCQ question, send Q + correct answer + distractors to LLM
LLM rates each distractor: PLAUSIBLE (would fool someone who didn't study), WEAK (obviously wrong to most), ABSURD (nonsensical)
Score = % of distractors rated PLAUSIBLE
Extra data: which distractors are weak, suggestions for improvement

Inputs: Quiz questions (MCQ types only, skip true_false and fill_in_the_blank)

3. `quiz_coverage`

What it tests: Do quiz questions cover the important concepts from the source?

Why it matters: A quiz that only tests one chapter is incomplete.

How it works: Reuse the coverage_v2 pattern:

Extract key concepts from content pillars
For each concept, LLM judges depth of coverage by quiz questions (0-3 scale)
Score = weighted mean of depths

Inputs: Quiz questions + content pillars

4. `quiz_clarity`

What it tests: Are questions clear, unambiguous, and self-contained?

Why it matters: An ambiguous question frustrates students regardless of answer quality.

How it works:

Batch quiz questions (10 per batch)
LLM classifies each: CLEAR, AMBIGUOUS, CONTEXT_DEPENDENT, ERROR
Score = % CLEAR

Inputs: Quiz questions (no source material needed)

Summary Metrics

1. `summary_faithfulness`

What it tests: Does the summary contain only factually accurate claims derivable from the source?

Why it matters: Hallucinated content in a summary is actively harmful for studying.

How it works (inspired by RAGAS faithfulness):

Step 1: LLM extracts atomic claims from summary (each claim = one fact)
Step 2: For each claim, LLM verifies against source markdown: SUPPORTED, UNSUPPORTED, CONTRADICTED
Score = % SUPPORTED out of total non-trivial claims
Extra data: list of unsupported/contradicted claims with reasoning

Inputs: Summary text + source markdown (from parsing_result.md_url)

2. `summary_coverage`

What it tests: Does the summary cover all major topics from the source?

Why it matters: A summary that misses half the content is incomplete.

How it works:

Extract main topics from content pillars
For each topic, LLM judges: COVERED (mentioned and explained), MENTIONED (briefly touched), MISSING (not addressed)
Score = weighted (COVERED=1, MENTIONED=0.5, MISSING=0) / total topics

Inputs: Summary text + content pillars

3. `summary_coherence`

What it tests: Is the summary well-organized, logically structured, and readable?

Why it matters: A coherent summary is easier to study from.

How it works:

LLM rates the summary on a 1-5 scale across 3 dimensions:
- Organization: Clear structure, logical section ordering
- Flow: Smooth transitions, no abrupt jumps
- Completeness: Each section is self-contained, no dangling references
Score = mean of 3 ratings, normalized to 0-100

Inputs: Summary text only (no source needed)

4. `summary_information_density`

What it tests: Is the summary appropriately concise? Does it avoid filler and repetition?

Why it matters: Verbose summaries waste study time. Dense summaries are more useful.

How it works:

LLM identifies: filler phrases, repeated information, unnecessary padding
Score = (total sentences - filler sentences) / total sentences * 100
Extra data: highlighted filler/repetition passages

Inputs: Summary text only

Implementation Priority

Phase 1 (essential for meaningful experiment results):

quiz_answer_correctness - Without this, quiz comparison is guesswork
summary_faithfulness - Without this, summary comparison is guesswork
summary_coverage - Direct quality signal using existing infrastructure

Phase 2 (improves experiment depth): 4. quiz_distractor_quality - Unique to MCQ, differentiating signal 5. quiz_coverage - Reuses coverage_v2 pattern 6. summary_coherence - Quick win, simple implementation

Phase 3 (nice to have): 7. quiz_clarity - Simple but less discriminating 8. summary_information_density - Subtle signal

Integration Plan

New files: metrics/quiz_correctness.py, metrics/quiz_distractor_quality.py, metrics/quiz_coverage.py, metrics/summary_faithfulness.py, metrics/summary_coverage.py, metrics/summary_coherence.py
New functions in sdk.py: evaluate_quiz(), evaluate_summary()
Hook into experiment_runner.py after generation (parallel with flashcard eval)
New MetricName literals for each surface
Frontend: surface-specific metric display in ExperimentOverview