apps/learning-api/evals-playground/EVAL_METRICS_DESIGN.md

Surface-Specific Eval Metrics Design

Surface-Specific Eval Metrics Design

Context

We're comparing content pillar strategies (merged_chunks vs docling_chunks) across three surfaces: flashcards, quiz, summary. The oracle judge does holistic A/B comparison. Auto-metrics provide specific, quantitative signals explaining why one variant is better.

Flashcards already have 8 metrics. Quiz and summary have zero. This doc designs pragmatic metrics for both.


Quiz Metrics

1. quiz_answer_correctness

What it tests: Is the marked correct answer actually correct? Are marked wrong answers actually wrong?

Why it matters: A quiz with wrong answers is worse than no quiz. This is the #1 quality signal.

How it works:

  • For each question, send Q + all choices + source chunk text to LLM
  • LLM verifies: (a) correct answer is factually right, (b) each distractor is actually wrong
  • Classification: CORRECT, INCORRECT_ANSWER, INCORRECT_DISTRACTOR, ERROR
  • Returns good_rate = % of questions where answer + all distractors are valid

Inputs: Quiz questions with choices, content pillars (for source chunk text)

Surface-specific nuances:

  • true_false: Verify the statement's truth value matches the marked answer
  • fill_in_the_blank: Verify the answer fits the blank correctly
  • single_choice / multiple_choice: Verify correct answer(s) AND each distractor

2. quiz_distractor_quality

What it tests: Are distractors plausible enough to challenge a student who doesn't know the material?

Why it matters: Trivially wrong distractors make quizzes useless ("What's 2+2? A:4, B:Banana, C:Purple"). Good distractors are the difference between useful and useless MCQs.

How it works:

  • For each MCQ question, send Q + correct answer + distractors to LLM
  • LLM rates each distractor: PLAUSIBLE (would fool someone who didn't study), WEAK (obviously wrong to most), ABSURD (nonsensical)
  • Score = % of distractors rated PLAUSIBLE
  • Extra data: which distractors are weak, suggestions for improvement

Inputs: Quiz questions (MCQ types only, skip true_false and fill_in_the_blank)

3. quiz_coverage

What it tests: Do quiz questions cover the important concepts from the source?

Why it matters: A quiz that only tests one chapter is incomplete.

How it works: Reuse the coverage_v2 pattern:

  • Extract key concepts from content pillars
  • For each concept, LLM judges depth of coverage by quiz questions (0-3 scale)
  • Score = weighted mean of depths

Inputs: Quiz questions + content pillars

4. quiz_clarity

What it tests: Are questions clear, unambiguous, and self-contained?

Why it matters: An ambiguous question frustrates students regardless of answer quality.

How it works:

  • Batch quiz questions (10 per batch)
  • LLM classifies each: CLEAR, AMBIGUOUS, CONTEXT_DEPENDENT, ERROR
  • Score = % CLEAR

Inputs: Quiz questions (no source material needed)


Summary Metrics

1. summary_faithfulness

What it tests: Does the summary contain only factually accurate claims derivable from the source?

Why it matters: Hallucinated content in a summary is actively harmful for studying.

How it works (inspired by RAGAS faithfulness):

  • Step 1: LLM extracts atomic claims from summary (each claim = one fact)
  • Step 2: For each claim, LLM verifies against source markdown: SUPPORTED, UNSUPPORTED, CONTRADICTED
  • Score = % SUPPORTED out of total non-trivial claims
  • Extra data: list of unsupported/contradicted claims with reasoning

Inputs: Summary text + source markdown (from parsing_result.md_url)

2. summary_coverage

What it tests: Does the summary cover all major topics from the source?

Why it matters: A summary that misses half the content is incomplete.

How it works:

  • Extract main topics from content pillars
  • For each topic, LLM judges: COVERED (mentioned and explained), MENTIONED (briefly touched), MISSING (not addressed)
  • Score = weighted (COVERED=1, MENTIONED=0.5, MISSING=0) / total topics

Inputs: Summary text + content pillars

3. summary_coherence

What it tests: Is the summary well-organized, logically structured, and readable?

Why it matters: A coherent summary is easier to study from.

How it works:

  • LLM rates the summary on a 1-5 scale across 3 dimensions:
    • Organization: Clear structure, logical section ordering
    • Flow: Smooth transitions, no abrupt jumps
    • Completeness: Each section is self-contained, no dangling references
  • Score = mean of 3 ratings, normalized to 0-100

Inputs: Summary text only (no source needed)

4. summary_information_density

What it tests: Is the summary appropriately concise? Does it avoid filler and repetition?

Why it matters: Verbose summaries waste study time. Dense summaries are more useful.

How it works:

  • LLM identifies: filler phrases, repeated information, unnecessary padding
  • Score = (total sentences - filler sentences) / total sentences * 100
  • Extra data: highlighted filler/repetition passages

Inputs: Summary text only


Implementation Priority

Phase 1 (essential for meaningful experiment results):

  1. quiz_answer_correctness - Without this, quiz comparison is guesswork
  2. summary_faithfulness - Without this, summary comparison is guesswork
  3. summary_coverage - Direct quality signal using existing infrastructure

Phase 2 (improves experiment depth): 4. quiz_distractor_quality - Unique to MCQ, differentiating signal 5. quiz_coverage - Reuses coverage_v2 pattern 6. summary_coherence - Quick win, simple implementation

Phase 3 (nice to have): 7. quiz_clarity - Simple but less discriminating 8. summary_information_density - Subtle signal


Integration Plan

  1. New files: metrics/quiz_correctness.py, metrics/quiz_distractor_quality.py, metrics/quiz_coverage.py, metrics/summary_faithfulness.py, metrics/summary_coverage.py, metrics/summary_coherence.py
  2. New functions in sdk.py: evaluate_quiz(), evaluate_summary()
  3. Hook into experiment_runner.py after generation (parallel with flashcard eval)
  4. New MetricName literals for each surface
  5. Frontend: surface-specific metric display in ExperimentOverview