apps/learning-api/evals-playground/metrics/README.md

Quiz eval metrics — canonical rubrics

Single source of truth for what each metric measures. When you change a rubric here, update BOTH the workflow judge's prompt (in the metric's .py) AND the cross-check bundle prep (in cli.py) so the two judges stay aligned. Drift between them makes agreement numbers meaningless.

Quiz eval metrics — canonical rubrics

Single source of truth for what each metric measures. When you change a rubric here, update BOTH the workflow judge's prompt (in the metric's .py) AND the cross-check bundle prep (in cli.py) so the two judges stay aligned. Drift between them makes agreement numbers meaningless.

quiz_correctness

What it measures: whether each question's marked-correct answer is supported by the source text the question was generated from, and distractors are not.

Inputs: questions with chunk_id pointing to a topic-merged chunk whose full text is supplied.

Classes (4-way, objective):

  • CORRECT — source text EXPLICITLY supports the marked correct answer AND contradicts each distractor. Judge must be able to quote the supporting passage.
  • INCORRECT_ANSWER — source contradicts a choice marked correct, or supports a choice marked as a distractor more strongly than the marked correct one.
  • INCORRECT_DISTRACTOR — a distractor is also a valid correct answer per the source.
  • UNSUPPORTED — source does not contain enough information to verify either way. Do NOT give benefit of the doubt.

Notes: 4-way because the classes are categorically distinct (not ordinal degrees), so they stay multi-class for both workflow and agent judges. Agent calibration on sonnet: ~94% agreement with workflow across 5 decks.

quiz_distractor_quality

What it measures: whether distractors are plausible enough to tempt a student who has NOT studied the material.

Inputs: MCQ / single_choice questions only (fill_in_the_blank and true_false skipped).

Classes (binary, subjective):

  • PLAUSIBLE — a student without subject knowledge could reasonably be tempted to pick this. Common misconceptions, adjacent domain concepts with correct terminology, inverted causality using real facts, partial truths.
  • NOT_PLAUSIBLE — eliminable without studying: contradicts an adjective in the question stem, contradicts culturally-embedded common knowledge, off-topic, or too generic to answer the question.

Notes: subjective middle class (WEAK) was collapsed to binary because agent-vs-agent self-agreement on 3-way was ~72%, making any >70% workflow-vs-agent target impossible. Binary + opus gets agent self-agreement to ~91%.

quiz_redundancy

What it measures: whether two questions in the same deck test the same underlying fact (different phrasing / choice sets count as redundant if the underlying tested fact is the same).

Inputs: all questions from a single generation.

How: adapter over metrics/redundancy.py. Questions become Flashcard-shaped (question + joined correct answers). Embedding-threshold retrieval (BM25 + cosine + RRF) produces candidate pairs; LLM verification confirms true redundancy. Returns redundancy groups.

Notes: per-deck signal. Shares prompt + threshold with flashcard redundancy so the rubric stays consistent across surfaces. Update metrics/redundancy.py to change behavior; quiz_redundancy is a thin adapter.

quiz_relevance

What it measures: whether each question tests meaningful, generalisable subject-matter knowledge versus administrative/logistical trivia or non-generalisable illustrative-example details.

Inputs: questions only, plus an optional document_summary (pulled from content_pillars["document_summary"]) so the judge knows what "on-topic" means for this deck.

Classes (binary, subjective):

  • GOOD — tests a concept, definition, mechanism, relationship, or key fact from the subject matter.
  • BAD — tests admin/logistics (professor email, deadlines, slide numbers, textbook ISBN, URLs), document/source metadata (section headings as trivia), or one-off details from illustrative anecdotes.

Notes: binary for the same reason as quiz_distractor_quality — a middle "debatable" class makes self-agreement fall below any usable threshold. good_rate = good / total. Mirrors the flashcard relevance LLM path.

quiz_difficulty

What it measures: whether the stem→correct-choice path is too easy — i.e. a student who has NOT studied the source can pick the right answer via stem-echo, tautology, grammatical cueing, or common cultural knowledge. Orthogonal to quiz_distractor_quality, which judges the wrong-answer elimination path.

Inputs: questions only (no source grounding). Judge sees stem + all choices with correctness marks.

Classes (binary, subjective):

  • TRIVIAL — a non-student could arrive at the correct answer. Tells:
    • Stem-echo — distinctive stem term paraphrased in the correct choice but absent from distractors.
    • Grammatical cueing — stem grammar (a/an, singular/plural, tense) fits only one choice.
    • Common-knowledge — educated-adult general knowledge suffices (famous geography, basic biology).
    • Tautology — the question's own terms give the answer (e.g. "the dual aspect" → "two").
  • NOT_TRIVIAL — picking the correct answer requires specific subject knowledge.

Notes: binary; reasoning MUST name the specific tell when TRIVIAL. Complements quiz_distractor_quality (right-answer angle vs wrong-answer angle of the too_easy user complaint).

quiz_leakage

What it measures: whether the stem LITERALLY reveals the correct-choice text. Strict literal-overlap check; paraphrased stem-echo belongs to quiz_difficulty.

Inputs: questions only (stem + choices with correctness marks).

Classes (binary, objective):

  • BAD — a verbatim token or multi-word phrase from the correct-choice text appears in the stem in a way that distinguishes the correct choice from distractors. Includes fill-in-the-blank questions where the answer word appears outside the blank position.
  • GOOD — no literal overlap that gives the correct choice away.

Notes: port of the flashcard metrics/leakage.py rubric to quiz inputs; kept deliberately narrow (literal match only) so the LLM judge stays consistent. Conceptual/semantic hints and shared domain vocabulary are NOT leakage — they fall under difficulty.

quiz_clarity

What it measures: whether a question is self-contained — a prepared student can answer it from the question text + choices alone.

Inputs: questions only (no source).

Classes:

  • CLEAR — self-contained, one unambiguous interpretation.
  • AMBIGUOUS — multiple plausible readings of what's being asked, or the right answer depends on how the student interprets the phrasing.
  • CONTEXT_DEPENDENT — references material the student can't see ("according to the passage", "the author argues", dangling pronouns).

Notes: no source grounding needed — clarity is a property of the question text itself. good_rate = clear / total.

quiz_structural (static, no LLM)

rejection_rate: rejected_count / raw_question_count. Post-LLM-parse validation failures (invalid_choices, missing correct answer, wrong T/F prefix). Flags malformed structured output from the generation prompt.

chunk_yield_rate: chunks_with_≥1_valid_question / total_chunks. Catches silent LLM-parse failures where a chunk yielded zero questions.

quiz_structural_bias (static, no LLM)

Measures surface-cue giveaways a student could exploit without knowing the material. Computed only on single-correct questions (multi-correct MCQ reported as side-channel).

  • length_outlier_rate: fraction of single-correct questions where the correct answer is uniquely longest or shortest. good_rate = 1 - length_outlier_rate.
  • tf_true_rate: fraction of true_false questions where "True" is the correct answer. Balanced = 0.5; flagged outside 0.35, 0.65.

Calibration workflow

Before cross-checking any LLM metric against subagents, confirm the rubric in this doc matches the rubric embedded in:

  • the workflow prompt (search for _PROMPT = in the metric's .py)
  • the bundle preparation (search for prepare-*-review in cli.py)

If they've drifted, update all three to match before trusting any agreement numbers.