Quiz Evaluation Metrics
Date: 2026-04-24
Quiz Evaluation Metrics
Date: 2026-04-24
Eight LLM-judged metrics plus two static diagnostics. The first three are dealbreakers — if any of them fails, the quiz is broken. The rest are quality signals. The two static metrics (structural health and structural bias) run without any LLM calls and catch pipeline/formatting issues the judges can't see.
1. quiz_correctness — source-grounded answer verification
"Is the marked answer actually correct?"
This is the only metric with source grounding at judge time — the judge sees the full chunk text the question was generated from.
4 classes (not a scale — distinct failure modes):
| Class | What it means |
|---|---|
CORRECT | Source explicitly supports the marked answer AND contradicts each distractor. Judge must quote the passage. |
INCORRECT_ANSWER | Source contradicts the "correct" answer, or supports a distractor better. |
INCORRECT_DISTRACTOR | A distractor is also valid per the source (two right answers). |
UNSUPPORTED | Source doesn't have enough info to verify. No benefit of the doubt. |
Why 4-way instead of binary?
Each failure mode has a different fix:
INCORRECT_ANSWER→ fix the generation prompt to ground betterINCORRECT_DISTRACTOR→ fix distractor generation to check for overlapUNSUPPORTED→ the generator is hallucinating beyond the source
Score:
correct_rate = CORRECT / total × 100
Across variants: Tight range (92–95%). All models are good here — correctness is largely a solved problem at this point. The biggest differentiator is not the model but the chunking strategy.

Real failure examples (biology deck)
Failure 1 — INCORRECT_DISTRACTOR (two right answers)
Q: Which of the following are examples of transcellular fluid?
| Choice | Marked as | Verdict |
|---|---|---|
| Cerebrospinal fluid | ✓ Correct | Supported |
| Synovial fluid | ✓ Correct | Supported |
| Aqueous humor | ✓ Correct | Supported |
| Blood plasma | ✗ Distractor | Problem |
Why it failed: Source says "blood plasma, a special ECF compartment" — so plasma is ALSO a valid answer. The question has two correct answers but only one is marked correct.
Failure 2 — INCORRECT_DISTRACTOR (distractor is also valid)
Q: A patient presents with significant edema. Which could be a contributing factor?
| Choice | Marked as | Verdict |
|---|---|---|
| Decreased plasma proteins → reduced colloid osmotic pressure | ✓ Correct | Supported |
| Increased capillary permeability | ✗ Distractor | Problem |
Why it failed: Source explicitly says "Increased capillary permeability is the hallmark of the inflammatory response... forming an oedema." The distractor is actually a valid cause of edema too!
Failure 3 — UNSUPPORTED (inference beyond source)
Q: If a small, non-polar molecule needs to move against a steep concentration gradient, which transport mechanism would be most efficient?
Marked correct: Active transport
Why it failed: Source says active transport moves things against gradients, and that small non-polar molecules pass through membranes easily — but it never says active transport is used for non-polar molecules or compares "efficiency." The answer is a reasonable inference but isn't actually in the source.
Note from Ahmed: The third failure mode (
UNSUPPORTED) is a fuck-up on my part. It sometimes penalizes reasonable inferences, which hurts evals when the LLM generates higher-order questions (UNDERSTAND/APPLY/ANALYZE) rather than pure recall. A question that requires the student to apply knowledge will naturally go beyond verbatim source text — but the metric flags it as unsupported. This creates tension betweenquiz_correctnessandblooms_score: we want deeper questions, but the correctness metric punishes them. Need to revisit this.
2. quiz_triviality — can you solve it without studying?
"Does the question give away its own answer?"
If a student can pick the correct answer without knowing the subject — through test-taking tricks or common knowledge — the question is useless. It gives false confidence.
5 tell categories (first match wins):
| Tell | What it catches |
|---|---|
LITERAL_OVERLAP | Answer text appears verbatim in the stem |
CATEGORY_CUE | Stem asks for a type, only one choice fits |
GRAMMAR_CUE | Article/tense/plural fits only one choice |
TAUTOLOGY | Stem contains a word whose definition IS the answer |
COMMON_KNOWLEDGE | Educated adult knows it without studying |
What it doesn't check:
- Distractor quality (that's
quiz_distractor_quality) - Whether the answer is correct (that's
quiz_correctness)
Score:
good_rate = NOT_TRIVIAL / total × 100
Across variants: Nano is worst (75–76%), flashlite is best (87%). The weaker the model, the more it leaks answers into its own stems.

Real failure examples
Example 1 — LITERAL_OVERLAP
Q: Defense mechanisms such as repression and denial are examples of __________. Answer: defense mechanisms
The answer is literally the first two words of the stem. Zero studying required.
— generated by gpt-5.4-nano
Example 2 — TAUTOLOGY
Q: Comprehensiveness refers to whether a personality theory explains most or all known facts and observations within its domain. True/False? Answer: True
The statement IS the definition of the term. It's tautological — can't be false.
— generated by gpt-5.4-nano
Example 3 — COMMON_KNOWLEDGE
Q: Which statement best explains the primary purpose of criminal law?
- ✓ To protect society and maintain social order
- ✗ To compensate private individuals for personal losses
- ✗ To regulate only business transactions
- ✗ To resolve family disagreements without state involvement
Any adult knows criminal law is about protecting society. No legal education needed.
— generated by gpt-5.4
Example 4 — COMMON_KNOWLEDGE (the "always" giveaway)
Q: Nudging is always highly effective for addressing large-scale environmental issues. True/False? Answer: False
The word "always" gives it away — nothing is "always" effective. A test-taking trick, not knowledge.
— generated by gemini-2.5-flash
3. blueprint_coverage — per-chunk learning-objective coverage
"Did the quiz actually test what the source material teaches?"
Step 1 — Extract learning objectives (once per deck)
Before any quiz variant runs, an LLM reads each chunk and extracts a short list of learning objectives (LOs). These are cached in blueprint_cache/ so every variant gets judged against the same blueprint.
Example from a chunk about mitosis:
LO1: Identify the four phases of mitosis
LO2: Explain the role of spindle fibers
LO3: Distinguish between mitosis and meiosis
Step 2 — Map questions to LOs
For each generated question, the judge checks: can the stem → correct-answer path be traced back to one of those LOs?
- "What are the four phases of mitosis?" → covers LO1 ✓
- "What color are spindle fibers under a microscope?" → doesn't cover any LO ✗
Step 3 — Score
blueprint_coverage = covered_LOs / total_LOs × 100
If the deck has 20 LOs across all chunks and the quiz hits 14 of them → 70%.
Why LOs instead of phrase extraction (like flashcard coverage)?
Quiz questions span multiple sentences — a single stem might synthesize info from three places in the chunk. Phrase-level matching undercounts that. LOs capture what the student should be able to do, which is the right unit for quizzes.
Across variants: The biggest spread of any metric. 5.4 dominates at 72%, flash models are worst at 37–51%. More capable models cover more learning objectives; topic chunking doesn't help flash here.

4. quiz_relevance — subject matter vs. admin trivia
"Is this question testing real knowledge, or is it asking what slide number something was on?"
Students upload all kinds of documents — lecture slides full of professor contact info, deadlines, textbook ISBNs. The generator sometimes latches onto that junk and turns it into quiz questions. This metric catches those.
Binary:
GOOD— tests a concept, definition, mechanism, relationshipBAD— tests admin logistics, document metadata, or one-off details that don't generalize
The judge receives the deck's document_summary so it knows what "on-topic" means for each deck. A question about Phoenix Park is trivia for a biology deck but might be relevant for a history deck — the summary provides that context.
Score:
good_rate = GOOD / total × 100
Across variants: Near-ceiling for everyone (96–99%). Relevance is mostly solved — generators rarely produce "what's the professor's email" questions. The remaining failures are edge cases like researcher names and peripheral dates.

Real failure examples
Name trivia (colonial policing deck — the doc IS about colonial policing, but knowing a specific park name isn't the point):
Q: The training of colonial police officers often took place at headquarters in __________ Park in Dublin. Answer: Phoenix
Understanding the Irish model's influence = GOOD. Memorizing the park name = BAD.
— generated by gemini-3.1-pro-preview
Researcher surname (personality psychology deck):
Q: The Five Factor Model originated by Tupes & Christal (1958) was later evolved by Digman & __________. Answer: Goldberg
Understanding the Five Factor Model = GOOD. Knowing a co-author's surname = BAD.
— generated by gemini-2.5-flash
Random decade (personality psychology deck):
Q: Initial attempts to link personality types to health conditions were made in the __________. Answer: 1960s
A peripheral date with zero pedagogical value.
— generated by gemini-2.5-flash
Formatting abbreviation (English analysis guide — the doc teaches analytical writing skills, not citation formatting):
Q: When referring to more than one line in an analysis, you should use the abbreviation '__________'. Answer: ll.
Testing a formatting convention, not analytical skill.
— generated by gemini-3.1-flash-lite-preview
5. quiz_distractor_quality — are the wrong answers good fakes?
"Can a student eliminate distractors without studying?"
Triviality asks if the stem gives away the right answer. Distractor quality asks the opposite: are the wrong answers obviously wrong? Same coin, opposite side.
The judge evaluates each distractor individually, applying 4 checks in order:
- Stem-contradiction — does the stem itself rule this out? → WEAK
- Cultural-elimination — would a non-student adult reject this? → WEAK
- Category mismatch — wrong subject area entirely? → ABSURD
- Otherwise → PLAUSIBLE
Key instruction: "Grade from the perspective of a student who has NOT studied. A distractor that an expert sees through is still PLAUSIBLE if a confused student could be tempted."
Only applies to MCQ / single-choice. True/false and fill-in-the-blank have no distractors to judge.
Score:
plausible_rate = PLAUSIBLE distractors / total distractors × 100
Real failure examples
The nano model generates distractors addicted to absolutes — "always", "guarantees", "eliminates all", "never". Any test-wise student crosses those out immediately.
Q: Which combination of stated pros for cap-and-trade is most accurate?
- ✗ "It guarantees no complexity and no price uncertainty" → WEAK
- ✗ "It is always cheaper without trade-offs" → WEAK
- ✗ "It removes any need for program design" → WEAK
All three eliminable on surface logic alone. No cap-and-trade knowledge needed.
— generated by gpt-5.4-nano
Q: Which policy argument is framed around intergenerational fairness?
- ✗ "Intergenerational fairness is the same as eliminating unemployment" → ABSURD
Completely off-topic. Different policy domain entirely.
— generated by gpt-5.4-nano
Q: Which con of cap-and-trade is described as tied to who may obtain permits?
- ✗ "It eliminates the risk of loopholes by fixing permits at zero" → WEAK
"Fixing permits at zero" is internally self-defeating / nonsensical.
— generated by gpt-5.4-nano
Across variants: Nano is worst (73–74%), flash/pro models cluster around 80–84%. Tracks closely with triviality — the models that leak answers into stems also produce weak distractors.

Note: This metric overlaps heavily with
quiz_triviality. Triviality catches giveaways in the stem→answer path; distractor quality catches eliminable wrong answers. In practice they often fire on the same questions — if the distractors are garbage, the question is also trivially solvable. We may want to merge these or drop one in a future iteration.
6. blooms_score — how deep do the questions go?
"Is this quiz all recall, or does it actually make students think?"
Each question is classified on a 4-level Bloom's taxonomy subset:
| Level | Weight | What it tests | Example |
|---|---|---|---|
| REMEMBER | 1 | Recall a fact, term, date | "What is the capital of France?" |
| UNDERSTAND | 2 | Explain, compare, paraphrase | "Why is the left ventricle thicker than the right?" |
| APPLY | 3 | Use knowledge in a new situation | "A patient presents with X — which drug?" |
| ANALYZE | 4 | Examine relationships, multi-step reasoning | "Compare structural changes in CHF vs HCM" |
We stop at ANALYZE because generators essentially never produce EVALUATE/CREATE-level items. Including unused classes just inflates disagreement.
Score:
blooms_score = mean_weight / 4 × 100
A quiz that's 100% REMEMBER scores 25%. A healthy mix of 60% REMEMBER + 30% UNDERSTAND + 10% APPLY scores 42%. Recall isn't a failure — it's the baseline. The score rewards variety, not the absence of recall.
The blooms_distribution field records the breakdown (e.g. 30% remember, 60% understand, 10% apply) — useful for spotting generators that collapse everything onto one level.
Across variants: All generators cluster in the 34–37% range. The distribution tells the story: ~60% REMEMBER, ~35% UNDERSTAND, ~2% APPLY, ~0% ANALYZE. There's a healthy REMEMBER/UNDERSTAND base but almost no APPLY or ANALYZE questions. gem31pro-high leads at 37.4% with the best UNDERSTAND share (40%). Pushing generators toward more APPLY-level questions is the main lever left.

7. difficulty_good_rate — right level for the audience?
"Is this quiz appropriately hard for who's taking it?"
Bloom's alone isn't enough — two ANALYZE questions can differ wildly in solve effort. So this metric combines two axes:
score = bloom(1–4) × complexity(1–5) → range 1–20
Bloom = cognitive level (same as above).
Complexity = how hard the subject matter is for the target audience. The judge is told the audience (high school / university / professional) and subject, so "What are the four chambers of the heart?" scores low complexity for a med student but higher for a middle-schooler.
| Complexity | Meaning |
|---|---|
| 1 | Basic fact any student in the field would know early on |
| 2 | Standard introductory curriculum concept |
| 3 | Connecting multiple concepts, moderate depth |
| 4 | Deep understanding, nuanced reasoning |
| 5 | Expert-level, synthesis across sub-domains |
Score:
difficulty_good_rate = questions with score >= 4 / total × 100
Questions scoring below 4 are both REMEMBER-level AND basic complexity — the easiest possible questions. difficulty_mean (the raw 1–20 average) is kept alongside as a diagnostic.
Across variants: Flat range (36–42%). Closely correlated with blooms — since most questions are REMEMBER, bloom(1) × complexity caps the score. gem31pro-high edges ahead at 42% but the difference is small.

8. uniqueness_rate — does the quiz ask the same thing twice?
"Did the generator produce 10 questions or 7 questions and 3 repeats?"
Two-stage pipeline:
Stage 1 — Embedding filter. Embed each question (stem + correct answer) with gemini-embedding-001. Compute pairwise cosine similarity. Surface pairs above 0.90.
Stage 2 — LLM arbitration. For each surfaced pair, a judge classifies:
LAZY_DUPLICATE— same fact, rephrased. A student who answers one learns nothing from the other.REINFORCEMENT— same topic, different angle or cognitive level. Answering both strengthens understanding.
Examples:
- "What is photosynthesis?" + "Define the process of photosynthesis" → LAZY_DUPLICATE (same fact, surface rephrasing)
- "What is photosynthesis?" + "How would photosynthesis be affected if CO₂ levels doubled?" → REINFORCEMENT (recall vs. application)
Score:
uniqueness_rate = (1 - lazy_duplicate_questions / total) × 100
Reinforcement pairs do NOT count against the score.
Why two stages? Cosine alone flags too many reinforcement pairs as duplicates. LLM-only on every pair is expensive. The 0.90 threshold catches almost every true duplicate while the LLM only arbitrates the ambiguous ones.
Across variants: The widest split. Nano and 5.4 generate tons of duplicates (60–61%), while flashlite barely produces any (96%). This correlates with question count — models that generate more questions per chunk tend to repeat themselves more.

Side-channel diagnostics (static, no LLM)
The eight metrics above are all LLM-judged. Two additional static checks run without any LLM calls — they catch structural problems in the generator's output that a student could exploit or that indicate pipeline health issues.
quiz_structural — pipeline parse health
"Did the generator produce valid, parseable questions?"
Two sub-metrics that flag problems between the LLM's raw output and the question format we need:
rejection_rate — what fraction of the LLM's raw output failed post-parse validation?
rejection_rate = rejected_count / raw_question_count
Rejection reasons: invalid choice count (e.g. single_choice with <2 or >4 choices), missing correct answer, multiple correct answers on a single_choice question, wrong True/False prefix. These are structured-output failures — the generation prompt asked for one format, the model produced another.
chunk_yield_rate — what fraction of source chunks produced at least one valid question?
chunk_yield_rate = chunks_with_≥1_valid_question / total_chunks
A low yield means the LLM is silently failing on some chunks — either generating nothing or generating only questions that fail validation. Since process_chunk swallows schema-validation failures, chunk yield is the only way to catch these silent drops.
Across variants: Chunk yield is near-100% for all variants — models rarely fail to produce something for a chunk. Rejection rate is more interesting: 0–9% depending on deck complexity. 100% of rejections in the baseline were single_choice questions with multiple correct answers — the type contract isn't enforced strongly enough in the prompt.
quiz_structural_bias — surface-cue giveaways
"Can a student guess the right answer just by looking at the shape of the choices?"
No LLM, no subject knowledge — just character counts and True/False tallies. If the correct answer is always the longest option, a test-wise student picks longest every time and beats random chance.
Length outlier rate
For each single-correct question: is the correct answer uniquely the longest or shortest choice? If yes, a student can exploit the "pick the longest" heuristic without knowing anything.
length_outlier_rate = (longest_correct + shortest_correct) / single_correct_count
good_rate = (1 - length_outlier_rate) × 100
Only computed on questions with exactly one correct choice — the "pick the longest" trick only works cleanly when there's one answer to pick. Multi-correct MCQ are tracked separately as a side-channel.
Across variants: This is the worst metric across the board. Every variant scores between 23–39% good_rate, meaning 61–77% of single-correct questions have the correct answer as a unique length outlier. The generation prompt already tells the model to match answer lengths — the model ignores that instruction ~6–7 times out of 10. flash-topic and 5.4 are worst (23–24%), gem31pro and nano-high are least bad (~38%) but still failing majority of the time. This is the single biggest structural fix remaining — a post-generation length-rewrite pass or a self-check step before emission.
T/F true-rate balance
For True/False questions: what fraction have "True" as the correct answer?
tf_true_rate = tf_true_count / tf_count
Balanced decks sit near 50%. Outside the 0.35, 0.65 band → flagged. If a student notices the pattern leans one way, "always pick False" (or "always pick True") beats random.
Supports multilingual T/F values — English "True", German "Wahr", French "Vrai", Spanish "Verdadero", etc.
Across variants: Most models produce reasonably balanced T/F splits (81–95% balance score). flashlite and gem31pro are most balanced (93–95%), 5.4 is solid (92%). The outlier is nano-none at 65% — noticeably skewed, meaning "always pick True" (or False) gives an edge. nano-high is better at 76%, suggesting reasoning effort helps with T/F balance. flash variants cluster around 81–83%.

Note: Neither of these metrics uses an LLM. They run on the raw question JSON in milliseconds and catch problems that the LLM-judged metrics miss entirely —
quiz_trivialitycan catch "the longest answer is always right" as a pattern only if the judge notices it across many questions, butstructural_biascatches it deterministically on every single question. They complement each other: triviality catches semantic giveaways, structural bias catches statistical ones.
Baseline analysis & easy gains
The current production configuration is flash-topic — gemini-2.5-flash with topic-level chunking, no reasoning effort.
Where the baseline sits
| Metric | Score | Rank | Best variant | Gap |
|---|---|---|---|---|
| Correctness | 93.5% | 5/8 | gem31pro (95.1%) | +1.6pp |
| Non-triviality | 82.3% | 5/8 | flashlite-high (87.6%) | +5.3pp |
| Blueprint Coverage | 41.5% | 6/8 | 5.4 (72.3%) | +30.8pp |
| Relevance | 97.5% | 5/8 | 5.4 (98.9%) | +1.4pp |
| Distractor Quality | 84.1% | 2/8 | gem31pro (84.3%) | +0.2pp |
| Bloom's Depth | 34.7% | 6/8 | gem31pro (37.4%) | +2.7pp |
| Difficulty | 35.7% | 8/8 | gem31pro (42.3%) | +6.6pp |
| Uniqueness | 91.0% | 3/8 | flashlite-low (96.0%) | +5.0pp |
| Length Balance | 23.2% | 8/8 | nano-high (38.5%) | +15.3pp |
Dead last on difficulty and length balance. Near-bottom on coverage and Bloom's depth. Strong on distractor quality and uniqueness. Mediocre on everything else.

Easy gain #1 — switch from topic → subtopic chunking (free, no model change)
Same model, same cost, same speed. Just change the chunking parameter.
| Metric | flash-topic | flash (subtopic) | Delta |
|---|---|---|---|
| Blueprint Coverage | 41.5% | 50.7% | +9.2pp |
| Non-triviality | 82.3% | 83.0% | +0.8pp |
| Questions per gen | 42.6 | 68.7 | +26.1 |
| Uniqueness | 91.0% | 87.3% | -3.7pp |
| Distractor Quality | 84.1% | 83.5% | -0.6pp |
| Generation time | 20.7s | 21.8s | +1.1s |
Subtopic chunking gives the model smaller, more focused chunks → +9.2pp coverage and +61% more questions. The uniqueness drop (-3.7pp) is the tradeoff — more questions from the same material means more overlap risk — but 87% is still solid. Essentially free lunch.
Easy gain #2 — length-balancing rewrite pass
23.2% length balance is the worst score across all metrics for the baseline. The generation prompt already says "match answer lengths" — models ignore it. You can't fix this with pure string manipulation — trimming "The mitochondria converts glucose into ATP through oxidative phosphorylation" to match "Nucleus" destroys the content. This needs a cheap LLM rewrite pass (nano-tier model, just rewording distractors to roughly match the correct answer's length) or a self-check step where the generator reviews and rewrites its own choices before emission. Every variant suffers here (best is 38.5%), so this fix benefits any model choice.
Easy gain #3 — T/F balance enforcement (pure code)
81.4% T/F balance is fine but could be better. During generation, count the True/False split. If it drifts outside 0.35, 0.65, flip some question polarities ("X is true" → "X is false" with the answer inverted). Pure post-processing, no model cost.
Medium effort — upgrade to gem31pro
| Metric | flash-topic | gem31pro | Delta |
|---|---|---|---|
| Blueprint Coverage | 41.5% | 60.1% | +18.6pp |
| Difficulty | 35.7% | 42.3% | +6.6pp |
| Non-triviality | 82.3% | 85.9% | +3.6pp |
| Bloom's Depth | 34.7% | 37.4% | +2.7pp |
| Correctness | 93.5% | 95.1% | +1.6pp |
| Questions per gen | 42.6 | 71.4 | +28.8 |
| Generation time | 20.7s | 91.8s | +71.1s (4.4×) |
Gains across the board — coverage, difficulty, triviality, correctness all improve. But 4.4× slower and more expensive. Could work as a premium tier or for high-stakes decks only.
Interesting outlier — the overgenerate-then-filter family (5.4, nano)
Both 5.4 and nano generate far more questions than flash — but a lot of them are duplicates. The question is whether the raw volume + dedup beats flash's smaller-but-cleaner output.
| Metric | flash-topic | nano | nano-high | 5.4 |
|---|---|---|---|---|
| Blueprint Coverage | 41.5% | 59.3% | 64.8% | 72.3% |
| Uniqueness | 91.0% | 61.5% | 70.4% | 60.2% |
| Raw questions | 42.6 | 113.0 | 104.4 | 158.3 |
| Effective unique qs | 39 | 69 | 73 | 95 |
| Distractor Quality | 84.1% | 74.0% | 73.5% | 74.1% |
| Non-triviality | 82.3% | 75.6% | 76.0% | 78.8% |
| Generation time | 20.7s | 76.3s | 232.1s | 63.1s |
Even after dedup, nano-none produces ~69 effective unique questions vs flash-topic's 39 — and covers +17.9pp more learning objectives. But the quality gap is real: -10pp distractor quality, -6.7pp triviality. Nano produces more questions, covering more material, but those questions are easier to game.
5.4 has the same quality problems plus costs more. Its advantage over nano is pure coverage volume (+13pp), but nano gets most of the coverage gain at a fraction of 5.4's cost.
The interesting experiment: nano + deduplication + distractor-rewrite pass. Generate with nano's volume (cheap, fast-ish), dedup the lazy duplicates, then run a quick rewrite pass on the surviving questions to fix distractor length/quality. If the rewrite pass is cheap (and it should be — it's reformulating existing distractors, not generating new knowledge), this pipeline could combine nano's coverage with flash-level quality.
Priority stack
- Switch to subtopic chunking — free, +9pp coverage, ship today
- Add length-balancing rewrite pass — cheap LLM call, fixes the worst metric across all variants
- Add T/F balance enforcement — pure code, minor but free
- Evaluate gem31pro for premium tier — big quality jump, need to price the latency cost
- Test nano + dedup + rewrite pipeline — nano's coverage at ~1/10th the cost of 5.4, with a quality cleanup pass on top
Language effects
The eval set covers 4 languages: English (826 evals), German (368), French (289), Dutch (176).
Dutch correctness drops across the board
Dutch correctness averages 88.8% vs 94.6% for English — a -5.8pp gap that holds across every variant (85–93% Dutch vs 92–95% English) and both judges (GPT-5.4: 87% vs 95%, Gemini: 90% vs 95%). This is a real generator problem, not judge noise — models produce more incorrect answers in Dutch. Only 4 Dutch decks though, so small sample.
Judges are lenient on non-English triviality and distractor quality
Both judges consistently rate French and German as less trivial and better distractors than English. French non-triviality: 90% vs English 85% (GPT-5.4), 86% vs 74% (Gemini). The same pattern holds for German.
This is almost certainly a judge blind spot — catching "common knowledge" giveaways and grammar cues is harder in a language the judge is less fluent in. A French student would spot a giveaway that neither judge flags. This means non-English triviality and distractor quality scores are probably inflated — the real quality gap between languages may be larger than the numbers show.

Cost analysis
Actual costs from llm_usage_logs (generation) and pipeline eval_cost fields (evaluation). All 830 generation runs × 2 judges × ~1,659 evals.
Generation cost per variant
| Variant | $/run | Total | Runs | Notes |
|---|---|---|---|---|
| flashlite-low | $0.0042 | $0.48 | 113 | Cheapest. No reasoning tokens. |
| flashlite-high | $0.0068 | $0.76 | 112 | +62% over low — reasoning effort bump |
| flash-topic | $0.0164 | $1.84 | 112 | Baseline. 2.4× flashlite-high |
| nano | $0.0184 | $1.95 | 106 | Comparable to flash-topic |
| flash | $0.0265 | $2.97 | 112 | Same model, subtopic chunks = +61% cost |
| nano-high | $0.0472 | $3.02 | 64 | Reasoning effort = 2.6× nano |
| 5.4 | $0.1081 | $11.35 | 105 | 6.6× baseline |
| gem31pro | $0.4457 | $47.25 | 106 | 27× baseline. Reasoning tokens dominate. |
Total generation cost: $69.61 across all 830 runs.
gem31pro is in a different universe — $0.45/run vs $0.016 for the baseline. Its coverage lead (+18.6pp) costs 27× more per generation. Nano gets most of that coverage gain (+23.3pp) at 1.1× the baseline cost — orders of magnitude better cost-efficiency.
Evaluation cost
| Judge | Total | Evals | $/eval |
|---|---|---|---|
| Gemini 3.1 Pro | $185.85 | 830 | $0.224 |
| GPT-5.4 | $40.06 | 829 | $0.048 |
| Total | $225.91 | 1,659 | $0.136 |
Gemini is 4.6× more expensive per eval than GPT-5.4. Since both judges preserve variant rankings on the metrics that matter (coverage ρ=0.98, triviality ρ=0.95), the obvious move for future evals is to drop to a single GPT-5.4 judge and cut eval cost by ~80%.
Total pipeline cost
| Component | Cost |
|---|---|
| Generation (830 runs) | $69.61 |
| Evaluation (1,659 evals) | $225.91 |
| Total | $295.52 |
Eval cost is 3.2× generation cost. For future experiments: single judge + fewer reps (2 instead of 3) would cut eval cost to ~$27 while preserving ranking accuracy.
Cost-efficiency: where to spend the next dollar
The question isn't "which model is best" — it's "which model gives the most quality per dollar."
| Variant | Gen $/run | Coverage | Triviality | Coverage per $0.01 |
|---|---|---|---|---|
| nano | $0.018 | 59.3% | 75.6% | 32.2pp |
| flash-topic | $0.016 | 41.5% | 82.3% | 25.3pp |
| nano-high | $0.047 | 64.8% | 76.0% | 13.7pp |
| flash | $0.027 | 50.7% | 83.0% | 19.2pp |
| 5.4 | $0.108 | 72.3% | 78.8% | 6.7pp |
| gem31pro | $0.446 | 60.1% | 85.9% | 1.3pp |
Nano delivers the best coverage-per-dollar by a wide margin. gem31pro's coverage is actually lower than 5.4 despite costing 4× more — its strength is triviality and difficulty, not coverage. The interesting frontier is nano + quality cleanup pass: nano's coverage volume at nano's price, with a cheap rewrite pass to fix the distractor/triviality gap.

