apps/learning-api/evals-playground/reports/2026-04-24-quiz-eval-metrics.md

Quiz Evaluation Metrics

Date: 2026-04-24

Quiz Evaluation Metrics

Date: 2026-04-24

Eight LLM-judged metrics plus two static diagnostics. The first three are dealbreakers — if any of them fails, the quiz is broken. The rest are quality signals. The two static metrics (structural health and structural bias) run without any LLM calls and catch pipeline/formatting issues the judges can't see.


1. quiz_correctness — source-grounded answer verification

"Is the marked answer actually correct?"

This is the only metric with source grounding at judge time — the judge sees the full chunk text the question was generated from.

4 classes (not a scale — distinct failure modes):

ClassWhat it means
CORRECTSource explicitly supports the marked answer AND contradicts each distractor. Judge must quote the passage.
INCORRECT_ANSWERSource contradicts the "correct" answer, or supports a distractor better.
INCORRECT_DISTRACTORA distractor is also valid per the source (two right answers).
UNSUPPORTEDSource doesn't have enough info to verify. No benefit of the doubt.

Why 4-way instead of binary?

Each failure mode has a different fix:

  • INCORRECT_ANSWER → fix the generation prompt to ground better
  • INCORRECT_DISTRACTOR → fix distractor generation to check for overlap
  • UNSUPPORTED → the generator is hallucinating beyond the source

Score:

correct_rate = CORRECT / total × 100

Across variants: Tight range (92–95%). All models are good here — correctness is largely a solved problem at this point. The biggest differentiator is not the model but the chunking strategy.

Correctness by variant

Real failure examples (biology deck)

Failure 1 — INCORRECT_DISTRACTOR (two right answers)

Q: Which of the following are examples of transcellular fluid?

ChoiceMarked asVerdict
Cerebrospinal fluid✓ CorrectSupported
Synovial fluid✓ CorrectSupported
Aqueous humor✓ CorrectSupported
Blood plasma✗ DistractorProblem

Why it failed: Source says "blood plasma, a special ECF compartment" — so plasma is ALSO a valid answer. The question has two correct answers but only one is marked correct.


Failure 2 — INCORRECT_DISTRACTOR (distractor is also valid)

Q: A patient presents with significant edema. Which could be a contributing factor?

ChoiceMarked asVerdict
Decreased plasma proteins → reduced colloid osmotic pressure✓ CorrectSupported
Increased capillary permeability✗ DistractorProblem

Why it failed: Source explicitly says "Increased capillary permeability is the hallmark of the inflammatory response... forming an oedema." The distractor is actually a valid cause of edema too!


Failure 3 — UNSUPPORTED (inference beyond source)

Q: If a small, non-polar molecule needs to move against a steep concentration gradient, which transport mechanism would be most efficient?

Marked correct: Active transport

Why it failed: Source says active transport moves things against gradients, and that small non-polar molecules pass through membranes easily — but it never says active transport is used for non-polar molecules or compares "efficiency." The answer is a reasonable inference but isn't actually in the source.


Note from Ahmed: The third failure mode (UNSUPPORTED) is a fuck-up on my part. It sometimes penalizes reasonable inferences, which hurts evals when the LLM generates higher-order questions (UNDERSTAND/APPLY/ANALYZE) rather than pure recall. A question that requires the student to apply knowledge will naturally go beyond verbatim source text — but the metric flags it as unsupported. This creates tension between quiz_correctness and blooms_score: we want deeper questions, but the correctness metric punishes them. Need to revisit this.


2. quiz_triviality — can you solve it without studying?

"Does the question give away its own answer?"

If a student can pick the correct answer without knowing the subject — through test-taking tricks or common knowledge — the question is useless. It gives false confidence.

5 tell categories (first match wins):

TellWhat it catches
LITERAL_OVERLAPAnswer text appears verbatim in the stem
CATEGORY_CUEStem asks for a type, only one choice fits
GRAMMAR_CUEArticle/tense/plural fits only one choice
TAUTOLOGYStem contains a word whose definition IS the answer
COMMON_KNOWLEDGEEducated adult knows it without studying

What it doesn't check:

  • Distractor quality (that's quiz_distractor_quality)
  • Whether the answer is correct (that's quiz_correctness)

Score:

good_rate = NOT_TRIVIAL / total × 100

Across variants: Nano is worst (75–76%), flashlite is best (87%). The weaker the model, the more it leaks answers into its own stems.

Triviality by variant

Real failure examples

Example 1 — LITERAL_OVERLAP

Q: Defense mechanisms such as repression and denial are examples of __________. Answer: defense mechanisms

The answer is literally the first two words of the stem. Zero studying required.

— generated by gpt-5.4-nano


Example 2 — TAUTOLOGY

Q: Comprehensiveness refers to whether a personality theory explains most or all known facts and observations within its domain. True/False? Answer: True

The statement IS the definition of the term. It's tautological — can't be false.

— generated by gpt-5.4-nano


Example 3 — COMMON_KNOWLEDGE

Q: Which statement best explains the primary purpose of criminal law?

  • ✓ To protect society and maintain social order
  • ✗ To compensate private individuals for personal losses
  • ✗ To regulate only business transactions
  • ✗ To resolve family disagreements without state involvement

Any adult knows criminal law is about protecting society. No legal education needed.

— generated by gpt-5.4


Example 4 — COMMON_KNOWLEDGE (the "always" giveaway)

Q: Nudging is always highly effective for addressing large-scale environmental issues. True/False? Answer: False

The word "always" gives it away — nothing is "always" effective. A test-taking trick, not knowledge.

— generated by gemini-2.5-flash


3. blueprint_coverage — per-chunk learning-objective coverage

"Did the quiz actually test what the source material teaches?"

Step 1 — Extract learning objectives (once per deck)

Before any quiz variant runs, an LLM reads each chunk and extracts a short list of learning objectives (LOs). These are cached in blueprint_cache/ so every variant gets judged against the same blueprint.

Example from a chunk about mitosis:

LO1: Identify the four phases of mitosis
LO2: Explain the role of spindle fibers
LO3: Distinguish between mitosis and meiosis

Step 2 — Map questions to LOs

For each generated question, the judge checks: can the stem → correct-answer path be traced back to one of those LOs?

  • "What are the four phases of mitosis?" → covers LO1 ✓
  • "What color are spindle fibers under a microscope?" → doesn't cover any LO ✗

Step 3 — Score

blueprint_coverage = covered_LOs / total_LOs × 100

If the deck has 20 LOs across all chunks and the quiz hits 14 of them → 70%.

Why LOs instead of phrase extraction (like flashcard coverage)?

Quiz questions span multiple sentences — a single stem might synthesize info from three places in the chunk. Phrase-level matching undercounts that. LOs capture what the student should be able to do, which is the right unit for quizzes.

Across variants: The biggest spread of any metric. 5.4 dominates at 72%, flash models are worst at 37–51%. More capable models cover more learning objectives; topic chunking doesn't help flash here.

Blueprint coverage by variant


4. quiz_relevance — subject matter vs. admin trivia

"Is this question testing real knowledge, or is it asking what slide number something was on?"

Students upload all kinds of documents — lecture slides full of professor contact info, deadlines, textbook ISBNs. The generator sometimes latches onto that junk and turns it into quiz questions. This metric catches those.

Binary:

  • GOOD — tests a concept, definition, mechanism, relationship
  • BAD — tests admin logistics, document metadata, or one-off details that don't generalize

The judge receives the deck's document_summary so it knows what "on-topic" means for each deck. A question about Phoenix Park is trivia for a biology deck but might be relevant for a history deck — the summary provides that context.

Score:

good_rate = GOOD / total × 100

Across variants: Near-ceiling for everyone (96–99%). Relevance is mostly solved — generators rarely produce "what's the professor's email" questions. The remaining failures are edge cases like researcher names and peripheral dates.

Relevance by variant

Real failure examples

Name trivia (colonial policing deck — the doc IS about colonial policing, but knowing a specific park name isn't the point):

Q: The training of colonial police officers often took place at headquarters in __________ Park in Dublin. Answer: Phoenix

Understanding the Irish model's influence = GOOD. Memorizing the park name = BAD.

— generated by gemini-3.1-pro-preview


Researcher surname (personality psychology deck):

Q: The Five Factor Model originated by Tupes & Christal (1958) was later evolved by Digman & __________. Answer: Goldberg

Understanding the Five Factor Model = GOOD. Knowing a co-author's surname = BAD.

— generated by gemini-2.5-flash


Random decade (personality psychology deck):

Q: Initial attempts to link personality types to health conditions were made in the __________. Answer: 1960s

A peripheral date with zero pedagogical value.

— generated by gemini-2.5-flash


Formatting abbreviation (English analysis guide — the doc teaches analytical writing skills, not citation formatting):

Q: When referring to more than one line in an analysis, you should use the abbreviation '__________'. Answer: ll.

Testing a formatting convention, not analytical skill.

— generated by gemini-3.1-flash-lite-preview


5. quiz_distractor_quality — are the wrong answers good fakes?

"Can a student eliminate distractors without studying?"

Triviality asks if the stem gives away the right answer. Distractor quality asks the opposite: are the wrong answers obviously wrong? Same coin, opposite side.

The judge evaluates each distractor individually, applying 4 checks in order:

  1. Stem-contradiction — does the stem itself rule this out? → WEAK
  2. Cultural-elimination — would a non-student adult reject this? → WEAK
  3. Category mismatch — wrong subject area entirely? → ABSURD
  4. Otherwise → PLAUSIBLE

Key instruction: "Grade from the perspective of a student who has NOT studied. A distractor that an expert sees through is still PLAUSIBLE if a confused student could be tempted."

Only applies to MCQ / single-choice. True/false and fill-in-the-blank have no distractors to judge.

Score:

plausible_rate = PLAUSIBLE distractors / total distractors × 100

Real failure examples

The nano model generates distractors addicted to absolutes — "always", "guarantees", "eliminates all", "never". Any test-wise student crosses those out immediately.

Q: Which combination of stated pros for cap-and-trade is most accurate?

  • ✗ "It guarantees no complexity and no price uncertainty" → WEAK
  • ✗ "It is always cheaper without trade-offs" → WEAK
  • ✗ "It removes any need for program design" → WEAK

All three eliminable on surface logic alone. No cap-and-trade knowledge needed.

— generated by gpt-5.4-nano


Q: Which policy argument is framed around intergenerational fairness?

  • ✗ "Intergenerational fairness is the same as eliminating unemployment" → ABSURD

Completely off-topic. Different policy domain entirely.

— generated by gpt-5.4-nano


Q: Which con of cap-and-trade is described as tied to who may obtain permits?

  • ✗ "It eliminates the risk of loopholes by fixing permits at zero" → WEAK

"Fixing permits at zero" is internally self-defeating / nonsensical.

— generated by gpt-5.4-nano


Across variants: Nano is worst (73–74%), flash/pro models cluster around 80–84%. Tracks closely with triviality — the models that leak answers into stems also produce weak distractors.

Distractor quality by variant

Note: This metric overlaps heavily with quiz_triviality. Triviality catches giveaways in the stem→answer path; distractor quality catches eliminable wrong answers. In practice they often fire on the same questions — if the distractors are garbage, the question is also trivially solvable. We may want to merge these or drop one in a future iteration.


6. blooms_score — how deep do the questions go?

"Is this quiz all recall, or does it actually make students think?"

Each question is classified on a 4-level Bloom's taxonomy subset:

LevelWeightWhat it testsExample
REMEMBER1Recall a fact, term, date"What is the capital of France?"
UNDERSTAND2Explain, compare, paraphrase"Why is the left ventricle thicker than the right?"
APPLY3Use knowledge in a new situation"A patient presents with X — which drug?"
ANALYZE4Examine relationships, multi-step reasoning"Compare structural changes in CHF vs HCM"

We stop at ANALYZE because generators essentially never produce EVALUATE/CREATE-level items. Including unused classes just inflates disagreement.

Score:

blooms_score = mean_weight / 4 × 100

A quiz that's 100% REMEMBER scores 25%. A healthy mix of 60% REMEMBER + 30% UNDERSTAND + 10% APPLY scores 42%. Recall isn't a failure — it's the baseline. The score rewards variety, not the absence of recall.

The blooms_distribution field records the breakdown (e.g. 30% remember, 60% understand, 10% apply) — useful for spotting generators that collapse everything onto one level.

Across variants: All generators cluster in the 34–37% range. The distribution tells the story: ~60% REMEMBER, ~35% UNDERSTAND, ~2% APPLY, ~0% ANALYZE. There's a healthy REMEMBER/UNDERSTAND base but almost no APPLY or ANALYZE questions. gem31pro-high leads at 37.4% with the best UNDERSTAND share (40%). Pushing generators toward more APPLY-level questions is the main lever left.

Bloom's distribution


7. difficulty_good_rate — right level for the audience?

"Is this quiz appropriately hard for who's taking it?"

Bloom's alone isn't enough — two ANALYZE questions can differ wildly in solve effort. So this metric combines two axes:

score = bloom(1–4) × complexity(1–5) → range 1–20

Bloom = cognitive level (same as above).

Complexity = how hard the subject matter is for the target audience. The judge is told the audience (high school / university / professional) and subject, so "What are the four chambers of the heart?" scores low complexity for a med student but higher for a middle-schooler.

ComplexityMeaning
1Basic fact any student in the field would know early on
2Standard introductory curriculum concept
3Connecting multiple concepts, moderate depth
4Deep understanding, nuanced reasoning
5Expert-level, synthesis across sub-domains

Score:

difficulty_good_rate = questions with score >= 4 / total × 100

Questions scoring below 4 are both REMEMBER-level AND basic complexity — the easiest possible questions. difficulty_mean (the raw 1–20 average) is kept alongside as a diagnostic.

Across variants: Flat range (36–42%). Closely correlated with blooms — since most questions are REMEMBER, bloom(1) × complexity caps the score. gem31pro-high edges ahead at 42% but the difference is small.

Difficulty by variant


8. uniqueness_rate — does the quiz ask the same thing twice?

"Did the generator produce 10 questions or 7 questions and 3 repeats?"

Two-stage pipeline:

Stage 1 — Embedding filter. Embed each question (stem + correct answer) with gemini-embedding-001. Compute pairwise cosine similarity. Surface pairs above 0.90.

Stage 2 — LLM arbitration. For each surfaced pair, a judge classifies:

  • LAZY_DUPLICATE — same fact, rephrased. A student who answers one learns nothing from the other.
  • REINFORCEMENT — same topic, different angle or cognitive level. Answering both strengthens understanding.

Examples:

  • "What is photosynthesis?" + "Define the process of photosynthesis" → LAZY_DUPLICATE (same fact, surface rephrasing)
  • "What is photosynthesis?" + "How would photosynthesis be affected if CO₂ levels doubled?" → REINFORCEMENT (recall vs. application)

Score:

uniqueness_rate = (1 - lazy_duplicate_questions / total) × 100

Reinforcement pairs do NOT count against the score.

Why two stages? Cosine alone flags too many reinforcement pairs as duplicates. LLM-only on every pair is expensive. The 0.90 threshold catches almost every true duplicate while the LLM only arbitrates the ambiguous ones.

Across variants: The widest split. Nano and 5.4 generate tons of duplicates (60–61%), while flashlite barely produces any (96%). This correlates with question count — models that generate more questions per chunk tend to repeat themselves more.

Uniqueness by variant


Side-channel diagnostics (static, no LLM)

The eight metrics above are all LLM-judged. Two additional static checks run without any LLM calls — they catch structural problems in the generator's output that a student could exploit or that indicate pipeline health issues.


quiz_structural — pipeline parse health

"Did the generator produce valid, parseable questions?"

Two sub-metrics that flag problems between the LLM's raw output and the question format we need:

rejection_rate — what fraction of the LLM's raw output failed post-parse validation?

rejection_rate = rejected_count / raw_question_count

Rejection reasons: invalid choice count (e.g. single_choice with <2 or >4 choices), missing correct answer, multiple correct answers on a single_choice question, wrong True/False prefix. These are structured-output failures — the generation prompt asked for one format, the model produced another.

chunk_yield_rate — what fraction of source chunks produced at least one valid question?

chunk_yield_rate = chunks_with_≥1_valid_question / total_chunks

A low yield means the LLM is silently failing on some chunks — either generating nothing or generating only questions that fail validation. Since process_chunk swallows schema-validation failures, chunk yield is the only way to catch these silent drops.

Across variants: Chunk yield is near-100% for all variants — models rarely fail to produce something for a chunk. Rejection rate is more interesting: 0–9% depending on deck complexity. 100% of rejections in the baseline were single_choice questions with multiple correct answers — the type contract isn't enforced strongly enough in the prompt.


quiz_structural_bias — surface-cue giveaways

"Can a student guess the right answer just by looking at the shape of the choices?"

No LLM, no subject knowledge — just character counts and True/False tallies. If the correct answer is always the longest option, a test-wise student picks longest every time and beats random chance.

Length outlier rate

For each single-correct question: is the correct answer uniquely the longest or shortest choice? If yes, a student can exploit the "pick the longest" heuristic without knowing anything.

length_outlier_rate = (longest_correct + shortest_correct) / single_correct_count
good_rate = (1 - length_outlier_rate) × 100

Only computed on questions with exactly one correct choice — the "pick the longest" trick only works cleanly when there's one answer to pick. Multi-correct MCQ are tracked separately as a side-channel.

Across variants: This is the worst metric across the board. Every variant scores between 23–39% good_rate, meaning 61–77% of single-correct questions have the correct answer as a unique length outlier. The generation prompt already tells the model to match answer lengths — the model ignores that instruction ~6–7 times out of 10. flash-topic and 5.4 are worst (23–24%), gem31pro and nano-high are least bad (~38%) but still failing majority of the time. This is the single biggest structural fix remaining — a post-generation length-rewrite pass or a self-check step before emission.

T/F true-rate balance

For True/False questions: what fraction have "True" as the correct answer?

tf_true_rate = tf_true_count / tf_count

Balanced decks sit near 50%. Outside the 0.35, 0.65 band → flagged. If a student notices the pattern leans one way, "always pick False" (or "always pick True") beats random.

Supports multilingual T/F values — English "True", German "Wahr", French "Vrai", Spanish "Verdadero", etc.

Across variants: Most models produce reasonably balanced T/F splits (81–95% balance score). flashlite and gem31pro are most balanced (93–95%), 5.4 is solid (92%). The outlier is nano-none at 65% — noticeably skewed, meaning "always pick True" (or False) gives an edge. nano-high is better at 76%, suggesting reasoning effort helps with T/F balance. flash variants cluster around 81–83%.

Structural bias — length and T/F balance

Note: Neither of these metrics uses an LLM. They run on the raw question JSON in milliseconds and catch problems that the LLM-judged metrics miss entirely — quiz_triviality can catch "the longest answer is always right" as a pattern only if the judge notices it across many questions, but structural_bias catches it deterministically on every single question. They complement each other: triviality catches semantic giveaways, structural bias catches statistical ones.


Baseline analysis & easy gains

The current production configuration is flash-topicgemini-2.5-flash with topic-level chunking, no reasoning effort.

Where the baseline sits

MetricScoreRankBest variantGap
Correctness93.5%5/8gem31pro (95.1%)+1.6pp
Non-triviality82.3%5/8flashlite-high (87.6%)+5.3pp
Blueprint Coverage41.5%6/85.4 (72.3%)+30.8pp
Relevance97.5%5/85.4 (98.9%)+1.4pp
Distractor Quality84.1%2/8gem31pro (84.3%)+0.2pp
Bloom's Depth34.7%6/8gem31pro (37.4%)+2.7pp
Difficulty35.7%8/8gem31pro (42.3%)+6.6pp
Uniqueness91.0%3/8flashlite-low (96.0%)+5.0pp
Length Balance23.2%8/8nano-high (38.5%)+15.3pp

Dead last on difficulty and length balance. Near-bottom on coverage and Bloom's depth. Strong on distractor quality and uniqueness. Mediocre on everything else.

Baseline vs alternatives

Easy gain #1 — switch from topic → subtopic chunking (free, no model change)

Same model, same cost, same speed. Just change the chunking parameter.

Metricflash-topicflash (subtopic)Delta
Blueprint Coverage41.5%50.7%+9.2pp
Non-triviality82.3%83.0%+0.8pp
Questions per gen42.668.7+26.1
Uniqueness91.0%87.3%-3.7pp
Distractor Quality84.1%83.5%-0.6pp
Generation time20.7s21.8s+1.1s

Subtopic chunking gives the model smaller, more focused chunks → +9.2pp coverage and +61% more questions. The uniqueness drop (-3.7pp) is the tradeoff — more questions from the same material means more overlap risk — but 87% is still solid. Essentially free lunch.

Easy gain #2 — length-balancing rewrite pass

23.2% length balance is the worst score across all metrics for the baseline. The generation prompt already says "match answer lengths" — models ignore it. You can't fix this with pure string manipulation — trimming "The mitochondria converts glucose into ATP through oxidative phosphorylation" to match "Nucleus" destroys the content. This needs a cheap LLM rewrite pass (nano-tier model, just rewording distractors to roughly match the correct answer's length) or a self-check step where the generator reviews and rewrites its own choices before emission. Every variant suffers here (best is 38.5%), so this fix benefits any model choice.

Easy gain #3 — T/F balance enforcement (pure code)

81.4% T/F balance is fine but could be better. During generation, count the True/False split. If it drifts outside 0.35, 0.65, flip some question polarities ("X is true" → "X is false" with the answer inverted). Pure post-processing, no model cost.

Medium effort — upgrade to gem31pro

Metricflash-topicgem31proDelta
Blueprint Coverage41.5%60.1%+18.6pp
Difficulty35.7%42.3%+6.6pp
Non-triviality82.3%85.9%+3.6pp
Bloom's Depth34.7%37.4%+2.7pp
Correctness93.5%95.1%+1.6pp
Questions per gen42.671.4+28.8
Generation time20.7s91.8s+71.1s (4.4×)

Gains across the board — coverage, difficulty, triviality, correctness all improve. But 4.4× slower and more expensive. Could work as a premium tier or for high-stakes decks only.

Interesting outlier — the overgenerate-then-filter family (5.4, nano)

Both 5.4 and nano generate far more questions than flash — but a lot of them are duplicates. The question is whether the raw volume + dedup beats flash's smaller-but-cleaner output.

Metricflash-topicnanonano-high5.4
Blueprint Coverage41.5%59.3%64.8%72.3%
Uniqueness91.0%61.5%70.4%60.2%
Raw questions42.6113.0104.4158.3
Effective unique qs39697395
Distractor Quality84.1%74.0%73.5%74.1%
Non-triviality82.3%75.6%76.0%78.8%
Generation time20.7s76.3s232.1s63.1s

Even after dedup, nano-none produces ~69 effective unique questions vs flash-topic's 39 — and covers +17.9pp more learning objectives. But the quality gap is real: -10pp distractor quality, -6.7pp triviality. Nano produces more questions, covering more material, but those questions are easier to game.

5.4 has the same quality problems plus costs more. Its advantage over nano is pure coverage volume (+13pp), but nano gets most of the coverage gain at a fraction of 5.4's cost.

The interesting experiment: nano + deduplication + distractor-rewrite pass. Generate with nano's volume (cheap, fast-ish), dedup the lazy duplicates, then run a quick rewrite pass on the surviving questions to fix distractor length/quality. If the rewrite pass is cheap (and it should be — it's reformulating existing distractors, not generating new knowledge), this pipeline could combine nano's coverage with flash-level quality.

Priority stack

  1. Switch to subtopic chunking — free, +9pp coverage, ship today
  2. Add length-balancing rewrite pass — cheap LLM call, fixes the worst metric across all variants
  3. Add T/F balance enforcement — pure code, minor but free
  4. Evaluate gem31pro for premium tier — big quality jump, need to price the latency cost
  5. Test nano + dedup + rewrite pipeline — nano's coverage at ~1/10th the cost of 5.4, with a quality cleanup pass on top

Language effects

The eval set covers 4 languages: English (826 evals), German (368), French (289), Dutch (176).

Dutch correctness drops across the board

Dutch correctness averages 88.8% vs 94.6% for English — a -5.8pp gap that holds across every variant (85–93% Dutch vs 92–95% English) and both judges (GPT-5.4: 87% vs 95%, Gemini: 90% vs 95%). This is a real generator problem, not judge noise — models produce more incorrect answers in Dutch. Only 4 Dutch decks though, so small sample.

Judges are lenient on non-English triviality and distractor quality

Both judges consistently rate French and German as less trivial and better distractors than English. French non-triviality: 90% vs English 85% (GPT-5.4), 86% vs 74% (Gemini). The same pattern holds for German.

This is almost certainly a judge blind spot — catching "common knowledge" giveaways and grammar cues is harder in a language the judge is less fluent in. A French student would spot a giveaway that neither judge flags. This means non-English triviality and distractor quality scores are probably inflated — the real quality gap between languages may be larger than the numbers show.

Language effects — correctness drop and judge leniency


Cost analysis

Actual costs from llm_usage_logs (generation) and pipeline eval_cost fields (evaluation). All 830 generation runs × 2 judges × ~1,659 evals.

Generation cost per variant

Variant$/runTotalRunsNotes
flashlite-low$0.0042$0.48113Cheapest. No reasoning tokens.
flashlite-high$0.0068$0.76112+62% over low — reasoning effort bump
flash-topic$0.0164$1.84112Baseline. 2.4× flashlite-high
nano$0.0184$1.95106Comparable to flash-topic
flash$0.0265$2.97112Same model, subtopic chunks = +61% cost
nano-high$0.0472$3.0264Reasoning effort = 2.6× nano
5.4$0.1081$11.351056.6× baseline
gem31pro$0.4457$47.2510627× baseline. Reasoning tokens dominate.

Total generation cost: $69.61 across all 830 runs.

gem31pro is in a different universe — $0.45/run vs $0.016 for the baseline. Its coverage lead (+18.6pp) costs 27× more per generation. Nano gets most of that coverage gain (+23.3pp) at 1.1× the baseline cost — orders of magnitude better cost-efficiency.

Evaluation cost

JudgeTotalEvals$/eval
Gemini 3.1 Pro$185.85830$0.224
GPT-5.4$40.06829$0.048
Total$225.911,659$0.136

Gemini is 4.6× more expensive per eval than GPT-5.4. Since both judges preserve variant rankings on the metrics that matter (coverage ρ=0.98, triviality ρ=0.95), the obvious move for future evals is to drop to a single GPT-5.4 judge and cut eval cost by ~80%.

Total pipeline cost

ComponentCost
Generation (830 runs)$69.61
Evaluation (1,659 evals)$225.91
Total$295.52

Eval cost is 3.2× generation cost. For future experiments: single judge + fewer reps (2 instead of 3) would cut eval cost to ~$27 while preserving ranking accuracy.

Cost-efficiency: where to spend the next dollar

The question isn't "which model is best" — it's "which model gives the most quality per dollar."

VariantGen $/runCoverageTrivialityCoverage per $0.01
nano$0.01859.3%75.6%32.2pp
flash-topic$0.01641.5%82.3%25.3pp
nano-high$0.04764.8%76.0%13.7pp
flash$0.02750.7%83.0%19.2pp
5.4$0.10872.3%78.8%6.7pp
gem31pro$0.44660.1%85.9%1.3pp

Nano delivers the best coverage-per-dollar by a wide margin. gem31pro's coverage is actually lower than 5.4 despite costing 4× more — its strength is triviality and difficulty, not coverage. The interesting frontier is nano + quality cleanup pass: nano's coverage volume at nano's price, with a cheap rewrite pass to fix the distractor/triviality gap.

Cost breakdown — generation vs evaluationCost efficiency — quality per dollar