Quiz Eval Current State
Date: 2026-05-01 Purpose: Durable summary of the current quiz eval baseline, retained reports, and immediate product implications.
Quiz Eval Current State
Date: 2026-05-01 Purpose: Durable summary of the current quiz eval baseline, retained reports, and immediate product implications.
What This Consolidation Keeps
This state keeps the parts that are useful on main:
- quiz eval framework and cross-check workflows
- clean reports, charts, and dashboard
- the latest April 24 report summarizing the 39-deck eval run
- current-state feedback summary for quiz and summary complaints
It intentionally does not keep raw result JSON, raw logs, generated question dumps, binary test-set documents, broad experiment scratch files, or unrelated mobile/infra changes that came along for the ride in stacked branches.
Baseline
The production baseline evaluated in the April reports is:
- model:
gemini-2.5-flash - chunking: topic-level merge, called
flash-topicin the reports - prompt: production quiz generation prompt
- no extra post-generation rewrite pass
The baseline is reliable on core correctness and distractor quality, but weak on coverage, cognitive depth, and structural answer cues.
Key baseline findings from 2026-04-24-quiz-eval-metrics.md:
- correctness: 93.5%
- non-triviality: 82.3%
- blueprint coverage: 41.5%
- distractor quality: 84.1%
- Bloom's depth: 34.7%
- difficulty: 35.7%
- uniqueness: 91.0%
- length balance: 23.2%
Latest Eval Report
The latest retained report is:
reports/2026-04-24-quiz-eval-metrics.md
It summarizes the final April 23 pipeline run:
- 39 decks
- 8 generator variants
- 2 judges
- 1,659 evals
- cost data from generation and eval runs
Raw result JSON files are generated artifacts and are intentionally gitignored under reports/*.json.
Current Best Understanding
The biggest immediate product win is not a model swap. It is changing how quiz generation chunks the source material.
Priority stack:
- switch from topic-level to subtopic-level chunking
- add a length-balancing rewrite/self-check pass
- enforce T/F balance in code
- evaluate expensive stronger models only for premium/high-stakes cases
- test a cheap overgenerate/dedup/rewrite path before adopting nano-style high-volume generation
Subtopic chunking improved blueprint coverage by about 9 points and produced about 61% more questions in the April 24 report, with little latency cost. The tradeoff is slightly lower uniqueness, which should be handled by deduplication or redundancy filtering.
The worst structural failure is answer-length leakage. Across variants, most single-correct questions still make the correct answer a unique length outlier. Prompt instructions alone did not fix this; it needs a rewrite pass or self-check stage.
Language And Judge Caveats
The latest report saw Dutch correctness underperform English. Non-English triviality and distractor-quality scores may also be inflated because judge models can miss language-specific giveaways.
Future evals should keep enough language coverage to distinguish generator issues from judge blind spots.