Quiz Eval Current State

Date: 2026-05-01 Purpose: Durable summary of the current quiz eval baseline, retained reports, and immediate product implications.

What This Consolidation Keeps

This state keeps the parts that are useful on main:

quiz eval framework and cross-check workflows
clean reports, charts, and dashboard
the latest April 24 report summarizing the 39-deck eval run
current-state feedback summary for quiz and summary complaints

It intentionally does not keep raw result JSON, raw logs, generated question dumps, binary test-set documents, broad experiment scratch files, or unrelated mobile/infra changes that came along for the ride in stacked branches.

Baseline

The production baseline evaluated in the April reports is:

model: gemini-2.5-flash
chunking: topic-level merge, called flash-topic in the reports
prompt: production quiz generation prompt
no extra post-generation rewrite pass

The baseline is reliable on core correctness and distractor quality, but weak on coverage, cognitive depth, and structural answer cues.

Key baseline findings from 2026-04-24-quiz-eval-metrics.md:

correctness: 93.5%
non-triviality: 82.3%
blueprint coverage: 41.5%
distractor quality: 84.1%
Bloom's depth: 34.7%
difficulty: 35.7%
uniqueness: 91.0%
length balance: 23.2%

Latest Eval Report

The latest retained report is:

reports/2026-04-24-quiz-eval-metrics.md

It summarizes the final April 23 pipeline run:

39 decks
8 generator variants
2 judges
1,659 evals
cost data from generation and eval runs

Raw result JSON files are generated artifacts and are intentionally gitignored under reports/*.json.

Current Best Understanding

The biggest immediate product win is not a model swap. It is changing how quiz generation chunks the source material.

Priority stack:

switch from topic-level to subtopic-level chunking
add a length-balancing rewrite/self-check pass
enforce T/F balance in code
evaluate expensive stronger models only for premium/high-stakes cases
test a cheap overgenerate/dedup/rewrite path before adopting nano-style high-volume generation

Subtopic chunking improved blueprint coverage by about 9 points and produced about 61% more questions in the April 24 report, with little latency cost. The tradeoff is slightly lower uniqueness, which should be handled by deduplication or redundancy filtering.

The worst structural failure is answer-length leakage. Across variants, most single-correct questions still make the correct answer a unique length outlier. Prompt instructions alone did not fix this; it needs a rewrite pass or self-check stage.

Language And Judge Caveats

The latest report saw Dutch correctness underperform English. Non-English triviality and distractor-quality scores may also be inflated because judge models can miss language-specific giveaways.

Future evals should keep enough language coverage to distinguish generator issues from judge blind spots.