apps/learning-api/evals-playground/reports/2026-04-12-quiz-summary-feedback-current-state.md

Quiz and Summary Feedback Current State

Date: 2026-04-12 Source: unified_feedback_enriched.csv — all negative-sentiment feedback with body text ≥5 chars. Method: Every row read and classified by LLM (no keyword heuristics). 2,309 entries total.

Quiz and Summary Feedback Current State

Date: 2026-04-12 Source: unified_feedback_enriched.csv — all negative-sentiment feedback with body text ≥5 chars. Method: Every row read and classified by LLM (no keyword heuristics). 2,309 entries total.


Quiz Feedback (605 negative entries)

CategoryCount%Description
too_few_questions22236.7%User requested N questions, got far fewer (often 1–3)
too_easy7512.4%Distractors too obvious, answers identifiable by length/position
content_mismatch7211.9%Questions not about the uploaded material
repetitive_questions518.4%Same questions repeated across or within quizzes
not_working477.8%Quiz didn't generate or load at all
unclear_questions355.8%Poorly worded, confusing, or overly long questions
incorrect_answers305.0%Wrong answer marked correct, factual errors
too_superficial254.1%Surface-level, not exam-relevant, covers meta/admin info
other223.6%Doesn't fit above (too hard, feature requests, etc.)
wrong_language183.0%Quiz generated in wrong language
rendering_bug81.3%Formulas/formatting broken or displayed incorrectly

Key Findings — Quiz

  1. 36.7% of complaints = too few questions. This is the single largest issue by far. Users set a max question count and receive far fewer — sometimes just 1. This is likely a generation pipeline issue (content extraction → question generation throughput).
  2. Quantity + generation failures account for ~53%. Combining too_few_questions (36.7%), repetitive_questions (8.4%), and not_working (7.8%) — over half of all negative feedback is about not getting enough usable quiz content.
  3. Quality issues account for ~39%. too_easy (12.4%), content_mismatch (11.9%), unclear_questions (5.8%), incorrect_answers (5.0%), and too_superficial (4.1%) together indicate quiz quality problems — distractors that don't work, questions from the wrong part of the material, and factual errors.
  4. too_easy is structurally exploitable. Multiple users report that the longest answer option is always correct, or that true/false answers always appear on the same side. This is a prompt/generation bias, not a content issue.

Summary Feedback (1,704 negative entries)

CategoryCount%Description
wants_bullets_structure29817.5%Wants bullet points, headings, organized layout — not prose
missing_content29317.2%Summary skipped chapters/sections from source material
too_long25114.7%Summary too lengthy/verbose
too_short21612.7%Summary too brief, not detailed enough
other1689.9%Vague complaints, feature requests, unclear feedback
wrong_language1096.4%Summary in wrong language
not_generated935.5%Empty page, nothing generated
content_mismatch844.9%Summary not about the uploaded document at all
too_complex_language814.8%Language too difficult, hard to understand for user's level
rendering_formatting684.0%Font size, layout, display issues, broken rendering
missing_visuals221.3%Diagrams/images from source not included
unwanted_sections211.2%Reflection questions or sections user didn't ask for

Key Findings — Summary

  1. Format is the #1 complaint. 17.5% of users explicitly want bullet points or structured layout instead of continuous prose. Many EU students (Dutch, French, German) use summaries for exam prep and need scannable, structured content.
  2. Coverage is the #2 complaint. 17.2% say the summary skipped chapters or sections. Common pattern: user uploads 50–80 page document, summary only covers the first few pages/chapters.
  3. Length calibration is broken in both directions. 14.7% say too long, 12.7% say too short — together 27.4%. The system isn't matching user expectations for detail level. Users who want "detailed" get walls of text; users who want "concise" get multi-page output.
  4. Wrong language is significant at 6.4%. Many non-English EU users (Dutch, Swedish, French, German) receive summaries in English or a different language than their source material.
  5. Generation failures at 5.5%. Users report empty pages or nothing generated — a reliability issue.
  6. too_complex_language (4.8%) reveals a user-level mismatch. Secondary school students (vmbo, 6de leerjaar) receive university-level language. The system doesn't adapt to the user's educational level.

Actionable Priorities

Quiz — High Impact

  1. Fix question count generation — investigate why the pipeline produces far fewer questions than requested
  2. Improve distractor quality — eliminate structural tells (longest answer = correct, positional bias in T/F)
  3. Improve content coverage — questions should span the full uploaded material, not just the first section

Summary — High Impact

  1. Default to structured/bullet format — or respect user format preferences more reliably
  2. Improve full-document coverage — ensure all chapters/sections are represented
  3. Better length calibration — map user detail preferences to actual output length
  4. Fix language detection/respect — generate in the source material's language

Source Data

  • quiz_feedback_classified.csv — 605 negative quiz feedback entries with complaint_category column
  • summary_feedback_classified.csv — 1,704 negative summary feedback entries with complaint_category column

Both files include all original columns from unified_feedback_enriched.csv plus the classification. They are source artifacts and are intentionally not tracked in this PR.