apps/learning-api/evals-playground/reports/2026-04-12-quiz-summary-feedback-current-state.md
Quiz and Summary Feedback Current State
Date: 2026-04-12 Source: unified_feedback_enriched.csv — all negative-sentiment feedback with body text ≥5 chars. Method: Every row read and classified by LLM (no keyword heuristics). 2,309 entries total.
Quiz and Summary Feedback Current State
Date: 2026-04-12
Source: unified_feedback_enriched.csv — all negative-sentiment feedback with body text ≥5 chars.
Method: Every row read and classified by LLM (no keyword heuristics). 2,309 entries total.
Quiz Feedback (605 negative entries)
| Category | Count | % | Description |
|---|---|---|---|
too_few_questions | 222 | 36.7% | User requested N questions, got far fewer (often 1–3) |
too_easy | 75 | 12.4% | Distractors too obvious, answers identifiable by length/position |
content_mismatch | 72 | 11.9% | Questions not about the uploaded material |
repetitive_questions | 51 | 8.4% | Same questions repeated across or within quizzes |
not_working | 47 | 7.8% | Quiz didn't generate or load at all |
unclear_questions | 35 | 5.8% | Poorly worded, confusing, or overly long questions |
incorrect_answers | 30 | 5.0% | Wrong answer marked correct, factual errors |
too_superficial | 25 | 4.1% | Surface-level, not exam-relevant, covers meta/admin info |
other | 22 | 3.6% | Doesn't fit above (too hard, feature requests, etc.) |
wrong_language | 18 | 3.0% | Quiz generated in wrong language |
rendering_bug | 8 | 1.3% | Formulas/formatting broken or displayed incorrectly |
Key Findings — Quiz
- 36.7% of complaints = too few questions. This is the single largest issue by far. Users set a max question count and receive far fewer — sometimes just 1. This is likely a generation pipeline issue (content extraction → question generation throughput).
- Quantity + generation failures account for ~53%. Combining
too_few_questions(36.7%),repetitive_questions(8.4%), andnot_working(7.8%) — over half of all negative feedback is about not getting enough usable quiz content. - Quality issues account for ~39%.
too_easy(12.4%),content_mismatch(11.9%),unclear_questions(5.8%),incorrect_answers(5.0%), andtoo_superficial(4.1%) together indicate quiz quality problems — distractors that don't work, questions from the wrong part of the material, and factual errors. too_easyis structurally exploitable. Multiple users report that the longest answer option is always correct, or that true/false answers always appear on the same side. This is a prompt/generation bias, not a content issue.
Summary Feedback (1,704 negative entries)
| Category | Count | % | Description |
|---|---|---|---|
wants_bullets_structure | 298 | 17.5% | Wants bullet points, headings, organized layout — not prose |
missing_content | 293 | 17.2% | Summary skipped chapters/sections from source material |
too_long | 251 | 14.7% | Summary too lengthy/verbose |
too_short | 216 | 12.7% | Summary too brief, not detailed enough |
other | 168 | 9.9% | Vague complaints, feature requests, unclear feedback |
wrong_language | 109 | 6.4% | Summary in wrong language |
not_generated | 93 | 5.5% | Empty page, nothing generated |
content_mismatch | 84 | 4.9% | Summary not about the uploaded document at all |
too_complex_language | 81 | 4.8% | Language too difficult, hard to understand for user's level |
rendering_formatting | 68 | 4.0% | Font size, layout, display issues, broken rendering |
missing_visuals | 22 | 1.3% | Diagrams/images from source not included |
unwanted_sections | 21 | 1.2% | Reflection questions or sections user didn't ask for |
Key Findings — Summary
- Format is the #1 complaint. 17.5% of users explicitly want bullet points or structured layout instead of continuous prose. Many EU students (Dutch, French, German) use summaries for exam prep and need scannable, structured content.
- Coverage is the #2 complaint. 17.2% say the summary skipped chapters or sections. Common pattern: user uploads 50–80 page document, summary only covers the first few pages/chapters.
- Length calibration is broken in both directions. 14.7% say too long, 12.7% say too short — together 27.4%. The system isn't matching user expectations for detail level. Users who want "detailed" get walls of text; users who want "concise" get multi-page output.
- Wrong language is significant at 6.4%. Many non-English EU users (Dutch, Swedish, French, German) receive summaries in English or a different language than their source material.
- Generation failures at 5.5%. Users report empty pages or nothing generated — a reliability issue.
too_complex_language(4.8%) reveals a user-level mismatch. Secondary school students (vmbo, 6de leerjaar) receive university-level language. The system doesn't adapt to the user's educational level.
Actionable Priorities
Quiz — High Impact
- Fix question count generation — investigate why the pipeline produces far fewer questions than requested
- Improve distractor quality — eliminate structural tells (longest answer = correct, positional bias in T/F)
- Improve content coverage — questions should span the full uploaded material, not just the first section
Summary — High Impact
- Default to structured/bullet format — or respect user format preferences more reliably
- Improve full-document coverage — ensure all chapters/sections are represented
- Better length calibration — map user detail preferences to actual output length
- Fix language detection/respect — generate in the source material's language
Source Data
quiz_feedback_classified.csv— 605 negative quiz feedback entries withcomplaint_categorycolumnsummary_feedback_classified.csv— 1,704 negative summary feedback entries withcomplaint_categorycolumn
Both files include all original columns from unified_feedback_enriched.csv plus the classification. They are source artifacts and are intentionally not tracked in this PR.