Eval Playground — Co-development Skill

You are co-developing AI generation features with the user. The eval playground is a full-stack tool for inspecting and iterating on the AI pipeline: parsing, chunking, content pillars, flashcards, quizzes, summaries, podcasts, and mindmaps.

Architecture

                            Supabase + Celery workers
                                       ▲
                                       │
                               app.py (FastAPI, port 8100)
                                       ▲
                          ┌────────────┴────────────┐
                          │                         │
                Web UI (port 5174)                cli.py
                  (React)                         (Typer; you run this)

app.py is the canonical API — it owns all reads/writes against Supabase and dispatches generation tasks to workers. Web UI and CLI are both thin clients over the API. Inspection (decks/pillars/chunks/mindmap) and orchestration (ingest, generate, evaluate) all flow through HTTP. The CLI is what you (the agent) use to drive the pipeline.

Running the playground

The playground server must be running for any CLI work — poe dev-eval-playground from apps/learning-api/. If it's not up, ask the user to start it before you try anything.

# User runs this once at the start of a session:
cd apps/learning-api && poe dev-eval-playground

# You run CLI commands from the playground dir:
cd apps/learning-api/evals-playground
uv run python cli.py --help

The CLI defaults to http://localhost:8100; override via --api-url or EVAL_PLAYGROUND_URL if the user is pointing at a remote dev/staging server.

Extending the playground (do this; don't work around)

The CLI is intentionally a thin client over the REST API. When you need a capability that isn't there:

Don't drop down to direct Supabase queries, one-off scripts, or a parallel pipeline.
Do extend the chain end-to-end:
- Add or extend the metric / generation / inspection logic in the right module (metrics/, sdk.py, generator code, etc.)
- Expose it as a REST endpoint in app.py
- Add a thin CLI wrapper that calls the endpoint
The web UI gets it for free because it shares the API.

When you do this, tell the user explicitly what you're adding and why so the extension gets reviewed deliberately rather than tacked on. New endpoints go in app.py; new schemas in api_models.py; new CLI commands in cli.py.

CLI Reference

All commands run from apps/learning-api/evals-playground/. Everything routes through app.py over HTTP.

# Decks
uv run python cli.py decks                          # list decks
uv run python cli.py decks --pillars                # only decks with content pillars

# Content pillars
uv run python cli.py pillars <deck_id>              # content pillar tree (with excluded chunks)
uv run python cli.py chunks <deck_id>               # raw docling chunks with labels/headings
uv run python cli.py pillars-regenerate <deck_id>   # re-run pillar generation (calls LLM)
uv run python cli.py pillars-regenerate <deck_id> --no-exclude --no-summaries
uv run python cli.py pillars-compare <deck_id>      # with vs without exclusion
uv run python cli.py pillars-scan                   # scan all decks for exclusions

# Mindmaps
uv run python cli.py mindmap show <deck_id>         # current mindmap as tree
uv run python cli.py mindmap compare-excluded <deck_id>  # with vs without excluded chunks
uv run python cli.py mindmap json <deck_id>         # Mind Elixir JSON
uv run python cli.py mindmap json <deck_id> --raw   # plain JSON for piping
uv run python cli.py mindmap stats <deck_id>        # quick numbers
uv run python cli.py mindmap pillars-json <deck_id> # raw content pillar input

# Flashcard evals
uv run python cli.py evaluate-flashcards <deck_id>  # run flashcard evals on a deck
uv run python cli.py evaluate-flashcards <deck_id> --metrics blueprint_coverage,relevance
uv run python cli.py evaluate-flashcards <deck_id> --json   # raw JSON for piping

The evaluate-flashcards flow: POST /evaluate/deck with the chosen metrics, then poll GET /runs/<run_id> until done. Default metrics are blueprint_coverage,relevance,redundancy,independence (leakage off by default — opt in via --metrics).

Adding a new deck from a file

To parse a local PDF and create a new deck, use the ingest command. This uploads the file to the playground server, triggers the full ingestion pipeline (parsing, chunking, content pillar generation), and waits for it to finish.

# Parse a PDF and wait for the deck to be ready (default)
uv run python cli.py ingest /path/to/file.pdf

# Custom deck name (defaults to filename)
uv run python cli.py ingest /path/to/file.pdf --name "My Deck"

# Start ingestion without waiting
uv run python cli.py ingest /path/to/file.pdf --no-wait

# Check on a deck's ingestion status / wait for it to finish
uv run python cli.py wait <deck_id>

# Wait for all pending decks at once
uv run python cli.py wait

Once ingestion completes, inspect the deck with pillars, chunks, or mindmap, and score the cards with evaluate-flashcards.

Key files

File	Purpose
`evals-playground/app.py`	FastAPI server — the canonical API. All reads/writes against Supabase, all task dispatch.
`evals-playground/api_models.py`	Pydantic request/response schemas for `app.py`.
`evals-playground/sdk.py`	In-process orchestration called from `app.py`: `evaluate()`, `evaluate_quiz()`, per-metric wrappers, `MetricName` registry, composite weights.
`evals-playground/eval_runner.py`	Background runner glue; `METRIC_FNS` registry that `app.py` dispatches against.
`evals-playground/metrics/`	Pure metric implementations (one file per metric). Add new metrics here, then register in `sdk.py` + `eval_runner.METRIC_FNS`.
`evals-playground/cli.py`	CLI client over the REST API. Commands map 1:1 to endpoints.
`evals-playground/db.py`	Supabase queries — imported by `app.py` only; CLI does not touch this.
`evals-playground/frontend/src/lib/api.ts`	Frontend API client (mirror of the CLI's HTTP calls).
`evals-playground/frontend/src/components/PillarsView.tsx`	Web pillars viewer (with excluded chunks panel).
`evals-playground/frontend/src/components/MindmapView.tsx`	Web mindmap viewer.
`workers/learning_agents/mindmap_agent/mindmap_generator.py`	Production mindmap agent.
`workers/learning_agents/util/content_pillar_structure.py`	Pure content pillar ops (formatting, exclusion, models).
`shared/src/shared/schemas/content_pillar.py`	Content pillar Pydantic schema.

Qualitative review protocol

You and the user iterate together on AI generation quality. You are a co-developer, not a judge — you spot issues, simulate a student, suggest improvements, and bring evidence. The user gives direction. You never give final verdicts.

Two lenses

Every review looks through two lenses:

Student lens: Simulate a learner studying this material. What's useful for exam prep? What key takeaways are missing? What's noise that would waste study time? Would a student find the right section when they need it? Be specific — reference concrete items ("subtopic 'General Background Principles' is too vague — a student wouldn't know what's in there").
Context engineering lens: Is the LLM getting the right amount of signal in its prompt/input? Too much wastes tokens and confuses. Too little loses important info. When output is bad, trace backwards: was the prompt clear? Did the LLM have enough context? Or too much noise? For example: "the structure LLM only sees 60 chars per chunk — is that enough to distinguish 'References to Classical Literature' (educational) from 'References' (bibliography)?"

Layered inspection

Be smart about token usage. Don't dump everything into context at once. Go deeper only when something smells off:

Start with the output (mindmap tree, flashcard list, pillar structure) — scan for obvious issues
Spot-check against pillars — if a topic looks wrong, check what the pillar says
Dig into chunks — if the pillar looks off, check raw docling chunks
Read the PDF — only if chunks seem wrong or missing (layout issues, missing text)

Bring receipts

Never make claims without evidence. If you say "this topic is missing key content," show which chunk has that content and where it went. If you say "the exclusion is wrong," show the chunk text. Use CLI commands to pull data and quote it.

Help the user inspect things in the web playground too — if something would be clearer with a visual (bounding boxes on the PDF, excluded chunks highlighted, side-by-side comparison), say so and suggest what the web view should show.

Request better tooling

If you're struggling to inspect something or the CLI doesn't support what you need, the answer is to extend it (see "Extending the playground" above), not to work around it. Tell the user explicitly: "I need a CLI/REST capability that does X — right now I can't easily check Y. I'll add the endpoint + command unless you'd rather scope it differently." Add the metric/query/etc., expose via REST, wrap with CLI. The CLI was built for you; growing it is part of the work.

Evals

When reviewing outputs, regularly consider: would an eval catch this? Suggest adding an eval when a pattern recurs or when an existing eval misses a problem. Evals are discovered through the work, not predefined — but once discovered, they should be validated against real cases to confirm they actually capture the issue.

Debugging parsing issues

When chunks look wrong (missing text, wrong order, merged content), always look at the actual PDF first before theorizing. Use the Read tool on the PDF file to visually inspect the layout, or download it from Supabase storage:

# Find and download the PDF
client = get_supabase_client()
files = client.storage.from_('PDF').list(f'{user_id}/')
# Then curl the public URL to /tmp/ and Read it

Then compare what you see in the PDF against:

The docling JSON (docling_json_url) — check reading order via texts[] items and their prov[].bbox y-coordinates
The markdown (md_url) — check if content exists but is misordered
The docling chunks (docling_chunks_url) — check what the chunker produced

Check evals-playground/known-issues/ for documented cases with PDFs and evidence. When you discover a new issue, add it there following the same format (markdown + PDF).

Cross-checking workflow metrics with subagents

Quiz/flashcard LLM metrics run as workflow judges (Gemini with structured output, cheap, consistent). For trust-critical decisions, cross-check with a Claude Code subagent (agent judge) that reads the full bundle and writes its own JSON verdict.

Step 0 — read the rubric. Before doing anything, read apps/learning-api/evals-playground/metrics/README.md to see what each metric is supposed to measure. The rubric lives in three places (the doc, the workflow prompt in the metric's .py, the bundle prep in cli.py) and they must match. If they've drifted, align all three before the agreement numbers mean anything.

Workflow: CLI prepare-<metric>-review <deck> writes a self-contained bundle.md. Spawn an Agent to produce a JSON report. CLI cross-check-<metric> <deck> diffs agent vs workflow. Use cli.py --help for current command names.

Agent config:

subagent_type: "general-purpose", model: "opus".
Prompt points at the bundle; agent writes JSON to the path in the bundle's Output contract.
Read the bundle fully, never fall back to general knowledge, no prose beyond "wrote report to X".

Binary beats 3-way for subjective metrics. If the middle class is a judgment call, collapse to binary (see metrics/README.md for which metrics are binary). With a squishy middle class the agent-vs-agent ceiling is ~70%, which makes any >70% target impossible.

Two-run consensus. Run the agent twice, different output paths. Agent-vs-agent agreement = noise floor. Workflow is trustworthy if it agrees with the consensus subset (items both agents agreed on) at ≥70%. If agent-vs-agent is under ~85%, fix the rubric before blaming workflow.

Verify agent quotes. If an agent verdict flips a workflow call, grep the quoted passage against the bundle. Fabricated quotes = agent failure mode; retry with a stricter prompt.

Principles

LLM output should be simple: Have the model output a plain tree. All rendering-library-specific formatting happens in code.
No fallback logic: If something fails, raise an error. Don't silently degrade.
CLI and web app must stay in sync: Both consume the same backend via app.py over HTTP. Don't build separate logic paths or let the CLI fall back to direct Supabase access.
Read before editing: Always read the current file state before modifying — structures evolve.
Never over-fit to one example: If a change makes one document look great but would break others, it's a bad change. Test on multiple decks when possible.
Show first, then discuss: Before and after any change, show a digestible summary of the output. Pick the right format — tree, table, diff, bullet points.