docs/learning-api-preview-vm-plan.md
Learning API Preview VM Plan
Learning API Preview VM Plan
Goal
Run per-PR Learning API preview environments on one provisioned VM, with clean integration into existing core-api preview environments.
Scope
- Host
learning-apipreview stacks (api + worker + redis) per PR. - Expose each PR stack behind Cloudflare (Tunnel + DNS).
- Wire each PR stack URL into matching
core-apipreview (LEARNING_API_BASE_URL). - Create/update on PR open/sync, destroy on PR close.
Assumptions
- Parser service is disabled for default preview stacks.
- No GPU requirement in default previews.
- Supabase preview branch lifecycle remains managed by Supabase GitHub integration.
Resource Model
Current defaults in repo are heavy for previews:
- API Dockerfile uses
gunicorn -w 8(apps/learning-api/api/Dockerfile). - Worker Dockerfile uses Celery
--concurrency=16(apps/learning-api/workers/learning_agents/Dockerfile).
Preview-safe per-PR budget (after tuning):
- CPU:
~1.5 vCPUreserved,~2.5 vCPUpeak - Memory:
~2.0 GBreserved,~2.8 GBpeak - Disk:
~2-4 GBwritable/log/temp
Single-VM sizing recommendations:
- 3 active PRs:
8 vCPU / 16 GB RAM / 150 GB NVMe - 6 active PRs:
16 vCPU / 32 GB RAM / 250 GB NVMe - 10 active PRs:
24 vCPU / 64 GB RAM / 400 GB NVMe
Planning formulas:
vCPU ≈ 2 + 1.5 * active_previewsRAM_GB ≈ 6 + 2.5 * active_previewsDisk_GB ≈ 120 + 8 * active_previews
Recommended starting point:
16 vCPU / 32 GB RAM / 250 GB NVMe- Cap active previews at 6.
URL and Routing Contract
Per-PR URL pattern:
https://<preview-base-host>/pr-<PR_NUMBER>
Core API integration:
- Set
LEARNING_API_BASE_URLin that PR’s core-api preview secrets to the URL above.
Lifecycle Workflow
PR Open / Reopen
- Resolve PR number + branch slug.
- Render stack env file (
.env.pr-<n>) with:- base secrets from Infisical
/learning-api/instaging - Supabase preview URL/service role key overrides
- API key aligned with core-api preview (
LEARNING_API_KEY) - internal wiring overrides:
API_URL=http://api:8000REDIS_URL=redis://redis:6379/0CELERY_BROKER_URL=redis://redis:6379/0CELERY_RESULT_BACKEND=redis://redis:6379/0
- base secrets from Infisical
- Start stack with compose project namespace:
docker compose -p pr-<n> up -d
- Traefik auto-discovers the stack route from Docker labels:
Host(<preview-host>) && PathPrefix(/pr-<n>)
- Update core-api preview secrets:
LEARNING_API_BASE_URL=https://<preview-base-host>/pr-<n>
- Comment preview URL on PR.
PR Synchronize
- Pull latest images / build if needed.
- Recreate only that PR namespace:
docker compose -p pr-<n> up -d --force-recreate
- Keep same URL and route.
- Refresh core-api preview secret only if URL changes (normally no change).
PR Close
- Stop and remove PR namespace:
docker compose -p pr-<n> down -v --remove-orphans
- Route automatically disappears when container is removed.
- Remove any local env/artifact files for that PR.
- Keep nightly janitor job to clean orphaned stacks/routes.
Preview Runtime Guardrails
- Cap API workers and Celery concurrency for preview to avoid VM starvation.
- Apply per-container memory limits and restart policies.
- Use project name isolation (
pr-<n>) for deterministic cleanup. - Log retention:
- rotate container logs
- TTL old logs/artifacts
- Enforce max concurrent preview stacks; fail fast with clear message when full.
Parser Strategy
Default preview:
- Disable parser service.
- Use fallback parsing path in preview mode.
Optional parser preview:
- Enable only on-demand (label or manual trigger), not for every PR.
- Plan separate capacity if parser previews become common.
Rollout Plan
- Add VM bootstrap scripts (Docker, compose plugin, cloudflared, systemd units).
- Add GitHub Actions jobs:
deploy-learning-api-previewonopened/reopened/synchronizecleanup-learning-api-previewonclosed
- Add URL wiring step into existing core-api preview job.
- Add nightly cleanup workflow for stale PR stacks.
- Add basic observability:
- health check endpoint polling
- stack count + resource usage report in workflow summary.
Out of Scope
- Marketing/email preview changes.
- Parser-on-every-PR rollout.
- Migrating learning-api runtime to Cloudflare-native compute.