docs/learning-api-preview-vm-plan.md

Learning API Preview VM Plan

Learning API Preview VM Plan

Goal

Run per-PR Learning API preview environments on one provisioned VM, with clean integration into existing core-api preview environments.

Scope

  • Host learning-api preview stacks (api + worker + redis) per PR.
  • Expose each PR stack behind Cloudflare (Tunnel + DNS).
  • Wire each PR stack URL into matching core-api preview (LEARNING_API_BASE_URL).
  • Create/update on PR open/sync, destroy on PR close.

Assumptions

  • Parser service is disabled for default preview stacks.
  • No GPU requirement in default previews.
  • Supabase preview branch lifecycle remains managed by Supabase GitHub integration.

Resource Model

Current defaults in repo are heavy for previews:

  • API Dockerfile uses gunicorn -w 8 (apps/learning-api/api/Dockerfile).
  • Worker Dockerfile uses Celery --concurrency=16 (apps/learning-api/workers/learning_agents/Dockerfile).

Preview-safe per-PR budget (after tuning):

  • CPU: ~1.5 vCPU reserved, ~2.5 vCPU peak
  • Memory: ~2.0 GB reserved, ~2.8 GB peak
  • Disk: ~2-4 GB writable/log/temp

Single-VM sizing recommendations:

  • 3 active PRs: 8 vCPU / 16 GB RAM / 150 GB NVMe
  • 6 active PRs: 16 vCPU / 32 GB RAM / 250 GB NVMe
  • 10 active PRs: 24 vCPU / 64 GB RAM / 400 GB NVMe

Planning formulas:

  • vCPU ≈ 2 + 1.5 * active_previews
  • RAM_GB ≈ 6 + 2.5 * active_previews
  • Disk_GB ≈ 120 + 8 * active_previews

Recommended starting point:

  • 16 vCPU / 32 GB RAM / 250 GB NVMe
  • Cap active previews at 6.

URL and Routing Contract

Per-PR URL pattern:

  • https://<preview-base-host>/pr-<PR_NUMBER>

Core API integration:

  • Set LEARNING_API_BASE_URL in that PR’s core-api preview secrets to the URL above.

Lifecycle Workflow

PR Open / Reopen

  1. Resolve PR number + branch slug.
  2. Render stack env file (.env.pr-<n>) with:
    • base secrets from Infisical /learning-api/ in staging
    • Supabase preview URL/service role key overrides
    • API key aligned with core-api preview (LEARNING_API_KEY)
    • internal wiring overrides:
      • API_URL=http://api:8000
      • REDIS_URL=redis://redis:6379/0
      • CELERY_BROKER_URL=redis://redis:6379/0
      • CELERY_RESULT_BACKEND=redis://redis:6379/0
  3. Start stack with compose project namespace:
    • docker compose -p pr-<n> up -d
  4. Traefik auto-discovers the stack route from Docker labels:
    • Host(<preview-host>) && PathPrefix(/pr-<n>)
  5. Update core-api preview secrets:
    • LEARNING_API_BASE_URL=https://<preview-base-host>/pr-<n>
  6. Comment preview URL on PR.

PR Synchronize

  1. Pull latest images / build if needed.
  2. Recreate only that PR namespace:
    • docker compose -p pr-<n> up -d --force-recreate
  3. Keep same URL and route.
  4. Refresh core-api preview secret only if URL changes (normally no change).

PR Close

  1. Stop and remove PR namespace:
    • docker compose -p pr-<n> down -v --remove-orphans
  2. Route automatically disappears when container is removed.
  3. Remove any local env/artifact files for that PR.
  4. Keep nightly janitor job to clean orphaned stacks/routes.

Preview Runtime Guardrails

  • Cap API workers and Celery concurrency for preview to avoid VM starvation.
  • Apply per-container memory limits and restart policies.
  • Use project name isolation (pr-<n>) for deterministic cleanup.
  • Log retention:
    • rotate container logs
    • TTL old logs/artifacts
  • Enforce max concurrent preview stacks; fail fast with clear message when full.

Parser Strategy

Default preview:

  • Disable parser service.
  • Use fallback parsing path in preview mode.

Optional parser preview:

  • Enable only on-demand (label or manual trigger), not for every PR.
  • Plan separate capacity if parser previews become common.

Rollout Plan

  1. Add VM bootstrap scripts (Docker, compose plugin, cloudflared, systemd units).
  2. Add GitHub Actions jobs:
    • deploy-learning-api-preview on opened/reopened/synchronize
    • cleanup-learning-api-preview on closed
  3. Add URL wiring step into existing core-api preview job.
  4. Add nightly cleanup workflow for stale PR stacks.
  5. Add basic observability:
    • health check endpoint polling
    • stack count + resource usage report in workflow summary.

Out of Scope

  • Marketing/email preview changes.
  • Parser-on-every-PR rollout.
  • Migrating learning-api runtime to Cloudflare-native compute.