infra/openreplay/README.md

OpenReplay on Hetzner

Self-hosted OpenReplay running on a single Hetzner Cloud VM. The VM userData runs the documented openreplay -i $DOMAIN installer (which sets up K3s with embedded containerd, plus helm/templater/kubectl as standalone binaries — no Docker on the host).

OpenReplay on Hetzner

Self-hosted OpenReplay running on a single Hetzner Cloud VM. The VM userData runs the documented openreplay -i $DOMAIN installer (which sets up K3s with embedded containerd, plus helm/templater/kubectl as standalone binaries — no Docker on the host).

Architecture

Two-domain split:

  • openreplay.studyflash.dev — admin dashboard. CF-proxied A record at the VM's public IP. Inherits the existing *.studyflash.dev Cloudflare Access wildcard, so the dashboard requires SSO.
  • or.studyflash.com — tracker ingest endpoint. Bound via WorkersCustomDomain to a Cloudflare Worker (openreplay-ingest) whose source (scripts/ingest-worker.js) reverse-proxies every path to https://openreplay.studyflash.dev/<same-path>. A path-scoped CF Access bypass on openreplay.studyflash.dev/ingest/* lets the Worker's tracker payloads through without an Access challenge. Neutral host so ad-blocker rules pattern-matching openreplay.* don't break the tracker.

TLS:

  • User TLS terminates at the Cloudflare edge with CF's Universal SSL cert (browsers see studyflash.dev).
  • CF→origin uses Full (Strict) mode and validates the Cloudflare Origin CA cert the VM presents (15-year validity).
  • Origin CA cert + key are issued out-of-band (see TLS section below) and stored base64-encoded in Infisical. Pulumi reads them at pulumi up time, decodes, marks as secret, and bakes the PEMs into the VM's userData (encrypted in Pulumi state under GCP KMS).
  • bootstrap.sh writes them into the openreplay-ssl Secret in the app namespace before openreplay -i runs, so the OpenReplay Ingress finds the cert immediately on first install. No cert-manager, no Let's Encrypt, no DNS-01.

Network lockdown:

  • Hetzner firewall: 22 (SSH) and ICMP open to the world; 443 locked to Cloudflare's published IP ranges (cloudflare.getIpRangesOutput()); port 80 closed entirely. Origin is invisible to direct IP probes from non-CF IPs.

Tracker init in client apps:

new OpenReplay({
  projectKey: "...",
  ingestPoint: "https://or.studyflash.com/ingest",
});

Stack config (Pulumi.<stack>.yaml)

KeyDefaultNotes
studyflash-openreplay:domainopenreplay.studyflash.devDashboard hostname. CF-proxied A record, Origin CA cert on the VM.
studyflash-openreplay:ingestDomainor.studyflash.comTracker ingest hostname. Bound to the openreplay-ingest Worker via WorkersCustomDomain on the studyflash.com zone.
studyflash-openreplay:serverTypeccx234 vCPU dedicated / 16 GB / 160 GB. Bundled disk is below the 240 GB hard minimum, so a Volume is attached (see below).
studyflash-openreplay:locationnbg1Hetzner DC. Volume is created in the same location.
studyflash-openreplay:dataVolumeSize100GB. Attached as ext4 and bind-mounted at /var/lib/rancher + /var/lib/openreplay so K3s PVCs (replays, MinIO, ClickHouse) land on the volume. 160 + 100 = 260 GB total, comfortably above 240.

Secrets / Infisical

Read from Infisical at /infra/openreplay/:

  • HCLOUD_TOKEN — Hetzner Cloud API token
  • CLOUDFLARE_API_TOKEN — Cloudflare API token (DNS, Workers, Access apps, Rulesets, Worker custom domains, all on the studyflash account)
  • PULUMI_BACKEND_URL — R2 backend URL (s3-compatible)
  • AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY — R2 credentials for the Pulumi backend
  • OPENREPLAY_ORIGIN_CERT_B64 — base64-encoded Origin CA cert PEM
  • OPENREPLAY_ORIGIN_KEY_B64 — base64-encoded private key PEM matching the cert

GCP credentials for the KMS secrets provider come from the operator's local gcloud auth application-default login — not Infisical. Pulumi state lives on the shared R2 backend; secrets are encrypted with the same pulumi-state GCP KMS key used by the other infra stacks.

Volume protection

The data volume has both protect: true (Pulumi-side) and deleteProtection: true (Hetzner API). Replacing only the Server (pulumi up --replace <server-urn>) is fine — new VM mounts the volume, K3s resumes from the existing data dir. Replacing the Volume requires lifting both protections manually first; doing so destroys all OpenReplay data (Postgres, ClickHouse, MinIO, replays).

Hetzner does not have automatic volume backups (unlike AWS EBS); for real durability we should layer on app-level backups (pg_dump → R2, mc mirror → R2, clickhouse-backup) as a follow-up.

First run

cd infra/openreplay
pnpm install

# Log in to the R2 Pulumi backend (Infisical injects PULUMI_BACKEND_URL)
infisical run --env prod --path /infra/openreplay/ -- pulumi login "$PULUMI_BACKEND_URL"

# Initialize the prod stack against the project's GCP KMS secrets provider.
# This populates the `encryptedkey` field in Pulumi.prod.yaml automatically.
infisical run --env prod --path /infra/openreplay/ -- \
  pulumi stack init prod \
    --secrets-provider="gcpkms://projects/studyflash-security/locations/europe-west6/keyRings/pulumi-state/cryptoKeys/pulumi-state"

# Verify types compile
pnpm typecheck

# Provision (will fail loudly if OPENREPLAY_ORIGIN_CERT_B64/KEY_B64 aren't in
# Infisical — see "TLS" section for how to issue + populate them).
pnpm run pulumi:up

SSH access

The VM ships with Ubuntu's defaults: port 22 open, password auth on, no static SSH keys baked in. For first login, regenerate the root password with hcloud server reset-password openreplay (Hetzner returns it inline) or grab it from the Hetzner Cloud panel. After logging in, add your own pubkey to /root/.ssh/authorized_keys so subsequent sessions use key auth.

A short-lived-cert replacement (Infisical SSH or similar) is on the roadmap but not in place — bootstrap.sh does not configure any SSH CA or restrict password auth.

TLS

The Origin CA cert + key are issued once out-of-band, stored in Infisical, and read into Pulumi state as secrets. Pulumi-side issuance via cloudflare.OriginCaCertificate isn't viable: the CF /certificates endpoint doesn't accept modern API tokens (you get 1016 regardless of perms), and the Pulumi cloudflare provider's apiKey field rejects CF's current cfk_… key format because it validates against the legacy 37-hex-char schema. Until both sides catch up, we stay out-of-band.

To issue (one-time bootstrap, then never again):

# 1) In Cloudflare dashboard: SSL/TLS → Origin Server → Create Certificate.
#    Pick RSA, list "openreplay.studyflash.dev", validity 15 years.
#    Copy BOTH the cert PEM and the private key PEM (the key is shown once).

# 2) Base64-encode both (single-line, no quoting issues for Infisical):
CERT_B64=$(base64 -w0 < cert.pem)
KEY_B64=$(base64 -w0  < key.pem)

# 3) Store in Infisical at /infra/openreplay/:
infisical secrets set \
  --projectId 0cfec798-5081-4028-b142-a46080728d1f --env prod --path /infra/openreplay/ \
  "OPENREPLAY_ORIGIN_CERT_B64=$CERT_B64" \
  "OPENREPLAY_ORIGIN_KEY_B64=$KEY_B64"

Renewal: not really a concern — Origin CA certs are valid for 15 years.

Outputs

  • openreplayUrlhttps://<domain> (dashboard, behind CF Access)
  • ingestUrlhttps://<ingestDomain> (tracker ingestPoint)
  • vmIp — Hetzner public IPv4 (informational; firewalled to CF only)

Resource sizing

OpenReplay's docs list a 2 vCPU / 8 GB / 50 GB minimum for low-to-moderate traffic. We default to ccx23 (4 vCPU dedicated / 16 GB / 160 GB) plus a 100 GB data volume — total 260 GB, comfortably above the project's 240 GB hard minimum. The volume is bind-mounted under /var/lib/rancher and /var/lib/openreplay before the installer runs, so K3s persistent volumes (session replays, MinIO, ClickHouse, Postgres) live on the volume rather than the bundled root disk.

If you want a single-disk setup, switching serverType to cpx41 (8 vCPU shared / 16 GB / 240 GB) and dropping the volume also works.

CI

Intentionally not wired into .github/workflows/infra.yml yet — the auto-up step in that workflow is currently disabled across all stacks, so adding a job here would be dead weight. Run pnpm run pulumi:up locally for now.