39948-vm/documentation/deployment-vm.md
2026-07-03 16:11:24 +02:00

9.1 KiB

VM Deployment Runbook

Operational notes for the standard Flatlogic VM deployment used by this project. This document describes the VM runtime layout, health checks, and the June 2026 503 Service Unavailable recovery path.

Runtime Topology

The standard VM runs the app behind Apache and Cloudflare:

Cloudflare
  -> Apache :80
    -> Frontend Next.js production server :3001
    -> Backend API :3000

Do not assume older local development ports on the VM. The standard port split is frontend 3001 and backend 3000:

Component VM process Port Notes
Apache apache2 80 Public entrypoint, reverse proxy
Frontend frontend-dev 3001 npm run build, then npm run start
Backend backend-dev 3000 NODE_ENV=dev_stage npm run start
Telemetry fl-telemetry 4317/4318 Executor telemetry daemon
Executor fl-executor n/a VM command/executor bridge

The backend returns 401 Unauthorized for protected API endpoints without a JWT. A 401 from http://127.0.0.1:3000/api/... means the backend is alive. The backend default is port 3000 for dev_stage; an explicit PORT env var overrides that when needed.

Process Manager

PM2 is managed by systemd:

sudo systemctl status pm2-ubuntu --no-pager
pm2 status

Expected PM2 apps:

Name Purpose
frontend-dev Next.js frontend production server
backend-dev Express API, migrations, seed, watcher
fl-telemetry Local telemetry daemon
fl-executor Standard VM executor bridge

The frontend PM2 app name may remain frontend-dev for compatibility with the standard VM image, but the process should run the production script. Build the VM frontend with:

cd /home/ubuntu/executor/workspace/frontend
npm run build

Start it with:

FRONT_PORT=3001 npm run start

The production frontend is a Next.js server build served by next start. Do not run the VM frontend with next dev; the dev server displays the Next.js dev indicator in presentations.

Frontend Release Deploys

Automatic VM pulls should deploy the frontend as immutable releases instead of rebuilding in the live workspace. The executor VCS layer builds a fresh copy under:

/home/ubuntu/executor/frontend-releases/<timestamp>-<git-sha>/frontend

The deploy order is:

  1. Pull the requested branch into /home/ubuntu/executor/workspace.
  2. Archive HEAD into a new release directory.
  3. Copy frontend env files from the live workspace when present: .env, .env.local, .env.production, .env.production.local.
  4. Run npm ci.
  5. Run npm run build.
  6. Remove non-runtime build caches from the new release: .next, .turbo, build/cache. Production runtime assets stay in build; local next dev --turbopack uses .next to avoid conflicts with production build manifests.
  7. Switch frontend-dev to the new release with FRONT_PORT=3001 pm2 start npm --name frontend-dev -- run start.
  8. Save PM2 and remove old frontend releases.

The active frontend release is the PM2 frontend-dev working directory. Check it with:

pm2 jlist | jq '.[] | select(.name=="frontend-dev") | {
  cwd:.pm2_env.pm_cwd,
  script:.pm2_env.pm_exec_path,
  args:.pm2_env.args,
  env:{FRONT_PORT:.pm2_env.FRONT_PORT}
}'

Retention defaults to the latest 2 release directories. Override it by setting FRONTEND_RELEASES_KEEP for the executor process before deploy. Do not delete the active release directory; next start serves production assets from its build directory.

Manual rollback is possible by starting frontend-dev from an older retained release:

cd /home/ubuntu/executor/frontend-releases/<release-id>/frontend
pm2 delete frontend-dev
FRONT_PORT=3001 pm2 start npm --name frontend-dev -- run start
pm2 save --force

The PM2 dump is stored at:

~/.pm2/dump.pm2

This file contains environment variables and may contain secrets. Do not paste it into public tools or tickets without redacting tokens, DB passwords, SMTP credentials, API keys, and tunnel credentials.

Health Checks

Use these checks after a deploy or incident:

df -h
df -ih
free -h
sudo ss -ltnp | grep -E ':80|:3001|:3000|:4317|:4318'
curl -I http://127.0.0.1:3001
curl -I http://127.0.0.1:3000/api/auth/me
curl -I http://tbp.flatlogic.app
pm2 status

Expected healthy responses:

  • http://127.0.0.1:3001 returns 200 OK.
  • http://127.0.0.1:3000/api/auth/me returns 401 Unauthorized without JWT.
  • http://tbp.flatlogic.app returns 200 OK.
  • PM2 shows all four apps online.

Recovering From Apache 503 Service Unavailable

If Apache returns:

Service Unavailable
Apache/2.4.x Server at tbp.flatlogic.app Port 80

first check whether upstream app processes are listening:

sudo ss -ltnp | grep -E ':80|:3001|:3000'
curl -I http://127.0.0.1:3001
curl -I http://127.0.0.1:3000/api/auth/me
sudo systemctl status pm2-ubuntu --no-pager

If Apache is listening but 3001 and 3000 are not, PM2 did not restore or was stopped. Restart it:

sudo systemctl reset-failed pm2-ubuntu
sudo systemctl restart pm2-ubuntu
pm2 status

Then re-run the health checks.

OOM-Kill Diagnosis

A VM can have enough disk and still fail if the kernel kills PM2 or a child process because memory spikes. Check kernel logs:

journalctl -k --since "YYYY-MM-DD HH:MM" --until "YYYY-MM-DD HH:MM" \
  | grep -Ei 'oom|killed process|out of memory'

Known June 2026 incident:

  • pm2-ubuntu.service failed with Result: oom-kill.
  • Kernel killed ffmpeg.
  • ffmpeg used about 3.3 GiB RSS on a 3.8 GiB RAM VM.
  • PM2 then stopped frontend-dev, backend-dev, fl-telemetry, and fl-executor.

This points to reversed video generation rather than Apache, disk space, or frontend routing.

FFmpeg and Reverse Video Generation

The backend uses bundled ffmpeg-static/ffprobe-static via backend/src/services/videoProcessing.ts; manual OS-level FFmpeg installation is not required for this project.

Reverse video generation can be memory-heavy for large videos. Operational guardrails:

  • FFmpeg reversal is serialized by videoProcessing.reverseVideo(): only one FFmpeg process runs at a time in the backend process, and additional reverse generation requests wait in an in-process queue.
  • FFmpeg reversal uses -threads 1.
  • FFmpeg reversal has a hard timeout (FFMPEG_REVERSE_TIMEOUT_MS, default 600000, exposed as config.resilience.ffmpeg.reverseTimeoutMs) and kills the child process if it exceeds the limit.
  • FFmpeg reversal is protected by an in-process circuit breaker (FFMPEG_BREAKER_FAILURE_THRESHOLD, FFMPEG_BREAKER_COOLDOWN_MS, FFMPEG_BREAKER_SUCCESS_THRESHOLD, exposed under config.resilience.ffmpeg.breaker) so repeated media failures stop launching new heavy jobs during the cooldown window.
  • FFprobe metadata extraction has a timeout (FFPROBE_TIMEOUT_MS, default 30000, exposed as config.resilience.ffmpeg.ffprobeTimeoutMs).
  • TourPagesService deduplicates reverse generation for the same source video storage key.
  • Treat large source videos as risky on small VMs.
  • Check backend PM2 logs for ffmpeg or publish/save background errors.
  • If the VM OOMs, inspect kernel logs before changing Apache or database config.

Remaining hardening work and follow-up:

  • Add input duration/resolution/size checks before reversal.
  • Structured logs now include reverse-video input/output size and probed media metadata. Continue tuning rejection thresholds as real VM media patterns are observed.
  • Consider running media processing in a separate worker with memory limits.

Logs

Useful log commands:

sudo journalctl -u pm2-ubuntu -n 200 --no-pager
pm2 logs frontend-dev --lines 100
pm2 logs backend-dev --lines 100
pm2 logs fl-executor --lines 100
pm2 logs fl-telemetry --lines 100
sudo tail -n 100 /var/log/apache2/error.log

pm2 logs tails by default. Press Ctrl-C before running the next command.

Executor Notes

The standard VM executor.js in ~/executor is not the web app startup script. It handles VM commands, VCS operations, AI runner prompts, screenshots, and telemetry. Starting it manually does not start the frontend/backend app.

Executor workspace path:

/home/ubuntu/executor/workspace

The executor can perform git operations when commanded, including reset/clean workflows through VCS commands. Do not run executor commands blindly when the goal is only to restore the web app. Use PM2/systemd for process recovery.

Node Version

The project requirement is Node.js 20.x LTS. Some standard VMs may report /usr/bin/node as Node 22 in PM2. If startup fails after a system update, verify:

node -v
which node
pm2 describe backend-dev
pm2 describe frontend-dev

Changing the VM Node version should be coordinated with PM2 startup paths and a full frontend/backend build check.

Persistence

After changing PM2 process definitions, save the process list:

pm2 save

For an incident-only restart where the process definitions were unchanged, pm2 save is still safe and keeps the current expected app list for reboot.