From 14ab8d0f76cae38ce466c8bba15bf28db420ffe4 Mon Sep 17 00:00:00 2001 From: Konrad du Plessis Date: Fri, 12 Jun 2026 17:36:45 +0200 Subject: [PATCH] docs: capture 27-29 May incident lessons (two-tier env precedence, --insecure, SSH access) + gitignore .claude.local.md Co-Authored-By: Claude Fable 5 --- .gitignore | 1 + CLAUDE.md | 167 ++++++++++++++++++++++++++++++++++++++++++++++++----- 2 files changed, 154 insertions(+), 14 deletions(-) diff --git a/.gitignore b/.gitignore index 4fa2755..7dcbc90 100644 --- a/.gitignore +++ b/.gitignore @@ -16,6 +16,7 @@ media/ # Claude Code / IDE .claude/ +.claude.local.md .vscode/ .idea/ diff --git a/CLAUDE.md b/CLAUDE.md index a50c50e..03fea03 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -3,13 +3,36 @@ ## What's mid-flight β€” read this first **Parked / deferred work:** see `docs/plans/parked-work.md`. -**Production status (17 May 2026):** βœ… **fully caught up & verified.** -The 36-commit bundle (Manager/Salaried Pay + pay-type filter + Salary -auto-scope picker + Pay Salary dashboard quick action) is **deployed -and confirmed working on production** (`https://foxlog.flatlogic.app/`, -Konrad verified 17 May 2026). `origin/ai-dev` HEAD `80d96d7` == prod -(the only delta over the functional tip `4c25011` is doc breadcrumbs). -Migrations `0016`/`0017` applied; `static/css/custom.css` collected. +**Production status (29 May 2026):** βœ… **fully caught up, verified, +and recovered from a 27-29 May platform incident.** The 36-commit +bundle (Manager/Salaried Pay + pay-type filter + Salary auto-scope +picker + Pay Salary dashboard quick action) is **deployed and +confirmed working on production** (`https://foxlog.flatlogic.app/`, +Konrad verified 17 May 2026 and re-verified post-incident 29 May +2026 via a live test payment + Spark Receipt delivery). +`origin/ai-dev` HEAD `80d96d7` == prod (the only delta over the +functional tip `4c25011` is doc breadcrumbs). Migrations +`0016`/`0017` applied; `static/css/custom.css` collected. + +**πŸ”₯ Incident 27-29 May 2026 (now closed) β€” what future sessions +need to know:** Cloudflare Tunnel error 1033 (27 May) β†’ suspected +SSH-activate side-effect wiped `/home/ubuntu/executor/.env` β†’ payment +500s with `ValueError: Invalid address ""` (28 May) β†’ Flatlogic AI +agent returned only `AI agent failed with exit code -1` for hours +before Erik restored it (29 May) β†’ discovered +`/etc/flatcloud/python-secrets.env` overrides `.env` via a systemd +drop-in β†’ restored Gmail credentials in the secrets file β†’ flipping +`DJANGO_DEBUG=false` killed static-file serving (runserver requires +`--insecure` for that with `DEBUG=False`) β†’ added `--insecure` to the +systemd unit. **Three lessons baked into the sections below:** +(1) "Where env vars live on Flatlogic" β€” two-tier env file +precedence; (2) "Flatlogic/AppWizzy Deployment" β€” `--insecure` is +now required in the systemd unit; (3) new "SSH access on the VM" +section β€” direct VM shell is now available, key in Konrad's +password manager. **Strategic side note:** SSH access closes the +#1 risk identified in the platform-risk memo at +`C:\Users\konra\.claude\plans\prancy-painting-brook.md` (off-platform +backup of `media/` is now feasible via `rsync`). **πŸ”§ In progress β€” local only, NOT pushed (HARD STOP):** removal of the "Log Today's Work" / **SiteReport** feature (Konrad wants to @@ -889,12 +912,63 @@ than the authenticated one (e.g. "FoxFitt Payroll "), set `DEFAULT_FROM_EMAIL` explicitly β€” but Gmail will likely rewrite it to the authenticated user anyway unless you've configured a "Send mail as" alias. -### Where env vars live on Flatlogic -Flatlogic's platform has no env-var UI. Values are set in a `.env` file at -`BASE_DIR.parent / ".env"` on the VM (one level up from the repo). Edit via -Gemini/shell β€” the user cannot modify via Flatlogic's web editor because -`.env` is outside the project tree. The file is loaded by -`python-dotenv` in `config/settings.py` before any `os.getenv()` calls. +### Where env vars live on Flatlogic β€” TWO files, second wins (learned the hard way 29 May 2026) + +Flatlogic's platform has no env-var UI. The Django service reads env +from **two** files at runtime, loaded in this order: + +1. **`/home/ubuntu/executor/.env`** β€” at `BASE_DIR.parent / ".env"` on + the VM, one level up from the repo. Loaded by `python-dotenv` in + `config/settings.py` at process startup. User-editable via + Gemini/shell (or now SSH). +2. **`/etc/flatcloud/python-secrets.env`** β€” root-owned, managed by + Flatlogic's platform image. Loaded by **systemd** itself via a + drop-in unit at `/etc/systemd/system/django-dev.service.d/flatcloud-env.conf` + that has an `EnvironmentFile=/etc/flatcloud/python-secrets.env` + directive. + +**Precedence:** when both files set the same key, **the systemd-loaded +secrets file wins** because systemd injects its variables into the +process environment BEFORE `manage.py runserver` starts, and +`python-dotenv`'s default `load_dotenv()` does NOT override existing +env vars. So `.env` only "wins" for keys the secrets file doesn't set. + +**Practical consequence (the bit that cost hours on 29 May 2026):** +editing `.env` to fix `EMAIL_HOST_USER`/`EMAIL_HOST_PASSWORD`/ +`DJANGO_DEBUG` had ZERO effect because all three were also set in +`python-secrets.env` (often to wrong values β€” Flatlogic's recovery +template installs **AWS SES placeholder credentials** that conflict +with our Gmail SMTP config). The fix is always: edit the secrets file +with `sudo`, then `sudo systemctl restart django-dev.service`. + +**Diagnostic command to confirm both files are loaded and in what +order:** +```bash +systemctl show django-dev.service --property=EnvironmentFiles +``` + +**Recovery playbook if email/payments break after a platform incident:** +1. `sudo grep EMAIL_HOST /etc/flatcloud/python-secrets.env` β€” does it + show AWS-shaped values (long base64-ish keys) instead of + `konrad@foxfitt.co.za` + a Gmail App Password? That's the symptom. +2. `sudo nano /etc/flatcloud/python-secrets.env` β€” set: + - `EMAIL_HOST=smtp.gmail.com` + - `EMAIL_PORT=587` + - `EMAIL_USE_TLS=True` + - `EMAIL_HOST_USER=konrad@foxfitt.co.za` + - `EMAIL_HOST_PASSWORD=<16-char Gmail App Password, no spaces>` + - `DEFAULT_FROM_EMAIL=konrad@foxfitt.co.za` + - `DJANGO_DEBUG=false` +3. `sudo systemctl restart django-dev.service` +4. Verify: `sudo systemctl status django-dev.service` shows + `active (running)`; first test payment delivers a payslip to Spark + Receipt. + +**Security note:** the secrets file contains the Gmail App Password, +`DJANGO_SECRET_KEY`, and DB credentials. Never `cat` it into the +Flatlogic AI agent chat or this Claude session β€” values leak into +transcripts. Use `grep -c KEY_NAME` to confirm presence without +printing the value. ## Flatlogic/AppWizzy Deployment - **Branches**: `ai-dev` = development (Flatlogic AI + Claude Code). `master` = deploy target. @@ -902,7 +976,9 @@ Gemini/shell β€” the user cannot modify via Flatlogic's web editor because - **Deploy from Git** (Settings): Full rebuild from `master` β€” use for production - **Migrations**: Sometimes run automatically during rebuild, but NOT always reliable. If you get "Unknown column" errors after pulling latest, visit `/run-migrate/` in the browser to apply pending migrations manually. This endpoint runs `python manage.py migrate` on the production MySQL database. - **Static files**: Flatlogic's rebuild does NOT auto-run `collectstatic`. After CSS/JS changes have Gemini run `python3 manage.py collectstatic --noinput` + restart the service, otherwise Apache keeps serving the previously-collected copy. -- **Service**: The Django app runs as `django-dev.service` (systemd). Gemini restarts it via `sudo systemctl restart django-dev.service`. It runs `python manage.py runserver 0.0.0.0:8000` β€” a **development server**, not gunicorn/uwsgi (Flatlogic default, works fine at this scale). +- **Service**: The Django app runs as `django-dev.service` (systemd). Gemini restarts it via `sudo systemctl restart django-dev.service`. It runs `python manage.py runserver 0.0.0.0:8000 --insecure` β€” a **development server**, not gunicorn/uwsgi (Flatlogic default, works fine at this scale). +- **⚠ The `--insecure` flag on runserver is REQUIRED in production (added 29 May 2026).** With `DEBUG=False` (the correct production state), Django's `runserver` refuses to serve `/static/` files by default β€” every CSS/JS request returns 404, and the dashboard renders as plain unstyled HTML. The `--insecure` flag explicitly opts in to serving static files even with DEBUG off. **If you ever see "everything works but the page looks unstyled" after a deploy:** check the `ExecStart=` line in `/etc/systemd/system/django-dev.service` (or its drop-in directory) β€” if `--insecure` is missing, add it, then `sudo systemctl daemon-reload && sudo systemctl restart django-dev.service`. The proper long-term fix is an Apache `Alias /static/ β†’ staticfiles/` directive that bypasses Django entirely, but `--insecure` is a stable workaround. +- **⚠ Cloudflare HIT-caches 404 responses for ~4h.** If a static-file URL returned 404 at any point, Cloudflare will keep serving that 404 even after you fix the underlying problem. To verify a fix without waiting for the TTL: append a random query string (`?cb=$(date +%s)`) β€” that's a cache key Cloudflare hasn't seen, so it fetches from origin. The Flatlogic preview iframe sometimes shows cached-working CSS while a fresh browser tab shows the cached 404; trust the browser tab, not the iframe. - **⚠ DEPLOY ORDERING β€” pull THEN restart, not the reverse.** Production runs `DEBUG=False`, so Django uses the **cached template loader**: every `.html` template is compiled into memory once at process start and is NEVER re-read from disk until the process restarts. Symptom of getting this wrong: "I pulled the code, `git log` shows the right commit, but the page still looks old." Cause: the `restart` happened *before* the code reached the target commit (e.g. Flatlogic auto-pulled afterward, or Gemini pulled after restarting). **Fix: restart AGAIN, after confirming `git log --oneline -1` is at the target commit.** Correct deploy order is ALWAYS: (1) `git fetch github ai-dev && git reset --hard github/ai-dev`, (2) `/run-migrate/` if there are new migrations, (3) `collectstatic` if `static/` changed, (4) `sudo systemctl restart django-dev.service` **last**. Template-only changes still need the restart (cached loader) β€” unlike local dev where `DEBUG=True` re-reads templates per request. Bit us 15 May 2026: 14 commits of template fixes were "invisible" on prod until a second restart. `git reset --hard github/ai-dev` (not `git pull`) is preferred because the VM accumulates Flatlogic-editor autosave commits that make a plain pull conflict. - **CDN**: All production traffic goes through Cloudflare. Response headers show `cf-ray`/`cf-cache-status`. Static assets are cached at the edge for 4h β€” see "Static Assets & Cache-Busting" section for how the `deployment_timestamp` token breaks stale caches. - **Never edit `ai-dev` directly on GitHub** β€” Flatlogic pushes overwrite it @@ -940,6 +1016,69 @@ Either works β€” pick one and stick to it per change to avoid divergence: 2. **Flatlogic UI β†’ GitHub**: edit in Flatlogic's file editor; click "Push to GitHub" in their UI; Claude pulls locally with `git pull origin ai-dev`. **Don't mix** paths in the same change β€” that's how divergence (and the "Ver XX.YY screeeewup" commits) happen. +## SSH access on the VM (added 29 May 2026) + +Direct SSH access was activated during the 27-29 May incident +recovery. **The key and SSH command are stored in Konrad's password +manager β€” they MUST NOT be committed to git or pasted into any AI +agent chat.** SSH gives full root-equivalent access to the production +VM; treat it like a vault credential. + +**When to use SSH:** +- Flatlogic AI agent is broken or unresponsive (happened on 29 May + 2026 β€” "AI agent failed with exit code -1" for several hours). +- Need to inspect logs in real time (`journalctl -f`, + `tail -f /var/log/...`). +- Need to run a true `mysqldump` for a full database backup + (`/backup-data/` only dumps via the Django ORM; mysqldump is more + complete). +- Need to `rsync` the `media/` directory off-platform for backup + (closes the platform-risk memo's #1 gap β€” see + `C:\Users\konra\.claude\plans\prancy-painting-brook.md`). +- Need to escape a Cloudflare outage (the SSH host:port is direct, not + proxied through Cloudflare). + +**When NOT to use SSH:** +- Routine deploys β€” keep using the GitHubβ†’Flatlogic pull workflow. +- Anything the Flatlogic AI agent can do β€” the agent is normally + faster, safer, and produces a chat-transcript audit trail. + +**⚠ DO NOT click the "Activate SSH" / "Deactivate SSH" button again +casually.** The strong hypothesis from the 27 May incident is that +clicking it triggered Flatlogic's platform-side recovery process, +which wiped `/home/ubuntu/executor/.env`. SSH is already active β€” +leave the button alone. If a deactivation is ever genuinely needed, +take a fresh `/backup-data/` first and allocate 30 minutes for +potential recovery. + +**Connection (do this from Konrad's laptop only):** +```bash +# In Git Bash (Windows) or terminal (Mac/Linux). Key path will +# differ depending on where Konrad stored the key. NEVER paste the +# actual key path or the host IP into a Claude/AI session. +chmod 600 # one-time, only needed on a fresh download +ssh -i -p @ +``` + +**Useful off-platform backup commands** (run from laptop, not on VM): +```bash +# Full SQL dump of the production DB β†’ laptop +ssh -i -p @ \ + "mysqldump --single-transaction " > foxlog_$(date +%Y%m%d).sql + +# Sync uploaded media (photos, ID docs, certs, warnings) β†’ laptop +rsync -avz -e "ssh -i -p " \ + @:/home/ubuntu/executor/workspace/media/ \ + ./foxlog_media_backup/ +``` + +These two commands together produce a **truly complete off-platform +backup** (DB + uploaded files) β€” something `/backup-data/` cannot do +because Django's ORM dump doesn't include `media/`. Run them weekly, +store in a non-Flatlogic location (Google Drive / external disk / +S3-compatible bucket), and the single biggest off-platform-readiness +gap closes without leaving Flatlogic. + ## Security Notes - Production: `SESSION_COOKIE_SECURE=True`, `CSRF_COOKIE_SECURE=True`, `SameSite=None` (cross-origin for Flatlogic iframe) - Local dev: Secure cookies disabled when `USE_SQLITE=true`