docs: capture 27-29 May incident lessons (two-tier env precedence, --insecure, SSH access) + gitignore .claude.local.md
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
parent
663b7d98ba
commit
14ab8d0f76
1
.gitignore
vendored
1
.gitignore
vendored
@ -16,6 +16,7 @@ media/
|
||||
|
||||
# Claude Code / IDE
|
||||
.claude/
|
||||
.claude.local.md
|
||||
.vscode/
|
||||
.idea/
|
||||
|
||||
|
||||
167
CLAUDE.md
167
CLAUDE.md
@ -3,13 +3,36 @@
|
||||
## What's mid-flight — read this first
|
||||
**Parked / deferred work:** see `docs/plans/parked-work.md`.
|
||||
|
||||
**Production status (17 May 2026):** ✅ **fully caught up & verified.**
|
||||
The 36-commit bundle (Manager/Salaried Pay + pay-type filter + Salary
|
||||
auto-scope picker + Pay Salary dashboard quick action) is **deployed
|
||||
and confirmed working on production** (`https://foxlog.flatlogic.app/`,
|
||||
Konrad verified 17 May 2026). `origin/ai-dev` HEAD `80d96d7` == prod
|
||||
(the only delta over the functional tip `4c25011` is doc breadcrumbs).
|
||||
Migrations `0016`/`0017` applied; `static/css/custom.css` collected.
|
||||
**Production status (29 May 2026):** ✅ **fully caught up, verified,
|
||||
and recovered from a 27-29 May platform incident.** The 36-commit
|
||||
bundle (Manager/Salaried Pay + pay-type filter + Salary auto-scope
|
||||
picker + Pay Salary dashboard quick action) is **deployed and
|
||||
confirmed working on production** (`https://foxlog.flatlogic.app/`,
|
||||
Konrad verified 17 May 2026 and re-verified post-incident 29 May
|
||||
2026 via a live test payment + Spark Receipt delivery).
|
||||
`origin/ai-dev` HEAD `80d96d7` == prod (the only delta over the
|
||||
functional tip `4c25011` is doc breadcrumbs). Migrations
|
||||
`0016`/`0017` applied; `static/css/custom.css` collected.
|
||||
|
||||
**🔥 Incident 27-29 May 2026 (now closed) — what future sessions
|
||||
need to know:** Cloudflare Tunnel error 1033 (27 May) → suspected
|
||||
SSH-activate side-effect wiped `/home/ubuntu/executor/.env` → payment
|
||||
500s with `ValueError: Invalid address ""` (28 May) → Flatlogic AI
|
||||
agent returned only `AI agent failed with exit code -1` for hours
|
||||
before Erik restored it (29 May) → discovered
|
||||
`/etc/flatcloud/python-secrets.env` overrides `.env` via a systemd
|
||||
drop-in → restored Gmail credentials in the secrets file → flipping
|
||||
`DJANGO_DEBUG=false` killed static-file serving (runserver requires
|
||||
`--insecure` for that with `DEBUG=False`) → added `--insecure` to the
|
||||
systemd unit. **Three lessons baked into the sections below:**
|
||||
(1) "Where env vars live on Flatlogic" — two-tier env file
|
||||
precedence; (2) "Flatlogic/AppWizzy Deployment" — `--insecure` is
|
||||
now required in the systemd unit; (3) new "SSH access on the VM"
|
||||
section — direct VM shell is now available, key in Konrad's
|
||||
password manager. **Strategic side note:** SSH access closes the
|
||||
#1 risk identified in the platform-risk memo at
|
||||
`C:\Users\konra\.claude\plans\prancy-painting-brook.md` (off-platform
|
||||
backup of `media/` is now feasible via `rsync`).
|
||||
|
||||
**🔧 In progress — local only, NOT pushed (HARD STOP):** removal of
|
||||
the "Log Today's Work" / **SiteReport** feature (Konrad wants to
|
||||
@ -889,12 +912,63 @@ than the authenticated one (e.g. "FoxFitt Payroll <payroll@foxfitt.co.za>"),
|
||||
set `DEFAULT_FROM_EMAIL` explicitly — but Gmail will likely rewrite it to the
|
||||
authenticated user anyway unless you've configured a "Send mail as" alias.
|
||||
|
||||
### Where env vars live on Flatlogic
|
||||
Flatlogic's platform has no env-var UI. Values are set in a `.env` file at
|
||||
`BASE_DIR.parent / ".env"` on the VM (one level up from the repo). Edit via
|
||||
Gemini/shell — the user cannot modify via Flatlogic's web editor because
|
||||
`.env` is outside the project tree. The file is loaded by
|
||||
`python-dotenv` in `config/settings.py` before any `os.getenv()` calls.
|
||||
### Where env vars live on Flatlogic — TWO files, second wins (learned the hard way 29 May 2026)
|
||||
|
||||
Flatlogic's platform has no env-var UI. The Django service reads env
|
||||
from **two** files at runtime, loaded in this order:
|
||||
|
||||
1. **`/home/ubuntu/executor/.env`** — at `BASE_DIR.parent / ".env"` on
|
||||
the VM, one level up from the repo. Loaded by `python-dotenv` in
|
||||
`config/settings.py` at process startup. User-editable via
|
||||
Gemini/shell (or now SSH).
|
||||
2. **`/etc/flatcloud/python-secrets.env`** — root-owned, managed by
|
||||
Flatlogic's platform image. Loaded by **systemd** itself via a
|
||||
drop-in unit at `/etc/systemd/system/django-dev.service.d/flatcloud-env.conf`
|
||||
that has an `EnvironmentFile=/etc/flatcloud/python-secrets.env`
|
||||
directive.
|
||||
|
||||
**Precedence:** when both files set the same key, **the systemd-loaded
|
||||
secrets file wins** because systemd injects its variables into the
|
||||
process environment BEFORE `manage.py runserver` starts, and
|
||||
`python-dotenv`'s default `load_dotenv()` does NOT override existing
|
||||
env vars. So `.env` only "wins" for keys the secrets file doesn't set.
|
||||
|
||||
**Practical consequence (the bit that cost hours on 29 May 2026):**
|
||||
editing `.env` to fix `EMAIL_HOST_USER`/`EMAIL_HOST_PASSWORD`/
|
||||
`DJANGO_DEBUG` had ZERO effect because all three were also set in
|
||||
`python-secrets.env` (often to wrong values — Flatlogic's recovery
|
||||
template installs **AWS SES placeholder credentials** that conflict
|
||||
with our Gmail SMTP config). The fix is always: edit the secrets file
|
||||
with `sudo`, then `sudo systemctl restart django-dev.service`.
|
||||
|
||||
**Diagnostic command to confirm both files are loaded and in what
|
||||
order:**
|
||||
```bash
|
||||
systemctl show django-dev.service --property=EnvironmentFiles
|
||||
```
|
||||
|
||||
**Recovery playbook if email/payments break after a platform incident:**
|
||||
1. `sudo grep EMAIL_HOST /etc/flatcloud/python-secrets.env` — does it
|
||||
show AWS-shaped values (long base64-ish keys) instead of
|
||||
`konrad@foxfitt.co.za` + a Gmail App Password? That's the symptom.
|
||||
2. `sudo nano /etc/flatcloud/python-secrets.env` — set:
|
||||
- `EMAIL_HOST=smtp.gmail.com`
|
||||
- `EMAIL_PORT=587`
|
||||
- `EMAIL_USE_TLS=True`
|
||||
- `EMAIL_HOST_USER=konrad@foxfitt.co.za`
|
||||
- `EMAIL_HOST_PASSWORD=<16-char Gmail App Password, no spaces>`
|
||||
- `DEFAULT_FROM_EMAIL=konrad@foxfitt.co.za`
|
||||
- `DJANGO_DEBUG=false`
|
||||
3. `sudo systemctl restart django-dev.service`
|
||||
4. Verify: `sudo systemctl status django-dev.service` shows
|
||||
`active (running)`; first test payment delivers a payslip to Spark
|
||||
Receipt.
|
||||
|
||||
**Security note:** the secrets file contains the Gmail App Password,
|
||||
`DJANGO_SECRET_KEY`, and DB credentials. Never `cat` it into the
|
||||
Flatlogic AI agent chat or this Claude session — values leak into
|
||||
transcripts. Use `grep -c KEY_NAME` to confirm presence without
|
||||
printing the value.
|
||||
|
||||
## Flatlogic/AppWizzy Deployment
|
||||
- **Branches**: `ai-dev` = development (Flatlogic AI + Claude Code). `master` = deploy target.
|
||||
@ -902,7 +976,9 @@ Gemini/shell — the user cannot modify via Flatlogic's web editor because
|
||||
- **Deploy from Git** (Settings): Full rebuild from `master` — use for production
|
||||
- **Migrations**: Sometimes run automatically during rebuild, but NOT always reliable. If you get "Unknown column" errors after pulling latest, visit `/run-migrate/` in the browser to apply pending migrations manually. This endpoint runs `python manage.py migrate` on the production MySQL database.
|
||||
- **Static files**: Flatlogic's rebuild does NOT auto-run `collectstatic`. After CSS/JS changes have Gemini run `python3 manage.py collectstatic --noinput` + restart the service, otherwise Apache keeps serving the previously-collected copy.
|
||||
- **Service**: The Django app runs as `django-dev.service` (systemd). Gemini restarts it via `sudo systemctl restart django-dev.service`. It runs `python manage.py runserver 0.0.0.0:8000` — a **development server**, not gunicorn/uwsgi (Flatlogic default, works fine at this scale).
|
||||
- **Service**: The Django app runs as `django-dev.service` (systemd). Gemini restarts it via `sudo systemctl restart django-dev.service`. It runs `python manage.py runserver 0.0.0.0:8000 --insecure` — a **development server**, not gunicorn/uwsgi (Flatlogic default, works fine at this scale).
|
||||
- **⚠ The `--insecure` flag on runserver is REQUIRED in production (added 29 May 2026).** With `DEBUG=False` (the correct production state), Django's `runserver` refuses to serve `/static/` files by default — every CSS/JS request returns 404, and the dashboard renders as plain unstyled HTML. The `--insecure` flag explicitly opts in to serving static files even with DEBUG off. **If you ever see "everything works but the page looks unstyled" after a deploy:** check the `ExecStart=` line in `/etc/systemd/system/django-dev.service` (or its drop-in directory) — if `--insecure` is missing, add it, then `sudo systemctl daemon-reload && sudo systemctl restart django-dev.service`. The proper long-term fix is an Apache `Alias /static/ → staticfiles/` directive that bypasses Django entirely, but `--insecure` is a stable workaround.
|
||||
- **⚠ Cloudflare HIT-caches 404 responses for ~4h.** If a static-file URL returned 404 at any point, Cloudflare will keep serving that 404 even after you fix the underlying problem. To verify a fix without waiting for the TTL: append a random query string (`?cb=$(date +%s)`) — that's a cache key Cloudflare hasn't seen, so it fetches from origin. The Flatlogic preview iframe sometimes shows cached-working CSS while a fresh browser tab shows the cached 404; trust the browser tab, not the iframe.
|
||||
- **⚠ DEPLOY ORDERING — pull THEN restart, not the reverse.** Production runs `DEBUG=False`, so Django uses the **cached template loader**: every `.html` template is compiled into memory once at process start and is NEVER re-read from disk until the process restarts. Symptom of getting this wrong: "I pulled the code, `git log` shows the right commit, but the page still looks old." Cause: the `restart` happened *before* the code reached the target commit (e.g. Flatlogic auto-pulled afterward, or Gemini pulled after restarting). **Fix: restart AGAIN, after confirming `git log --oneline -1` is at the target commit.** Correct deploy order is ALWAYS: (1) `git fetch github ai-dev && git reset --hard github/ai-dev`, (2) `/run-migrate/` if there are new migrations, (3) `collectstatic` if `static/` changed, (4) `sudo systemctl restart django-dev.service` **last**. Template-only changes still need the restart (cached loader) — unlike local dev where `DEBUG=True` re-reads templates per request. Bit us 15 May 2026: 14 commits of template fixes were "invisible" on prod until a second restart. `git reset --hard github/ai-dev` (not `git pull`) is preferred because the VM accumulates Flatlogic-editor autosave commits that make a plain pull conflict.
|
||||
- **CDN**: All production traffic goes through Cloudflare. Response headers show `cf-ray`/`cf-cache-status`. Static assets are cached at the edge for 4h — see "Static Assets & Cache-Busting" section for how the `deployment_timestamp` token breaks stale caches.
|
||||
- **Never edit `ai-dev` directly on GitHub** — Flatlogic pushes overwrite it
|
||||
@ -940,6 +1016,69 @@ Either works — pick one and stick to it per change to avoid divergence:
|
||||
2. **Flatlogic UI → GitHub**: edit in Flatlogic's file editor; click "Push to GitHub" in their UI; Claude pulls locally with `git pull origin ai-dev`.
|
||||
**Don't mix** paths in the same change — that's how divergence (and the "Ver XX.YY screeeewup" commits) happen.
|
||||
|
||||
## SSH access on the VM (added 29 May 2026)
|
||||
|
||||
Direct SSH access was activated during the 27-29 May incident
|
||||
recovery. **The key and SSH command are stored in Konrad's password
|
||||
manager — they MUST NOT be committed to git or pasted into any AI
|
||||
agent chat.** SSH gives full root-equivalent access to the production
|
||||
VM; treat it like a vault credential.
|
||||
|
||||
**When to use SSH:**
|
||||
- Flatlogic AI agent is broken or unresponsive (happened on 29 May
|
||||
2026 — "AI agent failed with exit code -1" for several hours).
|
||||
- Need to inspect logs in real time (`journalctl -f`,
|
||||
`tail -f /var/log/...`).
|
||||
- Need to run a true `mysqldump` for a full database backup
|
||||
(`/backup-data/` only dumps via the Django ORM; mysqldump is more
|
||||
complete).
|
||||
- Need to `rsync` the `media/` directory off-platform for backup
|
||||
(closes the platform-risk memo's #1 gap — see
|
||||
`C:\Users\konra\.claude\plans\prancy-painting-brook.md`).
|
||||
- Need to escape a Cloudflare outage (the SSH host:port is direct, not
|
||||
proxied through Cloudflare).
|
||||
|
||||
**When NOT to use SSH:**
|
||||
- Routine deploys — keep using the GitHub→Flatlogic pull workflow.
|
||||
- Anything the Flatlogic AI agent can do — the agent is normally
|
||||
faster, safer, and produces a chat-transcript audit trail.
|
||||
|
||||
**⚠ DO NOT click the "Activate SSH" / "Deactivate SSH" button again
|
||||
casually.** The strong hypothesis from the 27 May incident is that
|
||||
clicking it triggered Flatlogic's platform-side recovery process,
|
||||
which wiped `/home/ubuntu/executor/.env`. SSH is already active —
|
||||
leave the button alone. If a deactivation is ever genuinely needed,
|
||||
take a fresh `/backup-data/` first and allocate 30 minutes for
|
||||
potential recovery.
|
||||
|
||||
**Connection (do this from Konrad's laptop only):**
|
||||
```bash
|
||||
# In Git Bash (Windows) or terminal (Mac/Linux). Key path will
|
||||
# differ depending on where Konrad stored the key. NEVER paste the
|
||||
# actual key path or the host IP into a Claude/AI session.
|
||||
chmod 600 <path-to-key> # one-time, only needed on a fresh download
|
||||
ssh -i <path-to-key> -p <port> <user>@<host>
|
||||
```
|
||||
|
||||
**Useful off-platform backup commands** (run from laptop, not on VM):
|
||||
```bash
|
||||
# Full SQL dump of the production DB → laptop
|
||||
ssh -i <key> -p <port> <user>@<host> \
|
||||
"mysqldump --single-transaction <db_name>" > foxlog_$(date +%Y%m%d).sql
|
||||
|
||||
# Sync uploaded media (photos, ID docs, certs, warnings) → laptop
|
||||
rsync -avz -e "ssh -i <key> -p <port>" \
|
||||
<user>@<host>:/home/ubuntu/executor/workspace/media/ \
|
||||
./foxlog_media_backup/
|
||||
```
|
||||
|
||||
These two commands together produce a **truly complete off-platform
|
||||
backup** (DB + uploaded files) — something `/backup-data/` cannot do
|
||||
because Django's ORM dump doesn't include `media/`. Run them weekly,
|
||||
store in a non-Flatlogic location (Google Drive / external disk /
|
||||
S3-compatible bucket), and the single biggest off-platform-readiness
|
||||
gap closes without leaving Flatlogic.
|
||||
|
||||
## Security Notes
|
||||
- Production: `SESSION_COOKIE_SECURE=True`, `CSRF_COOKIE_SECURE=True`, `SameSite=None` (cross-origin for Flatlogic iframe)
|
||||
- Local dev: Secure cookies disabled when `USE_SQLITE=true`
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user