docs: audit bundle deployed & verified — close breadcrumbs, park leftovers, record gitea-auth break + autosave-diff rule

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
Konrad du Plessis 2026-06-12 22:35:51 +02:00
parent abfae69606
commit 4f5f1bbe13
2 changed files with 81 additions and 45 deletions

View File

@ -3,16 +3,16 @@
## What's mid-flight — read this first
**Parked / deferred work:** see `docs/plans/parked-work.md`.
**Production status (29 May 2026):** ✅ **fully caught up, verified,
and recovered from a 27-29 May platform incident.** The 36-commit
bundle (Manager/Salaried Pay + pay-type filter + Salary auto-scope
picker + Pay Salary dashboard quick action) is **deployed and
confirmed working on production** (`https://foxlog.flatlogic.app/`,
Konrad verified 17 May 2026 and re-verified post-incident 29 May
2026 via a live test payment + Spark Receipt delivery).
`origin/ai-dev` HEAD `80d96d7` == prod (the only delta over the
functional tip `4c25011` is doc breadcrumbs). Migrations
`0016`/`0017` applied; `static/css/custom.css` collected.
**Production status (12 Jun 2026):** ✅ **fully caught up & verified.**
`origin/ai-dev` HEAD `abfae69` == prod (deployed 12 Jun 2026 via
fetch + `reset --hard` + restart; Konrad browser-verified same day).
Live on prod: the Manager/Salaried bundle (17 May), the SiteReport
removal (migration `0018` confirmed applied), the **12 Jun audit-fix
bundle** (see breadcrumb below), and the Flatlogic preview domain in
`ALLOWED_HOSTS` (`abfae69`). Migrations applied through `0018`;
static collected (no changes since May). **Known platform nit:** the
VM can no longer push to Flatlogic's gitea mirror (auth failed since
the May incident — see "Git remotes on the VM" below).
**🔥 Incident 27-29 May 2026 (now closed) — what future sessions
need to know:** Cloudflare Tunnel error 1033 (27 May) → suspected
@ -34,37 +34,33 @@ password manager. **Strategic side note:** SSH access closes the
`C:\Users\konra\.claude\plans\prancy-painting-brook.md` (off-platform
backup of `media/` is now feasible via `rsync`).
**SiteReport removal — pushed to origin (per git, 12 Jun 2026).** The
"Log Today's Work" / SiteReport removal (migration
`0018_delete_sitereport`, commit `7f5e4c9` + follow-ups through
`663b7d9`) is on `origin/ai-dev` — the old "local only, NOT pushed"
HARD STOP here is resolved. Production deploy status of those commits
should be confirmed with Konrad before assuming prod has them
(deploy needs: pull → `/run-migrate/` for `0018` → restart). Design
knowledge for a future rebuild lives in
**SiteReport removal — DEPLOYED (confirmed 12 Jun 2026).** Migration
`0018_delete_sitereport` shows `[X]` applied on production and the
code lineage was already on the VM. Fully resolved — nothing pending.
Design knowledge for a future rebuild lives in
`docs/plans/2026-05-17-site-report-removed-capture.md`; see also the
parked rebuild entry in `docs/plans/parked-work.md`.
**🚀 Pushed to origin, awaiting PRODUCTION deploy: 12 Jun 2026
audit-fix bundle** (`14ab8d0..2d3cc43`, 11 commits, pushed 12 Jun
2026 with Konrad's approval). A comprehensive technical audit (4
parallel review agents + manual verification of every finding)
fixed: email-failure 500 after committed payments (the 28 May
incident class), Batch Pay modal silently re-ticking unticked
workers + swallowing server errors, payments with deductions >
earnings now REFUSED (Konrad's decision — no negative
PayrollRecords), attendance date range capped at 31 days,
worker-report views survive junk query params, **worker batch
report's lifetime "Total Paid" column was inflated by the work-log
join (real display bug — fixed + regression test)**, report-page
N+1s killed, money paths standardised on Decimal. Verified before
push: suite **206 OK** + live browser checks of the Batch Pay
filter behaviour, attendance cost estimator, payslip preview, and
the 31-day range rejection (12 Jun 2026, local dev). **Deploy to
prod still pending:** pull → restart (NO new migrations; no
`static/` changes so no collectstatic; restart required — cached
template loader). Mind the deploy ordering rule: confirm the pull
landed `2d3cc43` BEFORE the restart.
**✅ 12 Jun 2026 audit-fix bundle — DEPLOYED & VERIFIED on
production** (`14ab8d0..abfae69`, 13 commits). A comprehensive
technical audit (4 parallel review agents + manual verification of
every finding) fixed: email-failure 500 after committed payments
(the 28 May incident class — interactive callers now get the warning
toast; only batch re-raises), Batch Pay modal silently re-ticking
unticked workers + swallowing server errors, **payments with
deductions > earnings now REFUSED** (Konrad's decision — no negative
PayrollRecords; exactly-zero net still allowed; see
`DeductionsExceedEarningsError` in `core/views.py`), attendance date
range capped at 31 days, worker-report views survive junk query
params (`_int_param_or_none`), **worker batch report's lifetime
"Total Paid" was inflated by the work-log join** (fixed +
`WorkerReportLifetimeTotalsTests`), report-page N+1s killed, money
display paths standardised on Decimal. Suite **206 OK**; browser
checks + prod smoke test passed (Konrad, 12 Jun 2026). The deploy
also surfaced + properly committed a VM-side autosave fix
(`abfae69`: Flatlogic preview domain in `ALLOWED_HOSTS`) — see the
"reset --hard" warning in the Deployment section. Small leftovers
parked in `docs/plans/parked-work.md` ("Audit leftovers").
**🧊 Backburner — do NOT start in `ai-dev`:** Phase A.2 (manual
JournalEntry UI) and Phase B (Letterly inbound webhook) are
@ -1009,7 +1005,7 @@ printing the value.
- **Service**: The Django app runs as `django-dev.service` (systemd). Gemini restarts it via `sudo systemctl restart django-dev.service`. It runs `python manage.py runserver 0.0.0.0:8000 --insecure` — a **development server**, not gunicorn/uwsgi (Flatlogic default, works fine at this scale).
- **⚠ The `--insecure` flag on runserver is REQUIRED in production (added 29 May 2026).** With `DEBUG=False` (the correct production state), Django's `runserver` refuses to serve `/static/` files by default — every CSS/JS request returns 404, and the dashboard renders as plain unstyled HTML. The `--insecure` flag explicitly opts in to serving static files even with DEBUG off. **If you ever see "everything works but the page looks unstyled" after a deploy:** check the `ExecStart=` line in `/etc/systemd/system/django-dev.service` (or its drop-in directory) — if `--insecure` is missing, add it, then `sudo systemctl daemon-reload && sudo systemctl restart django-dev.service`. The proper long-term fix is an Apache `Alias /static/ → staticfiles/` directive that bypasses Django entirely, but `--insecure` is a stable workaround.
- **⚠ Cloudflare HIT-caches 404 responses for ~4h.** If a static-file URL returned 404 at any point, Cloudflare will keep serving that 404 even after you fix the underlying problem. To verify a fix without waiting for the TTL: append a random query string (`?cb=$(date +%s)`) — that's a cache key Cloudflare hasn't seen, so it fetches from origin. The Flatlogic preview iframe sometimes shows cached-working CSS while a fresh browser tab shows the cached 404; trust the browser tab, not the iframe.
- **⚠ DEPLOY ORDERING — pull THEN restart, not the reverse.** Production runs `DEBUG=False`, so Django uses the **cached template loader**: every `.html` template is compiled into memory once at process start and is NEVER re-read from disk until the process restarts. Symptom of getting this wrong: "I pulled the code, `git log` shows the right commit, but the page still looks old." Cause: the `restart` happened *before* the code reached the target commit (e.g. Flatlogic auto-pulled afterward, or Gemini pulled after restarting). **Fix: restart AGAIN, after confirming `git log --oneline -1` is at the target commit.** Correct deploy order is ALWAYS: (1) `git fetch github ai-dev && git reset --hard github/ai-dev`, (2) `/run-migrate/` if there are new migrations, (3) `collectstatic` if `static/` changed, (4) `sudo systemctl restart django-dev.service` **last**. Template-only changes still need the restart (cached loader) — unlike local dev where `DEBUG=True` re-reads templates per request. Bit us 15 May 2026: 14 commits of template fixes were "invisible" on prod until a second restart. `git reset --hard github/ai-dev` (not `git pull`) is preferred because the VM accumulates Flatlogic-editor autosave commits that make a plain pull conflict.
- **⚠ DEPLOY ORDERING — pull THEN restart, not the reverse.** Production runs `DEBUG=False`, so Django uses the **cached template loader**: every `.html` template is compiled into memory once at process start and is NEVER re-read from disk until the process restarts. Symptom of getting this wrong: "I pulled the code, `git log` shows the right commit, but the page still looks old." Cause: the `restart` happened *before* the code reached the target commit (e.g. Flatlogic auto-pulled afterward, or Gemini pulled after restarting). **Fix: restart AGAIN, after confirming `git log --oneline -1` is at the target commit.** Correct deploy order is ALWAYS: (1) `git fetch github ai-dev && git reset --hard github/ai-dev`, (2) `/run-migrate/` if there are new migrations, (3) `collectstatic` if `static/` changed, (4) `sudo systemctl restart django-dev.service` **last**. Template-only changes still need the restart (cached loader) — unlike local dev where `DEBUG=True` re-reads templates per request. Bit us 15 May 2026: 14 commits of template fixes were "invisible" on prod until a second restart. `git reset --hard github/ai-dev` (not `git pull`) is preferred because the VM accumulates Flatlogic-editor autosave commits that make a plain pull conflict. **⚠ BEFORE any reset --hard: `git show --stat` every VM-local commit being discarded, and anchor it with a `pre-*` safety branch.** An "Autosave" commit can contain a REAL fix — on 12 Jun 2026 autosave `98f66e9` held the only copy of the Flatlogic preview domain in `ALLOWED_HOSTS` (a 29 May incident-recovery edit); we read the diff, anchored it, and re-committed it properly via GitHub (`abfae69`). Never discard a VM commit unread.
- **CDN**: All production traffic goes through Cloudflare. Response headers show `cf-ray`/`cf-cache-status`. Static assets are cached at the edge for 4h — see "Static Assets & Cache-Busting" section for how the `deployment_timestamp` token breaks stale caches.
- **Never edit `ai-dev` directly on GitHub** — Flatlogic pushes overwrite it
- **Gemini gotcha**: Flatlogic's Gemini AI reads `__pycache__/*.pyc` and gets confused. Tell it: "Do NOT read .pyc files. Only work with .py source files."
@ -1026,6 +1022,14 @@ which silently confuses deploys. Flatlogic's UI occasionally commits as
`Flatlogic Bot <support@flatlogic.com>` (autosaves from the in-browser file editor) —
those commits land on gitea but don't propagate to GitHub unless someone pushes.
**⚠ gitea auth BROKEN since the May 2026 incident (found 12 Jun 2026):**
the VM gets `Authentication failed` pushing to gitea — credentials
presumably rotated/lost in the platform recovery. Until Erik/Flatlogic
support restores them, deploys are GitHub→VM one-way and the Flatlogic
dashboard may show a stale commit; don't burn time re-diagnosing a
gitea push failure. Once fixed, re-sync with `git push gitea ai-dev`
(may need `--force` — gitea last saw the discarded autosave `98f66e9`).
### VM-local safety branches
When doing risky deploys (model migrations, branch resets, history rewrites), we
create a safety branch on the VM at the pre-deploy HEAD so Gemini can

View File

@ -1,12 +1,45 @@
# Parked / deferred work
> Updated 15 May 2026. A small index of features that are designed,
> Updated 12 Jun 2026. A small index of features that are designed,
> half-built, blocked on input, or pending an operator step. When a
> fresh session opens, glance here first to see what's already on
> the workbench.
---
## 🔧 Audit leftovers (12 Jun 2026) — small, deliberate deferrals
The 12 Jun 2026 technical-audit bundle (deployed; see CLAUDE.md
breadcrumb) deliberately left these on the bench:
1. **Team/project batch-report per-row queries.**
`_build_team_report_context` and `_build_project_report_context`
(`core/views.py`) still run a per-team/per-project query loop
(~2 queries × ~6 rows on `/teams/report/` and `/projects/report/`).
Same disease as the fixed worker report; skipped because the pages
are rarely used and the fix pattern is already proven (copy the
batched-GROUP-BY-dict restructure from
`_build_worker_report_context`). Do it next time those pages feel
slow or get touched anyway.
2. **JS vs server decimal-separator inconsistency (cosmetic).**
Browser-side money helpers use `toLocaleString('en-ZA')` → comma
decimals (`R 2 400,00`); server-side `money` filter renders dot
decimals (`R 2 400.00`). Pre-dates the audit; harmonising means
choosing ONE convention and touching both sides. Pure cosmetics —
decide when something user-facing forces the question.
3. **Stale VM safety branches.** Five `pre-*` branches on the VM
(four from Apr 2026 + `pre-audit-deploy-20260612`). After a few
days of confirmed stability post-12-Jun, tell the Flatlogic agent
to delete them all (the 20260612 one anchors autosave `98f66e9`,
whose only real content is now properly committed as `abfae69`).
4. **gitea mirror auth broken** — operator step: Konrad asks Erik
(Flatlogic support) to restore the VM's gitea credentials; then
re-sync with `git push gitea ai-dev` (may need `--force`). Until
then deploys are GitHub→VM one-way. Details in CLAUDE.md "Git
remotes on the VM".
---
## ⏸ Paused — ready to execute (not started, not pushed)
### Site-progress logging — rebuild from scratch (parked)
@ -20,11 +53,10 @@ schema-as-Python pattern, recovery pointers, and the now-superseded
`2026-05-15-post-attendance-flow-v2-*` prior thinking, which stays on
disk).
**Removal status (local only, HARD STOP — not pushed):** Tasks 1-3
done — model/table/UI/routes deleted, migration
`0018_delete_sitereport` drops `core_sitereport`, suite **193 OK**.
Un-pushed pending Konrad's local verification (destructive migration
on the daily-use attendance path).
**Removal status: SHIPPED.** Pushed to origin and confirmed deployed
on production (migration `0018_delete_sitereport` shows `[X]` applied
— verified 12 Jun 2026 during the audit-bundle deploy). Nothing
pending; only the future rebuild remains parked.
---