docs: design + plan for MySQL 8.4.8 → 8.4.9 upgrade

Captures the wipe+reinit strategy (sidestep the broken DD upgrade path), the IO config bump (innodb_io_capacity 100→2000), root-cause analysis with explicit uncertainty, verification gates, and rollback. Not scheduled yet. Tracked in beads code-963q. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 13:10:00 +00:00 · 2026-05-19 13:10:00 +00:00 · e4b9e97ac9
commit e4b9e97ac9
parent a048b37f60
2 changed files with 461 additions and 0 deletions
--- a/docs/plans/2026-05-19-mysql-8.4.9-upgrade-design.md
+++ b/docs/plans/2026-05-19-mysql-8.4.9-upgrade-design.md
@ -0,0 +1,112 @@
+# MySQL 8.4.8 → 8.4.9 Upgrade — Design
+
+**Date**: 2026-05-19
+**Status**: Drafted, **NOT scheduled**. Execute only inside a planned maintenance window with user sign-off.
+**Beads**: (filed alongside this doc)
+**Related**: `docs/runbooks/restore-mysql.md`, beads `code-eme8` / `code-k40p` (closed in `ea475c3d`)
+
+## Background
+
+On 2026-05-18, Keel auto-bumped the `mysql:8.4` floating tag on the
+`mysql-standalone` StatefulSet from 8.4.8 to 8.4.9. The in-server data
+dictionary upgrade (80408 → 80409) stalled reliably: ~24 s of writes to
+`mysql.ibd` + redo log after "Server upgrade started", then complete
+silence — no CPU, no flushes, no errors, no completion. The `boot`
+thread sat in user-space sleep (`State: S`, `wchan: 0`) for 10+
+minutes; the MySQLX socket appeared but `mysqld.sock` never did. Even
+with `liveness_probe.initial_delay_seconds = 600`, the upgrade never
+completed.
+
+Recovery (commit `ea475c3d`): pinned image to `mysql:8.4.8` exactly,
+wiped the corrupted PVC, restored from the 00:30 UTC mysqldump. Total
+downtime: ~25 min. Forgejo + 7 dependent apps offline during that
+window.
+
+## Root cause — best evidence
+
+We never proved this definitively because we couldn't connect to MySQL
+during the stall, but the strongest hypothesis is **flush starvation
+during the DD upgrade's mandatory checkpoint**:
+
+1. Upgrade rewrites `mysql.st_spatial_reference_systems` (5103 SRS
+   defs) + dirties pages across the system tablespace.
+2. Reaches a point where it must checkpoint before continuing.
+3. The page-cleaner thread can't drain dirty pages fast enough because
+   `innodb_io_capacity=100` (1.6 MB/s effective flush rate, default is
+   200, recommended for SSDs is 2000+) combined with
+   `innodb_page_cleaners=1`.
+4. The `boot` thread waits on a pthread condvar that the flush
+   coordinator should signal but never does within probe timeout.
+
+Why we're not 100 % certain:
+- LUKS2-encrypted block storage (`proxmox-lvm-encrypted`) may
+  contribute its own flush latency.
+- We didn't capture a stack trace from the stalled `boot` thread
+  (`/proc/1/task/118/stack` was `permission denied`).
+- A genuine MySQL 8.4.9 bug in the SRS-update path is possible (worth
+  checking the MySQL bug tracker before retry).
+
+**Organizational root cause** (definitive): the `mysql:8.4` floating
+tag let Keel auto-bump without testing. Already fixed — image pinned
+to `mysql:8.4.8` exactly.
+
+## Decisions
+
+| # | Decision | Notes |
+|---|----------|-------|
+| 1 | **Approach: wipe + re-init on 8.4.9** (logical migration via fresh init + dump-restore) | The DD upgrade is the broken path. A fresh 8.4.9 init starts at version 80409 directly — no upgrade ever runs. We've executed wipe+restore once in ~25 min; the path is now well-trodden. |
+| 2 | **Pre-flight: bump InnoDB IO config** | `innodb_io_capacity=2000`, `innodb_io_capacity_max=4000`, `innodb_page_cleaners=4`. These are the long-term-correct values regardless of the upgrade — current settings are ~10× too conservative for the workload. |
+| 3 | **Restore strategy: per-database dumps, NOT the full `--all-databases` dump** | Per-db dumps at `/srv/nfs/mysql-backup/per-db/<db>/` skip the `mysql` system schema entirely. Avoids the question of "will 8.4.8 mysql-schema rows confuse 8.4.9". User accounts get recreated via Vault + null_resource. |
+| 4 | **Fresh dump immediately before cutover, not yesterday's** | The daily dump runs at 00:30 UTC. The cutover dump must come from < 60 s before scale-to-0 to minimize data loss. Kick `mysql-backup-per-db` CronJob manually. |
+| 5 | **Maintenance window required** | All MySQL-dependent apps offline ~25 min: Forgejo (+ registry → ImagePullBackOff cascade), Nextcloud, HackMD, Grafana, Paperless, Uptime-Kuma, Shlink, realestate-crawler, phpipam, technitium, vikunja, freshrss, finance, resume. Pick a low-traffic window (suggest Sunday 03:00 UK). |
+| 6 | **Single rollback path: re-pin to 8.4.8 + same wipe/restore flow** | If 8.4.9 fresh init misbehaves post-restore, rollback IS the same procedure, just with image=8.4.8. The pinned 8.4.8 dump survives. No new failure modes. |
+| 7 | **Out of scope for this upgrade**: tuning that doesn't gate the upgrade | Right-sizing buffer pool, switching to async commits, changing storage class, replication — all separate decisions. |
+
+## Verification gates
+
+Before declaring done:
+1. `kubectl -n dbaas exec mysql-standalone-0 -- mysql -uroot -p"$PW" -e "SELECT VERSION();"` returns `8.4.9`.
+2. `SHOW DATABASES;` lists all 20 user databases.
+3. Table count per schema matches the pre-upgrade snapshot (recorded
+   in step 1 of the plan).
+4. `forgejo` logs show successful DB ping; `kubectl -n forgejo get pod` is 1/1 Running.
+5. `kubectl get deploy,sts -A` shows no unready workloads.
+6. `bash infra/scripts/cluster_healthcheck.sh --quiet` returns same or
+   better PASS/WARN/FAIL ratio as pre-upgrade.
+7. Forgejo integrity probe reports 0 failures (manual trigger).
+8. `RegistryCatalogInaccessible` not firing in Prometheus.
+
+## Risks + mitigations
+
+| Risk | Likelihood | Mitigation |
+|---|---|---|
+| 8.4.9 fresh init has *some other* unobserved bug | Low | Smoke-test on a parallel PVC in dbaas before touching the real one (optional but cheap — adds 30 min). See plan Phase 1. |
+| Per-db dump-restore misses a database the user added recently | Low | Compare `SHOW DATABASES` against the per-db dump directory listing pre-cutover. If a DB exists in MySQL but not in `/srv/nfs/mysql-backup/per-db/`, dump it manually first. |
+| Forgejo/roundcubemail static-user passwords drift again after restore | Certain | Already documented in runbook — DROP USER + CREATE USER from Vault values immediately after restore. |
+| The cutover dump itself is corrupt | Very low | mysqldump exits non-zero on failure. CronJob already pushes `backup_last_success_timestamp` to Pushgateway. Verify timestamp is fresh before proceeding. |
+| Apps fail to reconnect after MySQL restart | Low | Already-proven recipe: `kubectl rollout restart` on the affected deployments. Listed exhaustively in runbook §B.8. |
+| 8.4.9 fresh init *also* stalls (root cause was NOT flush starvation) | Medium-low | Pre-flight test on parallel PVC catches this before maintenance window. If real prod init stalls, immediately revert TF pin to 8.4.8, redo same dump-restore flow. Same 25 min downtime as the original recovery. |
+
+## Why not alternatives
+
+- **In-place DD upgrade with bumped IO config**: simpler, but if it
+  still stalls we lose 30–60 min waiting + still fall back to
+  wipe+restore. Same data risk; worse expected time. We *would* learn
+  whether the bumped IO settings fix the upgrade, but the fresh init
+  approach makes that knowledge unnecessary.
+- **Parallel migration (new mysql-standalone-new pod alongside)**:
+  cleanest rollback (instant via service-selector flip), but needs TF
+  surgery to declare two StatefulSets temporarily and isn't worth the
+  complexity when the wipe+restore approach is now proven.
+- **Wait for 8.4.10 / 8.5 LTS**: leaves us stuck on 8.4.8 indefinitely.
+  Acceptable for now (we're pinned), but not a permanent answer.
+
+## Out of scope
+
+- A standby/replica MySQL for zero-downtime upgrades (separate
+  initiative — see future planning around CNPG-style HA for MySQL).
+- Removing `proxmox-lvm-encrypted` LUKS2 from the equation (the
+  encryption is a security requirement; debugging its flush latency is
+  separate).
+- Replacing MySQL with PostgreSQL (long-term goal for some apps; not
+  this upgrade).
--- a/docs/plans/2026-05-19-mysql-8.4.9-upgrade-plan.md
+++ b/docs/plans/2026-05-19-mysql-8.4.9-upgrade-plan.md
@ -0,0 +1,349 @@
+# MySQL 8.4.8 → 8.4.9 Upgrade — Plan
+
+**Date**: 2026-05-19
+**Status**: Drafted, **NOT scheduled**
+**Design**: `2026-05-19-mysql-8.4.9-upgrade-design.md`
+**Estimated downtime**: 25–30 min (all MySQL-dependent apps offline)
+**Window**: Suggest Sunday 03:00 UK (low traffic, kured window doesn't fight us)
+
+## Pre-flight (before the maintenance window)
+
+### P.1 Optional smoke test on a parallel PVC (recommended, +30 min)
+
+In a non-production session, before scheduling the real cutover:
+
+```bash
+# 1. Create a temporary StatefulSet `mysql-smoketest` in dbaas with the
+#    same image (mysql:8.4.9), same configmap, brand-new PVC.
+#    Use a one-off kubectl apply -f /tmp/smoketest.yaml — NOT Terraform —
+#    so it doesn't pollute the real stack.
+# 2. Verify it inits to 8.4.9 cleanly (mysqld.sock appears, "ready for connections").
+# 3. Restore one of the smaller per-db dumps (e.g. resume, freshrss) into it.
+# 4. Delete the smoketest StatefulSet + PVC.
+```
+
+Outcome:
+- ✅ Init succeeds → proceed with the real upgrade with high confidence.
+- ❌ Init stalls → root cause was not flush starvation. Halt and re-investigate. The real upgrade is unsafe.
+
+### P.2 Read the MySQL 8.4.9 release notes + bug tracker
+
+Specifically look for issues filed since 8.4.9 GA against the DD upgrade
+path or `st_spatial_reference_systems`. If a known fix landed in 8.4.10
+or 8.5.x, consider waiting.
+
+### P.3 Confirm backup pipeline is healthy
+
+```bash
+# Latest per-db dumps exist for all 20 databases
+kubectl -n dbaas exec mysql-standalone-0 -- bash -c \
+    'for d in $(ls /backup/per-db/); do echo -n "$d: "; ls -t /backup/per-db/$d/ | head -1; done'
+
+# Pushgateway shows recent success
+kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
+    wget -qO- 'http://prometheus-prometheus-pushgateway:9091/metrics' | grep mysql-backup-per-db
+```
+
+### P.4 Pin maintenance window and notify
+
+Brief the user. Confirm window. Disable any background scrapers /
+schedulers / bots that would create noise during the cutover.
+
+## Execution (inside the maintenance window)
+
+### Step 1 — Pre-flight snapshot
+
+```bash
+ROOT_PWD=$(kubectl -n dbaas get secret cluster-secret -o jsonpath='{.data.ROOT_PASSWORD}' | base64 -d)
+
+# Record current state for verification later
+kubectl -n dbaas exec mysql-standalone-0 -- mysql -uroot -p"$ROOT_PWD" \
+    -e "SELECT table_schema, COUNT(*) AS tables FROM information_schema.tables \
+        WHERE table_schema NOT IN ('information_schema','performance_schema','sys','mysql') \
+        GROUP BY table_schema;" > /tmp/mysql-pre-upgrade-table-counts.txt
+cat /tmp/mysql-pre-upgrade-table-counts.txt
+```
+
+### Step 2 — Trigger a fresh per-db dump
+
+```bash
+kubectl -n dbaas create job --from=cronjob/mysql-backup-per-db pre-upgrade-$(date +%s)
+# Wait for completion (typically <2 min)
+kubectl -n dbaas wait --for=condition=complete --timeout=300s job/pre-upgrade-<timestamp>
+```
+
+Verify all 20 databases dumped:
+
+```bash
+kubectl -n dbaas exec mysql-standalone-0 -- bash -c \
+    'for d in $(ls /backup/per-db/); do
+       newest=$(ls -t /backup/per-db/$d/ | head -1)
+       echo "$d: $newest"
+     done'
+```
+
+Every entry should have a `dump_<today>_*.sql.gz` listed.
+
+### Step 3 — Bump InnoDB IO config + image pin in Terraform
+
+In `stacks/dbaas/modules/dbaas/main.tf`:
+
+```diff
+-      innodb_io_capacity=100
+-      innodb_io_capacity_max=200
+-      innodb_page_cleaners=1
+      innodb_io_capacity=2000
+      innodb_io_capacity_max=4000
+      innodb_page_cleaners=4
+```
+
+```diff
+-          # Pinned to 8.4.8 — 8.4.9 DD upgrade got stuck (no progress, no CPU)
+-          # repeatedly across multiple attempts. ...
+-          image = "mysql:8.4.8"
+          # Re-pinned to 8.4.9 on 2026-MM-DD after the wipe+reinit upgrade
+          # path (see docs/plans/2026-05-19-mysql-8.4.9-upgrade-*).
+          image = "mysql:8.4.9"
+```
+
+Commit but **do not apply yet**.
+
+### Step 4 — Stop MySQL
+
+```bash
+kubectl -n dbaas scale statefulset mysql-standalone --replicas=0
+# Wait for pod deletion
+kubectl -n dbaas wait --for=delete pod/mysql-standalone-0 --timeout=120s
+```
+
+### Step 5 — Wipe the PVC
+
+```bash
+PV=$(kubectl -n dbaas get pvc data-mysql-standalone-0 -o jsonpath='{.spec.volumeName}')
+kubectl patch pv "$PV" -p '{"spec":{"persistentVolumeReclaimPolicy":"Delete"}}'
+kubectl -n dbaas delete pvc data-mysql-standalone-0
+# Confirm PV vanishes (CSI cleans up the LV)
+kubectl get pv | grep -q "$PV" && echo "WARNING: PV still present" || echo "PV cleaned up"
+```
+
+### Step 6 — Apply Terraform (8.4.9 + bumped IO)
+
+```bash
+cd stacks/dbaas
+/home/wizard/code/infra/scripts/tg apply
+```
+
+This creates a fresh 5 Gi PVC + new pod on `mysql:8.4.9`. Initial-init
+takes ~30 s. Verify:
+
+```bash
+kubectl -n dbaas wait --for=condition=ready pod/mysql-standalone-0 --timeout=300s
+kubectl -n dbaas exec mysql-standalone-0 -- mysql -uroot -p"$ROOT_PWD" -e "SELECT VERSION();"
+# expect: 8.4.9
+```
+
+**If the pod fails to become Ready within 5 min**: this is the
+"root cause was not flush starvation" failure mode. Abort the upgrade,
+revert the image pin to 8.4.8 in TF, re-run from Step 4 (wipe + apply
+8.4.8 + restore). Total extra downtime ~25 min.
+
+### Step 7 — Restore per-db dumps (NOT the full --all-databases dump)
+
+```bash
+ROOT_PWD=$(kubectl -n dbaas get secret cluster-secret -o jsonpath='{.data.ROOT_PASSWORD}' | base64 -d)
+
+cat <<YAML | kubectl apply -f -
+apiVersion: batch/v1
+kind: Job
+metadata:
+  name: mysql-restore-per-db-$(date +%Y-%m-%d)
+  namespace: dbaas
+spec:
+  ttlSecondsAfterFinished: 3600
+  template:
+    spec:
+      restartPolicy: Never
+      containers:
+      - name: restore
+        image: mysql:8.4.9
+        command: ["bash","-c"]
+        args:
+        - |
+          set -euo pipefail
+          for db in \$(ls /backup/per-db/); do
+            newest=\$(ls -t /backup/per-db/\$db/ | head -1)
+            echo "=== Restoring \$db from \$newest ==="
+            mysql -h mysql.dbaas.svc.cluster.local -uroot -p"\$MYSQL_ROOT_PASSWORD" \
+                -e "CREATE DATABASE IF NOT EXISTS \\\`\$db\\\`;"
+            gunzip -c "/backup/per-db/\$db/\$newest" | \
+              mysql -h mysql.dbaas.svc.cluster.local -uroot -p"\$MYSQL_ROOT_PASSWORD" "\$db"
+          done
+          echo "=== All databases restored ==="
+          mysql -h mysql.dbaas.svc.cluster.local -uroot -p"\$MYSQL_ROOT_PASSWORD" -e 'SHOW DATABASES;'
+        env:
+        - name: MYSQL_ROOT_PASSWORD
+          valueFrom: { secretKeyRef: { name: cluster-secret, key: ROOT_PASSWORD } }
+        volumeMounts:
+        - { name: backup, mountPath: /backup, readOnly: true }
+      volumes:
+      - name: backup
+        persistentVolumeClaim: { claimName: dbaas-mysql-backup-host, readOnly: true }
+YAML
+```
+
+Watch: `kubectl -n dbaas logs -f job/mysql-restore-per-db-<date>`.
+Expected time: ~3 min for all 20 databases.
+
+### Step 8 — Recreate Vault-rotated + static users
+
+The per-db restore did NOT touch `mysql.user`. Recreate all app users
+fresh:
+
+```bash
+# Static users (forgejo, roundcubemail) from Vault
+FORGEJO_PW=$(vault kv get -field=mysql_forgejo_password secret/viktor)
+RC_PW=$(vault kv get -field=mysql_roundcubemail_password secret/viktor)
+
+kubectl -n dbaas exec -i mysql-standalone-0 -- bash -c 'mysql -uroot -p"$MYSQL_ROOT_PASSWORD"' <<SQL
+CREATE USER IF NOT EXISTS 'forgejo'@'%' IDENTIFIED WITH caching_sha2_password BY '$FORGEJO_PW';
+CREATE USER IF NOT EXISTS 'roundcubemail'@'%' IDENTIFIED WITH caching_sha2_password BY '$RC_PW';
+GRANT ALL PRIVILEGES ON \`forgejo\`.* TO 'forgejo'@'%';
+GRANT ALL PRIVILEGES ON \`roundcubemail\`.* TO 'roundcubemail'@'%';
+FLUSH PRIVILEGES;
+SQL
+
+# Vault-DB-engine-rotated users: force re-rotation so Vault rewrites the
+# user with the current password held in K8s secrets
+for role in $(vault list -format=json database/roles | jq -r '.[]' | grep '^mysql-'); do
+  echo "Rotating $role"
+  vault write -f "database/rotate-role/$role"
+done
+
+# Technitium has a separate password-sync job — kick it
+kubectl -n technitium create job --from=cronjob/technitium-password-sync \
+    technitium-postupgrade-$(date +%s)
+```
+
+### Step 9 — Restart MySQL-dependent apps
+
+```bash
+for ns_app in \
+    "forgejo:deploy/forgejo" \
+    "nextcloud:deploy/nextcloud" \
+    "hackmd:deploy/hackmd" \
+    "monitoring:deploy/grafana" \
+    "paperless-ngx:deploy/paperless-ngx" \
+    "uptime-kuma:deploy/uptime-kuma" \
+    "url:deploy/shlink" \
+    "phpipam:deploy/phpipam" \
+    "technitium:sts/technitium" \
+    "vikunja:deploy/vikunja" \
+    "freshrss:deploy/freshrss" \
+    "finance:deploy/finance" \
+    "resume:deploy/resume" \
+    "realestate-crawler:deploy/realestate-crawler-api" \
+    "realestate-crawler:deploy/realestate-crawler-celery" \
+    "realestate-crawler:deploy/realestate-crawler-celery-beat" \
+    "realestate-crawler:deploy/realestate-crawler-ui"; do
+  ns=${ns_app%%:*}; app=${ns_app##*:}
+  kubectl -n "$ns" rollout restart "$app" &
+done
+wait
+```
+
+Wait for all to become ready:
+
+```bash
+until [ "$(kubectl get deploy,sts -A -o json | \
+    jq -r '.items[] | select(.spec.replicas != .status.readyReplicas and .spec.replicas > 0) | .metadata.name' | \
+    wc -l)" -eq 0 ]; do
+  sleep 5
+done
+echo "All workloads ready"
+```
+
+### Step 10 — Force ImagePullBackOff pods to retry (Forgejo registry was offline)
+
+```bash
+for ns in chrome-service fire-planner freedify; do
+  kubectl -n "$ns" delete pod --all 2>/dev/null || true
+done
+```
+
+### Step 11 — Clean up failed CronJob pods from the outage window
+
+```bash
+kubectl delete pods -A --field-selector=status.phase=Failed
+```
+
+### Step 12 — Verify (matches design §Verification gates)
+
+```bash
+# 1. Version
+kubectl -n dbaas exec mysql-standalone-0 -- mysql -uroot -p"$ROOT_PWD" -e "SELECT VERSION();"
+# expect: 8.4.9
+
+# 2-3. Databases + table counts
+kubectl -n dbaas exec mysql-standalone-0 -- mysql -uroot -p"$ROOT_PWD" \
+    -e "SELECT table_schema, COUNT(*) FROM information_schema.tables \
+        WHERE table_schema NOT IN ('information_schema','performance_schema','sys','mysql') \
+        GROUP BY table_schema;" > /tmp/mysql-post-upgrade-table-counts.txt
+diff /tmp/mysql-pre-upgrade-table-counts.txt /tmp/mysql-post-upgrade-table-counts.txt
+# expect: no diff (or only counts that grew between snapshots)
+
+# 4. Forgejo
+kubectl -n forgejo get pod
+kubectl -n forgejo logs deploy/forgejo --tail=20 | grep -iE "ORM engine|ready"
+# expect: 1/1 Running, "ORM engine initialized"
+
+# 5. Cluster health
+bash /home/wizard/code/infra/scripts/cluster_healthcheck.sh --quiet
+
+# 6. Registry integrity probe
+kubectl -n monitoring create job --from=cronjob/forgejo-integrity-probe \
+    postupgrade-$(date +%s)
+kubectl -n monitoring logs job/postupgrade-<timestamp> --tail=5
+# expect: "Probe complete: 0 failures"
+
+# 7. RegistryCatalogInaccessible not firing
+kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
+    wget -qO- 'http://localhost:9090/api/v1/alerts' | \
+    python3 -c "import json,sys; d=json.load(sys.stdin); [print(a['labels']['alertname']) for a in d['data']['alerts'] if a['state']=='firing']"
+# expect: empty / no RegistryCatalogInaccessible
+```
+
+### Step 13 — Commit + push the Terraform change
+
+```bash
+git add stacks/dbaas/modules/dbaas/main.tf
+git commit -m "dbaas: pin MySQL to 8.4.9 after successful wipe+reinit upgrade
+
+Executed per docs/plans/2026-05-19-mysql-8.4.9-upgrade-{design,plan}.md.
+The full upgrade ran clean — fresh init on 8.4.9 sidestepped the DD
+upgrade stall. IO config bumped to 2000/4 (was 100/1) for the workload.
+"
+git push
+```
+
+## Rollback path (if Step 6 or Step 7 fails catastrophically)
+
+The wipe at Step 5 is destructive — once executed, the original disk
+is gone. Rollback is **same procedure, image=8.4.8**:
+
+1. Edit TF: `image = "mysql:8.4.8"`
+2. `kubectl -n dbaas scale sts mysql-standalone --replicas=0`
+3. Re-wipe (already wiped; just `tg apply`)
+4. Run the Step 7 restore Job again (now on 8.4.8)
+5. Run Step 8-11
+6. Update Terraform comment to reflect retained 8.4.8 pin.
+
+Extra downtime: ~25 min on top of the existing window.
+
+## Post-upgrade follow-ups
+
+- Update `infra/.claude/CLAUDE.md` MySQL row to reflect 8.4.9 pin.
+- Update `docs/runbooks/restore-mysql.md` to reflect 8.4.9.
+- Re-evaluate whether the new IO config (2000/4) is overkill for the
+  workload after 1-2 weeks — could drop to 1000/2 if needed.
+- Optional: file a follow-up task to investigate MySQL HA/replication
+  so the next upgrade isn't blocking.