infra/docs/plans/2026-05-19-mysql-8.4.9-upgrade-design.md
Viktor Barzin 9fd54143c2 docs: design + plan for MySQL 8.4.8 → 8.4.9 upgrade
Captures the wipe+reinit strategy (sidestep the broken DD upgrade
path), the IO config bump (innodb_io_capacity 100→2000), root-cause
analysis with explicit uncertainty, verification gates, and rollback.

Not scheduled yet. Tracked in beads code-963q.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 13:10:00 +00:00

7.3 KiB
Raw Blame History

MySQL 8.4.8 → 8.4.9 Upgrade — Design

Date: 2026-05-19 Status: Drafted, NOT scheduled. Execute only inside a planned maintenance window with user sign-off. Beads: (filed alongside this doc) Related: docs/runbooks/restore-mysql.md, beads code-eme8 / code-k40p (closed in ea475c3d)

Background

On 2026-05-18, Keel auto-bumped the mysql:8.4 floating tag on the mysql-standalone StatefulSet from 8.4.8 to 8.4.9. The in-server data dictionary upgrade (80408 → 80409) stalled reliably: ~24 s of writes to mysql.ibd + redo log after "Server upgrade started", then complete silence — no CPU, no flushes, no errors, no completion. The boot thread sat in user-space sleep (State: S, wchan: 0) for 10+ minutes; the MySQLX socket appeared but mysqld.sock never did. Even with liveness_probe.initial_delay_seconds = 600, the upgrade never completed.

Recovery (commit ea475c3d): pinned image to mysql:8.4.8 exactly, wiped the corrupted PVC, restored from the 00:30 UTC mysqldump. Total downtime: ~25 min. Forgejo + 7 dependent apps offline during that window.

Root cause — best evidence

We never proved this definitively because we couldn't connect to MySQL during the stall, but the strongest hypothesis is flush starvation during the DD upgrade's mandatory checkpoint:

  1. Upgrade rewrites mysql.st_spatial_reference_systems (5103 SRS defs) + dirties pages across the system tablespace.
  2. Reaches a point where it must checkpoint before continuing.
  3. The page-cleaner thread can't drain dirty pages fast enough because innodb_io_capacity=100 (1.6 MB/s effective flush rate, default is 200, recommended for SSDs is 2000+) combined with innodb_page_cleaners=1.
  4. The boot thread waits on a pthread condvar that the flush coordinator should signal but never does within probe timeout.

Why we're not 100 % certain:

  • LUKS2-encrypted block storage (proxmox-lvm-encrypted) may contribute its own flush latency.
  • We didn't capture a stack trace from the stalled boot thread (/proc/1/task/118/stack was permission denied).
  • A genuine MySQL 8.4.9 bug in the SRS-update path is possible (worth checking the MySQL bug tracker before retry).

Organizational root cause (definitive): the mysql:8.4 floating tag let Keel auto-bump without testing. Already fixed — image pinned to mysql:8.4.8 exactly.

Decisions

# Decision Notes
1 Approach: wipe + re-init on 8.4.9 (logical migration via fresh init + dump-restore) The DD upgrade is the broken path. A fresh 8.4.9 init starts at version 80409 directly — no upgrade ever runs. We've executed wipe+restore once in ~25 min; the path is now well-trodden.
2 Pre-flight: bump InnoDB IO config innodb_io_capacity=2000, innodb_io_capacity_max=4000, innodb_page_cleaners=4. These are the long-term-correct values regardless of the upgrade — current settings are ~10× too conservative for the workload.
3 Restore strategy: per-database dumps, NOT the full --all-databases dump Per-db dumps at /srv/nfs/mysql-backup/per-db/<db>/ skip the mysql system schema entirely. Avoids the question of "will 8.4.8 mysql-schema rows confuse 8.4.9". User accounts get recreated via Vault + null_resource.
4 Fresh dump immediately before cutover, not yesterday's The daily dump runs at 00:30 UTC. The cutover dump must come from < 60 s before scale-to-0 to minimize data loss. Kick mysql-backup-per-db CronJob manually.
5 Maintenance window required All MySQL-dependent apps offline ~25 min: Forgejo (+ registry → ImagePullBackOff cascade), Nextcloud, HackMD, Grafana, Paperless, Uptime-Kuma, Shlink, realestate-crawler, phpipam, technitium, vikunja, freshrss, finance, resume. Pick a low-traffic window (suggest Sunday 03:00 UK).
6 Single rollback path: re-pin to 8.4.8 + same wipe/restore flow If 8.4.9 fresh init misbehaves post-restore, rollback IS the same procedure, just with image=8.4.8. The pinned 8.4.8 dump survives. No new failure modes.
7 Out of scope for this upgrade: tuning that doesn't gate the upgrade Right-sizing buffer pool, switching to async commits, changing storage class, replication — all separate decisions.

Verification gates

Before declaring done:

  1. kubectl -n dbaas exec mysql-standalone-0 -- mysql -uroot -p"$PW" -e "SELECT VERSION();" returns 8.4.9.
  2. SHOW DATABASES; lists all 20 user databases.
  3. Table count per schema matches the pre-upgrade snapshot (recorded in step 1 of the plan).
  4. forgejo logs show successful DB ping; kubectl -n forgejo get pod is 1/1 Running.
  5. kubectl get deploy,sts -A shows no unready workloads.
  6. bash infra/scripts/cluster_healthcheck.sh --quiet returns same or better PASS/WARN/FAIL ratio as pre-upgrade.
  7. Forgejo integrity probe reports 0 failures (manual trigger).
  8. RegistryCatalogInaccessible not firing in Prometheus.

Risks + mitigations

Risk Likelihood Mitigation
8.4.9 fresh init has some other unobserved bug Low Smoke-test on a parallel PVC in dbaas before touching the real one (optional but cheap — adds 30 min). See plan Phase 1.
Per-db dump-restore misses a database the user added recently Low Compare SHOW DATABASES against the per-db dump directory listing pre-cutover. If a DB exists in MySQL but not in /srv/nfs/mysql-backup/per-db/, dump it manually first.
Forgejo/roundcubemail static-user passwords drift again after restore Certain Already documented in runbook — DROP USER + CREATE USER from Vault values immediately after restore.
The cutover dump itself is corrupt Very low mysqldump exits non-zero on failure. CronJob already pushes backup_last_success_timestamp to Pushgateway. Verify timestamp is fresh before proceeding.
Apps fail to reconnect after MySQL restart Low Already-proven recipe: kubectl rollout restart on the affected deployments. Listed exhaustively in runbook §B.8.
8.4.9 fresh init also stalls (root cause was NOT flush starvation) Medium-low Pre-flight test on parallel PVC catches this before maintenance window. If real prod init stalls, immediately revert TF pin to 8.4.8, redo same dump-restore flow. Same 25 min downtime as the original recovery.

Why not alternatives

  • In-place DD upgrade with bumped IO config: simpler, but if it still stalls we lose 3060 min waiting + still fall back to wipe+restore. Same data risk; worse expected time. We would learn whether the bumped IO settings fix the upgrade, but the fresh init approach makes that knowledge unnecessary.
  • Parallel migration (new mysql-standalone-new pod alongside): cleanest rollback (instant via service-selector flip), but needs TF surgery to declare two StatefulSets temporarily and isn't worth the complexity when the wipe+restore approach is now proven.
  • Wait for 8.4.10 / 8.5 LTS: leaves us stuck on 8.4.8 indefinitely. Acceptable for now (we're pinned), but not a permanent answer.

Out of scope

  • A standby/replica MySQL for zero-downtime upgrades (separate initiative — see future planning around CNPG-style HA for MySQL).
  • Removing proxmox-lvm-encrypted LUKS2 from the equation (the encryption is a security requirement; debugging its flush latency is separate).
  • Replacing MySQL with PostgreSQL (long-term goal for some apps; not this upgrade).