infra/docs/plans/2026-05-19-mysql-8.4.9-upgrade-design.md
Viktor Barzin fd0f4a0365 fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip]
6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 08:45:33 +00:00

112 lines
7.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# MySQL 8.4.8 → 8.4.9 Upgrade — Design
**Date**: 2026-05-19
**Status**: Drafted, **NOT scheduled**. Execute only inside a planned maintenance window with user sign-off.
**Beads**: (filed alongside this doc)
**Related**: `docs/runbooks/restore-mysql.md`, beads `code-eme8` / `code-k40p` (closed in `ea475c3d`)
## Background
On 2026-05-18, Keel auto-bumped the `mysql:8.4` floating tag on the
`mysql-standalone` StatefulSet from 8.4.8 to 8.4.9. The in-server data
dictionary upgrade (80408 → 80409) stalled reliably: ~24 s of writes to
`mysql.ibd` + redo log after "Server upgrade started", then complete
silence — no CPU, no flushes, no errors, no completion. The `boot`
thread sat in user-space sleep (`State: S`, `wchan: 0`) for 10+
minutes; the MySQLX socket appeared but `mysqld.sock` never did. Even
with `liveness_probe.initial_delay_seconds = 600`, the upgrade never
completed.
Recovery (commit `ea475c3d`): pinned image to `mysql:8.4.8` exactly,
wiped the corrupted PVC, restored from the 00:30 UTC mysqldump. Total
downtime: ~25 min. Forgejo + 7 dependent apps offline during that
window.
## Root cause — best evidence
We never proved this definitively because we couldn't connect to MySQL
during the stall, but the strongest hypothesis is **flush starvation
during the DD upgrade's mandatory checkpoint**:
1. Upgrade rewrites `mysql.st_spatial_reference_systems` (5103 SRS
defs) + dirties pages across the system tablespace.
2. Reaches a point where it must checkpoint before continuing.
3. The page-cleaner thread can't drain dirty pages fast enough because
`innodb_io_capacity=100` (1.6 MB/s effective flush rate, default is
200, recommended for SSDs is 2000+) combined with
`innodb_page_cleaners=1`.
4. The `boot` thread waits on a pthread condvar that the flush
coordinator should signal but never does within probe timeout.
Why we're not 100 % certain:
- LUKS2-encrypted block storage (`proxmox-lvm-encrypted`) may
contribute its own flush latency.
- We didn't capture a stack trace from the stalled `boot` thread
(`/proc/1/task/118/stack` was `permission denied`).
- A genuine MySQL 8.4.9 bug in the SRS-update path is possible (worth
checking the MySQL bug tracker before retry).
**Organizational root cause** (definitive): the `mysql:8.4` floating
tag let Keel auto-bump without testing. Already fixed — image pinned
to `mysql:8.4.8` exactly.
## Decisions
| # | Decision | Notes |
|---|----------|-------|
| 1 | **Approach: wipe + re-init on 8.4.9** (logical migration via fresh init + dump-restore) | The DD upgrade is the broken path. A fresh 8.4.9 init starts at version 80409 directly — no upgrade ever runs. We've executed wipe+restore once in ~25 min; the path is now well-trodden. |
| 2 | **Pre-flight: bump InnoDB IO config** | `innodb_io_capacity=2000`, `innodb_io_capacity_max=4000`, `innodb_page_cleaners=4`. These are the long-term-correct values regardless of the upgrade — current settings are ~10× too conservative for the workload. |
| 3 | **Restore strategy: per-database dumps, NOT the full `--all-databases` dump** | Per-db dumps at `/srv/nfs/mysql-backup/per-db/<db>/` skip the `mysql` system schema entirely. Avoids the question of "will 8.4.8 mysql-schema rows confuse 8.4.9". User accounts get recreated via Vault + null_resource. |
| 4 | **Fresh dump immediately before cutover, not yesterday's** | The daily dump runs at 00:30 UTC. The cutover dump must come from < 60 s before scale-to-0 to minimize data loss. Kick `mysql-backup-per-db` CronJob manually. |
| 5 | **Maintenance window required** | All MySQL-dependent apps offline ~25 min: Forgejo (+ registry ImagePullBackOff cascade), Nextcloud, HackMD, Grafana, Paperless, Uptime-Kuma, Shlink, realestate-crawler, phpipam, technitium, vikunja, freshrss, finance, resume. Pick a low-traffic window (suggest Sunday 03:00 UK). |
| 6 | **Single rollback path: re-pin to 8.4.8 + same wipe/restore flow** | If 8.4.9 fresh init misbehaves post-restore, rollback IS the same procedure, just with image=8.4.8. The pinned 8.4.8 dump survives. No new failure modes. |
| 7 | **Out of scope for this upgrade**: tuning that doesn't gate the upgrade | Right-sizing buffer pool, switching to async commits, changing storage class, replication all separate decisions. |
## Verification gates
Before declaring done:
1. `kubectl -n dbaas exec mysql-standalone-0 -- mysql -uroot -p"$PW" -e "SELECT VERSION();"` returns `8.4.9`.
2. `SHOW DATABASES;` lists all 20 user databases.
3. Table count per schema matches the pre-upgrade snapshot (recorded
in step 1 of the plan).
4. `forgejo` logs show successful DB ping; `kubectl -n forgejo get pod` is 1/1 Running.
5. `kubectl get deploy,sts -A` shows no unready workloads.
6. `bash infra/scripts/cluster_healthcheck.sh --quiet` returns same or
better PASS/WARN/FAIL ratio as pre-upgrade.
7. Forgejo integrity probe reports 0 failures (manual trigger).
8. `RegistryCatalogInaccessible` not firing in Prometheus.
## Risks + mitigations
| Risk | Likelihood | Mitigation |
|---|---|---|
| 8.4.9 fresh init has *some other* unobserved bug | Low | Smoke-test on a parallel PVC in dbaas before touching the real one (optional but cheap adds 30 min). See plan Phase 1. |
| Per-db dump-restore misses a database the user added recently | Low | Compare `SHOW DATABASES` against the per-db dump directory listing pre-cutover. If a DB exists in MySQL but not in `/srv/nfs/mysql-backup/per-db/`, dump it manually first. |
| Forgejo/roundcubemail static-user passwords drift again after restore | Certain | Already documented in runbook DROP USER + CREATE USER from Vault values immediately after restore. |
| The cutover dump itself is corrupt | Very low | mysqldump exits non-zero on failure. CronJob already pushes `backup_last_success_timestamp` to Pushgateway. Verify timestamp is fresh before proceeding. |
| Apps fail to reconnect after MySQL restart | Low | Already-proven recipe: `kubectl rollout restart` on the affected deployments. Listed exhaustively in runbook §B.8. |
| 8.4.9 fresh init *also* stalls (root cause was NOT flush starvation) | Medium-low | Pre-flight test on parallel PVC catches this before maintenance window. If real prod init stalls, immediately revert TF pin to 8.4.8, redo same dump-restore flow. Same 25 min downtime as the original recovery. |
## Why not alternatives
- **In-place DD upgrade with bumped IO config**: simpler, but if it
still stalls we lose 3060 min waiting + still fall back to
wipe+restore. Same data risk; worse expected time. We *would* learn
whether the bumped IO settings fix the upgrade, but the fresh init
approach makes that knowledge unnecessary.
- **Parallel migration (new mysql-standalone-new pod alongside)**:
cleanest rollback (instant via service-selector flip), but needs TF
surgery to declare two StatefulSets temporarily and isn't worth the
complexity when the wipe+restore approach is now proven.
- **Wait for 8.4.10 / 8.5 LTS**: leaves us stuck on 8.4.8 indefinitely.
Acceptable for now (we're pinned), but not a permanent answer.
## Out of scope
- A standby/replica MySQL for zero-downtime upgrades (separate
initiative see future planning around CNPG-style HA for MySQL).
- Removing `proxmox-lvm-encrypted` LUKS2 from the equation (the
encryption is a security requirement; debugging its flush latency is
separate).
- Replacing MySQL with PostgreSQL (long-term goal for some apps; not
this upgrade).