# MySQL 8.4.8 → 8.4.9 Upgrade — Plan **Date**: 2026-05-19 **Status**: Drafted, **NOT scheduled** **Design**: `2026-05-19-mysql-8.4.9-upgrade-design.md` **Estimated downtime**: 25–30 min (all MySQL-dependent apps offline) **Window**: Suggest Sunday 03:00 UK (low traffic, kured window doesn't fight us) ## Pre-flight (before the maintenance window) ### P.1 Optional smoke test on a parallel PVC (recommended, +30 min) In a non-production session, before scheduling the real cutover: ```bash # 1. Create a temporary StatefulSet `mysql-smoketest` in dbaas with the # same image (mysql:8.4.9), same configmap, brand-new PVC. # Use a one-off kubectl apply -f /tmp/smoketest.yaml — NOT Terraform — # so it doesn't pollute the real stack. # 2. Verify it inits to 8.4.9 cleanly (mysqld.sock appears, "ready for connections"). # 3. Restore one of the smaller per-db dumps (e.g. resume, freshrss) into it. # 4. Delete the smoketest StatefulSet + PVC. ``` Outcome: - ✅ Init succeeds → proceed with the real upgrade with high confidence. - ❌ Init stalls → root cause was not flush starvation. Halt and re-investigate. The real upgrade is unsafe. ### P.2 Read the MySQL 8.4.9 release notes + bug tracker Specifically look for issues filed since 8.4.9 GA against the DD upgrade path or `st_spatial_reference_systems`. If a known fix landed in 8.4.10 or 8.5.x, consider waiting. ### P.3 Confirm backup pipeline is healthy ```bash # Latest per-db dumps exist for all 20 databases kubectl -n dbaas exec mysql-standalone-0 -- bash -c \ 'for d in $(ls /backup/per-db/); do echo -n "$d: "; ls -t /backup/per-db/$d/ | head -1; done' # Pushgateway shows recent success kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \ wget -qO- 'http://prometheus-prometheus-pushgateway:9091/metrics' | grep mysql-backup-per-db ``` ### P.4 Pin maintenance window and notify Brief the user. Confirm window. Disable any background scrapers / schedulers / bots that would create noise during the cutover. ## Execution (inside the maintenance window) ### Step 1 — Pre-flight snapshot ```bash ROOT_PWD=$(kubectl -n dbaas get secret cluster-secret -o jsonpath='{.data.ROOT_PASSWORD}' | base64 -d) # Record current state for verification later kubectl -n dbaas exec mysql-standalone-0 -- mysql -uroot -p"$ROOT_PWD" \ -e "SELECT table_schema, COUNT(*) AS tables FROM information_schema.tables \ WHERE table_schema NOT IN ('information_schema','performance_schema','sys','mysql') \ GROUP BY table_schema;" > /tmp/mysql-pre-upgrade-table-counts.txt cat /tmp/mysql-pre-upgrade-table-counts.txt ``` ### Step 2 — Trigger a fresh per-db dump ```bash kubectl -n dbaas create job --from=cronjob/mysql-backup-per-db pre-upgrade-$(date +%s) # Wait for completion (typically <2 min) kubectl -n dbaas wait --for=condition=complete --timeout=300s job/pre-upgrade- ``` Verify all 20 databases dumped: ```bash kubectl -n dbaas exec mysql-standalone-0 -- bash -c \ 'for d in $(ls /backup/per-db/); do newest=$(ls -t /backup/per-db/$d/ | head -1) echo "$d: $newest" done' ``` Every entry should have a `dump__*.sql.gz` listed. ### Step 3 — Bump InnoDB IO config + image pin in Terraform In `stacks/dbaas/modules/dbaas/main.tf`: ```diff - innodb_io_capacity=100 - innodb_io_capacity_max=200 - innodb_page_cleaners=1 + innodb_io_capacity=2000 + innodb_io_capacity_max=4000 + innodb_page_cleaners=4 ``` ```diff - # Pinned to 8.4.8 — 8.4.9 DD upgrade got stuck (no progress, no CPU) - # repeatedly across multiple attempts. ... - image = "mysql:8.4.8" + # Re-pinned to 8.4.9 on 2026-MM-DD after the wipe+reinit upgrade + # path (see docs/plans/2026-05-19-mysql-8.4.9-upgrade-*). + image = "mysql:8.4.9" ``` Commit but **do not apply yet**. ### Step 4 — Stop MySQL ```bash kubectl -n dbaas scale statefulset mysql-standalone --replicas=0 # Wait for pod deletion kubectl -n dbaas wait --for=delete pod/mysql-standalone-0 --timeout=120s ``` ### Step 5 — Wipe the PVC ```bash PV=$(kubectl -n dbaas get pvc data-mysql-standalone-0 -o jsonpath='{.spec.volumeName}') kubectl patch pv "$PV" -p '{"spec":{"persistentVolumeReclaimPolicy":"Delete"}}' kubectl -n dbaas delete pvc data-mysql-standalone-0 # Confirm PV vanishes (CSI cleans up the LV) kubectl get pv | grep -q "$PV" && echo "WARNING: PV still present" || echo "PV cleaned up" ``` ### Step 6 — Apply Terraform (8.4.9 + bumped IO) ```bash cd stacks/dbaas /home/wizard/code/infra/scripts/tg apply ``` This creates a fresh 5 Gi PVC + new pod on `mysql:8.4.9`. Initial-init takes ~30 s. Verify: ```bash kubectl -n dbaas wait --for=condition=ready pod/mysql-standalone-0 --timeout=300s kubectl -n dbaas exec mysql-standalone-0 -- mysql -uroot -p"$ROOT_PWD" -e "SELECT VERSION();" # expect: 8.4.9 ``` **If the pod fails to become Ready within 5 min**: this is the "root cause was not flush starvation" failure mode. Abort the upgrade, revert the image pin to 8.4.8 in TF, re-run from Step 4 (wipe + apply 8.4.8 + restore). Total extra downtime ~25 min. ### Step 7 — Restore per-db dumps (NOT the full --all-databases dump) ```bash ROOT_PWD=$(kubectl -n dbaas get secret cluster-secret -o jsonpath='{.data.ROOT_PASSWORD}' | base64 -d) cat <`. Expected time: ~3 min for all 20 databases. ### Step 8 — Recreate Vault-rotated + static users The per-db restore did NOT touch `mysql.user`. Recreate all app users fresh: ```bash # Static users (forgejo, roundcubemail) from Vault FORGEJO_PW=$(vault kv get -field=mysql_forgejo_password secret/viktor) RC_PW=$(vault kv get -field=mysql_roundcubemail_password secret/viktor) kubectl -n dbaas exec -i mysql-standalone-0 -- bash -c 'mysql -uroot -p"$MYSQL_ROOT_PASSWORD"' < 0) | .metadata.name' | \ wc -l)" -eq 0 ]; do sleep 5 done echo "All workloads ready" ``` ### Step 10 — Force ImagePullBackOff pods to retry (Forgejo registry was offline) ```bash for ns in chrome-service fire-planner freedify; do kubectl -n "$ns" delete pod --all 2>/dev/null || true done ``` ### Step 11 — Clean up failed CronJob pods from the outage window ```bash kubectl delete pods -A --field-selector=status.phase=Failed ``` ### Step 12 — Verify (matches design §Verification gates) ```bash # 1. Version kubectl -n dbaas exec mysql-standalone-0 -- mysql -uroot -p"$ROOT_PWD" -e "SELECT VERSION();" # expect: 8.4.9 # 2-3. Databases + table counts kubectl -n dbaas exec mysql-standalone-0 -- mysql -uroot -p"$ROOT_PWD" \ -e "SELECT table_schema, COUNT(*) FROM information_schema.tables \ WHERE table_schema NOT IN ('information_schema','performance_schema','sys','mysql') \ GROUP BY table_schema;" > /tmp/mysql-post-upgrade-table-counts.txt diff /tmp/mysql-pre-upgrade-table-counts.txt /tmp/mysql-post-upgrade-table-counts.txt # expect: no diff (or only counts that grew between snapshots) # 4. Forgejo kubectl -n forgejo get pod kubectl -n forgejo logs deploy/forgejo --tail=20 | grep -iE "ORM engine|ready" # expect: 1/1 Running, "ORM engine initialized" # 5. Cluster health bash /home/wizard/code/infra/scripts/cluster_healthcheck.sh --quiet # 6. Registry integrity probe kubectl -n monitoring create job --from=cronjob/forgejo-integrity-probe \ postupgrade-$(date +%s) kubectl -n monitoring logs job/postupgrade- --tail=5 # expect: "Probe complete: 0 failures" # 7. RegistryCatalogInaccessible not firing kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \ wget -qO- 'http://localhost:9090/api/v1/alerts' | \ python3 -c "import json,sys; d=json.load(sys.stdin); [print(a['labels']['alertname']) for a in d['data']['alerts'] if a['state']=='firing']" # expect: empty / no RegistryCatalogInaccessible ``` ### Step 13 — Commit + push the Terraform change ```bash git add stacks/dbaas/modules/dbaas/main.tf git commit -m "dbaas: pin MySQL to 8.4.9 after successful wipe+reinit upgrade Executed per docs/plans/2026-05-19-mysql-8.4.9-upgrade-{design,plan}.md. The full upgrade ran clean — fresh init on 8.4.9 sidestepped the DD upgrade stall. IO config bumped to 2000/4 (was 100/1) for the workload. " git push ``` ## Rollback path (if Step 6 or Step 7 fails catastrophically) The wipe at Step 5 is destructive — once executed, the original disk is gone. Rollback is **same procedure, image=8.4.8**: 1. Edit TF: `image = "mysql:8.4.8"` 2. `kubectl -n dbaas scale sts mysql-standalone --replicas=0` 3. Re-wipe (already wiped; just `tg apply`) 4. Run the Step 7 restore Job again (now on 8.4.8) 5. Run Step 8-11 6. Update Terraform comment to reflect retained 8.4.8 pin. Extra downtime: ~25 min on top of the existing window. ## Post-upgrade follow-ups - Update `infra/.claude/CLAUDE.md` MySQL row to reflect 8.4.9 pin. - Update `docs/runbooks/restore-mysql.md` to reflect 8.4.9. - Re-evaluate whether the new IO config (2000/4) is overkill for the workload after 1-2 weeks — could drop to 1000/2 if needed. - Optional: file a follow-up task to investigate MySQL HA/replication so the next upgrade isn't blocking.