infra/docs/plans/2026-05-19-mysql-8.4.9-upgrade-plan.md
Viktor Barzin fd0f4a0365 fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip]
6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 08:45:33 +00:00

349 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# MySQL 8.4.8 → 8.4.9 Upgrade — Plan
**Date**: 2026-05-19
**Status**: Drafted, **NOT scheduled**
**Design**: `2026-05-19-mysql-8.4.9-upgrade-design.md`
**Estimated downtime**: 2530 min (all MySQL-dependent apps offline)
**Window**: Suggest Sunday 03:00 UK (low traffic, kured window doesn't fight us)
## Pre-flight (before the maintenance window)
### P.1 Optional smoke test on a parallel PVC (recommended, +30 min)
In a non-production session, before scheduling the real cutover:
```bash
# 1. Create a temporary StatefulSet `mysql-smoketest` in dbaas with the
# same image (mysql:8.4.9), same configmap, brand-new PVC.
# Use a one-off kubectl apply -f /tmp/smoketest.yaml — NOT Terraform —
# so it doesn't pollute the real stack.
# 2. Verify it inits to 8.4.9 cleanly (mysqld.sock appears, "ready for connections").
# 3. Restore one of the smaller per-db dumps (e.g. resume, freshrss) into it.
# 4. Delete the smoketest StatefulSet + PVC.
```
Outcome:
- ✅ Init succeeds → proceed with the real upgrade with high confidence.
- ❌ Init stalls → root cause was not flush starvation. Halt and re-investigate. The real upgrade is unsafe.
### P.2 Read the MySQL 8.4.9 release notes + bug tracker
Specifically look for issues filed since 8.4.9 GA against the DD upgrade
path or `st_spatial_reference_systems`. If a known fix landed in 8.4.10
or 8.5.x, consider waiting.
### P.3 Confirm backup pipeline is healthy
```bash
# Latest per-db dumps exist for all 20 databases
kubectl -n dbaas exec mysql-standalone-0 -- bash -c \
'for d in $(ls /backup/per-db/); do echo -n "$d: "; ls -t /backup/per-db/$d/ | head -1; done'
# Pushgateway shows recent success
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
wget -qO- 'http://prometheus-prometheus-pushgateway:9091/metrics' | grep mysql-backup-per-db
```
### P.4 Pin maintenance window and notify
Brief the user. Confirm window. Disable any background scrapers /
schedulers / bots that would create noise during the cutover.
## Execution (inside the maintenance window)
### Step 1 — Pre-flight snapshot
```bash
ROOT_PWD=$(kubectl -n dbaas get secret cluster-secret -o jsonpath='{.data.ROOT_PASSWORD}' | base64 -d)
# Record current state for verification later
kubectl -n dbaas exec mysql-standalone-0 -- mysql -uroot -p"$ROOT_PWD" \
-e "SELECT table_schema, COUNT(*) AS tables FROM information_schema.tables \
WHERE table_schema NOT IN ('information_schema','performance_schema','sys','mysql') \
GROUP BY table_schema;" > /tmp/mysql-pre-upgrade-table-counts.txt
cat /tmp/mysql-pre-upgrade-table-counts.txt
```
### Step 2 — Trigger a fresh per-db dump
```bash
kubectl -n dbaas create job --from=cronjob/mysql-backup-per-db pre-upgrade-$(date +%s)
# Wait for completion (typically <2 min)
kubectl -n dbaas wait --for=condition=complete --timeout=300s job/pre-upgrade-<timestamp>
```
Verify all 20 databases dumped:
```bash
kubectl -n dbaas exec mysql-standalone-0 -- bash -c \
'for d in $(ls /backup/per-db/); do
newest=$(ls -t /backup/per-db/$d/ | head -1)
echo "$d: $newest"
done'
```
Every entry should have a `dump_<today>_*.sql.gz` listed.
### Step 3 — Bump InnoDB IO config + image pin in Terraform
In `stacks/dbaas/modules/dbaas/main.tf`:
```diff
- innodb_io_capacity=100
- innodb_io_capacity_max=200
- innodb_page_cleaners=1
+ innodb_io_capacity=2000
+ innodb_io_capacity_max=4000
+ innodb_page_cleaners=4
```
```diff
- # Pinned to 8.4.8 — 8.4.9 DD upgrade got stuck (no progress, no CPU)
- # repeatedly across multiple attempts. ...
- image = "mysql:8.4.8"
+ # Re-pinned to 8.4.9 on 2026-MM-DD after the wipe+reinit upgrade
+ # path (see docs/plans/2026-05-19-mysql-8.4.9-upgrade-*).
+ image = "mysql:8.4.9"
```
Commit but **do not apply yet**.
### Step 4 — Stop MySQL
```bash
kubectl -n dbaas scale statefulset mysql-standalone --replicas=0
# Wait for pod deletion
kubectl -n dbaas wait --for=delete pod/mysql-standalone-0 --timeout=120s
```
### Step 5 — Wipe the PVC
```bash
PV=$(kubectl -n dbaas get pvc data-mysql-standalone-0 -o jsonpath='{.spec.volumeName}')
kubectl patch pv "$PV" -p '{"spec":{"persistentVolumeReclaimPolicy":"Delete"}}'
kubectl -n dbaas delete pvc data-mysql-standalone-0
# Confirm PV vanishes (CSI cleans up the LV)
kubectl get pv | grep -q "$PV" && echo "WARNING: PV still present" || echo "PV cleaned up"
```
### Step 6 — Apply Terraform (8.4.9 + bumped IO)
```bash
cd stacks/dbaas
/home/wizard/code/infra/scripts/tg apply
```
This creates a fresh 5 Gi PVC + new pod on `mysql:8.4.9`. Initial-init
takes ~30 s. Verify:
```bash
kubectl -n dbaas wait --for=condition=ready pod/mysql-standalone-0 --timeout=300s
kubectl -n dbaas exec mysql-standalone-0 -- mysql -uroot -p"$ROOT_PWD" -e "SELECT VERSION();"
# expect: 8.4.9
```
**If the pod fails to become Ready within 5 min**: this is the
"root cause was not flush starvation" failure mode. Abort the upgrade,
revert the image pin to 8.4.8 in TF, re-run from Step 4 (wipe + apply
8.4.8 + restore). Total extra downtime ~25 min.
### Step 7 — Restore per-db dumps (NOT the full --all-databases dump)
```bash
ROOT_PWD=$(kubectl -n dbaas get secret cluster-secret -o jsonpath='{.data.ROOT_PASSWORD}' | base64 -d)
cat <<YAML | kubectl apply -f -
apiVersion: batch/v1
kind: Job
metadata:
name: mysql-restore-per-db-$(date +%Y-%m-%d)
namespace: dbaas
spec:
ttlSecondsAfterFinished: 3600
template:
spec:
restartPolicy: Never
containers:
- name: restore
image: mysql:8.4.9
command: ["bash","-c"]
args:
- |
set -euo pipefail
for db in \$(ls /backup/per-db/); do
newest=\$(ls -t /backup/per-db/\$db/ | head -1)
echo "=== Restoring \$db from \$newest ==="
mysql -h mysql.dbaas.svc.cluster.local -uroot -p"\$MYSQL_ROOT_PASSWORD" \
-e "CREATE DATABASE IF NOT EXISTS \\\`\$db\\\`;"
gunzip -c "/backup/per-db/\$db/\$newest" | \
mysql -h mysql.dbaas.svc.cluster.local -uroot -p"\$MYSQL_ROOT_PASSWORD" "\$db"
done
echo "=== All databases restored ==="
mysql -h mysql.dbaas.svc.cluster.local -uroot -p"\$MYSQL_ROOT_PASSWORD" -e 'SHOW DATABASES;'
env:
- name: MYSQL_ROOT_PASSWORD
valueFrom: { secretKeyRef: { name: cluster-secret, key: ROOT_PASSWORD } }
volumeMounts:
- { name: backup, mountPath: /backup, readOnly: true }
volumes:
- name: backup
persistentVolumeClaim: { claimName: dbaas-mysql-backup-host, readOnly: true }
YAML
```
Watch: `kubectl -n dbaas logs -f job/mysql-restore-per-db-<date>`.
Expected time: ~3 min for all 20 databases.
### Step 8 — Recreate Vault-rotated + static users
The per-db restore did NOT touch `mysql.user`. Recreate all app users
fresh:
```bash
# Static users (forgejo, roundcubemail) from Vault
FORGEJO_PW=$(vault kv get -field=mysql_forgejo_password secret/viktor)
RC_PW=$(vault kv get -field=mysql_roundcubemail_password secret/viktor)
kubectl -n dbaas exec -i mysql-standalone-0 -- bash -c 'mysql -uroot -p"$MYSQL_ROOT_PASSWORD"' <<SQL
CREATE USER IF NOT EXISTS 'forgejo'@'%' IDENTIFIED WITH caching_sha2_password BY '$FORGEJO_PW';
CREATE USER IF NOT EXISTS 'roundcubemail'@'%' IDENTIFIED WITH caching_sha2_password BY '$RC_PW';
GRANT ALL PRIVILEGES ON \`forgejo\`.* TO 'forgejo'@'%';
GRANT ALL PRIVILEGES ON \`roundcubemail\`.* TO 'roundcubemail'@'%';
FLUSH PRIVILEGES;
SQL
# Vault-DB-engine-rotated users: force re-rotation so Vault rewrites the
# user with the current password held in K8s secrets
for role in $(vault list -format=json database/roles | jq -r '.[]' | grep '^mysql-'); do
echo "Rotating $role"
vault write -f "database/rotate-role/$role"
done
# Technitium has a separate password-sync job — kick it
kubectl -n technitium create job --from=cronjob/technitium-password-sync \
technitium-postupgrade-$(date +%s)
```
### Step 9 — Restart MySQL-dependent apps
```bash
for ns_app in \
"forgejo:deploy/forgejo" \
"nextcloud:deploy/nextcloud" \
"hackmd:deploy/hackmd" \
"monitoring:deploy/grafana" \
"paperless-ngx:deploy/paperless-ngx" \
"uptime-kuma:deploy/uptime-kuma" \
"url:deploy/shlink" \
"phpipam:deploy/phpipam" \
"technitium:sts/technitium" \
"vikunja:deploy/vikunja" \
"freshrss:deploy/freshrss" \
"finance:deploy/finance" \
"resume:deploy/resume" \
"realestate-crawler:deploy/realestate-crawler-api" \
"realestate-crawler:deploy/realestate-crawler-celery" \
"realestate-crawler:deploy/realestate-crawler-celery-beat" \
"realestate-crawler:deploy/realestate-crawler-ui"; do
ns=${ns_app%%:*}; app=${ns_app##*:}
kubectl -n "$ns" rollout restart "$app" &
done
wait
```
Wait for all to become ready:
```bash
until [ "$(kubectl get deploy,sts -A -o json | \
jq -r '.items[] | select(.spec.replicas != .status.readyReplicas and .spec.replicas > 0) | .metadata.name' | \
wc -l)" -eq 0 ]; do
sleep 5
done
echo "All workloads ready"
```
### Step 10 — Force ImagePullBackOff pods to retry (Forgejo registry was offline)
```bash
for ns in chrome-service fire-planner freedify; do
kubectl -n "$ns" delete pod --all 2>/dev/null || true
done
```
### Step 11 — Clean up failed CronJob pods from the outage window
```bash
kubectl delete pods -A --field-selector=status.phase=Failed
```
### Step 12 — Verify (matches design §Verification gates)
```bash
# 1. Version
kubectl -n dbaas exec mysql-standalone-0 -- mysql -uroot -p"$ROOT_PWD" -e "SELECT VERSION();"
# expect: 8.4.9
# 2-3. Databases + table counts
kubectl -n dbaas exec mysql-standalone-0 -- mysql -uroot -p"$ROOT_PWD" \
-e "SELECT table_schema, COUNT(*) FROM information_schema.tables \
WHERE table_schema NOT IN ('information_schema','performance_schema','sys','mysql') \
GROUP BY table_schema;" > /tmp/mysql-post-upgrade-table-counts.txt
diff /tmp/mysql-pre-upgrade-table-counts.txt /tmp/mysql-post-upgrade-table-counts.txt
# expect: no diff (or only counts that grew between snapshots)
# 4. Forgejo
kubectl -n forgejo get pod
kubectl -n forgejo logs deploy/forgejo --tail=20 | grep -iE "ORM engine|ready"
# expect: 1/1 Running, "ORM engine initialized"
# 5. Cluster health
bash /home/wizard/code/infra/scripts/cluster_healthcheck.sh --quiet
# 6. Registry integrity probe
kubectl -n monitoring create job --from=cronjob/forgejo-integrity-probe \
postupgrade-$(date +%s)
kubectl -n monitoring logs job/postupgrade-<timestamp> --tail=5
# expect: "Probe complete: 0 failures"
# 7. RegistryCatalogInaccessible not firing
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
wget -qO- 'http://localhost:9090/api/v1/alerts' | \
python3 -c "import json,sys; d=json.load(sys.stdin); [print(a['labels']['alertname']) for a in d['data']['alerts'] if a['state']=='firing']"
# expect: empty / no RegistryCatalogInaccessible
```
### Step 13 — Commit + push the Terraform change
```bash
git add stacks/dbaas/modules/dbaas/main.tf
git commit -m "dbaas: pin MySQL to 8.4.9 after successful wipe+reinit upgrade
Executed per docs/plans/2026-05-19-mysql-8.4.9-upgrade-{design,plan}.md.
The full upgrade ran clean — fresh init on 8.4.9 sidestepped the DD
upgrade stall. IO config bumped to 2000/4 (was 100/1) for the workload.
"
git push
```
## Rollback path (if Step 6 or Step 7 fails catastrophically)
The wipe at Step 5 is destructive — once executed, the original disk
is gone. Rollback is **same procedure, image=8.4.8**:
1. Edit TF: `image = "mysql:8.4.8"`
2. `kubectl -n dbaas scale sts mysql-standalone --replicas=0`
3. Re-wipe (already wiped; just `tg apply`)
4. Run the Step 7 restore Job again (now on 8.4.8)
5. Run Step 8-11
6. Update Terraform comment to reflect retained 8.4.8 pin.
Extra downtime: ~25 min on top of the existing window.
## Post-upgrade follow-ups
- Update `infra/.claude/CLAUDE.md` MySQL row to reflect 8.4.9 pin.
- Update `docs/runbooks/restore-mysql.md` to reflect 8.4.9.
- Re-evaluate whether the new IO config (2000/4) is overkill for the
workload after 1-2 weeks — could drop to 1000/2 if needed.
- Optional: file a follow-up task to investigate MySQL HA/replication
so the next upgrade isn't blocking.