6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
349 lines
12 KiB
Markdown
349 lines
12 KiB
Markdown
# MySQL 8.4.8 → 8.4.9 Upgrade — Plan
|
||
|
||
**Date**: 2026-05-19
|
||
**Status**: Drafted, **NOT scheduled**
|
||
**Design**: `2026-05-19-mysql-8.4.9-upgrade-design.md`
|
||
**Estimated downtime**: 25–30 min (all MySQL-dependent apps offline)
|
||
**Window**: Suggest Sunday 03:00 UK (low traffic, kured window doesn't fight us)
|
||
|
||
## Pre-flight (before the maintenance window)
|
||
|
||
### P.1 Optional smoke test on a parallel PVC (recommended, +30 min)
|
||
|
||
In a non-production session, before scheduling the real cutover:
|
||
|
||
```bash
|
||
# 1. Create a temporary StatefulSet `mysql-smoketest` in dbaas with the
|
||
# same image (mysql:8.4.9), same configmap, brand-new PVC.
|
||
# Use a one-off kubectl apply -f /tmp/smoketest.yaml — NOT Terraform —
|
||
# so it doesn't pollute the real stack.
|
||
# 2. Verify it inits to 8.4.9 cleanly (mysqld.sock appears, "ready for connections").
|
||
# 3. Restore one of the smaller per-db dumps (e.g. resume, freshrss) into it.
|
||
# 4. Delete the smoketest StatefulSet + PVC.
|
||
```
|
||
|
||
Outcome:
|
||
- ✅ Init succeeds → proceed with the real upgrade with high confidence.
|
||
- ❌ Init stalls → root cause was not flush starvation. Halt and re-investigate. The real upgrade is unsafe.
|
||
|
||
### P.2 Read the MySQL 8.4.9 release notes + bug tracker
|
||
|
||
Specifically look for issues filed since 8.4.9 GA against the DD upgrade
|
||
path or `st_spatial_reference_systems`. If a known fix landed in 8.4.10
|
||
or 8.5.x, consider waiting.
|
||
|
||
### P.3 Confirm backup pipeline is healthy
|
||
|
||
```bash
|
||
# Latest per-db dumps exist for all 20 databases
|
||
kubectl -n dbaas exec mysql-standalone-0 -- bash -c \
|
||
'for d in $(ls /backup/per-db/); do echo -n "$d: "; ls -t /backup/per-db/$d/ | head -1; done'
|
||
|
||
# Pushgateway shows recent success
|
||
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
|
||
wget -qO- 'http://prometheus-prometheus-pushgateway:9091/metrics' | grep mysql-backup-per-db
|
||
```
|
||
|
||
### P.4 Pin maintenance window and notify
|
||
|
||
Brief the user. Confirm window. Disable any background scrapers /
|
||
schedulers / bots that would create noise during the cutover.
|
||
|
||
## Execution (inside the maintenance window)
|
||
|
||
### Step 1 — Pre-flight snapshot
|
||
|
||
```bash
|
||
ROOT_PWD=$(kubectl -n dbaas get secret cluster-secret -o jsonpath='{.data.ROOT_PASSWORD}' | base64 -d)
|
||
|
||
# Record current state for verification later
|
||
kubectl -n dbaas exec mysql-standalone-0 -- mysql -uroot -p"$ROOT_PWD" \
|
||
-e "SELECT table_schema, COUNT(*) AS tables FROM information_schema.tables \
|
||
WHERE table_schema NOT IN ('information_schema','performance_schema','sys','mysql') \
|
||
GROUP BY table_schema;" > /tmp/mysql-pre-upgrade-table-counts.txt
|
||
cat /tmp/mysql-pre-upgrade-table-counts.txt
|
||
```
|
||
|
||
### Step 2 — Trigger a fresh per-db dump
|
||
|
||
```bash
|
||
kubectl -n dbaas create job --from=cronjob/mysql-backup-per-db pre-upgrade-$(date +%s)
|
||
# Wait for completion (typically <2 min)
|
||
kubectl -n dbaas wait --for=condition=complete --timeout=300s job/pre-upgrade-<timestamp>
|
||
```
|
||
|
||
Verify all 20 databases dumped:
|
||
|
||
```bash
|
||
kubectl -n dbaas exec mysql-standalone-0 -- bash -c \
|
||
'for d in $(ls /backup/per-db/); do
|
||
newest=$(ls -t /backup/per-db/$d/ | head -1)
|
||
echo "$d: $newest"
|
||
done'
|
||
```
|
||
|
||
Every entry should have a `dump_<today>_*.sql.gz` listed.
|
||
|
||
### Step 3 — Bump InnoDB IO config + image pin in Terraform
|
||
|
||
In `stacks/dbaas/modules/dbaas/main.tf`:
|
||
|
||
```diff
|
||
- innodb_io_capacity=100
|
||
- innodb_io_capacity_max=200
|
||
- innodb_page_cleaners=1
|
||
+ innodb_io_capacity=2000
|
||
+ innodb_io_capacity_max=4000
|
||
+ innodb_page_cleaners=4
|
||
```
|
||
|
||
```diff
|
||
- # Pinned to 8.4.8 — 8.4.9 DD upgrade got stuck (no progress, no CPU)
|
||
- # repeatedly across multiple attempts. ...
|
||
- image = "mysql:8.4.8"
|
||
+ # Re-pinned to 8.4.9 on 2026-MM-DD after the wipe+reinit upgrade
|
||
+ # path (see docs/plans/2026-05-19-mysql-8.4.9-upgrade-*).
|
||
+ image = "mysql:8.4.9"
|
||
```
|
||
|
||
Commit but **do not apply yet**.
|
||
|
||
### Step 4 — Stop MySQL
|
||
|
||
```bash
|
||
kubectl -n dbaas scale statefulset mysql-standalone --replicas=0
|
||
# Wait for pod deletion
|
||
kubectl -n dbaas wait --for=delete pod/mysql-standalone-0 --timeout=120s
|
||
```
|
||
|
||
### Step 5 — Wipe the PVC
|
||
|
||
```bash
|
||
PV=$(kubectl -n dbaas get pvc data-mysql-standalone-0 -o jsonpath='{.spec.volumeName}')
|
||
kubectl patch pv "$PV" -p '{"spec":{"persistentVolumeReclaimPolicy":"Delete"}}'
|
||
kubectl -n dbaas delete pvc data-mysql-standalone-0
|
||
# Confirm PV vanishes (CSI cleans up the LV)
|
||
kubectl get pv | grep -q "$PV" && echo "WARNING: PV still present" || echo "PV cleaned up"
|
||
```
|
||
|
||
### Step 6 — Apply Terraform (8.4.9 + bumped IO)
|
||
|
||
```bash
|
||
cd stacks/dbaas
|
||
/home/wizard/code/infra/scripts/tg apply
|
||
```
|
||
|
||
This creates a fresh 5 Gi PVC + new pod on `mysql:8.4.9`. Initial-init
|
||
takes ~30 s. Verify:
|
||
|
||
```bash
|
||
kubectl -n dbaas wait --for=condition=ready pod/mysql-standalone-0 --timeout=300s
|
||
kubectl -n dbaas exec mysql-standalone-0 -- mysql -uroot -p"$ROOT_PWD" -e "SELECT VERSION();"
|
||
# expect: 8.4.9
|
||
```
|
||
|
||
**If the pod fails to become Ready within 5 min**: this is the
|
||
"root cause was not flush starvation" failure mode. Abort the upgrade,
|
||
revert the image pin to 8.4.8 in TF, re-run from Step 4 (wipe + apply
|
||
8.4.8 + restore). Total extra downtime ~25 min.
|
||
|
||
### Step 7 — Restore per-db dumps (NOT the full --all-databases dump)
|
||
|
||
```bash
|
||
ROOT_PWD=$(kubectl -n dbaas get secret cluster-secret -o jsonpath='{.data.ROOT_PASSWORD}' | base64 -d)
|
||
|
||
cat <<YAML | kubectl apply -f -
|
||
apiVersion: batch/v1
|
||
kind: Job
|
||
metadata:
|
||
name: mysql-restore-per-db-$(date +%Y-%m-%d)
|
||
namespace: dbaas
|
||
spec:
|
||
ttlSecondsAfterFinished: 3600
|
||
template:
|
||
spec:
|
||
restartPolicy: Never
|
||
containers:
|
||
- name: restore
|
||
image: mysql:8.4.9
|
||
command: ["bash","-c"]
|
||
args:
|
||
- |
|
||
set -euo pipefail
|
||
for db in \$(ls /backup/per-db/); do
|
||
newest=\$(ls -t /backup/per-db/\$db/ | head -1)
|
||
echo "=== Restoring \$db from \$newest ==="
|
||
mysql -h mysql.dbaas.svc.cluster.local -uroot -p"\$MYSQL_ROOT_PASSWORD" \
|
||
-e "CREATE DATABASE IF NOT EXISTS \\\`\$db\\\`;"
|
||
gunzip -c "/backup/per-db/\$db/\$newest" | \
|
||
mysql -h mysql.dbaas.svc.cluster.local -uroot -p"\$MYSQL_ROOT_PASSWORD" "\$db"
|
||
done
|
||
echo "=== All databases restored ==="
|
||
mysql -h mysql.dbaas.svc.cluster.local -uroot -p"\$MYSQL_ROOT_PASSWORD" -e 'SHOW DATABASES;'
|
||
env:
|
||
- name: MYSQL_ROOT_PASSWORD
|
||
valueFrom: { secretKeyRef: { name: cluster-secret, key: ROOT_PASSWORD } }
|
||
volumeMounts:
|
||
- { name: backup, mountPath: /backup, readOnly: true }
|
||
volumes:
|
||
- name: backup
|
||
persistentVolumeClaim: { claimName: dbaas-mysql-backup-host, readOnly: true }
|
||
YAML
|
||
```
|
||
|
||
Watch: `kubectl -n dbaas logs -f job/mysql-restore-per-db-<date>`.
|
||
Expected time: ~3 min for all 20 databases.
|
||
|
||
### Step 8 — Recreate Vault-rotated + static users
|
||
|
||
The per-db restore did NOT touch `mysql.user`. Recreate all app users
|
||
fresh:
|
||
|
||
```bash
|
||
# Static users (forgejo, roundcubemail) from Vault
|
||
FORGEJO_PW=$(vault kv get -field=mysql_forgejo_password secret/viktor)
|
||
RC_PW=$(vault kv get -field=mysql_roundcubemail_password secret/viktor)
|
||
|
||
kubectl -n dbaas exec -i mysql-standalone-0 -- bash -c 'mysql -uroot -p"$MYSQL_ROOT_PASSWORD"' <<SQL
|
||
CREATE USER IF NOT EXISTS 'forgejo'@'%' IDENTIFIED WITH caching_sha2_password BY '$FORGEJO_PW';
|
||
CREATE USER IF NOT EXISTS 'roundcubemail'@'%' IDENTIFIED WITH caching_sha2_password BY '$RC_PW';
|
||
GRANT ALL PRIVILEGES ON \`forgejo\`.* TO 'forgejo'@'%';
|
||
GRANT ALL PRIVILEGES ON \`roundcubemail\`.* TO 'roundcubemail'@'%';
|
||
FLUSH PRIVILEGES;
|
||
SQL
|
||
|
||
# Vault-DB-engine-rotated users: force re-rotation so Vault rewrites the
|
||
# user with the current password held in K8s secrets
|
||
for role in $(vault list -format=json database/roles | jq -r '.[]' | grep '^mysql-'); do
|
||
echo "Rotating $role"
|
||
vault write -f "database/rotate-role/$role"
|
||
done
|
||
|
||
# Technitium has a separate password-sync job — kick it
|
||
kubectl -n technitium create job --from=cronjob/technitium-password-sync \
|
||
technitium-postupgrade-$(date +%s)
|
||
```
|
||
|
||
### Step 9 — Restart MySQL-dependent apps
|
||
|
||
```bash
|
||
for ns_app in \
|
||
"forgejo:deploy/forgejo" \
|
||
"nextcloud:deploy/nextcloud" \
|
||
"hackmd:deploy/hackmd" \
|
||
"monitoring:deploy/grafana" \
|
||
"paperless-ngx:deploy/paperless-ngx" \
|
||
"uptime-kuma:deploy/uptime-kuma" \
|
||
"url:deploy/shlink" \
|
||
"phpipam:deploy/phpipam" \
|
||
"technitium:sts/technitium" \
|
||
"vikunja:deploy/vikunja" \
|
||
"freshrss:deploy/freshrss" \
|
||
"finance:deploy/finance" \
|
||
"resume:deploy/resume" \
|
||
"realestate-crawler:deploy/realestate-crawler-api" \
|
||
"realestate-crawler:deploy/realestate-crawler-celery" \
|
||
"realestate-crawler:deploy/realestate-crawler-celery-beat" \
|
||
"realestate-crawler:deploy/realestate-crawler-ui"; do
|
||
ns=${ns_app%%:*}; app=${ns_app##*:}
|
||
kubectl -n "$ns" rollout restart "$app" &
|
||
done
|
||
wait
|
||
```
|
||
|
||
Wait for all to become ready:
|
||
|
||
```bash
|
||
until [ "$(kubectl get deploy,sts -A -o json | \
|
||
jq -r '.items[] | select(.spec.replicas != .status.readyReplicas and .spec.replicas > 0) | .metadata.name' | \
|
||
wc -l)" -eq 0 ]; do
|
||
sleep 5
|
||
done
|
||
echo "All workloads ready"
|
||
```
|
||
|
||
### Step 10 — Force ImagePullBackOff pods to retry (Forgejo registry was offline)
|
||
|
||
```bash
|
||
for ns in chrome-service fire-planner freedify; do
|
||
kubectl -n "$ns" delete pod --all 2>/dev/null || true
|
||
done
|
||
```
|
||
|
||
### Step 11 — Clean up failed CronJob pods from the outage window
|
||
|
||
```bash
|
||
kubectl delete pods -A --field-selector=status.phase=Failed
|
||
```
|
||
|
||
### Step 12 — Verify (matches design §Verification gates)
|
||
|
||
```bash
|
||
# 1. Version
|
||
kubectl -n dbaas exec mysql-standalone-0 -- mysql -uroot -p"$ROOT_PWD" -e "SELECT VERSION();"
|
||
# expect: 8.4.9
|
||
|
||
# 2-3. Databases + table counts
|
||
kubectl -n dbaas exec mysql-standalone-0 -- mysql -uroot -p"$ROOT_PWD" \
|
||
-e "SELECT table_schema, COUNT(*) FROM information_schema.tables \
|
||
WHERE table_schema NOT IN ('information_schema','performance_schema','sys','mysql') \
|
||
GROUP BY table_schema;" > /tmp/mysql-post-upgrade-table-counts.txt
|
||
diff /tmp/mysql-pre-upgrade-table-counts.txt /tmp/mysql-post-upgrade-table-counts.txt
|
||
# expect: no diff (or only counts that grew between snapshots)
|
||
|
||
# 4. Forgejo
|
||
kubectl -n forgejo get pod
|
||
kubectl -n forgejo logs deploy/forgejo --tail=20 | grep -iE "ORM engine|ready"
|
||
# expect: 1/1 Running, "ORM engine initialized"
|
||
|
||
# 5. Cluster health
|
||
bash /home/wizard/code/infra/scripts/cluster_healthcheck.sh --quiet
|
||
|
||
# 6. Registry integrity probe
|
||
kubectl -n monitoring create job --from=cronjob/forgejo-integrity-probe \
|
||
postupgrade-$(date +%s)
|
||
kubectl -n monitoring logs job/postupgrade-<timestamp> --tail=5
|
||
# expect: "Probe complete: 0 failures"
|
||
|
||
# 7. RegistryCatalogInaccessible not firing
|
||
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
|
||
wget -qO- 'http://localhost:9090/api/v1/alerts' | \
|
||
python3 -c "import json,sys; d=json.load(sys.stdin); [print(a['labels']['alertname']) for a in d['data']['alerts'] if a['state']=='firing']"
|
||
# expect: empty / no RegistryCatalogInaccessible
|
||
```
|
||
|
||
### Step 13 — Commit + push the Terraform change
|
||
|
||
```bash
|
||
git add stacks/dbaas/modules/dbaas/main.tf
|
||
git commit -m "dbaas: pin MySQL to 8.4.9 after successful wipe+reinit upgrade
|
||
|
||
Executed per docs/plans/2026-05-19-mysql-8.4.9-upgrade-{design,plan}.md.
|
||
The full upgrade ran clean — fresh init on 8.4.9 sidestepped the DD
|
||
upgrade stall. IO config bumped to 2000/4 (was 100/1) for the workload.
|
||
"
|
||
git push
|
||
```
|
||
|
||
## Rollback path (if Step 6 or Step 7 fails catastrophically)
|
||
|
||
The wipe at Step 5 is destructive — once executed, the original disk
|
||
is gone. Rollback is **same procedure, image=8.4.8**:
|
||
|
||
1. Edit TF: `image = "mysql:8.4.8"`
|
||
2. `kubectl -n dbaas scale sts mysql-standalone --replicas=0`
|
||
3. Re-wipe (already wiped; just `tg apply`)
|
||
4. Run the Step 7 restore Job again (now on 8.4.8)
|
||
5. Run Step 8-11
|
||
6. Update Terraform comment to reflect retained 8.4.8 pin.
|
||
|
||
Extra downtime: ~25 min on top of the existing window.
|
||
|
||
## Post-upgrade follow-ups
|
||
|
||
- Update `infra/.claude/CLAUDE.md` MySQL row to reflect 8.4.9 pin.
|
||
- Update `docs/runbooks/restore-mysql.md` to reflect 8.4.9.
|
||
- Re-evaluate whether the new IO config (2000/4) is overkill for the
|
||
workload after 1-2 weeks — could drop to 1000/2 if needed.
|
||
- Optional: file a follow-up task to investigate MySQL HA/replication
|
||
so the next upgrade isn't blocking.
|