dbaas/mysql: innodb_doublewrite=DETECT_ONLY to halve page-flush writes
All checks were successful
ci/woodpecker/push/default Pipeline was successful
All checks were successful
ci/woodpecker/push/default Pipeline was successful
MySQL device-write investigation (code-oflt): after the nextcloud webcal
throttle settled (the earlier 3.4-8.8 MB/s were post-restart transients),
MySQL is ~1.74 MB/s at the InnoDB level — and HALF of that (~0.86 MB/s,
~55 pages/s) is the doublewrite buffer writing every flushed page twice.
Redo is negligible (0.01 MB/s), no temp-table spilling.
Set innodb_doublewrite=DETECT_ONLY (dynamic, no restart; persisted in the
cnf): InnoDB stops writing full page CONTENT to the doublewrite buffer
(~halves MySQL's page-flush writes on the IOPS-bound sdc) but keeps
torn-page DETECTION metadata — a crash-torn page is flagged on recovery
(restore from the daily mysqldump) rather than silently corrupt. Chosen
over full OFF: same write saving, keeps detection, and OFF requires a
shutdown ("cannot change to OFF if doublewrite is enabled"). Acceptable
risk given the PERC BBU cache + UPS (in-flight writes complete on power
loss) + daily per-db backups.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
fbae573664
commit
82371d1ef8
2 changed files with 6 additions and 1 deletions
|
|
@ -221,7 +221,7 @@ the workflow's built-in `GITHUB_TOKEN` (`packages: write`).
|
|||
| Frigate | GPU stall detection in liveness probe (inference speed check), high CPU |
|
||||
| Authentik | 3 server replicas + 2-replica embedded outpost (PG-backed sessions), PgBouncer in front of PostgreSQL, strip auth headers before forwarding. **`authentik.*` Helm values are INERT** (existingSecret skips chart env rendering) — tune via `server.env`/`worker.env` in `modules/authentik/values.yaml`. Single-screen login (password embedded in identification stage); all first-party OIDC apps use implicit consent (2026-06-10). `/static` ingress carve-out serves assets with immutable Cache-Control; `/`+`/static` use a dedicated `authentik-rate-limit` (100/1000) so the cold-load chunk burst isn't 429'd into a blank screen. **Reliability (2026-06-28): the chart key is `deploymentStrategy`, NOT `strategy`** — the old `strategy:` key was inert, so live ran the chart default 25%/25% and dropped a server pod out of rotation on every roll; now `maxSurge:1/maxUnavailable:0`. Readiness `failureThreshold:8` (~80s, was 30s): the DB-coupled `/-/health/ready/` returns 503 on a PG/pgbouncer blip, and with too-tight tolerance all 3 server pods left the Service at once → Traefik 502/504 (the episodic blank-screen + 30s-hang). gunicorn `max_requests=10000`/jitter=1000 decorrelates worker recycles from DB blips. Redis is GONE since 2026.2 (sessions+cache+channels on PostgreSQL, no external-cache option) — a short PG transient is now survived, but a TOTAL CNPG outage still takes authentik down. **Custom overlay image (2026-06-28):** server+worker run `ghcr.io/viktorbarzin/authentik-server:2026.2.4-patch3` (built by `.github/workflows/build-authentik.yml` from `stacks/authentik/Dockerfile` + `patch-compat-sfe.py`) with TWO guarded patches: **#1 SLOW-1a** — narrows the identification-stage `select_subclasses()` query (~1.4s→~14ms; bare upstream call LEFT-JOINs every source subtype); **#2 old-browser blank login** — `patch-compat-sfe.py` (a) extends `compat_needs_sfe()` to serve authentik's built-in no-JS **SFE** login to old Safari/WebKit AND **any iOS browser** (Chrome/CriOS, Firefox/FxiOS — all share the system WebKit) on iOS≤16.3, and (b) **injects static social-login `<a>` links into the SFE shell** (`flow-sfe.html`) since the SFE can't render Identification-stage sources — required for password-less accounts (e.g. emo = Google-only). The modern flow SPA is ES2022 (needs Safari 16.4+) and renders BLANK on older WebKit; every iOS browser shares that WebKit, so it's not browser-choice (emo's iPadOS-15.8 iPad hit this). SFE = the *real* authentik login (password + MFA + reputation, no auth downgrade) — chosen over a Traefik basic-auth fallback which would have put a spoofable-UA single password in front of `vbarzin→wizard` passwordless-root. Social link = plain redirect to `/source/oauth/login/<slug>/` (works on any browser); slugs (google/github/facebook) are static — re-verify on source changes. **Keel un-enrolled** for the ns → image pinned in `global.image` (repo+tag), **upgraded manually**: bump the Dockerfile `FROM` + the values tag (+ re-verify both patches) together, GHA rebuilds, then apply. |
|
||||
| Kyverno | failurePolicy=Ignore to prevent blocking cluster, pin chart version |
|
||||
| MySQL Standalone | Raw `kubernetes_stateful_set_v1` pinned to `mysql:8.4.8` exactly (migrated from InnoDB Cluster 2026-04-16; **pinned to 8.4.8 on 2026-05-18** after Keel-driven `mysql:8.4` → 8.4.9 bump stalled the DD upgrade and required a full PVC-wipe + dump-restore — see `docs/runbooks/restore-mysql.md` and beads code-eme8/code-k40p). `skip-log-bin`, `innodb_flush_log_at_trx_commit=2`, `innodb_doublewrite=ON`, `innodb_buffer_pool_size=2Gi` (raised 1→2Gi + mem limit 4→6Gi, code-oflt 2026-06-30 — pod was near-OOM at 3.7/4Gi; needs a restart). ConfigMap `mysql-standalone-cnf`. PVC `data-mysql-standalone-0` (5Gi initial → 30Gi via autoresizer, `proxmox-lvm-encrypted`). Service `mysql.dbaas` unchanged. Anti-affinity excludes k8s-node1. Bitnami charts deprecated (Broadcom Aug 2025) — use official images. **Write-reduction (code-oflt 2026-06-30):** MySQL was the #1 sdc bandwidth writer; ~60% of its writes were nextcloud webcal calendar churn — subscriptions re-importing every cron (Formula-1 `refreshrate=PT0S`, Tripit `PT15M`) thrashing `oc_calendarobjects_props`. Throttled to `P1D`/`PT1H` via `UPDATE nextcloud.oc_calendarsubscriptions.refreshrate` (app-data, NOT IaC — can regress if a CalDAV client re-adds a subscription with an aggressive rate). |
|
||||
| MySQL Standalone | Raw `kubernetes_stateful_set_v1` pinned to `mysql:8.4.8` exactly (migrated from InnoDB Cluster 2026-04-16; **pinned to 8.4.8 on 2026-05-18** after Keel-driven `mysql:8.4` → 8.4.9 bump stalled the DD upgrade and required a full PVC-wipe + dump-restore — see `docs/runbooks/restore-mysql.md` and beads code-eme8/code-k40p). `skip-log-bin`, `innodb_flush_log_at_trx_commit=2`, `innodb_doublewrite=DETECT_ONLY` (was ON; set dynamically 2026-06-30, code-oflt — stops doublewrite page-content writes ≈ −0.86 MB/s/~half MySQL's page-flush writes, keeps torn-page DETECTION; recovery = restore-from-backup, OK with BBU+UPS+daily mysqldump), `innodb_buffer_pool_size=2Gi` (raised 1→2Gi + mem limit 4→6Gi, code-oflt 2026-06-30 — pod was near-OOM at 3.7/4Gi; needs a restart). ConfigMap `mysql-standalone-cnf`. PVC `data-mysql-standalone-0` (5Gi initial → 30Gi via autoresizer, `proxmox-lvm-encrypted`). Service `mysql.dbaas` unchanged. Anti-affinity excludes k8s-node1. Bitnami charts deprecated (Broadcom Aug 2025) — use official images. **Write-reduction (code-oflt 2026-06-30):** MySQL was the #1 sdc bandwidth writer; ~60% of its writes were nextcloud webcal calendar churn — subscriptions re-importing every cron (Formula-1 `refreshrate=PT0S`, Tripit `PT15M`) thrashing `oc_calendarobjects_props`. Throttled to `P1D`/`PT1H` via `UPDATE nextcloud.oc_calendarsubscriptions.refreshrate` (app-data, NOT IaC — can regress if a CalDAV client re-adds a subscription with an aggressive rate). |
|
||||
| phpIPAM | IPAM — no active scanning. `pfsense-import` CronJob (hourly) pulls Kea leases + ARP via SSH. `dns-sync` CronJob (15min) bidirectional sync with Technitium. Kea DDNS on pfSense handles all 3 subnets. API app `claude` (ssl_token). |
|
||||
|
||||
## Monitoring & Alerting
|
||||
|
|
|
|||
|
|
@ -112,6 +112,11 @@ resource "kubernetes_config_map" "mysql_standalone_cnf" {
|
|||
innodb_io_capacity_max=200
|
||||
innodb_redo_log_capacity=1073741824
|
||||
innodb_buffer_pool_size=2147483648
|
||||
# DETECT_ONLY: stop writing full page content to the doublewrite buffer
|
||||
# (~halves page-flush writes on the IOPS-bound sdc) while keeping torn-page
|
||||
# DETECTION; recovery is restore-from-backup. OK with BBU+UPS+daily
|
||||
# mysqldump. Dynamic (no restart). code-oflt 2026-06-30.
|
||||
innodb_doublewrite=DETECT_ONLY
|
||||
innodb_flush_neighbors=1
|
||||
innodb_lru_scan_depth=256
|
||||
innodb_page_cleaners=1
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue