docs: MySQL buffer-pool/limit + nextcloud webcal throttle; VCT drift fixed
All checks were successful
ci/woodpecker/push/default Pipeline was successful
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Reflect the code-oflt MySQL write-reduction work (commit 82c9e69b + the
nextcloud webcal app-data throttle):
- MySQL row: buffer pool 1->2Gi, mem limit 4->6Gi, and the nextcloud
webcal calendar churn that was ~60% of MySQL's writes (now throttled
in oc_calendarsubscriptions.refreshrate — app-data, can regress).
- CNPG apply-gotcha note: the mysql_standalone VCT-annotation drift no
longer needs -target dodging (now ignore_changes'd on the STS VCT).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
82c9e69b77
commit
1afe41880e
1 changed files with 2 additions and 2 deletions
|
|
@ -197,7 +197,7 @@ the workflow's built-in `GITHUB_TOKEN` (`packages: write`).
|
|||
|
||||
**`postgresql_host`** in `config.tfvars` is `pg-cluster-rw.dbaas.svc.cluster.local` (the CNPG primary). The legacy `postgresql.dbaas` service is a live compatibility alias (selector `cnpg.io/instanceRole=primary`, so it also reaches the primary — authentik's PgBouncer still points at it) — but use `pg-cluster-rw` for anything new. This variable is shared by ~12 stacks.
|
||||
|
||||
**CNPG tuning** (in `stacks/dbaas/modules/dbaas/main.tf`): `shared_buffers=1024MB`, `effective_cache_size=2560MB`, `work_mem=16MB`, `max_connections=200`, pod memory 3Gi. **Write-reduction (2026-06-29, code-oflt, analysis #6922):** `checkpoint_timeout=15min` + `max_wal_size=4GB` + `min_wal_size=1GB` (checkpoints were 100% timer-driven at the 5-min default, bursting FPIs onto sdc); `archive_timeout=0` (CNPG forces `archive_mode=on` but `.spec.backup` is empty → a 16MB WAL switch every 300s shipped nowhere = ~4.6 GB/day waste; daily `pg_dump` is the real backup); `commit_delay=2500`µs (group-commit fsync coalescing, safe for all DBs incl financial); `wal_compression=zstd` (was pglz). All reloadable (no restart). **Apply gotcha:** the Cluster is a `null_resource.pg_cluster` + local-exec `kubectl apply` — bump its `pg_params` trigger or the YAML edit is inert, and apply with `-target=module.dbaas.null_resource.pg_cluster` to dodge the pre-existing `mysql_standalone` VCT-annotation drift that errors a broad `dbaas` apply.
|
||||
**CNPG tuning** (in `stacks/dbaas/modules/dbaas/main.tf`): `shared_buffers=1024MB`, `effective_cache_size=2560MB`, `work_mem=16MB`, `max_connections=200`, pod memory 3Gi. **Write-reduction (2026-06-29, code-oflt, analysis #6922):** `checkpoint_timeout=15min` + `max_wal_size=4GB` + `min_wal_size=1GB` (checkpoints were 100% timer-driven at the 5-min default, bursting FPIs onto sdc); `archive_timeout=0` (CNPG forces `archive_mode=on` but `.spec.backup` is empty → a 16MB WAL switch every 300s shipped nowhere = ~4.6 GB/day waste; daily `pg_dump` is the real backup); `commit_delay=2500`µs (group-commit fsync coalescing, safe for all DBs incl financial); `wal_compression=zstd` (was pglz). All reloadable (no restart). **Apply gotcha:** the Cluster is a `null_resource.pg_cluster` + local-exec `kubectl apply` — bump its `pg_params` trigger or the YAML edit is inert, and apply with `-target=module.dbaas.null_resource.pg_cluster` for fast iteration. (The `mysql_standalone` VCT-annotation drift that used to error broad `dbaas` applies was **fixed 2026-06-30** — `ignore_changes` on the STS `volume_claim_template`, since pvc-autoresizer owns PVC sizing and the VCT is immutable post-creation anyway.)
|
||||
|
||||
## Networking & Resilience
|
||||
- **Critical path services scaled to 3**: Traefik, Authentik, CrowdSec LAPI, PgBouncer, Cloudflared.
|
||||
|
|
@ -221,7 +221,7 @@ the workflow's built-in `GITHUB_TOKEN` (`packages: write`).
|
|||
| Frigate | GPU stall detection in liveness probe (inference speed check), high CPU |
|
||||
| Authentik | 3 server replicas + 2-replica embedded outpost (PG-backed sessions), PgBouncer in front of PostgreSQL, strip auth headers before forwarding. **`authentik.*` Helm values are INERT** (existingSecret skips chart env rendering) — tune via `server.env`/`worker.env` in `modules/authentik/values.yaml`. Single-screen login (password embedded in identification stage); all first-party OIDC apps use implicit consent (2026-06-10). `/static` ingress carve-out serves assets with immutable Cache-Control; `/`+`/static` use a dedicated `authentik-rate-limit` (100/1000) so the cold-load chunk burst isn't 429'd into a blank screen. **Reliability (2026-06-28): the chart key is `deploymentStrategy`, NOT `strategy`** — the old `strategy:` key was inert, so live ran the chart default 25%/25% and dropped a server pod out of rotation on every roll; now `maxSurge:1/maxUnavailable:0`. Readiness `failureThreshold:8` (~80s, was 30s): the DB-coupled `/-/health/ready/` returns 503 on a PG/pgbouncer blip, and with too-tight tolerance all 3 server pods left the Service at once → Traefik 502/504 (the episodic blank-screen + 30s-hang). gunicorn `max_requests=10000`/jitter=1000 decorrelates worker recycles from DB blips. Redis is GONE since 2026.2 (sessions+cache+channels on PostgreSQL, no external-cache option) — a short PG transient is now survived, but a TOTAL CNPG outage still takes authentik down. **Custom overlay image (2026-06-28):** server+worker run `ghcr.io/viktorbarzin/authentik-server:2026.2.4-patch3` (built by `.github/workflows/build-authentik.yml` from `stacks/authentik/Dockerfile` + `patch-compat-sfe.py`) with TWO guarded patches: **#1 SLOW-1a** — narrows the identification-stage `select_subclasses()` query (~1.4s→~14ms; bare upstream call LEFT-JOINs every source subtype); **#2 old-browser blank login** — `patch-compat-sfe.py` (a) extends `compat_needs_sfe()` to serve authentik's built-in no-JS **SFE** login to old Safari/WebKit AND **any iOS browser** (Chrome/CriOS, Firefox/FxiOS — all share the system WebKit) on iOS≤16.3, and (b) **injects static social-login `<a>` links into the SFE shell** (`flow-sfe.html`) since the SFE can't render Identification-stage sources — required for password-less accounts (e.g. emo = Google-only). The modern flow SPA is ES2022 (needs Safari 16.4+) and renders BLANK on older WebKit; every iOS browser shares that WebKit, so it's not browser-choice (emo's iPadOS-15.8 iPad hit this). SFE = the *real* authentik login (password + MFA + reputation, no auth downgrade) — chosen over a Traefik basic-auth fallback which would have put a spoofable-UA single password in front of `vbarzin→wizard` passwordless-root. Social link = plain redirect to `/source/oauth/login/<slug>/` (works on any browser); slugs (google/github/facebook) are static — re-verify on source changes. **Keel un-enrolled** for the ns → image pinned in `global.image` (repo+tag), **upgraded manually**: bump the Dockerfile `FROM` + the values tag (+ re-verify both patches) together, GHA rebuilds, then apply. |
|
||||
| Kyverno | failurePolicy=Ignore to prevent blocking cluster, pin chart version |
|
||||
| MySQL Standalone | Raw `kubernetes_stateful_set_v1` pinned to `mysql:8.4.8` exactly (migrated from InnoDB Cluster 2026-04-16; **pinned to 8.4.8 on 2026-05-18** after Keel-driven `mysql:8.4` → 8.4.9 bump stalled the DD upgrade and required a full PVC-wipe + dump-restore — see `docs/runbooks/restore-mysql.md` and beads code-eme8/code-k40p). `skip-log-bin`, `innodb_flush_log_at_trx_commit=2`, `innodb_doublewrite=ON`. ConfigMap `mysql-standalone-cnf`. PVC `data-mysql-standalone-0` (5Gi initial → 30Gi via autoresizer, `proxmox-lvm-encrypted`). Service `mysql.dbaas` unchanged. Anti-affinity excludes k8s-node1. Bitnami charts deprecated (Broadcom Aug 2025) — use official images. |
|
||||
| MySQL Standalone | Raw `kubernetes_stateful_set_v1` pinned to `mysql:8.4.8` exactly (migrated from InnoDB Cluster 2026-04-16; **pinned to 8.4.8 on 2026-05-18** after Keel-driven `mysql:8.4` → 8.4.9 bump stalled the DD upgrade and required a full PVC-wipe + dump-restore — see `docs/runbooks/restore-mysql.md` and beads code-eme8/code-k40p). `skip-log-bin`, `innodb_flush_log_at_trx_commit=2`, `innodb_doublewrite=ON`, `innodb_buffer_pool_size=2Gi` (raised 1→2Gi + mem limit 4→6Gi, code-oflt 2026-06-30 — pod was near-OOM at 3.7/4Gi; needs a restart). ConfigMap `mysql-standalone-cnf`. PVC `data-mysql-standalone-0` (5Gi initial → 30Gi via autoresizer, `proxmox-lvm-encrypted`). Service `mysql.dbaas` unchanged. Anti-affinity excludes k8s-node1. Bitnami charts deprecated (Broadcom Aug 2025) — use official images. **Write-reduction (code-oflt 2026-06-30):** MySQL was the #1 sdc bandwidth writer; ~60% of its writes were nextcloud webcal calendar churn — subscriptions re-importing every cron (Formula-1 `refreshrate=PT0S`, Tripit `PT15M`) thrashing `oc_calendarobjects_props`. Throttled to `P1D`/`PT1H` via `UPDATE nextcloud.oc_calendarsubscriptions.refreshrate` (app-data, NOT IaC — can regress if a CalDAV client re-adds a subscription with an aggressive rate). |
|
||||
| phpIPAM | IPAM — no active scanning. `pfsense-import` CronJob (hourly) pulls Kea leases + ARP via SSH. `dns-sync` CronJob (15min) bidirectional sync with Technitium. Kea DDNS on pfSense handles all 3 subnets. API app `claude` (ssl_token). |
|
||||
|
||||
## Monitoring & Alerting
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue