Commit graph

73 commits

Author SHA1 Message Date
Viktor Barzin
fb1e47a20a nextcloud: re-enable Keel auto-upgrades with occ-upgrade self-heal + live-tag floor
Re-enrolls Nextcloud in Keel (opted out after the 2026-05-26 32.0.3->32.0.9
bump stuck the pod in maintenance mode ~22h). Two safeguards engineer around
both failure modes:

- F1 (interrupted occ upgrade -> 503): nextcloud-watchdog CronJob runs
  `occ upgrade` + clears maintenance mode when occ reports needsDbUpgrade=true;
  Job deadline bumped 120->600s so it isn't killed mid-migration.
- F2 (helm re-renders a tag below the Keel-bumped live image -> downgrade
  CrashLoop): chart_values renders the live tag via a plural
  kubernetes_resources data source (empty-list-on-absence -> floor 32.0.9 on
  fresh install/DR), so a re-render never downgrades below live.

Scope is patch -- Kyverno's shared inject-keel-annotations policy stamps it and
its background-controller overrides a TF-set value, and patch == minor for
Nextcloud in practice (32.0.x only; major 33 stays manual). Dropped the
per-workload keel.sh/policy override resources to avoid perpetual drift; ns
enrollment + Kyverno now own the keel annotations like other workloads.

Also bumps the external-storage bootstrap Job create timeout 1m->12m to match
its own 10m pod-wait, since Keel bumps now roll the pod mid-apply.

Verified: Keel auto-upgraded 32.0.9->32.0.10 on apply, entrypoint occ upgrade
completed clean (no watchdog needed), pod 2/2, HTTP 200, plan shows no drift.
2026-06-01 19:50:41 +00:00
Viktor Barzin
3d28870e25 nextcloud: fix backup retention to sort by name, not mtime
The dated backup dirs are named YYYYMMDD_HHMMSS, but the cleanup used
`ls -dt` (mtime). `rsync -a` stamps the backup dir with the SOURCE dir's
mtime, so the freshest backup didn't sort as newest — the retention step
deleted the new backup and kept a stale one. Sort lexically (chronological
for these names) and keep the last.

Also exclude html/ (the app code, reproducible from the now-pinned image;
the real config lives at config/config.php, html/config is empty) so the
backup is config+data+custom_apps only → ~4.3G (<5G target).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 15:15:26 +00:00
root
84ab4c998c Woodpecker CI deploy [CI SKIP] 2026-06-01 15:15:26 +00:00
Viktor Barzin
ddd582a28c backup: stop offsite-copying regenerable data; shrink nextcloud backup; pin nextcloud image
The offsite Synology hit 97% — the Backup share grew +670G in a week, traced
to the 2026-05-26 change that began mirroring large regenerable services
offsite, plus an unbounded nextcloud.log bloating its backups to 87G.

- nfs-mirror: re-exclude ollama, prometheus-backup, audiblez, ebook2audiobook
  (regenerable; live-only on sdc). Keep *-backup DB dumps (real safety copies).
- offsite-sync Step 2: nfs-ssd leg is now immich-only; ollama/llamacpp on the
  SSD no longer ship offsite (re-pullable models).
- daily-backup: skip nextcloud/nextcloud-data-proxmox (orphaned pre-encryption
  PV, still backed up weekly).
- nextcloud: cap+rotate the log (log_rotate_size=10MB); the dedicated backup
  now excludes html/ (app code, from image), logs, and preview cache and keeps
  only the latest copy (pvc-data holds version history) → <5G (was 87G).
- nextcloud: pin image to 32.0.9 in chart_values. A 2026-05-26 Keel bump moved
  the live pod to 32.0.9 (data migrated to 32.0.9.2) but TF still defaulted to
  32.0.3; reconciling that drift this session rolled a 32.0.3 pod that
  CrashLooped on the downgrade. Pinning eliminates the drift.

Docs: backup-dr.md + infra CLAUDE.md updated (add nfs-mirror, new exclusions).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 15:15:26 +00:00
root
d0ede3773b Woodpecker CI deploy [CI SKIP] 2026-05-27 18:38:09 +00:00
Viktor Barzin
ee159b02ba nextcloud: disable Keel auto-upgrades
Keel bumped library/nextcloud :32.0.3-apache → :32.0.9-apache on
2026-05-26 19:42 UTC. The new image needs `occ upgrade` to migrate
the DB schema, which Keel does not run, so Nextcloud landed in
maintenance mode (needsDbUpgrade=true) and stayed there for ~22h —
external probes saw 503, ExternalAccessDivergence kept firing.

Disable Keel for this workload:
- Drop the `keel.sh/enrolled=true` label from the namespace so
  Kyverno's `inject-keel-annotations` policy no longer matches.
- Layer `keel.sh/policy=never` label + annotation onto the
  Helm-managed Deployment via `kubernetes_labels` /
  `kubernetes_annotations` (the chart at 8.8.1 doesn't expose
  Deployment-level commonLabels/commonAnnotations). Keel reads the
  annotation; the label is defense-in-depth for the Kyverno
  exclude rule should the namespace ever get re-enrolled.

Verified: Keel logged `image no longer tracked, removing watcher`
within seconds of the annotation landing, and `tg plan` is clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 18:37:05 +00:00
Viktor Barzin
c624caf65a nextcloud(external_storage): add per-mount enableSharing option
Some checks failed
ci/woodpecker/push/build-cli Pipeline failed
ci/woodpecker/push/default Pipeline was successful
Lets admin natively share folders from inside an external mount with
internal users/groups or via public link. The two PVE pool browsers
(visible to admin only) get enableSharing=true so they can act as a
"share-from picker" over /srv/nfs and /srv/nfs-ssd; /anca-elements
stays false so anca manages re-sharing inside her own view.

- Manifest schema gains enableSharing on rootMounts + archiveMounts.
- Bootstrap Job adds sync_option() and reconciles enable_sharing via
  occ files_external:option (idempotent — occ no-ops same-value set).
2026-05-24 11:39:16 +00:00
root
37e563d5a9 Woodpecker CI deploy [CI SKIP] 2026-05-24 11:31:53 +00:00
Viktor Barzin
cb1a34fd00 nextcloud: expose PVE NFS roots + /anca-elements via Files External
Some checks failed
ci/woodpecker/push/build-cli Pipeline failed
ci/woodpecker/push/default Pipeline was successful
Mounts the Proxmox host NFS exports (/srv/nfs and /srv/nfs-ssd) into
the NC pod and surfaces them through occ files_external:create:

- /PVE NFS Pool      → /mnt/pve-nfs       (admin group only)
- /PVE NFS-SSD Pool  → /mnt/pve-nfs-ssd   (admin group only)
- /anca-elements     → /mnt/pve-nfs/anca-elements  (admin, anca users)

Mount visibility is controlled by occ files_external:applicable; no
Files Access Control. ACL state is reconciled idempotently by a
bootstrap Job that diffs desired vs current applicable_users /
applicable_groups (via files_external:list --output=json).

Bootstrap fixes vs initial design:
- Sync loop used `[ -n "$U" ] && cmd` which returns 1 on empty input,
  triggering set -e on no-op re-runs. Switched to process substitution
  `< <(jq ...)` so empty diff -> loop body never runs -> 0 exit.
- RBAC missed `watch` verb (kubectl wait spammed reflector errors).
- Manifest used display-name "viktor" instead of NC username "admin"
  for the /anca-elements applicable list.

Chart values: added two PV-backed volume mounts at /mnt/pve-nfs[+ssd]
and pinned securityContext to fsGroup=33 with fsGroupChangePolicy:
OnRootMismatch (chart default Always would recurse 600k+ files on
every pod restart).
2026-05-24 11:27:26 +00:00
Viktor Barzin
7bcd3f745c ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
Viktor Barzin
70d0623b21 ci: retrigger apply for pending Keel enrollment (~58 stacks)
Bulk enrollment commit 8f4b1956 had its CI pipeline #689 killed before
terragrunt apply ran. The enrollment label + V2 lifecycle changes are
in master but never reached the cluster. Appending a one-line marker
to each pending stack's main.tf so Woodpecker's diff-detection picks
them up and applies them serially.

Idempotent — re-applying a stack whose state already matches is a no-op.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 13:42:57 +00:00
8f4b19565c recruiter-responder: bump image_tag to 189ef901
OpenClaw can now answer 'what do we know about <company>?' from cache
via the new recruiter_company_research tool, and recruiter_get embeds
the cached research payload inline.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 12:41:05 +00:00
Viktor Barzin
53657d9952 infra: document auth = "app|none" tier on every legacy ingress
Sweep through the 30+ stacks that predated the auth = "app" tier
and were tagged auth = "none" without a comment explaining why
they weren't behind Authentik. Each is now self-documenting at the
call site, so the tg-level anti-exposure guard passes and future
readers don't have to reverse-engineer the intent.

Flipped 6 stacks from "none" to "app" — their backends have their
own user auth and the new tier records that more accurately:
  - navidrome   (Subsonic user/password)
  - ntfy        (deny-all default + user.db tokens)
  - nextcloud   (WebDAV/CalDAV/CardDAV app passwords)
  - vaultwarden (Bitwarden-compatible token auth)
  - headscale   (OIDC + preauth keys for Tailscale nodes)
  - paperless-ngx (app-layer login + API tokens)

Kept "none" with a comment on the rest — they're genuinely public,
webhook receivers, native-protocol endpoints, OAuth callbacks, or
Anubis-fronted: authentik (×2 + guest outpost), beads-server (dolt),
claude-memory (bearer-token MCP), dawarich, ebooks/book-search-api,
fire-planner /api, forgejo (git/OCI native clients), frigate (HA
integration), immich/frame, insta2spotify /api, instagram-poster
(meta fetcher), k8s-portal, matrix (native bearer), monitoring×2
(HA REST scrapes), n8n (webhooks), nvidia, onlyoffice (JWT),
owntracks (HTTP Basic), postiz, privatebin (client-side enc),
rybbit (analytics tracker), send (E2E file drop), tuya-bridge
(API key), vault (own auth + CLI), webhook_handler, woodpecker
(forgejo webhooks + OAuth), xray (×3 VPN transports).

real-estate-crawler/main.tf:400 already had its comment from a
prior edit — not touched here.

No live state changes — auth = "app" produces the same middleware
chain as auth = "none" (verified earlier this session). This commit
is purely documentation + intent-tagging.
2026-05-11 19:25:48 +00:00
Viktor Barzin
bf752dffa5 fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes
After fixing the threshold=80% misconfig and seeing two PVCs
(prometheus + technitium primary) get stuck Terminating, a 3rd round
showed four more PVCs (frigate, hackmd, immich-postgresql,
paperless-ngx) in the same state. Same root cause: TF spec'd a
smaller storage size than the autoresizer-grown live value, K8s
rejected the shrink, TF force-replaced the PVC, and the
pvc-protection finalizer held it in Terminating while the pod kept
using the underlying volume.

Bulk-inject lifecycle.ignore_changes = [spec[0].resources[0].requests]
on every kubernetes_persistent_volume_claim block that has
resize.topolvm.io/threshold annotations. The pattern was already
documented in .claude/CLAUDE.md but ~63 stacks were missing it.

Live PVCs are unaffected; this only prevents future TF applies from
attempting the destroy+recreate.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 21:57:01 +00:00
Viktor Barzin
fecfa211fd fix: pvc-autoresizer threshold should be 10%, not 80%
topolvm/pvc-autoresizer's threshold annotation is the FREE-SPACE
percentage below which expansion fires (per upstream README). Setting
it to "80%" means "expand when free-space drops below 80%", i.e. as
soon as the PVC crosses 20% utilization — which caused
prometheus-data-proxmox to be repeatedly expanded from 200Gi to 433Gi
in 70 minutes (six 10% bumps, all when the volume was only ~14% used).
Once the SC opt-in fix landed (1e4eac53) and the inode metrics fix
landed (02a12f1a), the autoresizer started actively misfiring across
75+ PVCs cluster-wide.

Flip the value to "10%" everywhere — that's "expand when free-space
drops below 10%", i.e. at 90% utilization, which is the conventional
semantic and matches the alert thresholds in
prometheus_chart_values.tpl (PVAutoExpanding fires at 80%, PVFillingUp
at 95%).

The CLAUDE.md PVC template was the source of the misconfig, so update
it too. Live PVC annotations were patched in parallel via kubectl
annotate; TF apply on each affected stack will be a no-op against
those live values.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 19:56:16 +00:00
Viktor Barzin
e4f806abe3 ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks
Phase 3+4 of default-deny ingress plan. Replaces the `protected = bool` (default
false → unprotected) variable in `modules/kubernetes/ingress_factory` with
`auth = string` enum (default "required" → fail-closed). Touches every
ingress_factory caller so the audit decision is recorded explicitly in code.

ingress_factory (Phase 3):
- `auth = "required"`: standard Authentik forward-auth (the legacy
  `protected = true` semantic).
- `auth = "public"`: forward-auth via the new `authentik-forward-auth-public`
  middleware → dedicated public outpost → guest auto-bind. Logged-in users
  keep their real identity.
- `auth = "none"`: no Authentik middleware. For Anubis-fronted content, native
  client APIs (Git, /v2/, WebDAV), webhook receivers, the Authentik outpost
  itself.
- `effective_anti_ai` default flips ON only when `auth = "none"` (auth-gated
  ingresses don't need anti-AI noise; the auth flow already discourages bots).

Audit pass (Phase 4) across 96 ingress_factory call sites:
- 49 explicit `protected = true`     → `auth = "required"`
- 8 explicit `protected = false`     → `auth = "none"` (5) or `auth = "public"` (3)
- 64 previously-default (no protected line) → `auth = "required"` ADDED, then
  reviewed individually:
  * 9 Anubis-fronted (blog, www, kms, travel, f1, cyberchef, jsoncrack,
    homepage, wrongmove UI, privatebin) → `auth = "none"`
  * 22 native-client / programmatic surfaces (Forgejo Git+/v2/, webhook
    handler, claude-memory MCP, Nextcloud WebDAV, Matrix, Vault CLI/OIDC,
    xray VPN, ntfy, woodpecker webhooks, n8n triggers, ntfy push, dawarich
    location ingestion, immich frame kiosk, headscale CP, send anonymous
    drops, rybbit beacon, vaultwarden API, Authentik UI itself + outposts) →
    `auth = "none"`
  * Remaining ~33 → `auth = "required"` confirmed (admin tools, internal
    UIs, services without app-level auth)
- Smoke-test promotions to `auth = "public"`: fire-planner public UI,
  k8s-portal API, insta2spotify callback.

Three call sites in wrapper modules (`stacks/freedify/factory/`,
`stacks/reverse-proxy/modules/reverse_proxy/`) keep their internal `protected`
bool — they translate to `auth` internally, out of scope for this rename.

Behavior change: previously-default ingresses now fail closed (require
Authentik login) unless explicitly flipped to `auth = "none"` or
`auth = "public"`. This is the audit goal — no more accidentally-unprotected
surfaces. Sites that were intentionally public (Anubis content, native APIs,
webhooks) are now explicitly recorded as `auth = "none"`.

Drive-by: `modules/create-vm/main.tf` picked up cosmetic alignment via
`terraform fmt -recursive` during the audit. Behavior-neutral.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 18:53:49 +00:00
Viktor Barzin
a9ee9cee60
ig-poster: 69e395f2 + sync IMMICH_PG_* via ESO for CLIP scoring; postiz publish-notify n8n workflow 2026-05-09 13:16:24 +00:00
Viktor Barzin
3489621a45 nextcloud(backup): pin backup pod to nextcloud's node via podAffinity
The weekly backup mounts the same RWO PVC (proxmox-lvm-encrypted) as the
main nextcloud deployment. Single-node attach — the backup pod can never
mount the volume if it lands on a different node, and was stuck in
ContainerCreating for 6+ hours when cron fired today.

Add pod_affinity (required, hostname topology) so the backup co-locates
with the nextcloud app pod. Discovered via cluster-health probe; manual
verify run scheduled on k8s-node3 next to nextcloud's pod and completed
the rsync in seconds.
2026-04-26 11:03:20 +00:00
Viktor Barzin
b6cd83f85a [redis] Phase 3-7: cutover to redis-v2, Nextcloud HAProxy-only
Phase 3 — replication chain (old → v2):
 - Discovered the v2 cluster was running redis:7.4-alpine, but the
   Bitnami old master ships redis 8.6.2 which writes RDB format 13 —
   the 7.4 replicas rejected the stream with "Can't handle RDB format
   version 13". Bumped v2 image to redis:8-alpine (also 8.6.2) to
   restore PSYNC compatibility.
 - Discovered that sentinel on BOTH v2 and old Bitnami clusters
   auto-discovered the cross-cluster replication chain when v2-0
   REPLICAOF'd the old master, triggering a failover that reparented
   old-master to a v2 replica and took HAProxy's backend offline.
   Mitigation: `SENTINEL REMOVE mymaster` on all 5 sentinels (both
   clusters) during the REPLICAOF surgery, then re-MONITOR after
   cutover. This must be done on the OLD sentinels too, not just v2 —
   they're the ones that kept fighting our REPLICAOF.
 - Set up the chain: v2-0 REPLICAOF old-master; v2-{1,2} REPLICAOF v2-0.
   All 76 keys (db0:76, db1:22, db4:16) synced including `immich_bull:*`
   BullMQ queues and `_kombu.*` Celery queues — the user-stated
   must-survive data class.

Phase 4 — HAProxy cutover:
 - Updated `kubernetes_config_map.haproxy` to point at
   `redis-v2-{0,1,2}.redis-v2-headless` for both redis_master and
   redis_sentinel backends (removed redis-node-{0,1}).
 - Promoted v2-0 (`REPLICAOF NO ONE`) at the same time as the
   ConfigMap apply so HAProxy's 1s health-check interval found a
   role:master within a few seconds. Cutover disruption on HAProxy
   rollout was brief; old clients naturally moved to new HAProxy pods
   within the rolling update window.
 - Re-enabled sentinel monitoring on v2 with `SENTINEL MONITOR
   mymaster <hostname> 6379 2` after verifying `resolve-hostnames yes`
   + `announce-hostnames yes` were active — this ensures sentinel
   stores the hostname (not resolved IP) in its rewritten config, so
   pod-IP churn on restart doesn't break failover.

Phase 5 — chaos:
 - Round 1: killed master v2-0 mid-probe. First run exposed the
   sentinel IP-storage issue (stored 10.10.107.222, went stale on
   restart) — ~12s probe disruption. Fixed hostname persistence and
   re-MONITORed.
 - Round 2: killed new master v2-2 with hostnames correctly stored.
   Sentinel elected v2-0, HAProxy re-routed, 1/40 probe failures over
   60s — target <3s of actual user-visible disruption.

Phase 6 — Nextcloud simplification:
 - `zzz-redis.config.php` no longer queries sentinel in-process —
   just points at `redis-master.redis.svc.cluster.local`. Removed 20
   lines of PHP. HAProxy handles master tracking transparently now
   that it's scaled to 3 + PDB minAvailable=2.

Phase 7 step 1:
 - `kubectl scale statefulset/redis-node --replicas=0` (transient —
   TF removal in a 24h follow-up). Old PVCs `redis-data-redis-node-{0,1}`
   preserved as cold rollback.

Docs:
 - Rewrote `databases.md` Redis section to reflect post-cutover reality
   and the sentinel hostname gotcha (so future sessions don't relearn it).
 - `.claude/reference/service-catalog.md` entry updated.

The parallel-bootstrap race documented in the previous commit is still
worth watching — the init container now defaults to pod-0 as master
when no peer reports role:master-with-slaves, so fresh boots land in
a deterministic topology.

Closes: code-7n4
Closes: code-9y6
Closes: code-cnf
Closes: code-tc4

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 16:13:43 +00:00
Viktor Barzin
327ce215b9 [infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip]
## Context

Wave 3A (commit c9d221d5) added the `# KYVERNO_LIFECYCLE_V1` marker to the
27 pre-existing `ignore_changes = [...dns_config]` sites so they could be
grepped and audited. It did NOT address pod-owning resources that were
simply missing the suppression entirely. Post-Wave-3A sampling (2026-04-18)
found that navidrome, f1-stream, frigate, servarr, monitoring, crowdsec,
and many other stacks showed perpetual `dns_config` drift every plan
because their `kubernetes_deployment` / `kubernetes_stateful_set` /
`kubernetes_cron_job_v1` resources had no `lifecycle {}` block at all.

Root cause (same as Wave 3A): Kyverno's admission webhook stamps
`dns_config { option { name = "ndots"; value = "2" } }` on every pod's
`spec.template.spec.dns_config` to prevent NxDomain search-domain flooding
(see `k8s-ndots-search-domain-nxdomain-flood` skill). Without `ignore_changes`
on every Terraform-managed pod-owner, Terraform repeatedly tries to strip
the injected field.

## This change

Extends the Wave 3A convention by sweeping EVERY `kubernetes_deployment`,
`kubernetes_stateful_set`, `kubernetes_daemon_set`, `kubernetes_cron_job_v1`,
`kubernetes_job_v1` (+ their `_v1` variants) in the repo and ensuring each
carries the right `ignore_changes` path:

- **kubernetes_deployment / stateful_set / daemon_set / job_v1**:
  `spec[0].template[0].spec[0].dns_config`
- **kubernetes_cron_job_v1**:
  `spec[0].job_template[0].spec[0].template[0].spec[0].dns_config`
  (extra `job_template[0]` nesting — the CronJob's PodTemplateSpec is
  one level deeper)

Each injection / extension is tagged `# KYVERNO_LIFECYCLE_V1: Kyverno
admission webhook mutates dns_config with ndots=2` inline so the
suppression is discoverable via `rg 'KYVERNO_LIFECYCLE_V1' stacks/`.

Two insertion paths are handled by a Python pass (`/tmp/add_dns_config_ignore.py`):

1. **No existing `lifecycle {}`**: inject a brand-new block just before the
   resource's closing `}`. 108 new blocks on 93 files.
2. **Existing `lifecycle {}` (usually for `DRIFT_WORKAROUND: CI owns image tag`
   from Wave 4, commit a62b43d1)**: extend its `ignore_changes` list with the
   dns_config path. Handles both inline (`= [x]`) and multiline
   (`= [\n  x,\n]`) forms; ensures the last pre-existing list item carries
   a trailing comma so the extended list is valid HCL. 34 extensions.

The script skips anything already mentioning `dns_config` inside an
`ignore_changes`, so re-running is a no-op.

## Scale

- 142 total lifecycle injections/extensions
- 93 `.tf` files touched
- 108 brand-new `lifecycle {}` blocks + 34 extensions of existing ones
- Every Tier 0 and Tier 1 stack with a pod-owning resource is covered
- Together with Wave 3A's 27 pre-existing markers → **169 greppable
  `KYVERNO_LIFECYCLE_V1` dns_config sites across the repo**

## What is NOT in this change

- `stacks/trading-bot/main.tf` — entirely commented-out block (`/* … */`).
  Python script touched the file, reverted manually.
- `_template/main.tf.example` skeleton — kept minimal on purpose; any
  future stack created from it should either inherit the Wave 3A one-line
  form or add its own on first `kubernetes_deployment`.
- `terraform fmt` fixes to pre-existing alignment issues in meshcentral,
  nvidia/modules/nvidia, vault — unrelated to this commit. Left for a
  separate fmt-only pass.
- Non-pod resources (`kubernetes_service`, `kubernetes_secret`,
  `kubernetes_manifest`, etc.) — they don't own pods so they don't get
  Kyverno dns_config mutation.

## Verification

Random sample post-commit:
```
$ cd stacks/navidrome && ../../scripts/tg plan  → No changes.
$ cd stacks/f1-stream && ../../scripts/tg plan  → No changes.
$ cd stacks/frigate && ../../scripts/tg plan    → No changes.

$ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='*.tf' --include='*.tf.example' \
    | awk -F: '{s+=$2} END {print s}'
169
```

## Reproduce locally
1. `git pull`
2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/ | wc -l` → 169+
3. `cd stacks/navidrome && ../../scripts/tg plan` → expect 0 drift on
   the deployment's dns_config field.

Refs: code-seq (Wave 3B dns_config class closed; kubernetes_manifest
annotation class handled separately in 8d94688d for tls_secret)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:19:48 +00:00
Viktor Barzin
8b43692af0 [infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip]
## Context

Wave 3B-continued: the Goldilocks VPA dashboard (stacks/vpa) runs a Kyverno
ClusterPolicy `goldilocks-vpa-auto-mode` that mutates every namespace with
`metadata.labels["goldilocks.fairwinds.com/vpa-update-mode"] = "off"`. This
is intentional — Terraform owns container resource limits, and Goldilocks
should only provide recommendations, never auto-update. The label is how
Goldilocks decides per-namespace whether to run its VPA in `off` mode.

Effect on Terraform: every `kubernetes_namespace` resource shows the label
as pending-removal (`-> null`) on every `scripts/tg plan`. Dawarich survey
2026-04-18 confirmed the drift. Cluster-side count: 88 namespaces carry the
label (`kubectl get ns -o json | jq ... | wc -l`). Every TF-managed namespace
is affected.

This commit brings the intentional admission drift under the same
`# KYVERNO_LIFECYCLE_V1` discoverability marker introduced in c9d221d5 for
the ndots dns_config pattern. The marker now stands generically for any
Kyverno admission-webhook drift suppression; the inline comment records
which specific policy stamps which specific field so future grep audits
show why each suppression exists.

## This change

107 `.tf` files touched — every stack's `resource "kubernetes_namespace"`
resource gets:

```hcl
lifecycle {
  # KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
  ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
```

Injection was done with a brace-depth-tracking Python pass (`/tmp/add_goldilocks_ignore.py`):
match `^resource "kubernetes_namespace" ` → track `{` / `}` until the
outermost closing brace → insert the lifecycle block before the closing
brace. The script is idempotent (skips any file that already mentions
`goldilocks.fairwinds.com/vpa-update-mode`) so re-running is safe.

Vault stack picked up 2 namespaces in the same file (k8s-users produces
one, plus a second explicit ns) — confirmed via file diff (+8 lines).

## What is NOT in this change

- `stacks/trading-bot/main.tf` — entire file is `/* … */` commented out
  (paused 2026-04-06 per user decision). Reverted after the script ran.
- `stacks/_template/main.tf.example` — per-stack skeleton, intentionally
  minimal. User keeps it that way. Not touched by the script (file
  has no real `resource "kubernetes_namespace"` — only a placeholder
  comment).
- `.terraform/` copies (e.g. `stacks/metallb/.terraform/modules/...`) —
  gitignored, won't commit; the live path was edited.
- `terraform fmt` cleanup of adjacent pre-existing alignment issues in
  authentik, freedify, hermes-agent, nvidia, vault, meshcentral. Reverted
  to keep the commit scoped to the Goldilocks sweep. Those files will
  need a separate fmt-only commit or will be cleaned up on next real
  apply to that stack.

## Verification

Dawarich (one of the hundred-plus touched stacks) showed the pattern
before and after:

```
$ cd stacks/dawarich && ../../scripts/tg plan

Before:
  Plan: 0 to add, 2 to change, 0 to destroy.
   # kubernetes_namespace.dawarich will be updated in-place
     (goldilocks.fairwinds.com/vpa-update-mode -> null)
   # module.tls_secret.kubernetes_secret.tls_secret will be updated in-place
     (Kyverno generate.* labels — fixed in 8d94688d)

After:
  No changes. Your infrastructure matches the configuration.
```

Injection count check:
```
$ rg -c 'KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode' stacks/ | awk -F: '{s+=$2} END {print s}'
108
```

## Reproduce locally
1. `git pull`
2. Pick any stack: `cd stacks/<name> && ../../scripts/tg plan`
3. Expect: no drift on the namespace's goldilocks.fairwinds.com/vpa-update-mode label.

Closes: code-dwx

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:15:27 +00:00
Viktor Barzin
b034c868db [traefik] Remove broken rewrite-body plugin and all rybbit/anti-AI injection
The rewrite-body Traefik plugin (both packruler/rewrite-body v1.2.0 and
the-ccsn/traefik-plugin-rewritebody v0.1.3) silently fails on Traefik
v3.6.12 due to Yaegi interpreter issues with ResponseWriter wrapping.
Both plugins load without errors but never inject content.

Removed:
- rewrite-body plugin download (init container) and registration
- strip-accept-encoding middleware (only existed for rewrite-body bug)
- anti-ai-trap-links middleware (used rewrite-body for injection)
- rybbit_site_id variable from ingress_factory and reverse_proxy factory
- rybbit_site_id from 25 service stacks (39 instances)
- Per-service rybbit-analytics middleware CRD resources

Kept:
- compress middleware (entrypoint-level, working correctly)
- ai-bot-block middleware (ForwardAuth to bot-block-proxy)
- anti-ai-headers middleware (X-Robots-Tag: noai, noimageai)
- All CrowdSec, Authentik, rate-limit middleware unchanged

Next: Cloudflare Workers with HTMLRewriter for edge-side injection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 12:41:17 +00:00
Viktor Barzin
e80b2f026f [infra] Migrate Terraform state from local SOPS to PostgreSQL backend
Two-tier state architecture:
- Tier 0 (infra, platform, cnpg, vault, dbaas, external-secrets): local
  state with SOPS encryption in git — unchanged, required for bootstrap.
- Tier 1 (105 app stacks): PostgreSQL backend on CNPG cluster at
  10.0.20.200:5432/terraform_state with native pg_advisory_lock.

Motivation: multi-operator friction (every workstation needed SOPS + age +
git-crypt), bootstrap complexity for new operators, and headless agents/CI
needing the full encryption toolchain just to read state.

Changes:
- terragrunt.hcl: conditional backend (local vs pg) based on tier0 list
- scripts/tg: tier detection, auto-fetch PG creds from Vault for Tier 1,
  skip SOPS and Vault KV locking for Tier 1 stacks
- scripts/state-sync: tier-aware encrypt/decrypt (skips Tier 1)
- scripts/migrate-state-to-pg: one-shot migration script (idempotent)
- stacks/vault/main.tf: pg-terraform-state static role + K8s auth role
  for claude-agent namespace
- stacks/dbaas: terraform_state DB creation + MetalLB LoadBalancer
  service on shared IP 10.0.20.200
- Deleted 107 .tfstate.enc files for migrated Tier 1 stacks
- Cleaned up per-stack tiers.tf (now generated by root terragrunt.hcl)

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 19:33:12 +00:00
Viktor Barzin
216d4240c9 [infra] Add Cloudflare provider to all stack lock files and generated providers
Terragrunt now generates cloudflare_provider.tf (Vault-sourced API key)
and includes cloudflare in required_providers. These are the generated
files from running `terragrunt init -upgrade` across all stacks.

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 16:31:36 +00:00
Viktor Barzin
b1d152be1f [infra] Auto-create Cloudflare DNS records from ingress_factory
## Context

Deploying new services required manually adding hostnames to
cloudflare_proxied_names/cloudflare_non_proxied_names in config.tfvars —
a separate file from the service stack. This was frequently forgotten,
leaving services unreachable externally.

## This change:

- Add `dns_type` parameter to `ingress_factory` and `reverse_proxy/factory`
  modules. Setting `dns_type = "proxied"` or `"non-proxied"` auto-creates
  the Cloudflare DNS record (CNAME to tunnel or A/AAAA to public IP).
- Simplify cloudflared tunnel from 100 per-hostname rules to wildcard
  `*.viktorbarzin.me → Traefik`. Traefik still handles host-based routing.
- Add global Cloudflare provider via terragrunt.hcl (separate
  cloudflare_provider.tf with Vault-sourced API key).
- Migrate 118 hostnames from centralized config.tfvars to per-service
  dns_type. 17 hostnames remain centrally managed (Helm ingresses,
  special cases).
- Update docs, AGENTS.md, CLAUDE.md, dns.md runbook.

```
BEFORE                          AFTER
config.tfvars (manual list)     stacks/<svc>/main.tf
        |                         module "ingress" {
        v                           dns_type = "proxied"
stacks/cloudflared/               }
  for_each = list                     |
  cloudflare_record               auto-creates
  tunnel per-hostname             cloudflare_record + annotation
```

## What is NOT in this change:

- Uptime Kuma monitor migration (still reads from config.tfvars)
- 17 remaining centrally-managed hostnames (Helm, special cases)
- Removal of allow_overwrite (keep until migration confirmed stable)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 13:45:04 +00:00
Viktor Barzin
8b004c4c94 feat(storage): migrate all sensitive services to proxmox-lvm-encrypted
Reconcile Terraform with cluster state after manual encrypted PVC migrations
and complete the remaining unfinished migrations. All services storing
sensitive data now use LUKS2-encrypted block storage via the Proxmox CSI
plugin.

## Context

Only Technitium DNS was using encrypted storage in Terraform. Many services
had been manually migrated to encrypted PVCs in the cluster, but Terraform
was never updated — creating dangerous state drift where a `tg apply` could
recreate unencrypted PVCs.

## This change

Phase 0 — Infrastructure:
- Add `proxmox-lvm-encrypted` StorageClass to Helm values (extraParameters)
- Add ExternalSecret for LUKS encryption passphrase to Terraform
- Fix CSI node plugin memory: `node.plugin.resources` (not `node.resources`)
  with 1280Mi limit for LUKS2 Argon2id key derivation

Phase 1 — TF state reconciliation (zero downtime):
- Health, Matrix, N8N, Forgejo, Vaultwarden, Mailserver: state rm + import
- Redis, DBAAS MySQL, DBAAS PostgreSQL: Helm/CNPG value updates

Phase 2 — Data migration (encrypted PVCs existed but unused):
- Headscale, Frigate, MeshCentral: rsync + switchover
- Nextcloud (20Gi): rsync + chart_values update

Phase 3 — New encrypted PVCs:
- Roundcube HTML, HackMD, Affine, DBAAS pgadmin: create + rsync + switchover

Phase 4 — Cleanup:
- Deleted 5 orphaned unencrypted PVCs

## Services migrated (18 PVCs across 14 namespaces)

```
vaultwarden     → vaultwarden-data-encrypted
dbaas           → datadir-mysql-cluster-0, pg-cluster-{1,2}, dbaas-pgadmin-encrypted
mailserver      → mailserver-data-encrypted, roundcubemail-{enigma,html}-encrypted
nextcloud       → nextcloud-data-encrypted
forgejo         → forgejo-data-encrypted
matrix          → matrix-data-encrypted
n8n             → n8n-data-encrypted
affine          → affine-data-encrypted
health          → health-uploads-encrypted
hackmd          → hackmd-data-encrypted
redis           → redis-data-redis-node-{0,1}
headscale       → headscale-data-encrypted
frigate         → frigate-config-encrypted
meshcentral     → meshcentral-{data,files}-encrypted
```

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 20:15:30 +00:00
Viktor Barzin
bd41bb9230 fix: cluster healthcheck fixes + Authentik upgrade to 2026.2.2
- Authentik: upgrade 2025.10.3 → 2025.12.4 → 2026.2.2 with DB restore
  and stepped migration. Switch to existingSecret, PgBouncer session mode.
- Mailserver: migrate email roundtrip probe from Mailgun to Brevo API
- Redis: fix HAProxy tcp-check regex (rstring), faster health intervals
- Nextcloud: fix Redis fallback to HAProxy service, update dependency
- MeshCentral: fix TLSOffload + certUrl init container for first-run
- Monitoring: remove authentik from latency alert exclusion
- Diun: simplify to webhook notifier, remove git auto-update

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 06:41:56 +00:00
Viktor Barzin
30cdeefb1c chore: sync terraform state after nfsvers=4 convergence
Applied all 20 NFS stacks to converge PV mount_options (nfsvers=4).
State files encrypted and committed.

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 11:20:18 +00:00
Viktor Barzin
82b0f6c4cb truenas deprecation: migrate all non-immich storage to proxmox NFS
- Migrate 7 backup CronJobs to Proxmox host NFS (192.168.1.127)
  (etcd, mysql, postgresql, nextcloud, redis, vaultwarden, plotting-book)
- Migrate headscale backup, ebook2audiobook, osm_routing to Proxmox NFS
- Migrate servarr (lidarr, readarr, soulseek) NFS refs to Proxmox
- Remove 79 orphaned TrueNAS NFS module declarations from 49 stacks
- Delete stacks/platform/modules/ (27 dead module copies, 65MB)
- Update nfs-truenas StorageClass to point to Proxmox (192.168.1.127)
- Remove iscsi DNS record from config.tfvars
- Fix woodpecker persistence config and alertmanager PV

Only Immich (8 PVCs, ~1.4TB) remains on TrueNAS.
2026-04-12 14:35:39 +01:00
Viktor Barzin
a0392a9617 fix(nextcloud): auto-sync DB password from Vault rotation into config.php
Nextcloud persists dbpassword in config.php on its PVC and ignores
MYSQL_PASSWORD env var after initial install. When Vault rotates the
MySQL password, config.php goes stale causing HTTP 500 crash loops.

Adds a before-starting hook that patches config.php with the current
MYSQL_PASSWORD on every pod start. Combined with Stakater Reloader
annotation, the full rotation chain is now automated:
Vault rotates → ESO syncs Secret → Reloader restarts pod → hook
patches config.php → Nextcloud connects with new password.

Also fixes stale existingClaim (nextcloud-data-iscsi → nextcloud-data-proxmox).
2026-04-10 22:23:52 +01:00
Viktor Barzin
2a10c1821f fix(nextcloud): enable mysql.utf8mb4 to fix collation errors with MySQL 8.4
MySQL 8.4 remapped `utf8` to mean `utf8mb4`, but Nextcloud without this
config sends `COLLATE UTF8_general_ci` (a utf8mb3-only collation) in
queries, causing SQLSTATE[42000] errors that broke occ commands and sync.

Also removed stale `Work 🎯.csv` whose emoji filename was stripped in the
DB filecache (stored as `Work .csv`), causing permanent sync errors.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 09:16:48 +00:00
Viktor Barzin
61dc7a6862 nextcloud: refactor chart values and main.tf configuration 2026-04-06 11:57:44 +03:00
Viktor Barzin
ce7b8c2b2e add pvc-autoresizer for automatic PVC expansion before volumes fill up [ci skip]
Deploy topolvm/pvc-autoresizer controller that monitors kubelet_volume_stats
via Prometheus and auto-expands annotated PVCs. Annotated all 9 block-storage
PVCs (proxmox-lvm) with per-PVC thresholds and max limits. Updated PVFillingUp
alert to critical/10m (means auto-expansion failed) and added PVAutoExpanding
info alert at 80%.
2026-04-03 23:30:00 +03:00
Viktor Barzin
dd59512153 migrate iSCSI block volumes from democratic-csi to Proxmox CSI [ci skip]
Replace TrueNAS iSCSI (democratic-csi) with Proxmox CSI plugin for all
block storage PVCs. Eliminates double-CoW (ZFS + LVM-thin) and removes
the iSCSI network hop for database I/O.

New stack: stacks/proxmox-csi/ — deploys proxmox-csi-plugin Helm chart
with StorageClass "proxmox-lvm" using existing local-lvm thin pool.

Migrated PVCs (12 total):
- Phase 1 standalone: plotting-book, novelapp, vaultwarden, nextcloud, prometheus
- Phase 2 StatefulSets: CNPG PostgreSQL (2), MySQL InnoDB (3), Redis (2)

All services verified healthy post-migration.
2026-04-02 22:13:04 +03:00
Viktor Barzin
2e016d7df2 fix nextcloud db-username + k8s-dashboard chart repo
- nextcloud: add db-username to ESO secret template and usernameKey
  to chart values (required by newer chart version)
- k8s-dashboard: update chart repo URL to kubernetes-retired.github.io
  (old kubernetes.github.io/dashboard returns 404)
2026-03-22 02:50:48 +02:00
Viktor Barzin
c8b42f78df fix DB password rotation desync in 5 stacks
Vault DB engine rotates passwords weekly but 5 stacks baked passwords
at Terraform plan time, causing stale credentials until next apply.

- real-estate-crawler: add vault-database ESO, use secret_key_ref in 3 deployments
- nextcloud: switch Helm chart to existingSecret for DB password
- grafana: add vault-database ESO, use envFromSecrets in Helm values
- woodpecker: use extraSecretNamesForEnvFrom, remove plan-time data source chain
- affine: add vault-database ESO, use secret_key_ref in deployment + init container
2026-03-17 07:39:29 +00:00
Viktor Barzin
745e43c983 fix DB password desync + migrate remaining tfvars to Vault
DB desync fix: Stacks with Vault DB engine rotation (24h) now read
the password from vault-database ClusterSecretStore instead of vault-kv.
9 stacks updated with db ExternalSecrets reading from static-creds/*.

Stacks fixed: speedtest, hackmd, health, trading-bot, claude-memory,
woodpecker, linkwarden, nextcloud, url.

terraform.tfvars migration:
- plotting-book: google_client_id/secret → Vault KV + secret_key_ref
- tandoor: email_password var removed (was default="", now optional ESO)
- infra: ssh_private_key, vm_wizard_password, dockerhub_registry_password
  → Vault KV at secret/infra + data source
2026-03-15 21:39:45 +00:00
Viktor Barzin
06a0d0599a regenerate providers.tf: remove vault_root_token variable [ci skip] 2026-03-15 21:21:01 +00:00
Viktor Barzin
0f262ceda3 add pod dependency management via Kyverno init container injection
Kyverno ClusterPolicy reads dependency.kyverno.io/wait-for annotation
and injects busybox init containers that block until each dependency
is reachable (nc -z). Annotations added to 18 stacks (24 deployments).

Includes graceful-db-maintenance.sh script for planned DB maintenance
(scales dependents to 0, saves replica counts, restores on startup).
2026-03-15 19:17:57 +00:00
Viktor Barzin
1acf8cc4e8 migrate consuming stacks to ESO + remove k8s-dashboard static token
Phase 9: ExternalSecret migration across 26 stacks:

Fully migrated (vault data source removed, ESO delivers secrets):
- speedtest, shadowsocks, wealthfolio, plotting-book, f1-stream, tandoor
- n8n, dawarich, diun, netbox, onlyoffice, tuya-bridge
- hackmd (ESO template for DB URL), health (ESO template for DB URL)
- trading-bot (ESO template for DATABASE_URL + 7 secret env vars)
- forgejo (removed unused vault data source)

Partially migrated (vault kept for plan-time, ESO added for runtime):
- immich, linkwarden, nextcloud, paperless-ngx (jsondecode for homepage)
- claude-memory, rybbit, url, webhook_handler (plan-time in locals/jobs)
- woodpecker, openclaw, resume (plan-time in helm values/jobs/modules)

17 stacks unchanged (all plan-time: homepage annotations, configmaps,
module inputs) — vault data source works with OIDC auth.

Phase 17a: Remove k8s-dashboard static admin token secret.
Users now get tokens via: vault write kubernetes/creds/dashboard-admin
2026-03-15 19:05:04 +00:00
Viktor Barzin
3aba29e7a3 remove SOPS pipeline, deploy ESO + Vault DB/K8s engines
Vault is now the sole source of truth for secrets. SOPS pipeline
removed entirely — auth via `vault login -method=oidc`.

Part A: SOPS removal
- vault/main.tf: delete 990 lines (93 vars + 43 KV write resources),
  add self-read data source for OIDC creds from secret/vault
- terragrunt.hcl: remove SOPS var loading, vault_root_token, check_secrets hook
- scripts/tg: remove SOPS decryption, keep -auto-approve logic
- .woodpecker/default.yml: replace SOPS with Vault K8s auth via curl
- Delete secrets.sops.json, .sops.yaml

Part B: External Secrets Operator
- New stack stacks/external-secrets/ with Helm chart + 2 ClusterSecretStores
  (vault-kv for KV v2, vault-database for DB engine)

Part C: Database secrets engine (in vault/main.tf)
- MySQL + PostgreSQL connections with static role rotation (24h)
- 6 MySQL roles (speedtest, wrongmove, codimd, nextcloud, shlink, grafana)
- 6 PostgreSQL roles (trading, health, linkwarden, affine, woodpecker, claude_memory)

Part D: Kubernetes secrets engine (in vault/main.tf)
- RBAC for Vault SA to manage K8s tokens
- Roles: dashboard-admin, ci-deployer, openclaw, local-admin
- New scripts/vault-kubeconfig helper for dynamic kubeconfig

K8s auth method with scoped policies for CI, ESO, OpenClaw, Woodpecker sync.
2026-03-15 16:37:38 +00:00
Viktor Barzin
23019da8e5 equalize memory req=lim across 70+ containers using Prometheus 7d max data
After node2 OOM incident, right-size memory across the cluster by setting
requests=limits based on max_over_time(container_memory_working_set_bytes[7d])
with 1.3x headroom. Eliminates ~37Gi overcommit gap.

Categories:
- Safe equalization (50 containers): set req=lim where max7d well within target
- Limit increases (8 containers): raise limits for services spiking above current
- No Prometheus data (12 containers): conservatively set lim=req
- Exception: nextcloud keeps req=256Mi/lim=8Gi due to Apache memory spikes

Also increased dbaas namespace quota from 12Gi to 16Gi to accommodate mysql
4Gi limits across 3 replicas.
2026-03-14 21:46:49 +00:00
Viktor Barzin
a8d944eb9b migrate all secrets from SOPS to Vault KV
- Add vault provider to root terragrunt.hcl (generated providers.tf)
- Delete stacks/vault/vault_provider.tf (now in generated providers.tf)
- Add 124 variable declarations + 43 vault_kv_secret_v2 resources to
  vault/main.tf to populate Vault KV at secret/<stack-name>
- Migrate 43 consuming stacks to read secrets from Vault KV via
  data "vault_kv_secret_v2" instead of SOPS var-file
- Add dependency "vault" to all migrated stacks' terragrunt.hcl
- Complex types (maps/lists) stored as JSON strings, decoded with
  jsondecode() in locals blocks

Bootstrap secrets (vault_root_token, vault_authentik_client_id,
vault_authentik_client_secret) remain in SOPS permanently.

Apply order: vault stack first (populates KV), then all others.
2026-03-14 17:15:48 +00:00
Viktor Barzin
b00f810d3d Remove all CPU limits cluster-wide to eliminate CFS throttling
CPU limits cause CFS throttling even when nodes have idle capacity.
Move to a request-only CPU model: keep CPU requests for scheduling
fairness but remove all CPU limits. Memory limits stay (incompressible).

Changes across 108 files:
- Kyverno LimitRange policy: remove cpu from default/max in all 6 tiers
- Kyverno ResourceQuota policy: remove limits.cpu from all 5 tiers
- Custom ResourceQuotas: remove limits.cpu from 8 namespace quotas
- Custom LimitRanges: remove cpu from default/max (nextcloud, onlyoffice)
- RBAC module: remove cpu_limits variable and quota reference
- Freedify factory: remove cpu_limit variable and limits reference
- 86 deployment files: remove cpu from all limits blocks
- 6 Helm values files: remove cpu under limits sections
2026-03-14 08:51:45 +00:00
Viktor Barzin
120f83ce93 Nextcloud performance tuning and fix backup cron job
- Set loglevel=2 (warnings) and disable mail_smtpdebug via configs
- Enable opcache.enable_file_override for faster file checks
- Increase APCu shared memory from 32M to 128M
- Fix broken module.nfs_nextcloud_data reference in backup cron job
  to use the iSCSI PVC directly
2026-03-14 08:20:51 +00:00
Viktor Barzin
af5f6a659b right-size Nextcloud resources after MySQL migration
SQLite caused 4.7 CPU / 2GB usage, now MySQL uses ~95m / 95Mi.
Reduced limits from 16 CPU / 6Gi to 2 CPU / 1Gi.
Reduced requests from 100m / 1Gi to 50m / 256Mi.
Frees ~14 CPU cores and 5Gi memory for other workloads.
2026-03-13 22:21:10 +00:00
Viktor Barzin
3e03fbec63 increase MaxRequestWorkers to 150 now that Nextcloud is on MySQL
With SQLite, 50 workers caused all workers to block on DB locks.
On MySQL, CPU is ~20m and memory ~143Mi — no resource pressure.
The crash-looping was caused by hitting MaxRequestWorkers=50 limit
("server reached MaxRequestWorkers setting"), not by DB contention.
2026-03-13 22:20:52 +00:00
Viktor Barzin
aa3d3d0e66 migrate Nextcloud from SQLite to MySQL InnoDB Cluster
SQLite was causing constant crash-looping (138 restarts/day) due to
write lock contention with concurrent sync clients.

Migration required workarounds for multiple occ db:convert-type bugs:
- GR error 3100: SET GLOBAL sql_generate_invisible_primary_key = ON
- utf8mb3 column creation: stripped 4-byte emoji + invalid UTF-8 from
  SQLite (F1 calendar events, filecache)
- SQLite index corruption: repaired via .dump + INSERT OR IGNORE reimport
- kubectl exec timeouts: used nohup + detached process

Verified: all 136 tables migrated, 100% row count match across 15 key
tables (users, files, calendars, contacts, shares, activity).

Also fixed typo: databse → database in chart values.
2026-03-13 22:20:28 +00:00
OpenClaw
fd6c1cca93 fix(nextcloud): Database corruption recovery and conservative Apache tuning
- Restored clean SQLite database from pre-migration backup
- Fixed severe database corruption (rowid ordering, page corruption, index issues)
- Applied conservative MaxRequestWorkers=15 for SQLite stability
- Database integrity now perfect, all health checks passing
- Ready for future MySQL migration with clean data

[ci skip]
2026-03-12 13:38:37 +00:00
OpenClaw
db1e301eea fix(nextcloud): Increase Apache MaxRequestWorkers to resolve health check timeouts
- Increase MaxRequestWorkers from 10 to 25 for 4 CPU + 3Gi memory container
- Update Apache tuning for Redis + SQLite backend (not pure SQLite)
- Resolves CrashLoopBackOff caused by health probe timeouts
- Allows handling concurrent users without MaxRequestWorkers limit errors

[ci skip]
2026-03-12 13:14:20 +00:00