Commit graph

1672 commits

Author SHA1 Message Date
Viktor Barzin
1acf8cc4e8 migrate consuming stacks to ESO + remove k8s-dashboard static token
Phase 9: ExternalSecret migration across 26 stacks:

Fully migrated (vault data source removed, ESO delivers secrets):
- speedtest, shadowsocks, wealthfolio, plotting-book, f1-stream, tandoor
- n8n, dawarich, diun, netbox, onlyoffice, tuya-bridge
- hackmd (ESO template for DB URL), health (ESO template for DB URL)
- trading-bot (ESO template for DATABASE_URL + 7 secret env vars)
- forgejo (removed unused vault data source)

Partially migrated (vault kept for plan-time, ESO added for runtime):
- immich, linkwarden, nextcloud, paperless-ngx (jsondecode for homepage)
- claude-memory, rybbit, url, webhook_handler (plan-time in locals/jobs)
- woodpecker, openclaw, resume (plan-time in helm values/jobs/modules)

17 stacks unchanged (all plan-time: homepage annotations, configmaps,
module inputs) — vault data source works with OIDC auth.

Phase 17a: Remove k8s-dashboard static admin token secret.
Users now get tokens via: vault write kubernetes/creds/dashboard-admin
2026-03-15 19:05:04 +00:00
Viktor Barzin
cfc30b62e8 enhance devops-engineer agent: deploy + monitor pod health [ci skip]
- Upgrade model from sonnet to opus for subagent orchestration
- Add Write, Edit, Agent tools for spawning monitor subagents
- Add mandatory deployment workflow: pre-deploy snapshot, apply,
  spawn background haiku pod monitor, react to results
- Monitor detects CrashLoopBackOff, OOM, ImagePullBackOff, stuck
  Pending, and probe failures within 3 min timeout
- Allow terragrunt apply and kubectl set image as safe operations
2026-03-15 18:44:20 +00:00
Viktor Barzin
a2720f6a4c claude-memory: read DB password from Vault KV instead of tfvars
Vault DB engine rotates the password every 24h, so the static tfvars
value was stale. Now reads from secret/claude-memory db_password key.
2026-03-15 18:22:29 +00:00
Viktor Barzin
90b7d6ebb5 etcd defrag cronjob: add --command-timeout=60s
Default 5s timeout causes defrag to fail on fragmented DBs.
Discovered during manual defrag that took ~7s.
2026-03-15 17:24:24 +00:00
Viktor Barzin
c034adab5f mitigate cluster instability during terraform applies
- Recreate strategy for heavy single-replica deployments (onlyoffice, stirling-pdf)
- Reduce maxSurge on multi-replica deployments (traefik, authentik, grafana, kyverno)
  to prevent memory request surge overwhelming scheduler
- Weekly etcd defrag CronJob (Sunday 3 AM) to prevent fragmentation buildup
- Disable Kyverno policy reports (ephemeral report cleanup)
- Cloud-init: journald persistence + 4Gi swap for worker nodes
- Kubelet: LimitedSwap behavior for memory pressure relief
2026-03-15 17:23:39 +00:00
Viktor Barzin
1fe7798609 fix openclaw init container: escape shell vars, fix image path [ci skip]
- Use $$ for shell variable escaping in Terraform ($ is Terraform interpolation)
- Fix image: docker.io/alpine/git (not library/alpine/git)
- Inline command instead of heredoc to avoid Terraform interpolation issues
2026-03-15 17:19:03 +00:00
Viktor Barzin
3aba29e7a3 remove SOPS pipeline, deploy ESO + Vault DB/K8s engines
Vault is now the sole source of truth for secrets. SOPS pipeline
removed entirely — auth via `vault login -method=oidc`.

Part A: SOPS removal
- vault/main.tf: delete 990 lines (93 vars + 43 KV write resources),
  add self-read data source for OIDC creds from secret/vault
- terragrunt.hcl: remove SOPS var loading, vault_root_token, check_secrets hook
- scripts/tg: remove SOPS decryption, keep -auto-approve logic
- .woodpecker/default.yml: replace SOPS with Vault K8s auth via curl
- Delete secrets.sops.json, .sops.yaml

Part B: External Secrets Operator
- New stack stacks/external-secrets/ with Helm chart + 2 ClusterSecretStores
  (vault-kv for KV v2, vault-database for DB engine)

Part C: Database secrets engine (in vault/main.tf)
- MySQL + PostgreSQL connections with static role rotation (24h)
- 6 MySQL roles (speedtest, wrongmove, codimd, nextcloud, shlink, grafana)
- 6 PostgreSQL roles (trading, health, linkwarden, affine, woodpecker, claude_memory)

Part D: Kubernetes secrets engine (in vault/main.tf)
- RBAC for Vault SA to manage K8s tokens
- Roles: dashboard-admin, ci-deployer, openclaw, local-admin
- New scripts/vault-kubeconfig helper for dynamic kubeconfig

K8s auth method with scoped policies for CI, ESO, OpenClaw, Woodpecker sync.
2026-03-15 16:37:38 +00:00
Viktor Barzin
d17a6e2fd3 fix calendar-query.py: use get_display_name(), URL-decode names, fix search API
- Replace deprecated cal.name with cal_name() helper using get_display_name()
- URL-decode calendar names (Formula+1 → Formula 1)
- Use cal.search(event=True) instead of deprecated date_search()
- Default to showing all calendars instead of filtering to Personal
2026-03-15 16:12:36 +00:00
Viktor Barzin
deeea5edab openclaw: replace cc-config NFS with dotfiles repo clone [ci skip]
- Add init container "install-dotfiles" that clones the dotfiles repo
  and installs skills/agents/hooks to OpenClaw's home directory
- Remove nfs_cc_config module and its volume mount
- Skills/agents now come from the same chezmoi-managed dotfiles repo
  that manages the Mac config, eliminating the dual-sync problem
2026-03-15 16:04:02 +00:00
Viktor Barzin
944d6d3b22 update claude knowledge: resource management learnings from right-sizing session [ci skip] 2026-03-15 15:38:37 +00:00
Viktor Barzin
5beb481dc4 fix immich TF drift from Kyverno ndots injection, right-size nvidia GPU operator
- immich: add lifecycle ignore_changes for dns_config on all 3 deployments
  to prevent perpetual plan drift from Kyverno ndots:2 mutation policy
- nvidia dcgm-exporter: 768Mi → 2560Mi (VPA upper 2091Mi, was under-provisioned)
- nvidia cuda-validator: 1024Mi → 256Mi (one-shot job, vastly over-provisioned)
2026-03-15 15:36:19 +00:00
Viktor Barzin
a6d281dbc6 vaultwarden: upgrade to 1.35.4, use Recreate strategy
- Upgrade from 1.35.2 to 1.35.4 (fixes API key login userDecryptionOptions bug)
- Switch deployment strategy from RollingUpdate to Recreate (iSCSI PVC can't multi-attach)
2026-03-15 15:35:09 +00:00
Viktor Barzin
3f0e8541f6 claude-memory: pin image to :17, fixes URL-decode crash on sync endpoint
The + in +00:00 timezone offsets was being URL-decoded to a space,
causing ValueError on the /api/memories/sync endpoint. Build :17
includes the fix. Using versioned tag instead of :latest to avoid
pull-through cache serving stale images.
2026-03-15 15:32:50 +00:00
Viktor Barzin
194281e527 right-size cluster memory: reduce overprovisioned, fix under-provisioned services
Phase 1 - Quick wins (~4.5 Gi saved):
- democratic-csi: add explicit sidecar resources (64-80Mi vs 256Mi LimitRange default)
- caretta: 768Mi → 600Mi (VPA upper 485Mi)
- immich-ml: 4Gi → 3584Mi (VPA upper 2.95Gi, GPU margin)
- onlyoffice: 3Gi → 2304Mi (VPA upper 1.82Gi)

Phase 2 - Safety fixes (prevent OOMKills):
- frigate: 2Gi/8Gi → 5Gi/10Gi (VPA upper 7.7Gi, was 4% headroom)
- openclaw: 1280Mi req → 2Gi req=limit (documented 2Gi requirement)

Phase 3 - Additional right-sizing:
- authentik workers: 1Gi → 896Mi x3 (VPA upper 722Mi)
- shlink: 512Mi/768Mi → 960Mi req=limit (VPA upper 780Mi, safety increase)

Phase 4 - Burstable QoS for lower tiers:
- tier-3-edge: 128Mi/128Mi → 96Mi req / 192Mi limit
- tier-4-aux: 128Mi/128Mi → 64Mi req / 256Mi limit

Phase 5 - Monitoring:
- Add ClusterMemoryRequestsHigh alert (>85% allocatable, 15m)
- Add ContainerNearOOM alert (>85% limit, 30m)
- Add PodUnschedulable alert (5m, critical)

Cluster: 92.7% → 90.8% memory requests. Stirling-pdf now schedulable.
2026-03-15 15:30:18 +00:00
Viktor Barzin
8bac6db48f add name/description/tools to review-loop agent frontmatter [ci skip] 2026-03-15 11:14:31 +00:00
Viktor Barzin
616370d34c rename planner agent to review-loop [ci skip] 2026-03-15 11:12:14 +00:00
Viktor Barzin
18d012db11 fix: reduce openclaw memory requests for scheduling
- openclaw: request 1280Mi (limit 2Gi), modelrelay request 128Mi
  (limit 256Mi). Total request 1408Mi fits available capacity.
2026-03-15 10:47:34 +00:00
Viktor Barzin
123e996b04 add planner agent: plan-review-fix convergence loop [ci skip] 2026-03-15 10:46:53 +00:00
Viktor Barzin
307b7f6819 update claude knowledge: infra operational learnings from commit history [ci skip]
Add resource management patterns, networking resilience, service-specific
notes, monitoring patterns, and NFS storage rules extracted from ~963 commits.
2026-03-15 10:46:45 +00:00
Viktor Barzin
56ddee457a fix: openclaw policy violation + reduce memory requests for capacity
- openclaw: fix Kyverno policy violation (node:22-alpine ->
  docker.io/library/node:22-alpine), reduce request to 1536Mi
  with 2Gi limit for overcommit
- rybbit/clickhouse: reduce 1Gi -> 768Mi (frees 256Mi)
- stirling-pdf: reduce 1536Mi -> 1200Mi (frees 336Mi)
2026-03-15 10:37:58 +00:00
Viktor Barzin
683f255c2e fix: increase memory for OOMKilled services
- hackmd: 64Mi -> 256Mi (Node.js app OOMKilled after 14min)
- n8n: limit 512Mi -> 768Mi (DB timeouts at 88% mem usage)
- speedtest: 128Mi -> 256Mi (OOMKilled during startup)
- shlink: limit 512Mi -> 768Mi (OOMKilled after startup)
2026-03-15 10:26:11 +00:00
Viktor Barzin
b219961bd8 fix: MySQL memory overcommit + shlink OOMKill
- dbaas: MySQL requests 4Gi -> 2Gi (limits stay 4Gi) to free 6Gi
  of request capacity. Actual usage is 1-1.5Gi per instance.
- url/shlink: increase memory limit 512Mi -> 768Mi (OOMKilled)
2026-03-15 03:22:07 +00:00
Viktor Barzin
0a69af618d update claude knowledge: vault KV secrets migration [ci skip] 2026-03-15 03:22:07 +00:00
Viktor Barzin
4a27345057 enable memory-core plugin for OpenClaw [ci skip]
- Add memory-core to plugins.allow and plugins.slots.memory
- Add /app/extensions to plugin load paths
- Update CLAUDE.md memory instructions to reference native tools
2026-03-15 03:22:07 +00:00
root
516685c504 Woodpecker CI deploy commit [CI SKIP] 2026-03-15 02:38:30 +00:00
Viktor Barzin
52bce849d7 fix cluster health: resolve 21/23 failures from healthcheck
- nvidia: change GPU taint NoSchedule -> PreferNoSchedule to allow
  overflow scheduling on k8s-node1 (frees ~7Gi capacity)
- kyverno: increase reports-controller memory 256Mi -> 512Mi (OOMKilled)
- speedtest: add missing DB_PORT=3306 env var (nc: service "" unknown)
- realestate-crawler: increase API memory 64Mi -> 256Mi (OOMKilled)
- calibre: increase liveness probe timeout 1s -> 5s (false restarts)
2026-03-15 02:33:46 +00:00
Viktor Barzin
5f71a53b08 add memory-tool instructions to project CLAUDE.md [ci skip]
OpenClaw agents read the project-level CLAUDE.md from the workspace.
Adding explicit memory-tool CLI instructions here ensures the agent
uses exec to call memory-tool instead of looking for non-existent
MCP tools (memory_store, memory_recall).
2026-03-15 02:16:03 +00:00
Viktor Barzin
456e2777f5 update claude knowledge: LinuxServer.io container optimization learnings [ci skip] 2026-03-15 02:04:04 +00:00
Viktor Barzin
7c97902052 fix calibre slow startup: bake calibre binaries into image, skip chown on NFS
Custom Docker image pre-installs the universal-calibre mod at build time,
eliminating ~10 min apt-get on every container start. Added NO_CHOWN=true
to skip recursive chown that hangs on NFS mounts. Tightened startup probe
since pod now starts in ~2 min instead of 15-20 min.
2026-03-15 02:03:19 +00:00
Viktor Barzin
ff83ec3325 add infrastructure agent team: 8 specialized agents + 14 diagnostic scripts
Agents: devops-engineer, dba, security-engineer, sre, network-engineer,
platform-engineer, observability-engineer, home-automation-engineer.
Scripts: deploy-status, db-health, backup-verify, tls-check, crowdsec-status,
authentik-audit, oom-investigator, resource-report, dns-check, network-health,
nfs-health, truenas-status, platform-status, monitoring-health.
Also: known-issues.md suppression list, cluster-health-checker port-forward fix.
2026-03-15 02:01:07 +00:00
Viktor Barzin
fca4e02c54 prometheus: increase memory to 4Gi and probe delays for TSDB compaction
Compaction of 5 years of TSDB blocks was OOM-killing at 3Gi (18 restarts
in 8h), causing sustained IO pressure on the PVE host spinning disk.
Increase liveness probe delay to 300s so WAL replay completes before
the probe kills the pod.
2026-03-15 01:44:52 +00:00
Viktor Barzin
43b49f7f6c cluster recovery: fix resource limits and node1 memory
- nvidia quota: requests.memory 8Gi → 12Gi (unblock cuda-validator)
- calibre: startup probe initial_delay 60→120s, timeout 1→5s,
  wait_for_rollout=false (DOCKER_MODS install takes 10+ min)
- immich ML: memory 2Gi → 4Gi (OOMKilled loading CLIP models)

Also done outside TF (not in this commit):
- node1 VM: 16 GiB → 24 GiB RAM (Proxmox)
- tigera-operator: kubectl patch 128→256Mi
- nvidia-driver-daemonset: kubectl patch 1→4Gi memory
- kyverno reports-controller: kubectl patch 128→256Mi
- CNPG operator: kubectl rollout restart
2026-03-15 01:44:28 +00:00
Viktor Barzin
a3c198e10e add AUTH_SECRET and ALLOWED_ORIGIN env vars to novelapp deployment
AUTH_SECRET sourced from Vault (secret/novelapp) via K8s secret,
ALLOWED_ORIGIN set to https://novelapp.viktorbarzin.me.
2026-03-15 00:33:38 +00:00
Viktor Barzin
29032e0b6b fix vaultwarden backup image: use docker.io/library/alpine for Kyverno 2026-03-15 00:16:31 +00:00
Viktor Barzin
6f562b5da6 add vaultwarden daily backup CronJob to NFS
SQLite backup via Online Backup API + copy of RSA keys,
attachments, sends, and config. 30-day retention with rotation.
Pod affinity ensures co-scheduling with vaultwarden for RWO PVC access.
2026-03-15 00:03:59 +00:00
Viktor Barzin
46afa85b01 fix openclaw config mount and OOM: use init container, increase memory to 2Gi
- Replace subPath ConfigMap mount with init container that copies openclaw.json
  to writable NFS home (OpenClaw writes back to the file at runtime)
- Remove invalid memory-api plugin references causing "Config invalid"
- Increase memory to 2Gi (req+limit) with NODE_OPTIONS=--max-old-space-size=1536
- Fix tg wrapper to inject -auto-approve when apply --non-interactive is used
2026-03-14 23:42:17 +00:00
Viktor Barzin
916aa6c6cb update claude knowledge: OpenClaw deployment and tg wrapper learnings [ci skip] 2026-03-14 23:42:17 +00:00
Viktor Barzin
92cc3f01c1 migrate vaultwarden storage from NFS to iSCSI
SQLite on NFS causes DB corruption due to unreliable POSIX fcntl locking.
iSCSI provides a block device with a local filesystem where locking works
correctly. Same approach used for Redis, MySQL, PostgreSQL, etc.
2026-03-14 23:42:17 +00:00
Viktor Barzin
7e72a10848 exclude manifest requests from nginx registry cache
Split /v2/ location into two: regex match for blobs (cached 24h, immutable
content-addressed by SHA256) and prefix match for everything else including
manifests (proxy_cache off, mutable tags). Also remove disabled registries
(quay, k8s, kyverno) whose containers/configs don't exist on the VM.
2026-03-14 23:42:17 +00:00
root
0d01b3d1f3 Woodpecker CI deploy commit [CI SKIP] 2026-03-14 22:11:47 +00:00
Viktor Barzin
23019da8e5 equalize memory req=lim across 70+ containers using Prometheus 7d max data
After node2 OOM incident, right-size memory across the cluster by setting
requests=limits based on max_over_time(container_memory_working_set_bytes[7d])
with 1.3x headroom. Eliminates ~37Gi overcommit gap.

Categories:
- Safe equalization (50 containers): set req=lim where max7d well within target
- Limit increases (8 containers): raise limits for services spiking above current
- No Prometheus data (12 containers): conservatively set lim=req
- Exception: nextcloud keeps req=256Mi/lim=8Gi due to Apache memory spikes

Also increased dbaas namespace quota from 12Gi to 16Gi to accommodate mysql
4Gi limits across 3 replicas.
2026-03-14 21:46:49 +00:00
Viktor Barzin
eb0301b02b lower memory limits closer to actual usage
openclaw: 1536Mi -> 768Mi, affine: 256Mi -> 128Mi, rybbit: 512Mi -> 384Mi.
Also patched via kubectl: aiostreams, cloudflared, crowdsec, uptime-kuma,
vaultwarden, pgadmin, phpmyadmin, goflow2, sealed-secrets, ebook2audiobook.
2026-03-14 21:15:26 +00:00
Viktor Barzin
843ba0ae8f scale down unused/over-replicated services
- osm-routing (otp, osrm-bicycle, osrm-foot): replicas=0, 0Mi actual usage
- dashy: replicas=0, redundant with homepage
- echo: 5 -> 1 replica
- networking-toolbox: 3 -> 1 replica
- travel-blog: 3 -> 1 replica
- blog: 3 -> 1 replica

Saves ~3.5Gi memory requests.
2026-03-14 21:12:44 +00:00
Viktor Barzin
f7c2c06009 right-size memory: set requests=limits based on actual usage
- Set memory requests = limits across 56 stacks to prevent overcommit
- Right-sized limits based on actual pod usage (2x actual, rounded up)
- Scaled down trading-bot (replicas=0) to free memory
- Fixed OOMKilled services: forgejo, dawarich, health, meshcentral,
  paperless-ngx, vault auto-unseal, rybbit, whisper, openclaw, clickhouse
- Added startup+liveness probes to calibre-web
- Bumped inotify limits on nodes 2,3 (max_user_instances 128->8192)

Post node2 OOM incident (2026-03-14). Previous kubelet config had no
kubeReserved/systemReserved set, allowing pods to starve the kernel.
2026-03-14 21:01:24 +00:00
Viktor Barzin
2c296d4d7c add novelapp deployment [ci skip]
Deploy NovelApp (web novel reading tracker) to k8s cluster.
- Namespace: novelapp, tier: aux
- iSCSI PVC for SQLite persistence
- Ingress at novelapp.viktorbarzin.me
- Browser scraping disabled
2026-03-14 18:51:14 +00:00
Viktor Barzin
98d7c2a4a5 fix: resolve HCL semicolons and vault-platform dependency cycle
- Replace semicolons with newlines in vault/main.tf variable blocks
  (HCL does not support semicolons)
- Remove dependency "vault" from platform/terragrunt.hcl to break
  cycle (vault already depends on platform)
2026-03-14 17:37:25 +00:00
Viktor Barzin
a8d944eb9b migrate all secrets from SOPS to Vault KV
- Add vault provider to root terragrunt.hcl (generated providers.tf)
- Delete stacks/vault/vault_provider.tf (now in generated providers.tf)
- Add 124 variable declarations + 43 vault_kv_secret_v2 resources to
  vault/main.tf to populate Vault KV at secret/<stack-name>
- Migrate 43 consuming stacks to read secrets from Vault KV via
  data "vault_kv_secret_v2" instead of SOPS var-file
- Add dependency "vault" to all migrated stacks' terragrunt.hcl
- Complex types (maps/lists) stored as JSON strings, decoded with
  jsondecode() in locals blocks

Bootstrap secrets (vault_root_token, vault_authentik_client_id,
vault_authentik_client_secret) remain in SOPS permanently.

Apply order: vault stack first (populates KV), then all others.
2026-03-14 17:15:48 +00:00
Viktor Barzin
39b7dac1a9 fix: bump openclaw memory limit to 1536Mi
Was hitting V8 heap OOM at 768Mi during LLM orchestration.
2026-03-14 16:45:57 +00:00
Viktor Barzin
683361b55e fix: bump calibre memory limit to 512Mi
Calibre binary installation was timing out at 256Mi, leaving
the web server unable to start.
2026-03-14 16:38:09 +00:00
Viktor Barzin
8612eb3fc7 fix: bump affine migration init container memory to 512Mi
Init container was OOMKilled (137) with default 128Mi LimitRange limit.
Prisma/Node.js migrations need more memory.
2026-03-14 16:37:06 +00:00