Commit graph

45 commits

Author SHA1 Message Date
Viktor Barzin
21bb3036af state(dbaas): update encrypted state 2026-03-19 20:23:59 +00:00
Viktor Barzin
1acf8cc4e8 migrate consuming stacks to ESO + remove k8s-dashboard static token
Phase 9: ExternalSecret migration across 26 stacks:

Fully migrated (vault data source removed, ESO delivers secrets):
- speedtest, shadowsocks, wealthfolio, plotting-book, f1-stream, tandoor
- n8n, dawarich, diun, netbox, onlyoffice, tuya-bridge
- hackmd (ESO template for DB URL), health (ESO template for DB URL)
- trading-bot (ESO template for DATABASE_URL + 7 secret env vars)
- forgejo (removed unused vault data source)

Partially migrated (vault kept for plan-time, ESO added for runtime):
- immich, linkwarden, nextcloud, paperless-ngx (jsondecode for homepage)
- claude-memory, rybbit, url, webhook_handler (plan-time in locals/jobs)
- woodpecker, openclaw, resume (plan-time in helm values/jobs/modules)

17 stacks unchanged (all plan-time: homepage annotations, configmaps,
module inputs) — vault data source works with OIDC auth.

Phase 17a: Remove k8s-dashboard static admin token secret.
Users now get tokens via: vault write kubernetes/creds/dashboard-admin
2026-03-15 19:05:04 +00:00
Viktor Barzin
1fe7798609 fix openclaw init container: escape shell vars, fix image path [ci skip]
- Use $$ for shell variable escaping in Terraform ($ is Terraform interpolation)
- Fix image: docker.io/alpine/git (not library/alpine/git)
- Inline command instead of heredoc to avoid Terraform interpolation issues
2026-03-15 17:19:03 +00:00
Viktor Barzin
3aba29e7a3 remove SOPS pipeline, deploy ESO + Vault DB/K8s engines
Vault is now the sole source of truth for secrets. SOPS pipeline
removed entirely — auth via `vault login -method=oidc`.

Part A: SOPS removal
- vault/main.tf: delete 990 lines (93 vars + 43 KV write resources),
  add self-read data source for OIDC creds from secret/vault
- terragrunt.hcl: remove SOPS var loading, vault_root_token, check_secrets hook
- scripts/tg: remove SOPS decryption, keep -auto-approve logic
- .woodpecker/default.yml: replace SOPS with Vault K8s auth via curl
- Delete secrets.sops.json, .sops.yaml

Part B: External Secrets Operator
- New stack stacks/external-secrets/ with Helm chart + 2 ClusterSecretStores
  (vault-kv for KV v2, vault-database for DB engine)

Part C: Database secrets engine (in vault/main.tf)
- MySQL + PostgreSQL connections with static role rotation (24h)
- 6 MySQL roles (speedtest, wrongmove, codimd, nextcloud, shlink, grafana)
- 6 PostgreSQL roles (trading, health, linkwarden, affine, woodpecker, claude_memory)

Part D: Kubernetes secrets engine (in vault/main.tf)
- RBAC for Vault SA to manage K8s tokens
- Roles: dashboard-admin, ci-deployer, openclaw, local-admin
- New scripts/vault-kubeconfig helper for dynamic kubeconfig

K8s auth method with scoped policies for CI, ESO, OpenClaw, Woodpecker sync.
2026-03-15 16:37:38 +00:00
Viktor Barzin
deeea5edab openclaw: replace cc-config NFS with dotfiles repo clone [ci skip]
- Add init container "install-dotfiles" that clones the dotfiles repo
  and installs skills/agents/hooks to OpenClaw's home directory
- Remove nfs_cc_config module and its volume mount
- Skills/agents now come from the same chezmoi-managed dotfiles repo
  that manages the Mac config, eliminating the dual-sync problem
2026-03-15 16:04:02 +00:00
Viktor Barzin
194281e527 right-size cluster memory: reduce overprovisioned, fix under-provisioned services
Phase 1 - Quick wins (~4.5 Gi saved):
- democratic-csi: add explicit sidecar resources (64-80Mi vs 256Mi LimitRange default)
- caretta: 768Mi → 600Mi (VPA upper 485Mi)
- immich-ml: 4Gi → 3584Mi (VPA upper 2.95Gi, GPU margin)
- onlyoffice: 3Gi → 2304Mi (VPA upper 1.82Gi)

Phase 2 - Safety fixes (prevent OOMKills):
- frigate: 2Gi/8Gi → 5Gi/10Gi (VPA upper 7.7Gi, was 4% headroom)
- openclaw: 1280Mi req → 2Gi req=limit (documented 2Gi requirement)

Phase 3 - Additional right-sizing:
- authentik workers: 1Gi → 896Mi x3 (VPA upper 722Mi)
- shlink: 512Mi/768Mi → 960Mi req=limit (VPA upper 780Mi, safety increase)

Phase 4 - Burstable QoS for lower tiers:
- tier-3-edge: 128Mi/128Mi → 96Mi req / 192Mi limit
- tier-4-aux: 128Mi/128Mi → 64Mi req / 256Mi limit

Phase 5 - Monitoring:
- Add ClusterMemoryRequestsHigh alert (>85% allocatable, 15m)
- Add ContainerNearOOM alert (>85% limit, 30m)
- Add PodUnschedulable alert (5m, critical)

Cluster: 92.7% → 90.8% memory requests. Stirling-pdf now schedulable.
2026-03-15 15:30:18 +00:00
Viktor Barzin
18d012db11 fix: reduce openclaw memory requests for scheduling
- openclaw: request 1280Mi (limit 2Gi), modelrelay request 128Mi
  (limit 256Mi). Total request 1408Mi fits available capacity.
2026-03-15 10:47:34 +00:00
Viktor Barzin
56ddee457a fix: openclaw policy violation + reduce memory requests for capacity
- openclaw: fix Kyverno policy violation (node:22-alpine ->
  docker.io/library/node:22-alpine), reduce request to 1536Mi
  with 2Gi limit for overcommit
- rybbit/clickhouse: reduce 1Gi -> 768Mi (frees 256Mi)
- stirling-pdf: reduce 1536Mi -> 1200Mi (frees 336Mi)
2026-03-15 10:37:58 +00:00
Viktor Barzin
4a27345057 enable memory-core plugin for OpenClaw [ci skip]
- Add memory-core to plugins.allow and plugins.slots.memory
- Add /app/extensions to plugin load paths
- Update CLAUDE.md memory instructions to reference native tools
2026-03-15 03:22:07 +00:00
Viktor Barzin
6f562b5da6 add vaultwarden daily backup CronJob to NFS
SQLite backup via Online Backup API + copy of RSA keys,
attachments, sends, and config. 30-day retention with rotation.
Pod affinity ensures co-scheduling with vaultwarden for RWO PVC access.
2026-03-15 00:03:59 +00:00
Viktor Barzin
46afa85b01 fix openclaw config mount and OOM: use init container, increase memory to 2Gi
- Replace subPath ConfigMap mount with init container that copies openclaw.json
  to writable NFS home (OpenClaw writes back to the file at runtime)
- Remove invalid memory-api plugin references causing "Config invalid"
- Increase memory to 2Gi (req+limit) with NODE_OPTIONS=--max-old-space-size=1536
- Fix tg wrapper to inject -auto-approve when apply --non-interactive is used
2026-03-14 23:42:17 +00:00
Viktor Barzin
eb0301b02b lower memory limits closer to actual usage
openclaw: 1536Mi -> 768Mi, affine: 256Mi -> 128Mi, rybbit: 512Mi -> 384Mi.
Also patched via kubectl: aiostreams, cloudflared, crowdsec, uptime-kuma,
vaultwarden, pgadmin, phpmyadmin, goflow2, sealed-secrets, ebook2audiobook.
2026-03-14 21:15:26 +00:00
Viktor Barzin
f7c2c06009 right-size memory: set requests=limits based on actual usage
- Set memory requests = limits across 56 stacks to prevent overcommit
- Right-sized limits based on actual pod usage (2x actual, rounded up)
- Scaled down trading-bot (replicas=0) to free memory
- Fixed OOMKilled services: forgejo, dawarich, health, meshcentral,
  paperless-ngx, vault auto-unseal, rybbit, whisper, openclaw, clickhouse
- Added startup+liveness probes to calibre-web
- Bumped inotify limits on nodes 2,3 (max_user_instances 128->8192)

Post node2 OOM incident (2026-03-14). Previous kubelet config had no
kubeReserved/systemReserved set, allowing pods to starve the kernel.
2026-03-14 21:01:24 +00:00
Viktor Barzin
a8d944eb9b migrate all secrets from SOPS to Vault KV
- Add vault provider to root terragrunt.hcl (generated providers.tf)
- Delete stacks/vault/vault_provider.tf (now in generated providers.tf)
- Add 124 variable declarations + 43 vault_kv_secret_v2 resources to
  vault/main.tf to populate Vault KV at secret/<stack-name>
- Migrate 43 consuming stacks to read secrets from Vault KV via
  data "vault_kv_secret_v2" instead of SOPS var-file
- Add dependency "vault" to all migrated stacks' terragrunt.hcl
- Complex types (maps/lists) stored as JSON strings, decoded with
  jsondecode() in locals blocks

Bootstrap secrets (vault_root_token, vault_authentik_client_id,
vault_authentik_client_secret) remain in SOPS permanently.

Apply order: vault stack first (populates KV), then all others.
2026-03-14 17:15:48 +00:00
Viktor Barzin
39b7dac1a9 fix: bump openclaw memory limit to 1536Mi
Was hitting V8 heap OOM at 768Mi during LLM orchestration.
2026-03-14 16:45:57 +00:00
Viktor Barzin
2be858f616 fix: eliminate memory overcommit to prevent node OOM crashes
Set requests = limits (Guaranteed QoS) across LimitRange defaults and
explicit pod resources. Node2 crashed 2026-03-14 from 250% memory
overcommit (61GB limits on 24GB node).

Changes:
- LimitRange: default = defaultRequest for all 6 tiers
- Grafana: 3 → 2 replicas
- Grampsweb: document why replicas=0
- Prometheus: 1Gi/4Gi → 3Gi/3Gi
- OpenClaw: 512Mi/2Gi → 768Mi/768Mi
- Immich server: 256Mi/2Gi → 512Mi/512Mi
- Immich postgresql: 256Mi/1Gi → 512Mi/512Mi
- Calibre: 256Mi/1536Mi → 256Mi/256Mi
- Linkwarden: 256Mi/1536Mi → 768Mi/768Mi
- N8N: 256Mi/1Gi → 512Mi/512Mi
- MySQL cluster: 1Gi/3-4Gi → 2Gi/2Gi
- pg-cluster (CNPG): 512Mi/4Gi → 512Mi/512Mi
- DBaaS ResourceQuota limits.memory: 64Gi → 12Gi

[ci skip]
2026-03-14 16:01:41 +00:00
Viktor Barzin
b00f810d3d Remove all CPU limits cluster-wide to eliminate CFS throttling
CPU limits cause CFS throttling even when nodes have idle capacity.
Move to a request-only CPU model: keep CPU requests for scheduling
fairness but remove all CPU limits. Memory limits stay (incompressible).

Changes across 108 files:
- Kyverno LimitRange policy: remove cpu from default/max in all 6 tiers
- Kyverno ResourceQuota policy: remove limits.cpu from all 5 tiers
- Custom ResourceQuotas: remove limits.cpu from 8 namespace quotas
- Custom LimitRanges: remove cpu from default/max (nextcloud, onlyoffice)
- RBAC module: remove cpu_limits variable and quota reference
- Freedify factory: remove cpu_limit variable and limits reference
- 86 deployment files: remove cpu from all limits blocks
- 6 Helm values files: remove cpu under limits sections
2026-03-14 08:51:45 +00:00
Viktor Barzin
76a4987eef [ci skip] add Forgejo task pipeline for OpenClaw AI agent
Forgejo issues as a task queue for OpenClaw:
- Forgejo OAuth2 with Authentik SSO, self-registration disabled
- Webhook-triggered task processing (instant) + CronJob backup (5min poll)
- Tasks processed via Mistral Large 3 (NVIDIA NIM API)
- Results posted as issue comments, auto-labeled and closed
- Comment follow-ups and reopened issues supported
- n8n RBAC for OpenClaw pod exec (future workflow integration)
2026-03-07 21:11:07 +00:00
Viktor Barzin
6bd3970579 [ci skip] add Homepage gethomepage.dev annotations to all services
Add Kubernetes ingress annotations for Homepage auto-discovery across
~88 services organized into 11 groups. Enable serviceAccount for RBAC,
configure group layouts, and add Grafana/Frigate/Speedtest widgets.
2026-03-07 20:39:54 +00:00
Viktor Barzin
1f2c1ca361 [ci skip] phase 5+6: update CI pipelines for SOPS, add sensitive=true to secret vars
Phase 5 — CI pipelines:
- default.yml: add SOPS decrypt in prepare step, change git add . to
  specific paths (stacks/ state/ .woodpecker/), cleanup on success+failure
- renew-tls.yml: change git add . to git add secrets/ state/

Phase 6 — sensitive=true:
- Add sensitive = true to 256 variable declarations across 149 stack files
- Prevents secret values from appearing in terraform plan output
- Does NOT modify shared modules (ingress_factory, nfs_volume) to avoid
  breaking module interface contracts

Note: CI pipeline SOPS decryption requires sops_age_key Woodpecker secret
to be created before the pipeline will work with SOPS. Until then, the old
terraform.tfvars path continues to function.
2026-03-07 14:30:36 +00:00
Viktor Barzin
197cef7f3f [ci skip] add auto-generated tiers.tf, planning docs, and helm chart cache
- tiers.tf: Terragrunt-generated tier locals for all standalone stacks
- .planning/: resource audit research and plans
- docs/plans/: cluster hardening design doc
- redis-25.3.2.tgz: Bitnami Redis Helm chart cache
2026-03-06 23:55:57 +00:00
Viktor Barzin
0abae33c71 [ci skip] complete NFS CSI migration: complex stacks + platform modules
Migrate remaining multi-volume stacks and all platform modules from
inline NFS volumes to CSI-backed PV/PVC with nfs-truenas StorageClass
(soft,timeo=30,retrans=3 mount options).

Complex stacks: openclaw (4 vols), immich (8 vols), frigate (2 vols),
nextcloud (2 vols + old PV replaced), rybbit (1 vol)

Remaining stacks: affine, ebook2audiobook, f1-stream, osm_routing,
real-estate-crawler

Platform modules: monitoring (prometheus, loki, alertmanager PVs
converted from native NFS to CSI), redis, dbaas, technitium,
headscale, vaultwarden, uptime-kuma, mailserver, infra-maintenance
2026-03-02 01:24:07 +00:00
Viktor Barzin
9e4fb23b10 [ci skip] right-size all pod resources based on VPA + live metrics audit
Full cluster resource audit: cross-referenced Goldilocks VPA recommendations,
live kubectl top metrics, and Terraform definitions for 100+ containers.

Critical fixes:
- dashy: CPU throttled at 98% (490m/500m) → 2 CPU limit
- stirling-pdf: CPU throttled at 99.7% (299m/300m) → 2 CPU limit
- traefik auth-proxy/bot-block-proxy: mem limit 32Mi → 128Mi

Added explicit resources to ~40 containers that had none:
- audiobookshelf, changedetection, cyberchef, dawarich, diun, echo,
  excalidraw, freshrss, hackmd, isponsorblocktv, linkwarden, n8n,
  navidrome, ntfy, owntracks, privatebin, send, shadowsocks, tandoor,
  tor-proxy, wealthfolio, networking-toolbox, rybbit, mailserver,
  cloudflared, pgadmin, phpmyadmin, crowdsec-web, xray, wireguard,
  k8s-portal, tuya-bridge, ollama-ui, whisper, piper, immich-server,
  immich-postgresql, osrm-foot

GPU containers: added CPU/mem alongside GPU limits:
- ollama: removed CPU/mem limits (models vary in size), keep GPU only
- frigate: req 500m/2Gi, lim 4/8Gi + GPU
- immich-ml: req 100m/1Gi, lim 2/4Gi + GPU

Right-sized ~25 over-provisioned containers:
- kms-web-page: 500m/512Mi → 50m/64Mi (was using 0m/10Mi)
- onlyoffice: CPU 8 → 2 (VPA upper 45m)
- realestate-crawler-api: CPU 2000m → 250m
- blog/travel-blog/webhook-handler: 500m → 100m
- coturn/health/plotting-book: reduced to match actual usage

Conservative methodology: limits = max(VPA upper * 2, live usage * 2)
2026-03-01 19:18:50 +00:00
Viktor Barzin
ccf0b2232f [ci skip] switch VPA to off mode globally, fix Ollama/MySQL resources
- Kyverno policy: VPA mode set to 'off' for all namespaces (was 'initial'
  for non-core). Terraform is now sole authority for container resources.
  Goldilocks provides recommendations only.
- Ollama: add explicit CPU/memory resources (500m/4Gi req, 4/12Gi limit)
  alongside GPU allocation. Fixes OOMKill from VPA scaling down resources.
- MySQL InnoDB Cluster: bump memory limit from 2Gi to 3Gi.
- Remove redundant per-namespace VPA opt-out labels from onlyoffice,
  openclaw, trading-bot (now handled globally by Kyverno policy).
2026-03-01 19:03:49 +00:00
Viktor Barzin
80dfc58fea [ci skip] openclaw: fix workspace permissions — chown to node user
Init container clones repo as root but main container runs as node (UID 1000).
Added chown -R 1000:1000 /workspace/infra so OpenClaw can write to workspace.
2026-03-01 17:20:36 +00:00
Viktor Barzin
f203e7bd2c [ci skip] openclaw: set workspace + enable elevated + native commands
- Set workspace to /workspace/infra (was defaulting to ~/.openclaw/workspace)
- Enable tools.elevated for unrestricted access
- Enable commands.native and commands.nativeSkills
- All tools, commands, and skills now fully accessible
2026-03-01 17:12:03 +00:00
Viktor Barzin
b2ac69e12b [ci skip] openclaw: disable sandbox mode for unrestricted execution
- Set agents.defaults.sandbox.mode = off
- Combined with exec.host=gateway and exec.security=full,
  OpenClaw can now run any command on the container host
2026-03-01 16:51:35 +00:00
Viktor Barzin
99881b28e3 [ci skip] openclaw: fix exec host — use gateway instead of node
host=node requires a companion app (not available in container).
host=gateway runs commands directly on the gateway process host.
2026-03-01 16:47:14 +00:00
Viktor Barzin
6efc1e56c0 [ci skip] openclaw: fix exec config — use host=node, security=full
Valid options: host=sandbox|gateway|node, security=deny|allowlist|full.
Using node (run on container host) with full (no command restrictions).
2026-03-01 16:42:22 +00:00
Viktor Barzin
c83f3aab90 [ci skip] openclaw: disable sandbox, run commands on container host
- exec.host: sandbox → local (run directly on container, no Docker sandbox)
- exec.security: full → off (no restrictions on command execution)
2026-03-01 16:18:53 +00:00
Viktor Barzin
b10d43b7a7 [ci skip] openclaw: persist home directory on NFS
- Switch openclaw-home from emptyDir to NFS (/mnt/main/openclaw/home)
- Persists SOUL.md, IDENTITY.md, sessions, memory DB, telegram state,
  device identity, and all runtime files across pod restarts
- Init container still refreshes openclaw.json and kubeconfig on each start
2026-03-01 16:12:07 +00:00
Viktor Barzin
0f7e7e5969 [ci skip] openclaw: remove all tool/command restrictions
- Set tools.deny = [] (was blocking sessions, subagents, browser)
- All tools now available: sessions, subagents, browser, etc.
2026-03-01 15:58:12 +00:00
Viktor Barzin
f031a6bcf6 [ci skip] openclaw: add modelrelay sidecar as fallback model router
- Deploy modelrelay as sidecar container (auto-routes to fastest free model)
- Configured with NVIDIA NIM + OpenRouter API keys
- Primary: Mistral Large 3 (NIM), Fallback 1: Nemotron Ultra (NIM),
  Fallback 2: modelrelay/auto-fastest (80+ free models)
- Modelrelay web UI available at pod:7352
2026-03-01 15:57:31 +00:00
Viktor Barzin
207164050c [ci skip] openclaw: fix Telegram, update to v2026.2.26, fix startup issues
- Update OpenClaw from v2026.2.9 to v2026.2.26 (fixes Telegram channel)
- Add gateway.mode=local + wizard block (required for channel startup)
- Add dangerouslyAllowHostHeaderOriginFallback (v2026.2.26 requirement)
- Run doctor --fix at container startup to auto-enable Telegram
- Create required dirs (canvas, devices, cron, sessions, credentials)
- Fix permissions: chown -R 1000:1000 for node user
- Telegram: DM allowlist, user 8281953845 only
2026-03-01 15:47:54 +00:00
Viktor Barzin
0da6f90ad2 [ci skip] openclaw: fix slow startup — proper resources + readiness probe + VPA off
- Set explicit CPU (2 cores) and memory (2Gi) limits
  Root cause: Goldilocks VPA was throttling to 300m CPU, causing gateway
  to take 5+ minutes to start, and 1Gi memory caused OOM crashes
- Add TCP readiness probe on port 18789 to prevent 502 Bad Gateway
  during startup (Traefik was routing before gateway was listening)
- Disable Goldilocks VPA via namespace label (vpa-update-mode: off)
2026-03-01 14:44:22 +00:00
Viktor Barzin
e8ff760aff [ci skip] openclaw: cache tools on NFS for fast restarts
- Switch /tools volume from emptyDir to NFS (/mnt/main/openclaw/tools)
- Skip download of kubectl, terraform, terragrunt, pip packages if cached
- Startup time: ~2.5min → ~38s on subsequent restarts
2026-03-01 13:59:07 +00:00
Viktor Barzin
e728f4c106 [ci skip] openclaw: add Telegram channel + install terragrunt in init container
- Add Telegram bot integration (DM allowlist, user 8281953845 only)
- Install terragrunt v0.99.4 in init container alongside terraform
- Remove terraform init from init (terragrunt handles this per-stack)
- Add openclaw_telegram_bot_token variable
2026-03-01 13:44:58 +00:00
Viktor Barzin
014f6cad5a [ci skip] openclaw: switch to free agentic models via NVIDIA NIM, OpenRouter, Llama API
- Primary: Mistral Large 3 (675B) on NIM - always warm, excellent tool calling
- Fallback 1: Nemotron Ultra 253B on NIM
- Fallback 2: Llama 4 Maverick on Llama API (different provider for resilience)
- 10 models total across 3 providers, all free
- Removed: Modal (GLM-5), Gemini, Ollama providers
- Added: NVIDIA NIM provider with DeepSeek V3.2, Qwen 3.5, Qwen 3 Coder, GLM-5
- Bumped maxTokens from 8192 to 16384 for agentic output room
2026-03-01 13:22:47 +00:00
Viktor Barzin
f64c979ba5 [ci skip] tune resource limits and requests across 10 services
Critical OOM fixes (add/increase limits):
- netbox: add 512Mi limit (was at 98.8% of Kyverno default 256Mi)
- speedtest: add 512Mi limit (was at 80.9%)
- meshcentral: add 384Mi limit (was at 72.7%)
- ytdlp: uncomment resources, set 512Mi limit (was at 74.6%)

Over-provisioned (reduce limits):
- dashy: 2Gi → 512Mi (was using 135Mi)
- redis master: 2Gi → 256Mi (was using 14Mi)
- redis replica: 1Gi → 256Mi (was using 12Mi)
- resume printer: 2Gi → 512Mi (was using 108Mi)
- resume app: 1Gi → 384Mi (was using 125Mi)
- openclaw: 4Gi → 1Gi (was using 372Mi)

Under-provisioned requests (increase):
- authentik server: 256Mi → 512Mi request (actual ~560Mi)
- authentik worker: 256Mi → 384Mi request (actual ~400Mi)

New explicit resources (previously Kyverno defaults):
- forgejo: add 512Mi limit, 64Mi request
2026-02-28 21:59:08 +00:00
Viktor Barzin
89a6e08245 [ci skip] Infrastructure hardening: security, monitoring, reliability, maintainability
Phase 1 - Critical Security:
- Netbox: move hardcoded DB/superuser passwords to variables
- MeshCentral: disable public registration, add Authentik auth
- Traefik: disable insecure API dashboard (api.insecure=false)
- Traefik: configure forwarded headers with Cloudflare trusted IPs

Phase 2 - Security Hardening:
- Add security headers middleware (HSTS, X-Frame-Options, nosniff, etc.)
- Add Kyverno pod security policies in audit mode (privileged, host
  namespaces, SYS_ADMIN, trusted registries)
- Tighten rate limiting (avg=10, burst=50)
- Add Authentik protection to grampsweb

Phase 3 - Monitoring & Alerting:
- Add critical service alerts (PostgreSQL, MySQL, Redis, Headscale,
  Authentik, Loki)
- Increase Loki retention from 7 to 30 days (720h)
- Add predictive PV filling alert (predict_linear)
- Re-enable Hackmd and Privatebin down alerts

Phase 4 - Reliability:
- Add resource requests/limits to Redis, DBaaS, Technitium, Headscale,
  Vaultwarden, Uptime Kuma
- Increase Alloy DaemonSet memory to 512Mi/1Gi

Phase 6 - Maintainability:
- Extract duplicated tiers locals to terragrunt.hcl generate block
  (removed from 67 stacks)
- Replace hardcoded NFS IP 10.0.10.15 with var.nfs_server (114
  instances across 63 files)
- Replace hardcoded Redis/PostgreSQL/MySQL/Ollama/mail host references
  with variables across ~35 stacks
- Migrate xray raw ingress resources to ingress_factory modules
2026-02-23 22:05:28 +00:00
Viktor Barzin
ddb293b2b7 [ci skip] Reduce healthcheck frequency to 8h, fix apiserver audit duplication bug
Change cluster-healthcheck CronJob from every 30min to every 8h.
Replace fragile sed-based audit config in apiserver manifest with
idempotent Python script that deduplicates by name/mountPath,
preventing the duplicate volume entries that crashed the API server.
2026-02-22 23:18:30 +00:00
Viktor Barzin
c7c7047f1c [ci skip] Flatten module wrappers into stack roots
Remove the module "xxx" { source = "./module" } indirection layer
from all 66 service stacks. Resources are now defined directly in
each stack's main.tf instead of through a wrapper module.

- Merge module/main.tf contents into stack main.tf
- Apply variable replacements (var.tier -> local.tiers.X, renamed vars)
- Fix shared module paths (one fewer ../ at each level)
- Move extra files/dirs (factory/, chart_values, subdirs) to stack root
- Update state files to strip module.<name>. prefix
- Update CLAUDE.md to reflect flat structure

Verified: terragrunt plan shows 0 add, 0 destroy across all stacks.
2026-02-22 15:13:55 +00:00
Viktor Barzin
e6420c7b36 [ci skip] Move Terraform modules into stack directories
Move all 88 service modules (66 individual + 22 platform) from
modules/kubernetes/<service>/ into their corresponding stack directories:

- Service stacks: stacks/<service>/module/
- Platform stack: stacks/platform/modules/<service>/

This collocates module source code with its Terragrunt definition.
Only shared utility modules remain in modules/kubernetes/:
ingress_factory, setup_tls_secret, dockerhub_secret, oauth-proxy.

All cross-references to shared modules updated to use correct
relative paths. Verified with terragrunt run --all -- plan:
0 adds, 0 destroys across all 68 stacks.
2026-02-22 14:38:14 +00:00
Viktor Barzin
945a5f35b0 [ci skip] Fix path.root references for git-crypt key in openclaw and drone
Modules used filebase64("${path.root}/.git/git-crypt/keys/default")
which breaks with Terragrunt since path.root is now stacks/<service>/
instead of repo root. Changed to accept git_crypt_key_base64 variable
and resolve the path in the stack wrapper.
2026-02-22 14:01:02 +00:00
Viktor Barzin
a9ba8899be [ci skip] Phase 3: Create 66 service stacks and migrate state
Generated individual stack directories for all 66 services under stacks/.
Each stack has terragrunt.hcl (depends on platform) and main.tf (thin
wrapper calling existing module). Migrated all 64 active service states
from root terraform.tfstate to individual state files. Root state is now
empty. Verified with terragrunt plan on multiple stacks (no changes).
2026-02-22 13:56:34 +00:00