infra

Author	SHA1	Message	Date
Viktor Barzin	f82ece8d1f	add Vault→Woodpecker secret sync CronJob (Part E) Syncs secrets from Vault KV at secret/ci/global to Woodpecker global secrets via REST API every 6 hours. Authenticates via K8s SA JWT (woodpecker-sync role). New repos just add secrets to Vault and use from_secret: in pipeline files. Also removes k8s-dashboard static admin token — use vault write kubernetes/creds/dashboard-admin instead.	2026-03-18 08:04:02 +00:00
Viktor Barzin	850ab5277f	migrate consuming stacks to ESO + remove k8s-dashboard static token Phase 9: ExternalSecret migration across 26 stacks: Fully migrated (vault data source removed, ESO delivers secrets): - speedtest, shadowsocks, wealthfolio, plotting-book, f1-stream, tandoor - n8n, dawarich, diun, netbox, onlyoffice, tuya-bridge - hackmd (ESO template for DB URL), health (ESO template for DB URL) - trading-bot (ESO template for DATABASE_URL + 7 secret env vars) - forgejo (removed unused vault data source) Partially migrated (vault kept for plan-time, ESO added for runtime): - immich, linkwarden, nextcloud, paperless-ngx (jsondecode for homepage) - claude-memory, rybbit, url, webhook_handler (plan-time in locals/jobs) - woodpecker, openclaw, resume (plan-time in helm values/jobs/modules) 17 stacks unchanged (all plan-time: homepage annotations, configmaps, module inputs) — vault data source works with OIDC auth. Phase 17a: Remove k8s-dashboard static admin token secret. Users now get tokens via: vault write kubernetes/creds/dashboard-admin	2026-03-18 08:04:02 +00:00
Viktor Barzin	0bae93a097	claude-memory: read DB password from Vault KV instead of tfvars Vault DB engine rotates the password every 24h, so the static tfvars value was stale. Now reads from secret/claude-memory db_password key.	2026-03-18 08:04:02 +00:00
Viktor Barzin	91e5d728a2	etcd defrag cronjob: add --command-timeout=60s Default 5s timeout causes defrag to fail on fragmented DBs. Discovered during manual defrag that took ~7s.	2026-03-18 08:04:02 +00:00
Viktor Barzin	c766d849f8	mitigate cluster instability during terraform applies - Recreate strategy for heavy single-replica deployments (onlyoffice, stirling-pdf) - Reduce maxSurge on multi-replica deployments (traefik, authentik, grafana, kyverno) to prevent memory request surge overwhelming scheduler - Weekly etcd defrag CronJob (Sunday 3 AM) to prevent fragmentation buildup - Disable Kyverno policy reports (ephemeral report cleanup) - Cloud-init: journald persistence + 4Gi swap for worker nodes - Kubelet: LimitedSwap behavior for memory pressure relief	2026-03-18 08:04:02 +00:00
Viktor Barzin	750da49c80	fix openclaw init container: escape shell vars, fix image path [ci skip] - Use $$ for shell variable escaping in Terraform ($ is Terraform interpolation) - Fix image: docker.io/alpine/git (not library/alpine/git) - Inline command instead of heredoc to avoid Terraform interpolation issues	2026-03-18 08:04:02 +00:00
Viktor Barzin	afe3a8bf8d	remove SOPS pipeline, deploy ESO + Vault DB/K8s engines Vault is now the sole source of truth for secrets. SOPS pipeline removed entirely — auth via `vault login -method=oidc`. Part A: SOPS removal - vault/main.tf: delete 990 lines (93 vars + 43 KV write resources), add self-read data source for OIDC creds from secret/vault - terragrunt.hcl: remove SOPS var loading, vault_root_token, check_secrets hook - scripts/tg: remove SOPS decryption, keep -auto-approve logic - .woodpecker/default.yml: replace SOPS with Vault K8s auth via curl - Delete secrets.sops.json, .sops.yaml Part B: External Secrets Operator - New stack stacks/external-secrets/ with Helm chart + 2 ClusterSecretStores (vault-kv for KV v2, vault-database for DB engine) Part C: Database secrets engine (in vault/main.tf) - MySQL + PostgreSQL connections with static role rotation (24h) - 6 MySQL roles (speedtest, wrongmove, codimd, nextcloud, shlink, grafana) - 6 PostgreSQL roles (trading, health, linkwarden, affine, woodpecker, claude_memory) Part D: Kubernetes secrets engine (in vault/main.tf) - RBAC for Vault SA to manage K8s tokens - Roles: dashboard-admin, ci-deployer, openclaw, local-admin - New scripts/vault-kubeconfig helper for dynamic kubeconfig K8s auth method with scoped policies for CI, ESO, OpenClaw, Woodpecker sync.	2026-03-18 08:04:01 +00:00
Viktor Barzin	ffc04ef9f6	openclaw: replace cc-config NFS with dotfiles repo clone [ci skip] - Add init container "install-dotfiles" that clones the dotfiles repo and installs skills/agents/hooks to OpenClaw's home directory - Remove nfs_cc_config module and its volume mount - Skills/agents now come from the same chezmoi-managed dotfiles repo that manages the Mac config, eliminating the dual-sync problem	2026-03-18 08:04:01 +00:00
Viktor Barzin	0d596c57f5	fix immich TF drift from Kyverno ndots injection, right-size nvidia GPU operator - immich: add lifecycle ignore_changes for dns_config on all 3 deployments to prevent perpetual plan drift from Kyverno ndots:2 mutation policy - nvidia dcgm-exporter: 768Mi → 2560Mi (VPA upper 2091Mi, was under-provisioned) - nvidia cuda-validator: 1024Mi → 256Mi (one-shot job, vastly over-provisioned)	2026-03-18 08:04:01 +00:00
Viktor Barzin	57eccd8a81	vaultwarden: upgrade to 1.35.4, use Recreate strategy - Upgrade from 1.35.2 to 1.35.4 (fixes API key login userDecryptionOptions bug) - Switch deployment strategy from RollingUpdate to Recreate (iSCSI PVC can't multi-attach)	2026-03-18 08:04:01 +00:00
Viktor Barzin	24c14fd0c6	claude-memory: pin image to :17, fixes URL-decode crash on sync endpoint The + in +00:00 timezone offsets was being URL-decoded to a space, causing ValueError on the /api/memories/sync endpoint. Build :17 includes the fix. Using versioned tag instead of :latest to avoid pull-through cache serving stale images.	2026-03-18 08:04:01 +00:00
Viktor Barzin	9f1b3a53d3	right-size cluster memory: reduce overprovisioned, fix under-provisioned services Phase 1 - Quick wins (~4.5 Gi saved): - democratic-csi: add explicit sidecar resources (64-80Mi vs 256Mi LimitRange default) - caretta: 768Mi → 600Mi (VPA upper 485Mi) - immich-ml: 4Gi → 3584Mi (VPA upper 2.95Gi, GPU margin) - onlyoffice: 3Gi → 2304Mi (VPA upper 1.82Gi) Phase 2 - Safety fixes (prevent OOMKills): - frigate: 2Gi/8Gi → 5Gi/10Gi (VPA upper 7.7Gi, was 4% headroom) - openclaw: 1280Mi req → 2Gi req=limit (documented 2Gi requirement) Phase 3 - Additional right-sizing: - authentik workers: 1Gi → 896Mi x3 (VPA upper 722Mi) - shlink: 512Mi/768Mi → 960Mi req=limit (VPA upper 780Mi, safety increase) Phase 4 - Burstable QoS for lower tiers: - tier-3-edge: 128Mi/128Mi → 96Mi req / 192Mi limit - tier-4-aux: 128Mi/128Mi → 64Mi req / 256Mi limit Phase 5 - Monitoring: - Add ClusterMemoryRequestsHigh alert (>85% allocatable, 15m) - Add ContainerNearOOM alert (>85% limit, 30m) - Add PodUnschedulable alert (5m, critical) Cluster: 92.7% → 90.8% memory requests. Stirling-pdf now schedulable.	2026-03-18 08:04:01 +00:00
Viktor Barzin	d6d6290fb7	fix: reduce openclaw memory requests for scheduling - openclaw: request 1280Mi (limit 2Gi), modelrelay request 128Mi (limit 256Mi). Total request 1408Mi fits available capacity.	2026-03-18 08:04:01 +00:00
Viktor Barzin	bc9f5c3cf1	fix: openclaw policy violation + reduce memory requests for capacity - openclaw: fix Kyverno policy violation (node:22-alpine -> docker.io/library/node:22-alpine), reduce request to 1536Mi with 2Gi limit for overcommit - rybbit/clickhouse: reduce 1Gi -> 768Mi (frees 256Mi) - stirling-pdf: reduce 1536Mi -> 1200Mi (frees 336Mi)	2026-03-18 08:04:01 +00:00
Viktor Barzin	8e3d87587d	fix: increase memory for OOMKilled services - hackmd: 64Mi -> 256Mi (Node.js app OOMKilled after 14min) - n8n: limit 512Mi -> 768Mi (DB timeouts at 88% mem usage) - speedtest: 128Mi -> 256Mi (OOMKilled during startup) - shlink: limit 512Mi -> 768Mi (OOMKilled after startup)	2026-03-18 08:04:01 +00:00
Viktor Barzin	54642a0b94	fix: MySQL memory overcommit + shlink OOMKill - dbaas: MySQL requests 4Gi -> 2Gi (limits stay 4Gi) to free 6Gi of request capacity. Actual usage is 1-1.5Gi per instance. - url/shlink: increase memory limit 512Mi -> 768Mi (OOMKilled)	2026-03-18 08:04:01 +00:00
Viktor Barzin	4872bf2842	enable memory-core plugin for OpenClaw [ci skip] - Add memory-core to plugins.allow and plugins.slots.memory - Add /app/extensions to plugin load paths - Update CLAUDE.md memory instructions to reference native tools	2026-03-18 08:04:00 +00:00
root	3189c2bb35	Woodpecker CI deploy commit [CI SKIP]	2026-03-18 08:04:00 +00:00
Viktor Barzin	6f2f4c089c	fix cluster health: resolve 21/23 failures from healthcheck - nvidia: change GPU taint NoSchedule -> PreferNoSchedule to allow overflow scheduling on k8s-node1 (frees ~7Gi capacity) - kyverno: increase reports-controller memory 256Mi -> 512Mi (OOMKilled) - speedtest: add missing DB_PORT=3306 env var (nc: service "" unknown) - realestate-crawler: increase API memory 64Mi -> 256Mi (OOMKilled) - calibre: increase liveness probe timeout 1s -> 5s (false restarts)	2026-03-18 08:04:00 +00:00
Viktor Barzin	af3ba0306c	fix calibre slow startup: bake calibre binaries into image, skip chown on NFS Custom Docker image pre-installs the universal-calibre mod at build time, eliminating ~10 min apt-get on every container start. Added NO_CHOWN=true to skip recursive chown that hangs on NFS mounts. Tightened startup probe since pod now starts in ~2 min instead of 15-20 min.	2026-03-18 08:04:00 +00:00
Viktor Barzin	3be8fff082	prometheus: increase memory to 4Gi and probe delays for TSDB compaction Compaction of 5 years of TSDB blocks was OOM-killing at 3Gi (18 restarts in 8h), causing sustained IO pressure on the PVE host spinning disk. Increase liveness probe delay to 300s so WAL replay completes before the probe kills the pod.	2026-03-18 08:04:00 +00:00
Viktor Barzin	16ad9cd839	cluster recovery: fix resource limits and node1 memory - nvidia quota: requests.memory 8Gi → 12Gi (unblock cuda-validator) - calibre: startup probe initial_delay 60→120s, timeout 1→5s, wait_for_rollout=false (DOCKER_MODS install takes 10+ min) - immich ML: memory 2Gi → 4Gi (OOMKilled loading CLIP models) Also done outside TF (not in this commit): - node1 VM: 16 GiB → 24 GiB RAM (Proxmox) - tigera-operator: kubectl patch 128→256Mi - nvidia-driver-daemonset: kubectl patch 1→4Gi memory - kyverno reports-controller: kubectl patch 128→256Mi - CNPG operator: kubectl rollout restart	2026-03-18 08:04:00 +00:00
Viktor Barzin	a7a9c3d1e7	add AUTH_SECRET and ALLOWED_ORIGIN env vars to novelapp deployment AUTH_SECRET sourced from Vault (secret/novelapp) via K8s secret, ALLOWED_ORIGIN set to https://novelapp.viktorbarzin.me.	2026-03-18 08:04:00 +00:00
Viktor Barzin	fd482c08b2	fix vaultwarden backup image: use docker.io/library/alpine for Kyverno	2026-03-18 08:04:00 +00:00
Viktor Barzin	9acbcc7718	add vaultwarden daily backup CronJob to NFS SQLite backup via Online Backup API + copy of RSA keys, attachments, sends, and config. 30-day retention with rotation. Pod affinity ensures co-scheduling with vaultwarden for RWO PVC access.	2026-03-18 08:04:00 +00:00
Viktor Barzin	3c622659d8	fix openclaw config mount and OOM: use init container, increase memory to 2Gi - Replace subPath ConfigMap mount with init container that copies openclaw.json to writable NFS home (OpenClaw writes back to the file at runtime) - Remove invalid memory-api plugin references causing "Config invalid" - Increase memory to 2Gi (req+limit) with NODE_OPTIONS=--max-old-space-size=1536 - Fix tg wrapper to inject -auto-approve when apply --non-interactive is used	2026-03-18 08:04:00 +00:00
Viktor Barzin	1ff8cef134	migrate vaultwarden storage from NFS to iSCSI SQLite on NFS causes DB corruption due to unreliable POSIX fcntl locking. iSCSI provides a block device with a local filesystem where locking works correctly. Same approach used for Redis, MySQL, PostgreSQL, etc.	2026-03-18 08:04:00 +00:00
root	2b6055591b	Woodpecker CI deploy commit [CI SKIP]	2026-03-18 08:04:00 +00:00
Viktor Barzin	15b0b26a05	equalize memory req=lim across 70+ containers using Prometheus 7d max data After node2 OOM incident, right-size memory across the cluster by setting requests=limits based on max_over_time(container_memory_working_set_bytes[7d]) with 1.3x headroom. Eliminates ~37Gi overcommit gap. Categories: - Safe equalization (50 containers): set req=lim where max7d well within target - Limit increases (8 containers): raise limits for services spiking above current - No Prometheus data (12 containers): conservatively set lim=req - Exception: nextcloud keeps req=256Mi/lim=8Gi due to Apache memory spikes Also increased dbaas namespace quota from 12Gi to 16Gi to accommodate mysql 4Gi limits across 3 replicas.	2026-03-18 08:04:00 +00:00
Viktor Barzin	dd5020e524	lower memory limits closer to actual usage openclaw: 1536Mi -> 768Mi, affine: 256Mi -> 128Mi, rybbit: 512Mi -> 384Mi. Also patched via kubectl: aiostreams, cloudflared, crowdsec, uptime-kuma, vaultwarden, pgadmin, phpmyadmin, goflow2, sealed-secrets, ebook2audiobook.	2026-03-18 08:04:00 +00:00
Viktor Barzin	5ead49e43e	scale down unused/over-replicated services - osm-routing (otp, osrm-bicycle, osrm-foot): replicas=0, 0Mi actual usage - dashy: replicas=0, redundant with homepage - echo: 5 -> 1 replica - networking-toolbox: 3 -> 1 replica - travel-blog: 3 -> 1 replica - blog: 3 -> 1 replica Saves ~3.5Gi memory requests.	2026-03-18 08:04:00 +00:00
Viktor Barzin	60173ac35c	right-size memory: set requests=limits based on actual usage - Set memory requests = limits across 56 stacks to prevent overcommit - Right-sized limits based on actual pod usage (2x actual, rounded up) - Scaled down trading-bot (replicas=0) to free memory - Fixed OOMKilled services: forgejo, dawarich, health, meshcentral, paperless-ngx, vault auto-unseal, rybbit, whisper, openclaw, clickhouse - Added startup+liveness probes to calibre-web - Bumped inotify limits on nodes 2,3 (max_user_instances 128->8192) Post node2 OOM incident (2026-03-14). Previous kubelet config had no kubeReserved/systemReserved set, allowing pods to starve the kernel.	2026-03-18 08:03:59 +00:00
Viktor Barzin	41131e58ba	add novelapp deployment [ci skip] Deploy NovelApp (web novel reading tracker) to k8s cluster. - Namespace: novelapp, tier: aux - iSCSI PVC for SQLite persistence - Ingress at novelapp.viktorbarzin.me - Browser scraping disabled	2026-03-18 08:03:59 +00:00
Viktor Barzin	0b4370a2d8	fix: resolve HCL semicolons and vault-platform dependency cycle - Replace semicolons with newlines in vault/main.tf variable blocks (HCL does not support semicolons) - Remove dependency "vault" from platform/terragrunt.hcl to break cycle (vault already depends on platform)	2026-03-18 08:03:59 +00:00
Viktor Barzin	63c20f23ed	migrate all secrets from SOPS to Vault KV - Add vault provider to root terragrunt.hcl (generated providers.tf) - Delete stacks/vault/vault_provider.tf (now in generated providers.tf) - Add 124 variable declarations + 43 vault_kv_secret_v2 resources to vault/main.tf to populate Vault KV at secret/<stack-name> - Migrate 43 consuming stacks to read secrets from Vault KV via data "vault_kv_secret_v2" instead of SOPS var-file - Add dependency "vault" to all migrated stacks' terragrunt.hcl - Complex types (maps/lists) stored as JSON strings, decoded with jsondecode() in locals blocks Bootstrap secrets (vault_root_token, vault_authentik_client_id, vault_authentik_client_secret) remain in SOPS permanently. Apply order: vault stack first (populates KV), then all others.	2026-03-18 08:03:59 +00:00
Viktor Barzin	cf532c50bd	fix: bump openclaw memory limit to 1536Mi Was hitting V8 heap OOM at 768Mi during LLM orchestration.	2026-03-18 08:03:59 +00:00
Viktor Barzin	47c828eb78	fix: bump calibre memory limit to 512Mi Calibre binary installation was timing out at 256Mi, leaving the web server unable to start.	2026-03-18 08:03:59 +00:00
Viktor Barzin	e7f661bd45	fix: bump affine migration init container memory to 512Mi Init container was OOMKilled (137) with default 128Mi LimitRange limit. Prisma/Node.js migrations need more memory.	2026-03-18 08:03:59 +00:00
Viktor Barzin	7078f60c1a	scale down ollama-ui, netbox, tandoor to free cluster memory Disabled after 2026-03-14 node2 OOM incident. Frees ~5GB memory limits.	2026-03-18 08:03:59 +00:00
Viktor Barzin	9490f1afad	fix: eliminate memory overcommit to prevent node OOM crashes Set requests = limits (Guaranteed QoS) across LimitRange defaults and explicit pod resources. Node2 crashed 2026-03-14 from 250% memory overcommit (61GB limits on 24GB node). Changes: - LimitRange: default = defaultRequest for all 6 tiers - Grafana: 3 → 2 replicas - Grampsweb: document why replicas=0 - Prometheus: 1Gi/4Gi → 3Gi/3Gi - OpenClaw: 512Mi/2Gi → 768Mi/768Mi - Immich server: 256Mi/2Gi → 512Mi/512Mi - Immich postgresql: 256Mi/1Gi → 512Mi/512Mi - Calibre: 256Mi/1536Mi → 256Mi/256Mi - Linkwarden: 256Mi/1536Mi → 768Mi/768Mi - N8N: 256Mi/1Gi → 512Mi/512Mi - MySQL cluster: 1Gi/3-4Gi → 2Gi/2Gi - pg-cluster (CNPG): 512Mi/4Gi → 512Mi/512Mi - DBaaS ResourceQuota limits.memory: 64Gi → 12Gi [ci skip]	2026-03-18 08:03:59 +00:00
Viktor Barzin	142ed67875	Hide Vault OIDC from main login dropdown OIDC popup flow hangs due to Authentik X-Frame-Options. Keep OIDC accessible via the "Other" tab instead.	2026-03-18 08:03:59 +00:00
Viktor Barzin	673c7adc89	Add Vault OIDC authentication via Authentik Configure Vault to use Authentik as OIDC identity provider for SSO login. Creates OAuth2 provider/application in Authentik, adds OIDC auth backend, admin policy, and maps "authentik Admins" group to full vault-admin access.	2026-03-18 08:03:59 +00:00
Viktor Barzin	240feda408	Reduce downtime during platform stack applies CrowdSec fixes: - Increase ResourceQuota requests.cpu 1→4 (was at 302%, blocking upgrades) - Add LAPI startupProbe: 30 attempts × 10s = 5min startup window (LAPI pods were failing default startup probe during rolling upgrades) - Reduce Helm timeout 3600s→900s with wait=true, wait_for_jobs=true Prometheus startup guard on 8 rate-based alerts: - PodCrashLooping, ContainerOOMKilled, CoreDNSErrors, HighServiceErrorRate, HighService4xxRate, HighServiceLatency, SSDHighWriteRate, HDDHighWriteRate - Suppresses false positives for 15m after Prometheus restart	2026-03-18 08:03:59 +00:00
Viktor Barzin	a66a8d0de2	Reduce downtime during platform stack applies CrowdSec Helm fix: - Increase ResourceQuota requests.cpu from 1 to 4 — pods were at 302% of quota, preventing scheduling during rolling upgrades - Reduce Helm timeout from 3600s to 600s — 1 hour hang is excessive - Add wait=true and wait_for_jobs=true for proper readiness checking Prometheus startup guard: - Add startup guard to 8 rate/increase-based alerts that false-fire after Prometheus restarts (needs 2 scrapes for rate() to work): PodCrashLooping, ContainerOOMKilled, CoreDNSErrors, HighServiceErrorRate, HighService4xxRate, HighServiceLatency, SSDHighWriteRate, HDDHighWriteRate - Guard: and on() (time() - process_start_time_seconds) > 900 suppresses alerts for 15m after Prometheus startup	2026-03-18 08:03:59 +00:00
Viktor Barzin	8557d492db	Fix NFSServerUnresponsive false positives Root cause: sum(rate(node_nfs_requests_total[5m])) == 0 was too fragile: - rate() returns nothing after Prometheus restarts (needs 2 scrapes) - Individual nodes show zero NFS rate during scrape gaps or low activity - The sum() could hit zero during quiet hours + scrape gaps New expression uses: - changes() instead of rate() — works with a single scrape - Per-instance aggregation: count nodes with any NFS counter change - Threshold < 2 nodes: single-node restarts won't trigger, real NFS outage (all nodes affected) will - Prometheus startup guard: skip first 15m after restart to avoid false positives from empty TSDB - Wider 15m changes() window to smooth out scrape gaps	2026-03-18 08:03:59 +00:00
Viktor Barzin	df44601a36	Monitoring overhaul: reduce noise, add coverage gaps, auto-load dashboards Noise reduction (8 alerts tuned): - PoisonFountainDown: 2m→5m, critical→warning (fail-open service) - NodeExporterDown: 2m→5m (flaps during node restarts) - PowerOutage: add for:1m (debounce transient voltage dips) - New Tailscale client: add for:5m (debounce headscale reauths) - NoNodeLoadData: use absent() instead of OR vector(0)==0 - NodeHighCPUUsage: 30%→60% (normal for 70+ services) - HighMemoryUsage GPU: 12GB/5m→14GB/15m (T4=16GB, model loading) - PrometheusStorageFull: 50GiB→150GiB (TSDB cap is 180GB) Alert regrouping: - Move MailServerDown, HackmdDown, PrivatebinDown → new "Application Health" - Move New Tailscale client → "Infrastructure Health" New alerts (14): - Networking: Cloudflared (2), MetalLB (2), Technitium DNS - Storage: NFS CSI, iSCSI CSI controllers - Critical Services: PgBouncer, CNPG operator, MySQL operator - Infra Health: CrowdSec, Kyverno, Sealed Secrets, Woodpecker Inhibit rules: - Consolidate 3 NodeDown rules into 1 comprehensive rule - Extend NFS rule to suppress NFS-dependent services - Add PowerOutage → downstream suppression Dashboard loading: - Add for_each ConfigMap in grafana.tf to auto-load all 18 dashboards - Remove duplicate caretta dashboard ConfigMap from caretta.tf	2026-03-18 08:03:59 +00:00
Viktor Barzin	7ee4bbe5b6	feat(claude-memory): add stack and update image to standalone repo - Add claude-memory stack (was previously untracked) - Update Docker image from viktorbarzin/claude-memory to viktorbarzin/claude-memory-mcp (standalone open-source repo) - CI/CD now lives in the standalone repo's .woodpecker.yml	2026-03-18 08:03:58 +00:00
Viktor Barzin	69b513992a	Right-size CPU requests cluster-wide and remove missed CPU limits Increase requests for under-requested pods (dashy 50m→250m, frigate 500m→1500m, clickhouse 100m→500m, otp 100m→300m, linkwarden 25m→50m, authentik worker 50m→100m). Reduce requests for over-requested pods (crowdsec agent/lapi 500m→25m each, prometheus 200m→100m, dbaas mysql 1800m→100m, pg-cluster 250m→50m, shlink-web 250m→10m, gpu-pod-exporter 50m→10m, stirling-pdf 100m→25m, technitium 100m→25m, celery 50m→15m). Reduce crowdsec quota from 8→1 CPU. Remove missed CPU limits in prometheus (cpu: "2") and dbaas (cpu: "3600m") tpl files.	2026-03-18 08:03:58 +00:00
Viktor Barzin	28ac1382d1	Remove all CPU limits cluster-wide to eliminate CFS throttling CPU limits cause CFS throttling even when nodes have idle capacity. Move to a request-only CPU model: keep CPU requests for scheduling fairness but remove all CPU limits. Memory limits stay (incompressible). Changes across 108 files: - Kyverno LimitRange policy: remove cpu from default/max in all 6 tiers - Kyverno ResourceQuota policy: remove limits.cpu from all 5 tiers - Custom ResourceQuotas: remove limits.cpu from 8 namespace quotas - Custom LimitRanges: remove cpu from default/max (nextcloud, onlyoffice) - RBAC module: remove cpu_limits variable and quota reference - Freedify factory: remove cpu_limit variable and limits reference - 86 deployment files: remove cpu from all limits blocks - 6 Helm values files: remove cpu under limits sections	2026-03-18 08:03:58 +00:00
Viktor Barzin	1eccf0363e	Nextcloud performance tuning and fix backup cron job - Set loglevel=2 (warnings) and disable mail_smtpdebug via configs - Enable opcache.enable_file_override for faster file checks - Increase APCu shared memory from 32M to 128M - Fix broken module.nfs_nextcloud_data reference in backup cron job to use the iSCSI PVC directly	2026-03-18 08:03:58 +00:00

1 2 3 4 5 ...

258 commits