infra

Author	SHA1	Message	Date
Viktor Barzin	32762a0916	[ci skip] switch VPA to off mode globally, fix Ollama/MySQL resources - Kyverno policy: VPA mode set to 'off' for all namespaces (was 'initial' for non-core). Terraform is now sole authority for container resources. Goldilocks provides recommendations only. - Ollama: add explicit CPU/memory resources (500m/4Gi req, 4/12Gi limit) alongside GPU allocation. Fixes OOMKill from VPA scaling down resources. - MySQL InnoDB Cluster: bump memory limit from 2Gi to 3Gi. - Remove redundant per-namespace VPA opt-out labels from onlyoffice, openclaw, trading-bot (now handled globally by Kyverno policy).	2026-03-01 19:03:49 +00:00
Viktor Barzin	304b5e4b3d	[ci skip] add nfsv4-idmapd-uid-mapping skill, cross-ref from NFS troubleshooting New skill documenting the NFSv4 idmapd UID mapping crisis where all file UIDs show as 65534 (nobody) inside K8s containers. Root cause: containers auto-negotiate NFSv4.2, and idmapd domain mismatch maps all UIDs to nobody. Fix: v4_v3owner=true on TrueNAS for numeric UID passthrough.	2026-03-01 18:14:37 +00:00
Viktor Barzin	7a467c75ae	[ci skip] add openclaw-k8s-deployment skill from claudeception Extracts all non-obvious gotchas from deploying OpenClaw on Kubernetes: - wizard block required for Telegram, exec.host valid values, - VPA resource overrides, file permissions, startup command, - modelrelay sidecar, NFS caching strategy	2026-03-01 18:10:33 +00:00
Viktor Barzin	ae9565c3e6	[ci skip] onlyoffice: revert font cache NFS mounts, rebuild on startup NFS font caching caused issues. Reverted to default GENERATE_FONTS=true with 8 CPU burst limit for fast regeneration on startup.	2026-03-01 18:07:37 +00:00
Viktor Barzin	b8b41a9408	[ci skip] update claude knowledge: kyverno fixes, nextcloud, onlyoffice learnings	2026-03-01 18:07:04 +00:00
Viktor Barzin	a82f86b3e4	[ci skip] onlyoffice: cache fonts/themes on NFS for fast restarts Persist font cache (159MB) and theme images (10MB) to NFS volume. Set GENERATE_FONTS=false to skip regeneration on startup since cache is warm. Startup time: ~3 min -> 5 seconds.	2026-03-01 18:02:38 +00:00
Viktor Barzin	81e128acc4	[ci skip] onlyoffice: bump CPU limit to 8, add custom LimitRange/Quota Startup was throttled by allthemesgen and font generation hitting 2 CPU ceiling. Bumped to 8 CPU burst limit with custom LimitRange (max 8 CPU) and custom ResourceQuota. Disabled VPA and goldilocks opt-out labels.	2026-03-01 17:58:26 +00:00
Viktor Barzin	07874f8021	[ci skip] onlyoffice: disable VPA to prevent CPU resource override Goldilocks VPA in Initial mode was overriding the explicit 2 CPU limit down to 700m, throttling the document server. Set vpa-update-mode=off.	2026-03-01 17:55:06 +00:00
Viktor Barzin	beec5acbc7	[ci skip] nextcloud: bump CPU limit to 16, add custom ResourceQuota CPU was pegged at 2000m/2000m (100% throttled). Add custom-quota opt-out label and ResourceQuota allowing 32 CPU limits to accommodate the 16 CPU container limit plus sidecar defaults.	2026-03-01 17:41:18 +00:00
Viktor Barzin	ecc3445860	[ci skip] openclaw: fix workspace permissions — chown to node user Init container clones repo as root but main container runs as node (UID 1000). Added chown -R 1000:1000 /workspace/infra so OpenClaw can write to workspace.	2026-03-01 17:20:36 +00:00
Viktor Barzin	79af6fff47	[ci skip] fix MySQL cluster RBAC, Kyverno policy bugs, Nextcloud memory - dbaas: add mysql-sidecar-extra ClusterRole for namespaces/CRD list/watch needed by kopf framework in sidecar containers - kyverno: restrict inject-priority-class-from-tier to CREATE operations only (was blocking pod patches with immutable spec error) - kyverno: add resource-governance/custom-limitrange label opt-out to LimitRange generation policy (mirrors existing custom-quota) - nextcloud: bump memory limit 4Gi -> 6Gi, add custom LimitRange with 8Gi max, opt out of Kyverno-managed LimitRange	2026-03-01 17:16:03 +00:00
Viktor Barzin	a8da2e3790	[ci skip] redis: pin service to master pod to fix read-only errors The Bitnami Redis Sentinel chart's service selects all nodes (master + replicas). Clients using plain redis:// URLs (paperless-ngx, etc.) randomly hit read-only replicas, causing write failures. Pin the service to redis-node-0 (master).	2026-03-01 17:13:25 +00:00
Viktor Barzin	d7f031bc5f	[ci skip] openclaw: set workspace + enable elevated + native commands - Set workspace to /workspace/infra (was defaulting to ~/.openclaw/workspace) - Enable tools.elevated for unrestricted access - Enable commands.native and commands.nativeSkills - All tools, commands, and skills now fully accessible	2026-03-01 17:12:03 +00:00
Viktor Barzin	09024168ba	[ci skip] openclaw: disable sandbox mode for unrestricted execution - Set agents.defaults.sandbox.mode = off - Combined with exec.host=gateway and exec.security=full, OpenClaw can now run any command on the container host	2026-03-01 16:51:35 +00:00
Viktor Barzin	8fddc076d7	[ci skip] openclaw: fix exec host — use gateway instead of node host=node requires a companion app (not available in container). host=gateway runs commands directly on the gateway process host.	2026-03-01 16:47:14 +00:00
Viktor Barzin	7c300186f5	[ci skip] calibre: increase resources to 1 CPU / 1Gi to fix sluggish web UI Was getting LimitRange defaults (250m CPU, 256Mi) causing throttling.	2026-03-01 16:42:35 +00:00
Viktor Barzin	e82bab4b10	[ci skip] openclaw: fix exec config — use host=node, security=full Valid options: host=sandbox\|gateway\|node, security=deny\|allowlist\|full. Using node (run on container host) with full (no command restrictions).	2026-03-01 16:42:22 +00:00
Viktor Barzin	66d3b55b23	[ci skip] openclaw: disable sandbox, run commands on container host - exec.host: sandbox → local (run directly on container, no Docker sandbox) - exec.security: full → off (no restrictions on command execution)	2026-03-01 16:18:53 +00:00
Viktor Barzin	ea8d68f3a8	[ci skip] shlink: increase memory limit to 512Mi to prevent OOMKill Shlink with GeoLite2 DB requires more than the 256Mi LimitRange default.	2026-03-01 16:13:50 +00:00
Viktor Barzin	e9dbb0e82e	[ci skip] openclaw: persist home directory on NFS - Switch openclaw-home from emptyDir to NFS (/mnt/main/openclaw/home) - Persists SOUL.md, IDENTITY.md, sessions, memory DB, telegram state, device identity, and all runtime files across pod restarts - Init container still refreshes openclaw.json and kubeconfig on each start	2026-03-01 16:12:07 +00:00
Viktor Barzin	d71b1ac974	[ci skip] openclaw: remove all tool/command restrictions - Set tools.deny = [] (was blocking sessions, subagents, browser) - All tools now available: sessions, subagents, browser, etc.	2026-03-01 15:58:12 +00:00
Viktor Barzin	13e75711db	[ci skip] openclaw: add modelrelay sidecar as fallback model router - Deploy modelrelay as sidecar container (auto-routes to fastest free model) - Configured with NVIDIA NIM + OpenRouter API keys - Primary: Mistral Large 3 (NIM), Fallback 1: Nemotron Ultra (NIM), Fallback 2: modelrelay/auto-fastest (80+ free models) - Modelrelay web UI available at pod:7352	2026-03-01 15:57:31 +00:00
Viktor Barzin	30702057f9	[ci skip] openclaw: fix Telegram, update to v2026.2.26, fix startup issues - Update OpenClaw from v2026.2.9 to v2026.2.26 (fixes Telegram channel) - Add gateway.mode=local + wizard block (required for channel startup) - Add dangerouslyAllowHostHeaderOriginFallback (v2026.2.26 requirement) - Run doctor --fix at container startup to auto-enable Telegram - Create required dirs (canvas, devices, cron, sessions, credentials) - Fix permissions: chown -R 1000:1000 for node user - Telegram: DM allowlist, user 8281953845 only	2026-03-01 15:47:54 +00:00
Viktor Barzin	d02d6ad356	[ci skip] dbaas: add custom resource quota (12Gi req mem) to support 3-node MySQL cluster	2026-03-01 15:47:11 +00:00
Viktor Barzin	80fbff083b	[ci skip] f1-stream: use latest tag, CI manages image via kubectl set image	2026-03-01 15:15:14 +00:00
Viktor Barzin	3c92c38d51	[ci skip] f1-stream: bump image to v5.2.0	2026-03-01 15:13:38 +00:00
Viktor Barzin	82d63a10ef	[ci skip] add PoisonFountainDown and ForwardAuthFallbackActive alerts with inhibition	2026-03-01 15:05:57 +00:00
Viktor Barzin	388ba006fa	[ci skip] openclaw: fix slow startup — proper resources + readiness probe + VPA off - Set explicit CPU (2 cores) and memory (2Gi) limits Root cause: Goldilocks VPA was throttling to 300m CPU, causing gateway to take 5+ minutes to start, and 1Gi memory caused OOM crashes - Add TCP readiness probe on port 18789 to prevent 502 Bad Gateway during startup (Traefik was routing before gateway was listening) - Disable Goldilocks VPA via namespace label (vpa-update-mode: off)	2026-03-01 14:44:22 +00:00
Viktor Barzin	d95f753aa8	remove f1-stream CI pipeline (moved to separate repo) The f1-stream project now lives at github.com/ViktorBarzin/f1-stream with its own Woodpecker CI pipeline.	2026-03-01 14:43:06 +00:00
Viktor Barzin	00d3bb2fd1	[ci skip] add retry middleware (2 attempts, 100ms) to default ingress chain	2026-03-01 14:35:53 +00:00
Viktor Barzin	b7aa39353c	f1-stream: add real F1 stream extractors and iframe player support Add three new extractors (Streamed.pk, DaddyLive, Aceztrims) for live F1 streams. Extend ExtractedStream model with stream_type/embed_url fields, skip health checks for embed streams, fix broken Akamai demo stream, add variant playlist validation, and add iframe player support in the frontend for embed-type streams.	2026-03-01 14:35:19 +00:00
Viktor Barzin	e36e192dd0	[ci skip] add Authentik PDB (minAvailable=2)	2026-03-01 14:24:47 +00:00
Viktor Barzin	a1ba1e0d37	[ci skip] add Traefik topology spread, PDB (minAvailable=2), and 30s response timeout	2026-03-01 14:18:54 +00:00
Viktor Barzin	ecbee75275	[ci skip] add auth resilience proxy: basicAuth fallback when Authentik is down	2026-03-01 14:13:05 +00:00
Viktor Barzin	6ffc714683	[ci skip] add bot-block resilience proxy: fail-open when Poison Fountain is down	2026-03-01 14:05:41 +00:00
Viktor Barzin	999005d40f	[ci skip] openclaw: cache tools on NFS for fast restarts - Switch /tools volume from emptyDir to NFS (/mnt/main/openclaw/tools) - Skip download of kubectl, terraform, terragrunt, pip packages if cached - Startup time: ~2.5min → ~38s on subsequent restarts	2026-03-01 13:59:07 +00:00
Viktor Barzin	ef52da6eca	[ci skip] bump poison-fountain tier from aux to cluster (critical path for all ingress)	2026-03-01 13:57:54 +00:00
Viktor Barzin	a6f11a29ae	[ci skip] add Traefik resilience hardening implementation plan	2026-03-01 13:53:50 +00:00
Viktor Barzin	99fca0b7fa	[ci skip] add Traefik resilience hardening design doc	2026-03-01 13:50:00 +00:00
Viktor Barzin	2dda360b29	[ci skip] openclaw: add Telegram channel + install terragrunt in init container - Add Telegram bot integration (DM allowlist, user 8281953845 only) - Install terragrunt v0.99.4 in init container alongside terraform - Remove terraform init from init (terragrunt handles this per-stack) - Add openclaw_telegram_bot_token variable	2026-03-01 13:44:58 +00:00
Viktor Barzin	d9946506cd	[ci skip] openclaw: switch to free agentic models via NVIDIA NIM, OpenRouter, Llama API - Primary: Mistral Large 3 (675B) on NIM - always warm, excellent tool calling - Fallback 1: Nemotron Ultra 253B on NIM - Fallback 2: Llama 4 Maverick on Llama API (different provider for resilience) - 10 models total across 3 providers, all free - Removed: Modal (GLM-5), Gemini, Ollama providers - Added: NVIDIA NIM provider with DeepSeek V3.2, Qwen 3.5, Qwen 3 Coder, GLM-5 - Bumped maxTokens from 8192 to 16384 for agentic output room	2026-03-01 13:22:47 +00:00
Viktor Barzin	e25848be8f	[ci skip] add Kyverno resource governance details to CLAUDE.md	2026-03-01 13:05:57 +00:00
Viktor Barzin	51fb96e906	[ci skip] fix MySQL service: point at mysqld pods, pin to healthy primary The InnoDB Cluster Router (mysqlrouter) doesn't deploy when the cluster lacks quorum. Changed service selector from mysqlrouter to mysqld with publishNotReadyAddresses=true to bypass the operator's readiness gate. Pinned to mysql-cluster-1 (healthy primary) until full cluster recovers.	2026-03-01 12:16:28 +00:00
Viktor Barzin	a071f08dc8	[ci skip] MySQL: deploy InnoDB Cluster via Oracle MySQL Operator - MySQL Operator v2.2.7 in mysql-operator namespace (on control-plane) - InnoDB Cluster: 3 MySQL 9.2.0 servers + 1 Router, local-path storage - Group Replication with automatic failover via MySQL Router - Compatibility service: mysql.dbaas:3306 → Router port 6446 - Images from container-registry.oracle.com (not Docker Hub) - Init containers are slow (~20 min) due to mysqlsh plugin loading - Data restore from mysqldump pending after cluster is ONLINE	2026-03-01 03:00:21 +00:00
Viktor Barzin	a76c72042e	[ci skip] add graceful degradation to CrowdSec bouncer middleware P0: Set updateMaxFailure=-1 (fail-open) Previously defaulted to 0 which blocked ALL traffic on first LAPI failure. Now serves from cached decisions when LAPI is unreachable. P1: Enable Redis cache for CrowdSec decisions Decisions are now shared across all 3 Traefik replicas and survive pod restarts. redisCacheUnreachableBlock=false prevents Redis from becoming another SPOF. P1: Add clientTrustedIPs for internal cluster traffic Node CIDR (10.0.20.0/24) and pod CIDR (10.10.0.0/16) bypass CrowdSec entirely, preventing internal cascade failures.	2026-03-01 02:36:53 +00:00
Viktor Barzin	cd5d76fb33	[ci skip] add qemu-guest-agent to VM templates and enable agent by default	2026-03-01 01:58:46 +00:00
Viktor Barzin	e22d81275b	[ci skip] fix nextcloud: increase memory to 4Gi, extend startup probe - Memory limit: 2Gi → 4Gi (VPA target is 2.8Gi, was OOMKilling) - Memory request: 512Mi → 1Gi - Startup probe: 30s delay, 10s timeout, 60 failures (10min total) Previous 5min window was too short for NFS-backed SQLite init	2026-02-28 23:32:28 +00:00
Viktor Barzin	d6731d857d	[ci skip] revert MySQL to NFS — Bitnami images unavailable on Docker Hub Bitnami MySQL images can't be pulled (not found on Docker Hub, likely moved to a different registry). Reverted MySQL to single instance on NFS as the known-working state. MySQL replication to be revisited once image availability is resolved. PostgreSQL and Redis remain on local disk with replication.	2026-02-28 22:53:33 +00:00
Viktor Barzin	4577ba59ab	[ci skip] switch VPA from Auto to Initial mode for Terraform compatibility VPA Auto mode modifies Deployment specs at runtime, causing conflicts with Terraform on every apply (drift -> reset -> VPA evict loop). Initial mode only mutates Pod resource requests at creation time via the admission webhook, leaving the Deployment spec unchanged. This means terraform plan shows no drift while pods still get VPA-optimized resources on every restart. - 171 VPAs switched from Auto to Initial - 20 VPAs remain Off (tier-0 critical services) - Goldilocks dashboard continues to show recommendations	2026-02-28 22:43:29 +00:00
Viktor Barzin	5685a84c9f	[ci skip] tune resource limits and requests across 10 services Critical OOM fixes (add/increase limits): - netbox: add 512Mi limit (was at 98.8% of Kyverno default 256Mi) - speedtest: add 512Mi limit (was at 80.9%) - meshcentral: add 384Mi limit (was at 72.7%) - ytdlp: uncomment resources, set 512Mi limit (was at 74.6%) Over-provisioned (reduce limits): - dashy: 2Gi → 512Mi (was using 135Mi) - redis master: 2Gi → 256Mi (was using 14Mi) - redis replica: 1Gi → 256Mi (was using 12Mi) - resume printer: 2Gi → 512Mi (was using 108Mi) - resume app: 1Gi → 384Mi (was using 125Mi) - openclaw: 4Gi → 1Gi (was using 372Mi) Under-provisioned requests (increase): - authentik server: 256Mi → 512Mi request (actual ~560Mi) - authentik worker: 256Mi → 384Mi request (actual ~400Mi) New explicit resources (previously Kyverno defaults): - forgejo: add 512Mi limit, 64Mi request	2026-02-28 21:59:08 +00:00

1 2 3 4 5 ...

1843 commits