Commit graph

1843 commits

Author SHA1 Message Date
Viktor Barzin
32762a0916
[ci skip] switch VPA to off mode globally, fix Ollama/MySQL resources
- Kyverno policy: VPA mode set to 'off' for all namespaces (was 'initial'
  for non-core). Terraform is now sole authority for container resources.
  Goldilocks provides recommendations only.
- Ollama: add explicit CPU/memory resources (500m/4Gi req, 4/12Gi limit)
  alongside GPU allocation. Fixes OOMKill from VPA scaling down resources.
- MySQL InnoDB Cluster: bump memory limit from 2Gi to 3Gi.
- Remove redundant per-namespace VPA opt-out labels from onlyoffice,
  openclaw, trading-bot (now handled globally by Kyverno policy).
2026-03-01 19:03:49 +00:00
Viktor Barzin
304b5e4b3d
[ci skip] add nfsv4-idmapd-uid-mapping skill, cross-ref from NFS troubleshooting
New skill documenting the NFSv4 idmapd UID mapping crisis where all file
UIDs show as 65534 (nobody) inside K8s containers. Root cause: containers
auto-negotiate NFSv4.2, and idmapd domain mismatch maps all UIDs to nobody.
Fix: v4_v3owner=true on TrueNAS for numeric UID passthrough.
2026-03-01 18:14:37 +00:00
Viktor Barzin
7a467c75ae
[ci skip] add openclaw-k8s-deployment skill from claudeception
Extracts all non-obvious gotchas from deploying OpenClaw on Kubernetes:
- wizard block required for Telegram, exec.host valid values,
- VPA resource overrides, file permissions, startup command,
- modelrelay sidecar, NFS caching strategy
2026-03-01 18:10:33 +00:00
Viktor Barzin
ae9565c3e6
[ci skip] onlyoffice: revert font cache NFS mounts, rebuild on startup
NFS font caching caused issues. Reverted to default GENERATE_FONTS=true
with 8 CPU burst limit for fast regeneration on startup.
2026-03-01 18:07:37 +00:00
Viktor Barzin
b8b41a9408
[ci skip] update claude knowledge: kyverno fixes, nextcloud, onlyoffice learnings 2026-03-01 18:07:04 +00:00
Viktor Barzin
a82f86b3e4
[ci skip] onlyoffice: cache fonts/themes on NFS for fast restarts
Persist font cache (159MB) and theme images (10MB) to NFS volume.
Set GENERATE_FONTS=false to skip regeneration on startup since cache
is warm. Startup time: ~3 min -> 5 seconds.
2026-03-01 18:02:38 +00:00
Viktor Barzin
81e128acc4
[ci skip] onlyoffice: bump CPU limit to 8, add custom LimitRange/Quota
Startup was throttled by allthemesgen and font generation hitting 2 CPU
ceiling. Bumped to 8 CPU burst limit with custom LimitRange (max 8 CPU)
and custom ResourceQuota. Disabled VPA and goldilocks opt-out labels.
2026-03-01 17:58:26 +00:00
Viktor Barzin
07874f8021
[ci skip] onlyoffice: disable VPA to prevent CPU resource override
Goldilocks VPA in Initial mode was overriding the explicit 2 CPU limit
down to 700m, throttling the document server. Set vpa-update-mode=off.
2026-03-01 17:55:06 +00:00
Viktor Barzin
beec5acbc7
[ci skip] nextcloud: bump CPU limit to 16, add custom ResourceQuota
CPU was pegged at 2000m/2000m (100% throttled). Add custom-quota
opt-out label and ResourceQuota allowing 32 CPU limits to accommodate
the 16 CPU container limit plus sidecar defaults.
2026-03-01 17:41:18 +00:00
Viktor Barzin
ecc3445860
[ci skip] openclaw: fix workspace permissions — chown to node user
Init container clones repo as root but main container runs as node (UID 1000).
Added chown -R 1000:1000 /workspace/infra so OpenClaw can write to workspace.
2026-03-01 17:20:36 +00:00
Viktor Barzin
79af6fff47
[ci skip] fix MySQL cluster RBAC, Kyverno policy bugs, Nextcloud memory
- dbaas: add mysql-sidecar-extra ClusterRole for namespaces/CRD
  list/watch needed by kopf framework in sidecar containers
- kyverno: restrict inject-priority-class-from-tier to CREATE
  operations only (was blocking pod patches with immutable spec error)
- kyverno: add resource-governance/custom-limitrange label opt-out
  to LimitRange generation policy (mirrors existing custom-quota)
- nextcloud: bump memory limit 4Gi -> 6Gi, add custom LimitRange
  with 8Gi max, opt out of Kyverno-managed LimitRange
2026-03-01 17:16:03 +00:00
Viktor Barzin
a8da2e3790
[ci skip] redis: pin service to master pod to fix read-only errors
The Bitnami Redis Sentinel chart's service selects all nodes (master + replicas).
Clients using plain redis:// URLs (paperless-ngx, etc.) randomly hit read-only
replicas, causing write failures. Pin the service to redis-node-0 (master).
2026-03-01 17:13:25 +00:00
Viktor Barzin
d7f031bc5f
[ci skip] openclaw: set workspace + enable elevated + native commands
- Set workspace to /workspace/infra (was defaulting to ~/.openclaw/workspace)
- Enable tools.elevated for unrestricted access
- Enable commands.native and commands.nativeSkills
- All tools, commands, and skills now fully accessible
2026-03-01 17:12:03 +00:00
Viktor Barzin
09024168ba
[ci skip] openclaw: disable sandbox mode for unrestricted execution
- Set agents.defaults.sandbox.mode = off
- Combined with exec.host=gateway and exec.security=full,
  OpenClaw can now run any command on the container host
2026-03-01 16:51:35 +00:00
Viktor Barzin
8fddc076d7
[ci skip] openclaw: fix exec host — use gateway instead of node
host=node requires a companion app (not available in container).
host=gateway runs commands directly on the gateway process host.
2026-03-01 16:47:14 +00:00
Viktor Barzin
7c300186f5
[ci skip] calibre: increase resources to 1 CPU / 1Gi to fix sluggish web UI
Was getting LimitRange defaults (250m CPU, 256Mi) causing throttling.
2026-03-01 16:42:35 +00:00
Viktor Barzin
e82bab4b10
[ci skip] openclaw: fix exec config — use host=node, security=full
Valid options: host=sandbox|gateway|node, security=deny|allowlist|full.
Using node (run on container host) with full (no command restrictions).
2026-03-01 16:42:22 +00:00
Viktor Barzin
66d3b55b23
[ci skip] openclaw: disable sandbox, run commands on container host
- exec.host: sandbox → local (run directly on container, no Docker sandbox)
- exec.security: full → off (no restrictions on command execution)
2026-03-01 16:18:53 +00:00
Viktor Barzin
ea8d68f3a8
[ci skip] shlink: increase memory limit to 512Mi to prevent OOMKill
Shlink with GeoLite2 DB requires more than the 256Mi LimitRange default.
2026-03-01 16:13:50 +00:00
Viktor Barzin
e9dbb0e82e
[ci skip] openclaw: persist home directory on NFS
- Switch openclaw-home from emptyDir to NFS (/mnt/main/openclaw/home)
- Persists SOUL.md, IDENTITY.md, sessions, memory DB, telegram state,
  device identity, and all runtime files across pod restarts
- Init container still refreshes openclaw.json and kubeconfig on each start
2026-03-01 16:12:07 +00:00
Viktor Barzin
d71b1ac974
[ci skip] openclaw: remove all tool/command restrictions
- Set tools.deny = [] (was blocking sessions, subagents, browser)
- All tools now available: sessions, subagents, browser, etc.
2026-03-01 15:58:12 +00:00
Viktor Barzin
13e75711db
[ci skip] openclaw: add modelrelay sidecar as fallback model router
- Deploy modelrelay as sidecar container (auto-routes to fastest free model)
- Configured with NVIDIA NIM + OpenRouter API keys
- Primary: Mistral Large 3 (NIM), Fallback 1: Nemotron Ultra (NIM),
  Fallback 2: modelrelay/auto-fastest (80+ free models)
- Modelrelay web UI available at pod:7352
2026-03-01 15:57:31 +00:00
Viktor Barzin
30702057f9
[ci skip] openclaw: fix Telegram, update to v2026.2.26, fix startup issues
- Update OpenClaw from v2026.2.9 to v2026.2.26 (fixes Telegram channel)
- Add gateway.mode=local + wizard block (required for channel startup)
- Add dangerouslyAllowHostHeaderOriginFallback (v2026.2.26 requirement)
- Run doctor --fix at container startup to auto-enable Telegram
- Create required dirs (canvas, devices, cron, sessions, credentials)
- Fix permissions: chown -R 1000:1000 for node user
- Telegram: DM allowlist, user 8281953845 only
2026-03-01 15:47:54 +00:00
Viktor Barzin
d02d6ad356
[ci skip] dbaas: add custom resource quota (12Gi req mem) to support 3-node MySQL cluster 2026-03-01 15:47:11 +00:00
Viktor Barzin
80fbff083b
[ci skip] f1-stream: use latest tag, CI manages image via kubectl set image 2026-03-01 15:15:14 +00:00
Viktor Barzin
3c92c38d51
[ci skip] f1-stream: bump image to v5.2.0 2026-03-01 15:13:38 +00:00
Viktor Barzin
82d63a10ef
[ci skip] add PoisonFountainDown and ForwardAuthFallbackActive alerts with inhibition 2026-03-01 15:05:57 +00:00
Viktor Barzin
388ba006fa
[ci skip] openclaw: fix slow startup — proper resources + readiness probe + VPA off
- Set explicit CPU (2 cores) and memory (2Gi) limits
  Root cause: Goldilocks VPA was throttling to 300m CPU, causing gateway
  to take 5+ minutes to start, and 1Gi memory caused OOM crashes
- Add TCP readiness probe on port 18789 to prevent 502 Bad Gateway
  during startup (Traefik was routing before gateway was listening)
- Disable Goldilocks VPA via namespace label (vpa-update-mode: off)
2026-03-01 14:44:22 +00:00
Viktor Barzin
d95f753aa8
remove f1-stream CI pipeline (moved to separate repo)
The f1-stream project now lives at github.com/ViktorBarzin/f1-stream
with its own Woodpecker CI pipeline.
2026-03-01 14:43:06 +00:00
Viktor Barzin
00d3bb2fd1
[ci skip] add retry middleware (2 attempts, 100ms) to default ingress chain 2026-03-01 14:35:53 +00:00
Viktor Barzin
b7aa39353c
f1-stream: add real F1 stream extractors and iframe player support
Add three new extractors (Streamed.pk, DaddyLive, Aceztrims) for live
F1 streams. Extend ExtractedStream model with stream_type/embed_url
fields, skip health checks for embed streams, fix broken Akamai demo
stream, add variant playlist validation, and add iframe player support
in the frontend for embed-type streams.
2026-03-01 14:35:19 +00:00
Viktor Barzin
e36e192dd0
[ci skip] add Authentik PDB (minAvailable=2) 2026-03-01 14:24:47 +00:00
Viktor Barzin
a1ba1e0d37
[ci skip] add Traefik topology spread, PDB (minAvailable=2), and 30s response timeout 2026-03-01 14:18:54 +00:00
Viktor Barzin
ecbee75275
[ci skip] add auth resilience proxy: basicAuth fallback when Authentik is down 2026-03-01 14:13:05 +00:00
Viktor Barzin
6ffc714683
[ci skip] add bot-block resilience proxy: fail-open when Poison Fountain is down 2026-03-01 14:05:41 +00:00
Viktor Barzin
999005d40f
[ci skip] openclaw: cache tools on NFS for fast restarts
- Switch /tools volume from emptyDir to NFS (/mnt/main/openclaw/tools)
- Skip download of kubectl, terraform, terragrunt, pip packages if cached
- Startup time: ~2.5min → ~38s on subsequent restarts
2026-03-01 13:59:07 +00:00
Viktor Barzin
ef52da6eca
[ci skip] bump poison-fountain tier from aux to cluster (critical path for all ingress) 2026-03-01 13:57:54 +00:00
Viktor Barzin
a6f11a29ae
[ci skip] add Traefik resilience hardening implementation plan 2026-03-01 13:53:50 +00:00
Viktor Barzin
99fca0b7fa
[ci skip] add Traefik resilience hardening design doc 2026-03-01 13:50:00 +00:00
Viktor Barzin
2dda360b29
[ci skip] openclaw: add Telegram channel + install terragrunt in init container
- Add Telegram bot integration (DM allowlist, user 8281953845 only)
- Install terragrunt v0.99.4 in init container alongside terraform
- Remove terraform init from init (terragrunt handles this per-stack)
- Add openclaw_telegram_bot_token variable
2026-03-01 13:44:58 +00:00
Viktor Barzin
d9946506cd
[ci skip] openclaw: switch to free agentic models via NVIDIA NIM, OpenRouter, Llama API
- Primary: Mistral Large 3 (675B) on NIM - always warm, excellent tool calling
- Fallback 1: Nemotron Ultra 253B on NIM
- Fallback 2: Llama 4 Maverick on Llama API (different provider for resilience)
- 10 models total across 3 providers, all free
- Removed: Modal (GLM-5), Gemini, Ollama providers
- Added: NVIDIA NIM provider with DeepSeek V3.2, Qwen 3.5, Qwen 3 Coder, GLM-5
- Bumped maxTokens from 8192 to 16384 for agentic output room
2026-03-01 13:22:47 +00:00
Viktor Barzin
e25848be8f
[ci skip] add Kyverno resource governance details to CLAUDE.md 2026-03-01 13:05:57 +00:00
Viktor Barzin
51fb96e906
[ci skip] fix MySQL service: point at mysqld pods, pin to healthy primary
The InnoDB Cluster Router (mysqlrouter) doesn't deploy when the cluster
lacks quorum. Changed service selector from mysqlrouter to mysqld with
publishNotReadyAddresses=true to bypass the operator's readiness gate.
Pinned to mysql-cluster-1 (healthy primary) until full cluster recovers.
2026-03-01 12:16:28 +00:00
Viktor Barzin
a071f08dc8
[ci skip] MySQL: deploy InnoDB Cluster via Oracle MySQL Operator
- MySQL Operator v2.2.7 in mysql-operator namespace (on control-plane)
- InnoDB Cluster: 3 MySQL 9.2.0 servers + 1 Router, local-path storage
- Group Replication with automatic failover via MySQL Router
- Compatibility service: mysql.dbaas:3306 → Router port 6446
- Images from container-registry.oracle.com (not Docker Hub)
- Init containers are slow (~20 min) due to mysqlsh plugin loading
- Data restore from mysqldump pending after cluster is ONLINE
2026-03-01 03:00:21 +00:00
Viktor Barzin
a76c72042e
[ci skip] add graceful degradation to CrowdSec bouncer middleware
P0: Set updateMaxFailure=-1 (fail-open)
  Previously defaulted to 0 which blocked ALL traffic on first LAPI
  failure. Now serves from cached decisions when LAPI is unreachable.

P1: Enable Redis cache for CrowdSec decisions
  Decisions are now shared across all 3 Traefik replicas and survive
  pod restarts. redisCacheUnreachableBlock=false prevents Redis from
  becoming another SPOF.

P1: Add clientTrustedIPs for internal cluster traffic
  Node CIDR (10.0.20.0/24) and pod CIDR (10.10.0.0/16) bypass
  CrowdSec entirely, preventing internal cascade failures.
2026-03-01 02:36:53 +00:00
Viktor Barzin
cd5d76fb33
[ci skip] add qemu-guest-agent to VM templates and enable agent by default 2026-03-01 01:58:46 +00:00
Viktor Barzin
e22d81275b
[ci skip] fix nextcloud: increase memory to 4Gi, extend startup probe
- Memory limit: 2Gi → 4Gi (VPA target is 2.8Gi, was OOMKilling)
- Memory request: 512Mi → 1Gi
- Startup probe: 30s delay, 10s timeout, 60 failures (10min total)
  Previous 5min window was too short for NFS-backed SQLite init
2026-02-28 23:32:28 +00:00
Viktor Barzin
d6731d857d
[ci skip] revert MySQL to NFS — Bitnami images unavailable on Docker Hub
Bitnami MySQL images can't be pulled (not found on Docker Hub, likely
moved to a different registry). Reverted MySQL to single instance on
NFS as the known-working state. MySQL replication to be revisited
once image availability is resolved.

PostgreSQL and Redis remain on local disk with replication.
2026-02-28 22:53:33 +00:00
Viktor Barzin
4577ba59ab
[ci skip] switch VPA from Auto to Initial mode for Terraform compatibility
VPA Auto mode modifies Deployment specs at runtime, causing conflicts
with Terraform on every apply (drift -> reset -> VPA evict loop).

Initial mode only mutates Pod resource requests at creation time via
the admission webhook, leaving the Deployment spec unchanged. This
means terraform plan shows no drift while pods still get VPA-optimized
resources on every restart.

- 171 VPAs switched from Auto to Initial
- 20 VPAs remain Off (tier-0 critical services)
- Goldilocks dashboard continues to show recommendations
2026-02-28 22:43:29 +00:00
Viktor Barzin
5685a84c9f
[ci skip] tune resource limits and requests across 10 services
Critical OOM fixes (add/increase limits):
- netbox: add 512Mi limit (was at 98.8% of Kyverno default 256Mi)
- speedtest: add 512Mi limit (was at 80.9%)
- meshcentral: add 384Mi limit (was at 72.7%)
- ytdlp: uncomment resources, set 512Mi limit (was at 74.6%)

Over-provisioned (reduce limits):
- dashy: 2Gi → 512Mi (was using 135Mi)
- redis master: 2Gi → 256Mi (was using 14Mi)
- redis replica: 1Gi → 256Mi (was using 12Mi)
- resume printer: 2Gi → 512Mi (was using 108Mi)
- resume app: 1Gi → 384Mi (was using 125Mi)
- openclaw: 4Gi → 1Gi (was using 372Mi)

Under-provisioned requests (increase):
- authentik server: 256Mi → 512Mi request (actual ~560Mi)
- authentik worker: 256Mi → 384Mi request (actual ~400Mi)

New explicit resources (previously Kyverno defaults):
- forgejo: add 512Mi limit, 64Mi request
2026-02-28 21:59:08 +00:00