Commit graph

1819 commits

Author SHA1 Message Date
Viktor Barzin
80fbff083b
[ci skip] f1-stream: use latest tag, CI manages image via kubectl set image 2026-03-01 15:15:14 +00:00
Viktor Barzin
3c92c38d51
[ci skip] f1-stream: bump image to v5.2.0 2026-03-01 15:13:38 +00:00
Viktor Barzin
82d63a10ef
[ci skip] add PoisonFountainDown and ForwardAuthFallbackActive alerts with inhibition 2026-03-01 15:05:57 +00:00
Viktor Barzin
388ba006fa
[ci skip] openclaw: fix slow startup — proper resources + readiness probe + VPA off
- Set explicit CPU (2 cores) and memory (2Gi) limits
  Root cause: Goldilocks VPA was throttling to 300m CPU, causing gateway
  to take 5+ minutes to start, and 1Gi memory caused OOM crashes
- Add TCP readiness probe on port 18789 to prevent 502 Bad Gateway
  during startup (Traefik was routing before gateway was listening)
- Disable Goldilocks VPA via namespace label (vpa-update-mode: off)
2026-03-01 14:44:22 +00:00
Viktor Barzin
d95f753aa8
remove f1-stream CI pipeline (moved to separate repo)
The f1-stream project now lives at github.com/ViktorBarzin/f1-stream
with its own Woodpecker CI pipeline.
2026-03-01 14:43:06 +00:00
Viktor Barzin
00d3bb2fd1
[ci skip] add retry middleware (2 attempts, 100ms) to default ingress chain 2026-03-01 14:35:53 +00:00
Viktor Barzin
b7aa39353c
f1-stream: add real F1 stream extractors and iframe player support
Add three new extractors (Streamed.pk, DaddyLive, Aceztrims) for live
F1 streams. Extend ExtractedStream model with stream_type/embed_url
fields, skip health checks for embed streams, fix broken Akamai demo
stream, add variant playlist validation, and add iframe player support
in the frontend for embed-type streams.
2026-03-01 14:35:19 +00:00
Viktor Barzin
e36e192dd0
[ci skip] add Authentik PDB (minAvailable=2) 2026-03-01 14:24:47 +00:00
Viktor Barzin
a1ba1e0d37
[ci skip] add Traefik topology spread, PDB (minAvailable=2), and 30s response timeout 2026-03-01 14:18:54 +00:00
Viktor Barzin
ecbee75275
[ci skip] add auth resilience proxy: basicAuth fallback when Authentik is down 2026-03-01 14:13:05 +00:00
Viktor Barzin
6ffc714683
[ci skip] add bot-block resilience proxy: fail-open when Poison Fountain is down 2026-03-01 14:05:41 +00:00
Viktor Barzin
999005d40f
[ci skip] openclaw: cache tools on NFS for fast restarts
- Switch /tools volume from emptyDir to NFS (/mnt/main/openclaw/tools)
- Skip download of kubectl, terraform, terragrunt, pip packages if cached
- Startup time: ~2.5min → ~38s on subsequent restarts
2026-03-01 13:59:07 +00:00
Viktor Barzin
ef52da6eca
[ci skip] bump poison-fountain tier from aux to cluster (critical path for all ingress) 2026-03-01 13:57:54 +00:00
Viktor Barzin
a6f11a29ae
[ci skip] add Traefik resilience hardening implementation plan 2026-03-01 13:53:50 +00:00
Viktor Barzin
99fca0b7fa
[ci skip] add Traefik resilience hardening design doc 2026-03-01 13:50:00 +00:00
Viktor Barzin
2dda360b29
[ci skip] openclaw: add Telegram channel + install terragrunt in init container
- Add Telegram bot integration (DM allowlist, user 8281953845 only)
- Install terragrunt v0.99.4 in init container alongside terraform
- Remove terraform init from init (terragrunt handles this per-stack)
- Add openclaw_telegram_bot_token variable
2026-03-01 13:44:58 +00:00
Viktor Barzin
d9946506cd
[ci skip] openclaw: switch to free agentic models via NVIDIA NIM, OpenRouter, Llama API
- Primary: Mistral Large 3 (675B) on NIM - always warm, excellent tool calling
- Fallback 1: Nemotron Ultra 253B on NIM
- Fallback 2: Llama 4 Maverick on Llama API (different provider for resilience)
- 10 models total across 3 providers, all free
- Removed: Modal (GLM-5), Gemini, Ollama providers
- Added: NVIDIA NIM provider with DeepSeek V3.2, Qwen 3.5, Qwen 3 Coder, GLM-5
- Bumped maxTokens from 8192 to 16384 for agentic output room
2026-03-01 13:22:47 +00:00
Viktor Barzin
e25848be8f
[ci skip] add Kyverno resource governance details to CLAUDE.md 2026-03-01 13:05:57 +00:00
Viktor Barzin
51fb96e906
[ci skip] fix MySQL service: point at mysqld pods, pin to healthy primary
The InnoDB Cluster Router (mysqlrouter) doesn't deploy when the cluster
lacks quorum. Changed service selector from mysqlrouter to mysqld with
publishNotReadyAddresses=true to bypass the operator's readiness gate.
Pinned to mysql-cluster-1 (healthy primary) until full cluster recovers.
2026-03-01 12:16:28 +00:00
Viktor Barzin
a071f08dc8
[ci skip] MySQL: deploy InnoDB Cluster via Oracle MySQL Operator
- MySQL Operator v2.2.7 in mysql-operator namespace (on control-plane)
- InnoDB Cluster: 3 MySQL 9.2.0 servers + 1 Router, local-path storage
- Group Replication with automatic failover via MySQL Router
- Compatibility service: mysql.dbaas:3306 → Router port 6446
- Images from container-registry.oracle.com (not Docker Hub)
- Init containers are slow (~20 min) due to mysqlsh plugin loading
- Data restore from mysqldump pending after cluster is ONLINE
2026-03-01 03:00:21 +00:00
Viktor Barzin
a76c72042e
[ci skip] add graceful degradation to CrowdSec bouncer middleware
P0: Set updateMaxFailure=-1 (fail-open)
  Previously defaulted to 0 which blocked ALL traffic on first LAPI
  failure. Now serves from cached decisions when LAPI is unreachable.

P1: Enable Redis cache for CrowdSec decisions
  Decisions are now shared across all 3 Traefik replicas and survive
  pod restarts. redisCacheUnreachableBlock=false prevents Redis from
  becoming another SPOF.

P1: Add clientTrustedIPs for internal cluster traffic
  Node CIDR (10.0.20.0/24) and pod CIDR (10.10.0.0/16) bypass
  CrowdSec entirely, preventing internal cascade failures.
2026-03-01 02:36:53 +00:00
Viktor Barzin
cd5d76fb33
[ci skip] add qemu-guest-agent to VM templates and enable agent by default 2026-03-01 01:58:46 +00:00
Viktor Barzin
e22d81275b
[ci skip] fix nextcloud: increase memory to 4Gi, extend startup probe
- Memory limit: 2Gi → 4Gi (VPA target is 2.8Gi, was OOMKilling)
- Memory request: 512Mi → 1Gi
- Startup probe: 30s delay, 10s timeout, 60 failures (10min total)
  Previous 5min window was too short for NFS-backed SQLite init
2026-02-28 23:32:28 +00:00
Viktor Barzin
d6731d857d
[ci skip] revert MySQL to NFS — Bitnami images unavailable on Docker Hub
Bitnami MySQL images can't be pulled (not found on Docker Hub, likely
moved to a different registry). Reverted MySQL to single instance on
NFS as the known-working state. MySQL replication to be revisited
once image availability is resolved.

PostgreSQL and Redis remain on local disk with replication.
2026-02-28 22:53:33 +00:00
Viktor Barzin
4577ba59ab
[ci skip] switch VPA from Auto to Initial mode for Terraform compatibility
VPA Auto mode modifies Deployment specs at runtime, causing conflicts
with Terraform on every apply (drift -> reset -> VPA evict loop).

Initial mode only mutates Pod resource requests at creation time via
the admission webhook, leaving the Deployment spec unchanged. This
means terraform plan shows no drift while pods still get VPA-optimized
resources on every restart.

- 171 VPAs switched from Auto to Initial
- 20 VPAs remain Off (tier-0 critical services)
- Goldilocks dashboard continues to show recommendations
2026-02-28 22:43:29 +00:00
Viktor Barzin
5685a84c9f
[ci skip] tune resource limits and requests across 10 services
Critical OOM fixes (add/increase limits):
- netbox: add 512Mi limit (was at 98.8% of Kyverno default 256Mi)
- speedtest: add 512Mi limit (was at 80.9%)
- meshcentral: add 384Mi limit (was at 72.7%)
- ytdlp: uncomment resources, set 512Mi limit (was at 74.6%)

Over-provisioned (reduce limits):
- dashy: 2Gi → 512Mi (was using 135Mi)
- redis master: 2Gi → 256Mi (was using 14Mi)
- redis replica: 1Gi → 256Mi (was using 12Mi)
- resume printer: 2Gi → 512Mi (was using 108Mi)
- resume app: 1Gi → 384Mi (was using 125Mi)
- openclaw: 4Gi → 1Gi (was using 372Mi)

Under-provisioned requests (increase):
- authentik server: 256Mi → 512Mi request (actual ~560Mi)
- authentik worker: 256Mi → 384Mi request (actual ~400Mi)

New explicit resources (previously Kyverno defaults):
- forgejo: add 512Mi limit, 64Mi request
2026-02-28 21:59:08 +00:00
Viktor Barzin
ad8f686559
[ci skip] Phase 3: migrate MySQL from NFS to local disk
- Add local-path PVC for MySQL data (10Gi, actual usage ~27GB)
- Init container seeds data from NFS on first run (cp -a)
- NFS volume kept as read-only seed source in init container
- MySQL 9.2.0 running on local disk with proper fsync
- All dependent services verified running:
  hackmd, speedtest, onlyoffice, paperless-ngx
- mysqldump backup taken before migration
- Existing daily mysqldump CronJob unchanged (writes to NFS)
2026-02-28 20:41:07 +00:00
Viktor Barzin
419581727f
[ci skip] fix nextcloud OOMKilled: increase memory limit to 2Gi 2026-02-28 20:21:00 +00:00
Viktor Barzin
442f5450c4
trigger CI: test registry push with port-preserving Host header 2026-02-28 20:17:47 +00:00
Viktor Barzin
adfa05c8d0
[ci skip] fix: use $http_host in nginx to preserve port in registry redirects 2026-02-28 20:16:03 +00:00
Viktor Barzin
0cce9d350a
[ci skip] Redis: upgrade to Bitnami Helm chart with Sentinel HA
- Replace manual redis:7-alpine deployment with Bitnami Redis Helm chart v25.3.2
- Architecture: replication with Sentinel (1 master + 1 replica + sentinels)
- Automatic failover via Sentinel (quorum=2, masterSet=mymaster)
- Service 'redis.redis' always points at current master (transparent to clients)
- 120 clients connected immediately after deployment
- Sentinel confirmed tracking redis-node-0 as master
- Local-path PVCs for persistence (2Gi per node)
- Auth disabled (matches previous setup)
- Hourly RDB backup CronJob to NFS preserved
- OCI chart pulled via pull-through cache (10.0.20.10:5000)
2026-02-28 19:59:58 +00:00
Viktor Barzin
1d3c9dba99
fix: use inline cache to avoid cache_from comma splitting bug 2026-02-28 19:56:00 +00:00
Viktor Barzin
00717d0c7e
[ci skip] color only public IPs red in service map, private IPs (10.x, 192.168.x) get light blue 2026-02-28 19:44:16 +00:00
Viktor Barzin
fdd4e3e467
[ci skip] Phase 2: migrate Redis from NFS to local disk
- Switch from redis/redis-stack:latest to redis:7-alpine
  (modules were completely unused — zero module commands in stats)
- Move data from NFS (/mnt/main/redis) to local-path PVC
  (RDB saves: 39s on NFS → <1s on local disk)
- Start fresh (old RDB had redis-stack module data incompatible with plain redis;
  all Redis data is transient — queues and caches rebuild automatically)
- Add hourly redis-backup CronJob: redis-cli --rdb to NFS for backup pipeline
- Remove RedisInsight UI ingress (port 8001, only in redis-stack)
- Add redis-backup to NFS exports
- 110 clients reconnected immediately after switchover
- Memory savings: ~100MB from dropping unused modules
2026-02-28 19:44:08 +00:00
Viktor Barzin
c6076d275f
trigger CI: test BuildKit caching with TLS registry 2026-02-28 19:42:34 +00:00
Viktor Barzin
3e3699bbc6
[ci skip] add TLS to private registry, switch to registry.viktorbarzin.me 2026-02-28 19:40:38 +00:00
Viktor Barzin
e8bcf21127
[ci skip] update NFS mount skill: add stale mount variant after node reboots
New variant documents ghost Running pods with frozen processes after kured
rolling reboots. Key diagnostic: Running 1/1 but zero listening sockets
from ss -tlnp. Fix: force-delete pods to get fresh NFS mounts.
2026-02-28 19:38:30 +00:00
Viktor Barzin
5d745376bf
[ci skip] set network observability dashboard auto-refresh to 1h 2026-02-28 19:32:49 +00:00
Viktor Barzin
849116d08a
[ci skip] fix service map coloring: remove arc system, use color field for namespace-based node colors 2026-02-28 19:25:52 +00:00
Viktor Barzin
5be70fb955
[ci skip] codify CNPG PostgreSQL in Terraform, decommission old NFS-backed PG
Phase 1 complete — PostgreSQL fully migrated off NFS:

dbaas module changes:
- Replace old kubernetes_deployment.postgres with null_resource.pg_cluster
  (CNPG Cluster CR managed via kubectl apply due to webhook mutation issues)
- Update postgresql Service selector: app=postgresql → cnpg primary
- Update backup CronJob: use postgres user + read password from CNPG secret
  (pg-cluster-superuser) instead of hardcoded root password
- Add kube_config_path variable for kubectl in null_resource
- Old deployment deleted from cluster (was scaled to 0)

CNPG cluster status:
- 2 instances: primary (k8s-node4), replica (k8s-node2)
- PostGIS image (ghcr.io/cloudnative-pg/postgis:16)
- 20Gi local-path storage per instance
- All 13 dependent services verified running
- Backup CronJob verified working with new endpoint
2026-02-28 19:23:36 +00:00
Viktor Barzin
2d3be0ca74
fix: remove private registry from docker builds, push only to DockerHub
The BuildKit builder cannot push to the insecure HTTP registry at
registry.viktorbarzin.lan:5050 because buildkit_config is not being
applied by the plugin. Simplified to DockerHub-only push for now.
Private registry caching and push can be re-added once buildkit_config
issue is resolved.
2026-02-28 19:15:54 +00:00
Viktor Barzin
39e3fae488
[ci skip] improve network observability dashboard: namespace coloring, layered layout, full-width service map 2026-02-28 19:14:20 +00:00
Viktor Barzin
6e7f092c1f
trigger CI: test insecure registry and buildkit_config fixes for build-cli 2026-02-28 19:09:30 +00:00
Viktor Barzin
9d9c8fdc12
[ci skip] Phase 1: PostgreSQL migrated to CNPG on local disk
Major milestone - shared PostgreSQL moved from NFS to CloudNativePG:
- CNPG cluster (pg-cluster) running in dbaas namespace on local-path storage
- PostGIS image (ghcr.io/cloudnative-pg/postgis:16) for dawarich compatibility
- All 20 databases and 19 roles restored from pg_dumpall backup
- postgresql.dbaas Service patched to point at CNPG primary
- Old PG deployment scaled to 0 (NFS data intact for rollback)
- All 12+ dependent services verified running:
  authentik, n8n, dawarich, tandoor, linkwarden, netbox, woodpecker,
  rybbit, affine, health, resume, trading-bot, atuin
- Authentik PgBouncer working through the switched endpoint

TODO: codify CNPG cluster in Terraform, add 2nd replica, update backup CronJob
2026-02-28 19:08:06 +00:00
Viktor Barzin
7724214054
[ci skip] add resource limits to paperless-ngx and stirling-pdf to prevent OOMKills
Both services were hitting the 256Mi Kyverno LimitRange default and getting
OOMKilled. Set explicit limits to 1Gi memory for both.
2026-02-28 19:07:59 +00:00
Viktor Barzin
5e7dc5ba4a
[ci skip] combine caretta and goflow2 into unified network observability dashboard 2026-02-28 19:04:53 +00:00
Viktor Barzin
e231dbace4
fix: use cache_images instead of cache_from to avoid comma splitting
The plugin-docker-buildx (Codeberg version) changed CacheFrom from
string to StringSlice, which causes urfave/cli to split on commas.
The cache_images setting properly handles registry refs by generating
both --cache-from and --cache-to flags automatically.
2026-02-28 18:53:08 +00:00
Viktor Barzin
c24ae7e8df
[ci skip] fix caretta helm values and goflow2 transport args 2026-02-28 18:51:02 +00:00
Viktor Barzin
205d9f9fc4
fix: use plain string for cache_from/cache_to and fix caretta helm_release
- cache_from/cache_to must be plain strings, not YAML lists — the
  plugin-docker-buildx treats them as single string values and the
  Woodpecker settings layer was splitting comma-separated list items
  into separate --cache-from flags (type=registry and ref=... separately)
- caretta.tf: replace deprecated set{} blocks with values=[yamlencode()]
  to fix Terraform plan error with newer Helm provider
2026-02-28 18:47:20 +00:00
Viktor Barzin
99980d97f9
fix: use YAML list for cache_from/cache_to to prevent comma splitting 2026-02-28 18:40:55 +00:00