Commit graph

84 commits

Author SHA1 Message Date
Viktor Barzin
04bed0188e
[ci skip] fix pfSense widget: wan interface is vtnet0 not vmx0 2026-03-07 20:39:56 +00:00
Viktor Barzin
c09e1a0b65
[ci skip] fix pfSense widget: remove wanStatus (API v2 missing gateway endpoint)
Replace wanStatus with temp field. Remove wan interface param since
the pfSense REST API v2 package doesn't expose /status/gateway.
2026-03-07 20:39:56 +00:00
Viktor Barzin
8583cd8578
[ci skip] fix widget issues: ports, Immich v2 API, Nextcloud trusted domains
- qBittorrent: use service port 80 (not container port 8080)
- Immich: add version=2 for new API endpoints (/api/server/*)
- Nextcloud: use external URL (internal rejects untrusted Host header)
- HA London: remove widget (token expired, needs manual regeneration)
- Headscale: remove widget (requires nodeId param, not overview)
2026-03-07 20:39:56 +00:00
Viktor Barzin
98b6839dd6
[ci skip] fix widget URLs: use correct k8s service ports
Services expose port 80 via ClusterIP but widgets were using container
target ports (5000, 3001, 4533, 3000). Calibre was using external URL
through Authentik. All now use correct internal service URLs.
2026-03-07 20:39:56 +00:00
Viktor Barzin
2ebbf364d1
[ci skip] add widgets for qbittorrent, navidrome, nextcloud, freshrss, linkwarden, uptime-kuma
Add API credentials to SOPS and wire homepage_credentials through
stacks. Re-add Uptime Kuma widget with new "infra" status page slug.
2026-03-07 20:39:55 +00:00
Viktor Barzin
d573e07674
[ci skip] add widgets for audiobookshelf, changedetection, prowlarr, headscale
Wire homepage_credentials through servarr parent stack for prowlarr.
Fix paperless-ngx widget to use internal service URL.
2026-03-07 20:39:55 +00:00
Viktor Barzin
ce41f6841f
[ci skip] fix broken Homepage widgets + add service API tokens to SOPS
- Grafana: fix service URL (grafana not monitoring-grafana)
- Uptime Kuma: remove widget (no status page configured)
- Speedtest/Frigate/Immich: use internal k8s service URLs (external
  goes through Authentik forward auth, blocking API calls)
- pfSense: clean up annotations
- SOPS: add headscale, prowlarr, changedetection, audiobookshelf tokens
2026-03-07 20:39:55 +00:00
Viktor Barzin
c3df3393a7
[ci skip] add Homepage widget credentials for Authentik, Shlink, Home Assistant
Wire homepage_credentials tokens through platform stack to enable
live widgets for Authentik, Shlink (URL shortener), and Home Assistant
London. Update SOPS with new credential entries.
2026-03-07 20:39:54 +00:00
Viktor Barzin
af74aa297d
[ci skip] add Homepage gethomepage.dev annotations to all services
Add Kubernetes ingress annotations for Homepage auto-discovery across
~88 services organized into 11 groups. Enable serviceAccount for RBAC,
configure group layouts, and add Grafana/Frigate/Speedtest widgets.
2026-03-07 20:39:54 +00:00
Viktor Barzin
ed6a2780fb
[ci skip] fix Svelte 5 table structure (thead/tbody required) + use versioned image tag
Fixed architecture and services pages to wrap table rows in <thead>/<tbody>
as required by Svelte 5's strict HTML validation.

E2E test passed: clean Alpine container → setup script → kubectl installed →
CA cert verified against API server → TLS SUCCESS
2026-03-07 15:34:32 +00:00
Viktor Barzin
15f7114c4e
[ci skip] k8s portal: fix setup script + add onboarding hub (5 new pages)
Bug fixes:
- CA cert now populated in ConfigMap (was empty → TLS failures)
- Remove useless heredoc quote escaping in setup script
- Fix homepage: VPN callout, correct verification command (get namespaces)
- Fix false-positive sensitive=true on ingress_path, tls_secret_name,
  truenas_host, ollama_host, client_certificate_secret_name

New pages (direct Svelte, no mdsvex dependency):
- /onboarding: step-by-step guide (VPN, kubectl, git, first PR)
- /architecture: cluster topology, storage, networking, tiers
- /services: catalog of 70+ services with URLs
- /contributing: PR workflow, what you can/can't change, NEVER list
- /troubleshooting: common issues and fixes

Navigation bar added to layout. All pages use consistent docs styling.

Requires Docker image rebuild: cd stacks/platform/modules/k8s-portal/files
&& docker build -t viktorbarzin/k8s-portal:latest . && docker push
2026-03-07 15:06:26 +00:00
Viktor Barzin
db68067925
[ci skip] phase 5+6: update CI pipelines for SOPS, add sensitive=true to secret vars
Phase 5 — CI pipelines:
- default.yml: add SOPS decrypt in prepare step, change git add . to
  specific paths (stacks/ state/ .woodpecker/), cleanup on success+failure
- renew-tls.yml: change git add . to git add secrets/ state/

Phase 6 — sensitive=true:
- Add sensitive = true to 256 variable declarations across 149 stack files
- Prevents secret values from appearing in terraform plan output
- Does NOT modify shared modules (ingress_factory, nfs_volume) to avoid
  breaking module interface contracts

Note: CI pipeline SOPS decryption requires sops_age_key Woodpecker secret
to be created before the pipeline will work with SOPS. Until then, the old
terraform.tfvars path continues to function.
2026-03-07 14:30:36 +00:00
Viktor Barzin
1824d2be67
[ci skip] expand Prometheus PVC to 200Gi, increase retention to 180GB for 1-year history
Storage analysis: ~10.5 GB/month ingestion rate, 1 year = ~125 GB + overhead.
PVC: 30Gi → 200Gi, retention.size: 45GB → 180GB.
Historical TSDB data restored from NFS (39.8 GB total including all blocks).
2026-03-06 23:16:32 +00:00
Viktor Barzin
a7f3d432ee
[ci skip] expand Prometheus iSCSI PVC to 30Gi for historical data restore 2026-03-06 22:51:38 +00:00
Viktor Barzin
63fb6201c8
[ci skip] migrate Redis, Prometheus, Loki storage to iSCSI
- Redis: local-path → iscsi-truenas (master + replica persistence)
- Prometheus: NFS PV+PVC → dynamic iSCSI PVC (prometheus-data)
- Loki: NFS PV → dynamic iSCSI via storageClass in Helm values
- Deleted 2 orphaned Released iSCSI PVs (31Gi freed)
2026-03-06 20:50:55 +00:00
Viktor Barzin
94dcf22db4
[ci skip] exclude linkwarden from HighService4xxRate alert 2026-03-06 20:15:58 +00:00
Viktor Barzin
1d80c49201
[ci skip] iSCSI migration, healthcheck fixes, health probes, etcd backup
- Migrate MySQL/PostgreSQL storage from local-path to iscsi-truenas
- Add democratic-csi iSCSI driver module for TrueNAS
- Add open-iscsi to cloud-init VM template
- Fix Shlink health probe path (/api/v3 -> /rest/v3 for Shlink 5.0)
- Fix etcd backup: use etcd 3.5.21-0 (3.6.x is distroless, no /bin/sh)
- Fix cluster healthcheck CronJob: always exit 0 to prevent circular
  JobFailed alerts (reporting via Slack, not exit codes)
- Fix Uptime Kuma nested list handling in cluster-health.sh
- Add health probes to: audiobookshelf, immich ML, ntfy, headscale,
  uptime-kuma, vaultwarden, rybbit (clickhouse + server + client),
  shlink, shlink-web
- Add iSCSI storage documentation to CLAUDE.md
2026-03-06 19:54:21 +00:00
Viktor Barzin
a8e07ad930
[ci skip] fix post-NFS-migration issues: MySQL GR, Loki, grampsweb, alerts
- Loki: reduce memory limit from 6Gi to 4Gi (within LimitRange max)
- Grampsweb: increase memory to 2Gi (was OOMKilled at 512Mi)
- Fix PostgreSQLDown alert: check pod readiness instead of deployment
- Fix MySQLDown alert: check StatefulSet replicas instead of deployment
- Fix RedisDown alert: check StatefulSet replicas instead of deployment
- Fix NFSServerUnresponsive: aggregate all NFS versions cluster-wide
- Fix Uptime Kuma healthcheck: handle nested list heartbeat format
- Update etcd backup image to registry.k8s.io/etcd:3.6.5-0
2026-03-03 21:10:26 +00:00
Viktor Barzin
31f3fc0773
[ci skip] fix OOMKill: prometheus (4Gi), kyverno-reports (512Mi), grampsweb (512Mi)
- Prometheus server: explicit 1Gi req / 4Gi limit (was inheriting 512Mi LimitRange default)
- Kyverno reports controller: 128Mi req / 512Mi limit (was 128Mi Helm default)
- Grampsweb: 256Mi req / 512Mi limit for both containers (was 256Mi LimitRange default)
2026-03-02 21:39:14 +00:00
Viktor Barzin
51d77369de
[ci skip] fix: add mount_options to all NFS PVs (soft,timeo=30,retrans=3)
Critical fix: StorageClass mountOptions only apply during dynamic
provisioning. Our static PVs (created by Terraform) were missing
mount_options, so all NFS mounts defaulted to hard,timeo=600 —
the exact stale mount behavior we were trying to eliminate.

Adds mount_options directly to the nfs_volume module PV spec and
to the monitoring PVs (prometheus, loki, alertmanager).

Requires re-applying all stacks to propagate to existing PVs.
2026-03-02 20:23:36 +00:00
Viktor Barzin
0e324df545
[ci skip] complete NFS CSI migration: complex stacks + platform modules
Migrate remaining multi-volume stacks and all platform modules from
inline NFS volumes to CSI-backed PV/PVC with nfs-truenas StorageClass
(soft,timeo=30,retrans=3 mount options).

Complex stacks: openclaw (4 vols), immich (8 vols), frigate (2 vols),
nextcloud (2 vols + old PV replaced), rybbit (1 vol)

Remaining stacks: affine, ebook2audiobook, f1-stream, osm_routing,
real-estate-crawler

Platform modules: monitoring (prometheus, loki, alertmanager PVs
converted from native NFS to CSI), redis, dbaas, technitium,
headscale, vaultwarden, uptime-kuma, mailserver, infra-maintenance
2026-03-02 01:24:07 +00:00
Viktor Barzin
481e4fa46e
[ci skip] add NFS CSI driver + nfs_volume shared module
- Deploy csi-driver-nfs Helm chart as platform module (nfs-csi)
- Create nfs-truenas StorageClass with soft,timeo=30,retrans=3 mount options
- Add shared nfs_volume module for PV/PVC boilerplate (modules/kubernetes/nfs_volume/)
2026-03-01 23:38:58 +00:00
Viktor Barzin
ca648ff9bb
[ci skip] right-size all pod resources based on VPA + live metrics audit
Full cluster resource audit: cross-referenced Goldilocks VPA recommendations,
live kubectl top metrics, and Terraform definitions for 100+ containers.

Critical fixes:
- dashy: CPU throttled at 98% (490m/500m) → 2 CPU limit
- stirling-pdf: CPU throttled at 99.7% (299m/300m) → 2 CPU limit
- traefik auth-proxy/bot-block-proxy: mem limit 32Mi → 128Mi

Added explicit resources to ~40 containers that had none:
- audiobookshelf, changedetection, cyberchef, dawarich, diun, echo,
  excalidraw, freshrss, hackmd, isponsorblocktv, linkwarden, n8n,
  navidrome, ntfy, owntracks, privatebin, send, shadowsocks, tandoor,
  tor-proxy, wealthfolio, networking-toolbox, rybbit, mailserver,
  cloudflared, pgadmin, phpmyadmin, crowdsec-web, xray, wireguard,
  k8s-portal, tuya-bridge, ollama-ui, whisper, piper, immich-server,
  immich-postgresql, osrm-foot

GPU containers: added CPU/mem alongside GPU limits:
- ollama: removed CPU/mem limits (models vary in size), keep GPU only
- frigate: req 500m/2Gi, lim 4/8Gi + GPU
- immich-ml: req 100m/1Gi, lim 2/4Gi + GPU

Right-sized ~25 over-provisioned containers:
- kms-web-page: 500m/512Mi → 50m/64Mi (was using 0m/10Mi)
- onlyoffice: CPU 8 → 2 (VPA upper 45m)
- realestate-crawler-api: CPU 2000m → 250m
- blog/travel-blog/webhook-handler: 500m → 100m
- coturn/health/plotting-book: reduced to match actual usage

Conservative methodology: limits = max(VPA upper * 2, live usage * 2)
2026-03-01 19:18:50 +00:00
Viktor Barzin
32762a0916
[ci skip] switch VPA to off mode globally, fix Ollama/MySQL resources
- Kyverno policy: VPA mode set to 'off' for all namespaces (was 'initial'
  for non-core). Terraform is now sole authority for container resources.
  Goldilocks provides recommendations only.
- Ollama: add explicit CPU/memory resources (500m/4Gi req, 4/12Gi limit)
  alongside GPU allocation. Fixes OOMKill from VPA scaling down resources.
- MySQL InnoDB Cluster: bump memory limit from 2Gi to 3Gi.
- Remove redundant per-namespace VPA opt-out labels from onlyoffice,
  openclaw, trading-bot (now handled globally by Kyverno policy).
2026-03-01 19:03:49 +00:00
Viktor Barzin
79af6fff47
[ci skip] fix MySQL cluster RBAC, Kyverno policy bugs, Nextcloud memory
- dbaas: add mysql-sidecar-extra ClusterRole for namespaces/CRD
  list/watch needed by kopf framework in sidecar containers
- kyverno: restrict inject-priority-class-from-tier to CREATE
  operations only (was blocking pod patches with immutable spec error)
- kyverno: add resource-governance/custom-limitrange label opt-out
  to LimitRange generation policy (mirrors existing custom-quota)
- nextcloud: bump memory limit 4Gi -> 6Gi, add custom LimitRange
  with 8Gi max, opt out of Kyverno-managed LimitRange
2026-03-01 17:16:03 +00:00
Viktor Barzin
a8da2e3790
[ci skip] redis: pin service to master pod to fix read-only errors
The Bitnami Redis Sentinel chart's service selects all nodes (master + replicas).
Clients using plain redis:// URLs (paperless-ngx, etc.) randomly hit read-only
replicas, causing write failures. Pin the service to redis-node-0 (master).
2026-03-01 17:13:25 +00:00
Viktor Barzin
d02d6ad356
[ci skip] dbaas: add custom resource quota (12Gi req mem) to support 3-node MySQL cluster 2026-03-01 15:47:11 +00:00
Viktor Barzin
82d63a10ef
[ci skip] add PoisonFountainDown and ForwardAuthFallbackActive alerts with inhibition 2026-03-01 15:05:57 +00:00
Viktor Barzin
00d3bb2fd1
[ci skip] add retry middleware (2 attempts, 100ms) to default ingress chain 2026-03-01 14:35:53 +00:00
Viktor Barzin
e36e192dd0
[ci skip] add Authentik PDB (minAvailable=2) 2026-03-01 14:24:47 +00:00
Viktor Barzin
a1ba1e0d37
[ci skip] add Traefik topology spread, PDB (minAvailable=2), and 30s response timeout 2026-03-01 14:18:54 +00:00
Viktor Barzin
ecbee75275
[ci skip] add auth resilience proxy: basicAuth fallback when Authentik is down 2026-03-01 14:13:05 +00:00
Viktor Barzin
6ffc714683
[ci skip] add bot-block resilience proxy: fail-open when Poison Fountain is down 2026-03-01 14:05:41 +00:00
Viktor Barzin
51fb96e906
[ci skip] fix MySQL service: point at mysqld pods, pin to healthy primary
The InnoDB Cluster Router (mysqlrouter) doesn't deploy when the cluster
lacks quorum. Changed service selector from mysqlrouter to mysqld with
publishNotReadyAddresses=true to bypass the operator's readiness gate.
Pinned to mysql-cluster-1 (healthy primary) until full cluster recovers.
2026-03-01 12:16:28 +00:00
Viktor Barzin
a071f08dc8
[ci skip] MySQL: deploy InnoDB Cluster via Oracle MySQL Operator
- MySQL Operator v2.2.7 in mysql-operator namespace (on control-plane)
- InnoDB Cluster: 3 MySQL 9.2.0 servers + 1 Router, local-path storage
- Group Replication with automatic failover via MySQL Router
- Compatibility service: mysql.dbaas:3306 → Router port 6446
- Images from container-registry.oracle.com (not Docker Hub)
- Init containers are slow (~20 min) due to mysqlsh plugin loading
- Data restore from mysqldump pending after cluster is ONLINE
2026-03-01 03:00:21 +00:00
Viktor Barzin
a76c72042e
[ci skip] add graceful degradation to CrowdSec bouncer middleware
P0: Set updateMaxFailure=-1 (fail-open)
  Previously defaulted to 0 which blocked ALL traffic on first LAPI
  failure. Now serves from cached decisions when LAPI is unreachable.

P1: Enable Redis cache for CrowdSec decisions
  Decisions are now shared across all 3 Traefik replicas and survive
  pod restarts. redisCacheUnreachableBlock=false prevents Redis from
  becoming another SPOF.

P1: Add clientTrustedIPs for internal cluster traffic
  Node CIDR (10.0.20.0/24) and pod CIDR (10.10.0.0/16) bypass
  CrowdSec entirely, preventing internal cascade failures.
2026-03-01 02:36:53 +00:00
Viktor Barzin
d6731d857d
[ci skip] revert MySQL to NFS — Bitnami images unavailable on Docker Hub
Bitnami MySQL images can't be pulled (not found on Docker Hub, likely
moved to a different registry). Reverted MySQL to single instance on
NFS as the known-working state. MySQL replication to be revisited
once image availability is resolved.

PostgreSQL and Redis remain on local disk with replication.
2026-02-28 22:53:33 +00:00
Viktor Barzin
4577ba59ab
[ci skip] switch VPA from Auto to Initial mode for Terraform compatibility
VPA Auto mode modifies Deployment specs at runtime, causing conflicts
with Terraform on every apply (drift -> reset -> VPA evict loop).

Initial mode only mutates Pod resource requests at creation time via
the admission webhook, leaving the Deployment spec unchanged. This
means terraform plan shows no drift while pods still get VPA-optimized
resources on every restart.

- 171 VPAs switched from Auto to Initial
- 20 VPAs remain Off (tier-0 critical services)
- Goldilocks dashboard continues to show recommendations
2026-02-28 22:43:29 +00:00
Viktor Barzin
5685a84c9f
[ci skip] tune resource limits and requests across 10 services
Critical OOM fixes (add/increase limits):
- netbox: add 512Mi limit (was at 98.8% of Kyverno default 256Mi)
- speedtest: add 512Mi limit (was at 80.9%)
- meshcentral: add 384Mi limit (was at 72.7%)
- ytdlp: uncomment resources, set 512Mi limit (was at 74.6%)

Over-provisioned (reduce limits):
- dashy: 2Gi → 512Mi (was using 135Mi)
- redis master: 2Gi → 256Mi (was using 14Mi)
- redis replica: 1Gi → 256Mi (was using 12Mi)
- resume printer: 2Gi → 512Mi (was using 108Mi)
- resume app: 1Gi → 384Mi (was using 125Mi)
- openclaw: 4Gi → 1Gi (was using 372Mi)

Under-provisioned requests (increase):
- authentik server: 256Mi → 512Mi request (actual ~560Mi)
- authentik worker: 256Mi → 384Mi request (actual ~400Mi)

New explicit resources (previously Kyverno defaults):
- forgejo: add 512Mi limit, 64Mi request
2026-02-28 21:59:08 +00:00
Viktor Barzin
ad8f686559
[ci skip] Phase 3: migrate MySQL from NFS to local disk
- Add local-path PVC for MySQL data (10Gi, actual usage ~27GB)
- Init container seeds data from NFS on first run (cp -a)
- NFS volume kept as read-only seed source in init container
- MySQL 9.2.0 running on local disk with proper fsync
- All dependent services verified running:
  hackmd, speedtest, onlyoffice, paperless-ngx
- mysqldump backup taken before migration
- Existing daily mysqldump CronJob unchanged (writes to NFS)
2026-02-28 20:41:07 +00:00
Viktor Barzin
0cce9d350a
[ci skip] Redis: upgrade to Bitnami Helm chart with Sentinel HA
- Replace manual redis:7-alpine deployment with Bitnami Redis Helm chart v25.3.2
- Architecture: replication with Sentinel (1 master + 1 replica + sentinels)
- Automatic failover via Sentinel (quorum=2, masterSet=mymaster)
- Service 'redis.redis' always points at current master (transparent to clients)
- 120 clients connected immediately after deployment
- Sentinel confirmed tracking redis-node-0 as master
- Local-path PVCs for persistence (2Gi per node)
- Auth disabled (matches previous setup)
- Hourly RDB backup CronJob to NFS preserved
- OCI chart pulled via pull-through cache (10.0.20.10:5000)
2026-02-28 19:59:58 +00:00
Viktor Barzin
00717d0c7e
[ci skip] color only public IPs red in service map, private IPs (10.x, 192.168.x) get light blue 2026-02-28 19:44:16 +00:00
Viktor Barzin
fdd4e3e467
[ci skip] Phase 2: migrate Redis from NFS to local disk
- Switch from redis/redis-stack:latest to redis:7-alpine
  (modules were completely unused — zero module commands in stats)
- Move data from NFS (/mnt/main/redis) to local-path PVC
  (RDB saves: 39s on NFS → <1s on local disk)
- Start fresh (old RDB had redis-stack module data incompatible with plain redis;
  all Redis data is transient — queues and caches rebuild automatically)
- Add hourly redis-backup CronJob: redis-cli --rdb to NFS for backup pipeline
- Remove RedisInsight UI ingress (port 8001, only in redis-stack)
- Add redis-backup to NFS exports
- 110 clients reconnected immediately after switchover
- Memory savings: ~100MB from dropping unused modules
2026-02-28 19:44:08 +00:00
Viktor Barzin
5d745376bf
[ci skip] set network observability dashboard auto-refresh to 1h 2026-02-28 19:32:49 +00:00
Viktor Barzin
849116d08a
[ci skip] fix service map coloring: remove arc system, use color field for namespace-based node colors 2026-02-28 19:25:52 +00:00
Viktor Barzin
5be70fb955
[ci skip] codify CNPG PostgreSQL in Terraform, decommission old NFS-backed PG
Phase 1 complete — PostgreSQL fully migrated off NFS:

dbaas module changes:
- Replace old kubernetes_deployment.postgres with null_resource.pg_cluster
  (CNPG Cluster CR managed via kubectl apply due to webhook mutation issues)
- Update postgresql Service selector: app=postgresql → cnpg primary
- Update backup CronJob: use postgres user + read password from CNPG secret
  (pg-cluster-superuser) instead of hardcoded root password
- Add kube_config_path variable for kubectl in null_resource
- Old deployment deleted from cluster (was scaled to 0)

CNPG cluster status:
- 2 instances: primary (k8s-node4), replica (k8s-node2)
- PostGIS image (ghcr.io/cloudnative-pg/postgis:16)
- 20Gi local-path storage per instance
- All 13 dependent services verified running
- Backup CronJob verified working with new endpoint
2026-02-28 19:23:36 +00:00
Viktor Barzin
39e3fae488
[ci skip] improve network observability dashboard: namespace coloring, layered layout, full-width service map 2026-02-28 19:14:20 +00:00
Viktor Barzin
5e7dc5ba4a
[ci skip] combine caretta and goflow2 into unified network observability dashboard 2026-02-28 19:04:53 +00:00
Viktor Barzin
c24ae7e8df
[ci skip] fix caretta helm values and goflow2 transport args 2026-02-28 18:51:02 +00:00
Viktor Barzin
205d9f9fc4
fix: use plain string for cache_from/cache_to and fix caretta helm_release
- cache_from/cache_to must be plain strings, not YAML lists — the
  plugin-docker-buildx treats them as single string values and the
  Woodpecker settings layer was splitting comma-separated list items
  into separate --cache-from flags (type=registry and ref=... separately)
- caretta.tf: replace deprecated set{} blocks with values=[yamlencode()]
  to fix Terraform plan error with newer Helm provider
2026-02-28 18:47:20 +00:00