Commit graph

171 commits

Author SHA1 Message Date
Viktor Barzin
efe0cdefc8
[ci skip] add Forgejo task pipeline for OpenClaw AI agent
Forgejo issues as a task queue for OpenClaw:
- Forgejo OAuth2 with Authentik SSO, self-registration disabled
- Webhook-triggered task processing (instant) + CronJob backup (5min poll)
- Tasks processed via Mistral Large 3 (NVIDIA NIM API)
- Results posted as issue comments, auto-labeled and closed
- Comment follow-ups and reopened issues supported
- n8n RBAC for OpenClaw pod exec (future workflow integration)
2026-03-07 21:11:07 +00:00
Viktor Barzin
0d03037393
add nginx caching proxy for Homepage widget API requests
Stale-while-revalidate cache in front of Homepage reduces first-paint
latency by serving cached /api/ responses instantly while refreshing
upstream in background. Non-API paths pass through uncached.
2026-03-07 21:11:07 +00:00
root
aeee51a1a3 Woodpecker CI deploy commit [CI SKIP] 2026-03-07 20:47:22 +00:00
Viktor Barzin
2b5e4888e0
add Google OAuth env vars to plotting-book deployment
Deploy GOOGLE_CLIENT_ID, GOOGLE_CLIENT_SECRET, and GOOGLE_CALLBACK_URL
to the plotting-book container. Update CSP to allow accounts.google.com
for connect-src and form-action directives.
2026-03-07 20:41:08 +00:00
Viktor Barzin
90c3372643
[ci skip] add liveness probe to Send deployment
Prevents stale Redis connections from silently breaking file uploads.
The old Node.js Redis client doesn't auto-reconnect after Redis restarts,
causing all files to appear expired.
2026-03-07 20:39:57 +00:00
Viktor Barzin
04bed0188e
[ci skip] fix pfSense widget: wan interface is vtnet0 not vmx0 2026-03-07 20:39:56 +00:00
Viktor Barzin
c09e1a0b65
[ci skip] fix pfSense widget: remove wanStatus (API v2 missing gateway endpoint)
Replace wanStatus with temp field. Remove wan interface param since
the pfSense REST API v2 package doesn't expose /status/gateway.
2026-03-07 20:39:56 +00:00
Viktor Barzin
8583cd8578
[ci skip] fix widget issues: ports, Immich v2 API, Nextcloud trusted domains
- qBittorrent: use service port 80 (not container port 8080)
- Immich: add version=2 for new API endpoints (/api/server/*)
- Nextcloud: use external URL (internal rejects untrusted Host header)
- HA London: remove widget (token expired, needs manual regeneration)
- Headscale: remove widget (requires nodeId param, not overview)
2026-03-07 20:39:56 +00:00
Viktor Barzin
98b6839dd6
[ci skip] fix widget URLs: use correct k8s service ports
Services expose port 80 via ClusterIP but widgets were using container
target ports (5000, 3001, 4533, 3000). Calibre was using external URL
through Authentik. All now use correct internal service URLs.
2026-03-07 20:39:56 +00:00
Viktor Barzin
31a8eded7d
[ci skip] upgrade Homepage from v1.8.0 to v1.10.1 2026-03-07 20:39:56 +00:00
Viktor Barzin
2ebbf364d1
[ci skip] add widgets for qbittorrent, navidrome, nextcloud, freshrss, linkwarden, uptime-kuma
Add API credentials to SOPS and wire homepage_credentials through
stacks. Re-add Uptime Kuma widget with new "infra" status page slug.
2026-03-07 20:39:55 +00:00
Viktor Barzin
d573e07674
[ci skip] add widgets for audiobookshelf, changedetection, prowlarr, headscale
Wire homepage_credentials through servarr parent stack for prowlarr.
Fix paperless-ngx widget to use internal service URL.
2026-03-07 20:39:55 +00:00
Viktor Barzin
ce41f6841f
[ci skip] fix broken Homepage widgets + add service API tokens to SOPS
- Grafana: fix service URL (grafana not monitoring-grafana)
- Uptime Kuma: remove widget (no status page configured)
- Speedtest/Frigate/Immich: use internal k8s service URLs (external
  goes through Authentik forward auth, blocking API calls)
- pfSense: clean up annotations
- SOPS: add headscale, prowlarr, changedetection, audiobookshelf tokens
2026-03-07 20:39:55 +00:00
Viktor Barzin
c3df3393a7
[ci skip] add Homepage widget credentials for Authentik, Shlink, Home Assistant
Wire homepage_credentials tokens through platform stack to enable
live widgets for Authentik, Shlink (URL shortener), and Home Assistant
London. Update SOPS with new credential entries.
2026-03-07 20:39:54 +00:00
Viktor Barzin
af74aa297d
[ci skip] add Homepage gethomepage.dev annotations to all services
Add Kubernetes ingress annotations for Homepage auto-discovery across
~88 services organized into 11 groups. Enable serviceAccount for RBAC,
configure group layouts, and add Grafana/Frigate/Speedtest widgets.
2026-03-07 20:39:54 +00:00
Viktor Barzin
e86db29e80
[ci skip] fix false-positive sensitive=true on kube_config_path 2026-03-07 15:48:19 +00:00
Viktor Barzin
ed6a2780fb
[ci skip] fix Svelte 5 table structure (thead/tbody required) + use versioned image tag
Fixed architecture and services pages to wrap table rows in <thead>/<tbody>
as required by Svelte 5's strict HTML validation.

E2E test passed: clean Alpine container → setup script → kubectl installed →
CA cert verified against API server → TLS SUCCESS
2026-03-07 15:34:32 +00:00
Viktor Barzin
15f7114c4e
[ci skip] k8s portal: fix setup script + add onboarding hub (5 new pages)
Bug fixes:
- CA cert now populated in ConfigMap (was empty → TLS failures)
- Remove useless heredoc quote escaping in setup script
- Fix homepage: VPN callout, correct verification command (get namespaces)
- Fix false-positive sensitive=true on ingress_path, tls_secret_name,
  truenas_host, ollama_host, client_certificate_secret_name

New pages (direct Svelte, no mdsvex dependency):
- /onboarding: step-by-step guide (VPN, kubectl, git, first PR)
- /architecture: cluster topology, storage, networking, tiers
- /services: catalog of 70+ services with URLs
- /contributing: PR workflow, what you can/can't change, NEVER list
- /troubleshooting: common issues and fixes

Navigation bar added to layout. All pages use consistent docs styling.

Requires Docker image rebuild: cd stacks/platform/modules/k8s-portal/files
&& docker build -t viktorbarzin/k8s-portal:latest . && docker push
2026-03-07 15:06:26 +00:00
Viktor Barzin
db68067925
[ci skip] phase 5+6: update CI pipelines for SOPS, add sensitive=true to secret vars
Phase 5 — CI pipelines:
- default.yml: add SOPS decrypt in prepare step, change git add . to
  specific paths (stacks/ state/ .woodpecker/), cleanup on success+failure
- renew-tls.yml: change git add . to git add secrets/ state/

Phase 6 — sensitive=true:
- Add sensitive = true to 256 variable declarations across 149 stack files
- Prevents secret values from appearing in terraform plan output
- Does NOT modify shared modules (ingress_factory, nfs_volume) to avoid
  breaking module interface contracts

Note: CI pipeline SOPS decryption requires sops_age_key Woodpecker secret
to be created before the pipeline will work with SOPS. Until then, the old
terraform.tfvars path continues to function.
2026-03-07 14:30:36 +00:00
Viktor Barzin
b73b2eac33
fix(actualbudget): raise http-api resources to prevent OOM [ci skip] 2026-03-07 00:28:02 +00:00
Viktor Barzin
7d68be870d
[ci skip] add auto-generated tiers.tf, planning docs, and helm chart cache
- tiers.tf: Terragrunt-generated tier locals for all standalone stacks
- .planning/: resource audit research and plans
- docs/plans/: cluster hardening design doc
- redis-25.3.2.tgz: Bitnami Redis Helm chart cache
2026-03-06 23:55:57 +00:00
Viktor Barzin
1824d2be67
[ci skip] expand Prometheus PVC to 200Gi, increase retention to 180GB for 1-year history
Storage analysis: ~10.5 GB/month ingestion rate, 1 year = ~125 GB + overhead.
PVC: 30Gi → 200Gi, retention.size: 45GB → 180GB.
Historical TSDB data restored from NFS (39.8 GB total including all blocks).
2026-03-06 23:16:32 +00:00
Viktor Barzin
a7f3d432ee
[ci skip] expand Prometheus iSCSI PVC to 30Gi for historical data restore 2026-03-06 22:51:38 +00:00
Viktor Barzin
63fb6201c8
[ci skip] migrate Redis, Prometheus, Loki storage to iSCSI
- Redis: local-path → iscsi-truenas (master + replica persistence)
- Prometheus: NFS PV+PVC → dynamic iSCSI PVC (prometheus-data)
- Loki: NFS PV → dynamic iSCSI via storageClass in Helm values
- Deleted 2 orphaned Released iSCSI PVs (31Gi freed)
2026-03-06 20:50:55 +00:00
Viktor Barzin
bfa4a3ffdf
[ci skip] reduce resource limits per VPA recommendations
dashy: 4Gi→512Mi mem, 2→500m cpu (actual: 206Mi)
affine: 4Gi→512Mi mem, 2→1 cpu (actual: 186Mi)
rybbit clickhouse: 4Gi→2Gi mem, 2→1 cpu (actual: 618Mi)
2026-03-06 20:23:21 +00:00
Viktor Barzin
94dcf22db4
[ci skip] exclude linkwarden from HighService4xxRate alert 2026-03-06 20:15:58 +00:00
Viktor Barzin
d4400f8283
[ci skip] remove atuin: destroy stack, DNS, NFS export, PostgreSQL credentials 2026-03-06 20:11:14 +00:00
Viktor Barzin
1d80c49201
[ci skip] iSCSI migration, healthcheck fixes, health probes, etcd backup
- Migrate MySQL/PostgreSQL storage from local-path to iscsi-truenas
- Add democratic-csi iSCSI driver module for TrueNAS
- Add open-iscsi to cloud-init VM template
- Fix Shlink health probe path (/api/v3 -> /rest/v3 for Shlink 5.0)
- Fix etcd backup: use etcd 3.5.21-0 (3.6.x is distroless, no /bin/sh)
- Fix cluster healthcheck CronJob: always exit 0 to prevent circular
  JobFailed alerts (reporting via Slack, not exit codes)
- Fix Uptime Kuma nested list handling in cluster-health.sh
- Add health probes to: audiobookshelf, immich ML, ntfy, headscale,
  uptime-kuma, vaultwarden, rybbit (clickhouse + server + client),
  shlink, shlink-web
- Add iSCSI storage documentation to CLAUDE.md
2026-03-06 19:54:21 +00:00
Viktor Barzin
a8e07ad930
[ci skip] fix post-NFS-migration issues: MySQL GR, Loki, grampsweb, alerts
- Loki: reduce memory limit from 6Gi to 4Gi (within LimitRange max)
- Grampsweb: increase memory to 2Gi (was OOMKilled at 512Mi)
- Fix PostgreSQLDown alert: check pod readiness instead of deployment
- Fix MySQLDown alert: check StatefulSet replicas instead of deployment
- Fix RedisDown alert: check StatefulSet replicas instead of deployment
- Fix NFSServerUnresponsive: aggregate all NFS versions cluster-wide
- Fix Uptime Kuma healthcheck: handle nested list heartbeat format
- Update etcd backup image to registry.k8s.io/etcd:3.6.5-0
2026-03-03 21:10:26 +00:00
Viktor Barzin
065090dfe0
[ci skip] fix calibre: bump CPU/memory to prevent SIGBUS during calibre_postinstall 2026-03-03 19:48:45 +00:00
Viktor Barzin
31f3fc0773
[ci skip] fix OOMKill: prometheus (4Gi), kyverno-reports (512Mi), grampsweb (512Mi)
- Prometheus server: explicit 1Gi req / 4Gi limit (was inheriting 512Mi LimitRange default)
- Kyverno reports controller: 128Mi req / 512Mi limit (was 128Mi Helm default)
- Grampsweb: 256Mi req / 512Mi limit for both containers (was 256Mi LimitRange default)
2026-03-02 21:39:14 +00:00
Viktor Barzin
51d77369de
[ci skip] fix: add mount_options to all NFS PVs (soft,timeo=30,retrans=3)
Critical fix: StorageClass mountOptions only apply during dynamic
provisioning. Our static PVs (created by Terraform) were missing
mount_options, so all NFS mounts defaulted to hard,timeo=600 —
the exact stale mount behavior we were trying to eliminate.

Adds mount_options directly to the nfs_volume module PV spec and
to the monitoring PVs (prometheus, loki, alertmanager).

Requires re-applying all stacks to propagate to existing PVs.
2026-03-02 20:23:36 +00:00
Viktor Barzin
395bd94f0f
[ci skip] migrate servarr sub-stacks + actualbudget factory NFS to CSI PV/PVC
Final batch: servarr (aiostreams, listenarr, readarr, soulseek,
prowlarr, qbittorrent, lidarr) and actualbudget factory.
All use ../../../modules/kubernetes/nfs_volume (3 levels deep).
2026-03-02 02:04:22 +00:00
Viktor Barzin
0e324df545
[ci skip] complete NFS CSI migration: complex stacks + platform modules
Migrate remaining multi-volume stacks and all platform modules from
inline NFS volumes to CSI-backed PV/PVC with nfs-truenas StorageClass
(soft,timeo=30,retrans=3 mount options).

Complex stacks: openclaw (4 vols), immich (8 vols), frigate (2 vols),
nextcloud (2 vols + old PV replaced), rybbit (1 vol)

Remaining stacks: affine, ebook2audiobook, f1-stream, osm_routing,
real-estate-crawler

Platform modules: monitoring (prometheus, loki, alertmanager PVs
converted from native NFS to CSI), redis, dbaas, technitium,
headscale, vaultwarden, uptime-kuma, mailserver, infra-maintenance
2026-03-02 01:24:07 +00:00
Viktor Barzin
11b3d92684
[ci skip] migrate 29 services from inline NFS to CSI-backed PV/PVC
Batch migration of all single-volume and simple multi-volume stacks.
All services verified healthy after migration. Uses nfs-truenas
StorageClass with soft,timeo=30,retrans=3 mount options to eliminate
stale NFS mount hangs.

Services: atuin, audiobookshelf, calibre, changedetection, diun,
excalidraw, forgejo, freshrss, grampsweb, hackmd, health,
isponsorblocktv, matrix, meshcentral, n8n, navidrome, ntfy, ollama,
onlyoffice, owntracks, paperless-ngx, poison-fountain, send,
stirling-pdf, tandoor, wealthfolio, whisper, woodpecker, ytdlp
2026-03-02 00:15:39 +00:00
Viktor Barzin
8faad47994
[ci skip] migrate privatebin, resume, speedtest NFS volumes to CSI PV/PVC
Pilot migration: replace inline nfs {} volumes with CSI-backed PV/PVC
using nfs-truenas StorageClass (soft,timeo=30,retrans=3 mount options).
2026-03-01 23:42:23 +00:00
Viktor Barzin
481e4fa46e
[ci skip] add NFS CSI driver + nfs_volume shared module
- Deploy csi-driver-nfs Helm chart as platform module (nfs-csi)
- Create nfs-truenas StorageClass with soft,timeo=30,retrans=3 mount options
- Add shared nfs_volume module for PV/PVC boilerplate (modules/kubernetes/nfs_volume/)
2026-03-01 23:38:58 +00:00
Viktor Barzin
00197c931e
[ci skip] remove low-traffic pull-through caches (registry.k8s.io, quay.io, reg.kyverno.io)
Pull-through cache at 10.0.20.10 was serving corrupted/truncated images
for low-traffic registries, breaking VPA certgen (ImagePullBackOff) and
previously causing Kyverno image pull failures.

Kept: docker.io (port 5000) and ghcr.io (port 5010) — high traffic,
Docker Hub rate limits make caching essential.

Removed from cloud-init template and all 5 live nodes:
- registry.k8s.io (port 5030) — 14 system images, very low churn
- quay.io (port 5020) — 11 images
- reg.kyverno.io (port 5040) — 5 images

The registry containers on the 10.0.20.10 VM still run but nodes no
longer route to them. They can be stopped/removed from the VM later.
2026-03-01 21:46:41 +00:00
Viktor Barzin
14a5b4d7d5
[ci skip] frigate: add liveness/startup probes for GPU recovery
When the GPU becomes unavailable (overloaded, CUDA context corruption),
Frigate silently falls back to CPU detection burning 4 cores with no
automatic recovery. Add liveness probe checking nvidia-smi + API health
every 60s (3 failures = restart), and startup probe allowing up to 5min
for TensorRT model loading.
2026-03-01 20:36:49 +00:00
Viktor Barzin
78d5aeb5db
[ci skip] f1-stream: add Discord token and channel env vars 2026-03-01 20:17:38 +00:00
Viktor Barzin
6c1dffbfd8
[ci skip] rybbit: add CronJob to truncate ClickHouse system logs every 6h
ClickHouse system log tables (metric_log, trace_log, text_log, etc.) were
growing unboundedly on NFS (~10GiB, 1.3B rows) with no TTL, causing
continuous background merge operations that burned ~920m CPU. Mounting
custom config.d XML files crashes ClickHouse (exit code 36) so instead
add a CronJob that truncates the tables via the HTTP API every 6 hours.
Also removed the broken ConfigMap/volume mount that was causing crashes.
2026-03-01 19:41:39 +00:00
Viktor Barzin
ca648ff9bb
[ci skip] right-size all pod resources based on VPA + live metrics audit
Full cluster resource audit: cross-referenced Goldilocks VPA recommendations,
live kubectl top metrics, and Terraform definitions for 100+ containers.

Critical fixes:
- dashy: CPU throttled at 98% (490m/500m) → 2 CPU limit
- stirling-pdf: CPU throttled at 99.7% (299m/300m) → 2 CPU limit
- traefik auth-proxy/bot-block-proxy: mem limit 32Mi → 128Mi

Added explicit resources to ~40 containers that had none:
- audiobookshelf, changedetection, cyberchef, dawarich, diun, echo,
  excalidraw, freshrss, hackmd, isponsorblocktv, linkwarden, n8n,
  navidrome, ntfy, owntracks, privatebin, send, shadowsocks, tandoor,
  tor-proxy, wealthfolio, networking-toolbox, rybbit, mailserver,
  cloudflared, pgadmin, phpmyadmin, crowdsec-web, xray, wireguard,
  k8s-portal, tuya-bridge, ollama-ui, whisper, piper, immich-server,
  immich-postgresql, osrm-foot

GPU containers: added CPU/mem alongside GPU limits:
- ollama: removed CPU/mem limits (models vary in size), keep GPU only
- frigate: req 500m/2Gi, lim 4/8Gi + GPU
- immich-ml: req 100m/1Gi, lim 2/4Gi + GPU

Right-sized ~25 over-provisioned containers:
- kms-web-page: 500m/512Mi → 50m/64Mi (was using 0m/10Mi)
- onlyoffice: CPU 8 → 2 (VPA upper 45m)
- realestate-crawler-api: CPU 2000m → 250m
- blog/travel-blog/webhook-handler: 500m → 100m
- coturn/health/plotting-book: reduced to match actual usage

Conservative methodology: limits = max(VPA upper * 2, live usage * 2)
2026-03-01 19:18:50 +00:00
Viktor Barzin
32762a0916
[ci skip] switch VPA to off mode globally, fix Ollama/MySQL resources
- Kyverno policy: VPA mode set to 'off' for all namespaces (was 'initial'
  for non-core). Terraform is now sole authority for container resources.
  Goldilocks provides recommendations only.
- Ollama: add explicit CPU/memory resources (500m/4Gi req, 4/12Gi limit)
  alongside GPU allocation. Fixes OOMKill from VPA scaling down resources.
- MySQL InnoDB Cluster: bump memory limit from 2Gi to 3Gi.
- Remove redundant per-namespace VPA opt-out labels from onlyoffice,
  openclaw, trading-bot (now handled globally by Kyverno policy).
2026-03-01 19:03:49 +00:00
Viktor Barzin
ae9565c3e6
[ci skip] onlyoffice: revert font cache NFS mounts, rebuild on startup
NFS font caching caused issues. Reverted to default GENERATE_FONTS=true
with 8 CPU burst limit for fast regeneration on startup.
2026-03-01 18:07:37 +00:00
Viktor Barzin
a82f86b3e4
[ci skip] onlyoffice: cache fonts/themes on NFS for fast restarts
Persist font cache (159MB) and theme images (10MB) to NFS volume.
Set GENERATE_FONTS=false to skip regeneration on startup since cache
is warm. Startup time: ~3 min -> 5 seconds.
2026-03-01 18:02:38 +00:00
Viktor Barzin
81e128acc4
[ci skip] onlyoffice: bump CPU limit to 8, add custom LimitRange/Quota
Startup was throttled by allthemesgen and font generation hitting 2 CPU
ceiling. Bumped to 8 CPU burst limit with custom LimitRange (max 8 CPU)
and custom ResourceQuota. Disabled VPA and goldilocks opt-out labels.
2026-03-01 17:58:26 +00:00
Viktor Barzin
07874f8021
[ci skip] onlyoffice: disable VPA to prevent CPU resource override
Goldilocks VPA in Initial mode was overriding the explicit 2 CPU limit
down to 700m, throttling the document server. Set vpa-update-mode=off.
2026-03-01 17:55:06 +00:00
Viktor Barzin
beec5acbc7
[ci skip] nextcloud: bump CPU limit to 16, add custom ResourceQuota
CPU was pegged at 2000m/2000m (100% throttled). Add custom-quota
opt-out label and ResourceQuota allowing 32 CPU limits to accommodate
the 16 CPU container limit plus sidecar defaults.
2026-03-01 17:41:18 +00:00
Viktor Barzin
ecc3445860
[ci skip] openclaw: fix workspace permissions — chown to node user
Init container clones repo as root but main container runs as node (UID 1000).
Added chown -R 1000:1000 /workspace/infra so OpenClaw can write to workspace.
2026-03-01 17:20:36 +00:00
Viktor Barzin
79af6fff47
[ci skip] fix MySQL cluster RBAC, Kyverno policy bugs, Nextcloud memory
- dbaas: add mysql-sidecar-extra ClusterRole for namespaces/CRD
  list/watch needed by kopf framework in sidecar containers
- kyverno: restrict inject-priority-class-from-tier to CREATE
  operations only (was blocking pod patches with immutable spec error)
- kyverno: add resource-governance/custom-limitrange label opt-out
  to LimitRange generation policy (mirrors existing custom-quota)
- nextcloud: bump memory limit 4Gi -> 6Gi, add custom LimitRange
  with 8Gi max, opt out of Kyverno-managed LimitRange
2026-03-01 17:16:03 +00:00