Commit graph

183 commits

Author SHA1 Message Date
Viktor Barzin
ff03f2b99f tune Nextcloud Apache/PHP to fix constant crash-looping (50 restarts/6d)
Root cause: Apache prefork with 150 MaxRequestWorkers (each ~220MB RSS)
on SQLite DB causes worker exhaustion + lock contention → Apache hangs →
aggressive liveness probe (3 failures × 10s) kills container.

Fixes:
- Apache: MaxRequestWorkers 150→25, MaxConnectionsPerChild 0→200,
  StartServers 5→3 (via ConfigMap mount over mpm_prefork.conf)
- PHP: max_execution_time 0→300s, max_input_time 300s (prevent zombie workers)
- Liveness probe: period 10s→30s, failureThreshold 3→6, timeout 5s→10s
  (180s tolerance vs 30s before)
- Readiness probe: period 10s→30s, timeout 5s→10s
2026-03-08 21:33:27 +00:00
Viktor Barzin
ad8b90575e fix noisy JobFailed and duplicate mail server alerts
- JobFailed: only alert on jobs started within the last hour, so stale
  failed CronJob runs don't keep firing after subsequent runs succeed
- Mail server alert: renamed to MailServerDown, now targets the specific
  mailserver deployment instead of all deployments in the namespace
  (was falsely triggering on roundcubemail going down)
- Updated inhibition rule to use new MailServerDown alert name
2026-03-08 21:22:43 +00:00
Viktor Barzin
33c7976630 reduce alert noise: add cascade inhibitions, increase for durations, drop Loki alerts
- NodeDown now suppresses workload and service alerts (PodCrashLooping,
  DeploymentReplicasMismatch, StatefulSetReplicasMismatch, etc.)
- NFSServerUnresponsive suppresses pod-level alerts
- Increased for durations on transient alerts (e.g. 15m→30m for replica mismatches)
- NodeDown for: 1m→3m to avoid flapping
- Removed all 3 Loki log-based alerts (duplicated Prometheus alerts)
- Downgraded HeadscaleDown critical→warning, mail server page→warning
2026-03-08 21:13:16 +00:00
Viktor Barzin
2fa8ba2038 [ci skip] add sealed secrets convention: fileset + kubernetes_manifest pattern
- Document sealed secrets workflow in AGENTS.md and CLAUDE.md
- Add kubernetes_manifest + fileset(sealed-*.yaml) block to plotting-book as reference
- Users: kubeseal encrypt → commit sealed-*.yaml → CI applies via Terraform
- E2E tested: seal/commit/plan/apply/decrypt cycle verified
2026-03-08 20:03:50 +00:00
Viktor Barzin
6b3e84f465 deploy Sealed Secrets controller for encrypted secret management
Adds Sealed Secrets (Bitnami) to the platform stack so cluster users can
encrypt secrets with a public key and commit SealedSecret YAMLs to git.
The in-cluster controller decrypts them into regular K8s Secrets.

- New module: sealed-secrets (namespace + Helm chart v2.18.3, cluster tier)
- k8s-portal setup script: adds kubeseal CLI install for Linux and Mac
2026-03-08 19:49:48 +00:00
Viktor Barzin
d352d6e7f8 resource quota review: fix OOM risks, close quota gaps, add HA protections
Phase 1 - OOM fixes:
- dashy: increase memory limit 512Mi→1Gi (was at 99% utilization)
- caretta DaemonSet: set explicit resources 300Mi/512Mi (was at 85-98%)
- mysql-operator: add Helm resource values 256Mi/512Mi, create namespace
  with tier label (was at 92% of LimitRange default)
- prowlarr, flaresolverr, annas-archive-stacks: add explicit resources
  (outgrowing 256Mi LimitRange defaults)
- real-estate-crawler celery: add resources 512Mi/3Gi (608Mi actual, no
  explicit resources)

Phase 2 - Close quota gaps:
- nvidia, real-estate-crawler, trading-bot: remove custom-quota=true
  labels so Kyverno generates tier-appropriate quotas
- descheduler: add tier=1-cluster label for proper classification

Phase 3 - Reduce excessive quotas:
- monitoring: limits.memory 240Gi→64Gi, limits.cpu 120→64
- woodpecker: limits.memory 128Gi→32Gi, limits.cpu 64→16
- GPU tier default: limits.memory 96Gi→32Gi, limits.cpu 48→16

Phase 4 - Kubelet protection:
- Add cpu: 200m to systemReserved and kubeReserved in kubelet template

Phase 5 - HA improvements:
- cloudflared: add topology spread (ScheduleAnyway) + PDB (maxUnavailable:1)
- grafana: add topology spread + PDB via Helm values
- crowdsec LAPI: add topology spread + PDB via Helm values
- authentik server: add topology spread via Helm values
- authentik worker: add topology spread + PDB via Helm values
2026-03-08 18:17:46 +00:00
Viktor Barzin
ead33b23dd enable MySQL InnoDB Cluster auto-recovery after crashes
Previously manualStartOnBoot=true and exitStateAction=ABORT_SERVER meant
any ungraceful shutdown required manual rebootClusterFromCompleteOutage().

New settings:
- group_replication_start_on_boot=ON: auto-start GR after crash
- autorejoin_tries=2016: retry rejoining for ~28 minutes
- exit_state_action=OFFLINE_MODE: stay alive on expulsion (don't abort)
- member_expel_timeout=30s: tolerate brief unresponsiveness
- unreachable_majority_timeout=60s: leave group cleanly if majority lost
2026-03-08 17:13:03 +00:00
Viktor Barzin
fffc2ed0ab fix node OOM: reduce memory overcommit ratio and add kubelet eviction thresholds
LimitRange defaults had a 4-8x limit/request ratio causing the scheduler
to overpack nodes. When pods burst, nodes OOM-thrashed and became
unresponsive (k8s-node3 and k8s-node4 both went down today).

Changes:
- Increase default memory requests across all tiers (ratio now 2x):
  - core/cluster: 64Mi → 256Mi request (512Mi limit)
  - gpu: 256Mi → 1Gi request (2Gi limit)
  - edge/aux/fallback: 64Mi → 128Mi request (256Mi limit)
- Add kubelet memory reservation and eviction thresholds:
  - systemReserved: 512Mi, kubeReserved: 512Mi
  - evictionHard: 500Mi (was 100Mi), evictionSoft: 1Gi (was unset)
  - Applied to all nodes and future node template
2026-03-08 10:33:38 +00:00
Viktor Barzin
193f2e2dc5 add iSCSI persistent volume for plotting-book SQLite database
Create a 1Gi PVC using iscsi-truenas StorageClass, mount at /data,
and set DB_PATH=/data/database.sqlite for persistent storage.
2026-03-07 21:57:22 +00:00
Viktor Barzin
4374e78869 [ci skip] fix Wealthfolio Homepage icon: wealthfolio.png → mdi-finance 2026-03-07 21:32:58 +00:00
Viktor Barzin
9d031290cc [ci skip] fix Homepage icons for Tandoor, Listenarr, Networking Toolbox, Goldilocks
tandoor.png → tandoor-recipes.png (dashboard-icons), podcast.png →
mdi-podcast, networking.png → mdi-lan, goldilocks.png → mdi-scale-balance
2026-03-07 21:29:51 +00:00
Viktor Barzin
32bd30f56e [ci skip] fix invalid Homepage dashboard icons for 9 services
Use correct dashboard-icons names where available (changedetection,
gramps-web), Material Design Icons for custom apps (city-guesser,
plotting-book, resume, tuya-bridge, trading-bot, poison-fountain),
and Simple Icons for F1 Stream.
2026-03-07 21:14:17 +00:00
Viktor Barzin
76a4987eef [ci skip] add Forgejo task pipeline for OpenClaw AI agent
Forgejo issues as a task queue for OpenClaw:
- Forgejo OAuth2 with Authentik SSO, self-registration disabled
- Webhook-triggered task processing (instant) + CronJob backup (5min poll)
- Tasks processed via Mistral Large 3 (NVIDIA NIM API)
- Results posted as issue comments, auto-labeled and closed
- Comment follow-ups and reopened issues supported
- n8n RBAC for OpenClaw pod exec (future workflow integration)
2026-03-07 21:11:07 +00:00
Viktor Barzin
c2765e890b add nginx caching proxy for Homepage widget API requests
Stale-while-revalidate cache in front of Homepage reduces first-paint
latency by serving cached /api/ responses instantly while refreshing
upstream in background. Non-API paths pass through uncached.
2026-03-07 21:11:07 +00:00
root
b4f9777ecd Woodpecker CI deploy commit [CI SKIP] 2026-03-07 20:47:22 +00:00
Viktor Barzin
33b20ce111 add Google OAuth env vars to plotting-book deployment
Deploy GOOGLE_CLIENT_ID, GOOGLE_CLIENT_SECRET, and GOOGLE_CALLBACK_URL
to the plotting-book container. Update CSP to allow accounts.google.com
for connect-src and form-action directives.
2026-03-07 20:41:08 +00:00
Viktor Barzin
650c2a6ed4 [ci skip] add liveness probe to Send deployment
Prevents stale Redis connections from silently breaking file uploads.
The old Node.js Redis client doesn't auto-reconnect after Redis restarts,
causing all files to appear expired.
2026-03-07 20:39:57 +00:00
Viktor Barzin
7af0024473 [ci skip] fix pfSense widget: wan interface is vtnet0 not vmx0 2026-03-07 20:39:56 +00:00
Viktor Barzin
3ec643a897 [ci skip] fix pfSense widget: remove wanStatus (API v2 missing gateway endpoint)
Replace wanStatus with temp field. Remove wan interface param since
the pfSense REST API v2 package doesn't expose /status/gateway.
2026-03-07 20:39:56 +00:00
Viktor Barzin
f3042f318e [ci skip] fix widget issues: ports, Immich v2 API, Nextcloud trusted domains
- qBittorrent: use service port 80 (not container port 8080)
- Immich: add version=2 for new API endpoints (/api/server/*)
- Nextcloud: use external URL (internal rejects untrusted Host header)
- HA London: remove widget (token expired, needs manual regeneration)
- Headscale: remove widget (requires nodeId param, not overview)
2026-03-07 20:39:56 +00:00
Viktor Barzin
17256c8f76 [ci skip] fix widget URLs: use correct k8s service ports
Services expose port 80 via ClusterIP but widgets were using container
target ports (5000, 3001, 4533, 3000). Calibre was using external URL
through Authentik. All now use correct internal service URLs.
2026-03-07 20:39:56 +00:00
Viktor Barzin
c9bb470259 [ci skip] upgrade Homepage from v1.8.0 to v1.10.1 2026-03-07 20:39:56 +00:00
Viktor Barzin
57eed07370 [ci skip] add widgets for qbittorrent, navidrome, nextcloud, freshrss, linkwarden, uptime-kuma
Add API credentials to SOPS and wire homepage_credentials through
stacks. Re-add Uptime Kuma widget with new "infra" status page slug.
2026-03-07 20:39:55 +00:00
Viktor Barzin
10acdcd5a2 [ci skip] add widgets for audiobookshelf, changedetection, prowlarr, headscale
Wire homepage_credentials through servarr parent stack for prowlarr.
Fix paperless-ngx widget to use internal service URL.
2026-03-07 20:39:55 +00:00
Viktor Barzin
1f1700c4ff [ci skip] fix broken Homepage widgets + add service API tokens to SOPS
- Grafana: fix service URL (grafana not monitoring-grafana)
- Uptime Kuma: remove widget (no status page configured)
- Speedtest/Frigate/Immich: use internal k8s service URLs (external
  goes through Authentik forward auth, blocking API calls)
- pfSense: clean up annotations
- SOPS: add headscale, prowlarr, changedetection, audiobookshelf tokens
2026-03-07 20:39:55 +00:00
Viktor Barzin
a9daf50142 [ci skip] add Homepage widget credentials for Authentik, Shlink, Home Assistant
Wire homepage_credentials tokens through platform stack to enable
live widgets for Authentik, Shlink (URL shortener), and Home Assistant
London. Update SOPS with new credential entries.
2026-03-07 20:39:54 +00:00
Viktor Barzin
6bd3970579 [ci skip] add Homepage gethomepage.dev annotations to all services
Add Kubernetes ingress annotations for Homepage auto-discovery across
~88 services organized into 11 groups. Enable serviceAccount for RBAC,
configure group layouts, and add Grafana/Frigate/Speedtest widgets.
2026-03-07 20:39:54 +00:00
Viktor Barzin
2dc5ab8995 [ci skip] fix false-positive sensitive=true on kube_config_path 2026-03-07 15:48:19 +00:00
Viktor Barzin
b6aacf7b02 [ci skip] fix Svelte 5 table structure (thead/tbody required) + use versioned image tag
Fixed architecture and services pages to wrap table rows in <thead>/<tbody>
as required by Svelte 5's strict HTML validation.

E2E test passed: clean Alpine container → setup script → kubectl installed →
CA cert verified against API server → TLS SUCCESS
2026-03-07 15:34:32 +00:00
Viktor Barzin
6f8b48a73c [ci skip] k8s portal: fix setup script + add onboarding hub (5 new pages)
Bug fixes:
- CA cert now populated in ConfigMap (was empty → TLS failures)
- Remove useless heredoc quote escaping in setup script
- Fix homepage: VPN callout, correct verification command (get namespaces)
- Fix false-positive sensitive=true on ingress_path, tls_secret_name,
  truenas_host, ollama_host, client_certificate_secret_name

New pages (direct Svelte, no mdsvex dependency):
- /onboarding: step-by-step guide (VPN, kubectl, git, first PR)
- /architecture: cluster topology, storage, networking, tiers
- /services: catalog of 70+ services with URLs
- /contributing: PR workflow, what you can/can't change, NEVER list
- /troubleshooting: common issues and fixes

Navigation bar added to layout. All pages use consistent docs styling.

Requires Docker image rebuild: cd stacks/platform/modules/k8s-portal/files
&& docker build -t viktorbarzin/k8s-portal:latest . && docker push
2026-03-07 15:06:26 +00:00
Viktor Barzin
1f2c1ca361 [ci skip] phase 5+6: update CI pipelines for SOPS, add sensitive=true to secret vars
Phase 5 — CI pipelines:
- default.yml: add SOPS decrypt in prepare step, change git add . to
  specific paths (stacks/ state/ .woodpecker/), cleanup on success+failure
- renew-tls.yml: change git add . to git add secrets/ state/

Phase 6 — sensitive=true:
- Add sensitive = true to 256 variable declarations across 149 stack files
- Prevents secret values from appearing in terraform plan output
- Does NOT modify shared modules (ingress_factory, nfs_volume) to avoid
  breaking module interface contracts

Note: CI pipeline SOPS decryption requires sops_age_key Woodpecker secret
to be created before the pipeline will work with SOPS. Until then, the old
terraform.tfvars path continues to function.
2026-03-07 14:30:36 +00:00
Viktor Barzin
5b28319bc3 fix(actualbudget): raise http-api resources to prevent OOM [ci skip] 2026-03-07 00:28:02 +00:00
Viktor Barzin
197cef7f3f [ci skip] add auto-generated tiers.tf, planning docs, and helm chart cache
- tiers.tf: Terragrunt-generated tier locals for all standalone stacks
- .planning/: resource audit research and plans
- docs/plans/: cluster hardening design doc
- redis-25.3.2.tgz: Bitnami Redis Helm chart cache
2026-03-06 23:55:57 +00:00
Viktor Barzin
614d14c47d [ci skip] expand Prometheus PVC to 200Gi, increase retention to 180GB for 1-year history
Storage analysis: ~10.5 GB/month ingestion rate, 1 year = ~125 GB + overhead.
PVC: 30Gi → 200Gi, retention.size: 45GB → 180GB.
Historical TSDB data restored from NFS (39.8 GB total including all blocks).
2026-03-06 23:16:32 +00:00
Viktor Barzin
a52a371e35 [ci skip] expand Prometheus iSCSI PVC to 30Gi for historical data restore 2026-03-06 22:51:38 +00:00
Viktor Barzin
100a876dfe [ci skip] migrate Redis, Prometheus, Loki storage to iSCSI
- Redis: local-path → iscsi-truenas (master + replica persistence)
- Prometheus: NFS PV+PVC → dynamic iSCSI PVC (prometheus-data)
- Loki: NFS PV → dynamic iSCSI via storageClass in Helm values
- Deleted 2 orphaned Released iSCSI PVs (31Gi freed)
2026-03-06 20:50:55 +00:00
Viktor Barzin
23202fbf13 [ci skip] reduce resource limits per VPA recommendations
dashy: 4Gi→512Mi mem, 2→500m cpu (actual: 206Mi)
affine: 4Gi→512Mi mem, 2→1 cpu (actual: 186Mi)
rybbit clickhouse: 4Gi→2Gi mem, 2→1 cpu (actual: 618Mi)
2026-03-06 20:23:21 +00:00
Viktor Barzin
a48915ee02 [ci skip] exclude linkwarden from HighService4xxRate alert 2026-03-06 20:15:58 +00:00
Viktor Barzin
fb199e2da9 [ci skip] remove atuin: destroy stack, DNS, NFS export, PostgreSQL credentials 2026-03-06 20:11:14 +00:00
Viktor Barzin
0638e2cc2e [ci skip] iSCSI migration, healthcheck fixes, health probes, etcd backup
- Migrate MySQL/PostgreSQL storage from local-path to iscsi-truenas
- Add democratic-csi iSCSI driver module for TrueNAS
- Add open-iscsi to cloud-init VM template
- Fix Shlink health probe path (/api/v3 -> /rest/v3 for Shlink 5.0)
- Fix etcd backup: use etcd 3.5.21-0 (3.6.x is distroless, no /bin/sh)
- Fix cluster healthcheck CronJob: always exit 0 to prevent circular
  JobFailed alerts (reporting via Slack, not exit codes)
- Fix Uptime Kuma nested list handling in cluster-health.sh
- Add health probes to: audiobookshelf, immich ML, ntfy, headscale,
  uptime-kuma, vaultwarden, rybbit (clickhouse + server + client),
  shlink, shlink-web
- Add iSCSI storage documentation to CLAUDE.md
2026-03-06 19:54:21 +00:00
Viktor Barzin
87ef313888 [ci skip] fix post-NFS-migration issues: MySQL GR, Loki, grampsweb, alerts
- Loki: reduce memory limit from 6Gi to 4Gi (within LimitRange max)
- Grampsweb: increase memory to 2Gi (was OOMKilled at 512Mi)
- Fix PostgreSQLDown alert: check pod readiness instead of deployment
- Fix MySQLDown alert: check StatefulSet replicas instead of deployment
- Fix RedisDown alert: check StatefulSet replicas instead of deployment
- Fix NFSServerUnresponsive: aggregate all NFS versions cluster-wide
- Fix Uptime Kuma healthcheck: handle nested list heartbeat format
- Update etcd backup image to registry.k8s.io/etcd:3.6.5-0
2026-03-03 21:10:26 +00:00
Viktor Barzin
946e6f14be [ci skip] fix calibre: bump CPU/memory to prevent SIGBUS during calibre_postinstall 2026-03-03 19:48:45 +00:00
Viktor Barzin
22223ec0fd [ci skip] fix OOMKill: prometheus (4Gi), kyverno-reports (512Mi), grampsweb (512Mi)
- Prometheus server: explicit 1Gi req / 4Gi limit (was inheriting 512Mi LimitRange default)
- Kyverno reports controller: 128Mi req / 512Mi limit (was 128Mi Helm default)
- Grampsweb: 256Mi req / 512Mi limit for both containers (was 256Mi LimitRange default)
2026-03-02 21:39:14 +00:00
Viktor Barzin
307b356f06 [ci skip] fix: add mount_options to all NFS PVs (soft,timeo=30,retrans=3)
Critical fix: StorageClass mountOptions only apply during dynamic
provisioning. Our static PVs (created by Terraform) were missing
mount_options, so all NFS mounts defaulted to hard,timeo=600 —
the exact stale mount behavior we were trying to eliminate.

Adds mount_options directly to the nfs_volume module PV spec and
to the monitoring PVs (prometheus, loki, alertmanager).

Requires re-applying all stacks to propagate to existing PVs.
2026-03-02 20:23:36 +00:00
Viktor Barzin
220aa739ce [ci skip] migrate servarr sub-stacks + actualbudget factory NFS to CSI PV/PVC
Final batch: servarr (aiostreams, listenarr, readarr, soulseek,
prowlarr, qbittorrent, lidarr) and actualbudget factory.
All use ../../../modules/kubernetes/nfs_volume (3 levels deep).
2026-03-02 02:04:22 +00:00
Viktor Barzin
0abae33c71 [ci skip] complete NFS CSI migration: complex stacks + platform modules
Migrate remaining multi-volume stacks and all platform modules from
inline NFS volumes to CSI-backed PV/PVC with nfs-truenas StorageClass
(soft,timeo=30,retrans=3 mount options).

Complex stacks: openclaw (4 vols), immich (8 vols), frigate (2 vols),
nextcloud (2 vols + old PV replaced), rybbit (1 vol)

Remaining stacks: affine, ebook2audiobook, f1-stream, osm_routing,
real-estate-crawler

Platform modules: monitoring (prometheus, loki, alertmanager PVs
converted from native NFS to CSI), redis, dbaas, technitium,
headscale, vaultwarden, uptime-kuma, mailserver, infra-maintenance
2026-03-02 01:24:07 +00:00
Viktor Barzin
79a2aa3784 [ci skip] migrate 29 services from inline NFS to CSI-backed PV/PVC
Batch migration of all single-volume and simple multi-volume stacks.
All services verified healthy after migration. Uses nfs-truenas
StorageClass with soft,timeo=30,retrans=3 mount options to eliminate
stale NFS mount hangs.

Services: atuin, audiobookshelf, calibre, changedetection, diun,
excalidraw, forgejo, freshrss, grampsweb, hackmd, health,
isponsorblocktv, matrix, meshcentral, n8n, navidrome, ntfy, ollama,
onlyoffice, owntracks, paperless-ngx, poison-fountain, send,
stirling-pdf, tandoor, wealthfolio, whisper, woodpecker, ytdlp
2026-03-02 00:15:39 +00:00
Viktor Barzin
853a96cb57 [ci skip] migrate privatebin, resume, speedtest NFS volumes to CSI PV/PVC
Pilot migration: replace inline nfs {} volumes with CSI-backed PV/PVC
using nfs-truenas StorageClass (soft,timeo=30,retrans=3 mount options).
2026-03-01 23:42:23 +00:00
Viktor Barzin
c702fd2565 [ci skip] add NFS CSI driver + nfs_volume shared module
- Deploy csi-driver-nfs Helm chart as platform module (nfs-csi)
- Create nfs-truenas StorageClass with soft,timeo=30,retrans=3 mount options
- Add shared nfs_volume module for PV/PVC boilerplate (modules/kubernetes/nfs_volume/)
2026-03-01 23:38:58 +00:00
Viktor Barzin
de598996f1 [ci skip] remove low-traffic pull-through caches (registry.k8s.io, quay.io, reg.kyverno.io)
Pull-through cache at 10.0.20.10 was serving corrupted/truncated images
for low-traffic registries, breaking VPA certgen (ImagePullBackOff) and
previously causing Kyverno image pull failures.

Kept: docker.io (port 5000) and ghcr.io (port 5010) — high traffic,
Docker Hub rate limits make caching essential.

Removed from cloud-init template and all 5 live nodes:
- registry.k8s.io (port 5030) — 14 system images, very low churn
- quay.io (port 5020) — 11 images
- reg.kyverno.io (port 5040) — 5 images

The registry containers on the 10.0.20.10 VM still run but nodes no
longer route to them. They can be stopped/removed from the VM later.
2026-03-01 21:46:41 +00:00