Commit graph

381 commits

Author SHA1 Message Date
Viktor Barzin
914e0b08e2 add SMTP and CWA auth env vars to book-search for send-to-kindle [ci skip] 2026-03-29 12:42:45 +03:00
Viktor Barzin
cbea959966 feat(ebooks): mount calibre-library PVC in book-search for permission fixing
CWA NETWORK_SHARE_MODE=true skips post-import chown, leaving files as
root. book-search now mounts the library to periodically fix permissions
on recently imported books.
2026-03-29 11:31:41 +03:00
Viktor Barzin
fed9df8c0e feat(ebooks): mount stacks-config PVC in book-search for force re-download
Adds stacks-config volume mount to book-search pod so it can delete
Stacks history entries and force re-downloads when a book was consumed
by CWA but failed to import.
2026-03-29 11:26:30 +03:00
Viktor Barzin
6d44b4292f add /api/download-status to book-search unprotected API ingress [ci skip]
Needed for async polling from iOS Shortcuts — status endpoint
doesn't need Authentik auth (job IDs are unguessable UUIDs).
2026-03-29 10:11:22 +03:00
Viktor Barzin
46444e0306 openclaw: remove install-dotfiles init container to reduce NFS writes
The init container was cloning the dotfiles repo via git on every pod
start, causing 200+ small NFS writes that amplified through ZFS.
Dotfiles already exist on NFS from a previous clone — no need to
re-clone on every restart. To update dotfiles, run git pull manually.

Also cleaned up stale Uptime Kuma files (1.6GB old SQLite DB + 289MB
error log left over from migration to MariaDB).
2026-03-29 01:11:33 +02:00
Viktor Barzin
878b556179 state(monitoring): update encrypted state 2026-03-29 01:04:11 +02:00
Viktor Barzin
d41211ddd5 add API key + unprotected API ingress for book-search iOS Shortcut
- API_KEY env var from calibre-secrets for /api/download-url auth
- SHORTCUT_ICLOUD_URL env var for /shortcut redirect
- Separate ingress for /api/download-url and /shortcut (bypasses Authentik)
2026-03-29 00:43:34 +02:00
Viktor Barzin
06490b0634 reduce Prometheus cardinality round 3: drop 44k more series
- cadvisor: drop unused network error/dropped counters, unused cpu
  metrics (load_avg, system, user), unused memory metrics (cache,
  failcnt, kernel, mapped_file, max_usage, rss, swap, active/inactive)
- kubelet: drop all unused histogram buckets (storage_operation, csi,
  volume_operation, image_pull, http_requests, rest_client, pod_worker,
  volume_metric, cgroup_manager) + kubernetes_feature_enabled
- apiserver: drop flowcontrol/rest_client histograms, longrunning_requests
- traefik: drop all router-level metrics (keep service + entrypoint)
- service-endpoints: drop coredns histograms, node_filesystem_*

Post-relabel: 332k → 99k (-70%), ingestion: 5,480 → 1,659 samples/sec (-70%)
2026-03-29 00:27:23 +02:00
Viktor Barzin
614d3c72bd add liveness probe to annas-archive-stacks deployment
Prevents corrupted SQLite DB from looping errors forever —
K8s will auto-restart the pod if /api/version stops responding.
2026-03-29 00:17:29 +02:00
Viktor Barzin
a9ca65bc31 reduce Prometheus cardinality round 2: drop 137k more series
- fix traefik double-scrape: kubernetes-pods job was scraping traefik
  pods again (43k duplicate series). Added namespace drop rule.
- drop unused cadvisor metrics: container_fs_*, container_blkio_*,
  container_pressure_*, container_spec_*, and misc (30k series)
- drop more apiserver histogram buckets: watch_list, watch_cache,
  response_sizes, watch_events, admission_controller, workqueue (11k)
- drop unused kube-state-metrics: replicaset_*, pod_tolerations,
  pod_labels, endpoint_*, service_*, configmap_*, etc (53k series)

Post-relabel samples: 332k → 142k (-57%)
Ingestion rate: 5,480 → 3,239 samples/sec (-41%)
2026-03-28 23:51:24 +02:00
Viktor Barzin
aceea7db94 increase global rate limit from 10/50 to 50/200
HA frontend loads 30-50 JS bundles on page load, exhausting the burst.
iOS Companion app reconnections also trigger bursts. 172 rate-limited
(429) requests found in Traefik logs causing intermittent connectivity
failures for ha-sofia iOS app.
2026-03-28 23:40:10 +02:00
Viktor Barzin
4b3851829b feat: organize Grafana dashboards into folders
Enable sidecar folderAnnotation + foldersFromFilesStructure to group
26 dashboards into 5 managed folders:

- Cluster (6): k8s health, API server, nodes, pods, kube-state-metrics
- Networking (6): CoreDNS, Technitium, Headscale, ingress, network traffic
- Hardware (5): node-exporter, proxmox, iDRAC, UPS, NVIDIA GPU
- Operations (4): backup health, registry, audit logs, Loki
- Applications (2): realestate-crawler, qBittorrent

Dashboard-to-folder mapping defined in grafana.tf locals block.
External stacks (headscale, technitium) annotated individually.
2026-03-28 16:23:49 +02:00
Viktor Barzin
725fefe565 fix: add Headscale monitoring, alerts, and pin UI image
- Add 4 Prometheus alerts: HeadscaleDown (critical), NoOnlineNodes,
  HighHTTPLatency, HighErrorRate
- Add Grafana dashboard with node count, map responses, HTTP latency,
  nodestore operations, and memory panels
- Pin headscale-ui to digest sha256:015f5ba0... (was :latest)
- Set disable_check_updates: true to skip GitHub check on startup
- Uptime Kuma monitor already existed (id=19, 300s interval)
2026-03-28 16:07:04 +02:00
Viktor Barzin
f4ff654a69 perf: optimize Headscale for connectivity and latency
- Remove viktorbarzin.me from split DNS (same IPs as public DNS,
  was adding unnecessary tunnel overhead for every DNS query)
- Narrow reverse DNS split scope from 10.0.0.0/8 → 10.0.20.0/24
  and 10.0.10.0/24 only; 192.168.0.0/16 → 192.168.1.0/24 only
- Add extra_records for key internal services (technitium, k8s-master)
  for instant MagicDNS resolution without tunnel roundtrip
- Replace full Tailscale DERP map (29 regions) with curated set:
  home + 8 European + 5 global fallback DERPs (14 total)
- Add custom derp.yaml to ConfigMap, sourced from Vault

Port 80 DERP dropped — Traefik's global HTTP→HTTPS redirect
prevents non-TLS DERP upgrades on the web entrypoint.
2026-03-28 15:44:13 +02:00
Viktor Barzin
8a5a53a832 fix alerts and reduce Prometheus disk write rate
- linkwarden: add Reloader match annotation to DB secret so pods
  auto-restart on Vault credential rotation (was causing 100% 5xx)
- authentik: increase memory limits (server 1Gi→1.5Gi, worker 896Mi→1Gi)
  to prevent OOM kills
- prometheus: drop 113k high-cardinality series to reduce HDD write rate
  from ~8.8 to ~6.0 MB/s (31% reduction):
  - drop all traefik/apiserver/etcd histogram bucket metrics
  - drop goflow2_flow_process_nf_templates_total (9.3k series)
  - drop container_tasks_state and container_memory_failures_total
  - rewrite HighServiceLatency alert to use avg latency (_sum/_count)
  - update cluster_health dashboard to match
- raise KubeletRuntimeOperationsLatency threshold from 30s to 60s
2026-03-28 15:42:14 +02:00
Viktor Barzin
7e0b0d9362 fix: headscale VPN setup hardening
- Add SQLite backup CronJob (every 6h to NFS for cloud sync pickup)
- Move headscale-ui secrets (COOKIE_SECRET, ROOT_API_KEY) from hardcoded
  values to Vault-managed secrets
- Add DERP IPv6 address (2001:470:6e:43d::2) for IPv6-capable clients
- Clean up stale test nodes, duplicate users, rename "localhost" nodes

Also updated headscale_config in Vault to include DERP ipv6 field
and headscale_ui_cookie_secret/headscale_ui_api_key secrets.
2026-03-28 14:38:12 +02:00
Viktor Barzin
a42003fb8f fix: add dedicated DERP IngressRoute bypassing middlewares
CrowdSec, rate limiting, anti-AI, and error pages middlewares were
interfering with the Upgrade: DERP protocol handshake. Also updated
Headscale ACL in Vault to allow tailnet DNS traffic to Technitium
(10.0.20.200:53).
2026-03-28 14:26:51 +02:00
Viktor Barzin
04a96955c0 fix: exclude NFS PVs from PVFillingUp alert
NFS PVs report the entire NFS server filesystem usage (e.g., navidrome-music
shows 5.3 TiB Synology volume at 97%), not PVC-specific usage. Filter out
PVs with >1TiB capacity (always NFS mounts; iSCSI PVCs are 10-50Gi).
2026-03-28 01:14:05 +02:00
Viktor Barzin
ae21502698 fix: exclude disabled London Pi cloud sync task from CloudSyncFailing alert
Task 2 (Backup London pi) fails because 192.168.8.102 is unreachable.
Disabled task via TrueNAS, excluded task_id=2 from alert rule.
2026-03-27 15:15:48 +02:00
Viktor Barzin
252b65a574 fix: increase memory limits for OOMKilled pods (immich, clickhouse, speedtest)
- immich-server: limits 1700Mi → 2500Mi (70 restarts from media processing spikes)
- clickhouse: limits 1Gi → 1536Mi, max_server_memory_usage 800Mi → 1200Mi
- speedtest: limits 256Mi → 512Mi, requests 256Mi → 128Mi (daily OOM during test)
2026-03-27 13:57:16 +02:00
Viktor Barzin
1ec480e5fa novelapp: grant vabbit81 (Gheorghe) admin RBAC on novelapp namespace 2026-03-26 17:34:48 +02:00
Viktor Barzin
5e6e71e727 novelapp: add NextAuth + Google OAuth env vars
Replace AUTH_SECRET with NEXTAUTH_URL, NEXTAUTH_SECRET, GOOGLE_CLIENT_ID,
and GOOGLE_CLIENT_SECRET for Google OAuth integration.
2026-03-26 17:14:16 +02:00
Viktor Barzin
70ea01fb6e vault: increase k8s auth token TTLs and add periodic renewal
Stagger token periods across roles (7d/8d/9d/10d) to prevent
bulk lease revocation storms that caused transient 504s.
Periodic tokens auto-renew indefinitely, eliminating mass expiry.
2026-03-26 12:21:47 +02:00
Viktor Barzin
b8a5740138 reduce alert noise: remove 4 memory alerts, raise latency threshold [ci skip]
- Remove ClusterMemoryRequestsHigh, ContainerNearOOM, NodeLowFreeMemory,
  NodeMemoryPressureTrending — all fire regularly due to intentional
  memory overcommit and are not actionable
- Keep ContainerOOMKilled (actionable — container actually died)
- Raise HighServiceLatency p99 threshold from 10s to 30s to ignore
  transient spikes
2026-03-26 01:15:18 +02:00
Viktor Barzin
4e74f816bc cleanup: remove calibre and audiobookshelf stacks after ebooks migration [ci skip]
Both services migrated to unified ebooks namespace. Remove:
- Old stack directories and Terraform state
- calibre references from monitoring namespace lists
- calibre/audiobookshelf from operational scripts
2026-03-25 23:56:07 +02:00
Viktor Barzin
95e49134ae cleanup: remove old audiobook-search, superseded by book-search
- Delete servarr/audiobook-search TF module (moved to ebooks/book-search)
- Remove audiobook-search from cloudflare_proxied_names
- Remove commented-out module reference in servarr/main.tf
- Clean up "renamed from" comment in ebooks/main.tf
- K8s resources (deploy/svc/ingress) deleted from servarr namespace
- Cloudflare DNS record already absent
- Import book-search and insta2spotify DNS records into cloudflared state
2026-03-25 23:16:01 +02:00
Viktor Barzin
fe27709fd4 fix email monitor: use internal URL for Uptime Kuma push
Pods can't reach uptime.viktorbarzin.me externally. Switch to
http://uptime-kuma.uptime-kuma.svc.cluster.local for the push endpoint.
2026-03-25 22:59:26 +02:00
Viktor Barzin
78dec8f0ad add e2e email roundtrip monitoring
CronJob (every 30 min) sends test email via Mailgun API to
smoke-test@viktorbarzin.me, verifies IMAP delivery in spam@ catch-all,
deletes test email, pushes metrics to Pushgateway + Uptime Kuma.

Prometheus alerts: EmailRoundtripFailing, EmailRoundtripStale,
EmailRoundtripNeverRun. Uptime Kuma: SMTP/IMAP port checks + E2E push.
2026-03-25 22:50:22 +02:00
Viktor Barzin
3adaf88f62 add MAM_ID env var to book-search deployment [ci skip] 2026-03-25 15:52:24 +02:00
Viktor Barzin
946ea9e1f3 fix ebooks stack: prefix PV names, add book-search DNS, add secrets symlink [ci skip] 2026-03-25 15:14:08 +02:00
Viktor Barzin
6e1d8c0c8b add ebooks stack: consolidate book services into single namespace [ci skip]
- New ebooks namespace with CWA, Stacks, Audiobookshelf, book-search
- book-search (renamed from audiobook-search) with CWA ingest volume
- Comment out audiobook_search module from servarr
- All NFS volumes and secrets consolidated
2026-03-25 15:04:27 +02:00
Viktor Barzin
1ce8b3d899 remove setup_tls_secret from insta2spotify (Kyverno auto-syncs) 2026-03-25 13:44:34 +02:00
Viktor Barzin
6dda15afa0 add insta2spotify stack: namespace, ESO, NFS, 2-container deploy, split ingress
- Namespace insta2spotify (tier 4-aux)
- ExternalSecret from Vault secret/insta2spotify
- NFS volume at /mnt/main/insta2spotify for SQLite + Spotify cache
- Frontend (128Mi) + backend (512Mi req / 2Gi limit) in one pod
- Split ingress: protected (Authentik) for frontend, unprotected for /api/*
- DNS via Cloudflare (proxied)
2026-03-25 13:03:35 +02:00
Viktor Barzin
009f4b3b89 change qBittorrent torrent port from 6881 to 50000
Port 6881 is blacklisted by MAM and throttled by ISPs.
Also added pfSense NAT rule for 50000 TCP+UDP → 10.0.20.200.
2026-03-25 12:29:00 +02:00
Viktor Barzin
5b5a7d8cb4 add MAM email/password env vars to audiobook-search deployment
Reads mam_email and mam_password from Vault secret/servarr via ESO.
2026-03-25 12:03:12 +02:00
Viktor Barzin
e455bd06f4 state(monitoring): update encrypted state 2026-03-25 11:04:29 +02:00
Viktor Barzin
8c6f238697 add default Homepage annotations to ingress_factory for auto-discovery
- ingress_factory now injects gethomepage.dev/* annotations on all ingresses
  (name, group, href, icon) with namespace-to-group mapping
- Stacks with explicit annotations override defaults via merge order
- New homepage_enabled var allows opt-out for internal-only ingresses
- Homepage search widget switched to in-page quicklaunch (Ctrl+K / tap)
- Added hideErrors and quicklaunch settings for clean service directory
- Result: 116/134 ingresses now discoverable (up from ~30)
2026-03-25 11:00:38 +02:00
Viktor Barzin
d20c5e5535 add backup_output_bytes metric and cloudsync_transferred_bytes to backup dashboard
- All 7 backup CronJobs now push backup_output_bytes (file size after backup)
- Cloud Sync monitor parses rclone transfer stats into cloudsync_transferred_bytes
- Grafana dashboard: new Output (MiB) table column, Output Size Trend panel,
  Write Throughput panel, Cloud Sync Transfer Volume bargauge
- All timeseries panels use points-only draw style (discrete backup snapshots)
- etcd backup restructured: init_container for etcdctl (distroless image),
  busybox sidecar for metrics push + purge, ClusterFirstWithHostNet DNS
- Fixed pre-existing curl missing in postgres:16.4-bullseye (immich, dbaas PG)
- Fixed grep -oP not available in alpine/busybox (cloud sync monitor)
2026-03-25 10:44:53 +02:00
Viktor Barzin
f0eb4fae8b fix: openclaw task-processor use internal Forgejo URL
The task-processor CronJob was failing every 5min because
it used https://forgejo.viktorbarzin.me (external, via Cloudflare
tunnel) which is unreachable from within the cluster. Changed to
http://forgejo.forgejo.svc.cluster.local (internal ClusterIP).
2026-03-24 19:40:15 +02:00
Viktor Barzin
42eb85c578 fix: rybbit init port, mysql memory limit, metallb alert selector
- rybbit-client: fix Kyverno wait-for port 3001 → 80 (service port, not targetPort)
- dbaas: increase MySQL memory limit 4Gi → 5Gi (mysql-cluster-1 at 95.9%)
- dbaas: bump ResourceQuota limits.memory 24Gi → 27Gi to accommodate
- monitoring: fix MetalLBControllerDown alert selector for v0.15 (controller → metallb-controller)
2026-03-24 18:55:07 +02:00
Viktor Barzin
c49e4561a3 consolidate MetalLB IPs: 5 → 1 (10.0.20.200)
Migrate all 11 LoadBalancer services to share 10.0.20.200:
- Update annotations: metallb.universe.tf → metallb.io
- Pin all services to 10.0.20.200 with allow-shared-ip: shared
- Standardize externalTrafficPolicy to Cluster (required for IP sharing)
- Remove redundant port 80 (roundcube) from mailserver LB
- Update CoreDNS forward: 10.0.20.204 → 10.0.20.200
- Update cloudflared tunnel target: 10.0.20.202 → 10.0.20.200

Services consolidated: coturn, headscale, kms, qbittorrent, shadowsocks,
torrserver, wireguard, mailserver, traefik, xray, technitium
2026-03-24 18:35:43 +02:00
Viktor Barzin
33037eba46 upgrade MetalLB v0.10.2 → v0.15.3 and update annotations
- Replace custom ViktorBarzin/metallb module with official Helm chart
- Migrate from ConfigMap-based config to CRD (IPAddressPool + L2Advertisement)
- Update Traefik LB annotations from metallb.universe.tf to metallb.io format
- Technitium DNS keeps stable IP 10.0.20.204 via MetalLB auto-assignment
- Headscale split DNS already configured to use 10.0.20.204
2026-03-24 17:24:05 +02:00
Viktor Barzin
a644eb1c8e headscale: add STUN port, upgrade to 0.28.0, fix Home DERP connectivity
- Expose STUN port 3479/UDP on container and LoadBalancer service
- Upgrade headscale from 0.23.0 to 0.28.0
- Vault config updated: auto DERP region with ipv4 field, ISP router
  port forward for UDP 3479 added

Home DERP now shows ~3ms latency and is selected as nearest relay.
2026-03-24 14:51:09 +02:00
Viktor Barzin
540d7de807 add wealthfolio-sync CronJob for automated portfolio sync
Monthly CronJob (1st at 08:00 UTC) syncs trades from Schwab, Trading 212,
and InvestEngine into Wealthfolio SQLite DB. Added Kyverno ndots lifecycle
ignore. Removed stale manual sync comment.
2026-03-24 02:07:36 +02:00
Viktor Barzin
4ca7af8818 add audiobook-search service to servarr stack
- New audiobook-search deployment + service + ingress (Authentik-protected)
- qBittorrent: add NFS mount for /audiobooks (shared with Audiobookshelf)
- Cloudflare DNS: add audiobook-search.viktorbarzin.me
- Env vars: QBITTORRENT_URL/PASS, AUDIOBOOKSHELF_URL/TOKEN from ESO
2026-03-24 01:21:49 +02:00
Viktor Barzin
d9eaf42f36 exclude iDRAC from HighServiceLatency alert
iDRAC Redfish exporter is inherently slow, causing noisy alerts.
2026-03-23 22:51:42 +02:00
Viktor Barzin
304f0de43a add Metric Staleness alerts for UPS, iDRAC, ATS, and HA metrics
Replace fragile NoiDRACData alert with proper absent() checks. Add
UPSMetricsMissing (critical), iDRACRedfishMetricsMissing,
iDRACSNMPMetricsMissing, ATSMetricsMissing, and
HomeAssistantMetricsMissing alerts. Update PowerOutage and NodeDown
inhibit rules to suppress staleness alerts during outages.
2026-03-23 22:24:17 +02:00
Viktor Barzin
16cde1eab5 add Kyverno TLS secret sync + enhance renewal pipeline
Kyverno ClusterPolicy clones tls-secret from kyverno namespace to all
namespaces with synchronize=true. Renewal pipeline now updates the source
secret via kubectl, verifies cert validity, and sends Slack notification.
2026-03-23 22:19:34 +02:00
Viktor Barzin
6a2bee93b5 fix(monitoring): use patched idrac exporter with PSU input voltage metric
The upstream ghcr.io/mrlhansen/idrac_exporter:2.4.1 is missing
NewPowerSupplyInputVoltage in RefreshPowerOld, so the R730 iDRAC
never emits idrac_power_supply_input_voltage. Switch to the patched
viktorbarzin/idrac-redfish-exporter:2.4.1-voltage-fix image.
2026-03-23 22:07:36 +02:00
Viktor Barzin
a95d434ff1 fix backup IO stats: use /proc/$$/io instead of /proc/self/io
/proc/self/io inside $(awk ...) resolves to the awk subprocess PID,
not the parent bash shell. Use $$ (bash PID) to read the correct
process IO counters.
2026-03-23 12:33:52 +02:00