Commit graph

2160 commits

Author SHA1 Message Date
Viktor Barzin
78eff9ab11 fix: bump book-search memory to 512Mi for file upload/email [ci skip]
Downloads and sends ebook files via HTTP — needs more than 128Mi
for large PDFs. Applied live via kubectl, persisting in Terraform.
2026-03-29 13:24:19 +03:00
Viktor Barzin
914e0b08e2 add SMTP and CWA auth env vars to book-search for send-to-kindle [ci skip] 2026-03-29 12:42:45 +03:00
Viktor Barzin
cbea959966 feat(ebooks): mount calibre-library PVC in book-search for permission fixing
CWA NETWORK_SHARE_MODE=true skips post-import chown, leaving files as
root. book-search now mounts the library to periodically fix permissions
on recently imported books.
2026-03-29 11:31:41 +03:00
Viktor Barzin
fed9df8c0e feat(ebooks): mount stacks-config PVC in book-search for force re-download
Adds stacks-config volume mount to book-search pod so it can delete
Stacks history entries and force re-downloads when a book was consumed
by CWA but failed to import.
2026-03-29 11:26:30 +03:00
Viktor Barzin
6d44b4292f add /api/download-status to book-search unprotected API ingress [ci skip]
Needed for async polling from iOS Shortcuts — status endpoint
doesn't need Authentik auth (job IDs are unguessable UUIDs).
2026-03-29 10:11:22 +03:00
Viktor Barzin
46444e0306 openclaw: remove install-dotfiles init container to reduce NFS writes
The init container was cloning the dotfiles repo via git on every pod
start, causing 200+ small NFS writes that amplified through ZFS.
Dotfiles already exist on NFS from a previous clone — no need to
re-clone on every restart. To update dotfiles, run git pull manually.

Also cleaned up stale Uptime Kuma files (1.6GB old SQLite DB + 289MB
error log left over from migration to MariaDB).
2026-03-29 01:11:33 +02:00
Viktor Barzin
d89321ff3b state(openclaw): update encrypted state 2026-03-29 01:10:42 +02:00
Viktor Barzin
878b556179 state(monitoring): update encrypted state 2026-03-29 01:04:11 +02:00
Viktor Barzin
d41211ddd5 add API key + unprotected API ingress for book-search iOS Shortcut
- API_KEY env var from calibre-secrets for /api/download-url auth
- SHORTCUT_ICLOUD_URL env var for /shortcut redirect
- Separate ingress for /api/download-url and /shortcut (bypasses Authentik)
2026-03-29 00:43:34 +02:00
Viktor Barzin
06490b0634 reduce Prometheus cardinality round 3: drop 44k more series
- cadvisor: drop unused network error/dropped counters, unused cpu
  metrics (load_avg, system, user), unused memory metrics (cache,
  failcnt, kernel, mapped_file, max_usage, rss, swap, active/inactive)
- kubelet: drop all unused histogram buckets (storage_operation, csi,
  volume_operation, image_pull, http_requests, rest_client, pod_worker,
  volume_metric, cgroup_manager) + kubernetes_feature_enabled
- apiserver: drop flowcontrol/rest_client histograms, longrunning_requests
- traefik: drop all router-level metrics (keep service + entrypoint)
- service-endpoints: drop coredns histograms, node_filesystem_*

Post-relabel: 332k → 99k (-70%), ingestion: 5,480 → 1,659 samples/sec (-70%)
2026-03-29 00:27:23 +02:00
Viktor Barzin
614d3c72bd add liveness probe to annas-archive-stacks deployment
Prevents corrupted SQLite DB from looping errors forever —
K8s will auto-restart the pod if /api/version stops responding.
2026-03-29 00:17:29 +02:00
Viktor Barzin
a9ca65bc31 reduce Prometheus cardinality round 2: drop 137k more series
- fix traefik double-scrape: kubernetes-pods job was scraping traefik
  pods again (43k duplicate series). Added namespace drop rule.
- drop unused cadvisor metrics: container_fs_*, container_blkio_*,
  container_pressure_*, container_spec_*, and misc (30k series)
- drop more apiserver histogram buckets: watch_list, watch_cache,
  response_sizes, watch_events, admission_controller, workqueue (11k)
- drop unused kube-state-metrics: replicaset_*, pod_tolerations,
  pod_labels, endpoint_*, service_*, configmap_*, etc (53k series)

Post-relabel samples: 332k → 142k (-57%)
Ingestion rate: 5,480 → 3,239 samples/sec (-41%)
2026-03-28 23:51:24 +02:00
Viktor Barzin
aceea7db94 increase global rate limit from 10/50 to 50/200
HA frontend loads 30-50 JS bundles on page load, exhausting the burst.
iOS Companion app reconnections also trigger bursts. 172 rate-limited
(429) requests found in Traefik logs causing intermittent connectivity
failures for ha-sofia iOS app.
2026-03-28 23:40:10 +02:00
Viktor Barzin
4b3851829b feat: organize Grafana dashboards into folders
Enable sidecar folderAnnotation + foldersFromFilesStructure to group
26 dashboards into 5 managed folders:

- Cluster (6): k8s health, API server, nodes, pods, kube-state-metrics
- Networking (6): CoreDNS, Technitium, Headscale, ingress, network traffic
- Hardware (5): node-exporter, proxmox, iDRAC, UPS, NVIDIA GPU
- Operations (4): backup health, registry, audit logs, Loki
- Applications (2): realestate-crawler, qBittorrent

Dashboard-to-folder mapping defined in grafana.tf locals block.
External stacks (headscale, technitium) annotated individually.
2026-03-28 16:23:49 +02:00
Viktor Barzin
9c49d4c39b state(headscale): update encrypted state 2026-03-28 16:19:09 +02:00
Viktor Barzin
725fefe565 fix: add Headscale monitoring, alerts, and pin UI image
- Add 4 Prometheus alerts: HeadscaleDown (critical), NoOnlineNodes,
  HighHTTPLatency, HighErrorRate
- Add Grafana dashboard with node count, map responses, HTTP latency,
  nodestore operations, and memory panels
- Pin headscale-ui to digest sha256:015f5ba0... (was :latest)
- Set disable_check_updates: true to skip GitHub check on startup
- Uptime Kuma monitor already existed (id=19, 300s interval)
2026-03-28 16:07:04 +02:00
Viktor Barzin
972edf4d30 state(headscale): update encrypted state 2026-03-28 16:05:24 +02:00
Viktor Barzin
069af6517e state(freedify): update encrypted state 2026-03-28 16:05:11 +02:00
Viktor Barzin
f4ff654a69 perf: optimize Headscale for connectivity and latency
- Remove viktorbarzin.me from split DNS (same IPs as public DNS,
  was adding unnecessary tunnel overhead for every DNS query)
- Narrow reverse DNS split scope from 10.0.0.0/8 → 10.0.20.0/24
  and 10.0.10.0/24 only; 192.168.0.0/16 → 192.168.1.0/24 only
- Add extra_records for key internal services (technitium, k8s-master)
  for instant MagicDNS resolution without tunnel roundtrip
- Replace full Tailscale DERP map (29 regions) with curated set:
  home + 8 European + 5 global fallback DERPs (14 total)
- Add custom derp.yaml to ConfigMap, sourced from Vault

Port 80 DERP dropped — Traefik's global HTTP→HTTPS redirect
prevents non-TLS DERP upgrades on the web entrypoint.
2026-03-28 15:44:13 +02:00
Viktor Barzin
29fe56aa68 state(headscale): update encrypted state 2026-03-28 15:43:54 +02:00
Viktor Barzin
8a5a53a832 fix alerts and reduce Prometheus disk write rate
- linkwarden: add Reloader match annotation to DB secret so pods
  auto-restart on Vault credential rotation (was causing 100% 5xx)
- authentik: increase memory limits (server 1Gi→1.5Gi, worker 896Mi→1Gi)
  to prevent OOM kills
- prometheus: drop 113k high-cardinality series to reduce HDD write rate
  from ~8.8 to ~6.0 MB/s (31% reduction):
  - drop all traefik/apiserver/etcd histogram bucket metrics
  - drop goflow2_flow_process_nf_templates_total (9.3k series)
  - drop container_tasks_state and container_memory_failures_total
  - rewrite HighServiceLatency alert to use avg latency (_sum/_count)
  - update cluster_health dashboard to match
- raise KubeletRuntimeOperationsLatency threshold from 30s to 60s
2026-03-28 15:42:14 +02:00
Viktor Barzin
7267e53e2f state(headscale): update encrypted state 2026-03-28 15:41:32 +02:00
Viktor Barzin
85bbc67722 state(linkwarden): update encrypted state 2026-03-28 14:55:55 +02:00
Viktor Barzin
e79b996624 state(authentik): update encrypted state 2026-03-28 14:51:24 +02:00
Viktor Barzin
7e0b0d9362 fix: headscale VPN setup hardening
- Add SQLite backup CronJob (every 6h to NFS for cloud sync pickup)
- Move headscale-ui secrets (COOKIE_SECRET, ROOT_API_KEY) from hardcoded
  values to Vault-managed secrets
- Add DERP IPv6 address (2001:470:6e:43d::2) for IPv6-capable clients
- Clean up stale test nodes, duplicate users, rename "localhost" nodes

Also updated headscale_config in Vault to include DERP ipv6 field
and headscale_ui_cookie_secret/headscale_ui_api_key secrets.
2026-03-28 14:38:12 +02:00
Viktor Barzin
b339d454dd state(headscale): update encrypted state 2026-03-28 14:37:16 +02:00
Viktor Barzin
a42003fb8f fix: add dedicated DERP IngressRoute bypassing middlewares
CrowdSec, rate limiting, anti-AI, and error pages middlewares were
interfering with the Upgrade: DERP protocol handshake. Also updated
Headscale ACL in Vault to allow tailnet DNS traffic to Technitium
(10.0.20.200:53).
2026-03-28 14:26:51 +02:00
Viktor Barzin
1ec11cdab4 state(headscale): update encrypted state 2026-03-28 14:22:44 +02:00
Viktor Barzin
eadc266691 state(headscale): update encrypted state 2026-03-28 14:06:03 +02:00
Viktor Barzin
04a96955c0 fix: exclude NFS PVs from PVFillingUp alert
NFS PVs report the entire NFS server filesystem usage (e.g., navidrome-music
shows 5.3 TiB Synology volume at 97%), not PVC-specific usage. Filter out
PVs with >1TiB capacity (always NFS mounts; iSCSI PVCs are 10-50Gi).
2026-03-28 01:14:05 +02:00
Viktor Barzin
ae21502698 fix: exclude disabled London Pi cloud sync task from CloudSyncFailing alert
Task 2 (Backup London pi) fails because 192.168.8.102 is unreachable.
Disabled task via TrueNAS, excluded task_id=2 from alert rule.
2026-03-27 15:15:48 +02:00
Viktor Barzin
252b65a574 fix: increase memory limits for OOMKilled pods (immich, clickhouse, speedtest)
- immich-server: limits 1700Mi → 2500Mi (70 restarts from media processing spikes)
- clickhouse: limits 1Gi → 1536Mi, max_server_memory_usage 800Mi → 1200Mi
- speedtest: limits 256Mi → 512Mi, requests 256Mi → 128Mi (daily OOM during test)
2026-03-27 13:57:16 +02:00
Viktor Barzin
399f0e2bd0 state(rybbit): update encrypted state 2026-03-27 13:56:54 +02:00
Viktor Barzin
44a1c3a155 state(immich): update encrypted state 2026-03-27 13:54:19 +02:00
Viktor Barzin
e23202399e state(speedtest): update encrypted state 2026-03-27 13:54:09 +02:00
Viktor Barzin
1ec480e5fa novelapp: grant vabbit81 (Gheorghe) admin RBAC on novelapp namespace 2026-03-26 17:34:48 +02:00
Viktor Barzin
2dc27ca128 state(novelapp): update encrypted state 2026-03-26 17:34:44 +02:00
Viktor Barzin
64d1a3bd24 state(woodpecker): update encrypted state 2026-03-26 17:34:18 +02:00
Viktor Barzin
e774d486fd state(rbac): add vabbit81 RBAC resources 2026-03-26 17:33:16 +02:00
Viktor Barzin
e65647edb4 state(vault): add vabbit81 user resources 2026-03-26 17:32:34 +02:00
Viktor Barzin
4e8d087b24 state(novelapp): update encrypted state 2026-03-26 17:23:34 +02:00
Viktor Barzin
5e6e71e727 novelapp: add NextAuth + Google OAuth env vars
Replace AUTH_SECRET with NEXTAUTH_URL, NEXTAUTH_SECRET, GOOGLE_CLIENT_ID,
and GOOGLE_CLIENT_SECRET for Google OAuth integration.
2026-03-26 17:14:16 +02:00
Viktor Barzin
70ea01fb6e vault: increase k8s auth token TTLs and add periodic renewal
Stagger token periods across roles (7d/8d/9d/10d) to prevent
bulk lease revocation storms that caused transient 504s.
Periodic tokens auto-renew indefinitely, eliminating mass expiry.
2026-03-26 12:21:47 +02:00
Viktor Barzin
b6ac68d7f2 state(vault): update encrypted state 2026-03-26 12:21:23 +02:00
Viktor Barzin
b8a5740138 reduce alert noise: remove 4 memory alerts, raise latency threshold [ci skip]
- Remove ClusterMemoryRequestsHigh, ContainerNearOOM, NodeLowFreeMemory,
  NodeMemoryPressureTrending — all fire regularly due to intentional
  memory overcommit and are not actionable
- Keep ContainerOOMKilled (actionable — container actually died)
- Raise HighServiceLatency p99 threshold from 10s to 30s to ignore
  transient spikes
2026-03-26 01:15:18 +02:00
Viktor Barzin
2445edea8f state(freedify): update encrypted state 2026-03-26 01:13:29 +02:00
Viktor Barzin
30d58bc4c8 state(freedify): update encrypted state 2026-03-26 01:11:16 +02:00
Viktor Barzin
9e99c14a77 state(freedify): update encrypted state 2026-03-26 00:36:47 +02:00
Viktor Barzin
9bc37bf257 state(freedify): update encrypted state 2026-03-26 00:15:49 +02:00
Viktor Barzin
c732e92613 state(reverse-proxy): update encrypted state 2026-03-26 00:07:46 +02:00