Commit graph

339 commits

Author SHA1 Message Date
Viktor Barzin
a644eb1c8e headscale: add STUN port, upgrade to 0.28.0, fix Home DERP connectivity
- Expose STUN port 3479/UDP on container and LoadBalancer service
- Upgrade headscale from 0.23.0 to 0.28.0
- Vault config updated: auto DERP region with ipv4 field, ISP router
  port forward for UDP 3479 added

Home DERP now shows ~3ms latency and is selected as nearest relay.
2026-03-24 14:51:09 +02:00
Viktor Barzin
540d7de807 add wealthfolio-sync CronJob for automated portfolio sync
Monthly CronJob (1st at 08:00 UTC) syncs trades from Schwab, Trading 212,
and InvestEngine into Wealthfolio SQLite DB. Added Kyverno ndots lifecycle
ignore. Removed stale manual sync comment.
2026-03-24 02:07:36 +02:00
Viktor Barzin
4ca7af8818 add audiobook-search service to servarr stack
- New audiobook-search deployment + service + ingress (Authentik-protected)
- qBittorrent: add NFS mount for /audiobooks (shared with Audiobookshelf)
- Cloudflare DNS: add audiobook-search.viktorbarzin.me
- Env vars: QBITTORRENT_URL/PASS, AUDIOBOOKSHELF_URL/TOKEN from ESO
2026-03-24 01:21:49 +02:00
Viktor Barzin
d9eaf42f36 exclude iDRAC from HighServiceLatency alert
iDRAC Redfish exporter is inherently slow, causing noisy alerts.
2026-03-23 22:51:42 +02:00
Viktor Barzin
304f0de43a add Metric Staleness alerts for UPS, iDRAC, ATS, and HA metrics
Replace fragile NoiDRACData alert with proper absent() checks. Add
UPSMetricsMissing (critical), iDRACRedfishMetricsMissing,
iDRACSNMPMetricsMissing, ATSMetricsMissing, and
HomeAssistantMetricsMissing alerts. Update PowerOutage and NodeDown
inhibit rules to suppress staleness alerts during outages.
2026-03-23 22:24:17 +02:00
Viktor Barzin
16cde1eab5 add Kyverno TLS secret sync + enhance renewal pipeline
Kyverno ClusterPolicy clones tls-secret from kyverno namespace to all
namespaces with synchronize=true. Renewal pipeline now updates the source
secret via kubectl, verifies cert validity, and sends Slack notification.
2026-03-23 22:19:34 +02:00
Viktor Barzin
6a2bee93b5 fix(monitoring): use patched idrac exporter with PSU input voltage metric
The upstream ghcr.io/mrlhansen/idrac_exporter:2.4.1 is missing
NewPowerSupplyInputVoltage in RefreshPowerOld, so the R730 iDRAC
never emits idrac_power_supply_input_voltage. Switch to the patched
viktorbarzin/idrac-redfish-exporter:2.4.1-voltage-fix image.
2026-03-23 22:07:36 +02:00
Viktor Barzin
a95d434ff1 fix backup IO stats: use /proc/$$/io instead of /proc/self/io
/proc/self/io inside $(awk ...) resolves to the awk subprocess PID,
not the parent bash shell. Use $$ (bash PID) to read the correct
process IO counters.
2026-03-23 12:33:52 +02:00
Viktor Barzin
0a294a30a6 add backup IO logging, Pushgateway metrics, and Grafana dashboard
- Add /proc/self/io read/write tracking to vault raft-backup and etcd backup
- Push backup_duration_seconds, backup_read_bytes, backup_written_bytes,
  backup_last_success_timestamp to Pushgateway from all 6 backup CronJobs
  (etcd skipped — distroless image has no wget/curl)
- Add cloudsync_duration_seconds metric to cloudsync-monitor
- New "Backup Health" Grafana dashboard with 8 panels: time since last backup,
  overview table, duration/IO trends, cloud sync status, alerts, CronJob schedule
2026-03-23 12:19:01 +02:00
Viktor Barzin
0b595751c5 move Frigate cache to tmpfs to eliminate disk writes on node1
Add 512Mi tmpfs emptyDir for /tmp/cache — Frigate writes 10s MP4
segments here continuously for all cameras. With motion-only retention,
segments without events are deleted immediately anyway, so losing them
on pod restart is acceptable.

Node1 disk writes: 3.55 MB/s → 2.08 MB/s (previous commit) → 96 KB/s (now)
2026-03-23 11:52:49 +02:00
Viktor Barzin
1639910043 ingress latency: add histogram buckets, fix restarts, right-size memory
- Traefik: add fine-grained Prometheus histogram buckets (0.01-30s) for meaningful P50/P99
- Calibre: relax liveness probe (timeout 5→10s, threshold 3→6) to stop NFS-caused restarts
- Novelapp: increase memory 128Mi/256Mi → 640Mi/640Mi (confirmed OOMKilled, VPA upper 505Mi)
- Forgejo: increase memory 256Mi → 384Mi (at 80% of limit, VPA upper 311Mi)
- ActualBudget: add explicit resources to prevent silent LimitRange defaults
- Docs: update Nextcloud note from 4Gi → 8Gi limit (Apache spike history)
2026-03-23 10:52:43 +02:00
Viktor Barzin
5652972c53 fix dashboard: add refIds, explicit panel IDs, fix CrowdSec bouncer metric
- Added refId to all targets (required by Grafana)
- Added explicit panel IDs for stable references
- Fixed CrowdSec bouncer metric: cs_lapi_bouncer_requests_total doesn't
  exist, use cs_lapi_route_requests_total instead
- Added drawStyle/showPoints to all timeseries panels
- Updated via MySQL + ConfigMap + Grafana restart
2026-03-23 10:31:44 +02:00
Viktor Barzin
9527f62c2e fix network traffic dashboard: use only available GoFlow2 metrics
GoFlow2 v2 only exposes aggregate metrics (traffic_bytes_total,
process_nf_total, delay_seconds) — no per-source/dest labels.
Removed panels referencing non-existent src_addr/dst_port labels.
Replaced with flowset records by type, separated bytes and flows
into own panels to avoid scale issues.
2026-03-23 10:16:46 +02:00
Viktor Barzin
d401568317 fix CrowdSec collection names and increase Helm timeout
- Fix: crowdsecurity/pf → crowdsecurity/pfsense + firewallservices/pf
- Move syslog acquisition to custom ConfigMap (Helm schema validation)
- Increase Helm timeout to 1200s for DaemonSet rollout
2026-03-23 03:41:13 +02:00
Viktor Barzin
55246c8b5d add network traffic monitoring and adversary detection
- CrowdSec: add syslog listener for pfSense firewall logs (NodePort 30514),
  add postfix/dovecot log acquisition, install pf/postfix/dovecot/sshd collections
- Monitoring: add DNS anomaly CronJob (queries Technitium every 15m, DGA detection,
  pushes metrics to Pushgateway)
- Grafana: add "Network Traffic & Adversary Detection" dashboard
  (GoFlow2 flows, CrowdSec decisions, DNS anomaly metrics)

pfSense changes applied live: syslog forwarding to 10.0.20.202:30514,
Snort suppress rules for http_inspect false positives, IPS connectivity policy enabled
2026-03-23 03:06:56 +02:00
Viktor Barzin
877cd15b45 fix: increase tier-2-gpu quota to 12Gi, add NvidiaExporterDown alert
- Increase tier-2-gpu requests.memory from 8Gi to 12Gi to give immich
  ML pods scheduling headroom (was at 96% utilization)
- Add critical NvidiaExporterDown Prometheus alert that fires when GPU
  metrics are absent for >10 minutes (faster than generic ScrapeTargetDown)
2026-03-23 03:04:33 +02:00
Viktor Barzin
e9311915cb add agent route to k8s-portal 2026-03-23 02:25:08 +02:00
Viktor Barzin
6bfade3013 update infra stack terraform lock file (helm/kubernetes/vault providers) 2026-03-23 02:24:47 +02:00
Viktor Barzin
2dcdc65db5 add weekly SQLite backup for plotting-book to NFS 2026-03-23 02:24:43 +02:00
Viktor Barzin
e4cf0dee83 add TrueNAS Cloud Sync monitor CronJob and bump Prometheus Helm timeout
- New cloudsync-monitor CronJob: queries TrueNAS API every 6h, pushes metrics to Pushgateway
- Increase Prometheus Helm timeout to 900s for slow iSCSI reattach
2026-03-23 02:24:39 +02:00
Viktor Barzin
e463281205 optimize backup schedules: compress dumps, stagger to weekly, extend retention
- dbaas: gzip MySQL/PostgreSQL dumps, stagger to 0:30, clean old uncompressed
- infra-maintenance: etcd backup daily→weekly Sunday 1am
- redis: backup hourly→weekly Sunday 3am, retention 7→28 days
- vault: raft backup daily→weekly Sunday 2am
2026-03-23 02:24:34 +02:00
Viktor Barzin
644562454c add IPv6 connectivity via Hurricane Electric 6in4 tunnel
- Add public_ipv6 variable and AAAA records for all 34 non-proxied services
- Fix stale DNS records (85.130.108.6 → 176.12.22.76, old IPv6 → HE tunnel)
- Update SPF record with current IPv4/IPv6 addresses
- Add AAAA update support to Technitium DNS updater CLI
- Pin mailserver MetalLB IP to 10.0.20.201 for stable pfSense NAT
- pfSense: HE_IPv6 interface, strict firewall (80,443,25,465,587,993 + ICMPv6),
  socat IPv6→IPv4 proxy, removed dangerous "Allow all DEBUG" rules
2026-03-23 02:22:00 +02:00
Viktor Barzin
1f4e8cb278 use registry.viktorbarzin.me hostname for private images + protect ingress
- Switch priority-pass images from 10.0.20.10:5050 to registry.viktorbarzin.me
- Add containerd hosts.toml for registry.viktorbarzin.me on all nodes + template
  (redirects to 10.0.20.10:5050 LAN direct, avoids Traefik round-trip)
- Enable Authentik protection on priority-pass ingress
2026-03-23 01:02:27 +02:00
Viktor Barzin
e9919d8fc9 fix priority-pass: bump backend memory to 512Mi (OOM with OpenCV) 2026-03-23 00:58:39 +02:00
Viktor Barzin
0674d6e538 deploy priority-pass app to cluster via private registry
- SvelteKit frontend + FastAPI backend in single pod with sidecar pattern
- Images pushed to 10.0.20.10:5050 private registry (v4/v1)
- SvelteKit server route proxies /api/transform to backend on 127.0.0.1:8000
- Exposed at priority-pass.viktorbarzin.me (Cloudflare-proxied, no auth)
- Uses imagePullSecrets for authenticated registry pulls
2026-03-23 00:55:41 +02:00
Viktor Barzin
311ff5dd9e add hourly SQLite integrity check for vaultwarden with Prometheus alerting
- New CronJob runs PRAGMA integrity_check every hour
- Pushes vaultwarden_sqlite_integrity_ok metric to Prometheus pushgateway
- VaultwardenSQLiteCorrupt alert fires immediately on corruption (critical)
- VaultwardenIntegrityCheckStale alert if check hasn't run in 2h (warning)
- Prevents running for days on a corrupted DB unnoticed
2026-03-23 00:50:15 +02:00
Viktor Barzin
3b89a7d7e4 add VaultwardenDown alert and tighten backup staleness threshold
- Add dedicated VaultwardenDown Prometheus alert (critical, 5m)
- Reduce backup staleness threshold from 8d to 24h to match 6h schedule
- Fixes monitoring gap where VW downtime went undetected
2026-03-23 00:47:00 +02:00
Viktor Barzin
a44f35bcf8 harden vaultwarden iSCSI storage and increase backup frequency
- Increase backup from daily to every 6 hours (0 */6 * * *)
- Add pre/post-flight SQLite integrity checks to backup job
- Harden iSCSI on all nodes: increase recovery timeout (300s),
  enable CRC32C data/header digests for bit-flip detection
- Fix restore runbook PVC name (vaultwarden-data-iscsi)

Motivated by SQLite corruption from iSCSI I/O errors.
2026-03-23 00:36:11 +02:00
Viktor Barzin
ab7e18c07c fix registry auth: add Kyverno RBAC for Secrets + containerd TLS skip-verify
- Grant kyverno-admission-controller and kyverno-background-controller
  permissions to manage Secrets (required for generate clone rules)
- Add containerd hosts.toml for 10.0.20.10:5050 with skip_verify=true
  (wildcard cert doesn't cover IP SANs) — applied to all nodes + template
2026-03-22 23:47:29 +02:00
Viktor Barzin
36171bcda4 add htpasswd auth to private docker registry + expose at registry.viktorbarzin.me
- Add auth.htpasswd section to config-private.yml
- Mount htpasswd file in registry-private container, fix healthcheck for 401
- Rename registry UI from registry.viktorbarzin.me → docker.viktorbarzin.me
- Add Docker CLI ingress at registry.viktorbarzin.me (HTTPS backend, no rate-limit, unlimited body)
- Add docker to cloudflare_proxied_names (registry stays non-proxied)
- Add Kyverno ClusterPolicy to sync registry-credentials secret to all namespaces
- Update infra provisioning to install apache2-utils and generate htpasswd from Vault
2026-03-22 22:10:10 +02:00
Viktor Barzin
e4f478b490 switch claude-memory server to multi-user API_KEYS auth
Enables isolated memory namespaces per user (wizard/emo) by switching
from single API_KEY to API_KEYS JSON map env var.
2026-03-22 20:08:07 +02:00
Viktor Barzin
c103a1ee05 fix OOMKilled containers: bump immich/actualbudget memory, disable changedetection, cap clickhouse
- immich-server: 512Mi/1Gi → 1700Mi/1700Mi (VPA upperBound 1.39Gi, 34 OOM restarts)
- actualbudget http-api: 384Mi → 768Mi (VPA upperBound 615Mi, 3 OOM restarts)
- changedetection: replicas 1 → 0 (chronic OOM at 64Mi, not worth memory cost)
- rybbit clickhouse: add ConfigMap capping max_server_memory_usage to 800Mi (within 1Gi limit)
2026-03-22 15:22:29 +02:00
Viktor Barzin
ad689076d8 scale down non-critical services to free cluster memory
- authentik server: 3→2, worker: 3→2, PDB minAvailable: 2→1
- tuya-bridge: 3→1
- realestate-crawler-api: 2→1
- claude-memory: 2→1
- grafana: 2→1 (config only, apply pending)
- alertmanager: 2→1 (config only, apply pending)

Estimated savings: ~1.2 Gi total
2026-03-22 03:10:12 +02:00
Viktor Barzin
bd98b84ded scale grafana and alertmanager to 1 replica to free cluster memory
Grafana: 2 → 1 (saves ~312 Mi)
Alertmanager: 2 → 1 (saves ~150 Mi)
Matrix already scaled to 0 (saves ~212 Mi)
2026-03-22 03:02:17 +02:00
Viktor Barzin
1c13af142d sync regenerated providers.tf + upstream changes
- Terragrunt-regenerated providers.tf across stacks (vault_root_token
  variable removed from root generate block)
- Upstream monitoring/openclaw/CLAUDE.md changes from rebase
2026-03-22 02:56:04 +02:00
Viktor Barzin
2e016d7df2 fix nextcloud db-username + k8s-dashboard chart repo
- nextcloud: add db-username to ESO secret template and usernameKey
  to chart values (required by newer chart version)
- k8s-dashboard: update chart repo URL to kubernetes-retired.github.io
  (old kubernetes.github.io/dashboard returns 404)
2026-03-22 02:50:48 +02:00
Viktor Barzin
3d22599f7f fix infra stack: use overwrite strategy for provider generation
The child generate block needs if_exists="overwrite" to properly
override the root terragrunt's k8s_providers block.
2026-03-22 01:28:25 +02:00
Viktor Barzin
728fbcd3bd fix infra stack: add vault provider to terragrunt generate block
The infra stack's provider override only included proxmox but main.tf
uses data "vault_kv_secret_v2" which requires the vault provider.
2026-03-22 01:17:00 +02:00
Viktor Barzin
1d1549b8af state(descheduler): update encrypted state 2026-03-21 15:13:02 +00:00
Viktor Barzin
21cfa8c072 bump memory limits for OOM-prone services
FreshRSS: 64Mi → 256Mi (171 restarts, VPA upper ~204Mi)
Actual Budget HTTP API: 128Mi → 384Mi (17 restarts, VPA upper ~297Mi)
n8n: 768Mi → 1Gi (18 restarts, VPA upper ~765Mi)
Dawarich: 768Mi → 896Mi (2 restarts, VPA upper ~628Mi)
Traefik: 384Mi → 768Mi (2 restarts, VPA upper ~584Mi)
2026-03-21 11:12:12 +00:00
Viktor Barzin
b3c9c45a17 multi-user access: fix template memory default, add storage quota, add CONTRIBUTING.md [ci skip]
- Template: bump default memory from 128Mi to 256Mi (matches deploy-app skill guidance)
- ResourceQuota: add requests.storage (20Gi) and persistentvolumeclaims (5) defaults
- CONTRIBUTING.md: agent-friendly contributor guide for namespace-owners
2026-03-19 23:49:15 +00:00
Viktor Barzin
6b8ce04d44 fix(openclaw): change agent workspace from /workspace/infra to /workspace
Keeps infra repo as a subdirectory, allows OpenClaw to write to /workspace directly.
2026-03-19 23:32:28 +00:00
Viktor Barzin
e823b795f7 fix(dbaas,vault): fix backup CronJob failures and mysql-operator memory
- Add docker.io/library/ prefix to mysql and postgres backup images
  to satisfy Kyverno require-trusted-registries policy (both CronJobs
  were blocked for 46h, triggering MySQLBackupStale alert)
- Document mysql-operator chart ignoring resources values key — the
  LimitRange default (256Mi) was silently applied, putting the operator
  at 97% memory. Patched live to 512Mi via kubectl.
- Increase vault-raft-backup backoff_limit to 6 for transient failures
  (also fixed NFS export: vault-backup was a separate ZFS dataset not
  in the TrueNAS NFS share — destroyed dataset, created directory)
2026-03-19 23:26:05 +00:00
Viktor Barzin
250a058627 feat(traefik): add custom error pages with tarampampam/error-pages
Deploy error-pages service to show themed error pages instead of raw
Traefik 502/503/504 responses. Adds catch-all IngressRoute (priority 1)
for 404 on unknown hosts. Only 5xx intercepted to avoid breaking JSON APIs.
2026-03-19 23:14:27 +00:00
Viktor Barzin
d95144bd05 fix(immich): bump postgres memory 512Mi → 1Gi for v2.6.1 geodata migration
v2.6.1 bulk-inserts into geodata_places on first boot, OOM-killing
postgres at 512Mi. Raise to 1Gi to accommodate the migration.
2026-03-19 22:50:36 +00:00
Viktor Barzin
da630b8869 upgrade immich v2.5.6 → v2.6.1 2026-03-19 22:45:04 +00:00
Viktor Barzin
af2222fce8 backup & DR: add alerting, fix rotation, secure MySQL password, add runbooks
Phase 1: Add 12 PrometheusRules for backup health alerting
- PostgreSQL, MySQL, Vault, Vaultwarden, Redis staleness + never-succeeded alerts
- CSIDriverCrashLoop alert for nfs-csi/iscsi-csi namespaces
- Generic BackupCronJobFailed alert

Phase 2: Fix backup rotation
- etcd: timestamped snapshots instead of overwriting single file
- Redis: timestamped RDB files with 7-day retention purge
- PostgreSQL: retention increased from 7 to 14 days

Phase 3: Fix MySQL password exposure
- Move root password from command line arg to MYSQL_PWD env var via secretKeyRef

Phase 5: Add restore runbooks
- PostgreSQL, MySQL, Vault, etcd, Vaultwarden, full cluster rebuild
2026-03-19 20:34:33 +00:00
Viktor Barzin
e54bc016ba reduce alert noise: raise memory thresholds, exclude claude-memory 4xx, right-size mysql-operator
- ContainerNearOOM: 85% → 90% (silences forgejo, changedetection, immich-pg, mysql-cluster)
- ClusterMemoryRequestsHigh: 85% → 92% (intentional overcommit)
- NodeMemoryPressureTrending: 85% → 92%
- HighService4xxRate: exclude claude-memory (401s from unauth requests are expected)
- mysql-operator memory limit: 512Mi → 580Mi (VPA upperBound 481Mi × 1.2)
2026-03-19 20:25:36 +00:00
Viktor Barzin
21bb3036af state(dbaas): update encrypted state 2026-03-19 20:23:59 +00:00
Viktor Barzin
01eb9dd121 fix(monitoring): patch idrac-redfish-exporter to restore PSU voltage metric
Upstream v2.4.1 accidentally dropped idrac_power_supply_input_voltage from
the legacy RefreshPowerOld code path during a Huawei OEM support refactor.
Built a patched image that restores the single missing line:
  mc.NewPowerSupplyInputVoltage(ch, psu.LineInputVoltage, id)

Image: viktorbarzin/idrac-redfish-exporter:2.4.1-voltage-fix

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-19 13:37:14 +00:00