Commit graph

1974 commits

Author SHA1 Message Date
Viktor Barzin
6cdee231cd state(shadowsocks): update encrypted state 2026-03-24 18:08:04 +02:00
Viktor Barzin
842e870971 state(headscale): update encrypted state 2026-03-24 18:08:02 +02:00
Viktor Barzin
33037eba46 upgrade MetalLB v0.10.2 → v0.15.3 and update annotations
- Replace custom ViktorBarzin/metallb module with official Helm chart
- Migrate from ConfigMap-based config to CRD (IPAddressPool + L2Advertisement)
- Update Traefik LB annotations from metallb.universe.tf to metallb.io format
- Technitium DNS keeps stable IP 10.0.20.204 via MetalLB auto-assignment
- Headscale split DNS already configured to use 10.0.20.204
2026-03-24 17:24:05 +02:00
Viktor Barzin
957f13dfd6 state(headscale): update encrypted state 2026-03-24 17:23:34 +02:00
Viktor Barzin
7478f545e0 state(metallb): update encrypted state 2026-03-24 17:23:18 +02:00
Viktor Barzin
dd46252d17 state(metallb): update encrypted state 2026-03-24 17:23:01 +02:00
Viktor Barzin
7ef390f14e state(metallb): update encrypted state 2026-03-24 17:22:53 +02:00
Viktor Barzin
1defd711fe state(metallb): update encrypted state 2026-03-24 17:15:06 +02:00
Viktor Barzin
793490eaf4 state(metallb): update encrypted state 2026-03-24 17:14:14 +02:00
Viktor Barzin
d079666d34 state(metallb): update encrypted state 2026-03-24 17:11:26 +02:00
Viktor Barzin
b68f778c5a state(headscale): update encrypted state 2026-03-24 16:47:26 +02:00
Viktor Barzin
3ecb792a44 state(headscale): update encrypted state 2026-03-24 15:30:25 +02:00
Viktor Barzin
0ee6cade38 state(headscale): update encrypted state 2026-03-24 15:12:01 +02:00
Viktor Barzin
a644eb1c8e headscale: add STUN port, upgrade to 0.28.0, fix Home DERP connectivity
- Expose STUN port 3479/UDP on container and LoadBalancer service
- Upgrade headscale from 0.23.0 to 0.28.0
- Vault config updated: auto DERP region with ipv4 field, ISP router
  port forward for UDP 3479 added

Home DERP now shows ~3ms latency and is selected as nearest relay.
2026-03-24 14:51:09 +02:00
Viktor Barzin
fafea4b110 state(headscale): update encrypted state 2026-03-24 14:45:31 +02:00
Viktor Barzin
2cbcf00b8e state(headscale): update encrypted state 2026-03-24 14:36:30 +02:00
Viktor Barzin
20b0d564f1 state(headscale): update encrypted state 2026-03-24 14:32:12 +02:00
Viktor Barzin
78f302d6c0 state(headscale): update encrypted state 2026-03-24 14:30:02 +02:00
Viktor Barzin
d2c50be088 state(headscale): update encrypted state 2026-03-24 12:49:23 +02:00
Viktor Barzin
5161f77118 state(headscale): update encrypted state 2026-03-24 12:05:34 +02:00
Viktor Barzin
4aa0e97e1d remove terraform.tfvars from terragrunt loading — complete Vault migration
All 148 secret variables were migrated to Vault KV / SOPS / ESO.
The legacy terraform.tfvars silently overrode config.tfvars values
(e.g. stale postgresql_host), creating override risk. [ci skip]
2026-03-24 11:14:06 +02:00
Viktor Barzin
540d7de807 add wealthfolio-sync CronJob for automated portfolio sync
Monthly CronJob (1st at 08:00 UTC) syncs trades from Schwab, Trading 212,
and InvestEngine into Wealthfolio SQLite DB. Added Kyverno ndots lifecycle
ignore. Removed stale manual sync comment.
2026-03-24 02:07:36 +02:00
Viktor Barzin
5d12f92816 state(wealthfolio): update encrypted state 2026-03-24 02:07:17 +02:00
Viktor Barzin
4ca7af8818 add audiobook-search service to servarr stack
- New audiobook-search deployment + service + ingress (Authentik-protected)
- qBittorrent: add NFS mount for /audiobooks (shared with Audiobookshelf)
- Cloudflare DNS: add audiobook-search.viktorbarzin.me
- Env vars: QBITTORRENT_URL/PASS, AUDIOBOOKSHELF_URL/TOKEN from ESO
2026-03-24 01:21:49 +02:00
Viktor Barzin
dbff547741 remove docs/backup-strategy.md, absorbed into architecture/backup-dr.md [ci skip] 2026-03-24 01:08:06 +02:00
Viktor Barzin
5a42643176 add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
Viktor Barzin
31767ed8e7 state(headscale): update encrypted state 2026-03-24 00:03:03 +02:00
Viktor Barzin
2adf68ae03 state(platform): update encrypted state 2026-03-23 23:48:38 +02:00
Viktor Barzin
28f349a8f6 state(servarr): update encrypted state 2026-03-23 23:46:08 +02:00
Viktor Barzin
d9eaf42f36 exclude iDRAC from HighServiceLatency alert
iDRAC Redfish exporter is inherently slow, causing noisy alerts.
2026-03-23 22:51:42 +02:00
root
eeae58861b Woodpecker CI Update TLS Certificates Commit 2026-03-23 20:38:38 +00:00
Viktor Barzin
3bca7a97c2 fix(renew-tls): update TLS secret in ALL namespaces, not just kyverno
Kyverno generate+synchronize only manages secrets it created itself.
Existing Terraform-managed secrets in ~70 namespaces weren't updated.
Now loops through all namespaces and kubectl apply the new cert.
2026-03-23 22:36:31 +02:00
root
dadbec0eb4 Woodpecker CI Update TLS Certificates Commit 2026-03-23 20:34:36 +00:00
Viktor Barzin
2dcb4b7fa4 fix(renew-tls): clean stale _acme-challenge TXT records before certbot
21+ stale TXT records accumulated from previous runs, causing certbot
DNS-01 challenge to fail. Now deletes all _acme-challenge records
from Cloudflare before certbot creates fresh ones.
2026-03-23 22:32:27 +02:00
Viktor Barzin
b7409cea4e fix(renew-tls): use alpine+curl for kubectl step to avoid permission denied
bitnami/kubectl runs as non-root UID 1001, cannot read git-crypt
decrypted secrets owned by root. Switch to alpine (runs as root)
with kubectl downloaded directly.
2026-03-23 22:28:56 +02:00
root
b5dd43aeab Woodpecker CI Update TLS Certificates Commit 2026-03-23 20:27:00 +00:00
Viktor Barzin
304f0de43a add Metric Staleness alerts for UPS, iDRAC, ATS, and HA metrics
Replace fragile NoiDRACData alert with proper absent() checks. Add
UPSMetricsMissing (critical), iDRACRedfishMetricsMissing,
iDRACSNMPMetricsMissing, ATSMetricsMissing, and
HomeAssistantMetricsMissing alerts. Update PowerOutage and NodeDown
inhibit rules to suppress staleness alerts during outages.
2026-03-23 22:24:17 +02:00
Viktor Barzin
0c307f4d3d state(kyverno): update encrypted state 2026-03-23 22:20:18 +02:00
Viktor Barzin
16cde1eab5 add Kyverno TLS secret sync + enhance renewal pipeline
Kyverno ClusterPolicy clones tls-secret from kyverno namespace to all
namespaces with synchronize=true. Renewal pipeline now updates the source
secret via kubectl, verifies cert validity, and sends Slack notification.
2026-03-23 22:19:34 +02:00
Viktor Barzin
6a2bee93b5 fix(monitoring): use patched idrac exporter with PSU input voltage metric
The upstream ghcr.io/mrlhansen/idrac_exporter:2.4.1 is missing
NewPowerSupplyInputVoltage in RefreshPowerOld, so the R730 iDRAC
never emits idrac_power_supply_input_voltage. Switch to the patched
viktorbarzin/idrac-redfish-exporter:2.4.1-voltage-fix image.
2026-03-23 22:07:36 +02:00
Viktor Barzin
b6bc51b42b state(platform): update encrypted state 2026-03-23 22:04:06 +02:00
Viktor Barzin
a95d434ff1 fix backup IO stats: use /proc/$$/io instead of /proc/self/io
/proc/self/io inside $(awk ...) resolves to the awk subprocess PID,
not the parent bash shell. Use $$ (bash PID) to read the correct
process IO counters.
2026-03-23 12:33:52 +02:00
Viktor Barzin
0a294a30a6 add backup IO logging, Pushgateway metrics, and Grafana dashboard
- Add /proc/self/io read/write tracking to vault raft-backup and etcd backup
- Push backup_duration_seconds, backup_read_bytes, backup_written_bytes,
  backup_last_success_timestamp to Pushgateway from all 6 backup CronJobs
  (etcd skipped — distroless image has no wget/curl)
- Add cloudsync_duration_seconds metric to cloudsync-monitor
- New "Backup Health" Grafana dashboard with 8 panels: time since last backup,
  overview table, duration/IO trends, cloud sync status, alerts, CronJob schedule
2026-03-23 12:19:01 +02:00
Viktor Barzin
0b595751c5 move Frigate cache to tmpfs to eliminate disk writes on node1
Add 512Mi tmpfs emptyDir for /tmp/cache — Frigate writes 10s MP4
segments here continuously for all cameras. With motion-only retention,
segments without events are deleted immediately anyway, so losing them
on pod restart is acceptable.

Node1 disk writes: 3.55 MB/s → 2.08 MB/s (previous commit) → 96 KB/s (now)
2026-03-23 11:52:49 +02:00
Viktor Barzin
2855da2a3c state(frigate): update encrypted state 2026-03-23 11:49:40 +02:00
Viktor Barzin
3f0ecda737 harden pull-through cache: intercept errors, reduce lock timeout, add healthz
- Add proxy_intercept_errors + error_page for 502/503/504 on blob locations
  to prevent caching truncated upstream responses (root cause of repeated
  ImagePullBackOff across services)
- Reduce proxy_cache_lock_timeout from 15m to 5m — fail fast, let containerd
  retry instead of all concurrent pulls waiting on a failed first download
- Add proxy_cache_valid any 0 — never cache error responses
- Add /healthz endpoints on Docker Hub and GHCR servers
- Add draintimeout and proxy.ttl to registry proxy configs
2026-03-23 11:33:06 +02:00
Viktor Barzin
1639910043 ingress latency: add histogram buckets, fix restarts, right-size memory
- Traefik: add fine-grained Prometheus histogram buckets (0.01-30s) for meaningful P50/P99
- Calibre: relax liveness probe (timeout 5→10s, threshold 3→6) to stop NFS-caused restarts
- Novelapp: increase memory 128Mi/256Mi → 640Mi/640Mi (confirmed OOMKilled, VPA upper 505Mi)
- Forgejo: increase memory 256Mi → 384Mi (at 80% of limit, VPA upper 311Mi)
- ActualBudget: add explicit resources to prevent silent LimitRange defaults
- Docs: update Nextcloud note from 4Gi → 8Gi limit (Apache spike history)
2026-03-23 10:52:43 +02:00
Viktor Barzin
5652972c53 fix dashboard: add refIds, explicit panel IDs, fix CrowdSec bouncer metric
- Added refId to all targets (required by Grafana)
- Added explicit panel IDs for stable references
- Fixed CrowdSec bouncer metric: cs_lapi_bouncer_requests_total doesn't
  exist, use cs_lapi_route_requests_total instead
- Added drawStyle/showPoints to all timeseries panels
- Updated via MySQL + ConfigMap + Grafana restart
2026-03-23 10:31:44 +02:00
Viktor Barzin
45d48e7ce7 state(headscale): update encrypted state 2026-03-23 10:27:04 +02:00
Viktor Barzin
9527f62c2e fix network traffic dashboard: use only available GoFlow2 metrics
GoFlow2 v2 only exposes aggregate metrics (traffic_bytes_total,
process_nf_total, delay_seconds) — no per-source/dest labels.
Removed panels referencing non-existent src_addr/dst_port labels.
Replaced with flowset records by type, separated bytes and flows
into own panels to avoid scale issues.
2026-03-23 10:16:46 +02:00