Commit graph

1950 commits

Author SHA1 Message Date
Viktor Barzin
dbff547741 remove docs/backup-strategy.md, absorbed into architecture/backup-dr.md [ci skip] 2026-03-24 01:08:06 +02:00
Viktor Barzin
5a42643176 add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
Viktor Barzin
31767ed8e7 state(headscale): update encrypted state 2026-03-24 00:03:03 +02:00
Viktor Barzin
2adf68ae03 state(platform): update encrypted state 2026-03-23 23:48:38 +02:00
Viktor Barzin
28f349a8f6 state(servarr): update encrypted state 2026-03-23 23:46:08 +02:00
Viktor Barzin
d9eaf42f36 exclude iDRAC from HighServiceLatency alert
iDRAC Redfish exporter is inherently slow, causing noisy alerts.
2026-03-23 22:51:42 +02:00
root
eeae58861b Woodpecker CI Update TLS Certificates Commit 2026-03-23 20:38:38 +00:00
Viktor Barzin
3bca7a97c2 fix(renew-tls): update TLS secret in ALL namespaces, not just kyverno
Kyverno generate+synchronize only manages secrets it created itself.
Existing Terraform-managed secrets in ~70 namespaces weren't updated.
Now loops through all namespaces and kubectl apply the new cert.
2026-03-23 22:36:31 +02:00
root
dadbec0eb4 Woodpecker CI Update TLS Certificates Commit 2026-03-23 20:34:36 +00:00
Viktor Barzin
2dcb4b7fa4 fix(renew-tls): clean stale _acme-challenge TXT records before certbot
21+ stale TXT records accumulated from previous runs, causing certbot
DNS-01 challenge to fail. Now deletes all _acme-challenge records
from Cloudflare before certbot creates fresh ones.
2026-03-23 22:32:27 +02:00
Viktor Barzin
b7409cea4e fix(renew-tls): use alpine+curl for kubectl step to avoid permission denied
bitnami/kubectl runs as non-root UID 1001, cannot read git-crypt
decrypted secrets owned by root. Switch to alpine (runs as root)
with kubectl downloaded directly.
2026-03-23 22:28:56 +02:00
root
b5dd43aeab Woodpecker CI Update TLS Certificates Commit 2026-03-23 20:27:00 +00:00
Viktor Barzin
304f0de43a add Metric Staleness alerts for UPS, iDRAC, ATS, and HA metrics
Replace fragile NoiDRACData alert with proper absent() checks. Add
UPSMetricsMissing (critical), iDRACRedfishMetricsMissing,
iDRACSNMPMetricsMissing, ATSMetricsMissing, and
HomeAssistantMetricsMissing alerts. Update PowerOutage and NodeDown
inhibit rules to suppress staleness alerts during outages.
2026-03-23 22:24:17 +02:00
Viktor Barzin
0c307f4d3d state(kyverno): update encrypted state 2026-03-23 22:20:18 +02:00
Viktor Barzin
16cde1eab5 add Kyverno TLS secret sync + enhance renewal pipeline
Kyverno ClusterPolicy clones tls-secret from kyverno namespace to all
namespaces with synchronize=true. Renewal pipeline now updates the source
secret via kubectl, verifies cert validity, and sends Slack notification.
2026-03-23 22:19:34 +02:00
Viktor Barzin
6a2bee93b5 fix(monitoring): use patched idrac exporter with PSU input voltage metric
The upstream ghcr.io/mrlhansen/idrac_exporter:2.4.1 is missing
NewPowerSupplyInputVoltage in RefreshPowerOld, so the R730 iDRAC
never emits idrac_power_supply_input_voltage. Switch to the patched
viktorbarzin/idrac-redfish-exporter:2.4.1-voltage-fix image.
2026-03-23 22:07:36 +02:00
Viktor Barzin
b6bc51b42b state(platform): update encrypted state 2026-03-23 22:04:06 +02:00
Viktor Barzin
a95d434ff1 fix backup IO stats: use /proc/$$/io instead of /proc/self/io
/proc/self/io inside $(awk ...) resolves to the awk subprocess PID,
not the parent bash shell. Use $$ (bash PID) to read the correct
process IO counters.
2026-03-23 12:33:52 +02:00
Viktor Barzin
0a294a30a6 add backup IO logging, Pushgateway metrics, and Grafana dashboard
- Add /proc/self/io read/write tracking to vault raft-backup and etcd backup
- Push backup_duration_seconds, backup_read_bytes, backup_written_bytes,
  backup_last_success_timestamp to Pushgateway from all 6 backup CronJobs
  (etcd skipped — distroless image has no wget/curl)
- Add cloudsync_duration_seconds metric to cloudsync-monitor
- New "Backup Health" Grafana dashboard with 8 panels: time since last backup,
  overview table, duration/IO trends, cloud sync status, alerts, CronJob schedule
2026-03-23 12:19:01 +02:00
Viktor Barzin
0b595751c5 move Frigate cache to tmpfs to eliminate disk writes on node1
Add 512Mi tmpfs emptyDir for /tmp/cache — Frigate writes 10s MP4
segments here continuously for all cameras. With motion-only retention,
segments without events are deleted immediately anyway, so losing them
on pod restart is acceptable.

Node1 disk writes: 3.55 MB/s → 2.08 MB/s (previous commit) → 96 KB/s (now)
2026-03-23 11:52:49 +02:00
Viktor Barzin
2855da2a3c state(frigate): update encrypted state 2026-03-23 11:49:40 +02:00
Viktor Barzin
3f0ecda737 harden pull-through cache: intercept errors, reduce lock timeout, add healthz
- Add proxy_intercept_errors + error_page for 502/503/504 on blob locations
  to prevent caching truncated upstream responses (root cause of repeated
  ImagePullBackOff across services)
- Reduce proxy_cache_lock_timeout from 15m to 5m — fail fast, let containerd
  retry instead of all concurrent pulls waiting on a failed first download
- Add proxy_cache_valid any 0 — never cache error responses
- Add /healthz endpoints on Docker Hub and GHCR servers
- Add draintimeout and proxy.ttl to registry proxy configs
2026-03-23 11:33:06 +02:00
Viktor Barzin
1639910043 ingress latency: add histogram buckets, fix restarts, right-size memory
- Traefik: add fine-grained Prometheus histogram buckets (0.01-30s) for meaningful P50/P99
- Calibre: relax liveness probe (timeout 5→10s, threshold 3→6) to stop NFS-caused restarts
- Novelapp: increase memory 128Mi/256Mi → 640Mi/640Mi (confirmed OOMKilled, VPA upper 505Mi)
- Forgejo: increase memory 256Mi → 384Mi (at 80% of limit, VPA upper 311Mi)
- ActualBudget: add explicit resources to prevent silent LimitRange defaults
- Docs: update Nextcloud note from 4Gi → 8Gi limit (Apache spike history)
2026-03-23 10:52:43 +02:00
Viktor Barzin
5652972c53 fix dashboard: add refIds, explicit panel IDs, fix CrowdSec bouncer metric
- Added refId to all targets (required by Grafana)
- Added explicit panel IDs for stable references
- Fixed CrowdSec bouncer metric: cs_lapi_bouncer_requests_total doesn't
  exist, use cs_lapi_route_requests_total instead
- Added drawStyle/showPoints to all timeseries panels
- Updated via MySQL + ConfigMap + Grafana restart
2026-03-23 10:31:44 +02:00
Viktor Barzin
45d48e7ce7 state(headscale): update encrypted state 2026-03-23 10:27:04 +02:00
Viktor Barzin
9527f62c2e fix network traffic dashboard: use only available GoFlow2 metrics
GoFlow2 v2 only exposes aggregate metrics (traffic_bytes_total,
process_nf_total, delay_seconds) — no per-source/dest labels.
Removed panels referencing non-existent src_addr/dst_port labels.
Replaced with flowset records by type, separated bytes and flows
into own panels to avoid scale issues.
2026-03-23 10:16:46 +02:00
Viktor Barzin
9db2714393 state(crowdsec): update encrypted state 2026-03-23 03:41:33 +02:00
Viktor Barzin
d401568317 fix CrowdSec collection names and increase Helm timeout
- Fix: crowdsecurity/pf → crowdsecurity/pfsense + firewallservices/pf
- Move syslog acquisition to custom ConfigMap (Helm schema validation)
- Increase Helm timeout to 1200s for DaemonSet rollout
2026-03-23 03:41:13 +02:00
Viktor Barzin
850f73ab4d state(crowdsec): update encrypted state 2026-03-23 03:40:22 +02:00
Viktor Barzin
aa8695278b state(crowdsec): update encrypted state 2026-03-23 03:38:00 +02:00
Viktor Barzin
44a9d19085 state(actualbudget): update encrypted state 2026-03-23 03:12:20 +02:00
Viktor Barzin
449884c537 state(forgejo): update encrypted state 2026-03-23 03:12:08 +02:00
Viktor Barzin
dd8b4b3775 state(novelapp): update encrypted state 2026-03-23 03:11:35 +02:00
Viktor Barzin
fbd0cc675b state(calibre): update encrypted state 2026-03-23 03:11:03 +02:00
Viktor Barzin
55246c8b5d add network traffic monitoring and adversary detection
- CrowdSec: add syslog listener for pfSense firewall logs (NodePort 30514),
  add postfix/dovecot log acquisition, install pf/postfix/dovecot/sshd collections
- Monitoring: add DNS anomaly CronJob (queries Technitium every 15m, DGA detection,
  pushes metrics to Pushgateway)
- Grafana: add "Network Traffic & Adversary Detection" dashboard
  (GoFlow2 flows, CrowdSec decisions, DNS anomaly metrics)

pfSense changes applied live: syslog forwarding to 10.0.20.202:30514,
Snort suppress rules for http_inspect false positives, IPS connectivity policy enabled
2026-03-23 03:06:56 +02:00
Viktor Barzin
877cd15b45 fix: increase tier-2-gpu quota to 12Gi, add NvidiaExporterDown alert
- Increase tier-2-gpu requests.memory from 8Gi to 12Gi to give immich
  ML pods scheduling headroom (was at 96% utilization)
- Add critical NvidiaExporterDown Prometheus alert that fires when GPU
  metrics are absent for >10 minutes (faster than generic ScrapeTargetDown)
2026-03-23 03:04:33 +02:00
Viktor Barzin
20d0404a42 state(headscale): update encrypted state 2026-03-23 03:02:50 +02:00
Viktor Barzin
2639456978 state(platform): update encrypted state 2026-03-23 02:56:29 +02:00
Viktor Barzin
65fcd68181 state(headscale): update encrypted state 2026-03-23 02:48:28 +02:00
Viktor Barzin
2e8fbb51bf state(headscale): update encrypted state 2026-03-23 02:37:22 +02:00
Viktor Barzin
6bfb5e3285 state(headscale): update encrypted state 2026-03-23 02:29:03 +02:00
Viktor Barzin
a1b4a2106d add NFS/CSI cascade failure post-mortem [ci skip] 2026-03-23 02:25:11 +02:00
Viktor Barzin
e9311915cb add agent route to k8s-portal 2026-03-23 02:25:08 +02:00
Viktor Barzin
6bfade3013 update infra stack terraform lock file (helm/kubernetes/vault providers) 2026-03-23 02:24:47 +02:00
Viktor Barzin
2dcdc65db5 add weekly SQLite backup for plotting-book to NFS 2026-03-23 02:24:43 +02:00
Viktor Barzin
e4cf0dee83 add TrueNAS Cloud Sync monitor CronJob and bump Prometheus Helm timeout
- New cloudsync-monitor CronJob: queries TrueNAS API every 6h, pushes metrics to Pushgateway
- Increase Prometheus Helm timeout to 900s for slow iSCSI reattach
2026-03-23 02:24:39 +02:00
Viktor Barzin
e463281205 optimize backup schedules: compress dumps, stagger to weekly, extend retention
- dbaas: gzip MySQL/PostgreSQL dumps, stagger to 0:30, clean old uncompressed
- infra-maintenance: etcd backup daily→weekly Sunday 1am
- redis: backup hourly→weekly Sunday 3am, retention 7→28 days
- vault: raft backup daily→weekly Sunday 2am
2026-03-23 02:24:34 +02:00
Viktor Barzin
6e661fdfc5 add backup & DR strategy documentation with ASCII diagrams
Covers all 3 protection layers (ZFS snapshots, app-level backups,
offsite sync), the hybrid cloud sync architecture, iSCSI hardening,
monitoring alerts, and service protection matrix.
2026-03-23 02:24:02 +02:00
Viktor Barzin
644562454c add IPv6 connectivity via Hurricane Electric 6in4 tunnel
- Add public_ipv6 variable and AAAA records for all 34 non-proxied services
- Fix stale DNS records (85.130.108.6 → 176.12.22.76, old IPv6 → HE tunnel)
- Update SPF record with current IPv4/IPv6 addresses
- Add AAAA update support to Technitium DNS updater CLI
- Pin mailserver MetalLB IP to 10.0.20.201 for stable pfSense NAT
- pfSense: HE_IPv6 interface, strict firewall (80,443,25,465,587,993 + ICMPv6),
  socat IPv6→IPv4 proxy, removed dangerous "Allow all DEBUG" rules
2026-03-23 02:22:00 +02:00
Viktor Barzin
813f523170 docs: add private registry usage to infra CLAUDE.md [ci skip] 2026-03-23 01:08:57 +02:00