infra

Author	SHA1	Message	Date
Viktor Barzin	6a2bee93b5	fix(monitoring): use patched idrac exporter with PSU input voltage metric The upstream ghcr.io/mrlhansen/idrac_exporter:2.4.1 is missing NewPowerSupplyInputVoltage in RefreshPowerOld, so the R730 iDRAC never emits idrac_power_supply_input_voltage. Switch to the patched viktorbarzin/idrac-redfish-exporter:2.4.1-voltage-fix image.	2026-03-23 22:07:36 +02:00
Viktor Barzin	b6bc51b42b	state(platform): update encrypted state	2026-03-23 22:04:06 +02:00
Viktor Barzin	a95d434ff1	fix backup IO stats: use /proc/$$/io instead of /proc/self/io /proc/self/io inside $(awk ...) resolves to the awk subprocess PID, not the parent bash shell. Use $$ (bash PID) to read the correct process IO counters.	2026-03-23 12:33:52 +02:00
Viktor Barzin	0a294a30a6	add backup IO logging, Pushgateway metrics, and Grafana dashboard - Add /proc/self/io read/write tracking to vault raft-backup and etcd backup - Push backup_duration_seconds, backup_read_bytes, backup_written_bytes, backup_last_success_timestamp to Pushgateway from all 6 backup CronJobs (etcd skipped — distroless image has no wget/curl) - Add cloudsync_duration_seconds metric to cloudsync-monitor - New "Backup Health" Grafana dashboard with 8 panels: time since last backup, overview table, duration/IO trends, cloud sync status, alerts, CronJob schedule	2026-03-23 12:19:01 +02:00
Viktor Barzin	0b595751c5	move Frigate cache to tmpfs to eliminate disk writes on node1 Add 512Mi tmpfs emptyDir for /tmp/cache — Frigate writes 10s MP4 segments here continuously for all cameras. With motion-only retention, segments without events are deleted immediately anyway, so losing them on pod restart is acceptable. Node1 disk writes: 3.55 MB/s → 2.08 MB/s (previous commit) → 96 KB/s (now)	2026-03-23 11:52:49 +02:00
Viktor Barzin	2855da2a3c	state(frigate): update encrypted state	2026-03-23 11:49:40 +02:00
Viktor Barzin	3f0ecda737	harden pull-through cache: intercept errors, reduce lock timeout, add healthz - Add proxy_intercept_errors + error_page for 502/503/504 on blob locations to prevent caching truncated upstream responses (root cause of repeated ImagePullBackOff across services) - Reduce proxy_cache_lock_timeout from 15m to 5m — fail fast, let containerd retry instead of all concurrent pulls waiting on a failed first download - Add proxy_cache_valid any 0 — never cache error responses - Add /healthz endpoints on Docker Hub and GHCR servers - Add draintimeout and proxy.ttl to registry proxy configs	2026-03-23 11:33:06 +02:00
Viktor Barzin	1639910043	ingress latency: add histogram buckets, fix restarts, right-size memory - Traefik: add fine-grained Prometheus histogram buckets (0.01-30s) for meaningful P50/P99 - Calibre: relax liveness probe (timeout 5→10s, threshold 3→6) to stop NFS-caused restarts - Novelapp: increase memory 128Mi/256Mi → 640Mi/640Mi (confirmed OOMKilled, VPA upper 505Mi) - Forgejo: increase memory 256Mi → 384Mi (at 80% of limit, VPA upper 311Mi) - ActualBudget: add explicit resources to prevent silent LimitRange defaults - Docs: update Nextcloud note from 4Gi → 8Gi limit (Apache spike history)	2026-03-23 10:52:43 +02:00
Viktor Barzin	5652972c53	fix dashboard: add refIds, explicit panel IDs, fix CrowdSec bouncer metric - Added refId to all targets (required by Grafana) - Added explicit panel IDs for stable references - Fixed CrowdSec bouncer metric: cs_lapi_bouncer_requests_total doesn't exist, use cs_lapi_route_requests_total instead - Added drawStyle/showPoints to all timeseries panels - Updated via MySQL + ConfigMap + Grafana restart	2026-03-23 10:31:44 +02:00
Viktor Barzin	45d48e7ce7	state(headscale): update encrypted state	2026-03-23 10:27:04 +02:00
Viktor Barzin	9527f62c2e	fix network traffic dashboard: use only available GoFlow2 metrics GoFlow2 v2 only exposes aggregate metrics (traffic_bytes_total, process_nf_total, delay_seconds) — no per-source/dest labels. Removed panels referencing non-existent src_addr/dst_port labels. Replaced with flowset records by type, separated bytes and flows into own panels to avoid scale issues.	2026-03-23 10:16:46 +02:00
Viktor Barzin	9db2714393	state(crowdsec): update encrypted state	2026-03-23 03:41:33 +02:00
Viktor Barzin	d401568317	fix CrowdSec collection names and increase Helm timeout - Fix: crowdsecurity/pf → crowdsecurity/pfsense + firewallservices/pf - Move syslog acquisition to custom ConfigMap (Helm schema validation) - Increase Helm timeout to 1200s for DaemonSet rollout	2026-03-23 03:41:13 +02:00
Viktor Barzin	850f73ab4d	state(crowdsec): update encrypted state	2026-03-23 03:40:22 +02:00
Viktor Barzin	aa8695278b	state(crowdsec): update encrypted state	2026-03-23 03:38:00 +02:00
Viktor Barzin	44a9d19085	state(actualbudget): update encrypted state	2026-03-23 03:12:20 +02:00
Viktor Barzin	449884c537	state(forgejo): update encrypted state	2026-03-23 03:12:08 +02:00
Viktor Barzin	dd8b4b3775	state(novelapp): update encrypted state	2026-03-23 03:11:35 +02:00
Viktor Barzin	fbd0cc675b	state(calibre): update encrypted state	2026-03-23 03:11:03 +02:00
Viktor Barzin	55246c8b5d	add network traffic monitoring and adversary detection - CrowdSec: add syslog listener for pfSense firewall logs (NodePort 30514), add postfix/dovecot log acquisition, install pf/postfix/dovecot/sshd collections - Monitoring: add DNS anomaly CronJob (queries Technitium every 15m, DGA detection, pushes metrics to Pushgateway) - Grafana: add "Network Traffic & Adversary Detection" dashboard (GoFlow2 flows, CrowdSec decisions, DNS anomaly metrics) pfSense changes applied live: syslog forwarding to 10.0.20.202:30514, Snort suppress rules for http_inspect false positives, IPS connectivity policy enabled	2026-03-23 03:06:56 +02:00
Viktor Barzin	877cd15b45	fix: increase tier-2-gpu quota to 12Gi, add NvidiaExporterDown alert - Increase tier-2-gpu requests.memory from 8Gi to 12Gi to give immich ML pods scheduling headroom (was at 96% utilization) - Add critical NvidiaExporterDown Prometheus alert that fires when GPU metrics are absent for >10 minutes (faster than generic ScrapeTargetDown)	2026-03-23 03:04:33 +02:00
Viktor Barzin	20d0404a42	state(headscale): update encrypted state	2026-03-23 03:02:50 +02:00
Viktor Barzin	2639456978	state(platform): update encrypted state	2026-03-23 02:56:29 +02:00
Viktor Barzin	65fcd68181	state(headscale): update encrypted state	2026-03-23 02:48:28 +02:00
Viktor Barzin	2e8fbb51bf	state(headscale): update encrypted state	2026-03-23 02:37:22 +02:00
Viktor Barzin	6bfb5e3285	state(headscale): update encrypted state	2026-03-23 02:29:03 +02:00
Viktor Barzin	a1b4a2106d	add NFS/CSI cascade failure post-mortem [ci skip]	2026-03-23 02:25:11 +02:00
Viktor Barzin	e9311915cb	add agent route to k8s-portal	2026-03-23 02:25:08 +02:00
Viktor Barzin	6bfade3013	update infra stack terraform lock file (helm/kubernetes/vault providers)	2026-03-23 02:24:47 +02:00
Viktor Barzin	2dcdc65db5	add weekly SQLite backup for plotting-book to NFS	2026-03-23 02:24:43 +02:00
Viktor Barzin	e4cf0dee83	add TrueNAS Cloud Sync monitor CronJob and bump Prometheus Helm timeout - New cloudsync-monitor CronJob: queries TrueNAS API every 6h, pushes metrics to Pushgateway - Increase Prometheus Helm timeout to 900s for slow iSCSI reattach	2026-03-23 02:24:39 +02:00
Viktor Barzin	e463281205	optimize backup schedules: compress dumps, stagger to weekly, extend retention - dbaas: gzip MySQL/PostgreSQL dumps, stagger to 0:30, clean old uncompressed - infra-maintenance: etcd backup daily→weekly Sunday 1am - redis: backup hourly→weekly Sunday 3am, retention 7→28 days - vault: raft backup daily→weekly Sunday 2am	2026-03-23 02:24:34 +02:00
Viktor Barzin	6e661fdfc5	add backup & DR strategy documentation with ASCII diagrams Covers all 3 protection layers (ZFS snapshots, app-level backups, offsite sync), the hybrid cloud sync architecture, iSCSI hardening, monitoring alerts, and service protection matrix.	2026-03-23 02:24:02 +02:00
Viktor Barzin	644562454c	add IPv6 connectivity via Hurricane Electric 6in4 tunnel - Add public_ipv6 variable and AAAA records for all 34 non-proxied services - Fix stale DNS records (85.130.108.6 → 176.12.22.76, old IPv6 → HE tunnel) - Update SPF record with current IPv4/IPv6 addresses - Add AAAA update support to Technitium DNS updater CLI - Pin mailserver MetalLB IP to 10.0.20.201 for stable pfSense NAT - pfSense: HE_IPv6 interface, strict firewall (80,443,25,465,587,993 + ICMPv6), socat IPv6→IPv4 proxy, removed dangerous "Allow all DEBUG" rules	2026-03-23 02:22:00 +02:00
Viktor Barzin	813f523170	docs: add private registry usage to infra CLAUDE.md [ci skip]	2026-03-23 01:08:57 +02:00
Viktor Barzin	1f4e8cb278	use registry.viktorbarzin.me hostname for private images + protect ingress - Switch priority-pass images from 10.0.20.10:5050 to registry.viktorbarzin.me - Add containerd hosts.toml for registry.viktorbarzin.me on all nodes + template (redirects to 10.0.20.10:5050 LAN direct, avoids Traefik round-trip) - Enable Authentik protection on priority-pass ingress	2026-03-23 01:02:27 +02:00
Viktor Barzin	e9919d8fc9	fix priority-pass: bump backend memory to 512Mi (OOM with OpenCV)	2026-03-23 00:58:39 +02:00
Viktor Barzin	0674d6e538	deploy priority-pass app to cluster via private registry - SvelteKit frontend + FastAPI backend in single pod with sidecar pattern - Images pushed to 10.0.20.10:5050 private registry (v4/v1) - SvelteKit server route proxies /api/transform to backend on 127.0.0.1:8000 - Exposed at priority-pass.viktorbarzin.me (Cloudflare-proxied, no auth) - Uses imagePullSecrets for authenticated registry pulls	2026-03-23 00:55:41 +02:00
Viktor Barzin	d78be951b3	state(vaultwarden): update encrypted state	2026-03-23 00:51:56 +02:00
Viktor Barzin	311ff5dd9e	add hourly SQLite integrity check for vaultwarden with Prometheus alerting - New CronJob runs PRAGMA integrity_check every hour - Pushes vaultwarden_sqlite_integrity_ok metric to Prometheus pushgateway - VaultwardenSQLiteCorrupt alert fires immediately on corruption (critical) - VaultwardenIntegrityCheckStale alert if check hasn't run in 2h (warning) - Prevents running for days on a corrupted DB unnoticed	2026-03-23 00:50:15 +02:00
Viktor Barzin	3b89a7d7e4	add VaultwardenDown alert and tighten backup staleness threshold - Add dedicated VaultwardenDown Prometheus alert (critical, 5m) - Reduce backup staleness threshold from 8d to 24h to match 6h schedule - Fixes monitoring gap where VW downtime went undetected	2026-03-23 00:47:00 +02:00
Viktor Barzin	a44f35bcf8	harden vaultwarden iSCSI storage and increase backup frequency - Increase backup from daily to every 6 hours (0 /6 * *) - Add pre/post-flight SQLite integrity checks to backup job - Harden iSCSI on all nodes: increase recovery timeout (300s), enable CRC32C data/header digests for bit-flip detection - Fix restore runbook PVC name (vaultwarden-data-iscsi) Motivated by SQLite corruption from iSCSI I/O errors.	2026-03-23 00:36:11 +02:00
Viktor Barzin	469fcb12b5	remove duplicate deploy-app skill, now global agent [ci skip]	2026-03-23 00:17:53 +02:00
Viktor Barzin	ab7e18c07c	fix registry auth: add Kyverno RBAC for Secrets + containerd TLS skip-verify - Grant kyverno-admission-controller and kyverno-background-controller permissions to manage Secrets (required for generate clone rules) - Add containerd hosts.toml for 10.0.20.10:5050 with skip_verify=true (wildcard cert doesn't cover IP SANs) — applied to all nodes + template	2026-03-22 23:47:29 +02:00
Viktor Barzin	c111799831	remove duplicated agents, update CLAUDE.md references [ci skip] All agents now live globally in ~/.claude/agents/ (shared via dotfiles). Deleted 11 duplicates, moved sev-*/deploy-app to global scope.	2026-03-22 23:44:27 +02:00
Viktor Barzin	36171bcda4	add htpasswd auth to private docker registry + expose at registry.viktorbarzin.me - Add auth.htpasswd section to config-private.yml - Mount htpasswd file in registry-private container, fix healthcheck for 401 - Rename registry UI from registry.viktorbarzin.me → docker.viktorbarzin.me - Add Docker CLI ingress at registry.viktorbarzin.me (HTTPS backend, no rate-limit, unlimited body) - Add docker to cloudflare_proxied_names (registry stays non-proxied) - Add Kyverno ClusterPolicy to sync registry-credentials secret to all namespaces - Update infra provisioning to install apache2-utils and generate htpasswd from Vault	2026-03-22 22:10:10 +02:00
Viktor Barzin	e4f478b490	switch claude-memory server to multi-user API_KEYS auth Enables isolated memory namespaces per user (wizard/emo) by switching from single API_KEY to API_KEYS JSON map env var.	2026-03-22 20:08:07 +02:00
Viktor Barzin	a53b9438eb	state(claude-memory): update encrypted state	2026-03-22 20:01:00 +02:00
Viktor Barzin	29aa6c95f0	state(redis): update encrypted state	2026-03-22 15:23:58 +02:00
Viktor Barzin	c103a1ee05	fix OOMKilled containers: bump immich/actualbudget memory, disable changedetection, cap clickhouse - immich-server: 512Mi/1Gi → 1700Mi/1700Mi (VPA upperBound 1.39Gi, 34 OOM restarts) - actualbudget http-api: 384Mi → 768Mi (VPA upperBound 615Mi, 3 OOM restarts) - changedetection: replicas 1 → 0 (chronic OOM at 64Mi, not worth memory cost) - rybbit clickhouse: add ConfigMap capping max_server_memory_usage to 800Mi (within 1Gi limit)	2026-03-22 15:22:29 +02:00

1 2 3 4 5 ...

1935 commits