infra

Viktor Barzin 150f196095 [redis] Phase 1+2: parallel redis-v2 StatefulSet + Prometheus alerts Builds the target 3-node raw StatefulSet alongside the legacy Bitnami Helm release so data can migrate via REPLICAOF during a future short maintenance window (Phase 3-7). No traffic touches the new cluster yet — HAProxy still points at redis-node-{0,1}. Architecture: - 3 redis pods, each co-locating redis + sentinel + oliver006/redis_exporter - podManagementPolicy=Parallel + init container that writes fresh sentinel.conf on every boot by probing peer sentinels and redis for consensus master (priority: sentinel vote > role:master with slaves > pod-0 fallback). Kills the stale-state bug that broke sentinel on Apr 19 PM. - redis.conf `include /shared/replica.conf` — init container writes `replicaof <master> 6379` for non-master pods so they come up already in the correct role. No bootstrap race. - master+replica memory 768Mi (was 512Mi) for concurrent BGSAVE+AOF fork COW headroom. auto-aof-rewrite-percentage=200 tunes down rewrite churn. - RDB (save 900 1 / 300 100 / 60 10000) + AOF appendfsync=everysec. - PodDisruptionBudget minAvailable=2. Also: - HAProxy scaled 2→3 replicas + PodDisruptionBudget minAvailable=2, since Phase 6 drops Nextcloud's sentinel-query fallback and HAProxy becomes the sole client-facing path for all 17 consumers. - New Prometheus alerts: RedisMemoryPressure, RedisEvictions, RedisReplicationLagHigh, RedisForkLatencyHigh, RedisAOFRewriteLong, RedisReplicasMissing. Updated RedisDown to cover both statefulsets during the migration. - databases.md updated to describe the interim parallel-cluster state. Verified live: redis-v2-0 master, redis-v2-{1,2} replicas, master_link_status up, all 3 sentinels agree on get-master-addr-by-name. All new alerts loaded into Prometheus and inactive. Beads: code-v2b (still in progress — Phase 3-7 await maintenance window). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>		2026-04-19 15:23:05 +00:00
..
agent-task-tracking.md	Add agent task tracking documentation	2026-04-15 17:11:26 +00:00
authentication.md	fix: cluster healthcheck fixes + Authentik upgrade to 2026.2.2	2026-04-15 06:41:56 +00:00
automated-upgrades.md	[docs] automated-upgrades: document long-lived OAuth + expiry monitoring	2026-04-18 13:00:07 +00:00
backup-dr.md	[infra] Fix rewrite-body plugin + cleanup TrueNAS + version bumps	2026-04-17 05:51:52 +00:00
ci-cd.md	docs: comprehensive audit and update of all architecture docs and runbooks [ci skip]	2026-04-06 13:21:05 +03:00
compute.md	docs: comprehensive audit and update of all architecture docs and runbooks [ci skip]	2026-04-06 13:21:05 +03:00
databases.md	[redis] Phase 1+2: parallel redis-v2 StatefulSet + Prometheus alerts	2026-04-19 15:23:05 +00:00
dns.md	[dns] Fix CoreDNS serve_stale syntax — 24h TTL, no refresh-mode arg	2026-04-19 15:18:43 +00:00
homepage.md	add homepage auto-discovery documentation [ci skip]	2026-03-25 13:06:43 +02:00
incident-response.md	[claude-agent-service] Migrate all pipelines from DevVM SSH to K8s HTTP	2026-04-18 10:12:02 +00:00
mailserver.md	[docs] Mailserver architecture — richer diagrams + steady-state accuracy [ci skip]	2026-04-19 12:40:53 +00:00
monitoring.md	[docs] Document external-monitor opt-out mechanism in monitoring.md	2026-04-19 15:19:06 +00:00
multi-tenancy.md	add architecture documentation for all infrastructure subsystems [ci skip]	2026-03-24 00:55:25 +02:00
networking.md	docs: add comprehensive DNS architecture documentation	2026-04-15 18:10:27 +00:00
overview.md	deprecate TrueNAS: migrate Immich NFS to Proxmox, remove all 10.0.10.15 references [ci skip]	2026-04-13 14:42:07 +00:00
secrets.md	docs: comprehensive audit and update of all architecture docs and runbooks [ci skip]	2026-04-06 13:21:05 +03:00
security.md	[docs] Update anti-AI and rybbit docs after rewrite-body removal	2026-04-17 21:43:13 +00:00
storage.md	docs(storage): add encrypted LVM documentation	2026-04-15 21:00:37 +00:00
vpn.md	docs: update Technitium DNS docs after cache optimization	2026-04-12 18:29:25 +01:00