infra

Author	SHA1	Message	Date
Viktor Barzin	e9046e5a26	traefik+pfsense: real IPv6 client IPs via HAProxy PROXY-v2 bridge Replace the pfSense socat IPv6 forwarder (which masked every IPv6 client as 10.0.20.1) with a standalone HAProxy bridge using send-proxy-v2, so real IPv6 client IPs reach Traefik/CrowdSec. Traefik now trusts PROXY-v2 only from 10.0.20.1 on the web/websecure entrypoints; real IPv4 clients (ETP=Local, own source IP) are unaffected. Mail-over-IPv6 routed through the mail NodePorts (send-proxy-v2) too. Bridge is TCP/h2 only (no QUIC over IPv6). Persistence on pfSense: rc.d/ipv6proxy + ipv6_proxy.sh (config.xml shellcmd), keeping the nginx-off-[::] patch. Also fixes stale networking.md: Traefik was still documented on the shared .200; it moved to dedicated .203/ETP=Local on 2026-05-30. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-30 09:51:23 +00:00
Viktor Barzin	16c9aafafa	docs: Traefik dedicated-IP + ETP=Local cutover SUCCEEDED (attempt 2) Records the successful cutover and the key fix that made it safe: decouple cloudflared from the LB IP first (point its tunnel ingress at the in-cluster Traefik Service), so moving Traefik 10.0.20.200 -> 10.0.20.203 no longer breaks proxied apps or Vault's ingress. Updates infra CLAUDE.md Networking notes with the new Traefik LB IP / ETP=Local / cloudflared->ClusterIP state. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-30 08:12:57 +00:00
Viktor Barzin	1473a94f29	docs/plans: Traefik dedicated-IP cutover attempt 1 post-mortem (rolled back) Attempt rolled back to .200 baseline. Root blocker: cloudflared is a token/dashboard-managed tunnel whose ingress targets the Traefik LB IP (10.0.20.200), so moving Traefik to .203 took down all proxied apps. Retry must also repoint the tunnel ingress (Cloudflare API). Also documents the vault-ingress circular dep, SIGPIPE->stuck PG state-lock gotcha, and the ETP=Local hairpin caveat. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-30 01:27:29 +00:00
Viktor Barzin	09a0c1fad4	docs/plans: Traefik dedicated IP + ETP=Local migration (design + plan) Move Traefik off shared MetalLB IP 10.0.20.200 to a dedicated 10.0.20.203 with externalTrafficPolicy=Local, to (1) restore real client IPs for CrowdSec on the 24 non-proxied apps (currently SNAT'd to a node IP) and (2) enable QUIC. Forced off the shared IP because MetalLB forbids mixed ETP on a shared IP (10.0.20.200 also carries the Terraform state DB). In-place cutover selected. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-30 00:27:04 +00:00
Viktor Barzin	2a7124d266	docs(plans): wealth net-worth projections design Forward 30y net-worth projection on the existing wealth Grafana dashboard: multi-scenario lines (low/base/high + derived historical CAGR), pure-SQL over wealth-pg reusing the dashboard's Modified-Dietz and complete-days patterns, with/without-contributions at base rate, in a collapsed row that sidesteps Grafana's shared-time-range limit. [ci skip] Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 22:15:03 +00:00
Viktor Barzin	3526089457	docs: Talos migration design v7 — staged plan after 6 rounds of critique [ci skip] Handoff artifact for next session. v7 is the converged staged plan (Stage A hardened Ubuntu → B DR primitives → C 6-week soak → D-optional Talos). User decision pending: pick v4 (full Talos, 117-178h) vs v7 (staged, 30-37h to decision point) vs hybrid. Full context in ~/.claude/plans/distributed-humming-sonnet.md.	2026-05-26 19:45:48 +00:00
Viktor Barzin	0025511b6a	docs: Technitium DNS IP — 10.0.20.101 → 10.0.20.201 Stragglers from the same drift as commit b288a59 (monorepo) / the 2026-05-22 viktorbarzin.me apex incident — the `.101` references were left over from the NodePort exposure era. Technitium's actual MetalLB LB IP is `.201` (in pool 10.0.20.200-220). - architecture/vpn.md — Technitium component cell + AdGuard forwarder example + nslookup troubleshooting hint - architecture/networking.md — 502 ingress troubleshooting snippet - plans/2026-02-22-talos-linux-migration-evaluation.md — nameservers example	2026-05-23 08:53:52 +00:00
Viktor Barzin	7f63d35d0a	docs/plans: HA control plane — design + plan + deferral Investigated, designed, and planned the 3-master HA control plane migration triggered by 2026-05-21's autonomous k8s upgrade cascade. Locked 14 design decisions across two passes: - 10 initial decisions (LB strategy, IPs, sizing, etcd, kured gate, etc) - 4 challenger-pass amendments (cloud-init template bump, rbac stack multi-master refactor, HTTPS /readyz health check, expanded blast radius to include /home/wizard/code/infra/config root kubeconfig, config.tfvars, k8s-portal user kubeconfigs, etcd-backup nodeSelector, k8s-version-upgrade chain extension as Phase 7) Plan covers 11 phases end-to-end including panic-mode rollback. DEFERRED before execution. PVE host is 98% RAM-committed (262 GB allocated / 267 GB physical, 1.5 GB swap active); the planned 3 x 32 GB masters would push allocation to 326 GB and OOM the host. k8s-master currently uses only 4.6 GB of its 32 GB allocation (5-6x oversized). Revisit triggers documented in design doc: 1. Second PVE host added → hardware HA becomes possible. 2. Right-sizing pass OR planning masters at 16 GB each. 3. Cumulative manual upgrade nursing > ~10h. Standalone candidate worth lifting independently: Phase 1.5's rbac stack refactor (apiserver-oidc + audit-policy + etcd-tuning to loop over k8s_master_hosts list) — future-proofs the cluster without committing to the HA migration. Refs: code-n0ow (open, deferred via bd note). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 08:32:15 +00:00
Viktor Barzin	9ad52dfd61	openclaw: SSH + tmux task fallback to devvm Give the OpenClaw pod two new capabilities: 1. Host-tools bundle. New init container `install-host-tools` extracts openssh-client + dnsutils + tmux + jq + ripgrep + fd + vault + yq + friends into /tools/host-tools/, with the bookworm-slim libs the binaries need. PATH + LD_LIBRARY_PATH on the main container point ld.so at the bundle. Idempotent via /tools/host-tools/.installed-v1 marker; smoke test (ldd-based) fails the init at deploy time if any binary has unresolved deps. Bundle is ~558 MB on the existing /srv/nfs/openclaw/tools NFS. 2. devvm SSH + async task pattern. New init `setup-ssh-config` writes id_rsa/config/known_hosts under /home/node/.openclaw/.ssh; main container startup symlinks /home/node/.ssh → there. New /usr/local/bin/openclaw-task wrapper on devvm manages long-running work as tmux sessions on devvm (sessions and logs survive pod restarts — they live on devvm, not in the pod). New init container `seed-devvm-memory-note` drops a markdown note teaching the pattern; main container startup now runs `openclaw memory index --force` so the note is searchable on first boot. Design + verified E2E flow in docs/plans/2026-05-22-openclaw-devvm-access-design.md. Persistence test green: spawned a 50s task from pod A, deleted pod A, new pod B saw the task finish and read its full log. Pre-existing keel.sh annotation drift on openclaw/{openlobster, task_webhook} cleaned up in the same apply. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 10:20:00 +00:00
Viktor Barzin	533a89a010	docs: HA control plane design (3 masters) Captures today's k8s-upgrade-pipeline session findings — root cause of repeated upgrade failures is the single-master apiserver outage window cascading into operator crashloops + storm I/O. HA control plane with 3 masters + apiserver LB removes the cascade entirely. Tracked in beads code-n0ow. Plan doc to follow. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-21 09:41:20 +00:00
Viktor Barzin	9fd54143c2	docs: design + plan for MySQL 8.4.8 → 8.4.9 upgrade Captures the wipe+reinit strategy (sidestep the broken DD upgrade path), the IO config bump (innodb_io_capacity 100→2000), root-cause analysis with explicit uncertainty, verification gates, and rollback. Not scheduled yet. Tracked in beads code-963q. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-19 13:10:00 +00:00
Viktor Barzin	dd1ec4f1be	docs/plans: add agent presence implementation plan (2026-05-17) 15-task plan for a shared presence board so Claude Code sessions can see which shared infra resources are being actively mutated by other sessions. Resource-scoped claims on the existing Dolt server, heartbeat-driven TTL, agent-driven via CLAUDE.md rule + Python CLI.	2026-05-17 21:03:17 +00:00
Viktor Barzin	910167105e	Phase 0: install Keel + Kyverno auto-update annotation injector Foundation for opt-out-pure auto-update model per docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md. - New stack `stacks/keel/` deploys Keel via Helm (charts.keel.sh, v1.0.6). Polls registries hourly per design decision #8. Default schedule overridable per-workload via keel.sh/pollSchedule annotation. - New Kyverno ClusterPolicy `inject-keel-annotations` mutates Deployments, StatefulSets, and DaemonSets in namespaces labeled `keel.sh/enrolled=true` with keel.sh/policy=force + trigger=poll + pollSchedule=@every 1h. - Phase 0 enrolls no namespaces. Phase 1 (next session) labels the self-hosted set. - Per-workload opt-out: label `keel.sh/policy: never` (used by rollback runbook and chrome-service-style deliberate pins). - Keel namespace excluded from the mutate — supervisor self-update has too-bad a failure mode (decision #11). - AGENTS.md: KYVERNO_LIFECYCLE_V2 marker convention added for the ignore_changes block enrolled workloads need. - .claude/CLAUDE.md: docker-images rule flagged as transitional. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-16 12:19:34 +00:00
Viktor Barzin	016584651e	docs/plans: 2026-04-20 infra audit design (post-research, post-challenge) Adds the infra audit plan: 5 parallel research agents (Reliability, Declarative, Maintenance, Scalability, Security) → 91 raw findings → 2 independent challengers → filtered/corrected/ranked backlog. Already incorporates the challenger corrections (drops bad metric pulls, reframes intentional-by-design items). Source for several follow-ups already shipped this week (kured-prometheus gating, NFS fsid post-mortem fixes, Authentik outpost postgres-backend).	2026-05-10 17:07:49 +00:00
Viktor Barzin	5d22b449f9	[forgejo] Phase 0 of registry consolidation: prepare Forgejo OCI registry Stage 1 of moving private images off the registry:2 container at registry.viktorbarzin.me:5050 (which has hit distribution#3324 corruption 3x in 3 weeks) onto Forgejo's built-in OCI registry. No cutover risk — pods still pull from the existing registry until Phase 3. What changes: * Forgejo deployment: memory 384Mi→1Gi, PVC 5Gi→15Gi (cap 50Gi). Explicit FORGEJO__packages__ENABLED + CHUNKED_UPLOAD_PATH (defensive, v11 default-on). * ingress_factory: max_body_size variable was declared but never wired in after the nginx→Traefik migration. Now creates a per-ingress Buffering middleware when set; default null = no limit (preserves existing behavior). Forgejo ingress sets max_body_size=5g to allow multi-GB layer pushes. * Cluster-wide registry-credentials Secret: 4th auths entry for forgejo.viktorbarzin.me, populated from Vault secret/viktor/ forgejo_pull_token (cluster-puller PAT, read:package). Existing Kyverno ClusterPolicy syncs cluster-wide — no policy edits. * Containerd hosts.toml redirect: forgejo.viktorbarzin.me → in-cluster Traefik LB 10.0.20.200 (avoids hairpin NAT for in-cluster pulls). Cloud-init for new VMs + scripts/setup-forgejo-containerd-mirror.sh for existing nodes. * Forgejo retention CronJob (0 4 * * ): keeps newest 10 versions per package + always :latest. First 7 days dry-run (DRY_RUN=true); flip the local in cleanup.tf after log review. Forgejo integrity probe CronJob (/15): same algorithm as the existing registry-integrity-probe. Existing Prometheus alerts (RegistryManifestIntegrityFailure et al) made instance-aware so they cover both registries during the bake. Docs: design+plan in docs/plans/, setup runbook in docs/runbooks/. Operational note — the apply order is non-trivial because the new Vault keys (forgejo_pull_token, forgejo_cleanup_token, secret/ci/global/forgejo_*) must exist BEFORE terragrunt apply in the kyverno + monitoring + forgejo stacks. The setup runbook documents the bootstrap sequence. Phase 1 (per-project dual-push pipelines) follows in subsequent commits. Bake clock starts when the last project goes dual-push. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-07 15:51:34 +00:00
Viktor Barzin	51bf38815c	vault: record Phase 3 vault Released-PV cleanup Deleted the 6 NFS PVs orphaned by the Phase 2 rolling and removed their /srv/nfs/<dir> subtrees on the PVE host (~1.5 GB; vault-2 audit log was 1.4 GB on its own). Cluster-wide Released-PV sweep on the proxmox-lvm/encrypted side stays out of scope.	2026-04-25 23:08:45 +00:00
Viktor Barzin	484b4c7190	vault: complete Phase 2 NFS-hostile migration; remove nfs-proxmox SC All 3 vault voters now on proxmox-lvm-encrypted (vault-0 16:18, vault-1 + vault-2 today). The NFS fsync incompatibility identified in the 2026-04-22 raft-leader-deadlock post-mortem is no longer reachable — raft consensus log + audit log live on LUKS2 block storage with real fsync semantics. Cluster-wide consumers of the inline kubernetes_storage_class.nfs_proxmox dropped to zero after the rolling, so the resource is removed from infra/stacks/vault/main.tf. Released NFS PVs (6) remain in the cluster and will be reclaimed in Phase 3 cleanup. Lesson learned (recorded in plan): pvc-protection finalizer races the StatefulSet controller — pod recreates on the OLD PVCs unless the finalizer is patched out before pod delete. Force-finalize technique applied to vault-1 + vault-2 successfully. Closes: code-gy7h	2026-04-25 17:10:00 +00:00
Viktor Barzin	288efa89b3	vault: migrate vault-0 storage to proxmox-lvm-encrypted Phase 2 of the NFS-hostile migration: data + audit storageClass on the vault helm release switches from nfs-proxmox to proxmox-lvm-encrypted, then per-pod rolling swap (24h soak between). vault-0 swap done. vault-1 + vault-2 still on NFS — the rolling part is what makes this safe (raft quorum maintained by 2 healthy pods while one is replaced). Also restores chart-default pod securityContext fields. The previous `statefulSet.securityContext.pod = {fsGroupChangePolicy = "..."}` block REPLACED (not merged) the chart's defaults — fsGroup, runAsGroup, runAsUser, runAsNonRoot were all silently dropped. NFS exports were permissive enough to mask the missing fsGroup; ext4 LV volume root is root:root and the vault user (UID 100) couldn't open vault.db, CrashLoopBackOff. Fix: provide all five fields explicitly, survives future chart bumps. vault-1 and vault-2 retained their correct securityContext from when their pod specs were written to etcd, before the partial customization landed — the bug only surfaces when a pod is recreated. Pre-flight raft snapshot saved at /tmp/vault-pre-migration-*.snap (recovery anchor). Refs: code-gy7h Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-25 16:19:49 +00:00
Viktor Barzin	43e4f3f68e	immich: migrate PostgreSQL off NFS to proxmox-lvm-encrypted Live PG data moves to a 10Gi LUKS-encrypted RWO PVC. WAL fsync per commit on NFS contributed to the 2026-04-22 NFS writeback storm (2h43m recovery, 3 of 4 nodes hard-reset). Backups remain on NFS (append-only, NFS-tolerant). The init container that writes postgresql.override.conf is now gated on PG_VERSION presence — on a fresh PVC the file would otherwise make initdb refuse the non-empty PGDATA. First boot skips the override and initdb's cleanly; second boot (after a forced restart) writes the override so vchord/vectors/pg_prewarm load before the dump restore. Idempotent on initialised PVCs. Migration executed: pg_dumpall (1.9GB) → restore on encrypted PVC → REINDEX clip_index/face_index → 111,843 assets verified, external HTTP 200, all 10 extensions present (vector minor 0.8.0→0.8.1 only). LV created on PVE host, picked up by lvm-pvc-snapshot. See docs/plans/2026-04-25-nfs-hostile-migration-{design,plan}.md. Phase 2 (Vault Raft) follows under code-gy7h. Closes: code-ahr7 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-25 15:47:30 +00:00
Viktor Barzin	65b0f30d5e	[docs] Update anti-AI and rybbit docs after rewrite-body removal - Anti-AI: 5-layer → 3 active layers (bot-block, X-Robots-Tag, tarpit) - Layer 3 (trap links via rewrite-body) removed — Yaegi v3 incompatible - Rybbit analytics now injected via Cloudflare Worker (HTMLRewriter) - strip-accept-encoding middleware removed from all references Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 21:43:13 +00:00
Viktor Barzin	1c300a14cf	mailserver: overhaul inbound delivery, monitoring, CrowdSec, and migrate to Brevo relay Inbound: - Direct MX to mail.viktorbarzin.me (ForwardEmail relay attempted and abandoned) - Dedicated MetalLB IP 10.0.20.202 with ETP: Local for CrowdSec real-IP detection - Removed Cloudflare Email Routing (can't store-and-forward) - Fixed dual SPF violation, hardened to -all - Added MTA-STS, TLSRPT, imported Rspamd DKIM into Terraform - Removed dead BIND zones from config.tfvars (199 lines) Outbound: - Migrated from Mailgun (100/day) to Brevo (300/day free) - Added Brevo DKIM CNAMEs and verification TXT Monitoring: - Probe frequency: 30m → 20m, alert thresholds adjusted to 60m - Enabled Dovecot exporter scraping (port 9166) - Added external SMTP monitor on public IP Documentation: - New docs/architecture/mailserver.md with full architecture - New docs/architecture/mailserver-visual.html visualization - Updated monitoring.md, CLAUDE.md, historical plan docs	2026-04-12 22:24:38 +01:00
Viktor Barzin	b2cac8cc97	add proxmox-csi cleanup TODO for post-migration tasks [ci skip]	2026-04-03 20:02:14 +03:00
Viktor Barzin	d49acebd8e	migrate ebooks-calibre to proxmox-lvm, update storage docs [ci skip] - Migrate ebooks-calibre-config-iscsi (2Gi, 2380 files) to proxmox-lvm - Update docs/architecture/storage.md: document Proxmox CSI as primary block storage, mark democratic-csi iSCSI as deprecated - Add full migration plan to docs/plans/	2026-04-03 19:45:34 +03:00
Viktor Barzin	6f8b48a73c	[ci skip] k8s portal: fix setup script + add onboarding hub (5 new pages) Bug fixes: - CA cert now populated in ConfigMap (was empty → TLS failures) - Remove useless heredoc quote escaping in setup script - Fix homepage: VPN callout, correct verification command (get namespaces) - Fix false-positive sensitive=true on ingress_path, tls_secret_name, truenas_host, ollama_host, client_certificate_secret_name New pages (direct Svelte, no mdsvex dependency): - /onboarding: step-by-step guide (VPN, kubectl, git, first PR) - /architecture: cluster topology, storage, networking, tiers - /services: catalog of 70+ services with URLs - /contributing: PR workflow, what you can/can't change, NEVER list - /troubleshooting: common issues and fixes Navigation bar added to layout. All pages use consistent docs styling. Requires Docker image rebuild: cd stacks/platform/modules/k8s-portal/files && docker build -t viktorbarzin/k8s-portal:latest . && docker push	2026-03-07 15:06:26 +00:00
Viktor Barzin	91d11e5cda	[ci skip] add SOPS multi-user secrets migration design (v3, reviewed 3x) Replaces git-crypt all-or-nothing encryption with SOPS per-value encryption. Operators push PRs → Viktor reviews → CI applies. No encryption keys needed for operators. 7-phase migration plan, reviewed by 2 agents across 3 iterations (0 remaining CRITICALs).	2026-03-07 13:55:05 +00:00
Viktor Barzin	197cef7f3f	[ci skip] add auto-generated tiers.tf, planning docs, and helm chart cache - tiers.tf: Terragrunt-generated tier locals for all standalone stacks - .planning/: resource audit research and plans - docs/plans/: cluster hardening design doc - redis-25.3.2.tgz: Bitnami Redis Helm chart cache	2026-03-06 23:55:57 +00:00
Viktor Barzin	db7ea58d5c	[ci skip] add security observability layer design document Tetragon-centric approach: eBPF runtime security, pfSense syslog collection, CoreDNS query logging, Calico NetworkPolicies, on-demand mitmproxy, unified Grafana security dashboard. ~625MB steady-state, <5GB budget.	2026-03-02 21:13:01 +00:00
Viktor Barzin	910ea5d923	[ci skip] add NFS CSI migration design doc and implementation plan	2026-03-01 23:30:27 +00:00
Viktor Barzin	e50cfa1d19	[ci skip] add Traefik resilience hardening implementation plan	2026-03-01 13:53:50 +00:00
Viktor Barzin	454a48c6ac	[ci skip] add Traefik resilience hardening design doc	2026-03-01 13:50:00 +00:00
Viktor Barzin	a1ba218cd2	[ci skip] Phase 1: PostgreSQL migrated to CNPG on local disk Major milestone - shared PostgreSQL moved from NFS to CloudNativePG: - CNPG cluster (pg-cluster) running in dbaas namespace on local-path storage - PostGIS image (ghcr.io/cloudnative-pg/postgis:16) for dawarich compatibility - All 20 databases and 19 roles restored from pg_dumpall backup - postgresql.dbaas Service patched to point at CNPG primary - Old PG deployment scaled to 0 (NFS data intact for rollback) - All 12+ dependent services verified running: authentik, n8n, dawarich, tandoor, linkwarden, netbox, woodpecker, rybbit, affine, health, resume, trading-bot, atuin - Authentik PgBouncer working through the switched endpoint TODO: codify CNPG cluster in Terraform, add 2nd replica, update backup CronJob	2026-02-28 19:08:06 +00:00
Viktor Barzin	052662540b	[ci skip] add network visualization implementation plan	2026-02-28 18:19:36 +00:00
Viktor Barzin	887075189a	[ci skip] add network traffic visualization design doc	2026-02-28 18:14:42 +00:00
Viktor Barzin	4651b67479	[ci skip] update CI caching plan: add Terraform provisioning for private registry	2026-02-28 17:51:55 +00:00
Viktor Barzin	2adfa86401	[ci skip] add CI build caching implementation plan	2026-02-28 17:46:44 +00:00
Viktor Barzin	5ef03cc0e0	[ci skip] add CI build caching design doc	2026-02-28 17:43:42 +00:00
Viktor Barzin	14b1c43713	[ci skip] expand k8s worker nodes to 256G, update inventory and extend script - k8s-node2: 128G → 256G (160GB free) - k8s-node3: 128G → 256G (135GB free) - k8s-node4: 128G → 256G (127GB free) - k8s-node1: already 256G (51GB free) - extend_vm_storage.sh: increase drain timeout to 300s, add --force flag - Remove Vaultwarden from SQLite migration plan (too risky)	2026-02-28 16:00:16 +00:00
Viktor Barzin	517acd95af	[ci skip] revise storage reliability design based on research agent findings Key changes from v1: - Drop 3-instance replication → 2-instance CNPG, single Redis/MySQL - Remove Headscale from PG migration (project discourages it) - Remove MeshCentral from PG migration (NeDB, not SQLite) - Replace Redis Sentinel with single redis:7 on local disk (modules unused) - Add RAM overcommit warning and mitigation - Add explicit single-host limitation acknowledgment - Add per-component rollback plans - Fix backup strategy (CNPG can't archive WAL to NFS natively) - Reorder migration: low-risk services first, authentik last - Add research gate before each service migration	2026-02-28 14:38:01 +00:00
Viktor Barzin	415d8704d4	[ci skip] add storage reliability design: DB replication + SQLite consolidation	2026-02-28 14:24:42 +00:00
Viktor Barzin	cc7f119578	[ci skip] Reduce node config drift: GPU label, OIDC idempotency, node-exporter, rebuild docs - Add gpu=true label to Terraform (nvidia null_resource alongside taint) - Improve API server OIDC config to detect value changes, not just flag presence - Add policy_hash trigger to audit-policy so rule changes auto-reapply - Enable prometheus-node-exporter sub-chart, delete unused Ansible playbook - Document full node rebuild procedure in CLAUDE.md - Save Talos Linux migration evaluation for future reference	2026-02-22 22:59:38 +00:00
Viktor Barzin	5bc1a47cb8	[ci skip] Add anti-AI scraping implementation plan	2026-02-22 19:41:39 +00:00
Viktor Barzin	4a9fe474c6	[ci skip] Add anti-AI scraping system design doc	2026-02-22 19:37:29 +00:00
Viktor Barzin	116c4d9c30	[ci skip] Remove legacy files and orphaned modules Delete 20 orphaned module directories and 3 stray files from modules/kubernetes/ that are no longer referenced by any stack. Remove 7 root-level legacy files including the empty tfstate, 27MB terraform zip, commented-out main.tf, and migration notes. Clean up commented-out dockerhub_secret and oauth-proxy references in blog, travel_blog, and city-guesser stacks. Remove stale frigate config.yaml entry from .gitignore. Remove ephemeral docs/plans/ directory.	2026-02-22 15:23:27 +00:00
Viktor Barzin	c1ee757c6b	[ci skip] Add Terragrunt migration implementation plan	2026-02-22 00:51:00 +00:00
Viktor Barzin	209355d1af	[ci skip] Add Terragrunt migration design document	2026-02-22 00:46:57 +00:00
Viktor Barzin	f41e2ca969	[ci skip] Add OpenClaw cluster health agent implementation plan	2026-02-21 23:48:36 +00:00
Viktor Barzin	51cb045f12	[ci skip] Add OpenClaw cluster management agent design doc	2026-02-21 23:45:30 +00:00
Viktor Barzin	85581923f6	[ci skip] Add multi-user Kubernetes access implementation plan	2026-02-17 20:49:14 +00:00
Viktor Barzin	cf146f5980	[ci skip] Add multi-user Kubernetes access design document	2026-02-17 20:44:23 +00:00
Viktor Barzin	69aae2ec9d	[ci skip] Fix code review findings: correct Alertmanager URL, add atomic to Loki, remove dead minio NFS export, update design doc	2026-02-13 23:08:44 +00:00

1 2

52 commits