Commit graph

27 commits

Author SHA1 Message Date
Viktor Barzin
197cef7f3f [ci skip] add auto-generated tiers.tf, planning docs, and helm chart cache
- tiers.tf: Terragrunt-generated tier locals for all standalone stacks
- .planning/: resource audit research and plans
- docs/plans/: cluster hardening design doc
- redis-25.3.2.tgz: Bitnami Redis Helm chart cache
2026-03-06 23:55:57 +00:00
Viktor Barzin
db7ea58d5c [ci skip] add security observability layer design document
Tetragon-centric approach: eBPF runtime security, pfSense syslog
collection, CoreDNS query logging, Calico NetworkPolicies,
on-demand mitmproxy, unified Grafana security dashboard.
~625MB steady-state, <5GB budget.
2026-03-02 21:13:01 +00:00
Viktor Barzin
910ea5d923 [ci skip] add NFS CSI migration design doc and implementation plan 2026-03-01 23:30:27 +00:00
Viktor Barzin
e50cfa1d19 [ci skip] add Traefik resilience hardening implementation plan 2026-03-01 13:53:50 +00:00
Viktor Barzin
454a48c6ac [ci skip] add Traefik resilience hardening design doc 2026-03-01 13:50:00 +00:00
Viktor Barzin
a1ba218cd2 [ci skip] Phase 1: PostgreSQL migrated to CNPG on local disk
Major milestone - shared PostgreSQL moved from NFS to CloudNativePG:
- CNPG cluster (pg-cluster) running in dbaas namespace on local-path storage
- PostGIS image (ghcr.io/cloudnative-pg/postgis:16) for dawarich compatibility
- All 20 databases and 19 roles restored from pg_dumpall backup
- postgresql.dbaas Service patched to point at CNPG primary
- Old PG deployment scaled to 0 (NFS data intact for rollback)
- All 12+ dependent services verified running:
  authentik, n8n, dawarich, tandoor, linkwarden, netbox, woodpecker,
  rybbit, affine, health, resume, trading-bot, atuin
- Authentik PgBouncer working through the switched endpoint

TODO: codify CNPG cluster in Terraform, add 2nd replica, update backup CronJob
2026-02-28 19:08:06 +00:00
Viktor Barzin
052662540b [ci skip] add network visualization implementation plan 2026-02-28 18:19:36 +00:00
Viktor Barzin
887075189a [ci skip] add network traffic visualization design doc 2026-02-28 18:14:42 +00:00
Viktor Barzin
4651b67479 [ci skip] update CI caching plan: add Terraform provisioning for private registry 2026-02-28 17:51:55 +00:00
Viktor Barzin
2adfa86401 [ci skip] add CI build caching implementation plan 2026-02-28 17:46:44 +00:00
Viktor Barzin
5ef03cc0e0 [ci skip] add CI build caching design doc 2026-02-28 17:43:42 +00:00
Viktor Barzin
14b1c43713 [ci skip] expand k8s worker nodes to 256G, update inventory and extend script
- k8s-node2: 128G → 256G (160GB free)
- k8s-node3: 128G → 256G (135GB free)
- k8s-node4: 128G → 256G (127GB free)
- k8s-node1: already 256G (51GB free)
- extend_vm_storage.sh: increase drain timeout to 300s, add --force flag
- Remove Vaultwarden from SQLite migration plan (too risky)
2026-02-28 16:00:16 +00:00
Viktor Barzin
517acd95af [ci skip] revise storage reliability design based on research agent findings
Key changes from v1:
- Drop 3-instance replication → 2-instance CNPG, single Redis/MySQL
- Remove Headscale from PG migration (project discourages it)
- Remove MeshCentral from PG migration (NeDB, not SQLite)
- Replace Redis Sentinel with single redis:7 on local disk (modules unused)
- Add RAM overcommit warning and mitigation
- Add explicit single-host limitation acknowledgment
- Add per-component rollback plans
- Fix backup strategy (CNPG can't archive WAL to NFS natively)
- Reorder migration: low-risk services first, authentik last
- Add research gate before each service migration
2026-02-28 14:38:01 +00:00
Viktor Barzin
415d8704d4 [ci skip] add storage reliability design: DB replication + SQLite consolidation 2026-02-28 14:24:42 +00:00
Viktor Barzin
cc7f119578 [ci skip] Reduce node config drift: GPU label, OIDC idempotency, node-exporter, rebuild docs
- Add gpu=true label to Terraform (nvidia null_resource alongside taint)
- Improve API server OIDC config to detect value changes, not just flag presence
- Add policy_hash trigger to audit-policy so rule changes auto-reapply
- Enable prometheus-node-exporter sub-chart, delete unused Ansible playbook
- Document full node rebuild procedure in CLAUDE.md
- Save Talos Linux migration evaluation for future reference
2026-02-22 22:59:38 +00:00
Viktor Barzin
5bc1a47cb8 [ci skip] Add anti-AI scraping implementation plan 2026-02-22 19:41:39 +00:00
Viktor Barzin
4a9fe474c6 [ci skip] Add anti-AI scraping system design doc 2026-02-22 19:37:29 +00:00
Viktor Barzin
116c4d9c30 [ci skip] Remove legacy files and orphaned modules
Delete 20 orphaned module directories and 3 stray files from
modules/kubernetes/ that are no longer referenced by any stack.
Remove 7 root-level legacy files including the empty tfstate,
27MB terraform zip, commented-out main.tf, and migration notes.
Clean up commented-out dockerhub_secret and oauth-proxy references
in blog, travel_blog, and city-guesser stacks. Remove stale
frigate config.yaml entry from .gitignore. Remove ephemeral
docs/plans/ directory.
2026-02-22 15:23:27 +00:00
Viktor Barzin
c1ee757c6b [ci skip] Add Terragrunt migration implementation plan 2026-02-22 00:51:00 +00:00
Viktor Barzin
209355d1af [ci skip] Add Terragrunt migration design document 2026-02-22 00:46:57 +00:00
Viktor Barzin
f41e2ca969 [ci skip] Add OpenClaw cluster health agent implementation plan 2026-02-21 23:48:36 +00:00
Viktor Barzin
51cb045f12 [ci skip] Add OpenClaw cluster management agent design doc 2026-02-21 23:45:30 +00:00
Viktor Barzin
85581923f6 [ci skip] Add multi-user Kubernetes access implementation plan 2026-02-17 20:49:14 +00:00
Viktor Barzin
cf146f5980 [ci skip] Add multi-user Kubernetes access design document 2026-02-17 20:44:23 +00:00
Viktor Barzin
69aae2ec9d [ci skip] Fix code review findings: correct Alertmanager URL, add atomic to Loki, remove dead minio NFS export, update design doc 2026-02-13 23:08:44 +00:00
Viktor Barzin
04dd438b01 [ci skip] Add centralized log collection implementation plan 2026-02-13 21:54:55 +00:00
Viktor Barzin
6ac8d549cb [ci skip] Add centralized log collection design doc 2026-02-13 21:53:04 +00:00