Commit graph

32 commits

Author SHA1 Message Date
Viktor Barzin
6e661fdfc5 add backup & DR strategy documentation with ASCII diagrams
Covers all 3 protection layers (ZFS snapshots, app-level backups,
offsite sync), the hybrid cloud sync architecture, iSCSI hardening,
monitoring alerts, and service protection matrix.
2026-03-23 02:24:02 +02:00
Viktor Barzin
a44f35bcf8 harden vaultwarden iSCSI storage and increase backup frequency
- Increase backup from daily to every 6 hours (0 */6 * * *)
- Add pre/post-flight SQLite integrity checks to backup job
- Harden iSCSI on all nodes: increase recovery timeout (300s),
  enable CRC32C data/header digests for bit-flip detection
- Fix restore runbook PVC name (vaultwarden-data-iscsi)

Motivated by SQLite corruption from iSCSI I/O errors.
2026-03-23 00:36:11 +02:00
Viktor Barzin
af2222fce8 backup & DR: add alerting, fix rotation, secure MySQL password, add runbooks
Phase 1: Add 12 PrometheusRules for backup health alerting
- PostgreSQL, MySQL, Vault, Vaultwarden, Redis staleness + never-succeeded alerts
- CSIDriverCrashLoop alert for nfs-csi/iscsi-csi namespaces
- Generic BackupCronJobFailed alert

Phase 2: Fix backup rotation
- etcd: timestamped snapshots instead of overwriting single file
- Redis: timestamped RDB files with 7-day retention purge
- PostgreSQL: retention increased from 7 to 14 days

Phase 3: Fix MySQL password exposure
- Move root password from command line arg to MYSQL_PWD env var via secretKeyRef

Phase 5: Add restore runbooks
- PostgreSQL, MySQL, Vault, etcd, Vaultwarden, full cluster rebuild
2026-03-19 20:34:33 +00:00
Viktor Barzin
6f8b48a73c [ci skip] k8s portal: fix setup script + add onboarding hub (5 new pages)
Bug fixes:
- CA cert now populated in ConfigMap (was empty → TLS failures)
- Remove useless heredoc quote escaping in setup script
- Fix homepage: VPN callout, correct verification command (get namespaces)
- Fix false-positive sensitive=true on ingress_path, tls_secret_name,
  truenas_host, ollama_host, client_certificate_secret_name

New pages (direct Svelte, no mdsvex dependency):
- /onboarding: step-by-step guide (VPN, kubectl, git, first PR)
- /architecture: cluster topology, storage, networking, tiers
- /services: catalog of 70+ services with URLs
- /contributing: PR workflow, what you can/can't change, NEVER list
- /troubleshooting: common issues and fixes

Navigation bar added to layout. All pages use consistent docs styling.

Requires Docker image rebuild: cd stacks/platform/modules/k8s-portal/files
&& docker build -t viktorbarzin/k8s-portal:latest . && docker push
2026-03-07 15:06:26 +00:00
Viktor Barzin
91d11e5cda [ci skip] add SOPS multi-user secrets migration design (v3, reviewed 3x)
Replaces git-crypt all-or-nothing encryption with SOPS per-value encryption.
Operators push PRs → Viktor reviews → CI applies. No encryption keys needed
for operators. 7-phase migration plan, reviewed by 2 agents across 3 iterations
(0 remaining CRITICALs).
2026-03-07 13:55:05 +00:00
Viktor Barzin
197cef7f3f [ci skip] add auto-generated tiers.tf, planning docs, and helm chart cache
- tiers.tf: Terragrunt-generated tier locals for all standalone stacks
- .planning/: resource audit research and plans
- docs/plans/: cluster hardening design doc
- redis-25.3.2.tgz: Bitnami Redis Helm chart cache
2026-03-06 23:55:57 +00:00
Viktor Barzin
db7ea58d5c [ci skip] add security observability layer design document
Tetragon-centric approach: eBPF runtime security, pfSense syslog
collection, CoreDNS query logging, Calico NetworkPolicies,
on-demand mitmproxy, unified Grafana security dashboard.
~625MB steady-state, <5GB budget.
2026-03-02 21:13:01 +00:00
Viktor Barzin
910ea5d923 [ci skip] add NFS CSI migration design doc and implementation plan 2026-03-01 23:30:27 +00:00
Viktor Barzin
e50cfa1d19 [ci skip] add Traefik resilience hardening implementation plan 2026-03-01 13:53:50 +00:00
Viktor Barzin
454a48c6ac [ci skip] add Traefik resilience hardening design doc 2026-03-01 13:50:00 +00:00
Viktor Barzin
a1ba218cd2 [ci skip] Phase 1: PostgreSQL migrated to CNPG on local disk
Major milestone - shared PostgreSQL moved from NFS to CloudNativePG:
- CNPG cluster (pg-cluster) running in dbaas namespace on local-path storage
- PostGIS image (ghcr.io/cloudnative-pg/postgis:16) for dawarich compatibility
- All 20 databases and 19 roles restored from pg_dumpall backup
- postgresql.dbaas Service patched to point at CNPG primary
- Old PG deployment scaled to 0 (NFS data intact for rollback)
- All 12+ dependent services verified running:
  authentik, n8n, dawarich, tandoor, linkwarden, netbox, woodpecker,
  rybbit, affine, health, resume, trading-bot, atuin
- Authentik PgBouncer working through the switched endpoint

TODO: codify CNPG cluster in Terraform, add 2nd replica, update backup CronJob
2026-02-28 19:08:06 +00:00
Viktor Barzin
052662540b [ci skip] add network visualization implementation plan 2026-02-28 18:19:36 +00:00
Viktor Barzin
887075189a [ci skip] add network traffic visualization design doc 2026-02-28 18:14:42 +00:00
Viktor Barzin
4651b67479 [ci skip] update CI caching plan: add Terraform provisioning for private registry 2026-02-28 17:51:55 +00:00
Viktor Barzin
2adfa86401 [ci skip] add CI build caching implementation plan 2026-02-28 17:46:44 +00:00
Viktor Barzin
5ef03cc0e0 [ci skip] add CI build caching design doc 2026-02-28 17:43:42 +00:00
Viktor Barzin
14b1c43713 [ci skip] expand k8s worker nodes to 256G, update inventory and extend script
- k8s-node2: 128G → 256G (160GB free)
- k8s-node3: 128G → 256G (135GB free)
- k8s-node4: 128G → 256G (127GB free)
- k8s-node1: already 256G (51GB free)
- extend_vm_storage.sh: increase drain timeout to 300s, add --force flag
- Remove Vaultwarden from SQLite migration plan (too risky)
2026-02-28 16:00:16 +00:00
Viktor Barzin
517acd95af [ci skip] revise storage reliability design based on research agent findings
Key changes from v1:
- Drop 3-instance replication → 2-instance CNPG, single Redis/MySQL
- Remove Headscale from PG migration (project discourages it)
- Remove MeshCentral from PG migration (NeDB, not SQLite)
- Replace Redis Sentinel with single redis:7 on local disk (modules unused)
- Add RAM overcommit warning and mitigation
- Add explicit single-host limitation acknowledgment
- Add per-component rollback plans
- Fix backup strategy (CNPG can't archive WAL to NFS natively)
- Reorder migration: low-risk services first, authentik last
- Add research gate before each service migration
2026-02-28 14:38:01 +00:00
Viktor Barzin
415d8704d4 [ci skip] add storage reliability design: DB replication + SQLite consolidation 2026-02-28 14:24:42 +00:00
Viktor Barzin
cc7f119578 [ci skip] Reduce node config drift: GPU label, OIDC idempotency, node-exporter, rebuild docs
- Add gpu=true label to Terraform (nvidia null_resource alongside taint)
- Improve API server OIDC config to detect value changes, not just flag presence
- Add policy_hash trigger to audit-policy so rule changes auto-reapply
- Enable prometheus-node-exporter sub-chart, delete unused Ansible playbook
- Document full node rebuild procedure in CLAUDE.md
- Save Talos Linux migration evaluation for future reference
2026-02-22 22:59:38 +00:00
Viktor Barzin
5bc1a47cb8 [ci skip] Add anti-AI scraping implementation plan 2026-02-22 19:41:39 +00:00
Viktor Barzin
4a9fe474c6 [ci skip] Add anti-AI scraping system design doc 2026-02-22 19:37:29 +00:00
Viktor Barzin
116c4d9c30 [ci skip] Remove legacy files and orphaned modules
Delete 20 orphaned module directories and 3 stray files from
modules/kubernetes/ that are no longer referenced by any stack.
Remove 7 root-level legacy files including the empty tfstate,
27MB terraform zip, commented-out main.tf, and migration notes.
Clean up commented-out dockerhub_secret and oauth-proxy references
in blog, travel_blog, and city-guesser stacks. Remove stale
frigate config.yaml entry from .gitignore. Remove ephemeral
docs/plans/ directory.
2026-02-22 15:23:27 +00:00
Viktor Barzin
c1ee757c6b [ci skip] Add Terragrunt migration implementation plan 2026-02-22 00:51:00 +00:00
Viktor Barzin
209355d1af [ci skip] Add Terragrunt migration design document 2026-02-22 00:46:57 +00:00
Viktor Barzin
f41e2ca969 [ci skip] Add OpenClaw cluster health agent implementation plan 2026-02-21 23:48:36 +00:00
Viktor Barzin
51cb045f12 [ci skip] Add OpenClaw cluster management agent design doc 2026-02-21 23:45:30 +00:00
Viktor Barzin
85581923f6 [ci skip] Add multi-user Kubernetes access implementation plan 2026-02-17 20:49:14 +00:00
Viktor Barzin
cf146f5980 [ci skip] Add multi-user Kubernetes access design document 2026-02-17 20:44:23 +00:00
Viktor Barzin
69aae2ec9d [ci skip] Fix code review findings: correct Alertmanager URL, add atomic to Loki, remove dead minio NFS export, update design doc 2026-02-13 23:08:44 +00:00
Viktor Barzin
04dd438b01 [ci skip] Add centralized log collection implementation plan 2026-02-13 21:54:55 +00:00
Viktor Barzin
6ac8d549cb [ci skip] Add centralized log collection design doc 2026-02-13 21:53:04 +00:00