Commit graph

52 commits

Author SHA1 Message Date
Viktor Barzin
c740ed1301 docs: update Technitium DNS docs after cache optimization
- Fix Technitium IP typo: 10.0.20.101 → 10.0.20.201 (service-catalog, vpn.md)
- Fix PDB minAvailable: 1 → 2 (networking.md)
- Add emrsn.org stub zone, cache TTL tuning, PG query logging, CronJobs
- Update forwarders: was "Cloudflare + Google", actually Cloudflare DoH only
- Update config storage: was generic PVC, now NFS path
2026-04-12 18:29:25 +01:00
Viktor Barzin
73531c12e0 docs(vpn): update with dual-stack WG, GL-iNet AllowedIPs fix, and troubleshooting [ci skip]
Document fixes from 2026-04-10 London network debugging session:
- pfSense WG now dual-stack (IPv4+IPv6 via HE tunnel gif0 pf rule)
- GL-iNet AllowedIPs must be single comma-separated UCI entry (parse bug)
- AdGuardHome/carrier-monitor must not use 1.1.1.1 (conntrack + rate limit)
- Expanded troubleshooting for site-to-site tunnel disconnects
2026-04-10 22:24:19 +01:00
Viktor Barzin
eec6af6aef docs: add IPAM/DDNS architecture diagram and update docs
- networking.md: Add mermaid diagram showing full device discovery pipeline
  (Kea DHCP → DDNS → Technitium, pfSense import → phpIPAM → DNS sync)
- networking.md: Add data flow table, DHCP coverage table
- networking.md: Update pfSense (3 subnets + 42 reservations), phpIPAM
  (passive import replaces fping), Technitium (192.168.1.2 in ACL)
- CLAUDE.md: Update phpIPAM and networking descriptions

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 20:42:10 +00:00
Viktor Barzin
8cd8743140 docs: add phpIPAM, Kea DDNS, and DNS sync documentation
- networking.md: Add phpIPAM IPAM section, Kea DDNS config, reverse DNS zones,
  Technitium dynamic update policy
- CLAUDE.md: Add phpipam to DB rotation list, service notes, networking section
- service-catalog.md: Add phpipam, mark netbox as disabled/replaced

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 16:01:32 +00:00
Viktor Barzin
98aaba98da docs: add Split Horizon hairpin NAT fix to networking architecture
[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 18:45:53 +00:00
Viktor Barzin
cfa7a50cb5 docs: update networking architecture for DNS consolidation
- Technitium DNS now at dedicated MetalLB IP 10.0.20.201 (was shared 10.0.20.200)
- Document LAN DNS path: pfSense NAT redirect preserves client IPs for Technitium logging
- Document pfSense dnsmasq role (K8s VLAN + localhost only, not WAN)
- Document pfSense aliases (technitium_dns, k8s_shared_lb) for NAT rule maintainability
- Update MetalLB table with per-service IP assignments
- Add ClusterIP (10.96.0.53) for CoreDNS internal forwarding

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 17:49:33 +00:00
Viktor Barzin
fea8519f51 update VPN architecture docs and Authentik state reference
- vpn.md: Rewrite WireGuard section to match actual config (single tun_wg0
  interface, 10.3.2.0/24 subnet, hub-and-spoke topology, correct device
  names and subnets for London/Valchedrym)
- authentik-state.md: Document brute-force-protection policy unbinding fix
  that was blocking all unauthenticated users from login flows

[ci skip]
2026-04-06 16:26:21 +03:00
Viktor Barzin
b345b086ef update backup/DR docs and runbooks for 3-2-1 architecture
- Full rewrite of backup-dr.md: 3-2-1 strategy with sda backup disk,
  PVC file-level copy from LVM snapshots, pfsense backup, two offsite
  paths. 4 Mermaid diagrams (data flow, timeline, disk layout, restore tree).
- Update storage.md: 65 proxmox-lvm PVCs, sda backup tier
- Update restore-full-cluster.md: add Phase 3.5 for PVC restore from sda
- Update restore-{mysql,postgresql,vault,vaultwarden}.md: add sda fallback paths
- New runbook: restore-pvc-from-backup.md (file-level restore from sda)
- Update CLAUDE.md Storage & Backup section for 3-2-1 architecture
2026-04-06 15:06:01 +03:00
Viktor Barzin
fc233bd27f docs: comprehensive audit and update of all architecture docs and runbooks [ci skip]
Audited 14 documentation files against live cluster state and Terraform code.

Architecture docs:
- databases.md: MySQL 8.4.4, proxmox-lvm storage (not iSCSI), anti-affinity
  excludes k8s-node1 (GPU), 2Gi/3Gi resources, 7-day rotation (not 24h),
  CNPG 2 instances, PostGIS 16, postgresql.dbaas has endpoints
- overview.md: 1x CPU, ~160GB RAM, all nodes 32GB, proxmox-lvm storage,
  correct Vault paths (secret/ not kv/)
- compute.md: 272GB physical host RAM, ~160GB allocated to VMs
- secrets.md: 7-day rotation, 7 MySQL + 5 PG roles, correct ESO config
- networking.md: MetalLB pool 10.0.20.200-220
- ci-cd.md: 9 GHA projects, travel_blog 5.7GB

Runbooks:
- restore-mysql/postgresql: backup files are .sql.gz (not .sql)
- restore-vault: weekly backup (not daily), auto-unseal sidecar note
- restore-vaultwarden: PVC is proxmox (not iscsi)
- restore-full-cluster: updated node roles, removed trading

Reference docs:
- CLAUDE.md: 7-day rotation, removed trading from PG list
- AGENTS.md: 100+ stacks, proxmox-lvm, platform empty shell
- service-catalog.md: 6 new stacks, 14 stack column updates
2026-04-06 13:21:05 +03:00
Viktor Barzin
9492874c43 fix: restore technitium MySQL query logging with Vault auto-rotation [ci skip]
Query logs stopped syncing on 2026-03-16 due to password mismatch after
MySQL cluster rebuild and Technitium app config reset.

- Add Vault static role mysql-technitium (7-day rotation)
- Add ExternalSecret for technitium-db-creds in technitium namespace
- Add password-sync CronJob (6h) to push rotated password to Technitium API
- Update Grafana datasource to use ESO-managed password
- Remove stale technitium_db_password variable (replaced by ESO)
- Update databases.md and restore-mysql.md runbook
2026-04-06 13:00:49 +03:00
Viktor Barzin
72d832fee7 add HA Sofia checks (26-29) to cluster healthcheck and backup-dr docs
- Healthcheck: add entity availability, integration health, automation
  status, and system resources checks for Home Assistant Sofia
- Docs: add backup-dr architecture documentation
2026-04-06 11:57:36 +03:00
Viktor Barzin
b2cac8cc97 add proxmox-csi cleanup TODO for post-migration tasks [ci skip] 2026-04-03 20:02:14 +03:00
Viktor Barzin
d49acebd8e migrate ebooks-calibre to proxmox-lvm, update storage docs [ci skip]
- Migrate ebooks-calibre-config-iscsi (2Gi, 2380 files) to proxmox-lvm
- Update docs/architecture/storage.md: document Proxmox CSI as primary
  block storage, mark democratic-csi iSCSI as deprecated
- Add full migration plan to docs/plans/
2026-04-03 19:45:34 +03:00
Viktor Barzin
2d8aa5ed89 docs: update hardware inventory for R730 RAM upgrade to 272GB
Upgraded from 144GB (4x32G + 2x8G) to 272GB (8x32G + 2x8G) DDR4-2400.
Added physical DIMM slot diagram, channel layout, and BIOS speed override
notes. Updated compute architecture with correct CPU (single socket),
VM memory values, and capacity figures.
2026-04-02 00:48:13 +03:00
Viktor Barzin
10f22350c5 exclude frigate, audiblez, ollama, real-estate-crawler from Synology backup [ci skip]
Expanded cloud sync excludes to reduce sync time and Synology disk usage.
All excluded data is either regenerable or low-value.
TrueNAS Task 1 and incremental script already updated live.
2026-03-29 13:44:32 +03:00
Viktor Barzin
78dec8f0ad add e2e email roundtrip monitoring
CronJob (every 30 min) sends test email via Mailgun API to
smoke-test@viktorbarzin.me, verifies IMAP delivery in spam@ catch-all,
deletes test email, pushes metrics to Pushgateway + Uptime Kuma.

Prometheus alerts: EmailRoundtripFailing, EmailRoundtripStale,
EmailRoundtripNeverRun. Uptime Kuma: SMTP/IMAP port checks + E2E push.
2026-03-25 22:50:22 +02:00
Viktor Barzin
fe109d9f96 add homepage auto-discovery documentation [ci skip] 2026-03-25 13:06:43 +02:00
Viktor Barzin
6af47c7c89 docs: update networking architecture for single MetalLB IP
Reflect consolidation of all 11 LB services onto 10.0.20.200.
Add service port table, MetalLB v0.15 sharing key requirements,
and ETP matching troubleshooting guidance.
2026-03-24 18:44:47 +02:00
Viktor Barzin
dbff547741 remove docs/backup-strategy.md, absorbed into architecture/backup-dr.md [ci skip] 2026-03-24 01:08:06 +02:00
Viktor Barzin
5a42643176 add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
Viktor Barzin
6e661fdfc5 add backup & DR strategy documentation with ASCII diagrams
Covers all 3 protection layers (ZFS snapshots, app-level backups,
offsite sync), the hybrid cloud sync architecture, iSCSI hardening,
monitoring alerts, and service protection matrix.
2026-03-23 02:24:02 +02:00
Viktor Barzin
a44f35bcf8 harden vaultwarden iSCSI storage and increase backup frequency
- Increase backup from daily to every 6 hours (0 */6 * * *)
- Add pre/post-flight SQLite integrity checks to backup job
- Harden iSCSI on all nodes: increase recovery timeout (300s),
  enable CRC32C data/header digests for bit-flip detection
- Fix restore runbook PVC name (vaultwarden-data-iscsi)

Motivated by SQLite corruption from iSCSI I/O errors.
2026-03-23 00:36:11 +02:00
Viktor Barzin
af2222fce8 backup & DR: add alerting, fix rotation, secure MySQL password, add runbooks
Phase 1: Add 12 PrometheusRules for backup health alerting
- PostgreSQL, MySQL, Vault, Vaultwarden, Redis staleness + never-succeeded alerts
- CSIDriverCrashLoop alert for nfs-csi/iscsi-csi namespaces
- Generic BackupCronJobFailed alert

Phase 2: Fix backup rotation
- etcd: timestamped snapshots instead of overwriting single file
- Redis: timestamped RDB files with 7-day retention purge
- PostgreSQL: retention increased from 7 to 14 days

Phase 3: Fix MySQL password exposure
- Move root password from command line arg to MYSQL_PWD env var via secretKeyRef

Phase 5: Add restore runbooks
- PostgreSQL, MySQL, Vault, etcd, Vaultwarden, full cluster rebuild
2026-03-19 20:34:33 +00:00
Viktor Barzin
6f8b48a73c [ci skip] k8s portal: fix setup script + add onboarding hub (5 new pages)
Bug fixes:
- CA cert now populated in ConfigMap (was empty → TLS failures)
- Remove useless heredoc quote escaping in setup script
- Fix homepage: VPN callout, correct verification command (get namespaces)
- Fix false-positive sensitive=true on ingress_path, tls_secret_name,
  truenas_host, ollama_host, client_certificate_secret_name

New pages (direct Svelte, no mdsvex dependency):
- /onboarding: step-by-step guide (VPN, kubectl, git, first PR)
- /architecture: cluster topology, storage, networking, tiers
- /services: catalog of 70+ services with URLs
- /contributing: PR workflow, what you can/can't change, NEVER list
- /troubleshooting: common issues and fixes

Navigation bar added to layout. All pages use consistent docs styling.

Requires Docker image rebuild: cd stacks/platform/modules/k8s-portal/files
&& docker build -t viktorbarzin/k8s-portal:latest . && docker push
2026-03-07 15:06:26 +00:00
Viktor Barzin
91d11e5cda [ci skip] add SOPS multi-user secrets migration design (v3, reviewed 3x)
Replaces git-crypt all-or-nothing encryption with SOPS per-value encryption.
Operators push PRs → Viktor reviews → CI applies. No encryption keys needed
for operators. 7-phase migration plan, reviewed by 2 agents across 3 iterations
(0 remaining CRITICALs).
2026-03-07 13:55:05 +00:00
Viktor Barzin
197cef7f3f [ci skip] add auto-generated tiers.tf, planning docs, and helm chart cache
- tiers.tf: Terragrunt-generated tier locals for all standalone stacks
- .planning/: resource audit research and plans
- docs/plans/: cluster hardening design doc
- redis-25.3.2.tgz: Bitnami Redis Helm chart cache
2026-03-06 23:55:57 +00:00
Viktor Barzin
db7ea58d5c [ci skip] add security observability layer design document
Tetragon-centric approach: eBPF runtime security, pfSense syslog
collection, CoreDNS query logging, Calico NetworkPolicies,
on-demand mitmproxy, unified Grafana security dashboard.
~625MB steady-state, <5GB budget.
2026-03-02 21:13:01 +00:00
Viktor Barzin
910ea5d923 [ci skip] add NFS CSI migration design doc and implementation plan 2026-03-01 23:30:27 +00:00
Viktor Barzin
e50cfa1d19 [ci skip] add Traefik resilience hardening implementation plan 2026-03-01 13:53:50 +00:00
Viktor Barzin
454a48c6ac [ci skip] add Traefik resilience hardening design doc 2026-03-01 13:50:00 +00:00
Viktor Barzin
a1ba218cd2 [ci skip] Phase 1: PostgreSQL migrated to CNPG on local disk
Major milestone - shared PostgreSQL moved from NFS to CloudNativePG:
- CNPG cluster (pg-cluster) running in dbaas namespace on local-path storage
- PostGIS image (ghcr.io/cloudnative-pg/postgis:16) for dawarich compatibility
- All 20 databases and 19 roles restored from pg_dumpall backup
- postgresql.dbaas Service patched to point at CNPG primary
- Old PG deployment scaled to 0 (NFS data intact for rollback)
- All 12+ dependent services verified running:
  authentik, n8n, dawarich, tandoor, linkwarden, netbox, woodpecker,
  rybbit, affine, health, resume, trading-bot, atuin
- Authentik PgBouncer working through the switched endpoint

TODO: codify CNPG cluster in Terraform, add 2nd replica, update backup CronJob
2026-02-28 19:08:06 +00:00
Viktor Barzin
052662540b [ci skip] add network visualization implementation plan 2026-02-28 18:19:36 +00:00
Viktor Barzin
887075189a [ci skip] add network traffic visualization design doc 2026-02-28 18:14:42 +00:00
Viktor Barzin
4651b67479 [ci skip] update CI caching plan: add Terraform provisioning for private registry 2026-02-28 17:51:55 +00:00
Viktor Barzin
2adfa86401 [ci skip] add CI build caching implementation plan 2026-02-28 17:46:44 +00:00
Viktor Barzin
5ef03cc0e0 [ci skip] add CI build caching design doc 2026-02-28 17:43:42 +00:00
Viktor Barzin
14b1c43713 [ci skip] expand k8s worker nodes to 256G, update inventory and extend script
- k8s-node2: 128G → 256G (160GB free)
- k8s-node3: 128G → 256G (135GB free)
- k8s-node4: 128G → 256G (127GB free)
- k8s-node1: already 256G (51GB free)
- extend_vm_storage.sh: increase drain timeout to 300s, add --force flag
- Remove Vaultwarden from SQLite migration plan (too risky)
2026-02-28 16:00:16 +00:00
Viktor Barzin
517acd95af [ci skip] revise storage reliability design based on research agent findings
Key changes from v1:
- Drop 3-instance replication → 2-instance CNPG, single Redis/MySQL
- Remove Headscale from PG migration (project discourages it)
- Remove MeshCentral from PG migration (NeDB, not SQLite)
- Replace Redis Sentinel with single redis:7 on local disk (modules unused)
- Add RAM overcommit warning and mitigation
- Add explicit single-host limitation acknowledgment
- Add per-component rollback plans
- Fix backup strategy (CNPG can't archive WAL to NFS natively)
- Reorder migration: low-risk services first, authentik last
- Add research gate before each service migration
2026-02-28 14:38:01 +00:00
Viktor Barzin
415d8704d4 [ci skip] add storage reliability design: DB replication + SQLite consolidation 2026-02-28 14:24:42 +00:00
Viktor Barzin
cc7f119578 [ci skip] Reduce node config drift: GPU label, OIDC idempotency, node-exporter, rebuild docs
- Add gpu=true label to Terraform (nvidia null_resource alongside taint)
- Improve API server OIDC config to detect value changes, not just flag presence
- Add policy_hash trigger to audit-policy so rule changes auto-reapply
- Enable prometheus-node-exporter sub-chart, delete unused Ansible playbook
- Document full node rebuild procedure in CLAUDE.md
- Save Talos Linux migration evaluation for future reference
2026-02-22 22:59:38 +00:00
Viktor Barzin
5bc1a47cb8 [ci skip] Add anti-AI scraping implementation plan 2026-02-22 19:41:39 +00:00
Viktor Barzin
4a9fe474c6 [ci skip] Add anti-AI scraping system design doc 2026-02-22 19:37:29 +00:00
Viktor Barzin
116c4d9c30 [ci skip] Remove legacy files and orphaned modules
Delete 20 orphaned module directories and 3 stray files from
modules/kubernetes/ that are no longer referenced by any stack.
Remove 7 root-level legacy files including the empty tfstate,
27MB terraform zip, commented-out main.tf, and migration notes.
Clean up commented-out dockerhub_secret and oauth-proxy references
in blog, travel_blog, and city-guesser stacks. Remove stale
frigate config.yaml entry from .gitignore. Remove ephemeral
docs/plans/ directory.
2026-02-22 15:23:27 +00:00
Viktor Barzin
c1ee757c6b [ci skip] Add Terragrunt migration implementation plan 2026-02-22 00:51:00 +00:00
Viktor Barzin
209355d1af [ci skip] Add Terragrunt migration design document 2026-02-22 00:46:57 +00:00
Viktor Barzin
f41e2ca969 [ci skip] Add OpenClaw cluster health agent implementation plan 2026-02-21 23:48:36 +00:00
Viktor Barzin
51cb045f12 [ci skip] Add OpenClaw cluster management agent design doc 2026-02-21 23:45:30 +00:00
Viktor Barzin
85581923f6 [ci skip] Add multi-user Kubernetes access implementation plan 2026-02-17 20:49:14 +00:00
Viktor Barzin
cf146f5980 [ci skip] Add multi-user Kubernetes access design document 2026-02-17 20:44:23 +00:00
Viktor Barzin
69aae2ec9d [ci skip] Fix code review findings: correct Alertmanager URL, add atomic to Loki, remove dead minio NFS export, update design doc 2026-02-13 23:08:44 +00:00