Commit graph

52 commits

Author SHA1 Message Date
Viktor Barzin
e9046e5a26 traefik+pfsense: real IPv6 client IPs via HAProxy PROXY-v2 bridge
Replace the pfSense socat IPv6 forwarder (which masked every IPv6 client
as 10.0.20.1) with a standalone HAProxy bridge using send-proxy-v2, so
real IPv6 client IPs reach Traefik/CrowdSec. Traefik now trusts PROXY-v2
only from 10.0.20.1 on the web/websecure entrypoints; real IPv4 clients
(ETP=Local, own source IP) are unaffected. Mail-over-IPv6 routed through
the mail NodePorts (send-proxy-v2) too. Bridge is TCP/h2 only (no QUIC
over IPv6). Persistence on pfSense: rc.d/ipv6proxy + ipv6_proxy.sh
(config.xml shellcmd), keeping the nginx-off-[::] patch.

Also fixes stale networking.md: Traefik was still documented on the
shared .200; it moved to dedicated .203/ETP=Local on 2026-05-30.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-30 09:51:23 +00:00
Viktor Barzin
16c9aafafa docs: Traefik dedicated-IP + ETP=Local cutover SUCCEEDED (attempt 2)
Records the successful cutover and the key fix that made it safe: decouple
cloudflared from the LB IP first (point its tunnel ingress at the in-cluster
Traefik Service), so moving Traefik 10.0.20.200 -> 10.0.20.203 no longer
breaks proxied apps or Vault's ingress. Updates infra CLAUDE.md Networking
notes with the new Traefik LB IP / ETP=Local / cloudflared->ClusterIP state.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-30 08:12:57 +00:00
Viktor Barzin
1473a94f29 docs/plans: Traefik dedicated-IP cutover attempt 1 post-mortem (rolled back)
Attempt rolled back to .200 baseline. Root blocker: cloudflared is a
token/dashboard-managed tunnel whose ingress targets the Traefik LB IP
(10.0.20.200), so moving Traefik to .203 took down all proxied apps. Retry
must also repoint the tunnel ingress (Cloudflare API). Also documents the
vault-ingress circular dep, SIGPIPE->stuck PG state-lock gotcha, and the
ETP=Local hairpin caveat.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-30 01:27:29 +00:00
Viktor Barzin
09a0c1fad4 docs/plans: Traefik dedicated IP + ETP=Local migration (design + plan)
Move Traefik off shared MetalLB IP 10.0.20.200 to a dedicated 10.0.20.203
with externalTrafficPolicy=Local, to (1) restore real client IPs for CrowdSec
on the 24 non-proxied apps (currently SNAT'd to a node IP) and (2) enable QUIC.
Forced off the shared IP because MetalLB forbids mixed ETP on a shared IP
(10.0.20.200 also carries the Terraform state DB). In-place cutover selected.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-30 00:27:04 +00:00
Viktor Barzin
2a7124d266 docs(plans): wealth net-worth projections design
Forward 30y net-worth projection on the existing wealth Grafana
dashboard: multi-scenario lines (low/base/high + derived historical
CAGR), pure-SQL over wealth-pg reusing the dashboard's Modified-Dietz
and complete-days patterns, with/without-contributions at base rate,
in a collapsed row that sidesteps Grafana's shared-time-range limit.

[ci skip]

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 22:15:03 +00:00
Viktor Barzin
3526089457 docs: Talos migration design v7 — staged plan after 6 rounds of critique [ci skip]
Handoff artifact for next session. v7 is the converged staged plan
(Stage A hardened Ubuntu → B DR primitives → C 6-week soak →
D-optional Talos). User decision pending: pick v4 (full Talos, 117-178h)
vs v7 (staged, 30-37h to decision point) vs hybrid.

Full context in ~/.claude/plans/distributed-humming-sonnet.md.
2026-05-26 19:45:48 +00:00
0025511b6a docs: Technitium DNS IP — 10.0.20.101 → 10.0.20.201
Stragglers from the same drift as commit b288a59 (monorepo) / the
2026-05-22 viktorbarzin.me apex incident — the `.101` references were
left over from the NodePort exposure era. Technitium's actual MetalLB LB
IP is `.201` (in pool 10.0.20.200-220).

- architecture/vpn.md — Technitium component cell + AdGuard forwarder
  example + nslookup troubleshooting hint
- architecture/networking.md — 502 ingress troubleshooting snippet
- plans/2026-02-22-talos-linux-migration-evaluation.md — nameservers
  example
2026-05-23 08:53:52 +00:00
Viktor Barzin
7f63d35d0a docs/plans: HA control plane — design + plan + deferral
Investigated, designed, and planned the 3-master HA control plane
migration triggered by 2026-05-21's autonomous k8s upgrade cascade.

Locked 14 design decisions across two passes:
- 10 initial decisions (LB strategy, IPs, sizing, etcd, kured gate, etc)
- 4 challenger-pass amendments (cloud-init template bump, rbac stack
  multi-master refactor, HTTPS /readyz health check, expanded blast
  radius to include /home/wizard/code/infra/config root kubeconfig,
  config.tfvars, k8s-portal user kubeconfigs, etcd-backup nodeSelector,
  k8s-version-upgrade chain extension as Phase 7)

Plan covers 11 phases end-to-end including panic-mode rollback.

DEFERRED before execution. PVE host is 98% RAM-committed
(262 GB allocated / 267 GB physical, 1.5 GB swap active); the
planned 3 x 32 GB masters would push allocation to 326 GB and OOM
the host. k8s-master currently uses only 4.6 GB of its 32 GB
allocation (5-6x oversized).

Revisit triggers documented in design doc:
1. Second PVE host added → hardware HA becomes possible.
2. Right-sizing pass OR planning masters at 16 GB each.
3. Cumulative manual upgrade nursing > ~10h.

Standalone candidate worth lifting independently: Phase 1.5's
rbac stack refactor (apiserver-oidc + audit-policy + etcd-tuning
to loop over k8s_master_hosts list) — future-proofs the cluster
without committing to the HA migration.

Refs: code-n0ow (open, deferred via bd note).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 08:32:15 +00:00
Viktor Barzin
9ad52dfd61 openclaw: SSH + tmux task fallback to devvm
Give the OpenClaw pod two new capabilities:

1. Host-tools bundle. New init container `install-host-tools` extracts
   openssh-client + dnsutils + tmux + jq + ripgrep + fd + vault + yq +
   friends into /tools/host-tools/, with the bookworm-slim libs the
   binaries need. PATH + LD_LIBRARY_PATH on the main container point
   ld.so at the bundle. Idempotent via /tools/host-tools/.installed-v1
   marker; smoke test (ldd-based) fails the init at deploy time if any
   binary has unresolved deps. Bundle is ~558 MB on the existing
   /srv/nfs/openclaw/tools NFS.

2. devvm SSH + async task pattern. New init `setup-ssh-config` writes
   id_rsa/config/known_hosts under /home/node/.openclaw/.ssh; main
   container startup symlinks /home/node/.ssh → there. New
   /usr/local/bin/openclaw-task wrapper on devvm manages long-running
   work as tmux sessions on devvm (sessions and logs survive pod
   restarts — they live on devvm, not in the pod). New init container
   `seed-devvm-memory-note` drops a markdown note teaching the pattern;
   main container startup now runs `openclaw memory index --force` so
   the note is searchable on first boot.

Design + verified E2E flow in
docs/plans/2026-05-22-openclaw-devvm-access-design.md. Persistence test
green: spawned a 50s task from pod A, deleted pod A, new pod B saw the
task finish and read its full log.

Pre-existing keel.sh annotation drift on openclaw/{openlobster,
task_webhook} cleaned up in the same apply.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 10:20:00 +00:00
Viktor Barzin
533a89a010 docs: HA control plane design (3 masters)
Captures today's k8s-upgrade-pipeline session findings — root cause
of repeated upgrade failures is the single-master apiserver outage
window cascading into operator crashloops + storm I/O. HA control
plane with 3 masters + apiserver LB removes the cascade entirely.

Tracked in beads code-n0ow. Plan doc to follow.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-21 09:41:20 +00:00
Viktor Barzin
9fd54143c2 docs: design + plan for MySQL 8.4.8 → 8.4.9 upgrade
Captures the wipe+reinit strategy (sidestep the broken DD upgrade
path), the IO config bump (innodb_io_capacity 100→2000), root-cause
analysis with explicit uncertainty, verification gates, and rollback.

Not scheduled yet. Tracked in beads code-963q.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 13:10:00 +00:00
Viktor Barzin
dd1ec4f1be docs/plans: add agent presence implementation plan (2026-05-17)
15-task plan for a shared presence board so Claude Code sessions can
see which shared infra resources are being actively mutated by other
sessions. Resource-scoped claims on the existing Dolt server,
heartbeat-driven TTL, agent-driven via CLAUDE.md rule + Python CLI.
2026-05-17 21:03:17 +00:00
Viktor Barzin
910167105e Phase 0: install Keel + Kyverno auto-update annotation injector
Foundation for opt-out-pure auto-update model per
docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md.

- New stack `stacks/keel/` deploys Keel via Helm (charts.keel.sh, v1.0.6).
  Polls registries hourly per design decision #8. Default schedule
  overridable per-workload via keel.sh/pollSchedule annotation.
- New Kyverno ClusterPolicy `inject-keel-annotations` mutates Deployments,
  StatefulSets, and DaemonSets in namespaces labeled `keel.sh/enrolled=true`
  with keel.sh/policy=force + trigger=poll + pollSchedule=@every 1h.
- Phase 0 enrolls no namespaces. Phase 1 (next session) labels the
  self-hosted set.
- Per-workload opt-out: label `keel.sh/policy: never` (used by rollback
  runbook and chrome-service-style deliberate pins).
- Keel namespace excluded from the mutate — supervisor self-update has
  too-bad a failure mode (decision #11).
- AGENTS.md: KYVERNO_LIFECYCLE_V2 marker convention added for the
  ignore_changes block enrolled workloads need.
- .claude/CLAUDE.md: docker-images rule flagged as transitional.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 12:19:34 +00:00
Viktor Barzin
016584651e docs/plans: 2026-04-20 infra audit design (post-research, post-challenge)
Adds the infra audit plan: 5 parallel research agents (Reliability,
Declarative, Maintenance, Scalability, Security) → 91 raw findings →
2 independent challengers → filtered/corrected/ranked backlog.

Already incorporates the challenger corrections (drops bad metric
pulls, reframes intentional-by-design items). Source for several
follow-ups already shipped this week (kured-prometheus gating, NFS
fsid post-mortem fixes, Authentik outpost postgres-backend).
2026-05-10 17:07:49 +00:00
Viktor Barzin
5d22b449f9 [forgejo] Phase 0 of registry consolidation: prepare Forgejo OCI registry
Stage 1 of moving private images off the registry:2 container at
registry.viktorbarzin.me:5050 (which has hit distribution#3324 corruption
3x in 3 weeks) onto Forgejo's built-in OCI registry. No cutover risk —
pods still pull from the existing registry until Phase 3.

What changes:
* Forgejo deployment: memory 384Mi→1Gi, PVC 5Gi→15Gi (cap 50Gi).
  Explicit FORGEJO__packages__ENABLED + CHUNKED_UPLOAD_PATH (defensive,
  v11 default-on).
* ingress_factory: max_body_size variable was declared but never wired
  in after the nginx→Traefik migration. Now creates a per-ingress
  Buffering middleware when set; default null = no limit (preserves
  existing behavior). Forgejo ingress sets max_body_size=5g to allow
  multi-GB layer pushes.
* Cluster-wide registry-credentials Secret: 4th auths entry for
  forgejo.viktorbarzin.me, populated from Vault secret/viktor/
  forgejo_pull_token (cluster-puller PAT, read:package). Existing
  Kyverno ClusterPolicy syncs cluster-wide — no policy edits.
* Containerd hosts.toml redirect: forgejo.viktorbarzin.me → in-cluster
  Traefik LB 10.0.20.200 (avoids hairpin NAT for in-cluster pulls).
  Cloud-init for new VMs + scripts/setup-forgejo-containerd-mirror.sh
  for existing nodes.
* Forgejo retention CronJob (0 4 * * *): keeps newest 10 versions per
  package + always :latest. First 7 days dry-run (DRY_RUN=true);
  flip the local in cleanup.tf after log review.
* Forgejo integrity probe CronJob (*/15): same algorithm as the
  existing registry-integrity-probe. Existing Prometheus alerts
  (RegistryManifestIntegrityFailure et al) made instance-aware so
  they cover both registries during the bake.
* Docs: design+plan in docs/plans/, setup runbook in docs/runbooks/.

Operational note — the apply order is non-trivial because the new
Vault keys (forgejo_pull_token, forgejo_cleanup_token,
secret/ci/global/forgejo_*) must exist BEFORE terragrunt apply in the
kyverno + monitoring + forgejo stacks. The setup runbook documents
the bootstrap sequence.

Phase 1 (per-project dual-push pipelines) follows in subsequent
commits. Bake clock starts when the last project goes dual-push.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 15:51:34 +00:00
Viktor Barzin
51bf38815c vault: record Phase 3 vault Released-PV cleanup
Deleted the 6 NFS PVs orphaned by the Phase 2 rolling and removed
their /srv/nfs/<dir> subtrees on the PVE host (~1.5 GB; vault-2 audit
log was 1.4 GB on its own). Cluster-wide Released-PV sweep on the
proxmox-lvm/encrypted side stays out of scope.
2026-04-25 23:08:45 +00:00
Viktor Barzin
484b4c7190 vault: complete Phase 2 NFS-hostile migration; remove nfs-proxmox SC
All 3 vault voters now on proxmox-lvm-encrypted (vault-0 16:18, vault-1
+ vault-2 today). The NFS fsync incompatibility identified in the
2026-04-22 raft-leader-deadlock post-mortem is no longer reachable —
raft consensus log + audit log live on LUKS2 block storage with real
fsync semantics.

Cluster-wide consumers of the inline kubernetes_storage_class.nfs_proxmox
dropped to zero after the rolling, so the resource is removed from
infra/stacks/vault/main.tf. Released NFS PVs (6) remain in the cluster
and will be reclaimed in Phase 3 cleanup.

Lesson learned (recorded in plan): pvc-protection finalizer races the
StatefulSet controller — pod recreates on the OLD PVCs unless the
finalizer is patched out before pod delete. Force-finalize technique
applied to vault-1 + vault-2 successfully.

Closes: code-gy7h
2026-04-25 17:10:00 +00:00
Viktor Barzin
288efa89b3 vault: migrate vault-0 storage to proxmox-lvm-encrypted
Phase 2 of the NFS-hostile migration: data + audit storageClass on
the vault helm release switches from nfs-proxmox to
proxmox-lvm-encrypted, then per-pod rolling swap (24h soak between).

vault-0 swap done. vault-1 + vault-2 still on NFS — the rolling part
is what makes this safe (raft quorum maintained by 2 healthy pods
while one is replaced).

Also restores chart-default pod securityContext fields. The previous
`statefulSet.securityContext.pod = {fsGroupChangePolicy = "..."}`
block REPLACED (not merged) the chart's defaults — fsGroup,
runAsGroup, runAsUser, runAsNonRoot were all silently dropped. NFS
exports were permissive enough to mask the missing fsGroup; ext4 LV
volume root is root:root and the vault user (UID 100) couldn't open
vault.db, CrashLoopBackOff. Fix: provide all five fields explicitly,
survives future chart bumps. vault-1 and vault-2 retained their
correct securityContext from when their pod specs were written to
etcd, before the partial customization landed — the bug only surfaces
when a pod is recreated.

Pre-flight raft snapshot saved at /tmp/vault-pre-migration-*.snap
(recovery anchor).

Refs: code-gy7h

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-25 16:19:49 +00:00
Viktor Barzin
43e4f3f68e immich: migrate PostgreSQL off NFS to proxmox-lvm-encrypted
Live PG data moves to a 10Gi LUKS-encrypted RWO PVC. WAL fsync per
commit on NFS contributed to the 2026-04-22 NFS writeback storm
(2h43m recovery, 3 of 4 nodes hard-reset). Backups remain on NFS
(append-only, NFS-tolerant).

The init container that writes postgresql.override.conf is now gated
on PG_VERSION presence — on a fresh PVC the file would otherwise make
initdb refuse the non-empty PGDATA. First boot skips the override and
initdb's cleanly; second boot (after a forced restart) writes the
override so vchord/vectors/pg_prewarm load before the dump restore.
Idempotent on initialised PVCs.

Migration executed: pg_dumpall (1.9GB) → restore on encrypted PVC →
REINDEX clip_index/face_index → 111,843 assets verified, external
HTTP 200, all 10 extensions present (vector minor 0.8.0→0.8.1 only).
LV created on PVE host, picked up by lvm-pvc-snapshot.

See docs/plans/2026-04-25-nfs-hostile-migration-{design,plan}.md.
Phase 2 (Vault Raft) follows under code-gy7h.

Closes: code-ahr7

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-25 15:47:30 +00:00
Viktor Barzin
65b0f30d5e [docs] Update anti-AI and rybbit docs after rewrite-body removal
- Anti-AI: 5-layer → 3 active layers (bot-block, X-Robots-Tag, tarpit)
- Layer 3 (trap links via rewrite-body) removed — Yaegi v3 incompatible
- Rybbit analytics now injected via Cloudflare Worker (HTMLRewriter)
- strip-accept-encoding middleware removed from all references

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 21:43:13 +00:00
Viktor Barzin
1c300a14cf mailserver: overhaul inbound delivery, monitoring, CrowdSec, and migrate to Brevo relay
Inbound:
- Direct MX to mail.viktorbarzin.me (ForwardEmail relay attempted and abandoned)
- Dedicated MetalLB IP 10.0.20.202 with ETP: Local for CrowdSec real-IP detection
- Removed Cloudflare Email Routing (can't store-and-forward)
- Fixed dual SPF violation, hardened to -all
- Added MTA-STS, TLSRPT, imported Rspamd DKIM into Terraform
- Removed dead BIND zones from config.tfvars (199 lines)

Outbound:
- Migrated from Mailgun (100/day) to Brevo (300/day free)
- Added Brevo DKIM CNAMEs and verification TXT

Monitoring:
- Probe frequency: 30m → 20m, alert thresholds adjusted to 60m
- Enabled Dovecot exporter scraping (port 9166)
- Added external SMTP monitor on public IP

Documentation:
- New docs/architecture/mailserver.md with full architecture
- New docs/architecture/mailserver-visual.html visualization
- Updated monitoring.md, CLAUDE.md, historical plan docs
2026-04-12 22:24:38 +01:00
Viktor Barzin
b2cac8cc97 add proxmox-csi cleanup TODO for post-migration tasks [ci skip] 2026-04-03 20:02:14 +03:00
Viktor Barzin
d49acebd8e migrate ebooks-calibre to proxmox-lvm, update storage docs [ci skip]
- Migrate ebooks-calibre-config-iscsi (2Gi, 2380 files) to proxmox-lvm
- Update docs/architecture/storage.md: document Proxmox CSI as primary
  block storage, mark democratic-csi iSCSI as deprecated
- Add full migration plan to docs/plans/
2026-04-03 19:45:34 +03:00
Viktor Barzin
6f8b48a73c [ci skip] k8s portal: fix setup script + add onboarding hub (5 new pages)
Bug fixes:
- CA cert now populated in ConfigMap (was empty → TLS failures)
- Remove useless heredoc quote escaping in setup script
- Fix homepage: VPN callout, correct verification command (get namespaces)
- Fix false-positive sensitive=true on ingress_path, tls_secret_name,
  truenas_host, ollama_host, client_certificate_secret_name

New pages (direct Svelte, no mdsvex dependency):
- /onboarding: step-by-step guide (VPN, kubectl, git, first PR)
- /architecture: cluster topology, storage, networking, tiers
- /services: catalog of 70+ services with URLs
- /contributing: PR workflow, what you can/can't change, NEVER list
- /troubleshooting: common issues and fixes

Navigation bar added to layout. All pages use consistent docs styling.

Requires Docker image rebuild: cd stacks/platform/modules/k8s-portal/files
&& docker build -t viktorbarzin/k8s-portal:latest . && docker push
2026-03-07 15:06:26 +00:00
Viktor Barzin
91d11e5cda [ci skip] add SOPS multi-user secrets migration design (v3, reviewed 3x)
Replaces git-crypt all-or-nothing encryption with SOPS per-value encryption.
Operators push PRs → Viktor reviews → CI applies. No encryption keys needed
for operators. 7-phase migration plan, reviewed by 2 agents across 3 iterations
(0 remaining CRITICALs).
2026-03-07 13:55:05 +00:00
Viktor Barzin
197cef7f3f [ci skip] add auto-generated tiers.tf, planning docs, and helm chart cache
- tiers.tf: Terragrunt-generated tier locals for all standalone stacks
- .planning/: resource audit research and plans
- docs/plans/: cluster hardening design doc
- redis-25.3.2.tgz: Bitnami Redis Helm chart cache
2026-03-06 23:55:57 +00:00
Viktor Barzin
db7ea58d5c [ci skip] add security observability layer design document
Tetragon-centric approach: eBPF runtime security, pfSense syslog
collection, CoreDNS query logging, Calico NetworkPolicies,
on-demand mitmproxy, unified Grafana security dashboard.
~625MB steady-state, <5GB budget.
2026-03-02 21:13:01 +00:00
Viktor Barzin
910ea5d923 [ci skip] add NFS CSI migration design doc and implementation plan 2026-03-01 23:30:27 +00:00
Viktor Barzin
e50cfa1d19 [ci skip] add Traefik resilience hardening implementation plan 2026-03-01 13:53:50 +00:00
Viktor Barzin
454a48c6ac [ci skip] add Traefik resilience hardening design doc 2026-03-01 13:50:00 +00:00
Viktor Barzin
a1ba218cd2 [ci skip] Phase 1: PostgreSQL migrated to CNPG on local disk
Major milestone - shared PostgreSQL moved from NFS to CloudNativePG:
- CNPG cluster (pg-cluster) running in dbaas namespace on local-path storage
- PostGIS image (ghcr.io/cloudnative-pg/postgis:16) for dawarich compatibility
- All 20 databases and 19 roles restored from pg_dumpall backup
- postgresql.dbaas Service patched to point at CNPG primary
- Old PG deployment scaled to 0 (NFS data intact for rollback)
- All 12+ dependent services verified running:
  authentik, n8n, dawarich, tandoor, linkwarden, netbox, woodpecker,
  rybbit, affine, health, resume, trading-bot, atuin
- Authentik PgBouncer working through the switched endpoint

TODO: codify CNPG cluster in Terraform, add 2nd replica, update backup CronJob
2026-02-28 19:08:06 +00:00
Viktor Barzin
052662540b [ci skip] add network visualization implementation plan 2026-02-28 18:19:36 +00:00
Viktor Barzin
887075189a [ci skip] add network traffic visualization design doc 2026-02-28 18:14:42 +00:00
Viktor Barzin
4651b67479 [ci skip] update CI caching plan: add Terraform provisioning for private registry 2026-02-28 17:51:55 +00:00
Viktor Barzin
2adfa86401 [ci skip] add CI build caching implementation plan 2026-02-28 17:46:44 +00:00
Viktor Barzin
5ef03cc0e0 [ci skip] add CI build caching design doc 2026-02-28 17:43:42 +00:00
Viktor Barzin
14b1c43713 [ci skip] expand k8s worker nodes to 256G, update inventory and extend script
- k8s-node2: 128G → 256G (160GB free)
- k8s-node3: 128G → 256G (135GB free)
- k8s-node4: 128G → 256G (127GB free)
- k8s-node1: already 256G (51GB free)
- extend_vm_storage.sh: increase drain timeout to 300s, add --force flag
- Remove Vaultwarden from SQLite migration plan (too risky)
2026-02-28 16:00:16 +00:00
Viktor Barzin
517acd95af [ci skip] revise storage reliability design based on research agent findings
Key changes from v1:
- Drop 3-instance replication → 2-instance CNPG, single Redis/MySQL
- Remove Headscale from PG migration (project discourages it)
- Remove MeshCentral from PG migration (NeDB, not SQLite)
- Replace Redis Sentinel with single redis:7 on local disk (modules unused)
- Add RAM overcommit warning and mitigation
- Add explicit single-host limitation acknowledgment
- Add per-component rollback plans
- Fix backup strategy (CNPG can't archive WAL to NFS natively)
- Reorder migration: low-risk services first, authentik last
- Add research gate before each service migration
2026-02-28 14:38:01 +00:00
Viktor Barzin
415d8704d4 [ci skip] add storage reliability design: DB replication + SQLite consolidation 2026-02-28 14:24:42 +00:00
Viktor Barzin
cc7f119578 [ci skip] Reduce node config drift: GPU label, OIDC idempotency, node-exporter, rebuild docs
- Add gpu=true label to Terraform (nvidia null_resource alongside taint)
- Improve API server OIDC config to detect value changes, not just flag presence
- Add policy_hash trigger to audit-policy so rule changes auto-reapply
- Enable prometheus-node-exporter sub-chart, delete unused Ansible playbook
- Document full node rebuild procedure in CLAUDE.md
- Save Talos Linux migration evaluation for future reference
2026-02-22 22:59:38 +00:00
Viktor Barzin
5bc1a47cb8 [ci skip] Add anti-AI scraping implementation plan 2026-02-22 19:41:39 +00:00
Viktor Barzin
4a9fe474c6 [ci skip] Add anti-AI scraping system design doc 2026-02-22 19:37:29 +00:00
Viktor Barzin
116c4d9c30 [ci skip] Remove legacy files and orphaned modules
Delete 20 orphaned module directories and 3 stray files from
modules/kubernetes/ that are no longer referenced by any stack.
Remove 7 root-level legacy files including the empty tfstate,
27MB terraform zip, commented-out main.tf, and migration notes.
Clean up commented-out dockerhub_secret and oauth-proxy references
in blog, travel_blog, and city-guesser stacks. Remove stale
frigate config.yaml entry from .gitignore. Remove ephemeral
docs/plans/ directory.
2026-02-22 15:23:27 +00:00
Viktor Barzin
c1ee757c6b [ci skip] Add Terragrunt migration implementation plan 2026-02-22 00:51:00 +00:00
Viktor Barzin
209355d1af [ci skip] Add Terragrunt migration design document 2026-02-22 00:46:57 +00:00
Viktor Barzin
f41e2ca969 [ci skip] Add OpenClaw cluster health agent implementation plan 2026-02-21 23:48:36 +00:00
Viktor Barzin
51cb045f12 [ci skip] Add OpenClaw cluster management agent design doc 2026-02-21 23:45:30 +00:00
Viktor Barzin
85581923f6 [ci skip] Add multi-user Kubernetes access implementation plan 2026-02-17 20:49:14 +00:00
Viktor Barzin
cf146f5980 [ci skip] Add multi-user Kubernetes access design document 2026-02-17 20:44:23 +00:00
Viktor Barzin
69aae2ec9d [ci skip] Fix code review findings: correct Alertmanager URL, add atomic to Loki, remove dead minio NFS export, update design doc 2026-02-13 23:08:44 +00:00