infra

Author	SHA1	Message	Date
Viktor Barzin	efd28ccce5	anubis: fix 500 on multi-replica + roll out to 6 more public sites Browser visits to viktorbarzin.me started returning HTTP 500 with `store: key not found: "challenge:..."` in pod logs. Root cause: each Anubis pod stores in-flight challenges in process memory; with 2 replicas behind a ClusterIP, the PoW-solved request can be routed to a different pod than the one that issued the challenge. Anubis upstream documents the same caveat ("when running multiple instances on the same base domain, the key must be the same across all instances" — true for the ed25519 signing key, but the challenge store is still pod-local without a shared backend). Drop module default replicas: 2 → 1. Worst-case: ~1s cold-start on pod restart. Real fix (Redis-backed challenge store) noted as a follow-up in CLAUDE.md. Roll Anubis out to: f1-stream, cyberchef (cc), jsoncrack (json), privatebin (pb), homepage (home), real-estate-crawler (wrongmove UI only — `/api` ingress stays direct via path-based ingress carve- out so XHRs from the SPA bypass the challenge). End-state: 9 public hosts now Anubis-fronted (blog, www, kms, travel, f1, cc, json, pb, home, wrongmove). All return the challenge HTML to bare curl/browser; verified-IP search engines and /robots.txt + /.well-known still skip via the strict-policy allowlist.	2026-05-10 00:50:30 +00:00
Viktor Barzin	f48da84770	anubis: per-site PoW reverse proxy on blog + kms + travel-blog Adds modules/kubernetes/anubis_instance/ — a per-site reverse proxy instance pinned to ghcr.io/techarohq/anubis:v1.25.0. Each instance issues a 30-day JWT cookie scoped to viktorbarzin.me after a tiny proof-of-work (difficulty 2 ≈ 250 ms desktop / 700 ms mobile). The shared ed25519 signing key (Vault: secret/viktor → anubis_ed25519_key) makes a single solve good across every Anubis-fronted subdomain. Wired into blog (viktorbarzin.me + www), kms.viktorbarzin.me, and travel.viktorbarzin.me — each with anti_ai_scraping=false on the ingress so the redundant ai-bot-block forwardAuth is dropped from the chain. Skipped forgejo (Git/API clients can't solve PoW) and resume (replicas=0). Also tightens bot-block-proxy nginx timeouts (3s/5s → 100ms/200ms) so any ingress still using the ai-bot-block forwardAuth pays at most ~150 ms when poison-fountain is scaled down, instead of 3 s. End-to-end TTFB on viktorbarzin.me dropped from ~3.2 s to ~150-200 ms. Docs: .claude/reference/patterns.md "Anti-AI Scraping" updated to 4 layers; .claude/CLAUDE.md adds the Anubis usage paragraph and Forgejo/API caveat.	2026-05-10 00:06:21 +00:00
Viktor Barzin	d62a9dcda1	docs: PVC templates need lifecycle.ignore_changes for autoresizer The canonical proxmox-lvm and proxmox-lvm-encrypted PVC templates were missing `lifecycle { ignore_changes = [spec[0].resources[0].requests] }`. Without it, every PVC created from these templates becomes a drift bomb the moment pvc-autoresizer expands it: the next `tg apply` on that stack will try to shrink the PVC back to the TF-declared size, K8s rejects the shrink, and apply fails. This was latent because pvc-autoresizer was silently broken cluster-wide (commit `9d5da4d8` fixed it by allow-listing kubelet_volume_stats_available_bytes in Prometheus). Now that the autoresizer actually works, every existing proxmox-lvm/encrypted PVC without ignore_changes is at risk. Sweep needed (separate task): grep for kubernetes_persistent_volume_claim across stacks/ and add ignore_changes to any with resize.topolvm.io annotations.	2026-05-09 12:02:18 +00:00
Viktor Barzin	3148d15d5a	[forgejo] Phases 3+4+5: cutover, decommission, docs sweep End of forgejo-registry-consolidation. After Phase 0/1 already landed (Forgejo ready, dual-push CI, integrity probe, retention CronJob, images migrated via forgejo-migrate-orphan-images.sh), this commit flips everything off registry.viktorbarzin.me onto Forgejo and removes the legacy infrastructure. Phase 3 — image= flips: * infra/stacks/{payslip-ingest,job-hunter,claude-agent-service, fire-planner,freedify/factory,chrome-service,beads-server}/main.tf — image= now points to forgejo.viktorbarzin.me/viktor/<name>. * infra/stacks/claude-memory/main.tf — also moved off DockerHub (viktorbarzin/claude-memory-mcp:17 → forgejo.viktorbarzin.me/viktor/...). * infra/.woodpecker/{default,drift-detection}.yml — infra-ci pulled from Forgejo. build-ci-image.yml dual-pushes still until next build cycle confirms Forgejo as canonical. * /home/wizard/code/CLAUDE.md — claude-memory-mcp install URL updated. Phase 4 — decommission registry-private: * registry-credentials Secret: dropped registry.viktorbarzin.me / registry.viktorbarzin.me:5050 / 10.0.20.10:5050 auths entries. Forgejo entry is the only one left. * infra/stacks/infra/main.tf cloud-init: dropped containerd hosts.toml entries for registry.viktorbarzin.me + 10.0.20.10:5050. (Existing nodes already had the file removed manually by `setup-forgejo-containerd-mirror.sh` rollout — the cloud-init template only fires on new VM provision.) * infra/modules/docker-registry/docker-compose.yml: registry-private service block removed; nginx 5050 port mapping dropped. Pull- through caches for upstream registries (5000/5010/5020/5030/5040) stay on the VM permanently. * infra/modules/docker-registry/nginx_registry.conf: upstream `private` block + port 5050 server block removed. * infra/stacks/monitoring/modules/monitoring/main.tf: registry_ integrity_probe + registry_probe_credentials resources stripped. forgejo_integrity_probe is the only manifest probe now. Phase 5 — final docs sweep: * infra/docs/runbooks/registry-vm.md — VM scope reduced to pull- through caches; forgejo-registry-breakglass.md cross-ref added. * infra/docs/architecture/ci-cd.md — registry component table + diagram now reflect Forgejo. Pre-migration root-cause sentence preserved as historical context with a pointer to the design doc. * infra/docs/architecture/monitoring.md — Registry Integrity Probe row updated to point at the Forgejo probe. * infra/.claude/CLAUDE.md — Private registry section rewritten end- to-end (auth, retention, integrity, where the bake came from). * prometheus_chart_values.tpl — RegistryManifestIntegrityFailure alert annotation simplified now that only one registry is in scope. Operational follow-up (cannot be done from a TF apply): 1. ssh root@10.0.20.10 — edit /opt/registry/docker-compose.yml to match the new template AND `docker compose up -d --remove-orphans` to actually stop the registry-private container. Memory id=1078 confirms cloud-init won't redeploy on TF apply alone. 2. After 1 week of no incidents, `rm -rf /opt/registry/data/private/` on the VM (~2.6GB freed). 3. Open the dual-push step in build-ci-image.yml and drop registry.viktorbarzin.me:5050 from the `repo:` list — at that point the post-push integrity check at line 33-107 also needs to be repointed at Forgejo or removed (the per-build verify is redundant with the every-15min Forgejo probe). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-07 18:30:02 +00:00
Viktor Barzin	cd96fb64a8	phpipam-pfsense-import: every 5min → hourly Reduces 5-min disk-write spikes on PVE sdc. The cronjob was the heaviest single contributor in our hourly fan-out investigation (11.2 MB/s burst when it fired). Kea DDNS still handles real-time DNS auto-registration; phpIPAM inventory just lags by up to 1h, which we don't need fresher. Docs (dns.md, networking.md, .claude/CLAUDE.md) updated to match.	2026-04-26 22:48:43 +00:00
Viktor Barzin	7e34b67f24	[docs] Architecture docs: registry integrity probe, pin, new CI pipelines Bring the architecture set in line with what's actually deployed after today's registry reliability work (commits `7cb44d72` → `42961a5f`): - docs/architecture/ci-cd.md: expand Infra Pipelines table with build-ci-image (+ verify-integrity step), registry-config-sync, pve-nfs-exports-sync, postmortem-todos, drift-detection, issue-automation, provision-user. Note registry:2.8.3 pin + integrity probe in the image-registry flow section. - docs/architecture/monitoring.md: add Registry Integrity Probe to components table; add 3-alert section (Manifest Integrity Failure / Probe Stale / Catalog Inaccessible). - .claude/CLAUDE.md: one-line on the pin, auto-sync pipeline, and the revision-link-not-blob rule so the next agent knows the right check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:51:26 +00:00
Viktor Barzin	5a0b24f54e	[docs] TrueNAS decommission cleanup — remove references from active docs TrueNAS VM 9000 was operationally decommissioned 2026-04-13; NFS has been served by Proxmox host (192.168.1.127) since. This commit scrubs remaining references from active docs. VM 9000 itself remains on PVE in stopped state pending user decision on deletion. In-session cleanup already landed: reverse-proxy ingress + Cloudflare record removed; Technitium DNS records deleted; Vault truenas_{api_key,ssh_private_key} purged; homepage_credentials.reverse_proxy.truenas_token removed; truenas_homepage_token variable + module deleted; Loki + Dashy cleaned; config.tfvars deprecated DNS lines removed; historical-name comment added to the nfs-truenas StorageClass (48 bound PVs, immutable name — kept). Historical records (docs/plans/, docs/post-mortems/, .planning/) intentionally untouched — they describe state at a point in time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 16:55:43 +00:00
Viktor Barzin	43fe11fffc	[mailserver] Phase 6 — decommission MetalLB LB path [ci skip] ## Context (bd code-yiu) With Phase 4+5 proven (external mail flows through pfSense HAProxy + PROXY v2 to the alt PROXY-speaking container listeners), the MetalLB LoadBalancer Service + `10.0.20.202` external IP + ETP:Local policy are obsolete. Phase 6 decommissions them and documents the steady-state architecture. ## This change ### Terraform (stacks/mailserver/modules/mailserver/main.tf) - `kubernetes_service.mailserver` downgraded: `LoadBalancer` → `ClusterIP`. - Removed `metallb.io/loadBalancerIPs = "10.0.20.202"` annotation. - Removed `external_traffic_policy = "Local"` (irrelevant for ClusterIP). - Port set unchanged — the Service still exposes 25/465/587/993 for intra-cluster clients (Roundcube pod, `email-roundtrip-monitor` CronJob) that hit the stock PROXY-free container listeners. - Inline comment documents the downgrade rationale + companion `mailserver-proxy` NodePort Service that now carries external traffic. ### pfSense (ops, not in git) - `mailserver` host alias (pointing at `10.0.20.202`) deleted. No NAT rule references it post-Phase-4; keeping it would be misleading dead metadata. Reversible via WebUI + `php /tmp/delete-mailserver-alias.php` companion script (ad-hoc, not checked in — alias is just a Firewall → Aliases → Hosts entry). ### Uptime Kuma (ops) - Monitors `282` and `283` (PORT checks) retargeted from `10.0.20.202` → `10.0.20.1`. Renamed to `Mailserver HAProxy SMTP (pfSense :25)` / `... IMAPS (pfSense :993)` to reflect their new purpose (HAProxy layer liveness). History retained (edit, not delete-recreate). ### Docs - `docs/runbooks/mailserver-pfsense-haproxy.md` — fully rewritten "Current state" section; now reflects steady-state architecture with two-path diagram (external via HAProxy / intra-cluster via ClusterIP). Phase history table marks Phase 6 ✅. Rollback section updated (no one-liner post-Phase-6; need Service-type re-upgrade + alias re-add). - `docs/architecture/mailserver.md` — Overview, Mermaid diagram, Inbound flow, CrowdSec section, Uptime Kuma monitors list, Decisions section (dedicated MetalLB IP → "Client-IP Preservation via HAProxy + PROXY v2"), Troubleshooting all updated. - `.claude/CLAUDE.md` — mailserver monitoring + architecture paragraph updated with new external path description; references the new runbook. ## What is NOT in this change - Removal of `10.0.20.202` from `cloudflare_proxied_names` or any reserved-IP tracking — wasn't there to begin with. The `metallb-system default` IPAddressPool (10.0.20.200-220) shows 2 of 19 available after this, confirming `.202` went back to the pool. - Phase 4 NAT-flip rollback scripts — kept on-disk, still valid if someone re-introduces the MetalLB LB (see runbook "Rollback"). ## Test Plan ### Automated (verified pre-commit 2026-04-19) ``` # Service is ClusterIP with no EXTERNAL-IP $ kubectl get svc -n mailserver mailserver mailserver ClusterIP 10.103.108.217 <none> 25/TCP,465/TCP,587/TCP,993/TCP # 10.0.20.202 no longer answers ARP (ping from pfSense) $ ssh admin@10.0.20.1 'ping -c 2 -t 2 10.0.20.202' 2 packets transmitted, 0 packets received, 100.0% packet loss # MetalLB pool released the IP $ kubectl get ipaddresspool default -n metallb-system \ -o jsonpath='{.status.assignedIPv4} of {.status.availableIPv4}' 2 of 19 available # E2E probe — external Brevo → WAN:25 → pfSense HAProxy → pod — STILL SUCCEEDS $ kubectl create job --from=cronjob/email-roundtrip-monitor probe-phase6 -n mailserver ... Round-trip SUCCESS in 20.3s ... $ kubectl delete job probe-phase6 -n mailserver # pfSense mailserver alias removed $ ssh admin@10.0.20.1 'php -r "..." \| grep mailserver' (no output) ``` ### Manual Verification 1. Visit `https://uptime.viktorbarzin.me` — monitors 282/283 green on new hostname `10.0.20.1`. 2. Roundcube login works (`https://mail.viktorbarzin.me/`). 3. Send test email to `smoke-test@viktorbarzin.me` from Gmail — observe `postfix/smtpd-proxy25/postscreen: CONNECT from [<Gmail-IP>]` in mailserver logs within ~10s. 4. CrowdSec should still see real client IPs in postfix/dovecot parsers (verify with `cscli alerts list` on next auth-fail event). ## Phase history (bd code-yiu) \| Phase \| Status \| Description \| \|---\|---\|---\| \| 1a \| ✅ ``ef75c02f`` \| k8s alt :2525 listener + NodePort Service \| \| 2 \| ✅ 2026-04-19 \| pfSense HAProxy pkg installed \| \| 3 \| ✅ ``ba697b02`` \| HAProxy config persisted in pfSense XML \| \| 4+5 \| ✅ ``9806d515`` \| 4-port alt listeners + HAProxy frontends + NAT flip \| \| 6 \| ✅ this commit \| MetalLB LB retired; 10.0.20.202 released; docs updated \| Closes: code-yiu	2026-04-19 12:36:11 +00:00
Viktor Barzin	8a99be1194	[infra] Document HCL import {} block convention [ci skip] ## Context Wave 8 of the state-drift consolidation plan — adopt the HCL `import {}` block pattern (Terraform 1.5+) as the canonical way to bring live cluster / Vault / Cloudflare resources under TF management. Historically the repo has used `terraform import` on the CLI for adoptions. That path has three real problems: 1. Not reviewable — it's an out-of-band state mutation that leaves no trace in git beyond the subsequent `resource {}` block. A reviewer sees only the new resource, not the adoption intent. 2. Not plan-safe — if the resource address or ID is wrong, the CLI path commits the mistake to state before anyone can catch it. 3. Not idempotent — a failed apply mid-import leaves state in a confusing half-adopted shape. `import {}` blocks fix all three: the adoption intent is in the PR diff, `scripts/tg plan` shows the import as its own plan line (mistyped IDs fail before apply), and re-applying after a partial failure just retries the import step. Canonicalizing the pattern before Wave 5 (Calico + kured adoption) lands so the reviewer of those imports has the rule in front of them. ## This change - `AGENTS.md`: new "Adopting Existing Resources — Use `import {}` Blocks, Not the CLI" section sitting right after Execution. Includes the canonical 5-step workflow (write resource → add import stanza → plan to zero → apply → drop stanza), the reasoning, and a per-provider ID format table (helm_release, kubernetes_manifest, kubernetes_<kind>_v1, authentik_provider_proxy, cloudflare_record). - `.claude/CLAUDE.md`: one-line cross-reference at the end of the Terraform State two-tier section pointing back to AGENTS.md. Keeps CLAUDE.md's quick-reference density intact while making sure the rule is reachable from the Claude-instructions path. ## What is NOT in this change - Any actual imports — this is a pure docs landing. Wave 5 will demonstrate the pattern on kured + Calico. - Replacing the handful of existing `terraform import`-style adoptions in the repo history — `import {}` blocks are delete-after-apply, so retro-documenting them is not useful. Closes: code-[wave8-task] Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 21:10:05 +00:00
Viktor Barzin	c9d221d578	[infra] Establish KYVERNO_LIFECYCLE_V1 drift-suppression convention [ci skip] ## Context Phase 1 of the state-drift consolidation audit (plan Wave 3) identified that the entire repo leans on a repeated `lifecycle { ignore_changes = [...dns_config] }` snippet to suppress Kyverno's admission-webhook dns_config mutation (the ndots=2 override that prevents NxDomain search-domain flooding). 27 occurrences across 19 stacks. Without this suppression, every pod-owning resource shows perpetual TF plan drift. The original plan proposed a shared `modules/kubernetes/kyverno_lifecycle/` module emitting the ignore-paths list as an output that stacks would consume in their `ignore_changes` blocks. That approach is architecturally impossible: Terraform's `ignore_changes` meta-argument accepts only static attribute paths — it rejects module outputs, locals, variables, and any expression (the HCL spec evaluates `lifecycle` before the regular expression graph). So a DRY module cannot exist. The canonical pattern IS the repeated snippet. What the snippet was missing was a discoverability tag so that (a) new resources can be validated for compliance, (b) the existing 27 sites can be grep'd in a single command, and (c) future maintainers understand the convention rather than each reinventing it. ## This change - Introduces `# KYVERNO_LIFECYCLE_V1` as the canonical marker comment. Attached inline on every `spec[0].template[0].spec[0].dns_config` line (or `spec[0].job_template[0].spec[0]...` for CronJobs) across all 27 existing suppression sites. - Documents the convention with rationale and copy-paste snippets in `AGENTS.md` → new "Kyverno Drift Suppression" section. - Expands the existing `.claude/CLAUDE.md` Kyverno ndots note to reference the marker and explain why the module approach is blocked. - Updates `_template/main.tf.example` so every new stack starts compliant. ## What is NOT in this change - The `kubernetes_manifest` Kyverno annotation drift (beads `code-seq`) — that is Phase B with a sibling `# KYVERNO_MANIFEST_V1` marker. - Behavioral changes — every `ignore_changes` list is byte-identical save for the inline comment. - The fallback module the original plan anticipated — skipped because Terraform rejects expressions in `ignore_changes`. - `terraform fmt` cleanup on adjacent unrelated blocks in three files (claude-agent-service, freedify/factory, hermes-agent). Reverted to keep this commit scoped to the convention rollout. ## Before / after Before (cannot distinguish accidental-forgotten from intentional-convention): ```hcl lifecycle { ignore_changes = [spec[0].template[0].spec[0].dns_config] } ``` After (greppable, self-documenting, discoverable by tooling): ```hcl lifecycle { ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1 } ``` ## Test Plan ### Automated ``` $ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='.tf' --include='.tf.example' \ \| awk -F: '{s+=$2} END {print s}' 27 $ git diff --stat \| grep -E '\.(tf\|tf\.example\|md)$' \| wc -l 21 # All code-file diffs are 1 insertion + 1 deletion per marker site, # except beads-server (3), ebooks (4), immich (3), uptime-kuma (2). $ git diff --stat stacks/ \| tail -1 20 files changed, 45 insertions(+), 28 deletions(-) ``` ### Manual Verification No apply required — HCL comments only. Zero effect on any stack's plan output. Future audits: `rg 'KYVERNO_LIFECYCLE_V1' stacks/ \| wc -l` must grow as new pod-owning resources are added. ## Reproduce locally 1. `cd infra && git pull` 2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/` → expect 27 hits in 19 files 3. Grep any new `kubernetes_deployment` for the marker; absence = missing suppression. Closes: code-28m Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 14:15:51 +00:00
Viktor Barzin	d3bdf87676	[docs] Clarify external-monitor auto-annotation in CLAUDE.md ## Context During a false-alarm investigation of terminal.viktorbarzin.me, an Explore agent misdiagnosed "no monitoring" by checking cloudflare_proxied_names in config.tfvars (a legacy fallback list) instead of the ingress_factory auto-annotation. Both [External] monitors for terminal/terminal-ro exist and are active — the original agent just looked in the wrong place. ## This change Expands the Monitoring & Alerting bullet to spell out the mechanism: ingress_factory auto-adds uptime.viktorbarzin.me/external-monitor=true when dns_type != "none", and cloudflare_proxied_names is a legacy fallback for the 17 hostnames not yet migrated. Future agents debugging "is this monitored?" questions should not check cloudflare_proxied_names. ## What is NOT in this change No Terraform, no K8s, no service config. Docs only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 11:45:56 +00:00
Viktor Barzin	f538115c43	[dbaas] Migrate MySQL from InnoDB Cluster to standalone StatefulSet ## Context Disk write analysis showed MySQL InnoDB Cluster writing ~95 GB/day for only ~35 MB of actual data due to Group Replication overhead (binlog, relay log, GR apply log). The operator enforces GR even with serverInstances=1. Bitnami Helm charts were deprecated by Broadcom in Aug 2025 — no free container images available. Using official mysql:8.4 image instead. ## This change: - Replace helm_release.mysql_cluster service selector with raw kubernetes_stateful_set_v1 using official mysql:8.4 image - ConfigMap mysql-standalone-cnf: skip-log-bin, innodb_flush_log_at_trx_commit=2, innodb_doublewrite=ON (re-enabled for standalone safety) - Service selector switched to standalone pod labels - Technitium: disable SQLite query logging (18 GB/day write amplification), keep PostgreSQL-only logging (90-day retention) - Grafana datasource and dashboards migrated from MySQL to PostgreSQL - Dashboard SQL queries fixed for PG integer division (::float cast) - Updated CLAUDE.md service-specific notes ## What is NOT in this change: - InnoDB Cluster + operator removal (Phase 4, 7+ days from now) - Stale Vault role cleanup (Phase 4) - Old PVC deletion (Phase 4) Expected write reduction: ~113 GB/day (MySQL 95 + Technitium 18) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 19:01:06 +00:00
Viktor Barzin	b1d152be1f	[infra] Auto-create Cloudflare DNS records from ingress_factory ## Context Deploying new services required manually adding hostnames to cloudflare_proxied_names/cloudflare_non_proxied_names in config.tfvars — a separate file from the service stack. This was frequently forgotten, leaving services unreachable externally. ## This change: - Add `dns_type` parameter to `ingress_factory` and `reverse_proxy/factory` modules. Setting `dns_type = "proxied"` or `"non-proxied"` auto-creates the Cloudflare DNS record (CNAME to tunnel or A/AAAA to public IP). - Simplify cloudflared tunnel from 100 per-hostname rules to wildcard `*.viktorbarzin.me → Traefik`. Traefik still handles host-based routing. - Add global Cloudflare provider via terragrunt.hcl (separate cloudflare_provider.tf with Vault-sourced API key). - Migrate 118 hostnames from centralized config.tfvars to per-service dns_type. 17 hostnames remain centrally managed (Helm ingresses, special cases). - Update docs, AGENTS.md, CLAUDE.md, dns.md runbook. ``` BEFORE AFTER config.tfvars (manual list) stacks/<svc>/main.tf \| module "ingress" { v dns_type = "proxied" stacks/cloudflared/ } for_each = list \| cloudflare_record auto-creates tunnel per-hostname cloudflare_record + annotation ``` ## What is NOT in this change: - Uptime Kuma monitor migration (still reads from config.tfvars) - 17 remaining centrally-managed hostnames (Helm, special cases) - Removal of allow_overwrite (keep until migration confirmed stable) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 13:45:04 +00:00
Viktor Barzin	dcc96f465e	docs(storage): add encrypted LVM documentation Update storage docs to reflect the 2026-04-15 migration of all sensitive services to proxmox-lvm-encrypted. Add encrypted PVC template, LUKS2 flow documentation, updated architecture diagram, and storage class decision rules. Files updated: - .claude/CLAUDE.md: storage decision table, encrypted PVC template - docs/architecture/storage.md: encrypted flow, components, diagram, Vault paths - AGENTS.md: storage section with encrypted SC as default for sensitive data Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 21:00:37 +00:00
Viktor Barzin	d31bbc9a18	docs: update monitoring and backup docs for external monitors and per-db backups - CLAUDE.md: document external monitoring (ExternalAccessDivergence alert, external-monitor-sync CronJob) and per-database backup/restore paths - backup-dr.md: add per-db backup CronJobs to inventory table and daily timeline, update restore runbook references - monitoring.md: add External Monitor Sync component and external monitoring architecture section [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 06:37:07 +00:00
Viktor Barzin	4498f61402	fix(post-mortem): add /etc/exports to git, NFS health check in daily-backup, document CSI requirements [PM-2026-04-14] - scripts/pve-nfs-exports: git-managed copy of PVE host /etc/exports with detailed comments explaining fsid=0 danger and NFSv3 disable rationale. Deploy: scp scripts/pve-nfs-exports root@192.168.1.127:/etc/exports && ssh root@192.168.1.127 exportfs -ra - scripts/daily-backup.sh: add check_nfs_exports() that runs before backup starts. Detects: missing /etc/exports, dangerous fsid=0 on /srv/nfs, nfs-server not running, no active exports. Warns but doesn't abort (block-storage PVC backups can still run). - .claude/CLAUDE.md: document NFS CSI mount option requirements — nfsvers=4 mandatory, fsid=0 forbidden, /etc/exports is git-managed, critical services must use proxmox-lvm-encrypted. Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>	2026-04-14 18:08:24 +00:00
Viktor Barzin	1ef40daeec	docs: update for MySQL 3→1, CrowdSec/Technitium PG migration, PG tuning, NFS async, node OS tuning [ci skip]	2026-04-13 23:05:46 +01:00
Viktor Barzin	82f674a0b4	rename weekly-backup → daily-backup across scripts, timers, services, and docs [ci skip] Reflects the schedule change from weekly to daily. All references updated: - scripts/weekly-backup.{sh,timer,service} → daily-backup.* - Pushgateway job name: weekly-backup → daily-backup - Prometheus metric names: weekly_backup_* → daily_backup_* - All docs, runbooks, AGENTS.md, CLAUDE.md, proxmox-inventory - offsite-sync dependency: After=daily-backup.service Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 18:37:04 +00:00
Viktor Barzin	b45cee5c4a	docs: update backup architecture for inotify change tracking + consolidated Synology layout [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 18:16:36 +00:00
Viktor Barzin	1c300a14cf	mailserver: overhaul inbound delivery, monitoring, CrowdSec, and migrate to Brevo relay Inbound: - Direct MX to mail.viktorbarzin.me (ForwardEmail relay attempted and abandoned) - Dedicated MetalLB IP 10.0.20.202 with ETP: Local for CrowdSec real-IP detection - Removed Cloudflare Email Routing (can't store-and-forward) - Fixed dual SPF violation, hardened to -all - Added MTA-STS, TLSRPT, imported Rspamd DKIM into Terraform - Removed dead BIND zones from config.tfvars (199 lines) Outbound: - Migrated from Mailgun (100/day) to Brevo (300/day free) - Added Brevo DKIM CNAMEs and verification TXT Monitoring: - Probe frequency: 30m → 20m, alert thresholds adjusted to 60m - Enabled Dovecot exporter scraping (port 9166) - Added external SMTP monitor on public IP Documentation: - New docs/architecture/mailserver.md with full architecture - New docs/architecture/mailserver-visual.html visualization - Updated monitoring.md, CLAUDE.md, historical plan docs	2026-04-12 22:24:38 +01:00
Viktor Barzin	6ba4878f3a	docs: update storage architecture for NFS migration to Proxmox host [ci skip]	2026-04-11 17:00:10 +01:00
Viktor Barzin	eec6af6aef	docs: add IPAM/DDNS architecture diagram and update docs - networking.md: Add mermaid diagram showing full device discovery pipeline (Kea DHCP → DDNS → Technitium, pfSense import → phpIPAM → DNS sync) - networking.md: Add data flow table, DHCP coverage table - networking.md: Update pfSense (3 subnets + 42 reservations), phpIPAM (passive import replaces fping), Technitium (192.168.1.2 in ACL) - CLAUDE.md: Update phpIPAM and networking descriptions [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 20:42:10 +00:00
Viktor Barzin	8cd8743140	docs: add phpIPAM, Kea DDNS, and DNS sync documentation - networking.md: Add phpIPAM IPAM section, Kea DDNS config, reverse DNS zones, Technitium dynamic update policy - CLAUDE.md: Add phpipam to DB rotation list, service notes, networking section - service-catalog.md: Add phpipam, mark netbox as disabled/replaced [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 16:01:32 +00:00
Viktor Barzin	b345b086ef	update backup/DR docs and runbooks for 3-2-1 architecture - Full rewrite of backup-dr.md: 3-2-1 strategy with sda backup disk, PVC file-level copy from LVM snapshots, pfsense backup, two offsite paths. 4 Mermaid diagrams (data flow, timeline, disk layout, restore tree). - Update storage.md: 65 proxmox-lvm PVCs, sda backup tier - Update restore-full-cluster.md: add Phase 3.5 for PVC restore from sda - Update restore-{mysql,postgresql,vault,vaultwarden}.md: add sda fallback paths - New runbook: restore-pvc-from-backup.md (file-level restore from sda) - Update CLAUDE.md Storage & Backup section for 3-2-1 architecture	2026-04-06 15:06:01 +03:00
Viktor Barzin	64c378d158	add critical instruction to update docs with every infra change [ci skip]	2026-04-06 13:21:49 +03:00
Viktor Barzin	fc233bd27f	docs: comprehensive audit and update of all architecture docs and runbooks [ci skip] Audited 14 documentation files against live cluster state and Terraform code. Architecture docs: - databases.md: MySQL 8.4.4, proxmox-lvm storage (not iSCSI), anti-affinity excludes k8s-node1 (GPU), 2Gi/3Gi resources, 7-day rotation (not 24h), CNPG 2 instances, PostGIS 16, postgresql.dbaas has endpoints - overview.md: 1x CPU, ~160GB RAM, all nodes 32GB, proxmox-lvm storage, correct Vault paths (secret/ not kv/) - compute.md: 272GB physical host RAM, ~160GB allocated to VMs - secrets.md: 7-day rotation, 7 MySQL + 5 PG roles, correct ESO config - networking.md: MetalLB pool 10.0.20.200-220 - ci-cd.md: 9 GHA projects, travel_blog 5.7GB Runbooks: - restore-mysql/postgresql: backup files are .sql.gz (not .sql) - restore-vault: weekly backup (not daily), auto-unseal sidecar note - restore-vaultwarden: PVC is proxmox (not iscsi) - restore-full-cluster: updated node roles, removed trading Reference docs: - CLAUDE.md: 7-day rotation, removed trading from PG list - AGENTS.md: 100+ stacks, proxmox-lvm, platform empty shell - service-catalog.md: 6 new stacks, 14 stack column updates	2026-04-06 13:21:05 +03:00
Viktor Barzin	9492874c43	fix: restore technitium MySQL query logging with Vault auto-rotation [ci skip] Query logs stopped syncing on 2026-03-16 due to password mismatch after MySQL cluster rebuild and Technitium app config reset. - Add Vault static role mysql-technitium (7-day rotation) - Add ExternalSecret for technitium-db-creds in technitium namespace - Add password-sync CronJob (6h) to push rotated password to Technitium API - Update Grafana datasource to use ESO-managed password - Remove stale technitium_db_password variable (replaced by ESO) - Update databases.md and restore-mysql.md runbook	2026-04-06 13:00:49 +03:00
Viktor Barzin	ad7c0d7fc8	docs: add critical "Terraform Only" rule to CLAUDE.md All infrastructure changes must go through Terraform/Terragrunt. kubectl is read-only except for temporary migration steps. If a resource isn't in Terraform, evaluate adding it before making manual changes.	2026-04-05 19:46:07 +03:00
Viktor Barzin	2d5c55f7b1	docs: add storage class decision rule to CLAUDE.md Default to proxmox-lvm for all new services. NFS only for RWX, backup destinations, or shared media libraries. Updated iSCSI backup section to reflect proxmox-lvm migration.	2026-04-04 16:35:12 +03:00
Viktor Barzin	10f22350c5	exclude frigate, audiblez, ollama, real-estate-crawler from Synology backup [ci skip] Expanded cloud sync excludes to reduce sync time and Synology disk usage. All excluded data is either regenerable or low-value. TrueNAS Task 1 and incremental script already updated live.	2026-03-29 13:44:32 +03:00
Viktor Barzin	78dec8f0ad	add e2e email roundtrip monitoring CronJob (every 30 min) sends test email via Mailgun API to smoke-test@viktorbarzin.me, verifies IMAP delivery in spam@ catch-all, deletes test email, pushes metrics to Pushgateway + Uptime Kuma. Prometheus alerts: EmailRoundtripFailing, EmailRoundtripStale, EmailRoundtripNeverRun. Uptime Kuma: SMTP/IMAP port checks + E2E push.	2026-03-25 22:50:22 +02:00
Viktor Barzin	1639910043	ingress latency: add histogram buckets, fix restarts, right-size memory - Traefik: add fine-grained Prometheus histogram buckets (0.01-30s) for meaningful P50/P99 - Calibre: relax liveness probe (timeout 5→10s, threshold 3→6) to stop NFS-caused restarts - Novelapp: increase memory 128Mi/256Mi → 640Mi/640Mi (confirmed OOMKilled, VPA upper 505Mi) - Forgejo: increase memory 256Mi → 384Mi (at 80% of limit, VPA upper 311Mi) - ActualBudget: add explicit resources to prevent silent LimitRange defaults - Docs: update Nextcloud note from 4Gi → 8Gi limit (Apache spike history)	2026-03-23 10:52:43 +02:00
Viktor Barzin	813f523170	docs: add private registry usage to infra CLAUDE.md [ci skip]	2026-03-23 01:08:57 +02:00
Viktor Barzin	c111799831	remove duplicated agents, update CLAUDE.md references [ci skip] All agents now live globally in ~/.claude/agents/ (shared via dotfiles). Deleted 11 duplicates, moved sev-*/deploy-app to global scope.	2026-03-22 23:44:27 +02:00
Viktor Barzin	1c13af142d	sync regenerated providers.tf + upstream changes - Terragrunt-regenerated providers.tf across stacks (vault_root_token variable removed from root generate block) - Upstream monitoring/openclaw/CLAUDE.md changes from rebase	2026-03-22 02:56:04 +02:00
Viktor Barzin	fd130971aa	feat(provision): automated user provisioning via Authentik webhook - Expand CI Vault policy: write secret/data/platform + Transit SOPS keys - Add Woodpecker provision-user.yml pipeline (manual event, API-triggered) - Add env vars to webhook-handler deployment for Woodpecker/Authentik integration - Update add-user skill with automated flow documentation - Update Woodpecker repo ID list in CLAUDE.md	2026-03-17 23:56:30 +00:00
Viktor Barzin	6239e07dd5	docs: add plotting-book to GHA-migrated list and repo IDs [ci skip]	2026-03-17 23:07:32 +00:00
Viktor Barzin	88abbef7c3	update claude knowledge: GHA builds architecture, postgresql_host fix [ci skip]	2026-03-16 07:10:45 +00:00
Viktor Barzin	b87ba5e778	update claude knowledge: secret/viktor is go-to for all personal secrets [ci skip]	2026-03-15 23:21:52 +00:00
Viktor Barzin	c8069f53c8	update claude knowledge: final ESO migration state [ci skip]	2026-03-15 22:32:46 +00:00
Viktor Barzin	23dfaa1ac8	update claude knowledge: vault-native secrets migration decisions [ci skip]	2026-03-15 21:00:07 +00:00
Viktor Barzin	3aba29e7a3	remove SOPS pipeline, deploy ESO + Vault DB/K8s engines Vault is now the sole source of truth for secrets. SOPS pipeline removed entirely — auth via `vault login -method=oidc`. Part A: SOPS removal - vault/main.tf: delete 990 lines (93 vars + 43 KV write resources), add self-read data source for OIDC creds from secret/vault - terragrunt.hcl: remove SOPS var loading, vault_root_token, check_secrets hook - scripts/tg: remove SOPS decryption, keep -auto-approve logic - .woodpecker/default.yml: replace SOPS with Vault K8s auth via curl - Delete secrets.sops.json, .sops.yaml Part B: External Secrets Operator - New stack stacks/external-secrets/ with Helm chart + 2 ClusterSecretStores (vault-kv for KV v2, vault-database for DB engine) Part C: Database secrets engine (in vault/main.tf) - MySQL + PostgreSQL connections with static role rotation (24h) - 6 MySQL roles (speedtest, wrongmove, codimd, nextcloud, shlink, grafana) - 6 PostgreSQL roles (trading, health, linkwarden, affine, woodpecker, claude_memory) Part D: Kubernetes secrets engine (in vault/main.tf) - RBAC for Vault SA to manage K8s tokens - Roles: dashboard-admin, ci-deployer, openclaw, local-admin - New scripts/vault-kubeconfig helper for dynamic kubeconfig K8s auth method with scoped policies for CI, ESO, OpenClaw, Woodpecker sync.	2026-03-15 16:37:38 +00:00
Viktor Barzin	944d6d3b22	update claude knowledge: resource management learnings from right-sizing session [ci skip]	2026-03-15 15:38:37 +00:00
Viktor Barzin	307b7f6819	update claude knowledge: infra operational learnings from commit history [ci skip] Add resource management patterns, networking resilience, service-specific notes, monitoring patterns, and NFS storage rules extracted from ~963 commits.	2026-03-15 10:46:45 +00:00
Viktor Barzin	0a69af618d	update claude knowledge: vault KV secrets migration [ci skip]	2026-03-15 03:22:07 +00:00
Viktor Barzin	4a27345057	enable memory-core plugin for OpenClaw [ci skip] - Add memory-core to plugins.allow and plugins.slots.memory - Add /app/extensions to plugin load paths - Update CLAUDE.md memory instructions to reference native tools	2026-03-15 03:22:07 +00:00
Viktor Barzin	5f71a53b08	add memory-tool instructions to project CLAUDE.md [ci skip] OpenClaw agents read the project-level CLAUDE.md from the workspace. Adding explicit memory-tool CLI instructions here ensures the agent uses exec to call memory-tool instead of looking for non-existent MCP tools (memory_store, memory_recall).	2026-03-15 02:16:03 +00:00
Viktor Barzin	456e2777f5	update claude knowledge: LinuxServer.io container optimization learnings [ci skip]	2026-03-15 02:04:04 +00:00
Viktor Barzin	916aa6c6cb	update claude knowledge: OpenClaw deployment and tg wrapper learnings [ci skip]	2026-03-14 23:42:17 +00:00
Viktor Barzin	4635d3b826	remember: CrowdSec Helm upgrade timeout [ci skip]	2026-03-14 12:04:07 +00:00

1 2 3

115 commits