infra

Author	SHA1	Message	Date
Viktor Barzin	43fe11fffc	[mailserver] Phase 6 — decommission MetalLB LB path [ci skip] ## Context (bd code-yiu) With Phase 4+5 proven (external mail flows through pfSense HAProxy + PROXY v2 to the alt PROXY-speaking container listeners), the MetalLB LoadBalancer Service + `10.0.20.202` external IP + ETP:Local policy are obsolete. Phase 6 decommissions them and documents the steady-state architecture. ## This change ### Terraform (stacks/mailserver/modules/mailserver/main.tf) - `kubernetes_service.mailserver` downgraded: `LoadBalancer` → `ClusterIP`. - Removed `metallb.io/loadBalancerIPs = "10.0.20.202"` annotation. - Removed `external_traffic_policy = "Local"` (irrelevant for ClusterIP). - Port set unchanged — the Service still exposes 25/465/587/993 for intra-cluster clients (Roundcube pod, `email-roundtrip-monitor` CronJob) that hit the stock PROXY-free container listeners. - Inline comment documents the downgrade rationale + companion `mailserver-proxy` NodePort Service that now carries external traffic. ### pfSense (ops, not in git) - `mailserver` host alias (pointing at `10.0.20.202`) deleted. No NAT rule references it post-Phase-4; keeping it would be misleading dead metadata. Reversible via WebUI + `php /tmp/delete-mailserver-alias.php` companion script (ad-hoc, not checked in — alias is just a Firewall → Aliases → Hosts entry). ### Uptime Kuma (ops) - Monitors `282` and `283` (PORT checks) retargeted from `10.0.20.202` → `10.0.20.1`. Renamed to `Mailserver HAProxy SMTP (pfSense :25)` / `... IMAPS (pfSense :993)` to reflect their new purpose (HAProxy layer liveness). History retained (edit, not delete-recreate). ### Docs - `docs/runbooks/mailserver-pfsense-haproxy.md` — fully rewritten "Current state" section; now reflects steady-state architecture with two-path diagram (external via HAProxy / intra-cluster via ClusterIP). Phase history table marks Phase 6 ✅. Rollback section updated (no one-liner post-Phase-6; need Service-type re-upgrade + alias re-add). - `docs/architecture/mailserver.md` — Overview, Mermaid diagram, Inbound flow, CrowdSec section, Uptime Kuma monitors list, Decisions section (dedicated MetalLB IP → "Client-IP Preservation via HAProxy + PROXY v2"), Troubleshooting all updated. - `.claude/CLAUDE.md` — mailserver monitoring + architecture paragraph updated with new external path description; references the new runbook. ## What is NOT in this change - Removal of `10.0.20.202` from `cloudflare_proxied_names` or any reserved-IP tracking — wasn't there to begin with. The `metallb-system default` IPAddressPool (10.0.20.200-220) shows 2 of 19 available after this, confirming `.202` went back to the pool. - Phase 4 NAT-flip rollback scripts — kept on-disk, still valid if someone re-introduces the MetalLB LB (see runbook "Rollback"). ## Test Plan ### Automated (verified pre-commit 2026-04-19) ``` # Service is ClusterIP with no EXTERNAL-IP $ kubectl get svc -n mailserver mailserver mailserver ClusterIP 10.103.108.217 <none> 25/TCP,465/TCP,587/TCP,993/TCP # 10.0.20.202 no longer answers ARP (ping from pfSense) $ ssh admin@10.0.20.1 'ping -c 2 -t 2 10.0.20.202' 2 packets transmitted, 0 packets received, 100.0% packet loss # MetalLB pool released the IP $ kubectl get ipaddresspool default -n metallb-system \ -o jsonpath='{.status.assignedIPv4} of {.status.availableIPv4}' 2 of 19 available # E2E probe — external Brevo → WAN:25 → pfSense HAProxy → pod — STILL SUCCEEDS $ kubectl create job --from=cronjob/email-roundtrip-monitor probe-phase6 -n mailserver ... Round-trip SUCCESS in 20.3s ... $ kubectl delete job probe-phase6 -n mailserver # pfSense mailserver alias removed $ ssh admin@10.0.20.1 'php -r "..." \| grep mailserver' (no output) ``` ### Manual Verification 1. Visit `https://uptime.viktorbarzin.me` — monitors 282/283 green on new hostname `10.0.20.1`. 2. Roundcube login works (`https://mail.viktorbarzin.me/`). 3. Send test email to `smoke-test@viktorbarzin.me` from Gmail — observe `postfix/smtpd-proxy25/postscreen: CONNECT from [<Gmail-IP>]` in mailserver logs within ~10s. 4. CrowdSec should still see real client IPs in postfix/dovecot parsers (verify with `cscli alerts list` on next auth-fail event). ## Phase history (bd code-yiu) \| Phase \| Status \| Description \| \|---\|---\|---\| \| 1a \| ✅ ``ef75c02f`` \| k8s alt :2525 listener + NodePort Service \| \| 2 \| ✅ 2026-04-19 \| pfSense HAProxy pkg installed \| \| 3 \| ✅ ``ba697b02`` \| HAProxy config persisted in pfSense XML \| \| 4+5 \| ✅ ``9806d515`` \| 4-port alt listeners + HAProxy frontends + NAT flip \| \| 6 \| ✅ this commit \| MetalLB LB retired; 10.0.20.202 released; docs updated \| Closes: code-yiu	2026-04-19 12:36:11 +00:00
Viktor Barzin	973f549810	[payslip-ingest] Update extractor agent + dashboard for v2 regex parser ## Context Companion change to payslip-ingest v2 (regex parser + accurate RSU tax attribution). The Grafana dashboard now has 4 more panels powered by the new earnings-decomposition and YTD-snapshot columns, and the Claude fallback agent's prompt is aligned with the new schema so non-Meta payslips still land with the full field set. ## This change ### `.claude/agents/payslip-extractor.md` Rewrites the RSU handling section to match Meta UK's actual template (rsu_vest = "RSU Tax Offset" + "RSU Excs Refund", no matching rsu_offset deduction — PAYE uses grossed-up Taxable Pay instead). Adds a new "Earnings decomposition (v2)" section telling the fallback agent how to populate salary/bonus/pension_sacrifice/taxable_pay/ytd_* and when to use pension_employee vs pension_sacrifice without double-counting. ### `stacks/monitoring/modules/monitoring/dashboards/uk-payslip.json` - Panel 4 (Effective rate) — SQL switched from the naive `(income_tax + NIC) / cash_gross` to the YTD-effective-rate method: `cash_tax = income_tax - rsu_vest × (ytd_tax_paid / ytd_taxable_pay)`. Title updated to "YTD-corrected" so the change is discoverable. - Panel 5 (Table) — adds salary, bonus, pension_sacrifice, taxable_pay columns so row-level debugging against the parser output is trivial. - +Panel 8 (Earnings breakdown) — monthly stacked bars of salary / bonus / rsu_vest / -pension_sacrifice. Bonus-sacrifice months show up as a massive negative pension_sacrifice spike paired with a near-zero bonus bar. - +Panel 9 (Accurate cash tax rate) — timeseries of cash_tax_rate_ytd vs naive_tax_rate. Divergence is the RSU contribution the payslip hides in the single `Tax paid` line. - +Panel 10 (All-in compensation) — stacked bars of cash_gross + rsu_vest per payslip. - +Panel 11 (YTD cumulative cash gross vs total comp) — two lines partitioned by tax_year; the gap between them is the RSU contribution YTD. Total panels go from 7 → 11. ## Test Plan ### Automated Dashboard JSON validity: ``` $ python3 -m json.tool uk-payslip.json > /dev/null && echo ok ok ``` ### Manual Verification After applying `stacks/monitoring/`: 1. `https://grafana.viktorbarzin.me/d/uk-payslip` loads with 11 panels 2. Bonus-sacrifice months (e.g. March 2024 if present in data) show the negative pension_sacrifice bar in panel 8 3. Panel 9 "Accurate cash effective tax rate" shows the cash_tax_rate_ytd line sitting ~10-15pp below naive_tax_rate in RSU-vest months ## Reproduce locally 1. `cd infra/stacks/monitoring && terragrunt plan` 2. Expected: ConfigMap diff on the payslip dashboard with the new panel JSON 3. `terragrunt apply` — Grafana reloads the dashboard automatically (configmap-reload sidecar) Relates to: payslip-ingest commit 9741816 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 10:54:33 +00:00
Viktor Barzin	238a3f14c9	[payslip-extractor] Add RSU handling section Document what RSU vest / RSU offset look like on Meta UK payslips and tell the agent to populate rsu_vest + rsu_offset fields (new in the payslip-ingest schema) rather than rolling them into gross_pay.	2026-04-18 23:37:33 +00:00
Viktor Barzin	eee694c915	[payslip-extractor] Add PAYSLIP_TEXT fast path payslip-ingest now runs pdftotext locally before calling claude-agent-service, shrinking the prompt ~20-100x. Agent file documents both paths: PAYSLIP_TEXT (fast) and PDF_BASE64 (fallback for scanned-image PDFs or when pdftotext fails).	2026-04-18 22:48:07 +00:00
Viktor Barzin	8a99be1194	[infra] Document HCL import {} block convention [ci skip] ## Context Wave 8 of the state-drift consolidation plan — adopt the HCL `import {}` block pattern (Terraform 1.5+) as the canonical way to bring live cluster / Vault / Cloudflare resources under TF management. Historically the repo has used `terraform import` on the CLI for adoptions. That path has three real problems: 1. Not reviewable — it's an out-of-band state mutation that leaves no trace in git beyond the subsequent `resource {}` block. A reviewer sees only the new resource, not the adoption intent. 2. Not plan-safe — if the resource address or ID is wrong, the CLI path commits the mistake to state before anyone can catch it. 3. Not idempotent — a failed apply mid-import leaves state in a confusing half-adopted shape. `import {}` blocks fix all three: the adoption intent is in the PR diff, `scripts/tg plan` shows the import as its own plan line (mistyped IDs fail before apply), and re-applying after a partial failure just retries the import step. Canonicalizing the pattern before Wave 5 (Calico + kured adoption) lands so the reviewer of those imports has the rule in front of them. ## This change - `AGENTS.md`: new "Adopting Existing Resources — Use `import {}` Blocks, Not the CLI" section sitting right after Execution. Includes the canonical 5-step workflow (write resource → add import stanza → plan to zero → apply → drop stanza), the reasoning, and a per-provider ID format table (helm_release, kubernetes_manifest, kubernetes_<kind>_v1, authentik_provider_proxy, cloudflare_record). - `.claude/CLAUDE.md`: one-line cross-reference at the end of the Terraform State two-tier section pointing back to AGENTS.md. Keeps CLAUDE.md's quick-reference density intact while making sure the rule is reachable from the Claude-instructions path. ## What is NOT in this change - Any actual imports — this is a pure docs landing. Wave 5 will demonstrate the pattern on kured + Calico. - Replacing the handful of existing `terraform import`-style adoptions in the repo history — `import {}` blocks are delete-after-apply, so retro-documenting them is not useful. Closes: code-[wave8-task] Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 21:10:05 +00:00
Viktor Barzin	43b4e1d372	[payslip-ingest] Deploy stack + Grafana dashboard + Vault DB role ## Context New service `payslip-ingest` (code lives in `/home/wizard/code/payslip-ingest/`) needs in-cluster deployment, its own Postgres DB + rotating user, a Grafana datasource, a dashboard, and a Claude agent definition for PDF extraction. Cluster-internal only — webhook fires from Paperless-ngx in a sibling namespace. No ingress, no TLS cert, no DNS record. ## What ### New stack `stacks/payslip-ingest/` - `kubernetes_namespace` payslip-ingest, tier=aux. - ExternalSecret (vault-kv) projects PAPERLESS_API_TOKEN, CLAUDE_AGENT_BEARER_TOKEN, WEBHOOK_BEARER_TOKEN into `payslip-ingest-secrets`. - ExternalSecret (vault-database) reads rotating password from `static-creds/pg-payslip-ingest` and templates `DATABASE_URL` into `payslip-ingest-db-creds` with `reloader.stakater.com/match=true`. - Deployment: single replica, Recreate strategy (matches single-worker queue design), `wait-for postgresql.dbaas:5432` annotation, init container runs `alembic upgrade head`, main container serves FastAPI on 8080, Kyverno dns_config lifecycle ignore. - ClusterIP Service :8080. - Grafana datasource ConfigMap in `monitoring` ns (label `grafana_datasource=1`, uid `payslips-pg`) reading password from the db-creds K8s Secret. ### Grafana dashboard `uk-payslip.json` (4 panels) - Monthly gross/net/tax/NI (timeseries, currencyGBP). - YTD tax-band progression with threshold lines at £12,570 / £50,270 / £125,140. - Deductions breakdown (stacked bars). - Effective rate + take-home % (timeseries, percent). ### Vault DB role `pg-payslip-ingest` - Added to `allowed_roles` in `vault_database_secret_backend_connection.postgresql`. - New `vault_database_secret_backend_static_role.pg_payslip_ingest` (username `payslip_ingest`, 7d rotation). ### DBaaS — DB + role creation - New `null_resource.pg_payslip_ingest_db` mirrors `pg_terraform_state_db`: idempotent CREATE ROLE + CREATE DATABASE + GRANT ALL via `kubectl exec` into `pg-cluster-1`. ### Claude agent `.claude/agents/payslip-extractor.md` - Haiku-backed agent invoked by `claude-agent-service`. - Decodes base64 PDF from prompt, tries pdftotext → pypdf fallback, emits a single JSON object matching the schema to stdout. No network, no file writes outside /tmp, no markdown fences. ## Trade-offs / decisions - Own DB per service (convention), NOT a schema in a shared `app` DB as the plan initially described. The Alembic migration still creates a `payslip_ingest` schema inside the `payslip_ingest` DB for table organisation. - Paperless URL uses port 80 (the Service port), not 8000 (the pod target port). - Grafana datasource uses the primary RW user — separate `_ro` role is aspirational and not yet a pattern in this repo. - No ingress — webhook is cluster-internal; external exposure is unnecessary attack surface. - No Uptime Kuma monitor yet: the internal-monitor list is a static block in `stacks/uptime-kuma/`; will add in a follow-up tied to code-z29 (internal monitor auto-creator). ## Test Plan ### Automated ``` terraform init -backend=false && terraform validate Success! The configuration is valid. terraform fmt -check -recursive (exit 0) python3 -c "import json; json.load(open('uk-payslip.json'))" (exit 0) ``` ### Manual Verification (post-merge) Prerequisites: 1. Seed Vault: `vault kv put secret/payslip-ingest webhook_bearer_token=$(openssl rand -hex 32)`. 2. Seed Vault: `vault kv patch secret/paperless-ngx api_token=<paperless token>`. Apply: 3. `scripts/tg apply vault` → creates pg-payslip-ingest static role. 4. `scripts/tg apply dbaas` → creates payslip_ingest DB + role. 5. `cd stacks/payslip-ingest && ../../scripts/tg apply -target=kubernetes_manifest.db_external_secret` (first-apply ESO bootstrap). 6. `scripts/tg apply payslip-ingest` (full). 7. `kubectl -n payslip-ingest get pods` → Running 1/1. 8. `kubectl -n payslip-ingest port-forward svc/payslip-ingest 8080:8080 && curl localhost:8080/healthz` → 200. End-to-end: 9. Configure Paperless workflow (README in code repo has steps). 10. Upload sample payslip tagged `payslip` → row in `payslip_ingest.payslip` within 60s. 11. Grafana → Dashboards → UK Payslip → 4 panels render. Closes: code-do7 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 19:07:05 +00:00
Viktor Barzin	c9d221d578	[infra] Establish KYVERNO_LIFECYCLE_V1 drift-suppression convention [ci skip] ## Context Phase 1 of the state-drift consolidation audit (plan Wave 3) identified that the entire repo leans on a repeated `lifecycle { ignore_changes = [...dns_config] }` snippet to suppress Kyverno's admission-webhook dns_config mutation (the ndots=2 override that prevents NxDomain search-domain flooding). 27 occurrences across 19 stacks. Without this suppression, every pod-owning resource shows perpetual TF plan drift. The original plan proposed a shared `modules/kubernetes/kyverno_lifecycle/` module emitting the ignore-paths list as an output that stacks would consume in their `ignore_changes` blocks. That approach is architecturally impossible: Terraform's `ignore_changes` meta-argument accepts only static attribute paths — it rejects module outputs, locals, variables, and any expression (the HCL spec evaluates `lifecycle` before the regular expression graph). So a DRY module cannot exist. The canonical pattern IS the repeated snippet. What the snippet was missing was a discoverability tag so that (a) new resources can be validated for compliance, (b) the existing 27 sites can be grep'd in a single command, and (c) future maintainers understand the convention rather than each reinventing it. ## This change - Introduces `# KYVERNO_LIFECYCLE_V1` as the canonical marker comment. Attached inline on every `spec[0].template[0].spec[0].dns_config` line (or `spec[0].job_template[0].spec[0]...` for CronJobs) across all 27 existing suppression sites. - Documents the convention with rationale and copy-paste snippets in `AGENTS.md` → new "Kyverno Drift Suppression" section. - Expands the existing `.claude/CLAUDE.md` Kyverno ndots note to reference the marker and explain why the module approach is blocked. - Updates `_template/main.tf.example` so every new stack starts compliant. ## What is NOT in this change - The `kubernetes_manifest` Kyverno annotation drift (beads `code-seq`) — that is Phase B with a sibling `# KYVERNO_MANIFEST_V1` marker. - Behavioral changes — every `ignore_changes` list is byte-identical save for the inline comment. - The fallback module the original plan anticipated — skipped because Terraform rejects expressions in `ignore_changes`. - `terraform fmt` cleanup on adjacent unrelated blocks in three files (claude-agent-service, freedify/factory, hermes-agent). Reverted to keep this commit scoped to the convention rollout. ## Before / after Before (cannot distinguish accidental-forgotten from intentional-convention): ```hcl lifecycle { ignore_changes = [spec[0].template[0].spec[0].dns_config] } ``` After (greppable, self-documenting, discoverable by tooling): ```hcl lifecycle { ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1 } ``` ## Test Plan ### Automated ``` $ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='.tf' --include='.tf.example' \ \| awk -F: '{s+=$2} END {print s}' 27 $ git diff --stat \| grep -E '\.(tf\|tf\.example\|md)$' \| wc -l 21 # All code-file diffs are 1 insertion + 1 deletion per marker site, # except beads-server (3), ebooks (4), immich (3), uptime-kuma (2). $ git diff --stat stacks/ \| tail -1 20 files changed, 45 insertions(+), 28 deletions(-) ``` ### Manual Verification No apply required — HCL comments only. Zero effect on any stack's plan output. Future audits: `rg 'KYVERNO_LIFECYCLE_V1' stacks/ \| wc -l` must grow as new pod-owning resources are added. ## Reproduce locally 1. `cd infra && git pull` 2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/` → expect 27 hits in 19 files 3. Grep any new `kubernetes_deployment` for the marker; absence = missing suppression. Closes: code-28m Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 14:15:51 +00:00
Viktor Barzin	82b7866bc9	[claude-agent-service] Remove orphaned DevVM SSH key wiring ## Context The remote-executor pattern that SSHed into the DevVM (10.0.10.10) to run `claude -p` was fully migrated to the in-cluster service `claude-agent-service.claude-agent.svc:8080/execute` in commits `42f1c3cf` and `99180bec` (2026-04-18). Five parallel codebase audits (GH Actions, Woodpecker + scripts, K8s CronJobs/Deployments, n8n, local scripts/hooks/docs) confirmed zero remaining SSH+claude sites. This commit removes two cleanup artifacts left behind by that migration. ## This change 1. Deletes `.claude/skills/archived/setup-remote-executor.md` — the archived skill doc for the obsolete SSH-based pattern. Already in `archived/`, harmless but noise; deleting prevents anyone copy-pasting the old approach. 2. Removes `kubernetes_secret.ssh_key` from `stacks/claude-agent-service/main.tf`. The Secret was created from the `devvm_ssh_key` field at Vault `secret/ci/infra` but was never mounted into the agent pod. The pod's `git-init` init container uses HTTPS + `$GITHUB_TOKEN` exclusively and force-rewrites every `git@github.com:` and `https://github.com/` URL via `git config url.insteadOf`, so no downstream `git` invocation could fall through to SSH even if it tried. 3. Removes the now-orphaned `data "vault_kv_secret_v2" "ci_secrets"` block — the SSH key resource was its only consumer. ## What is NOT in this change - The `devvm_ssh_key` field at Vault `secret/ci/infra` stays in place. Removing it requires read/modify/put of the full secret and the upside is one unused Vault key. Not worth it without strong justification. - DevVM host decommission is out of scope (separate audit needed for non-Claude users of the host). - Pre-existing `terraform fmt` warnings at lines 464-505 (CronJob alignment) left untouched per no-adjacent-refactor rule. ## Test plan ### Automated - `terraform fmt -check stacks/claude-agent-service/main.tf` — only the pre-existing lines 464-505 are flagged; no new fmt warnings introduced by these deletions. ### Manual verification 1. `cd infra/stacks/claude-agent-service && ../../scripts/tg apply` 2. Expect exactly one resource destroyed: `kubernetes_secret.ssh_key`. The `ci_secrets` data source removal is plan-time only; does not appear in resource counts. 3. `kubectl -n claude-agent get secret ssh-key` → `NotFound`. 4. `kubectl -n claude-agent get pod` → both pods Running, no restart events. 5. Submit a synthetic agent job via HTTP API to confirm pipeline still works: curl -X POST http://claude-agent-service.claude-agent.svc.cluster.local:8080/execute with a minimal prompt; expect job completes with `exit_code=0`. Closes: code-bck Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 13:31:15 +00:00
Viktor Barzin	50e8184d99	[uptime-kuma] Codify MySQL monitor (id=663) via idempotent sync CronJob ## Context Monitor id 663 "MySQL Standalone (dbaas)" was created manually yesterday via the `uptime-kuma-api` Python library when the dbaas stack migrated from InnoDB Cluster to standalone MySQL. It worked and was UP, but lived only in Uptime Kuma's MariaDB — if UK's DB were wiped or restored from an older backup, the monitor would be lost. ## This change Adds declarative, self-healing management for internal-service monitors (databases, non-HTTP endpoints) that can't be discovered from ingress annotations. Modelled on the existing `external-monitor-sync` CronJob. - `local.internal_monitors` — list of desired monitors (name, type, connection string, Vault password key, interval, retries). Seeded with the MySQL Standalone monitor. Add new entries here to manage more. - `kubernetes_secret.internal_monitor_sync` — pulls admin password and all referenced DB passwords from Vault `secret/viktor` at apply time. Secret key names are derived from monitor name (`DB_PASSWORD_<upper_snake>`). - `kubernetes_config_map_v1.internal_monitor_targets` — renders the target list to JSON for the sync container. - `kubernetes_cron_job_v1.internal_monitor_sync` — runs every 10 min, looks up monitors by name, creates if missing, patches if drifted, leaves id and history untouched when already in desired state. ## Why this approach (Option B, not a Terraform provider) The `louislam/uptime-kuma` Terraform provider does NOT exist in the public registry (verified — only a CLI tool of the same name). Option A from the task brief was therefore unavailable. Option B (idempotent K8s CronJob) matches the established pattern in the same module for `external-monitor-sync` — no new machinery introduced. ## Monitor 663: no-op on first sync Manual import was not possible (no provider → no state to import). The sync job correctly identifies the existing monitor by name and reports: Monitor MySQL Standalone (dbaas) (id=663) already in desired state Internal monitor sync complete DB heartbeats confirm monitor 663 stayed UP throughout with `status=1` and `Rows: 1` responses every 60s — no disruption. ## Vault key — left manual (by design) `secret/viktor` is not Terraform-managed anywhere in the repo (only read via `data "vault_kv_secret_v2"`). It is a user-edited Vault entry holding 135 keys. The `uptimekuma_db_password` key was added manually yesterday; this change does NOT codify it. Codifying the whole `secret/viktor` entry is out of scope for this task (would need a separate migration + rotation story). The sync job reads the existing value at apply time — so if the value is ever rotated in Vault, the next sync picks it up. ## Plan + apply Plan: 3 to add, 0 to change, 0 to destroy. Apply complete! Resources: 3 added, 0 changed, 0 destroyed. Re-plan: No changes. Your infrastructure matches the configuration. Also updated `.claude/skills/uptime-kuma/SKILL.md` with the new pattern. Closes: code-ed2	2026-04-18 12:04:17 +00:00
Viktor Barzin	d3bdf87676	[docs] Clarify external-monitor auto-annotation in CLAUDE.md ## Context During a false-alarm investigation of terminal.viktorbarzin.me, an Explore agent misdiagnosed "no monitoring" by checking cloudflare_proxied_names in config.tfvars (a legacy fallback list) instead of the ingress_factory auto-annotation. Both [External] monitors for terminal/terminal-ro exist and are active — the original agent just looked in the wrong place. ## This change Expands the Monitoring & Alerting bullet to spell out the mechanism: ingress_factory auto-adds uptime.viktorbarzin.me/external-monitor=true when dns_type != "none", and cloudflare_proxied_names is a legacy fallback for the 17 hostnames not yet migrated. Future agents debugging "is this monitored?" questions should not check cloudflare_proxied_names. ## What is NOT in this change No Terraform, no K8s, no service config. Docs only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 11:45:56 +00:00
Viktor Barzin	65b0f30d5e	[docs] Update anti-AI and rybbit docs after rewrite-body removal - Anti-AI: 5-layer → 3 active layers (bot-block, X-Robots-Tag, tarpit) - Layer 3 (trap links via rewrite-body) removed — Yaegi v3 incompatible - Rybbit analytics now injected via Cloudflare Worker (HTMLRewriter) - strip-accept-encoding middleware removed from all references Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 21:43:13 +00:00
Viktor Barzin	5e9e487661	feat(setup-project): auto-PR working Dockerfiles back to upstream ## Context The setup-project skill treats "build from a Dockerfile" as priority 6 — "last resort, avoid if possible" — with no formalized path for apps whose upstream lacks a working Dockerfile. When we end up writing one to get the deploy green, that Dockerfile stays private in the infra repo and upstream never benefits. ## This change Adds a closed-loop flow: when we author a new Dockerfile (or fix a broken upstream one) and the deploy is healthy for 10 minutes, auto-open a PR against the upstream repo so the self-hosting community gets the working recipe. Flow: 1. Classify dockerfile_state during research phase (image-used / used-as-is / fixed-broken-upstream / written-from-scratch). Persist to modules/kubernetes/<service>/.contribution-state.json. 2. After Terraform apply, run scripts/stability-gate.sh — polls pod Ready + HTTP 200 every 30s x 20 iterations, requires 18/20 successes. 3. On pass with a trigger state, scripts/contribute-dockerfile.sh does the GitHub API dance: fork → merge-upstream → branch → commit Dockerfile / .dockerignore / BUILD.md via Contents API → open PR with body rendered from templates/PR_BODY.md. Idempotent (skips on recorded PR URL, existing fork, existing branch, open PR, upstream landed a Dockerfile mid-deploy). GitHub API via curl (gh CLI is sandbox-blocked per .claude/CLAUDE.md); token pulled from Vault (`secret/viktor` → `github_pat`). Commits include Signed-off-by for DCO-enforcing repos. Fork branch name is `add-dockerfile` for written-from-scratch or `fix-dockerfile` for fixed-broken-upstream, with timestamp suffix on collision. ## Files - SKILL.md — state classification table, quality bar checklist, §8b stability gate, §10 contribute-upstream step, checklist updates - scripts/stability-gate.sh — 10-minute health probe - scripts/contribute-dockerfile.sh — GitHub API orchestrator - templates/PR_BODY.md — `{{VAR}}` placeholder template for PR description - templates/Dockerfile.README.md — BUILD.md template shipped with the PR ## What is NOT in this change - No Woodpecker / GHA changes (skill-local flow). - No auto-tracking of merge/reject outcomes upstream (manual follow-up). - Not yet exercised end-to-end; first real-world run will validate the API dance. Plan to dry-run against a throwaway sink repo before pointing at a real upstream. ## Test Plan ### Automated - bash -n on both scripts → pass - Manual read-through of SKILL.md — step numbering coherent, existing §1-9 untouched semantics, new §8b/§10 reference real files ### Manual Verification 1. Next time setup-project onboards a Dockerfile-less app: - Confirm .contribution-state.json is written with `written-from-scratch` - Run stability-gate.sh — expect 18/20 passes on a healthy deploy - Run contribute-dockerfile.sh — expect a fork + branch + PR on ViktorBarzin - Verify contribution_pr_url is back-written to the state file 2. Re-run contribute-dockerfile.sh → must be a no-op (idempotent) 3. Upstream-archived case: manually archive a test upstream → re-run → expect SKIP, no PR created [ci skip] Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 18:12:13 +00:00
Viktor Barzin	26abd8fe94	[skill] Add /disk-wear skill for periodic disk write analysis ## Context After the MySQL standalone migration + Technitium SQLite disable saved ~130 GB/day of disk writes, this methodology should be reusable for periodic health reviews. ## This change: Adds `/disk-wear` skill that combines three data sources: - SSH to PVE host for real-time 30s I/O snapshots and SSD SMART health - Prometheus PromQL for per-app write attribution (node_disk_written_bytes_total joined with node_disk_device_mapper_info for dm->LVM mapping) - kubectl for PVC UUID -> pod/namespace mapping Produces ranked breakdowns by physical disk, VM, k8s namespace, and individual PVC. Includes baselines, red flag detection, and annualized wear projections. Note: container_fs_writes_bytes_total has 0 series (cadvisor doesn't track block device writes per container), so per-app attribution uses the PVE host's dm-device level metrics mapped through Prometheus and kubectl. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 11:15:26 +00:00
Viktor Barzin	f538115c43	[dbaas] Migrate MySQL from InnoDB Cluster to standalone StatefulSet ## Context Disk write analysis showed MySQL InnoDB Cluster writing ~95 GB/day for only ~35 MB of actual data due to Group Replication overhead (binlog, relay log, GR apply log). The operator enforces GR even with serverInstances=1. Bitnami Helm charts were deprecated by Broadcom in Aug 2025 — no free container images available. Using official mysql:8.4 image instead. ## This change: - Replace helm_release.mysql_cluster service selector with raw kubernetes_stateful_set_v1 using official mysql:8.4 image - ConfigMap mysql-standalone-cnf: skip-log-bin, innodb_flush_log_at_trx_commit=2, innodb_doublewrite=ON (re-enabled for standalone safety) - Service selector switched to standalone pod labels - Technitium: disable SQLite query logging (18 GB/day write amplification), keep PostgreSQL-only logging (90-day retention) - Grafana datasource and dashboards migrated from MySQL to PostgreSQL - Dashboard SQL queries fixed for PG integer division (::float cast) - Updated CLAUDE.md service-specific notes ## What is NOT in this change: - InnoDB Cluster + operator removal (Phase 4, 7+ days from now) - Stale Vault role cleanup (Phase 4) - Old PVC deletion (Phase 4) Expected write reduction: ~113 GB/day (MySQL 95 + Technitium 18) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 19:01:06 +00:00
Viktor Barzin	b1d152be1f	[infra] Auto-create Cloudflare DNS records from ingress_factory ## Context Deploying new services required manually adding hostnames to cloudflare_proxied_names/cloudflare_non_proxied_names in config.tfvars — a separate file from the service stack. This was frequently forgotten, leaving services unreachable externally. ## This change: - Add `dns_type` parameter to `ingress_factory` and `reverse_proxy/factory` modules. Setting `dns_type = "proxied"` or `"non-proxied"` auto-creates the Cloudflare DNS record (CNAME to tunnel or A/AAAA to public IP). - Simplify cloudflared tunnel from 100 per-hostname rules to wildcard `*.viktorbarzin.me → Traefik`. Traefik still handles host-based routing. - Add global Cloudflare provider via terragrunt.hcl (separate cloudflare_provider.tf with Vault-sourced API key). - Migrate 118 hostnames from centralized config.tfvars to per-service dns_type. 17 hostnames remain centrally managed (Helm ingresses, special cases). - Update docs, AGENTS.md, CLAUDE.md, dns.md runbook. ``` BEFORE AFTER config.tfvars (manual list) stacks/<svc>/main.tf \| module "ingress" { v dns_type = "proxied" stacks/cloudflared/ } for_each = list \| cloudflare_record auto-creates tunnel per-hostname cloudflare_record + annotation ``` ## What is NOT in this change: - Uptime Kuma monitor migration (still reads from config.tfvars) - 17 remaining centrally-managed hostnames (Helm, special cases) - Removal of allow_overwrite (keep until migration confirmed stable) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 13:45:04 +00:00
Viktor Barzin	c33f597111	feat(upgrade-agent): add automated service upgrade pipeline with n8n + DIUN Pipeline: DIUN detects new image versions every 6h → webhook to n8n → n8n filters (skip databases/custom/infra/:latest) and rate-limits (max 5/6h) → SSH to dev VM → claude -p runs upgrade agent. Agent workflow: resolve GitHub repo → fetch changelogs → classify risk (SAFE/CAUTION) → backup DB if needed → bump version in .tf → commit+push → wait for CI → verify (pod ready + HTTP + Uptime Kuma) → rollback on failure. Changes: - stacks/n8n: add N8N_PORT=5678 to fix K8s env var conflict - stacks/n8n/workflows: version-controlled n8n workflow backup - docs/architecture/automated-upgrades.md: full pipeline documentation - AGENTS.md: add upgrade agent section - service-catalog.md: update DIUN description Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 21:38:27 +00:00
Viktor Barzin	dcc96f465e	docs(storage): add encrypted LVM documentation Update storage docs to reflect the 2026-04-15 migration of all sensitive services to proxmox-lvm-encrypted. Add encrypted PVC template, LUKS2 flow documentation, updated architecture diagram, and storage class decision rules. Files updated: - .claude/CLAUDE.md: storage decision table, encrypted PVC template - docs/architecture/storage.md: encrypted flow, components, diagram, Vault paths - AGENTS.md: storage section with encrypted SC as default for sensitive data Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 21:00:37 +00:00
Viktor Barzin	7bb9ec2934	Add agent task tracking documentation Documents the centralized Beads/Dolt task tracking system used by all Claude Code sessions. Covers architecture, session lifecycle, settings hierarchy, known issues, and E2E test verification. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 17:11:26 +00:00
Viktor Barzin	bcad200a23	chore: add untracked stacks, scripts, and agent configs - New stacks: beads-server, hermes-agent - Terragrunt tiers.tf for infra, phpipam, status-page - Secrets symlinks for vault, phpipam, hermes-agent - Scripts: cluster_manager, image_pull, containerd pullthrough setup - Frigate config, audiblez-web app source, n8n workflows dir - Claude agent: service-upgrade, reference: upgrade-config.json - Removed: claudeception skill, excalidraw empty submodule, temp listings [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 09:33:06 +00:00
Viktor Barzin	d31bbc9a18	docs: update monitoring and backup docs for external monitors and per-db backups - CLAUDE.md: document external monitoring (ExternalAccessDivergence alert, external-monitor-sync CronJob) and per-database backup/restore paths - backup-dr.md: add per-db backup CronJobs to inventory table and daily timeline, update restore runbook references - monitoring.md: add External Monitor Sync component and external monitoring architecture section [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 06:37:07 +00:00
Viktor Barzin	460c68e015	feat: add incident management system with user reporting - Status page (status.viktorbarzin.me): incident cards with SEV badges, expandable timelines, postmortem links, user report rendering - Issue templates on infra repo for user outage reports - CronJob reads incidents + user-reports from ViktorBarzin/infra - "Report an Outage" button on status page links to infra repo - Post-mortem agents restored (4-stage pipeline: triage → investigation → historian → report writer) with updated paths and issue linking - Post-mortem skill/template updated to link reports to GitHub Issues and manage postmortem-required/postmortem-done labels - Labels: incident, sev1-3, user-report, postmortem-required, postmortem-done on infra repo [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 20:00:31 +00:00
Viktor Barzin	24a23709a5	fix: update healthcheck to report internal and external monitors separately - Increase Uptime Kuma API timeout to 120s with wait_events=0.2 - Remove hardcoded password, use Vault or UPTIME_KUMA_PASSWORD env var - Report internal and external monitor status separately - Install uptime-kuma-api in local venv [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 19:44:20 +00:00
Viktor Barzin	4498f61402	fix(post-mortem): add /etc/exports to git, NFS health check in daily-backup, document CSI requirements [PM-2026-04-14] - scripts/pve-nfs-exports: git-managed copy of PVE host /etc/exports with detailed comments explaining fsid=0 danger and NFSv3 disable rationale. Deploy: scp scripts/pve-nfs-exports root@192.168.1.127:/etc/exports && ssh root@192.168.1.127 exportfs -ra - scripts/daily-backup.sh: add check_nfs_exports() that runs before backup starts. Detects: missing /etc/exports, dangerous fsid=0 on /srv/nfs, nfs-server not running, no active exports. Warns but doesn't abort (block-storage PVC backups can still run). - .claude/CLAUDE.md: document NFS CSI mount option requirements — nfsvers=4 mandatory, fsid=0 forbidden, /etc/exports is git-managed, critical services must use proxmox-lvm-encrypted. Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>	2026-04-14 18:08:24 +00:00
Viktor Barzin	8badb8181a	feat: post-mortem automation pipeline E2E workflow for incident post-mortems: 1. /post-mortem skill generates structured post-mortem markdown 2. Woodpecker pipeline triggers on docs/post-mortems/*.md changes 3. parse-postmortem-todos.sh extracts safe TODOs (Alert/Config/Monitor) 4. postmortem-todo-resolver agent implements TODOs headlessly 5. Agent updates post-mortem with Follow-up Implementation table Components: - .claude/skills/post-mortem/ — writer skill + template - .claude/agents/postmortem-todo-resolver.md — headless agent - .woodpecker/postmortem-todos.yml — CI pipeline - scripts/parse-postmortem-todos.sh — TODO extractor - cluster-health skill — auto-suggest post-mortem after recovery Safety: only auto-implements Alert/Config/Monitor types. Architecture/Migration/Investigation items are skipped. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 15:34:42 +00:00
Viktor Barzin	bdba15a387	docs: move post-mortems to docs/post-mortems/ Consolidate all outage reports under docs/ for better discoverability. Moved from .claude/post-mortems/ (agent-internal) to docs/post-mortems/ (repo documentation). [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 08:20:09 +00:00
Viktor Barzin	68c8c5b4a0	fix(technitium): migrate primary to proxmox-lvm-encrypted + post-mortem SEV1 outage: fsid=0 in PVE /etc/exports broke all NFS subdirectory mounts from k8s (NFSv4 pseudo-root path resolution). Combined with lockd failure, both NFSv4 and NFSv3 mount paths broken. Cascaded into DNS primary, Vault (2/3 pods), Alertmanager, 20+ services. Changes: - Primary PVC: NFS (nfs-truenas) → proxmox-lvm-encrypted - Secondary/tertiary PVCs: proxmox-lvm → proxmox-lvm-encrypted - Removed NFS module dependency from technitium stack - Added full post-mortem with prevention plan [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 08:18:59 +00:00
Viktor Barzin	1ef40daeec	docs: update for MySQL 3→1, CrowdSec/Technitium PG migration, PG tuning, NFS async, node OS tuning [ci skip]	2026-04-13 23:05:46 +01:00
Viktor Barzin	82f674a0b4	rename weekly-backup → daily-backup across scripts, timers, services, and docs [ci skip] Reflects the schedule change from weekly to daily. All references updated: - scripts/weekly-backup.{sh,timer,service} → daily-backup.* - Pushgateway job name: weekly-backup → daily-backup - Prometheus metric names: weekly_backup_* → daily_backup_* - All docs, runbooks, AGENTS.md, CLAUDE.md, proxmox-inventory - offsite-sync dependency: After=daily-backup.service Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 18:37:04 +00:00
Viktor Barzin	b45cee5c4a	docs: update backup architecture for inotify change tracking + consolidated Synology layout [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 18:16:36 +00:00
Viktor Barzin	38d51ab0af	deprecate TrueNAS: migrate Immich NFS to Proxmox, remove all 10.0.10.15 references [ci skip] - Migrate Immich (8 NFS PVs, 1.1TB) from TrueNAS to Proxmox host NFS - Update config.tfvars nfs_server to 192.168.1.127 (Proxmox) - Update nfs-csi StorageClass share to /srv/nfs - Update scripts (weekly-backup, cluster-healthcheck) to Proxmox IP - Delete obsolete TrueNAS scripts (nfs_exports.sh, truenas-status.sh) - Rewrite nfs-health.sh for Proxmox NFS monitoring - Update Freedify nfs_music_server default to Proxmox - Mark CloudSync monitor CronJob as deprecated - Update Prometheus alert summaries - Update all architecture docs, AGENTS.md, and reference docs - Zero PVs remain on TrueNAS — VM ready for decommission Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 14:42:07 +00:00
Viktor Barzin	1c300a14cf	mailserver: overhaul inbound delivery, monitoring, CrowdSec, and migrate to Brevo relay Inbound: - Direct MX to mail.viktorbarzin.me (ForwardEmail relay attempted and abandoned) - Dedicated MetalLB IP 10.0.20.202 with ETP: Local for CrowdSec real-IP detection - Removed Cloudflare Email Routing (can't store-and-forward) - Fixed dual SPF violation, hardened to -all - Added MTA-STS, TLSRPT, imported Rspamd DKIM into Terraform - Removed dead BIND zones from config.tfvars (199 lines) Outbound: - Migrated from Mailgun (100/day) to Brevo (300/day free) - Added Brevo DKIM CNAMEs and verification TXT Monitoring: - Probe frequency: 30m → 20m, alert thresholds adjusted to 60m - Enabled Dovecot exporter scraping (port 9166) - Added external SMTP monitor on public IP Documentation: - New docs/architecture/mailserver.md with full architecture - New docs/architecture/mailserver-visual.html visualization - Updated monitoring.md, CLAUDE.md, historical plan docs	2026-04-12 22:24:38 +01:00
Viktor Barzin	c740ed1301	docs: update Technitium DNS docs after cache optimization - Fix Technitium IP typo: 10.0.20.101 → 10.0.20.201 (service-catalog, vpn.md) - Fix PDB minAvailable: 1 → 2 (networking.md) - Add emrsn.org stub zone, cache TTL tuning, PG query logging, CronJobs - Update forwarders: was "Cloudflare + Google", actually Cloudflare DoH only - Update config storage: was generic PVC, now NFS path	2026-04-12 18:29:25 +01:00
Viktor Barzin	cc670d949c	docs: add ha-sofia Version Control add-on to HA skill [ci skip] HomeAssistantVersionControl v1.2.0 installed on ha-sofia for git-based config tracking. Auto-commits on file change, pushes hourly to private GitHub repo ViktorBarzin/ha-sofia-config.	2026-04-12 11:37:02 +01:00
Viktor Barzin	6ba4878f3a	docs: update storage architecture for NFS migration to Proxmox host [ci skip]	2026-04-11 17:00:10 +01:00
Viktor Barzin	eec6af6aef	docs: add IPAM/DDNS architecture diagram and update docs - networking.md: Add mermaid diagram showing full device discovery pipeline (Kea DHCP → DDNS → Technitium, pfSense import → phpIPAM → DNS sync) - networking.md: Add data flow table, DHCP coverage table - networking.md: Update pfSense (3 subnets + 42 reservations), phpIPAM (passive import replaces fping), Technitium (192.168.1.2 in ACL) - CLAUDE.md: Update phpIPAM and networking descriptions [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 20:42:10 +00:00
Viktor Barzin	8cd8743140	docs: add phpIPAM, Kea DDNS, and DNS sync documentation - networking.md: Add phpIPAM IPAM section, Kea DDNS config, reverse DNS zones, Technitium dynamic update policy - CLAUDE.md: Add phpipam to DB rotation list, service notes, networking section - service-catalog.md: Add phpipam, mark netbox as disabled/replaced [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 16:01:32 +00:00
Viktor Barzin	1c7998163d	extend backup-verify.sh with full 3-2-1 health inspection 18 checks covering all backup layers: - Layer 1: LVM snapshot freshness/status/count, thin pool free, timer - Layer 2: Weekly backup freshness/status/timer, sda mount/usage, PVC data/NFS mirror/pfsense freshness - Layer 3: Offsite sync freshness/status/timer - DB CronJobs: age check with per-service thresholds - CNPG backups: existing check preserved New --fix flag for conservative auto-remediation: - Re-enable disabled timers - Clear stale lockfiles - Mount /mnt/backup if unmounted	2026-04-09 22:48:49 +01:00
Viktor Barzin	fea8519f51	update VPN architecture docs and Authentik state reference - vpn.md: Rewrite WireGuard section to match actual config (single tun_wg0 interface, 10.3.2.0/24 subnet, hub-and-spoke topology, correct device names and subnets for London/Valchedrym) - authentik-state.md: Document brute-force-protection policy unbinding fix that was blocking all unauthenticated users from login flows [ci skip]	2026-04-06 16:26:21 +03:00
Viktor Barzin	b345b086ef	update backup/DR docs and runbooks for 3-2-1 architecture - Full rewrite of backup-dr.md: 3-2-1 strategy with sda backup disk, PVC file-level copy from LVM snapshots, pfsense backup, two offsite paths. 4 Mermaid diagrams (data flow, timeline, disk layout, restore tree). - Update storage.md: 65 proxmox-lvm PVCs, sda backup tier - Update restore-full-cluster.md: add Phase 3.5 for PVC restore from sda - Update restore-{mysql,postgresql,vault,vaultwarden}.md: add sda fallback paths - New runbook: restore-pvc-from-backup.md (file-level restore from sda) - Update CLAUDE.md Storage & Backup section for 3-2-1 architecture	2026-04-06 15:06:01 +03:00
Viktor Barzin	64c378d158	add critical instruction to update docs with every infra change [ci skip]	2026-04-06 13:21:49 +03:00
Viktor Barzin	fc233bd27f	docs: comprehensive audit and update of all architecture docs and runbooks [ci skip] Audited 14 documentation files against live cluster state and Terraform code. Architecture docs: - databases.md: MySQL 8.4.4, proxmox-lvm storage (not iSCSI), anti-affinity excludes k8s-node1 (GPU), 2Gi/3Gi resources, 7-day rotation (not 24h), CNPG 2 instances, PostGIS 16, postgresql.dbaas has endpoints - overview.md: 1x CPU, ~160GB RAM, all nodes 32GB, proxmox-lvm storage, correct Vault paths (secret/ not kv/) - compute.md: 272GB physical host RAM, ~160GB allocated to VMs - secrets.md: 7-day rotation, 7 MySQL + 5 PG roles, correct ESO config - networking.md: MetalLB pool 10.0.20.200-220 - ci-cd.md: 9 GHA projects, travel_blog 5.7GB Runbooks: - restore-mysql/postgresql: backup files are .sql.gz (not .sql) - restore-vault: weekly backup (not daily), auto-unseal sidecar note - restore-vaultwarden: PVC is proxmox (not iscsi) - restore-full-cluster: updated node roles, removed trading Reference docs: - CLAUDE.md: 7-day rotation, removed trading from PG list - AGENTS.md: 100+ stacks, proxmox-lvm, platform empty shell - service-catalog.md: 6 new stacks, 14 stack column updates	2026-04-06 13:21:05 +03:00
Viktor Barzin	9492874c43	fix: restore technitium MySQL query logging with Vault auto-rotation [ci skip] Query logs stopped syncing on 2026-03-16 due to password mismatch after MySQL cluster rebuild and Technitium app config reset. - Add Vault static role mysql-technitium (7-day rotation) - Add ExternalSecret for technitium-db-creds in technitium namespace - Add password-sync CronJob (6h) to push rotated password to Technitium API - Update Grafana datasource to use ESO-managed password - Remove stale technitium_db_password variable (replaced by ESO) - Update databases.md and restore-mysql.md runbook	2026-04-06 13:00:49 +03:00
Viktor Barzin	ad7c0d7fc8	docs: add critical "Terraform Only" rule to CLAUDE.md All infrastructure changes must go through Terraform/Terragrunt. kubectl is read-only except for temporary migration steps. If a resource isn't in Terraform, evaluate adding it before making manual changes.	2026-04-05 19:46:07 +03:00
Viktor Barzin	2d5c55f7b1	docs: add storage class decision rule to CLAUDE.md Default to proxmox-lvm for all new services. NFS only for RWX, backup destinations, or shared media libraries. Updated iSCSI backup section to reflect proxmox-lvm migration.	2026-04-04 16:35:12 +03:00
Viktor Barzin	2d8aa5ed89	docs: update hardware inventory for R730 RAM upgrade to 272GB Upgraded from 144GB (4x32G + 2x8G) to 272GB (8x32G + 2x8G) DDR4-2400. Added physical DIMM slot diagram, channel layout, and BIOS speed override notes. Updated compute architecture with correct CPU (single socket), VM memory values, and capacity figures.	2026-04-02 00:48:13 +03:00
Viktor Barzin	10f22350c5	exclude frigate, audiblez, ollama, real-estate-crawler from Synology backup [ci skip] Expanded cloud sync excludes to reduce sync time and Synology disk usage. All excluded data is either regenerable or low-value. TrueNAS Task 1 and incremental script already updated live.	2026-03-29 13:44:32 +03:00
Viktor Barzin	78dec8f0ad	add e2e email roundtrip monitoring CronJob (every 30 min) sends test email via Mailgun API to smoke-test@viktorbarzin.me, verifies IMAP delivery in spam@ catch-all, deletes test email, pushes metrics to Pushgateway + Uptime Kuma. Prometheus alerts: EmailRoundtripFailing, EmailRoundtripStale, EmailRoundtripNeverRun. Uptime Kuma: SMTP/IMAP port checks + E2E push.	2026-03-25 22:50:22 +02:00
Viktor Barzin	1639910043	ingress latency: add histogram buckets, fix restarts, right-size memory - Traefik: add fine-grained Prometheus histogram buckets (0.01-30s) for meaningful P50/P99 - Calibre: relax liveness probe (timeout 5→10s, threshold 3→6) to stop NFS-caused restarts - Novelapp: increase memory 128Mi/256Mi → 640Mi/640Mi (confirmed OOMKilled, VPA upper 505Mi) - Forgejo: increase memory 256Mi → 384Mi (at 80% of limit, VPA upper 311Mi) - ActualBudget: add explicit resources to prevent silent LimitRange defaults - Docs: update Nextcloud note from 4Gi → 8Gi limit (Apache spike history)	2026-03-23 10:52:43 +02:00
Viktor Barzin	a1b4a2106d	add NFS/CSI cascade failure post-mortem [ci skip]	2026-03-23 02:25:11 +02:00
Viktor Barzin	813f523170	docs: add private registry usage to infra CLAUDE.md [ci skip]	2026-03-23 01:08:57 +02:00

1 2 3 4

194 commits