Commit graph

194 commits

Author SHA1 Message Date
Viktor Barzin
43fe11fffc [mailserver] Phase 6 — decommission MetalLB LB path [ci skip]
## Context (bd code-yiu)

With Phase 4+5 proven (external mail flows through pfSense HAProxy +
PROXY v2 to the alt PROXY-speaking container listeners), the MetalLB
LoadBalancer Service + `10.0.20.202` external IP + ETP:Local policy are
obsolete. Phase 6 decommissions them and documents the steady-state
architecture.

## This change

### Terraform (stacks/mailserver/modules/mailserver/main.tf)
- `kubernetes_service.mailserver` downgraded: `LoadBalancer` → `ClusterIP`.
- Removed `metallb.io/loadBalancerIPs = "10.0.20.202"` annotation.
- Removed `external_traffic_policy = "Local"` (irrelevant for ClusterIP).
- Port set unchanged — the Service still exposes 25/465/587/993 for
  intra-cluster clients (Roundcube pod, `email-roundtrip-monitor`
  CronJob) that hit the stock PROXY-free container listeners.
- Inline comment documents the downgrade rationale + companion
  `mailserver-proxy` NodePort Service that now carries external traffic.

### pfSense (ops, not in git)
- `mailserver` host alias (pointing at `10.0.20.202`) deleted. No NAT
  rule references it post-Phase-4; keeping it would be misleading dead
  metadata. Reversible via WebUI + `php /tmp/delete-mailserver-alias.php`
  companion script (ad-hoc, not checked in — alias is just a
  Firewall → Aliases → Hosts entry).

### Uptime Kuma (ops)
- Monitors `282` and `283` (PORT checks) retargeted from `10.0.20.202`
  → `10.0.20.1`. Renamed to `Mailserver HAProxy SMTP (pfSense :25)` /
  `... IMAPS (pfSense :993)` to reflect their new purpose (HAProxy
  layer liveness). History retained (edit, not delete-recreate).

### Docs
- `docs/runbooks/mailserver-pfsense-haproxy.md` — fully rewritten
  "Current state" section; now reflects steady-state architecture with
  two-path diagram (external via HAProxy / intra-cluster via ClusterIP).
  Phase history table marks Phase 6 . Rollback section updated (no
  one-liner post-Phase-6; need Service-type re-upgrade + alias re-add).
- `docs/architecture/mailserver.md` — Overview, Mermaid diagram, Inbound
  flow, CrowdSec section, Uptime Kuma monitors list, Decisions section
  (dedicated MetalLB IP → "Client-IP Preservation via HAProxy + PROXY
  v2"), Troubleshooting all updated.
- `.claude/CLAUDE.md` — mailserver monitoring + architecture paragraph
  updated with new external path description; references the new runbook.

## What is NOT in this change

- Removal of `10.0.20.202` from `cloudflare_proxied_names` or any
  reserved-IP tracking — wasn't there to begin with. The
  `metallb-system default` IPAddressPool (10.0.20.200-220) shows 2 of
  19 available after this, confirming `.202` went back to the pool.
- Phase 4 NAT-flip rollback scripts — kept on-disk, still valid if
  someone re-introduces the MetalLB LB (see runbook "Rollback").

## Test Plan

### Automated (verified pre-commit 2026-04-19)
```
# Service is ClusterIP with no EXTERNAL-IP
$ kubectl get svc -n mailserver mailserver
mailserver   ClusterIP   10.103.108.217   <none>   25/TCP,465/TCP,587/TCP,993/TCP

# 10.0.20.202 no longer answers ARP (ping from pfSense)
$ ssh admin@10.0.20.1 'ping -c 2 -t 2 10.0.20.202'
2 packets transmitted, 0 packets received, 100.0% packet loss

# MetalLB pool released the IP
$ kubectl get ipaddresspool default -n metallb-system \
    -o jsonpath='{.status.assignedIPv4} of {.status.availableIPv4}'
2 of 19 available

# E2E probe — external Brevo → WAN:25 → pfSense HAProxy → pod — STILL SUCCEEDS
$ kubectl create job --from=cronjob/email-roundtrip-monitor probe-phase6 -n mailserver
... Round-trip SUCCESS in 20.3s ...
$ kubectl delete job probe-phase6 -n mailserver

# pfSense mailserver alias removed
$ ssh admin@10.0.20.1 'php -r "..." | grep mailserver'
(no output)
```

### Manual Verification
1. Visit `https://uptime.viktorbarzin.me` — monitors 282/283 green on new
   hostname `10.0.20.1`.
2. Roundcube login works (`https://mail.viktorbarzin.me/`).
3. Send test email to `smoke-test@viktorbarzin.me` from Gmail — observe
   `postfix/smtpd-proxy25/postscreen: CONNECT from [<Gmail-IP>]` in
   mailserver logs within ~10s.
4. CrowdSec should still see real client IPs in postfix/dovecot parsers
   (verify with `cscli alerts list` on next auth-fail event).

## Phase history (bd code-yiu)

| Phase | Status | Description |
|---|---|---|
| 1a  |  `ef75c02f` | k8s alt :2525 listener + NodePort Service |
| 2   |  2026-04-19 | pfSense HAProxy pkg installed |
| 3   |  `ba697b02` | HAProxy config persisted in pfSense XML |
| 4+5 |  `9806d515` | 4-port alt listeners + HAProxy frontends + NAT flip |
| 6   |  **this commit** | MetalLB LB retired; 10.0.20.202 released; docs updated |

Closes: code-yiu
2026-04-19 12:36:11 +00:00
Viktor Barzin
973f549810 [payslip-ingest] Update extractor agent + dashboard for v2 regex parser
## Context

Companion change to payslip-ingest v2 (regex parser + accurate RSU tax
attribution). The Grafana dashboard now has 4 more panels powered by the
new earnings-decomposition and YTD-snapshot columns, and the Claude
fallback agent's prompt is aligned with the new schema so non-Meta
payslips still land with the full field set.

## This change

### `.claude/agents/payslip-extractor.md`

Rewrites the RSU handling section to match Meta UK's actual template
(rsu_vest = "RSU Tax Offset" + "RSU Excs Refund", no matching
rsu_offset deduction — PAYE uses grossed-up Taxable Pay instead).
Adds a new "Earnings decomposition (v2)" section telling the fallback
agent how to populate salary/bonus/pension_sacrifice/taxable_pay/ytd_*
and when to use pension_employee vs pension_sacrifice without
double-counting.

### `stacks/monitoring/modules/monitoring/dashboards/uk-payslip.json`

- **Panel 4 (Effective rate)** — SQL switched from the naive
  `(income_tax + NIC) / cash_gross` to the YTD-effective-rate
  method: `cash_tax = income_tax - rsu_vest × (ytd_tax_paid /
  ytd_taxable_pay)`. Title updated to "YTD-corrected" so the
  change is discoverable.
- **Panel 5 (Table)** — adds salary, bonus, pension_sacrifice,
  taxable_pay columns so row-level debugging against the parser
  output is trivial.
- **+Panel 8 (Earnings breakdown)** — monthly stacked bars of
  salary / bonus / rsu_vest / -pension_sacrifice. Bonus-sacrifice
  months show up as a massive negative pension_sacrifice spike
  paired with a near-zero bonus bar.
- **+Panel 9 (Accurate cash tax rate)** — timeseries of
  cash_tax_rate_ytd vs naive_tax_rate. Divergence is the RSU
  contribution the payslip hides in the single `Tax paid` line.
- **+Panel 10 (All-in compensation)** — stacked bars of cash_gross
  + rsu_vest per payslip.
- **+Panel 11 (YTD cumulative cash gross vs total comp)** — two
  lines partitioned by tax_year; the gap between them is the RSU
  contribution YTD.

Total panels go from 7 → 11.

## Test Plan

### Automated

Dashboard JSON validity:
```
$ python3 -m json.tool uk-payslip.json > /dev/null && echo ok
ok
```

### Manual Verification

After applying `stacks/monitoring/`:
1. `https://grafana.viktorbarzin.me/d/uk-payslip` loads with 11 panels
2. Bonus-sacrifice months (e.g. March 2024 if present in data) show the
   negative pension_sacrifice bar in panel 8
3. Panel 9 "Accurate cash effective tax rate" shows the
   cash_tax_rate_ytd line sitting ~10-15pp below naive_tax_rate in
   RSU-vest months

## Reproduce locally

1. `cd infra/stacks/monitoring && terragrunt plan`
2. Expected: ConfigMap diff on the payslip dashboard with the new panel
   JSON
3. `terragrunt apply` — Grafana reloads the dashboard automatically
   (configmap-reload sidecar)

Relates to: payslip-ingest commit 9741816

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:54:33 +00:00
238a3f14c9 [payslip-extractor] Add RSU handling section
Document what RSU vest / RSU offset look like on Meta UK payslips and
tell the agent to populate rsu_vest + rsu_offset fields (new in the
payslip-ingest schema) rather than rolling them into gross_pay.
2026-04-18 23:37:33 +00:00
eee694c915 [payslip-extractor] Add PAYSLIP_TEXT fast path
payslip-ingest now runs pdftotext locally before calling claude-agent-service,
shrinking the prompt ~20-100x. Agent file documents both paths: PAYSLIP_TEXT
(fast) and PDF_BASE64 (fallback for scanned-image PDFs or when pdftotext
fails).
2026-04-18 22:48:07 +00:00
Viktor Barzin
8a99be1194 [infra] Document HCL import {} block convention [ci skip]
## Context

Wave 8 of the state-drift consolidation plan — adopt the HCL `import {}`
block pattern (Terraform 1.5+) as the canonical way to bring live
cluster / Vault / Cloudflare resources under TF management.

Historically the repo has used `terraform import` on the CLI for
adoptions. That path has three real problems:

1. **Not reviewable** — it's an out-of-band state mutation that leaves
   no trace in git beyond the subsequent `resource {}` block. A
   reviewer sees only the new resource, not the adoption intent.
2. **Not plan-safe** — if the resource address or ID is wrong, the CLI
   path commits the mistake to state before anyone can catch it.
3. **Not idempotent** — a failed apply mid-import leaves state in a
   confusing half-adopted shape.

`import {}` blocks fix all three: the adoption intent is in the PR
diff, `scripts/tg plan` shows the import as its own plan line (mistyped
IDs fail before apply), and re-applying after a partial failure just
retries the import step.

Canonicalizing the pattern before Wave 5 (Calico + kured adoption) lands
so the reviewer of those imports has the rule in front of them.

## This change

- `AGENTS.md`: new "Adopting Existing Resources — Use `import {}` Blocks,
  Not the CLI" section sitting right after Execution. Includes the
  canonical 5-step workflow (write resource → add import stanza → plan
  to zero → apply → drop stanza), the reasoning, and a per-provider ID
  format table (helm_release, kubernetes_manifest, kubernetes_<kind>_v1,
  authentik_provider_proxy, cloudflare_record).
- `.claude/CLAUDE.md`: one-line cross-reference at the end of the
  Terraform State two-tier section pointing back to AGENTS.md. Keeps
  CLAUDE.md's quick-reference density intact while making sure the rule
  is reachable from the Claude-instructions path.

## What is NOT in this change

- Any actual imports — this is a pure docs landing. Wave 5 will
  demonstrate the pattern on kured + Calico.
- Replacing the handful of existing `terraform import`-style adoptions
  in the repo history — `import {}` blocks are delete-after-apply, so
  retro-documenting them is not useful.

Closes: code-[wave8-task]

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:10:05 +00:00
Viktor Barzin
43b4e1d372 [payslip-ingest] Deploy stack + Grafana dashboard + Vault DB role
## Context

New service `payslip-ingest` (code lives in `/home/wizard/code/payslip-ingest/`)
needs in-cluster deployment, its own Postgres DB + rotating user, a Grafana
datasource, a dashboard, and a Claude agent definition for PDF extraction.

Cluster-internal only — webhook fires from Paperless-ngx in a sibling namespace.
No ingress, no TLS cert, no DNS record.

## What

### New stack `stacks/payslip-ingest/`
- `kubernetes_namespace` payslip-ingest, tier=aux.
- ExternalSecret (vault-kv) projects PAPERLESS_API_TOKEN, CLAUDE_AGENT_BEARER_TOKEN,
  WEBHOOK_BEARER_TOKEN into `payslip-ingest-secrets`.
- ExternalSecret (vault-database) reads rotating password from
  `static-creds/pg-payslip-ingest` and templates `DATABASE_URL` into
  `payslip-ingest-db-creds` with `reloader.stakater.com/match=true`.
- Deployment: single replica, Recreate strategy (matches single-worker queue
  design), `wait-for postgresql.dbaas:5432` annotation, init container runs
  `alembic upgrade head`, main container serves FastAPI on 8080, Kyverno
  dns_config lifecycle ignore.
- ClusterIP Service :8080.
- Grafana datasource ConfigMap in `monitoring` ns (label `grafana_datasource=1`,
  uid `payslips-pg`) reading password from the db-creds K8s Secret.

### Grafana dashboard `uk-payslip.json` (4 panels)
- Monthly gross/net/tax/NI (timeseries, currencyGBP).
- YTD tax-band progression with threshold lines at £12,570 / £50,270 / £125,140.
- Deductions breakdown (stacked bars).
- Effective rate + take-home % (timeseries, percent).

### Vault DB role `pg-payslip-ingest`
- Added to `allowed_roles` in `vault_database_secret_backend_connection.postgresql`.
- New `vault_database_secret_backend_static_role.pg_payslip_ingest`
  (username `payslip_ingest`, 7d rotation).

### DBaaS — DB + role creation
- New `null_resource.pg_payslip_ingest_db` mirrors `pg_terraform_state_db`:
  idempotent CREATE ROLE + CREATE DATABASE + GRANT ALL via `kubectl exec` into
  `pg-cluster-1`.

### Claude agent `.claude/agents/payslip-extractor.md`
- Haiku-backed agent invoked by `claude-agent-service`.
- Decodes base64 PDF from prompt, tries pdftotext → pypdf fallback, emits a single
  JSON object matching the schema to stdout. No network, no file writes outside /tmp,
  no markdown fences.

## Trade-offs / decisions

- Own DB per service (convention), NOT a schema in a shared `app` DB as the plan
  initially described. The Alembic migration still creates a `payslip_ingest`
  schema inside the `payslip_ingest` DB for table organisation.
- Paperless URL uses port 80 (the Service port), not 8000 (the pod target port).
- Grafana datasource uses the primary RW user — separate `_ro` role is aspirational
  and not yet a pattern in this repo.
- No ingress — webhook is cluster-internal; external exposure is unnecessary attack
  surface.
- No Uptime Kuma monitor yet: the internal-monitor list is a static block in
  `stacks/uptime-kuma/`; will add in a follow-up tied to code-z29 (internal monitor
  auto-creator).

## Test Plan

### Automated
```
terraform init -backend=false && terraform validate
Success! The configuration is valid.

terraform fmt -check -recursive
(exit 0)

python3 -c "import json; json.load(open('uk-payslip.json'))"
(exit 0)
```

### Manual Verification (post-merge)

Prerequisites:
1. Seed Vault: `vault kv put secret/payslip-ingest webhook_bearer_token=$(openssl rand -hex 32)`.
2. Seed Vault: `vault kv patch secret/paperless-ngx api_token=<paperless token>`.

Apply:
3. `scripts/tg apply vault` → creates pg-payslip-ingest static role.
4. `scripts/tg apply dbaas` → creates payslip_ingest DB + role.
5. `cd stacks/payslip-ingest && ../../scripts/tg apply -target=kubernetes_manifest.db_external_secret`
   (first-apply ESO bootstrap).
6. `scripts/tg apply payslip-ingest` (full).
7. `kubectl -n payslip-ingest get pods` → Running 1/1.
8. `kubectl -n payslip-ingest port-forward svc/payslip-ingest 8080:8080 && curl localhost:8080/healthz` → 200.

End-to-end:
9. Configure Paperless workflow (README in code repo has steps).
10. Upload sample payslip tagged `payslip` → row in `payslip_ingest.payslip` within 60s.
11. Grafana → Dashboards → UK Payslip → 4 panels render.

Closes: code-do7

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 19:07:05 +00:00
Viktor Barzin
c9d221d578 [infra] Establish KYVERNO_LIFECYCLE_V1 drift-suppression convention [ci skip]
## Context

Phase 1 of the state-drift consolidation audit (plan Wave 3) identified that
the entire repo leans on a repeated `lifecycle { ignore_changes = [...dns_config] }`
snippet to suppress Kyverno's admission-webhook dns_config mutation (the ndots=2
override that prevents NxDomain search-domain flooding). 27 occurrences across
19 stacks. Without this suppression, every pod-owning resource shows perpetual
TF plan drift.

The original plan proposed a shared `modules/kubernetes/kyverno_lifecycle/`
module emitting the ignore-paths list as an output that stacks would consume in
their `ignore_changes` blocks. That approach is architecturally impossible:
Terraform's `ignore_changes` meta-argument accepts only static attribute paths
— it rejects module outputs, locals, variables, and any expression (the HCL
spec evaluates `lifecycle` before the regular expression graph). So a DRY
module cannot exist. The canonical pattern IS the repeated snippet.

What the snippet was missing was a *discoverability tag* so that (a) new
resources can be validated for compliance, (b) the existing 27 sites can be
grep'd in a single command, and (c) future maintainers understand the
convention rather than each reinventing it.

## This change

- Introduces `# KYVERNO_LIFECYCLE_V1` as the canonical marker comment.
  Attached inline on every `spec[0].template[0].spec[0].dns_config` line
  (or `spec[0].job_template[0].spec[0]...` for CronJobs) across all 27
  existing suppression sites.
- Documents the convention with rationale and copy-paste snippets in
  `AGENTS.md` → new "Kyverno Drift Suppression" section.
- Expands the existing `.claude/CLAUDE.md` Kyverno ndots note to reference
  the marker and explain why the module approach is blocked.
- Updates `_template/main.tf.example` so every new stack starts compliant.

## What is NOT in this change

- The `kubernetes_manifest` Kyverno annotation drift (beads `code-seq`)
  — that is Phase B with a sibling `# KYVERNO_MANIFEST_V1` marker.
- Behavioral changes — every `ignore_changes` list is byte-identical
  save for the inline comment.
- The fallback module the original plan anticipated — skipped because
  Terraform rejects expressions in `ignore_changes`.
- `terraform fmt` cleanup on adjacent unrelated blocks in three files
  (claude-agent-service, freedify/factory, hermes-agent). Reverted to
  keep this commit scoped to the convention rollout.

## Before / after

Before (cannot distinguish accidental-forgotten from intentional-convention):
```hcl
lifecycle {
  ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
```

After (greppable, self-documenting, discoverable by tooling):
```hcl
lifecycle {
  ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
}
```

## Test Plan

### Automated
```
$ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='*.tf' --include='*.tf.example' \
    | awk -F: '{s+=$2} END {print s}'
27

$ git diff --stat | grep -E '\.(tf|tf\.example|md)$' | wc -l
21

# All code-file diffs are 1 insertion + 1 deletion per marker site,
# except beads-server (3), ebooks (4), immich (3), uptime-kuma (2).
$ git diff --stat stacks/ | tail -1
20 files changed, 45 insertions(+), 28 deletions(-)
```

### Manual Verification

No apply required — HCL comments only. Zero effect on any stack's plan output.
Future audits: `rg 'KYVERNO_LIFECYCLE_V1' stacks/ | wc -l` must grow as new
pod-owning resources are added.

## Reproduce locally
1. `cd infra && git pull`
2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/` → expect 27 hits in 19 files
3. Grep any new `kubernetes_deployment` for the marker; absence = missing
   suppression.

Closes: code-28m

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 14:15:51 +00:00
Viktor Barzin
82b7866bc9 [claude-agent-service] Remove orphaned DevVM SSH key wiring
## Context

The remote-executor pattern that SSHed into the DevVM (10.0.10.10) to run
`claude -p` was fully migrated to the in-cluster service
`claude-agent-service.claude-agent.svc:8080/execute` in commits 42f1c3cf and
99180bec (2026-04-18). Five parallel codebase audits (GH Actions, Woodpecker
+ scripts, K8s CronJobs/Deployments, n8n, local scripts/hooks/docs) confirmed
zero remaining SSH+claude sites.

This commit removes two cleanup artifacts left behind by that migration.

## This change

1. Deletes `.claude/skills/archived/setup-remote-executor.md` — the archived
   skill doc for the obsolete SSH-based pattern. Already in `archived/`,
   harmless but noise; deleting prevents anyone copy-pasting the old approach.

2. Removes `kubernetes_secret.ssh_key` from
   `stacks/claude-agent-service/main.tf`. The Secret was created from the
   `devvm_ssh_key` field at Vault `secret/ci/infra` but was never mounted
   into the agent pod. The pod's `git-init` init container uses HTTPS +
   `$GITHUB_TOKEN` exclusively and force-rewrites every `git@github.com:`
   and `https://github.com/` URL via `git config url.insteadOf`, so no
   downstream `git` invocation could fall through to SSH even if it tried.

3. Removes the now-orphaned `data "vault_kv_secret_v2" "ci_secrets"` block —
   the SSH key resource was its only consumer.

## What is NOT in this change

- The `devvm_ssh_key` field at Vault `secret/ci/infra` stays in place.
  Removing it requires read/modify/put of the full secret and the upside
  is one unused Vault key. Not worth it without strong justification.
- DevVM host decommission is out of scope (separate audit needed for
  non-Claude users of the host).
- Pre-existing `terraform fmt` warnings at lines 464-505 (CronJob alignment)
  left untouched per no-adjacent-refactor rule.

## Test plan

### Automated

- `terraform fmt -check stacks/claude-agent-service/main.tf` — only the
  pre-existing lines 464-505 are flagged; no new fmt warnings introduced
  by these deletions.

### Manual verification

1. `cd infra/stacks/claude-agent-service && ../../scripts/tg apply`
2. Expect exactly one resource destroyed: `kubernetes_secret.ssh_key`.
   The `ci_secrets` data source removal is plan-time only; does not appear
   in resource counts.
3. `kubectl -n claude-agent get secret ssh-key` → `NotFound`.
4. `kubectl -n claude-agent get pod` → both pods Running, no restart events.
5. Submit a synthetic agent job via HTTP API to confirm pipeline still works:
   curl -X POST http://claude-agent-service.claude-agent.svc.cluster.local:8080/execute
   with a minimal prompt; expect job completes with `exit_code=0`.

Closes: code-bck

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:31:15 +00:00
Viktor Barzin
50e8184d99 [uptime-kuma] Codify MySQL monitor (id=663) via idempotent sync CronJob
## Context

Monitor id 663 "MySQL Standalone (dbaas)" was created manually yesterday via
the `uptime-kuma-api` Python library when the dbaas stack migrated from
InnoDB Cluster to standalone MySQL. It worked and was UP, but lived only in
Uptime Kuma's MariaDB — if UK's DB were wiped or restored from an older
backup, the monitor would be lost.

## This change

Adds declarative, self-healing management for internal-service monitors
(databases, non-HTTP endpoints) that can't be discovered from ingress
annotations. Modelled on the existing `external-monitor-sync` CronJob.

- `local.internal_monitors` — list of desired monitors (name, type,
  connection string, Vault password key, interval, retries). Seeded with
  the MySQL Standalone monitor. Add new entries here to manage more.
- `kubernetes_secret.internal_monitor_sync` — pulls admin password and all
  referenced DB passwords from Vault `secret/viktor` at apply time. Secret
  key names are derived from monitor name (`DB_PASSWORD_<upper_snake>`).
- `kubernetes_config_map_v1.internal_monitor_targets` — renders the target
  list to JSON for the sync container.
- `kubernetes_cron_job_v1.internal_monitor_sync` — runs every 10 min,
  looks up monitors by name, creates if missing, patches if drifted,
  leaves id and history untouched when already in desired state.

## Why this approach (Option B, not a Terraform provider)

The `louislam/uptime-kuma` Terraform provider does NOT exist in the public
registry (verified — only a CLI tool of the same name). Option A from the
task brief was therefore unavailable. Option B (idempotent K8s CronJob)
matches the established pattern in the same module for
`external-monitor-sync` — no new machinery introduced.

## Monitor 663: no-op on first sync

Manual import was not possible (no provider → no state to import). The
sync job correctly identifies the existing monitor by name and reports:

  Monitor MySQL Standalone (dbaas) (id=663) already in desired state
  Internal monitor sync complete

DB heartbeats confirm monitor 663 stayed UP throughout with `status=1` and
`Rows: 1` responses every 60s — no disruption.

## Vault key — left manual (by design)

`secret/viktor` is not Terraform-managed anywhere in the repo (only read
via `data "vault_kv_secret_v2"`). It is a user-edited Vault entry holding
135 keys. The `uptimekuma_db_password` key was added manually yesterday;
this change does NOT codify it. Codifying the whole `secret/viktor` entry
is out of scope for this task (would need a separate migration + rotation
story). The sync job reads the existing value at apply time — so if the
value is ever rotated in Vault, the next sync picks it up.

## Plan + apply

  Plan: 3 to add, 0 to change, 0 to destroy.
  Apply complete! Resources: 3 added, 0 changed, 0 destroyed.
  Re-plan: No changes. Your infrastructure matches the configuration.

Also updated `.claude/skills/uptime-kuma/SKILL.md` with the new pattern.

Closes: code-ed2
2026-04-18 12:04:17 +00:00
Viktor Barzin
d3bdf87676 [docs] Clarify external-monitor auto-annotation in CLAUDE.md
## Context
During a false-alarm investigation of terminal.viktorbarzin.me, an Explore
agent misdiagnosed "no monitoring" by checking cloudflare_proxied_names in
config.tfvars (a legacy fallback list) instead of the ingress_factory
auto-annotation. Both [External] monitors for terminal/terminal-ro exist and
are active — the original agent just looked in the wrong place.

## This change
Expands the Monitoring & Alerting bullet to spell out the mechanism:
ingress_factory auto-adds uptime.viktorbarzin.me/external-monitor=true when
dns_type != "none", and cloudflare_proxied_names is a legacy fallback for
the 17 hostnames not yet migrated. Future agents debugging "is this
monitored?" questions should not check cloudflare_proxied_names.

## What is NOT in this change
No Terraform, no K8s, no service config. Docs only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 11:45:56 +00:00
Viktor Barzin
65b0f30d5e [docs] Update anti-AI and rybbit docs after rewrite-body removal
- Anti-AI: 5-layer → 3 active layers (bot-block, X-Robots-Tag, tarpit)
- Layer 3 (trap links via rewrite-body) removed — Yaegi v3 incompatible
- Rybbit analytics now injected via Cloudflare Worker (HTMLRewriter)
- strip-accept-encoding middleware removed from all references

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 21:43:13 +00:00
Viktor Barzin
5e9e487661 feat(setup-project): auto-PR working Dockerfiles back to upstream
## Context
The setup-project skill treats "build from a Dockerfile" as priority 6 — "last
resort, avoid if possible" — with no formalized path for apps whose upstream
lacks a working Dockerfile. When we end up writing one to get the deploy green,
that Dockerfile stays private in the infra repo and upstream never benefits.

## This change
Adds a closed-loop flow: when we author a new Dockerfile (or fix a broken
upstream one) and the deploy is healthy for 10 minutes, auto-open a PR against
the upstream repo so the self-hosting community gets the working recipe.

Flow:
1. Classify dockerfile_state during research phase (image-used / used-as-is /
   fixed-broken-upstream / written-from-scratch). Persist to
   modules/kubernetes/<service>/.contribution-state.json.
2. After Terraform apply, run scripts/stability-gate.sh — polls pod Ready +
   HTTP 200 every 30s x 20 iterations, requires 18/20 successes.
3. On pass with a trigger state, scripts/contribute-dockerfile.sh does the
   GitHub API dance: fork → merge-upstream → branch → commit Dockerfile /
   .dockerignore / BUILD.md via Contents API → open PR with body rendered from
   templates/PR_BODY.md. Idempotent (skips on recorded PR URL, existing fork,
   existing branch, open PR, upstream landed a Dockerfile mid-deploy).

GitHub API via curl (gh CLI is sandbox-blocked per .claude/CLAUDE.md); token
pulled from Vault (`secret/viktor` → `github_pat`). Commits include
Signed-off-by for DCO-enforcing repos. Fork branch name is `add-dockerfile`
for written-from-scratch or `fix-dockerfile` for fixed-broken-upstream, with
timestamp suffix on collision.

## Files
- SKILL.md — state classification table, quality bar checklist, §8b stability
  gate, §10 contribute-upstream step, checklist updates
- scripts/stability-gate.sh — 10-minute health probe
- scripts/contribute-dockerfile.sh — GitHub API orchestrator
- templates/PR_BODY.md — `{{VAR}}` placeholder template for PR description
- templates/Dockerfile.README.md — BUILD.md template shipped with the PR

## What is NOT in this change
- No Woodpecker / GHA changes (skill-local flow).
- No auto-tracking of merge/reject outcomes upstream (manual follow-up).
- Not yet exercised end-to-end; first real-world run will validate the API
  dance. Plan to dry-run against a throwaway sink repo before pointing at a
  real upstream.

## Test Plan
### Automated
- bash -n on both scripts → pass
- Manual read-through of SKILL.md — step numbering coherent, existing
  §1-9 untouched semantics, new §8b/§10 reference real files

### Manual Verification
1. Next time setup-project onboards a Dockerfile-less app:
   - Confirm .contribution-state.json is written with `written-from-scratch`
   - Run stability-gate.sh — expect 18/20 passes on a healthy deploy
   - Run contribute-dockerfile.sh — expect a fork + branch + PR on ViktorBarzin
   - Verify contribution_pr_url is back-written to the state file
2. Re-run contribute-dockerfile.sh → must be a no-op (idempotent)
3. Upstream-archived case: manually archive a test upstream → re-run →
   expect SKIP, no PR created

[ci skip]

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 18:12:13 +00:00
Viktor Barzin
26abd8fe94 [skill] Add /disk-wear skill for periodic disk write analysis
## Context
After the MySQL standalone migration + Technitium SQLite disable saved ~130 GB/day
of disk writes, this methodology should be reusable for periodic health reviews.

## This change:
Adds `/disk-wear` skill that combines three data sources:
- SSH to PVE host for real-time 30s I/O snapshots and SSD SMART health
- Prometheus PromQL for per-app write attribution (node_disk_written_bytes_total
  joined with node_disk_device_mapper_info for dm->LVM mapping)
- kubectl for PVC UUID -> pod/namespace mapping

Produces ranked breakdowns by physical disk, VM, k8s namespace, and individual PVC.
Includes baselines, red flag detection, and annualized wear projections.

Note: container_fs_writes_bytes_total has 0 series (cadvisor doesn't track
block device writes per container), so per-app attribution uses the PVE host's
dm-device level metrics mapped through Prometheus and kubectl.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 11:15:26 +00:00
Viktor Barzin
f538115c43 [dbaas] Migrate MySQL from InnoDB Cluster to standalone StatefulSet
## Context
Disk write analysis showed MySQL InnoDB Cluster writing ~95 GB/day for only
~35 MB of actual data due to Group Replication overhead (binlog, relay log,
GR apply log). The operator enforces GR even with serverInstances=1.

Bitnami Helm charts were deprecated by Broadcom in Aug 2025 — no free
container images available. Using official mysql:8.4 image instead.

## This change:
- Replace helm_release.mysql_cluster service selector with raw
  kubernetes_stateful_set_v1 using official mysql:8.4 image
- ConfigMap mysql-standalone-cnf: skip-log-bin, innodb_flush_log_at_trx_commit=2,
  innodb_doublewrite=ON (re-enabled for standalone safety)
- Service selector switched to standalone pod labels
- Technitium: disable SQLite query logging (18 GB/day write amplification),
  keep PostgreSQL-only logging (90-day retention)
- Grafana datasource and dashboards migrated from MySQL to PostgreSQL
- Dashboard SQL queries fixed for PG integer division (::float cast)
- Updated CLAUDE.md service-specific notes

## What is NOT in this change:
- InnoDB Cluster + operator removal (Phase 4, 7+ days from now)
- Stale Vault role cleanup (Phase 4)
- Old PVC deletion (Phase 4)

Expected write reduction: ~113 GB/day (MySQL 95 + Technitium 18)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 19:01:06 +00:00
Viktor Barzin
b1d152be1f [infra] Auto-create Cloudflare DNS records from ingress_factory
## Context

Deploying new services required manually adding hostnames to
cloudflare_proxied_names/cloudflare_non_proxied_names in config.tfvars —
a separate file from the service stack. This was frequently forgotten,
leaving services unreachable externally.

## This change:

- Add `dns_type` parameter to `ingress_factory` and `reverse_proxy/factory`
  modules. Setting `dns_type = "proxied"` or `"non-proxied"` auto-creates
  the Cloudflare DNS record (CNAME to tunnel or A/AAAA to public IP).
- Simplify cloudflared tunnel from 100 per-hostname rules to wildcard
  `*.viktorbarzin.me → Traefik`. Traefik still handles host-based routing.
- Add global Cloudflare provider via terragrunt.hcl (separate
  cloudflare_provider.tf with Vault-sourced API key).
- Migrate 118 hostnames from centralized config.tfvars to per-service
  dns_type. 17 hostnames remain centrally managed (Helm ingresses,
  special cases).
- Update docs, AGENTS.md, CLAUDE.md, dns.md runbook.

```
BEFORE                          AFTER
config.tfvars (manual list)     stacks/<svc>/main.tf
        |                         module "ingress" {
        v                           dns_type = "proxied"
stacks/cloudflared/               }
  for_each = list                     |
  cloudflare_record               auto-creates
  tunnel per-hostname             cloudflare_record + annotation
```

## What is NOT in this change:

- Uptime Kuma monitor migration (still reads from config.tfvars)
- 17 remaining centrally-managed hostnames (Helm, special cases)
- Removal of allow_overwrite (keep until migration confirmed stable)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 13:45:04 +00:00
Viktor Barzin
c33f597111 feat(upgrade-agent): add automated service upgrade pipeline with n8n + DIUN
Pipeline: DIUN detects new image versions every 6h → webhook to n8n →
n8n filters (skip databases/custom/infra/:latest) and rate-limits
(max 5/6h) → SSH to dev VM → claude -p runs upgrade agent.

Agent workflow: resolve GitHub repo → fetch changelogs → classify risk
(SAFE/CAUTION) → backup DB if needed → bump version in .tf → commit+push
→ wait for CI → verify (pod ready + HTTP + Uptime Kuma) → rollback on
failure.

Changes:
- stacks/n8n: add N8N_PORT=5678 to fix K8s env var conflict
- stacks/n8n/workflows: version-controlled n8n workflow backup
- docs/architecture/automated-upgrades.md: full pipeline documentation
- AGENTS.md: add upgrade agent section
- service-catalog.md: update DIUN description

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 21:38:27 +00:00
Viktor Barzin
dcc96f465e docs(storage): add encrypted LVM documentation
Update storage docs to reflect the 2026-04-15 migration of all sensitive
services to proxmox-lvm-encrypted. Add encrypted PVC template, LUKS2 flow
documentation, updated architecture diagram, and storage class decision
rules.

Files updated:
- .claude/CLAUDE.md: storage decision table, encrypted PVC template
- docs/architecture/storage.md: encrypted flow, components, diagram, Vault paths
- AGENTS.md: storage section with encrypted SC as default for sensitive data

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 21:00:37 +00:00
Viktor Barzin
7bb9ec2934 Add agent task tracking documentation
Documents the centralized Beads/Dolt task tracking system used by all
Claude Code sessions. Covers architecture, session lifecycle, settings
hierarchy, known issues, and E2E test verification.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 17:11:26 +00:00
Viktor Barzin
bcad200a23 chore: add untracked stacks, scripts, and agent configs
- New stacks: beads-server, hermes-agent
- Terragrunt tiers.tf for infra, phpipam, status-page
- Secrets symlinks for vault, phpipam, hermes-agent
- Scripts: cluster_manager, image_pull, containerd pullthrough setup
- Frigate config, audiblez-web app source, n8n workflows dir
- Claude agent: service-upgrade, reference: upgrade-config.json
- Removed: claudeception skill, excalidraw empty submodule, temp listings

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 09:33:06 +00:00
Viktor Barzin
d31bbc9a18 docs: update monitoring and backup docs for external monitors and per-db backups
- CLAUDE.md: document external monitoring (ExternalAccessDivergence alert,
  external-monitor-sync CronJob) and per-database backup/restore paths
- backup-dr.md: add per-db backup CronJobs to inventory table and daily
  timeline, update restore runbook references
- monitoring.md: add External Monitor Sync component and external monitoring
  architecture section

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 06:37:07 +00:00
Viktor Barzin
460c68e015 feat: add incident management system with user reporting
- Status page (status.viktorbarzin.me): incident cards with SEV badges,
  expandable timelines, postmortem links, user report rendering
- Issue templates on infra repo for user outage reports
- CronJob reads incidents + user-reports from ViktorBarzin/infra
- "Report an Outage" button on status page links to infra repo
- Post-mortem agents restored (4-stage pipeline: triage → investigation
  → historian → report writer) with updated paths and issue linking
- Post-mortem skill/template updated to link reports to GitHub Issues
  and manage postmortem-required/postmortem-done labels
- Labels: incident, sev1-3, user-report, postmortem-required,
  postmortem-done on infra repo

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 20:00:31 +00:00
Viktor Barzin
24a23709a5 fix: update healthcheck to report internal and external monitors separately
- Increase Uptime Kuma API timeout to 120s with wait_events=0.2
- Remove hardcoded password, use Vault or UPTIME_KUMA_PASSWORD env var
- Report internal and external monitor status separately
- Install uptime-kuma-api in local venv

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 19:44:20 +00:00
Viktor Barzin
4498f61402 fix(post-mortem): add /etc/exports to git, NFS health check in daily-backup, document CSI requirements [PM-2026-04-14]
- scripts/pve-nfs-exports: git-managed copy of PVE host /etc/exports with
  detailed comments explaining fsid=0 danger and NFSv3 disable rationale.
  Deploy: scp scripts/pve-nfs-exports root@192.168.1.127:/etc/exports && ssh root@192.168.1.127 exportfs -ra
- scripts/daily-backup.sh: add check_nfs_exports() that runs before backup starts.
  Detects: missing /etc/exports, dangerous fsid=0 on /srv/nfs, nfs-server not running,
  no active exports. Warns but doesn't abort (block-storage PVC backups can still run).
- .claude/CLAUDE.md: document NFS CSI mount option requirements — nfsvers=4 mandatory,
  fsid=0 forbidden, /etc/exports is git-managed, critical services must use proxmox-lvm-encrypted.

Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>
2026-04-14 18:08:24 +00:00
Viktor Barzin
8badb8181a feat: post-mortem automation pipeline
E2E workflow for incident post-mortems:
1. /post-mortem skill generates structured post-mortem markdown
2. Woodpecker pipeline triggers on docs/post-mortems/*.md changes
3. parse-postmortem-todos.sh extracts safe TODOs (Alert/Config/Monitor)
4. postmortem-todo-resolver agent implements TODOs headlessly
5. Agent updates post-mortem with Follow-up Implementation table

Components:
- .claude/skills/post-mortem/ — writer skill + template
- .claude/agents/postmortem-todo-resolver.md — headless agent
- .woodpecker/postmortem-todos.yml — CI pipeline
- scripts/parse-postmortem-todos.sh — TODO extractor
- cluster-health skill — auto-suggest post-mortem after recovery

Safety: only auto-implements Alert/Config/Monitor types.
Architecture/Migration/Investigation items are skipped.

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 15:34:42 +00:00
Viktor Barzin
bdba15a387 docs: move post-mortems to docs/post-mortems/
Consolidate all outage reports under docs/ for better discoverability.
Moved from .claude/post-mortems/ (agent-internal) to docs/post-mortems/
(repo documentation).

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 08:20:09 +00:00
Viktor Barzin
68c8c5b4a0 fix(technitium): migrate primary to proxmox-lvm-encrypted + post-mortem
SEV1 outage: fsid=0 in PVE /etc/exports broke all NFS subdirectory
mounts from k8s (NFSv4 pseudo-root path resolution). Combined with
lockd failure, both NFSv4 and NFSv3 mount paths broken. Cascaded
into DNS primary, Vault (2/3 pods), Alertmanager, 20+ services.

Changes:
- Primary PVC: NFS (nfs-truenas) → proxmox-lvm-encrypted
- Secondary/tertiary PVCs: proxmox-lvm → proxmox-lvm-encrypted
- Removed NFS module dependency from technitium stack
- Added full post-mortem with prevention plan

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 08:18:59 +00:00
Viktor Barzin
1ef40daeec docs: update for MySQL 3→1, CrowdSec/Technitium PG migration, PG tuning, NFS async, node OS tuning [ci skip] 2026-04-13 23:05:46 +01:00
Viktor Barzin
82f674a0b4 rename weekly-backup → daily-backup across scripts, timers, services, and docs [ci skip]
Reflects the schedule change from weekly to daily. All references updated:
- scripts/weekly-backup.{sh,timer,service} → daily-backup.*
- Pushgateway job name: weekly-backup → daily-backup
- Prometheus metric names: weekly_backup_* → daily_backup_*
- All docs, runbooks, AGENTS.md, CLAUDE.md, proxmox-inventory
- offsite-sync dependency: After=daily-backup.service

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 18:37:04 +00:00
Viktor Barzin
b45cee5c4a docs: update backup architecture for inotify change tracking + consolidated Synology layout [ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 18:16:36 +00:00
Viktor Barzin
38d51ab0af deprecate TrueNAS: migrate Immich NFS to Proxmox, remove all 10.0.10.15 references [ci skip]
- Migrate Immich (8 NFS PVs, 1.1TB) from TrueNAS to Proxmox host NFS
- Update config.tfvars nfs_server to 192.168.1.127 (Proxmox)
- Update nfs-csi StorageClass share to /srv/nfs
- Update scripts (weekly-backup, cluster-healthcheck) to Proxmox IP
- Delete obsolete TrueNAS scripts (nfs_exports.sh, truenas-status.sh)
- Rewrite nfs-health.sh for Proxmox NFS monitoring
- Update Freedify nfs_music_server default to Proxmox
- Mark CloudSync monitor CronJob as deprecated
- Update Prometheus alert summaries
- Update all architecture docs, AGENTS.md, and reference docs
- Zero PVs remain on TrueNAS — VM ready for decommission

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 14:42:07 +00:00
Viktor Barzin
1c300a14cf mailserver: overhaul inbound delivery, monitoring, CrowdSec, and migrate to Brevo relay
Inbound:
- Direct MX to mail.viktorbarzin.me (ForwardEmail relay attempted and abandoned)
- Dedicated MetalLB IP 10.0.20.202 with ETP: Local for CrowdSec real-IP detection
- Removed Cloudflare Email Routing (can't store-and-forward)
- Fixed dual SPF violation, hardened to -all
- Added MTA-STS, TLSRPT, imported Rspamd DKIM into Terraform
- Removed dead BIND zones from config.tfvars (199 lines)

Outbound:
- Migrated from Mailgun (100/day) to Brevo (300/day free)
- Added Brevo DKIM CNAMEs and verification TXT

Monitoring:
- Probe frequency: 30m → 20m, alert thresholds adjusted to 60m
- Enabled Dovecot exporter scraping (port 9166)
- Added external SMTP monitor on public IP

Documentation:
- New docs/architecture/mailserver.md with full architecture
- New docs/architecture/mailserver-visual.html visualization
- Updated monitoring.md, CLAUDE.md, historical plan docs
2026-04-12 22:24:38 +01:00
Viktor Barzin
c740ed1301 docs: update Technitium DNS docs after cache optimization
- Fix Technitium IP typo: 10.0.20.101 → 10.0.20.201 (service-catalog, vpn.md)
- Fix PDB minAvailable: 1 → 2 (networking.md)
- Add emrsn.org stub zone, cache TTL tuning, PG query logging, CronJobs
- Update forwarders: was "Cloudflare + Google", actually Cloudflare DoH only
- Update config storage: was generic PVC, now NFS path
2026-04-12 18:29:25 +01:00
Viktor Barzin
cc670d949c docs: add ha-sofia Version Control add-on to HA skill [ci skip]
HomeAssistantVersionControl v1.2.0 installed on ha-sofia for git-based
config tracking. Auto-commits on file change, pushes hourly to private
GitHub repo ViktorBarzin/ha-sofia-config.
2026-04-12 11:37:02 +01:00
Viktor Barzin
6ba4878f3a docs: update storage architecture for NFS migration to Proxmox host [ci skip] 2026-04-11 17:00:10 +01:00
Viktor Barzin
eec6af6aef docs: add IPAM/DDNS architecture diagram and update docs
- networking.md: Add mermaid diagram showing full device discovery pipeline
  (Kea DHCP → DDNS → Technitium, pfSense import → phpIPAM → DNS sync)
- networking.md: Add data flow table, DHCP coverage table
- networking.md: Update pfSense (3 subnets + 42 reservations), phpIPAM
  (passive import replaces fping), Technitium (192.168.1.2 in ACL)
- CLAUDE.md: Update phpIPAM and networking descriptions

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 20:42:10 +00:00
Viktor Barzin
8cd8743140 docs: add phpIPAM, Kea DDNS, and DNS sync documentation
- networking.md: Add phpIPAM IPAM section, Kea DDNS config, reverse DNS zones,
  Technitium dynamic update policy
- CLAUDE.md: Add phpipam to DB rotation list, service notes, networking section
- service-catalog.md: Add phpipam, mark netbox as disabled/replaced

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 16:01:32 +00:00
Viktor Barzin
1c7998163d extend backup-verify.sh with full 3-2-1 health inspection
18 checks covering all backup layers:
- Layer 1: LVM snapshot freshness/status/count, thin pool free, timer
- Layer 2: Weekly backup freshness/status/timer, sda mount/usage,
  PVC data/NFS mirror/pfsense freshness
- Layer 3: Offsite sync freshness/status/timer
- DB CronJobs: age check with per-service thresholds
- CNPG backups: existing check preserved

New --fix flag for conservative auto-remediation:
- Re-enable disabled timers
- Clear stale lockfiles
- Mount /mnt/backup if unmounted
2026-04-09 22:48:49 +01:00
Viktor Barzin
fea8519f51 update VPN architecture docs and Authentik state reference
- vpn.md: Rewrite WireGuard section to match actual config (single tun_wg0
  interface, 10.3.2.0/24 subnet, hub-and-spoke topology, correct device
  names and subnets for London/Valchedrym)
- authentik-state.md: Document brute-force-protection policy unbinding fix
  that was blocking all unauthenticated users from login flows

[ci skip]
2026-04-06 16:26:21 +03:00
Viktor Barzin
b345b086ef update backup/DR docs and runbooks for 3-2-1 architecture
- Full rewrite of backup-dr.md: 3-2-1 strategy with sda backup disk,
  PVC file-level copy from LVM snapshots, pfsense backup, two offsite
  paths. 4 Mermaid diagrams (data flow, timeline, disk layout, restore tree).
- Update storage.md: 65 proxmox-lvm PVCs, sda backup tier
- Update restore-full-cluster.md: add Phase 3.5 for PVC restore from sda
- Update restore-{mysql,postgresql,vault,vaultwarden}.md: add sda fallback paths
- New runbook: restore-pvc-from-backup.md (file-level restore from sda)
- Update CLAUDE.md Storage & Backup section for 3-2-1 architecture
2026-04-06 15:06:01 +03:00
Viktor Barzin
64c378d158 add critical instruction to update docs with every infra change [ci skip] 2026-04-06 13:21:49 +03:00
Viktor Barzin
fc233bd27f docs: comprehensive audit and update of all architecture docs and runbooks [ci skip]
Audited 14 documentation files against live cluster state and Terraform code.

Architecture docs:
- databases.md: MySQL 8.4.4, proxmox-lvm storage (not iSCSI), anti-affinity
  excludes k8s-node1 (GPU), 2Gi/3Gi resources, 7-day rotation (not 24h),
  CNPG 2 instances, PostGIS 16, postgresql.dbaas has endpoints
- overview.md: 1x CPU, ~160GB RAM, all nodes 32GB, proxmox-lvm storage,
  correct Vault paths (secret/ not kv/)
- compute.md: 272GB physical host RAM, ~160GB allocated to VMs
- secrets.md: 7-day rotation, 7 MySQL + 5 PG roles, correct ESO config
- networking.md: MetalLB pool 10.0.20.200-220
- ci-cd.md: 9 GHA projects, travel_blog 5.7GB

Runbooks:
- restore-mysql/postgresql: backup files are .sql.gz (not .sql)
- restore-vault: weekly backup (not daily), auto-unseal sidecar note
- restore-vaultwarden: PVC is proxmox (not iscsi)
- restore-full-cluster: updated node roles, removed trading

Reference docs:
- CLAUDE.md: 7-day rotation, removed trading from PG list
- AGENTS.md: 100+ stacks, proxmox-lvm, platform empty shell
- service-catalog.md: 6 new stacks, 14 stack column updates
2026-04-06 13:21:05 +03:00
Viktor Barzin
9492874c43 fix: restore technitium MySQL query logging with Vault auto-rotation [ci skip]
Query logs stopped syncing on 2026-03-16 due to password mismatch after
MySQL cluster rebuild and Technitium app config reset.

- Add Vault static role mysql-technitium (7-day rotation)
- Add ExternalSecret for technitium-db-creds in technitium namespace
- Add password-sync CronJob (6h) to push rotated password to Technitium API
- Update Grafana datasource to use ESO-managed password
- Remove stale technitium_db_password variable (replaced by ESO)
- Update databases.md and restore-mysql.md runbook
2026-04-06 13:00:49 +03:00
Viktor Barzin
ad7c0d7fc8 docs: add critical "Terraform Only" rule to CLAUDE.md
All infrastructure changes must go through Terraform/Terragrunt.
kubectl is read-only except for temporary migration steps.
If a resource isn't in Terraform, evaluate adding it before
making manual changes.
2026-04-05 19:46:07 +03:00
Viktor Barzin
2d5c55f7b1 docs: add storage class decision rule to CLAUDE.md
Default to proxmox-lvm for all new services. NFS only for RWX,
backup destinations, or shared media libraries. Updated iSCSI
backup section to reflect proxmox-lvm migration.
2026-04-04 16:35:12 +03:00
Viktor Barzin
2d8aa5ed89 docs: update hardware inventory for R730 RAM upgrade to 272GB
Upgraded from 144GB (4x32G + 2x8G) to 272GB (8x32G + 2x8G) DDR4-2400.
Added physical DIMM slot diagram, channel layout, and BIOS speed override
notes. Updated compute architecture with correct CPU (single socket),
VM memory values, and capacity figures.
2026-04-02 00:48:13 +03:00
Viktor Barzin
10f22350c5 exclude frigate, audiblez, ollama, real-estate-crawler from Synology backup [ci skip]
Expanded cloud sync excludes to reduce sync time and Synology disk usage.
All excluded data is either regenerable or low-value.
TrueNAS Task 1 and incremental script already updated live.
2026-03-29 13:44:32 +03:00
Viktor Barzin
78dec8f0ad add e2e email roundtrip monitoring
CronJob (every 30 min) sends test email via Mailgun API to
smoke-test@viktorbarzin.me, verifies IMAP delivery in spam@ catch-all,
deletes test email, pushes metrics to Pushgateway + Uptime Kuma.

Prometheus alerts: EmailRoundtripFailing, EmailRoundtripStale,
EmailRoundtripNeverRun. Uptime Kuma: SMTP/IMAP port checks + E2E push.
2026-03-25 22:50:22 +02:00
Viktor Barzin
1639910043 ingress latency: add histogram buckets, fix restarts, right-size memory
- Traefik: add fine-grained Prometheus histogram buckets (0.01-30s) for meaningful P50/P99
- Calibre: relax liveness probe (timeout 5→10s, threshold 3→6) to stop NFS-caused restarts
- Novelapp: increase memory 128Mi/256Mi → 640Mi/640Mi (confirmed OOMKilled, VPA upper 505Mi)
- Forgejo: increase memory 256Mi → 384Mi (at 80% of limit, VPA upper 311Mi)
- ActualBudget: add explicit resources to prevent silent LimitRange defaults
- Docs: update Nextcloud note from 4Gi → 8Gi limit (Apache spike history)
2026-03-23 10:52:43 +02:00
Viktor Barzin
a1b4a2106d add NFS/CSI cascade failure post-mortem [ci skip] 2026-03-23 02:25:11 +02:00
Viktor Barzin
813f523170 docs: add private registry usage to infra CLAUDE.md [ci skip] 2026-03-23 01:08:57 +02:00