Commit graph

99 commits

Author SHA1 Message Date
Viktor Barzin
c6784f87b5 [docs] Add NFS prerequisite runbook for nfs_volume module [ci skip]
## Context

`modules/kubernetes/nfs_volume` creates the K8s PV but NOT the underlying
directory on the Proxmox NFS host (`192.168.1.127:/srv/nfs/<subdir>`).
The first time a new consumer is added, the mount fails with
`mount.nfs: … No such file or directory` and the pod hangs in
ContainerCreating.

This bit us twice during the Wave 1/2 rollout — once for the mailserver
backup (code-z26) and again for the Roundcube backup (code-1f6). Both
times the fix was `ssh root@192.168.1.127 'mkdir -p /srv/nfs/<subdir>'`.
Rather than automate the SSH dependency into the module (which would
break hermeticity and fail for operators without host SSH), this runbook
documents the manual bootstrap step and the rationale.

Addresses bd code-yo4.

## This change

New file: `docs/runbooks/nfs-prerequisites.md`. Lists known consumers,
gives the copy-paste SSH command, and explains why auto-creation was
rejected (two options, neither worth the churn).

## What is NOT in this change

- Any automation of the bootstrap — runbook only
- Migration to `nfs-subdir-external-provisioner` — explicitly out of scope

## Test Plan

### Automated
```
$ cat docs/runbooks/nfs-prerequisites.md | head -5
# NFS Prerequisites for `modules/kubernetes/nfs_volume`

The `nfs_volume` Terraform module creates a `PersistentVolume` pointing at a
path on the Proxmox NFS server (`192.168.1.127`). It does **not** create the
underlying directory on the server.
```

### Manual Verification
Before the next stack adds a new `nfs_volume` consumer, read the runbook
and run the `ssh root@192.168.1.127 'mkdir -p ...'` step. First pod
reaches Ready within a minute of the PV creation.

Closes: code-yo4

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:40:55 +00:00
Viktor Barzin
09c1105648 [mailserver] Delete postfix_cf_reference_DO_NOT_USE dead code [ci skip]
## Context

`infra/stacks/mailserver/modules/mailserver/variables.tf` carried a
130-line historical scaffolding variable
`postfix_cf_reference_DO_NOT_USE` containing a reference copy of an
older Postfix `main.cf` layout. The variable name itself signalled
dead-code intent ("DO_NOT_USE"), and a repo-wide
`grep -rn postfix_cf_reference infra/` confirmed zero consumers — no
module, no stack, no script, no doc ever referenced it. Carrying dead
Terraform variables costs nothing at runtime but actively wastes
reviewer attention on every `git blame`, drives up `variables.tf` read
time, and lets drift calcify.

Trade-offs considered:
- Keep it "just in case" → rejected; the file it mirrored
  (`/usr/share/postfix/main.cf.dist`) is already canonical upstream and
  reproducible inside any docker-mailserver container.
- Move it to a comment block → rejected; same noise cost, no value
  over deletion (authoritative source is in the image).

## This change

Drops the entire `variable "postfix_cf_reference_DO_NOT_USE" { ... }`
block (136 lines incl. trailing blank). No other variable touched, no
resource touched, no comment elsewhere touched. `variables.tf` now
contains only the single live variable `postfix_cf` that is actually
consumed by the module.

## What is NOT in this change

- No Terraform state modification — variable was never read, so state
  has no record of it.
- No Postfix runtime behaviour change — `postfix_cf` (the live one) is
  untouched.
- No fix for the pre-existing `kubernetes_deployment.mailserver` /
  `kubernetes_service.mailserver` drift that `terragrunt plan` surfaces
  independently. Those 2 in-place updates are known and tracked
  separately; this commit explicitly avoids conflating cleanup with
  drift resolution.
- No apply needed — pure source hygiene.

## Test Plan

### Automated

Reference check before edit:
```
$ grep -rn postfix_cf_reference /home/wizard/code/infra/
infra/stacks/mailserver/modules/mailserver/variables.tf:41:variable "postfix_cf_reference_DO_NOT_USE" {
```
(single match — the declaration itself)

Reference check after edit:
```
$ grep -rn postfix_cf_reference /home/wizard/code/infra/
(no matches)
```

`terragrunt validate` (from `infra/stacks/mailserver/`):
```
Success! The configuration is valid, but there were some
validation warnings as shown above.
```
(warnings are pre-existing `kubernetes_namespace` → `_v1` deprecation
notices, unrelated)

`terragrunt plan` (from `infra/stacks/mailserver/`):
```
  # module.mailserver.kubernetes_deployment.mailserver will be updated in-place
  # module.mailserver.kubernetes_service.mailserver will be updated in-place
Plan: 0 to add, 2 to change, 0 to destroy.
```
Both in-place updates are the known pre-existing drift
(volume_mount ordering + stale `metallb.io/ip-allocated-from-pool`
annotation). No change is attributable to this commit — the dead
variable was never referenced, so removing it leaves state untouched.

### Manual Verification

1. `cd infra/stacks/mailserver/modules/mailserver/`
2. `grep -c postfix_cf_reference variables.tf` → expected `0`
3. `wc -l variables.tf` → expected `39` (was `175`; 136 lines removed
   including the trailing blank after the EOT)
4. Open `variables.tf` → expected: only `variable "postfix_cf"` remains
5. `cd ../..` (stack root) → `terragrunt validate` → expected:
   `Success! The configuration is valid`
6. `terragrunt plan` → expected: `Plan: 0 to add, 2 to change, 0 to
   destroy.` (the 2 are the pre-existing drift, not from this commit).

Closes: code-o3q

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 00:05:44 +00:00
Viktor Barzin
b2d2a5bb1c [docs] Document Fail2ban-disabled rationale (CrowdSec is policy) [ci skip]
## Context

An audit of the mailserver stack raised the question: why is Fail2ban
disabled in the docker-mailserver deployment? The setting
`ENABLE_FAIL2BAN = "0"` lives in the env ConfigMap at
`stacks/mailserver/modules/mailserver/main.tf:68` with no documented
rationale, which made the decision look accidental rather than
deliberate.

The decision is deliberate: CrowdSec is the cluster-wide bouncer for
SSH, HTTP, and SMTP/IMAP brute-force defence. It already tails
`postfix` + `dovecot` logs via the installed collections and enforces
decisions at the LB/firewall tier with real client IPs preserved by
`externalTrafficPolicy: Local` on the dedicated MetalLB IP. Enabling
Fail2ban in-pod would duplicate that response path — two systems
racing to ban the same offender from different enforcement points,
iptables churn inside the container, and a split audit trail across
two decision stores. User decision 2026-04-18: keep disabled, document
the decision so the next auditor doesn't have to re-derive it.

## This change

Adds a new subsection "Fail2ban Disabled (CrowdSec is the Policy)" to
the Security section of `docs/architecture/mailserver.md`, placed
immediately after the existing CrowdSec Integration block. The
paragraph cites `stacks/mailserver/modules/mailserver/main.tf:68`
(where `ENABLE_FAIL2BAN = "0"` lives) and explains why duplicating the
layer would make things worse, not better. Pure docs — no Terraform
touched.

## Test Plan

### Automated
None — docs-only change. No tests, lint, or type checks apply to
markdown prose.

### Manual Verification
1. `less infra/docs/architecture/mailserver.md` — locate the Security
   section; confirm the new "Fail2ban Disabled (CrowdSec is the
   Policy)" subsection appears between "CrowdSec Integration" and
   "Rspamd".
2. Render on GitHub or via a markdown previewer; confirm the inline
   link to `main.tf` resolves and the paragraph reads cleanly.
3. `grep -n 'ENABLE_FAIL2BAN' infra/stacks/mailserver/modules/mailserver/main.tf`
   — confirm it still reports the value on line 68, matching the
   citation in the doc.

Closes: code-zhn

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 23:47:59 +00:00
Viktor Barzin
b9e9c3f084 [mailserver] Update SPF + docs for Brevo migration [ci skip]
## Context

Outbound mail relay migrated from Mailgun EU to Brevo EU on 2026-04-12 when
variables.tf:6 of the mailserver stack was switched to `smtp-relay.brevo.com:587`.
Postfix immediately began using Brevo for user mail — but the SPF TXT record
at viktorbarzin.me was left pointing at `include:mailgun.org -all`, so every
Brevo-relayed message failed SPF alignment and was spam-foldered or
DMARC-quarantined by Gmail/Outlook.

Observed on 2026-04-18 via `dig TXT viktorbarzin.me @1.1.1.1`:

    "v=spf1 include:mailgun.org -all"  <-- wrong sender network

User decision (2026-04-18): switch to `v=spf1 include:spf.brevo.com ~all`.
Soft-fail (`~all`) is intentional during cutover — keeps unauthorized Brevo
sends quarantined rather than outright rejected while we validate Brevo's
sending IPs + rate limits for real user mail. Tighten to `-all` once the
relay is proven stable.

The docs in `docs/architecture/mailserver.md` still described the old
Mailgun-based configuration (Overview paragraph, DNS table, Vault secrets
table). Per `infra/.claude/CLAUDE.md` rule "Update docs with every change",
those are updated in the same commit.

## This change

Coupled commit covering beads tasks code-q8p (SPF) + code-9pe (docs):

1. `stacks/cloudflared/modules/cloudflared/cloudflare.tf` — SPF TXT content
   flipped from `include:mailgun.org -all` to `include:spf.brevo.com ~all`,
   with an inline comment pointing at the mailserver docs for rationale.
2. `docs/architecture/mailserver.md` —
   - Last-updated stamp moved to 2026-04-18 with the cutover note.
   - Overview paragraph now says "relays through Brevo EU" (was Mailgun).
   - DNS table SPF row reflects the new value plus an annotated history
     note ("was include:mailgun.org -all until 2026-04-18").
   - DMARC row now calls out the intended `dmarc@viktorbarzin.me` rua
     target and flags that the current live record still points at
     e21c0ff8@dmarc.mailgun.org, tracked under follow-up code-569.
   - Vault secrets table: `mailserver_sasl_passwd` relabelled as Brevo
     relay credentials; `mailgun_api_key` annotated as retained for the
     E2E roundtrip probe only (inbound delivery testing, not user mail).

Apply was scoped with `-target=module.cloudflared.cloudflare_record.mail_spf`
to avoid sweeping up two unrelated pre-existing drifts that the Terraform
state shows on this stack: the DMARC + mail._domainkey_rspamd records are
stored on Cloudflare as RFC-compliant split TXT strings (>255 bytes), and
a naive refresh+apply would normalize them in the state back to single
strings. Those drifts are semantically equivalent (DNS concatenates
adjacent TXT strings at resolution time) and are out of scope for this
commit — they'll be handled under their own ticket.

## What is NOT in this change

- DMARC `rua=mailto:dmarc@viktorbarzin.me` cutover — that's code-569 (M1),
  still using the legacy `e21c0ff8@dmarc.mailgun.org` + ondmarc addresses
  in the live record.
- DMARC/DKIM TXT multi-string state reconciliation on `mail_dmarc` and
  `mail_domainkey_rspamd` — pre-existing Cloudflare representation drift,
  untouched here.
- Removal of Mailgun references in history/decision sections of the docs,
  or the Mailgun-backed E2E roundtrip probe — probe still uses Mailgun API
  on purpose for inbound delivery testing (code-569 scope).
- Mailgun DKIM record `s1._domainkey` — left in place; still consumed by
  the roundtrip probe.
- Other pending items from the 2026-04-18 mail audit plan.

## Test Plan

### Automated

Targeted plan showed exactly one change, no other drift sneaking in:

    module.cloudflared.cloudflare_record.mail_spf will be updated in-place
      ~ content = "\"v=spf1 include:mailgun.org -all\""
             -> "\"v=spf1 include:spf.brevo.com ~all\""
    Plan: 0 to add, 1 to change, 0 to destroy.

Apply result:

    Apply complete! Resources: 0 added, 1 changed, 0 destroyed.

DNS propagation verified on three independent resolvers immediately after
apply:

    $ dig TXT viktorbarzin.me @1.1.1.1 +short | grep spf
    "v=spf1 include:spf.brevo.com ~all"

    $ dig TXT viktorbarzin.me @8.8.8.8 +short | grep spf
    "v=spf1 include:spf.brevo.com ~all"

    $ dig TXT viktorbarzin.me @10.0.20.201 +short | grep spf   # Technitium primary
    "v=spf1 include:spf.brevo.com ~all"

### Manual Verification

Setup: nothing extra — change is already live (TF applied before commit
per home-lab convention; `[ci skip]` in title).

1. Confirm SPF is the Brevo-only record from an external resolver:

       dig TXT viktorbarzin.me @1.1.1.1 +short

   Expected: `"v=spf1 include:spf.brevo.com ~all"` — no Mailgun reference.

2. Send a test email via the mailserver (through Brevo relay) to a Gmail
   account and view the original headers:

       Authentication-Results: ... spf=pass smtp.mailfrom=viktorbarzin.me
       ...
       Received-SPF: Pass (google.com: domain of ... designates ... as
       permitted sender)

   Expected: `spf=pass` (it was `spf=fail` or `spf=softfail` before this
   change because the envelope sender IP was a Brevo IP not covered by
   `include:mailgun.org`).

3. Confirm no live Mailgun references in the mailserver doc:

       grep -n mailgun.org infra/docs/architecture/mailserver.md

   Expected: only annotated-history mentions — SPF "was ... until
   2026-04-18" and DMARC "current live record still points at
   e21c0ff8@dmarc.mailgun.org pending cutover". No claims of active
   Mailgun relay.

## Reproduce locally

    cd infra
    git pull
    dig TXT viktorbarzin.me @1.1.1.1 +short | grep spf
    # expected: "v=spf1 include:spf.brevo.com ~all"

    # inspect the TF change:
    git show HEAD -- stacks/cloudflared/modules/cloudflared/cloudflare.tf

    # inspect the doc change:
    git show HEAD -- docs/architecture/mailserver.md

Closes: code-q8p
Closes: code-9pe

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 23:13:47 +00:00
Viktor Barzin
1a7f68fe5b [beads-server] Auto-dispatch agent beads via CronJobs
## Context

Until now, handing work to the in-cluster `beads-task-runner` agent required
opening BeadBoard and clicking the manual Dispatch button on each bead. We
want users to be able to describe work as a bead, set `assignee=agent`, and
have the agent pick it up within a couple of minutes — no clicks.

The existing pieces already provide everything we need:
- `claude-agent-service` exposes `/execute` with a single-slot `asyncio.Lock`
- BeadBoard's `/api/agent-dispatch` builds the prompt and forwards the bearer
- BeadBoard's `/api/agent-status` reports `busy` via a cached `/health` poll
- Dolt stores beads and is already in-cluster at `dolt.beads-server:3306`

So the only missing component is a poller that ties them together. This
commit adds that poller as two Kubernetes CronJobs — matching the existing
infra pattern (OpenClaw task-processor, certbot-renewal, backups) rather than
introducing n8n or in-service polling.

## Flow

```
  user: bd assign <id> agent
         │
         ▼
  Dolt @ dolt.beads-server.svc:3306  ◄──── every 2 min ────┐
         │                                                  │
         ▼                                                  │
  CronJob: beads-dispatcher                                 │
    1. GET beadboard/api/agent-status  (busy? skip)         │
    2. bd query 'assignee=agent AND status=open'            │
    3. bd update -s in_progress   (claim)                   │
    4. POST beadboard/api/agent-dispatch                    │
    5. bd note "dispatched: job=…"                          │
         │                                                  │
         ▼                                                  │
  claude-agent-service /execute                             │
    beads-task-runner agent runs; notes/closes bead         │
         │                                                  │
         ▼                                                  │
  done  ──► next tick picks up the next bead ───────────────┘

  CronJob: beads-reaper  (every 10 min)
    for bead (assignee=agent, status=in_progress, updated_at > 30 min):
      bd note   "reaper: no progress for Nm — blocking"
      bd update -s blocked
```

## Decisions

- **Sentinel assignee `agent`** — free-form, no Beads schema change. Any bd
  client can set it (`bd assign <id> agent`).
- **Sequential dispatch** — matches the service's `asyncio.Lock`. With a
  2-min poll cadence and ~5-min average run, throughput is ~12 beads/hour.
  Parallelism is a separate plan.
- **Fixed agent `beads-task-runner`** — read-only rails, matches the manual
  Dispatch button. Broader-privilege agents stay manual via BeadBoard UI.
- **Image reuse** — the claude-agent-service image already ships `bd`, `jq`,
  `curl`; a new CronJob-specific image would duplicate 400MB of infra tooling.
  Mirror `claude_agent_service_image_tag` locally; bump on rebuild.
- **ConfigMap-mounted `metadata.json`** — declarative TF rather than reusing
  the image-seeded file. The script copies it into `/tmp/.beads/` because bd
  may touch the parent dir and ConfigMap mounts are read-only.
- **Kill switch (`beads_dispatcher_enabled`)** — single bool, default true.
  When false, `suspend: true` on both CronJobs; manual Dispatch keeps working.
- **Reaper threshold 30 min** — `bd note` bumps `updated_at`, so a well-behaved
  `beads-task-runner` never trips the reaper. Failures trip it; pod crashes
  (in-memory job state lost) also trip it.

## What is NOT in this change

- No Terraform apply — requires Vault OIDC + cluster access. Apply manually:
  `cd infra/stacks/beads-server && scripts/tg apply`
- No change to `claude-agent-service/` (already ships bd/jq/curl)
- No change to `beadboard/` (`/api/agent-dispatch` + `/api/agent-status` reused)
- No change to the `beads-task-runner` agent definition (rails unchanged)
- Parallelism: single-slot is MVP; multi-slot dispatch is a separate plan.

## Deviations from plan

Minor, documented in code comments:
- Reaper uses `.updated_at` instead of the plan's `.notes[].created_at`. bd
  serializes `notes` as a string (not an array), and every `bd note` bumps
  `updated_at` — equivalent for the reaper's purpose.
- ISO-8601 parsed via `python3`, not `date -d` — Alpine's busybox lacks GNU
  `-d` and the image has python3.
- `HOME=/tmp` set as a safety net — bd may try to write state/lock files.

## Test plan

### Automated

```
$ cd infra/stacks/beads-server && terraform init -backend=false
Terraform has been successfully initialized!

$ terraform validate
Warning: Deprecated Resource (kubernetes_namespace → v1)  # pre-existing, unrelated
Success! The configuration is valid, but there were some validation warnings as shown above.

$ terraform fmt stacks/beads-server/main.tf
# (no output — already formatted)
```

### Manual verification

1. **Apply**
   ```
   vault login -method=oidc
   cd infra/stacks/beads-server
   scripts/tg apply
   ```
   Expect: `kubernetes_config_map.beads_metadata`,
   `kubernetes_cron_job_v1.beads_dispatcher`, `kubernetes_cron_job_v1.beads_reaper`
   created. No changes to existing resources.

2. **CronJobs exist with right schedule**
   ```
   kubectl -n beads-server get cronjob
   ```
   Expect `beads-dispatcher  */2 * * * *` and `beads-reaper  */10 * * * *`,
   both with `SUSPEND=False`.

3. **End-to-end smoke**
   ```
   bd create "auto-dispatch smoke test" \
       -d "Read /etc/hostname inside the agent sandbox and close." \
       --acceptance "bd note includes 'hostname=' line and bead is closed."
   bd assign <new-id> agent
   # within 2 min:
   bd show <new-id> --json | jq '{status, notes}'
   ```
   Expect notes to contain `auto-dispatcher claimed at …` and
   `dispatched: job=<uuid>`, status `in_progress`.

4. **Reaper smoke**
   Assign + dispatch a long bead, then
   `kubectl -n claude-agent delete pod -l app=claude-agent-service`. Within
   30 min + one reaper tick, `bd show <id>` shows `blocked` with a
   `reaper: no progress for Nm — blocking` note.

5. **Kill switch**
   ```
   cd infra/stacks/beads-server
   scripts/tg apply -var=beads_dispatcher_enabled=false
   kubectl -n beads-server get cronjob
   ```
   Expect `SUSPEND=True` on both CronJobs. Assign a bead to `agent`; verify
   nothing happens within 5 min. Re-apply with `=true` to re-enable.

Runbook with all above plus reaper semantics + design choices at
`infra/docs/runbooks/beads-auto-dispatch.md`.

Closes: code-8sm

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 22:35:46 +00:00
Viktor Barzin
a24cf8c689 [docs] post-mortem: clarify the sizeLimit vs container memory limit gotcha
Initial 2Gi sizeLimit didn't take effect because Kyverno's tier-defaults
LimitRange in authentik ns applies a default container memory limit of
256Mi to pods with resources: {}. Writes to a memory-backed emptyDir count
against the container's cgroup memory, so the container was OOM-killed
(exit 137) at ~256 MiB even though the tmpfs sizeLimit said 2Gi. Confirmed
with `dd if=/dev/zero of=/dev/shm/test bs=1M count=500`.

Fix: also set `containers[0].resources.limits.memory: 2560Mi` via the same
kubernetes_json_patches. Verified end-to-end — 1.5 GB file write succeeds,
df -h /dev/shm reports 2.0G.

Updates the post-mortem P1 row to capture this for future readers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:23:14 +00:00
Viktor Barzin
cacc282f1a .gitignore: ignore terragrunt_rendered.json debug output
Generated by `terragrunt render-json` for debugging. Not meant to be
tracked — a stale one was sitting untracked in stacks/dbaas/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:18:05 +00:00
Viktor Barzin
b41528e564 [docs] Add post-mortem for Authentik outpost /dev/shm incident (2026-04-18)
## Context

On 2026-04-18 all Authentik-protected *.viktorbarzin.me sites returned HTTP
400 for all users. Reported first as a per-user issue affecting Emil since
2026-04-16 ~17:00 UTC, escalated to cluster-wide when Viktor's cached
session stopped being enough. Duration: ~44h for the first-affected user,
~30 min from cluster-wide report to unblocked.

## Root cause

The `ak-outpost-authentik-embedded-outpost` pod's /dev/shm (default 64 MB
tmpfs) filled to 100% with ~44k `session_*` files from gorilla/sessions
FileStore. Every forward-auth request with no valid cookie creates one
session-state file; with `access_token_validity=7d` and measured ~18
files/min, steady-state accumulation (~180k files) vastly exceeds the
default tmpfs. Once full, every new `store.Save()` returned ENOSPC and
the outpost replied HTTP 400 instead of the usual 302 to login.

## What's captured

- Full timeline, impact, affected services
- Root-cause chain diagram (request rate → retention → ENOSPC → 400)
- Why diagnosis took 2 days (misattribution of a Viktor event to Emil,
  red-herring suspicion of the new Rybbit Worker, cached sessions masking
  the outage)
- Contributing factors + detection gaps
- Prevention plan with P0 (done — 512Mi emptyDir via kubernetes_json_patches
  on the outpost config), P1 alerts, P2 Terraform codification, P3 upstream
- Lessons learned (check outpost logs first; cookie-less `curl` disproves
  per-user symptoms fast; UI-managed Authentik config is invisible to git)

## Follow-ups not in this commit

- Prometheus alert for outpost /dev/shm usage > 80%
- Meta-alert for correlated Uptime Kuma external-monitor failures
- Decision on tmpfs sizing vs restart cadence vs probe-frequency reduction
  (see discussion in beads code-zru)

Closes: code-zru

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:12:27 +00:00
Viktor Barzin
6e19dce99e [docs] automated-upgrades: document long-lived OAuth + expiry monitoring
Adds the `claude_oauth_token` Vault entries to the secrets table, a
new "OAuth token lifecycle" section explaining the two CLI auth modes
(`claude login` vs `claude setup-token`) and why we picked the latter
for headless use, the Ink 300-col PTY gotcha from today's harvest,
and the monitoring/rotation playbook for the new expiry alerts.

Follow-up to 8a054752 and 50dea8f0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:00:07 +00:00
Viktor Barzin
69fbd0ffd6 [docs] Update auto-upgrade docs — new HTTP auth path + n8n expression gotcha
Replaces the stale "Dev VM SSH key" secret entry with the current
`claude-agent-service` bearer token path (synced to both consumer +
caller namespaces). Adds an "n8n workflow gotchas" section documenting:

1. The workflow is DB-state, not Terraform-managed — the JSON in the
   repo is a backup, not authoritative.
2. Header-expression syntax: `=Bearer {{ $env.X }}` works, JS concat
   `='Bearer ' + $env.X` does NOT — costs silent 401s.
3. `N8N_BLOCK_ENV_ACCESS_IN_NODE=false` requirement.
4. 401-troubleshooting steps and the UPDATE pattern for in-place
   workflow patches.

Follow-up to 99180bec which fixed the actual pipeline break.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 10:42:11 +00:00
Viktor Barzin
42f1c3cf4f [claude-agent-service] Migrate all pipelines from DevVM SSH to K8s HTTP
## Context

The claude-agent-service K8s pod (deployed 2026-04-15) provides an HTTP API
for running Claude headless agents. Three workflows still SSH'd to the DevVM
(10.0.10.10) to invoke `claude -p`. This eliminates that dependency.

## This change

Pipeline migrations (SSH → HTTP POST to claude-agent-service):
- `.woodpecker/issue-automation.yml` — Vault auth fetches API token instead
  of SSH key; curl POST /execute + poll /jobs/{id} replaces SSH invocation
- `scripts/postmortem-pipeline.sh` — same pattern; uses jq for safe JSON
  construction of TODO payloads
- `.woodpecker/postmortem-todos.yml` — drop openssh-client from apk install
- `stacks/n8n/workflows/diun-upgrade.json` — SSH node replaced with HTTP
  Request node; API token via $env.CLAUDE_AGENT_API_TOKEN (added to Vault
  secret/n8n)

Documentation updates:
- `docs/architecture/incident-response.md` — Mermaid diagram: DevVM → K8s
- `docs/architecture/automated-upgrades.md` — pipeline diagram + n8n action
- `AGENTS.md` — pipeline description updated

## What is NOT in this change

- DevVM decommissioning (still hosts terminal/foolery services)
- Removal of SSH key secrets from Vault (kept for rollback)
- n8n workflow import (must be done manually in n8n UI)

[ci skip]

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
2026-04-18 10:12:02 +00:00
Viktor Barzin
18338a883f [ci skip] cleanup: remove e2e test file 2026-04-17 22:06:24 +00:00
Claude Agent
842646ea4f [ci skip] e2e: test commit from claude-agent-service 2026-04-17 22:03:50 +00:00
Viktor Barzin
65b0f30d5e [docs] Update anti-AI and rybbit docs after rewrite-body removal
- Anti-AI: 5-layer → 3 active layers (bot-block, X-Robots-Tag, tarpit)
- Layer 3 (trap links via rewrite-body) removed — Yaegi v3 incompatible
- Rybbit analytics now injected via Cloudflare Worker (HTMLRewriter)
- strip-accept-encoding middleware removed from all references

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 21:43:13 +00:00
Viktor Barzin
f8facf44dd [infra] Fix rewrite-body plugin + cleanup TrueNAS + version bumps
## Context

The rewrite-body Traefik plugin (packruler/rewrite-body v1.2.0) silently
broke on Traefik v3.6.12 — every service using rybbit analytics or anti-AI
injection returned HTTP 200 with "Error 404: Not Found" body. Root cause:
middleware specs referenced plugin name `rewrite-body` but Traefik registered
it as `traefik-plugin-rewritebody`.

Migrated to maintained fork `the-ccsn/traefik-plugin-rewritebody` v0.1.3
which uses the correct plugin name. Also added `lastModified = true` and
`methods = ["GET"]` to anti-AI middleware to avoid rewriting non-HTML
responses.

## This change

- Replace packruler/rewrite-body v1.2.0 with the-ccsn/traefik-plugin-rewritebody v0.1.3
- Fix plugin name in all 3 middleware locations (ingress_factory, reverse-proxy factory, traefik anti-AI)
- Remove deprecated TrueNAS cloud sync monitor (VM decommissioned 2026-04-13)
- Remove CloudSyncStale/CloudSyncFailing/CloudSyncNeverRun alerts
- Fix PrometheusBackupNeverRun alert (for: 48h → 32d to match monthly sidecar schedule)
- Bump versions: rybbit v1.0.21→v1.1.0, wealthfolio v1.1.0→v3.2,
  networking-toolbox 1.1.1→1.6.0, cyberchef v10.24.0→v9.55.0
- MySQL standalone storage_limit 30Gi → 50Gi
- beads-server: fix Dolt workbench type casing, remove Authentik on GraphQL endpoint

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 05:51:52 +00:00
Viktor Barzin
a237ac97e0 docs(upgrades): add bulk upgrade results from first production run
12 services upgraded in 30 min: audiobookshelf, owntracks, open-webui,
immich, coturn, shlink, phpipam, onlyoffice, paperless-ngx, linkwarden,
synapse, dawarich. Documents auto-rollback behavior, resource awareness
(paperless memory bump), bulk upgrade procedure, and rate limit reset.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 17:34:27 +00:00
Viktor Barzin
b1d152be1f [infra] Auto-create Cloudflare DNS records from ingress_factory
## Context

Deploying new services required manually adding hostnames to
cloudflare_proxied_names/cloudflare_non_proxied_names in config.tfvars —
a separate file from the service stack. This was frequently forgotten,
leaving services unreachable externally.

## This change:

- Add `dns_type` parameter to `ingress_factory` and `reverse_proxy/factory`
  modules. Setting `dns_type = "proxied"` or `"non-proxied"` auto-creates
  the Cloudflare DNS record (CNAME to tunnel or A/AAAA to public IP).
- Simplify cloudflared tunnel from 100 per-hostname rules to wildcard
  `*.viktorbarzin.me → Traefik`. Traefik still handles host-based routing.
- Add global Cloudflare provider via terragrunt.hcl (separate
  cloudflare_provider.tf with Vault-sourced API key).
- Migrate 118 hostnames from centralized config.tfvars to per-service
  dns_type. 17 hostnames remain centrally managed (Helm ingresses,
  special cases).
- Update docs, AGENTS.md, CLAUDE.md, dns.md runbook.

```
BEFORE                          AFTER
config.tfvars (manual list)     stacks/<svc>/main.tf
        |                         module "ingress" {
        v                           dns_type = "proxied"
stacks/cloudflared/               }
  for_each = list                     |
  cloudflare_record               auto-creates
  tunnel per-hostname             cloudflare_record + annotation
```

## What is NOT in this change:

- Uptime Kuma monitor migration (still reads from config.tfvars)
- 17 remaining centrally-managed hostnames (Helm, special cases)
- Removal of allow_overwrite (keep until migration confirmed stable)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 13:45:04 +00:00
Viktor Barzin
c33f597111 feat(upgrade-agent): add automated service upgrade pipeline with n8n + DIUN
Pipeline: DIUN detects new image versions every 6h → webhook to n8n →
n8n filters (skip databases/custom/infra/:latest) and rate-limits
(max 5/6h) → SSH to dev VM → claude -p runs upgrade agent.

Agent workflow: resolve GitHub repo → fetch changelogs → classify risk
(SAFE/CAUTION) → backup DB if needed → bump version in .tf → commit+push
→ wait for CI → verify (pod ready + HTTP + Uptime Kuma) → rollback on
failure.

Changes:
- stacks/n8n: add N8N_PORT=5678 to fix K8s env var conflict
- stacks/n8n/workflows: version-controlled n8n workflow backup
- docs/architecture/automated-upgrades.md: full pipeline documentation
- AGENTS.md: add upgrade agent section
- service-catalog.md: update DIUN description

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 21:38:27 +00:00
Viktor Barzin
dcc96f465e docs(storage): add encrypted LVM documentation
Update storage docs to reflect the 2026-04-15 migration of all sensitive
services to proxmox-lvm-encrypted. Add encrypted PVC template, LUKS2 flow
documentation, updated architecture diagram, and storage class decision
rules.

Files updated:
- .claude/CLAUDE.md: storage decision table, encrypted PVC template
- docs/architecture/storage.md: encrypted flow, components, diagram, Vault paths
- AGENTS.md: storage section with encrypted SC as default for sensitive data

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 21:00:37 +00:00
Viktor Barzin
69474fae96 docs: add comprehensive DNS architecture documentation
Covers Technitium HA (3-instance AXFR replication), CoreDNS config,
Cloudflare external DNS, Split Horizon hairpin NAT fix, DHCP-DNS
auto-registration, 6 automation CronJobs, and troubleshooting guides.
Also fixes stale NFS reference in networking.md.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 18:10:27 +00:00
Viktor Barzin
0a448c2bae docs: rewrite incident-response as user contribution guide
Complete rewrite of the user-facing documentation:
- How to report outages and request features
- Mermaid flow diagrams for both incident and feature request paths
- SLA expectations (automated vs human response times)
- Self-service checks before reporting
- Severity level definitions
- Status page explanation
- Full technical architecture section with component inventory
- Safety guardrails, labels, and commit conventions

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 17:59:09 +00:00
Viktor Barzin
7bb9ec2934 Add agent task tracking documentation
Documents the centralized Beads/Dolt task tracking system used by all
Claude Code sessions. Covers architecture, session lifecycle, settings
hierarchy, known issues, and E2E test verification.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 17:11:26 +00:00
Viktor Barzin
bd41bb9230 fix: cluster healthcheck fixes + Authentik upgrade to 2026.2.2
- Authentik: upgrade 2025.10.3 → 2025.12.4 → 2026.2.2 with DB restore
  and stepped migration. Switch to existingSecret, PgBouncer session mode.
- Mailserver: migrate email roundtrip probe from Mailgun to Brevo API
- Redis: fix HAProxy tcp-check regex (rstring), faster health intervals
- Nextcloud: fix Redis fallback to HAProxy service, update dependency
- MeshCentral: fix TLSOffload + certUrl init container for first-run
- Monitoring: remove authentik from latency alert exclusion
- Diun: simplify to webhook notifier, remove git auto-update

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 06:41:56 +00:00
Viktor Barzin
d31bbc9a18 docs: update monitoring and backup docs for external monitors and per-db backups
- CLAUDE.md: document external monitoring (ExternalAccessDivergence alert,
  external-monitor-sync CronJob) and per-database backup/restore paths
- backup-dr.md: add per-db backup CronJobs to inventory table and daily
  timeline, update restore runbook references
- monitoring.md: add External Monitor Sync component and external monitoring
  architecture section

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 06:37:07 +00:00
Viktor Barzin
0256ccdccc feat: add per-database backups for PostgreSQL and MySQL
Add separate CronJobs that dump each database individually:
- postgresql-backup-per-db: pg_dump -Fc per DB (daily 00:15)
- mysql-backup-per-db: mysqldump per DB (daily 00:45)

Dumps go to /backup/per-db/<dbname>/ on the same NFS PVC.
Enables single-database restore without affecting other databases.
Also fixed CNPG superuser password sync and added --single-transaction
--set-gtid-purged=OFF to MySQL per-db dumps.

Updated restore runbooks with per-database restore procedures.

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 22:39:33 +00:00
Viktor Barzin
30c3450c61 docs: add user-facing issue reporting guide
Adds "Reporting an Issue" section with:
- Where to report (Slack, GitHub, DM)
- What to include (examples of good vs bad reports)
- What happens after reporting (flow diagram)
- Self-service status checks (Uptime Kuma, Grafana, K8s Dashboard)

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 18:37:37 +00:00
Viktor Barzin
dfad89ef81 fix: remove br tags from mermaid diagram (GitHub compat) [ci skip] 2026-04-14 18:35:13 +00:00
Viktor Barzin
f658b18a50 docs: add incident response & post-mortem pipeline architecture
Documents the full E2E automated incident response pipeline:
- /post-mortem skill for generating structured post-mortems
- Woodpecker pipeline triggered on post-mortem file changes
- Claude Code headless agent for implementing safe TODOs
- Safety guardrails, commit conventions, secrets, limitations

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 18:31:29 +00:00
Viktor Barzin
cf87a747d8 docs: update post-mortem follow-up implementation [PM-2026-04-14] [ci skip]
Mark all 8 safe TODOs as Done. Add Follow-up Implementation table with commit
SHAs. Flag 3 Migration TODOs as needing human review.

Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>
2026-04-14 18:09:11 +00:00
Viktor Barzin
b1b408ff0e fix: use full path to claude CLI for non-interactive SSH 2026-04-14 17:44:50 +00:00
Viktor Barzin
7674cf8c5c docs: final E2E pipeline test 2026-04-14 17:43:38 +00:00
Viktor Barzin
91b97709b7 docs: trigger postmortem pipeline with TODO 2026-04-14 17:27:45 +00:00
Viktor Barzin
f336e5ed53 docs: E2E test postmortem pipeline with deep clone 2026-04-14 17:12:46 +00:00
Viktor Barzin
60c04e51b7 2026-04-14 17:10:45 +00:00
Viktor Barzin
933c562aa9 docs: trigger postmortem pipeline E2E test 2026-04-14 16:49:07 +00:00
Viktor Barzin
df95f52d08 docs: test postmortem with TODO for pipeline E2E 2026-04-14 16:45:44 +00:00
Viktor Barzin
b3cc5fcc32 test: trigger postmortem pipeline webhook 2026-04-14 16:44:11 +00:00
Viktor Barzin
777450cb19 docs: test post-mortem for pipeline E2E validation 2026-04-14 15:55:32 +00:00
Viktor Barzin
a703c6e84f docs: update post-mortem follow-up implementation [PM-2026-04-14] [ci skip]
Added Uptime Kuma TCP monitor for PVE NFS (192.168.1.127:2049), ID 328,
Tier 1 (30s/3 retries). Investigation TODO flagged for human review.

Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>
2026-04-14 15:48:11 +00:00
Viktor Barzin
e832581caf docs: update Apr 14 post-mortem with Phase 2 findings
Key additions:
- NFSv3 broke after NFS restart (kernel lockd bug on PVE 6.14)
- All 52 PVs migrated to NFSv4, NFSv3 disabled on PVE
- DNS zone sync gap: secondary/tertiary had no custom zones
- Converted one-time setup Job to recurring zone-sync CronJob
- MySQL, Redis, Vault collateral damage and fixes
- 3 new lessons learned (zone replication, NFS client state, operator rollout)

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 12:26:11 +00:00
Viktor Barzin
4e059b138c docs: consolidate all post-mortems under docs/post-mortems/
Move HTML post-mortems from repo root post-mortems/ to docs/post-mortems/.
Update index.html with all 3 incidents (newest first).

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 08:24:36 +00:00
Viktor Barzin
bdba15a387 docs: move post-mortems to docs/post-mortems/
Consolidate all outage reports under docs/ for better discoverability.
Moved from .claude/post-mortems/ (agent-internal) to docs/post-mortems/
(repo documentation).

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 08:20:09 +00:00
Viktor Barzin
82f674a0b4 rename weekly-backup → daily-backup across scripts, timers, services, and docs [ci skip]
Reflects the schedule change from weekly to daily. All references updated:
- scripts/weekly-backup.{sh,timer,service} → daily-backup.*
- Pushgateway job name: weekly-backup → daily-backup
- Prometheus metric names: weekly_backup_* → daily_backup_*
- All docs, runbooks, AGENTS.md, CLAUDE.md, proxmox-inventory
- offsite-sync dependency: After=daily-backup.service

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 18:37:04 +00:00
Viktor Barzin
b45cee5c4a docs: update backup architecture for inotify change tracking + consolidated Synology layout [ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 18:16:36 +00:00
Viktor Barzin
38d51ab0af deprecate TrueNAS: migrate Immich NFS to Proxmox, remove all 10.0.10.15 references [ci skip]
- Migrate Immich (8 NFS PVs, 1.1TB) from TrueNAS to Proxmox host NFS
- Update config.tfvars nfs_server to 192.168.1.127 (Proxmox)
- Update nfs-csi StorageClass share to /srv/nfs
- Update scripts (weekly-backup, cluster-healthcheck) to Proxmox IP
- Delete obsolete TrueNAS scripts (nfs_exports.sh, truenas-status.sh)
- Rewrite nfs-health.sh for Proxmox NFS monitoring
- Update Freedify nfs_music_server default to Proxmox
- Mark CloudSync monitor CronJob as deprecated
- Update Prometheus alert summaries
- Update all architecture docs, AGENTS.md, and reference docs
- Zero PVs remain on TrueNAS — VM ready for decommission

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 14:42:07 +00:00
Viktor Barzin
a2fad3f20e docs(mailserver): remove HTML visual, fix probe frequency in diagram 2026-04-12 22:25:34 +01:00
Viktor Barzin
1c300a14cf mailserver: overhaul inbound delivery, monitoring, CrowdSec, and migrate to Brevo relay
Inbound:
- Direct MX to mail.viktorbarzin.me (ForwardEmail relay attempted and abandoned)
- Dedicated MetalLB IP 10.0.20.202 with ETP: Local for CrowdSec real-IP detection
- Removed Cloudflare Email Routing (can't store-and-forward)
- Fixed dual SPF violation, hardened to -all
- Added MTA-STS, TLSRPT, imported Rspamd DKIM into Terraform
- Removed dead BIND zones from config.tfvars (199 lines)

Outbound:
- Migrated from Mailgun (100/day) to Brevo (300/day free)
- Added Brevo DKIM CNAMEs and verification TXT

Monitoring:
- Probe frequency: 30m → 20m, alert thresholds adjusted to 60m
- Enabled Dovecot exporter scraping (port 9166)
- Added external SMTP monitor on public IP

Documentation:
- New docs/architecture/mailserver.md with full architecture
- New docs/architecture/mailserver-visual.html visualization
- Updated monitoring.md, CLAUDE.md, historical plan docs
2026-04-12 22:24:38 +01:00
Viktor Barzin
c740ed1301 docs: update Technitium DNS docs after cache optimization
- Fix Technitium IP typo: 10.0.20.101 → 10.0.20.201 (service-catalog, vpn.md)
- Fix PDB minAvailable: 1 → 2 (networking.md)
- Add emrsn.org stub zone, cache TTL tuning, PG query logging, CronJobs
- Update forwarders: was "Cloudflare + Google", actually Cloudflare DoH only
- Update config storage: was generic PVC, now NFS path
2026-04-12 18:29:25 +01:00
Viktor Barzin
73531c12e0 docs(vpn): update with dual-stack WG, GL-iNet AllowedIPs fix, and troubleshooting [ci skip]
Document fixes from 2026-04-10 London network debugging session:
- pfSense WG now dual-stack (IPv4+IPv6 via HE tunnel gif0 pf rule)
- GL-iNet AllowedIPs must be single comma-separated UCI entry (parse bug)
- AdGuardHome/carrier-monitor must not use 1.1.1.1 (conntrack + rate limit)
- Expanded troubleshooting for site-to-site tunnel disconnects
2026-04-10 22:24:19 +01:00
Viktor Barzin
eec6af6aef docs: add IPAM/DDNS architecture diagram and update docs
- networking.md: Add mermaid diagram showing full device discovery pipeline
  (Kea DHCP → DDNS → Technitium, pfSense import → phpIPAM → DNS sync)
- networking.md: Add data flow table, DHCP coverage table
- networking.md: Update pfSense (3 subnets + 42 reservations), phpIPAM
  (passive import replaces fping), Technitium (192.168.1.2 in ACL)
- CLAUDE.md: Update phpIPAM and networking descriptions

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 20:42:10 +00:00