Replace the step-band fan curve with a continuous linear ramp — the bands
flapped at edges (e.g. 45<->65%). Web-researched: linear + 2-3C hysteresis
is the homelab standard; PID is overkill for this slow thermal loop.
fan% now interpolates between env-tunable anchors:
COOL 50C/30% -> 83C/100% (~2.1%/C; ~51% at the ~60C equilibrium)
QUIET 68C/20% -> 83C/100% (near-silent until ~70C)
Both reach 100% at the 83C ceiling. Anti-oscillation: asymmetric
hysteresis (fc_decide) + a MIN_STEP (3%) min-change threshold.
41 bash tests green; deployed + verified live (59C -> 49%, smooth).
[ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Document the HA-control feature shipped in 8beca1df: the daemon reads the
ha-sofia r730_fan_mode/manual_pct helpers, the 60-min auto-revert automation,
and the dashboard-it Server-view sensors + control tiles.
[ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The iDRAC stock curve runs the CPU at ~72°C on the 7080 RPM floor even
under load (optimises for quiet, not cool). Add a bash daemon + systemd
unit that drives the chassis fans from CPU temp on two curves, picked by
garage occupancy (the server is in the garage): COOL when empty
(measured ~58-65°C under load), QUIET near the silent floor when the
ha-sofia garage door shows someone is there (open, or <15min since last
activity).
Manual fan mode is backstopped: bash EXIT trap + systemd ExecStopPost
hand fans back to Dell auto on stop/crash; CPU>=83°C or repeated IPMI
failures do the same. Pushgateway metrics (job=fan_control). 36 unit
tests cover the pure curve/hysteresis/presence/parse logic; DRY_RUN +
RUN_ONCE for integration checks. Deployed and verified on 192.168.1.127
(CPU 70->58°C in cool mode, hysteresis stepping confirmed).
Design: docs/plans/2026-06-04-pve-fan-control-design.md
Runbook: docs/runbooks/fan-control.md
[ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The chrome-service stack ran `playwright launch-server`, which creates
ephemeral browser contexts per `connect()`. Despite the encrypted PVC
mounted at /profile, no chromium user-data ever persisted — only npm
cache + fontconfig. Logging in via noVNC was effectively a no-op.
Refactor:
- Replace launch-server with direct chromium (TCP CDP on :9223 internal),
fronted by a Python HTTP+WS bridge on :9222 that rewrites the Host
header to bypass Chrome's hardcoded DNS-rebinding protection (no
`--remote-allow-hosts` flag exists in stock Chrome 130; verified by
binary string grep). Bridge also forces Connection: close on HTTP
responses so Node ws opens a fresh TCP for the WS upgrade rather than
trying to reuse the dead keep-alive socket.
- Add `--user-data-dir=/profile/chromium-data` so cookies/localStorage
actually persist on the encrypted PVC.
- New snapshot-server sidecar (stdlib python HTTP) serves
GET /api/snapshot at chrome.viktorbarzin.me/api/snapshot,
bearer-token-gated by the existing api_bearer_token.
- New chrome-service-snapshot-harvester CronJob (hourly) connects via
CDP, dumps storage_state() (cookies + localStorage), writes atomically
to /profile/snapshots/storage-state.json.
- NetworkPolicy: TCP/9222 (was :3000), TCP/8088 added for traefik.
Caller migration:
- f1-stream: `chromium.connect(ws_url)` → `chromium.connect_over_cdp(cdp_url)`,
env var CHROME_WS_URL → CHROME_CDP_URL. CHROME_WS_TOKEN dropped (no
longer used by code; ExternalSecret kept for symmetry with the snapshot
endpoint).
Dev-box side (out of scope for this commit — see ~/.config/systemd/user/):
- playwright-mcp.service flips to `--isolated --storage-state=...`
so per-Claude-Code-session ephemeral contexts seed from the snapshot.
- playwright-snapshot-refresh.{service,timer} (hourly) pulls the
snapshot via the bearer-gated HTTPS endpoint.
Docs updated:
- docs/architecture/chrome-service.md — new architecture diagram + wire protocol.
- docs/runbooks/chrome-service-snapshot.md — day-2 ops (refresh, rotation,
failure modes, restore).
- stacks/chrome-service/README.md — connect_over_cdp recipe.
Design spec at docs/superpowers/specs/2026-06-04-playwright-per-session-browser-design.md.
Update authentication.md (structured multi-issuer AuthenticationConfiguration
+ dashboard SSO flow), multi-tenancy.md (web dashboard access), authentik-state
(new k8s-dashboard app + gheorghe groups), service-catalog (dashboard auth),
and the k8s-version-upgrade runbook (kubeadm wipes --authentication-config →
re-apply rbac post-upgrade). Design/plan addenda record the issuer-constraint
pivot from the original dual-aud approach. [ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The service now runs agent calls concurrently (bounded semaphore, per-job
isolated clones) instead of single-flight. Infra side:
- mount git-crypt-key into the main container (each job re-unlocks its own clone)
- MAX_CONCURRENCY=10 env (excess calls queue FIFO)
- bump pod memory 2Gi req / 12Gi limit, cpu req 1 (Burstable, tier-aux) — sized
for ~10 concurrent claude+terraform runs; fits node2/3/5 headroom
- docs: beads-auto-dispatch + automated-upgrades no longer describe single-slot
Service code: viktor/claude-agent-service @ 66104a3.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Dashboard now shows two 'Me' bars: realized gross (~£409k, from
SUM(payslip taxable_pay) = P60 basis) and package/grant-value (~£267k,
levels.fyi-comparable). Document that gross MUST come from taxable_pay, NOT
salary+bonus+rsu_vest (rsu_vest is net/partial, understates RSU ~50%).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
total_value (what the comparison bar uses) must be full TC; document storing
base+bonus+RSU components too so it's verifiable that RSU+bonus are included.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add a barchart (panel 10) ranking every company's London p50 total comp
(COALESCE total/base) with the user's current comp shown in line, so it's a
direct "how do I compare" view. The user's figure is NOT hardcoded in the
dashboard JSON — it's a labeled comp_point in the DB (company_slug
'self-current', source 'self', "Me (Meta IC5)"), keeping the sensitive number
out of git. It's below the £500k alert bar (no Slack ping) and ranks too low
to appear in analyze leaders. Runbook documents the panel + how to update the
baseline.
[ci skip] — dashboard ConfigMap applied locally (targeted).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add job-hunter-alert CronJob (Sundays 05:00 UTC, an hour after the refresh):
`python -m job_hunter alert --threshold 500000 --location london --slack`
posts to Slack the companies whose London p50 total comp >= £500k, flagging
any that newly crossed since last week's snapshot. SLACK_WEBHOOK_URL wired via
the job-hunter-secrets ExternalSecret from Vault secret/job-hunter
slack_webhook_url (seeded from the shared workspace webhook; repointable to a
dedicated channel). Runbook gains an "above-target Slack alert" section.
[ci skip] — applied locally (stack-scoped).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CI now drives the Deployment rollout (kubectl set image to the build SHA in
.woodpecker.yml), so the stack moves to image_tag = "latest": the Deployment
runs whatever CI last set (image ignore_changes keeps TF from fighting it),
and the CronJob uses :latest + imagePullPolicy=Always (fresh pod each weekly
run). Keel stays enrolled in parallel as a redundant net.
Docs: rewrite the runbook "Deploying" section for build-triggers-deploy;
record the reversal of decision #12 in the auto-upgrade design doc (owned
apps drive their own rollout, Keel parallel — upstream stays Keel-only); add
the owned-app deploy model to infra/.claude/CLAUDE.md CI/CD section.
[ci skip] — applied locally (stack-scoped); avoids a broad CI auto-apply.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add kubernetes_cron_job_v1.job_hunter_refresh — Sundays 04:00 UTC, runs
`refresh --source ats --source hn --source levels_fyi`, which upserts roles/
comp AND appends the dated comp_snapshots/roles_snapshots series consumed by
`job-hunter analyze`. Mirrors the Deployment's alembic-migrate init container
so a refresh never runs against an un-migrated DB; concurrency Forbid,
backoff 1, 30m activeDeadline, KYVERNO_LIFECYCLE_V1 dns_config ignore.
Add docs/runbooks/job-hunter.md: ops (health checks, manual refresh, add an
ATS company / CDIO watch, secret bag + rotation, failure table, TF apply) and
analyst (the analyze report, query recipes, SQL trend queries against the
snapshot tables, interpretation caveats) sections.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The videoConversion enqueue is an async scan; deleting encoded_video rows while a
prior scan is in-flight misses them (observed 2026-06-02: 11/3296 picked up on the
first pass). Re-trigger force=false once the queue first drains to waiting:0.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Pilot on PVE VM 300 established strong counterfactuals: identical kms-bootstrap +
the user's exact journey both reach office/ok on healthy Win10 (CF1 clean install,
CF2 retail O365HomePremRetail->targeted-remove->reboot->VL install). So a persistent
[Failing PreReq=SXSMSI]/1603 is the client's corrupted Windows servicing/Installer
subsystem (below DISM/SFC), not the script/ODT/KMS. Documents the consent-gated deep
repair, the DeepRepairDone marker + in-place-repair escalation, and the
low-disk/guest-agent-drop gotchas hit during the pilot.
Transcodes were uncapped (ffmpeg maxBitrate=0 + preset=ultrafast +
targetResolution=original) -> 77-264 Mbps 4K H.264 files. Mobile playback
streams that copy off the shared 7200rpm sdc pool over inter-VLAN NFS; a single
stream needs ~10-13.5 MB/s and stuttered for every client, local and remote.
Fix (DB system-config, applied via API): maxBitrate=20000k, preset=medium,
transcode=bitrate. 4K resolution preserved; originals never modified. Existing
oversized transcodes regenerated by deleting their asset_file encoded_video rows
+ videoConversion force=false (concurrency 1).
Document config + add runbook docs/runbooks/immich-transcode-bitrate.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Real root cause of the 2026-06-01 full-site 502 was not a missed
reference but an out-of-band fix that Terraform reverted: the 2026-05-30
Traefik .200->.203 migration repointed the Cloudflare tunnel to the
Traefik service DNS via the CF Global API Key, but never landed that
change in cloudflare.tf (left at .200). A terragrunt apply on 2026-06-01
reconciled live back to the stale .200, breaking all external ingress.
Rewrite the post-mortem around the "codify out-of-band fixes or TF
reverts them" lesson (a Terraform-Only-rule violation).
Also fix docs/runbooks/kms-public-exposure.md, which still claimed
Traefik served on 10.0.20.200:443 (now .203) — same migration fallout.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The bundled consumer Office removal leaves a pending reboot; a same-run VL
install (or re-run before rebooting) fails with setup.exe 1603. Document the two
guards (hard-reboot gate + reboot-aware 1603 message), the C2R-log capture, and
the on-disk completion poll. Record that the uninstall path is now verified on a
real M365 box (O365HomePremRetail removed) and the install needs a reboot first.
Internal split-horizon resolves kms.viktorbarzin.me to Traefik (10.0.20.203),
which has no :1688 listener — so LAN clients pointed at kms.viktorbarzin.me:1688
failed with 0xC004F074 "no KMS could be contacted". Add a dedicated A-only
vlmcs.viktorbarzin.me (cloudflare_record.vlmcs -> 176.12.22.76 for the public
WAN NAT; Technitium -> 10.0.20.202 internal, set via API) so it resolves to
vlmcsd both ways. Also carve /scripts/* out of Anubis (module.ingress_scripts
-> bare kms-web-page service) so `iwr | iex` downloads the real script instead
of the PoW challenge HTML.
Verified end-to-end on Win VM 300: reproduced 0xC004F074 on the old host, then
slmgr + ospp + both PowerShell one-liners all -> Licensed via vlmcs (10.0.20.202).
Docs: kms-public-exposure runbook + service-catalog entry.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Grows pve/nfs-data 3T → 4T (online lvextend + resize2fs) to absorb ~340 GB
of new originals landing under /srv/nfs/immich/upload during the import.
Adds:
- module "nfs_anca_elements_host" — RO PVC over /srv/nfs/anca-elements,
consumed only by the import Job (not mounted in immich-server).
- kubernetes_job_v1.anca_elements_import — immich-go v0.31.0 uploader
posting to immich-server.immich.svc:2283 with Anca's API key (synced
via the existing immich-secrets ExternalSecret from
secret/immich.anca_api_key). Filters to image extensions, bans the
non-photo top-level dirs (filme/, Music/, carti/, courses, installers,
docs, etc.), puts every asset in the album "Poze (Elements)". Default
`--pause-immich-jobs` is disabled — non-admin keys can't pause jobs.
- docs/architecture/storage.md — note the new 4 TB size in 3 places.
- docs/runbooks/grow-pve-nfs-lv.md — captures the one-shot lvextend
procedure (no pve-host TF stack exists for this).
Job is removed in the follow-up cleanup commit once the upload completes;
the PVC stays for a videos batch later.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- docs/architecture/storage.md: new "Nextcloud as PVE-NFS browser"
section documenting mount-per-archive + applicable_users model,
why mount-level ACL beats Files Access Control on NC 30/31, the
manifest shape (with current applicableUsers + enableSharing
fields), and the trade-off
- docs/runbooks/nextcloud-add-archive.md: 5-step runbook to surface
a new directory under /srv/nfs/* to specific NC users via the
bootstrap Job
- scripts/anca-elements-sync.sh: deployed at
/usr/local/bin/anca-elements-sync.sh on the PVE host; fpsync from
Synology Anca/Elements to /srv/nfs/anca-elements (idempotent +
resumable). The PVE replica is what the NC /anca-elements mount
serves; the offsite-sync pipeline excludes this path (committed
earlier this session) so we don't write it back to Synology
NC usernames are admin/anca/emo (not display names — admin is
Viktor). Stale "viktor" references in the manifest example dropped.
Runbook rewritten for the standalone setup (InnoDB Cluster gone since
2026-04-16) and now covers the full disaster-recovery flow we just
executed: stop pod, wipe PVC (incl. PV reclaim-policy flip from Retain
→ Delete), re-apply TF, restore via in-namespace Job, drop+create
static users with fresh Vault passwords, restart dependents.
CLAUDE.md MySQL row notes the 8.4.8 pin + links the runbook.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Locked design for wave 1 of cluster security hardening. Plan only — implementation lives in beads
code-8ywc and follow-up commits. Captures:
- security.md: Kyverno policy table updated (Audit → Enforce planned for the four security policies
with the 31-namespace exclude list). New section "Audit Logging & Anomaly Detection" detailing the
K8s API audit policy, Vault audit device + X-Forwarded-For trust, source-IP anomaly rules (K9, V7,
S1), and the rejected-canary-tokens / rejected-K1 rationales. New section "NetworkPolicy
Default-Deny Egress" describing the observe-then-enforce (γ) approach for tier 3+4.
- monitoring.md: new "Security Alerts (Wave 1)" section listing the 16 rules (K2-K9, V1-V7, S1)
and the Loki ruler → Alertmanager → #security routing path.
- runbooks/security-incident.md (new): per-alert response playbook with LogQL queries, action
steps, false-positive triage, and SEV1 escalation.
- .claude/CLAUDE.md: new "Security Posture" section summarising the locked decisions: identity
allowlist is me@viktorbarzin.me ONLY, source-IP allowlist CIDRs, no public-IP access policy,
rationale for not adopting canary tokens.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The weekday-only schedule was a 2026-03-16-incident-era guardrail when
the rest of the safety net was thin. Today's gates — halt-on-alert,
sentinel-gate Check 4 (24h soak via node Ready transitions), the
K8sUpgradeStalled alert, drainTimeout=30m, concurrency=1, and the
sentinel-path fix from earlier today — make weekend reboots safe and
just clear the backlog faster.
Effect: 5 pending node reboots clear in 5 calendar days instead of
queueing up over weekends. The K8s version-upgrade detection at Sun
12:00 UTC self-defers if a Sunday-morning kured reboot fires (the
RecentNodeReboot alert is in the Upgrade Gates ignore-less list for
the version-upgrade preflight — same mechanism kured uses).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The agent-based v1 ran inside claude-agent-service (replicas=1, no
nodeSelector) and self-evicted when it tried to drain its host (k8s-node4
on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers
v1.34.2) until manual recovery.
Rewrite the pipeline as a chain of nodeSelector-pinned Jobs:
preflight (k8s-node1)
→ master (k8s-node1) drains k8s-master
→ worker × 4 (k8s-node1) drains k8s-node{4,3,2}
→ worker (k8s-master + control-plane toleration) drains k8s-node1
→ postflight (no pinning)
Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by
envsubst-ing job-template.yaml into the next Job. Deterministic names
(k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply`
idempotent — a failed Job can be re-created without duplicating
downstream.
Also lands `predrain_unstick`: deletes pods on the target node whose PDB
has 0 disruptionsAllowed. Without this, drain loops indefinitely on
single-replica deployments (e.g. every Anubis instance — discovered the
hard way during 2026-05-11 manual recovery of k8s-node3).
Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min).
Deprecates the agent prompt (renamed to *.deprecated.md with a header
pointer to the new code).
Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps),
then monitoring (loads the new alert). Both applied 2026-05-11.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison
on master for new patches + HEAD pkgs.k8s.io for next-minor availability,
then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent.
The agent (.claude/agents/k8s-version-upgrade.md) orchestrates:
pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match)
-> etcd snapshot save
-> optional master containerd skew fix
-> apt repo URL rewrite (minor bumps only)
-> drain/upgrade/uncordon master via ssh < update_k8s.sh
-> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each
-> post-flight verification
Two new Upgrade Gates alerts catch failure modes:
- K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m)
- EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m)
update_k8s.sh refactored to take --role / --release args; the agent shells
it into each node via SSH pipe. update_node.sh annotated as OS-major path.
Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section
in docs/architecture/automated-upgrades.md.
Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519
keypair distributed to all 5 nodes via authorized_keys; slack_webhook
reuses kured webhook URL on initial deploy).
The OS-side counterpart to the service-upgrade pipeline. Covers
the unattended-upgrades + kured + sentinel-gate + Prometheus
halt-on-alert design landed in c0991f7f8.
Runbook: ops procedures (verify health, halt rollout, restore
config to a re-imaged node, roll back a bad upgrade, investigate
which alert is blocking).
Architecture doc: extends the existing service-upgrade flow with
a "K8s Node OS Upgrades" section (stack, sources of truth, day-2
mechanism, why-this-design rationale tied to the March 2026
post-mortem).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
LAN clients with DNS suffix viktorbarzin.lan now activate with zero
configuration — Windows queries _vlmcs._tcp.viktorbarzin.lan SRV by
default and the chain resolves through vlmcs.viktorbarzin.lan to the
new 10.0.20.202 KMS IP.
DNS state (Technitium primary, replicated to secondary+tertiary by the
existing technitium-zone-sync CronJob every 30 min):
- _vlmcs._tcp.viktorbarzin.lan SRV 0 0 1688 vlmcs.viktorbarzin.lan
(was: target=kms.viktorbarzin.lan)
- vlmcs.viktorbarzin.lan A 10.0.20.202 (added)
- kms.viktorbarzin.lan A 10.0.20.200 (unchanged — still the
Traefik LB for the user-facing website at kms.viktorbarzin.lan/)
vlmcs.viktorbarzin.lan was added as a dedicated KMS-server hostname
rather than retargeting kms.viktorbarzin.lan so the LAN-direct website
keeps working without depending on hairpin NAT through pfSense.
Verified end-to-end on WIN10Pro-DS32 (192.168.1.230):
slmgr /ckms → slmgr /ato → "Product activated successfully" with
"KMS machine name from DNS: vlmcs.viktorbarzin.lan:1688" and
"KMS machine IP address: 10.0.20.202". Real client IP 192.168.1.230
appears in vlmcsd log and in the slack-notifier sent line; second
activation within the dedup window correctly increments
kms_activations_dedup_skipped_total.
Two coupled fixes for the hourly Slack noise + missing client IPs:
1. Move windows-kms off shared 10.0.20.200 to a dedicated MetalLB IP
10.0.20.202 with externalTrafficPolicy=Local, so vlmcsd sees real
WAN client IPs (pfSense WAN forwards do DNAT-only; ETP=Local skips
kube-proxy SNAT). Same pattern mailserver used pre-2026-04-19.
Sharing 10.0.20.200 is blocked because all 10 services there are
ETP=Cluster and MetalLB requires consistent ETP per shared IP.
2. Slack notifier now suppresses Slack posts for bare TCP open/close
pairs (no Application/Activation block) — these are Uptime Kuma's
port monitor and the new kubelet readiness/liveness probes. Probe
counts go to a new metric kms_connection_probes_total{source} where
source classifies the IP as internal_pod / cluster_node / external.
Real activations are unaffected.
Pod fluidity: added TCP readiness/liveness probes on 1688 to gate Pod
Ready on the listener actually being up — required for ETP=Local so
MetalLB only advertises 10.0.20.202 from a node where vlmcsd is serving.
pfSense side (applied separately, not codified):
- New alias k8s_kms_lb = 10.0.20.202 (KMS-only)
- WAN:1688 NAT + filter rule retargeted from k8s_shared_lb to k8s_kms_lb
- All other forwards on k8s_shared_lb (WireGuard, HTTPS, shadowsocks,
smtps, etc.) untouched
Runbook updated. Tests added for classify_source / is_probe / process_line.
Slack notifier now also exposes /metrics on :9101 with stdlib HTTP — counts
activations and dedup-skips by product, gauges last-activation timestamp.
Pod template gets the standard prometheus.io/scrape annotations so the
cluster-wide kubernetes-pods job picks it up via pod IP. Memory request
bumped to 48Mi to cover counter dicts + HTTPServer.
Plus docs: networking.md footnotes the windows-kms row noting public WAN
exposure with the rate-limited (max-src-conn 50, max-src-conn-rate 10/60,
overload <virusprot> flush) pfSense filter rule, and a new runbook covers
log locations, rate-limit tuning, and how to revoke the WAN forward.
The matching pfSense rule was tightened in place (TCP-only + rate limits)
via SSH; pfSense isn't Terraform-managed.
Earlier I claimed the OAuth Web UI flow was the only way to onboard
new Forgejo repos in Woodpecker. That's wrong.
Two parts to the actual workaround:
1. Woodpecker session JWTs are HS256 signed with the user's per-user
`hash` column from the PG `users` table (NOT the global agent
secret). Mint a session JWT for the Forgejo viktor user (id=2,
forge_id=2), and you're authenticated as that user.
2. POST /api/repos?forge_remote_id=N as viktor → Woodpecker calls
Forgejo with viktor's stored OAuth access_token to create the
webhook + per-repo signing key. Works.
The 500 I saw earlier was from POST'ing as ViktorBarzin (GitHub
admin), whose user row has no Forgejo OAuth token — Woodpecker's
forge-API call fails for that user, surfacing as a 500.
scripts/woodpecker-register-forgejo-repo.sh wraps the whole flow:
extract hash from PG → mint JWT → activate repo. Verified against
viktor/{broker-sync,claude-agent-service,freedify,hmrc-sync} in
this session — all activated cleanly.
Also updated the runbook with the actual mechanism + the
WOODPECKER_FORGE_TIMEOUT=30s tip (the real root cause of the
'context deadline exceeded' failures, NOT the v3.14 upgrade).
End of forgejo-registry-consolidation. After Phase 0/1 already landed
(Forgejo ready, dual-push CI, integrity probe, retention CronJob,
images migrated via forgejo-migrate-orphan-images.sh), this commit
flips everything off registry.viktorbarzin.me onto Forgejo and
removes the legacy infrastructure.
Phase 3 — image= flips:
* infra/stacks/{payslip-ingest,job-hunter,claude-agent-service,
fire-planner,freedify/factory,chrome-service,beads-server}/main.tf
— image= now points to forgejo.viktorbarzin.me/viktor/<name>.
* infra/stacks/claude-memory/main.tf — also moved off DockerHub
(viktorbarzin/claude-memory-mcp:17 → forgejo.viktorbarzin.me/viktor/...).
* infra/.woodpecker/{default,drift-detection}.yml — infra-ci pulled
from Forgejo. build-ci-image.yml dual-pushes still until next
build cycle confirms Forgejo as canonical.
* /home/wizard/code/CLAUDE.md — claude-memory-mcp install URL updated.
Phase 4 — decommission registry-private:
* registry-credentials Secret: dropped registry.viktorbarzin.me /
registry.viktorbarzin.me:5050 / 10.0.20.10:5050 auths entries.
Forgejo entry is the only one left.
* infra/stacks/infra/main.tf cloud-init: dropped containerd
hosts.toml entries for registry.viktorbarzin.me +
10.0.20.10:5050. (Existing nodes already had the file removed
manually by `setup-forgejo-containerd-mirror.sh` rollout — the
cloud-init template only fires on new VM provision.)
* infra/modules/docker-registry/docker-compose.yml: registry-private
service block removed; nginx 5050 port mapping dropped. Pull-
through caches for upstream registries (5000/5010/5020/5030/5040)
stay on the VM permanently.
* infra/modules/docker-registry/nginx_registry.conf: upstream
`private` block + port 5050 server block removed.
* infra/stacks/monitoring/modules/monitoring/main.tf: registry_
integrity_probe + registry_probe_credentials resources stripped.
forgejo_integrity_probe is the only manifest probe now.
Phase 5 — final docs sweep:
* infra/docs/runbooks/registry-vm.md — VM scope reduced to pull-
through caches; forgejo-registry-breakglass.md cross-ref added.
* infra/docs/architecture/ci-cd.md — registry component table +
diagram now reflect Forgejo. Pre-migration root-cause sentence
preserved as historical context with a pointer to the design doc.
* infra/docs/architecture/monitoring.md — Registry Integrity Probe
row updated to point at the Forgejo probe.
* infra/.claude/CLAUDE.md — Private registry section rewritten end-
to-end (auth, retention, integrity, where the bake came from).
* prometheus_chart_values.tpl — RegistryManifestIntegrityFailure
alert annotation simplified now that only one registry is in
scope.
Operational follow-up (cannot be done from a TF apply):
1. ssh root@10.0.20.10 — edit /opt/registry/docker-compose.yml to
match the new template AND `docker compose up -d --remove-orphans`
to actually stop the registry-private container. Memory id=1078
confirms cloud-init won't redeploy on TF apply alone.
2. After 1 week of no incidents, `rm -rf /opt/registry/data/private/`
on the VM (~2.6GB freed).
3. Open the dual-push step in build-ci-image.yml and drop
registry.viktorbarzin.me:5050 from the `repo:` list — at that
point the post-push integrity check at line 33-107 also needs
to be repointed at Forgejo or removed (the per-build verify is
redundant with the every-15min Forgejo probe).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Companion to forgejo-registry-breakglass.md but for the more common
case: the Forgejo registry is healthy as a whole, but one image's
manifest/blob references are broken (orphan child, half-pushed
upload, retention-vs-pull race). The
RegistryManifestIntegrityFailure alert annotation already points
here.
Mirrors registry-rebuild-image.md (the registry-private equivalent)
in structure: confirm via probe + curl, delete broken version
through Forgejo API, rebuild via Woodpecker manual run, force
consumers to re-pull, verify integrity recovery.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds Forgejo as a second push target on the build-ci-image pipeline
and saves the just-pushed image as a gzipped tarball on the registry
VM disk (/opt/registry/data/private/_breakglass/) so we can recover
infra-ci with `ctr images import` if both registries are down.
* Dual-push: registry.viktorbarzin.me:5050/infra-ci AND
forgejo.viktorbarzin.me/viktor/infra-ci, in the same
woodpeckerci/plugin-docker-buildx step. Same image bytes; the
Forgejo integrity probe (every 15min) catches any divergence.
* Break-glass step: SSHes to 10.0.20.10, docker pulls + saves +
gzips, keeps last 5 tarballs (latest symlink). Failure-tolerant
so a transient registry blip doesn't fail the build pipeline.
* Runbook docs/runbooks/forgejo-registry-breakglass.md documents
the recovery flow (when to use, scp+ctr import, node cordon,
underlying-issue fix).
Tarball mirrors to Synology automatically through the existing
daily offsite-sync-backup job — no new sync wiring needed.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Stage 1 of moving private images off the registry:2 container at
registry.viktorbarzin.me:5050 (which has hit distribution#3324 corruption
3x in 3 weeks) onto Forgejo's built-in OCI registry. No cutover risk —
pods still pull from the existing registry until Phase 3.
What changes:
* Forgejo deployment: memory 384Mi→1Gi, PVC 5Gi→15Gi (cap 50Gi).
Explicit FORGEJO__packages__ENABLED + CHUNKED_UPLOAD_PATH (defensive,
v11 default-on).
* ingress_factory: max_body_size variable was declared but never wired
in after the nginx→Traefik migration. Now creates a per-ingress
Buffering middleware when set; default null = no limit (preserves
existing behavior). Forgejo ingress sets max_body_size=5g to allow
multi-GB layer pushes.
* Cluster-wide registry-credentials Secret: 4th auths entry for
forgejo.viktorbarzin.me, populated from Vault secret/viktor/
forgejo_pull_token (cluster-puller PAT, read:package). Existing
Kyverno ClusterPolicy syncs cluster-wide — no policy edits.
* Containerd hosts.toml redirect: forgejo.viktorbarzin.me → in-cluster
Traefik LB 10.0.20.200 (avoids hairpin NAT for in-cluster pulls).
Cloud-init for new VMs + scripts/setup-forgejo-containerd-mirror.sh
for existing nodes.
* Forgejo retention CronJob (0 4 * * *): keeps newest 10 versions per
package + always :latest. First 7 days dry-run (DRY_RUN=true);
flip the local in cleanup.tf after log review.
* Forgejo integrity probe CronJob (*/15): same algorithm as the
existing registry-integrity-probe. Existing Prometheus alerts
(RegistryManifestIntegrityFailure et al) made instance-aware so
they cover both registries during the bake.
* Docs: design+plan in docs/plans/, setup runbook in docs/runbooks/.
Operational note — the apply order is non-trivial because the new
Vault keys (forgejo_pull_token, forgejo_cleanup_token,
secret/ci/global/forgejo_*) must exist BEFORE terragrunt apply in the
kyverno + monitoring + forgejo stacks. The setup runbook documents
the bootstrap sequence.
Phase 1 (per-project dual-push pipelines) follows in subsequent
commits. Bake clock starts when the last project goes dual-push.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two coordinated fixes for the same root cause: Postfix's smtpd_upstream_proxy_protocol
listener fatals on every HAProxy health probe with `smtpd_peer_hostaddr_to_sockaddr:
... Servname not supported for ai_socktype` — the daemon respawns get throttled by
postfix master, and real client connections that land mid-respawn time out. We saw
this as ~50% timeout rate on public 587 from inside the cluster.
Layer 1 (book-search) — stacks/ebooks/main.tf:
SMTP_HOST mail.viktorbarzin.me → mailserver.mailserver.svc.cluster.local
Internal services should use ClusterIP, not hairpin through pfSense+HAProxy.
12/12 OK in <28ms vs ~6/12 timeouts on the public path.
Layer 2 (pfSense HAProxy) — stacks/mailserver + scripts/pfsense-haproxy-bootstrap.php:
Add 3 non-PROXY healthcheck NodePorts to mailserver-proxy svc:
30145 → pod 25 (stock postscreen)
30146 → pod 465 (stock smtps)
30147 → pod 587 (stock submission)
HAProxy uses `port <healthcheck-nodeport>` (per-server in advanced field) to
redirect L4 health probes to those ports while real client traffic keeps
going to 30125-30128 with PROXY v2.
Result: 0 fatals/min (was 96), 30/30 probes OK on 587, e2e roundtrip 20.4s.
Inter dropped 120000 → 5000 since log-spam concern is gone.
`option smtpchk EHLO` was tried first but flapped against postscreen (multi-line
greet + DNSBL silence + anti-pre-greet detection trip HAProxy's parser → L7RSP).
Plain TCP accept-on-port check is sufficient for both submission and postscreen.
Updated docs/runbooks/mailserver-pfsense-haproxy.md to reflect the new healthcheck
path and mark the "Known warts" entry as resolved.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replaces the manual scp+bounce sequence that landed registry:2.8.3 on
10.0.20.10 today (see commit 7cb44d72 + nginx-DNS-trap in runbook).
Addresses the "no repeat manual fixes" preference — future changes to
docker-compose.yml / fix-broken-blobs.sh / nginx_registry.conf /
config-private.yml / cleanup-tags.sh now deploy through CI.
Pipeline (.woodpecker/registry-config-sync.yml) mirrors
pve-nfs-exports-sync.yml: ssh-keyscan pin, scp the whole managed set,
bounce compose only when compose-visible files changed, always restart
nginx after a compose bounce (critical — nginx caches upstream DNS), end
with a dry-run fix-broken-blobs.sh to catch regressions.
Credentials:
- Woodpecker repo-secret `registry_ssh_key` (events: push, manual)
- Mirror at Vault `secret/woodpecker/registry_ssh_key`
(private_key / public_key / known_hosts_entry)
- Public key on /root/.ssh/authorized_keys on 10.0.20.10
- Key label: woodpecker-registry-config-sync
Runbook updated with "Auto-sync pipeline" section pointing at the new
flow + manual override command.
Closes: code-3vl
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Discovered during the 2026-04-19 registry:2.8.3 pin deploy: nginx caches
its upstream DNS at startup and does NOT re-resolve after registry-*
containers are recreated. Symptom was /v2/_catalog returning
{"repositories": []} and /v2/ returning 200 without auth — nginx was
forwarding to a stale IP that a different backend container now owns.
Fix is always 'docker restart registry-nginx' after any registry-*
bounce. Captured in registry-vm.md so future manual operators and the
coming auto-sync pipeline (beads code-3vl) both encode the step.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Second identical registry incident on 2026-04-19 (first 2026-04-13): the
infra-ci:latest image index resolved to child manifests whose blobs had been
garbage-collected out from under the index. Pipelines P366→P376 all exited
126 "image can't be pulled". Hot fix (a05d63e / 6371e75 / c113be4) restored
green CI but left the underlying bug unaddressed.
Root cause: cleanup-tags.sh rmtrees tag dirs on the registry VM daily at
02:00, registry:2's GC (Sunday 03:25) walks OCI index children imperfectly
(distribution/distribution#3324 class). Nothing verified pushes end-to-end;
nothing probed the registry for fetchability; nothing caught orphan indexes.
Phase 1 — Detection:
- .woodpecker/build-ci-image.yml: after build-and-push, a verify-integrity
step walks the just-pushed manifest (index + children + config + every
layer blob) via HEAD and fails the pipeline on any non-200. Catches
broken pushes at the source.
- stacks/monitoring: new registry-integrity-probe CronJob (every 15m) and
three alerts — RegistryManifestIntegrityFailure,
RegistryIntegrityProbeStale, RegistryCatalogInaccessible — closing the
"registry serves 404 for a tag that exists" gap that masked the incident
for 2+ hours.
- docs/post-mortems/2026-04-19-registry-orphan-index.md: root cause,
timeline, monitoring gaps, permanent fix.
Phase 2 — Prevention:
- modules/docker-registry/docker-compose.yml: pin registry:2 → registry:2.8.3
across all six registry services. Removes the floating-tag footgun.
- modules/docker-registry/fix-broken-blobs.sh: new scan walks every
_manifests/revisions/sha256/<digest> that is an image index and logs a
loud WARNING when a referenced child blob is missing. Does NOT auto-
delete — deleting a published image is a conscious decision. Layer-link
scan preserved.
Phase 3 — Recovery:
- build-ci-image.yml: accept `manual` event so Woodpecker API/UI rebuilds
don't need a cosmetic Dockerfile edit (matches convention from
pve-nfs-exports-sync.yml).
- docs/runbooks/registry-rebuild-image.md: exact command sequence for
diagnosing + rebuilding after an orphan-index incident, plus a fallback
for building directly on the registry VM if Woodpecker itself is down.
- docs/runbooks/registry-vm.md + .claude/reference/service-catalog.md:
cross-references to the new runbook.
Out of scope (verified healthy or intentionally deferred):
- Pull-through DockerHub/GHCR mirrors (74.5% hit rate, no 404s).
- Registry HA/replication (single-VM SPOF is a known architectural
choice; Synology offsite covers RPO < 1 day).
- Diun exclude for registry:2 — not applicable; Diun only watches
k8s (DIUN_PROVIDERS_KUBERNETES=true), not the VM's docker-compose.
Verified locally:
- fix-broken-blobs.sh --dry-run on a synthetic registry directory correctly
flags both orphan layer links and orphan OCI-index children.
- terraform fmt + validate on stacks/monitoring: success (only unrelated
deprecation warnings).
- python3 yaml.safe_load on .woodpecker/build-ci-image.yml and
modules/docker-registry/docker-compose.yml: both parse clean.
Closes: code-4b8
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TrueNAS VM 9000 was operationally decommissioned 2026-04-13; NFS has been
served by Proxmox host (192.168.1.127) since. This commit scrubs remaining
references from active docs. VM 9000 itself remains on PVE in stopped state
pending user decision on deletion.
In-session cleanup already landed: reverse-proxy ingress + Cloudflare record
removed; Technitium DNS records deleted; Vault truenas_{api_key,ssh_private_key}
purged; homepage_credentials.reverse_proxy.truenas_token removed;
truenas_homepage_token variable + module deleted; Loki + Dashy cleaned;
config.tfvars deprecated DNS lines removed; historical-name comment added to
the nfs-truenas StorageClass (48 bound PVs, immutable name — kept).
Historical records (docs/plans/, docs/post-mortems/, .planning/) intentionally
untouched — they describe state at a point in time.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Workstream E of the DNS hardening push. Two independent pfSense-side
changes to eliminate single-point DNS failures and the unauthenticated
RFC 2136 update vector.
Part 1 — Multi-IP DHCP option 6
- Before: clients on 10.0.10/24 got only 10.0.10.1; clients on 10.0.20/24
got only 10.0.20.1. Internal resolver outage == cluster-wide DNS dark.
- After:
- 10.0.10/24 -> [10.0.10.1, 94.140.14.14]
- 10.0.20/24 -> [10.0.20.1, 94.140.14.14]
- 192.168.1/24 deliberately untouched (served by TP-Link AP, not pfSense
Kea — pfSense WAN DHCP is disabled); already ships [192.168.1.2,
94.140.14.14] so the end state is consistent across all three subnets.
- Applied via PHP: set $cfg['dhcpd']['lan']['dnsserver'] and
$cfg['dhcpd']['opt1']['dnsserver'] as arrays. pfSense's
services_kea4_configure() implodes the array into "data: a, b" on the
"domain-name-servers" option-data entry (services.inc L1214).
- Verified:
- DevVM (10.0.10.10) resolv.conf shows "nameserver 10.0.10.1" +
"nameserver 94.140.14.14" after networkd renew.
- k8s-node1 (10.0.20.101) same after networkctl reload + systemd-resolved
restart.
- Fallback drill on k8s-node1: `ip route add blackhole 10.0.20.1/32`;
dig @10.0.20.1 google.com -> "no servers could be reached"; dig
@94.140.14.14 google.com -> 216.58.204.110; system resolver
(getent hosts) succeeds via the fallback IP. Blackhole route removed.
Part 2 — TSIG-signed Kea DHCP-DDNS
- Before: /usr/local/etc/kea/kea-dhcp-ddns.conf had `tsig-keys: []` and
Technitium's viktorbarzin.lan zone had update=Deny. Unauthenticated
update vector was latent (DDNS wiring in Kea DHCP4 is actually off
today — "DDNS: disabled" in dhcpd.log) but would activate as soon as
anyone turned on ddnsupdate on LAN/OPT1.
- Generated HMAC-SHA256 secret, base64-encoded 32 random bytes.
- Stored in Vault: secret/viktor/kea_ddns_tsig_secret (version 27).
- Created TSIG key "kea-ddns" on primary/secondary/tertiary Technitium
instances via /api/settings/set (tsigKeys[]).
- Updated kea-dhcp-ddns.conf on pfSense with
tsig-keys[]={name: "kea-ddns", algorithm: "HMAC-SHA256", secret: …}
and key-name: kea-ddns on each forward-ddns / reverse-ddns domain.
Pre-change backup at /usr/local/etc/kea/kea-dhcp-ddns.conf.2026-04-19-pre-tsig.
- Configured viktorbarzin.lan + 10.0.10.in-addr.arpa +
20.0.10.in-addr.arpa + 1.168.192.in-addr.arpa on Technitium primary:
- update = UseSpecifiedNetworkACL
- updateNetworkACL = [10.0.20.1, 10.0.10.1, 192.168.1.2]
- updateSecurityPolicies = [{tsigKeyName: kea-ddns,
domain: "*.<zone>", allowedTypes: [ANY]}]
Technitium requires BOTH a source-IP match AND a valid TSIG signature.
- Verified TSIG end-to-end:
- Signed A-record update from pfSense -> "successfully processed",
dig returns 10.99.99.99 (log: "TSIG KeyName: kea-ddns; TSIG Algo:
hmac-sha256; TSIG Error: NoError; RCODE: NoError").
- Signed PTR update same zone pattern -> dig -x returns tsig-test
FQDN.
- Unsigned update from pfSense IP (in ACL) -> "update failed:
REFUSED" (log: "refused a zone UPDATE request [...] due to Dynamic
Updates Security Policy").
- Test records cleaned up via signed nsupdate.
Safety
- pfSense config backup: /cf/conf/config.xml.2026-04-19-pre-kea-multi-ip
(145898 bytes, pre-change snapshot — keep 30d).
- DDNS config backup: /usr/local/etc/kea/kea-dhcp-ddns.conf.2026-04-19-pre-tsig.
- TSIG secret lives only in Vault + in config.xml/kea-dhcp-ddns.conf on
pfSense; not committed to git.
Docs
- architecture/dns.md: zone dynamic-updates section records the TSIG
policy; Incident History gets a WS E entry.
- architecture/networking.md: DHCP Coverage table now shows the DNS
option 6 values per subnet; pfSense block notes the TSIG-signed DDNS
and config backup path.
- runbooks/pfsense-unbound.md: new "Kea DHCP-DDNS TSIG" section covers
key rotation, emergency bypass, and enforcement-verification.
Closes: code-o6j
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fixes single-upstream DNS brittleness on non-DHCP hosts. Each host now
has a primary internal resolver + external fallback (AdGuard) so DNS
keeps working if the primary resolver IP is unreachable.
New config:
- Proxmox host (192.168.1.127): plain /etc/resolv.conf with
nameserver 192.168.1.2 (pfSense LAN) + 94.140.14.14 (AdGuard).
Previously: single nameserver 192.168.1.1 — could not resolve
internal .lan names at all. Documented in
docs/runbooks/proxmox-host.md.
- Registry VM (10.0.20.10): systemd-resolved drop-in at
/etc/systemd/resolved.conf.d/10-internal-dns.conf
(DNS=10.0.20.1, FallbackDNS=94.140.14.14, Domains=viktorbarzin.lan)
plus matching per-link nameservers in /etc/netplan/50-cloud-init.yaml.
Previously: 1.1.1.1 + 8.8.8.8 only — image pulls referencing .lan
hostnames would fail to resolve. Documented in
docs/runbooks/registry-vm.md.
- TrueNAS (10.0.10.15): host unreachable during this session
("No route to host" on 10.0.10.0/24). Deferred best-effort per
WS F instructions; noted on the beads task.
Both hosts have pre-change backups at /root/dns-backups/ for
one-command rollback. Fallback behaviour was validated by routing
each primary to a blackhole and confirming dig answered from the
fallback.
Both runbooks include the verified resolvectl / resolv.conf state,
the fallback-test procedure, and the rollback steps.
Closes: code-dw8
Workstreams A, B, G, H, I of the DNS reliability plan (code-q2e).
Follow-ups for C, D, E, F filed as code-2k6, code-k0d, code-o6j, code-dw8.
**Technitium (WS A)**
- Primary deployment: add Kyverno lifecycle ignore_changes on dns_config
(secondary/tertiary already had it) — eliminates per-apply ndots drift.
- All 3 instances: raise memory request+limit from 512Mi to 1Gi (primary
was restarting near the ceiling; CPU limits stay off per cluster policy).
- zone-sync CronJob: parse API responses, push status/failures/last-run and
per-instance zone_count gauges to Pushgateway, fail the job on any
create error (was silently passing).
**CoreDNS (WS B)**
- Corefile: add policy sequential + health_check 5s + max_fails 2 on root
forward, health_check on viktorbarzin.lan forward, serve_stale
3600s/86400s on both cache blocks — pfSense flap no longer takes the
cluster down; upstream outage keeps cached names resolving for 24h.
- Scale deploy/coredns to 3 replicas with required pod anti-affinity on
hostname via null_resource (hashicorp/kubernetes v3 dropped the _patch
resources); readiness gate asserts state post-apply.
- PDB coredns with minAvailable=2.
**Observability (WS G)**
- Fix DNSQuerySpike — rewrite to compare against
avg_over_time(dns_anomaly_total_queries[1h] offset 15m); previous
dns_anomaly_avg_queries was computed from a per-pod /tmp file so always
equalled the current value (alert could never fire).
- New: DNSQueryRateDropped, TechnitiumZoneSyncFailed,
TechnitiumZoneSyncStale, TechnitiumZoneCountMismatch,
CoreDNSForwardFailureRate.
**Post-apply readiness gate (WS H)**
- null_resource.technitium_readiness_gate runs at end of apply:
kubectl rollout status on all 3 deployments (180s), per-pod
/api/stats/get probe, zone-count parity across the 3 instances.
Fails the apply on any check fail. Override: -var skip_readiness=true.
**Docs (WS I)**
- docs/architecture/dns.md: CoreDNS Corefile hardening, new alerts table,
zone-sync metrics reference, why DNSQuerySpike was broken.
- docs/runbooks/technitium-apply.md (new): what the gate checks, failure
modes, emergency override.
Out of scope for this commit (see beads follow-ups):
- WS C: NodeLocal DNSCache (code-2k6)
- WS D: pfSense Unbound replaces dnsmasq (code-k0d)
- WS E: Kea multi-IP DHCP + TSIG (code-o6j)
- WS F: static-client DNS fixes (code-dw8)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
After code-yiu Phases 1a–6 landed, `docs/architecture/mailserver.md` still
carried the pre-HAProxy Mermaid diagram, a retired Dovecot-exporter
component row, stale PVC names (`-proxmox` suffixes that were renamed
`-encrypted` during the LUKS migration), a wrong probe schedule
(claimed 10 min, actually 20 min), and a Mailgun-API claim for the
probe (it's been on Brevo since code-n5l). The two-path architecture
(external-via-HAProxy + intra-cluster-via-ClusterIP) that defines the
current design wasn't visualised at all.
## This change
Rewrote the Architecture Diagram section to show **both ingress paths
in one Mermaid flowchart**, colour-coded:
- External (orange): Sender → pfSense NAT → HAProxy → NodePort →
**alt PROXY listeners** (2525/4465/5587/10993).
- Intra-cluster (blue): Roundcube / probe → ClusterIP Service →
**stock listeners** (25/465/587/993), no PROXY.
- The pod subgraph shows both listener sets feeding the same Postfix /
Rspamd / Dovecot / Maildir pipeline.
- Security dotted edges: Postfix log stream → CrowdSec agent →
LAPI → pfSense bouncer decisions.
- Monitoring dotted edges: probe → Brevo HTTP → MX → pod → IMAP →
Pushgateway/Uptime Kuma.
Added a **sequenceDiagram** for the external SMTP roundtrip — walks
through the wire-level handshake from external MTA → pfSense NAT →
HAProxy TCP connect → PROXY v2 header write → kube-proxy SNAT → pod
postscreen parse → smtpd banner. Makes the "how does the pod see the
real IP despite SNAT?" question self-answering.
Added a **Port mapping table** listing all 8 container listeners (4
stock + 4 alt) with their Service, NodePort, PROXY-required flag, and
who uses each path. Replaces the ambiguous prose about "alt ports".
Fixed stale bits:
- Removed Dovecot Exporter row from Components (retired in code-1ik).
- Added pfSense HAProxy row.
- Probe schedule: every 10 min → **every 20 min** (`*/20 * * * *`).
- Probe API: Mailgun → **Brevo HTTP**.
- PVC names: `-proxmox` → **`-encrypted`** (all three); storage class
`proxmox-lvm` → **`proxmox-lvm-encrypted`**.
- Added `mailserver-backup-host` + `roundcube-backup-host` RWX NFS
PVCs to the Storage table with backup flow pointer.
- Expanded Troubleshooting → Inbound to include HAProxy health check
+ container-listener verification steps.
- Secrets table: `brevo_api_key` now marked as used by both relay +
probe; `mailgun_api_key` marked historical.
Added a prominent **UPDATE 2026-04-19** header to
`docs/runbooks/mailserver-proxy-protocol.md` pointing future readers
at the implemented state in `mailserver-pfsense-haproxy.md`. Research
doc preserved as a decision record — it's the canonical "why not just
pin the pod?" reference.
## What is NOT in this change
- No Terraform changes; this is docs-only.
- No changes to the runbook (`mailserver-pfsense-haproxy.md`) — it was
already rewritten during Phase 6.
## Test Plan
### Automated
```
$ awk '/^```mermaid/ {c++} END{print c}' docs/architecture/mailserver.md
2
$ grep -c '\-encrypted' docs/architecture/mailserver.md
5 # PVC references normalised
$ grep -c '\-proxmox' docs/architecture/mailserver.md
0 # no stale names left
```
### Manual Verification
Render `docs/architecture/mailserver.md` on GitHub or any Mermaid-
capable viewer:
1. Top Architecture Diagram should show two labelled paths into the
pod, colour-coded (orange = external, blue = intra-cluster).
2. Sequence diagram should show 10 numbered steps ending at Rspamd +
Dovecot delivery.
3. Port Mapping table should make it obvious that the 4 alt container
ports are only reachable via `mailserver-proxy` NodePort and require
PROXY v2.