Compare commits
6 commits
master
...
broker-syn
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
731de63150 | ||
|
|
9ce9a9a7f7 | ||
|
|
277babc696 | ||
|
|
d91fbd4a60 | ||
|
|
e81e836d3a | ||
|
|
d3be9b50af |
324 changed files with 8359 additions and 30812 deletions
|
|
@ -30,7 +30,7 @@ Violations cause state drift, which causes future applies to break or silently r
|
|||
- **New service**: Use `setup-project` skill for full workflow
|
||||
- **Ingress**: `ingress_factory` module. Auth: `protected = true`. Anti-AI: on by default. **DNS**: `dns_type = "proxied"` (Cloudflare CDN) or `"non-proxied"` (direct A/AAAA). DNS records are auto-created — no need to edit `config.tfvars`.
|
||||
- **Docker images**: Always build for `linux/amd64`. Use 8-char git SHA tags — `:latest` causes stale pull-through cache.
|
||||
- **Private registry**: `forgejo.viktorbarzin.me/viktor/<name>` (Forgejo packages, OAuth-style PAT auth). Use `image: forgejo.viktorbarzin.me/viktor/<name>:<tag>` + `imagePullSecrets: [{name: registry-credentials}]`. Kyverno auto-syncs the Secret to all namespaces. Containerd `hosts.toml` on every node redirects to in-cluster Traefik LB `10.0.20.200` to avoid hairpin NAT. Push-side: viktor PAT in Vault `secret/ci/global/forgejo_push_token` (Forgejo container packages are scoped per-user; only the package owner can push, ci-pusher cannot write to viktor/*). Pull-side: cluster-puller PAT in Vault `secret/viktor/forgejo_pull_token`. Retention CronJob (`forgejo-cleanup` in `forgejo` ns, daily 04:00) keeps newest 10 versions + always `:latest`; integrity probed every 15min by `forgejo-integrity-probe` in `monitoring` ns (catalog walk + manifest HEAD on every blob). See `docs/plans/2026-05-07-forgejo-registry-consolidation-{design,plan}.md` for the migration history. Pull-through caches for upstream registries (DockerHub, GHCR, Quay, k8s.gcr, Kyverno) stay on the registry VM at `10.0.20.10` ports 5000/5010/5020/5030/5040 — the old port-5050 R/W private registry was decommissioned 2026-05-07.
|
||||
- **Private registry**: `registry.viktorbarzin.me` (htpasswd auth, credentials in Vault `secret/viktor`). Use `image: registry.viktorbarzin.me/<name>:<tag>` + `imagePullSecrets: [{name: registry-credentials}]`. Kyverno auto-syncs the secret to all namespaces. Build & push from registry VM (`10.0.20.10`). Containerd `hosts.toml` redirects pulls to LAN IP directly. Web UI at `docker.viktorbarzin.me` (Authentik-protected).
|
||||
- **LinuxServer.io containers**: `DOCKER_MODS` runs apt-get on every start — bake slow mods into a custom image (`RUN /docker-mods || true` then `ENV DOCKER_MODS=`). Set `NO_CHOWN=true` to skip recursive chown that hangs on NFS mounts.
|
||||
- **Node memory changes**: When changing VM memory on any k8s node, update kubelet `systemReserved`, `kubeReserved`, and eviction thresholds accordingly. Config: `/var/lib/kubelet/config.yaml`. Template: `stacks/infra/main.tf`. Current values: systemReserved=512Mi, kubeReserved=512Mi, evictionHard=500Mi, evictionSoft=1Gi.
|
||||
- **Node OS disk tuning** (in `stacks/infra/main.tf`): kubelet `imageGCHighThresholdPercent=70` (was 85), `imageGCLowThresholdPercent=60` (was 80), ext4 `commit=60` in fstab (was default 5s), journald `SystemMaxUse=200M` + `MaxRetentionSec=3day`.
|
||||
|
|
@ -48,7 +48,6 @@ Violations cause state drift, which causes future applies to break or silently r
|
|||
- **Tier 0 details**: Decrypt priority: Vault Transit (primary) → age key fallback. Encrypt: both Vault Transit + age recipients. Scripts: `scripts/state-sync {encrypt|decrypt|commit} [stack]`.
|
||||
- **Adding operator**: Generate age key (`age-keygen`), add pubkey to `.sops.yaml`, run `sops updatekeys` on Tier 0 `.enc` files. For Tier 1, only Vault access is needed.
|
||||
- **Migration script**: `scripts/migrate-state-to-pg` (one-shot, idempotent) migrates Tier 1 stacks from local to PG.
|
||||
- **Adopting existing resources**: use HCL `import {}` blocks (TF 1.5+), not `terraform import` CLI. Commit stanza → plan-to-zero → apply → delete stanza. Canonical reason: reviewable in PR, plan-safe, idempotent, tier-agnostic. Full rules + per-provider ID formats in `AGENTS.md` → "Adopting Existing Resources".
|
||||
|
||||
## Secrets Management — Vault KV
|
||||
- **Vault is the sole source of truth** for secrets.
|
||||
|
|
@ -74,7 +73,7 @@ Violations cause state drift, which causes future applies to break or silently r
|
|||
- **LimitRange**: Tier-based defaults silently apply to pods with `resources: {}`. Always set explicit resources on containers needing more than defaults. Tier 3-edge and 4-aux now use Burstable QoS (request < limit) to reduce scheduler pressure.
|
||||
- **Democratic-CSI sidecars**: Must set explicit resources (32-80Mi) in Helm values — 17 sidecars default to 256Mi each via LimitRange. `csiProxy` is a TOP-LEVEL chart key, not nested under controller/node.
|
||||
- **ResourceQuota blocks rolling updates**: When quota is tight, scale to 0 then back to 1 instead of RollingUpdate. Or use Recreate strategy.
|
||||
- **Kyverno ndots drift**: Kyverno injects dns_config on all pods. Every `kubernetes_deployment`, `kubernetes_stateful_set`, and `kubernetes_cron_job_v1` MUST include `lifecycle { ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1 }` (use `spec[0].job_template[0].spec[0].template[0].spec[0].dns_config` for CronJobs). The `# KYVERNO_LIFECYCLE_V1` marker is the canonical discoverability tag — grep for it to locate every site. A shared Terraform module was considered but `ignore_changes` only accepts static attribute paths (not module outputs, locals, or expressions), so the snippet convention is the only viable path. Full rationale and copy-paste snippets in `AGENTS.md` → "Kyverno Drift Suppression".
|
||||
- **Kyverno ndots drift**: Kyverno injects dns_config on all pods. Add `lifecycle { ignore_changes = [spec[0].template[0].spec[0].dns_config] }` to kubernetes_deployment resources to prevent perpetual TF plan drift.
|
||||
- **NVIDIA GPU operator resources**: dcgm-exporter and cuda-validator resources configurable via `dcgmExporter.resources` and `validator.resources` in nvidia values.yaml.
|
||||
- **Pin database versions**: Disable Diun (image update monitoring) for MySQL, PostgreSQL, Redis.
|
||||
- **Quarterly right-sizing**: Check Goldilocks dashboard. Compare VPA upperBound to current request. Also check for under-provisioned (VPA upper > request x 0.8).
|
||||
|
|
@ -117,7 +116,7 @@ Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handle
|
|||
- **Rate limiting**: Return 429 (not 503). Per-service tuning: Immich/Nextcloud need higher limits.
|
||||
- **Retry middleware**: 2 attempts, 100ms — in default ingress chain.
|
||||
- **HTTP/3 (QUIC)**: Enabled cluster-wide via Traefik.
|
||||
- **IPAM & DNS auto-registration**: pfSense Kea DHCP serves all 3 subnets (VLAN 10, VLAN 20, 192.168.1.x). Kea DDNS auto-registers every DHCP client in Technitium (RFC 2136, A+PTR). CronJob `phpipam-pfsense-import` (hourly) pulls Kea leases + ARP into phpIPAM via SSH (passive, no scanning). CronJob `phpipam-dns-sync` (15min) bidirectional sync phpIPAM ↔ Technitium. 42 MAC reservations for 192.168.1.x.
|
||||
- **IPAM & DNS auto-registration**: pfSense Kea DHCP serves all 3 subnets (VLAN 10, VLAN 20, 192.168.1.x). Kea DDNS auto-registers every DHCP client in Technitium (RFC 2136, A+PTR). CronJob `phpipam-pfsense-import` (5min) pulls Kea leases + ARP into phpIPAM via SSH (passive, no scanning). CronJob `phpipam-dns-sync` (15min) bidirectional sync phpIPAM ↔ Technitium. 42 MAC reservations for 192.168.1.x.
|
||||
|
||||
## Service-Specific Notes
|
||||
| Service | Key Operational Knowledge |
|
||||
|
|
@ -129,15 +128,15 @@ Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handle
|
|||
| Authentik | 3 replicas, PgBouncer in front of PostgreSQL, strip auth headers before forwarding |
|
||||
| Kyverno | failurePolicy=Ignore to prevent blocking cluster, pin chart version |
|
||||
| MySQL Standalone | Raw `kubernetes_stateful_set_v1` with `mysql:8.4` (migrated from InnoDB Cluster 2026-04-16). `skip-log-bin`, `innodb_flush_log_at_trx_commit=2`, `innodb_doublewrite=ON`. ConfigMap `mysql-standalone-cnf`. PVC `data-mysql-standalone-0` (15Gi, `proxmox-lvm-encrypted`). Service `mysql.dbaas` unchanged. Anti-affinity excludes k8s-node1. Old InnoDB Cluster + operator still in TF (Phase 4 cleanup pending). Bitnami charts deprecated (Broadcom Aug 2025) — use official images. |
|
||||
| phpIPAM | IPAM — no active scanning. `pfsense-import` CronJob (hourly) pulls Kea leases + ARP via SSH. `dns-sync` CronJob (15min) bidirectional sync with Technitium. Kea DDNS on pfSense handles all 3 subnets. API app `claude` (ssl_token). |
|
||||
| phpIPAM | IPAM — no active scanning. `pfsense-import` CronJob (5min) pulls Kea leases + ARP via SSH. `dns-sync` CronJob (15min) bidirectional sync with Technitium. Kea DDNS on pfSense handles all 3 subnets. API app `claude` (ssl_token). |
|
||||
|
||||
## Monitoring & Alerting
|
||||
- Alert cascade inhibitions: if node is down, suppress pod alerts on that node.
|
||||
- Exclude completed CronJob pods from "pod not ready" alerts.
|
||||
- Every new service gets Prometheus scrape config + Uptime Kuma monitor. External monitors auto-created for Cloudflare-proxied services by `external-monitor-sync` CronJob (10min, uptime-kuma ns). Mechanism: `ingress_factory` auto-adds `uptime.viktorbarzin.me/external-monitor=true` whenever `dns_type != "none"` (see `modules/kubernetes/ingress_factory/main.tf`) — no manual action needed on new services. The `cloudflare_proxied_names` list in `config.tfvars` is a legacy fallback for the 17 hostnames not yet migrated to `ingress_factory` `dns_type`; don't check that list when debugging "is this monitored?" questions.
|
||||
- Every new service gets Prometheus scrape config + Uptime Kuma monitor. External monitors auto-created for Cloudflare-proxied services by `external-monitor-sync` CronJob (10min, uptime-kuma ns).
|
||||
- **External monitoring**: `[External] <service>` monitors in Uptime Kuma test full external path (DNS → Cloudflare → Tunnel → Traefik). Divergence metric `external_internal_divergence_count` → alert `ExternalAccessDivergence` (15min). Config: `stacks/uptime-kuma/`, targets from `cloudflare_proxied_names` in `config.tfvars` (17 remaining centrally-managed hostnames; most DNS records now auto-created by `ingress_factory` `dns_type` param).
|
||||
- Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence.
|
||||
- **E2E email monitoring**: CronJob `email-roundtrip-monitor` (every 20 min) sends test email via Brevo HTTP API to `smoke-test@viktorbarzin.me` (catch-all → `spam@`), verifies IMAP delivery, deletes test email, pushes metrics to Pushgateway + Uptime Kuma. Alerts: `EmailRoundtripFailing` (60m), `EmailRoundtripStale` (60m), `EmailRoundtripNeverRun` (60m). Outbound relay: Brevo EU (`smtp-relay.brevo.com:587`, 300/day free — migrated from Mailgun). Inbound external traffic enters via pfSense HAProxy on `10.0.20.1:{25,465,587,993}`, which forwards to k8s `mailserver-proxy` NodePort (30125-30128) with `send-proxy-v2`. Mailserver pod runs alt PROXY-speaking listeners (2525/4465/5587/10993) alongside stock PROXY-free ones (25/465/587/993) for intra-cluster clients. Real client IPs recovered from PROXY v2 header despite kube-proxy SNAT (replaces pre-2026-04-19 MetalLB `10.0.20.202` ETP:Local scheme; see bd code-yiu + `docs/runbooks/mailserver-pfsense-haproxy.md`). Vault: `brevo_api_key` in `secret/viktor` (probe + relay).
|
||||
- **E2E email monitoring**: CronJob `email-roundtrip-monitor` (every 20 min) sends test email via Mailgun API to `smoke-test@viktorbarzin.me` (catch-all → `spam@`), verifies IMAP delivery, deletes test email, pushes metrics to Pushgateway + Uptime Kuma. Alerts: `EmailRoundtripFailing` (60m), `EmailRoundtripStale` (60m), `EmailRoundtripNeverRun` (60m). Outbound relay: Brevo EU (`smtp-relay.brevo.com:587`, 300/day free — migrated from Mailgun). Mailserver on dedicated MetalLB IP `10.0.20.202` with `externalTrafficPolicy: Local` for CrowdSec real-IP detection. Vault: `mailgun_api_key` in `secret/viktor` (probe), `brevo_api_key` in `secret/viktor` (relay).
|
||||
|
||||
## Storage & Backup Architecture
|
||||
|
||||
|
|
@ -155,9 +154,9 @@ Choose storage class based on workload type:
|
|||
|
||||
**Default for sensitive data is proxmox-lvm-encrypted.** Use plain `proxmox-lvm` only for non-sensitive workloads. Use NFS when you need RWX, backup pipeline integration, or it's a large shared media library.
|
||||
|
||||
**NFS server:**
|
||||
- **Proxmox host** (192.168.1.127): Sole NFS for all workloads. HDD at `/srv/nfs` (ext4 thin LV `pve/nfs-data`, 1TB). SSD at `/srv/nfs-ssd` (ext4 LV `ssd/nfs-ssd-data`, 100GB). Exports use `async,insecure` options (`async` — safe with UPS + Vault Raft replication + databases on block storage; `insecure` — pfSense NATs source ports >1024 between VLANs).
|
||||
- **`nfs-truenas` StorageClass**: Historical name retained only because SC names are immutable on PVs (48 bound PVs reference it — renaming would require mass PV churn, not worth it). Now points to the Proxmox host, identical to `nfs-proxmox`. TrueNAS (VM 9000, 10.0.10.15) operationally decommissioned 2026-04-13; VM still exists in stopped state on PVE pending user decision on deletion.
|
||||
**NFS servers:**
|
||||
- **Proxmox host** (192.168.1.127): Primary NFS for all workloads. HDD at `/srv/nfs` (ext4 thin LV `pve/nfs-data`, 1TB). SSD at `/srv/nfs-ssd` (ext4 LV `ssd/nfs-ssd-data`, 100GB). Exports use `async,insecure` options (`async` — safe with UPS + Vault Raft replication + databases on block storage; `insecure` — pfSense NATs source ports >1024 between VLANs).
|
||||
- **TrueNAS** (10.0.10.15): **Immich only** (8 PVCs). `nfs-truenas` StorageClass retained exclusively for Immich.
|
||||
|
||||
**Migration note**: CSI PV `volumeAttributes` are immutable — cannot update NFS server in place. New PV/PVC pairs required (convention: append `-host` to PV name).
|
||||
|
||||
|
|
@ -237,7 +236,7 @@ resource "kubernetes_persistent_volume_claim" "data_encrypted" {
|
|||
|
||||
**Synology layout** (`192.168.1.13:/volume1/Backup/Viki/`):
|
||||
- `pve-backup/` — PVC file backups (`pvc-data/`), SQLite backups (`sqlite-backup/`), pfSense, PVE config (synced from sda)
|
||||
- `nfs/` — mirrors `/srv/nfs` on Proxmox (inotify change-tracked rsync)
|
||||
- `nfs/` — mirrors `/srv/nfs` on Proxmox (inotify change-tracked rsync, renamed from `truenas/`)
|
||||
- `nfs-ssd/` — mirrors `/srv/nfs-ssd` on Proxmox (inotify change-tracked rsync)
|
||||
|
||||
**App-level CronJobs** (write to Proxmox host NFS, synced to Synology via inotify):
|
||||
|
|
|
|||
|
|
@ -1,194 +0,0 @@
|
|||
---
|
||||
name: payslip-extractor
|
||||
description: "Extract structured UK payslip fields from already-extracted text (preferred) or a base64 PDF (fallback) into strict JSON."
|
||||
model: haiku
|
||||
allowedTools:
|
||||
- Bash
|
||||
- Read
|
||||
---
|
||||
|
||||
You are a headless payslip-field extractor. You receive a prompt containing a UK payslip (either as pre-extracted text or as a base64-encoded PDF) plus a target JSON schema, and you produce exactly one JSON object that matches the schema.
|
||||
|
||||
## Your single job
|
||||
|
||||
Given a prompt that contains EITHER:
|
||||
- A line `PAYSLIP_TEXT:` followed by already-extracted text (preferred path — use it directly, skip to Step 3).
|
||||
- OR a line `PDF_BASE64:` followed by a base64 blob (fallback path — decode then extract text first).
|
||||
|
||||
Produce EXACTLY ONE JSON object on stdout matching the schema. No prose. No markdown fences. No preamble. No trailing commentary. The final message content must be a single valid JSON object and nothing else.
|
||||
|
||||
## RSU handling (important — Meta UK payslips)
|
||||
|
||||
UK payslips for equity-compensated employees (e.g. Meta) report RSU vests as NOTIONAL pay for HMRC reporting only — the broker (Schwab) sells shares to cover US-side withholding but the UK payslip ALSO runs the vest through PAYE via a grossed-up Taxable Pay line. Meta UK template:
|
||||
|
||||
- EARNINGS lines: `RSU Tax Offset` (grossed-up vest value) and optionally `RSU Excs Refund` (over-withheld amount returned). SUM BOTH into `rsu_vest`. Other labels seen on non-Meta templates: `RSU Vest`, `Restricted Stock Units`, `Notional Pay`, `GSU Vest`.
|
||||
- Meta's template does NOT use a matching offset deduction — `rsu_offset` should be 0. Taxable Pay is grossed up to (Total Payment + rsu_vest) so PAYE already includes the RSU share.
|
||||
- For non-Meta templates that DO use an offset (`Shares Retained`, `Notional Pay Offset`), populate `rsu_offset` with the magnitude.
|
||||
|
||||
If you see ANY of these lines, do NOT add them to `other_deductions` and do NOT let them count as regular income_tax/NI.
|
||||
|
||||
If the payslip has no stock component, leave both as 0.
|
||||
|
||||
## Earnings decomposition (v2)
|
||||
|
||||
- `salary`: the basic salary/pay line (usually the first "Salary" or "Basic Pay" entry in the Earnings/Payments block).
|
||||
- `bonus`: the bonus line (`Perform Bonus`, `Bonus`, `Performance Bonus`). If absent or 0, leave as 0 — that's meaningful signal (bonus-sacrifice months). Don't invent.
|
||||
- `pension_sacrifice`: **ABSOLUTE VALUE** of any NEGATIVE pension line in the Payments block (e.g. `AE Pension EE -600.20` → `600.20`). This is salary-sacrifice and is ALREADY subtracted from Total Payment/gross. Do not also put it in `pension_employee`.
|
||||
- `pension_employee`: use this ONLY when pension appears as a POSITIVE deduction on the Deductions side (legacy Meta variant A, or non-Meta templates). Never double-count.
|
||||
- `taxable_pay`: the "Taxable Pay" line in the summary block, THIS PERIOD column. For Meta this is the post-sacrifice + RSU-grossed-up base that PAYE is computed on. If the payslip doesn't surface a summary block, null.
|
||||
- `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross`: YTD column values from the same summary block. Null if not present.
|
||||
|
||||
## Fast path: PAYSLIP_TEXT is present
|
||||
|
||||
If the prompt contains `PAYSLIP_TEXT:`, the caller has already run `pdftotext -layout`. Skip Steps 1-2 entirely — the text is already in your context. Go straight to Step 3.
|
||||
|
||||
## Processing steps
|
||||
|
||||
### Step 1. Extract and decode the base64 PDF
|
||||
|
||||
The prompt will include a line that starts with `PDF_BASE64:` followed by the base64 blob. Decode it to `/tmp/payslip.pdf`.
|
||||
|
||||
Preferred method (handles whitespace and very long blobs robustly):
|
||||
|
||||
```bash
|
||||
python3 - <<'PY'
|
||||
import base64, re, pathlib, sys, os
|
||||
prompt = os.environ.get("PAYSLIP_PROMPT", "")
|
||||
# If the orchestrator didn't set an env var, fall back to reading the transcript via CWD stdin mechanism.
|
||||
# In practice the agent receives the prompt in its conversation — you extract the PDF_BASE64 value
|
||||
# from the prompt text you were given, strip whitespace, and base64-decode.
|
||||
PY
|
||||
```
|
||||
|
||||
In practice: read the `PDF_BASE64:` value out of the prompt you have been given (you can see the full prompt), then run:
|
||||
|
||||
```bash
|
||||
python3 -c "
|
||||
import base64, sys
|
||||
data = sys.stdin.read().strip()
|
||||
open('/tmp/payslip.pdf','wb').write(base64.b64decode(data))
|
||||
print('decoded bytes:', len(base64.b64decode(data)))
|
||||
" <<'B64'
|
||||
<paste-the-base64-here>
|
||||
B64
|
||||
```
|
||||
|
||||
Or pipe via shell `base64 -d`:
|
||||
|
||||
```bash
|
||||
printf '%s' '<base64>' | base64 -d > /tmp/payslip.pdf
|
||||
```
|
||||
|
||||
Verify the file looks like a PDF:
|
||||
|
||||
```bash
|
||||
head -c 8 /tmp/payslip.pdf | xxd
|
||||
# Expected: 25 50 44 46 2d (i.e. "%PDF-")
|
||||
```
|
||||
|
||||
### Step 2. Extract text from the PDF
|
||||
|
||||
Try tools in this order. Use the first one that works; do not chain all of them.
|
||||
|
||||
1. `pdftotext` from `poppler-utils` (preferred — fastest, most reliable on layout-preserving payslips):
|
||||
```bash
|
||||
pdftotext -layout /tmp/payslip.pdf - 2>/dev/null
|
||||
```
|
||||
|
||||
2. Python `pypdf` fallback:
|
||||
```bash
|
||||
python3 -c "
|
||||
from pypdf import PdfReader
|
||||
r = PdfReader('/tmp/payslip.pdf')
|
||||
for p in r.pages:
|
||||
print(p.extract_text() or '')
|
||||
"
|
||||
```
|
||||
|
||||
3. Python `pdfplumber` fallback:
|
||||
```bash
|
||||
python3 -c "
|
||||
import pdfplumber
|
||||
with pdfplumber.open('/tmp/payslip.pdf') as pdf:
|
||||
for page in pdf.pages:
|
||||
print(page.extract_text() or '')
|
||||
"
|
||||
```
|
||||
|
||||
4. If none of those are installed, check what IS available:
|
||||
```bash
|
||||
which pdftotext pdf2txt.py mutool
|
||||
python3 -c "import pypdf, pdfplumber, pdfminer" 2>&1
|
||||
```
|
||||
and use whatever you find (e.g. `mutool draw -F txt`).
|
||||
|
||||
If every text-extraction tool fails, emit the failure JSON (see "Failure mode" below).
|
||||
|
||||
### Step 3. Parse the extracted text
|
||||
|
||||
UK payslips are laid out in a few common templates (Sage, Iris, QuickBooks, Xero, in-house ADP/Workday layouts). Common landmarks:
|
||||
|
||||
- "Pay Date" / "Payment Date" / "Date Paid" — the date wages hit the account. Usually at the top or in a header box.
|
||||
- "Tax Period" / "Period" / "Month" — e.g. "Month 1", "Week 12".
|
||||
- Two numeric columns per line: "This Period" (or "Amount", "Current") and "Year to Date" (or "YTD"). **Always take the This Period column**, never YTD.
|
||||
- Payments / Earnings block: "Basic Pay", "Salary", "Bonus", "Overtime", "Commission", "Holiday Pay".
|
||||
- Deductions block: "Income Tax" / "PAYE", "National Insurance" / "NI" / "NIC", "Pension" / "Pension Contribution" / "Salary Sacrifice Pension", "Student Loan" / "SL", optional: "Union Dues", "Charity", "Season Ticket Loan", "Private Medical", etc.
|
||||
- "Gross Pay" / "Total Gross" — sum of payments.
|
||||
- "Net Pay" / "Take Home" / "Amount Payable" — the money actually paid.
|
||||
- "Tax Code" — e.g. "1257L", "BR", "D0", "NT".
|
||||
- "NI Number" / "National Insurance Number" — `AA123456A` format. Never invent one.
|
||||
- "Employer" / "Company" — usually in the letterhead. "Employee" / "Name".
|
||||
- Currency: almost always GBP / "£" for UK payslips. If the PDF is not in GBP or not a UK payslip, still return the numbers as-is but include a best-effort `currency` field.
|
||||
|
||||
### Step 4. Map to the schema and emit JSON
|
||||
|
||||
Rules that apply regardless of the caller's exact schema:
|
||||
|
||||
- **Dates**: `pay_date` MUST be `YYYY-MM-DD`. If the PDF prints `12/03/2026`, interpret as `DD/MM/YYYY` (UK format) → `2026-03-12`. If ambiguous (`01/02/2026`), prefer UK ordering. If impossible to determine a year, use the pay_period year.
|
||||
- **Money fields**: emit as JSON numbers, not strings. Two decimal places are acceptable (`2450.17`). Strip `£`, commas, and trailing spaces. Negative values stay negative.
|
||||
- **Missing numeric fields**: emit `0` (zero), not `null`, not an empty string, not `"N/A"`.
|
||||
- **`other_deductions`**: an object mapping `{ "<label>": <number>, ... }` for any deduction that isn't one of the first-class fields in the schema (tax, NI, pension, student loan). Use the exact label from the payslip (e.g. `"Season Ticket Loan"`, `"Private Medical"`). If there are no other deductions, emit `{}` — NEVER `null` and NEVER omit the key.
|
||||
- **Column discipline**: ALWAYS use the "This Period" column, NEVER the YTD column. If only one column exists, that's the period column.
|
||||
- **Currency default**: `"GBP"` unless the payslip explicitly shows another currency symbol or ISO code.
|
||||
- **No invented data**: If a field genuinely isn't on the payslip, use the documented default (`0` for money, `""` for strings, `{}` for objects). Do NOT make up names, NI numbers, tax codes, or employers.
|
||||
|
||||
Follow the exact field names and types given in the prompt's schema. If the prompt's schema adds fields not listed above, produce them too using the same discipline.
|
||||
|
||||
## Failure mode
|
||||
|
||||
If the PDF cannot be read at all — unreadable base64, not a PDF, encrypted PDF with no text layer, no text-extraction tool available, or clearly not a UK payslip — emit a single JSON object:
|
||||
|
||||
```json
|
||||
{"error": "<short human reason>"}
|
||||
```
|
||||
|
||||
Examples of acceptable error reasons:
|
||||
- `"base64 did not decode to a valid PDF"`
|
||||
- `"pdf has no extractable text layer (image-only scan)"`
|
||||
- `"no pdf text extraction tool available (pdftotext/pypdf/pdfplumber all missing)"`
|
||||
- `"document does not appear to be a UK payslip"`
|
||||
- `"pay_date not found on document"`
|
||||
|
||||
The caller treats the `error` key as a non-retriable parse failure. Do not include any other keys when emitting an error object.
|
||||
|
||||
## Hard constraints — things you MUST NOT do
|
||||
|
||||
1. **No network calls.** Do not curl, wget, dig, or otherwise talk to the network. Everything you need is in the prompt.
|
||||
2. **No modifications to `/workspace/infra/**`.** Do not edit, write, or commit any file under the infra repo. The only file you may create is the scratch PDF at `/tmp/payslip.pdf` (and intermediate text dumps under `/tmp/`).
|
||||
3. **No git operations.** No `git add`, `git commit`, `git push`, nothing.
|
||||
4. **No kubectl, no terraform, no vault.** You are not an infra agent — you are a narrow extractor.
|
||||
5. **No markdown in output.** No ` ```json ` fences, no preamble like "Here's the extraction:", no trailing notes. The ENTIRE final assistant message is exactly one JSON object.
|
||||
6. **No verbose logging in the final message.** It is fine to run bash commands and see their output during processing, but your final assistant message is JSON and nothing else.
|
||||
7. **No hallucinated fields.** If the payslip does not show a pension line, do not invent one. Use the documented default instead.
|
||||
|
||||
## Output discipline — summary
|
||||
|
||||
- Exactly one JSON object, UTF-8, no BOM.
|
||||
- Keys match the schema the caller gave you.
|
||||
- Numeric fields are JSON numbers, not strings.
|
||||
- `pay_date` is `YYYY-MM-DD`.
|
||||
- `other_deductions` is always present and is an object (possibly `{}`).
|
||||
- Missing money → `0`, missing string → `""`, missing object → `{}`.
|
||||
- On unrecoverable failure, one JSON object with a single `error` key.
|
||||
|
||||
That's the whole job. Decode, extract, parse, emit JSON. Be boring and exact.
|
||||
|
|
@ -34,11 +34,7 @@ You receive these parameters in your invocation:
|
|||
- **Infra repo**: `/home/wizard/code/infra`
|
||||
- **Config**: `/home/wizard/code/infra/.claude/reference/upgrade-config.json`
|
||||
- **Kubeconfig**: `/home/wizard/code/infra/config`
|
||||
- **Secrets (env-var contract)**: You run in the `claude-agent-service` pod, which has NO Vault CLI auth — do NOT call `vault kv get`. The following env vars are pre-loaded via `envFrom: claude-agent-secrets`:
|
||||
- `GITHUB_TOKEN` — PAT for GitHub API (changelog fetch) and `git push`
|
||||
- `WOODPECKER_API_TOKEN` — bearer for `ci.viktorbarzin.me/api/...`
|
||||
- `SLACK_WEBHOOK_URL` — full Slack webhook URL for status messages
|
||||
- Anything else (e.g. `kubectl`) uses the pod's ServiceAccount or in-repo git-crypt-unlocked secrets.
|
||||
- **Vault**: Authenticate with `vault login -method=oidc` if needed. Secrets at `secret/viktor` and `secret/platform`.
|
||||
- **Git remote**: `origin` → `github.com/ViktorBarzin/infra.git`
|
||||
|
||||
## NEVER Do
|
||||
|
|
@ -122,6 +118,7 @@ cat /home/wizard/code/infra/.claude/reference/upgrade-config.json
|
|||
3. **For Helm charts**: Check `helm_chart_repo_overrides` for the chart repository URL
|
||||
4. If auto-detect fails, verify the repo exists:
|
||||
```bash
|
||||
GITHUB_TOKEN=$(vault kv get -field=github_pat secret/viktor)
|
||||
curl -sf -H "Authorization: token $GITHUB_TOKEN" \
|
||||
"https://api.github.com/repos/${DETECTED_REPO}" > /dev/null
|
||||
```
|
||||
|
|
@ -131,6 +128,7 @@ cat /home/wizard/code/infra/.claude/reference/upgrade-config.json
|
|||
## Step 3: Fetch Changelogs via GitHub API
|
||||
|
||||
```bash
|
||||
GITHUB_TOKEN=$(vault kv get -field=github_pat secret/viktor)
|
||||
curl -s -H "Authorization: token $GITHUB_TOKEN" \
|
||||
"https://api.github.com/repos/${GITHUB_REPO}/releases?per_page=100"
|
||||
```
|
||||
|
|
@ -173,9 +171,11 @@ Scan all intermediate release notes for breaking change indicators from the conf
|
|||
## Step 5: Slack Notification — Starting
|
||||
|
||||
```bash
|
||||
SLACK_WEBHOOK=$(vault kv get -field=alertmanager_slack_api_url secret/platform)
|
||||
|
||||
curl -s -X POST -H 'Content-type: application/json' \
|
||||
--data "{\"text\":\"[Upgrade Agent] Starting: *${STACK}* ${OLD_VERSION} -> ${NEW_VERSION} (risk: ${RISK})\"}" \
|
||||
"$SLACK_WEBHOOK_URL"
|
||||
"$SLACK_WEBHOOK"
|
||||
```
|
||||
|
||||
For CAUTION risk, include breaking change excerpts in the Slack message.
|
||||
|
|
@ -266,28 +266,23 @@ UPGRADE_SHA=$(git rev-parse HEAD)
|
|||
|
||||
## Step 9: Wait for Woodpecker CI
|
||||
|
||||
The commit triggers one pipeline that runs multiple **workflows** in parallel — e.g. `default` (terragrunt apply) and `build-cli` (builds the infra CLI image). Only the `default` workflow gates your upgrade; the other workflows may be unrelated and sometimes fail without breaking anything on the cluster (current example: `build-cli` push to `registry.viktorbarzin.me:5050` is known-broken as of 2026-04-19).
|
||||
|
||||
**Do not read the overall pipeline `status`** — it reports `failure` whenever *any* workflow fails. Read the `default` workflow's `state` instead.
|
||||
The commit triggers the `app-stacks.yml` pipeline (or `default.yml` for platform stacks).
|
||||
|
||||
```bash
|
||||
# Find the pipeline for our commit
|
||||
curl -s -H "Authorization: Bearer $WOODPECKER_API_TOKEN" \
|
||||
"https://ci.viktorbarzin.me/api/repos/1/pipelines?page=1&per_page=10" \
|
||||
| jq --arg sha "$UPGRADE_SHA" '.[] | select(.commit==$sha) | .number'
|
||||
# → $PIPELINE_NUMBER
|
||||
|
||||
# Fetch detail (includes workflows[])
|
||||
curl -s -H "Authorization: Bearer $WOODPECKER_API_TOKEN" \
|
||||
"https://ci.viktorbarzin.me/api/repos/1/pipelines/$PIPELINE_NUMBER" \
|
||||
| jq '.workflows[] | select(.name=="default") | .state'
|
||||
# → "running" | "pending" | "success" | "failure" | "error" | "killed"
|
||||
WOODPECKER_TOKEN=$(vault kv get -field=woodpecker_token secret/viktor)
|
||||
```
|
||||
|
||||
Poll every 30 seconds until the `default` workflow's `state` is terminal (`success`, `failure`, `error`, `killed`). Timeout after 15 minutes.
|
||||
Poll for the pipeline triggered by our commit:
|
||||
```bash
|
||||
# Get latest pipeline
|
||||
curl -s -H "Authorization: Bearer $WOODPECKER_TOKEN" \
|
||||
"https://ci.viktorbarzin.me/api/repos/1/pipelines?page=1&per_page=5"
|
||||
```
|
||||
|
||||
**If `default` state is `success`** → proceed to Step 10 (verification), regardless of other workflows' state.
|
||||
**If `default` state is terminal-and-not-success, or the poll times out** → proceed to Step 10b (rollback).
|
||||
Find the pipeline matching our commit SHA. Poll every 30 seconds until status is `success`, `failure`, `error`, or `killed`. Timeout after 15 minutes.
|
||||
|
||||
**If CI fails** → proceed to Step 10 (rollback).
|
||||
**If CI succeeds** → proceed to verification.
|
||||
|
||||
## Step 10: Verify
|
||||
|
||||
|
|
@ -346,7 +341,7 @@ Re-run verification checks to confirm rollback succeeded. If rollback verificati
|
|||
```bash
|
||||
curl -s -X POST -H 'Content-type: application/json' \
|
||||
--data '{"text":"[Upgrade Agent] CRITICAL: Rollback of *${STACK}* also failed. Manual intervention required."}' \
|
||||
"$SLACK_WEBHOOK_URL"
|
||||
"$SLACK_WEBHOOK"
|
||||
```
|
||||
|
||||
## Step 11: Report Results
|
||||
|
|
@ -355,14 +350,14 @@ curl -s -X POST -H 'Content-type: application/json' \
|
|||
```bash
|
||||
curl -s -X POST -H 'Content-type: application/json' \
|
||||
--data "{\"text\":\"[Upgrade Agent] SUCCESS: *${STACK}* upgraded ${OLD_VERSION} -> ${NEW_VERSION}\nVerification: pods ready, HTTP OK${UPTIME_KUMA_MSG}\nCommit: ${UPGRADE_SHA}\"}" \
|
||||
"$SLACK_WEBHOOK_URL"
|
||||
"$SLACK_WEBHOOK"
|
||||
```
|
||||
|
||||
### On failure + rollback
|
||||
```bash
|
||||
curl -s -X POST -H 'Content-type: application/json' \
|
||||
--data "{\"text\":\"[Upgrade Agent] FAILED + ROLLED BACK: *${STACK}* ${OLD_VERSION} -> ${NEW_VERSION}\nReason: ${FAILURE_REASON}\nRollback commit: ${ROLLBACK_SHA}\nRollback status: ${ROLLBACK_STATUS}\"}" \
|
||||
"$SLACK_WEBHOOK_URL"
|
||||
"$SLACK_WEBHOOK"
|
||||
```
|
||||
|
||||
## Edge Cases
|
||||
|
|
|
|||
1728
.claude/cluster-health.sh
Executable file
1728
.claude/cluster-health.sh
Executable file
File diff suppressed because it is too large
Load diff
|
|
@ -119,18 +119,3 @@ Removed bindings from:
|
|||
- `default-source-authentication` (PK: via policybindingmodel `1a779f24`) — Google/GitHub/Facebook OAuth
|
||||
|
||||
Policy still exists with 0 bindings. If brute-force protection is needed, bind to the **password stage** (not the flow level).
|
||||
|
||||
## Session Duration (2026-05-01)
|
||||
|
||||
Pinned via Terraform in `stacks/authentik/`:
|
||||
|
||||
| Knob | Value | Surface | Effect |
|
||||
|------|-------|---------|--------|
|
||||
| `UserLoginStage.session_duration` on `default-authentication-login` | `weeks=4` | `authentik_stage_user_login.default_login` in `authentik_provider.tf` | Authenticated users stay logged in 4 weeks across browser restarts. No sliding refresh — resets on each login. |
|
||||
| `AUTHENTIK_SESSIONS__UNAUTHENTICATED_AGE` (server + worker) | `hours=2` | `server.env` + `worker.env` in `modules/authentik/values.yaml` | Anonymous Django sessions (bots, healthcheckers, partial flows) are reaped within 2h instead of the 1d default. |
|
||||
|
||||
Notes:
|
||||
- There is **no** `Brand.session_duration`; `UserLoginStage` is the only correct lever for authenticated session lifetime.
|
||||
- Embedded outpost session storage moved from `/dev/shm` → Postgres table `authentik_providers_proxy_proxysession` in authentik 2025.10. The 2026-04-18 `/dev/shm`-fill outage class is no longer load-bearing in 2026.2.2; the `unauthenticated_age` cap is still the right lever for anonymous-session bloat from external monitors.
|
||||
- `ProxyProvider.access_token_validity` and `remember_me_offset` stay UI-managed via `ignore_changes`.
|
||||
- The `unauthenticated_age` env var is injected via `server.env` / `worker.env` (not `authentik.sessions.unauthenticated_age`) because we set `authentik.existingSecret.secretName: goauthentik`, which makes the chart skip rendering its own `AUTHENTIK_*` Secret. The `authentik.*` value block is therefore inert in this stack — anything new under `authentik.*` must use the `*.env` arrays instead. The same applies to the existing `authentik.cache.*`, `authentik.web.*`, `authentik.worker.*` blocks (currently inert; live values come from the orphaned, helm-keep-policy `goauthentik` Secret created by chart 2025.10.3 before `existingSecret` was introduced).
|
||||
|
|
|
|||
|
|
@ -26,12 +26,11 @@ module "nfs_data" {
|
|||
## ~~iSCSI Storage~~ (REMOVED — replaced by proxmox-lvm)
|
||||
> iSCSI via democratic-csi and TrueNAS has been fully removed (2026-04). All database storage now uses `StorageClass: proxmox-lvm` (Proxmox CSI, LVM-thin hotplug). TrueNAS has been decommissioned.
|
||||
|
||||
## Anti-AI Scraping (3 Active Layers) (Updated 2026-04-17)
|
||||
## Anti-AI Scraping (5-Layer Defense)
|
||||
Default `anti_ai_scraping = true` in ingress_factory. Disable per-service: `anti_ai_scraping = false`.
|
||||
1. Bot blocking (ForwardAuth → poison-fountain) 2. X-Robots-Tag noai 3. Tarpit/poison content (standalone at poison.viktorbarzin.me)
|
||||
Trap links (formerly layer 3) removed April 2026 — rewrite-body plugin broken on Traefik v3.6.12 (Yaegi bugs). `strip-accept-encoding` and `anti-ai-trap-links` middlewares deleted.
|
||||
Rybbit analytics injection now via Cloudflare Worker (`stacks/rybbit/worker/`, HTMLRewriter, wildcard route `*.viktorbarzin.me/*`, 28 site ID mappings).
|
||||
Key files: `stacks/poison-fountain/`, `stacks/rybbit/worker/`, `stacks/platform/modules/traefik/middleware.tf`
|
||||
1. Bot blocking (ForwardAuth → poison-fountain) 2. X-Robots-Tag noai 3. Trap links before `</body>`
|
||||
4. Tarpit (~100 bytes/sec) 5. Poison content (CronJob every 6h, `--http1.1` required)
|
||||
Key files: `stacks/poison-fountain/`, `stacks/platform/modules/traefik/middleware.tf`
|
||||
|
||||
## Terragrunt Architecture
|
||||
- Root `terragrunt.hcl`: DRY providers, backend, variable loading, `generate "tiers"` block
|
||||
|
|
|
|||
|
|
@ -122,9 +122,8 @@ Channel 3: A4 [32G] ──── A8 [32G] ──── A12[ 8G ] = 72 GB
|
|||
| `offsite-sync-backup.timer` | Timer | Daily 06:00 | Two-step rsync to Synology (sda + NFS via inotify) |
|
||||
| `nfs-change-tracker.service` | Service | Continuous | inotifywait on `/srv/nfs` + `/srv/nfs-ssd`, logs to `/mnt/backup/.nfs-changes.log` |
|
||||
|
||||
## GPU Node (currently k8s-node1)
|
||||
- **VMID**: 201, **PCIe**: `0000:06:00.0` (NVIDIA Tesla T4) — physical passthrough, no Terraform pin
|
||||
- **Taint**: `nvidia.com/gpu=true:PreferNoSchedule` (applied dynamically to every NFD-discovered GPU node)
|
||||
- **Label**: `nvidia.com/gpu.present=true` (auto-applied by gpu-feature-discovery; also `feature.node.kubernetes.io/pci-10de.present=true` from NFD)
|
||||
- GPU workloads need: `node_selector = { "nvidia.com/gpu.present" : "true" }` + nvidia toleration
|
||||
- Taint applied via `null_resource.gpu_node_config` in `stacks/nvidia/modules/nvidia/main.tf`; node discovery keyed on the NFD `pci-10de.present` label so the taint follows the card to whichever host is carrying it
|
||||
## GPU Node (k8s-node1)
|
||||
- **VMID**: 201, **PCIe**: `0000:06:00.0` (NVIDIA Tesla T4)
|
||||
- **Taint**: `nvidia.com/gpu=true:NoSchedule`, **Label**: `gpu=true`
|
||||
- GPU workloads need: `node_selector = { "gpu": "true" }` + nvidia toleration
|
||||
- Taint applied via `null_resource.gpu_node_taint` in `modules/kubernetes/nvidia/main.tf`
|
||||
|
|
|
|||
|
|
@ -19,7 +19,7 @@
|
|||
| Service | Description | Stack |
|
||||
|---------|-------------|-------|
|
||||
| vaultwarden | Bitwarden-compatible password manager | platform |
|
||||
| redis | Shared Redis 8.x via HAProxy at `redis-master.redis.svc.cluster.local` — 3-pod raw StatefulSet `redis-v2` (redis+sentinel+exporter per pod), quorum=2. Clients use HAProxy only, no sentinel fallback. | redis |
|
||||
| redis | Shared Redis at `redis.redis.svc.cluster.local` | redis |
|
||||
| immich | Photo management (GPU) | immich |
|
||||
| nvidia | GPU device plugin | nvidia |
|
||||
| metrics-server | K8s metrics | metrics-server |
|
||||
|
|
@ -45,8 +45,7 @@
|
|||
| nextcloud | File sync/share | nextcloud |
|
||||
| calibre | E-book management (may be merged into ebooks stack) | calibre |
|
||||
| onlyoffice | Document editing | onlyoffice |
|
||||
| f1-stream | F1 streaming (uses chrome-service for hmembeds verifier) | f1-stream |
|
||||
| chrome-service | Headed Chromium WebSocket pool (`ws://chrome-service.chrome-service.svc:3000/<token>`) for sibling services driving anti-bot embeds | chrome-service |
|
||||
| f1-stream | F1 streaming | f1-stream |
|
||||
| rybbit | Analytics | rybbit |
|
||||
| isponsorblocktv | SponsorBlock for TV | isponsorblocktv |
|
||||
| actualbudget | Budgeting (factory pattern) | actualbudget |
|
||||
|
|
@ -138,18 +137,3 @@ jellyfin, jellyseerr, tdarr, affine, health, family, openclaw
|
|||
- `*.viktor.actualbudget` - Actualbudget factory instances
|
||||
- `*.freedify` - Freedify factory instances
|
||||
- `mailserver.*` - Mail server components (antispam, admin)
|
||||
|
||||
## Key Runbooks
|
||||
|
||||
Operational surfaces that aren't k8s services (VMs, pipelines, host-side
|
||||
procedures) are documented in `infra/docs/runbooks/`:
|
||||
|
||||
| Surface | Runbook |
|
||||
|---|---|
|
||||
| Private Docker registry VM (10.0.20.10) | [registry-vm.md](../../docs/runbooks/registry-vm.md) |
|
||||
| Rebuild after orphan-index incident | [registry-rebuild-image.md](../../docs/runbooks/registry-rebuild-image.md) |
|
||||
| PVE host operations (backups, LVM) | [proxmox-host.md](../../docs/runbooks/proxmox-host.md) |
|
||||
| NFS prerequisites and CSI mount options | [nfs-prerequisites.md](../../docs/runbooks/nfs-prerequisites.md) |
|
||||
| pfSense + Unbound DNS | [pfsense-unbound.md](../../docs/runbooks/pfsense-unbound.md) |
|
||||
| Mailserver PROXY-protocol / HAProxy | [mailserver-pfsense-haproxy.md](../../docs/runbooks/mailserver-pfsense-haproxy.md) |
|
||||
| Technitium apply flow | [technitium-apply.md](../../docs/runbooks/technitium-apply.md) |
|
||||
|
|
|
|||
102
.claude/skills/archived/setup-remote-executor.md
Normal file
102
.claude/skills/archived/setup-remote-executor.md
Normal file
|
|
@ -0,0 +1,102 @@
|
|||
# Setup Shared Remote Executor
|
||||
|
||||
Skill for setting up Claude Code's shared remote executor in new projects.
|
||||
|
||||
## When to Use
|
||||
- When adding Claude Code support to a new project
|
||||
- When the user says "set up remote executor for this project"
|
||||
- When working on a new project that needs remote command execution
|
||||
|
||||
## Prerequisites
|
||||
- Shared executor already deployed at `~/.claude/` on wizard@10.0.10.10
|
||||
- Project accessible via NFS from both macOS and the remote VM
|
||||
|
||||
## Setup Steps
|
||||
|
||||
### 1. Create .claude Directory
|
||||
```bash
|
||||
mkdir -p .claude/sessions
|
||||
```
|
||||
|
||||
### 2. Create session-exec.sh Wrapper
|
||||
Create `.claude/session-exec.sh` with the following content (adjust PROJECT_ROOT):
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# Project-Local Session Helper - Wrapper for shared executor
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
SHARED_SESSION_EXEC="/home/wizard/.claude/session-exec.sh"
|
||||
PROJECT_ROOT="/home/wizard/path/to/project" # UPDATE THIS
|
||||
|
||||
if [ -f "$SHARED_SESSION_EXEC" ]; then
|
||||
if [ "${1:-}" = "create" ] || [ -z "${1:-}" ]; then
|
||||
"$SHARED_SESSION_EXEC" create "$PROJECT_ROOT"
|
||||
else
|
||||
"$SHARED_SESSION_EXEC" "$@"
|
||||
fi
|
||||
else
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
SESSIONS_DIR="$SCRIPT_DIR/sessions"
|
||||
SESSION_ID="${1:-$(date +%s)-$$-$RANDOM}"
|
||||
ACTION="${2:-create}"
|
||||
SESSION_DIR="$SESSIONS_DIR/$SESSION_ID"
|
||||
|
||||
case "$ACTION" in
|
||||
create|init|"")
|
||||
mkdir -p "$SESSION_DIR"
|
||||
echo "ready" > "$SESSION_DIR/cmd_status.txt"
|
||||
echo "$PROJECT_ROOT" > "$SESSION_DIR/workdir.txt"
|
||||
> "$SESSION_DIR/cmd_input.txt"
|
||||
> "$SESSION_DIR/cmd_output.txt"
|
||||
echo "$SESSION_ID"
|
||||
;;
|
||||
cleanup|remove|delete)
|
||||
[ -d "$SESSION_DIR" ] && rm -rf "$SESSION_DIR"
|
||||
;;
|
||||
status)
|
||||
[ -d "$SESSION_DIR" ] && cat "$SESSION_DIR/cmd_status.txt"
|
||||
;;
|
||||
list)
|
||||
[ -d "$SESSIONS_DIR" ] && ls -1 "$SESSIONS_DIR" 2>/dev/null
|
||||
;;
|
||||
esac
|
||||
fi
|
||||
```
|
||||
|
||||
Make executable: `chmod +x .claude/session-exec.sh`
|
||||
|
||||
### 3. Link Sessions Directory (on remote VM)
|
||||
Run on the remote VM to add project sessions to the shared executor:
|
||||
|
||||
```bash
|
||||
# Option A: Symlink project sessions (if using project-local sessions)
|
||||
ln -sfn /path/to/project/.claude/sessions ~/.claude/sessions
|
||||
|
||||
# Option B: Use shared sessions (all projects share one directory)
|
||||
# Just ensure ~/.claude/sessions exists
|
||||
```
|
||||
|
||||
### 4. Create CLAUDE.md
|
||||
Add execution instructions to `.claude/CLAUDE.md`:
|
||||
|
||||
```markdown
|
||||
## Remote Command Execution
|
||||
Uses shared executor at `~/.claude/` on wizard@10.0.10.10.
|
||||
|
||||
### Usage
|
||||
\```bash
|
||||
SESSION_ID=$(.claude/session-exec.sh)
|
||||
echo "command" > .claude/sessions/$SESSION_ID/cmd_input.txt
|
||||
sleep 1 && cat .claude/sessions/$SESSION_ID/cmd_status.txt
|
||||
cat .claude/sessions/$SESSION_ID/cmd_output.txt
|
||||
\```
|
||||
|
||||
Start executor: `~/.claude/remote-executor.sh` (on remote VM)
|
||||
```
|
||||
|
||||
## Shared Executor Location
|
||||
- Scripts: `~/.claude/remote-executor.sh`, `~/.claude/session-exec.sh`
|
||||
- Sessions: `~/.claude/sessions/`
|
||||
- Remote VM: wizard@10.0.10.10
|
||||
|
|
@ -7,314 +7,339 @@ description: |
|
|||
(3) User asks to fix stuck pods, evicted pods, or CrashLoopBackOff,
|
||||
(4) User mentions "health check", "cluster status", "cluster health",
|
||||
(5) User asks "is everything running" or "any problems".
|
||||
Runs 42 cluster-wide checks (nodes, workloads, monitoring, certs,
|
||||
backups, external reachability) with safe auto-fix for evicted pods.
|
||||
Runs 8 standard K8s health checks with safe auto-fix for evicted pods
|
||||
and stuck CrashLoopBackOff pods.
|
||||
author: Claude Code
|
||||
version: 2.0.0
|
||||
date: 2026-04-19
|
||||
version: 1.0.0
|
||||
date: 2026-02-21
|
||||
---
|
||||
|
||||
# Cluster Health Check
|
||||
|
||||
## MANDATORY: Run the script first
|
||||
## Overview
|
||||
|
||||
When this skill is invoked, your **first action** must be to run the
|
||||
cluster health check script and reason over its output before doing
|
||||
anything else. Do not improvise individual `kubectl` calls — the
|
||||
script is the authoritative surface.
|
||||
- **Script**: `/workspace/infra/.claude/cluster-health.sh`
|
||||
- **Schedule**: CronJob runs every 30 minutes in the `openclaw` namespace
|
||||
- **Slack notifications**: Posts results to the webhook URL in `$SLACK_WEBHOOK_URL`
|
||||
- **Auto-fix**: Automatically deletes evicted/failed pods and CrashLoopBackOff pods with >10 restarts
|
||||
- **Exit code**: 0 = healthy, 1 = issues found
|
||||
|
||||
## Quick Check
|
||||
|
||||
Run the health check interactively:
|
||||
|
||||
```bash
|
||||
cd /home/wizard/code
|
||||
bash infra/scripts/cluster_healthcheck.sh --json | tee /tmp/cluster-health.json
|
||||
# Report only, no Slack notification
|
||||
bash /workspace/infra/.claude/cluster-health.sh --no-slack
|
||||
|
||||
# Full run with Slack notification
|
||||
bash /workspace/infra/.claude/cluster-health.sh
|
||||
|
||||
# Report only, no auto-fix and no Slack
|
||||
bash /workspace/infra/.claude/cluster-health.sh --no-fix --no-slack
|
||||
```
|
||||
|
||||
If the session is rooted elsewhere, fall back to the absolute path:
|
||||
## What It Checks
|
||||
|
||||
```bash
|
||||
bash /home/wizard/code/infra/scripts/cluster_healthcheck.sh --json
|
||||
```
|
||||
|
||||
Then:
|
||||
|
||||
1. Parse the JSON. Report the PASS/WARN/FAIL counts + overall verdict.
|
||||
2. Iterate every FAIL and WARN check, describe what tripped, and propose
|
||||
the remediation path (use the recipes below).
|
||||
3. Only reach for ad-hoc `kubectl` commands when investigating a
|
||||
specific failure beyond what the script reported.
|
||||
|
||||
Exit codes: `0` = healthy, `1` = warnings only, `2` = failures.
|
||||
|
||||
## Quick flags
|
||||
|
||||
```bash
|
||||
# Human-readable report (default), no auto-fix
|
||||
bash infra/scripts/cluster_healthcheck.sh
|
||||
|
||||
# Machine-readable JSON summary
|
||||
bash infra/scripts/cluster_healthcheck.sh --json
|
||||
|
||||
# Only show WARN + FAIL (suppress PASS noise)
|
||||
bash infra/scripts/cluster_healthcheck.sh --quiet
|
||||
|
||||
# Enable auto-fix (delete evicted pods, kick stuck CrashLoop pods)
|
||||
bash infra/scripts/cluster_healthcheck.sh --fix
|
||||
|
||||
# Combined: quiet JSON without auto-fix
|
||||
bash infra/scripts/cluster_healthcheck.sh --no-fix --quiet --json
|
||||
|
||||
# Custom kubeconfig
|
||||
bash infra/scripts/cluster_healthcheck.sh --kubeconfig /path/to/config
|
||||
```
|
||||
|
||||
## What It Checks (42 checks)
|
||||
|
||||
| # | Check | Notes |
|
||||
|---|-------|-------|
|
||||
| 1 | Node Status | NotReady nodes, version drift |
|
||||
| 2 | Node Resources | CPU/mem >80% (warn) / >90% (fail) |
|
||||
| 3 | Node Conditions | MemoryPressure / DiskPressure / PIDPressure |
|
||||
| 4 | Problematic Pods | CrashLoopBackOff / Error / ImagePullBackOff |
|
||||
| 5 | Evicted/Failed Pods | `status.phase=Failed` |
|
||||
| 6 | DaemonSets | desired == ready |
|
||||
| 7 | Deployments | ready == desired replicas |
|
||||
| 8 | PVC Status | all Bound |
|
||||
| 9 | HPA Health | targets not `<unknown>`, utilization <100% |
|
||||
| 10 | CronJob Failures | job conditions `Failed=True` in last 24h |
|
||||
| 11 | CrowdSec Agents | all pods Running |
|
||||
| 12 | Ingress Routes | every ingress has an LB IP + Traefik LB |
|
||||
| 13 | Prometheus Alerts | count of firing alerts |
|
||||
| 14 | Uptime Kuma Monitors | internal + external monitors up |
|
||||
| 15 | ResourceQuota Pressure | any quota >80% used |
|
||||
| 16 | StatefulSets | ready == desired |
|
||||
| 17 | Node Disk Usage | ephemeral-storage <80% |
|
||||
| 18 | Helm Release Health | all `deployed` (no `pending-*`) |
|
||||
| 19 | Kyverno Policy Engine | all pods Running |
|
||||
| 20 | NFS Connectivity | 192.168.1.127 showmount / port 2049 |
|
||||
| 21 | DNS Resolution | Technitium resolves internal + external |
|
||||
| 22 | TLS Certificate Expiry | TLS `Secret` certs >30d valid |
|
||||
| 23 | GPU Health | nvidia namespace + device-plugin Running |
|
||||
| 24 | Cloudflare Tunnel | pods Running |
|
||||
| 25 | Resource Usage | node CPU/mem headroom |
|
||||
| 26 | HA Sofia — Entity Availability | Home Assistant unavailable/unknown count |
|
||||
| 27 | HA Sofia — Integration Health | config entries setup_error / not_loaded |
|
||||
| 28 | HA Sofia — Automation Status | disabled / stale (>30d) automations |
|
||||
| 29 | HA Sofia — System Resources | HA CPU / mem / disk |
|
||||
| 30 | Hardware Exporters | snmp / idrac-redfish / proxmox / tuya pods + scrapes |
|
||||
| 31 | cert-manager — Certificate Readiness | Certificate CRs with `Ready!=True` |
|
||||
| 32 | cert-manager — Certificate Expiry (<14d) | notAfter within 14d |
|
||||
| 33 | cert-manager — Failed CertificateRequests | `Ready=False, reason=Failed` |
|
||||
| 34 | Backup Freshness — Per-DB Dumps | MySQL + PG dumps within 25h |
|
||||
| 35 | Backup Freshness — Offsite Sync | Pushgateway `backup_last_success_timestamp` <27h |
|
||||
| 36 | Backup Freshness — LVM PVC Snapshots | newest thin snapshot <25h (SSH PVE) |
|
||||
| 37 | Monitoring — Prometheus + Alertmanager | `/-/ready` + AM pods Running |
|
||||
| 38 | Monitoring — Vault Sealed Status | `vault status` reports `Sealed: false` |
|
||||
| 39 | Monitoring — ClusterSecretStore Ready | `vault-kv` + `vault-database` Ready |
|
||||
| 40 | External — Cloudflared + Authentik Replicas | deployments fully ready |
|
||||
| 41 | External — ExternalAccessDivergence Alert | alert not firing |
|
||||
| 42 | External — Traefik 5xx Rate (15m) | top-10 services emitting 5xx |
|
||||
| # | Check | Auto-Fix | Alerts |
|
||||
|---|-------|----------|--------|
|
||||
| 1 | **Node Health** — NotReady nodes, MemoryPressure, DiskPressure, PIDPressure | No | Yes |
|
||||
| 2 | **Pod Health** — CrashLoopBackOff, ImagePullBackOff, ErrImagePull, Error | Yes (CrashLoop >10 restarts) | Yes |
|
||||
| 3 | **Evicted/Failed Pods** — Pods in `Failed` phase | Yes (deletes all) | Yes |
|
||||
| 4 | **Failed Deployments** — Deployments with ready != desired replicas | No | Yes |
|
||||
| 5 | **Pending PVCs** — PersistentVolumeClaims not in `Bound` state | No | Yes |
|
||||
| 6 | **Resource Pressure** — Node CPU or memory >80% (warn) or >90% (issue) | No | Yes |
|
||||
| 7 | **CronJob Failures** — Failed CronJob-owned Jobs in the last 24h | No | Yes |
|
||||
| 8 | **DaemonSet Health** — DaemonSets with desired != ready | No | Yes |
|
||||
|
||||
## Safe Auto-Fix Rules
|
||||
|
||||
`--fix` only performs operations that are genuinely reversible and
|
||||
observable. Nothing here rewrites Terraform state or mutates the cluster
|
||||
beyond "delete pod".
|
||||
### Safe to auto-fix (the script does these automatically)
|
||||
|
||||
### Done automatically by `--fix`
|
||||
1. **Evicted/Failed pods** — These are already terminated and just cluttering the namespace:
|
||||
```bash
|
||||
kubectl delete pods -A --field-selector=status.phase=Failed
|
||||
```
|
||||
|
||||
- **Evicted / Failed pods** — delete them; the controller recreates.
|
||||
```bash
|
||||
kubectl delete pods -A --field-selector=status.phase=Failed
|
||||
```
|
||||
- **CrashLoopBackOff pods with >10 restarts** — delete once to reset
|
||||
backoff timer.
|
||||
2. **CrashLoopBackOff pods with >10 restarts** — The pod is stuck in a crash loop; deleting lets the controller recreate it with a fresh backoff timer:
|
||||
```bash
|
||||
kubectl delete pod -n <namespace> <pod-name> --grace-period=0
|
||||
```
|
||||
|
||||
### NEVER auto-fix (requires human investigation)
|
||||
|
||||
- NotReady nodes
|
||||
- MemoryPressure / DiskPressure / PIDPressure
|
||||
- ImagePullBackOff (usually a bad tag / registry credential)
|
||||
- Deployment ready-replica mismatch
|
||||
- Pending PVCs
|
||||
- Node CPU/memory >90%
|
||||
- CronJob failures
|
||||
- DaemonSet desired != ready
|
||||
- Vault sealed
|
||||
- ClusterSecretStore not Ready
|
||||
- cert-manager Certificate failures
|
||||
- Backup freshness regressions
|
||||
- Any external-reachability failure
|
||||
- **NotReady nodes** — Could be network, kubelet, or hardware issue; needs SSH investigation
|
||||
- **DiskPressure / MemoryPressure / PIDPressure** — Root cause must be identified
|
||||
- **ImagePullBackOff** — Usually a wrong image tag or registry issue; needs config fix
|
||||
- **Failed deployments** — Could be resource limits, bad config, missing secrets
|
||||
- **Pending PVCs** — Usually NFS export missing or storage class issue
|
||||
- **Resource pressure >90%** — Need to identify which pods are consuming resources
|
||||
- **CronJob failures** — Need to check job logs to understand why it failed
|
||||
- **DaemonSet issues** — Could be node taints, resource limits, or image issues
|
||||
|
||||
## Deep-investigation recipes per failure mode
|
||||
## Deep Investigation
|
||||
|
||||
### Node Issues (checks 1, 3, 17, 25)
|
||||
When the health check reports issues, use these commands to investigate further.
|
||||
|
||||
### Node Issues
|
||||
|
||||
```bash
|
||||
kubectl describe node <node>
|
||||
# Describe the problematic node (events, conditions, capacity)
|
||||
kubectl describe node <node-name>
|
||||
|
||||
# Check resource usage across all nodes
|
||||
kubectl top nodes
|
||||
kubectl get events --field-selector involvedObject.name=<node> --sort-by='.lastTimestamp'
|
||||
# SSH to the node
|
||||
ssh root@10.0.20.10X
|
||||
|
||||
# Check recent events on a specific node
|
||||
kubectl get events --field-selector involvedObject.name=<node-name> --sort-by='.lastTimestamp'
|
||||
|
||||
# SSH to the node for direct inspection
|
||||
ssh root@<node-ip>
|
||||
systemctl status kubelet
|
||||
journalctl -u kubelet --since "30 minutes ago" | tail -100
|
||||
df -h ; free -h
|
||||
df -h
|
||||
free -h
|
||||
```
|
||||
|
||||
Node IPs: `10.0.20.100` master, `.101` node1 (GPU), `.102` node2,
|
||||
`.103` node3, `.104` node4.
|
||||
|
||||
### Pod Issues (checks 4, 5, 11, 19)
|
||||
### Pod Issues
|
||||
|
||||
```bash
|
||||
kubectl describe pod -n <ns> <pod>
|
||||
kubectl logs -n <ns> <pod> --tail=200
|
||||
kubectl logs -n <ns> <pod> --previous --tail=200
|
||||
kubectl get events -n <ns> --sort-by='.lastTimestamp' | tail -20
|
||||
# Describe the pod (events, conditions, container statuses)
|
||||
kubectl describe pod -n <namespace> <pod-name>
|
||||
|
||||
# Check current logs
|
||||
kubectl logs -n <namespace> <pod-name> --tail=100
|
||||
|
||||
# Check logs from the previous crashed container
|
||||
kubectl logs -n <namespace> <pod-name> --previous --tail=100
|
||||
|
||||
# Check events in the namespace
|
||||
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
|
||||
|
||||
# Check all pods in a namespace
|
||||
kubectl get pods -n <namespace> -o wide
|
||||
```
|
||||
|
||||
Common failure causes: OOMKilled (raise mem limit in Terraform), bad
|
||||
config / missing env var, DB connection failure (check `dbaas` pods),
|
||||
NFS mount failure (`showmount -e 192.168.1.127`), stale
|
||||
imagePullSecret.
|
||||
|
||||
### Deployment / StatefulSet / DaemonSet (checks 6, 7, 16)
|
||||
### Deployment Issues
|
||||
|
||||
```bash
|
||||
kubectl describe deployment -n <ns> <name>
|
||||
kubectl rollout status deployment -n <ns> <name>
|
||||
kubectl rollout history deployment -n <ns> <name>
|
||||
kubectl get rs -n <ns> -l app=<app>
|
||||
# Describe the deployment (strategy, conditions, events)
|
||||
kubectl describe deployment -n <namespace> <deployment-name>
|
||||
|
||||
# Check rollout status
|
||||
kubectl rollout status deployment -n <namespace> <deployment-name>
|
||||
|
||||
# Check rollout history
|
||||
kubectl rollout history deployment -n <namespace> <deployment-name>
|
||||
|
||||
# Check the replicaset
|
||||
kubectl get rs -n <namespace> -l app=<app-label>
|
||||
```
|
||||
|
||||
### PVC (check 8)
|
||||
### PVC Issues
|
||||
|
||||
```bash
|
||||
kubectl describe pvc -n <ns> <pvc>
|
||||
kubectl get events -n <ns> --field-selector reason=FailedMount --sort-by='.lastTimestamp'
|
||||
kubectl get pv | grep <pvc>
|
||||
showmount -e 192.168.1.127
|
||||
# Describe the PVC (events, status, storage class)
|
||||
kubectl describe pvc -n <namespace> <pvc-name>
|
||||
|
||||
# Check PVs
|
||||
kubectl get pv
|
||||
|
||||
# Check events related to PVCs
|
||||
kubectl get events -n <namespace> --field-selector reason=FailedMount --sort-by='.lastTimestamp'
|
||||
|
||||
# Verify NFS export exists
|
||||
showmount -e 10.0.10.15 | grep <service-name>
|
||||
```
|
||||
|
||||
### cert-manager (checks 31, 32, 33)
|
||||
### Resource Pressure
|
||||
|
||||
```bash
|
||||
kubectl get certificate -A
|
||||
kubectl describe certificate -n <ns> <name>
|
||||
kubectl get certificaterequest -A
|
||||
kubectl describe certificaterequest -n <ns> <name>
|
||||
kubectl logs -n cert-manager deploy/cert-manager | tail -50
|
||||
# Top nodes (CPU and memory usage)
|
||||
kubectl top nodes
|
||||
|
||||
# Top pods sorted by memory (cluster-wide)
|
||||
kubectl top pods -A --sort-by=memory | head -20
|
||||
|
||||
# Top pods sorted by CPU (cluster-wide)
|
||||
kubectl top pods -A --sort-by=cpu | head -20
|
||||
|
||||
# Check resource requests/limits in a namespace
|
||||
kubectl describe resourcequota -n <namespace>
|
||||
kubectl describe limitrange -n <namespace>
|
||||
```
|
||||
|
||||
Common causes: ACME HTTP-01 challenge blocked, ClusterIssuer missing
|
||||
DNS provider secret, rate-limit from Let's Encrypt.
|
||||
## Common Remediation
|
||||
|
||||
### Backups (checks 34, 35, 36)
|
||||
### Persistent CrashLoopBackOff
|
||||
|
||||
```bash
|
||||
# Per-DB dumps (inside the DB pod)
|
||||
kubectl exec -n dbaas mysql-standalone-0 -- ls -lah /backup/per-db/
|
||||
kubectl exec -n dbaas pg-cluster-0 -- ls -lah /backup/per-db/
|
||||
A pod keeps crashing even after the auto-fix deletes it.
|
||||
|
||||
# Pushgateway metrics
|
||||
kubectl exec -n monitoring deploy/prometheus-server -- \
|
||||
wget -qO- http://prometheus-prometheus-pushgateway:9091/metrics | \
|
||||
grep backup_last_success_timestamp
|
||||
1. **Check logs from the crashed container**:
|
||||
```bash
|
||||
kubectl logs -n <namespace> <pod-name> --previous --tail=200
|
||||
```
|
||||
|
||||
# LVM snapshots on PVE host
|
||||
ssh -o BatchMode=yes root@192.168.1.127 \
|
||||
'lvs -o lv_name,lv_time,lv_size --noheadings | grep snap'
|
||||
```
|
||||
2. **Check the pod description for clues**:
|
||||
```bash
|
||||
kubectl describe pod -n <namespace> <pod-name>
|
||||
```
|
||||
Look for:
|
||||
- `OOMKilled` in Last State — the container ran out of memory
|
||||
- `Error` with exit code 1 — application error (bad config, missing env var, DB connection failure)
|
||||
- `Error` with exit code 137 — killed by OOM killer or liveness probe
|
||||
- `Error` with exit code 143 — SIGTERM (graceful shutdown failure)
|
||||
|
||||
If offsite sync is stale, the common cause is the
|
||||
`offsite-sync-backup.service` systemd unit on the PVE host failing.
|
||||
`ssh root@192.168.1.127 'systemctl status offsite-sync-backup'`.
|
||||
3. **Common causes**:
|
||||
- **OOMKilled**: Increase memory limits in Terraform (see below)
|
||||
- **Bad config**: Check environment variables, secrets, config maps
|
||||
- **DB connection failure**: Verify the database pod is running (`kubectl get pods -n dbaas`)
|
||||
- **NFS mount failure**: Verify NFS export exists (`showmount -e 10.0.10.15`)
|
||||
- **Missing secret**: Check if TLS secret or other secrets exist in the namespace
|
||||
|
||||
### Monitoring stack (checks 37, 38, 39)
|
||||
### OOMKilled
|
||||
|
||||
```bash
|
||||
# Prometheus
|
||||
kubectl exec -n monitoring deploy/prometheus-server -- wget -qO- http://localhost:9090/-/ready
|
||||
kubectl logs -n monitoring deploy/prometheus-server --tail=100
|
||||
The container was killed because it exceeded its memory limit.
|
||||
|
||||
# Alertmanager
|
||||
kubectl get pods -n monitoring | grep alertmanager
|
||||
kubectl logs -n monitoring -l app=prometheus-alertmanager --tail=100
|
||||
1. **Check current limits**:
|
||||
```bash
|
||||
kubectl describe pod -n <namespace> <pod-name> | grep -A 5 "Limits"
|
||||
```
|
||||
|
||||
# Vault
|
||||
kubectl exec -n vault vault-0 -- sh -c 'VAULT_ADDR=http://127.0.0.1:8200 vault status'
|
||||
# If sealed: check raft peers with `vault operator raft list-peers` and unseal.
|
||||
2. **Fix in Terraform** — Edit `modules/kubernetes/<service>/main.tf` and increase the memory limit:
|
||||
```hcl
|
||||
resources {
|
||||
limits = {
|
||||
memory = "2Gi" # Increase from current value
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
# ClusterSecretStore
|
||||
kubectl get clustersecretstore
|
||||
kubectl describe clustersecretstore vault-kv vault-database
|
||||
kubectl logs -n external-secrets deploy/external-secrets --tail=100
|
||||
```
|
||||
3. **Apply the change**:
|
||||
```bash
|
||||
cd /workspace/infra
|
||||
terraform apply -target=module.kubernetes_cluster.module.<service> -auto-approve
|
||||
```
|
||||
|
||||
### External reachability (checks 40, 41, 42)
|
||||
### ImagePullBackOff
|
||||
|
||||
```bash
|
||||
# Cloudflared
|
||||
kubectl get pods -n cloudflared
|
||||
kubectl logs -n cloudflared -l app=cloudflared --tail=100
|
||||
The container image cannot be pulled.
|
||||
|
||||
# Authentik
|
||||
kubectl get pods -n authentik -l app=authentik-server
|
||||
kubectl logs -n authentik -l app=authentik-server --tail=100
|
||||
1. **Check the exact error**:
|
||||
```bash
|
||||
kubectl describe pod -n <namespace> <pod-name> | grep -A 5 "Events"
|
||||
```
|
||||
|
||||
# ExternalAccessDivergence alert
|
||||
kubectl exec -n monitoring deploy/prometheus-server -- \
|
||||
wget -qO- 'http://localhost:9090/api/v1/alerts' | \
|
||||
python3 -m json.tool | grep -A 5 ExternalAccessDivergence
|
||||
2. **Common causes**:
|
||||
- **Wrong image tag**: Verify the tag exists on the registry (Docker Hub, ghcr.io, etc.)
|
||||
- **Private registry without credentials**: Check if imagePullSecrets are configured
|
||||
- **Pull-through cache issue**: The registry cache at `10.0.20.10` may have a stale entry
|
||||
```bash
|
||||
# Check pull-through cache ports:
|
||||
# 5000 = docker.io, 5010 = ghcr.io, 5020 = quay.io, 5030 = registry.k8s.io
|
||||
curl -s http://10.0.20.10:5000/v2/_catalog | python3 -m json.tool
|
||||
```
|
||||
- **Registry rate limit**: Docker Hub free tier has pull limits; pull-through cache helps avoid this
|
||||
|
||||
# Traefik 5xx — find the hot service
|
||||
kubectl exec -n monitoring deploy/prometheus-server -- \
|
||||
wget -qO- 'http://localhost:9090/api/v1/query?query=topk(10,rate(traefik_service_requests_total{code=~%225..%22}%5B15m%5D))' \
|
||||
| python3 -m json.tool
|
||||
```
|
||||
3. **Fix**: Update the image tag in the service's Terraform module and re-apply.
|
||||
|
||||
### OOMKilled remediation
|
||||
### Node NotReady
|
||||
|
||||
1. `kubectl describe pod -n <ns> <pod> | grep -A 5 Limits`
|
||||
2. Edit `infra/modules/kubernetes/<service>/main.tf` and raise
|
||||
`resources.limits.memory`.
|
||||
3. `cd /home/wizard/code/infra && scripts/tg apply` (Tier 1) or
|
||||
`terraform apply -target=module.<service>` as appropriate.
|
||||
A node has gone NotReady.
|
||||
|
||||
### ImagePullBackOff remediation
|
||||
1. **Check node conditions**:
|
||||
```bash
|
||||
kubectl describe node <node-name> | grep -A 20 "Conditions"
|
||||
```
|
||||
|
||||
1. `kubectl describe pod -n <ns> <pod> | grep -A 5 Events`
|
||||
2. Verify tag exists on the source registry.
|
||||
3. Check pull-through cache at `10.0.20.10:{5000,5010,5020,5030}`.
|
||||
4. Update the image tag in Terraform + re-apply.
|
||||
2. **SSH to the node and check kubelet**:
|
||||
```bash
|
||||
ssh root@<node-ip>
|
||||
systemctl status kubelet
|
||||
journalctl -u kubelet --since "10 minutes ago" | tail -50
|
||||
```
|
||||
|
||||
### Persistent CrashLoopBackOff after auto-fix
|
||||
3. **Check resources**:
|
||||
```bash
|
||||
# On the node
|
||||
df -h # Disk space
|
||||
free -h # Memory
|
||||
top -bn1 # CPU/processes
|
||||
```
|
||||
|
||||
1. `kubectl logs -n <ns> <pod> --previous --tail=200`
|
||||
2. `kubectl describe pod -n <ns> <pod>` and check Last State:
|
||||
- `OOMKilled` → raise memory limit
|
||||
- Exit code 137 → OOM or probe killed
|
||||
- Exit code 143 → SIGTERM / graceful shutdown failed
|
||||
3. Cross-check dbaas + NFS + secrets are healthy.
|
||||
4. **Node IPs** (for SSH):
|
||||
- `10.0.20.100` — k8s-master
|
||||
- `10.0.20.101` — k8s-node1 (GPU)
|
||||
- `10.0.20.102` — k8s-node2
|
||||
- `10.0.20.103` — k8s-node3
|
||||
- `10.0.20.104` — k8s-node4
|
||||
|
||||
## Notes on the canonical / hardlink setup
|
||||
## Slack Webhook
|
||||
|
||||
The authoritative copy of this SKILL.md lives at
|
||||
`/home/wizard/code/.claude/skills/cluster-health/SKILL.md`. A hardlink
|
||||
at `/home/wizard/code/infra/.claude/skills/cluster-health/SKILL.md`
|
||||
points to the same inode so infra-rooted sessions also discover the
|
||||
skill.
|
||||
The script posts results to the Slack incoming webhook URL in `$SLACK_WEBHOOK_URL`. The message format uses Slack mrkdwn:
|
||||
- All clear: green checkmark with node/pod count
|
||||
- Warnings only: warning icon with details
|
||||
- Issues found: red alert icon with auto-fixes applied and remaining issues
|
||||
|
||||
To verify the hardlink is intact:
|
||||
The webhook URL is passed as an environment variable from `openclaw_skill_secrets` in `terraform.tfvars`.
|
||||
|
||||
```bash
|
||||
stat -c '%i %n' \
|
||||
/home/wizard/code/.claude/skills/cluster-health/SKILL.md \
|
||||
/home/wizard/code/infra/.claude/skills/cluster-health/SKILL.md
|
||||
```
|
||||
## Infrastructure
|
||||
|
||||
Both should print the same inode number. If they diverge (e.g. `git
|
||||
checkout` replaced the file rather than updating it), re-link:
|
||||
| Component | Path / Location |
|
||||
|-----------|----------------|
|
||||
| Health check script | `/workspace/infra/.claude/cluster-health.sh` (in-pod) or `.claude/cluster-health.sh` (repo) |
|
||||
| Terraform module | `modules/kubernetes/openclaw/main.tf` |
|
||||
| CronJob definition | Defined in the OpenClaw Terraform module |
|
||||
| Existing full healthcheck | `scripts/cluster_healthcheck.sh` (local-only, 24 checks with color output) |
|
||||
| Infra repo (in pod) | `/workspace/infra` |
|
||||
| kubectl (in pod) | `/tools/kubectl` |
|
||||
| terraform (in pod) | `/tools/terraform` |
|
||||
|
||||
```bash
|
||||
ln -f /home/wizard/code/.claude/skills/cluster-health/SKILL.md \
|
||||
/home/wizard/code/infra/.claude/skills/cluster-health/SKILL.md
|
||||
```
|
||||
## Auto-File Incidents for SEV1/SEV2
|
||||
|
||||
After running health checks, if **SEV1 or SEV2 issues** are found (node down, multiple services affected, core service outage, or single important service down), auto-file a GitHub Issue:
|
||||
|
||||
### Severity Classification
|
||||
- **SEV1**: Node NotReady, multiple services down, data at risk, core service outage (DNS, auth, ingress, databases)
|
||||
- **SEV2**: Single non-core service down, degraded performance, persistent CrashLoopBackOff
|
||||
- **SEV3**: Warnings only, resource pressure <90%, cosmetic — do NOT auto-file
|
||||
|
||||
### Workflow
|
||||
1. **Dedup check**: Before filing, query open incidents:
|
||||
```bash
|
||||
GITHUB_TOKEN=$(vault kv get -field=github_pat secret/viktor)
|
||||
curl -s -H "Authorization: token $GITHUB_TOKEN" \
|
||||
"https://api.github.com/repos/ViktorBarzin/infra/issues?labels=incident&state=open&per_page=50"
|
||||
```
|
||||
If an open issue already covers the same service/namespace, **skip filing**.
|
||||
|
||||
2. **File the issue** with labels `incident`, `sev1` or `sev2`, `postmortem-required`:
|
||||
- Title: `[AUTO] <Service/Namespace> — <brief symptom>`
|
||||
- Body: full diagnostic dump (pod status, events, alerts, node state)
|
||||
- The issue-automation GHA workflow will trigger the post-mortem pipeline automatically
|
||||
|
||||
3. **Auto-close recovered services**: If a service that previously had an auto-filed incident is now healthy:
|
||||
```bash
|
||||
# Comment and close
|
||||
curl -s -X POST -H "Authorization: token $GITHUB_TOKEN" \
|
||||
"https://api.github.com/repos/ViktorBarzin/infra/issues/<N>/comments" \
|
||||
-d '{"body": "**Resolved** — Service recovered. Auto-closed by cluster health check."}'
|
||||
curl -s -X PATCH -H "Authorization: token $GITHUB_TOKEN" \
|
||||
"https://api.github.com/repos/ViktorBarzin/infra/issues/<N>" \
|
||||
-d '{"state": "closed"}'
|
||||
```
|
||||
|
||||
## Post-Mortem Auto-Suggest
|
||||
|
||||
After running a healthcheck, if the cluster has **recovered from an unhealthy state** (previous run showed FAIL items that are now resolved), suggest writing a post-mortem:
|
||||
|
||||
> The cluster has recovered from the previous unhealthy state. Would you like me to write a post-mortem? Run `/post-mortem` to generate one.
|
||||
|
||||
This ensures incidents are documented while context is fresh.
|
||||
|
||||
## Notes
|
||||
|
||||
1. This script is designed to run inside the OpenClaw pod where kubectl is pre-configured via the ServiceAccount
|
||||
2. The full `scripts/cluster_healthcheck.sh` script runs 24 checks and is meant for local interactive use; this skill's script runs 8 core checks optimized for automated CronJob execution
|
||||
3. When investigating issues interactively, prefer running commands directly rather than re-running the script
|
||||
4. All Terraform changes must go through the `.tf` files — never use `kubectl apply/edit/patch` for persistent changes
|
||||
|
|
|
|||
|
|
@ -155,19 +155,3 @@ Common port is 80. Exceptions:
|
|||
3. Add `time.sleep(0.3)` between bulk operations to avoid overloading
|
||||
4. Homepage dashboard widget slug: `cluster-internal`
|
||||
5. Cloudflare-proxied at `uptime.viktorbarzin.me`
|
||||
|
||||
## Terraform-Managed Monitors
|
||||
|
||||
There is NO `louislam/uptime-kuma` Terraform provider. Two patterns exist for
|
||||
declarative monitor management in this stack:
|
||||
|
||||
- **External HTTPS monitors** — auto-discovered from ingress annotations by the
|
||||
`external-monitor-sync` CronJob (`*/10 * * * *`). Opt-out via
|
||||
`uptime.viktorbarzin.me/external-monitor: "false"` on the ingress.
|
||||
- **Internal monitors (DBs, non-HTTP)** — declared in the
|
||||
`local.internal_monitors` list in `stacks/uptime-kuma/modules/uptime-kuma/main.tf`
|
||||
and synced by the `internal-monitor-sync` CronJob. To add one, append to the
|
||||
list (provide `name`, `type`, `database_connection_string`,
|
||||
`database_password_vault_key`, `interval`, `retry_interval`, `max_retries`)
|
||||
and `scripts/tg apply`. The sync is idempotent — looks up by name, creates
|
||||
if missing, patches if drifted. Existing monitors keep their id and history.
|
||||
|
|
|
|||
5
.gitignore
vendored
5
.gitignore
vendored
|
|
@ -65,11 +65,6 @@ state/infra/
|
|||
backend.tf
|
||||
providers.tf
|
||||
.terraform.lock.hcl
|
||||
cloudflare_provider.tf
|
||||
tiers.tf
|
||||
stacks/*/cloudflare_provider.tf
|
||||
stacks/*/tiers.tf
|
||||
stacks/*/terragrunt_rendered.json
|
||||
|
||||
# Kubernetes config (sensitive)
|
||||
config
|
||||
|
|
|
|||
|
|
@ -1,85 +1,38 @@
|
|||
# Build the CI tools Docker image used by all infra pipelines.
|
||||
# Triggers on push that touches ci/Dockerfile, or manual (API/UI) so
|
||||
# rebuilds after a registry incident don't need a cosmetic Dockerfile edit.
|
||||
# Triggers on changes to ci/Dockerfile only (push to master).
|
||||
|
||||
when:
|
||||
- event: push
|
||||
branch: master
|
||||
path:
|
||||
include:
|
||||
- 'ci/Dockerfile'
|
||||
- event: manual
|
||||
event: push
|
||||
branch: master
|
||||
path:
|
||||
include:
|
||||
- 'ci/Dockerfile'
|
||||
|
||||
steps:
|
||||
- name: build-and-push
|
||||
image: woodpeckerci/plugin-docker-buildx
|
||||
settings:
|
||||
# Phase 4 of forgejo-registry-consolidation 2026-05-07 —
|
||||
# registry.viktorbarzin.me dropped, Forgejo is the only target.
|
||||
repo:
|
||||
- forgejo.viktorbarzin.me/viktor/infra-ci
|
||||
repo: registry.viktorbarzin.me:5050/infra-ci
|
||||
dockerfile: ci/Dockerfile
|
||||
context: ci/
|
||||
tags:
|
||||
- latest
|
||||
- "${CI_COMMIT_SHA:0:8}"
|
||||
platforms: linux/amd64
|
||||
registry: registry.viktorbarzin.me:5050
|
||||
logins:
|
||||
- registry: forgejo.viktorbarzin.me
|
||||
- registry: registry.viktorbarzin.me:5050
|
||||
username:
|
||||
from_secret: forgejo_user
|
||||
from_secret: registry_user
|
||||
password:
|
||||
from_secret: forgejo_push_token
|
||||
|
||||
# Post-push integrity check is now redundant with the every-15min
|
||||
# forgejo-integrity-probe in stacks/monitoring/, which walks
|
||||
# /v2/_catalog + HEADs every blob across the entire Forgejo registry.
|
||||
# If a corruption pattern emerges that the periodic probe misses,
|
||||
# restore a verify step similar to the pre-Phase-4 version (see
|
||||
# commit 49f4956f) but pointed at forgejo.viktorbarzin.me.
|
||||
|
||||
# Break-glass tarball: save the just-pushed infra-ci image to disk on the
|
||||
# registry VM (10.0.20.10) so we can `docker load` it back into a node
|
||||
# when Forgejo is unreachable. Pulls from Forgejo (the only registry now).
|
||||
# Best-effort — failure here doesn't fail the pipeline.
|
||||
# Recovery procedure: docs/runbooks/forgejo-registry-breakglass.md.
|
||||
- name: breakglass-tarball
|
||||
image: alpine:3.20
|
||||
failure: ignore
|
||||
environment:
|
||||
REGISTRY_SSH_KEY:
|
||||
from_secret: registry_ssh_key
|
||||
FORGEJO_USER:
|
||||
from_secret: forgejo_user
|
||||
FORGEJO_PASS:
|
||||
from_secret: forgejo_push_token
|
||||
commands:
|
||||
- apk add --no-cache openssh-client
|
||||
- mkdir -p ~/.ssh && chmod 700 ~/.ssh
|
||||
- printf '%s\n' "$REGISTRY_SSH_KEY" > ~/.ssh/id_ed25519
|
||||
- chmod 600 ~/.ssh/id_ed25519
|
||||
- ssh-keyscan -t ed25519 10.0.20.10 >> ~/.ssh/known_hosts 2>/dev/null
|
||||
- SHA=${CI_COMMIT_SHA:0:8}
|
||||
- |
|
||||
ssh -n -o BatchMode=yes root@10.0.20.10 "
|
||||
set -e
|
||||
mkdir -p /opt/registry/data/private/_breakglass
|
||||
IMAGE=forgejo.viktorbarzin.me/viktor/infra-ci:$SHA
|
||||
echo \$FORGEJO_PASS | docker login forgejo.viktorbarzin.me -u \$FORGEJO_USER --password-stdin
|
||||
docker pull \$IMAGE
|
||||
docker save \$IMAGE | gzip > /opt/registry/data/private/_breakglass/infra-ci-$SHA.tar.gz
|
||||
ln -sfn infra-ci-$SHA.tar.gz /opt/registry/data/private/_breakglass/infra-ci-latest.tar.gz
|
||||
ls -t /opt/registry/data/private/_breakglass/infra-ci-*.tar.gz \
|
||||
| grep -v 'latest' | tail -n +6 | xargs -r rm -v
|
||||
ls -lh /opt/registry/data/private/_breakglass/
|
||||
"
|
||||
from_secret: registry_password
|
||||
|
||||
- name: slack
|
||||
image: curlimages/curl
|
||||
commands:
|
||||
- |
|
||||
curl -s -X POST -H 'Content-type: application/json' \
|
||||
--data "{\"text\":\"CI image built: forgejo.viktorbarzin.me/viktor/infra-ci:${CI_COMMIT_SHA:0:8} (and registry-private mirror)\"}" \
|
||||
--data "{\"text\":\"CI image built: registry.viktorbarzin.me:5050/infra-ci:${CI_COMMIT_SHA:0:8}\"}" \
|
||||
"$SLACK_WEBHOOK" || true
|
||||
environment:
|
||||
SLACK_WEBHOOK:
|
||||
|
|
|
|||
|
|
@ -23,14 +23,6 @@ steps:
|
|||
username: viktorbarzin
|
||||
password:
|
||||
from_secret: dockerhub-pat
|
||||
# Private registry on :5050 requires htpasswd auth since 2026-03-22.
|
||||
# Without this, buildx pushes the second repo but blob HEAD comes
|
||||
# back 401 → pipeline fails → CI false-negative (see bd code-12b).
|
||||
- registry: registry.viktorbarzin.me:5050
|
||||
username:
|
||||
from_secret: registry_user
|
||||
password:
|
||||
from_secret: registry_password
|
||||
dockerfile: cli/Dockerfile
|
||||
context: cli
|
||||
auto_tag: true
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@ clone:
|
|||
|
||||
steps:
|
||||
- name: apply
|
||||
image: forgejo.viktorbarzin.me/viktor/infra-ci:latest
|
||||
image: registry.viktorbarzin.me:5050/infra-ci:latest
|
||||
pull: true
|
||||
backend_options:
|
||||
kubernetes:
|
||||
|
|
@ -37,12 +37,6 @@ steps:
|
|||
environment:
|
||||
SLACK_WEBHOOK:
|
||||
from_secret: slack_webhook
|
||||
# Each `- |` command runs in a fresh shell, so we can't rely on an
|
||||
# `export VAULT_ADDR=...` in the auth command persisting — pin it at
|
||||
# step level. VAULT_TOKEN is still per-command; we persist it to
|
||||
# ~/.vault-token (auto-read by `vault` CLI) so downstream commands
|
||||
# don't need explicit token propagation.
|
||||
VAULT_ADDR: http://vault-active.vault.svc.cluster.local:8200
|
||||
commands:
|
||||
# ── Skip CI commits ──
|
||||
- |
|
||||
|
|
@ -61,17 +55,9 @@ steps:
|
|||
# ── Vault auth ──
|
||||
- |
|
||||
SA_TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
|
||||
VAULT_TOKEN=$(curl -s -X POST "$VAULT_ADDR/v1/auth/kubernetes/login" \
|
||||
export VAULT_ADDR=http://vault-active.vault.svc.cluster.local:8200
|
||||
export VAULT_TOKEN=$(curl -s -X POST "$VAULT_ADDR/v1/auth/kubernetes/login" \
|
||||
-d "{\"role\":\"ci\",\"jwt\":\"$SA_TOKEN\"}" | jq -r .auth.client_token)
|
||||
if [ -z "$VAULT_TOKEN" ] || [ "$VAULT_TOKEN" = "null" ]; then
|
||||
echo "ERROR: Vault K8s auth failed (role=ci, ns=woodpecker)" >&2
|
||||
exit 1
|
||||
fi
|
||||
# Persist for downstream `- |` blocks (each runs in a fresh shell,
|
||||
# so exporting VAULT_TOKEN wouldn't help). `vault`, `scripts/tg`,
|
||||
# and `scripts/state-sync` all fall through to ~/.vault-token when
|
||||
# the env var is unset.
|
||||
umask 077; printf '%s' "$VAULT_TOKEN" > "$HOME/.vault-token"
|
||||
|
||||
# ── Detect changed stacks ──
|
||||
- |
|
||||
|
|
@ -128,7 +114,7 @@ steps:
|
|||
# ── Pre-warm provider cache ──
|
||||
- |
|
||||
if [ -s .platform_apply ] || [ -s .app_apply ]; then
|
||||
FIRST_STACK=$(cat .platform_apply .app_apply 2>/dev/null | head -1)
|
||||
FIRST_STACK=$(head -1 .platform_apply .app_apply 2>/dev/null | head -1)
|
||||
if [ -n "$FIRST_STACK" ]; then
|
||||
echo "Pre-warming provider cache from stacks/$FIRST_STACK..."
|
||||
cd "stacks/$FIRST_STACK" && terragrunt init --terragrunt-non-interactive -input=false 2>&1 | tail -3 && cd ../..
|
||||
|
|
@ -137,7 +123,6 @@ steps:
|
|||
|
||||
# ── Apply platform stacks (serial, with Vault advisory locks) ──
|
||||
- |
|
||||
FAILED_PLATFORM_STACKS=""
|
||||
if [ -s .platform_apply ]; then
|
||||
echo "=== Applying platform stacks (serial, locked) ==="
|
||||
while read -r stack; do
|
||||
|
|
@ -150,9 +135,8 @@ steps:
|
|||
if echo "$OUTPUT" | grep -q "is locked by"; then
|
||||
echo "[$stack] SKIPPED (locked by another session)"
|
||||
else
|
||||
echo "$OUTPUT" | tail -50
|
||||
echo "$OUTPUT" | tail -5
|
||||
echo "[$stack] FAILED (exit $EXIT)"
|
||||
FAILED_PLATFORM_STACKS="$FAILED_PLATFORM_STACKS $stack"
|
||||
fi
|
||||
else
|
||||
echo "$OUTPUT" | tail -3
|
||||
|
|
@ -160,12 +144,9 @@ steps:
|
|||
fi
|
||||
done < .platform_apply
|
||||
fi
|
||||
# Deferred until after app stacks so both lists get a chance to run.
|
||||
echo "$FAILED_PLATFORM_STACKS" > .platform_failed
|
||||
|
||||
# ── Apply app stacks (serial, with Vault advisory locks) ──
|
||||
- |
|
||||
FAILED_APP_STACKS=""
|
||||
if [ -s .app_apply ]; then
|
||||
echo "=== Applying app stacks (serial, locked) ==="
|
||||
while read -r stack; do
|
||||
|
|
@ -178,9 +159,8 @@ steps:
|
|||
if echo "$OUTPUT" | grep -q "is locked by"; then
|
||||
echo "[$stack] SKIPPED (locked by another session)"
|
||||
else
|
||||
echo "$OUTPUT" | tail -50
|
||||
echo "$OUTPUT" | tail -5
|
||||
echo "[$stack] FAILED (exit $EXIT)"
|
||||
FAILED_APP_STACKS="$FAILED_APP_STACKS $stack"
|
||||
fi
|
||||
else
|
||||
echo "$OUTPUT" | tail -3
|
||||
|
|
@ -188,15 +168,6 @@ steps:
|
|||
fi
|
||||
done < .app_apply
|
||||
fi
|
||||
# Fail the step loudly so the pipeline `default` workflow state
|
||||
# reflects reality — the service-upgrade agent and CI alert cascade
|
||||
# both rely on this (see bd code-e1x). Lock-skipped stacks are NOT
|
||||
# counted as failures.
|
||||
FAILED_PLATFORM=$(cat .platform_failed 2>/dev/null | tr -d ' ')
|
||||
if [ -n "$FAILED_PLATFORM" ] || [ -n "$FAILED_APP_STACKS" ]; then
|
||||
echo "=== FAILED STACKS: platform=[$FAILED_PLATFORM ] apps=[$FAILED_APP_STACKS ] ==="
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# ── Commit and push state changes ──
|
||||
- |
|
||||
|
|
|
|||
|
|
@ -14,7 +14,7 @@ clone:
|
|||
|
||||
steps:
|
||||
- name: detect-drift
|
||||
image: forgejo.viktorbarzin.me/viktor/infra-ci:latest
|
||||
image: registry.viktorbarzin.me:5050/infra-ci:latest
|
||||
pull: true
|
||||
backend_options:
|
||||
kubernetes:
|
||||
|
|
@ -42,15 +42,10 @@ steps:
|
|||
-d "{\"role\":\"ci\",\"jwt\":\"$SA_TOKEN\"}" | jq -r .auth.client_token)
|
||||
|
||||
# ── Run terraform plan on all stacks ──
|
||||
# Emits two timestamps per drifted stack so the Pushgateway/Prometheus
|
||||
# side can compute drift-age-hours via `time() - drift_stack_first_seen`.
|
||||
- |
|
||||
DRIFTED=""
|
||||
CLEAN=0
|
||||
ERRORS=""
|
||||
NOW=$(date +%s)
|
||||
# Metrics accumulator — written once per stack, then pushed as a batch.
|
||||
METRICS=""
|
||||
|
||||
for stack_dir in stacks/*/; do
|
||||
stack=$(basename "$stack_dir")
|
||||
|
|
@ -61,50 +56,12 @@ steps:
|
|||
EXIT=$?
|
||||
|
||||
case $EXIT in
|
||||
0)
|
||||
echo "OK (no changes)"
|
||||
CLEAN=$((CLEAN + 1))
|
||||
# drift_stack_state=0 means clean; age-hours irrelevant so we
|
||||
# still push 0 so per-stack gauges don't go stale.
|
||||
METRICS="${METRICS}drift_stack_state{stack=\"$stack\"} 0\n"
|
||||
METRICS="${METRICS}drift_stack_age_hours{stack=\"$stack\"} 0\n"
|
||||
;;
|
||||
1)
|
||||
echo "ERROR"
|
||||
ERRORS="$ERRORS $stack"
|
||||
METRICS="${METRICS}drift_stack_state{stack=\"$stack\"} 2\n"
|
||||
;;
|
||||
2)
|
||||
echo "DRIFT DETECTED"
|
||||
DRIFTED="$DRIFTED $stack"
|
||||
# Fetch first-seen timestamp from Pushgateway (preserve across runs).
|
||||
FIRST_SEEN=$(curl -s "http://prometheus-prometheus-pushgateway.monitoring:9091/metrics" \
|
||||
| awk -v s="$stack" '$1 == "drift_stack_first_seen{stack=\""s"\"}" {print $2; exit}')
|
||||
if [ -z "$FIRST_SEEN" ] || [ "$FIRST_SEEN" = "0" ]; then
|
||||
FIRST_SEEN="$NOW"
|
||||
fi
|
||||
AGE_HOURS=$(( (NOW - FIRST_SEEN) / 3600 ))
|
||||
METRICS="${METRICS}drift_stack_state{stack=\"$stack\"} 1\n"
|
||||
METRICS="${METRICS}drift_stack_first_seen{stack=\"$stack\"} $FIRST_SEEN\n"
|
||||
METRICS="${METRICS}drift_stack_age_hours{stack=\"$stack\"} $AGE_HOURS\n"
|
||||
;;
|
||||
0) echo "OK (no changes)"; CLEAN=$((CLEAN + 1)) ;;
|
||||
1) echo "ERROR"; ERRORS="$ERRORS $stack" ;;
|
||||
2) echo "DRIFT DETECTED"; DRIFTED="$DRIFTED $stack" ;;
|
||||
esac
|
||||
done
|
||||
|
||||
# Summary counters — single gauge per run.
|
||||
DRIFT_COUNT=$(echo "$DRIFTED" | wc -w)
|
||||
ERROR_COUNT=$(echo "$ERRORS" | wc -w)
|
||||
METRICS="${METRICS}drift_stack_count $DRIFT_COUNT\n"
|
||||
METRICS="${METRICS}drift_error_count $ERROR_COUNT\n"
|
||||
METRICS="${METRICS}drift_clean_count $CLEAN\n"
|
||||
METRICS="${METRICS}drift_detection_last_run_timestamp $NOW\n"
|
||||
|
||||
# ── Push to Pushgateway ──
|
||||
# One batched push keeps the run atomic: either all metrics land or none.
|
||||
printf "%b" "$METRICS" | curl -s --data-binary @- \
|
||||
http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/drift-detection \
|
||||
|| echo "(pushgateway unavailable, metrics lost for this run)"
|
||||
|
||||
echo ""
|
||||
echo "=== Drift Detection Summary ==="
|
||||
echo "Clean: $CLEAN stacks"
|
||||
|
|
|
|||
|
|
@ -9,70 +9,52 @@ clone:
|
|||
|
||||
steps:
|
||||
- name: run-issue-responder
|
||||
image: alpine:3.20
|
||||
image: python:3.12-alpine
|
||||
commands:
|
||||
- apk add --no-cache curl jq
|
||||
- apk add --no-cache openssh-client curl jq
|
||||
# Authenticate to Vault via K8s SA JWT
|
||||
- |
|
||||
SA_TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
|
||||
VAULT_RESP=$(curl -sf -X POST http://vault-active.vault.svc.cluster.local:8200/v1/auth/kubernetes/login \
|
||||
-d "{\"role\":\"ci\",\"jwt\":\"$$SA_TOKEN\"}")
|
||||
VAULT_TOKEN=$(echo "$$VAULT_RESP" | jq -r .auth.client_token)
|
||||
if [ -z "$$VAULT_TOKEN" ] || [ "$$VAULT_TOKEN" = "null" ]; then
|
||||
-d "{\"role\":\"ci\",\"jwt\":\"$SA_TOKEN\"}")
|
||||
VAULT_TOKEN=$(echo "$VAULT_RESP" | jq -r .auth.client_token)
|
||||
if [ -z "$VAULT_TOKEN" ] || [ "$VAULT_TOKEN" = "null" ]; then
|
||||
echo "ERROR: Vault authentication failed"
|
||||
exit 1
|
||||
fi
|
||||
echo "Vault authenticated"
|
||||
# Fetch API token for claude-agent-service
|
||||
# Fetch DevVM SSH key
|
||||
- |
|
||||
AGENT_TOKEN=$(curl -sf -H "X-Vault-Token: $$VAULT_TOKEN" \
|
||||
http://vault-active.vault.svc.cluster.local:8200/v1/secret/data/claude-agent-service | \
|
||||
jq -r '.data.data.api_bearer_token')
|
||||
if [ -z "$$AGENT_TOKEN" ] || [ "$$AGENT_TOKEN" = "null" ]; then
|
||||
echo "ERROR: Failed to fetch agent API token"
|
||||
curl -sf -H "X-Vault-Token: $VAULT_TOKEN" \
|
||||
http://vault-active.vault.svc.cluster.local:8200/v1/secret/data/ci/infra | \
|
||||
jq -r '.data.data.devvm_ssh_key' > /tmp/devvm-key
|
||||
chmod 600 /tmp/devvm-key
|
||||
if [ ! -s /tmp/devvm-key ]; then
|
||||
echo "ERROR: Failed to fetch DevVM SSH key"
|
||||
exit 1
|
||||
fi
|
||||
echo "Agent token fetched"
|
||||
# Submit job to claude-agent-service
|
||||
echo "SSH key fetched"
|
||||
# SSH to DevVM and run issue-responder agent
|
||||
- |
|
||||
ISSUE_NUM="${ISSUE_NUMBER:-}"
|
||||
ISSUE_TITLE="${ISSUE_TITLE:-}"
|
||||
ISSUE_LABELS="${ISSUE_LABELS:-}"
|
||||
ISSUE_URL="${ISSUE_URL:-}"
|
||||
|
||||
if [ -z "$$ISSUE_NUM" ]; then
|
||||
if [ -z "$ISSUE_NUM" ]; then
|
||||
echo "ERROR: No issue number provided"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "Processing issue #$$ISSUE_NUM: $$ISSUE_TITLE"
|
||||
echo "Processing issue #$ISSUE_NUM: $ISSUE_TITLE"
|
||||
echo "Labels: $ISSUE_LABELS"
|
||||
|
||||
PAYLOAD=$(jq -n \
|
||||
--arg prompt "Process GitHub Issue #$$ISSUE_NUM: $$ISSUE_TITLE. Labels: $$ISSUE_LABELS. URL: $$ISSUE_URL. Read the issue body via GitHub API, investigate, and take appropriate action." \
|
||||
--arg agent ".claude/agents/issue-responder" \
|
||||
'{prompt: $prompt, agent: $agent, max_budget_usd: 10, timeout_seconds: 1800}')
|
||||
|
||||
RESP=$(curl -sf -X POST \
|
||||
-H "Authorization: Bearer $$AGENT_TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d "$$PAYLOAD" \
|
||||
http://claude-agent-service.claude-agent.svc.cluster.local:8080/execute)
|
||||
|
||||
JOB_ID=$(echo "$$RESP" | jq -r '.job_id')
|
||||
echo "Job submitted: $$JOB_ID"
|
||||
# Poll for completion (30min max)
|
||||
- |
|
||||
for i in $(seq 1 120); do
|
||||
sleep 15
|
||||
RESULT=$(curl -sf \
|
||||
-H "Authorization: Bearer $$AGENT_TOKEN" \
|
||||
http://claude-agent-service.claude-agent.svc.cluster.local:8080/jobs/$$JOB_ID)
|
||||
STATUS=$(echo "$$RESULT" | jq -r '.status')
|
||||
echo "[$$i/120] Status: $$STATUS"
|
||||
if [ "$$STATUS" != "running" ]; then
|
||||
echo "$$RESULT" | jq .
|
||||
if [ "$$STATUS" = "completed" ]; then exit 0; else exit 1; fi
|
||||
fi
|
||||
done
|
||||
echo "ERROR: Job timed out after 30 minutes"
|
||||
exit 1
|
||||
ssh -i /tmp/devvm-key -o StrictHostKeyChecking=no wizard@10.0.10.10 \
|
||||
"cd ~/code && git -C infra stash && git -C infra pull --rebase && git -C infra stash pop 2>/dev/null; \
|
||||
~/.local/bin/claude -p \
|
||||
--agent infra/.claude/agents/issue-responder \
|
||||
--dangerously-skip-permissions \
|
||||
--max-budget-usd 10 \
|
||||
'Process GitHub Issue #${ISSUE_NUM}: ${ISSUE_TITLE}. Labels: ${ISSUE_LABELS}. URL: ${ISSUE_URL}. Read the issue body via GitHub API, investigate, and take appropriate action.'"
|
||||
# Cleanup
|
||||
- rm -f /tmp/devvm-key
|
||||
|
|
|
|||
|
|
@ -17,7 +17,7 @@ steps:
|
|||
- name: parse-and-implement
|
||||
image: python:3.12-alpine
|
||||
commands:
|
||||
- apk add --no-cache jq curl git
|
||||
- apk add --no-cache jq curl git openssh-client
|
||||
- sh scripts/postmortem-pipeline.sh
|
||||
|
||||
- name: notify-slack
|
||||
|
|
|
|||
|
|
@ -1,63 +0,0 @@
|
|||
# Sync infra/scripts/pve-nfs-exports → PVE host /etc/exports on change.
|
||||
#
|
||||
# Wave 6b of the state-drift consolidation plan: move the "scp + exportfs -ra"
|
||||
# deploy step out of runbook-human-hands and into CI so the Proxmox NFS export
|
||||
# table tracks git.
|
||||
#
|
||||
# Trigger: push to master that touches `scripts/pve-nfs-exports`. The file
|
||||
# header documents the deploy invocation; this pipeline codifies it.
|
||||
#
|
||||
# Credentials:
|
||||
# - pve_ssh_key: Woodpecker repo-secret (ed25519 keypair provisioned
|
||||
# 2026-04-18 as `woodpecker-pve-nfs-exports-sync`). Public key lives in
|
||||
# /root/.ssh/authorized_keys on the PVE host. Private key mirrored in
|
||||
# Vault `secret/woodpecker/pve_ssh_key` for recovery.
|
||||
|
||||
when:
|
||||
- event: push
|
||||
branch: master
|
||||
path: scripts/pve-nfs-exports
|
||||
- event: manual
|
||||
|
||||
clone:
|
||||
git:
|
||||
image: woodpeckerci/plugin-git
|
||||
settings:
|
||||
depth: 1
|
||||
attempts: 3
|
||||
|
||||
steps:
|
||||
- name: deploy
|
||||
image: alpine:3.20
|
||||
environment:
|
||||
PVE_SSH_KEY:
|
||||
from_secret: pve_ssh_key
|
||||
SLACK_WEBHOOK:
|
||||
from_secret: slack_webhook
|
||||
commands:
|
||||
- apk add --no-cache openssh-client curl
|
||||
- mkdir -p ~/.ssh && chmod 700 ~/.ssh
|
||||
- printf '%s\n' "$PVE_SSH_KEY" > ~/.ssh/id_ed25519
|
||||
- chmod 600 ~/.ssh/id_ed25519
|
||||
# Pin host key — CI's ~/.ssh/known_hosts is ephemeral, so accept-new on first pull.
|
||||
- ssh-keyscan -t ed25519 192.168.1.127 >> ~/.ssh/known_hosts 2>/dev/null
|
||||
# Diff what we'd ship, so pipeline logs show the intended change.
|
||||
- echo '---diff---' && ssh -o BatchMode=yes root@192.168.1.127 "cat /etc/exports" > /tmp/remote.exports || true
|
||||
- diff -u /tmp/remote.exports scripts/pve-nfs-exports || true
|
||||
- echo '---applying---'
|
||||
- scp -o BatchMode=yes scripts/pve-nfs-exports root@192.168.1.127:/etc/exports
|
||||
- ssh -o BatchMode=yes root@192.168.1.127 "exportfs -ra && exportfs -s | head -5"
|
||||
- echo '---done---'
|
||||
|
||||
- name: slack
|
||||
image: curlimages/curl:8.11.0
|
||||
environment:
|
||||
SLACK_WEBHOOK:
|
||||
from_secret: slack_webhook
|
||||
commands:
|
||||
- |
|
||||
curl -s -X POST -H 'Content-type: application/json' \
|
||||
--data "{\"channel\":\"general\",\"text\":\"PVE /etc/exports sync: ${CI_PIPELINE_STATUS}\"}" \
|
||||
"$SLACK_WEBHOOK" || true
|
||||
when:
|
||||
status: [success, failure]
|
||||
|
|
@ -1,156 +0,0 @@
|
|||
# Sync modules/docker-registry/* → /opt/registry/ on docker-registry VM
|
||||
# (10.0.20.10) on change, and bounce containers + nginx when needed.
|
||||
#
|
||||
# Replaces the manual "ssh + scp + docker compose up -d" that was required
|
||||
# after the 2026-04-19 `registry:2 → registry:2.8.3` pin landed. The deploy
|
||||
# flow is now: edit a file in modules/docker-registry/ → git push → this
|
||||
# pipeline runs → registry VM picks up the change.
|
||||
#
|
||||
# Trigger: push to master that touches any managed file (see `when.path`),
|
||||
# or a manual run via Woodpecker UI / API.
|
||||
#
|
||||
# Credentials:
|
||||
# - registry_ssh_key: Woodpecker repo-secret (ed25519 keypair provisioned
|
||||
# 2026-04-19 as `woodpecker-registry-config-sync`). Public key lives in
|
||||
# /root/.ssh/authorized_keys on 10.0.20.10. Private key mirrored in
|
||||
# Vault `secret/woodpecker/registry_ssh_key` (subkeys private_key /
|
||||
# public_key / known_hosts_entry) for recovery.
|
||||
#
|
||||
# Why bounce nginx every time: nginx caches upstream DNS at startup, so if
|
||||
# any registry-* container gets recreated (new IP on the docker bridge),
|
||||
# nginx keeps forwarding to a stale address. Always restart nginx as the
|
||||
# last step — see docs/runbooks/registry-vm.md § "Bouncing registry
|
||||
# containers — the nginx DNS trap".
|
||||
|
||||
when:
|
||||
- event: push
|
||||
branch: master
|
||||
path:
|
||||
include:
|
||||
- 'modules/docker-registry/docker-compose.yml'
|
||||
- 'modules/docker-registry/fix-broken-blobs.sh'
|
||||
- 'modules/docker-registry/cleanup-tags.sh'
|
||||
- 'modules/docker-registry/nginx_registry.conf'
|
||||
- 'modules/docker-registry/config-private.yml'
|
||||
- event: manual
|
||||
|
||||
clone:
|
||||
git:
|
||||
image: woodpeckerci/plugin-git
|
||||
settings:
|
||||
depth: 1
|
||||
attempts: 3
|
||||
|
||||
steps:
|
||||
- name: deploy
|
||||
image: alpine:3.20
|
||||
environment:
|
||||
REGISTRY_SSH_KEY:
|
||||
from_secret: registry_ssh_key
|
||||
commands:
|
||||
- apk add --no-cache openssh-client rsync
|
||||
- mkdir -p ~/.ssh && chmod 700 ~/.ssh
|
||||
- printf '%s\n' "$REGISTRY_SSH_KEY" > ~/.ssh/id_ed25519
|
||||
- chmod 600 ~/.ssh/id_ed25519
|
||||
# Pin host key — CI's ~/.ssh/known_hosts is ephemeral, so accept-new on first pull.
|
||||
- ssh-keyscan -t ed25519 10.0.20.10 >> ~/.ssh/known_hosts 2>/dev/null
|
||||
- echo '---detecting changed files---'
|
||||
- |
|
||||
# Mirror the remote state of each file so we can diff and decide what bounces.
|
||||
CHANGED=""
|
||||
for f in docker-compose.yml fix-broken-blobs.sh cleanup-tags.sh nginx_registry.conf config-private.yml; do
|
||||
LOCAL="modules/docker-registry/$f"
|
||||
REMOTE="/opt/registry/$f"
|
||||
if [ ! -f "$LOCAL" ]; then
|
||||
echo "skip $f (not in repo)"
|
||||
continue
|
||||
fi
|
||||
# Pull the remote copy into /tmp for a diff. ssh -n avoids stdin-hogging.
|
||||
REMOTE_CONTENT=$(ssh -n -o BatchMode=yes root@10.0.20.10 "cat $REMOTE 2>/dev/null || true")
|
||||
LOCAL_CONTENT=$(cat "$LOCAL")
|
||||
if [ "$LOCAL_CONTENT" = "$REMOTE_CONTENT" ]; then
|
||||
echo "unchanged: $f"
|
||||
else
|
||||
echo "---diff: $f ---"
|
||||
echo "$REMOTE_CONTENT" > /tmp/remote.txt
|
||||
diff -u /tmp/remote.txt "$LOCAL" | head -40 || true
|
||||
CHANGED="$CHANGED $f"
|
||||
fi
|
||||
done
|
||||
echo "CHANGED_FILES=$CHANGED"
|
||||
printf '%s' "$CHANGED" > /tmp/changed
|
||||
- echo '---applying---'
|
||||
- |
|
||||
CHANGED=$(cat /tmp/changed)
|
||||
if [ -z "$CHANGED" ]; then
|
||||
echo "No files changed — exiting cleanly (manual run with no drift)."
|
||||
exit 0
|
||||
fi
|
||||
# Ship every managed file unconditionally — scp is cheap, idempotency is safe.
|
||||
scp -o BatchMode=yes \
|
||||
modules/docker-registry/docker-compose.yml \
|
||||
modules/docker-registry/fix-broken-blobs.sh \
|
||||
modules/docker-registry/cleanup-tags.sh \
|
||||
modules/docker-registry/nginx_registry.conf \
|
||||
modules/docker-registry/config-private.yml \
|
||||
root@10.0.20.10:/opt/registry/
|
||||
ssh -n -o BatchMode=yes root@10.0.20.10 '
|
||||
chmod +x /opt/registry/fix-broken-blobs.sh /opt/registry/cleanup-tags.sh
|
||||
'
|
||||
- echo '---bouncing containers + nginx---'
|
||||
- |
|
||||
CHANGED=$(cat /tmp/changed)
|
||||
# Compose-visible files: docker-compose.yml (image tag, mounts) and
|
||||
# config-private.yml (registry config → needs registry-private reload).
|
||||
BOUNCE_COMPOSE=0
|
||||
BOUNCE_NGINX=0
|
||||
echo "$CHANGED" | grep -q "docker-compose.yml" && BOUNCE_COMPOSE=1
|
||||
echo "$CHANGED" | grep -q "config-private.yml" && BOUNCE_COMPOSE=1
|
||||
echo "$CHANGED" | grep -q "nginx_registry.conf" && BOUNCE_NGINX=1
|
||||
|
||||
if [ "$BOUNCE_COMPOSE" = "1" ]; then
|
||||
echo "compose-visible change → pull + up -d"
|
||||
ssh -n -o BatchMode=yes root@10.0.20.10 '
|
||||
cd /opt/registry
|
||||
docker compose pull 2>&1 | tail -5
|
||||
docker compose up -d 2>&1 | tail -20
|
||||
'
|
||||
# Any compose recreate requires nginx DNS refresh too.
|
||||
BOUNCE_NGINX=1
|
||||
fi
|
||||
|
||||
if [ "$BOUNCE_NGINX" = "1" ]; then
|
||||
echo "bouncing nginx to flush upstream DNS cache"
|
||||
ssh -n -o BatchMode=yes root@10.0.20.10 '
|
||||
docker restart registry-nginx
|
||||
sleep 3
|
||||
docker ps --format "{{.Names}}\t{{.Image}}\t{{.Status}}" | grep -E "registry-"
|
||||
'
|
||||
fi
|
||||
|
||||
if [ "$BOUNCE_COMPOSE" = "0" ] && [ "$BOUNCE_NGINX" = "0" ]; then
|
||||
echo "only script files changed (cron-picks-up semantics) — no bounce needed"
|
||||
fi
|
||||
- echo '---verify---'
|
||||
- |
|
||||
ssh -n -o BatchMode=yes root@10.0.20.10 '
|
||||
echo "=== catalog ==="
|
||||
# Prove auth + routing survived.
|
||||
curl -sk -o /dev/null -w "catalog (unauth → 401 expected): HTTP %{http_code}\n" \
|
||||
https://127.0.0.1:5050/v2/
|
||||
echo "=== integrity scan (dry-run) ==="
|
||||
python3 /opt/registry/fix-broken-blobs.sh --dry-run 2>&1 | tail -5
|
||||
'
|
||||
|
||||
- name: slack
|
||||
image: curlimages/curl:8.11.0
|
||||
environment:
|
||||
SLACK_WEBHOOK:
|
||||
from_secret: slack_webhook
|
||||
commands:
|
||||
- |
|
||||
curl -s -X POST -H 'Content-type: application/json' \
|
||||
--data "{\"channel\":\"general\",\"text\":\"Registry config sync on 10.0.20.10: ${CI_PIPELINE_STATUS}\"}" \
|
||||
"$SLACK_WEBHOOK" || true
|
||||
when:
|
||||
status: [success, failure]
|
||||
91
AGENTS.md
91
AGENTS.md
|
|
@ -15,49 +15,6 @@
|
|||
- **Health check**: `bash scripts/cluster_healthcheck.sh --quiet`
|
||||
- **Plan all**: `cd stacks && terragrunt run --all --non-interactive -- plan`
|
||||
|
||||
## Adopting Existing Resources — Use `import {}` Blocks, Not the CLI
|
||||
|
||||
When bringing a live cluster/Vault/Cloudflare resource under Terraform management, use an HCL `import {}` block (Terraform 1.5+). Do **NOT** use `terraform import` on the CLI for anything landing in this repo — the CLI path leaves no audit trail and makes multi-operator adoption fragile.
|
||||
|
||||
**Canonical workflow:**
|
||||
|
||||
1. Write the `resource` block that matches the live object.
|
||||
2. In the same stack, add an `import {}` stanza naming the target and the provider-specific ID:
|
||||
```hcl
|
||||
import {
|
||||
to = helm_release.kured
|
||||
id = "kured/kured" # Helm ID format: <namespace>/<release-name>
|
||||
}
|
||||
|
||||
resource "helm_release" "kured" {
|
||||
name = "kured"
|
||||
namespace = "kured"
|
||||
repository = "https://kubereboot.github.io/charts/"
|
||||
chart = "kured"
|
||||
version = "5.7.0"
|
||||
# ... values matching the live release
|
||||
}
|
||||
```
|
||||
3. `scripts/tg plan` — every change it proposes is real divergence between HCL and live state. Iterate on values until the plan is **0 changes**.
|
||||
4. `scripts/tg apply` — the import runs alongside whatever zero-change apply you have. If your plan is 0 changes, this commits only the state-ownership transfer.
|
||||
5. After the apply lands cleanly, **delete the `import {}` block** in a follow-up commit. The resource is now fully TF-owned and the stanza would be a no-op that clutters diffs.
|
||||
|
||||
**Why `import {}` and not `terraform import`:**
|
||||
|
||||
- Reviewable in PRs before any state mutation. The CLI path is an out-of-band action nobody sees.
|
||||
- Plan-safe: the `import` plan step shows the exact object being adopted. Mistyped IDs or the wrong resource address are caught before apply, not after.
|
||||
- Survives state backend changes (Tier 0 SOPS vs Tier 1 PG) transparently — both work identically from the operator's perspective because both use `scripts/tg`.
|
||||
- Re-runnable: if the apply fails partway through, the `import {}` block is idempotent. The CLI path's state mutation is not.
|
||||
|
||||
**Finding the provider-specific ID:** each provider has its own convention.
|
||||
| Resource | ID format | Example |
|
||||
|---|---|---|
|
||||
| `helm_release` | `<namespace>/<release-name>` | `kured/kured` |
|
||||
| `kubernetes_manifest` | `{"apiVersion":"...","kind":"...","metadata":{"namespace":"...","name":"..."}}` | (pass as HCL object literal) |
|
||||
| `kubernetes_<kind>_v1` | `<namespace>/<name>` for namespaced, `<name>` for cluster-scoped | `kube-system/coredns` |
|
||||
| `authentik_provider_proxy` | provider UUID | `0eecac07-97c7-443c-...` |
|
||||
| `cloudflare_record` | `<zone-id>/<record-id>` | `abc123/def456` |
|
||||
|
||||
## Secrets Management (SOPS)
|
||||
- **`config.tfvars`** — plaintext config (hostnames, IPs, DNS records, public keys)
|
||||
- **`secrets.sops.json`** — SOPS-encrypted secrets (passwords, tokens, SSH keys, API keys)
|
||||
|
|
@ -99,13 +56,13 @@ Terragrunt-based homelab managing a Kubernetes cluster (5 nodes, v1.34.2) on Pro
|
|||
- `config.tfvars` — non-secret configuration (plaintext)
|
||||
- `secrets.sops.json` — all secrets (SOPS-encrypted JSON)
|
||||
- `terraform.tfvars` — legacy secrets file (git-crypt, kept for reference)
|
||||
- `scripts/cluster_healthcheck.sh` — 42-check cluster health script (nodes, workloads, monitoring, certs, backups, external reachability)
|
||||
- `scripts/cluster_healthcheck.sh` — 25-check cluster health script
|
||||
|
||||
## Storage
|
||||
- **NFS** (`nfs-proxmox` StorageClass): For app data. Use the `nfs_volume` module, never inline `nfs {}` blocks.
|
||||
- **proxmox-lvm-encrypted** (`proxmox-lvm-encrypted` StorageClass): **Default for all sensitive data** — databases, auth, email, passwords, git repos, health data. LUKS2 encryption via Proxmox CSI. Passphrase in Vault, backup key on PVE host.
|
||||
- **proxmox-lvm** (`proxmox-lvm` StorageClass): For non-sensitive stateful apps (configs, caches, tools). Proxmox CSI driver.
|
||||
- **NFS server**: Proxmox host at 192.168.1.127 (sole NFS). HDD NFS at `/srv/nfs` (2TB ext4 LV `pve/nfs-data`), SSD NFS at `/srv/nfs-ssd` (100GB ext4 LV `ssd/nfs-ssd-data`). Exports use `async` mode (safe with UPS + databases on block storage). TrueNAS (VM 9000, 10.0.10.15) decommissioned 2026-04-13. Legacy `nfs-truenas` StorageClass name retained (48 PVs bind it; SC names are immutable on PVs) but now points to the Proxmox host, identical to `nfs-proxmox`.
|
||||
- **NFS server**: Proxmox host at 192.168.1.127. HDD NFS at `/srv/nfs` (2TB ext4 LV `pve/nfs-data`), SSD NFS at `/srv/nfs-ssd` (100GB ext4 LV `ssd/nfs-ssd-data`). Exports use `async` mode (safe with UPS + databases on block storage). TrueNAS (10.0.10.15) decommissioned.
|
||||
- **SQLite on NFS is unreliable** (fsync issues) — always use proxmox-lvm or local disk for databases.
|
||||
- **NFS mount options**: Always `soft,timeo=30,retrans=3` to prevent uninterruptible sleep (D state).
|
||||
- **NFS export directory must exist** on the Proxmox host before Terraform can create the PV.
|
||||
|
|
@ -113,47 +70,11 @@ Terragrunt-based homelab managing a Kubernetes cluster (5 nodes, v1.34.2) on Pro
|
|||
- **daily-backup** (Daily 05:00): Auto-discovered BACKUP_DIRS (glob), auto SQLite backup (magic number + `?mode=ro`), pfSense, PVE config. No NFS mirror step (NFS syncs directly to Synology via inotify).
|
||||
- **offsite-sync-backup** (Daily 06:00): Step 1: sda→Synology `pve-backup/`. Step 2: NFS→Synology `nfs/`+`nfs-ssd/` via `rsync --files-from` (inotify change log). Monthly full `--delete`.
|
||||
- **nfs-change-tracker.service**: inotifywait on `/srv/nfs` + `/srv/nfs-ssd`, logs to `/mnt/backup/.nfs-changes.log`. Incremental syncs complete in seconds.
|
||||
- **Synology layout** (`/volume1/Backup/Viki/`): `pve-backup/` (from sda), `nfs/` (from `/srv/nfs`), `nfs-ssd/` (from `/srv/nfs-ssd`).
|
||||
- **Synology layout** (`/volume1/Backup/Viki/`): `pve-backup/` (from sda), `nfs/` (from `/srv/nfs`), `nfs-ssd/` (from `/srv/nfs-ssd`). `truenas/` renamed to `nfs/`, `pve-backup/nfs-mirror/` removed.
|
||||
|
||||
## Shared Variables (never hardcode)
|
||||
`var.nfs_server` (192.168.1.127), `var.redis_host`, `var.postgresql_host`, `var.mysql_host`, `var.ollama_host`, `var.mail_host`
|
||||
|
||||
## Redis Service Naming (read before wiring a new consumer)
|
||||
|
||||
The Redis stack (`stacks/redis/`) exposes three distinct entry points. Pick the one that matches the client's connection pattern — the wrong one causes READONLY errors or silent connection drops.
|
||||
|
||||
| Endpoint | Port(s) | Use for | Backed by |
|
||||
|----------|---------|---------|-----------|
|
||||
| `redis-master.redis.svc.cluster.local` | 6379 (redis), 26379 (sentinel) | **Default for new services.** Write-safe — HAProxy health-checks nodes and routes only to the current master. Matches `var.redis_host`. | `kubernetes_service.redis_master` → HAProxy → Bitnami StatefulSet |
|
||||
| `redis-node-{0,1,2}.redis-headless.redis.svc.cluster.local` | 26379 | **Long-lived connections (PUBSUB, BLPOP, MONITOR, Sidekiq).** Use a sentinel-aware client with master name `mymaster`. Example: `stacks/nextcloud/chart_values.yaml:32-54`. | Bitnami-created headless service → pod DNS |
|
||||
| `redis.redis.svc.cluster.local` | 6379 | **Do NOT use.** Helm chart's default service — selector patched by `null_resource.patch_redis_service` to match `redis-haproxy`, so today it behaves like `redis-master`. This patch is load-bearing but temporary; consumers hard-coded on this name are tracked in a beads follow-up (T0). | Bitnami chart (patched) |
|
||||
|
||||
**HAProxy's `timeout client 30s` closes idle raw Redis connections** — any client that holds a connection open for pub/sub, blocking commands, or replication streams MUST use the sentinel path. Uptime Kuma's Redis monitor hit this limit and had to be re-pointed at the sentinel endpoint (see memory id=748).
|
||||
|
||||
**When onboarding a new service:** start from `redis-master.redis.svc.cluster.local:6379` via `var.redis_host`. Only reach for sentinel discovery if the client library supports it natively (ioredis, redis-py Sentinel, go-redis FailoverClient, Sidekiq `sentinels` array) AND the workload uses long-lived connections.
|
||||
|
||||
## Kyverno Drift Suppression (`# KYVERNO_LIFECYCLE_V1`)
|
||||
|
||||
Kyverno's admission webhook mutates every pod with a `dns_config { option { name = "ndots"; value = "2" } }` block (fixes NxDomain search-domain floods — see `k8s-ndots-search-domain-nxdomain-flood` skill). Terraform does not manage that field, so without suppression every pod-owning resource shows perpetual `spec[0].template[0].spec[0].dns_config` drift.
|
||||
|
||||
**Rule**: every `kubernetes_deployment`, `kubernetes_stateful_set`, `kubernetes_daemon_set`, and `kubernetes_cron_job_v1` MUST include the following `lifecycle` block, tagged with the `# KYVERNO_LIFECYCLE_V1` marker so every site is greppable:
|
||||
|
||||
```hcl
|
||||
# kubernetes_deployment / kubernetes_stateful_set / kubernetes_daemon_set
|
||||
lifecycle {
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
|
||||
}
|
||||
|
||||
# kubernetes_cron_job_v1 (extra job_template nesting)
|
||||
lifecycle {
|
||||
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
|
||||
}
|
||||
```
|
||||
|
||||
**Why not a shared module?** Terraform's `ignore_changes` meta-argument only accepts static attribute paths. It rejects module outputs, locals, variables, and any expression. A DRY module is therefore impossible — the canonical pattern IS the snippet + marker. When `kubernetes_manifest` resources get Kyverno `generate.kyverno.io/*` annotations mutated, a sibling convention `# KYVERNO_MANIFEST_V1` will be introduced (Phase B).
|
||||
|
||||
**Audit**: `rg "KYVERNO_LIFECYCLE_V1" stacks/ | wc -l` — should grow (never shrink). Add the marker to every new pod-owning resource. The `_template/main.tf.example` stub shows the canonical form.
|
||||
|
||||
## Tier System
|
||||
`0-core` | `1-cluster` | `2-gpu` | `3-edge` | `4-aux` — Kyverno auto-generates LimitRange + ResourceQuota per namespace based on tier label.
|
||||
- Containers without explicit `resources {}` get default limits (256Mi for edge/aux — causes OOMKill for heavy apps)
|
||||
|
|
@ -163,10 +84,10 @@ lifecycle {
|
|||
## Infrastructure
|
||||
- **Proxmox**: 192.168.1.127 (Dell R730, 22c/44t, 142GB RAM)
|
||||
- **Nodes**: k8s-master (10.0.20.100), node1 (GPU, Tesla T4), node2-4
|
||||
- **GPU**: `node_selector = { "nvidia.com/gpu.present" : "true" }` + toleration `nvidia.com/gpu`. The label is auto-applied by NFD/gpu-feature-discovery on any node with an NVIDIA PCI device — nothing is hostname-pinned, so the GPU card can move between nodes without Terraform edits.
|
||||
- **GPU**: `node_selector = { "gpu": "true" }` + toleration `nvidia.com/gpu`
|
||||
- **Pull-through cache**: 10.0.20.10 — docker.io (:5000), ghcr.io (:5010) only. Caches stale manifests for :latest tags — use versioned tags or pre-pull with `ctr --hosts-dir ''` to bypass.
|
||||
- **pfSense**: 10.0.20.1 (gateway, firewall, DNS forwarding)
|
||||
- **MySQL InnoDB Cluster**: 1 instance on proxmox-lvm (scaled from 3 — only Uptime Kuma + phpIPAM remain), PriorityClass `mysql-critical` + PDB, anti-affinity excludes any GPU node (`nvidia.com/gpu.present=true`) so MySQL moves off the GPU host automatically if the card is relocated
|
||||
- **MySQL InnoDB Cluster**: 1 instance on proxmox-lvm (scaled from 3 — only Uptime Kuma + phpIPAM remain), PriorityClass `mysql-critical` + PDB, anti-affinity excludes k8s-node1 (GPU node)
|
||||
- **SMTP**: `var.mail_host` port 587 STARTTLS (not internal svc address — cert mismatch)
|
||||
|
||||
## Contributor Onboarding
|
||||
|
|
@ -184,7 +105,7 @@ lifecycle {
|
|||
- **NFS exports**: Create dir on Proxmox host (`ssh root@192.168.1.127 "mkdir -p /srv/nfs/<service>"`), add to `/etc/exports`, run `exportfs -ra`.
|
||||
|
||||
## Automated Service Upgrades
|
||||
- **Pipeline**: DIUN (detect) → n8n webhook (filter + rate limit) → HTTP POST → `claude-agent-service` (K8s) → `claude -p` (upgrade agent)
|
||||
- **Pipeline**: DIUN (detect) → n8n webhook (filter + rate limit) → SSH → `claude -p` (upgrade agent)
|
||||
- **Agent**: `.claude/agents/service-upgrade.md` — analyzes changelogs, backs up DBs, bumps versions, verifies health, rolls back on failure
|
||||
- **Config**: `.claude/reference/upgrade-config.json` — GitHub repo mappings, DB-backed services, skip patterns
|
||||
- **Rate limit**: Max 5 upgrades per 6h DIUN scan cycle (configured in n8n workflow)
|
||||
|
|
|
|||
|
|
@ -5,7 +5,6 @@ ARG TERRAFORM_VERSION=1.5.7
|
|||
ARG TERRAGRUNT_VERSION=0.99.4
|
||||
ARG SOPS_VERSION=3.9.4
|
||||
ARG KUBECTL_VERSION=1.34.0
|
||||
ARG VAULT_VERSION=1.18.1
|
||||
|
||||
# Install system packages (single layer)
|
||||
RUN apk add --no-cache \
|
||||
|
|
@ -35,16 +34,6 @@ RUN curl -fsSL "https://dl.k8s.io/release/v${KUBECTL_VERSION}/bin/linux/amd64/ku
|
|||
-o /usr/local/bin/kubectl \
|
||||
&& chmod +x /usr/local/bin/kubectl
|
||||
|
||||
# Vault CLI — required by scripts/tg for Tier 1 stack PG credential reads
|
||||
# and Tier 0 advisory locks. Pinned to server version (1.18.1). Without this
|
||||
# the CI pipeline surfaces the misleading "Cannot read PG credentials" error
|
||||
# because scripts/tg swallows stderr ("vault: not found").
|
||||
RUN curl -fsSL "https://releases.hashicorp.com/vault/${VAULT_VERSION}/vault_${VAULT_VERSION}_linux_amd64.zip" \
|
||||
-o /tmp/vault.zip \
|
||||
&& unzip /tmp/vault.zip -d /usr/local/bin/ \
|
||||
&& rm /tmp/vault.zip \
|
||||
&& vault version
|
||||
|
||||
# Provider cache directory (shared across stacks)
|
||||
ENV TF_PLUGIN_CACHE_DIR=/tmp/terraform-plugin-cache
|
||||
ENV TF_PLUGIN_CACHE_MAY_BREAK_DEPENDENCY_LOCK_FILE=1
|
||||
|
|
|
|||
BIN
config.tfvars
BIN
config.tfvars
Binary file not shown.
|
|
@ -80,6 +80,8 @@ def sofia():
|
|||
pfsense >> k8s_switch
|
||||
with Cluster('Management Network'):
|
||||
mgt_switch = Switch()
|
||||
# Truenas
|
||||
truenas = Storage("Truenas")
|
||||
# pxe server
|
||||
pxe_server = Rack("PXE Server")
|
||||
# HA
|
||||
|
|
@ -89,6 +91,7 @@ def sofia():
|
|||
devvm_vpn_client = VPN("Tailscale Client")
|
||||
vpn_clients["devvm"] = devvm_vpn_client
|
||||
|
||||
mgt_switch >> truenas
|
||||
mgt_switch >> pxe_server
|
||||
mgt_switch >> home_assistant
|
||||
mgt_switch >> devvm
|
||||
|
|
|
|||
|
|
@ -20,7 +20,7 @@ This repository contains the configuration and documentation for a homelab Kuber
|
|||
| [Overview](architecture/overview.md) | Infrastructure overview, hardware specs, VM inventory, and service catalog |
|
||||
| [Networking](architecture/networking.md) | Network topology, VLANs, routing, and firewall rules |
|
||||
| [VPN](architecture/vpn.md) | Headscale mesh VPN and Cloudflare Tunnel configuration |
|
||||
| [Storage](architecture/storage.md) | Proxmox host NFS, Proxmox CSI (LVM-thin + LUKS2), and persistent volume management |
|
||||
| [Storage](architecture/storage.md) | TrueNAS NFS, democratic-csi, and persistent volume management |
|
||||
| [Authentication](architecture/authentication.md) | Authentik SSO, OIDC flows, and service integration |
|
||||
| [Security](architecture/security.md) | CrowdSec IPS, Kyverno policies, and security controls |
|
||||
| [Monitoring](architecture/monitoring.md) | Prometheus, Grafana, Loki, and observability stack |
|
||||
|
|
|
|||
|
|
@ -16,10 +16,10 @@ n8n Webhook (POST /webhook/<uuid>)
|
|||
│ rate limit: max 5 upgrades per 6h window
|
||||
│
|
||||
▼
|
||||
HTTP POST → claude-agent-service (K8s)
|
||||
SSH → Dev VM (10.0.10.10)
|
||||
│
|
||||
▼
|
||||
claude -p "upgrade agent prompt" (in-cluster)
|
||||
claude -p "upgrade agent prompt"
|
||||
│
|
||||
▼
|
||||
Service Upgrade Agent
|
||||
|
|
@ -54,7 +54,7 @@ Service Upgrade Agent
|
|||
- Only `status=update` (skip `new`, `unchanged`)
|
||||
- Skip databases, custom images, infra images, `:latest`
|
||||
- **Rate limiting**: Max 5 upgrades per 6-hour window using `$getWorkflowStaticData('global')`
|
||||
- **Action**: HTTP POST to `claude-agent-service.claude-agent.svc:8080/execute` with the upgrade agent prompt
|
||||
- **Action**: SSH to dev VM, runs `claude -p` with the upgrade agent prompt
|
||||
|
||||
### Upgrade Agent
|
||||
- **Prompt**: `.claude/agents/service-upgrade.md`
|
||||
|
|
@ -173,35 +173,7 @@ Key behaviors observed:
|
|||
| Secret | Vault Path | Purpose |
|
||||
|--------|-----------|---------|
|
||||
| n8n webhook URL | `secret/diun` → `n8n_webhook_url` | DIUN → n8n trigger |
|
||||
| Agent API bearer token | `secret/claude-agent-service` → `api_bearer_token` | n8n → claude-agent-service `/execute` auth. Synced into both `claude-agent` ns (consumer) and `n8n` ns (caller) via ESO. n8n exposes it to the container as `CLAUDE_AGENT_API_TOKEN` env var. |
|
||||
| Claude OAuth (primary) | `secret/claude-agent-service` → `claude_oauth_token` | Long-lived 1-year token from `claude setup-token`. Consumed by the CLI via `CLAUDE_CODE_OAUTH_TOKEN` env var (set on the container via `envFrom`). Preferred over the short-lived `.credentials.json` — CLI skips the refresh dance entirely. Rotate yearly; alert fires 30d out. |
|
||||
| Claude OAuth (spares) | `secret/claude-agent-service-spare-{1,2}` → `claude_oauth_token` | Failover tokens. Minted alongside primary (verified Anthropic does NOT revoke earlier sessions on new mint). Swap into primary if revocation or compromise. |
|
||||
| GitHub PAT | `secret/viktor` → `github_pat` | Changelog fetch (5000 req/hr) |
|
||||
| Slack webhook | `secret/platform` → `alertmanager_slack_api_url` | Upgrade notifications |
|
||||
| Woodpecker token | `secret/viktor` → `woodpecker_token` | CI pipeline polling |
|
||||
|
||||
## OAuth token lifecycle
|
||||
|
||||
The CLI supports two auth modes. We use the second — long-lived.
|
||||
|
||||
| Mode | How minted | TTL | Needs refresh? | When to use |
|
||||
|------|-----------|-----|----------------|-------------|
|
||||
| `claude login` → `.credentials.json` | Interactive browser OAuth | Access ~6h + refresh token | Yes — CLI auto-refreshes on startup if refresh token valid | Human dev machines |
|
||||
| `claude setup-token` → opaque `sk-ant-oat01-*` | Interactive browser OAuth | **1 year** | No — expires hard | **Headless / service accounts (us)** |
|
||||
|
||||
When both are present on disk, `CLAUDE_CODE_OAUTH_TOKEN` env var wins.
|
||||
|
||||
**Harvesting headless**: `setup-token` uses Ink (React for terminals) and needs a real PTY with **≥300-column width**. At 80-col, Ink wraps and DROPS one character at the wrap boundary (107-char invalid instead of 108-char valid). Python wrapper pattern documented in memory; we harvested 2 spare tokens into Vault on 2026-04-18 using a temporary harvester pod.
|
||||
|
||||
**Monitoring**: CronJob `claude-oauth-expiry-monitor` (claude-agent ns, every 6h) pushes `claude_oauth_token_expiry_timestamp{path="..."}` to Pushgateway. Alerts: `ClaudeOAuthTokenExpiringSoon` (30d, warn), `ClaudeOAuthTokenCritical` (7d, crit), `ClaudeOAuthTokenMonitorStale` (48h no push, warn), `ClaudeOAuthTokenMonitorNeverRun` (metric absent, warn).
|
||||
|
||||
**Rotation**: on alert, harvest a new token, `vault kv patch secret/claude-agent-service claude_oauth_token=<new>`, update the `claude_oauth_token_mint_epochs` local in `stacks/claude-agent-service/main.tf`, `scripts/tg apply` → alert clears on next cron tick.
|
||||
|
||||
## n8n workflow gotchas
|
||||
|
||||
The `DIUN Upgrade Agent` workflow is imported once into n8n's PG DB — it is **not** Terraform-managed. The JSON at `stacks/n8n/workflows/diun-upgrade.json` is a backup; the live state lives in `workflow_entity.nodes`. Drift between the two is possible.
|
||||
|
||||
- **HTTP Request node header expressions must use template-literal form**: `=Bearer {{ $env.CLAUDE_AGENT_API_TOKEN }}` works; `='Bearer ' + $env.CLAUDE_AGENT_API_TOKEN` does NOT evaluate and sends an empty/bogus header → 401 from claude-agent-service.
|
||||
- **`N8N_BLOCK_ENV_ACCESS_IN_NODE=false`** must be set on the n8n deployment for expressions to read `$env.*` at all.
|
||||
- **Troubleshooting 401**: the workflow will show `success` status on the webhook node but error on `Run Upgrade Agent`. Inspect in n8n UI → Executions, or query `execution_entity` + `execution_data` directly. Claude-agent-service logs will also show `POST /execute HTTP/1.1 401 Unauthorized`.
|
||||
- **Patching the live workflow** (one-off, since it's not in TF): `UPDATE workflow_entity SET nodes = REPLACE(nodes::text, OLD, NEW)::json WHERE name = 'DIUN Upgrade Agent';`
|
||||
| Dev VM SSH key | n8n credentials store → `devvm-ssh` | n8n → dev VM SSH |
|
||||
|
|
|
|||
|
|
@ -209,7 +209,7 @@ graph LR
|
|||
| Vault Backup | Weekly Sunday 02:00, 30d | CronJob in `vault` | raft snapshot |
|
||||
| Redis Backup | Weekly Sunday 03:00, 30d | CronJob in `redis` | BGSAVE + copy |
|
||||
| Vaultwarden Integrity Check | Hourly | CronJob in `vaultwarden` | PRAGMA integrity_check → metric |
|
||||
| ~~TrueNAS Cloud Sync~~ | **DECOMMISSIONED 2026-04-13** | Was TrueNAS Cloud Sync Task 1 | Replaced by offsite-sync-backup + inotify change tracking on Proxmox host NFS |
|
||||
| ~~TrueNAS Cloud Sync~~ | **DECOMMISSIONED** | Was TrueNAS Cloud Sync Task 1 | Replaced by offsite-sync-backup |
|
||||
|
||||
## How It Works
|
||||
|
||||
|
|
@ -217,7 +217,7 @@ graph LR
|
|||
|
||||
Native LVM thin snapshots provide crash-consistent point-in-time recovery for 62 Proxmox CSI PVCs. These are CoW snapshots — instant creation, minimal overhead, sharing the thin pool's free space.
|
||||
|
||||
**Script**: `/usr/local/bin/lvm-pvc-snapshot` on PVE host (source: `infra/scripts/lvm-pvc-snapshot.sh`). Deploy: `scp infra/scripts/lvm-pvc-snapshot.sh root@192.168.1.127:/usr/local/bin/lvm-pvc-snapshot`
|
||||
**Script**: `/usr/local/bin/lvm-pvc-snapshot` on PVE host (source: `infra/scripts/lvm-pvc-snapshot`)
|
||||
**Schedule**: Daily 03:00 via systemd timer, 7-day retention
|
||||
**Discovery**: Auto-discovers PVC LVs matching `vm-*-pvc-*` pattern in VG `pve` thin pool `data`
|
||||
|
||||
|
|
@ -226,7 +226,7 @@ Native LVM thin snapshots provide crash-consistent point-in-time recovery for 62
|
|||
- They already have app-level dumps (Layer 2)
|
||||
- Including them causes ~36% write amplification; excluding them reduces overhead to ~0%
|
||||
|
||||
**Monitoring**: Pushes metrics to Pushgateway via NodePort (30091). Alerts: `LVMSnapshotStale` (>30h since last run + 30m `for:`), `LVMSnapshotFailing`, `LVMThinPoolLow` (<15% free).
|
||||
**Monitoring**: Pushes metrics to Pushgateway via NodePort (30091). Alerts: `LVMSnapshotStale` (>24h), `LVMSnapshotFailing`, `LVMThinPoolLow` (<15% free).
|
||||
|
||||
**Restore**: `lvm-pvc-snapshot restore <pvc-lv> <snapshot-lv>` — auto-discovers K8s workload, scales down, swaps LVs, scales back up. See `docs/runbooks/restore-lvm-snapshot.md`.
|
||||
|
||||
|
|
@ -234,7 +234,7 @@ Native LVM thin snapshots provide crash-consistent point-in-time recovery for 62
|
|||
|
||||
**Backup disk**: sda (1.1TB RAID1 SAS) → VG `backup` → LV `data` → ext4 → mounted at `/mnt/backup` on PVE host. Dedicated backup disk, independent of live storage.
|
||||
|
||||
**Script**: `/usr/local/bin/daily-backup` on PVE host (source: `infra/scripts/daily-backup.sh`)
|
||||
**Script**: `/usr/local/bin/daily-backup` on PVE host (source: `infra/scripts/daily-backup`)
|
||||
**Schedule**: Daily 05:00 via systemd timer
|
||||
**Retention**: 4 weekly versions (weeks 0-3 via `--link-dest` hardlink dedup)
|
||||
|
||||
|
|
@ -334,14 +334,14 @@ Two-step offsite sync:
|
|||
**Monthly full sync**: On 1st Sunday of month, runs `rsync --delete` for cleanup (removes orphaned files on Synology).
|
||||
|
||||
**Destination**:
|
||||
- `Synology/Backup/Viki/nfs/` — mirrors `/srv/nfs`
|
||||
- `Synology/Backup/Viki/nfs/` — mirrors `/srv/nfs` (renamed from `truenas/`)
|
||||
- `Synology/Backup/Viki/nfs-ssd/` — mirrors `/srv/nfs-ssd`
|
||||
|
||||
**Monitoring**: Pushes `offsite_backup_sync_last_success_timestamp` to Pushgateway. Alerts: `OffsiteBackupSyncStale` (>8d), `OffsiteBackupSyncFailing`.
|
||||
|
||||
#### ~~TrueNAS Cloud Sync~~ — DECOMMISSIONED 2026-04-13
|
||||
#### ~~TrueNAS Cloud Sync~~ — DECOMMISSIONED
|
||||
|
||||
> TrueNAS Cloud Sync was decommissioned along with TrueNAS (2026-04-13). The current offsite path is inotify-change-tracked rsync from the Proxmox host NFS (`/srv/nfs`, `/srv/nfs-ssd`) to Synology.
|
||||
> TrueNAS Cloud Sync was decommissioned along with TrueNAS (2026-04). The `Synology/Backup/Viki/truenas/` directory was renamed to `nfs/` to reflect the new consolidated layout.
|
||||
|
||||
## Configuration
|
||||
|
||||
|
|
@ -673,7 +673,7 @@ module "nfs_backup" {
|
|||
│ ~~CloudSyncNeverRun~~ REMOVED (TrueNAS decommissioned) │
|
||||
│ ~~CloudSyncFailing~~ REMOVED (TrueNAS decommissioned) │
|
||||
│ VaultwardenIntegrityFail integrity_ok == 0 │
|
||||
│ LVMSnapshotStale > 30h since last snapshot │
|
||||
│ LVMSnapshotStale > 24h since last snapshot │
|
||||
│ LVMSnapshotFailing snapshot creation failed │
|
||||
│ LVMThinPoolLow < 15% free space in thin pool │
|
||||
│ WeeklyBackupStale > 8d since last success │
|
||||
|
|
@ -692,16 +692,6 @@ module "nfs_backup" {
|
|||
- ~~CloudSync monitor~~: Removed (TrueNAS decommissioned)
|
||||
- Vaultwarden integrity: Pushes `vaultwarden_sqlite_integrity_ok` hourly
|
||||
|
||||
**Pushgateway persistence**: The Pushgateway is configured with
|
||||
`--persistence.file=/data/pushgateway.bin --persistence.interval=1m`
|
||||
on a 2Gi `proxmox-lvm-encrypted` PVC (helm values:
|
||||
`prometheus-pushgateway.persistentVolume`). Without this, every pod
|
||||
restart drops in-memory metrics. Once-per-day pushers (offsite-sync,
|
||||
weekly backup) are otherwise invisible for up to 24h if the
|
||||
Pushgateway restarts between pushes — which is exactly what triggered
|
||||
the 2026-04-22 backup_offsite_sync FAIL (node3 kubelet hiccup at
|
||||
11:42 UTC terminated the Pushgateway 8h after the 03:12 UTC push).
|
||||
|
||||
**Alert routing**:
|
||||
- All backup alerts → Slack `#infra-alerts`
|
||||
- Vaultwarden integrity fail → Slack `#infra-critical` (immediate action required)
|
||||
|
|
|
|||
|
|
@ -1,136 +0,0 @@
|
|||
# chrome-service — In-cluster headed Chromium pool
|
||||
|
||||
## Overview
|
||||
|
||||
`chrome-service` is a single-replica, persistent-profile, bearer-token-gated
|
||||
Playwright **launch-server** that exposes a headed Chromium browser over a
|
||||
WebSocket. Sibling services connect to it instead of running their own
|
||||
in-process Chromium when the upstream's anti-bot tooling
|
||||
(`disable-devtool.js` redirect-to-google trap, console-clear timing tricks,
|
||||
`navigator.webdriver` checks) defeats a headless browser.
|
||||
|
||||
Initial caller: `f1-stream`'s `playback_verifier`. Future callers attach
|
||||
via the WS+token contract documented in `stacks/chrome-service/README.md`.
|
||||
|
||||
## Why a separate stack
|
||||
|
||||
In-process Chromium inside `f1-stream`:
|
||||
|
||||
- Runs **headless** by default (no `Xvfb`/`DISPLAY`).
|
||||
- Has the `HeadlessChromium/...` UA suffix and `navigator.webdriver === true`.
|
||||
- Trips `disable-devtool.js`'s **Performance** detector — Playwright's CDP
|
||||
adds latency to `console.log(largeArray)` vs `console.table(largeArray)`,
|
||||
which the lib reads as "DevTools is open" and redirects to
|
||||
`https://www.google.com/`.
|
||||
|
||||
`chrome-service` solves this by:
|
||||
|
||||
1. Running **headed** under `Xvfb :99` (via `playwright launch-server` with
|
||||
a JSON config that pins `headless: false`).
|
||||
2. Living in a long-lived pod so JIT browser launch latency disappears.
|
||||
3. Allowing a per-context init script
|
||||
(`stacks/chrome-service/files/stealth.js` ~ 40 lines, vendored from
|
||||
`puppeteer-extra-plugin-stealth`) to spoof `webdriver`, `chrome.runtime`,
|
||||
`plugins`, `languages`, `Permissions.query`, WebGL renderer strings, and
|
||||
to hide the `disable-devtool-auto` script-tag attribute so the lib's
|
||||
IIFE exits early.
|
||||
|
||||
## Wire protocol
|
||||
|
||||
```text
|
||||
ws://chrome-service.chrome-service.svc.cluster.local:3000/<TOKEN>
|
||||
│
|
||||
┌───────────────────────────────┼───────────────────────────────┐
|
||||
│ caller pod │ chrome-service pod
|
||||
│ (e.g. f1-stream) │ (single replica)
|
||||
│ │
|
||||
│ CHROME_WS_URL ──────────────┘
|
||||
│ CHROME_WS_TOKEN ─── from `secret/chrome-service.api_bearer_token` (ESO)
|
||||
│
|
||||
│ await chromium.connect(f"{ws}/{token}")
|
||||
│ await ctx.add_init_script(STEALTH_JS)
|
||||
│ page.goto("https://upstream.com/embed/...")
|
||||
│
|
||||
└─── ←── pages render under Xvfb, headed Chromium ──── ─────────┘
|
||||
```
|
||||
|
||||
## Image pin
|
||||
|
||||
Both the server image (`mcr.microsoft.com/playwright:v1.48.0-noble` in
|
||||
`stacks/chrome-service/main.tf`) and the Python client
|
||||
(`playwright==1.48.0` in callers' `requirements.txt`) **must match
|
||||
minor-versions**. Bump in lockstep — Playwright protocol changes between
|
||||
minors and the client cannot connect to a mismatched server.
|
||||
|
||||
The Microsoft image ships only the browser binaries, not the `playwright`
|
||||
npm SDK; the start command runs `npx -y playwright@1.48.0 launch-server`
|
||||
which downloads the SDK on first start (cached under `$HOME/.npm` via the
|
||||
PVC) and reuses it on subsequent restarts.
|
||||
|
||||
## Storage
|
||||
|
||||
- **`chrome-service-profile-encrypted`** (PVC, 2Gi → 10Gi autoresize,
|
||||
`proxmox-lvm-encrypted`) — Chromium user-data dir + npm cache.
|
||||
Encrypted because cookies/localStorage may include third-party auth tokens
|
||||
for sites callers drive. `HOME=/profile` so npx caches there.
|
||||
- **`chrome-service-backup-host`** (NFS, RWX) — destination for a 6-hourly
|
||||
CronJob that `tar -czf /backup/<YYYY_MM_DD_HH>.tar.gz -C /profile .`,
|
||||
retention 30 days.
|
||||
|
||||
## Auth + secrets
|
||||
|
||||
- Vault KV `secret/chrome-service.api_bearer_token` — 32-byte URL-safe
|
||||
random, rotated by hand:
|
||||
`vault kv put secret/chrome-service api_bearer_token=$(python3 -c 'import secrets; print(secrets.token_urlsafe(32))')`.
|
||||
- ESO syncs into namespace-local Secret `chrome-service-secrets`
|
||||
(server pod) and `chrome-service-client-secrets` (each caller pod).
|
||||
- Reloader (`reloader.stakater.com/auto = "true"`) cascades token rotation
|
||||
to both server and any annotated caller — no manual rollout.
|
||||
|
||||
## Network controls
|
||||
|
||||
- **`kubernetes_network_policy_v1.ws_ingress`** — two separate ingress
|
||||
rules on the same policy:
|
||||
- **TCP/3000** (Playwright WS): only namespaces labelled
|
||||
`chrome-service.viktorbarzin.me/client = "true"` (plus an explicit
|
||||
fallback for `f1-stream` by `kubernetes.io/metadata.name`).
|
||||
- **TCP/6080** (noVNC HTTP+WS): only the `traefik` namespace, since
|
||||
the public-facing path is `chrome.viktorbarzin.me` ingress →
|
||||
Traefik → sidecar. Authentik forward-auth still gates external
|
||||
access at the Traefik layer.
|
||||
- **WS port 3000** is internal-only (no ingress, no Cloudflare DNS).
|
||||
- **noVNC sidecar** (`forgejo.viktorbarzin.me/viktor/chrome-service-novnc`)
|
||||
exposes a live HTML5 view of the headed Chromium session via
|
||||
`x11vnc` (connected to Xvfb on `localhost:6099`) bridged to
|
||||
`websockify` on port 6080. Service `chrome` maps :80 → :6080 and is
|
||||
exposed via `ingress_factory` at `chrome.viktorbarzin.me`,
|
||||
Authentik-gated. Both static page and WebSocket upgrade share the
|
||||
same path — Cloudflare proxy, Cloudflared tunnel, Traefik, and
|
||||
Authentik forward-auth all preserve `Upgrade: websocket`.
|
||||
|
||||
## Adding a new caller
|
||||
|
||||
See `stacks/chrome-service/README.md` for the four-step recipe:
|
||||
|
||||
1. Label the caller's namespace.
|
||||
2. Add an `ExternalSecret` pulling `secret/chrome-service`.
|
||||
3. Inject `CHROME_WS_URL` + `CHROME_WS_TOKEN` env vars.
|
||||
4. Vendor `stealth.js` and apply via `await context.add_init_script(...)`
|
||||
after every `new_context()`.
|
||||
|
||||
## Limits + risks
|
||||
|
||||
- **Anti-bot vs stealth arms race** — when an upstream beats us (DRM
|
||||
license check, device-fingerprint mismatch, hotlink protection that
|
||||
whitelists specific parent domains), the verifier returns
|
||||
`is_playable=False` and the extractor moves on. No user-visible
|
||||
breakage, just empty stream lists for that source.
|
||||
- **JWPlayer DRM error 102630** — observed with several hmembeds embeds
|
||||
even from the headed chrome-service. The license check bails because
|
||||
the request origin isn't on the embed's allowlist; this is upstream
|
||||
policy, not an infra defect.
|
||||
- **Single replica + RWO PVC** — the deployment uses `Recreate` strategy.
|
||||
Brief outage on rollout, ~30s for browser warmup.
|
||||
- **No `/metrics` endpoint** — the cluster's generic
|
||||
`KubePodCrashLooping` rule covers basic alerting. A Prometheus scrape
|
||||
exporter is day-2 work.
|
||||
|
|
@ -19,7 +19,7 @@ graph LR
|
|||
I --> J[Pull from DockerHub<br/>or Pull-Through Cache]
|
||||
|
||||
K[Pull-Through Cache<br/>10.0.20.10] -.-> J
|
||||
L[forgejo.viktorbarzin.me<br/>Private Registry on Forgejo] -.-> J
|
||||
L[registry.viktorbarzin.me<br/>Private Registry] -.-> J
|
||||
|
||||
style B fill:#2088ff
|
||||
style F fill:#4c9e47
|
||||
|
|
@ -33,7 +33,7 @@ graph LR
|
|||
| GitHub Actions | Cloud | `.github/workflows/build-and-deploy.yml` | Build Docker images, push to DockerHub |
|
||||
| Woodpecker CI | Self-hosted | `ci.viktorbarzin.me` | Deploy to Kubernetes cluster |
|
||||
| DockerHub | Cloud | `viktorbarzin/*` | Public image registry |
|
||||
| Private Registry | Forgejo Packages | `forgejo.viktorbarzin.me/viktor` | Private container images (PAT auth, retention CronJob) — migrated from registry.viktorbarzin.me 2026-05-07 |
|
||||
| Private Registry | Custom | `registry.viktorbarzin.me` | Private images, htpasswd auth |
|
||||
| Pull-Through Cache | Custom | `10.0.20.10:5000` (docker.io)<br/>`10.0.20.10:5010` (ghcr.io) | LAN cache for remote registries |
|
||||
| Kyverno | Cluster | `kyverno` namespace | Auto-sync registry credentials to all namespaces |
|
||||
| Vault | Cluster | `vault.viktorbarzin.me` | K8s auth for Woodpecker pipelines |
|
||||
|
|
@ -102,8 +102,7 @@ Woodpecker API uses numeric IDs (not owner/name):
|
|||
1. **Containerd hosts.toml** redirects pulls from docker.io and ghcr.io to pull-through cache at `10.0.20.10`
|
||||
2. **Pull-through cache** serves cached images from LAN, fetches from upstream on cache miss
|
||||
3. **Kyverno ClusterPolicy** auto-syncs `registry-credentials` Secret to all namespaces for private registry access
|
||||
4. **Private registry** has been Forgejo's built-in OCI registry at `forgejo.viktorbarzin.me/viktor/<image>` since 2026-05-07. Auth via PAT (Vault `secret/ci/global/forgejo_push_token` for push, `secret/viktor/forgejo_pull_token` for pull). The pre-migration `registry:2.8.3`-based private registry on `registry.viktorbarzin.me:5050` was the root cause of three orphan-index incidents in three weeks (2026-04-13, 2026-04-19, 2026-05-04 — see `docs/post-mortems/2026-04-19-registry-orphan-index.md` and the full migration writeup at `docs/plans/2026-05-07-forgejo-registry-consolidation-{design,plan}.md`). The five pull-through caches on `10.0.20.10` (ports 5000/5010/5020/5030/5040) stay in place for upstream registries.
|
||||
5. **Integrity probe** (`registry-integrity-probe` CronJob in `monitoring` ns, every 15m) walks `/v2/_catalog` → tags → indexes → child manifests via HEAD and pushes `registry_manifest_integrity_failures` to Pushgateway; alerts `RegistryManifestIntegrityFailure` / `RegistryIntegrityProbeStale` / `RegistryCatalogInaccessible` page on broken state. Authoritative check (HTTP API, not filesystem).
|
||||
4. **Private registry** (`registry.viktorbarzin.me`) uses htpasswd auth, credentials stored in Vault
|
||||
|
||||
### Infra Pipelines (Woodpecker-only)
|
||||
|
||||
|
|
@ -112,14 +111,7 @@ Woodpecker API uses numeric IDs (not owner/name):
|
|||
| default | `.woodpecker/default.yml` | Terragrunt apply on push |
|
||||
| renew-tls | `.woodpecker/renew-tls.yml` | Certbot renewal cron |
|
||||
| build-cli | `.woodpecker/build-cli.yml` | Build and push to dual registries |
|
||||
| build-ci-image | `.woodpecker/build-ci-image.yml` | Build `infra-ci` tooling image (triggered by `ci/Dockerfile` change or manual); post-push HEADs every blob via `verify-integrity` step to catch orphan-index pushes |
|
||||
| k8s-portal | `.woodpecker/k8s-portal.yml` | Path-filtered build for k8s-portal subdirectory |
|
||||
| registry-config-sync | `.woodpecker/registry-config-sync.yml` | SCP `modules/docker-registry/*` to `/opt/registry/` on `10.0.20.10` when any managed file changes; bounces containers + nginx per `docs/runbooks/registry-vm.md` |
|
||||
| pve-nfs-exports-sync | `.woodpecker/pve-nfs-exports-sync.yml` | Sync `scripts/pve-nfs-exports` → `/etc/exports` on PVE host |
|
||||
| postmortem-todos | `.woodpecker/postmortem-todos.yml` | Auto-resolve safe TODOs from new `docs/post-mortems/*.md` via headless Claude agent |
|
||||
| drift-detection | `.woodpecker/drift-detection.yml` | Nightly Terraform drift detection |
|
||||
| issue-automation | `.woodpecker/issue-automation.yml` | Triage + respond to `ViktorBarzin/infra` GitHub issues |
|
||||
| provision-user | `.woodpecker/provision-user.yml` | Add namespace-owner user from Vault spec |
|
||||
|
||||
## Configuration
|
||||
|
||||
|
|
|
|||
|
|
@ -18,7 +18,7 @@ graph TB
|
|||
subgraph Proxmox["Proxmox VE"]
|
||||
direction TB
|
||||
MASTER["VM 200: k8s-master<br/>8c / 32GB<br/>10.0.20.100"]
|
||||
NODE1["VM 201: k8s-node1<br/>16c / 32GB<br/>GPU Passthrough<br/>nvidia.com/gpu=true:PreferNoSchedule"]
|
||||
NODE1["VM 201: k8s-node1<br/>16c / 32GB<br/>GPU Passthrough<br/>nvidia.com/gpu=true:NoSchedule"]
|
||||
NODE2["VM 202: k8s-node2<br/>8c / 32GB"]
|
||||
NODE3["VM 203: k8s-node3<br/>8c / 32GB"]
|
||||
NODE4["VM 204: k8s-node4<br/>8c / 32GB"]
|
||||
|
|
@ -72,7 +72,7 @@ graph TB
|
|||
| VM | VMID | vCPUs | RAM | Network | Role | Taints |
|
||||
|----|------|-------|-----|---------|------|--------|
|
||||
| k8s-master | 200 | 8 | 32GB | vmbr1:vlan20 (10.0.20.100) | Control Plane | `node-role.kubernetes.io/control-plane:NoSchedule` |
|
||||
| k8s-node1 | 201 | 16 | 32GB | vmbr1:vlan20 | GPU Worker | `nvidia.com/gpu=true:PreferNoSchedule` (applied dynamically to whichever node carries the GPU) |
|
||||
| k8s-node1 | 201 | 16 | 32GB | vmbr1:vlan20 | GPU Worker | `nvidia.com/gpu=true:NoSchedule` |
|
||||
| k8s-node2 | 202 | 8 | 32GB | vmbr1:vlan20 | Worker | None |
|
||||
| k8s-node3 | 203 | 8 | 32GB | vmbr1:vlan20 | Worker | None |
|
||||
| k8s-node4 | 204 | 8 | 32GB | vmbr1:vlan20 | Worker | None |
|
||||
|
|
@ -85,9 +85,9 @@ graph TB
|
|||
|-----------|-------|
|
||||
| Device | NVIDIA Tesla T4 (16GB GDDR6) |
|
||||
| PCIe Address | 0000:06:00.0 |
|
||||
| Assigned VM | VMID 201 (k8s-node1) — physical location only, no Terraform pin |
|
||||
| Node Label | `nvidia.com/gpu.present=true` (auto-applied by gpu-feature-discovery; also `feature.node.kubernetes.io/pci-10de.present=true` from NFD) |
|
||||
| Node Taint | `nvidia.com/gpu=true:PreferNoSchedule` (applied by `null_resource.gpu_node_config` to every NFD-tagged GPU node) |
|
||||
| Assigned VM | VMID 201 (k8s-node1) |
|
||||
| Node Label | `gpu=true` |
|
||||
| Node Taint | `nvidia.com/gpu=true:NoSchedule` |
|
||||
| Driver | NVIDIA GPU Operator |
|
||||
| Resource Name | `nvidia.com/gpu` |
|
||||
|
||||
|
|
@ -273,8 +273,8 @@ resources {
|
|||
### GPU Resource Management
|
||||
|
||||
**Node Selection**: GPU pods must:
|
||||
1. Tolerate `nvidia.com/gpu=true:PreferNoSchedule` taint
|
||||
2. Select `nvidia.com/gpu.present=true` label (auto-applied by gpu-feature-discovery wherever the card is)
|
||||
1. Tolerate `nvidia.com/gpu=true:NoSchedule` taint
|
||||
2. Select `gpu=true` label
|
||||
3. Request `nvidia.com/gpu: 1` resource
|
||||
|
||||
**Example**:
|
||||
|
|
@ -286,7 +286,7 @@ spec:
|
|||
value: "true"
|
||||
effect: NoSchedule
|
||||
nodeSelector:
|
||||
nvidia.com/gpu.present: "true"
|
||||
gpu: "true"
|
||||
containers:
|
||||
- name: app
|
||||
resources:
|
||||
|
|
@ -294,14 +294,6 @@ spec:
|
|||
nvidia.com/gpu: 1
|
||||
```
|
||||
|
||||
**Portability**: No Terraform code references a specific hostname for
|
||||
GPU scheduling. If the GPU card is physically moved to a different
|
||||
node, gpu-feature-discovery moves the `nvidia.com/gpu.present=true`
|
||||
label with it, and `null_resource.gpu_node_config` re-applies the
|
||||
`nvidia.com/gpu=true:PreferNoSchedule` taint to the new host on the
|
||||
next apply (discovery keyed on
|
||||
`feature.node.kubernetes.io/pci-10de.present=true`).
|
||||
|
||||
**GPU Workloads**:
|
||||
- Ollama (LLM inference)
|
||||
- ComfyUI (Stable Diffusion workflows)
|
||||
|
|
@ -537,7 +529,7 @@ kubectl describe pod <pod-name> -n <namespace>
|
|||
```
|
||||
0/5 nodes are available: 5 Insufficient nvidia.com/gpu.
|
||||
```
|
||||
**Fix**: Verify the GPU-carrying node is Ready and has the `nvidia.com/gpu.present=true` label. Check `kubectl get nodes -l nvidia.com/gpu.present=true` — if empty, gpu-feature-discovery hasn't labeled any node (operator not running, driver not loaded, or PCI passthrough broken).
|
||||
**Fix**: Verify GPU node (201) is Ready and labeled `gpu=true`.
|
||||
|
||||
### Pods OOMKilled repeatedly
|
||||
|
||||
|
|
@ -622,7 +614,7 @@ spec:
|
|||
value: "true"
|
||||
effect: NoSchedule
|
||||
nodeSelector:
|
||||
nvidia.com/gpu.present: "true"
|
||||
gpu: "true"
|
||||
containers:
|
||||
- name: app
|
||||
resources:
|
||||
|
|
|
|||
|
|
@ -120,29 +120,9 @@ graph TB
|
|||
|
||||
### Redis
|
||||
|
||||
Single shared cluster for all 17 consumers (Immich, Authentik, Nextcloud, Paperless, Dawarich Sidekiq, Traefik, etc.). HAProxy (3 replicas, PDB minAvailable=2) is the sole client-facing path — clients talk only to `redis-master.redis.svc.cluster.local:6379` and HAProxy health-checks backends via `INFO replication`, routing only to `role:master`.
|
||||
|
||||
**Architecture**:
|
||||
|
||||
3 pods in StatefulSet `redis-v2`, each co-locating redis + sentinel + redis_exporter, using `docker.io/library/redis:8-alpine` (8.6.2). HAProxy (3 replicas, PDB minAvailable=2) routes clients to the current master via 1s `INFO replication` tcp-checks. Full context behind the April 2026 rework in beads `code-v2b`.
|
||||
|
||||
- 3 redis pods + 3 co-located sentinels (quorum=2). Odd sentinel count eliminates split-brain.
|
||||
- **Pod anti-affinity is `required` (hard)** — each redis pod must land on a distinct node. Soft anti-affinity previously let the scheduler co-locate 2/3 pods on the same node; when that node (`k8s-node3`) went `NotReady→Ready` at 11:42 UTC on 2026-04-22 it took 2 redis pods with it and the cluster lost quorum. Cluster-wide PV `nodeAffinity` matches one zone (`topology.kubernetes.io/region=pve, zone=pve`), so PVCs rebind freely on reschedule.
|
||||
- `podManagementPolicy=Parallel` + init container that regenerates `sentinel.conf` on every boot by probing peer sentinels for consensus master (priority: sentinel vote → peer role:master with slaves → deterministic pod-0 fallback). No persistent sentinel runtime state — can't drift out of sync with reality (root cause of 2026-04-19 PM incident).
|
||||
- redis.conf has `include /shared/replica.conf`; the init container writes either an empty file (master) or `replicaof <master> 6379` (replicas), so pods come up already in the right role — no bootstrap race.
|
||||
- **Sentinel hostname persistence**: `sentinel resolve-hostnames yes` + `sentinel announce-hostnames yes` in the init-generated sentinel.conf are mandatory — without them, sentinel stores resolved IPs in its rewritten config, and pod-IP churn on restart breaks failover. The MONITOR command itself must be issued with a hostname and the flags must be active before MONITOR, otherwise sentinel stores an IP that goes stale the next time the pod is deleted.
|
||||
- **Failover timing (tuned 2026-04-22)**: `sentinel down-after-milliseconds=15000` + `sentinel failover-timeout=60000`. Redis liveness probe `timeout_seconds=10, failure_threshold=5`; sentinel liveness probe same. LUKS-encrypted LVM + BGSAVE fork can briefly stall master I/O >5s, which under the old 5s/30s sentinel timings + 3s/3 probes induced spurious `+sdown`→`+odown`→`+switch-master` cycles every 1-2 minutes. The new values absorb normal BGSAVE pauses without triggering failover.
|
||||
- **HAProxy check smoothing (tuned 2026-04-22)**: `check inter 2s fall 3 rise 2` (was `1s / 2 / 2`) + `timeout check 5s` (was `3s`). The aggressive 1s polling used to race sentinel failovers — during a legitimate promote, HAProxy could catch the old master serving `role:slave` in the 1-3s window before re-probing the new master, leaving the backend empty and clients receiving `ReadOnlyError`.
|
||||
- **Headless service `publish_not_ready_addresses=false`** (flipped 2026-04-22). Previously `true` meant HAProxy's DNS resolver saw not-yet-ready pods during rollouts, compounding the check-race above. Sentinel peer discovery is unaffected because sentinels announce to each other explicitly via `sentinel announce-hostnames yes`.
|
||||
- Memory: master + replicas `requests=limits=768Mi`. Concurrent BGSAVE + AOF-rewrite fork can double RSS via COW, so headroom must cover it. `auto-aof-rewrite-percentage=200` + `auto-aof-rewrite-min-size=128mb` tune down rewrite frequency.
|
||||
- Persistence: RDB (`save 900 1 / 300 100 / 60 10000`) + AOF `appendfsync=everysec`. Disk-wear analysis on 2026-04-19 (sdb Samsung 850 EVO 1TB, 150 TBW): Redis contributes <1 GB/day cluster-wide → 40+ year runway at the 20% TBW budget.
|
||||
- `maxmemory=640mb` (83% of 768Mi limit), `maxmemory-policy=allkeys-lru`.
|
||||
- Weekly RDB backup to NFS (`/srv/nfs/redis-backup/`, Sunday 03:00, 28-day retention, pushes Pushgateway metrics).
|
||||
- Auth disabled this phase — NetworkPolicy is the isolation layer. Enabling `requirepass` + rolling creds to all 17 clients is a planned follow-up.
|
||||
|
||||
**Observability** (redis-v2 only): `oliver006/redis_exporter:v1.62.0` sidecar per pod on port 9121, auto-scraped via Prometheus pod annotation. Alerts: `RedisDown`, `RedisMemoryPressure`, `RedisEvictions`, `RedisReplicationLagHigh`, `RedisForkLatencyHigh`, `RedisAOFRewriteLong`, `RedisReplicasMissing`, `RedisBackupStale`, `RedisBackupNeverSucceeded`.
|
||||
|
||||
**Why this design** — four incidents in April 2026 drove the rework: (a) 2026-04-04 service selector routed reads+writes to master+replica causing `READONLY` errors; (b) 2026-04-19 AM master OOMKilled during BGSAVE+PSYNC with the 256Mi limit too tight for a 204 MB working set under COW amplification; (c) 2026-04-19 PM sentinel runtime state drifted (only 2 sentinels, no majority) and routed writes to a slave; (d) 2026-04-22 five-factor flap cascade — soft anti-affinity let 2/3 pods co-locate on `k8s-node3`, node bounced NotReady→Ready and took quorum with it; aggressive sentinel/probe timing (5s/30s + 3s/3) amplified disk-I/O stalls under LUKS-encrypted LVM into spurious `+switch-master` loops; HAProxy's 1s polling raced sentinel failovers and routed writes to demoted masters; `publish_not_ready_addresses=true` fed not-yet-ready pods into HAProxy DNS; downstream `realestate-crawler-celery` CrashLoopBackOff closed the feedback loop. See beads epic `code-v2b` for the full plan and linked challenger analyses.
|
||||
- Shared instance at `redis.redis.svc.cluster.local`
|
||||
- Used for caching and session storage
|
||||
- No persistence (ephemeral)
|
||||
|
||||
### SQLite (Per-App)
|
||||
|
||||
|
|
|
|||
|
|
@ -1,10 +1,10 @@
|
|||
# DNS Architecture
|
||||
|
||||
Last updated: 2026-04-19 (WS C — NodeLocal DNSCache deployed; WS D — pfSense Unbound replaces dnsmasq; WS E — Kea multi-IP DHCP option 6 + TSIG-signed DDNS)
|
||||
Last updated: 2026-04-15
|
||||
|
||||
## Overview
|
||||
|
||||
DNS is served by a split architecture: **Technitium DNS** handles internal resolution (`.viktorbarzin.lan`) and recursive lookups, while **Cloudflare DNS** manages all public domains (`.viktorbarzin.me`). Kubernetes pods use **CoreDNS** which forwards to Technitium for internal zones. All three Technitium instances run on encrypted block storage with zone replication via AXFR every 30 minutes. A **NodeLocal DNSCache** DaemonSet runs on every node and transparently intercepts pod DNS traffic, caching responses locally so pods keep resolving even during CoreDNS, Technitium, or pfSense disruptions.
|
||||
DNS is served by a split architecture: **Technitium DNS** handles internal resolution (`.viktorbarzin.lan`) and recursive lookups, while **Cloudflare DNS** manages all public domains (`.viktorbarzin.me`). Kubernetes pods use **CoreDNS** which forwards to Technitium for internal zones. All three Technitium instances run on encrypted block storage with zone replication via AXFR every 30 minutes.
|
||||
|
||||
## Architecture Diagram
|
||||
|
||||
|
|
@ -22,15 +22,14 @@ graph TB
|
|||
end
|
||||
|
||||
subgraph "pfSense (10.0.20.1)"
|
||||
pf_unbound[Unbound<br/>Resolver<br/>auth-zone AXFR]
|
||||
pf_dnsmasq[dnsmasq<br/>Forwarder]
|
||||
pf_kea[Kea DHCP4<br/>3 subnets, 53 reservations]
|
||||
pf_ddns[Kea DHCP-DDNS<br/>RFC 2136]
|
||||
pf_nat[NAT rdr<br/>UDP 53 → Technitium]
|
||||
end
|
||||
|
||||
subgraph "Kubernetes Cluster"
|
||||
NodeLocalDNS[NodeLocal DNSCache<br/>DaemonSet, 5 nodes<br/>169.254.20.10 + 10.96.0.10]
|
||||
CoreDNS[CoreDNS<br/>kube-system<br/>.:53 + viktorbarzin.lan:53]
|
||||
KubeDNSUpstream[kube-dns-upstream<br/>ClusterIP, selects CoreDNS pods]
|
||||
|
||||
subgraph "Technitium HA (namespace: technitium)"
|
||||
Primary[Primary<br/>technitium]
|
||||
|
|
@ -52,17 +51,16 @@ graph TB
|
|||
|
||||
Internet -->|DNS query| CF
|
||||
CF -->|CNAME to tunnel| CFTunnel
|
||||
LAN -->|DNS query UDP 53| pf_unbound
|
||||
LAN -->|DNS query UDP 53| pf_nat
|
||||
pf_nat -->|forward| LB_DNS
|
||||
pf_kea -->|lease event| pf_ddns
|
||||
pf_ddns -->|A + PTR| LB_DNS
|
||||
|
||||
pf_unbound -->|AXFR viktorbarzin.lan| LB_DNS
|
||||
pf_unbound -->|public queries DoT :853| CF
|
||||
pf_dnsmasq -->|.viktorbarzin.lan| LB_DNS
|
||||
pf_dnsmasq -->|public queries| CF
|
||||
|
||||
NodeLocalDNS -->|cache miss| KubeDNSUpstream
|
||||
KubeDNSUpstream --> CoreDNS
|
||||
CoreDNS -->|.viktorbarzin.lan| ClusterIP
|
||||
CoreDNS -->|public queries| pf_unbound
|
||||
CoreDNS -->|public queries| pf_dnsmasq
|
||||
|
||||
LB_DNS --> Primary
|
||||
LB_DNS --> Secondary
|
||||
|
|
@ -82,9 +80,8 @@ graph TB
|
|||
|-----------|----------|---------|---------|
|
||||
| Technitium DNS | K8s namespace `technitium` | 14.3.0 | Primary internal DNS + recursive resolver |
|
||||
| CoreDNS | K8s `kube-system` | Cluster default | K8s service discovery + forwarding to Technitium |
|
||||
| NodeLocal DNSCache | K8s `kube-system` (DaemonSet) | `k8s-dns-node-cache:1.23.1` | Per-node DNS cache, transparent interception on 10.96.0.10 + 169.254.20.10. Insulates pods from CoreDNS/Technitium/pfSense disruption. |
|
||||
| Cloudflare DNS | SaaS | N/A | Public domain management (~50 domains) |
|
||||
| pfSense Unbound | 10.0.20.1 | pfSense 2.7.2 (Unbound 1.19) | DNS resolver on LAN/OPT1/WAN; AXFR-slaves `viktorbarzin.lan` from Technitium; DoT upstream to Cloudflare |
|
||||
| pfSense dnsmasq | 10.0.20.1 | pfSense 2.7.x | DNS forwarder for management VLAN |
|
||||
| Kea DHCP-DDNS | 10.0.20.1 | pfSense 2.7.x | Automatic DNS registration on DHCP lease |
|
||||
| phpIPAM | K8s namespace `phpipam` | v1.7.0 | IPAM ↔ DNS bidirectional sync |
|
||||
|
||||
|
|
@ -93,22 +90,19 @@ graph TB
|
|||
| Stack | Path | DNS Resources |
|
||||
|-------|------|---------------|
|
||||
| Technitium | `stacks/technitium/` | 3 deployments, services, PVCs, 4 CronJobs, CoreDNS ConfigMap |
|
||||
| NodeLocal DNSCache | `stacks/nodelocal-dns/` | DaemonSet (5 pods), ConfigMap, kube-dns-upstream Service, headless metrics Service |
|
||||
| Cloudflared | `stacks/cloudflared/` | Cloudflare DNS records (A, AAAA, CNAME, MX, TXT), tunnel config |
|
||||
| phpIPAM | `stacks/phpipam/` | dns-sync CronJob, pfsense-import CronJob |
|
||||
| pfSense | `stacks/pfsense/` | VM config only (Unbound config is managed out-of-band via pfSense web UI / direct config.xml edits; see `docs/runbooks/pfsense-unbound.md`) |
|
||||
| pfSense | `stacks/pfsense/` | VM config (DNS config is via pfSense web UI) |
|
||||
|
||||
## DNS Resolution Paths
|
||||
|
||||
### K8s Pod → Internal Domain (.viktorbarzin.lan)
|
||||
|
||||
```
|
||||
Pod → NodeLocal DNSCache (intercepts on kube-dns:10.96.0.10)
|
||||
→ cache hit: serve locally (TTL 30s / stale up to 86400s via CoreDNS upstream)
|
||||
→ cache miss: forward to kube-dns-upstream (selects CoreDNS pods directly)
|
||||
→ CoreDNS: template matches 2+ labels before .viktorbarzin.lan → NXDOMAIN
|
||||
→ CoreDNS: forward to Technitium ClusterIP (10.96.0.53)
|
||||
→ Technitium resolves from viktorbarzin.lan zone
|
||||
Pod → CoreDNS (kube-dns:53)
|
||||
→ template: if 2+ labels before .viktorbarzin.lan → NXDOMAIN (ndots:5 junk filter)
|
||||
→ forward to Technitium ClusterIP (10.96.0.53)
|
||||
→ Technitium resolves from viktorbarzin.lan zone
|
||||
```
|
||||
|
||||
The ndots:5 template in CoreDNS short-circuits queries like `www.cloudflare.com.viktorbarzin.lan` (caused by K8s search domain expansion) by returning NXDOMAIN for any query with 2+ labels before `.viktorbarzin.lan`. Only single-label queries (e.g., `idrac.viktorbarzin.lan`) reach Technitium.
|
||||
|
|
@ -116,54 +110,41 @@ The ndots:5 template in CoreDNS short-circuits queries like `www.cloudflare.com.
|
|||
### K8s Pod → Public Domain
|
||||
|
||||
```
|
||||
Pod → NodeLocal DNSCache (intercepts on kube-dns:10.96.0.10)
|
||||
→ cache hit: serve locally
|
||||
→ cache miss: forward to kube-dns-upstream (selects CoreDNS pods directly)
|
||||
→ CoreDNS: forward to pfSense (10.0.20.1), fallback 8.8.8.8, 1.1.1.1
|
||||
→ pfSense Unbound:
|
||||
- .viktorbarzin.lan → local auth-zone (AXFR-cached from Technitium)
|
||||
- public → DoT to Cloudflare (1.1.1.1 / 1.0.0.1 port 853)
|
||||
Pod → CoreDNS (kube-dns:53)
|
||||
→ forward to pfSense (10.0.20.1), fallback 8.8.8.8, 1.1.1.1
|
||||
→ pfSense dnsmasq → Cloudflare (1.1.1.1)
|
||||
```
|
||||
|
||||
### LAN Client (192.168.1.x) → Any Domain
|
||||
|
||||
```
|
||||
Client gets DNS=192.168.1.2 (pfSense WAN) from DHCP
|
||||
→ pfSense Unbound listens on 192.168.1.2:53 directly (no NAT rdr)
|
||||
- .viktorbarzin.lan → auth-zone (AXFR-cached from Technitium 10.0.20.201)
|
||||
Survives full Technitium/K8s outage — auth-zone keeps serving from
|
||||
/var/unbound/viktorbarzin.lan.zone with `fallback-enabled: yes`.
|
||||
- .viktorbarzin.me (non-proxied) and other public → DoT to Cloudflare
|
||||
(1.1.1.1 / 1.0.0.1 on port 853, SNI cloudflare-dns.com)
|
||||
→ pfSense NAT rdr on WAN interface → Technitium LB (10.0.20.201)
|
||||
→ Technitium resolves:
|
||||
- .viktorbarzin.lan → local zone
|
||||
- .viktorbarzin.me (non-proxied) → recursive, then Split Horizon translates
|
||||
176.12.22.76 → 10.0.20.200 for 192.168.1.0/24 clients
|
||||
- other → recursive to Cloudflare DoH (1.1.1.1)
|
||||
```
|
||||
|
||||
**Trade-off vs. prior NAT rdr**: Split Horizon hairpin translation
|
||||
(`176.12.22.76 → 10.0.20.200` for 192.168.1.x clients) was only applied
|
||||
when queries reached Technitium via the NAT rdr. With Unbound answering
|
||||
on 192.168.1.2:53 directly, non-proxied `*.viktorbarzin.me` queries on the
|
||||
192.168.1.x LAN return the public IP, which the TP-Link AP can't hairpin.
|
||||
If hairpin is broken on LAN for a given non-proxied service, the fix is
|
||||
either (a) switch the service to proxied (via `dns_type = "proxied"`)
|
||||
or (b) add a local-data override on pfSense Unbound. The pre-Unbound
|
||||
state is documented in the `docs/runbooks/pfsense-unbound.md` rollback
|
||||
section.
|
||||
Client source IPs are preserved (no SNAT on 192.168.1.x → 10.0.20.x path) — Technitium logs show real per-device IPs.
|
||||
|
||||
### Management VLAN (10.0.10.x) → Any Domain
|
||||
|
||||
```
|
||||
Client gets DNS from Kea DHCP → pfSense (10.0.10.1)
|
||||
→ pfSense Unbound:
|
||||
- .viktorbarzin.lan → auth-zone (local)
|
||||
- other → DoT to Cloudflare (1.1.1.1 / 1.0.0.1 port 853)
|
||||
→ pfSense dnsmasq:
|
||||
- .viktorbarzin.lan → forward to Technitium (10.0.20.201)
|
||||
- other → forward to Cloudflare (1.1.1.1)
|
||||
```
|
||||
|
||||
### K8s VLAN (10.0.20.x) → Any Domain
|
||||
|
||||
```
|
||||
Client gets DNS from Kea DHCP → pfSense (10.0.20.1)
|
||||
→ pfSense Unbound:
|
||||
- .viktorbarzin.lan → auth-zone (local)
|
||||
- other → DoT to Cloudflare (1.1.1.1 / 1.0.0.1 port 853)
|
||||
→ pfSense dnsmasq:
|
||||
- .viktorbarzin.lan → forward to Technitium (10.0.20.201)
|
||||
- other → forward to Cloudflare (1.1.1.1)
|
||||
```
|
||||
|
||||
## Technitium DNS — Internal DNS Server
|
||||
|
|
@ -212,7 +193,7 @@ All three pods share the `dns-server=true` label, so the DNS LoadBalancer (10.0.
|
|||
| `0.168.192.in-addr.arpa` | Primary | PTR | Reverse DNS for Valchedrym site |
|
||||
| `emrsn.org` | Primary (stub) | — | Returns NXDOMAIN locally (avoids 27K+ daily corporate query floods) |
|
||||
|
||||
**Dynamic updates**: Enabled via `UseSpecifiedNetworkACL` from pfSense IPs (10.0.20.1, 10.0.10.1, 192.168.1.2) **AND require a valid TSIG signature** on `viktorbarzin.lan`, `10.0.10.in-addr.arpa`, `20.0.10.in-addr.arpa`, `1.168.192.in-addr.arpa`. Policy: `updateSecurityPolicies = [{tsigKeyName: "kea-ddns", domain: "*.<zone>", allowedTypes: ["ANY"]}]`. Unsigned updates from the allowlisted pfSense source IPs are refused ("Dynamic Updates Security Policy"). TSIG key `kea-ddns` (HMAC-SHA256) present on primary/secondary/tertiary; secret in Vault `secret/viktor/kea_ddns_tsig_secret`. Applied 2026-04-19 (WS E, bd `code-o6j`).
|
||||
**Dynamic updates**: Enabled via `UseSpecifiedNetworkACL` from pfSense IPs (10.0.20.1, 10.0.10.1, 192.168.1.2) for Kea DDNS RFC 2136 updates.
|
||||
|
||||
### Resolver Settings
|
||||
|
||||
|
|
@ -271,61 +252,29 @@ Technitium's **Split Horizon AddressTranslation** app post-processes DNS respons
|
|||
|
||||
Config is synced to all 3 Technitium instances by CronJob `technitium-split-horizon-sync` (every 6h).
|
||||
|
||||
## NodeLocal DNSCache
|
||||
|
||||
A DaemonSet in `kube-system` (`node-local-dns`, image `registry.k8s.io/dns/k8s-dns-node-cache:1.23.1`) runs on every node including the control plane. Each pod uses `hostNetwork: true` + `NET_ADMIN` and installs iptables NOTRACK rules so it transparently serves DNS on both:
|
||||
|
||||
- **169.254.20.10** — the canonical link-local IP from the upstream docs
|
||||
- **10.96.0.10** — the `kube-dns` ClusterIP, so existing pods (which already use this as their nameserver) hit the on-node cache with no kubelet change
|
||||
|
||||
Cache misses go to a separate `kube-dns-upstream` ClusterIP service (not `kube-dns`, to avoid looping back to ourselves) that selects the CoreDNS pods directly via `k8s-app=kube-dns`.
|
||||
|
||||
Priority class is `system-node-critical`; tolerations are permissive (`operator: Exists`) so the DaemonSet runs on tainted master and other reserved nodes. Kyverno `dns_config` drift is suppressed via `ignore_changes` on the DaemonSet.
|
||||
|
||||
**Caching**: `cluster.local:53` caches 9984 success / 9984 denial entries with 30s/5s TTLs. Other zones cache 30s. If CoreDNS is killed, nodes keep answering cached names — verified on 2026-04-19 by deleting all three CoreDNS pods and running `dig @169.254.20.10 idrac.viktorbarzin.lan` + `dig @169.254.20.10 github.com` from a pod (both returned answers).
|
||||
|
||||
**Kubelet clusterDNS**: **Unchanged** — still `10.96.0.10`. NodeLocal DNSCache co-listens on that IP so traffic interception is transparent; switching kubelet to `169.254.20.10` would require a rolling reconfigure of every node and provides no additional cache benefit over transparent mode.
|
||||
|
||||
**Metrics**: A headless Service `node-local-dns` (ClusterIP `None`) exposes each pod on port `9253` for Prometheus scraping (annotated `prometheus.io/scrape=true`).
|
||||
|
||||
## CoreDNS Configuration
|
||||
|
||||
CoreDNS is managed via Terraform in `stacks/technitium/modules/technitium/` — the Corefile ConfigMap lives in `main.tf`, and scaling/PDB are in `coredns.tf` (a `kubernetes_deployment_v1_patch` against the kubeadm-managed Deployment).
|
||||
CoreDNS is managed via a Terraform `kubernetes_config_map` resource in `stacks/technitium/modules/technitium/main.tf`.
|
||||
|
||||
```
|
||||
.:53 {
|
||||
errors / health / ready
|
||||
kubernetes cluster.local in-addr.arpa ip6.arpa # K8s service discovery
|
||||
prometheus :9153 # Metrics
|
||||
forward . 10.0.20.1 8.8.8.8 1.1.1.1 {
|
||||
policy sequential # try upstreams in order
|
||||
health_check 5s # mark unhealthy in 5s
|
||||
max_fails 2
|
||||
}
|
||||
cache {
|
||||
success 10000 300 6
|
||||
denial 10000 300 60
|
||||
serve_stale 86400s # resilience during upstream outage
|
||||
}
|
||||
forward . 10.0.20.1 8.8.8.8 1.1.1.1 # pfSense → Google → Cloudflare
|
||||
cache (success 10000 300, denial 10000 300)
|
||||
loop / reload / loadbalance
|
||||
}
|
||||
|
||||
viktorbarzin.lan:53 {
|
||||
template: .*\..*\.viktorbarzin\.lan\.$ → NXDOMAIN # ndots:5 junk filter
|
||||
forward . 10.96.0.53 { # Technitium ClusterIP
|
||||
health_check 5s
|
||||
max_fails 2
|
||||
}
|
||||
cache (success 10000 300, denial 10000 300, serve_stale 86400s)
|
||||
forward . 10.96.0.53 # Technitium ClusterIP
|
||||
cache (success 10000 300, denial 10000 300)
|
||||
}
|
||||
```
|
||||
|
||||
**Scaling**: 3 replicas, `required` anti-affinity on `kubernetes.io/hostname` (spread across 3 distinct nodes). PodDisruptionBudget `coredns` with `minAvailable=2`.
|
||||
|
||||
**Kyverno ndots injection**: A Kyverno policy injects `ndots:2` on all pods cluster-wide to reduce search domain expansion noise. The template regex is a second layer of defense for any queries that still get expanded.
|
||||
|
||||
**Failover behaviour**: With `policy sequential` on the root forward block, CoreDNS tries pfSense first; if `health_check 5s` detects pfSense as down, it fails over to 8.8.8.8 then 1.1.1.1 within ~5s rather than timing out per-query. Combined with `serve_stale`, pods keep resolving cached names for up to 24h even with full upstream failure.
|
||||
|
||||
## Cloudflare DNS — External Domains
|
||||
|
||||
All public domains are under the `viktorbarzin.me` zone. DNS records are **auto-created per service** via the `ingress_factory` module's `dns_type` parameter. A small number of records (Helm-managed ingresses, special cases) remain centrally managed in `config.tfvars`.
|
||||
|
|
@ -375,9 +324,9 @@ The Cloudflare tunnel uses a **wildcard rule** (`*.viktorbarzin.me → Traefik`)
|
|||
Devices get automatic DNS registration without manual intervention. See [networking.md § IPAM & DNS Auto-Registration](networking.md#ipam--dns-auto-registration) for the full data flow diagram.
|
||||
|
||||
Summary:
|
||||
1. **Kea DHCP** on pfSense assigns IP (53 reservations across 3 subnets). DHCP option 6 (DNS servers) is pushed with two IPs per internal subnet: internal resolver + AdGuard public fallback (`94.140.14.14`) — clients survive an internal DNS outage.
|
||||
2. **Kea DDNS** sends **TSIG-signed** RFC 2136 dynamic update to Technitium (A + PTR records) — immediate. Key `kea-ddns` (HMAC-SHA256); Technitium enforces both source-IP ACL and TSIG signature on `viktorbarzin.lan` + reverse zones.
|
||||
3. **phpipam-pfsense-import** CronJob (hourly) pulls Kea leases + ARP table into phpIPAM
|
||||
1. **Kea DHCP** on pfSense assigns IP (53 reservations across 3 subnets)
|
||||
2. **Kea DDNS** sends RFC 2136 dynamic update to Technitium (A + PTR records) — immediate
|
||||
3. **phpipam-pfsense-import** CronJob (5min) pulls Kea leases + ARP table into phpIPAM
|
||||
4. **phpipam-dns-sync** CronJob (15min) pushes named phpIPAM hosts → Technitium A + PTR, pulls Technitium PTR → phpIPAM hostnames
|
||||
|
||||
## Automation CronJobs
|
||||
|
|
@ -389,7 +338,7 @@ Summary:
|
|||
| `technitium-split-horizon-sync` | `15 */6 * * *` | technitium | Split Horizon + DNS Rebinding Protection on all 3 instances |
|
||||
| `technitium-dns-optimization` | `30 */6 * * *` | technitium | Min cache TTL 60s, emrsn.org stub zone |
|
||||
| `phpipam-dns-sync` | `*/15 * * * *` | phpipam | Bidirectional phpIPAM ↔ Technitium DNS sync |
|
||||
| `phpipam-pfsense-import` | `0 * * * *` | phpipam | Import Kea DHCP leases + ARP from pfSense |
|
||||
| `phpipam-pfsense-import` | `*/5 * * * *` | phpipam | Import Kea DHCP leases + ARP from pfSense |
|
||||
|
||||
### Password Rotation Flow
|
||||
|
||||
|
|
@ -411,62 +360,28 @@ Vault DB engine rotates password
|
|||
| Metric Source | Dashboard | Alerts |
|
||||
|---------------|-----------|--------|
|
||||
| Technitium query logs (PostgreSQL) | Grafana `technitium-dns.json` | — |
|
||||
| CoreDNS Prometheus metrics (:9153) | Grafana CoreDNS dashboard | `CoreDNSErrors`, `CoreDNSForwardFailureRate` |
|
||||
| Technitium zone-sync CronJob (Pushgateway) | — | `TechnitiumZoneSyncFailed`, `TechnitiumZoneSyncStale`, `TechnitiumZoneCountMismatch` |
|
||||
| Technitium DNS pod availability | — | `TechnitiumDNSDown` |
|
||||
| `dns-anomaly-monitor` CronJob (Pushgateway) | — | `DNSQuerySpike`, `DNSQueryRateDropped`, `DNSHighErrorRate` |
|
||||
| CoreDNS Prometheus metrics (:9153) | Grafana CoreDNS dashboard | — |
|
||||
| Uptime Kuma | External monitors for all proxied domains | ExternalAccessDivergence (15min) |
|
||||
|
||||
### Metrics pushed by `technitium-zone-sync`
|
||||
|
||||
The zone-sync CronJob (runs every 30min) pushes the following to the Prometheus Pushgateway under `job=technitium-zone-sync`:
|
||||
|
||||
| Metric | Labels | Meaning |
|
||||
|--------|--------|---------|
|
||||
| `technitium_zone_sync_status` | — | 0 = last run succeeded, 1 = at least one zone failed to create |
|
||||
| `technitium_zone_sync_failures` | — | Number of zones that failed to create this run |
|
||||
| `technitium_zone_sync_last_run` | — | Unix timestamp of last run (used by `TechnitiumZoneSyncStale`) |
|
||||
| `technitium_zone_count` | `instance=primary\|<replica-host>` | Zone count on each Technitium instance (drives `TechnitiumZoneCountMismatch`) |
|
||||
|
||||
### DNS alert rewrites
|
||||
|
||||
- `DNSQuerySpike` was previously broken: it compared current queries against `dns_anomaly_avg_queries`, which was computed from a per-pod `/tmp/dns_avg` file. Each CronJob run started with a fresh `/tmp`, so `NEW_AVG == TOTAL_QUERIES` every time and the spike condition could never fire. Rewritten to use `avg_over_time(dns_anomaly_total_queries[1h] offset 15m)` which compares against the actual 1h Prometheus history.
|
||||
- `DNSQueryRateDropped` (new): fires when query rate drops below 50% of 1h average — upstream clients may be failing to reach Technitium.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### DNS Not Resolving Internal Domains
|
||||
|
||||
1. Check NodeLocal DNSCache pods first — pod queries go through these: `kubectl -n kube-system get pod -l k8s-app=node-local-dns -o wide`
|
||||
2. Check Technitium pods: `kubectl get pod -n technitium`
|
||||
3. Check all 3 are healthy: `kubectl get pod -n technitium -l dns-server=true`
|
||||
4. Test via NodeLocal DNSCache from a pod: `kubectl exec -it <pod> -- dig @169.254.20.10 idrac.viktorbarzin.lan`
|
||||
5. Bypass NodeLocal DNSCache (test CoreDNS directly): `kubectl exec -it <pod> -- dig @<kube-dns-upstream-ClusterIP> idrac.viktorbarzin.lan` (`kubectl get svc -n kube-system kube-dns-upstream`)
|
||||
6. Check CoreDNS logs: `kubectl logs -n kube-system -l k8s-app=kube-dns`
|
||||
7. Verify ClusterIP service: `kubectl get svc -n technitium technitium-dns-internal`
|
||||
1. Check Technitium pods: `kubectl get pod -n technitium`
|
||||
2. Check all 3 are healthy: `kubectl get pod -n technitium -l dns-server=true`
|
||||
3. Test from a pod: `kubectl exec -it <pod> -- nslookup idrac.viktorbarzin.lan 10.96.0.53`
|
||||
4. Check CoreDNS logs: `kubectl logs -n kube-system -l k8s-app=kube-dns`
|
||||
5. Verify ClusterIP service: `kubectl get svc -n technitium technitium-dns-internal`
|
||||
|
||||
### LAN Clients Can't Resolve
|
||||
|
||||
1. Verify pfSense Unbound is running: `ssh admin@10.0.20.1 "sockstat -l -4 -p 53 | grep unbound"` — expect listeners on `192.168.1.2:53`, `10.0.10.1:53`, `10.0.20.1:53`, `127.0.0.1:53`
|
||||
2. Verify the auth-zone is loaded: `ssh admin@10.0.20.1 "unbound-control -c /var/unbound/unbound.conf list_auth_zones"` — expect `viktorbarzin.lan. serial N`
|
||||
3. Test from LAN: `dig @192.168.1.2 idrac.viktorbarzin.lan` (should return with `aa` flag)
|
||||
4. Test public upstream: `dig @192.168.1.2 example.com +dnssec` (should have `ad` flag — DoT via Cloudflare working)
|
||||
5. If auth-zone can't AXFR: check Technitium `viktorbarzin.lan` zone options → `zoneTransferNetworkACL` contains `10.0.20.1, 10.0.10.1, 192.168.1.2`
|
||||
6. See `docs/runbooks/pfsense-unbound.md` for full Unbound runbook and rollback instructions
|
||||
1. Verify pfSense NAT rule redirects UDP 53 on WAN to 10.0.20.201
|
||||
2. Check Technitium LB service: `kubectl get svc -n technitium technitium-dns`
|
||||
3. Test from LAN: `dig @192.168.1.2 idrac.viktorbarzin.lan`
|
||||
4. Check `externalTrafficPolicy: Local` — if no Technitium pod runs on the node receiving traffic, it drops
|
||||
|
||||
### Hairpin NAT Not Working (LAN → *.viktorbarzin.me Fails)
|
||||
|
||||
Since 2026-04-19 (Workstream D), pfSense Unbound answers LAN DNS queries
|
||||
directly instead of forwarding to Technitium, so the Technitium Split Horizon
|
||||
post-processing does NOT run for 192.168.1.x clients anymore. Non-proxied
|
||||
services break hairpin on LAN clients again. Options:
|
||||
|
||||
1. **Switch service to proxied Cloudflare** (preferred) — set `dns_type = "proxied"` in the `ingress_factory` module call; DNS now resolves to Cloudflare edge, hairpin-independent.
|
||||
2. **Add a local-data override on pfSense Unbound** — under `Services → DNS Resolver → Host Overrides`, set `<service>.viktorbarzin.me → 10.0.20.200` (Traefik LB IP). This is equivalent to what Split Horizon did, applied at the resolver.
|
||||
3. **Revert to prior NAT rdr + Technitium Split Horizon** — documented in `docs/runbooks/pfsense-unbound.md` rollback section.
|
||||
|
||||
K8s-side Split Horizon is still configured and applies when `*.viktorbarzin.me` queries DO reach Technitium (e.g., from pods that query via CoreDNS → Technitium forwarding for `.viktorbarzin.me` via pfSense). Verify Technitium split-horizon app:
|
||||
|
||||
1. Verify Split Horizon app is installed on all instances
|
||||
2. Check CronJob status: `kubectl get cronjob -n technitium technitium-split-horizon-sync`
|
||||
3. Run the job manually: `kubectl create job --from=cronjob/technitium-split-horizon-sync test-sh -n technitium`
|
||||
|
|
@ -501,8 +416,6 @@ For external `.viktorbarzin.me` records:
|
|||
## Incident History
|
||||
|
||||
- **2026-04-14 (SEV1)**: NFS `fsid=0` caused Technitium primary data loss on restart. Fixed by migrating all 3 instances to `proxmox-lvm-encrypted`, adding zone-sync CronJob (30min AXFR). See [post-mortem](../post-mortems/2026-04-14-nfs-fsid0-dns-vault-outage.md).
|
||||
- **2026-04-19 (hardening, not outage)**: Workstream D — pfSense Unbound replaces dnsmasq as the pfSense DNS service. Unbound AXFR-slaves `viktorbarzin.lan` from Technitium so LAN-side resolution survives a full K8s outage. WAN NAT rdr `192.168.1.2:53 → 10.0.20.201` removed (Unbound listens on WAN directly). DoT upstream via Cloudflare. See `docs/runbooks/pfsense-unbound.md` and bd `code-k0d`.
|
||||
- **2026-04-19 (hardening, not outage)**: Workstream E — Kea DHCP now pushes TWO DNS IPs (internal + AdGuard public fallback `94.140.14.14`) via option 6 to the internal subnets (10.0.10/24, 10.0.20/24); 192.168.1/24 was already dual-IP (served by TP-Link). Kea DHCP-DDNS now TSIG-signs its RFC 2136 updates (key `kea-ddns`, HMAC-SHA256) and the Technitium zones require both source-IP ACL AND TSIG signature. See `docs/runbooks/pfsense-unbound.md` § "Kea DHCP-DDNS TSIG" and bd `code-o6j`.
|
||||
|
||||
## Related
|
||||
|
||||
|
|
|
|||
|
|
@ -178,11 +178,11 @@ flowchart LR
|
|||
subgraph "Kubernetes Cluster"
|
||||
C -->|Yes| D[Woodpecker Pipeline]
|
||||
D --> E[Vault Auth<br/>K8s SA JWT]
|
||||
E --> F[Fetch API Token]
|
||||
E --> F[Fetch SSH Key]
|
||||
end
|
||||
|
||||
subgraph "claude-agent-service (K8s)"
|
||||
F --> G[HTTP POST /execute]
|
||||
subgraph "DevVM (10.0.10.10)"
|
||||
F --> G[SSH + Claude Code]
|
||||
G --> H[issue-responder agent]
|
||||
H --> I[Investigate / Implement]
|
||||
I --> J[Comment on Issue]
|
||||
|
|
|
|||
|
|
@ -1,147 +1,72 @@
|
|||
# Mail Server Architecture
|
||||
|
||||
Last updated: 2026-04-19 (code-yiu Phase 6: MetalLB LB retired; traffic now enters via pfSense HAProxy with PROXY v2)
|
||||
Last updated: 2026-04-12 (Brevo relay migration)
|
||||
|
||||
## Overview
|
||||
|
||||
Self-hosted email for `viktorbarzin.me` using docker-mailserver 15.0.0 on Kubernetes. Inbound mail arrives directly via MX record to the home IP on port 25. Outbound mail relays through Brevo EU (`smtp-relay.brevo.com:587` — migrated from Mailgun on 2026-04-12; SPF record cut over on 2026-04-18). Roundcubemail provides webmail access. CrowdSec protects SMTP/IMAP from brute-force attacks using real client IPs: pfSense HAProxy injects the PROXY v2 header on each backend connection so the mailserver pod sees the true source IP despite kube-proxy SNAT. See [`runbooks/mailserver-pfsense-haproxy.md`](../runbooks/mailserver-pfsense-haproxy.md) for ops details.
|
||||
Self-hosted email for `viktorbarzin.me` using docker-mailserver 15.0.0 on Kubernetes. Inbound mail arrives directly via MX record to the home IP on port 25. Outbound mail relays through Mailgun EU. Roundcubemail provides webmail access. CrowdSec protects SMTP/IMAP from brute-force attacks using real client IPs via `externalTrafficPolicy: Local` on a dedicated MetalLB IP.
|
||||
|
||||
## Architecture Diagram
|
||||
|
||||
Two independent paths into the mailserver pod:
|
||||
|
||||
- **External** (MX traffic, webmail clients over WAN): Internet → pfSense → HAProxy → NodePort → **alt container ports** (2525/4465/5587/10993) that **require** PROXY v2 framing.
|
||||
- **Intra-cluster** (Roundcube, E2E probe): same pod, **stock container ports** (25/465/587/993), **no** PROXY framing.
|
||||
|
||||
One Deployment, one pod, two sets of Postfix `master.cf` services + Dovecot `inet_listener` blocks, two Kubernetes Services (`mailserver` ClusterIP + `mailserver-proxy` NodePort).
|
||||
|
||||
```mermaid
|
||||
flowchart TB
|
||||
%% External ingress path
|
||||
SENDER[Sending MTA<br/>arbitrary public IP] -->|MX lookup + SMTP<br/>:25| MX[mail.viktorbarzin.me<br/>A 176.12.22.76]
|
||||
MX --> PF[pfSense WAN<br/>vtnet0 192.168.1.2]
|
||||
PF -->|NAT rdr<br/>WAN:25/465/587/993<br/>→ 10.0.20.1:same| HAP
|
||||
HAP[pfSense HAProxy<br/>4 TCP frontends on 10.0.20.1<br/>send-proxy-v2 to backends]
|
||||
HAP -->|round-robin<br/>tcp-check inter 120s| KN{k8s worker<br/>node1..4}
|
||||
KN -->|NodePort 30125-30128<br/>ETP: Cluster → kube-proxy SNAT| PODEXT
|
||||
|
||||
%% Internal ingress path
|
||||
RC[Roundcubemail pod] -->|SMTP :587 + IMAP :993<br/>no PROXY| SVC[Service mailserver<br/>ClusterIP 10.103.108.x<br/>25/465/587/993]
|
||||
PROBE[email-roundtrip-monitor<br/>CronJob every 20m] -->|IMAP :993<br/>no PROXY| SVC
|
||||
SVC -->|kube-proxy routes| PODINT
|
||||
|
||||
%% The pod — two listener sets, one process tree
|
||||
subgraph POD["mailserver pod (docker-mailserver 15.0.0)"]
|
||||
direction LR
|
||||
PODEXT[Alt ports<br/>2525 / 4465 / 5587 / 10993<br/><b>PROXY v2 REQUIRED</b><br/>smtpd_upstream_proxy_protocol=haproxy<br/>haproxy = yes]
|
||||
PODINT[Stock ports<br/>25 / 465 / 587 / 993<br/>PROXY-free]
|
||||
PODEXT --> POSTFIX
|
||||
PODINT --> POSTFIX
|
||||
POSTFIX[Postfix<br/>postscreen + smtpd + cleanup + queue]
|
||||
POSTFIX --> RSPAMD[Rspamd<br/>spam + DKIM + DMARC]
|
||||
RSPAMD --> DOVECOT[Dovecot IMAP<br/>LMTP deliver]
|
||||
DOVECOT --> MAILBOX[(Maildir storage<br/>mailserver-data-encrypted PVC<br/>proxmox-lvm-encrypted LUKS2)]
|
||||
graph TB
|
||||
subgraph "Inbound Mail"
|
||||
SENDER[Sending MTA] -->|MX lookup| MX[mail.viktorbarzin.me:25]
|
||||
MX -->|176.12.22.76:25| PF[pfSense NAT]
|
||||
PF -->|10.0.20.202:25| MLB[MetalLB<br/>ETP: Local]
|
||||
MLB --> POSTFIX[Postfix MTA]
|
||||
end
|
||||
|
||||
%% Outbound
|
||||
POSTFIX -->|queued mail<br/>SASL + TLS| BREVO[Brevo EU Relay<br/>smtp-relay.brevo.com:587<br/>300/day free tier]
|
||||
BREVO --> RECIPIENT[External Recipient]
|
||||
subgraph "Mail Processing"
|
||||
POSTFIX --> RSPAMD[Rspamd<br/>Spam/DKIM/DMARC]
|
||||
RSPAMD --> DOVECOT[Dovecot IMAP]
|
||||
DOVECOT --> MAILBOX[(Mailboxes<br/>proxmox-lvm PVC)]
|
||||
end
|
||||
|
||||
%% Webmail HTTP path
|
||||
USER[User browser] -->|HTTPS| CF[Cloudflare proxy<br/>mail.viktorbarzin.me]
|
||||
CF --> TUNNEL[Cloudflared tunnel<br/>pfSense → Traefik]
|
||||
TUNNEL --> TRAEFIK[Traefik Ingress<br/>Authentik-protected]
|
||||
TRAEFIK --> RC
|
||||
subgraph "Outbound Mail"
|
||||
POSTFIX_OUT[Postfix] -->|SASL + TLS| MAILGUN[Brevo EU Relay<br/>smtp-relay.brevo.com:587]
|
||||
MAILGUN --> RECIPIENT[Recipient]
|
||||
end
|
||||
|
||||
%% Security
|
||||
POSTFIX -.->|log stream<br/>real client IPs from PROXY v2| CSAGENT[CrowdSec Agent<br/>postfix + dovecot parsers]
|
||||
CSAGENT -.-> CSLAPI[CrowdSec LAPI]
|
||||
CSLAPI -.->|bouncer decisions<br/>ban external IPs| PF
|
||||
subgraph "Webmail"
|
||||
USER[User] -->|HTTPS| TRAEFIK[Traefik Ingress]
|
||||
TRAEFIK --> RC[Roundcubemail]
|
||||
RC -->|IMAP 993| DOVECOT
|
||||
RC -->|SMTP 587| POSTFIX_OUT
|
||||
end
|
||||
|
||||
%% Monitoring
|
||||
PROBE -.->|Brevo HTTP API<br/>triggers external delivery| MX
|
||||
PROBE -.->|Push on roundtrip success| PUSH[Pushgateway + Uptime Kuma]
|
||||
subgraph "Security"
|
||||
MLB -->|Real client IPs| CS_AGENT[CrowdSec Agent<br/>postfix + dovecot parsers]
|
||||
CS_AGENT --> CS_LAPI[CrowdSec LAPI]
|
||||
end
|
||||
|
||||
classDef extPath fill:#ffedd5,stroke:#ea580c,stroke-width:2px
|
||||
classDef intPath fill:#dbeafe,stroke:#2563eb,stroke-width:2px
|
||||
classDef pod fill:#dcfce7,stroke:#15803d
|
||||
classDef sec fill:#fee2e2,stroke:#dc2626
|
||||
class SENDER,MX,PF,HAP,KN,PODEXT extPath
|
||||
class RC,PROBE,SVC,PODINT intPath
|
||||
class POSTFIX,RSPAMD,DOVECOT,MAILBOX pod
|
||||
class CSAGENT,CSLAPI sec
|
||||
subgraph "Monitoring"
|
||||
PROBE[E2E Roundtrip Probe<br/>CronJob every 20m] -->|Mailgun API| SENDER
|
||||
PROBE -->|IMAP check| DOVECOT
|
||||
PROBE --> PUSH[Pushgateway + Uptime Kuma]
|
||||
DEXP[Dovecot Exporter<br/>:9166] --> PROM[Prometheus]
|
||||
end
|
||||
```
|
||||
|
||||
### PROXY v2 sequence (external SMTP roundtrip)
|
||||
|
||||
Illustrates the wire-level sequence of a Brevo probe email arriving at our MX. Same sequence applies to any external sender.
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
autonumber
|
||||
participant C as External MTA<br/>(e.g. Brevo 77.32.148.26)
|
||||
participant PF as pfSense WAN<br/>192.168.1.2:25
|
||||
participant HAP as pfSense HAProxy<br/>10.0.20.1:25
|
||||
participant N as k8s-node:30125<br/>ETP: Cluster
|
||||
participant P as Postfix postscreen<br/>pod:2525
|
||||
|
||||
C->>PF: TCP SYN dst=192.168.1.2:25
|
||||
PF->>HAP: NAT rdr rewrites dst → 10.0.20.1:25
|
||||
HAP->>N: TCP connect (src=10.0.20.1, dst=k8s-node:30125)
|
||||
Note over HAP,N: HAProxy opens a NEW TCP flow<br/>to the backend k8s node.
|
||||
HAP->>N: PROXY v2 header<br/>(source=77.32.148.26, dest=10.0.20.1)
|
||||
N->>P: kube-proxy SNAT src=k8s-node IP<br/>forwards PROXY header + payload to pod
|
||||
P->>P: Parse PROXY v2 header<br/>smtpd_client_addr := 77.32.148.26<br/>(despite kube-proxy SNAT on the wire)
|
||||
P-->>C: SMTP banner 220 mail.viktorbarzin.me
|
||||
C-->>P: EHLO / MAIL FROM / RCPT TO / DATA
|
||||
Note over P,C: Real client IP logged in maillog,<br/>fed to CrowdSec postfix parser.
|
||||
P->>P: → smtpd → Rspamd → Dovecot → mailbox
|
||||
```
|
||||
|
||||
|
||||
## Components
|
||||
|
||||
| Component | Version | Location | Purpose |
|
||||
|-----------|---------|----------|---------|
|
||||
| docker-mailserver | 15.0.0 | `mailserver` namespace | Postfix MTA + Dovecot IMAP + Rspamd (single container) |
|
||||
| docker-mailserver | 15.0.0 | `mailserver` namespace | Postfix MTA + Dovecot IMAP + Rspamd |
|
||||
| Roundcubemail | 1.6.13-apache | `mailserver` namespace | Webmail UI (MySQL-backed) |
|
||||
| Dovecot Exporter | latest | Sidecar in mailserver pod | Prometheus metrics (port 9166) |
|
||||
| Rspamd | Built into docker-mailserver | — | Spam filtering, DKIM signing, DMARC verification |
|
||||
| pfSense HAProxy | 2.9-dev6 (`pfSense-pkg-haproxy-devel`) | pfSense VM | TCP reverse proxy injecting PROXY v2 for external mail |
|
||||
| Brevo EU (ex-Sendinblue) | SaaS | — | Outbound SMTP relay (300/day free) |
|
||||
|
||||
Dovecot exporter was retired in code-1ik (2026-04-19) — `viktorbarzin/dovecot_exporter` speaks the pre-2.3 `old_stats` FIFO protocol which docker-mailserver 15.0.0's Dovecot 2.3.19 no longer emits.
|
||||
|
||||
## Port mapping
|
||||
|
||||
The mailserver pod exposes **8 TCP listeners**: 4 stock + 4 alt. Two Kubernetes Services front them depending on whether the client can inject PROXY v2.
|
||||
|
||||
| Mail protocol | Service port | K8s Service | Container port | NodePort | PROXY v2? | Who uses this path |
|
||||
|---|---|---|---|---|---|---|
|
||||
| SMTP (plain + STARTTLS) | 25 | `mailserver` ClusterIP | 25 | — | ❌ stock | Intra-cluster only (not used — internal clients send via 587) |
|
||||
| SMTPS (implicit TLS) | 465 | `mailserver` ClusterIP | 465 | — | ❌ stock | Intra-cluster (Roundcube rarely uses this) |
|
||||
| Submission (STARTTLS) | 587 | `mailserver` ClusterIP | 587 | — | ❌ stock | **Roundcube pod** → mailserver.svc:587 |
|
||||
| IMAPS | 993 | `mailserver` ClusterIP | 993 | — | ❌ stock | **Roundcube pod** + E2E probe → mailserver.svc:993 |
|
||||
| SMTP | 25 | `mailserver-proxy` NodePort | 2525 | 30125 | ✅ required | External MX traffic via pfSense HAProxy |
|
||||
| SMTPS | 465 | `mailserver-proxy` NodePort | 4465 | 30126 | ✅ required | External SMTPS submission |
|
||||
| Submission | 587 | `mailserver-proxy` NodePort | 5587 | 30127 | ✅ required | External STARTTLS submission (mail clients over WAN) |
|
||||
| IMAPS | 993 | `mailserver-proxy` NodePort | 10993 | 30128 | ✅ required | External IMAPS (mail clients over WAN) |
|
||||
|
||||
The alt listeners are set up by:
|
||||
- **Postfix**: `user-patches.sh` (shipped via ConfigMap `mailserver-user-patches`) appends 3 entries to `master.cf` with `-o postscreen_upstream_proxy_protocol=haproxy` (for 2525) or `-o smtpd_upstream_proxy_protocol=haproxy` (for 4465/5587).
|
||||
- **Dovecot**: `dovecot.cf` ConfigMap adds a second `inet_listener` inside `service imap-login` with `haproxy = yes`, plus `haproxy_trusted_networks = 10.0.20.0/24` to allow PROXY headers from the k8s node subnet (post kube-proxy SNAT the source IP is always a node IP).
|
||||
|
||||
## Mail Flow
|
||||
|
||||
### Inbound
|
||||
```
|
||||
Internet → MX: mail.viktorbarzin.me (priority 1)
|
||||
→ A record: 176.12.22.76 (non-proxied Cloudflare DNS-only)
|
||||
→ pfSense NAT rdr: WAN:{25,465,587,993} → 10.0.20.1:{same}
|
||||
→ pfSense HAProxy (TCP mode, send-proxy-v2 on backend)
|
||||
→ k8s-node:{30125..30128} NodePort (mailserver-proxy, ETP: Cluster)
|
||||
→ kube-proxy → pod alt listener (2525/4465/5587/10993)
|
||||
→ Postfix postscreen / smtpd / Dovecot parses PROXY v2 header
|
||||
→ Rspamd (spam + DKIM + DMARC) → Dovecot → mailbox
|
||||
→ pfSense NAT: port 25 → 10.0.20.202:25
|
||||
→ MetalLB (dedicated IP, ETP: Local — preserves real client IPs)
|
||||
→ Postfix → Rspamd (spam + DKIM + DMARC check) → Dovecot → mailbox
|
||||
```
|
||||
|
||||
No backup MX. If the server is down, sender MTAs queue and retry for 4-5 days per SMTP standards (RFC 5321).
|
||||
|
|
@ -170,13 +95,13 @@ All managed in Terraform at `stacks/cloudflared/modules/cloudflared/cloudflare.t
|
|||
| MX | `viktorbarzin.me` | `mail.viktorbarzin.me` (pri 1) | Inbound mail routing |
|
||||
| A | `mail.viktorbarzin.me` | `176.12.22.76` (non-proxied) | Mail server IP |
|
||||
| AAAA | `mail.viktorbarzin.me` | `2001:470:6e:43d::2` | IPv6 (HE tunnel) |
|
||||
| TXT (SPF) | `viktorbarzin.me` | `v=spf1 include:spf.brevo.com ~all` | Authorize Brevo for outbound (soft-fail during cutover; was `include:mailgun.org -all` until 2026-04-18 Brevo migration) |
|
||||
| TXT (DKIM) | `s1._domainkey` | RSA 1024-bit key | Mailgun DKIM (roundtrip probe only — inbound testing still uses Mailgun API) |
|
||||
| TXT (SPF) | `viktorbarzin.me` | `v=spf1 include:mailgun.org -all` | Authorize Mailgun for outbound |
|
||||
| TXT (DKIM) | `s1._domainkey` | RSA 1024-bit key | Mailgun DKIM (roundtrip probe) |
|
||||
| TXT (DKIM) | `mail._domainkey` | RSA 2048-bit key | Rspamd self-hosted DKIM signing |
|
||||
| CNAME (DKIM) | `brevo1._domainkey` | b1.viktorbarzin-me.dkim.brevo.com | Brevo outbound DKIM (delegated) |
|
||||
| CNAME (DKIM) | `brevo2._domainkey` | b2.viktorbarzin-me.dkim.brevo.com | Brevo outbound DKIM (delegated) |
|
||||
| TXT | `viktorbarzin.me` | `brevo-code:a6ef1dd9...` | Brevo domain verification |
|
||||
| TXT (DMARC) | `_dmarc` | `p=quarantine; pct=100; rua=mailto:dmarc@viktorbarzin.me` | DMARC enforcement; aggregate reports land in-domain at `dmarc@viktorbarzin.me` (tracked under code-569; current live record still points at `e21c0ff8@dmarc.mailgun.org` pending cutover) |
|
||||
| TXT (DMARC) | `_dmarc` | `p=quarantine; pct=100` | DMARC enforcement, reports to Mailgun + ondmarc |
|
||||
| TXT (MTA-STS) | `_mta-sts` | `v=STSv1; id=20260412` | TLS enforcement for inbound |
|
||||
| TXT (TLSRPT) | `_smtp._tls` | `v=TLSRPTv1; rua=mailto:postmaster@...` | TLS failure reporting |
|
||||
|
||||
|
|
@ -189,13 +114,9 @@ Reverse DNS for `176.12.22.76` returns `176-12-22-76.pon.spectrumnet.bg.` (ISP-a
|
|||
### CrowdSec Integration
|
||||
- **Collections**: `crowdsecurity/postfix` + `crowdsecurity/dovecot` (installed)
|
||||
- **Log acquisition**: CrowdSec agents parse mailserver pod logs for brute-force patterns
|
||||
- **Real client IPs**: pfSense HAProxy injects PROXY v2 header on each backend connection; Postfix (`postscreen_upstream_proxy_protocol=haproxy` / `smtpd_upstream_proxy_protocol=haproxy` on alt ports) + Dovecot (`haproxy = yes` on alt IMAPS listener) parse it to recover the true source IP despite kube-proxy SNAT. Replaces the pre-2026-04-19 MetalLB `10.0.20.202` ETP:Local scheme (see code-yiu)
|
||||
- **Real client IPs**: `externalTrafficPolicy: Local` on dedicated MetalLB IP `10.0.20.202` preserves original client IPs (not SNATed to node IPs)
|
||||
- **Decisions**: CrowdSec bans/challenges attackers via firewall bouncer rules
|
||||
|
||||
### Fail2ban Disabled (CrowdSec is the Policy)
|
||||
|
||||
docker-mailserver ships Fail2ban, but it is explicitly disabled here: `ENABLE_FAIL2BAN = "0"` at [`stacks/mailserver/modules/mailserver/main.tf:68`](../../stacks/mailserver/modules/mailserver/main.tf). CrowdSec is the cluster-wide bouncer for SSH, HTTP, and SMTP/IMAP brute-force defence — it already parses the `postfix` and `dovecot` log streams via the collections listed above and applies decisions at the LB/firewall layer. Enabling Fail2ban in-pod would create a duplicate response path (two systems racing to ban the same IP from different enforcement points), add iptables churn inside the container, and fragment the audit trail across two decision stores. Decision (2026-04-18): keep it disabled; CrowdSec owns this policy.
|
||||
|
||||
### Rspamd
|
||||
- Spam filtering with phishing detection and Oletools
|
||||
- DKIM signing (selector `mail`, 2048-bit RSA)
|
||||
|
|
@ -218,30 +139,28 @@ anvil_rate_time_unit = 60s
|
|||
## Monitoring
|
||||
|
||||
### E2E Roundtrip Probe
|
||||
CronJob `email-roundtrip-monitor` (every 20 min, `*/20 * * * *`):
|
||||
1. Sends test email via **Brevo HTTP API** to `smoke-test@viktorbarzin.me` (Brevo delivers it to our MX over the public internet, exercising the full external-ingress path).
|
||||
2. Email hits WAN → pfSense HAProxy → k8s-node:30125 → pod :2525 postscreen (PROXY v2) → Postfix → catch-all delivers to `spam@` mailbox.
|
||||
3. Verifies delivery via IMAP — connects to `mailserver.mailserver.svc.cluster.local:993` (intra-cluster path, no PROXY), searches by UUID marker.
|
||||
4. Deletes test email, pushes metrics to Pushgateway + Uptime Kuma.
|
||||
|
||||
Push secrets (`BREVO_API_KEY`, `EMAIL_MONITOR_IMAP_PASSWORD`) come from ExternalSecret `mailserver-probe-secrets` (synced from Vault `secret/viktor` + `secret/platform.mailserver_accounts`) — see code-39v.
|
||||
CronJob `email-roundtrip-monitor` (every 10 min):
|
||||
1. Sends test email via Mailgun HTTP API to `smoke-test@viktorbarzin.me`
|
||||
2. Email hits MX → Postfix → catch-all delivers to `spam@` mailbox
|
||||
3. Verifies delivery via IMAP (searches by UUID marker)
|
||||
4. Deletes test email, pushes metrics to Pushgateway + Uptime Kuma
|
||||
|
||||
### Prometheus Alerts
|
||||
| Alert | Threshold | Severity |
|
||||
|-------|-----------|----------|
|
||||
| MailServerDown | No replicas for 5m | warning |
|
||||
| EmailRoundtripFailing | Probe failing for 30m | warning |
|
||||
| EmailRoundtripStale | No success in >80m (60m threshold + for:20m) | warning |
|
||||
| EmailRoundtripStale | No success in >40m | warning |
|
||||
| EmailRoundtripNeverRun | Metric absent for 40m | warning |
|
||||
|
||||
### Uptime Kuma Monitors
|
||||
- TCP SMTP on `176.12.22.76:25` — full external path (DNS → WAN → pfSense HAProxy → mailserver)
|
||||
- TCP `mailserver.svc:{587,993}` — intra-cluster ClusterIP path
|
||||
- TCP `10.0.20.1:{25,993}` — pfSense HAProxy health (post code-yiu Phase 6)
|
||||
- E2E Push monitor (receives push from `email-roundtrip-monitor` probe)
|
||||
- TCP SMTP on `176.12.22.76:25` (external, 60s interval)
|
||||
- TCP IMAP on `10.0.20.202:993` (internal)
|
||||
- E2E Push monitor (receives push from roundtrip probe)
|
||||
|
||||
### Dovecot exporter — retired
|
||||
`viktorbarzin/dovecot_exporter` was removed in code-1ik (2026-04-19). It spoke the pre-2.3 `old_stats` FIFO protocol; Dovecot 2.3.19 (docker-mailserver 15.0.0) no longer emits that, so the scrape only ever returned `dovecot_up{scope="user"} 0`. If Dovecot metrics become valuable, reach for a 2.3+ compatible exporter (e.g. `jtackaberry/dovecot_exporter`) and re-add the scrape + alerts. The previously-created `mailserver-metrics` ClusterIP Service was also removed.
|
||||
### Dovecot Exporter
|
||||
- Sidecar container in mailserver pod, port 9166
|
||||
- Scraped by Prometheus for IMAP connection metrics
|
||||
|
||||
## Terraform
|
||||
|
||||
|
|
@ -258,21 +177,17 @@ Push secrets (`BREVO_API_KEY`, `EMAIL_MONITOR_IMAP_PASSWORD`) come from External
|
|||
| `secret/platform` | `mailserver_accounts` | User credentials (JSON) |
|
||||
| `secret/platform` | `mailserver_aliases` | Postfix virtual aliases |
|
||||
| `secret/platform` | `mailserver_opendkim_key` | DKIM private key |
|
||||
| `secret/platform` | `mailserver_sasl_passwd` | Brevo relay credentials (`[smtp-relay.brevo.com]:587 <login>:<key>`) |
|
||||
| `secret/viktor` | `brevo_api_key` | Brevo API key — used by BOTH outbound SMTP SASL (postfix) AND the E2E roundtrip probe (sends external test mail via Brevo HTTP) |
|
||||
| `secret/viktor` | `mailgun_api_key` | Historical; no longer used by the probe post code-n5l/Phase-5 work. Kept for reference. |
|
||||
| `secret/platform` | `mailserver_sasl_passwd` | Mailgun relay credentials |
|
||||
| `secret/viktor` | `mailgun_api_key` | Mailgun API for E2E probe (inbound testing) |
|
||||
| `secret/viktor` | `brevo_api_key` | Brevo API key (stored for reference) |
|
||||
|
||||
## Storage
|
||||
|
||||
| PVC | Size | Storage Class | Purpose |
|
||||
|-----|------|---------------|---------|
|
||||
| `mailserver-data-encrypted` | 2Gi (auto-resize 5Gi) | `proxmox-lvm-encrypted` (LUKS2) | Maildir + Postfix queue + state + logs |
|
||||
| `roundcubemail-html-encrypted` | 1Gi | `proxmox-lvm-encrypted` | Roundcube PHP code + user session data |
|
||||
| `roundcubemail-enigma-encrypted` | 1Gi | `proxmox-lvm-encrypted` | Roundcube Enigma (PGP) user keys |
|
||||
| `mailserver-backup-host` (RWX) | 10Gi | `nfs-truenas` (historical SC name, Proxmox host NFS) | `mailserver-backup` CronJob destination (`/srv/nfs/mailserver-backup/<YYYY-WW>/`) |
|
||||
| `roundcube-backup-host` (RWX) | 10Gi | `nfs-truenas` (historical SC name, Proxmox host NFS) | `roundcube-backup` CronJob destination |
|
||||
|
||||
**Backup**: daily `mailserver-backup` + `roundcube-backup` CronJobs rsync data PVCs to NFS. NFS directory is picked up by the PVE host's inotify-driven `/usr/local/bin/offsite-sync-backup` which pushes to Synology (weekly). See [Storage & Backup Architecture](storage.md) for the 3-2-1 flow.
|
||||
| `mailserver-data-proxmox` | 2Gi (auto-resize 5Gi) | proxmox-lvm | Mail data, state, logs |
|
||||
| `roundcubemail-html-proxmox` | 1Gi | proxmox-lvm | Roundcube web files |
|
||||
| `roundcubemail-enigma-proxmox` | 1Gi | proxmox-lvm | Roundcube encryption |
|
||||
|
||||
## Decisions & Rationale
|
||||
|
||||
|
|
@ -291,23 +206,19 @@ Push secrets (`BREVO_API_KEY`, `EMAIL_MONITOR_IMAP_PASSWORD`) come from External
|
|||
- **Decision**: Rspamd replaces both SpamAssassin and OpenDKIM in a single component
|
||||
- **Tradeoff**: Higher memory usage (~150-200MB) but simpler stack
|
||||
|
||||
### Client-IP Preservation (pfSense HAProxy + PROXY v2)
|
||||
- **Current (2026-04-19, bd code-yiu)**: pfSense HAProxy listens on `10.0.20.1:{25,465,587,993}`, forwards to k8s NodePort 30125-30128 with `send-proxy-v2` on each backend connection. The mailserver pod exposes parallel listeners (2525/4465/5587/10993) that REQUIRE the PROXY v2 header, while the stock ports 25/465/587/993 stay PROXY-free for intra-cluster traffic (Roundcube, probe). The mailserver Service is ClusterIP-only; ETP is no longer a concern for external traffic.
|
||||
- **Historical (2026-04-12 → 2026-04-19)**: Dedicated MetalLB IP `10.0.20.202` with `externalTrafficPolicy: Local` — required pod/speaker colocation; kube-proxy preserved client IP only when pod was on the same node as the advertising speaker.
|
||||
- **Why switched**: ETP:Local made the mailserver's single replica drop inbound mail silently during pod reschedule (30-60s GARP flip). HAProxy with `send-proxy-v2` lets the pod reschedule to any node and recover IP-preservation through the header.
|
||||
- **Tradeoff**: pfSense now runs HAProxy (one more service in the firewall's responsibility); alt container ports + extra Service are ~80 lines of Terraform. The win is HA without IP-preservation compromise.
|
||||
- **Runbook**: [`runbooks/mailserver-pfsense-haproxy.md`](../runbooks/mailserver-pfsense-haproxy.md).
|
||||
### Dedicated MetalLB IP for CrowdSec
|
||||
- **Decision**: Mailserver gets `10.0.20.202` (separate from shared `10.0.20.200`) with `externalTrafficPolicy: Local`
|
||||
- **Why**: Shared IP with ETP: Cluster SNATs away real client IPs, making CrowdSec detections and Postfix rate limiting useless
|
||||
- **Tradeoff**: Uses one extra IP from the MetalLB pool. Requires separate pfSense NAT rule.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Inbound mail not arriving
|
||||
1. **DNS/MX**: `dig MX viktorbarzin.me +short` → should show `mail.viktorbarzin.me`
|
||||
2. **WAN reachability**: `nc -zw5 mail.viktorbarzin.me 25` from outside
|
||||
3. **pfSense NAT**: verify WAN:{25,465,587,993} rdr to `10.0.20.1` (HAProxy VIP). `ssh admin@10.0.20.1 'pfctl -sn' | grep '10.0.20.1'`
|
||||
4. **HAProxy health**: `ssh admin@10.0.20.1 "echo 'show servers state' | socat /tmp/haproxy.socket stdio"` — at least one backend in `srv_op_state=2` (UP) per pool
|
||||
5. **Container listener**: `kubectl exec -n mailserver -c docker-mailserver deployment/mailserver -- ss -ltn | grep -E ':(25|2525|465|4465|587|5587|993|10993)\b'` — 8 lines expected
|
||||
6. **Postfix queue + delivery**: `kubectl logs -n mailserver deploy/mailserver -c docker-mailserver | grep -E 'from=|reject|smtpd-proxy'`
|
||||
7. **CrowdSec decisions**: `kubectl exec -n crowdsec deploy/crowdsec-lapi -- cscli decisions list`
|
||||
1. Check MX: `dig MX viktorbarzin.me +short` → should show `mail.viktorbarzin.me`
|
||||
2. Check port 25: `nc -zw5 mail.viktorbarzin.me 25`
|
||||
3. Check pfSense NAT rule: port 25 → `10.0.20.202:25`
|
||||
4. Check Postfix logs: `kubectl logs -n mailserver deploy/mailserver -c docker-mailserver | grep -E 'from=|reject'`
|
||||
5. Check if CrowdSec is blocking the sender: `kubectl exec -n crowdsec deploy/crowdsec-lapi -- cscli decisions list`
|
||||
|
||||
### Outbound mail failing
|
||||
1. Check Brevo relay: `kubectl logs -n mailserver deploy/mailserver -c docker-mailserver | grep relay` — should show `relay=smtp-relay.brevo.com`
|
||||
|
|
|
|||
|
|
@ -63,7 +63,6 @@ graph TB
|
|||
| External Monitor Sync | Python 3.12 | `stacks/uptime-kuma/` | CronJob (10min) syncs `[External]` monitors from `cloudflare_proxied_names` |
|
||||
| dcgm-exporter | Configurable resources | `stacks/monitoring/modules/monitoring/` | NVIDIA GPU metrics collection |
|
||||
| Email Roundtrip Probe | Python 3.12 | `stacks/mailserver/modules/mailserver/` | E2E email delivery verification via Mailgun API + IMAP |
|
||||
| Forgejo Registry Integrity Probe | Alpine 3.20 + curl/jq | `stacks/monitoring/modules/monitoring/main.tf` | CronJob every 15m: walks `/v2/_catalog` on `forgejo.viktorbarzin.me` (HTTP via in-cluster service), HEADs every tagged manifest + index child; emits `registry_manifest_integrity_*` metrics to Pushgateway. Replaces the legacy `registry-integrity-probe` against `registry.viktorbarzin.me:5050` decommissioned in Phase 4 of forgejo-registry-consolidation 2026-05-07. |
|
||||
|
||||
## How It Works
|
||||
|
||||
|
|
@ -76,9 +75,7 @@ Prometheus scrapes metrics from all cluster components and applications using Se
|
|||
|
||||
### External Monitoring
|
||||
|
||||
The `external-monitor-sync` CronJob (every 10min, `stacks/uptime-kuma/`) ensures Uptime Kuma has `[External] <service>` monitors for externally-reachable ingresses. Discovery is **opt-OUT**: the script lists every ingress via the K8s API and creates a monitor for any host ending in `.viktorbarzin.me`, skipping only those annotated `uptime.viktorbarzin.me/external-monitor: "false"`. Both `ingress_factory` and the `reverse-proxy` factory emit that annotation when the caller sets `external_monitor = false`; leaving it null keeps the opt-in default (important for helm-provisioned ingresses that don't go through our factories). The legacy `cloudflare_proxied_names` ConfigMap is a fallback if the K8s API discovery fails.
|
||||
|
||||
These monitors test the full external access path (DNS → Cloudflare → Tunnel → Traefik → Service) from inside the cluster. The status-page-pusher groups them as "External Reachability" and pushes a `external_internal_divergence_count` metric to Pushgateway when services are externally down but internally up. Alert `ExternalAccessDivergence` fires after 15min of divergence.
|
||||
The `external-monitor-sync` CronJob (every 10min, `stacks/uptime-kuma/`) ensures Uptime Kuma has `[External] <service>` monitors for every service in `cloudflare_proxied_names`. These monitors test the full external access path (DNS → Cloudflare → Tunnel → Traefik → Service) from inside the cluster. The status-page-pusher groups them as "External Reachability" and pushes a `external_internal_divergence_count` metric to Pushgateway when services are externally down but internally up. Alert `ExternalAccessDivergence` fires after 15min of divergence.
|
||||
|
||||
Data flows from targets through Prometheus storage to Grafana dashboards. Applications emit logs to stdout/stderr which are aggregated by Loki and queryable through Grafana's log viewer.
|
||||
|
||||
|
|
@ -158,14 +155,9 @@ spec:
|
|||
|
||||
#### Email Monitoring Alerts
|
||||
- **EmailRoundtripFailing**: E2E email probe returning failure for >30m
|
||||
- **EmailRoundtripStale**: No successful email round-trip in >80m (60m threshold + for:20m)
|
||||
- **EmailRoundtripStale**: No successful email round-trip in >40m
|
||||
- **EmailRoundtripNeverRun**: Email probe has never reported (40m)
|
||||
|
||||
#### Registry Integrity Alerts
|
||||
- **RegistryManifestIntegrityFailure**: Private registry serving 404 for manifests it advertises (orphan OCI-index children) — fires after 30m of `registry_manifest_integrity_failures > 0`. Remediation: rebuild affected image per `docs/runbooks/registry-rebuild-image.md`.
|
||||
- **RegistryIntegrityProbeStale**: Probe hasn't reported in >1h (CronJob broken)
|
||||
- **RegistryCatalogInaccessible**: Probe cannot fetch `/v2/_catalog` (auth failure or registry down)
|
||||
|
||||
The email monitoring system uses a CronJob (`email-roundtrip-monitor`, every 10 min) in the `mailserver` namespace that:
|
||||
1. Sends a test email via Mailgun HTTP API to `smoke-test@viktorbarzin.me`
|
||||
2. Email lands in the `spam@` catch-all mailbox via MX delivery
|
||||
|
|
|
|||
|
|
@ -1,6 +1,6 @@
|
|||
# Networking Architecture
|
||||
|
||||
Last updated: 2026-04-19 (WS E — Kea DHCP pushes dual DNS per subnet; Kea DDNS TSIG-signed)
|
||||
Last updated: 2026-04-12
|
||||
|
||||
## Overview
|
||||
|
||||
|
|
@ -28,6 +28,7 @@ graph TB
|
|||
|
||||
subgraph "VLAN 10 - Management<br/>10.0.10.0/24"
|
||||
Proxmox[Proxmox Host<br/>10.0.10.1]
|
||||
TrueNAS[TrueNAS<br/>10.0.10.15]
|
||||
DevVM[DevVM<br/>10.0.10.10]
|
||||
Registry[Registry VM<br/>10.0.20.10]
|
||||
end
|
||||
|
|
@ -63,6 +64,7 @@ graph TB
|
|||
vmbr0 -.physical link.- eno1
|
||||
vmbr0 --> vmbr1
|
||||
vmbr1 -.VLAN 10.- Proxmox
|
||||
vmbr1 -.VLAN 10.- TrueNAS
|
||||
vmbr1 -.VLAN 10.- DevVM
|
||||
vmbr1 -.VLAN 20.- pfSense
|
||||
vmbr1 -.VLAN 20.- Tech
|
||||
|
|
@ -104,7 +106,7 @@ flowchart LR
|
|||
end
|
||||
|
||||
subgraph K8s["Kubernetes"]
|
||||
Import[CronJob<br/>pfsense-import<br/>hourly]
|
||||
Import[CronJob<br/>pfsense-import<br/>every 5min]
|
||||
Sync[CronJob<br/>dns-sync<br/>every 15min]
|
||||
IPAM[phpIPAM<br/>Web UI + API]
|
||||
MySQL[(MySQL<br/>InnoDB)]
|
||||
|
|
@ -142,14 +144,14 @@ flowchart LR
|
|||
|
||||
### DHCP Coverage
|
||||
|
||||
| Subnet | DHCP Server | DNS option 6 | Reservations | DDNS | Notes |
|
||||
|--------|------------|--------------|--------------|------|-------|
|
||||
| 10.0.10.0/24 (Mgmt) | Kea on pfSense | `10.0.10.1, 94.140.14.14` | 3 (devvm, pxe, ha) | Yes (TSIG) | VMs with static MACs |
|
||||
| 10.0.20.0/24 (K8s) | Kea on pfSense | `10.0.20.1, 94.140.14.14` | 7 (master, nodes 1-5, registry) | Yes (TSIG) | K8s cluster nodes |
|
||||
| 192.168.1.0/24 (LAN) | **TP-Link AP** | `192.168.1.2, 94.140.14.14` | 42 (all home devices) | Yes | pfSense Kea WAN is disabled |
|
||||
| 10.3.2.0/24 (VPN) | Static | — | — | No | WireGuard peers |
|
||||
| 192.168.0.0/24 (Valchedrym) | OpenWRT | — | — | No | Remote site |
|
||||
| 192.168.8.0/24 (London) | GL-iNet | — | — | No | Remote site |
|
||||
| Subnet | DHCP Server | Reservations | DDNS | Notes |
|
||||
|--------|------------|--------------|------|-------|
|
||||
| 10.0.10.0/24 (Mgmt) | Kea on pfSense | 4 (devvm, truenas, pxe, ha) | Yes | VMs with static MACs |
|
||||
| 10.0.20.0/24 (K8s) | Kea on pfSense | 7 (master, nodes 1-5, registry) | Yes | K8s cluster nodes |
|
||||
| 192.168.1.0/24 (LAN) | Kea on pfSense | 42 (all home devices) | Yes | TP-Link is dumb AP only |
|
||||
| 10.3.2.0/24 (VPN) | Static | — | No | WireGuard peers |
|
||||
| 192.168.0.0/24 (Valchedrym) | OpenWRT | — | No | Remote site |
|
||||
| 192.168.8.0/24 (London) | GL-iNet | — | No | Remote site |
|
||||
|
||||
## How It Works
|
||||
|
||||
|
|
@ -158,7 +160,7 @@ flowchart LR
|
|||
The Proxmox host uses a dual-bridge architecture:
|
||||
- **vmbr0**: Physical bridge on interface `eno1`, connected to upstream LAN (192.168.1.0/24). Proxmox management IP is 192.168.1.127.
|
||||
- **vmbr1**: Internal VLAN-aware bridge, acts as a trunk carrying:
|
||||
- **VLAN 10 (Management)**: 10.0.10.0/24 — Proxmox, DevVM
|
||||
- **VLAN 10 (Management)**: 10.0.10.0/24 — Proxmox, TrueNAS, DevVM
|
||||
- **VLAN 20 (Kubernetes)**: 10.0.20.0/24 — All K8s nodes, services, MetalLB IPs
|
||||
|
||||
VMs tag traffic on vmbr1 to isolate workloads. pfSense bridges VLAN 20 to the upstream LAN via NAT.
|
||||
|
|
@ -312,14 +314,9 @@ Containerd on all K8s nodes uses `hosts.toml` to redirect pulls to the local cac
|
|||
|
||||
**pfSense**:
|
||||
- Config: Not Terraform-managed (pfSense web UI / config.xml)
|
||||
- DHCP: Kea DHCP4 on the two internal VLANs (VLAN 10 = 10.0.10.0/24, VLAN 20 = 10.0.20.0/24). WAN/192.168.1.0/24 is served by the TP-Link dumb AP — pfSense's Kea WAN subnet is disabled.
|
||||
- **DNS option 6** (per-subnet, WS E 2026-04-19):
|
||||
- 10.0.10.0/24 → `10.0.10.1, 94.140.14.14` (internal Unbound + AdGuard Home public fallback)
|
||||
- 10.0.20.0/24 → `10.0.20.1, 94.140.14.14`
|
||||
- 192.168.1.0/24 → `192.168.1.2, 94.140.14.14` (served by TP-Link, unchanged by WS E)
|
||||
- Rationale: clients survive an internal resolver outage by falling through to AdGuard (`94.140.14.14`) — confirmed via null-route drill on 2026-04-19.
|
||||
- DHCP: Kea DHCP4 on all 3 subnets (VLAN 10, VLAN 20, WAN/LAN 192.168.1.0/24)
|
||||
- 42 MAC→IP reservations for 192.168.1.0/24 (all known home devices)
|
||||
- DHCP DDNS: Kea DHCP-DDNS sends **TSIG-signed** RFC 2136 updates to Technitium (key `kea-ddns`, HMAC-SHA256; secret in Vault `secret/viktor/kea_ddns_tsig_secret`). Zone `viktorbarzin.lan` + reverse zones require both a pfSense-source IP AND a valid TSIG signature. Config: `/usr/local/etc/kea/kea-dhcp-ddns.conf` (hand-managed on pfSense; pre-WS-E backup at `kea-dhcp-ddns.conf.2026-04-19-pre-tsig`).
|
||||
- DHCP DDNS: Kea DHCP-DDNS sends RFC 2136 updates to Technitium on every lease grant (forward A + reverse PTR)
|
||||
- Firewall rules: Allow K8s egress, block inter-VLAN by default
|
||||
|
||||
**Technitium**:
|
||||
|
|
@ -338,7 +335,7 @@ Containerd on all K8s nodes uses `hosts.toml` to redirect pulls to the local cac
|
|||
- Stack: `stacks/phpipam/`
|
||||
- Web UI: `phpipam.viktorbarzin.me` (Authentik-protected)
|
||||
- Database: MySQL InnoDB cluster (`mysql.dbaas.svc.cluster.local`)
|
||||
- Device import: CronJob `phpipam-pfsense-import` hourly — queries Kea DHCP leases + pfSense ARP table via SSH (no active scanning)
|
||||
- Device import: CronJob `phpipam-pfsense-import` every 5min — queries Kea DHCP leases + pfSense ARP table via SSH (no active scanning)
|
||||
- DNS sync: CronJob `phpipam-dns-sync` every 15min — bidirectional sync between phpIPAM and Technitium DNS (push named hosts → A+PTR, pull DNS hostnames → unnamed phpIPAM entries)
|
||||
- Subnets tracked: 10.0.10.0/24, 10.0.20.0/24, 192.168.1.0/24, 10.3.2.0/24, 192.168.8.0/24, 192.168.0.0/24
|
||||
- API: REST API enabled (app `claude`, ssl_token auth), MCP server available for agent access
|
||||
|
|
@ -367,7 +364,7 @@ Containerd on all K8s nodes uses `hosts.toml` to redirect pulls to the local cac
|
|||
1. **Single flat network**: Simpler, but no isolation between management and workload traffic.
|
||||
2. **Routed network with physical VLANs**: Requires switch with VLAN support.
|
||||
|
||||
**Decision**: vmbr0 (physical) + vmbr1 (VLAN trunk) gives isolation without requiring managed switches. Management traffic (Proxmox, DevVM) stays on VLAN 10, K8s workloads stay on VLAN 20. Failures in K8s don't affect access to Proxmox or storage.
|
||||
**Decision**: vmbr0 (physical) + vmbr1 (VLAN trunk) gives isolation without requiring managed switches. Management traffic (Proxmox, TrueNAS) stays on VLAN 10, K8s workloads stay on VLAN 20. Failures in K8s don't affect access to Proxmox or storage.
|
||||
|
||||
### Why Cloudflared Tunnel Instead of Port Forwarding?
|
||||
|
||||
|
|
|
|||
|
|
@ -92,7 +92,7 @@ graph TB
|
|||
| 203 | k8s-node3 | 8 | 32GB | vmbr1:vlan20 | - | Worker node |
|
||||
| 204 | k8s-node4 | 8 | 32GB | vmbr1:vlan20 | - | Worker node |
|
||||
| 220 | docker-registry | 4 | 4GB | vmbr1:vlan20 | 10.0.20.10 | Private Docker registry |
|
||||
| ~~9000~~ | ~~truenas~~ | — | — | — | ~~10.0.10.15~~ | **DECOMMISSIONED 2026-04-13** — NFS now served by Proxmox host (192.168.1.127). VM still exists in stopped state on PVE pending user decision on deletion. |
|
||||
| ~~9000~~ | ~~truenas~~ | — | — | — | ~~10.0.10.15~~ | **DECOMMISSIONED** — NFS now served by Proxmox host (192.168.1.127) |
|
||||
|
||||
### Kubernetes Cluster
|
||||
|
||||
|
|
@ -139,7 +139,7 @@ The Kubernetes cluster consists of 5 nodes:
|
|||
- **k8s-node1 (201)**: 16c/32GB GPU node with Tesla T4 passthrough, tainted for GPU workloads only
|
||||
- **k8s-node2-4 (202-204)**: 8c/32GB workers running general-purpose workloads
|
||||
|
||||
GPU passthrough on node1 uses PCIe device 0000:06:00.0. The NVIDIA GPU Operator's gpu-feature-discovery auto-labels whichever node carries the card with `nvidia.com/gpu.present=true`; `null_resource.gpu_node_config` taints the same set of nodes with `nvidia.com/gpu=true:PreferNoSchedule`. No hostname is hardcoded — moving the card to a different node requires no Terraform edits.
|
||||
GPU passthrough on node1 uses PCIe device 0000:06:00.0, with Kubernetes taint `nvidia.com/gpu=true:NoSchedule` and label `gpu=true` to ensure only GPU-requesting pods schedule there.
|
||||
|
||||
### Service Organization
|
||||
|
||||
|
|
@ -213,7 +213,7 @@ Secrets are stored in HashiCorp Vault under `secret/`:
|
|||
|
||||
**Rationale**:
|
||||
- **Flexibility**: Easy to snapshot, clone, and roll back VMs during upgrades
|
||||
- **Isolation**: Management network (devvm) separated from Kubernetes
|
||||
- **Isolation**: Management network (TrueNAS, devvm) separated from Kubernetes
|
||||
- **GPU passthrough**: Can dedicate GPU to a single node without tainting the entire host
|
||||
- **Multi-purpose**: Same physical host can run non-K8s VMs (pfSense, Home Assistant)
|
||||
|
||||
|
|
|
|||
|
|
@ -2,7 +2,7 @@
|
|||
|
||||
## Overview
|
||||
|
||||
The homelab implements defense-in-depth security at the application layer (L7) using CrowdSec for threat intelligence and IP reputation, Kyverno for policy enforcement and resource governance, and a 3-layer anti-AI scraping defense (reduced from 5 in April 2026 after removing the rewrite-body plugin). All security components operate in graceful degradation mode (fail-open) to prevent cascading failures. Security policies are deployed in audit mode first, then selectively enforced after validation.
|
||||
The homelab implements defense-in-depth security at the application layer (L7) using CrowdSec for threat intelligence and IP reputation, Kyverno for policy enforcement and resource governance, and a 5-layer anti-AI scraping defense. All security components operate in graceful degradation mode (fail-open) to prevent cascading failures. Security policies are deployed in audit mode first, then selectively enforced after validation.
|
||||
|
||||
## Architecture Diagram
|
||||
|
||||
|
|
@ -59,7 +59,7 @@ Every incoming request passes through 6 security layers:
|
|||
1. **Cloudflare WAF** - DDoS protection, bot detection, firewall rules (external)
|
||||
2. **Cloudflared Tunnel** - Zero Trust tunnel, hides origin IP
|
||||
3. **CrowdSec Bouncer** - IP reputation check against LAPI (fail-open on error)
|
||||
4. **Anti-AI Scraping** - 3-layer bot defense (optional per service, updated 2026-04-17)
|
||||
4. **Anti-AI Scraping** - 5-layer bot defense (optional per service)
|
||||
5. **Authentik ForwardAuth** - Authentication check (if `protected = true`)
|
||||
6. **Rate Limiting** - Per-source IP rate limits (returns 429 on breach)
|
||||
7. **Retry Middleware** - Auto-retry on transient errors (2 attempts, 100ms delay)
|
||||
|
|
@ -131,12 +131,10 @@ This prevents resource exhaustion and enforces governance without manual quota m
|
|||
| `sync-tier-label` | Propagate tier label to child resources | Enforce |
|
||||
| `goldilocks-vpa-auto-mode` | Disable VPA globally (VPA off) | Enforce |
|
||||
|
||||
### Anti-AI Scraping (3 Active Layers) (Updated 2026-04-17)
|
||||
### Anti-AI Scraping (5-Layer Defense)
|
||||
|
||||
Enabled by default via `ingress_factory` module. Disable per-service with `anti_ai_scraping = false`.
|
||||
|
||||
Active middleware chain: `ai-bot-block` (ForwardAuth) + `anti-ai-headers` (X-Robots-Tag). The `strip-accept-encoding` and `anti-ai-trap-links` middlewares were removed in April 2026 due to Traefik v3.6.12 Yaegi plugin incompatibility with the rewrite-body plugin.
|
||||
|
||||
#### Layer 1: Bot Blocking (ForwardAuth)
|
||||
|
||||
- Middleware calls `poison-fountain` service before backend
|
||||
|
|
@ -150,16 +148,25 @@ Active middleware chain: `ai-bot-block` (ForwardAuth) + `anti-ai-headers` (X-Rob
|
|||
- Instructs compliant bots to skip content
|
||||
- Lightweight, no performance impact
|
||||
|
||||
#### ~~Layer 3: Trap Links~~ (REMOVED)
|
||||
#### Layer 3: Trap Links
|
||||
|
||||
Removed April 2026. The rewrite-body Traefik plugin used to inject hidden trap links broke on Traefik v3.6.12 due to Yaegi runtime bugs. The companion `strip-accept-encoding` middleware was also removed.
|
||||
- JavaScript injects invisible links before `</body>`
|
||||
- Links point to honeypot endpoints
|
||||
- Legitimate browsers don't click, bots follow
|
||||
- Triggered bots get added to ban list
|
||||
|
||||
#### Layer 3 (formerly 4): Tarpit / Poison Content
|
||||
#### Layer 4: Tarpit
|
||||
|
||||
- Serves AI bots extremely slowly (~100 bytes/sec)
|
||||
- Wastes bot resources, makes scraping uneconomical
|
||||
- Humans see normal speed (only applies to detected bots)
|
||||
|
||||
#### Layer 5: Poison Content
|
||||
|
||||
- `poison-fountain` service still exists as a standalone service at `poison.viktorbarzin.me`
|
||||
- Serves AI bots extremely slowly (~100 bytes/sec tarpit)
|
||||
- CronJob every 6 hours generates fake content
|
||||
- Trap links are no longer injected into real pages, but bots that discover `poison.viktorbarzin.me` directly still get tarpitted and poisoned
|
||||
- Injects misleading/nonsense data into pages shown to bots
|
||||
- Degrades AI training data quality
|
||||
- **Requires `--http1.1` flag** to work with current HTTP/2 setup
|
||||
|
||||
**Implementation**: See `stacks/poison-fountain/` and `stacks/platform/modules/traefik/middleware.tf`
|
||||
|
||||
|
|
@ -279,13 +286,13 @@ spec:
|
|||
- **Better observability**: Collect violation metrics before enforcing
|
||||
- **Selective enforcement**: Move to enforce mode per-policy after validation
|
||||
|
||||
### Why Multi-Layer Anti-AI Defense? (Updated 2026-04-17)
|
||||
### Why 5-Layer Anti-AI Defense?
|
||||
|
||||
- **Defense in depth**: Each layer catches different bot types
|
||||
- **Compliant bots**: Layer 2 (X-Robots-Tag) handles respectful crawlers
|
||||
- **Persistent bots**: Tarpit makes scraping uneconomical
|
||||
- **Poison content**: Degrades training data for bots that reach poison-fountain
|
||||
- Layer 3 (trap links via rewrite-body) was removed due to Traefik v3 plugin incompatibility
|
||||
- **Dumb bots**: Layer 3 (trap links) catches simple scrapers
|
||||
- **Persistent bots**: Layer 4 (tarpit) makes scraping uneconomical
|
||||
- **Sophisticated bots**: Layer 5 (poison content) degrades training data
|
||||
|
||||
### Why Fail-Open Mode?
|
||||
|
||||
|
|
@ -375,16 +382,15 @@ spec:
|
|||
2. Verify backend isn't returning transient errors: Check for 5xx responses
|
||||
3. Disable retry for specific service: Remove retry middleware from `ingress_factory`
|
||||
|
||||
### Poison Content Not Serving (Updated 2026-04-17)
|
||||
### Poison Content Not Injecting
|
||||
|
||||
**Problem**: Bots not receiving poisoned content on `poison.viktorbarzin.me`.
|
||||
|
||||
**Note**: Poison content is no longer injected into real pages (rewrite-body removed). It is only served directly via the `poison.viktorbarzin.me` subdomain.
|
||||
**Problem**: Bots not receiving poisoned content.
|
||||
|
||||
**Fix**:
|
||||
1. Verify CronJob running: `kubectl get cronjob -n poison-fountain`
|
||||
2. Check logs: `kubectl logs -n poison-fountain -l app=poison-fountain`
|
||||
3. Manually trigger: `kubectl create job --from=cronjob/poison-content manual-poison`
|
||||
2. Check logs: `kubectl logs -n poison-fountain -l app=poison-content-injector`
|
||||
3. Ensure `--http1.1` flag set (required for HTTP/2 backends)
|
||||
4. Manually trigger: `kubectl create job --from=cronjob/poison-content manual-poison`
|
||||
|
||||
## Related
|
||||
|
||||
|
|
|
|||
|
|
@ -16,13 +16,13 @@ All services storing sensitive data were migrated to `proxmox-lvm-encrypted` on
|
|||
- **HDD NFS**: `/srv/nfs` on ext4 LV `pve/nfs-data` (2TB) — bulk media and backup targets
|
||||
- **SSD NFS**: `/srv/nfs-ssd` on ext4 LV `ssd/nfs-ssd-data` (100GB) — high-performance data (Immich ML)
|
||||
|
||||
Both `StorageClass: nfs-truenas` and `StorageClass: nfs-proxmox` point to the Proxmox host and are functionally identical. The `nfs-truenas` name is historical — it was retained because StorageClass names are immutable on bound PVs (48 PVs reference it) and renaming would force mass PV churn across the cluster.
|
||||
Both `StorageClass: nfs-truenas` (name kept for compatibility) and `StorageClass: nfs-proxmox` (identical) point to the Proxmox host. Migrated from TrueNAS (10.0.10.15) which has been fully decommissioned.
|
||||
|
||||
**Backup storage (sda)**: 1.1TB RAID1 SAS disk, VG `backup`, LV `data` (ext4), mounted at `/mnt/backup` on PVE host. Dedicated backup disk for weekly PVC file backups, auto SQLite backups, pfSense backups, and PVE config. NFS data syncs directly to Synology via inotify change tracking (not stored on sda). Independent of live storage (sdc).
|
||||
|
||||
**History (2026-04-02)**: iSCSI block volumes migrated from democratic-csi (TrueNAS iSCSI → ZFS → LVM-thin) to Proxmox CSI (direct LVM-thin hotplug). democratic-csi iSCSI driver removed.
|
||||
**Migration (2026-04-02)**: All iSCSI block volumes were migrated from democratic-csi (TrueNAS iSCSI → ZFS → LVM-thin) to Proxmox CSI (direct LVM-thin hotplug). democratic-csi iSCSI driver has been removed.
|
||||
|
||||
**History (2026-04-13)**: TrueNAS (VM 9000, 10.0.10.15) fully decommissioned. NFS storage migrated to the Proxmox host (192.168.1.127). ZFS datasets under `/mnt/main/` and `/mnt/ssd/` moved to ext4 LVs at `/srv/nfs/` and `/srv/nfs-ssd/`. Legacy PVs referencing `/mnt/main/` paths still work (bind-mounted or symlinked on the Proxmox host); new PVs use `/srv/nfs/` and `/srv/nfs-ssd/`. TrueNAS VM still exists in stopped state on PVE pending user decision on deletion.
|
||||
**Migration (2026-04)**: TrueNAS (10.0.10.15) fully decommissioned. All NFS storage migrated to the Proxmox host (192.168.1.127). ZFS datasets under `/mnt/main/` and `/mnt/ssd/` moved to ext4 LVs at `/srv/nfs/` and `/srv/nfs-ssd/`. Legacy PVs referencing `/mnt/main/` paths still work (bind-mounted or symlinked on the Proxmox host); new PVs use `/srv/nfs/` and `/srv/nfs-ssd/`.
|
||||
|
||||
## Architecture Diagram
|
||||
|
||||
|
|
@ -39,7 +39,7 @@ graph TB
|
|||
end
|
||||
|
||||
subgraph K8s["Kubernetes Cluster"]
|
||||
CSI_NFS["nfs-csi driver<br/>StorageClass: nfs-proxmox (+ legacy nfs-truenas)<br/>soft,timeo=30,retrans=3"]
|
||||
CSI_NFS["nfs-csi driver<br/>StorageClass: nfs-truenas / nfs-proxmox<br/>soft,timeo=30,retrans=3"]
|
||||
CSI_PVE["Proxmox CSI plugin<br/>StorageClass: proxmox-lvm<br/>StorageClass: proxmox-lvm-encrypted"]
|
||||
|
||||
NFS_PV["NFS PersistentVolumes<br/>RWX, ~100 volumes"]
|
||||
|
|
@ -77,10 +77,10 @@ graph TB
|
|||
| Proxmox NFS (HDD) | LV `pve/nfs-data`, 2TB ext4 | 192.168.1.127:/srv/nfs | Bulk NFS data for all services |
|
||||
| Proxmox NFS (SSD) | LV `ssd/nfs-ssd-data`, 100GB ext4 | 192.168.1.127:/srv/nfs-ssd | High-performance data (Immich ML) |
|
||||
| nfs-csi | Helm chart | Namespace: nfs-csi | NFS CSI driver |
|
||||
| StorageClass `nfs-proxmox` | RWX, soft mount | Cluster-wide | NFS storage, points to Proxmox host |
|
||||
| StorageClass `nfs-truenas` | RWX, soft mount | Cluster-wide | **Historical name** — functionally identical to `nfs-proxmox`, points to the Proxmox host. Kept because SC names are immutable on 48 bound PVs. |
|
||||
| StorageClass `nfs-truenas` | RWX, soft mount | Cluster-wide | NFS storage (name kept for compatibility, points to Proxmox) |
|
||||
| StorageClass `nfs-proxmox` | RWX, soft mount | Cluster-wide | NFS storage (identical to nfs-truenas) |
|
||||
| TF module `nfs_volume` | `modules/kubernetes/nfs_volume/` | Infra repo | Static NFS PV/PVC factory |
|
||||
| ~~TrueNAS VM~~ | **DECOMMISSIONED 2026-04-13** | Was VM 9000 at 10.0.10.15 | Replaced by Proxmox NFS. VM still in stopped state pending deletion. |
|
||||
| ~~TrueNAS VM~~ | **DECOMMISSIONED** | Was VMID 9000 at 10.0.10.15 | Replaced by Proxmox NFS (2026-04) |
|
||||
| ~~democratic-csi-iscsi~~ | **REMOVED** | Was namespace: iscsi-csi | Replaced by Proxmox CSI (2026-04-02) |
|
||||
| ~~StorageClass `iscsi-truenas`~~ | **REMOVED** | Was cluster-wide | Replaced by `proxmox-lvm` |
|
||||
|
||||
|
|
@ -105,7 +105,7 @@ graph TB
|
|||
|
||||
**Note**: Some legacy PVs still reference `/mnt/main/<service>` paths. These work via compatibility symlinks/bind-mounts on the Proxmox host. New PVs should use `/srv/nfs/<service>` or `/srv/nfs-ssd/<service>`.
|
||||
|
||||
**CRITICAL**: Never use inline `nfs {}` blocks in pod specs — they default to `hard,timeo=600` which causes 10-minute hangs on network issues. Always use the `nfs-proxmox` StorageClass (or the legacy `nfs-truenas` for existing PVs) via PVCs.
|
||||
**CRITICAL**: Never use inline `nfs {}` blocks in pod specs — they default to `hard,timeo=600` which causes 10-minute hangs on network issues. Always use the `nfs-truenas` or `nfs-proxmox` StorageClass via PVCs.
|
||||
|
||||
### Block Storage Flow (Proxmox CSI) — NEW
|
||||
|
||||
|
|
@ -129,9 +129,7 @@ graph TB
|
|||
5. **Passphrase management**: ExternalSecret syncs passphrase from Vault KV (`secret/viktor/proxmox_csi_encryption_passphrase`) → K8s Secret. Backup key at `/root/.luks-backup-key` on PVE host.
|
||||
|
||||
**Services on encrypted storage (2026-04-15 migration):**
|
||||
vaultwarden, dbaas (mysql+pg+pgadmin), mailserver, nextcloud, forgejo, matrix, n8n, affine, health, hackmd, redis, headscale, frigate, meshcentral, technitium, actualbudget, grampsweb, owntracks, wealthfolio, monitoring (alertmanager)
|
||||
|
||||
**Services migrated later** (post-audit catch-up): paperless-ngx (2026-04-25 — sensitive document scans had been left on plain `proxmox-lvm` by an abandoned attempt; rsync swap cleaned up the orphan and re-did via Terraform). Vault raft cluster (2026-04-25 — all 3 voters migrated from `nfs-proxmox` to `proxmox-lvm-encrypted` after the 2026-04-22 raft-leader-deadlock post-mortem found NFS fsync semantics incompatible with raft consensus log; rolled non-leader-first with force-finalize on the pvc-protection finalizer to avoid pod-recreating on the old PVCs).
|
||||
vaultwarden, dbaas (mysql+pg+pgadmin), mailserver, nextcloud, forgejo, matrix, n8n, affine, health, hackmd, redis, headscale, frigate, meshcentral, technitium, actualbudget, grampsweb, owntracks, paperless-ngx, wealthfolio, monitoring (alertmanager)
|
||||
|
||||
**CSI node plugin memory**: Requires 1280Mi limit for LUKS2 Argon2id key derivation (~1GiB). Set via `node.plugin.resources` in Helm values (not `node.resources`).
|
||||
|
||||
|
|
@ -166,7 +164,7 @@ SQLite uses `fsync()` to guarantee durability. NFS's soft mount + async semantic
|
|||
|------|---------|
|
||||
| `/etc/exports` (on Proxmox host) | NFS export configuration for all service shares |
|
||||
| `stacks/proxmox-csi/` | Terraform stack for Proxmox CSI plugin + StorageClass |
|
||||
| `stacks/nfs-csi/` | NFS CSI driver + StorageClasses (`nfs-proxmox` + legacy `nfs-truenas`) |
|
||||
| `stacks/nfs-csi/` | NFS CSI driver + StorageClasses (`nfs-truenas`, `nfs-proxmox`) |
|
||||
| `modules/kubernetes/nfs_volume/` | Reusable module for static NFS PV/PVC creation |
|
||||
| `config.tfvars` | Variable `nfs_server = "192.168.1.127"` shared by all stacks |
|
||||
|
||||
|
|
@ -175,10 +173,8 @@ SQLite uses `fsync()` to guarantee durability. NFS's soft mount + async semantic
|
|||
| Path | Contents |
|
||||
|------|----------|
|
||||
| `secret/viktor/proxmox_csi_encryption_passphrase` | LUKS2 encryption passphrase for `proxmox-lvm-encrypted` StorageClass |
|
||||
| ~~`secret/viktor/truenas_ssh_key`~~ | **REMOVED** — was SSH key for democratic-csi SSH driver (TrueNAS decommissioned 2026-04-13) |
|
||||
| ~~`secret/viktor/truenas_root_password`~~ | **REMOVED** — was TrueNAS root password (TrueNAS decommissioned 2026-04-13) |
|
||||
| ~~`secret/viktor/truenas_api_key`~~ | **REMOVED** — was TrueNAS API key (TrueNAS decommissioned 2026-04-13) |
|
||||
| ~~`secret/viktor/truenas_ssh_private_key`~~ | **REMOVED** — was TrueNAS SSH private key (TrueNAS decommissioned 2026-04-13) |
|
||||
| ~~`secret/viktor/truenas_ssh_key`~~ | **LEGACY** — was SSH key for democratic-csi SSH driver (TrueNAS decommissioned) |
|
||||
| ~~`secret/viktor/truenas_root_password`~~ | **LEGACY** — was TrueNAS root password (TrueNAS decommissioned) |
|
||||
|
||||
### Terraform Stacks
|
||||
|
||||
|
|
|
|||
|
|
@ -63,10 +63,10 @@ sequenceDiagram
|
|||
Cloudflare-->>AdGuard: A record (Cloudflare IP)
|
||||
AdGuard-->>Client: Response
|
||||
|
||||
Note over Client: Query: nextcloud.viktorbarzin.lan
|
||||
Note over Client: Query: truenas.viktorbarzin.lan
|
||||
Client->>AdGuard: DNS query
|
||||
AdGuard->>Technitium: Forward (.lan domain)
|
||||
Technitium-->>AdGuard: A record (10.0.20.200)
|
||||
Technitium-->>AdGuard: A record (10.0.10.15)
|
||||
AdGuard-->>Client: Response
|
||||
|
||||
Note over Client,Technitium: If Cloudflared tunnel is down:
|
||||
|
|
@ -370,14 +370,14 @@ dns_config:
|
|||
|
||||
### Can't Resolve .lan Domains from VPN
|
||||
|
||||
**Symptoms**: `nslookup nextcloud.viktorbarzin.lan` returns `NXDOMAIN`.
|
||||
**Symptoms**: `nslookup truenas.viktorbarzin.lan` returns `NXDOMAIN`.
|
||||
|
||||
**Diagnosis**: Check DNS chain: Client → AdGuard → Technitium.
|
||||
|
||||
**Steps**:
|
||||
1. Verify AdGuard is running: `kubectl get pod -n adguard`
|
||||
2. Check AdGuard conditional forwarding: Query AdGuard directly: `nslookup nextcloud.viktorbarzin.lan <adguard-ip>`
|
||||
3. Check Technitium: `nslookup nextcloud.viktorbarzin.lan 10.0.20.101`
|
||||
2. Check AdGuard conditional forwarding: Query AdGuard directly: `nslookup truenas.viktorbarzin.lan <adguard-ip>`
|
||||
3. Check Technitium: `nslookup truenas.viktorbarzin.lan 10.0.20.101`
|
||||
|
||||
**Common causes**:
|
||||
1. **AdGuard not forwarding .lan**: Conditional forwarding rule missing or misconfigured.
|
||||
|
|
|
|||
|
|
@ -1,7 +1,5 @@
|
|||
# Anti-AI Scraping System Design
|
||||
|
||||
> **Status (Updated 2026-04-17):** Partially superseded. Layer 3 (trap links via rewrite-body plugin) removed due to Traefik v3.6.12 Yaegi plugin incompatibility. The `strip-accept-encoding` and `anti-ai-trap-links` middlewares have been deleted. Rybbit analytics injection moved from Traefik rewrite-body to a Cloudflare Worker (`infra/stacks/rybbit/worker/`). Active layers: 1 (bot-block), 2 (headers), 4 (tarpit), 5 (poison content).
|
||||
|
||||
## Problem
|
||||
|
||||
AI scrapers crawl public web services to harvest training data. We want to:
|
||||
|
|
@ -11,7 +9,7 @@ AI scrapers crawl public web services to harvest training data. We want to:
|
|||
|
||||
## Architecture
|
||||
|
||||
Four active defense layers applied to all public services via Traefik (Layer 3 removed April 2026):
|
||||
Five defense layers applied to all public services via Traefik:
|
||||
|
||||
```
|
||||
Internet -> Cloudflare -> Traefik
|
||||
|
|
@ -20,7 +18,7 @@ Internet -> Cloudflare -> Traefik
|
|||
|
|
||||
+-- Layer 2: Headers -> X-Robots-Tag: noai, noimageai
|
||||
|
|
||||
+-- [REMOVED] Layer 3: Rewrite-body trap links (April 2026 — Yaegi bugs in Traefik v3.6.12)
|
||||
+-- Layer 3: Rewrite-body -> inject hidden trap links into HTML
|
||||
|
|
||||
+-- Layer 4: Poison service -> serve cached Poison Fountain data
|
||||
|
|
||||
|
|
@ -70,10 +68,13 @@ All defined in `stacks/platform/modules/traefik/middleware.tf`:
|
|||
- Sets `X-Robots-Tag: noai, noimageai` on all responses
|
||||
- Added to all public services via ingress_factory
|
||||
|
||||
**`anti-ai-trap-links` (rewrite-body plugin)** — REMOVED (Updated 2026-04-17):
|
||||
- Removed due to Traefik v3.6.12 Yaegi runtime bugs making the rewrite-body plugin unreliable
|
||||
- The companion `strip-accept-encoding` middleware was also removed (only existed for rewrite-body)
|
||||
- Trap link injection is no longer active; poison-fountain still serves tarpit content standalone
|
||||
**`anti-ai-trap-links` (rewrite-body plugin)**:
|
||||
- Regex: `</body>` -> injects hidden div with trap links + `</body>`
|
||||
- Links point to `https://poison.viktorbarzin.me/article/<slug>`
|
||||
- CSS: invisible to humans (`position:absolute;left:-9999px;height:0;overflow:hidden;aria-hidden=true`)
|
||||
- Only processes `text/html` responses
|
||||
- Requires strip-accept-encoding companion middleware (already exists)
|
||||
- Applied globally via ingress_factory
|
||||
|
||||
### 4. Trap subdomain: poison.viktorbarzin.me
|
||||
|
||||
|
|
@ -87,7 +88,7 @@ All defined in `stacks/platform/modules/traefik/middleware.tf`:
|
|||
|
||||
New variables:
|
||||
- `anti_ai_scraping` (bool, default: true) - enable all anti-AI layers
|
||||
- When true, adds to middleware chain: `ai-bot-block`, `anti-ai-headers`
|
||||
- When true, adds to middleware chain: `ai-bot-block`, `anti-ai-headers`, `strip-accept-encoding`, `anti-ai-trap-links`
|
||||
- Services can opt out with `anti_ai_scraping = false`
|
||||
|
||||
## Human User Protection
|
||||
|
|
@ -96,7 +97,7 @@ New variables:
|
|||
|---------|-----------|
|
||||
| Hidden links visible | CSS `position:absolute;left:-9999px;height:0;overflow:hidden` + `aria-hidden="true"` |
|
||||
| False positive blocking | Only blocks specific AI bot User-Agent strings; no browser matches these |
|
||||
| Performance overhead | ForwardAuth is a string match (<1ms). Rybbit injected via Cloudflare Worker (not Traefik). |
|
||||
| Performance overhead | ForwardAuth is a string match (<1ms). Rewrite-body already proven with Rybbit. |
|
||||
| Poison content leakage | Only served on poison.viktorbarzin.me, not linked from any navigation |
|
||||
| Slow responses | Tarpit only applies to poison.viktorbarzin.me, not to real services |
|
||||
|
||||
|
|
|
|||
|
|
@ -1,142 +0,0 @@
|
|||
# NFS-Hostile Workload Migration — Design
|
||||
|
||||
**Date**: 2026-04-25
|
||||
**Author**: Viktor (with Claude)
|
||||
**Status**: Phase 1 done, Phase 2 in progress
|
||||
**Beads**: code-gy7h (Vault), code-ahr7 (Immich PG)
|
||||
|
||||
## Problem
|
||||
|
||||
The 2026-04-22 Vault Raft leader deadlock (post-mortem
|
||||
`2026-04-22-vault-raft-leader-deadlock.md`) traced to NFS client
|
||||
writeback stalls poisoning kernel state. Recovery took 2h43m and
|
||||
required hard-resetting 3 of 4 cluster VMs. Two workload classes on
|
||||
NFS are NFS-hostile per the criteria in
|
||||
`infra/.claude/CLAUDE.md` ("Critical services MUST NOT use NFS"):
|
||||
|
||||
1. **Postgres with WAL fsync per commit** — Immich primary
|
||||
2. **Vault Raft consensus log** — fsync per append-entry, 3 replicas
|
||||
|
||||
Everything else on NFS (47 PVCs, ~455 GiB) is correctly placed:
|
||||
RWX media libraries, append-only backups, ML caches.
|
||||
|
||||
## Decision
|
||||
|
||||
Migrate exactly those two workload classes to
|
||||
`proxmox-lvm-encrypted` (LUKS2 LVM-thin via Proxmox CSI). No iSCSI,
|
||||
no RWX media migration, no backup-target migration.
|
||||
|
||||
## Rationale
|
||||
|
||||
- Block storage decouples PG / Raft fsync from NFS client kernel
|
||||
state. Failure mode that triggered the post-mortem cannot recur for
|
||||
these workloads.
|
||||
- `proxmox-lvm-encrypted` is the documented default for sensitive data
|
||||
(`infra/.claude/CLAUDE.md` storage decision rule). It already backs
|
||||
~28 PVCs across the cluster — pattern is proven.
|
||||
- Existing nightly `lvm-pvc-snapshot` PVE host script (03:00, 7-day
|
||||
retention) auto-picks-up new PVCs via thin snapshots — no extra
|
||||
backup wiring needed for the live data side.
|
||||
- LUKS2 satisfies "encrypted at rest for sensitive data" requirement.
|
||||
|
||||
## Out of scope
|
||||
|
||||
- iSCSI evaluation (already retired 2026-04-13).
|
||||
- RWX media (Immich library, music, ebooks) — correct placement.
|
||||
- Backup target PVCs (`*-backup` on NFS) — append-only, NFS-tolerant.
|
||||
- Prometheus 200 GiB — already on `proxmox-lvm`.
|
||||
|
||||
## Pattern per workload
|
||||
|
||||
### Immich PG (single replica, Deployment, Recreate strategy)
|
||||
|
||||
- Add new RWO PVC on `proxmox-lvm-encrypted`.
|
||||
- Quiesce app pods (server + ML + frame).
|
||||
- `pg_dumpall` from running NFS pod → local file.
|
||||
- Swap deployment `claim_name` → encrypted PVC.
|
||||
- PG bootstraps fresh on empty PVC; restore dump.
|
||||
- REINDEX vector indexes (`clip_index`, `face_index`).
|
||||
- Backup CronJob keeps writing to NFS module (correct: append-only).
|
||||
|
||||
### Vault Raft (3 replicas, StatefulSet, helm-managed)
|
||||
|
||||
- Change `dataStorage.storageClass` and `auditStorage.storageClass`
|
||||
from `nfs-proxmox` → `proxmox-lvm-encrypted`.
|
||||
- StatefulSet `volumeClaimTemplates` is immutable → use
|
||||
`kubectl delete sts vault --cascade=orphan` then re-apply (memory
|
||||
pattern for VCT swaps).
|
||||
- Per-pod rolling: delete pod + PVCs, controller recreates with new
|
||||
template. Auto-unseal sidecar handles unseal; raft `retry_join`
|
||||
rejoins cluster.
|
||||
- 24h validation window between pods. Migrate non-leader pods first;
|
||||
step-down current leader before migrating it last.
|
||||
- Backup target (`vault-backup-host` on NFS) stays on NFS.
|
||||
|
||||
## Risks and rollbacks
|
||||
|
||||
### Immich PG
|
||||
|
||||
- pg_dumpall captures schema + data, not file-level state. Vector
|
||||
index versions matter (vchord 0.3.0 unchanged; vector 0.8.0 →
|
||||
0.8.1 is a minor automatic bump on `CREATE EXTENSION` — confirmed
|
||||
benign). Rollback: revert `claim_name`, scale apps; old NFS PVC
|
||||
retained for 7 days post-migration.
|
||||
|
||||
### Vault Raft
|
||||
|
||||
- Cluster keeps quorum from 2 standby replicas while one pod is
|
||||
swapped. Migrating the leader last avoids quorum churn.
|
||||
- Recovery anchor: pre-migration `vault operator raft snapshot save`
|
||||
+ nightly `vault-raft-backup` CronJob. RTO < 1h via snapshot
|
||||
restore.
|
||||
|
||||
## Helm `securityContext.pod` replace-not-merge (Vault, discovered during execution)
|
||||
|
||||
The Vault helm chart sets pod-level securityContext defaults
|
||||
(`fsGroup=1000, runAsGroup=1000, runAsUser=100, runAsNonRoot=true`)
|
||||
from chart templates, not from values.yaml. When `main.tf` provided
|
||||
its own `server.statefulSet.securityContext.pod = {fsGroupChangePolicy
|
||||
= "OnRootMismatch"}` the helm rendering REPLACED the chart defaults
|
||||
rather than merging into them. On NFS this was harmless (`async,
|
||||
insecure` exports made the volume world-writable enough for any UID),
|
||||
but on a fresh ext4 LV via Proxmox CSI the volume root is `root:root`
|
||||
and vault user (UID 100) cannot open `/vault/data/vault.db`.
|
||||
|
||||
vault-1 and vault-2 happened to be Running with the correct
|
||||
securityContext because their pod specs were written into etcd
|
||||
**before** the customization landed; helm chart upgrades don't
|
||||
restart pods, so the broken values lay dormant until vault-0 was
|
||||
recreated by the orphan-deleted STS during this migration.
|
||||
|
||||
Resolution: provide all five fields (`fsGroup`, `fsGroupChangePolicy`,
|
||||
`runAsGroup`, `runAsUser`, `runAsNonRoot`) explicitly in main.tf so
|
||||
`runAsGroup=1000` etc. survive future chart bumps. Idempotent on
|
||||
both fresh PVCs and existing pods.
|
||||
|
||||
## Init container chicken-and-egg (Immich PG, discovered during execution)
|
||||
|
||||
The pre-existing `write-pg-override-conf` init container on the
|
||||
Immich PG deployment writes `postgresql.override.conf` directly to
|
||||
`PGDATA`. On a populated NFS PVC this was a no-op (init was already
|
||||
run). On the fresh encrypted PVC, the file made `initdb` refuse the
|
||||
non-empty directory and the pod CrashLoopBackOff'd.
|
||||
|
||||
Resolution: gate the init container on `PG_VERSION` presence — first
|
||||
boot skips the override write, PG `initdb`s cleanly; force a pod
|
||||
restart and the second boot writes the override and PG loads
|
||||
`vchord` / `vectors` / `pg_prewarm` before the dump restore. Change
|
||||
is permanent and idempotent (correct on both fresh and initialised
|
||||
PVCs). One restart pre-migration only.
|
||||
|
||||
## Verification
|
||||
|
||||
End-to-end DONE when:
|
||||
|
||||
- `kubectl get pvc -A | grep nfs-proxmox` returns only the
|
||||
`vault-backup-host` PVC (or zero, if backup PVC moves elsewhere).
|
||||
- `vault operator raft list-peers` shows 3 voters on
|
||||
`proxmox-lvm-encrypted`, leader elected.
|
||||
- Immich PG `\dx` matches pre-migration extensions (vector minor
|
||||
drift OK).
|
||||
- `lvm-pvc-snapshot` captures new LVs in next 03:00 run.
|
||||
- 7 consecutive days of clean backup CronJob runs and no new alerts.
|
||||
|
|
@ -1,169 +0,0 @@
|
|||
# NFS-Hostile Workload Migration — Plan
|
||||
|
||||
**Date**: 2026-04-25
|
||||
**Design**: `2026-04-25-nfs-hostile-migration-design.md`
|
||||
**Beads**: code-gy7h (Vault, epic), code-ahr7 (Immich PG)
|
||||
|
||||
## Phase 1 — Immich PG (DONE 2026-04-25)
|
||||
|
||||
| Step | Done |
|
||||
|---|---|
|
||||
| Snapshot extensions + row counts to `/tmp/immich-pre-migration-*` | ✓ |
|
||||
| Quiesce `immich-server` + `immich-machine-learning` + `immich-frame` | ✓ |
|
||||
| `pg_dumpall` → `/tmp/immich-pre-migration-<ts>.sql` (1.9 GB) | ✓ |
|
||||
| Add `kubernetes_persistent_volume_claim.immich_postgresql_encrypted` (10Gi, autoresize 20Gi cap) | ✓ |
|
||||
| Swap `claim_name` at `infra/stacks/immich/main.tf` deployment | ✓ |
|
||||
| Patch init container to gate on `PG_VERSION` (chicken-and-egg fix) | ✓ |
|
||||
| Force pod restart so override.conf gets written | ✓ |
|
||||
| Restore dump | ✓ |
|
||||
| `REINDEX clip_index`, `REINDEX face_index` | ✓ |
|
||||
| Scale apps back up | ✓ |
|
||||
| Verify: `\dx`, row counts (~111k assets), HTTP 200 internal/external | ✓ |
|
||||
| LV present on PVE host (`vm-9999-pvc-...`) | ✓ |
|
||||
|
||||
### Phase 1 follow-ups (not blocking)
|
||||
|
||||
- Old NFS PVC `immich-postgresql-data-host` retained 7 days for
|
||||
rollback. After 2026-05-02: remove `module.nfs_postgresql_host`
|
||||
from `infra/stacks/immich/main.tf` and the CronJob's reference.
|
||||
- Backup CronJob (`postgresql-backup`) still writes to the NFS
|
||||
module. After cleanup, point it at a dedicated backup PVC or to
|
||||
the existing `immich-backups` NFS share.
|
||||
|
||||
## Phase 2 — Vault Raft (DONE 2026-04-25)
|
||||
|
||||
**Phase 2 complete 2026-04-25; all 3 voters on `proxmox-lvm-encrypted`.**
|
||||
|
||||
### Pre-flight (T-0) — DONE 2026-04-25 15:50 UTC
|
||||
|
||||
- [x] Verify all 3 vault pods sealed=false, raft healthy.
|
||||
- [x] Take fresh `vault operator raft snapshot save` (anchor saved at
|
||||
`/tmp/vault-pre-migration-20260425-155029.snap`, 1.5 MB).
|
||||
- [ ] Optional: scale ESO to 0 — skipped (auto-unseal sidecar is
|
||||
independent; ESO refresh churn is non-disruptive for one swap).
|
||||
- [x] Confirmed leader is **vault-2** → migrate vault-0 first
|
||||
(non-leader), vault-1 next, vault-2 last (with step-down).
|
||||
Plan originally assumed vault-0 was leader; same intent
|
||||
(non-leader first).
|
||||
- [x] Thin pool headroom: 54.63% used, plenty for 6 × 2 GiB LVs.
|
||||
|
||||
### Step 0 — Helm values + StatefulSet swap — DONE 2026-04-25 16:08 UTC
|
||||
|
||||
- [x] Edit `infra/stacks/vault/main.tf`: change
|
||||
`dataStorage.storageClass` and `auditStorage.storageClass`
|
||||
from `nfs-proxmox` → `proxmox-lvm-encrypted`.
|
||||
- [x] `kubectl -n vault delete sts vault --cascade=orphan` (StatefulSet
|
||||
`volumeClaimTemplates` is immutable; orphan keeps pods+PVCs
|
||||
alive while we recreate the controller with the new template).
|
||||
- [x] `tg apply -target=helm_release.vault` → recreates STS with new
|
||||
VCT (full-stack `tg plan` blocks on unrelated for_each-with-
|
||||
apply-time-keys errors at lines 848/865/909/917; targeted
|
||||
apply on the helm release alone is the right scope here).
|
||||
Existing pods still on old NFS PVCs.
|
||||
|
||||
### Step 1 — Roll vault-0 first (non-leader) — DONE 2026-04-25 16:18 UTC
|
||||
|
||||
- [x] `kubectl -n vault delete pod vault-0 --grace-period=30`
|
||||
- [x] `kubectl -n vault delete pvc data-vault-0 audit-vault-0`
|
||||
- [x] STS controller recreated pod; new PVCs auto-provisioned on
|
||||
`proxmox-lvm-encrypted` (LVs `vm-9999-pvc-fb732fd7-...` data
|
||||
4.12%, `vm-9999-pvc-36451f42-...` audit 3.99%).
|
||||
- [x] **Hit and fixed**: vault-0 CrashLoopBackOff'd with
|
||||
`permission denied` on `/vault/data/vault.db`. The helm chart's
|
||||
`statefulSet.securityContext.pod` block in main.tf only set
|
||||
`fsGroupChangePolicy`, replacing (not merging) the chart's
|
||||
defaults `fsGroup=1000, runAsGroup=1000, runAsUser=100,
|
||||
runAsNonRoot=true`. NFS exports made the missing fsGroup a
|
||||
no-op; ext4 LV needs it to chown the volume root for the
|
||||
vault user. Old vault-1/vault-2 pods were created before that
|
||||
block was added so they still had the chart-default
|
||||
securityContext from their original spec. Fix: provide all
|
||||
five fields explicitly in main.tf and re-apply. Same root
|
||||
cause will affect vault-1 and vault-2 swaps unless this stays
|
||||
in place.
|
||||
- [x] Wait Ready; auto-unseal sidecar unsealed; `retry_join` rejoined
|
||||
raft cluster.
|
||||
- [x] Verify: `vault operator raft list-peers` shows 3 voters,
|
||||
vault-0 follower, leader=vault-2. External HTTPS 200.
|
||||
|
||||
### Step 2 — 24h soak (SKIPPED per user direction 2026-04-25)
|
||||
|
||||
User instructed "continue with all the remaining actions" — soak
|
||||
gates compressed to per-pod settle windows + raft-state verification
|
||||
between rollings. No Raft alarms, no Vault errors observed at each
|
||||
verification gate.
|
||||
|
||||
### Step 3 — Roll vault-1 — DONE 2026-04-25
|
||||
|
||||
- [x] Force-finalize PVCs to break re-mount race:
|
||||
`kubectl -n vault patch pvc data-vault-1 audit-vault-1 -p '{"metadata":{"finalizers":null}}' --type=merge`.
|
||||
(Initial pod-then-PVC delete recreated pod on the OLD NFS PVCs
|
||||
because pvc-protection finalizer hadn't cleared. Lesson learned
|
||||
and applied to vault-2 below.)
|
||||
- [x] Pod recreated on encrypted PVCs; auto-unsealed; rejoined raft.
|
||||
|
||||
### Step 4 — Settle window — DONE 2026-04-25
|
||||
|
||||
3-check verification over 90s; raft index advancing (2730010→2730012),
|
||||
all 3 voters healthy.
|
||||
|
||||
### Step 5 — Roll vault-2 (leader) — DONE 2026-04-25
|
||||
|
||||
- [x] `vault operator step-down` on vault-2; vault-0 took leadership.
|
||||
Confirmed vault-0 active, vault-1+vault-2 standby before delete.
|
||||
- [x] Snapshot anchor at `/tmp/vault-pre-vault2.snap` (1.5 MB) from new
|
||||
leader vault-0.
|
||||
- [x] Force-finalize + delete PVCs + delete pod (lesson from vault-1).
|
||||
- [x] Pod recreated on encrypted PVCs; auto-unsealed; rejoined raft.
|
||||
- [x] `vault operator raft list-peers` shows 3 voters all healthy on
|
||||
encrypted storage; leader vault-0.
|
||||
|
||||
### Step 6 — Cleanup — DONE 2026-04-25
|
||||
|
||||
- [x] `kubectl get pvc -A` cross-cluster shows zero PVCs on
|
||||
`nfs-proxmox` SC (only Released PVs remain → Phase 3).
|
||||
- [x] Removed inline `kubernetes_storage_class.nfs_proxmox` from
|
||||
`infra/stacks/vault/main.tf` (was lines 29–42).
|
||||
- [x] All 3 PVC pairs on `proxmox-lvm-encrypted`.
|
||||
- [x] `vault operator raft autopilot state` healthy=true.
|
||||
- [x] External `https://vault.viktorbarzin.me/v1/sys/health` = 200.
|
||||
|
||||
## Phase 3 — Released-PV cleanup (FOLLOW-UP)
|
||||
|
||||
### Step 3.1 — vault Released PVs — DONE 2026-04-25
|
||||
|
||||
6 vault NFS PVs (Released, `nfs-proxmox` SC, Retain policy) deleted
|
||||
along with their NFS subdirectories on PVE host (~1.5 GB reclaimed):
|
||||
|
||||
| PV | Claim | Size on disk |
|
||||
|---|---|---|
|
||||
| pvc-004a5d3b-… | data-vault-2 | 45M |
|
||||
| pvc-808a78ec-… | audit-vault-1 | 1.4M |
|
||||
| pvc-918ee7c1-… | audit-vault-0 | 3.2M |
|
||||
| pvc-9d2ddcb4-… | data-vault-0 | 46M |
|
||||
| pvc-a659711d-… | data-vault-1 | 46M |
|
||||
| pvc-d2e65109-… | audit-vault-2 | 1.4G |
|
||||
|
||||
Procedure: `kubectl delete pv <name>` (cluster object only — Retain
|
||||
policy means CSI never touches NFS) then `rm -rf /srv/nfs/<dir>` on
|
||||
192.168.1.127.
|
||||
|
||||
### Step 3.2 — Cluster-wide Released PV sweep (DEFERRED)
|
||||
|
||||
~50 other Released PVs persist across the cluster (~200 GiB on
|
||||
`proxmox-lvm` and `proxmox-lvm-encrypted`). Out of scope for the
|
||||
2026-04-25 NFS-hostile session per user direction. To reclaim:
|
||||
|
||||
1. List Released PVs, confirm LV exists on PVE.
|
||||
2. `kubectl delete pv <name>` (CSI removes underlying LV when PV is
|
||||
orphaned with `Retain` reclaim policy and no PVC reference).
|
||||
3. If LV survives: manual `lvremove pve/vm-9999-pvc-<uuid>`.
|
||||
|
||||
## Rollback
|
||||
|
||||
| Phase | Trigger | Action |
|
||||
|---|---|---|
|
||||
| 1 | Immich UI broken / data loss | Revert `claim_name`; restore from `/tmp/immich-pre-migration-*.sql` to old NFS PVC |
|
||||
| 2 (mid-rolling) | Single pod broken | Delete the encrypted PVC; recreate with NFS SC explicitly; cluster keeps quorum from 2 healthy pods |
|
||||
| 2 (post-rolling, raft corrupt) | Cluster-wide failure | `vault operator raft snapshot restore <pre-migration.snap>` |
|
||||
| Catastrophic | All Vault data lost | Restore from latest `/srv/nfs/vault-backup/` snapshot via CronJob output |
|
||||
|
|
@ -1,195 +0,0 @@
|
|||
# Forgejo Registry Consolidation — Design
|
||||
|
||||
**Date**: 2026-05-07
|
||||
**Status**: Approved
|
||||
|
||||
## Problem
|
||||
|
||||
`registry-private` (the `registry:2` container on the docker-registry
|
||||
VM at `10.0.20.10`) has hit `distribution#3324` corruption three
|
||||
times in three weeks (2026-04-13, 2026-04-19, 2026-05-04). Each
|
||||
incident required manual blob recovery and another round of
|
||||
hardening to `cleanup-tags.sh` and the GC procedure. The integrity
|
||||
probe catches it within 15 minutes now, but every hit still costs
|
||||
~1h of cleanup, and we keep tightening the same loose screw.
|
||||
|
||||
Root cause is a known race in `distribution`: tag deletes that race
|
||||
with concurrent garbage collection produce orphan OCI-index children.
|
||||
Upstream has not patched it; our mitigations (probe, blob
|
||||
fix-up script, idempotent cleanup) reduce blast radius but don't
|
||||
remove the failure mode.
|
||||
|
||||
Forgejo (deployed for OAuth and personal repos at
|
||||
`forgejo.viktorbarzin.me`) ships a built-in OCI registry as part of
|
||||
the Packages feature, default-on in v11. Using it removes
|
||||
`distribution`-the-engine from the path entirely, replaces it with
|
||||
Forgejo's own implementation backed by Forgejo's DB+blob store, and
|
||||
gets us source hosting + image hosting in one resource.
|
||||
|
||||
The PVE host RAM upgrade from 142GB to 272GB (memory id=569) means
|
||||
the cluster can absorb the resource bump Forgejo needs for the
|
||||
registry workload (1Gi → 1Gi).
|
||||
|
||||
## Decision
|
||||
|
||||
Move every image currently on `registry.viktorbarzin.me:5050` to
|
||||
Forgejo's OCI registry at `forgejo.viktorbarzin.me`. Decommission
|
||||
`registry-private` after a 14-day dual-push bake.
|
||||
|
||||
Pull-through caches for upstream registries (DockerHub, GHCR, Quay,
|
||||
k8s.gcr, Kyverno) stay on the registry VM permanently — Forgejo
|
||||
won't serve as a pull-through, so the chicken-and-egg of "Forgejo
|
||||
pulling its own image through itself" never arises.
|
||||
|
||||
## Design
|
||||
|
||||
### Registry hostname
|
||||
|
||||
Image references become `forgejo.viktorbarzin.me/viktor/<image>:<tag>`.
|
||||
The `viktor/` prefix is the Forgejo owner namespace; all current
|
||||
private images ship under that single owner.
|
||||
|
||||
### Auth
|
||||
|
||||
Two service-account users:
|
||||
|
||||
| User | Scope | Vault key | Used by |
|
||||
|---|---|---|---|
|
||||
| `cluster-puller` | `read:package` | `secret/viktor/forgejo_pull_token` | cluster-wide `registry-credentials` Secret, monitoring probe |
|
||||
| `ci-pusher` | `write:package` | `secret/ci/global/forgejo_push_token` | Woodpecker pipelines (synced via `vault-woodpecker-sync` CronJob) |
|
||||
|
||||
A third PAT (`secret/viktor/forgejo_cleanup_token`, also belongs to
|
||||
`ci-pusher`) drives the retention CronJob — kept separate from the
|
||||
push PAT so a leaked CI token doesn't immediately enable mass deletes.
|
||||
|
||||
PATs have no expiry. Rotation policy: regenerate via Forgejo Web UI
|
||||
and `vault kv patch` if a leak is suspected; ESO/sync downstream is
|
||||
automatic.
|
||||
|
||||
### Cluster pull path
|
||||
|
||||
`registry-credentials` is a single Secret in `kyverno` ns, cloned
|
||||
into every namespace by the existing
|
||||
`sync-registry-credentials` ClusterPolicy. We extend its
|
||||
`dockerconfigjson` `auths` map with a fourth entry for
|
||||
`forgejo.viktorbarzin.me`. **No new Secret, no new ClusterPolicy,
|
||||
no `imagePullSecrets =` line edits across stacks.**
|
||||
|
||||
Containerd `hosts.toml` redirects `forgejo.viktorbarzin.me` → in-cluster
|
||||
Traefik LB at `10.0.20.200`, the same pattern used for
|
||||
`registry.viktorbarzin.me` → `10.0.20.10:5050`. Avoids hairpin NAT
|
||||
through the WAN gateway for in-cluster pulls.
|
||||
|
||||
### Push path
|
||||
|
||||
Woodpecker pipelines push to BOTH targets during the bake:
|
||||
|
||||
```yaml
|
||||
- name: build-and-push
|
||||
image: woodpeckerci/plugin-docker-buildx
|
||||
settings:
|
||||
repo:
|
||||
- registry.viktorbarzin.me/<name>
|
||||
- forgejo.viktorbarzin.me/viktor/<name>
|
||||
logins:
|
||||
- registry: registry.viktorbarzin.me
|
||||
username:
|
||||
from_secret: registry_user
|
||||
password:
|
||||
from_secret: registry_password
|
||||
- registry: forgejo.viktorbarzin.me
|
||||
username:
|
||||
from_secret: forgejo_user
|
||||
password:
|
||||
from_secret: forgejo_push_token
|
||||
```
|
||||
|
||||
The `vault-woodpecker-sync` CronJob (every 6h) propagates
|
||||
`secret/ci/global` keys to every Woodpecker repo as global secrets.
|
||||
|
||||
### Retention
|
||||
|
||||
Forgejo's per-package "Cleanup Rules" UI is per-user runtime DB
|
||||
state, not Terraform-driven. Retention runs as a CronJob in the
|
||||
`forgejo` namespace, schedule `0 4 * * *`, that:
|
||||
|
||||
1. Lists all container packages under the `viktor` owner.
|
||||
2. Groups by package name.
|
||||
3. Keeps newest 10 versions + always keeps `latest`.
|
||||
4. DELETEs the rest via `/api/v1/packages/{owner}/{type}/{name}/{version}`.
|
||||
|
||||
First 7 days run with `DRY_RUN=true` — script logs what it would
|
||||
delete but issues no DELETE calls. After log review, flip the
|
||||
`forgejo_cleanup_dry_run` local in `cleanup.tf` to false.
|
||||
|
||||
### Integrity monitoring
|
||||
|
||||
Mirror the existing `registry-integrity-probe` CronJob: walk
|
||||
`/v2/_catalog`, walk every tag, HEAD every manifest + index child,
|
||||
push `registry_manifest_integrity_*` metrics. Existing
|
||||
Prometheus alerts fire on the `instance` label, so they cover both
|
||||
probes automatically once the alert annotations are made
|
||||
instance-aware (done in this change).
|
||||
|
||||
### Source migration
|
||||
|
||||
Projects currently living as plain dirs in the local-only monorepo
|
||||
become standalone Forgejo repos. Two GitHub-hosted private repos
|
||||
(`beadboard`, `claude-memory-mcp`) move to Forgejo and are archived
|
||||
on GitHub.
|
||||
|
||||
CI standardises on Woodpecker for everything in scope. The two
|
||||
projects that used GHA (build + Woodpecker-deploy via GHA-hosted
|
||||
DockerHub push) keep DockerHub for legacy compatibility but their
|
||||
canonical image source becomes Forgejo.
|
||||
|
||||
### Break-glass for infra-ci
|
||||
|
||||
`infra-ci` is the Docker image used by all infra Woodpecker
|
||||
pipelines, including `default.yml` (terragrunt apply). If Forgejo is
|
||||
unreachable at the moment we need to apply, `infra-ci` is
|
||||
unreachable, and we can't apply our way out.
|
||||
|
||||
Mitigation: dual-push step also `docker save | gzip` the built
|
||||
infra-ci image to:
|
||||
|
||||
- `/opt/registry/data/private/_breakglass/infra-ci-<sha>.tar.gz` on
|
||||
the registry VM disk (Copy 1)
|
||||
- `/srv/nfs/forgejo-breakglass/` on the NAS (Copy 2)
|
||||
|
||||
A `latest` symlink in each location points at the most recent.
|
||||
Recovery procedure (`docs/runbooks/forgejo-registry-breakglass.md`):
|
||||
scp tarball → `docker load` → `ctr -n k8s.io images import` → fix
|
||||
Forgejo via that node.
|
||||
|
||||
### Cutover style
|
||||
|
||||
**Dual-push bake**: pipelines push to both registries for ≥14 days.
|
||||
Pods continue pulling from `registry.viktorbarzin.me`. After bake:
|
||||
|
||||
1. Per-project PR: flip `image=` lines in Terraform stacks. Pod
|
||||
re-pull naturally on next rollout.
|
||||
2. Phase 4: stop `registry-private` container, remove its
|
||||
`auths` entry from the cluster Secret, drop containerd hosts.toml
|
||||
entry.
|
||||
|
||||
## Why not alternatives
|
||||
|
||||
| Option | Rejected because |
|
||||
|---|---|
|
||||
| Stay on `registry-private` | Three corruption incidents in three weeks; mitigation cost rising |
|
||||
| Run a fresh registry container alongside (no Forgejo) | Same upstream, same `distribution#3324` failure mode |
|
||||
| GHCR / DockerHub for all private images | Public-by-default model + push rate limits; loses owner-owned blob storage |
|
||||
| Harbor | Heavier than Forgejo registry, would need its own DB + ingress, no source-hosting integration |
|
||||
|
||||
## Risks
|
||||
|
||||
See plan doc § "Risk register" for the full table. Top three:
|
||||
|
||||
1. **Forgejo registry hits the same corruption pattern.** Mitigated
|
||||
by 14-day bake + integrity probe within 15 min.
|
||||
2. **Forgejo down → infra-ci unreachable → can't apply.** Mitigated
|
||||
by tarball break-glass on VM + NAS.
|
||||
3. **Pod re-pulls fail after `image=` flip due to containerd cache
|
||||
poisoning.** Mitigated by hosts.toml deployment + per-project
|
||||
`kubectl rollout restart` in Phase 3.
|
||||
|
|
@ -1,152 +0,0 @@
|
|||
# Forgejo Registry Consolidation — Plan
|
||||
|
||||
**Date**: 2026-05-07
|
||||
**Status**: Approved — execution in progress (Phase 0)
|
||||
**Design**: `2026-05-07-forgejo-registry-consolidation-design.md`
|
||||
|
||||
This is the implementation roadmap for migrating off `registry-private`
|
||||
onto Forgejo's OCI registry. See the design doc for problem
|
||||
statement and rationale. Execution spans 5 phases over ≥3 weeks.
|
||||
|
||||
## Phase 0 — Prepare Forgejo (1 PR, no cutover risk)
|
||||
|
||||
| Task | File / artifact |
|
||||
|---|---|
|
||||
| Bump Forgejo memory request+limit 384Mi → 1Gi | `infra/stacks/forgejo/main.tf` |
|
||||
| Add `FORGEJO__packages__ENABLED=true` and `FORGEJO__packages__CHUNKED_UPLOAD_PATH=/data/tmp/package-upload` env vars (defensive — already default in v11) | `infra/stacks/forgejo/main.tf` |
|
||||
| Bump Forgejo PVC 5Gi → 15Gi, auto-resize cap 20Gi → 50Gi | `infra/stacks/forgejo/main.tf` |
|
||||
| Bump ingress `max_body_size = "5g"` (wired into ingress_factory as a Buffering middleware) | `infra/stacks/forgejo/main.tf`, `infra/modules/kubernetes/ingress_factory/main.tf` |
|
||||
| Create `cluster-puller` (read:package), `ci-pusher` (write:package), and a third `cleanup` PAT on `ci-pusher`; store PATs in Vault | runbook: `docs/runbooks/forgejo-registry-setup.md` |
|
||||
| Extend `registry-credentials` Secret with 4th `auths` entry for `forgejo.viktorbarzin.me` | `infra/stacks/kyverno/modules/kyverno/registry-credentials.tf` |
|
||||
| Add containerd `hosts.toml` entry redirecting `forgejo.viktorbarzin.me` → in-cluster Traefik LB `10.0.20.200` | `infra/stacks/infra/main.tf` cloud-init + new `infra/scripts/setup-forgejo-containerd-mirror.sh` for existing nodes |
|
||||
| Forgejo retention CronJob (`0 4 * * *`, dry-run for first 7 days) | new `infra/stacks/forgejo/cleanup.tf` + `infra/stacks/forgejo/files/cleanup.sh` |
|
||||
| Forgejo integrity probe CronJob (`*/15 * * * *`) | `infra/stacks/monitoring/modules/monitoring/main.tf` |
|
||||
| Make existing alerts instance-aware so they cover both registries | `infra/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` |
|
||||
|
||||
**Smoke test (must pass before declaring Phase 0 done):**
|
||||
|
||||
- `docker login forgejo.viktorbarzin.me` succeeds.
|
||||
- Push a hello-world image to `forgejo.viktorbarzin.me/viktor/smoketest:1` succeeds.
|
||||
- `crictl pull forgejo.viktorbarzin.me/viktor/smoketest:1` from a k8s
|
||||
node succeeds, using the auto-synced `registry-credentials` Secret.
|
||||
- A fresh namespace gets the cloned Secret with 4 `auths` entries.
|
||||
- Delete the smoketest package via API.
|
||||
- Forgejo integrity probe completes once and pushes metrics.
|
||||
|
||||
## Phase 1 — Source migration (parallel-safe, no production impact)
|
||||
|
||||
For each project the recipe is identical:
|
||||
|
||||
1. `git init` + push to `forgejo.viktorbarzin.me/viktor/<name>` —
|
||||
register in Woodpecker via OAuth.
|
||||
2. Add `.woodpecker.yml` based on `payslip-ingest/.woodpecker.yml`.
|
||||
Push step uses `woodpeckerci/plugin-docker-buildx` with TWO
|
||||
`repo:` entries (dual-push).
|
||||
3. Confirm first build pushes to BOTH registries.
|
||||
|
||||
Projects (bake clock starts at "all dual-push"):
|
||||
|
||||
| Project | Action |
|
||||
|---|---|
|
||||
| `claude-agent-service` | Extract from monorepo to Forgejo. New `.woodpecker.yml`. |
|
||||
| `fire-planner` | Extract from monorepo to Forgejo. New `.woodpecker.yml`. |
|
||||
| `wealthfolio-sync` | Extract from monorepo to Forgejo. New `.woodpecker.yml`. |
|
||||
| `hmrc-sync` | Extract from monorepo to Forgejo. New `.woodpecker.yml`. |
|
||||
| `freedify` | Push from monorepo to Forgejo. New `.woodpecker.yml`. (Upstream is gone.) |
|
||||
| `payslip-ingest` | Already on Forgejo. Add second `repo:` entry to `.woodpecker.yml`. |
|
||||
| `job-hunter` | Already on Forgejo. Add second `repo:` entry. |
|
||||
| `beadboard` | Push to Forgejo. New `.woodpecker.yml`. Disable GHA workflow. **Don't archive GitHub yet** (deferred to Phase 3). |
|
||||
| `claude-memory-mcp` | Push to Forgejo. New `.woodpecker.yml`. |
|
||||
| `infra-ci` | Edit `.woodpecker/build-ci-image.yml` to dual-push. ALSO `docker save | gzip` to `/opt/registry/data/private/_breakglass/` on VM AND `/srv/nfs/forgejo-breakglass/` on NAS. Pin a `latest` symlink. |
|
||||
|
||||
Break-glass runbook (`docs/runbooks/forgejo-registry-breakglass.md`)
|
||||
documents the recovery path.
|
||||
|
||||
## Phase 2 — Bake (≥14 days)
|
||||
|
||||
- No `image=` lines change. Pods still pull from
|
||||
`registry.viktorbarzin.me`.
|
||||
- **Daily smoke check**: pull a recent image from Forgejo as
|
||||
`cluster-puller`, verify integrity (HEAD on manifest + each blob).
|
||||
- **Bake exit criteria**:
|
||||
- Zero `RegistryManifestIntegrityFailure` alerts on Forgejo.
|
||||
- Zero `ContainerNearOOM` for the forgejo pod.
|
||||
- Retention CronJob has run ≥14 times successfully.
|
||||
- At least one full Sunday GC cycle has elapsed.
|
||||
- Switch retention CronJob to `DRY_RUN=false` on day 7, observe
|
||||
until day 14.
|
||||
|
||||
## Phase 3 — Cutover (one PR per project, single session)
|
||||
|
||||
Order = lowest blast radius first. Each step:
|
||||
`image=` flip → `kubectl rollout restart` → verify pull from Forgejo.
|
||||
|
||||
1. `payslip-ingest` (`infra/stacks/payslip-ingest/main.tf`)
|
||||
2. `job-hunter` (`infra/stacks/job-hunter/main.tf`)
|
||||
3. `claude-agent-service` (`infra/stacks/claude-agent-service/main.tf`)
|
||||
4. `fire-planner` (`infra/stacks/fire-planner/main.tf`)
|
||||
5. `wealthfolio-sync` (`infra/stacks/wealthfolio/main.tf`)
|
||||
6. `freedify` (`infra/stacks/freedify/factory/main.tf`)
|
||||
7. `chrome-service` (`infra/stacks/chrome-service/main.tf`)
|
||||
8. `beads-server` / `beadboard` (`infra/stacks/beads-server/main.tf`).
|
||||
Then `gh repo archive ViktorBarzin/beadboard`.
|
||||
9. `infra-ci` — flip `image:` references in 4 `.woodpecker/*.yml`
|
||||
files in the infra repo. Verify next push to master applies cleanly.
|
||||
10. `claude-memory-mcp` — update `CLAUDE.md` install instruction from
|
||||
`claude plugins install github:ViktorBarzin/claude-memory-mcp` to
|
||||
`claude plugins install https://forgejo.viktorbarzin.me/viktor/claude-memory-mcp.git`.
|
||||
`gh repo archive ViktorBarzin/claude-memory-mcp`.
|
||||
|
||||
## Phase 4 — Decommission
|
||||
|
||||
| Step | File / location |
|
||||
|---|---|
|
||||
| Stop `registry-private` container on VM (10.0.20.10): edit `/opt/registry/docker-compose.yml`, comment out service, `docker compose up -d --remove-orphans`. (Manual SSH — cloud-init won't redeploy on TF apply per memory id=1078.) | live VM |
|
||||
| Update cloud-init template to match the new compose file | `infra/stacks/infra/main.tf:288` |
|
||||
| Delete `auths` entries for `registry.viktorbarzin.me` / `:5050` / `10.0.20.10:5050` from the dockerconfigjson | `infra/stacks/kyverno/modules/kyverno/registry-credentials.tf` |
|
||||
| Drop `registry.viktorbarzin.me` and `10.0.20.10:5050` `hosts.toml` entries on each node + cloud-init template | `infra/stacks/infra/main.tf` cloud-init + ad-hoc script |
|
||||
| After 1 week of no incidents, delete `/opt/registry/data/private/` blob storage on the VM (~2.6GB freed) | manual SSH |
|
||||
|
||||
## Phase 5 — Docs
|
||||
|
||||
In the same commit as the Phase 4 closing:
|
||||
|
||||
| Doc | Update |
|
||||
|---|---|
|
||||
| `docs/runbooks/registry-vm.md` | Note `registry-private` is gone; pull-through caches and break-glass tarballs only |
|
||||
| `docs/runbooks/registry-rebuild-image.md` | Replaced by NEW `forgejo-registry-rebuild-image.md` |
|
||||
| `docs/runbooks/forgejo-registry-rebuild-image.md` (NEW) | Forgejo PVC restore procedure |
|
||||
| `docs/runbooks/forgejo-registry-breakglass.md` (NEW) | infra-ci tarball recovery |
|
||||
| `docs/architecture/ci-cd.md` | Image registry section flips to Forgejo |
|
||||
| `docs/architecture/monitoring.md` | Integrity probe target updated |
|
||||
| `infra/.claude/CLAUDE.md` | Registry references updated |
|
||||
| `CLAUDE.md` (monorepo root) | claude-memory-mcp install URL updated |
|
||||
| `infra/.claude/reference/service-catalog.md` | Cross-reference checked |
|
||||
|
||||
## Critical files modified
|
||||
|
||||
| File | Phase | What |
|
||||
|---|---|---|
|
||||
| `infra/stacks/forgejo/main.tf` | 0 | Memory bump, packages env vars, PVC bump, ingress max_body_size |
|
||||
| `infra/stacks/forgejo/cleanup.tf` (NEW) | 0 | Retention CronJob |
|
||||
| `infra/stacks/forgejo/files/cleanup.sh` (NEW) | 0 | Retention script (mounted via ConfigMap) |
|
||||
| `infra/modules/kubernetes/ingress_factory/main.tf` | 0 | Wire `max_body_size` into a Traefik Buffering middleware |
|
||||
| `infra/stacks/kyverno/modules/kyverno/registry-credentials.tf` | 0 | Add 4th `auths` entry |
|
||||
| `infra/stacks/infra/main.tf` | 0 + 4 | Containerd hosts.toml block (add Forgejo, later remove registry-private); compose template update |
|
||||
| `infra/scripts/setup-forgejo-containerd-mirror.sh` (NEW) | 0 | One-shot rollout for existing nodes |
|
||||
| `infra/stacks/monitoring/modules/monitoring/main.tf` | 0 | Forgejo integrity probe CronJob |
|
||||
| `infra/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` | 0 | Make alerts instance-aware |
|
||||
| `infra/stacks/monitoring/main.tf` | 0 | Plumb `forgejo_pull_token` into module |
|
||||
| `infra/.woodpecker/build-ci-image.yml` | 1 | Dual-push to add Forgejo target + tarball break-glass |
|
||||
| `<each-project>/.woodpecker.yml` | 1 | Dual-push (NEW for fire-planner, wealthfolio-sync, hmrc-sync, freedify, beadboard, claude-memory-mcp; EDIT for payslip-ingest, job-hunter, claude-agent-service) |
|
||||
| `infra/.woodpecker/{default,drift-detection,build-cli}.yml` | 3 | Flip `image:` to Forgejo for infra-ci |
|
||||
| `infra/stacks/{beads-server,chrome-service,claude-agent-service,fire-planner,freedify/factory,job-hunter,payslip-ingest,wealthfolio}/main.tf` | 3 | Flip `image =` to Forgejo |
|
||||
|
||||
## Verification
|
||||
|
||||
- **Push** (Phase 0/1): `docker push forgejo.viktorbarzin.me/viktor/<name>` visible in Forgejo Web UI under viktor/.
|
||||
- **Pull** (Phase 0): `crictl pull forgejo.viktorbarzin.me/viktor/smoketest:1` succeeds with auto-synced Secret.
|
||||
- **Dual-push** (Phase 1): every Woodpecker pipeline run pushes to BOTH endpoints — confirmed via HEAD checks on `<reg>:<sha>` for both.
|
||||
- **Bake** (Phase 2): existing daily Forgejo `/api/healthz` external monitor stays green; integrity probe stays green; no `ContainerNearOOM` for forgejo pod.
|
||||
- **Cutover** (Phase 3): `kubectl rollout status deploy/<svc> -n <ns>` succeeds. `kubectl describe pod` shows the image was pulled from `forgejo.viktorbarzin.me`.
|
||||
- **Decommission** (Phase 4): `docker ps` on registry VM no longer shows `registry-private`. Brand-new namespace gets the Secret with only the Forgejo `auths` entry. Pull still works.
|
||||
|
|
@ -1,150 +0,0 @@
|
|||
# Post-Mortem: Authentik Embedded Outpost `/dev/shm` Fills — Cluster-Wide Auth Blocked
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Date** | 2026-04-18 |
|
||||
| **Duration** | ~44h for first-affected user (Emil, Apr 16 17:00 → Apr 18 12:40 UTC); ~30min for cluster-wide impact (Apr 18 12:10 → 12:40 UTC) |
|
||||
| **Severity** | SEV2 — authentication blocked for all users on all Authentik-protected services |
|
||||
| **Affected Services** | ~30+ Authentik-protected subdomains (every service using the `authentik-forward-auth` Traefik middleware) |
|
||||
| **Status** | Root cause fixed; permanent mitigation applied; alerting still TODO |
|
||||
|
||||
## Summary
|
||||
|
||||
The `ak-outpost-authentik-embedded-outpost` pod's `/dev/shm` (default 64 MB tmpfs) filled to 100% with ~44,000 `session_*` files. Once full, every forward-auth request failed to write its session state with `ENOSPC` and the outpost returned HTTP 400 instead of the usual 302 → login redirect. All users on all protected services were unable to log in.
|
||||
|
||||
Detection was delayed because the initial user report (Emil) looked like a per-user bug — investigation spent two days chasing hypotheses about non-ASCII headers, user privileges, cookie corruption, and a newly-deployed Cloudflare Worker before the real cause was found in the outpost logs.
|
||||
|
||||
## Impact
|
||||
|
||||
- **User-facing**: HTTP 400 on initial GET of any Authentik-protected site (`terminal`, `grafana`, `immich`, `proxmox`, `london`, etc.). Existing sessions whose cookies were still cached worked until their cookie rotation attempt, then broke.
|
||||
- **Blast radius**: Every service using the `authentik-forward-auth` middleware via the "Domain wide catch all" Proxy provider. Public and internal.
|
||||
- **Duration**: First user (Emil) broken since 2026-04-16 ~17:00 UTC after his last valid session. Cluster-wide block when Viktor's cached session stopped being sufficient — roughly 2026-04-18 12:10 UTC. Fixed 12:40 UTC.
|
||||
- **Data loss**: None. Session state in tmpfs is ephemeral by design.
|
||||
- **Monitoring gap**: No Prometheus alert on outpost `/dev/shm` usage. No alert on outpost 400 response rate. Uptime Kuma external monitors hitting protected services returned 400s for 40+ hours without paging.
|
||||
|
||||
## Timeline (UTC)
|
||||
|
||||
| Time | Event |
|
||||
|------|-------|
|
||||
| **Apr 15 ~09:21** | `ak-outpost-authentik-embedded-outpost-587598dc4b-fvzzz` pod started (normal rolling restart, unrelated to this incident). `/dev/shm` fresh. |
|
||||
| **Apr 16 16:23:32** | Emil's last successful `authorize_application` event from his iPhone Brave (`85.255.235.23`). After this point, his subsequent requests create session files — his new sessions work briefly, then `/dev/shm` fills and every new session write fails. |
|
||||
| **Apr 16 ~17:00 (approx)** | `/dev/shm` at ~44,000 files = 100% full. New forward-auth requests start returning 400 across the board. Viktor's browser still has a valid cached cookie so his requests succeed without writing new session files. |
|
||||
| **Apr 17 10:30 (approx)** | Emil reports "terminal.viktorbarzin.me returns 400" to Viktor. |
|
||||
| **Apr 18 09:00–12:30** | Deep investigation begins. Multiple hypotheses tested and rejected: non-ASCII bytes in Emil's `name` field, policy denial, cookie corruption, Rybbit Cloudflare Worker (deployed 2026-04-17 — suspicious timing, turned out unrelated), plaintext redirect scheme. |
|
||||
| **Apr 18 12:20:39** | First direct evidence found: 2 Chrome 400s in Traefik logs from Emil's IP `176.12.22.76` (BG) on `terminal.viktorbarzin.me`, request missing `authentik_proxy_*` cookie. Redirect loop observed on iPhone IPv6 `2620:10d:c092:500::7:8c0d`. |
|
||||
| **Apr 18 12:34** | Viktor reports he can no longer log in either. |
|
||||
| **Apr 18 12:38** | `curl` against direct Traefik (`--resolve` bypassing Cloudflare) returns the same 400 with Authentik's CSP header — Cloudflare Worker exonerated. |
|
||||
| **Apr 18 12:39** | Outpost log grep finds the smoking gun: `failed to save session: write /dev/shm/session_XXX: no space left on device`. |
|
||||
| **Apr 18 12:40:13** | `kubectl delete pod ak-outpost-authentik-embedded-outpost-587598dc4b-fvzzz` — tmpfs cleared on pod restart. Replacement pod `-8qscr` Running within 8s. Cluster unblocked. |
|
||||
| **Apr 18 12:41** | Verified: direct-Traefik and via-CF curls both return `HTTP 302` to Authentik auth flow. Viktor authenticates successfully on `proxmox.viktorbarzin.me`. |
|
||||
| **Apr 18 12:53** | Permanent fix applied via Authentik API: `PATCH /api/v3/outposts/instances/{uuid}/` setting `config.kubernetes_json_patches` to mount `emptyDir {medium: Memory, sizeLimit: 512Mi}` at `/dev/shm`. |
|
||||
| **Apr 18 12:54** | Authentik controller reconciled the Deployment within 5s. `kubectl rollout restart` triggered new pod `-k5hv8`. `/dev/shm` now `tmpfs 256M` (4× the previous capacity; K8s clamps the tmpfs size to pod memory policy, but usage is capped at `sizeLimit=512Mi`). Forward-auth verified working. |
|
||||
|
||||
## Root Cause Chain
|
||||
|
||||
```
|
||||
[1] goauthentik/proxy outpost uses gorilla/sessions FileStore
|
||||
└─> each forward-auth request that has no valid session cookie writes
|
||||
/dev/shm/session_<random> (~1500 bytes/file)
|
||||
│
|
||||
├─> [2] Catch-all Proxy provider's access_token_validity = hours=168 (7 days)
|
||||
│ └─> each file's MaxAge = 7 days
|
||||
│ └─> Upstream 5-min GC (PR #15798, shipped in ≥ 2025.10) can only
|
||||
│ delete files whose MaxAge has EXPIRED, not whose age exceeds any
|
||||
│ shorter threshold
|
||||
│
|
||||
├─> [3] Measured creation rate: ~18 files/min (Uptime-Kuma monitors +
|
||||
│ real user traffic)
|
||||
│ └─> 18/min × 60 × 24 × 7 = 181,440 steady-state files expected
|
||||
│
|
||||
└─> [4] Pod's /dev/shm default: 64 MB tmpfs (Kubernetes default)
|
||||
└─> 64 MB / 1500 bytes ≈ 44,000 files maximum
|
||||
└─> Full in approx 44,000 / (18 × 60) min ≈ 41 hours
|
||||
└─> Actual observed time: pod started Apr 15 ~09:21,
|
||||
first ENOSPC ~Apr 16 ~17:00 ≈ 32 hours
|
||||
(some excess from Uptime-Kuma bursts)
|
||||
|
||||
[ENOSPC] -> every new forward-auth request fails -> outpost returns HTTP 400
|
||||
-> Traefik forwards the 400 to the browser
|
||||
-> user sees "400 Bad Request" on every protected site
|
||||
```
|
||||
|
||||
## Why Diagnosis Took So Long
|
||||
|
||||
The initial report was framed as "Emil can't access terminal" — a per-user symptom. All four pre-registered hypotheses in the triage plan (non-ASCII bytes in header value, oversized cookie, corrupt user attribute, provider policy rejecting groups) were per-user explanations, all of which turned out to be falsified.
|
||||
|
||||
Contributing distractions:
|
||||
1. **Misattribution in initial research** — an `authorize_application` event for Viktor (`vbarzin@gmail.com`) at 2026-04-18 08:09 was initially attributed to Emil. This led to the incorrect conclusion that Emil was authenticating successfully today.
|
||||
2. **Rybbit analytics Cloudflare Worker deployed 2026-04-17** (see memory #792, commit around 2026-04-17 21:26 UTC) ran on `*.viktorbarzin.me/*`. Suspicious timing — Viktor's first instinct was "this must be the Worker." The Worker WAS adding long cookies to browser state, but not the cause of the 400. Exonerated by direct-Traefik curl returning the same 400.
|
||||
3. **Viktor's cached session masked the outage** — only unauthenticated requests wrote new session files. Viktor's valid cookie kept working until the outpost needed to rotate state, at which point he also hit 400.
|
||||
4. **The tell is in the outpost logs, not anywhere else.** `grep 'no space left on device'` on the outpost logs would have found it in seconds, but the investigation scope started with user records, then cookies, then the Worker — outpost logs weren't grepped until hour 3+.
|
||||
|
||||
## Contributing Factors
|
||||
|
||||
1. **No alert on outpost `/dev/shm` usage.** A simple `kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes > 0.8` or equivalent cAdvisor metric would have paged hours before users noticed.
|
||||
2. **No alert on outpost HTTP 400 rate.** `increase(authentik_outpost_http_requests_total{status="400"}[15m])` went from ~0 to thousands — invisible to our monitoring.
|
||||
3. **No alert on "Uptime-Kuma external monitors all turning red simultaneously."** Every external monitor for a protected service started failing, but each is individually monitored — correlated failures across dozens of services didn't trigger a higher-level alert.
|
||||
4. **Default Kubernetes `/dev/shm` is 64 MB.** This is fine for most workloads, but the goauthentik proxy outpost writes one session file per unauthenticated request with a 7-day retention. The default sizing is an accident waiting to happen on any busy deployment.
|
||||
5. **Upstream issue [#20093](https://github.com/goauthentik/authentik/issues/20093)** ("External Proxy Outpost cannot use persistent session backend") is still OPEN as of 2026-04-18. Known architectural limitation.
|
||||
6. **Catch-all Proxy provider is UI-managed, not Terraform-managed.** Its `access_token_validity` and the outpost's `kubernetes_json_patches` are configured in Authentik's PostgreSQL database, not in code. This means the fix applied today is invisible to `git log` and vulnerable to drift if someone changes it in the UI.
|
||||
|
||||
## Detection Gaps
|
||||
|
||||
| Gap | Impact | Fix |
|
||||
|-----|--------|-----|
|
||||
| No alert on outpost `/dev/shm` usage | Outage progressed from "Emil only" to "everyone" over 40+ hours silently | Add Prometheus alert: `kubelet_volume_stats_used_bytes{namespace="authentik",persistentvolumeclaim=~"dshm.*"} / kubelet_volume_stats_capacity_bytes > 0.8` (or per-container cAdvisor metric if emptyDir not a PVC) |
|
||||
| No alert on outpost 400 rate spike | ~thousands of 400s over 40h didn't page | Alert on `increase(traefik_service_requests_total{code="400",service=~".*viktorbarzin-me.*"}[15m]) > N` OR on outpost-specific 400 metric |
|
||||
| Uptime Kuma external monitors not cross-correlated | Dozens of red monitors didn't trigger a cluster-wide alert | Add meta-alert: "more than N [External] Uptime Kuma monitors down within 10 min" — strong signal of shared-infra failure |
|
||||
| Outpost logs not searched during initial triage | Investigation went down 4 wrong paths before finding the real error | Runbook addition: for any Authentik forward-auth issue, FIRST command is `kubectl -n authentik logs -l goauthentik.io/outpost-name=authentik-embedded-outpost --since=1h \| grep -iE 'error\|no space'` |
|
||||
|
||||
## Prevention Plan
|
||||
|
||||
### P0 — Prevent this exact failure
|
||||
|
||||
| Priority | Action | Type | Details | Status |
|
||||
|----------|--------|------|---------|--------|
|
||||
| P0 | Size `/dev/shm` up via `kubernetes_json_patches` on the embedded outpost config | Config | `PATCH /api/v3/outposts/instances/0eecac07-97c7-443c-8925-05f2f4fe3e47/` with `config.kubernetes_json_patches.deployment` adding an `emptyDir {medium: Memory, sizeLimit: 512Mi}` volume at `/dev/shm`. Authentik reconciles the Deployment within 5 minutes. **Applied 2026-04-18 12:53 UTC.** | **DONE** |
|
||||
|
||||
### P1 — Detect this next time
|
||||
|
||||
| Priority | Action | Type | Details | Status |
|
||||
|----------|--------|------|---------|--------|
|
||||
| P1 | Prometheus alerts on outpost `/dev/shm` fill (two thresholds) | Alert | Group `Authentik Outpost` added in `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl`. `AuthentikOutpostMemoryHigh` (warning, working set > 1.5 GiB for 15m) + `AuthentikOutpostMemoryCritical` (critical, > 1.8 GiB for 5m) + `AuthentikOutpostRestarts` (warning, > 2 restarts in 30m). Applied 2026-04-18 13:16 UTC; loaded in Prometheus, state=inactive. | **DONE** |
|
||||
| P1 | Uptime-Kuma meta-monitor: "N+ external monitors down simultaneously" | Alert | Either a Prometheus rule over `uptime_kuma_monitor_status == 0` counts, or a dedicated external probe. Very strong signal of shared-infra failure. | TODO |
|
||||
| P1 | Bump tmpfs `sizeLimit` from 512Mi → 2Gi + set explicit container memory limit 2560Mi | Config | Patched outpost `kubernetes_json_patches` via Authentik API. 2026-04-18 13:06 UTC (sizeLimit), 13:22 UTC (container limit). **Gotcha**: `sizeLimit` alone is insufficient — writes to tmpfs count against container cgroup memory, and Kyverno's `tier-defaults` LimitRange sets a default `limits.memory: 256Mi` which OOM-kills the container before tmpfs fills. Fix is to also set `containers[0].resources.limits.memory` ≥ `sizeLimit + working_set_headroom`. Verified 1.5 GB file write succeeds on the configured pod; df reports 2.0 GB tmpfs. Gives ~8× growth headroom at current probe rate. | **DONE** |
|
||||
|
||||
### P2 — Codify the fix so it survives drift
|
||||
|
||||
| Priority | Action | Type | Details | Status |
|
||||
|----------|--------|------|---------|--------|
|
||||
| P2 | Codify the catch-all Proxy provider + embedded outpost config in Terraform | Architecture | Adopt `goauthentik/authentik` Terraform provider in `infra/stacks/authentik/`. Import the existing UUID `0eecac07-97c7-443c-8925-05f2f4fe3e47` and the catch-all provider pk=5. Move `kubernetes_json_patches` into TF so the fix is reviewable in git. | TODO |
|
||||
| P2 | Runbook: Authentik forward-auth troubleshooting | Docs | Add a runbook at `docs/runbooks/authentik-forward-auth-400.md` with the "grep outpost logs first" first step, plus pointer commands for `/dev/shm` usage, session file count, and recent authorize events. | TODO |
|
||||
|
||||
### P3 — Upstream + architectural
|
||||
|
||||
| Priority | Action | Type | Details | Status |
|
||||
|----------|--------|------|---------|--------|
|
||||
| P3 | Comment/support on authentik issue [#20093](https://github.com/goauthentik/authentik/issues/20093) | Upstream | Request either a persistent-backed session store (Redis/DB) OR a configurable GC interval shorter than the default 5 min. | TODO |
|
||||
| P3 | Consider shortening `access_token_validity` from 168h (7 days) to 24h | Config | Reduces steady-state session file count from ~181k to ~26k (7× reduction). Trade-off: users re-auth daily. Viktor's call on UX tolerance. | TODO |
|
||||
| P3 | Evaluate moving forward-auth away from the embedded outpost | Architecture | The embedded outpost is a single replica Go binary with in-memory session state. An external, multi-replica outpost with Redis-backed sessions is the production-grade deployment. Probably overkill for a home-lab, but worth noting. | TODO (paused) |
|
||||
|
||||
## Lessons Learned
|
||||
|
||||
1. **When a per-user bug affects a shared infrastructure layer, suspect the shared layer, not the user.** The framing "Emil gets 400" led the first two hours of investigation down four user-specific rabbit holes. A sanity check ("does ANY user's non-cached request to a protected site return 400?") would have cut to the chase in minutes.
|
||||
|
||||
2. **Check the outpost logs first, not last.** For any Authentik forward-auth oddity, the first `kubectl logs` should be on the outpost pod, grepping for `error` and `ENOSPC`. The outpost is the component that actually makes the 400/302 decision.
|
||||
|
||||
3. **Cache + low-request users mask outages longer than you'd think.** Viktor had a valid cookie and his browser kept using it without writing new session files; he couldn't reproduce the bug Emil saw. The outage felt per-user until his cookie rotation needed to write state. **Any outage that "only affects some users" needs an active check from a fresh, cookie-less context** — `curl` with no cookie jar is the fastest way.
|
||||
|
||||
4. **Default tmpfs sizing + per-request file writes = ticking clock.** 64 MB of `/dev/shm` is a Kubernetes default, not a considered choice. Any workload that writes per-request files into tmpfs without aggressive GC will eventually fill, and the time-to-fill scales inversely with request rate. Worth auditing other services that might have the same pattern.
|
||||
|
||||
5. **UI-managed Authentik config is invisible to git review.** Our catch-all Proxy provider, embedded outpost config, property mappings, and policy bindings are all in Authentik's PostgreSQL database. The fix applied today (`kubernetes_json_patches`) is durable but not discoverable from `git log`. Drift risk. Codify in Terraform.
|
||||
|
||||
6. **Recently-deployed things are prime suspects but not always guilty.** The Rybbit Cloudflare Worker was deployed 2026-04-17 with a wildcard route. Viktor's intuition was "that's the recent change, must be the cause." It was a plausible theory and worth checking — but `curl --resolve` to bypass Cloudflare proved it innocent within 30 seconds. Always have a way to bypass the suspect layer cheaply.
|
||||
|
||||
## References
|
||||
|
||||
- Memory #836-841: incident details stored in claude-memory MCP (2026-04-18 12:42 UTC).
|
||||
- Upstream issue: [goauthentik/authentik#20093](https://github.com/goauthentik/authentik/issues/20093) (open).
|
||||
- Related upstream fix: [PR #15798](https://github.com/goauthentik/authentik/pull/15798) — 5-min session GC shipped in ≥ 2025.10 (our version 2026.2.2 has it, but insufficient alone).
|
||||
- Beads task: `code-zru` (P1 bug).
|
||||
|
|
@ -1,246 +0,0 @@
|
|||
# Post-Mortem: Private Registry Orphan OCI-Index — Repeat Incident
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Date** | 2026-04-19 (first occurrence 2026-04-13) |
|
||||
| **Duration** | ~40 min of blocked CI each time; only detected via pipeline failures |
|
||||
| **Severity** | SEV2 — all infra CI pipelines using `infra-ci:latest` failed (P366 → P376 all exit 126 "image can't be pulled") |
|
||||
| **Affected Services** | Every Woodpecker pipeline that starts with `image: registry.viktorbarzin.me:5050/infra-ci:latest` — `default.yml`, `build-cli.yml`, `renew-tls.yml`, `drift-detection.yml`, `provision-user.yml`, `k8s-portal.yml`, `postmortem-todos.yml`, `issue-automation.yml`, `pve-nfs-exports-sync.yml` |
|
||||
| **Status** | Hot fix green (three commits: `a05d63ee`, `6371e75e`, `c113be4d` — URL fix + rebuild). This doc captures the permanent fix landed in the same branch. |
|
||||
|
||||
## Summary
|
||||
|
||||
On 2026-04-19 ~09:00 UTC, every infra CI pipeline started failing at the
|
||||
`clone` step with "image can't be pulled". The image in question — the CI
|
||||
toolchain image `registry.viktorbarzin.me:5050/infra-ci:latest` — resolved
|
||||
to an OCI image index whose `linux/amd64` platform manifest
|
||||
(`sha256:98f718c8…`) and its in-toto attestation
|
||||
(`sha256:27d5ab83…`) returned **HTTP 404** from the private registry.
|
||||
The index record itself still existed — it's the children that had been
|
||||
garbage-collected out from under it.
|
||||
|
||||
This is the **second identical incident**: the same failure mode occurred
|
||||
on 2026-04-13 against a different image. Both times the immediate fix was
|
||||
to rebuild the image from scratch; both times the root cause was left
|
||||
unaddressed.
|
||||
|
||||
## Impact
|
||||
|
||||
- **User-facing**: all CI pipelines failed. No automated Terraform applies,
|
||||
no TLS renewal, no drift detection. Manual workflows (Woodpecker UI
|
||||
reruns) all failed with the same error.
|
||||
- **Blast radius**: every pipeline that pulls `infra-ci`. Does NOT affect
|
||||
k8s workloads (those pull via containerd, which goes through the
|
||||
pull-through proxy on :5000/:5010 — a completely different code path).
|
||||
- **Duration on 2026-04-19**: from first P366 failure to the hot-fix
|
||||
commit `c113be4d` — roughly 40 min. Pipelines that had already been
|
||||
triggered queued up until the rebuild restored `:latest`.
|
||||
- **Data loss**: none. The registry has the index object; the child
|
||||
manifests are re-producible by rebuilding the source image.
|
||||
- **Monitoring gap**: nothing alerted. The only signal was the individual
|
||||
pipeline failures from Woodpecker. No Prometheus alert fires on "the
|
||||
registry served a 404 for a tag that exists".
|
||||
|
||||
## Timeline (UTC, 2026-04-19)
|
||||
|
||||
| Time | Event |
|
||||
|------|-------|
|
||||
| ~09:00 | P366 (`default.yml` on master) fails with exit 126. |
|
||||
| 09:00–11:00 | P367, P368, … P376 all fail with the same error. Nobody pages — there's no alert configured. |
|
||||
| 11:15 | User notices and investigates: `skopeo inspect` reveals the missing platform manifest. |
|
||||
| 11:20 | Hot fix phase begins: `a05d63ee` fixes a push-URL misalignment, `6371e75e` and `c113be4d` trigger a full rebuild. |
|
||||
| 11:40 | Rebuild completes; `infra-ci:latest` resolves to a fresh, complete index. Pipelines green from P377 onward. |
|
||||
| 11:45 | User requests a proper root-cause fix: "this is the second time — what's actually broken?" |
|
||||
| 12:00 | Investigation begins (this document's work). |
|
||||
|
||||
## Root Cause Chain
|
||||
|
||||
```
|
||||
[1] cleanup-tags.sh runs daily at 02:00 on the registry VM
|
||||
└─> For each repository, keeps the last 10 tags by mtime, rmtrees the rest.
|
||||
This walks `_manifests/tags/<tag>` directly, bypassing the registry API.
|
||||
│
|
||||
├─> [2] Subtle on-disk asymmetry: a registry:2 tag rmtree removes
|
||||
│ BOTH the `_manifests/tags/<tag>/` dir AND — on 2.8.x — the
|
||||
│ per-repo revision-link files under
|
||||
│ `<repo>/_manifests/revisions/sha256/<child-digest>/link` for
|
||||
│ every child referenced by that tag's index. The raw blob data
|
||||
│ under `/var/lib/registry/docker/registry/v2/blobs/sha256/<.>/data`
|
||||
│ is NOT touched — GC owns that, and GC only runs Sunday.
|
||||
│
|
||||
├─> [3] If ANOTHER tag's index still references one of those same
|
||||
│ children (common — successive rebuilds share layers), the child
|
||||
│ blob survives. But the revision-link is gone, so the registry
|
||||
│ API can no longer map `<repo>/manifests/sha256:<child>` back
|
||||
│ to the blob. HEAD → 404, even though the bytes are on disk.
|
||||
│ distribution/distribution#3324 is the upstream class of this bug.
|
||||
│
|
||||
└─> [4] Result: the surviving index (e.g. `infra-ci:5319f03e`) is
|
||||
intact on disk, its children's blob data files are intact on
|
||||
disk, but HEAD `/v2/infra-ci/manifests/sha256:98f718c8…`
|
||||
returns 404. The registry has the bytes, but cannot find them
|
||||
through the API because the per-repo link bridge is gone.
|
||||
|
||||
[pull] containerd resolves `infra-ci:latest`
|
||||
│
|
||||
├─> GET /v2/infra-ci/manifests/latest → 200 OK, returns the index
|
||||
│
|
||||
└─> GET /v2/infra-ci/manifests/sha256:98f718c8… → 404 Not Found
|
||||
└─> containerd fails the pull with "manifest unknown"
|
||||
└─> woodpecker exit 126
|
||||
```
|
||||
|
||||
> **Detection-gotcha** uncovered 2026-04-19 while implementing
|
||||
> `fix-broken-blobs.sh`: a scan that checks `/blobs/sha256/<child>/data` for
|
||||
> presence is NOT equivalent to "can the registry serve this child?" The
|
||||
> authoritative check is whether
|
||||
> `<repo>/_manifests/revisions/sha256/<child>/link` exists. The script
|
||||
> was rewritten to check the per-repo link file after the HTTP probe
|
||||
> caught 38 real orphans the filesystem scan had reported clean.
|
||||
|
||||
## Why Existing Remediation Missed It
|
||||
|
||||
1. **`fix-broken-blobs.sh` only scans layer links.** The existing cron
|
||||
walks `_layers/sha256/` and removes link files whose blob `data` is
|
||||
missing. It does NOT inspect `_manifests/revisions/sha256/` to see
|
||||
whether an image-index's referenced children still exist. That's
|
||||
exactly the class of orphan this incident represents.
|
||||
2. **`registry:2` image tag was floating.** `docker-compose.yml` pinned
|
||||
only to `registry:2`. Whatever Docker Inc. last rebuilt as
|
||||
"v2-current" was running, with no version pin. Any regression in
|
||||
the upstream walker would silently swap in.
|
||||
3. **No integrity monitoring.** Prometheus alerted on cache hit rate
|
||||
and registry-down, but nothing probes "are the manifests the registry
|
||||
advertises actually fetchable?"
|
||||
4. **CI pipeline didn't verify its own push.** `buildx --push` returns
|
||||
success as soon as it uploads. If a child blob upload 0-byted or
|
||||
the client disconnected mid-push (distinct from the GC mode but the
|
||||
same on-disk symptom), nothing would notice until the next pull.
|
||||
|
||||
## Permanent Fix — Three Phases
|
||||
|
||||
### Phase 1 — Detection (ship today)
|
||||
|
||||
1. **Post-push integrity check** in `.woodpecker/build-ci-image.yml`.
|
||||
After `build-and-push`, a new step walks the just-pushed manifest
|
||||
(and every child of an image index) and HEADs every referenced blob.
|
||||
Any non-200 fails the pipeline immediately, catching broken pushes at
|
||||
the source rather than leaking them to consumers.
|
||||
2. **Prometheus alert `RegistryManifestIntegrityFailure`.** A new
|
||||
CronJob (`registry-integrity-probe`, every 15m, in the `monitoring`
|
||||
namespace) walks the private registry's catalog, HEADs every tag's
|
||||
manifest, follows each image index's children, and pushes
|
||||
`registry_manifest_integrity_failures` to Pushgateway. Accompanying
|
||||
alerts: `RegistryIntegrityProbeStale`, `RegistryCatalogInaccessible`.
|
||||
3. **Post-mortem** — this document. Linked from
|
||||
`.claude/reference/service-catalog.md` via the new runbook.
|
||||
|
||||
### Phase 2 — Prevention
|
||||
|
||||
4. **Pin `registry:2` → `registry:2.8.3`** in
|
||||
`modules/docker-registry/docker-compose.yml` (all six registry
|
||||
services). Removes the floating-tag footgun.
|
||||
5. **Extend `fix-broken-blobs.sh`** to scan every
|
||||
`_manifests/revisions/sha256/<digest>` that is an image index and
|
||||
flag children whose blob `data` file is missing. The script prints a
|
||||
loud WARNING per orphan; it does not auto-delete the index, because
|
||||
deleting a published image is a conscious decision, not an automated
|
||||
repair.
|
||||
|
||||
### Phase 3 — Recovery tooling
|
||||
|
||||
6. **Manual event trigger** on `build-ci-image.yml`. Rebuilds no longer
|
||||
need a cosmetic Dockerfile edit — POST to the Woodpecker API or
|
||||
click "Run manually" in the UI.
|
||||
7. **Runbook** `docs/runbooks/registry-rebuild-image.md` — exact
|
||||
command sequence for the next time this happens, plus fallback paths.
|
||||
|
||||
## Out of Scope
|
||||
|
||||
- **Pull-through caches.** The DockerHub / GHCR mirrors on
|
||||
`:5000` / `:5010` are healthy (74.5% cache hit rate, no 404s). The
|
||||
orphan problem is private-registry-only. No changes to nginx or
|
||||
containerd `hosts.toml`.
|
||||
- **Registry HA / replication.** Single-VM SPOF is a known
|
||||
architectural choice. Harbor or a replicated registry would solve
|
||||
more than this incident requires, at multi-day cost. Synology offsite
|
||||
snapshots already give RPO < 1 day.
|
||||
- **Disabling `cleanup-tags.sh`.** Keeping storage bounded is still
|
||||
necessary; the fix is detection + rebuild, not "stop cleaning up".
|
||||
|
||||
## Lessons
|
||||
|
||||
- **Repeat incidents deserve root-cause work, not a third hot-fix.** The
|
||||
2026-04-13 incident was closed when CI turned green. Without a probe
|
||||
and without a scan for orphan indexes, the next incident was
|
||||
inevitable — and it happened six days later against a different image.
|
||||
- **"No alert fired, so it wasn't detected" is a monitoring gap, not an
|
||||
outage feature.** The registry was serving 404s for 2+ hours before
|
||||
anyone noticed, because our only signal was "pipeline failures" and
|
||||
our eyes were elsewhere. The new probe closes that gap.
|
||||
- **CI pipelines should verify their own output.** The `buildx --push`
|
||||
"success" exit code is not a guarantee of pulled-back integrity — as
|
||||
this incident proves. A 30-second post-push HEAD walk is cheap
|
||||
insurance.
|
||||
|
||||
## Related
|
||||
|
||||
- **Prior incident (same failure mode, different image)**: memory `709`
|
||||
/ `710` — 2026-04-13.
|
||||
- **Runbook**: `docs/runbooks/registry-rebuild-image.md` (new).
|
||||
- **Hot-fix commits**: `a05d63ee`, `6371e75e`, `c113be4d`.
|
||||
- **Upstream bug class**: `distribution/distribution#3324`.
|
||||
|
||||
## 2026-04-19 — Bulk cleanup sweep (beads code-8hk + code-jh3c)
|
||||
|
||||
Same failure class, broader scope. The `registry-integrity-probe`
|
||||
surfaced 38 broken manifest references persisting after the 04-19
|
||||
infra-ci fix. `beads-dispatcher` + `beads-reaper` CronJobs were stuck
|
||||
`ImagePullBackOff` on `claude-agent-service:0c24c9b6` for >6h. All 34
|
||||
affected `repo:tag` pairs were OCI indexes whose `linux/amd64` child
|
||||
manifests were absent from blob storage (same orphan pattern).
|
||||
|
||||
**Action taken**:
|
||||
1. Bumped `beads-server/main.tf` var default `claude_agent_service_image_tag`
|
||||
from `0c24c9b6` → `2fd7670d` (the canonical tag in
|
||||
`claude-agent-service/main.tf`), reused — same image already healthy
|
||||
on the registry. `scripts/tg apply` on `beads-server`. Deleted the
|
||||
stuck Jobs so new CronJob ticks could fire.
|
||||
2. Enumerated 34 broken `(repo, tag, parent_digest)` triples via HTTP
|
||||
probe using `registry-probe-credentials` K8s Secret. Deleted each
|
||||
via `DELETE /v2/<repo>/manifests/<digest>` (33× 202, 1× 404 —
|
||||
claude-agent-service:latest pointed at an already-deleted digest).
|
||||
3. Ran `docker exec registry-private /bin/registry garbage-collect
|
||||
/etc/docker/registry/config.yml` — reclaimed ~3GB of orphan blob
|
||||
storage.
|
||||
4. Rebuilt the 3 in-use broken tags (all 3 OCI-index parents pointed
|
||||
at missing children, so no cached copies would survive pod
|
||||
reschedule):
|
||||
- `freedify:latest` / `freedify:c803de02` — built on registry VM
|
||||
directly (no CI pipeline exists for this image; python FastAPI).
|
||||
- `beadboard:17a38e43` / `beadboard:latest` — GHA
|
||||
`workflow_dispatch` failed at registry login (missing
|
||||
`REGISTRY_USERNAME`/`REGISTRY_PASSWORD` GH secrets). Built on
|
||||
registry VM directly as the fallback. GitHub secret gap is a
|
||||
follow-up — beads `code-8hk` notes it.
|
||||
- `priority-pass-backend:ae1420a0` / `priority-pass-frontend:ae1420a0`
|
||||
— Woodpecker pipeline #8 on repo 81. Pipeline `kubectl set image`'d
|
||||
the Deployment to `ae1420a0` (drift vs TF `v5`/`v8` defaults, but
|
||||
that drift is pre-existing, not introduced by this cleanup).
|
||||
- `wealthfolio-sync:latest` — **not rebuilt**. Monthly CronJob (next
|
||||
run 2026-05-01), no source tree or CI pipeline available in the
|
||||
monorepo; deferred for separate follow-up.
|
||||
|
||||
**Post-cleanup state**:
|
||||
- Probe: 39 tags, 0 failures. `registry_manifest_integrity_failures{} = 0`.
|
||||
- Alert `RegistryManifestIntegrityFailure` cleared (was firing for
|
||||
5h 32m).
|
||||
- No `ImagePullBackOff` pods anywhere in the cluster.
|
||||
- 28 of 34 deleted manifests were **dangling tags not referenced by any
|
||||
workload** — old `382d6b1*`, `v2`-`v7`, `yt-fallback`, etc. Safe
|
||||
deletes, no rebuilds needed.
|
||||
|
||||
**Permanent fix still in flight**: Phase 2/3 of this post-mortem
|
||||
(post-push verification in CI, atomic `cleanup-tags.sh`) — not
|
||||
addressed by this cleanup. The probe continues to be the
|
||||
authoritative detector.
|
||||
|
|
@ -1,155 +0,0 @@
|
|||
# Post-Mortem: Vault Raft Leader Deadlock + NFS Kernel Client Corruption Cascade
|
||||
|
||||
> **Resolution status (2026-04-25):** Resolved structurally by code-gy7h
|
||||
> migration. All 3 vault voters now on `proxmox-lvm-encrypted` block
|
||||
> storage; the NFS fsync incompatibility that triggered the original
|
||||
> raft hang is no longer reachable. See
|
||||
> `docs/plans/2026-04-25-nfs-hostile-migration-plan.md` Phase 2.
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Date** | 2026-04-22 |
|
||||
| **Duration** | External endpoint 503 from ~09:00 UTC to ~11:43 UTC (~2h 43m). vault-2 became active leader 11:43:28 UTC. |
|
||||
| **Severity** | SEV1 (Vault — single source of secrets for 40+ services) |
|
||||
| **Affected Services** | All ESO-backed services (password rotation paused). CronJobs that read plan-time secrets (14 stacks). Woodpecker CI (blocked pipeline `d39770b3`). Everything with `ExternalSecret` refresh interval ≤ 2h. |
|
||||
| **Status** | Vault HA operational with vault-0 + vault-2 quorum. vault-1 still stuck ContainerCreating on node2 (third node2 reboot pending; workload can accept 2/3 quorum). Terraform fix committed as `2f1f9107`; apply pending. |
|
||||
|
||||
## Summary
|
||||
|
||||
A Vault raft leader (`vault-2`) entered a stuck goroutine state where its cluster port (8201) accepted TCP but never completed msgpack RPC. Standbys could not detect leader death because the TCP layer looked healthy, so no re-election fired. The only recovery was to kill the leader. During recovery, abrupt `kubectl delete --force` of the stuck Vault pods left kernel-side NFS client state on k8s-node1/node3/node4 in a corrupted state — **all new NFS mounts from those nodes timed out at 110s**, while existing mounts kept working. This created a cascade: the stuck leader blocked quorum, killing the leader broke NFS on the destination node for the recreated pod, force-killing the stuck pods left zombie `containerd-shim` processes kubelet couldn't clean up, and the resulting volume-manager loops pegged kubelet into 2-minute timeouts. Recovery required a VM hard-reset for node2 and node3 (kubelet was zombie on both). vault-0 remains down pending node4 reboot.
|
||||
|
||||
## Impact
|
||||
|
||||
- **User-facing**: `vault.viktorbarzin.me` returned HTTP 503 for ~2h. Any service that needed a Vault token during that window was degraded; Woodpecker CI pipeline blocked.
|
||||
- **Blast radius**: 3/3 Vault pods affected (raft deadlock blocked re-election even with standbys up). Three k8s nodes degraded simultaneously with kernel NFS client stuck state (node1, node3, node4). Two nodes required VM hard-reset to recover kubelet (node2, node3).
|
||||
- **Duration**: Degraded ~2h; resolution required sequential hard reboots.
|
||||
- **Data loss**: None. Raft data integrity preserved on NFS. vault-1 came up with index 2475732, caught up to 2476009+ once leader was elected.
|
||||
- **Observability gap**: No alert fired for the stuck raft leader. Standbys report `HA Mode: standby, Active Node Address: <leader IP>` as if healthy even when leader is hung.
|
||||
|
||||
## Timeline (UTC)
|
||||
|
||||
| Time | Event |
|
||||
|------|-------|
|
||||
| **~09:00** | `vault-2` (original raft leader) enters hung state — port 8201 open but msgpack RPCs hang. Its own logs go silent. Standbys continue heartbeat/appendEntries with `msgpack decode error [pos 0]: i/o timeout`. Neither standby triggers re-election because raft transport does not distinguish "TCP open + silent" from "TCP open + healthy". |
|
||||
| **~09:15** | External endpoint starts serving 503. Woodpecker CI pipeline `d39770b3` blocks waiting for Vault. |
|
||||
| **09:59** | Operator force-deletes `vault-2` pod — replacement comes up on node3 and enters candidate loop (term=32), cannot get quorum because DNS for `vault-0` is NXDOMAIN (ContainerCreating) and vault-1 does not respond (its raft goroutine also hung). |
|
||||
| **10:07** | Operator force-deletes `vault-1` — new `vault-1` gets scheduled to node2. Its raft would be fine, but kubelet on node2 hangs in the pod cleanup path for the old pod's NFS mount. Concurrently, a new `vault-0` pod is attempted on node4, but **NFS mount from node4 times out at 110s** — the host kernel NFS client is in a degraded state that blocks all new mounts (including to completely different NFS paths like `/srv/nfs/ytdlp`). |
|
||||
| **10:09** | Diagnostic test: from node1 and node4 CSI pods, `mount -t nfs -o nfsvers=4 192.168.1.127:/srv/nfs/ytdlp /tmp/test` times out. From node2 and node3 the same mount succeeds. NFS server is healthy (`showmount -e` works; `rpcinfo` shows all programs registered). The common factor on the broken nodes: they had a force-terminated Vault pod earlier in the session, leaving stuck `mount.nfs` processes in D-state. |
|
||||
| **10:18** | Manual unmount of stale NFS mount from the force-deleted old vault-0 pod on node4. New mount attempts from CSI still time out — clearing the old mount did not recover kernel NFS client state. |
|
||||
| **10:22** | Workaround discovered: mounting with `nfsvers=4.0` or `nfsvers=4.1` (instead of default `nfsvers=4` which negotiates to 4.2) succeeds on broken nodes. Confirms the stuck state is version-specific (NFSv4.2 session state), not a general NFS issue. Decision: rather than change CSI mount options cluster-wide (risk of remounting existing 48+ PVs), fix the nodes directly. |
|
||||
| **10:31** | Investigated node2 kubelet state: old `vault-1` container shows `vault` process in **Z (zombie)** state with its `sh` wrapper stuck in `do_wait` in kernel (`zap_pid_ns_processes`). Containerd-shim PID killed manually — `sh` and zombie reparented to init but remained stuck (uninterruptible kernel wait tied to NFS). |
|
||||
| **10:34** | Attempted `systemctl restart kubelet` on node2 — kubelet itself went into Z (zombie) with 2 tasks still attached. Classic NFS-related kernel deadlock. |
|
||||
| **10:42** | **Decision: hard-reset node2 VM** (`qm reset 202`). Disruption: 22 pods evicted. |
|
||||
| **10:43** | node2 back up (Ready). CSI registered. New `vault-1` scheduled to node2. NFS mount succeeded (fresh kernel state). Kubelet began chowning volume — **extremely slow, ~3 files per minute over NFS**. |
|
||||
| **10:48** | `vault-1` (2/2 Running) unsealed. **Raft leader elected: `vault-2` wins term 32, election tally=2** (vault-1 voted yes once it came up, vault-0 unreachable). However vault-2's vault-layer (HA active/standby) never transitioned to active — raft leader with `active_time: 0001-01-01T00:00:00Z` and `/sys/ha-status` returning 500. |
|
||||
| **10:50** | Restarted `vault-2` pod to force clean leader transition. New `vault-2` stuck in chown loop on node3 (same pattern as node2 earlier). |
|
||||
| **10:54** | Patched the Vault `StatefulSet` with `fsGroupChangePolicy: OnRootMismatch` so subsequent recreations skip the recursive chown. |
|
||||
| **10:57** | Force-deleted `vault-2` and `06fa940b` pod directory on node3. New pod spawned but kubelet again stuck on phantom state from the old pod. |
|
||||
| **11:01** | **Hard-reset node3 VM** (`qm reset 203`). |
|
||||
| **11:03** | First 200 response: vault-1 elected leader, vault-2 standby. Premature celebration — vault-1's audit log on node2 NFS starts timing out; `/sys/ha-status` returns 500 even though raft thinks vault-1 is active. |
|
||||
| **~11:18** | Service regresses. `vault-1` audit writes hanging (`event not processed by enough 'sink' nodes, context deadline exceeded`). Readiness probe fails; pod goes 1/2; `vault-active` endpoint stays pointed at vault-1's IP but backend unresponsive → 503. |
|
||||
| **11:22** | Force-restart `vault-1` to trigger re-election with new pod. Delete + containerd-shim cleanup leaves yet another zombie on node2. Same pattern: force-delete → zombie. |
|
||||
| **11:29** | **Hard-reset node4 VM** (`qm reset 204`). Rationale: vault-0 was still blocked there; 74 pods on node4 contribute to NFS server load (load avg 16 on PVE). After reboot, vault-0 mounts its PVCs on fresh kernel state and comes up 2/2 Running 11:31. |
|
||||
| **11:31** | Increased PVE NFS threads from 16 to 64 (`echo 64 > /proc/fs/nfsd/threads`). Did not help immediate mount failures — the stuck state is per-client kernel, not server capacity. |
|
||||
| **11:38** | Discover DNS resolution issue: vault-2's Go resolver returns NXDOMAIN for short names `vault-0.vault-internal` even though glibc resolver works. CoreDNS restart issued earlier didn't fix. Restart vault-2 pod to force fresh resolver state. |
|
||||
| **11:42** | **Second hard-reset of node3 VM** (`qm reset 203`). Kubelet+CSI re-register; vault-2 scheduled, NFS mounts finally succeed on fresh kernel state. |
|
||||
| **11:43:28** | **vault-2 becomes active leader.** External endpoint returns 200 and stays there. vault-0 follower, catches up to index 2477632+. vault-1 still stuck on node2; left for later recovery. |
|
||||
|
||||
## Root Cause Chain
|
||||
|
||||
```
|
||||
[1] Vault-2 raft goroutine hang (root cause — upstream Vault bug or infra-induced)
|
||||
└─> Cluster port 8201 accepts TCP but never responds to msgpack RPCs
|
||||
└─> Standbys' appendEntries calls return `msgpack decode error [pos 0]: i/o timeout`
|
||||
└─> Raft protocol: no re-election because leader is heartbeating at the TCP level
|
||||
└─> External endpoint returns 503 because HA layer has no active leader
|
||||
|
||||
[2] Recovery complication — abrupt pod termination
|
||||
└─> `kubectl delete --force --grace-period=0` on vault-0/1/2
|
||||
└─> containerd-shim fails to kill container cleanly (NFS I/O in D-state)
|
||||
└─> vault process ends as zombie; sh wrapper stuck in do_wait
|
||||
└─> Kubelet retries forever, cannot tear down old pod volumes
|
||||
└─> NFS-CSI unmount requests succeed at the NFS layer but kubelet's
|
||||
volume state-machine never marks the volume as unmounted
|
||||
(stale 0000-mode mount directory blocks teardown completion)
|
||||
|
||||
[3] Kernel NFS client corruption on node1/node4
|
||||
└─> Force-terminated Vault pod left stuck `mount.nfs` processes in D-state
|
||||
└─> Kernel NFS4.2 client session state corrupted (held open mount slot)
|
||||
└─> All subsequent mount syscalls for nfsvers=4 block 110s+ waiting for
|
||||
session slot that will never be freed
|
||||
└─> Manual workaround: nfsvers=4.1 bypasses the corrupted session state
|
||||
|
||||
[4] Kubelet starvation
|
||||
└─> Combination of (2) and (3) means kubelet is stuck in a 2-minute volume-setup
|
||||
context deadline loop — each iteration times out, new iteration restarts,
|
||||
infinite loop
|
||||
└─> Hard VM reset is the only exit
|
||||
└─> After reset, kubelet starts clean, CSI re-registers, mounts succeed
|
||||
|
||||
[5] Slow recursive chown amplifies impact
|
||||
└─> Default fsGroupChangePolicy: Always (Vault Helm chart 0.29.1 default)
|
||||
└─> Kubelet walks every file on NFS setting gid=1000
|
||||
└─> Over a 1GB audit log and a 47MB raft.db on NFS with timeo=30,retrans=3,
|
||||
each chown syscall takes seconds; kubelet 2-minute deadline runs out
|
||||
before the walk finishes
|
||||
└─> Loop never exits even when ownership is already correct
|
||||
```
|
||||
|
||||
## Why This Failed
|
||||
|
||||
1. **Raft transport does not detect stuck leaders.** If TCP is open and the process is alive enough to hold the port, standbys assume the leader is healthy. A stuck goroutine that never responds to RPCs appears to raft as "leader with high RTT" and does not trigger re-election. This is an upstream Vault bug (or at least a missing liveness check).
|
||||
|
||||
2. **Abrupt pod termination + NFS = kernel-level zombie.** When a Vault pod holding an NFS mount is force-killed before it cleanly closes file handles, the kernel's NFS4.2 client session state enters a corrupted state. This blocks all new mounts from that node — not just to the same NFS path, but to ANY NFS path on the same server. The fix is a kernel reboot; there is no userspace recovery.
|
||||
|
||||
3. **Vault data on NFS violates the documented rule.** `infra/.claude/CLAUDE.md` explicitly states: *"Critical services MUST NOT use NFS storage — circular dependency risk."* Vault currently uses `nfs-proxmox` for both `dataStorage` and `auditStorage`. If Vault had been on `proxmox-lvm-encrypted`, none of the NFS corruption cascade would have happened.
|
||||
|
||||
4. **fsGroupChangePolicy: Always is the Helm default.** Every pod restart walks every file over NFS. On a 1GB audit log with degraded NFS RTT, this takes longer than kubelet's internal 2-minute deadline, causing infinite restart loops. `OnRootMismatch` makes chown a no-op when the root is already correct (which it always is after first setup).
|
||||
|
||||
5. **No alert for this failure mode.** Prometheus alerts exist for `VaultSealed`, `VaultDown` (`up` metric), and backup staleness, but none for "raft leader has been running without advancing commit index" or "standby reports leader but leader's `/sys/ha-status` returns 500".
|
||||
|
||||
## Remediation (Applied)
|
||||
|
||||
- [x] Hard-reset node2 and node3 VMs to clear kernel NFS state and kubelet zombies.
|
||||
- [x] Manually patched live `StatefulSet vault/vault` with `fsGroupChangePolicy: OnRootMismatch` to stop the chown loop.
|
||||
- [x] Lazy-unmounted stale NFS mounts from force-deleted pod directories on node2 and node3.
|
||||
- [x] Removed stale kubelet pod directories (`/var/lib/kubelet/pods/<UID>`) that had 0000-mode mount subdirectories blocking teardown.
|
||||
- [x] Updated `stacks/vault/main.tf` with the `fsGroupChangePolicy` setting so the next `scripts/tg apply vault` makes it durable.
|
||||
|
||||
## Remediation (Pending)
|
||||
|
||||
- [ ] **Hard-reset node4** to recover vault-0 (same NFS kernel corruption pattern).
|
||||
- [ ] **Run `scripts/tg apply` on the vault stack** to persist the fsGroupChangePolicy change.
|
||||
- [ ] **Add Prometheus alert `VaultRaftLeaderStuck`** — fire when `vault_raft_last_index_gauge` (or derivation from `vault_runtime_total_gc_runs`) stops advancing for >2 minutes while `vault_core_active` is 1.
|
||||
- [ ] **Add Prometheus alert `VaultHAStatusUnavailable`** — fire when `vault_core_active{}` reports 0 across all pods but `up{job="vault"}` reports 1 (HA layer broken but pods alive).
|
||||
- [ ] **Migrate Vault to `proxmox-lvm-encrypted` block storage** — eliminates the entire NFS failure class. This follows the rule already documented in `infra/.claude/CLAUDE.md`. Tracked as beads task (open after Dolt is back up; currently down on node4).
|
||||
- [ ] **Consider raising kubelet volume-manager deadline** for large-volume chown scenarios, or document the `fsGroupChangePolicy: OnRootMismatch` requirement for all NFS-backed StatefulSets.
|
||||
- [ ] **Runbook**: `docs/runbooks/vault-raft-leader-deadlock.md` — how to detect stuck leader, safe force-restart procedure that avoids zombie pods, NFS kernel state recovery.
|
||||
|
||||
## Contributing Factors
|
||||
|
||||
1. **NFS mount options use bare `nfsvers=4`**. This negotiates to the highest version the server supports (NFSv4.2). When 4.2 session state corrupts, mounts fail; 4.1 works. Pinning to `nfsvers=4.1` in the `nfs-proxmox` StorageClass would make the failure mode recoverable without node reboot, but would also require recreating 48+ existing PVs (volumeAttributes are immutable). Deferred.
|
||||
|
||||
2. **`kubectl delete --force` is the default for stuck pods**. Operators reach for force-delete when a pod won't terminate, but this leaves containerd in an inconsistent state when the underlying storage is hung. Better approach: identify the stuck process (typically `mount.nfs` or a kernel NFS callback) and fix the root cause before force-deleting.
|
||||
|
||||
3. **Beads / Dolt server was on node4**, so beads task tracking went offline during this incident and couldn't be used to log progress cross-session.
|
||||
|
||||
4. **node1 was cordoned mid-incident** to prevent rescheduling to a node with confirmed NFS issues, but this reduced the scheduling surface for anti-affinity-sensitive StatefulSets.
|
||||
|
||||
## Learnings
|
||||
|
||||
1. **NFS for stateful critical services is structurally unsafe.** When NFS breaks, the recovery involves killing pods → which can break NFS further → until a reboot. The rule exists for a reason; Vault should never have been on NFS.
|
||||
|
||||
2. **Raft liveness needs application-layer probing, not TCP.** Every time we've seen a "stuck leader" issue in the homelab, TCP was fine and the app was unresponsive. A lightweight RPC probe with a short timeout and Prometheus alert would catch this in minutes instead of hours.
|
||||
|
||||
3. **kubelet volume-manager is fragile against stuck NFS.** Once kubelet enters a chown loop with a context deadline shorter than the chown duration, it cannot make progress — even when the filesystem is otherwise healthy. `OnRootMismatch` is effectively mandatory for any pod with `fsGroup` and a volume >100MB.
|
||||
|
||||
4. **VM hard-reset is cheap but disruptive.** The two reboots took ~60 seconds each but evicted 22+44 = 66 pods. Doing this twice in one session is a lot of churn. A post-mortem-driven improvement: pre-prepare "hot-standby" capacity so we can cordon+drain instead of hard-reset when kubelet zombies appear.
|
||||
|
||||
5. **Documentation of this rule is worth more than the rule itself.** The CLAUDE.md already says "critical services must not use NFS". The vault stack violates it. The rule without enforcement (validation, linting, CI) is ignored during the rush to ship.
|
||||
|
||||
## References
|
||||
|
||||
- Related: `docs/post-mortems/2026-04-14-nfs-fsid0-dns-vault-outage.md` — previous Vault+NFS incident (different root cause, similar blast pattern).
|
||||
- Vault helm chart 0.29.1 default `fsGroupChangePolicy` is unset (behaves as `Always`).
|
||||
- Upstream Vault HA layer: raft leader → vault-active transition is in `vault/external_tests/raft`. Stuck goroutine pattern not documented as a known issue.
|
||||
|
|
@ -1,185 +0,0 @@
|
|||
# Beads Auto-Dispatch Runbook
|
||||
|
||||
Users can hand work to the headless `beads-task-runner` agent by assigning a
|
||||
bead to the sentinel user `agent`. Two CronJobs in the `beads-server`
|
||||
namespace drive the pipeline:
|
||||
|
||||
- **`beads-dispatcher`** — every 2 min: picks up the highest-priority
|
||||
`assignee=agent`/`status=open` bead with non-empty acceptance criteria,
|
||||
claims it by flipping to `in_progress`, and POSTs it to BeadBoard's
|
||||
`/api/agent-dispatch`. BeadBoard forwards to `claude-agent-service` with
|
||||
the existing bearer-token flow.
|
||||
- **`beads-reaper`** — every 10 min: flips any `assignee=agent` +
|
||||
`status=in_progress` bead whose `updated_at` is older than 30 min to
|
||||
`status=blocked` with an explanatory note. Catches pod crashes mid-run.
|
||||
|
||||
The manual BeadBoard Dispatch button continues to work in parallel.
|
||||
|
||||
## Flow diagram
|
||||
|
||||
```
|
||||
user: bd assign <id> agent
|
||||
│
|
||||
▼
|
||||
Dolt @ dolt.beads-server.svc:3306 ◄──── every 2 min ────┐
|
||||
│ │
|
||||
▼ │
|
||||
CronJob: beads-dispatcher │
|
||||
1. GET beadboard/api/agent-status (busy?) │
|
||||
2. bd query 'assignee=agent AND status=open' │
|
||||
3. bd update -s in_progress (claim) │
|
||||
4. POST beadboard/api/agent-dispatch │
|
||||
5. bd note "dispatched: job=…" │
|
||||
│ │
|
||||
▼ │
|
||||
claude-agent-service /execute │
|
||||
beads-task-runner agent runs; notes/closes bead │
|
||||
│ │
|
||||
▼ │
|
||||
done ──► next tick picks up the next bead ───────────────┘
|
||||
|
||||
|
||||
CronJob: beads-reaper (every 10 min)
|
||||
for bead (assignee=agent, status=in_progress, updated_at > 30 min):
|
||||
bd note "reaper: no progress for Nm — blocking"
|
||||
bd update -s blocked
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### Hand a bead to the agent
|
||||
|
||||
```
|
||||
bd create "Title" \
|
||||
-d "Full context — files, services, error messages. Any agent with no prior context must be able to execute this." \
|
||||
--acceptance "Concrete, verifiable criteria" \
|
||||
-p 2
|
||||
bd assign <new-id> agent
|
||||
```
|
||||
|
||||
**Acceptance criteria is required.** Beads without it are skipped by the
|
||||
dispatcher and stay in `open` forever. This is intentional — the
|
||||
`beads-task-runner` agent expects clear done conditions.
|
||||
|
||||
### Take a bead back (unassign)
|
||||
|
||||
```
|
||||
bd assign <id> ""
|
||||
```
|
||||
|
||||
If the bead is already `in_progress`, also reset it:
|
||||
|
||||
```
|
||||
bd update <id> -s open
|
||||
```
|
||||
|
||||
### Pause auto-dispatch
|
||||
|
||||
```
|
||||
cd infra/stacks/beads-server
|
||||
scripts/tg apply -var=beads_dispatcher_enabled=false
|
||||
```
|
||||
|
||||
This sets `spec.suspend: true` on both CronJobs. Existing running jobs
|
||||
continue; no new ticks fire. Re-enable by re-applying with
|
||||
`beads_dispatcher_enabled=true` (the default). Manual BeadBoard Dispatch
|
||||
remains available while paused.
|
||||
|
||||
### Read the logs
|
||||
|
||||
```
|
||||
# Recent dispatcher runs
|
||||
kubectl -n beads-server get jobs --selector=job-name --sort-by=.metadata.creationTimestamp | grep beads-dispatcher | tail
|
||||
kubectl -n beads-server logs job/<dispatcher-job-name>
|
||||
|
||||
# Tail the underlying agent once a bead dispatches
|
||||
kubectl -n claude-agent logs -l app=claude-agent-service -f
|
||||
|
||||
# Inspect reaper decisions
|
||||
kubectl -n beads-server get jobs | grep beads-reaper | tail
|
||||
kubectl -n beads-server logs job/<reaper-job-name>
|
||||
```
|
||||
|
||||
### Inspect a specific bead's dispatch history
|
||||
|
||||
```
|
||||
bd show <id> --json | jq '{status, assignee, notes, updated_at}'
|
||||
```
|
||||
|
||||
Both the dispatcher and reaper write dated notes (`auto-dispatcher claimed
|
||||
at…`, `dispatched: job=…`, `reaper: no progress for…`) so the audit trail
|
||||
lives on the bead itself.
|
||||
|
||||
## Reaper semantics — when a bead becomes `blocked`
|
||||
|
||||
The reaper flips a bead to `blocked` if:
|
||||
- `assignee = agent`, AND
|
||||
- `status = in_progress`, AND
|
||||
- `updated_at` is more than **30 minutes** in the past.
|
||||
|
||||
Every `bd note` bumps `updated_at`, so a well-behaved `beads-task-runner`
|
||||
agent never trips the reaper — it notes progress as it works. A `blocked`
|
||||
bead is a signal that:
|
||||
- the agent pod crashed mid-run (`kubectl -n claude-agent delete pod` test),
|
||||
- the job hit its 15-minute budget timeout inside `claude-agent-service`
|
||||
without notes (rare — the agent usually notes failure before exiting),
|
||||
- `claude-agent-service` was restarted during the run (in-memory job state
|
||||
is lost; see [known risks](#known-risks)).
|
||||
|
||||
Recovery: read the reaper note, reopen manually if appropriate:
|
||||
|
||||
```
|
||||
bd update <id> -s open
|
||||
bd assign <id> agent # re-arm for next dispatcher tick
|
||||
```
|
||||
|
||||
## Design choices
|
||||
|
||||
- **Sentinel assignee `agent`** — free-form, no Beads schema change. Any bd
|
||||
client can set it (`bd assign <id> agent`).
|
||||
- **Sequential dispatch** — matches `claude-agent-service`'s single-slot
|
||||
`asyncio.Lock`. With a 2-min poll cadence and ~5-min average run,
|
||||
throughput is ~12 beads/hour. Parallelism is a separate plan.
|
||||
- **Fixed agent (`beads-task-runner`)** — read-only rails, matches BeadBoard's
|
||||
manual Dispatch button. Broader-privilege agents stay manual.
|
||||
- **CronJob (not in-service polling, not n8n)** — matches existing infra
|
||||
pattern (OpenClaw task-processor, certbot-renewal, backups), TF-managed,
|
||||
easy to pause.
|
||||
- **ConfigMap-mounted `metadata.json`** — declarative TF rather than reusing
|
||||
the image-seeded file. The CronJob's init step copies it into `/tmp/.beads/`
|
||||
because `bd` may touch the parent directory and ConfigMap mounts are
|
||||
read-only.
|
||||
|
||||
## Known risks
|
||||
|
||||
- **In-memory job state in `claude-agent-service`** — if the pod restarts
|
||||
mid-run, the job record is lost. The reaper catches this after 30 min.
|
||||
Persistent job store is deferred.
|
||||
- **Prompt injection via bead fields** — a malicious bead description could
|
||||
try to steer the agent. The `beads-task-runner` rails + token budget +
|
||||
timeout are the defense. Identical exposure as the manual Dispatch button.
|
||||
- **Image tag drift** — `claude_agent_service_image_tag` in
|
||||
`stacks/beads-server/main.tf` mirrors `local.image_tag` in
|
||||
`stacks/claude-agent-service/main.tf`. Bump both when the image rebuilds,
|
||||
or the dispatcher/reaper will run on an older layer. (They only need
|
||||
`bd`, `curl`, `jq` — stable across rebuilds — so the drift is low-risk.)
|
||||
- **`bd` JSON schema changes** — the reaper's `jq` reads `.id` and
|
||||
`.updated_at`. If a future `bd` upgrade renames these, the reaper breaks
|
||||
silently (no reaping, no alert). `BD_VERSION` is pinned in the image
|
||||
Dockerfile.
|
||||
|
||||
## Verification after change
|
||||
|
||||
```
|
||||
# Both CronJobs exist with the right schedule / SUSPEND state
|
||||
kubectl -n beads-server get cronjob
|
||||
|
||||
# End-to-end smoke test
|
||||
bd create "auto-dispatch smoke test" \
|
||||
-d "Read /etc/hostname inside the agent sandbox and close." \
|
||||
--acceptance "bd note includes 'hostname=' and bead is closed."
|
||||
bd assign <new-id> agent
|
||||
# within 2 min:
|
||||
bd show <new-id> --json | jq '.notes'
|
||||
# → contains 'auto-dispatcher claimed' + 'dispatched: job=<uuid>'
|
||||
```
|
||||
|
|
@ -1,126 +0,0 @@
|
|||
# Runbook: Forgejo registry break-glass — recovering infra-ci
|
||||
|
||||
Last updated: 2026-05-07
|
||||
|
||||
## When to use this runbook
|
||||
|
||||
When **all** of the following are true:
|
||||
|
||||
1. Forgejo (`forgejo.viktorbarzin.me`) is unreachable.
|
||||
2. `registry-private` is also gone (post-Phase 4 of the consolidation),
|
||||
so you can't fall back to `registry.viktorbarzin.me:5050/infra-ci`.
|
||||
3. You need to run an infra Woodpecker pipeline (apply, build-cli,
|
||||
drift-detection, etc.) — but those pipelines pull `infra-ci` and
|
||||
crash because the registry is down.
|
||||
|
||||
If only Forgejo is down but `registry-private` is still alive, the
|
||||
pipelines work — `image:` references in `infra/.woodpecker/*.yml`
|
||||
still hit `registry.viktorbarzin.me:5050/infra-ci` until Phase 3
|
||||
flips them. Skip this runbook entirely.
|
||||
|
||||
## What's available
|
||||
|
||||
The `build-ci-image.yml` Woodpecker pipeline saves a tarball after
|
||||
each successful push:
|
||||
|
||||
| Location | Path |
|
||||
|---|---|
|
||||
| Registry VM disk (10.0.20.10) | `/opt/registry/data/private/_breakglass/infra-ci-<sha>.tar.gz` |
|
||||
| Registry VM disk (latest symlink) | `/opt/registry/data/private/_breakglass/infra-ci-latest.tar.gz` |
|
||||
| Synology NAS (offsite copy via daily-backup sync) | `/volume1/Backup/Viki/pve-backup/_forgejo-breakglass/` |
|
||||
|
||||
The registry VM keeps the last 5 tarballs. Synology mirrors them
|
||||
through the existing offsite-sync-backup job (`/usr/local/bin/
|
||||
offsite-sync-backup`).
|
||||
|
||||
## Recovery procedure
|
||||
|
||||
The goal is to get a working `infra-ci` image onto a k8s node so
|
||||
Woodpecker pods can run it. Then run a Woodpecker pipeline that
|
||||
restores Forgejo from PVC backup or rebuilds it.
|
||||
|
||||
### Step 1 — copy the tarball to a node
|
||||
|
||||
From your workstation (the registry VM is reachable but Forgejo is
|
||||
not — the rest of the cluster might be in a similar partial state):
|
||||
|
||||
```bash
|
||||
ssh wizard@10.0.20.103 # any responsive k8s node
|
||||
sudo mkdir -p /var/breakglass
|
||||
sudo scp root@10.0.20.10:/opt/registry/data/private/_breakglass/infra-ci-latest.tar.gz \
|
||||
/var/breakglass/
|
||||
```
|
||||
|
||||
If the registry VM is also down, fall back to Synology:
|
||||
|
||||
```bash
|
||||
sudo scp 192.168.1.13:/volume1/Backup/Viki/pve-backup/_forgejo-breakglass/infra-ci-latest.tar.gz \
|
||||
/var/breakglass/
|
||||
```
|
||||
|
||||
### Step 2 — load into containerd
|
||||
|
||||
`docker load` won't help on a k8s node — it loads into the docker
|
||||
daemon, which kubelet/containerd doesn't see. Use `ctr`:
|
||||
|
||||
```bash
|
||||
sudo ctr -n k8s.io images import /var/breakglass/infra-ci-latest.tar.gz
|
||||
sudo ctr -n k8s.io images list | grep infra-ci
|
||||
```
|
||||
|
||||
Confirm the image is tagged with the original repository name
|
||||
(`registry.viktorbarzin.me:5050/infra-ci:<sha>` — the tarball was
|
||||
saved with that tag, NOT the Forgejo name).
|
||||
|
||||
### Step 3 — pin pods to this node
|
||||
|
||||
Add a node selector or taint-toleration to whatever pipeline you
|
||||
need to run. Simplest: cordon the other nodes briefly so Woodpecker
|
||||
schedules onto this one.
|
||||
|
||||
```bash
|
||||
for n in $(kubectl get nodes -o name | grep -v $(hostname)); do
|
||||
kubectl cordon ${n#node/}
|
||||
done
|
||||
```
|
||||
|
||||
Run the pipeline. After it completes:
|
||||
|
||||
```bash
|
||||
for n in $(kubectl get nodes -o name); do
|
||||
kubectl uncordon ${n#node/}
|
||||
done
|
||||
```
|
||||
|
||||
### Step 4 — fix the underlying problem
|
||||
|
||||
The pipeline you just ran was meant to restore Forgejo. Common
|
||||
options:
|
||||
|
||||
- **Forgejo PVC corrupt** — `docs/runbooks/forgejo-registry-rebuild-image.md`
|
||||
walks through PVC restore from LVM snapshot or PVE backup.
|
||||
- **Forgejo OOM-loop** — bump memory request+limit in
|
||||
`infra/stacks/forgejo/main.tf` and apply.
|
||||
- **Forgejo unreachable due to network** — check Traefik, MetalLB,
|
||||
pfSense.
|
||||
|
||||
Once Forgejo is back, run `build-ci-image.yml` manually so the
|
||||
tarball regenerates with the latest commit.
|
||||
|
||||
## Why this exists
|
||||
|
||||
The 2026-04-19 post-mortem on the registry-orphan-index incident
|
||||
showed that a single registry going corrupt could block ALL infra
|
||||
pipelines (because every pipeline pulls `infra-ci` from that
|
||||
registry). The dual-push to Forgejo + registry-private removes that
|
||||
single-point-of-failure during the bake. After Phase 4
|
||||
decommissions registry-private, the tarball is the last line of
|
||||
defense.
|
||||
|
||||
## Why on the registry VM and not in-cluster
|
||||
|
||||
The Forgejo pod and registry-private pod both depend on cluster
|
||||
networking + storage. The registry VM is an independent
|
||||
non-clustered VM with local storage. If the cluster is in a bad
|
||||
state, the VM's disk is still readable from any other host on the
|
||||
LAN.
|
||||
|
|
@ -1,128 +0,0 @@
|
|||
# Runbook: Rebuild an Image on the Forgejo OCI Registry
|
||||
|
||||
Last updated: 2026-05-07
|
||||
|
||||
## When to use this
|
||||
|
||||
Pipelines pulling from `forgejo.viktorbarzin.me/viktor/<image>` fail with:
|
||||
|
||||
- `failed to resolve reference … : not found`
|
||||
- `manifest unknown`
|
||||
- HEAD on a manifest/blob digest returns 404
|
||||
- `forgejo-integrity-probe` CronJob in `monitoring` reports
|
||||
`registry_manifest_integrity_failures > 0` for
|
||||
`instance="forgejo.viktorbarzin.me"`
|
||||
|
||||
This is the Forgejo equivalent of the registry-private orphan-index
|
||||
failure mode (`docs/post-mortems/2026-04-19-registry-orphan-index.md`).
|
||||
Cause is usually package-version delete races with an in-flight pull,
|
||||
or PVC corruption. Fix is to rebuild the image from source and
|
||||
re-push, so Forgejo receives a complete, fresh upload.
|
||||
|
||||
If the symptom is different (Forgejo unreachable, PVC OOM,
|
||||
authentication failure), use:
|
||||
- `docs/runbooks/forgejo-registry-setup.md` for auth + token issues
|
||||
- `docs/runbooks/forgejo-registry-breakglass.md` if Forgejo + the
|
||||
cluster are both unreachable
|
||||
- `docs/runbooks/restore-pvc-from-backup.md` for PVC corruption
|
||||
|
||||
## Phase 1 — Confirm the diagnosis
|
||||
|
||||
From any host:
|
||||
|
||||
```sh
|
||||
REG=forgejo.viktorbarzin.me
|
||||
USER=cluster-puller
|
||||
PASS="$(vault kv get -field=forgejo_pull_token secret/viktor)"
|
||||
IMAGE=viktor/payslip-ingest
|
||||
TAG=latest
|
||||
|
||||
# 1. Confirm the manifest exists at all.
|
||||
curl -sk -u "$USER:$PASS" \
|
||||
-H 'Accept: application/vnd.oci.image.index.v1+json,application/vnd.oci.image.manifest.v1+json' \
|
||||
"https://$REG/v2/$IMAGE/manifests/$TAG" | jq '.mediaType, .manifests[].digest // .config.digest'
|
||||
|
||||
# 2. HEAD each child / config / layer digest. Any non-200 = confirmed.
|
||||
for d in $(curl -sk -u "$USER:$PASS" -H 'Accept: application/vnd.oci.image.index.v1+json' \
|
||||
"https://$REG/v2/$IMAGE/manifests/$TAG" | jq -r '.manifests[].digest // empty'); do
|
||||
code=$(curl -sk -u "$USER:$PASS" -o /dev/null -w '%{http_code}' \
|
||||
-I "https://$REG/v2/$IMAGE/manifests/$d")
|
||||
echo "$d → $code"
|
||||
done
|
||||
```
|
||||
|
||||
The probe's last log run is also a fast way to see what's affected:
|
||||
|
||||
```sh
|
||||
kubectl -n monitoring logs \
|
||||
$(kubectl -n monitoring get pods -l job-name -o name \
|
||||
| grep forgejo-integrity-probe | head -1)
|
||||
```
|
||||
|
||||
## Phase 2 — Rebuild and re-push
|
||||
|
||||
Forgejo lets you delete a specific package version through the API.
|
||||
Doing this **before** the rebuild ensures the new push doesn't
|
||||
collide with the half-broken existing entry.
|
||||
|
||||
```sh
|
||||
# Delete the broken version (replace TAG with the actual tag).
|
||||
curl -X DELETE -H "Authorization: token $(vault kv get -field=forgejo_cleanup_token secret/viktor)" \
|
||||
"https://$REG/api/v1/packages/viktor/container/$(basename $IMAGE)/$TAG"
|
||||
```
|
||||
|
||||
Rebuild via Woodpecker (manual run if the pipeline isn't triggered
|
||||
by a code change):
|
||||
|
||||
1. Open `https://ci.viktorbarzin.me/repos/<repo>/manual` for the
|
||||
project.
|
||||
2. Click **Run pipeline** with `branch=master`.
|
||||
3. Wait for the build-and-push step to complete.
|
||||
4. Confirm the new version is visible in Forgejo Web UI under
|
||||
`viktor/<image>` → Packages → Container.
|
||||
|
||||
## Phase 3 — Restart consumers
|
||||
|
||||
Pods that already cached the broken digest may continue using it.
|
||||
Force a fresh pull:
|
||||
|
||||
```sh
|
||||
kubectl rollout restart deploy/<service> -n <ns>
|
||||
```
|
||||
|
||||
If the pod still fails, the new manifest digest may not have
|
||||
propagated through containerd's cache. Drain + restart containerd on
|
||||
the affected node:
|
||||
|
||||
```sh
|
||||
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
|
||||
ssh wizard@<node> sudo systemctl restart containerd
|
||||
kubectl uncordon <node>
|
||||
```
|
||||
|
||||
## Phase 4 — Verify integrity recovery
|
||||
|
||||
The next probe run (every 15 min) will report:
|
||||
|
||||
```
|
||||
registry_manifest_integrity_failures{instance="forgejo.viktorbarzin.me"} 0
|
||||
```
|
||||
|
||||
The `RegistryManifestIntegrityFailure` alert resolves automatically
|
||||
30 minutes after the metric goes back to 0.
|
||||
|
||||
## Why this happens
|
||||
|
||||
Forgejo's OCI registry stores blobs in its own DB+filesystem. Unlike
|
||||
`registry:2` + `distribution`, it doesn't have the
|
||||
[`distribution#3324`](https://github.com/distribution/distribution/issues/3324)
|
||||
GC-vs-tag-delete race. But it can still reach a broken state if:
|
||||
|
||||
- The retention CronJob deletes a version while a pull is in flight
|
||||
on the same digest.
|
||||
- The PVC fills up mid-push (`docs/runbooks/restore-pvc-from-backup.md`).
|
||||
- A Forgejo upgrade migrates the package schema and a row is dropped.
|
||||
|
||||
In all cases the recovery procedure is identical: delete the broken
|
||||
version through the API, rebuild from source, force consumers to
|
||||
re-pull.
|
||||
|
|
@ -1,163 +0,0 @@
|
|||
# Runbook: Forgejo OCI registry — initial setup
|
||||
|
||||
Last updated: 2026-05-07
|
||||
|
||||
This runbook covers the **one-time** bootstrap of Forgejo's container
|
||||
registry, executed during Phase 0 of the registry consolidation plan
|
||||
(`docs/plans/2026-05-07-forgejo-registry-consolidation-plan.md`).
|
||||
|
||||
After this runbook is complete, the Forgejo OCI registry at
|
||||
`forgejo.viktorbarzin.me` accepts pushes from CI and pulls from the
|
||||
cluster, with retention and integrity monitoring in place.
|
||||
|
||||
## Order of operations
|
||||
|
||||
The Terraform stacks reference Vault keys that don't exist on a fresh
|
||||
cluster. Create the keys **before** running `scripts/tg apply`.
|
||||
|
||||
1. Apply the resource bumps (memory, PVC, ingress body size,
|
||||
packages env vars) — these don't depend on the new Vault keys.
|
||||
2. Create the service-account users + PATs in Forgejo.
|
||||
3. Push the PATs to Vault.
|
||||
4. Apply the rest of Phase 0 (registry-credentials extension,
|
||||
monitoring probe, retention CronJob).
|
||||
|
||||
### Step 1 — apply Forgejo deployment bumps
|
||||
|
||||
```bash
|
||||
cd infra/stacks/forgejo
|
||||
scripts/tg apply
|
||||
```
|
||||
|
||||
Wait for the new pod to come up at the bumped 1Gi memory request and
|
||||
the resized 15Gi PVC. Verify packages are enabled:
|
||||
|
||||
```bash
|
||||
kubectl exec -n forgejo deploy/forgejo -- forgejo manager flush-queues
|
||||
kubectl exec -n forgejo deploy/forgejo -- env | grep PACKAGES
|
||||
```
|
||||
|
||||
### Step 2 — create service-account users
|
||||
|
||||
`forgejo admin user create` is idempotent only with
|
||||
`--must-change-password=false`. Re-running it on an existing user
|
||||
errors out — that's fine; skip on rerun.
|
||||
|
||||
```bash
|
||||
# cluster-puller — read:package PAT for in-cluster pulls.
|
||||
kubectl exec -n forgejo deploy/forgejo -- \
|
||||
forgejo admin user create \
|
||||
--username cluster-puller \
|
||||
--email cluster-puller@viktorbarzin.me \
|
||||
--password "$(openssl rand -base64 24)" \
|
||||
--must-change-password=false
|
||||
|
||||
# ci-pusher — write:package PAT for CI dual-push, also reused as the
|
||||
# cleanup CronJob credential (write:package includes delete).
|
||||
kubectl exec -n forgejo deploy/forgejo -- \
|
||||
forgejo admin user create \
|
||||
--username ci-pusher \
|
||||
--email ci-pusher@viktorbarzin.me \
|
||||
--password "$(openssl rand -base64 24)" \
|
||||
--must-change-password=false
|
||||
```
|
||||
|
||||
The user passwords are throwaway — we only ever auth via PAT. Forgejo
|
||||
admin can reset them at any time from the Web UI.
|
||||
|
||||
### Step 3 — generate the PATs
|
||||
|
||||
PATs **must** be generated through the Web UI logged in as the
|
||||
respective user (the CLI doesn't expose token creation). To log in
|
||||
without OAuth (registration is disabled for everyone except `viktor`,
|
||||
the admin), use the per-user temporary password from step 2.
|
||||
|
||||
For each of `cluster-puller` and `ci-pusher`:
|
||||
|
||||
1. Sign out of `viktor`.
|
||||
2. Go to `https://forgejo.viktorbarzin.me/user/login` and sign in
|
||||
with the throwaway password.
|
||||
3. Settings → Applications → Generate new token.
|
||||
4. Name: `cluster-pull` / `ci-push`. **Expiration: never.**
|
||||
5. Scopes:
|
||||
- `cluster-puller`: `read:package`
|
||||
- `ci-pusher`: `write:package` (covers read+write+delete)
|
||||
6. Save the token shown on the next page — it is **not** displayed again.
|
||||
|
||||
For the cleanup CronJob, generate a third PAT on `ci-pusher`:
|
||||
|
||||
7. Repeat steps 4-6 with name `cleanup`, scope `write:package`.
|
||||
|
||||
### Step 4 — push PATs to Vault
|
||||
|
||||
```bash
|
||||
vault login -method=oidc
|
||||
|
||||
# Read-only, used by the cluster-wide registry-credentials Secret and
|
||||
# by the Forgejo integrity probe.
|
||||
vault kv patch secret/viktor \
|
||||
forgejo_pull_token=<paste cluster-puller PAT>
|
||||
|
||||
# Write+delete, used by the retention CronJob inside Forgejo's
|
||||
# namespace.
|
||||
vault kv patch secret/viktor \
|
||||
forgejo_cleanup_token=<paste ci-pusher cleanup PAT>
|
||||
|
||||
# Write, propagated by vault-woodpecker-sync to all Woodpecker repos.
|
||||
vault kv patch secret/ci/global \
|
||||
forgejo_user=ci-pusher \
|
||||
forgejo_push_token=<paste ci-pusher push PAT>
|
||||
```
|
||||
|
||||
### Step 5 — apply the rest of Phase 0
|
||||
|
||||
```bash
|
||||
# Registry credential Secret (now reads forgejo_pull_token).
|
||||
cd infra/stacks/kyverno && scripts/tg apply
|
||||
|
||||
# Monitoring probe + retention CronJob.
|
||||
cd infra/stacks/monitoring && scripts/tg apply
|
||||
cd infra/stacks/forgejo && scripts/tg apply
|
||||
|
||||
# Containerd hosts.toml on each existing k8s node — VM cloud-init
|
||||
# only fires on first boot.
|
||||
infra/scripts/setup-forgejo-containerd-mirror.sh
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
```bash
|
||||
# Login from a workstation with docker.
|
||||
echo "<ci-pusher PAT>" | docker login forgejo.viktorbarzin.me -u ci-pusher --password-stdin
|
||||
|
||||
# Push a smoketest image.
|
||||
docker pull alpine:3.20
|
||||
docker tag alpine:3.20 forgejo.viktorbarzin.me/viktor/smoketest:1
|
||||
docker push forgejo.viktorbarzin.me/viktor/smoketest:1
|
||||
|
||||
# Pull from a k8s node.
|
||||
ssh wizard@<node> sudo crictl pull forgejo.viktorbarzin.me/viktor/smoketest:1
|
||||
|
||||
# Confirm the cluster-wide Secret was synced into a fresh namespace.
|
||||
kubectl create namespace forgejo-smoketest
|
||||
kubectl get secret -n forgejo-smoketest registry-credentials \
|
||||
-o jsonpath='{.data.\.dockerconfigjson}' | base64 -d | jq '.auths | keys'
|
||||
# Expect: ["10.0.20.10:5050", "forgejo.viktorbarzin.me",
|
||||
# "registry.viktorbarzin.me", "registry.viktorbarzin.me:5050"]
|
||||
kubectl delete namespace forgejo-smoketest
|
||||
|
||||
# Delete the smoketest package via API.
|
||||
curl -X DELETE -H "Authorization: token <ci-pusher cleanup PAT>" \
|
||||
https://forgejo.viktorbarzin.me/api/v1/packages/viktor/container/smoketest/1
|
||||
```
|
||||
|
||||
## When to revisit
|
||||
|
||||
- **PAT rotation**: PATs created here have no expiry by design. If a
|
||||
PAT leaks, regenerate via the Web UI and `vault kv patch` the new
|
||||
value into the same key — the next `terragrunt apply` will sync it
|
||||
to all consumers within minutes (Kyverno ClusterPolicy clones the
|
||||
Secret, vault-woodpecker-sync runs every 6h).
|
||||
- **New service account**: if a future workload needs different
|
||||
scopes, add a parallel user/PAT here rather than expanding existing
|
||||
PAT scope. Principle of least privilege.
|
||||
|
|
@ -1,222 +0,0 @@
|
|||
# pfSense HAProxy for Mailserver — Runbook
|
||||
|
||||
Last updated: 2026-04-19 (Phase 6 complete)
|
||||
|
||||
## What & why
|
||||
|
||||
External mail traffic (SMTP/IMAP) requires **real client IP visibility** for
|
||||
CrowdSec + Postfix rate-limiting. MetalLB cannot inject PROXY-protocol
|
||||
headers (see [`mailserver-proxy-protocol.md`](./mailserver-proxy-protocol.md)),
|
||||
so pfSense runs a small HAProxy that:
|
||||
|
||||
1. Listens on the pfSense VLAN20 IP (`10.0.20.1`) on all 4 mail ports,
|
||||
2. Forwards each connection to a k8s node's NodePort with `send-proxy-v2`,
|
||||
3. Injects PROXY v2 framing so Postfix/Dovecot see the original client IP,
|
||||
4. TCP-checks every k8s worker via dedicated **non-PROXY healthcheck NodePorts**
|
||||
(30145/30146/30147 → pod stock 25/465/587 listeners, no PROXY required).
|
||||
This split path avoids the `smtpd_peer_hostaddr_to_sockaddr` fatal that
|
||||
used to fire on every PROXY-aware health probe and throttled real client
|
||||
connections.
|
||||
|
||||
Corresponding k8s-side setup (`stacks/mailserver/modules/mailserver/`):
|
||||
|
||||
- ConfigMap `mailserver-user-patches` → `user-patches.sh` appends 3 alt
|
||||
`master.cf` services to Postfix:
|
||||
- `:2525` postscreen (alt :25) with `postscreen_upstream_proxy_protocol=haproxy`
|
||||
- `:4465` smtpd (alt :465 SMTPS) with `smtpd_upstream_proxy_protocol=haproxy`
|
||||
- `:5587` smtpd (alt :587 submission) with `smtpd_upstream_proxy_protocol=haproxy`
|
||||
- ConfigMap `mailserver.config` adds Dovecot `inet_listener imaps_proxy` on
|
||||
port 10993 with `haproxy = yes` and `haproxy_trusted_networks = 10.0.20.0/24`.
|
||||
- Service `mailserver-proxy` (NodePort, ETP:Cluster) — 4 PROXY data ports +
|
||||
3 non-PROXY healthcheck ports:
|
||||
- Data (PROXY v2):
|
||||
- `port 25 → targetPort 2525 → nodePort 30125`
|
||||
- `port 465 → targetPort 4465 → nodePort 30126`
|
||||
- `port 587 → targetPort 5587 → nodePort 30127`
|
||||
- `port 993 → targetPort 10993 → nodePort 30128`
|
||||
- Healthcheck (no PROXY, stock SMTP/SMTPS/Submission listeners):
|
||||
- `port 2500 → targetPort 25 → nodePort 30145` (smtp-check)
|
||||
- `port 4650 → targetPort 465 → nodePort 30146` (smtps-check)
|
||||
- `port 5870 → targetPort 587 → nodePort 30147` (sub-check)
|
||||
- Service `mailserver` (ClusterIP) — unchanged stock ports 25/465/587/993
|
||||
for intra-cluster clients (Roundcube pod, `email-roundtrip-monitor`
|
||||
CronJob, book-search). These listeners are PROXY-free.
|
||||
|
||||
bd: `code-yiu`.
|
||||
|
||||
## Steady-state architecture
|
||||
|
||||
```
|
||||
External mail (WAN) path — PROXY v2
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ Client (real IP) │
|
||||
│ │ SMTP/SMTPS/Sub/IMAPS │
|
||||
│ ▼ │
|
||||
│ pfSense WAN:{25,465,587,993} │
|
||||
│ │ NAT rdr → 10.0.20.1:{same} │
|
||||
│ ▼ │
|
||||
│ pfSense HAProxy (mode tcp, 4 frontends, 4 backend pools) │
|
||||
│ │ data: send-proxy-v2 → :{30125..30128} (PROXY-aware pod) │
|
||||
│ │ health: TCP-check → :{30145..30147} (no-PROXY pod) │
|
||||
│ │ inter 5000 │
|
||||
│ ▼ │
|
||||
│ k8s-node<1-4>:{30125..30128} ← any node (ETP:Cluster) │
|
||||
│ │ kube-proxy SNAT (source IP lost on the wire) │
|
||||
│ ▼ │
|
||||
│ mailserver pod :{2525,4465,5587,10993} │
|
||||
│ │ postscreen / smtpd / Dovecot parse PROXY v2 header │
|
||||
│ │ → real client IP recovered despite kube-proxy SNAT │
|
||||
│ ▼ │
|
||||
│ CrowdSec + Postfix / Dovecot see the true source IP ✓ │
|
||||
└─────────────────────────────────────────────────────────────────────┘
|
||||
|
||||
Intra-cluster path — no PROXY
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ Roundcube pod / email-roundtrip-monitor CronJob │
|
||||
│ │ SMTP/IMAP │
|
||||
│ ▼ │
|
||||
│ mailserver.mailserver.svc.cluster.local:{25,465,587,993} │
|
||||
│ │ ClusterIP — bypasses LoadBalancer/NodePort layer entirely │
|
||||
│ ▼ │
|
||||
│ mailserver pod stock :{25,465,587,993} (PROXY-free) │
|
||||
└─────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Validation
|
||||
|
||||
```sh
|
||||
# All HAProxy frontends listening
|
||||
ssh admin@10.0.20.1 'sockstat -l | grep haproxy'
|
||||
# Expect: *:25, *:465, *:587, *:993, *:2525 (test port)
|
||||
|
||||
# All backend pools healthy
|
||||
ssh admin@10.0.20.1 "echo 'show servers state' | socat /tmp/haproxy.socket stdio" \
|
||||
| awk 'NR>1 {print $3, $4, $6}'
|
||||
# srv_op_state 2 = UP, 0 = DOWN
|
||||
|
||||
# Container listens on all 8 ports
|
||||
kubectl exec -n mailserver -c docker-mailserver deployment/mailserver -- \
|
||||
ss -ltn | grep -E ':(25|2525|465|4465|587|5587|993|10993)\b'
|
||||
|
||||
# pf rdr points at pfSense (10.0.20.1), not <mailserver> alias
|
||||
ssh admin@10.0.20.1 'pfctl -sn' | grep -E 'port = (25|submission|imaps|smtps)'
|
||||
|
||||
# E2E probe — Brevo → external MX :25 → IMAP fetch
|
||||
kubectl create job --from=cronjob/email-roundtrip-monitor probe-test -n mailserver
|
||||
kubectl wait --for=condition=complete --timeout=90s job/probe-test -n mailserver
|
||||
kubectl logs job/probe-test -n mailserver | grep SUCCESS
|
||||
kubectl delete job probe-test -n mailserver
|
||||
|
||||
# Real client IP in maillog post-delivery
|
||||
kubectl logs -c docker-mailserver deployment/mailserver -n mailserver \
|
||||
| grep 'smtpd-proxy25.*CONNECT from' | tail -5
|
||||
# Expect external source IPs (e.g., Brevo 77.32.148.x), NOT 10.0.20.x
|
||||
```
|
||||
|
||||
## Bootstrap / restore from scratch
|
||||
|
||||
pfSense HAProxy config lives in `/cf/conf/config.xml` under
|
||||
`<installedpackages><haproxy>`. That file is scp'd nightly to
|
||||
`/mnt/backup/pfsense/config-YYYYMMDD.xml` by `scripts/daily-backup.sh`, then
|
||||
synced to Synology. To rebuild from source of truth (git):
|
||||
|
||||
```sh
|
||||
scp infra/scripts/pfsense-haproxy-bootstrap.php admin@10.0.20.1:/tmp/
|
||||
ssh admin@10.0.20.1 'php /tmp/pfsense-haproxy-bootstrap.php'
|
||||
```
|
||||
|
||||
The script is idempotent — re-runs reset the mailserver frontends + backends
|
||||
to the declared state.
|
||||
|
||||
Expected output:
|
||||
```
|
||||
haproxy_check_and_run rc=OK
|
||||
```
|
||||
|
||||
## Operations
|
||||
|
||||
### Change backend k8s node IPs / NodePorts
|
||||
|
||||
Edit `infra/scripts/pfsense-haproxy-bootstrap.php` — `$NODES` array + the
|
||||
`build_pool()` port arguments. Re-run the bootstrap command above. Don't
|
||||
hand-edit `/var/etc/haproxy/haproxy.cfg` — it is regenerated from XML on
|
||||
every apply.
|
||||
|
||||
### Check health of backends
|
||||
|
||||
```sh
|
||||
ssh admin@10.0.20.1 "echo 'show servers state' | socat /tmp/haproxy.socket stdio"
|
||||
```
|
||||
`srv_op_state=2` means UP, `0` means DOWN.
|
||||
|
||||
### View live HAProxy stats (WebUI)
|
||||
|
||||
`https://pfsense.viktorbarzin.me` → Services → HAProxy → Stats.
|
||||
|
||||
### Reload after config.xml edit
|
||||
|
||||
```sh
|
||||
ssh admin@10.0.20.1 'pfSsh.php playback svc restart haproxy'
|
||||
```
|
||||
|
||||
### Rollback (flip NAT back to MetalLB, post-Phase-6 only partial)
|
||||
|
||||
There is no Phase-6 rollback one-liner. Phase 6 removed the MetalLB
|
||||
LoadBalancer 10.0.20.202 entirely, so un-flipping NAT now would send
|
||||
traffic to a dead alias. To regress:
|
||||
|
||||
1. Re-add `metallb.io/loadBalancerIPs = "10.0.20.202"` + `type = "LoadBalancer"`
|
||||
+ `external_traffic_policy = "Local"` to `kubernetes_service.mailserver`,
|
||||
apply.
|
||||
2. Re-add the `mailserver` host alias in pfSense pointing at 10.0.20.202
|
||||
(Firewall → Aliases → Hosts).
|
||||
3. Run `infra/scripts/pfsense-nat-mailserver-haproxy-unflip.php` on pfSense.
|
||||
|
||||
For rollback of just the NAT (Phase 4) without touching the Service, only
|
||||
the third step is needed — but only meaningful BEFORE Phase 6.
|
||||
|
||||
### Restore from backup
|
||||
|
||||
pfSense config backup is a plain XML file:
|
||||
```
|
||||
/mnt/backup/pfsense/config-YYYYMMDD.xml # sda host copy (1.1TB RAID1)
|
||||
/volume1/Backup/Viki/pve-backup/pfsense/... # Synology offsite
|
||||
```
|
||||
|
||||
Full restore: pfSense WebUI → Diagnostics → Backup & Restore → Upload that
|
||||
`config.xml`. The `<installedpackages><haproxy>` section is included.
|
||||
|
||||
## Phase history (bd code-yiu)
|
||||
|
||||
| Phase | Status | Description |
|
||||
|---|---|---|
|
||||
| 1a | ✅ commit `ef75c02f` | k8s alt :2525 listener + NodePort Service |
|
||||
| 2 | ✅ 2026-04-19 | pfSense HAProxy pkg installed (`pfSense-pkg-haproxy-devel-0.63_2`, HAProxy 2.9-dev6) |
|
||||
| 3 | ✅ commit `ba697b02` | HAProxy config persisted in pfSense XML (bootstrap script + this runbook) |
|
||||
| 4+5| ✅ commit `9806d515` | 4-port alt listeners + HAProxy frontends for 25/465/587/993 + NAT flip |
|
||||
| 6 | ✅ this commit | Mailserver Service downgraded LoadBalancer → ClusterIP; `10.0.20.202` released back to MetalLB pool; orphan `mailserver` pfSense alias removed; monitors retargeted |
|
||||
|
||||
## Known warts
|
||||
|
||||
- ~~HAProxy TCP health-check with `send-proxy-v2` generates `getpeername:
|
||||
Transport endpoint not connected` warnings on postscreen every check cycle.~~
|
||||
**Resolved 2026-05-05**: dedicated non-PROXY healthcheck NodePorts
|
||||
(30145/30146/30147 → stock pod 25/465/587) added; HAProxy now checks
|
||||
those, eliminating both the `getpeername` postscreen warnings and the
|
||||
`smtpd_peer_hostaddr_to_sockaddr: ... Servname not supported` fatals
|
||||
that were throttling smtpd respawns and causing ~50% client timeouts on
|
||||
the public 587 path. `inter` dropped 120000 → 5000 (fast failover, no
|
||||
log-spam concern). `option smtpchk` was tried but flapped against
|
||||
postscreen (multi-line greet + DNSBL silence + anti-pre-greet detection
|
||||
trip HAProxy's parser → L7RSP). Plain TCP check on the no-PROXY ports
|
||||
is sufficient.
|
||||
- Frontend binds on all pfSense interfaces (`bind :25` instead of
|
||||
`10.0.20.1:25`). `<extaddr>` is set in XML but pfSense templates it
|
||||
port-only. Low concern in practice because WAN firewall rules plus the
|
||||
NAT rdr gate external access; internal VLAN clients SHOULD be able to
|
||||
reach HAProxy on any pfSense-local IP.
|
||||
- k8s-node5 doesn't exist — cluster has master + 4 workers. Backend pool
|
||||
capped at 4 servers.
|
||||
- Postscreen still logs `improper command pipelining` for legitimate
|
||||
clients that send `EHLO\r\nQUIT\r\n` as a single TCP write. This is
|
||||
unchanged pre/post-migration — postscreen's anti-bot heuristic.
|
||||
|
|
@ -1,181 +0,0 @@
|
|||
# Mailserver PROXY protocol — research & decision
|
||||
|
||||
Last updated: 2026-04-18 (original research). **Outcome implemented 2026-04-19 — see [UPDATE](#update-2026-04-19) below.**
|
||||
|
||||
> ## UPDATE (2026-04-19)
|
||||
>
|
||||
> This doc describes the research that led to the Phase-6 rollout. **Option C
|
||||
> (pfSense HAProxy + PROXY v2)** was chosen and is now live. Operational
|
||||
> state, cutover history, bootstrap, and rollback procedures live in
|
||||
> [`mailserver-pfsense-haproxy.md`](mailserver-pfsense-haproxy.md).
|
||||
>
|
||||
> This file is retained as a decision record — it explains *why* Option A
|
||||
> (pod-pinning via nodeSelector) was rejected mid-session in favour of
|
||||
> Option C, and documents the MetalLB upstream limitation (PROXY injection
|
||||
> is explicitly won't-implement). Future debates of "why don't we just pin
|
||||
> the pod?" should land here first.
|
||||
|
||||
## TL;DR
|
||||
|
||||
**MetalLB does not and will not inject PROXY protocol headers.** The original plan
|
||||
(`/home/wizard/.claude/plans/let-s-work-on-linking-temporal-valiant.md`, task
|
||||
`code-rtb`) assumed MetalLB could be configured to emit PROXY v1/v2 on behalf of
|
||||
the `mailserver` LoadBalancer Service. That assumption is wrong at the product
|
||||
level. MetalLB is a control-plane-only announcer (ARP/NDP for L2 mode, BGP for
|
||||
L3 mode); it never touches the L4 payload.
|
||||
|
||||
As a result, there is no single Terraform change that can flip
|
||||
`externalTrafficPolicy: Local` → `Cluster` on the `mailserver` Service while
|
||||
preserving the real client IP for Postfix/postscreen and Dovecot. Three
|
||||
alternative paths exist (see below); none is trivial.
|
||||
|
||||
## Environment (verified 2026-04-18)
|
||||
|
||||
- **MetalLB version**: `quay.io/metallb/controller:v0.15.3` /
|
||||
`quay.io/metallb/speaker:v0.15.3` (5 speakers).
|
||||
- **Advertisement type**: L2Advertisement `default` bound to IPAddressPool
|
||||
`default` (10.0.20.200–10.0.20.220). No BGPAdvertisements.
|
||||
- **Service**: `mailserver/mailserver` — type `LoadBalancer`, `loadBalancerIPs:
|
||||
10.0.20.202`, `externalTrafficPolicy: Local`,
|
||||
`healthCheckNodePort: 30234`, 5 ports (25, 465, 587, 993, 9166/dovecot-metrics).
|
||||
- **Pod**: single replica today, RWO PVCs prevent horizontal scale without
|
||||
further work (`mailserver-data-encrypted`, `mailserver-letsencrypt-encrypted`).
|
||||
|
||||
## Why the original plan fails
|
||||
|
||||
### MetalLB never touches packets
|
||||
|
||||
> *"MetalLB is controlplane only, making it part of the dataplane means we
|
||||
> would be responsible for the performance of the system, so more bugs to
|
||||
> fight, I personally don't see that happening."*
|
||||
> — MetalLB maintainer `champtar`, 2021-01-06
|
||||
> (issue [#797 — Feature Request: Supporting Proxy Protocol v2](https://github.com/metallb/metallb/issues/797))
|
||||
|
||||
Issue #797 is closed as "won't implement". Repeat asks in 2022–2023 got the
|
||||
same answer. The v0.15.3 API surface confirms this: no
|
||||
`proxyProtocol`/`haproxy`/`protocol: proxy` field exists on `IPAddressPool`,
|
||||
`L2Advertisement`, `BGPAdvertisement`, or as a Service annotation.
|
||||
|
||||
Only managed-cloud LBs (AWS NLB, Azure LB, OCI, DO, OVH, Scaleway, etc.) offer
|
||||
PROXY protocol as a tick-box. MetalLB's equivalents are:
|
||||
|
||||
| MetalLB feature | Does it preserve client IP? | Comment |
|
||||
|---|---|---|
|
||||
| `externalTrafficPolicy: Local` (current) | Yes, via iptables DNAT on the speaker node | Forces pod↔speaker colocation on L2 mode. This is the pain we wanted to avoid. |
|
||||
| `externalTrafficPolicy: Cluster` | No — kube-proxy SNATs to the node IP | The problem we would re-introduce if we flipped without PROXY injection. |
|
||||
| PROXY protocol injection | N/A — not implemented | Dead end. |
|
||||
|
||||
### The `Local` trap is real, but narrower than it seems
|
||||
|
||||
Today's `Local` policy means the ARP announcer node must also host the mailserver
|
||||
pod. MetalLB always picks a single speaker to advertise the VIP (leader
|
||||
election per IP), so in practice exactly one node matters at any moment. A pod
|
||||
rescheduled to a different node silently drops inbound SMTP/IMAP until a GARP
|
||||
flip or node cordon.
|
||||
|
||||
The only pods on our cluster that see this same class of risk are Traefik
|
||||
(3 replicas + PDB `minAvailable=2`, so 2 of 3 nodes always have a pod) and
|
||||
mailserver (1 replica). Traefik survives because the pods outnumber the nodes
|
||||
that could be the speaker at once; the mailserver cannot.
|
||||
|
||||
## Alternative paths (ranked by effort)
|
||||
|
||||
### Option A — Pin the mailserver pod to a specific node (SIMPLEST)
|
||||
|
||||
Add `nodeSelector` on the mailserver Deployment pointing at a label that's also
|
||||
stamped on the MetalLB speaker we want to advertise the VIP from, and use
|
||||
MetalLB's [node selector](https://metallb.io/configuration/_advanced_l2_configuration/#specify-network-interfaces-that-lb-ip-can-be-announced-from)
|
||||
on `L2Advertisement.spec.nodeSelectors` to pin the announcer to the same node.
|
||||
|
||||
Trade-offs:
|
||||
|
||||
- Zero changes to Postfix/Dovecot configs.
|
||||
- Keeps `externalTrafficPolicy: Local` — real client IP keeps arriving.
|
||||
- Loses HA (the whole point of the MetalLB layer) but reflects reality — one
|
||||
replica, one PVC, no HA today anyway.
|
||||
- Drain of that node requires a planned cutover, but that's no worse than
|
||||
today's silent failure mode.
|
||||
|
||||
Implementation (~10 lines of Terraform):
|
||||
|
||||
```hcl
|
||||
# In stacks/mailserver/modules/mailserver/main.tf, on the Deployment:
|
||||
node_selector = { "viktorbarzin.me/mailserver-anchor" = "true" }
|
||||
|
||||
# In stacks/platform (or wherever the MetalLB CRs live):
|
||||
resource "kubernetes_manifest" "mailserver_l2ad" {
|
||||
manifest = {
|
||||
apiVersion = "metallb.io/v1beta1"
|
||||
kind = "L2Advertisement"
|
||||
metadata = { name = "mailserver", namespace = "metallb-system" }
|
||||
spec = {
|
||||
ipAddressPools = ["default"]
|
||||
nodeSelectors = [{ matchLabels = { "viktorbarzin.me/mailserver-anchor" = "true" } }]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Plus a node label via `kubectl label node k8s-node3 viktorbarzin.me/mailserver-anchor=true`.
|
||||
|
||||
**Recommendation: this is the shortest path to eliminating the silent-drop
|
||||
failure mode** without taking on a new proxy tier.
|
||||
|
||||
### Option B — Put a HAProxy sidecar in front of Postfix/Dovecot
|
||||
|
||||
Stand up an in-cluster HAProxy with PROXY v2 enabled on the frontend and
|
||||
`send-proxy-v2` on the backend to `mailserver:25/465/587/993`. Expose HAProxy
|
||||
via a new MetalLB Service with `externalTrafficPolicy: Cluster` + kube-proxy
|
||||
DSR workaround (still loses client IP at that layer), or run HAProxy on the
|
||||
host-network of the same node (back to Option A's colocation).
|
||||
|
||||
Trade-offs:
|
||||
|
||||
- Introduces one more network hop and TLS-termination decision for every
|
||||
SMTP connect.
|
||||
- HAProxy needs its own cert rotation (or `tls-passthrough`) — adds moving
|
||||
parts to an already crowded mailserver module.
|
||||
- Doesn't actually solve the colocation problem on its own — HAProxy itself
|
||||
needs to receive the client IP, so we are back to externalTrafficPolicy
|
||||
constraints for HAProxy.
|
||||
|
||||
**Recommendation: avoid unless we also get HA for mailserver itself, which
|
||||
needs RWX storage + DB split-brain work — out of scope.**
|
||||
|
||||
### Option C — Replace MetalLB with a different LB for this Service
|
||||
|
||||
Candidates: [kube-vip](https://kube-vip.io/) (supports eBPF-based DSR but not
|
||||
PROXY injection either), [Cilium LB](https://docs.cilium.io/en/stable/network/lb-ipam/)
|
||||
(preserves client IP via DSR in hybrid mode), or a dedicated HAProxy running on
|
||||
pfSense and NAT-forwarding 25/465/587/993 with PROXY headers to a
|
||||
ClusterIP-exposed mailserver. Cilium requires a CNI migration (we run Calico
|
||||
today); pfSense HAProxy is genuinely feasible but belongs in a different bd
|
||||
task.
|
||||
|
||||
**Recommendation: track as P3 follow-up under a new bd task if Option A proves
|
||||
insufficient.**
|
||||
|
||||
## Decision
|
||||
|
||||
Do nothing in this session beyond this runbook + the bd note. The `code-rtb`
|
||||
task as written is not executable — MetalLB cannot inject PROXY headers, and
|
||||
the Postfix/Dovecot config changes the plan proposed would not receive the
|
||||
header they expect, they would hang waiting for it and then timeout (5s per
|
||||
connection).
|
||||
|
||||
Follow-up work filed as bd child tasks (if user wants to pursue):
|
||||
|
||||
- **Option A — pin mailserver + L2Advertisement nodeSelectors** (new bd task)
|
||||
- **Option C — HAProxy on pfSense with PROXY v2 to a ClusterIP** (new bd task)
|
||||
|
||||
## References
|
||||
|
||||
- [MetalLB issue #797 — Feature Request: Supporting Proxy Protocol v2](https://github.com/metallb/metallb/issues/797) (closed, won't implement)
|
||||
- [MetalLB PR #796 — Source IP Preservation discussion](https://github.com/metallb/metallb/issues/796)
|
||||
- Postfix [postscreen_upstream_proxy_protocol](https://www.postfix.org/postconf.5.html#postscreen_upstream_proxy_protocol) — expects the PROXY header *on every incoming connection*; if absent, postscreen drops after `postscreen_upstream_proxy_timeout`.
|
||||
- Dovecot [haproxy_trusted_networks](https://doc.dovecot.org/settings/core/#core_setting-haproxy_trusted_networks) — treats the header as mandatory for listed source networks.
|
||||
- Cluster state verified against: `kubectl -n metallb-system get pods`,
|
||||
`kubectl get ipaddresspools.metallb.io -A`,
|
||||
`kubectl get l2advertisements.metallb.io -A`,
|
||||
`kubectl get bgpadvertisements.metallb.io -A`,
|
||||
`kubectl -n mailserver get svc mailserver -o yaml`.
|
||||
|
|
@ -1,66 +0,0 @@
|
|||
# NFS Prerequisites for `modules/kubernetes/nfs_volume`
|
||||
|
||||
The `nfs_volume` Terraform module creates a `PersistentVolume` pointing at a
|
||||
path on the Proxmox NFS server (`192.168.1.127`). It does **not** create the
|
||||
underlying directory on the server.
|
||||
|
||||
If the path does not exist, the first pod that tries to mount the resulting
|
||||
PVC gets stuck in `ContainerCreating` with the kubelet event:
|
||||
|
||||
```
|
||||
MountVolume.SetUp failed for volume "<name>" : mount failed: exit status 32
|
||||
mount.nfs: mounting 192.168.1.127:/srv/nfs/<path> failed, reason given by
|
||||
server: No such file or directory
|
||||
```
|
||||
|
||||
## Bootstrap before first apply
|
||||
|
||||
Before adding a new `nfs_volume` consumer (backup CronJob, data PV, etc.),
|
||||
create the export root on the PVE host:
|
||||
|
||||
```sh
|
||||
# Replace <app> with the backup stack name, e.g. mailserver-backup,
|
||||
# roundcube-backup, immich-backup, etc.
|
||||
ssh root@192.168.1.127 'mkdir -p /srv/nfs/<app> && chmod 755 /srv/nfs/<app>'
|
||||
|
||||
# Confirm exports are live (no change to /etc/exports needed — `/srv/nfs`
|
||||
# is already exported via the root entry in pve-nfs-exports).
|
||||
ssh root@192.168.1.127 exportfs -v | grep '/srv/nfs\b'
|
||||
```
|
||||
|
||||
`/srv/nfs` is exported with the root entry. Subdirectories inherit the
|
||||
export automatically; they just have to exist on disk.
|
||||
|
||||
## Known consumers
|
||||
|
||||
| Consumer | NFS path | Owning stack |
|
||||
|--------------------------------|---------------------------------|--------------------------|
|
||||
| `mailserver-backup` | `/srv/nfs/mailserver-backup` | `stacks/mailserver/` |
|
||||
| `roundcube-backup` | `/srv/nfs/roundcube-backup` | `stacks/mailserver/` |
|
||||
| `mysql-backup` | `/srv/nfs/mysql-backup` | `stacks/dbaas/` |
|
||||
| `postgresql-backup` | `/srv/nfs/postgresql-backup` | `stacks/dbaas/` |
|
||||
| `vaultwarden-backup` | `/srv/nfs/vaultwarden-backup` | `stacks/vaultwarden/` |
|
||||
|
||||
Use `grep -rn 'nfs_volume' infra/stacks/` to find all active consumers.
|
||||
|
||||
## Why not auto-create?
|
||||
|
||||
Two options were considered for automating this:
|
||||
|
||||
1. `null_resource` + `local-exec` SSH `mkdir` in the `nfs_volume` module —
|
||||
works but adds an SSH dependency to every Terraform run, makes the
|
||||
module non-hermetic, and fails if the operator does not have SSH to
|
||||
the PVE host.
|
||||
2. `nfs-subdir-external-provisioner` — handles subdirs automatically but
|
||||
changes the PV/PVC shape and would require migrating all existing
|
||||
consumers.
|
||||
|
||||
Neither is worth the churn for a one-time operation per new backup stack.
|
||||
Document + checklist is the current call; re-evaluate if we start adding
|
||||
one NFS consumer per week.
|
||||
|
||||
## Related tasks
|
||||
|
||||
- `code-yo4` — this runbook
|
||||
- `code-z26` — mailserver backup CronJob (first-time setup hit this)
|
||||
- `code-1f6` — Roundcube backup CronJob (also hit this)
|
||||
|
|
@ -1,281 +0,0 @@
|
|||
# pfSense Unbound DNS Resolver
|
||||
|
||||
Last updated: 2026-04-19
|
||||
|
||||
## Overview
|
||||
|
||||
pfSense runs **Unbound** (DNS Resolver) as its sole DNS service, replacing
|
||||
dnsmasq (DNS Forwarder) as of 2026-04-19 (DNS hardening Workstream D,
|
||||
bd `code-k0d`).
|
||||
|
||||
Unbound AXFR-slaves the `viktorbarzin.lan` zone from the Technitium primary
|
||||
via the `10.0.20.201` LoadBalancer, so LAN-side `.lan` resolution survives
|
||||
a full Kubernetes outage. Public queries go to Cloudflare via DNS-over-TLS
|
||||
(`1.1.1.1` + `1.0.0.1` on port 853, SNI `cloudflare-dns.com`).
|
||||
|
||||
## Listeners
|
||||
|
||||
Unbound binds on:
|
||||
|
||||
| Interface | IP | Purpose |
|
||||
|-----------|-----|---------|
|
||||
| WAN | `192.168.1.2:53` | LAN (192.168.1.0/24) clients querying via pfSense WAN |
|
||||
| LAN | `10.0.10.1:53` | Management VLAN clients |
|
||||
| OPT1 | `10.0.20.1:53` | K8s VLAN clients (CoreDNS upstream) |
|
||||
| lo0 | `127.0.0.1:53` | pfSense itself |
|
||||
|
||||
The prior WAN NAT `rdr` rule (`192.168.1.2:53 → 10.0.20.201`) was removed in
|
||||
the same change — Unbound now answers directly on WAN.
|
||||
|
||||
## Config Summary
|
||||
|
||||
Relevant `<unbound>` keys in `/cf/conf/config.xml`:
|
||||
|
||||
| Key | Value | Meaning |
|
||||
|-----|-------|---------|
|
||||
| `enable` | flag | Enable Unbound |
|
||||
| `dnssec` | flag | DNSSEC validation on |
|
||||
| `forwarding` | flag | Forwarding mode (send recursive queries to upstream) |
|
||||
| `forward_tls_upstream` | flag | Use DoT for upstream forwarders |
|
||||
| `prefetch` | flag | Prefetch records near expiry |
|
||||
| `prefetchkey` | flag | Prefetch DNSKEY records |
|
||||
| `dnsrecordcache` | flag | `serve-expired: yes` |
|
||||
| `active_interface` | `lan,opt1,wan,lo0` | Listen interfaces |
|
||||
| `msgcachesize` | `256` (MB) | Message cache (rrset-cache auto-doubles to 512MB) |
|
||||
| `cache_max_ttl` | `604800` | 7 days |
|
||||
| `cache_min_ttl` | `60` | 60 seconds |
|
||||
| `custom_options` | base64 | Contains `serve-expired-ttl: 259200` + `auth-zone:` block |
|
||||
|
||||
Upstream DoT forwarders live in `<system>`:
|
||||
|
||||
- `dnsserver[0] = 1.1.1.1`
|
||||
- `dnsserver[1] = 1.0.0.1`
|
||||
- `dns1host = cloudflare-dns.com`
|
||||
- `dns2host = cloudflare-dns.com`
|
||||
|
||||
## Auth-Zone for viktorbarzin.lan
|
||||
|
||||
The custom_options block declares:
|
||||
|
||||
```
|
||||
server:
|
||||
serve-expired-ttl: 259200
|
||||
|
||||
auth-zone:
|
||||
name: "viktorbarzin.lan"
|
||||
master: 10.0.20.201
|
||||
fallback-enabled: yes
|
||||
for-downstream: yes
|
||||
for-upstream: yes
|
||||
zonefile: "viktorbarzin.lan.zone"
|
||||
allow-notify: 10.0.20.201
|
||||
```
|
||||
|
||||
- `master: 10.0.20.201` — AXFR source (Technitium LoadBalancer)
|
||||
- `fallback-enabled: yes` — if the zone can't refresh from master, fall back to normal recursion for this name (prevents hard-fail if AXFR breaks)
|
||||
- `for-downstream: yes` — answer queries for this zone with AA flag
|
||||
- `for-upstream: yes` — Unbound's internal iterator also uses this zone
|
||||
- `zonefile` is relative to the chroot (`/var/unbound/viktorbarzin.lan.zone`)
|
||||
- `allow-notify: 10.0.20.201` — accept NOTIFY from Technitium
|
||||
|
||||
## Technitium-side ACL
|
||||
|
||||
Zone `viktorbarzin.lan` on Technitium has `zoneTransfer = UseSpecifiedNetworkACL`
|
||||
with ACL entries:
|
||||
|
||||
- `10.0.20.1` (pfSense OPT1)
|
||||
- `10.0.10.1` (pfSense LAN)
|
||||
- `192.168.1.2` (pfSense WAN)
|
||||
|
||||
Verify via the Technitium API:
|
||||
|
||||
```
|
||||
curl -sk "http://127.0.0.1:5380/api/zones/options/get?token=$TOK&zone=viktorbarzin.lan" | jq .response.zoneTransfer
|
||||
```
|
||||
|
||||
## Operational Checks
|
||||
|
||||
```bash
|
||||
# Is Unbound listening?
|
||||
ssh admin@10.0.20.1 "sockstat -l -4 -p 53"
|
||||
|
||||
# Auth-zone loaded?
|
||||
ssh admin@10.0.20.1 "unbound-control -c /var/unbound/unbound.conf list_auth_zones"
|
||||
# Expected: viktorbarzin.lan. serial NNNNN
|
||||
|
||||
# LAN record via auth-zone? (aa flag = authoritative / from auth-zone)
|
||||
dig @192.168.1.2 idrac.viktorbarzin.lan +norec
|
||||
|
||||
# Public record via DoT? (ad flag = DNSSEC validated, via 1.1.1.1/1.0.0.1)
|
||||
dig @192.168.1.2 example.com +dnssec
|
||||
|
||||
# Zonefile has all records?
|
||||
ssh admin@10.0.20.1 "wc -l /var/unbound/viktorbarzin.lan.zone"
|
||||
```
|
||||
|
||||
## K8s Outage Drill
|
||||
|
||||
Tests that `.lan` resolution survives a full Technitium outage:
|
||||
|
||||
```bash
|
||||
# Scale Technitium primary to 0
|
||||
kubectl -n technitium scale deploy/technitium --replicas=0
|
||||
|
||||
# Wait ~5 seconds, then test from a LAN client
|
||||
ssh devvm.viktorbarzin.lan "dig @192.168.1.2 idrac.viktorbarzin.lan +short"
|
||||
# Expected: 192.168.1.4 (served from Unbound's cached auth-zone)
|
||||
|
||||
# Restore immediately
|
||||
kubectl -n technitium scale deploy/technitium --replicas=1
|
||||
```
|
||||
|
||||
Completed successfully on 2026-04-19 initial deployment.
|
||||
|
||||
Note: secondary/tertiary Technitium pods remain up and continue to serve
|
||||
queries via the `10.0.20.201` LoadBalancer even when the primary is down —
|
||||
so the strongest proof that Unbound's auth-zone serves locally is to also
|
||||
scale those down (optional, not part of the routine drill).
|
||||
|
||||
## Backup & Rollback
|
||||
|
||||
### Backups
|
||||
|
||||
- **On-box**: `/cf/conf/config.xml.2026-04-19-pre-unbound` (created before this
|
||||
workstream ran — keep for 30 days, then delete)
|
||||
- **Daily**: PVE `daily-backup` script copies `/cf/conf/config.xml` and a full
|
||||
pfSense config tar to `/mnt/backup/pfsense/` on the Proxmox host at 05:00
|
||||
- **Offsite**: Synology `pve-backup/pfsense/` (synced daily by
|
||||
`offsite-sync-backup`)
|
||||
|
||||
### Rollback to dnsmasq
|
||||
|
||||
If Unbound misbehaves, revert to dnsmasq + NAT rdr:
|
||||
|
||||
```bash
|
||||
# On pfSense
|
||||
cp /cf/conf/config.xml.2026-04-19-pre-unbound /cf/conf/config.xml
|
||||
|
||||
# Tell pfSense to re-read config and reload services
|
||||
php -r 'require_once("config.inc"); require_once("config.lib.inc"); disable_path_cache();'
|
||||
/etc/rc.restart_webgui # reloads PHP config caches
|
||||
# Restart services
|
||||
php -r 'require_once("config.inc"); require_once("services.inc"); services_dnsmasq_configure(); services_unbound_configure(); filter_configure();'
|
||||
/etc/rc.filter_configure # re-applies NAT rules (brings back rdr)
|
||||
```
|
||||
|
||||
Verify:
|
||||
|
||||
```bash
|
||||
sockstat -l -4 -p 53 | grep dnsmasq # expect dnsmasq on 10.0.10.1 and 10.0.20.1
|
||||
pfctl -sn | grep '53' # expect rdr on wan UDP 53 → 10.0.20.201
|
||||
```
|
||||
|
||||
### Rollback without wiping new changes
|
||||
|
||||
If you only want to stop Unbound without restoring the whole config, edit
|
||||
config.xml and remove `<enable/>` from `<unbound>` + add it back to `<dnsmasq>`,
|
||||
then re-run `services_unbound_configure()` + `services_dnsmasq_configure()`.
|
||||
You also need to re-add the WAN NAT rdr in `<nat><rule>` (see the backup XML
|
||||
for the exact shape — tracker `1775670025`).
|
||||
|
||||
## Known Gotchas
|
||||
|
||||
1. **pfSense regenerates `/var/unbound/unbound.conf`** on every service reload
|
||||
from `<unbound>` in `config.xml`. Edits to unbound.conf are NOT durable.
|
||||
2. **`unbound-control` default config path is wrong**. Always use
|
||||
`unbound-control -c /var/unbound/unbound.conf <cmd>`.
|
||||
3. **`custom_options` is base64-encoded** in config.xml. Use `base64 -d` to
|
||||
decode in a shell, or `base64_decode()` in PHP.
|
||||
4. **`interface-automatic: yes` is NOT used** when `active_interface` is
|
||||
explicitly set to a list — pfSense emits explicit `interface: <ip>` lines.
|
||||
5. **`auth-zone`'s `zonefile` path is relative to the Unbound chroot**
|
||||
(`/var/unbound`), NOT absolute. Using an absolute path silently fails.
|
||||
6. **DoT forwarders need `forward_tls_upstream`** flag AND `dns1host` /
|
||||
`dns2host` set in `<system>` for SNI — without the hostname, pfSense emits
|
||||
`forward-addr: 1.1.1.1@853` (no `#`) which Cloudflare rejects with
|
||||
certificate hostname mismatch.
|
||||
|
||||
## Kea DHCP-DDNS TSIG (WS E, 2026-04-19)
|
||||
|
||||
Kea DHCP-DDNS on pfSense signs its RFC 2136 dynamic updates with an
|
||||
HMAC-SHA256 TSIG key (`kea-ddns`). Technitium's `viktorbarzin.lan` zone
|
||||
and reverse zones (`10.0.10.in-addr.arpa`, `20.0.10.in-addr.arpa`,
|
||||
`1.168.192.in-addr.arpa`) require both a pfSense-source IP (10.0.20.1 /
|
||||
10.0.10.1 / 192.168.1.2) AND a valid TSIG signature.
|
||||
|
||||
### Config locations
|
||||
|
||||
| Side | File | Notes |
|
||||
|------|------|-------|
|
||||
| pfSense | `/usr/local/etc/kea/kea-dhcp-ddns.conf` | Hand-managed. Pre-WS-E backup: `.2026-04-19-pre-tsig`. Daemon: `kea-dhcp-ddns` (`pkill -x kea-dhcp-ddns && /usr/local/sbin/kea-dhcp-ddns -c /usr/local/etc/kea/kea-dhcp-ddns.conf -d &`) |
|
||||
| Technitium | Zone options API: `POST /api/zones/options/set?zone=<z>&updateSecurityPolicies=kea-ddns\|*.<z>\|ANY&updateNetworkACL=10.0.20.1,10.0.10.1,192.168.1.2&update=UseSpecifiedNetworkACL` | Set on primary; replicates to secondary/tertiary via AXFR |
|
||||
| Technitium settings | TSIG keys array: `POST /api/settings/set` with `tsigKeys: [{keyName: "kea-ddns", sharedSecret: <b64>, algorithmName: "hmac-sha256"}]` | Must be set on all 3 Technitium instances (primary, secondary, tertiary) |
|
||||
| Vault | `secret/viktor/kea_ddns_tsig_secret` | Authoritative copy of the base64 secret |
|
||||
|
||||
### Rotating the TSIG key
|
||||
|
||||
1. Generate a new base64 32-byte secret: `openssl rand -base64 32` (any base64-encoded blob of reasonable length works; HMAC-SHA256 truncates/pads internally).
|
||||
2. Write it to Vault: `vault kv patch secret/viktor kea_ddns_tsig_secret=<new-secret>`.
|
||||
3. Add the new key under a **new name** (e.g., `kea-ddns-v2`) via the Technitium settings API on all 3 instances. Do NOT overwrite `kea-ddns` while Kea still uses it — you'd orphan in-flight updates.
|
||||
4. Update `/usr/local/etc/kea/kea-dhcp-ddns.conf` on pfSense to reference both keys in `tsig-keys`, set `key-name: kea-ddns-v2` on each `forward-ddns` / `reverse-ddns` domain, restart `kea-dhcp-ddns`.
|
||||
5. Update each affected zone's `updateSecurityPolicies` to use the new key name.
|
||||
6. After a lease-renewal cycle (default Kea lease = 7200s / 2h), verify with `kubectl -n technitium exec <primary-pod> -- grep "TSIG KeyName: kea-ddns-v2" /etc/dns/logs/<today>.log`.
|
||||
7. Remove the old `kea-ddns` key from Technitium settings + Kea config.
|
||||
|
||||
### Emergency TSIG bypass (if rotation breaks DDNS)
|
||||
|
||||
If DDNS updates are failing and you cannot quickly fix the key, temporarily
|
||||
downgrade the zone policy to IP-ACL only (pfSense source IPs) without
|
||||
TSIG:
|
||||
|
||||
```bash
|
||||
kubectl -n technitium port-forward pod/<primary-pod> 5380:5380 &
|
||||
TOKEN=$(curl -s -X POST http://127.0.0.1:5380/api/user/login \
|
||||
-d "user=admin&pass=$(vault kv get -field=technitium_password secret/platform)&includeInfo=false" | jq -r .token)
|
||||
|
||||
for Z in viktorbarzin.lan 10.0.10.in-addr.arpa 20.0.10.in-addr.arpa 1.168.192.in-addr.arpa; do
|
||||
curl -s -X POST "http://127.0.0.1:5380/api/zones/options/set?token=$TOKEN&zone=$Z&update=UseSpecifiedNetworkACL&updateNetworkACL=10.0.20.1,10.0.10.1,192.168.1.2&updateSecurityPolicies="
|
||||
done
|
||||
```
|
||||
|
||||
This clears `updateSecurityPolicies` while keeping the IP ACL. Updates
|
||||
now flow unsigned from pfSense IPs — **weaker** than TSIG but restores
|
||||
service. Re-enable TSIG as soon as the key issue is resolved.
|
||||
|
||||
### Verify TSIG is enforced
|
||||
|
||||
```bash
|
||||
# Unsigned update should fail
|
||||
nsupdate <<EOF
|
||||
server 10.0.20.201 53
|
||||
zone viktorbarzin.lan
|
||||
update delete tsig-test.viktorbarzin.lan.
|
||||
update add tsig-test.viktorbarzin.lan. 300 A 10.99.99.99
|
||||
send
|
||||
EOF
|
||||
# Expected: "update failed: REFUSED"
|
||||
|
||||
# Signed update should succeed
|
||||
cat > /tmp/kea-ddns.key <<EOF
|
||||
key "kea-ddns" {
|
||||
algorithm hmac-sha256;
|
||||
secret "$(vault kv get -field=kea_ddns_tsig_secret secret/viktor)";
|
||||
};
|
||||
EOF
|
||||
nsupdate -k /tmp/kea-ddns.key <<EOF
|
||||
server 10.0.20.201 53
|
||||
zone viktorbarzin.lan
|
||||
update delete tsig-test.viktorbarzin.lan.
|
||||
update add tsig-test.viktorbarzin.lan. 300 A 10.99.99.99
|
||||
send
|
||||
EOF
|
||||
dig @10.0.20.201 +short tsig-test.viktorbarzin.lan
|
||||
# Expected: 10.99.99.99
|
||||
rm -f /tmp/kea-ddns.key
|
||||
```
|
||||
|
||||
## Related Docs
|
||||
|
||||
- `docs/architecture/dns.md` — overall DNS architecture (K8s side, Technitium, CoreDNS)
|
||||
- `docs/architecture/networking.md` — VLAN layout, pfSense interface mapping
|
||||
- `.claude/skills/pfsense/skill.md` — SSH / CLI patterns for pfSense management
|
||||
|
|
@ -1,103 +0,0 @@
|
|||
# Runbook: Proxmox host (pve, 192.168.1.127)
|
||||
|
||||
Last updated: 2026-04-19
|
||||
|
||||
The Proxmox host is a baremetal hypervisor on the storage LAN
|
||||
(192.168.1.0/24) with a single IP `192.168.1.127`. It hosts every
|
||||
Kubernetes node VM and the NFS exports that back PVCs. It does **not**
|
||||
receive DHCP — its network config is static in
|
||||
`/etc/network/interfaces` (ifupdown). Because of that, DNS must be
|
||||
configured manually and stays out of the scope of Kea/DHCP-DDNS.
|
||||
|
||||
## DNS configuration
|
||||
|
||||
The host uses a plain `/etc/resolv.conf` with two nameservers. No
|
||||
`systemd-resolved`, no `resolvconf`, no NetworkManager — nothing
|
||||
manages `/etc/resolv.conf`; it is a regular file owned by root.
|
||||
|
||||
### Why plain `/etc/resolv.conf` and not systemd-resolved
|
||||
|
||||
1. Installing `systemd-resolved` on an active Proxmox node during
|
||||
business hours is the kind of change that risks breaking the NFS
|
||||
server or VM networking. PVE's Debian base does not ship
|
||||
`systemd-resolved` by default.
|
||||
2. The ifupdown `/etc/network/interfaces` file does not manage
|
||||
`/etc/resolv.conf` here — ifupdown's resolvconf integration is
|
||||
only active if the `resolvconf` package is installed, which it is
|
||||
not (`dpkg -l resolvconf` returns `un`).
|
||||
3. A plain file is the simplest mental model and avoids a second
|
||||
layer of "which tool is running now" confusion during an incident.
|
||||
|
||||
If you ever want to migrate to `systemd-resolved`, install the
|
||||
package, enable the service, symlink `/etc/resolv.conf` to
|
||||
`/run/systemd/resolve/stub-resolv.conf`, and drop the config in
|
||||
`/etc/systemd/resolved.conf.d/10-internal-dns.conf` — but do this
|
||||
during a maintenance window, not reactively.
|
||||
|
||||
### Current state
|
||||
|
||||
```
|
||||
# /etc/resolv.conf
|
||||
search viktorbarzin.lan
|
||||
nameserver 192.168.1.2
|
||||
nameserver 94.140.14.14
|
||||
options timeout:2 attempts:2
|
||||
```
|
||||
|
||||
| Field | Value | Purpose |
|
||||
|---|---|---|
|
||||
| Primary | `192.168.1.2` | pfSense LAN interface (dnsmasq forwarder → Technitium LB) — resolves `.viktorbarzin.lan` |
|
||||
| Fallback | `94.140.14.14` | AdGuard public DNS — recursive only, used if pfSense LAN IP unreachable |
|
||||
| `search` | `viktorbarzin.lan` | Unqualified names (`technitium`, `idrac`, etc.) resolve against the internal zone |
|
||||
| `timeout:2 attempts:2` | — | Cap glibc resolver at 2s per server, 2 tries — reasonable fallback latency |
|
||||
|
||||
### Verification commands
|
||||
|
||||
```sh
|
||||
ssh root@192.168.1.127 '
|
||||
cat /etc/resolv.conf # should show the two nameservers
|
||||
dig +short idrac.viktorbarzin.lan # expect an A record (192.168.1.4)
|
||||
dig +short github.com # expect an A record
|
||||
'
|
||||
```
|
||||
|
||||
Simulated failover — force the primary unreachable and verify the
|
||||
fallback answers:
|
||||
|
||||
```sh
|
||||
ssh root@192.168.1.127 '
|
||||
ip route add blackhole 192.168.1.2
|
||||
dig +short +time=3 github.com # glibc times out on primary, tries 94.140.14.14 → A record returned
|
||||
ip route del blackhole 192.168.1.2 # cleanup
|
||||
'
|
||||
```
|
||||
|
||||
Expected behaviour: the first `dig` prints a warning about the UDP
|
||||
setup failing for 192.168.1.2 and then prints the GitHub A record
|
||||
(answered by 94.140.14.14).
|
||||
|
||||
## Rollback
|
||||
|
||||
A pre-change backup of `/etc/resolv.conf`, `/etc/network/interfaces`,
|
||||
and `/etc/network/interfaces.d/` lives at
|
||||
`/root/dns-backups/dns-config-backup-YYYYMMDD-HHMMSS.tar.gz` on the
|
||||
host. To roll back:
|
||||
|
||||
```sh
|
||||
ssh root@192.168.1.127 '
|
||||
# pick the backup you want (there may be multiple if this runbook has been applied more than once)
|
||||
BACKUP=$(ls -t /root/dns-backups/dns-config-backup-*.tar.gz | head -1)
|
||||
tar -xzf "$BACKUP" -C /
|
||||
cat /etc/resolv.conf
|
||||
'
|
||||
```
|
||||
|
||||
No service restart is needed — glibc re-reads `/etc/resolv.conf` per
|
||||
lookup.
|
||||
|
||||
## Related docs
|
||||
|
||||
- `docs/architecture/dns.md` — where each resolver IP lives and which
|
||||
subnet it serves.
|
||||
- `docs/runbooks/nfs-prerequisites.md` — other operations on this
|
||||
host; read before adding new NFS exports.
|
||||
|
|
@ -146,8 +146,8 @@ qm shutdown 220; sleep 10
|
|||
for VMID in 102 300 103; do qm shutdown $VMID; done
|
||||
sleep 20
|
||||
|
||||
# TrueNAS (decommissioned 2026-04-13 — VM 9000 should already be stopped; skip if absent)
|
||||
qm shutdown 9000 2>/dev/null || true
|
||||
# TrueNAS (wait for ZFS flush)
|
||||
qm shutdown 9000; sleep 60
|
||||
|
||||
# pfSense (last — network gateway)
|
||||
qm shutdown 101; sleep 15
|
||||
|
|
|
|||
|
|
@ -1,170 +0,0 @@
|
|||
# Runbook: Rebuild an Image After a Registry Orphan-Index Incident
|
||||
|
||||
Last updated: 2026-04-19
|
||||
|
||||
## When to use this
|
||||
|
||||
Pipelines that pull from `registry.viktorbarzin.me:5050` are failing with
|
||||
messages like:
|
||||
|
||||
- `failed to resolve reference … : not found`
|
||||
- `manifest unknown`
|
||||
- `image can't be pulled` (Woodpecker exit 126)
|
||||
- `error pulling image`: HEAD on a child manifest digest returns 404
|
||||
|
||||
…and `skopeo inspect --tls-verify --creds "$USER:$PASS" docker://registry.viktorbarzin.me:5050/<image>:<tag>`
|
||||
returns an OCI image index whose `manifests[].digest` references are 404
|
||||
on the registry.
|
||||
|
||||
This is the **orphan OCI-index** failure mode documented in
|
||||
`docs/post-mortems/2026-04-19-registry-orphan-index.md`. The fix is to
|
||||
rebuild the affected image from source so the registry receives a fresh,
|
||||
complete push.
|
||||
|
||||
If the symptom is different (e.g., registry container down, TLS expiry,
|
||||
auth failure), use `docs/runbooks/registry-vm.md` instead.
|
||||
|
||||
## Phase 1 — Confirm the diagnosis
|
||||
|
||||
From any host with `skopeo`:
|
||||
|
||||
```sh
|
||||
REG=registry.viktorbarzin.me:5050
|
||||
IMAGE=infra-ci
|
||||
TAG=latest
|
||||
|
||||
# 1. Confirm the index exists.
|
||||
skopeo inspect --tls-verify --creds "$USER:$PASS" \
|
||||
--raw "docker://$REG/$IMAGE:$TAG" | jq '.mediaType, .manifests[].digest'
|
||||
|
||||
# 2. HEAD each child. Any non-200 = confirmed orphan.
|
||||
for d in $(skopeo inspect --tls-verify --creds "$USER:$PASS" --raw \
|
||||
"docker://$REG/$IMAGE:$TAG" | jq -r '.manifests[].digest'); do
|
||||
code=$(curl -sk -u "$USER:$PASS" -o /dev/null -w '%{http_code}' \
|
||||
-I "https://$REG/v2/$IMAGE/manifests/$d")
|
||||
echo "$d → $code"
|
||||
done
|
||||
```
|
||||
|
||||
If every child is 200, the problem is elsewhere — stop here and check
|
||||
the registry VM, TLS, or auth.
|
||||
|
||||
The `registry-integrity-probe` CronJob in the `monitoring` namespace
|
||||
runs this same check every 15 minutes across every tag in the catalog;
|
||||
its last run is also a fast way to see which image(s) are affected:
|
||||
|
||||
```sh
|
||||
kubectl -n monitoring logs \
|
||||
$(kubectl -n monitoring get pods -l job-name -o name \
|
||||
| grep registry-integrity-probe | head -1)
|
||||
```
|
||||
|
||||
## Phase 2 — Rebuild
|
||||
|
||||
### Option A (preferred): rebuild via CI
|
||||
|
||||
Find the `build-*.yml` pipeline that produces the image:
|
||||
|
||||
| Image | Pipeline | Repo ID |
|
||||
|---|---|---|
|
||||
| `infra-ci` | `.woodpecker/build-ci-image.yml` | 1 (infra) |
|
||||
| `infra` (cli) | `.woodpecker/build-cli.yml` | 1 (infra) |
|
||||
| `k8s-portal` | `.woodpecker/k8s-portal.yml` | 1 (infra) |
|
||||
|
||||
Trigger a manual build. The Woodpecker API expects a numeric repo ID
|
||||
(paths with `owner/name` return HTML):
|
||||
|
||||
```sh
|
||||
WOODPECKER_TOKEN=$(vault kv get -field=woodpecker_admin_token secret/viktor)
|
||||
|
||||
# Kick off a manual build against master.
|
||||
curl -s -X POST \
|
||||
-H "Authorization: Bearer $WOODPECKER_TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
"https://ci.viktorbarzin.me/api/repos/1/pipelines" \
|
||||
-d '{"branch":"master"}' | jq .number
|
||||
|
||||
# Follow the pipeline at https://ci.viktorbarzin.me/repos/1/pipeline/<number>
|
||||
```
|
||||
|
||||
The pipeline's `verify-integrity` step walks every blob the push
|
||||
references. If it passes, the registry now has a clean index; pull
|
||||
consumers will recover on next attempt.
|
||||
|
||||
### Option B (fallback): build on the registry VM
|
||||
|
||||
Only use this if Woodpecker itself is broken (its own pipeline runs
|
||||
from the same `infra-ci` image, so a corrupted `infra-ci:latest` can
|
||||
prevent Option A from recovering).
|
||||
|
||||
```sh
|
||||
ssh root@10.0.20.10 '
|
||||
cd /tmp
|
||||
git clone --depth 1 https://github.com/ViktorBarzin/infra
|
||||
cd infra/ci
|
||||
docker build -t registry.viktorbarzin.me:5050/infra-ci:manual -t registry.viktorbarzin.me:5050/infra-ci:latest .
|
||||
docker login -u "$USER" -p "$PASS" registry.viktorbarzin.me:5050
|
||||
docker push registry.viktorbarzin.me:5050/infra-ci:manual
|
||||
docker push registry.viktorbarzin.me:5050/infra-ci:latest
|
||||
'
|
||||
```
|
||||
|
||||
Then re-run any pipelines that failed — Woodpecker UI → Restart, or:
|
||||
|
||||
```sh
|
||||
curl -s -X POST \
|
||||
-H "Authorization: Bearer $WOODPECKER_TOKEN" \
|
||||
"https://ci.viktorbarzin.me/api/repos/1/pipelines/<failed-pipeline-number>"
|
||||
```
|
||||
|
||||
## Phase 3 — Verify
|
||||
|
||||
```sh
|
||||
# 1. Pull the image fresh (bypassing containerd cache) and check its index.
|
||||
REG=registry.viktorbarzin.me:5050
|
||||
skopeo inspect --tls-verify --creds "$USER:$PASS" \
|
||||
--raw "docker://$REG/infra-ci:latest" \
|
||||
| jq '.manifests[] | {digest, platform}'
|
||||
|
||||
# 2. HEAD every child digest — all should be 200.
|
||||
for d in $(skopeo inspect --tls-verify --creds "$USER:$PASS" --raw \
|
||||
"docker://$REG/infra-ci:latest" | jq -r '.manifests[].digest'); do
|
||||
code=$(curl -sk -u "$USER:$PASS" -o /dev/null -w '%{http_code}' \
|
||||
-I "https://$REG/v2/infra-ci/manifests/$d")
|
||||
[ "$code" = "200" ] || echo "STILL BROKEN: $d → $code"
|
||||
done
|
||||
echo "verified"
|
||||
|
||||
# 3. Kick off the next scheduled probe for good measure.
|
||||
kubectl -n monitoring create job --from=cronjob/registry-integrity-probe \
|
||||
registry-integrity-probe-verify-$(date +%s)
|
||||
kubectl -n monitoring logs -f -l job-name=registry-integrity-probe-verify-$(date +%s)
|
||||
```
|
||||
|
||||
The `RegistryManifestIntegrityFailure` alert clears automatically when
|
||||
the probe's next run returns zero failures.
|
||||
|
||||
## Phase 4 — Investigate orphans
|
||||
|
||||
Once the immediate fix is in, check whether any OTHER images on the
|
||||
registry have orphan children:
|
||||
|
||||
```sh
|
||||
ssh root@10.0.20.10 'python3 /opt/registry/fix-broken-blobs.sh --dry-run 2>&1 | grep "ORPHAN INDEX"'
|
||||
```
|
||||
|
||||
Each hit is a separate image that will eventually fail to pull. Rebuild
|
||||
them in the same way (Option A preferred). If the list is long, open a
|
||||
beads task — do NOT batch-delete the indexes; that's a destructive
|
||||
registry operation outside this runbook's scope.
|
||||
|
||||
## Related
|
||||
|
||||
- `docs/post-mortems/2026-04-19-registry-orphan-index.md` — why this
|
||||
happens.
|
||||
- `docs/runbooks/registry-vm.md` — VM-level operations (DNS,
|
||||
`docker compose` restarts).
|
||||
- `modules/docker-registry/fix-broken-blobs.sh` — the scanner cron
|
||||
itself, runs nightly and after each GC.
|
||||
- `stacks/monitoring/modules/monitoring/main.tf` —
|
||||
`registry_integrity_probe` CronJob definition.
|
||||
|
|
@ -1,227 +0,0 @@
|
|||
# Runbook: Registry VM (docker-registry, 10.0.20.10)
|
||||
|
||||
Last updated: 2026-05-07
|
||||
|
||||
The registry VM is an Ubuntu 24.04 VM on the cluster LAN subnet
|
||||
`10.0.20.0/24`, with a static netplan config (no DHCP). Because it
|
||||
sits on a subnet that only has pfSense as its gateway, its DNS must
|
||||
be statically configured.
|
||||
|
||||
**As of Phase 4 of forgejo-registry-consolidation 2026-05-07** the VM
|
||||
no longer hosts the private R/W registry. It hosts pull-through
|
||||
caches only:
|
||||
|
||||
| Port | Upstream |
|
||||
|---|---|
|
||||
| 5000 | docker.io (Docker Hub) — auth via dockerhub_registry_password |
|
||||
| 5010 | ghcr.io |
|
||||
| 5020 | quay.io |
|
||||
| 5030 | registry.k8s.io |
|
||||
| 5040 | reg.kyverno.io |
|
||||
|
||||
The decommissioned private registry (port 5050) is now hosted on
|
||||
Forgejo at `forgejo.viktorbarzin.me/viktor/<image>`. See
|
||||
`docs/plans/2026-05-07-forgejo-registry-consolidation-plan.md` for the
|
||||
migration. Break-glass tarballs of `infra-ci` are still produced on
|
||||
each build to `/opt/registry/data/private/_breakglass/` — see
|
||||
`docs/runbooks/forgejo-registry-breakglass.md`.
|
||||
|
||||
## DNS configuration
|
||||
|
||||
Ubuntu ships `systemd-resolved` and uses netplan to declare per-link
|
||||
`nameservers`. Netplan writes systemd-networkd or NetworkManager
|
||||
configs that resolved reads at runtime. There is **no automatic
|
||||
merging** of netplan DNS with the `[Resolve]` section of
|
||||
`/etc/systemd/resolved.conf` — per-link settings override the global
|
||||
ones. So both layers must be in sync:
|
||||
|
||||
| Layer | File | Role |
|
||||
|---|---|---|
|
||||
| Netplan | `/etc/netplan/50-cloud-init.yaml` | Per-link DNS servers that resolved reports on `Link 2 (eth0)` |
|
||||
| Resolved global | `/etc/systemd/resolved.conf.d/10-internal-dns.conf` | `Global` scope `DNS=` / `FallbackDNS=` — also shown in `resolvectl status` |
|
||||
|
||||
### Current state
|
||||
|
||||
`/etc/systemd/resolved.conf.d/10-internal-dns.conf`:
|
||||
|
||||
```ini
|
||||
[Resolve]
|
||||
DNS=10.0.20.1
|
||||
FallbackDNS=94.140.14.14
|
||||
Domains=viktorbarzin.lan
|
||||
```
|
||||
|
||||
`/etc/netplan/50-cloud-init.yaml` (eth0 block, simplified):
|
||||
|
||||
```yaml
|
||||
nameservers:
|
||||
addresses:
|
||||
- 10.0.20.1
|
||||
- 94.140.14.14
|
||||
search:
|
||||
- viktorbarzin.lan
|
||||
```
|
||||
|
||||
`resolvectl status` output after the change:
|
||||
|
||||
```
|
||||
Global
|
||||
resolv.conf mode: stub
|
||||
Current DNS Server: 10.0.20.1
|
||||
DNS Servers: 10.0.20.1
|
||||
Fallback DNS Servers: 94.140.14.14
|
||||
DNS Domain: viktorbarzin.lan
|
||||
|
||||
Link 2 (eth0)
|
||||
Current Scopes: DNS
|
||||
Current DNS Server: 10.0.20.1
|
||||
DNS Servers: 10.0.20.1 94.140.14.14
|
||||
DNS Domain: viktorbarzin.lan
|
||||
```
|
||||
|
||||
| Field | Value | Purpose |
|
||||
|---|---|---|
|
||||
| Primary | `10.0.20.1` | pfSense OPT1 interface (dnsmasq forwarder → Technitium LB) — resolves `.viktorbarzin.lan` |
|
||||
| Fallback | `94.140.14.14` | AdGuard public DNS — used if pfSense unreachable (e.g., OPT1 flap) |
|
||||
| Search | `viktorbarzin.lan` | Unqualified names resolve against the internal zone |
|
||||
|
||||
### Why this matters for the registry
|
||||
|
||||
Container builds on this VM reference `.lan` hostnames (Technitium,
|
||||
NFS, etc.) and external hostnames (Docker Hub, GHCR). Before the
|
||||
hardening the netplan had `1.1.1.1` / `8.8.8.8` only, which meant:
|
||||
|
||||
1. Internal hostname lookups silently failed (slow timeout) — the
|
||||
VM could not resolve `idrac.viktorbarzin.lan` or any internal
|
||||
helper.
|
||||
2. If Cloudflare's 1.1.1.1 had an outage, the VM would lose DNS
|
||||
entirely.
|
||||
|
||||
With the new config the VM can resolve both zones and keeps working
|
||||
if the primary DNS server is unreachable.
|
||||
|
||||
## Apply / re-apply
|
||||
|
||||
```sh
|
||||
ssh root@10.0.20.10 '
|
||||
netplan generate
|
||||
netplan apply
|
||||
systemctl restart systemd-resolved
|
||||
resolvectl status | head -20
|
||||
'
|
||||
```
|
||||
|
||||
`netplan apply` is not disruptive when only `nameservers` change — it
|
||||
does not bounce the link.
|
||||
|
||||
## Verification
|
||||
|
||||
```sh
|
||||
ssh root@10.0.20.10 '
|
||||
dig +short idrac.viktorbarzin.lan # 192.168.1.4
|
||||
dig +short github.com # GitHub A record
|
||||
dig +short registry.viktorbarzin.me # 10.0.20.10 + external A
|
||||
'
|
||||
```
|
||||
|
||||
Fallback test — blackhole the primary and confirm external lookups
|
||||
still succeed through 94.140.14.14:
|
||||
|
||||
```sh
|
||||
ssh root@10.0.20.10 '
|
||||
ip route add blackhole 10.0.20.1
|
||||
dig +short +time=5 +tries=2 github.com # should still answer
|
||||
ip route del blackhole 10.0.20.1
|
||||
'
|
||||
```
|
||||
|
||||
Internal lookups do fail during the blackhole (the fallback is a
|
||||
public resolver and does not know about the internal zone), which is
|
||||
expected — the fallback buys availability for external pulls, not
|
||||
internal hostnames.
|
||||
|
||||
## Rollback
|
||||
|
||||
A pre-change backup of `/etc/resolv.conf`, `/etc/systemd/resolved.conf`,
|
||||
and `/etc/netplan/` lives at
|
||||
`/root/dns-backups/dns-config-backup-YYYYMMDD-HHMMSS.tar.gz` on the
|
||||
VM. To roll back:
|
||||
|
||||
```sh
|
||||
ssh root@10.0.20.10 '
|
||||
BACKUP=$(ls -t /root/dns-backups/dns-config-backup-*.tar.gz | head -1)
|
||||
tar -xzf "$BACKUP" -C /
|
||||
rm -f /etc/systemd/resolved.conf.d/10-internal-dns.conf
|
||||
netplan apply
|
||||
systemctl restart systemd-resolved
|
||||
resolvectl status | head -10
|
||||
'
|
||||
```
|
||||
|
||||
## Auto-sync pipeline
|
||||
|
||||
Changes to `modules/docker-registry/{docker-compose.yml, fix-broken-blobs.sh,
|
||||
cleanup-tags.sh, nginx_registry.conf, config-private.yml}` deploy
|
||||
automatically via `.woodpecker/registry-config-sync.yml`:
|
||||
|
||||
- Fires on `push` to master touching any of those paths, or via `manual`
|
||||
event (Woodpecker UI / API).
|
||||
- SCPs every managed file to `/opt/registry/` on `10.0.20.10`.
|
||||
- Bounces containers + nginx when a compose-visible file changed; leaves
|
||||
them alone when only scripts changed (cron picks up automatically).
|
||||
- Runs a dry-run `fix-broken-blobs.sh` at the end to verify the registry
|
||||
is still coherent.
|
||||
|
||||
SSH credentials: Woodpecker repo-secret `registry_ssh_key` (ed25519,
|
||||
provisioned 2026-04-19). Public key at `/root/.ssh/authorized_keys` on
|
||||
`10.0.20.10`. Private key mirrored at `secret/woodpecker/registry_ssh_key`
|
||||
in Vault (subkeys `private_key` / `public_key` / `known_hosts_entry`).
|
||||
|
||||
Manual override if you need to sync right now:
|
||||
|
||||
```sh
|
||||
curl -sf -X POST \
|
||||
-H "Authorization: Bearer $WOODPECKER_TOKEN" \
|
||||
"https://ci.viktorbarzin.me/api/repos/1/pipelines" \
|
||||
-d '{"branch":"master"}' | jq .number
|
||||
```
|
||||
|
||||
## Bouncing registry containers — the nginx DNS trap
|
||||
|
||||
`docker compose up -d` on `/opt/registry/docker-compose.yml` recreates
|
||||
`registry-*` containers when their image tag changes, which assigns them
|
||||
new IPs on the `registry` bridge network. **`registry-nginx` resolves its
|
||||
upstream DNS names (`registry-private`, `registry-dockerhub`, …) ONCE at
|
||||
startup and caches the results** — it does not re-resolve after a
|
||||
recreate.
|
||||
|
||||
Symptom if you forget: `/v2/_catalog` on `:5050` returns
|
||||
`{"repositories": []}`, `/v2/` returns 200 without auth, pulls return
|
||||
the wrong image. nginx is forwarding to a stale IP that now belongs to a
|
||||
different registry-* backend (commonly the pull-through ghcr or
|
||||
dockerhub cache, which have empty catalogs from the htpasswd-auth user's
|
||||
perspective).
|
||||
|
||||
**Always follow a registry-* bounce with `docker restart registry-nginx`.**
|
||||
Or prevent the problem by setting a `resolver` directive in
|
||||
`nginx_registry.conf` so upstream names are re-resolved per request.
|
||||
|
||||
```sh
|
||||
ssh root@10.0.20.10 '
|
||||
cd /opt/registry && docker compose up -d
|
||||
docker restart registry-nginx
|
||||
sleep 3
|
||||
docker ps --format "{{.Names}}\t{{.Image}}\t{{.Status}}" \
|
||||
| grep -E "registry-"
|
||||
'
|
||||
```
|
||||
|
||||
## Related docs
|
||||
|
||||
- `docs/architecture/dns.md` — resolver IP assignments per subnet.
|
||||
- `.claude/CLAUDE.md` (at repo root) — notes on the private registry
|
||||
and `containerd` `hosts.toml` redirects.
|
||||
- `docs/runbooks/registry-rebuild-image.md` — rebuild an image after an
|
||||
orphan OCI-index incident (different class of problem than DNS).
|
||||
- `docs/post-mortems/2026-04-19-registry-orphan-index.md` — root cause
|
||||
+ detection gaps behind the recurring missing-blob incidents.
|
||||
|
|
@ -7,7 +7,7 @@
|
|||
|
||||
## Backup Location
|
||||
- NFS: `/mnt/main/etcd-backup/etcd-snapshot-YYYYMMDD-HHMMSS.db`
|
||||
- Replicated to Synology NAS (192.168.1.13) via Proxmox host offsite-sync-backup (inotify-driven rsync)
|
||||
- Replicated to Synology NAS (192.168.1.13) via TrueNAS ZFS replication
|
||||
- Retention: 30 days
|
||||
- Schedule: Daily at 00:00
|
||||
|
||||
|
|
|
|||
|
|
@ -8,8 +8,8 @@ Last updated: 2026-04-06
|
|||
- Proxmox host failure requiring fresh VM provisioning
|
||||
|
||||
## Prerequisites
|
||||
- Proxmox host (192.168.1.127) accessible, with NFS exports on `/srv/nfs` and `/srv/nfs-ssd`
|
||||
- Synology NAS (192.168.1.13) accessible for offsite backup restore if the PVE host backup disk is also lost
|
||||
- Proxmox host (192.168.1.127) accessible
|
||||
- TrueNAS NFS server (10.0.10.15) accessible — or Synology NAS (192.168.1.13) for backups
|
||||
- sda backup disk mounted at `/mnt/backup` on PVE host (or restore from Synology first)
|
||||
- Git repo with infra code
|
||||
- SOPS age keys for state decryption (`~/.config/sops/age/keys.txt`)
|
||||
|
|
|
|||
|
|
@ -130,7 +130,7 @@ kubectl rollout restart deployment -n <namespace>
|
|||
|
||||
## Alternative: Restore from sda Backup
|
||||
|
||||
If the Proxmox host NFS mount is unavailable but the PVE host itself is accessible:
|
||||
If TrueNAS NFS is unavailable but the PVE host is accessible:
|
||||
|
||||
```bash
|
||||
# 1. SSH to PVE host
|
||||
|
|
@ -148,17 +148,17 @@ kubectl run mysql-restore --rm -it --image=mysql \
|
|||
|
||||
## Alternative: Restore from Synology (if PVE host is down)
|
||||
|
||||
If the PVE host itself is unavailable:
|
||||
If both TrueNAS and PVE host are unavailable:
|
||||
|
||||
```bash
|
||||
# 1. SSH to Synology NAS
|
||||
ssh Administrator@192.168.1.13
|
||||
|
||||
# 2. Navigate to backup directory
|
||||
cd /volume1/Backup/Viki/nfs/mysql-backup/
|
||||
cd /volume1/Backup/Viki/pve-backup/nfs-mirror/mysql-backup/
|
||||
|
||||
# 3. Copy dump to a temporary location accessible from cluster
|
||||
# (e.g., via rsync to a surviving node, or restore PVE host first)
|
||||
# (e.g., via rsync to a surviving node, or restore TrueNAS first)
|
||||
```
|
||||
|
||||
## Estimated Time
|
||||
|
|
|
|||
|
|
@ -123,7 +123,7 @@ kubectl rollout restart deployment -n <namespace>
|
|||
|
||||
## Alternative: Restore from sda Backup
|
||||
|
||||
If the Proxmox host NFS mount is unavailable but the PVE host itself is accessible:
|
||||
If TrueNAS NFS is unavailable but the PVE host is accessible:
|
||||
|
||||
```bash
|
||||
# 1. SSH to PVE host
|
||||
|
|
@ -142,17 +142,17 @@ kubectl run pg-restore --rm -it --image=postgres:16.4-bullseye \
|
|||
|
||||
## Alternative: Restore from Synology (if PVE host is down)
|
||||
|
||||
If the PVE host itself is unavailable:
|
||||
If both TrueNAS and PVE host are unavailable:
|
||||
|
||||
```bash
|
||||
# 1. SSH to Synology NAS
|
||||
ssh Administrator@192.168.1.13
|
||||
|
||||
# 2. Navigate to backup directory
|
||||
cd /volume1/Backup/Viki/nfs/postgresql-backup/
|
||||
cd /volume1/Backup/Viki/pve-backup/nfs-mirror/postgresql-backup/
|
||||
|
||||
# 3. Copy dump to a temporary location accessible from cluster
|
||||
# (e.g., via rsync to a surviving node, or restore PVE host first)
|
||||
# (e.g., via rsync to a surviving node, or restore TrueNAS first)
|
||||
```
|
||||
|
||||
## Estimated Time
|
||||
|
|
|
|||
|
|
@ -93,7 +93,7 @@ kubectl get externalsecrets -A | grep -v "SecretSynced"
|
|||
|
||||
## Alternative: Restore from sda Backup
|
||||
|
||||
If the Proxmox host NFS mount is unavailable but the PVE host itself is accessible:
|
||||
If TrueNAS NFS is unavailable but the PVE host is accessible:
|
||||
|
||||
```bash
|
||||
# 1. SSH to PVE host
|
||||
|
|
@ -115,17 +115,17 @@ vault operator raft snapshot restore -force ./vault-raft-YYYYMMDD-HHMMSS.db
|
|||
|
||||
## Alternative: Restore from Synology (if PVE host is down)
|
||||
|
||||
If the PVE host itself is unavailable:
|
||||
If both TrueNAS and PVE host are unavailable:
|
||||
|
||||
```bash
|
||||
# 1. SSH to Synology NAS
|
||||
ssh Administrator@192.168.1.13
|
||||
|
||||
# 2. Navigate to backup directory
|
||||
cd /volume1/Backup/Viki/nfs/vault-backup/
|
||||
cd /volume1/Backup/Viki/pve-backup/nfs-mirror/vault-backup/
|
||||
|
||||
# 3. Copy snapshot to local workstation
|
||||
scp Administrator@192.168.1.13:/volume1/Backup/Viki/nfs/vault-backup/vault-raft-YYYYMMDD-HHMMSS.db ./
|
||||
scp Administrator@192.168.1.13:/volume1/Backup/Viki/pve-backup/nfs-mirror/vault-backup/vault-raft-YYYYMMDD-HHMMSS.db ./
|
||||
|
||||
# 4. Restore via port-forward (same as above)
|
||||
```
|
||||
|
|
|
|||
|
|
@ -104,9 +104,9 @@ lvchange -an pve/$LV_NAME
|
|||
kubectl scale deployment vaultwarden -n vaultwarden --replicas=1
|
||||
```
|
||||
|
||||
## Alternative: Restore from sda Backup Mirror
|
||||
## Alternative: Restore from sda NFS Mirror
|
||||
|
||||
If the Proxmox host NFS mount is unavailable but the PVE host itself is accessible:
|
||||
If TrueNAS NFS is unavailable but PVE host is accessible:
|
||||
|
||||
```bash
|
||||
# 1. SSH to PVE host
|
||||
|
|
|
|||
|
|
@ -1,51 +0,0 @@
|
|||
# Runbook: Applying the Technitium Terraform stack
|
||||
|
||||
Last updated: 2026-04-19
|
||||
|
||||
The `stacks/technitium/` apply has a **post-apply readiness gate** that asserts all three DNS instances are healthy before the apply is allowed to finish. This runbook explains what it checks, how to interpret failures, and how to override it for emergency maintenance.
|
||||
|
||||
## What the gate checks
|
||||
|
||||
`stacks/technitium/modules/technitium/readiness.tf` defines `null_resource.technitium_readiness_gate`. It runs after the three Technitium deployments, the DNS LoadBalancer service, and the PDB are applied, and performs:
|
||||
|
||||
1. **Rollout status** — `kubectl rollout status deploy/<name> --timeout=180s` for `technitium`, `technitium-secondary`, `technitium-tertiary`. Fails if any deployment has not reached its desired pod count within 180s.
|
||||
2. **Per-pod API health** — for every pod with label `dns-server=true`, executes `wget http://127.0.0.1:5380/api/stats/get` inside the pod and asserts the response contains `"status":"ok"`. Catches Technitium process hangs that TCP probes miss.
|
||||
3. **Zone-count parity** — queries `technitium-web`, `technitium-secondary-web`, `technitium-tertiary-web` and counts the zones returned. Fails if the three counts differ, which would mean `technitium-zone-sync` has drifted or a replica has lost state.
|
||||
|
||||
The gate is re-run whenever any of the deployment container spec, the CoreDNS Corefile, or the apply timestamp changes (see `triggers` in `readiness.tf`).
|
||||
|
||||
## Emergency override
|
||||
|
||||
Set `skip_readiness=true` via terragrunt inputs or pass it directly to the Terraform apply:
|
||||
|
||||
```bash
|
||||
cd infra/stacks/technitium
|
||||
scripts/tg apply -var skip_readiness=true
|
||||
```
|
||||
|
||||
Only use this when you need to land a Terraform change while one Technitium instance is intentionally offline (e.g., you are replacing its PVC, migrating storage, or recovering a corrupted config DB). Re-apply without the flag once the instance is back.
|
||||
|
||||
You can also target around the gate during emergency work:
|
||||
|
||||
```bash
|
||||
scripts/tg apply -target=kubernetes_config_map.coredns
|
||||
```
|
||||
|
||||
`-target` bypasses the `depends_on` chain feeding the gate, so a single-resource push does not need the gate to pass.
|
||||
|
||||
## Failure modes and responses
|
||||
|
||||
| Symptom | Likely cause | Fix |
|
||||
|---------|--------------|-----|
|
||||
| `rollout status` times out on one deployment | Pod stuck `Pending` (node pressure / anti-affinity with other dns-server pods) or `ImagePullBackOff` | `kubectl describe pod` for events. If anti-affinity is blocking, confirm 3 nodes are Ready. |
|
||||
| API check fails on a pod but readiness probe passes | Technitium process hung but port 53 still accepting TCP (liveness probe is `tcp_socket` on :53) | `kubectl delete pod <name>` — deployment will recreate it. |
|
||||
| Zone count differs between instances | `technitium-zone-sync` CronJob is failing or AXFR is blocked | `kubectl logs -n technitium -l job-name=<latest-zone-sync-job>`. Check `TechnitiumZoneSyncFailed` alert. |
|
||||
| Gate passes but external clients still cannot resolve | Gate only checks in-pod API and intra-cluster zone parity — external path (LoadBalancer → Technitium pod) is not tested | Run the LAN-client drill in `docs/architecture/dns.md` troubleshooting section. |
|
||||
|
||||
## What the gate does NOT check
|
||||
|
||||
- External reachability through the LoadBalancer IP `10.0.20.201` (that would require a LAN-side probe).
|
||||
- CoreDNS health (CoreDNS is patched by `coredns.tf`, not this module's deployments — alerts `CoreDNSErrors` / `CoreDNSForwardFailureRate` catch regressions post-apply).
|
||||
- Upstream resolver health (covered by `CoreDNSForwardFailureRate`).
|
||||
|
||||
For broader end-to-end verification, see `docs/architecture/dns.md` → "Verification" section, or run the Uptime Kuma external DNS probe.
|
||||
|
|
@ -1,217 +0,0 @@
|
|||
# Runbook: Vault Raft Leader Deadlock + Safe Pod Restart
|
||||
|
||||
Captures the 2026-04-22 incident pattern. When a Vault raft leader enters a
|
||||
stuck goroutine state (port 8201 accepts TCP but RPCs never return), the
|
||||
recovery is *not* `kubectl delete --force`. Force-deleting a Vault pod that
|
||||
holds a stuck NFS mount leaves kernel NFS client state corrupted, which
|
||||
blocks all subsequent NFS mounts from the node and usually requires a VM
|
||||
hard-reset to clear.
|
||||
|
||||
**Related**: [post-mortems/2026-04-22-vault-raft-leader-deadlock.md](../post-mortems/2026-04-22-vault-raft-leader-deadlock.md).
|
||||
|
||||
## Symptoms
|
||||
|
||||
- `https://vault.viktorbarzin.me/v1/sys/health` returns HTTP 503.
|
||||
- Standbys log `msgpack decode error [pos 0]: i/o timeout` every 2s.
|
||||
- `kubectl exec` into a standby shows raft thinks the leader is alive
|
||||
(peers list all `Voter`, leader address populated) but `vault operator
|
||||
raft autopilot state` stalls or errors.
|
||||
- The "leader" pod's logs go silent — no heartbeats, no audit writes,
|
||||
nothing. TCP on 8201 still accepts connections.
|
||||
- ESO-backed secrets stop refreshing (ExternalSecret `SecretSyncedError`).
|
||||
- Woodpecker CI pipelines that read from Vault at plan time hang.
|
||||
|
||||
## 0. Confirm the diagnosis (before touching anything)
|
||||
|
||||
Don't jump to force-delete. Verify the leader is actually stuck, not just
|
||||
slow:
|
||||
|
||||
```sh
|
||||
# 1. Who does raft think the leader is?
|
||||
kubectl exec -n vault vault-0 -c vault -- vault status 2>&1 | \
|
||||
grep -E 'HA Mode|Active Node|Leader|Raft'
|
||||
|
||||
# 2. Is the leader's port open but unresponsive?
|
||||
LEADER_POD=vault-2 # or whichever vault status reports
|
||||
kubectl exec -n vault $LEADER_POD -c vault -- sh -c \
|
||||
'timeout 3 nc -zv 127.0.0.1 8200 2>&1; echo; timeout 3 vault status'
|
||||
|
||||
# 3. Is the active vault service pointing at a real pod?
|
||||
kubectl get endpoints -n vault vault-active -o yaml | \
|
||||
grep -E 'addresses|notReadyAddresses' -A2
|
||||
|
||||
# 4. What do standby logs say?
|
||||
kubectl logs -n vault vault-0 -c vault --tail=40 | grep -iE 'msgpack|decode|rpc'
|
||||
```
|
||||
|
||||
If (2) hangs and (4) shows repeated msgpack errors → stuck leader.
|
||||
|
||||
## 1. Identify the stuck pod precisely
|
||||
|
||||
```sh
|
||||
# Find the pod whose vault_core_active would be 1 if it were scraping
|
||||
# (currently no telemetry — use logs as proxy until telemetry is enabled).
|
||||
for p in vault-0 vault-1 vault-2; do
|
||||
echo "=== $p ==="
|
||||
kubectl logs -n vault $p -c vault --tail=5 2>&1 | head -5
|
||||
done | grep -B1 'no recent output'
|
||||
```
|
||||
|
||||
The pod whose logs have been silent for minutes while the others are
|
||||
actively erroring is the stuck leader.
|
||||
|
||||
## 2. The safe restart sequence (avoids zombie containers)
|
||||
|
||||
**DO NOT** `kubectl delete pod --force --grace-period=0` as the first
|
||||
step. On NFS-backed Vault that's the exact move that leaves the kernel
|
||||
NFS client corrupted on the node where the stuck pod ran.
|
||||
|
||||
Instead:
|
||||
|
||||
### 2a. Graceful delete first (30s grace)
|
||||
|
||||
```sh
|
||||
kubectl delete pod -n vault vault-2
|
||||
```
|
||||
|
||||
Wait 30 seconds. Most of the time the TERM → SIGKILL path works and the
|
||||
new pod schedules cleanly. The remaining leaders re-elect and the external
|
||||
endpoint recovers.
|
||||
|
||||
### 2b. If the pod is Terminating after 60s, find the stuck process
|
||||
|
||||
```sh
|
||||
NODE=$(kubectl get pod -n vault vault-2-<suffix> -o jsonpath='{.spec.nodeName}')
|
||||
POD_UID=$(kubectl get pod -n vault vault-2-<suffix> -o jsonpath='{.metadata.uid}')
|
||||
|
||||
ssh $NODE "sudo ps auxf | grep -A2 $POD_UID | head -20"
|
||||
# Look for: mount.nfs (D-state), vault (Z-state), or the sh wrapper in do_wait
|
||||
```
|
||||
|
||||
### 2c. Unmount stale NFS before force-deleting
|
||||
|
||||
If the old pod's NFS mount is still present, lazy-unmount it FIRST so
|
||||
the kernel can release NFS session state cleanly:
|
||||
|
||||
```sh
|
||||
ssh $NODE "sudo mount | grep $POD_UID | awk '{print \$3}' | xargs -I{} sudo umount -l {}"
|
||||
```
|
||||
|
||||
Verify no mount.nfs processes are in D-state on the node:
|
||||
|
||||
```sh
|
||||
ssh $NODE "ps -eo state,pid,comm | grep '^D' | head -5"
|
||||
```
|
||||
|
||||
### 2d. Only NOW force-delete if needed
|
||||
|
||||
```sh
|
||||
kubectl delete pod -n vault vault-2-<suffix> --force --grace-period=0
|
||||
```
|
||||
|
||||
## 3. Recovery when the node is already stuck
|
||||
|
||||
If you force-deleted before reading this runbook and NFS is now broken
|
||||
on the node:
|
||||
|
||||
**Diagnostic — confirm NFS client state is corrupted:**
|
||||
|
||||
```sh
|
||||
NODE=k8s-node2 # node where the force-delete happened
|
||||
ssh $NODE "sudo mkdir -p /tmp/nfstest && sudo timeout 30 \
|
||||
mount -t nfs 192.168.1.127:/srv/nfs /tmp/nfstest && echo MOUNT_OK"
|
||||
```
|
||||
|
||||
If the mount times out at 30-110s, kernel NFS client state is stuck.
|
||||
No userspace recovery exists — only a VM reboot clears it.
|
||||
|
||||
**Workaround before rebooting**: mounting with `nfsvers=4.1` succeeds
|
||||
on broken nodes (the corruption is NFSv4.2 session-state specific).
|
||||
This is useful for diagnostic mounts, but does NOT fix CSI pods —
|
||||
their mount options come from the `nfs-proxmox` StorageClass and can't
|
||||
be overridden per-pod.
|
||||
|
||||
**Reboot the affected node VM:**
|
||||
|
||||
```sh
|
||||
# Find PVE VM ID — nodes numbered 201-204 for k8s-node1..4
|
||||
ssh root@192.168.1.127 "qm reset 20<N>"
|
||||
|
||||
# If qm reset leaves the VM PID unchanged (it didn't actually reboot),
|
||||
# use qm stop/start:
|
||||
ssh root@192.168.1.127 "qm stop 20<N> && qm start 20<N>"
|
||||
```
|
||||
|
||||
Wait for the node to become Ready (`kubectl get node k8s-node<N> -w`)
|
||||
and CSI driver to register (`kubectl get pods -n nfs-csi -o wide`).
|
||||
|
||||
**Gotcha — `qm reset` can be a no-op.** On the 2026-04-22 incident,
|
||||
`qm reset 201` returned exit 0 but did NOT restart the VM (same QEMU PID
|
||||
before and after). `qm status` reported "running" throughout. Always
|
||||
verify by checking the QEMU PID or VM uptime post-reset. If uptime is
|
||||
unchanged, escalate to `qm stop && qm start`.
|
||||
|
||||
**Gotcha — check boot order before stop/start.** Long-running VMs
|
||||
(630+ day uptime) may have stale `bootdisk:` config that's been hidden
|
||||
by never rebooting. On 2026-04-22, k8s-node1's config had `bootdisk:
|
||||
scsi0` but the actual OS disk was on `scsi1`, so the first boot after
|
||||
stop attempted iPXE and failed. Before stopping, verify:
|
||||
|
||||
```sh
|
||||
ssh root@192.168.1.127 "grep -E 'boot|scsi[0-9]+:' /etc/pve/qemu-server/20<N>.conf"
|
||||
```
|
||||
|
||||
If `bootdisk` references a disk ID that doesn't exist, fix it first
|
||||
with `qm set 20<N> --boot "order=scsi<ID>"` (use the ID of the main
|
||||
OS disk).
|
||||
|
||||
## 4. Prevent re-infection — the chown loop
|
||||
|
||||
After the node comes back, the vault pod's PV chown walk can still
|
||||
peg kubelet. The durable fix is in `stacks/vault/main.tf`:
|
||||
|
||||
```hcl
|
||||
statefulSet = {
|
||||
securityContext = {
|
||||
pod = {
|
||||
fsGroupChangePolicy = "OnRootMismatch"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
This was applied in commit `2f1f9107` (2026-04-22). If you find
|
||||
yourself editing this in a kubectl patch for live recovery, follow
|
||||
up with a Terraform apply the same session — leaving the cluster
|
||||
ahead of Terraform state is technical debt that re-triggers on the
|
||||
next apply.
|
||||
|
||||
## 5. Verify end-to-end
|
||||
|
||||
```sh
|
||||
# External endpoint — the user-facing health check
|
||||
curl -sk -o /dev/null -w "%{http_code}\n" https://vault.viktorbarzin.me/v1/sys/health
|
||||
# expect: 200
|
||||
|
||||
# Raft peers (needs VAULT_TOKEN with operator capability)
|
||||
kubectl exec -n vault vault-0 -c vault -- vault operator raft list-peers
|
||||
|
||||
# All pods 2/2
|
||||
kubectl get pods -n vault -l app.kubernetes.io/name=vault -o wide
|
||||
|
||||
# No alerts fired (once VaultRaftLeaderStuck + VaultHAStatusUnavailable are live)
|
||||
curl -s https://alertmanager.viktorbarzin.me/api/v2/alerts | \
|
||||
jq '.[] | select(.labels.alertname | test("Vault"))'
|
||||
```
|
||||
|
||||
## Known limitations
|
||||
|
||||
- **No alert for stuck leaders yet.** `VaultRaftLeaderStuck` and
|
||||
`VaultHAStatusUnavailable` require Vault telemetry enabled
|
||||
(`telemetry { unauthenticated_metrics_access = true }`) and a
|
||||
scrape job. Alerts are defined in `prometheus_chart_values.tpl`
|
||||
but stay silent until telemetry lands — tracked as a beads task.
|
||||
- **Vault on NFS violates the documented rule.** `infra/.claude/CLAUDE.md`
|
||||
says critical services must use `proxmox-lvm-encrypted`. The
|
||||
`dataStorage`/`auditStorage` still use `nfs-proxmox`. Migration
|
||||
tracked as an epic-level beads task.
|
||||
|
|
@ -1,73 +0,0 @@
|
|||
# Runbook: Onboarding a new Forgejo repo to Woodpecker
|
||||
|
||||
Last updated: 2026-05-07
|
||||
|
||||
When you create a new repo on `forgejo.viktorbarzin.me`, Woodpecker
|
||||
does NOT auto-discover it via the cluster's existing OAuth session.
|
||||
The `forgejo` user inside Woodpecker (Forgejo-OAuth'd) needs to:
|
||||
|
||||
1. Open `https://ci.viktorbarzin.me/` in a browser.
|
||||
2. Log in via Forgejo OAuth (the "Sign in with Forgejo" button).
|
||||
3. Click "Add Repository" — your new repo should appear.
|
||||
4. Click the toggle to activate it. Woodpecker will:
|
||||
- Add a webhook on the Forgejo repo (push, PR, release events).
|
||||
- Register the repo's `forge_remote_id` in its DB so subsequent
|
||||
hooks deserialize correctly.
|
||||
5. Push a commit (or hit "Run pipeline" in Woodpecker UI) — first
|
||||
build fires.
|
||||
|
||||
## Why API-only doesn't work
|
||||
|
||||
The webhook URL contains a JWT signed with a per-server key that's
|
||||
stored in the DB and only accessible at OAuth-flow time. POST'ing
|
||||
`/api/repos` as the admin (`ViktorBarzin` GitHub user) returns 500
|
||||
because the lookup queries forge-side OAuth state for THAT user,
|
||||
which doesn't exist for the Forgejo `viktor` user. We confirmed:
|
||||
|
||||
- Direct `POST /api/repos?forge_remote_id=N` → HTTP 500 server-side.
|
||||
- Generating a JWT with the agent secret → "token is unverifiable"
|
||||
on hook delivery (the signing key is repo-specific, not the
|
||||
global agent secret).
|
||||
|
||||
There's no admin endpoint that side-steps the OAuth flow.
|
||||
|
||||
## Bootstrap when UI access isn't available
|
||||
|
||||
If you absolutely need to bootstrap a new image without UI access
|
||||
(e.g., during an outage), the workaround is:
|
||||
|
||||
1. Build locally:
|
||||
```bash
|
||||
docker build -t forgejo.viktorbarzin.me/viktor/<name>:<tag> /path/to/source
|
||||
docker push forgejo.viktorbarzin.me/viktor/<name>:<tag>
|
||||
```
|
||||
2. Or pull from another already-built source and retag:
|
||||
```bash
|
||||
docker pull viktorbarzin/<name>:<tag> # DockerHub
|
||||
docker tag viktorbarzin/<name>:<tag> forgejo.viktorbarzin.me/viktor/<name>:<tag>
|
||||
docker push forgejo.viktorbarzin.me/viktor/<name>:<tag>
|
||||
```
|
||||
3. Flip the cluster `image=` reference and restart deployments.
|
||||
|
||||
Document the bootstrap in the relevant stack so future maintainers
|
||||
know the image was put there by hand. After Woodpecker UI onboarding,
|
||||
the next pipeline run replaces the bootstrap image with a CI-built one.
|
||||
|
||||
## Repos onboarded in flight 2026-05-07
|
||||
|
||||
These were created during the forgejo-registry-consolidation but the
|
||||
UI step above hasn't been done yet — their `.woodpecker.yml` /
|
||||
`.woodpecker/build.yml` exists on Forgejo but no pipeline fires:
|
||||
|
||||
- `viktor/broker-sync` — image bootstrapped via DockerHub (see
|
||||
`infra/stacks/wealthfolio/main.tf` comment).
|
||||
- `viktor/fire-planner` — image bootstrapped via local docker build.
|
||||
- `viktor/hmrc-sync`
|
||||
- `viktor/freedify`
|
||||
- `viktor/claude-agent-service`
|
||||
- `viktor/beadboard` — image bootstrapped via local docker build.
|
||||
- `viktor/claude-memory-mcp`
|
||||
|
||||
Walk through each in the Woodpecker UI to enable. Pipelines for
|
||||
already-onboarded repos (payslip-ingest, job-hunter, infra) fired
|
||||
correctly after the v3.13 → v3.14 upgrade.
|
||||
|
|
@ -3,12 +3,8 @@ networks:
|
|||
driver: bridge
|
||||
|
||||
services:
|
||||
# registry:2 is pinned after the 2026-04-13 + 2026-04-19 orphan-index incidents.
|
||||
# Floating tags were swapping to regressed versions between GC runs. Upgrade
|
||||
# path: bump all six registry-* services in lockstep and bounce via
|
||||
# `systemctl restart docker-compose-registry.service`.
|
||||
registry-dockerhub:
|
||||
image: registry:2.8.3
|
||||
image: registry:2
|
||||
container_name: registry-dockerhub
|
||||
restart: always
|
||||
volumes:
|
||||
|
|
@ -26,7 +22,7 @@ services:
|
|||
start_period: 10s
|
||||
|
||||
registry-ghcr:
|
||||
image: registry:2.8.3
|
||||
image: registry:2
|
||||
container_name: registry-ghcr
|
||||
restart: always
|
||||
volumes:
|
||||
|
|
@ -42,7 +38,7 @@ services:
|
|||
start_period: 10s
|
||||
|
||||
registry-quay:
|
||||
image: registry:2.8.3
|
||||
image: registry:2
|
||||
container_name: registry-quay
|
||||
restart: always
|
||||
volumes:
|
||||
|
|
@ -58,7 +54,7 @@ services:
|
|||
start_period: 10s
|
||||
|
||||
registry-k8s:
|
||||
image: registry:2.8.3
|
||||
image: registry:2
|
||||
container_name: registry-k8s
|
||||
restart: always
|
||||
volumes:
|
||||
|
|
@ -74,7 +70,7 @@ services:
|
|||
start_period: 10s
|
||||
|
||||
registry-kyverno:
|
||||
image: registry:2.8.3
|
||||
image: registry:2
|
||||
container_name: registry-kyverno
|
||||
restart: always
|
||||
volumes:
|
||||
|
|
@ -89,26 +85,35 @@ services:
|
|||
retries: 3
|
||||
start_period: 10s
|
||||
|
||||
# registry-private decommissioned in Phase 4 of
|
||||
# forgejo-registry-consolidation 2026-05-07 — image migration completed,
|
||||
# cluster flipped to forgejo.viktorbarzin.me/viktor/<image>. The remaining
|
||||
# five services on this VM are pull-through caches for upstream registries.
|
||||
# After 1 week of no incidents, `rm -rf /opt/registry/data/private/` on the
|
||||
# VM frees ~2.6 GB. The tarball break-glass under
|
||||
# /opt/registry/data/private/_breakglass/ stays — it's how we recover
|
||||
# infra-ci if Forgejo ever goes fully down.
|
||||
registry-private:
|
||||
image: registry:2
|
||||
container_name: registry-private
|
||||
restart: always
|
||||
volumes:
|
||||
- /opt/registry/data/private:/var/lib/registry
|
||||
- /opt/registry/config-private.yml:/etc/docker/registry/config.yml:ro
|
||||
- /opt/registry/htpasswd:/auth/htpasswd:ro
|
||||
networks:
|
||||
- registry
|
||||
healthcheck:
|
||||
# 401 is expected (auth required) — any HTTP response means the registry is healthy
|
||||
test: ["CMD", "sh", "-c", "wget -qS -O /dev/null http://127.0.0.1:5000/v2/ 2>&1 | grep -q 'HTTP/'"]
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
start_period: 10s
|
||||
|
||||
nginx:
|
||||
image: nginx:alpine
|
||||
container_name: registry-nginx
|
||||
restart: always
|
||||
# 5050 dropped Phase 4 of forgejo-registry-consolidation 2026-05-07.
|
||||
ports:
|
||||
- "5000:5000"
|
||||
- "5010:5010"
|
||||
- "5020:5020"
|
||||
- "5030:5030"
|
||||
- "5040:5040"
|
||||
- "5050:5050"
|
||||
volumes:
|
||||
- /opt/registry/nginx.conf:/etc/nginx/nginx.conf:ro
|
||||
- /opt/registry/tls:/etc/nginx/tls:ro
|
||||
|
|
@ -126,6 +131,8 @@ services:
|
|||
condition: service_healthy
|
||||
registry-kyverno:
|
||||
condition: service_healthy
|
||||
registry-private:
|
||||
condition: service_healthy
|
||||
healthcheck:
|
||||
test: ["CMD", "sh", "-c", "wget -qO- http://127.0.0.1:5000/v2/ >/dev/null 2>&1"]
|
||||
interval: 30s
|
||||
|
|
|
|||
|
|
@ -1,33 +1,25 @@
|
|||
#!/usr/bin/env python3
|
||||
"""Registry integrity scanner — two classes of brokenness.
|
||||
"""Finds and removes layer links that point to non-existent blobs.
|
||||
|
||||
1. Orphaned layer links: the cleanup-tags.sh + garbage-collect cycle can delete
|
||||
blob data while leaving _layers/ link files intact. The registry then returns
|
||||
HTTP 200 with 0 bytes for those layers (it finds the link, trusts the blob
|
||||
exists, but the data is gone). Containerd sees "unexpected EOF".
|
||||
Action: delete the orphan link so the next pull re-fetches cleanly.
|
||||
When the cleanup-tags.sh + garbage-collect cycle runs, it can delete blob data
|
||||
while leaving _layers/ link files intact. The registry then returns HTTP 200
|
||||
with 0 bytes for those layers (it finds the link, trusts the blob exists, but
|
||||
the data is gone). This causes containerd to fail with "unexpected EOF".
|
||||
|
||||
2. Orphaned OCI-index children: an image index (multi-platform manifest list)
|
||||
references child manifests by digest. If a child's blob has been deleted —
|
||||
by a cleanup-tags.sh tag rmtree followed by garbage-collect walking the
|
||||
children wrong (distribution/distribution#3324 class), or by an incomplete
|
||||
`buildx --push` whose partial blob was later purged by `uploadpurging` —
|
||||
the index survives but pulls fail with `manifest unknown`.
|
||||
Action: log loudly. Deleting an index is a conscious decision (the image
|
||||
was published; removing it breaks downstream consumers), so we surface
|
||||
the problem and leave repair to a human or to the rebuild runbook.
|
||||
This script walks all repositories, checks each layer link against the actual
|
||||
blobs directory, and removes any orphaned links. On next pull, the registry
|
||||
will re-fetch the missing blobs from the upstream registry.
|
||||
|
||||
Run after garbage-collect (Sunday 03:30) and daily (Mon-Sat 02:30).
|
||||
Run after garbage-collect (e.g., 3:15 AM Sunday) or daily.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
import sys
|
||||
|
||||
sys.stdout.reconfigure(line_buffering=True)
|
||||
|
||||
parser = argparse.ArgumentParser(description="Scan registry for orphaned blobs and indexes")
|
||||
parser = argparse.ArgumentParser(description="Remove orphaned registry layer links")
|
||||
parser.add_argument("base", nargs="?", default="/opt/registry/data", help="Registry data directory")
|
||||
parser.add_argument("--dry-run", action="store_true", help="Report but don't delete")
|
||||
args = parser.parse_args()
|
||||
|
|
@ -35,124 +27,39 @@ args = parser.parse_args()
|
|||
BASE = args.base
|
||||
DRY_RUN = args.dry_run
|
||||
|
||||
INDEX_MEDIA_TYPES = (
|
||||
"application/vnd.oci.image.index.v1+json",
|
||||
"application/vnd.docker.distribution.manifest.list.v2+json",
|
||||
)
|
||||
|
||||
# Only the private R/W registry is authoritative for every child of every
|
||||
# index it stores — we pushed those indexes ourselves, so a missing child is
|
||||
# always a bug (the 2026-04-13 + 2026-04-19 failure mode).
|
||||
#
|
||||
# Pull-through caches (dockerhub, ghcr, quay, k8s, kyverno) are ALLOWED to
|
||||
# have missing children: they only fetch what someone actually pulls.
|
||||
# Uncached arm64 / arm / attestation variants of a multi-platform index are
|
||||
# normal partial state, not orphans. Scanning them generates hundreds of
|
||||
# false-positive warnings — noise that would mask the real signal from the
|
||||
# private registry. Scan 2 is therefore private-only.
|
||||
INDEX_SCAN_REGISTRIES = ("private",)
|
||||
|
||||
total_layer_removed = 0
|
||||
total_layer_checked = 0
|
||||
total_index_scanned = 0
|
||||
total_index_orphans = 0
|
||||
|
||||
|
||||
def load_manifest_blob(blobs_root, digest_hex):
|
||||
blob_path = os.path.join(blobs_root, digest_hex[:2], digest_hex, "data")
|
||||
if not os.path.isfile(blob_path):
|
||||
return None
|
||||
try:
|
||||
with open(blob_path, "rb") as f:
|
||||
raw = f.read(1024 * 1024)
|
||||
except OSError:
|
||||
return None
|
||||
try:
|
||||
return json.loads(raw)
|
||||
except (json.JSONDecodeError, UnicodeDecodeError):
|
||||
return None
|
||||
|
||||
total_removed = 0
|
||||
total_checked = 0
|
||||
|
||||
for registry_name in sorted(os.listdir(BASE)):
|
||||
repos_dir = os.path.join(BASE, registry_name, "docker/registry/v2/repositories")
|
||||
blobs_root = os.path.join(BASE, registry_name, "docker/registry/v2/blobs/sha256")
|
||||
blobs_dir = os.path.join(BASE, registry_name, "docker/registry/v2/blobs")
|
||||
|
||||
if not os.path.isdir(repos_dir):
|
||||
continue
|
||||
|
||||
for root, _, _ in os.walk(repos_dir):
|
||||
# --- Scan 1: orphan layer links ----------------------------------------
|
||||
if root.endswith("/_layers/sha256"):
|
||||
repo = root.replace(repos_dir + "/", "").replace("/_layers/sha256", "")
|
||||
for root, dirs, files in os.walk(repos_dir):
|
||||
if not root.endswith("/_layers/sha256"):
|
||||
continue
|
||||
|
||||
for digest_dir in os.listdir(root):
|
||||
link_file = os.path.join(root, digest_dir, "link")
|
||||
if not os.path.isfile(link_file):
|
||||
continue
|
||||
repo = root.replace(repos_dir + "/", "").replace("/_layers/sha256", "")
|
||||
|
||||
total_layer_checked += 1
|
||||
blob_data = os.path.join(blobs_root, digest_dir[:2], digest_dir, "data")
|
||||
if os.path.isfile(blob_data):
|
||||
continue
|
||||
for digest_dir in os.listdir(root):
|
||||
link_file = os.path.join(root, digest_dir, "link")
|
||||
if not os.path.isfile(link_file):
|
||||
continue
|
||||
|
||||
total_checked += 1
|
||||
|
||||
# Check if the actual blob data exists
|
||||
blob_data = os.path.join(blobs_dir, "sha256", digest_dir[:2], digest_dir, "data")
|
||||
if not os.path.isfile(blob_data):
|
||||
prefix = "[DRY RUN] " if DRY_RUN else ""
|
||||
print(f"{prefix}[{registry_name}/{repo}] removing orphaned layer link: {digest_dir[:12]}...")
|
||||
if not DRY_RUN:
|
||||
# Remove the entire digest directory (contains the link file)
|
||||
import shutil
|
||||
shutil.rmtree(os.path.join(root, digest_dir))
|
||||
total_layer_removed += 1
|
||||
|
||||
# --- Scan 2: orphan OCI-index children (private registry only) --------
|
||||
elif root.endswith("/_manifests/revisions/sha256") and registry_name in INDEX_SCAN_REGISTRIES:
|
||||
repo = root.replace(repos_dir + "/", "").replace("/_manifests/revisions/sha256", "")
|
||||
|
||||
for digest_dir in os.listdir(root):
|
||||
# Manifest revision entry. Load the blob it points to.
|
||||
manifest = load_manifest_blob(blobs_root, digest_dir)
|
||||
if manifest is None:
|
||||
continue
|
||||
|
||||
media_type = manifest.get("mediaType", "")
|
||||
if media_type not in INDEX_MEDIA_TYPES:
|
||||
continue
|
||||
|
||||
total_index_scanned += 1
|
||||
|
||||
# Per-repo revision links — serving a child manifest via the API
|
||||
# requires <repo>/_manifests/revisions/sha256/<child-digest>/link
|
||||
# to exist. The blob data alone is not enough: cleanup-tags.sh
|
||||
# rmtrees tag dirs (which on 2.8.x also orphans the per-repo
|
||||
# revision links for index children), while the upstream blob
|
||||
# data survives in /blobs/. That's exactly the 2026-04-19
|
||||
# failure mode — the probe sees 404 even though the blob file
|
||||
# is still on disk.
|
||||
revisions_root = os.path.dirname(root) # …/_manifests/revisions
|
||||
for child in manifest.get("manifests", []):
|
||||
child_digest = child.get("digest", "")
|
||||
if not child_digest.startswith("sha256:"):
|
||||
continue
|
||||
child_hex = child_digest[len("sha256:"):]
|
||||
child_link = os.path.join(revisions_root, "sha256", child_hex, "link")
|
||||
if os.path.isfile(child_link):
|
||||
continue
|
||||
|
||||
platform = child.get("platform", {})
|
||||
arch = platform.get("architecture", "?")
|
||||
os_ = platform.get("os", "?")
|
||||
child_blob = os.path.join(blobs_root, child_hex[:2], child_hex, "data")
|
||||
blob_state = "blob-data-present" if os.path.isfile(child_blob) else "blob-data-gone"
|
||||
print(
|
||||
f"WARNING [{registry_name}/{repo}] ORPHAN INDEX: "
|
||||
f"{digest_dir[:12]} references missing child {child_hex[:12]} "
|
||||
f"({arch}/{os_}, {blob_state}) — registry returns 404, rebuild required"
|
||||
)
|
||||
total_index_orphans += 1
|
||||
|
||||
total_removed += 1
|
||||
|
||||
mode = "DRY RUN — " if DRY_RUN else ""
|
||||
print(f"\n{mode}Layer scan: checked {total_layer_checked} links, removed {total_layer_removed} orphaned.")
|
||||
print(f"{mode}Index scan: inspected {total_index_scanned} image indexes, found {total_index_orphans} orphaned children.")
|
||||
if total_index_orphans > 0:
|
||||
print(f"\nACTION REQUIRED: {total_index_orphans} orphan index child(ren) detected. "
|
||||
"See docs/runbooks/registry-rebuild-image.md — the affected image must be rebuilt "
|
||||
"(a registry DELETE on an index is a conscious decision, not an automated repair).")
|
||||
print(f"\n{mode}Checked {total_checked} layer links, removed {total_removed} orphaned.")
|
||||
|
|
|
|||
|
|
@ -33,9 +33,10 @@ http {
|
|||
keepalive 32;
|
||||
}
|
||||
|
||||
# `upstream private` removed in Phase 4 of forgejo-registry-consolidation
|
||||
# 2026-05-07. The /v2/ private registry is now Forgejo at
|
||||
# forgejo.viktorbarzin.me/viktor/.
|
||||
upstream private {
|
||||
server registry-private:5000;
|
||||
keepalive 32;
|
||||
}
|
||||
|
||||
# --- Docker Hub (port 5000) ---
|
||||
|
||||
|
|
@ -167,8 +168,37 @@ http {
|
|||
}
|
||||
}
|
||||
|
||||
# --- Private R/W Registry (port 5050) decommissioned Phase 4 2026-05-07 ---
|
||||
# The TLS port 5050 server block previously fronted `registry-private`.
|
||||
# Migrated to Forgejo at forgejo.viktorbarzin.me/viktor/. Both
|
||||
# docker-compose.yml and this nginx config no longer reference port 5050.
|
||||
# --- Private R/W Registry (port 5050, TLS) ---
|
||||
|
||||
server {
|
||||
listen 5050 ssl;
|
||||
server_name registry.viktorbarzin.me;
|
||||
|
||||
ssl_certificate /etc/nginx/tls/fullchain.pem;
|
||||
ssl_certificate_key /etc/nginx/tls/privkey.pem;
|
||||
ssl_protocols TLSv1.2 TLSv1.3;
|
||||
|
||||
client_max_body_size 0;
|
||||
proxy_request_buffering off;
|
||||
proxy_buffering off;
|
||||
chunked_transfer_encoding on;
|
||||
|
||||
location /v2/ {
|
||||
proxy_pass http://private;
|
||||
proxy_http_version 1.1;
|
||||
proxy_set_header Host $http_host;
|
||||
proxy_set_header Connection "";
|
||||
proxy_set_header X-Real-IP $remote_addr;
|
||||
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||
proxy_set_header X-Forwarded-Proto $scheme;
|
||||
|
||||
proxy_read_timeout 900;
|
||||
proxy_send_timeout 900;
|
||||
}
|
||||
|
||||
location / {
|
||||
return 200 'ok';
|
||||
add_header Content-Type text/plain;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
|
|||
|
|
@ -40,9 +40,8 @@ variable "ingress_path" {
|
|||
default = ["/"]
|
||||
}
|
||||
variable "max_body_size" {
|
||||
type = string
|
||||
default = null
|
||||
description = "Maximum request body size, e.g. '5g'. null = no limit (Traefik default). When set, a per-ingress Buffering middleware is created and attached."
|
||||
type = string
|
||||
default = "50m"
|
||||
}
|
||||
variable "extra_annotations" {
|
||||
default = {}
|
||||
|
|
@ -149,19 +148,10 @@ locals {
|
|||
# record (either CF-proxied or direct A/AAAA). Explicit bool overrides.
|
||||
effective_external_monitor = var.external_monitor != null ? var.external_monitor : (var.dns_type != "none")
|
||||
|
||||
# Emit the annotation when effective is true (positive signal), or when the
|
||||
# caller explicitly set external_monitor=false (opt-out). When the caller
|
||||
# leaves it null AND dns_type="none", emit nothing — the sync script's
|
||||
# default opt-in (any *.viktorbarzin.me ingress) keeps monitoring services
|
||||
# that are publicly reachable via routes we don't manage here (e.g.
|
||||
# helm-provisioned ingresses, services behind cloudflared tunnel with DNS
|
||||
# set elsewhere).
|
||||
external_monitor_annotations = local.effective_external_monitor ? merge(
|
||||
{ "uptime.viktorbarzin.me/external-monitor" = "true" },
|
||||
var.external_monitor_name != null ? { "uptime.viktorbarzin.me/external-monitor-name" = var.external_monitor_name } : {},
|
||||
) : (var.external_monitor == false ?
|
||||
{ "uptime.viktorbarzin.me/external-monitor" = "false" } : {}
|
||||
)
|
||||
) : {}
|
||||
|
||||
ns_to_group = {
|
||||
monitoring = "Infrastructure"
|
||||
|
|
@ -204,17 +194,6 @@ locals {
|
|||
"gethomepage.dev/href" = "https://${local.effective_host}"
|
||||
"gethomepage.dev/icon" = "${replace(var.name, "-", "")}.png"
|
||||
} : {}
|
||||
|
||||
# Parse "5g"/"50m"/"1024k"/"42" into bytes. Traefik's Buffering middleware
|
||||
# takes maxRequestBodyBytes as an integer. Empty unit = bytes.
|
||||
body_size_match = var.max_body_size == null ? null : regex("^([0-9]+)([kmgKMG]?)$", var.max_body_size)
|
||||
body_size_unit_multiplier = var.max_body_size == null ? 0 : (
|
||||
lower(local.body_size_match[1]) == "g" ? 1073741824 :
|
||||
lower(local.body_size_match[1]) == "m" ? 1048576 :
|
||||
lower(local.body_size_match[1]) == "k" ? 1024 :
|
||||
1
|
||||
)
|
||||
max_body_size_bytes = var.max_body_size == null ? 0 : tonumber(local.body_size_match[0]) * local.body_size_unit_multiplier
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -257,7 +236,6 @@ resource "kubernetes_ingress_v1" "proxied-ingress" {
|
|||
var.protected ? "traefik-authentik-forward-auth@kubernetescrd" : null,
|
||||
var.allow_local_access_only ? "traefik-local-only@kubernetescrd" : null,
|
||||
var.custom_content_security_policy != null ? "${var.namespace}-custom-csp-${var.name}@kubernetescrd" : null,
|
||||
var.max_body_size != null ? "${var.namespace}-buffering-${var.name}@kubernetescrd" : null,
|
||||
], var.extra_middlewares)))
|
||||
"traefik.ingress.kubernetes.io/router.entrypoints" = "websecure"
|
||||
}, local.homepage_defaults, var.extra_annotations,
|
||||
|
|
@ -315,27 +293,6 @@ resource "kubernetes_manifest" "custom_csp" {
|
|||
}
|
||||
}
|
||||
|
||||
# Buffering middleware - created per service when max_body_size is set.
|
||||
# Traefik default is unlimited; setting maxRequestBodyBytes enforces a limit
|
||||
# (e.g. Forgejo container pushes can ship multi-GB layer blobs).
|
||||
resource "kubernetes_manifest" "buffering" {
|
||||
count = var.max_body_size != null ? 1 : 0
|
||||
|
||||
manifest = {
|
||||
apiVersion = "traefik.io/v1alpha1"
|
||||
kind = "Middleware"
|
||||
metadata = {
|
||||
name = "buffering-${var.name}"
|
||||
namespace = var.namespace
|
||||
}
|
||||
spec = {
|
||||
buffering = {
|
||||
maxRequestBodyBytes = local.max_body_size_bytes
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# Cloudflare DNS records — created automatically when dns_type is set.
|
||||
# Proxied: CNAME to Cloudflare tunnel. Non-proxied: A + AAAA to public IP.
|
||||
resource "cloudflare_record" "proxied" {
|
||||
|
|
|
|||
|
|
@ -18,8 +18,4 @@ resource "kubernetes_secret" "tls_secret" {
|
|||
"tls.key" = var.tls_key == "" ? file("${path.root}/secrets/privkey.pem") : var.tls_key
|
||||
}
|
||||
type = "kubernetes.io/tls"
|
||||
lifecycle {
|
||||
# KYVERNO_LIFECYCLE_V1: the sync-tls-secret policy stamps generate.kyverno.io/* + app.kubernetes.io/managed-by labels on this generated Secret
|
||||
ignore_changes = [metadata[0].labels]
|
||||
}
|
||||
}
|
||||
|
|
|
|||
|
|
@ -1,7 +1,7 @@
|
|||
#!/usr/bin/env bash
|
||||
|
||||
# Cluster health check script.
|
||||
# Runs 42 diagnostic checks against the Kubernetes cluster and prints
|
||||
# Runs 24 diagnostic checks against the Kubernetes cluster and prints
|
||||
# a colour-coded report with PASS / WARN / FAIL for each section.
|
||||
#
|
||||
# Usage: ./scripts/cluster_healthcheck.sh [--fix] [--quiet|-q] [--json] [--kubeconfig <path>]
|
||||
|
|
@ -26,7 +26,7 @@ JSON=false
|
|||
KUBECONFIG_PATH="$(pwd)/config"
|
||||
KUBECTL=""
|
||||
JSON_RESULTS=()
|
||||
TOTAL_CHECKS=42
|
||||
TOTAL_CHECKS=30
|
||||
|
||||
# --- Helpers ---
|
||||
info() { [[ "$JSON" == true ]] && return 0; echo -e "${BLUE}[INFO]${NC} $*"; }
|
||||
|
|
@ -71,16 +71,14 @@ parse_args() {
|
|||
while [[ $# -gt 0 ]]; do
|
||||
case "$1" in
|
||||
--fix) FIX=true; shift ;;
|
||||
--no-fix) FIX=false; shift ;;
|
||||
--quiet|-q) QUIET=true; shift ;;
|
||||
--json) JSON=true; shift ;;
|
||||
--kubeconfig) KUBECONFIG_PATH="$2"; shift 2 ;;
|
||||
-h|--help)
|
||||
echo "Usage: $0 [--fix|--no-fix] [--quiet|-q] [--json] [--kubeconfig <path>]"
|
||||
echo "Usage: $0 [--fix] [--quiet|-q] [--json] [--kubeconfig <path>]"
|
||||
echo ""
|
||||
echo "Flags:"
|
||||
echo " --fix Auto-remediate safe issues (delete evicted pods)"
|
||||
echo " --no-fix Disable auto-remediation (default)"
|
||||
echo " --quiet, -q Only show WARN and FAIL sections"
|
||||
echo " --json Machine-readable JSON output"
|
||||
echo " --kubeconfig PATH Override kubeconfig (default: \$(pwd)/config)"
|
||||
|
|
@ -1242,17 +1240,9 @@ check_overcommit() {
|
|||
HA_CACHE_DIR=""
|
||||
|
||||
ha_sofia_available() {
|
||||
if [[ -z "${HOME_ASSISTANT_SOFIA_URL:-}" ]]; then
|
||||
export HOME_ASSISTANT_SOFIA_URL="https://ha-sofia.viktorbarzin.me"
|
||||
if [[ -z "${HOME_ASSISTANT_SOFIA_URL:-}" ]] || [[ -z "${HOME_ASSISTANT_SOFIA_TOKEN:-}" ]]; then
|
||||
return 1
|
||||
fi
|
||||
if [[ -z "${HOME_ASSISTANT_SOFIA_TOKEN:-}" ]]; then
|
||||
if command -v vault >/dev/null 2>&1 && [[ -n "${VAULT_TOKEN:-}${HOME:-}" ]]; then
|
||||
local t
|
||||
t=$(vault kv get -field=haos_api_token secret/viktor 2>/dev/null || true)
|
||||
[[ -n "$t" ]] && export HOME_ASSISTANT_SOFIA_TOKEN="$t"
|
||||
fi
|
||||
fi
|
||||
[[ -n "${HOME_ASSISTANT_SOFIA_TOKEN:-}" ]] || return 1
|
||||
return 0
|
||||
}
|
||||
|
||||
|
|
@ -1760,616 +1750,6 @@ else:
|
|||
json_add "hardware_exporters" "$status" "${detail:-All healthy}"
|
||||
}
|
||||
|
||||
# Returns 0 if cert-manager CRDs are installed, 1 otherwise.
|
||||
cert_manager_installed() {
|
||||
$KUBECTL get crd certificates.cert-manager.io -o name >/dev/null 2>&1
|
||||
}
|
||||
|
||||
# --- 31. cert-manager: Certificate Readiness ---
|
||||
check_cert_manager_certificates() {
|
||||
section 31 "cert-manager — Certificate Readiness"
|
||||
local certs not_ready detail="" status="PASS"
|
||||
|
||||
if ! cert_manager_installed; then
|
||||
pass "cert-manager not installed — N/A"
|
||||
json_add "certmanager_certificates" "PASS" "N/A (cert-manager not installed)"
|
||||
return 0
|
||||
fi
|
||||
|
||||
certs=$($KUBECTL get certificates.cert-manager.io -A -o json 2>/dev/null) || {
|
||||
warn "cert-manager CRDs installed but API query failed"
|
||||
json_add "certmanager_certificates" "WARN" "API query failed"
|
||||
return 0
|
||||
}
|
||||
|
||||
not_ready=$(echo "$certs" | python3 -c '
|
||||
import json, sys
|
||||
data = json.load(sys.stdin)
|
||||
for item in data.get("items", []):
|
||||
ns = item["metadata"]["namespace"]
|
||||
name = item["metadata"]["name"]
|
||||
conds = item.get("status", {}).get("conditions", [])
|
||||
ready = next((c for c in conds if c.get("type") == "Ready"), None)
|
||||
if not ready or ready.get("status") != "True":
|
||||
reason = ready.get("reason", "NoCondition") if ready else "NoCondition"
|
||||
print(f"{ns}/{name}:{reason}")
|
||||
' 2>/dev/null) || true
|
||||
|
||||
if [[ -z "$not_ready" ]]; then
|
||||
pass "All Certificate CRs Ready"
|
||||
json_add "certmanager_certificates" "PASS" "All Ready"
|
||||
else
|
||||
[[ "$QUIET" == true ]] && section_always 31 "cert-manager — Certificate Readiness"
|
||||
local count
|
||||
count=$(count_lines "$not_ready")
|
||||
while IFS= read -r line; do
|
||||
fail "Certificate not Ready: $line"
|
||||
detail+="$line; "
|
||||
done <<< "$not_ready"
|
||||
status="FAIL"
|
||||
json_add "certmanager_certificates" "$status" "$count not Ready: $detail"
|
||||
fi
|
||||
}
|
||||
|
||||
# --- 32. cert-manager: Certificate Expiry (<14d) ---
|
||||
check_cert_manager_expiry() {
|
||||
section 32 "cert-manager — Certificate Expiry (<14d)"
|
||||
local certs expiring detail="" status="PASS"
|
||||
|
||||
if ! cert_manager_installed; then
|
||||
pass "cert-manager not installed — N/A"
|
||||
json_add "certmanager_expiry" "PASS" "N/A (cert-manager not installed)"
|
||||
return 0
|
||||
fi
|
||||
|
||||
certs=$($KUBECTL get certificates.cert-manager.io -A -o json 2>/dev/null) || {
|
||||
warn "cert-manager CRDs installed but API query failed"
|
||||
json_add "certmanager_expiry" "WARN" "API query failed"
|
||||
return 0
|
||||
}
|
||||
|
||||
expiring=$(echo "$certs" | python3 -c '
|
||||
import json, sys
|
||||
from datetime import datetime, timezone, timedelta
|
||||
data = json.load(sys.stdin)
|
||||
cutoff = datetime.now(timezone.utc) + timedelta(days=14)
|
||||
for item in data.get("items", []):
|
||||
ns = item["metadata"]["namespace"]
|
||||
name = item["metadata"]["name"]
|
||||
not_after = item.get("status", {}).get("notAfter")
|
||||
if not not_after:
|
||||
continue
|
||||
try:
|
||||
expiry = datetime.fromisoformat(not_after.replace("Z", "+00:00"))
|
||||
if expiry < cutoff:
|
||||
days = (expiry - datetime.now(timezone.utc)).days
|
||||
level = "FAIL" if days <= 3 else "WARN"
|
||||
print(f"{level}:{ns}/{name}:{days}")
|
||||
except ValueError:
|
||||
pass
|
||||
' 2>/dev/null) || true
|
||||
|
||||
if [[ -z "$expiring" ]]; then
|
||||
pass "No Certificate CRs expiring within 14 days"
|
||||
json_add "certmanager_expiry" "PASS" "None expiring <14d"
|
||||
else
|
||||
[[ "$QUIET" == true ]] && section_always 32 "cert-manager — Certificate Expiry (<14d)"
|
||||
while IFS= read -r line; do
|
||||
local level cert_name days
|
||||
level=$(echo "$line" | cut -d: -f1)
|
||||
cert_name=$(echo "$line" | cut -d: -f2)
|
||||
days=$(echo "$line" | cut -d: -f3)
|
||||
if [[ "$level" == "FAIL" ]]; then
|
||||
fail "Certificate $cert_name expires in ${days}d"
|
||||
status="FAIL"
|
||||
else
|
||||
warn "Certificate $cert_name expires in ${days}d"
|
||||
[[ "$status" != "FAIL" ]] && status="WARN"
|
||||
fi
|
||||
detail+="$cert_name=${days}d; "
|
||||
done <<< "$expiring"
|
||||
json_add "certmanager_expiry" "$status" "$detail"
|
||||
fi
|
||||
}
|
||||
|
||||
# --- 33. cert-manager: Failed CertificateRequests ---
|
||||
check_cert_manager_requests() {
|
||||
section 33 "cert-manager — Failed CertificateRequests"
|
||||
local requests failed detail="" status="PASS"
|
||||
|
||||
if ! cert_manager_installed; then
|
||||
pass "cert-manager not installed — N/A"
|
||||
json_add "certmanager_requests" "PASS" "N/A (cert-manager not installed)"
|
||||
return 0
|
||||
fi
|
||||
|
||||
requests=$($KUBECTL get certificaterequests.cert-manager.io -A -o json 2>/dev/null) || {
|
||||
warn "cert-manager CRDs installed but API query failed"
|
||||
json_add "certmanager_requests" "WARN" "API query failed"
|
||||
return 0
|
||||
}
|
||||
|
||||
failed=$(echo "$requests" | python3 -c '
|
||||
import json, sys
|
||||
data = json.load(sys.stdin)
|
||||
for item in data.get("items", []):
|
||||
ns = item["metadata"]["namespace"]
|
||||
name = item["metadata"]["name"]
|
||||
conds = item.get("status", {}).get("conditions", [])
|
||||
for c in conds:
|
||||
if c.get("type") == "Ready" and c.get("status") == "False" and c.get("reason") == "Failed":
|
||||
print(f"{ns}/{name}:{c.get(\"message\", \"\")[:80]}")
|
||||
break
|
||||
' 2>/dev/null) || true
|
||||
|
||||
if [[ -z "$failed" ]]; then
|
||||
pass "No failed CertificateRequests"
|
||||
json_add "certmanager_requests" "PASS" "None failed"
|
||||
else
|
||||
[[ "$QUIET" == true ]] && section_always 33 "cert-manager — Failed CertificateRequests"
|
||||
local count
|
||||
count=$(count_lines "$failed")
|
||||
while IFS= read -r line; do
|
||||
fail "CertificateRequest failed: $line"
|
||||
detail+="$line; "
|
||||
done <<< "$failed"
|
||||
status="FAIL"
|
||||
json_add "certmanager_requests" "$status" "$count failed: $detail"
|
||||
fi
|
||||
}
|
||||
|
||||
# --- 34. Backup Freshness: Per-DB Dumps ---
|
||||
check_backup_per_db() {
|
||||
section 34 "Backup Freshness — Per-DB Dumps"
|
||||
local detail="" had_issue=false status="PASS"
|
||||
|
||||
# Freshness threshold: 25 hours
|
||||
local now_epoch max_age_sec
|
||||
now_epoch=$(date -u +%s)
|
||||
max_age_sec=$((25 * 3600))
|
||||
|
||||
_check_cronjob_fresh() {
|
||||
local ns="$1" cj="$2" label="$3"
|
||||
local ts age_sec
|
||||
ts=$($KUBECTL get cronjob -n "$ns" "$cj" -o jsonpath='{.status.lastSuccessfulTime}' 2>/dev/null || true)
|
||||
if [[ -z "$ts" ]]; then
|
||||
[[ "$had_issue" == false && "$QUIET" == true ]] && section_always 34 "Backup Freshness — Per-DB Dumps"
|
||||
fail "$label: CronJob $ns/$cj has no lastSuccessfulTime"
|
||||
detail+="${label}=no-success; "
|
||||
had_issue=true
|
||||
status="FAIL"
|
||||
return 0
|
||||
fi
|
||||
local ts_epoch
|
||||
ts_epoch=$(date -u -d "$ts" +%s 2>/dev/null || echo 0)
|
||||
age_sec=$((now_epoch - ts_epoch))
|
||||
if [[ "$age_sec" -gt "$max_age_sec" ]]; then
|
||||
[[ "$had_issue" == false && "$QUIET" == true ]] && section_always 34 "Backup Freshness — Per-DB Dumps"
|
||||
local age_h=$((age_sec / 3600))
|
||||
fail "$label: last success ${age_h}h ago (>25h)"
|
||||
detail+="${label}=${age_h}h; "
|
||||
had_issue=true
|
||||
status="FAIL"
|
||||
else
|
||||
local age_h=$((age_sec / 3600))
|
||||
detail+="${label}=${age_h}h; "
|
||||
fi
|
||||
}
|
||||
|
||||
_check_cronjob_fresh dbaas mysql-backup-per-db mysql
|
||||
_check_cronjob_fresh dbaas postgresql-backup-per-db pg
|
||||
|
||||
[[ "$had_issue" == false ]] && pass "Per-DB dumps fresh — $detail"
|
||||
json_add "backup_per_db" "$status" "$detail"
|
||||
}
|
||||
|
||||
# --- 35. Backup Freshness: Offsite Sync ---
|
||||
check_backup_offsite_sync() {
|
||||
section 35 "Backup Freshness — Offsite Sync"
|
||||
local metrics detail="" status="PASS"
|
||||
|
||||
metrics=$($KUBECTL exec -n monitoring deploy/prometheus-server -- \
|
||||
wget -qO- "http://prometheus-prometheus-pushgateway:9091/metrics" 2>/dev/null || true)
|
||||
|
||||
if [[ -z "$metrics" ]]; then
|
||||
[[ "$QUIET" == true ]] && section_always 35 "Backup Freshness — Offsite Sync"
|
||||
warn "Cannot query Pushgateway"
|
||||
json_add "backup_offsite_sync" "WARN" "Pushgateway unreachable"
|
||||
return 0
|
||||
fi
|
||||
|
||||
local age_hours
|
||||
age_hours=$(echo "$metrics" | python3 -c '
|
||||
import sys, re, time
|
||||
ts = None
|
||||
for line in sys.stdin:
|
||||
if line.startswith("#"):
|
||||
continue
|
||||
if "backup_last_success_timestamp" in line and "offsite-backup-sync" in line:
|
||||
m = re.search(r"\s([0-9.eE+]+)\s*$", line.strip())
|
||||
if m:
|
||||
try:
|
||||
ts = float(m.group(1))
|
||||
break
|
||||
except ValueError:
|
||||
pass
|
||||
if ts is None:
|
||||
print("missing")
|
||||
else:
|
||||
age = (time.time() - ts) / 3600
|
||||
print(f"{age:.1f}")
|
||||
' 2>/dev/null) || age_hours="error"
|
||||
|
||||
if [[ "$age_hours" == "missing" ]]; then
|
||||
[[ "$QUIET" == true ]] && section_always 35 "Backup Freshness — Offsite Sync"
|
||||
fail "backup_last_success_timestamp metric missing for offsite-backup-sync"
|
||||
json_add "backup_offsite_sync" "FAIL" "Metric missing"
|
||||
elif [[ "$age_hours" == "error" ]]; then
|
||||
[[ "$QUIET" == true ]] && section_always 35 "Backup Freshness — Offsite Sync"
|
||||
warn "Failed to parse Pushgateway metric"
|
||||
json_add "backup_offsite_sync" "WARN" "Parse error"
|
||||
else
|
||||
local age_int
|
||||
age_int=$(printf '%.0f' "$age_hours")
|
||||
if [[ "$age_int" -gt 27 ]]; then
|
||||
[[ "$QUIET" == true ]] && section_always 35 "Backup Freshness — Offsite Sync"
|
||||
fail "Offsite sync last success ${age_hours}h ago (>27h)"
|
||||
status="FAIL"
|
||||
else
|
||||
pass "Offsite sync last success ${age_hours}h ago"
|
||||
fi
|
||||
detail="age=${age_hours}h"
|
||||
json_add "backup_offsite_sync" "$status" "$detail"
|
||||
fi
|
||||
}
|
||||
|
||||
# --- 36. Backup Freshness: LVM PVC Snapshots ---
|
||||
check_backup_lvm_snapshots() {
|
||||
section 36 "Backup Freshness — LVM PVC Snapshots"
|
||||
local snap_output detail="" status="PASS"
|
||||
|
||||
snap_output=$(ssh -o BatchMode=yes -o ConnectTimeout=5 -o StrictHostKeyChecking=no \
|
||||
root@192.168.1.127 "lvs -o lv_name,lv_time --noheadings 2>/dev/null | grep _snap" 2>/dev/null || true)
|
||||
|
||||
if [[ -z "$snap_output" ]]; then
|
||||
[[ "$QUIET" == true ]] && section_always 36 "Backup Freshness — LVM PVC Snapshots"
|
||||
warn "No LVM PVC snapshots found or SSH to 192.168.1.127 failed (BatchMode)"
|
||||
json_add "backup_lvm_snapshots" "WARN" "SSH failed or no snapshots"
|
||||
return 0
|
||||
fi
|
||||
|
||||
local newest_age_hours
|
||||
newest_age_hours=$(echo "$snap_output" | python3 -c '
|
||||
import sys, re, time
|
||||
from datetime import datetime
|
||||
newest = None
|
||||
for line in sys.stdin:
|
||||
line = line.strip()
|
||||
if not line:
|
||||
continue
|
||||
parts = line.split(None, 1)
|
||||
if len(parts) < 2:
|
||||
continue
|
||||
date_str = parts[1].strip()
|
||||
# lv_time format: "2026-04-19 03:00:01 +0000" or similar
|
||||
for fmt in ("%Y-%m-%d %H:%M:%S %z", "%Y-%m-%d %H:%M:%S"):
|
||||
try:
|
||||
dt = datetime.strptime(date_str, fmt)
|
||||
ts = dt.timestamp()
|
||||
if newest is None or ts > newest:
|
||||
newest = ts
|
||||
break
|
||||
except ValueError:
|
||||
continue
|
||||
if newest is None:
|
||||
print("parse_error")
|
||||
else:
|
||||
age = (time.time() - newest) / 3600
|
||||
print(f"{age:.1f}")
|
||||
' 2>/dev/null) || newest_age_hours="error"
|
||||
|
||||
if [[ "$newest_age_hours" == "parse_error" || "$newest_age_hours" == "error" ]]; then
|
||||
[[ "$QUIET" == true ]] && section_always 36 "Backup Freshness — LVM PVC Snapshots"
|
||||
warn "Could not parse LVM snapshot timestamps"
|
||||
json_add "backup_lvm_snapshots" "WARN" "Parse error"
|
||||
else
|
||||
local count age_int
|
||||
count=$(count_lines "$snap_output")
|
||||
age_int=$(printf '%.0f' "$newest_age_hours")
|
||||
if [[ "$age_int" -gt 25 ]]; then
|
||||
[[ "$QUIET" == true ]] && section_always 36 "Backup Freshness — LVM PVC Snapshots"
|
||||
fail "Newest LVM snapshot ${newest_age_hours}h old (>25h); $count total"
|
||||
status="FAIL"
|
||||
else
|
||||
pass "LVM snapshots fresh — $count total, newest ${newest_age_hours}h old"
|
||||
fi
|
||||
detail="count=$count newest=${newest_age_hours}h"
|
||||
json_add "backup_lvm_snapshots" "$status" "$detail"
|
||||
fi
|
||||
}
|
||||
|
||||
# --- 37. Monitoring: Prometheus + Alertmanager ---
|
||||
check_monitoring_prom_am() {
|
||||
section 37 "Monitoring — Prometheus + Alertmanager"
|
||||
local detail="" had_issue=false status="PASS"
|
||||
|
||||
# Prometheus /-/ready
|
||||
local prom_ready
|
||||
prom_ready=$($KUBECTL exec -n monitoring deploy/prometheus-server -- \
|
||||
wget -qO- "http://localhost:9090/-/ready" 2>/dev/null || true)
|
||||
if echo "$prom_ready" | grep -qi "ready"; then
|
||||
detail+="prometheus=ready; "
|
||||
else
|
||||
[[ "$had_issue" == false && "$QUIET" == true ]] && section_always 37 "Monitoring — Prometheus + Alertmanager"
|
||||
fail "Prometheus /-/ready returned no Ready response"
|
||||
detail+="prometheus=not-ready; "
|
||||
had_issue=true
|
||||
status="FAIL"
|
||||
fi
|
||||
|
||||
# Alertmanager running pod count
|
||||
local am_running
|
||||
am_running=$($KUBECTL get pods -n monitoring --no-headers 2>/dev/null | \
|
||||
grep alertmanager | awk '$3 == "Running"' | wc -l | tr -d ' ')
|
||||
if [[ "$am_running" -gt 0 ]]; then
|
||||
detail+="alertmanager=${am_running} running; "
|
||||
else
|
||||
[[ "$had_issue" == false && "$QUIET" == true ]] && section_always 37 "Monitoring — Prometheus + Alertmanager"
|
||||
fail "Alertmanager: 0 Running pods"
|
||||
detail+="alertmanager=none-running; "
|
||||
had_issue=true
|
||||
status="FAIL"
|
||||
fi
|
||||
|
||||
[[ "$had_issue" == false ]] && pass "Prometheus Ready, $am_running Alertmanager pod(s) Running"
|
||||
json_add "monitoring_prom_am" "$status" "$detail"
|
||||
}
|
||||
|
||||
# --- 38. Monitoring: Vault Sealed Status ---
|
||||
check_monitoring_vault() {
|
||||
section 38 "Monitoring — Vault Sealed Status"
|
||||
local output detail="" status="PASS"
|
||||
|
||||
output=$($KUBECTL exec -n vault vault-0 -- \
|
||||
sh -c 'VAULT_ADDR=http://127.0.0.1:8200 vault status' 2>&1 || true)
|
||||
|
||||
if [[ -z "$output" ]]; then
|
||||
[[ "$QUIET" == true ]] && section_always 38 "Monitoring — Vault Sealed Status"
|
||||
fail "Cannot exec vault status on vault-0"
|
||||
json_add "monitoring_vault" "FAIL" "Exec failed"
|
||||
return 0
|
||||
fi
|
||||
|
||||
if echo "$output" | grep -qi "^Sealed[[:space:]]*false"; then
|
||||
pass "Vault unsealed"
|
||||
detail="sealed=false"
|
||||
json_add "monitoring_vault" "PASS" "$detail"
|
||||
elif echo "$output" | grep -qi "^Sealed[[:space:]]*true"; then
|
||||
[[ "$QUIET" == true ]] && section_always 38 "Monitoring — Vault Sealed Status"
|
||||
fail "Vault is SEALED — secrets unavailable"
|
||||
detail="sealed=true"
|
||||
status="FAIL"
|
||||
json_add "monitoring_vault" "$status" "$detail"
|
||||
else
|
||||
[[ "$QUIET" == true ]] && section_always 38 "Monitoring — Vault Sealed Status"
|
||||
warn "Cannot parse vault status output"
|
||||
json_add "monitoring_vault" "WARN" "Parse error"
|
||||
fi
|
||||
}
|
||||
|
||||
# --- 39. Monitoring: ClusterSecretStore Ready ---
|
||||
check_monitoring_css() {
|
||||
section 39 "Monitoring — ClusterSecretStore Ready"
|
||||
local css not_ready detail="" status="PASS"
|
||||
|
||||
css=$($KUBECTL get clustersecretstore -o json 2>/dev/null) || {
|
||||
[[ "$QUIET" == true ]] && section_always 39 "Monitoring — ClusterSecretStore Ready"
|
||||
warn "ClusterSecretStore CRD not installed"
|
||||
json_add "monitoring_css" "WARN" "CRD missing"
|
||||
return 0
|
||||
}
|
||||
|
||||
not_ready=$(echo "$css" | python3 -c '
|
||||
import json, sys
|
||||
data = json.load(sys.stdin)
|
||||
for item in data.get("items", []):
|
||||
name = item["metadata"]["name"]
|
||||
conds = item.get("status", {}).get("conditions", [])
|
||||
ready = next((c for c in conds if c.get("type") == "Ready"), None)
|
||||
if not ready or ready.get("status") != "True":
|
||||
print(f"{name}:{ready.get(\"reason\", \"NoCondition\") if ready else \"NoCondition\"}")
|
||||
' 2>/dev/null) || true
|
||||
|
||||
if [[ -z "$not_ready" ]]; then
|
||||
local total
|
||||
total=$(echo "$css" | python3 -c 'import json,sys; print(len(json.load(sys.stdin).get("items",[])))' 2>/dev/null || echo "?")
|
||||
pass "All $total ClusterSecretStores Ready"
|
||||
json_add "monitoring_css" "PASS" "$total Ready"
|
||||
else
|
||||
[[ "$QUIET" == true ]] && section_always 39 "Monitoring — ClusterSecretStore Ready"
|
||||
while IFS= read -r line; do
|
||||
fail "ClusterSecretStore not Ready: $line"
|
||||
detail+="$line; "
|
||||
done <<< "$not_ready"
|
||||
status="FAIL"
|
||||
json_add "monitoring_css" "$status" "$detail"
|
||||
fi
|
||||
}
|
||||
|
||||
# --- 40. External Reachability: Cloudflared + Authentik Replicas ---
|
||||
check_external_replicas() {
|
||||
section 40 "External — Cloudflared + Authentik Replicas"
|
||||
local detail="" had_issue=false status="PASS"
|
||||
|
||||
# Cloudflared
|
||||
local cf_json cf_ready cf_desired
|
||||
cf_json=$($KUBECTL get deployment cloudflared -n cloudflared -o json 2>/dev/null || true)
|
||||
if [[ -z "$cf_json" ]]; then
|
||||
[[ "$had_issue" == false && "$QUIET" == true ]] && section_always 40 "External — Cloudflared + Authentik Replicas"
|
||||
fail "Cloudflared deployment not found"
|
||||
detail+="cloudflared=missing; "
|
||||
had_issue=true
|
||||
status="FAIL"
|
||||
else
|
||||
cf_ready=$(echo "$cf_json" | python3 -c 'import json,sys; print(json.load(sys.stdin).get("status",{}).get("readyReplicas",0) or 0)' 2>/dev/null || echo "0")
|
||||
cf_desired=$(echo "$cf_json" | python3 -c 'import json,sys; print(json.load(sys.stdin).get("spec",{}).get("replicas",0) or 0)' 2>/dev/null || echo "0")
|
||||
if [[ "$cf_ready" != "$cf_desired" ]]; then
|
||||
[[ "$had_issue" == false && "$QUIET" == true ]] && section_always 40 "External — Cloudflared + Authentik Replicas"
|
||||
fail "Cloudflared: $cf_ready/$cf_desired ready (external access degraded)"
|
||||
detail+="cloudflared=${cf_ready}/${cf_desired}; "
|
||||
had_issue=true
|
||||
status="FAIL"
|
||||
else
|
||||
detail+="cloudflared=${cf_ready}/${cf_desired}; "
|
||||
fi
|
||||
fi
|
||||
|
||||
# Authentik server (Helm chart names the deployment goauthentik-server)
|
||||
local auth_json auth_ready auth_desired
|
||||
auth_json=$($KUBECTL get deployment goauthentik-server -n authentik -o json 2>/dev/null || true)
|
||||
if [[ -z "$auth_json" ]]; then
|
||||
[[ "$had_issue" == false && "$QUIET" == true ]] && section_always 40 "External — Cloudflared + Authentik Replicas"
|
||||
warn "goauthentik-server deployment not found in authentik namespace"
|
||||
detail+="authentik=missing; "
|
||||
had_issue=true
|
||||
[[ "$status" != "FAIL" ]] && status="WARN"
|
||||
else
|
||||
auth_ready=$(echo "$auth_json" | python3 -c 'import json,sys; print(json.load(sys.stdin).get("status",{}).get("readyReplicas",0) or 0)' 2>/dev/null || echo "0")
|
||||
auth_desired=$(echo "$auth_json" | python3 -c 'import json,sys; print(json.load(sys.stdin).get("spec",{}).get("replicas",0) or 0)' 2>/dev/null || echo "0")
|
||||
if [[ "$auth_ready" != "$auth_desired" ]]; then
|
||||
[[ "$had_issue" == false && "$QUIET" == true ]] && section_always 40 "External — Cloudflared + Authentik Replicas"
|
||||
fail "goauthentik-server: $auth_ready/$auth_desired ready (auth degraded)"
|
||||
detail+="authentik=${auth_ready}/${auth_desired}; "
|
||||
had_issue=true
|
||||
status="FAIL"
|
||||
else
|
||||
detail+="authentik=${auth_ready}/${auth_desired}; "
|
||||
fi
|
||||
fi
|
||||
|
||||
[[ "$had_issue" == false ]] && pass "Cloudflared + authentik-server at full replicas ($detail)"
|
||||
json_add "external_replicas" "$status" "$detail"
|
||||
}
|
||||
|
||||
# --- 41. External Reachability: ExternalAccessDivergence Alert ---
|
||||
check_external_divergence() {
|
||||
section 41 "External — ExternalAccessDivergence Alert"
|
||||
local alerts result detail="" status="PASS"
|
||||
|
||||
alerts=$($KUBECTL exec -n monitoring deploy/prometheus-server -- \
|
||||
wget -qO- "http://localhost:9090/api/v1/alerts" 2>/dev/null || true)
|
||||
|
||||
if [[ -z "$alerts" ]]; then
|
||||
[[ "$QUIET" == true ]] && section_always 41 "External — ExternalAccessDivergence Alert"
|
||||
warn "Cannot query Prometheus alerts"
|
||||
json_add "external_divergence" "WARN" "Cannot query"
|
||||
return 0
|
||||
fi
|
||||
|
||||
result=$(echo "$alerts" | python3 -c '
|
||||
import json, sys
|
||||
try:
|
||||
data = json.load(sys.stdin)
|
||||
alerts = data.get("data", {}).get("alerts", []) if isinstance(data, dict) else data
|
||||
firing = [a for a in alerts
|
||||
if a.get("labels", {}).get("alertname") == "ExternalAccessDivergence"
|
||||
and a.get("state") == "firing"]
|
||||
if firing:
|
||||
hosts = [a.get("labels", {}).get("host") or a.get("labels", {}).get("service") or "?" for a in firing]
|
||||
print(f"{len(firing)}:" + ",".join(hosts))
|
||||
else:
|
||||
print("0:")
|
||||
except Exception as e:
|
||||
print(f"error:{e}")
|
||||
' 2>/dev/null) || result="error:parse"
|
||||
|
||||
if [[ "$result" == error:* ]]; then
|
||||
[[ "$QUIET" == true ]] && section_always 41 "External — ExternalAccessDivergence Alert"
|
||||
warn "Failed to parse alerts JSON: ${result#error:}"
|
||||
json_add "external_divergence" "WARN" "Parse error"
|
||||
return 0
|
||||
fi
|
||||
|
||||
local count names
|
||||
count=$(echo "$result" | cut -d: -f1)
|
||||
names=$(echo "$result" | cut -d: -f2-)
|
||||
|
||||
if [[ "$count" -eq 0 ]]; then
|
||||
pass "ExternalAccessDivergence not firing"
|
||||
json_add "external_divergence" "PASS" "Not firing"
|
||||
else
|
||||
[[ "$QUIET" == true ]] && section_always 41 "External — ExternalAccessDivergence Alert"
|
||||
fail "ExternalAccessDivergence firing for $count target(s): $names"
|
||||
status="FAIL"
|
||||
detail="$count firing: $names"
|
||||
json_add "external_divergence" "$status" "$detail"
|
||||
fi
|
||||
}
|
||||
|
||||
# --- 42. External Reachability: Traefik 5xx Rate ---
|
||||
check_external_traefik_5xx() {
|
||||
section 42 "External — Traefik 5xx Rate (15m)"
|
||||
local query_result detail="" status="PASS"
|
||||
|
||||
query_result=$($KUBECTL exec -n monitoring deploy/prometheus-server -- \
|
||||
wget -qO- 'http://localhost:9090/api/v1/query?query=topk(10,rate(traefik_service_requests_total{code=~%225..%22}%5B15m%5D))' 2>/dev/null || true)
|
||||
|
||||
if [[ -z "$query_result" ]]; then
|
||||
[[ "$QUIET" == true ]] && section_always 42 "External — Traefik 5xx Rate (15m)"
|
||||
warn "Cannot query Prometheus for traefik 5xx rate"
|
||||
json_add "external_traefik_5xx" "WARN" "Query failed"
|
||||
return 0
|
||||
fi
|
||||
|
||||
local parsed
|
||||
parsed=$(echo "$query_result" | python3 -c '
|
||||
import json, sys
|
||||
try:
|
||||
data = json.load(sys.stdin)
|
||||
results = data.get("data", {}).get("result", [])
|
||||
hot = [(r.get("metric", {}).get("service", "?"), float(r.get("value", [0, "0"])[1])) for r in results]
|
||||
hot = [(s, v) for s, v in hot if v > 0.01] # 1% req/s threshold
|
||||
hot.sort(key=lambda x: -x[1])
|
||||
if not hot:
|
||||
print("0:")
|
||||
else:
|
||||
top = [f"{s}={v:.2f}/s" for s, v in hot[:5]]
|
||||
print(f"{len(hot)}:" + "; ".join(top))
|
||||
except Exception as e:
|
||||
print(f"error:{e}")
|
||||
' 2>/dev/null) || parsed="error:parse"
|
||||
|
||||
if [[ "$parsed" == error:* ]]; then
|
||||
[[ "$QUIET" == true ]] && section_always 42 "External — Traefik 5xx Rate (15m)"
|
||||
warn "Parse failed: ${parsed#error:}"
|
||||
json_add "external_traefik_5xx" "WARN" "Parse error"
|
||||
return 0
|
||||
fi
|
||||
|
||||
local count top
|
||||
count=$(echo "$parsed" | cut -d: -f1)
|
||||
top=$(echo "$parsed" | cut -d: -f2-)
|
||||
|
||||
if [[ "$count" -eq 0 ]]; then
|
||||
pass "No Traefik services with 5xx rate >0.01 req/s (last 15m)"
|
||||
json_add "external_traefik_5xx" "PASS" "None above threshold"
|
||||
else
|
||||
[[ "$QUIET" == true ]] && section_always 42 "External — Traefik 5xx Rate (15m)"
|
||||
# WARN at any 5xx; FAIL if top service >1 req/s
|
||||
local top_rate
|
||||
top_rate=$(echo "$top" | grep -oE '[0-9.]+/s' | head -1 | tr -d '/s')
|
||||
if awk "BEGIN{exit !($top_rate > 1.0)}" 2>/dev/null; then
|
||||
fail "$count Traefik service(s) with elevated 5xx: $top"
|
||||
status="FAIL"
|
||||
else
|
||||
warn "$count Traefik service(s) emitting 5xx: $top"
|
||||
status="WARN"
|
||||
fi
|
||||
detail="$count services: $top"
|
||||
json_add "external_traefik_5xx" "$status" "$detail"
|
||||
fi
|
||||
}
|
||||
|
||||
# --- Summary ---
|
||||
print_summary() {
|
||||
if [[ "$JSON" == true ]]; then
|
||||
|
|
@ -2452,18 +1832,6 @@ main() {
|
|||
check_ha_automations
|
||||
check_ha_system
|
||||
check_hardware_exporters
|
||||
check_cert_manager_certificates
|
||||
check_cert_manager_expiry
|
||||
check_cert_manager_requests
|
||||
check_backup_per_db
|
||||
check_backup_offsite_sync
|
||||
check_backup_lvm_snapshots
|
||||
check_monitoring_prom_am
|
||||
check_monitoring_vault
|
||||
check_monitoring_css
|
||||
check_external_replicas
|
||||
check_external_divergence
|
||||
check_external_traefik_5xx
|
||||
print_summary
|
||||
|
||||
# Exit code: 2 for failures, 1 for warnings, 0 for clean
|
||||
|
|
|
|||
|
|
@ -1,76 +0,0 @@
|
|||
#!/usr/bin/env bash
|
||||
# One-shot migration of every private image on registry.viktorbarzin.me to
|
||||
# Forgejo. Used as a stop-gap when the dual-push CI pipelines aren't
|
||||
# producing Forgejo images on their own (Forgejo-Woodpecker forge driver
|
||||
# context-deadline-exceeded issue, see bd code-d3y / 2026-05-07).
|
||||
#
|
||||
# Pulls each image from registry.viktorbarzin.me, retags, pushes to
|
||||
# forgejo.viktorbarzin.me/viktor/<name>:<tag> — preserving the blob bytes
|
||||
# verbatim so the cluster can flip image= without a rebuild.
|
||||
#
|
||||
# Run from any host with docker + network reach to BOTH registries. Auth
|
||||
# from `docker login` (~/.docker/config.json) — make sure both registries
|
||||
# are logged in:
|
||||
# docker login registry.viktorbarzin.me -u viktorbarzin
|
||||
# docker login forgejo.viktorbarzin.me -u viktor # use viktor PAT, not ci-pusher
|
||||
#
|
||||
# (ci-pusher CANNOT push to viktor/<image> — Forgejo container packages
|
||||
# are scoped to the pushing user. Only viktor's PAT can write to viktor/*.)
|
||||
#
|
||||
# After the script, the new image lives at
|
||||
# forgejo.viktorbarzin.me/viktor/<name>:<tag>
|
||||
# Phase 3 of the consolidation flips infra/stacks/<svc>/main.tf image=
|
||||
# to that path.
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
OLD_REG=registry.viktorbarzin.me
|
||||
NEW_REG=forgejo.viktorbarzin.me/viktor
|
||||
|
||||
# Image list: <name>:<tag>. Generated 2026-05-07 from `grep -rEn 'image\s*=\s*
|
||||
# "registry\.viktorbarzin\.me'` across infra/stacks/.
|
||||
#
|
||||
# Excluded:
|
||||
# - wealthfolio-sync: registry repo exists but has 0 tags (CronJob has been
|
||||
# broken for 36+ days, separate decision needed). User to triage before
|
||||
# migration.
|
||||
# - fire-planner: registry repo exists but has 0 tags. Dockerfile + CI added
|
||||
# in this session (commit 8b53d99e); rebuild via Woodpecker before flipping.
|
||||
IMAGES=(
|
||||
"chrome-service-novnc:v4"
|
||||
"chrome-service-novnc:latest"
|
||||
"payslip-ingest:latest"
|
||||
"job-hunter:latest"
|
||||
"claude-agent-service:latest"
|
||||
"freedify:latest"
|
||||
"beadboard:latest"
|
||||
"infra-ci:latest"
|
||||
)
|
||||
|
||||
for img in "${IMAGES[@]}"; do
|
||||
echo "=== $img ==="
|
||||
src="$OLD_REG/$img"
|
||||
dst="$NEW_REG/$img"
|
||||
|
||||
if ! docker pull "$src" 2>&1 | tee /tmp/pull-$$ | grep -q 'Status: '; then
|
||||
if grep -q 'not found' /tmp/pull-$$; then
|
||||
echo " SKIP — image not present in source registry"
|
||||
rm -f /tmp/pull-$$
|
||||
continue
|
||||
fi
|
||||
fi
|
||||
rm -f /tmp/pull-$$
|
||||
|
||||
echo " tag → $dst"
|
||||
docker tag "$src" "$dst"
|
||||
|
||||
echo " push $dst"
|
||||
docker push "$dst" 2>&1 | tail -2
|
||||
|
||||
echo " cleanup local copy"
|
||||
docker rmi "$src" "$dst" 2>&1 | tail -1 || true
|
||||
done
|
||||
|
||||
echo ""
|
||||
echo "Done. Verify in Forgejo Web UI: https://forgejo.viktorbarzin.me/viktor/-/packages?type=container"
|
||||
echo "Phase 3 of the plan flips infra/stacks/{wealthfolio,fire-planner}/main.tf image= references."
|
||||
|
|
@ -1,469 +0,0 @@
|
|||
#!/usr/bin/env bash
|
||||
# lvm-pvc-snapshot — LVM thin snapshot management for Proxmox CSI PVCs
|
||||
# Deploy to PVE host at /usr/local/bin/lvm-pvc-snapshot
|
||||
set -euo pipefail
|
||||
|
||||
# --- Configuration ---
|
||||
VG="pve"
|
||||
THINPOOL="data"
|
||||
SNAP_SUFFIX_FORMAT="%Y%m%d_%H%M"
|
||||
RETENTION_DAYS=7
|
||||
MIN_FREE_PCT=10
|
||||
PUSHGATEWAY="${LVM_SNAP_PUSHGATEWAY:-http://10.0.20.100:30091}"
|
||||
PUSHGATEWAY_JOB="lvm-pvc-snapshot"
|
||||
LOCKFILE="/run/lvm-pvc-snapshot.lock"
|
||||
KUBECONFIG="${KUBECONFIG:-/root/.kube/config}"
|
||||
export KUBECONFIG
|
||||
|
||||
# Namespaces to exclude from snapshots (high-churn, have app-level dumps)
|
||||
# These PVCs cause significant CoW write amplification (~36% overhead)
|
||||
EXCLUDE_NAMESPACES="${LVM_SNAP_EXCLUDE_NS:-dbaas,monitoring}"
|
||||
|
||||
# --- Logging ---
|
||||
log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*"; }
|
||||
warn() { log "WARN: $*" >&2; }
|
||||
die() { log "FATAL: $*" >&2; exit 1; }
|
||||
|
||||
# --- Helpers ---
|
||||
|
||||
get_thinpool_free_pct() {
|
||||
local data_pct
|
||||
data_pct=$(lvs --noheadings --nosuffix -o data_percent "${VG}/${THINPOOL}" 2>/dev/null | tr -d ' ')
|
||||
echo "scale=2; 100 - ${data_pct}" | bc
|
||||
}
|
||||
|
||||
build_exclude_lv_list() {
|
||||
# Query K8s for PVs in excluded namespaces, extract their LV names
|
||||
if [[ -z "${EXCLUDE_NAMESPACES}" ]] || ! command -v kubectl &>/dev/null; then
|
||||
return
|
||||
fi
|
||||
kubectl get pv -o json 2>/dev/null | jq -r --arg ns "${EXCLUDE_NAMESPACES}" '
|
||||
($ns | split(",")) as $excl |
|
||||
.items[] |
|
||||
select(.spec.csi.driver == "csi.proxmox.sinextra.dev") |
|
||||
select(.spec.claimRef.namespace as $n | $excl | index($n)) |
|
||||
.spec.csi.volumeHandle | split("/") | last
|
||||
' 2>/dev/null || true
|
||||
}
|
||||
|
||||
discover_pvc_lvs() {
|
||||
# List thin LVs matching PVC pattern, excluding snapshots, pre-restore backups,
|
||||
# and LVs belonging to excluded namespaces (high-churn databases/metrics)
|
||||
local all_lvs exclude_lvs
|
||||
all_lvs=$(lvs --noheadings -o lv_name,pool_lv "${VG}" 2>/dev/null \
|
||||
| awk -v pool="${THINPOOL}" '$2 == pool { print $1 }' \
|
||||
| grep -E '^vm-[0-9]+-pvc-' \
|
||||
| grep -v '_snap_' \
|
||||
| grep -v '_pre_restore_')
|
||||
|
||||
exclude_lvs=$(build_exclude_lv_list)
|
||||
|
||||
if [[ -n "${exclude_lvs}" ]]; then
|
||||
# Filter out excluded LVs
|
||||
local exclude_pattern
|
||||
exclude_pattern=$(echo "${exclude_lvs}" | paste -sd'|' -)
|
||||
echo "${all_lvs}" | grep -vE "(${exclude_pattern})" || true
|
||||
else
|
||||
echo "${all_lvs}"
|
||||
fi
|
||||
}
|
||||
|
||||
list_snapshots() {
|
||||
lvs --noheadings -o lv_name,pool_lv "${VG}" 2>/dev/null \
|
||||
| awk -v pool="${THINPOOL}" '$2 == pool { print $1 }' \
|
||||
| grep '_snap_' || true
|
||||
}
|
||||
|
||||
parse_snap_timestamp() {
|
||||
# Extract YYYYMMDD_HHMM from snapshot name, convert to epoch
|
||||
local snap_name="$1"
|
||||
local ts_str
|
||||
ts_str=$(echo "${snap_name}" | grep -oE '[0-9]{8}_[0-9]{4}$')
|
||||
if [[ -z "${ts_str}" ]]; then
|
||||
echo "0"
|
||||
return
|
||||
fi
|
||||
local ymd="${ts_str:0:8}"
|
||||
local hm="${ts_str:9:4}"
|
||||
date -d "${ymd:0:4}-${ymd:4:2}-${ymd:6:2} ${hm:0:2}:${hm:2:2}" +%s 2>/dev/null || echo "0"
|
||||
}
|
||||
|
||||
get_original_lv_from_snap() {
|
||||
# vm-200-pvc-abc_snap_20260403_1200 -> vm-200-pvc-abc
|
||||
echo "$1" | sed 's/_snap_[0-9]\{8\}_[0-9]\{4\}$//'
|
||||
}
|
||||
|
||||
push_metrics() {
|
||||
local status="$1" created="$2" failed="$3" pruned="$4"
|
||||
local free_pct
|
||||
free_pct=$(get_thinpool_free_pct)
|
||||
|
||||
cat <<METRICS | curl -sf --connect-timeout 5 --max-time 10 --data-binary @- \
|
||||
"${PUSHGATEWAY}/metrics/job/${PUSHGATEWAY_JOB}" 2>/dev/null || warn "Failed to push metrics to Pushgateway"
|
||||
# HELP lvm_snapshot_last_run_timestamp Unix timestamp of last snapshot run
|
||||
# TYPE lvm_snapshot_last_run_timestamp gauge
|
||||
lvm_snapshot_last_run_timestamp $(date +%s)
|
||||
# HELP lvm_snapshot_last_status Exit status (0=success, 1=partial failure, 2=aborted)
|
||||
# TYPE lvm_snapshot_last_status gauge
|
||||
lvm_snapshot_last_status ${status}
|
||||
# HELP lvm_snapshot_created_total Number of snapshots created in last run
|
||||
# TYPE lvm_snapshot_created_total gauge
|
||||
lvm_snapshot_created_total ${created}
|
||||
# HELP lvm_snapshot_failed_total Number of snapshot failures in last run
|
||||
# TYPE lvm_snapshot_failed_total gauge
|
||||
lvm_snapshot_failed_total ${failed}
|
||||
# HELP lvm_snapshot_pruned_total Number of snapshots pruned in last run
|
||||
# TYPE lvm_snapshot_pruned_total gauge
|
||||
lvm_snapshot_pruned_total ${pruned}
|
||||
# HELP lvm_snapshot_thinpool_free_pct Thin pool free percentage
|
||||
# TYPE lvm_snapshot_thinpool_free_pct gauge
|
||||
lvm_snapshot_thinpool_free_pct ${free_pct}
|
||||
METRICS
|
||||
}
|
||||
|
||||
# --- Subcommands ---
|
||||
|
||||
cmd_snapshot() {
|
||||
log "Starting PVC LVM thin snapshot run"
|
||||
|
||||
# Check thin pool free space
|
||||
local free_pct
|
||||
free_pct=$(get_thinpool_free_pct)
|
||||
log "Thin pool free space: ${free_pct}%"
|
||||
if (( $(echo "${free_pct} < ${MIN_FREE_PCT}" | bc -l) )); then
|
||||
warn "Thin pool has only ${free_pct}% free (minimum: ${MIN_FREE_PCT}%). Aborting."
|
||||
push_metrics 2 0 0 0
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Discover PVC LVs
|
||||
local lvs_list
|
||||
lvs_list=$(discover_pvc_lvs)
|
||||
if [[ -z "${lvs_list}" ]]; then
|
||||
warn "No PVC LVs found matching pattern"
|
||||
push_metrics 2 0 0 0
|
||||
exit 1
|
||||
fi
|
||||
|
||||
local count=0 failed=0 total
|
||||
total=$(echo "${lvs_list}" | wc -l | tr -d ' ')
|
||||
local snap_ts
|
||||
snap_ts=$(date +"${SNAP_SUFFIX_FORMAT}")
|
||||
|
||||
log "Found ${total} PVC LVs to snapshot"
|
||||
|
||||
while IFS= read -r lv; do
|
||||
local snap_name="${lv}_snap_${snap_ts}"
|
||||
if lvcreate -s -kn -n "${snap_name}" "${VG}/${lv}" >/dev/null 2>&1; then
|
||||
log " Created: ${snap_name}"
|
||||
count=$((count + 1))
|
||||
else
|
||||
warn " Failed to create snapshot for ${lv}"
|
||||
failed=$((failed + 1))
|
||||
fi
|
||||
done <<< "${lvs_list}"
|
||||
|
||||
log "Snapshot run complete: ${count} created, ${failed} failed out of ${total}"
|
||||
|
||||
# Auto-prune
|
||||
log "Running auto-prune..."
|
||||
local pruned
|
||||
pruned=$(cmd_prune_count)
|
||||
|
||||
# Determine status
|
||||
local status=0
|
||||
if (( failed > 0 && count > 0 )); then
|
||||
status=1 # partial
|
||||
elif (( failed > 0 && count == 0 )); then
|
||||
status=2 # all failed
|
||||
fi
|
||||
|
||||
push_metrics "${status}" "${count}" "${failed}" "${pruned}"
|
||||
log "Done"
|
||||
}
|
||||
|
||||
cmd_list() {
|
||||
printf "%-45s %-50s %8s %8s\n" "ORIGINAL LV" "SNAPSHOT" "AGE" "DATA%"
|
||||
printf "%-45s %-50s %8s %8s\n" "-----------" "--------" "---" "-----"
|
||||
|
||||
local now
|
||||
now=$(date +%s)
|
||||
|
||||
local snap_lines
|
||||
snap_lines=$(lvs --noheadings --nosuffix -o lv_name,lv_size,data_percent "${VG}" 2>/dev/null \
|
||||
| grep -E '_snap_|_pre_restore_' || true)
|
||||
|
||||
if [[ -z "${snap_lines}" ]]; then
|
||||
echo "(no snapshots found)"
|
||||
return
|
||||
fi
|
||||
|
||||
echo "${snap_lines}" | while read -r name size data_pct; do
|
||||
local original age_str ts epoch
|
||||
if [[ "${name}" == *"_pre_restore_"* ]]; then
|
||||
original=$(echo "${name}" | sed 's/_pre_restore_[0-9]\{8\}_[0-9]\{4\}$//')
|
||||
ts=$(echo "${name}" | grep -oE '[0-9]{8}_[0-9]{4}$')
|
||||
else
|
||||
original=$(get_original_lv_from_snap "${name}")
|
||||
ts=$(echo "${name}" | grep -oE '[0-9]{8}_[0-9]{4}$')
|
||||
fi
|
||||
epoch=$(parse_snap_timestamp "${name}")
|
||||
if (( epoch > 0 )); then
|
||||
local age_s=$(( now - epoch ))
|
||||
local days=$(( age_s / 86400 ))
|
||||
local hours=$(( (age_s % 86400) / 3600 ))
|
||||
age_str="${days}d${hours}h"
|
||||
else
|
||||
age_str="unknown"
|
||||
fi
|
||||
printf "%-45s %-50s %8s %7s%%\n" "${original}" "${name}" "${age_str}" "${data_pct}"
|
||||
done
|
||||
}
|
||||
|
||||
cmd_prune() {
|
||||
local pruned
|
||||
pruned=$(cmd_prune_count)
|
||||
log "Pruned ${pruned} expired snapshots"
|
||||
}
|
||||
|
||||
cmd_prune_count() {
|
||||
# NOTE: stdout of this function is captured by callers (`pruned=$(cmd_prune_count)`),
|
||||
# so all log/warn output must go to stderr — the only thing on stdout is the count.
|
||||
local now cutoff pruned=0
|
||||
now=$(date +%s)
|
||||
cutoff=$(( now - RETENTION_DAYS * 86400 ))
|
||||
|
||||
local snaps
|
||||
snaps=$(lvs --noheadings -o lv_name,pool_lv "${VG}" 2>/dev/null \
|
||||
| awk -v pool="${THINPOOL}" '$2 == pool { print $1 }' \
|
||||
| grep -E '_snap_|_pre_restore_' || true)
|
||||
|
||||
if [[ -z "${snaps}" ]]; then
|
||||
echo "0"
|
||||
return
|
||||
fi
|
||||
|
||||
while IFS= read -r snap; do
|
||||
local epoch
|
||||
epoch=$(parse_snap_timestamp "${snap}")
|
||||
if (( epoch > 0 && epoch < cutoff )); then
|
||||
if lvremove -f "${VG}/${snap}" >/dev/null 2>&1; then
|
||||
log " Pruned: ${snap}" >&2
|
||||
pruned=$((pruned + 1))
|
||||
else
|
||||
warn " Failed to prune: ${snap}"
|
||||
fi
|
||||
fi
|
||||
done <<< "${snaps}"
|
||||
|
||||
echo "${pruned}"
|
||||
}
|
||||
|
||||
cmd_restore() {
|
||||
local pvc_lv="${1:-}" snapshot_lv="${2:-}"
|
||||
|
||||
if [[ -z "${pvc_lv}" || -z "${snapshot_lv}" ]]; then
|
||||
die "Usage: $0 restore <pvc-lv-name> <snapshot-lv-name>"
|
||||
fi
|
||||
|
||||
# Validate LVs exist
|
||||
if ! lvs "${VG}/${pvc_lv}" >/dev/null 2>&1; then
|
||||
die "PVC LV '${pvc_lv}' not found in VG '${VG}'"
|
||||
fi
|
||||
if ! lvs "${VG}/${snapshot_lv}" >/dev/null 2>&1; then
|
||||
die "Snapshot LV '${snapshot_lv}' not found in VG '${VG}'"
|
||||
fi
|
||||
|
||||
# Discover K8s context
|
||||
log "Discovering Kubernetes context for LV '${pvc_lv}'..."
|
||||
|
||||
local volume_handle="local-lvm:${pvc_lv}"
|
||||
local pv_info
|
||||
pv_info=$(kubectl get pv -o json 2>/dev/null | jq -r \
|
||||
--arg vh "${volume_handle}" \
|
||||
'.items[] | select(.spec.csi.volumeHandle == $vh) | "\(.metadata.name) \(.spec.claimRef.namespace) \(.spec.claimRef.name)"' \
|
||||
) || die "Failed to query PVs (is kubectl configured?)"
|
||||
|
||||
if [[ -z "${pv_info}" ]]; then
|
||||
die "No PV found with volumeHandle '${volume_handle}'"
|
||||
fi
|
||||
|
||||
local pv_name pvc_ns pvc_name
|
||||
read -r pv_name pvc_ns pvc_name <<< "${pv_info}"
|
||||
log "Found: PV=${pv_name}, PVC=${pvc_ns}/${pvc_name}"
|
||||
|
||||
# Find the workload (Deployment or StatefulSet) that uses this PVC
|
||||
local workload_type="" workload_name="" original_replicas=""
|
||||
|
||||
# Check StatefulSets first (databases use these)
|
||||
local sts_info
|
||||
sts_info=$(kubectl get statefulset -n "${pvc_ns}" -o json 2>/dev/null | jq -r \
|
||||
--arg pvc "${pvc_name}" \
|
||||
'.items[] | select(
|
||||
(.spec.template.spec.volumes // [] | .[].persistentVolumeClaim.claimName == $pvc) or
|
||||
(.spec.volumeClaimTemplates // [] | .[].metadata.name as $vct |
|
||||
.spec.replicas as $r | range($r) | "\($vct)-\(.metadata.name)-\(.)" ) == $pvc
|
||||
) | "\(.metadata.name) \(.spec.replicas)"' 2>/dev/null \
|
||||
) || true
|
||||
|
||||
# If not found via simple volume check, try matching VCT naming pattern
|
||||
if [[ -z "${sts_info}" ]]; then
|
||||
sts_info=$(kubectl get statefulset -n "${pvc_ns}" -o json 2>/dev/null | jq -r \
|
||||
--arg pvc "${pvc_name}" \
|
||||
'.items[] | .metadata.name as $sts | .spec.replicas as $r |
|
||||
select(.spec.volumeClaimTemplates != null) |
|
||||
.spec.volumeClaimTemplates[].metadata.name as $vct |
|
||||
[range($r)] | map("\($vct)-\($sts)-\(.)") |
|
||||
if any(. == $pvc) then "\($sts) \($r)" else empty end' 2>/dev/null \
|
||||
) || true
|
||||
fi
|
||||
|
||||
if [[ -n "${sts_info}" ]]; then
|
||||
read -r workload_name original_replicas <<< "${sts_info}"
|
||||
workload_type="statefulset"
|
||||
else
|
||||
# Check Deployments
|
||||
local deploy_info
|
||||
deploy_info=$(kubectl get deployment -n "${pvc_ns}" -o json 2>/dev/null | jq -r \
|
||||
--arg pvc "${pvc_name}" \
|
||||
'.items[] | select(
|
||||
.spec.template.spec.volumes // [] | .[].persistentVolumeClaim.claimName == $pvc
|
||||
) | "\(.metadata.name) \(.spec.replicas)"' 2>/dev/null \
|
||||
) || true
|
||||
|
||||
if [[ -n "${deploy_info}" ]]; then
|
||||
read -r workload_name original_replicas <<< "${deploy_info}"
|
||||
workload_type="deployment"
|
||||
fi
|
||||
fi
|
||||
|
||||
if [[ -z "${workload_type}" ]]; then
|
||||
warn "Could not auto-discover workload for PVC '${pvc_name}' in namespace '${pvc_ns}'."
|
||||
warn "You may need to scale down the pod manually."
|
||||
echo ""
|
||||
read -rp "Continue with LV swap anyway? (yes/no): " confirm
|
||||
[[ "${confirm}" == "yes" ]] || die "Aborted by user"
|
||||
workload_type="manual"
|
||||
fi
|
||||
|
||||
# Dry-run output
|
||||
local backup_name="${pvc_lv}_pre_restore_$(date +"${SNAP_SUFFIX_FORMAT}")"
|
||||
echo ""
|
||||
echo "╔══════════════════════════════════════════════════════════════╗"
|
||||
echo "║ RESTORE DRY-RUN ║"
|
||||
echo "╠══════════════════════════════════════════════════════════════╣"
|
||||
echo "║ PVC: ${pvc_ns}/${pvc_name}"
|
||||
echo "║ PV: ${pv_name}"
|
||||
if [[ "${workload_type}" != "manual" ]]; then
|
||||
echo "║ Workload: ${workload_type}/${workload_name} (replicas: ${original_replicas}→0→${original_replicas})"
|
||||
fi
|
||||
echo "║"
|
||||
echo "║ Actions:"
|
||||
if [[ "${workload_type}" != "manual" ]]; then
|
||||
echo "║ 1. Scale ${workload_type}/${workload_name} to 0 replicas"
|
||||
echo "║ 2. Wait for pod termination"
|
||||
fi
|
||||
echo "║ 3. Rename ${pvc_lv} → ${backup_name}"
|
||||
echo "║ 4. Rename ${snapshot_lv} → ${pvc_lv}"
|
||||
if [[ "${workload_type}" != "manual" ]]; then
|
||||
echo "║ 5. Scale ${workload_type}/${workload_name} back to ${original_replicas} replicas"
|
||||
fi
|
||||
echo "╚══════════════════════════════════════════════════════════════╝"
|
||||
echo ""
|
||||
|
||||
# Interactive confirmation
|
||||
read -rp "Type 'yes' to proceed with restore: " confirm
|
||||
if [[ "${confirm}" != "yes" ]]; then
|
||||
die "Aborted by user"
|
||||
fi
|
||||
|
||||
# Scale down
|
||||
if [[ "${workload_type}" != "manual" ]]; then
|
||||
log "Scaling ${workload_type}/${workload_name} to 0 replicas..."
|
||||
kubectl scale "${workload_type}/${workload_name}" -n "${pvc_ns}" --replicas=0
|
||||
|
||||
log "Waiting for pod termination (timeout: 120s)..."
|
||||
kubectl wait --for=delete pod -l "app.kubernetes.io/name=${workload_name}" -n "${pvc_ns}" --timeout=120s 2>/dev/null || \
|
||||
kubectl wait --for=delete pod -l "app=${workload_name}" -n "${pvc_ns}" --timeout=120s 2>/dev/null || \
|
||||
warn "Timeout waiting for pods — continuing anyway (LV may still be in use)"
|
||||
sleep 5 # extra grace period for device detach
|
||||
fi
|
||||
|
||||
# Verify LV is not active
|
||||
local lv_active
|
||||
lv_active=$(lvs --noheadings -o lv_active "${VG}/${pvc_lv}" 2>/dev/null | tr -d ' ')
|
||||
if [[ "${lv_active}" == "active" ]]; then
|
||||
warn "LV ${pvc_lv} is still active. Attempting to deactivate..."
|
||||
# Close any LUKS mapper on the LV before deactivation
|
||||
if dmsetup ls 2>/dev/null | grep -q "${pvc_lv}"; then
|
||||
log "Closing LUKS mapper for ${pvc_lv}..."
|
||||
cryptsetup luksClose "${pvc_lv}" 2>/dev/null || true
|
||||
fi
|
||||
lvchange -an "${VG}/${pvc_lv}" 2>/dev/null || warn "Could not deactivate — proceeding with caution"
|
||||
fi
|
||||
|
||||
# LV swap
|
||||
log "Renaming ${pvc_lv} → ${backup_name}"
|
||||
lvrename "${VG}" "${pvc_lv}" "${backup_name}" || die "Failed to rename original LV"
|
||||
|
||||
log "Renaming ${snapshot_lv} → ${pvc_lv}"
|
||||
lvrename "${VG}" "${snapshot_lv}" "${pvc_lv}" || die "Failed to rename snapshot LV"
|
||||
|
||||
# Scale back up
|
||||
if [[ "${workload_type}" != "manual" ]]; then
|
||||
log "Scaling ${workload_type}/${workload_name} back to ${original_replicas} replicas..."
|
||||
kubectl scale "${workload_type}/${workload_name}" -n "${pvc_ns}" --replicas="${original_replicas}"
|
||||
|
||||
log "Waiting for pod to become Ready (timeout: 300s)..."
|
||||
kubectl wait --for=condition=Ready pod -l "app.kubernetes.io/name=${workload_name}" -n "${pvc_ns}" --timeout=300s 2>/dev/null || \
|
||||
kubectl wait --for=condition=Ready pod -l "app=${workload_name}" -n "${pvc_ns}" --timeout=300s 2>/dev/null || \
|
||||
warn "Timeout waiting for pod Ready — check manually"
|
||||
fi
|
||||
|
||||
echo ""
|
||||
log "Restore complete!"
|
||||
log "Old data preserved as: ${backup_name}"
|
||||
log "To delete old data after verification: lvremove -f ${VG}/${backup_name}"
|
||||
}
|
||||
|
||||
# --- Main ---
|
||||
|
||||
usage() {
|
||||
cat <<EOF
|
||||
Usage: $(basename "$0") <command> [args]
|
||||
|
||||
Commands:
|
||||
snapshot Create thin snapshots of all PVC LVs
|
||||
list List existing snapshots with age and data%
|
||||
prune Remove snapshots older than ${RETENTION_DAYS} days
|
||||
restore <lv> <snap> Restore a PVC from a snapshot (interactive)
|
||||
|
||||
Environment:
|
||||
LVM_SNAP_PUSHGATEWAY Pushgateway URL (default: ${PUSHGATEWAY})
|
||||
KUBECONFIG Kubeconfig path (default: /root/.kube/config)
|
||||
EOF
|
||||
}
|
||||
|
||||
main() {
|
||||
local cmd="${1:-}"
|
||||
shift || true
|
||||
|
||||
# Acquire lock (except for list which is read-only)
|
||||
if [[ "${cmd}" != "list" && "${cmd}" != "" && "${cmd}" != "help" && "${cmd}" != "--help" && "${cmd}" != "-h" ]]; then
|
||||
exec 200>"${LOCKFILE}"
|
||||
if ! flock -n 200; then
|
||||
die "Another instance is already running (lockfile: ${LOCKFILE})"
|
||||
fi
|
||||
fi
|
||||
|
||||
case "${cmd}" in
|
||||
snapshot) cmd_snapshot ;;
|
||||
list) cmd_list ;;
|
||||
prune) cmd_prune ;;
|
||||
restore) cmd_restore "$@" ;;
|
||||
help|--help|-h|"") usage ;;
|
||||
*) die "Unknown command: ${cmd}. Run '$0 help' for usage." ;;
|
||||
esac
|
||||
}
|
||||
|
||||
main "$@"
|
||||
|
|
@ -1,236 +0,0 @@
|
|||
<?php
|
||||
// pfSense HAProxy bootstrap — configures the mailserver PROXY-v2 path
|
||||
// (bd code-yiu, Phases 2/3 + 5).
|
||||
//
|
||||
// WHY THIS EXISTS
|
||||
// pfSense HAProxy config is stored XML-in-`/cf/conf/config.xml` under
|
||||
// `<installedpackages><haproxy>`. That file IS picked up by the nightly
|
||||
// `daily-backup` on the PVE host (see `scripts/daily-backup.sh` → `scp
|
||||
// root@10.0.20.1:/cf/conf/config.xml`) and synced to Synology. This script
|
||||
// is the canonical reproducer: run it to rebuild the pfSense HAProxy config
|
||||
// from scratch (DR restore, fresh pfSense install, etc.).
|
||||
//
|
||||
// WHAT IT BUILDS
|
||||
// 4 backend pools — one per mail port:
|
||||
// mailserver_nodes_smtp → k8s-node1..4:30125 (container :2525 postscreen)
|
||||
// mailserver_nodes_smtps → k8s-node1..4:30126 (container :4465 smtps)
|
||||
// mailserver_nodes_sub → k8s-node1..4:30127 (container :5587 submission)
|
||||
// mailserver_nodes_imaps → k8s-node1..4:30128 (container :10993 IMAPS)
|
||||
// Each server uses `send-proxy-v2` and TCP health-check every 120s.
|
||||
// 4 frontends on pfSense 10.0.20.1:{25,465,587,993} TCP mode.
|
||||
// + 1 legacy test frontend on :2525 (kept for validation; safe to remove later).
|
||||
//
|
||||
// USAGE (on pfSense host, via SSH as admin)
|
||||
// scp infra/scripts/pfsense-haproxy-bootstrap.php admin@10.0.20.1:/tmp/
|
||||
// ssh admin@10.0.20.1 'php /tmp/pfsense-haproxy-bootstrap.php'
|
||||
//
|
||||
// IDEMPOTENCY
|
||||
// Removes any existing entries named mailserver_* before re-adding, so
|
||||
// repeat runs are safe and behave as reset-to-declared.
|
||||
|
||||
require_once('/etc/inc/config.inc');
|
||||
require_once('/usr/local/pkg/haproxy/haproxy.inc');
|
||||
require_once('/usr/local/pkg/haproxy/haproxy_utils.inc');
|
||||
|
||||
global $config;
|
||||
parse_config(true);
|
||||
|
||||
if (!is_array($config['installedpackages']['haproxy'])) {
|
||||
$config['installedpackages']['haproxy'] = [];
|
||||
}
|
||||
$h = &$config['installedpackages']['haproxy'];
|
||||
|
||||
$h['enable'] = 'yes';
|
||||
$h['maxconn'] = '1000';
|
||||
|
||||
// Our declared object names (anything starting with mailserver_ is ours)
|
||||
$POOL_NAMES = [
|
||||
'mailserver_nodes', // legacy (Phase 2/3 test)
|
||||
'mailserver_nodes_smtp',
|
||||
'mailserver_nodes_smtps',
|
||||
'mailserver_nodes_sub',
|
||||
'mailserver_nodes_imaps',
|
||||
];
|
||||
$FRONTEND_NAMES = [
|
||||
'mailserver_proxy_test', // legacy (Phase 2/3 test, :2525)
|
||||
'mailserver_proxy_25',
|
||||
'mailserver_proxy_465',
|
||||
'mailserver_proxy_587',
|
||||
'mailserver_proxy_993',
|
||||
];
|
||||
|
||||
// k8s workers. Not in the cluster: master (control-plane) and node5
|
||||
// (doesn't exist in this topology).
|
||||
$NODES = [
|
||||
['k8s-node1', '10.0.20.101'],
|
||||
['k8s-node2', '10.0.20.102'],
|
||||
['k8s-node3', '10.0.20.103'],
|
||||
['k8s-node4', '10.0.20.104'],
|
||||
];
|
||||
|
||||
// Build a pool with optional split healthcheck path.
|
||||
//
|
||||
// $check_port: if non-null, HAProxy sends health probes to that NodePort
|
||||
// (which Service `mailserver-proxy` maps to the pod's stock no-PROXY
|
||||
// listener — see infra/stacks/mailserver/.../mailserver_proxy ports
|
||||
// 30145/30146/30147). Real client traffic still goes to $nodeport with
|
||||
// PROXY v2 framing.
|
||||
// $check_type: 'TCP' for plain accept-on-port checks, 'ESMTP' for
|
||||
// `option smtpchk EHLO <monitor_domain>` (real SMTP banner+EHLO+250).
|
||||
//
|
||||
// Why split: smtpd-proxy587/4465 fatal on every PROXY-v2-aware health
|
||||
// probe with `smtpd_peer_hostaddr_to_sockaddr: ... Servname not supported`
|
||||
// — the daemon respawns get throttled by Postfix master and real clients
|
||||
// land mid-respawn → 6s TCP timeout. Routing health probes to the stock
|
||||
// no-PROXY port sidesteps the bug entirely while data path still gets
|
||||
// PROXY v2 for CrowdSec/Postfix client-IP visibility. The HAProxy package
|
||||
// has no `checkport` field, so `port N` is appended via the server's
|
||||
// `advanced` string (HAProxy parses server keywords in any order).
|
||||
function build_pool(
|
||||
string $name,
|
||||
string $nodeport,
|
||||
array $nodes,
|
||||
string $check_type = 'TCP',
|
||||
?string $check_port = null,
|
||||
string $monitor_domain = ''
|
||||
): array {
|
||||
$advanced_check = $check_port !== null
|
||||
? "send-proxy-v2 port {$check_port}"
|
||||
: 'send-proxy-v2';
|
||||
$servers = [];
|
||||
foreach ($nodes as $n) {
|
||||
$servers[] = [
|
||||
'name' => $n[0],
|
||||
'address' => $n[1],
|
||||
'port' => $nodeport,
|
||||
'weight' => '10',
|
||||
'ssl' => '',
|
||||
// 5s = sub-block-window failover when a NodePort goes sour.
|
||||
// Safe to be aggressive once health probes don't fatal smtpd.
|
||||
'checkinter' => '5000',
|
||||
'advanced' => $advanced_check,
|
||||
'status' => 'active',
|
||||
];
|
||||
}
|
||||
return [
|
||||
'name' => $name,
|
||||
'balance' => 'roundrobin',
|
||||
'check_type' => $check_type,
|
||||
'monitor_domain' => $monitor_domain,
|
||||
'checkinter' => '5000',
|
||||
'retries' => '3',
|
||||
'ha_servers' => ['item' => $servers],
|
||||
'advanced_bind' => '',
|
||||
'persist_cookie_enabled' => '',
|
||||
'transparent_clientip' => '',
|
||||
'advanced' => '',
|
||||
];
|
||||
}
|
||||
|
||||
function build_frontend(string $name, string $descr, string $extaddr, string $port, string $pool): array {
|
||||
return [
|
||||
'name' => $name,
|
||||
'descr' => $descr,
|
||||
'status' => 'active',
|
||||
'secondary' => '',
|
||||
'type' => 'tcp',
|
||||
'a_extaddr' => ['item' => [[
|
||||
'extaddr' => $extaddr,
|
||||
'extaddr_port' => $port,
|
||||
'extaddr_ssl' => '',
|
||||
'extaddr_advanced' => '',
|
||||
]]],
|
||||
'backend_serverpool' => $pool,
|
||||
'ha_acls' => '',
|
||||
'dontlognull'=> '',
|
||||
'httpclose' => '',
|
||||
'forwardfor' => '',
|
||||
'advanced' => '',
|
||||
];
|
||||
}
|
||||
|
||||
// ── Backend pools ───────────────────────────────────────────────────────
|
||||
if (!is_array($h['ha_pools'])) $h['ha_pools'] = ['item' => []];
|
||||
if (!is_array($h['ha_pools']['item'])) $h['ha_pools']['item'] = [];
|
||||
$h['ha_pools']['item'] = array_values(array_filter(
|
||||
$h['ha_pools']['item'],
|
||||
fn($p) => !in_array($p['name'] ?? '', $POOL_NAMES, true)
|
||||
));
|
||||
|
||||
// Legacy test pool (still used by the :2525 test frontend for manual SMTP roundtrip).
|
||||
$h['ha_pools']['item'][] = build_pool('mailserver_nodes', '30125', $NODES);
|
||||
|
||||
// Production pools — one per mail port.
|
||||
//
|
||||
// All SMTP/SMTPS/Submission backends use plain TCP checks against
|
||||
// dedicated non-PROXY healthcheck NodePorts (30145/30146/30147 → pod
|
||||
// stock 25/465/587) so probes hit the no-PROXY listeners and avoid
|
||||
// the smtpd_peer_hostaddr_to_sockaddr fatal that fires on PROXY-v2
|
||||
// LOCAL frames. Real client traffic still goes to 30125-30128 with
|
||||
// PROXY v2 for client-IP visibility.
|
||||
//
|
||||
// We tried `option smtpchk EHLO` initially — it works on the plain
|
||||
// `submission` daemon (587) but flaps the `postscreen` listener on
|
||||
// port 25 (multi-line greet + DNSBL silence + anti-pre-greet
|
||||
// detection makes HAProxy's simple smtpchk parser hit L7RSP). A
|
||||
// plain TCP accept-on-port check is enough for both: HAProxy still
|
||||
// gets fast failover when the listener actually goes away, and we
|
||||
// stop triggering the Postfix fatal entirely.
|
||||
//
|
||||
// IMAPS stays on its existing TCP-check-with-PROXY-frame for now —
|
||||
// Dovecot's PROXY parser doesn't show the same fatal pattern; adding
|
||||
// a separate IMAP healthcheck path would require another svc port.
|
||||
$h['ha_pools']['item'][] = build_pool('mailserver_nodes_smtp', '30125', $NODES, 'TCP', '30145');
|
||||
$h['ha_pools']['item'][] = build_pool('mailserver_nodes_smtps', '30126', $NODES, 'TCP', '30146');
|
||||
$h['ha_pools']['item'][] = build_pool('mailserver_nodes_sub', '30127', $NODES, 'TCP', '30147');
|
||||
$h['ha_pools']['item'][] = build_pool('mailserver_nodes_imaps', '30128', $NODES);
|
||||
|
||||
// ── Frontends ───────────────────────────────────────────────────────────
|
||||
if (!is_array($h['ha_backends'])) $h['ha_backends'] = ['item' => []];
|
||||
if (!is_array($h['ha_backends']['item'])) $h['ha_backends']['item'] = [];
|
||||
$h['ha_backends']['item'] = array_values(array_filter(
|
||||
$h['ha_backends']['item'],
|
||||
fn($f) => !in_array($f['name'] ?? '', $FRONTEND_NAMES, true)
|
||||
));
|
||||
|
||||
// Legacy test frontend — :2525 — retained so SMTP roundtrip tests keep working
|
||||
// without touching the real :25. Safe to remove once fully validated.
|
||||
$h['ha_backends']['item'][] = build_frontend(
|
||||
'mailserver_proxy_test',
|
||||
'code-yiu Phase 2/3 test — PROXY v2 to k8s mailserver NodePort 30125 (alt port :2525)',
|
||||
'10.0.20.1', '2525',
|
||||
'mailserver_nodes'
|
||||
);
|
||||
|
||||
// Production frontends — 4 ports listening on pfSense VLAN20 IP 10.0.20.1.
|
||||
$h['ha_backends']['item'][] = build_frontend(
|
||||
'mailserver_proxy_25',
|
||||
'code-yiu Phase 4/5 — external SMTP (:25) via PROXY v2 → pod :2525 postscreen',
|
||||
'10.0.20.1', '25',
|
||||
'mailserver_nodes_smtp'
|
||||
);
|
||||
$h['ha_backends']['item'][] = build_frontend(
|
||||
'mailserver_proxy_465',
|
||||
'code-yiu Phase 4/5 — external SMTPS (:465) via PROXY v2 → pod :4465 smtpd',
|
||||
'10.0.20.1', '465',
|
||||
'mailserver_nodes_smtps'
|
||||
);
|
||||
$h['ha_backends']['item'][] = build_frontend(
|
||||
'mailserver_proxy_587',
|
||||
'code-yiu Phase 4/5 — external submission (:587) via PROXY v2 → pod :5587 smtpd',
|
||||
'10.0.20.1', '587',
|
||||
'mailserver_nodes_sub'
|
||||
);
|
||||
$h['ha_backends']['item'][] = build_frontend(
|
||||
'mailserver_proxy_993',
|
||||
'code-yiu Phase 4/5 — external IMAPS (:993) via PROXY v2 → pod :10993 Dovecot',
|
||||
'10.0.20.1', '993',
|
||||
'mailserver_nodes_imaps'
|
||||
);
|
||||
|
||||
write_config('code-yiu: mailserver HAProxy — 4 production frontends + legacy :2525 test');
|
||||
|
||||
$messages = '';
|
||||
$rc = haproxy_check_and_run($messages, true);
|
||||
echo 'haproxy_check_and_run rc=' . ($rc ? 'OK' : 'FAIL') . "\n";
|
||||
echo "messages: $messages\n";
|
||||
|
|
@ -1,68 +0,0 @@
|
|||
<?php
|
||||
// pfSense NAT redirect flip — mail ports 25/465/587/993 from
|
||||
// <mailserver> alias (10.0.20.202 MetalLB LB) to pfSense's own HAProxy
|
||||
// listener (10.0.20.1). bd code-yiu.
|
||||
//
|
||||
// THIS IS THE CUTOVER. After this script:
|
||||
// Internet → pfSense WAN:{25,465,587,993} → rdr → 10.0.20.1:{...}
|
||||
// (pfSense HAProxy) → send-proxy-v2 → k8s-node:{30125..30128} NodePort
|
||||
// → kube-proxy → mailserver pod alt listeners (2525/4465/5587/10993)
|
||||
// → Postfix/Dovecot parse PROXY v2 → real client IP recovered.
|
||||
//
|
||||
// Internal clients (Roundcube, email-roundtrip-monitor CronJob) continue
|
||||
// using the existing mailserver ClusterIP Service on the stock ports
|
||||
// (25/465/587/993) which hit container stock listeners WITHOUT PROXY.
|
||||
// No change to internal traffic paths.
|
||||
//
|
||||
// USAGE
|
||||
// scp infra/scripts/pfsense-nat-mailserver-haproxy-flip.php admin@10.0.20.1:/tmp/
|
||||
// ssh admin@10.0.20.1 'php /tmp/pfsense-nat-mailserver-haproxy-flip.php'
|
||||
//
|
||||
// REVERT — run pfsense-nat-mailserver-haproxy-unflip.php (companion script).
|
||||
//
|
||||
// IDEMPOTENT — re-runs converge. Flips nothing if already pointed at 10.0.20.1.
|
||||
|
||||
require_once('/etc/inc/config.inc');
|
||||
require_once('/etc/inc/filter.inc');
|
||||
|
||||
global $config;
|
||||
parse_config(true);
|
||||
|
||||
$PORTS_TO_FLIP = ['25', '465', '587', '993'];
|
||||
$OLD_TARGET = 'mailserver';
|
||||
$NEW_TARGET = '10.0.20.1';
|
||||
|
||||
$changed = 0;
|
||||
foreach ($config['nat']['rule'] as $i => &$r) {
|
||||
$iface = $r['interface'] ?? '';
|
||||
$lport = $r['local-port'] ?? '';
|
||||
$tgt = $r['target'] ?? '';
|
||||
|
||||
if ($iface !== 'wan') continue;
|
||||
if (!in_array($lport, $PORTS_TO_FLIP, true)) continue;
|
||||
if ($tgt !== $OLD_TARGET) {
|
||||
printf("rule %d (dport=%s) target=%s — not flipping (already %s or unexpected)\n",
|
||||
$i, $lport, $tgt, $NEW_TARGET);
|
||||
continue;
|
||||
}
|
||||
|
||||
$r['target'] = $NEW_TARGET;
|
||||
// Also unset the 'associated-rule-id' linked filter rule target if any —
|
||||
// actually pfSense regenerates the associated rule from NAT rule on apply,
|
||||
// so leaving associated-rule-id intact is fine.
|
||||
$changed++;
|
||||
printf("rule %d (dport=%s): target %s → %s\n", $i, $lport, $OLD_TARGET, $NEW_TARGET);
|
||||
}
|
||||
unset($r);
|
||||
|
||||
if ($changed === 0) {
|
||||
echo "No changes. (Already flipped? Run unflip script to revert.)\n";
|
||||
exit(0);
|
||||
}
|
||||
|
||||
write_config("code-yiu: NAT rdr — mail ports {$changed} flipped to HAProxy (10.0.20.1)");
|
||||
|
||||
// Rebuild pf rules & reload.
|
||||
$rc = filter_configure();
|
||||
printf("filter_configure rc=%s\n", var_export($rc, true));
|
||||
echo "done.\n";
|
||||
|
|
@ -1,48 +0,0 @@
|
|||
<?php
|
||||
// REVERT of pfsense-nat-mailserver-haproxy-flip.php.
|
||||
// Moves mail-port NAT rdr target from 10.0.20.1 (pfSense HAProxy) back to
|
||||
// <mailserver> alias (10.0.20.202 MetalLB LB IP). bd code-yiu rollback.
|
||||
//
|
||||
// USE THIS IF: external mail breaks after the flip, any postscreen
|
||||
// PROXY timeouts show up in logs, or you need to back out before Phase 6.
|
||||
|
||||
require_once('/etc/inc/config.inc');
|
||||
require_once('/etc/inc/filter.inc');
|
||||
|
||||
global $config;
|
||||
parse_config(true);
|
||||
|
||||
$PORTS_TO_REVERT = ['25', '465', '587', '993'];
|
||||
$OLD_TARGET = '10.0.20.1';
|
||||
$NEW_TARGET = 'mailserver';
|
||||
|
||||
$changed = 0;
|
||||
foreach ($config['nat']['rule'] as $i => &$r) {
|
||||
$iface = $r['interface'] ?? '';
|
||||
$lport = $r['local-port'] ?? '';
|
||||
$tgt = $r['target'] ?? '';
|
||||
|
||||
if ($iface !== 'wan') continue;
|
||||
if (!in_array($lport, $PORTS_TO_REVERT, true)) continue;
|
||||
if ($tgt !== $OLD_TARGET) {
|
||||
printf("rule %d (dport=%s) target=%s — not reverting (already %s or unexpected)\n",
|
||||
$i, $lport, $tgt, $NEW_TARGET);
|
||||
continue;
|
||||
}
|
||||
|
||||
$r['target'] = $NEW_TARGET;
|
||||
$changed++;
|
||||
printf("rule %d (dport=%s): target %s → %s\n", $i, $lport, $OLD_TARGET, $NEW_TARGET);
|
||||
}
|
||||
unset($r);
|
||||
|
||||
if ($changed === 0) {
|
||||
echo "No changes. (Already reverted.)\n";
|
||||
exit(0);
|
||||
}
|
||||
|
||||
write_config("code-yiu: NAT rdr — mail ports {$changed} reverted to <mailserver> alias");
|
||||
|
||||
$rc = filter_configure();
|
||||
printf("filter_configure rc=%s\n", var_export($rc, true));
|
||||
echo "done.\n";
|
||||
|
|
@ -39,43 +39,26 @@ if [ -z "$VAULT_TOKEN" ] || [ "$VAULT_TOKEN" = "null" ]; then
|
|||
fi
|
||||
echo "Vault authenticated"
|
||||
|
||||
# 5. Fetch API token for claude-agent-service
|
||||
AGENT_TOKEN=$(curl -sf -H "X-Vault-Token: $VAULT_TOKEN" \
|
||||
http://vault-active.vault.svc.cluster.local:8200/v1/secret/data/claude-agent-service | \
|
||||
jq -r '.data.data.api_bearer_token')
|
||||
if [ -z "$AGENT_TOKEN" ] || [ "$AGENT_TOKEN" = "null" ]; then
|
||||
echo "ERROR: Failed to fetch agent API token"
|
||||
# 5. Fetch DevVM SSH key from Vault
|
||||
curl -sf -H "X-Vault-Token: $VAULT_TOKEN" \
|
||||
http://vault-active.vault.svc.cluster.local:8200/v1/secret/data/ci/infra | \
|
||||
jq -r '.data.data.devvm_ssh_key' > /tmp/devvm-key
|
||||
chmod 600 /tmp/devvm-key
|
||||
if [ ! -s /tmp/devvm-key ]; then
|
||||
echo "ERROR: Failed to fetch DevVM SSH key"
|
||||
exit 1
|
||||
fi
|
||||
echo "Agent token fetched"
|
||||
echo "SSH key fetched"
|
||||
|
||||
# 6. Submit to claude-agent-service
|
||||
# 6. SSH to DevVM and run Claude Code headless
|
||||
TODOS=$(cat /tmp/todos.json)
|
||||
PAYLOAD=$(jq -n \
|
||||
--arg prompt "Implement the auto-implementable TODOs from $PM_FILE. Parsed TODO list: $TODOS" \
|
||||
--arg agent ".claude/agents/postmortem-todo-resolver" \
|
||||
'{prompt: $prompt, agent: $agent, max_budget_usd: 5, timeout_seconds: 900}')
|
||||
ssh -i /tmp/devvm-key -o StrictHostKeyChecking=no wizard@10.0.10.10 \
|
||||
"cd ~/code && git -C infra stash && git -C infra pull && git -C infra stash pop 2>/dev/null; ~/.local/bin/claude -p \
|
||||
--agent infra/.claude/agents/postmortem-todo-resolver \
|
||||
--dangerously-skip-permissions \
|
||||
--max-budget-usd 5 \
|
||||
'Implement the auto-implementable TODOs from $PM_FILE. Parsed TODO list: $TODOS'"
|
||||
|
||||
RESP=$(curl -sf -X POST \
|
||||
-H "Authorization: Bearer $AGENT_TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d "$PAYLOAD" \
|
||||
http://claude-agent-service.claude-agent.svc.cluster.local:8080/execute)
|
||||
JOB_ID=$(echo "$RESP" | jq -r '.job_id')
|
||||
echo "Job submitted: $JOB_ID"
|
||||
|
||||
# 7. Poll for completion (15min max)
|
||||
for i in $(seq 1 60); do
|
||||
sleep 15
|
||||
RESULT=$(curl -sf \
|
||||
-H "Authorization: Bearer $AGENT_TOKEN" \
|
||||
http://claude-agent-service.claude-agent.svc.cluster.local:8080/jobs/$JOB_ID)
|
||||
STATUS=$(echo "$RESULT" | jq -r '.status')
|
||||
echo "[$i/60] Status: $STATUS"
|
||||
if [ "$STATUS" != "running" ]; then
|
||||
echo "$RESULT" | jq .
|
||||
if [ "$STATUS" = "completed" ]; then exit 0; else exit 1; fi
|
||||
fi
|
||||
done
|
||||
echo "ERROR: Job timed out after 15 minutes"
|
||||
exit 1
|
||||
# 7. Cleanup
|
||||
rm -f /tmp/devvm-key
|
||||
echo "Pipeline complete"
|
||||
|
|
|
|||
|
|
@ -1,59 +0,0 @@
|
|||
#!/usr/bin/env bash
|
||||
# One-shot deployment of the forgejo.viktorbarzin.me containerd hosts.toml
|
||||
# entry across every k8s node. Cloud-init only fires on VM provision, so
|
||||
# existing nodes need this manual rollout.
|
||||
#
|
||||
# What it does, per node:
|
||||
# 1. drain (ignore-daemonsets, delete-emptydir-data)
|
||||
# 2. ssh in: mkdir + write /etc/containerd/certs.d/forgejo.viktorbarzin.me/hosts.toml
|
||||
# 3. systemctl restart containerd
|
||||
# 4. uncordon
|
||||
#
|
||||
# hosts.toml is documented as hot-reloaded but the post-2026-04-19
|
||||
# containerd corruption playbook calls for an explicit restart so the
|
||||
# config is unambiguously in effect. Running drain/uncordon around it
|
||||
# avoids pulling against an in-flight containerd restart.
|
||||
#
|
||||
# Re-run is safe: writes are idempotent.
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
CERTS_DIR=/etc/containerd/certs.d/forgejo.viktorbarzin.me
|
||||
HOSTS_TOML='server = "https://forgejo.viktorbarzin.me"
|
||||
|
||||
[host."https://10.0.20.200"]
|
||||
capabilities = ["pull", "resolve"]
|
||||
'
|
||||
|
||||
NODES=$(kubectl get nodes -o name | sed 's|^node/||')
|
||||
if [[ -z "$NODES" ]]; then
|
||||
echo "ERROR: no nodes returned from kubectl get nodes" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
for n in $NODES; do
|
||||
echo "=== $n ==="
|
||||
kubectl drain "$n" --ignore-daemonsets --delete-emptydir-data --force --grace-period=60
|
||||
|
||||
ssh -o StrictHostKeyChecking=accept-new "wizard@$n" sudo bash <<EOF
|
||||
set -euo pipefail
|
||||
mkdir -p "$CERTS_DIR"
|
||||
cat > "$CERTS_DIR/hosts.toml" <<'TOML'
|
||||
$HOSTS_TOML
|
||||
TOML
|
||||
systemctl restart containerd
|
||||
EOF
|
||||
|
||||
kubectl uncordon "$n"
|
||||
|
||||
# Wait for the node to report Ready before moving to the next one.
|
||||
for i in {1..30}; do
|
||||
if kubectl get node "$n" -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}' | grep -q True; then
|
||||
echo " node Ready"
|
||||
break
|
||||
fi
|
||||
sleep 2
|
||||
done
|
||||
done
|
||||
|
||||
echo "All nodes updated."
|
||||
19
scripts/tg
19
scripts/tg
|
|
@ -72,23 +72,12 @@ if [ -n "$STACK_NAME" ]; then
|
|||
else
|
||||
# Tier 1: PG backend — fetch credentials from Vault
|
||||
if [ -z "${PG_CONN_STR:-}" ]; then
|
||||
# Pre-flight: vault CLI must be available. Previously CI failed with a
|
||||
# misleading "Cannot read PG credentials" message because the Alpine CI
|
||||
# image lacked the vault binary — the 2>/dev/null below swallowed the
|
||||
# real "vault: not found" error. Fail fast with a clear message instead.
|
||||
if ! command -v vault >/dev/null 2>&1; then
|
||||
echo "ERROR: vault CLI not found on PATH. Install it or use an image that includes it (ci/Dockerfile)." >&2
|
||||
exit 1
|
||||
fi
|
||||
VAULT_OUT=$(vault read -format=json database/static-creds/pg-terraform-state 2>&1) || {
|
||||
echo "ERROR: Cannot read PG credentials from Vault. Vault output follows:" >&2
|
||||
echo "$VAULT_OUT" >&2
|
||||
echo "" >&2
|
||||
echo "Hint: humans run 'vault login -method=oidc'; CI auths via K8s SA (role=ci)." >&2
|
||||
PG_CREDS=$(vault read -format=json database/static-creds/pg-terraform-state 2>/dev/null) || {
|
||||
echo "ERROR: Cannot read PG credentials from Vault. Run: vault login -method=oidc" >&2
|
||||
exit 1
|
||||
}
|
||||
PG_USER=$(echo "$VAULT_OUT" | jq -r .data.username)
|
||||
PG_PASS=$(echo "$VAULT_OUT" | jq -r .data.password)
|
||||
PG_USER=$(echo "$PG_CREDS" | jq -r .data.username)
|
||||
PG_PASS=$(echo "$PG_CREDS" | jq -r .data.password)
|
||||
export PG_CONN_STR="postgres://${PG_USER}:${PG_PASS}@10.0.20.200:5432/terraform_state?sslmode=disable"
|
||||
fi
|
||||
fi
|
||||
|
|
|
|||
Binary file not shown.
Binary file not shown.
Binary file not shown.
29
setup-monitoring.sh
Executable file
29
setup-monitoring.sh
Executable file
|
|
@ -0,0 +1,29 @@
|
|||
#!/bin/bash
|
||||
# Setup script for automated monitoring environment
|
||||
# Ensures health check scripts have access to kubeconfig
|
||||
|
||||
echo "=== Setting up automated monitoring environment ==="
|
||||
|
||||
# Copy kubeconfig to location expected by health check scripts
|
||||
if [ -f /home/node/.openclaw/kubeconfig ]; then
|
||||
cp /home/node/.openclaw/kubeconfig /workspace/infra/config
|
||||
echo "✅ Kubeconfig copied to /workspace/infra/config"
|
||||
else
|
||||
echo "❌ Source kubeconfig not found at /home/node/.openclaw/kubeconfig"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Test health check access
|
||||
echo ""
|
||||
echo "Testing health check script access..."
|
||||
cd /workspace/infra
|
||||
if KUBECONFIG="" timeout 30 bash .claude/cluster-health.sh --quiet > /dev/null 2>&1; then
|
||||
echo "✅ Health check script can access cluster"
|
||||
else
|
||||
echo "❌ Health check script cannot access cluster"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo ""
|
||||
echo "✅ Automated monitoring environment setup complete"
|
||||
echo "📊 Cron health checks will now work properly"
|
||||
|
|
@ -63,7 +63,7 @@ resource "kubernetes_deployment" "app" {
|
|||
}
|
||||
}
|
||||
lifecycle {
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config]
|
||||
}
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -116,10 +116,6 @@ resource "kubernetes_deployment" "actualbudget" {
|
|||
}
|
||||
}
|
||||
}
|
||||
lifecycle {
|
||||
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config]
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_service" "actualbudget" {
|
||||
|
|
@ -218,10 +214,6 @@ resource "kubernetes_deployment" "actualbudget-http-api" {
|
|||
}
|
||||
}
|
||||
}
|
||||
lifecycle {
|
||||
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config]
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_service" "actualbudget-http-api" {
|
||||
|
|
@ -312,8 +304,4 @@ resource "kubernetes_cron_job_v1" "bank-sync" {
|
|||
}
|
||||
}
|
||||
}
|
||||
lifecycle {
|
||||
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
|
||||
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
|
||||
}
|
||||
}
|
||||
|
|
|
|||
|
|
@ -46,7 +46,7 @@ locals {
|
|||
|
||||
# To create a new deployment:
|
||||
/**
|
||||
1. Create a subdirectory for {name} under /srv/nfs on the Proxmox host (192.168.1.127)
|
||||
1. Export a new nfs share with {name} in truenas
|
||||
2. Add {name} as proxied cloudflare route (tfvars)
|
||||
3. Add module here
|
||||
*/
|
||||
|
|
@ -59,10 +59,6 @@ resource "kubernetes_namespace" "actualbudget" {
|
|||
tier = local.tiers.edge
|
||||
}
|
||||
}
|
||||
lifecycle {
|
||||
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
|
||||
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
|
||||
}
|
||||
}
|
||||
|
||||
module "tls_secret" {
|
||||
|
|
@ -76,14 +72,13 @@ module "tls_secret" {
|
|||
module "viktor" {
|
||||
source = "./factory"
|
||||
name = "viktor"
|
||||
tag = "26.4.0"
|
||||
tag = "26.3.0"
|
||||
tls_secret_name = var.tls_secret_name
|
||||
nfs_server = var.nfs_server
|
||||
depends_on = [kubernetes_namespace.actualbudget]
|
||||
tier = local.tiers.edge
|
||||
enable_http_api = true
|
||||
enable_bank_sync = true
|
||||
storage_size = "4Gi"
|
||||
budget_encryption_password = lookup(local.credentials["viktor"], "password", null)
|
||||
sync_id = lookup(local.credentials["viktor"], "sync_id", null)
|
||||
homepage_annotations = {
|
||||
|
|
@ -100,7 +95,7 @@ module "viktor" {
|
|||
module "anca" {
|
||||
source = "./factory"
|
||||
name = "anca"
|
||||
tag = "26.4.0"
|
||||
tag = "26.3.0"
|
||||
tls_secret_name = var.tls_secret_name
|
||||
nfs_server = var.nfs_server
|
||||
depends_on = [kubernetes_namespace.actualbudget]
|
||||
|
|
@ -123,7 +118,7 @@ module "anca" {
|
|||
module "emo" {
|
||||
source = "./factory"
|
||||
name = "emo"
|
||||
tag = "26.4.0"
|
||||
tag = "26.3.0"
|
||||
tls_secret_name = var.tls_secret_name
|
||||
nfs_server = var.nfs_server
|
||||
depends_on = [kubernetes_namespace.actualbudget]
|
||||
|
|
|
|||
|
|
@ -90,10 +90,6 @@ resource "kubernetes_namespace" "affine" {
|
|||
tier = local.tiers.aux
|
||||
}
|
||||
}
|
||||
lifecycle {
|
||||
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
|
||||
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
|
||||
}
|
||||
}
|
||||
|
||||
module "tls_secret" {
|
||||
|
|
@ -201,7 +197,7 @@ resource "kubernetes_deployment" "affine" {
|
|||
annotations = {
|
||||
"diun.enable" = "true"
|
||||
"diun.include_tags" = "^\\d+\\.\\d+\\.\\d+$"
|
||||
"dependency.kyverno.io/wait-for" = "postgresql.dbaas:5432,redis-master.redis:6379"
|
||||
"dependency.kyverno.io/wait-for" = "postgresql.dbaas:5432,redis.redis:6379"
|
||||
}
|
||||
}
|
||||
spec {
|
||||
|
|
@ -323,10 +319,6 @@ resource "kubernetes_deployment" "affine" {
|
|||
}
|
||||
}
|
||||
}
|
||||
lifecycle {
|
||||
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config]
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_service" "affine" {
|
||||
|
|
|
|||
|
|
@ -1,81 +0,0 @@
|
|||
# goauthentik/authentik Terraform provider.
|
||||
#
|
||||
# Adopted 2026-04-18 (Wave 6a of the state-drift consolidation plan) to bring
|
||||
# the catch-all Proxy Provider — previously managed only via the Authentik UI
|
||||
# — under Terraform management. API token lives in Vault
|
||||
# `secret/authentik/tf_api_token` (token identifier `terraform-infra-stack`,
|
||||
# intent API, user akadmin, no expiry). Required-providers declaration sits
|
||||
# in the central terragrunt.hcl so every stack has it available; only this
|
||||
# stack configures a provider block.
|
||||
|
||||
data "vault_kv_secret_v2" "authentik_tf" {
|
||||
mount = "secret"
|
||||
name = "authentik"
|
||||
}
|
||||
|
||||
provider "authentik" {
|
||||
url = "https://authentik.viktorbarzin.me"
|
||||
token = data.vault_kv_secret_v2.authentik_tf.data["tf_api_token"]
|
||||
}
|
||||
|
||||
data "authentik_flow" "default_authorization_implicit_consent" {
|
||||
slug = "default-provider-authorization-implicit-consent"
|
||||
}
|
||||
|
||||
data "authentik_flow" "default_provider_invalidation" {
|
||||
slug = "default-provider-invalidation-flow"
|
||||
}
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Catch-all Proxy Provider + Application.
|
||||
#
|
||||
# Created via the Authentik UI ~a year ago; adopted into Terraform 2026-04-18
|
||||
# (Wave 6a). The proxy provider is consumed by the embedded outpost
|
||||
# (uuid 0eecac07-97c7-443c-8925-05f2f4fe3e47) via an outpost-level binding
|
||||
# that stays in the UI — it's a single toggle with no drift risk.
|
||||
# -----------------------------------------------------------------------------
|
||||
|
||||
resource "authentik_application" "catchall" {
|
||||
name = "Domain wide catch all"
|
||||
slug = "domain-wide-catch-all"
|
||||
protocol_provider = authentik_provider_proxy.catchall.id
|
||||
lifecycle {
|
||||
ignore_changes = [meta_description, meta_launch_url, meta_icon, group, backchannel_providers, policy_engine_mode, open_in_new_tab]
|
||||
}
|
||||
}
|
||||
|
||||
resource "authentik_provider_proxy" "catchall" {
|
||||
name = "Provider for Domain wide catch all"
|
||||
mode = "forward_domain"
|
||||
external_host = "https://authentik.viktorbarzin.me"
|
||||
cookie_domain = "viktorbarzin.me"
|
||||
# Flow UUIDs resolved dynamically so a flow re-creation (keeping the slug)
|
||||
# doesn't require an HCL edit.
|
||||
authorization_flow = data.authentik_flow.default_authorization_implicit_consent.id
|
||||
invalidation_flow = data.authentik_flow.default_provider_invalidation.id
|
||||
lifecycle {
|
||||
ignore_changes = [property_mappings, jwt_federation_sources, skip_path_regex, internal_host, basic_auth_enabled, basic_auth_password_attribute, basic_auth_username_attribute, intercept_header_auth, access_token_validity]
|
||||
}
|
||||
}
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Default User Login stage — bound to default-authentication-flow.
|
||||
# Adopted into Terraform 2026-05-01 to set session_duration=weeks=4 so users
|
||||
# stay logged in across browser restarts. There is no Brand.session_duration
|
||||
# in authentik 2026.2.x — UserLoginStage is the correct knob.
|
||||
# -----------------------------------------------------------------------------
|
||||
|
||||
resource "authentik_stage_user_login" "default_login" {
|
||||
name = "default-authentication-login"
|
||||
session_duration = "weeks=4"
|
||||
lifecycle {
|
||||
# Pin only session_duration; everything else stays UI-managed so the
|
||||
# plan doesn't churn unrelated knobs (e.g. remember_me_offset toggles).
|
||||
ignore_changes = [
|
||||
remember_me_offset,
|
||||
terminate_other_sessions,
|
||||
geoip_binding,
|
||||
network_binding,
|
||||
]
|
||||
}
|
||||
}
|
||||
|
|
@ -31,10 +31,6 @@ resource "kubernetes_namespace" "authentik" {
|
|||
"resource-governance/custom-quota" = "true"
|
||||
}
|
||||
}
|
||||
lifecycle {
|
||||
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
|
||||
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_resource_quota" "authentik" {
|
||||
|
|
|
|||
|
|
@ -74,36 +74,6 @@ resource "kubernetes_deployment" "pgbouncer" {
|
|||
container_port = 6432
|
||||
}
|
||||
|
||||
resources {
|
||||
requests = {
|
||||
cpu = "50m"
|
||||
memory = "128Mi"
|
||||
}
|
||||
limits = {
|
||||
memory = "512Mi"
|
||||
}
|
||||
}
|
||||
|
||||
readiness_probe {
|
||||
tcp_socket {
|
||||
port = 6432
|
||||
}
|
||||
initial_delay_seconds = 5
|
||||
period_seconds = 10
|
||||
timeout_seconds = 3
|
||||
failure_threshold = 3
|
||||
}
|
||||
|
||||
liveness_probe {
|
||||
tcp_socket {
|
||||
port = 6432
|
||||
}
|
||||
initial_delay_seconds = 30
|
||||
period_seconds = 30
|
||||
timeout_seconds = 5
|
||||
failure_threshold = 3
|
||||
}
|
||||
|
||||
volume_mount {
|
||||
name = "config"
|
||||
mount_path = "/etc/pgbouncer/pgbouncer.ini"
|
||||
|
|
@ -145,29 +115,6 @@ resource "kubernetes_deployment" "pgbouncer" {
|
|||
}
|
||||
}
|
||||
depends_on = [kubernetes_secret.pgbouncer_auth]
|
||||
lifecycle {
|
||||
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config]
|
||||
}
|
||||
}
|
||||
|
||||
# --- 3b️⃣ PodDisruptionBudget ---
|
||||
# Protects auth against simultaneous node drains. With 3 replicas and
|
||||
# minAvailable=2, a single drain rolls cleanly; a simultaneous two-node
|
||||
# outage is correctly blocked.
|
||||
resource "kubernetes_pod_disruption_budget_v1" "pgbouncer" {
|
||||
metadata {
|
||||
name = "pgbouncer"
|
||||
namespace = "authentik"
|
||||
}
|
||||
spec {
|
||||
min_available = 2
|
||||
selector {
|
||||
match_labels = {
|
||||
app = "pgbouncer"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# --- 4️⃣ Service ---
|
||||
|
|
|
|||
|
|
@ -14,38 +14,9 @@ authentik:
|
|||
port: 6432
|
||||
user: authentik
|
||||
password: ""
|
||||
# Persistent client-side connections (safe with PgBouncer session mode;
|
||||
# must be < pgbouncer server_idle_timeout=600s). Cuts Django connection
|
||||
# setup overhead off the ~70 sequential ORM ops per flow stage.
|
||||
conn_max_age: 60
|
||||
conn_health_checks: true
|
||||
cache:
|
||||
# Cache flow plans for 30m and policy evaluations for 15m. Authentik 2026.2
|
||||
# moved cache storage from Redis to Postgres, so a TTL hit is still a
|
||||
# SELECT — but a single indexed lookup beats re-evaluating PolicyBindings.
|
||||
timeout_flows: 1800
|
||||
timeout_policies: 900
|
||||
web:
|
||||
# Gunicorn: 3 workers × 4 threads per server pod (default 2×4).
|
||||
# Pairs with the server memory bump to 2Gi (each worker preloads Django ~500Mi).
|
||||
workers: 3
|
||||
threads: 4
|
||||
worker:
|
||||
# Celery-equivalent worker threads per pod (default 2, renamed from
|
||||
# AUTHENTIK_WORKER__CONCURRENCY in 2025.8).
|
||||
threads: 4
|
||||
|
||||
server:
|
||||
replicas: 3
|
||||
# Anonymous Django sessions (no completed login: bots, healthcheckers,
|
||||
# partial flows) expire in 2h. Default is days=1. Once login completes,
|
||||
# UserLoginStage.session_duration takes over via request.session.set_expiry.
|
||||
# Injected via server.env (not authentik.sessions.*) because we use
|
||||
# authentik.existingSecret.secretName, which makes the chart skip
|
||||
# rendering the AUTHENTIK_* secret — so the values block doesn't reach env.
|
||||
env:
|
||||
- name: AUTHENTIK_SESSIONS__UNAUTHENTICATED_AGE
|
||||
value: "hours=2"
|
||||
replicas: 2
|
||||
strategy:
|
||||
type: RollingUpdate
|
||||
rollingUpdate:
|
||||
|
|
@ -56,7 +27,7 @@ server:
|
|||
cpu: 100m
|
||||
memory: 1.5Gi
|
||||
limits:
|
||||
memory: 2Gi
|
||||
memory: 1.5Gi
|
||||
topologySpreadConstraints:
|
||||
- maxSkew: 1
|
||||
topologyKey: kubernetes.io/hostname
|
||||
|
|
@ -73,17 +44,12 @@ server:
|
|||
diun.include_tags: "^202[0-9].[0-9]+.*$" # no need to annotate the worker as it uses the same image
|
||||
pdb:
|
||||
enabled: true
|
||||
minAvailable: 2
|
||||
minAvailable: 1
|
||||
global:
|
||||
addPrometheusAnnotations: true
|
||||
|
||||
worker:
|
||||
replicas: 3
|
||||
# Same unauthenticated_age cap as server — both the server (Django session
|
||||
# middleware) and worker (cleanup tasks) need to see the value.
|
||||
env:
|
||||
- name: AUTHENTIK_SESSIONS__UNAUTHENTICATED_AGE
|
||||
value: "hours=2"
|
||||
replicas: 2
|
||||
strategy:
|
||||
type: RollingUpdate
|
||||
rollingUpdate:
|
||||
|
|
@ -94,7 +60,7 @@ worker:
|
|||
cpu: 100m
|
||||
memory: 1.5Gi
|
||||
limits:
|
||||
memory: 2Gi
|
||||
memory: 1.5Gi
|
||||
topologySpreadConstraints:
|
||||
- maxSkew: 1
|
||||
topologyKey: kubernetes.io/hostname
|
||||
|
|
|
|||
|
|
@ -3,27 +3,6 @@ variable "tls_secret_name" {
|
|||
sensitive = true
|
||||
}
|
||||
|
||||
variable "beadboard_image_tag" {
|
||||
type = string
|
||||
default = "17a38e43"
|
||||
}
|
||||
|
||||
# Mirrors `local.image_tag` in stacks/claude-agent-service/main.tf — keep in
|
||||
# sync when the claude-agent-service image is rebuilt. Reused here because the
|
||||
# dispatcher + reaper CronJobs only need bd, curl, and jq, which that image
|
||||
# already ships.
|
||||
variable "claude_agent_service_image_tag" {
|
||||
type = string
|
||||
default = "2fd7670d"
|
||||
}
|
||||
|
||||
# Kill switch for auto-dispatch. When false, both CronJobs are suspended. The
|
||||
# manual BeadBoard Dispatch button keeps working either way.
|
||||
variable "beads_dispatcher_enabled" {
|
||||
type = bool
|
||||
default = true
|
||||
}
|
||||
|
||||
resource "kubernetes_namespace" "beads" {
|
||||
metadata {
|
||||
name = "beads-server"
|
||||
|
|
@ -31,10 +10,6 @@ resource "kubernetes_namespace" "beads" {
|
|||
tier = local.tiers.aux
|
||||
}
|
||||
}
|
||||
lifecycle {
|
||||
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
|
||||
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_persistent_volume_claim" "dolt_data" {
|
||||
|
|
@ -170,7 +145,7 @@ resource "kubernetes_deployment" "dolt" {
|
|||
}
|
||||
lifecycle {
|
||||
ignore_changes = [
|
||||
spec[0].template[0].spec[0].dns_config # KYVERNO_LIFECYCLE_V1
|
||||
spec[0].template[0].spec[0].dns_config
|
||||
]
|
||||
}
|
||||
}
|
||||
|
|
@ -374,7 +349,7 @@ resource "kubernetes_deployment" "workbench" {
|
|||
}
|
||||
lifecycle {
|
||||
ignore_changes = [
|
||||
spec[0].template[0].spec[0].dns_config # KYVERNO_LIFECYCLE_V1
|
||||
spec[0].template[0].spec[0].dns_config
|
||||
]
|
||||
}
|
||||
}
|
||||
|
|
@ -411,13 +386,13 @@ module "tls_secret" {
|
|||
}
|
||||
|
||||
module "ingress" {
|
||||
source = "../../modules/kubernetes/ingress_factory"
|
||||
dns_type = "proxied"
|
||||
namespace = kubernetes_namespace.beads.metadata[0].name
|
||||
name = "dolt-workbench"
|
||||
tls_secret_name = var.tls_secret_name
|
||||
protected = false
|
||||
exclude_crowdsec = true
|
||||
source = "../../modules/kubernetes/ingress_factory"
|
||||
dns_type = "proxied"
|
||||
namespace = kubernetes_namespace.beads.metadata[0].name
|
||||
name = "dolt-workbench"
|
||||
tls_secret_name = var.tls_secret_name
|
||||
protected = false
|
||||
exclude_crowdsec = true
|
||||
extra_annotations = {
|
||||
"gethomepage.dev/enabled" = "true"
|
||||
"gethomepage.dev/name" = "Dolt Workbench"
|
||||
|
|
@ -488,38 +463,6 @@ resource "kubernetes_config_map" "beadboard_config" {
|
|||
}
|
||||
}
|
||||
|
||||
# Pulls the claude-agent-service bearer token from Vault so BeadBoard can
|
||||
# dispatch agent jobs via the in-cluster HTTP API.
|
||||
resource "kubernetes_manifest" "beadboard_agent_service_secret" {
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1beta1"
|
||||
kind = "ExternalSecret"
|
||||
metadata = {
|
||||
name = "beadboard-agent-service"
|
||||
namespace = kubernetes_namespace.beads.metadata[0].name
|
||||
}
|
||||
spec = {
|
||||
refreshInterval = "15m"
|
||||
secretStoreRef = {
|
||||
name = "vault-kv"
|
||||
kind = "ClusterSecretStore"
|
||||
}
|
||||
target = {
|
||||
name = "beadboard-agent-service"
|
||||
}
|
||||
data = [
|
||||
{
|
||||
secretKey = "api_bearer_token"
|
||||
remoteRef = {
|
||||
key = "claude-agent-service"
|
||||
property = "api_bearer_token"
|
||||
}
|
||||
},
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_deployment" "beadboard" {
|
||||
metadata {
|
||||
name = "beadboard"
|
||||
|
|
@ -528,9 +471,6 @@ resource "kubernetes_deployment" "beadboard" {
|
|||
app = "beadboard"
|
||||
tier = local.tiers.aux
|
||||
}
|
||||
annotations = {
|
||||
"reloader.stakater.com/auto" = "true"
|
||||
}
|
||||
}
|
||||
spec {
|
||||
replicas = 1
|
||||
|
|
@ -567,29 +507,13 @@ resource "kubernetes_deployment" "beadboard" {
|
|||
|
||||
container {
|
||||
name = "beadboard"
|
||||
# Phase 3 cutover 2026-05-07 — Forgejo registry consolidation.
|
||||
image = "forgejo.viktorbarzin.me/viktor/beadboard:${var.beadboard_image_tag}"
|
||||
image = "registry.viktorbarzin.me:5050/beadboard:latest"
|
||||
|
||||
port {
|
||||
name = "http"
|
||||
container_port = 3000
|
||||
}
|
||||
|
||||
env {
|
||||
name = "CLAUDE_AGENT_SERVICE_URL"
|
||||
value = "http://claude-agent-service.claude-agent.svc.cluster.local:8080"
|
||||
}
|
||||
|
||||
env {
|
||||
name = "CLAUDE_AGENT_BEARER_TOKEN"
|
||||
value_from {
|
||||
secret_key_ref {
|
||||
name = "beadboard-agent-service"
|
||||
key = "api_bearer_token"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
volume_mount {
|
||||
name = "beads-writable"
|
||||
mount_path = "/app/.beads"
|
||||
|
|
@ -646,7 +570,7 @@ resource "kubernetes_deployment" "beadboard" {
|
|||
}
|
||||
lifecycle {
|
||||
ignore_changes = [
|
||||
spec[0].template[0].spec[0].dns_config # KYVERNO_LIFECYCLE_V1
|
||||
spec[0].template[0].spec[0].dns_config
|
||||
]
|
||||
}
|
||||
}
|
||||
|
|
@ -672,13 +596,13 @@ resource "kubernetes_service" "beadboard" {
|
|||
}
|
||||
|
||||
module "beadboard_ingress" {
|
||||
source = "../../modules/kubernetes/ingress_factory"
|
||||
dns_type = "proxied"
|
||||
namespace = kubernetes_namespace.beads.metadata[0].name
|
||||
name = "beadboard"
|
||||
tls_secret_name = var.tls_secret_name
|
||||
protected = true
|
||||
exclude_crowdsec = true
|
||||
source = "../../modules/kubernetes/ingress_factory"
|
||||
dns_type = "proxied"
|
||||
namespace = kubernetes_namespace.beads.metadata[0].name
|
||||
name = "beadboard"
|
||||
tls_secret_name = var.tls_secret_name
|
||||
protected = true
|
||||
exclude_crowdsec = true
|
||||
extra_annotations = {
|
||||
"gethomepage.dev/enabled" = "true"
|
||||
"gethomepage.dev/name" = "BeadBoard"
|
||||
|
|
@ -688,275 +612,3 @@ module "beadboard_ingress" {
|
|||
"gethomepage.dev/pod-selector" = ""
|
||||
}
|
||||
}
|
||||
|
||||
# ── Beads auto-dispatch (dispatcher + reaper CronJobs) ──
|
||||
#
|
||||
# Flow:
|
||||
# user: bd assign <id> agent
|
||||
# └──> CronJob: beads-dispatcher (every 2 min)
|
||||
# 1. GET BeadBoard /api/agent-status — skip if claude-agent-service busy
|
||||
# 2. bd query 'assignee=agent AND status=open' — pick highest priority
|
||||
# 3. bd update -s in_progress (claim; next tick won't re-pick)
|
||||
# 4. POST BeadBoard /api/agent-dispatch — reuses prompt-build + bearer flow
|
||||
# 5. bd note "dispatched: job=<id>" (or rollback + note on failure)
|
||||
#
|
||||
# CronJob: beads-reaper (every 10 min)
|
||||
# └── for bead (assignee=agent, status=in_progress, updated_at > 30m):
|
||||
# bd update -s blocked + bd note (recover from pod crashes mid-run)
|
||||
#
|
||||
# The claude-agent-service image ships bd + jq + curl — no separate image built.
|
||||
|
||||
resource "kubernetes_config_map" "beads_metadata" {
|
||||
metadata {
|
||||
name = "beads-metadata"
|
||||
namespace = kubernetes_namespace.beads.metadata[0].name
|
||||
}
|
||||
data = {
|
||||
"metadata.json" = jsonencode({
|
||||
database = "dolt"
|
||||
backend = "dolt"
|
||||
dolt_mode = "server"
|
||||
dolt_server_host = "${kubernetes_service.dolt.metadata[0].name}.${kubernetes_namespace.beads.metadata[0].name}.svc.cluster.local"
|
||||
dolt_server_port = 3306
|
||||
dolt_server_user = "beads"
|
||||
dolt_database = "code"
|
||||
project_id = "a8f8bae7-ce65-4145-a5db-a13d11d297da"
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
locals {
|
||||
# Phase 3 cutover 2026-05-07 — Forgejo registry consolidation.
|
||||
claude_agent_service_image = "forgejo.viktorbarzin.me/viktor/claude-agent-service:${var.claude_agent_service_image_tag}"
|
||||
beadboard_internal_url = "http://${kubernetes_service.beadboard.metadata[0].name}.${kubernetes_namespace.beads.metadata[0].name}.svc.cluster.local"
|
||||
|
||||
beads_script_prelude = <<-EOT
|
||||
set -euo pipefail
|
||||
# bd with Dolt server mode needs metadata.json in a directory it can walk.
|
||||
# ConfigMap mounts are read-only — copy to a writable location before use.
|
||||
mkdir -p /tmp/.beads
|
||||
cp /etc/beads-metadata/metadata.json /tmp/.beads/metadata.json
|
||||
EOT
|
||||
}
|
||||
|
||||
resource "kubernetes_cron_job_v1" "beads_dispatcher" {
|
||||
metadata {
|
||||
name = "beads-dispatcher"
|
||||
namespace = kubernetes_namespace.beads.metadata[0].name
|
||||
}
|
||||
spec {
|
||||
schedule = "*/2 * * * *"
|
||||
concurrency_policy = "Forbid"
|
||||
successful_jobs_history_limit = 3
|
||||
failed_jobs_history_limit = 3
|
||||
starting_deadline_seconds = 60
|
||||
suspend = !var.beads_dispatcher_enabled
|
||||
job_template {
|
||||
metadata {}
|
||||
spec {
|
||||
backoff_limit = 0
|
||||
ttl_seconds_after_finished = 600
|
||||
template {
|
||||
metadata {
|
||||
labels = {
|
||||
app = "beads-dispatcher"
|
||||
}
|
||||
}
|
||||
spec {
|
||||
restart_policy = "Never"
|
||||
image_pull_secrets {
|
||||
name = "registry-credentials"
|
||||
}
|
||||
container {
|
||||
name = "dispatcher"
|
||||
image = local.claude_agent_service_image
|
||||
command = ["/bin/sh", "-c", <<-EOT
|
||||
${local.beads_script_prelude}
|
||||
|
||||
BUSY=$(curl -sf "$${BEADBOARD_URL}/api/agent-status" | jq -r '.busy // false')
|
||||
if [ "$BUSY" != "false" ]; then
|
||||
echo "claude-agent-service is busy — skipping tick"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
BEAD=$(bd --db /tmp/.beads query 'assignee=agent AND status=open' --json \
|
||||
| jq -r '[.[] | select(.acceptance_criteria and (.acceptance_criteria | length) > 0)]
|
||||
| sort_by(.priority, .updated_at)[0].id // empty')
|
||||
|
||||
if [ -z "$BEAD" ]; then
|
||||
echo "no eligible beads (assignee=agent, status=open, has acceptance_criteria)"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
echo "picked bead: $BEAD"
|
||||
|
||||
bd --db /tmp/.beads update "$BEAD" -s in_progress
|
||||
bd --db /tmp/.beads note "$BEAD" "auto-dispatcher claimed at $(date -u +%Y-%m-%dT%H:%M:%SZ)"
|
||||
|
||||
RESP=$(curl -sS -w '\n%%{http_code}' -X POST \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d "{\"taskId\":\"$BEAD\"}" \
|
||||
"$${BEADBOARD_URL}/api/agent-dispatch")
|
||||
CODE=$(printf '%s' "$RESP" | tail -n1)
|
||||
BODY=$(printf '%s' "$RESP" | sed '$d')
|
||||
|
||||
if [ "$CODE" = "200" ]; then
|
||||
JOB_ID=$(printf '%s' "$BODY" | jq -r '.job_id // "unknown"')
|
||||
bd --db /tmp/.beads note "$BEAD" "dispatched: job=$JOB_ID"
|
||||
echo "dispatched $BEAD as job $JOB_ID"
|
||||
else
|
||||
# Roll the claim back so the next tick can retry.
|
||||
bd --db /tmp/.beads update "$BEAD" -s open
|
||||
bd --db /tmp/.beads note "$BEAD" "dispatch failed HTTP $CODE: $BODY"
|
||||
echo "dispatch FAILED for $BEAD: HTTP $CODE — $BODY" >&2
|
||||
exit 1
|
||||
fi
|
||||
EOT
|
||||
]
|
||||
env {
|
||||
name = "BEADBOARD_URL"
|
||||
value = local.beadboard_internal_url
|
||||
}
|
||||
env {
|
||||
name = "API_BEARER_TOKEN"
|
||||
value_from {
|
||||
secret_key_ref {
|
||||
name = "beadboard-agent-service"
|
||||
key = "api_bearer_token"
|
||||
}
|
||||
}
|
||||
}
|
||||
env {
|
||||
name = "BEADS_ACTOR"
|
||||
value = "beads-dispatcher"
|
||||
}
|
||||
env {
|
||||
name = "HOME"
|
||||
value = "/tmp"
|
||||
}
|
||||
volume_mount {
|
||||
name = "beads-metadata"
|
||||
mount_path = "/etc/beads-metadata"
|
||||
read_only = true
|
||||
}
|
||||
resources {
|
||||
requests = {
|
||||
cpu = "50m"
|
||||
memory = "128Mi"
|
||||
}
|
||||
limits = {
|
||||
memory = "256Mi"
|
||||
}
|
||||
}
|
||||
}
|
||||
volume {
|
||||
name = "beads-metadata"
|
||||
config_map {
|
||||
name = kubernetes_config_map.beads_metadata.metadata[0].name
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
lifecycle {
|
||||
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
|
||||
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_cron_job_v1" "beads_reaper" {
|
||||
metadata {
|
||||
name = "beads-reaper"
|
||||
namespace = kubernetes_namespace.beads.metadata[0].name
|
||||
}
|
||||
spec {
|
||||
schedule = "*/10 * * * *"
|
||||
concurrency_policy = "Forbid"
|
||||
successful_jobs_history_limit = 3
|
||||
failed_jobs_history_limit = 3
|
||||
starting_deadline_seconds = 60
|
||||
suspend = !var.beads_dispatcher_enabled
|
||||
job_template {
|
||||
metadata {}
|
||||
spec {
|
||||
backoff_limit = 0
|
||||
ttl_seconds_after_finished = 600
|
||||
template {
|
||||
metadata {
|
||||
labels = {
|
||||
app = "beads-reaper"
|
||||
}
|
||||
}
|
||||
spec {
|
||||
restart_policy = "Never"
|
||||
image_pull_secrets {
|
||||
name = "registry-credentials"
|
||||
}
|
||||
container {
|
||||
name = "reaper"
|
||||
image = local.claude_agent_service_image
|
||||
command = ["/bin/sh", "-c", <<-EOT
|
||||
${local.beads_script_prelude}
|
||||
|
||||
THRESHOLD_MIN=30
|
||||
NOW=$(date -u +%s)
|
||||
|
||||
bd --db /tmp/.beads query 'assignee=agent AND status=in_progress' --json \
|
||||
| jq -c '.[]' \
|
||||
| while read -r BEAD_JSON; do
|
||||
ID=$(printf '%s' "$BEAD_JSON" | jq -r '.id')
|
||||
LAST_UPDATE=$(printf '%s' "$BEAD_JSON" | jq -r '.updated_at')
|
||||
# Alpine's busybox date lacks GNU -d; parse ISO-8601 with python3.
|
||||
LAST_TS=$(python3 -c "from datetime import datetime; print(int(datetime.fromisoformat('$LAST_UPDATE'.replace('Z','+00:00')).timestamp()))")
|
||||
AGE_MIN=$(( (NOW - LAST_TS) / 60 ))
|
||||
if [ "$AGE_MIN" -gt "$THRESHOLD_MIN" ]; then
|
||||
bd --db /tmp/.beads note "$ID" "reaper: no progress for $${AGE_MIN}m (threshold $${THRESHOLD_MIN}m) — blocking"
|
||||
bd --db /tmp/.beads update "$ID" -s blocked
|
||||
echo "REAPED $ID (stale $${AGE_MIN}m)"
|
||||
else
|
||||
echo "keeping $ID (age $${AGE_MIN}m < $${THRESHOLD_MIN}m)"
|
||||
fi
|
||||
done
|
||||
EOT
|
||||
]
|
||||
env {
|
||||
name = "BEADS_ACTOR"
|
||||
value = "beads-reaper"
|
||||
}
|
||||
env {
|
||||
name = "HOME"
|
||||
value = "/tmp"
|
||||
}
|
||||
volume_mount {
|
||||
name = "beads-metadata"
|
||||
mount_path = "/etc/beads-metadata"
|
||||
read_only = true
|
||||
}
|
||||
resources {
|
||||
requests = {
|
||||
cpu = "50m"
|
||||
memory = "128Mi"
|
||||
}
|
||||
limits = {
|
||||
memory = "256Mi"
|
||||
}
|
||||
}
|
||||
}
|
||||
volume {
|
||||
name = "beads-metadata"
|
||||
config_map {
|
||||
name = kubernetes_config_map.beads_metadata.metadata[0].name
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
lifecycle {
|
||||
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
|
||||
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
|
||||
}
|
||||
}
|
||||
|
|
|
|||
|
|
@ -12,10 +12,6 @@ resource "kubernetes_namespace" "website" {
|
|||
tier = local.tiers.aux
|
||||
}
|
||||
}
|
||||
lifecycle {
|
||||
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
|
||||
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
|
||||
}
|
||||
}
|
||||
|
||||
module "tls_secret" {
|
||||
|
|
@ -75,10 +71,6 @@ resource "kubernetes_deployment" "blog" {
|
|||
}
|
||||
}
|
||||
}
|
||||
lifecycle {
|
||||
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config]
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_service" "blog" {
|
||||
|
|
|
|||
|
|
@ -14,10 +14,6 @@ resource "kubernetes_namespace" "broker_sync" {
|
|||
tier = local.tiers.aux
|
||||
}
|
||||
}
|
||||
lifecycle {
|
||||
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
|
||||
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
|
||||
}
|
||||
}
|
||||
|
||||
# Secrets for all providers. Seeded in Vault at `secret/broker-sync`:
|
||||
|
|
@ -105,7 +101,7 @@ resource "kubernetes_cron_job_v1" "version_probe" {
|
|||
metadata {}
|
||||
spec {
|
||||
backoff_limit = 1
|
||||
ttl_seconds_after_finished = 86400
|
||||
ttl_seconds_after_finished = 300
|
||||
template {
|
||||
metadata {
|
||||
labels = { app = "broker-sync", component = "version-probe" }
|
||||
|
|
@ -126,10 +122,6 @@ resource "kubernetes_cron_job_v1" "version_probe" {
|
|||
}
|
||||
}
|
||||
}
|
||||
lifecycle {
|
||||
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
|
||||
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
|
||||
}
|
||||
}
|
||||
|
||||
# Trading212 steady-state daily sync. Phase 1 deliverable.
|
||||
|
|
@ -226,10 +218,6 @@ resource "kubernetes_cron_job_v1" "trading212" {
|
|||
}
|
||||
}
|
||||
}
|
||||
lifecycle {
|
||||
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
|
||||
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
|
||||
}
|
||||
}
|
||||
|
||||
# IMAP ingest — InvestEngine + Schwab email parsers, one combined pod.
|
||||
|
|
@ -246,12 +234,7 @@ resource "kubernetes_cron_job_v1" "imap" {
|
|||
concurrency_policy = "Forbid"
|
||||
successful_jobs_history_limit = 3
|
||||
failed_jobs_history_limit = 5
|
||||
# Unsuspended 2026-04-19 for RSU vest ground-truth ingestion — the parser
|
||||
# now detects Schwab Release Confirmations and scaffolds VestEvents; the
|
||||
# postgres sink that persists them into payslip_ingest.rsu_vest_events is
|
||||
# pending a real-email fixture and cross-service DB grant (see
|
||||
# follow-up beads task filed under the RSU tax spike fix epic).
|
||||
suspend = false
|
||||
suspend = true # enable in Phase 2
|
||||
job_template {
|
||||
metadata {}
|
||||
spec {
|
||||
|
|
@ -360,10 +343,6 @@ resource "kubernetes_cron_job_v1" "imap" {
|
|||
}
|
||||
}
|
||||
}
|
||||
lifecycle {
|
||||
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
|
||||
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
|
||||
}
|
||||
}
|
||||
|
||||
# CSV drop-folder processor — Scottish Widows, Fidelity quarterly, Freetrade, etc.
|
||||
|
|
@ -452,10 +431,6 @@ resource "kubernetes_cron_job_v1" "csv_drop" {
|
|||
}
|
||||
}
|
||||
}
|
||||
lifecycle {
|
||||
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
|
||||
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
|
||||
}
|
||||
}
|
||||
|
||||
# Monthly HMRC FX reconciliation — rewrites last-month activities with official
|
||||
|
|
@ -544,10 +519,6 @@ resource "kubernetes_cron_job_v1" "fx_reconcile" {
|
|||
}
|
||||
}
|
||||
}
|
||||
lifecycle {
|
||||
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
|
||||
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
|
||||
}
|
||||
}
|
||||
|
||||
# Backup: snapshot sync.db / fx.db / csv-archive into NFS daily, keep 30 days.
|
||||
|
|
@ -625,170 +596,4 @@ resource "kubernetes_cron_job_v1" "backup" {
|
|||
}
|
||||
}
|
||||
}
|
||||
lifecycle {
|
||||
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
|
||||
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
|
||||
}
|
||||
}
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Fidelity UK PlanViewer — monthly pension contribution sync
|
||||
#
|
||||
# Architecture notes:
|
||||
# - The CLI (`broker-sync fidelity-ingest`) loads storage_state.json, boots
|
||||
# headless Chromium, scrapes the transaction history + valuation JSON, and
|
||||
# posts DEPOSIT activities to Wealthfolio. See
|
||||
# broker-sync/docs/providers/fidelity-planviewer.md for the seed workflow.
|
||||
# - Storage_state is staged to Vault (`secret/broker-sync` →
|
||||
# `fidelity_storage_state`). ESO projects all broker-sync keys into the
|
||||
# shared `broker-sync-secrets` K8s Secret; an init container writes the
|
||||
# JSON blob to the PVC so the main container can load it.
|
||||
# - Image needs Chromium baked in — add the `fidelity-capable: "true"` label
|
||||
# so the Dockerfile/CI treats this CronJob's pod spec as the Playwright
|
||||
# variant. Until the Playwright image ships, keep `suspend = true`.
|
||||
# - Schedule: 05:00 UK on the 20th of each month — well after Viktor's mid-
|
||||
# month payroll contribution has settled (finance history shows credits
|
||||
# landing 13th-18th).
|
||||
resource "kubernetes_cron_job_v1" "fidelity" {
|
||||
metadata {
|
||||
name = "broker-sync-fidelity"
|
||||
namespace = kubernetes_namespace.broker_sync.metadata[0].name
|
||||
labels = { app = "broker-sync", component = "fidelity" }
|
||||
}
|
||||
spec {
|
||||
schedule = "0 5 20 * *"
|
||||
concurrency_policy = "Forbid"
|
||||
successful_jobs_history_limit = 3
|
||||
failed_jobs_history_limit = 5
|
||||
# Suspended until the broker-sync image ships with Playwright + Chromium.
|
||||
suspend = true
|
||||
job_template {
|
||||
metadata {}
|
||||
spec {
|
||||
backoff_limit = 1
|
||||
ttl_seconds_after_finished = 86400
|
||||
template {
|
||||
metadata {
|
||||
labels = { app = "broker-sync", component = "fidelity" }
|
||||
}
|
||||
spec {
|
||||
restart_policy = "OnFailure"
|
||||
# Materialise the JSON storage_state from the projected Secret
|
||||
# onto the PVC where Playwright expects to read it. Init container
|
||||
# runs as root; the main broker-sync container runs as uid 10001,
|
||||
# so we chown+chmod 600 to grant read access to the broker user.
|
||||
init_container {
|
||||
name = "stage-storage-state"
|
||||
image = "busybox:1.36"
|
||||
command = ["/bin/sh", "-c", <<-EOT
|
||||
set -eu
|
||||
mkdir -p /data
|
||||
cp /secrets/fidelity_storage_state /data/fidelity_storage_state.json
|
||||
chown 10001:10001 /data/fidelity_storage_state.json
|
||||
chmod 600 /data/fidelity_storage_state.json
|
||||
EOT
|
||||
]
|
||||
volume_mount {
|
||||
name = "secrets"
|
||||
mount_path = "/secrets"
|
||||
read_only = true
|
||||
}
|
||||
volume_mount {
|
||||
name = "data"
|
||||
mount_path = "/data"
|
||||
}
|
||||
resources {
|
||||
requests = { cpu = "5m", memory = "8Mi" }
|
||||
limits = { memory = "32Mi" }
|
||||
}
|
||||
}
|
||||
container {
|
||||
name = "broker-sync"
|
||||
image = local.broker_sync_image
|
||||
command = ["broker-sync", "fidelity-ingest"]
|
||||
|
||||
env {
|
||||
name = "BROKER_SYNC_DATA_DIR"
|
||||
value = "/data"
|
||||
}
|
||||
env {
|
||||
name = "WF_SESSION_PATH"
|
||||
value = "/data/wealthfolio_session.json"
|
||||
}
|
||||
env {
|
||||
name = "FIDELITY_STORAGE_STATE_PATH"
|
||||
value = "/data/fidelity_storage_state.json"
|
||||
}
|
||||
env {
|
||||
name = "FIDELITY_PLAN_ID"
|
||||
value_from {
|
||||
secret_key_ref {
|
||||
name = "broker-sync-secrets"
|
||||
key = "fidelity_plan_id"
|
||||
}
|
||||
}
|
||||
}
|
||||
env {
|
||||
name = "WF_BASE_URL"
|
||||
value_from {
|
||||
secret_key_ref {
|
||||
name = "broker-sync-secrets"
|
||||
key = "wf_base_url"
|
||||
}
|
||||
}
|
||||
}
|
||||
env {
|
||||
name = "WF_USERNAME"
|
||||
value_from {
|
||||
secret_key_ref {
|
||||
name = "broker-sync-secrets"
|
||||
key = "wf_username"
|
||||
}
|
||||
}
|
||||
}
|
||||
env {
|
||||
name = "WF_PASSWORD"
|
||||
value_from {
|
||||
secret_key_ref {
|
||||
name = "broker-sync-secrets"
|
||||
key = "wf_password"
|
||||
}
|
||||
}
|
||||
}
|
||||
volume_mount {
|
||||
name = "data"
|
||||
mount_path = "/data"
|
||||
}
|
||||
resources {
|
||||
# Chromium is hungry — headless shell + page rendering
|
||||
# comfortably under 1Gi, spike up to 1.2Gi during full-page
|
||||
# screenshots.
|
||||
requests = { cpu = "50m", memory = "512Mi" }
|
||||
limits = { memory = "1280Mi" }
|
||||
}
|
||||
}
|
||||
volume {
|
||||
name = "secrets"
|
||||
secret {
|
||||
secret_name = "broker-sync-secrets"
|
||||
items {
|
||||
key = "fidelity_storage_state"
|
||||
path = "fidelity_storage_state"
|
||||
}
|
||||
}
|
||||
}
|
||||
volume {
|
||||
name = "data"
|
||||
persistent_volume_claim {
|
||||
claim_name = kubernetes_persistent_volume_claim.data_encrypted.metadata[0].name
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
lifecycle {
|
||||
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
|
||||
}
|
||||
}
|
||||
|
|
|
|||
|
|
@ -1,67 +0,0 @@
|
|||
# Calico CNI
|
||||
#
|
||||
# Calico has underpinned this cluster's pod networking since 2024-07-30, installed
|
||||
# as raw kubectl manifests (tigera-operator Deployment + CRDs + Installation CR).
|
||||
# Bringing the full stack under Terraform is high-blast — the operator and its
|
||||
# Deployment must never flap during node pressure or during any apply, because
|
||||
# new pod scheduling breaks within ~seconds of a CNI outage.
|
||||
#
|
||||
# This stack (created 2026-04-18 Wave 5b) adopts the three namespaces only:
|
||||
# calico-system, calico-apiserver, tigera-operator. The `tigera-operator`
|
||||
# Deployment, the 20+ CRDs it manages, and the `Installation` CR itself are
|
||||
# intentionally *not* adopted yet — they require a low-traffic window and a
|
||||
# careful ignore_changes set to cover operator-generated defaults on the
|
||||
# Installation CR. Follow-up tracked in beads code-3ad.
|
||||
#
|
||||
# The namespaces are safe to adopt (no networking impact — they're just label
|
||||
# containers) and give TF an audit trail entry for the labels/tier Kyverno
|
||||
# cares about.
|
||||
|
||||
resource "kubernetes_namespace" "calico_system" {
|
||||
metadata {
|
||||
name = "calico-system"
|
||||
labels = {
|
||||
name = "calico-system"
|
||||
}
|
||||
}
|
||||
lifecycle {
|
||||
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode label on every namespace.
|
||||
# pod-security.kubernetes.io/* labels are applied by the tigera-operator
|
||||
# reconciler on calico-system + calico-apiserver for PSA 'privileged'.
|
||||
ignore_changes = [
|
||||
metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"],
|
||||
metadata[0].labels["pod-security.kubernetes.io/enforce"],
|
||||
metadata[0].labels["pod-security.kubernetes.io/enforce-version"],
|
||||
]
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_namespace" "calico_apiserver" {
|
||||
metadata {
|
||||
name = "calico-apiserver"
|
||||
labels = {
|
||||
name = "calico-apiserver"
|
||||
}
|
||||
}
|
||||
lifecycle {
|
||||
# KYVERNO_LIFECYCLE_V1 + PSA labels applied by tigera-operator (see calico_system).
|
||||
ignore_changes = [
|
||||
metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"],
|
||||
metadata[0].labels["pod-security.kubernetes.io/enforce"],
|
||||
metadata[0].labels["pod-security.kubernetes.io/enforce-version"],
|
||||
]
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_namespace" "tigera_operator" {
|
||||
metadata {
|
||||
name = "tigera-operator"
|
||||
labels = {
|
||||
name = "tigera-operator"
|
||||
}
|
||||
}
|
||||
lifecycle {
|
||||
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
|
||||
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
|
||||
}
|
||||
}
|
||||
|
|
@ -1,6 +0,0 @@
|
|||
include "root" {
|
||||
path = find_in_parent_folders()
|
||||
}
|
||||
|
||||
# No platform dependency — Calico provides the cluster network the rest
|
||||
# of the platform runs on. This stack must not introduce a dep cycle.
|
||||
Some files were not shown because too many files have changed in this diff Show more
Loading…
Add table
Add a link
Reference in a new issue