From 7e34b67f2449a6336f66fc03961743ae276c8ef9 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sun, 19 Apr 2026 17:51:26 +0000 Subject: [PATCH 1/3] [docs] Architecture docs: registry integrity probe, pin, new CI pipelines MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Bring the architecture set in line with what's actually deployed after today's registry reliability work (commits 7cb44d72 → 42961a5f): - docs/architecture/ci-cd.md: expand Infra Pipelines table with build-ci-image (+ verify-integrity step), registry-config-sync, pve-nfs-exports-sync, postmortem-todos, drift-detection, issue-automation, provision-user. Note registry:2.8.3 pin + integrity probe in the image-registry flow section. - docs/architecture/monitoring.md: add Registry Integrity Probe to components table; add 3-alert section (Manifest Integrity Failure / Probe Stale / Catalog Inaccessible). - .claude/CLAUDE.md: one-line on the pin, auto-sync pipeline, and the revision-link-not-blob rule so the next agent knows the right check. Co-Authored-By: Claude Opus 4.7 (1M context) --- .claude/CLAUDE.md | 2 +- docs/architecture/ci-cd.md | 10 +++++++++- docs/architecture/monitoring.md | 6 ++++++ 3 files changed, 16 insertions(+), 2 deletions(-) diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md index 34845017..55cbc0d1 100755 --- a/.claude/CLAUDE.md +++ b/.claude/CLAUDE.md @@ -30,7 +30,7 @@ Violations cause state drift, which causes future applies to break or silently r - **New service**: Use `setup-project` skill for full workflow - **Ingress**: `ingress_factory` module. Auth: `protected = true`. Anti-AI: on by default. **DNS**: `dns_type = "proxied"` (Cloudflare CDN) or `"non-proxied"` (direct A/AAAA). DNS records are auto-created — no need to edit `config.tfvars`. - **Docker images**: Always build for `linux/amd64`. Use 8-char git SHA tags — `:latest` causes stale pull-through cache. -- **Private registry**: `registry.viktorbarzin.me` (htpasswd auth, credentials in Vault `secret/viktor`). Use `image: registry.viktorbarzin.me/:` + `imagePullSecrets: [{name: registry-credentials}]`. Kyverno auto-syncs the secret to all namespaces. Build & push from registry VM (`10.0.20.10`). Containerd `hosts.toml` redirects pulls to LAN IP directly. Web UI at `docker.viktorbarzin.me` (Authentik-protected). +- **Private registry**: `registry.viktorbarzin.me` (htpasswd auth, credentials in Vault `secret/viktor`). Use `image: registry.viktorbarzin.me/:` + `imagePullSecrets: [{name: registry-credentials}]`. Kyverno auto-syncs the secret to all namespaces. Build & push from registry VM (`10.0.20.10`). Containerd `hosts.toml` redirects pulls to LAN IP directly. Web UI at `docker.viktorbarzin.me` (Authentik-protected). Engine pinned to `registry:2.8.3` (see post-mortem 2026-04-19); on-VM configs deploy via `.woodpecker/registry-config-sync.yml`; integrity probed every 15m by `registry-integrity-probe` CronJob in `monitoring` ns — the HTTP API is the authoritative integrity check, NOT `/blobs/*/data` presence (revision-link absence is the real failure mode). - **LinuxServer.io containers**: `DOCKER_MODS` runs apt-get on every start — bake slow mods into a custom image (`RUN /docker-mods || true` then `ENV DOCKER_MODS=`). Set `NO_CHOWN=true` to skip recursive chown that hangs on NFS mounts. - **Node memory changes**: When changing VM memory on any k8s node, update kubelet `systemReserved`, `kubeReserved`, and eviction thresholds accordingly. Config: `/var/lib/kubelet/config.yaml`. Template: `stacks/infra/main.tf`. Current values: systemReserved=512Mi, kubeReserved=512Mi, evictionHard=500Mi, evictionSoft=1Gi. - **Node OS disk tuning** (in `stacks/infra/main.tf`): kubelet `imageGCHighThresholdPercent=70` (was 85), `imageGCLowThresholdPercent=60` (was 80), ext4 `commit=60` in fstab (was default 5s), journald `SystemMaxUse=200M` + `MaxRetentionSec=3day`. diff --git a/docs/architecture/ci-cd.md b/docs/architecture/ci-cd.md index cc51cdde..6a7fd2f9 100644 --- a/docs/architecture/ci-cd.md +++ b/docs/architecture/ci-cd.md @@ -102,7 +102,8 @@ Woodpecker API uses numeric IDs (not owner/name): 1. **Containerd hosts.toml** redirects pulls from docker.io and ghcr.io to pull-through cache at `10.0.20.10` 2. **Pull-through cache** serves cached images from LAN, fetches from upstream on cache miss 3. **Kyverno ClusterPolicy** auto-syncs `registry-credentials` Secret to all namespaces for private registry access -4. **Private registry** (`registry.viktorbarzin.me`) uses htpasswd auth, credentials stored in Vault +4. **Private registry** (`registry.viktorbarzin.me`) uses htpasswd auth, credentials stored in Vault. Runs `registry:2.8.3` (pinned — floating `registry:2` was the root cause of the 2026-04-13 + 2026-04-19 orphan-index incidents; see `docs/post-mortems/2026-04-19-registry-orphan-index.md`). +5. **Integrity probe** (`registry-integrity-probe` CronJob in `monitoring` ns, every 15m) walks `/v2/_catalog` → tags → indexes → child manifests via HEAD and pushes `registry_manifest_integrity_failures` to Pushgateway; alerts `RegistryManifestIntegrityFailure` / `RegistryIntegrityProbeStale` / `RegistryCatalogInaccessible` page on broken state. Authoritative check (HTTP API, not filesystem). ### Infra Pipelines (Woodpecker-only) @@ -111,7 +112,14 @@ Woodpecker API uses numeric IDs (not owner/name): | default | `.woodpecker/default.yml` | Terragrunt apply on push | | renew-tls | `.woodpecker/renew-tls.yml` | Certbot renewal cron | | build-cli | `.woodpecker/build-cli.yml` | Build and push to dual registries | +| build-ci-image | `.woodpecker/build-ci-image.yml` | Build `infra-ci` tooling image (triggered by `ci/Dockerfile` change or manual); post-push HEADs every blob via `verify-integrity` step to catch orphan-index pushes | | k8s-portal | `.woodpecker/k8s-portal.yml` | Path-filtered build for k8s-portal subdirectory | +| registry-config-sync | `.woodpecker/registry-config-sync.yml` | SCP `modules/docker-registry/*` to `/opt/registry/` on `10.0.20.10` when any managed file changes; bounces containers + nginx per `docs/runbooks/registry-vm.md` | +| pve-nfs-exports-sync | `.woodpecker/pve-nfs-exports-sync.yml` | Sync `scripts/pve-nfs-exports` → `/etc/exports` on PVE host | +| postmortem-todos | `.woodpecker/postmortem-todos.yml` | Auto-resolve safe TODOs from new `docs/post-mortems/*.md` via headless Claude agent | +| drift-detection | `.woodpecker/drift-detection.yml` | Nightly Terraform drift detection | +| issue-automation | `.woodpecker/issue-automation.yml` | Triage + respond to `ViktorBarzin/infra` GitHub issues | +| provision-user | `.woodpecker/provision-user.yml` | Add namespace-owner user from Vault spec | ## Configuration diff --git a/docs/architecture/monitoring.md b/docs/architecture/monitoring.md index 774a3356..0de2a219 100644 --- a/docs/architecture/monitoring.md +++ b/docs/architecture/monitoring.md @@ -63,6 +63,7 @@ graph TB | External Monitor Sync | Python 3.12 | `stacks/uptime-kuma/` | CronJob (10min) syncs `[External]` monitors from `cloudflare_proxied_names` | | dcgm-exporter | Configurable resources | `stacks/monitoring/modules/monitoring/` | NVIDIA GPU metrics collection | | Email Roundtrip Probe | Python 3.12 | `stacks/mailserver/modules/mailserver/` | E2E email delivery verification via Mailgun API + IMAP | +| Registry Integrity Probe | Alpine 3.20 + curl/jq | `stacks/monitoring/modules/monitoring/main.tf` | CronJob every 15m: walks `/v2/_catalog` on `registry.viktorbarzin.me:5050`, HEADs every tagged manifest + index child; emits `registry_manifest_integrity_*` metrics to Pushgateway. Catches orphan OCI-index state that filesystem scans miss. | ## How It Works @@ -160,6 +161,11 @@ spec: - **EmailRoundtripStale**: No successful email round-trip in >40m - **EmailRoundtripNeverRun**: Email probe has never reported (40m) +#### Registry Integrity Alerts +- **RegistryManifestIntegrityFailure**: Private registry serving 404 for manifests it advertises (orphan OCI-index children) — fires after 30m of `registry_manifest_integrity_failures > 0`. Remediation: rebuild affected image per `docs/runbooks/registry-rebuild-image.md`. +- **RegistryIntegrityProbeStale**: Probe hasn't reported in >1h (CronJob broken) +- **RegistryCatalogInaccessible**: Probe cannot fetch `/v2/_catalog` (auth failure or registry down) + The email monitoring system uses a CronJob (`email-roundtrip-monitor`, every 10 min) in the `mailserver` namespace that: 1. Sends a test email via Mailgun HTTP API to `smoke-test@viktorbarzin.me` 2. Email lands in the `spam@` catch-all mailbox via MX delivery From fca3dd4976d267817dc776e4e8c00a9c147d2189 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sun, 19 Apr 2026 18:04:05 +0000 Subject: [PATCH 2/3] [monitoring] uk-payslip: Panel 2 uses COALESCE cash_income_tax; Panel 4 flags NULL MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Phase A of RSU tax spike fix. Two changes: 1. Panel 2 "Monthly cash flow (RSU stripped)" plotted raw income_tax despite the title. Switch to COALESCE(cash_income_tax, income_tax) so the chart is honest once the Phase B back-fill populates cash_income_tax on variant-A slips. For slips where cash_income_tax is already populated (variant B, 2024+) the spike is removed immediately. 2. Panel 4 "Data integrity" now surfaces rows where cash_income_tax is NULL on vest months (rsu_vest > 0). New status value NULL_CASH_TAX (orange) highlights the back-fill remaining population — expected to drop to 0 after Phase B lands. Part of: code-860 --- .../modules/monitoring/dashboards/uk-payslip.json | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/stacks/monitoring/modules/monitoring/dashboards/uk-payslip.json b/stacks/monitoring/modules/monitoring/dashboards/uk-payslip.json index f0ddb18f..9b0c2644 100644 --- a/stacks/monitoring/modules/monitoring/dashboards/uk-payslip.json +++ b/stacks/monitoring/modules/monitoring/dashboards/uk-payslip.json @@ -422,7 +422,7 @@ "type": "grafana-postgresql-datasource", "uid": "payslips-pg" }, - "rawSql": "SELECT pay_date AS \"time\", (gross_pay - rsu_vest) AS cash_gross, net_pay, income_tax, national_insurance FROM payslip_ingest.payslip WHERE $__timeFilter(pay_date) ORDER BY pay_date", + "rawSql": "SELECT pay_date AS \"time\", (gross_pay - rsu_vest) AS cash_gross, net_pay, COALESCE(cash_income_tax, income_tax) AS income_tax, national_insurance FROM payslip_ingest.payslip WHERE $__timeFilter(pay_date) ORDER BY pay_date", "format": "time_series", "refId": "A", "rawQuery": true, @@ -1144,9 +1144,14 @@ "index": 2, "text": "RSU_NO_SALARY" }, + "NULL_CASH_TAX": { + "color": "orange", + "index": 3, + "text": "NULL_CASH_TAX" + }, "ok": { "color": "green", - "index": 3, + "index": 4, "text": "ok" } } @@ -1217,7 +1222,7 @@ "rawQuery": true, "editorMode": "code", "format": "table", - "rawSql": "WITH expected AS (SELECT generate_series(DATE '2019-07-01', DATE_TRUNC('month', CURRENT_DATE), '1 month'::interval)::date AS month_start), actual AS (SELECT DATE_TRUNC('month', pay_date)::date AS month_start, pay_date, salary, gross_pay, rsu_vest, paperless_doc_id FROM payslip_ingest.payslip) SELECT e.month_start, CASE WHEN a.pay_date IS NULL THEN 'MISSING' WHEN a.salary = 0 AND a.gross_pay > 5000 THEN 'ZERO_SALARY' WHEN a.rsu_vest > 0 AND a.salary = 0 THEN 'RSU_NO_SALARY' ELSE 'ok' END AS status, a.pay_date, a.salary, a.gross_pay, a.rsu_vest, a.paperless_doc_id FROM expected e LEFT JOIN actual a ON a.month_start = e.month_start WHERE e.month_start >= DATE '2019-07-01' ORDER BY e.month_start DESC" + "rawSql": "WITH expected AS (SELECT generate_series(DATE '2019-07-01', DATE_TRUNC('month', CURRENT_DATE), '1 month'::interval)::date AS month_start), actual AS (SELECT DATE_TRUNC('month', pay_date)::date AS month_start, pay_date, salary, gross_pay, rsu_vest, cash_income_tax, income_tax, paperless_doc_id FROM payslip_ingest.payslip) SELECT e.month_start, CASE WHEN a.pay_date IS NULL THEN 'MISSING' WHEN a.salary = 0 AND a.gross_pay > 5000 THEN 'ZERO_SALARY' WHEN a.rsu_vest > 0 AND a.salary = 0 THEN 'RSU_NO_SALARY' WHEN a.rsu_vest > 0 AND a.cash_income_tax IS NULL THEN 'NULL_CASH_TAX' ELSE 'ok' END AS status, a.pay_date, a.salary, a.gross_pay, a.rsu_vest, a.cash_income_tax, a.paperless_doc_id FROM expected e LEFT JOIN actual a ON a.month_start = e.month_start WHERE e.month_start >= DATE '2019-07-01' ORDER BY e.month_start DESC" } ] }, From ef53053ae6cc5212fb25b8f7367afe7f91513c74 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sun, 19 Apr 2026 18:13:26 +0000 Subject: [PATCH 3/3] =?UTF-8?q?[job-hunter]=20Bump=20image=20to=2048f8615d?= =?UTF-8?q?=20=E2=80=94=20London=20filter=20+=20AI=20CLI?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit New image adds Alembic 0002 (primary_location column), London-default query/bands/report commands, and FX-priming on refresh so USD/EUR salaries convert correctly. Applied live; 5826 rows backfilled. Refs: code-snp Co-Authored-By: Claude Opus 4.7 (1M context) --- stacks/job-hunter/terragrunt.hcl | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/stacks/job-hunter/terragrunt.hcl b/stacks/job-hunter/terragrunt.hcl index 8104129e..93df44f1 100644 --- a/stacks/job-hunter/terragrunt.hcl +++ b/stacks/job-hunter/terragrunt.hcl @@ -21,5 +21,5 @@ inputs = { # 8-char SHA from the Forgejo commit viktor/job-hunter@9c42eac9 # (first image built locally + pushed 2026-04-19 due to a Woodpecker # v3.13 Forgejo webhook bug; bump on every deploy once CI recovers). - image_tag = "9c42eac9" + image_tag = "48f8615d" }