Compare commits

...

3 commits

Author SHA1 Message Date
Viktor Barzin
ef53053ae6 [job-hunter] Bump image to 48f8615d — London filter + AI CLI
New image adds Alembic 0002 (primary_location column), London-default
query/bands/report commands, and FX-priming on refresh so USD/EUR
salaries convert correctly. Applied live; 5826 rows backfilled.

Refs: code-snp

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 18:13:26 +00:00
Viktor Barzin
fca3dd4976 [monitoring] uk-payslip: Panel 2 uses COALESCE cash_income_tax; Panel 4 flags NULL
Phase A of RSU tax spike fix. Two changes:

1. Panel 2 "Monthly cash flow (RSU stripped)" plotted raw income_tax despite
   the title. Switch to COALESCE(cash_income_tax, income_tax) so the chart
   is honest once the Phase B back-fill populates cash_income_tax on
   variant-A slips. For slips where cash_income_tax is already populated
   (variant B, 2024+) the spike is removed immediately.

2. Panel 4 "Data integrity" now surfaces rows where cash_income_tax is NULL
   on vest months (rsu_vest > 0). New status value NULL_CASH_TAX (orange)
   highlights the back-fill remaining population — expected to drop to 0
   after Phase B lands.

Part of: code-860
2026-04-19 18:04:05 +00:00
Viktor Barzin
7e34b67f24 [docs] Architecture docs: registry integrity probe, pin, new CI pipelines
Bring the architecture set in line with what's actually deployed after
today's registry reliability work (commits 7cb44d7242961a5f):

- docs/architecture/ci-cd.md: expand Infra Pipelines table with
  build-ci-image (+ verify-integrity step), registry-config-sync,
  pve-nfs-exports-sync, postmortem-todos, drift-detection,
  issue-automation, provision-user. Note registry:2.8.3 pin +
  integrity probe in the image-registry flow section.
- docs/architecture/monitoring.md: add Registry Integrity Probe to
  components table; add 3-alert section (Manifest Integrity Failure /
  Probe Stale / Catalog Inaccessible).
- .claude/CLAUDE.md: one-line on the pin, auto-sync pipeline, and the
  revision-link-not-blob rule so the next agent knows the right check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:51:26 +00:00
5 changed files with 25 additions and 6 deletions

View file

@ -30,7 +30,7 @@ Violations cause state drift, which causes future applies to break or silently r
- **New service**: Use `setup-project` skill for full workflow
- **Ingress**: `ingress_factory` module. Auth: `protected = true`. Anti-AI: on by default. **DNS**: `dns_type = "proxied"` (Cloudflare CDN) or `"non-proxied"` (direct A/AAAA). DNS records are auto-created — no need to edit `config.tfvars`.
- **Docker images**: Always build for `linux/amd64`. Use 8-char git SHA tags — `:latest` causes stale pull-through cache.
- **Private registry**: `registry.viktorbarzin.me` (htpasswd auth, credentials in Vault `secret/viktor`). Use `image: registry.viktorbarzin.me/<name>:<tag>` + `imagePullSecrets: [{name: registry-credentials}]`. Kyverno auto-syncs the secret to all namespaces. Build & push from registry VM (`10.0.20.10`). Containerd `hosts.toml` redirects pulls to LAN IP directly. Web UI at `docker.viktorbarzin.me` (Authentik-protected).
- **Private registry**: `registry.viktorbarzin.me` (htpasswd auth, credentials in Vault `secret/viktor`). Use `image: registry.viktorbarzin.me/<name>:<tag>` + `imagePullSecrets: [{name: registry-credentials}]`. Kyverno auto-syncs the secret to all namespaces. Build & push from registry VM (`10.0.20.10`). Containerd `hosts.toml` redirects pulls to LAN IP directly. Web UI at `docker.viktorbarzin.me` (Authentik-protected). Engine pinned to `registry:2.8.3` (see post-mortem 2026-04-19); on-VM configs deploy via `.woodpecker/registry-config-sync.yml`; integrity probed every 15m by `registry-integrity-probe` CronJob in `monitoring` ns — the HTTP API is the authoritative integrity check, NOT `/blobs/*/data` presence (revision-link absence is the real failure mode).
- **LinuxServer.io containers**: `DOCKER_MODS` runs apt-get on every start — bake slow mods into a custom image (`RUN /docker-mods || true` then `ENV DOCKER_MODS=`). Set `NO_CHOWN=true` to skip recursive chown that hangs on NFS mounts.
- **Node memory changes**: When changing VM memory on any k8s node, update kubelet `systemReserved`, `kubeReserved`, and eviction thresholds accordingly. Config: `/var/lib/kubelet/config.yaml`. Template: `stacks/infra/main.tf`. Current values: systemReserved=512Mi, kubeReserved=512Mi, evictionHard=500Mi, evictionSoft=1Gi.
- **Node OS disk tuning** (in `stacks/infra/main.tf`): kubelet `imageGCHighThresholdPercent=70` (was 85), `imageGCLowThresholdPercent=60` (was 80), ext4 `commit=60` in fstab (was default 5s), journald `SystemMaxUse=200M` + `MaxRetentionSec=3day`.

View file

@ -102,7 +102,8 @@ Woodpecker API uses numeric IDs (not owner/name):
1. **Containerd hosts.toml** redirects pulls from docker.io and ghcr.io to pull-through cache at `10.0.20.10`
2. **Pull-through cache** serves cached images from LAN, fetches from upstream on cache miss
3. **Kyverno ClusterPolicy** auto-syncs `registry-credentials` Secret to all namespaces for private registry access
4. **Private registry** (`registry.viktorbarzin.me`) uses htpasswd auth, credentials stored in Vault
4. **Private registry** (`registry.viktorbarzin.me`) uses htpasswd auth, credentials stored in Vault. Runs `registry:2.8.3` (pinned — floating `registry:2` was the root cause of the 2026-04-13 + 2026-04-19 orphan-index incidents; see `docs/post-mortems/2026-04-19-registry-orphan-index.md`).
5. **Integrity probe** (`registry-integrity-probe` CronJob in `monitoring` ns, every 15m) walks `/v2/_catalog` → tags → indexes → child manifests via HEAD and pushes `registry_manifest_integrity_failures` to Pushgateway; alerts `RegistryManifestIntegrityFailure` / `RegistryIntegrityProbeStale` / `RegistryCatalogInaccessible` page on broken state. Authoritative check (HTTP API, not filesystem).
### Infra Pipelines (Woodpecker-only)
@ -111,7 +112,14 @@ Woodpecker API uses numeric IDs (not owner/name):
| default | `.woodpecker/default.yml` | Terragrunt apply on push |
| renew-tls | `.woodpecker/renew-tls.yml` | Certbot renewal cron |
| build-cli | `.woodpecker/build-cli.yml` | Build and push to dual registries |
| build-ci-image | `.woodpecker/build-ci-image.yml` | Build `infra-ci` tooling image (triggered by `ci/Dockerfile` change or manual); post-push HEADs every blob via `verify-integrity` step to catch orphan-index pushes |
| k8s-portal | `.woodpecker/k8s-portal.yml` | Path-filtered build for k8s-portal subdirectory |
| registry-config-sync | `.woodpecker/registry-config-sync.yml` | SCP `modules/docker-registry/*` to `/opt/registry/` on `10.0.20.10` when any managed file changes; bounces containers + nginx per `docs/runbooks/registry-vm.md` |
| pve-nfs-exports-sync | `.woodpecker/pve-nfs-exports-sync.yml` | Sync `scripts/pve-nfs-exports``/etc/exports` on PVE host |
| postmortem-todos | `.woodpecker/postmortem-todos.yml` | Auto-resolve safe TODOs from new `docs/post-mortems/*.md` via headless Claude agent |
| drift-detection | `.woodpecker/drift-detection.yml` | Nightly Terraform drift detection |
| issue-automation | `.woodpecker/issue-automation.yml` | Triage + respond to `ViktorBarzin/infra` GitHub issues |
| provision-user | `.woodpecker/provision-user.yml` | Add namespace-owner user from Vault spec |
## Configuration

View file

@ -63,6 +63,7 @@ graph TB
| External Monitor Sync | Python 3.12 | `stacks/uptime-kuma/` | CronJob (10min) syncs `[External]` monitors from `cloudflare_proxied_names` |
| dcgm-exporter | Configurable resources | `stacks/monitoring/modules/monitoring/` | NVIDIA GPU metrics collection |
| Email Roundtrip Probe | Python 3.12 | `stacks/mailserver/modules/mailserver/` | E2E email delivery verification via Mailgun API + IMAP |
| Registry Integrity Probe | Alpine 3.20 + curl/jq | `stacks/monitoring/modules/monitoring/main.tf` | CronJob every 15m: walks `/v2/_catalog` on `registry.viktorbarzin.me:5050`, HEADs every tagged manifest + index child; emits `registry_manifest_integrity_*` metrics to Pushgateway. Catches orphan OCI-index state that filesystem scans miss. |
## How It Works
@ -160,6 +161,11 @@ spec:
- **EmailRoundtripStale**: No successful email round-trip in >40m
- **EmailRoundtripNeverRun**: Email probe has never reported (40m)
#### Registry Integrity Alerts
- **RegistryManifestIntegrityFailure**: Private registry serving 404 for manifests it advertises (orphan OCI-index children) — fires after 30m of `registry_manifest_integrity_failures > 0`. Remediation: rebuild affected image per `docs/runbooks/registry-rebuild-image.md`.
- **RegistryIntegrityProbeStale**: Probe hasn't reported in >1h (CronJob broken)
- **RegistryCatalogInaccessible**: Probe cannot fetch `/v2/_catalog` (auth failure or registry down)
The email monitoring system uses a CronJob (`email-roundtrip-monitor`, every 10 min) in the `mailserver` namespace that:
1. Sends a test email via Mailgun HTTP API to `smoke-test@viktorbarzin.me`
2. Email lands in the `spam@` catch-all mailbox via MX delivery

View file

@ -21,5 +21,5 @@ inputs = {
# 8-char SHA from the Forgejo commit viktor/job-hunter@9c42eac9
# (first image built locally + pushed 2026-04-19 due to a Woodpecker
# v3.13 Forgejo webhook bug; bump on every deploy once CI recovers).
image_tag = "9c42eac9"
image_tag = "48f8615d"
}

View file

@ -422,7 +422,7 @@
"type": "grafana-postgresql-datasource",
"uid": "payslips-pg"
},
"rawSql": "SELECT pay_date AS \"time\", (gross_pay - rsu_vest) AS cash_gross, net_pay, income_tax, national_insurance FROM payslip_ingest.payslip WHERE $__timeFilter(pay_date) ORDER BY pay_date",
"rawSql": "SELECT pay_date AS \"time\", (gross_pay - rsu_vest) AS cash_gross, net_pay, COALESCE(cash_income_tax, income_tax) AS income_tax, national_insurance FROM payslip_ingest.payslip WHERE $__timeFilter(pay_date) ORDER BY pay_date",
"format": "time_series",
"refId": "A",
"rawQuery": true,
@ -1144,9 +1144,14 @@
"index": 2,
"text": "RSU_NO_SALARY"
},
"NULL_CASH_TAX": {
"color": "orange",
"index": 3,
"text": "NULL_CASH_TAX"
},
"ok": {
"color": "green",
"index": 3,
"index": 4,
"text": "ok"
}
}
@ -1217,7 +1222,7 @@
"rawQuery": true,
"editorMode": "code",
"format": "table",
"rawSql": "WITH expected AS (SELECT generate_series(DATE '2019-07-01', DATE_TRUNC('month', CURRENT_DATE), '1 month'::interval)::date AS month_start), actual AS (SELECT DATE_TRUNC('month', pay_date)::date AS month_start, pay_date, salary, gross_pay, rsu_vest, paperless_doc_id FROM payslip_ingest.payslip) SELECT e.month_start, CASE WHEN a.pay_date IS NULL THEN 'MISSING' WHEN a.salary = 0 AND a.gross_pay > 5000 THEN 'ZERO_SALARY' WHEN a.rsu_vest > 0 AND a.salary = 0 THEN 'RSU_NO_SALARY' ELSE 'ok' END AS status, a.pay_date, a.salary, a.gross_pay, a.rsu_vest, a.paperless_doc_id FROM expected e LEFT JOIN actual a ON a.month_start = e.month_start WHERE e.month_start >= DATE '2019-07-01' ORDER BY e.month_start DESC"
"rawSql": "WITH expected AS (SELECT generate_series(DATE '2019-07-01', DATE_TRUNC('month', CURRENT_DATE), '1 month'::interval)::date AS month_start), actual AS (SELECT DATE_TRUNC('month', pay_date)::date AS month_start, pay_date, salary, gross_pay, rsu_vest, cash_income_tax, income_tax, paperless_doc_id FROM payslip_ingest.payslip) SELECT e.month_start, CASE WHEN a.pay_date IS NULL THEN 'MISSING' WHEN a.salary = 0 AND a.gross_pay > 5000 THEN 'ZERO_SALARY' WHEN a.rsu_vest > 0 AND a.salary = 0 THEN 'RSU_NO_SALARY' WHEN a.rsu_vest > 0 AND a.cash_income_tax IS NULL THEN 'NULL_CASH_TAX' ELSE 'ok' END AS status, a.pay_date, a.salary, a.gross_pay, a.rsu_vest, a.cash_income_tax, a.paperless_doc_id FROM expected e LEFT JOIN actual a ON a.month_start = e.month_start WHERE e.month_start >= DATE '2019-07-01' ORDER BY e.month_start DESC"
}
]
},