diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md index 34845017..55cbc0d1 100755 --- a/.claude/CLAUDE.md +++ b/.claude/CLAUDE.md @@ -30,7 +30,7 @@ Violations cause state drift, which causes future applies to break or silently r - **New service**: Use `setup-project` skill for full workflow - **Ingress**: `ingress_factory` module. Auth: `protected = true`. Anti-AI: on by default. **DNS**: `dns_type = "proxied"` (Cloudflare CDN) or `"non-proxied"` (direct A/AAAA). DNS records are auto-created — no need to edit `config.tfvars`. - **Docker images**: Always build for `linux/amd64`. Use 8-char git SHA tags — `:latest` causes stale pull-through cache. -- **Private registry**: `registry.viktorbarzin.me` (htpasswd auth, credentials in Vault `secret/viktor`). Use `image: registry.viktorbarzin.me/:` + `imagePullSecrets: [{name: registry-credentials}]`. Kyverno auto-syncs the secret to all namespaces. Build & push from registry VM (`10.0.20.10`). Containerd `hosts.toml` redirects pulls to LAN IP directly. Web UI at `docker.viktorbarzin.me` (Authentik-protected). +- **Private registry**: `registry.viktorbarzin.me` (htpasswd auth, credentials in Vault `secret/viktor`). Use `image: registry.viktorbarzin.me/:` + `imagePullSecrets: [{name: registry-credentials}]`. Kyverno auto-syncs the secret to all namespaces. Build & push from registry VM (`10.0.20.10`). Containerd `hosts.toml` redirects pulls to LAN IP directly. Web UI at `docker.viktorbarzin.me` (Authentik-protected). Engine pinned to `registry:2.8.3` (see post-mortem 2026-04-19); on-VM configs deploy via `.woodpecker/registry-config-sync.yml`; integrity probed every 15m by `registry-integrity-probe` CronJob in `monitoring` ns — the HTTP API is the authoritative integrity check, NOT `/blobs/*/data` presence (revision-link absence is the real failure mode). - **LinuxServer.io containers**: `DOCKER_MODS` runs apt-get on every start — bake slow mods into a custom image (`RUN /docker-mods || true` then `ENV DOCKER_MODS=`). Set `NO_CHOWN=true` to skip recursive chown that hangs on NFS mounts. - **Node memory changes**: When changing VM memory on any k8s node, update kubelet `systemReserved`, `kubeReserved`, and eviction thresholds accordingly. Config: `/var/lib/kubelet/config.yaml`. Template: `stacks/infra/main.tf`. Current values: systemReserved=512Mi, kubeReserved=512Mi, evictionHard=500Mi, evictionSoft=1Gi. - **Node OS disk tuning** (in `stacks/infra/main.tf`): kubelet `imageGCHighThresholdPercent=70` (was 85), `imageGCLowThresholdPercent=60` (was 80), ext4 `commit=60` in fstab (was default 5s), journald `SystemMaxUse=200M` + `MaxRetentionSec=3day`. diff --git a/docs/architecture/ci-cd.md b/docs/architecture/ci-cd.md index cc51cdde..6a7fd2f9 100644 --- a/docs/architecture/ci-cd.md +++ b/docs/architecture/ci-cd.md @@ -102,7 +102,8 @@ Woodpecker API uses numeric IDs (not owner/name): 1. **Containerd hosts.toml** redirects pulls from docker.io and ghcr.io to pull-through cache at `10.0.20.10` 2. **Pull-through cache** serves cached images from LAN, fetches from upstream on cache miss 3. **Kyverno ClusterPolicy** auto-syncs `registry-credentials` Secret to all namespaces for private registry access -4. **Private registry** (`registry.viktorbarzin.me`) uses htpasswd auth, credentials stored in Vault +4. **Private registry** (`registry.viktorbarzin.me`) uses htpasswd auth, credentials stored in Vault. Runs `registry:2.8.3` (pinned — floating `registry:2` was the root cause of the 2026-04-13 + 2026-04-19 orphan-index incidents; see `docs/post-mortems/2026-04-19-registry-orphan-index.md`). +5. **Integrity probe** (`registry-integrity-probe` CronJob in `monitoring` ns, every 15m) walks `/v2/_catalog` → tags → indexes → child manifests via HEAD and pushes `registry_manifest_integrity_failures` to Pushgateway; alerts `RegistryManifestIntegrityFailure` / `RegistryIntegrityProbeStale` / `RegistryCatalogInaccessible` page on broken state. Authoritative check (HTTP API, not filesystem). ### Infra Pipelines (Woodpecker-only) @@ -111,7 +112,14 @@ Woodpecker API uses numeric IDs (not owner/name): | default | `.woodpecker/default.yml` | Terragrunt apply on push | | renew-tls | `.woodpecker/renew-tls.yml` | Certbot renewal cron | | build-cli | `.woodpecker/build-cli.yml` | Build and push to dual registries | +| build-ci-image | `.woodpecker/build-ci-image.yml` | Build `infra-ci` tooling image (triggered by `ci/Dockerfile` change or manual); post-push HEADs every blob via `verify-integrity` step to catch orphan-index pushes | | k8s-portal | `.woodpecker/k8s-portal.yml` | Path-filtered build for k8s-portal subdirectory | +| registry-config-sync | `.woodpecker/registry-config-sync.yml` | SCP `modules/docker-registry/*` to `/opt/registry/` on `10.0.20.10` when any managed file changes; bounces containers + nginx per `docs/runbooks/registry-vm.md` | +| pve-nfs-exports-sync | `.woodpecker/pve-nfs-exports-sync.yml` | Sync `scripts/pve-nfs-exports` → `/etc/exports` on PVE host | +| postmortem-todos | `.woodpecker/postmortem-todos.yml` | Auto-resolve safe TODOs from new `docs/post-mortems/*.md` via headless Claude agent | +| drift-detection | `.woodpecker/drift-detection.yml` | Nightly Terraform drift detection | +| issue-automation | `.woodpecker/issue-automation.yml` | Triage + respond to `ViktorBarzin/infra` GitHub issues | +| provision-user | `.woodpecker/provision-user.yml` | Add namespace-owner user from Vault spec | ## Configuration diff --git a/docs/architecture/monitoring.md b/docs/architecture/monitoring.md index 774a3356..0de2a219 100644 --- a/docs/architecture/monitoring.md +++ b/docs/architecture/monitoring.md @@ -63,6 +63,7 @@ graph TB | External Monitor Sync | Python 3.12 | `stacks/uptime-kuma/` | CronJob (10min) syncs `[External]` monitors from `cloudflare_proxied_names` | | dcgm-exporter | Configurable resources | `stacks/monitoring/modules/monitoring/` | NVIDIA GPU metrics collection | | Email Roundtrip Probe | Python 3.12 | `stacks/mailserver/modules/mailserver/` | E2E email delivery verification via Mailgun API + IMAP | +| Registry Integrity Probe | Alpine 3.20 + curl/jq | `stacks/monitoring/modules/monitoring/main.tf` | CronJob every 15m: walks `/v2/_catalog` on `registry.viktorbarzin.me:5050`, HEADs every tagged manifest + index child; emits `registry_manifest_integrity_*` metrics to Pushgateway. Catches orphan OCI-index state that filesystem scans miss. | ## How It Works @@ -160,6 +161,11 @@ spec: - **EmailRoundtripStale**: No successful email round-trip in >40m - **EmailRoundtripNeverRun**: Email probe has never reported (40m) +#### Registry Integrity Alerts +- **RegistryManifestIntegrityFailure**: Private registry serving 404 for manifests it advertises (orphan OCI-index children) — fires after 30m of `registry_manifest_integrity_failures > 0`. Remediation: rebuild affected image per `docs/runbooks/registry-rebuild-image.md`. +- **RegistryIntegrityProbeStale**: Probe hasn't reported in >1h (CronJob broken) +- **RegistryCatalogInaccessible**: Probe cannot fetch `/v2/_catalog` (auth failure or registry down) + The email monitoring system uses a CronJob (`email-roundtrip-monitor`, every 10 min) in the `mailserver` namespace that: 1. Sends a test email via Mailgun HTTP API to `smoke-test@viktorbarzin.me` 2. Email lands in the `spam@` catch-all mailbox via MX delivery diff --git a/stacks/job-hunter/terragrunt.hcl b/stacks/job-hunter/terragrunt.hcl index 8104129e..93df44f1 100644 --- a/stacks/job-hunter/terragrunt.hcl +++ b/stacks/job-hunter/terragrunt.hcl @@ -21,5 +21,5 @@ inputs = { # 8-char SHA from the Forgejo commit viktor/job-hunter@9c42eac9 # (first image built locally + pushed 2026-04-19 due to a Woodpecker # v3.13 Forgejo webhook bug; bump on every deploy once CI recovers). - image_tag = "9c42eac9" + image_tag = "48f8615d" } diff --git a/stacks/monitoring/modules/monitoring/dashboards/uk-payslip.json b/stacks/monitoring/modules/monitoring/dashboards/uk-payslip.json index f0ddb18f..9b0c2644 100644 --- a/stacks/monitoring/modules/monitoring/dashboards/uk-payslip.json +++ b/stacks/monitoring/modules/monitoring/dashboards/uk-payslip.json @@ -422,7 +422,7 @@ "type": "grafana-postgresql-datasource", "uid": "payslips-pg" }, - "rawSql": "SELECT pay_date AS \"time\", (gross_pay - rsu_vest) AS cash_gross, net_pay, income_tax, national_insurance FROM payslip_ingest.payslip WHERE $__timeFilter(pay_date) ORDER BY pay_date", + "rawSql": "SELECT pay_date AS \"time\", (gross_pay - rsu_vest) AS cash_gross, net_pay, COALESCE(cash_income_tax, income_tax) AS income_tax, national_insurance FROM payslip_ingest.payslip WHERE $__timeFilter(pay_date) ORDER BY pay_date", "format": "time_series", "refId": "A", "rawQuery": true, @@ -1144,9 +1144,14 @@ "index": 2, "text": "RSU_NO_SALARY" }, + "NULL_CASH_TAX": { + "color": "orange", + "index": 3, + "text": "NULL_CASH_TAX" + }, "ok": { "color": "green", - "index": 3, + "index": 4, "text": "ok" } } @@ -1217,7 +1222,7 @@ "rawQuery": true, "editorMode": "code", "format": "table", - "rawSql": "WITH expected AS (SELECT generate_series(DATE '2019-07-01', DATE_TRUNC('month', CURRENT_DATE), '1 month'::interval)::date AS month_start), actual AS (SELECT DATE_TRUNC('month', pay_date)::date AS month_start, pay_date, salary, gross_pay, rsu_vest, paperless_doc_id FROM payslip_ingest.payslip) SELECT e.month_start, CASE WHEN a.pay_date IS NULL THEN 'MISSING' WHEN a.salary = 0 AND a.gross_pay > 5000 THEN 'ZERO_SALARY' WHEN a.rsu_vest > 0 AND a.salary = 0 THEN 'RSU_NO_SALARY' ELSE 'ok' END AS status, a.pay_date, a.salary, a.gross_pay, a.rsu_vest, a.paperless_doc_id FROM expected e LEFT JOIN actual a ON a.month_start = e.month_start WHERE e.month_start >= DATE '2019-07-01' ORDER BY e.month_start DESC" + "rawSql": "WITH expected AS (SELECT generate_series(DATE '2019-07-01', DATE_TRUNC('month', CURRENT_DATE), '1 month'::interval)::date AS month_start), actual AS (SELECT DATE_TRUNC('month', pay_date)::date AS month_start, pay_date, salary, gross_pay, rsu_vest, cash_income_tax, income_tax, paperless_doc_id FROM payslip_ingest.payslip) SELECT e.month_start, CASE WHEN a.pay_date IS NULL THEN 'MISSING' WHEN a.salary = 0 AND a.gross_pay > 5000 THEN 'ZERO_SALARY' WHEN a.rsu_vest > 0 AND a.salary = 0 THEN 'RSU_NO_SALARY' WHEN a.rsu_vest > 0 AND a.cash_income_tax IS NULL THEN 'NULL_CASH_TAX' ELSE 'ok' END AS status, a.pay_date, a.salary, a.gross_pay, a.rsu_vest, a.cash_income_tax, a.paperless_doc_id FROM expected e LEFT JOIN actual a ON a.month_start = e.month_start WHERE e.month_start >= DATE '2019-07-01' ORDER BY e.month_start DESC" } ] },