[docs] Architecture docs: registry integrity probe, pin, new CI pipelines

Bring the architecture set in line with what's actually deployed after
today's registry reliability work (commits 7cb44d7242961a5f):

- docs/architecture/ci-cd.md: expand Infra Pipelines table with
  build-ci-image (+ verify-integrity step), registry-config-sync,
  pve-nfs-exports-sync, postmortem-todos, drift-detection,
  issue-automation, provision-user. Note registry:2.8.3 pin +
  integrity probe in the image-registry flow section.
- docs/architecture/monitoring.md: add Registry Integrity Probe to
  components table; add 3-alert section (Manifest Integrity Failure /
  Probe Stale / Catalog Inaccessible).
- .claude/CLAUDE.md: one-line on the pin, auto-sync pipeline, and the
  revision-link-not-blob rule so the next agent knows the right check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-04-19 17:51:26 +00:00
parent fec0bbb7dd
commit 7e34b67f24
3 changed files with 16 additions and 2 deletions

View file

@ -63,6 +63,7 @@ graph TB
| External Monitor Sync | Python 3.12 | `stacks/uptime-kuma/` | CronJob (10min) syncs `[External]` monitors from `cloudflare_proxied_names` |
| dcgm-exporter | Configurable resources | `stacks/monitoring/modules/monitoring/` | NVIDIA GPU metrics collection |
| Email Roundtrip Probe | Python 3.12 | `stacks/mailserver/modules/mailserver/` | E2E email delivery verification via Mailgun API + IMAP |
| Registry Integrity Probe | Alpine 3.20 + curl/jq | `stacks/monitoring/modules/monitoring/main.tf` | CronJob every 15m: walks `/v2/_catalog` on `registry.viktorbarzin.me:5050`, HEADs every tagged manifest + index child; emits `registry_manifest_integrity_*` metrics to Pushgateway. Catches orphan OCI-index state that filesystem scans miss. |
## How It Works
@ -160,6 +161,11 @@ spec:
- **EmailRoundtripStale**: No successful email round-trip in >40m
- **EmailRoundtripNeverRun**: Email probe has never reported (40m)
#### Registry Integrity Alerts
- **RegistryManifestIntegrityFailure**: Private registry serving 404 for manifests it advertises (orphan OCI-index children) — fires after 30m of `registry_manifest_integrity_failures > 0`. Remediation: rebuild affected image per `docs/runbooks/registry-rebuild-image.md`.
- **RegistryIntegrityProbeStale**: Probe hasn't reported in >1h (CronJob broken)
- **RegistryCatalogInaccessible**: Probe cannot fetch `/v2/_catalog` (auth failure or registry down)
The email monitoring system uses a CronJob (`email-roundtrip-monitor`, every 10 min) in the `mailserver` namespace that:
1. Sends a test email via Mailgun HTTP API to `smoke-test@viktorbarzin.me`
2. Email lands in the `spam@` catch-all mailbox via MX delivery