[forgejo] Phases 3+4+5: cutover, decommission, docs sweep
End of forgejo-registry-consolidation. After Phase 0/1 already landed
(Forgejo ready, dual-push CI, integrity probe, retention CronJob,
images migrated via forgejo-migrate-orphan-images.sh), this commit
flips everything off registry.viktorbarzin.me onto Forgejo and
removes the legacy infrastructure.
Phase 3 — image= flips:
* infra/stacks/{payslip-ingest,job-hunter,claude-agent-service,
fire-planner,freedify/factory,chrome-service,beads-server}/main.tf
— image= now points to forgejo.viktorbarzin.me/viktor/<name>.
* infra/stacks/claude-memory/main.tf — also moved off DockerHub
(viktorbarzin/claude-memory-mcp:17 → forgejo.viktorbarzin.me/viktor/...).
* infra/.woodpecker/{default,drift-detection}.yml — infra-ci pulled
from Forgejo. build-ci-image.yml dual-pushes still until next
build cycle confirms Forgejo as canonical.
* /home/wizard/code/CLAUDE.md — claude-memory-mcp install URL updated.
Phase 4 — decommission registry-private:
* registry-credentials Secret: dropped registry.viktorbarzin.me /
registry.viktorbarzin.me:5050 / 10.0.20.10:5050 auths entries.
Forgejo entry is the only one left.
* infra/stacks/infra/main.tf cloud-init: dropped containerd
hosts.toml entries for registry.viktorbarzin.me +
10.0.20.10:5050. (Existing nodes already had the file removed
manually by `setup-forgejo-containerd-mirror.sh` rollout — the
cloud-init template only fires on new VM provision.)
* infra/modules/docker-registry/docker-compose.yml: registry-private
service block removed; nginx 5050 port mapping dropped. Pull-
through caches for upstream registries (5000/5010/5020/5030/5040)
stay on the VM permanently.
* infra/modules/docker-registry/nginx_registry.conf: upstream
`private` block + port 5050 server block removed.
* infra/stacks/monitoring/modules/monitoring/main.tf: registry_
integrity_probe + registry_probe_credentials resources stripped.
forgejo_integrity_probe is the only manifest probe now.
Phase 5 — final docs sweep:
* infra/docs/runbooks/registry-vm.md — VM scope reduced to pull-
through caches; forgejo-registry-breakglass.md cross-ref added.
* infra/docs/architecture/ci-cd.md — registry component table +
diagram now reflect Forgejo. Pre-migration root-cause sentence
preserved as historical context with a pointer to the design doc.
* infra/docs/architecture/monitoring.md — Registry Integrity Probe
row updated to point at the Forgejo probe.
* infra/.claude/CLAUDE.md — Private registry section rewritten end-
to-end (auth, retention, integrity, where the bake came from).
* prometheus_chart_values.tpl — RegistryManifestIntegrityFailure
alert annotation simplified now that only one registry is in
scope.
Operational follow-up (cannot be done from a TF apply):
1. ssh root@10.0.20.10 — edit /opt/registry/docker-compose.yml to
match the new template AND `docker compose up -d --remove-orphans`
to actually stop the registry-private container. Memory id=1078
confirms cloud-init won't redeploy on TF apply alone.
2. After 1 week of no incidents, `rm -rf /opt/registry/data/private/`
on the VM (~2.6GB freed).
3. Open the dual-push step in build-ci-image.yml and drop
registry.viktorbarzin.me:5050 from the `repo:` list — at that
point the post-push integrity check at line 33-107 also needs
to be repointed at Forgejo or removed (the per-build verify is
redundant with the every-15min Forgejo probe).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
parent
e86efd107a
commit
4ec40ea804
20 changed files with 89 additions and 294 deletions
|
|
@ -243,193 +243,15 @@ resource "kubernetes_cron_job_v1" "dns_anomaly_monitor" {
|
|||
}
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Registry manifest-integrity probe — HEADs every tag in the private R/W
|
||||
# registry's catalog, walks multi-platform image indexes, and reports blob
|
||||
# availability. Catches the orphan-index failure mode seen 2026-04-13 and
|
||||
# 2026-04-19 before downstream pipelines hit it.
|
||||
# Phase 4 of forgejo-registry-consolidation 2026-05-07: registry-private
|
||||
# decommissioned. The integrity probe below caught the orphan-index failure
|
||||
# mode in `registry:2.8.3` (post-mortem 2026-04-19). With that engine
|
||||
# retired, the probe is replaced by `forgejo_integrity_probe` below.
|
||||
#
|
||||
# Resource definitions stripped wholesale — terragrunt apply destroys the
|
||||
# in-cluster CronJob + Secret on the next run.
|
||||
# See: docs/post-mortems/2026-04-19-registry-orphan-index.md
|
||||
# -----------------------------------------------------------------------------
|
||||
resource "kubernetes_secret" "registry_probe_credentials" {
|
||||
metadata {
|
||||
name = "registry-probe-credentials"
|
||||
namespace = kubernetes_namespace.monitoring.metadata[0].name
|
||||
}
|
||||
type = "Opaque"
|
||||
data = {
|
||||
REG_USER = var.registry_user
|
||||
REG_PASS = var.registry_password
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_cron_job_v1" "registry_integrity_probe" {
|
||||
metadata {
|
||||
name = "registry-integrity-probe"
|
||||
namespace = kubernetes_namespace.monitoring.metadata[0].name
|
||||
}
|
||||
spec {
|
||||
concurrency_policy = "Forbid"
|
||||
failed_jobs_history_limit = 3
|
||||
successful_jobs_history_limit = 3
|
||||
schedule = "*/15 * * * *"
|
||||
job_template {
|
||||
metadata {}
|
||||
spec {
|
||||
backoff_limit = 1
|
||||
ttl_seconds_after_finished = 600
|
||||
template {
|
||||
metadata {}
|
||||
spec {
|
||||
container {
|
||||
name = "registry-integrity-probe"
|
||||
image = "docker.io/library/alpine:3.20"
|
||||
env {
|
||||
name = "REG_USER"
|
||||
value_from {
|
||||
secret_key_ref {
|
||||
name = kubernetes_secret.registry_probe_credentials.metadata[0].name
|
||||
key = "REG_USER"
|
||||
}
|
||||
}
|
||||
}
|
||||
env {
|
||||
name = "REG_PASS"
|
||||
value_from {
|
||||
secret_key_ref {
|
||||
name = kubernetes_secret.registry_probe_credentials.metadata[0].name
|
||||
key = "REG_PASS"
|
||||
}
|
||||
}
|
||||
}
|
||||
env {
|
||||
name = "REGISTRY_HOST"
|
||||
value = "10.0.20.10:5050"
|
||||
}
|
||||
env {
|
||||
name = "REGISTRY_INSTANCE"
|
||||
value = "registry.viktorbarzin.me:5050"
|
||||
}
|
||||
env {
|
||||
name = "PUSHGATEWAY"
|
||||
value = "http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/registry-integrity-probe"
|
||||
}
|
||||
env {
|
||||
name = "TAGS_PER_REPO"
|
||||
value = "5"
|
||||
}
|
||||
command = ["/bin/sh", "-c", <<-EOT
|
||||
set -eu
|
||||
apk add --no-cache curl jq >/dev/null
|
||||
|
||||
REG="$REGISTRY_HOST"
|
||||
INSTANCE="$REGISTRY_INSTANCE"
|
||||
AUTH="$REG_USER:$REG_PASS"
|
||||
ACCEPT='application/vnd.oci.image.index.v1+json,application/vnd.oci.image.manifest.v1+json,application/vnd.docker.distribution.manifest.list.v2+json,application/vnd.docker.distribution.manifest.v2+json'
|
||||
|
||||
push() {
|
||||
# Prometheus pushgateway — body ends with blank line. Ignore push errors.
|
||||
curl -sf --max-time 10 --data-binary @- "$PUSHGATEWAY" >/dev/null 2>&1 || true
|
||||
}
|
||||
|
||||
CATALOG=$(curl -sk -u "$AUTH" --max-time 30 "https://$REG/v2/_catalog?n=1000" || echo "")
|
||||
REPOS=$(echo "$CATALOG" | jq -r '.repositories[]?' 2>/dev/null || echo "")
|
||||
|
||||
if [ -z "$REPOS" ]; then
|
||||
echo "ERROR: empty catalog or auth failure — cannot probe"
|
||||
NOW=$(date +%s)
|
||||
push <<METRICS
|
||||
# TYPE registry_manifest_integrity_catalog_accessible gauge
|
||||
registry_manifest_integrity_catalog_accessible{instance="$INSTANCE"} 0
|
||||
# TYPE registry_manifest_integrity_last_run_timestamp gauge
|
||||
registry_manifest_integrity_last_run_timestamp{instance="$INSTANCE"} $NOW
|
||||
METRICS
|
||||
exit 1
|
||||
fi
|
||||
|
||||
FAIL=0
|
||||
REPOS_N=0
|
||||
TAGS_N=0
|
||||
INDEXES_N=0
|
||||
|
||||
printf '%s\n' $REPOS > /tmp/repos.txt
|
||||
while IFS= read -r repo; do
|
||||
[ -z "$repo" ] && continue
|
||||
REPOS_N=$((REPOS_N + 1))
|
||||
|
||||
TAGS_JSON=$(curl -sk -u "$AUTH" --max-time 15 "https://$REG/v2/$repo/tags/list" || echo "")
|
||||
echo "$TAGS_JSON" | jq -r '.tags[]?' 2>/dev/null | tail -n "$TAGS_PER_REPO" > /tmp/tags.txt || true
|
||||
|
||||
while IFS= read -r tag; do
|
||||
[ -z "$tag" ] && continue
|
||||
TAGS_N=$((TAGS_N + 1))
|
||||
|
||||
HTTP=$(curl -sk -u "$AUTH" -o /tmp/m.json -w '%%{http_code}' \
|
||||
-H "Accept: $ACCEPT" --max-time 15 \
|
||||
"https://$REG/v2/$repo/manifests/$tag")
|
||||
if [ "$HTTP" != "200" ]; then
|
||||
echo "FAIL: $repo:$tag manifest HTTP $HTTP"
|
||||
FAIL=$((FAIL + 1))
|
||||
continue
|
||||
fi
|
||||
|
||||
MT=$(jq -r '.mediaType // empty' /tmp/m.json 2>/dev/null || echo "")
|
||||
if echo "$MT" | grep -Eq 'manifest\.list|image\.index'; then
|
||||
INDEXES_N=$((INDEXES_N + 1))
|
||||
jq -r '.manifests[].digest' /tmp/m.json > /tmp/children.txt 2>/dev/null || true
|
||||
while IFS= read -r d; do
|
||||
[ -z "$d" ] && continue
|
||||
CH=$(curl -sk -u "$AUTH" -o /dev/null -w '%%{http_code}' \
|
||||
-H "Accept: $ACCEPT" --max-time 10 -I \
|
||||
"https://$REG/v2/$repo/manifests/$d")
|
||||
if [ "$CH" != "200" ]; then
|
||||
echo "FAIL: $repo:$tag index child $d HTTP $CH"
|
||||
FAIL=$((FAIL + 1))
|
||||
fi
|
||||
done < /tmp/children.txt
|
||||
fi
|
||||
done < /tmp/tags.txt
|
||||
done < /tmp/repos.txt
|
||||
|
||||
NOW=$(date +%s)
|
||||
push <<METRICS
|
||||
# TYPE registry_manifest_integrity_failures gauge
|
||||
registry_manifest_integrity_failures{instance="$INSTANCE"} $FAIL
|
||||
# TYPE registry_manifest_integrity_catalog_accessible gauge
|
||||
registry_manifest_integrity_catalog_accessible{instance="$INSTANCE"} 1
|
||||
# TYPE registry_manifest_integrity_repos_checked gauge
|
||||
registry_manifest_integrity_repos_checked{instance="$INSTANCE"} $REPOS_N
|
||||
# TYPE registry_manifest_integrity_tags_checked gauge
|
||||
registry_manifest_integrity_tags_checked{instance="$INSTANCE"} $TAGS_N
|
||||
# TYPE registry_manifest_integrity_indexes_checked gauge
|
||||
registry_manifest_integrity_indexes_checked{instance="$INSTANCE"} $INDEXES_N
|
||||
# TYPE registry_manifest_integrity_last_run_timestamp gauge
|
||||
registry_manifest_integrity_last_run_timestamp{instance="$INSTANCE"} $NOW
|
||||
METRICS
|
||||
|
||||
echo "Probe complete: $FAIL failures across $REPOS_N repos / $TAGS_N tags / $INDEXES_N indexes"
|
||||
if [ "$FAIL" -gt 0 ]; then exit 1; fi
|
||||
EOT
|
||||
]
|
||||
resources {
|
||||
requests = {
|
||||
cpu = "10m"
|
||||
memory = "48Mi"
|
||||
}
|
||||
limits = {
|
||||
memory = "96Mi"
|
||||
}
|
||||
}
|
||||
}
|
||||
restart_policy = "OnFailure"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
lifecycle {
|
||||
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
|
||||
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
|
||||
}
|
||||
}
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Forgejo registry integrity probe — same algorithm as registry-integrity-probe
|
||||
|
|
|
|||
|
|
@ -1657,7 +1657,7 @@ serverFiles:
|
|||
severity: critical
|
||||
annotations:
|
||||
summary: "{{ $labels.instance }}: {{ $value }} broken manifest reference(s) — orphan index or missing blob"
|
||||
description: "The integrity probe CronJob found {{ $value }} manifest/blob references that return non-200 on {{ $labels.instance }}. For registry.viktorbarzin.me see docs/runbooks/registry-rebuild-image.md (orphan OCI-index child from cleanup-tags.sh+GC race). For forgejo.viktorbarzin.me see docs/runbooks/forgejo-registry-rebuild-image.md."
|
||||
description: "The forgejo-integrity-probe CronJob found {{ $value }} manifest/blob references that return non-200 on {{ $labels.instance }}. Rebuild the affected image per docs/runbooks/forgejo-registry-rebuild-image.md. (registry.viktorbarzin.me retired Phase 4 of forgejo-registry-consolidation 2026-05-07 — only forgejo.viktorbarzin.me remains.)"
|
||||
- alert: RegistryIntegrityProbeStale
|
||||
expr: time() - registry_manifest_integrity_last_run_timestamp > 3600
|
||||
for: 15m
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue