[forgejo] Phases 3+4+5: cutover, decommission, docs sweep

End of forgejo-registry-consolidation. After Phase 0/1 already landed
(Forgejo ready, dual-push CI, integrity probe, retention CronJob,
images migrated via forgejo-migrate-orphan-images.sh), this commit
flips everything off registry.viktorbarzin.me onto Forgejo and
removes the legacy infrastructure.

Phase 3 — image= flips:
* infra/stacks/{payslip-ingest,job-hunter,claude-agent-service,
  fire-planner,freedify/factory,chrome-service,beads-server}/main.tf
  — image= now points to forgejo.viktorbarzin.me/viktor/<name>.
* infra/stacks/claude-memory/main.tf — also moved off DockerHub
  (viktorbarzin/claude-memory-mcp:17 → forgejo.viktorbarzin.me/viktor/...).
* infra/.woodpecker/{default,drift-detection}.yml — infra-ci pulled
  from Forgejo. build-ci-image.yml dual-pushes still until next
  build cycle confirms Forgejo as canonical.
* /home/wizard/code/CLAUDE.md — claude-memory-mcp install URL updated.

Phase 4 — decommission registry-private:
* registry-credentials Secret: dropped registry.viktorbarzin.me /
  registry.viktorbarzin.me:5050 / 10.0.20.10:5050 auths entries.
  Forgejo entry is the only one left.
* infra/stacks/infra/main.tf cloud-init: dropped containerd
  hosts.toml entries for registry.viktorbarzin.me +
  10.0.20.10:5050. (Existing nodes already had the file removed
  manually by `setup-forgejo-containerd-mirror.sh` rollout — the
  cloud-init template only fires on new VM provision.)
* infra/modules/docker-registry/docker-compose.yml: registry-private
  service block removed; nginx 5050 port mapping dropped. Pull-
  through caches for upstream registries (5000/5010/5020/5030/5040)
  stay on the VM permanently.
* infra/modules/docker-registry/nginx_registry.conf: upstream
  `private` block + port 5050 server block removed.
* infra/stacks/monitoring/modules/monitoring/main.tf: registry_
  integrity_probe + registry_probe_credentials resources stripped.
  forgejo_integrity_probe is the only manifest probe now.

Phase 5 — final docs sweep:
* infra/docs/runbooks/registry-vm.md — VM scope reduced to pull-
  through caches; forgejo-registry-breakglass.md cross-ref added.
* infra/docs/architecture/ci-cd.md — registry component table +
  diagram now reflect Forgejo. Pre-migration root-cause sentence
  preserved as historical context with a pointer to the design doc.
* infra/docs/architecture/monitoring.md — Registry Integrity Probe
  row updated to point at the Forgejo probe.
* infra/.claude/CLAUDE.md — Private registry section rewritten end-
  to-end (auth, retention, integrity, where the bake came from).
* prometheus_chart_values.tpl — RegistryManifestIntegrityFailure
  alert annotation simplified now that only one registry is in
  scope.

Operational follow-up (cannot be done from a TF apply):
1. ssh root@10.0.20.10 — edit /opt/registry/docker-compose.yml to
   match the new template AND `docker compose up -d --remove-orphans`
   to actually stop the registry-private container. Memory id=1078
   confirms cloud-init won't redeploy on TF apply alone.
2. After 1 week of no incidents, `rm -rf /opt/registry/data/private/`
   on the VM (~2.6GB freed).
3. Open the dual-push step in build-ci-image.yml and drop
   registry.viktorbarzin.me:5050 from the `repo:` list — at that
   point the post-push integrity check at line 33-107 also needs
   to be repointed at Forgejo or removed (the per-build verify is
   redundant with the every-15min Forgejo probe).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-05-07 18:30:02 +00:00
parent e86efd107a
commit 4ec40ea804
20 changed files with 89 additions and 294 deletions

View file

@ -567,7 +567,8 @@ resource "kubernetes_deployment" "beadboard" {
container {
name = "beadboard"
image = "registry.viktorbarzin.me:5050/beadboard:${var.beadboard_image_tag}"
# Phase 3 cutover 2026-05-07 Forgejo registry consolidation.
image = "forgejo.viktorbarzin.me/viktor/beadboard:${var.beadboard_image_tag}"
port {
name = "http"
@ -725,7 +726,8 @@ resource "kubernetes_config_map" "beads_metadata" {
}
locals {
claude_agent_service_image = "registry.viktorbarzin.me/claude-agent-service:${var.claude_agent_service_image_tag}"
# Phase 3 cutover 2026-05-07 Forgejo registry consolidation.
claude_agent_service_image = "forgejo.viktorbarzin.me/viktor/claude-agent-service:${var.claude_agent_service_image_tag}"
beadboard_internal_url = "http://${kubernetes_service.beadboard.metadata[0].name}.${kubernetes_namespace.beads.metadata[0].name}.svc.cluster.local"
beads_script_prelude = <<-EOT

View file

@ -269,8 +269,9 @@ resource "kubernetes_deployment" "chrome_service" {
# ingress at chrome.viktorbarzin.me. WS port 3000 (the Playwright
# endpoint) stays internal-only.
container {
name = "novnc"
image = "registry.viktorbarzin.me/chrome-service-novnc:v4"
name = "novnc"
# Phase 3 cutover 2026-05-07 Forgejo registry consolidation.
image = "forgejo.viktorbarzin.me/viktor/chrome-service-novnc:v4"
image_pull_policy = "IfNotPresent"
port {
name = "http"

View file

@ -10,7 +10,8 @@ data "vault_kv_secret_v2" "viktor_secrets" {
locals {
namespace = "claude-agent"
image = "registry.viktorbarzin.me/claude-agent-service"
# Phase 3 cutover 2026-05-07 see infra/docs/plans/2026-05-07-forgejo-registry-consolidation-plan.md.
image = "forgejo.viktorbarzin.me/viktor/claude-agent-service"
image_tag = "2fd7670d"
labels = {
app = "claude-agent-service"

View file

@ -175,8 +175,10 @@ resource "kubernetes_deployment" "claude-memory" {
}
}
container {
name = "claude-memory"
image = "viktorbarzin/claude-memory-mcp:17"
name = "claude-memory"
# Phase 3 cutover 2026-05-07 moved off DockerHub to Forgejo as
# part of the registry consolidation. Old: viktorbarzin/claude-memory-mcp:17
image = "forgejo.viktorbarzin.me/viktor/claude-memory-mcp:17"
port {
container_port = 8000
@ -282,7 +284,4 @@ module "ingress" {
"gethomepage.dev/name" = "Claude Memory"
"gethomepage.dev/description" = "Shared persistent memory for Claude sessions"
"gethomepage.dev/icon" = "claude-ai.png"
"gethomepage.dev/group" = "Core Platform"
"gethomepage.dev/pod-selector" = ""
}
}
"gethomepage.dev/group" = "Cor

View file

@ -8,7 +8,11 @@ variable "postgresql_host" { type = string }
locals {
namespace = "fire-planner"
image = "registry.viktorbarzin.me/fire-planner:${var.image_tag}"
# Phase 3 cutover 2026-05-07. NOTE: the registry-private repo for
# fire-planner has 0 tags first build via Woodpecker on the new Forgejo
# repo (viktor/fire-planner, Dockerfile + .woodpecker.yml added 2026-05-07)
# must succeed BEFORE the next pod restart, otherwise pulls will 404.
image = "forgejo.viktorbarzin.me/viktor/fire-planner:${var.image_tag}"
labels = {
app = "fire-planner"
}

View file

@ -105,7 +105,8 @@ resource "kubernetes_deployment" "freedify" {
name = "registry-credentials"
}
container {
image = "registry.viktorbarzin.me/freedify:${var.tag}"
# Phase 3 cutover 2026-05-07 Forgejo registry consolidation.
image = "forgejo.viktorbarzin.me/viktor/freedify:${var.tag}"
name = "freedify"
port {

View file

@ -75,17 +75,11 @@ module "k8s-node-template" {
mkdir -p /etc/containerd/certs.d/ghcr.io
printf 'server = "https://ghcr.io"\n\n[host."http://10.0.20.10:5010"]\n capabilities = ["pull", "resolve"]\n\n[host."https://ghcr.io"]\n capabilities = ["pull", "resolve"]\n' > /etc/containerd/certs.d/ghcr.io/hosts.toml
# Create hosts.toml for private registry both IP and hostname entries
# IP-based (10.0.20.10:5050): direct access, skip TLS verify (wildcard cert, no IP SAN)
mkdir -p /etc/containerd/certs.d/10.0.20.10:5050
printf 'server = "https://10.0.20.10:5050"\n\n[host."https://10.0.20.10:5050"]\n capabilities = ["pull", "resolve", "push"]\n skip_verify = true\n' > /etc/containerd/certs.d/10.0.20.10:5050/hosts.toml
# Hostname-based (registry.viktorbarzin.me): redirects to LAN IP to avoid Traefik round-trip
mkdir -p /etc/containerd/certs.d/registry.viktorbarzin.me
printf 'server = "https://registry.viktorbarzin.me"\n\n[host."https://10.0.20.10:5050"]\n capabilities = ["pull", "resolve", "push"]\n skip_verify = true\n' > /etc/containerd/certs.d/registry.viktorbarzin.me/hosts.toml
# Forgejo OCI registry: redirect to in-cluster Traefik LB (10.0.20.200) so
# pulls don't hairpin out through the WAN gateway. Traefik serves the
# *.viktorbarzin.me wildcard so SNI verification still passes.
# registry.viktorbarzin.me / 10.0.20.10:5050 entries removed in Phase 4 of
# the forgejo-registry-consolidation 2026-05-07 registry-private is gone.
mkdir -p /etc/containerd/certs.d/forgejo.viktorbarzin.me
printf 'server = "https://forgejo.viktorbarzin.me"\n\n[host."https://10.0.20.200"]\n capabilities = ["pull", "resolve"]\n' > /etc/containerd/certs.d/forgejo.viktorbarzin.me/hosts.toml

View file

@ -8,7 +8,8 @@ variable "postgresql_host" { type = string }
locals {
namespace = "job-hunter"
image = "registry.viktorbarzin.me/job-hunter:${var.image_tag}"
# Phase 3 cutover 2026-05-07 see infra/docs/plans/2026-05-07-forgejo-registry-consolidation-plan.md.
image = "forgejo.viktorbarzin.me/viktor/job-hunter:${var.image_tag}"
labels = {
app = "job-hunter"
}

View file

@ -20,21 +20,12 @@ resource "kubernetes_secret" "registry_credentials" {
data = {
".dockerconfigjson" = jsonencode({
auths = {
"registry.viktorbarzin.me" = {
auth = base64encode("${data.vault_kv_secret_v2.viktor.data["registry_user"]}:${data.vault_kv_secret_v2.viktor.data["registry_password"]}")
}
"registry.viktorbarzin.me:5050" = {
auth = base64encode("${data.vault_kv_secret_v2.viktor.data["registry_user"]}:${data.vault_kv_secret_v2.viktor.data["registry_password"]}")
}
"10.0.20.10:5050" = {
auth = base64encode("${data.vault_kv_secret_v2.viktor.data["registry_user"]}:${data.vault_kv_secret_v2.viktor.data["registry_password"]}")
}
# Forgejo OCI registry read-only PAT for the cluster-puller service
# account user. Pushes go through ci-pusher (separate PAT in Vault
# secret/ci/global, surfaced to Woodpecker).
# try() lets the apply succeed before the Vault key is populated
# during Phase 0 bootstrap (see docs/runbooks/forgejo-registry-setup.md).
# The cluster has no consumers yet broken creds are visible but harmless.
# Phase 4 of forgejo-registry-consolidation 2026-05-07 registry-
# private decommissioned. Old auths entries (registry.viktorbarzin.me,
# registry.viktorbarzin.me:5050, 10.0.20.10:5050) removed to prevent
# silent fallback. If a pod somehow references the old hostname now,
# it will visibly fail with auth missing rather than silently pulling
# potentially-stale blobs.
"forgejo.viktorbarzin.me" = {
auth = base64encode("cluster-puller:${try(data.vault_kv_secret_v2.viktor.data["forgejo_pull_token"], "")}")
}

View file

@ -243,193 +243,15 @@ resource "kubernetes_cron_job_v1" "dns_anomaly_monitor" {
}
# -----------------------------------------------------------------------------
# Registry manifest-integrity probe HEADs every tag in the private R/W
# registry's catalog, walks multi-platform image indexes, and reports blob
# availability. Catches the orphan-index failure mode seen 2026-04-13 and
# 2026-04-19 before downstream pipelines hit it.
# Phase 4 of forgejo-registry-consolidation 2026-05-07: registry-private
# decommissioned. The integrity probe below caught the orphan-index failure
# mode in `registry:2.8.3` (post-mortem 2026-04-19). With that engine
# retired, the probe is replaced by `forgejo_integrity_probe` below.
#
# Resource definitions stripped wholesale terragrunt apply destroys the
# in-cluster CronJob + Secret on the next run.
# See: docs/post-mortems/2026-04-19-registry-orphan-index.md
# -----------------------------------------------------------------------------
resource "kubernetes_secret" "registry_probe_credentials" {
metadata {
name = "registry-probe-credentials"
namespace = kubernetes_namespace.monitoring.metadata[0].name
}
type = "Opaque"
data = {
REG_USER = var.registry_user
REG_PASS = var.registry_password
}
}
resource "kubernetes_cron_job_v1" "registry_integrity_probe" {
metadata {
name = "registry-integrity-probe"
namespace = kubernetes_namespace.monitoring.metadata[0].name
}
spec {
concurrency_policy = "Forbid"
failed_jobs_history_limit = 3
successful_jobs_history_limit = 3
schedule = "*/15 * * * *"
job_template {
metadata {}
spec {
backoff_limit = 1
ttl_seconds_after_finished = 600
template {
metadata {}
spec {
container {
name = "registry-integrity-probe"
image = "docker.io/library/alpine:3.20"
env {
name = "REG_USER"
value_from {
secret_key_ref {
name = kubernetes_secret.registry_probe_credentials.metadata[0].name
key = "REG_USER"
}
}
}
env {
name = "REG_PASS"
value_from {
secret_key_ref {
name = kubernetes_secret.registry_probe_credentials.metadata[0].name
key = "REG_PASS"
}
}
}
env {
name = "REGISTRY_HOST"
value = "10.0.20.10:5050"
}
env {
name = "REGISTRY_INSTANCE"
value = "registry.viktorbarzin.me:5050"
}
env {
name = "PUSHGATEWAY"
value = "http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/registry-integrity-probe"
}
env {
name = "TAGS_PER_REPO"
value = "5"
}
command = ["/bin/sh", "-c", <<-EOT
set -eu
apk add --no-cache curl jq >/dev/null
REG="$REGISTRY_HOST"
INSTANCE="$REGISTRY_INSTANCE"
AUTH="$REG_USER:$REG_PASS"
ACCEPT='application/vnd.oci.image.index.v1+json,application/vnd.oci.image.manifest.v1+json,application/vnd.docker.distribution.manifest.list.v2+json,application/vnd.docker.distribution.manifest.v2+json'
push() {
# Prometheus pushgateway body ends with blank line. Ignore push errors.
curl -sf --max-time 10 --data-binary @- "$PUSHGATEWAY" >/dev/null 2>&1 || true
}
CATALOG=$(curl -sk -u "$AUTH" --max-time 30 "https://$REG/v2/_catalog?n=1000" || echo "")
REPOS=$(echo "$CATALOG" | jq -r '.repositories[]?' 2>/dev/null || echo "")
if [ -z "$REPOS" ]; then
echo "ERROR: empty catalog or auth failure — cannot probe"
NOW=$(date +%s)
push <<METRICS
# TYPE registry_manifest_integrity_catalog_accessible gauge
registry_manifest_integrity_catalog_accessible{instance="$INSTANCE"} 0
# TYPE registry_manifest_integrity_last_run_timestamp gauge
registry_manifest_integrity_last_run_timestamp{instance="$INSTANCE"} $NOW
METRICS
exit 1
fi
FAIL=0
REPOS_N=0
TAGS_N=0
INDEXES_N=0
printf '%s\n' $REPOS > /tmp/repos.txt
while IFS= read -r repo; do
[ -z "$repo" ] && continue
REPOS_N=$((REPOS_N + 1))
TAGS_JSON=$(curl -sk -u "$AUTH" --max-time 15 "https://$REG/v2/$repo/tags/list" || echo "")
echo "$TAGS_JSON" | jq -r '.tags[]?' 2>/dev/null | tail -n "$TAGS_PER_REPO" > /tmp/tags.txt || true
while IFS= read -r tag; do
[ -z "$tag" ] && continue
TAGS_N=$((TAGS_N + 1))
HTTP=$(curl -sk -u "$AUTH" -o /tmp/m.json -w '%%{http_code}' \
-H "Accept: $ACCEPT" --max-time 15 \
"https://$REG/v2/$repo/manifests/$tag")
if [ "$HTTP" != "200" ]; then
echo "FAIL: $repo:$tag manifest HTTP $HTTP"
FAIL=$((FAIL + 1))
continue
fi
MT=$(jq -r '.mediaType // empty' /tmp/m.json 2>/dev/null || echo "")
if echo "$MT" | grep -Eq 'manifest\.list|image\.index'; then
INDEXES_N=$((INDEXES_N + 1))
jq -r '.manifests[].digest' /tmp/m.json > /tmp/children.txt 2>/dev/null || true
while IFS= read -r d; do
[ -z "$d" ] && continue
CH=$(curl -sk -u "$AUTH" -o /dev/null -w '%%{http_code}' \
-H "Accept: $ACCEPT" --max-time 10 -I \
"https://$REG/v2/$repo/manifests/$d")
if [ "$CH" != "200" ]; then
echo "FAIL: $repo:$tag index child $d HTTP $CH"
FAIL=$((FAIL + 1))
fi
done < /tmp/children.txt
fi
done < /tmp/tags.txt
done < /tmp/repos.txt
NOW=$(date +%s)
push <<METRICS
# TYPE registry_manifest_integrity_failures gauge
registry_manifest_integrity_failures{instance="$INSTANCE"} $FAIL
# TYPE registry_manifest_integrity_catalog_accessible gauge
registry_manifest_integrity_catalog_accessible{instance="$INSTANCE"} 1
# TYPE registry_manifest_integrity_repos_checked gauge
registry_manifest_integrity_repos_checked{instance="$INSTANCE"} $REPOS_N
# TYPE registry_manifest_integrity_tags_checked gauge
registry_manifest_integrity_tags_checked{instance="$INSTANCE"} $TAGS_N
# TYPE registry_manifest_integrity_indexes_checked gauge
registry_manifest_integrity_indexes_checked{instance="$INSTANCE"} $INDEXES_N
# TYPE registry_manifest_integrity_last_run_timestamp gauge
registry_manifest_integrity_last_run_timestamp{instance="$INSTANCE"} $NOW
METRICS
echo "Probe complete: $FAIL failures across $REPOS_N repos / $TAGS_N tags / $INDEXES_N indexes"
if [ "$FAIL" -gt 0 ]; then exit 1; fi
EOT
]
resources {
requests = {
cpu = "10m"
memory = "48Mi"
}
limits = {
memory = "96Mi"
}
}
}
restart_policy = "OnFailure"
}
}
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}
# -----------------------------------------------------------------------------
# Forgejo registry integrity probe same algorithm as registry-integrity-probe

View file

@ -1657,7 +1657,7 @@ serverFiles:
severity: critical
annotations:
summary: "{{ $labels.instance }}: {{ $value }} broken manifest reference(s) — orphan index or missing blob"
description: "The integrity probe CronJob found {{ $value }} manifest/blob references that return non-200 on {{ $labels.instance }}. For registry.viktorbarzin.me see docs/runbooks/registry-rebuild-image.md (orphan OCI-index child from cleanup-tags.sh+GC race). For forgejo.viktorbarzin.me see docs/runbooks/forgejo-registry-rebuild-image.md."
description: "The forgejo-integrity-probe CronJob found {{ $value }} manifest/blob references that return non-200 on {{ $labels.instance }}. Rebuild the affected image per docs/runbooks/forgejo-registry-rebuild-image.md. (registry.viktorbarzin.me retired Phase 4 of forgejo-registry-consolidation 2026-05-07 — only forgejo.viktorbarzin.me remains.)"
- alert: RegistryIntegrityProbeStale
expr: time() - registry_manifest_integrity_last_run_timestamp > 3600
for: 15m

View file

@ -8,7 +8,10 @@ variable "postgresql_host" { type = string }
locals {
namespace = "payslip-ingest"
image = "registry.viktorbarzin.me/payslip-ingest:${var.image_tag}"
# Phase 3 of forgejo-registry-consolidation image= flipped to Forgejo
# 2026-05-07. registry-private kept image at the same path, so the new
# Forgejo URL is `viktor/<name>` under forgejo.viktorbarzin.me.
image = "forgejo.viktorbarzin.me/viktor/payslip-ingest:${var.image_tag}"
labels = {
app = "payslip-ingest"
}