infra

Author	SHA1	Message	Date
Viktor Barzin	fd0f4a0365	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip] `6d224861` came from a --no-checkout worktree whose empty index made the commit drop every file except two. This restores 05b50d2b's full tree and correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the live infra was never applied from the broken commit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 08:45:33 +00:00
Viktor Barzin	6d224861c4	stem95su: scheduled Drive->site sync CronJob (every 10m) CronJob stem95su-gdrive-sync (*/10) mounts the content PVC RW and rclone-syncs the read-only Drive folder "claude" (stem claude/files) onto it (rclone/rclone:1.74.3, scope=drive.readonly, empty-source guard + --max-delete 25). ESO ExternalSecret stem95su-rclone <- Vault secret/stem95su. Requires the GCP OAuth app published to Production or the refresh token expires ~weekly. Lands the gdrive-sync stack on master (it had landed on a feature branch by accident on the shared devvm checkout). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 08:42:26 +00:00
Viktor Barzin	5030b44535	[forgejo] Phase 4 final decommission: drop registry-private container + port 5050 Image migration completed (forgejo-migrate-orphan-images.sh ran + all in-scope images now under forgejo.viktorbarzin.me/viktor/) and the cluster cutover landed in commit `3148d15d`. registry-private is no longer needed. * infra/modules/docker-registry/docker-compose.yml — registry-private service block removed; nginx 5050 port mapping dropped. * infra/modules/docker-registry/nginx_registry.conf — upstream private block + port 5050 server block removed. * infra/.woodpecker/build-ci-image.yml — drop the dual-push to registry.viktorbarzin.me:5050; only push to Forgejo. Verify- integrity step removed (the every-15min forgejo-integrity-probe in monitoring covers it). Break-glass tarball step still runs but pulls from Forgejo (the only registry left). The registry-config-sync.yml pipeline will pick this commit up and sync the new compose+nginx to the VM. Manual final step on the VM: ssh root@10.0.20.10 'cd /opt/registry && docker compose up -d --remove-orphans' to actually destroy the registry-private container — compose does NOT do orphan removal on a normal up -d. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-07 19:08:17 +00:00
Viktor Barzin	40a8e66c58	[ci] Phase 1: infra-ci dual-push + break-glass tarball Adds Forgejo as a second push target on the build-ci-image pipeline and saves the just-pushed image as a gzipped tarball on the registry VM disk (/opt/registry/data/private/_breakglass/) so we can recover infra-ci with `ctr images import` if both registries are down. * Dual-push: registry.viktorbarzin.me:5050/infra-ci AND forgejo.viktorbarzin.me/viktor/infra-ci, in the same woodpeckerci/plugin-docker-buildx step. Same image bytes; the Forgejo integrity probe (every 15min) catches any divergence. * Break-glass step: SSHes to 10.0.20.10, docker pulls + saves + gzips, keeps last 5 tarballs (latest symlink). Failure-tolerant so a transient registry blip doesn't fail the build pipeline. * Runbook docs/runbooks/forgejo-registry-breakglass.md documents the recovery flow (when to use, scp+ctr import, node cordon, underlying-issue fix). Tarball mirrors to Synology automatically through the existing daily offsite-sync-backup job — no new sync wiring needed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-07 16:01:20 +00:00
Viktor Barzin	7cb44d7264	[registry] Stop recurring orphan OCI-index incidents — detection + prevention + recovery Second identical registry incident on 2026-04-19 (first 2026-04-13): the infra-ci:latest image index resolved to child manifests whose blobs had been garbage-collected out from under the index. Pipelines P366→P376 all exited 126 "image can't be pulled". Hot fix (`a05d63e` / `6371e75` / `c113be4`) restored green CI but left the underlying bug unaddressed. Root cause: cleanup-tags.sh rmtrees tag dirs on the registry VM daily at 02:00, registry:2's GC (Sunday 03:25) walks OCI index children imperfectly (distribution/distribution#3324 class). Nothing verified pushes end-to-end; nothing probed the registry for fetchability; nothing caught orphan indexes. Phase 1 — Detection: - .woodpecker/build-ci-image.yml: after build-and-push, a verify-integrity step walks the just-pushed manifest (index + children + config + every layer blob) via HEAD and fails the pipeline on any non-200. Catches broken pushes at the source. - stacks/monitoring: new registry-integrity-probe CronJob (every 15m) and three alerts — RegistryManifestIntegrityFailure, RegistryIntegrityProbeStale, RegistryCatalogInaccessible — closing the "registry serves 404 for a tag that exists" gap that masked the incident for 2+ hours. - docs/post-mortems/2026-04-19-registry-orphan-index.md: root cause, timeline, monitoring gaps, permanent fix. Phase 2 — Prevention: - modules/docker-registry/docker-compose.yml: pin registry:2 → registry:2.8.3 across all six registry services. Removes the floating-tag footgun. - modules/docker-registry/fix-broken-blobs.sh: new scan walks every _manifests/revisions/sha256/<digest> that is an image index and logs a loud WARNING when a referenced child blob is missing. Does NOT auto- delete — deleting a published image is a conscious decision. Layer-link scan preserved. Phase 3 — Recovery: - build-ci-image.yml: accept `manual` event so Woodpecker API/UI rebuilds don't need a cosmetic Dockerfile edit (matches convention from pve-nfs-exports-sync.yml). - docs/runbooks/registry-rebuild-image.md: exact command sequence for diagnosing + rebuilding after an orphan-index incident, plus a fallback for building directly on the registry VM if Woodpecker itself is down. - docs/runbooks/registry-vm.md + .claude/reference/service-catalog.md: cross-references to the new runbook. Out of scope (verified healthy or intentionally deferred): - Pull-through DockerHub/GHCR mirrors (74.5% hit rate, no 404s). - Registry HA/replication (single-VM SPOF is a known architectural choice; Synology offsite covers RPO < 1 day). - Diun exclude for registry:2 — not applicable; Diun only watches k8s (DIUN_PROVIDERS_KUBERNETES=true), not the VM's docker-compose. Verified locally: - fix-broken-blobs.sh --dry-run on a synthetic registry directory correctly flags both orphan layer links and orphan OCI-index children. - terraform fmt + validate on stacks/monitoring: success (only unrelated deprecation warnings). - python3 yaml.safe_load on .woodpecker/build-ci-image.yml and modules/docker-registry/docker-compose.yml: both parse clean. Closes: code-4b8 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:08:28 +00:00
Viktor Barzin	704fa09185	fix: remove manual event from build-ci-image to fix issue automation build-ci-image.yml had event:[push,manual] which caused it to run on every manual pipeline trigger. Its registry_user/registry_password secrets don't have the manual event, causing all manual pipelines to error. Removed manual from its event list since it only needs push. Reverted evaluate conditions (Woodpecker evaluates secrets before conditions, so evaluate can't prevent missing-secret errors). [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 17:31:25 +00:00
Viktor Barzin	a583b11484	fix: guard manual Woodpecker pipelines with evaluate conditions When GHA triggers a manual pipeline for issue automation, ALL pipelines with event:manual fire. Added evaluate conditions: - issue-automation.yml: only runs when ISSUE_NUMBER is set - provision-user.yml: only runs when ISSUE_NUMBER is NOT set - build-ci-image.yml: only runs when ISSUE_NUMBER is NOT set This prevents build-ci-image from failing on missing registry_password secret when issue automation triggers. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 17:29:35 +00:00
Viktor Barzin	36454b87d1	feat: CI/CD performance overhaul - New custom CI Docker image (ci/Dockerfile) with TF 1.5.7, TG 0.99.4, git-crypt, sops, kubectl pre-installed. Pushed to private registry. Eliminates 17 apk add calls + binary downloads per pipeline run. - Unified CI pipeline: merge default.yml + app-stacks.yml into one. Changed-stacks-only detection (git diff, with global-file fallback). Concurrency limit (xargs -P 4). Step consolidation (2 steps vs 4). Shallow clone (depth=2). Provider cache (TF_PLUGIN_CACHE_DIR). - Per-stack Vault advisory locks in scripts/tg. 30min TTL with stale lock detection. Blocks concurrent applies to same stack. - TF_PLUGIN_CACHE_DIR enabled by default in scripts/tg for local dev. - Daily drift detection pipeline (.woodpecker/drift-detection.yml). Runs terraform plan on all stacks, Slack alert on drift. - CI image build pipeline (.woodpecker/build-ci-image.yml). Expected speedup: ~5-10 min per pipeline run → ~2-4 min. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 11:22:26 +00:00

8 commits