infra

Author	SHA1	Message	Date
Viktor Barzin	8c73a0243a	[forgejo] Phase 4 final decommission: drop registry-private container + port 5050 Image migration completed (forgejo-migrate-orphan-images.sh ran + all in-scope images now under forgejo.viktorbarzin.me/viktor/) and the cluster cutover landed in commit 3148d15d. registry-private is no longer needed. * infra/modules/docker-registry/docker-compose.yml — registry-private service block removed; nginx 5050 port mapping dropped. * infra/modules/docker-registry/nginx_registry.conf — upstream private block + port 5050 server block removed. * infra/.woodpecker/build-ci-image.yml — drop the dual-push to registry.viktorbarzin.me:5050; only push to Forgejo. Verify- integrity step removed (the every-15min forgejo-integrity-probe in monitoring covers it). Break-glass tarball step still runs but pulls from Forgejo (the only registry left). The registry-config-sync.yml pipeline will pick this commit up and sync the new compose+nginx to the VM. Manual final step on the VM: ssh root@10.0.20.10 'cd /opt/registry && docker compose up -d --remove-orphans' to actually destroy the registry-private container — compose does NOT do orphan removal on a normal up -d. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-07 23:29:34 +00:00
Viktor Barzin	4ec40ea804	[forgejo] Phases 3+4+5: cutover, decommission, docs sweep End of forgejo-registry-consolidation. After Phase 0/1 already landed (Forgejo ready, dual-push CI, integrity probe, retention CronJob, images migrated via forgejo-migrate-orphan-images.sh), this commit flips everything off registry.viktorbarzin.me onto Forgejo and removes the legacy infrastructure. Phase 3 — image= flips: * infra/stacks/{payslip-ingest,job-hunter,claude-agent-service, fire-planner,freedify/factory,chrome-service,beads-server}/main.tf — image= now points to forgejo.viktorbarzin.me/viktor/<name>. * infra/stacks/claude-memory/main.tf — also moved off DockerHub (viktorbarzin/claude-memory-mcp:17 → forgejo.viktorbarzin.me/viktor/...). * infra/.woodpecker/{default,drift-detection}.yml — infra-ci pulled from Forgejo. build-ci-image.yml dual-pushes still until next build cycle confirms Forgejo as canonical. * /home/wizard/code/CLAUDE.md — claude-memory-mcp install URL updated. Phase 4 — decommission registry-private: * registry-credentials Secret: dropped registry.viktorbarzin.me / registry.viktorbarzin.me:5050 / 10.0.20.10:5050 auths entries. Forgejo entry is the only one left. * infra/stacks/infra/main.tf cloud-init: dropped containerd hosts.toml entries for registry.viktorbarzin.me + 10.0.20.10:5050. (Existing nodes already had the file removed manually by `setup-forgejo-containerd-mirror.sh` rollout — the cloud-init template only fires on new VM provision.) * infra/modules/docker-registry/docker-compose.yml: registry-private service block removed; nginx 5050 port mapping dropped. Pull- through caches for upstream registries (5000/5010/5020/5030/5040) stay on the VM permanently. * infra/modules/docker-registry/nginx_registry.conf: upstream `private` block + port 5050 server block removed. * infra/stacks/monitoring/modules/monitoring/main.tf: registry_ integrity_probe + registry_probe_credentials resources stripped. forgejo_integrity_probe is the only manifest probe now. Phase 5 — final docs sweep: * infra/docs/runbooks/registry-vm.md — VM scope reduced to pull- through caches; forgejo-registry-breakglass.md cross-ref added. * infra/docs/architecture/ci-cd.md — registry component table + diagram now reflect Forgejo. Pre-migration root-cause sentence preserved as historical context with a pointer to the design doc. * infra/docs/architecture/monitoring.md — Registry Integrity Probe row updated to point at the Forgejo probe. * infra/.claude/CLAUDE.md — Private registry section rewritten end- to-end (auth, retention, integrity, where the bake came from). * prometheus_chart_values.tpl — RegistryManifestIntegrityFailure alert annotation simplified now that only one registry is in scope. Operational follow-up (cannot be done from a TF apply): 1. ssh root@10.0.20.10 — edit /opt/registry/docker-compose.yml to match the new template AND `docker compose up -d --remove-orphans` to actually stop the registry-private container. Memory id=1078 confirms cloud-init won't redeploy on TF apply alone. 2. After 1 week of no incidents, `rm -rf /opt/registry/data/private/` on the VM (~2.6GB freed). 3. Open the dual-push step in build-ci-image.yml and drop registry.viktorbarzin.me:5050 from the `repo:` list — at that point the post-push integrity check at line 33-107 also needs to be repointed at Forgejo or removed (the per-build verify is redundant with the every-15min Forgejo probe). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-07 23:29:34 +00:00
Viktor Barzin	fbb41eff9d	[ci] Phase 1: infra-ci dual-push + break-glass tarball Adds Forgejo as a second push target on the build-ci-image pipeline and saves the just-pushed image as a gzipped tarball on the registry VM disk (/opt/registry/data/private/_breakglass/) so we can recover infra-ci with `ctr images import` if both registries are down. * Dual-push: registry.viktorbarzin.me:5050/infra-ci AND forgejo.viktorbarzin.me/viktor/infra-ci, in the same woodpeckerci/plugin-docker-buildx step. Same image bytes; the Forgejo integrity probe (every 15min) catches any divergence. * Break-glass step: SSHes to 10.0.20.10, docker pulls + saves + gzips, keeps last 5 tarballs (latest symlink). Failure-tolerant so a transient registry blip doesn't fail the build pipeline. * Runbook docs/runbooks/forgejo-registry-breakglass.md documents the recovery flow (when to use, scp+ctr import, node cordon, underlying-issue fix). Tarball mirrors to Synology automatically through the existing daily offsite-sync-backup job — no new sync wiring needed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-07 23:29:33 +00:00
Viktor Barzin	07bc0098e3	ci(woodpecker): show full terraform error on stack apply failure The default workflow truncated the failed-stack output at `tail -5`, which only captured the trailing source-line indicator (`│ 45: resource …`) and dropped the actual `Error: …` line above it. Bump to `tail -50` so the real error is visible without re-running locally to reproduce. Also fix the pre-warm step's FIRST_STACK detection — `head -1 file1 file2 \| head -1` returns the file header (`==> .platform_apply <==`), not the first stack name, so the cd then fails with "no such file or directory". Use `cat \| head -1` instead. Pure logging-and-pre-warm change; no stacks touched, so this commit is a no-op for the apply step.	2026-04-26 18:39:46 +00:00
Viktor Barzin	34ee282d88	[ci] Auto-sync modules/docker-registry/* to registry VM + runbook docs Replaces the manual scp+bounce sequence that landed registry:2.8.3 on 10.0.20.10 today (see commit `7cb44d72` + nginx-DNS-trap in runbook). Addresses the "no repeat manual fixes" preference — future changes to docker-compose.yml / fix-broken-blobs.sh / nginx_registry.conf / config-private.yml / cleanup-tags.sh now deploy through CI. Pipeline (.woodpecker/registry-config-sync.yml) mirrors pve-nfs-exports-sync.yml: ssh-keyscan pin, scp the whole managed set, bounce compose only when compose-visible files changed, always restart nginx after a compose bounce (critical — nginx caches upstream DNS), end with a dry-run fix-broken-blobs.sh to catch regressions. Credentials: - Woodpecker repo-secret `registry_ssh_key` (events: push, manual) - Mirror at Vault `secret/woodpecker/registry_ssh_key` (private_key / public_key / known_hosts_entry) - Public key on /root/.ssh/authorized_keys on 10.0.20.10 - Key label: woodpecker-registry-config-sync Runbook updated with "Auto-sync pipeline" section pointing at the new flow + manual override command. Closes: code-3vl Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:32:12 +00:00
Viktor Barzin	7cb44d7264	[registry] Stop recurring orphan OCI-index incidents — detection + prevention + recovery Second identical registry incident on 2026-04-19 (first 2026-04-13): the infra-ci:latest image index resolved to child manifests whose blobs had been garbage-collected out from under the index. Pipelines P366→P376 all exited 126 "image can't be pulled". Hot fix (`a05d63e` / `6371e75` / `c113be4`) restored green CI but left the underlying bug unaddressed. Root cause: cleanup-tags.sh rmtrees tag dirs on the registry VM daily at 02:00, registry:2's GC (Sunday 03:25) walks OCI index children imperfectly (distribution/distribution#3324 class). Nothing verified pushes end-to-end; nothing probed the registry for fetchability; nothing caught orphan indexes. Phase 1 — Detection: - .woodpecker/build-ci-image.yml: after build-and-push, a verify-integrity step walks the just-pushed manifest (index + children + config + every layer blob) via HEAD and fails the pipeline on any non-200. Catches broken pushes at the source. - stacks/monitoring: new registry-integrity-probe CronJob (every 15m) and three alerts — RegistryManifestIntegrityFailure, RegistryIntegrityProbeStale, RegistryCatalogInaccessible — closing the "registry serves 404 for a tag that exists" gap that masked the incident for 2+ hours. - docs/post-mortems/2026-04-19-registry-orphan-index.md: root cause, timeline, monitoring gaps, permanent fix. Phase 2 — Prevention: - modules/docker-registry/docker-compose.yml: pin registry:2 → registry:2.8.3 across all six registry services. Removes the floating-tag footgun. - modules/docker-registry/fix-broken-blobs.sh: new scan walks every _manifests/revisions/sha256/<digest> that is an image index and logs a loud WARNING when a referenced child blob is missing. Does NOT auto- delete — deleting a published image is a conscious decision. Layer-link scan preserved. Phase 3 — Recovery: - build-ci-image.yml: accept `manual` event so Woodpecker API/UI rebuilds don't need a cosmetic Dockerfile edit (matches convention from pve-nfs-exports-sync.yml). - docs/runbooks/registry-rebuild-image.md: exact command sequence for diagnosing + rebuilding after an orphan-index incident, plus a fallback for building directly on the registry VM if Woodpecker itself is down. - docs/runbooks/registry-vm.md + .claude/reference/service-catalog.md: cross-references to the new runbook. Out of scope (verified healthy or intentionally deferred): - Pull-through DockerHub/GHCR mirrors (74.5% hit rate, no 404s). - Registry HA/replication (single-VM SPOF is a known architectural choice; Synology offsite covers RPO < 1 day). - Diun exclude for registry:2 — not applicable; Diun only watches k8s (DIUN_PROVIDERS_KUBERNETES=true), not the VM's docker-compose. Verified locally: - fix-broken-blobs.sh --dry-run on a synthetic registry directory correctly flags both orphan layer links and orphan OCI-index children. - terraform fmt + validate on stacks/monitoring: success (only unrelated deprecation warnings). - python3 yaml.safe_load on .woodpecker/build-ci-image.yml and modules/docker-registry/docker-compose.yml: both parse clean. Closes: code-4b8 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:08:28 +00:00
Viktor Barzin	a05d63eefb	[ci] Fix infra pipeline image-pull — drop :5050 from infra-ci image URL P366-P374 default workflow failed with exit 126 "image can't be pulled" — containerd hosts.toml has a mirror entry for `registry.viktorbarzin.me` but NOT for `registry.viktorbarzin.me:5050`, so pulls fell through to direct HTTPS on :5050 (which isn't exposed externally). Convention per infra/.claude/CLAUDE.md is the no-port form; :5050 was an anomaly introduced by the 2026-04-15 CI perf overhaul. build-cli/build-ci-image push paths still use :5050 and work fine — they go through the buildx plugin (pod DNS, not node containerd). Only `image:` fields on a step hit the broken path. Normalizing push URLs left for a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 16:00:58 +00:00
Viktor Barzin	3b54983a9f	[ci] build-cli: add logins entry for registry.viktorbarzin.me:5050 ## Context The infra CLI image (`viktorbarzin/infra` + `registry.viktorbarzin.me:5050/infra`) is built by `.woodpecker/build-cli.yml` via plugin-docker-buildx and pushed to two repos. The private-registry htpasswd auth that went in on 2026-03-22 (memory 437) was never wired into this pipeline, so the second push has been failing with `401 Unauthorized` on every blob HEAD for ~4 weeks. That in turn kept every infra pipeline's overall status at `failure`, which fooled the service-upgrade agent into spurious rollbacks before the per-workflow check in bd code-3o3. Now that the agent ignores overall status, this is purely cosmetic — but worth fixing so the pipeline list goes green and the private- registry mirror of the infra CLI image stays fresh. ## This change Extend the plugin's `logins:` array with an entry for `registry.viktorbarzin.me:5050`, pulling credentials from two Woodpecker global secrets `registry_user` / `registry_password`. Secrets plumbing (no CI config changes needed long-term — already `vault-woodpecker-sync` compatible): - Vault `secret/ci/global` now carries `registry_user` + `registry_password`, copied from `secret/viktor` via `vault kv patch`. - `vault-woodpecker-sync` CronJob picks them up on next run and POSTs them to Woodpecker via the API. Also triggered manually as `manual-sync-1776613321` → "Synced 8 global secrets from Vault to Woodpecker". - `curl -H "Authorization: Bearer <wp-api-token>" .../api/secrets` now lists both `registry_user` and `registry_password`. ## What is NOT in this change - A follow-on cleanup of the `docker_username`/`docker_password` globals (which are actually DockerHub creds mis-named). They still work — renaming would cascade across several older pipelines. - Restoring inline BuildKit cache — commit `0c123903` disabled `cache_from/cache_to` due to registry cache corruption; leaving that alone here. ## Test Plan ### Automated Will be validated by the CI run of this very commit: - `build-cli` workflow should log `#14 [auth] viktor/registry.viktorbarzin.me:5050` successful - blob HEAD returns 200/404 instead of 401 - step `build-image` exits 0 - overall pipeline status: success (FINALLY) ### Manual Verification ``` $ curl -sS -H "Authorization: Bearer $(vault kv get -field=woodpecker_api_token secret/ci/global)" \ https://ci.viktorbarzin.me/api/secrets \| jq '.[] \| .name' \| grep registry "registry_password" "registry_user" $ curl -sSI -u viktor:$PASS https://registry.viktorbarzin.me:5050/v2/infra/manifests/<8-char-sha> HTTP/2 200 ``` Closes: code-12b Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 15:42:52 +00:00
Viktor Barzin	a5e097088a	[ci] Persist VAULT_TOKEN across Woodpecker step commands ## Context Follow-up to commit `2eca011c` (bd code-e1x). That commit attached the `terraform-state` policy to the `ci` Vault role and propagated apply- loop failures so the pipeline actually fails when a stack fails. On the very first push to exercise it (pipeline 361), the platform apply step died with: [vault] Starting apply... state-sync: ERROR — no Vault token and no age key at ~/.config/sops/age/keys.txt [vault] FAILED (exit 1) Root cause: in Woodpecker's `commands:` list, each `- \|` item runs in a fresh shell. The dedicated "Vault auth" command was doing `export VAULT_TOKEN=...`, but that export was lost by the time the apply command ran. Tier-0 stacks depended on Vault Transit (via `scripts/state-sync`), and Tier-1 stacks depend on `vault read database/static-creds/pg-terraform-state` via `scripts/tg` — both silently fell through to their "no Vault" error path. This bug was latent before `2eca011c` because the old apply loop swallowed per-stack exit codes. Now that we surface them, the pipeline fails honestly — but fails on every run. Fixing the missing token propagation is the last mile. ## This change - Pin `VAULT_ADDR` at the step's `environment:` level so every command inherits it without an explicit export. - In the Vault auth command, assert the auth succeeded (non-empty, non-"null" token) then write the token to `~/.vault-token` with `umask 077`. `vault`, `scripts/tg`, and `scripts/state-sync` all fall through to `~/.vault-token` when `VAULT_TOKEN` env is unset. ## What is NOT in this change - A broader refactor to fold the multi-step chain into a single `- \|` script — preserving the existing granular structure keeps individual step logs grep-friendly and failures localised. - Restoring the VAULT_TOKEN export too — redundant once ~/.vault-token is written, and would need duplicating into each command anyway. ## Test Plan ### Automated N/A (pure YAML change). Will be verified by the very next CI run — the push creating this commit. ### Manual Verification Watch `ci.viktorbarzin.me/repos/1/pipelines` for the pipeline whose commit matches this one. Expected: - `default` workflow exercises the auth + apply steps. - Platform apply for `vault` stack runs state-sync decrypt → detects no drift (I applied locally already) → OK. - Tier-1 stacks (if any in the diff): `vault read database/static- creds/pg-terraform-state` returns creds → apply runs. - No "state-sync: ERROR" or "Cannot read PG credentials" errors. - `default` workflow state: success. - Overall pipeline status: still failure because `build-cli` is independently broken (bd code-12b); that's cosmetic. Refs: bd code-e1x Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 14:30:39 +00:00
Viktor Barzin	2eca011cc3	[ci,vault] Fix Tier-1 apply silently failing in Woodpecker ## Context For weeks, every push to infra has resulted in `build-cli` workflow failure AND `default` workflow succeed — but the `default` workflow's "success" was a lie. Inside the apply-loop we were swallowing per-stack failures with `set +e ... echo FAILED` and the step exited 0 regardless. Discovered during bd code-3o3 e2e test (qbittorrent 5.0.4 → 5.1.4): agent commit landed, CI reported `default=success`, but cluster was unchanged. Log inside the step showed: [servarr] Starting apply... ERROR: Cannot read PG credentials from Vault. Run: vault login -method=oidc [servarr] FAILED (exit 1) Two root causes, two fixes here. ### 1. Vault `ci` role lacks Tier-1 PG backend creds The Tier-1 PG state backend (2026-04-16 migration, memory 407) uses the `pg-terraform-state` static DB role. `scripts/tg` reads it via `vault read database/static-creds/pg-terraform-state`. That path is permitted by the separate `terraform-state` Vault policy, which is bound only to a role in namespace `claude-agent`. The CI runner is in namespace `woodpecker` using role `ci`, whose policy grants only KV + K8s-creds + transit. Net: every Tier-1 stack apply from CI has been dying at the PG-creds fetch since the migration. Fix: attach `vault_policy.terraform_state` to `vault_kubernetes_auth_backend_role.ci`'s `token_policies`. No new policy needed — reuses the minimal one from 2026-04-16. ### 2. Apply-loop swallows stack failures `.woodpecker/default.yml`'s platform + app apply loops use `set +e; OUTPUT=$(... tg apply ...); EXIT=$?; set -e; [ $EXIT -ne 0 ] && echo FAILED` and then continue the while-loop. The step never re-raises, so it exits 0 regardless of how many stacks failed. Fix: accumulate failed stack names (excluding lock-skipped ones) into `FAILED_PLATFORM_STACKS` / `FAILED_APP_STACKS`, serialise the platform list to `.platform_failed` so it survives the step boundary, and at the end of the app-stack step exit 1 if either list is non-empty. Lock-skipped stacks remain non-fatal. Together, (1) unblocks real apply and (2) ensures the Woodpecker pipeline + the service-upgrade agent can both trust `default` workflow state again. ## What is NOT in this change - Re-running the qbittorrent upgrade to converge the cluster — the TF file is already at 5.1.4 in git; once CI picks up this commit it'll apply on its own, or Viktor can run `tg apply` locally now that the ci role has access too. - Retiring the `set +e ... continue` pattern entirely — keeping the per-stack continuation so a single bad stack doesn't hide the others' plans from the log. Just making the final status honest. ## Test Plan ### Automated `terraform plan` / apply clean (Tier-0 via scripts/tg): ``` Plan: 0 to add, 2 to change, 0 to destroy. # vault_kubernetes_auth_backend_role.ci will be updated in-place ~ token_policies = [ + "terraform-state", # (1 unchanged element hidden) ] # vault_jwt_auth_backend.oidc will be updated in-place ~ tune = [...] # cosmetic provider-schema drift, pre-existing Apply complete! Resources: 0 added, 2 changed, 0 destroyed. ``` State re-encrypted via `scripts/state-sync encrypt vault`; enc file committed. ### Manual Verification ``` # Before (on previous commit — expect failure): $ kubectl -n woodpecker exec woodpecker-server-0 -- sh -c ' SA=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token); TOK=$(curl -s -X POST http://vault-active.vault.svc:8200/v1/auth/kubernetes/login \ -d "{\"role\":\"ci\",\"jwt\":\"$SA\"}" \| jq -r .auth.client_token); curl -s -H "X-Vault-Token: $TOK" \ http://vault-active.vault.svc:8200/v1/database/static-creds/pg-terraform-state' → {"errors":["1 error occurred:\n\t* permission denied\n\n"]} # After (this commit): → {"data":{"username":"terraform_state","password":"..."},...} ``` Pipeline-level: the next infra push will exercise `.woodpecker/default.yml`; expected first push is this very commit. Watch `ci.viktorbarzin.me` — the `default` workflow should either succeed for real (and land actual changes) or exit 1 with "=== FAILED STACKS ===" so the cause is visible. Refs: bd code-e1x Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 14:25:52 +00:00
Viktor Barzin	43254ccd3f	[infra] Add Woodpecker pipeline to deploy PVE /etc/exports (Wave 6b) ## Context Wave 6b of the state-drift consolidation plan. `scripts/pve-nfs-exports` is the git-managed source of truth for the Proxmox host's NFS export table (file header documents this since the 2026-04-14 NFS outage post-mortem). Deploying it was runbook-only — `scp` then `ssh ... exportfs -ra` — which means a change could sit unpushed-to-PVE indefinitely, and nothing alerted on divergence between git and host. Wave 6b closes that loop: a Woodpecker pipeline watches `scripts/pve-nfs-exports` on the `master` branch, diffs against the current host file, and scp's the new content followed by `exportfs -ra`. The same 2-shell-command runbook, now a CI step that runs on every push and is manually triggerable. ## This change - New pipeline `.woodpecker/pve-nfs-exports-sync.yml` — path-filtered push trigger + manual. - SSH credentials provisioned 2026-04-18: - ed25519 keypair `woodpecker-pve-nfs-exports-sync` - Public key in `root@192.168.1.127:~/.ssh/authorized_keys` - Private key in Vault `secret/woodpecker/pve_ssh_key` (plus known_hosts entry for deterministic host-key pinning from Vault) - Woodpecker repo-level secret `pve_ssh_key` (id 139) bound to the infra repo's `push`/`manual`/`cron` events - Pipeline steps: install openssh + curl (alpine image) → stage private key from secret → ssh-keyscan the PVE host into known_hosts → diff current vs. proposed exports (shown in pipeline log) → scp → exportfs -ra → Slack notify status. ## What is NOT in this change - Drift detection (git-truth vs. host-truth) via cron: this pipeline only fires on push, so a host-side edit wouldn't be caught. Could add a daily cron that just runs the diff step and alerts if non-empty. Left as a refinement if drift becomes an issue. - Pulling known_hosts from Vault rather than ssh-keyscan on each run: the keyscan is simpler and works against key rotation without needing a Vault round-trip. Pulling from Vault is the right answer the moment we add MITM risk, which the internal network doesn't have today. ## Reproduce locally Edit `scripts/pve-nfs-exports`, push to master. Watch the pipeline in Woodpecker. Verify on PVE: `ssh root@192.168.1.127 "md5sum /etc/exports"` matches `md5sum scripts/pve-nfs-exports` in the repo. Closes: code-dne Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 23:21:36 +00:00
Viktor Barzin	b28c76e371	[infra] Wire drift detection to Pushgateway + alert on stale/unaddressed drift ## Context Wave 7 of the state-drift consolidation plan. The drift-detection pipeline (`.woodpecker/drift-detection.yml`) already ran terragrunt plan on every stack daily and Slack-posted a summary, but its output was ephemeral — nothing persisted in Prometheus, so there was no historical view of which stacks drift, when, or for how long. Following the convergence work in waves 1–6 (168 KYVERNO_LIFECYCLE_V1 markers, 4 stacks adopted, Phase 4 mysql cleanup), the baseline is clean enough that new drift should stand out. That only works if we have observability. ## This change ### `.woodpecker/drift-detection.yml` Enhances the existing cron pipeline to push a batched set of metrics to the in-cluster Pushgateway (`prometheus-prometheus-pushgateway.monitoring:9091`) after each run: \| Metric \| Kind \| Purpose \| \|---\|---\|---\| \| `drift_stack_state{stack}` \| gauge, 0/1/2 \| 0=clean, 1=drift, 2=error \| \| `drift_stack_first_seen{stack}` \| gauge (unix seconds) \| Preserved across runs for drift-age tracking \| \| `drift_stack_age_hours{stack}` \| gauge (hours) \| Computed from `first_seen` \| \| `drift_stack_count` \| gauge (count) \| Total drifted stacks this run \| \| `drift_error_count` \| gauge (count) \| Total plan-errored stacks \| \| `drift_clean_count` \| gauge (count) \| Total clean stacks \| \| `drift_detection_last_run_timestamp` \| gauge (unix seconds) \| Pipeline heartbeat \| First-seen preservation: on each drift hit, the pipeline queries Pushgateway for the existing `drift_stack_first_seen{stack=<stack>}` value. If present and non-zero, reuse it; otherwise stamp with `NOW`. That means age-hours grows monotonically until the stack goes clean (at which point state=0 resets first_seen by omission). Atomic batched push: all metrics for a run are POST'd in a single HTTP request. Pushgateway doesn't support atomic multi-metric updates natively, but batching at the pipeline layer prevents half-updated state if the curl is interrupted mid-run (the second call would just fail the entire run and alert on `DriftDetectionStale`). ### `stacks/monitoring/.../prometheus_chart_values.tpl` New `Infrastructure Drift` alert group with three rules: - DriftDetectionStale (warning, 30m): fires if `drift_detection_last_run_timestamp` is older than 26h. Gives a 2h grace window on top of the 24h cron so transient Pushgateway or cluster unavailability doesn't false-alarm. Guards against the pipeline silently failing or the cron not firing. - DriftUnaddressed (warning, 1h): fires if any stack has `drift_stack_age_hours > 72` — three days of unacknowledged drift. Three days is long enough to absorb weekends + typical review cycles but short enough to force follow-up before drift compounds. - DriftStacksMany (warning, 30m): fires if `drift_stack_count > 10` in a single run. Sudden wide drift usually signals systemic causes (new admission webhook, provider version bump, cluster-wide CRD upgrade) rather than individual configuration errors, and the alert body nudges toward that diagnosis. Applied to `stacks/monitoring` this session — 1 helm_release changed, no other drift surfaced. ## What is NOT in this change - The Wave 7 GitHub issue auto-filer — the full plan included filing a `drift-detected` issue per drifted stack. Deferred because it requires wiring the `file-issue` skill's convention + a gh token exposed to Woodpecker, both of which need separate setup. The Slack alert covers the same need at lower fidelity in the meantime. - The Wave 7 PG drift_history table — would provide the richest historical view but adds a new DB schema dependency for a CI pipeline. Pushgateway + Prometheus handle the 72h window we care about; PG history is nice-to-have for quarterly reviews. - Auto-apply marker (`# DRIFT_AUTO_APPLY_OK`) — premature until the baseline has been stable for a few cycles. Follow-ups tracked: file dedicated beads items for GH-issue filer + PG drift_history. ## Verification ``` $ cd stacks/monitoring && ../../scripts/tg apply --non-interactive Apply complete! Resources: 0 added, 1 changed, 0 destroyed. # After next cron run (cron expr: "drift-detection" in Woodpecker UI): $ curl -s http://prometheus-prometheus-pushgateway.monitoring:9091/metrics \ \| grep -c '^drift_' # expect a positive number ``` ## Reproduce locally 1. `git pull` 2. Check Prometheus rules: `curl -sk https://prometheus.viktorbarzin.lan/api/v1/rules \| jq '.data.groups[] \| select(.name == "Infrastructure Drift")'` 3. Manually trigger the Woodpecker cron and watch Pushgateway populate. Refs: Wave 7 umbrella (code-hl1) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 22:42:51 +00:00
Viktor Barzin	42f1c3cf4f	[claude-agent-service] Migrate all pipelines from DevVM SSH to K8s HTTP ## Context The claude-agent-service K8s pod (deployed 2026-04-15) provides an HTTP API for running Claude headless agents. Three workflows still SSH'd to the DevVM (10.0.10.10) to invoke `claude -p`. This eliminates that dependency. ## This change Pipeline migrations (SSH → HTTP POST to claude-agent-service): - `.woodpecker/issue-automation.yml` — Vault auth fetches API token instead of SSH key; curl POST /execute + poll /jobs/{id} replaces SSH invocation - `scripts/postmortem-pipeline.sh` — same pattern; uses jq for safe JSON construction of TODO payloads - `.woodpecker/postmortem-todos.yml` — drop openssh-client from apk install - `stacks/n8n/workflows/diun-upgrade.json` — SSH node replaced with HTTP Request node; API token via $env.CLAUDE_AGENT_API_TOKEN (added to Vault secret/n8n) Documentation updates: - `docs/architecture/incident-response.md` — Mermaid diagram: DevVM → K8s - `docs/architecture/automated-upgrades.md` — pipeline diagram + n8n action - `AGENTS.md` — pipeline description updated ## What is NOT in this change - DevVM decommissioning (still hosts terminal/foolery services) - Removal of SSH key secrets from Vault (kept for rollback) - n8n workflow import (must be done manually in n8n UI) [ci skip] Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>	2026-04-18 10:12:02 +00:00
Viktor Barzin	89af09852f	feat(ci): add Vault advisory locks to CI terraform applies CI now uses scripts/tg instead of raw terragrunt apply, acquiring the same per-stack Vault KV lock that user sessions use. This prevents CI from overwriting in-flight user applies. Changes: - Switch from xargs -P 4 (parallel) to serial while-read loop - CI skips stacks locked by users instead of racing them - Git rebase failures now exit 1 instead of silently continuing - Updated header comments to reflect new locking behavior Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 20:53:00 +00:00
Viktor Barzin	f726d1c3fd	fix: stash local changes before git pull in CI pipelines DevVM may have unstaged changes from active sessions. Use git stash before pull to avoid 'cannot pull with rebase: unstaged changes' errors. Stash pop after to restore working state. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 17:37:10 +00:00
Viktor Barzin	dcce76403a	fix: use direct env vars for Woodpecker pipeline variables Woodpecker injects manual pipeline variables as direct env vars (e.g., $ISSUE_NUMBER), not as CI_PIPELINE_VARIABLE_* prefixed vars. The provision-user pipeline already uses this pattern correctly. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 17:34:42 +00:00
Viktor Barzin	704fa09185	fix: remove manual event from build-ci-image to fix issue automation build-ci-image.yml had event:[push,manual] which caused it to run on every manual pipeline trigger. Its registry_user/registry_password secrets don't have the manual event, causing all manual pipelines to error. Removed manual from its event list since it only needs push. Reverted evaluate conditions (Woodpecker evaluates secrets before conditions, so evaluate can't prevent missing-secret errors). [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 17:31:25 +00:00
Viktor Barzin	a583b11484	fix: guard manual Woodpecker pipelines with evaluate conditions When GHA triggers a manual pipeline for issue automation, ALL pipelines with event:manual fire. Added evaluate conditions: - issue-automation.yml: only runs when ISSUE_NUMBER is set - provision-user.yml: only runs when ISSUE_NUMBER is NOT set - build-ci-image.yml: only runs when ISSUE_NUMBER is NOT set This prevents build-ci-image from failing on missing registry_password secret when issue automation triggers. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 17:29:35 +00:00
Viktor Barzin	92495d0fc3	fix: start Claude from ~/code to load root CLAUDE.md correctly Both issue-automation and postmortem pipelines were cd'ing into ~/code/infra before running Claude, missing the root CLAUDE.md with beads config and project-wide instructions. Now cd to ~/code and use relative agent paths from there. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 17:15:32 +00:00
Viktor Barzin	7bb9ec2934	Add agent task tracking documentation Documents the centralized Beads/Dolt task tracking system used by all Claude Code sessions. Covers architecture, session lifecycle, settings hierarchy, known issues, and E2E test verification. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 17:11:26 +00:00
Viktor Barzin	601a83d84e	fix: CI pipeline image pull auth + shallow clone resilience [ci skip] - Add WOODPECKER_BACKEND_K8S_PULL_SECRET_NAMES to agent env so step pods can pull from private registry (registry.viktorbarzin.me:5050) - Add fallback in default.yml when HEAD~1 is unavailable (shallow clone with depth=1): fetch more history, or apply all platform stacks as safe default - Root cause: pipeline #243 failed because infra-ci:latest image couldn't be pulled (no imagePullSecrets on step pods) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 14:41:08 +00:00
Viktor Barzin	36454b87d1	feat: CI/CD performance overhaul - New custom CI Docker image (ci/Dockerfile) with TF 1.5.7, TG 0.99.4, git-crypt, sops, kubectl pre-installed. Pushed to private registry. Eliminates 17 apk add calls + binary downloads per pipeline run. - Unified CI pipeline: merge default.yml + app-stacks.yml into one. Changed-stacks-only detection (git diff, with global-file fallback). Concurrency limit (xargs -P 4). Step consolidation (2 steps vs 4). Shallow clone (depth=2). Provider cache (TF_PLUGIN_CACHE_DIR). - Per-stack Vault advisory locks in scripts/tg. 30min TTL with stale lock detection. Blocks concurrent applies to same stack. - TF_PLUGIN_CACHE_DIR enabled by default in scripts/tg for local dev. - Daily drift detection pipeline (.woodpecker/drift-detection.yml). Runs terraform plan on all stacks, Slack alert on drift. - CI image build pipeline (.woodpecker/build-ci-image.yml). Expected speedup: ~5-10 min per pipeline run → ~2-4 min. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 11:22:26 +00:00
Viktor Barzin	0b2f5a4729	fix: use depth 5 clone for postmortem pipeline (need HEAD~1)	2026-04-14 17:12:41 +00:00
Viktor Barzin	ce7a4e6e76	fix: Woodpecker v3 secrets→environment migration	2026-04-14 16:47:17 +00:00
Viktor Barzin	8540f48a28	fix: move pipeline logic to shell script (avoid YAML quoting issues)	2026-04-14 16:46:42 +00:00
Viktor Barzin	7f5115f9fe	fix: Woodpecker pipeline YAML quoting + trigger test [ci skip]	2026-04-14 16:45:27 +00:00
Viktor Barzin	8ad674e7b1	fix: postmortem pipeline uses Vault for SSH key (not Woodpecker secrets) Pipeline authenticates to Vault via K8s SA JWT, fetches devvm_ssh_key from secret/ci/infra, SSHes to DevVM to run Claude Code headlessly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 15:55:12 +00:00
Viktor Barzin	8badb8181a	feat: post-mortem automation pipeline E2E workflow for incident post-mortems: 1. /post-mortem skill generates structured post-mortem markdown 2. Woodpecker pipeline triggers on docs/post-mortems/*.md changes 3. parse-postmortem-todos.sh extracts safe TODOs (Alert/Config/Monitor) 4. postmortem-todo-resolver agent implements TODOs headlessly 5. Agent updates post-mortem with Follow-up Implementation table Components: - .claude/skills/post-mortem/ — writer skill + template - .claude/agents/postmortem-todo-resolver.md — headless agent - .woodpecker/postmortem-todos.yml — CI pipeline - scripts/parse-postmortem-todos.sh — TODO extractor - cluster-health skill — auto-suggest post-mortem after recovery Safety: only auto-implements Alert/Config/Monitor types. Architecture/Migration/Investigation items are skipped. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 15:34:42 +00:00
Viktor Barzin	09b4bad958	feat: pin ~28 images to specific versions, enable DIUN monitoring, add app-stacks pipeline Pin third-party images from :latest to current stable versions: - Platform: cloudflared, technitium, snmp-exporter, pve-exporter, headscale, shadowsocks, xray - Apps: paperless-ngx, linkwarden, wealthfolio, speedtest, synapse, n8n, prowlarr, qbittorrent, lidarr, rybbit, ollama, immichframe, cyberchef, networking-toolbox, echo, coturn, shlink, affine Enable DIUN annotations on all pinned deployments with per-image tag patterns. Add Woodpecker app-stacks pipeline for selective terragrunt apply on changed app stacks.	2026-04-06 14:27:13 +03:00
Viktor Barzin	3bca7a97c2	fix(renew-tls): update TLS secret in ALL namespaces, not just kyverno Kyverno generate+synchronize only manages secrets it created itself. Existing Terraform-managed secrets in ~70 namespaces weren't updated. Now loops through all namespaces and kubectl apply the new cert.	2026-03-23 22:36:31 +02:00
Viktor Barzin	b7409cea4e	fix(renew-tls): use alpine+curl for kubectl step to avoid permission denied bitnami/kubectl runs as non-root UID 1001, cannot read git-crypt decrypted secrets owned by root. Switch to alpine (runs as root) with kubectl downloaded directly.	2026-03-23 22:28:56 +02:00
Viktor Barzin	16cde1eab5	add Kyverno TLS secret sync + enhance renewal pipeline Kyverno ClusterPolicy clones tls-secret from kyverno namespace to all namespaces with synchronize=true. Renewal pipeline now updates the source secret via kubectl, verifies cert validity, and sends Slack notification.	2026-03-23 22:19:34 +02:00
Viktor Barzin	410c893647	fix(provision): security hardening from code review - Add input validation: username regex + email format check in pipeline - Quote variables in .provision-env to prevent shell injection - Remove dead source command (each Woodpecker command is separate shell) - Use jq to build JSON payloads (prevents injection via group names) - Clean up git-crypt key on failure (use ; instead of &&) - Add Kyverno ndots lifecycle ignore to webhook-handler deployment	2026-03-18 21:25:03 +00:00
Viktor Barzin	82403a933c	fix(provision): remove TF apply from pipeline, notify for manual apply Vault stack can't be applied in CI (git-crypt TLS certs + sensitive for_each on k8s_users). Pipeline now automates Vault KV update + Authentik group creation, then notifies admin to apply stacks manually. This matches the existing pattern — vault is not in default.yml either.	2026-03-18 00:23:06 +00:00
Viktor Barzin	d76b4b698f	fix(provision): targeted vault apply + git-crypt in terragrunt step - Two-pass vault apply: first target new user resources, then full apply - Add git-crypt unlock to terragrunt step (TLS certs needed at plan time)	2026-03-18 00:19:16 +00:00
Viktor Barzin	6fad484126	fix(provision): reduce memory limit to 4Gi (LimitRange max)	2026-03-18 00:15:26 +00:00
Viktor Barzin	de6a5caecc	fix(provision): merge terragrunt-apply into single shell block for env persistence	2026-03-18 00:11:14 +00:00
Viktor Barzin	7a24ff6702	fix(provision): use $USERNAME/$EMAIL directly — Woodpecker 3.x env vars Woodpecker 3.x exposes pipeline variables with their original key names (USERNAME, EMAIL), not CI_PIPELINE_VARIABLE_ prefix.	2026-03-18 00:04:51 +00:00
Viktor Barzin	52dc657af5	debug(provision): dump env vars to find correct variable names	2026-03-18 00:00:33 +00:00
Viktor Barzin	0a05343d86	fix(provision): use $VAR instead of ${VAR} to avoid Woodpecker interpolation Woodpecker performs compile-time substitution on ${...} patterns, replacing pipeline variables with empty strings. Using $VAR without braces lets the shell evaluate them at runtime.	2026-03-17 23:58:46 +00:00
Viktor Barzin	fd130971aa	feat(provision): automated user provisioning via Authentik webhook - Expand CI Vault policy: write secret/data/platform + Transit SOPS keys - Add Woodpecker provision-user.yml pipeline (manual event, API-triggered) - Add env vars to webhook-handler deployment for Woodpecker/Authentik integration - Update add-user skill with automated flow documentation - Update Woodpecker repo ID list in CLAUDE.md	2026-03-17 23:56:30 +00:00
Viktor Barzin	73511b1230	extract remaining 19 modules from platform, complete stack split [ci skip] Phase 3: all 27 platform modules now run as independent stacks. Platform reduced to empty shell (outputs only) for backward compat with 72 app stacks that declare dependency "platform". Fixed technitium cross-module dashboard reference by copying file. Woodpecker pipeline applies all 27+1 stacks in parallel via loop. All applied with zero destroys.	2026-03-17 21:42:16 +00:00
Viktor Barzin	ae36dc253b	extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] Phase 2 of platform stack split. 5 more modules extracted into independent stacks. All applied successfully with zero destroys. Cloudflared now reads k8s_users from Vault directly to compute user_domains. Woodpecker pipeline runs all 8 extracted stacks in parallel. Memory bumped to 6Gi for 9 concurrent TF processes. Platform reduced from 27 to 19 modules.	2026-03-17 21:34:11 +00:00
Viktor Barzin	3c804aedf8	extract dbaas, authentik, crowdsec from platform into independent stacks [ci skip] Phase 1 of platform stack split for parallel CI applies. All 3 modules were fully independent (no cross-module refs). State migrated via terraform state mv. All 3 stacks applied with zero changes (dbaas had pre-existing ResourceQuota drift). Woodpecker pipeline updated to run extracted stacks in parallel.	2026-03-17 18:11:53 +00:00
Viktor Barzin	b6d619e5df	fix: increase terragrunt-apply step memory to 2Gi LimitRange defaults containers to 192Mi which is insufficient for terragrunt apply on the platform stack (48 vault refs, many modules). Set explicit 1Gi request / 2Gi limit via backend_options.	2026-03-15 22:59:34 +00:00
Viktor Barzin	0c1239030d	fix: CI pipeline - disable corrupted cache, add pull before push - build-cli.yml: comment out cache_from/cache_to to avoid BuildKit "short read" errors from corrupted registry cache - default.yml: add git pull --rebase before push in cleanup-and-push to handle remote having newer commits	2026-03-15 22:51:08 +00:00
Viktor Barzin	50620e6047	add generic multi-user cluster onboarding system Data-driven user onboarding: add a JSON entry to Vault KV k8s_users, apply vault + platform + woodpecker stacks, and everything is auto-generated. Vault stack: namespace creation, per-user Vault policies with secret isolation via identity entities/aliases, K8s deployer roles, CI policy update. Platform stack: domains field in k8s_users type, TLS secrets per user namespace, user domains merged into Cloudflare DNS, user-roles ConfigMap mounted in portal. Woodpecker stack: admin list auto-generated from k8s_users, WOODPECKER_OPEN=true. K8s-portal: dual-track onboarding (general/namespace-owner), namespace-owner dashboard with Vault/kubectl commands, setup script adds Vault+Terraform+Terragrunt, contributing page with CI pipeline template, versioned image tags in CI pipeline. New: stacks/_template/ with copyable stack template for namespace-owners.	2026-03-15 22:23:36 +00:00
Viktor Barzin	3aba29e7a3	remove SOPS pipeline, deploy ESO + Vault DB/K8s engines Vault is now the sole source of truth for secrets. SOPS pipeline removed entirely — auth via `vault login -method=oidc`. Part A: SOPS removal - vault/main.tf: delete 990 lines (93 vars + 43 KV write resources), add self-read data source for OIDC creds from secret/vault - terragrunt.hcl: remove SOPS var loading, vault_root_token, check_secrets hook - scripts/tg: remove SOPS decryption, keep -auto-approve logic - .woodpecker/default.yml: replace SOPS with Vault K8s auth via curl - Delete secrets.sops.json, .sops.yaml Part B: External Secrets Operator - New stack stacks/external-secrets/ with Helm chart + 2 ClusterSecretStores (vault-kv for KV v2, vault-database for DB engine) Part C: Database secrets engine (in vault/main.tf) - MySQL + PostgreSQL connections with static role rotation (24h) - 6 MySQL roles (speedtest, wrongmove, codimd, nextcloud, shlink, grafana) - 6 PostgreSQL roles (trading, health, linkwarden, affine, woodpecker, claude_memory) Part D: Kubernetes secrets engine (in vault/main.tf) - RBAC for Vault SA to manage K8s tokens - Roles: dashboard-admin, ci-deployer, openclaw, local-admin - New scripts/vault-kubeconfig helper for dynamic kubeconfig K8s auth method with scoped policies for CI, ESO, OpenClaw, Woodpecker sync.	2026-03-15 16:37:38 +00:00
Viktor Barzin	9f2ac0fd1a	[ci skip] update AGENTS.md + CLAUDE.md with SOPS workflow, add k8s-portal CI pipeline AGENTS.md: added SOPS secrets management section, scripts/tg usage, contributor onboarding steps, pull-through cache bypass notes. CLAUDE.md: added SOPS workflow note, linux/amd64 build reminder, versioned tag guidance for pull-through cache. CI: new .woodpecker/k8s-portal.yml pipeline — auto-builds and deploys the k8s portal when files under stacks/platform/modules/k8s-portal/files/ change on master push. Uses buildx for linux/amd64.	2026-03-07 15:37:19 +00:00
Viktor Barzin	1f2c1ca361	[ci skip] phase 5+6: update CI pipelines for SOPS, add sensitive=true to secret vars Phase 5 — CI pipelines: - default.yml: add SOPS decrypt in prepare step, change git add . to specific paths (stacks/ state/ .woodpecker/), cleanup on success+failure - renew-tls.yml: change git add . to git add secrets/ state/ Phase 6 — sensitive=true: - Add sensitive = true to 256 variable declarations across 149 stack files - Prevents secret values from appearing in terraform plan output - Does NOT modify shared modules (ingress_factory, nfs_volume) to avoid breaking module interface contracts Note: CI pipeline SOPS decryption requires sops_age_key Woodpecker secret to be created before the pipeline will work with SOPS. Until then, the old terraform.tfvars path continues to function.	2026-03-07 14:30:36 +00:00

1 2

66 commits