Wave-1 trusted-registries allowlist was missing woodpeckerci/* which is
used by every .woodpecker.yml's clone step (woodpeckerci/plugin-git) and
build steps (woodpeckerci/plugin-docker-buildx). Result: ALL Woodpecker
pipelines have been failing at the git step since the Audit→Enforce flip
on 2026-05-19. First surfaced via code-da4h (recruiter-responder pushes
not building).
Added between viren070/* and zelest/* in the same DockerHub-user-repos
block as the 2026-05-22 batch (commit 2d35d72a).
Closes: code-da4h
Discovered via W1.5 enforcement when querying live cluster state:
PolicyViolation events on 5 deployments (council-complaints, ebook2audiobook,
hermes-agent, netbox, whisper/piper) trying to admit images from registries
not in the original enumeration.
Added entries:
- amruthpillai/* (resume — reactive-resume)
- athomasson2/* (ebook2audiobook)
- netboxcommunity/* (netbox)
- nousresearch/* (hermes-agent)
- opentripplanner/* (osm-routing)
- rhasspy/* (whisper, piper)
- registry.viktorbarzin.me/* (legacy private registry — council-complaints
still references; should migrate to forgejo)
The legacy registry.viktorbarzin.me was supposedly decommissioned 2026-05-07
per CLAUDE.md but council-complaints still uses it — separate cleanup task.
## Verification
- kubectl delete + reapply (kubectl_manifest resourceVersion=0 patch gotcha,
same as 2026-05-18 inject-keel-annotations)
- Dry-run admission of previously-blocked images now PASS:
- netboxcommunity/netbox:v4.5.0-beta1 ✓
- rhasspy/wyoming-whisper:3.1.0 ✓
- registry.viktorbarzin.me/council-complaints:1c56f8f ✓
- Policy still in Enforce mode
## Observation status (W1.6)
- Calico GNP wave1-egress-observe-tier34 still applied, 82 ns selected
- Loki `{job="node-journal"} |~ "calico-packet"` returns ~5000 lines/hour
- No errors from observation infrastructure
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Uncomment the trading-bot stack (disabled 2026-04-06 due to resource
consumption) and add the new meet_kevin_watcher service container.
Changes:
- Uncomment the /* ... */ block enclosing the entire stack
- Fix db_init job: add -d postgres to psql commands (root user has no
root-named database — matches pattern used in claude-memory + others)
- Remove 3 disabled containers from trading-bot-workers Pod spec:
news-fetcher, sentiment-analyzer, trade-executor
- Add new meet-kevin-watcher container (image
viktorbarzin/trading-bot-service:latest, command
python -m services.meet_kevin_watcher.main, mem 128Mi/256Mi)
- Extend ExternalSecret with TRADING_OPENROUTER_API_KEY and
TRADING_MEET_KEVIN_CHANNEL_ID keys (sourced from Vault
secret/trading-bot)
- Add 4 common_env entries for the Meet Kevin pipeline
(poll interval, daily cost cap, model slug, prompt version)
- Update lifecycle.ignore_changes to 4 image indices
vault: re-enable pg-trading static role
- Add pg-trading to vault_database_secret_backend_connection allowed_roles
- Uncomment vault_database_secret_backend_static_role.pg_trading
(was disabled 2026-04-06 with the rest of trading-bot stack)
kyverno: add postgres* to trusted-registries allowlist
- trading-bot db_init uses postgres:16-alpine (Docker Hub library image)
- postgres* was not in the DockerHub bare-name allowlist (unlike mysql*,
alpine*, nginx*, python* which were already there)
Final workers Pod containers (in order):
[0] signal-generator
[1] learning-engine
[2] market-data
[3] meet-kevin-watcher (NEW)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
## Resolves code-e2dp (Kyverno TF apply blocked)
Root cause: terraform-provider-kubernetes v3.1.0 panics on plan/refresh of
kubernetes_manifest resources holding Kyverno ClusterPolicy CRDs (large
CEL/foreach schemas). Workaround: swap to gavinbunney/kubectl_manifest which
treats manifests as opaque YAML strings.
## Migration mechanics
- Root terragrunt.hcl: added gavinbunney/kubectl provider declaration so all
stacks get it generated in providers.tf.
- stacks/kyverno/modules/kyverno/versions.tf (new): module-level provider source
declaration (required for kubectl_manifest in a child module).
- Converted 17 kubernetes_manifest resources across 7 files to kubectl_manifest
with yaml_body = yamlencode({...}). depends_on chains preserved.
- terraform state rm for all 17 old kubernetes_manifest entries.
- stacks/kyverno/imports.tf (new): TF 1.5+ import blocks mapping each
kubectl_manifest to its live cluster resource by apiVersion//Kind//name ID.
- One resource (policy_inject_keel_annotations) needed kubectl delete + recreate
because the kubectl provider couldn't patch it cleanly (resourceVersion=0
invalid for update — gotcha when adopting a resource previously
kubernetes_manifest-owned).
## W1.4 — security policies Audit → Enforce (LIVE)
Three policies flipped: deny-privileged-containers, deny-host-namespaces,
restrict-sys-admin. Verified live via kubectl. failurePolicy=Ignore preserved.
## Shared exclude list (35 namespaces)
local.security_policy_exclude_namespaces in security-policies.tf.
- 31 critical from memory id=1970 (Keel rollout list)
- + frigate (camera HW transcoding needs host access)
- + kured (privileged DaemonSet for node reboots)
- + default (etcd backup/defrag CronJobs use hostNetwork)
- + changedetection (uses SYS_ADMIN for chromium sandbox)
## W1.5 — require-trusted-registries stays Audit
Pattern */* allows anything-with-a-slash; Enforce would be a no-op for supply
chain. Tracked under beads code-8ywc as follow-up.
## TF import-blocks
The imports.tf file should be removed in a follow-up cleanup commit once
verified — TF doesn't auto-clean these.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Closes: code-e2dp
## W1.2 — Vault audit device + X-Forwarded-For (APPLIED + VERIFIED)
- Added `x_forwarded_for_authorized_addrs = "10.10.0.0/16"` to vault listener config.
Trust X-Forwarded-For from in-cluster sources (pod CIDR). Without this, every
vault audit log entry shows Traefik's pod IP instead of the real client IP —
the V7 alert rule (Viktor identity from non-allowlist source IP) needs the
real client IP to be meaningful.
- Applied via `tg apply -target=helm_release.vault` (vault stack has pre-existing
for_each unknown issues unrelated to this change; -target documented in error
message itself as the workaround).
- Rolling restart of vault-{0,1,2} performed manually (StatefulSet uses OnDelete
update strategy, not RollingUpdate). All 3 pods rejoined Raft + auto-unsealed
within ~10s each. Verified XFF config visible in pod's
/vault/config/extraconfig-from-values.hcl.
- The `vault_audit "file"` resource was already in TF at line 287 (writing to
/vault/audit/vault-audit.log) — no change needed.
## W1.4 + W1.5 — Kyverno enforce flip (CODE ONLY, apply BLOCKED)
- Added shared `local.security_policy_exclude_namespaces` (31 critical namespaces
from memory id=1970 + `frigate, kured, default, changedetection` discovered
during the live-cluster pre-flight check for privileged/hostNetwork/SYS_ADMIN
pods that would be blocked by Enforce).
- Flipped 3 security policies Audit → Enforce: deny-privileged-containers,
deny-host-namespaces, restrict-sys-admin. failurePolicy=Ignore preserved at
chart level.
- `require-trusted-registries` STAYS in Audit mode pending allowlist tightening
(current pattern includes `*/*` which matches anything-with-a-slash, so Enforce
would be a no-op for supply chain). Tracked under beads `code-8ywc` W1.5.
**Apply blocker**: `tg plan` panics with `terraform-provider-kubernetes_v3.1.0`
crash on the kubernetes_manifest resources (`ElementKeyInt(0): can't use
tftypes.Object...` — provider schema mismatch on Kyverno CRDs). The crash
reproduces on the UNMODIFIED file, so it's a pre-existing provider issue, not
caused by these changes. Resolving it requires either upgrading the provider or
finding a kubernetes_manifest-compatible workaround. Tracked under `code-8ywc`.
## Wave 1 status after this commit
- W1.2: APPLIED + VERIFIED (vault XFF + audit device already in place)
- W1.4 + W1.5: code ready, apply blocked on provider crash
- W1.1, W1.3, W1.6, W1.7: not started in this session
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Per user decision, removed authentik, kyverno, metallb-system,
external-secrets, proxmox-csi, nfs-csi, vpa, sealed-secrets,
infra-maintenance from the policy-level exclude list, and added
keel.sh/enrolled=true to aiostreams (alive — 1/1 Running, despite
being earlier flagged as scaled-to-0) and woodpecker.
Net cluster coverage: 197/227 workloads on safe-force (86%), up from
170/227 (74%). All 197 are paired with match-tag=true (digest-only).
Remaining 7 namespaces in Kyverno exclude list (irreducible):
- keel (self-update)
- calico-system + tigera-operator (operator-managed Installation CR)
- cnpg-system + dbaas (state-coupled)
- nvidia (chart-pinned at 570.195.03 per code-8vr0 until NVIDIA ships
ubuntu26.04 driver images)
- kube-system (k8s built-ins)
Files:
- stacks/kyverno/modules/kyverno/keel-annotations.tf — exclude list
trimmed from 16 → 7
- stacks/authentik, kyverno, proxmox-csi, nfs-csi, vpa, sealed-secrets,
servarr/aiostreams, metallb (creates ns "metallb-system"), woodpecker —
added keel.sh/enrolled=true label on kubernetes_namespace resource
- infra-maintenance was in the policy exclude but the namespace doesn't
actually exist in the cluster; the removal is a no-op there
Applied via kubectl patch on the live ClusterPolicy + kubectl label on
namespaces because the kubernetes provider v3.1.0 panics on Kyverno
ClusterPolicy refresh — TF source has the desired state for next clean
apply on a fixed provider.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Per user decision today: monitoring, mailserver, vault, descheduler,
metrics-server, traefik, technitium, crowdsec, redis, reverse-proxy,
reloader, headscale, wireguard, xray, cloudflared now participate in
the same `force + match-tag` regime as the rest of the cluster — Keel
watches the deployment's CURRENT tag for digest changes only and rolls
on push, never rewriting tag strings.
Two-part change:
stacks/kyverno/modules/kyverno/keel-annotations.tf
Trim the policy-level namespace exclude list from 31 → 16. The 16
remaining exclusions are the irreducible cluster-operator + state-
coupled set: keel itself, calico-system + tigera-operator (operator
loop), authentik (2026-05-17 pgbouncer incident bite), cnpg-system +
dbaas (state-coupled), kyverno, metallb-system, external-secrets,
proxmox-csi + nfs-csi + nvidia (just stabilized today, chart-pinned),
kube-system, vpa, sealed-secrets, infra-maintenance.
stacks/<each-of-15>/.../main.tf
Add `"keel.sh/enrolled" = "true"` label to the `kubernetes_namespace`
resource so the Kyverno mutate policy can target the workloads via
its namespaceSelector matchLabels.
Note on the apply path: the live ClusterPolicy was patched via
`kubectl patch` because the hashicorp/kubernetes provider v3.1.0 panics
during state refresh on Kyverno ClusterPolicy schemas with deeply
nested optional `context.celPreconditions` / `imageRegistry` fields
(see crash dump). The TF source above has the desired state, so any
clean future apply on a fixed provider version will be a no-op against
the live cluster.
Floating-tag workloads in the newly-enrolled set (will roll on every
upstream digest update — acceptable risk per user):
- wireguard: sclevine/wg:latest (image fixed today via iptables-nft
postStart shim)
- xray: teddysun/xray
- crowdsec-web: viktorbarzin/crowdsec_web
- monitoring: prompve/prometheus-pve-exporter:latest, prom/snmp-exporter
- traefik: nginx:1-alpine, openresty/openresty:alpine,
ghcr.io/tarampampam/error-pages:3
- redis: haproxy:3.1-alpine, redis:8-alpine
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Without the anchor, each policy update fires mutateExistingOnPolicyUpdate,
which OVERWRITES existing keel.sh/policy annotations back to 'force'. That
broke the phased rollout — bulk-setting workloads to 'never' didn't stick
because the next policy update reset them.
With +() anchors, the mutate only adds the annotation if missing. New
workloads (in enrolled namespaces) get force+match-tag; existing workloads
with explicit policy=never (out-of-band, for phased rollout) stay never.
Phase 1 rollout state (2026-05-17):
- 10 workloads on force+match-tag in 10 namespaces (Phase 1)
enrolled via keel.sh/enrolled=true namespace label:
linkwarden, excalidraw, diun, echo, foolery, city-guesser,
jsoncrack, privatebin, ntfy, speedtest
- 216 workloads on policy=never (out-of-band kubectl annotate)
- 31 critical namespaces excluded at policy level
Expand to Phase 2 by labeling more namespaces `keel.sh/enrolled=true`
and clearing the `never` annotation off their workloads.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 incident: Keel rolled authentik 2026.2.2 → 2026.2.3 around 23:36.
The force+match-tag pairing should have constrained Keel to digest-only on
the current tag (not switch to a new tag), but a race between Kyverno's
mutate (injecting match-tag) and Keel's hourly poll caused the workload to
still have the old `force`-only annotation when Keel acted. Result: tag
rewrite, pods cycled, pgbouncer connection failures, login broken.
Manual rollback: `kubectl rollout undo` on all 5 authentik deployments back
to 2026.2.2. Auth restored within ~5 min.
Going forward, critical-namespace workloads are excluded at the policy level
so this race can't recur. They get upgraded via TF (Helm chart version bumps)
on a deliberate cadence, never by Keel.
Live state: 36 workloads on policy=never (35 critical + chrome-service pin
+ 7 CI-driven self-hosted from earlier), 190 on policy=force+match-tag for
opt-out-pure auto-update on the remaining stateless apps.
This matches user direction (2026-05-17): "upgrading is fine as long as we
upgrade correctly and the latest version is healthy" + "keel responsible
for the latest version, phased rollout, graceful".
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
User: 'i'm happy with occasional breakages. we have alerts.'
Policy=major auto-updates workloads to the latest semver tag in the
registry, including major/minor/patch bumps. Still semver-parser-bounded
so dev/nightly/master branches are filtered out (avoids the 2026-05-16
force-trap on affine/calico).
Live: 217 patch-annotated workloads re-annotated to major. Next Keel
poll (~1h) will pick up any pending major/minor releases.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The mutateExistingOnPolicyUpdate=true on inject-keel-annotations produced
176 UpdateRequests for the initial bulk scan across enrolled namespaces.
At the existing 384Mi limit, kyverno-background-controller OOMKilled while
processing them — no annotations got injected on existing workloads (count
stuck at 30).
Live state already bumped via kubectl set resources; this commit makes it
durable through Terraform. Also lowered the request to 256Mi (the 384Mi
floor was tight against limit; 2Gi headroom for bulk scans, 256Mi steady
state).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous commit (bc714755) added mutateExistingOnPolicyUpdate=true
to the inject-keel-annotations ClusterPolicy but Kyverno's validate
webhook rejected it: the background-controller SA needs update/patch
on apps/v1 Deployment/StatefulSet/DaemonSet.
Created live via kubectl + now in TF so the next apply is idempotent.
The ClusterRole aggregates into kyverno:background-controller via the
rbac.kyverno.io/aggregate-to-background-controller label.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Before this, the inject-keel-annotations policy only fired on admission
events. Workloads that existed BEFORE their namespace got labeled
keel.sh/enrolled=true never received the annotation, so Keel didn't
watch them. Live state was 30 of 226 workloads auto-updating.
With mutateExistingOnPolicyUpdate=true and the required mutate.targets
block, Kyverno's BackgroundScan controller applies the mutate to
existing matching Deployments/StatefulSets/DaemonSets on policy update.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Stop the hourly Keel-vs-tigera-operator fight loop on calico-node
DaemonSet (v3.26.5 ↔ v3.26.1). Live: re-annotated 4 calico-system
workloads with keel.sh/policy=never; TF: added calico-system to the
namespaces exclude list so any future mutate run won't re-inject.
The previous calico unenrollment (label removal from namespace)
wasn't enough — once Kyverno had stamped the policy=patch annotation
on the Deployments/DaemonSets, removing the namespace label didn't
strip the annotation, so Keel kept watching them.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Move from `never` (no auto-update) to `patch` for the cluster-wide
default. Keel only auto-updates PATCH versions within the current
major.minor: 0.26.6 → 0.26.7 OK; 0.26.6 → :nightly-latest blocked.
Tag-rewrites that broke calico (v3.26.1 → :master) and affine
(0.26.6 → :nightly-latest) on 2026-05-16 cannot recur with patch.
Caveats:
* Patch causes Terraform image drift for semver-pinned services —
drift-detection pipeline will surface it; lifecycle ignore_changes
on container[].image can be added per stack later if drift is
noisy.
* Tags that aren't parseable as semver (:latest, :11, :nightly,
SHA tags) are ignored by patch — those workloads stay on their
current image until promoted to `force` policy individually.
Self-hosted CI-driven services + chrome-service kept on `never`
(deliberate pins / CI controls the tag):
recruiter-responder, claude-agent-service, claude-memory,
chrome-service, fire-planner, job-hunter, payslip-ingest
Live state already updated via kubectl apply + per-workload patches.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 incident: Keel's `force` policy switched semver-pinned
images (affine 0.26.6 → :nightly-latest, calico v3.26.1 → :master)
instead of digest-tracking. Force is documented as "always update
to the newest tag in the registry" — only safe on already-mutable
tags like :latest.
Changing the cluster-wide default in inject-keel-annotations to
`never`. The namespace enrollment label + V2 lifecycle suppression
stay in place so opt-in is one annotation per Deployment, but no
service auto-updates until explicitly approved.
To opt in a workload now:
1. Verify the Deployment image is on a mutable tag (:latest,
:<major>, or a vendor "stable" tag) — change in Terraform first
if needed.
2. Add to the Deployment's metadata.annotations:
"keel.sh/policy" = "force" (digest tracking)
OR
"keel.sh/policy" = "patch" (semver patch bumps — also
requires ignore_changes on the image)
Live policy already updated via kubectl apply + per-workload
override (force → never).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Foundation for opt-out-pure auto-update model per
docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md.
- New stack `stacks/keel/` deploys Keel via Helm (charts.keel.sh, v1.0.6).
Polls registries hourly per design decision #8. Default schedule
overridable per-workload via keel.sh/pollSchedule annotation.
- New Kyverno ClusterPolicy `inject-keel-annotations` mutates Deployments,
StatefulSets, and DaemonSets in namespaces labeled `keel.sh/enrolled=true`
with keel.sh/policy=force + trigger=poll + pollSchedule=@every 1h.
- Phase 0 enrolls no namespaces. Phase 1 (next session) labels the
self-hosted set.
- Per-workload opt-out: label `keel.sh/policy: never` (used by rollback
runbook and chrome-service-style deliberate pins).
- Keel namespace excluded from the mutate — supervisor self-update has
too-bad a failure mode (decision #11).
- AGENTS.md: KYVERNO_LIFECYCLE_V2 marker convention added for the
ignore_changes block enrolled workloads need.
- .claude/CLAUDE.md: docker-images rule flagged as transitional.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
End of forgejo-registry-consolidation. After Phase 0/1 already landed
(Forgejo ready, dual-push CI, integrity probe, retention CronJob,
images migrated via forgejo-migrate-orphan-images.sh), this commit
flips everything off registry.viktorbarzin.me onto Forgejo and
removes the legacy infrastructure.
Phase 3 — image= flips:
* infra/stacks/{payslip-ingest,job-hunter,claude-agent-service,
fire-planner,freedify/factory,chrome-service,beads-server}/main.tf
— image= now points to forgejo.viktorbarzin.me/viktor/<name>.
* infra/stacks/claude-memory/main.tf — also moved off DockerHub
(viktorbarzin/claude-memory-mcp:17 → forgejo.viktorbarzin.me/viktor/...).
* infra/.woodpecker/{default,drift-detection}.yml — infra-ci pulled
from Forgejo. build-ci-image.yml dual-pushes still until next
build cycle confirms Forgejo as canonical.
* /home/wizard/code/CLAUDE.md — claude-memory-mcp install URL updated.
Phase 4 — decommission registry-private:
* registry-credentials Secret: dropped registry.viktorbarzin.me /
registry.viktorbarzin.me:5050 / 10.0.20.10:5050 auths entries.
Forgejo entry is the only one left.
* infra/stacks/infra/main.tf cloud-init: dropped containerd
hosts.toml entries for registry.viktorbarzin.me +
10.0.20.10:5050. (Existing nodes already had the file removed
manually by `setup-forgejo-containerd-mirror.sh` rollout — the
cloud-init template only fires on new VM provision.)
* infra/modules/docker-registry/docker-compose.yml: registry-private
service block removed; nginx 5050 port mapping dropped. Pull-
through caches for upstream registries (5000/5010/5020/5030/5040)
stay on the VM permanently.
* infra/modules/docker-registry/nginx_registry.conf: upstream
`private` block + port 5050 server block removed.
* infra/stacks/monitoring/modules/monitoring/main.tf: registry_
integrity_probe + registry_probe_credentials resources stripped.
forgejo_integrity_probe is the only manifest probe now.
Phase 5 — final docs sweep:
* infra/docs/runbooks/registry-vm.md — VM scope reduced to pull-
through caches; forgejo-registry-breakglass.md cross-ref added.
* infra/docs/architecture/ci-cd.md — registry component table +
diagram now reflect Forgejo. Pre-migration root-cause sentence
preserved as historical context with a pointer to the design doc.
* infra/docs/architecture/monitoring.md — Registry Integrity Probe
row updated to point at the Forgejo probe.
* infra/.claude/CLAUDE.md — Private registry section rewritten end-
to-end (auth, retention, integrity, where the bake came from).
* prometheus_chart_values.tpl — RegistryManifestIntegrityFailure
alert annotation simplified now that only one registry is in
scope.
Operational follow-up (cannot be done from a TF apply):
1. ssh root@10.0.20.10 — edit /opt/registry/docker-compose.yml to
match the new template AND `docker compose up -d --remove-orphans`
to actually stop the registry-private container. Memory id=1078
confirms cloud-init won't redeploy on TF apply alone.
2. After 1 week of no incidents, `rm -rf /opt/registry/data/private/`
on the VM (~2.6GB freed).
3. Open the dual-push step in build-ci-image.yml and drop
registry.viktorbarzin.me:5050 from the `repo:` list — at that
point the post-push integrity check at line 33-107 also needs
to be repointed at Forgejo or removed (the per-build verify is
redundant with the every-15min Forgejo probe).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wrap the three new Vault key reads in try(...) so the first apply
succeeds even when forgejo_pull_token / forgejo_cleanup_token /
secret/ci/global haven't been populated yet. Without this, CI
auto-apply blocks on the very push that introduces the references —
chicken-and-egg with the runbook order (which is: apply Forgejo bumps,
then create users + PATs, then apply the rest).
Empty tokens are intentionally visible-broken (auth fails, probe
reports auth failure, cleanup CronJob errors) — that's the signal
to run the bootstrap runbook. Subsequent apply picks up the real
values.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Stage 1 of moving private images off the registry:2 container at
registry.viktorbarzin.me:5050 (which has hit distribution#3324 corruption
3x in 3 weeks) onto Forgejo's built-in OCI registry. No cutover risk —
pods still pull from the existing registry until Phase 3.
What changes:
* Forgejo deployment: memory 384Mi→1Gi, PVC 5Gi→15Gi (cap 50Gi).
Explicit FORGEJO__packages__ENABLED + CHUNKED_UPLOAD_PATH (defensive,
v11 default-on).
* ingress_factory: max_body_size variable was declared but never wired
in after the nginx→Traefik migration. Now creates a per-ingress
Buffering middleware when set; default null = no limit (preserves
existing behavior). Forgejo ingress sets max_body_size=5g to allow
multi-GB layer pushes.
* Cluster-wide registry-credentials Secret: 4th auths entry for
forgejo.viktorbarzin.me, populated from Vault secret/viktor/
forgejo_pull_token (cluster-puller PAT, read:package). Existing
Kyverno ClusterPolicy syncs cluster-wide — no policy edits.
* Containerd hosts.toml redirect: forgejo.viktorbarzin.me → in-cluster
Traefik LB 10.0.20.200 (avoids hairpin NAT for in-cluster pulls).
Cloud-init for new VMs + scripts/setup-forgejo-containerd-mirror.sh
for existing nodes.
* Forgejo retention CronJob (0 4 * * *): keeps newest 10 versions per
package + always :latest. First 7 days dry-run (DRY_RUN=true);
flip the local in cleanup.tf after log review.
* Forgejo integrity probe CronJob (*/15): same algorithm as the
existing registry-integrity-probe. Existing Prometheus alerts
(RegistryManifestIntegrityFailure et al) made instance-aware so
they cover both registries during the bake.
* Docs: design+plan in docs/plans/, setup runbook in docs/runbooks/.
Operational note — the apply order is non-trivial because the new
Vault keys (forgejo_pull_token, forgejo_cleanup_token,
secret/ci/global/forgejo_*) must exist BEFORE terragrunt apply in the
kyverno + monitoring + forgejo stacks. The setup runbook documents
the bootstrap sequence.
Phase 1 (per-project dual-push pipelines) follows in subsequent
commits. Bake clock starts when the last project goes dual-push.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replaces `redis.redis:6379` with `redis-master.redis:6379` in all 11
dependency.kyverno.io/wait-for annotations across 8 stacks, plus one
docs comment in the Kyverno module.
These annotations drive DNS-only `nc -z` init-container readiness
checks — zero RW risk. Both hostnames resolve, so there is no wait-for
failure window during the rolling re-apply.
Closes: code-otr
## Context
Wave 3B-continued: the Goldilocks VPA dashboard (stacks/vpa) runs a Kyverno
ClusterPolicy `goldilocks-vpa-auto-mode` that mutates every namespace with
`metadata.labels["goldilocks.fairwinds.com/vpa-update-mode"] = "off"`. This
is intentional — Terraform owns container resource limits, and Goldilocks
should only provide recommendations, never auto-update. The label is how
Goldilocks decides per-namespace whether to run its VPA in `off` mode.
Effect on Terraform: every `kubernetes_namespace` resource shows the label
as pending-removal (`-> null`) on every `scripts/tg plan`. Dawarich survey
2026-04-18 confirmed the drift. Cluster-side count: 88 namespaces carry the
label (`kubectl get ns -o json | jq ... | wc -l`). Every TF-managed namespace
is affected.
This commit brings the intentional admission drift under the same
`# KYVERNO_LIFECYCLE_V1` discoverability marker introduced in c9d221d5 for
the ndots dns_config pattern. The marker now stands generically for any
Kyverno admission-webhook drift suppression; the inline comment records
which specific policy stamps which specific field so future grep audits
show why each suppression exists.
## This change
107 `.tf` files touched — every stack's `resource "kubernetes_namespace"`
resource gets:
```hcl
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
```
Injection was done with a brace-depth-tracking Python pass (`/tmp/add_goldilocks_ignore.py`):
match `^resource "kubernetes_namespace" ` → track `{` / `}` until the
outermost closing brace → insert the lifecycle block before the closing
brace. The script is idempotent (skips any file that already mentions
`goldilocks.fairwinds.com/vpa-update-mode`) so re-running is safe.
Vault stack picked up 2 namespaces in the same file (k8s-users produces
one, plus a second explicit ns) — confirmed via file diff (+8 lines).
## What is NOT in this change
- `stacks/trading-bot/main.tf` — entire file is `/* … */` commented out
(paused 2026-04-06 per user decision). Reverted after the script ran.
- `stacks/_template/main.tf.example` — per-stack skeleton, intentionally
minimal. User keeps it that way. Not touched by the script (file
has no real `resource "kubernetes_namespace"` — only a placeholder
comment).
- `.terraform/` copies (e.g. `stacks/metallb/.terraform/modules/...`) —
gitignored, won't commit; the live path was edited.
- `terraform fmt` cleanup of adjacent pre-existing alignment issues in
authentik, freedify, hermes-agent, nvidia, vault, meshcentral. Reverted
to keep the commit scoped to the Goldilocks sweep. Those files will
need a separate fmt-only commit or will be cleaned up on next real
apply to that stack.
## Verification
Dawarich (one of the hundred-plus touched stacks) showed the pattern
before and after:
```
$ cd stacks/dawarich && ../../scripts/tg plan
Before:
Plan: 0 to add, 2 to change, 0 to destroy.
# kubernetes_namespace.dawarich will be updated in-place
(goldilocks.fairwinds.com/vpa-update-mode -> null)
# module.tls_secret.kubernetes_secret.tls_secret will be updated in-place
(Kyverno generate.* labels — fixed in 8d94688d)
After:
No changes. Your infrastructure matches the configuration.
```
Injection count check:
```
$ rg -c 'KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode' stacks/ | awk -F: '{s+=$2} END {print s}'
108
```
## Reproduce locally
1. `git pull`
2. Pick any stack: `cd stacks/<name> && ../../scripts/tg plan`
3. Expect: no drift on the namespace's goldilocks.fairwinds.com/vpa-update-mode label.
Closes: code-dwx
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Context
-------
The cluster policy is "no CPU limits anywhere" — CFS throttling causes
more harm than good for bursty single-threaded workloads (Node.js,
Python). LimitRanges are already correct (defaultRequest.cpu only, no
default.cpu), but 22 pods still carried CPU limits injected by upstream
Helm chart defaults — CrowdSec (lapi + agents), descheduler,
kubernetes-dashboard (×4), nvidia gpu-operator.
Previous attempts were ad-hoc: patch each values.yaml, occasionally
missing things on chart upgrade. This replaces that with a declarative
Kyverno mutation at admission time.
This change
-----------
Adds a new ClusterPolicy `strip-cpu-limits` with two foreach rules:
strip-container-cpu-limit → containers[]
strip-initcontainer-cpu-limit → initContainers[]
Each rule uses `patchesJson6902` with an `op: remove` on
`resources/limits/cpu`. JSON6902 `remove` fails on missing paths, so
per-element preconditions gate the mutation — pods without CPU limits
pass through untouched. A top-level rule precondition short-circuits
using JMESPath filter (`[?resources.limits.cpu != null] | length(@) > 0`)
so the mutation is a no-op for the overwhelming majority of pods.
Admission-time only. No `mutateExistingOnPolicyUpdate`, no `background`.
Existing pods keep their CPU limits until they're restarted naturally
(Helm upgrade, node drain, rollout). We rely on churn, not forced
restarts, to avoid unnecessary thrash.
Memory limits are preserved — they prevent OOM, still useful.
Flow
----
admission request → match Pod + CREATE
→ top-level precondition: any container has limits.cpu?
no → skip (fast path)
yes → foreach container:
element.limits.cpu present?
no → skip element
yes → remove /spec/containers/N/resources/limits/cpu
→ same again for initContainers
→ mutated pod proceeds to API server
Verification
------------
kubectl run test-strip-cpu --overrides='{limits:{cpu:500m,memory:64Mi}}'
→ admitted pod.resources = {limits:{memory:64Mi}, requests:{cpu:50m,memory:32Mi}}
→ CPU limit stripped, memory preserved, requests untouched
kubectl rollout restart deploy/kubernetes-dashboard-metrics-scraper
→ new pod.resources = {limits:{memory:400Mi}, requests:{cpu:100m,memory:200Mi}}
→ cluster-wide count of pods with CPU limits: 22 → 21
Rollout
-------
Remaining 21 pods will drop their CPU limits on natural churn. No manual
restarts in this change — user may want to time a mass restart with a
maintenance window.
Closes: code-eaf
Closes: code-4bz
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two-tier state architecture:
- Tier 0 (infra, platform, cnpg, vault, dbaas, external-secrets): local
state with SOPS encryption in git — unchanged, required for bootstrap.
- Tier 1 (105 app stacks): PostgreSQL backend on CNPG cluster at
10.0.20.200:5432/terraform_state with native pg_advisory_lock.
Motivation: multi-operator friction (every workstation needed SOPS + age +
git-crypt), bootstrap complexity for new operators, and headless agents/CI
needing the full encryption toolchain just to read state.
Changes:
- terragrunt.hcl: conditional backend (local vs pg) based on tier0 list
- scripts/tg: tier detection, auto-fetch PG creds from Vault for Tier 1,
skip SOPS and Vault KV locking for Tier 1 stacks
- scripts/state-sync: tier-aware encrypt/decrypt (skips Tier 1)
- scripts/migrate-state-to-pg: one-shot migration script (idempotent)
- stacks/vault/main.tf: pg-terraform-state static role + K8s auth role
for claude-agent namespace
- stacks/dbaas: terraform_state DB creation + MetalLB LoadBalancer
service on shared IP 10.0.20.200
- Deleted 107 .tfstate.enc files for migrated Tier 1 stacks
- Cleaned up per-stack tiers.tf (now generated by root terragrunt.hcl)
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pipeline pods pull from registry.viktorbarzin.me:5050 but the
registry-credentials secret only had auth for registry.viktorbarzin.me
(without port). Containerd requires exact hostname:port match.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add cleanup-failed-pods policy that runs hourly (at :15) to delete all
pods in Failed phase cluster-wide. Prevents stale evicted and failed
CronJob pods from accumulating and creating healthcheck noise.
Also adds ClusterRole + ClusterRoleBinding to grant Kyverno cleanup
controller permission to delete Pods (not included by default).
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- daily-backup: handle rsync exit 23 (partial transfer) as OK for LUKS
noload mounts — in-flight writes have corrupt metadata from skipped
journal replay, but core data is intact
- daily-backup: clean up stale LUKS dm mappings from previous crashed
runs before attempting to open
- daily-backup: capture rsync exit code safely with set -e (|| pattern)
- kyverno: bump tier-4-aux requests.memory 2Gi→3Gi (servarr was at 83%)
- actualbudget: patched custom quota 5Gi→6Gi (was at 82%)
Verified: backup now completes status=0 (96 PVCs OK, 0 failed)
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Kyverno's tier-1-cluster LimitRange had max=4Gi which blocked
mysql-cluster-2 from starting after we bumped MySQL to 6Gi limit.
Also added custom LimitRange in dbaas stack (for when Terraform
manages it directly).
Kyverno ClusterPolicy clones tls-secret from kyverno namespace to all
namespaces with synchronize=true. Renewal pipeline now updates the source
secret via kubectl, verifies cert validity, and sends Slack notification.
- Increase tier-2-gpu requests.memory from 8Gi to 12Gi to give immich
ML pods scheduling headroom (was at 96% utilization)
- Add critical NvidiaExporterDown Prometheus alert that fires when GPU
metrics are absent for >10 minutes (faster than generic ScrapeTargetDown)
- Grant kyverno-admission-controller and kyverno-background-controller
permissions to manage Secrets (required for generate clone rules)
- Add containerd hosts.toml for 10.0.20.10:5050 with skip_verify=true
(wildcard cert doesn't cover IP SANs) — applied to all nodes + template
- Add auth.htpasswd section to config-private.yml
- Mount htpasswd file in registry-private container, fix healthcheck for 401
- Rename registry UI from registry.viktorbarzin.me → docker.viktorbarzin.me
- Add Docker CLI ingress at registry.viktorbarzin.me (HTTPS backend, no rate-limit, unlimited body)
- Add docker to cloudflare_proxied_names (registry stays non-proxied)
- Add Kyverno ClusterPolicy to sync registry-credentials secret to all namespaces
- Update infra provisioning to install apache2-utils and generate htpasswd from Vault
Phase 2 of platform stack split. 5 more modules extracted into
independent stacks. All applied successfully with zero destroys.
Cloudflared now reads k8s_users from Vault directly to compute
user_domains. Woodpecker pipeline runs all 8 extracted stacks
in parallel. Memory bumped to 6Gi for 9 concurrent TF processes.
Platform reduced from 27 to 19 modules.