infra

Author	SHA1	Message	Date
Viktor Barzin	6bf216751b	Merge forgejo/master (tts stack) into wizard/android-emulator Some checks failed ci/woodpecker/push/default Pipeline failed Details ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details # Conflicts: # stacks/tripit/main.tf	2026-06-11 19:53:07 +00:00
Viktor Barzin	8b7c77c794	android-emulator: new stack — shared in-cluster Android 16 testing instance Viktor is setting up an Android app development pipeline (tripit is the first app) and wants agents to natively test changes on Android before shipping. This adds the testing environment: an API-36 Google emulator under KVM as a privileged pod (namespace joins the Kyverno exclude list), SDK/system-image/AVD on a proxmox-lvm PVC, adb on the shared MetalLB IP 10.0.20.200:5555 (LAN only), noVNC screen view at android-emulator.viktorbarzin.lan. Image is built manually from the stack's docker/ dir (rare rebuilds; off-infra-CI rule targets repeated builds). First infra ADR records the trade-offs (devvm/VM/redroid/budtmo rejected).	2026-06-11 19:51:57 +00:00
Viktor Barzin	798b025580	tts+kyverno: non-merge apply trigger (merge-commit diff hid stacks/tts from the stack detector) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details The Woodpecker default pipeline selects stacks via git diff HEAD~1 HEAD; on a merge commit that is the first-parent diff, which contained only the concurrently-landed files — stacks/tts never got applied (namespace still absent) and the kyverno re-trigger push got no pipeline at all. Single non-merge commit touching both stacks so the detector sees them; the sorted loop applies kyverno before tts, the order tripit#26 requires. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 19:08:23 +00:00
Viktor Barzin	7a1cc64898	kyverno: re-trigger apply of tts GPU-priority exclusion (`87702bdc` was [ci skip]'d) Viktor's tour-guide redo (tripit#26): the Chatterbox TTS go-live commit `87702bdc` carried [ci skip], so CI never applied the kyverno change that keeps the tts namespace out of low-GPU-priority injection. This comment-only commit makes CI apply the already-committed change — step 1 of the kyverno -> tts -> tripit apply order. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 18:23:29 +00:00
Viktor Barzin	87702bdce8	feat(tts): Chatterbox TTS stack + off-peak T4 gate, wire tripit narration [ci skip] New `infra/stacks/tts/` deploys devnen/Chatterbox-TTS-Server (OpenAI-compatible /v1/audio/speech) as ClusterIP `chatterbox-tts.tts.svc:8000` (server listens on 8004; Service remaps), requesting ONE T4 time-slice. Mirrors stacks/llama-cpp/. Option A off-peak control (no VRAM isolation on the time-sliced T4 — see post-mortem 2026-06-02): Deployment sits at replicas=0; three Europe/London CronJobs own the replica count — `chatterbox-window-up` scales to 1 at 02:00 ONLY IF a free-VRAM preflight passes (sum gpu_pod_memory_used_bytes from gpu-pod-exporter; free = 16GiB - used >= floor), `chatterbox-vram-guard` yields the card mid-window if a resident wakes, `chatterbox-window-down` scales to 0 at 06:00. tripit's bake is best-effort + cached-forever (ADR-0002/0004) so a skipped/aborted window backfills next time. SA+Role+RoleBinding grant the CronJobs deployments/scale (nextcloud-watchdog pattern). Polite-tenant hardening: kyverno `inject-gpu-workload-priority` now excludes the `tts` namespace (new `gpu_priority_excluded_namespaces` local) so Chatterbox keeps tier-2-gpu priority (600k) and is always evicted first under GPU pressure — never immich-ml/frigate/llama-swap. The LimitRange-fallback policy still uses the base exclude list (tts untouched there). tripit: add TTS_MODE=openai_compatible, TTS_BASE_URL, TTS_MODEL=chatterbox to local.app_env (no token — ClusterIP only). No tripit code change. Image build is documented in stacks/tts/README.md (devnen cu128 target -> forgejo.viktorbarzin.me/viktor/chatterbox-tts) — build is impractical inline (large CUDA image + needs the upstream repo). NOT APPLIED — review branch only. Free-VRAM floor (var.vram_free_floor_bytes, default 6GiB) must be set from the measured chatterbox-multilingual T4 peak during the first bake. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:53 +00:00
Viktor Barzin	fd0f4a0365	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip] `6d224861` came from a --no-checkout worktree whose empty index made the commit drop every file except two. This restores 05b50d2b's full tree and correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the live infra was never applied from the broken commit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 08:45:33 +00:00
Viktor Barzin	6d224861c4	stem95su: scheduled Drive->site sync CronJob (every 10m) CronJob stem95su-gdrive-sync (*/10) mounts the content PVC RW and rclone-syncs the read-only Drive folder "claude" (stem claude/files) onto it (rclone/rclone:1.74.3, scope=drive.readonly, empty-source guard + --max-delete 25). ESO ExternalSecret stem95su-rclone <- Vault secret/stem95su. Requires the GCP OAuth app published to Production or the refresh token expires ~weekly. Lands the gdrive-sync stack on master (it had landed on a feature branch by accident on the shared devvm checkout). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 08:42:26 +00:00
Viktor Barzin	a42f4f7b26	trek: trial-deploy TREK group-trip planner behind Authentik (solo eval) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Stand up upstream TREK (mauriceboe/trek:3.0.22, AGPL) as a low-commitment trial to evaluate the self-hosted group-trip use case before building a custom app. Solo, single shared instance, Authentik forward-auth. - stacks/trek: namespace, deployment (pinned, TF-managed, no CI/Keel), service 80->3000, ingress_factory auth=required + proxied DNS at trek.viktorbarzin.me, TLS. Two proxmox-lvm-encrypted PVCs (SQLite data + uploads) -- encrypted per the sensitive-data rule and to avoid the SQLite-over-NFS locking hazard. - Trial secrets posture: ENCRYPTION_KEY auto-generated on the data PVC, bootstrap admin in pod logs -- no Vault/ESO. Graduation TODOs documented in main.tf + service-catalog (Vault key, app-level SQLite backup, OIDC SSO). - kyverno: add mauriceboe/* to require-trusted-registries allowlist (the policy is Enforce since 2026-05-19 -- also fixed the stale "stays in Audit" header comment that said otherwise and misled the deploy). - Runs free on OpenStreetMap (no paid maps key). Rallly availability-poll companion deferred per solo-trial scope. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 20:30:07 +00:00
Viktor Barzin	50d0f1affa	kyverno: strip orphaned keel.sh/match-tag fleet-wide (image-swap fix) The 2026-05-26 migration flipped the keel default force->patch and dropped match-tag from the inject-keel-annotations patch, but Kyverno's add-only mutate can't remove an annotation that's no longer listed -- 194 workloads kept a stale keel.sh/match-tag=true. Under it Keel cross-assigned images in multi-image pods: the blog's nginx<->nginx-exporter images were swapped and the site was down 2026-05-26 -> 06-01 (nginx received the exporter's -nginx.scrape-uri arg and CrashLoopBackOff'd); changedetection was silently swapped (app lost its /datastore PVC + env, ran ephemeral for days). - policy now sets keel.sh/match-tag=null (strips on admission, never re-added) - swept the annotation off all 194 existing workloads (kubectl, no pod restart) - AGENTS.md: documents the strip; post-mortem added blog + changedetection un-swapped via kubectl set image (TF-ignored images); both 2/2 and serving 200. Policy already applied via scripts/tg (Tier-1 PG state authoritative). [ci skip] Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 19:50:41 +00:00
Viktor Barzin	0f26bf030b	kyverno: exclude postiz namespace from Keel auto-update injection Postiz was generating hourly Slack spam and a wedged rollout, both Keel-driven: - Bundled redis StatefulSets run docker.io/bitnamilegacy/redis; Keel tried 7.4.0->7.4.1/7.4.2 every poll but require-trusted-registries denies bitnamilegacy/* (only bitnami/* allowlisted) -> endless deny/retry/Slack-ping loop. - Keel bumped postiz-app v2.21.7->v2.21.8 on 2026-05-26; the surge pod couldn't schedule under the 3Gi tier-4-aux quota, wedging the rollout for 3 days. postiz Terraform state is heavily drifted (~2/30 resources tracked), so per-workload opt-out can't be applied from the postiz stack. Durable guard is here (clean kyverno state). Operational steps applied live via kubectl (postiz stack can't apply): removed keel.sh/enrolled=true from the namespace, set keel.sh/policy=never (annotation+label) on all 4 workloads, rolled postiz back to the running v2.21.7. Keel restarted (scale 0->1) to drop postiz-app from its in-memory tracker; confirmed it no longer tracks postiz. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 19:16:58 +00:00
Viktor Barzin	f325b949be	keel: re-enable with policy=patch (semver-bounded) + fix CI deny-privileged All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Re-enables Keel after the 2026-05-26 emergency stop, with a safer default. Switch Kyverno-injected default from `force + match-tag=true` (proven unreliable — it rewrote tag strings cluster-wide despite the design intent) to `patch`, which is semver-parser-bounded: - Only patch bumps within current major.minor (1.2.3 → 1.2.4, never 1.3.x or 2.x — the parser does the math, not string compare). - Non-semver tags (`:latest`, `:v4`, `:2`, SHA, `:nightly`) are IGNORED entirely. No tag rewriting under any code path. - 151 stale `force` annotations migrated to `patch` cluster-wide during this apply (anchor `+()` dropped, then re-added). Live state after this commit: 0 workloads on `force`, 209 on `patch`, 22 on `never`. Keel deployment back to 1/1 on `:0.21.1`. Note: 22 workloads with `keel.sh/policy=never` LABEL had their annotation mutated to `patch` during the migration despite Kyverno's matchLabels-based exclude rule — appears to be a quirk of `mutateExistingOnPolicyUpdate` not honoring `selector` excludes. Repatched all 22 back to `annotation=never` via `kubectl annotate --overwrite`, then restored the `+(keel.sh/policy)` anchor in the policy so future Kyverno reconciles preserve them. Also fixes CI build-cli workflow which was blocked by `deny-privileged-containers` since wave 1 enforce flip on 2026-05-18: woodpecker namespace added to the shared security_policy_exclude_namespaces list (CI pipeline pods `wp-*` run privileged docker builds, legitimate use). The `default` workflow (terragrunt apply) was already passing — only the parallel `build-cli` workflow (which builds the infra-cli docker image) was failing, but it took the overall pipeline status down with it. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 19:06:51 +00:00
Viktor Barzin	bb9d8f1b38	kyverno: GPU priority mutate uses add (was replace) — fixes silent skip The Layer 5 ClusterPolicy inject-gpu-workload-priority used JSON6902 op=replace on /spec/priorityClassName. Incoming pods (e.g. frigate) have no priorityClassName field at all — replace requires the path to exist, so the patch fails with "doc is missing key: /spec/priorityClassName" and the whole mutation chain aborts BEFORE Layer 4 (inject-priority-class-from-tier) gets a chance to add the field. Result: GPU pods never got priorityClassName set, sat at priority=0, and could not preempt lower-tier pods on the GPU node. Observed today on frigate post-node4-recovery — pod stayed Pending with "Preemption is not helpful" while 3 pg-cluster pods (tier-1-cluster, priority 800000) occupied node1's memory budget. Fix: op=add for all three paths. add works whether or not the key is present, so the policy is robust to the upstream pod shape. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 09:04:51 +00:00
Viktor Barzin	68a503e29f	kyverno: allowlist woodpeckerci/* for CI step pods Wave-1 trusted-registries allowlist was missing woodpeckerci/* which is used by every .woodpecker.yml's clone step (woodpeckerci/plugin-git) and build steps (woodpeckerci/plugin-docker-buildx). Result: ALL Woodpecker pipelines have been failing at the git step since the Audit→Enforce flip on 2026-05-19. First surfaced via code-da4h (recruiter-responder pushes not building). Added between viren070/* and zelest/* in the same DockerHub-user-repos block as the 2026-05-22 batch (commit `2d35d72a`). Closes: code-da4h	2026-05-23 08:52:48 +00:00
Viktor Barzin	2d35d72a53	kyverno(wave1): add 7 missing registries to trusted-registries allowlist Discovered via W1.5 enforcement when querying live cluster state: PolicyViolation events on 5 deployments (council-complaints, ebook2audiobook, hermes-agent, netbox, whisper/piper) trying to admit images from registries not in the original enumeration. Added entries: - amruthpillai/* (resume — reactive-resume) - athomasson2/* (ebook2audiobook) - netboxcommunity/* (netbox) - nousresearch/* (hermes-agent) - opentripplanner/* (osm-routing) - rhasspy/* (whisper, piper) - registry.viktorbarzin.me/* (legacy private registry — council-complaints still references; should migrate to forgejo) The legacy registry.viktorbarzin.me was supposedly decommissioned 2026-05-07 per CLAUDE.md but council-complaints still uses it — separate cleanup task. ## Verification - kubectl delete + reapply (kubectl_manifest resourceVersion=0 patch gotcha, same as 2026-05-18 inject-keel-annotations) - Dry-run admission of previously-blocked images now PASS: - netboxcommunity/netbox:v4.5.0-beta1 ✓ - rhasspy/wyoming-whisper:3.1.0 ✓ - registry.viktorbarzin.me/council-complaints:1c56f8f ✓ - Policy still in Enforce mode ## Observation status (W1.6) - Calico GNP wave1-egress-observe-tier34 still applied, 82 ns selected - Loki `{job="node-journal"} \|~ "calico-packet"` returns ~5000 lines/hour - No errors from observation infrastructure Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 15:17:16 +00:00
Viktor Barzin	66ca8b9e9c	trading-bot: revive K8s stack + add meet-kevin-watcher Uncomment the trading-bot stack (disabled 2026-04-06 due to resource consumption) and add the new meet_kevin_watcher service container. Changes: - Uncomment the /* ... / block enclosing the entire stack - Fix db_init job: add -d postgres to psql commands (root user has no root-named database — matches pattern used in claude-memory + others) - Remove 3 disabled containers from trading-bot-workers Pod spec: news-fetcher, sentiment-analyzer, trade-executor - Add new meet-kevin-watcher container (image viktorbarzin/trading-bot-service:latest, command python -m services.meet_kevin_watcher.main, mem 128Mi/256Mi) - Extend ExternalSecret with TRADING_OPENROUTER_API_KEY and TRADING_MEET_KEVIN_CHANNEL_ID keys (sourced from Vault secret/trading-bot) - Add 4 common_env entries for the Meet Kevin pipeline (poll interval, daily cost cap, model slug, prompt version) - Update lifecycle.ignore_changes to 4 image indices vault: re-enable pg-trading static role - Add pg-trading to vault_database_secret_backend_connection allowed_roles - Uncomment vault_database_secret_backend_static_role.pg_trading (was disabled 2026-04-06 with the rest of trading-bot stack) kyverno: add postgres to trusted-registries allowlist - trading-bot db_init uses postgres:16-alpine (Docker Hub library image) - postgres* was not in the DockerHub bare-name allowlist (unlike mysql, alpine, nginx, python which were already there) Final workers Pod containers (in order): [0] signal-generator [1] learning-engine [2] market-data [3] meet-kevin-watcher (NEW) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 11:23:30 +00:00
Viktor Barzin	669ba97078	security(wave1): W1.1 audit-log shipping LIVE + W1.5 trusted-registries Enforce LIVE ## W1.1 — K8s API audit log shipping (LIVE) - alloy.yaml: added control-plane toleration so Alloy DaemonSet runs on k8s-master node. Verified alloy-7zg7t scheduled on master, tailing /var/log/kubernetes/audit.log - loki.tf "Security Wave 1" rule group: added K2-K9 alert rules (skipped K1 per Q7 decision): - K2 K8sSATokenFromUnexpectedIP - K3 K8sSensitiveSecretReadByUnexpectedActor - K4 K8sExecIntoSensitiveNamespace - K5 K8sMassDelete (>5 Pod/Secret/CM in 60s by single user) - K6 K8sAuditPolicyModified (kubeadm-config CM change) - K7 K8sClusterRoleWildcardCreated (verbs=* + resources=) - K8 K8sAnonymousBindingGranted - K9 K8sViktorFromUnexpectedIP - All rules use source-IP regex matching the wave-1 allowlist (10.0.20.0/22, 192.168.1.0/24, 10.10.0.0/16 pod, 10.96.0.0/12 svc, 100.64-127 tailnet) and `lane = "security"` → #security Slack route. - Verified: kubectl-audit logs flowing in Loki query {job="kubernetes-audit"} returns events with node=k8s-master. - Verified: /loki/api/v1/rules lists all K2-K9 + V1-V7 + S1. ## W1.5 — require-trusted-registries Enforce (LIVE) - security-policies.tf: flipped Audit→Enforce with explicit allowlist built by `kubectl get pods -A -o jsonpath='{..image}'` enumeration. - Removed `/` catch-all (which made Audit→Enforce a no-op). - Pattern includes 15 explicit registries, 6 DockerHub library bare names, 56 DockerHub user repos. - Verified by admission dry-run: - evilcorp.example/malware:v1 → BLOCKED with custom message - alpine:3.20 → ALLOWED (matches `alpine`) - docker.io/library/alpine:3.20 → ALLOWED (matches `docker.io/*`) ## W1.6 — Calico flow logs (BLOCKED — Calico OSS limitation) - Tried adding FelixConfiguration with flowLogsFileEnabled=true via kubectl_manifest in stacks/calico/main.tf - Calico OSS rejected with "strict decoding error: unknown field spec.flowLogsFileEnabled" — these fields are Calico Enterprise/Tigera-only - Removed the failed resource. Documented alternative paths in main.tf comment block: GNP with action=Log (iptables NFLOG → journal), Cilium migration, eBPF tooling, or Tigera Operator adoption. ## Docs updates - security.md status table refreshed: W1.1/W1.2/W1.3/W1.4/W1.5 LIVE, W1.6/W1.7 blocked - monitoring.md: Loki marked DEPLOYED (was incorrectly NOT-DEPLOYED in prior session before today's apply) ## Cleanup - Removed stacks/kyverno/imports.tf (TF 1.5+ import blocks completed their job in the 2026-05-18 apply; should not stay in tree per TF docs) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-19 06:37:54 +00:00
Viktor Barzin	90e074a4a2	kyverno(wave1): swap kubernetes_manifest → kubectl_manifest + flip 3 security policies to Enforce ## Resolves code-e2dp (Kyverno TF apply blocked) Root cause: terraform-provider-kubernetes v3.1.0 panics on plan/refresh of kubernetes_manifest resources holding Kyverno ClusterPolicy CRDs (large CEL/foreach schemas). Workaround: swap to gavinbunney/kubectl_manifest which treats manifests as opaque YAML strings. ## Migration mechanics - Root terragrunt.hcl: added gavinbunney/kubectl provider declaration so all stacks get it generated in providers.tf. - stacks/kyverno/modules/kyverno/versions.tf (new): module-level provider source declaration (required for kubectl_manifest in a child module). - Converted 17 kubernetes_manifest resources across 7 files to kubectl_manifest with yaml_body = yamlencode({...}). depends_on chains preserved. - terraform state rm for all 17 old kubernetes_manifest entries. - stacks/kyverno/imports.tf (new): TF 1.5+ import blocks mapping each kubectl_manifest to its live cluster resource by apiVersion//Kind//name ID. - One resource (policy_inject_keel_annotations) needed kubectl delete + recreate because the kubectl provider couldn't patch it cleanly (resourceVersion=0 invalid for update — gotcha when adopting a resource previously kubernetes_manifest-owned). ## W1.4 — security policies Audit → Enforce (LIVE) Three policies flipped: deny-privileged-containers, deny-host-namespaces, restrict-sys-admin. Verified live via kubectl. failurePolicy=Ignore preserved. ## Shared exclude list (35 namespaces) local.security_policy_exclude_namespaces in security-policies.tf. - 31 critical from memory id=1970 (Keel rollout list) - + frigate (camera HW transcoding needs host access) - + kured (privileged DaemonSet for node reboots) - + default (etcd backup/defrag CronJobs use hostNetwork) - + changedetection (uses SYS_ADMIN for chromium sandbox) ## W1.5 — require-trusted-registries stays Audit Pattern / allows anything-with-a-slash; Enforce would be a no-op for supply chain. Tracked under beads code-8ywc as follow-up. ## TF import-blocks The imports.tf file should be removed in a follow-up cleanup commit once verified — TF doesn't auto-clean these. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Closes: code-e2dp	2026-05-18 20:10:27 +00:00
Viktor Barzin	f30c141270	security(wave1): W1.2 Vault XFF (applied) + W1.4/W1.5 Kyverno code prep (apply blocked on provider crash) ## W1.2 — Vault audit device + X-Forwarded-For (APPLIED + VERIFIED) - Added `x_forwarded_for_authorized_addrs = "10.10.0.0/16"` to vault listener config. Trust X-Forwarded-For from in-cluster sources (pod CIDR). Without this, every vault audit log entry shows Traefik's pod IP instead of the real client IP — the V7 alert rule (Viktor identity from non-allowlist source IP) needs the real client IP to be meaningful. - Applied via `tg apply -target=helm_release.vault` (vault stack has pre-existing for_each unknown issues unrelated to this change; -target documented in error message itself as the workaround). - Rolling restart of vault-{0,1,2} performed manually (StatefulSet uses OnDelete update strategy, not RollingUpdate). All 3 pods rejoined Raft + auto-unsealed within ~10s each. Verified XFF config visible in pod's /vault/config/extraconfig-from-values.hcl. - The `vault_audit "file"` resource was already in TF at line 287 (writing to /vault/audit/vault-audit.log) — no change needed. ## W1.4 + W1.5 — Kyverno enforce flip (CODE ONLY, apply BLOCKED) - Added shared `local.security_policy_exclude_namespaces` (31 critical namespaces from memory id=1970 + `frigate, kured, default, changedetection` discovered during the live-cluster pre-flight check for privileged/hostNetwork/SYS_ADMIN pods that would be blocked by Enforce). - Flipped 3 security policies Audit → Enforce: deny-privileged-containers, deny-host-namespaces, restrict-sys-admin. failurePolicy=Ignore preserved at chart level. - `require-trusted-registries` STAYS in Audit mode pending allowlist tightening (current pattern includes `/` which matches anything-with-a-slash, so Enforce would be a no-op for supply chain). Tracked under beads `code-8ywc` W1.5. Apply blocker: `tg plan` panics with `terraform-provider-kubernetes_v3.1.0` crash on the kubernetes_manifest resources (`ElementKeyInt(0): can't use tftypes.Object...` — provider schema mismatch on Kyverno CRDs). The crash reproduces on the UNMODIFIED file, so it's a pre-existing provider issue, not caused by these changes. Resolving it requires either upgrading the provider or finding a kubernetes_manifest-compatible workaround. Tracked under `code-8ywc`. ## Wave 1 status after this commit - W1.2: APPLIED + VERIFIED (vault XFF + audit device already in place) - W1.4 + W1.5: code ready, apply blocked on provider crash - W1.1, W1.3, W1.6, W1.7: not started in this session Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-18 19:26:39 +00:00
Viktor Barzin	bcf22640b2	keel: enroll 11 more namespaces (operators + critical infra) Per user decision, removed authentik, kyverno, metallb-system, external-secrets, proxmox-csi, nfs-csi, vpa, sealed-secrets, infra-maintenance from the policy-level exclude list, and added keel.sh/enrolled=true to aiostreams (alive — 1/1 Running, despite being earlier flagged as scaled-to-0) and woodpecker. Net cluster coverage: 197/227 workloads on safe-force (86%), up from 170/227 (74%). All 197 are paired with match-tag=true (digest-only). Remaining 7 namespaces in Kyverno exclude list (irreducible): - keel (self-update) - calico-system + tigera-operator (operator-managed Installation CR) - cnpg-system + dbaas (state-coupled) - nvidia (chart-pinned at 570.195.03 per code-8vr0 until NVIDIA ships ubuntu26.04 driver images) - kube-system (k8s built-ins) Files: - stacks/kyverno/modules/kyverno/keel-annotations.tf — exclude list trimmed from 16 → 7 - stacks/authentik, kyverno, proxmox-csi, nfs-csi, vpa, sealed-secrets, servarr/aiostreams, metallb (creates ns "metallb-system"), woodpecker — added keel.sh/enrolled=true label on kubernetes_namespace resource - infra-maintenance was in the policy exclude but the namespace doesn't actually exist in the cluster; the removal is a no-op there Applied via kubectl patch on the live ClusterPolicy + kubectl label on namespaces because the kubernetes provider v3.1.0 panics on Kyverno ClusterPolicy refresh — TF source has the desired state for next clean apply on a fixed provider. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-17 20:59:14 +00:00
Viktor Barzin	1b340ef531	keel: enroll 15 critical-path namespaces for digest-only auto-update Per user decision today: monitoring, mailserver, vault, descheduler, metrics-server, traefik, technitium, crowdsec, redis, reverse-proxy, reloader, headscale, wireguard, xray, cloudflared now participate in the same `force + match-tag` regime as the rest of the cluster — Keel watches the deployment's CURRENT tag for digest changes only and rolls on push, never rewriting tag strings. Two-part change: stacks/kyverno/modules/kyverno/keel-annotations.tf Trim the policy-level namespace exclude list from 31 → 16. The 16 remaining exclusions are the irreducible cluster-operator + state- coupled set: keel itself, calico-system + tigera-operator (operator loop), authentik (2026-05-17 pgbouncer incident bite), cnpg-system + dbaas (state-coupled), kyverno, metallb-system, external-secrets, proxmox-csi + nfs-csi + nvidia (just stabilized today, chart-pinned), kube-system, vpa, sealed-secrets, infra-maintenance. stacks/<each-of-15>/.../main.tf Add `"keel.sh/enrolled" = "true"` label to the `kubernetes_namespace` resource so the Kyverno mutate policy can target the workloads via its namespaceSelector matchLabels. Note on the apply path: the live ClusterPolicy was patched via `kubectl patch` because the hashicorp/kubernetes provider v3.1.0 panics during state refresh on Kyverno ClusterPolicy schemas with deeply nested optional `context.celPreconditions` / `imageRegistry` fields (see crash dump). The TF source above has the desired state, so any clean future apply on a fixed provider version will be a no-op against the live cluster. Floating-tag workloads in the newly-enrolled set (will roll on every upstream digest update — acceptable risk per user): - wireguard: sclevine/wg:latest (image fixed today via iptables-nft postStart shim) - xray: teddysun/xray - crowdsec-web: viktorbarzin/crowdsec_web - monitoring: prompve/prometheus-pve-exporter:latest, prom/snmp-exporter - traefik: nginx:1-alpine, openresty/openresty:alpine, ghcr.io/tarampampam/error-pages:3 - redis: haproxy:3.1-alpine, redis:8-alpine Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-17 12:13:22 +00:00
Viktor Barzin	32db6760cc	keel: use +() anchors on policy/match-tag so per-workload overrides stick Without the anchor, each policy update fires mutateExistingOnPolicyUpdate, which OVERWRITES existing keel.sh/policy annotations back to 'force'. That broke the phased rollout — bulk-setting workloads to 'never' didn't stick because the next policy update reset them. With +() anchors, the mutate only adds the annotation if missing. New workloads (in enrolled namespaces) get force+match-tag; existing workloads with explicit policy=never (out-of-band, for phased rollout) stay never. Phase 1 rollout state (2026-05-17): - 10 workloads on force+match-tag in 10 namespaces (Phase 1) enrolled via keel.sh/enrolled=true namespace label: linkwarden, excalidraw, diun, echo, foolery, city-guesser, jsoncrack, privatebin, ntfy, speedtest - 216 workloads on policy=never (out-of-band kubectl annotate) - 31 critical namespaces excluded at policy level Expand to Phase 2 by labeling more namespaces `keel.sh/enrolled=true` and clearing the `never` annotation off their workloads. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-17 00:32:19 +00:00
Viktor Barzin	0365ed83ca	keel: expand critical-namespace exclude list — protects vault/cnpg/authentik/etc. 2026-05-17 incident: Keel rolled authentik 2026.2.2 → 2026.2.3 around 23:36. The force+match-tag pairing should have constrained Keel to digest-only on the current tag (not switch to a new tag), but a race between Kyverno's mutate (injecting match-tag) and Keel's hourly poll caused the workload to still have the old `force`-only annotation when Keel acted. Result: tag rewrite, pods cycled, pgbouncer connection failures, login broken. Manual rollback: `kubectl rollout undo` on all 5 authentik deployments back to 2026.2.2. Auth restored within ~5 min. Going forward, critical-namespace workloads are excluded at the policy level so this race can't recur. They get upgraded via TF (Helm chart version bumps) on a deliberate cadence, never by Keel. Live state: 36 workloads on policy=never (35 critical + chrome-service pin + 7 CI-driven self-hosted from earlier), 190 on policy=force+match-tag for opt-out-pure auto-update on the remaining stateless apps. This matches user direction (2026-05-17): "upgrading is fine as long as we upgrade correctly and the latest version is healthy" + "keel responsible for the latest version, phased rollout, graceful". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-17 00:07:32 +00:00
Viktor Barzin	cdd6781bb9	keel: bump default policy patch → major (user wants latest version) User: 'i'm happy with occasional breakages. we have alerts.' Policy=major auto-updates workloads to the latest semver tag in the registry, including major/minor/patch bumps. Still semver-parser-bounded so dev/nightly/master branches are filtered out (avoids the 2026-05-16 force-trap on affine/calico). Live: 217 patch-annotated workloads re-annotated to major. Next Keel poll (~1h) will pick up any pending major/minor releases. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-16 23:53:59 +00:00
Viktor Barzin	055030fa24	kyverno: bump background-controller memory 384Mi → 2Gi (OOMKilled processing keel URs) The mutateExistingOnPolicyUpdate=true on inject-keel-annotations produced 176 UpdateRequests for the initial bulk scan across enrolled namespaces. At the existing 384Mi limit, kyverno-background-controller OOMKilled while processing them — no annotations got injected on existing workloads (count stuck at 30). Live state already bumped via kubectl set resources; this commit makes it durable through Terraform. Also lowered the request to 256Mi (the 384Mi floor was tight against limit; 2Gi headroom for bulk scans, 256Mi steady state). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-16 23:36:16 +00:00
Viktor Barzin	de0e7b9dd6	kyverno: codify aggregated ClusterRole for keel mutate-existing The previous commit (`bc714755`) added mutateExistingOnPolicyUpdate=true to the inject-keel-annotations ClusterPolicy but Kyverno's validate webhook rejected it: the background-controller SA needs update/patch on apps/v1 Deployment/StatefulSet/DaemonSet. Created live via kubectl + now in TF so the next apply is idempotent. The ClusterRole aggregates into kyverno:background-controller via the rbac.kyverno.io/aggregate-to-background-controller label. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-16 23:30:07 +00:00
Viktor Barzin	bc714755ea	kyverno: add mutateExistingOnPolicyUpdate=true so existing workloads get annotated Before this, the inject-keel-annotations policy only fired on admission events. Workloads that existed BEFORE their namespace got labeled keel.sh/enrolled=true never received the annotation, so Keel didn't watch them. Live state was 30 of 226 workloads auto-updating. With mutateExistingOnPolicyUpdate=true and the required mutate.targets block, Kyverno's BackgroundScan controller applies the mutate to existing matching Deployments/StatefulSets/DaemonSets on policy update. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-16 23:27:27 +00:00
Viktor Barzin	629fe24305	kyverno: exclude calico-system from inject-keel-annotations Stop the hourly Keel-vs-tigera-operator fight loop on calico-node DaemonSet (v3.26.5 ↔ v3.26.1). Live: re-annotated 4 calico-system workloads with keel.sh/policy=never; TF: added calico-system to the namespaces exclude list so any future mutate run won't re-inject. The previous calico unenrollment (label removal from namespace) wasn't enough — once Kyverno had stamped the policy=patch annotation on the Deployments/DaemonSets, removing the namespace label didn't strip the annotation, so Keel kept watching them. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-16 22:58:20 +00:00
Viktor Barzin	2b236a1629	keel: default policy → patch (semver-bounded opt-out auto-update) Move from `never` (no auto-update) to `patch` for the cluster-wide default. Keel only auto-updates PATCH versions within the current major.minor: 0.26.6 → 0.26.7 OK; 0.26.6 → :nightly-latest blocked. Tag-rewrites that broke calico (v3.26.1 → :master) and affine (0.26.6 → :nightly-latest) on 2026-05-16 cannot recur with patch. Caveats: * Patch causes Terraform image drift for semver-pinned services — drift-detection pipeline will surface it; lifecycle ignore_changes on container[].image can be added per stack later if drift is noisy. * Tags that aren't parseable as semver (:latest, :11, :nightly, SHA tags) are ignored by patch — those workloads stay on their current image until promoted to `force` policy individually. Self-hosted CI-driven services + chrome-service kept on `never` (deliberate pins / CI controls the tag): recruiter-responder, claude-agent-service, claude-memory, chrome-service, fire-planner, job-hunter, payslip-ingest Live state already updated via kubectl apply + per-workload patches. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-16 13:17:33 +00:00
Viktor Barzin	d656e38c9d	keel: default policy → never (post-incident safe default) 2026-05-16 incident: Keel's `force` policy switched semver-pinned images (affine 0.26.6 → :nightly-latest, calico v3.26.1 → :master) instead of digest-tracking. Force is documented as "always update to the newest tag in the registry" — only safe on already-mutable tags like :latest. Changing the cluster-wide default in inject-keel-annotations to `never`. The namespace enrollment label + V2 lifecycle suppression stay in place so opt-in is one annotation per Deployment, but no service auto-updates until explicitly approved. To opt in a workload now: 1. Verify the Deployment image is on a mutable tag (:latest, :<major>, or a vendor "stable" tag) — change in Terraform first if needed. 2. Add to the Deployment's metadata.annotations: "keel.sh/policy" = "force" (digest tracking) OR "keel.sh/policy" = "patch" (semver patch bumps — also requires ignore_changes on the image) Live policy already updated via kubectl apply + per-workload override (force → never). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-16 13:13:16 +00:00
Viktor Barzin	910167105e	Phase 0: install Keel + Kyverno auto-update annotation injector Foundation for opt-out-pure auto-update model per docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md. - New stack `stacks/keel/` deploys Keel via Helm (charts.keel.sh, v1.0.6). Polls registries hourly per design decision #8. Default schedule overridable per-workload via keel.sh/pollSchedule annotation. - New Kyverno ClusterPolicy `inject-keel-annotations` mutates Deployments, StatefulSets, and DaemonSets in namespaces labeled `keel.sh/enrolled=true` with keel.sh/policy=force + trigger=poll + pollSchedule=@every 1h. - Phase 0 enrolls no namespaces. Phase 1 (next session) labels the self-hosted set. - Per-workload opt-out: label `keel.sh/policy: never` (used by rollback runbook and chrome-service-style deliberate pins). - Keel namespace excluded from the mutate — supervisor self-update has too-bad a failure mode (decision #11). - AGENTS.md: KYVERNO_LIFECYCLE_V2 marker convention added for the ignore_changes block enrolled workloads need. - .claude/CLAUDE.md: docker-images rule flagged as transitional. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-16 12:19:34 +00:00
Viktor Barzin	3148d15d5a	[forgejo] Phases 3+4+5: cutover, decommission, docs sweep End of forgejo-registry-consolidation. After Phase 0/1 already landed (Forgejo ready, dual-push CI, integrity probe, retention CronJob, images migrated via forgejo-migrate-orphan-images.sh), this commit flips everything off registry.viktorbarzin.me onto Forgejo and removes the legacy infrastructure. Phase 3 — image= flips: * infra/stacks/{payslip-ingest,job-hunter,claude-agent-service, fire-planner,freedify/factory,chrome-service,beads-server}/main.tf — image= now points to forgejo.viktorbarzin.me/viktor/<name>. * infra/stacks/claude-memory/main.tf — also moved off DockerHub (viktorbarzin/claude-memory-mcp:17 → forgejo.viktorbarzin.me/viktor/...). * infra/.woodpecker/{default,drift-detection}.yml — infra-ci pulled from Forgejo. build-ci-image.yml dual-pushes still until next build cycle confirms Forgejo as canonical. * /home/wizard/code/CLAUDE.md — claude-memory-mcp install URL updated. Phase 4 — decommission registry-private: * registry-credentials Secret: dropped registry.viktorbarzin.me / registry.viktorbarzin.me:5050 / 10.0.20.10:5050 auths entries. Forgejo entry is the only one left. * infra/stacks/infra/main.tf cloud-init: dropped containerd hosts.toml entries for registry.viktorbarzin.me + 10.0.20.10:5050. (Existing nodes already had the file removed manually by `setup-forgejo-containerd-mirror.sh` rollout — the cloud-init template only fires on new VM provision.) * infra/modules/docker-registry/docker-compose.yml: registry-private service block removed; nginx 5050 port mapping dropped. Pull- through caches for upstream registries (5000/5010/5020/5030/5040) stay on the VM permanently. * infra/modules/docker-registry/nginx_registry.conf: upstream `private` block + port 5050 server block removed. * infra/stacks/monitoring/modules/monitoring/main.tf: registry_ integrity_probe + registry_probe_credentials resources stripped. forgejo_integrity_probe is the only manifest probe now. Phase 5 — final docs sweep: * infra/docs/runbooks/registry-vm.md — VM scope reduced to pull- through caches; forgejo-registry-breakglass.md cross-ref added. * infra/docs/architecture/ci-cd.md — registry component table + diagram now reflect Forgejo. Pre-migration root-cause sentence preserved as historical context with a pointer to the design doc. * infra/docs/architecture/monitoring.md — Registry Integrity Probe row updated to point at the Forgejo probe. * infra/.claude/CLAUDE.md — Private registry section rewritten end- to-end (auth, retention, integrity, where the bake came from). * prometheus_chart_values.tpl — RegistryManifestIntegrityFailure alert annotation simplified now that only one registry is in scope. Operational follow-up (cannot be done from a TF apply): 1. ssh root@10.0.20.10 — edit /opt/registry/docker-compose.yml to match the new template AND `docker compose up -d --remove-orphans` to actually stop the registry-private container. Memory id=1078 confirms cloud-init won't redeploy on TF apply alone. 2. After 1 week of no incidents, `rm -rf /opt/registry/data/private/` on the VM (~2.6GB freed). 3. Open the dual-push step in build-ci-image.yml and drop registry.viktorbarzin.me:5050 from the `repo:` list — at that point the post-push integrity check at line 33-107 also needs to be repointed at Forgejo or removed (the per-build verify is redundant with the every-15min Forgejo probe). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-07 18:30:02 +00:00
Viktor Barzin	6d5204db10	[forgejo] Tolerate missing Vault keys during Phase 0 bootstrap Wrap the three new Vault key reads in try(...) so the first apply succeeds even when forgejo_pull_token / forgejo_cleanup_token / secret/ci/global haven't been populated yet. Without this, CI auto-apply blocks on the very push that introduces the references — chicken-and-egg with the runbook order (which is: apply Forgejo bumps, then create users + PATs, then apply the rest). Empty tokens are intentionally visible-broken (auth fails, probe reports auth failure, cleanup CronJob errors) — that's the signal to run the bootstrap runbook. Subsequent apply picks up the real values. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-07 15:53:08 +00:00
Viktor Barzin	5d22b449f9	[forgejo] Phase 0 of registry consolidation: prepare Forgejo OCI registry Stage 1 of moving private images off the registry:2 container at registry.viktorbarzin.me:5050 (which has hit distribution#3324 corruption 3x in 3 weeks) onto Forgejo's built-in OCI registry. No cutover risk — pods still pull from the existing registry until Phase 3. What changes: * Forgejo deployment: memory 384Mi→1Gi, PVC 5Gi→15Gi (cap 50Gi). Explicit FORGEJO__packages__ENABLED + CHUNKED_UPLOAD_PATH (defensive, v11 default-on). * ingress_factory: max_body_size variable was declared but never wired in after the nginx→Traefik migration. Now creates a per-ingress Buffering middleware when set; default null = no limit (preserves existing behavior). Forgejo ingress sets max_body_size=5g to allow multi-GB layer pushes. * Cluster-wide registry-credentials Secret: 4th auths entry for forgejo.viktorbarzin.me, populated from Vault secret/viktor/ forgejo_pull_token (cluster-puller PAT, read:package). Existing Kyverno ClusterPolicy syncs cluster-wide — no policy edits. * Containerd hosts.toml redirect: forgejo.viktorbarzin.me → in-cluster Traefik LB 10.0.20.200 (avoids hairpin NAT for in-cluster pulls). Cloud-init for new VMs + scripts/setup-forgejo-containerd-mirror.sh for existing nodes. * Forgejo retention CronJob (0 4 * * ): keeps newest 10 versions per package + always :latest. First 7 days dry-run (DRY_RUN=true); flip the local in cleanup.tf after log review. Forgejo integrity probe CronJob (/15): same algorithm as the existing registry-integrity-probe. Existing Prometheus alerts (RegistryManifestIntegrityFailure et al) made instance-aware so they cover both registries during the bake. Docs: design+plan in docs/plans/, setup runbook in docs/runbooks/. Operational note — the apply order is non-trivial because the new Vault keys (forgejo_pull_token, forgejo_cleanup_token, secret/ci/global/forgejo_*) must exist BEFORE terragrunt apply in the kyverno + monitoring + forgejo stacks. The setup runbook documents the bootstrap sequence. Phase 1 (per-project dual-push pipelines) follows in subsequent commits. Bake clock starts when the last project goes dual-push. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-07 15:51:34 +00:00
Viktor Barzin	4ca793380b	[multi] Sweep Kyverno wait-for redis annotations to redis-master Replaces `redis.redis:6379` with `redis-master.redis:6379` in all 11 dependency.kyverno.io/wait-for annotations across 8 stacks, plus one docs comment in the Kyverno module. These annotations drive DNS-only `nc -z` init-container readiness checks — zero RW risk. Both hostnames resolve, so there is no wait-for failure window during the rolling re-apply. Closes: code-otr	2026-04-19 12:44:46 +00:00
Viktor Barzin	8b43692af0	[infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip] ## Context Wave 3B-continued: the Goldilocks VPA dashboard (stacks/vpa) runs a Kyverno ClusterPolicy `goldilocks-vpa-auto-mode` that mutates every namespace with `metadata.labels["goldilocks.fairwinds.com/vpa-update-mode"] = "off"`. This is intentional — Terraform owns container resource limits, and Goldilocks should only provide recommendations, never auto-update. The label is how Goldilocks decides per-namespace whether to run its VPA in `off` mode. Effect on Terraform: every `kubernetes_namespace` resource shows the label as pending-removal (`-> null`) on every `scripts/tg plan`. Dawarich survey 2026-04-18 confirmed the drift. Cluster-side count: 88 namespaces carry the label (`kubectl get ns -o json \| jq ... \| wc -l`). Every TF-managed namespace is affected. This commit brings the intentional admission drift under the same `# KYVERNO_LIFECYCLE_V1` discoverability marker introduced in `c9d221d5` for the ndots dns_config pattern. The marker now stands generically for any Kyverno admission-webhook drift suppression; the inline comment records which specific policy stamps which specific field so future grep audits show why each suppression exists. ## This change 107 `.tf` files touched — every stack's `resource "kubernetes_namespace"` resource gets: ```hcl lifecycle { # KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]] } ``` Injection was done with a brace-depth-tracking Python pass (`/tmp/add_goldilocks_ignore.py`): match `^resource "kubernetes_namespace" ` → track `{` / `}` until the outermost closing brace → insert the lifecycle block before the closing brace. The script is idempotent (skips any file that already mentions `goldilocks.fairwinds.com/vpa-update-mode`) so re-running is safe. Vault stack picked up 2 namespaces in the same file (k8s-users produces one, plus a second explicit ns) — confirmed via file diff (+8 lines). ## What is NOT in this change - `stacks/trading-bot/main.tf` — entire file is `/* … /` commented out (paused 2026-04-06 per user decision). Reverted after the script ran. - `stacks/_template/main.tf.example` — per-stack skeleton, intentionally minimal. User keeps it that way. Not touched by the script (file has no real `resource "kubernetes_namespace"` — only a placeholder comment). - `.terraform/` copies (e.g. `stacks/metallb/.terraform/modules/...`) — gitignored, won't commit; the live path was edited. - `terraform fmt` cleanup of adjacent pre-existing alignment issues in authentik, freedify, hermes-agent, nvidia, vault, meshcentral. Reverted to keep the commit scoped to the Goldilocks sweep. Those files will need a separate fmt-only commit or will be cleaned up on next real apply to that stack. ## Verification Dawarich (one of the hundred-plus touched stacks) showed the pattern before and after: ``` $ cd stacks/dawarich && ../../scripts/tg plan Before: Plan: 0 to add, 2 to change, 0 to destroy. # kubernetes_namespace.dawarich will be updated in-place (goldilocks.fairwinds.com/vpa-update-mode -> null) # module.tls_secret.kubernetes_secret.tls_secret will be updated in-place (Kyverno generate. labels — fixed in `8d94688d`) After: No changes. Your infrastructure matches the configuration. ``` Injection count check: ``` $ rg -c 'KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode' stacks/ \| awk -F: '{s+=$2} END {print s}' 108 ``` ## Reproduce locally 1. `git pull` 2. Pick any stack: `cd stacks/<name> && ../../scripts/tg plan` 3. Expect: no drift on the namespace's goldilocks.fairwinds.com/vpa-update-mode label. Closes: code-dwx Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 21:15:27 +00:00
Viktor Barzin	1de2ee307f	kyverno: strip resources.limits.cpu cluster-wide via ClusterPolicy Context ------- The cluster policy is "no CPU limits anywhere" — CFS throttling causes more harm than good for bursty single-threaded workloads (Node.js, Python). LimitRanges are already correct (defaultRequest.cpu only, no default.cpu), but 22 pods still carried CPU limits injected by upstream Helm chart defaults — CrowdSec (lapi + agents), descheduler, kubernetes-dashboard (×4), nvidia gpu-operator. Previous attempts were ad-hoc: patch each values.yaml, occasionally missing things on chart upgrade. This replaces that with a declarative Kyverno mutation at admission time. This change ----------- Adds a new ClusterPolicy `strip-cpu-limits` with two foreach rules: strip-container-cpu-limit → containers[] strip-initcontainer-cpu-limit → initContainers[] Each rule uses `patchesJson6902` with an `op: remove` on `resources/limits/cpu`. JSON6902 `remove` fails on missing paths, so per-element preconditions gate the mutation — pods without CPU limits pass through untouched. A top-level rule precondition short-circuits using JMESPath filter (`[?resources.limits.cpu != null] \| length(@) > 0`) so the mutation is a no-op for the overwhelming majority of pods. Admission-time only. No `mutateExistingOnPolicyUpdate`, no `background`. Existing pods keep their CPU limits until they're restarted naturally (Helm upgrade, node drain, rollout). We rely on churn, not forced restarts, to avoid unnecessary thrash. Memory limits are preserved — they prevent OOM, still useful. Flow ---- admission request → match Pod + CREATE → top-level precondition: any container has limits.cpu? no → skip (fast path) yes → foreach container: element.limits.cpu present? no → skip element yes → remove /spec/containers/N/resources/limits/cpu → same again for initContainers → mutated pod proceeds to API server Verification ------------ kubectl run test-strip-cpu --overrides='{limits:{cpu:500m,memory:64Mi}}' → admitted pod.resources = {limits:{memory:64Mi}, requests:{cpu:50m,memory:32Mi}} → CPU limit stripped, memory preserved, requests untouched kubectl rollout restart deploy/kubernetes-dashboard-metrics-scraper → new pod.resources = {limits:{memory:400Mi}, requests:{cpu:100m,memory:200Mi}} → cluster-wide count of pods with CPU limits: 22 → 21 Rollout ------- Remaining 21 pods will drop their CPU limits on natural churn. No manual restarts in this change — user may want to time a mass restart with a maintenance window. Closes: code-eaf Closes: code-4bz Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 11:34:39 +00:00
Viktor Barzin	3e273399c1	fix(ci): add registry.viktorbarzin.me:5050 to imagePullSecrets Pipeline pods pull from registry.viktorbarzin.me:5050 but the registry-credentials secret only had auth for registry.viktorbarzin.me (without port). Containerd requires exact hostname:port match. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 21:50:51 +00:00
Viktor Barzin	cf578516e9	feat: auto-cleanup failed/evicted pods via Kyverno ClusterCleanupPolicy Add cleanup-failed-pods policy that runs hourly (at :15) to delete all pods in Failed phase cluster-wide. Prevents stale evicted and failed CronJob pods from accumulating and creating healthcheck noise. Also adds ClusterRole + ClusterRoleBinding to grant Kyverno cleanup controller permission to delete Pods (not included by default). [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 17:37:49 +00:00
Viktor Barzin	de42acd68e	fix: backup LUKS rsync tolerance, stale mapping cleanup, tier-4-aux quota bump - daily-backup: handle rsync exit 23 (partial transfer) as OK for LUKS noload mounts — in-flight writes have corrupt metadata from skipped journal replay, but core data is intact - daily-backup: clean up stale LUKS dm mappings from previous crashed runs before attempting to open - daily-backup: capture rsync exit code safely with set -e (\|\| pattern) - kyverno: bump tier-4-aux requests.memory 2Gi→3Gi (servarr was at 83%) - actualbudget: patched custom quota 5Gi→6Gi (was at 82%) Verified: backup now completes status=0 (96 PVCs OK, 0 failed) [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 17:21:51 +00:00
Viktor Barzin	ae0585048a	fix: bump tier-1-cluster LimitRange max to 8Gi for MySQL 6Gi limit Kyverno's tier-1-cluster LimitRange had max=4Gi which blocked mysql-cluster-2 from starting after we bumped MySQL to 6Gi limit. Also added custom LimitRange in dbaas stack (for when Terraform manages it directly).	2026-04-05 23:31:23 +03:00
Viktor Barzin	4da8f0242f	fix: right-size service memory after PVE RAM upgrade (142→272GB) - MySQL InnoDB: 2Gi/4Gi → 3Gi/6Gi (was at 97% of limit) - Redis HAProxy: 16Mi/16Mi → 32Mi/64Mi (OOMKilled) - Plotting-book: 64Mi/64Mi → 128Mi/256Mi (OOMKilled) - Tandoor: 256Mi/256Mi → 384Mi/512Mi (60 OOM restarts), re-enabled - Navidrome: 128Mi/128Mi → 256Mi/384Mi - Matrix: add explicit 256Mi/512Mi resources - Trading-bot workers: 64Mi/64Mi → 128Mi/256Mi, re-enabled - Tier 3-edge defaults: 96Mi/192Mi → 128Mi/256Mi - Fallback tier defaults: 128Mi/128Mi → 128Mi/192Mi, max 2→4Gi - Mailserver: disable rspamd-redis, fix Roundcube IPv6/IMAP, bump dovecot connections	2026-04-05 23:02:50 +03:00
Viktor Barzin	16cde1eab5	add Kyverno TLS secret sync + enhance renewal pipeline Kyverno ClusterPolicy clones tls-secret from kyverno namespace to all namespaces with synchronize=true. Renewal pipeline now updates the source secret via kubectl, verifies cert validity, and sends Slack notification.	2026-03-23 22:19:34 +02:00
Viktor Barzin	877cd15b45	fix: increase tier-2-gpu quota to 12Gi, add NvidiaExporterDown alert - Increase tier-2-gpu requests.memory from 8Gi to 12Gi to give immich ML pods scheduling headroom (was at 96% utilization) - Add critical NvidiaExporterDown Prometheus alert that fires when GPU metrics are absent for >10 minutes (faster than generic ScrapeTargetDown)	2026-03-23 03:04:33 +02:00
Viktor Barzin	ab7e18c07c	fix registry auth: add Kyverno RBAC for Secrets + containerd TLS skip-verify - Grant kyverno-admission-controller and kyverno-background-controller permissions to manage Secrets (required for generate clone rules) - Add containerd hosts.toml for 10.0.20.10:5050 with skip_verify=true (wildcard cert doesn't cover IP SANs) — applied to all nodes + template	2026-03-22 23:47:29 +02:00
Viktor Barzin	36171bcda4	add htpasswd auth to private docker registry + expose at registry.viktorbarzin.me - Add auth.htpasswd section to config-private.yml - Mount htpasswd file in registry-private container, fix healthcheck for 401 - Rename registry UI from registry.viktorbarzin.me → docker.viktorbarzin.me - Add Docker CLI ingress at registry.viktorbarzin.me (HTTPS backend, no rate-limit, unlimited body) - Add docker to cloudflare_proxied_names (registry stays non-proxied) - Add Kyverno ClusterPolicy to sync registry-credentials secret to all namespaces - Update infra provisioning to install apache2-utils and generate htpasswd from Vault	2026-03-22 22:10:10 +02:00
Viktor Barzin	ae36dc253b	extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] Phase 2 of platform stack split. 5 more modules extracted into independent stacks. All applied successfully with zero destroys. Cloudflared now reads k8s_users from Vault directly to compute user_domains. Woodpecker pipeline runs all 8 extracted stacks in parallel. Memory bumped to 6Gi for 9 concurrent TF processes. Platform reduced from 27 to 19 modules.	2026-03-17 21:34:11 +00:00

46 commits