infra

Author	SHA1	Message	Date
Viktor Barzin	efe8c9625b	dbaas: pin MySQL to 8.4.8, recover from broken 8.4.9 DD upgrade The mysql:8.4 floating tag let Keel auto-bump to 8.4.9, whose data-dictionary upgrade got stuck mid-flight on every attempt (no progress, no CPU, never completing). Pinning to 8.4.8 + restoring from the 2026-05-18 00:30 UTC mysqldump puts us back on a known-good binary. Closes: code-eme8 Closes: code-k40p Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:59 +00:00
Viktor Barzin	1082cba0fb	kyverno(wave1): swap kubernetes_manifest → kubectl_manifest + flip 3 security policies to Enforce ## Resolves code-e2dp (Kyverno TF apply blocked) Root cause: terraform-provider-kubernetes v3.1.0 panics on plan/refresh of kubernetes_manifest resources holding Kyverno ClusterPolicy CRDs (large CEL/foreach schemas). Workaround: swap to gavinbunney/kubectl_manifest which treats manifests as opaque YAML strings. ## Migration mechanics - Root terragrunt.hcl: added gavinbunney/kubectl provider declaration so all stacks get it generated in providers.tf. - stacks/kyverno/modules/kyverno/versions.tf (new): module-level provider source declaration (required for kubectl_manifest in a child module). - Converted 17 kubernetes_manifest resources across 7 files to kubectl_manifest with yaml_body = yamlencode({...}). depends_on chains preserved. - terraform state rm for all 17 old kubernetes_manifest entries. - stacks/kyverno/imports.tf (new): TF 1.5+ import blocks mapping each kubectl_manifest to its live cluster resource by apiVersion//Kind//name ID. - One resource (policy_inject_keel_annotations) needed kubectl delete + recreate because the kubectl provider couldn't patch it cleanly (resourceVersion=0 invalid for update — gotcha when adopting a resource previously kubernetes_manifest-owned). ## W1.4 — security policies Audit → Enforce (LIVE) Three policies flipped: deny-privileged-containers, deny-host-namespaces, restrict-sys-admin. Verified live via kubectl. failurePolicy=Ignore preserved. ## Shared exclude list (35 namespaces) local.security_policy_exclude_namespaces in security-policies.tf. - 31 critical from memory id=1970 (Keel rollout list) - + frigate (camera HW transcoding needs host access) - + kured (privileged DaemonSet for node reboots) - + default (etcd backup/defrag CronJobs use hostNetwork) - + changedetection (uses SYS_ADMIN for chromium sandbox) ## W1.5 — require-trusted-registries stays Audit Pattern / allows anything-with-a-slash; Enforce would be a no-op for supply chain. Tracked under beads code-8ywc as follow-up. ## TF import-blocks The imports.tf file should be removed in a follow-up cleanup commit once verified — TF doesn't auto-clean these. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Closes: code-e2dp	2026-05-22 14:16:58 +00:00
Viktor Barzin	83079758bb	monitoring(wave1): re-enable Loki+Alloy, deploy wave1 alert rules, add #security Slack lane ## Loki + Alloy re-enabled (code-146x) - Uncommented helm_release.loki, helm_release.alloy, kubernetes_daemon_set_v1.sysctl-inotify, kubernetes_config_map.loki_alert_rules, kubernetes_config_map.grafana_loki_datasource - Reverses the documented "operational overhead vs benefit after node2 incident" decision. Re-evaluated because wave 1 security detection layer (beads code-8ywc) needs Loki + ruler + alert routing. - SingleBinary mode, 2-4Gi mem, 50Gi proxmox-lvm PVC, 30-day retention, ruler enabled pointed at prometheus-alertmanager.monitoring.svc:9093 - Alloy DaemonSet (4 pods on worker nodes) discovers pod logs via K8s API + pushes to Loki - Loki canaries running (4) - Vault audit-tail sidecar logs now flowing to Loki: queried {namespace="vault",container="audit-tail"} returns live audit JSON ## Wave 1 alert rules deployed (W1.3 partial) Added "Security Wave 1" rule group to loki_alert_rules configmap: - V1: VaultRootTokenCreated — auth/token/create with policies=[root] - V2: VaultAuditDeviceModified — sys/audit/* create/delete/update - V3: VaultSealChanged — sys/seal update - V4: VaultPolicyModified — sys/policies/acl/* create/update/delete - V5: VaultAuthFailureSpike — >10 permission denied/min - V7: VaultViktorFromUnexpectedIP — auth as me@viktorbarzin.me from non-allowlist source IP (allowlist: 10.0.20.0/22, 192.168.1.0/24, 10.10.0.0/16 pod, 10.96.0.0/12 svc, 100.64-127 tailnet) - S1: PVEsshLoginFromUnexpectedIP — sshd "Accepted" from non-allowlist IP (rule defined, fires once promtail/Alloy ships sshd journal with job=sshd-pve) Verified rules visible via /loki/api/v1/rules. K2-K9 (K8s API audit) deferred to W1.1 which needs the audit policy + apiserver log shipping codified. ## #security Slack lane (Alertmanager) - New `slack-security` receiver in prometheus_chart_values.tpl, channel #security - Higher-priority route at top of routes list: matchers `lane = security` → slack-security, continue: false (so wave 1 alerts never fall through to #alerts) - Slack message format includes summary + description + runbook link annotation - All wave 1 rules set `lane = "security"` label ## Resource summary - 6 added: helm_release.loki, helm_release.alloy, kubernetes_config_map.grafana_loki_datasource, kubernetes_config_map.loki_alert_rules, kubernetes_daemon_set_v1.sysctl-inotify, + 1 other - 5 changed: helm_release.prometheus (alertmanager config — new receiver + route), 4 deployments (image tag drift from Keel-managed images, unrelated) - 1 destroyed: null_resource grafana_admin_only_folder_acl["Finance (Personal)"] (timestamp-triggered always recreates — not destructive) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Closes: code-146x	2026-05-22 14:16:58 +00:00
Viktor Barzin	c9289192c7	security(wave1): Vault audit-tail sidecar (live) + doc reality-check ## Vault audit-tail sidecar (APPLIED + VERIFIED) - Added `audit-tail` extraContainer to vault helm chart values: busybox:1.37 with `tail -F /vault/audit/vault-audit.log`. Reads the audit PVC (`audit` volume from the chart's auditStorage), emits JSON audit events to stdout. kubelet captures the stdout; once Loki+Alloy are deployed (blocked on code-146x), these logs flow automatically to Loki with `container="audit-tail"`. - Resources: 5m CPU / 16Mi mem request, 32Mi limit. PVC mount is readOnly. - Applied via `tg apply -target=helm_release.vault`. All 3 vault pods rolled cleanly (OnDelete strategy, manual one-at-a-time, auto-unseal each ~10s). - Verified: `kubectl logs -n vault vault-2 -c audit-tail` shows live JSON audit lines from ESO token issuance, KV reads, etc. ## Doc reality-check While verifying logs reached Loki, discovered Loki is NOT actually deployed. `stacks/monitoring/modules/monitoring/loki.tf` defines `helm_release.loki` but has a self-referencing `depends_on = [helm_release.loki]` that prevented apply. No `loki` Helm release in the cluster, no Loki pods, no Loki Service. The monitoring.md "Loki: deployed" claim was aspirational. - security.md W1.2 row: PENDING → PARTIAL (sidecar live, shipping blocked on code-146x) - security.md W1.3 row: gated on code-146x added - monitoring.md Loki row: marked NOT DEPLOYED with cross-ref to code-146x ## New beads task - code-146x P1 — Loki + log shipper missing. Lists the helm_release self-depends_on bug, investigation paths, and revised wave 1 sequencing (Loki/Alloy is prereq 0). ## Wave 1 status update - W1.2: Vault audit device + XFF + audit-tail sidecar all LIVE; Loki shipping blocked on code-146x - W1.1, W1.3, W1.6, W1.7: still not started (W1.6 also blocked on code-3ad Calico Installation CR) - W1.4, W1.5: code committed, blocked on code-e2dp (Kyverno provider crash) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:57 +00:00
Viktor Barzin	ae0c1701ec	security(wave1): W1.2 Vault XFF (applied) + W1.4/W1.5 Kyverno code prep (apply blocked on provider crash) ## W1.2 — Vault audit device + X-Forwarded-For (APPLIED + VERIFIED) - Added `x_forwarded_for_authorized_addrs = "10.10.0.0/16"` to vault listener config. Trust X-Forwarded-For from in-cluster sources (pod CIDR). Without this, every vault audit log entry shows Traefik's pod IP instead of the real client IP — the V7 alert rule (Viktor identity from non-allowlist source IP) needs the real client IP to be meaningful. - Applied via `tg apply -target=helm_release.vault` (vault stack has pre-existing for_each unknown issues unrelated to this change; -target documented in error message itself as the workaround). - Rolling restart of vault-{0,1,2} performed manually (StatefulSet uses OnDelete update strategy, not RollingUpdate). All 3 pods rejoined Raft + auto-unsealed within ~10s each. Verified XFF config visible in pod's /vault/config/extraconfig-from-values.hcl. - The `vault_audit "file"` resource was already in TF at line 287 (writing to /vault/audit/vault-audit.log) — no change needed. ## W1.4 + W1.5 — Kyverno enforce flip (CODE ONLY, apply BLOCKED) - Added shared `local.security_policy_exclude_namespaces` (31 critical namespaces from memory id=1970 + `frigate, kured, default, changedetection` discovered during the live-cluster pre-flight check for privileged/hostNetwork/SYS_ADMIN pods that would be blocked by Enforce). - Flipped 3 security policies Audit → Enforce: deny-privileged-containers, deny-host-namespaces, restrict-sys-admin. failurePolicy=Ignore preserved at chart level. - `require-trusted-registries` STAYS in Audit mode pending allowlist tightening (current pattern includes `/` which matches anything-with-a-slash, so Enforce would be a no-op for supply chain). Tracked under beads `code-8ywc` W1.5. Apply blocker: `tg plan` panics with `terraform-provider-kubernetes_v3.1.0` crash on the kubernetes_manifest resources (`ElementKeyInt(0): can't use tftypes.Object...` — provider schema mismatch on Kyverno CRDs). The crash reproduces on the UNMODIFIED file, so it's a pre-existing provider issue, not caused by these changes. Resolving it requires either upgrading the provider or finding a kubernetes_manifest-compatible workaround. Tracked under `code-8ywc`. ## Wave 1 status after this commit - W1.2: APPLIED + VERIFIED (vault XFF + audit device already in place) - W1.4 + W1.5: code ready, apply blocked on provider crash - W1.1, W1.3, W1.6, W1.7: not started in this session Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:57 +00:00
Viktor Barzin	87961e9ef8	monitoring(wealth): drop 6y timeFrom override on META vest cadence	2026-05-22 14:16:57 +00:00
Viktor Barzin	dd24ace480	realestate-crawler: dockerhub pull-secret + lift image-pin on ui/api Companion to the GHA migration in immovika/realestate-crawler@c2acbf5. Apps row of /upgrade-state was flagging ⚠ because Keel poll on the four Deployments returned 401 — DockerHub repo viktorbarzin/realestatecrawler is private, the Deployments had no imagePullSecrets, and Keel's poll-secret discovery list came up empty. Pods kept running only because the image landed in containerd cache months ago. Adds: - ExternalSecret `dockerhub-pull-secret` synced from Vault secret/viktor.dockerhub_registry_password. ESO template renders the dockerconfigjson server-side (Sprig b64enc) so the PAT never sits in cleartext in any K8s manifest. - image_pull_secrets { name = "dockerhub-pull-secret" } on all 4 Deployments (ui, api, celery, celery-beat). - Lifts `ignore_changes=[container[0].image]` on ui+api so TF re-asserts :latest. CI no longer patches the image to a numeric tag — Keel now drives rollouts from digest changes on :latest. Live state after apply: all 4 Deployments on :latest with imagePullSecrets=dockerhub-pull-secret; ExternalSecret SecretSynced=True. Once a GHA build pushes a new digest, Keel will roll all four within ~1h. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:57 +00:00
Viktor Barzin	99127939a8	monitoring(wealth): keep only FIFO-realized PNL table; pair Positions + vest-cadence side-by-side - Removed panel 27 (META RSU vest value over time) — superseded by vest-cadence chart which carries the same value signal plus the share-count overlay. - Removed panel 28 (per-vest value at vest vs today) — duplicative with panel 31's FIFO realized PNL. - Removed panel 29 (per-sell realized PNL) — same data as panel 31, just rolled up by sell date instead of vest date. - Resized panel 26 (Positions) to w=12 and moved panel 30 (META vest cadence) to (y=32, x=12, w=12) so they sit side-by-side next to the Positions table. - Moved panel 31 (FIFO realized PNL) to y=118, where the deleted RSU chart used to live.	2026-05-22 14:16:57 +00:00
Viktor Barzin	b879481d71	monitoring(wealth): per-vest realized PNL via FIFO sell-match New table panel below the per-sell breakdown. For each vest, FIFO-match its shares against the subsequent sells (shares from earlier vests get sold first), and aggregate the matched portions: realized_pnl = SUM(matched_qty * (sell_price - vest_price)) pnl_pct = realized_pnl / SUM(matched_qty * vest_price) * 100 days_held = AVG(sell_date - vest_date) per matched portion Footer reducer sums shares, vest value, sell value, and realized PNL so the bottom row is the full-portfolio realized take.	2026-05-22 14:16:57 +00:00
Viktor Barzin	8b60e6bb6d	monitoring(wealth): META vest cadence chart — value vs shares (dual axis) Per-vest event line chart. Left Y axis (blue): vest value at the time = SUM(quantity * unit_price), in USD. Right Y axis (orange): number of shares vested. One point per vest date (aggregated when multiple BUY rows share a date, e.g. 2021-05-18 was 18 + 2 shares). Lets Viktor see how vest sizes ramped (initial 18 shares -> 38 -> 60s) and how the per-vest USD value tracked META's price ride across 2020-2026. timeFrom='6y' override pins the panel to the full vesting window.	2026-05-22 14:16:57 +00:00
Viktor Barzin	af077112cb	monitoring(wealth): META vest + sell PNL tables with FIFO cost basis Two new bottom-of-dashboard tables: Panel 28 'META vests — value at vest vs today': one row per BUY activity. Shows vest-day price * shares + what those same shares would be worth at today's META quote, plus the hypo P&L if Viktor had held everything (color-text on the gain columns). Panel 29 'META sells — realized PNL vs if held until today': one row per SELL with FIFO-matched cost basis (LEAST/GREATEST overlap in cumulative-share space). Shows realized P&L, the counterfactual P&L had he held until today, and the 'missed by' delta = (today_price - sell_price) * shares. Both pull today_price dynamically from quote_latest via a CTE so they self-update as Yahoo updates the META quote. Schwab account is empty so no live activity is expected.	2026-05-22 14:16:57 +00:00
Viktor Barzin	20c5965f95	monitoring(wealth): pin META RSU panel to 6y window Dashboard default time range is now-180d, but the META vesting + sell arc spans 2020-11 → 2026-02. With the default window the panel just showed a flat line at $64 (the empty post-sell residual). timeFrom='6y' override makes panel 27 always render the full vesting curve regardless of the dashboard-level time selector.	2026-05-22 14:16:57 +00:00
Viktor Barzin	3d43d96a5e	k8s-version-upgrade: switch detection cron from weekly to daily Was `0 12 * * 0` (Sun 12:00 UTC) — patch releases waited up to 6 days before the chain picked them up. Now `0 12 * * *` (daily 12:00 UTC, still outside kured's 02:00-06:00 London window). Concurrency is bounded by Forbid + deterministic job-name idempotency (the detection job exits early if a preflight Job for the same target already exists), so back-to-back days can't pile up parallel runs. - stacks/k8s-version-upgrade/main.tf: var.schedule default + rationale comment - scripts/upgrade_state.sh: rename next_sunday_noon_utc -> next_daily_noon_utc (now returns "Tue 2026-05-19 12:00 UTC" form); change "(Sun cron)" label to "(daily cron)" - .claude/skills/upgrade-state/SKILL.md: cadence column + frontmatter Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:57 +00:00
Viktor Barzin	018ef3790f	monitoring(wealth): META RSU vest value panel (Schwab account) Daily total_value timeseries for the Schwab workplace account (account_id 72d34e09-...). Single-asset account holding META RSUs that vested 2020-11 → 2026-02 and were sold opportunistically over the same window. Currency USD (account_currency). Yahoo quote on META powers WF's daily mark; the historical DAV mirrored into wealthfolio_sync via pg-sync gives us ~2k days of vesting curve.	2026-05-22 14:16:57 +00:00
Viktor Barzin	b107c0be8c	upgrade-state: skill + script + Keel scrape for periodic three-pipeline audit Three autonomous-upgrade pipelines run independently — Keel for apps (hourly registry polling), unattended-upgrades+kured for OS, and the k8s-version-check chain for kubeadm/kubelet/kubectl. Until now there was no single place to see whether each was healthy, what's pending, or whether anything's stuck. The /upgrade-state skill collapses the state of all three into one table you can run before each Sunday's k8s-version-check fires. - stacks/keel/main.tf: add Prometheus pod-annotation scrape on container port 9300. Surfaces pending_approvals, poll_trigger_tracked_images, and registries_scanned_total{image} so the skill has a real timeseries (also opens the door to a future "pending_approvals > 0 for 24h" alert). - scripts/upgrade_state.sh: collector + renderer. Three-row table (Apps / OS / K8s) + drill-down, --json for piping, exit 0/1/2. SSH fan-out (parallel subshells) to all five nodes for apt state + reboot-required + uu log; Prometheus query for Keel; Pushgateway parse for k8s_upgrade_* gauges. Read-only. - .claude/skills/upgrade-state/SKILL.md: hardlinked to ~/.claude/skills/upgrade-state/SKILL.md so the skill is discoverable from both monorepo-rooted and global sessions. Verification: ran the script, stress-tested the ✗ stalled path by pushing in_flight=1 + started_timestamp=-100min to Pushgateway and resetting after — script correctly raised ✗ and exit 2. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:57 +00:00
Viktor Barzin	309f83ec8c	beads-server: codify Keel annotations on Dolt deployment (drift cleanup) Task 1's recovery from the broken `:latest` image rollout left keel.sh/policy=never set imperatively via `kubectl annotate` — out of TF, which violates the "all infra via TF" rule. Now codified alongside match-tag, trigger, pollSchedule. Removed those three keys from ignore_changes (was the original "Keel manages these" pattern, no longer correct for this deployment). Also added KYVERNO_LIFECYCLE_V1 ignore_changes on the presence_schema migration Job so future applies don't try to replace it over the Kyverno-injected ndots dns_config. Verified: 0 added, 3 changed (unrelated pre-existing drift on beadboard/workbench/service), 0 destroyed. Dolt pod uninterrupted (revision 13 preserved).	2026-05-22 14:16:57 +00:00
Viktor Barzin	5482f46125	RecentNodeReboot: 24h → 1h threshold, matching upgrade-chain preflight The 24h kubelet-uptime threshold (process_start_time_seconds < 86400) was a defense-in-depth duplicate of the 24h-since-Ready-transition check in kured-sentinel-gate Check 4 — but they used different signals (kubelet process start vs node Ready transition). Whenever the cluster cycled through reboots, the alert kept firing for a full day even after sentinel-gate's check passed, and blocked anything querying halt-on-alert (kured, K8s version-upgrade preflight). Tightened to 1h (3600s) for "node just rebooted, give it a settle window". The cluster-wide 24h-between-reboots invariant lives exclusively in kured-sentinel-gate Check 4 from now on (independent, uses lastTransitionTime). Matched the preflight's own 24h-quiet check in upgrade-step.sh (86400 → 3600) so it doesn't act as a second blocker. Empirically verified: all 5 kubelets are >10h up, alert cleared on next eval after the rule reload. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:57 +00:00
Viktor Barzin	d1dcc5d12d	beads-server: add presence_claims table for agent coordination Adds the schema for the new agent presence board. Live Dolt is updated via a hashed-named one-shot Job; the ConfigMap entry preserves fresh-PVC init. Also pins the Dolt image to 2.0.3 — :latest on dolthub/dolt-sql-server currently resolves to 0.50.10, whose docker-entrypoint.sh references an undefined docker_process_sql function and crash-loops on every init script in /docker-entrypoint-initdb.d. Keel can still bump this tag in-cluster (image is in lifecycle.ignore_changes).	2026-05-22 14:16:57 +00:00
Viktor Barzin	e4e2babd6a	k8s-version-upgrade: FQDN SSH targets + python3 in place of envsubst Two latent bugs in the K8s-version-upgrade pipeline surfaced when a real detection run ran post-26.04 upgrade today: 1. DNS: pod's CoreDNS search path is `<ns>.svc.cluster.local svc.cluster.local cluster.local` (+ ndots=2 via Kyverno mutation). Unqualified `k8s-master` falls through all of those and then queries upstream Technitium for the bare name → NXDOMAIN. The FQDN `k8s-master.viktorbarzin.lan` is what Technitium actually serves. Suffix every node SSH target with `$NODE_DOMAIN`. 2. envsubst missing: claude-agent-service image doesn't ship `gettext-base`. Replace `envsubst <template \| apply` with `python3 -c 'import os,sys; sys.stdout.write(os.path.expandvars( sys.stdin.read()))' <template \| apply`. Same semantics, image already has python3. Multi-line $SCHEDULING_BLOCK is preserved correctly through expandvars. Verified by manually triggering `k8s-version-check` post-fix: detection now reads `Latest patch: v1.34.8` (currently running 1.34.7) and spawns `k8s-upgrade-preflight-1-34-8`. The Job pod scheduled and started; killed before it touched the cluster (will land on Sunday 2026-05-24 12:00 UTC like the schedule says). Root cause of why these bugs lay dormant: yesterday's first manual-test detection found "no upgrade needed" so neither code path exercised SSH or envsubst. Today's apt-source restore (do-release- upgrade had mangled them) unmasked the v1.34.8 candidate, which made detection finally proceed past the SSH step. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:56 +00:00
Viktor Barzin	23d8aa89c4	keel: enroll 11 more namespaces (operators + critical infra) Per user decision, removed authentik, kyverno, metallb-system, external-secrets, proxmox-csi, nfs-csi, vpa, sealed-secrets, infra-maintenance from the policy-level exclude list, and added keel.sh/enrolled=true to aiostreams (alive — 1/1 Running, despite being earlier flagged as scaled-to-0) and woodpecker. Net cluster coverage: 197/227 workloads on safe-force (86%), up from 170/227 (74%). All 197 are paired with match-tag=true (digest-only). Remaining 7 namespaces in Kyverno exclude list (irreducible): - keel (self-update) - calico-system + tigera-operator (operator-managed Installation CR) - cnpg-system + dbaas (state-coupled) - nvidia (chart-pinned at 570.195.03 per code-8vr0 until NVIDIA ships ubuntu26.04 driver images) - kube-system (k8s built-ins) Files: - stacks/kyverno/modules/kyverno/keel-annotations.tf — exclude list trimmed from 16 → 7 - stacks/authentik, kyverno, proxmox-csi, nfs-csi, vpa, sealed-secrets, servarr/aiostreams, metallb (creates ns "metallb-system"), woodpecker — added keel.sh/enrolled=true label on kubernetes_namespace resource - infra-maintenance was in the policy exclude but the namespace doesn't actually exist in the cluster; the removal is a no-op there Applied via kubectl patch on the live ClusterPolicy + kubectl label on namespaces because the kubernetes provider v3.1.0 panics on Kyverno ClusterPolicy refresh — TF source has the desired state for next clean apply on a fixed provider. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:56 +00:00
Viktor Barzin	3bdba9f388	keel: enroll 15 critical-path namespaces for digest-only auto-update Per user decision today: monitoring, mailserver, vault, descheduler, metrics-server, traefik, technitium, crowdsec, redis, reverse-proxy, reloader, headscale, wireguard, xray, cloudflared now participate in the same `force + match-tag` regime as the rest of the cluster — Keel watches the deployment's CURRENT tag for digest changes only and rolls on push, never rewriting tag strings. Two-part change: stacks/kyverno/modules/kyverno/keel-annotations.tf Trim the policy-level namespace exclude list from 31 → 16. The 16 remaining exclusions are the irreducible cluster-operator + state- coupled set: keel itself, calico-system + tigera-operator (operator loop), authentik (2026-05-17 pgbouncer incident bite), cnpg-system + dbaas (state-coupled), kyverno, metallb-system, external-secrets, proxmox-csi + nfs-csi + nvidia (just stabilized today, chart-pinned), kube-system, vpa, sealed-secrets, infra-maintenance. stacks/<each-of-15>/.../main.tf Add `"keel.sh/enrolled" = "true"` label to the `kubernetes_namespace` resource so the Kyverno mutate policy can target the workloads via its namespaceSelector matchLabels. Note on the apply path: the live ClusterPolicy was patched via `kubectl patch` because the hashicorp/kubernetes provider v3.1.0 panics during state refresh on Kyverno ClusterPolicy schemas with deeply nested optional `context.celPreconditions` / `imageRegistry` fields (see crash dump). The TF source above has the desired state, so any clean future apply on a fixed provider version will be a no-op against the live cluster. Floating-tag workloads in the newly-enrolled set (will roll on every upstream digest update — acceptable risk per user): - wireguard: sclevine/wg:latest (image fixed today via iptables-nft postStart shim) - xray: teddysun/xray - crowdsec-web: viktorbarzin/crowdsec_web - monitoring: prompve/prometheus-pve-exporter:latest, prom/snmp-exporter - traefik: nginx:1-alpine, openresty/openresty:alpine, ghcr.io/tarampampam/error-pages:3 - redis: haproxy:3.1-alpine, redis:8-alpine Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:56 +00:00
Viktor Barzin	f5cf6ec051	nvidia: bump driver container memory limit 128Mi → 2Gi After rolling back k8s-node1's kernel to 6.8.0-117 + spoofing /etc/os-release to 24.04 so the operator picked the matching ubuntu24.04 driver image (everything per the workaround documented in docs/known-issues.md), the driver container still went into a restart loop. Container status: lastState.terminated: { reason: "OOMKilled", exitCode: 137 } The driver-installer was hitting the namespace LimitRange default of 128Mi during `apt-get install linux-headers-6.8.0-117-generic` — the last log line on every restart was "Installing Linux kernel headers..." before SIGKILL. 2Gi gives apt + the DKMS compile step enough headroom; peak observed during a successful compile in a test container was ~1.4Gi. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:56 +00:00
Viktor Barzin	22c18dc061	paperless-mcp: deploy MCP for AI document search - New stack `paperless-mcp` running barryw/PaperlessMCP v0.1.19 (.NET, HTTP+SSE on :5000) wraps paperless-ngx's built-in FTS. 43 tools exposed. - In-cluster only egress to paperless-ngx svc; no Cloudflare hop on MCP-internal traffic. - Read-only at paperless layer: dedicated `claude-mcp` user (non-superuser) in new `claude-mcp-readers` group with view-only Django perms; existing 279 docs bulk-granted view perm via /api/documents/bulk_edit/; workflow #2 auto-grants the group on new docs (Consumption Added). - Gateway-level bearer auth via new Traefik plugin Aetherinox/traefik-api-token-middleware@v0.1.4 (loaded in traefik stack alongside crowdsec-bouncer); per-stack Middleware CRD `bearer-auth` pulls token list from Vault `secret/paperless-mcp/bearer_tokens`. - Vault `secret/paperless-mcp` holds: paperless_api_token (synced to K8s Secret via ESO; pod env via secret_key_ref), bearer_tokens (JSON array, read at plan time), bearer_token_viktor_laptop (mirror for laptop wiring), paperless_user_password (paperless UI fallback). - Image auto-update via Keel (semver minor policy, hourly poll). - Ingress dns_type=proxied → Uptime Kuma external monitor auto-created by external-monitor-sync CronJob. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:56 +00:00
Viktor Barzin	d1e7121115	recruiter-responder: bump image to 05b95943 (split callback routes) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:56 +00:00
Viktor Barzin	c72b839a2f	nvidia: pin chart to v25.10.1 after v26.3.1 upgrade revealed missing ubuntu26.04 driver images k8s-node1 was upgraded to Ubuntu 26.04 (kernel 7.0.0-15-generic) at some point. NVIDIA has NOT published ubuntu26.04 driver images yet (skopeo list-tags docker://nvcr.io/nvidia/driver returned 0 ubuntu26.04 tags vs 779 for ubuntu22.04 and 206 for ubuntu24.04). Attempted fix today: bump gpu-operator chart v25.10.1 → v26.3.1 + driver 570.195.03 → 580.105.08 + kernelModuleType=open. The chart applied cleanly but the v26.3.1 operator auto-detects host OS via NFD labels and constructs `<version>-ubuntu26.04` image tags, which 404 on pull. Rolled back to chart v25.10.1 and pinned it explicitly here so future `terraform apply` doesn't surface the same trap again. Note: chart rollback alone does NOT restore GPU functionality on k8s-node1. Both v25.10.1 and v26.3.1's operators now pick the ubuntu26.04 suffix (the NFD label is sticky once detected). The actual recovery path requires either (a) NVIDIA shipping ubuntu26.04 driver images, or (b) rolling the host kernel back to 6.8.0-117-generic (still installed in /boot, headers in /usr/src) + `apt-mark hold` to prevent re-upgrade. That step needs explicit user authorization for a node reboot — left as the next action item on code-8vr0. Files: - stacks/nvidia/modules/nvidia/main.tf — explicit version pin, explanatory comment - stacks/nvidia/modules/nvidia/values.yaml — comment block documenting the situation; driver pinned at 570.195.03 - docs/post-mortems/2026-05-17-gpu-driver-ubuntu2604-mismatch.md — full timeline, root causes, recovery procedure Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:56 +00:00
Viktor Barzin	62efded1b6	wireguard: switch to iptables-nft so PostUp MASQUERADE works Wireguard pod CrashLoopBackOff'd for hours with wg-quick's PostUp failing: iptables v1.8.4 (legacy): can't initialize iptables table `nat': Table does not exist (do you need to insmod?) sclevine/wg's default `iptables` symlink points to iptables-legacy, which talks to the kernel's xt-tables. K8s nodes nowadays initialize their nat table via nftables (calico-node sets it up), so iptables-legacy in the container sees "no nat table" and bails. Reproduced by ephemerally debugging the live pod's namespaces (kubectl debug --copy-to + same mounts as the real pod) — wg-quick output matched verbatim. Fix: postStart now calls update-alternatives to point iptables and ip6tables at iptables-nft/ip6tables-nft (already present in the image) before exec'ing wg-quick. The wg0.conf PostUp MASQUERADE then writes to the nftables-backed nat table calico already populated. Verified: new pod went 2/2 Running with 0 restarts after apply. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:56 +00:00
Viktor Barzin	45c8e88e89	terminal: probe + alerts after Traefik replica routing-table skew User reported "site loads but failed to connect on the tmux session". Root cause was a Traefik replica (traefik-db7696fbf-ktjjz) that came up missing the kubernetes_ingress-derived router for terminal.viktorbarzin.me — only the IngressRoute CRDs registered. About 1/3 of /token preflight requests landed on that replica and got 404 with router="-", and WS upgrades intermittently failed the same way, so the lobby iframe stayed stuck on "Failed to connect. Retrying...". `kubectl delete pod` on the bad replica restored the missing router and unblocked the user. This commit adds the long-term mitigation: stacks/terminal/main.tf - kubernetes_cron_job_v1.webterminal_probe runs every 5min, hits /token + /ws via Cloudflare and the in-cluster ttyd Service, pushes 4 gauges to Pushgateway (token_status, ws_status, ttyd_status, last_success_timestamp). Verified the probe end-to-end: token=302 ws=302 ttyd=200 ok=1 stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl - Webterminal group: WebterminalTokenDegraded (warning, 10m), WebterminalWebsocketDegraded (critical, 10m), WebterminalTtydUnreachable (critical, 10m), WebterminalProbeStale (warning, 15m). - Traefik Router Parity group: TraefikRouterCountSkew fires when any Traefik replica's router count diverges from siblings for >10m — catches the same class of issue cluster-wide, not just for terminal. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:56 +00:00
Viktor Barzin	d828b51670	recruiter-responder: bump image_tag to 50f43004 (backtest --persist) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:56 +00:00
Viktor Barzin	0480477f44	nfs-csi: pin chart v4.13.1 + controller affinity (post-mortem) Keel rolled csi-driver-nfs 4.13.1→4.13.2 today. The 4.13.2 chart dropped control-plane exclusion from the controller Deployment, so both replicas landed on k8s-master, fought for hostNetwork ports 19809/29653, and one went CrashLoopBackOff. Helm rollback left orphan containerd sandboxes holding the ports — only a kubelet restart on master cleared them. - Pin helm_release.version = "4.13.1" so terraform apply can't drift to the broken chart (defense in depth; nfs-csi namespace is already in the Kyverno-Keel exclude list) - Add controller.affinity: podAntiAffinity between replicas + nodeAffinity excluding node-role.kubernetes.io/control-plane - docs/post-mortems/2026-05-17-nfs-csi-keel-upgrade-master-port-conflict.md captures the root cause + recovery procedure (kubelet restart via nsenter is the escalation path when crictl rmp -f fails) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:56 +00:00
Viktor Barzin	e398e717f1	broker-sync(fidelity): un-suspend monthly CronJob The provider now emits delta gains-offset DEPOSITs (broker-sync@98c4729) which is the simple accumulate-gains approach Viktor signed off on: each monthly scrape captures (current_pot, real_contribs), and we emit a single DEPOSIT/WITHDRAWAL sized to growth-since-last-scrape. dav_corrected handles the dashboard math. Next scheduled run: 2026-05-20 05:00 UK. Manual trigger via 'kubectl -n broker-sync create job fid-now --from=cronjob/broker-sync-fidelity'.	2026-05-22 14:16:56 +00:00
Viktor Barzin	195b5e4061	keel: use +() anchors on policy/match-tag so per-workload overrides stick Without the anchor, each policy update fires mutateExistingOnPolicyUpdate, which OVERWRITES existing keel.sh/policy annotations back to 'force'. That broke the phased rollout — bulk-setting workloads to 'never' didn't stick because the next policy update reset them. With +() anchors, the mutate only adds the annotation if missing. New workloads (in enrolled namespaces) get force+match-tag; existing workloads with explicit policy=never (out-of-band, for phased rollout) stay never. Phase 1 rollout state (2026-05-17): - 10 workloads on force+match-tag in 10 namespaces (Phase 1) enrolled via keel.sh/enrolled=true namespace label: linkwarden, excalidraw, diun, echo, foolery, city-guesser, jsoncrack, privatebin, ntfy, speedtest - 216 workloads on policy=never (out-of-band kubectl annotate) - 31 critical namespaces excluded at policy level Expand to Phase 2 by labeling more namespaces `keel.sh/enrolled=true` and clearing the `never` annotation off their workloads. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:56 +00:00
Viktor Barzin	25fcf80651	keel: expand critical-namespace exclude list — protects vault/cnpg/authentik/etc. 2026-05-17 incident: Keel rolled authentik 2026.2.2 → 2026.2.3 around 23:36. The force+match-tag pairing should have constrained Keel to digest-only on the current tag (not switch to a new tag), but a race between Kyverno's mutate (injecting match-tag) and Keel's hourly poll caused the workload to still have the old `force`-only annotation when Keel acted. Result: tag rewrite, pods cycled, pgbouncer connection failures, login broken. Manual rollback: `kubectl rollout undo` on all 5 authentik deployments back to 2026.2.2. Auth restored within ~5 min. Going forward, critical-namespace workloads are excluded at the policy level so this race can't recur. They get upgraded via TF (Helm chart version bumps) on a deliberate cadence, never by Keel. Live state: 36 workloads on policy=never (35 critical + chrome-service pin + 7 CI-driven self-hosted from earlier), 190 on policy=force+match-tag for opt-out-pure auto-update on the remaining stateless apps. This matches user direction (2026-05-17): "upgrading is fine as long as we upgrade correctly and the latest version is healthy" + "keel responsible for the latest version, phased rollout, graceful". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:55 +00:00
Viktor Barzin	74ecda6cc3	keel: bump default policy patch → major (user wants latest version) User: 'i'm happy with occasional breakages. we have alerts.' Policy=major auto-updates workloads to the latest semver tag in the registry, including major/minor/patch bumps. Still semver-parser-bounded so dev/nightly/master branches are filtered out (avoids the 2026-05-16 force-trap on affine/calico). Live: 217 patch-annotated workloads re-annotated to major. Next Keel poll (~1h) will pick up any pending major/minor releases. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:55 +00:00
Viktor Barzin	6c8546bb84	recruiter-responder: bump image_tag to 94b37a9c (follow-up detection) Replies from recruiters to our sent decline / engage / ignored threads are now attached to the existing thread, surface with a 🔁 follow-up marker in Telegram ("you previously sent"), and re-open thread status to pending so they show up in recruiter_list status=pending. Smoke-tested live: Rachel-style follow-up referencing our outbound msgid + the original recruiter msgid in References → correctly attached to thread #87, status flipped sent→pending, 3 messages persisted (in/out/in). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:55 +00:00
Viktor Barzin	7e540292ad	kyverno: bump background-controller memory 384Mi → 2Gi (OOMKilled processing keel URs) The mutateExistingOnPolicyUpdate=true on inject-keel-annotations produced 176 UpdateRequests for the initial bulk scan across enrolled namespaces. At the existing 384Mi limit, kyverno-background-controller OOMKilled while processing them — no annotations got injected on existing workloads (count stuck at 30). Live state already bumped via kubectl set resources; this commit makes it durable through Terraform. Also lowered the request to 256Mi (the 384Mi floor was tight against limit; 2Gi headroom for bulk scans, 256Mi steady state). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:55 +00:00
Viktor Barzin	257679166b	recruiter-responder: bump image_tag to 02a01c9a (Reply-To + quoted body in replies) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:55 +00:00
Viktor Barzin	7e1ecaf74c	kyverno: codify aggregated ClusterRole for keel mutate-existing The previous commit (bc714755) added mutateExistingOnPolicyUpdate=true to the inject-keel-annotations ClusterPolicy but Kyverno's validate webhook rejected it: the background-controller SA needs update/patch on apps/v1 Deployment/StatefulSet/DaemonSet. Created live via kubectl + now in TF so the next apply is idempotent. The ClusterRole aggregates into kyverno:background-controller via the rbac.kyverno.io/aggregate-to-background-controller label. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:55 +00:00
Viktor Barzin	bede247e98	recruiter-responder: bump image_tag to 59df5f8a (Reply-To honoured) Reply-To header now extracted on inbound and used for outbound replies. Verified with a synthetic email From: noreply-careers@megacorp.example Reply-To: spam@viktorbarzin.me — reply correctly went to spam@ and threaded under the original (Re: subject + In-Reply-To + References). Alembic 0003 added messages.reply_to_addr column. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:55 +00:00
Viktor Barzin	a2e8afc3ed	kyverno: add mutateExistingOnPolicyUpdate=true so existing workloads get annotated Before this, the inject-keel-annotations policy only fired on admission events. Workloads that existed BEFORE their namespace got labeled keel.sh/enrolled=true never received the annotation, so Keel didn't watch them. Live state was 30 of 226 workloads auto-updating. With mutateExistingOnPolicyUpdate=true and the required mutate.targets block, Kyverno's BackgroundScan controller applies the mutate to existing matching Deployments/StatefulSets/DaemonSets on policy update. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:55 +00:00
Viktor Barzin	cdeb89d5f1	final wave: enroll immich + status-page, retrigger 17 pending Bucket A * immich: extended 3 V1 lifecycles to V2 (1 Deployment without V1 skipped — has non-standard lifecycle from earlier work). * status-page: enrolled (was missing from original sweep). * v6 retrigger marker on 17 stacks that never reached terragrunt apply (#704 exit-1 halted mid-loop). After this lands, expected live enrollment: ~96 / 118 Tier 1 stacks. The remaining ~22 are operator/Helm-managed and intentionally excluded (same fight-loop risk as Calico — bump via Helm chart version, not Keel). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:55 +00:00
root	caba0e811f	Woodpecker CI deploy [CI SKIP]	2026-05-22 14:16:55 +00:00
Viktor Barzin	4944e508aa	Bucket C: enroll 5 raw-deploy stacks in Keel auto-update * beads-server: 3 Deployments — extended V1 lifecycle blocks to V2 + KEEL_IGNORE_IMAGE; namespace label. * llama-cpp: 1 Deployment — extended V1→V2; namespace label. * novelapp: namespace label only (Deployment has non-standard lifecycle without V1 dns_config — drift expected, accept for now). * plotting-book: namespace label only (same as novelapp). * trading-bot: namespace label only (same as novelapp). immich deferred — the bulk-add script's brace-counter got confused by a HEREDOC in the file, inserting a lifecycle block in the wrong position. Needs manual per-Deployment editing. The 3 ns-only stacks (novelapp, plotting-book, trading-bot) will see their Deployments mutated by Kyverno but their TF lifecycle doesn't yet ignore the keel annotations. Expected behavior: drift visible in terragrunt plan, applied-state oscillates with Kyverno re-injecting. Acceptable starting point; per-Deployment lifecycle work to fix. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:55 +00:00
Viktor Barzin	b57596d930	Bucket A retrigger + Bucket D enrollment (5 module-nested stacks) After fixing the postgresql-lb MetalLB flap (deleted stuck ServiceL2Status CR l2-rgt9d), Tier 1 CI can apply again. Combined commit: * Bucket A (16 stacks): re-append CI retrigger marker so the previously-pending applies pick up: blog calico cyberchef descheduler f1-stream homepage jsoncrack k8s-dashboard k8s-version-upgrade kms local-path osm_routing real-estate-crawler travel_blog vault webhook_handler * Bucket D (5 module-nested stacks): keel.sh/enrolled label on namespace + KYVERNO_LIFECYCLE_V2 on Deployments inside the module: postiz instagram-poster k8s-portal uptime-kuma vaultwarden Bucket C (raw-deploy apps without V1 marker on their Deployment lifecycles) deferred — needs per-Deployment lifecycle block additions that the bulk script can't safely automate: beads-server immich llama-cpp novelapp plotting-book trading-bot Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:55 +00:00
Viktor Barzin	ec60af5fd4	kyverno: exclude calico-system from inject-keel-annotations Stop the hourly Keel-vs-tigera-operator fight loop on calico-node DaemonSet (v3.26.5 ↔ v3.26.1). Live: re-annotated 4 calico-system workloads with keel.sh/policy=never; TF: added calico-system to the namespaces exclude list so any future mutate run won't re-inject. The previous calico unenrollment (label removal from namespace) wasn't enough — once Kyverno had stamped the policy=patch annotation on the Deployments/DaemonSets, removing the namespace label didn't strip the annotation, so Keel kept watching them. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:54 +00:00
Viktor Barzin	b48ddc09d6	ci: retrigger v4 — remaining 16 Keel stacks (#701 failed one of them)	2026-05-22 14:16:54 +00:00
Viktor Barzin	978237441e	ci: retrigger v3 — apply remaining 22 Keel-enrolled stacks	2026-05-22 14:16:54 +00:00
Viktor Barzin	1dd8f4e2bf	openclaw: native MCP servers + daily claude-memory sync Wire ha-mcp, context7, and the in-pod playwright sidecar as native MCP servers on OpenClaw via `mcp set` in the container startup (ConfigMap-baked mcp.servers gets stripped by `doctor --fix`; CLI-set entries persist). HA URL pulled from new Vault key secret/openclaw.ha_sofia_mcp_url and passed via the HA_SOFIA_MCP_URL env var. Add a daily 03:00 UTC `memory-sync` CronJob in the openclaw namespace: pulls all non-sensitive memories from claude-memory.claude-memory.svc:80/api/memories, groups by category, writes 18 Markdown files into /workspace/memory/projects/claude- memory-sync/ (the path memory-core indexes), then triggers `openclaw memory index --force` via kubectl exec. Reuses the existing cluster-healthcheck SA (pods+pods/exec). Smoke test: 1488 memories synced, 25/25 files indexed, search returns hits. Also drops the legacy /app/extensions entry from plugins.load.paths (doctor warning), wires HA_SOFIA_MCP_URL env, and one-shot deletes the stale 2026-02-28 metaclaw-export.json from the openclaw home volume. claude_memory MCP intentionally NOT wired — its /mcp/mcp transport 404s on the deployed claude-memory-mcp:17 image (tracked as code-z1so). Shared knowledge is delivered via the CronJob's REST sync instead. Adding claude_memory to mcp.servers is a one-line follow-up once that's fixed.	2026-05-22 14:16:53 +00:00
Viktor Barzin	0c73974362	ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698 )	2026-05-22 14:16:53 +00:00
root	87f7b25a13	Woodpecker CI deploy [CI SKIP]	2026-05-22 14:16:53 +00:00
Viktor Barzin	126cfb7022	wealth: dav_corrected view fixes pension gains-offset miscategorisation The broker-sync Fidelity provider emits 'unrealised-gains-offset' DEPOSIT activities to reconcile Wealthfolio's total with the PlanViewer reported pot, because Wealthfolio doesn't track pension fund units directly. Wealthfolio's data model treats that DEPOSIT as a cash contribution, which double-inflates net_contribution and zeroes out the implied growth. Add a Postgres view 'dav_corrected' in wealthfolio_sync that subtracts the cumulative gains-offset from net_contribution per account per date (re-exporting as 'net_contribution' so it's a drop-in replacement). All 17 wealth dashboard panels that compute contribution/growth/ROI now read from the view. Total impact: portfolio Growth jumps from £301,753.19 to £337,474.39 (exactly the £35,721.20 Fidelity offset that was previously miscategorised).	2026-05-22 14:16:52 +00:00

1 2 3 4 5 ...

985 commits