infra

Author	SHA1	Message	Date
Viktor Barzin	a3bcb5e12f	fire-planner: COL refresh CronJob + Grafana Cost-of-Living dashboard Operational layer for the new col_snapshot cache shipped in fire-planner@e72fd22: stacks/fire-planner: - fire-planner-col-refresh CronJob — Sun 04:00 UTC, no-op until rows age toward the 1-year TTL boundary (within 7 days). Calls python -m fire_planner col-refresh-stale, upserts via cache.upsert. monitoring/dashboards/cost-of-living.json (Finance folder): - Two template variables: $city (single-select from col_snapshot), $baseline_city (for COL ratio computation, defaults London). - Stat row: total w/rent, w/o rent, 1-bed rent, ratio (color-coded). - All-cities ranked table with gradient-gauged total + colored ratio. - Cache-freshness table flags rows approaching TTL expiry. Initial population needs a one-shot: post-Keel-rollout, kubectl -n fire-planner exec deploy/fire-planner -- \\ python -m fire_planner col-seed Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:17:01 +00:00
Viktor Barzin	d4c76a07a2	openclaw: revert model swap + document codex re-auth path The previous commit promoted modelrelay/auto-fastest to primary as a workaround for the expired openai-codex OAuth token. But modelrelay routes to small tool-call-shy models (nvidia/stepfun-ai/step-3.5-flash) that hallucinate answers instead of using ssh / curl / etc. — exactly what the v4 learning loop is supposed to leverage. Revert primary back to openai-codex/gpt-5.4-mini (gpt-5.4-mini is the only mini variant the Codex backend accepts for ChatGPT Plus tier), and inline the re-auth command in the model-block comment so future sessions know exactly what to do when the OAuth token expires: kubectl -n openclaw exec -it $(kubectl -n openclaw get pods \ -l app=openclaw -o jsonpath='{.items[0].metadata.name}') \ -c openclaw -- node /app/openclaw.mjs models auth login \ --provider openai-codex modelrelay/auto-fastest stays in the fallback chain so the agent remains partially usable while the token is expired. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:17:01 +00:00
Viktor Barzin	dbb3dc04d3	openclaw: engrain the learning loop at the identity level User feedback: "this should work for any task, not just calendar. this learning flow must be strongly engrained to ensure openclaw gets better over time." The v3 rules were buried at the bottom of TOOLS.md and only stated in workflow language. Three changes to make the rule unavoidable: 1. SOUL.md — new marker-delimited section "Learning is your identity" inserted before ## Boundaries. AGENTS.md tells the agent to read SOUL.md first every session, so this is now the FIRST thing the agent loads about itself. Frames learning as character, not procedure. 2. TOOLS.md v4 — section moved from the END of the file to right after the `# TOOLS.md` title (first substantive content on file load). Title strengthened: "THE FLOW — run this on EVERY task. Not just hard ones." Concrete examples explicitly call out diverse domains (calendar, frigate restart, disk usage, inbox summary, deploys) so the universality is unmistakable. 3. learn-from-tasks skill — opens with "This is universal. EVERY task runs through this flow — not just hard ones, not just unfamiliar ones. The save at the end is mandatory." The actual flow (know → ask devvm → save) is unchanged. What changed is salience: the rule is now the first thing the agent encounters in three independent surfaces, with stronger framing that makes "skipping the save" feel like a violation of identity rather than a missed optimisation. Marker bumped v3 → v4. Stripper handles v1-v9 idempotently. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:17:01 +00:00
Viktor Barzin	854817e2e3	trading-bot: revive K8s stack + add meet-kevin-watcher Uncomment the trading-bot stack (disabled 2026-04-06 due to resource consumption) and add the new meet_kevin_watcher service container. Changes: - Uncomment the /* ... / block enclosing the entire stack - Fix db_init job: add -d postgres to psql commands (root user has no root-named database — matches pattern used in claude-memory + others) - Remove 3 disabled containers from trading-bot-workers Pod spec: news-fetcher, sentiment-analyzer, trade-executor - Add new meet-kevin-watcher container (image viktorbarzin/trading-bot-service:latest, command python -m services.meet_kevin_watcher.main, mem 128Mi/256Mi) - Extend ExternalSecret with TRADING_OPENROUTER_API_KEY and TRADING_MEET_KEVIN_CHANNEL_ID keys (sourced from Vault secret/trading-bot) - Add 4 common_env entries for the Meet Kevin pipeline (poll interval, daily cost cap, model slug, prompt version) - Update lifecycle.ignore_changes to 4 image indices vault: re-enable pg-trading static role - Add pg-trading to vault_database_secret_backend_connection allowed_roles - Uncomment vault_database_secret_backend_static_role.pg_trading (was disabled 2026-04-06 with the rest of trading-bot stack) kyverno: add postgres to trusted-registries allowlist - trading-bot db_init uses postgres:16-alpine (Docker Hub library image) - postgres* was not in the DockerHub bare-name allowlist (unlike mysql, alpine, nginx, python which were already there) Final workers Pod containers (in order): [0] signal-generator [1] learning-engine [2] market-data [3] meet-kevin-watcher (NEW) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:17:01 +00:00
Viktor Barzin	d0a4876825	openclaw: v3 flow — know → ask devvm → (rarely) try yourself Refines the devvm-fallback into an explicit triage flow that the agent runs on every task. The default path is to ASK devvm-claude when uncertain — don't brute-force. Most tasks are solvable there. ## The flow 1. Do I KNOW how? Check `memory_recall` and INDEX.md. 2. If not, SSH devvm and ask claude — and crucially, ask it to share the steps + credentials needed so I can do it on my own next time. Save the answer in openclaw memory. 3. (RARE) If devvm-claude says no, try in-pod. Most likely fail — that's OK. ## Storage moved to memory-indexed location Learnings now live under `/workspace/memory/projects/openclaw-learned/` (was `/workspace/learned/`) so memory-core indexes them and `memory_recall` surfaces them. Layout: - `scripts/<task>.md` runnable recipes - `knowledge/<topic>.md` decisions, paths, gotchas - `credentials/<name>.md` POINTERS to Vault, never values ## Credentials = Vault pointers only Previous v2 design saved cred values to plaintext NFS files. v3 flips to pointer-only: cred file documents the Vault path + fetch command (`ssh devvm 'vault kv get -field=foo secret/bar'`), the consumer, and rotation expectations. The secret stays in Vault. ## Init container also migrates Strips v1/v2/v3 markers from TOOLS.md before re-inserting v3, moves any files from the legacy `/workspace/learned/` tree into the new location, removes the empty legacy dir. User edits outside the markers always survive. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:17:01 +00:00
Viktor Barzin	ef67a53676	openclaw: explicit "use devvm + learn" default behaviour Refine the init container's devvm-fallback seeding so the OpenClaw agent treats devvm as its DEFAULT teacher and saves recipes locally to become independent over time: 1. TOOLS.md v2 section now has two emphatic CRITICAL rules: - "TRY DEVVM before giving up" — when stuck, ssh devvm before telling the user "I can't do that". - "After every task, introspect → save a faster way" — for any non-trivial task (especially recurring ones), save the recipe to /workspace/learned/ and update INDEX.md. 2. New cc-skill `learn-from-tasks` at /home/node/.openclaw/cc-skills/learn-from-tasks/SKILL.md formalises both triggers: (A) you're stuck → check INDEX → ask devvm → save; (B) you just finished → introspect → save if recurring. 3. /workspace/learned/ scaffold: INDEX.md table-of-contents + scripts/, knowledge/, credentials/ (0700) subdirs. Agent checks INDEX.md BEFORE reaching for devvm, so saved recipes are findable on the next run. 4. Marker migration: strips both v1 and v2 markers before re-inserting so user edits outside the markers always survive future restarts. Security caveat documented inline: credentials in /workspace/learned/credentials/ are NFS plaintext — acceptable for home-lab personal scope, NOT for anything more sensitive than what `ssh devvm` already gives the pod (wizard's access). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:17:01 +00:00
Viktor Barzin	43802d2452	openclaw: also write devvm section to /workspace/TOOLS.md The OpenClaw agent reads TOOLS.md on every session per AGENTS.md ("environment-specific notes"), but it does NOT auto-search the memory-core index for "devvm" before answering. Result: the agent said "I don't have access to the devvm" even though ssh + the openclaw-task wrapper were fully wired up (verified e2e in 9ad52dfd). Updated init 6 (seed-devvm-memory-note) to ALSO append a marker-delimited section to /workspace/TOOLS.md describing the devvm SSH capability + openclaw-task usage. Idempotent: strips any prior v1 section before re-inserting, so user edits outside the markers survive future pod restarts. The /workspace/memory/projects/openclaw-runtime/devvm-fallback.md memory note stays in place — it's still indexed by memory-core and surfaces for memory_recall queries. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:17:01 +00:00
Viktor Barzin	7e558de8f0	openclaw: SSH + tmux task fallback to devvm Give the OpenClaw pod two new capabilities: 1. Host-tools bundle. New init container `install-host-tools` extracts openssh-client + dnsutils + tmux + jq + ripgrep + fd + vault + yq + friends into /tools/host-tools/, with the bookworm-slim libs the binaries need. PATH + LD_LIBRARY_PATH on the main container point ld.so at the bundle. Idempotent via /tools/host-tools/.installed-v1 marker; smoke test (ldd-based) fails the init at deploy time if any binary has unresolved deps. Bundle is ~558 MB on the existing /srv/nfs/openclaw/tools NFS. 2. devvm SSH + async task pattern. New init `setup-ssh-config` writes id_rsa/config/known_hosts under /home/node/.openclaw/.ssh; main container startup symlinks /home/node/.ssh → there. New /usr/local/bin/openclaw-task wrapper on devvm manages long-running work as tmux sessions on devvm (sessions and logs survive pod restarts — they live on devvm, not in the pod). New init container `seed-devvm-memory-note` drops a markdown note teaching the pattern; main container startup now runs `openclaw memory index --force` so the note is searchable on first boot. Design + verified E2E flow in docs/plans/2026-05-22-openclaw-devvm-access-design.md. Persistence test green: spawned a 50s task from pod A, deleted pod A, new pod B saw the task finish and read its full log. Pre-existing keel.sh annotation drift on openclaw/{openlobster, task_webhook} cleaned up in the same apply. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:17:01 +00:00
Viktor Barzin	61f7539de2	postiz: disable unused providers + pin temporal vs Keel force-policy Two changes in one commit because they are coupled — the DISABLED_PROVIDERS addition cannot land safely without the Keel exclusion on temporal: 1. Add DISABLED_PROVIDERS env on postiz Helm chart. Live DB audit showed only 'instagram-standalone' connected; all other Postiz providers were idle-polling Temporal task queues. List excludes x, linkedin, reddit, threads, youtube, tiktok, pinterest, dribbble, slack, discord, mastodon, bluesky, lemmy, warpcast, vk, beehiiv, telegram, wordpress, nostr, farcaster. Keeps facebook + instagram + the standalone variant active. 2. temporal deployment needs keel.sh/policy=never (set live via kubectl annotate). Keel was rolling temporalio/auto-setup 1.28.1 -> 0.20.0 on every helm reconcile because :0.20.0 is published in the same registry path but is a DIFFERENT (legacy Cassandra-based) image stream. Memory id 1933 trap; new variant captured in id 2315-2319. The annotation is set live (not in TF) because the existing TF block has lifecycle.ignore_changes = [keel.sh/policy] so the chart reconcile won't reset it. Long-term fix: add temporal to the Kyverno keel-mutate-existing exclude list so it survives a namespace re-label.	2026-05-22 14:17:00 +00:00
Viktor Barzin	6dd1f15881	k8s-version-upgrade: kill-switch + ignore RecentNodeReboot + shorter quiet window Three changes from today's autonomous-pipeline validation session: 1. Kill-switch ConfigMap — chain checks for `k8s-upgrade-killswitch` ConfigMap in `k8s-upgrade` namespace at the top of every phase + at the start of version-check. Existence halts the chain (exit 0) with a Slack message. Single-command emergency stop: kubectl -n k8s-upgrade create configmap k8s-upgrade-killswitch \ --from-literal=reason="storm response" Resume: kubectl -n k8s-upgrade delete cm k8s-upgrade-killswitch Role rule for `configmaps` get/list/watch added (resourceName-scoped). 2. Ignore RecentNodeReboot in halt_on_alert_query everywhere — the chain itself causes reboots. The pre-drain master check, post-upgrade worker check, postflight check, and preflight halt-on-alert all now pass `RecentNodeReboot` as the extra-ignore. Previously only worker phase's post-upgrade gate did this. Master Failed silently this morning on the pre-drain check after my own master reboot. 3. Preflight quiet-baseline 3600s → 600s — the 1h cooldown after any Ready transition meant the chain refused to run for an hour after every kured reboot. 10 min is enough for kubelet/control-plane to settle; the 24h-between-cluster-reboots invariant lives in kured-sentinel-gate, not here. Validated by running the chain end-to-end: preflight passed in 5s, master phase now in drain. Today's storm post-mortem (snapshot CoW amplification + tigera-operator crashloop feedback loop) drove the kill-switch design. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:17:00 +00:00
Viktor Barzin	899c7adaa0	authentik: worker replicas 3 -> 2 Workers handle background tasks only (LDAP sync, email, certificate renewal) — no user-facing traffic, so 2-of-3 redundancy isn't load- bearing. Reduces sustained CPU by ~100m. Server replicas unchanged at 3 (PDB minAvailable=2 — user-facing). PgBouncer pool unchanged at 3 (DB connection pooling).	2026-05-22 14:17:00 +00:00
Viktor Barzin	701b73bf53	forgejo: disable source archive ZIP/TAR downloads Bot crawlers were hitting /<owner>/<repo>/archive/<sha>.zip on the dot_files repo (vim-plugin source trees) — each request synthesised a fresh ZIP from git history, taking 9.9s and returning 500 under sustained load. Cost: ~440m sustained forgejo CPU. Toggle: FORGEJO__repository__DISABLE_DOWNLOAD_SOURCE_ARCHIVES=true. /archive/* URLs now 404; git clone / OCI registry / API unaffected. Measured: forgejo pod 440-573m -> 60m steady-state (~85% drop). (Pod rollout took ~7min on the new RS due to kubelet's recursive chown of the 2700+ files in the data PVC — fsGroupChangePolicy is unset and defaults to Always; could be set to OnRootMismatch later.)	2026-05-22 14:17:00 +00:00
Viktor Barzin	b92e1166a8	monitoring: prometheus global scrape 1m -> 2m + UPS pinned 30s Halves sample volume on all default-scrape jobs (cAdvisor, node-exporter, service-endpoints, etc.). Memory id 559's earlier scrape-2m tuning was applied live but not codified — this restores the Helm template. Companion changes to keep alerting fidelity: - evaluation_interval kept at 1m (alerts evaluate every minute) - snmp-ups job pinned to scrape_interval=30s so PowerOutage / LowUPSBattery detect within ~30s instead of 2m - 3 alerts bumped from for:1m to for:3m (HighGPUTemp, LowUPSBattery, PowerOutage) for stability above the new 2m global cadence Other jobs that already had per-job overrides (snmp-idrac 1m, redfish-idrac 3m, kubernetes-pods 5m, kubernetes-services 5m) unaffected. Expected: 50-150m sustained CPU saving on Prometheus + apiserver. Verification ongoing — apiserver settles ~minutes after Prometheus config reload due to initial-target-scrape burst.	2026-05-22 14:17:00 +00:00
Viktor Barzin	5bc98851b9	alloy: switch pod log shipping from apiserver to file-tail Replaced 'loki.source.kubernetes' with 'loki.source.file' in alloy DS config. discovery.relabel.pod_logs already sets __path__ to the kubelet log path (/var/log/pods/<uid>/<container>/.log) and varlog host-mount was already present, so this is a one-line swap. Why: apiserver was burning ~700m sustained on 'CONNECT pods/log' streams (13 req/s, ~2200 sec/s of long-lived TCP connections). Streaming pod logs through the apiserver instead of tailing kubelet's log files was the dominant residual cost after the recent Loki/Alloy onboarding. Measured before/after: - Alloy DS: ~620m total (5 x ~125m) -> ~92m total (5 x ~18m) - kube-apiserver: peak 1959m midnight burst, settled 632m (Stuck-pod recovery: alloy-7zg7t on k8s-master needed --force delete during rollout — FailedKillPod 'unable to signal init: permission denied' on runc, transient runtime issue, unrelated to this change.)	2026-05-22 14:17:00 +00:00
Viktor Barzin	48e7c309fc	vault: add pg-matrix + pg-technitium static roles to allowed_roles Both static-roles existed in Vault state (created out-of-band) but were missing from the postgresql connection's allowed_roles list. Vault was logging 'is not an allowed role' rotation errors every 10s for both, sustained CPU waste ~40-70m. Adopted both via 'import {}' (import blocks removed after first apply per the canonical adoption pattern). - pg-matrix: username=matrix, rotation_period=86400 (1d) - pg-technitium: username=technitium, rotation_period=604800 (7d) Verified: 'is not an allowed role' errors stopped in vault-0 logs immediately after apply.	2026-05-22 14:17:00 +00:00
Viktor Barzin	94ca849379	k8s-version-upgrade: grant get/list on apps resources for drain kubectl drain --ignore-daemonsets needs to GET each pod's owner reference (DaemonSet/StatefulSet/ReplicaSet/Deployment) to classify which pods can be drained vs ignored. Without these RBAC verbs, drain bails with 'cannot delete daemonsets ... is forbidden' for every daemonset-managed pod on the node.	2026-05-22 14:17:00 +00:00
Viktor Barzin	a90ce27923	infra: add kubectl + authentik providers across 6 stacks Provider declarations were applied across freshrss, linkwarden, navidrome, openclaw, tandoor, vault in prior sessions; lock files regenerated for the 4 stacks where init had run. Commits the WIP so downstream Terraform plans can proceed. - kubectl (gavinbunney/kubectl ~> 1.14): kubernetes_manifest panic workaround for Kyverno CRDs (beads code-e2dp) - authentik (goauthentik/authentik ~> 2024.10): used where stacks manage their own Authentik objects	2026-05-22 14:17:00 +00:00
Viktor Barzin	fa2b57f177	openclaw: enable recruiter-api plugin (allowlist + manifest contracts) Plugin needs three things to load under OpenClaw 2026.5.x: 1. plugins.allow includes 'recruiter-api' (doctor --fix overwrites the ConfigMap-baked value, so re-patch via 'openclaw config patch --stdin' in the startup command after doctor runs). 2. 'openclaw plugins enable recruiter-api' to flip its registry entry. 3. manifest declares contracts.tools (added in recruiter-responder commit 83ffd9fa). Plus: VIKTOR_CHAT_ID env wired from secret/openclaw.viktor_chat_id so the plugin's polling loop knows which Telegram chat to deliver into. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:17:00 +00:00
Viktor Barzin	4bc0c5f27e	recruiter-responder: deploy d7892396 — OpenClaw-driven flow Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:17:00 +00:00
Viktor Barzin	6417c770c1	recruiter-responder + openclaw: wire gpt-mini secret keys + VIKTOR_CHAT_ID recruiter-responder ExternalSecret gains GPT_MINI_ENDPOINT/_API_KEY/_MODEL (NIM-served qwen3-coder-480b — gpt-5.4-mini in OpenClaw is OAuth-only and not HTTP-accessible to external services). OpenClaw gains VIKTOR_CHAT_ID env consumed by the recruiter-api plugin's announcement loop. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:17:00 +00:00
Viktor Barzin	8aff0ba1a2	k8s-version-upgrade: fix two more grep-pipefail bugs Same `grep -v` / `set -o pipefail` interaction as commit 10b261d2, in two more callsites the previous fix didn't cover: Line 354 (phase_master): control-plane Running check — `grep -v Running \| wc -l` returns 1 when all pods are Running (the happy path), aborting the chain right after master upgrades. Line 419 (phase_postflight): on-target node check — `grep -v ":v$TARGET_VERSION$" \| wc -l` returns 1 when all nodes are on the target version (the happy path, exactly when postflight should succeed). Aborts at the moment of victory. Forensics on yesterday's master Job failure (see commit message of 10b261d2 for context): the master Job spawned 16s after the previous fix's TF apply, before configmap propagation completed on the kubelet. With those two latent bugs also looming, the chain would have died post-master-upgrade and again at postflight even if propagation had been timely. Wrapping each grep in `{ ... \|\| true; }` so a no-matches result returns success. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:17:00 +00:00
Viktor Barzin	83fc15c22b	k8s-version-upgrade: fix pipefail abort when no alerts are firing halt_on_alert_query() ends with `grep -vE "$regex" \| sort -u`. When zero alerts are firing (the desired healthy state), grep matches nothing and exits 1. Under `set -o pipefail`, the whole pipeline returns 1; under `set -e`, the caller's `alerts=$(...)` assignment fails and aborts the script in ~1s with no diagnostic output. The chain effectively required at least one non-meta alert to be firing to make any forward progress. Today (2026-05-19) the cluster is fully clean post-MySQL recovery, the daily 12:00 UTC detection spawned the preflight Job, and it died instantly — blocking the 1.34.7 → 1.34.8 patch chain. Fix: wrap the grep in `{ ... \|\| true; }` so a no-matches result returns success. Preflight verified end-to-end after the fix — the chain is now in flight (preflight ✓, master phase running). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:17:00 +00:00
Viktor Barzin	612a83f8ce	security(wave1): W1.6 expand observation from recruiter-responder pilot → tier 3+4 (82 namespaces) ## Change - Replaced kubectl_manifest.wave1_egress_observe_recruiter_responder with kubectl_manifest.wave1_egress_observe_tier34 - namespaceSelector changed from `kubernetes.io/metadata.name == 'recruiter-responder'` to `tier in {"3-edge", "4-aux"}` — covers 82 namespaces (17 tier-3-edge + 65 tier-4-aux) - Legacy pilot GNP wave1-egress-observe-recruiter-responder kubectl-deleted (apply_only=true means TF rename does NOT destroy the live old resource; cleanup done manually) - Tier 0/1/2 namespaces explicitly out of wave 1 observation per locked plan (cluster infra + GPU workloads, deferred) ## Verification (live cluster, 2026-05-19) - 82 namespaces match `tier in (3-edge,4-aux)` - Felix translated the new policy into iptables LOG rule in cali-po-* chain - LogQL `{job="node-journal"} \|~ "calico-packet"` returns real packet metadata from multiple namespaces with distinct destinations: - east-west pod-to-pod (10.10.108.48, 10.10.122.131) - in-cluster service VIP (10.96.0.10 — kube-dns) - external (149.154.166.110 — Telegram API from recruiter-responder) ## W1.7 next step (calendar-bound, ~1 week) - Let observation run for ~1 week - Aggregate distinct destinations per namespace via LogQL - Build per-namespace egress allowlist module `tier3_egress_baseline` - Flip GNP rules from `[Log, Allow]` to `[Allow <specific dests>, Deny]` - Phased per-namespace as originally planned Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:17:00 +00:00
Viktor Barzin	2f9ac0110a	security(wave1): W1.6 observe phase LIVE — Calico GNP action:Log pilot on recruiter-responder Replaces the abandoned FelixConfiguration.flowLogsFileEnabled approach (Calico Enterprise-only field, rejected by OSS v3.26) with the supported primitive: Calico GlobalNetworkPolicy with `action: Log`. ## Mechanics (verified end-to-end on 2026-05-19) 1. kubectl_manifest applies GNP `wave1-egress-observe-recruiter-responder` with `namespaceSelector: kubernetes.io/metadata.name == 'recruiter-responder'`, `types: [Egress]`, `egress: [{action: Log}, {action: Allow}]`. 2. Felix translates to iptables LOG rule in `cali-po-_ZEv_aILlvyT9fbgWN58` chain with prefix `calico-packet: ` log-level=5. 3. Linux kernel emits LOG entries to ring buffer with transport=kernel. 4. systemd-journald captures kernel transport entries. 5. Alloy DaemonSet ships journal to Loki with `job=node-journal,transport=kernel`. 6. LogQL: `{job="node-journal"} \|~ "calico-packet"` returns entries showing SRC/DST/PROTO/PORT for every NEW egress connection. ## Verified output sample `calico-packet: IN=cali6cfdec4abc1 OUT=ens18 MAC=... SRC=10.10.122.132 DST=9.9.9.9 LEN=60 TOS=0x00 PREC=0x00 TTL=...` The Allow rule in the GNP keeps egress functional (recruiter-responder remained 1/1 Running through the apply — verified Python TCP connections to 1.1.1.1, 8.8.8.8, 9.9.9.9 succeed). ## Wave 1 status W1.6 observation infra is LIVE for the recruiter-responder pilot. W1.7 remains pending: collect 1 week of `{job="node-journal"} \|~ "calico-packet"` samples, build empirical egress allowlist, flip the GNP rules from `[Log, Allow]` to `[Allow <specific dests>, Deny]`. Expand observation to additional namespaces by adding entries to `spec.namespaceSelector` (e.g. `kubernetes.io/metadata.name in {recruiter-responder,X,Y}`). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:17:00 +00:00
Viktor Barzin	a5772060f8	dbaas: opt MySQL out of Keel + add do-not-bump warning Two changes to make the 8.4.8 pin durable: 1. Add `keel.sh/policy: never` annotation on the mysql-standalone StatefulSet. The dbaas namespace was already excluded from the Kyverno mutate, but the StatefulSet carried orphan Keel annotations (force/poll/match-tag) from an earlier policy version that lacked the exclusion list. Keel kept watching :8.4.8 for digest changes. Now explicitly opted out; Keel logged "image no longer tracked". 2. Expand the inline comment to a banner pointing at the upgrade plan docs and the gating beads task. Anyone touching this line sees the warning + the path to do it right. Closes the loop on the 2026-05-18 outage. Real upgrade tracked in code-963q + docs/plans/2026-05-19-mysql-8.4.9-upgrade-{design,plan}. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:59 +00:00
Viktor Barzin	a048b37f60	security(wave1): W1.1 audit-log shipping LIVE + W1.5 trusted-registries Enforce LIVE ## W1.1 — K8s API audit log shipping (LIVE) - alloy.yaml: added control-plane toleration so Alloy DaemonSet runs on k8s-master node. Verified alloy-7zg7t scheduled on master, tailing /var/log/kubernetes/audit.log - loki.tf "Security Wave 1" rule group: added K2-K9 alert rules (skipped K1 per Q7 decision): - K2 K8sSATokenFromUnexpectedIP - K3 K8sSensitiveSecretReadByUnexpectedActor - K4 K8sExecIntoSensitiveNamespace - K5 K8sMassDelete (>5 Pod/Secret/CM in 60s by single user) - K6 K8sAuditPolicyModified (kubeadm-config CM change) - K7 K8sClusterRoleWildcardCreated (verbs=* + resources=) - K8 K8sAnonymousBindingGranted - K9 K8sViktorFromUnexpectedIP - All rules use source-IP regex matching the wave-1 allowlist (10.0.20.0/22, 192.168.1.0/24, 10.10.0.0/16 pod, 10.96.0.0/12 svc, 100.64-127 tailnet) and `lane = "security"` → #security Slack route. - Verified: kubectl-audit logs flowing in Loki query {job="kubernetes-audit"} returns events with node=k8s-master. - Verified: /loki/api/v1/rules lists all K2-K9 + V1-V7 + S1. ## W1.5 — require-trusted-registries Enforce (LIVE) - security-policies.tf: flipped Audit→Enforce with explicit allowlist built by `kubectl get pods -A -o jsonpath='{..image}'` enumeration. - Removed `/` catch-all (which made Audit→Enforce a no-op). - Pattern includes 15 explicit registries, 6 DockerHub library bare names, 56 DockerHub user repos. - Verified by admission dry-run: - evilcorp.example/malware:v1 → BLOCKED with custom message - alpine:3.20 → ALLOWED (matches `alpine`) - docker.io/library/alpine:3.20 → ALLOWED (matches `docker.io/*`) ## W1.6 — Calico flow logs (BLOCKED — Calico OSS limitation) - Tried adding FelixConfiguration with flowLogsFileEnabled=true via kubectl_manifest in stacks/calico/main.tf - Calico OSS rejected with "strict decoding error: unknown field spec.flowLogsFileEnabled" — these fields are Calico Enterprise/Tigera-only - Removed the failed resource. Documented alternative paths in main.tf comment block: GNP with action=Log (iptables NFLOG → journal), Cilium migration, eBPF tooling, or Tigera Operator adoption. ## Docs updates - security.md status table refreshed: W1.1/W1.2/W1.3/W1.4/W1.5 LIVE, W1.6/W1.7 blocked - monitoring.md: Loki marked DEPLOYED (was incorrectly NOT-DEPLOYED in prior session before today's apply) ## Cleanup - Removed stacks/kyverno/imports.tf (TF 1.5+ import blocks completed their job in the 2026-05-18 apply; should not stay in tree per TF docs) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:59 +00:00
Viktor Barzin	51365937b1	recruiter-responder: bump image to 444fa58c (header CRLF fix) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:59 +00:00
Viktor Barzin	efe8c9625b	dbaas: pin MySQL to 8.4.8, recover from broken 8.4.9 DD upgrade The mysql:8.4 floating tag let Keel auto-bump to 8.4.9, whose data-dictionary upgrade got stuck mid-flight on every attempt (no progress, no CPU, never completing). Pinning to 8.4.8 + restoring from the 2026-05-18 00:30 UTC mysqldump puts us back on a known-good binary. Closes: code-eme8 Closes: code-k40p Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:59 +00:00
Viktor Barzin	1082cba0fb	kyverno(wave1): swap kubernetes_manifest → kubectl_manifest + flip 3 security policies to Enforce ## Resolves code-e2dp (Kyverno TF apply blocked) Root cause: terraform-provider-kubernetes v3.1.0 panics on plan/refresh of kubernetes_manifest resources holding Kyverno ClusterPolicy CRDs (large CEL/foreach schemas). Workaround: swap to gavinbunney/kubectl_manifest which treats manifests as opaque YAML strings. ## Migration mechanics - Root terragrunt.hcl: added gavinbunney/kubectl provider declaration so all stacks get it generated in providers.tf. - stacks/kyverno/modules/kyverno/versions.tf (new): module-level provider source declaration (required for kubectl_manifest in a child module). - Converted 17 kubernetes_manifest resources across 7 files to kubectl_manifest with yaml_body = yamlencode({...}). depends_on chains preserved. - terraform state rm for all 17 old kubernetes_manifest entries. - stacks/kyverno/imports.tf (new): TF 1.5+ import blocks mapping each kubectl_manifest to its live cluster resource by apiVersion//Kind//name ID. - One resource (policy_inject_keel_annotations) needed kubectl delete + recreate because the kubectl provider couldn't patch it cleanly (resourceVersion=0 invalid for update — gotcha when adopting a resource previously kubernetes_manifest-owned). ## W1.4 — security policies Audit → Enforce (LIVE) Three policies flipped: deny-privileged-containers, deny-host-namespaces, restrict-sys-admin. Verified live via kubectl. failurePolicy=Ignore preserved. ## Shared exclude list (35 namespaces) local.security_policy_exclude_namespaces in security-policies.tf. - 31 critical from memory id=1970 (Keel rollout list) - + frigate (camera HW transcoding needs host access) - + kured (privileged DaemonSet for node reboots) - + default (etcd backup/defrag CronJobs use hostNetwork) - + changedetection (uses SYS_ADMIN for chromium sandbox) ## W1.5 — require-trusted-registries stays Audit Pattern / allows anything-with-a-slash; Enforce would be a no-op for supply chain. Tracked under beads code-8ywc as follow-up. ## TF import-blocks The imports.tf file should be removed in a follow-up cleanup commit once verified — TF doesn't auto-clean these. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Closes: code-e2dp	2026-05-22 14:16:58 +00:00
Viktor Barzin	83079758bb	monitoring(wave1): re-enable Loki+Alloy, deploy wave1 alert rules, add #security Slack lane ## Loki + Alloy re-enabled (code-146x) - Uncommented helm_release.loki, helm_release.alloy, kubernetes_daemon_set_v1.sysctl-inotify, kubernetes_config_map.loki_alert_rules, kubernetes_config_map.grafana_loki_datasource - Reverses the documented "operational overhead vs benefit after node2 incident" decision. Re-evaluated because wave 1 security detection layer (beads code-8ywc) needs Loki + ruler + alert routing. - SingleBinary mode, 2-4Gi mem, 50Gi proxmox-lvm PVC, 30-day retention, ruler enabled pointed at prometheus-alertmanager.monitoring.svc:9093 - Alloy DaemonSet (4 pods on worker nodes) discovers pod logs via K8s API + pushes to Loki - Loki canaries running (4) - Vault audit-tail sidecar logs now flowing to Loki: queried {namespace="vault",container="audit-tail"} returns live audit JSON ## Wave 1 alert rules deployed (W1.3 partial) Added "Security Wave 1" rule group to loki_alert_rules configmap: - V1: VaultRootTokenCreated — auth/token/create with policies=[root] - V2: VaultAuditDeviceModified — sys/audit/* create/delete/update - V3: VaultSealChanged — sys/seal update - V4: VaultPolicyModified — sys/policies/acl/* create/update/delete - V5: VaultAuthFailureSpike — >10 permission denied/min - V7: VaultViktorFromUnexpectedIP — auth as me@viktorbarzin.me from non-allowlist source IP (allowlist: 10.0.20.0/22, 192.168.1.0/24, 10.10.0.0/16 pod, 10.96.0.0/12 svc, 100.64-127 tailnet) - S1: PVEsshLoginFromUnexpectedIP — sshd "Accepted" from non-allowlist IP (rule defined, fires once promtail/Alloy ships sshd journal with job=sshd-pve) Verified rules visible via /loki/api/v1/rules. K2-K9 (K8s API audit) deferred to W1.1 which needs the audit policy + apiserver log shipping codified. ## #security Slack lane (Alertmanager) - New `slack-security` receiver in prometheus_chart_values.tpl, channel #security - Higher-priority route at top of routes list: matchers `lane = security` → slack-security, continue: false (so wave 1 alerts never fall through to #alerts) - Slack message format includes summary + description + runbook link annotation - All wave 1 rules set `lane = "security"` label ## Resource summary - 6 added: helm_release.loki, helm_release.alloy, kubernetes_config_map.grafana_loki_datasource, kubernetes_config_map.loki_alert_rules, kubernetes_daemon_set_v1.sysctl-inotify, + 1 other - 5 changed: helm_release.prometheus (alertmanager config — new receiver + route), 4 deployments (image tag drift from Keel-managed images, unrelated) - 1 destroyed: null_resource grafana_admin_only_folder_acl["Finance (Personal)"] (timestamp-triggered always recreates — not destructive) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Closes: code-146x	2026-05-22 14:16:58 +00:00
Viktor Barzin	c9289192c7	security(wave1): Vault audit-tail sidecar (live) + doc reality-check ## Vault audit-tail sidecar (APPLIED + VERIFIED) - Added `audit-tail` extraContainer to vault helm chart values: busybox:1.37 with `tail -F /vault/audit/vault-audit.log`. Reads the audit PVC (`audit` volume from the chart's auditStorage), emits JSON audit events to stdout. kubelet captures the stdout; once Loki+Alloy are deployed (blocked on code-146x), these logs flow automatically to Loki with `container="audit-tail"`. - Resources: 5m CPU / 16Mi mem request, 32Mi limit. PVC mount is readOnly. - Applied via `tg apply -target=helm_release.vault`. All 3 vault pods rolled cleanly (OnDelete strategy, manual one-at-a-time, auto-unseal each ~10s). - Verified: `kubectl logs -n vault vault-2 -c audit-tail` shows live JSON audit lines from ESO token issuance, KV reads, etc. ## Doc reality-check While verifying logs reached Loki, discovered Loki is NOT actually deployed. `stacks/monitoring/modules/monitoring/loki.tf` defines `helm_release.loki` but has a self-referencing `depends_on = [helm_release.loki]` that prevented apply. No `loki` Helm release in the cluster, no Loki pods, no Loki Service. The monitoring.md "Loki: deployed" claim was aspirational. - security.md W1.2 row: PENDING → PARTIAL (sidecar live, shipping blocked on code-146x) - security.md W1.3 row: gated on code-146x added - monitoring.md Loki row: marked NOT DEPLOYED with cross-ref to code-146x ## New beads task - code-146x P1 — Loki + log shipper missing. Lists the helm_release self-depends_on bug, investigation paths, and revised wave 1 sequencing (Loki/Alloy is prereq 0). ## Wave 1 status update - W1.2: Vault audit device + XFF + audit-tail sidecar all LIVE; Loki shipping blocked on code-146x - W1.1, W1.3, W1.6, W1.7: still not started (W1.6 also blocked on code-3ad Calico Installation CR) - W1.4, W1.5: code committed, blocked on code-e2dp (Kyverno provider crash) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:57 +00:00
Viktor Barzin	ae0c1701ec	security(wave1): W1.2 Vault XFF (applied) + W1.4/W1.5 Kyverno code prep (apply blocked on provider crash) ## W1.2 — Vault audit device + X-Forwarded-For (APPLIED + VERIFIED) - Added `x_forwarded_for_authorized_addrs = "10.10.0.0/16"` to vault listener config. Trust X-Forwarded-For from in-cluster sources (pod CIDR). Without this, every vault audit log entry shows Traefik's pod IP instead of the real client IP — the V7 alert rule (Viktor identity from non-allowlist source IP) needs the real client IP to be meaningful. - Applied via `tg apply -target=helm_release.vault` (vault stack has pre-existing for_each unknown issues unrelated to this change; -target documented in error message itself as the workaround). - Rolling restart of vault-{0,1,2} performed manually (StatefulSet uses OnDelete update strategy, not RollingUpdate). All 3 pods rejoined Raft + auto-unsealed within ~10s each. Verified XFF config visible in pod's /vault/config/extraconfig-from-values.hcl. - The `vault_audit "file"` resource was already in TF at line 287 (writing to /vault/audit/vault-audit.log) — no change needed. ## W1.4 + W1.5 — Kyverno enforce flip (CODE ONLY, apply BLOCKED) - Added shared `local.security_policy_exclude_namespaces` (31 critical namespaces from memory id=1970 + `frigate, kured, default, changedetection` discovered during the live-cluster pre-flight check for privileged/hostNetwork/SYS_ADMIN pods that would be blocked by Enforce). - Flipped 3 security policies Audit → Enforce: deny-privileged-containers, deny-host-namespaces, restrict-sys-admin. failurePolicy=Ignore preserved at chart level. - `require-trusted-registries` STAYS in Audit mode pending allowlist tightening (current pattern includes `/` which matches anything-with-a-slash, so Enforce would be a no-op for supply chain). Tracked under beads `code-8ywc` W1.5. Apply blocker: `tg plan` panics with `terraform-provider-kubernetes_v3.1.0` crash on the kubernetes_manifest resources (`ElementKeyInt(0): can't use tftypes.Object...` — provider schema mismatch on Kyverno CRDs). The crash reproduces on the UNMODIFIED file, so it's a pre-existing provider issue, not caused by these changes. Resolving it requires either upgrading the provider or finding a kubernetes_manifest-compatible workaround. Tracked under `code-8ywc`. ## Wave 1 status after this commit - W1.2: APPLIED + VERIFIED (vault XFF + audit device already in place) - W1.4 + W1.5: code ready, apply blocked on provider crash - W1.1, W1.3, W1.6, W1.7: not started in this session Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:57 +00:00
Viktor Barzin	87961e9ef8	monitoring(wealth): drop 6y timeFrom override on META vest cadence	2026-05-22 14:16:57 +00:00
Viktor Barzin	dd24ace480	realestate-crawler: dockerhub pull-secret + lift image-pin on ui/api Companion to the GHA migration in immovika/realestate-crawler@c2acbf5. Apps row of /upgrade-state was flagging ⚠ because Keel poll on the four Deployments returned 401 — DockerHub repo viktorbarzin/realestatecrawler is private, the Deployments had no imagePullSecrets, and Keel's poll-secret discovery list came up empty. Pods kept running only because the image landed in containerd cache months ago. Adds: - ExternalSecret `dockerhub-pull-secret` synced from Vault secret/viktor.dockerhub_registry_password. ESO template renders the dockerconfigjson server-side (Sprig b64enc) so the PAT never sits in cleartext in any K8s manifest. - image_pull_secrets { name = "dockerhub-pull-secret" } on all 4 Deployments (ui, api, celery, celery-beat). - Lifts `ignore_changes=[container[0].image]` on ui+api so TF re-asserts :latest. CI no longer patches the image to a numeric tag — Keel now drives rollouts from digest changes on :latest. Live state after apply: all 4 Deployments on :latest with imagePullSecrets=dockerhub-pull-secret; ExternalSecret SecretSynced=True. Once a GHA build pushes a new digest, Keel will roll all four within ~1h. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:57 +00:00
Viktor Barzin	99127939a8	monitoring(wealth): keep only FIFO-realized PNL table; pair Positions + vest-cadence side-by-side - Removed panel 27 (META RSU vest value over time) — superseded by vest-cadence chart which carries the same value signal plus the share-count overlay. - Removed panel 28 (per-vest value at vest vs today) — duplicative with panel 31's FIFO realized PNL. - Removed panel 29 (per-sell realized PNL) — same data as panel 31, just rolled up by sell date instead of vest date. - Resized panel 26 (Positions) to w=12 and moved panel 30 (META vest cadence) to (y=32, x=12, w=12) so they sit side-by-side next to the Positions table. - Moved panel 31 (FIFO realized PNL) to y=118, where the deleted RSU chart used to live.	2026-05-22 14:16:57 +00:00
Viktor Barzin	b879481d71	monitoring(wealth): per-vest realized PNL via FIFO sell-match New table panel below the per-sell breakdown. For each vest, FIFO-match its shares against the subsequent sells (shares from earlier vests get sold first), and aggregate the matched portions: realized_pnl = SUM(matched_qty * (sell_price - vest_price)) pnl_pct = realized_pnl / SUM(matched_qty * vest_price) * 100 days_held = AVG(sell_date - vest_date) per matched portion Footer reducer sums shares, vest value, sell value, and realized PNL so the bottom row is the full-portfolio realized take.	2026-05-22 14:16:57 +00:00
Viktor Barzin	8b60e6bb6d	monitoring(wealth): META vest cadence chart — value vs shares (dual axis) Per-vest event line chart. Left Y axis (blue): vest value at the time = SUM(quantity * unit_price), in USD. Right Y axis (orange): number of shares vested. One point per vest date (aggregated when multiple BUY rows share a date, e.g. 2021-05-18 was 18 + 2 shares). Lets Viktor see how vest sizes ramped (initial 18 shares -> 38 -> 60s) and how the per-vest USD value tracked META's price ride across 2020-2026. timeFrom='6y' override pins the panel to the full vesting window.	2026-05-22 14:16:57 +00:00
Viktor Barzin	af077112cb	monitoring(wealth): META vest + sell PNL tables with FIFO cost basis Two new bottom-of-dashboard tables: Panel 28 'META vests — value at vest vs today': one row per BUY activity. Shows vest-day price * shares + what those same shares would be worth at today's META quote, plus the hypo P&L if Viktor had held everything (color-text on the gain columns). Panel 29 'META sells — realized PNL vs if held until today': one row per SELL with FIFO-matched cost basis (LEAST/GREATEST overlap in cumulative-share space). Shows realized P&L, the counterfactual P&L had he held until today, and the 'missed by' delta = (today_price - sell_price) * shares. Both pull today_price dynamically from quote_latest via a CTE so they self-update as Yahoo updates the META quote. Schwab account is empty so no live activity is expected.	2026-05-22 14:16:57 +00:00
Viktor Barzin	20c5965f95	monitoring(wealth): pin META RSU panel to 6y window Dashboard default time range is now-180d, but the META vesting + sell arc spans 2020-11 → 2026-02. With the default window the panel just showed a flat line at $64 (the empty post-sell residual). timeFrom='6y' override makes panel 27 always render the full vesting curve regardless of the dashboard-level time selector.	2026-05-22 14:16:57 +00:00
Viktor Barzin	3d43d96a5e	k8s-version-upgrade: switch detection cron from weekly to daily Was `0 12 * * 0` (Sun 12:00 UTC) — patch releases waited up to 6 days before the chain picked them up. Now `0 12 * * *` (daily 12:00 UTC, still outside kured's 02:00-06:00 London window). Concurrency is bounded by Forbid + deterministic job-name idempotency (the detection job exits early if a preflight Job for the same target already exists), so back-to-back days can't pile up parallel runs. - stacks/k8s-version-upgrade/main.tf: var.schedule default + rationale comment - scripts/upgrade_state.sh: rename next_sunday_noon_utc -> next_daily_noon_utc (now returns "Tue 2026-05-19 12:00 UTC" form); change "(Sun cron)" label to "(daily cron)" - .claude/skills/upgrade-state/SKILL.md: cadence column + frontmatter Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:57 +00:00
Viktor Barzin	018ef3790f	monitoring(wealth): META RSU vest value panel (Schwab account) Daily total_value timeseries for the Schwab workplace account (account_id 72d34e09-...). Single-asset account holding META RSUs that vested 2020-11 → 2026-02 and were sold opportunistically over the same window. Currency USD (account_currency). Yahoo quote on META powers WF's daily mark; the historical DAV mirrored into wealthfolio_sync via pg-sync gives us ~2k days of vesting curve.	2026-05-22 14:16:57 +00:00
Viktor Barzin	b107c0be8c	upgrade-state: skill + script + Keel scrape for periodic three-pipeline audit Three autonomous-upgrade pipelines run independently — Keel for apps (hourly registry polling), unattended-upgrades+kured for OS, and the k8s-version-check chain for kubeadm/kubelet/kubectl. Until now there was no single place to see whether each was healthy, what's pending, or whether anything's stuck. The /upgrade-state skill collapses the state of all three into one table you can run before each Sunday's k8s-version-check fires. - stacks/keel/main.tf: add Prometheus pod-annotation scrape on container port 9300. Surfaces pending_approvals, poll_trigger_tracked_images, and registries_scanned_total{image} so the skill has a real timeseries (also opens the door to a future "pending_approvals > 0 for 24h" alert). - scripts/upgrade_state.sh: collector + renderer. Three-row table (Apps / OS / K8s) + drill-down, --json for piping, exit 0/1/2. SSH fan-out (parallel subshells) to all five nodes for apt state + reboot-required + uu log; Prometheus query for Keel; Pushgateway parse for k8s_upgrade_* gauges. Read-only. - .claude/skills/upgrade-state/SKILL.md: hardlinked to ~/.claude/skills/upgrade-state/SKILL.md so the skill is discoverable from both monorepo-rooted and global sessions. Verification: ran the script, stress-tested the ✗ stalled path by pushing in_flight=1 + started_timestamp=-100min to Pushgateway and resetting after — script correctly raised ✗ and exit 2. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:57 +00:00
Viktor Barzin	309f83ec8c	beads-server: codify Keel annotations on Dolt deployment (drift cleanup) Task 1's recovery from the broken `:latest` image rollout left keel.sh/policy=never set imperatively via `kubectl annotate` — out of TF, which violates the "all infra via TF" rule. Now codified alongside match-tag, trigger, pollSchedule. Removed those three keys from ignore_changes (was the original "Keel manages these" pattern, no longer correct for this deployment). Also added KYVERNO_LIFECYCLE_V1 ignore_changes on the presence_schema migration Job so future applies don't try to replace it over the Kyverno-injected ndots dns_config. Verified: 0 added, 3 changed (unrelated pre-existing drift on beadboard/workbench/service), 0 destroyed. Dolt pod uninterrupted (revision 13 preserved).	2026-05-22 14:16:57 +00:00
Viktor Barzin	5482f46125	RecentNodeReboot: 24h → 1h threshold, matching upgrade-chain preflight The 24h kubelet-uptime threshold (process_start_time_seconds < 86400) was a defense-in-depth duplicate of the 24h-since-Ready-transition check in kured-sentinel-gate Check 4 — but they used different signals (kubelet process start vs node Ready transition). Whenever the cluster cycled through reboots, the alert kept firing for a full day even after sentinel-gate's check passed, and blocked anything querying halt-on-alert (kured, K8s version-upgrade preflight). Tightened to 1h (3600s) for "node just rebooted, give it a settle window". The cluster-wide 24h-between-reboots invariant lives exclusively in kured-sentinel-gate Check 4 from now on (independent, uses lastTransitionTime). Matched the preflight's own 24h-quiet check in upgrade-step.sh (86400 → 3600) so it doesn't act as a second blocker. Empirically verified: all 5 kubelets are >10h up, alert cleared on next eval after the rule reload. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:57 +00:00
Viktor Barzin	d1dcc5d12d	beads-server: add presence_claims table for agent coordination Adds the schema for the new agent presence board. Live Dolt is updated via a hashed-named one-shot Job; the ConfigMap entry preserves fresh-PVC init. Also pins the Dolt image to 2.0.3 — :latest on dolthub/dolt-sql-server currently resolves to 0.50.10, whose docker-entrypoint.sh references an undefined docker_process_sql function and crash-loops on every init script in /docker-entrypoint-initdb.d. Keel can still bump this tag in-cluster (image is in lifecycle.ignore_changes).	2026-05-22 14:16:57 +00:00
Viktor Barzin	e4e2babd6a	k8s-version-upgrade: FQDN SSH targets + python3 in place of envsubst Two latent bugs in the K8s-version-upgrade pipeline surfaced when a real detection run ran post-26.04 upgrade today: 1. DNS: pod's CoreDNS search path is `<ns>.svc.cluster.local svc.cluster.local cluster.local` (+ ndots=2 via Kyverno mutation). Unqualified `k8s-master` falls through all of those and then queries upstream Technitium for the bare name → NXDOMAIN. The FQDN `k8s-master.viktorbarzin.lan` is what Technitium actually serves. Suffix every node SSH target with `$NODE_DOMAIN`. 2. envsubst missing: claude-agent-service image doesn't ship `gettext-base`. Replace `envsubst <template \| apply` with `python3 -c 'import os,sys; sys.stdout.write(os.path.expandvars( sys.stdin.read()))' <template \| apply`. Same semantics, image already has python3. Multi-line $SCHEDULING_BLOCK is preserved correctly through expandvars. Verified by manually triggering `k8s-version-check` post-fix: detection now reads `Latest patch: v1.34.8` (currently running 1.34.7) and spawns `k8s-upgrade-preflight-1-34-8`. The Job pod scheduled and started; killed before it touched the cluster (will land on Sunday 2026-05-24 12:00 UTC like the schedule says). Root cause of why these bugs lay dormant: yesterday's first manual-test detection found "no upgrade needed" so neither code path exercised SSH or envsubst. Today's apt-source restore (do-release- upgrade had mangled them) unmasked the v1.34.8 candidate, which made detection finally proceed past the SSH step. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:56 +00:00
Viktor Barzin	23d8aa89c4	keel: enroll 11 more namespaces (operators + critical infra) Per user decision, removed authentik, kyverno, metallb-system, external-secrets, proxmox-csi, nfs-csi, vpa, sealed-secrets, infra-maintenance from the policy-level exclude list, and added keel.sh/enrolled=true to aiostreams (alive — 1/1 Running, despite being earlier flagged as scaled-to-0) and woodpecker. Net cluster coverage: 197/227 workloads on safe-force (86%), up from 170/227 (74%). All 197 are paired with match-tag=true (digest-only). Remaining 7 namespaces in Kyverno exclude list (irreducible): - keel (self-update) - calico-system + tigera-operator (operator-managed Installation CR) - cnpg-system + dbaas (state-coupled) - nvidia (chart-pinned at 570.195.03 per code-8vr0 until NVIDIA ships ubuntu26.04 driver images) - kube-system (k8s built-ins) Files: - stacks/kyverno/modules/kyverno/keel-annotations.tf — exclude list trimmed from 16 → 7 - stacks/authentik, kyverno, proxmox-csi, nfs-csi, vpa, sealed-secrets, servarr/aiostreams, metallb (creates ns "metallb-system"), woodpecker — added keel.sh/enrolled=true label on kubernetes_namespace resource - infra-maintenance was in the policy exclude but the namespace doesn't actually exist in the cluster; the removal is a no-op there Applied via kubectl patch on the live ClusterPolicy + kubectl label on namespaces because the kubernetes provider v3.1.0 panics on Kyverno ClusterPolicy refresh — TF source has the desired state for next clean apply on a fixed provider. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:56 +00:00
Viktor Barzin	3bdba9f388	keel: enroll 15 critical-path namespaces for digest-only auto-update Per user decision today: monitoring, mailserver, vault, descheduler, metrics-server, traefik, technitium, crowdsec, redis, reverse-proxy, reloader, headscale, wireguard, xray, cloudflared now participate in the same `force + match-tag` regime as the rest of the cluster — Keel watches the deployment's CURRENT tag for digest changes only and rolls on push, never rewriting tag strings. Two-part change: stacks/kyverno/modules/kyverno/keel-annotations.tf Trim the policy-level namespace exclude list from 31 → 16. The 16 remaining exclusions are the irreducible cluster-operator + state- coupled set: keel itself, calico-system + tigera-operator (operator loop), authentik (2026-05-17 pgbouncer incident bite), cnpg-system + dbaas (state-coupled), kyverno, metallb-system, external-secrets, proxmox-csi + nfs-csi + nvidia (just stabilized today, chart-pinned), kube-system, vpa, sealed-secrets, infra-maintenance. stacks/<each-of-15>/.../main.tf Add `"keel.sh/enrolled" = "true"` label to the `kubernetes_namespace` resource so the Kyverno mutate policy can target the workloads via its namespaceSelector matchLabels. Note on the apply path: the live ClusterPolicy was patched via `kubectl patch` because the hashicorp/kubernetes provider v3.1.0 panics during state refresh on Kyverno ClusterPolicy schemas with deeply nested optional `context.celPreconditions` / `imageRegistry` fields (see crash dump). The TF source above has the desired state, so any clean future apply on a fixed provider version will be a no-op against the live cluster. Floating-tag workloads in the newly-enrolled set (will roll on every upstream digest update — acceptable risk per user): - wireguard: sclevine/wg:latest (image fixed today via iptables-nft postStart shim) - xray: teddysun/xray - crowdsec-web: viktorbarzin/crowdsec_web - monitoring: prompve/prometheus-pve-exporter:latest, prom/snmp-exporter - traefik: nginx:1-alpine, openresty/openresty:alpine, ghcr.io/tarampampam/error-pages:3 - redis: haproxy:3.1-alpine, redis:8-alpine Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:56 +00:00
Viktor Barzin	f5cf6ec051	nvidia: bump driver container memory limit 128Mi → 2Gi After rolling back k8s-node1's kernel to 6.8.0-117 + spoofing /etc/os-release to 24.04 so the operator picked the matching ubuntu24.04 driver image (everything per the workaround documented in docs/known-issues.md), the driver container still went into a restart loop. Container status: lastState.terminated: { reason: "OOMKilled", exitCode: 137 } The driver-installer was hitting the namespace LimitRange default of 128Mi during `apt-get install linux-headers-6.8.0-117-generic` — the last log line on every restart was "Installing Linux kernel headers..." before SIGKILL. 2Gi gives apt + the DKMS compile step enough headroom; peak observed during a successful compile in a test container was ~1.4Gi. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:56 +00:00
Viktor Barzin	22c18dc061	paperless-mcp: deploy MCP for AI document search - New stack `paperless-mcp` running barryw/PaperlessMCP v0.1.19 (.NET, HTTP+SSE on :5000) wraps paperless-ngx's built-in FTS. 43 tools exposed. - In-cluster only egress to paperless-ngx svc; no Cloudflare hop on MCP-internal traffic. - Read-only at paperless layer: dedicated `claude-mcp` user (non-superuser) in new `claude-mcp-readers` group with view-only Django perms; existing 279 docs bulk-granted view perm via /api/documents/bulk_edit/; workflow #2 auto-grants the group on new docs (Consumption Added). - Gateway-level bearer auth via new Traefik plugin Aetherinox/traefik-api-token-middleware@v0.1.4 (loaded in traefik stack alongside crowdsec-bouncer); per-stack Middleware CRD `bearer-auth` pulls token list from Vault `secret/paperless-mcp/bearer_tokens`. - Vault `secret/paperless-mcp` holds: paperless_api_token (synced to K8s Secret via ESO; pod env via secret_key_ref), bearer_tokens (JSON array, read at plan time), bearer_token_viktor_laptop (mirror for laptop wiring), paperless_user_password (paperless UI fallback). - Image auto-update via Keel (semver minor policy, hourly poll). - Ingress dns_type=proxied → Uptime Kuma external monitor auto-created by external-monitor-sync CronJob. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:56 +00:00

1 2 3 4 5 ...

1012 commits