Commit graph

3733 commits

Author SHA1 Message Date
Viktor Barzin
94dfbb9a9c state(vault): update encrypted state 2026-05-22 14:16:41 +00:00
Viktor Barzin
fbf97dfc5c state(dbaas): update encrypted state 2026-05-22 14:16:41 +00:00
Viktor Barzin
1fcf911269 authentik/pgbouncer: image_pull_policy IfNotPresent -> Always (match live)
The HCL declared `IfNotPresent` since module creation but the live
deployment reconciled to `Always` somewhere along the way (likely a
Helm/operator default). Since the image is `:latest`, `Always` is the
correct value — `IfNotPresent` would skip pulling updated images on
pod restart, defeating the point of the floating tag.

Drops the lone remaining drift in the authentik stack so plan-to-zero
holds across the whole stack, not just the resources I just adopted.
2026-05-22 14:16:41 +00:00
Viktor Barzin
24795ec203 authentik: codify proxy provider TTL + adopt embedded outpost
Bump access_token_validity to weeks=4 (was hours=168, UI-managed in
ignore_changes). Drives the cookie Max-Age and the proxysession.expires
TTL — keeps users logged in for 28d instead of 7d.

Adopt the embedded outpost into Terraform so the postgres-session-backend
fix from earlier today (2026-05-10) is described as code:
  - kubernetes_json_patches.deployment carries dshm 2Gi tmpfs, resource
    requests/limits, the app.kubernetes.io/component=server pod label
    (workaround for goauthentik 2026.2.2 service.py:52 selector mismatch
    on standalone embedded outposts), and AUTHENTIK_POSTGRESQL__* envFrom
    the shared `goauthentik` Secret so the postgres session backend can
    connect to the dbaas cluster.
  - kubernetes_json_patches.service replaces the controller-set selector
    (which targets app.kubernetes.io/name=authentik / the goauthentik-server
    pods) with the outpost's own labels — without this, endpoints are
    empty and auth-proxy falls back to Basic-Auth realm "Emergency Access".

The `managed` field ("goauthentik.io/outposts/embedded") is server-set
and not in the Terraform provider's schema, so TF preserves it across
applies (writes only fields it knows about). Plan-to-zero verified.
2026-05-22 14:16:41 +00:00
Viktor Barzin
63fc1e00de infra/compute: bump k8s-node1 RAM 32 -> 48 GiB
Reason: GPU multi-tenancy (frigate + ytdlp-highlights + llama-swap +
immich-ml) was hitting 94% memory-request saturation on the old size.
The benchmark on 2026-05-10 surfaced this when llama-swap stayed
Pending despite GPU time-slicing being on (nvidia.com/gpu replicas=100)
- the actual constraint was node1 RAM, not GPU.

Procedure: drained node1, qm shutdown 201, qm set 201 --memory 49152,
qm start 201, kubelet picked up new capacity (47 GiB / 45.5 GiB
allocatable), uncordon, restored llama-swap + immich-ml.

Out-of-band qm set is the path here (not Terraform) because VMID 201
is intentionally not managed by TF yet - the telmate/proxmox provider
trips on iSCSI-disked VMs (see infra/stacks/infra/main.tf line 442).
Adopt this VM into TF once we migrate to bpg/proxmox.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:41 +00:00
Viktor Barzin
6e7fe96a40 infra/llama-cpp: benchmark report + -fa flag fix
Phase 7 of the vision-LLM benchmark plan. Adds:

- docs/benchmarks/2026-05-10-vision-llm.md — curated report (TL;DR,
  per-model analysis, top-N agreement, cost vs cloud APIs, sample
  captions). Verdict: qwen3vl-4b for the request path (3.55 s p50,
  100% parse, decisive top-N distro); qwen3vl-8b for caption polish.
- docs/benchmarks/benchmark-2026-05-10-1424.json — raw 300-row dump
  for diff-checking against future runs.
- main.tf: -fa -> -fa on (b9085 llama.cpp removed the no-value form
  of the flash-attention flag; without the value llama-server exits
  before serving any request).
- llama-cpp.md architecture doc links the report so future operators
  land on the deployed-and-evaluated model from one entry point.

300/300 calls, 0 parse errors, 33m32s wall on a single T4 with the
GPU exclusively allocated. immich-ml was scaled to 0 for the run
(node1 RAM constraint, not GPU - bumping node1 RAM is tracked as a
follow-up).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:41 +00:00
Viktor Barzin
3da01e6e1e anubis: only challenge GET requests; allow everything else
PrivateBin's XHR `POST /` (paste creation) was the trigger — Anubis's
catch-all CHALLENGE rule served an HTML challenge page where the JS
expected JSON, breaking paste creation entirely. Same shape will hit
any SPA XHR or CORS preflight on the other 8 Anubis-fronted sites
(homepage actions, kms upload-then-poll, wrongmove search refresh,
jsoncrack share, etc.) the moment it gets exercised.

Add an `ALLOW` rule keyed on `method != "GET"` between the AI/UA-block
imports and the catch-all CHALLENGE. Rationale:

  * AI scrapers consume GET response bodies — they don't POST.
  * State-mutating XHRs and OPTIONS preflight need to bypass the
    challenge or the app breaks.
  * CrowdSec + per-route rate-limit + app-level auth already cover
    abuse on mutating methods, so this gives up nothing.
  * Hard-deny rules for known-bad bots run first, so a declared bad
    bot can't sneak through by sending a POST.

Also added a `checksum/policy` annotation on the Anubis pod template
sourced from `sha256(coalesce(var.policy_yaml, default_policy_yaml))`
so future policy changes auto-roll the deployment instead of needing
a manual `kubectl rollout restart`.

f1-stream had its own policy override (path carve-outs for SvelteKit
asset hashes and JSON data routes); mirrored the new rule there too.

Applied to all 8 Anubis-fronted stacks: blog, kms, f1-stream,
travel_blog, real-estate-crawler, homepage, cyberchef, jsoncrack.
Verified per stack: GET / returns the Anubis challenge page; POST,
PUT, DELETE, OPTIONS pass through to the backend (HTTP 301/405/502
from the upstream app, never the Anubis "not a bot" HTML).
2026-05-22 14:16:40 +00:00
root
ff3d64159a Woodpecker CI deploy [CI SKIP] 2026-05-22 14:16:40 +00:00
Viktor Barzin
1f0bd11d3f privatebin: drop Anubis — broke XHR paste creation
PrivateBin's UI POSTs the encrypted blob to `/` via XHR. With Anubis in
front, the catch-all CHALLENGE rule returned an HTML challenge page
where the JS expected JSON, so paste creation failed silently for every
user. The challenge cookie didn't bypass it — Anubis appears to issue a
fresh challenge on POST regardless of cookie state.

Pastes are client-side encrypted; AI scrapers gain nothing from
indexing them, so the default `anti_ai_scraping` middleware is enough
protection. Restoring the ingress to point straight at the privatebin
service. CSP `wasm-unsafe-eval` retained — PrivateBin's zlib.wasm
needs it independent of Anubis.

This matches the rule already documented in infra/.claude/CLAUDE.md:
"DO NOT put Anubis in front of Git/API/WebDAV/CLI endpoints — clients
without JS can't solve PoW." A SPA's XHR is the same shape.

Verified: GET / returns PrivateBin HTML (not the Anubis challenge),
POST / returns PrivateBin's own JSON error envelope.
2026-05-22 14:16:40 +00:00
Viktor Barzin
9c617e6d38 infra/llama-cpp: add stack — llama-swap fronting Qwen3-VL + MiniCPM-V
Single Deployment of mostlygeek/llama-swap:cuda hot-swaps three
GGUF vision models (qwen3vl-8b, minicpm-v-4-5, qwen3vl-4b) at one
OpenAI-compat /v1 endpoint on Service llama-swap.llama-cpp.svc.
Idle TTL 10min so models unload between benchmark batches.

Storage: NFS-RWX from /srv/nfs-ssd/llamacpp (30Gi). One-shot
download Job pulls Q4_K_M GGUF + mmproj per model, creates stable
model.gguf / mmproj.gguf symlinks so the llama-swap config is
filename-agnostic, then warms the kernel page cache.

GPU: nvidia.com/gpu=1 = whole T4 — operator must scale immich-ml
to 0 during benchmark windows. wait_for_rollout=false so apply
doesn't block on GPU availability.

Initial use case: vision-LLM benchmark for instagram-poster
candidate scoring; future consumers (HA, agentic tooling) hit
the same endpoint via LiteLLM at the gateway.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:40 +00:00
Viktor Barzin
0752bd49c8 kms: document native DNS auto-discovery (no client config needed)
LAN clients with DNS suffix viktorbarzin.lan now activate with zero
configuration — Windows queries _vlmcs._tcp.viktorbarzin.lan SRV by
default and the chain resolves through vlmcs.viktorbarzin.lan to the
new 10.0.20.202 KMS IP.

DNS state (Technitium primary, replicated to secondary+tertiary by the
existing technitium-zone-sync CronJob every 30 min):
- _vlmcs._tcp.viktorbarzin.lan SRV 0 0 1688 vlmcs.viktorbarzin.lan
  (was: target=kms.viktorbarzin.lan)
- vlmcs.viktorbarzin.lan A 10.0.20.202   (added)
- kms.viktorbarzin.lan A 10.0.20.200      (unchanged — still the
  Traefik LB for the user-facing website at kms.viktorbarzin.lan/)

vlmcs.viktorbarzin.lan was added as a dedicated KMS-server hostname
rather than retargeting kms.viktorbarzin.lan so the LAN-direct website
keeps working without depending on hairpin NAT through pfSense.

Verified end-to-end on WIN10Pro-DS32 (192.168.1.230):
slmgr /ckms → slmgr /ato → "Product activated successfully" with
"KMS machine name from DNS: vlmcs.viktorbarzin.lan:1688" and
"KMS machine IP address: 10.0.20.202". Real client IP 192.168.1.230
appears in vlmcsd log and in the slack-notifier sent line; second
activation within the dedup window correctly increments
kms_activations_dedup_skipped_total.
2026-05-22 14:16:40 +00:00
Viktor Barzin
d85b54d89d kms: per-connection state in notifier (vlmcsd is multi-threaded)
Bug found via E2E test against the Windows VM (VMID 300). The single
shared `state` dict in slack-notifier.py worked when vlmcsd processed
one connection at a time, but real Windows KMS activations hold the
connection open ~30 seconds (handshake + keep-alive). During that
window vlmcsd accepts other concurrent connections — most relevantly
the new kubelet TCP readiness probe every 5s — and each new OPEN line
reset the shared state, wiping the in-flight activation's
app/product/host before its CLOSE arrived. Result: real activations
were misclassified as probes (no Slack post, no metric increment).

Fix: state is now a dict keyed by `ip:port` with one sub-dict per
in-flight connection. A `__current` pointer tracks the most recent
OPEN so unkeyed detail lines (Application ID, Workstation name, etc.)
can be attributed correctly — vlmcsd writes detail lines immediately
after the OPEN and before any subsequent OPEN, so the heuristic holds.
Orphan CLOSEs (notifier started mid-conn) are now silently dropped
instead of emitting an empty probe event.

Two new regression tests:
- test_kubelet_probe_during_long_activation: 5s probe interleaved into
  a 31s activation block — exact production failure mode.
- test_orphan_close_no_event: bare CLOSE without prior OPEN.

Verified live: triggered slmgr /upk + /ipk + /skms 10.0.20.202 + /ato
on WIN10Pro-DS32. vlmcsd logged the full activation block, notifier
posted to Slack with ip=192.168.1.230 source=external
product='Windows 10 Professional' host='WIN10Pro-DS32.viktorbarzin.lan'
and kms_activations_total{product=Windows 10 Professional,
status=Licensed} 1 — real WAN client IP preserved through the
ETP=Local + dedicated MetalLB IP chain end to end.
2026-05-22 14:16:40 +00:00
Viktor Barzin
4a3ca572e8 fire-planner: imagePullPolicy=Always on alembic-migrate init container
After a rollout-restart, the main container (default Always for :latest)
pulled the new image with alembic 0003, but the init container
defaulted to IfNotPresent and reused a cached old image lacking 0003 →
"Can't locate revision identified by '0003'" → CrashLoopBackOff.

Setting Always on the init container so both containers stay in lockstep
across rollouts. Longer term we should switch the deployment to 8-char
git-SHA tags per the cluster policy in .claude/CLAUDE.md, but this
unblocks the Wave 1 deploy in the meantime.
2026-05-22 14:16:40 +00:00
Viktor Barzin
67b11a964a kms: dedicate MetalLB IP 10.0.20.202 + filter probe noise
Two coupled fixes for the hourly Slack noise + missing client IPs:

1. Move windows-kms off shared 10.0.20.200 to a dedicated MetalLB IP
   10.0.20.202 with externalTrafficPolicy=Local, so vlmcsd sees real
   WAN client IPs (pfSense WAN forwards do DNAT-only; ETP=Local skips
   kube-proxy SNAT). Same pattern mailserver used pre-2026-04-19.
   Sharing 10.0.20.200 is blocked because all 10 services there are
   ETP=Cluster and MetalLB requires consistent ETP per shared IP.

2. Slack notifier now suppresses Slack posts for bare TCP open/close
   pairs (no Application/Activation block) — these are Uptime Kuma's
   port monitor and the new kubelet readiness/liveness probes. Probe
   counts go to a new metric kms_connection_probes_total{source} where
   source classifies the IP as internal_pod / cluster_node / external.
   Real activations are unaffected.

Pod fluidity: added TCP readiness/liveness probes on 1688 to gate Pod
Ready on the listener actually being up — required for ETP=Local so
MetalLB only advertises 10.0.20.202 from a node where vlmcsd is serving.

pfSense side (applied separately, not codified):
- New alias k8s_kms_lb = 10.0.20.202 (KMS-only)
- WAN:1688 NAT + filter rule retargeted from k8s_shared_lb to k8s_kms_lb
- All other forwards on k8s_shared_lb (WireGuard, HTTPS, shadowsocks,
  smtps, etc.) untouched

Runbook updated. Tests added for classify_source / is_probe / process_line.
2026-05-22 14:16:40 +00:00
28db8fc9d4 fire-planner: COL refresh CronJob + Grafana Cost-of-Living dashboard
Operational layer for the new col_snapshot cache shipped in
fire-planner@e72fd22:

stacks/fire-planner:
- fire-planner-col-refresh CronJob — Sun 04:00 UTC, no-op until rows
  age toward the 1-year TTL boundary (within 7 days). Calls
  python -m fire_planner col-refresh-stale, upserts via cache.upsert.

monitoring/dashboards/cost-of-living.json (Finance folder):
- Two template variables: $city (single-select from col_snapshot),
  $baseline_city (for COL ratio computation, defaults London).
- Stat row: total w/rent, w/o rent, 1-bed rent, ratio (color-coded).
- All-cities ranked table with gradient-gauged total + colored ratio.
- Cache-freshness table flags rows approaching TTL expiry.

Initial population needs a one-shot: post-Keel-rollout,
  kubectl -n fire-planner exec deploy/fire-planner -- \\
    python -m fire_planner col-seed

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:15:38 +00:00
Viktor Barzin
c1cb22896a openclaw: revert model swap + document codex re-auth path
The previous commit promoted modelrelay/auto-fastest to primary as a
workaround for the expired openai-codex OAuth token. But modelrelay
routes to small tool-call-shy models (nvidia/stepfun-ai/step-3.5-flash)
that hallucinate answers instead of using ssh / curl / etc. — exactly
what the v4 learning loop is supposed to leverage.

Revert primary back to openai-codex/gpt-5.4-mini (gpt-5.4-mini is the
only mini variant the Codex backend accepts for ChatGPT Plus tier),
and inline the re-auth command in the model-block comment so future
sessions know exactly what to do when the OAuth token expires:

  kubectl -n openclaw exec -it $(kubectl -n openclaw get pods \
    -l app=openclaw -o jsonpath='{.items[0].metadata.name}') \
    -c openclaw -- node /app/openclaw.mjs models auth login \
    --provider openai-codex

modelrelay/auto-fastest stays in the fallback chain so the agent
remains partially usable while the token is expired.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:12:30 +00:00
Viktor Barzin
247afdb220 cluster-health skill: document tightened #43 thermal threshold (65 C) 2026-05-22 14:09:12 +00:00
Viktor Barzin
4830230984 cluster-health #43: tighten PVE thermal threshold to 65 C
Per Viktor: healthy baseline range is 55-65 C; anything above 65 C is a
signal a VM/workload is using too much CPU and warrants investigation.

Previous thresholds were calibrated to the hardware's TjMax (75/83 C) —
that was too lax, since cluster-load-driven elevation arrives a long
time before throttling. The 65 C cutoff matches the live Prometheus
baseline (Apr 20-May 8 2026: peak 61-69 C, avg 51-55 C) and the
session-observed correlation: above 65 C means the cluster is doing
sustained work that should be looked at, even if hardware is still
nowhere near its limit.

Updated:
  PASS  < 65 C   (within 55-65 baseline)
  WARN  65-82 C  (elevated; check top kvm processes for the culprit)
  FAIL  >= 83 C  (at/above TjMax — throttling imminent)

Verified live: 67 C now WARN (was PASS under the 75 C threshold).
2026-05-22 14:09:08 +00:00
Viktor Barzin
282d7f6182 openclaw: engrain the learning loop at the identity level
User feedback: "this should work for any task, not just calendar.
this learning flow must be strongly engrained to ensure openclaw
gets better over time."

The v3 rules were buried at the bottom of TOOLS.md and only stated
in workflow language. Three changes to make the rule unavoidable:

1. **SOUL.md** — new marker-delimited section "Learning is your
   identity" inserted before ## Boundaries. AGENTS.md tells the
   agent to read SOUL.md first every session, so this is now the
   FIRST thing the agent loads about itself. Frames learning as
   character, not procedure.

2. **TOOLS.md v4** — section moved from the END of the file to
   right after the `# TOOLS.md` title (first substantive content
   on file load). Title strengthened: "THE FLOW — run this on
   EVERY task. Not just hard ones." Concrete examples explicitly
   call out diverse domains (calendar, frigate restart, disk
   usage, inbox summary, deploys) so the universality is
   unmistakable.

3. **learn-from-tasks skill** — opens with "This is universal.
   EVERY task runs through this flow — not just hard ones, not
   just unfamiliar ones. The save at the end is mandatory."

The actual flow (know → ask devvm → save) is unchanged. What
changed is salience: the rule is now the first thing the agent
encounters in three independent surfaces, with stronger framing
that makes "skipping the save" feel like a violation of identity
rather than a missed optimisation.

Marker bumped v3 → v4. Stripper handles v1-v9 idempotently.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 13:18:52 +00:00
66ca8b9e9c trading-bot: revive K8s stack + add meet-kevin-watcher
Uncomment the trading-bot stack (disabled 2026-04-06 due to resource
consumption) and add the new meet_kevin_watcher service container.

Changes:
- Uncomment the /* ... */ block enclosing the entire stack
- Fix db_init job: add -d postgres to psql commands (root user has no
  root-named database — matches pattern used in claude-memory + others)
- Remove 3 disabled containers from trading-bot-workers Pod spec:
  news-fetcher, sentiment-analyzer, trade-executor
- Add new meet-kevin-watcher container (image
  viktorbarzin/trading-bot-service:latest, command
  python -m services.meet_kevin_watcher.main, mem 128Mi/256Mi)
- Extend ExternalSecret with TRADING_OPENROUTER_API_KEY and
  TRADING_MEET_KEVIN_CHANNEL_ID keys (sourced from Vault
  secret/trading-bot)
- Add 4 common_env entries for the Meet Kevin pipeline
  (poll interval, daily cost cap, model slug, prompt version)
- Update lifecycle.ignore_changes to 4 image indices

vault: re-enable pg-trading static role

- Add pg-trading to vault_database_secret_backend_connection allowed_roles
- Uncomment vault_database_secret_backend_static_role.pg_trading
  (was disabled 2026-04-06 with the rest of trading-bot stack)

kyverno: add postgres* to trusted-registries allowlist

- trading-bot db_init uses postgres:16-alpine (Docker Hub library image)
- postgres* was not in the DockerHub bare-name allowlist (unlike mysql*,
  alpine*, nginx*, python* which were already there)

Final workers Pod containers (in order):
  [0] signal-generator
  [1] learning-engine
  [2] market-data
  [3] meet-kevin-watcher (NEW)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 11:23:30 +00:00
Viktor Barzin
60d8d54d6e openclaw: v3 flow — know → ask devvm → (rarely) try yourself
Refines the devvm-fallback into an explicit triage flow that the
agent runs on every task. The default path is to ASK devvm-claude
when uncertain — don't brute-force. Most tasks are solvable there.

## The flow

1. Do I KNOW how? Check `memory_recall` and INDEX.md.
2. If not, SSH devvm and ask claude — and crucially, ask it to
   share the steps + credentials needed so I can do it on my own
   next time. Save the answer in openclaw memory.
3. (RARE) If devvm-claude says no, try in-pod. Most likely fail —
   that's OK.

## Storage moved to memory-indexed location

Learnings now live under
`/workspace/memory/projects/openclaw-learned/` (was
`/workspace/learned/`) so memory-core indexes them and
`memory_recall` surfaces them. Layout:

- `scripts/<task>.md`       runnable recipes
- `knowledge/<topic>.md`    decisions, paths, gotchas
- `credentials/<name>.md`   **POINTERS to Vault, never values**

## Credentials = Vault pointers only

Previous v2 design saved cred values to plaintext NFS files. v3
flips to pointer-only: cred file documents the Vault path + fetch
command (`ssh devvm 'vault kv get -field=foo secret/bar'`), the
consumer, and rotation expectations. The secret stays in Vault.

## Init container also migrates

Strips v1/v2/v3 markers from TOOLS.md before re-inserting v3,
moves any files from the legacy `/workspace/learned/` tree into
the new location, removes the empty legacy dir. User edits
outside the markers always survive.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 11:20:54 +00:00
Viktor Barzin
8e5d682707 openclaw: explicit "use devvm + learn" default behaviour
Refine the init container's devvm-fallback seeding so the OpenClaw
agent treats devvm as its DEFAULT teacher and saves recipes locally
to become independent over time:

1. TOOLS.md v2 section now has two emphatic CRITICAL rules:
   - "TRY DEVVM before giving up" — when stuck, ssh devvm before
     telling the user "I can't do that".
   - "After every task, introspect → save a faster way" — for any
     non-trivial task (especially recurring ones), save the recipe
     to /workspace/learned/ and update INDEX.md.

2. New cc-skill `learn-from-tasks` at
   /home/node/.openclaw/cc-skills/learn-from-tasks/SKILL.md formalises
   both triggers: (A) you're stuck → check INDEX → ask devvm → save;
   (B) you just finished → introspect → save if recurring.

3. /workspace/learned/ scaffold: INDEX.md table-of-contents +
   scripts/, knowledge/, credentials/ (0700) subdirs. Agent checks
   INDEX.md BEFORE reaching for devvm, so saved recipes are
   findable on the next run.

4. Marker migration: strips both v1 and v2 markers before re-inserting
   so user edits outside the markers always survive future restarts.

Security caveat documented inline: credentials in
/workspace/learned/credentials/ are NFS plaintext — acceptable for
home-lab personal scope, NOT for anything more sensitive than what
`ssh devvm` already gives the pod (wizard's access).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 11:12:33 +00:00
Viktor Barzin
ccf1ccdd1d openclaw: also write devvm section to /workspace/TOOLS.md
The OpenClaw agent reads TOOLS.md on every session per AGENTS.md
("environment-specific notes"), but it does NOT auto-search the
memory-core index for "devvm" before answering. Result: the agent
said "I don't have access to the devvm" even though ssh + the
openclaw-task wrapper were fully wired up (verified e2e in
9ad52dfd).

Updated init 6 (seed-devvm-memory-note) to ALSO append a
marker-delimited section to /workspace/TOOLS.md describing the
devvm SSH capability + openclaw-task usage. Idempotent: strips
any prior v1 section before re-inserting, so user edits outside
the markers survive future pod restarts.

The /workspace/memory/projects/openclaw-runtime/devvm-fallback.md
memory note stays in place — it's still indexed by memory-core
and surfaces for memory_recall queries.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 10:50:42 +00:00
Viktor Barzin
9ad52dfd61 openclaw: SSH + tmux task fallback to devvm
Give the OpenClaw pod two new capabilities:

1. Host-tools bundle. New init container `install-host-tools` extracts
   openssh-client + dnsutils + tmux + jq + ripgrep + fd + vault + yq +
   friends into /tools/host-tools/, with the bookworm-slim libs the
   binaries need. PATH + LD_LIBRARY_PATH on the main container point
   ld.so at the bundle. Idempotent via /tools/host-tools/.installed-v1
   marker; smoke test (ldd-based) fails the init at deploy time if any
   binary has unresolved deps. Bundle is ~558 MB on the existing
   /srv/nfs/openclaw/tools NFS.

2. devvm SSH + async task pattern. New init `setup-ssh-config` writes
   id_rsa/config/known_hosts under /home/node/.openclaw/.ssh; main
   container startup symlinks /home/node/.ssh → there. New
   /usr/local/bin/openclaw-task wrapper on devvm manages long-running
   work as tmux sessions on devvm (sessions and logs survive pod
   restarts — they live on devvm, not in the pod). New init container
   `seed-devvm-memory-note` drops a markdown note teaching the pattern;
   main container startup now runs `openclaw memory index --force` so
   the note is searchable on first boot.

Design + verified E2E flow in
docs/plans/2026-05-22-openclaw-devvm-access-design.md. Persistence test
green: spawned a 50s task from pod A, deleted pod A, new pod B saw the
task finish and read its full log.

Pre-existing keel.sh annotation drift on openclaw/{openlobster,
task_webhook} cleaned up in the same apply.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 10:20:00 +00:00
Viktor Barzin
c7b0ebf6a5 state(vault): update encrypted state 2026-05-22 10:04:55 +00:00
Viktor Barzin
8228171104 cluster-health: add checks 43 + 44 (PVE host thermals + load)
Both new checks SSH read-only to the PVE host and emit PASS/WARN/FAIL
via the standard healthcheck output + JSON. They run alongside the
existing 42 checks and surface the same alerts the 2026-05-20/21
optimization session had to gather by hand.

#43 PVE Host Thermals — Xeon E5-2699v4 package + per-core temps
  Reads every /sys/class/hwmon/hwmon0/temp*_input in one SSH round-trip.
  Thresholds tuned to the live TjMax=83 / Tcrit=93:
    PASS  < 75 °C package
    WARN  75-82 °C  (approaching max, action time)
    FAIL  >= 83 °C  (at/above TjMax, throttling imminent)
  Reports hottest core label too so a single hot core doesn't hide in
  the package average.

#44 PVE Host Load — load avg vs 44-thread capacity
  Reads /proc/loadavg, compares 5-min to thread count (44):
    PASS  load_5 < 30   (< 70% threads busy)
    WARN  30-37         (oversubscribed but not saturating)
    FAIL  >= 38         (~85%+ threads busy — scheduler saturation)
  Uses 5-min so brief work spikes don't false-fail.

Both gracefully WARN-degrade if SSH BatchMode fails, matching the
existing check 36 (LVM PVC snapshots) pattern. TOTAL_CHECKS bumped
42 -> 44 and the dispatcher updated.
2026-05-22 09:55:11 +00:00
Viktor Barzin
1b21d4819e postiz: disable unused providers + pin temporal vs Keel force-policy
Two changes in one commit because they are coupled — the DISABLED_PROVIDERS
addition cannot land safely without the Keel exclusion on temporal:

1. Add DISABLED_PROVIDERS env on postiz Helm chart. Live DB audit showed
   only 'instagram-standalone' connected; all other Postiz providers
   were idle-polling Temporal task queues. List excludes x, linkedin,
   reddit, threads, youtube, tiktok, pinterest, dribbble, slack,
   discord, mastodon, bluesky, lemmy, warpcast, vk, beehiiv, telegram,
   wordpress, nostr, farcaster. Keeps facebook + instagram + the
   standalone variant active.

2. temporal deployment needs keel.sh/policy=never (set live via kubectl
   annotate). Keel was rolling temporalio/auto-setup 1.28.1 -> 0.20.0
   on every helm reconcile because :0.20.0 is published in the same
   registry path but is a DIFFERENT (legacy Cassandra-based) image
   stream. Memory id 1933 trap; new variant captured in id 2315-2319.

   The annotation is set live (not in TF) because the existing TF block
   has lifecycle.ignore_changes = [keel.sh/policy] so the chart
   reconcile won't reset it. Long-term fix: add temporal to the
   Kyverno keel-mutate-existing exclude list so it survives a
   namespace re-label.
2026-05-21 10:04:22 +00:00
Viktor Barzin
533a89a010 docs: HA control plane design (3 masters)
Captures today's k8s-upgrade-pipeline session findings — root cause
of repeated upgrade failures is the single-master apiserver outage
window cascading into operator crashloops + storm I/O. HA control
plane with 3 masters + apiserver LB removes the cascade entirely.

Tracked in beads code-n0ow. Plan doc to follow.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-21 09:41:20 +00:00
Viktor Barzin
2dc7e001bd k8s-version-upgrade: retry kubeadm apply on static-pod-hash timeout
kubeadm's `upgrade apply` waits 5min for each static-pod manifest swap
to be picked up by the kubelet (it polls the pod's
`kubernetes.io/config.hash` annotation via apiserver). On a freshly-rebooted
master with apiserver-to-kubelet status sync lagging, that 5min isn't
enough — kubeadm declares the upgrade failed and rolls back.

The thing is: the etcd container HAS already been swapped to the new
image by then (verified live — pod is on registry.k8s.io/etcd:3.6.5-0
when this fires). kubeadm's check is just slow to notice. The 2nd
attempt sees etcd already on target, skips it, and proceeds cleanly.

Wrap `kubeadm upgrade apply` in a 3-attempt loop with 30s between.
Worker phase doesn't need this — `kubeadm upgrade node` has no
static-pod-hash waits.

Today's autonomous-pipeline session: master phase Failed at 5m on
attempt #1 with this exact error, retried, hit same timeout, gave up
(backoffLimit=1). The wrapper turns this from a fatal pipeline halt
into a "wait a bit, try again" that usually completes on attempt #2.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-21 09:32:29 +00:00
Viktor Barzin
fc0510aa67 k8s-version-upgrade: kill-switch + ignore RecentNodeReboot + shorter quiet window
Three changes from today's autonomous-pipeline validation session:

1. **Kill-switch ConfigMap** — chain checks for `k8s-upgrade-killswitch`
   ConfigMap in `k8s-upgrade` namespace at the top of every phase + at the
   start of version-check. Existence halts the chain (exit 0) with a Slack
   message. Single-command emergency stop:
       kubectl -n k8s-upgrade create configmap k8s-upgrade-killswitch \
           --from-literal=reason="storm response"
   Resume:  kubectl -n k8s-upgrade delete cm k8s-upgrade-killswitch
   Role rule for `configmaps` get/list/watch added (resourceName-scoped).

2. **Ignore RecentNodeReboot in halt_on_alert_query everywhere** — the
   chain itself causes reboots. The pre-drain master check, post-upgrade
   worker check, postflight check, and preflight halt-on-alert all now
   pass `RecentNodeReboot` as the extra-ignore. Previously only worker
   phase's post-upgrade gate did this. Master Failed silently this morning
   on the pre-drain check after my own master reboot.

3. **Preflight quiet-baseline 3600s → 600s** — the 1h cooldown after any
   Ready transition meant the chain refused to run for an hour after
   every kured reboot. 10 min is enough for kubelet/control-plane to
   settle; the 24h-between-cluster-reboots invariant lives in
   kured-sentinel-gate, not here.

Validated by running the chain end-to-end: preflight passed in 5s,
master phase now in drain. Today's storm post-mortem (snapshot CoW
amplification + tigera-operator crashloop feedback loop) drove the
kill-switch design.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-21 09:23:41 +00:00
Viktor Barzin
944cf51f6b authentik: worker replicas 3 -> 2
Workers handle background tasks only (LDAP sync, email, certificate
renewal) — no user-facing traffic, so 2-of-3 redundancy isn't load-
bearing. Reduces sustained CPU by ~100m.

Server replicas unchanged at 3 (PDB minAvailable=2 — user-facing).
PgBouncer pool unchanged at 3 (DB connection pooling).
2026-05-21 09:14:35 +00:00
Viktor Barzin
8c87b77f1b forgejo: disable source archive ZIP/TAR downloads
Bot crawlers were hitting /<owner>/<repo>/archive/<sha>.zip on the
dot_files repo (vim-plugin source trees) — each request synthesised a
fresh ZIP from git history, taking 9.9s and returning 500 under
sustained load. Cost: ~440m sustained forgejo CPU.

Toggle: FORGEJO__repository__DISABLE_DOWNLOAD_SOURCE_ARCHIVES=true.
/archive/* URLs now 404; git clone / OCI registry / API unaffected.

Measured: forgejo pod 440-573m -> 60m steady-state (~85% drop).

(Pod rollout took ~7min on the new RS due to kubelet's recursive
chown of the 2700+ files in the data PVC — fsGroupChangePolicy is
unset and defaults to Always; could be set to OnRootMismatch later.)
2026-05-21 09:12:20 +00:00
Viktor Barzin
af6aa18b25 monitoring: prometheus global scrape 1m -> 2m + UPS pinned 30s
Halves sample volume on all default-scrape jobs (cAdvisor, node-exporter,
service-endpoints, etc.). Memory id 559's earlier scrape-2m tuning was
applied live but not codified — this restores the Helm template.

Companion changes to keep alerting fidelity:
- evaluation_interval kept at 1m (alerts evaluate every minute)
- snmp-ups job pinned to scrape_interval=30s so PowerOutage /
  LowUPSBattery detect within ~30s instead of 2m
- 3 alerts bumped from for:1m to for:3m (HighGPUTemp, LowUPSBattery,
  PowerOutage) for stability above the new 2m global cadence

Other jobs that already had per-job overrides (snmp-idrac 1m,
redfish-idrac 3m, kubernetes-pods 5m, kubernetes-services 5m) unaffected.

Expected: 50-150m sustained CPU saving on Prometheus + apiserver.
Verification ongoing — apiserver settles ~minutes after Prometheus
config reload due to initial-target-scrape burst.
2026-05-21 08:32:57 +00:00
Viktor Barzin
aba061cf2e alloy: switch pod log shipping from apiserver to file-tail
Replaced 'loki.source.kubernetes' with 'loki.source.file' in alloy DS
config. discovery.relabel.pod_logs already sets __path__ to the kubelet
log path (/var/log/pods/*<uid>/<container>/*.log) and varlog host-mount
was already present, so this is a one-line swap.

Why: apiserver was burning ~700m sustained on 'CONNECT pods/log' streams
(13 req/s, ~2200 sec/s of long-lived TCP connections). Streaming pod
logs through the apiserver instead of tailing kubelet's log files was
the dominant residual cost after the recent Loki/Alloy onboarding.

Measured before/after:
- Alloy DS: ~620m total (5 x ~125m) -> ~92m total (5 x ~18m)
- kube-apiserver: peak 1959m midnight burst, settled 632m

(Stuck-pod recovery: alloy-7zg7t on k8s-master needed --force delete
during rollout — FailedKillPod 'unable to signal init: permission denied'
on runc, transient runtime issue, unrelated to this change.)
2026-05-21 08:27:34 +00:00
Viktor Barzin
b6724a5d48 vault: add pg-matrix + pg-technitium static roles to allowed_roles
Both static-roles existed in Vault state (created out-of-band) but
were missing from the postgresql connection's allowed_roles list. Vault
was logging 'is not an allowed role' rotation errors every 10s for both,
sustained CPU waste ~40-70m.

Adopted both via 'import {}' (import blocks removed after first apply
per the canonical adoption pattern).

- pg-matrix: username=matrix, rotation_period=86400 (1d)
- pg-technitium: username=technitium, rotation_period=604800 (7d)

Verified: 'is not an allowed role' errors stopped in vault-0 logs
immediately after apply.
2026-05-21 08:11:11 +00:00
Viktor Barzin
9247a68514 state(vault): update encrypted state 2026-05-21 08:09:11 +00:00
Viktor Barzin
926d507313 k8s-version-upgrade: grant get/list on apps resources for drain
kubectl drain --ignore-daemonsets needs to GET each pod's owner
reference (DaemonSet/StatefulSet/ReplicaSet/Deployment) to classify
which pods can be drained vs ignored. Without these RBAC verbs, drain
bails with 'cannot delete daemonsets ... is forbidden' for every
daemonset-managed pod on the node.
2026-05-21 08:07:29 +00:00
Viktor Barzin
1617285d23 infra: add kubectl + authentik providers across 6 stacks
Provider declarations were applied across freshrss, linkwarden,
navidrome, openclaw, tandoor, vault in prior sessions; lock files
regenerated for the 4 stacks where init had run. Commits the WIP so
downstream Terraform plans can proceed.

- kubectl (gavinbunney/kubectl ~> 1.14): kubernetes_manifest panic
  workaround for Kyverno CRDs (beads code-e2dp)
- authentik (goauthentik/authentik ~> 2024.10): used where stacks
  manage their own Authentik objects
2026-05-21 08:07:22 +00:00
c09230815c openclaw: enable recruiter-api plugin (allowlist + manifest contracts)
Plugin needs three things to load under OpenClaw 2026.5.x:
1. plugins.allow includes 'recruiter-api' (doctor --fix overwrites the
   ConfigMap-baked value, so re-patch via 'openclaw config patch --stdin'
   in the startup command after doctor runs).
2. 'openclaw plugins enable recruiter-api' to flip its registry entry.
3. manifest declares contracts.tools (added in recruiter-responder commit
   83ffd9fa).

Plus: VIKTOR_CHAT_ID env wired from secret/openclaw.viktor_chat_id so the
plugin's polling loop knows which Telegram chat to deliver into.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 21:56:11 +00:00
57ab903a0c recruiter-responder: deploy d7892396 — OpenClaw-driven flow
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 21:14:11 +00:00
18928eb8ac recruiter-responder + openclaw: wire gpt-mini secret keys + VIKTOR_CHAT_ID
recruiter-responder ExternalSecret gains GPT_MINI_ENDPOINT/_API_KEY/_MODEL
(NIM-served qwen3-coder-480b — gpt-5.4-mini in OpenClaw is OAuth-only and
not HTTP-accessible to external services). OpenClaw gains VIKTOR_CHAT_ID
env consumed by the recruiter-api plugin's announcement loop.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 21:10:56 +00:00
Viktor Barzin
0c8b46df55 k8s-version-upgrade: fix two more grep-pipefail bugs
Same `grep -v` / `set -o pipefail` interaction as commit 10b261d2,
in two more callsites the previous fix didn't cover:

  Line 354 (phase_master): control-plane Running check —
    `grep -v Running | wc -l` returns 1 when all pods are Running
    (the happy path), aborting the chain right after master upgrades.

  Line 419 (phase_postflight): on-target node check —
    `grep -v ":v$TARGET_VERSION$" | wc -l` returns 1 when all nodes
    are on the target version (the happy path, exactly when postflight
    should succeed). Aborts at the moment of victory.

Forensics on yesterday's master Job failure (see commit message of
10b261d2 for context): the master Job spawned 16s after the previous
fix's TF apply, before configmap propagation completed on the kubelet.
With those two latent bugs also looming, the chain would have died
post-master-upgrade and again at postflight even if propagation had
been timely.

Wrapping each grep in `{ ... || true; }` so a no-matches result
returns success.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 20:59:10 +00:00
Viktor Barzin
10b261d2db k8s-version-upgrade: fix pipefail abort when no alerts are firing
halt_on_alert_query() ends with `grep -vE "$regex" | sort -u`. When
zero alerts are firing (the desired healthy state), grep matches
nothing and exits 1. Under `set -o pipefail`, the whole pipeline
returns 1; under `set -e`, the caller's `alerts=$(...)` assignment
fails and aborts the script in ~1s with no diagnostic output.

The chain effectively required at least one non-meta alert to be
firing to make any forward progress. Today (2026-05-19) the cluster
is fully clean post-MySQL recovery, the daily 12:00 UTC detection
spawned the preflight Job, and it died instantly — blocking the
1.34.7 → 1.34.8 patch chain.

Fix: wrap the grep in `{ ... || true; }` so a no-matches result
returns success. Preflight verified end-to-end after the fix — the
chain is now in flight (preflight ✓, master phase running).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 22:19:06 +00:00
f5917f0eb3 security(wave1): W1.6 expand observation from recruiter-responder pilot → tier 3+4 (82 namespaces)
## Change
- Replaced kubectl_manifest.wave1_egress_observe_recruiter_responder with
  kubectl_manifest.wave1_egress_observe_tier34
- namespaceSelector changed from `kubernetes.io/metadata.name == 'recruiter-responder'`
  to `tier in {"3-edge", "4-aux"}` — covers 82 namespaces (17 tier-3-edge + 65 tier-4-aux)
- Legacy pilot GNP wave1-egress-observe-recruiter-responder kubectl-deleted
  (apply_only=true means TF rename does NOT destroy the live old resource;
  cleanup done manually)
- Tier 0/1/2 namespaces explicitly out of wave 1 observation per locked plan
  (cluster infra + GPU workloads, deferred)

## Verification (live cluster, 2026-05-19)
- 82 namespaces match `tier in (3-edge,4-aux)`
- Felix translated the new policy into iptables LOG rule in cali-po-* chain
- LogQL `{job="node-journal"} |~ "calico-packet"` returns real packet metadata
  from multiple namespaces with distinct destinations:
  - east-west pod-to-pod (10.10.108.48, 10.10.122.131)
  - in-cluster service VIP (10.96.0.10 — kube-dns)
  - external (149.154.166.110 — Telegram API from recruiter-responder)

## W1.7 next step (calendar-bound, ~1 week)
- Let observation run for ~1 week
- Aggregate distinct destinations per namespace via LogQL
- Build per-namespace egress allowlist module `tier3_egress_baseline`
- Flip GNP rules from `[Log, Allow]` to `[Allow <specific dests>, Deny]`
- Phased per-namespace as originally planned

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 22:14:16 +00:00
e9054e6b1b security(wave1): W1.6 observe phase LIVE — Calico GNP action:Log pilot on recruiter-responder
Replaces the abandoned FelixConfiguration.flowLogsFileEnabled approach (Calico
Enterprise-only field, rejected by OSS v3.26) with the supported primitive:
Calico GlobalNetworkPolicy with `action: Log`.

## Mechanics (verified end-to-end on 2026-05-19)
1. kubectl_manifest applies GNP `wave1-egress-observe-recruiter-responder`
   with `namespaceSelector: kubernetes.io/metadata.name == 'recruiter-responder'`,
   `types: [Egress]`, `egress: [{action: Log}, {action: Allow}]`.
2. Felix translates to iptables LOG rule in
   `cali-po-_ZEv_aILlvyT9fbgWN58` chain with prefix `calico-packet: ` log-level=5.
3. Linux kernel emits LOG entries to ring buffer with transport=kernel.
4. systemd-journald captures kernel transport entries.
5. Alloy DaemonSet ships journal to Loki with `job=node-journal,transport=kernel`.
6. LogQL: `{job="node-journal"} |~ "calico-packet"` returns entries showing
   SRC/DST/PROTO/PORT for every NEW egress connection.

## Verified output sample
`calico-packet: IN=cali6cfdec4abc1 OUT=ens18 MAC=... SRC=10.10.122.132
DST=9.9.9.9 LEN=60 TOS=0x00 PREC=0x00 TTL=...`

The Allow rule in the GNP keeps egress functional (recruiter-responder
remained 1/1 Running through the apply — verified Python TCP connections to
1.1.1.1, 8.8.8.8, 9.9.9.9 succeed).

## Wave 1 status
W1.6 observation infra is LIVE for the recruiter-responder pilot. W1.7
remains pending: collect 1 week of `{job="node-journal"} |~ "calico-packet"`
samples, build empirical egress allowlist, flip the GNP rules from
`[Log, Allow]` to `[Allow <specific dests>, Deny]`.

Expand observation to additional namespaces by adding entries to
`spec.namespaceSelector` (e.g. `kubernetes.io/metadata.name in {recruiter-responder,X,Y}`).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 22:10:42 +00:00
Viktor Barzin
7a1751a668 upgrade-state: filter transient registry digest-check errors
Keel polls ~175 image manifests hourly against public registries.
Transient i/o timeouts and registry 5xx responses are inherent at
that scale and auto-recover on the next poll, but they were tripping
the Apps row into ⚠ attn — pure noise.

Extend benign_re to cover:
  - failed to check digest + (i/o timeout | connection refused
    | connection reset | context deadline exceeded | TLS handshake
    timeout | no such host | EOF)
  - failed to check digest + non-successful response (status=5xx)

Real actionable digest-check failures (HTTP 401 auth, 404 removed
tag) still surface. Persistent registry-side 5xx is owned by the
registry's own monitoring (forgejo-integrity-probe +
RegistryCatalogInaccessible), not by Keel logs.

Tested locally: Apps row flips from ⚠ attn → ✓ healthy after the
filter is in place; remaining errors-line drops to "(none in last
24h)".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 22:06:21 +00:00
Viktor Barzin
359b0277f8 dbaas: opt MySQL out of Keel + add do-not-bump warning
Two changes to make the 8.4.8 pin durable:

1. Add `keel.sh/policy: never` annotation on the mysql-standalone
   StatefulSet. The dbaas namespace was already excluded from the
   Kyverno mutate, but the StatefulSet carried orphan Keel annotations
   (force/poll/match-tag) from an earlier policy version that lacked
   the exclusion list. Keel kept watching :8.4.8 for digest changes.
   Now explicitly opted out; Keel logged "image no longer tracked".

2. Expand the inline comment to a banner pointing at the upgrade plan
   docs and the gating beads task. Anyone touching this line sees the
   warning + the path to do it right.

Closes the loop on the 2026-05-18 outage. Real upgrade tracked in
code-963q + docs/plans/2026-05-19-mysql-8.4.9-upgrade-{design,plan}.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 13:21:03 +00:00
Viktor Barzin
0fab599dbc state(dbaas): update encrypted state 2026-05-19 13:20:39 +00:00
Viktor Barzin
9fd54143c2 docs: design + plan for MySQL 8.4.8 → 8.4.9 upgrade
Captures the wipe+reinit strategy (sidestep the broken DD upgrade
path), the IO config bump (innodb_io_capacity 100→2000), root-cause
analysis with explicit uncertainty, verification gates, and rollback.

Not scheduled yet. Tracked in beads code-963q.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 13:10:00 +00:00
669ba97078 security(wave1): W1.1 audit-log shipping LIVE + W1.5 trusted-registries Enforce LIVE
## W1.1 — K8s API audit log shipping (LIVE)
- alloy.yaml: added control-plane toleration so Alloy DaemonSet runs on
  k8s-master node. Verified alloy-7zg7t scheduled on master, tailing
  /var/log/kubernetes/audit.log
- loki.tf "Security Wave 1" rule group: added K2-K9 alert rules
  (skipped K1 per Q7 decision):
  - K2 K8sSATokenFromUnexpectedIP
  - K3 K8sSensitiveSecretReadByUnexpectedActor
  - K4 K8sExecIntoSensitiveNamespace
  - K5 K8sMassDelete (>5 Pod/Secret/CM in 60s by single user)
  - K6 K8sAuditPolicyModified (kubeadm-config CM change)
  - K7 K8sClusterRoleWildcardCreated (verbs=* + resources=*)
  - K8 K8sAnonymousBindingGranted
  - K9 K8sViktorFromUnexpectedIP
- All rules use source-IP regex matching the wave-1 allowlist
  (10.0.20.0/22, 192.168.1.0/24, 10.10.0.0/16 pod, 10.96.0.0/12 svc,
  100.64-127 tailnet) and `lane = "security"` → #security Slack route.
- Verified: kubectl-audit logs flowing in Loki query
  {job="kubernetes-audit"} returns events with node=k8s-master.
- Verified: /loki/api/v1/rules lists all K2-K9 + V1-V7 + S1.

## W1.5 — require-trusted-registries Enforce (LIVE)
- security-policies.tf: flipped Audit→Enforce with explicit allowlist
  built by `kubectl get pods -A -o jsonpath='{..image}'` enumeration.
- Removed `*/*` catch-all (which made Audit→Enforce a no-op).
- Pattern includes 15 explicit registries, 6 DockerHub library bare
  names, 56 DockerHub user repos.
- Verified by admission dry-run:
  - evilcorp.example/malware:v1 → BLOCKED with custom message
  - alpine:3.20 → ALLOWED (matches `alpine*`)
  - docker.io/library/alpine:3.20 → ALLOWED (matches `docker.io/*`)

## W1.6 — Calico flow logs (BLOCKED — Calico OSS limitation)
- Tried adding FelixConfiguration with flowLogsFileEnabled=true via
  kubectl_manifest in stacks/calico/main.tf
- Calico OSS rejected with "strict decoding error: unknown field
  spec.flowLogsFileEnabled" — these fields are Calico Enterprise/Tigera-only
- Removed the failed resource. Documented alternative paths in main.tf
  comment block: GNP with action=Log (iptables NFLOG → journal), Cilium
  migration, eBPF tooling, or Tigera Operator adoption.

## Docs updates
- security.md status table refreshed: W1.1/W1.2/W1.3/W1.4/W1.5 LIVE,
  W1.6/W1.7 blocked
- monitoring.md: Loki marked DEPLOYED (was incorrectly NOT-DEPLOYED in
  prior session before today's apply)

## Cleanup
- Removed stacks/kyverno/imports.tf (TF 1.5+ import blocks completed
  their job in the 2026-05-18 apply; should not stay in tree per TF docs)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 06:37:54 +00:00