Commit graph

1024 commits

Author SHA1 Message Date
Viktor Barzin
503ac4c192 monitoring: tune 4 alerts for transient drain/upgrade blips
Today's worker-phase rolling upgrade tripped MysqlStandaloneDown,
MetalLBSpeakerDown, KubeletRunningContainersDrop, and
IngressErrorRate5xxHigh even though every affected workload
recovered within 30-60s. Loosen `for:` (and one threshold) on each so
they only fire on persistent faults, not on routine drain+kubelet-
restart cycles.

- MysqlStandaloneDown: for 2m -> 3m (single-replica StatefulSet,
  drain re-scheduling routinely takes 1-3m).
- MetalLBSpeakerDown: for 5m -> 2m (kubelet restart drops the
  speaker pod for 30-45s; 2m suppresses that blip).
- KubeletRunningContainersDrop: absolute `< -10` threshold replaced
  with relative `< -0.5` (>50% drop vs. 10m ago); routine drains
  routinely shed 10-30 containers and tripped the old rule.
- IngressErrorRate5xxHigh: for 5m -> 10m (rolling pod migrations
  cause brief 5xx spikes that clear in 1-2m).

Severity, labels, and annotation structure preserved; only `for:`
durations and the one expression changed. Tactical loosening of
four specific alerts -- broader observability audit tracked
separately in beads.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 09:28:53 +00:00
Viktor Barzin
ad9f6c8f41 k8s-version-upgrade: halt_on_alert allowlist (severity=critical only)
Refactored halt_on_alert_query from denylist ("ignore these noisy alerts")
to an allowlist ("only halt on severity=critical"). Today's blocking
alerts were all warning/info-level and not actual upgrade blockers:
  - PodCrashLooping (gpu-operator on the GPU node, code-8vr0, long-standing)
  - IngressTTFBHigh (Traefik latency, transient)
  - NodeHighIOWait (chicken-and-egg with our own upgrade I/O)
  - RecentNodeReboot (chain causes this itself)

severity=critical filtering is more robust than maintaining a denylist
of every noisy alert that crops up. extra_ignore parameter kept for
backwards compatibility but is rarely needed now (critical alerts are
the only ones that should actually halt the chain).

Tested end-to-end this session — master successfully upgraded to v1.34.8
via the autonomous chain after the apiserver state-repair (apiserver
manifest had been pinned at v1.34.2 from a previous month's rollback;
required a one-time manual edit + kubelet reload to bring back to v1.34.7,
after which the chain ran cleanly).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 09:14:39 +00:00
Viktor Barzin
68a503e29f kyverno: allowlist woodpeckerci/* for CI step pods
Wave-1 trusted-registries allowlist was missing woodpeckerci/* which is
used by every .woodpecker.yml's clone step (woodpeckerci/plugin-git) and
build steps (woodpeckerci/plugin-docker-buildx). Result: ALL Woodpecker
pipelines have been failing at the git step since the Audit→Enforce flip
on 2026-05-19. First surfaced via code-da4h (recruiter-responder pushes
not building).

Added between viren070/* and zelest/* in the same DockerHub-user-repos
block as the 2026-05-22 batch (commit 2d35d72a).

Closes: code-da4h
2026-05-23 08:52:48 +00:00
000d306542 technitium: add viktorbarzin.me apex DNS drift probe + alerts
Every internal *.viktorbarzin.me hostname (~80 services) chains through the
split-horizon `viktorbarzin.me` apex A record. If the apex drifts (ISP
rollover, accidental edit), every internal service breaks at once — the
2026-05-22 ha-sofia incident was exactly this.

This adds a backstop probe so the next drift surfaces in <10 min instead
of via user-reported outage:

- CronJob `viktorbarzin-apex-probe` in `technitium` namespace, every 5 min,
  resolves `viktorbarzin.me A` against the Technitium LB IP (10.0.20.201)
  and pushes `viktorbarzin_apex_correct` + `_last_correct_timestamp` to
  Pushgateway. Python+dnspython, ~30 LOC.

- 3 Prometheus alerts:
  - `ViktorBarzinApexDrift` (critical, 10m) — apex resolved to anything
    other than 10.0.20.200.
  - `ViktorBarzinApexProbeStale` (warning, 5m on 15m gap) — probe stopped
    succeeding.
  - `ViktorBarzinApexProbeNeverRun` (warning, 30m absent) — probe never
    reported.

- Added the new alert names to the Slack receiver matcher in both routes
  alongside EmailRoundtrip*.

Verified: rules loaded as inactive (apex is correct), metric flowing, manual
probe job pass observed.
2026-05-23 08:41:14 +00:00
Viktor Barzin
4713c3a6d9 k8s-version-upgrade: tigera quiesce + etcd-skip retry + IO-wait alert ignore
Three changes unblocking the autonomous chain for k8s patch upgrades:

1. **phase_master quiesces tigera-operator before drain, restores after.**
   Tigera crashes immediately if apiserver is unreachable (no retry logic)
   and crashlooping it during master static-pod swaps generates ~500MB/s
   disk I/O that pushes kubeadm's 5-min static-pod-hash watch past its
   limit. Quiesce removes the storm contributor; calico data plane keeps
   running unchanged (data plane is the DaemonSet+Typha, operator is just
   the reconciler).

2. **update_k8s.sh retries with --etcd-upgrade=false on the 2nd attempt.**
   For patch upgrades (1.34.7→1.34.8), etcd's image doesn't change — kubeadm
   writes an identical manifest, hash doesn't update, watch times out and
   rolls back forever. The skip-etcd retry sidesteps it for the legitimate
   no-change case while still doing a full etcd upgrade on the first
   attempt (correct for minor-version bumps).

3. **halt_on_alert_query also ignores IngressTTFBHigh + NodeHighIOWait.**
   Both are symptoms-not-causes: ingress latency spikes briefly during any
   pod-restart wave; high IOwait is exactly what upgrade activity causes
   (chicken-and-egg). The inline quiet-baseline check (Ready transition
   <10min) is the real cluster-churn gate.

RBAC: k8s-upgrade-job ClusterRole gains `patch` on deployments + scale
subresource so the chain can do the scale-to-0/back-to-1 on tigera.

These three together get the chain past the cascade that's been blocking
1.34.7→1.34.8 for a week. Long-term fix is still HA control plane
(beads code-n0ow); these are the bridge.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 08:40:11 +00:00
6f4a569d1c traefik: bump auth-proxy nginx header buffers to handle Authentik cookie pile
Browsers accumulate one authentik_proxy_<random> cookie per Authentik
Proxy Provider under viktorbarzin.me (Path=/). With 30+ services the
combined Cookie header exceeds nginx's default 4 x 8k
large_client_header_buffers and trips '431 Request Header Fields Too
Large' at the forward-auth nginx (traefik/auth-proxy).

Bumped to:
  client_header_buffer_size 8k
  large_client_header_buffers 8 64k

Matches the pattern used on the London Flint 2 router nginx
(memory id=647).
2026-05-23 08:34:33 +00:00
70a334e431 trading-bot: pin Meet Kevin LLM model to claude-haiku-4-5
Sonnet-4-5 trips Anthropic per-account rate_limit_error on the OAuth
bearer (sk-ant-oat01) tokens after 5-10 burst calls — sticky multi-hour
quota. Haiku-4-5 has much higher RPM and processes the 16-video
backfill cleanly (~30s/video with inter-call throttle).

Comment above the env line documents the rationale for future re-evaluation.
2026-05-22 20:43:05 +00:00
5258f09230 mailserver: decommission SendGrid
Remove leftover SendGrid references after the Brevo migration was completed:

- Delete TF `cloudflare_record.mail_domainkey` (TXT at `s1._domainkey`,
  SendGrid-era DKIM, hidden behind the SendGrid CNAME but would re-emerge
  once the CNAME is removed).
- Clean up commented-out `smtp.sendgrid.net` relayhost references and the
  `# For sendgrid` comment on `sasl_passwd` in the mailserver module.

DNS records deleted out-of-band (not TF-managed):
- CF: `s1._domainkey CNAME` + `s2._domainkey CNAME` → sendgrid.net (manual entries)
- Technitium internal `viktorbarzin.me`: `em7107`, `s1._domainkey`,
  `s2._domainkey` CNAMEs → sendgrid.net

Verified end-to-end mail flow unaffected (Brevo outbound + IMAP receive,
roundtrip 20.4s — identical to baseline). Active DKIM (`mail._domainkey`
local + `brevo1/brevo2._domainkey` Brevo) untouched.
2026-05-22 20:08:38 +00:00
Viktor Barzin
b233aba710 openclaw: switch primary to nim/meta/llama-3.1-70b-instruct
Auth audit on 2026-05-22 — all the broken paths and the one that works:

- openai-codex OAuth: EXPIRED (ChatGPT Plus, ancaelena98@gmail.com)
- secret/openclaw → openai_api_key (sk-svcacct): insufficient_quota
- openrouter_api_key: "Key limit exceeded (total limit)"
- llama_api_key: region-blocked
- anthropic_api_key: sk-ant-oat-… (OAuth refresh token, not a real
  x-api-key — won't auth via x-api-key header)
- nvidia_api_key (NIM): WORKS. The key was already baked into the
  openclaw.json providers.nim.apiKey from secret/openclaw → nvidia_api_key.

Two NIM models verified end-to-end (call from inside openclaw pod
with tool-call schema, both returned proper {tool_calls:[…]} JSON):
- meta/llama-3.1-70b-instruct      — 0.58s, primary
- meta/llama-4-maverick-17b-128e   — 16s, smarter, fallback

Fallback chain: maverick → openai-codex (auto-promotes once re-authed)
→ modelrelay/auto-fastest (last resort, hallucinates instead of
tool-calling, but at least responds).

Models registered in both `agents.defaults.models` (allowlist) and
`models.providers.nim.models` (capability declarations) so the agent
sees them as available tools. Startup `models set` updated to pin
the new primary across `doctor --fix` runs.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 15:23:17 +00:00
2d35d72a53 kyverno(wave1): add 7 missing registries to trusted-registries allowlist
Discovered via W1.5 enforcement when querying live cluster state:
PolicyViolation events on 5 deployments (council-complaints, ebook2audiobook,
hermes-agent, netbox, whisper/piper) trying to admit images from registries
not in the original enumeration.

Added entries:
- amruthpillai/*       (resume — reactive-resume)
- athomasson2/*        (ebook2audiobook)
- netboxcommunity/*    (netbox)
- nousresearch/*       (hermes-agent)
- opentripplanner/*    (osm-routing)
- rhasspy/*            (whisper, piper)
- registry.viktorbarzin.me/*  (legacy private registry — council-complaints
                                still references; should migrate to forgejo)

The legacy registry.viktorbarzin.me was supposedly decommissioned 2026-05-07
per CLAUDE.md but council-complaints still uses it — separate cleanup task.

## Verification
- kubectl delete + reapply (kubectl_manifest resourceVersion=0 patch gotcha,
  same as 2026-05-18 inject-keel-annotations)
- Dry-run admission of previously-blocked images now PASS:
  - netboxcommunity/netbox:v4.5.0-beta1 ✓
  - rhasspy/wyoming-whisper:3.1.0 ✓
  - registry.viktorbarzin.me/council-complaints:1c56f8f ✓
- Policy still in Enforce mode

## Observation status (W1.6)
- Calico GNP wave1-egress-observe-tier34 still applied, 82 ns selected
- Loki `{job="node-journal"} |~ "calico-packet"` returns ~5000 lines/hour
- No errors from observation infrastructure

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 15:17:16 +00:00
Viktor Barzin
c11ac7d486 cnpg: bump webhook-cert renewal threshold 7d -> 30d
Root cause of the recurring 'cnpg-webhook-cert' TLS expiry warn:

CNPG default 'expiringCheckThreshold = 7' means the operator only
regenerates the self-signed webhook cert when remaining lifetime drops
BELOW 7 days. Our cluster-health check #22 alerts at <30d. Result:
~23 days of WARN before CNPG would even attempt rotation.

Set EXPIRING_CHECK_THRESHOLD=30 via the chart's config.data map so the
operator now regenerates with 30d buffer, aligning with our monitoring
threshold. Cert lifetime stays at chart default 90d.

Verified after apply: operator runtime config shows
'expiringCheckThreshold:30'. Companion in-session action: deleted the
existing soon-to-expire secret and bounced the operator to force an
immediate fresh 90-day cert (notBefore=May 22, notAfter=Aug 20).
2026-05-22 15:00:41 +00:00
6367b783c7 broker-sync(imap): fix command name + add fsGroup for sync.db writes
Two latent issues found while diagnosing why the May 2026 META vest
didn't land:

1. broker-sync-imap CronJob's command was 'broker-sync imap', but the
   actual CLI subcommand is 'imap-ingest'. Every scheduled run had
   been failing with 'No such command imap' since day-one.

2. Pod runs as uid=10001 gid=999; PVC /data dir is mode 2775
   group=10001. Without fsGroup in the pod's securityContext the
   pod gets only 'other' (r-x) perms on the dir, so sqlite3 can't
   create journal/WAL files next to sync.db -- hits
   'attempt to write a readonly database'. fsGroup=10001 adds the
   matching gid to the pod's supplemental groups so writes work.

Schwab email-sender regex fix is in broker-sync@d860aef.
2026-05-22 14:41:54 +00:00
28db8fc9d4 fire-planner: COL refresh CronJob + Grafana Cost-of-Living dashboard
Operational layer for the new col_snapshot cache shipped in
fire-planner@e72fd22:

stacks/fire-planner:
- fire-planner-col-refresh CronJob — Sun 04:00 UTC, no-op until rows
  age toward the 1-year TTL boundary (within 7 days). Calls
  python -m fire_planner col-refresh-stale, upserts via cache.upsert.

monitoring/dashboards/cost-of-living.json (Finance folder):
- Two template variables: $city (single-select from col_snapshot),
  $baseline_city (for COL ratio computation, defaults London).
- Stat row: total w/rent, w/o rent, 1-bed rent, ratio (color-coded).
- All-cities ranked table with gradient-gauged total + colored ratio.
- Cache-freshness table flags rows approaching TTL expiry.

Initial population needs a one-shot: post-Keel-rollout,
  kubectl -n fire-planner exec deploy/fire-planner -- \\
    python -m fire_planner col-seed

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:15:38 +00:00
Viktor Barzin
c1cb22896a openclaw: revert model swap + document codex re-auth path
The previous commit promoted modelrelay/auto-fastest to primary as a
workaround for the expired openai-codex OAuth token. But modelrelay
routes to small tool-call-shy models (nvidia/stepfun-ai/step-3.5-flash)
that hallucinate answers instead of using ssh / curl / etc. — exactly
what the v4 learning loop is supposed to leverage.

Revert primary back to openai-codex/gpt-5.4-mini (gpt-5.4-mini is the
only mini variant the Codex backend accepts for ChatGPT Plus tier),
and inline the re-auth command in the model-block comment so future
sessions know exactly what to do when the OAuth token expires:

  kubectl -n openclaw exec -it $(kubectl -n openclaw get pods \
    -l app=openclaw -o jsonpath='{.items[0].metadata.name}') \
    -c openclaw -- node /app/openclaw.mjs models auth login \
    --provider openai-codex

modelrelay/auto-fastest stays in the fallback chain so the agent
remains partially usable while the token is expired.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:12:30 +00:00
Viktor Barzin
282d7f6182 openclaw: engrain the learning loop at the identity level
User feedback: "this should work for any task, not just calendar.
this learning flow must be strongly engrained to ensure openclaw
gets better over time."

The v3 rules were buried at the bottom of TOOLS.md and only stated
in workflow language. Three changes to make the rule unavoidable:

1. **SOUL.md** — new marker-delimited section "Learning is your
   identity" inserted before ## Boundaries. AGENTS.md tells the
   agent to read SOUL.md first every session, so this is now the
   FIRST thing the agent loads about itself. Frames learning as
   character, not procedure.

2. **TOOLS.md v4** — section moved from the END of the file to
   right after the `# TOOLS.md` title (first substantive content
   on file load). Title strengthened: "THE FLOW — run this on
   EVERY task. Not just hard ones." Concrete examples explicitly
   call out diverse domains (calendar, frigate restart, disk
   usage, inbox summary, deploys) so the universality is
   unmistakable.

3. **learn-from-tasks skill** — opens with "This is universal.
   EVERY task runs through this flow — not just hard ones, not
   just unfamiliar ones. The save at the end is mandatory."

The actual flow (know → ask devvm → save) is unchanged. What
changed is salience: the rule is now the first thing the agent
encounters in three independent surfaces, with stronger framing
that makes "skipping the save" feel like a violation of identity
rather than a missed optimisation.

Marker bumped v3 → v4. Stripper handles v1-v9 idempotently.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 13:18:52 +00:00
66ca8b9e9c trading-bot: revive K8s stack + add meet-kevin-watcher
Uncomment the trading-bot stack (disabled 2026-04-06 due to resource
consumption) and add the new meet_kevin_watcher service container.

Changes:
- Uncomment the /* ... */ block enclosing the entire stack
- Fix db_init job: add -d postgres to psql commands (root user has no
  root-named database — matches pattern used in claude-memory + others)
- Remove 3 disabled containers from trading-bot-workers Pod spec:
  news-fetcher, sentiment-analyzer, trade-executor
- Add new meet-kevin-watcher container (image
  viktorbarzin/trading-bot-service:latest, command
  python -m services.meet_kevin_watcher.main, mem 128Mi/256Mi)
- Extend ExternalSecret with TRADING_OPENROUTER_API_KEY and
  TRADING_MEET_KEVIN_CHANNEL_ID keys (sourced from Vault
  secret/trading-bot)
- Add 4 common_env entries for the Meet Kevin pipeline
  (poll interval, daily cost cap, model slug, prompt version)
- Update lifecycle.ignore_changes to 4 image indices

vault: re-enable pg-trading static role

- Add pg-trading to vault_database_secret_backend_connection allowed_roles
- Uncomment vault_database_secret_backend_static_role.pg_trading
  (was disabled 2026-04-06 with the rest of trading-bot stack)

kyverno: add postgres* to trusted-registries allowlist

- trading-bot db_init uses postgres:16-alpine (Docker Hub library image)
- postgres* was not in the DockerHub bare-name allowlist (unlike mysql*,
  alpine*, nginx*, python* which were already there)

Final workers Pod containers (in order):
  [0] signal-generator
  [1] learning-engine
  [2] market-data
  [3] meet-kevin-watcher (NEW)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 11:23:30 +00:00
Viktor Barzin
60d8d54d6e openclaw: v3 flow — know → ask devvm → (rarely) try yourself
Refines the devvm-fallback into an explicit triage flow that the
agent runs on every task. The default path is to ASK devvm-claude
when uncertain — don't brute-force. Most tasks are solvable there.

## The flow

1. Do I KNOW how? Check `memory_recall` and INDEX.md.
2. If not, SSH devvm and ask claude — and crucially, ask it to
   share the steps + credentials needed so I can do it on my own
   next time. Save the answer in openclaw memory.
3. (RARE) If devvm-claude says no, try in-pod. Most likely fail —
   that's OK.

## Storage moved to memory-indexed location

Learnings now live under
`/workspace/memory/projects/openclaw-learned/` (was
`/workspace/learned/`) so memory-core indexes them and
`memory_recall` surfaces them. Layout:

- `scripts/<task>.md`       runnable recipes
- `knowledge/<topic>.md`    decisions, paths, gotchas
- `credentials/<name>.md`   **POINTERS to Vault, never values**

## Credentials = Vault pointers only

Previous v2 design saved cred values to plaintext NFS files. v3
flips to pointer-only: cred file documents the Vault path + fetch
command (`ssh devvm 'vault kv get -field=foo secret/bar'`), the
consumer, and rotation expectations. The secret stays in Vault.

## Init container also migrates

Strips v1/v2/v3 markers from TOOLS.md before re-inserting v3,
moves any files from the legacy `/workspace/learned/` tree into
the new location, removes the empty legacy dir. User edits
outside the markers always survive.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 11:20:54 +00:00
Viktor Barzin
8e5d682707 openclaw: explicit "use devvm + learn" default behaviour
Refine the init container's devvm-fallback seeding so the OpenClaw
agent treats devvm as its DEFAULT teacher and saves recipes locally
to become independent over time:

1. TOOLS.md v2 section now has two emphatic CRITICAL rules:
   - "TRY DEVVM before giving up" — when stuck, ssh devvm before
     telling the user "I can't do that".
   - "After every task, introspect → save a faster way" — for any
     non-trivial task (especially recurring ones), save the recipe
     to /workspace/learned/ and update INDEX.md.

2. New cc-skill `learn-from-tasks` at
   /home/node/.openclaw/cc-skills/learn-from-tasks/SKILL.md formalises
   both triggers: (A) you're stuck → check INDEX → ask devvm → save;
   (B) you just finished → introspect → save if recurring.

3. /workspace/learned/ scaffold: INDEX.md table-of-contents +
   scripts/, knowledge/, credentials/ (0700) subdirs. Agent checks
   INDEX.md BEFORE reaching for devvm, so saved recipes are
   findable on the next run.

4. Marker migration: strips both v1 and v2 markers before re-inserting
   so user edits outside the markers always survive future restarts.

Security caveat documented inline: credentials in
/workspace/learned/credentials/ are NFS plaintext — acceptable for
home-lab personal scope, NOT for anything more sensitive than what
`ssh devvm` already gives the pod (wizard's access).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 11:12:33 +00:00
Viktor Barzin
ccf1ccdd1d openclaw: also write devvm section to /workspace/TOOLS.md
The OpenClaw agent reads TOOLS.md on every session per AGENTS.md
("environment-specific notes"), but it does NOT auto-search the
memory-core index for "devvm" before answering. Result: the agent
said "I don't have access to the devvm" even though ssh + the
openclaw-task wrapper were fully wired up (verified e2e in
9ad52dfd).

Updated init 6 (seed-devvm-memory-note) to ALSO append a
marker-delimited section to /workspace/TOOLS.md describing the
devvm SSH capability + openclaw-task usage. Idempotent: strips
any prior v1 section before re-inserting, so user edits outside
the markers survive future pod restarts.

The /workspace/memory/projects/openclaw-runtime/devvm-fallback.md
memory note stays in place — it's still indexed by memory-core
and surfaces for memory_recall queries.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 10:50:42 +00:00
Viktor Barzin
9ad52dfd61 openclaw: SSH + tmux task fallback to devvm
Give the OpenClaw pod two new capabilities:

1. Host-tools bundle. New init container `install-host-tools` extracts
   openssh-client + dnsutils + tmux + jq + ripgrep + fd + vault + yq +
   friends into /tools/host-tools/, with the bookworm-slim libs the
   binaries need. PATH + LD_LIBRARY_PATH on the main container point
   ld.so at the bundle. Idempotent via /tools/host-tools/.installed-v1
   marker; smoke test (ldd-based) fails the init at deploy time if any
   binary has unresolved deps. Bundle is ~558 MB on the existing
   /srv/nfs/openclaw/tools NFS.

2. devvm SSH + async task pattern. New init `setup-ssh-config` writes
   id_rsa/config/known_hosts under /home/node/.openclaw/.ssh; main
   container startup symlinks /home/node/.ssh → there. New
   /usr/local/bin/openclaw-task wrapper on devvm manages long-running
   work as tmux sessions on devvm (sessions and logs survive pod
   restarts — they live on devvm, not in the pod). New init container
   `seed-devvm-memory-note` drops a markdown note teaching the pattern;
   main container startup now runs `openclaw memory index --force` so
   the note is searchable on first boot.

Design + verified E2E flow in
docs/plans/2026-05-22-openclaw-devvm-access-design.md. Persistence test
green: spawned a 50s task from pod A, deleted pod A, new pod B saw the
task finish and read its full log.

Pre-existing keel.sh annotation drift on openclaw/{openlobster,
task_webhook} cleaned up in the same apply.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 10:20:00 +00:00
Viktor Barzin
1b21d4819e postiz: disable unused providers + pin temporal vs Keel force-policy
Two changes in one commit because they are coupled — the DISABLED_PROVIDERS
addition cannot land safely without the Keel exclusion on temporal:

1. Add DISABLED_PROVIDERS env on postiz Helm chart. Live DB audit showed
   only 'instagram-standalone' connected; all other Postiz providers
   were idle-polling Temporal task queues. List excludes x, linkedin,
   reddit, threads, youtube, tiktok, pinterest, dribbble, slack,
   discord, mastodon, bluesky, lemmy, warpcast, vk, beehiiv, telegram,
   wordpress, nostr, farcaster. Keeps facebook + instagram + the
   standalone variant active.

2. temporal deployment needs keel.sh/policy=never (set live via kubectl
   annotate). Keel was rolling temporalio/auto-setup 1.28.1 -> 0.20.0
   on every helm reconcile because :0.20.0 is published in the same
   registry path but is a DIFFERENT (legacy Cassandra-based) image
   stream. Memory id 1933 trap; new variant captured in id 2315-2319.

   The annotation is set live (not in TF) because the existing TF block
   has lifecycle.ignore_changes = [keel.sh/policy] so the chart
   reconcile won't reset it. Long-term fix: add temporal to the
   Kyverno keel-mutate-existing exclude list so it survives a
   namespace re-label.
2026-05-21 10:04:22 +00:00
Viktor Barzin
fc0510aa67 k8s-version-upgrade: kill-switch + ignore RecentNodeReboot + shorter quiet window
Three changes from today's autonomous-pipeline validation session:

1. **Kill-switch ConfigMap** — chain checks for `k8s-upgrade-killswitch`
   ConfigMap in `k8s-upgrade` namespace at the top of every phase + at the
   start of version-check. Existence halts the chain (exit 0) with a Slack
   message. Single-command emergency stop:
       kubectl -n k8s-upgrade create configmap k8s-upgrade-killswitch \
           --from-literal=reason="storm response"
   Resume:  kubectl -n k8s-upgrade delete cm k8s-upgrade-killswitch
   Role rule for `configmaps` get/list/watch added (resourceName-scoped).

2. **Ignore RecentNodeReboot in halt_on_alert_query everywhere** — the
   chain itself causes reboots. The pre-drain master check, post-upgrade
   worker check, postflight check, and preflight halt-on-alert all now
   pass `RecentNodeReboot` as the extra-ignore. Previously only worker
   phase's post-upgrade gate did this. Master Failed silently this morning
   on the pre-drain check after my own master reboot.

3. **Preflight quiet-baseline 3600s → 600s** — the 1h cooldown after any
   Ready transition meant the chain refused to run for an hour after
   every kured reboot. 10 min is enough for kubelet/control-plane to
   settle; the 24h-between-cluster-reboots invariant lives in
   kured-sentinel-gate, not here.

Validated by running the chain end-to-end: preflight passed in 5s,
master phase now in drain. Today's storm post-mortem (snapshot CoW
amplification + tigera-operator crashloop feedback loop) drove the
kill-switch design.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-21 09:23:41 +00:00
Viktor Barzin
944cf51f6b authentik: worker replicas 3 -> 2
Workers handle background tasks only (LDAP sync, email, certificate
renewal) — no user-facing traffic, so 2-of-3 redundancy isn't load-
bearing. Reduces sustained CPU by ~100m.

Server replicas unchanged at 3 (PDB minAvailable=2 — user-facing).
PgBouncer pool unchanged at 3 (DB connection pooling).
2026-05-21 09:14:35 +00:00
Viktor Barzin
8c87b77f1b forgejo: disable source archive ZIP/TAR downloads
Bot crawlers were hitting /<owner>/<repo>/archive/<sha>.zip on the
dot_files repo (vim-plugin source trees) — each request synthesised a
fresh ZIP from git history, taking 9.9s and returning 500 under
sustained load. Cost: ~440m sustained forgejo CPU.

Toggle: FORGEJO__repository__DISABLE_DOWNLOAD_SOURCE_ARCHIVES=true.
/archive/* URLs now 404; git clone / OCI registry / API unaffected.

Measured: forgejo pod 440-573m -> 60m steady-state (~85% drop).

(Pod rollout took ~7min on the new RS due to kubelet's recursive
chown of the 2700+ files in the data PVC — fsGroupChangePolicy is
unset and defaults to Always; could be set to OnRootMismatch later.)
2026-05-21 09:12:20 +00:00
Viktor Barzin
af6aa18b25 monitoring: prometheus global scrape 1m -> 2m + UPS pinned 30s
Halves sample volume on all default-scrape jobs (cAdvisor, node-exporter,
service-endpoints, etc.). Memory id 559's earlier scrape-2m tuning was
applied live but not codified — this restores the Helm template.

Companion changes to keep alerting fidelity:
- evaluation_interval kept at 1m (alerts evaluate every minute)
- snmp-ups job pinned to scrape_interval=30s so PowerOutage /
  LowUPSBattery detect within ~30s instead of 2m
- 3 alerts bumped from for:1m to for:3m (HighGPUTemp, LowUPSBattery,
  PowerOutage) for stability above the new 2m global cadence

Other jobs that already had per-job overrides (snmp-idrac 1m,
redfish-idrac 3m, kubernetes-pods 5m, kubernetes-services 5m) unaffected.

Expected: 50-150m sustained CPU saving on Prometheus + apiserver.
Verification ongoing — apiserver settles ~minutes after Prometheus
config reload due to initial-target-scrape burst.
2026-05-21 08:32:57 +00:00
Viktor Barzin
aba061cf2e alloy: switch pod log shipping from apiserver to file-tail
Replaced 'loki.source.kubernetes' with 'loki.source.file' in alloy DS
config. discovery.relabel.pod_logs already sets __path__ to the kubelet
log path (/var/log/pods/*<uid>/<container>/*.log) and varlog host-mount
was already present, so this is a one-line swap.

Why: apiserver was burning ~700m sustained on 'CONNECT pods/log' streams
(13 req/s, ~2200 sec/s of long-lived TCP connections). Streaming pod
logs through the apiserver instead of tailing kubelet's log files was
the dominant residual cost after the recent Loki/Alloy onboarding.

Measured before/after:
- Alloy DS: ~620m total (5 x ~125m) -> ~92m total (5 x ~18m)
- kube-apiserver: peak 1959m midnight burst, settled 632m

(Stuck-pod recovery: alloy-7zg7t on k8s-master needed --force delete
during rollout — FailedKillPod 'unable to signal init: permission denied'
on runc, transient runtime issue, unrelated to this change.)
2026-05-21 08:27:34 +00:00
Viktor Barzin
b6724a5d48 vault: add pg-matrix + pg-technitium static roles to allowed_roles
Both static-roles existed in Vault state (created out-of-band) but
were missing from the postgresql connection's allowed_roles list. Vault
was logging 'is not an allowed role' rotation errors every 10s for both,
sustained CPU waste ~40-70m.

Adopted both via 'import {}' (import blocks removed after first apply
per the canonical adoption pattern).

- pg-matrix: username=matrix, rotation_period=86400 (1d)
- pg-technitium: username=technitium, rotation_period=604800 (7d)

Verified: 'is not an allowed role' errors stopped in vault-0 logs
immediately after apply.
2026-05-21 08:11:11 +00:00
Viktor Barzin
926d507313 k8s-version-upgrade: grant get/list on apps resources for drain
kubectl drain --ignore-daemonsets needs to GET each pod's owner
reference (DaemonSet/StatefulSet/ReplicaSet/Deployment) to classify
which pods can be drained vs ignored. Without these RBAC verbs, drain
bails with 'cannot delete daemonsets ... is forbidden' for every
daemonset-managed pod on the node.
2026-05-21 08:07:29 +00:00
Viktor Barzin
1617285d23 infra: add kubectl + authentik providers across 6 stacks
Provider declarations were applied across freshrss, linkwarden,
navidrome, openclaw, tandoor, vault in prior sessions; lock files
regenerated for the 4 stacks where init had run. Commits the WIP so
downstream Terraform plans can proceed.

- kubectl (gavinbunney/kubectl ~> 1.14): kubernetes_manifest panic
  workaround for Kyverno CRDs (beads code-e2dp)
- authentik (goauthentik/authentik ~> 2024.10): used where stacks
  manage their own Authentik objects
2026-05-21 08:07:22 +00:00
c09230815c openclaw: enable recruiter-api plugin (allowlist + manifest contracts)
Plugin needs three things to load under OpenClaw 2026.5.x:
1. plugins.allow includes 'recruiter-api' (doctor --fix overwrites the
   ConfigMap-baked value, so re-patch via 'openclaw config patch --stdin'
   in the startup command after doctor runs).
2. 'openclaw plugins enable recruiter-api' to flip its registry entry.
3. manifest declares contracts.tools (added in recruiter-responder commit
   83ffd9fa).

Plus: VIKTOR_CHAT_ID env wired from secret/openclaw.viktor_chat_id so the
plugin's polling loop knows which Telegram chat to deliver into.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 21:56:11 +00:00
57ab903a0c recruiter-responder: deploy d7892396 — OpenClaw-driven flow
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 21:14:11 +00:00
18928eb8ac recruiter-responder + openclaw: wire gpt-mini secret keys + VIKTOR_CHAT_ID
recruiter-responder ExternalSecret gains GPT_MINI_ENDPOINT/_API_KEY/_MODEL
(NIM-served qwen3-coder-480b — gpt-5.4-mini in OpenClaw is OAuth-only and
not HTTP-accessible to external services). OpenClaw gains VIKTOR_CHAT_ID
env consumed by the recruiter-api plugin's announcement loop.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 21:10:56 +00:00
Viktor Barzin
0c8b46df55 k8s-version-upgrade: fix two more grep-pipefail bugs
Same `grep -v` / `set -o pipefail` interaction as commit 10b261d2,
in two more callsites the previous fix didn't cover:

  Line 354 (phase_master): control-plane Running check —
    `grep -v Running | wc -l` returns 1 when all pods are Running
    (the happy path), aborting the chain right after master upgrades.

  Line 419 (phase_postflight): on-target node check —
    `grep -v ":v$TARGET_VERSION$" | wc -l` returns 1 when all nodes
    are on the target version (the happy path, exactly when postflight
    should succeed). Aborts at the moment of victory.

Forensics on yesterday's master Job failure (see commit message of
10b261d2 for context): the master Job spawned 16s after the previous
fix's TF apply, before configmap propagation completed on the kubelet.
With those two latent bugs also looming, the chain would have died
post-master-upgrade and again at postflight even if propagation had
been timely.

Wrapping each grep in `{ ... || true; }` so a no-matches result
returns success.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 20:59:10 +00:00
Viktor Barzin
10b261d2db k8s-version-upgrade: fix pipefail abort when no alerts are firing
halt_on_alert_query() ends with `grep -vE "$regex" | sort -u`. When
zero alerts are firing (the desired healthy state), grep matches
nothing and exits 1. Under `set -o pipefail`, the whole pipeline
returns 1; under `set -e`, the caller's `alerts=$(...)` assignment
fails and aborts the script in ~1s with no diagnostic output.

The chain effectively required at least one non-meta alert to be
firing to make any forward progress. Today (2026-05-19) the cluster
is fully clean post-MySQL recovery, the daily 12:00 UTC detection
spawned the preflight Job, and it died instantly — blocking the
1.34.7 → 1.34.8 patch chain.

Fix: wrap the grep in `{ ... || true; }` so a no-matches result
returns success. Preflight verified end-to-end after the fix — the
chain is now in flight (preflight ✓, master phase running).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 22:19:06 +00:00
f5917f0eb3 security(wave1): W1.6 expand observation from recruiter-responder pilot → tier 3+4 (82 namespaces)
## Change
- Replaced kubectl_manifest.wave1_egress_observe_recruiter_responder with
  kubectl_manifest.wave1_egress_observe_tier34
- namespaceSelector changed from `kubernetes.io/metadata.name == 'recruiter-responder'`
  to `tier in {"3-edge", "4-aux"}` — covers 82 namespaces (17 tier-3-edge + 65 tier-4-aux)
- Legacy pilot GNP wave1-egress-observe-recruiter-responder kubectl-deleted
  (apply_only=true means TF rename does NOT destroy the live old resource;
  cleanup done manually)
- Tier 0/1/2 namespaces explicitly out of wave 1 observation per locked plan
  (cluster infra + GPU workloads, deferred)

## Verification (live cluster, 2026-05-19)
- 82 namespaces match `tier in (3-edge,4-aux)`
- Felix translated the new policy into iptables LOG rule in cali-po-* chain
- LogQL `{job="node-journal"} |~ "calico-packet"` returns real packet metadata
  from multiple namespaces with distinct destinations:
  - east-west pod-to-pod (10.10.108.48, 10.10.122.131)
  - in-cluster service VIP (10.96.0.10 — kube-dns)
  - external (149.154.166.110 — Telegram API from recruiter-responder)

## W1.7 next step (calendar-bound, ~1 week)
- Let observation run for ~1 week
- Aggregate distinct destinations per namespace via LogQL
- Build per-namespace egress allowlist module `tier3_egress_baseline`
- Flip GNP rules from `[Log, Allow]` to `[Allow <specific dests>, Deny]`
- Phased per-namespace as originally planned

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 22:14:16 +00:00
e9054e6b1b security(wave1): W1.6 observe phase LIVE — Calico GNP action:Log pilot on recruiter-responder
Replaces the abandoned FelixConfiguration.flowLogsFileEnabled approach (Calico
Enterprise-only field, rejected by OSS v3.26) with the supported primitive:
Calico GlobalNetworkPolicy with `action: Log`.

## Mechanics (verified end-to-end on 2026-05-19)
1. kubectl_manifest applies GNP `wave1-egress-observe-recruiter-responder`
   with `namespaceSelector: kubernetes.io/metadata.name == 'recruiter-responder'`,
   `types: [Egress]`, `egress: [{action: Log}, {action: Allow}]`.
2. Felix translates to iptables LOG rule in
   `cali-po-_ZEv_aILlvyT9fbgWN58` chain with prefix `calico-packet: ` log-level=5.
3. Linux kernel emits LOG entries to ring buffer with transport=kernel.
4. systemd-journald captures kernel transport entries.
5. Alloy DaemonSet ships journal to Loki with `job=node-journal,transport=kernel`.
6. LogQL: `{job="node-journal"} |~ "calico-packet"` returns entries showing
   SRC/DST/PROTO/PORT for every NEW egress connection.

## Verified output sample
`calico-packet: IN=cali6cfdec4abc1 OUT=ens18 MAC=... SRC=10.10.122.132
DST=9.9.9.9 LEN=60 TOS=0x00 PREC=0x00 TTL=...`

The Allow rule in the GNP keeps egress functional (recruiter-responder
remained 1/1 Running through the apply — verified Python TCP connections to
1.1.1.1, 8.8.8.8, 9.9.9.9 succeed).

## Wave 1 status
W1.6 observation infra is LIVE for the recruiter-responder pilot. W1.7
remains pending: collect 1 week of `{job="node-journal"} |~ "calico-packet"`
samples, build empirical egress allowlist, flip the GNP rules from
`[Log, Allow]` to `[Allow <specific dests>, Deny]`.

Expand observation to additional namespaces by adding entries to
`spec.namespaceSelector` (e.g. `kubernetes.io/metadata.name in {recruiter-responder,X,Y}`).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 22:10:42 +00:00
Viktor Barzin
359b0277f8 dbaas: opt MySQL out of Keel + add do-not-bump warning
Two changes to make the 8.4.8 pin durable:

1. Add `keel.sh/policy: never` annotation on the mysql-standalone
   StatefulSet. The dbaas namespace was already excluded from the
   Kyverno mutate, but the StatefulSet carried orphan Keel annotations
   (force/poll/match-tag) from an earlier policy version that lacked
   the exclusion list. Keel kept watching :8.4.8 for digest changes.
   Now explicitly opted out; Keel logged "image no longer tracked".

2. Expand the inline comment to a banner pointing at the upgrade plan
   docs and the gating beads task. Anyone touching this line sees the
   warning + the path to do it right.

Closes the loop on the 2026-05-18 outage. Real upgrade tracked in
code-963q + docs/plans/2026-05-19-mysql-8.4.9-upgrade-{design,plan}.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 13:21:03 +00:00
669ba97078 security(wave1): W1.1 audit-log shipping LIVE + W1.5 trusted-registries Enforce LIVE
## W1.1 — K8s API audit log shipping (LIVE)
- alloy.yaml: added control-plane toleration so Alloy DaemonSet runs on
  k8s-master node. Verified alloy-7zg7t scheduled on master, tailing
  /var/log/kubernetes/audit.log
- loki.tf "Security Wave 1" rule group: added K2-K9 alert rules
  (skipped K1 per Q7 decision):
  - K2 K8sSATokenFromUnexpectedIP
  - K3 K8sSensitiveSecretReadByUnexpectedActor
  - K4 K8sExecIntoSensitiveNamespace
  - K5 K8sMassDelete (>5 Pod/Secret/CM in 60s by single user)
  - K6 K8sAuditPolicyModified (kubeadm-config CM change)
  - K7 K8sClusterRoleWildcardCreated (verbs=* + resources=*)
  - K8 K8sAnonymousBindingGranted
  - K9 K8sViktorFromUnexpectedIP
- All rules use source-IP regex matching the wave-1 allowlist
  (10.0.20.0/22, 192.168.1.0/24, 10.10.0.0/16 pod, 10.96.0.0/12 svc,
  100.64-127 tailnet) and `lane = "security"` → #security Slack route.
- Verified: kubectl-audit logs flowing in Loki query
  {job="kubernetes-audit"} returns events with node=k8s-master.
- Verified: /loki/api/v1/rules lists all K2-K9 + V1-V7 + S1.

## W1.5 — require-trusted-registries Enforce (LIVE)
- security-policies.tf: flipped Audit→Enforce with explicit allowlist
  built by `kubectl get pods -A -o jsonpath='{..image}'` enumeration.
- Removed `*/*` catch-all (which made Audit→Enforce a no-op).
- Pattern includes 15 explicit registries, 6 DockerHub library bare
  names, 56 DockerHub user repos.
- Verified by admission dry-run:
  - evilcorp.example/malware:v1 → BLOCKED with custom message
  - alpine:3.20 → ALLOWED (matches `alpine*`)
  - docker.io/library/alpine:3.20 → ALLOWED (matches `docker.io/*`)

## W1.6 — Calico flow logs (BLOCKED — Calico OSS limitation)
- Tried adding FelixConfiguration with flowLogsFileEnabled=true via
  kubectl_manifest in stacks/calico/main.tf
- Calico OSS rejected with "strict decoding error: unknown field
  spec.flowLogsFileEnabled" — these fields are Calico Enterprise/Tigera-only
- Removed the failed resource. Documented alternative paths in main.tf
  comment block: GNP with action=Log (iptables NFLOG → journal), Cilium
  migration, eBPF tooling, or Tigera Operator adoption.

## Docs updates
- security.md status table refreshed: W1.1/W1.2/W1.3/W1.4/W1.5 LIVE,
  W1.6/W1.7 blocked
- monitoring.md: Loki marked DEPLOYED (was incorrectly NOT-DEPLOYED in
  prior session before today's apply)

## Cleanup
- Removed stacks/kyverno/imports.tf (TF 1.5+ import blocks completed
  their job in the 2026-05-18 apply; should not stay in tree per TF docs)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 06:37:54 +00:00
8ef4f06ac0 recruiter-responder: bump image to 444fa58c (header CRLF fix)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 22:55:09 +00:00
Viktor Barzin
ea475c3d86 dbaas: pin MySQL to 8.4.8, recover from broken 8.4.9 DD upgrade
The mysql:8.4 floating tag let Keel auto-bump to 8.4.9, whose
data-dictionary upgrade got stuck mid-flight on every attempt
(no progress, no CPU, never completing). Pinning to 8.4.8 +
restoring from the 2026-05-18 00:30 UTC mysqldump puts us back
on a known-good binary.

Closes: code-eme8
Closes: code-k40p

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 22:46:54 +00:00
90e074a4a2 kyverno(wave1): swap kubernetes_manifest → kubectl_manifest + flip 3 security policies to Enforce
## Resolves code-e2dp (Kyverno TF apply blocked)
Root cause: terraform-provider-kubernetes v3.1.0 panics on plan/refresh of
kubernetes_manifest resources holding Kyverno ClusterPolicy CRDs (large
CEL/foreach schemas). Workaround: swap to gavinbunney/kubectl_manifest which
treats manifests as opaque YAML strings.

## Migration mechanics
- Root terragrunt.hcl: added gavinbunney/kubectl provider declaration so all
  stacks get it generated in providers.tf.
- stacks/kyverno/modules/kyverno/versions.tf (new): module-level provider source
  declaration (required for kubectl_manifest in a child module).
- Converted 17 kubernetes_manifest resources across 7 files to kubectl_manifest
  with yaml_body = yamlencode({...}). depends_on chains preserved.
- terraform state rm for all 17 old kubernetes_manifest entries.
- stacks/kyverno/imports.tf (new): TF 1.5+ import blocks mapping each
  kubectl_manifest to its live cluster resource by apiVersion//Kind//name ID.
- One resource (policy_inject_keel_annotations) needed kubectl delete + recreate
  because the kubectl provider couldn't patch it cleanly (resourceVersion=0
  invalid for update — gotcha when adopting a resource previously
  kubernetes_manifest-owned).

## W1.4 — security policies Audit → Enforce (LIVE)
Three policies flipped: deny-privileged-containers, deny-host-namespaces,
restrict-sys-admin. Verified live via kubectl. failurePolicy=Ignore preserved.

## Shared exclude list (35 namespaces)
local.security_policy_exclude_namespaces in security-policies.tf.
- 31 critical from memory id=1970 (Keel rollout list)
- + frigate (camera HW transcoding needs host access)
- + kured (privileged DaemonSet for node reboots)
- + default (etcd backup/defrag CronJobs use hostNetwork)
- + changedetection (uses SYS_ADMIN for chromium sandbox)

## W1.5 — require-trusted-registries stays Audit
Pattern */* allows anything-with-a-slash; Enforce would be a no-op for supply
chain. Tracked under beads code-8ywc as follow-up.

## TF import-blocks
The imports.tf file should be removed in a follow-up cleanup commit once
verified — TF doesn't auto-clean these.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Closes: code-e2dp
2026-05-18 20:10:27 +00:00
0560d81f3a monitoring(wave1): re-enable Loki+Alloy, deploy wave1 alert rules, add #security Slack lane
## Loki + Alloy re-enabled (code-146x)
- Uncommented helm_release.loki, helm_release.alloy, kubernetes_daemon_set_v1.sysctl-inotify,
  kubernetes_config_map.loki_alert_rules, kubernetes_config_map.grafana_loki_datasource
- Reverses the documented "operational overhead vs benefit after node2 incident"
  decision. Re-evaluated because wave 1 security detection layer (beads code-8ywc)
  needs Loki + ruler + alert routing.
- SingleBinary mode, 2-4Gi mem, 50Gi proxmox-lvm PVC, 30-day retention, ruler enabled
  pointed at prometheus-alertmanager.monitoring.svc:9093
- Alloy DaemonSet (4 pods on worker nodes) discovers pod logs via K8s API + pushes
  to Loki
- Loki canaries running (4)
- Vault audit-tail sidecar logs now flowing to Loki: queried
  {namespace="vault",container="audit-tail"} returns live audit JSON

## Wave 1 alert rules deployed (W1.3 partial)
Added "Security Wave 1" rule group to loki_alert_rules configmap:
- V1: VaultRootTokenCreated — auth/token/create with policies=[root]
- V2: VaultAuditDeviceModified — sys/audit/* create/delete/update
- V3: VaultSealChanged — sys/seal update
- V4: VaultPolicyModified — sys/policies/acl/* create/update/delete
- V5: VaultAuthFailureSpike — >10 permission denied/min
- V7: VaultViktorFromUnexpectedIP — auth as me@viktorbarzin.me from non-allowlist source IP
  (allowlist: 10.0.20.0/22, 192.168.1.0/24, 10.10.0.0/16 pod, 10.96.0.0/12 svc, 100.64-127 tailnet)
- S1: PVEsshLoginFromUnexpectedIP — sshd "Accepted" from non-allowlist IP (rule defined,
  fires once promtail/Alloy ships sshd journal with job=sshd-pve)

Verified rules visible via /loki/api/v1/rules. K2-K9 (K8s API audit) deferred to W1.1
which needs the audit policy + apiserver log shipping codified.

## #security Slack lane (Alertmanager)
- New `slack-security` receiver in prometheus_chart_values.tpl, channel #security
- Higher-priority route at top of routes list: matchers `lane = security` →
  slack-security, continue: false (so wave 1 alerts never fall through to #alerts)
- Slack message format includes summary + description + runbook link annotation
- All wave 1 rules set `lane = "security"` label

## Resource summary
- 6 added: helm_release.loki, helm_release.alloy, kubernetes_config_map.grafana_loki_datasource,
  kubernetes_config_map.loki_alert_rules, kubernetes_daemon_set_v1.sysctl-inotify,
  + 1 other
- 5 changed: helm_release.prometheus (alertmanager config — new receiver + route),
  4 deployments (image tag drift from Keel-managed images, unrelated)
- 1 destroyed: null_resource grafana_admin_only_folder_acl["Finance (Personal)"]
  (timestamp-triggered always recreates — not destructive)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Closes: code-146x
2026-05-18 19:51:57 +00:00
82fedf1336 security(wave1): Vault audit-tail sidecar (live) + doc reality-check
## Vault audit-tail sidecar (APPLIED + VERIFIED)
- Added `audit-tail` extraContainer to vault helm chart values: busybox:1.37 with
  `tail -F /vault/audit/vault-audit.log`. Reads the audit PVC (`audit` volume
  from the chart's auditStorage), emits JSON audit events to stdout. kubelet
  captures the stdout; once Loki+Alloy are deployed (blocked on code-146x),
  these logs flow automatically to Loki with `container="audit-tail"`.
- Resources: 5m CPU / 16Mi mem request, 32Mi limit. PVC mount is readOnly.
- Applied via `tg apply -target=helm_release.vault`. All 3 vault pods rolled
  cleanly (OnDelete strategy, manual one-at-a-time, auto-unseal each ~10s).
- Verified: `kubectl logs -n vault vault-2 -c audit-tail` shows live JSON
  audit lines from ESO token issuance, KV reads, etc.

## Doc reality-check
While verifying logs reached Loki, discovered Loki is NOT actually deployed.
`stacks/monitoring/modules/monitoring/loki.tf` defines `helm_release.loki` but
has a self-referencing `depends_on = [helm_release.loki]` that prevented apply.
No `loki` Helm release in the cluster, no Loki pods, no Loki Service. The
monitoring.md "Loki: deployed" claim was aspirational.

- security.md W1.2 row: PENDING → PARTIAL (sidecar live, shipping blocked on
  code-146x)
- security.md W1.3 row: gated on code-146x added
- monitoring.md Loki row: marked NOT DEPLOYED with cross-ref to code-146x

## New beads task
- code-146x P1 — Loki + log shipper missing. Lists the helm_release self-depends_on bug,
  investigation paths, and revised wave 1 sequencing (Loki/Alloy is prereq 0).

## Wave 1 status update
- W1.2: Vault audit device + XFF + audit-tail sidecar all LIVE; Loki shipping blocked on code-146x
- W1.1, W1.3, W1.6, W1.7: still not started (W1.6 also blocked on code-3ad Calico Installation CR)
- W1.4, W1.5: code committed, blocked on code-e2dp (Kyverno provider crash)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 19:37:36 +00:00
f30c141270 security(wave1): W1.2 Vault XFF (applied) + W1.4/W1.5 Kyverno code prep (apply blocked on provider crash)
## W1.2 — Vault audit device + X-Forwarded-For (APPLIED + VERIFIED)
- Added `x_forwarded_for_authorized_addrs = "10.10.0.0/16"` to vault listener config.
  Trust X-Forwarded-For from in-cluster sources (pod CIDR). Without this, every
  vault audit log entry shows Traefik's pod IP instead of the real client IP —
  the V7 alert rule (Viktor identity from non-allowlist source IP) needs the
  real client IP to be meaningful.
- Applied via `tg apply -target=helm_release.vault` (vault stack has pre-existing
  for_each unknown issues unrelated to this change; -target documented in error
  message itself as the workaround).
- Rolling restart of vault-{0,1,2} performed manually (StatefulSet uses OnDelete
  update strategy, not RollingUpdate). All 3 pods rejoined Raft + auto-unsealed
  within ~10s each. Verified XFF config visible in pod's
  /vault/config/extraconfig-from-values.hcl.
- The `vault_audit "file"` resource was already in TF at line 287 (writing to
  /vault/audit/vault-audit.log) — no change needed.

## W1.4 + W1.5 — Kyverno enforce flip (CODE ONLY, apply BLOCKED)
- Added shared `local.security_policy_exclude_namespaces` (31 critical namespaces
  from memory id=1970 + `frigate, kured, default, changedetection` discovered
  during the live-cluster pre-flight check for privileged/hostNetwork/SYS_ADMIN
  pods that would be blocked by Enforce).
- Flipped 3 security policies Audit → Enforce: deny-privileged-containers,
  deny-host-namespaces, restrict-sys-admin. failurePolicy=Ignore preserved at
  chart level.
- `require-trusted-registries` STAYS in Audit mode pending allowlist tightening
  (current pattern includes `*/*` which matches anything-with-a-slash, so Enforce
  would be a no-op for supply chain). Tracked under beads `code-8ywc` W1.5.

**Apply blocker**: `tg plan` panics with `terraform-provider-kubernetes_v3.1.0`
crash on the kubernetes_manifest resources (`ElementKeyInt(0): can't use
tftypes.Object...` — provider schema mismatch on Kyverno CRDs). The crash
reproduces on the UNMODIFIED file, so it's a pre-existing provider issue, not
caused by these changes. Resolving it requires either upgrading the provider or
finding a kubernetes_manifest-compatible workaround. Tracked under `code-8ywc`.

## Wave 1 status after this commit
- W1.2: APPLIED + VERIFIED (vault XFF + audit device already in place)
- W1.4 + W1.5: code ready, apply blocked on provider crash
- W1.1, W1.3, W1.6, W1.7: not started in this session

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 19:26:39 +00:00
b85a5ecebd monitoring(wealth): drop 6y timeFrom override on META vest cadence 2026-05-18 19:25:29 +00:00
Viktor Barzin
4f64f51bba realestate-crawler: dockerhub pull-secret + lift image-pin on ui/api
Companion to the GHA migration in immovika/realestate-crawler@c2acbf5.

Apps row of /upgrade-state was flagging ⚠ because Keel poll on the four
Deployments returned 401 — DockerHub repo viktorbarzin/realestatecrawler
is private, the Deployments had no imagePullSecrets, and Keel's poll-secret
discovery list came up empty. Pods kept running only because the image
landed in containerd cache months ago.

Adds:
- ExternalSecret `dockerhub-pull-secret` synced from Vault
  secret/viktor.dockerhub_registry_password. ESO template renders the
  dockerconfigjson server-side (Sprig b64enc) so the PAT never sits in
  cleartext in any K8s manifest.
- image_pull_secrets { name = "dockerhub-pull-secret" } on all 4
  Deployments (ui, api, celery, celery-beat).
- Lifts `ignore_changes=[container[0].image]` on ui+api so TF re-asserts
  :latest. CI no longer patches the image to a numeric tag — Keel now
  drives rollouts from digest changes on :latest.

Live state after apply: all 4 Deployments on :latest with
imagePullSecrets=dockerhub-pull-secret; ExternalSecret SecretSynced=True.
Once a GHA build pushes a new digest, Keel will roll all four within ~1h.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 19:11:43 +00:00
0a4ce162f8 monitoring(wealth): keep only FIFO-realized PNL table; pair Positions + vest-cadence side-by-side
- Removed panel 27 (META RSU vest value over time) — superseded by
  vest-cadence chart which carries the same value signal plus the
  share-count overlay.
- Removed panel 28 (per-vest value at vest vs today) — duplicative with
  panel 31's FIFO realized PNL.
- Removed panel 29 (per-sell realized PNL) — same data as panel 31,
  just rolled up by sell date instead of vest date.
- Resized panel 26 (Positions) to w=12 and moved panel 30
  (META vest cadence) to (y=32, x=12, w=12) so they sit side-by-side
  next to the Positions table.
- Moved panel 31 (FIFO realized PNL) to y=118, where the deleted RSU
  chart used to live.
2026-05-18 19:11:13 +00:00
20018cd9b4 monitoring(wealth): per-vest realized PNL via FIFO sell-match
New table panel below the per-sell breakdown. For each vest, FIFO-match
its shares against the subsequent sells (shares from earlier vests get
sold first), and aggregate the matched portions:

  realized_pnl = SUM(matched_qty * (sell_price - vest_price))
  pnl_pct      = realized_pnl / SUM(matched_qty * vest_price) * 100
  days_held    = AVG(sell_date - vest_date) per matched portion

Footer reducer sums shares, vest value, sell value, and realized PNL
so the bottom row is the full-portfolio realized take.
2026-05-18 18:43:46 +00:00
14befe3998 monitoring(wealth): META vest cadence chart — value vs shares (dual axis)
Per-vest event line chart. Left Y axis (blue): vest value at the
time = SUM(quantity * unit_price), in USD. Right Y axis (orange):
number of shares vested. One point per vest date (aggregated when
multiple BUY rows share a date, e.g. 2021-05-18 was 18 + 2 shares).

Lets Viktor see how vest sizes ramped (initial 18 shares -> 38 ->
60s) and how the per-vest USD value tracked META's price ride
across 2020-2026. timeFrom='6y' override pins the panel to the full
vesting window.
2026-05-18 18:38:36 +00:00
3d5841a776 monitoring(wealth): META vest + sell PNL tables with FIFO cost basis
Two new bottom-of-dashboard tables:

Panel 28 'META vests — value at vest vs today': one row per BUY
activity. Shows vest-day price * shares + what those same shares
would be worth at today's META quote, plus the hypo P&L if Viktor
had held everything (color-text on the gain columns).

Panel 29 'META sells — realized PNL vs if held until today':
one row per SELL with FIFO-matched cost basis (LEAST/GREATEST
overlap in cumulative-share space). Shows realized P&L, the
counterfactual P&L had he held until today, and the
'missed by' delta = (today_price - sell_price) * shares.

Both pull today_price dynamically from quote_latest via a CTE so
they self-update as Yahoo updates the META quote. Schwab account
is empty so no live activity is expected.
2026-05-18 18:36:00 +00:00