Records the design reached in a /grill-with-docs session: how to track which
Service talks to which as more Services are added, using k8s-native options.
Decision: service identity = the workload's namespace (primary) plus a
`service-identity` label only in the few multi-Service namespaces; east-west
observability = Calico 3.30 Goldmane/Whisker (already in our Calico v3.30.7,
currently disabled) emitting to Loki for a durable trail; enforcement reuses the
existing Wave 1 egress track. Dedicated per-Service ServiceAccounts deferred and
a service mesh / mTLS / SPIFFE rejected — the trust model needs attribution-grade
forensics on a trusted, etcd-constrained cluster, not cryptographic
non-repudiation. This is the service-mesh evaluation the 2026-04-20 infra audit
flagged as missing; rejected alternatives (Retina, Hubble, Kiali, a custom Alloy
enricher) are recorded with rationale.
Adds glossary terms (Service identity, Goldmane / Whisker) to CONTEXT.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Phase 3 of the ESO 0.12->2.6 migration (the last k8s-1.35 compat-gate blocker).
Climbed external-secrets 0.16.2 -> 0.17.0 -> ... -> 2.6.0 one minor at a time,
each hop applied + verified (ES sync held at 109 Ready every hop; atomic=true
rollback safety net). Crossed the 0.17 cutoff (v1beta1 serving removed) only
after Phase 2 put all 104 ExternalSecrets + 2 ClusterSecretStores on
external-secrets.io/v1. Result: compat-gate now returns "OK: cluster is safe to
upgrade to 1.35.6" (EXIT 0) — the autonomous version-check chain will take k8s
1.34 -> 1.35 on its next nightly run.
Also fixes the repo-wide stale-lock issue that broke CI pipeline 332: the
terragrunt-generated providers.tf declares gavinbunney/kubectl + telmate/proxmox,
but ~28-39 stacks' committed .terraform.lock.hcl predated that ("Inconsistent
dependency lock file: no version selected"). Reconciled via `tg init -upgrade`
and committed so `terragrunt apply`/CI work cleanly again.
Docs: .claude/CLAUDE.md ESO line corrected (104 ESs, v1, chart 2.6.0); plan doc
marked COMPLETE.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Record the new reconcile step alongside install_memory/install_playwright:
vendored own-copies of the 16-skill set for the SKILL_USERS allowlist (emo),
why it's vendored not npx (upstream drift), and that if-absent keys on the
user's own copy so it heals a stale/cross-user ~/.claude/skills symlink
(emo's grill-me pointed into the admin's home).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Point the chrome-service container at the new chrome-service-browser image and
launch /opt/google/chrome/chrome instead of the bundled Chromium. Fixes
MEDIA_ERR_SRC_NOT_SUPPORTED on H.264/AAC video (Instagram Reels etc.) in the
noVNC view — bundled Chromium has those codecs compiled out; only real Chrome
carries them. connect_over_cdp callers (tripit fare scrape, homelab browser,
snapshot-harvester) attach over raw CDP (version-tolerant) — validated after
rollout. Image is built off-infra on GHA (prior commit) → public ghcr.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The noVNC view showed the browser in the top-left with the rest of the
framebuffer black. Cause: Chrome launched with no --window-size, and there's no
window manager, so it opened at its profile-persisted (smaller) size inside the
1280x720 Xvfb. Add --window-size=1280,720 --window-position=0,0 so the window
fills the screen on every launch (fresh pods/profiles too). Live windows were
already resized via CDP as a stopgap.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The noVNC view hung on "Connecting" forever then timed out. Root cause: x11vnc
sweeps the entire fd table (fcntl per fd) on every client connection, and
containerd grants pods RLIMIT_NOFILE=2^31, so the RFB handshake never completes
(websockify accepts the WS and dials localhost:5900, but x11vnc never sends its
banner — verified: handshake timed out at 8s, x11vnc had burned 1h41m CPU
spinning). Same bug + fix the android-emulator stack already carries.
Cap nofile before x11vnc starts, in two places:
- files/novnc/entrypoint.sh: `ulimit -n 65536` (root fix, makes the image correct)
- main.tf novnc container: `command = ["bash","-c","ulimit -n 65536; exec /entrypoint.sh"]`
so the cap applies deterministically on rollout even though the image is
:latest/IfNotPresent (a rebuilt entrypoint isn't guaranteed to be re-pulled).
Also documents the gotcha + diagnosis in docs/architecture/chrome-service.md and
notes the black-when-idle behaviour + the autoconnect URL.
(A live x11vnc relaunch with the cap already unblocked the running pod; this
makes it survive restarts.)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Make `homelab browser --help` and chrome-service.md state the same tiered rule
now in ~/code/CLAUDE.md: default to the Playwright MCP/headless browser for all
routine automation; reach for `homelab browser` ONLY when headless is blocked
(loads-but-submit-fails / one request errors while siblings 200 / explicit bot
wall). Removes the "co-equal choice" framing so agents have one non-conflicting
instruction. Adds a test asserting the tiered wording so it can't regress.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add `homelab browser run|open` so agents can drive the cluster's headful
Chrome (chrome-service) over CDP from the devvm. The headless playwright/mcp
browser can load anti-bot sites and fill their forms, but the gated submit
silently fails — e.g. the Stirling Ackroyd Fixflo tenant portal returned
net::ERR_FILE_NOT_FOUND on its pre-submit check and hung, creating nothing.
Driving the real headful Chrome submits first try. That capability already
existed but was undiscoverable, so it cost ~40 min + redundant form re-runs to
find; now it is one command, versioned, test-covered, and `browser --help`
carries the when-to-use signature + an error-code cheat-sheet so the right tool
is reached at the right moment (the failure was judgment, not setup).
- port-forward svc/chrome-service:9222 (tunnels API-server->pod, so it bypasses
the :9222 NetworkPolicy), assert non-headless via /json/version,
connect_over_cdp, inject the same vendored stealth.js the in-cluster callers
use; the port-forward is always torn down, on success and on error.
- node CDP client pinned to playwright-core@1.48.2 to match the v1.48.0-noble
image (Chromium 130); self-provisioned lazily into ~/.cache/homelab, no
per-user setup.
- default is a fresh incognito context (safe for the shared browser + concurrent
callers); --shared-context reuses the warmed persistent profile.
- TDD: cmd_browser_test.go covers arg parsing, headless detection, the version
pin, the help cheat-sheet, and a stealth.js drift guard. Verified end-to-end
against bot.sannysoft.com (real Chrome UA, webdriver hidden, plugins/WebGL
spoofed) and `browser open`.
- docs: README v0.8 section, ADR-0013, and a chrome-service.md "driving from
outside the cluster" section.
Closes: code-nepg
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The systemd-oomd backstop added in the previous commit is INERT on this box.
oomd's memory-pressure kill only acts on cgroups doing active reclaim (pgscan
rising); with MemorySwapMax=0 + anonymous agent memory there is nothing to
reclaim, so pgscan stays 0 and oomd never fires. Proven live: a cgroup held at
96-99% memory.pressure for >70s with pgscan=0 was never killed (oomctl + balloon).
The very swap=0 that kills the IO storm also neuters oomd.
Replace it with earlyoom, which watches free RAM (MemAvailable%) and is
swap-independent: SIGTERM the biggest task at 5%, SIGKILL at 3%, swap ignored
(-s 100). It --avoids sshd/systemd/dockerd/containerd/t3-dispatch/tmux (the
admin's way in always survives) and --prefers the agent/browser hogs. Verified
via --dryrun: fires on the RAM threshold and selects a chrome process, not a
protected daemon.
The per-cgroup caps (MemoryHigh=12G/MemoryMax=16G/MemorySwapMax=0 per user,
docker.slice 8G) are unchanged and remain the PRIMARY guard — earlyoom is the
aggregate net for the rare all-users-maxed case. systemd-oomd purged; its config
+ ManagedOOM drop-ins removed. Post-mortem updated with the finding.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The shared devvm keeps overloading and had to be hard-killed again today
(2026-06-22): a runaway in one user's ssh/tmux session (a 10G ugrep, plus
stacked max-effort agents) grew unbounded, spilled into the disk swap, and
swap-thrashed the throttled virtual disk into an IO storm until the box wedged.
Root cause: ssh/tmux work runs under user-<uid>.slice, left memory-uncontained
by the explicit 2026-06-10 "swap-only" decision, while only the t3-serve tree
was capped. So one user could starve everyone.
This bounds every user on BOTH trees (MemoryHigh=12G, MemoryMax=16G,
MemorySwapMax=0 so work OOMs locally at its ceiling instead of thrashing swap),
adds a systemd-oomd PSI backstop that sheds the single worst work cgroup under
box-wide pressure while leaving system.slice (sshd/services/your way in)
protected, gives system.slice a fair-share CPU/IO priority edge, and routes
docker containers into a capped, oomd-policed docker.slice so they can't dodge
the caps or mis-target oomd. All durable in setup-devvm.sh so a VM rebuild
reproduces them; systemd-oomd added to packages.txt.
Applied live and verified: oomctl shows the backstop armed (not dry-run) on the
work slices with system.slice protected; a capped-balloon stress test OOM-killed
locally at the ceiling with swap flat (no thrash).
Post-mortem: docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Re-land Phase 2 after the first attempt's two failure modes, both fixed:
- tempo.resources set under the correct single-binary chart key (was OOMKilled on
the namespace LimitRange default when mis-placed at top level).
- atomic=true + cleanup_on_fail=true on BOTH helm releases — a failed install
auto-rolls-back instead of leaving a stuck/orphaned release (memory #6479).
Tempo (single-binary, proxmox-lvm 20Gi, 30d) + OTel Collector (contrib; otlp ->
redaction -> batch -> tempo) + Tempo datasource + additive trace_id->Tempo
derivedField on Loki + tripit LOG_FORMAT=json/OTEL_EXPORTER_OTLP_ENDPOINT.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
multi-tenancy.md never mentioned the homelab-memory hooks rollout and still
listed claude_memory credential injection as purely "future". Document what is
actually true now: install_memory provisions the recall/auto-learn/compaction
hooks per user, the provisioner binary self-deploys from the repo (step 0), the
set -e abort fix, and that the hooks no-op without a MEMORY_API_KEY in env (CLI
defaults the URL) — emo has a key, ancamilea is keyless until one is minted.
Also clarify setup-devvm.sh's binary install is now bootstrap-only (ongoing
edits self-deploy).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Stand up the cluster's first trace store + OTLP ingress so tripit's OpenTelemetry
spans (Phase 1, already live in prod) export and correlate with logs:
- Grafana Tempo (single-binary, filesystem on proxmox-lvm 20Gi, 30d)
- OTel Collector (contrib; otlp -> redaction deny-list backstop -> batch -> tempo)
- Grafana: a Tempo datasource + an ADDITIVE trace_id->Tempo derivedField on the
Loki datasource (no uid change, so existing dashboards are unaffected)
- tripit deployment: LOG_FORMAT=json + OTEL_EXPORTER_OTLP_ENDPOINT -> the Collector
Additive (new helm releases; Loki/Prometheus/Grafana untouched). Offline
'terraform validate' clean; full plan+apply runs in CI (locked git-crypt blocks a
local plan as non-admin).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Execution log added to the ESO migration plan. Phase 1 complete: ESO at 0.16.2
(both v1beta1+v1 served). Phase 2 findings confirmed live: apiVersion bump forces
a kubernetes_manifest REPLACE, and ESO ESs use creationPolicy=Owner (target Secret
ownerRef → cascade-GC risk on the replace's delete). Phase 2 must snapshot Secrets
+ empirically validate GC-survival on the first live ES + per-stack two-phase
-target apply (fallback: state rm + import). Corrected the doc's k8s assumption
(cluster is on 1.34; whole climb stays on 1.34, no interleave).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Design doc for migrating External Secrets Operator off v0.12 (k8s <=1.31), now
the ONLY remaining compat-gate blocker for autonomous k8s 1.35 (kyverno cleared
to 1.18.1 today). Decisive findings: NO v1beta1->v1 conversion webhook, so all
104 ExternalSecrets (across 73 stacks) + 2 ClusterSecretStores must be rewritten
to external-secrets.io/v1 (byte-identical apiVersion bump) while on 0.16.2, BEFORE
crossing 0.17 (which removes v1beta1 — the point of no return). Step one minor at
a time (no skipping); chart==app version; downstream Secrets survive. 5-phase
ordered plan + per-phase rollback + the plan-time data.kubernetes_secret -target
gotcha (15 stacks) + Tier-0/SOPS handling. Plan only — nothing applied.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds a daily visibility layer so every night's autonomous-upgrade outcome is
reviewable at a glance during the upgrade-cleanup window (Viktor: "track every
night's upgrade for the next 7 days; clean up all bugs and blockers").
Last night (2026-06-20) confirmed BOTH prior fixes work in production: the
detector resolved target 1.35.6 (k8s_upgrade_available) and the compat gate
correctly REFUSED it (k8s_upgrade_blocked=1 -> K8sUpgradeBlocked) because ESO
v0.12 (<=1.31) and kyverno v1.16 (<=1.34) don't support 1.35.
What's here:
- CronJob k8s-upgrade-nightly-report (06:07 UTC) -> one Slack summary/morning:
running version, detector freshness, detected target, outcome (no-op /
blocked+live reasons / upgraded / in-progress / detector-stale), recent jobs.
Read-only: reads Pushgateway gauges + live nodes/jobs, re-runs compat-gate.py
for fresh blockers; reuses the chain SA + slack_webhook + scripts ConfigMap.
Pure helpers unit-tested (test_nightly_report.py, 8 cases incl. a real
v-prefix bug TDD caught). Verified end-to-end in-cluster (posted to Slack).
- K8sUpgradeChainJobFailed regex scoped from `k8s-upgrade-.*` to
`k8s-upgrade-(preflight|master|worker|postflight)-.*` so the new report job
(or any future helper) can't false-trip the chain-wedged alarm.
Manual state repair (no git artifact): imported the orphaned `alert-digest`
CronJob into the monitoring stack state
(`tg import module.monitoring.kubernetes_cron_job_v1.alert_digest monitoring/alert-digest`).
Root cause: when alert_digest was added (2026-06-12) the apply recorded its
ConfigMap + Secret but not the CronJob, so every full monitoring apply since has
failed with `cronjobs.batch "alert-digest" already exists` (Woodpecker pipeline
298 today) — surviving only via targeted prometheus applies. Now in state, so
monitoring CI applies cleanly again.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Last night (2026-06-20) the detector + compat-gate fixes worked: the chain
resolved target 1.35.6 and the gate correctly REFUSED it (ESO 0.12 + kyverno
1.16 don't support 1.35), pushing k8s_upgrade_blocked=1 -> K8sUpgradeBlocked
fired as designed. But the refusal also made the preflight Job exit 1
(block() exits 1 on purpose so the Failed Job re-spawns nightly), which tripped
K8sUpgradeChainJobFailed too — a duplicate, misleading "pipeline wedged" alarm
for what is the intended halt-and-alert outcome.
Fix: gate the alert with `unless on() k8s_upgrade_blocked == 1`. A deliberate
block sets that gauge (and it stays 1 until the next preflight resets it), so
the chain-job-failed alert is suppressed for the blocked period; a genuine
wedge / crash / halt-on-alert exits 1 WITHOUT setting it, so it still fires
(preserving the alert's original purpose — catching the pre-in_flight preflight
failure that hid the 5-day 1.34.9 wedge). Runbook + automated-upgrades docs
updated to match.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Traefik Yaegi CrowdSec bouncer plugin was dead on Traefik 3.7.5 (handler
never invoked) and has been removed. Document the replacement: in-kernel
nftables drop via cs-firewall-bouncer on direct hosts, and a Cloudflare IP-List
+ zone WAF block rule (fed by a LAPI->CF-list sync CronJob) on proxied hosts.
Both add zero per-request latency and fail open.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The council-complaints app (Islington civic-reporting pilot) has been
abandoned. It was already dead in the cluster (deployments scaled 0/0,
image only on the decommissioned registry.viktorbarzin.me which 404s),
and it was never in Terraform — only docs + a kyverno comment referenced
it. Its live cluster resources (namespace, both NFS-backed PVs, ingresses)
were torn down out-of-band via kubectl (nothing in TF to drift from); the
DB-dump PVC was backed up to NFS first.
This removes the remaining repo references to the live app:
- service-catalog.md: drop the council-complaints row
- ci-cd.md + .claude/CLAUDE.md: drop it from the GHA->ghcr app list
- kyverno require-trusted-registries: the registry.viktorbarzin.me/*
allowlist comment claimed council-complaints as the last referencer;
rewrite it (no live workload pulls from that registry now; only stale
completed Job records still carry the ref). The allowlist line itself
is kept (registry-scoped, not app-specific).
Historical point-in-time plan docs (docs/plans/2026-05-16-auto-upgrade-
apps-{design,plan}.md) still mention it inside a frozen "10 GHA-migrated
repos (memory id=388)" snapshot; left as-is so the dated record stays
accurate.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor hit the t3 'Client and server versions differ' warning. Root cause: the daily gated autoupdate defers a user's t3-serve restart whenever that user has an active agent at the 04:00 window, so anyone busy every night (long-lived/AFK sessions) never migrates and the client/server version skew persists for days.
This design adds a small idle-gated overnight job that drains those deferrals -- restarting a deferred instance onto the current binary only when no turn is in flight (state.sqlite active_turn_id) and it's been quiet for a buffer, so the migration lands in a real quiet gap instead of killing in-flight agent turns. Reuses the autoupdate's proven backup->restart->verify->recover path via a shared helper (approach C from the brainstorm). Design doc only; no behavior change yet.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
`ha token` originally read openclaw/openclaw-secrets -> skill_secrets, which only
cluster admins can read — so it hung/failed for the non-admin operator it was
built for (emo = emil.barzin@gmail.com, OIDC group "Home Server Admins", whose
identity is deliberately barred from secrets in the openclaw namespace).
Split the HA tokens into a dedicated secret openclaw/ha-tokens (keys sofia/london)
with a Role + RoleBinding granting `get` on JUST that secret to the Home Server
Admins group (k8s RBAC can't scope to a JSON sub-key, hence a separate object).
emo now resolves the HA token with their own identity, WITHOUT gaining the rest
of skill_secrets (slack_webhook, uptime_kuma_password). openclaw's own deployment
keeps reading openclaw-secrets — purely additive.
- stacks/openclaw/ha_tokens.tf: new secret + least-privilege Role/RoleBinding
- cli/cmd_ha.go: read openclaw/ha-tokens (raw base64 per-instance key); drop JSON parse
- README + ADR-0012 updated; VERSION -> v0.7.1
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Mined another devvm user's Claude sessions for repeated, hand-rolled command
patterns worth absorbing into the shared CLI. The dominant signal was Home
Assistant "Sofia" work: a `kubectl | base64 | jq` token-extraction pipeline
re-derived ~420x, and a bespoke non-interactive `ssh -o …` invocation reinvented
~30x — every session. The existing `home-assistant-sofia.py` already covers the
API but goes unused from an arbitrary cwd (needs an env var set + a cwd-relative
path), so agents bypassed it and hand-rolled everything.
Add two verbs covering exactly the gaps the `ha` MCP can't (entity state/control
stays with the MCP):
- `ha token [--instance sofia|london]` (read): resolves the long-lived API token
live from k8s secret openclaw/openclaw-secrets via the ambient kubeconfig — no
pre-set env var. Composes as `curl -H "Authorization: Bearer $(homelab ha token)"`.
- `ha ssh [--instance sofia|london] -- <cmd>` (write): deterministic
non-interactive ssh to the HA host using the invoking user's key.
Also fix the root cause: `home-assistant-sofia.py` now falls back to
`homelab ha token` when its env var is unset (works from any directory), and the
home-assistant skill points agents at these verbs + `homelab metrics query`
instead of hand-rolled curls. README + ADR-0012 + AGENTS.md updated per the
per-verb-group convention.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor's passkeys all vanished and he was suddenly being asked to log in
multiple times a day instead of ~monthly. Root cause: on 2026-06-18 an ad-hoc
tripit passkey E2E test (run from the devvm as akadmin via python-httpx) cleaned
up "the demo user's" passkeys with GET /core/users/?search={demo} then DELETE
each device of users[0] — but the fuzzy search returned the REAL account, so it
wiped all 6 real passkeys. Losing passkeys forced fallback to Google login, and
the social-login stage (default-source-authentication-login) had the provider
default session_duration=seconds=0, which falls back to UNAUTHENTICATED_AGE=2h —
hence the constant re-logins. (Password + passkey logins were already weeks=4.)
Changes:
- authentik: adopt default-source-authentication-login into Terraform (import)
and pin session_duration=weeks=4, so Google/GitHub/Facebook logins last as long
as password/passkey. Immediate relief without re-enrolling.
- authentik: document the provider-schema gotcha — authentik_stage_identification
exposes no webauthn_stage / enable_remember_me attribute, so they must NOT be in
ignore_changes (commit 4e882989 removed them for this reason; re-adding breaks
every apply). The passkey break was purely the missing device records, not drift.
- edge (rybbit): shield auth so a CrowdSec hit can never wall a user out of login —
carve authentik.viktorbarzin.me + public-auth out of the zone WAF block rule,
make the LAPI->edge sync ban-only (stop downgrading captcha to a hard block),
and set exclude_crowdsec on the Authentik UI ingress (auth keeps rate-limiting).
- docs: record the session-duration change, the edge enforcement + auth carve-out
(previously undocumented), and the pre-existing broken crowdsec-cf-sync CronJob
(CF cursor pagination 400 + ~31k IPs vs list capacity -> edge list inert).
Passkey re-enrollment is a manual user action (devices are gone from the DB);
nothing auto-re-deletes them.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Each workstation user needs a continuously valid Claude token under their own Enterprise identity. Store only that user's OAuth state in an isolated Vault path, renew and verify it automatically, recover from Vault when possible, and alert when interactive SSO is required.
TripIt external users are now LOCAL TripIt accounts (ADR-0028 native passkey + Authentik OIDC), so the Authentik-side self-enrollment machinery is dead. Removes the tripit-enrollment + tripit-recovery flows and all their stages/prompts/policies/bindings, the tripit-email-stages blueprint (+yaml), and the 'TripIt External' group; reverts the admin-services-restriction fence branch that contained those users (its sole member, the leftover tripit-demo@ test account, was deleted first, so the revert affects zero live principals). Real external collaborators (type=external) are untouched. tg plan: 0 add, 1 change (the policy expression), 20 destroy (all tripit_*). Closes tripit#97; moots the B2 per-app OIDC fences.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Two corrections to the runbook matching today's code fixes:
- The next-minor *patch* probe (GET .../Packages) also needs `-L`; it lacked it
until 2026-06-20 and silently no-op'd the 2026-06-19 nightly run. Both probes
now follow the 302.
- The compat gate's addon check is scoped to minor jumps — patches within the
running minor are never addon-blocked (target_minor <= running_minor returns
early), so a conservative ceiling like ESO 0.12 -> 1.31 no longer false-blocks
a 1.34.x patch.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Answers the question that drove the whole CLI — which verbs to add next — with
data instead of one maintainer's habits, and resolves the cross-user-usage ask
in-bounds (no reading anyone's home).
- emit on dispatch: every verb fire-and-forgets one Loki line {job,user,verb} +
"exit=N ver=X". ONLY the verb path + exit code — never args, paths, flags, or
secrets (the emit never sees arguments). Best-effort: 800ms timeout, errors
swallowed, never affects the command; opt-out HOMELAB_TELEMETRY=0. Discovery
verbs (manifest/version/help) and usage itself don't self-record.
- usage top [--since 30d] [--user U] [--json]: ranks verbs via
sum by (verb)(count_over_time({job="homelab-usage"}[…])) against the shared
Loki. Cross-user analytics WITHOUT touching ~/.claude — the privacy-preserving
answer to "what does the team use".
- Loki sink (zero new infra, dogfoods v0.5 logs path); push verified HTTP 204 no
auth. ADR docs/adr/0011.
Live-verified: ran 4 verbs, usage top ranked them correctly (metrics query=2).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Authentik OAuth2 source is now disabled (login_source.is_active=0) and GitHub
auto-registration (zero-click sign-up) is on. Document why (global auto-reg +
Authentik's email-as-username 500; Forgejo/Authentik email mismatch blocks
account-linking) and how to re-enable Authentik later.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Traefik bouncer plugin's API key was never registered with LAPI — the
crowdsec stack reads many keys from Vault but not ingress_crowdsec_api_key, and
the chart registers no bouncer. So LAPI returned 403 to the plugin, which with
updateMaxFailure=-1 failed open and enforced NOTHING: no community-blocklist
bans, and the (now-Turnstile-wired) captcha never fired. cscli bouncers list was
empty; the registration was likely lost in the MySQL->PostgreSQL DB migration
with no IaC to recreate it.
Seed the bouncer at LAPI startup via BOUNCER_KEY_traefik, valued from the same
Vault key the middleware presents — so they match by construction, and the
bouncer re-registers automatically on every LAPI start (survives DB wipes).
- stacks/crowdsec/main.tf: read ingress_crowdsec_api_key, pass to module.
- module main.tf: new sensitive var + thread into the values templatefile.
- values.yaml: BOUNCER_KEY_traefik on lapi.env.
- docs/architecture/security.md: document registration + fail-open history and
the proxied-app coverage caveat.
Activates enforcement (community blocklist bans + captcha) on non-proxied apps;
internal IPs stay bypassed (clientTrustedIPs), fail-open-on-LAPI-down preserved.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor wanted people to be able to sign up with GitHub, not just the
native form or Authentik SSO.
- Added a GitHub OAuth2 login source via `forgejo admin auth add-oauth
--provider github` (name "github", matching the callback registered on
the GitHub OAuth App). Like the existing Authentik source, it lives in
Forgejo's DB rather than Terraform — there's no clean TF resource for
login sources. Client id/secret mirrored to Vault secret/viktor
(forgejo_github_oauth_client_id / _secret) for recovery.
- This commit's TF change: ENABLE_AUTO_REGISTRATION=true in
[oauth2_client], so a first GitHub sign-in creates the account directly
("sign up with GitHub") instead of a link-to-existing detour. The
GitHub identity is the trust gate for this path; Turnstile + email
confirmation still gate the native form.
Verified: GitHub recognises the client id, Forgejo's /user/oauth2/github
redirects to GitHub's authorize URL with the correct client id +
callback, and the login page renders the button. Final browser
click-through is the user's to do.
Runbook updated: docs/runbooks/forgejo-open-signups.md (GitHub section +
secret-rotation + DB-loss recreate steps).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CrowdSec LAPI already issues `captcha`-type decisions for lower-severity abuse
(http-429-abuse, http-403-abuse, http-crawl-non_statics, http-sensitive-files),
but the Traefik bouncer plugin had no captcha provider configured — so those
decisions silently fell through to a 403 ban (traced in the plugin's bouncer.go
@ v1.4.2: captchaClient.Valid==false => handleBanServeHTTP). Flagged users had
no way to self-unblock, contradicting the profile's stated intent.
Wire Cloudflare Turnstile as the bouncer's captcha provider so a captcha
decision now renders a solvable challenge instead of a hard block:
- New cloudflare_turnstile_widget.crowdsec_captcha (managed mode), scoped to
viktorbarzin.me so one widget covers every subdomain the bouncer fronts.
Mirrors the existing Forgejo-signup Turnstile pattern; sitekey + secret are
passed into the traefik module.
- middleware.tf: captchaProvider=turnstile + site/secret keys + grace 1800s +
captchaHTMLFilePath=/captcha/captcha.html.
- Vendor the plugin's captcha.html and mount it into the Traefik container at
/captcha via the chart `volumes` value — the pulled Yaegi plugin does not
expose its bundled template to Traefik.
- docs/architecture/security.md: document the ban-vs-captcha remediation split.
- Remove the dead crowdsec-ingress-bouncer.yaml (unused nginx bouncer with
placeholder reCAPTCHA keys; referenced by zero .tf).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor wants Forgejo open for anyone to sign up, but without bot/spam
account floods. Flip the deployment from OAuth-only registration
(ALLOW_ONLY_EXTERNAL_REGISTRATION=true) to allowing native local
sign-up, and add two bot gates on the registration form:
- Cloudflare Turnstile captcha (CAPTCHA_TYPE=cfturnstile). The widget
is managed in Terraform (turnstile.tf) via the CF Global API key, so
the sitekey/secret are IaC, not a dashboard artifact.
- Mandatory email confirmation (REGISTER_EMAIL_CONFIRM=true). Wire the
Forgejo mailer to the cluster mailserver as noreply@viktorbarzin.me
(mail.viktorbarzin.me:587 STARTTLS), reusing the same Vault-sourced
credential Authentik uses (email-secret.tf ESO -> secret/authentik
smtp_password).
Existing Authentik OAuth2 login is unchanged (additive). Deployment env
appended (not inserted) so the diff stays purely additive; a reloader
annotation rolls the pod on secret rotation.
Verified live: signup page renders the Turnstile widget, mailer delivers
a test message end-to-end, Forgejo healthy, plan-to-zero after apply.
Runbook: docs/runbooks/forgejo-open-signups.md
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The remaining verbs that pass the "saves reasoning, not just typing" test the
user posed mid-session: each encodes the non-obvious which-endpoint-reached-how
resolution otherwise re-derived every time. (Same test deprioritized node-ssh
and secret-get aliasing — thin wrappers over commands already known.)
- net check <host> [path]: two-legged reachability — external (public DNS→CF)
vs internal (Traefik LB) — so you see WHERE a break is, not just that one path
works. (live: surfaced the LB at 6ms vs CF 77ms.)
- dns lookup <name> [type]: Technitium (10.0.20.201) vs public (1.1.1.1) diff.
- metrics query "<promql>" / metrics alerts: Prometheus via the LB
(prometheus-query.viktorbarzin.lan); alerts uses the synthetic ALERTS series
since the query frontend has no /api/v1/alerts and Alertmanager has no ingress.
- logs query "<logql>" [--since 1h] [--limit N]: Loki range query via the LB.
All reach auth-free internal ingresses through the LB (Go form of
curl --resolve host:443:10.0.20.203) — no port-forward, no kubectl. In-cluster-
only endpoints (Alertmanager v2) deliberately out of scope. Verified live before
building; all five smoke-tested green.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Builds on the compat gate (prev commit) to finish "auto-upgrade when safe, halt +
alert when not":
- monitoring: K8sUpgradeBlocked alert (k8s_upgrade_blocked==1, for 10m, warning)
in the Upgrade Gates group — the clean "a k8s auto-upgrade was refused, see
Slack for why" signal. (Until monitoring is applied, a block still surfaces via
the already-live K8sUpgradeChainJobFailed.)
- upgrade-step.sh phase_postflight: deeper post-upgrade smoke tests —
apiserver /readyz + /livez, in-cluster DNS (resolve kubernetes.default), and
core kube-system pods (apiserver/controller-manager/scheduler/etcd/coredns)
Running. Any failure halts + alerts (exit 1; no rollback — kubeadm can't
downgrade). Catches a "pods look Running but cluster is broken" upgrade.
- runbook: documents the compat gate, the blocked alert, how to clear a block,
matrix maintenance, and the detector minor-probe fix.
After deploy, the nightly chain detects 1.35 (minor detection now works) and
correctly BLOCKS on Calico 3.26 / ESO 0.12 / kyverno 1.16 (all behind), alerting
via K8sUpgradeBlocked — the autonomy working as designed until the catch-up
clears those addons.
Adds the verb-group that kills the single biggest reasoning sink in agent
sessions — watching a build/deploy to completion (proven the session that built
it: hours hand-rolling Woodpecker polling + DB-schema spelunking for one CI
incident).
- ci status/watch: Woodpecker REST API (version-stable, not its DB schema),
reached via the internal Traefik LB (dial 10.0.20.203, SNI=ci.viktorbarzin.me
so the cert verifies — the Go form of the house `curl --resolve` pattern),
token from WOODPECKER_TOKEN/Vault, repo id resolved from the cwd remote, with
retries that ride Woodpecker's intermittent empty responses. watch matches the
HEAD/given commit (avoids the post-push race) and exits non-zero on failure.
- deploy wait: image-sha match THEN rollout status (rollout status alone returns
success on the old ReplicaSet); kubectl-based.
- work land now auto-watches CI to green on the landed commit (--no-ci-watch to
skip), closing the v0.1 gap.
- ci logs deferred to v0.4.1 (Woodpecker detail/log endpoints were the least
reliable; status/watch use the working list endpoint).
Live-verified ci status/watch against the live API.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The rpi-sofia under-voltage alert keyed off the sticky firmware bit
(rpi_under_voltage_occurred == 1), which latches on the first brown-out and
stays 1 until the Pi reboots. With alert-on-change routing it re-paged on every
boot cycle and sat firing for ~211h of the last 14d — Viktor reported "getting a
few of these lately" — and it disagreed with the HA-sofia dashboard, which shows
the live state and reads OK once voltage recovers.
Can't just switch to the live bit: rpi_under_voltage_now never registered once in
14d (brown-outs are sub-second and fall between the 1-min textfile-collector
samples), so the sticky bit is the only reliable detector.
Fix: edge-trigger on a NEW latch via increase(rpi_under_voltage_occurred[1h]) > 0.
Fires once per brown-out and auto-resolves ~1h later (~2h active over the same
14d instead of ~211h); counter-reset handling makes a clean reboot a no-op. Both
real brown-out events in the window are still caught. Docs updated in the same
commit (monitoring.md).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
kubeadm upgrade apply regenerates the apiserver static-pod manifest and drops
the --authentication-config flag, silently breaking SSO (kubectl/kubelogin + the
k8s dashboard) until someone manually re-applied the rbac stack. That manual step
ran after every control-plane upgrade — the one thing keeping autonomous patch
upgrades from being truly hands-off (it bit us this cycle: an earlier master bump
left SSO broken until we noticed).
Automate it: the rbac stack now publishes its existing OIDC restore script (the
same one its null_resource runs) to a kube-system/apiserver-oidc-restore
ConfigMap, and the upgrade chain's phase_master re-runs it on master right after
the kubeadm upgrade — while tigera-operator is still quiesced so the flag-add
apiserver restart can't crashloop it. The script is idempotent and health-gates
/livez with auto-rollback; the step is non-fatal (a failure only lags SSO until
the next rbac apply, it won't abort the upgrade). phase_master already self-skips
when master is at target, so this only fires when master was actually upgraded.
The chain SA gets a name-scoped get on that one ConfigMap. Runbook updated: the
manual restore is now a documented fallback (command corrected — it needs
-replace, since the null_resource trigger hash never changes).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Lets agents search/navigate memory via the CLI, as the first step toward
deprecating the memory MCP. claude-memory is a FastAPI service (the MCP is just
one frontend); homelab memory is a thin Bearer-auth HTTP client over the same
API, using the env the hooks already set (CLAUDE_MEMORY_API_URL/KEY). It works
even when the MCP frontend is down — the recurring disconnect that took the MCP
offline for this whole session.
Verbs: recall (server-side semantic search), list, categories, tags, stats,
secret (read); store, update, delete (write). Validated against the live API
including a store→recall→delete round-trip — full data-plane parity with the MCP.
The deprecation itself (rewiring the per-prompt auto-recall + auto-learn hooks to
the CLI, then uninstalling the MCP) is a deliberate follow-up, sequenced after
the CLI is proven in the hooks — see docs/adr/0008.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Bump cli/VERSION to v0.2.0; document the k8s verbs (README table + resolver
note), add docs/adr/0007 (resolver, read/write split, config-mutation stays
raw, db dbaas pattern), and extend the AGENTS.md discovery pointer with the
Kubernetes surface.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Completes v0.1: documentation, build/install path, and version stamping.
- cli/VERSION (v0.1.0) stamped into the binary via ldflags.
- cli/README.md rewritten as the homelab overview (verbs + tiers, manifest,
build, the preserved legacy webhook use-cases).
- docs/adr/0004-0006: why homelab exists (grown in place from infra/cli, not a
separate repo), v0.1 scope + everything-allowed/tiers-recorded, and the
work/tf behaviour (native worktree entry, verification-gated auto-land,
presence-coupled apply).
- setup-devvm.sh builds cli/ -> /usr/local/bin/homelab each provisioning run
(t3-dispatch pattern), so every devvm user gets the current binary.
- AGENTS.md: discovery pointer under Common Operations.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The autonomous 1.34.9 version-upgrade chain has been failing its preflight every
night. A prior run left k8s-master + k8s-node1 on 1.34.9 while node2-6 stayed on
1.34.8, and preflight's gate-4 runs `kubeadm upgrade plan` on master. On an
already-at-target master, kubeadm prints no "kubeadm upgrade apply vX.Y.Z" line,
so the parsed target came back empty and the `!= requested` check aborted the
whole chain before any worker was touched. Deterministic — it self-cleaned and
re-failed identically each night, so it would have failed again tonight, leaving
node2-6 stuck on the old patch.
Fix: skip the kubeadm-plan-target gate when master is already on TARGET_VERSION
— the same at-target self-skip that phase_master and phase_worker already do.
The remaining workers are still validated by their own per-node phases, and the
detector already confirmed the target is installable via apt-cache. This lets
tonight's unattended chain resume and finish node2-6 -> 1.34.9.
Runbook updated: node count 5 -> 7, the gate skip note, and a Past Incidents
writeup (incl. the collateral apiserver OIDC wipe, restored via the rbac stack).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Disruptive node drains should run when the cluster is idle. Move the
k8s-version-check detection CronJob from 12:00 UTC (noon) to 23:00 UTC
(00:00 London) — overnight, low usage, and clear of the kured OS-reboot window
(01:00-05:00 UTC) so the two drain pipelines never overlap. (Viktor, 2026-06-17.)
- stacks/k8s-version-upgrade/main.tf: var.schedule default 0 12 → 0 23 * * *.
- scripts/upgrade_state.sh: next_scheduled_run_utc now computes the 23:00 slot
(was next_daily_noon_utc).
- docs (runbook, architecture) + upgrade-state SKILL: schedule references
updated to 23:00 UTC nightly.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>