claude-breakglass 4Gi->512Mi, stirling-pdf 1536->512Mi, insta2spotify 2Gi->256Mi, recruiter-responder 768->256Mi. These idle/utility services had memory LIMITS sitting 4-15x above their 30d peak, inflating cluster limit-overcommit to 142% across the 5 post-node6 nodes. Burstable (request<limit), limits capped at ~peak x1.5 (never below peak), so no OOM risk (verified zero OOMKills cluster-wide in 30d). Reduces phantom limit overcommit + frees scheduler requests.
Follows the 3-reviewer adversarial review: raising limits on an already-overcommitted cluster worsens correlated node-OOM; the real fix is trimming the fat. Limits only lowered where peak is far below; tuned/DB/GPU limits untouched.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Part of the sdc IOPS-reduction work (code-oflt). 462 daily thin snapshots
(66 PVCs x 7d) drive ~10-34 w/s of thin-pool metadata (tmeta) CoW writes on
the contended sdc spindle and pin ~2TB in the 70%-full pool. Halving to 3
days roughly halves both. Instant-restore window shrinks 7->3d; daily-backup
still keeps 4 weeks of file-level PVC history, so DR coverage is unchanged.
Deployed to the PVE host via scp (these host scripts are scp-deployed, not
TF-managed). Doc updated in .claude/CLAUDE.md.
Refs: code-oflt.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The N-1 capacity alert was hardcoded to k8s-node[234]/[1234], predating node5/node6 (added 2026-05-26) and the 2026-06-29 removal of node6 — so it no longer reflected the real cluster and gave no trustworthy N-1 signal. Generalize node selection via metrics: GPU node by nvidia_com_gpu capacity, drained/cordoned by kube_node_spec_unschedulable, down by the Ready condition. Control-plane excluded by name (node!~"k8s-master.*") because this cluster's kube-state-metrics exposes neither kube_node_role nor node taints/labels (verified live).
Also fixes a latent bug (multiplying by kube_node_spec_unschedulable==0 zeroed the result) and refreshes the remediation text (krr, not the removed Goldilocks). With node6 gone the rule now correctly evaluates LHS 31.0Gi > RHS 27.9Gi (fires) — the honest signal that removing node6 tightened requests-based N-1; trimming the inflated requests clears it.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
a2c8f906 added checkpoint_timeout=15min + max/min_wal_size to the CNPG
Cluster YAML, but the cluster is applied via null_resource.pg_cluster +
local-exec kubectl apply, which only re-runs when its `triggers` change.
The YAML edit didn't bump a trigger, so the change was inert and never
applied (incl. via CI). Bump the pg_params trigger so the kubectl apply
re-runs and CNPG hot-reloads the new params (reloadable, no restart).
Landing it via a targeted apply (-target=null_resource.pg_cluster) to avoid
3 pre-existing unrelated drifts in this stack -- notably a mysql_standalone
volumeClaimTemplate annotation diff the apiserver rejects as immutable,
which is what fails broad dbaas applies (and silently blocked a2c8f906).
Refs: code-oflt.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Reusable Workflow script that audits whether the cluster is memory-overcommitted and whether a single k8s worker can be removed to return RAM to the PVE host without sacrificing N-1 failover. Read-only throughout: gathers PVE host memory (qm config / free / KSM via SSH), k8s per-node capacity + cluster 30d peak working set, and per-workload right-sizing, then models N-1 two ways (physical actual-usage and scheduling-by-request) and adversarially verifies the conclusion with 3 skeptics.
Sizes requests (scheduling reservation) and limits (OOM ceiling) as SEPARATE knobs — an earlier ad-hoc pass conflated them by sizing requests to 30d peak, which manufactured a false N-1 shortfall. Invoke via Workflow {scriptPath}, or by name when cwd is the infra repo.
Requested by Viktor: identify memory overcommit and whether deployment requests can be trimmed to free PVE host RAM by removing a node, without sacrificing service reliability.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor asked to reduce CNPG checkpoint/WAL writes as part of the sdc
IOPS-isolation work (code-oflt). The IOPS deep-dive found CNPG checkpoints
fire 100% on the 5-min timer (checkpoints_timed >> checkpoints_req), each
triggering a full-page-write burst + flush onto the contended 7200rpm sdc
spindle -- a top write-IOPS source after etcd.
Set checkpoint_timeout=15min + max_wal_size=4GB + min_wal_size=1GB so
checkpoints fire ~1/3 as often (fewer FPW) and WAL segments are recycled
rather than churned. All three are sighup-reloadable -> CNPG applies them
without a restart or failover. checkpoint_completion_target stays 0.9 so
each checkpoint's IO is still smeared across the interval. Bounded
recovery-time tradeoff (more WAL to replay on crash), acceptable for the
write relief. wal_compression left at pglz ('on') pending image
zstd-support verification.
Also refreshes the stale CNPG tuning note in .claude/CLAUDE.md (it listed
shared_buffers=512MB / effective_cache_size=1536MB / 2Gi; live is 1024MB /
2560MB / 3Gi).
Refs: code-oflt (etcd/sdc IO isolation).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor asked to move the upgrade checks to weekly. With the actionable-vs-held
gate now quieting the routine 'held' churn (e.g. 1.36), a daily check + attempt
buys little; weekly is enough. Accepted trade-off: k8s patch (incl. security)
uptake now lags up to 7 days instead of <=1.
- var.schedule: 0 23 * * * -> 0 23 * * 0 (detector: weekly Sunday 23:00 UTC)
- var.report_schedule: 7 6 * * * -> 7 6 * * 1 (report: Monday 06:07 UTC, ~7h
after the Sunday check, so nightly-report.py's ~25h staleness threshold stays
valid AND still flags a missed weekly run; no STALE_SECONDS change needed)
The report CronJob keeps its historical name k8s-upgrade-nightly-report (rename
= churn). Cadence wording updated across main.tf comments, nightly-report.py
docstring, and the runbook.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor flagged not wanting to wear the single non-RAID SSD with useless etcd
writes if etcd moves there. Investigation found the avoidable load is kyverno
reporting: the 2026-06-12 etcd-load-reduction disabled the report *features*
but left the reports-controller running (default --enableReporting +
--validatingAdmissionPolicyReports=true), so the 2026-06-21 kyverno upgrade
left a one-time pile of ~10.5k cluster/namespaced ephemeralreports (~114MB in
etcd) that nothing reaps (aggregation off). Listing that range starves etcd's
fdatasync enough to flap the apiserver (observed live 2026-06-28).
Disable the reports-controller outright (reportsController.enabled=false),
completing the 2026-06-12 intent. Reports are not consumed (violations surface
via Loki->Slack); admission enforcement (deny-* policies) and Keel mutation are
independent of it. The ~10.5k stale reports already in etcd are cleared
separately (throttled, out-of-band) since bulk-deleting them is itself
etcd-heavy.
Refs: code-oflt (etcd IO isolation), code-at4f (etcd starvation alerting).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
No-op comment touch in loki.tf to force a clean `terragrunt apply monitoring`.
The pfSense egress-monitoring apply (commit 7fe2d978, CI pipeline #414) was
cancelled by a newer push and SIGKILLed mid-helm-upgrade: the live resources
applied (probes green, rules loaded) but the Terraform state write and the helm
release finalize were lost, leaving the prometheus release stuck in
pending-upgrade (manually unstuck). This commit re-applies the unchanged
monitoring stack so state matches live, with zero resource changes.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor plans to live to 100, so the portfolio must last that long. The
fire-targets CronJob was solving a 60-year horizon (≈ to age 88); set it to 72
(retire ~age 28 → age 100). Raises every case's FIRE number modestly (more years
to fund). A one-off in-cluster job re-solves the existing rows at the new horizon.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
On 2026-06-27 pfSense (Proxmox VMID 101) stopped passing internet egress for
~20 min while internal routing + Unbound stayed up; recovery needed a manual
reboot and NOTHING alerted — there was no egress probe and the cloudflared
replica metric stayed green. Add first-class egress monitoring so the next
occurrence pages in ~2 min instead of being noticed by a human.
- blackbox-exporter: new icmp_egress + dns_external probe modules (+ NET_RAW
so ICMP can use raw sockets).
- Three in-cluster probe jobs exercising the pod->node->pfSense-NAT path that
failed: wan-gateway-icmp (192.168.1.1), internet-egress-icmp (9.9.9.9 +
1.1.1.1), internet-egress-dns (cloudflare.com via both resolvers).
- Prometheus alerts (group "Egress / pfSense"): WANGatewayUnreachable,
InternetEgressDown (both providers dead), ExternalDNSResolutionDown,
EgressOnlyDivergence (reuses the existing t3-probe legs — the incident's
exact "external down while internal up" signature), PfSenseVMDown.
- Loki ruler: CloudflaredTunnelConnLoss — the canary that fired first; the
cloudflared replica metric is blind to tunnel-connection loss. Threshold
calibrated against live Loki (steady-state ~2/6h vs 37-85/5m in-incident).
- Alertmanager inhibit: WAN/egress-down suppresses the downstream egress
symptom alerts so one root alert pages, not a storm.
- Runbook docs/runbooks/pfsense-egress.md + .claude/CLAUDE.md.
All metric names + the cloudflared threshold verified against live
Prometheus/Loki. Pure GitOps, no pfSense change. Firewall-side hardening
(dpinger retargeting, failover gateway, pfSense syslog -> Loki) is deferred
and documented in the runbook.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Post-mortem for the 2026-04/05 SEV3 where a stuck MetalLB ServiceL2Status
CR (immutable status.node) flapped the PG load-balancer VIP and silently
broke Tier-1 Woodpecker terragrunt applies for ~5 days (the wrapper error
"Cannot read PG creds" masked the real cause for ~25 days). Written when
the incident closed (beads code-aoxk, 2026-05-26) but never committed;
landing it so the RCA + stuck-CR cleanup procedure live in the repo.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The fire-planner-pg Grafana datasource baked the rotating fire_planner DB
password into its provisioning ConfigMap at terraform plan-time, so on every
7-day static-role rotation the password went stale and ALL fire-planner-pg
dashboards (fire-planner, cost-of-living, and the new wealth FIRE Countdown)
silently failed with "password authentication failed for user fire_planner"
until the next stack apply.
Switch to the same live-env pattern wealth-pg / payslips-pg already use:
- new ExternalSecret grafana-fire-planner-pg-creds (monitoring ns, Reloader
match) mirrors the rotating Vault static-creds/pg-fire-planner password
- datasource ConfigMap now references $__env{FIRE_PLANNER_PG_PASSWORD}
- Grafana mounts it via envFromSecrets; reloader (auto) restarts Grafana on
rotation so the provisioned datasource never goes stale
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The noVNC sidecar (x11vnc + websockify) was OOMKilled (exit 137) repeatedly
whenever someone actively opened chrome.viktorbarzin.me — the view connected
then froze/hung. Idle usage is ~37Mi, but x11vnc + websockify
framebuffer/encode buffers spike past the 96Mi cap when streaming the
1280x720 screen to a client. Raised request 32Mi->64Mi, limit 96Mi->256Mi
(Burstable, aux tier). Already applied live via a transient kubectl patch
(Recreate rollout, verified 0 restarts since); this lands the durable state
so the next apply / daily drift-detection doesn't revert it to 96Mi.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor asked to give emo access to the cluster's headed Chrome so he can fill
in forms and get past anti-bot / captcha pages. emo was deliberately locked
out of chrome-service (noVNC Authentik allowlist was Viktor-only + his
power-user RBAC has no pods/portforward). Viktor's explicit decision: SHARE
his existing browser rather than stand up an isolated per-user instance,
accepting that emo can therefore reach Viktor's warmed logged-in sessions
(CDP has no per-context auth, so the single shared persistent profile is
reachable by anyone who can drive the browser). emo's CLI use is hands-off
(his agent can run it unattended).
- authentik: add emo (emil.barzin / emil.barzin@gmail.com) to CHROME_ALLOWED
so the admin-services-restriction policy admits him to chrome.viktorbarzin.me
(noVNC). Reverses the prior Viktor-only lock; comment updated to record why.
- chrome-service/rbac.tf (new): emo-browser ServiceAccount + long-lived token
(dashboard-sa.tf pattern), a chrome-service-portforward Role granting
pods/portforward, and a cluster read-only binding (oidc-power-user-readonly)
so the SA can resolve the Service and emo's normal read access doesn't regress.
- t3-provision-users.sh: install_browser_kubeconfig installs a dual-context
kubeconfig for any user with a <user>-browser SA — SA token as the default
context (non-interactive, works headless), personal OIDC retained as the
oidc@homelab named context. emo's OIDC-only kubeconfig can't authenticate the
headless agent session that homelab browser needs.
- docs/architecture/chrome-service.md: document the shared-browser multi-user
access model, the session-exposure trade-off, and how to grant/revoke a user.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
6 OCR workers crept past the 8Gi per-container memory cap over ~6h and
OOMKilled paperless at 15:00 during the Emo bulk import. The import
auto-recovered (the consume dir lives on the PVC, so a restart re-scans
and reprocesses — nothing lost), but it left the queue inflated with
re-queued duplicates and spiked etcd on each restart.
The 8Gi cap is the shared edge-tier `tier-defaults` LimitRange, not worth
raising for one namespace. 4 workers fit with headroom (4 measured
~1.3Gi). Matches the value applied live via `kubectl set env` during
incident response; this removes the drift so the next apply keeps it.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
postiz's live deployment (Helm + Temporal + Elasticsearch + Authentik
OIDC + static-DB password) came from the never-merged branch
`wizard/postiz-cnpg-oidc`, so master's HCL was stale and a `terragrunt
apply` would have DESTROYED the stack. This lands that postiz config to
master so HCL == state == live (CI green; destroy-landmine gone).
Kept PARKED (postiz + temporal replicas = 0): IG-via-postiz is Meta-
blocked (it hardcodes retired Instagram scopes → OAuth "Invalid Scopes"),
which is why it was parked; IG runs via the instagram-poster service. To
revive later: flip postiz `replicaCount` + temporal `replicas` back to 1
and re-check image pins.
Notes captured in this reconcile:
- ES image pinned to 7.17.28 (the branch's 7.17.24 was a DOWNGRADE vs the
live data → ES refused to start "cannot downgrade node 7.17.28→7.17.24";
caught + rolled back during this work).
- The 4 Authentik resources (app/provider/group/binding) were re-imported
into state (adopted, not recreated — no duplicate AK objects); the
obsolete `external_secret_jwt` ExternalSecret was removed (Retain → its
synced secret was kept).
- Vault-side cleanup (removing the unused pg-postiz rotated role) is
deliberately NOT included here — deferred, postiz uses a static
secret/postiz database_url.
State was already reconciled by a local `scripts/tg apply`; this commit is
the HCL catch-up (CI re-apply is a no-op).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Old-browser users on the SFE who have a password but no MFA device hit the
default-authentication-flow's forced WebAuthn passkey enrolment, which the SFE
cannot render (the 'unsupported state: ak-stage-authenticator-webauthn' error).
emo (Google-only, iPadOS 15) hit this on the password path.
Document the two no-MFA-downgrade fixes: (1) social login, whose source flow
(default-source-authentication) has no MFA stage, so the SFE's social button
always completes; (2) enrolling TOTP, which the SFE can validate (unlike
WebAuthn) and which flips the MFA stage from force-enrol to validate. TOTP was
enrolled for emo and stored in his Vaultwarden authentik item; verified
end-to-end (a Bitwarden-generated code is accepted by authentik).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
global.image -> 2026.2.4-patch3. Old iPad Chrome (and any iOS browser) now gets
the SFE too, and the SFE login shows social-login buttons (emo is Google-only with
no password, so the password form alone was a dead end). Docs: .claude/CLAUDE.md +
authentication.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Two follow-ups to patch2 (both in patch-compat-sfe.py, guarded):
1. compat_needs_sfe() now also serves the SFE to ANY iOS browser on iOS<=16.3,
not just Safari. iOS Chrome/Firefox are WebKit skins (Apple mandate) reporting
a non-Safari UA family, so the Safari-only check missed them and they still got
the blank modern SPA. Added an os.family=="iOS" + version<=16.3 branch.
2. Inject static social-login <a> links (Continue with Google/GitHub/Facebook ->
/source/oauth/login/<slug>/) into the SFE shell (flow-sfe.html). The SFE
architecturally can't render Identification-stage sources (authentik docs), and
emo's account (emil.barzin@gmail.com) is Google-only with NO password — so the
SFE's username/password form was a dead end. The links are plain redirects that
work on any browser. Slugs are static; re-verify on source changes.
Tag -> 2026.2.4-patch3; values repoint + docs land once GHA builds it.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add a "FIRE Countdown" section to the wealth Grafana dashboard plus a monthly
CronJob that computes the targets it reads.
Viktor wanted a £ countdown to retirement in today's money, per life-case
(Solo / Household / Family) and per country, with progress, a projected date,
runway, and his safety guardrails — so he can see how close he is to FIRE
(ideally lean) without ever coming back to work.
- wealth.json: new country / with_home / savings_per_year template vars + a
per-Case row (target NW at the 99% GK bar, progress gauge, still-needed,
projected FIRE date, runway) and safety-valve panels (re-entry trigger vs
£1.0M, 2.5yr cash buffer, pension tranche @57, Anca-bridge note). Reads
fire_planner.fire_target via the fire-planner-pg datasource (Mixed).
- fire-planner stack: fire-planner-fire-targets CronJob (monthly, 2nd 10:00
UTC) runs `recompute-fire-targets --countries all`.
Targets come from the solver shipped in fire-planner edb4d11.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
global.image -> 2026.2.4-patch2 (adds the compat_needs_sfe SFE patch on top of the
SLOW-1a query patch). Old Safari/WebKit (<=16.3) now gets authentik's no-JS SFE
login instead of a blank page — fixes emo's iPadOS-15.8 iPad with no auth
downgrade. Docs: .claude/CLAUDE.md Authentik row + docs/architecture/authentication.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Old Safari/WebKit (<=16.3, e.g. iPadOS<=16.3) can't parse authentik's modern
ES2022 flow SPA and gets a COMPLETELY BLANK login — exactly what emo's iPadOS-15.8
iPad hit. authentik already ships a no-JS Simplified Flow Executor (SFE, ES5) and
serves it via compat_needs_sfe(), but only for IE/old-Edge/PKeyAuth. Extend that
to old Safari so those clients get the REAL authentik login (password + MFA +
reputation, identity preserved — NO auth downgrade, no new credential store).
Chosen over a Traefik basic-auth fallback after an adversarial review: that route
would put a single, spoofable-UA password in front of vbarzin->wizard (passwordless
root on the cluster-controlling devvm) — an MFA->single-factor path to cluster root.
SFE keeps full authentik auth and is generic for any old browser.
Shipped as patch #2 in the existing overlay image (patch-compat-sfe.py — guarded:
asserts the upstream anchor + ast-parses; verified against the live interface.py).
Tag -> 2026.2.4-patch2; the values repoint lands once GHA builds the image.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The infra terragrunt-apply pipeline (.woodpecker/default.yml) was going
red ~20% of the time. Root causes (verified from the failure logs, not
guessed):
1. infra is registered in Woodpecker TWICE — canonical Forgejo (repo 82)
AND legacy GitHub mirror (repo 1) — and BOTH run `default.yml` on every
push. The two applies race each other for the per-stack PG state lock →
"Error acquiring the state lock" failures + push-supersede "killed" runs.
2. The skip-not-fail lock guard only matched the Tier-0 Vault lock string
("is locked by"); the Tier-1 PG-backend lock ("Error acquiring the state
lock") fell through and was counted as a hard FAILURE.
3. Transient provider-registry download timeouts (and Vault 5xx) failed the
whole pipeline with no retry.
Fixes (all in default.yml):
- Forge guard: the push-apply runs ONLY on the canonical Forgejo forge; on
the GitHub mirror it no-ops (exit 0). The mirror keeps running the crons
(they live on repo 1), so we de-dup the apply without deactivating the
registration. Fail-open on unknown forge.
- Lock-skip now matches BOTH tiers (Vault + PG) → lock-waits are SKIPPED.
- Bounded retry (3x) ONLY on transient signatures (provider download
timeout, Vault 5xx). Config errors + helm atomic-timeouts fail fast.
Rejected (documented in docs/architecture/ci-cd.md): an off-infra GHA
validate gate (catches ~0 of the real, runtime/Vault-data/SSA/lock
failures; reproduced `terraform validate` passing the exact stacks that
fail at apply) and lock-reaping/force-unlock (PG advisory locks are
session-scoped + auto-release; force-unlock can't free them and would
corrupt a live concurrent apply).
Shell logic + the classification regexes were unit-tested locally against
the real decoded error strings (#359 PG lock, #353 provider timeout, #360
missing-arg, helm atomic timeout); `bash -n` clean; YAML parses.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
`homelab vault` only spoke to Vaultwarden (the password manager), but the
name reads as HashiCorp Vault (the infra secrets store — actually OpenBao
here). Make the two unmistakable and support both.
Distinction (no breakage — the existing Vaultwarden verbs are unchanged):
- bare `homelab vault` help now LEADS with the two-stores split;
- every verb summary is tagged `[vaultwarden]` or `[hashicorp-vault]`;
- HashiCorp Vault/OpenBao lives under a clearly-named `vault kv` group.
New `vault kv` (HashiCorp Vault / OpenBao, the secret/… KV store):
- `kv get <path> [--field K]` — read; --field → one value (TTY-aware
clipboard/stdout), no field → full secret JSON (refuses a bare TTY).
- `kv list <path>` — list sub-paths (no values).
- `kv put <path> <key>` — write one key; value via stdin (piped or
no-echo prompt, never argv); creates the path or merges (never
clobbers siblings; uses kv patch -method=rw so no `patch` cap needed).
Critical: `kv` uses the caller's OWN Vault token (OIDC ~/.vault-token /
$VAULT_TOKEN), NOT the per-user scoped Vaultwarden token (bound only to
claude-users/<user>, which would 403 elsewhere) — handlers set VAULT_ADDR
but never inject the scoped token. Access is whatever the policy grants.
Logic in cmd_vault_kv.go (pure cores extractKVData/parseKVList/arg
builders/kvGet/List/Put; file header documents the credential split).
CLI v0.11.0. Tests: no value in put argv, create-then-merge, KV-v2
envelope strip, help names both systems. Verified e2e against live Vault
(read key-names-only + a scratch put/merge/cleanup).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor asked that emo be able to edit his own secrets with full access.
emo's personal-emo policy was read-only (read on data, read/list on
metadata), so he could view but not change his personal secrets.
Widen it to the same self-service capability set every namespace-owner
already has over their own tree: create/read/update/delete/list on
secret/data/emo(+/*) and list/read/delete on secret/metadata/emo(+/*).
Scope is unchanged — still only emo's own secret/emo subtree, still a
named exception that does not widen the power-user tier in general.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
GHA built ghcr.io/viktorbarzin/authentik-server:2026.2.4-patch1 (public, verified
anonymously pullable). Point global.image at it (repository + tag pinned
explicitly so neither helm's appVersion default nor Keel can downgrade it — the
2026-06-10 boot-storm class) and remove keel.sh/enrolled from the namespace so
Keel won't auto-bump the custom tag. authentik is now manual-upgrade: bump the
Dockerfile FROM + this tag together on each authentik version bump.
Net effect once rolled: the identification-stage query drops ~1.4s -> ~14ms, so
the cold login-flow first-load stops being slow. (Does NOT affect old-browser
clients — iPadOS<=15/Safari<=15.6 still can't run the SPA; that's unfixable
server-side.) Docs: .claude/CLAUDE.md Authentik row.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The login flow's identification stage runs a bare select_subclasses() that
LEFT-JOINs every Source subtype table — ~1.4s server-side on every cold login
(verified live: 1527ms vs 14ms). Narrow it to only the subtypes that render a UI
login button (oauth/saml/plex/telegram/kerberos — not the sync-only ldap/scim),
via django-model-utils string accessors so no import is needed. Byte-identical
output, ~100x faster, robust to adding new login source types.
Shipped as a thin overlay over the official image (mirrors the diun/excalidraw
precedent): stacks/authentik/Dockerfile (FROM ghcr.io/goauthentik/server:2026.2.4
+ a guarded sed) built by .github/workflows/build-authentik.yml -> ghcr.io/
viktorbarzin/authentik-server:2026.2.4-patch1. The values repoint + Keel freeze
land in a follow-up commit once the image is built. Upstream bug still present in
main (no fix/PR) — drop this overlay once upstream narrows the query.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
`bw unlock` only decrypts the LOCAL cache, so a persisted (already
logged-in) session served stale data — a password changed in the web
vault wouldn't appear until the next fresh login. Add a best-effort
`bw sync` in openSession (the chokepoint every read shares: get, get
--all, list, code, status), so reads reflect current server-side values.
Best-effort by design: a transient sync failure warns on stderr and
falls back to the cached vault rather than failing the read (an AFK
agent shouldn't break on a network blip). status keeps its own explicit
sync so a reachability failure still surfaces in its report.
CLI v0.10.1. Tests assert the sync runs after unlock and before the read,
and that a read still succeeds when sync fails.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Reflect the classification change in the operational runbook: the gate's three
refusal classes (actionable/waiting/pinned), held wins on a mix, refusals now
Complete cleanly (no Failed Job), k8s_upgrade_held gauge + the deliberate
no-alert-for-held, the dropped K8sUpgradeChainJobFailed suppression clause, the
nightly report ⏸️ HELD outcome, and the detector's silent nightly re-evaluation.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The nightly upgrade chain detected 1.36, the preflight compat-gate refused it,
and that produced a Failed preflight Job + a K8sUpgradeBlocked alert EVERY
night — even though the block is unactionable (no kyverno/ESO release supports
1.36 yet, and gpu-operator is pinned to its current version because bumping it
needs a newer NVIDIA driver image + Ubuntu/kernel we're not ready for). Viktor
asked to teach the checker to tell 'we can fix this' apart from 'nothing to do
but wait', and stop the nightly Failed-Job + alert noise for the latter.
compat-gate.py now classifies each blocker:
- ACTIONABLE: a newer addon version in addon-compat.json supports the target
-> exit 2, k8s_upgrade_blocked=1 -> K8sUpgradeBlocked alert (reasons in the
nightly report).
- WAITING-on-upstream: no released version supports the target yet -> held.
- PINNED: addon marked pinned in the matrix (gpu-operator) -> held.
Held wins on a mix -> exit 4, k8s_upgrade_held=1 (NEW gauge), NO alert.
Tidy the block path (Viktor's scope choice): deliberate gate decisions now make
the preflight Job Complete cleanly (HALT_CHAIN stops chain progression without a
non-zero exit), so they no longer create Failed Jobs. Dropped the now-obsolete
'unless k8s_upgrade_blocked==1' suppression from K8sUpgradeChainJobFailed. Gauge
is pushed definitively once per run (no 1->0->1 flap that re-notifies). The
detector re-spawns a refused-but-Complete preflight nightly (silently) so a
standing hold still re-evaluates, and only announces genuine new/Failed spawns.
nightly-report gains a quiet '⏸️ HELD' headline with reasons grouped by class.
gpu-operator pinned in addon-compat.json (unpin = delete pinned + pin_reason).
Net effect on 1.36: HELD + quiet (waiting on kyverno/ESO, gpu-operator pinned;
Calico the lone actionable piece) — no nightly Failed Job, no alert, just the
morning report's HELD line. Design: docs/plans/2026-06-28-k8s-upgrade-gate-held-classification.md
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
`homelab vault get` could only fetch one of five allow-listed fields and
had no way to see what fields an item even has — in particular it could
not reach arbitrary user-defined custom fields. Add a `--all` flag that
dumps the whole item as a normalized JSON object
(`{name, username?, password?, uris?, totp?, notes?, fields?}`), so a
Claude session can discover and read every field, custom ones included,
in a single call.
Security model preserved:
- Like `get --json`, the dump is all secret values, so it refuses a bare
TTY (pipe it, e.g. `| jq`); the machine/agent path is stdout.
- The TOTP *seed* is reduced to a presence flag (`"totp": true`) and
never emitted — the seed is more powerful than a one-time code, so the
only seed-derived path stays the specially-audited `vault code`. Tests
assert the seed and password-history never appear in the dump.
- Op-log uses a distinct `get-all` verb (item name still never logged) so
a bulk dump is distinguishable from a single-field read.
`normalizeItem` is a pure, unit-tested core; `getItem` is the
session+fetch seam. CLI bumped to v0.10.0. Docs: README changelog,
onboarding runbook, design spec §16.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The nightly upgrade chain fails a preflight Job and raises K8sUpgradeBlocked
every night for the 1.36 target, even though the block is unactionable: no
kyverno/ESO release supports 1.36 yet and gpu-operator is deliberately pinned
(NVIDIA driver/Ubuntu coupling). Viktor asked to teach the checker to tell
'we can fix this' apart from 'nothing to do but wait', and stop the nightly
Failed-Job + alert noise for the latter.
This documents the design: classify each blocker as actionable / waiting-
upstream / pinned, keep the alert only for actionable, quiet the held case to
the nightly report, and make deliberate gate decisions Complete cleanly.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Makes the goldmane_edges east-west trail (ADR-0014) reachable during incident
investigations without remembering the DB/creds/SQL. New top-level verb:
homelab edges --ns <ns> edges touching <ns> (either direction)
homelab edges --src/--dst <ns> directional egress / ingress peers
homelab edges --peers-of <ns> distinct peer namespaces of <ns>
homelab edges --new-since 24h first seen since a duration or date (YYYY-MM-DD)
homelab edges --denied only action='deny' (blocked / lateral movement)
homelab edges --json --limit N machine-readable / row cap (default 200)
Filters render to a single read-only SELECT against the `edge` table, run via
the dbaas CNPG primary pod (same exec path as `k8s db`). Namespace values are
validated to the k8s name charset (injection guard) before they reach SQL.
TDD: edges_test.go covers flag parsing, query building (each filter, AND
combination, peers-of shape, JSON wrapper), the new-since duration/date parser,
and namespace-validation / injection rejection. Smoke-tested live: --peers-of,
--new-since 24h, --denied, and --json all return correct rows.
Docs: runbook query section now leads with the CLI; cli/README gains a v0.9
section. VERSION v0.8.2 -> v0.9.0.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
emo's Claude Code sessions hit "UserPromptSubmit hook error" on almost every
prompt. Root cause: the homelab-memory-recall.py UserPromptSubmit hook runs
`homelab memory recall <prompt>` and strict-decodes its stdout. printMemories
truncated each memory's preview with a BYTE slice (c[:240]), which cuts through
the middle of a 2-byte Cyrillic character and emits invalid UTF-8 (a dangling
0xd0 lead byte). The hook's subprocess.run(text=True) then raised
UnicodeDecodeError — not caught by its `except (TimeoutExpired, OSError)` — so
the hook exited non-zero and Claude surfaced the error. It is Cyrillic-specific
(ASCII has no multibyte chars to split), so it bit emo (Bulgarian prompts) every
turn while English users almost never saw it.
Two-layer fix:
- cli: truncatePreview() now counts RUNES, not bytes, so the preview never
splits a character. Regression test asserts valid UTF-8 on a long Cyrillic
string. Fixes the root for every consumer of `memory recall` / `memory list`.
- hook: subprocess.run gains errors="replace" and the except is broadened to
honor the script's own "best-effort, exit 0" contract — so a truncated or
otherwise odd payload can never again surface as a hook error.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Real root cause of the 2026-06-28 "Whisker UI empty" incident (the watchdog
added in 8d1d2fb9 was treating a symptom). The tigera operator's own `whisker`
NetworkPolicy is policyTypes:[Ingress,Egress]; its egress allows DNS only to the
kube-dns *pods* (podSelector k8s-app=kube-dns). But whisker-backend resolves
goldmane.calico-system.svc via the kube-dns *ClusterIP* (10.96.0.10), and Calico
drops UDP DNS to a ClusterIP under a podSelector-only egress rule.
Verified in an isolated repro: from the whisker pod's netns, ClusterIP DNS = 100%
timeout while direct kube-dns pod-IP DNS = OK; a pod with no egress policy
resolves fine; a test pod with the operator's podSelector-only egress rule
reproduces the failure, and adding an ipBlock(ClusterIP) egress rule flips it to
100% ok. whisker-backend resolves goldmane once in the brief startup window
before the policy programs, holds its long-lived gRPC stream, and only
re-resolves when that stream breaks (e.g. a node-reboot blip) — then the blocked
ClusterIP DNS wedges its Go resolver and the UI goes empty. The durable
aggregator (separate pod, unrestricted namespace) was never affected.
Fix: additive egress NetworkPolicy whisker-allow-dns-clusterip
(whisker -> 10.96.0.10/32 on 53 UDP+TCP); k8s egress policies are additive so
the operator NP is untouched. The whisker-watchdog CronJob is kept as a backstop
(repurposed comment). Applied + verified: ClusterIP DNS from the whisker netns
now 8/8 ok, whisker-backend 0 errors, flow API returns 828 flows / the namespace
list. Docs (runbook + CLAUDE.md) updated to the real root cause.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The login screen would sometimes hang/blank for everyone for ~30s at a time.
Root-caused: the readiness probe (/-/health/ready/) queries the DB, and on a
transient PG/pgbouncer blip it 503s; with the chart-default ~30s tolerance all 3
goauthentik-server pods dropped out of the Service at once, so Traefik had no
healthy backend -> 502/503/504. Compounded by a silent drift: the repo set the
rollout strategy under `strategy:`, but the chart reads `deploymentStrategy:` —
so live ran the chart-default 25%/25% and dropped a pod out of rotation on every
roll. (Redis was removed upstream in authentik 2026.2, so sessions+cache are on
PostgreSQL and request-serving is coupled to PG — verified there is no
external-cache option to put back, so a SHORT transient is now survived but a
total CNPG outage still takes authentik down.)
Reliability package (R2, approved):
- readinessProbe.failureThreshold 3->8 (~80s) — absorbs a full CNPG failover
reconnect without dropping the whole fleet from the Service.
- rename server+worker `strategy:` -> `deploymentStrategy:` (the real chart key)
and set maxSurge:1/maxUnavailable:0 so a roll never dips below 3 ready.
- gunicorn AUTHENTIK_WEB__MAX_REQUESTS 1000->10000 / JITTER 50->1000 so the 9
workers' recycles don't cluster on a DB blip.
- / and /static ingresses switch to the dedicated authentik-rate-limit (100/1000)
from the previous commit (skip_default_rate_limit) — fixes the cold-load 429
blank screen.
Liveness intentionally left DB-coupled-but-shallow (LiveView always returns 200,
so it can't kill a DB-blocked pod). CONN_MAX_AGE intentionally NOT set (pins the
pgbouncer pool, reverted 2026-06-10). Docs: .claude/CLAUDE.md + authentication.md
(also corrected a stale "60s persistent DB connections" note).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Unauthenticated users were getting a blank login screen (and the screen would
sometimes just hang). Root-caused via a read-only fan-out + adversarial verify:
the login SPA cold-loads ~70 flow-executor JS/CSS chunks from /static through
the SHARED 10/50 Traefik limiter, so a fresh/empty-cache load 429s the tail and
a failed ES-module import aborts SPA bootstrap -> permanent blank. authentik was
the only first-party SPA still on the default limiter (8 siblings already have a
carve-out). NAT-shared clients trip it especially easily (shared per-IP bucket).
- traefik: new `authentik-rate-limit` Middleware (average 100 / burst 1000,
mirroring the existing health/tripit carve-outs). The authentik / and /static
ingresses switch to it in the authentik-stack commit.
- monitoring: the `traefik` scrape job's drop-regex was a blanket
`traefik_router_.*`, which also dropped `traefik_router_requests_total` — so
per-router 4xx/5xx (incl. 429/503) was neither queryable nor alertable.
Narrowed it to keep the counter while still dropping the high-cardinality
`*_duration_seconds_bucket` histogram, and added `AuthentikRootRouter5xxHigh`
for the episodic all-3-server-pods-NotReady 502/503/504 cascade.
Docs updated (networking.md rate-limit list, .claude/CLAUDE.md). GitOps CI applies.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The onboarding runbook's "rebuild the binary" command stamped the version
from `git describe --tags --always`, but setup-devvm.sh stamps it from
`cli/VERSION`. The v0.8.1 tag is no longer reachable from master, so the
describe form silently produced a bare commit sha — diverging from what a
provisioner reconcile stamps. Match the canonical source.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Setting up emo's Bitwarden access via `homelab vault`, his one-time
`homelab vault setup` failed with an opaque "exit status 2". Two latent
CLI bugs, both of which any non-admin AFK invocation can hit:
1. The CLI set VAULT_TOKEN but never VAULT_ADDR, relying on the ambient
value. It IS in /etc/environment (login shells), but emo runs his
agents from long-lived tmux / non-login shells that never sourced it,
so every `vault` child hit the 127.0.0.1:8200 default -> connection
refused. claude-auth-sync already self-defaults VAULT_ADDR; the CLI
now does the same.
2. Token precedence was env > ~/.vault-token > scoped. A power-user who
ran `vault login -method=oidc` carries a read-only ~/.vault-token
(policy `default`, capability `deny` on their workstation path), which
shadowed the purpose-built scoped token -> 403 permission denied on
the user's OWN path. This tool only ever touches
secret/workstation/claude-users/<user>, which the scoped token covers
exactly, so precedence is now env > scoped > ~/.vault-token. Verified
the scoped tokens for both emo and wizard hold create/read/update on
their own paths, so admins are unaffected.
Also stop swallowing the shelled `vault`/`bw` stderr: errors now carry
the real message (connection refused / permission denied) instead of a
bare "exit status N" — without that, (1) and (2) were indistinguishable.
Verified end-to-end as emo (VAULT_ADDR unset + his read-only
~/.vault-token present): writeCreds now succeeds.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Whisker showed an empty UI on 2026-06-28. Root cause: whisker-backend dials
goldmane:7443 over a long-lived gRPC stream; when that stream dropped during a
transient CNI/DNS blip (right after k8s-node5 finished its v1.35.6 upgrade, its
pod resolver briefly timed out on the kube-dns ClusterIP) the Go gRPC resolver
got WEDGED — spamming "failed to stream flows" / "code = Unavailable: dns ...
i/o timeout" forever, never reconnecting. The operator ships whisker-backend
with NO liveness probe, so nothing restarted it; the live UI stayed blank until
a manual `kubectl delete pod`. (The durable aggregator is a separate pod and
was unaffected — only Whisker's ~60-min live view went dark.)
Whisker is operator-managed (Whisker CR), so we can't inject a liveness probe.
Instead add a watchdog so this never needs a manual restart again:
- whisker-watchdog CronJob (every 10 min) + least-privilege SA/Role/RoleBinding
(calico-system only: pods get/list/delete, pods/log get).
- It restarts the whisker pod only when whisker-backend logs >=10 goldmane-
connection errors in 11m AND Goldmane is Ready (the Goldmane-Ready guard
avoids restart-thrash during a real Goldmane outage).
- Self-tested: a manual run reports "whisker-backend healthy: 0 ... errors"
and does not restart.
Docs: runbook gains a "Whisker UI empty" troubleshooting entry + a self-heal
note; the stale 2026-06-25 "digest never posted" known-state block is updated
to Resolved (digest posts to #alerts, lastSuccessfulTime current); CLAUDE.md
flow-trail bullet gains the whisker-wedge gotcha.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>