infra

Author	SHA1	Message	Date
Viktor Barzin	3398873a16	k8s-upgrade: move version-check cadence from daily to weekly (Sun check, Mon report) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked to move the upgrade checks to weekly. With the actionable-vs-held gate now quieting the routine 'held' churn (e.g. 1.36), a daily check + attempt buys little; weekly is enough. Accepted trade-off: k8s patch (incl. security) uptake now lags up to 7 days instead of <=1. - var.schedule: 0 23 * * * -> 0 23 * * 0 (detector: weekly Sunday 23:00 UTC) - var.report_schedule: 7 6 * * * -> 7 6 * * 1 (report: Monday 06:07 UTC, ~7h after the Sunday check, so nightly-report.py's ~25h staleness threshold stays valid AND still flags a missed weekly run; no STALE_SECONDS change needed) The report CronJob keeps its historical name k8s-upgrade-nightly-report (rename = churn). Cadence wording updated across main.tf comments, nightly-report.py docstring, and the runbook. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 06:22:20 +00:00
Viktor Barzin	7fe2d9780e	monitoring: add pfSense WAN/egress alerting + probes Some checks failed ci/woodpecker/push/default Pipeline was canceled Details On 2026-06-27 pfSense (Proxmox VMID 101) stopped passing internet egress for ~20 min while internal routing + Unbound stayed up; recovery needed a manual reboot and NOTHING alerted — there was no egress probe and the cloudflared replica metric stayed green. Add first-class egress monitoring so the next occurrence pages in ~2 min instead of being noticed by a human. - blackbox-exporter: new icmp_egress + dns_external probe modules (+ NET_RAW so ICMP can use raw sockets). - Three in-cluster probe jobs exercising the pod->node->pfSense-NAT path that failed: wan-gateway-icmp (192.168.1.1), internet-egress-icmp (9.9.9.9 + 1.1.1.1), internet-egress-dns (cloudflare.com via both resolvers). - Prometheus alerts (group "Egress / pfSense"): WANGatewayUnreachable, InternetEgressDown (both providers dead), ExternalDNSResolutionDown, EgressOnlyDivergence (reuses the existing t3-probe legs — the incident's exact "external down while internal up" signature), PfSenseVMDown. - Loki ruler: CloudflaredTunnelConnLoss — the canary that fired first; the cloudflared replica metric is blind to tunnel-connection loss. Threshold calibrated against live Loki (steady-state ~2/6h vs 37-85/5m in-incident). - Alertmanager inhibit: WAN/egress-down suppresses the downstream egress symptom alerts so one root alert pages, not a storm. - Runbook docs/runbooks/pfsense-egress.md + .claude/CLAUDE.md. All metric names + the cloudflared threshold verified against live Prometheus/Loki. Pure GitOps, no pfSense change. Firewall-side hardening (dpinger retargeting, failover gateway, pfSense syslog -> Loki) is deferred and documented in the runbook. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 16:46:30 +00:00
Viktor Barzin	279b88d2bc	docs: add MetalLB L2Status-immutable PG-VIP-flap post-mortem (code-aoxk) All checks were successful ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details Post-mortem for the 2026-04/05 SEV3 where a stuck MetalLB ServiceL2Status CR (immutable status.node) flapped the PG load-balancer VIP and silently broke Tier-1 Woodpecker terragrunt applies for ~5 days (the wrapper error "Cannot read PG creds" masked the real cause for ~25 days). Written when the incident closed (beads code-aoxk, 2026-05-26) but never committed; landing it so the RCA + stuck-CR cleanup procedure live in the repo. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 16:25:10 +00:00
Viktor Barzin	2e50c1235c	chrome-service: grant emo shared browser access (noVNC + homelab browser CLI) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked to give emo access to the cluster's headed Chrome so he can fill in forms and get past anti-bot / captcha pages. emo was deliberately locked out of chrome-service (noVNC Authentik allowlist was Viktor-only + his power-user RBAC has no pods/portforward). Viktor's explicit decision: SHARE his existing browser rather than stand up an isolated per-user instance, accepting that emo can therefore reach Viktor's warmed logged-in sessions (CDP has no per-context auth, so the single shared persistent profile is reachable by anyone who can drive the browser). emo's CLI use is hands-off (his agent can run it unattended). - authentik: add emo (emil.barzin / emil.barzin@gmail.com) to CHROME_ALLOWED so the admin-services-restriction policy admits him to chrome.viktorbarzin.me (noVNC). Reverses the prior Viktor-only lock; comment updated to record why. - chrome-service/rbac.tf (new): emo-browser ServiceAccount + long-lived token (dashboard-sa.tf pattern), a chrome-service-portforward Role granting pods/portforward, and a cluster read-only binding (oidc-power-user-readonly) so the SA can resolve the Service and emo's normal read access doesn't regress. - t3-provision-users.sh: install_browser_kubeconfig installs a dual-context kubeconfig for any user with a <user>-browser SA — SA token as the default context (non-interactive, works headless), personal OIDC retained as the oidc@homelab named context. emo's OIDC-only kubeconfig can't authenticate the headless agent session that homelab browser needs. - docs/architecture/chrome-service.md: document the shared-browser multi-user access model, the session-exposure trade-off, and how to grant/revoke a user. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 15:20:07 +00:00
Viktor Barzin	250d0fc334	docs(authentik): document SFE forced-WebAuthn escape hatches (TOTP + social) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Old-browser users on the SFE who have a password but no MFA device hit the default-authentication-flow's forced WebAuthn passkey enrolment, which the SFE cannot render (the 'unsupported state: ak-stage-authenticator-webauthn' error). emo (Google-only, iPadOS 15) hit this on the password path. Document the two no-MFA-downgrade fixes: (1) social login, whose source flow (default-source-authentication) has no MFA stage, so the SFE's social button always completes; (2) enrolling TOTP, which the SFE can validate (unlike WebAuthn) and which flips the MFA stage from force-enrol to validate. TOTP was enrolled for emo and stored in his Vaultwarden authentik item; verified end-to-end (a Bitwarden-generated code is accepted by authentik). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 12:24:40 +00:00
Viktor Barzin	e518ada3d4	authentik: repoint to overlay patch3 (all-iOS SFE + SFE social links) + docs All checks were successful ci/woodpecker/push/default Pipeline was successful Details global.image -> 2026.2.4-patch3. Old iPad Chrome (and any iOS browser) now gets the SFE too, and the SFE login shows social-login buttons (emo is Google-only with no password, so the password form alone was a dead end). Docs: .claude/CLAUDE.md + authentication.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 11:53:26 +00:00
Viktor Barzin	6ba60cbb2d	authentik: repoint to overlay patch2 (SFE for old Safari) + docs All checks were successful ci/woodpecker/push/default Pipeline was successful Details global.image -> 2026.2.4-patch2 (adds the compat_needs_sfe SFE patch on top of the SLOW-1a query patch). Old Safari/WebKit (<=16.3) now gets authentik's no-JS SFE login instead of a blank page — fixes emo's iPadOS-15.8 iPad with no auth downgrade. Docs: .claude/CLAUDE.md Authentik row + docs/architecture/authentication.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 11:39:29 +00:00
Viktor Barzin	ec681ba6e1	ci(infra): stop double-apply + stop counting PG lock-waits as failures All checks were successful ci/woodpecker/push/default Pipeline was successful Details The infra terragrunt-apply pipeline (.woodpecker/default.yml) was going red ~20% of the time. Root causes (verified from the failure logs, not guessed): 1. infra is registered in Woodpecker TWICE — canonical Forgejo (repo 82) AND legacy GitHub mirror (repo 1) — and BOTH run `default.yml` on every push. The two applies race each other for the per-stack PG state lock → "Error acquiring the state lock" failures + push-supersede "killed" runs. 2. The skip-not-fail lock guard only matched the Tier-0 Vault lock string ("is locked by"); the Tier-1 PG-backend lock ("Error acquiring the state lock") fell through and was counted as a hard FAILURE. 3. Transient provider-registry download timeouts (and Vault 5xx) failed the whole pipeline with no retry. Fixes (all in default.yml): - Forge guard: the push-apply runs ONLY on the canonical Forgejo forge; on the GitHub mirror it no-ops (exit 0). The mirror keeps running the crons (they live on repo 1), so we de-dup the apply without deactivating the registration. Fail-open on unknown forge. - Lock-skip now matches BOTH tiers (Vault + PG) → lock-waits are SKIPPED. - Bounded retry (3x) ONLY on transient signatures (provider download timeout, Vault 5xx). Config errors + helm atomic-timeouts fail fast. Rejected (documented in docs/architecture/ci-cd.md): an off-infra GHA validate gate (catches ~0 of the real, runtime/Vault-data/SSA/lock failures; reproduced `terraform validate` passing the exact stacks that fail at apply) and lock-reaping/force-unlock (PG advisory locks are session-scoped + auto-release; force-unlock can't free them and would corrupt a live concurrent apply). Shell logic + the classification regexes were unit-tested locally against the real decoded error strings (#359 PG lock, #353 provider timeout, #360 missing-arg, helm atomic timeout); `bash -n` clean; YAML parses. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 11:37:18 +00:00
Viktor Barzin	e03e4719ad	vault: distinguish Vaultwarden vs HashiCorp Vault, add `vault kv` `homelab vault` only spoke to Vaultwarden (the password manager), but the name reads as HashiCorp Vault (the infra secrets store — actually OpenBao here). Make the two unmistakable and support both. Distinction (no breakage — the existing Vaultwarden verbs are unchanged): - bare `homelab vault` help now LEADS with the two-stores split; - every verb summary is tagged `[vaultwarden]` or `[hashicorp-vault]`; - HashiCorp Vault/OpenBao lives under a clearly-named `vault kv` group. New `vault kv` (HashiCorp Vault / OpenBao, the secret/… KV store): - `kv get <path> [--field K]` — read; --field → one value (TTY-aware clipboard/stdout), no field → full secret JSON (refuses a bare TTY). - `kv list <path>` — list sub-paths (no values). - `kv put <path> <key>` — write one key; value via stdin (piped or no-echo prompt, never argv); creates the path or merges (never clobbers siblings; uses kv patch -method=rw so no `patch` cap needed). Critical: `kv` uses the caller's OWN Vault token (OIDC ~/.vault-token / $VAULT_TOKEN), NOT the per-user scoped Vaultwarden token (bound only to claude-users/<user>, which would 403 elsewhere) — handlers set VAULT_ADDR but never inject the scoped token. Access is whatever the policy grants. Logic in cmd_vault_kv.go (pure cores extractKVData/parseKVList/arg builders/kvGet/List/Put; file header documents the credential split). CLI v0.11.0. Tests: no value in put argv, create-then-merge, KV-v2 envelope strip, help names both systems. Verified e2e against live Vault (read key-names-only + a scratch put/merge/cleanup). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 11:09:33 +00:00
Viktor Barzin	3d948c7033	Merge remote-tracking branch 'origin/master' into wizard/upgrade-gate-held All checks were successful ci/woodpecker/push/default Pipeline was successful Details	2026-06-28 10:09:42 +00:00
Viktor Barzin	2880fe1c29	docs: update k8s-version-upgrade runbook for actionable-vs-held gate Reflect the classification change in the operational runbook: the gate's three refusal classes (actionable/waiting/pinned), held wins on a mix, refusals now Complete cleanly (no Failed Job), k8s_upgrade_held gauge + the deliberate no-alert-for-held, the dropped K8sUpgradeChainJobFailed suppression clause, the nightly report ⏸️ HELD outcome, and the detector's silent nightly re-evaluation. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 10:09:34 +00:00
Viktor Barzin	eebb6c8594	k8s-upgrade: classify compat-gate blocks as actionable vs held; quiet the held case The nightly upgrade chain detected 1.36, the preflight compat-gate refused it, and that produced a Failed preflight Job + a K8sUpgradeBlocked alert EVERY night — even though the block is unactionable (no kyverno/ESO release supports 1.36 yet, and gpu-operator is pinned to its current version because bumping it needs a newer NVIDIA driver image + Ubuntu/kernel we're not ready for). Viktor asked to teach the checker to tell 'we can fix this' apart from 'nothing to do but wait', and stop the nightly Failed-Job + alert noise for the latter. compat-gate.py now classifies each blocker: - ACTIONABLE: a newer addon version in addon-compat.json supports the target -> exit 2, k8s_upgrade_blocked=1 -> K8sUpgradeBlocked alert (reasons in the nightly report). - WAITING-on-upstream: no released version supports the target yet -> held. - PINNED: addon marked pinned in the matrix (gpu-operator) -> held. Held wins on a mix -> exit 4, k8s_upgrade_held=1 (NEW gauge), NO alert. Tidy the block path (Viktor's scope choice): deliberate gate decisions now make the preflight Job Complete cleanly (HALT_CHAIN stops chain progression without a non-zero exit), so they no longer create Failed Jobs. Dropped the now-obsolete 'unless k8s_upgrade_blocked==1' suppression from K8sUpgradeChainJobFailed. Gauge is pushed definitively once per run (no 1->0->1 flap that re-notifies). The detector re-spawns a refused-but-Complete preflight nightly (silently) so a standing hold still re-evaluates, and only announces genuine new/Failed spawns. nightly-report gains a quiet '⏸️ HELD' headline with reasons grouped by class. gpu-operator pinned in addon-compat.json (unpin = delete pinned + pin_reason). Net effect on 1.36: HELD + quiet (waiting on kyverno/ESO, gpu-operator pinned; Calico the lone actionable piece) — no nightly Failed Job, no alert, just the morning report's HELD line. Design: docs/plans/2026-06-28-k8s-upgrade-gate-held-classification.md Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 10:08:20 +00:00
Viktor Barzin	ccee443790	vault: add `get --all` to browse every field of an item Some checks are pending Build infra CLI / build (push) Waiting to run Details ci/woodpecker/push/default Pipeline was successful Details `homelab vault get` could only fetch one of five allow-listed fields and had no way to see what fields an item even has — in particular it could not reach arbitrary user-defined custom fields. Add a `--all` flag that dumps the whole item as a normalized JSON object (`{name, username?, password?, uris?, totp?, notes?, fields?}`), so a Claude session can discover and read every field, custom ones included, in a single call. Security model preserved: - Like `get --json`, the dump is all secret values, so it refuses a bare TTY (pipe it, e.g. `\| jq`); the machine/agent path is stdout. - The TOTP seed is reduced to a presence flag (`"totp": true`) and never emitted — the seed is more powerful than a one-time code, so the only seed-derived path stays the specially-audited `vault code`. Tests assert the seed and password-history never appear in the dump. - Op-log uses a distinct `get-all` verb (item name still never logged) so a bulk dump is distinguishable from a single-field read. `normalizeItem` is a pure, unit-tested core; `getItem` is the session+fetch seam. CLI bumped to v0.10.0. Docs: README changelog, onboarding runbook, design spec §16. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 10:01:49 +00:00
Viktor Barzin	afcd463f39	k8s-upgrade: design doc for actionable-vs-held compat-gate classification The nightly upgrade chain fails a preflight Job and raises K8sUpgradeBlocked every night for the 1.36 target, even though the block is unactionable: no kyverno/ESO release supports 1.36 yet and gpu-operator is deliberately pinned (NVIDIA driver/Ubuntu coupling). Viktor asked to teach the checker to tell 'we can fix this' apart from 'nothing to do but wait', and stop the nightly Failed-Job + alert noise for the latter. This documents the design: classify each blocker as actionable / waiting- upstream / pinned, keep the alert only for actionable, quiet the held case to the nightly report, and make deliberate gate decisions Complete cleanly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 10:01:36 +00:00
Viktor Barzin	b3c419e108	Merge remote-tracking branch 'origin/master' into wizard/authentik-perf-fix All checks were successful ci/woodpecker/push/default Pipeline was successful Details	2026-06-28 09:55:25 +00:00
Viktor Barzin	9a1ab6247b	cli: add `homelab edges` — who-talks-to-whom investigation helper (v0.9.0) Some checks are pending Build infra CLI / build (push) Waiting to run Details ci/woodpecker/push/default Pipeline was successful Details Makes the goldmane_edges east-west trail (ADR-0014) reachable during incident investigations without remembering the DB/creds/SQL. New top-level verb: homelab edges --ns <ns> edges touching <ns> (either direction) homelab edges --src/--dst <ns> directional egress / ingress peers homelab edges --peers-of <ns> distinct peer namespaces of <ns> homelab edges --new-since 24h first seen since a duration or date (YYYY-MM-DD) homelab edges --denied only action='deny' (blocked / lateral movement) homelab edges --json --limit N machine-readable / row cap (default 200) Filters render to a single read-only SELECT against the `edge` table, run via the dbaas CNPG primary pod (same exec path as `k8s db`). Namespace values are validated to the k8s name charset (injection guard) before they reach SQL. TDD: edges_test.go covers flag parsing, query building (each filter, AND combination, peers-of shape, JSON wrapper), the new-since duration/date parser, and namespace-validation / injection rejection. Smoke-tested live: --peers-of, --new-since 24h, --denied, and --json all return correct rows. Docs: runbook query section now leads with the CLI; cli/README gains a v0.9 section. VERSION v0.8.2 -> v0.9.0. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 09:51:41 +00:00
Viktor Barzin	a3eb309e26	calico: fix empty Whisker UI — allow whisker egress to the kube-dns ClusterIP All checks were successful ci/woodpecker/push/default Pipeline was successful Details Real root cause of the 2026-06-28 "Whisker UI empty" incident (the watchdog added in `8d1d2fb9` was treating a symptom). The tigera operator's own `whisker` NetworkPolicy is policyTypes:[Ingress,Egress]; its egress allows DNS only to the kube-dns pods (podSelector k8s-app=kube-dns). But whisker-backend resolves goldmane.calico-system.svc via the kube-dns ClusterIP (10.96.0.10), and Calico drops UDP DNS to a ClusterIP under a podSelector-only egress rule. Verified in an isolated repro: from the whisker pod's netns, ClusterIP DNS = 100% timeout while direct kube-dns pod-IP DNS = OK; a pod with no egress policy resolves fine; a test pod with the operator's podSelector-only egress rule reproduces the failure, and adding an ipBlock(ClusterIP) egress rule flips it to 100% ok. whisker-backend resolves goldmane once in the brief startup window before the policy programs, holds its long-lived gRPC stream, and only re-resolves when that stream breaks (e.g. a node-reboot blip) — then the blocked ClusterIP DNS wedges its Go resolver and the UI goes empty. The durable aggregator (separate pod, unrestricted namespace) was never affected. Fix: additive egress NetworkPolicy whisker-allow-dns-clusterip (whisker -> 10.96.0.10/32 on 53 UDP+TCP); k8s egress policies are additive so the operator NP is untouched. The whisker-watchdog CronJob is kept as a backstop (repurposed comment). Applied + verified: ClusterIP DNS from the whisker netns now 8/8 ok, whisker-backend 0 errors, flow API returns 828 flows / the namespace list. Docs (runbook + CLAUDE.md) updated to the real root cause. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 09:32:28 +00:00
Viktor Barzin	385dfff0e7	authentik: fix episodic blank-screen + 30s-hang login (reliability R2) The login screen would sometimes hang/blank for everyone for ~30s at a time. Root-caused: the readiness probe (/-/health/ready/) queries the DB, and on a transient PG/pgbouncer blip it 503s; with the chart-default ~30s tolerance all 3 goauthentik-server pods dropped out of the Service at once, so Traefik had no healthy backend -> 502/503/504. Compounded by a silent drift: the repo set the rollout strategy under `strategy:`, but the chart reads `deploymentStrategy:` — so live ran the chart-default 25%/25% and dropped a pod out of rotation on every roll. (Redis was removed upstream in authentik 2026.2, so sessions+cache are on PostgreSQL and request-serving is coupled to PG — verified there is no external-cache option to put back, so a SHORT transient is now survived but a total CNPG outage still takes authentik down.) Reliability package (R2, approved): - readinessProbe.failureThreshold 3->8 (~80s) — absorbs a full CNPG failover reconnect without dropping the whole fleet from the Service. - rename server+worker `strategy:` -> `deploymentStrategy:` (the real chart key) and set maxSurge:1/maxUnavailable:0 so a roll never dips below 3 ready. - gunicorn AUTHENTIK_WEB__MAX_REQUESTS 1000->10000 / JITTER 50->1000 so the 9 workers' recycles don't cluster on a DB blip. - / and /static ingresses switch to the dedicated authentik-rate-limit (100/1000) from the previous commit (skip_default_rate_limit) — fixes the cold-load 429 blank screen. Liveness intentionally left DB-coupled-but-shallow (LiveView always returns 200, so it can't kill a DB-blocked pod). CONN_MAX_AGE intentionally NOT set (pins the pgbouncer pool, reverted 2026-06-10). Docs: .claude/CLAUDE.md + authentication.md (also corrected a stale "60s persistent DB connections" note). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 09:17:05 +00:00
Viktor Barzin	b84b0021c2	authentik: dedicated rate-limit carve-out + per-router 5xx observability All checks were successful ci/woodpecker/push/default Pipeline was successful Details Unauthenticated users were getting a blank login screen (and the screen would sometimes just hang). Root-caused via a read-only fan-out + adversarial verify: the login SPA cold-loads ~70 flow-executor JS/CSS chunks from /static through the SHARED 10/50 Traefik limiter, so a fresh/empty-cache load 429s the tail and a failed ES-module import aborts SPA bootstrap -> permanent blank. authentik was the only first-party SPA still on the default limiter (8 siblings already have a carve-out). NAT-shared clients trip it especially easily (shared per-IP bucket). - traefik: new `authentik-rate-limit` Middleware (average 100 / burst 1000, mirroring the existing health/tripit carve-outs). The authentik / and /static ingresses switch to it in the authentik-stack commit. - monitoring: the `traefik` scrape job's drop-regex was a blanket `traefik_router_.`, which also dropped `traefik_router_requests_total` — so per-router 4xx/5xx (incl. 429/503) was neither queryable nor alertable. Narrowed it to keep the counter while still dropping the high-cardinality `_duration_seconds_bucket` histogram, and added `AuthentikRootRouter5xxHigh` for the episodic all-3-server-pods-NotReady 502/503/504 cascade. Docs updated (networking.md rate-limit list, .claude/CLAUDE.md). GitOps CI applies. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 09:10:34 +00:00
Viktor Barzin	65a09dcbc4	docs(homelab-vault): rebuild snippet uses cli/VERSION, not git describe All checks were successful ci/woodpecker/push/default Pipeline was successful Details The onboarding runbook's "rebuild the binary" command stamped the version from `git describe --tags --always`, but setup-devvm.sh stamps it from `cli/VERSION`. The v0.8.1 tag is no longer reachable from master, so the describe form silently produced a bare commit sha — diverging from what a provisioner reconcile stamps. Match the canonical source. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 09:05:49 +00:00
Viktor Barzin	c53e7839e1	Merge remote-tracking branch 'origin/master' into wizard/vault-addr-default Some checks failed Build infra CLI / build (push) Waiting to run Details ci/woodpecker/push/default Pipeline was canceled Details	2026-06-28 09:04:43 +00:00
Viktor Barzin	0525f0b12d	homelab vault: self-default VAULT_ADDR + prefer scoped token over ~/.vault-token Setting up emo's Bitwarden access via `homelab vault`, his one-time `homelab vault setup` failed with an opaque "exit status 2". Two latent CLI bugs, both of which any non-admin AFK invocation can hit: 1. The CLI set VAULT_TOKEN but never VAULT_ADDR, relying on the ambient value. It IS in /etc/environment (login shells), but emo runs his agents from long-lived tmux / non-login shells that never sourced it, so every `vault` child hit the 127.0.0.1:8200 default -> connection refused. claude-auth-sync already self-defaults VAULT_ADDR; the CLI now does the same. 2. Token precedence was env > ~/.vault-token > scoped. A power-user who ran `vault login -method=oidc` carries a read-only ~/.vault-token (policy `default`, capability `deny` on their workstation path), which shadowed the purpose-built scoped token -> 403 permission denied on the user's OWN path. This tool only ever touches secret/workstation/claude-users/<user>, which the scoped token covers exactly, so precedence is now env > scoped > ~/.vault-token. Verified the scoped tokens for both emo and wizard hold create/read/update on their own paths, so admins are unaffected. Also stop swallowing the shelled `vault`/`bw` stderr: errors now carry the real message (connection refused / permission denied) instead of a bare "exit status N" — without that, (1) and (2) were indistinguishable. Verified end-to-end as emo (VAULT_ADDR unset + his read-only ~/.vault-token present): writeCreds now succeeds. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 09:04:28 +00:00
Viktor Barzin	8d1d2fb999	calico: add whisker-watchdog CronJob to self-heal a wedged whisker-backend All checks were successful ci/woodpecker/push/default Pipeline was successful Details Whisker showed an empty UI on 2026-06-28. Root cause: whisker-backend dials goldmane:7443 over a long-lived gRPC stream; when that stream dropped during a transient CNI/DNS blip (right after k8s-node5 finished its v1.35.6 upgrade, its pod resolver briefly timed out on the kube-dns ClusterIP) the Go gRPC resolver got WEDGED — spamming "failed to stream flows" / "code = Unavailable: dns ... i/o timeout" forever, never reconnecting. The operator ships whisker-backend with NO liveness probe, so nothing restarted it; the live UI stayed blank until a manual `kubectl delete pod`. (The durable aggregator is a separate pod and was unaffected — only Whisker's ~60-min live view went dark.) Whisker is operator-managed (Whisker CR), so we can't inject a liveness probe. Instead add a watchdog so this never needs a manual restart again: - whisker-watchdog CronJob (every 10 min) + least-privilege SA/Role/RoleBinding (calico-system only: pods get/list/delete, pods/log get). - It restarts the whisker pod only when whisker-backend logs >=10 goldmane- connection errors in 11m AND Goldmane is Ready (the Goldmane-Ready guard avoids restart-thrash during a real Goldmane outage). - Self-tested: a manual run reports "whisker-backend healthy: 0 ... errors" and does not restart. Docs: runbook gains a "Whisker UI empty" troubleshooting entry + a self-heal note; the stale 2026-06-25 "digest never posted" known-state block is updated to Resolved (digest posts to #alerts, lastSuccessfulTime current); CLAUDE.md flow-trail bullet gains the whisker-wedge gotcha. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 08:59:31 +00:00
Viktor Barzin	c70810a51b	workstation: per-user long-lived Claude token to end concurrent-refresh logout All checks were successful ci/woodpecker/push/default Pipeline was successful Details A heavy user (emo) runs 8+ always-on `claude` agents + their t3-serve instance, all sharing one ~/.claude/.credentials.json. When the shared access token expires the processes refresh simultaneously; OAuth refresh-token rotation makes the losing writer persist an EMPTY refresh token, logging the user out roughly every access-token lifetime (~8h). Re-issuing the credential never sticks — the race recurs (this is why emo's "standalone token" fix kept regressing). Fix: an opt-in, per-user, non-rotating setup-token (sk-ant-oat01, ~1y, scope user:inference) kept in the user's OWN Vault path (field `setup_token`). claude-auth-sync materializes it to a user-owned ~/.config/claude-auth-sync/claude-oauth.env and, while it is present, SKIPS the rotating-credential validate/backup/restore (so no false WorkstationClaudeAuthInvalid). start-claude.sh and t3-serve@.service load it as CLAUDE_CODE_OAUTH_TOKEN, so every session of that user uses the non-rotating token and there is nothing to race on. Fail-safe + opt-in: with no `setup_token` in Vault, every path is a no-op, so users on the normal per-user Enterprise-SSO flow are unaffected. This is each user's OWN identity, never the forbidden shared CLAUDE_CODE_OAUTH_TOKEN. Runbook documents enable/disable/rotate. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 08:07:43 +00:00
Viktor Barzin	0097bddf9f	docs(ci-cd): add plotting-book build→ghcr→deploy flow diagram ASCII flow of the migrated plotting-book pipeline (GHA build in Anca's repo → private ghcr.io/passionprojectsanca/book-plotter → Woodpecker redeploy hook → in-cluster pull via ghcr-credentials), plus the Kyverno admission / Keel backstop / Vault pull-cred supporting cast and the serving path. Appended to the existing plotting-book section. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-27 15:49:58 +00:00
Viktor Barzin	bbc797b30e	ci(woodpecker): stop applying/planning the Tier-0 vault stack in CI All checks were successful ci/woodpecker/push/default Pipeline was successful Details The nightly drift-detection cron and every vault-touching push apply have been failing because CI runs terragrunt plan/apply on the Tier-0 `vault` stack, which manages Vault's own transit mount + ACL policies. The CI `ci` Vault role intentionally lacks those admin perms (sys/mounts, sys/policies/acl), so the run always errors: - apply: 403 on vault_mount.transit + vault_policy.personal_emo, plus an Invalid for_each (local.k8s_users from secret/platform is deferred) - drift: terragrunt plan exits 1 → fails the whole nightly run vault is Tier-0 = human-applied via OIDC. Skip it in both pipelines: - default.yml: skip `vault` in the platform-apply loop (kept in PLATFORM_STACKS so the app-stack detector still excludes it) - drift-detection.yml: skip `vault` in the per-stack plan loop - ci-cd.md: document the exclusion on both pipeline rows Found during a CI-health sweep (user reported many failures): GitHub Actions all green; all Woodpecker repos green except this recurring infra-repo failure, doubled by the legacy repo-1 + repo-82 dual registration. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-27 15:48:41 +00:00
Viktor Barzin	c13a3f1694	plotting-book: pull image from private ghcr instead of public DockerHub Anca's plotting-book app now builds its image in her own GitHub repo to the private package ghcr.io/passionprojectsanca/book-plotter (off public DockerHub viktorbarzin/book-plotter). Wire the cluster to pull it: - stacks/plotting-book: point the deployment baseline image at the ghcr package and add imagePullSecrets {ghcr-credentials} so the pod can pull the private image (the live tag is still CI-owned via ignore_changes). - stacks/kyverno: add the plotting-book namespace to the ghcr-credentials allowlist so the Kyverno generate policy clones the pull secret into it. Verified the shared ghcr_pull_token (Viktor, repo-admin on Anca's repo) can read the private package before wiring this. Docs: correct ci-cd.md (it wrongly listed plotting-book as already on ghcr — it was on DockerHub) and note the special arrangement; amend ADR-0003 to record that this GitHub-first repo builds to its own org's ghcr namespace. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-27 15:32:19 +00:00
Viktor Barzin	bf40409141	docs(security): note crowdsec-cf-sync rate-limit resilience All checks were successful ci/woodpecker/push/default Pipeline was successful Details Document the backoff_limit=0 + CF-429 soft-skip hardening alongside the cf-sync architecture description, with the why (the backoff_limit=2 retry-storm that escalated Cloudflare's Lists-API throttle into a stuck state). Follow-up to `5b49634f`. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-27 15:27:44 +00:00
Viktor Barzin	b371ae6eee	homelab vault: install bw system-wide + onboarding runbook Two remaining gaps to let non-admins (emo) use `homelab vault`: - setup-devvm.sh installed `@bitwarden/cli` only when `command -v bw` failed, which an admin's own ~/.local/bin/bw satisfied — so the system-wide copy was never installed and non-admins had no `bw` backend. Install to the npm /usr prefix and guard on the system path (/usr/bin/bw) instead. - Add docs/runbooks/homelab-vault-onboarding.md (per-user setup, the shared Organization/Collection flow for sharing passwords, admin deploy + verification, security model) and repoint the two code comments that cited a design-spec path which never existed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-27 08:16:52 +00:00
Viktor Barzin	19d0f0933a	chrome-service: supervise x11vnc in noVNC sidecar so the VNC view self-heals Some checks failed ci/woodpecker/push/default Pipeline was successful Details Build chrome-service-novnc / build (push) Has been cancelled Details The noVNC view at chrome.viktorbarzin.me went black: x11vnc (in the novnc sidecar) attaches to the browser container's Xvfb over localhost:6099, and when that container restarted (~8h ago, Chrome exited cleanly) x11vnc lost its X connection and exited. Because the entrypoint ran x11vnc as an unsupervised background child and then exec'd websockify as PID 1, the dead x11vnc was never relaunched — :5900 stayed dead (a defunct zombie), websockify kept returning 'Connection refused', and the view was black until a manual pod restart. Fix: the entrypoint now runs both x11vnc and websockify as supervised background children and exits non-zero via 'wait -n' if either dies, so the kubelet restarts the novnc container, which re-waits for Xvfb and relaunches x11vnc. The bridge now self-heals across browser-container restarts. Mirrors the android-emulator stack's supervision pattern. Architecture doc updated with the new failure mode, diagnosis, immediate-recovery, and SHA-pin deploy note. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-27 08:03:29 +00:00
Viktor Barzin	fd33d1a447	monitoring: consolidate all Slack alerting to #alerts, abandon #security Some checks are pending ci/woodpecker/push/default Pipeline is running Details The dedicated #security Slack channel was unreachable: the shared incoming webhook (Vault secret/viktor -> alertmanager_slack_api_url) belongs to a Slack app that isn't a member of #security, so any channel override on it returns HTTP 404 channel_not_found. The goldmane-edges-digest was silently failing for that reason. Per request ("dump the security channel, post in an existing one"), route everything to #alerts instead: - alertmanager slack-security receiver -> #alerts (keeps its [SECURITY/<sev>] title styling so security-lane alerts still stand out in the shared channel) - goldmane-edges-digest CronJob SLACK_CHANNEL -> #alerts (comment only; value was already switched and applied last change) - AggregatorDown / DigestFailing alert summaries reworded to say #alerts - docs swept (security.md, monitoring.md, ADR-0014, goldmane runbook, .claude/CLAUDE.md, service-catalog, CONTEXT.md) to drop the "invite the app / flip back to #security" caveats and state the #security abandonment + #alerts consolidation as the current routing. Monitoring stack applied (alertmanager rolled, live config verified: slack-security channel is now #alerts). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-26 13:29:44 +00:00
Viktor Barzin	d105713ae7	fix(workstation): claude-auth-sync must merge, not overwrite, the shared Vault path All checks were successful ci/woodpecker/push/default Pipeline was successful Details cas_backup did `vault kv put secret/workstation/claude-users/<user>`, a full KV-v2 replace that rewrote the document with only its 3 OAuth keys. Because `homelab vault setup` co-locates the user's vaultwarden_* credentials on that same path, every six-hourly sync silently deleted them — so `homelab vault` reported "not configured" within hours of each setup. (Reported as: homelab vault "keeps getting reset / logged out", set up 3 times.) Switch the backup to a merge: `kv patch -method=rw` (read+update, needs no `patch` capability) when the path exists, and `kv put` only to create it on the first backup. Add a regression test with a fake vault asserting a pre-existing sibling key survives a backup, and document the merge requirement in the renewal runbook. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-26 08:33:41 +00:00
Viktor Barzin	6f1951af93	fix(workstation): carry OS/sudo authz policy into managed-settings source + multi-tenancy doc All checks were successful ci/woodpecker/push/default Pipeline was successful Details ADR-0015's policy change was applied live to /etc/claude-code/managed-settings.json, but that file self-deploys from the repo source scripts/workstation/managed-settings.json via the hourly reconcile (sync_managed_config). Without updating the source the next reconcile would REVERT /etc to the old 'never read other homes' rule. This updates the source-of-truth claudeMd (now byte-identical to /etc) so the change is durable + canonical, and refresh_codex_mirror propagates it to every user's ~/.codex/AGENTS.md. Also notes the access-model change in the multi-tenancy architecture doc (pointer to ADR-0015). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-26 08:25:33 +00:00
Viktor Barzin	8121d8a4ac	docs(adr): add ADR-0015 (OS/sudo is the authorization boundary), supersede ADR-0011 privacy norm All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor (owner) wants agents to stop refusing file reads the OS already permits. wizard holds passwordless root ((ALL) NOPASSWD: ALL), so the managed-settings rule 'never read another user's ~/.claude' was stricter than the OS itself. The managed-settings policy (/etc/claude-code/managed-settings.json) was updated out-of-band to defer to OS/sudo authorization with no extra prompt; backup kept at .bak-2026-06-26. This ADR records the decision, its symmetry across sudo-holders, and the larger blast radius. ADR-0011's usage-telemetry design is unchanged; only the cross-user privacy norm it referenced is superseded. The original ask was to delete ADR-0011 — superseded instead to preserve the audit trail and the ADR-0012/0013 references. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-26 08:22:29 +00:00
Viktor Barzin	6c5288998f	goldmane-trail: polish follow-ups #57/#59/#61/#62/#63 + digest→#alerts All checks were successful ci/woodpecker/push/default Pipeline was successful Details Completes the Goldmane who-talks-to-whom trail (ADR-0014), implemented by a subagent workflow (distinct stacks in parallel, docs last): - #57 Whisker gated ingress: ingress_factory (whisker.viktorbarzin.me, auth=required, Authentik-gated) + a NetworkPolicy allowing traefik->whisker:8081 (the operator's whisker NP default-denies ingress). calico stack. - #61 pipeline health: AggregatorDown + DigestFailing Prometheus alerts (prometheus_chart_values.tpl) + cluster-health check #48. - #59 service-identity labels on the multi-Service namespaces (monitoring's 5 TF-managed deployments + dbaas), with the KYVERNO_LIFECYCLE_V1 marker so they update in-place. - #62/#63 docs: docs/runbooks/goldmane-flow-trail.md (new), service-catalog, security.md + monitoring.md east-west sections, ADR-0014 as-built, CONTEXT.md. #62 = the SQL to derive the Wave-1 per-namespace egress allowlist from the edge table (feeds code-8ywc; enforce-flips out of scope). Also fixes the digest's Slack target: #security override 404s channel_not_found because the shared alertmanager_slack_api_url webhook's app isn't a member of #security (this likely also breaks alertmanager's slack-security receiver — flagged in the runbook). Routed to #alerts (the webhook's working channel) until the app is invited; verified a real digest run posts cleanly (360 edges). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-25 17:49:25 +00:00
Viktor Barzin	9c68d147e0	k8s-upgrade: reclaim+auto-prune kubeadm /etc/kubernetes/tmp leak; correct crash root cause to etcd IO (not OIDC) Some checks failed ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/default Pipeline failed Details Digging into "why did the apiserver crash" disproved the earlier OIDC explanation. An isolated v1.35.6 apiserver repro with authentik reachable initialises OIDC cleanly (oidc.go:313, no error) and runs fine — so the --authentication-config -> --oidc-* revert is NOT what crashed it. etcd's surviving crash-window log is the real cause: 1180 "apply request took too long" warnings in 16 min, individual applies up to 4.3s (healthy <100ms) right as kubeadm tried to bring up the new apiserver. That's etcd IO starvation on the shared sdc HDD (beads code-oflt). A big contributor + the reason master root fs sat at 73%: kubeadm dumps a full ~400MB etcd DB backup into /etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/ before every etcd upgrade and never cleans it up — 145 dirs / 28GB had accumulated, driving image-GC churn and extra write-IO onto etcd's spindle. Reclaimed live (73% -> 23%) and added a preflight prune (>3 days) so it can't re-accumulate. Also corrected the OIDC handling: the kubeadm-config drift is real but only breaks dashboard/kubectl SSO AFTER a successful upgrade (recoverable via the chain's restore.sh + the kubeadm-config reconciliation) — it does not crash the apiserver. So the preflight check is now an ALERT, not a block (was added on the wrong hypothesis). Post-mortem, runbook, and apiserver-oidc.tf header corrected. Per Viktor: reclaim the disk and automate so the manual cleanup never recurs; the durable IO fix remains code-oflt (etcd off the shared HDD). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-25 15:23:15 +00:00
Viktor Barzin	60a1cb9a25	k8s-upgrade: reconcile kubeadm-config OIDC drift that crash-looped the v1.35 apiserver upgrade All checks were successful ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details Last night's autonomous 1.34->1.35 run reached the master control-plane phase for the first time (preflight passed, etcd snapshot taken, etcd upgraded), then the kube-apiserver upgrade to v1.35.6 crash-looped and kubeadm auto-rolled-back to 1.34.9. The cluster stayed healthy but the master was left cordoned and the chain wedged on in_flight. Root cause: kubeadm upgrade regenerates the apiserver static-pod manifest from the kubeadm-config ConfigMap. apiserver auth was switched on 2026-06-19 to a structured multi-issuer --authentication-config (kubectl + dashboard SSO), but kubeadm-config still carried the legacy single-issuer --oidc-* extraArgs, so the regenerated manifest reverted structured auth and the new apiserver crash-looped. Proven via `kubeadm upgrade diff`. The existing post-upgrade OIDC restore step never ran because the upgrade itself never succeeded. Fix: - rbac/apiserver-oidc.tf: the remote script now also reconciles kubeadm-config (kubeadm init phase upload-config: drop --oidc-*, add --authentication-config) so a future kubeadm upgrade regenerates a correct manifest. Delivered to the cluster via the apiserver-oidc-restore ConfigMap the chain re-runs (CI needs no ssh key); trigger deliberately not script-hashed since CI cannot ssh. - k8s-version-upgrade/upgrade-step.sh: new preflight gate runs `kubeadm upgrade diff` and BLOCKS+alerts (never drains the master) if --authentication-config would still be dropped. - Post-mortem + runbook updated. The live kubeadm-config was reconciled directly on the master and verified (`kubeadm upgrade diff` now shows only the control-plane image bump), so tonight's run can complete the 1.34->1.35 upgrade. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-25 14:16:04 +00:00
Viktor Barzin	ae0d7984c4	docs: ADR-0014 + glossary — service identity (namespace+label) & Calico Goldmane observability All checks were successful ci/woodpecker/push/default Pipeline was successful Details Records the design reached in a /grill-with-docs session: how to track which Service talks to which as more Services are added, using k8s-native options. Decision: service identity = the workload's namespace (primary) plus a `service-identity` label only in the few multi-Service namespaces; east-west observability = Calico 3.30 Goldmane/Whisker (already in our Calico v3.30.7, currently disabled) emitting to Loki for a durable trail; enforcement reuses the existing Wave 1 egress track. Dedicated per-Service ServiceAccounts deferred and a service mesh / mTLS / SPIFFE rejected — the trust model needs attribution-grade forensics on a trusted, etcd-constrained cluster, not cryptographic non-repudiation. This is the service-mesh evaluation the 2026-04-20 infra audit flagged as missing; rejected alternatives (Retina, Hubble, Kiali, a custom Alloy enricher) are recorded with rationale. Adds glossary terms (Service identity, Goldmane / Whisker) to CONTEXT.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 10:00:36 +00:00
Viktor Barzin	68c240b8de	Merge remote-tracking branch 'origin/master' Some checks failed ci/woodpecker/push/default Pipeline failed Details	2026-06-23 09:56:25 +00:00
Viktor Barzin	7d297dc6b1	eso: complete migration — chart 2.6.0, all CRs on v1, 1.35 gate cleared Phase 3 of the ESO 0.12->2.6 migration (the last k8s-1.35 compat-gate blocker). Climbed external-secrets 0.16.2 -> 0.17.0 -> ... -> 2.6.0 one minor at a time, each hop applied + verified (ES sync held at 109 Ready every hop; atomic=true rollback safety net). Crossed the 0.17 cutoff (v1beta1 serving removed) only after Phase 2 put all 104 ExternalSecrets + 2 ClusterSecretStores on external-secrets.io/v1. Result: compat-gate now returns "OK: cluster is safe to upgrade to 1.35.6" (EXIT 0) — the autonomous version-check chain will take k8s 1.34 -> 1.35 on its next nightly run. Also fixes the repo-wide stale-lock issue that broke CI pipeline 332: the terragrunt-generated providers.tf declares gavinbunney/kubectl + telmate/proxmox, but ~28-39 stacks' committed .terraform.lock.hcl predated that ("Inconsistent dependency lock file: no version selected"). Reconciled via `tg init -upgrade` and committed so `terragrunt apply`/CI work cleanly again. Docs: .claude/CLAUDE.md ESO line corrected (104 ESs, v1, chart 2.6.0); plan doc marked COMPLETE. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-23 09:55:51 +00:00
Viktor Barzin	e88ea50304	docs(multi-tenancy): document install_skills (vendored per-user agent skills) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Record the new reconcile step alongside install_memory/install_playwright: vendored own-copies of the 16-skill set for the SKILL_USERS allowlist (emo), why it's vendored not npx (upstream drift), and that if-absent keys on the user's own copy so it heals a stale/cross-user ~/.claude/skills symlink (emo's grill-me pointed into the admin's home). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-23 09:30:27 +00:00
Viktor Barzin	59f2beda21	chrome-service: run real Google Chrome (H.264/AAC codecs) for the browser All checks were successful ci/woodpecker/push/default Pipeline was successful Details Point the chrome-service container at the new chrome-service-browser image and launch /opt/google/chrome/chrome instead of the bundled Chromium. Fixes MEDIA_ERR_SRC_NOT_SUPPORTED on H.264/AAC video (Instagram Reels etc.) in the noVNC view — bundled Chromium has those codecs compiled out; only real Chrome carries them. connect_over_cdp callers (tripit fare scrape, homelab browser, snapshot-harvester) attach over raw CDP (version-tolerant) — validated after rollout. Image is built off-infra on GHA (prior commit) → public ghcr. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 21:15:36 +00:00
Viktor Barzin	a3cdc0d6d0	chrome-service: size headed Chrome window to fill Xvfb (noVNC cut-off) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The noVNC view showed the browser in the top-left with the rest of the framebuffer black. Cause: Chrome launched with no --window-size, and there's no window manager, so it opened at its profile-persisted (smaller) size inside the 1280x720 Xvfb. Add --window-size=1280,720 --window-position=0,0 so the window fills the screen on every launch (fresh pods/profiles too). Live windows were already resized via CDP as a stopgap. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 18:00:20 +00:00
Viktor Barzin	c7ead032ec	chrome-service: fix noVNC stuck-"Connecting" (x11vnc fd-sweep under nofile=2^31) Some checks failed ci/woodpecker/push/default Pipeline was successful Details Build chrome-service-novnc / build (push) Has been cancelled Details The noVNC view hung on "Connecting" forever then timed out. Root cause: x11vnc sweeps the entire fd table (fcntl per fd) on every client connection, and containerd grants pods RLIMIT_NOFILE=2^31, so the RFB handshake never completes (websockify accepts the WS and dials localhost:5900, but x11vnc never sends its banner — verified: handshake timed out at 8s, x11vnc had burned 1h41m CPU spinning). Same bug + fix the android-emulator stack already carries. Cap nofile before x11vnc starts, in two places: - files/novnc/entrypoint.sh: `ulimit -n 65536` (root fix, makes the image correct) - main.tf novnc container: `command = ["bash","-c","ulimit -n 65536; exec /entrypoint.sh"]` so the cap applies deterministically on rollout even though the image is :latest/IfNotPresent (a rebuilt entrypoint isn't guaranteed to be re-pulled). Also documents the gotcha + diagnosis in docs/architecture/chrome-service.md and notes the black-when-idle behaviour + the autoconnect URL. (A live x11vnc relaunch with the cap already unblocked the running pod; this makes it survive restarts.) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 17:34:03 +00:00
Viktor Barzin	7dbbb74163	homelab v0.8.1: frame browser as escalation (default headless), match CLAUDE.md Some checks failed ci/woodpecker/push/default Pipeline was successful Details Build infra CLI / build (push) Has been cancelled Details Make `homelab browser --help` and chrome-service.md state the same tiered rule now in ~/code/CLAUDE.md: default to the Playwright MCP/headless browser for all routine automation; reach for `homelab browser` ONLY when headless is blocked (loads-but-submit-fails / one request errors while siblings 200 / explicit bot wall). Removes the "co-equal choice" framing so agents have one non-conflicting instruction. Adds a test asserting the tiered wording so it can't regress. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 15:44:43 +00:00
Viktor Barzin	a6b52a5839	homelab v0.8.0: browser verbs for headful anti-bot web automation Some checks are pending Build infra CLI / build (push) Waiting to run Details ci/woodpecker/push/default Pipeline was successful Details Add `homelab browser run\|open` so agents can drive the cluster's headful Chrome (chrome-service) over CDP from the devvm. The headless playwright/mcp browser can load anti-bot sites and fill their forms, but the gated submit silently fails — e.g. the Stirling Ackroyd Fixflo tenant portal returned net::ERR_FILE_NOT_FOUND on its pre-submit check and hung, creating nothing. Driving the real headful Chrome submits first try. That capability already existed but was undiscoverable, so it cost ~40 min + redundant form re-runs to find; now it is one command, versioned, test-covered, and `browser --help` carries the when-to-use signature + an error-code cheat-sheet so the right tool is reached at the right moment (the failure was judgment, not setup). - port-forward svc/chrome-service:9222 (tunnels API-server->pod, so it bypasses the :9222 NetworkPolicy), assert non-headless via /json/version, connect_over_cdp, inject the same vendored stealth.js the in-cluster callers use; the port-forward is always torn down, on success and on error. - node CDP client pinned to playwright-core@1.48.2 to match the v1.48.0-noble image (Chromium 130); self-provisioned lazily into ~/.cache/homelab, no per-user setup. - default is a fresh incognito context (safe for the shared browser + concurrent callers); --shared-context reuses the warmed persistent profile. - TDD: cmd_browser_test.go covers arg parsing, headless detection, the version pin, the help cheat-sheet, and a stealth.js drift guard. Verified end-to-end against bot.sannysoft.com (real Chrome UA, webdriver hidden, plugins/WebGL spoofed) and `browser open`. - docs: README v0.8 section, ADR-0013, and a chrome-service.md "driving from outside the cluster" section. Closes: code-nepg Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 12:22:22 +00:00
Viktor Barzin	de163aa6af	workstation: switch devvm OOM backstop from systemd-oomd to earlyoom All checks were successful ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details The systemd-oomd backstop added in the previous commit is INERT on this box. oomd's memory-pressure kill only acts on cgroups doing active reclaim (pgscan rising); with MemorySwapMax=0 + anonymous agent memory there is nothing to reclaim, so pgscan stays 0 and oomd never fires. Proven live: a cgroup held at 96-99% memory.pressure for >70s with pgscan=0 was never killed (oomctl + balloon). The very swap=0 that kills the IO storm also neuters oomd. Replace it with earlyoom, which watches free RAM (MemAvailable%) and is swap-independent: SIGTERM the biggest task at 5%, SIGKILL at 3%, swap ignored (-s 100). It --avoids sshd/systemd/dockerd/containerd/t3-dispatch/tmux (the admin's way in always survives) and --prefers the agent/browser hogs. Verified via --dryrun: fires on the RAM threshold and selects a chrome process, not a protected daemon. The per-cgroup caps (MemoryHigh=12G/MemoryMax=16G/MemorySwapMax=0 per user, docker.slice 8G) are unchanged and remain the PRIMARY guard — earlyoom is the aggregate net for the rare all-users-maxed case. systemd-oomd purged; its config + ManagedOOM drop-ins removed. Post-mortem updated with the finding. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 10:39:16 +00:00
Viktor Barzin	3a59f4a8bf	workstation: per-user memory caps + systemd-oomd backstop on devvm All checks were successful ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details The shared devvm keeps overloading and had to be hard-killed again today (2026-06-22): a runaway in one user's ssh/tmux session (a 10G ugrep, plus stacked max-effort agents) grew unbounded, spilled into the disk swap, and swap-thrashed the throttled virtual disk into an IO storm until the box wedged. Root cause: ssh/tmux work runs under user-<uid>.slice, left memory-uncontained by the explicit 2026-06-10 "swap-only" decision, while only the t3-serve tree was capped. So one user could starve everyone. This bounds every user on BOTH trees (MemoryHigh=12G, MemoryMax=16G, MemorySwapMax=0 so work OOMs locally at its ceiling instead of thrashing swap), adds a systemd-oomd PSI backstop that sheds the single worst work cgroup under box-wide pressure while leaving system.slice (sshd/services/your way in) protected, gives system.slice a fair-share CPU/IO priority edge, and routes docker containers into a capped, oomd-policed docker.slice so they can't dodge the caps or mis-target oomd. All durable in setup-devvm.sh so a VM rebuild reproduces them; systemd-oomd added to packages.txt. Applied live and verified: oomctl shows the backstop armed (not dry-run) on the work slices with system.slice protected; a capped-balloon stress test OOM-killed locally at the ceiling with swap flat (no thrash). Post-mortem: docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 10:25:09 +00:00
Viktor Barzin	aeed461591	Revert "feat(monitoring): Tempo + OTel Collector for tripit tracing, hardened (ADR-0032 Phase 2)" All checks were successful ci/woodpecker/push/default Pipeline was successful Details This reverts commit `1595bddfc2`.	2026-06-22 08:31:17 +00:00
Viktor Barzin	1595bddfc2	feat(monitoring): Tempo + OTel Collector for tripit tracing, hardened (ADR-0032 Phase 2) Some checks failed ci/woodpecker/push/default Pipeline failed Details Re-land Phase 2 after the first attempt's two failure modes, both fixed: - tempo.resources set under the correct single-binary chart key (was OOMKilled on the namespace LimitRange default when mis-placed at top level). - atomic=true + cleanup_on_fail=true on BOTH helm releases — a failed install auto-rolls-back instead of leaving a stuck/orphaned release (memory #6479). Tempo (single-binary, proxmox-lvm 20Gi, 30d) + OTel Collector (contrib; otlp -> redaction -> batch -> tempo) + Tempo datasource + additive trace_id->Tempo derivedField on Loki + tripit LOG_FORMAT=json/OTEL_EXPORTER_OTLP_ENDPOINT. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 08:17:59 +00:00

1 2 3 4 5 ...

429 commits