infra

Author	SHA1	Message	Date
Viktor Barzin	a6b52a5839	homelab v0.8.0: browser verbs for headful anti-bot web automation Some checks are pending Build infra CLI / build (push) Waiting to run Details ci/woodpecker/push/default Pipeline was successful Details Add `homelab browser run\|open` so agents can drive the cluster's headful Chrome (chrome-service) over CDP from the devvm. The headless playwright/mcp browser can load anti-bot sites and fill their forms, but the gated submit silently fails — e.g. the Stirling Ackroyd Fixflo tenant portal returned net::ERR_FILE_NOT_FOUND on its pre-submit check and hung, creating nothing. Driving the real headful Chrome submits first try. That capability already existed but was undiscoverable, so it cost ~40 min + redundant form re-runs to find; now it is one command, versioned, test-covered, and `browser --help` carries the when-to-use signature + an error-code cheat-sheet so the right tool is reached at the right moment (the failure was judgment, not setup). - port-forward svc/chrome-service:9222 (tunnels API-server->pod, so it bypasses the :9222 NetworkPolicy), assert non-headless via /json/version, connect_over_cdp, inject the same vendored stealth.js the in-cluster callers use; the port-forward is always torn down, on success and on error. - node CDP client pinned to playwright-core@1.48.2 to match the v1.48.0-noble image (Chromium 130); self-provisioned lazily into ~/.cache/homelab, no per-user setup. - default is a fresh incognito context (safe for the shared browser + concurrent callers); --shared-context reuses the warmed persistent profile. - TDD: cmd_browser_test.go covers arg parsing, headless detection, the version pin, the help cheat-sheet, and a stealth.js drift guard. Verified end-to-end against bot.sannysoft.com (real Chrome UA, webdriver hidden, plugins/WebGL spoofed) and `browser open`. - docs: README v0.8 section, ADR-0013, and a chrome-service.md "driving from outside the cluster" section. Closes: code-nepg Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 12:22:22 +00:00
Viktor Barzin	de163aa6af	workstation: switch devvm OOM backstop from systemd-oomd to earlyoom All checks were successful ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details The systemd-oomd backstop added in the previous commit is INERT on this box. oomd's memory-pressure kill only acts on cgroups doing active reclaim (pgscan rising); with MemorySwapMax=0 + anonymous agent memory there is nothing to reclaim, so pgscan stays 0 and oomd never fires. Proven live: a cgroup held at 96-99% memory.pressure for >70s with pgscan=0 was never killed (oomctl + balloon). The very swap=0 that kills the IO storm also neuters oomd. Replace it with earlyoom, which watches free RAM (MemAvailable%) and is swap-independent: SIGTERM the biggest task at 5%, SIGKILL at 3%, swap ignored (-s 100). It --avoids sshd/systemd/dockerd/containerd/t3-dispatch/tmux (the admin's way in always survives) and --prefers the agent/browser hogs. Verified via --dryrun: fires on the RAM threshold and selects a chrome process, not a protected daemon. The per-cgroup caps (MemoryHigh=12G/MemoryMax=16G/MemorySwapMax=0 per user, docker.slice 8G) are unchanged and remain the PRIMARY guard — earlyoom is the aggregate net for the rare all-users-maxed case. systemd-oomd purged; its config + ManagedOOM drop-ins removed. Post-mortem updated with the finding. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 10:39:16 +00:00
Viktor Barzin	3a59f4a8bf	workstation: per-user memory caps + systemd-oomd backstop on devvm All checks were successful ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details The shared devvm keeps overloading and had to be hard-killed again today (2026-06-22): a runaway in one user's ssh/tmux session (a 10G ugrep, plus stacked max-effort agents) grew unbounded, spilled into the disk swap, and swap-thrashed the throttled virtual disk into an IO storm until the box wedged. Root cause: ssh/tmux work runs under user-<uid>.slice, left memory-uncontained by the explicit 2026-06-10 "swap-only" decision, while only the t3-serve tree was capped. So one user could starve everyone. This bounds every user on BOTH trees (MemoryHigh=12G, MemoryMax=16G, MemorySwapMax=0 so work OOMs locally at its ceiling instead of thrashing swap), adds a systemd-oomd PSI backstop that sheds the single worst work cgroup under box-wide pressure while leaving system.slice (sshd/services/your way in) protected, gives system.slice a fair-share CPU/IO priority edge, and routes docker containers into a capped, oomd-policed docker.slice so they can't dodge the caps or mis-target oomd. All durable in setup-devvm.sh so a VM rebuild reproduces them; systemd-oomd added to packages.txt. Applied live and verified: oomctl shows the backstop armed (not dry-run) on the work slices with system.slice protected; a capped-balloon stress test OOM-killed locally at the ceiling with swap flat (no thrash). Post-mortem: docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 10:25:09 +00:00
Viktor Barzin	2169e0de5f	workstation: harden memory hooks — prune dead plugin refs + homelab-CLI-only store All checks were successful ci/woodpecker/push/default Pipeline was successful Details wire-memory-hooks.py now PRUNES any settings.json hook still pointing at the retired claude-memory plugin (plugins/claude-memory/hooks/) before the additive pass. install_memory() rm -rf's that dir, so those entries are dangling — and a missing UserPromptSubmit hook exits 2, a BLOCKING error that erases the prompt and froze emo's sessions (2026-06-22). The plugin shares basenames with the homelab hooks, so the old additive-only logic saw the dead plugin path as "already present" and skipped installing the real ~/.claude/hooks/ copy; pruning first fixes that. Verified against emo's exact original config: yields the correct 4-hook set, drops the dead PermissionRequest entry, idempotent on rerun. auto-learn.py now stores via the `homelab memory` CLI only — dropped the direct HTTP path and the local-SQLite fallback (memory is homelab-CLI-only per Viktor; never local files). No-ops silently when no API key is in env (e.g. ancamilea). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 09:24:42 +00:00
Viktor Barzin	aeed461591	Revert "feat(monitoring): Tempo + OTel Collector for tripit tracing, hardened (ADR-0032 Phase 2)" All checks were successful ci/woodpecker/push/default Pipeline was successful Details This reverts commit `1595bddfc2`.	2026-06-22 08:31:17 +00:00
Viktor Barzin	1595bddfc2	feat(monitoring): Tempo + OTel Collector for tripit tracing, hardened (ADR-0032 Phase 2) Some checks failed ci/woodpecker/push/default Pipeline failed Details Re-land Phase 2 after the first attempt's two failure modes, both fixed: - tempo.resources set under the correct single-binary chart key (was OOMKilled on the namespace LimitRange default when mis-placed at top level). - atomic=true + cleanup_on_fail=true on BOTH helm releases — a failed install auto-rolls-back instead of leaving a stuck/orphaned release (memory #6479). Tempo (single-binary, proxmox-lvm 20Gi, 30d) + OTel Collector (contrib; otlp -> redaction -> batch -> tempo) + Tempo datasource + additive trace_id->Tempo derivedField on Loki + tripit LOG_FORMAT=json/OTEL_EXPORTER_OTLP_ENDPOINT. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 08:17:59 +00:00
Viktor Barzin	a0897de7c3	workstation: document homelab-memory hooks + provisioner self-deploy [ci skip] multi-tenancy.md never mentioned the homelab-memory hooks rollout and still listed claude_memory credential injection as purely "future". Document what is actually true now: install_memory provisions the recall/auto-learn/compaction hooks per user, the provisioner binary self-deploys from the repo (step 0), the set -e abort fix, and that the hooks no-op without a MEMORY_API_KEY in env (CLI defaults the URL) — emo has a key, ancamilea is keyless until one is minted. Also clarify setup-devvm.sh's binary install is now bootstrap-only (ongoing edits self-deploy). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 08:04:38 +00:00
Viktor Barzin	92f35550f2	workstation: self-deploy t3-provision-users from the repo each reconcile [ci skip] Root cause of emo's lost memory: nothing redeployed /usr/local/bin/t3-provision-users except the manual setup-devvm.sh, so the homelab-memory rollout (44562535/9aa2438e, Jun 21) sat committed-but-undeployed for a day — the hourly reconcile kept running the pre-memory binary and never wired the new memory hooks for emo/anca. Close the gap the same way the script already treats managed-settings.json and start-claude.sh (sync_managed_config / deploy_user_launcher): the repo is the authoring surface. At the top of the run, if the repo copy differs from the deployed binary, install it and re-exec the fresh one. Guards: a re-exec env flag (no loop), bash -n (never deploy a broken script), DRY_RUN (no mutation), cmp (no churn when unchanged). Verified across all four paths in isolation. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 08:02:31 +00:00
Viktor Barzin	0b11a28d66	workstation: stop install_memory aborting the reconcile under set -e install_memory (added in `44562535`) ended with `[[ -d <plugin-dir> ]] && rm && log` and guarded a chmod with a bare `[[ -f settings ]] && chmod`. When the plugin dir or settings file is absent — the normal case for users who never had the claude-memory plugin — those return non-zero, and under `set -euo pipefail` the function returns non-zero and kills the whole hourly reconcile after the FIRST user, before the rest are processed. It never fired before because the rollout was committed but the deployed /usr/local/bin/t3-provision-users was never updated, so install_memory had never run. On first real run it aborted right after ancamilea, so emo (and wizard) never got their memory hooks wired — the reason emo's sessions lost memory. Wrap the cleanup in an if-block, guard the chmod, and end the function with return 0. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 07:59:47 +00:00
Viktor Barzin	464e0bfb97	Revert "feat(monitoring): Tempo + OTel Collector for tripit tracing (ADR-0032 Phase 2)" All checks were successful ci/woodpecker/push/default Pipeline was successful Details This reverts commit `7513468a2d`.	2026-06-22 06:46:56 +00:00
Viktor Barzin	72dcb125d5	Revert "fix(monitoring): tempo OOMKilled — move resources under tempo.resources" This reverts commit `a02782d11f`.	2026-06-22 06:46:56 +00:00
Viktor Barzin	a02782d11f	fix(monitoring): tempo OOMKilled — move resources under tempo.resources Some checks failed ci/woodpecker/push/default Pipeline failed Details Pipeline #315 failed: tempo-0 CrashLoopBackOff / OOMKilled (exit 137). The single-binary grafana/tempo chart (v1.24.4) takes container resources at tempo.resources, not a top-level resources: — so my block was ignored and the pod fell to the namespace LimitRange default and OOMed. Set tempo.resources explicitly (req 256Mi / limit 2Gi). tripit + existing monitoring were unaffected throughout. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 06:44:31 +00:00
Viktor Barzin	7513468a2d	feat(monitoring): Tempo + OTel Collector for tripit tracing (ADR-0032 Phase 2) Some checks failed ci/woodpecker/push/default Pipeline failed Details Stand up the cluster's first trace store + OTLP ingress so tripit's OpenTelemetry spans (Phase 1, already live in prod) export and correlate with logs: - Grafana Tempo (single-binary, filesystem on proxmox-lvm 20Gi, 30d) - OTel Collector (contrib; otlp -> redaction deny-list backstop -> batch -> tempo) - Grafana: a Tempo datasource + an ADDITIVE trace_id->Tempo derivedField on the Loki datasource (no uid change, so existing dashboards are unaffected) - tripit deployment: LOG_FORMAT=json + OTEL_EXPORTER_OTLP_ENDPOINT -> the Collector Additive (new helm releases; Loki/Prometheus/Grafana untouched). Offline 'terraform validate' clean; full plan+apply runs in CI (locked git-crypt blocks a local plan as non-admin). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 06:31:11 +00:00
Viktor Barzin	1a32c07ffe	docs(eso): Phase 1 done (0.16.2) + confirmed Phase 2 GC findings All checks were successful ci/woodpecker/push/default Pipeline was successful Details Execution log added to the ESO migration plan. Phase 1 complete: ESO at 0.16.2 (both v1beta1+v1 served). Phase 2 findings confirmed live: apiVersion bump forces a kubernetes_manifest REPLACE, and ESO ESs use creationPolicy=Owner (target Secret ownerRef → cascade-GC risk on the replace's delete). Phase 2 must snapshot Secrets + empirically validate GC-survival on the first live ES + per-stack two-phase -target apply (fallback: state rm + import). Corrected the doc's k8s assumption (cluster is on 1.34; whole climb stays on 1.34, no interleave). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 20:44:50 +00:00
Viktor Barzin	ac27e41fde	Merge remote-tracking branch 'origin/master' All checks were successful ci/woodpecker/push/default Pipeline was successful Details	2026-06-21 20:41:35 +00:00
Viktor Barzin	296deda3b4	eso: Phase 1 — climb chart 0.12.1 -> 0.16.2 (transition version) + atomic First half of the ESO 0.12->2.6 migration (docs/plans/2026-06-21-eso-0.12-to-2.x-migration-design.md), clearing the LAST k8s-1.35 compat-gate blocker. Stepped one minor at a time on k8s 1.34 (no k8s interleave — cluster already on 1.34, ESO bands are conservative tested ranges not hard limits): 0.12.1 -> 0.13.0 -> 0.14.4 -> 0.15.1 -> 0.16.2. Each hop applied + verified: controller healthy, all 108 live ExternalSecrets stayed SecretSynced (2 pre-existing dead — instagram-poster, payslip-ingest — missing Vault data, untouched). Added atomic=true + timeout=600 (ESO had no rollback safety net). 0.16.2 serves BOTH v1beta1 AND v1 (storedVersions now ["v1beta1","v1"]) — the safe window to rewrite all 104 CRs to v1 (Phase 2) before 0.17 removes v1beta1. State auto-committed per hop by scripts/tg (Tier-0 SOPS). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 20:41:30 +00:00
Viktor Barzin	0cd59d2c55	state(external-secrets): update encrypted state	2026-06-21 20:41:10 +00:00
Viktor Barzin	b8612e788d	state(external-secrets): update encrypted state	2026-06-21 20:39:45 +00:00
Viktor Barzin	877e5c73b2	state(external-secrets): update encrypted state	2026-06-21 20:38:34 +00:00
Viktor Barzin	de2250f667	immich-frame: set photo date format to dd/MM/yyyy All checks were successful ci/woodpecker/push/default Pipeline was successful Details The photo date overlay was showing US-style MM/dd/yyyy — ImmichFrame's built-in default when PhotoDateFormat is unset. Viktor wants UK day/month/year ordering instead. Pin PhotoDateFormat to the date-fns pattern "dd/MM/yyyy" (uppercase MM = month; lowercase mm would render minutes). The config map carries reloader.stakater.com/match, so Reloader restarts the immich-frame pod automatically on apply. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 20:36:43 +00:00
Viktor Barzin	8e6eff03dd	state(external-secrets): update encrypted state	2026-06-21 20:36:37 +00:00
Viktor Barzin	0bae025b9b	wealth dashboard: spend-down figures in today's money (inflation-adjusted) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked whether the spend-down numbers were inflation-adjusted — they were not (all nominal). He chose to switch the card to today's money, so every row now shows constant purchasing power for life. Each row is a die-with-zero annuity at the REAL rate (1+g)/1.03−1 (3% inflation), spending a constant inflation-adjusted amount (the actual pounds withdrawn rise with inflation) until net worth hits £0 at age 100: • No growth (0%) → £12/day, £370/mo, £4,446/yr (negative real: loses to inflation) • Inflation (3%) → £43/day, £1,315/mo, £15,776/yr (0% real: holds value) • Market (7%) → £130/day, £3,942/mo, £47,300/yr (~3.9% real) Title now flags "(today's £)". Same panel/layout; only the SQL, title, and tooltip changed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 20:13:59 +00:00
Viktor Barzin	3fb6284e2b	immich-frame: use 24-hour clock (ClockFormat HH:mm) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked to switch the Immich photo-frame shown on the Portal kitchen appliance to a 24-hour clock. immichFrame defaults ClockFormat to 'hh:mm' (12-hour) and we never overrode it, so the frame was showing 12-hour time. Set ClockFormat: "HH:mm" (date-fns 24h token) in the frame Settings.yml ConfigMap; Reloader restarts the pod on apply. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 20:10:51 +00:00
Viktor Barzin	e89de86af0	wealth dashboard: spend-down table → three growth scenarios All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor wanted the spend-down card to compare three portfolio-growth scenarios rather than the previous floor-vs-4%-real pair. The table now has three rows, each a die-with-zero annuity (drain net worth to £0 by age 100) spending a constant number of ACTUAL (nominal) pounds, differing only by the assumed nominal growth rate: • No growth (0%) → £43/day, £1,315/mo, £15,776/yr (= NW ÷ years) • Inflation (3%) → £106/day, £3,233/mo, £38,792/yr (NEW) • Avg market (7%) → £220/day, £6,703/mo, £80,435/yr This keeps the £43 no-growth floor he anchored on. The old third row was "4% real" (£133) expressed in today's money; it's replaced by the 7%-nominal market row (£220, actual pounds) so all three rows share one basis (nominal pounds) and are directly comparable. 3%/7% are hardcoded (one-line SQL edit). Table height 4→5 for the extra row; panels below shifted down 1. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 20:06:29 +00:00
Viktor Barzin	85d42f2c13	wealth dashboard: merge spend-down tiles into one compact table All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor wanted the six separate spend-down stat tiles consolidated into a single, more compact card with the figures laid out as rows. Replaces stat panels 9220-9225 with one table panel (id 9220) in the Overview row: 2 rows (Floor / 4% real) × 3 columns (per day / month / year). Same underlying math and live values (£43/£1,315/£15,776 floor; £133/£4,039/£48,463 at 4% real). w=9 instead of the full-width tile row, so it takes ~a third of the width. Note: this intentionally overrides the "table panels live at the bottom" layout convention — Viktor chose to keep this headline KPI glanceable at the top of the dashboard rather than scroll for it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 19:55:57 +00:00
Viktor Barzin	63add2a126	feat(tripit): finalize ADR-0028 auth env — AUTH_MODE=normal, trips@ sender, trust XFF All checks were successful ci/woodpecker/push/default Pipeline was successful Details Now that the native-auth rollout is complete: (1) AUTH_MODE hybrid->normal — the legacy Authentik OIDC-bearer + forward-auth arms were removed in #96, and 'hybrid' already resolved to 'normal' via backward-compat parsing; this makes it explicit and corrects the now-false comment. (2) SMTP_FROM plans@->trips@ — the dedicated native-auth sender; the trips@->spam@ send-as alias is live + verified (RCPT 250). (3) TRUST_FORWARDED_FOR=true — so #95's per-IP signup rate-limit keys on the real client behind Traefik, not the shared ingress pod IP. Env-only; the Deployment image is KEEL_IGNORE_IMAGE (lifecycle-ignored), so this does NOT touch the running image. Reloader restarts the pod to pick up the new env. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 19:50:20 +00:00
Viktor Barzin	166a2bcab4	wealth dashboard: add "spend-down to £0 at 100" stat tiles All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor wanted a glanceable number on the Wealth dashboard for how much he can spend for the rest of his life — spending the whole net worth down to zero by age 100. Adds a third line of six stat tiles to the Overview section, two equations × three cadences (per day / month / year): • FLOOR — net worth ÷ time remaining to age 100. Treats the money as cash (no growth, no inflation): a conservative lower bound. ≈ £43/day, £1.3k/mo, £15.8k/yr. • 4% REAL — die-with-zero annuity: the constant, inflation-adjusted spend that drains the balance to £0 at 100 while it keeps earning 4% real. PMT = NW·r/(1−(1+r)^−n). ≈ £133/day, £4.0k/mo, £48.5k/yr. Horizon is today → his 100th birthday (DOB 1998-10-04 → 2098-10-04), computed live so the figures tick as net worth and the horizon move. Net worth reuses the existing latest-per-account dav_corrected math, so the tiles always agree with the "Net worth (current)" stat (pension included; target £0). The 4% real rate is hard-coded per his "keep it simple, just a number" steer — a one-line SQL edit to change later. Layout: tiles inserted at y=9; all sections below shifted down 4 rows. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 19:48:30 +00:00
viktor	c830f9f462	Merge pull request 'workstation: wire-memory-hooks as root (fix non-admin wiring)' (#14 ) from wizard/mem-fix into master All checks were successful ci/woodpecker/push/default Pipeline was successful Details	2026-06-21 17:45:39 +00:00
Viktor Barzin	9aa2438e75	workstation: run wire-memory-hooks as root, not runuser (fix non-admin wiring) install_memory ran the JSON-merge helper via 'runuser -u $user', but the helper lives under the admin's mode-700 home ($WORKSTATION_DIR) which non-admin users can't traverse -> wiring silently failed for emo/anca (hooks copied but never wired into settings.json). Run the helper as root (it reads both the repo helper and the user's home) and chown the result back to the user. Verified by the live all-users rollout: emo + anca now wired correctly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 17:45:36 +00:00
viktor	f318773cb0	Merge pull request 'workstation: homelab-memory for all users (retire claude-memory MCP)' (#13 ) from wizard/memory-allusers into master All checks were successful ci/woodpecker/push/default Pipeline was successful Details	2026-06-21 17:42:51 +00:00
Viktor Barzin	44562535a2	workstation: provision homelab-memory hooks for all users (retire claude-memory MCP) Roll the wizard MCP->homelab-CLI memory migration out to every devvm user. Adds install_memory() to t3-provision-users.sh (mirrors install_playwright: per-user, idempotent, if-absent, as-the-user): installs the 4 memory hook scripts into ~/.claude/hooks, wires them into settings.json additively (wire-memory-hooks.py never touches env / the per-user MEMORY_API_KEY), and removes ONLY the claude_memory MCP + plugin if present. Reuses each user's existing key (no minting; per-user isolation stays deferred per the 2026-06-07 design). The homelab CLI hits the same remote HTTP API the MCP used; recall runs via the homelab-memory-recall.py UserPromptSubmit hook. Shared instructions (rules/skills symlinked from base; root+infra CLAUDE.md) already cover all users. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 17:42:42 +00:00
Viktor Barzin	79749d7324	Merge remote-tracking branch 'origin/master' All checks were successful ci/woodpecker/push/default Pipeline was successful Details	2026-06-21 17:27:42 +00:00
Viktor Barzin	5e3fe2e8e2	docs(plans): ESO 0.12->2.6 (v1beta1->v1) migration design — the last k8s-1.35 blocker Design doc for migrating External Secrets Operator off v0.12 (k8s <=1.31), now the ONLY remaining compat-gate blocker for autonomous k8s 1.35 (kyverno cleared to 1.18.1 today). Decisive findings: NO v1beta1->v1 conversion webhook, so all 104 ExternalSecrets (across 73 stacks) + 2 ClusterSecretStores must be rewritten to external-secrets.io/v1 (byte-identical apiVersion bump) while on 0.16.2, BEFORE crossing 0.17 (which removes v1beta1 — the point of no return). Step one minor at a time (no skipping); chart==app version; downstream Secrets survive. 5-phase ordered plan + per-phase rollback + the plan-time data.kubernetes_secret -target gotcha (15 stacks) + Tier-0/SOPS handling. Plan only — nothing applied. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 17:27:37 +00:00
viktor	3f81b20fa6	Merge pull request 'docs: memory via homelab CLI (retire memory-tool/MCP refs)' (#12 ) from wizard/memory-cli-docs into master All checks were successful ci/woodpecker/push/default Pipeline was successful Details	2026-06-21 17:24:10 +00:00
Viktor Barzin	e2018f9b6c	docs: memory via homelab CLI, not the retired memory-tool/MCP The claude-memory MCP/plugin was uninstalled 2026-06-21 (recall now via the homelab-memory-recall.py UserPromptSubmit hook; store/recall/update via the `homelab memory` CLI, which hits the same remote HTTP API). Updates the .claude/CLAUDE.md 'remember X' instruction off the obsolete local memory-tool CLI + memory_search/memory_get onto the homelab CLI. Matches the root monorepo CLAUDE.md + ~/.claude/rules/execution.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 17:24:00 +00:00
Viktor Barzin	51838a4ec7	kyverno: 3.6.1 -> 3.8.1 (app 1.16 -> 1.18.1) — clears the k8s-1.35 compat-gate block All checks were successful ci/woodpecker/push/default Pipeline was successful Details kyverno v1.16 supports k8s <=1.34, so it was one of the two addons blocking the autonomous 1.35 upgrade (compat gate, nightly). v1.18 supports 1.35. Stepped one minor at a time per the kyverno upgrade guide (per-minor CRD notes): 3.6.1 (1.16) -> 3.7.2 (1.17.2) -> 3.8.1 (1.18.1), each hop applied + verified supervised. atomic=true (auto-rollback on a failed rollout) + forceFailurePolicyIgnore (admissions stay open mid-roll) kept it safe. Values schema confirmed compatible across 3.6->3.8 (forceFailurePolicyIgnore still under features:). Verified after each hop: all 17 ClusterPolicies stayed Ready, admission controller 2/2, no destroys/replaces in plan. Final 1.18.1: images v1.18.1, mutating webhook live (server-side dry-run injects ndots:2 in a non-excluded ns). compat-gate vs 1.35.6 now lists ONLY external-secrets (kyverno cleared). ESO 0.12->2.x (v1beta1->v1, 73 files) is the last remaining 1.35 blocker — to be planned. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 17:21:38 +00:00
Viktor Barzin	ead876ec65	k8s-upgrade: nightly Slack report monitor + scope chain-failed alert to phases All checks were successful ci/woodpecker/push/default Pipeline was successful Details Adds a daily visibility layer so every night's autonomous-upgrade outcome is reviewable at a glance during the upgrade-cleanup window (Viktor: "track every night's upgrade for the next 7 days; clean up all bugs and blockers"). Last night (2026-06-20) confirmed BOTH prior fixes work in production: the detector resolved target 1.35.6 (k8s_upgrade_available) and the compat gate correctly REFUSED it (k8s_upgrade_blocked=1 -> K8sUpgradeBlocked) because ESO v0.12 (<=1.31) and kyverno v1.16 (<=1.34) don't support 1.35. What's here: - CronJob k8s-upgrade-nightly-report (06:07 UTC) -> one Slack summary/morning: running version, detector freshness, detected target, outcome (no-op / blocked+live reasons / upgraded / in-progress / detector-stale), recent jobs. Read-only: reads Pushgateway gauges + live nodes/jobs, re-runs compat-gate.py for fresh blockers; reuses the chain SA + slack_webhook + scripts ConfigMap. Pure helpers unit-tested (test_nightly_report.py, 8 cases incl. a real v-prefix bug TDD caught). Verified end-to-end in-cluster (posted to Slack). - K8sUpgradeChainJobFailed regex scoped from `k8s-upgrade-.` to `k8s-upgrade-(preflight\|master\|worker\|postflight)-.` so the new report job (or any future helper) can't false-trip the chain-wedged alarm. Manual state repair (no git artifact): imported the orphaned `alert-digest` CronJob into the monitoring stack state (`tg import module.monitoring.kubernetes_cron_job_v1.alert_digest monitoring/alert-digest`). Root cause: when alert_digest was added (2026-06-12) the apply recorded its ConfigMap + Secret but not the CronJob, so every full monitoring apply since has failed with `cronjobs.batch "alert-digest" already exists` (Woodpecker pipeline 298 today) — surviving only via targeted prometheus applies. Now in state, so monitoring CI applies cleanly again. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 16:57:44 +00:00
Viktor Barzin	7270e2be3b	monitoring: K8sUpgradeChainJobFailed must not double-fire on a compat-gate block Some checks failed ci/woodpecker/push/default Pipeline failed Details Last night (2026-06-20) the detector + compat-gate fixes worked: the chain resolved target 1.35.6 and the gate correctly REFUSED it (ESO 0.12 + kyverno 1.16 don't support 1.35), pushing k8s_upgrade_blocked=1 -> K8sUpgradeBlocked fired as designed. But the refusal also made the preflight Job exit 1 (block() exits 1 on purpose so the Failed Job re-spawns nightly), which tripped K8sUpgradeChainJobFailed too — a duplicate, misleading "pipeline wedged" alarm for what is the intended halt-and-alert outcome. Fix: gate the alert with `unless on() k8s_upgrade_blocked == 1`. A deliberate block sets that gauge (and it stays 1 until the next preflight resets it), so the chain-job-failed alert is suppressed for the blocked period; a genuine wedge / crash / halt-on-alert exits 1 WITHOUT setting it, so it still fires (preserving the alert's original purpose — catching the pre-in_flight preflight failure that hid the 5-day 1.34.9 wedge). Runbook + automated-upgrades docs updated to match. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 16:35:35 +00:00
Viktor Barzin	b0ccaf1c65	state(vault): update encrypted state	2026-06-21 15:07:01 +00:00
Viktor Barzin	f84e6818b2	state(vault): update encrypted state	2026-06-21 15:07:01 +00:00
Viktor Barzin	cc4bb8ffe8	wealth dashboard: show price freshness for all 3 holdings, not just worst Some checks failed ci/woodpecker/push/default Pipeline failed Details Viktor wanted the freshness tile to cover all three main holdings (META, VUAG, VUSA), not only the single stalest one. Dropped LIMIT 1 so the stat renders one value per held position (worst-first), switched the tile to horizontal orientation so the three values sit side-by-side, and updated the description. Each value is coloured by its own age threshold (META red ~2mo, the Vanguard ETFs green ~2d). No threshold or datasource change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 14:49:33 +00:00
viktor	6c2c56ab3b	Merge pull request 'docs: CrowdSec enforcement = firewall-bouncer + CF WAF (plugin removed)' (#11 ) from wizard/crowdsec-docs into master All checks were successful ci/woodpecker/push/default Pipeline was successful Details	2026-06-21 13:40:41 +00:00
Viktor Barzin	ceae4d5f06	docs: rewrite CrowdSec enforcement architecture (firewall-bouncer + CF WAF; Yaegi plugin removed) The Traefik Yaegi CrowdSec bouncer plugin was dead on Traefik 3.7.5 (handler never invoked) and has been removed. Document the replacement: in-kernel nftables drop via cs-firewall-bouncer on direct hosts, and a Cloudflare IP-List + zone WAF block rule (fed by a LAPI->CF-list sync CronJob) on proxied hosts. Both add zero per-request latency and fail open. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 13:39:26 +00:00
viktor	4df741f6de	Merge pull request 'traefik/crowdsec: delete dead Yaegi plugin + middleware CRD + captcha (PR2/2)' (#10 ) from wizard/cs-deplugin-crd into master All checks were successful ci/woodpecker/push/default Pipeline was successful Details	2026-06-21 13:36:03 +00:00
Viktor Barzin	c23b03864e	traefik/crowdsec: delete dead Yaegi plugin + middleware CRD + captcha (PR2/2) Zero live ingresses reference traefik-crowdsec@kubernetescrd (PR1 + a cluster-wide targeted ingress re-apply confirmed 0), so the crowdsec Middleware CRD and the broken Yaegi bouncer plugin can be removed without orphaning any router. Removes: the `crowdsec` Middleware, the crowdsec-bouncer plugin (static config + initContainer download + state.json entry), the captcha template ConfigMap + volume + captcha.html, the Turnstile widget + data.cloudflare_accounts, and the 3 now-unused module vars. Also drops the `crowdsec` middleware from the catch-all error-pages IngressRoute chain (the one remaining CRD-level reference, which an Ingress-annotation grep does not surface) so that router is not orphaned when the Middleware is deleted; it keeps rate-limit. Enforcement is fully handled out-of-band now: cs-firewall-bouncer (in-kernel nftables, direct hosts) + Cloudflare IP-List/WAF (proxied hosts). The api-token-middleware plugin is deliberately preserved (still used by paperless-mcp). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 13:35:13 +00:00
viktor	df86075c3d	Merge pull request 'cleanup: fully remove orphaned council-complaints app' (#9 ) from wizard/council-cleanup into master All checks were successful ci/woodpecker/push/default Pipeline was successful Details	2026-06-21 13:33:23 +00:00
Viktor Barzin	68d9058f85	cleanup: fully remove orphaned council-complaints app The council-complaints app (Islington civic-reporting pilot) has been abandoned. It was already dead in the cluster (deployments scaled 0/0, image only on the decommissioned registry.viktorbarzin.me which 404s), and it was never in Terraform — only docs + a kyverno comment referenced it. Its live cluster resources (namespace, both NFS-backed PVs, ingresses) were torn down out-of-band via kubectl (nothing in TF to drift from); the DB-dump PVC was backed up to NFS first. This removes the remaining repo references to the live app: - service-catalog.md: drop the council-complaints row - ci-cd.md + .claude/CLAUDE.md: drop it from the GHA->ghcr app list - kyverno require-trusted-registries: the registry.viktorbarzin.me/* allowlist comment claimed council-complaints as the last referencer; rewrite it (no live workload pulls from that registry now; only stale completed Job records still carry the ref). The allowlist line itself is kept (registry-scoped, not app-specific). Historical point-in-time plan docs (docs/plans/2026-05-16-auto-upgrade- apps-{design,plan}.md) still mention it inside a frozen "10 GHA-migrated repos (memory id=388)" snapshot; left as-is so the dated record stays accurate. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 13:32:10 +00:00
Viktor Barzin	6dc3ce139f	wealth dashboard: expand all rows by default + inline the freshness stat Some checks failed ci/woodpecker/push/default Pipeline failed Details Two follow-ups Viktor asked for on the Price freshness panel: - Expand every section by default. Grafana's collapsed rows hide their child panels; just flipping collapsed=false leaves a non-canonical shape (confirmed via the Grafana API that it keeps the panels nested rather than hoisting them), so each row is now collapsed=false + panels=[] with its children hoisted to top-level -- the exact form Grafana writes when you expand-and-save. Row headers revert to their original y (the child y-coords were already expanded-layout coordinates). - Stop the freshness stat from taking its own line. It's now the 6th tile in the existing returns row (1d/7d/30d/90d/12mo + freshness), all width 4 at y=5; the collapsed-row y-shift from the previous commit is undone. No query or threshold changes. The large diff is mechanical: 12 child panels re-indent from nested to top-level. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 13:29:25 +00:00
Viktor Barzin	92ff0b92f1	Merge remote-tracking branch 'forgejo/master' into wizard/t3-idle-migrate All checks were successful ci/woodpecker/push/default Pipeline was successful Details	2026-06-21 12:41:33 +00:00
Viktor Barzin	5a136c7d53	docs: t3-migrate-idle runbook section + service-catalog + design status Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 12:40:46 +00:00

1 2 3 4 5 ...

4552 commits