Commit graph

4480 commits

Author SHA1 Message Date
Viktor Barzin
3fb6284e2b immich-frame: use 24-hour clock (ClockFormat HH:mm)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to switch the Immich photo-frame shown on the Portal
kitchen appliance to a 24-hour clock. immichFrame defaults ClockFormat
to 'hh:mm' (12-hour) and we never overrode it, so the frame was showing
12-hour time. Set ClockFormat: "HH:mm" (date-fns 24h token) in the
frame Settings.yml ConfigMap; Reloader restarts the pod on apply.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 20:10:51 +00:00
Viktor Barzin
e89de86af0 wealth dashboard: spend-down table → three growth scenarios
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor wanted the spend-down card to compare three portfolio-growth
scenarios rather than the previous floor-vs-4%-real pair.

The table now has three rows, each a die-with-zero annuity (drain net
worth to £0 by age 100) spending a constant number of ACTUAL (nominal)
pounds, differing only by the assumed nominal growth rate:
  • No growth (0%)      → £43/day,  £1,315/mo, £15,776/yr  (= NW ÷ years)
  • Inflation (3%)      → £106/day, £3,233/mo, £38,792/yr  (NEW)
  • Avg market (7%)     → £220/day, £6,703/mo, £80,435/yr

This keeps the £43 no-growth floor he anchored on. The old third row
was "4% real" (£133) expressed in today's money; it's replaced by the
7%-nominal market row (£220, actual pounds) so all three rows share one
basis (nominal pounds) and are directly comparable. 3%/7% are hardcoded
(one-line SQL edit). Table height 4→5 for the extra row; panels below
shifted down 1.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 20:06:29 +00:00
Viktor Barzin
85d42f2c13 wealth dashboard: merge spend-down tiles into one compact table
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor wanted the six separate spend-down stat tiles consolidated into a
single, more compact card with the figures laid out as rows.

Replaces stat panels 9220-9225 with one table panel (id 9220) in the
Overview row: 2 rows (Floor / 4% real) × 3 columns (per day / month /
year). Same underlying math and live values (£43/£1,315/£15,776 floor;
£133/£4,039/£48,463 at 4% real). w=9 instead of the full-width tile row,
so it takes ~a third of the width.

Note: this intentionally overrides the "table panels live at the bottom"
layout convention — Viktor chose to keep this headline KPI glanceable at
the top of the dashboard rather than scroll for it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 19:55:57 +00:00
Viktor Barzin
63add2a126 feat(tripit): finalize ADR-0028 auth env — AUTH_MODE=normal, trips@ sender, trust XFF
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Now that the native-auth rollout is complete: (1) AUTH_MODE hybrid->normal — the legacy Authentik OIDC-bearer + forward-auth arms were removed in #96, and 'hybrid' already resolved to 'normal' via backward-compat parsing; this makes it explicit and corrects the now-false comment. (2) SMTP_FROM plans@->trips@ — the dedicated native-auth sender; the trips@->spam@ send-as alias is live + verified (RCPT 250). (3) TRUST_FORWARDED_FOR=true — so #95's per-IP signup rate-limit keys on the real client behind Traefik, not the shared ingress pod IP. Env-only; the Deployment image is KEEL_IGNORE_IMAGE (lifecycle-ignored), so this does NOT touch the running image. Reloader restarts the pod to pick up the new env.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 19:50:20 +00:00
Viktor Barzin
166a2bcab4 wealth dashboard: add "spend-down to £0 at 100" stat tiles
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor wanted a glanceable number on the Wealth dashboard for how much
he can spend for the rest of his life — spending the whole net worth
down to zero by age 100.

Adds a third line of six stat tiles to the Overview section, two
equations × three cadences (per day / month / year):

  • FLOOR  — net worth ÷ time remaining to age 100. Treats the money as
    cash (no growth, no inflation): a conservative lower bound.
    ≈ £43/day, £1.3k/mo, £15.8k/yr.
  • 4% REAL — die-with-zero annuity: the constant, inflation-adjusted
    spend that drains the balance to £0 at 100 while it keeps earning
    4% real. PMT = NW·r/(1−(1+r)^−n). ≈ £133/day, £4.0k/mo, £48.5k/yr.

Horizon is today → his 100th birthday (DOB 1998-10-04 → 2098-10-04),
computed live so the figures tick as net worth and the horizon move.
Net worth reuses the existing latest-per-account dav_corrected math, so
the tiles always agree with the "Net worth (current)" stat (pension
included; target £0). The 4% real rate is hard-coded per his "keep it
simple, just a number" steer — a one-line SQL edit to change later.

Layout: tiles inserted at y=9; all sections below shifted down 4 rows.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 19:48:30 +00:00
c830f9f462 Merge pull request 'workstation: wire-memory-hooks as root (fix non-admin wiring)' (#14) from wizard/mem-fix into master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 17:45:39 +00:00
Viktor Barzin
9aa2438e75 workstation: run wire-memory-hooks as root, not runuser (fix non-admin wiring)
install_memory ran the JSON-merge helper via 'runuser -u $user', but the helper
lives under the admin's mode-700 home ($WORKSTATION_DIR) which non-admin users
can't traverse -> wiring silently failed for emo/anca (hooks copied but never
wired into settings.json). Run the helper as root (it reads both the repo helper
and the user's home) and chown the result back to the user. Verified by the live
all-users rollout: emo + anca now wired correctly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 17:45:36 +00:00
f318773cb0 Merge pull request 'workstation: homelab-memory for all users (retire claude-memory MCP)' (#13) from wizard/memory-allusers into master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 17:42:51 +00:00
Viktor Barzin
44562535a2 workstation: provision homelab-memory hooks for all users (retire claude-memory MCP)
Roll the wizard MCP->homelab-CLI memory migration out to every devvm user. Adds
install_memory() to t3-provision-users.sh (mirrors install_playwright: per-user,
idempotent, if-absent, as-the-user): installs the 4 memory hook scripts into
~/.claude/hooks, wires them into settings.json additively (wire-memory-hooks.py
never touches env / the per-user MEMORY_API_KEY), and removes ONLY the
claude_memory MCP + plugin if present. Reuses each user's existing key (no
minting; per-user isolation stays deferred per the 2026-06-07 design). The
homelab CLI hits the same remote HTTP API the MCP used; recall runs via the
homelab-memory-recall.py UserPromptSubmit hook. Shared instructions (rules/skills
symlinked from base; root+infra CLAUDE.md) already cover all users.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 17:42:42 +00:00
Viktor Barzin
79749d7324 Merge remote-tracking branch 'origin/master'
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 17:27:42 +00:00
Viktor Barzin
5e3fe2e8e2 docs(plans): ESO 0.12->2.6 (v1beta1->v1) migration design — the last k8s-1.35 blocker
Design doc for migrating External Secrets Operator off v0.12 (k8s <=1.31), now
the ONLY remaining compat-gate blocker for autonomous k8s 1.35 (kyverno cleared
to 1.18.1 today). Decisive findings: NO v1beta1->v1 conversion webhook, so all
104 ExternalSecrets (across 73 stacks) + 2 ClusterSecretStores must be rewritten
to external-secrets.io/v1 (byte-identical apiVersion bump) while on 0.16.2, BEFORE
crossing 0.17 (which removes v1beta1 — the point of no return). Step one minor at
a time (no skipping); chart==app version; downstream Secrets survive. 5-phase
ordered plan + per-phase rollback + the plan-time data.kubernetes_secret -target
gotcha (15 stacks) + Tier-0/SOPS handling. Plan only — nothing applied.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 17:27:37 +00:00
3f81b20fa6 Merge pull request 'docs: memory via homelab CLI (retire memory-tool/MCP refs)' (#12) from wizard/memory-cli-docs into master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 17:24:10 +00:00
Viktor Barzin
e2018f9b6c docs: memory via homelab CLI, not the retired memory-tool/MCP
The claude-memory MCP/plugin was uninstalled 2026-06-21 (recall now via the
homelab-memory-recall.py UserPromptSubmit hook; store/recall/update via the
`homelab memory` CLI, which hits the same remote HTTP API). Updates the
.claude/CLAUDE.md 'remember X' instruction off the obsolete local memory-tool
CLI + memory_search/memory_get onto the homelab CLI. Matches the root monorepo
CLAUDE.md + ~/.claude/rules/execution.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 17:24:00 +00:00
Viktor Barzin
51838a4ec7 kyverno: 3.6.1 -> 3.8.1 (app 1.16 -> 1.18.1) — clears the k8s-1.35 compat-gate block
All checks were successful
ci/woodpecker/push/default Pipeline was successful
kyverno v1.16 supports k8s <=1.34, so it was one of the two addons blocking the
autonomous 1.35 upgrade (compat gate, nightly). v1.18 supports 1.35.

Stepped one minor at a time per the kyverno upgrade guide (per-minor CRD notes):
3.6.1 (1.16) -> 3.7.2 (1.17.2) -> 3.8.1 (1.18.1), each hop applied + verified
supervised. atomic=true (auto-rollback on a failed rollout) + forceFailurePolicyIgnore
(admissions stay open mid-roll) kept it safe. Values schema confirmed compatible
across 3.6->3.8 (forceFailurePolicyIgnore still under features:).

Verified after each hop: all 17 ClusterPolicies stayed Ready, admission controller
2/2, no destroys/replaces in plan. Final 1.18.1: images v1.18.1, mutating webhook
live (server-side dry-run injects ndots:2 in a non-excluded ns). compat-gate vs
1.35.6 now lists ONLY external-secrets (kyverno cleared). ESO 0.12->2.x
(v1beta1->v1, 73 files) is the last remaining 1.35 blocker — to be planned.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 17:21:38 +00:00
Viktor Barzin
ead876ec65 k8s-upgrade: nightly Slack report monitor + scope chain-failed alert to phases
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Adds a daily visibility layer so every night's autonomous-upgrade outcome is
reviewable at a glance during the upgrade-cleanup window (Viktor: "track every
night's upgrade for the next 7 days; clean up all bugs and blockers").

Last night (2026-06-20) confirmed BOTH prior fixes work in production: the
detector resolved target 1.35.6 (k8s_upgrade_available) and the compat gate
correctly REFUSED it (k8s_upgrade_blocked=1 -> K8sUpgradeBlocked) because ESO
v0.12 (<=1.31) and kyverno v1.16 (<=1.34) don't support 1.35.

What's here:
- CronJob k8s-upgrade-nightly-report (06:07 UTC) -> one Slack summary/morning:
  running version, detector freshness, detected target, outcome (no-op /
  blocked+live reasons / upgraded / in-progress / detector-stale), recent jobs.
  Read-only: reads Pushgateway gauges + live nodes/jobs, re-runs compat-gate.py
  for fresh blockers; reuses the chain SA + slack_webhook + scripts ConfigMap.
  Pure helpers unit-tested (test_nightly_report.py, 8 cases incl. a real
  v-prefix bug TDD caught). Verified end-to-end in-cluster (posted to Slack).
- K8sUpgradeChainJobFailed regex scoped from `k8s-upgrade-.*` to
  `k8s-upgrade-(preflight|master|worker|postflight)-.*` so the new report job
  (or any future helper) can't false-trip the chain-wedged alarm.

Manual state repair (no git artifact): imported the orphaned `alert-digest`
CronJob into the monitoring stack state
(`tg import module.monitoring.kubernetes_cron_job_v1.alert_digest monitoring/alert-digest`).
Root cause: when alert_digest was added (2026-06-12) the apply recorded its
ConfigMap + Secret but not the CronJob, so every full monitoring apply since has
failed with `cronjobs.batch "alert-digest" already exists` (Woodpecker pipeline
298 today) — surviving only via targeted prometheus applies. Now in state, so
monitoring CI applies cleanly again.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 16:57:44 +00:00
Viktor Barzin
7270e2be3b monitoring: K8sUpgradeChainJobFailed must not double-fire on a compat-gate block
Some checks failed
ci/woodpecker/push/default Pipeline failed
Last night (2026-06-20) the detector + compat-gate fixes worked: the chain
resolved target 1.35.6 and the gate correctly REFUSED it (ESO 0.12 + kyverno
1.16 don't support 1.35), pushing k8s_upgrade_blocked=1 -> K8sUpgradeBlocked
fired as designed. But the refusal also made the preflight Job exit 1
(block() exits 1 on purpose so the Failed Job re-spawns nightly), which tripped
K8sUpgradeChainJobFailed too — a duplicate, misleading "pipeline wedged" alarm
for what is the intended halt-and-alert outcome.

Fix: gate the alert with `unless on() k8s_upgrade_blocked == 1`. A deliberate
block sets that gauge (and it stays 1 until the next preflight resets it), so
the chain-job-failed alert is suppressed for the blocked period; a genuine
wedge / crash / halt-on-alert exits 1 WITHOUT setting it, so it still fires
(preserving the alert's original purpose — catching the pre-in_flight preflight
failure that hid the 5-day 1.34.9 wedge). Runbook + automated-upgrades docs
updated to match.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 16:35:35 +00:00
Viktor Barzin
b0ccaf1c65 state(vault): update encrypted state 2026-06-21 15:07:01 +00:00
Viktor Barzin
f84e6818b2 state(vault): update encrypted state 2026-06-21 15:07:01 +00:00
Viktor Barzin
cc4bb8ffe8 wealth dashboard: show price freshness for all 3 holdings, not just worst
Some checks failed
ci/woodpecker/push/default Pipeline failed
Viktor wanted the freshness tile to cover all three main holdings
(META, VUAG, VUSA), not only the single stalest one. Dropped LIMIT 1 so
the stat renders one value per held position (worst-first), switched the
tile to horizontal orientation so the three values sit side-by-side, and
updated the description. Each value is coloured by its own age threshold
(META red ~2mo, the Vanguard ETFs green ~2d). No threshold or datasource
change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 14:49:33 +00:00
6c2c56ab3b Merge pull request 'docs: CrowdSec enforcement = firewall-bouncer + CF WAF (plugin removed)' (#11) from wizard/crowdsec-docs into master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 13:40:41 +00:00
Viktor Barzin
ceae4d5f06 docs: rewrite CrowdSec enforcement architecture (firewall-bouncer + CF WAF; Yaegi plugin removed)
The Traefik Yaegi CrowdSec bouncer plugin was dead on Traefik 3.7.5 (handler
never invoked) and has been removed. Document the replacement: in-kernel
nftables drop via cs-firewall-bouncer on direct hosts, and a Cloudflare IP-List
+ zone WAF block rule (fed by a LAPI->CF-list sync CronJob) on proxied hosts.
Both add zero per-request latency and fail open.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 13:39:26 +00:00
4df741f6de Merge pull request 'traefik/crowdsec: delete dead Yaegi plugin + middleware CRD + captcha (PR2/2)' (#10) from wizard/cs-deplugin-crd into master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 13:36:03 +00:00
Viktor Barzin
c23b03864e traefik/crowdsec: delete dead Yaegi plugin + middleware CRD + captcha (PR2/2)
Zero live ingresses reference traefik-crowdsec@kubernetescrd (PR1 + a
cluster-wide targeted ingress re-apply confirmed 0), so the crowdsec Middleware
CRD and the broken Yaegi bouncer plugin can be removed without orphaning any
router. Removes: the `crowdsec` Middleware, the crowdsec-bouncer plugin (static
config + initContainer download + state.json entry), the captcha template
ConfigMap + volume + captcha.html, the Turnstile widget + data.cloudflare_accounts,
and the 3 now-unused module vars. Also drops the `crowdsec` middleware from the
catch-all error-pages IngressRoute chain (the one remaining CRD-level reference,
which an Ingress-annotation grep does not surface) so that router is not orphaned
when the Middleware is deleted; it keeps rate-limit. Enforcement is fully handled
out-of-band now: cs-firewall-bouncer (in-kernel nftables, direct hosts) +
Cloudflare IP-List/WAF (proxied hosts). The api-token-middleware plugin is
deliberately preserved (still used by paperless-mcp).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 13:35:13 +00:00
df86075c3d Merge pull request 'cleanup: fully remove orphaned council-complaints app' (#9) from wizard/council-cleanup into master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 13:33:23 +00:00
Viktor Barzin
68d9058f85 cleanup: fully remove orphaned council-complaints app
The council-complaints app (Islington civic-reporting pilot) has been
abandoned. It was already dead in the cluster (deployments scaled 0/0,
image only on the decommissioned registry.viktorbarzin.me which 404s),
and it was never in Terraform — only docs + a kyverno comment referenced
it. Its live cluster resources (namespace, both NFS-backed PVs, ingresses)
were torn down out-of-band via kubectl (nothing in TF to drift from); the
DB-dump PVC was backed up to NFS first.

This removes the remaining repo references to the live app:
- service-catalog.md: drop the council-complaints row
- ci-cd.md + .claude/CLAUDE.md: drop it from the GHA->ghcr app list
- kyverno require-trusted-registries: the registry.viktorbarzin.me/*
  allowlist comment claimed council-complaints as the last referencer;
  rewrite it (no live workload pulls from that registry now; only stale
  completed Job records still carry the ref). The allowlist line itself
  is kept (registry-scoped, not app-specific).

Historical point-in-time plan docs (docs/plans/2026-05-16-auto-upgrade-
apps-{design,plan}.md) still mention it inside a frozen "10 GHA-migrated
repos (memory id=388)" snapshot; left as-is so the dated record stays
accurate.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 13:32:10 +00:00
Viktor Barzin
6dc3ce139f wealth dashboard: expand all rows by default + inline the freshness stat
Some checks failed
ci/woodpecker/push/default Pipeline failed
Two follow-ups Viktor asked for on the Price freshness panel:

- Expand every section by default. Grafana's collapsed rows hide their
  child panels; just flipping collapsed=false leaves a non-canonical shape
  (confirmed via the Grafana API that it keeps the panels nested rather
  than hoisting them), so each row is now collapsed=false + panels=[] with
  its children hoisted to top-level -- the exact form Grafana writes when
  you expand-and-save. Row headers revert to their original y (the child
  y-coords were already expanded-layout coordinates).

- Stop the freshness stat from taking its own line. It's now the 6th tile
  in the existing returns row (1d/7d/30d/90d/12mo + freshness), all width 4
  at y=5; the collapsed-row y-shift from the previous commit is undone.

No query or threshold changes. The large diff is mechanical: 12 child
panels re-indent from nested to top-level.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 13:29:25 +00:00
Viktor Barzin
92ff0b92f1 Merge remote-tracking branch 'forgejo/master' into wizard/t3-idle-migrate
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 12:41:33 +00:00
Viktor Barzin
5a136c7d53 docs: t3-migrate-idle runbook section + service-catalog + design status
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:40:46 +00:00
Viktor Barzin
334d8fee5d setup-devvm: install + enable t3-migrate-idle (lib, script, units, timer)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:36:13 +00:00
Viktor Barzin
3cf09a0fe3 t3-migrate-idle: systemd oneshot + overnight timer (01:00-05:40, /20)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:35:19 +00:00
Viktor Barzin
af9f7be297 t3-migrate-idle: drain deferral markers when safe
For each /var/lib/t3-autoupdate/deferred/<user> marker: skip+clear if the unit is
gone or was already restarted after the deferral; otherwise, when the idle gate is
satisfied, take a pre-restart backup and restart via the shared safe_restart_unit,
clearing the marker on verified success. DRY_RUN logs decisions without acting.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:34:44 +00:00
Viktor Barzin
06e400522f t3-migrate-idle: idle gate (no in-flight turn + quiet buffer), TDD
The gate reads t3's state.sqlite: safe to restart only when zero threads have an
active_turn_id AND the most-recent thread activity is older than the quiet buffer
(default 15m). Fail-closed on any parse/query error. Pure-bash unit tests cover
the boundaries against fixture DBs (no root/bats/Docker).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:34:11 +00:00
Viktor Barzin
de97696ff0 t3-autoupdate: source the shared safe-restart lib + record deferrals
Behavior-preserving refactor: the per-unit restart/recover body and small helpers
now come from t3-safe-restart.sh (one audited copy). Additionally, when a unit is
deferred for an active agent, write a marker under /var/lib/t3-autoupdate/deferred/
so the new idle migrator can drain it later; clear the marker on a successful
restart. Install/health-gate/canary logic is unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:32:57 +00:00
Viktor Barzin
2ab5b94748 t3-safe-restart: extract shared safe-restart library from t3-autoupdate
Pull the per-unit backup->restart->verify->recover routine (and the small
helpers it needs) out of t3-autoupdate.sh into a sourced library, so a second
job (the upcoming idle migrator) can reuse the exact same audited recovery path
instead of forking safety-critical code. safe_restart_unit returns non-zero on
failure (after recovery+freeze) rather than exiting, so callers control flow.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:28:53 +00:00
Viktor Barzin
0cebeeb0ee t3-idle-migrate: implementation plan
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:26:05 +00:00
Viktor Barzin
ddbdbca7e9 wealth dashboard: add "Price freshness" stat for stalest held quote
Some checks failed
ci/woodpecker/push/default Pipeline failed
Viktor was worried about stale prices silently distorting net worth.
Confirmed it's real: META's quote has been frozen at 2026-04-17 (65 days
old) while the dashboard keeps valuing the ~55-share position at that
stale close; the Vanguard ETFs are current. Nothing flagged it.

Adds one compact stat to the Overview row showing the most out-of-date
HELD position's quote age (symbol + humanised age), colour-coded: green
<=4d (weekend/bank-holiday tolerant), amber 5-9d, red >=10d. Pure read of
the quote_latest mirror via the wealth-pg datasource, held positions
only, LEFT JOIN so a held symbol with no quote at all sorts as max-stale.
The six collapsed rows below shift down 4 grid units to make room; no
other panel touched.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:23:45 +00:00
Viktor Barzin
9503bed589 t3-idle-migrate: design for graceful overnight restart of deferred t3-serve instances
Viktor hit the t3 'Client and server versions differ' warning. Root cause: the daily gated autoupdate defers a user's t3-serve restart whenever that user has an active agent at the 04:00 window, so anyone busy every night (long-lived/AFK sessions) never migrates and the client/server version skew persists for days.

This design adds a small idle-gated overnight job that drains those deferrals -- restarting a deferred instance onto the current binary only when no turn is in flight (state.sqlite active_turn_id) and it's been quiet for a buffer, so the migration lands in a real quiet gap instead of killing in-flight agent turns. Reuses the autoupdate's proven backup->restart->verify->recover path via a shared helper (approach C from the brainstorm). Design doc only; no behavior change yet.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:04:22 +00:00
Viktor Barzin
b1bbe42821 homelab ha token: dedicated openclaw/ha-tokens secret + least-priv RBAC for emo
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
`ha token` originally read openclaw/openclaw-secrets -> skill_secrets, which only
cluster admins can read — so it hung/failed for the non-admin operator it was
built for (emo = emil.barzin@gmail.com, OIDC group "Home Server Admins", whose
identity is deliberately barred from secrets in the openclaw namespace).

Split the HA tokens into a dedicated secret openclaw/ha-tokens (keys sofia/london)
with a Role + RoleBinding granting `get` on JUST that secret to the Home Server
Admins group (k8s RBAC can't scope to a JSON sub-key, hence a separate object).
emo now resolves the HA token with their own identity, WITHOUT gaining the rest
of skill_secrets (slack_webhook, uptime_kuma_password). openclaw's own deployment
keeps reading openclaw-secrets — purely additive.

- stacks/openclaw/ha_tokens.tf: new secret + least-privilege Role/RoleBinding
- cli/cmd_ha.go: read openclaw/ha-tokens (raw base64 per-instance key); drop JSON parse
- README + ADR-0012 updated; VERSION -> v0.7.1

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 10:45:32 +00:00
a091689603 Merge pull request 'traefik/crowdsec: remove dead plugin middleware reference (PR1/2)' (#8) from wizard/cs-deplugin-refs into master
Some checks failed
ci/woodpecker/push/default Pipeline failed
2026-06-21 00:17:51 +00:00
Viktor Barzin
71d0af084e traefik/crowdsec: remove 6 hard-coded middleware refs the variable sweep missed (PR1/2)
The first PR1 commit only dropped the ingress_factory reference + the 8
exclude_crowdsec call sites. But the crowdsec middleware is ALSO hard-coded
(not via the variable) in 6 more ingresses that build their middleware chain by
hand: owntracks, the monitoring Helm values (grafana + prometheus +
alertmanager), and the reverse-proxy module + its own separate ingress factory.
Remove all 6 so that after the full-cluster apply NO live ingress references
traefik-crowdsec@kubernetescrd — the precondition for PR2 deleting the CRD.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 00:17:40 +00:00
Viktor Barzin
7bd4612edf ci: scripts/tg waits out a contended state lock (-lock-timeout)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The infra CI pipeline was failing often — ~38% of the last 50 runs didn't
succeed. The single biggest cause (8 of 19 non-successes) was Tier-1 stack
applies dying instantly with "Error acquiring the state lock".

Tier-0 stacks already degrade gracefully (Vault advisory lock → the pipeline
skips a locked stack). Tier-1 stacks have no such fallback: they rely on
terraform's pg-backend pg_advisory_lock, and scripts/tg ran terragrunt with
no -lock-timeout, so any concurrent lock holder was fatal — a Woodpecker-killed
run whose PG lock wasn't reaped yet (PL266 killed → PL267 failed the same
second), a human/agent applying locally, or the daily drift `plan`.

Fix: scripts/tg now passes -lock-timeout (default 5m, override TG_LOCK_TIMEOUT)
on every state-locking verb (plan/apply/destroy/refresh), so a contended lock
WAITS for the holder to finish instead of failing. -auto-approve behaviour for
non-interactive applies is unchanged. Central wrapper change → covers CI, plus
local human/agent applies; no CI image rebuild (tg is read from the repo).

Adds a hermetic pytest (stub terragrunt + preset PG_CONN_STR) pinning the
arg-injection. Docs updated in AGENTS.md + .claude/CLAUDE.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 00:15:39 +00:00
Viktor Barzin
84a18a5529 traefik/crowdsec: remove dead Yaegi-plugin middleware reference (PR1/2)
The Traefik CrowdSec (Yaegi) bouncer plugin enforces nothing on Traefik 3.7.5
(handler never invoked) and is fully superseded by the cs-firewall-bouncer
(in-kernel nftables drop on direct hosts) + the Cloudflare IP-List/WAF rule
(proxied hosts). Drop the `traefik-crowdsec@kubernetescrd` middleware from the
ingress_factory chain and the 8 explicit `exclude_crowdsec = true` call sites,
and delete the now-unused `exclude_crowdsec` variable.

This is PR1 of a 2-phase removal: the reference is removed FIRST (a shared-module
change → full-cluster apply re-renders every ingress without the middleware) so
that PR2 can delete the `crowdsec` Middleware CRD + the plugin itself WITHOUT
leaving any ingress pointing at a missing middleware (which would error those
routers). PR2 MUST NOT land until this has fully applied and zero live ingresses
reference traefik-crowdsec@kubernetescrd.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 00:15:12 +00:00
9774ae3d19 Merge pull request 'crowdsec: firewall-bouncer cluster-wide (remove node2 pin)' (#7) from wizard/cs-fw-allnodes into master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 00:08:15 +00:00
Viktor Barzin
c92590ae85 crowdsec: roll firewall-bouncer cluster-wide (remove node2 validation pin)
One-node validation on k8s-node2 passed: kernel nftables sets created in both
input and forward chains (policy accept), ~31k decisions loaded, a known banned
scanner confirmed in the drop set, pod stable 4h+ with no collateral. Remove the
nodeSelector so the DaemonSet runs on every node — direct-host enforcement now
survives a MetalLB VIP failover to any worker.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 00:07:45 +00:00
4f1c998468 Merge pull request 'rybbit sync: exclude CAPI + per_page=500 fix' (#6) from wizard/crowdsec-syncfix into master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 00:05:50 +00:00
Viktor Barzin
f55bb6c422 rybbit: sync excludes CAPI blocklist + fix CF items per_page (500)
The edge CF IP List can't hold the ~31k CAPI community blocklist (already
enforced in-kernel by the firewall-bouncer), so the sync now skips origin=CAPI
and carries only high-signal local/curated decisions (+ a 9000 safety cap).
Also fixes the list-items GET: per_page=1000 returned a misleading CF 400
'invalid or expired cursor' (10027); the endpoint max is 500. Verified live:
crowdsec_ban populates (4 IPs) and the sync exits 0.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 00:05:05 +00:00
Viktor Barzin
6d5d3726d6 Merge remote-tracking branch 'origin/master' into wizard/ha-cli-verbs
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
2026-06-20 23:46:29 +00:00
Viktor Barzin
48225f2dea homelab CLI v0.7: add ha token + ha ssh for Home Assistant
Mined another devvm user's Claude sessions for repeated, hand-rolled command
patterns worth absorbing into the shared CLI. The dominant signal was Home
Assistant "Sofia" work: a `kubectl | base64 | jq` token-extraction pipeline
re-derived ~420x, and a bespoke non-interactive `ssh -o …` invocation reinvented
~30x — every session. The existing `home-assistant-sofia.py` already covers the
API but goes unused from an arbitrary cwd (needs an env var set + a cwd-relative
path), so agents bypassed it and hand-rolled everything.

Add two verbs covering exactly the gaps the `ha` MCP can't (entity state/control
stays with the MCP):
- `ha token [--instance sofia|london]` (read): resolves the long-lived API token
  live from k8s secret openclaw/openclaw-secrets via the ambient kubeconfig — no
  pre-set env var. Composes as `curl -H "Authorization: Bearer $(homelab ha token)"`.
- `ha ssh [--instance sofia|london] -- <cmd>` (write): deterministic
  non-interactive ssh to the HA host using the invoking user's key.

Also fix the root cause: `home-assistant-sofia.py` now falls back to
`homelab ha token` when its env var is unset (works from any directory), and the
home-assistant skill points agents at these verbs + `homelab metrics query`
instead of hand-rolled curls. README + ADR-0012 + AGENTS.md updated per the
per-verb-group convention.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 23:46:09 +00:00
Viktor Barzin
46166c63b2 fix(authentik): long-lived social-login sessions + shield auth from CrowdSec lockout
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor's passkeys all vanished and he was suddenly being asked to log in
multiple times a day instead of ~monthly. Root cause: on 2026-06-18 an ad-hoc
tripit passkey E2E test (run from the devvm as akadmin via python-httpx) cleaned
up "the demo user's" passkeys with GET /core/users/?search={demo} then DELETE
each device of users[0] — but the fuzzy search returned the REAL account, so it
wiped all 6 real passkeys. Losing passkeys forced fallback to Google login, and
the social-login stage (default-source-authentication-login) had the provider
default session_duration=seconds=0, which falls back to UNAUTHENTICATED_AGE=2h —
hence the constant re-logins. (Password + passkey logins were already weeks=4.)

Changes:
- authentik: adopt default-source-authentication-login into Terraform (import)
  and pin session_duration=weeks=4, so Google/GitHub/Facebook logins last as long
  as password/passkey. Immediate relief without re-enrolling.
- authentik: document the provider-schema gotcha — authentik_stage_identification
  exposes no webauthn_stage / enable_remember_me attribute, so they must NOT be in
  ignore_changes (commit 4e882989 removed them for this reason; re-adding breaks
  every apply). The passkey break was purely the missing device records, not drift.
- edge (rybbit): shield auth so a CrowdSec hit can never wall a user out of login —
  carve authentik.viktorbarzin.me + public-auth out of the zone WAF block rule,
  make the LAPI->edge sync ban-only (stop downgrading captcha to a hard block),
  and set exclude_crowdsec on the Authentik UI ingress (auth keeps rate-limiting).
- docs: record the session-duration change, the edge enforcement + auth carve-out
  (previously undocumented), and the pre-existing broken crowdsec-cf-sync CronJob
  (CF cursor pagination 400 + ~31k IPs vs list capacity -> edge list inert).

Passkey re-enrollment is a manual user action (devices are gone from the DB);
nothing auto-re-deletes them.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 23:40:22 +00:00
Viktor Barzin
600f1f933c Create Claude auth state directories
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The first live renewal run showed systemd could not create state beneath a read-only home sandbox. Provision each user's writable state directory before enabling the timer so automatic renewal can run.
2026-06-20 20:25:55 +00:00