infra

Author	SHA1	Message	Date
Viktor Barzin	df1ec1879d	chrome-service: build a real-Chrome browser image (H.264/AAC codecs) Some checks failed ci/woodpecker/push/default Pipeline was successful Details Build chrome-service-browser / build (push) Has been cancelled Details Add an infra-owned image (Playwright base + google-chrome-stable) + its GHA build workflow. The bundled Chromium ships proprietary codecs compiled out, so H.264/AAC video (Instagram Reels, X, most .mp4) fails in the noVNC view with MEDIA_ERR_SRC_NOT_SUPPORTED; only real Google Chrome carries those codecs (libffmpeg swap + Chrome-for-Testing both ruled out). This commit only builds the image (→ ghcr.io/viktorbarzin/chrome-service-browser); a follow-up flips main.tf's launch to it once the image exists + is public. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 21:01:17 +00:00
Viktor Barzin	c670cb7118	eso: Phase 2 — migrate all 104 ExternalSecrets + 2 ClusterSecretStores to v1 Some checks failed ci/woodpecker/push/default Pipeline failed Details The API rewrite half of the ESO 0.12->2.6 migration (last k8s-1.35 compat-gate blocker). Done on chart 0.16.2, which serves BOTH external-secrets.io/v1beta1 and v1, so this is the safe window — MUST land before 0.17 removes v1beta1 (there is no conversion webhook). Pure apiVersion bump, schema is byte-identical: 106 occurrences (104 ExternalSecrets + 2 ClusterSecretStores vault-kv/vault-database) across 73 .tf files, v1beta1 -> v1, no other field changes. Validated live first on tandoor (single, non-coupled, synced ES): the kubernetes_manifest apiVersion bump forces a REPLACE; the target Secret is cascade-GC'd for ONE ~0.3s poll then ESO recreates it (identical value re-synced from Vault, new UID) and the ES returns SecretSynced=True on v1. Running pods keep their mounted copy through the sub-second blip. All 110 target Secrets were snapshotted to /tmp first as a backstop. CI applies the changed stacks serially (staged rollout); watching aggregate ES sync back to 108 synced (2 pre-existing dead: instagram-poster, payslip-ingest). Next: Phase 3 climb 0.16.2 -> 2.6.0. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 19:13:04 +00:00
Viktor Barzin	98cd535b97	authentik: lock chrome.viktorbarzin.me noVNC to Viktor only All checks were successful ci/woodpecker/push/default Pipeline was successful Details The chrome-service noVNC exposes Viktor's live logged-in browser sessions (Instagram etc. — he'll sign in there for homelab browser to reuse). It was auth="required" = any authenticated user, and "Home Server Admins" includes emo (emil.barzin@gmail.com), so the admin group is not a sufficient gate. Add a host-specific case to the domain-wide forward-auth restriction allowing only Viktor's accounts (vbarzin@gmail.com + akadmin break-glass); everyone else, incl. emo, is denied at the noVNC. emo's AGENT already can't reach the browser (read-only RBAC blocks port-forward); this closes the human noVNC path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 18:09:27 +00:00
Viktor Barzin	a3cdc0d6d0	chrome-service: size headed Chrome window to fill Xvfb (noVNC cut-off) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The noVNC view showed the browser in the top-left with the rest of the framebuffer black. Cause: Chrome launched with no --window-size, and there's no window manager, so it opened at its profile-persisted (smaller) size inside the 1280x720 Xvfb. Add --window-size=1280,720 --window-position=0,0 so the window fills the screen on every launch (fresh pods/profiles too). Live windows were already resized via CDP as a stopgap. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 18:00:20 +00:00
Viktor Barzin	c7ead032ec	chrome-service: fix noVNC stuck-"Connecting" (x11vnc fd-sweep under nofile=2^31) Some checks failed ci/woodpecker/push/default Pipeline was successful Details Build chrome-service-novnc / build (push) Has been cancelled Details The noVNC view hung on "Connecting" forever then timed out. Root cause: x11vnc sweeps the entire fd table (fcntl per fd) on every client connection, and containerd grants pods RLIMIT_NOFILE=2^31, so the RFB handshake never completes (websockify accepts the WS and dials localhost:5900, but x11vnc never sends its banner — verified: handshake timed out at 8s, x11vnc had burned 1h41m CPU spinning). Same bug + fix the android-emulator stack already carries. Cap nofile before x11vnc starts, in two places: - files/novnc/entrypoint.sh: `ulimit -n 65536` (root fix, makes the image correct) - main.tf novnc container: `command = ["bash","-c","ulimit -n 65536; exec /entrypoint.sh"]` so the cap applies deterministically on rollout even though the image is :latest/IfNotPresent (a rebuilt entrypoint isn't guaranteed to be re-pulled). Also documents the gotcha + diagnosis in docs/architecture/chrome-service.md and notes the black-when-idle behaviour + the autoconnect URL. (A live x11vnc relaunch with the cap already unblocked the running pod; this makes it survive restarts.) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 17:34:03 +00:00
Viktor Barzin	20ca5ee624	tripit: REEL_PROVIDER=anonymous — actually fetch reels (was fake canned caption) All checks were successful ci/woodpecker/push/default Pipeline was successful Details REEL_PROVIDER was unset, so the reel pipeline used FakeReelExtractor, which returns a CANNED caption — every pasted (tripit #120) or forwarded reel produced a DUMMY Saved Place instead of reading the real reel. Set REEL_PROVIDER=anonymous in app_env (covers the web Deployment + the ingest CronJob) so AnonymousReelExtractor does the real anonymous read. Verified live from the cluster: yt-dlp fetched a real IG /p/ caption (no IG_GRAPHQL_DOC_ID needed — the internal-API path is an optional optimisation; yt-dlp fallback works). LLM extraction + Nominatim POI geocoding were already real (prior commits); this was the last fake link in the chain. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 17:30:47 +00:00
Viktor Barzin	f46b69f372	tripit: enable real LLM + Nominatim on the web Deployment (in-app reel paste #120 ) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The web Deployment ran LLM_MODE=fake with no reel geocoder — only the ingest-plans CronJob had real providers. The in-app reel-URL paste feature (tripit #120) runs ingest_reel IN the web pod (BackgroundTask), so the Deployment now needs real extraction: LLM_MODE=llamacpp (qwen3vl-8b; qwen3-8b segfaults on the current llama-swap image) with the ADR-0033 claude-agent-service fallback, plus REEL_GEOCODER_PROVIDER=nominatim for venue->city/country POI geocoding. Set in app_env (feeds the Deployment; the CronJobs already had these via extra_env). Bonus: this also un-fakes the in-app booking share import, which used the same fake LLM. MAIL_INGEST_ENABLED stays false on the Deployment (only the CronJob polls mail). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 16:50:04 +00:00
Viktor Barzin	59f2070e56	tripit: switch mail-ingest LLM_MODEL qwen3-8b -> qwen3vl-8b (qwen3-8b segfaults) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The qwen3-8b GGUF segfaults on load on the current llama-swap :cuda image ("common_init_from_params: failed to create context"; llama-swap returns 502), which broke ALL tripit mail ingest text extraction — booking emails AND forwarded reels (status=failed, "no place could be read"). The GGUF isn't corrupt (valid header, full size, worked for weeks) — it's a llama.cpp/image regression. Rather than pin the SHARED llama-swap image (cross-user blast radius), repoint the ingest-plans CronJob at qwen3vl-8b, an already-provisioned 8B model that loads fine and extracts flight numbers + places reliably. Restores the auto-path (reels resolve via the Nominatim geocoder; bookings parse again). The broken qwen3-8b GGUF is a separate, non-urgent llama-cpp cleanup. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 15:52:09 +00:00
Viktor Barzin	f96cde35bd	tripit: enable Nominatim POI geocoding for reel→Wishlist ingest All checks were successful ci/woodpecker/push/default Pipeline was successful Details Forwarded reels (tripit ADR-0031) geocode their venue to map a Saved Place to a country + city, but the reel route was wired to the global geocoder, which here is GEOCODER_PROVIDER=openmeteo (city-level, name-based). OpenMeteo returns nothing for a venue query like "Time Out Market, Lisbon" so reels never resolved and no Saved Place was created. The app fix (tripit 3c62d596) gave the reel route its own geocoder behind REEL_GEOCODER_PROVIDER; set it to nominatim on the ingest-plans CronJob (the only one running the reel route) so forwarded reels resolve to real venue coords + city + country. Isolated from the global geocoder, which stays openmeteo for weather/tours. Verified Nominatim resolves the venue from the cluster. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 14:59:37 +00:00
Viktor Barzin	aeed461591	Revert "feat(monitoring): Tempo + OTel Collector for tripit tracing, hardened (ADR-0032 Phase 2)" All checks were successful ci/woodpecker/push/default Pipeline was successful Details This reverts commit `1595bddfc2`.	2026-06-22 08:31:17 +00:00
Viktor Barzin	1595bddfc2	feat(monitoring): Tempo + OTel Collector for tripit tracing, hardened (ADR-0032 Phase 2) Some checks failed ci/woodpecker/push/default Pipeline failed Details Re-land Phase 2 after the first attempt's two failure modes, both fixed: - tempo.resources set under the correct single-binary chart key (was OOMKilled on the namespace LimitRange default when mis-placed at top level). - atomic=true + cleanup_on_fail=true on BOTH helm releases — a failed install auto-rolls-back instead of leaving a stuck/orphaned release (memory #6479). Tempo (single-binary, proxmox-lvm 20Gi, 30d) + OTel Collector (contrib; otlp -> redaction -> batch -> tempo) + Tempo datasource + additive trace_id->Tempo derivedField on Loki + tripit LOG_FORMAT=json/OTEL_EXPORTER_OTLP_ENDPOINT. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 08:17:59 +00:00
Viktor Barzin	464e0bfb97	Revert "feat(monitoring): Tempo + OTel Collector for tripit tracing (ADR-0032 Phase 2)" All checks were successful ci/woodpecker/push/default Pipeline was successful Details This reverts commit `7513468a2d`.	2026-06-22 06:46:56 +00:00
Viktor Barzin	72dcb125d5	Revert "fix(monitoring): tempo OOMKilled — move resources under tempo.resources" This reverts commit `a02782d11f`.	2026-06-22 06:46:56 +00:00
Viktor Barzin	a02782d11f	fix(monitoring): tempo OOMKilled — move resources under tempo.resources Some checks failed ci/woodpecker/push/default Pipeline failed Details Pipeline #315 failed: tempo-0 CrashLoopBackOff / OOMKilled (exit 137). The single-binary grafana/tempo chart (v1.24.4) takes container resources at tempo.resources, not a top-level resources: — so my block was ignored and the pod fell to the namespace LimitRange default and OOMed. Set tempo.resources explicitly (req 256Mi / limit 2Gi). tripit + existing monitoring were unaffected throughout. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 06:44:31 +00:00
Viktor Barzin	7513468a2d	feat(monitoring): Tempo + OTel Collector for tripit tracing (ADR-0032 Phase 2) Some checks failed ci/woodpecker/push/default Pipeline failed Details Stand up the cluster's first trace store + OTLP ingress so tripit's OpenTelemetry spans (Phase 1, already live in prod) export and correlate with logs: - Grafana Tempo (single-binary, filesystem on proxmox-lvm 20Gi, 30d) - OTel Collector (contrib; otlp -> redaction deny-list backstop -> batch -> tempo) - Grafana: a Tempo datasource + an ADDITIVE trace_id->Tempo derivedField on the Loki datasource (no uid change, so existing dashboards are unaffected) - tripit deployment: LOG_FORMAT=json + OTEL_EXPORTER_OTLP_ENDPOINT -> the Collector Additive (new helm releases; Loki/Prometheus/Grafana untouched). Offline 'terraform validate' clean; full plan+apply runs in CI (locked git-crypt blocks a local plan as non-admin). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 06:31:11 +00:00
Viktor Barzin	ac27e41fde	Merge remote-tracking branch 'origin/master' All checks were successful ci/woodpecker/push/default Pipeline was successful Details	2026-06-21 20:41:35 +00:00
Viktor Barzin	296deda3b4	eso: Phase 1 — climb chart 0.12.1 -> 0.16.2 (transition version) + atomic First half of the ESO 0.12->2.6 migration (docs/plans/2026-06-21-eso-0.12-to-2.x-migration-design.md), clearing the LAST k8s-1.35 compat-gate blocker. Stepped one minor at a time on k8s 1.34 (no k8s interleave — cluster already on 1.34, ESO bands are conservative tested ranges not hard limits): 0.12.1 -> 0.13.0 -> 0.14.4 -> 0.15.1 -> 0.16.2. Each hop applied + verified: controller healthy, all 108 live ExternalSecrets stayed SecretSynced (2 pre-existing dead — instagram-poster, payslip-ingest — missing Vault data, untouched). Added atomic=true + timeout=600 (ESO had no rollback safety net). 0.16.2 serves BOTH v1beta1 AND v1 (storedVersions now ["v1beta1","v1"]) — the safe window to rewrite all 104 CRs to v1 (Phase 2) before 0.17 removes v1beta1. State auto-committed per hop by scripts/tg (Tier-0 SOPS). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 20:41:30 +00:00
Viktor Barzin	de2250f667	immich-frame: set photo date format to dd/MM/yyyy All checks were successful ci/woodpecker/push/default Pipeline was successful Details The photo date overlay was showing US-style MM/dd/yyyy — ImmichFrame's built-in default when PhotoDateFormat is unset. Viktor wants UK day/month/year ordering instead. Pin PhotoDateFormat to the date-fns pattern "dd/MM/yyyy" (uppercase MM = month; lowercase mm would render minutes). The config map carries reloader.stakater.com/match, so Reloader restarts the immich-frame pod automatically on apply. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 20:36:43 +00:00
Viktor Barzin	0bae025b9b	wealth dashboard: spend-down figures in today's money (inflation-adjusted) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked whether the spend-down numbers were inflation-adjusted — they were not (all nominal). He chose to switch the card to today's money, so every row now shows constant purchasing power for life. Each row is a die-with-zero annuity at the REAL rate (1+g)/1.03−1 (3% inflation), spending a constant inflation-adjusted amount (the actual pounds withdrawn rise with inflation) until net worth hits £0 at age 100: • No growth (0%) → £12/day, £370/mo, £4,446/yr (negative real: loses to inflation) • Inflation (3%) → £43/day, £1,315/mo, £15,776/yr (0% real: holds value) • Market (7%) → £130/day, £3,942/mo, £47,300/yr (~3.9% real) Title now flags "(today's £)". Same panel/layout; only the SQL, title, and tooltip changed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 20:13:59 +00:00
Viktor Barzin	3fb6284e2b	immich-frame: use 24-hour clock (ClockFormat HH:mm) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked to switch the Immich photo-frame shown on the Portal kitchen appliance to a 24-hour clock. immichFrame defaults ClockFormat to 'hh:mm' (12-hour) and we never overrode it, so the frame was showing 12-hour time. Set ClockFormat: "HH:mm" (date-fns 24h token) in the frame Settings.yml ConfigMap; Reloader restarts the pod on apply. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 20:10:51 +00:00
Viktor Barzin	e89de86af0	wealth dashboard: spend-down table → three growth scenarios All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor wanted the spend-down card to compare three portfolio-growth scenarios rather than the previous floor-vs-4%-real pair. The table now has three rows, each a die-with-zero annuity (drain net worth to £0 by age 100) spending a constant number of ACTUAL (nominal) pounds, differing only by the assumed nominal growth rate: • No growth (0%) → £43/day, £1,315/mo, £15,776/yr (= NW ÷ years) • Inflation (3%) → £106/day, £3,233/mo, £38,792/yr (NEW) • Avg market (7%) → £220/day, £6,703/mo, £80,435/yr This keeps the £43 no-growth floor he anchored on. The old third row was "4% real" (£133) expressed in today's money; it's replaced by the 7%-nominal market row (£220, actual pounds) so all three rows share one basis (nominal pounds) and are directly comparable. 3%/7% are hardcoded (one-line SQL edit). Table height 4→5 for the extra row; panels below shifted down 1. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 20:06:29 +00:00
Viktor Barzin	85d42f2c13	wealth dashboard: merge spend-down tiles into one compact table All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor wanted the six separate spend-down stat tiles consolidated into a single, more compact card with the figures laid out as rows. Replaces stat panels 9220-9225 with one table panel (id 9220) in the Overview row: 2 rows (Floor / 4% real) × 3 columns (per day / month / year). Same underlying math and live values (£43/£1,315/£15,776 floor; £133/£4,039/£48,463 at 4% real). w=9 instead of the full-width tile row, so it takes ~a third of the width. Note: this intentionally overrides the "table panels live at the bottom" layout convention — Viktor chose to keep this headline KPI glanceable at the top of the dashboard rather than scroll for it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 19:55:57 +00:00
Viktor Barzin	63add2a126	feat(tripit): finalize ADR-0028 auth env — AUTH_MODE=normal, trips@ sender, trust XFF All checks were successful ci/woodpecker/push/default Pipeline was successful Details Now that the native-auth rollout is complete: (1) AUTH_MODE hybrid->normal — the legacy Authentik OIDC-bearer + forward-auth arms were removed in #96, and 'hybrid' already resolved to 'normal' via backward-compat parsing; this makes it explicit and corrects the now-false comment. (2) SMTP_FROM plans@->trips@ — the dedicated native-auth sender; the trips@->spam@ send-as alias is live + verified (RCPT 250). (3) TRUST_FORWARDED_FOR=true — so #95's per-IP signup rate-limit keys on the real client behind Traefik, not the shared ingress pod IP. Env-only; the Deployment image is KEEL_IGNORE_IMAGE (lifecycle-ignored), so this does NOT touch the running image. Reloader restarts the pod to pick up the new env. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 19:50:20 +00:00
Viktor Barzin	166a2bcab4	wealth dashboard: add "spend-down to £0 at 100" stat tiles All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor wanted a glanceable number on the Wealth dashboard for how much he can spend for the rest of his life — spending the whole net worth down to zero by age 100. Adds a third line of six stat tiles to the Overview section, two equations × three cadences (per day / month / year): • FLOOR — net worth ÷ time remaining to age 100. Treats the money as cash (no growth, no inflation): a conservative lower bound. ≈ £43/day, £1.3k/mo, £15.8k/yr. • 4% REAL — die-with-zero annuity: the constant, inflation-adjusted spend that drains the balance to £0 at 100 while it keeps earning 4% real. PMT = NW·r/(1−(1+r)^−n). ≈ £133/day, £4.0k/mo, £48.5k/yr. Horizon is today → his 100th birthday (DOB 1998-10-04 → 2098-10-04), computed live so the figures tick as net worth and the horizon move. Net worth reuses the existing latest-per-account dav_corrected math, so the tiles always agree with the "Net worth (current)" stat (pension included; target £0). The 4% real rate is hard-coded per his "keep it simple, just a number" steer — a one-line SQL edit to change later. Layout: tiles inserted at y=9; all sections below shifted down 4 rows. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 19:48:30 +00:00
Viktor Barzin	51838a4ec7	kyverno: 3.6.1 -> 3.8.1 (app 1.16 -> 1.18.1) — clears the k8s-1.35 compat-gate block All checks were successful ci/woodpecker/push/default Pipeline was successful Details kyverno v1.16 supports k8s <=1.34, so it was one of the two addons blocking the autonomous 1.35 upgrade (compat gate, nightly). v1.18 supports 1.35. Stepped one minor at a time per the kyverno upgrade guide (per-minor CRD notes): 3.6.1 (1.16) -> 3.7.2 (1.17.2) -> 3.8.1 (1.18.1), each hop applied + verified supervised. atomic=true (auto-rollback on a failed rollout) + forceFailurePolicyIgnore (admissions stay open mid-roll) kept it safe. Values schema confirmed compatible across 3.6->3.8 (forceFailurePolicyIgnore still under features:). Verified after each hop: all 17 ClusterPolicies stayed Ready, admission controller 2/2, no destroys/replaces in plan. Final 1.18.1: images v1.18.1, mutating webhook live (server-side dry-run injects ndots:2 in a non-excluded ns). compat-gate vs 1.35.6 now lists ONLY external-secrets (kyverno cleared). ESO 0.12->2.x (v1beta1->v1, 73 files) is the last remaining 1.35 blocker — to be planned. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 17:21:38 +00:00
Viktor Barzin	ead876ec65	k8s-upgrade: nightly Slack report monitor + scope chain-failed alert to phases All checks were successful ci/woodpecker/push/default Pipeline was successful Details Adds a daily visibility layer so every night's autonomous-upgrade outcome is reviewable at a glance during the upgrade-cleanup window (Viktor: "track every night's upgrade for the next 7 days; clean up all bugs and blockers"). Last night (2026-06-20) confirmed BOTH prior fixes work in production: the detector resolved target 1.35.6 (k8s_upgrade_available) and the compat gate correctly REFUSED it (k8s_upgrade_blocked=1 -> K8sUpgradeBlocked) because ESO v0.12 (<=1.31) and kyverno v1.16 (<=1.34) don't support 1.35. What's here: - CronJob k8s-upgrade-nightly-report (06:07 UTC) -> one Slack summary/morning: running version, detector freshness, detected target, outcome (no-op / blocked+live reasons / upgraded / in-progress / detector-stale), recent jobs. Read-only: reads Pushgateway gauges + live nodes/jobs, re-runs compat-gate.py for fresh blockers; reuses the chain SA + slack_webhook + scripts ConfigMap. Pure helpers unit-tested (test_nightly_report.py, 8 cases incl. a real v-prefix bug TDD caught). Verified end-to-end in-cluster (posted to Slack). - K8sUpgradeChainJobFailed regex scoped from `k8s-upgrade-.` to `k8s-upgrade-(preflight\|master\|worker\|postflight)-.` so the new report job (or any future helper) can't false-trip the chain-wedged alarm. Manual state repair (no git artifact): imported the orphaned `alert-digest` CronJob into the monitoring stack state (`tg import module.monitoring.kubernetes_cron_job_v1.alert_digest monitoring/alert-digest`). Root cause: when alert_digest was added (2026-06-12) the apply recorded its ConfigMap + Secret but not the CronJob, so every full monitoring apply since has failed with `cronjobs.batch "alert-digest" already exists` (Woodpecker pipeline 298 today) — surviving only via targeted prometheus applies. Now in state, so monitoring CI applies cleanly again. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 16:57:44 +00:00
Viktor Barzin	7270e2be3b	monitoring: K8sUpgradeChainJobFailed must not double-fire on a compat-gate block Some checks failed ci/woodpecker/push/default Pipeline failed Details Last night (2026-06-20) the detector + compat-gate fixes worked: the chain resolved target 1.35.6 and the gate correctly REFUSED it (ESO 0.12 + kyverno 1.16 don't support 1.35), pushing k8s_upgrade_blocked=1 -> K8sUpgradeBlocked fired as designed. But the refusal also made the preflight Job exit 1 (block() exits 1 on purpose so the Failed Job re-spawns nightly), which tripped K8sUpgradeChainJobFailed too — a duplicate, misleading "pipeline wedged" alarm for what is the intended halt-and-alert outcome. Fix: gate the alert with `unless on() k8s_upgrade_blocked == 1`. A deliberate block sets that gauge (and it stays 1 until the next preflight resets it), so the chain-job-failed alert is suppressed for the blocked period; a genuine wedge / crash / halt-on-alert exits 1 WITHOUT setting it, so it still fires (preserving the alert's original purpose — catching the pre-in_flight preflight failure that hid the 5-day 1.34.9 wedge). Runbook + automated-upgrades docs updated to match. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 16:35:35 +00:00
Viktor Barzin	cc4bb8ffe8	wealth dashboard: show price freshness for all 3 holdings, not just worst Some checks failed ci/woodpecker/push/default Pipeline failed Details Viktor wanted the freshness tile to cover all three main holdings (META, VUAG, VUSA), not only the single stalest one. Dropped LIMIT 1 so the stat renders one value per held position (worst-first), switched the tile to horizontal orientation so the three values sit side-by-side, and updated the description. Each value is coloured by its own age threshold (META red ~2mo, the Vanguard ETFs green ~2d). No threshold or datasource change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 14:49:33 +00:00
Viktor Barzin	c23b03864e	traefik/crowdsec: delete dead Yaegi plugin + middleware CRD + captcha (PR2/2) Zero live ingresses reference traefik-crowdsec@kubernetescrd (PR1 + a cluster-wide targeted ingress re-apply confirmed 0), so the crowdsec Middleware CRD and the broken Yaegi bouncer plugin can be removed without orphaning any router. Removes: the `crowdsec` Middleware, the crowdsec-bouncer plugin (static config + initContainer download + state.json entry), the captcha template ConfigMap + volume + captcha.html, the Turnstile widget + data.cloudflare_accounts, and the 3 now-unused module vars. Also drops the `crowdsec` middleware from the catch-all error-pages IngressRoute chain (the one remaining CRD-level reference, which an Ingress-annotation grep does not surface) so that router is not orphaned when the Middleware is deleted; it keeps rate-limit. Enforcement is fully handled out-of-band now: cs-firewall-bouncer (in-kernel nftables, direct hosts) + Cloudflare IP-List/WAF (proxied hosts). The api-token-middleware plugin is deliberately preserved (still used by paperless-mcp). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 13:35:13 +00:00
Viktor Barzin	68d9058f85	cleanup: fully remove orphaned council-complaints app The council-complaints app (Islington civic-reporting pilot) has been abandoned. It was already dead in the cluster (deployments scaled 0/0, image only on the decommissioned registry.viktorbarzin.me which 404s), and it was never in Terraform — only docs + a kyverno comment referenced it. Its live cluster resources (namespace, both NFS-backed PVs, ingresses) were torn down out-of-band via kubectl (nothing in TF to drift from); the DB-dump PVC was backed up to NFS first. This removes the remaining repo references to the live app: - service-catalog.md: drop the council-complaints row - ci-cd.md + .claude/CLAUDE.md: drop it from the GHA->ghcr app list - kyverno require-trusted-registries: the registry.viktorbarzin.me/* allowlist comment claimed council-complaints as the last referencer; rewrite it (no live workload pulls from that registry now; only stale completed Job records still carry the ref). The allowlist line itself is kept (registry-scoped, not app-specific). Historical point-in-time plan docs (docs/plans/2026-05-16-auto-upgrade- apps-{design,plan}.md) still mention it inside a frozen "10 GHA-migrated repos (memory id=388)" snapshot; left as-is so the dated record stays accurate. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 13:32:10 +00:00
Viktor Barzin	6dc3ce139f	wealth dashboard: expand all rows by default + inline the freshness stat Some checks failed ci/woodpecker/push/default Pipeline failed Details Two follow-ups Viktor asked for on the Price freshness panel: - Expand every section by default. Grafana's collapsed rows hide their child panels; just flipping collapsed=false leaves a non-canonical shape (confirmed via the Grafana API that it keeps the panels nested rather than hoisting them), so each row is now collapsed=false + panels=[] with its children hoisted to top-level -- the exact form Grafana writes when you expand-and-save. Row headers revert to their original y (the child y-coords were already expanded-layout coordinates). - Stop the freshness stat from taking its own line. It's now the 6th tile in the existing returns row (1d/7d/30d/90d/12mo + freshness), all width 4 at y=5; the collapsed-row y-shift from the previous commit is undone. No query or threshold changes. The large diff is mechanical: 12 child panels re-indent from nested to top-level. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 13:29:25 +00:00
Viktor Barzin	ddbdbca7e9	wealth dashboard: add "Price freshness" stat for stalest held quote Some checks failed ci/woodpecker/push/default Pipeline failed Details Viktor was worried about stale prices silently distorting net worth. Confirmed it's real: META's quote has been frozen at 2026-04-17 (65 days old) while the dashboard keeps valuing the ~55-share position at that stale close; the Vanguard ETFs are current. Nothing flagged it. Adds one compact stat to the Overview row showing the most out-of-date HELD position's quote age (symbol + humanised age), colour-coded: green <=4d (weekend/bank-holiday tolerant), amber 5-9d, red >=10d. Pure read of the quote_latest mirror via the wealth-pg datasource, held positions only, LEFT JOIN so a held symbol with no quote at all sorts as max-stale. The six collapsed rows below shift down 4 grid units to make room; no other panel touched. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 12:23:45 +00:00
Viktor Barzin	b1bbe42821	homelab ha token: dedicated openclaw/ha-tokens secret + least-priv RBAC for emo Some checks are pending Build infra CLI / build (push) Waiting to run Details ci/woodpecker/push/default Pipeline was successful Details `ha token` originally read openclaw/openclaw-secrets -> skill_secrets, which only cluster admins can read — so it hung/failed for the non-admin operator it was built for (emo = emil.barzin@gmail.com, OIDC group "Home Server Admins", whose identity is deliberately barred from secrets in the openclaw namespace). Split the HA tokens into a dedicated secret openclaw/ha-tokens (keys sofia/london) with a Role + RoleBinding granting `get` on JUST that secret to the Home Server Admins group (k8s RBAC can't scope to a JSON sub-key, hence a separate object). emo now resolves the HA token with their own identity, WITHOUT gaining the rest of skill_secrets (slack_webhook, uptime_kuma_password). openclaw's own deployment keeps reading openclaw-secrets — purely additive. - stacks/openclaw/ha_tokens.tf: new secret + least-privilege Role/RoleBinding - cli/cmd_ha.go: read openclaw/ha-tokens (raw base64 per-instance key); drop JSON parse - README + ADR-0012 updated; VERSION -> v0.7.1 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 10:45:32 +00:00
Viktor Barzin	71d0af084e	traefik/crowdsec: remove 6 hard-coded middleware refs the variable sweep missed (PR1/2) The first PR1 commit only dropped the ingress_factory reference + the 8 exclude_crowdsec call sites. But the crowdsec middleware is ALSO hard-coded (not via the variable) in 6 more ingresses that build their middleware chain by hand: owntracks, the monitoring Helm values (grafana + prometheus + alertmanager), and the reverse-proxy module + its own separate ingress factory. Remove all 6 so that after the full-cluster apply NO live ingress references traefik-crowdsec@kubernetescrd — the precondition for PR2 deleting the CRD. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 00:17:40 +00:00
Viktor Barzin	84a18a5529	traefik/crowdsec: remove dead Yaegi-plugin middleware reference (PR1/2) The Traefik CrowdSec (Yaegi) bouncer plugin enforces nothing on Traefik 3.7.5 (handler never invoked) and is fully superseded by the cs-firewall-bouncer (in-kernel nftables drop on direct hosts) + the Cloudflare IP-List/WAF rule (proxied hosts). Drop the `traefik-crowdsec@kubernetescrd` middleware from the ingress_factory chain and the 8 explicit `exclude_crowdsec = true` call sites, and delete the now-unused `exclude_crowdsec` variable. This is PR1 of a 2-phase removal: the reference is removed FIRST (a shared-module change → full-cluster apply re-renders every ingress without the middleware) so that PR2 can delete the `crowdsec` Middleware CRD + the plugin itself WITHOUT leaving any ingress pointing at a missing middleware (which would error those routers). PR2 MUST NOT land until this has fully applied and zero live ingresses reference traefik-crowdsec@kubernetescrd. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 00:15:12 +00:00
Viktor Barzin	c92590ae85	crowdsec: roll firewall-bouncer cluster-wide (remove node2 validation pin) One-node validation on k8s-node2 passed: kernel nftables sets created in both input and forward chains (policy accept), ~31k decisions loaded, a known banned scanner confirmed in the drop set, pod stable 4h+ with no collateral. Remove the nodeSelector so the DaemonSet runs on every node — direct-host enforcement now survives a MetalLB VIP failover to any worker. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 00:07:45 +00:00
Viktor Barzin	f55bb6c422	rybbit: sync excludes CAPI blocklist + fix CF items per_page (500) The edge CF IP List can't hold the ~31k CAPI community blocklist (already enforced in-kernel by the firewall-bouncer), so the sync now skips origin=CAPI and carries only high-signal local/curated decisions (+ a 9000 safety cap). Also fixes the list-items GET: per_page=1000 returned a misleading CF 400 'invalid or expired cursor' (10027); the endpoint max is 500. Verified live: crowdsec_ban populates (4 IPs) and the sync exits 0. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 00:05:05 +00:00
Viktor Barzin	46166c63b2	fix(authentik): long-lived social-login sessions + shield auth from CrowdSec lockout All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor's passkeys all vanished and he was suddenly being asked to log in multiple times a day instead of ~monthly. Root cause: on 2026-06-18 an ad-hoc tripit passkey E2E test (run from the devvm as akadmin via python-httpx) cleaned up "the demo user's" passkeys with GET /core/users/?search={demo} then DELETE each device of users[0] — but the fuzzy search returned the REAL account, so it wiped all 6 real passkeys. Losing passkeys forced fallback to Google login, and the social-login stage (default-source-authentication-login) had the provider default session_duration=seconds=0, which falls back to UNAUTHENTICATED_AGE=2h — hence the constant re-logins. (Password + passkey logins were already weeks=4.) Changes: - authentik: adopt default-source-authentication-login into Terraform (import) and pin session_duration=weeks=4, so Google/GitHub/Facebook logins last as long as password/passkey. Immediate relief without re-enrolling. - authentik: document the provider-schema gotcha — authentik_stage_identification exposes no webauthn_stage / enable_remember_me attribute, so they must NOT be in ignore_changes (commit `4e882989` removed them for this reason; re-adding breaks every apply). The passkey break was purely the missing device records, not drift. - edge (rybbit): shield auth so a CrowdSec hit can never wall a user out of login — carve authentik.viktorbarzin.me + public-auth out of the zone WAF block rule, make the LAPI->edge sync ban-only (stop downgrading captcha to a hard block), and set exclude_crowdsec on the Authentik UI ingress (auth keeps rate-limiting). - docs: record the session-duration change, the edge enforcement + auth carve-out (previously undocumented), and the pre-existing broken crowdsec-cf-sync CronJob (CF cursor pagination 400 + ~31k IPs vs list capacity -> edge list inert). Passkey re-enrollment is a manual user action (devices are gone from the DB); nothing auto-re-deletes them. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-20 23:40:22 +00:00
Viktor Barzin	7050b0441e	Merge remote-tracking branch 'origin/master' into wizard/claude-auth-renew Some checks failed ci/woodpecker/push/default Pipeline failed Details	2026-06-20 20:11:09 +00:00
Viktor Barzin	bc2fbc712c	Merge remote-tracking branch 'origin/master' into wizard/claude-auth-renew	2026-06-20 20:10:48 +00:00
Viktor Barzin	02d14796cc	feat(mailserver): add trips@ send-as alias for TripIt native auth email (ADR-0028) Some checks failed ci/woodpecker/push/default Pipeline was canceled Details TripIt's native signup-verification + account-recovery mail (ADR-0028) sends From: trips@viktorbarzin.me while authenticating SMTP as spam@. With SPOOF_PROTECTION on, Postfix smtpd_sender_login_maps requires an EXPLICIT alias (the @domain catch-all doesn't satisfy it) — mirrors the existing plans@->spam@ grant. Must be applied + verified before TripIt flips SMTP_FROM to trips@, else every verification/recovery send is rejected 550. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-20 20:10:47 +00:00
Viktor Barzin	5549fc3672	Add per-user Claude auth renewal Each workstation user needs a continuously valid Claude token under their own Enterprise identity. Store only that user's OAuth state in an isolated Vault path, renew and verify it automatically, recover from Vault when possible, and alert when interactive SSO is required.	2026-06-20 20:10:40 +00:00
Viktor Barzin	3278588325	chore(authentik): tear down obsolete tripit-enrollment (ADR-0020 superseded by ADR-0028) All checks were successful ci/woodpecker/push/default Pipeline was successful Details TripIt external users are now LOCAL TripIt accounts (ADR-0028 native passkey + Authentik OIDC), so the Authentik-side self-enrollment machinery is dead. Removes the tripit-enrollment + tripit-recovery flows and all their stages/prompts/policies/bindings, the tripit-email-stages blueprint (+yaml), and the 'TripIt External' group; reverts the admin-services-restriction fence branch that contained those users (its sole member, the leftover tripit-demo@ test account, was deleted first, so the revert affects zero live principals). Real external collaborators (type=external) are untouched. tg plan: 0 add, 1 change (the policy expression), 20 destroy (all tripit_*). Closes tripit#97; moots the B2 per-app OIDC fences. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-20 20:04:24 +00:00
Viktor Barzin	7cf93a0587	crowdsec+rybbit: proxied edge to single CF list (block-only) + retrigger firewall-bouncer apply CF account hard-limits to 1 Rules List, so proxied enforcement uses one crowdsec_ban list + one WAF block rule; the sync writes both ban and captcha decisions into it (captcha downgraded to block at the edge). Drops the second list + managed_challenge rule. Trivial touch to firewall_bouncer.tf to make CI re-apply crowdsec and recreate the DaemonSet (tar fix already in master; stale orphan was cleared). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-20 19:29:43 +00:00
Viktor Barzin	f2b089e267	rybbit: fix cloudflare_ruleset import id (zone/ 3-part form) + depends_on lists v4.52.7 import id must be zone/<zone_id>/<ruleset_id>; add depends_on so the crowdsec_ban/captcha lists exist before the WAF rules reference them. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-20 19:12:29 +00:00
Viktor Barzin	a351a66843	crowdsec+rybbit: fix firewall-bouncer tar extraction (busybox) + import existing CF WAF ruleset - initContainer used GNU tar --wildcards which fails on the busybox curl image (pod Init:Error); switch to extract-all + cp via shell glob. - cloudflare_ruleset hit the per-zone singleton conflict; import the existing 'default' http_request_firewall_custom ruleset and manage all rules — CrowdSec ban/captcha first, the pre-existing disabled skip rule preserved verbatim. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-20 19:04:30 +00:00
viktor	70e8ce1021	Merge pull request 'CrowdSec real enforcement: edge WAF (proxied) + firewall-bouncer (direct)' (#2 ) from wizard/crowdsec-enforcement into master Some checks failed ci/woodpecker/push/default Pipeline failed Details	2026-06-20 09:42:41 +00:00
Viktor Barzin	ca8d617e72	rybbit: use 'Account Rule Lists' permission group for the CF sync token (v4) tg plan verified the agent's guess 'Account Filter Lists Edit/Read' is not a key in the v4.52.7 permission-group map; the live CF API lists the correct account-scoped groups as 'Account Rule Lists Read'/'Write'. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-20 09:41:41 +00:00
Viktor Barzin	0c56290af0	chore(forgejo): re-trigger apply of git.timeout/gc.auto (changed-stack skip) All checks were successful ci/woodpecker/push/default Pipeline was successful Details `910d5892` landed the [git.timeout] + [git.config] env in master, but the CI apply skipped stacks/forgejo (the changed-stack-diff race after a sync-merge), so the Forgejo deployment never picked it up. A trivial comment touch to force a clean apply of the stack so the durable push-mirror fix actually takes effect. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-20 09:19:53 +00:00
Viktor Barzin	cc4bfb593b	rybbit: proxied CrowdSec enforcement via Cloudflare IP Lists + WAF rule Replaces the Worker+KV approach (which only covered the ~27 routed hosts) with a zone-wide mechanism that covers ALL proxied hosts: two CF account IP Lists (crowdsec_ban, crowdsec_captcha) + one zone WAF custom rule that blocks `(ip.src in $crowdsec_ban)` and managed-challenges `(ip.src in $crowdsec_captcha)`. No per-request Worker, no cookie machinery — the rybbit Worker stays analytics-only. lapi_kv_sync.py now full-reconciles the two lists from LAPI (fail-safe: a LAPI blip skips the run and freezes the last-known-good block set; serializes CF bulk ops since CF allows one pending op per account). A least-privilege CF API token (Account Filter Lists Edit) is minted in TF. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-20 09:18:33 +00:00

1 2 3 4 5 ...

1535 commits