Commit graph

303 commits

Author SHA1 Message Date
Viktor Barzin
bb0f9f59ef docs: CI-compute doctrine — leverage external infra for builds AND tests [ci skip]
Viktor's standing instruction (2026-06-12): lean on external infra as
much as possible for CI — builds, running tests, lint, releases all on
GitHub Actions hosted runners, never on cluster nodes; in-cluster
pipelines only for cluster-touching steps (deploys, terragrunt,
certbot). Also: watch any triggered pipeline chain to completion and
fix failures immediately. Added to AGENTS.md + .claude/CLAUDE.md
CI sections (ADR-0002 companions).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 20:39:27 +00:00
Viktor Barzin
97dcf49b8e monitoring: reduce Slack alert noise (alert-on-change + daily digest)
Some checks failed
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was canceled
Reviewed the last 24h of Slack alerts after the midday node-pressure blip:
the volume came far less from the outage than from (a) alerts re-pinging
every few hours while nothing changed and (b) a pod cascade that fired
uninhibited. This hardens the alerting *system* so recurrences are quiet,
rather than just clearing today's broken services.

Changes (all in the monitoring module):

* Alert-on-change routing. warning/info repeat_interval -> 8760h (notify
  once, then only on a membership change or resolve); critical 1h -> 6h
  (a slow nag, not an hourly drip). send_resolved stays on. The bulk of
  the 24h volume was these re-pings (RpiSofiaUndervoltage alone fired
  continuously for ~24h, re-notifying every 4h).

* Daily digest CronJob (alert_digest.tf + alert_digest.py) -> #alerts at
  08:00 Europe/London: the full current board grouped by severity + what
  resolved in the last 24h. This is the standing-state safety net for the
  alert-on-change model. Stock python:3.12-alpine, pure-stdlib script
  (no pip/apk at runtime -> none of the per-run disk-write footprint that
  disabled status-page-pusher). Reuses the existing Alertmanager Slack
  webhook via a namespaced Secret; reads Alertmanager v2 + Prometheus.

* Cascade inhibition. NodeConditionBad/NodeDiskPressure now suppress the
  downstream pod-churn alerts (PodCrashLooping, PodImagePullBackOff,
  PodsStuckContainerCreating, ScrapeTargetDown, *ReplicasMismatch, ...).
  The midday DiskPressure event on 4 nodes fired 25 PodCrashLooping + 14
  PodImagePullBackOff uninhibited because only NodeDown was a source.

* T3 probe de-duplication. T3ProbeLegDown now inhibits T3ProbeDropBurst
  for the same leg — two alerts described one condition and were the #1
  noise source (~3,400 alert-minutes over 24h).

* ScrapeTargetDown false positives. Scrape only Ready endpoints, so
  completed CronJob pods that linger in EndpointSlices as NotReady
  addresses stop firing phantom "down" alerts (tts/tripit/beads). A Ready
  pod with a genuinely broken metrics endpoint still fires.

* for: 0m -> 5m on the flappy backup-status flags (LVM/Weekly/Offsite/
  NfsMirror/Vzdump *Failing) and DNS spike detectors, so a single
  transient Pushgateway/scrape blip no longer fires-and-resolves.

* Added an Alertmanager scrape target: it carried no prometheus.io/scrape
  annotation, so notification volume was unmeasurable — now we can verify
  this change worked (alertmanager_notifications_total et al.).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-12 20:35:56 +00:00
Viktor Barzin
18f524c265 docs: ghcr-credentials is now Kyverno-synced to allowlisted namespaces [ci skip]
Same-change doc sync for infra#12: the tripit-ns-scoped interim secret
paragraph described the pre-ClusterPolicy state.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 20:31:55 +00:00
Viktor Barzin
0216e993dc etcd-load-reduction: remove VPA/Goldilocks, disable kyverno reporting, descheduler hourly
Some checks failed
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline failed
The control-plane flap (etcd lease-renewal timeouts) recurred. Rather than move
etcd to SSD (code-oflt, deferred again), the chosen direction is to REDUCE etcd
load enough that the leader-election-timeout band-aid (renew 10s->30s) becomes
removable. These are the big, clean cuts:

1. Remove VPA/Goldilocks (stacks/vpa emptied). All 349 VPAs ran updateMode=Off
   (no auto-right-sizing) yet cost ~800 etcd objects + continuous recommender
   writes + a pod-creation admission webhook, purely to feed a dashboard. krr
   (Dockerized, on-demand) replaces it. Reverses the re-add after memory 2431.

2. Disable kyverno reporting (admission/aggregate/background). policyReports were
   already off, so the pipeline generated ephemeralreports + an hourly
   all-resource etcd re-scan for NO user-facing output. Admission enforcement
   (deny-* policies) and Keel mutation are unaffected; violations surface via
   Loki->Slack.

3. descheduler */5 -> hourly (fewer list/evict cycles; rebalancing isn't urgent).

Deferred (poor ROI / unsafe as planned): ESO refreshInterval 15m->1h is a
~20-stack sprawl for ~0.1 writes/s; keel background=false is invalid for a
mutate-existing policy and its churn is apply-time not steady-state. Both filed
as follow-up beads.

Post-apply: delete the chart-orphaned VPA CRDs to cascade-clean leftover CRs.
Then measure etcd apply-latency and revert the timeouts. Docs updated
(VPA/Goldilocks -> krr). See memory 5402-5407.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-12 19:41:22 +00:00
Viktor Barzin
6bf216751b Merge forgejo/master (tts stack) into wizard/android-emulator
Some checks failed
ci/woodpecker/push/default Pipeline failed
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
# Conflicts:
#	stacks/tripit/main.tf
2026-06-11 19:53:07 +00:00
Viktor Barzin
8b7c77c794 android-emulator: new stack — shared in-cluster Android 16 testing instance
Viktor is setting up an Android app development pipeline (tripit is the
first app) and wants agents to natively test changes on Android before
shipping. This adds the testing environment: an API-36 Google emulator
under KVM as a privileged pod (namespace joins the Kyverno exclude list),
SDK/system-image/AVD on a proxmox-lvm PVC, adb on the shared MetalLB IP
10.0.20.200:5555 (LAN only), noVNC screen view at
android-emulator.viktorbarzin.lan. Image is built manually from the
stack's docker/ dir (rare rebuilds; off-infra-CI rule targets repeated
builds). First infra ADR records the trade-offs (devvm/VM/redroid/budtmo
rejected).
2026-06-11 19:51:57 +00:00
Viktor Barzin
df332b59e6 break-glass SSH: drop port-knock for exposed key-only :52222; version host config
Viktor got locked out of the break-glass path (forgot the port-knock setup) and
deleted the edge-router forwards, then asked to review and redesign it from
scratch.

Root cause of the lockout: the knock added no real security (key-only SSH is
already brute-force-proof) and its only benefit — hiding the port — came at the
cost of a circular dependency. The knock sequence lived only in in-cluster
Vault, which is unreachable in the exact away/cold scenario break-glass exists
for. So the unlock secret was unavailable precisely when needed.

New model (self-contained, nothing to remember): plain key-only SSH on the
Proxmox host's :52222, openly reachable. The edge router forwards WAN tcp/52222
-> 192.168.1.127:52222 (external port MUST equal internal on the TP-Link AX6000
- it rejects remaps; port 22 itself is reserved). The exposed port trusts only a
dedicated break-glass key via `Match LocalPort` (a leak of any other root key
does not grant internet access), rate-limited (iptables hashlimit) + fail2ban.

- Removed knockd (package + config) and the legacy Synology SSH forward
  (ext 3333 -> .13:22, a needless WAN exposure the original plan wanted gone).
- Fixed the fail2ban jail for Debian 13 (auth logs under sshd-session, not sshd
  - the stock journalmatch silently never banned).
- Versioned the host config in scripts/ (it was applied ad-hoc, never committed)
  and recorded the deliberate Wave-1 "no public-IP" exception in security.md +
  .claude/CLAUDE.md. Superseded the 2026-05-30 port-knock design docs.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 18:23:39 +00:00
Viktor Barzin
c3a63fcd38 apply-mbps-caps: compare normalized option sets (true idempotency) + devvm I/O-stall post-mortem [ci skip]
The raw string compare never matched qm config's canonical key order, so
the hourly timer re-issued 'qm set' against every running capped VM,
live-rewriting QEMU throttle state via QMP 24x/day. Implicated in today's
devvm freeze (15:21-16:48 UTC): the guest's disk I/O stalled inside QEMU
(blockstats frozen at 0 while QMP stayed responsive) on the legacy lsi
controller path with no iothread.

Viktor asked to root-cause the freeze before choosing fixes, then approved
mitigating via VM settings: this commit fixes the hourly trigger and
documents the incident; the controller swap (virtio-scsi-single +
iothread=1 + aio=threads) is staged on VM 102 separately, pending his
cold stop/start.

Adds docs/post-mortems/2026-06-11-devvm-qemu-io-stall.md (evidence chain,
ruled-out causes, capture-before-kill autopsy steps) and syncs compute.md
+ proxmox-inventory.md.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 18:00:08 +00:00
Viktor Barzin
2e0cebff87 docs: sync compute/storage/proxmox-inventory with live state (memory audit) [ci skip]
Viktor asked to go through the agent's stored infra facts and straighten out anything wrong about what-is-where. Cross-checking docs against the live cluster surfaced doc drift alongside the stale memories:

- compute.md: add k8s-node5/6 (joined 2026-05-26) to diagram + node table; totals 48 vCPU / ~176GB -> 64 vCPU / ~240GB; cluster version v1.34.2 -> v1.34.8 (live-verified)
- storage.md: the nfs-proxmox StorageClass no longer exists (removed 2026-04-25, commit 484b4c71) — nfs-truenas is the only NFS SC; fixed three spots that told readers to use nfs-proxmox
- proxmox-inventory.md: k8s VM RAM rows live-verified via kubectl (master 32G, node1 48G, node2-4 32G — the old 16/32/24G figures predated the 2026-04-02 resize), added node5/6 rows, devvm swap 8G -> 14G (grown 2026-06-10), recomputed total (~288GB nominal of 272GB physical, overcommitted)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 17:50:43 +00:00
Viktor Barzin
980ec55418 tripit: enable live flight-fare scrape via shared chrome-service CDP
Sets FARE_PROVIDER=playwright + FARE_CDP_URL on the tripit deployment so the planning workspace's flight_fare cells auto-fetch live Google Flights quotes through the existing in-cluster headed browser (tripit issue #18, ADR-0007 — rate-limited, cached, degrades to manual entry). Viktor asked to complete the trip-planning tickets; this is the infra leg of the fare-scrape slice. Docs: chrome-service architecture + service catalog updated (tripit is now the second active CDP caller; catalog's legacy :3000 WS pool line corrected to CDP :9222). HOLD-ORDER NOTE: pushed only after the tripit image containing FareMode.playwright rolled out (older images crash-loop on the unknown enum).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-11 14:23:53 +00:00
Viktor Barzin
9b19caff47 t3: connection logging across the path for drop attribution
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Viktor asked to add connection logs (Traefik/Cloudflare) to catch the
real-path t3 WS drops: a direct-to-t3-serve browser ran 40 min clean
while real tunnel sessions cycle every 15-35s, so the drop originates
above t3-serve and we need to see which layer cuts the socket.

Traefik (/ws duration) and cloudflared (WS close events) already ship to
Loki; the gap was the devvm side. This adds:

- t3-dispatch logs every /ws open/close with dur_ms + cause:
  downstream_closed (client/CF/Traefik hung up = last-mile/network),
  upstream_closed (t3-serve closed/reset), or graceful. Graceful closes
  previously left no trace (default ReverseProxy only logs on error), so a
  watchdog-driven reconnect was invisible. Helpers unit-tested.
- devvm-promtail.{yaml,service}: ships devvm journald (t3-dispatch +
  t3-serve@<user>) to cluster Loki as job=devvm-journal, mirroring the
  pve/rpi-sofia shippers. devvm was never in Loki (standalone VM).

Joined in Loki the three layers attribute any future drop to a segment
with no repro needed. Runbook + service-catalog updated.
2026-06-11 13:48:10 +00:00
Viktor Barzin
4e88298976 authentik: incident hardening after the signin-speedup rollout storm
The first apply of the signin-speedup change triggered a ~50min authentik
outage (and a shared CNPG primary failover): the helm chart pin (2026.2.2)
silently DOWNGRADED the Keel-managed live image (2026.2.4) against an
already-migrated DB, default liveness probes kill-looped pods queuing on
authentik's migration advisory lock, and kills mid-migration left ghost
idle-in-transaction sessions holding that lock. Full analysis in
docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md.

Hardening (all root causes):
- values.yaml: pin global.image.tag to the Keel-managed live tag (2026.2.4)
  so helm applies can never downgrade under Keel again
- values.yaml: server livenessProbe 6x10s/5s (was chart-default 3x10s/3s)
- values.yaml: REMOVE AUTHENTIK_POSTGRESQL__CONN_MAX_AGE (session-mode
  pgbouncer pins persistent conns 1:1 -> pool saturation, 58s/s waits)
- pgbouncer.ini: idle_transaction_timeout=300 reaps ghost lock holders;
  pgbouncer.tf gets a config-checksum annotation so ini changes roll pods
- authentik_provider.tf: drop the completed import stanza (adoption rule)
- traefik: suppress pre-existing keel.sh annotation/tier-label drift on
  auth-proxy/bot-block/x402/error-pages deployments (KEEL_LIFECYCLE_V1
  pattern) so applies stop stripping live Keel state

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 00:26:52 +00:00
Viktor Barzin
97ccdbecb8 authentik: speed up first-time signin (single-screen login, live env tuning, asset caching, outpost+nginx hot path)
Viktor asked to review Authentik and the web tier and make first-time
signin to apps faster. Review found the slowness is screens and round
trips, not server time. Changes:

- values.yaml: the authentik.* Helm values (gunicorn workers, cache
  timeouts, conn_max_age) were silently INERT because existingSecret
  skips chart env rendering — pods ran defaults (2 workers, 300s
  caches, no persistent DB conns). Moved all tuning into
  server.env/worker.env, which actually reaches the pods.
- authentik_provider.tf: adopt the identification stage and pin
  password_stage so username+password render on ONE screen (the
  separate order-20 password binding is deleted via API — authentik
  requires that when embedding). Outpost log_level trace->info and
  1->2 replicas (it is on the hot path of every forward-auth request;
  PG-backed sessions make 2 replicas safe).
- authentik module: /static ingress carve-out with immutable
  Cache-Control (assets are version-fingerprinted but served with no
  max-age — internal split-horizon users got zero caching).
- traefik auth-proxy nginx: upstream keepalive 32 + HTTP/1.1 (was
  opening a fresh TCP connection to the outpost per subrequest) +
  config-checksum annotation so config changes roll the pods.
- docs: authentication.md + authentik-state.md updated; fixed stale
  'postgresql.dbaas has no endpoints' claim in CLAUDE.md/CONTEXT.md
  (it is a live CNPG primary-selector compatibility service).

Done via API in the same change (UI-managed objects): 6 OIDC providers
(Vault, Forgejo, Immich, Headscale, linkwarden, Cloudflare Access)
switched from explicit to implicit consent — all first-party, the
4-weekly consent screen only slowed first-time signin.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 21:58:10 +00:00
Viktor Barzin
9b55d53be0 t3: differential drop-attribution probe + devvm metrics
Closes the loop on Viktor's ask to find the t3 disconnect root cause and
definitively rule infra in or out. Server logs alone cannot separate
'client network broke' from 'Cloudflare/tunnel broke' from 't3-serve
stalled' — every cause collapses into the same 20s-watchdog reconnect.

The t3-probe (stacks/t3code) holds three permanent legs that differ only
in path segment: 'cloudflare' (WS via DoH-resolved public DNS -> WAN ->
CF edge -> tunnel -> Traefik -> dispatch), 'internal' (same WS pinned to
the Traefik LB, no Cloudflare), 't3serve' (HTTP straight to the serve
process). Whichever leg drops convicts its segment; all legs clean while
a user drops exonerates infra with data. Dispatch gains an
unauthenticated /probe/ws echo + /probe/healthz (gorilla/websocket,
test-first) behind an auth=none path carve-out, guarded by the
authentik-walloff probe.

Also starts scraping devvm's node_exporter (job 'devvm') — it ran
unscraped, so the box whose memory/IO stalls cause the drops had zero
pressure history. Alerts T3ProbeLegDown + T3ProbeDropBurst; runbook
docs/runbooks/t3-drop-attribution.md.
2026-06-10 21:11:29 +00:00
Viktor Barzin
acb847b858 actualbudget: dedicated traefik rate-limit (50/300) for budget ingresses
The Actual web app boots with ~70 near-parallel requests (55
/data/migrations/*.sql + statics, all served cache-control max-age=0 so
every page load re-validates them). The shared rate-limit middleware
(average 10, burst 50) 429s the tail of that storm, so every cold boot
shows 'Server returned an error while checking its status' and every
load stalls in retry backoff — measured up to 5min stalls when two
loads from one IP overlap. Viktor asked to relax the limit after the
anca slow-load investigation (beads code-7zv).

Same pattern as immich: dedicated actualbudget-rate-limit middleware in
the traefik stack, budget-* ingresses opt out of the default via
skip_default_rate_limit + extra_middlewares.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 19:36:42 +00:00
Viktor Barzin
59a531b8e0 coredns: pods get internal split-horizon answers for viktorbarzin.me [ci skip]
Forward the viktorbarzin.me:53 pod block to the Technitium ClusterIP
(10.96.0.53, same as the .lan block) instead of 8.8.8.8/1.1.1.1. Pods
become ordinary internal clients (CNAME -> apex -> live Traefik LB;
mail -> 10.0.20.1), fixing the 27 non-proxied [External] uptime-kuma
monitors that rode the TP-Link NAT loopback (hard-down since 06-09;
loopback refuses flows whose source equals the reflection target, which
all pfSense-SNAT'd cluster traffic does).

Enabled by re-testing a stale premise: on k8s 1.34 pods DO reach the
ETP=Local Traefik LB IP (kube-proxy short-circuits in-cluster traffic
to LB IPs; verified from pods on three non-Traefik nodes) — re-verify
after major k8s upgrades; canary = [External] fleet going red. The
NAT-layer alternatives (pfSense rdr, SNAT-drop) were rejected: both
fight return-path asymmetry and deepen TP-Link dependency.

Verified in-pod: immich -> .203 + HTTPS 200, mail -> 10.0.20.1,
forgejo -> Traefik ClusterIP (pin kept for Technitium-outage
resilience). Proxied [External] monitors now test the internal path —
true edge fidelity moves to the external vantage (ha-london, next fix).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 16:21:34 +00:00
Viktor Barzin
a1b7b0ca53 forgejo retention: revert to DRY_RUN — first live run orphaned OCI indexes [ci skip]
The keep-set (newest 10 versions + latest + *cache* tags) treats
multi-arch/attestation index CHILDREN — separate untagged sha256
versions — as deletable: for images not rebuilt recently they sort
outside the newest-10 window and were pruned while their kept parent
index survived. kms-website :latest and :dfc83fb children 404'd
(RegistryManifestIntegrityFailure, caught by forgejo-integrity-probe
within hours; deployed tag a794d1a unaffected).

Healed: :latest re-pointed at the intact a794d1a index (also the
newest commit), corrupt :dfc83fb version deleted, probe re-run clean
(0 failures / 22 repos / 63 tags / 59 indexes). DRY_RUN=true applied
live. Re-enable only with a container-aware keep-set — options in the
post-mortem.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 09:22:47 +00:00
Viktor Barzin
2b8c0def30 dns: pfSense forward-zone for viktorbarzin.me, nodes fully stock [ci skip]
Round 3 of the forgejo-pull hairpin fix (per Viktor: no per-node
customization — split-brain lives in the DNS infra):

- pfSense Unbound domain override viktorbarzin.me -> Technitium
  10.0.20.201 (applied via php write_config, backup on-box). Every
  Unbound client on every VLAN now gets the internal split-horizon
  answers (live Traefik IP via apex CNAME) with zero per-host config.
- CoreDNS carve-out (TF, applied): dedicated viktorbarzin.me:53 block —
  forgejo pinned to Traefik ClusterIP via data source (pods cannot reach
  the ETP=Local LB IP pfSense now returns), all other .me names kept on
  public resolvers (pods' pre-existing behavior). Replaces the .:53
  forgejo rewrite.
- Removed the same-day resolved routing-domain drop-ins from all 7 nodes;
  node5/6 link DNS repointed Technitium -> pfSense (netplan + qm 205/206)
  for fleet parity; cloud-init no longer writes any DNS drop-ins.
- Docs: dns.md, pfsense-unbound runbook (override + rollback), registry
  bullet, post-mortem final-architecture addendum.

Verified: nodes resolve forgejo -> .203 via pfSense, crictl pull OK,
pods resolve forgejo -> ClusterIP / others -> public, mail record works,
.lan zone unaffected.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 08:32:34 +00:00
Viktor Barzin
1ee1bf0817 forgejo pulls: route *.viktorbarzin.me to Technitium, drop /etc/hosts pins [ci skip]
Supersedes this morning's per-node /etc/hosts pin (no hardcoded service
IPs on nodes, per Viktor). Technitium's split-horizon zone already
resolves forgejo.viktorbarzin.me -> CNAME apex -> live Traefik LB IP
(ingress-dns-sync auto-CNAMEs every ingress host; apex drift probe
alerts) -- the nodes just never queried it. Rolled the devvm's
systemd-resolved routing-domain pattern (~viktorbarzin.me ->
10.0.20.201) to all 7 nodes, removed the pins, verified getent +
crictl pull via pure DNS.

Also demoted node5/6's cloud-init global-dns.conf (DNS=8.8.8.8 1.1.1.1)
to FallbackDNS-only: public servers in the global set race the routing
domain. Its justification ("Technitium NXDOMAINs forgejo") was obsolete
-- exactly the stale comment that pointed new nodes at the hairpin.

hosts.toml mirror kept but documented as vestigial (Traefik 404s
bare-IP requests; registry auth realm is an absolute URL).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 07:56:31 +00:00
Viktor Barzin
b6976ce014 forgejo pulls: pin registry name to internal Traefik in node /etc/hosts [ci skip]
tuya-bridge was down 7.5h (ImagePullBackOff on k8s-node3): fresh kubelet
pulls of forgejo.viktorbarzin.me images depended on the intermittently
broken public-IP hairpin. The containerd hosts.toml mirror cannot keep
pulls internal on its own — Traefik 404s its bare-IP requests (no
Host/SNI match) and the registry Bearer realm is an absolute public URL
fetched outside the mirror. Third incident of this class (buildkit
06-04, tripit/devvm 06-09).

Fix: /etc/hosts pin 10.0.20.203 forgejo.viktorbarzin.me on every node —
covers resolve + token + blob legs with correct SNI and valid cert.
Applied live to all 7 nodes; persisted in the cloud-init bootstrap and
the existing-node rollout script. Docs updated (registry bullet, dns.md
hairpin scope + stale .200 literals, runbook) + post-mortem.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 07:15:24 +00:00
Viktor Barzin
7330cb6a0b backup: image-level vzdump of hand-managed VMs (devvm) — close no-VM-backup DR gap
The hand-managed Linux VMs (not in Terraform) were never imaged: the
PVC/NFS/pfSense/PVE-config scripts cover cluster data but no VM disk. A lost
devvm disk = unrecoverable home dirs + local-only git repos (monorepo root has
no remote).

vzdump-vms.{sh,service,timer}: daily 01:00 live `vzdump --mode snapshot` of
VZDUMP_VMIDS (default 102=devvm) -> /mnt/backup/vzdump (Copy 2), keep 3; the
monthly offsite-sync full pass mirrors it to Synology (Copy 3). Guest agent
enabled -> fs-consistent. Nice/idle-ionice so it never starves etcd.
Pushgateway job vzdump-backup.

Deployed live to PVE + timer enabled. Docs updated: backup-dr.md (new VM-image
layer + protection matrix), infra CLAUDE.md, AGENTS.md.

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:54 +00:00
Viktor Barzin
dacd9d2d8a t3: prepare to adopt 0.0.25 — version-agnostic dispatch + real pairing health-check + state backup [ci skip]
Investigated the 0.0.25 break: it is ONLY an endpoint rename
(/api/auth/bootstrap -> /api/auth/browser-session). The rest of the pairing
contract (credential payload, t3_session cookie, /api/auth/session) is
byte-identical, verified in isolated 0.0.24-vs-0.0.25 sandbox serves. So a
future pin bump is now safe + reversible (pin STAYS 0.0.24 — this is prep):

- t3-dispatch: autoPair tries /api/auth/browser-session, falls back to
  /api/auth/bootstrap on 404 — one binary pairs across both versions and any
  rolling-restart skew. TDD via TestAutoPairAcrossVersions (red on 0.0.25
  before, green after). Built, deployed, verified live on 0.0.24 (all three
  users still 302 + t3_session via the fallback).
- t3-autoupdate.sh: health-check now exercises the REAL mint->credential->cookie
  handshake (was GET / -> 200, which passed the pairing-broken nightly). A bad
  build now auto-rolls-back. Validated against both versions.
- t3-backup-state.{sh,service,timer}: daily online VACUUM INTO of each ~/.t3
  state.sqlite (was the only copy, unbacked) -> the one-way forward schema
  migration becomes a restore, not sqlite surgery. timeout-guarded.
- runbooks/t3-version-bump.md: the reversible cutover checklist.
- post-mortem #5 (health-check) DONE + #6 added; service-catalog updated.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:53 +00:00
Viktor Barzin
baac46415f t3: pin t3@0.0.24 + stop nightly auto-update (auth-outage fix) [ci skip]
The t3-autoupdate timer (re-enabled by the provisioner's step 5b with
`--now`, which fires the missed daily job immediately on a Persistent
timer) pulled t3@nightly 0.0.25 mid-day. That build ran forward schema
migrations on every ~/.t3 state.sqlite (auth_pairing_links/auth_sessions
role->scopes, +proof_key_thumbprint) AND changed the bootstrap API,
breaking t3-mint/pairing for ALL devvm users (pair prompt, no session).

- t3-autoupdate.sh: now a pinned-version ENFORCER (T3_PIN=0.0.24), not a
  nightly tracker -- re-asserts the pin (a no-op when correct).
- t3-provision-users.sh step 5b: drop `--now` (it triggered the
  immediate missed-job run that pulled the bad build).
- setup-devvm.sh: install pinned t3@0.0.24 at machine setup.
- unit Descriptions + service-catalog reflect the pin.
- post-mortem: 2026-06-09-t3-nightly-autoupdate-auth-outage.md.

Host already reconciled out-of-band: rolled back to 0.0.24, re-enabled
the (now-pinned) enforcer, reset the 2 new users' disposable DBs,
surgically reverted wizard's auth tables to level-30 (96 threads + live
session preserved). All users verified 302 + t3_session.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:53 +00:00
edaee13be3 docs(ci-cd): tripit auto-deploy (GHA->Woodpecker 167) + svu semver in GHA [ci skip] 2026-06-09 21:41:53 +00:00
93ec0c66fd docs(ci-cd): add off-infra GHA->GHCR build pattern for private Forgejo repos (tripit pilot) [ci skip] 2026-06-09 21:41:53 +00:00
Viktor Barzin
e0452611b5 forgejo: survive CI-build registry-push storms (mem 3Gi + working retention)
Heavy in-cluster builds (e.g. tripit buildkit) were taking Forgejo down via
two vectors. Fixes both, without moving Forgejo off the sdc HDD (code-oflt
deferred):

- Memory 1Gi -> 3Gi (requests=limits). Forgejo was OOMKilled (exit 137) under
  registry-push load; VPA upperBound ~1.5Gi was suppressed by the 1Gi cap it
  kept OOMing against. Size for the push spike.

- Activate registry retention (DRY_RUN false). Verified the delete list
  against all running viktor/* images first: 0 running images affected.
  Pruned 478 -> 161 package versions; PVC was at its 50Gi autoresize ceiling.

- FIX broken retention auth: the cleanup PAT was ci-pusher's, but Forgejo
  scopes container packages per-user, so DELETE on viktor/* returned 403 (the
  dry-run only did GETs, hiding it). Repointed forgejo_cleanup_token to
  viktor's write:package PAT. Retention had never actually worked.

- Protect buildkit *cache* tags from retention (cleanup.sh keep-set) so the
  gentler-builds layer cache survives daily pruning.

[ci skip] — already applied via scripts/tg.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:53 +00:00
Viktor Barzin
fd0f4a0365 fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip]
6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 08:45:33 +00:00
Viktor Barzin
6d224861c4 stem95su: scheduled Drive->site sync CronJob (every 10m)
CronJob stem95su-gdrive-sync (*/10) mounts the content PVC RW and
rclone-syncs the read-only Drive folder "claude" (stem claude/files) onto
it (rclone/rclone:1.74.3, scope=drive.readonly, empty-source guard +
--max-delete 25). ESO ExternalSecret stem95su-rclone <- Vault
secret/stem95su. Requires the GCP OAuth app published to Production or the
refresh token expires ~weekly.

Lands the gdrive-sync stack on master (it had landed on a feature branch
by accident on the shared devvm checkout).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 08:42:26 +00:00
Viktor Barzin
37626cb89b workstation: docs — mark RBAC + Authentik gate applied [ci skip]
multi-tenancy.md + service-catalog.md status: per-user OIDC kubeconfig, oidc-power-user-readonly ClusterRole, emo k8s_users entry, and the Authentik T3 Users edge gate are now applied + verified. Remaining: emo cutover (Phase 5, held), offboarding apply-side (Phase 7), per-user MCP injection, roster-reconciled group membership.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 17:51:44 +00:00
Viktor Barzin
c611ecf84d workstation: docs — multi-tenancy Workstation section + offboard runbook + service-catalog fix [ci skip]
multi-tenancy.md: new DevVM Workstation section (roster SSoT, tiers, config inheritance, locked clone, built-vs-gated status). service-catalog.md t3code row: corrected the stale 'source of truth = /etc/ttyd-user-map' (now roster.yaml; the map/dispatch are GENERATED). offboard-user.md: written (was a referenced-but-missing dead link) — staged reversible-cut-then-gated-destructive for both cluster + workstation surfaces.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 14:27:17 +00:00
Viktor Barzin
55d4b4cf2d workstation: correct devvm RAM (8->24GB) + record 8G swap & capacity budget [ci skip]
devvm is the t3code Workstation host. Added an 8 GiB swapfile (swappiness=10, fstab-persisted) to turn multi-user OOM-kills into graceful paging (was 0 swap, ~1.2 GiB free of 23). Capacity budget: ~4-5G RAM per active user, max ~3-4 concurrent active sessions.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 13:48:52 +00:00
Viktor Barzin
3d6c5b8bc7 matrix/authentik: remove orphaned Matrix OAuth2 app + provider (post-tuwunel)
The migration left a UI-managed (not TF) Authentik OIDC app orphaned — tuwunel
uses native password auth, so nothing consumed it. Deleted application `matrix`
+ OAuth2 provider pk=6 via the Authentik API (user-confirmed). Drop the stale
Matrix rows from the SSO reference tables and update the plan's residual list.

Doc-only [ci skip].

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 12:32:49 +00:00
Viktor Barzin
23602f393e matrix: migrate Synapse -> tuwunel (Rust homeserver, fresh start, federated)
Replace the cramped Synapse deployment with tuwunel v1.7.1: embedded RocksDB
drops the CNPG dependency (both init-containers, the db ESO, the Reloader
annotation all gone), env-var config, fsGroup-owned encrypted PVC, federation
on, tuwunel-served well-known delegation to :443. server_name unchanged
(matrix.viktorbarzin.me); fresh start (no Synapse->RocksDB migration path).
Registered @viktor admin then disabled registration (403).

Cleanup: removed the orphaned pg-matrix Vault static role and dropped the
matrix Postgres DB/role; updated service-catalog, upgrade-config, CLAUDE.md
PG-rotation list, and the Matrix OIDC->orphaned auth notes. Design+plan in
docs/plans/2026-06-08-matrix-synapse-to-tuwunel-*.

Already applied via scripts/tg (matrix tier-1 + targeted vault tier-0), so
[ci skip] to avoid CI reconciling an unrelated pre-existing vault OIDC
tune-TTL drift.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 11:58:17 +00:00
Viktor Barzin
838343184b stem95su: document on-demand Drive→NFS deploy (no scheduled job)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
CI/CD for the stem95su site is intentionally ON-DEMAND, not a CronJob:
the content is short-term and a scheduled job + Vault secret + ESO +
GCP "publish to Production" would be rotting artifacts. Instead, mirror
the source Google Drive folder "claude" → /srv/nfs/stem-site via a
throwaway rclone container using the existing google_workspace OAuth
creds (secret/viktor), rsync to NFS with an empty-source guard, then
shred the temp config. Verified end-to-end. Recipe in claude-memory.

Doc-only: corrects the service-catalog update-mechanism note.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-07 22:10:06 +00:00
Viktor Barzin
0d445d948c stem95su: host STEM platform for 95. СУ (public NFS-backed static site)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
New public static site at stem95su.viktorbarzin.me serving the school's
Bulgarian STEM platform (dashboard + lessons/games, externally authored
HTML/media exported from Gemini).

- Stock nginx:1.28-alpine serving /srv/nfs/stem-site read-only (nfs_volume),
  NOT image-baked — content updated out-of-band (Nextcloud "PVE NFS Pool"
  or rsync), no rebuild; auto-backed-up offsite by nfs-mirror.
- ingress_factory auth="none" (open; CrowdSec + ai-bot-block at the edge),
  dns_type="proxied" (Cloudflare CNAME auto-created).
- nginx ConfigMap sets index stem_board.html (the dashboard) for "/".
- Docs: service-catalog entry + new "Static Site Hosting" pattern
  (NFS-backed vs image-baked) in patterns.md.

Applied via scripts/tg apply; verified live end-to-end (dashboard, 20MB
page, video byte-range, no Authentik redirect) through the public
Cloudflare path.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-07 15:21:21 +00:00
Viktor Barzin
c7ffbaa204 aiostreams: harden stream-probe + repair sources (RD-451 "few films" fix)
Root cause of "barely serving films": Real-Debrid's May-2026
infringing_file/HTTP-451 filter blocks WEB-DL releases (which dominate
new content), while degraded sources starved candidates. RD account +
popular-title availability were healthy throughout (library 32/36
unrestrict OK; Matrix 897 / Dune2 694 / Oppenheimer 672 streams).

Runtime config (AIOStreams PG, applied via API — not in this diff):
- Comet timeout 5s -> 10s. Comet is the workhorse (~450+ streams/title)
  and was silently dropping the bulk of its results at the 5s cutoff;
  Interstellar 430 -> 987 streams after the bump.
- Removed MediaFusion preset: broken upstream ("Invalid configuration"
  -> 500 Internal Server Error), contributed 0 usable streams, only a
  dead [X] entry in every list.

This diff (Terraform):
- Harden aiostreams-stream-probe: test series AND movie paths, per-source
  breakdown (comet/torrentio/stremthru_torz/knaben), error-stream count,
  success gated on Comet being alive. The old probe counted only Breaking
  Bad streams and stayed green while new-content playback was broken.
- service-catalog: reflect source set + probe behaviour.

[ci skip] — probe already applied via targeted `tg apply` + verified
(series=378 movie=898 comet=206 errors=0 success=1); skipping the full
servarr reconcile to avoid touching unrelated pre-existing drift
(qbittorrent MetalLB annotation, tls_secret cert revert).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-07 07:21:42 +00:00
Viktor Barzin
9529eedfe0 docs(security): bot-block-proxy is a no-op while poison-fountain is at 0 [ci skip]
Reflect commit b6dd23b1: bot-block-proxy short-circuits /auth to
return 200 instead of proxying to the scaled-to-0 poison-fountain.
- security.md Layer 1 + tarpit description + troubleshooting (fix stale
  stacks/platform path -> traefik stack; drop misleading
  restart-poison-fountain step).
- .claude/CLAUDE.md: add matrix to PG rotation list; document that
  startup-read secret consumers need a Reloader annotation (matrix root
  cause, found via Loki 2026-06-05).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-06 16:51:26 +00:00
Viktor Barzin
a42f4f7b26 trek: trial-deploy TREK group-trip planner behind Authentik (solo eval)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Stand up upstream TREK (mauriceboe/trek:3.0.22, AGPL) as a low-commitment
trial to evaluate the self-hosted group-trip use case before building a
custom app. Solo, single shared instance, Authentik forward-auth.

- stacks/trek: namespace, deployment (pinned, TF-managed, no CI/Keel),
  service 80->3000, ingress_factory auth=required + proxied DNS at
  trek.viktorbarzin.me, TLS. Two proxmox-lvm-encrypted PVCs (SQLite data +
  uploads) -- encrypted per the sensitive-data rule and to avoid the
  SQLite-over-NFS locking hazard.
- Trial secrets posture: ENCRYPTION_KEY auto-generated on the data PVC,
  bootstrap admin in pod logs -- no Vault/ESO. Graduation TODOs documented
  in main.tf + service-catalog (Vault key, app-level SQLite backup, OIDC SSO).
- kyverno: add mauriceboe/* to require-trusted-registries allowlist (the
  policy is Enforce since 2026-05-19 -- also fixed the stale "stays in
  Audit" header comment that said otherwise and misled the deploy).
- Runs free on OpenStreetMap (no paid maps key). Rallly availability-poll
  companion deferred per solo-trial scope.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 20:30:07 +00:00
Viktor Barzin
52f5de905d docs(context): freshen infra glossary (modules, tiers, new concepts) [ci skip]
Refresh CONTEXT.md against current repo + cluster reality (grill-with-docs):

- Module taxonomy rewrite: drop fictional k8s_app/helm_app/postgres_app
  factory modules (never existed); name the real four (ingress_factory,
  nfs_volume, anubis_instance, setup_tls_secret) + the shared / Stack-local
  / flat distinction; flag vestigial modules/kubernetes/<app> dirs.
- Rename "Ingress auth tier" -> "Ingress auth" (discrete modes, not tiers);
  reserve "tier" for State tier + Namespace tier only.
- Add local-path entry (cluster default SC; node-local footgun warning).
- Add concepts: Keel, Diun, CNPG/pg-cluster, MetalLB LB-IP split, Calico.
- Add "policy" ambiguity flag (Kyverno vs Calico NetworkPolicy vs Vault/RBAC).
- Fix node count 5 -> 7 (k8s-master + k8s-node1..6).

Doc-sync (same commit per repo rules):
- overview.md: replace fictional factory modules with the real shared
  modules + the flat/stack-local pattern.
- .claude/CLAUDE.md: drop dead nfs-proxmox column from the storage decision
  table + stale cross-reference (vault migrated off it 2026-04-25).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 19:34:49 +00:00
Viktor Barzin
ddc8bfa8cf tripit: remove Gmail-scrape ingest-mail CronJob; plans@ becomes sole channel
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
The Gmail All-Mail scrape (tripit-ingest-mail) is retired — Viktor only wants
mail ingested when forwarded to plans@viktorbarzin.me, and only from actual
users. Dropped the ingest-mail CronJob and removed MAIL_DEFAULT_OWNER_EMAIL
from ingest-plans (the app now ignores mail from non-users instead of filing it
under the default owner). ingest-plans already carries EMAIL_PROVIDER/SMTP_* for
the new sender notifications. Service-catalog updated.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 13:50:53 +00:00
Viktor Barzin
3796a84e04 docs: f1-stream is Woodpecker-native (Forgejo viktor/f1-stream), not GHA/repo-10
f1-stream was extracted to its own Forgejo repo + deployed from the Forgejo
registry (2026-06-05). Correct the stale "Migrated to GHA / repo id 10" claims:
- CLAUDE.md + ci-cd.md: move f1-stream from the GHA list to the Woodpecker-native
  owned-app group; note old github source archived + GHA Woodpecker repo 10
  deactivated; f1-stream is now Woodpecker repo 166.
- service-catalog: note the source repo + deploy model.
2026-06-05 09:19:12 +00:00
Viktor Barzin
deb031cc2c feat(tripit): encrypted personal-document vault PVC + DOCUMENT_ENCRYPTION_KEY
Add a proxmox-lvm-encrypted RWO PVC (tripit-personal-documents) mounted at
/data/personal-documents on the app container, PERSONAL_STORAGE_DIR env, and the
DOCUMENT_ENCRYPTION_KEY ExternalSecret entry (seeded in Vault secret/tripit). A
root chown init-container makes the block volume writable by the non-root app
without touching the NFS doc vault. Backs the new owner-only encrypted personal
document vault in the tripit app.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:12 +00:00
Viktor Barzin
147a8cff40 Restore f1-stream stack — undo accidental bundling into 63fe7d2b
Commit 63fe7d2b (fan-control) was made with a bare `git commit` in the
shared infra working tree and inadvertently swept in a parallel session's
staged f1-stream-extraction work (main.tf repoint, ~48 files/ removals,
ci-cd.md + .claude docs, two extraction plan docs).

This returns every f1-stream-related path to its pre-63fe7d2b state
(3493c347) so that extraction can be committed cleanly by its own
session. The fan-control files added in 63fe7d2b are untouched.

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:12 +00:00
Viktor Barzin
90ad6b9125 fan-control: presence-aware IPMI fan curve for the R730 PVE host
The iDRAC stock curve runs the CPU at ~72°C on the 7080 RPM floor even
under load (optimises for quiet, not cool). Add a bash daemon + systemd
unit that drives the chassis fans from CPU temp on two curves, picked by
garage occupancy (the server is in the garage): COOL when empty
(measured ~58-65°C under load), QUIET near the silent floor when the
ha-sofia garage door shows someone is there (open, or <15min since last
activity).

Manual fan mode is backstopped: bash EXIT trap + systemd ExecStopPost
hand fans back to Dell auto on stop/crash; CPU>=83°C or repeated IPMI
failures do the same. Pushgateway metrics (job=fan_control). 36 unit
tests cover the pure curve/hysteresis/presence/parse logic; DRY_RUN +
RUN_ONCE for integration checks. Deployed and verified on 192.168.1.127
(CPU 70->58°C in cool mode, hysteresis stepping confirmed).

Design:  docs/plans/2026-06-04-pve-fan-control-design.md
Runbook: docs/runbooks/fan-control.md

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:11 +00:00
Viktor Barzin
9858a1c44b docs(add-user): document dashboard auto-login home-ns scope + foreign-namespace exception [ci skip]
Auto-login covers a user's k8s_users home namespace only (dashboard SA bound
there). For workloads in a separate/pre-existing namespace (gheorghe→novelapp),
that namespace must also grant the dashboard SA, not just the OIDC User. Best
practice: set k8s_users namespace = where the workload runs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:11 +00:00
Viktor Barzin
8f13fdeaf7 docs: dashboard SA cluster-read tightened to namespace-list + nodes only [ci skip]
Reflect the dashboard-nav-readonly ClusterRole: namespace-owners can list
namespaces/nodes (for dashboard nav) but not read other tenants' resources.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:11 +00:00
Viktor Barzin
c4bd64f88a docs: dashboard now auto-injects per-user SA token (no token-paste)
Update authentication.md, multi-tenancy.md, service-catalog, add-user skill to
reflect the token-injector (X-authentik-username -> SA token -> Bearer). Note the
extra k8s-dashboard apply needed when onboarding a namespace-owner (injector map
regen). [ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:10 +00:00
Viktor Barzin
8e44ccaa65 docs: dashboard access is forward-auth + token-paste (OIDC SSO blocked)
Correct the docs I'd written for the (reverted) oauth2-proxy SSO. Reality:
apiserver OIDC rejects all Authentik tokens (design §12), so the dashboard
uses forward-auth (admits kubernetes-* groups) + per-namespace SA token-paste.
Updates authentication.md, multi-tenancy.md, service-catalog, authentik-state,
and add-user skill (onboarding now documents the dashboard token). oauth2-proxy
+ k8s-dashboard OIDC app noted as idle. [ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:10 +00:00
Viktor Barzin
b64d8d6168 cluster-health: add #47 ghost-disk drift check; fix immich_search set -e crash
Check #47 "Proxmox CSI — Ghost-Disk Drift": per node, compares the real
virtio-scsi CSI disk count in `qm config <vmid>` (SSH PVE) against the
attached proxmox-CSI VolumeAttachments k8s tracks. Catches orphaned "ghost"
disks left by failed detaches (query-pci QMP timeouts) that the scheduler's
28-LUN guard can't see — exactly the drift that wedged the MAM grabber on
node3 (13 tracked vs 23 real). PASS reconciled; WARN drift>0 or real 20-24;
FAIL real ≥25 (near the LUN cap). Already flagging node6 at 21 disks.
Single `qm list` + one `qm config` per VM keeps it ~3s (the naive
once-per-VM version timed out the parallel runner).

Also fixes a PRE-EXISTING set -e crash in #46 immich_search (introduced by
138894cd): `pct=$(kubectl exec … | tr -d ' ')` and the dur_ms probe were
unguarded, so with `set -o pipefail` a non-zero psql/exec propagated and
tripped `set -e`, killing the check before json_add. It silently dropped
from every parallel report and broke --serial entirely (whole run aborted).
Guarded both substitutions with `|| true`; the existing `=~` numeric checks
already handle the empty case. immich_search now reports PASS/WARN instead
of vanishing.
2026-06-05 09:19:10 +00:00
Viktor Barzin
ad3432d685 docs(k8s-dashboard): dashboard SSO as-built (Option B multi-issuer apiserver)
Update authentication.md (structured multi-issuer AuthenticationConfiguration
+ dashboard SSO flow), multi-tenancy.md (web dashboard access), authentik-state
(new k8s-dashboard app + gheorghe groups), service-catalog (dashboard auth),
and the k8s-version-upgrade runbook (kubeadm wipes --authentication-config →
re-apply rbac post-upgrade). Design/plan addenda record the issuer-constraint
pivot from the original dual-aud approach. [ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:09 +00:00