Commit graph

3907 commits

Author SHA1 Message Date
Viktor Barzin
fe8db19aaf job-hunter: build-triggers-deploy model; CronJob :latest + docs
CI now drives the Deployment rollout (kubectl set image to the build SHA in
.woodpecker.yml), so the stack moves to image_tag = "latest": the Deployment
runs whatever CI last set (image ignore_changes keeps TF from fighting it),
and the CronJob uses :latest + imagePullPolicy=Always (fresh pod each weekly
run). Keel stays enrolled in parallel as a redundant net.

Docs: rewrite the runbook "Deploying" section for build-triggers-deploy;
record the reversal of decision #12 in the auto-upgrade design doc (owned
apps drive their own rollout, Keel parallel — upstream stays Keel-only); add
the owned-app deploy model to infra/.claude/CLAUDE.md CI/CD section.

[ci skip] — applied locally (stack-scoped); avoids a broad CI auto-apply.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 20:24:50 +00:00
Viktor Barzin
052c776eba immich: set MACHINE_LEARNING_MODEL_TTL 0->600 to stop GPU VRAM hog
immich-ml at TTL=0 never unloaded models; a heavy OCR library job
inflated onnxruntime's CUDA arena to ~10.7GB and held it on the shared
time-sliced T4, starving llama-swap (qwen3-8b) so recruiter-responder
triage 502'd silently for hours (emails preserved unseen, no loss).
TTL=600 lets idle ad-hoc models (OCR, face) free VRAM while preloaded
CLIP/smart-search stays warm.

Docs: correct stale llama-cpp GPU notes (T4 is time-sliced, no VRAM
isolation; add qwen3-8b to model table), immich MODEL_TTL gotcha in
.claude/CLAUDE.md, and a post-mortem.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 20:16:11 +00:00
Viktor Barzin
cda858d560 job-hunter: weekly refresh CronJob + ops/analyst runbook
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Add kubernetes_cron_job_v1.job_hunter_refresh — Sundays 04:00 UTC, runs
`refresh --source ats --source hn --source levels_fyi`, which upserts roles/
comp AND appends the dated comp_snapshots/roles_snapshots series consumed by
`job-hunter analyze`. Mirrors the Deployment's alembic-migrate init container
so a refresh never runs against an un-migrated DB; concurrency Forbid,
backoff 1, 30m activeDeadline, KYVERNO_LIFECYCLE_V1 dns_config ignore.

Add docs/runbooks/job-hunter.md: ops (health checks, manual refresh, add an
ATS company / CDIO watch, secret bag + rotation, failure table, TF apply) and
analyst (the analyze report, query recipes, SQL trend queries against the
snapshot tables, interpretation caveats) sections.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 19:37:57 +00:00
Viktor Barzin
87f1dcb72d wealth: consolidation chunk 2 — net-pay $grain merge, Trend projection, row reorg
Completes the 36->17 consolidation:
- 3 net-pay panels -> 1 "Net pay vs market gain (${grain})" with a cumulative/
  yearly/monthly dropdown (Mixed datasource: payslips-pg + wealth-pg).
- Projection rebuilt as a Trend panel (numeric "Years from today" x-axis) so it
  renders regardless of the dashboard time range — fixes empty-by-default. Drops
  the duplicate projection-row stat cards + the how-to-view text panel.
- Full reorg into 7 collapsed rows: Overview / Net worth over time / Returns &
  contributions / Income vs market / Holdings / RSUs (META) / Projections.

All wealth-pg SQL validated live; net_pay target reuses the existing payslips-pg
source. Visual review pending.

[ci skip]

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-02 19:24:30 +00:00
Viktor Barzin
848cc7211f t3code: track t3 nightly via health-checked auto-updater
Move t3 from pinned stable (0.0.24, catalog capped at opus-4-7) to the nightly
channel so new models (Opus 4.8) land as t3 ships them. t3-autoupdate (daily
systemd timer) pulls t3@nightly, but applies the Keel-incident lesson: it
health-checks the new binary on a throwaway serve and AUTO-ROLLS-BACK on
failure, and restarts only IDLE per-user instances (defers any with an active
agent child) so an in-flight session is never killed by an update.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-02 19:24:30 +00:00
Viktor Barzin
de09e8f294 immich runbook: note force=false re-kick gotcha after row deletion [ci skip]
The videoConversion enqueue is an async scan; deleting encoded_video rows while a
prior scan is in-flight misses them (observed 2026-06-02: 11/3296 picked up on the
first pass). Re-trigger force=false once the queue first drains to waiting:0.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 19:24:30 +00:00
Viktor Barzin
d27df1f321 t3code: dispatch — strip @domain from X-authentik-username (Authentik injects email)
Authentik injects the full email (e.g. vbarzin@gmail.com), but /etc/ttyd-user-map
and dispatch.json key on the local part (vbarzin), so every real login hit
403 'no instance provisioned'. Strip @domain before lookup, matching the
terminal stack's tmux-attach.sh. Verified: vbarzin@gmail.com / emil.barzin@gmail.com
-> 302 (own instance); unmapped/no-header -> 403.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-02 19:24:30 +00:00
Viktor Barzin
b651f137b9 docs(kms): SXSMSI/1603 is client-machine-specific (VM 300 pilot) + deep-repair/escalation
Pilot on PVE VM 300 established strong counterfactuals: identical kms-bootstrap +
the user's exact journey both reach office/ok on healthy Win10 (CF1 clean install,
CF2 retail O365HomePremRetail->targeted-remove->reboot->VL install). So a persistent
[Failing PreReq=SXSMSI]/1603 is the client's corrupted Windows servicing/Installer
subsystem (below DISM/SFC), not the script/ODT/KMS. Documents the consent-gated deep
repair, the DeepRepairDone marker + in-place-repair escalation, and the
low-disk/guest-agent-drop gotchas hit during the pilot.
2026-06-02 19:24:30 +00:00
Viktor Barzin
481585f6e6 immich: cap streaming transcode bitrate to fix 4K video stutter [ci skip]
Transcodes were uncapped (ffmpeg maxBitrate=0 + preset=ultrafast +
targetResolution=original) -> 77-264 Mbps 4K H.264 files. Mobile playback
streams that copy off the shared 7200rpm sdc pool over inter-VLAN NFS; a single
stream needs ~10-13.5 MB/s and stuttered for every client, local and remote.

Fix (DB system-config, applied via API): maxBitrate=20000k, preset=medium,
transcode=bitrate. 4K resolution preserved; originals never modified. Existing
oversized transcodes regenerated by deleting their asset_file encoded_video rows
+ videoConversion force=false (concurrency 1).

Document config + add runbook docs/runbooks/immich-transcode-bitrate.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 19:24:30 +00:00
Viktor Barzin
deec540fad t3code: docs — auto-provisioning service-catalog entry + design status implemented
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-02 19:24:30 +00:00
Viktor Barzin
a587f0ee55 t3code: ingress -> devvm dispatch+autopair (retire in-cluster nginx)
stacks/t3code now points the Authentik-gated ingress at the DevVM t3-dispatch
service (Service+Endpoints -> 10.0.10.10:3780) instead of the in-cluster nginx,
which is removed. Per-user routing + session auto-injection now live on DevVM.
Verified: external 302->Authentik; in-cluster vbarzin/emil.barzin->302 (auto-pair
to own instance), unmapped->403.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-02 19:24:30 +00:00
Viktor Barzin
9f551e3c13 t3code: harden dispatch — dedicated user + validated t3-mint + scoped sudoers
Run t3-dispatch as an unprivileged dedicated user instead of wizard (who has
full sudo). Privileged minting goes through /usr/local/bin/t3-mint, which
validates the target against /etc/ttyd-user-map before minting as that user;
sudoers permits t3-dispatch to run only that wrapper. Compromise of the
network-facing service can mint pairing tokens for mapped users at most.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-02 19:24:30 +00:00
Viktor Barzin
0472f67d49 t3code: devvm dispatch + auto-pair service (Go)
Routes X-authentik-username -> per-user t3 instance; on no t3_session
cookie, mints a pairing token (as the OS user) and exchanges it at
/api/auth/bootstrap, injecting the session cookie. Listens :3780, reads
/etc/t3-serve/dispatch.json. Constants from the Task-1 auth-contract spike.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-02 19:24:30 +00:00
Viktor Barzin
72aba7da32 t3code: reconcile per-user t3 instances from /etc/ttyd-user-map
Sticky port allocation (3773+), enables t3-serve@<user>, emits
/etc/t3-serve/dispatch.json for the dispatch service. systemd timer
(OnBootSec+hourly) mirrors the apply-mbps-caps pattern.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-02 19:24:30 +00:00
Viktor Barzin
f8a63fdacd t3code: per-user t3-serve@ systemd template (User=%i file isolation)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-02 19:24:30 +00:00
Viktor Barzin
2152430b70 docs(t3code): record discovered t3 web-auth contract 2026-06-02 19:24:30 +00:00
Viktor Barzin
5e4f83d4e7 wealth: consolidation chunk 1 — merge NW/contribution/growth, returns table, yearly combo
36 -> 19 panels (chunk 1 of 2), zero metric loss:
- 3 NW/contribution/growth timeseries -> 1 "contribution vs market value (+growth)"
- 11 returns/Δ stat cards (12mo x3 + Δ 1d/7d/30d/90d all&mkt) -> 1 "Returns over
  time windows" table (window × Δall/Δmkt/return%)
- 2 yearly barcharts -> 1 combo (contributions/market-gain bars + return-% line,
  timeFrom=10y so full history always shows)

All SQL validated live. Chunk 2 (net-pay $grain merge, projection->Trend panel,
row reorg) to follow.

[ci skip]

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 22:27:09 +00:00
Viktor Barzin
a09b0b3612 docs(t3code): implementation plan for per-user auto-provisioning
Task-by-task plan pairing with the design doc: Task 1 discovers the t3
web-auth contract (cookie name + bootstrap body), then systemd template,
reconcile, devvm dispatch+auto-pair Go service, scoped sudoers, TF repoint.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 22:19:22 +00:00
Viktor Barzin
1a0647c7ed docs(t3code): design for per-user auto-provisioning (Authentik login → instance + session)
Approach 1: /etc/ttyd-user-map as source of truth; per-user t3-serve@.service
template (User=%i enforces file permissions); devvm reconcile; devvm
dispatch+auto-pair service (mints + injects the t3 session cookie on first
authenticated visit, replacing the in-cluster nginx). Spec for review before
writing the implementation plan.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 22:10:05 +00:00
Viktor Barzin
55ed50b932 docs(plans): wealth dashboard consolidation design
Consolidate the wealth Grafana dashboard 36 -> ~17 panels with zero metric
loss: merge the 3 NW/contribution/growth timeseries into 1, the 11 returns/Δ
stat cards into 1 returns table, the 2 yearly barcharts into 1 combo, and the
3 net-pay-vs-market-gain panels into 1 (grain dropdown); reorganize into
collapsed rows. Also rebuild the projection as a Trend panel (numeric
years-from-today x-axis) so it renders regardless of the dashboard time range
(fixes empty-by-default). Philosophy: merge duplicates, keep every metric.

[ci skip]

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 21:52:59 +00:00
Viktor Barzin
73cb0aab8b t3code: per-user isolation via Authentik + nginx username dispatcher
t3 is single-owner (no in-app multi-user), so each person runs their own
`t3 serve` on the DevVM as their own OS user: wizard→:3773 (t3-serve.service),
emo→:3774 (t3-serve-emo.service). An in-cluster nginx `t3-dispatch` maps the
Authentik-injected X-authentik-username to the right instance; unmapped
identities get 403 (no shared fallback). Flipped the ingress auth app→required
(Authentik forward-auth) — the same-origin self-served UI works behind it (WS
carries the Authentik cookie) and t3's own pairing/bearer stays the inner gate.
Mirrors the terminal stack's per-user model.

Verified: dispatcher routes vbarzin→:3773, emil.barzin→:3774, unmapped→403;
t3.viktorbarzin.me now 302s to Authentik. Cross-origin native app / app.t3.codes
intentionally unsupported here — deferred until the native app is published.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 21:38:06 +00:00
Viktor Barzin
9fb3e6e851 docs: correct cloudflared-502 post-mortem + fix stale .200 Traefik ref [ci skip]
Real root cause of the 2026-06-01 full-site 502 was not a missed
reference but an out-of-band fix that Terraform reverted: the 2026-05-30
Traefik .200->.203 migration repointed the Cloudflare tunnel to the
Traefik service DNS via the CF Global API Key, but never landed that
change in cloudflare.tf (left at .200). A terragrunt apply on 2026-06-01
reconciled live back to the stale .200, breaking all external ingress.
Rewrite the post-mortem around the "codify out-of-band fixes or TF
reverts them" lesson (a Terraform-Only-rule violation).

Also fix docs/runbooks/kms-public-exposure.md, which still claimed
Traefik served on 10.0.20.200:443 (now .203) — same migration fallout.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-01 21:25:33 +00:00
Viktor Barzin
f807050eb5 cloudflared: fix tunnel origin .200 -> Traefik svc DNS (full-site 502 outage) [ci skip]
The Cloudflare tunnel routed *.viktorbarzin.me and the apex to
https://10.0.20.200:443, but Traefik moved off the shared MetalLB .200
onto its dedicated 10.0.20.203 on 2026-05-30 (commit 0c01adac). Nothing
serves HTTPS on .200:443 anymore, so cloudflared could not reach its
origin (no route to host / i/o timeout) and Cloudflare returned 502 for
every externally-proxied service. Internal/LAN access (split-horizon ->
.203) was unaffected, which masked the outage.

Repoint both ingress rules at the in-cluster Traefik Service DNS
(https://traefik.traefik.svc.cluster.local:443) -- the design the docs
already described but the code never implemented -- so the tunnel is
decoupled from the Traefik LB IP and this cannot recur on a future move.

Applied live via targeted apply on the tunnel config resource only;
[ci skip] because live already matches and a full stack apply would
churn unrelated pre-existing drift (Keel annotations, DKIM re-chunk).

Post-mortem: docs/post-mortems/2026-06-01-cloudflared-stale-traefik-origin.md

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-01 21:22:05 +00:00
Viktor Barzin
30a644d3cd docs(kms): document reboot-after-uninstall / 1603 handling + real-hardware status
The bundled consumer Office removal leaves a pending reboot; a same-run VL
install (or re-run before rebooting) fails with setup.exe 1603. Document the two
guards (hard-reboot gate + reboot-aware 1603 message), the C2R-log capture, and
the on-disk completion poll. Record that the uninstall path is now verified on a
real M365 box (O365HomePremRetail removed) and the install needs a reboot first.
2026-06-01 21:22:05 +00:00
Viktor Barzin
a382683c0e infra: fix containerd forgejo-registry redirect .200->.203 (+skip_verify)
Traefik moved off shared .200 to its dedicated .203 on 2026-05-30, but the
containerd hosts.toml redirect for forgejo.viktorbarzin.me still pointed at the
now-dead .200:443 -> every FRESH forgejo pull failed (cached images kept
running, so it stayed hidden until a new image tag was pulled). Retarget to
.203 and add skip_verify (node dials Traefik by IP; cert is for
forgejo.viktorbarzin.me) in both the new-node cloud-init and existing-node
deploy scripts. Already rolled to all 7 nodes (rewrite + restart containerd,
no drain). Doc fix in .claude/CLAUDE.md.
2026-06-01 21:22:05 +00:00
Viktor Barzin
82855848d1 plans: TopoLVM migration evaluation (Path 3 for LUN-cap relief)
Decision-support doc, NOT a commitment. Evaluates whether replacing
proxmox-csi with TopoLVM would lift the per-VM 29-PVC ceiling
permanently and at what cost.

Key trade-off documented: TopoLVM PVCs are pinned to the node where
the LV lives (topology.topolvm.cybozu.com/node). proxmox-csi PVCs
migrate between VMs when pods reschedule. The data-locality penalty
matters most for single-replica stateful services (MySQL standalone,
Nextcloud, Vaultwarden, mailserver, claude-memory, ~30 SQLite-backed
apps); replicated services (CNPG PG cluster, Redis-v2, Vault Raft)
absorb it.

Three disk-layout options:
  A. Carve per-VM data disks from sdc — simple, no hardware,
     IO contention unchanged
  B. Hybrid SSD/HDD — SSD-constrained at 675 GiB free
  C. Add a dedicated NVMe — also closes beads code-oflt (IO
     contention), ~£200 hardware investment

Effort estimate: 2.5-3 weeks of focused work for the full migration;
covers TopoLVM install, lvmd config, per-VM disk provisioning,
LUKS plumbing, 5 migration waves (regenerable → huge PVCs),
backup-pipeline rewrite, deprecation.

Recommended next step before committing: small pilot on
k8s-node5/6 with one non-critical PVC to validate the operational
pattern end-to-end.

Related: docs/architecture/storage.md § Per-VM SCSI-LUN cap,
docs/runbooks/scale-k8s-cluster.md (Path 1+2 alternative),
beads code-oflt (IO isolation).
2026-06-01 21:22:05 +00:00
Viktor Barzin
599d67db51 docs(kms): self-hosted ODT bootstrapper + anonymous client telemetry (kms-diag/Loki)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 21:22:05 +00:00
Viktor Barzin
f364399ede wealth: add 30y net-worth projection row + align net-pay panel
Implements the committed projections design (docs/plans/2026-05-28-wealth-
projections-{design,plan}.md): a collapsed "Projections" row on the wealth
dashboard with 5 template vars (rate_low/base/high, monthly_contribution=auto,
horizon_years=30), a multi-scenario projection panel (Low/Base/High + trailing-
3y historical line + a base-rate compounding-only line), 3 stat cards, and a
text panel with one-click future time-range links.

Projection is pure SQL over dav_corrected: compound + ordinary-annuity FV from
today's net worth; auto contribution = trailing-12mo run-rate (COALESCE/NULLIF
so $monthly_contribution=auto doesn't constant-fold 'auto'::numeric). Historical
rate = trailing-3-full-year geometric mean of per-year Modified-Dietz returns
(~10.4%) — all-time was a nonsense 83% because the all-accounts-complete window
is only ~4 months, and the true all-time geomean is skewed by 2021's +86%.

Also aligns "Net pay vs market gain — per month" to consecutive month-end
deltas (same fix as the other monthly panels). Verified all SQL live.

[ci skip]

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 19:50:41 +00:00
Viktor Barzin
32e1042ca8 t3code: expose t3 serve (DevVM) publicly at t3.viktorbarzin.me (app-tier)
New stacks/t3code mirrors stacks/terminal: K8s Service + Endpoints →
10.0.10.10:3773 plus an ingress_factory route (dns_type=proxied,
auth="app"). t3 ships its own owner-pairing + bearer-session auth, so
Authentik forward-auth is intentionally omitted — it would break the
cross-origin native mobile app and app.t3.codes (bearer-only, no
Authentik cookie). CrowdSec + anti-AI (both default-on for app-tier)
rate-limit the public surface; t3's pairing is the gate. TLS is
auto-synced into the namespace by Kyverno's sync-tls-secret policy.

Verified end-to-end: t3.viktorbarzin.me → CF → Traefik → devvm:3773 = 200.
Trade-off (public RCE surface behind app-native auth, no Authentik SSO)
accepted 2026-06-01 to keep the native app + app.t3.codes working.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 19:50:41 +00:00
Viktor Barzin
c5e4b1ea71 kms: add /diag anonymous telemetry collector behind Anubis carve-out
The PowerShell activation scripts POST small JSON diagnostics to
/diag so script execution errors are captured. The collector
(python:3.12-alpine, ConfigMap-mounted) prints each event to stdout
as a KMSDIAG line; the cluster's Loki scrapes pod stdout, making
events searchable in Grafana (Loki only — no Slack, no Prometheus).

Like /scripts, /diag needs a second ingress_factory carve-out with
full_host="kms.viktorbarzin.me" so it bypasses the Anubis PoW
challenge that PowerShell/curl can't solve. Without full_host the
factory would derive kms-diag.viktorbarzin.me and the carve-out
would never match.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 19:50:41 +00:00
Viktor Barzin
3fa9e2409c runbook: K8s worker scaling for PVC capacity headroom
Documents the 6-worker cluster shape (post 2026-05-26 scale-up after
the proxmox-csi LUN-cap incident), the six binding constraints (plugin
LUN cap at 29/VM, memory commit, sdc IO contention, GPU concentration
on node1, PVE host memory, no Terraform management for K8s VMs), and
the playbooks for adding/removing workers.

Scale-up triggers:
  - max-node VA count ≥ 25 (~86% of 29 cap) for ≥7 days
  - cluster memory requests > 90%
  - LUN-cap incident
  - planned ≥3 net-new block PVCs when max VA already ≥ 22
Scale-down conditions:
  - max-node PVC count ≤ 20, memory < 70%/95% for ≥30 days

Playbooks lean on scripts/provision-k8s-worker (clones template 2000,
cloud-inits, auto-joins) for adds; kubectl cordon → drain → delete
node → qm shutdown for removes. Cold-spare option documented.

Related: docs/architecture/storage.md § Per-VM SCSI-LUN cap,
docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md,
beads code-oflt (IO contention long-term fix).
2026-06-01 19:50:41 +00:00
Viktor Barzin
5c77482a8c fire-planner: LLM_MODEL env var → qwen3vl-4b default (fits in current GPU headroom; immich-ml is holding ~10GB) 2026-06-01 19:50:41 +00:00
Viktor Barzin
fb1e47a20a nextcloud: re-enable Keel auto-upgrades with occ-upgrade self-heal + live-tag floor
Re-enrolls Nextcloud in Keel (opted out after the 2026-05-26 32.0.3->32.0.9
bump stuck the pod in maintenance mode ~22h). Two safeguards engineer around
both failure modes:

- F1 (interrupted occ upgrade -> 503): nextcloud-watchdog CronJob runs
  `occ upgrade` + clears maintenance mode when occ reports needsDbUpgrade=true;
  Job deadline bumped 120->600s so it isn't killed mid-migration.
- F2 (helm re-renders a tag below the Keel-bumped live image -> downgrade
  CrashLoop): chart_values renders the live tag via a plural
  kubernetes_resources data source (empty-list-on-absence -> floor 32.0.9 on
  fresh install/DR), so a re-render never downgrades below live.

Scope is patch -- Kyverno's shared inject-keel-annotations policy stamps it and
its background-controller overrides a TF-set value, and patch == minor for
Nextcloud in practice (32.0.x only; major 33 stays manual). Dropped the
per-workload keel.sh/policy override resources to avoid perpetual drift; ns
enrollment + Kyverno now own the keel annotations like other workloads.

Also bumps the external-storage bootstrap Job create timeout 1m->12m to match
its own 10m pod-wait, since Keel bumps now roll the pod mid-apply.

Verified: Keel auto-upgraded 32.0.9->32.0.10 on apply, entrypoint occ upgrade
completed clean (no watchdog needed), pod 2/2, HTTP 200, plan shows no drift.
2026-06-01 19:50:41 +00:00
Viktor Barzin
50d0f1affa kyverno: strip orphaned keel.sh/match-tag fleet-wide (image-swap fix)
The 2026-05-26 migration flipped the keel default force->patch and dropped
match-tag from the inject-keel-annotations patch, but Kyverno's add-only
mutate can't remove an annotation that's no longer listed -- 194 workloads
kept a stale keel.sh/match-tag=true. Under it Keel cross-assigned images in
multi-image pods: the blog's nginx<->nginx-exporter images were swapped and
the site was down 2026-05-26 -> 06-01 (nginx received the exporter's
-nginx.scrape-uri arg and CrashLoopBackOff'd); changedetection was silently
swapped (app lost its /datastore PVC + env, ran ephemeral for days).

- policy now sets keel.sh/match-tag=null (strips on admission, never re-added)
- swept the annotation off all 194 existing workloads (kubectl, no pod restart)
- AGENTS.md: documents the strip; post-mortem added

blog + changedetection un-swapped via kubectl set image (TF-ignored images);
both 2/2 and serving 200. Policy already applied via scripts/tg (Tier-1 PG
state authoritative). [ci skip]

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 19:50:41 +00:00
Viktor Barzin
769ae7a6d3 traefik: bot-block-proxy buffer 256k + document the real HTTP/2 limit
Follow-up to the 64k bump: raised bot-block-proxy large_client_header_buffers
to 256k and corrected the rationale. Investigation found the *binding* limit
for browsers is Traefik's HTTP/2 header cap (~64KB, Go maxHeaderListSize, not
exposed by Traefik config) — oversized authentik_proxy_* cookie piles are
rejected at the h2 layer upstream of bot-block regardless of these buffers.
The real fix for >64KB piles is reducing authentik_proxy_* cookie accumulation
(or clearing cookies); these buffers only prevent bot-block being a tighter
bottleneck for sub-64KB piles + HTTP/1.1 clients.

[ci skip]

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 15:15:27 +00:00
Viktor Barzin
1c165ce5b4 docs(kms): document the consequence-gated edition switch (changepk + ODT)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 15:15:27 +00:00
Viktor Barzin
3d28870e25 nextcloud: fix backup retention to sort by name, not mtime
The dated backup dirs are named YYYYMMDD_HHMMSS, but the cleanup used
`ls -dt` (mtime). `rsync -a` stamps the backup dir with the SOURCE dir's
mtime, so the freshest backup didn't sort as newest — the retention step
deleted the new backup and kept a stale one. Sort lexically (chronological
for these names) and keep the last.

Also exclude html/ (the app code, reproducible from the now-pinned image;
the real config lives at config/config.php, html/config is empty) so the
backup is config+data+custom_apps only → ~4.3G (<5G target).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 15:15:26 +00:00
root
84ab4c998c Woodpecker CI deploy [CI SKIP] 2026-06-01 15:15:26 +00:00
Viktor Barzin
ddd582a28c backup: stop offsite-copying regenerable data; shrink nextcloud backup; pin nextcloud image
The offsite Synology hit 97% — the Backup share grew +670G in a week, traced
to the 2026-05-26 change that began mirroring large regenerable services
offsite, plus an unbounded nextcloud.log bloating its backups to 87G.

- nfs-mirror: re-exclude ollama, prometheus-backup, audiblez, ebook2audiobook
  (regenerable; live-only on sdc). Keep *-backup DB dumps (real safety copies).
- offsite-sync Step 2: nfs-ssd leg is now immich-only; ollama/llamacpp on the
  SSD no longer ship offsite (re-pullable models).
- daily-backup: skip nextcloud/nextcloud-data-proxmox (orphaned pre-encryption
  PV, still backed up weekly).
- nextcloud: cap+rotate the log (log_rotate_size=10MB); the dedicated backup
  now excludes html/ (app code, from image), logs, and preview cache and keeps
  only the latest copy (pvc-data holds version history) → <5G (was 87G).
- nextcloud: pin image to 32.0.9 in chart_values. A 2026-05-26 Keel bump moved
  the live pod to 32.0.9 (data migrated to 32.0.9.2) but TF still defaulted to
  32.0.3; reconciling that drift this session rolled a 32.0.3 pod that
  CrashLooped on the downgrade. Pinning eliminates the drift.

Docs: backup-dr.md + infra CLAUDE.md updated (add nfs-mirror, new exclusions).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 15:15:26 +00:00
Viktor Barzin
0dd4a31eff docs(immich): cap server-side job concurrency to protect sdc + log recurrence
A library-wide Duplicate Detection run on 2026-06-01 fanned the ML/thumbnail
backfill out at thumbnailGeneration concurrency 8, saturating the shared sdc
HDD and starving etcd -> kube-apiserver down ~30 min (5th IO-pressure incident
on sdc). Capped server-side thumbnailGeneration/metadataExtraction/library to 2
in the Immich DB system-config; documented in the Immich row and recorded the
recurrence + still-TODO IO-isolation fixes in the 2026-05-25 post-mortem (this
also commits that previously-untracked post-mortem).

[ci skip]

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 15:15:26 +00:00
Viktor Barzin
af4bfbe046 kms: revert files accidentally bundled into the docs commit
The previous commit (81a7d804) swept in 23 unrelated working-tree files because
a rebase --autostash had left them staged in the index — including 4 files with
leftover git conflict markers (llama-cpp/main.tf, excalidraw/providers.tf,
url + wealthfolio .terraform.lock.hcl) from a stale 2026-05-25 stash, which is
invalid Terraform. Revert all 23 (terragrunt-generated backend/providers/lock
+ the llama-cpp markers) to their prior committed state; terragrunt regenerates
the generated files on the next run. Net effect of the docs commit is now just
the runbook doc.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 10:36:49 +00:00
Viktor Barzin
bdb0cef242 docs(kms): document /keys.json carve-out + script auto-key selection
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 10:36:49 +00:00
Viktor Barzin
170a3bb052 traefik: bump bot-block-proxy large_client_header_buffers to 8x64k
The ai-bot-block forward-auth copies the full request (incl. the
accumulated authentik_proxy_<random> cookie pile) to bot-block-proxy.
With 30+ Authentik Proxy Providers under viktorbarzin.me the combined
Cookie header exceeds openresty's default 4x8k buffers, so the auth
check returned 400 "Request Header Or Cookie Too Large" (surfaced as
error-pages' "Too big request header" 431) and broke Woodpecker/Forgejo
OAuth sign-in for affected browsers.

Mirror the existing auth-proxy-config fix: 8x64k accepts the pile.
Applied live via tg apply + bot-block-proxy rollout restart.

[ci skip]

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 10:36:49 +00:00
Viktor Barzin
6f0bdf2993 kms: carve /keys.json out of Anubis for script auto-key-selection
The activation scripts now fetch the published GVLK list from /keys.json to
auto-select the right key for the detected edition. Like the .ps1 scripts,
that endpoint must bypass Anubis (PowerShell/ConvertFrom-Json can't solve the
PoW). Add /keys.json to the ingress_scripts carve-out path list.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 10:36:49 +00:00
root
7a297deb24 Woodpecker CI deploy [CI SKIP] 2026-06-01 10:36:49 +00:00
Viktor Barzin
e63a812062 kms: dedicated vlmcs.viktorbarzin.me endpoint + Anubis /scripts carve-out
Internal split-horizon resolves kms.viktorbarzin.me to Traefik (10.0.20.203),
which has no :1688 listener — so LAN clients pointed at kms.viktorbarzin.me:1688
failed with 0xC004F074 "no KMS could be contacted". Add a dedicated A-only
vlmcs.viktorbarzin.me (cloudflare_record.vlmcs -> 176.12.22.76 for the public
WAN NAT; Technitium -> 10.0.20.202 internal, set via API) so it resolves to
vlmcsd both ways. Also carve /scripts/* out of Anubis (module.ingress_scripts
-> bare kms-web-page service) so `iwr | iex` downloads the real script instead
of the PoW challenge HTML.

Verified end-to-end on Win VM 300: reproduced 0xC004F074 on the old host, then
slmgr + ospp + both PowerShell one-liners all -> Licensed via vlmcs (10.0.20.202).

Docs: kms-public-exposure runbook + service-catalog entry.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 10:36:49 +00:00
root
de04ed099e Woodpecker CI Update TLS Certificates Commit 2026-06-01 10:36:49 +00:00
Viktor Barzin
e5d9160a88 monitoring: KEEL/tier ignore_changes on 5 exporters [ci skip]
goflow2, snmp-exporter, pve-exporter, idrac-redfish and the sysctl-inotify
daemonset were missed by the cdb7d9a8 KEEL_LIFECYCLE sweep. The monitoring ns
is keel-enrolled (policy=patch) so Keel owns their image tags + injects keel.sh
annotations; TF kept trying to revert both, plus a live-stamped tier label —
which made `terragrunt plan -detailed-exitcode` return 2 every run and the
drift-detection cron fail daily. Add the standard KEEL ignore_changes (image +
keel.sh annotations) and ignore the tier label so these stop churning.

Declarative-only: takes effect at next plan, no apply needed. [ci skip] so this
does not trigger a monitoring apply. Remaining (separate) drift: the grafana
ACL null_resource (triggers.always) + tls cert refresh.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-31 15:33:30 +00:00
Viktor Barzin
935fb07df7 hermes-agent: gate PVC on parked flag (clears PVCStuckPending)
The data_proxmox PVC is WaitForFirstConsumer; with the Deployment parked at
replicas=0 it had no consumer pod and sat Pending forever, falsely tripping
PVCStuckPending (which halts kured reboots). Introduce local.hermes_parked to
drive both replicas and the PVC count, so a parked service has no PVC at all.
Empty/never-bound PVC removed; recreated automatically when un-parked.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-31 15:19:28 +00:00
Viktor Barzin
7b6a0e70af hermes-agent: opt out of external monitor while parked
hermes-agent is parked at replicas=0 (PVC perms bug, 2026-04-22). Its
auto-created Uptime Kuma external monitor was down → ExternalAccessDivergence
firing, which halts kured node reboots. Set external_monitor=false so a
deliberately-down service stops tripping the divergence gate. Re-enable when
the deployment is brought back up.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-31 15:12:33 +00:00