The Fan % and Fan RPM sensor-graph cards had identical trend shapes (RPM ∝ %),
so merge them into one "Fan speed" card: % trend (stable Pushgateway sensor) +
RPM beneath. RPM reads sensor.r730_fan_speed (Redfish) but falls back to the
calibrated estimate (rpm≈160·%+1520, shown with a "~" prefix) when that sensor
is unavailable — it blips out intermittently, so the readout never goes blank.
The Override readout likewise shows both "% · rpm". HA-side only; daemon
unchanged.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add an "External host: rpi-sofia" section to docs/architecture/monitoring.md
covering the 2026-06-05 setup: node_exporter + vcgencmd textfile metrics; the
full-journal promtail->Loki shipping (job=rpi-sofia-journal — kernel/dmesg via
the (none) unit + all systemd units, labeled by unit/level); the RPi Sofia
alert group; the dashboard; and the systemd watchdog. Notes the SD-card root
cause and that the Pi-side config is hand-managed + backed up off-box.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The dashboard Override slider used to show a stale stored % (e.g. 5%) while the
fans were actually at ~53%, which was confusing. Add
automation.r730_fan_override_track_live_speed_while_unlocked: while unlocked it
mirrors the live commanded % (sensor.r730_fan_control_target) into the Override,
so it always shows the actual absolute fan speed and updates as the fan moves.
While locked it stops tracking and is the user's editable setpoint. The readout
under the slider now shows the live "% · rpm" (actual, not an estimate). HA-side
only; daemon unchanged. Verified live: slider forced to 10 → synced to 58 target.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The dashboard-it Server → Fans view is now minimal: fan speed (% + RPM), an
Override % slider, and a Lock toggle. Lock now means "freeze the current speed,
algorithm off" — a new automation (r730_fan_lock_freeze_current_speed_resume_algo)
snapshots the live target % into Override and sets mode=manual on lock-ON, and
mode=auto on lock-OFF. The host daemon is unchanged (the toggle just drives the
mode it already reads). cool/quiet stay reachable via the entity but are off the
simplified view; the 60-min auto-revert is kept as a dormant safety net. Verified
live: lock ON → mode=manual + Override captured the live 60%; lock OFF → auto.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A manual/cool/quiet override in HA auto-reverts to `auto` after 60 min. Add a
Fan Lock (`input_boolean.r730_fan_lock`) that gates that automation so a
deliberate override persists, with a visible "🔒 FAN CONTROL LOCKED" banner on
the dashboard-it Server view so it isn't forgotten. The automation re-checks the
lock after the hour (locking mid-countdown cancels the revert) and the 83 °C
ceiling still wins. HA-side only (helper + automation + dashboard live on
ha-sofia, auto-git-tracked there); these docs are the infra-repo record.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The iDRAC reports only total DCMI watts + RPM (no per-fan power), so add a
cube-law fan-power estimate: fan_W ~= 0.0205*(RPM/1000)^3, calibrated to the
2026-06-05 sweep (fits within ~3W; ~2W floor -> ~99W full). The daemon reads
live RPM each loop and pushes pve_fan_control_fan_rpm + _fan_watts_est.
Surfaced in HA as sensor.r730_fan_power_est + a "Fan Power (est)" card on the
dashboard-it Server view, next to total power. 46 bash tests green; verified
live (9120rpm -> ~15W est).
[ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replace the step-band fan curve with a continuous linear ramp — the bands
flapped at edges (e.g. 45<->65%). Web-researched: linear + 2-3C hysteresis
is the homelab standard; PID is overkill for this slow thermal loop.
fan% now interpolates between env-tunable anchors:
COOL 50C/30% -> 83C/100% (~2.1%/C; ~51% at the ~60C equilibrium)
QUIET 68C/20% -> 83C/100% (near-silent until ~70C)
Both reach 100% at the 83C ceiling. Anti-oscillation: asymmetric
hysteresis (fc_decide) + a MIN_STEP (3%) min-change threshold.
41 bash tests green; deployed + verified live (59C -> 49%, smooth).
[ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Document the HA-control feature shipped in 8beca1df: the daemon reads the
ha-sofia r730_fan_mode/manual_pct helpers, the 60-min auto-revert automation,
and the dashboard-it Server-view sensors + control tiles.
[ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
f1-stream was extracted to its own Forgejo repo + deployed from the Forgejo
registry (2026-06-05). Correct the stale "Migrated to GHA / repo id 10" claims:
- CLAUDE.md + ci-cd.md: move f1-stream from the GHA list to the Woodpecker-native
owned-app group; note old github source archived + GHA Woodpecker repo 10
deactivated; f1-stream is now Woodpecker repo 166.
- service-catalog: note the source repo + deploy model.
The actively-developed f1-stream (infra files/ copy: 12 active extractors +
Playwright/chrome-service verifier) is now its own repo viktor/f1-stream and is
the deployed app (replacing the stale March github build).
- main.tf: image -> forgejo.viktorbarzin.me/viktor/f1-stream:${var.image_tag}
+ image_pull_secrets registry-credentials. Image stays in KEEL_IGNORE_IMAGE.
- Remove stacks/f1-stream/files/ (source now in viktor/f1-stream).
- docs/plans: extraction design + plan pair.
Applied via tg + kubectl set image to forgejo:24857a82; live /health green.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Power/temp sweep (2026-06-05) located the cooling-per-watt knee at ~60%:
60->70% buys only -2C for +21W, and 70->100% buys 0C for +54W (the CPU
floors ~59C at cluster load, so more airflow does nothing). Re-tune the
COOL curve to cap its normal band at 60% (~303W, ~61C); 80/100% become a
high-load safety ramp (>=73/79C) before the 83C ceiling. QUIET unchanged
(already at the 281W / 4800rpm floor). Saves up to ~75W (~650 kWh/yr) vs
full-tilt for the last ~2C. Tests + design doc updated; verified live
(63C, 60%, ~267W).
[ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Commit 63fe7d2b (fan-control) was made with a bare `git commit` in the
shared infra working tree and inadvertently swept in a parallel session's
staged f1-stream-extraction work (main.tf repoint, ~48 files/ removals,
ci-cd.md + .claude docs, two extraction plan docs).
This returns every f1-stream-related path to its pre-63fe7d2b state
(3493c347) so that extraction can be committed cleanly by its own
session. The fan-control files added in 63fe7d2b are untouched.
[ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The iDRAC stock curve runs the CPU at ~72°C on the 7080 RPM floor even
under load (optimises for quiet, not cool). Add a bash daemon + systemd
unit that drives the chassis fans from CPU temp on two curves, picked by
garage occupancy (the server is in the garage): COOL when empty
(measured ~58-65°C under load), QUIET near the silent floor when the
ha-sofia garage door shows someone is there (open, or <15min since last
activity).
Manual fan mode is backstopped: bash EXIT trap + systemd ExecStopPost
hand fans back to Dell auto on stop/crash; CPU>=83°C or repeated IPMI
failures do the same. Pushgateway metrics (job=fan_control). 36 unit
tests cover the pure curve/hysteresis/presence/parse logic; DRY_RUN +
RUN_ONCE for integration checks. Deployed and verified on 192.168.1.127
(CPU 70->58°C in cool mode, hysteresis stepping confirmed).
Design: docs/plans/2026-06-04-pve-fan-control-design.md
Runbook: docs/runbooks/fan-control.md
[ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Reflect the dashboard-nav-readonly ClusterRole: namespace-owners can list
namespaces/nodes (for dashboard nav) but not read other tenants' resources.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Update authentication.md, multi-tenancy.md, service-catalog, add-user skill to
reflect the token-injector (X-authentik-username -> SA token -> Bearer). Note the
extra k8s-dashboard apply needed when onboarding a namespace-owner (injector map
regen). [ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Correct the docs I'd written for the (reverted) oauth2-proxy SSO. Reality:
apiserver OIDC rejects all Authentik tokens (design §12), so the dashboard
uses forward-auth (admits kubernetes-* groups) + per-namespace SA token-paste.
Updates authentication.md, multi-tenancy.md, service-catalog, authentik-state,
and add-user skill (onboarding now documents the dashboard token). oauth2-proxy
+ k8s-dashboard OIDC app noted as idle. [ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
stacks/f1-stream/files/backend/playback_verifier.py and
chrome_browser.py describe an in-cluster CDP caller, but the deployed
f1-stream image is built from github.com/ViktorBarzin/f1-stream which
has neither file — verified by `kubectl exec ls /app/backend/` and
grepping for 'CHROME' in the deployed pod.
The infra/stacks/f1-stream/files/backend/ tree is a vestigial design
that was never wired up to a build pipeline. Calling it out so the
next reader doesn't waste time debugging why the migration "didn't
take effect" — it took effect on dead code.
The hourly snapshot-harvester CronJob is the only live in-cluster
caller of the CDP endpoint today.
The chrome-service stack ran `playwright launch-server`, which creates
ephemeral browser contexts per `connect()`. Despite the encrypted PVC
mounted at /profile, no chromium user-data ever persisted — only npm
cache + fontconfig. Logging in via noVNC was effectively a no-op.
Refactor:
- Replace launch-server with direct chromium (TCP CDP on :9223 internal),
fronted by a Python HTTP+WS bridge on :9222 that rewrites the Host
header to bypass Chrome's hardcoded DNS-rebinding protection (no
`--remote-allow-hosts` flag exists in stock Chrome 130; verified by
binary string grep). Bridge also forces Connection: close on HTTP
responses so Node ws opens a fresh TCP for the WS upgrade rather than
trying to reuse the dead keep-alive socket.
- Add `--user-data-dir=/profile/chromium-data` so cookies/localStorage
actually persist on the encrypted PVC.
- New snapshot-server sidecar (stdlib python HTTP) serves
GET /api/snapshot at chrome.viktorbarzin.me/api/snapshot,
bearer-token-gated by the existing api_bearer_token.
- New chrome-service-snapshot-harvester CronJob (hourly) connects via
CDP, dumps storage_state() (cookies + localStorage), writes atomically
to /profile/snapshots/storage-state.json.
- NetworkPolicy: TCP/9222 (was :3000), TCP/8088 added for traefik.
Caller migration:
- f1-stream: `chromium.connect(ws_url)` → `chromium.connect_over_cdp(cdp_url)`,
env var CHROME_WS_URL → CHROME_CDP_URL. CHROME_WS_TOKEN dropped (no
longer used by code; ExternalSecret kept for symmetry with the snapshot
endpoint).
Dev-box side (out of scope for this commit — see ~/.config/systemd/user/):
- playwright-mcp.service flips to `--isolated --storage-state=...`
so per-Claude-Code-session ephemeral contexts seed from the snapshot.
- playwright-snapshot-refresh.{service,timer} (hourly) pulls the
snapshot via the bearer-gated HTTPS endpoint.
Docs updated:
- docs/architecture/chrome-service.md — new architecture diagram + wire protocol.
- docs/runbooks/chrome-service-snapshot.md — day-2 ops (refresh, rotation,
failure modes, restore).
- stacks/chrome-service/README.md — connect_over_cdp recipe.
Design spec at docs/superpowers/specs/2026-06-04-playwright-per-session-browser-design.md.
Update authentication.md (structured multi-issuer AuthenticationConfiguration
+ dashboard SSO flow), multi-tenancy.md (web dashboard access), authentik-state
(new k8s-dashboard app + gheorghe groups), service-catalog (dashboard auth),
and the k8s-version-upgrade runbook (kubeadm wipes --authentication-config →
re-apply rbac post-upgrade). Design/plan addenda record the issuer-constraint
pivot from the original dual-aud approach. [ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Design for letting namespace-owner users (e.g. gheorghe/vabbit81) open the
K8s Dashboard with their Authentik account, mapped to their per-user RBAC.
oauth2-proxy fronts kong-proxy, runs the OIDC code-flow, and injects the
user's id_token as Bearer so the apiserver applies existing namespace-owner
bindings. Additive + one ingress repoint; multi-audience scope mapping
keeps the CLI flow untouched.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Context (smart) search latency was caused by the 665MB vchord clip_index
decaying out of PG shared_buffers (~33% resident -> ~1.8s cold ANN reads vs
~4ms warm), NOT by yesterday's ML MODEL_TTL/clip-keepalive change (CLIP textual
is warm ~15ms on GPU). The postStart prewarm runs once at pod start and
pg_prewarm.autoprewarm only re-warms at startup, so the index decays under job
buffer-pressure over days.
- clip-index-prewarm CronJob (immich, */5): pg_prewarm('clip_index') keeps the
whole index resident -> searches stay ~4ms.
- immich-search-probe CronJob (immich, */5): times a random-vector ANN query +
reads clip_index residency, pushes gauges to the Pushgateway.
- Prometheus alerts ImmichSmartSearchSlow / ImmichClipIndexColdCache /
ImmichSearchProbeStale (+ inhibition when the probe is stale).
- cluster_healthcheck.sh check #46 check_immich_search (TOTAL_CHECKS 45->46).
- Docs: infra CLAUDE.md immich note, monitoring.md, cluster-health skill.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
In-cluster pods resolved forgejo.viktorbarzin.me to the public IP
(176.12.22.76) and hairpinned out through the WAN gateway, intermittently
timing out buildkit pushes from Woodpecker build pods (which, unlike
kubelet, don't use the per-node containerd Forgejo mirror). This silently
failed CI build-and-push for Forgejo-hosted repos (recruiter-responder
pipelines #15-#18 at the push step).
Add a CoreDNS `rewrite name exact forgejo.viktorbarzin.me
traefik.traefik.svc.cluster.local` so pods resolve to the Traefik ClusterIP
(reachable in-cluster, unlike the ETP=Local LB .203; the Service-name target
auto-tracks the ClusterIP so it can't rot on a Traefik renumber). Traefik's
*.viktorbarzin.me wildcard keeps SNI/TLS valid. Makes the per-pod
woodpecker-server hostAlias belt-and-suspenders.
Applied via targeted apply (coredns ConfigMap only, to avoid reconciling 7
unrelated pre-existing drifts in the stack) + verified:
- pod resolves forgejo.viktorbarzin.me -> 10.111.111.95 (Traefik ClusterIP)
- recruiter-responder pipeline #20 build-and-push succeeds via ClusterIP
Docs: networking.md (K8s cluster DNS path) + .claude/CLAUDE.md (forgejo
registry quick-ref). Advances beads code-yh33.
[ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Part of the L4 LB-IP review (docs/plans/2026-06-03-lb-ip-hygiene-design.md).
The 2026-05-30 Traefik .200->.203 move left consumers pointing at the dead
.200; this fixes the two in-Terraform ones and replaces the stale networking
doc with an accurate registry + a renumber checklist.
- woodpecker: forgejo.viktorbarzin.me hostAlias hardcoded 10.0.20.200
(.200:443 refuses TLS now; the next woodpecker apply would re-pin it and
break pipeline creation). Now reads the Traefik ClusterIP dynamically via a
kubernetes_service data source -- cannot rot on a future renumber and avoids
the ETP=Local hairpin trap.
- monitoring: ViktorBarzinApexDrift alert summary said "expected 10.0.20.200"
-> 10.0.20.203 (cosmetic; alert logic already correct).
- docs/architecture/networking.md: rewrote the MetalLB section (it wrongly had
KMS on .200, mailserver on a LB IP, "two dedicated") into an accurate 4-IP
registry + LB-IP renumber checklist (in-band + out-of-band consumers).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The pfSense NAT rdr rules added in f7cf9f07 hardcoded 10.0.20.203
(Traefik LB IP) as the redirect source. That couples mail's LAN
path to Traefik's IP choice — if Traefik moves again (it just
moved .200 → .203 on 2026-05-30), the mail path silently breaks.
Removing the script and the matching doc paragraph; keeping the
networking.md .200 → .203 staleness fix (separate correction).
Follow-up: give the mail HAProxy listener a dedicated pfSense
Virtual IP (IP Alias on opt1), update Technitium internal zone
+ WAN port-forwards to target the VIP, so mail's LAN-side path
is decoupled from any other service's LB IP.
Reconciles the tripit stack source with live state and adds the forward
flow. Ingest now polls vbarzin@gmail.com [Gmail]/All Mail read-only over a
rolling 12-month X-GM-RAW travel-sender window (Croatia Jet2 refs excluded),
filing trips under MAIL_DEFAULT_OWNER_EMAIL=vbarzin@gmail.com (Viktor's
Authentik login identity). Adds an ingest-plans CronJob that polls spam@
filtered to To:plans@viktorbarzin.me (the @viktorbarzin.me catch-all target)
so forwarded bookings are extracted and attached to the matching trip;
IMAP_PASSWORD is overridden per-job to spam@'s creds (PLANS_IMAP_PASSWORD).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Technitium's split-horizon rewrites *.viktorbarzin.me to 10.0.20.203
(Traefik LB) for the 192.168.1.0/24 Barzini WiFi (TP-Link router has
no hairpin NAT). The rule is name-agnostic so mail.viktorbarzin.me
(and imap./smtp.) get sent to .203 too — where Traefik does not
listen on 25/465/587/993. iOS Mail on Barzini WiFi silently hangs
while Roundcube (port 443 via Traefik) keeps working.
Adds pfSense NAT rdr rules so traffic to 10.0.20.203:{25,465,587,993}
gets redirected to 10.0.20.1 (the mail HAProxy listener already
serving the public path). Loaded on every incoming interface by
pfSense rule generation, so any LAN/VPN client falling into the
split-horizon answer lands on the right service unchanged.
Includes idempotent reproducer script (mirrors the existing
pfsense-haproxy-bootstrap.php pattern) and the networking.md
mail carve-out paragraph plus the stale .200 → .203 reference.
The service now runs agent calls concurrently (bounded semaphore, per-job
isolated clones) instead of single-flight. Infra side:
- mount git-crypt-key into the main container (each job re-unlocks its own clone)
- MAX_CONCURRENCY=10 env (excess calls queue FIFO)
- bump pod memory 2Gi req / 12Gi limit, cpu req 1 (Burstable, tier-aux) — sized
for ~10 concurrent claude+terraform runs; fits node2/3/5 headroom
- docs: beads-auto-dispatch + automated-upgrades no longer describe single-slot
Service code: viktor/claude-agent-service @ 66104a3.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Dashboard now shows two 'Me' bars: realized gross (~£409k, from
SUM(payslip taxable_pay) = P60 basis) and package/grant-value (~£267k,
levels.fyi-comparable). Document that gross MUST come from taxable_pay, NOT
salary+bonus+rsu_vest (rsu_vest is net/partial, understates RSU ~50%).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
total_value (what the comparison bar uses) must be full TC; document storing
base+bonus+RSU components too so it's verifiable that RSU+bonus are included.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add a barchart (panel 10) ranking every company's London p50 total comp
(COALESCE total/base) with the user's current comp shown in line, so it's a
direct "how do I compare" view. The user's figure is NOT hardcoded in the
dashboard JSON — it's a labeled comp_point in the DB (company_slug
'self-current', source 'self', "Me (Meta IC5)"), keeping the sensitive number
out of git. It's below the £500k alert bar (no Slack ping) and ranks too low
to appear in analyze leaders. Runbook documents the panel + how to update the
baseline.
[ci skip] — dashboard ConfigMap applied locally (targeted).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add job-hunter-alert CronJob (Sundays 05:00 UTC, an hour after the refresh):
`python -m job_hunter alert --threshold 500000 --location london --slack`
posts to Slack the companies whose London p50 total comp >= £500k, flagging
any that newly crossed since last week's snapshot. SLACK_WEBHOOK_URL wired via
the job-hunter-secrets ExternalSecret from Vault secret/job-hunter
slack_webhook_url (seeded from the shared workspace webhook; repointable to a
dedicated channel). Runbook gains an "above-target Slack alert" section.
[ci skip] — applied locally (stack-scoped).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CI now drives the Deployment rollout (kubectl set image to the build SHA in
.woodpecker.yml), so the stack moves to image_tag = "latest": the Deployment
runs whatever CI last set (image ignore_changes keeps TF from fighting it),
and the CronJob uses :latest + imagePullPolicy=Always (fresh pod each weekly
run). Keel stays enrolled in parallel as a redundant net.
Docs: rewrite the runbook "Deploying" section for build-triggers-deploy;
record the reversal of decision #12 in the auto-upgrade design doc (owned
apps drive their own rollout, Keel parallel — upstream stays Keel-only); add
the owned-app deploy model to infra/.claude/CLAUDE.md CI/CD section.
[ci skip] — applied locally (stack-scoped); avoids a broad CI auto-apply.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
immich-ml at TTL=0 never unloaded models; a heavy OCR library job
inflated onnxruntime's CUDA arena to ~10.7GB and held it on the shared
time-sliced T4, starving llama-swap (qwen3-8b) so recruiter-responder
triage 502'd silently for hours (emails preserved unseen, no loss).
TTL=600 lets idle ad-hoc models (OCR, face) free VRAM while preloaded
CLIP/smart-search stays warm.
Docs: correct stale llama-cpp GPU notes (T4 is time-sliced, no VRAM
isolation; add qwen3-8b to model table), immich MODEL_TTL gotcha in
.claude/CLAUDE.md, and a post-mortem.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add kubernetes_cron_job_v1.job_hunter_refresh — Sundays 04:00 UTC, runs
`refresh --source ats --source hn --source levels_fyi`, which upserts roles/
comp AND appends the dated comp_snapshots/roles_snapshots series consumed by
`job-hunter analyze`. Mirrors the Deployment's alembic-migrate init container
so a refresh never runs against an un-migrated DB; concurrency Forbid,
backoff 1, 30m activeDeadline, KYVERNO_LIFECYCLE_V1 dns_config ignore.
Add docs/runbooks/job-hunter.md: ops (health checks, manual refresh, add an
ATS company / CDIO watch, secret bag + rotation, failure table, TF apply) and
analyst (the analyze report, query recipes, SQL trend queries against the
snapshot tables, interpretation caveats) sections.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The videoConversion enqueue is an async scan; deleting encoded_video rows while a
prior scan is in-flight misses them (observed 2026-06-02: 11/3296 picked up on the
first pass). Re-trigger force=false once the queue first drains to waiting:0.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Pilot on PVE VM 300 established strong counterfactuals: identical kms-bootstrap +
the user's exact journey both reach office/ok on healthy Win10 (CF1 clean install,
CF2 retail O365HomePremRetail->targeted-remove->reboot->VL install). So a persistent
[Failing PreReq=SXSMSI]/1603 is the client's corrupted Windows servicing/Installer
subsystem (below DISM/SFC), not the script/ODT/KMS. Documents the consent-gated deep
repair, the DeepRepairDone marker + in-place-repair escalation, and the
low-disk/guest-agent-drop gotchas hit during the pilot.
Transcodes were uncapped (ffmpeg maxBitrate=0 + preset=ultrafast +
targetResolution=original) -> 77-264 Mbps 4K H.264 files. Mobile playback
streams that copy off the shared 7200rpm sdc pool over inter-VLAN NFS; a single
stream needs ~10-13.5 MB/s and stuttered for every client, local and remote.
Fix (DB system-config, applied via API): maxBitrate=20000k, preset=medium,
transcode=bitrate. 4K resolution preserved; originals never modified. Existing
oversized transcodes regenerated by deleting their asset_file encoded_video rows
+ videoConversion force=false (concurrency 1).
Document config + add runbook docs/runbooks/immich-transcode-bitrate.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Task-by-task plan pairing with the design doc: Task 1 discovers the t3
web-auth contract (cookie name + bootstrap body), then systemd template,
reconcile, devvm dispatch+auto-pair Go service, scoped sudoers, TF repoint.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Approach 1: /etc/ttyd-user-map as source of truth; per-user t3-serve@.service
template (User=%i enforces file permissions); devvm reconcile; devvm
dispatch+auto-pair service (mints + injects the t3 session cookie on first
authenticated visit, replacing the in-cluster nginx). Spec for review before
writing the implementation plan.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Consolidate the wealth Grafana dashboard 36 -> ~17 panels with zero metric
loss: merge the 3 NW/contribution/growth timeseries into 1, the 11 returns/Δ
stat cards into 1 returns table, the 2 yearly barcharts into 1 combo, and the
3 net-pay-vs-market-gain panels into 1 (grain dropdown); reorganize into
collapsed rows. Also rebuild the projection as a Trend panel (numeric
years-from-today x-axis) so it renders regardless of the dashboard time range
(fixes empty-by-default). Philosophy: merge duplicates, keep every metric.
[ci skip]
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Real root cause of the 2026-06-01 full-site 502 was not a missed
reference but an out-of-band fix that Terraform reverted: the 2026-05-30
Traefik .200->.203 migration repointed the Cloudflare tunnel to the
Traefik service DNS via the CF Global API Key, but never landed that
change in cloudflare.tf (left at .200). A terragrunt apply on 2026-06-01
reconciled live back to the stale .200, breaking all external ingress.
Rewrite the post-mortem around the "codify out-of-band fixes or TF
reverts them" lesson (a Terraform-Only-rule violation).
Also fix docs/runbooks/kms-public-exposure.md, which still claimed
Traefik served on 10.0.20.200:443 (now .203) — same migration fallout.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Cloudflare tunnel routed *.viktorbarzin.me and the apex to
https://10.0.20.200:443, but Traefik moved off the shared MetalLB .200
onto its dedicated 10.0.20.203 on 2026-05-30 (commit 0c01adac). Nothing
serves HTTPS on .200:443 anymore, so cloudflared could not reach its
origin (no route to host / i/o timeout) and Cloudflare returned 502 for
every externally-proxied service. Internal/LAN access (split-horizon ->
.203) was unaffected, which masked the outage.
Repoint both ingress rules at the in-cluster Traefik Service DNS
(https://traefik.traefik.svc.cluster.local:443) -- the design the docs
already described but the code never implemented -- so the tunnel is
decoupled from the Traefik LB IP and this cannot recur on a future move.
Applied live via targeted apply on the tunnel config resource only;
[ci skip] because live already matches and a full stack apply would
churn unrelated pre-existing drift (Keel annotations, DKIM re-chunk).
Post-mortem: docs/post-mortems/2026-06-01-cloudflared-stale-traefik-origin.md
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The bundled consumer Office removal leaves a pending reboot; a same-run VL
install (or re-run before rebooting) fails with setup.exe 1603. Document the two
guards (hard-reboot gate + reboot-aware 1603 message), the C2R-log capture, and
the on-disk completion poll. Record that the uninstall path is now verified on a
real M365 box (O365HomePremRetail removed) and the install needs a reboot first.
Decision-support doc, NOT a commitment. Evaluates whether replacing
proxmox-csi with TopoLVM would lift the per-VM 29-PVC ceiling
permanently and at what cost.
Key trade-off documented: TopoLVM PVCs are pinned to the node where
the LV lives (topology.topolvm.cybozu.com/node). proxmox-csi PVCs
migrate between VMs when pods reschedule. The data-locality penalty
matters most for single-replica stateful services (MySQL standalone,
Nextcloud, Vaultwarden, mailserver, claude-memory, ~30 SQLite-backed
apps); replicated services (CNPG PG cluster, Redis-v2, Vault Raft)
absorb it.
Three disk-layout options:
A. Carve per-VM data disks from sdc — simple, no hardware,
IO contention unchanged
B. Hybrid SSD/HDD — SSD-constrained at 675 GiB free
C. Add a dedicated NVMe — also closes beads code-oflt (IO
contention), ~£200 hardware investment
Effort estimate: 2.5-3 weeks of focused work for the full migration;
covers TopoLVM install, lvmd config, per-VM disk provisioning,
LUKS plumbing, 5 migration waves (regenerable → huge PVCs),
backup-pipeline rewrite, deprecation.
Recommended next step before committing: small pilot on
k8s-node5/6 with one non-critical PVC to validate the operational
pattern end-to-end.
Related: docs/architecture/storage.md § Per-VM SCSI-LUN cap,
docs/runbooks/scale-k8s-cluster.md (Path 1+2 alternative),
beads code-oflt (IO isolation).