ha-sofia's 7 R730 REST sensors (CPU/exhaust/inlet temp, power, 2x PSU voltage,
fan) read the iDRAC via the slow on-demand Redfish exporter (scan_interval 120,
~16-22s/fetch, intermittent `unavailable` blips). Migrated them to a FAST
Prometheus query of the SNMP values (instant, ~1m-fresh from the snmp-idrac
scrape), scan_interval 30.
This adds the enabling ingress: `prometheus-query.viktorbarzin.lan` →
`prometheus-server:80`, auth=none, allow_local_access_only, path-scoped to
`/api/v1/query` (read-only instant-query only — not the UI/admin/federation).
ha-sofia can't use `prometheus.viktorbarzin.me` (Authentik-gated, no OIDC from
a REST sensor), so this mirrors the existing local-only `.lan` exporter
ingresses HA already queries.
The ha-sofia REST file (`/config/rest_resources/idrac_redfish_exporter.yaml`)
was edited in place (auto-version-controlled by the HA version-control add-on;
pre-migration copy at `/config/idrac_redfish_exporter.bak-pre-snmp`). The
Technitium CNAME `prometheus-query.viktorbarzin.lan -> ingress.viktorbarzin.lan`
was added manually via the API — like the other `.lan` exporter hosts it is NOT
auto-synced (the technitium-ingress-dns-sync CronJob only creates `.me`
records). Follow-up (already noted for the Loki sensor): extend that sync to
manage `.lan` CNAMEs too. The Redfish remnant's `sensors` collector is now
vestigial (HA no longer reads it).
Verified: all 7 HA sensors report correct fresh values from Prometheus (fan
10800 rpm, CPU 62.0C, power 280W, PSU 230/240V).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Cluster-wide Loki log observability now that pod logs flow (Alloy fix). New
dashboards/cluster-logs.json (Loki DS P8E80F9AEF21F6940): namespace/app/pod
dropdowns + free-text regex search; stats (lines/errors/warns/active-ns),
log-volume-by-namespace, error/warn rate, top-namespaces-by-errors,
top-pods-by-errors, a filterable live-logs panel, and a second row for the
node + rpi-sofia systemd journals (volume-by-level + error/warn journal panel).
Error/warn use case-insensitive regex line-filters so they work regardless of
level-label availability. New "Logs" Grafana folder.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Cluster pod logs were NOT reaching Loki — only node/Pi journals were. Root cause:
loki.source.file was fed the /var/log/pods/*<uid>/<container>/*.log glob directly
from discovery.relabel, but loki.source.file does NOT expand globs, so it stat()'d
the literal `*` path and shipped zero pod logs ("stat failed: no such file" for
every pod). Per Grafana Alloy docs, a local.file_match component must expand the
glob into concrete file targets first. Add it. Also add stage.cri {} so Loki
stores clean messages + real timestamps instead of raw containerd CRI-prefixed
lines. Fixes cluster-wide log observability (regression vs the working 2026-05-26
state). Ship-all-then-measure per the agreed plan; Alloy mem limits stay as the
IO-storm safeguard.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Redfish exporter (mrlhansen, metrics:all:true) walked every BMC subtree on
each scrape — ~18.5s avg / 28s peak against the slow iDRAC — forcing a 3m
interval. Moved the fast path to SNMP via the (previously unmounted) dell_idrac
module: ~3.7s/scrape at 1m.
- snmp_exporter: merge dell_idrac into ups_snmp_values.yaml; hand-add fan-RPM
(coolingDeviceReading + location lookup) and an amperageProbeLocationName
lookup so the "System Board Pwr Consumption" watts probe is label-selectable.
- snmp-idrac job: params module=dell_idrac, auth=public_v2, 1m/30s — now the
primary source for health/thermal/power/fan/voltage (relabeled r730_idrac_*).
- Re-point 9 iDRAC alerts to SNMP metrics + DellStatus enums (OK=3, on=4) and
fix the misnamed iDRACSNMPMetricsMissing/iDRACRedfishMetricsMissing probes.
- Re-point Grafana panels (idrac.json, cluster_health.json) to SNMP names;
temps ÷10 (tenths-degC); DellStatus value-mappings updated.
- Demote the Redfish exporter to a slow remnant: trim collectors to
system/sensors/power/storage/network/memory, scrape 3m->10m. Kept only for
metrics SNMP can't serve (indicator LED, NIC Mbps, machine/BIOS, per-drive
table) AND to keep HA Sofia's sensor.r730_fan_speed working — it reads
idrac_sensors_fan_speed from the exporter directly, so no ha-sofia change.
SSD-wear alerts + SEL panel left as-is (already inert/empty today). Verified
live: snmp-idrac up, scrape 3.7s, all 9 re-pointed alerts resolve without
firing, HA fan metric (idrac_sensors_fan_speed=6) intact. Design/plan +
as-built docs: docs/plans/2026-06-05-idrac-snmp-migration-{design,plan}.md,
docs/architecture/monitoring.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Gmail All-Mail scrape (tripit-ingest-mail) is retired — Viktor only wants
mail ingested when forwarded to plans@viktorbarzin.me, and only from actual
users. Dropped the ingest-mail CronJob and removed MAIL_DEFAULT_OWNER_EMAIL
from ingest-plans (the app now ignores mail from non-users instead of filing it
under the default owner). ingest-plans already carries EMAIL_PROVIDER/SMTP_* for
the new sender notifications. Service-catalog updated.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
scripts/tg's check-ingress-auth-comments.py requires the `# auth = "none":`
rationale comment DIRECTLY above the `auth = "none"` line; mine was in the
module's top block comment, so the guard aborted the whole monitoring apply
(this is why the rpi-sofia scrape/alerts/ingress/dashboard never landed on the
first push). Move the rationale to the required position.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Sofia Raspberry Pi hung this morning (network wedged ~10:13, HA
sensors dead, and its local journal had been silent since Apr 27 — a
2017 SD card intermittently flipping the rootfs read-only). Nothing was
captured because logging lived only on the failing card. Ship telemetry
off-box so the next failure is diagnosable centrally:
- Prometheus scrape job `rpi-sofia` (rpi-sofia.viktorbarzin.lan:9100) —
node_exporter + a vcgencmd textfile collector on the Pi exporting
under-voltage/throttle/SoC-temp as rpi_* metrics.
- Alert group "RPi Sofia": node_exporter Down, rootfs ReadOnly (the
exact SD-failure signature), Under-voltage since boot, High SoC temp.
- LAN-gated Loki write ingress (loki.viktorbarzin.lan) so the Pi's
promtail can push its journal — Loki was ClusterIP-only.
- Grafana dashboard "RPi Sofia" (Hardware): status, undervoltage/
throttle, temp, load, memory, disk, network.
The Pi separately got a systemd hardware watchdog (auto-reboot on a hard
hang; today it stayed down ~5h until a manual power-cycle).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Actual usage ~116Mi, Goldilocks/VPA upperBound ~185Mi (incl. live races over
99d). The 1Gi reservation was sized for the old bundled-Chromium image; the app
now drives the remote chrome-service over CDP. 256Mi (upperBound x~1.3, bursty)
requests=limits per convention; cpu request 100m -> 50m (VPA upperBound 49m).
Frees ~768Mi of reserved cluster memory.
woodpecker-server sets reloader.stakater.com/search="true" but the
woodpecker-db-creds ExternalSecret never carried the matching
reloader.stakater.com/match="true", so Stakater Reloader never
restarted the server when Vault rotated the pg-woodpecker password
(7-day static role). The DB DSN is injected via envFrom, which does not
hot-reload a running pod — so after each rotation the server kept using
the revoked password until some unrelated restart (Keel bump, drain,
manual) recreated it inside the window. A latent weekly DB-outage masked
by incidental restarts.
Add the match annotation to the ESO target template and correct the
stale "rotated every 24h" comment (actual rotation_period is 604800s =
7 days).
Verified end-to-end: forced 'vault write -f database/rotate-role/pg-woodpecker',
ESO updated the secret in ~3s, Reloader auto-restarted woodpecker-server
in ~36s, new pod reconnected with zero DB errors. [ci skip] because the
change was already applied via scripts/tg apply.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The actively-developed f1-stream (infra files/ copy: 12 active extractors +
Playwright/chrome-service verifier) is now its own repo viktor/f1-stream and is
the deployed app (replacing the stale March github build).
- main.tf: image -> forgejo.viktorbarzin.me/viktor/f1-stream:${var.image_tag}
+ image_pull_secrets registry-credentials. Image stays in KEEL_IGNORE_IMAGE.
- Remove stacks/f1-stream/files/ (source now in viktor/f1-stream).
- docs/plans: extraction design + plan pair.
Applied via tg + kubectl set image to forgejo:24857a82; live /health green.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add a proxmox-lvm-encrypted RWO PVC (tripit-personal-documents) mounted at
/data/personal-documents on the app container, PERSONAL_STORAGE_DIR env, and the
DOCUMENT_ENCRYPTION_KEY ExternalSecret entry (seeded in Vault secret/tripit). A
root chown init-container makes the block volume writable by the non-root app
without touching the NFS doc vault. Backs the new owner-only encrypted personal
document vault in the tripit app.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Enable bulk ingest job (run_examples_bulk_ingest=true) to populate
fire_example table from top/all + top/year across 12 FIRE subreddits.
Job fire-planner-examples-bulk-202606042150 is currently running.
- Upgrade examples_llm_model from qwen3vl-4b to qwen3-8b; GPU has 10.7GB
free (immich-ml using ~4GB of 15GB total), so higher-quality model fits.
- Add LLM_CONCURRENCY=3 to bulk job container — claude-agent-service is
now bounded-concurrency (MAX_CONCURRENCY=10), no longer single-flight.
Strictly serial extraction (default 1) is no longer necessary.
TODO: flip run_examples_bulk_ingest=false after job completes and re-apply
to push the weekly CronJob model upgrade (qwen3vl-4b→qwen3-8b) which
didn't land in this apply (TF timed out waiting for Job completion).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Commit 63fe7d2b (fan-control) was made with a bare `git commit` in the
shared infra working tree and inadvertently swept in a parallel session's
staged f1-stream-extraction work (main.tf repoint, ~48 files/ removals,
ci-cd.md + .claude docs, two extraction plan docs).
This returns every f1-stream-related path to its pre-63fe7d2b state
(3493c347) so that extraction can be committed cleanly by its own
session. The fan-control files added in 63fe7d2b are untouched.
[ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The iDRAC stock curve runs the CPU at ~72°C on the 7080 RPM floor even
under load (optimises for quiet, not cool). Add a bash daemon + systemd
unit that drives the chassis fans from CPU temp on two curves, picked by
garage occupancy (the server is in the garage): COOL when empty
(measured ~58-65°C under load), QUIET near the silent floor when the
ha-sofia garage door shows someone is there (open, or <15min since last
activity).
Manual fan mode is backstopped: bash EXIT trap + systemd ExecStopPost
hand fans back to Dell auto on stop/crash; CPU>=83°C or repeated IPMI
failures do the same. Pushgateway metrics (job=fan_control). 36 unit
tests cover the pure curve/hysteresis/presence/parse logic; DRY_RUN +
RUN_ONCE for integration checks. Deployed and verified on 192.168.1.127
(CPU 70->58°C in cool mode, hysteresis stepping confirmed).
Design: docs/plans/2026-06-04-pve-fan-control-design.md
Runbook: docs/runbooks/fan-control.md
[ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
At h=4 the two stacked values per window panel were too small because each
also rendered its field-name label. Switch textMode value_and_name -> value
on 9211-9215 so the numbers get the full cell height; the % suffix / £ prefix
keep them self-identifying and the window stays in the panel title. Applied
via targeted tg apply of the configmap.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The 5 per-window returns widgets (9211-9215) were too tall at h=8. Halve to
h=4 (matching the overview stat cards directly above) and pull every panel
below up by 4 so the layout stays gap-free. Layout-only change — no panel
content/query touched. Applied via targeted tg apply of the configmap.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Swap the single "Returns over time windows" table (panel 9201) for 5 stat
panels (1d/7d/30d/90d/12mo), each showing Return % (Modified-Dietz) as the
headline value + Δ market (£, net of contributions) as a second value,
colored red/green by sign.
Same per-window Modified-Dietz math as the old table, just scoped to one
interval per panel — verified against live wealthfolio_sync PG and reproduced
through Grafana's datasource API (e.g. 30d = 8.15% / £86,875, 12mo = 38.68% /
£297,846, matching the prior table exactly). Kept the same 24×8 grid footprint
so nothing else on the dashboard reflows.
Already applied via targeted `tg apply` of the wealth.json configmap; [ci skip]
because a full monitoring-stack CI apply would pull in unrelated pre-existing
drift.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Explicitly own the keel.sh/policy annotation in TF (was relying on the
Kyverno-stamped `patch` default). Set policy=all + trigger=poll +
pollSchedule, expand ignore_changes per KEEL_LIFECYCLE_V1 to cover
Keel-written runtime annotations (change-cause, update-time, revision,
match-tag).
The agent acts only on newly-created todos; the Updated listener re-fired on
every edit (incl. the agent's own note-append). Live Updated webhook (id=2)
already deleted via OCS API.
Surfaced while installing the nextcloud-todos-api plugin (a pod roll):
- 2026.5.4 gateway rejects an openai-codex `agentRuntime` key it writes itself
(commit 4b39cb72) -> crashloop on any restart. Pinned image back to 2026.2.26.
- startup steps (plugins enable / mcp set / memory index) backgrounded +
timeout-guarded so a hung npm-install can never block the gateway.
- install-nextcloud-todos-plugin init SHA-pinned (:f85c6de1) + Always pull:
IfNotPresent served a stale cached :latest, so the plugin manifest
(configSchema) fix never landed.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Frees a per-node SCSI-LUN slot on node6 (20->19, under the check #47 >=20
WARN). FreshRSS extensions are static plugin files (no embedded DB; app DB is
external MySQL) -> NFS-safe. Empty volume (re-installable). Applied
deadlock-safe: -target deployment+module first (Recreate releases old PVC),
then full apply destroys the now-unused proxmox PVC.
Frees a per-node SCSI-LUN slot on node6 (21->20). Volume holds only
config.json (no embedded DB) -> NFS-safe. config pre-seeded to
/srv/nfs/isponsorblocktv before cutover. RWO-destroy deadlock during apply
(TF deleting the in-use old PVC before rolling the deployment) was broken by
patching the deployment claim to the NFS PVC; TF reconciled to the same value.
His app lives in novelapp, but the dashboard injects his SA token
(system:serviceaccount:vabbit81:dashboard-vabbit81), while the existing
binding only granted the OIDC User vabbit81@gmail.com (OIDC blocked). Add the
SA as a second subject so the web dashboard (token-injector) can manage
novelapp. Verified: SA can list/create in novelapp; injector path returns 200.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
namespace-owners could read all tenants' pods/configmaps/etc cluster-wide
(read-only) via the broad namespace_owner_readonly role. Give the dashboard
SAs a dedicated dashboard-nav-readonly ClusterRole = namespaces + nodes (list)
only — enough for the dashboard namespace-picker/Nodes view, but no
cross-tenant resource reads. Own-namespace access (admin) unchanged. Verified:
gheorghe can list namespaces/nodes + full vabbit81, but list pods/configmaps -A
= no, other namespaces = no.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
monitoring-quota requests.memory sat at 89% (18.2/20Gi), tripping the
ResourceQuota>80% WARN. Root cause was over-provisioned requests, not real
usage: loki requested 3Gi but its VPA upperBound is 364Mi and actual ~315Mi.
prometheus's 4Gi is legitimately required (2Gi tmpfs WAL shares the cgroup;
OOMs at 3Gi during WAL replay) so it stays; grafana's main container is
already 512Mi. Trimmed loki to 1Gi request (~3x its observed ceiling; 4Gi
Burstable limit preserves query-spike headroom) -> quota 78.8%, clears the
WARN. NOTE: alloy DaemonSet (562Mi/node) grows with node count, so revisit
(bump the 20Gi quota) as the cluster expands.
The prometheus-backup sidecar runs monthly on the 1st SUNDAY 04:00 UTC.
Consecutive first-Sundays can be ~35 days apart (e.g. May 3 -> Jun 7), but
the alert threshold was 32d (2764800s) -> it false-fired every year for the
~3 days between day-32 and the next run. Raised to 40d (3456000s): clears
the max first-Sunday spacing with margin, still catches a genuinely missed
monthly backup. Backup itself is healthy (last May 3, next Jun 7). Verified:
live rule now > 3.456e6, alert state inactive.
Elevates the shared claude-agent-service pod (SA claude-agent, ns
claude-agent) so the nextcloud-todos-exec agent can run autonomously.
Viktor explicitly chose to elevate the SHARED service knowing every
agent on the pod inherits these creds — each grant is security-sensitive
and flagged inline for review.
Vault (stacks/vault/main.tf):
- terraform-state k8s-auth role: add `claude-agent` to
bound_service_account_names (was only `default` — the pod's own SA
token could not log in, so scripts/tg apply died fetching the PG
backend password). `default` kept.
- terraform-state policy broadened from `database/static-creds/pg-terraform-state`
read only to read on database/static-creds/*, database/creds/*,
secret/data/* and secret/metadata/* — what stacks read at plan/apply
time. FLAG: grants the shared pod broad Vault READ (effectively all app
secrets + rotating DB creds); not denied: secret/data/vault.
claude-agent-service stack (stacks/claude-agent-service/main.tf):
- ExternalSecret: add FORGEJO_TOKEN (secret/ci/global -> forgejo_push_token,
viktor-scoped admin PAT) and HA_MCP_URL (secret/openclaw -> ha_sofia_mcp_url).
- git-init: add url.insteadOf rewrite to authenticate git pushes to
forgejo.viktorbarzin.me with $FORGEJO_TOKEN (PRs opened via Forgejo API).
- New claude-agent-exec ClusterRole+Binding: cluster-wide
get/list/watch/create/update/patch/delete on core (incl. secrets),
apps, batch, networking.k8s.io, rbac roles/rolebindings. Additive to the
existing read-only claude-agent role; does NOT bind cluster-admin. FLAG:
very broad — close to cluster-admin in blast radius.
- Vault login: VAULT_ADDR + VAULT_K8S_ROLE env + vault-token-refresher
sidecar (k8s-auth login role=terraform-state every 30m -> shared
emptyDir); main container symlinks ~/.vault-token so scripts/tg auto-auths.
- MCP: project-scoped .mcp.json at infra repo root wires `ha` (HTTP,
${HA_MCP_URL}) and `paperless` (in-cluster Service, no token in-cluster).
Not applied, not pushed — code only, for human review of the privilege grants.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The monthly wealthfolio-sync CronJob mounts the same RWO
wealthfolio-data-encrypted volume (shared wealthfolio.db SQLite) as the
always-running wealthfolio app Deployment. RWO attaches to only one node,
but the sync had no affinity — so the 2026-06-01 run landed on node4 while
the app held the volume on node3 and hung in ContainerCreating for 3 days
(FailedAttachVolume / Multi-Attach), surfacing as a problematic_pods WARN.
Add a required podAffinity (app=wealthfolio, topologyKey hostname) so the
sync always schedules onto the app's node, where the volume is already
attached (RWO permits multiple pods on the same node). Verified: a fresh
sync run co-located on node3, attached cleanly, and broker-sync started.
Namespace-owners (e.g. gheorghe) were blocked at forward-auth — k8s.viktorbarzin.me
was Home-Server-Admins-only. Carve-out: the dashboard host now also admits
kubernetes-admins/power-users/namespace-owners so they can reach the login page;
per-namespace access is still enforced by the pasted SA token (dashboard-sa.tf).
All other admin-only hosts unchanged. Policy adopted from UI into TF via import.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Pragmatic dashboard access while OIDC SSO is blocked: each namespace-owner
(from k8s_users) gets a ServiceAccount scoped to admin on their namespace(s)
+ cluster read-only, plus a long-lived token to paste into the dashboard
'Token' login. Real per-namespace isolation, no apiserver-OIDC dependency.
Verified: vabbit81 SA = admin in vabbit81, read-only elsewhere, no cross-ns write.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The chrome-service stack ran `playwright launch-server`, which creates
ephemeral browser contexts per `connect()`. Despite the encrypted PVC
mounted at /profile, no chromium user-data ever persisted — only npm
cache + fontconfig. Logging in via noVNC was effectively a no-op.
Refactor:
- Replace launch-server with direct chromium (TCP CDP on :9223 internal),
fronted by a Python HTTP+WS bridge on :9222 that rewrites the Host
header to bypass Chrome's hardcoded DNS-rebinding protection (no
`--remote-allow-hosts` flag exists in stock Chrome 130; verified by
binary string grep). Bridge also forces Connection: close on HTTP
responses so Node ws opens a fresh TCP for the WS upgrade rather than
trying to reuse the dead keep-alive socket.
- Add `--user-data-dir=/profile/chromium-data` so cookies/localStorage
actually persist on the encrypted PVC.
- New snapshot-server sidecar (stdlib python HTTP) serves
GET /api/snapshot at chrome.viktorbarzin.me/api/snapshot,
bearer-token-gated by the existing api_bearer_token.
- New chrome-service-snapshot-harvester CronJob (hourly) connects via
CDP, dumps storage_state() (cookies + localStorage), writes atomically
to /profile/snapshots/storage-state.json.
- NetworkPolicy: TCP/9222 (was :3000), TCP/8088 added for traefik.
Caller migration:
- f1-stream: `chromium.connect(ws_url)` → `chromium.connect_over_cdp(cdp_url)`,
env var CHROME_WS_URL → CHROME_CDP_URL. CHROME_WS_TOKEN dropped (no
longer used by code; ExternalSecret kept for symmetry with the snapshot
endpoint).
Dev-box side (out of scope for this commit — see ~/.config/systemd/user/):
- playwright-mcp.service flips to `--isolated --storage-state=...`
so per-Claude-Code-session ephemeral contexts seed from the snapshot.
- playwright-snapshot-refresh.{service,timer} (hourly) pulls the
snapshot via the bearer-gated HTTPS endpoint.
Docs updated:
- docs/architecture/chrome-service.md — new architecture diagram + wire protocol.
- docs/runbooks/chrome-service-snapshot.md — day-2 ops (refresh, rotation,
failure modes, restore).
- stacks/chrome-service/README.md — connect_over_cdp recipe.
Design spec at docs/superpowers/specs/2026-06-04-playwright-per-session-browser-design.md.
Dashboard back to the working forward-auth + kong-proxy state. The
oauth2-proxy SSO path is blocked by a deeper issue: the apiserver rejects
ALL valid Authentik OIDC tokens (both legacy --oidc-* flags and structured
AuthenticationConfiguration), despite verified signature, issuer, audience,
email_verified, synced clock, and reachable+trusted JWKS. Needs dedicated
apiserver-OIDC investigation. oauth2-proxy + k8s-dashboard Authentik app
left deployed (idle, harmless) pending that.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Phase 4 infrastructure-as-code for the nextcloud-todos service (watches the
Nextcloud Personal task list; classifies todos via local qwen3-8b and routes
research/mutating work through claude-agent-service). Clones the
recruiter-responder service pattern end-to-end. Written only — NOT applied.
- stacks/nextcloud-todos/{main.tf,terragrunt.hcl}: new aux stack cloning
recruiter-responder — ns (tier aux, istio-injection disabled, keel enrolled),
two ExternalSecrets (vault-kv app secrets + vault-database DSN), Recreate
deployment with alembic-migrate init-container, ClusterIP svc, /cb-only
HMAC-gated ingress (auth=none, proxied), and an idempotent webhook-register
null_resource (OCS webhook_listeners API, both CalendarObject Created/Updated
events -> internal svc URL, Bearer auth).
- stacks/vault/main.tf: pg_nextcloud_todos static role (nextcloud_todos, 7d
rotation) + pg-nextcloud-todos in the postgresql allowed_roles array.
- stacks/dbaas/modules/dbaas/main.tf: pg_nextcloud_todos_db null_resource
(clone of pg_tripit_db) — creates role+DB, pins role search_path, and
creates schema nextcloud_todos AUTHORIZATION nextcloud_todos.
- stacks/openclaw/main.tf: install-nextcloud-todos-plugin init-container,
nextcloud-todos-api in plugins.allow + the doctor-fix re-add + plugins
enable, NEXTCLOUD_TODOS_URL/NEXTCLOUD_TODOS_TOKEN env, and the cross-path
ESO key (secret/nextcloud-todos.webhook_bearer_token).
- stacks/uptime-kuma/modules/uptime-kuma/main.tf: internal /healthz HTTP
monitor. Prometheus /metrics scrape via svc annotations in the new stack.
- .gitleaksignore: allowlist two curl-auth-user false positives (the OCS
webhook curl uses a Vault-sourced shell var, not a literal credential).
KV seed (secret/nextcloud-todos) + applies are deferred to the apply runbook.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add cp lines in the seed-beads-agent init-container so the two new
nextcloud-todos agent definitions (baked into the image at
/usr/share/agent-seed/ by the claude-agent-service Dockerfile) land in
~/.claude/agents/ at pod start. Phase 3, task 3.3.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The grabber + bp-spender shared a 1Gi proxmox-lvm RWO PVC holding two
plain-text files (mam_id cookie + grabbed_ids.txt dedup list — no embedded
DB). On 2026-06-04 a grabber pod wedged in ContainerCreating for >1h because
its proxmox-lvm disk couldn't hot-plug onto a SCSI-LUN-saturated node VM
(k8s-node3, QEMU `query-pci` QMP timeout); `concurrencyPolicy: Forbid` then
blocked every run → 0 grabs → MAMFarmingStuck.
NFS (nfs_volume module, matching the other 9 servarr apps) removes this
volume from the per-VM SCSI hotplug path entirely: it mounts over the
network, consumes zero LUN slots, and is RWX so the grabber + bp-spender can
co-schedule on any node. Data (mam_id + grabbed_ids.txt) was copied across
before the switch; verified a grabber run Succeeds on NFS on node4 with the
preserved dedup list (tracked IDs carried over). Lever #1 from
docs/architecture/storage.md "Per-VM SCSI-LUN cap".
The apiserver rejects the email username-claim when email_verified is false
(invalid bearer token 401). Authentik external/social users are unverified,
so the default scope-email mapping fails. Mirror the proven kubernetes
provider: use the custom 'Kubernetes Email (verified)' mapping (hardcodes
email_verified=true) + 'Kubernetes Groups'. Drop the now-unneeded dual-aud
mapping (apiserver trusts the k8s-dashboard issuer w/ audience=client_id) and
align oauth2-proxy scope to 'openid email profile groups'.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Provider had signing_key=null → Authentik signed id_tokens with HS256 and
served an empty JWKS, so oauth2-proxy (and the apiserver) failed signature
verification (500 'failed to verify id token signature' on the callback).
Use the same 'authentik Self-signed Certificate' keypair the kubernetes
provider uses.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Authentik group policy denied admins: it gated on kubernetes-* group
membership, but cluster access is email-based RBAC (User bindings from
k8s_users), not group-based. vbarzin@gmail.com (Home Server Admins) gets
cluster-admin via oidc-admin-vbarzin but isn't in any kubernetes-* group,
so the gate locked him out. Apiserver RBAC is now the sole gate — matching
the kubelogin CLI (authenticate freely, RBAC decides actions).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Dashboard now authenticates via Authentik (oauth2-proxy, k8s-dashboard
issuer) and applies each user's own RBAC via the apiserver multi-issuer
AuthenticationConfiguration. Committed so CI converges (uncommitted local
applies were being reverted by the Woodpecker terragrunt-apply pipeline).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replace the legacy single --oidc-* flags (which kubeadm v1.34 had wiped,
silently disabling apiserver OIDC) with an apiserver.config.k8s.io/v1
AuthenticationConfiguration trusting BOTH the kubernetes (CLI) and
k8s-dashboard (oauth2-proxy) issuers. Enables per-user RBAC for the
dashboard via SSO while keeping the CLI issuer working. Remote script
health-gates /livez and auto-rolls-back on failure (single master).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>