The autonomous 1.34.9 version-upgrade chain has been failing its preflight every
night. A prior run left k8s-master + k8s-node1 on 1.34.9 while node2-6 stayed on
1.34.8, and preflight's gate-4 runs `kubeadm upgrade plan` on master. On an
already-at-target master, kubeadm prints no "kubeadm upgrade apply vX.Y.Z" line,
so the parsed target came back empty and the `!= requested` check aborted the
whole chain before any worker was touched. Deterministic — it self-cleaned and
re-failed identically each night, so it would have failed again tonight, leaving
node2-6 stuck on the old patch.
Fix: skip the kubeadm-plan-target gate when master is already on TARGET_VERSION
— the same at-target self-skip that phase_master and phase_worker already do.
The remaining workers are still validated by their own per-node phases, and the
detector already confirmed the target is installable via apt-cache. This lets
tonight's unattended chain resume and finish node2-6 -> 1.34.9.
Runbook updated: node count 5 -> 7, the gate skip note, and a Past Incidents
writeup (incl. the collateral apiserver OIDC wipe, restored via the rbac stack).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The claude-memory MCP backend ran as a single replica with no PDB, so every
voluntary disruption took it to zero for ~30-90s — which surfaced as the
memory MCP "keeps getting disconnected" problem. Disruption sources hitting
the lone pod: the descheduler (every-5-min CronJob, LowNodeUtilization —
caught evicting it live), Keel image bumps, Reloader restarts on the 7-day
DB-password rotation, node drains, and CI deploys.
The local stdio MCP subprocess itself was proven healthy (fast non-blocking
startup, stderr suppressed, graceful degradation), so the fault was purely
backend availability, not the MCP plumbing.
Fix: run 2 replicas (the backend is stateless FastAPI over shared CNPG
Postgres and already has hostname anti-affinity) + restore the PDB at
minAvailable=1 (safe now — the drain deadlock that justified removing it
only existed at 1 replica) + descheduler evict=false to stop the needless
5-min churn. All five disruption sources become zero-downtime rolling events.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor asked to turn lodging prices on and stop using the fake provider.
Mirrors the existing FARE_PROVIDER wiring: point the Booking.com/Airbnb lodging
scraper at the shared chrome-service browser over CDP (the namespace is already
admitted through chrome-service's NetworkPolicy for the fare scrape). The lodging
code (ADR-0025, tripit #78) is live in tripit 03973b5, so the env lands after
that rollout.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Auto-tagging silently no-op'd: the container env vars set in the deployment
shadowed the app's own /app/data/.env, because paperless-ai's dotenv loader
does not override process.env. A stale PROCESS_PREDEFINED_DOCUMENTS=yes (with
no TAGS) made the scan select zero documents.
Strip the wizard-owned behavioural config (Paperless URL, AI provider, model,
scan interval, tagging flags) from the container env, keeping only
infrastructural env (PUID/PGID/port/RAG/HF cache) and the Vault-sourced
secret refs. The app's setup-written .env on the PVC is now authoritative,
so processing runs and tags all documents. Qwen3 thinking is disabled via
SYSTEM_PROMPT=/no_think in that .env to keep the model's JSON output parseable.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor wanted real semantic search over his ~300 Paperless documents and
preferred a ready-made solution over building one. paperless-ai provides
local-embedding RAG (ChromaDB + sentence-transformers, GPU-free) plus
LLM-driven auto-analysis/tagging.
Wiring:
- LLM (chat answers + tagging) -> in-cluster llama-swap qwen3-8b
(OpenAI-compatible); embeddings + vector store are local on the PVC.
- Reads Paperless over the internal service via a dedicated `paperless-ai`
superuser token (Vault secret/paperless-ai); app-admin creds also in Vault.
- Encrypted PVC for /app/data (SQLite + ChromaDB + model cache).
- Ingress paperless-ai.viktorbarzin.me behind Authentik (auth=required).
- Third-party image pinned (docker.io/clusterzx/paperless-ai:3.0.9), no Keel.
Runtime config persists to the PVC .env via the app's one-time setup; the
deployment env vars are pre-fill/documentation only.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The edge-tts apply was blocked by the require-trusted-registries Kyverno policy —
a bare `travisvn/openai-edge-tts` isn't in the allowlist. The policy blanket-
trusts `docker.io/*`, so prefixing the image with `docker.io/` passes admission
with no policy change. Verified live: bg synth round-trips through Whisper
verbatim and a full gateway /v1/talk bg turn returns a coherent spoken Bulgarian
reply ("Добър ден! Добре съм, благодаря!...").
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The landed portal-stt source still declared the setup_tls_secret module +
tls_secret_name variable, which file()-reads secrets/fullchain.pem — a file this
stack does not ship. portal-stt is ClusterIP-only (no ingress; the Gateway is the
sole externally-exposed component, ADR-0001), so it needs no TLS secret. The live
deployment never had it (removed during the original apply); this aligns the
source with reality so CI applies cleanly. Fixes the pipeline-229 portal-stt
apply failure.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The portal-assistant voice-assistant stacks (portal-tts, portal-stt,
portal-assistant) were applied to the live cluster from feature branches but
never landed on master — the GitOps source of truth. This lands all three and,
in portal-tts, fixes Bulgarian speech.
Bulgarian was unintelligible: the local Piper voice (bg_BG-dimitar-medium via
espeak-ng) mangles Bulgarian consonants — a synth->Whisper round-trip turned
"Добър ден" into "Обърден", and a user heard pure gibberish. English was fine.
portal-tts now runs openai-edge-tts (Microsoft edge-tts neural voices) for BOTH
languages instead of Piper — ADR-0003 always named edge-tts as the online
Bulgarian-quality fallback. Validated before landing: edge bg round-trips
through Whisper verbatim ("Добър ден! Как сте днес? ..."). The gateway maps
detected language bg/en to the edge voice names via new TTS_VOICE_BG /
TTS_VOICE_EN env (bg-BG-KalinaNeural / en-US-AvaNeural). No GPU, no NFS model
store, no secrets — edge fetches voices from Microsoft per request (egress
verified). The assistant already needs the internet for the Claude brain, so an
online TTS adds no new failure mode.
The brain stays Sonnet with no extended thinking (already the default — a live
turn answers directly in ~3.4s), per the latency-over-smartness ask.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
New, empty-cache clients (the repurposed Meta Portal running the HA companion
app) cold-load the whole HA frontend at once - dozens of frontend_latest/*.js +
MDI icon chunks. ha-london had no per-service rate limit, so it fell back to the
global 10/s burst 50 and 429'd those chunks, leaving every dashboard blank
(Settings, which loads less, worked). Give ha-london its own 200/500 middleware
(skip_global_rate_limit, mirroring ha-sofia, with depends_on to avoid the
dangling-middleware 404 window) and bump ha-sofia 100/200 -> 200/500 so a cold
Portal load of Sofia doesn't hit the same wall.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor has no near-term plans to use the autonomous AFK pipeline's in-cluster T3
cockpit/executor, so stop its pod to free node resources while keeping it
trivially revivable. Only the deployment replica count changes (1 -> 0); the SSD
PVC (state.sqlite + repo checkouts), Service, Ingress, and ExternalSecret are all
left in place — reviving is just setting replicas back to 1 and applying.
Already applied live via scripts/tg (PG state now 0 replicas, pod terminated);
this commit syncs git so drift-detection / the next apply won't re-scale it up.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
phase_master quiesces tigera-operator (Calico's config reconciler) to 0 around
the master upgrade so it can't crashloop during the apiserver blip + I/O-storm
kubeadm's static-pod-hash watch (which would roll the upgrade back). The restore
was a plain line at the end of the phase, so any abort AFTER quiescing left the
operator at 0 — and the idempotent retry then skipped the already-on-target
master phase and never restored it. Observed 2026-06-17: a post-upgrade gate
aborted the master attempt; the operator sat scaled to 0 for ~1.5h (data plane
fine — calico-node keeps running — but no Calico reconciliation).
Fix:
- Drain first (drain doesn't blip the apiserver), THEN quiesce right before
`kubeadm upgrade apply`, and install an EXIT trap that restores the operator
no matter how the phase exits (gate abort, set -e on ssh/kubeadm, success).
Trap is set AFTER drain_node so its own EXIT trap can't clobber it; cleared
after the explicit happy-path restore.
- postflight also force-restores replicas=1 as a final guarantee (covers the
skip-on-retry path that never quiesces or restores).
Long-term fix remains HA control plane (apiserver never goes down) — bead code-n0ow.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Disruptive node drains should run when the cluster is idle. Move the
k8s-version-check detection CronJob from 12:00 UTC (noon) to 23:00 UTC
(00:00 London) — overnight, low usage, and clear of the kured OS-reboot window
(01:00-05:00 UTC) so the two drain pipelines never overlap. (Viktor, 2026-06-17.)
- stacks/k8s-version-upgrade/main.tf: var.schedule default 0 12 → 0 23 * * *.
- scripts/upgrade_state.sh: next_scheduled_run_utc now computes the 23:00 slot
(was next_daily_noon_utc).
- docs (runbook, architecture) + upgrade-state SKILL: schedule references
updated to 23:00 UTC nightly.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The chain hardcoded master→node4→node3→node2→node1→postflight and SSHed by
FQDN. It silently SKIPPED node5/node6 (added 2026-05-26) — postflight would
have failed even if reachable — and node5/node6 had no .viktorbarzin.lan DNS
records, so the chain couldn't SSH to them at all.
Refactor (upgrade-step.sh):
- Worker set + order derived live from `kubectl get nodes` (worker_nodes /
next_pending_worker), so EVERY worker still off-target is upgraded and a
newly-joined node is covered with zero script change.
- SSH targets are node InternalIPs (ssh_target), removing the dependency on
node DNS records entirely — a new node is reachable the moment it joins.
- The two remaining hardcoded loops (containerd skew, apt-repo rewrite) now
enumerate workers/all-nodes dynamically too.
- Topology preserved: master-drain Job runs on the first worker; every
worker-drain Job runs on the already-upgraded k8s-master (self-preemption
invariant intact).
- next_pending_worker returns 0 explicitly on the no-match path — the
`while read … done < <(…)` loop exits 1 at EOF, which under set -e would
abort the LAST worker's Job before it spawns postflight (cluster upgraded
but no cleanup / in_flight reset). Caught in review.
Docs (runbook + architecture + headers) updated to the dynamic topology.
NOTE: nodes still need the k8s-upgrade SSH public key in authorized_keys; it was
deployed to node4/5/6 by hand this session. Baking it into node provisioning
(so new nodes get it automatically) is the remaining follow-up.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Preflight step 6 confirms the pre-upgrade etcd snapshot is non-empty by parsing
the backup Job's log (`kubectl -n default logs job/pre-upgrade-etcd-...`). The
k8s-upgrade-job ClusterRole granted `pods` get/list/delete but NOT the `pods/log`
subresource, so the read failed with Forbidden in the default ns and aborted
preflight — after step 5 had already set k8s_upgrade_in_flight=1. A stale
out-of-band grant had masked this until a `terragrunt apply` in this session
reconciled the role back to its TF definition. Codify pods/log:get.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The prior commit added the CoreDNS ignore/skip flags only to `kubeadm upgrade
apply`, but `kubeadm upgrade plan` runs the SAME CoreDNS preflight. Once master's
kubeadm binary is on the target version (the first attempt's apt step already
bumps it), both plan calls fail on the Keel-drifted CoreDNS 1.12.4 under
set -euo pipefail and abort:
- preflight Job step 4 (upgrade-step.sh) — `plan` output is grepped for the
target version; the failing pipeline killed the whole preflight.
- update_k8s.sh master path line 85 — bare `plan` before the apply.
Both now pass --ignore-preflight-errors=CoreDNSMigration,CoreDNSUnsupportedPlugins.
Verified read-only on master: plan exits 0 and still emits
"kubeadm upgrade apply v1.34.9".
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The 1.34.9 master upgrade hard-failed `kubeadm upgrade apply` preflight: CoreDNS
is at v1.12.4 (Keel auto-bumped it 1.12.1 -> 1.12.4 on 2026-05-26 via a stale
kube-system out-of-band annotation), and 1.12.4 is ahead of kubeadm 1.34.9's
bundled corefile-migration table ("start version not supported").
- scripts/update_k8s.sh: master `kubeadm upgrade apply` now runs with
`--ignore-preflight-errors=CoreDNSMigration,CoreDNSUnsupportedPlugins
--skip-phases=addon/coredns`. A dry-run proved --ignore ALONE would overwrite
our custom split-horizon Corefile with kubeadm's default AND downgrade the
image; --skip-phases leaves CoreDNS 100% untouched while the control plane
upgrades. CoreDNS is pinned off Keel (keel.sh/policy=never) to stop the drift.
- stacks/k8s-version-upgrade/scripts/upgrade-step.sh: fix the preflight
quiet-baseline (settle-window) check, which silently no-op'd on the ghcr
claude-agent-service image's busybox `date` (can't parse ISO8601). Now tries
GNU then busybox `-D`, and warns+skips on parse failure (no silent fail-open).
- docs: runbook + architecture document the CoreDNS handling.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
fb638cd8 landed as two commits; the apply pipeline diffed against HEAD~1 (the
monitoring-only commit) and never applied stacks/k8s-version-upgrade, so the
retry-on-failure logic isn't live yet. This single-commit retrigger forces it.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Refines the new K8sUpgradeChainJobFailed alert from a bare failed-pod count to
the terminal job-condition reasons (BackoffLimitExceeded|DeadlineExceeded). A
phase whose first pod failed but whose retry SUCCEEDED must NOT fire: every
firing alert also halts kured, so a bare-count false-positive would block all
OS node reboots for the Job's 7-day TTL. Verified against kube-state-metrics:
the stuck preflight reports reason="BackoffLimitExceeded"; a Complete job has 0
for the terminal reasons.
Docs updated to match the behaviour change (per the same-commit docs rule):
- docs/runbooks/k8s-version-upgrade.md — new alert in the gates list; the
"kill a stuck Job" recovery now leads with retry-on-failure self-heal.
- docs/architecture/automated-upgrades.md — fourth Upgrade Gates alert;
retry-on-failure note on the deterministic-naming paragraph.
- .claude/skills/upgrade-state/SKILL.md — new "chain failed" status, legend
entry, and drill-down (also copied to the active ~/.claude copy).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The 1.34.9 patch auto-upgrade sat stuck for 5 days without anyone knowing.
On 2026-06-12 a transient critical alert (the ttyd web-terminal probe on the
devvm) was firing when the daily detection ran; the preflight's "halt on any
critical alert" gate aborted it, so the preflight Job Failed (backoffLimit=1).
Two design gaps then turned that blip into a multi-day wedge:
* the detection guard and spawn_next only checked whether the phase Job
EXISTED, not whether it succeeded — and the Failed Job lingers 7 days via
ttlSecondsAfterFinished, so every daily run skipped re-spawning it;
* the abort happens before the in-flight metric is pushed, so neither
K8sUpgradeStalled nor upgrade_state.sh could see it — the pipeline reported
"never ran" while actually being stuck.
Fixes:
D1 retry-on-failure: detection CronJob (main.tf) and spawn_next
(upgrade-step.sh) now delete + re-spawn a terminally-Failed phase Job
instead of skipping it, so a transient gate self-corrects next cycle
rather than wedging the pipeline for a week.
D2 WebterminalTtydUnreachable critical -> warning: a devvm developer
web-terminal is not cluster infrastructure and must not block upgrades.
D3 observability: new K8sUpgradeChainJobFailed alert
(kube_job_status_failed in k8s-upgrade ns) and upgrade_state.sh now flags
a Failed chain Job as "chain failed" — closing the pre-in-flight blind
spot so a wedge is visible immediately.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The passwordless enrollment prompt collects only email+name, so user_write aborted with 'Aborting write to empty username' (ak-stage-access-denied). Add an expression policy on the user_write binding (evaluate_on_plan=false + re_evaluate_policies=true, like guest.tf) that sets prompt_data['username'] = the entered email before the write. Verified the failure live via the flow executor API.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Pipeline 214 failed: the pinned goauthentik 2024.x provider models EmailStage.token_expiry as an integer, but the live 2026.2.x server requires a duration string ('hours=24') and 400s any number (even the provider default 30). Bumping the provider is a global terragrunt.hcl change re-applying every platform stack + breaking 3 other authentik-using stacks' lockfiles — disproportionate. Instead the two email-verification stages + their flow bindings move into an Authentik blueprint (tripit-email-stages.yaml) applied server-side via authentik_blueprint; the server parses token_expiry natively. Validated on the live server + terraform validate. Restores the ADR-0020 email-verification security gate.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Makes the WebLanding 'Sign up' button work (it was 404ing — the tripit-enrollment flow didn't exist). Open passwordless registration: prompt(email,name) -> user_write(INACTIVE, external, group 'TripIt External') -> email verification (activates) -> passkey -> login. The inactive-until-verified gate is the security boundary: tripit trusts X-authentik-email, so activation must require proving inbox ownership. Passwordless login already works via the built-in webauthn flow. tripit-recovery (email -> new passkey) is built but intentionally NOT wired into the global brand recovery, so admin recovery is unchanged. Schema validated with terraform validate.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The in-cluster svc name mailserver.mailserver.svc.cluster.local fails Authentik's strict STARTTLS hostname verification (CERTIFICATE_VERIFY_FAILED): the mailserver serves the *.viktorbarzin.me wildcard cert, which doesn't cover the svc DNS name. Use the public name mail.viktorbarzin.me, which resolves in-cluster (10.0.20.1) and matches the cert. Verified end-to-end from an authentik pod (verified TLS + SASL auth + send) before this change.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Authentik email was unconfigured (localhost), so the TripIt enrollment flow's email-verification stage couldn't send. Add AUTHENTIK_EMAIL__* to server.env + worker.env pointing at the in-cluster mailserver as noreply@viktorbarzin.me (587/STARTTLS), with the SASL password synced from Vault secret/authentik.smtp_password via a new authentik-email ExternalSecret (reloader-annotated). Image pin unchanged (2026.2.4 == live). Prereq for the tripit-enrollment flow.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
All of Anca's photos are imported. The Job was declared as
kubernetes_job_v1.anca_elements_import — meaning every `terragrunt apply` of
the immich stack re-created it, despite the 2026-05-25 in-code comment saying
"After successful completion: REMOVE this resource block + apply again."
Nobody noticed for 22 days; the re-trigger today (2026-06-16) was the
6th IO-pressure incident — it scanned all 21,643 assets in pure read-scan
mode for 51 min, saturated sdc, starved etcd, crash-looped kube-apiserver.
Recovery actions taken before this commit:
- Throttled nfsd 64→8 on PVE host to give apiserver headroom
- `kubectl delete job -n immich anca-elements-import` + force-delete pod
- Restored nfsd to 64; cluster healthy
Code change here:
- Remove `kubernetes_job_v1.anca_elements_import` block
- Remove `module.nfs_anca_elements_host` (PVC `immich-anca-elements-host` —
no live consumer; videos batch deferred per user, source dump remains on
PVE at /srv/nfs/anca-elements, browseable via Nextcloud admin)
- Update 2026-05-25 post-mortem: 6th-incident section + new lesson that
one-shot Jobs do NOT belong in kubernetes_job_v1 (use a suspended CronJob
or a runbook-captured `kubectl create job` ad-hoc invocation instead).
Per Viktor: show the whole Immich library from the last 2 years instead of the
single 'china' album, enable Ken Burns pan/zoom, slow the interval to 30s, and
add the weather overlay (London, metric). OpenWeatherMap key is read from Vault
(secret/immich -> frame_weather_api_key), not hardcoded.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Forwarded schedule-change emails were being parsed by qwen3vl-4b (a 4B *vision*
model) for text extraction, which reliably dropped the flight number — so the
matcher had no key to link on and a forwarded flight update created a duplicate
instead of amending the existing segment.
Point the ingest-plans CronJob's text extraction at qwen3-8b (verified live: it
emits flight_number + a clean PNR, 3/3 on the failing email) and keep qwen3vl-4b
for boarding-pass image attachments (LLM_VISION_MODEL). llama-swap loads each on
demand; the GPU swap cost is accepted.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The website 302'd unauthenticated visitors straight to Authentik. Split the tripit.viktorbarzin.me ingress: the SPA shell (everything else) becomes auth=none so the app shows its own Log in / Sign up landing page, while a new tripit-app-api ingress keeps /api + /metrics behind forward-auth — the security boundary, since /api trusts the outpost-injected X-authentik-email. The public SPA gets strip-auth-headers (no spoofed headers can reach the backend) and anti_ai_scraping=false (it's an installable PWA). The existing auth=none carve-outs (calendar, emails/confirm, planner/slack) are longer prefixes and keep winning. Pairs with the tripit landing-page deploy (commit 3fe4da1).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The monitoring stack apply was create-failing on every push with `configmaps
"alert-digest-script" already exists` + `secrets "alert-digest" already exists`
(modules/monitoring/alert_digest.tf) — both resources exist in-cluster but fell
out of Terraform state, so apply tried to CREATE them and errored. Pre-existing
(failed on pipelines 203 AND 204, NOT caused by the t3 alert-rules change). Add
import {} blocks (TF 1.5+ adoption per AGENTS.md) so apply imports + reconciles
instead of failing. Idempotent once imported; safe to remove after a green apply.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Before auto-tracking t3 nightly builds (Viktor's call, risk accepted), stand up
the detection that was missing on 2026-06-09 — when an auto-pulled nightly broke
pairing for ALL users and nothing alerted. Viktor's explicit requirement: make
sure session auth keeps working and revert if the pairing fallback/failure rate
climbs. This is phase 0 (detection) of that work.
- t3-dispatch: exchangeCredential now reports WHICH pairing endpoint answered,
and autoPair logs every outcome (paired user=.. endpoint=.. fallback=..) — so
the real-user browser-session->bootstrap fallback rate is observable. A
non-zero rate flags that a build moved the pairing API (the 2026-06-09 class).
- Loki ruler alerts (devvm journal -> Alertmanager -> Slack): T3PairingBroken
(real users failing to pair), T3PairFallbackHigh (build moved the pairing API),
T3AutoUpdateRolledBack / RollbackFailed / Frozen (enforcer outcomes). Closes
the post-mortem's open "nothing monitors end-to-end pairing" detection gap.
The existing t3-probe only checks GET /api/auth/session==200, which stays 200
even when pairing is dead, so it never caught the outage class.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Two small doc additions that also re-include these stacks in Woodpecker's
changed-stack detection. The earlier 2-commit push left chrome-service out of the
HEAD~1..HEAD diff so its ignore_changes fix never applied; the monitoring apply was
separately blocked by a stuck prometheus pending-upgrade (now cleared).
- chrome-service: note the live pod's container order had drifted from this file's
order, so a TF apply reorders them (containers[0] differs live-vs-TF until the
apply lands) -- documents the confusion this caused during diagnosis.
- mam-farming: cross-ref the grabber script that emits mam_grabber_last_run_timestamp.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
MAMFarmingStuck fired whenever the freeleech grabber added 0 torrents in 4h, but
grabbing 0 is normal: the grabber searches a random catalogue offset each run and
legitimately finds nothing when freeleech is dry (account ratio was a healthy
37.5; the alert even misreported it as "0.00" because $value was the grabbed
count, not the ratio). The alert's real intent was to catch the grabber not
running at all (CronJob Forbid-blocked / wedged), but increase(grabbed[4h])==0
cannot distinguish "didn't run" from "ran, nothing to grab" since Pushgateway
serves the last pushed value forever.
The grabber now heartbeats mam_grabber_last_run_timestamp on every completed run
(main success, ratio/mouse skip, and qBittorrent-unreachable paths). The alert
fires only when that heartbeat is >4h stale — the true stuck condition. Cookie
expiry and qBittorrent-down keep their own dedicated alerts.
Surfaced by /cluster-health as a false-firing alert.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The chrome-service container (container[0]) runs the pinned Microsoft Playwright
image, which ships chromium under /ms-playwright. Its image was still listed in
the deployment's lifecycle ignore_changes — a leftover KEEL_IGNORE from before
ADR-0002 #29 moved the novnc container to TF management. With that field ignored,
a stray clobber of container[0] to ghcr chrome-service-novnc:latest (which has no
chromium there) stuck permanently: the container crash-looped ~12h on "chromium
binary not found under /ms-playwright" (273 restarts) and TF could not revert it.
Remove container[0].image from ignore_changes so Terraform pins it to local.image
and re-asserts it on every apply. Both containers are TF-managed now (novnc since
ADR-0002 #29); Keel is inert (policy=never), so nothing should fight TF here.
Surfaced by /cluster-health. Live state was already restored transiently via
kubectl set image; this commit makes the fix durable.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CI pipeline 198 failed: the pinned goauthentik/authentik provider has no data "authentik_application" source, so terraform failed the whole authentik plan and applied NOTHING (state unchanged). Replace the data-source lookups with the live pbm_uuid (Vault app) and group_uuid (Allow Login Users) as literals; authentik_policy_binding is supported (used in guest.tf).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Audit found the Vault Authentik application had no authorization binding, so any authenticated identity (incl. a future self-enrolled TripIt External user) could complete Vault OIDC login and get a built-in default-policy token. Bind it to 'Allow Login Users' — existing homelab users inherit that group via its children (verified User.all_groups() includes the parent), parentless TripIt External users are excluded. Closes the only OIDC app the forward-auth fence does not cover.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor wants people outside the homelab to self-register to TripIt with email + a passkey (no password), kept separate from the rest of the homelab. Adds the empty, parentless 'TripIt External' Authentik group and a first-position branch in the catch-all policy that admits those users to tripit.viktorbarzin.me only and denies every other forward-auth host. Inert on apply (group empty => matches no existing user => no lockout). An adversarial review found the fence is forward-auth-only, so the runbook records the OIDC-app containment audit (every sensitive app already requires a trusted group External users won't hold), the Vault->Allow Login Users binding that closes the one open OIDC app, the SMTP prerequisite for email verification, and the before/after access-matrix verification. Flows/SMTP/Vault binding are UI steps per the runbook; the push that applies the catch-all edit must be human-watched (CI auto-applies the authentik stack).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Root cause of "the agent never commits": the issue-implementer CLAUDE.md was
subPath-mounted at /home/node/.claude/CLAUDE.md, which made /home/node/.claude
root-owned. The agent (uid 1000) then couldn't create its Bash session-env
there, so EVERY Bash/git call failed (Write/Edit worked, so it silently edited
but never committed). Found by reading the agent transcripts from
state.sqlite -> projection_thread_messages.
Fix: don't mount anything into ~/.claude (it's not honored by T3's SDK anyway).
Behaviour is injected via the dispatch message preamble by the control plane;
files/issue-implementer-CLAUDE.md kept as the canonical source text.
Verified post-fix: a preamble-dispatched task edited README and COMMITTED
(073ab28) unattended.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The bare `t3 serve` behind Authentik showed the manual /pair#token screen, which
didn't connect. Mirror the devvm t3-dispatch: a small stdlib-Node sidecar fronts
t3 serve, and on a cookieless (already Authentik-gated) document load it mints a
pairing credential (`t3 auth pairing create`) and exchanges it at
/api/auth/browser-session for the t3_session cookie, then 302s back. Everything
else — including WebSocket upgrades for the live cockpit — reverse-proxies to
:3773. The Service now targets the sidecar (:8080).
Verified: cookieless GET -> 302 + Set-Cookie t3_session; cookied GET -> 200 SPA.
Matches the t3.viktorbarzin.me experience (Authentik login -> straight into the
cockpit).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Keel "patch"-downgraded the image docker.io/library/node:24 -> library/node:24.0.2,
which is below t3@0.0.27's required node >=24.10, so `t3 serve` exited silently and
the pod crash-looped (~160 restarts / 13h).
Root cause: keel.sh/policy=never was on the POD-TEMPLATE labels, but Keel reads the
policy at the DEPLOYMENT level. The cluster's Kyverno inject-keel-annotations is
opt-out, so it stamped policy=patch and Keel acted on it.
Fix: set keel.sh/policy=never as a deployment-level annotation; ignore_changes the
Kyverno-injected keel.sh/pollSchedule + keel.sh/trigger annotations; the image stays
TF-owned (apply reverted Keel's downgrade). Pod now 1/1, t3 serve 200.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The docker.io fix created the deployment, but wait_for_rollout (default true)
then hung on the OOMing pod and the apply failed — leaving the deployment in
the cluster but NOT in terraform state, so every later apply hit
'deployments.apps "anisette" already exists'. Deleted that orphan and set
wait_for_rollout=false (mirrors tts/llama-cpp slow-start services); readiness
probe still gates Service traffic.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Slice #2 of claude-agent-service PRD #1 (AFK implementation pipeline). Dedicated
in-cluster T3 Code instance the control plane dispatches issues into; runs the
issue-implementer agent in a git worktree with a live cockpit. Applied + live
2026-06-14 (9 resources).
Pilot-fast: stock docker.io/library/node:24 + install pinned t3@0.0.27 + Claude
CLI at startup onto an SSD-NFS PVC. Authentik-gated ingress. issue-implementer
behaviour ships as a user-level ~/.claude/CLAUDE.md (T3 hardcodes the system
prompt; settingSources loads it) and forbids plan-mode/clarifying-questions so
unattended threads don't stall. Keel-excluded (ADR 0003). wait_for_rollout=false
(slow first start). Image fully-qualified for the Kyverno trusted-registries
allowlist; container mem limit 4Gi (tier-aux LimitRange cap).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The pod CrashLooped with OOMKilled (exit 137): anisette downloads and
initializes Apple's CoreADI provisioning library on startup, spiking past the
128Mi limit before it can bind :6969 (empty logs, liveness 'connection
refused'). Bump request 256Mi / limit 512Mi; steady state is much lower.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
First apply was denied at admission — a bare dadoum/anisette-v3-server@sha256
ref isn't in the trusted-registries allowlist (only enumerated DockerHub
user-repo prefixes are). docker.io/* IS allowlisted, so use the explicit
registry prefix; still pulls via the 10.0.20.10 pull-through cache.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Deploy a small stateless anisette-data server so the TripIt iOS Shell can be
sideloaded with SideStore using a free Apple ID, without brokering the
Apple-ID auth dance through a public third-party anisette server (which would
see every login). SideStore points at a stable internal endpoint we control.
- Image: Dadoum/anisette-v3-server, the de-facto standard anisette-v3 server
for SideStore/AltStore. Upstream ships only a mutable :latest (no GitHub
releases / semver / sha tags), so pinned by manifest digest instead of a tag
per the "never :latest" rule. Pulled from DockerHub via the registry-VM
pull-through cache like echo/cyberchef. Diun watches :latest (notify-only) so
a new upstream build prompts a digest re-pin.
- Stateless: emptyDir backs the provisioning-library cache dir (regenerable
download; upstream issue #23 means it doesn't preserve client auth across
restarts anyway) — no PVC, no Vault secret.
- Internal-only endpoint http://anisette.viktorbarzin.lan (auth=none,
allow_local_access_only, ssl_redirect off) — SideStore is a native client
that can't do the Authentik cookie dance, same reasoning as android-emulator's
adb. The .lan CNAME is auto-created by technitium-ingress-dns-sync; never
publicly exposed.
Mirrors the echo/networking-toolbox/android-emulator stack pattern. Service
catalog updated.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>