Origin and forgejo had drifted since 2026-05-05 (merge base b45c45e4).
Each remote was receiving Viktor's commits independently — origin since
2026-05-23 and forgejo from 2026-05-06 to 2026-05-22 14:15. Both had
~30 substantive commits. This merge brings forgejo's work into the
local branch.
13 conflict files resolved as follows (all favoured HEAD = origin/local,
which is newer in every case):
- secrets/{fullchain,privkey}.pem — kept HEAD (renewed 2026-05-24,
vs forgejo's 2026-05-17 renewal)
- stacks/blog/main.tf — kept HEAD (ingress-www intentionally removed
today after DNS+monitor cleanup; forgejo had the old block)
- stacks/xray/modules/xray/main.tf — kept HEAD (vless dropped today
as dead ingress; forgejo had the old 3-port service)
- stacks/k8s-version-upgrade/scripts/upgrade-step.sh — kept HEAD
(allowlist refactor, master-phase idempotency skip, tigera-operator
quiesce/restore, IngressTTFBCritical ignore — all newer than forgejo)
- stacks/k8s-version-upgrade/main.tf — kept HEAD (deployments/scale
RBAC, oldest-kubelet detection — both added 2026-05-23)
- scripts/update_k8s.sh — kept HEAD (--etcd-upgrade=false fallback)
- stacks/llama-cpp/main.tf — kept HEAD (KEEL_LIFECYCLE_V1 ignore_changes
block added today, commit 0b1282a1)
- stacks/openclaw/main.tf — kept HEAD (nim/meta/llama-3.1-70b primary)
- stacks/trading-bot/main.tf — kept HEAD (claude-haiku-4-5 pin +
kevin-signal-bridge container)
- stacks/postiz/modules/postiz/main.tf — kept HEAD (memory 2Gi/3Gi
bump, despite postiz being destroyed today — kept TF intent)
- stacks/nvidia/modules/nvidia/values.yaml — kept HEAD (mem 822Mi)
- stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl —
kept HEAD (richer alert list + raised StatefulSet `for: 3m`)
- stacks/kyverno/modules/kyverno/security-policies.tf — kept HEAD
(expanded registry allowlist + comments)
- docs/architecture/security.md — kept HEAD (detailed W1.7 analysis)
- docs/plans/2026-05-21-ha-control-plane-design.md — kept HEAD
(178-line superset incl. 2026-05-23 deferral rationale)
Auto-merged (no conflict): broker-sync, claude-agent-service,
cloudflared, mailserver, n8n, technitium, traefik, url, proxmox-csi,
xray (deployment portion). Brings in forgejo-only substantive commits:
fire-planner, openclaw v3 flow + recruiter-responder wiring, several
k8s-version-upgrade hardening passes (kill-switch, RecentNodeReboot
ignore, pipefail fixes), HA control plane design, security wave 1
expansion to tier 3+4, alloy file-tail switch, prometheus scrape 2m,
authentik replica cut, forgejo archive disable.
Meta: forgejo and origin drift is a coordination bug. Going forward we
need to either (a) have one CI mirror to the other, or (b) standardize
on one remote. Filed mentally; not addressed in this commit.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Every `tg apply` was reverting the annotations that keel patches when it
detects an upstream digest change — `keel.sh/match-tag` (Kyverno-stamped),
`keel.sh/update-time` (on the pod template; what actually triggers the
rollout), plus the K8s-managed `kubernetes.io/change-cause` and
`deployment.kubernetes.io/revision`. The revert forced a rollout, then
the next keel poll re-stamped the annotations, forcing another. With
llama-swap's ~10s cold-load on each pod recreate the user noticed.
Upstream `ghcr.io/mostlygeek/llama-swap:cuda` is a moving nightly tag —
keel still drives one legitimate rollout per day at ~07:25 UTC; this
patch stops the apply-driven extra rollouts on top of that.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
5th worker container running in audit-only mode. Writes
kevin_signal_bridge_state rows showing what it WOULD trade but never
publishes to signals:generated. Kill-switch flipped in Phase 2.
The xray-vless ingress, Service port 6443, and container port 6443 had
no backing listener — xray.config.json only binds 7443 (REALITY), 8443
(WS) and 9443 (XHTTP). The "xray-vless" hostname was returning 502
since the module was created.
Side effect: removing the first Service port slot ("vless"/6443) caused
the kubernetes provider to shift targetPort values on the remaining
two ports (defaulting only worked at create time, not on port removal).
Pinning target_port explicitly makes Service routing deterministic.
End-to-end verified: REALITY via public IP:8080 (pfSense forward 8080
-> 10.0.20.200:7443), WS via Cloudflare, XHTTP via Cloudflare — all
three transports proxied successfully through a test pod, egress IP
correctly resolves to the home WAN.
krr 2026-05-22 flagged postiz-app as critically under-requested when it
was running (gap 2.2 GiB above the 512Mi request). Postiz is currently
uninstalled in the cluster — this change is only for when the stack is
re-deployed later. No apply triggered now.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
krr 2026-05-22 flagged crowdsec-agent DaemonSet (4 pods) as under-
requested by ~588 MiB across the cluster. Live usage around the
80-128 MiB mark for active log parsing — 64 MiB request risked eviction
ahead of more-needed pods. Limit stays at 512 MiB.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
krr 2026-05-22 flagged nvidia-driver-daemonset as critically
under-requested (~566 MiB gap). Live driver process holds ~600-800Mi
once the kernel module is loaded. Limit stays at 2Gi so the DKMS build
during a kernel upgrade still has headroom (documented in values.yaml
to need ~1.4 GiB peak).
May help unblock code-8vr0 (GPU driver crashloop on node1) if the
crashloop was OOM-driven.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
krr 2026-05-22 flagged loki as under-requested by 1.9 GiB. Live working
set is sitting at ~3 GiB during normal ingestion; the existing 2 GiB
request meant scheduler didn't reserve enough room and the pod risked
eviction. Limit stays at 4 GiB (documented ceiling in loki.yaml).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The CSI node plugin's LUKS2 Argon2id key derivation peaks at ~1 GiB
during unlock (memory id=712 + already-documented in the limits=1280Mi).
Request was 64 MiB — meaning the unlock burst ran "best-effort", first in
line for OOM under node pressure. krr 2026-05-22 flagged this as a top
under-request. Bumping request matches the documented requirement.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The Synology DSM (port 5001) ingress chronically trips IngressTTFBCritical
because of NAS-side latency that is unrelated to k8s upgrades. The chain
was halting indefinitely waiting for it to clear. Add it alongside
RecentNodeReboot to the per-call ignore regex so the chain can proceed
autonomously without manual silences.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The www subdomain was internal-only (no Cloudflare DNS record) but the
external uptime-kuma monitor still flagged it as down because public DNS
resolution failed. Removing the ingress along with the Technitium CNAME
makes the failure mode disappear and lets the cluster reach an
autonomous-clean state.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Right-sizing per krr report (2026-05-22). Zone data is ~43 MiB; the rest
was cache headroom. Primary keeps more (1 GiB) since it owns authoritative
zones; replicas get 512 MiB. DNS sanity-checked across CoreDNS and the
MetalLB external IP (10.0.20.201) post-rollout.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Right-sizing per krr report (2026-05-22). Kept Burstable QoS (limit > request)
so an active agent run still has 2 GiB headroom — krr's 100 MiB recommendation
was measured idle and is not safe for an active job.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Right-sizing per krr report (2026-05-22). Image bump syncs main.tf with
the live Keel-managed version to avoid an inadvertent downgrade on apply.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Right-sizing per krr report (2026-05-22, memory id=2431-2438). Live pod
working set is ~80 MiB; 512Mi leaves comfortable headroom for the
Symfony+RoadRunner footprint.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The chain wasn't idempotent — re-running on a partially-upgraded cluster
would re-drain + re-kubeadm + re-apt an already-upgraded node, causing
unnecessary disruption (5-10 min per no-op node) and risking alert
re-fires during the unnecessary drain.
Today's chain hit this twice: after fixing the version-detection bug
(commit a0f3e155), the chain correctly resumed but re-did master AND
node4 even though both were already on v1.34.8. node4 got cordoned,
drained, and is now soaking for 10 min for no reason.
Fix: at the top of phase_master and phase_worker, read the node's
current kubelet version. If it equals TARGET_VERSION, skip the whole
phase (return 0 — spawn_next will fire downstream). Chain advances
without disturbing the already-upgraded node.
In-flight effect: the current node4 worker pod has the old script
mounted from configmap snapshot, so it'll continue. If it fails and
retries, the new pod will see node4 on v1.34.8 and short-circuit.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Previous version-check read RUNNING from .items[0].nodeInfo.kubeletVersion
— which is just k8s-master. If master is upgraded but workers aren't
(e.g. a chain that completed master phase but failed mid-worker), the
version-check sees v1.34.8 and decides "no upgrade needed", never
spawning the resume phase. Workers stay behind forever.
Today's chain hit exactly this: master + node4 upgraded to v1.34.8,
worker-node4 Failed mid-soak (alert sensitivity, since loosened),
chain dead. Re-triggering the version-check looked at master only,
decided cluster was "done", and refused to resume worker chain.
Fix: read all node kubelet versions, sort -V, take head -1 (oldest).
A partial chain now correctly reports the un-upgraded version and the
chain resumes.
Trivial change; tested live — chain now correctly reports v1.34.7
(workers' version) and spawns preflight → master → worker chain.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Earlier in this session, commit 503ac4c1 brought the for: from 5m → 2m
based on a brief I wrote inaccurately. The brief said the alert "fires
immediately" but it was actually already at 5m. The subagent followed
the explicit "2m" target and tightened it — opposite of what we wanted.
10m is the right value for our chain: a full drain + kubeadm + apt +
kubelet restart + uncordon cycle can take a worker out of MetalLB
rotation for 5-7 min in the worst case (PDB stickiness on some pods).
10m suppresses upgrade-induced blips while still catching real
speaker-down conditions.
node4 worker phase tripped this alert mid-soak today, aborted the
chain (Job retry), succeeded on the 2nd attempt only because alerts
didn't re-fire fast enough. With 10m the next workers shouldn't need
the retry.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Today's worker-phase rolling upgrade tripped MysqlStandaloneDown,
MetalLBSpeakerDown, KubeletRunningContainersDrop, and
IngressErrorRate5xxHigh even though every affected workload
recovered within 30-60s. Loosen `for:` (and one threshold) on each so
they only fire on persistent faults, not on routine drain+kubelet-
restart cycles.
- MysqlStandaloneDown: for 2m -> 3m (single-replica StatefulSet,
drain re-scheduling routinely takes 1-3m).
- MetalLBSpeakerDown: for 5m -> 2m (kubelet restart drops the
speaker pod for 30-45s; 2m suppresses that blip).
- KubeletRunningContainersDrop: absolute `< -10` threshold replaced
with relative `< -0.5` (>50% drop vs. 10m ago); routine drains
routinely shed 10-30 containers and tripped the old rule.
- IngressErrorRate5xxHigh: for 5m -> 10m (rolling pod migrations
cause brief 5xx spikes that clear in 1-2m).
Severity, labels, and annotation structure preserved; only `for:`
durations and the one expression changed. Tactical loosening of
four specific alerts -- broader observability audit tracked
separately in beads.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Refactored halt_on_alert_query from denylist ("ignore these noisy alerts")
to an allowlist ("only halt on severity=critical"). Today's blocking
alerts were all warning/info-level and not actual upgrade blockers:
- PodCrashLooping (gpu-operator on the GPU node, code-8vr0, long-standing)
- IngressTTFBHigh (Traefik latency, transient)
- NodeHighIOWait (chicken-and-egg with our own upgrade I/O)
- RecentNodeReboot (chain causes this itself)
severity=critical filtering is more robust than maintaining a denylist
of every noisy alert that crops up. extra_ignore parameter kept for
backwards compatibility but is rarely needed now (critical alerts are
the only ones that should actually halt the chain).
Tested end-to-end this session — master successfully upgraded to v1.34.8
via the autonomous chain after the apiserver state-repair (apiserver
manifest had been pinned at v1.34.2 from a previous month's rollback;
required a one-time manual edit + kubelet reload to bring back to v1.34.7,
after which the chain ran cleanly).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Stragglers from the same drift as commit b288a59 (monorepo) / the
2026-05-22 viktorbarzin.me apex incident — the `.101` references were
left over from the NodePort exposure era. Technitium's actual MetalLB LB
IP is `.201` (in pool 10.0.20.200-220).
- architecture/vpn.md — Technitium component cell + AdGuard forwarder
example + nslookup troubleshooting hint
- architecture/networking.md — 502 ingress troubleshooting snippet
- plans/2026-02-22-talos-linux-migration-evaluation.md — nameservers
example
Wave-1 trusted-registries allowlist was missing woodpeckerci/* which is
used by every .woodpecker.yml's clone step (woodpeckerci/plugin-git) and
build steps (woodpeckerci/plugin-docker-buildx). Result: ALL Woodpecker
pipelines have been failing at the git step since the Audit→Enforce flip
on 2026-05-19. First surfaced via code-da4h (recruiter-responder pushes
not building).
Added between viren070/* and zelest/* in the same DockerHub-user-repos
block as the 2026-05-22 batch (commit 2d35d72a).
Closes: code-da4h
Every internal *.viktorbarzin.me hostname (~80 services) chains through the
split-horizon `viktorbarzin.me` apex A record. If the apex drifts (ISP
rollover, accidental edit), every internal service breaks at once — the
2026-05-22 ha-sofia incident was exactly this.
This adds a backstop probe so the next drift surfaces in <10 min instead
of via user-reported outage:
- CronJob `viktorbarzin-apex-probe` in `technitium` namespace, every 5 min,
resolves `viktorbarzin.me A` against the Technitium LB IP (10.0.20.201)
and pushes `viktorbarzin_apex_correct` + `_last_correct_timestamp` to
Pushgateway. Python+dnspython, ~30 LOC.
- 3 Prometheus alerts:
- `ViktorBarzinApexDrift` (critical, 10m) — apex resolved to anything
other than 10.0.20.200.
- `ViktorBarzinApexProbeStale` (warning, 5m on 15m gap) — probe stopped
succeeding.
- `ViktorBarzinApexProbeNeverRun` (warning, 30m absent) — probe never
reported.
- Added the new alert names to the Slack receiver matcher in both routes
alongside EmailRoundtrip*.
Verified: rules loaded as inactive (apex is correct), metric flowing, manual
probe job pass observed.
Three changes unblocking the autonomous chain for k8s patch upgrades:
1. **phase_master quiesces tigera-operator before drain, restores after.**
Tigera crashes immediately if apiserver is unreachable (no retry logic)
and crashlooping it during master static-pod swaps generates ~500MB/s
disk I/O that pushes kubeadm's 5-min static-pod-hash watch past its
limit. Quiesce removes the storm contributor; calico data plane keeps
running unchanged (data plane is the DaemonSet+Typha, operator is just
the reconciler).
2. **update_k8s.sh retries with --etcd-upgrade=false on the 2nd attempt.**
For patch upgrades (1.34.7→1.34.8), etcd's image doesn't change — kubeadm
writes an identical manifest, hash doesn't update, watch times out and
rolls back forever. The skip-etcd retry sidesteps it for the legitimate
no-change case while still doing a full etcd upgrade on the first
attempt (correct for minor-version bumps).
3. **halt_on_alert_query also ignores IngressTTFBHigh + NodeHighIOWait.**
Both are symptoms-not-causes: ingress latency spikes briefly during any
pod-restart wave; high IOwait is exactly what upgrade activity causes
(chicken-and-egg). The inline quiet-baseline check (Ready transition
<10min) is the real cluster-churn gate.
RBAC: k8s-upgrade-job ClusterRole gains `patch` on deployments + scale
subresource so the chain can do the scale-to-0/back-to-1 on tigera.
These three together get the chain past the cascade that's been blocking
1.34.7→1.34.8 for a week. Long-term fix is still HA control plane
(beads code-n0ow); these are the bridge.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Browsers accumulate one authentik_proxy_<random> cookie per Authentik
Proxy Provider under viktorbarzin.me (Path=/). With 30+ services the
combined Cookie header exceeds nginx's default 4 x 8k
large_client_header_buffers and trips '431 Request Header Fields Too
Large' at the forward-auth nginx (traefik/auth-proxy).
Bumped to:
client_header_buffer_size 8k
large_client_header_buffers 8 64k
Matches the pattern used on the London Flint 2 router nginx
(memory id=647).
Investigated, designed, and planned the 3-master HA control plane
migration triggered by 2026-05-21's autonomous k8s upgrade cascade.
Locked 14 design decisions across two passes:
- 10 initial decisions (LB strategy, IPs, sizing, etcd, kured gate, etc)
- 4 challenger-pass amendments (cloud-init template bump, rbac stack
multi-master refactor, HTTPS /readyz health check, expanded blast
radius to include /home/wizard/code/infra/config root kubeconfig,
config.tfvars, k8s-portal user kubeconfigs, etcd-backup nodeSelector,
k8s-version-upgrade chain extension as Phase 7)
Plan covers 11 phases end-to-end including panic-mode rollback.
DEFERRED before execution. PVE host is 98% RAM-committed
(262 GB allocated / 267 GB physical, 1.5 GB swap active); the
planned 3 x 32 GB masters would push allocation to 326 GB and OOM
the host. k8s-master currently uses only 4.6 GB of its 32 GB
allocation (5-6x oversized).
Revisit triggers documented in design doc:
1. Second PVE host added → hardware HA becomes possible.
2. Right-sizing pass OR planning masters at 16 GB each.
3. Cumulative manual upgrade nursing > ~10h.
Standalone candidate worth lifting independently: Phase 1.5's
rbac stack refactor (apiserver-oidc + audit-policy + etcd-tuning
to loop over k8s_master_hosts list) — future-proofs the cluster
without committing to the HA migration.
Refs: code-n0ow (open, deferred via bd note).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sonnet-4-5 trips Anthropic per-account rate_limit_error on the OAuth
bearer (sk-ant-oat01) tokens after 5-10 burst calls — sticky multi-hour
quota. Haiku-4-5 has much higher RPM and processes the 16-video
backfill cleanly (~30s/video with inter-call throttle).
Comment above the env line documents the rationale for future re-evaluation.
Remove leftover SendGrid references after the Brevo migration was completed:
- Delete TF `cloudflare_record.mail_domainkey` (TXT at `s1._domainkey`,
SendGrid-era DKIM, hidden behind the SendGrid CNAME but would re-emerge
once the CNAME is removed).
- Clean up commented-out `smtp.sendgrid.net` relayhost references and the
`# For sendgrid` comment on `sasl_passwd` in the mailserver module.
DNS records deleted out-of-band (not TF-managed):
- CF: `s1._domainkey CNAME` + `s2._domainkey CNAME` → sendgrid.net (manual entries)
- Technitium internal `viktorbarzin.me`: `em7107`, `s1._domainkey`,
`s2._domainkey` CNAMEs → sendgrid.net
Verified end-to-end mail flow unaffected (Brevo outbound + IMAP receive,
roundtrip 20.4s — identical to baseline). Active DKIM (`mail._domainkey`
local + `brevo1/brevo2._domainkey` Brevo) untouched.
Auth audit on 2026-05-22 — all the broken paths and the one that works:
- openai-codex OAuth: EXPIRED (ChatGPT Plus, ancaelena98@gmail.com)
- secret/openclaw → openai_api_key (sk-svcacct): insufficient_quota
- openrouter_api_key: "Key limit exceeded (total limit)"
- llama_api_key: region-blocked
- anthropic_api_key: sk-ant-oat-… (OAuth refresh token, not a real
x-api-key — won't auth via x-api-key header)
- nvidia_api_key (NIM): WORKS. The key was already baked into the
openclaw.json providers.nim.apiKey from secret/openclaw → nvidia_api_key.
Two NIM models verified end-to-end (call from inside openclaw pod
with tool-call schema, both returned proper {tool_calls:[…]} JSON):
- meta/llama-3.1-70b-instruct — 0.58s, primary
- meta/llama-4-maverick-17b-128e — 16s, smarter, fallback
Fallback chain: maverick → openai-codex (auto-promotes once re-authed)
→ modelrelay/auto-fastest (last resort, hallucinates instead of
tool-calling, but at least responds).
Models registered in both `agents.defaults.models` (allowlist) and
`models.providers.nim.models` (capability declarations) so the agent
sees them as available tools. Startup `models set` updated to pin
the new primary across `doctor --fix` runs.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
First analysis pass over Calico GNP wave1-egress-observe-tier34 data captured
in Loki since 2026-05-19. Pulled ~10000 flow log lines covering 36 source
namespaces (of 82 selected by tier 3+4). Analysis script outputs preserved
on the dev host at /tmp/{analyze_flows2,build_allowlist}.py.
## Findings
**Universal baseline (every observed ns):**
- DNS to kube-system/kube-dns UDP/53
- Often mysql.dbaas TCP/3306 or pg.dbaas TCP/5432
- Often redis.redis TCP/6379
**Rollout tiering by egress fan-out:**
- Tier A (recruiter-responder only): 2 destinations, ideal pilot
- Tier B (29 namespaces): ≤3 external IPs, ≤5 internal — batch rollout
- Tier C (4 namespaces: f1-stream/openclaw/woodpecker/status-page):
needs per-IP investigation
- Tier D (servarr): 130+ external IPs (BitTorrent P2P) — keep Log+Allow
permanently or move to dedicated egress proxy
## Caveats blocking immediate enforce
- Observation horizon too short: ~6h dense data, ~24h total. Need ≥7 days
to catch weekly CronJobs, Vault token rotations, Keel pulls.
- External IPs are dynamic (Cloudflare/AWS rotate). Static IP allowlists
will break — need DNS-based selectors or CIDR ranges.
- Some intra-namespace traffic bypasses the Calico filter chain.
## Recommended next steps
1. Continue observation through 2026-05-29 (full week). Compare destination
set day-over-day; if stable, allowlist is ready.
2. First enforce: recruiter-responder (allowlist = kube-dns + telegram CIDR
+ vault/ESO service IPs).
3. Tier B phased rollout at 3-5 ns/day after pilot proves out.
Full analysis: docs/architecture/wave1-egress-observation-2026-05-22.md
Tracked under beads code-8ywc.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Discovered via W1.5 enforcement when querying live cluster state:
PolicyViolation events on 5 deployments (council-complaints, ebook2audiobook,
hermes-agent, netbox, whisper/piper) trying to admit images from registries
not in the original enumeration.
Added entries:
- amruthpillai/* (resume — reactive-resume)
- athomasson2/* (ebook2audiobook)
- netboxcommunity/* (netbox)
- nousresearch/* (hermes-agent)
- opentripplanner/* (osm-routing)
- rhasspy/* (whisper, piper)
- registry.viktorbarzin.me/* (legacy private registry — council-complaints
still references; should migrate to forgejo)
The legacy registry.viktorbarzin.me was supposedly decommissioned 2026-05-07
per CLAUDE.md but council-complaints still uses it — separate cleanup task.
## Verification
- kubectl delete + reapply (kubectl_manifest resourceVersion=0 patch gotcha,
same as 2026-05-18 inject-keel-annotations)
- Dry-run admission of previously-blocked images now PASS:
- netboxcommunity/netbox:v4.5.0-beta1 ✓
- rhasspy/wyoming-whisper:3.1.0 ✓
- registry.viktorbarzin.me/council-complaints:1c56f8f ✓
- Policy still in Enforce mode
## Observation status (W1.6)
- Calico GNP wave1-egress-observe-tier34 still applied, 82 ns selected
- Loki `{job="node-journal"} |~ "calico-packet"` returns ~5000 lines/hour
- No errors from observation infrastructure
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Root cause of the recurring 'cnpg-webhook-cert' TLS expiry warn:
CNPG default 'expiringCheckThreshold = 7' means the operator only
regenerates the self-signed webhook cert when remaining lifetime drops
BELOW 7 days. Our cluster-health check #22 alerts at <30d. Result:
~23 days of WARN before CNPG would even attempt rotation.
Set EXPIRING_CHECK_THRESHOLD=30 via the chart's config.data map so the
operator now regenerates with 30d buffer, aligning with our monitoring
threshold. Cert lifetime stays at chart default 90d.
Verified after apply: operator runtime config shows
'expiringCheckThreshold:30'. Companion in-session action: deleted the
existing soon-to-expire secret and bounced the operator to force an
immediate fresh 90-day cert (notBefore=May 22, notAfter=Aug 20).
Two latent issues found while diagnosing why the May 2026 META vest
didn't land:
1. broker-sync-imap CronJob's command was 'broker-sync imap', but the
actual CLI subcommand is 'imap-ingest'. Every scheduled run had
been failing with 'No such command imap' since day-one.
2. Pod runs as uid=10001 gid=999; PVC /data dir is mode 2775
group=10001. Without fsGroup in the pod's securityContext the
pod gets only 'other' (r-x) perms on the dir, so sqlite3 can't
create journal/WAL files next to sync.db -- hits
'attempt to write a readonly database'. fsGroup=10001 adds the
matching gid to the pod's supplemental groups so writes work.
Schwab email-sender regex fix is in broker-sync@d860aef.
Operational layer for the new col_snapshot cache shipped in
fire-planner@e72fd22:
stacks/fire-planner:
- fire-planner-col-refresh CronJob — Sun 04:00 UTC, no-op until rows
age toward the 1-year TTL boundary (within 7 days). Calls
python -m fire_planner col-refresh-stale, upserts via cache.upsert.
monitoring/dashboards/cost-of-living.json (Finance folder):
- Two template variables: $city (single-select from col_snapshot),
$baseline_city (for COL ratio computation, defaults London).
- Stat row: total w/rent, w/o rent, 1-bed rent, ratio (color-coded).
- All-cities ranked table with gradient-gauged total + colored ratio.
- Cache-freshness table flags rows approaching TTL expiry.
Initial population needs a one-shot: post-Keel-rollout,
kubectl -n fire-planner exec deploy/fire-planner -- \\
python -m fire_planner col-seed
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous commit promoted modelrelay/auto-fastest to primary as a
workaround for the expired openai-codex OAuth token. But modelrelay
routes to small tool-call-shy models (nvidia/stepfun-ai/step-3.5-flash)
that hallucinate answers instead of using ssh / curl / etc. — exactly
what the v4 learning loop is supposed to leverage.
Revert primary back to openai-codex/gpt-5.4-mini (gpt-5.4-mini is the
only mini variant the Codex backend accepts for ChatGPT Plus tier),
and inline the re-auth command in the model-block comment so future
sessions know exactly what to do when the OAuth token expires:
kubectl -n openclaw exec -it $(kubectl -n openclaw get pods \
-l app=openclaw -o jsonpath='{.items[0].metadata.name}') \
-c openclaw -- node /app/openclaw.mjs models auth login \
--provider openai-codex
modelrelay/auto-fastest stays in the fallback chain so the agent
remains partially usable while the token is expired.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Per Viktor: healthy baseline range is 55-65 C; anything above 65 C is a
signal a VM/workload is using too much CPU and warrants investigation.
Previous thresholds were calibrated to the hardware's TjMax (75/83 C) —
that was too lax, since cluster-load-driven elevation arrives a long
time before throttling. The 65 C cutoff matches the live Prometheus
baseline (Apr 20-May 8 2026: peak 61-69 C, avg 51-55 C) and the
session-observed correlation: above 65 C means the cluster is doing
sustained work that should be looked at, even if hardware is still
nowhere near its limit.
Updated:
PASS < 65 C (within 55-65 baseline)
WARN 65-82 C (elevated; check top kvm processes for the culprit)
FAIL >= 83 C (at/above TjMax — throttling imminent)
Verified live: 67 C now WARN (was PASS under the 75 C threshold).
User feedback: "this should work for any task, not just calendar.
this learning flow must be strongly engrained to ensure openclaw
gets better over time."
The v3 rules were buried at the bottom of TOOLS.md and only stated
in workflow language. Three changes to make the rule unavoidable:
1. **SOUL.md** — new marker-delimited section "Learning is your
identity" inserted before ## Boundaries. AGENTS.md tells the
agent to read SOUL.md first every session, so this is now the
FIRST thing the agent loads about itself. Frames learning as
character, not procedure.
2. **TOOLS.md v4** — section moved from the END of the file to
right after the `# TOOLS.md` title (first substantive content
on file load). Title strengthened: "THE FLOW — run this on
EVERY task. Not just hard ones." Concrete examples explicitly
call out diverse domains (calendar, frigate restart, disk
usage, inbox summary, deploys) so the universality is
unmistakable.
3. **learn-from-tasks skill** — opens with "This is universal.
EVERY task runs through this flow — not just hard ones, not
just unfamiliar ones. The save at the end is mandatory."
The actual flow (know → ask devvm → save) is unchanged. What
changed is salience: the rule is now the first thing the agent
encounters in three independent surfaces, with stronger framing
that makes "skipping the save" feel like a violation of identity
rather than a missed optimisation.
Marker bumped v3 → v4. Stripper handles v1-v9 idempotently.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Uncomment the trading-bot stack (disabled 2026-04-06 due to resource
consumption) and add the new meet_kevin_watcher service container.
Changes:
- Uncomment the /* ... */ block enclosing the entire stack
- Fix db_init job: add -d postgres to psql commands (root user has no
root-named database — matches pattern used in claude-memory + others)
- Remove 3 disabled containers from trading-bot-workers Pod spec:
news-fetcher, sentiment-analyzer, trade-executor
- Add new meet-kevin-watcher container (image
viktorbarzin/trading-bot-service:latest, command
python -m services.meet_kevin_watcher.main, mem 128Mi/256Mi)
- Extend ExternalSecret with TRADING_OPENROUTER_API_KEY and
TRADING_MEET_KEVIN_CHANNEL_ID keys (sourced from Vault
secret/trading-bot)
- Add 4 common_env entries for the Meet Kevin pipeline
(poll interval, daily cost cap, model slug, prompt version)
- Update lifecycle.ignore_changes to 4 image indices
vault: re-enable pg-trading static role
- Add pg-trading to vault_database_secret_backend_connection allowed_roles
- Uncomment vault_database_secret_backend_static_role.pg_trading
(was disabled 2026-04-06 with the rest of trading-bot stack)
kyverno: add postgres* to trusted-registries allowlist
- trading-bot db_init uses postgres:16-alpine (Docker Hub library image)
- postgres* was not in the DockerHub bare-name allowlist (unlike mysql*,
alpine*, nginx*, python* which were already there)
Final workers Pod containers (in order):
[0] signal-generator
[1] learning-engine
[2] market-data
[3] meet-kevin-watcher (NEW)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Refines the devvm-fallback into an explicit triage flow that the
agent runs on every task. The default path is to ASK devvm-claude
when uncertain — don't brute-force. Most tasks are solvable there.
## The flow
1. Do I KNOW how? Check `memory_recall` and INDEX.md.
2. If not, SSH devvm and ask claude — and crucially, ask it to
share the steps + credentials needed so I can do it on my own
next time. Save the answer in openclaw memory.
3. (RARE) If devvm-claude says no, try in-pod. Most likely fail —
that's OK.
## Storage moved to memory-indexed location
Learnings now live under
`/workspace/memory/projects/openclaw-learned/` (was
`/workspace/learned/`) so memory-core indexes them and
`memory_recall` surfaces them. Layout:
- `scripts/<task>.md` runnable recipes
- `knowledge/<topic>.md` decisions, paths, gotchas
- `credentials/<name>.md` **POINTERS to Vault, never values**
## Credentials = Vault pointers only
Previous v2 design saved cred values to plaintext NFS files. v3
flips to pointer-only: cred file documents the Vault path + fetch
command (`ssh devvm 'vault kv get -field=foo secret/bar'`), the
consumer, and rotation expectations. The secret stays in Vault.
## Init container also migrates
Strips v1/v2/v3 markers from TOOLS.md before re-inserting v3,
moves any files from the legacy `/workspace/learned/` tree into
the new location, removes the empty legacy dir. User edits
outside the markers always survive.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Refine the init container's devvm-fallback seeding so the OpenClaw
agent treats devvm as its DEFAULT teacher and saves recipes locally
to become independent over time:
1. TOOLS.md v2 section now has two emphatic CRITICAL rules:
- "TRY DEVVM before giving up" — when stuck, ssh devvm before
telling the user "I can't do that".
- "After every task, introspect → save a faster way" — for any
non-trivial task (especially recurring ones), save the recipe
to /workspace/learned/ and update INDEX.md.
2. New cc-skill `learn-from-tasks` at
/home/node/.openclaw/cc-skills/learn-from-tasks/SKILL.md formalises
both triggers: (A) you're stuck → check INDEX → ask devvm → save;
(B) you just finished → introspect → save if recurring.
3. /workspace/learned/ scaffold: INDEX.md table-of-contents +
scripts/, knowledge/, credentials/ (0700) subdirs. Agent checks
INDEX.md BEFORE reaching for devvm, so saved recipes are
findable on the next run.
4. Marker migration: strips both v1 and v2 markers before re-inserting
so user edits outside the markers always survive future restarts.
Security caveat documented inline: credentials in
/workspace/learned/credentials/ are NFS plaintext — acceptable for
home-lab personal scope, NOT for anything more sensitive than what
`ssh devvm` already gives the pod (wizard's access).
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The OpenClaw agent reads TOOLS.md on every session per AGENTS.md
("environment-specific notes"), but it does NOT auto-search the
memory-core index for "devvm" before answering. Result: the agent
said "I don't have access to the devvm" even though ssh + the
openclaw-task wrapper were fully wired up (verified e2e in
9ad52dfd).
Updated init 6 (seed-devvm-memory-note) to ALSO append a
marker-delimited section to /workspace/TOOLS.md describing the
devvm SSH capability + openclaw-task usage. Idempotent: strips
any prior v1 section before re-inserting, so user edits outside
the markers survive future pod restarts.
The /workspace/memory/projects/openclaw-runtime/devvm-fallback.md
memory note stays in place — it's still indexed by memory-core
and surfaces for memory_recall queries.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Give the OpenClaw pod two new capabilities:
1. Host-tools bundle. New init container `install-host-tools` extracts
openssh-client + dnsutils + tmux + jq + ripgrep + fd + vault + yq +
friends into /tools/host-tools/, with the bookworm-slim libs the
binaries need. PATH + LD_LIBRARY_PATH on the main container point
ld.so at the bundle. Idempotent via /tools/host-tools/.installed-v1
marker; smoke test (ldd-based) fails the init at deploy time if any
binary has unresolved deps. Bundle is ~558 MB on the existing
/srv/nfs/openclaw/tools NFS.
2. devvm SSH + async task pattern. New init `setup-ssh-config` writes
id_rsa/config/known_hosts under /home/node/.openclaw/.ssh; main
container startup symlinks /home/node/.ssh → there. New
/usr/local/bin/openclaw-task wrapper on devvm manages long-running
work as tmux sessions on devvm (sessions and logs survive pod
restarts — they live on devvm, not in the pod). New init container
`seed-devvm-memory-note` drops a markdown note teaching the pattern;
main container startup now runs `openclaw memory index --force` so
the note is searchable on first boot.
Design + verified E2E flow in
docs/plans/2026-05-22-openclaw-devvm-access-design.md. Persistence test
green: spawned a 50s task from pod A, deleted pod A, new pod B saw the
task finish and read its full log.
Pre-existing keel.sh annotation drift on openclaw/{openlobster,
task_webhook} cleaned up in the same apply.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Both new checks SSH read-only to the PVE host and emit PASS/WARN/FAIL
via the standard healthcheck output + JSON. They run alongside the
existing 42 checks and surface the same alerts the 2026-05-20/21
optimization session had to gather by hand.
#43 PVE Host Thermals — Xeon E5-2699v4 package + per-core temps
Reads every /sys/class/hwmon/hwmon0/temp*_input in one SSH round-trip.
Thresholds tuned to the live TjMax=83 / Tcrit=93:
PASS < 75 °C package
WARN 75-82 °C (approaching max, action time)
FAIL >= 83 °C (at/above TjMax, throttling imminent)
Reports hottest core label too so a single hot core doesn't hide in
the package average.
#44 PVE Host Load — load avg vs 44-thread capacity
Reads /proc/loadavg, compares 5-min to thread count (44):
PASS load_5 < 30 (< 70% threads busy)
WARN 30-37 (oversubscribed but not saturating)
FAIL >= 38 (~85%+ threads busy — scheduler saturation)
Uses 5-min so brief work spikes don't false-fail.
Both gracefully WARN-degrade if SSH BatchMode fails, matching the
existing check 36 (LVM PVC snapshots) pattern. TOTAL_CHECKS bumped
42 -> 44 and the dispatcher updated.
Two changes in one commit because they are coupled — the DISABLED_PROVIDERS
addition cannot land safely without the Keel exclusion on temporal:
1. Add DISABLED_PROVIDERS env on postiz Helm chart. Live DB audit showed
only 'instagram-standalone' connected; all other Postiz providers
were idle-polling Temporal task queues. List excludes x, linkedin,
reddit, threads, youtube, tiktok, pinterest, dribbble, slack,
discord, mastodon, bluesky, lemmy, warpcast, vk, beehiiv, telegram,
wordpress, nostr, farcaster. Keeps facebook + instagram + the
standalone variant active.
2. temporal deployment needs keel.sh/policy=never (set live via kubectl
annotate). Keel was rolling temporalio/auto-setup 1.28.1 -> 0.20.0
on every helm reconcile because :0.20.0 is published in the same
registry path but is a DIFFERENT (legacy Cassandra-based) image
stream. Memory id 1933 trap; new variant captured in id 2315-2319.
The annotation is set live (not in TF) because the existing TF block
has lifecycle.ignore_changes = [keel.sh/policy] so the chart
reconcile won't reset it. Long-term fix: add temporal to the
Kyverno keel-mutate-existing exclude list so it survives a
namespace re-label.
Captures today's k8s-upgrade-pipeline session findings — root cause
of repeated upgrade failures is the single-master apiserver outage
window cascading into operator crashloops + storm I/O. HA control
plane with 3 masters + apiserver LB removes the cascade entirely.
Tracked in beads code-n0ow. Plan doc to follow.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
kubeadm's `upgrade apply` waits 5min for each static-pod manifest swap
to be picked up by the kubelet (it polls the pod's
`kubernetes.io/config.hash` annotation via apiserver). On a freshly-rebooted
master with apiserver-to-kubelet status sync lagging, that 5min isn't
enough — kubeadm declares the upgrade failed and rolls back.
The thing is: the etcd container HAS already been swapped to the new
image by then (verified live — pod is on registry.k8s.io/etcd:3.6.5-0
when this fires). kubeadm's check is just slow to notice. The 2nd
attempt sees etcd already on target, skips it, and proceeds cleanly.
Wrap `kubeadm upgrade apply` in a 3-attempt loop with 30s between.
Worker phase doesn't need this — `kubeadm upgrade node` has no
static-pod-hash waits.
Today's autonomous-pipeline session: master phase Failed at 5m on
attempt #1 with this exact error, retried, hit same timeout, gave up
(backoffLimit=1). The wrapper turns this from a fatal pipeline halt
into a "wait a bit, try again" that usually completes on attempt #2.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three changes from today's autonomous-pipeline validation session:
1. **Kill-switch ConfigMap** — chain checks for `k8s-upgrade-killswitch`
ConfigMap in `k8s-upgrade` namespace at the top of every phase + at the
start of version-check. Existence halts the chain (exit 0) with a Slack
message. Single-command emergency stop:
kubectl -n k8s-upgrade create configmap k8s-upgrade-killswitch \
--from-literal=reason="storm response"
Resume: kubectl -n k8s-upgrade delete cm k8s-upgrade-killswitch
Role rule for `configmaps` get/list/watch added (resourceName-scoped).
2. **Ignore RecentNodeReboot in halt_on_alert_query everywhere** — the
chain itself causes reboots. The pre-drain master check, post-upgrade
worker check, postflight check, and preflight halt-on-alert all now
pass `RecentNodeReboot` as the extra-ignore. Previously only worker
phase's post-upgrade gate did this. Master Failed silently this morning
on the pre-drain check after my own master reboot.
3. **Preflight quiet-baseline 3600s → 600s** — the 1h cooldown after any
Ready transition meant the chain refused to run for an hour after
every kured reboot. 10 min is enough for kubelet/control-plane to
settle; the 24h-between-cluster-reboots invariant lives in
kured-sentinel-gate, not here.
Validated by running the chain end-to-end: preflight passed in 5s,
master phase now in drain. Today's storm post-mortem (snapshot CoW
amplification + tigera-operator crashloop feedback loop) drove the
kill-switch design.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Workers handle background tasks only (LDAP sync, email, certificate
renewal) — no user-facing traffic, so 2-of-3 redundancy isn't load-
bearing. Reduces sustained CPU by ~100m.
Server replicas unchanged at 3 (PDB minAvailable=2 — user-facing).
PgBouncer pool unchanged at 3 (DB connection pooling).
Bot crawlers were hitting /<owner>/<repo>/archive/<sha>.zip on the
dot_files repo (vim-plugin source trees) — each request synthesised a
fresh ZIP from git history, taking 9.9s and returning 500 under
sustained load. Cost: ~440m sustained forgejo CPU.
Toggle: FORGEJO__repository__DISABLE_DOWNLOAD_SOURCE_ARCHIVES=true.
/archive/* URLs now 404; git clone / OCI registry / API unaffected.
Measured: forgejo pod 440-573m -> 60m steady-state (~85% drop).
(Pod rollout took ~7min on the new RS due to kubelet's recursive
chown of the 2700+ files in the data PVC — fsGroupChangePolicy is
unset and defaults to Always; could be set to OnRootMismatch later.)
Halves sample volume on all default-scrape jobs (cAdvisor, node-exporter,
service-endpoints, etc.). Memory id 559's earlier scrape-2m tuning was
applied live but not codified — this restores the Helm template.
Companion changes to keep alerting fidelity:
- evaluation_interval kept at 1m (alerts evaluate every minute)
- snmp-ups job pinned to scrape_interval=30s so PowerOutage /
LowUPSBattery detect within ~30s instead of 2m
- 3 alerts bumped from for:1m to for:3m (HighGPUTemp, LowUPSBattery,
PowerOutage) for stability above the new 2m global cadence
Other jobs that already had per-job overrides (snmp-idrac 1m,
redfish-idrac 3m, kubernetes-pods 5m, kubernetes-services 5m) unaffected.
Expected: 50-150m sustained CPU saving on Prometheus + apiserver.
Verification ongoing — apiserver settles ~minutes after Prometheus
config reload due to initial-target-scrape burst.
Replaced 'loki.source.kubernetes' with 'loki.source.file' in alloy DS
config. discovery.relabel.pod_logs already sets __path__ to the kubelet
log path (/var/log/pods/*<uid>/<container>/*.log) and varlog host-mount
was already present, so this is a one-line swap.
Why: apiserver was burning ~700m sustained on 'CONNECT pods/log' streams
(13 req/s, ~2200 sec/s of long-lived TCP connections). Streaming pod
logs through the apiserver instead of tailing kubelet's log files was
the dominant residual cost after the recent Loki/Alloy onboarding.
Measured before/after:
- Alloy DS: ~620m total (5 x ~125m) -> ~92m total (5 x ~18m)
- kube-apiserver: peak 1959m midnight burst, settled 632m
(Stuck-pod recovery: alloy-7zg7t on k8s-master needed --force delete
during rollout — FailedKillPod 'unable to signal init: permission denied'
on runc, transient runtime issue, unrelated to this change.)
Both static-roles existed in Vault state (created out-of-band) but
were missing from the postgresql connection's allowed_roles list. Vault
was logging 'is not an allowed role' rotation errors every 10s for both,
sustained CPU waste ~40-70m.
Adopted both via 'import {}' (import blocks removed after first apply
per the canonical adoption pattern).
- pg-matrix: username=matrix, rotation_period=86400 (1d)
- pg-technitium: username=technitium, rotation_period=604800 (7d)
Verified: 'is not an allowed role' errors stopped in vault-0 logs
immediately after apply.
kubectl drain --ignore-daemonsets needs to GET each pod's owner
reference (DaemonSet/StatefulSet/ReplicaSet/Deployment) to classify
which pods can be drained vs ignored. Without these RBAC verbs, drain
bails with 'cannot delete daemonsets ... is forbidden' for every
daemonset-managed pod on the node.
Provider declarations were applied across freshrss, linkwarden,
navidrome, openclaw, tandoor, vault in prior sessions; lock files
regenerated for the 4 stacks where init had run. Commits the WIP so
downstream Terraform plans can proceed.
- kubectl (gavinbunney/kubectl ~> 1.14): kubernetes_manifest panic
workaround for Kyverno CRDs (beads code-e2dp)
- authentik (goauthentik/authentik ~> 2024.10): used where stacks
manage their own Authentik objects
Plugin needs three things to load under OpenClaw 2026.5.x:
1. plugins.allow includes 'recruiter-api' (doctor --fix overwrites the
ConfigMap-baked value, so re-patch via 'openclaw config patch --stdin'
in the startup command after doctor runs).
2. 'openclaw plugins enable recruiter-api' to flip its registry entry.
3. manifest declares contracts.tools (added in recruiter-responder commit
83ffd9fa).
Plus: VIKTOR_CHAT_ID env wired from secret/openclaw.viktor_chat_id so the
plugin's polling loop knows which Telegram chat to deliver into.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
recruiter-responder ExternalSecret gains GPT_MINI_ENDPOINT/_API_KEY/_MODEL
(NIM-served qwen3-coder-480b — gpt-5.4-mini in OpenClaw is OAuth-only and
not HTTP-accessible to external services). OpenClaw gains VIKTOR_CHAT_ID
env consumed by the recruiter-api plugin's announcement loop.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Same `grep -v` / `set -o pipefail` interaction as commit 10b261d2,
in two more callsites the previous fix didn't cover:
Line 354 (phase_master): control-plane Running check —
`grep -v Running | wc -l` returns 1 when all pods are Running
(the happy path), aborting the chain right after master upgrades.
Line 419 (phase_postflight): on-target node check —
`grep -v ":v$TARGET_VERSION$" | wc -l` returns 1 when all nodes
are on the target version (the happy path, exactly when postflight
should succeed). Aborts at the moment of victory.
Forensics on yesterday's master Job failure (see commit message of
10b261d2 for context): the master Job spawned 16s after the previous
fix's TF apply, before configmap propagation completed on the kubelet.
With those two latent bugs also looming, the chain would have died
post-master-upgrade and again at postflight even if propagation had
been timely.
Wrapping each grep in `{ ... || true; }` so a no-matches result
returns success.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
halt_on_alert_query() ends with `grep -vE "$regex" | sort -u`. When
zero alerts are firing (the desired healthy state), grep matches
nothing and exits 1. Under `set -o pipefail`, the whole pipeline
returns 1; under `set -e`, the caller's `alerts=$(...)` assignment
fails and aborts the script in ~1s with no diagnostic output.
The chain effectively required at least one non-meta alert to be
firing to make any forward progress. Today (2026-05-19) the cluster
is fully clean post-MySQL recovery, the daily 12:00 UTC detection
spawned the preflight Job, and it died instantly — blocking the
1.34.7 → 1.34.8 patch chain.
Fix: wrap the grep in `{ ... || true; }` so a no-matches result
returns success. Preflight verified end-to-end after the fix — the
chain is now in flight (preflight ✓, master phase running).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
## Change
- Replaced kubectl_manifest.wave1_egress_observe_recruiter_responder with
kubectl_manifest.wave1_egress_observe_tier34
- namespaceSelector changed from `kubernetes.io/metadata.name == 'recruiter-responder'`
to `tier in {"3-edge", "4-aux"}` — covers 82 namespaces (17 tier-3-edge + 65 tier-4-aux)
- Legacy pilot GNP wave1-egress-observe-recruiter-responder kubectl-deleted
(apply_only=true means TF rename does NOT destroy the live old resource;
cleanup done manually)
- Tier 0/1/2 namespaces explicitly out of wave 1 observation per locked plan
(cluster infra + GPU workloads, deferred)
## Verification (live cluster, 2026-05-19)
- 82 namespaces match `tier in (3-edge,4-aux)`
- Felix translated the new policy into iptables LOG rule in cali-po-* chain
- LogQL `{job="node-journal"} |~ "calico-packet"` returns real packet metadata
from multiple namespaces with distinct destinations:
- east-west pod-to-pod (10.10.108.48, 10.10.122.131)
- in-cluster service VIP (10.96.0.10 — kube-dns)
- external (149.154.166.110 — Telegram API from recruiter-responder)
## W1.7 next step (calendar-bound, ~1 week)
- Let observation run for ~1 week
- Aggregate distinct destinations per namespace via LogQL
- Build per-namespace egress allowlist module `tier3_egress_baseline`
- Flip GNP rules from `[Log, Allow]` to `[Allow <specific dests>, Deny]`
- Phased per-namespace as originally planned
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replaces the abandoned FelixConfiguration.flowLogsFileEnabled approach (Calico
Enterprise-only field, rejected by OSS v3.26) with the supported primitive:
Calico GlobalNetworkPolicy with `action: Log`.
## Mechanics (verified end-to-end on 2026-05-19)
1. kubectl_manifest applies GNP `wave1-egress-observe-recruiter-responder`
with `namespaceSelector: kubernetes.io/metadata.name == 'recruiter-responder'`,
`types: [Egress]`, `egress: [{action: Log}, {action: Allow}]`.
2. Felix translates to iptables LOG rule in
`cali-po-_ZEv_aILlvyT9fbgWN58` chain with prefix `calico-packet: ` log-level=5.
3. Linux kernel emits LOG entries to ring buffer with transport=kernel.
4. systemd-journald captures kernel transport entries.
5. Alloy DaemonSet ships journal to Loki with `job=node-journal,transport=kernel`.
6. LogQL: `{job="node-journal"} |~ "calico-packet"` returns entries showing
SRC/DST/PROTO/PORT for every NEW egress connection.
## Verified output sample
`calico-packet: IN=cali6cfdec4abc1 OUT=ens18 MAC=... SRC=10.10.122.132
DST=9.9.9.9 LEN=60 TOS=0x00 PREC=0x00 TTL=...`
The Allow rule in the GNP keeps egress functional (recruiter-responder
remained 1/1 Running through the apply — verified Python TCP connections to
1.1.1.1, 8.8.8.8, 9.9.9.9 succeed).
## Wave 1 status
W1.6 observation infra is LIVE for the recruiter-responder pilot. W1.7
remains pending: collect 1 week of `{job="node-journal"} |~ "calico-packet"`
samples, build empirical egress allowlist, flip the GNP rules from
`[Log, Allow]` to `[Allow <specific dests>, Deny]`.
Expand observation to additional namespaces by adding entries to
`spec.namespaceSelector` (e.g. `kubernetes.io/metadata.name in {recruiter-responder,X,Y}`).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Keel polls ~175 image manifests hourly against public registries.
Transient i/o timeouts and registry 5xx responses are inherent at
that scale and auto-recover on the next poll, but they were tripping
the Apps row into ⚠ attn — pure noise.
Extend benign_re to cover:
- failed to check digest + (i/o timeout | connection refused
| connection reset | context deadline exceeded | TLS handshake
timeout | no such host | EOF)
- failed to check digest + non-successful response (status=5xx)
Real actionable digest-check failures (HTTP 401 auth, 404 removed
tag) still surface. Persistent registry-side 5xx is owned by the
registry's own monitoring (forgejo-integrity-probe +
RegistryCatalogInaccessible), not by Keel logs.
Tested locally: Apps row flips from ⚠ attn → ✓ healthy after the
filter is in place; remaining errors-line drops to "(none in last
24h)".
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two changes to make the 8.4.8 pin durable:
1. Add `keel.sh/policy: never` annotation on the mysql-standalone
StatefulSet. The dbaas namespace was already excluded from the
Kyverno mutate, but the StatefulSet carried orphan Keel annotations
(force/poll/match-tag) from an earlier policy version that lacked
the exclusion list. Keel kept watching :8.4.8 for digest changes.
Now explicitly opted out; Keel logged "image no longer tracked".
2. Expand the inline comment to a banner pointing at the upgrade plan
docs and the gating beads task. Anyone touching this line sees the
warning + the path to do it right.
Closes the loop on the 2026-05-18 outage. Real upgrade tracked in
code-963q + docs/plans/2026-05-19-mysql-8.4.9-upgrade-{design,plan}.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Captures the wipe+reinit strategy (sidestep the broken DD upgrade
path), the IO config bump (innodb_io_capacity 100→2000), root-cause
analysis with explicit uncertainty, verification gates, and rollback.
Not scheduled yet. Tracked in beads code-963q.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Runbook rewritten for the standalone setup (InnoDB Cluster gone since
2026-04-16) and now covers the full disaster-recovery flow we just
executed: stop pod, wipe PVC (incl. PV reclaim-policy flip from Retain
→ Delete), re-apply TF, restore via in-namespace Job, drop+create
static users with fresh Vault passwords, restart dependents.
CLAUDE.md MySQL row notes the 8.4.8 pin + links the runbook.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The mysql:8.4 floating tag let Keel auto-bump to 8.4.9, whose
data-dictionary upgrade got stuck mid-flight on every attempt
(no progress, no CPU, never completing). Pinning to 8.4.8 +
restoring from the 2026-05-18 00:30 UTC mysqldump puts us back
on a known-good binary.
Closes: code-eme8
Closes: code-k40p
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
## Resolves code-e2dp (Kyverno TF apply blocked)
Root cause: terraform-provider-kubernetes v3.1.0 panics on plan/refresh of
kubernetes_manifest resources holding Kyverno ClusterPolicy CRDs (large
CEL/foreach schemas). Workaround: swap to gavinbunney/kubectl_manifest which
treats manifests as opaque YAML strings.
## Migration mechanics
- Root terragrunt.hcl: added gavinbunney/kubectl provider declaration so all
stacks get it generated in providers.tf.
- stacks/kyverno/modules/kyverno/versions.tf (new): module-level provider source
declaration (required for kubectl_manifest in a child module).
- Converted 17 kubernetes_manifest resources across 7 files to kubectl_manifest
with yaml_body = yamlencode({...}). depends_on chains preserved.
- terraform state rm for all 17 old kubernetes_manifest entries.
- stacks/kyverno/imports.tf (new): TF 1.5+ import blocks mapping each
kubectl_manifest to its live cluster resource by apiVersion//Kind//name ID.
- One resource (policy_inject_keel_annotations) needed kubectl delete + recreate
because the kubectl provider couldn't patch it cleanly (resourceVersion=0
invalid for update — gotcha when adopting a resource previously
kubernetes_manifest-owned).
## W1.4 — security policies Audit → Enforce (LIVE)
Three policies flipped: deny-privileged-containers, deny-host-namespaces,
restrict-sys-admin. Verified live via kubectl. failurePolicy=Ignore preserved.
## Shared exclude list (35 namespaces)
local.security_policy_exclude_namespaces in security-policies.tf.
- 31 critical from memory id=1970 (Keel rollout list)
- + frigate (camera HW transcoding needs host access)
- + kured (privileged DaemonSet for node reboots)
- + default (etcd backup/defrag CronJobs use hostNetwork)
- + changedetection (uses SYS_ADMIN for chromium sandbox)
## W1.5 — require-trusted-registries stays Audit
Pattern */* allows anything-with-a-slash; Enforce would be a no-op for supply
chain. Tracked under beads code-8ywc as follow-up.
## TF import-blocks
The imports.tf file should be removed in a follow-up cleanup commit once
verified — TF doesn't auto-clean these.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Closes: code-e2dp
Keel 1.2.0 registers a Slack socket-mode bot whenever SLACK_BOT_TOKEN is
set, then fails because we don't supply an `xapp-` app-level token:
bot.slack.Configure(): SLACK_APP_TOKEN must have the prefix "xapp-".
bot.Run(): can not get configuration for bot [slack]
We don't want the interactive bot — opt-out auto-update + no approval flow
(see stacks/keel/main.tf comment). The Slack NOTIFICATION sender works
independently and continues posting rollout messages to #general fine.
But /upgrade-state's broad `grep level=error` was counting these as real
errors → ⚠ on the Apps row every run. Add a small skip-pattern list so the
two recurring benign lines drop out; any new genuine Keel error still
shows. Reuses `bot.Run()` + `SLACK_APP_TOKEN must have the prev?if|prefix`
(typo in Keel's actual log message preserved as alternation).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
## Vault audit-tail sidecar (APPLIED + VERIFIED)
- Added `audit-tail` extraContainer to vault helm chart values: busybox:1.37 with
`tail -F /vault/audit/vault-audit.log`. Reads the audit PVC (`audit` volume
from the chart's auditStorage), emits JSON audit events to stdout. kubelet
captures the stdout; once Loki+Alloy are deployed (blocked on code-146x),
these logs flow automatically to Loki with `container="audit-tail"`.
- Resources: 5m CPU / 16Mi mem request, 32Mi limit. PVC mount is readOnly.
- Applied via `tg apply -target=helm_release.vault`. All 3 vault pods rolled
cleanly (OnDelete strategy, manual one-at-a-time, auto-unseal each ~10s).
- Verified: `kubectl logs -n vault vault-2 -c audit-tail` shows live JSON
audit lines from ESO token issuance, KV reads, etc.
## Doc reality-check
While verifying logs reached Loki, discovered Loki is NOT actually deployed.
`stacks/monitoring/modules/monitoring/loki.tf` defines `helm_release.loki` but
has a self-referencing `depends_on = [helm_release.loki]` that prevented apply.
No `loki` Helm release in the cluster, no Loki pods, no Loki Service. The
monitoring.md "Loki: deployed" claim was aspirational.
- security.md W1.2 row: PENDING → PARTIAL (sidecar live, shipping blocked on
code-146x)
- security.md W1.3 row: gated on code-146x added
- monitoring.md Loki row: marked NOT DEPLOYED with cross-ref to code-146x
## New beads task
- code-146x P1 — Loki + log shipper missing. Lists the helm_release self-depends_on bug,
investigation paths, and revised wave 1 sequencing (Loki/Alloy is prereq 0).
## Wave 1 status update
- W1.2: Vault audit device + XFF + audit-tail sidecar all LIVE; Loki shipping blocked on code-146x
- W1.1, W1.3, W1.6, W1.7: still not started (W1.6 also blocked on code-3ad Calico Installation CR)
- W1.4, W1.5: code committed, blocked on code-e2dp (Kyverno provider crash)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
## W1.2 — Vault audit device + X-Forwarded-For (APPLIED + VERIFIED)
- Added `x_forwarded_for_authorized_addrs = "10.10.0.0/16"` to vault listener config.
Trust X-Forwarded-For from in-cluster sources (pod CIDR). Without this, every
vault audit log entry shows Traefik's pod IP instead of the real client IP —
the V7 alert rule (Viktor identity from non-allowlist source IP) needs the
real client IP to be meaningful.
- Applied via `tg apply -target=helm_release.vault` (vault stack has pre-existing
for_each unknown issues unrelated to this change; -target documented in error
message itself as the workaround).
- Rolling restart of vault-{0,1,2} performed manually (StatefulSet uses OnDelete
update strategy, not RollingUpdate). All 3 pods rejoined Raft + auto-unsealed
within ~10s each. Verified XFF config visible in pod's
/vault/config/extraconfig-from-values.hcl.
- The `vault_audit "file"` resource was already in TF at line 287 (writing to
/vault/audit/vault-audit.log) — no change needed.
## W1.4 + W1.5 — Kyverno enforce flip (CODE ONLY, apply BLOCKED)
- Added shared `local.security_policy_exclude_namespaces` (31 critical namespaces
from memory id=1970 + `frigate, kured, default, changedetection` discovered
during the live-cluster pre-flight check for privileged/hostNetwork/SYS_ADMIN
pods that would be blocked by Enforce).
- Flipped 3 security policies Audit → Enforce: deny-privileged-containers,
deny-host-namespaces, restrict-sys-admin. failurePolicy=Ignore preserved at
chart level.
- `require-trusted-registries` STAYS in Audit mode pending allowlist tightening
(current pattern includes `*/*` which matches anything-with-a-slash, so Enforce
would be a no-op for supply chain). Tracked under beads `code-8ywc` W1.5.
**Apply blocker**: `tg plan` panics with `terraform-provider-kubernetes_v3.1.0`
crash on the kubernetes_manifest resources (`ElementKeyInt(0): can't use
tftypes.Object...` — provider schema mismatch on Kyverno CRDs). The crash
reproduces on the UNMODIFIED file, so it's a pre-existing provider issue, not
caused by these changes. Resolving it requires either upgrading the provider or
finding a kubernetes_manifest-compatible workaround. Tracked under `code-8ywc`.
## Wave 1 status after this commit
- W1.2: APPLIED + VERIFIED (vault XFF + audit device already in place)
- W1.4 + W1.5: code ready, apply blocked on provider crash
- W1.1, W1.3, W1.6, W1.7: not started in this session
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Companion to the GHA migration in immovika/realestate-crawler@c2acbf5.
Apps row of /upgrade-state was flagging ⚠ because Keel poll on the four
Deployments returned 401 — DockerHub repo viktorbarzin/realestatecrawler
is private, the Deployments had no imagePullSecrets, and Keel's poll-secret
discovery list came up empty. Pods kept running only because the image
landed in containerd cache months ago.
Adds:
- ExternalSecret `dockerhub-pull-secret` synced from Vault
secret/viktor.dockerhub_registry_password. ESO template renders the
dockerconfigjson server-side (Sprig b64enc) so the PAT never sits in
cleartext in any K8s manifest.
- image_pull_secrets { name = "dockerhub-pull-secret" } on all 4
Deployments (ui, api, celery, celery-beat).
- Lifts `ignore_changes=[container[0].image]` on ui+api so TF re-asserts
:latest. CI no longer patches the image to a numeric tag — Keel now
drives rollouts from digest changes on :latest.
Live state after apply: all 4 Deployments on :latest with
imagePullSecrets=dockerhub-pull-secret; ExternalSecret SecretSynced=True.
Once a GHA build pushes a new digest, Keel will roll all four within ~1h.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Removed panel 27 (META RSU vest value over time) — superseded by
vest-cadence chart which carries the same value signal plus the
share-count overlay.
- Removed panel 28 (per-vest value at vest vs today) — duplicative with
panel 31's FIFO realized PNL.
- Removed panel 29 (per-sell realized PNL) — same data as panel 31,
just rolled up by sell date instead of vest date.
- Resized panel 26 (Positions) to w=12 and moved panel 30
(META vest cadence) to (y=32, x=12, w=12) so they sit side-by-side
next to the Positions table.
- Moved panel 31 (FIFO realized PNL) to y=118, where the deleted RSU
chart used to live.
Locked design for wave 1 of cluster security hardening. Plan only — implementation lives in beads
code-8ywc and follow-up commits. Captures:
- security.md: Kyverno policy table updated (Audit → Enforce planned for the four security policies
with the 31-namespace exclude list). New section "Audit Logging & Anomaly Detection" detailing the
K8s API audit policy, Vault audit device + X-Forwarded-For trust, source-IP anomaly rules (K9, V7,
S1), and the rejected-canary-tokens / rejected-K1 rationales. New section "NetworkPolicy
Default-Deny Egress" describing the observe-then-enforce (γ) approach for tier 3+4.
- monitoring.md: new "Security Alerts (Wave 1)" section listing the 16 rules (K2-K9, V1-V7, S1)
and the Loki ruler → Alertmanager → #security routing path.
- runbooks/security-incident.md (new): per-alert response playbook with LogQL queries, action
steps, false-positive triage, and SEV1 escalation.
- .claude/CLAUDE.md: new "Security Posture" section summarising the locked decisions: identity
allowlist is me@viktorbarzin.me ONLY, source-IP allowlist CIDRs, no public-IP access policy,
rationale for not adopting canary tokens.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
New table panel below the per-sell breakdown. For each vest, FIFO-match
its shares against the subsequent sells (shares from earlier vests get
sold first), and aggregate the matched portions:
realized_pnl = SUM(matched_qty * (sell_price - vest_price))
pnl_pct = realized_pnl / SUM(matched_qty * vest_price) * 100
days_held = AVG(sell_date - vest_date) per matched portion
Footer reducer sums shares, vest value, sell value, and realized PNL
so the bottom row is the full-portfolio realized take.
Per-vest event line chart. Left Y axis (blue): vest value at the
time = SUM(quantity * unit_price), in USD. Right Y axis (orange):
number of shares vested. One point per vest date (aggregated when
multiple BUY rows share a date, e.g. 2021-05-18 was 18 + 2 shares).
Lets Viktor see how vest sizes ramped (initial 18 shares -> 38 ->
60s) and how the per-vest USD value tracked META's price ride
across 2020-2026. timeFrom='6y' override pins the panel to the full
vesting window.
Two new bottom-of-dashboard tables:
Panel 28 'META vests — value at vest vs today': one row per BUY
activity. Shows vest-day price * shares + what those same shares
would be worth at today's META quote, plus the hypo P&L if Viktor
had held everything (color-text on the gain columns).
Panel 29 'META sells — realized PNL vs if held until today':
one row per SELL with FIFO-matched cost basis (LEAST/GREATEST
overlap in cumulative-share space). Shows realized P&L, the
counterfactual P&L had he held until today, and the
'missed by' delta = (today_price - sell_price) * shares.
Both pull today_price dynamically from quote_latest via a CTE so
they self-update as Yahoo updates the META quote. Schwab account
is empty so no live activity is expected.
Dashboard default time range is now-180d, but the META vesting + sell
arc spans 2020-11 → 2026-02. With the default window the panel just
showed a flat line at $64 (the empty post-sell residual). timeFrom='6y'
override makes panel 27 always render the full vesting curve regardless
of the dashboard-level time selector.
Was `0 12 * * 0` (Sun 12:00 UTC) — patch releases waited up to 6 days
before the chain picked them up. Now `0 12 * * *` (daily 12:00 UTC,
still outside kured's 02:00-06:00 London window). Concurrency is
bounded by Forbid + deterministic job-name idempotency (the detection
job exits early if a preflight Job for the same target already exists),
so back-to-back days can't pile up parallel runs.
- stacks/k8s-version-upgrade/main.tf: var.schedule default + rationale comment
- scripts/upgrade_state.sh: rename next_sunday_noon_utc -> next_daily_noon_utc
(now returns "Tue 2026-05-19 12:00 UTC" form); change "(Sun cron)" label
to "(daily cron)"
- .claude/skills/upgrade-state/SKILL.md: cadence column + frontmatter
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Daily total_value timeseries for the Schwab workplace account
(account_id 72d34e09-...). Single-asset account holding META RSUs
that vested 2020-11 → 2026-02 and were sold opportunistically over
the same window. Currency USD (account_currency). Yahoo quote on
META powers WF's daily mark; the historical DAV mirrored into
wealthfolio_sync via pg-sync gives us ~2k days of vesting curve.
Three autonomous-upgrade pipelines run independently — Keel for apps
(hourly registry polling), unattended-upgrades+kured for OS, and the
k8s-version-check chain for kubeadm/kubelet/kubectl. Until now there
was no single place to see whether each was healthy, what's pending,
or whether anything's stuck. The /upgrade-state skill collapses the
state of all three into one table you can run before each Sunday's
k8s-version-check fires.
- stacks/keel/main.tf: add Prometheus pod-annotation scrape on
container port 9300. Surfaces pending_approvals,
poll_trigger_tracked_images, and registries_scanned_total{image}
so the skill has a real timeseries (also opens the door to a
future "pending_approvals > 0 for 24h" alert).
- scripts/upgrade_state.sh: collector + renderer. Three-row table
(Apps / OS / K8s) + drill-down, --json for piping, exit 0/1/2.
SSH fan-out (parallel subshells) to all five nodes for apt
state + reboot-required + uu log; Prometheus query for Keel;
Pushgateway parse for k8s_upgrade_* gauges. Read-only.
- .claude/skills/upgrade-state/SKILL.md: hardlinked to
~/.claude/skills/upgrade-state/SKILL.md so the skill is
discoverable from both monorepo-rooted and global sessions.
Verification: ran the script, stress-tested the ✗ stalled path by
pushing in_flight=1 + started_timestamp=-100min to Pushgateway and
resetting after — script correctly raised ✗ and exit 2.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Task 1's recovery from the broken `:latest` image rollout left
keel.sh/policy=never set imperatively via `kubectl annotate` — out of
TF, which violates the "all infra via TF" rule. Now codified alongside
match-tag, trigger, pollSchedule. Removed those three keys from
ignore_changes (was the original "Keel manages these" pattern, no
longer correct for this deployment).
Also added KYVERNO_LIFECYCLE_V1 ignore_changes on the presence_schema
migration Job so future applies don't try to replace it over the
Kyverno-injected ndots dns_config.
Verified: 0 added, 3 changed (unrelated pre-existing drift on
beadboard/workbench/service), 0 destroyed. Dolt pod uninterrupted
(revision 13 preserved).
The 24h kubelet-uptime threshold (process_start_time_seconds < 86400)
was a defense-in-depth duplicate of the 24h-since-Ready-transition
check in kured-sentinel-gate Check 4 — but they used different
signals (kubelet process start vs node Ready transition). Whenever
the cluster cycled through reboots, the alert kept firing for a full
day even after sentinel-gate's check passed, and blocked anything
querying halt-on-alert (kured, K8s version-upgrade preflight).
Tightened to 1h (3600s) for "node just rebooted, give it a settle
window". The cluster-wide 24h-between-reboots invariant lives
exclusively in kured-sentinel-gate Check 4 from now on (independent,
uses lastTransitionTime).
Matched the preflight's own 24h-quiet check in upgrade-step.sh
(86400 → 3600) so it doesn't act as a second blocker.
Empirically verified: all 5 kubelets are >10h up, alert cleared on
next eval after the rule reload.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the schema for the new agent presence board. Live Dolt is updated
via a hashed-named one-shot Job; the ConfigMap entry preserves fresh-PVC
init.
Also pins the Dolt image to 2.0.3 — :latest on dolthub/dolt-sql-server
currently resolves to 0.50.10, whose docker-entrypoint.sh references an
undefined docker_process_sql function and crash-loops on every init
script in /docker-entrypoint-initdb.d. Keel can still bump this tag
in-cluster (image is in lifecycle.ignore_changes).
Two latent bugs in the K8s-version-upgrade pipeline surfaced when a
real detection run ran post-26.04 upgrade today:
1. **DNS**: pod's CoreDNS search path is `<ns>.svc.cluster.local
svc.cluster.local cluster.local` (+ ndots=2 via Kyverno mutation).
Unqualified `k8s-master` falls through all of those and then queries
upstream Technitium for the bare name → NXDOMAIN. The FQDN
`k8s-master.viktorbarzin.lan` is what Technitium actually serves.
Suffix every node SSH target with `$NODE_DOMAIN`.
2. **envsubst missing**: claude-agent-service image doesn't ship
`gettext-base`. Replace `envsubst <template | apply` with
`python3 -c 'import os,sys; sys.stdout.write(os.path.expandvars(
sys.stdin.read()))' <template | apply`. Same semantics, image
already has python3. Multi-line $SCHEDULING_BLOCK is preserved
correctly through expandvars.
Verified by manually triggering `k8s-version-check` post-fix:
detection now reads `Latest patch: v1.34.8` (currently running 1.34.7)
and spawns `k8s-upgrade-preflight-1-34-8`. The Job pod scheduled and
started; killed before it touched the cluster (will land on Sunday
2026-05-24 12:00 UTC like the schedule says).
Root cause of why these bugs lay dormant: yesterday's first
manual-test detection found "no upgrade needed" so neither code path
exercised SSH or envsubst. Today's apt-source restore (do-release-
upgrade had mangled them) unmasked the v1.34.8 candidate, which made
detection finally proceed past the SSH step.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
15-task plan for a shared presence board so Claude Code sessions can
see which shared infra resources are being actively mutated by other
sessions. Resource-scoped claims on the existing Dolt server,
heartbeat-driven TTL, agent-driven via CLAUDE.md rule + Python CLI.
Per user decision, removed authentik, kyverno, metallb-system,
external-secrets, proxmox-csi, nfs-csi, vpa, sealed-secrets,
infra-maintenance from the policy-level exclude list, and added
keel.sh/enrolled=true to aiostreams (alive — 1/1 Running, despite
being earlier flagged as scaled-to-0) and woodpecker.
Net cluster coverage: 197/227 workloads on safe-force (86%), up from
170/227 (74%). All 197 are paired with match-tag=true (digest-only).
Remaining 7 namespaces in Kyverno exclude list (irreducible):
- keel (self-update)
- calico-system + tigera-operator (operator-managed Installation CR)
- cnpg-system + dbaas (state-coupled)
- nvidia (chart-pinned at 570.195.03 per code-8vr0 until NVIDIA ships
ubuntu26.04 driver images)
- kube-system (k8s built-ins)
Files:
- stacks/kyverno/modules/kyverno/keel-annotations.tf — exclude list
trimmed from 16 → 7
- stacks/authentik, kyverno, proxmox-csi, nfs-csi, vpa, sealed-secrets,
servarr/aiostreams, metallb (creates ns "metallb-system"), woodpecker —
added keel.sh/enrolled=true label on kubernetes_namespace resource
- infra-maintenance was in the policy exclude but the namespace doesn't
actually exist in the cluster; the removal is a no-op there
Applied via kubectl patch on the live ClusterPolicy + kubectl label on
namespaces because the kubernetes provider v3.1.0 panics on Kyverno
ClusterPolicy refresh — TF source has the desired state for next clean
apply on a fixed provider.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Per user decision today: monitoring, mailserver, vault, descheduler,
metrics-server, traefik, technitium, crowdsec, redis, reverse-proxy,
reloader, headscale, wireguard, xray, cloudflared now participate in
the same `force + match-tag` regime as the rest of the cluster — Keel
watches the deployment's CURRENT tag for digest changes only and rolls
on push, never rewriting tag strings.
Two-part change:
stacks/kyverno/modules/kyverno/keel-annotations.tf
Trim the policy-level namespace exclude list from 31 → 16. The 16
remaining exclusions are the irreducible cluster-operator + state-
coupled set: keel itself, calico-system + tigera-operator (operator
loop), authentik (2026-05-17 pgbouncer incident bite), cnpg-system +
dbaas (state-coupled), kyverno, metallb-system, external-secrets,
proxmox-csi + nfs-csi + nvidia (just stabilized today, chart-pinned),
kube-system, vpa, sealed-secrets, infra-maintenance.
stacks/<each-of-15>/.../main.tf
Add `"keel.sh/enrolled" = "true"` label to the `kubernetes_namespace`
resource so the Kyverno mutate policy can target the workloads via
its namespaceSelector matchLabels.
Note on the apply path: the live ClusterPolicy was patched via
`kubectl patch` because the hashicorp/kubernetes provider v3.1.0 panics
during state refresh on Kyverno ClusterPolicy schemas with deeply
nested optional `context.celPreconditions` / `imageRegistry` fields
(see crash dump). The TF source above has the desired state, so any
clean future apply on a fixed provider version will be a no-op against
the live cluster.
Floating-tag workloads in the newly-enrolled set (will roll on every
upstream digest update — acceptable risk per user):
- wireguard: sclevine/wg:latest (image fixed today via iptables-nft
postStart shim)
- xray: teddysun/xray
- crowdsec-web: viktorbarzin/crowdsec_web
- monitoring: prompve/prometheus-pve-exporter:latest, prom/snmp-exporter
- traefik: nginx:1-alpine, openresty/openresty:alpine,
ghcr.io/tarampampam/error-pages:3
- redis: haproxy:3.1-alpine, redis:8-alpine
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
After rolling back k8s-node1's kernel to 6.8.0-117 + spoofing
/etc/os-release to 24.04 so the operator picked the matching
ubuntu24.04 driver image (everything per the workaround documented in
docs/known-issues.md), the driver container still went into a restart
loop. Container status:
lastState.terminated: { reason: "OOMKilled", exitCode: 137 }
The driver-installer was hitting the namespace LimitRange default of
128Mi during `apt-get install linux-headers-6.8.0-117-generic` — the
last log line on every restart was "Installing Linux kernel
headers..." before SIGKILL. 2Gi gives apt + the DKMS compile step
enough headroom; peak observed during a successful compile in a test
container was ~1.4Gi.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Captures the workaround applied on k8s-node1 today (kernel rolled back
to 6.8.0-117-generic, apt-mark hold on kernel meta-packages,
/etc/os-release spoofed to 24.04 so NFD reports VERSION_ID=24.04 and
the gpu-operator picks an existing ubuntu24.04 driver image), plus the
trigger that lets us un-mitigate: any ubuntu26.04 tag appearing on
nvcr.io/nvidia/driver.
Linked from the post-mortem and from beads code-8vr0.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- New stack `paperless-mcp` running barryw/PaperlessMCP v0.1.19 (.NET,
HTTP+SSE on :5000) wraps paperless-ngx's built-in FTS. 43 tools exposed.
- In-cluster only egress to paperless-ngx svc; no Cloudflare hop on
MCP-internal traffic.
- Read-only at paperless layer: dedicated `claude-mcp` user (non-superuser)
in new `claude-mcp-readers` group with view-only Django perms; existing
279 docs bulk-granted view perm via /api/documents/bulk_edit/;
workflow #2 auto-grants the group on new docs (Consumption Added).
- Gateway-level bearer auth via new Traefik plugin
Aetherinox/traefik-api-token-middleware@v0.1.4 (loaded in traefik stack
alongside crowdsec-bouncer); per-stack Middleware CRD `bearer-auth`
pulls token list from Vault `secret/paperless-mcp/bearer_tokens`.
- Vault `secret/paperless-mcp` holds: paperless_api_token (synced to
K8s Secret via ESO; pod env via secret_key_ref), bearer_tokens
(JSON array, read at plan time), bearer_token_viktor_laptop (mirror
for laptop wiring), paperless_user_password (paperless UI fallback).
- Image auto-update via Keel (semver minor policy, hourly poll).
- Ingress dns_type=proxied → Uptime Kuma external monitor auto-created
by external-monitor-sync CronJob.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
k8s-node1 was upgraded to Ubuntu 26.04 (kernel 7.0.0-15-generic) at some
point. NVIDIA has NOT published ubuntu26.04 driver images yet
(skopeo list-tags docker://nvcr.io/nvidia/driver returned 0 ubuntu26.04
tags vs 779 for ubuntu22.04 and 206 for ubuntu24.04).
Attempted fix today: bump gpu-operator chart v25.10.1 → v26.3.1 +
driver 570.195.03 → 580.105.08 + kernelModuleType=open. The chart
applied cleanly but the v26.3.1 operator auto-detects host OS via NFD
labels and constructs `<version>-ubuntu26.04` image tags, which 404 on
pull. Rolled back to chart v25.10.1 and pinned it explicitly here so
future `terraform apply` doesn't surface the same trap again.
Note: chart rollback alone does NOT restore GPU functionality on
k8s-node1. Both v25.10.1 and v26.3.1's operators now pick the
ubuntu26.04 suffix (the NFD label is sticky once detected). The actual
recovery path requires either (a) NVIDIA shipping ubuntu26.04 driver
images, or (b) rolling the host kernel back to 6.8.0-117-generic
(still installed in /boot, headers in /usr/src) + `apt-mark hold` to
prevent re-upgrade. That step needs explicit user authorization for a
node reboot — left as the next action item on code-8vr0.
Files:
- stacks/nvidia/modules/nvidia/main.tf — explicit version pin,
explanatory comment
- stacks/nvidia/modules/nvidia/values.yaml — comment block
documenting the situation; driver pinned at 570.195.03
- docs/post-mortems/2026-05-17-gpu-driver-ubuntu2604-mismatch.md —
full timeline, root causes, recovery procedure
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wireguard pod CrashLoopBackOff'd for hours with wg-quick's PostUp failing:
iptables v1.8.4 (legacy): can't initialize iptables table `nat':
Table does not exist (do you need to insmod?)
sclevine/wg's default `iptables` symlink points to iptables-legacy, which
talks to the kernel's xt-tables. K8s nodes nowadays initialize their
nat table via nftables (calico-node sets it up), so iptables-legacy in
the container sees "no nat table" and bails. Reproduced by ephemerally
debugging the live pod's namespaces (kubectl debug --copy-to + same
mounts as the real pod) — wg-quick output matched verbatim.
Fix: postStart now calls update-alternatives to point iptables and
ip6tables at iptables-nft/ip6tables-nft (already present in the image)
before exec'ing wg-quick. The wg0.conf PostUp MASQUERADE then writes
to the nftables-backed nat table calico already populated. Verified:
new pod went 2/2 Running with 0 restarts after apply.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
User reported "site loads but failed to connect on the tmux session". Root
cause was a Traefik replica (traefik-db7696fbf-ktjjz) that came up missing
the kubernetes_ingress-derived router for terminal.viktorbarzin.me — only
the IngressRoute CRDs registered. About 1/3 of /token preflight requests
landed on that replica and got 404 with router="-", and WS upgrades
intermittently failed the same way, so the lobby iframe stayed stuck on
"Failed to connect. Retrying...". `kubectl delete pod` on the bad replica
restored the missing router and unblocked the user.
This commit adds the long-term mitigation:
stacks/terminal/main.tf
- kubernetes_cron_job_v1.webterminal_probe runs every 5min, hits
/token + /ws via Cloudflare and the in-cluster ttyd Service, pushes
4 gauges to Pushgateway (token_status, ws_status, ttyd_status,
last_success_timestamp). Verified the probe end-to-end:
token=302 ws=302 ttyd=200 ok=1
stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl
- Webterminal group: WebterminalTokenDegraded (warning, 10m),
WebterminalWebsocketDegraded (critical, 10m),
WebterminalTtydUnreachable (critical, 10m),
WebterminalProbeStale (warning, 15m).
- Traefik Router Parity group: TraefikRouterCountSkew fires when any
Traefik replica's router count diverges from siblings for >10m —
catches the same class of issue cluster-wide, not just for terminal.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Keel rolled csi-driver-nfs 4.13.1→4.13.2 today. The 4.13.2 chart dropped
control-plane exclusion from the controller Deployment, so both replicas
landed on k8s-master, fought for hostNetwork ports 19809/29653, and one
went CrashLoopBackOff. Helm rollback left orphan containerd sandboxes
holding the ports — only a kubelet restart on master cleared them.
- Pin helm_release.version = "4.13.1" so terraform apply can't drift to
the broken chart (defense in depth; nfs-csi namespace is already in the
Kyverno-Keel exclude list)
- Add controller.affinity: podAntiAffinity between replicas +
nodeAffinity excluding node-role.kubernetes.io/control-plane
- docs/post-mortems/2026-05-17-nfs-csi-keel-upgrade-master-port-conflict.md
captures the root cause + recovery procedure (kubelet restart via
nsenter is the escalation path when crictl rmp -f fails)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The provider now emits delta gains-offset DEPOSITs (broker-sync@98c4729)
which is the simple accumulate-gains approach Viktor signed off on:
each monthly scrape captures (current_pot, real_contribs), and we emit
a single DEPOSIT/WITHDRAWAL sized to growth-since-last-scrape.
dav_corrected handles the dashboard math.
Next scheduled run: 2026-05-20 05:00 UK. Manual trigger via
'kubectl -n broker-sync create job fid-now --from=cronjob/broker-sync-fidelity'.
Without the anchor, each policy update fires mutateExistingOnPolicyUpdate,
which OVERWRITES existing keel.sh/policy annotations back to 'force'. That
broke the phased rollout — bulk-setting workloads to 'never' didn't stick
because the next policy update reset them.
With +() anchors, the mutate only adds the annotation if missing. New
workloads (in enrolled namespaces) get force+match-tag; existing workloads
with explicit policy=never (out-of-band, for phased rollout) stay never.
Phase 1 rollout state (2026-05-17):
- 10 workloads on force+match-tag in 10 namespaces (Phase 1)
enrolled via keel.sh/enrolled=true namespace label:
linkwarden, excalidraw, diun, echo, foolery, city-guesser,
jsoncrack, privatebin, ntfy, speedtest
- 216 workloads on policy=never (out-of-band kubectl annotate)
- 31 critical namespaces excluded at policy level
Expand to Phase 2 by labeling more namespaces `keel.sh/enrolled=true`
and clearing the `never` annotation off their workloads.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 incident: Keel rolled authentik 2026.2.2 → 2026.2.3 around 23:36.
The force+match-tag pairing should have constrained Keel to digest-only on
the current tag (not switch to a new tag), but a race between Kyverno's
mutate (injecting match-tag) and Keel's hourly poll caused the workload to
still have the old `force`-only annotation when Keel acted. Result: tag
rewrite, pods cycled, pgbouncer connection failures, login broken.
Manual rollback: `kubectl rollout undo` on all 5 authentik deployments back
to 2026.2.2. Auth restored within ~5 min.
Going forward, critical-namespace workloads are excluded at the policy level
so this race can't recur. They get upgraded via TF (Helm chart version bumps)
on a deliberate cadence, never by Keel.
Live state: 36 workloads on policy=never (35 critical + chrome-service pin
+ 7 CI-driven self-hosted from earlier), 190 on policy=force+match-tag for
opt-out-pure auto-update on the remaining stateless apps.
This matches user direction (2026-05-17): "upgrading is fine as long as we
upgrade correctly and the latest version is healthy" + "keel responsible
for the latest version, phased rollout, graceful".
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
User: 'i'm happy with occasional breakages. we have alerts.'
Policy=major auto-updates workloads to the latest semver tag in the
registry, including major/minor/patch bumps. Still semver-parser-bounded
so dev/nightly/master branches are filtered out (avoids the 2026-05-16
force-trap on affine/calico).
Live: 217 patch-annotated workloads re-annotated to major. Next Keel
poll (~1h) will pick up any pending major/minor releases.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replies from recruiters to our sent decline / engage / ignored threads
are now attached to the existing thread, surface with a 🔁 follow-up
marker in Telegram ("you previously sent"), and re-open thread status
to pending so they show up in recruiter_list status=pending.
Smoke-tested live: Rachel-style follow-up referencing our outbound
msgid + the original recruiter msgid in References → correctly
attached to thread #87, status flipped sent→pending, 3 messages
persisted (in/out/in).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The mutateExistingOnPolicyUpdate=true on inject-keel-annotations produced
176 UpdateRequests for the initial bulk scan across enrolled namespaces.
At the existing 384Mi limit, kyverno-background-controller OOMKilled while
processing them — no annotations got injected on existing workloads (count
stuck at 30).
Live state already bumped via kubectl set resources; this commit makes it
durable through Terraform. Also lowered the request to 256Mi (the 384Mi
floor was tight against limit; 2Gi headroom for bulk scans, 256Mi steady
state).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous commit (bc714755) added mutateExistingOnPolicyUpdate=true
to the inject-keel-annotations ClusterPolicy but Kyverno's validate
webhook rejected it: the background-controller SA needs update/patch
on apps/v1 Deployment/StatefulSet/DaemonSet.
Created live via kubectl + now in TF so the next apply is idempotent.
The ClusterRole aggregates into kyverno:background-controller via the
rbac.kyverno.io/aggregate-to-background-controller label.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reply-To header now extracted on inbound and used for outbound replies.
Verified with a synthetic email From: noreply-careers@megacorp.example
Reply-To: spam@viktorbarzin.me — reply correctly went to spam@ and
threaded under the original (Re: subject + In-Reply-To + References).
Alembic 0003 added messages.reply_to_addr column.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Before this, the inject-keel-annotations policy only fired on admission
events. Workloads that existed BEFORE their namespace got labeled
keel.sh/enrolled=true never received the annotation, so Keel didn't
watch them. Live state was 30 of 226 workloads auto-updating.
With mutateExistingOnPolicyUpdate=true and the required mutate.targets
block, Kyverno's BackgroundScan controller applies the mutate to
existing matching Deployments/StatefulSets/DaemonSets on policy update.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* immich: extended 3 V1 lifecycles to V2 (1 Deployment without V1
skipped — has non-standard lifecycle from earlier work).
* status-page: enrolled (was missing from original sweep).
* v6 retrigger marker on 17 stacks that never reached terragrunt
apply (#704 exit-1 halted mid-loop).
After this lands, expected live enrollment: ~96 / 118 Tier 1 stacks.
The remaining ~22 are operator/Helm-managed and intentionally excluded
(same fight-loop risk as Calico — bump via Helm chart version, not
Keel).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* beads-server: 3 Deployments — extended V1 lifecycle blocks to V2
+ KEEL_IGNORE_IMAGE; namespace label.
* llama-cpp: 1 Deployment — extended V1→V2; namespace label.
* novelapp: namespace label only (Deployment has non-standard
lifecycle without V1 dns_config — drift expected, accept for now).
* plotting-book: namespace label only (same as novelapp).
* trading-bot: namespace label only (same as novelapp).
immich deferred — the bulk-add script's brace-counter got confused by
a HEREDOC in the file, inserting a lifecycle block in the wrong
position. Needs manual per-Deployment editing.
The 3 ns-only stacks (novelapp, plotting-book, trading-bot) will see
their Deployments mutated by Kyverno but their TF lifecycle doesn't
yet ignore the keel annotations. Expected behavior: drift visible in
terragrunt plan, applied-state oscillates with Kyverno re-injecting.
Acceptable starting point; per-Deployment lifecycle work to fix.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Stop the hourly Keel-vs-tigera-operator fight loop on calico-node
DaemonSet (v3.26.5 ↔ v3.26.1). Live: re-annotated 4 calico-system
workloads with keel.sh/policy=never; TF: added calico-system to the
namespaces exclude list so any future mutate run won't re-inject.
The previous calico unenrollment (label removal from namespace)
wasn't enough — once Kyverno had stamped the policy=patch annotation
on the Deployments/DaemonSets, removing the namespace label didn't
strip the annotation, so Keel kept watching them.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wire ha-mcp, context7, and the in-pod playwright sidecar as native
MCP servers on OpenClaw via `mcp set` in the container startup
(ConfigMap-baked mcp.servers gets stripped by `doctor --fix`; CLI-set
entries persist). HA URL pulled from new Vault key
secret/openclaw.ha_sofia_mcp_url and passed via the
HA_SOFIA_MCP_URL env var.
Add a daily 03:00 UTC `memory-sync` CronJob in the openclaw
namespace: pulls all non-sensitive memories from
claude-memory.claude-memory.svc:80/api/memories, groups by category,
writes 18 Markdown files into /workspace/memory/projects/claude-
memory-sync/ (the path memory-core indexes), then triggers
`openclaw memory index --force` via kubectl exec. Reuses the
existing cluster-healthcheck SA (pods+pods/exec). Smoke test: 1488
memories synced, 25/25 files indexed, search returns hits.
Also drops the legacy /app/extensions entry from
plugins.load.paths (doctor warning), wires HA_SOFIA_MCP_URL env,
and one-shot deletes the stale 2026-02-28 metaclaw-export.json from
the openclaw home volume.
claude_memory MCP intentionally NOT wired — its /mcp/mcp transport
404s on the deployed claude-memory-mcp:17 image (tracked as
code-z1so). Shared knowledge is delivered via the CronJob's REST
sync instead. Adding claude_memory to mcp.servers is a one-line
follow-up once that's fixed.
The broker-sync Fidelity provider emits 'unrealised-gains-offset'
DEPOSIT activities to reconcile Wealthfolio's total with the
PlanViewer reported pot, because Wealthfolio doesn't track pension
fund units directly. Wealthfolio's data model treats that DEPOSIT as
a cash contribution, which double-inflates net_contribution and
zeroes out the implied growth.
Add a Postgres view 'dav_corrected' in wealthfolio_sync that
subtracts the cumulative gains-offset from net_contribution per
account per date (re-exporting as 'net_contribution' so it's a
drop-in replacement). All 17 wealth dashboard panels that compute
contribution/growth/ROI now read from the view. Total impact:
portfolio Growth jumps from £301,753.19 to £337,474.39 (exactly
the £35,721.20 Fidelity offset that was previously miscategorised).
Bulk enrollment commit 8f4b1956 had its CI pipeline #689 killed before
terragrunt apply ran. The enrollment label + V2 lifecycle changes are
in master but never reached the cluster. Appending a one-line marker
to each pending stack's main.tf so Woodpecker's diff-detection picks
them up and applies them serially.
Idempotent — re-applying a stack whose state already matches is a no-op.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
For Deployments enrolled in Keel with policy=patch, the image tag is
updated by Keel as new patches release upstream. Without
ignore_changes on the image field, terragrunt apply would fight Keel
in an endless loop (TF reverts → Keel re-rolls → repeat — same shape
as the calico/tigera-operator fight from earlier).
Adding KEEL_IGNORE_IMAGE marker to the lifecycle of these stacks.
Image string in TF becomes the initial seed; Keel rolls it forward.
Stacks: actualbudget, broker-sync, changedetection, city-guesser,
coturn, dashy, dawarich, diun, ebook2audiobook, ebooks, echo,
excalidraw, foolery, forgejo, freedify.
CI-driven self-hosted stacks (fire-planner, job-hunter, payslip-ingest,
recruiter-responder, claude-agent-service, claude-memory) keep TF
ownership of image and policy=never — their image_tag is set by CI
via terragrunt.hcl inputs, not by Keel. Adding image to ignore_changes
on those would break the CI deploy flow.
Caveat: only container[0].image is added. Multi-container Deployments
(immich, beads, etc.) will need additional container[N].image lines
for any container Keel rolls. Those stacks are not currently enrolled.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Keel kept rewriting calico-node + calico-kube-controllers images to
v3.26.5 (proper patch update); tigera-operator immediately reverted
to v3.26.1 because the Installation CR is the source of truth.
Endless churn but no data loss — Calico stayed healthy throughout.
Removing keel.sh/enrolled label and live label from calico-system ns.
Calico upgrades go through the tigera-operator's Installation CR
manually, not Keel.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Move from `never` (no auto-update) to `patch` for the cluster-wide
default. Keel only auto-updates PATCH versions within the current
major.minor: 0.26.6 → 0.26.7 OK; 0.26.6 → :nightly-latest blocked.
Tag-rewrites that broke calico (v3.26.1 → :master) and affine
(0.26.6 → :nightly-latest) on 2026-05-16 cannot recur with patch.
Caveats:
* Patch causes Terraform image drift for semver-pinned services —
drift-detection pipeline will surface it; lifecycle ignore_changes
on container[].image can be added per stack later if drift is
noisy.
* Tags that aren't parseable as semver (:latest, :11, :nightly,
SHA tags) are ignored by patch — those workloads stay on their
current image until promoted to `force` policy individually.
Self-hosted CI-driven services + chrome-service kept on `never`
(deliberate pins / CI controls the tag):
recruiter-responder, claude-agent-service, claude-memory,
chrome-service, fire-planner, job-hunter, payslip-ingest
Live state already updated via kubectl apply + per-workload patches.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- claude-agent-service bumped to 191ed5dd (new AI section in agent
template — leadership stance, approved tools, usage limits / quotas,
code-gen safety, product-side AI depth, follow-up questions for the
recruiter when the web is sparse).
- recruiter-responder bumped to ab59eeab (deep_research prompt asks
for AI culture; warm_engage template adds a written-only ask for
IDE assistants, chat tools, per-seat limits, source-to-external
model policy).
Smoke-tested 2026-05-16: forced fresh research on Datadog, agent
returned full structured AI section with 7 explicit recruiter
questions covering DLP/IDE/limits/code-gen-policy. $0.80 / 192s.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 incident: Keel's `force` policy switched semver-pinned
images (affine 0.26.6 → :nightly-latest, calico v3.26.1 → :master)
instead of digest-tracking. Force is documented as "always update
to the newest tag in the registry" — only safe on already-mutable
tags like :latest.
Changing the cluster-wide default in inject-keel-annotations to
`never`. The namespace enrollment label + V2 lifecycle suppression
stay in place so opt-in is one annotation per Deployment, but no
service auto-updates until explicitly approved.
To opt in a workload now:
1. Verify the Deployment image is on a mutable tag (:latest,
:<major>, or a vendor "stable" tag) — change in Terraform first
if needed.
2. Add to the Deployment's metadata.annotations:
"keel.sh/policy" = "force" (digest tracking)
OR
"keel.sh/policy" = "patch" (semver patch bumps — also
requires ignore_changes on the image)
Live policy already updated via kubectl apply + per-workload
override (force → never).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wire Keel's Slack notifier to the existing bot token in Vault
(secret/viktor -> slack_bot_token). Posts to #general by default;
override via slack.channel in the Helm values if you want a dedicated
channel like #keel-notifications.
Notification level is "info" so we get every rollout event, not just
errors. Approval flow is OFF — opt-out-pure means all updates apply
unattended. If we later introduce approvals, add slack.approvalsChannel.
Resolves user request: 'keel should send notifications to slack everytime
it upgrades an app'.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
OpenClaw can now answer 'what do we know about <company>?' from cache
via the new recruiter_company_research tool, and recruiter_get embeds
the cached research payload inline.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The Helm repo at https://charts.keel.sh has versions 1.0.0–1.0.5,
1.1.0, 1.2.0. 1.0.6 is not published, so the Phase 0 apply failed
silently. Bump to 1.2.0 (app version 0.21.1, latest stable).
The weekday-only schedule was a 2026-03-16-incident-era guardrail when
the rest of the safety net was thin. Today's gates — halt-on-alert,
sentinel-gate Check 4 (24h soak via node Ready transitions), the
K8sUpgradeStalled alert, drainTimeout=30m, concurrency=1, and the
sentinel-path fix from earlier today — make weekend reboots safe and
just clear the backlog faster.
Effect: 5 pending node reboots clear in 5 calendar days instead of
queueing up over weekends. The K8s version-upgrade detection at Sun
12:00 UTC self-defers if a Sunday-morning kured reboot fires (the
RecentNodeReboot alert is in the Upgrade Gates ignore-less list for
the version-upgrade preflight — same mechanism kured uses).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Enrolls the cleanest Woodpecker-build-only self-hosted services into
the inject-keel-annotations ClusterPolicy by labeling their namespaces
keel.sh/enrolled=true. CI already pushes :latest (auto_tag: true) on
each, so Keel will detect the current upstream digest and trigger a
rolling restart when polling starts (1h cadence).
Per-Deployment lifecycle extended with KYVERNO_LIFECYCLE_V2 to suppress
the annotation drift Kyverno will inject (keel.sh/policy, /trigger,
/pollSchedule).
Services included:
- fire-planner
- job-hunter
- payslip-ingest
- recruiter-responder
Skipped from Phase 1 for follow-up:
- claude-agent-service (user has WIP on main.tf)
- claude-memory (Postgres co-deployed; treat in Phase 9 with other DBs)
- kms (two Deployments; needs per-resource review)
- wealthfolio (sync sidecar pattern; needs review)
- chrome-service (deliberate :v4 pin; needs keel.sh/policy: never label)
- GHA-migrated repos (10) (need per-repo CI cleanup)
- beadboard, freedify (no CI)
See docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- claude-agent-service bumped to f764fef6 (agent system prompt adds
the Perks block: food/health/pension/equity/PTO/parental/equipment/
learning/wellness/amenities/commuter). 1200-word cap.
- recruiter-responder bumped to 38a2cdaa (cache-first deep_research:
serves cached payload if fetched_at + ttl_seconds > now; cache
writes upsert; new force flag bypasses).
Verified end-to-end: deep_research on Datadog now returns full Perks
section (~220s, $0.60, 23 turns). Earlier 500 fixed (was
uq_research_company_tier dup-key on re-run).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Foundation for opt-out-pure auto-update model per
docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md.
- New stack `stacks/keel/` deploys Keel via Helm (charts.keel.sh, v1.0.6).
Polls registries hourly per design decision #8. Default schedule
overridable per-workload via keel.sh/pollSchedule annotation.
- New Kyverno ClusterPolicy `inject-keel-annotations` mutates Deployments,
StatefulSets, and DaemonSets in namespaces labeled `keel.sh/enrolled=true`
with keel.sh/policy=force + trigger=poll + pollSchedule=@every 1h.
- Phase 0 enrolls no namespaces. Phase 1 (next session) labels the
self-hosted set.
- Per-workload opt-out: label `keel.sh/policy: never` (used by rollback
runbook and chrome-service-style deliberate pins).
- Keel namespace excluded from the mutate — supervisor self-update has
too-bad a failure mode (decision #11).
- AGENTS.md: KYVERNO_LIFECYCLE_V2 marker convention added for the
ignore_changes block enrolled workloads need.
- .claude/CLAUDE.md: docker-images rule flagged as transitional.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Captures the May 10–16 kured-vs-sentinel-gate hostPath mismatch (chart
derived hostPath from configuration.rebootSentinel) and the companion
work to harden the rolling-reboot pipeline against single-replica
PDB deadlocks: Anubis 1→2 replicas with shared Valkey store, kured
drainTimeout=30m, CNPG pg-cluster 2→3 instances. Includes the
mysql-standalone-PDB orphan cleanup and the k8s-node1 containerd-source
drift audit (benign).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three defensive moves to make the kured rolling-reboot cycle survive
edge cases without operator intervention:
kured (stacks/kured/main.tf):
- Set `configuration.drainTimeout = "30m"`. Default is unlimited; if
a future PDB or finalizer stalls drain, kured retries forever and
the node stays cordoned silently. 30m caps the silent-failure
window — after timeout kured logs the abort and waits for the
next period; the node stays Schedulable so cluster capacity isn't
lost. Lets us fail closed instead of fail-silent.
CNPG pg-cluster (stacks/dbaas/modules/dbaas/main.tf):
- Bump instances 2 → 3 (1 primary + 2 replicas). With 2 instances the
failover during a primary-node drain depended on the lone replica
being caught up; a WAL backlog would stall the drain until the
replica was current. With 3 instances CNPG always has at least one
fully-current replica to promote, and the PDB's
`minAvailable=1` on the primary selector is satisfied throughout
the switchover. Storage: +20Gi PVC on proxmox-lvm-encrypted (about
35Gi after autoresize). Memory: +3Gi pod limit.
- Updated the `triggers.instances` so the null_resource's local-exec
actually re-applies the YAML (kubectl apply with the new spec). The
YAML is the source-of-truth but the trigger is what tells terraform
to re-run the provisioner.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Anubis pre-2026-05-16 ran at replicas=1 because in-flight PoW challenge
state lived in process memory — a challenge issued by pod A wouldn't be
verifiable by pod B (HTTP 500 "store: key not found"). The PDB at
`minAvailable=1` made this worse: with replicas=1 the eviction API can
NEVER satisfy the constraint, so every drain on a node hosting an Anubis
pod looped forever. This is what stalled the manual K8s upgrade on
2026-05-11 (had to delete pods directly to bypass eviction) and was
about to block kured on Monday 2026-05-18 once the kured sentinel fix
landed.
Anubis upstream has first-class support for a Valkey/Redis-protocol
shared store (documented as the "Kubernetes worker pool" pattern).
Wire it up:
- modules/kubernetes/anubis_instance: add `shared_store_url` variable.
When set, appends a `store: { backend: valkey, parameters: { url } }`
block to the rendered policy YAML and defaults replicas to 2 (capped
at 2). PDB switched from `minAvailable=1` to `maxUnavailable=1` so
drains can take down one pod at a time. topologySpreadConstraint
tightened to `DoNotSchedule` so the two replicas land on different
nodes — a single node loss never takes a whole Anubis instance down.
- All 8 call sites (cyberchef, jsoncrack, kms, homepage, blog,
travel_blog, real-estate-crawler, f1-stream) opted in. Each picks a
unique Redis DB index (5–12) on `redis-master.redis:6379`. Cluster
Redis already runs HA via Sentinel + haproxy, no new infra needed.
Verified: every Anubis Deployment now 2/2 Ready with pods on different
nodes; PDBs allow 1 disruption; Redis DBs 5,7,8,10 already populated
by live traffic post-apply; Palo Alto Networks scanner hit blog right
after apply and the challenge log shows the new state path.
Drain on any worker now succeeds without a `predrain_unstick` workaround
— eviction API is satisfied because at most one pod is unavailable at a
time, and the other replica keeps serving. Monday's kured reboot wave
should roll through cleanly.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- main.tf: bump image_tag to 1b3350c0 (carries the new agent),
init container also copies recruiter-triage.md
into /home/agent/.claude/agents/.
- terragrunt.hcl: restored (file was missing — apply was blocked).
Standard root include + platform/vault/external-secrets dependencies.
Smoke-tested 2026-05-16: deep_research call on Datadog (thread 42)
via recruiter-responder REST API → 102.5s, $0.43, structured
markdown report with comp bands vs £600k floor, culture signals,
remote policy, recent news, sources cited. End-to-end Tier-2 is live.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the per-repo domain glossary that engineering skills
(diagnose, tdd, improve-codebase-architecture, grill-with-docs)
read before working in this repo. Terms only — no implementation
detail. Six clusters (code organization, cluster, networking,
storage, secrets, CI/CD), 22 terms, plus relationships, an example
dialogue, and five flagged ambiguities.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The kured Helm chart derives the sentinel hostPath from
`dirname(configuration.rebootSentinel)`. Previously
rebootSentinel=/sentinel/gated-reboot-required pointed hostPath at
`/sentinel/` (an empty auto-created directory on every host) while the
kured-sentinel-gate DaemonSet writes to /var/run/gated-reboot-required.
Two different host directories → kured never saw the open gate, even
though the gate's checks were all green every 5 min on every node.
Result: unattended-upgrades has packages waiting on every node since
2026-05-10 (when uu was re-enabled) and kured's hourly log says
"Reboot not required" for the entire period.
Set rebootSentinel=/var/run/gated-reboot-required so the chart mounts
hostPath /var/run — same directory the gate writes to. The in-pod
mountPath (/sentinel) is hardcoded by the chart and doesn't matter,
the symlink chain works out: /sentinel/<file> inside the pod resolves
to /var/run/<file> on the host.
Verified: kured pod can now list /sentinel/gated-reboot-required
(0 B) AND /sentinel/reboot-required (32 B, set by uu on 2026-05-15).
First gated reboot will land Mon 2026-05-18 02:00 London.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Stremio stream aggregator now has its own row in the Active Use tier.
Captures the auth model (own UUID+password, not Authentik), monitoring
posture (canary probe + 3 alerts), and backup pipeline (weekly NFS
dumps of both decrypted config and the Stremio account addon
collection).
Follow-up from the 2026-05-15/16 hardening session: 5 commits on
servarr/aiostreams, none previously catalogued.
Adds stremio-account-backup CronJob (Sun 04:00 weekly, offset 1h from
the AIOStreams config-backup at 03:00):
- Logs into api.strem.io with credentials from Vault
(secret/viktor.stremio_email + stremio_password, now also synced
into the aiostreams-probe-secrets ExternalSecret)
- Fetches the full addonCollection via addonCollectionGet
- Writes timestamped JSON to the existing aiostreams-backup PVC
(NFS /srv/nfs/aiostreams-backup/stremio-collection-*.json, mode 0600)
- 90-day retention, logs out to invalidate the auth key
- Pushgateway metrics: stremio_account_backup_{success,bytes,
addon_count,duration_seconds,last_run_timestamp}
Protects against: accidental "uninstall all" / API regression / wrong
account login wiping the curated set of 22 addons (Cinemeta + 16
MDBList + AIOStreams + More Like This + Formulio + Zamunda + Local).
Verified: manual run wrote 93480 bytes, 22 addons, file present on NFS.
- Add ingress_factory module (auth=none, HMAC + expiry are the gate);
ingress_path=["/cb"] only — /api stays internal, /healthz cluster.
dns_type=proxied. anti_ai_scraping=false.
- Drop setup_tls_secret module — Kyverno ClusterPolicy `sync-tls-secret`
auto-clones the wildcard cert into every namespace.
- Bump image_tag to 7383b426 (callback endpoints + SMTP STARTTLS
hostname relax).
- Wire CALLBACK_BASE_URL=https://recruiter-responder.viktorbarzin.me.
- Drop git-crypt-encrypted wildcard cert files into
stacks/recruiter-responder/secrets/. Allowlist privkey.pem in a new
.gitleaksignore — git-crypt encrypts at rest but the working-tree
copy is plaintext, so gitleaks can't tell.
Smoke-tested end-to-end 2026-05-15 23:45:
synthetic email -> Telegram with ✅/❌ buttons -> ✅ tapped via curl
-> 'Sent' HTML page -> thread.status=sent, decision row recorded
with decided_via=telegram_button, outbound message threaded correctly.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds two env vars on the AIOStreams deployment:
- WHITELISTED_REGEX_PATTERNS_URLS: Vidhin's release-group regex
(TRaSH-aligned) so syncedRankedRegexUrls works for the user
- WHITELISTED_SEL_URLS: Vidhin's ranked stream expressions +
Tamtaro's ISE/PSE/ESE-standard
Gotcha: AIOStreams validates each synced* field against the matching
whitelist — stream-expression files (incl. Vidhin's expressions.json)
go in WHITELISTED_SEL_URLS, not the regex one, even though they live
in Vidhin's regex repo. Mixing them up returns USER_INVALID_CONFIG.
User config: enabled Vidhin's regex + ranked expressions + Tamtaro's
ISEs. Skipped Tamtaro PSE/ESE for now to avoid surprise over-filtering;
can be added later from the same whitelist.
Adds aiostreams-config-backup CronJob (Sun 03:00 weekly):
- Pulls /api/v1/user via internal ClusterIP with UUID + password from
the existing aiostreams-probe-secrets ExternalSecret
- Writes timestamped JSON to nfs-backup PVC mounted at /backup
- 90-day retention, prunes older files
- Pushgateway metrics: aiostreams_config_backup_{success,bytes,duration,last_run_timestamp}
NFS path: 192.168.1.127:/srv/nfs/aiostreams-backup (auto-synced offsite
to Synology via the existing offsite-sync-backup CronJob).
Complements the daily postgresql-backup-per-db pipeline (which dumps
the encrypted blob) by storing the decrypted JSON — usable for human
inspection / disaster recovery even without the AIOStreams password.
Verified: manual job wrote 12931 bytes, file present on NFS.
- stacks/recruiter-responder/terragrunt.hcl: bump image_tag to 0500c3d3
(300s LLM timeouts + IMAP BODY.PEEK[] fix).
- stacks/openclaw/main.tf: install-recruiter-plugin init container now
runs as uid 0 — the openclaw NFS volume is owned by uid 1000 and the
recruiter-responder image otherwise drops to uid 10001 which can't
write or chown.
Smoke-tested end-to-end 2026-05-15 ~23:15:
Synthetic recruiter email -> IMAP IDLE EXISTS push -> qwen3-8b triage
(12.1s, JSON output complete with company/role/salary/location/tech)
-> 2 drafts persisted in Postgres -> Telegram sendMessage 200 OK.
Then deleted 3 stale n8n workflows W992Nr7..., 1AU4k7..., IisDNx... from
the n8n Postgres workflow_entity table.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- stacks/vault/main.tf: register pg-recruiter-responder static role on
the postgresql connection (7d password rotation). Adds the role to
allowed_roles and creates vault_database_secret_backend_static_role
for `recruiter_responder` user.
- stacks/recruiter-responder/main.tf: drop TASK_WEBHOOK_URL env, swap
TASK_WEBHOOK_TOKEN secret for TELEGRAM_BOT_TOKEN + TELEGRAM_CHAT_ID.
Updated header doc.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three coupled changes for the new recruiter-responder pipeline:
1. stacks/llama-cpp/: add qwen3-8b text-only model to llama-swap. Uses
unsloth/Qwen3-8B-GGUF Q4_K_M, 16k context, no mmproj. Refactored the
download Job script + cmd renderer to handle text_only=true (skip
mmproj download + --mmproj flag). The 3 existing vision models stay
on text_only=false; no behaviour change for them.
2. stacks/recruiter-responder/: new stack. Namespace, 2 ExternalSecrets
(app secrets from secret/recruiter-responder, DB creds from Vault DB
engine static-creds/pg-recruiter-responder), Deployment (replicas=1,
Recreate -- IMAP IDLE + APScheduler want single leader), Service
ClusterIP. Image: forgejo.viktorbarzin.me/viktor/recruiter-responder.
3. stacks/openclaw/: add init container `install-recruiter-plugin` that
uses the recruiter-responder image to copy the .mjs plugin into
/home/node/.openclaw/extensions/recruiter-api/ on NFS. Couples plugin
version to the recruiter-responder image tag. Also injects
RECRUITER_RESPONDER_URL + RECRUITER_RESPONDER_TOKEN env vars (token
from openclaw-secrets.recruiter_responder_bearer_token, optional).
Pre-apply checklist for recruiter-responder stack:
- Vault: seed secret/recruiter-responder with webhook_bearer_token,
imap_{me,spam}_{user,pass}, smtp_password, claude_agent_token,
task_webhook_token.
- Vault: add secret/openclaw.recruiter_responder_bearer_token (same as
above webhook_bearer_token).
- dbaas: create DB recruiter_responder + role recruiter_responder,
and Vault DB-engine role static-creds/pg-recruiter-responder.
- Build + push image via Woodpecker (recruiter-responder repo CI).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Hardening pass following the empty-stream-list incident:
1. STREAM_CACHE_TTL=3600 — re-enables stream payload cache (was -1 /
disabled). Default behaviour hit all 5 upstream addons on every
Stremio request; with a 1h TTL repeat requests for the same title
are instant, while RD cache invalidations still propagate quickly.
2. aiostreams-stream-probe CronJob (every 5 min): fetches the user's
encryptedPassword via the internal ClusterIP, runs a canary stream
search for Breaking Bad S01E01, pushes streams_count + probe_success
to Pushgateway. Uses an ExternalSecret pulling UUID + password from
Vault secret/viktor. Same pattern as email-roundtrip-monitor.
3. Three alerts in monitoring's prometheus_chart_values.tpl:
- AIOStreamsStreamCountLow (< 50 streams for 30m)
- AIOStreamsProbeFailing (probe_success == 0 for 30m)
- AIOStreamsProbeStale (last_run_timestamp > 30min for 10m)
Verified: probe returned streams=411 success=1 on first run; all 3
alerts loaded into Prometheus with state=inactive health=ok.
- Pin viren070/aiostreams:nightly → :2026.05.14.1326-nightly (avoid
stale-pull cache, matches 8-char SHA convention for rolling tags)
- Switch ingress auth tier required → app: Authentik forward-auth
blocks Stremio clients (cannot follow OAuth 302), and AIOStreams
already enforces UUID + password on /configure and /api/*, with
Stremio addon URLs using encryptedPassword as a bearer token.
Result: empty-stream-list issue fixed for public Stremio clients.
Verified: 410 streams returned via public URL for Breaking Bad S01E01
with no cookies, vs 0 before (502→Authentik OIDC redirect).
Positions panel now sits at y=32 (immediately below the
contrib-vs-market + growth row at y=22..32), and everything from
the per-account stack down shifts 8 rows lower.
pg-sync sidecar now mirrors three extra views from the wealthfolio
SQLite: assets (id/symbol/name/currency), quote_latest (one row per
asset, preferring YAHOO over MANUAL on same-day collisions), and
positions_latest (currently-held positions extracted from the TOTAL
aggregate row of holdings_snapshots — quantity, average cost,
total cost basis).
Wealth dashboard gets a new bottom Positions table joining the three:
symbol, name, shares, avg cost, last price, market value, cost,
gain, return %. Gain and return % are color-text with red<0, green>=0
thresholds.
The lobby has grown enough (frontend, two Go services, devvm units +
scripts + config) that it earns its own repo. Code now lives at
https://forgejo.viktorbarzin.me/viktor/terminal-lobby with
scripts/deploy.sh covering the manual deploy until CI activation
lands (Woodpecker forge_id=2 activation still 500s; Forgejo Actions
not yet enabled).
This stack now owns only the K8s side — Services, Endpoints,
IngressRoutes, middlewares. main.tf comment block updated to point
at the new repo and the full DevVM port map.
Removed:
- stacks/terminal/files/ (index.html + DevVM artefacts)
- stacks/terminal/tmux-api/ (Go service)
- stacks/terminal/clipboard-upload/ (Go service)
Drops the hardcoded violet/indigo palette. Four themes are defined as
CSS variables on body.theme-{carbon,slate,mono,ink}:
- Carbon (default): warm dark, ivory text, restrained amber accent.
- Slate: cool dark, GitHub/Linear-ish charcoal with electric blue.
- Mono: strict greyscale, off-white accent.
- Ink: warm paper light, deep ink, terracotta accent.
The lobby reads the choice from localStorage and applies the class
before render. The picker lives at the bottom of the sidebar
(margin-top: auto pins it). On change, the iframe is bounced through
about:blank so the inner xterm picks up the new computed CSS vars
(--terminal-bg/fg/cursor/selection) on the next mount.
Picker UI uses native buttons, current theme highlighted with the
accent border + color. No gradients, hairline borders only.
Backend: POST /sessions/<name>/rename in tmux-api runs tmux
rename-session as the mapped OS user. 400 on bad name, 404 on missing
source, 409 on duplicate target, 401 on missing auth header.
Frontend:
- Rename button per card → prompt() dialog, validates against the
shared regex. Updates currentActive + hash + iframe.src if the
renamed session was active.
- Session order is now user-driven, persisted in localStorage
keyed per osUser. New sessions append at the bottom. The previous
sort-by-lastActivity is gone.
- HTML5 drag-and-drop reorders cards live during dragover; dragend
captures the DOM order into localStorage.
- Polling renderLobby is suppressed while a drag is in flight so the
5s tick doesn't yank the list out from under the user.
Replace full-page navigation with a two-pane lobby. Sidebar holds the
session list as clickable cards; an iframe in the content pane swaps
its src on click so switching sessions takes one click instead of two
navigations.
- #lobby-shell grid (260px sidebar + iframe pane)
- Cards become role=button, kill button stops propagation
- activateSession/deactivateSession with hash routing
(location.hash <-> active session, replaceState so back stack stays
clean)
- Killed active session deactivates the iframe before re-render
- 5s session poll preserves currentActive; deactivates if gone
- Mobile media query collapses to one column
CSP frame-ancestors already permits same-origin embedding
(*.viktorbarzin.me), no infra changes needed. Direct-link
?arg=<name> path is unchanged.
Emo isn't using the instance and the daily bank-sync CronJob has been
failing because the budget has zero accounts (deleted from the UI),
triggering BankSyncStale. Adds an `enabled` toggle that gates the core
Deployment + Service + Ingress + http-api + CronJob behind a single
plan-time bool while preserving the PVC, so we can flip back to true
later to restore the instance as-was.
Also fixes a latent bug where the http-api Service was always created
even when `enable_http_api=false`.
Apply: 7 resources destroyed (emo deploy/svc/ingress/cf dns/http-api
deploy+svc/cronjob), 0 changes for viktor/anca (moved blocks
migrated their state cleanly to the new [0] addresses). Pushgateway
job bank-sync-emo cleared manually; orphaned external-monitor
synced out by external-monitor-sync.
Restores the kernel-level isolation the pre-cutover ttyd-session.sh had,
but keeps the multi-session lobby UX:
- ttyd.service gets `-H X-authentik-username` back. `tmux-attach.sh` reads
$TTYD_USER, looks up the local part in /etc/ttyd-user-map, denies the
connection (no fallback to wizard) if there's no mapping, otherwise
`sudo -n -H -u <os_user> tmux …`. Each Authentik identity → its own
Unix user → its own `/tmp/tmux-<uid>/default` socket.
- tmux-api scopes every request to the same OS user via the same header.
Adds /whoami so the lobby HTML can preflight access and render
"logged in as <os_user> (<authentik>)" instead of leaving the user to
discover the deny via a reconnect loop.
- Commits /etc/ttyd-user-map and the matching /etc/sudoers.d/ttyd-users
fragment under files/devvm/ so future operators see one canonical
source of truth. Current mappings: vbarzin → wizard, emil.barzin → emo.
Adding a user is now: append a line to ttyd-user-map + a NOPASSWD
sudoers line + `useradd -m`. README walks through it.
No Terraform changes — this is all DevVM-side + lobby JS.
Promotes the staged multi-session UX from term.viktorbarzin.me to the
primary terminal.viktorbarzin.me hostname. `ttyd.service` on the DevVM
moves to the same ExecStart that `ttyd-multi.service` was running:
`/usr/local/bin/ttyd -W -a -t enableClipboard=true -I
/usr/local/share/ttyd/index.html -p 7681 /usr/local/bin/tmux-attach.sh`.
The lobby HTML supersedes the old per-user-attach index.html
(ttyd-session.sh wrapper retired alongside).
Terraform: retires the `terminal-multi` Service+Endpoints and the
term.viktorbarzin.me ingress (Cloudflare DNS record for `term` is
released by module deletion). The tmux-api Service+Endpoints stay, but
its IngressRoute now matches terminal.viktorbarzin.me — same path-prefix
specificity wins against the catch-all ingress.
DevVM follow-up (applied manually as before — see files/devvm/README.md):
restart ttyd to pick up the new unit, stop+disable ttyd-multi.service.
New hostname term.viktorbarzin.me serves a session-picker UI that lists,
creates, and kills tmux sessions. Visiting ?arg=<name> attaches to that
session (auto-creates via tmux -A). Builds on a fresh ttyd instance
(7685) plus a tmux-api Go binary (7684) on the DevVM, both running as
User=wizard alongside (not replacing) the existing ttyd.service (7681),
ttyd-ro.service (7682), and clipboard-upload (7683). Cutover of
terminal.viktorbarzin.me to the multi-session setup is deferred.
Terraform diff is purely additive — terminal-multi/tmux-api Service +
Endpoints + ingress_multi (term.viktorbarzin.me, Authentik-gated) + an
IngressRoute that path-prefixes /api/sessions/* to tmux-api with the
matching strip-prefix Middleware.
DevVM-side units ship under files/devvm/ with a README — manual scp +
systemctl install (see files/devvm/README.md). ttyd 1.7.7 already
deployed there (≥1.7 needed for -a).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Mirror the panel 5 treatment on panel 7 (Growth = market value −
contribution). Second SQL column emits the growth value only when
the point is part of a declining segment; field override paints it
red with no fill, spanNulls=false.
Identified during alert-noise review as steady sources of JobFailed.
Suspending them stops the noise; unsuspend after the per-job blocker is
cleared.
* payslip-ingest/actualbudget-payroll-sync — blocked on Vault
`secret/payslip-ingest` missing `actualbudget_encryption_password`.
`actualbudget_api_key` and `actualbudget_budget_sync_id` were added
(copied from `secret/fire-planner`) in the same session; the
encryption password is not stored anywhere in Vault and needs to be
populated separately. ExternalSecret sync has been failing since
2026-04-25.
* instagram-poster/ig-refresh-token — the deployed image (:da5b4191)
does not contain the `POST /ig-refresh-token` route; the route is
defined in uncommitted working-copy changes at
`instagram-poster/instagram_poster/app.py:695`. Unsuspend after the
new image rolls.
Each `suspend = true` line carries an inline comment with the unsuspend
trigger.
Three improvements identified in the 7d alert-noise review:
A. New PodImagePullBackOff alert. `KubeletImagePullErrors` measures
node-level pull error rate, which doesn't catch a single pod stuck
in ImagePullBackOff — council-complaints sat broken for ~10h on
2026-05-12 without paging. The new rule fires per-pod after 30m.
B. Two new inhibit_rules:
- PVFillingUp (95% used, critical) suppresses PVPredictedFull
(linear projection, warning) on the same PVC. Pair was producing
~24h of redundant firing per 7d.
- EmailRoundtripFailing (active probe failure) suppresses
EmailRoundtripStale (derivative >60min no-success). Same outage
windows, ~14.5h of duplicate firing per 7d.
C. JobFailed for: 30m → 2h. Most cronjobs run every 5–15min; the old
30-minute window paged on the first failed iteration before the
next run could recover. 2h means "still failing across at least
two cron iterations" — much more actionable.
Verified live: rules loaded, inhibitors in alertmanager config,
PodImagePullBackOff is currently inactive (council-complaints
ImagePullBackOff actively detected — see separate fix).
The initial formulation used clamp_min(min(rate[2h]), 0.0001), which
made a recently-deleted pod's lingering rate=0 drive the ratio toward
infinity for up to 2h until the stale series aged out of the rate
window. With for: 2h, this was a near-miss for spurious firing in the
immediate aftermath of restarting the bad replica (our remediation
path).
Tighter formulation:
* 30m rate window — stale series ages out within minutes, not hours
* `min(rate) > 0.0005` floor — filters both stale-zero and fresh-pod
ramp-up series; the bug's actual rate (~0.00076 in the 2026-05-12
incident) sits well above it, so true positives still trip
* for: 1h — fast enough to catch the next incident, long enough that
short rate dips don't flap
Verified: post-deploy `(max/min) > 5 AND min > 0.0005` evaluates to 0
results with the live cluster's tight rate spread (~0.00065–0.0007/s
across all three Traefik replicas).
Two new alertmanager inhibit rules and one new Prometheus alert,
informed by the 2026-05-12 incident where Traefik pod
traefik-db7696fbf-k42wp came back after a SIGTERM with only 6 routers
vs 119 on healthy peers (stale K8s informer cache) and served 404 for
~1/3 of viktorbarzin.me traffic.
* New alert TraefikReplicaConfigStale: fires when max/min reload-rate
ratio across Traefik pods exceeds 5x for 2h. The 2h window + 2h
for-clause tolerates legitimate post-restart ramp-up; the bug
pattern persists indefinitely.
* New inhibit: TraefikReplicaConfigStale suppresses the symptom
alerts (HighService{Error,4xx,Latency}, IngressTTFB{High,Critical},
IngressErrorRate5xxHigh, TraefikHighOpenConnections,
ForwardAuthFallbackActive, AnubisChallengeStoreErrors,
ExternalAccessDivergence) so only the actionable root cause pages.
* New inhibit: HomeAssistantDown suppresses
HomeAssistantCriticalSensorUnavailable and
HomeAssistantMetricsMissing — when HA itself is down, every sensor
going unavailable is noise (10x firings observed in the last 12h).
* Extend NodeDown and NFSServerUnresponsive target lists to also
suppress HomeAssistantCriticalSensorUnavailable.
The agent-based v1 ran inside claude-agent-service (replicas=1, no
nodeSelector) and self-evicted when it tried to drain its host (k8s-node4
on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers
v1.34.2) until manual recovery.
Rewrite the pipeline as a chain of nodeSelector-pinned Jobs:
preflight (k8s-node1)
→ master (k8s-node1) drains k8s-master
→ worker × 4 (k8s-node1) drains k8s-node{4,3,2}
→ worker (k8s-master + control-plane toleration) drains k8s-node1
→ postflight (no pinning)
Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by
envsubst-ing job-template.yaml into the next Job. Deterministic names
(k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply`
idempotent — a failed Job can be re-created without duplicating
downstream.
Also lands `predrain_unstick`: deletes pods on the target node whose PDB
has 0 disruptionsAllowed. Without this, drain loops indefinitely on
single-replica deployments (e.g. every Anubis instance — discovered the
hard way during 2026-05-11 manual recovery of k8s-node3).
Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min).
Deprecates the agent prompt (renamed to *.deprecated.md with a header
pointer to the new code).
Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps),
then monitoring (loads the new alert). Both applied 2026-05-11.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The 2026-04-13 encrypted-PVC migration replaced the wealthfolio and
paperless-ngx data volumes with -encrypted variants but never removed
the original -proxmox PVC blocks from TF — both were sitting orphaned
with no pod mounting them, occupying 1Gi each of LVM thin pool. The
autoresizer also logged repeated "failed to get volume stats" for them
(no kubelet stats without a mounted pod), masking real signal.
* wealthfolio: removed kubernetes_persistent_volume_claim.data_proxmox
* paperless-ngx: removed kubernetes_persistent_volume_claim.data_proxmox
(the paperless PVC turned out to be out-of-TF-state, so deleted via
kubectl after the TF block removal.)
The 10Gi proxmox-lvm-encrypted PVC `claude-agent-workspace-encrypted` was
declared in TF but never wired into the deployment — the `workspace`
volume_mount pointed at an emptyDir, so the PVC sat allocated and idle
from 2026-04-15 to 2026-05-11.
Restructured per the design intent:
* `workspace` (emptyDir) — fast per-pod ephemeral scratch for git clones.
Each agent job clones the infra repo fresh, so persistence doesn't
buy anything and emptyDir avoids RWO contention if the deployment
is ever scaled past 1 replica.
* `persistent` (5Gi NFS-backed RWX) — mounted at /persistent for cases
where the agent needs to write state that should survive pod
restarts (caches, ad-hoc outputs). RWX so all replicas share it;
the service's sequential-mutex lock prevents concurrent writes.
Also fixed `fix-perms` init container: the Dockerfile's `WORKDIR
/workspace/infra` causes kubelet to create that path inside the
emptyDir as root:fsGroup with the setgid bit, which uid 1000 can't
write to. Pre-create the path + chmod 0775 to make it writable.
NFS export already exists on the PVE host
(/srv/nfs/claude-agent-persistent, owned 1000:1000).
Verified: pod runs 1/1; `/persistent` writable as agent uid 1000;
git-init successfully clones infra into /workspace/infra.
PVAutoExpanding fired at >80% used (info severity), but pvc-autoresizer's
threshold is 10% free (= 90% used) — the alert always fired ~10 points
before any action would have been taken, and there was nothing for an
operator to do during that window either. It was a "heads up" that
didn't surface a problem.
Real failure modes are already covered:
* PVFillingUp (critical, >95% for 10m) — autoresizer didn't keep up
* PVPredictedFull (warning, predict_linear 24h) — trend toward exhaustion
Sharpened PVFillingUp's annotation to spell out the likely causes
(storage_limit reached, expansion failing, or missing autoresizer
annotations) so the responder doesn't have to recall the runbook.
Add a second SQL column on panel 5 that returns net_worth only when the
current point's previous or next neighbor is lower — i.e. the point is
part of a declining segment (including the peak and trough endpoints).
A field override draws this 'decline' series in red with no fill and
spanNulls=false, overlaying the green base line so down periods show
up as red on top of the climb.
The Uptime Kuma check was hitting https://uptime.viktorbarzin.me, which
sits behind Authentik forward-auth. Authentik 302-redirects the Socket.IO
handshake the uptime-kuma-api library uses, and the library can't
complete the OAuth flow, so every healthcheck reported "Connection
failed" even though the pod was healthy and serving 225 monitors.
Fix: open a transient `kubectl port-forward` to svc/uptime-kuma in the
uptime-kuma namespace for the duration of the check, connect the
library to http://127.0.0.1:<port> (no auth gate), then SIGKILL the
port-forward on the way out. The disown is to suppress bash's "Killed"
job notification on stderr, which corrupted stdout when stderr was
merged for JSON parsing.
Verified end-to-end: healthcheck now reports the real signal —
"external down(3): www, xray-vless, hermes-agent" — the same 3
Cloudflare-facing endpoints flagging in the uptime-kuma logs.
Background: 2026-05-10 someone added `server.auditStorage.annotations`
to vault/main.tf attempting to enable pvc-autoresizer on audit-vault-N
PVCs. The vault helm chart maps that block into the StatefulSet's
volumeClaimTemplates, which is immutable post-creation on existing
StatefulSets. Result: 4 consecutive helm upgrade attempts (rev 16-19)
all rejected with "StatefulSet spec: Forbidden", leaving the release
stuck in failed state since 22:47 UTC that day. Live PVCs were
hand-annotated via `kubectl annotate` as a workaround, but the IaC
declared a path that couldn't be applied — every subsequent tg apply
on the vault stack would re-fail.
Fix:
* Remove `annotations` block from `server.auditStorage` values
(with a comment recording why it can't live there).
* Add `kubernetes_annotations` resources for audit-vault-{0,1,2}
with `force = true`, so Terraform adopts the existing annotations
and tracks the desired-state in IaC going forward. The autoresizer
cares about PVC annotations, not StatefulSet template annotations,
so this is functionally equivalent.
Done out-of-band before commit (helm state was already corrupted):
`helm rollback vault 15 -n vault` → revision 20 deployed (clean).
Verified: helm status vault = deployed; audit-vault-0 still has
threshold=10% storage_limit=10Gi annotations; cluster healthcheck
no longer reports vault/vault=failed.
Replace the legacy `protected = true` reference with the four-tier
`auth` enum that's been live for weeks. Document the anti-exposure
guard (`scripts/check-ingress-auth-comments.py` + `scripts/tg`)
that enforces the inline-comment convention. Fix two stale paths:
- `stacks/platform/modules/ingress_factory/` → `modules/kubernetes/ingress_factory/`
- `stacks/platform/modules/traefik/middleware.tf` → `stacks/traefik/modules/traefik/middleware.tf`
Replace the single `protected = true` example with three: a
default Authentik-gated admin UI, an app-managed backend, and an
intentionally-public webhook receiver. Each example shows the
required comment line above the auth assignment.
[ci skip]
Wires celery-beat to fire two periodic scrapes via the existing in-app
SchedulesConfig mechanism. Replaces the empty-string fallback with two
inline schedules expressed as Terraform-managed JSON:
- london-rent-daily: every day at 03:00 UTC, RENT, London, 1-2 bed,
£1900-4000
- london-buy-weekly: every Sunday at 04:00 UTC, BUY, London, 1-2 bed,
£400k-1.2M
Schedules live in `local.scrape_schedules` (jsonencode'd) rather than
Vault — they're configuration, not secrets, and benefit from being
version-controlled. The previous Vault-backed lookup
(`local.notification_settings["scrape_schedules"]`) was unused.
Verified live: new celery-beat pod logs
`Registering periodic task: london-rent-daily at 3:0` and
`london-buy-weekly at 4:0` immediately after roll-out.
Also tightens the comment above the wrongmove-api `auth = "none"` line
so it passes the new `scripts/check-ingress-auth-comments.py` guard
(pre-existing tech debt that blocked the apply).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Commit 0712a1b6 added a Python-based ingress_factory auth-comment check
that runs from scripts/tg on every plan/apply. The CI image
(forgejo.viktorbarzin.me/viktor/infra-ci) doesn't ship python3, so every
CI apply has been failing since with:
env: can't execute 'python3': No such file or directory
Adding python3 to the apk install line restores CI applies for all stacks.
The build-ci-image.yml pipeline auto-fires on this commit (path filter
on ci/Dockerfile), so the rebuild + retag happens without manual action.
Sweep through the 30+ stacks that predated the auth = "app" tier
and were tagged auth = "none" without a comment explaining why
they weren't behind Authentik. Each is now self-documenting at the
call site, so the tg-level anti-exposure guard passes and future
readers don't have to reverse-engineer the intent.
Flipped 6 stacks from "none" to "app" — their backends have their
own user auth and the new tier records that more accurately:
- navidrome (Subsonic user/password)
- ntfy (deny-all default + user.db tokens)
- nextcloud (WebDAV/CalDAV/CardDAV app passwords)
- vaultwarden (Bitwarden-compatible token auth)
- headscale (OIDC + preauth keys for Tailscale nodes)
- paperless-ngx (app-layer login + API tokens)
Kept "none" with a comment on the rest — they're genuinely public,
webhook receivers, native-protocol endpoints, OAuth callbacks, or
Anubis-fronted: authentik (×2 + guest outpost), beads-server (dolt),
claude-memory (bearer-token MCP), dawarich, ebooks/book-search-api,
fire-planner /api, forgejo (git/OCI native clients), frigate (HA
integration), immich/frame, insta2spotify /api, instagram-poster
(meta fetcher), k8s-portal, matrix (native bearer), monitoring×2
(HA REST scrapes), n8n (webhooks), nvidia, onlyoffice (JWT),
owntracks (HTTP Basic), postiz, privatebin (client-side enc),
rybbit (analytics tracker), send (E2E file drop), tuya-bridge
(API key), vault (own auth + CLI), webhook_handler, woodpecker
(forgejo webhooks + OAuth), xray (×3 VPN transports).
real-estate-crawler/main.tf:400 already had its comment from a
prior edit — not touched here.
No live state changes — auth = "app" produces the same middleware
chain as auth = "none" (verified earlier this session). This commit
is purely documentation + intent-tagging.
Cluster grew past the 100-conn default — steady-state idle was 90/100,
leaving zero headroom for terragrunt applies or transient surges. The
ceiling was being discovered by Terraform crashing (pq: "remaining
connection slots are reserved for roles with the SUPERUSER attribute"),
not by alerting, because we had no PG scrape config at all.
dbaas (Tier 0):
* max_connections: 100 → 200
* shared_buffers: 512MB → 1GB (Postgres recommends ~25% of pod memory)
* effective_cache_size: 1536MB → 2560MB (scaled with pod memory)
* pod memory: 2Gi → 3Gi (rough rule of thumb: enough for shared_buffers
+ ~16MB work_mem * concurrent sorts + OS cache + overhead)
* Triggers bump on null_resource.pg_cluster forces CNPG to re-apply,
which rolls the cluster (standby first, then primary failover).
monitoring:
* New scrape job 'cnpg' on dbaas namespace pods labeled
cnpg.io/podRole=instance, port name=metrics (9187). Relabels add
cnpg_cluster + cnpg_role labels for alert grouping.
* PGConnectionsHigh (warning, >85% for 10m) — heads-up before exhaustion.
* PGConnectionsCritical (critical, >95% for 3m) — last call before
refusing connections.
Verified: cnpg targets up, sum(cnpg_backends_total)=84, max_connections
metric=200, alert ratio 0.42 → both alerts inactive.
Every `tg plan/apply/destroy/refresh` now runs
`scripts/check-ingress-auth-comments.py` against the current stack
before invoking terragrunt. The check fails closed if any
`auth = "app"` or `auth = "none"` line in the stack's .tf files lacks
an immediately-preceding `# auth = "<tier>": ...` comment documenting
what gates the app (for "app") or why the endpoint is intentionally
public (for "none").
Why tg-level (not git pre-commit): tg is the universal entry point
for all infra changes. CI runs it, headless agents run it, humans
run it. A pre-commit hook only catches the human path. Wiring the
check into tg means the anti-exposure guard fires regardless of who
or what is invoking terragrunt.
Stack-scoped: each stack documents itself the next time it's edited.
The 30+ existing `auth = "none"` stacks that predate this guard are
not blocked from operating today; they'll need the comment added the
next time someone runs `tg plan` on them — at which point the gate
forces a conscious "yes, this is intentional" moment before any
state change can land.
Skipped on: init, fmt, validate, output, etc. — anything that doesn't
read or write infra state.
Adds a fourth auth tier alongside required/public/none. "app" is
functionally identical to "none" — no Authentik middleware attached —
but the distinct name records intent at the call site: this backend
has its own user login (NextAuth, Django, OAuth, bearer-token API,
etc.) and Authentik would only break it.
Why the new tier: with only required/none, every "the app has its
own auth so drop Authentik" decision looked identical at the call
site to "this is an OAuth callback / webhook receiver / native-client
API". Future readers couldn't tell whether a stack was intentionally
unauthenticated or relying on backend auth. Now they can.
Migrates the 8 stacks flipped earlier this session (novelapp, immich,
linkwarden, tandoor, freshrss, affine, actualbudget, ebooks/audiobookshelf)
from "none" to "app". Confirmed no-op: `tg plan` on novelapp showed
"No changes" — same middleware chain, same live state.
The variable description and the .claude/CLAUDE.md Auth section now
spell out the anti-exposure rule: only pick "app" or "none" AFTER
verifying the app has its own user auth ("app") or the endpoint is
intentionally public ("none"). Default stays "required" so accidental
omission fails closed.
[ci skip]
The bank-sync CronJob was posting to /accounts/banksync which fans out to
ALL accounts in a single call. With PSD2/GoCardless's 4-successful-pulls
per-account per-24h quota, a single rate-limited account would 500 the
whole call, and `bank_sync_success` would flip to 0 even though the data
itself was still flowing through manual UI syncs. Result: BankSyncFailing
fired routinely whenever the user had been active in the UI that day —
a structural false positive.
Fix:
* CronJob: enumerate accounts via GET /accounts, POST per-account
/accounts/{id}/banksync, emit bank_sync_account_success and
bank_sync_account_last_success_timestamp labelled by account name.
Roll up bank_sync_success = 1 iff any account succeeded.
* Alerts: drop BankSyncFailing (noise generator). Keep BankSyncStale
at 48h (global drought). Add BankSyncAccountStale at 72h (catches
single-account auth expiry — the real signal we wanted).
Verified: manual run on bank-sync-viktor pushes 6 per-account success +
timestamp series; roll-up bank_sync_success=1; no firing alerts.
Apps with their own user auth + bearer-token APIs were being broken by
Traefik → Authentik forward-auth: every iOS/Android/native client got a
302 to authentik.viktorbarzin.me instead of the JSON they expected.
Authentik's 302+cookie dance can only be followed by a real browser.
Changed:
- immich (Immich mobile app + bearer-token /api)
- linkwarden (NextAuth + Linkwarden mobile clients)
- tandoor (Django auth + Tandoor mobile clients)
- freshrss (Fever/GReader API used by Reeder/FeedMe/etc.)
- affine (workspace auth + AFFiNE desktop/mobile sync)
- actualbudget (server password + Actual mobile/sync clients)
- ebooks/abs (Audiobookshelf iOS/Android app)
Each app's own auth is the gate now. CrowdSec + rate-limit + anti-AI
UA filter still front the ingresses. Same pattern as the novelapp
change earlier this session.
[ci skip]
novelapp handles its own user auth via NextAuth + Google OAuth, so the
ingress-level Authentik forward-auth was double-gating. Mobile webviews
(iOS/Android) can't follow the Authentik 302/cookie dance — they saw
HTML challenges where they expected JSON. CrowdSec + rate-limit +
anti-AI UA filter remain in front; novelapp's own login handles users.
[ci skip]
Two pre-existing apply failures uncovered during the Phase 4 mass apply,
unrelated to the auth refactor but blocking 100% rollout.
claude-memory:
- `var.claude_memory_db_password` had no default and wasn't passed by
terragrunt → fall back to Vault `secret/claude-memory.db_password` via
`coalesce(var.x, data.vault.data["db_password"])`.
- db-init Job was failing with `database "root" does not exist` because
psql defaults the database name to the user when -d is omitted. Added
`-d postgres` to all five psql invocations.
resume:
- `var.resume_database_url` had no default and wasn't passed → default to
empty string. Vault carries the real value at `secret/resume.database_url`
consumed at the deployment env-var level; the variable here just needs
a value to satisfy the apply.
Also: priority-pass had lost most of its TF state (only 3 of 8 resources
tracked); imported namespace/service/pvc/deployment/ingress/tls-secret to
re-bind state with live K8s resources. No code change needed there.
Verified after re-apply:
- claude-memory.viktorbarzin.me → 200 (auth=none, native MCP responses)
- priority-pass.viktorbarzin.me → 302 → authentik (auth=required)
- resume.viktorbarzin.me → 302 → authentik public outpost (auth=public)
- 6 of 7 previously-failing applies now green; only vault remains, blocked
by an unrelated helm chart immutable-StatefulSet-field issue.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Six tuning changes to cluster_healthcheck.sh so PASS sections actually
reflect "nothing to act on":
1. prometheus_alerts: only count severity=warning|critical. Info-level
alerts (RecentNodeReboot soak, PVAutoExpanding) are by design — the
alert rule itself sets severity; the script should respect it.
2. tls_certs: lower WARN threshold 30d → 14d. cnpg-webhook-cert
auto-rotates at 7d before expiry, kyverno tls pairs at 15d, the
Lets Encrypt wildcard renews weekly; <14d is the only window where
human attention is genuinely useful.
3. ha_entities: skip mobile_app/device_tracker/notify/button/scene/
event/image/update domains (transient by design), skip friendly
names containing iphone/ipad/macbook/tv/bravia/laptop/etc., and
only count entities whose last_changed > 24h. Was 431/1470,
most of which were "phone in standby" noise.
4. ha_automations: only flag DISABLED automations as abandoned if
they've also been untouched (last_changed) for >180 days; raise
stale threshold 30d → 180d. Was flagging seasonal/holiday-only
automations as broken.
5. problematic_pods + evicted_pods: exclude pods owned by Jobs.
CronJob retry leftovers (Error/Failed phase pods that K8s keeps
around for log inspection) aren't problematic at the cluster level.
6. uptime_kuma: retry the WebSocket login 3x with backoff. Single-
shot failures were a recurring false-positive even though the
service was healthy.
Also: nvidia-exporter ingress auth=required → auth=none. HA Sofia's
nvidia REST sensors (Tesla_T4_GPU_Temperature, Power_Usage, etc.) poll
/metrics and got 302'd to Authentik like the idrac/snmp ones did.
Same fix.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The frigate-lan.viktorbarzin.lan ingress had Authentik forward-auth in
front. HA Sofia's frigate integration polls /api/config and only knows
how to use Frigate's own API key (not browser SSO), so every poll got
a 302 to authentik.viktorbarzin.me and the integration entered the
errors-state. Same pattern as idrac-redfish-exporter (5c594291).
allow_local_access_only IP allowlist + Frigate's API key are enough.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
After fixing the threshold=80% misconfig and seeing two PVCs
(prometheus + technitium primary) get stuck Terminating, a 3rd round
showed four more PVCs (frigate, hackmd, immich-postgresql,
paperless-ngx) in the same state. Same root cause: TF spec'd a
smaller storage size than the autoresizer-grown live value, K8s
rejected the shrink, TF force-replaced the PVC, and the
pvc-protection finalizer held it in Terminating while the pod kept
using the underlying volume.
Bulk-inject lifecycle.ignore_changes = [spec[0].resources[0].requests]
on every kubernetes_persistent_volume_claim block that has
resize.topolvm.io/threshold annotations. The pattern was already
documented in .claude/CLAUDE.md but ~63 stacks were missing it.
Live PVCs are unaffected; this only prevents future TF applies from
attempting the destroy+recreate.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two real issues found while triaging HomeAssistantCriticalSensorUnavailable
alerts and the prometheus + technitium PVC Terminating-but-in-use
state from the earlier session.
1. idrac-redfish-exporter + snmp-exporter ingresses: auth=required →
auth=none. HA Sofia REST sensors scrape these endpoints
programmatically; with Authentik forward-auth in front, every
request got a 302 to authentik.viktorbarzin.me and the REST
sensors parsed the HTML login page instead of metrics — leaving
the R730, UPS, and ~20 other sensors permanently unavailable.
The allow_local_access_only IP allowlist (192.168.0.0/16 +
10.0.0.0/8) already gates external access, so authentik on top
was breaking machine-to-machine traffic for no security gain.
2. prometheus_server_pvc + technitium primary_config_encrypted:
add lifecycle.ignore_changes = [spec[0].resources[0].requests].
The autoresizer expands these PVCs; PVCs can't shrink. Without
the ignore, every TF apply tried to revert the live size back
to the TF spec value, hit K8s's shrink-forbidden rule, and
force-replaced the PVC. Because the pod still mounted it, the
PVC went into Terminating-but-protected limbo — fine until a
pod restart would have orphaned the volume. Root cause of the
2026-05-10 PVC Terminating incident.
Bonus: prometheus_server_pvc threshold was the inverted "90%" (the
same bug the bulk fecfa211 sweep fixed elsewhere; my regex only
matched "80%" so this one slipped through). Now "10%".
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
audit-vault-0 fills steadily with raft audit logs; without autoresizer
annotations it hits the 2Gi ceiling and Vault stalls on writes
(PVAutoExpanding alert was firing at 81% used). The Vault Helm chart
copies server.auditStorage.annotations onto the PVC at create time.
Live PVC already has the annotations applied via kubectl annotate;
this just keeps TF in sync.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Same misconfig as the bulk fecfa211 sweep, but the pg-cluster YAML
is buried inside a null_resource local-exec heredoc so the regex
didn't catch it. CNPG operator inherits these annotations onto each
member PVC (pg-cluster-1, pg-cluster-2), and reapplies them on every
reconcile — patching the live PVCs alone bounces back within seconds.
Live state already patched via kubectl patch cluster, this just keeps
TF in sync.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
topolvm/pvc-autoresizer's threshold annotation is the FREE-SPACE
percentage below which expansion fires (per upstream README). Setting
it to "80%" means "expand when free-space drops below 80%", i.e. as
soon as the PVC crosses 20% utilization — which caused
prometheus-data-proxmox to be repeatedly expanded from 200Gi to 433Gi
in 70 minutes (six 10% bumps, all when the volume was only ~14% used).
Once the SC opt-in fix landed (1e4eac53) and the inode metrics fix
landed (02a12f1a), the autoresizer started actively misfiring across
75+ PVCs cluster-wide.
Flip the value to "10%" everywhere — that's "expand when free-space
drops below 10%", i.e. at 90% utilization, which is the conventional
semantic and matches the alert thresholds in
prometheus_chart_values.tpl (PVAutoExpanding fires at 80%, PVFillingUp
at 95%).
The CLAUDE.md PVC template was the source of the misconfig, so update
it too. Live PVC annotations were patched in parallel via kubectl
annotate; TF apply on each affected stack will be a no-op against
those live values.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Test 2 dry-run revealed kubeadm plan reports v1.34.7 as latest while
apt-cache madison (without prior apt-get update) was reporting v1.34.5
— so the CronJob would have dispatched the agent against a stale
target. Now do `sudo apt-get update -qq` for just the kubernetes repo
before querying madison.
Also add a DRY_RUN_OVERRIDE env precedence so future test invocations
can override DRY_RUN without an apply cycle — but Job spec env is
immutable post-create, so this is only useful for CronJob spec edits
(suspend, then add env, then resume). Documented in the runbook.
Stage 2 now reuses the existing default/backup-etcd CronJob (NFS-backed
PV pointing at 192.168.1.127:/srv/nfs/etcd-backup) instead of trying to
ssh into master and run etcdctl against a non-existent /mnt/main mount.
The agent triggers a one-shot Job from cronjob/backup-etcd, waits up to
10 min, then parses the backup-manage container log for "Backup done"
line + byte count.
Test 2 (dry-run) surfaced 5 real cluster blockers — agent loop works
end-to-end at the planning level.
Expanded the claude-agent ServiceAccount's privileges via a sibling
ClusterRole (claude-agent-upgrade-ops):
- patch namespaces/k8s-upgrade (in-flight annotation)
- create batch/jobs (trigger etcd snapshot Job)
- patch nodes (cordon/uncordon)
- create pods/eviction (drain)
- delete pods (drain fallback)
Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison
on master for new patches + HEAD pkgs.k8s.io for next-minor availability,
then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent.
The agent (.claude/agents/k8s-version-upgrade.md) orchestrates:
pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match)
-> etcd snapshot save
-> optional master containerd skew fix
-> apt repo URL rewrite (minor bumps only)
-> drain/upgrade/uncordon master via ssh < update_k8s.sh
-> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each
-> post-flight verification
Two new Upgrade Gates alerts catch failure modes:
- K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m)
- EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m)
update_k8s.sh refactored to take --role / --release args; the agent shells
it into each node via SSH pipe. update_node.sh annotated as OS-major path.
Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section
in docs/architecture/automated-upgrades.md.
Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519
keypair distributed to all 5 nodes via authorized_keys; slack_webhook
reuses kured webhook URL on initial deploy).
The Phase 4 audit promoted three "smoke-test candidates" from `protected = false`
to `auth = "public"`, but all three are XHR / curl-driven endpoints (fetch()
calls, automation scripts) that don't survive the 302+cookie redirect dance
that the public-auto-login flow requires on first visit. fire-planner's SPA
broke immediately — every fetch() to /api/* hit a cross-origin redirect and
CORS preflight rejected it.
Important learning for the `auth = "public"` design:
`auth = "public"` is functionally equivalent to a normal Authentik forward-auth
for the FIRST request — it issues a 302 to authentik to set a guest session
cookie, then 302s back. This is invisible for top-level browser navigation
but BREAKS:
- XHR/fetch() under CORS preflight (preflight rejects redirects)
- curl/automation scripts that don't preserve cookies across requests
- Mobile / native clients that can't follow OAuth-style redirects
Use `auth = "public"` only for top-level HTML pages where the user navigates
via the browser address bar (or links). For XHR APIs, native-client surfaces,
webhooks, OAuth callbacks — use `auth = "none"`.
The plan's "smoke test 3 candidates" were misjudged on this front. Reverting
all three to `auth = "none"` (their previous behaviour). The end-to-end public
flow IS verified working via curl + flow API — the design is sound, just the
test targets were wrong.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The Phase 4 audit pass missed this site because the previous agent scoped
out owntracks (it overrides the factory's middleware list via
extra_annotations to use its own basic-auth middleware). Adding the explicit
auth = "none" satisfies Phase 5's "every ingress has an explicit decision"
goal and makes the intent visible — mobile OwnTracks clients post location
data via HTTP basic-auth and can't follow Authentik forward-auth 302s.
Closes the loop on Phase 5: 122/122 active ingress_factory call sites now
carry an explicit auth = "..." decision (zero callers rely on the default).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The endpoint exists in the working copy of instagram_poster/app.py
but isn't committed/built/deployed, so every cron fire returned 404
and triggered JobFailed alerts every 30 min.
Set count = 0 to leave the resource declaration in place — re-enable
by removing that line once the endpoint is in a built image.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Per-PVC rsync had no timeout, so any single hung PVC (e.g. on a
corrupted snapshot or a sqlite held open by a writer) blocked the
whole script until systemd's 4h TimeoutStartSec kicked in, leaving
every later PVC silently unbacked. Today's run hung on
mailserver/roundcubemail-enigma-encrypted at 05:09 and didn't recover
— hence WeeklyBackupFailing alert.
Now:
- rsync per PVC: timeout 30 min, exit 124 logged separately
- sqlite3 per database: timeout 5 min
- /etc/pve rsync: timeout 5 min
Each timed-out PVC bumps PVC_FAIL but the loop keeps moving.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CF zone was returning 403 to declared AI-bot UAs at the edge
(`ai_bots_protection: "block"`). That meant the in-cluster x402
gateway never saw the request and could never issue an HTTP 402 with
the wallet payment requirements — the bot just bounced.
Adopt `cloudflare_bot_management.zone` via root-module import block,
flip ai_bots_protection to "disabled". Bot Fight Mode (`fight_mode`),
crawler challenge (`crawler_protection`), and managed robots.txt are
unaffected — generic automated traffic still gets the bot fight gate.
End-to-end verified: `User-Agent: Mozilla/5.0 (compatible; ClaudeBot/
1.0;...)` on viktorbarzin.me now returns HTTP 402 (was 403 CF block)
with `payTo=0xCc33...659f`, `amount=10000` micro-USDC, `network=base`.
Trade-off: bots that don't pay still hit origin (instead of CF
blackholing them), so a small bandwidth uptick. Negligible at our
traffic level.
pvc-autoresizer's GetMetrics() returns volume stats for a PVC only if
all four kubelet_volume_stats metrics (available_bytes, capacity_bytes,
inodes_free, inodes) are retrieved. The keep-list in the
kubernetes-nodes scrape job had available_bytes and capacity_bytes
(post 9d5da4d8) but was missing the two inode metrics, so the
autoresizer's reconcile logged "failed to get volume stats" for every
PVC and never resized anything.
Add kubelet_volume_stats_inodes and kubelet_volume_stats_inodes_free
to the regex.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Phase 1+2 of default-deny ingress plan. Adds the infrastructure for an `auth = "public"`
ingress tier that auto-binds anonymous requests to a `guest` Authentik user (no UI
prompt), so public sites are still recorded as authenticated by Authentik for audit
purposes — but as `guest`, not by leaking the standard catchall flow.
- guest user in `Public Guests` group (NOT `Allow Login Users`).
- `public-auto-login` flow: stage_binding policy sets `pending_user = guest`,
`evaluate_on_plan = false` + `re_evaluate_policies = true` so flow_plan is
populated when the policy mutates it; `authentication = none` lets anonymous
requests enter.
- `Provider for Public` proxy provider (forward_domain, cookie_domain
viktorbarzin.me) with `authentication_flow = public-auto-login`.
- Dedicated `public` outpost: only the public provider bound, deployed as
`ak-outpost-public` Deployment+Service in the `authentik` namespace by
Authentik's K8s controller.
- `public-auth.viktorbarzin.me` ingress exposes the public outpost's
`/outpost.goauthentik.io/*` so OAuth callbacks land on it (the embedded
outpost doesn't know about the public provider, so `authentik.viktorbarzin.me`
callbacks would fail).
- `authentik-forward-auth-public` traefik middleware points at the public
outpost service (not via the auth-proxy nginx fallback). The plan's
`?app=public` dispatch idea was tested and rejected — the embedded outpost
dispatches purely by Host header, so a dedicated outpost was the only way
to isolate the public flow without conflicts.
No ingresses use the new middleware yet — Phase 3+4 (the ingress_factory
`auth` variable refactor + audit pass) wires it up. This commit is additive
and behaviour-neutral.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Without this annotation on the StorageClass, pvc-autoresizer's controller
filters the SC out at the index lookup stage and never patches any of its
PVCs, regardless of utilization or per-PVC threshold/increase/storage_limit
annotations. Internal metric pvcautoresizer_loop_seconds_total ticked but
no PVCs were ever evaluated — visible cluster-wide as PVAutoExpanding alerts
firing for forgejo-data-encrypted (82%) and audit-vault-0 (81%) without any
ResizeStarted events ever following.
The Prometheus scrape-config fix in 9d5da4d8 was a prerequisite (autoresizer
reads kubelet_volume_stats_available_bytes) but not sufficient on its own.
Also pinning chart version to 0.5.6 so the next apply doesn't incidentally
bump to 0.5.7.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wires the traefik stack to read two new fields from secret/viktor:
* x402_wallet_address -> 0xCc33BD250d39752e0ceaB616f8a05F72274a659f
* alertmanager_slack_api_url (existing) -> reused as the per-payment
notification webhook so payment events arrive in the same Slack
channel as other infra alerts.
Gateway now runs `wallet_set:true, dry_run:false`. Verified end-to-end:
- Browser UA on all 9 sites -> 200 (passes through to Anubis)
- python-requests/2.31 + scrapy + ClaudeBot UA -> 402 with
PaymentRequiredResponse, payTo == Viktor's wallet, amount=10000
micro-USDC, network=base, asset=Base USDC contract
- Direct Slack-webhook test from inside cluster -> HTTP 200
Image bumped to forgejo.../x402-gateway:d9b83125 with Slack-format
notification payload (text=..., username=x402-gateway,
icon_emoji=💰; auxiliary fields preserved for richer receivers).
Notifications fire on every successful X-PAYMENT validation; failures
on Slack webhook are logged at WARN, never block the request, never
double-charge the bot.
Test 3 validation surfaced two latent bugs in the sentinel-gate
DaemonSet that have been masked since 2026-04-18 (when uu was off,
nothing wrote /var/run/reboot-required, so the gate never had to
fire):
1. automount_service_account_token=false on both the SA and the
pod spec → kubectl in the script falls back to localhost:8080
on every call. Each check (`kubectl get nodes`, `kubectl get
pods -n calico-system`, transition-time read) errors to stderr
and emits empty stdout. `wc -l` reports 0 → checks "pass" with
no real data.
2. bitnami/kubectl:latest runs as uid=1001 by default. The hostPath
/var/run is root:root 0755 → final
`touch /host/var-run/gated-reboot-required` failed with EACCES.
Fail-safe by accident — but if anything had ever loosened those
perms, the broken checks above would have green-lit the gate
with no real validation.
Fix: enable token mount on the SA + pod, set
securityContext.run_as_user=0 on the container.
Verified post-fix: kubectl returns all 5 nodes, touch succeeds,
sentinel-gate now reports the correct
`BLOCKED: A node transitioned Ready within the last 24 hours
(soak window)` when triggered with k8s-node1's recent reboot
within the cool-down period.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous default of $(pwd)/config required running the script from
the infra/ directory or always passing --kubeconfig. From a parent
shell or any other working directory, the lookup hit a non-existent
file and kubectl returned a stale-token error, masking real check
results.
Now: use $KUBECONFIG if set, then ~/.kube/config, then fall back to
$(pwd)/config for backwards compatibility.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The OS-side counterpart to the service-upgrade pipeline. Covers
the unattended-upgrades + kured + sentinel-gate + Prometheus
halt-on-alert design landed in c0991f7f8.
Runbook: ops procedures (verify health, halt rollout, restore
config to a re-imaged node, roll back a bad upgrade, investigate
which alert is blocking).
Architecture doc: extends the existing service-upgrade flow with
a "K8s Node OS Upgrades" section (stack, sources of truth, day-2
mechanism, why-this-design rationale tied to the March 2026
post-mortem).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CI image (ci/Dockerfile) is alpine + jq, no python3. The
grafana_admin_only_folder_acl null_resource was parsing /api/folders
with a python3 oneliner, which crashed every CI apply with
"python3: command not found" and made every monitoring stack apply
fail in CI (worked locally because the dev VM has python3).
jq is already in the CI image and produces the same output.
Reverses the March 2026 outage mitigation that disabled unattended-
upgrades cluster-wide. Now re-enables it on the k8s template VM with:
- Allowed-Origins limited to security/updates pockets
- Package-Blacklist for k8s/containerd/runc/calico-node (apt-mark
hold on the cluster-critical components)
- Automatic-Reboot disabled — kured drives the actual reboots
- Compatible with the existing kured + sentinel-gate flow
kured side:
- rebootDelay 30s, concurrency 1
- Sentinel cool-down stretched 30m → 24h (aligns with the 24h soak
window from the post-mortem)
- prometheusUrl + alertFilterRegexp wired so any firing non-ignored
alert halts the rollout. Ignore-list excludes self-referential
alerts (Watchdog/RebootRequired/KuredNodeWasNotDrained/
InfoInhibitor) that would otherwise deadlock kured.
Prometheus side (already partly landed in 6c4e0966 — the "Upgrade
Gates" rule group):
- Refine `KubeQuotaAlmostFull` to include the resourcequota label in
both the on-clause and the summary, so multi-quota namespaces
(authentik, beads-server, frigate) report the quota name correctly.
grafana.tf: terraform fmt whitespace only.
Together with the post-mortem 2026-03-22 (memory id=390) the loop is
closed: unattended-upgrades runs again, kernel-class updates can land,
but only when cluster health is green and the reboot window is open.
- add traefik-authentik-forward-auth to grafana ingress middleware list
- disable auth.anonymous (was Viewer-by-default for the public)
- enable auth.proxy with X-authentik-username so Authentik users get
signed in seamlessly (no double-login UX)
Prometheus and Alertmanager already had forward-auth — no change.
Add `AuthentikForwardAuthFallbackActive` Prometheus alert: fires on
sustained 401/s spike on the websecure entrypoint (>5/s for 5m), which
is the symptom of the auth-proxy Emergency-Access fallback firing —
in turn caused by zero ready endpoints on the outpost service.
Why this rule and not `kube_endpoint_address_available == 0`:
kube-state-metrics endpoint metrics exist as series names but never
have current values in this Prometheus pipeline (something is dropping
them silently). Detecting the failure at the edge via Traefik is more
reliable than instrumenting the broken middle.
Also fix the pre-existing `AuthentikOutpostForwardAuth400Spike` regex
— the service label is `authentik-ak-outpost-...`, not
`authentik-authentik-outpost-...`, so the alert never matched any
series and never could have fired. Verified in Prometheus before/after
the fix.
Add an "Upgrade Validation Checklist" section to
`.claude/reference/authentik-state.md` with the seven-step smoke test
to run after Authentik chart bumps, provider bumps, or outpost pod
recreation. Covers the brittle surfaces (Service selector, JSON
patches, postgres backend wiring, access_token_validity TTL, edge
auth flow, plan-to-zero).
The instagram_poster.benchmark CLI was writing scores to a sqlite file
on the pod's data PVC. Moving it to the shared CNPG cluster so the
benchmark scoring path is stateless on the pod, scores survive pod
recreation, and the rotation/backup pipeline applies automatically.
- dbaas: null_resource.pg_instagram_poster_db creates role + DB
(idempotent CREATE IF NOT EXISTS, password placeholder) — same
shape as pg_postiz_dbs / pg_wealthfolio_sync_db.
- vault: vault_database_secret_backend_static_role.pg_instagram_poster
+ add to allowed_roles. 7d rotation_period.
- instagram-poster: second ExternalSecret (vault-database store) →
K8s Secret instagram-poster-benchmark-db with BENCHMARK_PG_HOST/
PORT/USER/PASSWORD/DATABASE. env_from on the deployment.
reloader.stakater.com/match=true bounces the pod on rotation.
Code-side: instagram_poster/benchmark.py now resolves the DB URL from
BENCHMARK_DB_URL or BENCHMARK_PG_* env vars; falls back to sqlite for
local DevVM scratch runs. Schema bootstraps via Base.metadata.create_all,
no alembic step needed for the benchmark-only side-DB.
Verified end-to-end via DevVM port-forward: ESO synced, K8s Secret has
all 5 fields, pod env shows BENCHMARK_PG_*, smoke-test scoring 3 photos
landed in the new PG table with subject_category populated.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Update `.claude/reference/authentik-state.md`:
- Add `ProxyProvider.access_token_validity = "weeks=4"` to the Session
Duration table with the gotcha that the gorilla session store binds
the value once at outpost startup (rollout restart needed).
- Replace the "session storage moved to Postgres in 2025.10" note that
falsely implied the migration was automatic — explain that the
`Outpost.managed` field gates the postgres path and our outpost
silently stayed on `FilesystemStore` until 2026-05-10.
- Document the goauthentik 2026.2.2 service-selector bug
(service.py:52) and the JSON-patch workaround.
- Document that the standalone embedded-outpost deployment needs
`AUTHENTIK_POSTGRESQL__*` env vars injected via JSON patch, plus the
`app.kubernetes.io/component=server` pod label.
- Note the "Terraform doesn't expose `Outpost.managed`" assumption
that holds the `managed=embedded` value in place across applies.
Close out post-mortem `2026-04-18-authentik-outpost-shm-full.md`:
- P2 codify-in-Terraform: DONE.
- P3 access_token_validity reduce: DONE-alt (we did the opposite —
bumped to 4 weeks — because postgres backend mooted the storage
concern).
- P3 move-off-embedded-outpost: DONE-alt (postgres backend addresses
the loss-of-state class on the embedded outpost itself).
The HCL declared `IfNotPresent` since module creation but the live
deployment reconciled to `Always` somewhere along the way (likely a
Helm/operator default). Since the image is `:latest`, `Always` is the
correct value — `IfNotPresent` would skip pulling updated images on
pod restart, defeating the point of the floating tag.
Drops the lone remaining drift in the authentik stack so plan-to-zero
holds across the whole stack, not just the resources I just adopted.
Bump access_token_validity to weeks=4 (was hours=168, UI-managed in
ignore_changes). Drives the cookie Max-Age and the proxysession.expires
TTL — keeps users logged in for 28d instead of 7d.
Adopt the embedded outpost into Terraform so the postgres-session-backend
fix from earlier today (2026-05-10) is described as code:
- kubernetes_json_patches.deployment carries dshm 2Gi tmpfs, resource
requests/limits, the app.kubernetes.io/component=server pod label
(workaround for goauthentik 2026.2.2 service.py:52 selector mismatch
on standalone embedded outposts), and AUTHENTIK_POSTGRESQL__* envFrom
the shared `goauthentik` Secret so the postgres session backend can
connect to the dbaas cluster.
- kubernetes_json_patches.service replaces the controller-set selector
(which targets app.kubernetes.io/name=authentik / the goauthentik-server
pods) with the outpost's own labels — without this, endpoints are
empty and auth-proxy falls back to Basic-Auth realm "Emergency Access".
The `managed` field ("goauthentik.io/outposts/embedded") is server-set
and not in the Terraform provider's schema, so TF preserves it across
applies (writes only fields it knows about). Plan-to-zero verified.
Reason: GPU multi-tenancy (frigate + ytdlp-highlights + llama-swap +
immich-ml) was hitting 94% memory-request saturation on the old size.
The benchmark on 2026-05-10 surfaced this when llama-swap stayed
Pending despite GPU time-slicing being on (nvidia.com/gpu replicas=100)
- the actual constraint was node1 RAM, not GPU.
Procedure: drained node1, qm shutdown 201, qm set 201 --memory 49152,
qm start 201, kubelet picked up new capacity (47 GiB / 45.5 GiB
allocatable), uncordon, restored llama-swap + immich-ml.
Out-of-band qm set is the path here (not Terraform) because VMID 201
is intentionally not managed by TF yet - the telmate/proxmox provider
trips on iSCSI-disked VMs (see infra/stacks/infra/main.tf line 442).
Adopt this VM into TF once we migrate to bpg/proxmox.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Phase 7 of the vision-LLM benchmark plan. Adds:
- docs/benchmarks/2026-05-10-vision-llm.md — curated report (TL;DR,
per-model analysis, top-N agreement, cost vs cloud APIs, sample
captions). Verdict: qwen3vl-4b for the request path (3.55 s p50,
100% parse, decisive top-N distro); qwen3vl-8b for caption polish.
- docs/benchmarks/benchmark-2026-05-10-1424.json — raw 300-row dump
for diff-checking against future runs.
- main.tf: -fa -> -fa on (b9085 llama.cpp removed the no-value form
of the flash-attention flag; without the value llama-server exits
before serving any request).
- llama-cpp.md architecture doc links the report so future operators
land on the deployed-and-evaluated model from one entry point.
300/300 calls, 0 parse errors, 33m32s wall on a single T4 with the
GPU exclusively allocated. immich-ml was scaled to 0 for the run
(node1 RAM constraint, not GPU - bumping node1 RAM is tracked as a
follow-up).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
PrivateBin's XHR `POST /` (paste creation) was the trigger — Anubis's
catch-all CHALLENGE rule served an HTML challenge page where the JS
expected JSON, breaking paste creation entirely. Same shape will hit
any SPA XHR or CORS preflight on the other 8 Anubis-fronted sites
(homepage actions, kms upload-then-poll, wrongmove search refresh,
jsoncrack share, etc.) the moment it gets exercised.
Add an `ALLOW` rule keyed on `method != "GET"` between the AI/UA-block
imports and the catch-all CHALLENGE. Rationale:
* AI scrapers consume GET response bodies — they don't POST.
* State-mutating XHRs and OPTIONS preflight need to bypass the
challenge or the app breaks.
* CrowdSec + per-route rate-limit + app-level auth already cover
abuse on mutating methods, so this gives up nothing.
* Hard-deny rules for known-bad bots run first, so a declared bad
bot can't sneak through by sending a POST.
Also added a `checksum/policy` annotation on the Anubis pod template
sourced from `sha256(coalesce(var.policy_yaml, default_policy_yaml))`
so future policy changes auto-roll the deployment instead of needing
a manual `kubectl rollout restart`.
f1-stream had its own policy override (path carve-outs for SvelteKit
asset hashes and JSON data routes); mirrored the new rule there too.
Applied to all 8 Anubis-fronted stacks: blog, kms, f1-stream,
travel_blog, real-estate-crawler, homepage, cyberchef, jsoncrack.
Verified per stack: GET / returns the Anubis challenge page; POST,
PUT, DELETE, OPTIONS pass through to the backend (HTTP 301/405/502
from the upstream app, never the Anubis "not a bot" HTML).
PrivateBin's UI POSTs the encrypted blob to `/` via XHR. With Anubis in
front, the catch-all CHALLENGE rule returned an HTML challenge page
where the JS expected JSON, so paste creation failed silently for every
user. The challenge cookie didn't bypass it — Anubis appears to issue a
fresh challenge on POST regardless of cookie state.
Pastes are client-side encrypted; AI scrapers gain nothing from
indexing them, so the default `anti_ai_scraping` middleware is enough
protection. Restoring the ingress to point straight at the privatebin
service. CSP `wasm-unsafe-eval` retained — PrivateBin's zlib.wasm
needs it independent of Anubis.
This matches the rule already documented in infra/.claude/CLAUDE.md:
"DO NOT put Anubis in front of Git/API/WebDAV/CLI endpoints — clients
without JS can't solve PoW." A SPA's XHR is the same shape.
Verified: GET / returns PrivateBin HTML (not the Anubis challenge),
POST / returns PrivateBin's own JSON error envelope.
Single Deployment of mostlygeek/llama-swap:cuda hot-swaps three
GGUF vision models (qwen3vl-8b, minicpm-v-4-5, qwen3vl-4b) at one
OpenAI-compat /v1 endpoint on Service llama-swap.llama-cpp.svc.
Idle TTL 10min so models unload between benchmark batches.
Storage: NFS-RWX from /srv/nfs-ssd/llamacpp (30Gi). One-shot
download Job pulls Q4_K_M GGUF + mmproj per model, creates stable
model.gguf / mmproj.gguf symlinks so the llama-swap config is
filename-agnostic, then warms the kernel page cache.
GPU: nvidia.com/gpu=1 = whole T4 — operator must scale immich-ml
to 0 during benchmark windows. wait_for_rollout=false so apply
doesn't block on GPU availability.
Initial use case: vision-LLM benchmark for instagram-poster
candidate scoring; future consumers (HA, agentic tooling) hit
the same endpoint via LiteLLM at the gateway.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
LAN clients with DNS suffix viktorbarzin.lan now activate with zero
configuration — Windows queries _vlmcs._tcp.viktorbarzin.lan SRV by
default and the chain resolves through vlmcs.viktorbarzin.lan to the
new 10.0.20.202 KMS IP.
DNS state (Technitium primary, replicated to secondary+tertiary by the
existing technitium-zone-sync CronJob every 30 min):
- _vlmcs._tcp.viktorbarzin.lan SRV 0 0 1688 vlmcs.viktorbarzin.lan
(was: target=kms.viktorbarzin.lan)
- vlmcs.viktorbarzin.lan A 10.0.20.202 (added)
- kms.viktorbarzin.lan A 10.0.20.200 (unchanged — still the
Traefik LB for the user-facing website at kms.viktorbarzin.lan/)
vlmcs.viktorbarzin.lan was added as a dedicated KMS-server hostname
rather than retargeting kms.viktorbarzin.lan so the LAN-direct website
keeps working without depending on hairpin NAT through pfSense.
Verified end-to-end on WIN10Pro-DS32 (192.168.1.230):
slmgr /ckms → slmgr /ato → "Product activated successfully" with
"KMS machine name from DNS: vlmcs.viktorbarzin.lan:1688" and
"KMS machine IP address: 10.0.20.202". Real client IP 192.168.1.230
appears in vlmcsd log and in the slack-notifier sent line; second
activation within the dedup window correctly increments
kms_activations_dedup_skipped_total.
Bug found via E2E test against the Windows VM (VMID 300). The single
shared `state` dict in slack-notifier.py worked when vlmcsd processed
one connection at a time, but real Windows KMS activations hold the
connection open ~30 seconds (handshake + keep-alive). During that
window vlmcsd accepts other concurrent connections — most relevantly
the new kubelet TCP readiness probe every 5s — and each new OPEN line
reset the shared state, wiping the in-flight activation's
app/product/host before its CLOSE arrived. Result: real activations
were misclassified as probes (no Slack post, no metric increment).
Fix: state is now a dict keyed by `ip:port` with one sub-dict per
in-flight connection. A `__current` pointer tracks the most recent
OPEN so unkeyed detail lines (Application ID, Workstation name, etc.)
can be attributed correctly — vlmcsd writes detail lines immediately
after the OPEN and before any subsequent OPEN, so the heuristic holds.
Orphan CLOSEs (notifier started mid-conn) are now silently dropped
instead of emitting an empty probe event.
Two new regression tests:
- test_kubelet_probe_during_long_activation: 5s probe interleaved into
a 31s activation block — exact production failure mode.
- test_orphan_close_no_event: bare CLOSE without prior OPEN.
Verified live: triggered slmgr /upk + /ipk + /skms 10.0.20.202 + /ato
on WIN10Pro-DS32. vlmcsd logged the full activation block, notifier
posted to Slack with ip=192.168.1.230 source=external
product='Windows 10 Professional' host='WIN10Pro-DS32.viktorbarzin.lan'
and kms_activations_total{product=Windows 10 Professional,
status=Licensed} 1 — real WAN client IP preserved through the
ETP=Local + dedicated MetalLB IP chain end to end.
After a rollout-restart, the main container (default Always for :latest)
pulled the new image with alembic 0003, but the init container
defaulted to IfNotPresent and reused a cached old image lacking 0003 →
"Can't locate revision identified by '0003'" → CrashLoopBackOff.
Setting Always on the init container so both containers stay in lockstep
across rollouts. Longer term we should switch the deployment to 8-char
git-SHA tags per the cluster policy in .claude/CLAUDE.md, but this
unblocks the Wave 1 deploy in the meantime.
Two coupled fixes for the hourly Slack noise + missing client IPs:
1. Move windows-kms off shared 10.0.20.200 to a dedicated MetalLB IP
10.0.20.202 with externalTrafficPolicy=Local, so vlmcsd sees real
WAN client IPs (pfSense WAN forwards do DNAT-only; ETP=Local skips
kube-proxy SNAT). Same pattern mailserver used pre-2026-04-19.
Sharing 10.0.20.200 is blocked because all 10 services there are
ETP=Cluster and MetalLB requires consistent ETP per shared IP.
2. Slack notifier now suppresses Slack posts for bare TCP open/close
pairs (no Application/Activation block) — these are Uptime Kuma's
port monitor and the new kubelet readiness/liveness probes. Probe
counts go to a new metric kms_connection_probes_total{source} where
source classifies the IP as internal_pod / cluster_node / external.
Real activations are unaffected.
Pod fluidity: added TCP readiness/liveness probes on 1688 to gate Pod
Ready on the listener actually being up — required for ETP=Local so
MetalLB only advertises 10.0.20.202 from a node where vlmcsd is serving.
pfSense side (applied separately, not codified):
- New alias k8s_kms_lb = 10.0.20.202 (KMS-only)
- WAN:1688 NAT + filter rule retargeted from k8s_shared_lb to k8s_kms_lb
- All other forwards on k8s_shared_lb (WireGuard, HTTPS, shadowsocks,
smtps, etc.) untouched
Runbook updated. Tests added for classify_source / is_probe / process_line.
Adds 3 new keys (ACTUALBUDGET_API_URL/KEY/SYNC_ID) sourced from
Vault secret/fire-planner so the FastAPI backend can read viktor's
spending from the in-cluster actualbudget HTTP API and prefill the
Annual spending field on the WhatIf form. Vault keys seeded manually
ahead of this commit; ESO has already synced the K8s Secret.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The per-site `x402_instance` module created one Deployment + Service +
PDB per protected host (9 in total, 9×64Mi). Every pod was running the
exact same logic with the same config — the only thing that varied
was the upstream URL, which we don't even need since the gateway can
return 200 to "allow" and Traefik handles the upstream itself.
Refactor to the same pattern as `ai-bot-block`:
* single deployment + service in `traefik` namespace, 2 replicas, HA
* Traefik `Middleware` CRD `x402` (forwardAuth → x402-gateway:8080/auth)
* each consumer ingress just appends `traefik-x402@kubernetescrd` to
its middleware chain via `extra_middlewares`
x402-gateway gains a `MODE=forwardauth` env var that returns 200 (allow)
or 402 (with x402 PaymentRequiredResponse body) instead of reverse-
proxying. Image: ghcr ... f4804d62.
Pod count: 9 → 2 (78% memory saved). All 9 sites verified still
serving the Anubis challenge to plain curl with identical TTFB.
DRY_RUN until `var.x402_wallet_address` is set on the traefik stack.
Removes `modules/kubernetes/x402_instance/` (dead code now).
Adds modules/kubernetes/x402_instance/ — a small Go reverse proxy
(forgejo.viktorbarzin.me/viktor/x402-gateway:ce333419) that selectively
issues HTTP 402 Payment Required to declared AI-bot User-Agents and
validates X-PAYMENT headers against a Coinbase x402 facilitator.
Browsers are forwarded transparently to Anubis (which then handles the
JS PoW gate as before).
Wired into all nine Anubis-fronted sites:
ingress -> x402-X -> anubis-X -> backend
While `wallet_address` is empty the gateway runs in DRY_RUN — every
request is transparent-proxied, no 402s issued. This lets the pod sit
in the request path with zero behavioural impact today; flipping the
wallet variable in the per-stack module call activates payment-required
mode for AI-bot UAs.
Default config: Base mainnet USDC, $0.01/req, x402.org/facilitator,
catch-all UA list (ClaudeBot|GPTBot|Bytespider|meta-externalagent|
PerplexityBot|GoogleOther|cohere-ai|Diffbot|Amazonbot|
Applebot-Extended|FacebookBot|ImagesiftBot|YouBot|anthropic-ai|
Claude-Web|petalbot|spawning-ai|scrapy|python-requests).
Verified post-apply: 9/9 pods Running, all 9 sites still serve the
Anubis challenge to plain curl with identical TTFB, x402 logs confirm
"dry_run":true on every instance.
Earlier f1 revert left the host fully unprotected (no Anubis,
exclude_crowdsec=true on the ingress already). Re-add Anubis with
a custom policy_yaml that:
- ALLOWs /_app/* (SvelteKit immutable JS/CSS chunks loaded before
any cookie exists), /openapi.json, /docs, /api/* (FastAPI meta).
- ALLOWs the 9 known JSON/proxy routes (schedule, streams,
embed, embed-asset, extract, extractors, health, proxy, relay)
so the SvelteKit SPA's XHRs return JSON instead of the challenge
HTML.
- Catch-all CHALLENGE for everything else — the SPA HTML pages
(which fall through to FastAPI's `/{path}` catch-all) get the
PoW gate.
The ALLOWed JSON routes are technically scrapeable by a determined
bot, but the user's stated goal is "avoid accidental scrapes" — the
HTML/SPA is the AI-training target, and that stays gated.
Verified: / → Anubis challenge HTML; /schedule, /streams → JSON;
/_app/.../app.js → text/javascript; ClaudeBot UA → Anubis deny page.
f1.viktorbarzin.me is a SPA whose JS fetches /schedule, /embed,
/embed-asset, … on the same path tree. With Anubis fronting `/`,
those XHRs land on the challenge HTML even when the cookie *should*
be valid, breaking the page with `Unexpected token '<', "<!doctype "
... is not valid JSON`. Removed Anubis from f1 — would need a path
carve-out (the way wrongmove does for /api) to re-enable. Added a
top-of-block comment so future me remembers why.
Plus four new Prometheus alerts in `Slow Ingress Latency` group
(stacks/monitoring/.../prometheus_chart_values.tpl):
- IngressTTFBHigh (warn, 10m, avg latency >1s)
- IngressTTFBCritical (crit, 5m, avg latency >3s)
- IngressErrorRate5xxHigh (crit, 5m, 5xx >5%)
- AnubisChallengeStoreErrors (crit, 5m, any 5xx on *anubis* services
via Traefik — proxies for the in-pod challenge-store error since
Anubis itself only exposes Go-runtime metrics)
Notes from the alert author: avg-not-p95 because the existing
Prometheus scrape config drops traefik bucket series; once those
are restored, swap to histogram_quantile(0.95). TraefikDown inhibit
rule extended to suppress these four during a Traefik outage.
Browser visits to viktorbarzin.me started returning HTTP 500 with
`store: key not found: "challenge:..."` in pod logs. Root cause:
each Anubis pod stores in-flight challenges in process memory; with
2 replicas behind a ClusterIP, the PoW-solved request can be
routed to a different pod than the one that issued the challenge.
Anubis upstream documents the same caveat ("when running multiple
instances on the same base domain, the key must be the same across
all instances" — true for the ed25519 signing key, but the
challenge store is still pod-local without a shared backend).
Drop module default replicas: 2 → 1. Worst-case: ~1s cold-start on
pod restart. Real fix (Redis-backed challenge store) noted as a
follow-up in CLAUDE.md.
Roll Anubis out to: f1-stream, cyberchef (cc), jsoncrack (json),
privatebin (pb), homepage (home), real-estate-crawler (wrongmove
UI only — `/api` ingress stays direct via path-based ingress carve-
out so XHRs from the SPA bypass the challenge).
End-state: 9 public hosts now Anubis-fronted (blog, www, kms,
travel, f1, cc, json, pb, home, wrongmove). All return the
challenge HTML to bare curl/browser; verified-IP search engines and
/robots.txt + /.well-known still skip via the strict-policy
allowlist.
The default upstream policy only WEIGHs Mozilla|Opera UAs and lets
everything else (curl, wget, python-requests, scrapy, headless CLI
scrapers) fall through to the implicit ALLOW. On non-CDN-fronted
hosts (kms, anything dns_type=non-proxied) this meant a plain
`curl https://kms.viktorbarzin.me/` returned the real backend
content with no challenge — defeating the whole point of the
"avoid casual scrapers" intent.
Now the module ships a custom POLICY_FNAME mounted via ConfigMap:
- Imports the upstream deny-pathological / ai-block-aggressive /
allow-good-crawlers / keep-internet-working snippets unchanged
- Adds a final `path_regex: .*` → action: CHALLENGE catch-all
Result: only IP-verified search engines (Googlebot from Google IPs,
Bingbot, etc.) and well-known paths (robots.txt, .well-known,
favicon, sitemap) skip the challenge. Everything else — including
spoofed-Googlebot-UA-from-random-IP — solves PoW or gets nothing.
Verified post-apply: curl default UA on viktorbarzin.me + kms +
travel returns the Anubis challenge HTML; /robots.txt still 200s
straight through.
The SPA can't carry an Authentik session on its own fetch() XHRs in
all cases (cross-origin redirect to authentik.viktorbarzin.me on a
stale cookie returns HTML, fetch().json() parse fails). Splitting
the ingress so /api/ paths skip forward-auth lets the React app talk
to its API end-to-end. The browser still has to log in via
Authentik to load the SPA at /.
Verified end-to-end via chrome-service Playwright: dashboard load,
scenario list, what-if run with real Monte Carlo, save-as-scenario
round-trip, run-now on detail, delete — all pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds modules/kubernetes/anubis_instance/ — a per-site reverse proxy
instance pinned to ghcr.io/techarohq/anubis:v1.25.0. Each instance
issues a 30-day JWT cookie scoped to viktorbarzin.me after a tiny
proof-of-work (difficulty 2 ≈ 250 ms desktop / 700 ms mobile). The
shared ed25519 signing key (Vault: secret/viktor → anubis_ed25519_key)
makes a single solve good across every Anubis-fronted subdomain.
Wired into blog (viktorbarzin.me + www), kms.viktorbarzin.me, and
travel.viktorbarzin.me — each with anti_ai_scraping=false on the
ingress so the redundant ai-bot-block forwardAuth is dropped from the
chain. Skipped forgejo (Git/API clients can't solve PoW) and resume
(replicas=0).
Also tightens bot-block-proxy nginx timeouts (3s/5s → 100ms/200ms) so
any ingress still using the ai-bot-block forwardAuth pays at most
~150 ms when poison-fountain is scaled down, instead of 3 s.
End-to-end TTFB on viktorbarzin.me dropped from ~3.2 s to ~150-200 ms.
Docs: .claude/reference/patterns.md "Anti-AI Scraping" updated to
4 layers; .claude/CLAUDE.md adds the Anubis usage paragraph and
Forgejo/API caveat.
ingress_factory's port var defaults to 80, but fire-planner publishes
on 8080. Traefik logged 'Cannot create service error="service port
not found"' and 404'd every request. Cloudflare's standard
origin-error decoy page (with the noindex meta + cdn-cgi/content
honeypot link) made it look like a bot-block, but it was just the
upstream coming back 404.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Slack notifier now also exposes /metrics on :9101 with stdlib HTTP — counts
activations and dedup-skips by product, gauges last-activation timestamp.
Pod template gets the standard prometheus.io/scrape annotations so the
cluster-wide kubernetes-pods job picks it up via pod IP. Memory request
bumped to 48Mi to cover counter dicts + HTTPServer.
Plus docs: networking.md footnotes the windows-kms row noting public WAN
exposure with the rate-limited (max-src-conn 50, max-src-conn-rate 10/60,
overload <virusprot> flush) pfSense filter rule, and a new runbook covers
log locations, rate-limit tuning, and how to revoke the WAN forward.
The matching pfSense rule was tightened in place (TCP-only + rate limits)
via SSH; pfSense isn't Terraform-managed.
daily-backup ran out of its 1h budget and SIGTERMed for 10 days straight (Apr
30 → May 9). Each failed run left its snapshot mount stacked on /tmp/pvc-mount,
which blocked the next run from completing — root cause of the WeeklyBackupStale
alert going silent (the metric never reached its end-of-script push).
Fixes:
- TimeoutStartSec 1h → 4h (current workload of 118 PVCs needs ~1.5h, was hitting
the wall during week 18 runs)
- Recursive umount + LUKS cleanup on EXIT trap, plus the same at script start as
belt-and-braces for any inherited stuck state from a prior crashed run
- TERM/INT trap pushes status=2 metric so WeeklyBackupFailing fires instead of
the alert going blind on systemd kills
- pfsense metric pushed in BOTH success and failure paths (was only on success;
any ssh-to-pfsense outage made PfsenseBackupStale silent until the alert
threshold expired)
Postiz backup CronJob: bundled bitnami PG/Redis live on local-path (K8s node
OS disk) — outside Layer 1+2 of the 3-2-1 pipeline. Added postiz-postgres-backup
that pg_dumps postiz + temporal + temporal_visibility daily 03:00 to
/srv/nfs/postiz-backup, getting Layer 3 offsite coverage. Verified end-to-end:
3 dumps written, Pushgateway metric received. Note: bitnamilegacy/postgresql
image is stripped (no curl/wget/python) — switched to docker.io/library/postgres
matching the dbaas/postgresql-backup pattern with apt-installed curl.
Doc reconcile (backup-dr.md): metric names had drifted (e.g. the docs claimed
backup_weekly_last_success_timestamp but the script pushes
daily_backup_last_run_timestamp). Updated to match what's actually emitted, and
added a "default-covered" footnote to the Service Protection Matrix so the
~40 services with PVCs not enumerated in the table are no longer ambiguous.
Manual PVE-host actions (out-of-band, not in TF):
- unmounted 6 stacked snapshots from /tmp/pvc-mount
- pruned 5 stale snapshots on vm-9999-pvc-67c90b6b... (origin LV that the
loop got SIGTERMed against repeatedly, so prune kept failing)
- created /srv/nfs/postiz-backup directory
- triggered a one-shot daily-backup run with the new TimeoutStartSec to
validate the fix end-to-end
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wealth, Payslips, and Job-Hunter Grafana datasources all baked the
rotating PG password into their ConfigMap at TF-apply time, so every
7-day Vault static-role rotation silently broke the panels until a
manual `terragrunt apply`. Same family as the recurring grafana-mysql
backend bug — Grafana caches creds at startup and never picks up the
new ESO-synced password without a restart.
Fix:
- Each source stack now creates an ExternalSecret in `monitoring`
exposing the rotating password as `<NAME>_PG_PASSWORD` env-var.
- Grafana mounts those via `envFromSecrets` (optional=true so a
missing source stack doesn't block boot) and the datasource
ConfigMaps reference `$__env{<NAME>_PG_PASSWORD}` instead of a
literal password.
- `reloader.stakater.com/auto: "true"` on the Grafana pod restarts
it whenever any of the four DB-cred Secrets is updated.
Tested end-to-end: forced `vault write -force database/rotate-role/
pg-wealthfolio-sync` → ESO synced (~30s) → reloader fired →
Grafana booted with new env in ~50s total → all three /api/datasources
/uid/*/health endpoints return "Database Connection OK".
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
mysql-standalone OOMKilled May 8 18:05 (anon-rss 2 GB at the 2 Gi limit).
innodb_buffer_pool_size=1Gi plus connection buffers and InnoDB internals
don't fit in 2 Gi. Bumping limit to 4 Gi (request 3 Gi) leaves headroom
without changing the buffer pool config.
/srv/nfs was at 90% (1.7T / 2T); grew the underlying pve/nfs-data LV
1 TiB online and ran resize2fs (now 60% used). Triggered by surfacing
during the 2026-05-09 IO-pressure post-mortem; thinpool had ~4.6 TiB
free.
The post-mortem also covers the stale-NFS-client trigger (legacy
/usr/local/bin/weekly-backup pointing at the decommissioned TrueNAS IP)
and the resulting wedged kthread on the PVE host. Script removed and
node_exporter restarted out-of-band; kthread will clear at next PVE
reboot. See docs/post-mortems/2026-05-09-io-pressure-stale-nfs.md.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
instagram-approval: after every tap, immediately fetch /candidates?limit=1
and send the next photo as a fresh inline-keyboard message — the user's
tap chains back into this same workflow, so the loop is user-paced.
When the pool is exhausted, send an 'all caught up' summary with the
backlog count + cumulative training stats.
instagram-discover: cron throttled from every-30-min to daily 09:00.
The chain handles ongoing training; the daily run only kickstarts a
session if the user hasn't been tapping. Limit reduced from 3 → 1 so
each kickstart sends a single photo (chain takes over).
Stories+feed posts via Postiz failed with state=ERROR and Postiz
mistranslated the cause as 'Invalid Instagram image resolution
max: 1920x1080px'. Real cause: Postiz hands Meta an upload URL
under https://postiz.viktorbarzin.me/uploads/... and Meta gets a
302 to the Authentik login page instead of bytes. Meta returns
error 36001 (image not fetchable) which Postiz maps to that
misleading resolution string.
Split the ingress: /uploads/* on a public ingress (matches the
instagram-poster /image+/original pattern), everything else
remains behind Authentik forward-auth. /uploads contents are
random UUIDs, low blast radius if scraped.
The canonical proxmox-lvm and proxmox-lvm-encrypted PVC templates were
missing `lifecycle { ignore_changes = [spec[0].resources[0].requests] }`.
Without it, every PVC created from these templates becomes a drift bomb
the moment pvc-autoresizer expands it: the next `tg apply` on that stack
will try to shrink the PVC back to the TF-declared size, K8s rejects the
shrink, and apply fails.
This was latent because pvc-autoresizer was silently broken cluster-wide
(commit 9d5da4d8 fixed it by allow-listing kubelet_volume_stats_available_bytes
in Prometheus). Now that the autoresizer actually works, every existing
proxmox-lvm/encrypted PVC without ignore_changes is at risk.
Sweep needed (separate task): grep for kubernetes_persistent_volume_claim
across stacks/ and add ignore_changes to any with resize.topolvm.io
annotations.
Standalone provider (instagram-standalone OAuth flow) is what the user
is trying after the FB-Login path was blocked by their Business Account
ad-policy flag. Uses modern scope names (instagram_business_*), so no
JS patch needed unlike the FB-Login provider.
Same fix as default.yml — drift-detection cron also runs terragrunt
plan on every stack, which requires the kubeconfig at <repo>/config
that terragrunt.hcl injects via -var kube_config_path. Pipeline #547
(latest scheduled drift-detection run) failed with the same
'config_path refers to an invalid path' error.
terragrunt.hcl injects -var kube_config_path=${repo_root}/config for
every terraform invocation, but the pipeline never created that file.
Every commit that touched a TF stack since #545 (2026-05-08) failed
with 'config_path refers to an invalid path: \"../../config\": no such
file or directory' followed by the kubernetes provider falling back
to localhost:80.
Add a step that writes a kubeconfig at <repo>/config using the
projected SA token + cluster CA. The woodpecker namespace's default
SA is already cluster-admin (woodpecker-default ClusterRoleBinding),
so the projected token is sufficient for any stack apply. Using
tokenFile (not an inline token) lets the provider re-read it if
kubelet rotates the projected token mid-pipeline.
#545 was the last green run because that commit only changed the
build-cli pipeline — 0 stacks applied so the missing kubeconfig
never mattered.
The old port-5050 R/W private registry was decommissioned 2026-05-07
(forgejo-registry-consolidation Phase 4). The reverse-proxy ingress
+ ExternalName service + Cloudflare DNS record kept pointing at the
dead backend, returning 502 to anyone hitting registry.viktorbarzin.me.
This was driving 3 monitoring artifacts that auto-cleared on cleanup:
- Uptime Kuma external monitor #586 (deleted)
- Pushgateway stale registry-integrity-probe metrics (deleted)
- ExternalAccessDivergence + RegistryIntegrityProbeStale alerts
The Prometheus scrape config for the kubernetes-nodes job kept
capacity_bytes + used_bytes but dropped available_bytes. pvc-autoresizer
computes utilization from available/capacity, so without that metric it
was silent for every PVC in the cluster — including mailserver, which
filled to 89% (1.7G/2.0G) and started rejecting all inbound mail with
'452 4.3.1 Insufficient system storage' (15+ hours, all real senders:
Brevo, Gmail, Facebook).
Also bumps the floors of mailserver (2Gi -> 5Gi, limit 10Gi) and forgejo
(15Gi -> 30Gi) PVCs to recover from the immediate outage, and adds
ignore_changes on requests.storage so future autoresizer expansions
don't cause TF drift.
User dropped Postiz/Instagram OAuth (Meta Business Account flagged
+ Postiz scope drift). New pipeline ends at Telegram — full-quality
JPEG delivered to the bot chat, manually uploaded to IG by the user.
- Image bumped to 25e46efd: adds /deliver/{asset_id} endpoint that
multipart-uploads to Telegram (URL-fetch fails through Cloudflare
for >5MB), then tags 'posted' in Immich.
- ESO now syncs telegram_bot_token + telegram_chat_id from Vault.
- Public ingress paths grow to ['/image', '/original'] (Authentik
bypass on /original is harmless — files are user-tagged, low blast
radius — and useful for ad-hoc browser downloads).
- Memory limit 512Mi -> 1500Mi: full-resolution Pillow HEIC decode
was OOMing on 12MP+ phone photos.
- discover.json simplified to scan -> deliver per item; approval and
post workflows already deactivated. Telegram bot webhook removed.
Postiz backend was crashlooping on connect ECONNREFUSED ::1:7233 —
Postiz needs Temporal for cron/scheduled posts and the Helm chart
doesn't bundle it. Added a single-replica temporalio/auto-setup:1.28.1
Deployment in the postiz namespace, backed by the bundled
postiz-postgresql (separate `temporal` + `temporal_visibility`
databases pre-created via init container), ENABLE_ES=false (Postiz
only uses the workflow engine, not visibility search). Skips
DYNAMIC_CONFIG_FILE_PATH because that file isn't bundled in
auto-setup.
Auth audit:
- postiz: ingress now `protected = true` (Authentik forward-auth).
Postiz also has its own login on top, but registration is no
longer exposed to the open internet.
- instagram-poster: split into two ingresses on the same host.
`/image/*` stays public (Meta + Telegram fetch the 9:16
derivatives). Everything else (/healthz, /queue, /scan,
/enqueue, /reject, /post-next) sits behind Authentik. The
protected ingress sets dns_type=none — the public one already
created the CF DNS record.
- postiz: set DATABASE_URL/REDIS_URL pointing at the bundled subcharts;
the chart does NOT auto-wire even when postgresql.enabled=true, so
the prisma db:push was failing with empty DATABASE_URL.
- n8n approval workflow: swap telegramTrigger -> webhook node so it
works without an n8n-stored Telegram credential. Telegram bot's
webhook is set via setWebhook to https://n8n.viktorbarzin.me/webhook/instagram-approval.
Parse-callback Code node tolerates both shapes ({body:{callback_query:...}}
vs {callback_query:...}) so a future move back to telegramTrigger doesn't break.
- postiz: pin chart name to 'postiz-app' (was 'postiz', wrong path)
and override bundled bitnami subchart images to bitnamilegacy/* —
Bitnami removed bitnami/postgresql + bitnami/redis from DockerHub
in Aug 2025 (Broadcom acquisition).
- postiz: enable initial registration (DISABLE_REGISTRATION=false)
so first admin user can be created in UI; tighten after.
- instagram-poster: add securityContext (fsGroup/runAsUser=10001)
so kubelet chowns the PVC mount for the non-root 'poster' user;
was crashing on alembic with 'unable to open database file'.
- instagram-poster: bump image_tag to 24935ab4 (uvicorn now binds
to port 8000 to match Service contract; was 8080 -> probe 404).
New stacks:
- stacks/postiz/ — Postiz scheduler (Helm chart v1.0.5, image v2.21.7)
with bundled PG/Redis, /uploads PVC on proxmox-lvm, JWT_SECRET
via ESO from secret/instagram-poster.
- stacks/instagram-poster/ — custom Python service that polls Immich
for the 'instagram' tag, reformats photos to 9:16 with blurred-bg
letterbox, exposes /image/<asset_id> publicly so Postiz can fetch.
Image: forgejo.viktorbarzin.me/viktor/instagram-poster.
n8n: 3 new workflows (discover, approval, post) for the Telegram
inline-button approval UX. Adds ExternalSecret + env vars for
TELEGRAM_BOT_TOKEN, TELEGRAM_CHAT_ID, IMMICH_API_KEY, plus static
URLs for the new service.
Vault: seed secret/instagram-poster with telegram_bot_token,
telegram_chat_id, immich_api_key, postiz_api_token,
postiz_jwt_secret before applying.
The previously-baked kubeconfig at /home/node/.openclaw/kubeconfig retained
a service-account token bound to the original (long-dead) pod, so kubectl
calls from inside the openclaw container failed with "the server has asked
for the client to provide credentials" even though the openclaw SA has
cluster-admin and kubelet projects a fresh token at
/var/run/secrets/kubernetes.io/serviceaccount/token.
Add init-container "setup-kubeconfig" that writes a kubeconfig with
tokenFile + certificate-authority paths pointing at the projected
SA volume — kubelet auto-rotates the token, kubectl always reads
fresh creds, no Vault K8s-creds-engine refresh needed.
Verified end-to-end: agent ran `kubectl get nodes -o wide` inside the
pod and delivered a correct one-line summary to Telegram via
openai-codex/gpt-5.4-mini.
The build-cli pipeline was still pushing to the
registry.viktorbarzin.me:5050/infra path that no longer exists
post Phase 4 — failing with 'error authenticating: exit status 1'
on every infra push. Drop the second repo + login; DockerHub +
Forgejo are the canonical destinations now.
The helm provider in this Terraform version doesn't support
list-index access on helm_release.metadata[0]. Switch the
woodpecker_server_host_alias trigger to {helm_version, sha256(values)}
which works regardless of provider quirks. (Original fix landed
2026-05-07; got reverted by a linter pass.)
Companion commit to 92474254 — the new extractor wasn't being
registered, only the file was added. Add the import + register call
in create_registry().
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Four-agent parallel investigation finally pinned down what's happening
with the hmembeds.one streams. The TL;DR is unexpected: there is no
fingerprint check, no decoder failure, no broken JS — the obfuscated
decoder is trivial to reproduce, but the upstream origin is dead.
Findings (saved at /tmp/jwre/{findings.md, blob-analysis.md,
fingerprint-gap.md, trace-summary.md}):
1. **The "ZpQw9XkLmN8c3vR3" blob is decoy.** It's an Adcash adblock-
bypass config — not the stream URL. The actual stream URL is in a
different inline `<script>` block of the embed HTML.
2. **The real decoder is base64 + XOR with a hardcoded key**, the key
appears literally in the HTML (e.g. `var k="bux7ver6mow4trh1"`).
No browser-derived inputs. We can run it in Python in 50µs.
3. **The decoded URL is JWT-bound to /24 of the requestor's IP**. JWT
payload: `{stream, ip:"176.12.22.0/24", session_id, exp}`. From our
cluster (egress 176.12.22.76) the JWT IP-binding is satisfied.
4. **The origin still returns 404 (GET) / 403 (HEAD).** Tested both
curated embeds (Sky F1 888520f3..., DAZN F1 fc3a5463...) — same
404. Origin landing page (`/`) returns 200, so the host is up;
the `/sec/<JWT>/<embed_id>.m3u8` endpoint specifically refuses.
5. **No fingerprint surface trips this.** Runtime trace via
chrome-service hooks confirmed: decoder reads navigator.userAgent
(heavy), screen dimensions, and a single WebGL getParameter call.
No canvas, audio, fonts, fetch-to-fingerprint-API. JW Player setup
is given a valid file URL — the playlist stays empty because JW
can't fetch the manifest from the (dead) origin.
Verdict: **the legacy curated hmembeds embeds (`888520f3...` Sky F1,
`fc3a5463...` DAZN F1) are upstream-dead.** No browser-side fix is
possible. The community uses these IDs as "24/7 channels" but they're
in a perpetually-offline state right now.
This commit ships the offline decoder anyway, registered as a new
extractor. Two reasons:
- If those origins come back online, no code change needed.
- Future curated hmembeds IDs (added by hand or discovered via
subreddit posts) will resolve through the same path.
Files added: `extractors/hmembeds.py` (~120 lines incl. the decoder
and a `decode_embed(html) -> str | None` helper that's reusable).
Registered in `__init__.py`. The existing CuratedExtractor stays
disabled; this replaces its mechanism with one that can absorb new
embed IDs without code changes.
Bonus from the agent work:
- Confirmed our stealth.js is sufficient — the runtime trace showed
the decoder reads only the surfaces we already cover.
- Identified ~10 fingerprint surfaces we don't spoof (platform,
userAgentData, hardwareConcurrency, deviceMemory, timezone,
AudioContext, ICE candidates) but proved they're not what's
blocking us, so no change needed for now.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Earlier I claimed the OAuth Web UI flow was the only way to onboard
new Forgejo repos in Woodpecker. That's wrong.
Two parts to the actual workaround:
1. Woodpecker session JWTs are HS256 signed with the user's per-user
`hash` column from the PG `users` table (NOT the global agent
secret). Mint a session JWT for the Forgejo viktor user (id=2,
forge_id=2), and you're authenticated as that user.
2. POST /api/repos?forge_remote_id=N as viktor → Woodpecker calls
Forgejo with viktor's stored OAuth access_token to create the
webhook + per-repo signing key. Works.
The 500 I saw earlier was from POST'ing as ViktorBarzin (GitHub
admin), whose user row has no Forgejo OAuth token — Woodpecker's
forge-API call fails for that user, surfacing as a 500.
scripts/woodpecker-register-forgejo-repo.sh wraps the whole flow:
extract hash from PG → mint JWT → activate repo. Verified against
viktor/{broker-sync,claude-agent-service,freedify,hmrc-sync} in
this session — all activated cleanly.
Also updated the runbook with the actual mechanism + the
WOODPECKER_FORGE_TIMEOUT=30s tip (the real root cause of the
'context deadline exceeded' failures, NOT the v3.14 upgrade).
The kms-web-page deployment now pulls
forgejo.viktorbarzin.me/viktor/kms-website:${var.image_tag} (source
in the new Forgejo repo viktor/kms-website). The ConfigMap-mounted
index.html is gone — the new site is a Hugo build with full GVLK
catalog for every Microsoft KMS-eligible Windows + Office edition,
copy-to-clipboard, dark/light themes.
The container image tag is managed by CI (kubectl set image), so
add lifecycle ignore_changes on container[0].image alongside the
existing dns_config (Kyverno) ignore.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
5 parallel research agents surveyed Stremio addons, F1 TV / Sky / DAZN
official APIs, IPTV M3U lists, and free-to-air broadcasters. The clean
finding: two community Stremio addons already index Sky Sports F1 +
DAZN F1 via their public HTTP APIs — no Stremio client required, just
GET /stream/<type>/<id>.json on the addon's hosted instance.
New `stremio.py` extractor pulls from:
- **TvVoo** (`https://tvvoo.hayd.uk/manifest.json`) — wraps Vavoo IPTV.
Lists Sky Sports F1 UK + Sky Sports F1 HD + Sky Sport F1 IT + Sky
Sport F1 HD DE + DAZN F1 ES. Returns 2 IP-bound m3u8 URLs per
channel. Source: github.com/qwertyuiop8899/tvvoo. Vavoo's CDN SSL
certs are currently expired so most clients fail verification today
— addon framework is right but delivery is degraded.
- **StremVerse** (`https://stremverse.onrender.com/manifest.json`) —
Returns 11+ streams per id (`stremevent_591` = F1, `stremevent_866`
= MotoGP). Mix of DRM-walled DASH, JW-broken-chain JWT URLs, and
HuggingFace-Space proxies that 404 without a per-instance api_password.
The extractor surfaces 15 candidate URLs per run; verifier filters to
the playable subset. Today that subset is 0 (Vavoo cert expiry + JW
chain + proxy auth), but the wiring is correct: as the addons fix
delivery or rotate to fresh URLs, candidates will start passing.
Other agent findings worth noting (not coded but documented):
- F1 TV Pro live = Widevine DASH; impossible without a CDM. VOD is
clean HLS but only post-session.
- Sky Go / DAZN / Viaplay / Canal+ = all Widevine + geo-fenced + active
DMCA enforcement. Pursuing not feasible.
- ServusTV AT (free F1 race weekends) = clean public HLS at
rbmn-live.akamaized.net/hls/live/2002825/geoSTVATweb/master.m3u8 but
geo-fenced; needs an Austrian-IP egress proxy/VPN.
- iptv-org/iptv has an F1 Channel (Pluto TV IE) at
jmp2.uk/plu-6661739641af6400080cd8f1.m3u8 — 24/7 free, BG works,
but only historic races + shoulder programming. Worth adding as a
curated entry later.
- boxboxbox.* (community-favourite F1 race-weekend domain) is dead
across all known TLDs as of today.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The default forge-API timeout is 3 seconds. The config-loader makes
4-6 sequential calls per pipeline trigger (probing for .woodpecker dir
then each .woodpecker.{yaml,yml} variant), and Forgejo responses on
this cluster spike to 1-2s under load — easy to trip the cumulative
3s deadline. Result: 'could not load config from forge: context
deadline exceeded' on virtually every pipeline trigger.
This was the actual root cause of the 'Woodpecker forge-API bug'
that v3.13 → v3.14 was supposed to fix — turns out v3.14 didn't
change the timeout default, and the v3.13 successes I saw earlier
were warm-cache flukes.
User asked specifically for r/motorsportstreams. Reddit banned that sub
years ago; the active 12.5k-subscriber successor is r/motorsportsstreams2.
Added it to SUBREDDITS plus r/f1streams (709 subs, public).
Also extended:
- SEARCH_QUERIES with three Sky Sports F1 / live-stream phrases that
catch the `[F1 STREAM]` post pattern the community uses on race
weekends (titles like "[F1 STREAM] Bahrain GP - Live Race | No Buffer
| Mobile Friendly" linking to boxboxbox.pro/stream-1).
- _INTERESTING_HOSTS allowlist with boxboxbox.{pro,live,lol},
pitsport.live, ppv.to, streamed.pk, acestrlms/aceztrims, and the
Super Formula direct CDNs (racelive.jp, cdn.sfgo.jp) — all observed
in last-50-posts on r/motorsportsstreams2.
Where this leaves us, honestly:
- The r/motorsportsstreams2 megathread "Where to watch every F1 race"
recommends EXACTLY the four sites we already pull from: pitsport.xyz,
streamed.pk, ppv.to, acestrlms. The community has the same broken JW
Player chain we have for Sky Sports F1 24/7 streams. There is no
free-and-working alternative they know about.
- boxboxbox.pro (the most-promoted F1 stream domain in race-weekend
posts) is currently NXDOMAIN; .live is parked, .lol unreachable. The
domain rotates after takedowns; Reddit posts will surface fresh ones
when posters share them.
- For F1 specifically: extractor surfaces 2 motomundo.net candidates
(MotoGP wrappers) and lights up to ~6+ during F1 race weekends as
posters share fresh boxboxbox/equivalent URLs.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
User asked for two lines instead of side-by-side bars at monthly
granularity. Converts panel 25 from barchart to timeseries:
* type: barchart -> timeseries
* format: table -> time_series, SELECT month::timestamp AS time
* drawStyle line, lineWidth 2, fillOpacity 0, showPoints auto
* Same blue (contributions) / green (market gain) colour overrides
Where the green line rises above the blue line is the visual cue that
the market out-earned new contributions for that month -- the trend
the user wants to track.
Diff is small (15 ins / 28 del) because the bar-chart-only fields
(barRadius, barWidth, groupWidth, stacking, xField, xTickLabelRotation)
are dropped.
Goal stated by user: see when monthly market gain starts to exceed
monthly contributions, i.e. the inflection point where the market is
out-earning savings rather than the other way around.
New panel id=25 between the annual decomposition (13) and per-account
ROI (14): bar chart with two side-by-side bars per month --
contributions (blue) and market gain (green). Same calculation as
panel 13 but month-grain instead of year-grain. Months where the
green bar dwarfs the blue one are visible at a glance.
SQL: same endpoints CTE pattern as panel 13, with date_trunc('month',
valuation_date) as the grouping key. Uses max_complete cutoff so
partial-today doesn't skew the latest month.
Layout: panels at y >= 75 shifted down by 11 (chart height). New
chart at y=75; panel 14 (per-account ROI) -> y=86; panel 10
(activity log) -> y=96.
Spot check (recent months from PG):
2025-07: contrib +£5,601 market +£42,295 <- big market month
2025-09: contrib +£1,501 market +£24,206
2026-02: contrib +£35,501 market +£41,382
2026-03: contrib +£5,501 market -£38,483 <- correction
2026-04: contrib +£73,267 market +£21,448
The CronJob has been broken since registry-private lost the
wealthfolio-sync image (last successful run 36+ days ago). The image
is built from /home/wizard/code/broker-sync (the brokerage data sync —
Trading 212, Schwab, Fidelity, IMAP-CSV → wealthfolio).
Set up: viktor/broker-sync repo on Forgejo with .woodpecker/build.yml
that pushes to forgejo.viktorbarzin.me/viktor/wealthfolio-sync. Until
Woodpecker recognises the new repo's webhook, the image was bootstrapped
via 'docker pull viktorbarzin/broker-sync:latest && docker tag … &&
docker push forgejo.viktorbarzin.me/viktor/wealthfolio-sync:latest' so
the CronJob unblocks immediately.
Fixes the 'could not load config from forge: context deadline exceeded'
issue that blocked every Forgejo-triggered pipeline during the
forgejo-registry-consolidation cutover. Helm chart 3.5.1 stays
(no 3.6 yet); only the image tag overrides change.
Image migration completed (forgejo-migrate-orphan-images.sh ran +
all in-scope images now under forgejo.viktorbarzin.me/viktor/) and
the cluster cutover landed in commit 3148d15d. registry-private is
no longer needed.
* infra/modules/docker-registry/docker-compose.yml — registry-private
service block removed; nginx 5050 port mapping dropped.
* infra/modules/docker-registry/nginx_registry.conf — upstream
private block + port 5050 server block removed.
* infra/.woodpecker/build-ci-image.yml — drop the dual-push to
registry.viktorbarzin.me:5050; only push to Forgejo. Verify-
integrity step removed (the every-15min forgejo-integrity-probe
in monitoring covers it). Break-glass tarball step still runs but
pulls from Forgejo (the only registry left).
The registry-config-sync.yml pipeline will pick this commit up and
sync the new compose+nginx to the VM. Manual final step on the VM:
ssh root@10.0.20.10 'cd /opt/registry && docker compose up -d --remove-orphans'
to actually destroy the registry-private container — compose does
NOT do orphan removal on a normal up -d.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The Phase 3 commit 3148d15d ran into a disk-full ENOSPC during edit
of stacks/claude-memory/main.tf, and the file was committed truncated
at line 286 mid-string ('Cor instead of 'Core Platform' / closing
braces). terraform validate failed with 'Unterminated template string'.
Restoring the trailing 2 lines + re-applying the
viktorbarzin/claude-memory-mcp:17 → forgejo.viktorbarzin.me/viktor/
claude-memory-mcp:17 cutover that Phase 3 was meant to do.
Existing NetworkPolicy only admitted port 3000 (Playwright WS) from
labelled client namespaces, blocking Traefik's traffic to the noVNC
sidecar on port 6080. The chrome.viktorbarzin.me ingress would hang
forever — page never loads, eventually times out.
Adds a second ingress rule allowing TCP/6080 from the traefik
namespace only. Authentik forward-auth still gates external access
at the Traefik layer.
Also reconciles the noVNC image to the new Forgejo registry path
(:v4 unchanged) — already declared in TF, just live-state drift from
the Phase 3 registry consolidation.
Updates the architecture doc; the previous text still described the
old nginx static health stub that noVNC replaced.
The Phase 4 docker-compose + nginx changes I landed earlier dropped
the registry-private container's port-5050 listener BEFORE migrating
the existing images to Forgejo. The registry-config-sync pipeline
applied the new nginx config, breaking pulls from registry-private —
which is the source of every image we still need to copy to Forgejo.
Restore registry-private + the 5050 listener until the migration
script has finished. Subsequent commit will drop them once images
are confirmed in Forgejo.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two fixes for the previously-dormant subreddit extractor + a chrome-browser TARGETS pivot to MotoGP weekend live URLs.
1. **Reddit fetch was 403'd by `Accept: application/json`**. Cluster IP +
that header trips Reddit's anti-bot fingerprint and returns HTML 403.
Removing the explicit Accept (default `*/*`) restores HTTP 200 with
JSON. Confirmed via direct httpx test from the f1-stream pod.
2. **Search the right things**. The community uses a stable
`[Watch / Download] <Series> <Year> - <Round> | <Event>` post pattern
with selftext links to admin-curated WordPress sites (motomundo.net
for MotoGP, sister sites for F1 when active). New extractor:
- Hits both /new.json and /search.json across r/MotorsportsReplays
and three smaller motorsport subs.
- Filters posts where title contains `[watch`, `watch online`, or
flair = `live`.
- Extracts URLs from selftext (regex), filters to a positive
`_INTERESTING_HOSTS` allowlist (motomundo, freemotorsports,
pitsport, rerace, dd12, etc.) so we don't drown the verifier in
YouTube/Discord/gofile links.
- Returns each as embed-type so the chrome-service verifier visits.
3. **chrome_browser.TARGETS pivoted** to the live MotoMundo MotoGP
French GP iframes (motomundo.top/e/<id> + motomundo.upns.xyz/#<id>)
while the weekend is on. The previous DD12 NASCAR + Acestrlms F1
targets were both broken JW Player paths anyway.
State after deploy:
- /streams: 3 verified live (WRC Rally Portugal, NASCAR 24/7, Premier League Darts) — Darts is currently active because UK is mid-match.
- Subreddit extractor surfaces the live MotoMundo URL but the verifier
marks the WordPress wrapper page playable=False (no top-level <video>
element; the m3u8 lives in nested iframes). Next iteration: drill the
verifier into iframe contentDocument and capture from there.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
End of forgejo-registry-consolidation. After Phase 0/1 already landed
(Forgejo ready, dual-push CI, integrity probe, retention CronJob,
images migrated via forgejo-migrate-orphan-images.sh), this commit
flips everything off registry.viktorbarzin.me onto Forgejo and
removes the legacy infrastructure.
Phase 3 — image= flips:
* infra/stacks/{payslip-ingest,job-hunter,claude-agent-service,
fire-planner,freedify/factory,chrome-service,beads-server}/main.tf
— image= now points to forgejo.viktorbarzin.me/viktor/<name>.
* infra/stacks/claude-memory/main.tf — also moved off DockerHub
(viktorbarzin/claude-memory-mcp:17 → forgejo.viktorbarzin.me/viktor/...).
* infra/.woodpecker/{default,drift-detection}.yml — infra-ci pulled
from Forgejo. build-ci-image.yml dual-pushes still until next
build cycle confirms Forgejo as canonical.
* /home/wizard/code/CLAUDE.md — claude-memory-mcp install URL updated.
Phase 4 — decommission registry-private:
* registry-credentials Secret: dropped registry.viktorbarzin.me /
registry.viktorbarzin.me:5050 / 10.0.20.10:5050 auths entries.
Forgejo entry is the only one left.
* infra/stacks/infra/main.tf cloud-init: dropped containerd
hosts.toml entries for registry.viktorbarzin.me +
10.0.20.10:5050. (Existing nodes already had the file removed
manually by `setup-forgejo-containerd-mirror.sh` rollout — the
cloud-init template only fires on new VM provision.)
* infra/modules/docker-registry/docker-compose.yml: registry-private
service block removed; nginx 5050 port mapping dropped. Pull-
through caches for upstream registries (5000/5010/5020/5030/5040)
stay on the VM permanently.
* infra/modules/docker-registry/nginx_registry.conf: upstream
`private` block + port 5050 server block removed.
* infra/stacks/monitoring/modules/monitoring/main.tf: registry_
integrity_probe + registry_probe_credentials resources stripped.
forgejo_integrity_probe is the only manifest probe now.
Phase 5 — final docs sweep:
* infra/docs/runbooks/registry-vm.md — VM scope reduced to pull-
through caches; forgejo-registry-breakglass.md cross-ref added.
* infra/docs/architecture/ci-cd.md — registry component table +
diagram now reflect Forgejo. Pre-migration root-cause sentence
preserved as historical context with a pointer to the design doc.
* infra/docs/architecture/monitoring.md — Registry Integrity Probe
row updated to point at the Forgejo probe.
* infra/.claude/CLAUDE.md — Private registry section rewritten end-
to-end (auth, retention, integrity, where the bake came from).
* prometheus_chart_values.tpl — RegistryManifestIntegrityFailure
alert annotation simplified now that only one registry is in
scope.
Operational follow-up (cannot be done from a TF apply):
1. ssh root@10.0.20.10 — edit /opt/registry/docker-compose.yml to
match the new template AND `docker compose up -d --remove-orphans`
to actually stop the registry-private container. Memory id=1078
confirms cloud-init won't redeploy on TF apply alone.
2. After 1 week of no incidents, `rm -rf /opt/registry/data/private/`
on the VM (~2.6GB freed).
3. Open the dual-push step in build-ci-image.yml and drop
registry.viktorbarzin.me:5050 from the `repo:` list — at that
point the post-push integrity check at line 33-107 also needs
to be repointed at Forgejo or removed (the per-build verify is
redundant with the every-15min Forgejo probe).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Updated to handle the actual situation: wealthfolio-sync and
fire-planner have registry repos but no tags (broken/abandoned
deployments). Skip those with a SKIP marker. Migrate everything
else as a stop-gap until Woodpecker pipelines start producing
Forgejo images on their own.
The image list now covers all private images currently in scope.
Helm chart 3.5.1 has no `server.hostAliases` field, so the YAML
addition I made earlier was a no-op. Apply via kubectl patch in a
null_resource keyed on helm revision so it re-asserts on every
chart upgrade. Same pattern as the CoreDNS replicas/affinity patch
in stacks/technitium/.
Without this, every helm upgrade on woodpecker reverts the
hostAliases fix and the Forgejo pipeline triggers start failing
with context-deadline-exceeded again.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Pipeline triggers from Forgejo were failing with "could not load
config from forge: context deadline exceeded" — Woodpecker's
forge-API fetch path was round-tripping through Cloudflare via the
public IP, hitting 30s deadline timeouts on cold connections. The
in-cluster path via the Traefik LB (10.0.20.200) is consistently
sub-100ms.
Same trick we use for the containerd hosts.toml redirect on each
node — Traefik serves the *.viktorbarzin.me wildcard cert so SNI
verification still passes. OAuth callbacks still use the public
hostname (correct, those come from the user's browser).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Forgejo→Woodpecker webhooks were timing out on first request after
pod restart. The default 5s deadline is too tight for the cold
Cloudflare-tunnel TLS handshake (observed 6-8s). 30s comfortably
covers retries.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The Forgejo→Woodpecker webhook (so Woodpecker fires on each push to
viktor/<repo>) was being blocked by the existing ALLOWED_HOST_LIST
of *.svc.cluster.local — ci.viktorbarzin.me resolves to the public IP
because Cloudflare proxying wasn't covering that path. Without this
fix, no Woodpecker pipeline run was triggered on push, the dual-push
bake would never start, and Forgejo's package catalog stays empty.
Add ci.viktorbarzin.me explicitly + *.viktorbarzin.me as a future-
proofing wildcard. The list still excludes arbitrary external hosts,
so this is not a security regression — just unblocking the webhook
to our own CI.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Phase 0 enabled packages but the pod crashloops on
`mkdir /data/tmp: permission denied` — Forgejo loads the chunked
upload path (default /data/tmp/package-upload) before s6-overlay
gets a chance to chown /data. fsGroup tells kubelet to recursively
chown the volume to GID 1000 on mount, which fixes it.
Pre-23-day Forgejo deployed with packages off so this code path
never ran.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Setting it to /data/tmp/package-upload triggers a CrashLoopBackOff
because /data is the volume mount root and is owned by root, not
the forgejo user (uid 1000) — Forgejo can't `mkdir /data/tmp`.
The default value resolves under the AppDataPath (a subdir Forgejo
itself owns) which works fine. Keep the ENABLED=true override; v11
ships packages on but explicit is safer.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
User asked to broaden the source pipeline so f1-stream can find F1 (and
adjacent motorsport) streams from Sky Sports / DAZN / Reddit / etc.,
using the in-cluster chrome-service headed browser where needed. Four
changes:
1. **streamed.py**: BASE_URL streamed.su → streamed.pk. The .su domain
stopped serving the API host in 2026 (only the marketing page is
left); .pk hosts the JSON API now. Adds 3 events/round (currently
all routed through embedsports.top — see #2 caveat).
2. **chrome_browser.py** (new): generic chrome-service-driven extractor.
Connects to the existing chrome-service WS (CHROME_WS_URL +
CHROME_WS_TOKEN env), navigates a list of TARGETS, captures any HLS
playlist URL the page fetches at runtime, returns one ExtractedStream
per discovery. Uses the same stealth init script as the verifier so
anti-bot checks don't trip the page. Handles iframes (DD12-style
/nas → /new-nas/jwplayer) and probes child-frame <video>/source
elements after settle. Caveat: most aggregator sites (pooembed,
embedsports, hmembeds, even DD12's JW Player path) use a broken
runtime decoder that produces no m3u8 in our environment, so the
TARGETS list is currently 0-yielding; the framework is the
contribution and concrete sites can be added as they're discovered.
3. **subreddit.py** (new): scans r/MotorsportsReplays, r/motorsports,
r/formula1, r/motogp via the public old.reddit.com JSON API for
posts whose flair/title indicates a live stream. Discovered URLs
are returned as embed-type streams; the verifier visits each via
chrome-service to confirm playability. Note: Reddit currently HTTP
403's our cluster outbound IP for anonymous JSON requests; the
extractor returns 0 in that state and logs a debug message. Will
work from any IP Reddit isn't blocking.
4. **dd12.py** (new): inline-HTML scraper for DD12Streams. The site
embeds `playerInstance.setup({file: "..."})` directly in HTML — no
JS decoder needed. Currently surfaces NASCAR Cup Series 24/7 (clean
BunnyCDN-hosted HLS at w9329432hnf3h34.b-cdn.net/pdfs/master.m3u8);
add new `(path, label, title)` tuples to CHANNELS as DD12 expands.
Result: /streams now shows 2 verified live streams (Rally TV via
pitsport + DD12 NASCAR Cup 24/7). When the next F1 weekend (Canadian
GP, May 22-24) goes live, pitsport will surface F1 sessions
automatically via the existing pushembdz path.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Companion to forgejo-registry-breakglass.md but for the more common
case: the Forgejo registry is healthy as a whole, but one image's
manifest/blob references are broken (orphan child, half-pushed
upload, retention-vs-pull race). The
RegistryManifestIntegrityFailure alert annotation already points
here.
Mirrors registry-rebuild-image.md (the registry-private equivalent)
in structure: confirm via probe + curl, delete broken version
through Forgejo API, rebuild via Woodpecker manual run, force
consumers to re-pull, verify integrity recovery.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds Forgejo as a second push target on the build-ci-image pipeline
and saves the just-pushed image as a gzipped tarball on the registry
VM disk (/opt/registry/data/private/_breakglass/) so we can recover
infra-ci with `ctr images import` if both registries are down.
* Dual-push: registry.viktorbarzin.me:5050/infra-ci AND
forgejo.viktorbarzin.me/viktor/infra-ci, in the same
woodpeckerci/plugin-docker-buildx step. Same image bytes; the
Forgejo integrity probe (every 15min) catches any divergence.
* Break-glass step: SSHes to 10.0.20.10, docker pulls + saves +
gzips, keeps last 5 tarballs (latest symlink). Failure-tolerant
so a transient registry blip doesn't fail the build pipeline.
* Runbook docs/runbooks/forgejo-registry-breakglass.md documents
the recovery flow (when to use, scp+ctr import, node cordon,
underlying-issue fix).
Tarball mirrors to Synology automatically through the existing
daily offsite-sync-backup job — no new sync wiring needed.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wrap the three new Vault key reads in try(...) so the first apply
succeeds even when forgejo_pull_token / forgejo_cleanup_token /
secret/ci/global haven't been populated yet. Without this, CI
auto-apply blocks on the very push that introduces the references —
chicken-and-egg with the runbook order (which is: apply Forgejo bumps,
then create users + PATs, then apply the rest).
Empty tokens are intentionally visible-broken (auth fails, probe
reports auth failure, cleanup CronJob errors) — that's the signal
to run the bootstrap runbook. Subsequent apply picks up the real
values.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Stage 1 of moving private images off the registry:2 container at
registry.viktorbarzin.me:5050 (which has hit distribution#3324 corruption
3x in 3 weeks) onto Forgejo's built-in OCI registry. No cutover risk —
pods still pull from the existing registry until Phase 3.
What changes:
* Forgejo deployment: memory 384Mi→1Gi, PVC 5Gi→15Gi (cap 50Gi).
Explicit FORGEJO__packages__ENABLED + CHUNKED_UPLOAD_PATH (defensive,
v11 default-on).
* ingress_factory: max_body_size variable was declared but never wired
in after the nginx→Traefik migration. Now creates a per-ingress
Buffering middleware when set; default null = no limit (preserves
existing behavior). Forgejo ingress sets max_body_size=5g to allow
multi-GB layer pushes.
* Cluster-wide registry-credentials Secret: 4th auths entry for
forgejo.viktorbarzin.me, populated from Vault secret/viktor/
forgejo_pull_token (cluster-puller PAT, read:package). Existing
Kyverno ClusterPolicy syncs cluster-wide — no policy edits.
* Containerd hosts.toml redirect: forgejo.viktorbarzin.me → in-cluster
Traefik LB 10.0.20.200 (avoids hairpin NAT for in-cluster pulls).
Cloud-init for new VMs + scripts/setup-forgejo-containerd-mirror.sh
for existing nodes.
* Forgejo retention CronJob (0 4 * * *): keeps newest 10 versions per
package + always :latest. First 7 days dry-run (DRY_RUN=true);
flip the local in cleanup.tf after log review.
* Forgejo integrity probe CronJob (*/15): same algorithm as the
existing registry-integrity-probe. Existing Prometheus alerts
(RegistryManifestIntegrityFailure et al) made instance-aware so
they cover both registries during the bake.
* Docs: design+plan in docs/plans/, setup runbook in docs/runbooks/.
Operational note — the apply order is non-trivial because the new
Vault keys (forgejo_pull_token, forgejo_cleanup_token,
secret/ci/global/forgejo_*) must exist BEFORE terragrunt apply in the
kyverno + monitoring + forgejo stacks. The setup runbook documents
the bootstrap sequence.
Phase 1 (per-project dual-push pipelines) follows in subsequent
commits. Bake clock starts when the last project goes dual-push.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
User feedback: every stream on /watch shows ads but the player fails
to load. Three causes, three fixes:
1. CuratedExtractor's two hmembeds 24/7 channels (Sky F1, DAZN F1)
sat at the top of the list and ALWAYS failed: they load the
upstream's ad overlay then JW Player throws error 102630 (empty
playlist; the obfuscated decoder produces no fileURL in our
environment). Disabled the registration in extractors/__init__.py
until/unless we find a working bypass — leaving the existing
`CURATED_BYPASS = {"curated"}` shim in service.py so the swap is
reversible.
2. Pitsport surfaces every WRC stage / MotoGP session as its own
/watch UUID, but they all resolve to the same upstream m3u8 URL
(e.g. RallyTV one master.m3u8 across all 22 Rally de Portugal
stages). Added URL-keyed dedupe in service.run_extraction so the
/streams response shows one row per actual stream.
3. The pitsport category filter was still narrowed to motorsport.
Pitsport.xyz only lists curated sports broadcasts (WRC, MotoGP,
IndyCar, NASCAR, Premier League Darts, Premier League football…),
so the site's own selection is the right filter. Replaced the
hand-maintained MOTORSPORT_KEYWORDS list with `bool(category or
title)` — anything pitsport returns goes through. Streams that
aren't actually live get filtered out downstream when the embed
API returns an empty manifest.
Frontend: hls.js `lowLatencyMode` was on by default but RallyTV (and
most non-LL-HLS providers) don't ship the LL-HLS extensions, which
broke playback in real browsers. Default to `lowLatencyMode: false`.
Result: /streams is now 1 verified live entry (Rally TV WRC stage
currently airing); was 24 with the top 2 always broken + 22 dupes.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous extractor only surfaced Formula 1/2/3 and never returned
anything outside race weekends. Two fixes:
1. Broadened category filter from {formula 1/2/3} to a motorsport set
(MotoGP/Moto2/Moto3, WRC/WEC/IndyCar/NASCAR + the F1 series).
Replaces the NON_F1_KEYWORDS exclusion list with a positive-match
MOTORSPORT_KEYWORDS set; removes the F1-specific filter on title
keywords. Old `_is_f1_*` aliases retained as compat shims.
2. Updated `_parse_stream_config` for the current pushembdz.store embed
payload — Next.js now serves `safeStream` (just title + method) and
the actual stream URL is fetched at runtime from
`pushembdz.store/api/stream/<slug>`. Extractor now hits that endpoint
when the inline link is missing. Treats `method=jwp` as HLS and
accepts URLs ending in `.css` (pushembdz disguises some HLS playlists
with a `.css` extension).
End-to-end result: /streams went from 2 (curated, broken JW decoder) to
24 streams marked `is_live=True`. The verifier confirms each via
`manifest_parsed_codec_missing_in_verifier` (Playwright Chromium has no
H.264 — manifest fetch alone is the codec-independent positive signal).
Currently surfaces Rally de Portugal SS1–SS22 (WRC); MotoGP starts
appearing once the French GP weekend goes live tomorrow.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The static nginx stub at chrome.viktorbarzin.me wasn't useful for
debugging anti-bot interactions. Swap it for a live noVNC HTML5 view
of the headed Chromium session: x11vnc taps Xvfb's :99 over localhost
TCP (added `-listen tcp -ac` to Xvfb), websockify wraps it as a WS
endpoint, and noVNC's vendored web client serves it on :6080.
The ingress chain is unchanged — chrome.viktorbarzin.me stays
Authentik-gated, dns_type=proxied, port 3000 (the Playwright WS) stays
internal-only behind the NetworkPolicy + token. Custom image
`registry.viktorbarzin.me/chrome-service-novnc:v4` (ubuntu:24.04 +
x11vnc + websockify + novnc apt packages) needs imagePullSecrets, so
also added registry-credentials reference to the deployment spec.
x11vnc flags: `-noshm -noxdamage -nopw -shared -forever`. SHM is
disabled because each container has its own /dev/shm so the X server
can't grant access; XDAMAGE isn't compiled into the noble Xvfb. The
sidecar entrypoint waits up to 30s for both Xvfb (:6099) and x11vnc
(:5900) to bind before exec'ing websockify.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The f1-stream verifier's in-process headless Chromium kept tripping
hmembeds' disable-devtool.js Performance detector (CDP latency on
console.log vs console.table) and getting redirected to google.com.
This adds a single-replica chrome-service stack running Playwright
launch-server under Xvfb so callers can connect via WS+token to a
shared headed browser. f1-stream's _ensure_browser now prefers
chromium.connect(CHROME_WS_URL/CHROME_WS_TOKEN) and adds a vendored
stealth init script (webdriver/plugins/languages/Permissions/WebGL
spoofs + querySelector hijack to disarm disable-devtool-auto) on
every new context. Falls back to in-process headless if the env
vars aren't set.
Encrypted PVC for profile + npm cache, NetworkPolicy to TCP/3000
gated by client-namespace label, 6h tar.gz backup CronJob to NFS,
Authentik-gated nginx sidecar at chrome.viktorbarzin.me for human
liveness checks. Image pinned to playwright:v1.48.0-noble in
lockstep with the Python client's playwright==1.48.0.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Stdlib-only Python exporter ($1) reads ~/.openclaw/agents/*/sessions/*.jsonl
(assistant messages with usage) plus auth-profiles.json (OAuth expiry,
Plus-tier label) and exposes Prometheus text format on :9099/metrics.
Container is python:3.12-slim; pod template gets prometheus.io/scrape
annotations so the existing kubernetes-pods job picks it up — no
ServiceMonitor needed.
Metrics exported:
openclaw_codex_messages_total{provider,model,session_kind} counter
openclaw_codex_input/output/cache_read/cache_write_tokens_total
openclaw_codex_message_errors_total{reason}
openclaw_codex_active_sessions{kind} gauge
openclaw_codex_oauth_expiry_seconds{provider,account,plan} gauge
openclaw_codex_last_run_timestamp gauge
Grafana dashboard "OpenClaw — Codex Usage" (Applications folder, 30s
refresh): messages/5h vs Plus rate-card, % of 1,200 floor, tokens/5h,
cache hit %, OAuth expiry days, active sessions, last-turn age, errors,
plus per-model timeseries + bar gauge + error table.
Plus rate-card thresholds in the gauge are conservative (1,200/5h floor;
real cap is dynamic 1,200–7,000). Re-baseline if throttling shows up
below 80%.
Bumps image 2026.2.26 → 2026.5.4 (openai-codex provider plugin landed in
2026.4.21+). Auth profile is OAuth via the device-pairing flow against the
Codex backend (account ancaelena98@gmail.com); token persists in
/home/node/.openclaw/agents/main/agent/auth-state.json on NFS so it survives
pod restarts. Plus tier accepts gpt-5.4-mini (1,200–7,000 local msgs/5h);
gpt-5-mini and gpt-5.1-codex-mini both return errors on Plus, so we pin
gpt-5.4-mini explicitly. doctor --fix auto-promotes the highest-tier model
(gpt-5-pro) after model discovery, so the container command pins the mini
back as default after doctor runs but before gateway start.
Per user feedback: the demo Big Buck Bunny / Apple test streams aren't
useful in an F1-streams app. Removed DemoExtractor entirely. Tightened
the discord-extractor path filter from "any stream-shaped path" to
"direct embed/player path only" — the previous filter still let
sportsurge `/event/...` landing pages through, which the verifier
mistook for playable because they render player-class divs without a
real player.
Embed proxy now also rewrites window.fetch + XMLHttpRequest.open inside
the upstream HTML so that cross-origin XHRs (e.g. the hmembeds
`/sec/<JWT>` token-binding endpoint) go through our /embed-asset relay.
This avoids the CORS reject that fired when the player JS tried to call
hghndasw.gbgdhdffhf.shop/sec/... from an `f1.viktorbarzin.me` origin.
The verifier now requires a `<video>` element to mark embed streams
playable (not just a player-class div). Curated streams bypass the
verifier — hmembeds aggressively detects headless Chromium (devtool
trap, console-clear timing, automation flags) and won't progress past
JW Player init in our pod, but the user's real browser should clear
those checks. We can't honestly headless-verify hmembeds, so we trust
the curator instead of falsely rejecting them.
Image: viktorbarzin/f1-stream:v6.1.1
Cuts the stream list from 23 mostly-broken entries to ~6 confirmed-playable
ones, and adds an iframe-stripping proxy so embed sources (hmembeds, etc.)
load through our origin without X-Frame-Options / CSP / JS frame-buster
blocks.
Why: the previous list was dominated by Discord-shared news article URLs,
hardcoded aggregator landing pages, and other non-stream URLs that all sat
at is_live=true because embed streams skipped the health check entirely.
Users could not tell which links would actually play.
What:
- backend/playback_verifier.py: new headless-Chromium verifier (Playwright)
that polls each candidate stream for a codec-independent "playable" signal
(hls.js MANIFEST_PARSED for m3u8; <video>/player div for embed). Replaces
the unconditional is_live=True for embed streams in service.py.
- backend/embed_proxy.py: new /embed and /embed-asset routes that fetch
upstream embed pages, strip X-Frame-Options/CSP/Set-Cookie, and inject a
<base href> + frame-buster-defeat <script> that locks down window.top,
document.referrer, console.clear/table, and window.location so the
hmembeds disable-devtool.js redirect-to-google trap can't fire.
- extractors/curated.py: new always-on extractor with two known-good 24/7
hmembeds embeds (Sky Sports F1, DAZN F1) so the list isn't empty between
race weekends.
- extractors/__init__.py: register CuratedExtractor first; drop
FallbackExtractor (its 10 aggregator landing-pages can't iframe-play).
- extractors/discord_source.py: positive-match path filter (must look like
/embed/, /stream, /watch, /live, /player, *.m3u8, *.php) plus expanded
domain blocklist for news sites — was 10 noise URLs, now ~1.
- extractors/service.py: run_extraction now health-checks AND verifier-
checks both stream types; only verified-playable streams reach is_live.
- main.py: register /embed + /embed-asset routes; defer initial extraction
by 8s so the verifier can reach the local /embed proxy on 127.0.0.1:8000.
- frontend/lib/api.js + watch/+page.svelte: route embed iframes through
/embed proxy instead of the upstream URL, so X-Frame-Options/CSP can't
block them.
- Dockerfile: install Playwright chromium + system codec-runtime libs.
- main.tf: bump pod memory 256Mi → 1Gi for chromium.
Verified end-to-end with Playwright against
https://f1.viktorbarzin.me/watch — 6/6 streams reach a player UI; the 3
demo m3u8s actually play (codec-bearing browser); the 3 embeds (Sky
Sports F1, DAZN F1, sportsurge) render iframes through the proxy.
Image: viktorbarzin/f1-stream:v6.0.5
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
NVIDIA retired nim/qwen/qwen3.5-397b-a17b — modelrelay shows consistent
TIMEOUTs over 24h+ of pings, and nim/nvidia/llama-3.1-nemotron-ultra-253b-v1
returns 404. With both gone the openclaw failover never reached
mistral-large-3 in time, so every message hung until the 120s embedded-run
timeout. Promote qwen3-coder-480b-a35b-instruct (already in models list, UP
~1-2s, 256k ctx) to primary; drop the dead nemotron-ultra fallback.
Better visual grouping: instead of 8 paired panels in a single row at
w=3 (cramped, hard to scan), arrange as a 2x4 grid at w=6. Top row
("all" — wealth change incl new money), bottom row ("mkt" — pure
market gain). Columns are timeframes 1d / 7d / 30d / 90d.
Reading vertically: same window, two interpretations side by side.
Reading horizontally: same metric across timeframes.
Layout shift: delta row goes from y=4 (4 wide) to y=4..11 (8 high).
All chart/log panels with y >= 8 shift down by another 4 rows
(net-worth chart 8->12, activity log 81->85, etc.).
User feedback: net-worth delta panels (1d/7d/30d/90d) confused
because +£174k over 90d looked too big against the £271k cumulative
unrealised gain. Decomposition showed the 90d delta was £114k of new
money in (contributions) + £60k of actual market gain.
So now the delta row shows BOTH:
Δ Nd (all) — net-worth change incl new money (the original number)
Δ Nd (mkt) — pure market gain, contributions stripped out
Pattern for "(mkt)" panels: same now_snap / past_snap CTEs but
selecting both total_value and net_contribution, then computing
(nw_delta - contrib_delta) = market_gain over window.
Layout: 8 panels at w=3 each on the y=4 row, paired by window
(all next to mkt for each timeframe), so you can see "wealth
change vs investment performance" at a glance.
Verified live (90d): all=+£174,612, mkt=+£60,343, contrib=+£114,268.
New row at y=4 with 4 stat panels showing net-worth change over the
trailing windows. Each uses the latest-per-account stitching pattern
(skew-resilient against partial-day syncs) and computes:
delta = SUM(latest per account) - SUM(latest per account at or
before max_complete - N)
Where max_complete is the most recent date all accounts have a row.
For each window: 1d, 7d, 30d, 90d.
Verified live values: +£8,575 / +£22,696 / +£144,633 / +£174,612.
All panels at y >= 4 shifted down by 4 rows to make room (Net worth
chart 4->8, Per-account stacked 24->28, Activity log 77->81, etc.).
Note: this commit also reformats the dashboard JSON from compact-
object form to indented form (json.dump indent=2 side effect from the
Python patch script). No semantic changes outside the new panels and
y-shifts.
Cloudflare cannot proxy raw TCP/1688 (KMS protocol). Switch
kms.viktorbarzin.me from CF-proxied CNAME to direct A/AAAA so
clients can reach the vlmcsd LoadBalancer (10.0.20.200) via the
existing pfSense WAN port-forward for 1688.
Verified end-to-end: vlmcs against 176.12.22.76:1688 completes
the KMS V4 handshake for Office Professional Plus 2019.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
nginx's not_modified_filter evaluated If-Match headers forwarded by
Traefik's forwardAuth, returning 412 and breaking CalDAV VTODO updates
from macOS/iOS Reminders. Switch to OpenResty and clear conditional
headers with Lua before proxy processing.
2026-05-05 22:52:43 +01:00
36 changed files with 1167 additions and 244 deletions
**Status**: **DEFERRED** — design + plan complete, NOT scheduled. Awaiting either PVE host capacity expansion OR a separate right-sizing pass on the existing master before this becomes affordable. Paired plan: `2026-05-21-ha-control-plane-plan.md`.
**Beads**: code-n0ow (open, deferred — see `bd show code-n0ow`)
**Trigger**: 2026-05-21 k8s 1.34.7→1.34.8 autonomous-upgrade session repeatedly hit a storm cascade rooted in single-master apiserver outages
## Why deferred (2026-05-23)
Measured during the locking pass:
- **k8s-master uses 4.6 GB of 32 GB allocated** (kube-apiserver 2.6 GB + etcd 660 MB + cm 360 MB + ~1 GB everything else). The 32 GB sizing is ~5-6× oversized vs working set.
- **PVE host is already 98% RAM-committed** — 262 GB allocated to VMs against 267 GB physical, with 1.5 GB of active swap. The planned 3 × 32 GB control plane (+64 GB net) would push allocation to 326 GB → OOM on the host.
- **Software-only HA on a single PVE host has bounded value** — a hypervisor crash still loses all 3 masters. The big resilience wins (kubeadm upgrades, cert rotation, planned reboots) are real but the disaster-recovery angle is limited until a second PVE host exists.
### Revisit triggers — any of:
1. **Second PVE host added** to the lab. Hardware HA becomes possible; HA control plane becomes the natural follow-up. Spread the 3 masters across 2 hosts (2+1).
2. **Cluster-wide right-sizing pass** that frees enough headroom for the original 3 × 32 GB plan, OR pre-agreed amendment to provision 16 GB masters (right-sized to actual usage; 3-4× current working-set headroom).
3. **Storm cascade burns enough hours** that the operational cost outweighs the memory cost — track minutes spent manually nursing kubeadm upgrades; if cumulative > ~10h over a few months, revisit.
### What's still good
The design + plan in this directory remain authoritative. When we revisit:
- All 14 locked decisions stand.
- Challenger amendments (cloud-init template bump, rbac multi-master refactor, HTTPS `/readyz` health check, expanded blast radius, etcd-backup nodeSelector, chain extension as Phase 7) are baked in.
- Only the sizing decision needs revisiting — likely 16 GB per master instead of 32 GB.
- Adding `k8s_master_hosts` list-based refactor to the rbac stack (Phase 1.5) is a **standalone win** that could be done independently of HA — it would future-proof the cluster against the day HA lands. Consider lifting that as its own task.
## Problem statement
@ -50,18 +73,24 @@ The k8s upgrade chain doesn't need to be aware of *any* of this — the
underlying availability of apiserver makes the chain's gates
naturally pass on each iteration.
## Decisions (proposed — to be confirmed)
## Decisions (locked 2026-05-22)
| # | Decision | Notes |
|---|----------|-------|
| 1 | **3 masters** (not 5) | Quorum tolerates 1 failure, sufficient for home-lab. 5 would tolerate 2 but doubles etcd write amplification. |
| 2 | **Sizing**: match current `k8s-master` (8 vCPU, 32GB RAM, ~64 GB disk) for all 3 | Symmetric. New VMs `k8s-master-2`, `k8s-master-3` on Proxmox. |
| 3 | **Apiserver LB**: **pfSense HAProxy** (existing pattern, see mailserver-pfsense-haproxy.md) over keepalived+haproxy-on-each-master | Pros: no per-node moving parts, mirrors the mailserver layout already in production. Cons: pfSense becomes more SPoF — but it's already SPoF for everything else (DNS, gateway, ingress). |
| 4 | **VIP**: pick an unused IP on the cluster VLAN, e.g. `10.0.20.99`, point all kubeconfigs + kubelet `--server` at it | Internal-only VIP; external API access stays via Cloudflared. |
| 5 | **etcd**: kubeadm-managed (existing); just `kubeadm join --control-plane` brings new members into the etcd cluster automatically | Avoids running etcd separately. |
| 6 | **kured-sentinel-gate**: extend "quorum-safe" check to verify ≥2 control-plane nodes Ready before allowing a reboot | Otherwise kured could reboot 2 masters at once and break quorum. |
| 7 | **etcd backup**: today's `etcd-backup` CronJob already takes a snapshot from one member; that's still sufficient (etcd snapshot is a consistent point-in-time). No new work needed. | |
| 8 | **Migration order**: add masters one at a time, run smoke (kubectl from each), then cut over kubeconfigs | Each `kubeadm join --control-plane` is reversible (just `kubeadm reset` + remove from etcd member list). |
| 2 | **Sizing**: match current `k8s-master` (8 vCPU, 32GB RAM, ~64 GB disk) for all 3 | Symmetric. New VMs `k8s-master-2` (VMID 205, 10.0.20.110), `k8s-master-3` (VMID 206, 10.0.20.111). |
| 3 | **Apiserver LB**: **pfSense HAProxy** — new TCP frontend on `10.0.20.99:6443` mirroring the mailserver pattern. Idempotent via `scripts/pfsense-haproxy-bootstrap.php`. | Pros: no per-node moving parts, mirrors existing mailserver layout. Cons: pfSense becomes more SPoF — but it's already SPoF for everything else (gateway/DNS/ingress). |
| 4 | **VIP**: `10.0.20.99` (one below current master `.100`, well clear of MetalLB pool `.200-.220`). Internal-only — external API access stays via Cloudflared. | All kubeconfigs + kubelet.conf entries flip from `10.0.20.100:6443` → `10.0.20.99:6443`. |
| 5 | **etcd**: kubeadm-managed stacked; `kubeadm join --control-plane` brings new members into the etcd cluster automatically | Avoids running etcd separately. |
| 6 | **kured-sentinel-gate**: extend the bash loop in `stacks/kured/main.tf` with a "≥2 control-plane nodes Ready" check between the existing all-nodes-Ready and calico-Ready checks | Otherwise kured could reboot 2 masters at once and break quorum. |
| 7 | **etcd backup**: `etcdctl snapshot save` from any member is a consistent point-in-time of the full quorum state — but the existing CronJob is pinned `node_name = "k8s-master"`. Phase 4.5 flips this to a control-plane label + toleration so backups don't silently skip when master-1 is drained. | Snapshot CORRECTNESS unchanged; SCHEDULING needs fixing. |
| 9 | **VM provisioning**: cloud-init via `create-template-vm` module, **but the template needs an apt-source bump first** (v1.32 → v1.34) and a control-plane gate on `k8s_join_command` so master VMs don't auto-join as workers. Existing master stays as the legacy manual VM (not rebuilt). | The repo has zero VMs using cloud-init for provisioning today — we're the first user. Update template first, then use it. |
| 10 | **Cert SAN + controlPlaneEndpoint retrofit**: Phase 0, before any new master joins. Patch `kubeadm-config` via `kubeadm init phase upload-config kubeadm --config <file>` (kubeadm-owned write, future-proof against `kubeadm upgrade apply`), regen `apiserver.crt` via `kubeadm init phase certs apiserver`, restart the kube-apiserver pod (~30s outage on the existing master only). | Standard kubeadm retrofit path; `kubeadm join --control-plane` requires controlPlaneEndpoint to be set. |
| 11 | **Multi-master config propagation (Phase 1.5)**: refactor `stacks/rbac/modules/rbac/{apiserver-oidc,audit-policy,etcd-tuning}.tf` to loop over a list of master hosts. Apply BEFORE master-2/3 join so they boot with OIDC, audit policy, and etcd tuning already in place. | Today these stacks SSH into a single master and sed into `kube-apiserver.yaml` — if not propagated, Authentik login flaps depending on which master the LB lands on. |
| 12 | **k8s-version-upgrade chain extension (Phase 7)**: extend `stacks/k8s-version-upgrade/scripts/upgrade-step.sh` to discover and iterate over all control-plane nodes (drain → upgrade → uncordon, gated by quorum check). | Without this, chain only upgrades master-1; masters 2/3 drift behind one version per upgrade. Original autonomous-upgrades goal unmet. |
| 13 | **LB health check**: HTTPS `GET /readyz` (with `verify none` for self-signed apiserver cert), NOT plain TCP. | Plain TCP misses apiserver-NotReady states (etcd unreachable, controller-manager flapping). |
| 14 | **VIP DNS name**: add `k8s-apiserver IN A 10.0.20.99` to `config.tfvars` BEFORE Phase 4. Delete stale `kubernetes IN A 10.0.20.100`. Consumers reference the FQDN, not the bare IP — future renumbering is then a single record change. | |
## Out of scope
@ -74,12 +103,15 @@ naturally pass on each iteration.
| Risk | Mitigation |
|---|---|
| Phase 0 cert regen on existing master triggers a brief apiserver outage (~30s) | Already a known cluster behaviour during static-pod restart. Schedule during a low-activity window. Tigera/operators will crash-loop briefly but recover — same blast radius as today's k8s upgrade. **Once HA is up, future restarts won't have this surface at all.** |
| etcd quorum split-brain during member join | kubeadm join is atomic; if it fails, the new member doesn't join the quorum. Existing etcd stays healthy. |
| LB misconfiguration → all kubectl breaks | Smoke-test from each master before flipping clients. Keep a kubeconfig pointing directly at one master as fallback. |
| Existing kubeconfigs (dev VM, agents, woodpecker) need updating | List all consumers, update in a single TF apply. |
| New masters get scheduled some workload pods unintentionally | Verify control-plane taint is applied at join time. |
| Cluster-wide cert rotation might be needed | kubeadm join handles certs automatically using the `--certificate-key` from `kubeadm init phase upload-certs`. |
| 32GB per master × 3 = 96GB RAM used for control plane alone | Proxmox host has headroom; not blocking. |
| LB misconfiguration → all kubectl breaks | Smoke-test from each master directly (bypass LB) before flipping clients. Keep a kubeconfig pointing at `10.0.20.100:6443` as fallback. |
| Existing kubeconfigs (Woodpecker pipelines, agents, dev VM, in-cluster RBAC default) need updating | Single Terraform apply touches `stacks/rbac/modules/rbac/apiserver-oidc.tf` (default), `.woodpecker/*.yml` (committed kubeconfigs). Worker `kubelet.conf` files patched in Phase 4 via ssh loop. |
| New masters get scheduled workload pods unintentionally | Verify `node-role.kubernetes.io/control-plane:NoSchedule` taint is applied at join time (default with `--control-plane`). |
| Cert rotation propagation | kubeadm join uses the `--certificate-key` from `kubeadm init phase upload-certs` to fetch existing CA materials. Single short-lived secret in `kube-system/kubeadm-certs` (**2h TTL** — Phases 2 + 3 must complete within the window, or re-upload between them). |
| 32GB per master × 3 = 96GB RAM used for control plane alone | PVE host has 272GB total, 176GB allocated to cluster pre-HA. Post-HA: 240GB allocated, 32GB headroom. Sufficient. |
| Pre-existing kubeadm-config does NOT have `controlPlaneEndpoint` set | Phase 0 patches it. Verify: `kubectl -n kube-system get cm kubeadm-config -o yaml \| grep controlPlaneEndpoint` (absent → `10.0.20.99:6443` post-Phase 0). |
| Existing master cert SANs are `[k8s-master, 10.96.0.1, 10.0.20.100]` only — missing VIP | Phase 0 regens with `--apiserver-cert-extra-sans 10.0.20.99` after patching kubeadm-config. |
## Verification
@ -91,12 +123,23 @@ kubectl get nodes -l node-role.kubernetes.io/control-plane=
| `.woodpecker/{default,drift-detection,renew-tls,provision-user}.yml` (4 files × 2 refs each — kubeconfig `server:` AND `curl` lines) | repo root | 4.1 |
| `stacks/k8s-portal/.../files/src/routes/{download,setup/script}/+server.ts` (`CLUSTER_SERVER` const used to generate user kubeconfigs) | k8s-portal module | 4.1 |
- **Phase 0 apiserver restart (~30s)** = same blast radius as today's k8s upgrade (tigera/cnpg/gpu-operator briefly crash). The LB doesn't help here because the new cert isn't yet trusted by clients. Accept the brief outage. Schedule during a low-activity window.
- **`kubeadm-certs` secret TTL = 2h** (NOT 24h as initially stated). Phase 2 + 3 must complete within the window, or re-upload between them.
- **pfSense haproxy bootstrap = reset-to-declared-state** on each run (lines 155-158 of the script). Adding master-2 means the apiserver pool is briefly torn down + rebuilt. TCP frontends bounce. Long-poll connections from kubelets break + reconnect. Expect ~2-5s of "kubectl: unable to connect" during pool rewrites.
- **TCP health check is too lax** for apiserver (listener up ≠ ready). Phase 1 uses HTTPS `GET /readyz` with `verify none` — catches NotReady (etcd unreachable, controller-manager flapping).
- **Worker kubelet.conf flip**: kubelet TLS bootstrap re-auths against new endpoint on restart. Expect 5-10s NotReady per node during the Phase 4.2 loop.
- **VIP cannot be the existing master IP**: confirmed `.99` is free (no grep matches, no MetalLB pool conflict — pool is .200-.220).
- **pfSense reboot windows**: pre-Phase-4 OK (clients still on direct IP), post-Phase-4 breaks everything. Don't migrate near a pfSense maintenance window.
## Phased plan
Reversible up to Phase 4. Phase 4+ reverse via the rollback section.
- [ ] Hand-edit /tmp/kubeadm-new.yaml: take the existing CM as base, add `controlPlaneEndpoint: 10.0.20.99:6443` under ClusterConfiguration, add `apiServer.certSANs: [10.0.20.99, k8s-apiserver.viktorbarzin.lan]`
- [ ] Wrap each `null_resource` / `provisioner "remote-exec"` block in `for_each = toset(var.k8s_master_hosts)` so the same sed runs on every master
- [ ] In `stacks/rbac/main.tf` set `k8s_master_hosts = ["10.0.20.100"]` (still single-master in this phase — variable is forward-looking, no behaviour change yet)
- [ ] **1.5.2 `scripts/tg apply` rbac stack** — confirm zero diff against today (no-op refactor)
- line 49: apt source `pkgs.k8s.io/core:/stable:/v1.32/deb/` → `pkgs.k8s.io/core:/stable:/v1.34/deb/`
- line 135: wrap `${k8s_join_command}` in a conditional via cloud-init `if:` template logic, or simpler: add `${k8s_join_command_or_noop}` and let the module pass `""` for masters and the real worker join command for workers (default)
- [ ] Update `infra/modules/create-template-vm/main.tf` to add `variable "k8s_join_command" { default = "" }` and a conditional in the templatefile to skip the runcmd line when empty
- [ ] Rebuild the template: `scripts/tg apply -target=module.k8s_template` (or whatever the existing template-build target name is in `stacks/infra/main.tf`)
- [ ] Verify new template registered in Proxmox at the same template_id
- [ ] **2.1 Add master-2 VM to Terraform**
- [ ] In `stacks/infra/main.tf`: add `module "k8s-master-2"` using `create-vm` from the (now-v1.34) k8s template, with master sizing (8 vCPU / 32GB / 64GB), VMID 205, IP `10.0.20.110`, unique MAC, `vmbr1/vlan 20`, `use_cloud_init = true`, and explicitly pass `k8s_join_command = ""` (so first-boot does NOT auto-join as worker)
- [ ] Worker loop: `for n in k8s-{master,node1,node2,node3,node4,master-2,master-3}; do ssh wizard@$n.viktorbarzin.lan "sudo grep server: /etc/kubernetes/kubelet.conf"; done` — all show VIP
- [ ] Trigger a no-op Woodpecker pipeline (commit a typo fix in a runbook) — verify the kubeconfig path through the new VIP
- [ ] **4.5.3 Verify backup runs** — trigger a manual job-from-cronjob, confirm it lands on one of the 3 masters and produces a valid snapshot
### Phase 5 — kured-sentinel-gate quorum check (~15 min)
- [ ] **5.1 Edit `infra/stacks/kured/main.tf`** (insert into the bash heredoc in the sentinel-gate ConfigMap, between all-nodes-Ready and calico-Ready checks)
```bash
# Check 3b: control-plane quorum safety (HA invariant)
- [ ] During the 50-90s reboot: tight loop `while true; do kubectl get nodes -o name | wc -l; sleep 2; done` from devvm — line count never drops to 0 (LB transparent)
- [ ] After boot: `kubectl uncordon k8s-master`, verify apiserver static pod re-registers in LB pool (op_state=2)
- [ ] **6.2 All-masters apiserver flag parity**
- [ ] `for h in k8s-master k8s-master-2 k8s-master-3; do echo "=== $h ==="; ssh wizard@$h.viktorbarzin.lan 'sudo grep -E "oidc-issuer-url|audit-policy|auto-compaction-retention|snapshot-count" /etc/kubernetes/manifests/{kube-apiserver,etcd}.yaml | sort'; done`
- [ ] Expect identical flag set across all 3 masters
- [ ] **6.3 Update documentation**
- [ ] Add `docs/architecture/control-plane.md` — HA topology, etcd member list, LB config location
- [ ] `kubectl -n k8s-upgrade create job --from=cronjob/k8s-version-check ha-validation-$(date +%s)` (no actual upgrade pending — chain should noop the upgrade phase but exercise the discovery loop)
- [ ] Verify logs show 3 masters discovered in correct order
- [ ] **7.5 (Real test on next patch release)** — when 1.34.8 ships:
- [ ] Watch the chain execute drain → upgrade → uncordon across all 3 masters in turn
- [ ] Confirm no manual intervention needed
### Phase 8 — Close out
- [ ] **8.1 Update beads** — `bd close code-n0ow` once all 6 acceptance criteria met (see below)
## Rollback plan
### Before Phase 4 (no clients flipped)
- **Phase 0**: restore apiserver cert/key from `/tmp/apiserver-backup/`, edit kubeadm-config back, restart kubelet on master.
- **Phase 1**: remove `apiserver_proxy_6443` + `apiserver_nodes` from `pfsense-haproxy-bootstrap.php`, re-run; revert DNS A record in config.tfvars.
- **Phase 1.5**: revert rbac stack to single `k8s_master_host` var; apply.
- **Phase 2/3**: on failed master `sudo kubeadm reset --force`; from a surviving master `etcdctl member remove <id>`; `tg destroy -target=module.k8s-master-N`.
### After Phase 4 (clients flipped)
- Revert all the Phase 4.1 file changes (single revert commit).
- Reverse the kubelet.conf sed loop (VIP → direct IP) on all 7 nodes.
- Phase 0 controlPlaneEndpoint can stay — harmless even on full rollback.
### Worst case (etcd corruption / multi-master split-brain)
- Restore from latest etcd snapshot via `etcdctl snapshot restore` to a single master.
- Rebuild master VM from the Proxmox snapshot taken in Phase 0.1.
- Cluster back to single-master.
## Acceptance criteria (beads `code-n0ow`)
- [ ] 1. Design doc + plan doc written ✓ (this commit)
- [ ] 2. Plan approved by user
- [ ] 3. 3 masters online, etcd quorum healthy, apiserver LB working
- [ ] 4. k8s upgrade chain runs end-to-end across **all 3 masters** without manual intervention (Phase 7)
- [ ] 5. kured-sentinel-gate respects quorum (Phase 5)
- [ ] 6. etcd backup runs from any control-plane node (Phase 4.5)
## Open questions
None — all locked via 2026-05-22 decision pass + challenger amendment pass.
# change → kubeadm waits forever for a change that will never come).
# Skipping the etcd phase on retry is safe IF etcd is already on the
# right version (which is the only case where this timeout fires).
attempt=1
while ! sudo kubeadm upgrade apply "v$RELEASE" -y;do
extra_flags=""
while ! sudo kubeadm upgrade apply "v$RELEASE" -y $extra_flags;do
if(( attempt >=3));then
echo"ERROR: kubeadm upgrade apply failed after 3 attempts" >&2
exit1
fi
echo"==> kubeadm apply attempt $attempt failed (likely static-pod-hash 5m timeout). Sleeping 30s then retrying — the previous attempt's manifest writes usually take hold on the 2nd try."
echo"==> kubeadm apply attempt $attempt failed. Retrying with --etcd-upgrade=false (etcd image is unchanged for patch upgrades; kubeadm's static-pod-hash watch is the only thing failing)."
extra_flags="--etcd-upgrade=false"
sleep 30
attempt=$(( attempt +1))
done
echo"==> kubeadm upgrade apply succeeded on attempt $attempt"
echo"==> kubeadm upgrade apply succeeded on attempt $attempt (flags: '$extra_flags')"
summary: "viktorbarzin.me apex A drifted from expected 10.0.20.200"
description: "Technitium serves the split-horizon apex for ~80 *.viktorbarzin.me CNAMEs. If this is wrong, every internal service (auth, vault, immich, ha-sofia, ...) breaks. Check Technitium primary zone records via API or web console."