Part of the L4 LB-IP review (docs/plans/2026-06-03-lb-ip-hygiene-design.md).
The 2026-05-30 Traefik .200->.203 move left consumers pointing at the dead
.200; this fixes the two in-Terraform ones and replaces the stale networking
doc with an accurate registry + a renumber checklist.
- woodpecker: forgejo.viktorbarzin.me hostAlias hardcoded 10.0.20.200
(.200:443 refuses TLS now; the next woodpecker apply would re-pin it and
break pipeline creation). Now reads the Traefik ClusterIP dynamically via a
kubernetes_service data source -- cannot rot on a future renumber and avoids
the ETP=Local hairpin trap.
- monitoring: ViktorBarzinApexDrift alert summary said "expected 10.0.20.200"
-> 10.0.20.203 (cosmetic; alert logic already correct).
- docs/architecture/networking.md: rewrote the MetalLB section (it wrongly had
KMS on .200, mailserver on a LB IP, "two dedicated") into an accurate 4-IP
registry + LB-IP renumber checklist (in-band + out-of-band consumers).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The shlink-web admin SPA (shlink.viktorbarzin.me) showed "Something went
wrong while loading short URLs". Root cause: the web client was untagged
(:latest) and Keel's 2026-05-26 match-tag rewrite downgraded it to the
ancient 0.1.1 (2019 image), which speaks the removed /rest/v1/authenticate
API (404) and serves nginx on port 80. Backend (shlink:5.0.2) was healthy.
Pin shlink-web-client to 4.7.1 (current stable; :latest/:stable resolve to
it) and align container port + both probes + service target_port to 8080
(the port the 4.x nginx listens on). A clean semver anchor can no longer be
Keel-downgraded to 0.1.1.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Reconciles the tripit stack source with live state and adds the forward
flow. Ingest now polls vbarzin@gmail.com [Gmail]/All Mail read-only over a
rolling 12-month X-GM-RAW travel-sender window (Croatia Jet2 refs excluded),
filing trips under MAIL_DEFAULT_OWNER_EMAIL=vbarzin@gmail.com (Viktor's
Authentik login identity). Adds an ingest-plans CronJob that polls spam@
filtered to To:plans@viktorbarzin.me (the @viktorbarzin.me catch-all target)
so forwarded bookings are extracted and attached to the matching trip;
IMAP_PASSWORD is overridden per-job to spam@'s creds (PLANS_IMAP_PASSWORD).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The recruiter-api plugin's announceEvent() sends recruiter cards to Telegram
via OPENLOBSTER_CHANNELS_TELEGRAM_TOKEN (its fallback path, since OpenClaw
doesn't pass api.bot to "kind: tools" plugins). That env was never set in the
container, so every hourly poll threw on the send, events were never marked
consumed, and no Telegram notification ever went out — the rest of the
"recruiter pipeline has no responses" problem (the GPU/triage half was fixed
separately). Wire it from openclaw-secrets.telegram_bot_token (same token as
channels.telegram.botToken). Verified: the 3 backlogged events were announced
+ consumed on the openclaw restart.
Drafting (the /api/draft 500 that also degraded the cards) was fixed in
parallel by swapping Vault secret/recruiter-responder gpt_mini_model from the
slow/timing-out qwen3-coder-480b to meta/llama-3.3-70b-instruct (~1.6s).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The task-webhook host is an inbound webhook receiver: Forgejo (a machine
with no Authentik SSO cookie) POSTs issue/comment events, so forward-auth
302-bounced every delivery and silently dropped all webhooks. Flip only
this ingress to auth=none; the do_POST handler gates on payload action +
bot-user filtering. Gateway (openclaw) and openlobster stay auth=required.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The app ships complete auth — WebAuthn/passkey (RP_ID=trading.viktorbarzin.me)
+ JWT bearer on every /api/* route + a /ws?token=<JWT> WebSocket. Authentik
forward-auth on / was 302-bouncing the WebAuthn XHR flow and the WS upgrade,
making the app unusable. Flip to auth = "app" so the backend's own auth is the
gate (same-origin SPA + bearer-token API, same pattern as immich). Verified all
11 route modules enforce Depends(get_current_user) and dev_mode defaults False
before flipping.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The single uptime ingress gated the ENTIRE site (path "/") behind
Authentik forward-auth, so public-by-design endpoints 302-bounced to
SSO: status pages (/status/<slug>), push-monitor ingest
(/api/push/<key>), status-page API + heartbeat (/api/status-page),
badges (/api/badge), and static assets. Status pages are for
logged-out viewers and push monitors POST from machines — neither can
follow the Authentik OAuth cookie dance, so all were broken.
Fix mirrors the meshcentral agent carve-out (9a15f3f2): add a second
path-scoped ingress_factory (auth="none") pointing at the same
uptime-kuma Service. Traefik routes longest-rule-first, so these
out-prioritise the "/" catch-all; the dashboard (/, /dashboard,
/manage-*, /settings, etc.) stays Authentik-gated via the original
ingress. WebSocket status UI keeps working — the default middleware
chain passes Upgrade/Connection through.
Verified: /status/infra, /api/status-page/{,heartbeat/}infra,
/api/badge no longer 302 (200); / still 302s to authentik.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two root causes kept all 8 mesh agents (incl. family laptops) offline:
1. The single ingress gated the ENTIRE site (path "/") behind Authentik
forward-auth, so the agent/relay endpoints (/agent.ashx, /meshrelay.ashx,
/control.ashx, etc.) got 302-bounced to SSO. Native mesh clients can't do
the OAuth cookie dance. Fix: add a second ingress_factory (auth="none")
path-scoped to the agent endpoints, pointing at the same meshcentral
service. Traefik routes by rule length so these out-prioritise the "/"
catch-all; the human web UI stays Authentik-gated.
2. After the auth fix, agents reached /agent.ashx but were rejected with
"Agent bad web cert hash" — MeshCentral pins the OUTER TLS cert, but with
TLS offload the agent sees Traefik's Let's Encrypt cert (which differs
between the internal .203 LB and the external Cloudflare path, and rotates
monthly), not MeshCentral's own webserver cert. Fix: set
ignoreAgentHashCheck=true in the init-container config so MeshCentral
echoes back the agent-reported hash. The separate mesh-certificate
(ServerID) handshake still authenticates the server.
Verified: agent paths no longer 302->authentik; web UI root still does;
laptop "Valia_Laptop" enrolled in group "laptops" and ONLINE.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
MACHINE_LEARNING_MODEL_TTL=600 is a single global knob, so it unloads the
CLIP textual (smart-search) encoder after idle exactly like OCR/face —
immich has no per-model pin. This CronJob pings the textual encoder every
5 min (< the 600s TTL) via immich-ml /predict, so a search query never
pays the ~1.5s cold-load, while idle OCR/face still free their VRAM on the
shared T4. Textual-only (search = text->embedding->pgvector); the visual
encoder is import-time and left to unload. curl baked into the image (no
runtime install).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The service now runs agent calls concurrently (bounded semaphore, per-job
isolated clones) instead of single-flight. Infra side:
- mount git-crypt-key into the main container (each job re-unlocks its own clone)
- MAX_CONCURRENCY=10 env (excess calls queue FIFO)
- bump pod memory 2Gi req / 12Gi limit, cpu req 1 (Burstable, tier-aux) — sized
for ~10 concurrent claude+terraform runs; fits node2/3/5 headroom
- docs: beads-auto-dispatch + automated-upgrades no longer describe single-slot
Service code: viktor/claude-agent-service @ 66104a3.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The role panels (Top roles, Top companies by role volume, New roles/day,
Roles by source, Salary distribution) had no location filter, so they showed
all locations regardless of the $location dropdown. Add
'primary_location IN (${location:sqlstring})' to each (matching the comp
panels' pattern). Also switch the 'Your comp vs the market' panel from
hardcoded 'london' to the same $location filter for consistency. Data was
fine (all london-tagged roles genuinely contain 'london').
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add a barchart (panel 10) ranking every company's London p50 total comp
(COALESCE total/base) with the user's current comp shown in line, so it's a
direct "how do I compare" view. The user's figure is NOT hardcoded in the
dashboard JSON — it's a labeled comp_point in the DB (company_slug
'self-current', source 'self', "Me (Meta IC5)"), keeping the sensitive number
out of git. It's below the £500k alert bar (no Slack ping) and ranks too low
to appear in analyze leaders. Runbook documents the panel + how to update the
baseline.
[ci skip] — dashboard ConfigMap applied locally (targeted).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add job-hunter-alert CronJob (Sundays 05:00 UTC, an hour after the refresh):
`python -m job_hunter alert --threshold 500000 --location london --slack`
posts to Slack the companies whose London p50 total comp >= £500k, flagging
any that newly crossed since last week's snapshot. SLACK_WEBHOOK_URL wired via
the job-hunter-secrets ExternalSecret from Vault secret/job-hunter
slack_webhook_url (seeded from the shared workspace webhook; repointable to a
dedicated channel). Runbook gains an "above-target Slack alert" section.
[ci skip] — applied locally (stack-scoped).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Authentik forward-auth on the shlink REST API + short-link domain
(url.viktorbarzin.me) 302s shlink-web's cross-origin API XHR (CORS
preflight) and SSO-bounces every public short link. Result: the admin
UI showed "Something went wrong while loading short URLs" and short
links never resolved for logged-out clients.
The shlink REST API is self-gated by its X-Api-Key and short links are
public by design, so Authentik must not front this domain. CrowdSec +
rate-limit + anti-AI bot-block still apply. The admin web UI
(shlink.viktorbarzin.me) stays auth=required via module.ingress-web.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CI now drives the Deployment rollout (kubectl set image to the build SHA in
.woodpecker.yml), so the stack moves to image_tag = "latest": the Deployment
runs whatever CI last set (image ignore_changes keeps TF from fighting it),
and the CronJob uses :latest + imagePullPolicy=Always (fresh pod each weekly
run). Keel stays enrolled in parallel as a redundant net.
Docs: rewrite the runbook "Deploying" section for build-triggers-deploy;
record the reversal of decision #12 in the auto-upgrade design doc (owned
apps drive their own rollout, Keel parallel — upstream stays Keel-only); add
the owned-app deploy model to infra/.claude/CLAUDE.md CI/CD section.
[ci skip] — applied locally (stack-scoped); avoids a broad CI auto-apply.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
immich-ml at TTL=0 never unloaded models; a heavy OCR library job
inflated onnxruntime's CUDA arena to ~10.7GB and held it on the shared
time-sliced T4, starving llama-swap (qwen3-8b) so recruiter-responder
triage 502'd silently for hours (emails preserved unseen, no loss).
TTL=600 lets idle ad-hoc models (OCR, face) free VRAM while preloaded
CLIP/smart-search stays warm.
Docs: correct stale llama-cpp GPU notes (T4 is time-sliced, no VRAM
isolation; add qwen3-8b to model table), immich MODEL_TTL gotcha in
.claude/CLAUDE.md, and a post-mortem.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add kubernetes_cron_job_v1.job_hunter_refresh — Sundays 04:00 UTC, runs
`refresh --source ats --source hn --source levels_fyi`, which upserts roles/
comp AND appends the dated comp_snapshots/roles_snapshots series consumed by
`job-hunter analyze`. Mirrors the Deployment's alembic-migrate init container
so a refresh never runs against an un-migrated DB; concurrency Forbid,
backoff 1, 30m activeDeadline, KYVERNO_LIFECYCLE_V1 dns_config ignore.
Add docs/runbooks/job-hunter.md: ops (health checks, manual refresh, add an
ATS company / CDIO watch, secret bag + rotation, failure table, TF apply) and
analyst (the analyze report, query recipes, SQL trend queries against the
snapshot tables, interpretation caveats) sections.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Completes the 36->17 consolidation:
- 3 net-pay panels -> 1 "Net pay vs market gain (${grain})" with a cumulative/
yearly/monthly dropdown (Mixed datasource: payslips-pg + wealth-pg).
- Projection rebuilt as a Trend panel (numeric "Years from today" x-axis) so it
renders regardless of the dashboard time range — fixes empty-by-default. Drops
the duplicate projection-row stat cards + the how-to-view text panel.
- Full reorg into 7 collapsed rows: Overview / Net worth over time / Returns &
contributions / Income vs market / Holdings / RSUs (META) / Projections.
All wealth-pg SQL validated live; net_pay target reuses the existing payslips-pg
source. Visual review pending.
[ci skip]
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
stacks/t3code now points the Authentik-gated ingress at the DevVM t3-dispatch
service (Service+Endpoints -> 10.0.10.10:3780) instead of the in-cluster nginx,
which is removed. Per-user routing + session auto-injection now live on DevVM.
Verified: external 302->Authentik; in-cluster vbarzin/emil.barzin->302 (auto-pair
to own instance), unmapped->403.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
t3 is single-owner (no in-app multi-user), so each person runs their own
`t3 serve` on the DevVM as their own OS user: wizard→:3773 (t3-serve.service),
emo→:3774 (t3-serve-emo.service). An in-cluster nginx `t3-dispatch` maps the
Authentik-injected X-authentik-username to the right instance; unmapped
identities get 403 (no shared fallback). Flipped the ingress auth app→required
(Authentik forward-auth) — the same-origin self-served UI works behind it (WS
carries the Authentik cookie) and t3's own pairing/bearer stays the inner gate.
Mirrors the terminal stack's per-user model.
Verified: dispatcher routes vbarzin→:3773, emil.barzin→:3774, unmapped→403;
t3.viktorbarzin.me now 302s to Authentik. Cross-origin native app / app.t3.codes
intentionally unsupported here — deferred until the native app is published.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The Cloudflare tunnel routed *.viktorbarzin.me and the apex to
https://10.0.20.200:443, but Traefik moved off the shared MetalLB .200
onto its dedicated 10.0.20.203 on 2026-05-30 (commit 0c01adac). Nothing
serves HTTPS on .200:443 anymore, so cloudflared could not reach its
origin (no route to host / i/o timeout) and Cloudflare returned 502 for
every externally-proxied service. Internal/LAN access (split-horizon ->
.203) was unaffected, which masked the outage.
Repoint both ingress rules at the in-cluster Traefik Service DNS
(https://traefik.traefik.svc.cluster.local:443) -- the design the docs
already described but the code never implemented -- so the tunnel is
decoupled from the Traefik LB IP and this cannot recur on a future move.
Applied live via targeted apply on the tunnel config resource only;
[ci skip] because live already matches and a full stack apply would
churn unrelated pre-existing drift (Keel annotations, DKIM re-chunk).
Post-mortem: docs/post-mortems/2026-06-01-cloudflared-stale-traefik-origin.md
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Implements the committed projections design (docs/plans/2026-05-28-wealth-
projections-{design,plan}.md): a collapsed "Projections" row on the wealth
dashboard with 5 template vars (rate_low/base/high, monthly_contribution=auto,
horizon_years=30), a multi-scenario projection panel (Low/Base/High + trailing-
3y historical line + a base-rate compounding-only line), 3 stat cards, and a
text panel with one-click future time-range links.
Projection is pure SQL over dav_corrected: compound + ordinary-annuity FV from
today's net worth; auto contribution = trailing-12mo run-rate (COALESCE/NULLIF
so $monthly_contribution=auto doesn't constant-fold 'auto'::numeric). Historical
rate = trailing-3-full-year geometric mean of per-year Modified-Dietz returns
(~10.4%) — all-time was a nonsense 83% because the all-accounts-complete window
is only ~4 months, and the true all-time geomean is skewed by 2021's +86%.
Also aligns "Net pay vs market gain — per month" to consecutive month-end
deltas (same fix as the other monthly panels). Verified all SQL live.
[ci skip]
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
New stacks/t3code mirrors stacks/terminal: K8s Service + Endpoints →
10.0.10.10:3773 plus an ingress_factory route (dns_type=proxied,
auth="app"). t3 ships its own owner-pairing + bearer-session auth, so
Authentik forward-auth is intentionally omitted — it would break the
cross-origin native mobile app and app.t3.codes (bearer-only, no
Authentik cookie). CrowdSec + anti-AI (both default-on for app-tier)
rate-limit the public surface; t3's pairing is the gate. TLS is
auto-synced into the namespace by Kyverno's sync-tls-secret policy.
Verified end-to-end: t3.viktorbarzin.me → CF → Traefik → devvm:3773 = 200.
Trade-off (public RCE surface behind app-native auth, no Authentik SSO)
accepted 2026-06-01 to keep the native app + app.t3.codes working.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The PowerShell activation scripts POST small JSON diagnostics to
/diag so script execution errors are captured. The collector
(python:3.12-alpine, ConfigMap-mounted) prints each event to stdout
as a KMSDIAG line; the cluster's Loki scrapes pod stdout, making
events searchable in Grafana (Loki only — no Slack, no Prometheus).
Like /scripts, /diag needs a second ingress_factory carve-out with
full_host="kms.viktorbarzin.me" so it bypasses the Anubis PoW
challenge that PowerShell/curl can't solve. Without full_host the
factory would derive kms-diag.viktorbarzin.me and the carve-out
would never match.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Re-enrolls Nextcloud in Keel (opted out after the 2026-05-26 32.0.3->32.0.9
bump stuck the pod in maintenance mode ~22h). Two safeguards engineer around
both failure modes:
- F1 (interrupted occ upgrade -> 503): nextcloud-watchdog CronJob runs
`occ upgrade` + clears maintenance mode when occ reports needsDbUpgrade=true;
Job deadline bumped 120->600s so it isn't killed mid-migration.
- F2 (helm re-renders a tag below the Keel-bumped live image -> downgrade
CrashLoop): chart_values renders the live tag via a plural
kubernetes_resources data source (empty-list-on-absence -> floor 32.0.9 on
fresh install/DR), so a re-render never downgrades below live.
Scope is patch -- Kyverno's shared inject-keel-annotations policy stamps it and
its background-controller overrides a TF-set value, and patch == minor for
Nextcloud in practice (32.0.x only; major 33 stays manual). Dropped the
per-workload keel.sh/policy override resources to avoid perpetual drift; ns
enrollment + Kyverno now own the keel annotations like other workloads.
Also bumps the external-storage bootstrap Job create timeout 1m->12m to match
its own 10m pod-wait, since Keel bumps now roll the pod mid-apply.
Verified: Keel auto-upgraded 32.0.9->32.0.10 on apply, entrypoint occ upgrade
completed clean (no watchdog needed), pod 2/2, HTTP 200, plan shows no drift.
The 2026-05-26 migration flipped the keel default force->patch and dropped
match-tag from the inject-keel-annotations patch, but Kyverno's add-only
mutate can't remove an annotation that's no longer listed -- 194 workloads
kept a stale keel.sh/match-tag=true. Under it Keel cross-assigned images in
multi-image pods: the blog's nginx<->nginx-exporter images were swapped and
the site was down 2026-05-26 -> 06-01 (nginx received the exporter's
-nginx.scrape-uri arg and CrashLoopBackOff'd); changedetection was silently
swapped (app lost its /datastore PVC + env, ran ephemeral for days).
- policy now sets keel.sh/match-tag=null (strips on admission, never re-added)
- swept the annotation off all 194 existing workloads (kubectl, no pod restart)
- AGENTS.md: documents the strip; post-mortem added
blog + changedetection un-swapped via kubectl set image (TF-ignored images);
both 2/2 and serving 200. Policy already applied via scripts/tg (Tier-1 PG
state authoritative). [ci skip]
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Follow-up to the 64k bump: raised bot-block-proxy large_client_header_buffers
to 256k and corrected the rationale. Investigation found the *binding* limit
for browsers is Traefik's HTTP/2 header cap (~64KB, Go maxHeaderListSize, not
exposed by Traefik config) — oversized authentik_proxy_* cookie piles are
rejected at the h2 layer upstream of bot-block regardless of these buffers.
The real fix for >64KB piles is reducing authentik_proxy_* cookie accumulation
(or clearing cookies); these buffers only prevent bot-block being a tighter
bottleneck for sub-64KB piles + HTTP/1.1 clients.
[ci skip]
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The dated backup dirs are named YYYYMMDD_HHMMSS, but the cleanup used
`ls -dt` (mtime). `rsync -a` stamps the backup dir with the SOURCE dir's
mtime, so the freshest backup didn't sort as newest — the retention step
deleted the new backup and kept a stale one. Sort lexically (chronological
for these names) and keep the last.
Also exclude html/ (the app code, reproducible from the now-pinned image;
the real config lives at config/config.php, html/config is empty) so the
backup is config+data+custom_apps only → ~4.3G (<5G target).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The offsite Synology hit 97% — the Backup share grew +670G in a week, traced
to the 2026-05-26 change that began mirroring large regenerable services
offsite, plus an unbounded nextcloud.log bloating its backups to 87G.
- nfs-mirror: re-exclude ollama, prometheus-backup, audiblez, ebook2audiobook
(regenerable; live-only on sdc). Keep *-backup DB dumps (real safety copies).
- offsite-sync Step 2: nfs-ssd leg is now immich-only; ollama/llamacpp on the
SSD no longer ship offsite (re-pullable models).
- daily-backup: skip nextcloud/nextcloud-data-proxmox (orphaned pre-encryption
PV, still backed up weekly).
- nextcloud: cap+rotate the log (log_rotate_size=10MB); the dedicated backup
now excludes html/ (app code, from image), logs, and preview cache and keeps
only the latest copy (pvc-data holds version history) → <5G (was 87G).
- nextcloud: pin image to 32.0.9 in chart_values. A 2026-05-26 Keel bump moved
the live pod to 32.0.9 (data migrated to 32.0.9.2) but TF still defaulted to
32.0.3; reconciling that drift this session rolled a 32.0.3 pod that
CrashLooped on the downgrade. Pinning eliminates the drift.
Docs: backup-dr.md + infra CLAUDE.md updated (add nfs-mirror, new exclusions).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous commit (81a7d804) swept in 23 unrelated working-tree files because
a rebase --autostash had left them staged in the index — including 4 files with
leftover git conflict markers (llama-cpp/main.tf, excalidraw/providers.tf,
url + wealthfolio .terraform.lock.hcl) from a stale 2026-05-25 stash, which is
invalid Terraform. Revert all 23 (terragrunt-generated backend/providers/lock
+ the llama-cpp markers) to their prior committed state; terragrunt regenerates
the generated files on the next run. Net effect of the docs commit is now just
the runbook doc.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The ai-bot-block forward-auth copies the full request (incl. the
accumulated authentik_proxy_<random> cookie pile) to bot-block-proxy.
With 30+ Authentik Proxy Providers under viktorbarzin.me the combined
Cookie header exceeds openresty's default 4x8k buffers, so the auth
check returned 400 "Request Header Or Cookie Too Large" (surfaced as
error-pages' "Too big request header" 431) and broke Woodpecker/Forgejo
OAuth sign-in for affected browsers.
Mirror the existing auth-proxy-config fix: 8x64k accepts the pile.
Applied live via tg apply + bot-block-proxy rollout restart.
[ci skip]
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The activation scripts now fetch the published GVLK list from /keys.json to
auto-select the right key for the detected edition. Like the .ps1 scripts,
that endpoint must bypass Anubis (PowerShell/ConvertFrom-Json can't solve the
PoW). Add /keys.json to the ingress_scripts carve-out path list.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Internal split-horizon resolves kms.viktorbarzin.me to Traefik (10.0.20.203),
which has no :1688 listener — so LAN clients pointed at kms.viktorbarzin.me:1688
failed with 0xC004F074 "no KMS could be contacted". Add a dedicated A-only
vlmcs.viktorbarzin.me (cloudflare_record.vlmcs -> 176.12.22.76 for the public
WAN NAT; Technitium -> 10.0.20.202 internal, set via API) so it resolves to
vlmcsd both ways. Also carve /scripts/* out of Anubis (module.ingress_scripts
-> bare kms-web-page service) so `iwr | iex` downloads the real script instead
of the PoW challenge HTML.
Verified end-to-end on Win VM 300: reproduced 0xC004F074 on the old host, then
slmgr + ospp + both PowerShell one-liners all -> Licensed via vlmcs (10.0.20.202).
Docs: kms-public-exposure runbook + service-catalog entry.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
goflow2, snmp-exporter, pve-exporter, idrac-redfish and the sysctl-inotify
daemonset were missed by the cdb7d9a8 KEEL_LIFECYCLE sweep. The monitoring ns
is keel-enrolled (policy=patch) so Keel owns their image tags + injects keel.sh
annotations; TF kept trying to revert both, plus a live-stamped tier label —
which made `terragrunt plan -detailed-exitcode` return 2 every run and the
drift-detection cron fail daily. Add the standard KEEL ignore_changes (image +
keel.sh annotations) and ignore the tier label so these stop churning.
Declarative-only: takes effect at next plan, no apply needed. [ci skip] so this
does not trigger a monitoring apply. Remaining (separate) drift: the grafana
ACL null_resource (triggers.always) + tls cert refresh.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The data_proxmox PVC is WaitForFirstConsumer; with the Deployment parked at
replicas=0 it had no consumer pod and sat Pending forever, falsely tripping
PVCStuckPending (which halts kured reboots). Introduce local.hermes_parked to
drive both replicas and the PVC count, so a parked service has no PVC at all.
Empty/never-bound PVC removed; recreated automatically when un-parked.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
hermes-agent is parked at replicas=0 (PVC perms bug, 2026-04-22). Its
auto-created Uptime Kuma external monitor was down → ExternalAccessDivergence
firing, which halts kured node reboots. Set external_monitor=false so a
deliberately-down service stops tripping the divergence gate. Re-enable when
the deployment is brought back up.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The k8s-master gate pod OOM-killed child kubectls 149x/7d (accelerating:
0/day → 15 → 134) while master sat in pending-reboot. Root cause: only the
pending-reboot node's gate pod runs the kubectl-heavy hot path each cycle,
and the immortal bash loop slowly leaks (kubectl forks + Check-4 process
substitution) past the 64Mi cgroup limit. PID 1 bash survives each kill, so
the pod never restarts — just silent oom_events.
Fix: raise limit 64Mi→256Mi (headroom for ~30-50Mi kubectl forks) + add a
MAX_ITER=72 self-exit (~6h) so kubelet restarts the pod fresh and the leak
can never accumulate, regardless of how long a node stays pending-reboot.
Docs: post-mortem + automated-upgrades.md gate note.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>