The dated backup dirs are named YYYYMMDD_HHMMSS, but the cleanup used
`ls -dt` (mtime). `rsync -a` stamps the backup dir with the SOURCE dir's
mtime, so the freshest backup didn't sort as newest — the retention step
deleted the new backup and kept a stale one. Sort lexically (chronological
for these names) and keep the last.
Also exclude html/ (the app code, reproducible from the now-pinned image;
the real config lives at config/config.php, html/config is empty) so the
backup is config+data+custom_apps only → ~4.3G (<5G target).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The offsite Synology hit 97% — the Backup share grew +670G in a week, traced
to the 2026-05-26 change that began mirroring large regenerable services
offsite, plus an unbounded nextcloud.log bloating its backups to 87G.
- nfs-mirror: re-exclude ollama, prometheus-backup, audiblez, ebook2audiobook
(regenerable; live-only on sdc). Keep *-backup DB dumps (real safety copies).
- offsite-sync Step 2: nfs-ssd leg is now immich-only; ollama/llamacpp on the
SSD no longer ship offsite (re-pullable models).
- daily-backup: skip nextcloud/nextcloud-data-proxmox (orphaned pre-encryption
PV, still backed up weekly).
- nextcloud: cap+rotate the log (log_rotate_size=10MB); the dedicated backup
now excludes html/ (app code, from image), logs, and preview cache and keeps
only the latest copy (pvc-data holds version history) → <5G (was 87G).
- nextcloud: pin image to 32.0.9 in chart_values. A 2026-05-26 Keel bump moved
the live pod to 32.0.9 (data migrated to 32.0.9.2) but TF still defaulted to
32.0.3; reconciling that drift this session rolled a 32.0.3 pod that
CrashLooped on the downgrade. Pinning eliminates the drift.
Docs: backup-dr.md + infra CLAUDE.md updated (add nfs-mirror, new exclusions).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous commit (81a7d804) swept in 23 unrelated working-tree files because
a rebase --autostash had left them staged in the index — including 4 files with
leftover git conflict markers (llama-cpp/main.tf, excalidraw/providers.tf,
url + wealthfolio .terraform.lock.hcl) from a stale 2026-05-25 stash, which is
invalid Terraform. Revert all 23 (terragrunt-generated backend/providers/lock
+ the llama-cpp markers) to their prior committed state; terragrunt regenerates
the generated files on the next run. Net effect of the docs commit is now just
the runbook doc.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The ai-bot-block forward-auth copies the full request (incl. the
accumulated authentik_proxy_<random> cookie pile) to bot-block-proxy.
With 30+ Authentik Proxy Providers under viktorbarzin.me the combined
Cookie header exceeds openresty's default 4x8k buffers, so the auth
check returned 400 "Request Header Or Cookie Too Large" (surfaced as
error-pages' "Too big request header" 431) and broke Woodpecker/Forgejo
OAuth sign-in for affected browsers.
Mirror the existing auth-proxy-config fix: 8x64k accepts the pile.
Applied live via tg apply + bot-block-proxy rollout restart.
[ci skip]
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The activation scripts now fetch the published GVLK list from /keys.json to
auto-select the right key for the detected edition. Like the .ps1 scripts,
that endpoint must bypass Anubis (PowerShell/ConvertFrom-Json can't solve the
PoW). Add /keys.json to the ingress_scripts carve-out path list.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Internal split-horizon resolves kms.viktorbarzin.me to Traefik (10.0.20.203),
which has no :1688 listener — so LAN clients pointed at kms.viktorbarzin.me:1688
failed with 0xC004F074 "no KMS could be contacted". Add a dedicated A-only
vlmcs.viktorbarzin.me (cloudflare_record.vlmcs -> 176.12.22.76 for the public
WAN NAT; Technitium -> 10.0.20.202 internal, set via API) so it resolves to
vlmcsd both ways. Also carve /scripts/* out of Anubis (module.ingress_scripts
-> bare kms-web-page service) so `iwr | iex` downloads the real script instead
of the PoW challenge HTML.
Verified end-to-end on Win VM 300: reproduced 0xC004F074 on the old host, then
slmgr + ospp + both PowerShell one-liners all -> Licensed via vlmcs (10.0.20.202).
Docs: kms-public-exposure runbook + service-catalog entry.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
goflow2, snmp-exporter, pve-exporter, idrac-redfish and the sysctl-inotify
daemonset were missed by the cdb7d9a8 KEEL_LIFECYCLE sweep. The monitoring ns
is keel-enrolled (policy=patch) so Keel owns their image tags + injects keel.sh
annotations; TF kept trying to revert both, plus a live-stamped tier label —
which made `terragrunt plan -detailed-exitcode` return 2 every run and the
drift-detection cron fail daily. Add the standard KEEL ignore_changes (image +
keel.sh annotations) and ignore the tier label so these stop churning.
Declarative-only: takes effect at next plan, no apply needed. [ci skip] so this
does not trigger a monitoring apply. Remaining (separate) drift: the grafana
ACL null_resource (triggers.always) + tls cert refresh.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The data_proxmox PVC is WaitForFirstConsumer; with the Deployment parked at
replicas=0 it had no consumer pod and sat Pending forever, falsely tripping
PVCStuckPending (which halts kured reboots). Introduce local.hermes_parked to
drive both replicas and the PVC count, so a parked service has no PVC at all.
Empty/never-bound PVC removed; recreated automatically when un-parked.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
hermes-agent is parked at replicas=0 (PVC perms bug, 2026-04-22). Its
auto-created Uptime Kuma external monitor was down → ExternalAccessDivergence
firing, which halts kured node reboots. Set external_monitor=false so a
deliberately-down service stops tripping the divergence gate. Re-enable when
the deployment is brought back up.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The k8s-master gate pod OOM-killed child kubectls 149x/7d (accelerating:
0/day → 15 → 134) while master sat in pending-reboot. Root cause: only the
pending-reboot node's gate pod runs the kubectl-heavy hot path each cycle,
and the immortal bash loop slowly leaks (kubectl forks + Check-4 process
substitution) past the 64Mi cgroup limit. PID 1 bash survives each kill, so
the pod never restarts — just silent oom_events.
Fix: raise limit 64Mi→256Mi (headroom for ~30-50Mi kubectl forks) + add a
MAX_ITER=72 self-exit (~6h) so kubelet restarts the pod fresh and the leak
can never accumulate, regardless of how long a node stays pending-reboot.
Docs: post-mortem + automated-upgrades.md gate note.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The redis-v2 Sentinel cluster split-brained: redis-v2-0 booted during a network
partition, hit the init script's deterministic "pod-0 = bootstrap master"
fallback, and became a SECOND master alongside the sentinel-elected redis-v2-2.
HAProxy's `expect rstring role:master` matched both and round-robined client
connections across the two diverging masters, so Immich enqueued BullMQ jobs on
one while its workers blocked-popped on the other -> every queue wedged and
new-upload thumbnails 404'd cluster-wide. Third Sentinel-class incident in ~6
weeks (after the 2026-04-19 PM quorum drift and 2026-04-22 flap cascade).
Revert to a single standalone instance: replicas=1; drop Sentinel + HAProxy +
init bootstrap configmap + both PDBs; redis container only (+ exporter).
maxmemory-policy allkeys-lru -> volatile-lru so one shared instance serves both
workload classes correctly: evict only TTL'd cache keys, never TTL-less Immich
BullMQ / Celery job keys. redis-master service name/DNS unchanged -> no consumer
edits; collapsed onto redis-v2-0's existing dataset (queued jobs preserved).
Applied via tg (Tier 1 / PG-authoritative state); this commit syncs source +
docs only, hence [ci skip].
Monitoring: drop RedisReplicationLagHigh + RedisReplicasMissing (no replicas
now; the latter would false-fire), RedisMemoryPressure 85%->80% volatile-lru backstop.
Docs: rewrite databases.md Redis section (single-instance design + incident
history); add post-mortem 2026-05-30-redis-split-brain.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Large Immich video downloads and uploads failed at a hard ~60s wall. The
websecure entrypoint set respondingTimeouts.{read,write}Timeout=60s; unlike
nginx proxy_*_timeout (per-read idle), Traefik respondingTimeouts are hard caps
on total request/response duration, so every transfer slower than 60s was cut
mid-stream. Reproduced: a 6 MB/s throttled 650MB download died at 386MB / 62s
with an HTTP/2 stream reset.
- writeTimeout=0 (Traefik's default, which Immich's reverse-proxy guidance
assumes): unlimited download size/duration.
- readTimeout=3600s: passes multi-GB uploads while keeping a slow-loris backstop
(Immich has no resumable upload, so the window must exceed real upload times).
Verified: the same 650MB download now completes fully (650MB / 102s, exit 0).
IPv6 path needs no change - the pfSense bridge HAProxy 1h timeouts are
inactivity-based, not total caps. Applied via tg (Tier 1 / PG-authoritative
state); this commit syncs source + docs only, hence [ci skip].
Docs: networking.md (Entrypoint Transport Timeouts + troubleshooting),
.claude/CLAUDE.md networking note.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Today's Traefik dedicated-IP migration (.200 -> .203, ETP=Local) updated
the viktorbarzin.me zone but missed the viktorbarzin.lan zone + two stale
.200 literals — breaking every *.viktorbarzin.lan ingress host (internal
exporters + ~15 HA-Sofia sensors via idrac-redfish/nvidia/snmp) and
tripping the apex-drift probe. Found via /cluster-health (23 alerts -> 7).
- apex-probe EXPECTED .200 -> .203 (apex IS .203; probe asserted the wrong
value -> false ViktorBarzinApexDrift "critical").
- split-horizon externalToInternalTranslation .200 -> .203 (sofia-lan
hairpin-NAT target).
- ingress-dns-sync CronJob now also pins ingress.viktorbarzin.lan A to the
LIVE Traefik LB IP (queried from svc/traefik) every run, so a future
Traefik IP move can't silently break the .lan zone again. Added
services get/list to its ClusterRole.
Applied via targeted apply (4 resources, 0 destroyed) + manual CronJob
triggers; verified apex correct=1 and the .lan anchor self-pins to .203.
[ci skip] because a full technitium apply would also pick up unrelated
pre-existing deployment drift (DNS pod restart risk) — left untouched.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The 2026-05-24 apply was interrupted with the Helm release stuck in
pending-install, leaving only 2 of ~12 resources in TF state (any apply
errored "already exists"). Adopted the live resources back via import {}
sweep (namespace, tls-secret, uploads PVC, ESO ExternalSecret, both
ingresses, temporal Service, nfs backup PV+PVC) — plan now reaches zero.
Reconciled code to live reality (zero runtime change to running postiz):
- Removed kubernetes_deployment.temporal + kubernetes_job.temporal_search_
attr_cleanup: the temporal Deployment is gone from the cluster (only the
Service survives). Scheduled posts remain unavailable until temporal is
restored; immediate posting works.
- Removed helm_release.postiz from TF entirely: importing it would force a
helm upgrade (provider can't match merged values to config) and the
release is stuck pending-install. Left Helm-managed outside TF.
- Removed keel.sh/enrolled=true from the namespace (postiz was opted out of
Keel on 2026-05-29; this would have re-enrolled it on apply).
- Backup CronJob now dumps only the `postiz` DB (temporal/temporal_visibility
DBs don't exist) and no longer depends_on the removed helm_release.
Applied: 9 imported, 1 added (backup CronJob), 6 changed (benign), 0 destroyed.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace the pfSense socat IPv6 forwarder (which masked every IPv6 client
as 10.0.20.1) with a standalone HAProxy bridge using send-proxy-v2, so
real IPv6 client IPs reach Traefik/CrowdSec. Traefik now trusts PROXY-v2
only from 10.0.20.1 on the web/websecure entrypoints; real IPv4 clients
(ETP=Local, own source IP) are unaffected. Mail-over-IPv6 routed through
the mail NodePorts (send-proxy-v2) too. Bridge is TCP/h2 only (no QUIC
over IPv6). Persistence on pfSense: rc.d/ipv6proxy + ipv6_proxy.sh
(config.xml shellcmd), keeping the nginx-off-[::] patch.
Also fixes stale networking.md: Traefik was still documented on the
shared .200; it moved to dedicated .203/ETP=Local on 2026-05-30.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Gives direct (non-proxied) apps real client IPs for CrowdSec (were SNAT'd to
the node IP under ETP=Cluster) and working QUIC. Companion change (NOT in TF —
remote cloudflared tunnel config, done via CF API): tunnel ingress repointed
from https://10.0.20.200:443 to https://traefik.traefik.svc.cluster.local:443
so proxied apps are decoupled from the LB IP. pfSense 443 NAT -> traefik_lb
alias (.203). See docs/plans/2026-05-30-traefik-dedicated-ip-etp-local-*.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Postiz was generating hourly Slack spam and a wedged rollout, both
Keel-driven:
- Bundled redis StatefulSets run docker.io/bitnamilegacy/redis; Keel
tried 7.4.0->7.4.1/7.4.2 every poll but require-trusted-registries
denies bitnamilegacy/* (only bitnami/* allowlisted) -> endless
deny/retry/Slack-ping loop.
- Keel bumped postiz-app v2.21.7->v2.21.8 on 2026-05-26; the surge pod
couldn't schedule under the 3Gi tier-4-aux quota, wedging the rollout
for 3 days.
postiz Terraform state is heavily drifted (~2/30 resources tracked), so
per-workload opt-out can't be applied from the postiz stack. Durable
guard is here (clean kyverno state). Operational steps applied live via
kubectl (postiz stack can't apply): removed keel.sh/enrolled=true from
the namespace, set keel.sh/policy=never (annotation+label) on all 4
workloads, rolled postiz back to the running v2.21.7. Keel restarted
(scale 0->1) to drop postiz-app from its in-memory tracker; confirmed it
no longer tracks postiz.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Pin immich-server to the GPU node with a time-sliced nvidia.com/gpu slice
so ffmpeg uses hardware NVENC encode + NVDEC decode instead of software.
This frees the ~3-4 CPU cores the software transcoder was burning inside
the request-serving pod (which was slowing thumbnail/photo browsing), and
makes incompatible (HEVC/iPhone) videos playable in seconds. Activation is
ffmpeg.accel=nvenc + accelDecode=true in the DB system-config (Immich app
config is DB-managed here, like oauth/smtp — not Terraform).
Also give immich-frame the same Keel ignore_changes immich-server already
has, so an untargeted apply no longer churns it (pre-existing drift).
Docs: .claude/CLAUDE.md Immich row + compute.md GPU-workloads list.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
llama-cpp was scaled to 0 during 2026-05-25 IO-storm recovery
(TEMP-SCALEDOWN). Cluster is now stable; only frigate competes for the
GPU on k8s-node1. Restoring to 1 to unblock fire-planner's Reddit
examples ingest, which needs qwen3-8b for structured extraction.
fire-planner's llama_cpp_base_url default pointed at a non-existent
service:port (llama-cpp:8000) — the real service is `llama-swap` on
port 8080. First 2026-05-28 bulk Job exited 0 with 0 rows because of
this. Correcting.
Second-tier keel drift: actualbudget, mailserver (docker-mailserver + roundcube),
servarr (8 deployments), and authentik pgbouncer are live-enrolled (Kyverno injects
keel.sh/policy=patch) and drifting, but never had the V2 block in Terraform. Added
the full block (KYVERNO_LIFECYCLE_V2 + keel.sh/match-tag + per-container
KEEL_IGNORE_IMAGE + KEEL_LIFECYCLE_V1) to all 13 deployments. The docker-mailserver
deployment had no resource-level lifecycle at all — added one.
Also fixes a pre-existing bug in modules/kubernetes/anubis_instance: the `replicas`
validation `var.replicas == null || (...)` doesn't null-short-circuit in the current
TF version, failing apply on every single-replica Anubis site (blog, cyberchef,
f1-stream, homepage, jsoncrack, kms, postiz, real-estate-crawler, travel_blog) with
"argument must not be null". Switched to a null-safe ternary.
Verified: actualbudget plan shows no image drift (http-api 26.5.2 downgrade prevented).
The anubis module change triggers a full platform apply.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Mirrors the kms-website pattern: deployment image now points to
forgejo.viktorbarzin.me/viktor/tuya_bridge:${var.image_tag} and the
new Woodpecker pipeline in tuya_bridge/.woodpecker.yml drives the
rollout via `kubectl set image` on every push.
Changes:
- Extract `tls_secret_name` and add `image_tag` (default "latest")
to a new variables.tf, matching the kms / fire-planner /
payslip-ingest convention.
- Add `image_pull_secrets { name = "registry-credentials" }` (Kyverno
ClusterPolicy sync-registry-credentials already syncs the Secret
into every namespace).
- Set explicit `image_pull_policy = "IfNotPresent"` — SHA-tagged
images are immutable, no need to re-pull on every restart.
The image attribute remains in `lifecycle.ignore_changes` (line was
already there from the prior Keel-managed era), so future `tg apply`s
do not fight Woodpecker's `kubectl set image`. Keel is still enrolled
on the namespace but will skip SHA-tagged images under `policy: patch`
(non-semver), so the CI pipeline is the sole rollout mechanism.
Backstory: the 2026-05-26 cluster-health incident was tuya-bridge
crashlooping after Keel rewrote `:latest` to a stale broken `:0.1`
tag on Docker Hub (which predated the `prometheus_exporter.py`
addition). Manual rebuild + push was the immediate fix; this commit
plus tuya_bridge/.woodpecker.yml close the underlying gap so a
source change reliably produces a fresh registry image.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Yesterday's session SQL-patched monitor 313 to `https://192.168.1.127:8006/`
+ ignore_tls=1 because the prior URL `http://proxmox.reverse-proxy.svc.cluster.local:8006`
hit a CoreDNS pod-level cache returning stale `10.0.10.1` (pfSense GW)
intermittently, false-tripping ExternalAccessDivergence. A kuma DB
restore would have lost the SQL fix. Declare the monitor in
`internal_monitors` so the existing sync CronJob self-heals it.
Extends the schema with optional `url` / `accepted_statuscodes` /
`ignore_tls` fields (null on the existing DB/port entries) and
teaches the sync script the MonitorType.HTTP branch — url +
accepted_statuscodes + ignoreTls (camelCase on the API), matching
drift fields the same way PORT does for hostname/port.
Verified: manually triggered the sync after apply; it found monitor
313 by name and reported "already in desired state".
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Completes the enrolled-workload sweep from cdb7d9a8. fire-planner was held
back because a parallel session was mid-apply on it (presence board); that
claim has since cleared.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Every Keel-enrolled workload (policy=patch, match-tag=true, injected by the
inject-keel-annotations Kyverno policy) was fighting Terraform: Keel rewrites
the image tag and restamps keel.sh/update-time, change-cause and the rollout
revision on each poll; without ignore_changes every `tg apply` reverted those
— downgrading the image and forcing a spurious rollout that Keel then re-did.
Only llama-cpp had the full block (added 2026-05-24); the other ~73 workloads
drifted. This sweep adds, to every enrolled deployment/daemonset lifecycle:
- container[N].image (one per container index + init_container[N]) # KEEL_IGNORE_IMAGE
- keel.sh/match-tag, keel.sh/update-time, kubernetes.io/change-cause,
deployment.kubernetes.io/revision # KEEL_LIFECYCLE_V1
Verified via `tg plan` on speedtest (single-container: image downgrade
0.24.3->0.24.1 + annotation strip now gone) and changedetection (multi-container:
both container images no longer drift). AGENTS.md drift-suppression section
updated with the canonical block + marker legend.
fire-planner deferred (parallel session mid-apply per presence board).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two correctness fixes to the wealth dashboard, found while validating
contribution data against actual-viktor (source of truth):
1. dav_corrected (Fix 1): LOCF gap-fill scoped to the Fidelity pension.
A PlanViewer scrape gap left total_value=0 for 13 days from 2026-02-16,
which cratered net worth and produced a phantom -£97,457 "contribution"
in Feb then +£100,458 in Mar. Carry the last non-zero day forward across
the gap (a £0 pension valuation is always a scrape gap, never real).
2. wealth.json (Fix 3): "Monthly contributions vs market gain" and "Annual
change decomposition" now use consecutive period-end deltas instead of
within-period first-to-last-obs, so contributions landing near a period
boundary are no longer dropped/mis-attributed.
Verified live: Feb-2026 monthly contribution now +£34,000 (real Trading212
RSU-proceeds investment, reconciles with actual-viktor), no spurious
negatives. Brokerage contributions unchanged (already correct).
Applied via scripts/tg (wealthfolio + targeted monitoring ConfigMap).
[ci skip]
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the K8s plumbing for the Reddit FIRE-examples ingest path:
- ExternalSecret fire-planner-examples-reddit (Reddit OAuth from
Vault secret/viktor.trading_bot_reddit_{client_id,client_secret}).
- ExternalSecret fire-planner-examples-claude (claude-agent-service
bearer from Vault secret/claude-agent-service.api_bearer_token).
- kubernetes_job_v1.examples_bulk_ingest — one-shot bulk Job toggled
via var.run_examples_bulk_ingest (default false). Timestamp-named so
each (true) transition creates a fresh Job; lifecycle ignores the
name so re-plans don't propose phantom renames.
- kubernetes_cron_job_v1.examples_weekly_delta — Sunday 04:00 UTC
--top=week --limit=200 incremental run.
Both runners share the env_from plumbing of the existing recompute
CronJob (fire-planner-secrets, fire-planner-db-creds,
wealthfolio-sync-db-creds) plus examples-specific vars
(REDDIT_USER_AGENT, LLAMA_CPP_BASE_URL, CLAUDE_AGENT_SERVICE_URL,
plus the three secret-backed env vars).
Plan-only this commit — actual apply lands in Task 17 after the
ingest image build.
Discovers all *.viktorbarzin.me ingress hosts every 15 minutes and
creates matching CNAME records in Technitium if missing. Prevents
the desync where Cloudflare has the DNS record (via ingress_factory)
but internal DNS returns NXDOMAIN because Technitium was never updated.
Includes ServiceAccount + ClusterRole for ingress list permissions.
Three new panels comparing employment income to investment returns over
time, via Grafana's -- Mixed -- datasource (salary lives in payslip_ingest,
portfolio in wealthfolio_sync — separate DBs, so per-target datasources):
- cumulative net take-home pay vs cumulative market gain (line race)
- net pay vs market gain per year (grouped bars)
- net pay vs market gain per month (grouped bars)
Inserted after the "Growth over time" panel; existing panels shifted down,
full-width tables remain at the bottom.
User no longer actively using foolery. Removed:
- TF stack stacks/foolery (Cloudflare DNS, Traefik IngressRoute,
Authentik forward-auth integration, K8s Service+Endpoints)
- Devvm systemd unit /etc/systemd/system/foolery.service
- Runtime at ~/.local/share/foolery and launcher ~/.local/bin/foolery
- Stale foolery reference in .claude/CLAUDE.md auth="required" examples
Uptime Kuma [External] foolery monitor will auto-prune on next
external-monitor-sync reconcile.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
curl|bash clients can't solve PoW, so /net-diag.sh has to bypass Anubis.
Adds a second ingress_factory pointing /net-diag.sh at the bare blog
service (port 80), keeping every other path on the existing Anubis
chain. Path-prefix specificity wins in Traefik routing — / stays gated.
dns_type = "none" because the apex viktorbarzin.me CF record already
exists from the main ingress.
Doc update: CLAUDE.md Anubis section notes blog now follows the
wrongmove carve-out pattern.
The catchall-error-pages IngressRoute matches HostRegexp(^(.+\.)?
viktorbarzin\.me$) at priority=1 — it's the wildcard handler that
returns 404 for any unmatched hostname (typos + scanner traffic).
By design its 4xx rate sits at ~100%, so HighService4xxRate was a
permanent false positive for traefik-catchall-error-pages-*@kubernetescrd.
Same exclusion pattern as nextcloud/grafana/linkwarden/claude-memory
(services with legitimately high 4xx counts).
Keel bumped library/nextcloud :32.0.3-apache → :32.0.9-apache on
2026-05-26 19:42 UTC. The new image needs `occ upgrade` to migrate
the DB schema, which Keel does not run, so Nextcloud landed in
maintenance mode (needsDbUpgrade=true) and stayed there for ~22h —
external probes saw 503, ExternalAccessDivergence kept firing.
Disable Keel for this workload:
- Drop the `keel.sh/enrolled=true` label from the namespace so
Kyverno's `inject-keel-annotations` policy no longer matches.
- Layer `keel.sh/policy=never` label + annotation onto the
Helm-managed Deployment via `kubernetes_labels` /
`kubernetes_annotations` (the chart at 8.8.1 doesn't expose
Deployment-level commonLabels/commonAnnotations). Keel reads the
annotation; the label is defense-in-depth for the Kyverno
exclude rule should the namespace ever get re-enrolled.
Verified: Keel logged `image no longer tracked, removing watcher`
within seconds of the annotation landing, and `tg plan` is clean.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
PVE API endpoint regularly takes ~11s with ~1035 thin LVs on the host
(1002 k8s-csi PVCs + 22 VMs + 11 system), blowing past Prometheus's
default 10s scrape_timeout and flapping ProxmoxMetricsMissing +
ScrapeTargetDown. Switch the Service annotation from prometheus.io/scrape
to prometheus.io/scrape_slow so the scrape moves to the existing
kubernetes-service-endpoints-slow job (5m interval, 30s timeout).
As of fire-planner@4da58fe the account_snapshot cache is refreshed
lazily on each /networth, /networth/history, /progress request when
older than NETWORTH_CACHE_TTL_DAYS (default 1). The recompute CronJob
runs Monte Carlo only — no longer assumed to coordinate with the
wealthfolio-sync schedule.
[ci skip]
E2E test (manual one-shot of all 3 broker-sync CronJobs) confirmed
idempotent behaviour with zero new activities and net worth unchanged.
The IE-via-IMAP path is now default-skipped inside
broker_sync.providers.imap (commit 0d23487), so unsuspending the cron is
safe — Schwab vests get parsed, IE messages get ie_skipped at the parser
level regardless of which entry point triggers the run.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
39 IMAP-source InvestEngine BUYs + their cash-flow DEPOSITs were
re-inserted into Wealthfolio at 2026-05-27T09:22:18 UTC — exactly the
rows the £252k dedup removed yesterday. The broker-sync-imap cron at
02:30 UTC today correctly logged `ie_skipped=53`, so the IMAP cron itself
isn't the immediate culprit, but the rows DO carry broker-sync's IMAP-path
signature (`[rfc2822-v1]` notes + `sync:imap:invest-engine:...` cash-flow
markers).
Suspending kills one possible vector while a researcher subagent
investigates the root cause. Schwab vest ingestion is the only function
lost; can be unsuspended once the IE re-dup source is identified.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>