infra

Author	SHA1	Message	Date
Viktor Barzin	3d28870e25	nextcloud: fix backup retention to sort by name, not mtime The dated backup dirs are named YYYYMMDD_HHMMSS, but the cleanup used `ls -dt` (mtime). `rsync -a` stamps the backup dir with the SOURCE dir's mtime, so the freshest backup didn't sort as newest — the retention step deleted the new backup and kept a stale one. Sort lexically (chronological for these names) and keep the last. Also exclude html/ (the app code, reproducible from the now-pinned image; the real config lives at config/config.php, html/config is empty) so the backup is config+data+custom_apps only → ~4.3G (<5G target). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 15:15:26 +00:00
root	84ab4c998c	Woodpecker CI deploy [CI SKIP]	2026-06-01 15:15:26 +00:00
Viktor Barzin	ddd582a28c	backup: stop offsite-copying regenerable data; shrink nextcloud backup; pin nextcloud image The offsite Synology hit 97% — the Backup share grew +670G in a week, traced to the 2026-05-26 change that began mirroring large regenerable services offsite, plus an unbounded nextcloud.log bloating its backups to 87G. - nfs-mirror: re-exclude ollama, prometheus-backup, audiblez, ebook2audiobook (regenerable; live-only on sdc). Keep *-backup DB dumps (real safety copies). - offsite-sync Step 2: nfs-ssd leg is now immich-only; ollama/llamacpp on the SSD no longer ship offsite (re-pullable models). - daily-backup: skip nextcloud/nextcloud-data-proxmox (orphaned pre-encryption PV, still backed up weekly). - nextcloud: cap+rotate the log (log_rotate_size=10MB); the dedicated backup now excludes html/ (app code, from image), logs, and preview cache and keeps only the latest copy (pvc-data holds version history) → <5G (was 87G). - nextcloud: pin image to 32.0.9 in chart_values. A 2026-05-26 Keel bump moved the live pod to 32.0.9 (data migrated to 32.0.9.2) but TF still defaulted to 32.0.3; reconciling that drift this session rolled a 32.0.3 pod that CrashLooped on the downgrade. Pinning eliminates the drift. Docs: backup-dr.md + infra CLAUDE.md updated (add nfs-mirror, new exclusions). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 15:15:26 +00:00
Viktor Barzin	af4bfbe046	kms: revert files accidentally bundled into the docs commit The previous commit (81a7d804) swept in 23 unrelated working-tree files because a rebase --autostash had left them staged in the index — including 4 files with leftover git conflict markers (llama-cpp/main.tf, excalidraw/providers.tf, url + wealthfolio .terraform.lock.hcl) from a stale 2026-05-25 stash, which is invalid Terraform. Revert all 23 (terragrunt-generated backend/providers/lock + the llama-cpp markers) to their prior committed state; terragrunt regenerates the generated files on the next run. Net effect of the docs commit is now just the runbook doc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 10:36:49 +00:00
Viktor Barzin	bdb0cef242	docs(kms): document /keys.json carve-out + script auto-key selection Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 10:36:49 +00:00
Viktor Barzin	170a3bb052	traefik: bump bot-block-proxy large_client_header_buffers to 8x64k The ai-bot-block forward-auth copies the full request (incl. the accumulated authentik_proxy_<random> cookie pile) to bot-block-proxy. With 30+ Authentik Proxy Providers under viktorbarzin.me the combined Cookie header exceeds openresty's default 4x8k buffers, so the auth check returned 400 "Request Header Or Cookie Too Large" (surfaced as error-pages' "Too big request header" 431) and broke Woodpecker/Forgejo OAuth sign-in for affected browsers. Mirror the existing auth-proxy-config fix: 8x64k accepts the pile. Applied live via tg apply + bot-block-proxy rollout restart. [ci skip] Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 10:36:49 +00:00
Viktor Barzin	6f0bdf2993	kms: carve /keys.json out of Anubis for script auto-key-selection The activation scripts now fetch the published GVLK list from /keys.json to auto-select the right key for the detected edition. Like the .ps1 scripts, that endpoint must bypass Anubis (PowerShell/ConvertFrom-Json can't solve the PoW). Add /keys.json to the ingress_scripts carve-out path list. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 10:36:49 +00:00
root	7a297deb24	Woodpecker CI deploy [CI SKIP]	2026-06-01 10:36:49 +00:00
Viktor Barzin	e63a812062	kms: dedicated vlmcs.viktorbarzin.me endpoint + Anubis /scripts carve-out Internal split-horizon resolves kms.viktorbarzin.me to Traefik (10.0.20.203), which has no :1688 listener — so LAN clients pointed at kms.viktorbarzin.me:1688 failed with 0xC004F074 "no KMS could be contacted". Add a dedicated A-only vlmcs.viktorbarzin.me (cloudflare_record.vlmcs -> 176.12.22.76 for the public WAN NAT; Technitium -> 10.0.20.202 internal, set via API) so it resolves to vlmcsd both ways. Also carve /scripts/* out of Anubis (module.ingress_scripts -> bare kms-web-page service) so `iwr \| iex` downloads the real script instead of the PoW challenge HTML. Verified end-to-end on Win VM 300: reproduced 0xC004F074 on the old host, then slmgr + ospp + both PowerShell one-liners all -> Licensed via vlmcs (10.0.20.202). Docs: kms-public-exposure runbook + service-catalog entry. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 10:36:49 +00:00
Viktor Barzin	e5d9160a88	monitoring: KEEL/tier ignore_changes on 5 exporters [ci skip] goflow2, snmp-exporter, pve-exporter, idrac-redfish and the sysctl-inotify daemonset were missed by the `cdb7d9a8` KEEL_LIFECYCLE sweep. The monitoring ns is keel-enrolled (policy=patch) so Keel owns their image tags + injects keel.sh annotations; TF kept trying to revert both, plus a live-stamped tier label — which made `terragrunt plan -detailed-exitcode` return 2 every run and the drift-detection cron fail daily. Add the standard KEEL ignore_changes (image + keel.sh annotations) and ignore the tier label so these stop churning. Declarative-only: takes effect at next plan, no apply needed. [ci skip] so this does not trigger a monitoring apply. Remaining (separate) drift: the grafana ACL null_resource (triggers.always) + tls cert refresh. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 15:33:30 +00:00
Viktor Barzin	935fb07df7	hermes-agent: gate PVC on parked flag (clears PVCStuckPending) The data_proxmox PVC is WaitForFirstConsumer; with the Deployment parked at replicas=0 it had no consumer pod and sat Pending forever, falsely tripping PVCStuckPending (which halts kured reboots). Introduce local.hermes_parked to drive both replicas and the PVC count, so a parked service has no PVC at all. Empty/never-bound PVC removed; recreated automatically when un-parked. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 15:19:28 +00:00
Viktor Barzin	7b6a0e70af	hermes-agent: opt out of external monitor while parked hermes-agent is parked at replicas=0 (PVC perms bug, 2026-04-22). Its auto-created Uptime Kuma external monitor was down → ExternalAccessDivergence firing, which halts kured node reboots. Set external_monitor=false so a deliberately-down service stops tripping the divergence gate. Re-enable when the deployment is brought back up. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 15:12:33 +00:00
Viktor Barzin	51313ee088	kured: fix sentinel-gate OOM — 256Mi limit + self-restart leak guard The k8s-master gate pod OOM-killed child kubectls 149x/7d (accelerating: 0/day → 15 → 134) while master sat in pending-reboot. Root cause: only the pending-reboot node's gate pod runs the kubectl-heavy hot path each cycle, and the immortal bash loop slowly leaks (kubectl forks + Check-4 process substitution) past the 64Mi cgroup limit. PID 1 bash survives each kill, so the pod never restarts — just silent oom_events. Fix: raise limit 64Mi→256Mi (headroom for ~30-50Mi kubectl forks) + add a MAX_ITER=72 self-exit (~6h) so kubelet restarts the pod fresh and the leak can never accumulate, regardless of how long a node stays pending-reboot. Docs: post-mortem + automated-upgrades.md gate note. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 14:49:04 +00:00
Viktor Barzin	0c64fc2948	travel-agent: switch from Slack webhook to bot token (chat.postMessage)	2026-05-30 22:44:11 +00:00
Viktor Barzin	46f63bb70e	infra: travel-agent stack (namespace + ExternalSecret + 2 CronJobs)	2026-05-30 18:24:13 +00:00
Viktor Barzin	e1ab23193d	redis: revert 3-node Sentinel HA to single standalone instance [ci skip] The redis-v2 Sentinel cluster split-brained: redis-v2-0 booted during a network partition, hit the init script's deterministic "pod-0 = bootstrap master" fallback, and became a SECOND master alongside the sentinel-elected redis-v2-2. HAProxy's `expect rstring role:master` matched both and round-robined client connections across the two diverging masters, so Immich enqueued BullMQ jobs on one while its workers blocked-popped on the other -> every queue wedged and new-upload thumbnails 404'd cluster-wide. Third Sentinel-class incident in ~6 weeks (after the 2026-04-19 PM quorum drift and 2026-04-22 flap cascade). Revert to a single standalone instance: replicas=1; drop Sentinel + HAProxy + init bootstrap configmap + both PDBs; redis container only (+ exporter). maxmemory-policy allkeys-lru -> volatile-lru so one shared instance serves both workload classes correctly: evict only TTL'd cache keys, never TTL-less Immich BullMQ / Celery job keys. redis-master service name/DNS unchanged -> no consumer edits; collapsed onto redis-v2-0's existing dataset (queued jobs preserved). Applied via tg (Tier 1 / PG-authoritative state); this commit syncs source + docs only, hence [ci skip]. Monitoring: drop RedisReplicationLagHigh + RedisReplicasMissing (no replicas now; the latter would false-fire), RedisMemoryPressure 85%->80% volatile-lru backstop. Docs: rewrite databases.md Redis section (single-instance design + incident history); add post-mortem 2026-05-30-redis-split-brain.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 17:49:43 +00:00
Viktor Barzin	5bcb4525a4	traefik: uncap download duration (writeTimeout 60s->0), upload window 3600s [ci skip] Large Immich video downloads and uploads failed at a hard ~60s wall. The websecure entrypoint set respondingTimeouts.{read,write}Timeout=60s; unlike nginx proxy_*_timeout (per-read idle), Traefik respondingTimeouts are hard caps on total request/response duration, so every transfer slower than 60s was cut mid-stream. Reproduced: a 6 MB/s throttled 650MB download died at 386MB / 62s with an HTTP/2 stream reset. - writeTimeout=0 (Traefik's default, which Immich's reverse-proxy guidance assumes): unlimited download size/duration. - readTimeout=3600s: passes multi-GB uploads while keeping a slow-loris backstop (Immich has no resumable upload, so the window must exceed real upload times). Verified: the same 650MB download now completes fully (650MB / 102s, exit 0). IPv6 path needs no change - the pfSense bridge HAProxy 1h timeouts are inactivity-based, not total caps. Applied via tg (Tier 1 / PG-authoritative state); this commit syncs source + docs only, hence [ci skip]. Docs: networking.md (Entrypoint Transport Timeouts + troubleshooting), .claude/CLAUDE.md networking note. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 17:46:59 +00:00
Viktor Barzin	89561c7779	technitium: complete Traefik .200->.203 migration for the .lan zone [ci skip] Today's Traefik dedicated-IP migration (.200 -> .203, ETP=Local) updated the viktorbarzin.me zone but missed the viktorbarzin.lan zone + two stale .200 literals — breaking every *.viktorbarzin.lan ingress host (internal exporters + ~15 HA-Sofia sensors via idrac-redfish/nvidia/snmp) and tripping the apex-drift probe. Found via /cluster-health (23 alerts -> 7). - apex-probe EXPECTED .200 -> .203 (apex IS .203; probe asserted the wrong value -> false ViktorBarzinApexDrift "critical"). - split-horizon externalToInternalTranslation .200 -> .203 (sofia-lan hairpin-NAT target). - ingress-dns-sync CronJob now also pins ingress.viktorbarzin.lan A to the LIVE Traefik LB IP (queried from svc/traefik) every run, so a future Traefik IP move can't silently break the .lan zone again. Added services get/list to its ClusterRole. Applied via targeted apply (4 resources, 0 destroyed) + manual CronJob triggers; verified apex correct=1 and the .lan anchor self-pins to .203. [ci skip] because a full technitium apply would also pick up unrelated pre-existing deployment drift (DNS pod restart risk) — left untouched. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-30 16:54:09 +00:00
Viktor Barzin	c2b820dc55	postiz: adopt drifted resources into TF state; exclude stuck Helm release The 2026-05-24 apply was interrupted with the Helm release stuck in pending-install, leaving only 2 of ~12 resources in TF state (any apply errored "already exists"). Adopted the live resources back via import {} sweep (namespace, tls-secret, uploads PVC, ESO ExternalSecret, both ingresses, temporal Service, nfs backup PV+PVC) — plan now reaches zero. Reconciled code to live reality (zero runtime change to running postiz): - Removed kubernetes_deployment.temporal + kubernetes_job.temporal_search_ attr_cleanup: the temporal Deployment is gone from the cluster (only the Service survives). Scheduled posts remain unavailable until temporal is restored; immediate posting works. - Removed helm_release.postiz from TF entirely: importing it would force a helm upgrade (provider can't match merged values to config) and the release is stuck pending-install. Left Helm-managed outside TF. - Removed keel.sh/enrolled=true from the namespace (postiz was opted out of Keel on 2026-05-29; this would have re-enrolled it on apply). - Backup CronJob now dumps only the `postiz` DB (temporal/temporal_visibility DBs don't exist) and no longer depends_on the removed helm_release. Applied: 9 imported, 1 added (backup CronJob), 6 changed (benign), 0 destroyed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-30 14:36:07 +00:00
Viktor Barzin	01351e4ce2	tripit: deploy stack + DB provisioning + ongoing mail-ingest [ci skip] - stacks/tripit: namespace, ESO (vault-kv + vault-database), Deployment (alembic init + app), Service, NFS document PVC, ingress (Authentik forward-auth) + /api/calendar carve-out (auth=none, HMAC-token gated), and 3 worker CronJobs. ingest-mail is live: real IMAP (me@, read-only BODY.PEEK, recent-30) + local LLM (qwen3vl-4b on llama-swap), idempotent (skips seen message_ids), owner me@viktorbarzin.me. - stacks/dbaas: create CNPG role+db `tripit`. - stacks/vault: pg-tripit static role (7d rotation) + allowed_roles entry. Deployed at tripit.viktorbarzin.me. [ci skip]: stacks were applied out-of-band via scripts/tg this session; a CI re-apply would also apply unrelated pre-existing dbaas/vault drift (MySQL StatefulSet, vault OIDC). Refs: code-bb9g, code-muqi Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-30 10:23:11 +00:00
Viktor Barzin	e9046e5a26	traefik+pfsense: real IPv6 client IPs via HAProxy PROXY-v2 bridge Replace the pfSense socat IPv6 forwarder (which masked every IPv6 client as 10.0.20.1) with a standalone HAProxy bridge using send-proxy-v2, so real IPv6 client IPs reach Traefik/CrowdSec. Traefik now trusts PROXY-v2 only from 10.0.20.1 on the web/websecure entrypoints; real IPv4 clients (ETP=Local, own source IP) are unaffected. Mail-over-IPv6 routed through the mail NodePorts (send-proxy-v2) too. Bridge is TCP/h2 only (no QUIC over IPv6). Persistence on pfSense: rc.d/ipv6proxy + ipv6_proxy.sh (config.xml shellcmd), keeping the nginx-off-[::] patch. Also fixes stale networking.md: Traefik was still documented on the shared .200; it moved to dedicated .203/ETP=Local on 2026-05-30. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-30 09:51:23 +00:00
Viktor Barzin	0c01adac95	traefik: dedicate LB IP 10.0.20.203 + externalTrafficPolicy=Local Gives direct (non-proxied) apps real client IPs for CrowdSec (were SNAT'd to the node IP under ETP=Cluster) and working QUIC. Companion change (NOT in TF — remote cloudflared tunnel config, done via CF API): tunnel ingress repointed from https://10.0.20.200:443 to https://traefik.traefik.svc.cluster.local:443 so proxied apps are decoupled from the LB IP. pfSense 443 NAT -> traefik_lb alias (.203). See docs/plans/2026-05-30-traefik-dedicated-ip-etp-local-*. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-30 08:09:37 +00:00
Viktor Barzin	0f26bf030b	kyverno: exclude postiz namespace from Keel auto-update injection Postiz was generating hourly Slack spam and a wedged rollout, both Keel-driven: - Bundled redis StatefulSets run docker.io/bitnamilegacy/redis; Keel tried 7.4.0->7.4.1/7.4.2 every poll but require-trusted-registries denies bitnamilegacy/* (only bitnami/* allowlisted) -> endless deny/retry/Slack-ping loop. - Keel bumped postiz-app v2.21.7->v2.21.8 on 2026-05-26; the surge pod couldn't schedule under the 3Gi tier-4-aux quota, wedging the rollout for 3 days. postiz Terraform state is heavily drifted (~2/30 resources tracked), so per-workload opt-out can't be applied from the postiz stack. Durable guard is here (clean kyverno state). Operational steps applied live via kubectl (postiz stack can't apply): removed keel.sh/enrolled=true from the namespace, set keel.sh/policy=never (annotation+label) on all 4 workloads, rolled postiz back to the running v2.21.7. Keel restarted (scale 0->1) to drop postiz-app from its in-memory tracker; confirmed it no longer tracks postiz. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 19:16:58 +00:00
root	ae72ad51bb	Woodpecker CI deploy [CI SKIP]	2026-05-29 18:07:00 +00:00
Viktor Barzin	bc41fe572a	immich: GPU-accelerate video transcoding (NVENC + NVDEC) Pin immich-server to the GPU node with a time-sliced nvidia.com/gpu slice so ffmpeg uses hardware NVENC encode + NVDEC decode instead of software. This frees the ~3-4 CPU cores the software transcoder was burning inside the request-serving pod (which was slowing thumbnail/photo browsing), and makes incompatible (HEVC/iPhone) videos playable in seconds. Activation is ffmpeg.accel=nvenc + accelDecode=true in the DB system-config (Immich app config is DB-managed here, like oauth/smtp — not Terraform). Also give immich-frame the same Keel ignore_changes immich-server already has, so an untargeted apply no longer churns it (pre-existing drift). Docs: .claude/CLAUDE.md Immich row + compute.md GPU-workloads list. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 18:05:34 +00:00
Viktor Barzin	b10233975b	llama-cpp: restore replicas to 1; fire-planner: fix llama-swap URL llama-cpp was scaled to 0 during 2026-05-25 IO-storm recovery (TEMP-SCALEDOWN). Cluster is now stable; only frigate competes for the GPU on k8s-node1. Restoring to 1 to unblock fire-planner's Reddit examples ingest, which needs qwen3-8b for structured extraction. fire-planner's llama_cpp_base_url default pointed at a non-existent service:port (llama-cpp:8000) — the real service is `llama-swap` on port 8080. First 2026-05-28 bulk Job exited 0 with 0 rows because of this. Correcting.	2026-05-29 06:20:03 +00:00
Viktor Barzin	478629c1ee	keel+anubis: extend sweep to non-V2 raw deployments; fix anubis replicas validation Second-tier keel drift: actualbudget, mailserver (docker-mailserver + roundcube), servarr (8 deployments), and authentik pgbouncer are live-enrolled (Kyverno injects keel.sh/policy=patch) and drifting, but never had the V2 block in Terraform. Added the full block (KYVERNO_LIFECYCLE_V2 + keel.sh/match-tag + per-container KEEL_IGNORE_IMAGE + KEEL_LIFECYCLE_V1) to all 13 deployments. The docker-mailserver deployment had no resource-level lifecycle at all — added one. Also fixes a pre-existing bug in modules/kubernetes/anubis_instance: the `replicas` validation `var.replicas == null \|\| (...)` doesn't null-short-circuit in the current TF version, failing apply on every single-replica Anubis site (blog, cyberchef, f1-stream, homepage, jsoncrack, kms, postiz, real-estate-crawler, travel_blog) with "argument must not be null". Switched to a null-safe ternary. Verified: actualbudget plan shows no image drift (http-api 26.5.2 downgrade prevented). The anubis module change triggers a full platform apply. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 06:02:24 +00:00
root	fe1a16a5f5	Woodpecker CI deploy [CI SKIP]	2026-05-29 05:48:10 +00:00
Viktor Barzin	5bc7a76630	tuya-bridge: switch to Forgejo image + CI-driven deploy Mirrors the kms-website pattern: deployment image now points to forgejo.viktorbarzin.me/viktor/tuya_bridge:${var.image_tag} and the new Woodpecker pipeline in tuya_bridge/.woodpecker.yml drives the rollout via `kubectl set image` on every push. Changes: - Extract `tls_secret_name` and add `image_tag` (default "latest") to a new variables.tf, matching the kms / fire-planner / payslip-ingest convention. - Add `image_pull_secrets { name = "registry-credentials" }` (Kyverno ClusterPolicy sync-registry-credentials already syncs the Secret into every namespace). - Set explicit `image_pull_policy = "IfNotPresent"` — SHA-tagged images are immutable, no need to re-pull on every restart. The image attribute remains in `lifecycle.ignore_changes` (line was already there from the prior Keel-managed era), so future `tg apply`s do not fight Woodpecker's `kubectl set image`. Keel is still enrolled on the namespace but will skip SHA-tagged images under `policy: patch` (non-semver), so the CI pipeline is the sole rollout mechanism. Backstory: the 2026-05-26 cluster-health incident was tuya-bridge crashlooping after Keel rewrote `:latest` to a stale broken `:0.1` tag on Docker Hub (which predated the `prometheus_exporter.py` addition). Manual rebuild + push was the immediate fix; this commit plus tuya_bridge/.woodpecker.yml close the underlying gap so a source change reliably produces a fresh registry image. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 05:45:16 +00:00
Viktor Barzin	7870e62a07	uptime-kuma: declare Proxmox UI monitor in TF Yesterday's session SQL-patched monitor 313 to `https://192.168.1.127:8006/` + ignore_tls=1 because the prior URL `http://proxmox.reverse-proxy.svc.cluster.local:8006` hit a CoreDNS pod-level cache returning stale `10.0.10.1` (pfSense GW) intermittently, false-tripping ExternalAccessDivergence. A kuma DB restore would have lost the SQL fix. Declare the monitor in `internal_monitors` so the existing sync CronJob self-heals it. Extends the schema with optional `url` / `accepted_statuscodes` / `ignore_tls` fields (null on the existing DB/port entries) and teaches the sync script the MonitorType.HTTP branch — url + accepted_statuscodes + ignoreTls (camelCase on the API), matching drift fields the same way PORT does for hostname/port. Verified: manually triggered the sync after apply; it found monitor 313 by name and reported "already in desired state". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 05:40:18 +00:00
Viktor Barzin	7c73c69f9b	keel: add KEEL_LIFECYCLE_V1 + image-ignore to fire-planner Completes the enrolled-workload sweep from `cdb7d9a8`. fire-planner was held back because a parallel session was mid-apply on it (presence board); that claim has since cleared. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 23:12:49 +00:00
Viktor Barzin	cdb7d9a81a	keel: sweep KEEL_LIFECYCLE_V1 + per-container KEEL_IGNORE_IMAGE across enrolled workloads Every Keel-enrolled workload (policy=patch, match-tag=true, injected by the inject-keel-annotations Kyverno policy) was fighting Terraform: Keel rewrites the image tag and restamps keel.sh/update-time, change-cause and the rollout revision on each poll; without ignore_changes every `tg apply` reverted those — downgrading the image and forcing a spurious rollout that Keel then re-did. Only llama-cpp had the full block (added 2026-05-24); the other ~73 workloads drifted. This sweep adds, to every enrolled deployment/daemonset lifecycle: - container[N].image (one per container index + init_container[N]) # KEEL_IGNORE_IMAGE - keel.sh/match-tag, keel.sh/update-time, kubernetes.io/change-cause, deployment.kubernetes.io/revision # KEEL_LIFECYCLE_V1 Verified via `tg plan` on speedtest (single-container: image downgrade 0.24.3->0.24.1 + annotation strip now gone) and changedetection (multi-container: both container images no longer drift). AGENTS.md drift-suppression section updated with the canonical block + marker legend. fire-planner deferred (parallel session mid-apply per presence board). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 23:09:30 +00:00
Viktor Barzin	4f71ce6bc5	wealth: fix Fidelity Feb-2026 zero-gap + month-boundary contribution smear Two correctness fixes to the wealth dashboard, found while validating contribution data against actual-viktor (source of truth): 1. dav_corrected (Fix 1): LOCF gap-fill scoped to the Fidelity pension. A PlanViewer scrape gap left total_value=0 for 13 days from 2026-02-16, which cratered net worth and produced a phantom -£97,457 "contribution" in Feb then +£100,458 in Mar. Carry the last non-zero day forward across the gap (a £0 pension valuation is always a scrape gap, never real). 2. wealth.json (Fix 3): "Monthly contributions vs market gain" and "Annual change decomposition" now use consecutive period-end deltas instead of within-period first-to-last-obs, so contributions landing near a period boundary are no longer dropped/mis-attributed. Verified live: Feb-2026 monthly contribution now +£34,000 (real Trading212 RSU-proceeds investment, reconciles with actual-viktor), no spurious negatives. Brokerage contributions unchanged (already correct). Applied via scripts/tg (wealthfolio + targeted monitoring ConfigMap). [ci skip] Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 22:58:59 +00:00
Viktor Barzin	0044c3a8ea	fire-planner: add examples ingest Job (toggled) + weekly CronJob Adds the K8s plumbing for the Reddit FIRE-examples ingest path: - ExternalSecret fire-planner-examples-reddit (Reddit OAuth from Vault secret/viktor.trading_bot_reddit_{client_id,client_secret}). - ExternalSecret fire-planner-examples-claude (claude-agent-service bearer from Vault secret/claude-agent-service.api_bearer_token). - kubernetes_job_v1.examples_bulk_ingest — one-shot bulk Job toggled via var.run_examples_bulk_ingest (default false). Timestamp-named so each (true) transition creates a fresh Job; lifecycle ignores the name so re-plans don't propose phantom renames. - kubernetes_cron_job_v1.examples_weekly_delta — Sunday 04:00 UTC --top=week --limit=200 incremental run. Both runners share the env_from plumbing of the existing recompute CronJob (fire-planner-secrets, fire-planner-db-creds, wealthfolio-sync-db-creds) plus examples-specific vars (REDDIT_USER_AGENT, LLAMA_CPP_BASE_URL, CLAUDE_AGENT_SERVICE_URL, plus the three secret-backed env vars). Plan-only this commit — actual apply lands in Task 17 after the ingest image build.	2026-05-28 22:51:14 +00:00
Viktor Barzin	4dff834c8a	reduce ingress-dns-sync frequency to hourly [ci skip]	2026-05-28 22:30:08 +00:00
Viktor Barzin	5ac8d625b9	add ingress-dns-sync CronJob to auto-create Technitium CNAME records Discovers all *.viktorbarzin.me ingress hosts every 15 minutes and creates matching CNAME records in Technitium if missing. Prevents the desync where Cloudflare has the DNS record (via ingress_factory) but internal DNS returns NXDOMAIN because Technitium was never updated. Includes ServiceAccount + ClusterRole for ingress list permissions.	2026-05-28 22:22:42 +00:00
Viktor Barzin	58cced5dab	monitoring: render market-vs-salary periodic panels as lines, not bars	2026-05-28 22:18:59 +00:00
Viktor Barzin	388a7f60c7	monitoring: add net-pay-vs-market-gains panels to wealth dashboard Three new panels comparing employment income to investment returns over time, via Grafana's -- Mixed -- datasource (salary lives in payslip_ingest, portfolio in wealthfolio_sync — separate DBs, so per-target datasources): - cumulative net take-home pay vs cumulative market gain (line race) - net pay vs market gain per year (grouped bars) - net pay vs market gain per month (grouped bars) Inserted after the "Growth over time" panel; existing panels shifted down, full-width tables remain at the bottom.	2026-05-28 22:13:44 +00:00
Viktor Barzin	1af412b461	trading-bot: bump TRADING_MEET_KEVIN_PROMPT_VERSION v1 -> v2 (forward-looking prompt)	2026-05-28 21:40:17 +00:00
Viktor Barzin	188bdd50a0	infra: decommission foolery agent UI User no longer actively using foolery. Removed: - TF stack stacks/foolery (Cloudflare DNS, Traefik IngressRoute, Authentik forward-auth integration, K8s Service+Endpoints) - Devvm systemd unit /etc/systemd/system/foolery.service - Runtime at ~/.local/share/foolery and launcher ~/.local/bin/foolery - Stale foolery reference in .claude/CLAUDE.md auth="required" examples Uptime Kuma [External] foolery monitor will auto-prune on next external-monitor-sync reconcile. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 16:08:41 +00:00
Viktor Barzin	8b4bcc0ca2	blog: Anubis carve-out for /net-diag.sh curl\|bash clients can't solve PoW, so /net-diag.sh has to bypass Anubis. Adds a second ingress_factory pointing /net-diag.sh at the bare blog service (port 80), keeping every other path on the existing Anubis chain. Path-prefix specificity wins in Traefik routing — / stays gated. dns_type = "none" because the apex viktorbarzin.me CF record already exists from the main ingress. Doc update: CLAUDE.md Anubis section notes blog now follows the wrongmove carve-out pattern.	2026-05-28 13:22:57 +00:00
Viktor Barzin	fc5a4b66ad	monitoring: exclude catchall-error-pages from HighService4xxRate The catchall-error-pages IngressRoute matches HostRegexp(^(.+\.)? viktorbarzin\.me$) at priority=1 — it's the wildcard handler that returns 404 for any unmatched hostname (typos + scanner traffic). By design its 4xx rate sits at ~100%, so HighService4xxRate was a permanent false positive for traefik-catchall-error-pages-*@kubernetescrd. Same exclusion pattern as nextcloud/grafana/linkwarden/claude-memory (services with legitimately high 4xx counts).	2026-05-27 19:46:40 +00:00
github-actions[bot]	b8cd1219a6	priority-pass: bump image_tag to 4ce9e8e8 [ci skip] Auto-committed by ViktorBarzin/priority-pass GHA on push to main. Source: `4ce9e8e894`	2026-05-27 18:46:19 +00:00
root	d0ede3773b	Woodpecker CI deploy [CI SKIP]	2026-05-27 18:38:09 +00:00
Viktor Barzin	ee159b02ba	nextcloud: disable Keel auto-upgrades Keel bumped library/nextcloud :32.0.3-apache → :32.0.9-apache on 2026-05-26 19:42 UTC. The new image needs `occ upgrade` to migrate the DB schema, which Keel does not run, so Nextcloud landed in maintenance mode (needsDbUpgrade=true) and stayed there for ~22h — external probes saw 503, ExternalAccessDivergence kept firing. Disable Keel for this workload: - Drop the `keel.sh/enrolled=true` label from the namespace so Kyverno's `inject-keel-annotations` policy no longer matches. - Layer `keel.sh/policy=never` label + annotation onto the Helm-managed Deployment via `kubernetes_labels` / `kubernetes_annotations` (the chart at 8.8.1 doesn't expose Deployment-level commonLabels/commonAnnotations). Keel reads the annotation; the label is defense-in-depth for the Kyverno exclude rule should the namespace ever get re-enrolled. Verified: Keel logged `image no longer tracked, removing watcher` within seconds of the annotation landing, and `tg plan` is clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 18:37:05 +00:00
Viktor Barzin	d72c7169c0	monitoring: route proxmox-exporter to scrape_slow job (fix flapping alerts) PVE API endpoint regularly takes ~11s with ~1035 thin LVs on the host (1002 k8s-csi PVCs + 22 VMs + 11 system), blowing past Prometheus's default 10s scrape_timeout and flapping ProxmoxMetricsMissing + ScrapeTargetDown. Switch the Service annotation from prometheus.io/scrape to prometheus.io/scrape_slow so the scrape moves to the existing kubernetes-service-endpoints-slow job (5m interval, 30s timeout).	2026-05-27 18:36:11 +00:00
Viktor Barzin	f121bee121	fire-planner: update recompute CronJob comment to reflect lazy refresh As of fire-planner@4da58fe the account_snapshot cache is refreshed lazily on each /networth, /networth/history, /progress request when older than NETWORTH_CACHE_TTL_DAYS (default 1). The recompute CronJob runs Monte Carlo only — no longer assumed to coordinate with the wealthfolio-sync schedule. [ci skip]	2026-05-27 18:23:21 +00:00
Viktor Barzin	4b77aa65a1	broker-sync: unsuspend broker-sync-imap (IE structurally skipped at code level now) E2E test (manual one-shot of all 3 broker-sync CronJobs) confirmed idempotent behaviour with zero new activities and net worth unchanged. The IE-via-IMAP path is now default-skipped inside broker_sync.providers.imap (commit 0d23487), so unsuspending the cron is safe — Schwab vests get parsed, IE messages get ie_skipped at the parser level regardless of which entry point triggers the run. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 17:57:26 +00:00
Viktor Barzin	06fb1f9ea9	broker-sync: update imap-cron comment to reflect default-skip IE (post-incident)	2026-05-27 17:25:42 +00:00
Viktor Barzin	501f2c6b37	broker-sync: re-suspend broker-sync-imap CronJob 39 IMAP-source InvestEngine BUYs + their cash-flow DEPOSITs were re-inserted into Wealthfolio at 2026-05-27T09:22:18 UTC — exactly the rows the £252k dedup removed yesterday. The broker-sync-imap cron at 02:30 UTC today correctly logged `ie_skipped=53`, so the IMAP cron itself isn't the immediate culprit, but the rows DO carry broker-sync's IMAP-path signature (`[rfc2822-v1]` notes + `sync:imap:invest-engine:...` cash-flow markers). Suspending kills one possible vector while a researcher subagent investigates the root cause. Schwab vest ingestion is the only function lost; can be unsuspended once the IE re-dup source is identified. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 17:09:09 +00:00

1 2 3 4 5 ...

1142 commits