infra

Author	SHA1	Message	Date
Viktor Barzin	3d28870e25	nextcloud: fix backup retention to sort by name, not mtime The dated backup dirs are named YYYYMMDD_HHMMSS, but the cleanup used `ls -dt` (mtime). `rsync -a` stamps the backup dir with the SOURCE dir's mtime, so the freshest backup didn't sort as newest — the retention step deleted the new backup and kept a stale one. Sort lexically (chronological for these names) and keep the last. Also exclude html/ (the app code, reproducible from the now-pinned image; the real config lives at config/config.php, html/config is empty) so the backup is config+data+custom_apps only → ~4.3G (<5G target). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 15:15:26 +00:00
root	84ab4c998c	Woodpecker CI deploy [CI SKIP]	2026-06-01 15:15:26 +00:00
Viktor Barzin	ddd582a28c	backup: stop offsite-copying regenerable data; shrink nextcloud backup; pin nextcloud image The offsite Synology hit 97% — the Backup share grew +670G in a week, traced to the 2026-05-26 change that began mirroring large regenerable services offsite, plus an unbounded nextcloud.log bloating its backups to 87G. - nfs-mirror: re-exclude ollama, prometheus-backup, audiblez, ebook2audiobook (regenerable; live-only on sdc). Keep *-backup DB dumps (real safety copies). - offsite-sync Step 2: nfs-ssd leg is now immich-only; ollama/llamacpp on the SSD no longer ship offsite (re-pullable models). - daily-backup: skip nextcloud/nextcloud-data-proxmox (orphaned pre-encryption PV, still backed up weekly). - nextcloud: cap+rotate the log (log_rotate_size=10MB); the dedicated backup now excludes html/ (app code, from image), logs, and preview cache and keeps only the latest copy (pvc-data holds version history) → <5G (was 87G). - nextcloud: pin image to 32.0.9 in chart_values. A 2026-05-26 Keel bump moved the live pod to 32.0.9 (data migrated to 32.0.9.2) but TF still defaulted to 32.0.3; reconciling that drift this session rolled a 32.0.3 pod that CrashLooped on the downgrade. Pinning eliminates the drift. Docs: backup-dr.md + infra CLAUDE.md updated (add nfs-mirror, new exclusions). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 15:15:26 +00:00
Viktor Barzin	0dd4a31eff	docs(immich): cap server-side job concurrency to protect sdc + log recurrence A library-wide Duplicate Detection run on 2026-06-01 fanned the ML/thumbnail backfill out at thumbnailGeneration concurrency 8, saturating the shared sdc HDD and starving etcd -> kube-apiserver down ~30 min (5th IO-pressure incident on sdc). Capped server-side thumbnailGeneration/metadataExtraction/library to 2 in the Immich DB system-config; documented in the Immich row and recorded the recurrence + still-TODO IO-isolation fixes in the 2026-05-25 post-mortem (this also commits that previously-untracked post-mortem). [ci skip] Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 15:15:26 +00:00
Viktor Barzin	af4bfbe046	kms: revert files accidentally bundled into the docs commit The previous commit (81a7d804) swept in 23 unrelated working-tree files because a rebase --autostash had left them staged in the index — including 4 files with leftover git conflict markers (llama-cpp/main.tf, excalidraw/providers.tf, url + wealthfolio .terraform.lock.hcl) from a stale 2026-05-25 stash, which is invalid Terraform. Revert all 23 (terragrunt-generated backend/providers/lock + the llama-cpp markers) to their prior committed state; terragrunt regenerates the generated files on the next run. Net effect of the docs commit is now just the runbook doc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 10:36:49 +00:00
Viktor Barzin	bdb0cef242	docs(kms): document /keys.json carve-out + script auto-key selection Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 10:36:49 +00:00
Viktor Barzin	170a3bb052	traefik: bump bot-block-proxy large_client_header_buffers to 8x64k The ai-bot-block forward-auth copies the full request (incl. the accumulated authentik_proxy_<random> cookie pile) to bot-block-proxy. With 30+ Authentik Proxy Providers under viktorbarzin.me the combined Cookie header exceeds openresty's default 4x8k buffers, so the auth check returned 400 "Request Header Or Cookie Too Large" (surfaced as error-pages' "Too big request header" 431) and broke Woodpecker/Forgejo OAuth sign-in for affected browsers. Mirror the existing auth-proxy-config fix: 8x64k accepts the pile. Applied live via tg apply + bot-block-proxy rollout restart. [ci skip] Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 10:36:49 +00:00
Viktor Barzin	6f0bdf2993	kms: carve /keys.json out of Anubis for script auto-key-selection The activation scripts now fetch the published GVLK list from /keys.json to auto-select the right key for the detected edition. Like the .ps1 scripts, that endpoint must bypass Anubis (PowerShell/ConvertFrom-Json can't solve the PoW). Add /keys.json to the ingress_scripts carve-out path list. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 10:36:49 +00:00
root	7a297deb24	Woodpecker CI deploy [CI SKIP]	2026-06-01 10:36:49 +00:00
Viktor Barzin	e63a812062	kms: dedicated vlmcs.viktorbarzin.me endpoint + Anubis /scripts carve-out Internal split-horizon resolves kms.viktorbarzin.me to Traefik (10.0.20.203), which has no :1688 listener — so LAN clients pointed at kms.viktorbarzin.me:1688 failed with 0xC004F074 "no KMS could be contacted". Add a dedicated A-only vlmcs.viktorbarzin.me (cloudflare_record.vlmcs -> 176.12.22.76 for the public WAN NAT; Technitium -> 10.0.20.202 internal, set via API) so it resolves to vlmcsd both ways. Also carve /scripts/* out of Anubis (module.ingress_scripts -> bare kms-web-page service) so `iwr \| iex` downloads the real script instead of the PoW challenge HTML. Verified end-to-end on Win VM 300: reproduced 0xC004F074 on the old host, then slmgr + ospp + both PowerShell one-liners all -> Licensed via vlmcs (10.0.20.202). Docs: kms-public-exposure runbook + service-catalog entry. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 10:36:49 +00:00
root	de04ed099e	Woodpecker CI Update TLS Certificates Commit	2026-06-01 10:36:49 +00:00
Viktor Barzin	e5d9160a88	monitoring: KEEL/tier ignore_changes on 5 exporters [ci skip] goflow2, snmp-exporter, pve-exporter, idrac-redfish and the sysctl-inotify daemonset were missed by the `cdb7d9a8` KEEL_LIFECYCLE sweep. The monitoring ns is keel-enrolled (policy=patch) so Keel owns their image tags + injects keel.sh annotations; TF kept trying to revert both, plus a live-stamped tier label — which made `terragrunt plan -detailed-exitcode` return 2 every run and the drift-detection cron fail daily. Add the standard KEEL ignore_changes (image + keel.sh annotations) and ignore the tier label so these stop churning. Declarative-only: takes effect at next plan, no apply needed. [ci skip] so this does not trigger a monitoring apply. Remaining (separate) drift: the grafana ACL null_resource (triggers.always) + tls cert refresh. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 15:33:30 +00:00
Viktor Barzin	935fb07df7	hermes-agent: gate PVC on parked flag (clears PVCStuckPending) The data_proxmox PVC is WaitForFirstConsumer; with the Deployment parked at replicas=0 it had no consumer pod and sat Pending forever, falsely tripping PVCStuckPending (which halts kured reboots). Introduce local.hermes_parked to drive both replicas and the PVC count, so a parked service has no PVC at all. Empty/never-bound PVC removed; recreated automatically when un-parked. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 15:19:28 +00:00
Viktor Barzin	7b6a0e70af	hermes-agent: opt out of external monitor while parked hermes-agent is parked at replicas=0 (PVC perms bug, 2026-04-22). Its auto-created Uptime Kuma external monitor was down → ExternalAccessDivergence firing, which halts kured node reboots. Set external_monitor=false so a deliberately-down service stops tripping the divergence gate. Re-enable when the deployment is brought back up. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 15:12:33 +00:00
Viktor Barzin	51313ee088	kured: fix sentinel-gate OOM — 256Mi limit + self-restart leak guard The k8s-master gate pod OOM-killed child kubectls 149x/7d (accelerating: 0/day → 15 → 134) while master sat in pending-reboot. Root cause: only the pending-reboot node's gate pod runs the kubectl-heavy hot path each cycle, and the immortal bash loop slowly leaks (kubectl forks + Check-4 process substitution) past the 64Mi cgroup limit. PID 1 bash survives each kill, so the pod never restarts — just silent oom_events. Fix: raise limit 64Mi→256Mi (headroom for ~30-50Mi kubectl forks) + add a MAX_ITER=72 self-exit (~6h) so kubelet restarts the pod fresh and the leak can never accumulate, regardless of how long a node stays pending-reboot. Docs: post-mortem + automated-upgrades.md gate note. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 14:49:04 +00:00
Viktor Barzin	0c64fc2948	travel-agent: switch from Slack webhook to bot token (chat.postMessage)	2026-05-30 22:44:11 +00:00
Viktor Barzin	46f63bb70e	infra: travel-agent stack (namespace + ExternalSecret + 2 CronJobs)	2026-05-30 18:24:13 +00:00
Viktor Barzin	e1ab23193d	redis: revert 3-node Sentinel HA to single standalone instance [ci skip] The redis-v2 Sentinel cluster split-brained: redis-v2-0 booted during a network partition, hit the init script's deterministic "pod-0 = bootstrap master" fallback, and became a SECOND master alongside the sentinel-elected redis-v2-2. HAProxy's `expect rstring role:master` matched both and round-robined client connections across the two diverging masters, so Immich enqueued BullMQ jobs on one while its workers blocked-popped on the other -> every queue wedged and new-upload thumbnails 404'd cluster-wide. Third Sentinel-class incident in ~6 weeks (after the 2026-04-19 PM quorum drift and 2026-04-22 flap cascade). Revert to a single standalone instance: replicas=1; drop Sentinel + HAProxy + init bootstrap configmap + both PDBs; redis container only (+ exporter). maxmemory-policy allkeys-lru -> volatile-lru so one shared instance serves both workload classes correctly: evict only TTL'd cache keys, never TTL-less Immich BullMQ / Celery job keys. redis-master service name/DNS unchanged -> no consumer edits; collapsed onto redis-v2-0's existing dataset (queued jobs preserved). Applied via tg (Tier 1 / PG-authoritative state); this commit syncs source + docs only, hence [ci skip]. Monitoring: drop RedisReplicationLagHigh + RedisReplicasMissing (no replicas now; the latter would false-fire), RedisMemoryPressure 85%->80% volatile-lru backstop. Docs: rewrite databases.md Redis section (single-instance design + incident history); add post-mortem 2026-05-30-redis-split-brain.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 17:49:43 +00:00
Viktor Barzin	5bcb4525a4	traefik: uncap download duration (writeTimeout 60s->0), upload window 3600s [ci skip] Large Immich video downloads and uploads failed at a hard ~60s wall. The websecure entrypoint set respondingTimeouts.{read,write}Timeout=60s; unlike nginx proxy_*_timeout (per-read idle), Traefik respondingTimeouts are hard caps on total request/response duration, so every transfer slower than 60s was cut mid-stream. Reproduced: a 6 MB/s throttled 650MB download died at 386MB / 62s with an HTTP/2 stream reset. - writeTimeout=0 (Traefik's default, which Immich's reverse-proxy guidance assumes): unlimited download size/duration. - readTimeout=3600s: passes multi-GB uploads while keeping a slow-loris backstop (Immich has no resumable upload, so the window must exceed real upload times). Verified: the same 650MB download now completes fully (650MB / 102s, exit 0). IPv6 path needs no change - the pfSense bridge HAProxy 1h timeouts are inactivity-based, not total caps. Applied via tg (Tier 1 / PG-authoritative state); this commit syncs source + docs only, hence [ci skip]. Docs: networking.md (Entrypoint Transport Timeouts + troubleshooting), .claude/CLAUDE.md networking note. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 17:46:59 +00:00
Viktor Barzin	89561c7779	technitium: complete Traefik .200->.203 migration for the .lan zone [ci skip] Today's Traefik dedicated-IP migration (.200 -> .203, ETP=Local) updated the viktorbarzin.me zone but missed the viktorbarzin.lan zone + two stale .200 literals — breaking every *.viktorbarzin.lan ingress host (internal exporters + ~15 HA-Sofia sensors via idrac-redfish/nvidia/snmp) and tripping the apex-drift probe. Found via /cluster-health (23 alerts -> 7). - apex-probe EXPECTED .200 -> .203 (apex IS .203; probe asserted the wrong value -> false ViktorBarzinApexDrift "critical"). - split-horizon externalToInternalTranslation .200 -> .203 (sofia-lan hairpin-NAT target). - ingress-dns-sync CronJob now also pins ingress.viktorbarzin.lan A to the LIVE Traefik LB IP (queried from svc/traefik) every run, so a future Traefik IP move can't silently break the .lan zone again. Added services get/list to its ClusterRole. Applied via targeted apply (4 resources, 0 destroyed) + manual CronJob triggers; verified apex correct=1 and the .lan anchor self-pins to .203. [ci skip] because a full technitium apply would also pick up unrelated pre-existing deployment drift (DNS pod restart risk) — left untouched. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-30 16:54:09 +00:00
Viktor Barzin	a222c024fd	docs: correct tripit DNS classification to proxied [ci skip] tripit's ingress is dns_type="proxied" (Cloudflare), not non-proxied. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-30 15:00:49 +00:00
Viktor Barzin	b78378eda9	docs: catalog tripit service (service-catalog + databases) [ci skip] Add tripit (self-hosted TripIt-clone travel-itinerary PWA) to the service catalog Optional tier and Non-Proxied DNS list, and to the CNPG consumer + PostgreSQL rotation lists in the databases doc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-30 14:59:01 +00:00
Viktor Barzin	c2b820dc55	postiz: adopt drifted resources into TF state; exclude stuck Helm release The 2026-05-24 apply was interrupted with the Helm release stuck in pending-install, leaving only 2 of ~12 resources in TF state (any apply errored "already exists"). Adopted the live resources back via import {} sweep (namespace, tls-secret, uploads PVC, ESO ExternalSecret, both ingresses, temporal Service, nfs backup PV+PVC) — plan now reaches zero. Reconciled code to live reality (zero runtime change to running postiz): - Removed kubernetes_deployment.temporal + kubernetes_job.temporal_search_ attr_cleanup: the temporal Deployment is gone from the cluster (only the Service survives). Scheduled posts remain unavailable until temporal is restored; immediate posting works. - Removed helm_release.postiz from TF entirely: importing it would force a helm upgrade (provider can't match merged values to config) and the release is stuck pending-install. Left Helm-managed outside TF. - Removed keel.sh/enrolled=true from the namespace (postiz was opted out of Keel on 2026-05-29; this would have re-enrolled it on apply). - Backup CronJob now dumps only the `postiz` DB (temporal/temporal_visibility DBs don't exist) and no longer depends_on the removed helm_release. Applied: 9 imported, 1 added (backup CronJob), 6 changed (benign), 0 destroyed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-30 14:36:07 +00:00
Viktor Barzin	01351e4ce2	tripit: deploy stack + DB provisioning + ongoing mail-ingest [ci skip] - stacks/tripit: namespace, ESO (vault-kv + vault-database), Deployment (alembic init + app), Service, NFS document PVC, ingress (Authentik forward-auth) + /api/calendar carve-out (auth=none, HMAC-token gated), and 3 worker CronJobs. ingest-mail is live: real IMAP (me@, read-only BODY.PEEK, recent-30) + local LLM (qwen3vl-4b on llama-swap), idempotent (skips seen message_ids), owner me@viktorbarzin.me. - stacks/dbaas: create CNPG role+db `tripit`. - stacks/vault: pg-tripit static role (7d rotation) + allowed_roles entry. Deployed at tripit.viktorbarzin.me. [ci skip]: stacks were applied out-of-band via scripts/tg this session; a CI re-apply would also apply unrelated pre-existing dbaas/vault drift (MySQL StatefulSet, vault OIDC). Refs: code-bb9g, code-muqi Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-30 10:23:11 +00:00
Viktor Barzin	e9046e5a26	traefik+pfsense: real IPv6 client IPs via HAProxy PROXY-v2 bridge Replace the pfSense socat IPv6 forwarder (which masked every IPv6 client as 10.0.20.1) with a standalone HAProxy bridge using send-proxy-v2, so real IPv6 client IPs reach Traefik/CrowdSec. Traefik now trusts PROXY-v2 only from 10.0.20.1 on the web/websecure entrypoints; real IPv4 clients (ETP=Local, own source IP) are unaffected. Mail-over-IPv6 routed through the mail NodePorts (send-proxy-v2) too. Bridge is TCP/h2 only (no QUIC over IPv6). Persistence on pfSense: rc.d/ipv6proxy + ipv6_proxy.sh (config.xml shellcmd), keeping the nginx-off-[::] patch. Also fixes stale networking.md: Traefik was still documented on the shared .200; it moved to dedicated .203/ETP=Local on 2026-05-30. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-30 09:51:23 +00:00
Viktor Barzin	16c9aafafa	docs: Traefik dedicated-IP + ETP=Local cutover SUCCEEDED (attempt 2) Records the successful cutover and the key fix that made it safe: decouple cloudflared from the LB IP first (point its tunnel ingress at the in-cluster Traefik Service), so moving Traefik 10.0.20.200 -> 10.0.20.203 no longer breaks proxied apps or Vault's ingress. Updates infra CLAUDE.md Networking notes with the new Traefik LB IP / ETP=Local / cloudflared->ClusterIP state. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-30 08:12:57 +00:00
Viktor Barzin	0c01adac95	traefik: dedicate LB IP 10.0.20.203 + externalTrafficPolicy=Local Gives direct (non-proxied) apps real client IPs for CrowdSec (were SNAT'd to the node IP under ETP=Cluster) and working QUIC. Companion change (NOT in TF — remote cloudflared tunnel config, done via CF API): tunnel ingress repointed from https://10.0.20.200:443 to https://traefik.traefik.svc.cluster.local:443 so proxied apps are decoupled from the LB IP. pfSense 443 NAT -> traefik_lb alias (.203). See docs/plans/2026-05-30-traefik-dedicated-ip-etp-local-*. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-30 08:09:37 +00:00
Viktor Barzin	d6a61f00ad	state(vault): update encrypted state	2026-05-30 07:59:28 +00:00
Viktor Barzin	aceee34889	state(dbaas): update encrypted state	2026-05-30 07:55:42 +00:00
Viktor Barzin	1473a94f29	docs/plans: Traefik dedicated-IP cutover attempt 1 post-mortem (rolled back) Attempt rolled back to .200 baseline. Root blocker: cloudflared is a token/dashboard-managed tunnel whose ingress targets the Traefik LB IP (10.0.20.200), so moving Traefik to .203 took down all proxied apps. Retry must also repoint the tunnel ingress (Cloudflare API). Also documents the vault-ingress circular dep, SIGPIPE->stuck PG state-lock gotcha, and the ETP=Local hairpin caveat. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-30 01:27:29 +00:00
Viktor Barzin	09a0c1fad4	docs/plans: Traefik dedicated IP + ETP=Local migration (design + plan) Move Traefik off shared MetalLB IP 10.0.20.200 to a dedicated 10.0.20.203 with externalTrafficPolicy=Local, to (1) restore real client IPs for CrowdSec on the 24 non-proxied apps (currently SNAT'd to a node IP) and (2) enable QUIC. Forced off the shared IP because MetalLB forbids mixed ETP on a shared IP (10.0.20.200 also carries the Terraform state DB). In-place cutover selected. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-30 00:27:04 +00:00
Viktor Barzin	0f26bf030b	kyverno: exclude postiz namespace from Keel auto-update injection Postiz was generating hourly Slack spam and a wedged rollout, both Keel-driven: - Bundled redis StatefulSets run docker.io/bitnamilegacy/redis; Keel tried 7.4.0->7.4.1/7.4.2 every poll but require-trusted-registries denies bitnamilegacy/* (only bitnami/* allowlisted) -> endless deny/retry/Slack-ping loop. - Keel bumped postiz-app v2.21.7->v2.21.8 on 2026-05-26; the surge pod couldn't schedule under the 3Gi tier-4-aux quota, wedging the rollout for 3 days. postiz Terraform state is heavily drifted (~2/30 resources tracked), so per-workload opt-out can't be applied from the postiz stack. Durable guard is here (clean kyverno state). Operational steps applied live via kubectl (postiz stack can't apply): removed keel.sh/enrolled=true from the namespace, set keel.sh/policy=never (annotation+label) on all 4 workloads, rolled postiz back to the running v2.21.7. Keel restarted (scale 0->1) to drop postiz-app from its in-memory tracker; confirmed it no longer tracks postiz. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 19:16:58 +00:00
root	ae72ad51bb	Woodpecker CI deploy [CI SKIP]	2026-05-29 18:07:00 +00:00
Viktor Barzin	bc41fe572a	immich: GPU-accelerate video transcoding (NVENC + NVDEC) Pin immich-server to the GPU node with a time-sliced nvidia.com/gpu slice so ffmpeg uses hardware NVENC encode + NVDEC decode instead of software. This frees the ~3-4 CPU cores the software transcoder was burning inside the request-serving pod (which was slowing thumbnail/photo browsing), and makes incompatible (HEVC/iPhone) videos playable in seconds. Activation is ffmpeg.accel=nvenc + accelDecode=true in the DB system-config (Immich app config is DB-managed here, like oauth/smtp — not Terraform). Also give immich-frame the same Keel ignore_changes immich-server already has, so an untargeted apply no longer churns it (pre-existing drift). Docs: .claude/CLAUDE.md Immich row + compute.md GPU-workloads list. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 18:05:34 +00:00
Viktor Barzin	b10233975b	llama-cpp: restore replicas to 1; fire-planner: fix llama-swap URL llama-cpp was scaled to 0 during 2026-05-25 IO-storm recovery (TEMP-SCALEDOWN). Cluster is now stable; only frigate competes for the GPU on k8s-node1. Restoring to 1 to unblock fire-planner's Reddit examples ingest, which needs qwen3-8b for structured extraction. fire-planner's llama_cpp_base_url default pointed at a non-existent service:port (llama-cpp:8000) — the real service is `llama-swap` on port 8080. First 2026-05-28 bulk Job exited 0 with 0 rows because of this. Correcting.	2026-05-29 06:20:03 +00:00
Viktor Barzin	478629c1ee	keel+anubis: extend sweep to non-V2 raw deployments; fix anubis replicas validation Second-tier keel drift: actualbudget, mailserver (docker-mailserver + roundcube), servarr (8 deployments), and authentik pgbouncer are live-enrolled (Kyverno injects keel.sh/policy=patch) and drifting, but never had the V2 block in Terraform. Added the full block (KYVERNO_LIFECYCLE_V2 + keel.sh/match-tag + per-container KEEL_IGNORE_IMAGE + KEEL_LIFECYCLE_V1) to all 13 deployments. The docker-mailserver deployment had no resource-level lifecycle at all — added one. Also fixes a pre-existing bug in modules/kubernetes/anubis_instance: the `replicas` validation `var.replicas == null \|\| (...)` doesn't null-short-circuit in the current TF version, failing apply on every single-replica Anubis site (blog, cyberchef, f1-stream, homepage, jsoncrack, kms, postiz, real-estate-crawler, travel_blog) with "argument must not be null". Switched to a null-safe ternary. Verified: actualbudget plan shows no image drift (http-api 26.5.2 downgrade prevented). The anubis module change triggers a full platform apply. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 06:02:24 +00:00
root	fe1a16a5f5	Woodpecker CI deploy [CI SKIP]	2026-05-29 05:48:10 +00:00
Viktor Barzin	5bc7a76630	tuya-bridge: switch to Forgejo image + CI-driven deploy Mirrors the kms-website pattern: deployment image now points to forgejo.viktorbarzin.me/viktor/tuya_bridge:${var.image_tag} and the new Woodpecker pipeline in tuya_bridge/.woodpecker.yml drives the rollout via `kubectl set image` on every push. Changes: - Extract `tls_secret_name` and add `image_tag` (default "latest") to a new variables.tf, matching the kms / fire-planner / payslip-ingest convention. - Add `image_pull_secrets { name = "registry-credentials" }` (Kyverno ClusterPolicy sync-registry-credentials already syncs the Secret into every namespace). - Set explicit `image_pull_policy = "IfNotPresent"` — SHA-tagged images are immutable, no need to re-pull on every restart. The image attribute remains in `lifecycle.ignore_changes` (line was already there from the prior Keel-managed era), so future `tg apply`s do not fight Woodpecker's `kubectl set image`. Keel is still enrolled on the namespace but will skip SHA-tagged images under `policy: patch` (non-semver), so the CI pipeline is the sole rollout mechanism. Backstory: the 2026-05-26 cluster-health incident was tuya-bridge crashlooping after Keel rewrote `:latest` to a stale broken `:0.1` tag on Docker Hub (which predated the `prometheus_exporter.py` addition). Manual rebuild + push was the immediate fix; this commit plus tuya_bridge/.woodpecker.yml close the underlying gap so a source change reliably produces a fresh registry image. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 05:45:16 +00:00
Viktor Barzin	7870e62a07	uptime-kuma: declare Proxmox UI monitor in TF Yesterday's session SQL-patched monitor 313 to `https://192.168.1.127:8006/` + ignore_tls=1 because the prior URL `http://proxmox.reverse-proxy.svc.cluster.local:8006` hit a CoreDNS pod-level cache returning stale `10.0.10.1` (pfSense GW) intermittently, false-tripping ExternalAccessDivergence. A kuma DB restore would have lost the SQL fix. Declare the monitor in `internal_monitors` so the existing sync CronJob self-heals it. Extends the schema with optional `url` / `accepted_statuscodes` / `ignore_tls` fields (null on the existing DB/port entries) and teaches the sync script the MonitorType.HTTP branch — url + accepted_statuscodes + ignoreTls (camelCase on the API), matching drift fields the same way PORT does for hostname/port. Verified: manually triggered the sync after apply; it found monitor 313 by name and reported "already in desired state". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 05:40:18 +00:00
Viktor Barzin	7c73c69f9b	keel: add KEEL_LIFECYCLE_V1 + image-ignore to fire-planner Completes the enrolled-workload sweep from `cdb7d9a8`. fire-planner was held back because a parallel session was mid-apply on it (presence board); that claim has since cleared. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 23:12:49 +00:00
Viktor Barzin	cdb7d9a81a	keel: sweep KEEL_LIFECYCLE_V1 + per-container KEEL_IGNORE_IMAGE across enrolled workloads Every Keel-enrolled workload (policy=patch, match-tag=true, injected by the inject-keel-annotations Kyverno policy) was fighting Terraform: Keel rewrites the image tag and restamps keel.sh/update-time, change-cause and the rollout revision on each poll; without ignore_changes every `tg apply` reverted those — downgrading the image and forcing a spurious rollout that Keel then re-did. Only llama-cpp had the full block (added 2026-05-24); the other ~73 workloads drifted. This sweep adds, to every enrolled deployment/daemonset lifecycle: - container[N].image (one per container index + init_container[N]) # KEEL_IGNORE_IMAGE - keel.sh/match-tag, keel.sh/update-time, kubernetes.io/change-cause, deployment.kubernetes.io/revision # KEEL_LIFECYCLE_V1 Verified via `tg plan` on speedtest (single-container: image downgrade 0.24.3->0.24.1 + annotation strip now gone) and changedetection (multi-container: both container images no longer drift). AGENTS.md drift-suppression section updated with the canonical block + marker legend. fire-planner deferred (parallel session mid-apply per presence board). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 23:09:30 +00:00
Viktor Barzin	4f71ce6bc5	wealth: fix Fidelity Feb-2026 zero-gap + month-boundary contribution smear Two correctness fixes to the wealth dashboard, found while validating contribution data against actual-viktor (source of truth): 1. dav_corrected (Fix 1): LOCF gap-fill scoped to the Fidelity pension. A PlanViewer scrape gap left total_value=0 for 13 days from 2026-02-16, which cratered net worth and produced a phantom -£97,457 "contribution" in Feb then +£100,458 in Mar. Carry the last non-zero day forward across the gap (a £0 pension valuation is always a scrape gap, never real). 2. wealth.json (Fix 3): "Monthly contributions vs market gain" and "Annual change decomposition" now use consecutive period-end deltas instead of within-period first-to-last-obs, so contributions landing near a period boundary are no longer dropped/mis-attributed. Verified live: Feb-2026 monthly contribution now +£34,000 (real Trading212 RSU-proceeds investment, reconciles with actual-viktor), no spurious negatives. Brokerage contributions unchanged (already correct). Applied via scripts/tg (wealthfolio + targeted monitoring ConfigMap). [ci skip] Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 22:58:59 +00:00
Viktor Barzin	0044c3a8ea	fire-planner: add examples ingest Job (toggled) + weekly CronJob Adds the K8s plumbing for the Reddit FIRE-examples ingest path: - ExternalSecret fire-planner-examples-reddit (Reddit OAuth from Vault secret/viktor.trading_bot_reddit_{client_id,client_secret}). - ExternalSecret fire-planner-examples-claude (claude-agent-service bearer from Vault secret/claude-agent-service.api_bearer_token). - kubernetes_job_v1.examples_bulk_ingest — one-shot bulk Job toggled via var.run_examples_bulk_ingest (default false). Timestamp-named so each (true) transition creates a fresh Job; lifecycle ignores the name so re-plans don't propose phantom renames. - kubernetes_cron_job_v1.examples_weekly_delta — Sunday 04:00 UTC --top=week --limit=200 incremental run. Both runners share the env_from plumbing of the existing recompute CronJob (fire-planner-secrets, fire-planner-db-creds, wealthfolio-sync-db-creds) plus examples-specific vars (REDDIT_USER_AGENT, LLAMA_CPP_BASE_URL, CLAUDE_AGENT_SERVICE_URL, plus the three secret-backed env vars). Plan-only this commit — actual apply lands in Task 17 after the ingest image build.	2026-05-28 22:51:14 +00:00
Viktor Barzin	4dff834c8a	reduce ingress-dns-sync frequency to hourly [ci skip]	2026-05-28 22:30:08 +00:00
Viktor Barzin	5ac8d625b9	add ingress-dns-sync CronJob to auto-create Technitium CNAME records Discovers all *.viktorbarzin.me ingress hosts every 15 minutes and creates matching CNAME records in Technitium if missing. Prevents the desync where Cloudflare has the DNS record (via ingress_factory) but internal DNS returns NXDOMAIN because Technitium was never updated. Includes ServiceAccount + ClusterRole for ingress list permissions.	2026-05-28 22:22:42 +00:00
Viktor Barzin	58cced5dab	monitoring: render market-vs-salary periodic panels as lines, not bars	2026-05-28 22:18:59 +00:00
Viktor Barzin	2a7124d266	docs(plans): wealth net-worth projections design Forward 30y net-worth projection on the existing wealth Grafana dashboard: multi-scenario lines (low/base/high + derived historical CAGR), pure-SQL over wealth-pg reusing the dashboard's Modified-Dietz and complete-days patterns, with/without-contributions at base rate, in a collapsed row that sidesteps Grafana's shared-time-range limit. [ci skip] Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 22:15:03 +00:00
Viktor Barzin	388a7f60c7	monitoring: add net-pay-vs-market-gains panels to wealth dashboard Three new panels comparing employment income to investment returns over time, via Grafana's -- Mixed -- datasource (salary lives in payslip_ingest, portfolio in wealthfolio_sync — separate DBs, so per-target datasources): - cumulative net take-home pay vs cumulative market gain (line race) - net pay vs market gain per year (grouped bars) - net pay vs market gain per month (grouped bars) Inserted after the "Growth over time" panel; existing panels shifted down, full-width tables remain at the bottom.	2026-05-28 22:13:44 +00:00
Viktor Barzin	1af412b461	trading-bot: bump TRADING_MEET_KEVIN_PROMPT_VERSION v1 -> v2 (forward-looking prompt)	2026-05-28 21:40:17 +00:00
Viktor Barzin	188bdd50a0	infra: decommission foolery agent UI User no longer actively using foolery. Removed: - TF stack stacks/foolery (Cloudflare DNS, Traefik IngressRoute, Authentik forward-auth integration, K8s Service+Endpoints) - Devvm systemd unit /etc/systemd/system/foolery.service - Runtime at ~/.local/share/foolery and launcher ~/.local/bin/foolery - Stale foolery reference in .claude/CLAUDE.md auth="required" examples Uptime Kuma [External] foolery monitor will auto-prune on next external-monitor-sync reconcile. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 16:08:41 +00:00

1 2 3 4 5 ...

3871 commits