infra

Author	SHA1	Message	Date
root	ae72ad51bb	Woodpecker CI deploy [CI SKIP]	2026-05-29 18:07:00 +00:00
Viktor Barzin	bc41fe572a	immich: GPU-accelerate video transcoding (NVENC + NVDEC) Pin immich-server to the GPU node with a time-sliced nvidia.com/gpu slice so ffmpeg uses hardware NVENC encode + NVDEC decode instead of software. This frees the ~3-4 CPU cores the software transcoder was burning inside the request-serving pod (which was slowing thumbnail/photo browsing), and makes incompatible (HEVC/iPhone) videos playable in seconds. Activation is ffmpeg.accel=nvenc + accelDecode=true in the DB system-config (Immich app config is DB-managed here, like oauth/smtp — not Terraform). Also give immich-frame the same Keel ignore_changes immich-server already has, so an untargeted apply no longer churns it (pre-existing drift). Docs: .claude/CLAUDE.md Immich row + compute.md GPU-workloads list. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 18:05:34 +00:00
Viktor Barzin	b10233975b	llama-cpp: restore replicas to 1; fire-planner: fix llama-swap URL llama-cpp was scaled to 0 during 2026-05-25 IO-storm recovery (TEMP-SCALEDOWN). Cluster is now stable; only frigate competes for the GPU on k8s-node1. Restoring to 1 to unblock fire-planner's Reddit examples ingest, which needs qwen3-8b for structured extraction. fire-planner's llama_cpp_base_url default pointed at a non-existent service:port (llama-cpp:8000) — the real service is `llama-swap` on port 8080. First 2026-05-28 bulk Job exited 0 with 0 rows because of this. Correcting.	2026-05-29 06:20:03 +00:00
Viktor Barzin	478629c1ee	keel+anubis: extend sweep to non-V2 raw deployments; fix anubis replicas validation Second-tier keel drift: actualbudget, mailserver (docker-mailserver + roundcube), servarr (8 deployments), and authentik pgbouncer are live-enrolled (Kyverno injects keel.sh/policy=patch) and drifting, but never had the V2 block in Terraform. Added the full block (KYVERNO_LIFECYCLE_V2 + keel.sh/match-tag + per-container KEEL_IGNORE_IMAGE + KEEL_LIFECYCLE_V1) to all 13 deployments. The docker-mailserver deployment had no resource-level lifecycle at all — added one. Also fixes a pre-existing bug in modules/kubernetes/anubis_instance: the `replicas` validation `var.replicas == null \|\| (...)` doesn't null-short-circuit in the current TF version, failing apply on every single-replica Anubis site (blog, cyberchef, f1-stream, homepage, jsoncrack, kms, postiz, real-estate-crawler, travel_blog) with "argument must not be null". Switched to a null-safe ternary. Verified: actualbudget plan shows no image drift (http-api 26.5.2 downgrade prevented). The anubis module change triggers a full platform apply. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 06:02:24 +00:00
root	fe1a16a5f5	Woodpecker CI deploy [CI SKIP]	2026-05-29 05:48:10 +00:00
Viktor Barzin	5bc7a76630	tuya-bridge: switch to Forgejo image + CI-driven deploy Mirrors the kms-website pattern: deployment image now points to forgejo.viktorbarzin.me/viktor/tuya_bridge:${var.image_tag} and the new Woodpecker pipeline in tuya_bridge/.woodpecker.yml drives the rollout via `kubectl set image` on every push. Changes: - Extract `tls_secret_name` and add `image_tag` (default "latest") to a new variables.tf, matching the kms / fire-planner / payslip-ingest convention. - Add `image_pull_secrets { name = "registry-credentials" }` (Kyverno ClusterPolicy sync-registry-credentials already syncs the Secret into every namespace). - Set explicit `image_pull_policy = "IfNotPresent"` — SHA-tagged images are immutable, no need to re-pull on every restart. The image attribute remains in `lifecycle.ignore_changes` (line was already there from the prior Keel-managed era), so future `tg apply`s do not fight Woodpecker's `kubectl set image`. Keel is still enrolled on the namespace but will skip SHA-tagged images under `policy: patch` (non-semver), so the CI pipeline is the sole rollout mechanism. Backstory: the 2026-05-26 cluster-health incident was tuya-bridge crashlooping after Keel rewrote `:latest` to a stale broken `:0.1` tag on Docker Hub (which predated the `prometheus_exporter.py` addition). Manual rebuild + push was the immediate fix; this commit plus tuya_bridge/.woodpecker.yml close the underlying gap so a source change reliably produces a fresh registry image. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 05:45:16 +00:00
Viktor Barzin	7870e62a07	uptime-kuma: declare Proxmox UI monitor in TF Yesterday's session SQL-patched monitor 313 to `https://192.168.1.127:8006/` + ignore_tls=1 because the prior URL `http://proxmox.reverse-proxy.svc.cluster.local:8006` hit a CoreDNS pod-level cache returning stale `10.0.10.1` (pfSense GW) intermittently, false-tripping ExternalAccessDivergence. A kuma DB restore would have lost the SQL fix. Declare the monitor in `internal_monitors` so the existing sync CronJob self-heals it. Extends the schema with optional `url` / `accepted_statuscodes` / `ignore_tls` fields (null on the existing DB/port entries) and teaches the sync script the MonitorType.HTTP branch — url + accepted_statuscodes + ignoreTls (camelCase on the API), matching drift fields the same way PORT does for hostname/port. Verified: manually triggered the sync after apply; it found monitor 313 by name and reported "already in desired state". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 05:40:18 +00:00
Viktor Barzin	7c73c69f9b	keel: add KEEL_LIFECYCLE_V1 + image-ignore to fire-planner Completes the enrolled-workload sweep from `cdb7d9a8`. fire-planner was held back because a parallel session was mid-apply on it (presence board); that claim has since cleared. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 23:12:49 +00:00
Viktor Barzin	cdb7d9a81a	keel: sweep KEEL_LIFECYCLE_V1 + per-container KEEL_IGNORE_IMAGE across enrolled workloads Every Keel-enrolled workload (policy=patch, match-tag=true, injected by the inject-keel-annotations Kyverno policy) was fighting Terraform: Keel rewrites the image tag and restamps keel.sh/update-time, change-cause and the rollout revision on each poll; without ignore_changes every `tg apply` reverted those — downgrading the image and forcing a spurious rollout that Keel then re-did. Only llama-cpp had the full block (added 2026-05-24); the other ~73 workloads drifted. This sweep adds, to every enrolled deployment/daemonset lifecycle: - container[N].image (one per container index + init_container[N]) # KEEL_IGNORE_IMAGE - keel.sh/match-tag, keel.sh/update-time, kubernetes.io/change-cause, deployment.kubernetes.io/revision # KEEL_LIFECYCLE_V1 Verified via `tg plan` on speedtest (single-container: image downgrade 0.24.3->0.24.1 + annotation strip now gone) and changedetection (multi-container: both container images no longer drift). AGENTS.md drift-suppression section updated with the canonical block + marker legend. fire-planner deferred (parallel session mid-apply per presence board). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 23:09:30 +00:00
Viktor Barzin	4f71ce6bc5	wealth: fix Fidelity Feb-2026 zero-gap + month-boundary contribution smear Two correctness fixes to the wealth dashboard, found while validating contribution data against actual-viktor (source of truth): 1. dav_corrected (Fix 1): LOCF gap-fill scoped to the Fidelity pension. A PlanViewer scrape gap left total_value=0 for 13 days from 2026-02-16, which cratered net worth and produced a phantom -£97,457 "contribution" in Feb then +£100,458 in Mar. Carry the last non-zero day forward across the gap (a £0 pension valuation is always a scrape gap, never real). 2. wealth.json (Fix 3): "Monthly contributions vs market gain" and "Annual change decomposition" now use consecutive period-end deltas instead of within-period first-to-last-obs, so contributions landing near a period boundary are no longer dropped/mis-attributed. Verified live: Feb-2026 monthly contribution now +£34,000 (real Trading212 RSU-proceeds investment, reconciles with actual-viktor), no spurious negatives. Brokerage contributions unchanged (already correct). Applied via scripts/tg (wealthfolio + targeted monitoring ConfigMap). [ci skip] Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 22:58:59 +00:00
Viktor Barzin	0044c3a8ea	fire-planner: add examples ingest Job (toggled) + weekly CronJob Adds the K8s plumbing for the Reddit FIRE-examples ingest path: - ExternalSecret fire-planner-examples-reddit (Reddit OAuth from Vault secret/viktor.trading_bot_reddit_{client_id,client_secret}). - ExternalSecret fire-planner-examples-claude (claude-agent-service bearer from Vault secret/claude-agent-service.api_bearer_token). - kubernetes_job_v1.examples_bulk_ingest — one-shot bulk Job toggled via var.run_examples_bulk_ingest (default false). Timestamp-named so each (true) transition creates a fresh Job; lifecycle ignores the name so re-plans don't propose phantom renames. - kubernetes_cron_job_v1.examples_weekly_delta — Sunday 04:00 UTC --top=week --limit=200 incremental run. Both runners share the env_from plumbing of the existing recompute CronJob (fire-planner-secrets, fire-planner-db-creds, wealthfolio-sync-db-creds) plus examples-specific vars (REDDIT_USER_AGENT, LLAMA_CPP_BASE_URL, CLAUDE_AGENT_SERVICE_URL, plus the three secret-backed env vars). Plan-only this commit — actual apply lands in Task 17 after the ingest image build.	2026-05-28 22:51:14 +00:00
Viktor Barzin	4dff834c8a	reduce ingress-dns-sync frequency to hourly [ci skip]	2026-05-28 22:30:08 +00:00
Viktor Barzin	5ac8d625b9	add ingress-dns-sync CronJob to auto-create Technitium CNAME records Discovers all *.viktorbarzin.me ingress hosts every 15 minutes and creates matching CNAME records in Technitium if missing. Prevents the desync where Cloudflare has the DNS record (via ingress_factory) but internal DNS returns NXDOMAIN because Technitium was never updated. Includes ServiceAccount + ClusterRole for ingress list permissions.	2026-05-28 22:22:42 +00:00
Viktor Barzin	58cced5dab	monitoring: render market-vs-salary periodic panels as lines, not bars	2026-05-28 22:18:59 +00:00
Viktor Barzin	2a7124d266	docs(plans): wealth net-worth projections design Forward 30y net-worth projection on the existing wealth Grafana dashboard: multi-scenario lines (low/base/high + derived historical CAGR), pure-SQL over wealth-pg reusing the dashboard's Modified-Dietz and complete-days patterns, with/without-contributions at base rate, in a collapsed row that sidesteps Grafana's shared-time-range limit. [ci skip] Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 22:15:03 +00:00
Viktor Barzin	388a7f60c7	monitoring: add net-pay-vs-market-gains panels to wealth dashboard Three new panels comparing employment income to investment returns over time, via Grafana's -- Mixed -- datasource (salary lives in payslip_ingest, portfolio in wealthfolio_sync — separate DBs, so per-target datasources): - cumulative net take-home pay vs cumulative market gain (line race) - net pay vs market gain per year (grouped bars) - net pay vs market gain per month (grouped bars) Inserted after the "Growth over time" panel; existing panels shifted down, full-width tables remain at the bottom.	2026-05-28 22:13:44 +00:00
Viktor Barzin	1af412b461	trading-bot: bump TRADING_MEET_KEVIN_PROMPT_VERSION v1 -> v2 (forward-looking prompt)	2026-05-28 21:40:17 +00:00
Viktor Barzin	188bdd50a0	infra: decommission foolery agent UI User no longer actively using foolery. Removed: - TF stack stacks/foolery (Cloudflare DNS, Traefik IngressRoute, Authentik forward-auth integration, K8s Service+Endpoints) - Devvm systemd unit /etc/systemd/system/foolery.service - Runtime at ~/.local/share/foolery and launcher ~/.local/bin/foolery - Stale foolery reference in .claude/CLAUDE.md auth="required" examples Uptime Kuma [External] foolery monitor will auto-prune on next external-monitor-sync reconcile. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 16:08:41 +00:00
Viktor Barzin	8b4bcc0ca2	blog: Anubis carve-out for /net-diag.sh curl\|bash clients can't solve PoW, so /net-diag.sh has to bypass Anubis. Adds a second ingress_factory pointing /net-diag.sh at the bare blog service (port 80), keeping every other path on the existing Anubis chain. Path-prefix specificity wins in Traefik routing — / stays gated. dns_type = "none" because the apex viktorbarzin.me CF record already exists from the main ingress. Doc update: CLAUDE.md Anubis section notes blog now follows the wrongmove carve-out pattern.	2026-05-28 13:22:57 +00:00
Viktor Barzin	fc5a4b66ad	monitoring: exclude catchall-error-pages from HighService4xxRate The catchall-error-pages IngressRoute matches HostRegexp(^(.+\.)? viktorbarzin\.me$) at priority=1 — it's the wildcard handler that returns 404 for any unmatched hostname (typos + scanner traffic). By design its 4xx rate sits at ~100%, so HighService4xxRate was a permanent false positive for traefik-catchall-error-pages-*@kubernetescrd. Same exclusion pattern as nextcloud/grafana/linkwarden/claude-memory (services with legitimately high 4xx counts).	2026-05-27 19:46:40 +00:00
Viktor Barzin	f677794379	cluster_healthcheck.sh: run checks in parallel (~3x speedup) Each check function only reads cluster state and mutates in-memory counters; that makes it safe to isolate each one in a subshell, write stdout to a per-check temp file, and replay outputs in original order after all jobs finish. Counters/JSON_RESULTS replicated through marker lines (###HCK###PASS:N etc.) so the aggregate state matches the serial run exactly. Pre-fetch the HA Sofia cache once in the parent so the four HA checks share a single API round-trip instead of each subshell re-fetching. Auto-fix mode forces --serial so mutation order stays deterministic. New flags: --parallel N (default 12, env HEALTHCHECK_PARALLEL_JOBS), --serial. Diminishing returns past ~12 workers. Benchmark (--quiet, 44 checks): 53s serial -> 18s parallel-12.	2026-05-27 19:46:40 +00:00
github-actions[bot]	b8cd1219a6	priority-pass: bump image_tag to 4ce9e8e8 [ci skip] Auto-committed by ViktorBarzin/priority-pass GHA on push to main. Source: `4ce9e8e894`	2026-05-27 18:46:19 +00:00
root	d0ede3773b	Woodpecker CI deploy [CI SKIP]	2026-05-27 18:38:09 +00:00
Viktor Barzin	ee159b02ba	nextcloud: disable Keel auto-upgrades Keel bumped library/nextcloud :32.0.3-apache → :32.0.9-apache on 2026-05-26 19:42 UTC. The new image needs `occ upgrade` to migrate the DB schema, which Keel does not run, so Nextcloud landed in maintenance mode (needsDbUpgrade=true) and stayed there for ~22h — external probes saw 503, ExternalAccessDivergence kept firing. Disable Keel for this workload: - Drop the `keel.sh/enrolled=true` label from the namespace so Kyverno's `inject-keel-annotations` policy no longer matches. - Layer `keel.sh/policy=never` label + annotation onto the Helm-managed Deployment via `kubernetes_labels` / `kubernetes_annotations` (the chart at 8.8.1 doesn't expose Deployment-level commonLabels/commonAnnotations). Keel reads the annotation; the label is defense-in-depth for the Kyverno exclude rule should the namespace ever get re-enrolled. Verified: Keel logged `image no longer tracked, removing watcher` within seconds of the annotation landing, and `tg plan` is clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 18:37:05 +00:00
Viktor Barzin	d72c7169c0	monitoring: route proxmox-exporter to scrape_slow job (fix flapping alerts) PVE API endpoint regularly takes ~11s with ~1035 thin LVs on the host (1002 k8s-csi PVCs + 22 VMs + 11 system), blowing past Prometheus's default 10s scrape_timeout and flapping ProxmoxMetricsMissing + ScrapeTargetDown. Switch the Service annotation from prometheus.io/scrape to prometheus.io/scrape_slow so the scrape moves to the existing kubernetes-service-endpoints-slow job (5m interval, 30s timeout).	2026-05-27 18:36:11 +00:00
Viktor Barzin	f121bee121	fire-planner: update recompute CronJob comment to reflect lazy refresh As of fire-planner@4da58fe the account_snapshot cache is refreshed lazily on each /networth, /networth/history, /progress request when older than NETWORTH_CACHE_TTL_DAYS (default 1). The recompute CronJob runs Monte Carlo only — no longer assumed to coordinate with the wealthfolio-sync schedule. [ci skip]	2026-05-27 18:23:21 +00:00
Viktor Barzin	4b77aa65a1	broker-sync: unsuspend broker-sync-imap (IE structurally skipped at code level now) E2E test (manual one-shot of all 3 broker-sync CronJobs) confirmed idempotent behaviour with zero new activities and net worth unchanged. The IE-via-IMAP path is now default-skipped inside broker_sync.providers.imap (commit 0d23487), so unsuspending the cron is safe — Schwab vests get parsed, IE messages get ie_skipped at the parser level regardless of which entry point triggers the run. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 17:57:26 +00:00
Viktor Barzin	06fb1f9ea9	broker-sync: update imap-cron comment to reflect default-skip IE (post-incident)	2026-05-27 17:25:42 +00:00
Viktor Barzin	501f2c6b37	broker-sync: re-suspend broker-sync-imap CronJob 39 IMAP-source InvestEngine BUYs + their cash-flow DEPOSITs were re-inserted into Wealthfolio at 2026-05-27T09:22:18 UTC — exactly the rows the £252k dedup removed yesterday. The broker-sync-imap cron at 02:30 UTC today correctly logged `ie_skipped=53`, so the IMAP cron itself isn't the immediate culprit, but the rows DO carry broker-sync's IMAP-path signature (`[rfc2822-v1]` notes + `sync:imap:invest-engine:...` cash-flow markers). Suspending kills one possible vector while a researcher subagent investigates the root cause. Schwab vest ingestion is the only function lost; can be unsuspended once the IE re-dup source is identified. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 17:09:09 +00:00
Viktor Barzin	54919e3abc	trading-bot: TRADING_SLACK_BOT_TOKEN + TRADING_SLACK_CHANNEL env	2026-05-27 10:06:51 +00:00
Viktor Barzin	17c59a280b	broker-sync: drop IBKR_ACCOUNT_ID env (now derived via ensure_account)	2026-05-27 09:25:02 +00:00
Viktor Barzin	6d13ba12da	broker-sync: add fsGroup=10001 to trading212 cron pod spec Without supplementary GID 10001, the broker user (uid=10001 gid=999) cannot write sqlite3 journal files next to /data/sync.db. The cron hits a "readonly database" error in dedup.record() AFTER successfully importing fills to Wealthfolio — so data lands but the dedup store never updates, leaving every subsequent run to re-fetch the same window and exit 1 again. Same fix that's already on imap + ibkr crons. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 09:20:16 +00:00
root	9e8314183f	Woodpecker CI deploy [CI SKIP]	2026-05-26 22:53:29 +00:00
Viktor Barzin	9b68dbc788	wealthfolio: dav_corrected — also exclude Schwab synthetic cash flows The Net-contribution chart was showing huge negative monthly swings because broker-sync emits a synthetic cash-flow-match DEPOSIT for every vest BUY and a WITHDRAWAL for every sell-to-cover SELL. Cumulatively WITHDRAWALs ($1.06M) exceed DEPOSITs ($498k) — the user perceives this as having "withdrawn" money even though they never moved cash out of Schwab. The proceeds left for the bank and surface as real DEPOSITs on the next account (IE/T212) that the user transfers them to. Extend the dav_corrected view to subtract Schwab cash-flow-match flows (DEPOSIT-positive, WITHDRAWAL-negative, account-scoped) in addition to the existing Fidelity unrealised-gains-offset correction. InvestEngine and Trading212 cash-flow-match entries are REAL deposits and must be preserved — scope by Schwab account_id only. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 22:52:17 +00:00
Viktor Barzin	30ba6860b9	broker-sync: add IBKR Flex daily CronJob (02:00 UK)	2026-05-26 22:34:54 +00:00
Viktor Barzin	2df9700d70	trading-bot: add slack_webhook_url ESO secret + env var	2026-05-26 21:55:59 +00:00
Viktor Barzin	15c88bc683	keel: belt-and-suspenders opt-out for mysql/redis/nvidia-exporter All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details After re-enabling Keel with `policy: patch` (commit `f325b949`), 3 of the 60 first-hour bumps broke things and need explicit cluster-wide opt-out so future Kyverno reconciles can't put them back under auto-update: - `dbaas/mysql-standalone`: patch-bumped `mysql:8.4.8 → :8.4.9` and the DD upgrade stalled (we explicitly track that as beads `code-963q` — the 8.4.9 jump needs a wipe+reinit, not a rolling upgrade). The StatefulSet already had `annotation=never` from TF but was missing the LABEL — Kyverno's selector exclude reads the LABEL, so a reconcile that dropped the annotation could resume auto-update. Added the LABEL. - `redis/redis-v2`: patch-bumped `redis:8-alpine → :8.0.6-alpine` and the new image rejected the `aof-load-corrupt-tail-max-size` directive from commit `1eee56d0` → redis-v2-2 CrashLoopBackOff. Plus :8.0.6 is semantically older than :8-alpine (which resolves to :8.6.2) — same Keel tag-picking pathology as the 2026-05-26 morning incident, just in a different shape. LABEL + ANNOTATION both added. - `nvidia/nvidia-exporter`: Keel rewrote `:latest → :4.5.2-4.8.1-ubuntu22.04` and the new dcgm-exporter OOMKilled at the 192Mi memory limit (4 restarts before I caught it). Added LABEL + ANNOTATION for opt-out, AND bumped memory request/limit 192Mi → 256Mi/512Mi so the bumped image doesn't OOM (older versions fit in 192Mi; the bumped one needs ~250Mi steady-state). The 56 other Keel bumps in that 10-minute window (coredns 1.12.1→1.12.4, kyverno 1.16.1→1.16.4, nextcloud 32.0.3→32.0.9, grafana 12.3.1→12.3.6, cnpg, mailserver, csi-nfs, metrics-server, etc.) landed cleanly — the `patch` policy is the right default. Per-workload `never` opt-out is the maintenance cost. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 21:53:10 +00:00
Viktor Barzin	1abe6465e0	state(dbaas): update encrypted state	2026-05-26 21:40:56 +00:00
Viktor Barzin	498b01396c	status-page: disable pusher CronJob to stop sdc write storm The CronJob ran every 5 min on a vanilla python:3.12-alpine image, doing `apk add git` + `pip install uptime-kuma-api` from scratch on every invocation. Caught at ~3.2 MB/s on k8s-node4's root LV, contributing to ~8 MB/s sustained on the pve-data thin pool (sdc) — ~804 GB written over the prior 18 h. Commented out the kubernetes_cron_job_v1.status_page_pusher resource (kept ns / SA / RBAC / ConfigMap intact for trivial revert). Re-enable once a custom image with git + uptime-kuma-api baked in is published so no per-run cold install happens. status.viktorbarzin.me stops updating until then.	2026-05-26 21:40:14 +00:00
Viktor Barzin	84404fd0d6	broker-sync: skip InvestEngine in IMAP CronJob Sets BROKER_SYNC_IMAP_EXCLUDE_PROVIDERS=invest-engine on broker-sync-imap, so the IMAP path no longer parses InvestEngine emails (handled by the bearer-token API path now). Stops duplicate BUYs in Wealthfolio. The terraform fmt run also realigned two adjacent label assignments. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 21:19:31 +00:00
root	2becd0ff6f	Woodpecker CI deploy [CI SKIP]	2026-05-26 21:09:48 +00:00
Viktor Barzin	8605181c53	trading-bot: Phase 2 — add trade-executor + flip kevin kill-switch	2026-05-26 21:07:37 +00:00
Viktor Barzin	047a1189c9	backup-dr docs: refresh diagrams for daily/immich-only architecture - Add new "Data Routing" flowchart up front showing which paths go where (sda mirror vs Synology-direct vs not-backed-up). - Overall Backup Flow: split Layer 2 into 2a (nfs-mirror daily 02:00) and 2b (daily-backup 05:00); show nfs-mirror as an explicit component; clarify Step 2 is immich-only direct + nfs-ssd. - Weekly Backup Timeline → Daily Backup Timeline: actual schedule (00:00 LVM, 00:15 PG, 00:45 MySQL, 02:00 nfs-mirror, 05:00 daily- backup, 06:00 offsite-sync, 12:00 second LVM); explicit inotify feeding Step 2. - Physical Disk Layout: current capacity numbers + dual sdc→sda and sdc→Synology arrows (immich-only) reflecting the two-leg design. - Restore Decision Tree: refreshed age tiers (< 12h LVM, 12h-4w sda, > 4w Synology) + dedicated branch for immich photos (which only have an offsite copy). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 20:00:31 +00:00
Viktor Barzin	3f0c429d46	offsite-sync: add `\|\| true` to Step 2 HDD grep\|while pipeline Mirrors the SSD section's pattern. If the LAST iteration of the `while IFS= read -r f; do [ -f "$f" ] && echo "${f#/srv/nfs/}"; done` body sees a file that was deleted between inotify capture and now (e.g. an immich encoded-video temp file that got cleaned up), the while loop returns 1, pipefail propagates, set -e kills the script silently before reaching the rsync. No log line, just disappears. Pre-existing bug; only exposed today after pruning the bypass regex to immich-only — when the regex was broader, the last match in the sorted dedup'd inotify log happened to be a live file often enough that the bug stayed dormant. Validated by full e2e run: 1120 nfs/immich files + 2285 nfs-ssd files shipped successfully. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 19:55:33 +00:00
Viktor Barzin	3526089457	docs: Talos migration design v7 — staged plan after 6 rounds of critique [ci skip] Handoff artifact for next session. v7 is the converged staged plan (Stage A hardened Ubuntu → B DR primitives → C 6-week soak → D-optional Talos). User decision pending: pick v4 (full Talos, 117-178h) vs v7 (staged, 30-37h to decision point) vs hybrid. Full context in ~/.claude/plans/distributed-humming-sonnet.md.	2026-05-26 19:45:48 +00:00
Viktor Barzin	f325b949be	keel: re-enable with policy=patch (semver-bounded) + fix CI deny-privileged All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Re-enables Keel after the 2026-05-26 emergency stop, with a safer default. Switch Kyverno-injected default from `force + match-tag=true` (proven unreliable — it rewrote tag strings cluster-wide despite the design intent) to `patch`, which is semver-parser-bounded: - Only patch bumps within current major.minor (1.2.3 → 1.2.4, never 1.3.x or 2.x — the parser does the math, not string compare). - Non-semver tags (`:latest`, `:v4`, `:2`, SHA, `:nightly`) are IGNORED entirely. No tag rewriting under any code path. - 151 stale `force` annotations migrated to `patch` cluster-wide during this apply (anchor `+()` dropped, then re-added). Live state after this commit: 0 workloads on `force`, 209 on `patch`, 22 on `never`. Keel deployment back to 1/1 on `:0.21.1`. Note: 22 workloads with `keel.sh/policy=never` LABEL had their annotation mutated to `patch` during the migration despite Kyverno's matchLabels-based exclude rule — appears to be a quirk of `mutateExistingOnPolicyUpdate` not honoring `selector` excludes. Repatched all 22 back to `annotation=never` via `kubectl annotate --overwrite`, then restored the `+(keel.sh/policy)` anchor in the policy so future Kyverno reconciles preserve them. Also fixes CI build-cli workflow which was blocked by `deny-privileged-containers` since wave 1 enforce flip on 2026-05-18: woodpecker namespace added to the shared security_policy_exclude_namespaces list (CI pipeline pods `wp-*` run privileged docker builds, legitimate use). The `default` workflow (terragrunt apply) was already passing — only the parallel `build-cli` workflow (which builds the infra-cli docker image) was failing, but it took the overall pipeline status down with it. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 19:06:51 +00:00
Viktor Barzin	37d88ce50e	nfs-mirror: weekly Mon 04:00 → daily 02:00 Steady-state delta runs in 10-20 min and the weekly cadence left a real RPO gap: app data under /srv/nfs/<svc>/ that isn't a PVC (captured by daily-backup) or a *-backup CronJob (captured daily by the CronJob writing to /srv/nfs/<svc>-backup/) was on a 7-day worst case for off-disk durability. Affected paths include nextcloud shared files, audiobookshelf library, mailserver Maildir, calibre, servarr metadata, real-estate-crawler scraped data, openclaw agent state. Daily cadence drops their RPO to ~24h at negligible cost. Slot: 02:00, 3h ahead of daily-backup (05:00) so the manifest is populated before offsite-sync reads it at 06:00. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 19:00:10 +00:00
Viktor Barzin	1eee56d0ba	redis: tolerate up to 1KB of AOF tail corruption on load Some checks failed ci/woodpecker/push/build-cli Pipeline failed Details ci/woodpecker/push/default Pipeline failed Details Post-2026-05-26 unclean node2 reboot left redis-v2-2's incremental AOF truncated at offset 84799139. With aof-load-corrupt-tail-max-size at its default 0, redis refuses to load any corruption and crashloops. Setting 1024 lets it truncate the corrupted tail and continue, which is the right call for a non-source-of-truth cache fronted by sentinel. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 18:48:58 +00:00
Viktor Barzin	60b2b1cdfc	cluster-health: emergency-stop Keel + roll back image downgrades + quota raises Keel was rewriting tag strings (not just digests) despite the keel.sh/match-tag=true annotation injected by the Kyverno inject-keel-annotations ClusterPolicy. That annotation was supposed to constrain Keel to digest-only watches under the deployment's CURRENT tag. It didn't. Casualties confirmed today (live image rewritten to a lower version): uptime-kuma (:2 → :1, 4h CrashLoopBackOff because v1 boots into SQLite mode and can't read the v2 db-config.json → MariaDB store); n8n (:1.80.5 → :0.1.2, silent — EEXIST mkdir /root/.n8n loop); beads-server/dolt-workbench (:0.3.73 → :0.1.0, GraphQL schema mismatch on addDatabaseConnection); wealthfolio (:3.2.1 → :2.0 → :3.2 string truncate); plus historical ones previously fixed (claude-memory :71b32438 → :17, forgejo 11.0.14 → 1.18, onlyoffice 9.3.1 → 4.0.0.9, shlink 5.0.2 → 1.16.1). Changes: * stacks/keel: replicaCount = 0 in the helm values. Pod went from 1/1 to 0/0. Keep off until either match-tag is root-caused or every enrolled workload migrates to a content-addressed (SHA) pin. * stacks/uptime-kuma: pin image to louislam/uptime-kuma:2.3.2 (was :2, bumped to :1 by Keel). Full opt-out: keel.sh/policy=never on BOTH the deployment label (matches Kyverno's exclude rule so the inject-keel- annotations ClusterPolicy stops mutating) AND the annotation (so Keel itself respects). Removed keel.sh/policy from lifecycle.ignore_changes so TF owns it as `never` and can't drift back to `force`. * stacks/beads-server: pin dolt-workbench to dolthub/dolt-workbench:0.3.73 on both seed-config and workbench containers (was :latest, Keel rolled to :0.1.0). * stacks/wealthfolio: pin to afadil/wealthfolio:3.2.1 (was :3.2 truncated by Keel from the prior live :3.2.1). * stacks/monitoring: monitoring-quota requests.memory 16Gi → 20Gi. Cluster grew from 5 to 7 workers (k8s-node5/6 added 2026-05-26) and alloy's per-pod request jumped 50Mi → 562Mi earlier today; combined with new-node DS pods (loki-canary, node-exporter, sysctl-inotify) the quota tipped to 100% and blocked every new pod create with FailedCreate. Raising the cap unblocked the four affected DaemonSets in one shot. * stacks/immich: tier-quota requests.memory 20Gi → 24Gi, limits.memory 32Gi → 40Gi. Was at 88% with VPA still creeping up on immich-server's face-detection burst behaviour. * stacks/{excalidraw,immich,n8n}: providers.tf + .terraform.lock.hcl updated by `tg init -upgrade` to record telmate/proxmox 3.0.2-rc07 (matches the 21 other stacks that already declare it). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 18:48:50 +00:00
Viktor Barzin	41fb7c4a76	backup pipeline: prune sda-bypass list to immich-only Previously /srv/nfs/{ollama,audiblez,ebook2audiobook,*-backup} took the sdc → Synology direct leg. They now ride sdc → sda → Synology pve-backup/ via nfs-mirror like every other NFS subtree, so sda becomes the single canonical mirror and Synology only has to ingest one feed for the bulk of cluster state. frigate + temp dropped from BOTH legs (no backup anywhere) per explicit user ask — frigate is a 14d camera ring, temp is scratch. prometheus/loki/alertmanager dropped as no-op (orphan dirs that no longer exist on /srv/nfs). Also: nfs-mirror's manifest collection switched from find -newer (mtime) to find -cnewer (ctime) — rsync -t preserves source mtime on dest, so freshly-written files looked "older than \$STAMP" and the 2026-05-26 full mirror run captured only 2 of 800k transferred files. Hit during this session, recovered via .force-full-sync. Operational result post-rollout: - sda 87% → 70% (anca-elements 423G deleted, +260G new dirs) - /Viki/nfs/ on Synology: was 24 stale dirs (~430G), now immich only - Synology free: ~300G → ~430G+ once btrfs reclaim catches up Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 18:22:01 +00:00

1 2 3 4 5 ...

3839 commits