infra

Author	SHA1	Message	Date
Viktor Barzin	0044c3a8ea	fire-planner: add examples ingest Job (toggled) + weekly CronJob Adds the K8s plumbing for the Reddit FIRE-examples ingest path: - ExternalSecret fire-planner-examples-reddit (Reddit OAuth from Vault secret/viktor.trading_bot_reddit_{client_id,client_secret}). - ExternalSecret fire-planner-examples-claude (claude-agent-service bearer from Vault secret/claude-agent-service.api_bearer_token). - kubernetes_job_v1.examples_bulk_ingest — one-shot bulk Job toggled via var.run_examples_bulk_ingest (default false). Timestamp-named so each (true) transition creates a fresh Job; lifecycle ignores the name so re-plans don't propose phantom renames. - kubernetes_cron_job_v1.examples_weekly_delta — Sunday 04:00 UTC --top=week --limit=200 incremental run. Both runners share the env_from plumbing of the existing recompute CronJob (fire-planner-secrets, fire-planner-db-creds, wealthfolio-sync-db-creds) plus examples-specific vars (REDDIT_USER_AGENT, LLAMA_CPP_BASE_URL, CLAUDE_AGENT_SERVICE_URL, plus the three secret-backed env vars). Plan-only this commit — actual apply lands in Task 17 after the ingest image build.	2026-05-28 22:51:14 +00:00
Viktor Barzin	4dff834c8a	reduce ingress-dns-sync frequency to hourly [ci skip]	2026-05-28 22:30:08 +00:00
Viktor Barzin	5ac8d625b9	add ingress-dns-sync CronJob to auto-create Technitium CNAME records Discovers all *.viktorbarzin.me ingress hosts every 15 minutes and creates matching CNAME records in Technitium if missing. Prevents the desync where Cloudflare has the DNS record (via ingress_factory) but internal DNS returns NXDOMAIN because Technitium was never updated. Includes ServiceAccount + ClusterRole for ingress list permissions.	2026-05-28 22:22:42 +00:00
Viktor Barzin	58cced5dab	monitoring: render market-vs-salary periodic panels as lines, not bars	2026-05-28 22:18:59 +00:00
Viktor Barzin	388a7f60c7	monitoring: add net-pay-vs-market-gains panels to wealth dashboard Three new panels comparing employment income to investment returns over time, via Grafana's -- Mixed -- datasource (salary lives in payslip_ingest, portfolio in wealthfolio_sync — separate DBs, so per-target datasources): - cumulative net take-home pay vs cumulative market gain (line race) - net pay vs market gain per year (grouped bars) - net pay vs market gain per month (grouped bars) Inserted after the "Growth over time" panel; existing panels shifted down, full-width tables remain at the bottom.	2026-05-28 22:13:44 +00:00
Viktor Barzin	1af412b461	trading-bot: bump TRADING_MEET_KEVIN_PROMPT_VERSION v1 -> v2 (forward-looking prompt)	2026-05-28 21:40:17 +00:00
Viktor Barzin	188bdd50a0	infra: decommission foolery agent UI User no longer actively using foolery. Removed: - TF stack stacks/foolery (Cloudflare DNS, Traefik IngressRoute, Authentik forward-auth integration, K8s Service+Endpoints) - Devvm systemd unit /etc/systemd/system/foolery.service - Runtime at ~/.local/share/foolery and launcher ~/.local/bin/foolery - Stale foolery reference in .claude/CLAUDE.md auth="required" examples Uptime Kuma [External] foolery monitor will auto-prune on next external-monitor-sync reconcile. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 16:08:41 +00:00
Viktor Barzin	8b4bcc0ca2	blog: Anubis carve-out for /net-diag.sh curl\|bash clients can't solve PoW, so /net-diag.sh has to bypass Anubis. Adds a second ingress_factory pointing /net-diag.sh at the bare blog service (port 80), keeping every other path on the existing Anubis chain. Path-prefix specificity wins in Traefik routing — / stays gated. dns_type = "none" because the apex viktorbarzin.me CF record already exists from the main ingress. Doc update: CLAUDE.md Anubis section notes blog now follows the wrongmove carve-out pattern.	2026-05-28 13:22:57 +00:00
Viktor Barzin	fc5a4b66ad	monitoring: exclude catchall-error-pages from HighService4xxRate The catchall-error-pages IngressRoute matches HostRegexp(^(.+\.)? viktorbarzin\.me$) at priority=1 — it's the wildcard handler that returns 404 for any unmatched hostname (typos + scanner traffic). By design its 4xx rate sits at ~100%, so HighService4xxRate was a permanent false positive for traefik-catchall-error-pages-*@kubernetescrd. Same exclusion pattern as nextcloud/grafana/linkwarden/claude-memory (services with legitimately high 4xx counts).	2026-05-27 19:46:40 +00:00
github-actions[bot]	b8cd1219a6	priority-pass: bump image_tag to 4ce9e8e8 [ci skip] Auto-committed by ViktorBarzin/priority-pass GHA on push to main. Source: `4ce9e8e894`	2026-05-27 18:46:19 +00:00
root	d0ede3773b	Woodpecker CI deploy [CI SKIP]	2026-05-27 18:38:09 +00:00
Viktor Barzin	ee159b02ba	nextcloud: disable Keel auto-upgrades Keel bumped library/nextcloud :32.0.3-apache → :32.0.9-apache on 2026-05-26 19:42 UTC. The new image needs `occ upgrade` to migrate the DB schema, which Keel does not run, so Nextcloud landed in maintenance mode (needsDbUpgrade=true) and stayed there for ~22h — external probes saw 503, ExternalAccessDivergence kept firing. Disable Keel for this workload: - Drop the `keel.sh/enrolled=true` label from the namespace so Kyverno's `inject-keel-annotations` policy no longer matches. - Layer `keel.sh/policy=never` label + annotation onto the Helm-managed Deployment via `kubernetes_labels` / `kubernetes_annotations` (the chart at 8.8.1 doesn't expose Deployment-level commonLabels/commonAnnotations). Keel reads the annotation; the label is defense-in-depth for the Kyverno exclude rule should the namespace ever get re-enrolled. Verified: Keel logged `image no longer tracked, removing watcher` within seconds of the annotation landing, and `tg plan` is clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 18:37:05 +00:00
Viktor Barzin	d72c7169c0	monitoring: route proxmox-exporter to scrape_slow job (fix flapping alerts) PVE API endpoint regularly takes ~11s with ~1035 thin LVs on the host (1002 k8s-csi PVCs + 22 VMs + 11 system), blowing past Prometheus's default 10s scrape_timeout and flapping ProxmoxMetricsMissing + ScrapeTargetDown. Switch the Service annotation from prometheus.io/scrape to prometheus.io/scrape_slow so the scrape moves to the existing kubernetes-service-endpoints-slow job (5m interval, 30s timeout).	2026-05-27 18:36:11 +00:00
Viktor Barzin	f121bee121	fire-planner: update recompute CronJob comment to reflect lazy refresh As of fire-planner@4da58fe the account_snapshot cache is refreshed lazily on each /networth, /networth/history, /progress request when older than NETWORTH_CACHE_TTL_DAYS (default 1). The recompute CronJob runs Monte Carlo only — no longer assumed to coordinate with the wealthfolio-sync schedule. [ci skip]	2026-05-27 18:23:21 +00:00
Viktor Barzin	4b77aa65a1	broker-sync: unsuspend broker-sync-imap (IE structurally skipped at code level now) E2E test (manual one-shot of all 3 broker-sync CronJobs) confirmed idempotent behaviour with zero new activities and net worth unchanged. The IE-via-IMAP path is now default-skipped inside broker_sync.providers.imap (commit 0d23487), so unsuspending the cron is safe — Schwab vests get parsed, IE messages get ie_skipped at the parser level regardless of which entry point triggers the run. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 17:57:26 +00:00
Viktor Barzin	06fb1f9ea9	broker-sync: update imap-cron comment to reflect default-skip IE (post-incident)	2026-05-27 17:25:42 +00:00
Viktor Barzin	501f2c6b37	broker-sync: re-suspend broker-sync-imap CronJob 39 IMAP-source InvestEngine BUYs + their cash-flow DEPOSITs were re-inserted into Wealthfolio at 2026-05-27T09:22:18 UTC — exactly the rows the £252k dedup removed yesterday. The broker-sync-imap cron at 02:30 UTC today correctly logged `ie_skipped=53`, so the IMAP cron itself isn't the immediate culprit, but the rows DO carry broker-sync's IMAP-path signature (`[rfc2822-v1]` notes + `sync:imap:invest-engine:...` cash-flow markers). Suspending kills one possible vector while a researcher subagent investigates the root cause. Schwab vest ingestion is the only function lost; can be unsuspended once the IE re-dup source is identified. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 17:09:09 +00:00
Viktor Barzin	54919e3abc	trading-bot: TRADING_SLACK_BOT_TOKEN + TRADING_SLACK_CHANNEL env	2026-05-27 10:06:51 +00:00
Viktor Barzin	17c59a280b	broker-sync: drop IBKR_ACCOUNT_ID env (now derived via ensure_account)	2026-05-27 09:25:02 +00:00
Viktor Barzin	6d13ba12da	broker-sync: add fsGroup=10001 to trading212 cron pod spec Without supplementary GID 10001, the broker user (uid=10001 gid=999) cannot write sqlite3 journal files next to /data/sync.db. The cron hits a "readonly database" error in dedup.record() AFTER successfully importing fills to Wealthfolio — so data lands but the dedup store never updates, leaving every subsequent run to re-fetch the same window and exit 1 again. Same fix that's already on imap + ibkr crons. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 09:20:16 +00:00
root	9e8314183f	Woodpecker CI deploy [CI SKIP]	2026-05-26 22:53:29 +00:00
Viktor Barzin	9b68dbc788	wealthfolio: dav_corrected — also exclude Schwab synthetic cash flows The Net-contribution chart was showing huge negative monthly swings because broker-sync emits a synthetic cash-flow-match DEPOSIT for every vest BUY and a WITHDRAWAL for every sell-to-cover SELL. Cumulatively WITHDRAWALs ($1.06M) exceed DEPOSITs ($498k) — the user perceives this as having "withdrawn" money even though they never moved cash out of Schwab. The proceeds left for the bank and surface as real DEPOSITs on the next account (IE/T212) that the user transfers them to. Extend the dav_corrected view to subtract Schwab cash-flow-match flows (DEPOSIT-positive, WITHDRAWAL-negative, account-scoped) in addition to the existing Fidelity unrealised-gains-offset correction. InvestEngine and Trading212 cash-flow-match entries are REAL deposits and must be preserved — scope by Schwab account_id only. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 22:52:17 +00:00
Viktor Barzin	30ba6860b9	broker-sync: add IBKR Flex daily CronJob (02:00 UK)	2026-05-26 22:34:54 +00:00
Viktor Barzin	2df9700d70	trading-bot: add slack_webhook_url ESO secret + env var	2026-05-26 21:55:59 +00:00
Viktor Barzin	15c88bc683	keel: belt-and-suspenders opt-out for mysql/redis/nvidia-exporter All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details After re-enabling Keel with `policy: patch` (commit `f325b949`), 3 of the 60 first-hour bumps broke things and need explicit cluster-wide opt-out so future Kyverno reconciles can't put them back under auto-update: - `dbaas/mysql-standalone`: patch-bumped `mysql:8.4.8 → :8.4.9` and the DD upgrade stalled (we explicitly track that as beads `code-963q` — the 8.4.9 jump needs a wipe+reinit, not a rolling upgrade). The StatefulSet already had `annotation=never` from TF but was missing the LABEL — Kyverno's selector exclude reads the LABEL, so a reconcile that dropped the annotation could resume auto-update. Added the LABEL. - `redis/redis-v2`: patch-bumped `redis:8-alpine → :8.0.6-alpine` and the new image rejected the `aof-load-corrupt-tail-max-size` directive from commit `1eee56d0` → redis-v2-2 CrashLoopBackOff. Plus :8.0.6 is semantically older than :8-alpine (which resolves to :8.6.2) — same Keel tag-picking pathology as the 2026-05-26 morning incident, just in a different shape. LABEL + ANNOTATION both added. - `nvidia/nvidia-exporter`: Keel rewrote `:latest → :4.5.2-4.8.1-ubuntu22.04` and the new dcgm-exporter OOMKilled at the 192Mi memory limit (4 restarts before I caught it). Added LABEL + ANNOTATION for opt-out, AND bumped memory request/limit 192Mi → 256Mi/512Mi so the bumped image doesn't OOM (older versions fit in 192Mi; the bumped one needs ~250Mi steady-state). The 56 other Keel bumps in that 10-minute window (coredns 1.12.1→1.12.4, kyverno 1.16.1→1.16.4, nextcloud 32.0.3→32.0.9, grafana 12.3.1→12.3.6, cnpg, mailserver, csi-nfs, metrics-server, etc.) landed cleanly — the `patch` policy is the right default. Per-workload `never` opt-out is the maintenance cost. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 21:53:10 +00:00
Viktor Barzin	498b01396c	status-page: disable pusher CronJob to stop sdc write storm The CronJob ran every 5 min on a vanilla python:3.12-alpine image, doing `apk add git` + `pip install uptime-kuma-api` from scratch on every invocation. Caught at ~3.2 MB/s on k8s-node4's root LV, contributing to ~8 MB/s sustained on the pve-data thin pool (sdc) — ~804 GB written over the prior 18 h. Commented out the kubernetes_cron_job_v1.status_page_pusher resource (kept ns / SA / RBAC / ConfigMap intact for trivial revert). Re-enable once a custom image with git + uptime-kuma-api baked in is published so no per-run cold install happens. status.viktorbarzin.me stops updating until then.	2026-05-26 21:40:14 +00:00
Viktor Barzin	84404fd0d6	broker-sync: skip InvestEngine in IMAP CronJob Sets BROKER_SYNC_IMAP_EXCLUDE_PROVIDERS=invest-engine on broker-sync-imap, so the IMAP path no longer parses InvestEngine emails (handled by the bearer-token API path now). Stops duplicate BUYs in Wealthfolio. The terraform fmt run also realigned two adjacent label assignments. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 21:19:31 +00:00
root	2becd0ff6f	Woodpecker CI deploy [CI SKIP]	2026-05-26 21:09:48 +00:00
Viktor Barzin	8605181c53	trading-bot: Phase 2 — add trade-executor + flip kevin kill-switch	2026-05-26 21:07:37 +00:00
Viktor Barzin	f325b949be	keel: re-enable with policy=patch (semver-bounded) + fix CI deny-privileged All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Re-enables Keel after the 2026-05-26 emergency stop, with a safer default. Switch Kyverno-injected default from `force + match-tag=true` (proven unreliable — it rewrote tag strings cluster-wide despite the design intent) to `patch`, which is semver-parser-bounded: - Only patch bumps within current major.minor (1.2.3 → 1.2.4, never 1.3.x or 2.x — the parser does the math, not string compare). - Non-semver tags (`:latest`, `:v4`, `:2`, SHA, `:nightly`) are IGNORED entirely. No tag rewriting under any code path. - 151 stale `force` annotations migrated to `patch` cluster-wide during this apply (anchor `+()` dropped, then re-added). Live state after this commit: 0 workloads on `force`, 209 on `patch`, 22 on `never`. Keel deployment back to 1/1 on `:0.21.1`. Note: 22 workloads with `keel.sh/policy=never` LABEL had their annotation mutated to `patch` during the migration despite Kyverno's matchLabels-based exclude rule — appears to be a quirk of `mutateExistingOnPolicyUpdate` not honoring `selector` excludes. Repatched all 22 back to `annotation=never` via `kubectl annotate --overwrite`, then restored the `+(keel.sh/policy)` anchor in the policy so future Kyverno reconciles preserve them. Also fixes CI build-cli workflow which was blocked by `deny-privileged-containers` since wave 1 enforce flip on 2026-05-18: woodpecker namespace added to the shared security_policy_exclude_namespaces list (CI pipeline pods `wp-*` run privileged docker builds, legitimate use). The `default` workflow (terragrunt apply) was already passing — only the parallel `build-cli` workflow (which builds the infra-cli docker image) was failing, but it took the overall pipeline status down with it. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 19:06:51 +00:00
Viktor Barzin	1eee56d0ba	redis: tolerate up to 1KB of AOF tail corruption on load Some checks failed ci/woodpecker/push/build-cli Pipeline failed Details ci/woodpecker/push/default Pipeline failed Details Post-2026-05-26 unclean node2 reboot left redis-v2-2's incremental AOF truncated at offset 84799139. With aof-load-corrupt-tail-max-size at its default 0, redis refuses to load any corruption and crashloops. Setting 1024 lets it truncate the corrupted tail and continue, which is the right call for a non-source-of-truth cache fronted by sentinel. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 18:48:58 +00:00
Viktor Barzin	60b2b1cdfc	cluster-health: emergency-stop Keel + roll back image downgrades + quota raises Keel was rewriting tag strings (not just digests) despite the keel.sh/match-tag=true annotation injected by the Kyverno inject-keel-annotations ClusterPolicy. That annotation was supposed to constrain Keel to digest-only watches under the deployment's CURRENT tag. It didn't. Casualties confirmed today (live image rewritten to a lower version): uptime-kuma (:2 → :1, 4h CrashLoopBackOff because v1 boots into SQLite mode and can't read the v2 db-config.json → MariaDB store); n8n (:1.80.5 → :0.1.2, silent — EEXIST mkdir /root/.n8n loop); beads-server/dolt-workbench (:0.3.73 → :0.1.0, GraphQL schema mismatch on addDatabaseConnection); wealthfolio (:3.2.1 → :2.0 → :3.2 string truncate); plus historical ones previously fixed (claude-memory :71b32438 → :17, forgejo 11.0.14 → 1.18, onlyoffice 9.3.1 → 4.0.0.9, shlink 5.0.2 → 1.16.1). Changes: * stacks/keel: replicaCount = 0 in the helm values. Pod went from 1/1 to 0/0. Keep off until either match-tag is root-caused or every enrolled workload migrates to a content-addressed (SHA) pin. * stacks/uptime-kuma: pin image to louislam/uptime-kuma:2.3.2 (was :2, bumped to :1 by Keel). Full opt-out: keel.sh/policy=never on BOTH the deployment label (matches Kyverno's exclude rule so the inject-keel- annotations ClusterPolicy stops mutating) AND the annotation (so Keel itself respects). Removed keel.sh/policy from lifecycle.ignore_changes so TF owns it as `never` and can't drift back to `force`. * stacks/beads-server: pin dolt-workbench to dolthub/dolt-workbench:0.3.73 on both seed-config and workbench containers (was :latest, Keel rolled to :0.1.0). * stacks/wealthfolio: pin to afadil/wealthfolio:3.2.1 (was :3.2 truncated by Keel from the prior live :3.2.1). * stacks/monitoring: monitoring-quota requests.memory 16Gi → 20Gi. Cluster grew from 5 to 7 workers (k8s-node5/6 added 2026-05-26) and alloy's per-pod request jumped 50Mi → 562Mi earlier today; combined with new-node DS pods (loki-canary, node-exporter, sysctl-inotify) the quota tipped to 100% and blocked every new pod create with FailedCreate. Raising the cap unblocked the four affected DaemonSets in one shot. * stacks/immich: tier-quota requests.memory 20Gi → 24Gi, limits.memory 32Gi → 40Gi. Was at 88% with VPA still creeping up on immich-server's face-detection burst behaviour. * stacks/{excalidraw,immich,n8n}: providers.tf + .terraform.lock.hcl updated by `tg init -upgrade` to record telmate/proxmox 3.0.2-rc07 (matches the 21 other stacks that already declare it). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 18:48:50 +00:00
Viktor Barzin	b3dcccfc41	vaultwarden: track :latest tag for Keel auto-upgrade (was 1.35.7) Earlier today Keel's hourly poll caught vaultwarden's deployment in a window where the `keel.sh/match-tag` annotation wasn't set, fell into 'watch repository tags' mode, and rewrote 1.35.7 -> 1.21.0. Vaultwarden 1.21.0 doesn't have the API endpoints the modern Bitwarden clients call (/identity/accounts/prelogin/password, /api/devices/knowndevice, /api/config), so the Chrome extension started 404-ing on login. Same race shape as the 2026-05-17 authentik/pgbouncer incident. The fundamental issue: `policy: force` on a semver-pinned tag is unsafe because Keel happily rewrites the tag string if it can't find a stable 'current tag' to digest-watch. Fix: switch to `:latest` (the mutable tag vaultwarden publishes for the newest stable release). Keel now digest-watches `:latest` (safe mode) and rolls forward on each upstream release. Matches cluster convention (128 other Keel-managed workloads use the same `:latest` + force + match-tag pattern). Also added imagePullPolicy=Always (required with :latest so the kubelet revalidates the manifest on each rollout instead of using a cached layer), and extended the lifecycle.ignore_changes to cover the match-tag annotation and kubernetes.io/change-cause (Keel rewrites this on every rollout). Current `:latest` digest -> vaultwarden 1.36.0 (released 2026-05-03). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 13:26:36 +00:00
Viktor Barzin	8ed427a7e4	cloud-init: hands-off k8s worker provisioning + 5 bug fixes Goal: re-clone the worker template, boot, and have it appear as `kubectl get nodes …Ready` with no manual steps. Adds `scripts/provision-k8s-worker NAME VMID IP` and rebuilds the cloud-init pipeline that was failing five distinct ways on a clean boot. Bugs fixed (all hit during the k8s-node5 + k8s-node6 builds today): 1. `indent(6, containerd_config_update_command)` indented the bodies of `cat >> /etc/containerd/config.toml <<'CONTAINERD_GC'` heredocs, so [plugins.*] TOML sections landed in /etc/containerd/config.toml at col 6 — containerd refused to parse them. Source is now a normal .sh file (`modules/create-template-vm/k8s-node-containerd-setup.sh`) base64-embedded into `write_files`; YAML whitespace never touches the heredoc bodies. 2. The same script tried to `cat >> /etc/containerd/config.toml` `[plugins."io.containerd.gc.v1.scheduler"]` etc., which containerd v2.2.4's `config default` ALREADY emits. Result: `toml: table … already exists`. Patched with sed-in-place overrides instead. 3. Kubelet tuning (sed against /var/lib/kubelet/config.yaml) ran from the containerd setup script — BEFORE `kubeadm join` writes that file. Sed aborted with "No such file or directory", `set -e` killed the script, post-script cloud-init steps kept going (cloud-init doesn't stop on runcmd failure). Split into a dedicated `k8s-node-post-join-tune.sh` invoked AFTER kubeadm join. 4. cloud_init.yaml fallocate'd a 4G swapfile and `swapon`'d it BEFORE kubeadm join. kubelet defaults to failSwapOn=true → exited 1 immediately. Replaced the swap setup with `swapoff -a` (node4 already runs this way and the cluster is fine). 5. Without `hostname:` in the shared user-data snippet, Proxmox's auto-generated meta-data does NOT include local-hostname when `cicustom user=…` is set — so cloud-init falls back to the cloud image's default `ubuntu` and `kubeadm join` registers the wrong node name. `provision-k8s-worker` now writes a per-node `<NAME>-meta.yaml` snippet and passes both via `cicustom user=…,meta=…`. Other improvements rolled in while fixing the above: - `ssh_public_key` read from Vault (`secret/viktor.ssh_public_key`, added today) instead of `var.ssh_public_key`. The last `terragrunt apply` was run with that var empty, leaving the snippet's `ssh_authorized_keys` with a single blank entry; the wizard user was effectively locked out of every fresh node. - `cloud_init.yaml` adds `/etc/systemd/resolved.conf.d/global-dns.conf` with `DNS=8.8.8.8 1.1.1.1, FallbackDNS=10.0.20.201`. Without it, systemd-resolved only consulted Technitium (link-level), which returns NXDOMAIN for `forgejo.viktorbarzin.me` — kubelet pulls from the Forgejo registry then failed DNS until I patched it manually on node5. - k8s apt repo bumped v1.32 → v1.34 (matches cluster). - The containerd setup script now creates hosts.toml for forgejo, quay, registry.k8s.io in addition to docker.io + ghcr.io. node3/4 had these added by hand post-bootstrap; now they're baked in. - `config_path` sed matches both `""` (containerd v1) and `''` (containerd v2.x). Without the v2 match, the certs.d mirror dir was silently ignored. - `proxmox-csi` node map adds k8s-node5 + k8s-node6 entries so CSI topology labels (region/zone, max-volume-attachments=28) apply on next `tg apply`. - `stacks/infra/main.tf` shed the 160-line inline containerd setup heredoc — that whole thing now lives in the module as a .sh file. Known unsolved gaps (deferred): - iscsid restart hangs ~90s on first boot before SIGKILL releases it (systemd-resolved restart kicks iscsid via dependency). Adds wall- clock time but doesn't block the join. - `provision-k8s-worker` doesn't run `tg apply` on `proxmox-csi` afterward, so the CSI topology labels need a manual apply after the node joins. Solving cleanly needs the CSI map to derive from `kubectl get nodes` instead of a static local — separate work. - `var.containerd_config_update_command` is now ignored when is_k8s_template=true (replaced by the bundled .sh file). Variable kept with a deprecation note to avoid breaking other call sites. E2E proof: k8s-node6 (VMID 206) boots hands-off from `provision-k8s-worker k8s-node6 206 10.0.20.106` and appears as `kubectl get nodes …Ready` ~7 min later (most of which is the apt package_upgrade — separate optimization). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 11:52:00 +00:00
Viktor Barzin	bb9d8f1b38	kyverno: GPU priority mutate uses add (was replace) — fixes silent skip The Layer 5 ClusterPolicy inject-gpu-workload-priority used JSON6902 op=replace on /spec/priorityClassName. Incoming pods (e.g. frigate) have no priorityClassName field at all — replace requires the path to exist, so the patch fails with "doc is missing key: /spec/priorityClassName" and the whole mutation chain aborts BEFORE Layer 4 (inject-priority-class-from-tier) gets a chance to add the field. Result: GPU pods never got priorityClassName set, sat at priority=0, and could not preempt lower-tier pods on the GPU node. Observed today on frigate post-node4-recovery — pod stayed Pending with "Preemption is not helpful" while 3 pg-cluster pods (tier-1-cluster, priority 800000) occupied node1's memory budget. Fix: op=add for all three paths. add works whether or not the key is present, so the policy is robust to the upstream pod shape. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 09:04:51 +00:00
Viktor Barzin	12b4f6f81a	dbaas: require pod anti-affinity on pg-cluster (one PG per node) Default CNPG affinity was `preferred` (soft). During the 2026-05-26 node4 outage, all 3 pg-cluster pods drifted onto k8s-node1 — losing that node would have taken the whole PG cluster down (no quorum) AND the 9.2 GiB pg-cluster footprint was the dominant reason frigate couldn't fit on the GPU node. With 3 instances + 4 worker nodes, `required` is safe under 1-node drain (3 distinct nodes always available, even excluding the drained one). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 09:00:37 +00:00
Viktor Barzin	400ee88967	state(dbaas): update encrypted state	2026-05-26 08:59:40 +00:00
root	daa41a2eb1	Woodpecker CI deploy [CI SKIP]	2026-05-26 08:29:09 +00:00
Viktor Barzin	00bbbe0838	url/shlink-web: containerPort 8080 -> 80 shlinkio/shlink-web-client:0.1.1 listens on port 80 (nginx default), not 8080 like the prior :latest images. Keel auto-bumped the tag on 2026-05-23; liveness/readiness probes have been failing ever since because they still hit :8080. Pod was stuck restarting, the DeploymentReplicasMismatch alert fired. Aligns containerPort + both probes + service target_port with the image.	2026-05-26 08:19:24 +00:00
Viktor Barzin	44c3770a5c	infra: pull all VMs out of Terraform — telmate provider can't represent them safely The telmate/proxmox v3.0.2-rc07 provider mangles dynamically-attached disks (id=539, 2026-05-26 incident) and doesn't refresh mbps_*_concurrent fields back from live state — every plan after a qm-set cap is applied proposes to "fix" mbps 0 → N and the apply errors with the spurious "the QEMU guest needs to be rebooted" message. lifecycle.ignore_changes does NOT block either failure mode. Decision: stop trying to manage Linux VMs in this stack. The cloud-init bootstrap stays in TF (via k8s-node-template, non-k8s-node-template, docker-registry-template above), so a fresh node still clones the right template and runs the same bootstrap. VM lifecycle stays in the Proxmox UI. I/O caps are managed via qm-set on the PVE host (idempotent script at /tmp/apply-mbps-caps.sh, tracked in beads code-9v2j). Removed from TF state + HCL: - module "k8s-master" (vmid 200) - module "k8s-node2" (vmid 202) — pre-existing drift, never in state - module "docker-registry-vm" (vmid 220) — was in state, hit refresh bug Already hand-managed (never in HCL): - 102 devvm, 103 home-assistant, 201 k8s-node1 (Tesla T4 passthrough), 203 k8s-node3, 204 k8s-node4, 101 pfSense (BSD), 300 Windows10. Live I/O caps (qm set, all verified): 102=60/60 103=40/40 200=100/60 201=150/120 202=150/120 203=150/120 204=150/120 220=40/40 Future TF adoption tracked in beads code-75ds (blocks on bpg/proxmox provider migration — telmate can't represent these VMs at all). Closes: code-75ds	2026-05-26 07:12:46 +00:00
Viktor Barzin	9b75b2817b	cloud-init: fix k8s node bootstrap snippet (multi-line interp + containerd v2 quotes) Two bugs found while rebuilding k8s-node4 (2026-05-26): 1. runcmd YAML breakage: `- $${containerd_config_update_command}` interpolated a multi-line heredoc as bare list-item content. The trailing lines lost their list-item prefix, breaking cloud-config parsing. Cloud-init silently fell back to the minimal default (hostname + package_upgrade only) — kubeadm join, containerd config, kubelet tuning, iSCSI hardening, swap, ALL skipped. No error visible in `cloud-init status`. Fix: wrap the interpolation in `- \|` literal block with `indent(4, ...)`. 2. containerd v2 single-quote mismatch: `containerd config default` in v2 writes `config_path = ''` (single quotes), v1 writes `""` (double). The sed pattern matched only double quotes → silent no-op on fresh containerd 2.x nodes → registry-mirror hosts.toml ignored → all image pulls hit upstream registries → DNS-to-MetalLB chicken-and-egg loop. Fix: match any value with `config_path = .*`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 07:06:50 +00:00
Viktor Barzin	445feb118f	infra: per-VM I/O caps + terragrunt v0.77 plumbing + state recovery WHAT LANDED: - terragrunt.hcl (root): added telmate/proxmox to k8s_providers required_providers. Other stacks just don't instantiate a provider block — harmless. Replaces the same-name override trick the infra stack used to do, which stopped working under Terragrunt v0.77 ("Detected generate blocks with the same name"). - stacks/infra/terragrunt.hcl: new generate "proxmox_provider" block writes proxmox_provider.tf with the provider config; credentials read from Vault secret/viktor at plan/apply time (no env vars). - modules/create-vm: new mbps_rd / mbps_wr number variables (default 0 = uncapped), wired into scsi0/scsi1 disk{} blocks as mbps_r_concurrent / mbps_wr_concurrent. lifecycle.ignore_changes extended to scsi6..scsi29 (K8s nodes have many CSI-managed slots), plus scsihw and qemu_os (vary per-VM; non-trivial live changes). - stacks/infra/main.tf: docker-registry-vm gains mbps_rd=40, mbps_wr=40 in HCL — already applied live via qm set on 2026-05-26. WHAT FAILED AND WAS ROLLED BACK: - Attempted import of 7 VMs (102 devvm, 103 home-assistant, 200 k8s-master, 201 k8s-node1, 202 k8s-node2, 203 k8s-node3, 204 k8s-node4) via import {} blocks. The telmate/proxmox v3.0.2-rc07 provider mangled proxmox-csi PVC slots on apply for vmid 202 and 203: every scsi slot got rewritten from `vm-9999-pvc-<uuid>` to the boot disk `vm-<vmid>-disk-0`. Restored both .conf files from the 2026-05-24 nightly PVE config backup at /mnt/backup/pve-config/ etc-pve/nodes/pve/qemu-server/{202,203}.conf — no reboots, no data loss, K8s CSI reconciled PVC attachments within minutes. Removed the 7 imports from state via `terraform state rm` and re-encrypted. Tracked in beads code-xzbl: blocked on bpg/proxmox provider migration (telmate has the same dynamic-disk defect that bit us on iSCSI back in 2026-04-02; see memory id=539). LIVE CAPS STILL IN PLACE (qm set, 2026-05-26 ~03:13 UTC): 102 devvm 60/60 103 home-assistant 40/40 200 k8s-master 100/60 201 k8s-node1 150/120 202 k8s-node2 150/120 203 k8s-node3 150/120 204 k8s-node4 150/120 220 docker-registry 40/40 (pfSense 101 BSD + Windows10 300 intentionally out of scope.) PRE-EXISTING DRIFT EXPOSED (NOT NEW): - HCL declares k8s-master (200) and k8s-node2 (202) but neither was ever imported into TF state — confirmed against the SOPS-encrypted state in git (lineage e1cc5bb5, serial 42, last touched 2026-04-06). This commit leaves both declarations in place but does NOT import them; that's part of the code-xzbl follow-up. Closes: code-s9xr	2026-05-26 06:46:47 +00:00
Viktor Barzin	07bd2e0017	onlyoffice: restore replicas 0 → 1 post IO-storm recovery Cluster is fully stable (all 5 nodes Ready, vaultwarden recovered, node4 rebuilt 2026-05-26). Removing the TEMP-SCALEDOWN guard. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 03:08:17 +00:00
Viktor Barzin	7ad0e578ae	f1-stream: migrate PVC from proxmox-lvm to NFS Wave 1 LUN-cap relief. The PVC stores 5 small JSON state files (health_state, schedule, scraped_links, sessions, streams) and a lost+found — total 30KB, no DB, regenerable from upstream APIs. Standard scale-to-0 → rsync → swap pattern (deployment was at replicas=1). Pod came back up on k8s-node4 (now Ready again). Net: -1 SCSI LUN on k8s-node1 (was the previous host).	2026-05-26 02:49:43 +00:00
Viktor Barzin	aded77d5ab	monitoring: alerts for proxmox-csi LUN saturation per node Vaultwarden + 18 pods got stuck for 7h on 2026-05-26 when k8s-node4 went down: surviving workloads piled onto node1 and hit the csi.proxmox.sinextra.dev/max-volume-attachments=28 cap. The Proxmox VM also had 5 stale scsi entries (PVCs long-migrated to other nodes but never removed from VM config), which bypassed the K8s scheduler safety until the plugin returned 'no free lun found' at attach time. Three new alerts on the kube_volumeattachment_info count per node: - warning at 24/28 (>= 85%), 10m - critical at 27/28 (1 slot left), 3m - critical at 28/28 (cap reached), 1m Also whitelisted kube_volumeattachment_info — the metric was being dropped by the disk-write-reduction filter (id=559) and the alert queries returned zero series until it's kept. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 02:45:13 +00:00
Viktor Barzin	a0b5cbc922	onlyoffice: migrate PVC from proxmox-lvm to NFS Wave 1 LUN-cap relief. OnlyOffice document server keeps only 2 WOPI key files + a .private dir on the PVC (~24K) — the real DB lives in its external Postgres + Redis stack, not on this PVC. Service is at replicas=0 (IO-storm temp scaledown — TEMP-SCALEDOWN comment preserved). Migration trivia: scheduler tried to put the rsync helper on k8s-node4 (PVC's last-known location) but node4 had just come back online and its proxmox-csi/nfs-csi node pods were still in ContainerCreating — failed. Retried pinned to k8s-node2 via nodeSelector; rsync template updated to take an optional node arg. Net: -1 SCSI LUN once onlyoffice is brought back up.	2026-05-26 02:43:47 +00:00
Viktor Barzin	681f6daf10	whisper: migrate PVC from proxmox-lvm to NFS Wave 1 LUN-cap relief. Whisper PVC holds Piper TTS .onnx voice model + a HuggingFace faster-whisper-small-int8 model cache — read-mostly model artefacts, no DB, 303M total. Both whisper and piper deployments are at replicas=0 (GPU-node memory pressure, unrelated). Switched access_modes to ReadWriteMany since both whisper + piper deployments reference the same PVC; on proxmox-lvm RWO they could only colocate on the same node when both come back. Net: -1 SCSI LUN once these are brought back up.	2026-05-26 02:38:34 +00:00
Viktor Barzin	a2b410f6c9	resume: migrate PVC from proxmox-lvm to NFS Wave 1 LUN-cap relief. Reactive Resume stores user-uploaded PDFs + 3 .txt counters under uploads/ and statistics/ — no embedded DB, 112K of data. Service is at replicas=0 (browserless OOM scaledown, unrelated to this work) so the migration was no-downtime. Net: -1 SCSI LUN once resume is brought back up.	2026-05-26 02:36:20 +00:00
Viktor Barzin	cdbb418f45	monitoring: alert when cluster can't tolerate losing a non-GPU worker ClusterCannotTolerateNonGpuNodeLoss fires when the most heavily reserved non-GPU worker (k8s-node2/3/4) has more memory requests pinned to it than the rest of the workers (incl. node1 GPU node) currently have free. If that node went down, its pods would not fit elsewhere and would stay Pending — exactly what happened today (2026-05-26) with node4 NotReady: 4 kyverno pods + woodpecker PVCs + several deployments stuck Pending because node2/node3 were at 99% memory-request saturation. Math: max(R(node X) for X in non-GPU workers) > sum(clamp_min(A(n) - R(n), 0)) over Ready workers. node1 included on the right because its taint is PreferNoSchedule (soft) so it does absorb non-GPU pods under pressure. Currently fires with a 33.96 GiB shortage. Remediation: right-size top reservers via Goldilocks (immich-server 8Gi, frigate 5Gi, prometheus 4.4Gi, pg-cluster 3Gi each, paperless 2Gi) or bump VM RAM on k8s-node2/k8s-node3 from 32GB → 48GB to match node1. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 02:34:13 +00:00
Viktor Barzin	467fa1631d	excalidraw: migrate PVC from proxmox-lvm to NFS Wave 1 of the per-VM SCSI-LUN cap relief. The proxmox-csi-plugin hardcodes a `lun < 30` loop (pkg/csi/utils.go:394) — cap is 29 attachable PVCs per K8s node VM, and k8s-node1 was sitting at 29 with 4 stuck `no free lun found` PVCs queued behind it. Excalidraw stores per-user .excalidraw scene files (no SQLite, no embedded DB) — confirmed safe on NFS. 1.5 MiB of data, 4 active scenes. Migration: - Add nfs_volume module → apply - Scale to 0, rsync helper, swap claim_name → apply - Remove old proxmox-lvm PVC → apply Net: -1 SCSI LUN on k8s-node2. Refs: docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md (separate concern; this is for the upstream LUN-cap pressure).	2026-05-26 02:33:41 +00:00

1 2 3 4 5 ...

1109 commits