Commit graph

1116 commits

Author SHA1 Message Date
Viktor Barzin
478629c1ee keel+anubis: extend sweep to non-V2 raw deployments; fix anubis replicas validation
Second-tier keel drift: actualbudget, mailserver (docker-mailserver + roundcube),
servarr (8 deployments), and authentik pgbouncer are live-enrolled (Kyverno injects
keel.sh/policy=patch) and drifting, but never had the V2 block in Terraform. Added
the full block (KYVERNO_LIFECYCLE_V2 + keel.sh/match-tag + per-container
KEEL_IGNORE_IMAGE + KEEL_LIFECYCLE_V1) to all 13 deployments. The docker-mailserver
deployment had no resource-level lifecycle at all — added one.

Also fixes a pre-existing bug in modules/kubernetes/anubis_instance: the `replicas`
validation `var.replicas == null || (...)` doesn't null-short-circuit in the current
TF version, failing apply on every single-replica Anubis site (blog, cyberchef,
f1-stream, homepage, jsoncrack, kms, postiz, real-estate-crawler, travel_blog) with
"argument must not be null". Switched to a null-safe ternary.

Verified: actualbudget plan shows no image drift (http-api 26.5.2 downgrade prevented).
The anubis module change triggers a full platform apply.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 06:02:24 +00:00
root
fe1a16a5f5 Woodpecker CI deploy [CI SKIP] 2026-05-29 05:48:10 +00:00
Viktor Barzin
5bc7a76630 tuya-bridge: switch to Forgejo image + CI-driven deploy
Mirrors the kms-website pattern: deployment image now points to
forgejo.viktorbarzin.me/viktor/tuya_bridge:${var.image_tag} and the
new Woodpecker pipeline in tuya_bridge/.woodpecker.yml drives the
rollout via `kubectl set image` on every push.

Changes:
- Extract `tls_secret_name` and add `image_tag` (default "latest")
  to a new variables.tf, matching the kms / fire-planner /
  payslip-ingest convention.
- Add `image_pull_secrets { name = "registry-credentials" }` (Kyverno
  ClusterPolicy sync-registry-credentials already syncs the Secret
  into every namespace).
- Set explicit `image_pull_policy = "IfNotPresent"` — SHA-tagged
  images are immutable, no need to re-pull on every restart.

The image attribute remains in `lifecycle.ignore_changes` (line was
already there from the prior Keel-managed era), so future `tg apply`s
do not fight Woodpecker's `kubectl set image`. Keel is still enrolled
on the namespace but will skip SHA-tagged images under `policy: patch`
(non-semver), so the CI pipeline is the sole rollout mechanism.

Backstory: the 2026-05-26 cluster-health incident was tuya-bridge
crashlooping after Keel rewrote `:latest` to a stale broken `:0.1`
tag on Docker Hub (which predated the `prometheus_exporter.py`
addition). Manual rebuild + push was the immediate fix; this commit
plus tuya_bridge/.woodpecker.yml close the underlying gap so a
source change reliably produces a fresh registry image.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 05:45:16 +00:00
Viktor Barzin
7870e62a07 uptime-kuma: declare Proxmox UI monitor in TF
Yesterday's session SQL-patched monitor 313 to `https://192.168.1.127:8006/`
+ ignore_tls=1 because the prior URL `http://proxmox.reverse-proxy.svc.cluster.local:8006`
hit a CoreDNS pod-level cache returning stale `10.0.10.1` (pfSense GW)
intermittently, false-tripping ExternalAccessDivergence. A kuma DB
restore would have lost the SQL fix. Declare the monitor in
`internal_monitors` so the existing sync CronJob self-heals it.

Extends the schema with optional `url` / `accepted_statuscodes` /
`ignore_tls` fields (null on the existing DB/port entries) and
teaches the sync script the MonitorType.HTTP branch — url +
accepted_statuscodes + ignoreTls (camelCase on the API), matching
drift fields the same way PORT does for hostname/port.

Verified: manually triggered the sync after apply; it found monitor
313 by name and reported "already in desired state".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 05:40:18 +00:00
Viktor Barzin
7c73c69f9b keel: add KEEL_LIFECYCLE_V1 + image-ignore to fire-planner
Completes the enrolled-workload sweep from cdb7d9a8. fire-planner was held
back because a parallel session was mid-apply on it (presence board); that
claim has since cleared.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 23:12:49 +00:00
Viktor Barzin
cdb7d9a81a keel: sweep KEEL_LIFECYCLE_V1 + per-container KEEL_IGNORE_IMAGE across enrolled workloads
Every Keel-enrolled workload (policy=patch, match-tag=true, injected by the
inject-keel-annotations Kyverno policy) was fighting Terraform: Keel rewrites
the image tag and restamps keel.sh/update-time, change-cause and the rollout
revision on each poll; without ignore_changes every `tg apply` reverted those
— downgrading the image and forcing a spurious rollout that Keel then re-did.

Only llama-cpp had the full block (added 2026-05-24); the other ~73 workloads
drifted. This sweep adds, to every enrolled deployment/daemonset lifecycle:
  - container[N].image (one per container index + init_container[N]) # KEEL_IGNORE_IMAGE
  - keel.sh/match-tag, keel.sh/update-time, kubernetes.io/change-cause,
    deployment.kubernetes.io/revision  # KEEL_LIFECYCLE_V1

Verified via `tg plan` on speedtest (single-container: image downgrade
0.24.3->0.24.1 + annotation strip now gone) and changedetection (multi-container:
both container images no longer drift). AGENTS.md drift-suppression section
updated with the canonical block + marker legend.

fire-planner deferred (parallel session mid-apply per presence board).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 23:09:30 +00:00
Viktor Barzin
4f71ce6bc5 wealth: fix Fidelity Feb-2026 zero-gap + month-boundary contribution smear
Two correctness fixes to the wealth dashboard, found while validating
contribution data against actual-viktor (source of truth):

1. dav_corrected (Fix 1): LOCF gap-fill scoped to the Fidelity pension.
   A PlanViewer scrape gap left total_value=0 for 13 days from 2026-02-16,
   which cratered net worth and produced a phantom -£97,457 "contribution"
   in Feb then +£100,458 in Mar. Carry the last non-zero day forward across
   the gap (a £0 pension valuation is always a scrape gap, never real).

2. wealth.json (Fix 3): "Monthly contributions vs market gain" and "Annual
   change decomposition" now use consecutive period-end deltas instead of
   within-period first-to-last-obs, so contributions landing near a period
   boundary are no longer dropped/mis-attributed.

Verified live: Feb-2026 monthly contribution now +£34,000 (real Trading212
RSU-proceeds investment, reconciles with actual-viktor), no spurious
negatives. Brokerage contributions unchanged (already correct).

Applied via scripts/tg (wealthfolio + targeted monitoring ConfigMap).

[ci skip]

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 22:58:59 +00:00
Viktor Barzin
0044c3a8ea fire-planner: add examples ingest Job (toggled) + weekly CronJob
Adds the K8s plumbing for the Reddit FIRE-examples ingest path:

- ExternalSecret fire-planner-examples-reddit (Reddit OAuth from
  Vault secret/viktor.trading_bot_reddit_{client_id,client_secret}).
- ExternalSecret fire-planner-examples-claude (claude-agent-service
  bearer from Vault secret/claude-agent-service.api_bearer_token).
- kubernetes_job_v1.examples_bulk_ingest — one-shot bulk Job toggled
  via var.run_examples_bulk_ingest (default false). Timestamp-named so
  each (true) transition creates a fresh Job; lifecycle ignores the
  name so re-plans don't propose phantom renames.
- kubernetes_cron_job_v1.examples_weekly_delta — Sunday 04:00 UTC
  --top=week --limit=200 incremental run.

Both runners share the env_from plumbing of the existing recompute
CronJob (fire-planner-secrets, fire-planner-db-creds,
wealthfolio-sync-db-creds) plus examples-specific vars
(REDDIT_USER_AGENT, LLAMA_CPP_BASE_URL, CLAUDE_AGENT_SERVICE_URL,
plus the three secret-backed env vars).

Plan-only this commit — actual apply lands in Task 17 after the
ingest image build.
2026-05-28 22:51:14 +00:00
Viktor Barzin
4dff834c8a reduce ingress-dns-sync frequency to hourly [ci skip] 2026-05-28 22:30:08 +00:00
Viktor Barzin
5ac8d625b9 add ingress-dns-sync CronJob to auto-create Technitium CNAME records
Discovers all *.viktorbarzin.me ingress hosts every 15 minutes and
creates matching CNAME records in Technitium if missing. Prevents
the desync where Cloudflare has the DNS record (via ingress_factory)
but internal DNS returns NXDOMAIN because Technitium was never updated.

Includes ServiceAccount + ClusterRole for ingress list permissions.
2026-05-28 22:22:42 +00:00
Viktor Barzin
58cced5dab monitoring: render market-vs-salary periodic panels as lines, not bars 2026-05-28 22:18:59 +00:00
Viktor Barzin
388a7f60c7 monitoring: add net-pay-vs-market-gains panels to wealth dashboard
Three new panels comparing employment income to investment returns over
time, via Grafana's -- Mixed -- datasource (salary lives in payslip_ingest,
portfolio in wealthfolio_sync — separate DBs, so per-target datasources):
- cumulative net take-home pay vs cumulative market gain (line race)
- net pay vs market gain per year (grouped bars)
- net pay vs market gain per month (grouped bars)

Inserted after the "Growth over time" panel; existing panels shifted down,
full-width tables remain at the bottom.
2026-05-28 22:13:44 +00:00
Viktor Barzin
1af412b461 trading-bot: bump TRADING_MEET_KEVIN_PROMPT_VERSION v1 -> v2 (forward-looking prompt) 2026-05-28 21:40:17 +00:00
Viktor Barzin
188bdd50a0 infra: decommission foolery agent UI
User no longer actively using foolery. Removed:
- TF stack stacks/foolery (Cloudflare DNS, Traefik IngressRoute,
  Authentik forward-auth integration, K8s Service+Endpoints)
- Devvm systemd unit /etc/systemd/system/foolery.service
- Runtime at ~/.local/share/foolery and launcher ~/.local/bin/foolery
- Stale foolery reference in .claude/CLAUDE.md auth="required" examples

Uptime Kuma [External] foolery monitor will auto-prune on next
external-monitor-sync reconcile.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 16:08:41 +00:00
Viktor Barzin
8b4bcc0ca2 blog: Anubis carve-out for /net-diag.sh
curl|bash clients can't solve PoW, so /net-diag.sh has to bypass Anubis.
Adds a second ingress_factory pointing /net-diag.sh at the bare blog
service (port 80), keeping every other path on the existing Anubis
chain. Path-prefix specificity wins in Traefik routing — / stays gated.

dns_type = "none" because the apex viktorbarzin.me CF record already
exists from the main ingress.

Doc update: CLAUDE.md Anubis section notes blog now follows the
wrongmove carve-out pattern.
2026-05-28 13:22:57 +00:00
Viktor Barzin
fc5a4b66ad monitoring: exclude catchall-error-pages from HighService4xxRate
The catchall-error-pages IngressRoute matches HostRegexp(^(.+\.)?
viktorbarzin\.me$) at priority=1 — it's the wildcard handler that
returns 404 for any unmatched hostname (typos + scanner traffic).
By design its 4xx rate sits at ~100%, so HighService4xxRate was a
permanent false positive for traefik-catchall-error-pages-*@kubernetescrd.

Same exclusion pattern as nextcloud/grafana/linkwarden/claude-memory
(services with legitimately high 4xx counts).
2026-05-27 19:46:40 +00:00
github-actions[bot]
b8cd1219a6 priority-pass: bump image_tag to 4ce9e8e8 [ci skip]
Auto-committed by ViktorBarzin/priority-pass GHA on push to main.
Source: 4ce9e8e894
2026-05-27 18:46:19 +00:00
root
d0ede3773b Woodpecker CI deploy [CI SKIP] 2026-05-27 18:38:09 +00:00
Viktor Barzin
ee159b02ba nextcloud: disable Keel auto-upgrades
Keel bumped library/nextcloud :32.0.3-apache → :32.0.9-apache on
2026-05-26 19:42 UTC. The new image needs `occ upgrade` to migrate
the DB schema, which Keel does not run, so Nextcloud landed in
maintenance mode (needsDbUpgrade=true) and stayed there for ~22h —
external probes saw 503, ExternalAccessDivergence kept firing.

Disable Keel for this workload:
- Drop the `keel.sh/enrolled=true` label from the namespace so
  Kyverno's `inject-keel-annotations` policy no longer matches.
- Layer `keel.sh/policy=never` label + annotation onto the
  Helm-managed Deployment via `kubernetes_labels` /
  `kubernetes_annotations` (the chart at 8.8.1 doesn't expose
  Deployment-level commonLabels/commonAnnotations). Keel reads the
  annotation; the label is defense-in-depth for the Kyverno
  exclude rule should the namespace ever get re-enrolled.

Verified: Keel logged `image no longer tracked, removing watcher`
within seconds of the annotation landing, and `tg plan` is clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 18:37:05 +00:00
Viktor Barzin
d72c7169c0 monitoring: route proxmox-exporter to scrape_slow job (fix flapping alerts)
PVE API endpoint regularly takes ~11s with ~1035 thin LVs on the host
(1002 k8s-csi PVCs + 22 VMs + 11 system), blowing past Prometheus's
default 10s scrape_timeout and flapping ProxmoxMetricsMissing +
ScrapeTargetDown. Switch the Service annotation from prometheus.io/scrape
to prometheus.io/scrape_slow so the scrape moves to the existing
kubernetes-service-endpoints-slow job (5m interval, 30s timeout).
2026-05-27 18:36:11 +00:00
Viktor Barzin
f121bee121 fire-planner: update recompute CronJob comment to reflect lazy refresh
As of fire-planner@4da58fe the account_snapshot cache is refreshed
lazily on each /networth, /networth/history, /progress request when
older than NETWORTH_CACHE_TTL_DAYS (default 1). The recompute CronJob
runs Monte Carlo only — no longer assumed to coordinate with the
wealthfolio-sync schedule.

[ci skip]
2026-05-27 18:23:21 +00:00
Viktor Barzin
4b77aa65a1 broker-sync: unsuspend broker-sync-imap (IE structurally skipped at code level now)
E2E test (manual one-shot of all 3 broker-sync CronJobs) confirmed
idempotent behaviour with zero new activities and net worth unchanged.
The IE-via-IMAP path is now default-skipped inside
broker_sync.providers.imap (commit 0d23487), so unsuspending the cron is
safe — Schwab vests get parsed, IE messages get ie_skipped at the parser
level regardless of which entry point triggers the run.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 17:57:26 +00:00
Viktor Barzin
06fb1f9ea9 broker-sync: update imap-cron comment to reflect default-skip IE (post-incident) 2026-05-27 17:25:42 +00:00
Viktor Barzin
501f2c6b37 broker-sync: re-suspend broker-sync-imap CronJob
39 IMAP-source InvestEngine BUYs + their cash-flow DEPOSITs were
re-inserted into Wealthfolio at 2026-05-27T09:22:18 UTC — exactly the
rows the £252k dedup removed yesterday. The broker-sync-imap cron at
02:30 UTC today correctly logged `ie_skipped=53`, so the IMAP cron itself
isn't the immediate culprit, but the rows DO carry broker-sync's IMAP-path
signature (`[rfc2822-v1]` notes + `sync:imap:invest-engine:...` cash-flow
markers).

Suspending kills one possible vector while a researcher subagent
investigates the root cause. Schwab vest ingestion is the only function
lost; can be unsuspended once the IE re-dup source is identified.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 17:09:09 +00:00
Viktor Barzin
54919e3abc trading-bot: TRADING_SLACK_BOT_TOKEN + TRADING_SLACK_CHANNEL env 2026-05-27 10:06:51 +00:00
Viktor Barzin
17c59a280b broker-sync: drop IBKR_ACCOUNT_ID env (now derived via ensure_account) 2026-05-27 09:25:02 +00:00
Viktor Barzin
6d13ba12da broker-sync: add fsGroup=10001 to trading212 cron pod spec
Without supplementary GID 10001, the broker user (uid=10001 gid=999)
cannot write sqlite3 journal files next to /data/sync.db. The cron
hits a "readonly database" error in dedup.record() AFTER successfully
importing fills to Wealthfolio — so data lands but the dedup store
never updates, leaving every subsequent run to re-fetch the same
window and exit 1 again. Same fix that's already on imap + ibkr crons.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 09:20:16 +00:00
root
9e8314183f Woodpecker CI deploy [CI SKIP] 2026-05-26 22:53:29 +00:00
Viktor Barzin
9b68dbc788 wealthfolio: dav_corrected — also exclude Schwab synthetic cash flows
The Net-contribution chart was showing huge negative monthly swings
because broker-sync emits a synthetic cash-flow-match DEPOSIT for every
vest BUY and a WITHDRAWAL for every sell-to-cover SELL. Cumulatively
WITHDRAWALs ($1.06M) exceed DEPOSITs ($498k) — the user perceives this
as having "withdrawn" money even though they never moved cash out of
Schwab. The proceeds left for the bank and surface as real DEPOSITs on
the next account (IE/T212) that the user transfers them to.

Extend the dav_corrected view to subtract Schwab cash-flow-match flows
(DEPOSIT-positive, WITHDRAWAL-negative, account-scoped) in addition to
the existing Fidelity unrealised-gains-offset correction. InvestEngine
and Trading212 cash-flow-match entries are REAL deposits and must be
preserved — scope by Schwab account_id only.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 22:52:17 +00:00
Viktor Barzin
30ba6860b9 broker-sync: add IBKR Flex daily CronJob (02:00 UK) 2026-05-26 22:34:54 +00:00
Viktor Barzin
2df9700d70 trading-bot: add slack_webhook_url ESO secret + env var 2026-05-26 21:55:59 +00:00
Viktor Barzin
15c88bc683 keel: belt-and-suspenders opt-out for mysql/redis/nvidia-exporter
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
After re-enabling Keel with `policy: patch` (commit f325b949), 3 of the
60 first-hour bumps broke things and need explicit cluster-wide opt-out
so future Kyverno reconciles can't put them back under auto-update:

- `dbaas/mysql-standalone`: patch-bumped `mysql:8.4.8 → :8.4.9` and the
  DD upgrade stalled (we explicitly track that as beads `code-963q` —
  the 8.4.9 jump needs a wipe+reinit, not a rolling upgrade). The
  StatefulSet already had `annotation=never` from TF but was missing the
  LABEL — Kyverno's selector exclude reads the LABEL, so a reconcile
  that dropped the annotation could resume auto-update. Added the LABEL.

- `redis/redis-v2`: patch-bumped `redis:8-alpine → :8.0.6-alpine` and
  the new image rejected the `aof-load-corrupt-tail-max-size` directive
  from commit 1eee56d0 → redis-v2-2 CrashLoopBackOff. Plus :8.0.6 is
  semantically older than :8-alpine (which resolves to :8.6.2) — same
  Keel tag-picking pathology as the 2026-05-26 morning incident, just
  in a different shape. LABEL + ANNOTATION both added.

- `nvidia/nvidia-exporter`: Keel rewrote `:latest → :4.5.2-4.8.1-ubuntu22.04`
  and the new dcgm-exporter OOMKilled at the 192Mi memory limit
  (4 restarts before I caught it). Added LABEL + ANNOTATION for opt-out,
  AND bumped memory request/limit 192Mi → 256Mi/512Mi so the bumped image
  doesn't OOM (older versions fit in 192Mi; the bumped one needs ~250Mi
  steady-state).

The 56 other Keel bumps in that 10-minute window (coredns 1.12.1→1.12.4,
kyverno 1.16.1→1.16.4, nextcloud 32.0.3→32.0.9, grafana 12.3.1→12.3.6,
cnpg, mailserver, csi-nfs, metrics-server, etc.) landed cleanly — the
`patch` policy is the right default. Per-workload `never` opt-out is
the maintenance cost.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 21:53:10 +00:00
Viktor Barzin
498b01396c status-page: disable pusher CronJob to stop sdc write storm
The CronJob ran every 5 min on a vanilla python:3.12-alpine image, doing
`apk add git` + `pip install uptime-kuma-api` from scratch on every
invocation. Caught at ~3.2 MB/s on k8s-node4's root LV, contributing to
~8 MB/s sustained on the pve-data thin pool (sdc) — ~804 GB written
over the prior 18 h.

Commented out the kubernetes_cron_job_v1.status_page_pusher resource
(kept ns / SA / RBAC / ConfigMap intact for trivial revert). Re-enable
once a custom image with git + uptime-kuma-api baked in is published so
no per-run cold install happens.

status.viktorbarzin.me stops updating until then.
2026-05-26 21:40:14 +00:00
Viktor Barzin
84404fd0d6 broker-sync: skip InvestEngine in IMAP CronJob
Sets BROKER_SYNC_IMAP_EXCLUDE_PROVIDERS=invest-engine on broker-sync-imap,
so the IMAP path no longer parses InvestEngine emails (handled by the
bearer-token API path now). Stops duplicate BUYs in Wealthfolio.

The terraform fmt run also realigned two adjacent label assignments.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 21:19:31 +00:00
root
2becd0ff6f Woodpecker CI deploy [CI SKIP] 2026-05-26 21:09:48 +00:00
Viktor Barzin
8605181c53 trading-bot: Phase 2 — add trade-executor + flip kevin kill-switch 2026-05-26 21:07:37 +00:00
Viktor Barzin
f325b949be keel: re-enable with policy=patch (semver-bounded) + fix CI deny-privileged
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Re-enables Keel after the 2026-05-26 emergency stop, with a safer default.

Switch Kyverno-injected default from `force + match-tag=true` (proven
unreliable — it rewrote tag strings cluster-wide despite the design intent)
to `patch`, which is semver-parser-bounded:

  - Only patch bumps within current major.minor (1.2.3 → 1.2.4, never
    1.3.x or 2.x — the parser does the math, not string compare).
  - Non-semver tags (`:latest`, `:v4`, `:2`, SHA, `:nightly`) are
    IGNORED entirely. No tag rewriting under any code path.
  - 151 stale `force` annotations migrated to `patch` cluster-wide
    during this apply (anchor `+()` dropped, then re-added).

Live state after this commit:
  0 workloads on `force`, 209 on `patch`, 22 on `never`.
  Keel deployment back to 1/1 on `:0.21.1`.

Note: 22 workloads with `keel.sh/policy=never` LABEL had their annotation
mutated to `patch` during the migration despite Kyverno's
matchLabels-based exclude rule — appears to be a quirk of
`mutateExistingOnPolicyUpdate` not honoring `selector` excludes. Repatched
all 22 back to `annotation=never` via `kubectl annotate --overwrite`, then
restored the `+(keel.sh/policy)` anchor in the policy so future Kyverno
reconciles preserve them.

Also fixes CI build-cli workflow which was blocked by
`deny-privileged-containers` since wave 1 enforce flip on 2026-05-18:
woodpecker namespace added to the shared security_policy_exclude_namespaces
list (CI pipeline pods `wp-*` run privileged docker builds, legitimate use).

The `default` workflow (terragrunt apply) was already passing — only the
parallel `build-cli` workflow (which builds the infra-cli docker image) was
failing, but it took the overall pipeline status down with it.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 19:06:51 +00:00
Viktor Barzin
1eee56d0ba redis: tolerate up to 1KB of AOF tail corruption on load
Some checks failed
ci/woodpecker/push/build-cli Pipeline failed
ci/woodpecker/push/default Pipeline failed
Post-2026-05-26 unclean node2 reboot left redis-v2-2's incremental AOF
truncated at offset 84799139. With aof-load-corrupt-tail-max-size at its
default 0, redis refuses to load any corruption and crashloops. Setting
1024 lets it truncate the corrupted tail and continue, which is the
right call for a non-source-of-truth cache fronted by sentinel.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 18:48:58 +00:00
Viktor Barzin
60b2b1cdfc cluster-health: emergency-stop Keel + roll back image downgrades + quota raises
Keel was rewriting tag strings (not just digests) despite the
keel.sh/match-tag=true annotation injected by the Kyverno
inject-keel-annotations ClusterPolicy. That annotation was supposed to
constrain Keel to digest-only watches under the deployment's CURRENT tag.
It didn't. Casualties confirmed today (live image rewritten to a lower
version): uptime-kuma (:2 → :1, 4h CrashLoopBackOff because v1 boots into
SQLite mode and can't read the v2 db-config.json → MariaDB store);
n8n (:1.80.5 → :0.1.2, silent — EEXIST mkdir /root/.n8n loop);
beads-server/dolt-workbench (:0.3.73 → :0.1.0, GraphQL schema mismatch on
addDatabaseConnection); wealthfolio (:3.2.1 → :2.0 → :3.2 string truncate);
plus historical ones previously fixed (claude-memory :71b32438 → :17,
forgejo 11.0.14 → 1.18, onlyoffice 9.3.1 → 4.0.0.9, shlink 5.0.2 → 1.16.1).

Changes:

* stacks/keel: replicaCount = 0 in the helm values. Pod went from 1/1 to
  0/0. Keep off until either match-tag is root-caused or every enrolled
  workload migrates to a content-addressed (SHA) pin.

* stacks/uptime-kuma: pin image to louislam/uptime-kuma:2.3.2 (was :2,
  bumped to :1 by Keel). Full opt-out: keel.sh/policy=never on BOTH the
  deployment label (matches Kyverno's exclude rule so the inject-keel-
  annotations ClusterPolicy stops mutating) AND the annotation (so Keel
  itself respects). Removed keel.sh/policy from lifecycle.ignore_changes
  so TF owns it as `never` and can't drift back to `force`.

* stacks/beads-server: pin dolt-workbench to dolthub/dolt-workbench:0.3.73
  on both seed-config and workbench containers (was :latest, Keel rolled
  to :0.1.0).

* stacks/wealthfolio: pin to afadil/wealthfolio:3.2.1 (was :3.2 truncated
  by Keel from the prior live :3.2.1).

* stacks/monitoring: monitoring-quota requests.memory 16Gi → 20Gi. Cluster
  grew from 5 to 7 workers (k8s-node5/6 added 2026-05-26) and alloy's
  per-pod request jumped 50Mi → 562Mi earlier today; combined with new-node
  DS pods (loki-canary, node-exporter, sysctl-inotify) the quota tipped to
  100% and blocked every new pod create with FailedCreate. Raising the cap
  unblocked the four affected DaemonSets in one shot.

* stacks/immich: tier-quota requests.memory 20Gi → 24Gi, limits.memory
  32Gi → 40Gi. Was at 88% with VPA still creeping up on immich-server's
  face-detection burst behaviour.

* stacks/{excalidraw,immich,n8n}: providers.tf + .terraform.lock.hcl
  updated by `tg init -upgrade` to record telmate/proxmox 3.0.2-rc07
  (matches the 21 other stacks that already declare it).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 18:48:50 +00:00
Viktor Barzin
b3dcccfc41 vaultwarden: track :latest tag for Keel auto-upgrade (was 1.35.7)
Earlier today Keel's hourly poll caught vaultwarden's deployment in a
window where the `keel.sh/match-tag` annotation wasn't set, fell into
'watch repository tags' mode, and rewrote 1.35.7 -> 1.21.0. Vaultwarden
1.21.0 doesn't have the API endpoints the modern Bitwarden clients call
(/identity/accounts/prelogin/password, /api/devices/knowndevice,
/api/config), so the Chrome extension started 404-ing on login.

Same race shape as the 2026-05-17 authentik/pgbouncer incident. The
fundamental issue: `policy: force` on a semver-pinned tag is unsafe
because Keel happily rewrites the tag string if it can't find a stable
'current tag' to digest-watch.

Fix: switch to `:latest` (the mutable tag vaultwarden publishes for the
newest stable release). Keel now digest-watches `:latest` (safe mode)
and rolls forward on each upstream release. Matches cluster convention
(128 other Keel-managed workloads use the same `:latest` + force +
match-tag pattern).

Also added imagePullPolicy=Always (required with :latest so the kubelet
revalidates the manifest on each rollout instead of using a cached
layer), and extended the lifecycle.ignore_changes to cover the
match-tag annotation and kubernetes.io/change-cause (Keel rewrites
this on every rollout).

Current `:latest` digest -> vaultwarden 1.36.0 (released 2026-05-03).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 13:26:36 +00:00
Viktor Barzin
8ed427a7e4 cloud-init: hands-off k8s worker provisioning + 5 bug fixes
Goal: re-clone the worker template, boot, and have it appear as `kubectl
get nodes …Ready` with no manual steps. Adds `scripts/provision-k8s-worker
NAME VMID IP` and rebuilds the cloud-init pipeline that was failing five
distinct ways on a clean boot.

Bugs fixed (all hit during the k8s-node5 + k8s-node6 builds today):

1. `indent(6, containerd_config_update_command)` indented the bodies of
   `cat >> /etc/containerd/config.toml <<'CONTAINERD_GC'` heredocs, so
   [plugins.*] TOML sections landed in /etc/containerd/config.toml at
   col 6 — containerd refused to parse them. Source is now a normal
   .sh file (`modules/create-template-vm/k8s-node-containerd-setup.sh`)
   base64-embedded into `write_files`; YAML whitespace never touches
   the heredoc bodies.

2. The same script tried to `cat >> /etc/containerd/config.toml`
   `[plugins."io.containerd.gc.v1.scheduler"]` etc., which containerd
   v2.2.4's `config default` ALREADY emits. Result: `toml: table …
   already exists`. Patched with sed-in-place overrides instead.

3. Kubelet tuning (sed against /var/lib/kubelet/config.yaml) ran from
   the containerd setup script — BEFORE `kubeadm join` writes that
   file. Sed aborted with "No such file or directory", `set -e` killed
   the script, post-script cloud-init steps kept going (cloud-init
   doesn't stop on runcmd failure). Split into a dedicated
   `k8s-node-post-join-tune.sh` invoked AFTER kubeadm join.

4. cloud_init.yaml fallocate'd a 4G swapfile and `swapon`'d it BEFORE
   kubeadm join. kubelet defaults to failSwapOn=true → exited 1
   immediately. Replaced the swap setup with `swapoff -a` (node4
   already runs this way and the cluster is fine).

5. Without `hostname:` in the shared user-data snippet, Proxmox's
   auto-generated meta-data does NOT include local-hostname when
   `cicustom user=…` is set — so cloud-init falls back to the cloud
   image's default `ubuntu` and `kubeadm join` registers the wrong
   node name. `provision-k8s-worker` now writes a per-node
   `<NAME>-meta.yaml` snippet and passes both via
   `cicustom user=…,meta=…`.

Other improvements rolled in while fixing the above:

- `ssh_public_key` read from Vault (`secret/viktor.ssh_public_key`,
  added today) instead of `var.ssh_public_key`. The last
  `terragrunt apply` was run with that var empty, leaving the snippet's
  `ssh_authorized_keys` with a single blank entry; the wizard user
  was effectively locked out of every fresh node.
- `cloud_init.yaml` adds `/etc/systemd/resolved.conf.d/global-dns.conf`
  with `DNS=8.8.8.8 1.1.1.1, FallbackDNS=10.0.20.201`. Without it,
  systemd-resolved only consulted Technitium (link-level), which
  returns NXDOMAIN for `forgejo.viktorbarzin.me` — kubelet pulls from
  the Forgejo registry then failed DNS until I patched it manually
  on node5.
- k8s apt repo bumped v1.32 → v1.34 (matches cluster).
- The containerd setup script now creates hosts.toml for forgejo,
  quay, registry.k8s.io in addition to docker.io + ghcr.io. node3/4
  had these added by hand post-bootstrap; now they're baked in.
- `config_path` sed matches both `""` (containerd v1) and `''`
  (containerd v2.x). Without the v2 match, the certs.d mirror dir was
  silently ignored.
- `proxmox-csi` node map adds k8s-node5 + k8s-node6 entries so CSI
  topology labels (region/zone, max-volume-attachments=28) apply on
  next `tg apply`.
- `stacks/infra/main.tf` shed the 160-line inline containerd setup
  heredoc — that whole thing now lives in the module as a .sh file.

Known unsolved gaps (deferred):

- iscsid restart hangs ~90s on first boot before SIGKILL releases it
  (systemd-resolved restart kicks iscsid via dependency). Adds wall-
  clock time but doesn't block the join.
- `provision-k8s-worker` doesn't run `tg apply` on `proxmox-csi`
  afterward, so the CSI topology labels need a manual apply after
  the node joins. Solving cleanly needs the CSI map to derive from
  `kubectl get nodes` instead of a static local — separate work.
- `var.containerd_config_update_command` is now ignored when
  is_k8s_template=true (replaced by the bundled .sh file). Variable
  kept with a deprecation note to avoid breaking other call sites.

E2E proof: k8s-node6 (VMID 206) boots hands-off from
`provision-k8s-worker k8s-node6 206 10.0.20.106` and appears as
`kubectl get nodes …Ready` ~7 min later (most of which is the apt
package_upgrade — separate optimization).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 11:52:00 +00:00
Viktor Barzin
bb9d8f1b38 kyverno: GPU priority mutate uses add (was replace) — fixes silent skip
The Layer 5 ClusterPolicy inject-gpu-workload-priority used JSON6902
op=replace on /spec/priorityClassName. Incoming pods (e.g. frigate)
have no priorityClassName field at all — replace requires the path to
exist, so the patch fails with "doc is missing key: /spec/priorityClassName"
and the whole mutation chain aborts BEFORE Layer 4 (inject-priority-class-from-tier)
gets a chance to add the field.

Result: GPU pods never got priorityClassName set, sat at priority=0, and
could not preempt lower-tier pods on the GPU node. Observed today on
frigate post-node4-recovery — pod stayed Pending with "Preemption is
not helpful" while 3 pg-cluster pods (tier-1-cluster, priority 800000)
occupied node1's memory budget.

Fix: op=add for all three paths. add works whether or not the key is
present, so the policy is robust to the upstream pod shape.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 09:04:51 +00:00
Viktor Barzin
12b4f6f81a dbaas: require pod anti-affinity on pg-cluster (one PG per node)
Default CNPG affinity was `preferred` (soft). During the 2026-05-26
node4 outage, all 3 pg-cluster pods drifted onto k8s-node1 — losing
that node would have taken the whole PG cluster down (no quorum) AND
the 9.2 GiB pg-cluster footprint was the dominant reason frigate
couldn't fit on the GPU node.

With 3 instances + 4 worker nodes, `required` is safe under 1-node
drain (3 distinct nodes always available, even excluding the drained
one).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 09:00:37 +00:00
Viktor Barzin
400ee88967 state(dbaas): update encrypted state 2026-05-26 08:59:40 +00:00
root
daa41a2eb1 Woodpecker CI deploy [CI SKIP] 2026-05-26 08:29:09 +00:00
Viktor Barzin
00bbbe0838 url/shlink-web: containerPort 8080 -> 80
shlinkio/shlink-web-client:0.1.1 listens on port 80 (nginx default),
not 8080 like the prior :latest images. Keel auto-bumped the tag
on 2026-05-23; liveness/readiness probes have been failing ever since
because they still hit :8080. Pod was stuck restarting, the
DeploymentReplicasMismatch alert fired.

Aligns containerPort + both probes + service target_port with the image.
2026-05-26 08:19:24 +00:00
Viktor Barzin
44c3770a5c infra: pull all VMs out of Terraform — telmate provider can't represent them safely
The telmate/proxmox v3.0.2-rc07 provider mangles dynamically-attached
disks (id=539, 2026-05-26 incident) and doesn't refresh mbps_*_concurrent
fields back from live state — every plan after a qm-set cap is applied
proposes to "fix" mbps 0 → N and the apply errors with the spurious
"the QEMU guest needs to be rebooted" message. lifecycle.ignore_changes
does NOT block either failure mode.

Decision: stop trying to manage Linux VMs in this stack. The cloud-init
bootstrap stays in TF (via k8s-node-template, non-k8s-node-template,
docker-registry-template above), so a fresh node still clones the right
template and runs the same bootstrap. VM lifecycle stays in the Proxmox
UI. I/O caps are managed via qm-set on the PVE host (idempotent script
at /tmp/apply-mbps-caps.sh, tracked in beads code-9v2j).

Removed from TF state + HCL:
  - module "k8s-master"          (vmid 200)
  - module "k8s-node2"            (vmid 202) — pre-existing drift, never in state
  - module "docker-registry-vm"   (vmid 220) — was in state, hit refresh bug

Already hand-managed (never in HCL):
  - 102 devvm, 103 home-assistant, 201 k8s-node1 (Tesla T4 passthrough),
    203 k8s-node3, 204 k8s-node4, 101 pfSense (BSD), 300 Windows10.

Live I/O caps (qm set, all verified):
  102=60/60  103=40/40  200=100/60  201=150/120  202=150/120
  203=150/120  204=150/120  220=40/40

Future TF adoption tracked in beads code-75ds (blocks on bpg/proxmox
provider migration — telmate can't represent these VMs at all).

Closes: code-75ds
2026-05-26 07:12:46 +00:00
Viktor Barzin
9b75b2817b cloud-init: fix k8s node bootstrap snippet (multi-line interp + containerd v2 quotes)
Two bugs found while rebuilding k8s-node4 (2026-05-26):

1. **runcmd YAML breakage**: `- $${containerd_config_update_command}`
   interpolated a multi-line heredoc as bare list-item content. The
   trailing lines lost their list-item prefix, breaking cloud-config
   parsing. Cloud-init silently fell back to the minimal default
   (hostname + package_upgrade only) — kubeadm join, containerd config,
   kubelet tuning, iSCSI hardening, swap, ALL skipped. No error visible
   in `cloud-init status`.

   Fix: wrap the interpolation in `- |` literal block with `indent(4, ...)`.

2. **containerd v2 single-quote mismatch**: `containerd config default`
   in v2 writes `config_path = ''` (single quotes), v1 writes `""` (double).
   The sed pattern matched only double quotes → silent no-op on fresh
   containerd 2.x nodes → registry-mirror hosts.toml ignored → all image
   pulls hit upstream registries → DNS-to-MetalLB chicken-and-egg loop.

   Fix: match any value with `config_path = .*`.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 07:06:50 +00:00
Viktor Barzin
445feb118f infra: per-VM I/O caps + terragrunt v0.77 plumbing + state recovery
WHAT LANDED:
- terragrunt.hcl (root): added telmate/proxmox to k8s_providers
  required_providers. Other stacks just don't instantiate a provider
  block — harmless. Replaces the same-name override trick the infra
  stack used to do, which stopped working under Terragrunt v0.77
  ("Detected generate blocks with the same name").
- stacks/infra/terragrunt.hcl: new generate "proxmox_provider" block
  writes proxmox_provider.tf with the provider config; credentials
  read from Vault secret/viktor at plan/apply time (no env vars).
- modules/create-vm: new mbps_rd / mbps_wr number variables (default 0
  = uncapped), wired into scsi0/scsi1 disk{} blocks as
  mbps_r_concurrent / mbps_wr_concurrent. lifecycle.ignore_changes
  extended to scsi6..scsi29 (K8s nodes have many CSI-managed slots),
  plus scsihw and qemu_os (vary per-VM; non-trivial live changes).
- stacks/infra/main.tf: docker-registry-vm gains mbps_rd=40,
  mbps_wr=40 in HCL — already applied live via qm set on 2026-05-26.

WHAT FAILED AND WAS ROLLED BACK:
- Attempted import of 7 VMs (102 devvm, 103 home-assistant, 200
  k8s-master, 201 k8s-node1, 202 k8s-node2, 203 k8s-node3, 204
  k8s-node4) via import {} blocks. The telmate/proxmox v3.0.2-rc07
  provider mangled proxmox-csi PVC slots on apply for vmid 202 and
  203: every scsi slot got rewritten from `vm-9999-pvc-<uuid>` to
  the boot disk `vm-<vmid>-disk-0`. Restored both .conf files from
  the 2026-05-24 nightly PVE config backup at /mnt/backup/pve-config/
  etc-pve/nodes/pve/qemu-server/{202,203}.conf — no reboots, no data
  loss, K8s CSI reconciled PVC attachments within minutes. Removed
  the 7 imports from state via `terraform state rm` and re-encrypted.
  Tracked in beads code-xzbl: blocked on bpg/proxmox provider
  migration (telmate has the same dynamic-disk defect that bit us on
  iSCSI back in 2026-04-02; see memory id=539).

LIVE CAPS STILL IN PLACE (qm set, 2026-05-26 ~03:13 UTC):
  102 devvm 60/60   103 home-assistant 40/40   200 k8s-master 100/60
  201 k8s-node1 150/120   202 k8s-node2 150/120   203 k8s-node3 150/120
  204 k8s-node4 150/120   220 docker-registry 40/40
  (pfSense 101 BSD + Windows10 300 intentionally out of scope.)

PRE-EXISTING DRIFT EXPOSED (NOT NEW):
- HCL declares k8s-master (200) and k8s-node2 (202) but neither was
  ever imported into TF state — confirmed against the SOPS-encrypted
  state in git (lineage e1cc5bb5, serial 42, last touched 2026-04-06).
  This commit leaves both declarations in place but does NOT import
  them; that's part of the code-xzbl follow-up.

Closes: code-s9xr
2026-05-26 06:46:47 +00:00
Viktor Barzin
07bd2e0017 onlyoffice: restore replicas 0 → 1 post IO-storm recovery
Cluster is fully stable (all 5 nodes Ready, vaultwarden recovered,
node4 rebuilt 2026-05-26). Removing the TEMP-SCALEDOWN guard.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 03:08:17 +00:00