Commit graph

3742 commits

Author SHA1 Message Date
root
9e2163040b Woodpecker CI deploy [CI SKIP] 2026-05-24 14:23:44 +00:00
Viktor Barzin
d6590612b2 immich: bulk-import Anca's Elements photo archive into her account
Grows pve/nfs-data 3T → 4T (online lvextend + resize2fs) to absorb ~340 GB
of new originals landing under /srv/nfs/immich/upload during the import.

Adds:
- module "nfs_anca_elements_host" — RO PVC over /srv/nfs/anca-elements,
  consumed only by the import Job (not mounted in immich-server).
- kubernetes_job_v1.anca_elements_import — immich-go v0.31.0 uploader
  posting to immich-server.immich.svc:2283 with Anca's API key (synced
  via the existing immich-secrets ExternalSecret from
  secret/immich.anca_api_key). Filters to image extensions, bans the
  non-photo top-level dirs (filme/, Music/, carti/, courses, installers,
  docs, etc.), puts every asset in the album "Poze (Elements)". Default
  `--pause-immich-jobs` is disabled — non-admin keys can't pause jobs.
- docs/architecture/storage.md — note the new 4 TB size in 3 places.
- docs/runbooks/grow-pve-nfs-lv.md — captures the one-shot lvextend
  procedure (no pve-host TF stack exists for this).

Job is removed in the follow-up cleanup commit once the upload completes;
the PVC stays for a videos batch later.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 14:12:30 +00:00
Viktor Barzin
4d756be4f5 backup: consolidate to one local-mirror script + invert offsite filter
Some checks failed
ci/woodpecker/push/build-cli Pipeline failed
ci/woodpecker/push/default Pipeline failed
Before this commit, the in-flight design split anca-elements (its own
mirror script + timer) from the rest of /srv/nfs (still going to
Synology via inotify-tracked offsite-sync). It also meant Synology
received some bytes via both paths (sda → Synology AND direct NFS →
Synology), which doubled consumption.

This commit collapses both into a clean 3-2-1:

  Copy 1 (sdc):       live /srv/nfs/* + cluster block PVCs
  Copy 2 (sda):       /mnt/backup/{pvc-data,sqlite-backup,pfsense,
                                   pve-config,<critical-nfs>/}
                      ← daily-backup + nfs-mirror (one script each)
  Copy 3 (Synology):  /Backup/Viki/{pve-backup,nfs,nfs-ssd}
                      ← offsite-sync-backup Step 1 (sda → Synology)
                        + Step 2 (sda-BYPASS paths only → Synology direct)

scripts/nfs-mirror.{sh,service,timer}:
  New consolidated weekly mirror. Replaces anca-elements-mirror (to be
  removed in a follow-up after the current in-flight rsync completes,
  parity-verified, and Synology source-of-truth is deleted). Single
  rsync /srv/nfs/ → /mnt/backup/ with an explicit EXCLUDES list that
  drops paths not worth a local 2nd copy: immich (1.2T — too big),
  frigate (14d ring), prometheus/loki (rebuildable), ollama/llamacpp/
  audiblez/ebook2audiobook (re-fetchable), *-backup (already backups),
  temp/alertmanager (transient). Nice=10, IOSchedulingClass=idle.

scripts/offsite-sync-backup.sh:
  Step 2 (NFS → Synology) filter inverted: instead of `--exclude=
  anca-elements/`, it now `--include`s only the sda-BYPASS paths
  (immich, frigate, prometheus, *-backup, …). The bypass-include
  regex MUST stay in lockstep with nfs-mirror's EXCLUDES — they are
  complementary and any drift creates either gaps or duplication on
  Synology. Comment in the script flags this.

monitoring alerts: renamed AncaElementsMirror{Stale,Failing} to
NfsMirror{Stale,Failing} matching the new metric job name
`nfs-mirror`. Thresholds unchanged.

docs/architecture/backup-dr.md: rewritten Step 1/Step 2 sections and
added the bypass-list rationale + cross-reference between scripts.

NOT YET DEPLOYED — gated on the in-flight anca-elements-mirror rsync
finishing + parity verification + Synology /volume1/Backup/Anca/
Elements deletion. The old scripts (anca-elements-{mirror,sync.sh})
remain on the PVE host until then, and will be removed in a cleanup
commit.
2026-05-24 12:49:20 +00:00
Viktor Barzin
416c2a0468 monitoring: add AncaElementsMirror{Stale,Failing} alerts
Some checks failed
ci/woodpecker/push/build-cli Pipeline failed
ci/woodpecker/push/default Pipeline failed
Layer 3a (anca-elements local mirror) now has the same alert coverage
as offsite-sync-backup:
- AncaElementsMirrorStale fires if last_run_timestamp > 16d
  (2 weekly cycles, matches the 8d → 9d slack used elsewhere)
- AncaElementsMirrorFailing fires if last_status != 0

BackupDiskFull (existing) covers the sda fill-up risk at 85%.

Not applied this commit — pick up on next monitoring stack apply.
2026-05-24 11:55:19 +00:00
Viktor Barzin
6db64fe060 anca-elements: weekly local mirror sdc → sda (replaces Synology as 2nd copy)
Some checks failed
ci/woodpecker/push/build-cli Pipeline failed
ci/woodpecker/push/default Pipeline was successful
Synology is being removed as a host for the Anca/Elements archive
(770G). /srv/nfs/anca-elements on PVE becomes the source of truth;
sda /mnt/backup/anca-elements becomes the single-disk-failure mirror.
No offsite for this archive — by design.

- scripts/anca-elements-mirror.sh: rsync -rlt --delete -H, idempotent,
  pushes anca_elements_mirror_last_{run_timestamp,status,bytes} to
  Pushgateway, lockfile in /run, SIGTERM-safe (status=2 on abort).
- .service: oneshot, Nice=10, IOSchedulingClass=idle, 5h timeout.
- .timer: weekly Mon 04:00, Persistent=true, 15-min randomised delay.

Deployed to PVE host; timer enabled; initial 770G sync running in
background. Synology original to be deleted after first run completes
and parity is verified.

docs/architecture/backup-dr.md: documents Layer 3a + updated path
exclusion rationale (PVE is now upstream, not downstream).
2026-05-24 11:51:52 +00:00
Viktor Barzin
34f8c0f537 docs+scripts: lock in nextcloud-as-PVE-NFS-browser surface
Some checks failed
ci/woodpecker/push/build-cli Pipeline failed
ci/woodpecker/push/default Pipeline was successful
- docs/architecture/storage.md: new "Nextcloud as PVE-NFS browser"
  section documenting mount-per-archive + applicable_users model,
  why mount-level ACL beats Files Access Control on NC 30/31, the
  manifest shape (with current applicableUsers + enableSharing
  fields), and the trade-off
- docs/runbooks/nextcloud-add-archive.md: 5-step runbook to surface
  a new directory under /srv/nfs/* to specific NC users via the
  bootstrap Job
- scripts/anca-elements-sync.sh: deployed at
  /usr/local/bin/anca-elements-sync.sh on the PVE host; fpsync from
  Synology Anca/Elements to /srv/nfs/anca-elements (idempotent +
  resumable). The PVE replica is what the NC /anca-elements mount
  serves; the offsite-sync pipeline excludes this path (committed
  earlier this session) so we don't write it back to Synology

NC usernames are admin/anca/emo (not display names — admin is
Viktor). Stale "viktor" references in the manifest example dropped.
2026-05-24 11:45:01 +00:00
Viktor Barzin
c624caf65a nextcloud(external_storage): add per-mount enableSharing option
Some checks failed
ci/woodpecker/push/build-cli Pipeline failed
ci/woodpecker/push/default Pipeline was successful
Lets admin natively share folders from inside an external mount with
internal users/groups or via public link. The two PVE pool browsers
(visible to admin only) get enableSharing=true so they can act as a
"share-from picker" over /srv/nfs and /srv/nfs-ssd; /anca-elements
stays false so anca manages re-sharing inside her own view.

- Manifest schema gains enableSharing on rootMounts + archiveMounts.
- Bootstrap Job adds sync_option() and reconciles enable_sharing via
  occ files_external:option (idempotent — occ no-ops same-value set).
2026-05-24 11:39:16 +00:00
root
37e563d5a9 Woodpecker CI deploy [CI SKIP] 2026-05-24 11:31:53 +00:00
Viktor Barzin
cb1a34fd00 nextcloud: expose PVE NFS roots + /anca-elements via Files External
Some checks failed
ci/woodpecker/push/build-cli Pipeline failed
ci/woodpecker/push/default Pipeline was successful
Mounts the Proxmox host NFS exports (/srv/nfs and /srv/nfs-ssd) into
the NC pod and surfaces them through occ files_external:create:

- /PVE NFS Pool      → /mnt/pve-nfs       (admin group only)
- /PVE NFS-SSD Pool  → /mnt/pve-nfs-ssd   (admin group only)
- /anca-elements     → /mnt/pve-nfs/anca-elements  (admin, anca users)

Mount visibility is controlled by occ files_external:applicable; no
Files Access Control. ACL state is reconciled idempotently by a
bootstrap Job that diffs desired vs current applicable_users /
applicable_groups (via files_external:list --output=json).

Bootstrap fixes vs initial design:
- Sync loop used `[ -n "$U" ] && cmd` which returns 1 on empty input,
  triggering set -e on no-op re-runs. Switched to process substitution
  `< <(jq ...)` so empty diff -> loop body never runs -> 0 exit.
- RBAC missed `watch` verb (kubectl wait spammed reflector errors).
- Manifest used display-name "viktor" instead of NC username "admin"
  for the /anca-elements applicable list.

Chart values: added two PV-backed volume mounts at /mnt/pve-nfs[+ssd]
and pinned securityContext to fsGroup=33 with fsGroupChangePolicy:
OnRootMismatch (chart default Always would recurse 600k+ files on
every pod restart).
2026-05-24 11:27:26 +00:00
Viktor Barzin
7a649ce7eb crowdsec: pin image to v1.7.8 + remove ENROLL_KEY, CAPI restored
Some checks failed
ci/woodpecker/push/build-cli Pipeline failed
ci/woodpecker/push/default Pipeline was successful
Root cause of today's CAPI 403 crashloop: chart 0.21.0 pins appVersion
to v1.7.3, but Keel had auto-bumped the running pods to v1.7.8 on
2026-05-16 and they ran fine with CAPI for 8 days. Today's TF apply
(b59acbc1 agent memory bump) re-rendered the deployment from chart
defaults, reverting the image to v1.7.3 — and v1.7.3 has a CAPI
watcher-auth bug against the current api.crowdsec.net behaviour, so
every fresh replica started 403'ing on startup.

Fix: set `image.tag: "v1.7.8"` in values.yaml so the image survives
future TF applies independently of the chart's appVersion. Verified
CAPI auth succeeds on all 3 fresh pods with v1.7.8.

Also dropped the ENROLL_KEY env block — the existing key `cmey5e636…`
is single-shot and was already consumed by the first replica;
subsequent pods hit 403 on `cscli console enroll`. CAPI works WITHOUT
console enrollment (separate flows). Re-enable console reporting by
generating a fresh enroll key at app.crowdsec.net (procedure
documented in the values.yaml comment block).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 11:11:29 +00:00
Viktor Barzin
f55eaae682 docs/backup-dr: document /srv/nfs/anca-elements offsite-sync exclusion
Some checks failed
ci/woodpecker/push/build-cli Pipeline failed
ci/woodpecker/push/default Pipeline was successful
2026-05-24 11:03:50 +00:00
Viktor Barzin
05f047f290 offsite-sync-backup + nfs-change-tracker: exclude /srv/nfs/anca-elements
Some checks failed
ci/woodpecker/push/build-cli Pipeline failed
ci/woodpecker/push/default Pipeline was successful
The 771G under /srv/nfs/anca-elements is a downstream replica synced
FROM Synology (/volume1/Backup/Anca/Elements) by anca-elements-sync.sh.
The offsite-sync pipeline was copying it back to Synology under
/volume1/Backup/Viki/nfs/anca-elements, creating a self-duplicate
(~122G already partially copied during the last monthly full sync).

- nfs-change-tracker.service: drop anca-elements/ from inotify watch
  (incremental syncs no longer queue these paths)
- offsite-sync-backup.sh: --exclude='anca-elements/' on the monthly
  full rsync; grep -v on the incremental files-from list

Deployed to 192.168.1.127:/usr/local/bin/offsite-sync-backup +
/etc/systemd/system/nfs-change-tracker.service; service reloaded.
2026-05-24 11:03:09 +00:00
Viktor Barzin
41786b0fca crowdsec: DISABLE_ONLINE_API=true — break the recurring 403 crashloop
Some checks failed
ci/woodpecker/push/build-cli Pipeline failed
ci/woodpecker/push/default Pipeline was successful
CAPI auth at api.crowdsec.net is rejecting watcher logins from inside
the cluster within ~1h of registration, even after rotating creds via
`cscli capi register`. The same login successfully authenticates from
devvm but fails from cluster pods → IP-throttle or account-state issue
at the central API. Until that's resolved with CrowdSec support (or
the throttle window resets), running with CAPI on is just chronic
crashloops on every fresh replica.

`DISABLE_ONLINE_API=true` makes the chart entrypoint
`conf_set 'del(.api.server.online_client)'`, removing the online_client
block entirely. Pods skip CAPI auth, no 403, no crashloop. Trade-off:
no community blocklists. Local scenarios + bouncers continue
unchanged.

Side-effect of disabling CAPI in this chart (v0.21.0) — `role.yaml`
is gated on `IsOnlineAPIDisabled=false` while `cscli-lapi-register-job`
is gated on `StoreLAPICscliCredentialsInSecret=true` (orthogonal). So
the hook runs without the Role it needs, and atomic apply rolls back.
Mitigation: pre-created the `crowdsec-lapi-cscli-credentials` Secret
manually (the hook short-circuits when the secret already exists) and
re-applied the missing Role for future re-enablement.

Re-enable path documented in the comment block.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 10:31:03 +00:00
Viktor Barzin
1f6facc8e4 Merge forgejo/master — reconcile 18-day divergence with origin
Some checks failed
ci/woodpecker/push/build-cli Pipeline failed
ci/woodpecker/push/default Pipeline was successful
Origin and forgejo had drifted since 2026-05-05 (merge base b45c45e4).
Each remote was receiving Viktor's commits independently — origin since
2026-05-23 and forgejo from 2026-05-06 to 2026-05-22 14:15. Both had
~30 substantive commits. This merge brings forgejo's work into the
local branch.

13 conflict files resolved as follows (all favoured HEAD = origin/local,
which is newer in every case):

- secrets/{fullchain,privkey}.pem — kept HEAD (renewed 2026-05-24,
  vs forgejo's 2026-05-17 renewal)
- stacks/blog/main.tf — kept HEAD (ingress-www intentionally removed
  today after DNS+monitor cleanup; forgejo had the old block)
- stacks/xray/modules/xray/main.tf — kept HEAD (vless dropped today
  as dead ingress; forgejo had the old 3-port service)
- stacks/k8s-version-upgrade/scripts/upgrade-step.sh — kept HEAD
  (allowlist refactor, master-phase idempotency skip, tigera-operator
  quiesce/restore, IngressTTFBCritical ignore — all newer than forgejo)
- stacks/k8s-version-upgrade/main.tf — kept HEAD (deployments/scale
  RBAC, oldest-kubelet detection — both added 2026-05-23)
- scripts/update_k8s.sh — kept HEAD (--etcd-upgrade=false fallback)
- stacks/llama-cpp/main.tf — kept HEAD (KEEL_LIFECYCLE_V1 ignore_changes
  block added today, commit 0b1282a1)
- stacks/openclaw/main.tf — kept HEAD (nim/meta/llama-3.1-70b primary)
- stacks/trading-bot/main.tf — kept HEAD (claude-haiku-4-5 pin +
  kevin-signal-bridge container)
- stacks/postiz/modules/postiz/main.tf — kept HEAD (memory 2Gi/3Gi
  bump, despite postiz being destroyed today — kept TF intent)
- stacks/nvidia/modules/nvidia/values.yaml — kept HEAD (mem 822Mi)
- stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl —
  kept HEAD (richer alert list + raised StatefulSet `for: 3m`)
- stacks/kyverno/modules/kyverno/security-policies.tf — kept HEAD
  (expanded registry allowlist + comments)
- docs/architecture/security.md — kept HEAD (detailed W1.7 analysis)
- docs/plans/2026-05-21-ha-control-plane-design.md — kept HEAD
  (178-line superset incl. 2026-05-23 deferral rationale)

Auto-merged (no conflict): broker-sync, claude-agent-service,
cloudflared, mailserver, n8n, technitium, traefik, url, proxmox-csi,
xray (deployment portion). Brings in forgejo-only substantive commits:
fire-planner, openclaw v3 flow + recruiter-responder wiring, several
k8s-version-upgrade hardening passes (kill-switch, RecentNodeReboot
ignore, pipefail fixes), HA control plane design, security wave 1
expansion to tier 3+4, alloy file-tail switch, prometheus scrape 2m,
authentik replica cut, forgejo archive disable.

Meta: forgejo and origin drift is a coordination bug. Going forward we
need to either (a) have one CI mirror to the other, or (b) standardize
on one remote. Filed mentally; not addressed in this commit.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 09:41:36 +00:00
Viktor Barzin
0b1282a13c llama-cpp: ignore_changes for keel/k8s-managed annotations
Every `tg apply` was reverting the annotations that keel patches when it
detects an upstream digest change — `keel.sh/match-tag` (Kyverno-stamped),
`keel.sh/update-time` (on the pod template; what actually triggers the
rollout), plus the K8s-managed `kubernetes.io/change-cause` and
`deployment.kubernetes.io/revision`. The revert forced a rollout, then
the next keel poll re-stamped the annotations, forcing another. With
llama-swap's ~10s cold-load on each pod recreate the user noticed.

Upstream `ghcr.io/mostlygeek/llama-swap:cuda` is a moving nightly tag —
keel still drives one legitimate rollout per day at ~07:25 UTC; this
patch stops the apply-driven extra rollouts on top of that.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 09:01:17 +00:00
67f8be4598 trading-bot: add kevin_signal_bridge container (kill-switch OFF for Phase 1)
5th worker container running in audit-only mode. Writes
kevin_signal_bridge_state rows showing what it WOULD trade but never
publishes to signals:generated. Kill-switch flipped in Phase 2.
2026-05-24 01:22:53 +00:00
Viktor Barzin
6218868ea5 xray: drop dead vless ingress + pin Service target_port
The xray-vless ingress, Service port 6443, and container port 6443 had
no backing listener — xray.config.json only binds 7443 (REALITY), 8443
(WS) and 9443 (XHTTP). The "xray-vless" hostname was returning 502
since the module was created.

Side effect: removing the first Service port slot ("vless"/6443) caused
the kubernetes provider to shift targetPort values on the remaining
two ports (defaulting only worked at create time, not on port removal).
Pinning target_port explicitly makes Service routing deterministic.

End-to-end verified: REALITY via public IP:8080 (pfSense forward 8080
-> 10.0.20.200:7443), WS via Cloudflare, XHTTP via Cloudflare — all
three transports proxied successfully through a test pod, egress IP
correctly resolves to the home WAN.
2026-05-24 01:13:54 +00:00
Viktor Barzin
ae874e028d postiz: bump memory request 512Mi → 2Gi, limit 4Gi → 3Gi (right-size for next deploy)
krr 2026-05-22 flagged postiz-app as critically under-requested when it
was running (gap 2.2 GiB above the 512Mi request). Postiz is currently
uninstalled in the cluster — this change is only for when the stack is
re-deployed later. No apply triggered now.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 01:11:25 +00:00
Viktor Barzin
b59acbc1db crowdsec/agent: bump memory request 64Mi → 128Mi
krr 2026-05-22 flagged crowdsec-agent DaemonSet (4 pods) as under-
requested by ~588 MiB across the cluster. Live usage around the
80-128 MiB mark for active log parsing — 64 MiB request risked eviction
ahead of more-needed pods. Limit stays at 512 MiB.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 01:11:16 +00:00
Viktor Barzin
7108843b38 nvidia/driver-daemonset: bump memory request 256Mi → 822Mi
krr 2026-05-22 flagged nvidia-driver-daemonset as critically
under-requested (~566 MiB gap). Live driver process holds ~600-800Mi
once the kernel module is loaded. Limit stays at 2Gi so the DKMS build
during a kernel upgrade still has headroom (documented in values.yaml
to need ~1.4 GiB peak).

May help unblock code-8vr0 (GPU driver crashloop on node1) if the
crashloop was OOM-driven.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 01:11:06 +00:00
Viktor Barzin
2711d4af05 monitoring/loki: bump memory request 2Gi → 3Gi (close gap to 4Gi limit)
krr 2026-05-22 flagged loki as under-requested by 1.9 GiB. Live working
set is sitting at ~3 GiB during normal ingestion; the existing 2 GiB
request meant scheduler didn't reserve enough room and the pod risked
eviction. Limit stays at 4 GiB (documented ceiling in loki.yaml).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 01:10:55 +00:00
Viktor Barzin
c77984a713 proxmox-csi/node: bump memory request 64Mi → 1Gi (LUKS unlock reservation)
The CSI node plugin's LUKS2 Argon2id key derivation peaks at ~1 GiB
during unlock (memory id=712 + already-documented in the limits=1280Mi).
Request was 64 MiB — meaning the unlock burst ran "best-effort", first in
line for OOM under node pressure. krr 2026-05-22 flagged this as a top
under-request. Bumping request matches the documented requirement.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 01:10:44 +00:00
Viktor Barzin
467460cccd k8s-version-upgrade: ignore IngressTTFBCritical in halt-on-alert check
The Synology DSM (port 5001) ingress chronically trips IngressTTFBCritical
because of NAS-side latency that is unrelated to k8s upgrades. The chain
was halting indefinitely waiting for it to clear. Add it alongside
RecentNodeReboot to the per-call ignore regex so the chain can proceed
autonomously without manual silences.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 01:10:44 +00:00
Viktor Barzin
447bfef507 blog: remove www.viktorbarzin.me ingress
The www subdomain was internal-only (no Cloudflare DNS record) but the
external uptime-kuma monitor still flagged it as down because public DNS
resolution failed. Removing the ingress along with the Technitium CNAME
makes the failure mode disappear and lets the cluster reach an
autonomous-clean state.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 01:10:44 +00:00
root
10ac174627 Woodpecker CI Update TLS Certificates Commit 2026-05-24 00:03:48 +00:00
Viktor Barzin
b4aa8eaf58 technitium: cut memory — primary 2Gi → 1Gi, secondary+tertiary 2Gi → 512Mi
Right-sizing per krr report (2026-05-22). Zone data is ~43 MiB; the rest
was cache headroom. Primary keeps more (1 GiB) since it owns authoritative
zones; replicas get 512 MiB. DNS sanity-checked across CoreDNS and the
MetalLB external IP (10.0.20.201) post-rollout.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 10:03:51 +00:00
Viktor Barzin
931d7b6c9d claude-agent-service: cut memory request 2Gi → 1Gi (limit 4Gi → 2Gi)
Right-sizing per krr report (2026-05-22). Kept Burstable QoS (limit > request)
so an active agent run still has 2 GiB headroom — krr's 100 MiB recommendation
was measured idle and is not safe for an active job.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 10:03:42 +00:00
Viktor Barzin
d76f4c4827 n8n: cut memory request 1Gi → 512Mi (+ image bump 1.80.0 → 1.80.5)
Right-sizing per krr report (2026-05-22). Image bump syncs main.tf with
the live Keel-managed version to avoid an inadvertent downgrade on apply.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 10:03:28 +00:00
Viktor Barzin
17c1ef73be url/shlink: cut memory request 960Mi → 512Mi
Right-sizing per krr report (2026-05-22, memory id=2431-2438). Live pod
working set is ~80 MiB; 512Mi leaves comfortable headroom for the
Symfony+RoadRunner footprint.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 10:02:45 +00:00
Viktor Barzin
02ea5da8dc k8s-version-upgrade: skip phase_master/phase_worker if node already on target
The chain wasn't idempotent — re-running on a partially-upgraded cluster
would re-drain + re-kubeadm + re-apt an already-upgraded node, causing
unnecessary disruption (5-10 min per no-op node) and risking alert
re-fires during the unnecessary drain.

Today's chain hit this twice: after fixing the version-detection bug
(commit a0f3e155), the chain correctly resumed but re-did master AND
node4 even though both were already on v1.34.8. node4 got cordoned,
drained, and is now soaking for 10 min for no reason.

Fix: at the top of phase_master and phase_worker, read the node's
current kubelet version. If it equals TARGET_VERSION, skip the whole
phase (return 0 — spawn_next will fire downstream). Chain advances
without disturbing the already-upgraded node.

In-flight effect: the current node4 worker pod has the old script
mounted from configmap snapshot, so it'll continue. If it fails and
retries, the new pod will see node4 on v1.34.8 and short-circuit.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 09:53:57 +00:00
Viktor Barzin
a0f3e15562 k8s-version-upgrade: version-check uses oldest kubelet, not master
Previous version-check read RUNNING from .items[0].nodeInfo.kubeletVersion
— which is just k8s-master. If master is upgraded but workers aren't
(e.g. a chain that completed master phase but failed mid-worker), the
version-check sees v1.34.8 and decides "no upgrade needed", never
spawning the resume phase. Workers stay behind forever.

Today's chain hit exactly this: master + node4 upgraded to v1.34.8,
worker-node4 Failed mid-soak (alert sensitivity, since loosened),
chain dead. Re-triggering the version-check looked at master only,
decided cluster was "done", and refused to resume worker chain.

Fix: read all node kubelet versions, sort -V, take head -1 (oldest).
A partial chain now correctly reports the un-upgraded version and the
chain resumes.

Trivial change; tested live — chain now correctly reports v1.34.7
(workers' version) and spawns preflight → master → worker chain.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 09:48:50 +00:00
Viktor Barzin
68f8514e61 monitoring: MetalLBSpeakerDown for: 2m → 10m (was upgrade-chain regression)
Earlier in this session, commit 503ac4c1 brought the for: from 5m → 2m
based on a brief I wrote inaccurately. The brief said the alert "fires
immediately" but it was actually already at 5m. The subagent followed
the explicit "2m" target and tightened it — opposite of what we wanted.

10m is the right value for our chain: a full drain + kubeadm + apt +
kubelet restart + uncordon cycle can take a worker out of MetalLB
rotation for 5-7 min in the worst case (PDB stickiness on some pods).
10m suppresses upgrade-induced blips while still catching real
speaker-down conditions.

node4 worker phase tripped this alert mid-soak today, aborted the
chain (Job retry), succeeded on the 2nd attempt only because alerts
didn't re-fire fast enough. With 10m the next workers shouldn't need
the retry.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 09:32:41 +00:00
Viktor Barzin
503ac4c192 monitoring: tune 4 alerts for transient drain/upgrade blips
Today's worker-phase rolling upgrade tripped MysqlStandaloneDown,
MetalLBSpeakerDown, KubeletRunningContainersDrop, and
IngressErrorRate5xxHigh even though every affected workload
recovered within 30-60s. Loosen `for:` (and one threshold) on each so
they only fire on persistent faults, not on routine drain+kubelet-
restart cycles.

- MysqlStandaloneDown: for 2m -> 3m (single-replica StatefulSet,
  drain re-scheduling routinely takes 1-3m).
- MetalLBSpeakerDown: for 5m -> 2m (kubelet restart drops the
  speaker pod for 30-45s; 2m suppresses that blip).
- KubeletRunningContainersDrop: absolute `< -10` threshold replaced
  with relative `< -0.5` (>50% drop vs. 10m ago); routine drains
  routinely shed 10-30 containers and tripped the old rule.
- IngressErrorRate5xxHigh: for 5m -> 10m (rolling pod migrations
  cause brief 5xx spikes that clear in 1-2m).

Severity, labels, and annotation structure preserved; only `for:`
durations and the one expression changed. Tactical loosening of
four specific alerts -- broader observability audit tracked
separately in beads.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 09:28:53 +00:00
Viktor Barzin
ad9f6c8f41 k8s-version-upgrade: halt_on_alert allowlist (severity=critical only)
Refactored halt_on_alert_query from denylist ("ignore these noisy alerts")
to an allowlist ("only halt on severity=critical"). Today's blocking
alerts were all warning/info-level and not actual upgrade blockers:
  - PodCrashLooping (gpu-operator on the GPU node, code-8vr0, long-standing)
  - IngressTTFBHigh (Traefik latency, transient)
  - NodeHighIOWait (chicken-and-egg with our own upgrade I/O)
  - RecentNodeReboot (chain causes this itself)

severity=critical filtering is more robust than maintaining a denylist
of every noisy alert that crops up. extra_ignore parameter kept for
backwards compatibility but is rarely needed now (critical alerts are
the only ones that should actually halt the chain).

Tested end-to-end this session — master successfully upgraded to v1.34.8
via the autonomous chain after the apiserver state-repair (apiserver
manifest had been pinned at v1.34.2 from a previous month's rollback;
required a one-time manual edit + kubelet reload to bring back to v1.34.7,
after which the chain ran cleanly).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 09:14:39 +00:00
0025511b6a docs: Technitium DNS IP — 10.0.20.101 → 10.0.20.201
Stragglers from the same drift as commit b288a59 (monorepo) / the
2026-05-22 viktorbarzin.me apex incident — the `.101` references were
left over from the NodePort exposure era. Technitium's actual MetalLB LB
IP is `.201` (in pool 10.0.20.200-220).

- architecture/vpn.md — Technitium component cell + AdGuard forwarder
  example + nslookup troubleshooting hint
- architecture/networking.md — 502 ingress troubleshooting snippet
- plans/2026-02-22-talos-linux-migration-evaluation.md — nameservers
  example
2026-05-23 08:53:52 +00:00
Viktor Barzin
68a503e29f kyverno: allowlist woodpeckerci/* for CI step pods
Wave-1 trusted-registries allowlist was missing woodpeckerci/* which is
used by every .woodpecker.yml's clone step (woodpeckerci/plugin-git) and
build steps (woodpeckerci/plugin-docker-buildx). Result: ALL Woodpecker
pipelines have been failing at the git step since the Audit→Enforce flip
on 2026-05-19. First surfaced via code-da4h (recruiter-responder pushes
not building).

Added between viren070/* and zelest/* in the same DockerHub-user-repos
block as the 2026-05-22 batch (commit 2d35d72a).

Closes: code-da4h
2026-05-23 08:52:48 +00:00
000d306542 technitium: add viktorbarzin.me apex DNS drift probe + alerts
Every internal *.viktorbarzin.me hostname (~80 services) chains through the
split-horizon `viktorbarzin.me` apex A record. If the apex drifts (ISP
rollover, accidental edit), every internal service breaks at once — the
2026-05-22 ha-sofia incident was exactly this.

This adds a backstop probe so the next drift surfaces in <10 min instead
of via user-reported outage:

- CronJob `viktorbarzin-apex-probe` in `technitium` namespace, every 5 min,
  resolves `viktorbarzin.me A` against the Technitium LB IP (10.0.20.201)
  and pushes `viktorbarzin_apex_correct` + `_last_correct_timestamp` to
  Pushgateway. Python+dnspython, ~30 LOC.

- 3 Prometheus alerts:
  - `ViktorBarzinApexDrift` (critical, 10m) — apex resolved to anything
    other than 10.0.20.200.
  - `ViktorBarzinApexProbeStale` (warning, 5m on 15m gap) — probe stopped
    succeeding.
  - `ViktorBarzinApexProbeNeverRun` (warning, 30m absent) — probe never
    reported.

- Added the new alert names to the Slack receiver matcher in both routes
  alongside EmailRoundtrip*.

Verified: rules loaded as inactive (apex is correct), metric flowing, manual
probe job pass observed.
2026-05-23 08:41:14 +00:00
Viktor Barzin
4713c3a6d9 k8s-version-upgrade: tigera quiesce + etcd-skip retry + IO-wait alert ignore
Three changes unblocking the autonomous chain for k8s patch upgrades:

1. **phase_master quiesces tigera-operator before drain, restores after.**
   Tigera crashes immediately if apiserver is unreachable (no retry logic)
   and crashlooping it during master static-pod swaps generates ~500MB/s
   disk I/O that pushes kubeadm's 5-min static-pod-hash watch past its
   limit. Quiesce removes the storm contributor; calico data plane keeps
   running unchanged (data plane is the DaemonSet+Typha, operator is just
   the reconciler).

2. **update_k8s.sh retries with --etcd-upgrade=false on the 2nd attempt.**
   For patch upgrades (1.34.7→1.34.8), etcd's image doesn't change — kubeadm
   writes an identical manifest, hash doesn't update, watch times out and
   rolls back forever. The skip-etcd retry sidesteps it for the legitimate
   no-change case while still doing a full etcd upgrade on the first
   attempt (correct for minor-version bumps).

3. **halt_on_alert_query also ignores IngressTTFBHigh + NodeHighIOWait.**
   Both are symptoms-not-causes: ingress latency spikes briefly during any
   pod-restart wave; high IOwait is exactly what upgrade activity causes
   (chicken-and-egg). The inline quiet-baseline check (Ready transition
   <10min) is the real cluster-churn gate.

RBAC: k8s-upgrade-job ClusterRole gains `patch` on deployments + scale
subresource so the chain can do the scale-to-0/back-to-1 on tigera.

These three together get the chain past the cascade that's been blocking
1.34.7→1.34.8 for a week. Long-term fix is still HA control plane
(beads code-n0ow); these are the bridge.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 08:40:11 +00:00
6f4a569d1c traefik: bump auth-proxy nginx header buffers to handle Authentik cookie pile
Browsers accumulate one authentik_proxy_<random> cookie per Authentik
Proxy Provider under viktorbarzin.me (Path=/). With 30+ services the
combined Cookie header exceeds nginx's default 4 x 8k
large_client_header_buffers and trips '431 Request Header Fields Too
Large' at the forward-auth nginx (traefik/auth-proxy).

Bumped to:
  client_header_buffer_size 8k
  large_client_header_buffers 8 64k

Matches the pattern used on the London Flint 2 router nginx
(memory id=647).
2026-05-23 08:34:33 +00:00
Viktor Barzin
7f63d35d0a docs/plans: HA control plane — design + plan + deferral
Investigated, designed, and planned the 3-master HA control plane
migration triggered by 2026-05-21's autonomous k8s upgrade cascade.

Locked 14 design decisions across two passes:
- 10 initial decisions (LB strategy, IPs, sizing, etcd, kured gate, etc)
- 4 challenger-pass amendments (cloud-init template bump, rbac stack
  multi-master refactor, HTTPS /readyz health check, expanded blast
  radius to include /home/wizard/code/infra/config root kubeconfig,
  config.tfvars, k8s-portal user kubeconfigs, etcd-backup nodeSelector,
  k8s-version-upgrade chain extension as Phase 7)

Plan covers 11 phases end-to-end including panic-mode rollback.

DEFERRED before execution. PVE host is 98% RAM-committed
(262 GB allocated / 267 GB physical, 1.5 GB swap active); the
planned 3 x 32 GB masters would push allocation to 326 GB and OOM
the host. k8s-master currently uses only 4.6 GB of its 32 GB
allocation (5-6x oversized).

Revisit triggers documented in design doc:
1. Second PVE host added → hardware HA becomes possible.
2. Right-sizing pass OR planning masters at 16 GB each.
3. Cumulative manual upgrade nursing > ~10h.

Standalone candidate worth lifting independently: Phase 1.5's
rbac stack refactor (apiserver-oidc + audit-policy + etcd-tuning
to loop over k8s_master_hosts list) — future-proofs the cluster
without committing to the HA migration.

Refs: code-n0ow (open, deferred via bd note).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 08:32:15 +00:00
70a334e431 trading-bot: pin Meet Kevin LLM model to claude-haiku-4-5
Sonnet-4-5 trips Anthropic per-account rate_limit_error on the OAuth
bearer (sk-ant-oat01) tokens after 5-10 burst calls — sticky multi-hour
quota. Haiku-4-5 has much higher RPM and processes the 16-video
backfill cleanly (~30s/video with inter-call throttle).

Comment above the env line documents the rationale for future re-evaluation.
2026-05-22 20:43:05 +00:00
5258f09230 mailserver: decommission SendGrid
Remove leftover SendGrid references after the Brevo migration was completed:

- Delete TF `cloudflare_record.mail_domainkey` (TXT at `s1._domainkey`,
  SendGrid-era DKIM, hidden behind the SendGrid CNAME but would re-emerge
  once the CNAME is removed).
- Clean up commented-out `smtp.sendgrid.net` relayhost references and the
  `# For sendgrid` comment on `sasl_passwd` in the mailserver module.

DNS records deleted out-of-band (not TF-managed):
- CF: `s1._domainkey CNAME` + `s2._domainkey CNAME` → sendgrid.net (manual entries)
- Technitium internal `viktorbarzin.me`: `em7107`, `s1._domainkey`,
  `s2._domainkey` CNAMEs → sendgrid.net

Verified end-to-end mail flow unaffected (Brevo outbound + IMAP receive,
roundtrip 20.4s — identical to baseline). Active DKIM (`mail._domainkey`
local + `brevo1/brevo2._domainkey` Brevo) untouched.
2026-05-22 20:08:38 +00:00
Viktor Barzin
b233aba710 openclaw: switch primary to nim/meta/llama-3.1-70b-instruct
Auth audit on 2026-05-22 — all the broken paths and the one that works:

- openai-codex OAuth: EXPIRED (ChatGPT Plus, ancaelena98@gmail.com)
- secret/openclaw → openai_api_key (sk-svcacct): insufficient_quota
- openrouter_api_key: "Key limit exceeded (total limit)"
- llama_api_key: region-blocked
- anthropic_api_key: sk-ant-oat-… (OAuth refresh token, not a real
  x-api-key — won't auth via x-api-key header)
- nvidia_api_key (NIM): WORKS. The key was already baked into the
  openclaw.json providers.nim.apiKey from secret/openclaw → nvidia_api_key.

Two NIM models verified end-to-end (call from inside openclaw pod
with tool-call schema, both returned proper {tool_calls:[…]} JSON):
- meta/llama-3.1-70b-instruct      — 0.58s, primary
- meta/llama-4-maverick-17b-128e   — 16s, smarter, fallback

Fallback chain: maverick → openai-codex (auto-promotes once re-authed)
→ modelrelay/auto-fastest (last resort, hallucinates instead of
tool-calling, but at least responds).

Models registered in both `agents.defaults.models` (allowlist) and
`models.providers.nim.models` (capability declarations) so the agent
sees them as available tools. Startup `models set` updated to pin
the new primary across `doctor --fix` runs.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 15:23:17 +00:00
3962513036 security(wave1): W1.7 analysis snapshot — observation data → allowlist plan
First analysis pass over Calico GNP wave1-egress-observe-tier34 data captured
in Loki since 2026-05-19. Pulled ~10000 flow log lines covering 36 source
namespaces (of 82 selected by tier 3+4). Analysis script outputs preserved
on the dev host at /tmp/{analyze_flows2,build_allowlist}.py.

## Findings

**Universal baseline (every observed ns):**
- DNS to kube-system/kube-dns UDP/53
- Often mysql.dbaas TCP/3306 or pg.dbaas TCP/5432
- Often redis.redis TCP/6379

**Rollout tiering by egress fan-out:**
- Tier A (recruiter-responder only): 2 destinations, ideal pilot
- Tier B (29 namespaces): ≤3 external IPs, ≤5 internal — batch rollout
- Tier C (4 namespaces: f1-stream/openclaw/woodpecker/status-page):
  needs per-IP investigation
- Tier D (servarr): 130+ external IPs (BitTorrent P2P) — keep Log+Allow
  permanently or move to dedicated egress proxy

## Caveats blocking immediate enforce
- Observation horizon too short: ~6h dense data, ~24h total. Need ≥7 days
  to catch weekly CronJobs, Vault token rotations, Keel pulls.
- External IPs are dynamic (Cloudflare/AWS rotate). Static IP allowlists
  will break — need DNS-based selectors or CIDR ranges.
- Some intra-namespace traffic bypasses the Calico filter chain.

## Recommended next steps
1. Continue observation through 2026-05-29 (full week). Compare destination
   set day-over-day; if stable, allowlist is ready.
2. First enforce: recruiter-responder (allowlist = kube-dns + telegram CIDR
   + vault/ESO service IPs).
3. Tier B phased rollout at 3-5 ns/day after pilot proves out.

Full analysis: docs/architecture/wave1-egress-observation-2026-05-22.md
Tracked under beads code-8ywc.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 15:22:25 +00:00
2d35d72a53 kyverno(wave1): add 7 missing registries to trusted-registries allowlist
Discovered via W1.5 enforcement when querying live cluster state:
PolicyViolation events on 5 deployments (council-complaints, ebook2audiobook,
hermes-agent, netbox, whisper/piper) trying to admit images from registries
not in the original enumeration.

Added entries:
- amruthpillai/*       (resume — reactive-resume)
- athomasson2/*        (ebook2audiobook)
- netboxcommunity/*    (netbox)
- nousresearch/*       (hermes-agent)
- opentripplanner/*    (osm-routing)
- rhasspy/*            (whisper, piper)
- registry.viktorbarzin.me/*  (legacy private registry — council-complaints
                                still references; should migrate to forgejo)

The legacy registry.viktorbarzin.me was supposedly decommissioned 2026-05-07
per CLAUDE.md but council-complaints still uses it — separate cleanup task.

## Verification
- kubectl delete + reapply (kubectl_manifest resourceVersion=0 patch gotcha,
  same as 2026-05-18 inject-keel-annotations)
- Dry-run admission of previously-blocked images now PASS:
  - netboxcommunity/netbox:v4.5.0-beta1 ✓
  - rhasspy/wyoming-whisper:3.1.0 ✓
  - registry.viktorbarzin.me/council-complaints:1c56f8f ✓
- Policy still in Enforce mode

## Observation status (W1.6)
- Calico GNP wave1-egress-observe-tier34 still applied, 82 ns selected
- Loki `{job="node-journal"} |~ "calico-packet"` returns ~5000 lines/hour
- No errors from observation infrastructure

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 15:17:16 +00:00
Viktor Barzin
c11ac7d486 cnpg: bump webhook-cert renewal threshold 7d -> 30d
Root cause of the recurring 'cnpg-webhook-cert' TLS expiry warn:

CNPG default 'expiringCheckThreshold = 7' means the operator only
regenerates the self-signed webhook cert when remaining lifetime drops
BELOW 7 days. Our cluster-health check #22 alerts at <30d. Result:
~23 days of WARN before CNPG would even attempt rotation.

Set EXPIRING_CHECK_THRESHOLD=30 via the chart's config.data map so the
operator now regenerates with 30d buffer, aligning with our monitoring
threshold. Cert lifetime stays at chart default 90d.

Verified after apply: operator runtime config shows
'expiringCheckThreshold:30'. Companion in-session action: deleted the
existing soon-to-expire secret and bounced the operator to force an
immediate fresh 90-day cert (notBefore=May 22, notAfter=Aug 20).
2026-05-22 15:00:41 +00:00
Viktor Barzin
96f9db0b13 state(cnpg): update encrypted state 2026-05-22 15:00:04 +00:00
6367b783c7 broker-sync(imap): fix command name + add fsGroup for sync.db writes
Two latent issues found while diagnosing why the May 2026 META vest
didn't land:

1. broker-sync-imap CronJob's command was 'broker-sync imap', but the
   actual CLI subcommand is 'imap-ingest'. Every scheduled run had
   been failing with 'No such command imap' since day-one.

2. Pod runs as uid=10001 gid=999; PVC /data dir is mode 2775
   group=10001. Without fsGroup in the pod's securityContext the
   pod gets only 'other' (r-x) perms on the dir, so sqlite3 can't
   create journal/WAL files next to sync.db -- hits
   'attempt to write a readonly database'. fsGroup=10001 adds the
   matching gid to the pod's supplemental groups so writes work.

Schwab email-sender regex fix is in broker-sync@d860aef.
2026-05-22 14:41:54 +00:00
Viktor Barzin
fa536cc08b ci: retry after Keel rollout cascade settled 2026-05-22 14:41:54 +00:00
a3bcb5e12f fire-planner: COL refresh CronJob + Grafana Cost-of-Living dashboard
Operational layer for the new col_snapshot cache shipped in
fire-planner@e72fd22:

stacks/fire-planner:
- fire-planner-col-refresh CronJob — Sun 04:00 UTC, no-op until rows
  age toward the 1-year TTL boundary (within 7 days). Calls
  python -m fire_planner col-refresh-stale, upserts via cache.upsert.

monitoring/dashboards/cost-of-living.json (Finance folder):
- Two template variables: $city (single-select from col_snapshot),
  $baseline_city (for COL ratio computation, defaults London).
- Stat row: total w/rent, w/o rent, 1-bed rent, ratio (color-coded).
- All-cities ranked table with gradient-gauged total + colored ratio.
- Cache-freshness table flags rows approaching TTL expiry.

Initial population needs a one-shot: post-Keel-rollout,
  kubectl -n fire-planner exec deploy/fire-planner -- \\
    python -m fire_planner col-seed

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:17:01 +00:00