infra

Author	SHA1	Message	Date
Viktor Barzin	7e558de8f0	openclaw: SSH + tmux task fallback to devvm Give the OpenClaw pod two new capabilities: 1. Host-tools bundle. New init container `install-host-tools` extracts openssh-client + dnsutils + tmux + jq + ripgrep + fd + vault + yq + friends into /tools/host-tools/, with the bookworm-slim libs the binaries need. PATH + LD_LIBRARY_PATH on the main container point ld.so at the bundle. Idempotent via /tools/host-tools/.installed-v1 marker; smoke test (ldd-based) fails the init at deploy time if any binary has unresolved deps. Bundle is ~558 MB on the existing /srv/nfs/openclaw/tools NFS. 2. devvm SSH + async task pattern. New init `setup-ssh-config` writes id_rsa/config/known_hosts under /home/node/.openclaw/.ssh; main container startup symlinks /home/node/.ssh → there. New /usr/local/bin/openclaw-task wrapper on devvm manages long-running work as tmux sessions on devvm (sessions and logs survive pod restarts — they live on devvm, not in the pod). New init container `seed-devvm-memory-note` drops a markdown note teaching the pattern; main container startup now runs `openclaw memory index --force` so the note is searchable on first boot. Design + verified E2E flow in docs/plans/2026-05-22-openclaw-devvm-access-design.md. Persistence test green: spawned a 50s task from pod A, deleted pod A, new pod B saw the task finish and read its full log. Pre-existing keel.sh annotation drift on openclaw/{openlobster, task_webhook} cleaned up in the same apply. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:17:01 +00:00
Viktor Barzin	052404301b	docs: HA control plane design (3 masters) Captures today's k8s-upgrade-pipeline session findings — root cause of repeated upgrade failures is the single-master apiserver outage window cascading into operator crashloops + storm I/O. HA control plane with 3 masters + apiserver LB removes the cascade entirely. Tracked in beads code-n0ow. Plan doc to follow. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:17:00 +00:00
Viktor Barzin	2f9ac0110a	security(wave1): W1.6 observe phase LIVE — Calico GNP action:Log pilot on recruiter-responder Replaces the abandoned FelixConfiguration.flowLogsFileEnabled approach (Calico Enterprise-only field, rejected by OSS v3.26) with the supported primitive: Calico GlobalNetworkPolicy with `action: Log`. ## Mechanics (verified end-to-end on 2026-05-19) 1. kubectl_manifest applies GNP `wave1-egress-observe-recruiter-responder` with `namespaceSelector: kubernetes.io/metadata.name == 'recruiter-responder'`, `types: [Egress]`, `egress: [{action: Log}, {action: Allow}]`. 2. Felix translates to iptables LOG rule in `cali-po-_ZEv_aILlvyT9fbgWN58` chain with prefix `calico-packet: ` log-level=5. 3. Linux kernel emits LOG entries to ring buffer with transport=kernel. 4. systemd-journald captures kernel transport entries. 5. Alloy DaemonSet ships journal to Loki with `job=node-journal,transport=kernel`. 6. LogQL: `{job="node-journal"} \|~ "calico-packet"` returns entries showing SRC/DST/PROTO/PORT for every NEW egress connection. ## Verified output sample `calico-packet: IN=cali6cfdec4abc1 OUT=ens18 MAC=... SRC=10.10.122.132 DST=9.9.9.9 LEN=60 TOS=0x00 PREC=0x00 TTL=...` The Allow rule in the GNP keeps egress functional (recruiter-responder remained 1/1 Running through the apply — verified Python TCP connections to 1.1.1.1, 8.8.8.8, 9.9.9.9 succeed). ## Wave 1 status W1.6 observation infra is LIVE for the recruiter-responder pilot. W1.7 remains pending: collect 1 week of `{job="node-journal"} \|~ "calico-packet"` samples, build empirical egress allowlist, flip the GNP rules from `[Log, Allow]` to `[Allow <specific dests>, Deny]`. Expand observation to additional namespaces by adding entries to `spec.namespaceSelector` (e.g. `kubernetes.io/metadata.name in {recruiter-responder,X,Y}`). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:17:00 +00:00
Viktor Barzin	e4b9e97ac9	docs: design + plan for MySQL 8.4.8 → 8.4.9 upgrade Captures the wipe+reinit strategy (sidestep the broken DD upgrade path), the IO config bump (innodb_io_capacity 100→2000), root-cause analysis with explicit uncertainty, verification gates, and rollback. Not scheduled yet. Tracked in beads code-963q. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:59 +00:00
Viktor Barzin	a048b37f60	security(wave1): W1.1 audit-log shipping LIVE + W1.5 trusted-registries Enforce LIVE ## W1.1 — K8s API audit log shipping (LIVE) - alloy.yaml: added control-plane toleration so Alloy DaemonSet runs on k8s-master node. Verified alloy-7zg7t scheduled on master, tailing /var/log/kubernetes/audit.log - loki.tf "Security Wave 1" rule group: added K2-K9 alert rules (skipped K1 per Q7 decision): - K2 K8sSATokenFromUnexpectedIP - K3 K8sSensitiveSecretReadByUnexpectedActor - K4 K8sExecIntoSensitiveNamespace - K5 K8sMassDelete (>5 Pod/Secret/CM in 60s by single user) - K6 K8sAuditPolicyModified (kubeadm-config CM change) - K7 K8sClusterRoleWildcardCreated (verbs=* + resources=) - K8 K8sAnonymousBindingGranted - K9 K8sViktorFromUnexpectedIP - All rules use source-IP regex matching the wave-1 allowlist (10.0.20.0/22, 192.168.1.0/24, 10.10.0.0/16 pod, 10.96.0.0/12 svc, 100.64-127 tailnet) and `lane = "security"` → #security Slack route. - Verified: kubectl-audit logs flowing in Loki query {job="kubernetes-audit"} returns events with node=k8s-master. - Verified: /loki/api/v1/rules lists all K2-K9 + V1-V7 + S1. ## W1.5 — require-trusted-registries Enforce (LIVE) - security-policies.tf: flipped Audit→Enforce with explicit allowlist built by `kubectl get pods -A -o jsonpath='{..image}'` enumeration. - Removed `/` catch-all (which made Audit→Enforce a no-op). - Pattern includes 15 explicit registries, 6 DockerHub library bare names, 56 DockerHub user repos. - Verified by admission dry-run: - evilcorp.example/malware:v1 → BLOCKED with custom message - alpine:3.20 → ALLOWED (matches `alpine`) - docker.io/library/alpine:3.20 → ALLOWED (matches `docker.io/*`) ## W1.6 — Calico flow logs (BLOCKED — Calico OSS limitation) - Tried adding FelixConfiguration with flowLogsFileEnabled=true via kubectl_manifest in stacks/calico/main.tf - Calico OSS rejected with "strict decoding error: unknown field spec.flowLogsFileEnabled" — these fields are Calico Enterprise/Tigera-only - Removed the failed resource. Documented alternative paths in main.tf comment block: GNP with action=Log (iptables NFLOG → journal), Cilium migration, eBPF tooling, or Tigera Operator adoption. ## Docs updates - security.md status table refreshed: W1.1/W1.2/W1.3/W1.4/W1.5 LIVE, W1.6/W1.7 blocked - monitoring.md: Loki marked DEPLOYED (was incorrectly NOT-DEPLOYED in prior session before today's apply) ## Cleanup - Removed stacks/kyverno/imports.tf (TF 1.5+ import blocks completed their job in the 2026-05-18 apply; should not stay in tree per TF docs) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:59 +00:00
Viktor Barzin	fd1490ae15	docs: update MySQL restore runbook + CLAUDE.md after 8.4.9 recovery Runbook rewritten for the standalone setup (InnoDB Cluster gone since 2026-04-16) and now covers the full disaster-recovery flow we just executed: stop pod, wipe PVC (incl. PV reclaim-policy flip from Retain → Delete), re-apply TF, restore via in-namespace Job, drop+create static users with fresh Vault passwords, restart dependents. CLAUDE.md MySQL row notes the 8.4.8 pin + links the runbook. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:59 +00:00
Viktor Barzin	c9289192c7	security(wave1): Vault audit-tail sidecar (live) + doc reality-check ## Vault audit-tail sidecar (APPLIED + VERIFIED) - Added `audit-tail` extraContainer to vault helm chart values: busybox:1.37 with `tail -F /vault/audit/vault-audit.log`. Reads the audit PVC (`audit` volume from the chart's auditStorage), emits JSON audit events to stdout. kubelet captures the stdout; once Loki+Alloy are deployed (blocked on code-146x), these logs flow automatically to Loki with `container="audit-tail"`. - Resources: 5m CPU / 16Mi mem request, 32Mi limit. PVC mount is readOnly. - Applied via `tg apply -target=helm_release.vault`. All 3 vault pods rolled cleanly (OnDelete strategy, manual one-at-a-time, auto-unseal each ~10s). - Verified: `kubectl logs -n vault vault-2 -c audit-tail` shows live JSON audit lines from ESO token issuance, KV reads, etc. ## Doc reality-check While verifying logs reached Loki, discovered Loki is NOT actually deployed. `stacks/monitoring/modules/monitoring/loki.tf` defines `helm_release.loki` but has a self-referencing `depends_on = [helm_release.loki]` that prevented apply. No `loki` Helm release in the cluster, no Loki pods, no Loki Service. The monitoring.md "Loki: deployed" claim was aspirational. - security.md W1.2 row: PENDING → PARTIAL (sidecar live, shipping blocked on code-146x) - security.md W1.3 row: gated on code-146x added - monitoring.md Loki row: marked NOT DEPLOYED with cross-ref to code-146x ## New beads task - code-146x P1 — Loki + log shipper missing. Lists the helm_release self-depends_on bug, investigation paths, and revised wave 1 sequencing (Loki/Alloy is prereq 0). ## Wave 1 status update - W1.2: Vault audit device + XFF + audit-tail sidecar all LIVE; Loki shipping blocked on code-146x - W1.1, W1.3, W1.6, W1.7: still not started (W1.6 also blocked on code-3ad Calico Installation CR) - W1.4, W1.5: code committed, blocked on code-e2dp (Kyverno provider crash) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:57 +00:00
Viktor Barzin	b3cf75dc61	docs(security): wave 1 plan — Kyverno enforce, NetworkPolicy egress, audit logging, source-IP anomaly Locked design for wave 1 of cluster security hardening. Plan only — implementation lives in beads code-8ywc and follow-up commits. Captures: - security.md: Kyverno policy table updated (Audit → Enforce planned for the four security policies with the 31-namespace exclude list). New section "Audit Logging & Anomaly Detection" detailing the K8s API audit policy, Vault audit device + X-Forwarded-For trust, source-IP anomaly rules (K9, V7, S1), and the rejected-canary-tokens / rejected-K1 rationales. New section "NetworkPolicy Default-Deny Egress" describing the observe-then-enforce (γ) approach for tier 3+4. - monitoring.md: new "Security Alerts (Wave 1)" section listing the 16 rules (K2-K9, V1-V7, S1) and the Loki ruler → Alertmanager → #security routing path. - runbooks/security-incident.md (new): per-alert response playbook with LogQL queries, action steps, false-positive triage, and SEV1 escalation. - .claude/CLAUDE.md: new "Security Posture" section summarising the locked decisions: identity allowlist is me@viktorbarzin.me ONLY, source-IP allowlist CIDRs, no public-IP access policy, rationale for not adopting canary tokens. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:57 +00:00
Viktor Barzin	6de4549a96	docs/plans: add agent presence implementation plan (2026-05-17) 15-task plan for a shared presence board so Claude Code sessions can see which shared infra resources are being actively mutated by other sessions. Resource-scoped claims on the existing Dolt server, heartbeat-driven TTL, agent-driven via CLAUDE.md rule + Python CLI.	2026-05-22 14:16:56 +00:00
Viktor Barzin	63cbd0aba5	docs: known-issues entry for the Ubuntu 26.04 / NVIDIA driver gap Captures the workaround applied on k8s-node1 today (kernel rolled back to 6.8.0-117-generic, apt-mark hold on kernel meta-packages, /etc/os-release spoofed to 24.04 so NFD reports VERSION_ID=24.04 and the gpu-operator picks an existing ubuntu24.04 driver image), plus the trigger that lets us un-mitigate: any ubuntu26.04 tag appearing on nvcr.io/nvidia/driver. Linked from the post-mortem and from beads code-8vr0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:56 +00:00
Viktor Barzin	c72b839a2f	nvidia: pin chart to v25.10.1 after v26.3.1 upgrade revealed missing ubuntu26.04 driver images k8s-node1 was upgraded to Ubuntu 26.04 (kernel 7.0.0-15-generic) at some point. NVIDIA has NOT published ubuntu26.04 driver images yet (skopeo list-tags docker://nvcr.io/nvidia/driver returned 0 ubuntu26.04 tags vs 779 for ubuntu22.04 and 206 for ubuntu24.04). Attempted fix today: bump gpu-operator chart v25.10.1 → v26.3.1 + driver 570.195.03 → 580.105.08 + kernelModuleType=open. The chart applied cleanly but the v26.3.1 operator auto-detects host OS via NFD labels and constructs `<version>-ubuntu26.04` image tags, which 404 on pull. Rolled back to chart v25.10.1 and pinned it explicitly here so future `terraform apply` doesn't surface the same trap again. Note: chart rollback alone does NOT restore GPU functionality on k8s-node1. Both v25.10.1 and v26.3.1's operators now pick the ubuntu26.04 suffix (the NFD label is sticky once detected). The actual recovery path requires either (a) NVIDIA shipping ubuntu26.04 driver images, or (b) rolling the host kernel back to 6.8.0-117-generic (still installed in /boot, headers in /usr/src) + `apt-mark hold` to prevent re-upgrade. That step needs explicit user authorization for a node reboot — left as the next action item on code-8vr0. Files: - stacks/nvidia/modules/nvidia/main.tf — explicit version pin, explanatory comment - stacks/nvidia/modules/nvidia/values.yaml — comment block documenting the situation; driver pinned at 570.195.03 - docs/post-mortems/2026-05-17-gpu-driver-ubuntu2604-mismatch.md — full timeline, root causes, recovery procedure Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:56 +00:00
Viktor Barzin	0480477f44	nfs-csi: pin chart v4.13.1 + controller affinity (post-mortem) Keel rolled csi-driver-nfs 4.13.1→4.13.2 today. The 4.13.2 chart dropped control-plane exclusion from the controller Deployment, so both replicas landed on k8s-master, fought for hostNetwork ports 19809/29653, and one went CrashLoopBackOff. Helm rollback left orphan containerd sandboxes holding the ports — only a kubelet restart on master cleared them. - Pin helm_release.version = "4.13.1" so terraform apply can't drift to the broken chart (defense in depth; nfs-csi namespace is already in the Kyverno-Keel exclude list) - Add controller.affinity: podAntiAffinity between replicas + nodeAffinity excluding node-role.kubernetes.io/control-plane - docs/post-mortems/2026-05-17-nfs-csi-keel-upgrade-master-port-conflict.md captures the root cause + recovery procedure (kubelet restart via nsenter is the escalation path when crictl rmp -f fails) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:56 +00:00
Viktor Barzin	411524a10d	kured: drop Mon-Fri restriction, reboot any day The weekday-only schedule was a 2026-03-16-incident-era guardrail when the rest of the safety net was thin. Today's gates — halt-on-alert, sentinel-gate Check 4 (24h soak via node Ready transitions), the K8sUpgradeStalled alert, drainTimeout=30m, concurrency=1, and the sentinel-path fix from earlier today — make weekend reboots safe and just clear the backlog faster. Effect: 5 pending node reboots clear in 5 calendar days instead of queueing up over weekends. The K8s version-upgrade detection at Sun 12:00 UTC self-defers if a Sunday-morning kured reboot fires (the RecentNodeReboot alert is in the Upgrade Gates ignore-less list for the version-upgrade preflight — same mechanism kured uses). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:48 +00:00
Viktor Barzin	020f62555b	Phase 0: install Keel + Kyverno auto-update annotation injector Foundation for opt-out-pure auto-update model per docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md. - New stack `stacks/keel/` deploys Keel via Helm (charts.keel.sh, v1.0.6). Polls registries hourly per design decision #8. Default schedule overridable per-workload via keel.sh/pollSchedule annotation. - New Kyverno ClusterPolicy `inject-keel-annotations` mutates Deployments, StatefulSets, and DaemonSets in namespaces labeled `keel.sh/enrolled=true` with keel.sh/policy=force + trigger=poll + pollSchedule=@every 1h. - Phase 0 enrolls no namespaces. Phase 1 (next session) labels the self-hosted set. - Per-workload opt-out: label `keel.sh/policy: never` (used by rollback runbook and chrome-service-style deliberate pins). - Keel namespace excluded from the mutate — supervisor self-update has too-bad a failure mode (decision #11). - AGENTS.md: KYVERNO_LIFECYCLE_V2 marker convention added for the ignore_changes block enrolled workloads need. - .claude/CLAUDE.md: docker-images rule flagged as transitional. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:48 +00:00
Viktor Barzin	9476649539	docs/pm: kured silently stalled 6 days + Anubis HA lift (2026-05-16) Captures the May 10–16 kured-vs-sentinel-gate hostPath mismatch (chart derived hostPath from configuration.rebootSentinel) and the companion work to harden the rolling-reboot pipeline against single-replica PDB deadlocks: Anubis 1→2 replicas with shared Valkey store, kured drainTimeout=30m, CNPG pg-cluster 2→3 instances. Includes the mysql-standalone-PDB orphan cleanup and the k8s-node1 containerd-source drift audit (benign). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:48 +00:00
Viktor Barzin	448bc0c0f6	k8s-version-upgrade: decompose into Job chain to fix self-preemption The agent-based v1 ran inside claude-agent-service (replicas=1, no nodeSelector) and self-evicted when it tried to drain its host (k8s-node4 on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers v1.34.2) until manual recovery. Rewrite the pipeline as a chain of nodeSelector-pinned Jobs: preflight (k8s-node1) → master (k8s-node1) drains k8s-master → worker × 4 (k8s-node1) drains k8s-node{4,3,2} → worker (k8s-master + control-plane toleration) drains k8s-node1 → postflight (no pinning) Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by envsubst-ing job-template.yaml into the next Job. Deterministic names (k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply` idempotent — a failed Job can be re-created without duplicating downstream. Also lands `predrain_unstick`: deletes pods on the target node whose PDB has 0 disruptionsAllowed. Without this, drain loops indefinitely on single-replica deployments (e.g. every Anubis instance — discovered the hard way during 2026-05-11 manual recovery of k8s-node3). Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min). Deprecates the agent prompt (renamed to *.deprecated.md with a header pointer to the new code). Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps), then monitoring (loads the new alert). Both applied 2026-05-11. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:45 +00:00
Viktor Barzin	b278a8f158	docs/auth: sync to current `auth` enum (required/app/public/none) Replace the legacy `protected = true` reference with the four-tier `auth` enum that's been live for weeks. Document the anti-exposure guard (`scripts/check-ingress-auth-comments.py` + `scripts/tg`) that enforces the inline-comment convention. Fix two stale paths: - `stacks/platform/modules/ingress_factory/` → `modules/kubernetes/ingress_factory/` - `stacks/platform/modules/traefik/middleware.tf` → `stacks/traefik/modules/traefik/middleware.tf` Replace the single `protected = true` example with three: a default Authentik-gated admin UI, an app-managed backend, and an intentionally-public webhook receiver. Each example shows the required comment line above the auth assignment. [ci skip]	2026-05-22 14:16:44 +00:00
Viktor Barzin	e75bcaf394	k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison on master for new patches + HEAD pkgs.k8s.io for next-minor availability, then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent. The agent (.claude/agents/k8s-version-upgrade.md) orchestrates: pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match) -> etcd snapshot save -> optional master containerd skew fix -> apt repo URL rewrite (minor bumps only) -> drain/upgrade/uncordon master via ssh < update_k8s.sh -> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each -> post-flight verification Two new Upgrade Gates alerts catch failure modes: - K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m) - EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m) update_k8s.sh refactored to take --role / --release args; the agent shells it into each node via SSH pipe. update_node.sh annotated as OS-major path. Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section in docs/architecture/automated-upgrades.md. Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519 keypair distributed to all 5 nodes via authorized_keys; slack_webhook reuses kured webhook URL on initial deploy).	2026-05-22 14:16:42 +00:00
Viktor Barzin	f5b1fb179a	docs: add k8s node auto-upgrade runbook + architecture section The OS-side counterpart to the service-upgrade pipeline. Covers the unattended-upgrades + kured + sentinel-gate + Prometheus halt-on-alert design landed in c0991f7f8. Runbook: ops procedures (verify health, halt rollout, restore config to a re-imaged node, roll back a bad upgrade, investigate which alert is blocking). Architecture doc: extends the existing service-upgrade flow with a "K8s Node OS Upgrades" section (stack, sources of truth, day-2 mechanism, why-this-design rationale tied to the March 2026 post-mortem). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:41 +00:00
Viktor Barzin	b99e30e798	docs/plans: 2026-04-20 infra audit design (post-research, post-challenge) Adds the infra audit plan: 5 parallel research agents (Reliability, Declarative, Maintenance, Scalability, Security) → 91 raw findings → 2 independent challengers → filtered/corrected/ranked backlog. Already incorporates the challenger corrections (drops bad metric pulls, reframes intentional-by-design items). Source for several follow-ups already shipped this week (kured-prometheus gating, NFS fsid post-mortem fixes, Authentik outpost postgres-backend).	2026-05-22 14:16:41 +00:00
Viktor Barzin	93ee45bd25	docs/authentik: document postgres session backend + close out 2026-04-18 post-mortem items Update `.claude/reference/authentik-state.md`: - Add `ProxyProvider.access_token_validity = "weeks=4"` to the Session Duration table with the gotcha that the gorilla session store binds the value once at outpost startup (rollout restart needed). - Replace the "session storage moved to Postgres in 2025.10" note that falsely implied the migration was automatic — explain that the `Outpost.managed` field gates the postgres path and our outpost silently stayed on `FilesystemStore` until 2026-05-10. - Document the goauthentik 2026.2.2 service-selector bug (service.py:52) and the JSON-patch workaround. - Document that the standalone embedded-outpost deployment needs `AUTHENTIK_POSTGRESQL__*` env vars injected via JSON patch, plus the `app.kubernetes.io/component=server` pod label. - Note the "Terraform doesn't expose `Outpost.managed`" assumption that holds the `managed=embedded` value in place across applies. Close out post-mortem `2026-04-18-authentik-outpost-shm-full.md`: - P2 codify-in-Terraform: DONE. - P3 access_token_validity reduce: DONE-alt (we did the opposite — bumped to 4 weeks — because postgres backend mooted the storage concern). - P3 move-off-embedded-outpost: DONE-alt (postgres backend addresses the loss-of-state class on the embedded outpost itself).	2026-05-22 14:16:41 +00:00
Viktor Barzin	63fc1e00de	infra/compute: bump k8s-node1 RAM 32 -> 48 GiB Reason: GPU multi-tenancy (frigate + ytdlp-highlights + llama-swap + immich-ml) was hitting 94% memory-request saturation on the old size. The benchmark on 2026-05-10 surfaced this when llama-swap stayed Pending despite GPU time-slicing being on (nvidia.com/gpu replicas=100) - the actual constraint was node1 RAM, not GPU. Procedure: drained node1, qm shutdown 201, qm set 201 --memory 49152, qm start 201, kubelet picked up new capacity (47 GiB / 45.5 GiB allocatable), uncordon, restored llama-swap + immich-ml. Out-of-band qm set is the path here (not Terraform) because VMID 201 is intentionally not managed by TF yet - the telmate/proxmox provider trips on iSCSI-disked VMs (see infra/stacks/infra/main.tf line 442). Adopt this VM into TF once we migrate to bpg/proxmox. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:41 +00:00
Viktor Barzin	6e7fe96a40	infra/llama-cpp: benchmark report + -fa flag fix Phase 7 of the vision-LLM benchmark plan. Adds: - docs/benchmarks/2026-05-10-vision-llm.md — curated report (TL;DR, per-model analysis, top-N agreement, cost vs cloud APIs, sample captions). Verdict: qwen3vl-4b for the request path (3.55 s p50, 100% parse, decisive top-N distro); qwen3vl-8b for caption polish. - docs/benchmarks/benchmark-2026-05-10-1424.json — raw 300-row dump for diff-checking against future runs. - main.tf: -fa -> -fa on (b9085 llama.cpp removed the no-value form of the flash-attention flag; without the value llama-server exits before serving any request). - llama-cpp.md architecture doc links the report so future operators land on the deployed-and-evaluated model from one entry point. 300/300 calls, 0 parse errors, 33m32s wall on a single T4 with the GPU exclusively allocated. immich-ml was scaled to 0 for the run (node1 RAM constraint, not GPU - bumping node1 RAM is tracked as a follow-up). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:41 +00:00
Viktor Barzin	9c617e6d38	infra/llama-cpp: add stack — llama-swap fronting Qwen3-VL + MiniCPM-V Single Deployment of mostlygeek/llama-swap:cuda hot-swaps three GGUF vision models (qwen3vl-8b, minicpm-v-4-5, qwen3vl-4b) at one OpenAI-compat /v1 endpoint on Service llama-swap.llama-cpp.svc. Idle TTL 10min so models unload between benchmark batches. Storage: NFS-RWX from /srv/nfs-ssd/llamacpp (30Gi). One-shot download Job pulls Q4_K_M GGUF + mmproj per model, creates stable model.gguf / mmproj.gguf symlinks so the llama-swap config is filename-agnostic, then warms the kernel page cache. GPU: nvidia.com/gpu=1 = whole T4 — operator must scale immich-ml to 0 during benchmark windows. wait_for_rollout=false so apply doesn't block on GPU availability. Initial use case: vision-LLM benchmark for instagram-poster candidate scoring; future consumers (HA, agentic tooling) hit the same endpoint via LiteLLM at the gateway. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:40 +00:00
Viktor Barzin	0752bd49c8	kms: document native DNS auto-discovery (no client config needed) LAN clients with DNS suffix viktorbarzin.lan now activate with zero configuration — Windows queries _vlmcs._tcp.viktorbarzin.lan SRV by default and the chain resolves through vlmcs.viktorbarzin.lan to the new 10.0.20.202 KMS IP. DNS state (Technitium primary, replicated to secondary+tertiary by the existing technitium-zone-sync CronJob every 30 min): - _vlmcs._tcp.viktorbarzin.lan SRV 0 0 1688 vlmcs.viktorbarzin.lan (was: target=kms.viktorbarzin.lan) - vlmcs.viktorbarzin.lan A 10.0.20.202 (added) - kms.viktorbarzin.lan A 10.0.20.200 (unchanged — still the Traefik LB for the user-facing website at kms.viktorbarzin.lan/) vlmcs.viktorbarzin.lan was added as a dedicated KMS-server hostname rather than retargeting kms.viktorbarzin.lan so the LAN-direct website keeps working without depending on hairpin NAT through pfSense. Verified end-to-end on WIN10Pro-DS32 (192.168.1.230): slmgr /ckms → slmgr /ato → "Product activated successfully" with "KMS machine name from DNS: vlmcs.viktorbarzin.lan:1688" and "KMS machine IP address: 10.0.20.202". Real client IP 192.168.1.230 appears in vlmcsd log and in the slack-notifier sent line; second activation within the dedup window correctly increments kms_activations_dedup_skipped_total.	2026-05-22 14:16:40 +00:00
Viktor Barzin	67b11a964a	kms: dedicate MetalLB IP 10.0.20.202 + filter probe noise Two coupled fixes for the hourly Slack noise + missing client IPs: 1. Move windows-kms off shared 10.0.20.200 to a dedicated MetalLB IP 10.0.20.202 with externalTrafficPolicy=Local, so vlmcsd sees real WAN client IPs (pfSense WAN forwards do DNAT-only; ETP=Local skips kube-proxy SNAT). Same pattern mailserver used pre-2026-04-19. Sharing 10.0.20.200 is blocked because all 10 services there are ETP=Cluster and MetalLB requires consistent ETP per shared IP. 2. Slack notifier now suppresses Slack posts for bare TCP open/close pairs (no Application/Activation block) — these are Uptime Kuma's port monitor and the new kubelet readiness/liveness probes. Probe counts go to a new metric kms_connection_probes_total{source} where source classifies the IP as internal_pod / cluster_node / external. Real activations are unaffected. Pod fluidity: added TCP readiness/liveness probes on 1688 to gate Pod Ready on the listener actually being up — required for ETP=Local so MetalLB only advertises 10.0.20.202 from a node where vlmcsd is serving. pfSense side (applied separately, not codified): - New alias k8s_kms_lb = 10.0.20.202 (KMS-only) - WAN:1688 NAT + filter rule retargeted from k8s_shared_lb to k8s_kms_lb - All other forwards on k8s_shared_lb (WireGuard, HTTPS, shadowsocks, smtps, etc.) untouched Runbook updated. Tests added for classify_source / is_probe / process_line.	2026-05-22 14:16:40 +00:00
Viktor Barzin	08edd92b22	kms: deploy slack-notifier sidecar with Prometheus metrics + document public exposure Slack notifier now also exposes /metrics on :9101 with stdlib HTTP — counts activations and dedup-skips by product, gauges last-activation timestamp. Pod template gets the standard prometheus.io/scrape annotations so the cluster-wide kubernetes-pods job picks it up via pod IP. Memory request bumped to 48Mi to cover counter dicts + HTTPServer. Plus docs: networking.md footnotes the windows-kms row noting public WAN exposure with the rate-limited (max-src-conn 50, max-src-conn-rate 10/60, overload <virusprot> flush) pfSense filter rule, and a new runbook covers log locations, rate-limit tuning, and how to revoke the WAN forward. The matching pfSense rule was tightened in place (TCP-only + rate limits) via SSH; pfSense isn't Terraform-managed.	2026-05-10 11:12:39 +00:00
Viktor Barzin	0d8e0ca6fc	backup: fix daily-backup silent failures, postiz pg_dump CronJob, doc reconcile daily-backup ran out of its 1h budget and SIGTERMed for 10 days straight (Apr 30 → May 9). Each failed run left its snapshot mount stacked on /tmp/pvc-mount, which blocked the next run from completing — root cause of the WeeklyBackupStale alert going silent (the metric never reached its end-of-script push). Fixes: - TimeoutStartSec 1h → 4h (current workload of 118 PVCs needs ~1.5h, was hitting the wall during week 18 runs) - Recursive umount + LUKS cleanup on EXIT trap, plus the same at script start as belt-and-braces for any inherited stuck state from a prior crashed run - TERM/INT trap pushes status=2 metric so WeeklyBackupFailing fires instead of the alert going blind on systemd kills - pfsense metric pushed in BOTH success and failure paths (was only on success; any ssh-to-pfsense outage made PfsenseBackupStale silent until the alert threshold expired) Postiz backup CronJob: bundled bitnami PG/Redis live on local-path (K8s node OS disk) — outside Layer 1+2 of the 3-2-1 pipeline. Added postiz-postgres-backup that pg_dumps postiz + temporal + temporal_visibility daily 03:00 to /srv/nfs/postiz-backup, getting Layer 3 offsite coverage. Verified end-to-end: 3 dumps written, Pushgateway metric received. Note: bitnamilegacy/postgresql image is stripped (no curl/wget/python) — switched to docker.io/library/postgres matching the dbaas/postgresql-backup pattern with apt-installed curl. Doc reconcile (backup-dr.md): metric names had drifted (e.g. the docs claimed backup_weekly_last_success_timestamp but the script pushes daily_backup_last_run_timestamp). Updated to match what's actually emitted, and added a "default-covered" footnote to the Service Protection Matrix so the ~40 services with PVCs not enumerated in the table are no longer ambiguous. Manual PVE-host actions (out-of-band, not in TF): - unmounted 6 stacked snapshots from /tmp/pvc-mount - pruned 5 stale snapshots on vm-9999-pvc-67c90b6b... (origin LV that the loop got SIGTERMed against repeatedly, so prune kept failing) - created /srv/nfs/postiz-backup directory - triggered a one-shot daily-backup run with the new TimeoutStartSec to validate the fix end-to-end Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-10 11:12:39 +00:00
Viktor Barzin	57250cfda2	mysql: bump to 4Gi limit / 3Gi request; grow /srv/nfs LV to 3 TiB mysql-standalone OOMKilled May 8 18:05 (anon-rss 2 GB at the 2 Gi limit). innodb_buffer_pool_size=1Gi plus connection buffers and InnoDB internals don't fit in 2 Gi. Bumping limit to 4 Gi (request 3 Gi) leaves headroom without changing the buffer pool config. /srv/nfs was at 90% (1.7T / 2T); grew the underlying pve/nfs-data LV 1 TiB online and ran resize2fs (now 60% used). Triggered by surfacing during the 2026-05-09 IO-pressure post-mortem; thinpool had ~4.6 TiB free. The post-mortem also covers the stale-NFS-client trigger (legacy /usr/local/bin/weekly-backup pointing at the decommissioned TrueNAS IP) and the resulting wedged kthread on the PVE host. Script removed and node_exporter restarted out-of-band; kthread will clear at next PVE reboot. See docs/post-mortems/2026-05-09-io-pressure-stale-nfs.md. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-10 11:12:38 +00:00
Viktor Barzin	ffa1d6d5dc	[woodpecker] Programmatic Forgejo repo registration Earlier I claimed the OAuth Web UI flow was the only way to onboard new Forgejo repos in Woodpecker. That's wrong. Two parts to the actual workaround: 1. Woodpecker session JWTs are HS256 signed with the user's per-user `hash` column from the PG `users` table (NOT the global agent secret). Mint a session JWT for the Forgejo viktor user (id=2, forge_id=2), and you're authenticated as that user. 2. POST /api/repos?forge_remote_id=N as viktor → Woodpecker calls Forgejo with viktor's stored OAuth access_token to create the webhook + per-repo signing key. Works. The 500 I saw earlier was from POST'ing as ViktorBarzin (GitHub admin), whose user row has no Forgejo OAuth token — Woodpecker's forge-API call fails for that user, surfacing as a 500. scripts/woodpecker-register-forgejo-repo.sh wraps the whole flow: extract hash from PG → mint JWT → activate repo. Verified against viktor/{broker-sync,claude-agent-service,freedify,hmrc-sync} in this session — all activated cleanly. Also updated the runbook with the actual mechanism + the WOODPECKER_FORGE_TIMEOUT=30s tip (the real root cause of the 'context deadline exceeded' failures, NOT the v3.14 upgrade).	2026-05-10 11:12:36 +00:00
Viktor Barzin	afafc9928f	[docs] Onboarding runbook for new Forgejo repos in Woodpecker	2026-05-07 23:29:35 +00:00
Viktor Barzin	3f3e5fc954	chrome-service: open NP for Traefik → noVNC sidecar (port 6080) Existing NetworkPolicy only admitted port 3000 (Playwright WS) from labelled client namespaces, blocking Traefik's traffic to the noVNC sidecar on port 6080. The chrome.viktorbarzin.me ingress would hang forever — page never loads, eventually times out. Adds a second ingress rule allowing TCP/6080 from the traefik namespace only. Authentik forward-auth still gates external access at the Traefik layer. Also reconciles the noVNC image to the new Forgejo registry path (:v4 unchanged) — already declared in TF, just live-state drift from the Phase 3 registry consolidation. Updates the architecture doc; the previous text still described the old nginx static health stub that noVNC replaced.	2026-05-07 23:29:34 +00:00
Viktor Barzin	4ec40ea804	[forgejo] Phases 3+4+5: cutover, decommission, docs sweep End of forgejo-registry-consolidation. After Phase 0/1 already landed (Forgejo ready, dual-push CI, integrity probe, retention CronJob, images migrated via forgejo-migrate-orphan-images.sh), this commit flips everything off registry.viktorbarzin.me onto Forgejo and removes the legacy infrastructure. Phase 3 — image= flips: * infra/stacks/{payslip-ingest,job-hunter,claude-agent-service, fire-planner,freedify/factory,chrome-service,beads-server}/main.tf — image= now points to forgejo.viktorbarzin.me/viktor/<name>. * infra/stacks/claude-memory/main.tf — also moved off DockerHub (viktorbarzin/claude-memory-mcp:17 → forgejo.viktorbarzin.me/viktor/...). * infra/.woodpecker/{default,drift-detection}.yml — infra-ci pulled from Forgejo. build-ci-image.yml dual-pushes still until next build cycle confirms Forgejo as canonical. * /home/wizard/code/CLAUDE.md — claude-memory-mcp install URL updated. Phase 4 — decommission registry-private: * registry-credentials Secret: dropped registry.viktorbarzin.me / registry.viktorbarzin.me:5050 / 10.0.20.10:5050 auths entries. Forgejo entry is the only one left. * infra/stacks/infra/main.tf cloud-init: dropped containerd hosts.toml entries for registry.viktorbarzin.me + 10.0.20.10:5050. (Existing nodes already had the file removed manually by `setup-forgejo-containerd-mirror.sh` rollout — the cloud-init template only fires on new VM provision.) * infra/modules/docker-registry/docker-compose.yml: registry-private service block removed; nginx 5050 port mapping dropped. Pull- through caches for upstream registries (5000/5010/5020/5030/5040) stay on the VM permanently. * infra/modules/docker-registry/nginx_registry.conf: upstream `private` block + port 5050 server block removed. * infra/stacks/monitoring/modules/monitoring/main.tf: registry_ integrity_probe + registry_probe_credentials resources stripped. forgejo_integrity_probe is the only manifest probe now. Phase 5 — final docs sweep: * infra/docs/runbooks/registry-vm.md — VM scope reduced to pull- through caches; forgejo-registry-breakglass.md cross-ref added. * infra/docs/architecture/ci-cd.md — registry component table + diagram now reflect Forgejo. Pre-migration root-cause sentence preserved as historical context with a pointer to the design doc. * infra/docs/architecture/monitoring.md — Registry Integrity Probe row updated to point at the Forgejo probe. * infra/.claude/CLAUDE.md — Private registry section rewritten end- to-end (auth, retention, integrity, where the bake came from). * prometheus_chart_values.tpl — RegistryManifestIntegrityFailure alert annotation simplified now that only one registry is in scope. Operational follow-up (cannot be done from a TF apply): 1. ssh root@10.0.20.10 — edit /opt/registry/docker-compose.yml to match the new template AND `docker compose up -d --remove-orphans` to actually stop the registry-private container. Memory id=1078 confirms cloud-init won't redeploy on TF apply alone. 2. After 1 week of no incidents, `rm -rf /opt/registry/data/private/` on the VM (~2.6GB freed). 3. Open the dual-push step in build-ci-image.yml and drop registry.viktorbarzin.me:5050 from the `repo:` list — at that point the post-push integrity check at line 33-107 also needs to be repointed at Forgejo or removed (the per-build verify is redundant with the every-15min Forgejo probe). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-07 23:29:34 +00:00
Viktor Barzin	a3024d1f51	[docs] Forgejo registry image-rebuild runbook Companion to forgejo-registry-breakglass.md but for the more common case: the Forgejo registry is healthy as a whole, but one image's manifest/blob references are broken (orphan child, half-pushed upload, retention-vs-pull race). The RegistryManifestIntegrityFailure alert annotation already points here. Mirrors registry-rebuild-image.md (the registry-private equivalent) in structure: confirm via probe + curl, delete broken version through Forgejo API, rebuild via Woodpecker manual run, force consumers to re-pull, verify integrity recovery. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-07 23:29:33 +00:00
Viktor Barzin	fbb41eff9d	[ci] Phase 1: infra-ci dual-push + break-glass tarball Adds Forgejo as a second push target on the build-ci-image pipeline and saves the just-pushed image as a gzipped tarball on the registry VM disk (/opt/registry/data/private/_breakglass/) so we can recover infra-ci with `ctr images import` if both registries are down. * Dual-push: registry.viktorbarzin.me:5050/infra-ci AND forgejo.viktorbarzin.me/viktor/infra-ci, in the same woodpeckerci/plugin-docker-buildx step. Same image bytes; the Forgejo integrity probe (every 15min) catches any divergence. * Break-glass step: SSHes to 10.0.20.10, docker pulls + saves + gzips, keeps last 5 tarballs (latest symlink). Failure-tolerant so a transient registry blip doesn't fail the build pipeline. * Runbook docs/runbooks/forgejo-registry-breakglass.md documents the recovery flow (when to use, scp+ctr import, node cordon, underlying-issue fix). Tarball mirrors to Synology automatically through the existing daily offsite-sync-backup job — no new sync wiring needed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-07 23:29:33 +00:00
Viktor Barzin	f793a5f50b	[forgejo] Phase 0 of registry consolidation: prepare Forgejo OCI registry Stage 1 of moving private images off the registry:2 container at registry.viktorbarzin.me:5050 (which has hit distribution#3324 corruption 3x in 3 weeks) onto Forgejo's built-in OCI registry. No cutover risk — pods still pull from the existing registry until Phase 3. What changes: * Forgejo deployment: memory 384Mi→1Gi, PVC 5Gi→15Gi (cap 50Gi). Explicit FORGEJO__packages__ENABLED + CHUNKED_UPLOAD_PATH (defensive, v11 default-on). * ingress_factory: max_body_size variable was declared but never wired in after the nginx→Traefik migration. Now creates a per-ingress Buffering middleware when set; default null = no limit (preserves existing behavior). Forgejo ingress sets max_body_size=5g to allow multi-GB layer pushes. * Cluster-wide registry-credentials Secret: 4th auths entry for forgejo.viktorbarzin.me, populated from Vault secret/viktor/ forgejo_pull_token (cluster-puller PAT, read:package). Existing Kyverno ClusterPolicy syncs cluster-wide — no policy edits. * Containerd hosts.toml redirect: forgejo.viktorbarzin.me → in-cluster Traefik LB 10.0.20.200 (avoids hairpin NAT for in-cluster pulls). Cloud-init for new VMs + scripts/setup-forgejo-containerd-mirror.sh for existing nodes. * Forgejo retention CronJob (0 4 * * ): keeps newest 10 versions per package + always :latest. First 7 days dry-run (DRY_RUN=true); flip the local in cleanup.tf after log review. Forgejo integrity probe CronJob (/15): same algorithm as the existing registry-integrity-probe. Existing Prometheus alerts (RegistryManifestIntegrityFailure et al) made instance-aware so they cover both registries during the bake. Docs: design+plan in docs/plans/, setup runbook in docs/runbooks/. Operational note — the apply order is non-trivial because the new Vault keys (forgejo_pull_token, forgejo_cleanup_token, secret/ci/global/forgejo_*) must exist BEFORE terragrunt apply in the kyverno + monitoring + forgejo stacks. The setup runbook documents the bootstrap sequence. Phase 1 (per-project dual-push pipelines) follows in subsequent commits. Bake clock starts when the last project goes dual-push. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-07 23:29:33 +00:00
Viktor Barzin	f18cd1d314	chrome-service: in-cluster headed Chromium pool for f1-stream verifier The f1-stream verifier's in-process headless Chromium kept tripping hmembeds' disable-devtool.js Performance detector (CDP latency on console.log vs console.table) and getting redirected to google.com. This adds a single-replica chrome-service stack running Playwright launch-server under Xvfb so callers can connect via WS+token to a shared headed browser. f1-stream's _ensure_browser now prefers chromium.connect(CHROME_WS_URL/CHROME_WS_TOKEN) and adds a vendored stealth init script (webdriver/plugins/languages/Permissions/WebGL spoofs + querySelector hijack to disarm disable-devtool-auto) on every new context. Falls back to in-process headless if the env vars aren't set. Encrypted PVC for profile + npm cache, NetworkPolicy to TCP/3000 gated by client-namespace label, 6h tar.gz backup CronJob to NFS, Authentik-gated nginx sidecar at chrome.viktorbarzin.me for human liveness checks. Image pinned to playwright:v1.48.0-noble in lockstep with the Python client's playwright==1.48.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-07 23:29:32 +00:00
Viktor Barzin	4c8d12229f	mailserver: split healthcheck path off PROXY-aware listeners + book-search uses ClusterIP Two coordinated fixes for the same root cause: Postfix's smtpd_upstream_proxy_protocol listener fatals on every HAProxy health probe with `smtpd_peer_hostaddr_to_sockaddr: ... Servname not supported for ai_socktype` — the daemon respawns get throttled by postfix master, and real client connections that land mid-respawn time out. We saw this as ~50% timeout rate on public 587 from inside the cluster. Layer 1 (book-search) — stacks/ebooks/main.tf: SMTP_HOST mail.viktorbarzin.me → mailserver.mailserver.svc.cluster.local Internal services should use ClusterIP, not hairpin through pfSense+HAProxy. 12/12 OK in <28ms vs ~6/12 timeouts on the public path. Layer 2 (pfSense HAProxy) — stacks/mailserver + scripts/pfsense-haproxy-bootstrap.php: Add 3 non-PROXY healthcheck NodePorts to mailserver-proxy svc: 30145 → pod 25 (stock postscreen) 30146 → pod 465 (stock smtps) 30147 → pod 587 (stock submission) HAProxy uses `port <healthcheck-nodeport>` (per-server in advanced field) to redirect L4 health probes to those ports while real client traffic keeps going to 30125-30128 with PROXY v2. Result: 0 fatals/min (was 96), 30/30 probes OK on 587, e2e roundtrip 20.4s. Inter dropped 120000 → 5000 since log-spam concern is gone. `option smtpchk EHLO` was tried first but flapped against postscreen (multi-line greet + DNSBL silence + anti-pre-greet detection trip HAProxy's parser → L7RSP). Plain TCP accept-on-port check is sufficient for both submission and postscreen. Updated docs/runbooks/mailserver-pfsense-haproxy.md to reflect the new healthcheck path and mark the "Known warts" entry as resolved. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-05 19:45:33 +00:00
Viktor Barzin	cd96fb64a8	phpipam-pfsense-import: every 5min → hourly Reduces 5-min disk-write spikes on PVE sdc. The cronjob was the heaviest single contributor in our hourly fan-out investigation (11.2 MB/s burst when it fired). Kea DDNS still handles real-time DNS auto-registration; phpIPAM inventory just lags by up to 1h, which we don't need fresher. Docs (dns.md, networking.md, .claude/CLAUDE.md) updated to match.	2026-04-26 22:48:43 +00:00
Viktor Barzin	51bf38815c	vault: record Phase 3 vault Released-PV cleanup Deleted the 6 NFS PVs orphaned by the Phase 2 rolling and removed their /srv/nfs/<dir> subtrees on the PVE host (~1.5 GB; vault-2 audit log was 1.4 GB on its own). Cluster-wide Released-PV sweep on the proxmox-lvm/encrypted side stays out of scope.	2026-04-25 23:08:45 +00:00
Viktor Barzin	484b4c7190	vault: complete Phase 2 NFS-hostile migration; remove nfs-proxmox SC All 3 vault voters now on proxmox-lvm-encrypted (vault-0 16:18, vault-1 + vault-2 today). The NFS fsync incompatibility identified in the 2026-04-22 raft-leader-deadlock post-mortem is no longer reachable — raft consensus log + audit log live on LUKS2 block storage with real fsync semantics. Cluster-wide consumers of the inline kubernetes_storage_class.nfs_proxmox dropped to zero after the rolling, so the resource is removed from infra/stacks/vault/main.tf. Released NFS PVs (6) remain in the cluster and will be reclaimed in Phase 3 cleanup. Lesson learned (recorded in plan): pvc-protection finalizer races the StatefulSet controller — pod recreates on the OLD PVCs unless the finalizer is patched out before pod delete. Force-finalize technique applied to vault-1 + vault-2 successfully. Closes: code-gy7h	2026-04-25 17:10:00 +00:00
Viktor Barzin	ac8d2f548b	paperless-ngx: migrate to proxmox-lvm-encrypted Document scans (receipts, contracts, IDs) are unambiguously sensitive PII. Storage decision rule defaults sensitive data to `proxmox-lvm-encrypted`, but paperless-ngx had been left on plain `proxmox-lvm` by an abandoned migration attempt that left a dormant, non-Terraform-managed encrypted PVC sitting unbound for 11 days. Cleaned up the orphan, added the encrypted PVC properly via Terraform, rsynced data with deployment scaled to 0, swapped claim_name. Plain `proxmox-lvm` PVC retained for a 7-day soak before removal. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-25 16:48:53 +00:00
Viktor Barzin	288efa89b3	vault: migrate vault-0 storage to proxmox-lvm-encrypted Phase 2 of the NFS-hostile migration: data + audit storageClass on the vault helm release switches from nfs-proxmox to proxmox-lvm-encrypted, then per-pod rolling swap (24h soak between). vault-0 swap done. vault-1 + vault-2 still on NFS — the rolling part is what makes this safe (raft quorum maintained by 2 healthy pods while one is replaced). Also restores chart-default pod securityContext fields. The previous `statefulSet.securityContext.pod = {fsGroupChangePolicy = "..."}` block REPLACED (not merged) the chart's defaults — fsGroup, runAsGroup, runAsUser, runAsNonRoot were all silently dropped. NFS exports were permissive enough to mask the missing fsGroup; ext4 LV volume root is root:root and the vault user (UID 100) couldn't open vault.db, CrashLoopBackOff. Fix: provide all five fields explicitly, survives future chart bumps. vault-1 and vault-2 retained their correct securityContext from when their pod specs were written to etcd, before the partial customization landed — the bug only surfaces when a pod is recreated. Pre-flight raft snapshot saved at /tmp/vault-pre-migration-*.snap (recovery anchor). Refs: code-gy7h Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-25 16:19:49 +00:00
Viktor Barzin	43e4f3f68e	immich: migrate PostgreSQL off NFS to proxmox-lvm-encrypted Live PG data moves to a 10Gi LUKS-encrypted RWO PVC. WAL fsync per commit on NFS contributed to the 2026-04-22 NFS writeback storm (2h43m recovery, 3 of 4 nodes hard-reset). Backups remain on NFS (append-only, NFS-tolerant). The init container that writes postgresql.override.conf is now gated on PG_VERSION presence — on a fresh PVC the file would otherwise make initdb refuse the non-empty PGDATA. First boot skips the override and initdb's cleanly; second boot (after a forced restart) writes the override so vchord/vectors/pg_prewarm load before the dump restore. Idempotent on initialised PVCs. Migration executed: pg_dumpall (1.9GB) → restore on encrypted PVC → REINDEX clip_index/face_index → 111,843 assets verified, external HTTP 200, all 10 extensions present (vector minor 0.8.0→0.8.1 only). LV created on PVE host, picked up by lvm-pvc-snapshot. See docs/plans/2026-04-25-nfs-hostile-migration-{design,plan}.md. Phase 2 (Vault Raft) follows under code-gy7h. Closes: code-ahr7 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-25 15:47:30 +00:00
Viktor Barzin	4315ed5c2a	[backup] Fix lvm-pvc-snapshot Pushgateway push (stdout pollution in cmd_prune_count) cmd_prune_count's `log " Pruned: ..."` wrote to stdout, which the caller captures via `pruned=$(cmd_prune_count)`. From 2026-04-16 onward (7d retention kicked in), pruned snapshots polluted the captured value with multi-line log text, breaking the Prometheus exposition format on the metric push (`lvm_snapshot_pruned_total ${pruned}` → 400 from Pushgateway). Snapshots themselves were always fine; only the metric push silently failed for ~9 nights, eventually triggering LVMSnapshotNeverRun (alert has 48h `for:`). Fix: redirect the inner log call to stderr so cmd_prune_count's stdout contains only the count. Also adopts `infra/scripts/lvm-pvc-snapshot.sh` as the source-of-truth (was edited only on the PVE host) and updates backup-dr.md to point at the .sh and document the scp deploy. Deploy: scp infra/scripts/lvm-pvc-snapshot.sh root@192.168.1.127:/usr/local/bin/lvm-pvc-snapshot Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 14:30:58 +00:00
Viktor Barzin	344fce3692	[monitoring][poison-fountain] pushgateway persistence + cronjob uid-0 Two independent root-cause fixes surfaced by the 2026-04-22 cluster health check: 1. Pushgateway lost all in-memory metrics when node3 kubelet hiccuped at 11:42 UTC, hiding backup_last_success_timestamp{job="offsite- backup-sync"} until the next 06:01 UTC push — a ~18h false-negative window. Enable persistence on a 2Gi proxmox-lvm-encrypted PVC with --persistence.interval=1m. Chart note: values key is `prometheus-pushgateway:` (subchart alias), not `pushgateway:`. 2. poison-fountain-fetcher CronJob runs curlimages/curl as UID 100 but the NFS mount /srv/nfs/poison-fountain is root:root 755 and the main Deployment runs as root, so mkdir /data/cache fails every 6h. Set run_as_user=0 on the CronJob container (no_root_squash is set on the export). Closes the backup_offsite_sync FAIL on the next 06:01 UTC offsite sync; closes the recurring poison-fountain evicted-pod noise on the next 00:00 UTC cron tick. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 18:32:29 +00:00
Viktor Barzin	7dfe89a6e0	[redis] stabilise against node-crash flap cascade — RC1-RC5 fixes Five compounding factors produced the 2026-04-22 flap cascade: soft anti-affinity let 2/3 pods co-locate on k8s-node3 (which bounced NotReady→Ready at 11:42Z and took quorum), aggressive sentinel/probe timing amplified LUKS-encrypted LVM I/O stalls into spurious +switch-master loops, HAProxy's 1s polling raced sentinel failovers and routed writes to demoted masters, publish_not_ready_addresses=true fed not-yet-ready pods into HAProxy DNS, and realestate-crawler-celery CrashLoopBackOff closed the feedback loop. Changes: - Anti-affinity: preferred → required (one redis pod per node, hard) - Sentinel down-after-ms 5000→15000, failover-timeout 30000→60000 - Redis + sentinel liveness: timeout 3→10, failure_threshold 3→5 - HAProxy: check inter 1s→2s / fall 2→3, timeout check 3s→5s - Headless svc: publish_not_ready_addresses true→false Post-rollout verification clean: 0 flaps, 0 +switch-master events, 0 celery ReadOnlyError in the 60s window after settle. Docs updated.	2026-04-22 15:59:00 +00:00
Viktor Barzin	e2146e6916	gpu: schedule off NFD label, not k8s-node1 hostname Remove every hardcoded reference to k8s-node1 that pinned GPU scheduling to a specific host: - GPU workload nodeSelectors: gpu=true -> nvidia.com/gpu.present=true (frigate, immich, whisper, piper, ytdlp, ebook2audiobook, audiblez, audiblez-web, nvidia-exporter, gpu-pod-exporter). The NFD label is auto-applied by gpu-feature-discovery on any node carrying an NVIDIA PCI device, so the selector follows the card. - null_resource.gpu_node_config: rewrite to enumerate NFD-labeled nodes (feature.node.kubernetes.io/pci-10de.present=true) and taint each with nvidia.com/gpu=true:PreferNoSchedule. Drop the manual 'kubectl label gpu=true' since NFD handles labeling. - MySQL anti-affinity: kubernetes.io/hostname NotIn [k8s-node1] -> nvidia.com/gpu.present NotIn [true]. Same intent (keep MySQL off the GPU node) but portable when the card relocates. Net effect: moving the GPU card between nodes no longer requires any Terraform edit. Verified no-op for current scheduling — both old and new labels resolve to node1 today. Docs updated to match: AGENTS.md, compute.md, overview.md, proxmox-inventory.md, k8s-portal agent-guidance string.	2026-04-22 13:43:07 +00:00
Viktor Barzin	134d6b9a82	vault runbook + raft/HA stuck-leader alerts Post-2026-04-22 Step 5 deliverables: - docs/runbooks/vault-raft-leader-deadlock.md — safe pod-restart sequence that avoids zombie containerd-shim + kernel NFS corruption, qm reset no-op gotcha, boot-order gotcha. - prometheus_chart_values.tpl — VaultRaftLeaderStuck + VaultHAStatusUnavailable. Silent until vault telemetry scraping lands (tracked as beads code-vkpn). Epic for moving vault off NFS tracked as beads code-gy7h.	2026-04-22 12:44:46 +00:00
Viktor Barzin	4cb2c157da	post-mortem 2026-04-22: full timeline — second regression + node4 reboot The initial recovery at 11:03 was premature; vault-1's audit writes over NFS started hanging ~15 min later and the cluster regressed to 503. Full recovery required rebooting node4 (to free vault-0's stuck NFS mount and shed PVE NFS thread contention) and a second reboot of node3 (to clear another round of kernel NFS client degradation). Final recovery at 11:43:28 UTC with vault-2 as active leader on the quorum vault-0 + vault-2. vault-1 remains stuck in ContainerCreating on node2 — a third node2 reboot is required for full 3/3 quorum, but 2/3 is operationally sufficient, so that's deferred.	2026-04-22 11:44:56 +00:00

1 2 3 4

172 commits