infra

Author	SHA1	Message	Date
Viktor Barzin	3cc8f9f661	paperless-ngx: keep mem limit at 8Gi (tier LimitRange caps containers) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The prior commit set the limit to 10Gi, but the shared tier-defaults LimitRange caps per-container memory at 8Gi, so the rollout's new pod was forbidden (FailedCreate) and paperless was briefly down. 8Gi is ample for 6 workers anyway (4 workers measured ~1.3Gi under full OCR load). Restored service live via kubectl patch; this commit matches TF to the live 8Gi so drift detection won't re-revert it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-27 19:37:59 +00:00
Viktor Barzin	21d20dccf8	paperless-ngx: bulk-import via PVC consume dir (restart-safe) + 6 workers Some checks failed ci/woodpecker/push/default Pipeline was canceled Details Emo's ~13.7k-document import was going through the API upload path, which stages each file on the pod's EPHEMERAL scratch before queuing it. Any paperless pod or redis restart therefore destroyed all in-flight work (the "File not found" failures we hit) and required manual re-uploads. Move bulk ingest to paperless's consume directory placed on the encrypted PVC, with PAPERLESS_CONSUMER_POLLING so the whole folder is re-scanned periodically (and on startup) with a file-stability check. Files now live on durable storage and survive any restart — the folder is the queue and self-heals, so we can copy everything in fast and let it process over time with zero retry/integrity risk. RECURSIVE preserves the source tree (avoids basename collisions); owner+tag come from a consumption workflow. Bump TASK_WORKERS 4->6 to speed the OCR/convert-bound processing (node6 has the core headroom for one pod) and mem limit 8->10Gi for the extra workers. Revert workers/mem/consume envs to defaults once the import ends. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-27 19:35:10 +00:00
Viktor Barzin	2cb37d51d4	paperless-ngx: scale Gotenberg x3 + Tika x2, 4 workers, skip-archive — speed the Emo import All checks were successful ci/woodpecker/push/default Pipeline was successful Details Bottleneck found: single Gotenberg 503s under concurrent workers (office docs failing + slow). Cluster is otherwise idle (sdc 0.5% util, etcd ~1/min), so: - Gotenberg 1->3 + Tika 1->2 (Service load-balances; fixes the 503s, parallel office conversion). - paperless TASK_WORKERS 2->4, THREADS_PER_WORKER 2->1, mem limit 4->8Gi (avoid OOM with 4 concurrent OCR). Requests kept low to stay within tier-quota (requests.memory 3840/4096Mi). - PAPERLESS_OCR_SKIP_ARCHIVE_FILE=with_text: skip redundant archive for born- digital/office docs (big IO saver for the work-doc set). Guard + etcd watch stay in place; revert to defaults after the import. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-27 18:45:25 +00:00
Viktor Barzin	d6bd9486e3	Merge remote-tracking branch 'origin/master' into wizard/portal-onboarding-paths Some checks are pending Build k8s-portal / build (push) Waiting to run Details ci/woodpecker/push/default Pipeline was successful Details	2026-06-27 16:34:44 +00:00
Viktor Barzin	fca948a23d	k8s-portal: document all three cluster-access paths in onboarding The Getting Started portal only walked through the heaviest path (local VPN + kubectl + Vault + sops install) and never mentioned the two zero-setup routes that users actually reach first. Restructure onboarding to lead with all three, recommendation first: (A) the t3 web terminal, which drops you into a ready shell with kubectl/Vault/repos preinstalled; (B) the k8s web dashboard, auto-authenticated per user; and (C) the existing own-machine setup. Flag the dashboard/terminal as the fallback when CLI OIDC login is unavailable, reframe the misleading home-page 'VPN required' banner (only path C needs it), add the access endpoints to the service catalog, and fix a stale Vaultwarden URL (was vault.viktorbarzin.me, which is actually HashiCorp Vault; Vaultwarden is vaultwarden.viktorbarzin.me). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-27 16:34:36 +00:00
Viktor Barzin	9599beadc9	paperless-ngx: 2 task workers + 2 threads/worker + 4Gi limit for the Emo bulk import Some checks failed ci/woodpecker/push/default Pipeline was canceled Details Emo's ~13.7k-doc import is OCR-bound on a single celery worker (~10s/doc = multi-day). Bump PAPERLESS_TASK_WORKERS=2 + THREADS_PER_WORKER=2 for ~2x throughput, and the memory limit 2Gi->4Gi to fit two concurrent OCR jobs. Kept deliberately modest: archive writes hit the shared sdc HDD that etcd also lives on (IO-storm risk, code-oflt) — watch etcd apply latency and revert workers to 1 if it degrades. Revert to defaults once the import done. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-27 16:33:43 +00:00
Viktor Barzin	c13a3f1694	plotting-book: pull image from private ghcr instead of public DockerHub Anca's plotting-book app now builds its image in her own GitHub repo to the private package ghcr.io/passionprojectsanca/book-plotter (off public DockerHub viktorbarzin/book-plotter). Wire the cluster to pull it: - stacks/plotting-book: point the deployment baseline image at the ghcr package and add imagePullSecrets {ghcr-credentials} so the pod can pull the private image (the live tag is still CI-owned via ignore_changes). - stacks/kyverno: add the plotting-book namespace to the ghcr-credentials allowlist so the Kyverno generate policy clones the pull secret into it. Verified the shared ghcr_pull_token (Viktor, repo-admin on Anca's repo) can read the private package before wiring this. Docs: correct ci-cd.md (it wrongly listed plotting-book as already on ghcr — it was on DockerHub) and note the special arrangement; amend ADR-0003 to record that this GitHub-first repo builds to its own org's ghcr namespace. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-27 15:32:19 +00:00
Viktor Barzin	5b49634fe0	rybbit/crowdsec-cf-sync: stop Cloudflare Lists-API retry-storm (429 self-DoS) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The edge-ban sync was failing every 2 min on Cloudflare HTTP 429 (rate-limited) and never recovering, leaving the crowdsec_ban list empty. Root cause: backoff_limit=2 made k8s re-run a failing pod up to 3x within seconds, so each /2 cycle fired a burst of POSTs into Cloudflare's per-60s Lists-API write limit. That kept the throttle perpetually tripped (it stopped clearing even after minutes of quiet) — a self-inflicted DoS. Two changes make the sync gentle and self-healing: - backoff_limit 2 -> 0: one attempt per /2 cycle (the schedule IS the retry cadence), no rapid-fire burst. - lapi_kv_sync.py: treat a CF 429 as a soft-skip (exit 0, retry next cycle) like the existing LAPI fail-safe, instead of fail-loud + k8s retry. Any other CF error still fails loud. Found during a cluster health check (AIOStreams CSI + pfSense SSH issues handled separately). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-27 15:23:42 +00:00
Viktor Barzin	f92ab04dae	vault: grant emo read-only access to his own secret/emo emo (power-user tier) had no Vault policy granting his personal secret path, so `vault kv get secret/emo` failed. Viktor asked to give him that access. Adds a read-only `personal-emo` policy (read on secret/data/emo + metadata) and attaches it to emo's OIDC identity by adopting the entity/alias Vault auto-created on his first login. Scoped explicitly to emo; does not widen the power-user tier (which stays secret-less). Verified live: a personal-emo token reads secret/emo, is denied writes, and is denied other paths (secret/viktor -> 403). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-27 13:35:57 +00:00
Emil Barzin	a7117e0bfe	immich(frame-emo): bump photo-frame Interval 30->45s All checks were successful ci/woodpecker/push/default Pipeline was successful Details Permissions-test change requested by Viktor: slow Emo's Sofia photo-frame slideshow from 30s to 45s per image. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-27 13:07:00 +00:00
Viktor Barzin	d50962b00e	immich: add Immich photo-frame for Emo's Portal (highlights-immich-emo) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Second ImmichFrame instance cloned from the London frame (frame.tf), scoped to Emo's Immich account (emil.barzin) with Sofia weather coords and last-2-years photos. Drives Emo's Meta Portal Mini in Sofia via the portal-immich-frame app. Dedicated API key minted on Emo's account and stored in Vault (secret/immich -> frame_api_key_emo). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-27 12:40:29 +00:00
Viktor Barzin	e8b72019b5	paperless-ngx: deploy Tika + Gotenberg for Office ingest + raise PVC ceiling to 80Gi All checks were successful ci/woodpecker/push/default Pipeline was successful Details Emo's import scope now includes his work-PC document set (C/Documents, Project Management, Service & MRO, etc. on the NAS), which is ~4.9k Office files (.doc/.docx/.xls/.xlsx/.ppt/.pptx) on top of Emo shared. Paperless can only archive/OCR/index those if it can convert them, so add the standard Apache Tika (text+metadata) + Gotenberg (-> PDF) sidecar deployments + their services in the paperless-ngx namespace and point PAPERLESS_TIKA_* at them. Pinned images (gotenberg 8.25, tika 3.3.1.0), single replica, no PVC. Total in-scope document set across all NAS locations is now ~13,700 PDF+Office files / ~13.7GB source (~30GB once OCR'd + archived), so raise the data PVC autoresize ceiling 30Gi -> 80Gi for comfortable headroom. The topolvm autoresizer grows on demand up to the ceiling. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-27 12:02:04 +00:00
Viktor Barzin	7988a690ed	paperless-ngx: add Bulgarian OCR (bul+eng) + raise data PVC ceiling to 30Gi Preparing Paperless for Emo's document import from the NAS. His archive is Bulgarian (Cyrillic) + English, but OCR was English-only (tesseract had no 'bul' pack and PAPERLESS_OCR_LANGUAGE was unset/defaulted to eng), so scanned BG documents would OCR to garbage and be unsearchable. Add bul to the install list and set OCR_LANGUAGE=bul+eng. Also raise the data PVC autoresize ceiling from 5Gi to 30Gi: everything (originals + archive via PAPERLESS_MEDIA_ROOT=../data) lives on the single encrypted PVC, and the ~2.7GB in-scope import would blow past the 5Gi cap mid-ingest. The topolvm autoresizer grows the volume on demand up to the ceiling; 30Gi gives ample headroom. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-27 08:17:13 +00:00
Viktor Barzin	82a7b2585b	chrome-service: reconcile state after pipeline #366 was killed mid-apply + document cancel-previous hazard All checks were successful ci/woodpecker/push/default Pipeline was successful Details Pipeline #366 (the SHA-pin apply, commit `7b4a8ba8`) was SIGKILLed mid-apply by Woodpecker cancel-previous when I pushed the next commit (#367, docs) while it was still running — the apply log ends at '[chrome-service] Starting apply...' with no 'Apply complete!', so the terraform state write did not finish. The live deployment is correct (image = the supervised SHA, verified, self-healing), but the stored state may be stale; this commit re-triggers a clean changed-stack apply to reconcile it (comment-only change → 0 resource changes, no rollout). Also adds a caution to the novnc image comment: after bumping the SHA, WAIT for the apply pipeline to finish before pushing again (memory id=1957). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-27 08:15:41 +00:00
Viktor Barzin	7b4a8ba867	chrome-service: pin noVNC image to the x11vnc-supervision build Some checks failed ci/woodpecker/push/default Pipeline was canceled Details Deploys the self-heal fix from the previous commit. Keel is off for this deployment (keel.sh/policy=never, because the browser container's playwright image is version-pinned to f1-stream) and the novnc image was :latest with imagePullPolicy=IfNotPresent, so a rebuilt :latest would NOT be re-pulled on a rollout — the supervised entrypoint would never reach the running pod. Pin novnc to :`19d0f0933a` (the build of the prior commit; ghcr digest sha256:5b783ac6, == :latest) so the stack apply rolls the sidecar onto the new image. Future novnc entrypoint changes deploy by bumping this digest after build-chrome-service-novnc.yml publishes a new SHA tag. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-27 08:04:55 +00:00
Viktor Barzin	19d0f0933a	chrome-service: supervise x11vnc in noVNC sidecar so the VNC view self-heals Some checks failed ci/woodpecker/push/default Pipeline was successful Details Build chrome-service-novnc / build (push) Has been cancelled Details The noVNC view at chrome.viktorbarzin.me went black: x11vnc (in the novnc sidecar) attaches to the browser container's Xvfb over localhost:6099, and when that container restarted (~8h ago, Chrome exited cleanly) x11vnc lost its X connection and exited. Because the entrypoint ran x11vnc as an unsupervised background child and then exec'd websockify as PID 1, the dead x11vnc was never relaunched — :5900 stayed dead (a defunct zombie), websockify kept returning 'Connection refused', and the view was black until a manual pod restart. Fix: the entrypoint now runs both x11vnc and websockify as supervised background children and exits non-zero via 'wait -n' if either dies, so the kubelet restarts the novnc container, which re-waits for Xvfb and relaunches x11vnc. The bridge now self-heals across browser-container restarts. Mirrors the android-emulator stack's supervision pattern. Architecture doc updated with the new failure mode, diagnosis, immediate-recovery, and SHA-pin deploy note. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-27 08:03:29 +00:00
Viktor Barzin	fd33d1a447	monitoring: consolidate all Slack alerting to #alerts, abandon #security Some checks are pending ci/woodpecker/push/default Pipeline is running Details The dedicated #security Slack channel was unreachable: the shared incoming webhook (Vault secret/viktor -> alertmanager_slack_api_url) belongs to a Slack app that isn't a member of #security, so any channel override on it returns HTTP 404 channel_not_found. The goldmane-edges-digest was silently failing for that reason. Per request ("dump the security channel, post in an existing one"), route everything to #alerts instead: - alertmanager slack-security receiver -> #alerts (keeps its [SECURITY/<sev>] title styling so security-lane alerts still stand out in the shared channel) - goldmane-edges-digest CronJob SLACK_CHANNEL -> #alerts (comment only; value was already switched and applied last change) - AggregatorDown / DigestFailing alert summaries reworded to say #alerts - docs swept (security.md, monitoring.md, ADR-0014, goldmane runbook, .claude/CLAUDE.md, service-catalog, CONTEXT.md) to drop the "invite the app / flip back to #security" caveats and state the #security abandonment + #alerts consolidation as the current routing. Monitoring stack applied (alertmanager rolled, live config verified: slack-security channel is now #alerts). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-26 13:29:44 +00:00
Viktor Barzin	196d0db4bd	rbac/apiserver-oidc: back up the apiserver manifest OUTSIDE /etc/kubernetes/manifests All checks were successful ci/woodpecker/push/default Pipeline was successful Details The SSO restore script backed up the live manifest with `cp "$MANIFEST" "$MANIFEST.bak.$TS"` — i.e. INSIDE /etc/kubernetes/manifests/. The kubelet treats every file in that dir as a static pod, so the .bak became a SECOND kube-apiserver static pod. While both copies were identical it was harmless, but the instant `kubeadm upgrade` changed the real manifest's image to v1.35.6, the kubelet saw two same-named pods with different specs and flip-flopped (pod attempt count hit 13) — the new apiserver never stabilised, so kubeadm timed out on "static Pod hash did not change after 5m" and rolled back. THIS was the real cause of the 1.34->1.35 upgrade stalling for days (not etcd IO, which was a downstream symptom of the flip-flopping apiserver hammering etcd). Fix: write backups to a dedicated dir OUTSIDE the static-pod dir (/etc/kubernetes/apiserver-oidc-bak/) and read the rollback copy from there. The stray .bak that planted the landmine on 2026-06-18 was moved out manually 2026-06-26; this prevents the SSO script (and the upgrade chain's restore.sh, which is the same script) from ever re-creating it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-26 10:29:19 +00:00
Viktor Barzin	5d33327c30	postiz: repoint postgres-backup CronJob at CNPG (was failing on removed host) Some checks failed ci/woodpecker/push/default Pipeline failed Details The postiz-postgres-backup CronJob still dumped from the chart's bundled `postiz-postgresql` host with a hardcoded `postiz-password`. That bundled PostgreSQL was removed when postiz migrated to the shared CNPG cluster, so the host no longer resolves (NXDOMAIN) and every nightly run failed — firing BackupCronJobFailed, and leaving the postiz DB with no logical dump in the offsite pipeline. Connect via the app's own DATABASE_URL (from the postiz-secrets Secret, postgresql://postiz:…@pg-cluster-rw.dbaas.svc.cluster.local/postiz) instead of a hardcoded host/user/password, so the backup tracks the live DB and credentials. Verified with a one-off test job: psql + pg_dump 16.4 connect to CNPG 16.9 and produce a 180K custom-format dump. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-26 09:34:42 +00:00
Viktor Barzin	1bca799bb4	monitoring: give kube-state-metrics a 512Mi memory limit (Burstable) Some checks failed ci/woodpecker/push/default Pipeline failed Details kube-state-metrics had no explicit resources, so the monitoring-namespace LimitRange pinned it to requests=limits=256Mi (Guaranteed QoS). KSM idles around 45Mi but momentarily spikes past 256Mi during a full object relist (450+ pods, 150+ jobs, all secrets/endpoints) and gets OOMKilled. Each OOM blacks out the KSM-exported series that ~10 alert rules read, so they all fire false "<svc>Down" criticals at once and self-resolve when KSM recovers ~5 min later — exactly the alert storm seen at 2026-06-26 08:42 UTC. Set explicit Burstable resources: keep the request low (64Mi, just above idle) so we don't reserve memory we don't use, and raise only the limit to 512Mi to absorb the relist peak. No CPU limit, per the cluster-wide policy. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-26 09:06:31 +00:00
Viktor Barzin	ebc8b6588f	ESO: add force_conflicts to all ExternalSecret manifests (fleet sweep) Some checks failed ci/woodpecker/push/default Pipeline failed Details The 2026-06-22 external-secrets v1 migration made the ESO controller the server-side-apply owner of .spec.refreshInterval on every ExternalSecret, so any stack defining one via kubernetes_manifest fails `terraform apply` with a field-manager conflict the next time it's applied (instagram-poster + grafana hit this on 2026-06-24; it was latent across the whole fleet). Add field_manager { force_conflicts = true } to all 101 remaining ExternalSecret manifests across 70 stacks, matching the fix already on grafana / woodpecker / traefik / k8s-version-upgrade / instagram-poster. TF and ESO set the same value, so it's stable (no perpetual drift). Defuses the landmine before each stack's next apply trips it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-25 21:28:11 +00:00
Viktor Barzin	6c5288998f	goldmane-trail: polish follow-ups #57/#59/#61/#62/#63 + digest→#alerts All checks were successful ci/woodpecker/push/default Pipeline was successful Details Completes the Goldmane who-talks-to-whom trail (ADR-0014), implemented by a subagent workflow (distinct stacks in parallel, docs last): - #57 Whisker gated ingress: ingress_factory (whisker.viktorbarzin.me, auth=required, Authentik-gated) + a NetworkPolicy allowing traefik->whisker:8081 (the operator's whisker NP default-denies ingress). calico stack. - #61 pipeline health: AggregatorDown + DigestFailing Prometheus alerts (prometheus_chart_values.tpl) + cluster-health check #48. - #59 service-identity labels on the multi-Service namespaces (monitoring's 5 TF-managed deployments + dbaas), with the KYVERNO_LIFECYCLE_V1 marker so they update in-place. - #62/#63 docs: docs/runbooks/goldmane-flow-trail.md (new), service-catalog, security.md + monitoring.md east-west sections, ADR-0014 as-built, CONTEXT.md. #62 = the SQL to derive the Wave-1 per-namespace egress allowlist from the edge table (feeds code-8ywc; enforce-flips out of scope). Also fixes the digest's Slack target: #security override 404s channel_not_found because the shared alertmanager_slack_api_url webhook's app isn't a member of #security (this likely also breaks alertmanager's slack-security receiver — flagged in the runbook). Routed to #alerts (the webhook's working channel) until the app is invited; verified a real digest run posts cleanly (360 edges). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-25 17:49:25 +00:00
Viktor Barzin	9c68d147e0	k8s-upgrade: reclaim+auto-prune kubeadm /etc/kubernetes/tmp leak; correct crash root cause to etcd IO (not OIDC) Some checks failed ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/default Pipeline failed Details Digging into "why did the apiserver crash" disproved the earlier OIDC explanation. An isolated v1.35.6 apiserver repro with authentik reachable initialises OIDC cleanly (oidc.go:313, no error) and runs fine — so the --authentication-config -> --oidc-* revert is NOT what crashed it. etcd's surviving crash-window log is the real cause: 1180 "apply request took too long" warnings in 16 min, individual applies up to 4.3s (healthy <100ms) right as kubeadm tried to bring up the new apiserver. That's etcd IO starvation on the shared sdc HDD (beads code-oflt). A big contributor + the reason master root fs sat at 73%: kubeadm dumps a full ~400MB etcd DB backup into /etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/ before every etcd upgrade and never cleans it up — 145 dirs / 28GB had accumulated, driving image-GC churn and extra write-IO onto etcd's spindle. Reclaimed live (73% -> 23%) and added a preflight prune (>3 days) so it can't re-accumulate. Also corrected the OIDC handling: the kubeadm-config drift is real but only breaks dashboard/kubectl SSO AFTER a successful upgrade (recoverable via the chain's restore.sh + the kubeadm-config reconciliation) — it does not crash the apiserver. So the preflight check is now an ALERT, not a block (was added on the wrong hypothesis). Post-mortem, runbook, and apiserver-oidc.tf header corrected. Per Viktor: reclaim the disk and automate so the manual cleanup never recurs; the durable IO fix remains code-oflt (etcd off the shared HDD). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-25 15:23:15 +00:00
Viktor Barzin	60a1cb9a25	k8s-upgrade: reconcile kubeadm-config OIDC drift that crash-looped the v1.35 apiserver upgrade All checks were successful ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details Last night's autonomous 1.34->1.35 run reached the master control-plane phase for the first time (preflight passed, etcd snapshot taken, etcd upgraded), then the kube-apiserver upgrade to v1.35.6 crash-looped and kubeadm auto-rolled-back to 1.34.9. The cluster stayed healthy but the master was left cordoned and the chain wedged on in_flight. Root cause: kubeadm upgrade regenerates the apiserver static-pod manifest from the kubeadm-config ConfigMap. apiserver auth was switched on 2026-06-19 to a structured multi-issuer --authentication-config (kubectl + dashboard SSO), but kubeadm-config still carried the legacy single-issuer --oidc-* extraArgs, so the regenerated manifest reverted structured auth and the new apiserver crash-looped. Proven via `kubeadm upgrade diff`. The existing post-upgrade OIDC restore step never ran because the upgrade itself never succeeded. Fix: - rbac/apiserver-oidc.tf: the remote script now also reconciles kubeadm-config (kubeadm init phase upload-config: drop --oidc-*, add --authentication-config) so a future kubeadm upgrade regenerates a correct manifest. Delivered to the cluster via the apiserver-oidc-restore ConfigMap the chain re-runs (CI needs no ssh key); trigger deliberately not script-hashed since CI cannot ssh. - k8s-version-upgrade/upgrade-step.sh: new preflight gate runs `kubeadm upgrade diff` and BLOCKS+alerts (never drains the master) if --authentication-config would still be dropped. - Post-mortem + runbook updated. The live kubeadm-config was reconciled directly on the master and verified (`kubeadm upgrade diff` now shows only the control-plane image bump), so tonight's run can complete the 1.34->1.35 upgrade. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-25 14:16:04 +00:00
Viktor Barzin	b858561bd0	Merge remote-tracking branch 'origin/master' Some checks failed ci/woodpecker/push/default Pipeline failed Details	2026-06-24 20:59:39 +00:00
Viktor Barzin	a7704f46a6	deploy goldmane-edge-aggregator: durable who-talks-to-whom edge trail (#58 , ADR-0014) Infra side of ADR-0014: an mTLS gRPC consumer of Calico Goldmane's Flows API that records the namespace-pair edge-set in CNPG and posts a daily new-edge digest to #security. Adds the goldmane-edge-aggregator stack, the pg-goldmane-edges Vault rotation role (Tier-0 vault state updated here), and the namespace in the ghcr-credentials allowlist. Cert: REUSES the operator-minted, Tigera-CA-signed whisker-backend client cert (Goldmane verifies only the CA chain, not identity) instead of minting from the Tigera CA private key. This avoids putting the CA key in TF state AND the hashicorp/tls provider, which is incompatible with this repo's global generate-providers/lockfile pattern (it broke every stack's lockfile). Verified live: aggregator streaming flows, 174 edges in Postgres across 50x54 namespaces, db+slack ExternalSecrets synced, digest dry-run formats correctly, private image pulls via the Kyverno-synced ghcr-credentials. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 20:59:39 +00:00
Viktor Barzin	aa510e3600	instagram-poster: force_conflicts on ESO manifests (fix apply) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The ESO v1 migration (2026-06-22) made the external-secrets controller own .spec.refreshInterval via server-side apply, so terraform apply of the two ExternalSecret manifests fails with a field-manager conflict (Woodpecker #348), which blocked the replicas=0 scale-down from landing. Add force_conflicts=true to both, matching the grafana/woodpecker/traefik fix applied to other stacks the same day. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 20:49:53 +00:00
Viktor Barzin	53834deb24	instagram-poster: scale to 0 (unused, dead ExternalSecret) Some checks failed ci/woodpecker/push/default Pipeline failed Details Viktor confirmed the Instagram Graph poster isn't used. Its ExternalSecret has been dead on missing Vault keys (ig_graph_long_lived_token, ig_business_account_id), so the deployment sat at 0/1 firing DeploymentReplicasMismatch. Setting replicas=0 stops the alert and makes the scale-down durable (a bare kubectl scale reverts on the next stack apply). Re-set to 1 after minting a Meta long-lived token + populating the Vault keys. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 20:45:30 +00:00
Viktor Barzin	8dd9a3978d	Merge remote-tracking branch 'forgejo/master' into wizard/homelab-vault All checks were successful ci/woodpecker/push/default Pipeline was successful Details	2026-06-24 12:25:52 +00:00
Viktor Barzin	65b2df1222	fix(monitoring): force_conflicts on grafana_db_creds ExternalSecret The external-secrets controller owns .spec.refreshInterval via SSA, so a plain terraform apply of the monitoring stack conflicts. Latent until 2026-06-24 (the homelab-vault loki-rules change was the first monitoring apply in a while and surfaced it). force_conflicts lets TF win — same pattern as woodpecker/traefik/ k8s-version-upgrade stacks.	2026-06-24 12:25:36 +00:00
Viktor Barzin	1d0388da12	Merge remote-tracking branch 'origin/master' All checks were successful ci/woodpecker/push/default Pipeline was successful Details	2026-06-24 12:22:58 +00:00
Viktor Barzin	92361f36db	calico: enable Goldmane + Whisker (Calico 3.30 OSS flow observability) Turns on Calico 3.30's native east-west flow observability so we can see which Service talks to which (ADR-0014, issue #57). Enabled via the operator CRs directly (kubectl_manifest Goldmane + Whisker, name=default) rather than the Helm goldmane/whisker flags, because the goldmanes/whiskers CRDs already exist and this sidesteps the helm-upgrade CR-before-CRD ordering issue. Whisker notifications=Disabled so the UI doesn't call the external Tigera endpoint. Applied supervised: creating the Goldmane CR re-rendered calico-node with the FELIX_FLOWLOGSGOLDMANESERVER env (operator auto-wires Felix — no manual FelixConfiguration); calico-node rolled cleanly 7/7, tigerastatus healthy, goldmane is receiving flows from all nodes, Whisker UI serves. Durable Loki persistence is NOT included here: the Goldmane emitter is Calico Cloud/Enterprise-gated with no OSS knob to aim it at Loki (the CR can override only name+resources, not env), so a durable trail needs a small custom gRPC consumer of goldmane:7443 — tracked in issue #58. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 12:22:48 +00:00
Viktor Barzin	e711b2f971	feat(monitoring): homelab vault traceability alerts (TOTP-fetch + volume) Some checks failed ci/woodpecker/push/default Pipeline failed Details Build infra CLI / build (push) Has been cancelled Details Adds a Loki ruler group (lane=security -> #security) for the homelab vault op-log: VaultwardenTOTPFetched (every 2nd-factor fetch is visible) and VaultwardenFetchVolumeHigh (>100 fetches/10m backstop). The audit spine (Vault audit device, reads of secret/data/workstation/claude-users/*) is already captured. True CLI-bypass detection needs cross-stream correlation (follow-up).	2026-06-24 10:31:32 +00:00
Viktor Barzin	0293b5c634	android-emulator: fix idle-sleeper dying with SIGPIPE before it could sleep All checks were successful ci/woodpecker/push/default Pipeline was successful Details Caught live-testing the previous commit: every sleeper run exited 141 (SIGPIPE) in ~1s with no output, never reaching the scale-down. Cause: `set -o pipefail` + `dumpsys power \| awk '...; exit'` — awk closes the pipe after the first match while `kubectl exec` is still streaming dumpsys, so the exec gets SIGPIPE, pipefail makes the pipeline 141, and set -e kills the script before any echo. (My earlier dry-run missed it because it didn't run under `set -euo pipefail`.) Fix: drop pipefail; capture each exec to a var (`\|\| true`) then parse with awk reading to END (no early `exit`), so nothing can SIGPIPE mid-stream and a failed/booting exec falls through to the fail-safe "do not sleep" branch. Also fetch the pod name via jsonpath instead of `-o name \| head -1` (no pipe to SIGPIPE, no `pod/` prefix to strip), and exec `adb` directly without the `sh -c` wrapper. Verified live: ran the corrected script as the gate ServiceAccount against the stuck emulator (idle ~120h) — it logged "idle >= 6h ... scaling to zero" and patched the deployment to replicas=0. The 6+ day pod is now asleep. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 08:57:36 +00:00
Viktor Barzin	839fdb33c2	android-emulator: sleep after 6h idle (activity-based), fix never-sleeping All checks were successful ci/woodpecker/push/default Pipeline was successful Details The emulator was meant to scale to zero when idle but had been up 6+ days straight despite ~5 days with no real use. Two bugs: 1. The idle check counted ESTABLISHED TCP connections to the adb/noVNC ports. A forgotten `adb connect` (no disconnect) holds that transport open forever, so every 15-min run saw "active" and reset the counter -- it never reached the sleep branch. (Right now: 4 such stale transports from pods on k8s-node3/node4.) 2. Even when it did reach the sleep branch, `kubectl scale --replicas=0` failed Forbidden -- the gate ServiceAccount can patch `deployments` but not `deployments/scale`. Switch the sleeper to measure actual use: time since last user activity (taps/keys/app-launches, incl. noVNC clicks) from `dumpsys power` vs guest uptime. No interaction for 6h -> sleep. This ignores idle/forgotten connections entirely. Scale down with a direct replicas patch on the named deployment (same path the wake gate scales up), so it needs only the existing `deployments` patch grant -- no `deployments/scale`. Now stateless (drops the idle-counter annotation; gate.py no longer sets it) and lighter on etcd. Fail-safe: any read error (e.g. mid-boot) does not sleep. Requested by Viktor: turn the dev-only emulator off when it hasn't been used for 6h. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 08:49:23 +00:00
Viktor Barzin	566447a698	k8s-upgrade: preflight kubeadm-plan gate must pass explicit target (minor-upgrade fix) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Last night's 1.34.9->1.35.6 run passed the ESO/kyverno compat gate (the migration worked!) but ABORTED at the kubeadm-plan-target gate: it ran `kubeadm upgrade plan` with NO version, so master's old 1.34.9 kubeadm auto-proposed only the current minor (Loki: "falling back to stable-1.34") and plan_target != 1.35.6 -> abort. That gate worked for patch upgrades but never for minors. Fix: pass the explicit `v$TARGET_VERSION` (verified on master: `kubeadm upgrade plan v1.35.6` emits "kubeadm upgrade apply v1.35.6"). Works for patches too. Applied live to the ConfigMap before tonight's run; deleted the failed preflight-1-35-6 job. Also: ESO 2.x took SSA ownership of .spec.refreshInterval, so terraform's apply of the k8s-upgrade-creds ExternalSecret hit a field-manager conflict. Added field_manager.force_conflicts=true (benign — interval is semantically identical). This pattern affects all 104 migrated ESs fleet-wide (follow-up). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 06:06:14 +00:00
Viktor Barzin	98d2b89614	calico: bump tigera-operator mem limit 256Mi -> 512Mi (OOM crashloop fix) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The operator OOM-crashlooped on 2026-06-23: it idles at ~246Mi with a ~266Mi startup spike (re-listing resources to build informer caches), both at/over the 256Mi limit, so the first time the pod restarted it could never finish startup (exit 137 OOMKilled, leader-elect, OOM, repeat). A latent landmine — the limit was always too tight; it only bit once the pod restarted. Data plane was never affected (calico-node 7/7, tigerastatus green throughout). 512Mi gives headroom (now ~246Mi steady, verified stable 0 restarts). NOT caused by the ESO migration (which never touched calico); cluster churn was at most the trigger that exposed the tight limit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-23 12:46:28 +00:00
Viktor Barzin	68c240b8de	Merge remote-tracking branch 'origin/master' Some checks failed ci/woodpecker/push/default Pipeline failed Details	2026-06-23 09:56:25 +00:00
Viktor Barzin	7d297dc6b1	eso: complete migration — chart 2.6.0, all CRs on v1, 1.35 gate cleared Phase 3 of the ESO 0.12->2.6 migration (the last k8s-1.35 compat-gate blocker). Climbed external-secrets 0.16.2 -> 0.17.0 -> ... -> 2.6.0 one minor at a time, each hop applied + verified (ES sync held at 109 Ready every hop; atomic=true rollback safety net). Crossed the 0.17 cutoff (v1beta1 serving removed) only after Phase 2 put all 104 ExternalSecrets + 2 ClusterSecretStores on external-secrets.io/v1. Result: compat-gate now returns "OK: cluster is safe to upgrade to 1.35.6" (EXIT 0) — the autonomous version-check chain will take k8s 1.34 -> 1.35 on its next nightly run. Also fixes the repo-wide stale-lock issue that broke CI pipeline 332: the terragrunt-generated providers.tf declares gavinbunney/kubectl + telmate/proxmox, but ~28-39 stacks' committed .terraform.lock.hcl predated that ("Inconsistent dependency lock file: no version selected"). Reconciled via `tg init -upgrade` and committed so `terragrunt apply`/CI work cleanly again. Docs: .claude/CLAUDE.md ESO line corrected (104 ESs, v1, chart 2.6.0); plan doc marked COMPLETE. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-23 09:55:51 +00:00
Viktor Barzin	59f2beda21	chrome-service: run real Google Chrome (H.264/AAC codecs) for the browser All checks were successful ci/woodpecker/push/default Pipeline was successful Details Point the chrome-service container at the new chrome-service-browser image and launch /opt/google/chrome/chrome instead of the bundled Chromium. Fixes MEDIA_ERR_SRC_NOT_SUPPORTED on H.264/AAC video (Instagram Reels etc.) in the noVNC view — bundled Chromium has those codecs compiled out; only real Chrome carries them. connect_over_cdp callers (tripit fare scrape, homelab browser, snapshot-harvester) attach over raw CDP (version-tolerant) — validated after rollout. Image is built off-infra on GHA (prior commit) → public ghcr. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 21:15:36 +00:00
Viktor Barzin	df1ec1879d	chrome-service: build a real-Chrome browser image (H.264/AAC codecs) Some checks failed ci/woodpecker/push/default Pipeline was successful Details Build chrome-service-browser / build (push) Has been cancelled Details Add an infra-owned image (Playwright base + google-chrome-stable) + its GHA build workflow. The bundled Chromium ships proprietary codecs compiled out, so H.264/AAC video (Instagram Reels, X, most .mp4) fails in the noVNC view with MEDIA_ERR_SRC_NOT_SUPPORTED; only real Google Chrome carries those codecs (libffmpeg swap + Chrome-for-Testing both ruled out). This commit only builds the image (→ ghcr.io/viktorbarzin/chrome-service-browser); a follow-up flips main.tf's launch to it once the image exists + is public. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 21:01:17 +00:00
Viktor Barzin	c670cb7118	eso: Phase 2 — migrate all 104 ExternalSecrets + 2 ClusterSecretStores to v1 Some checks failed ci/woodpecker/push/default Pipeline failed Details The API rewrite half of the ESO 0.12->2.6 migration (last k8s-1.35 compat-gate blocker). Done on chart 0.16.2, which serves BOTH external-secrets.io/v1beta1 and v1, so this is the safe window — MUST land before 0.17 removes v1beta1 (there is no conversion webhook). Pure apiVersion bump, schema is byte-identical: 106 occurrences (104 ExternalSecrets + 2 ClusterSecretStores vault-kv/vault-database) across 73 .tf files, v1beta1 -> v1, no other field changes. Validated live first on tandoor (single, non-coupled, synced ES): the kubernetes_manifest apiVersion bump forces a REPLACE; the target Secret is cascade-GC'd for ONE ~0.3s poll then ESO recreates it (identical value re-synced from Vault, new UID) and the ES returns SecretSynced=True on v1. Running pods keep their mounted copy through the sub-second blip. All 110 target Secrets were snapshotted to /tmp first as a backstop. CI applies the changed stacks serially (staged rollout); watching aggregate ES sync back to 108 synced (2 pre-existing dead: instagram-poster, payslip-ingest). Next: Phase 3 climb 0.16.2 -> 2.6.0. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 19:13:04 +00:00
Viktor Barzin	98cd535b97	authentik: lock chrome.viktorbarzin.me noVNC to Viktor only All checks were successful ci/woodpecker/push/default Pipeline was successful Details The chrome-service noVNC exposes Viktor's live logged-in browser sessions (Instagram etc. — he'll sign in there for homelab browser to reuse). It was auth="required" = any authenticated user, and "Home Server Admins" includes emo (emil.barzin@gmail.com), so the admin group is not a sufficient gate. Add a host-specific case to the domain-wide forward-auth restriction allowing only Viktor's accounts (vbarzin@gmail.com + akadmin break-glass); everyone else, incl. emo, is denied at the noVNC. emo's AGENT already can't reach the browser (read-only RBAC blocks port-forward); this closes the human noVNC path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 18:09:27 +00:00
Viktor Barzin	a3cdc0d6d0	chrome-service: size headed Chrome window to fill Xvfb (noVNC cut-off) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The noVNC view showed the browser in the top-left with the rest of the framebuffer black. Cause: Chrome launched with no --window-size, and there's no window manager, so it opened at its profile-persisted (smaller) size inside the 1280x720 Xvfb. Add --window-size=1280,720 --window-position=0,0 so the window fills the screen on every launch (fresh pods/profiles too). Live windows were already resized via CDP as a stopgap. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 18:00:20 +00:00
Viktor Barzin	c7ead032ec	chrome-service: fix noVNC stuck-"Connecting" (x11vnc fd-sweep under nofile=2^31) Some checks failed ci/woodpecker/push/default Pipeline was successful Details Build chrome-service-novnc / build (push) Has been cancelled Details The noVNC view hung on "Connecting" forever then timed out. Root cause: x11vnc sweeps the entire fd table (fcntl per fd) on every client connection, and containerd grants pods RLIMIT_NOFILE=2^31, so the RFB handshake never completes (websockify accepts the WS and dials localhost:5900, but x11vnc never sends its banner — verified: handshake timed out at 8s, x11vnc had burned 1h41m CPU spinning). Same bug + fix the android-emulator stack already carries. Cap nofile before x11vnc starts, in two places: - files/novnc/entrypoint.sh: `ulimit -n 65536` (root fix, makes the image correct) - main.tf novnc container: `command = ["bash","-c","ulimit -n 65536; exec /entrypoint.sh"]` so the cap applies deterministically on rollout even though the image is :latest/IfNotPresent (a rebuilt entrypoint isn't guaranteed to be re-pulled). Also documents the gotcha + diagnosis in docs/architecture/chrome-service.md and notes the black-when-idle behaviour + the autoconnect URL. (A live x11vnc relaunch with the cap already unblocked the running pod; this makes it survive restarts.) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 17:34:03 +00:00
Viktor Barzin	20ca5ee624	tripit: REEL_PROVIDER=anonymous — actually fetch reels (was fake canned caption) All checks were successful ci/woodpecker/push/default Pipeline was successful Details REEL_PROVIDER was unset, so the reel pipeline used FakeReelExtractor, which returns a CANNED caption — every pasted (tripit #120) or forwarded reel produced a DUMMY Saved Place instead of reading the real reel. Set REEL_PROVIDER=anonymous in app_env (covers the web Deployment + the ingest CronJob) so AnonymousReelExtractor does the real anonymous read. Verified live from the cluster: yt-dlp fetched a real IG /p/ caption (no IG_GRAPHQL_DOC_ID needed — the internal-API path is an optional optimisation; yt-dlp fallback works). LLM extraction + Nominatim POI geocoding were already real (prior commits); this was the last fake link in the chain. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 17:30:47 +00:00
Viktor Barzin	f46b69f372	tripit: enable real LLM + Nominatim on the web Deployment (in-app reel paste #120 ) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The web Deployment ran LLM_MODE=fake with no reel geocoder — only the ingest-plans CronJob had real providers. The in-app reel-URL paste feature (tripit #120) runs ingest_reel IN the web pod (BackgroundTask), so the Deployment now needs real extraction: LLM_MODE=llamacpp (qwen3vl-8b; qwen3-8b segfaults on the current llama-swap image) with the ADR-0033 claude-agent-service fallback, plus REEL_GEOCODER_PROVIDER=nominatim for venue->city/country POI geocoding. Set in app_env (feeds the Deployment; the CronJobs already had these via extra_env). Bonus: this also un-fakes the in-app booking share import, which used the same fake LLM. MAIL_INGEST_ENABLED stays false on the Deployment (only the CronJob polls mail). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 16:50:04 +00:00
Viktor Barzin	59f2070e56	tripit: switch mail-ingest LLM_MODEL qwen3-8b -> qwen3vl-8b (qwen3-8b segfaults) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The qwen3-8b GGUF segfaults on load on the current llama-swap :cuda image ("common_init_from_params: failed to create context"; llama-swap returns 502), which broke ALL tripit mail ingest text extraction — booking emails AND forwarded reels (status=failed, "no place could be read"). The GGUF isn't corrupt (valid header, full size, worked for weeks) — it's a llama.cpp/image regression. Rather than pin the SHARED llama-swap image (cross-user blast radius), repoint the ingest-plans CronJob at qwen3vl-8b, an already-provisioned 8B model that loads fine and extracts flight numbers + places reliably. Restores the auto-path (reels resolve via the Nominatim geocoder; bookings parse again). The broken qwen3-8b GGUF is a separate, non-urgent llama-cpp cleanup. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 15:52:09 +00:00
Viktor Barzin	f96cde35bd	tripit: enable Nominatim POI geocoding for reel→Wishlist ingest All checks were successful ci/woodpecker/push/default Pipeline was successful Details Forwarded reels (tripit ADR-0031) geocode their venue to map a Saved Place to a country + city, but the reel route was wired to the global geocoder, which here is GEOCODER_PROVIDER=openmeteo (city-level, name-based). OpenMeteo returns nothing for a venue query like "Time Out Market, Lisbon" so reels never resolved and no Saved Place was created. The app fix (tripit 3c62d596) gave the reel route its own geocoder behind REEL_GEOCODER_PROVIDER; set it to nominatim on the ingest-plans CronJob (the only one running the reel route) so forwarded reels resolve to real venue coords + city + country. Isolated from the global geocoder, which stays openmeteo for weather/tours. Verified Nominatim resolves the venue from the cluster. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 14:59:37 +00:00
Viktor Barzin	aeed461591	Revert "feat(monitoring): Tempo + OTel Collector for tripit tracing, hardened (ADR-0032 Phase 2)" All checks were successful ci/woodpecker/push/default Pipeline was successful Details This reverts commit `1595bddfc2`.	2026-06-22 08:31:17 +00:00

1 2 3 4 5 ...

1575 commits