infra

Author	SHA1	Message	Date
Viktor Barzin	5c378dd5e3	workstation: gate t3.viktorbarzin.me to the T3 Users group (Phase 4) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details New authentik_group 'T3 Users' (members wizard/emo/ancamilea via data lookups — usernames ARE their emails in this Authentik instance) + a branch in the admin-services-restriction expression policy gating t3.viktorbarzin.me to that group, placed BEFORE the ADMIN_ONLY_HOSTS early-return. Surgical two-step targeted apply (group-with-members first, then the gate) → zero lock-out window. Verified: group has all 3 members, the live policy contains the t3 branch, t3 still 302s to Authentik. Membership is HCL for now (FUTURE: roster-reconciled via the Authentik API). Note: the authentik stack had 3 unrelated pending drift changes (pgbouncer deployment + 2 tls_secrets) — deliberately NOT applied (targeted apply isolated this change; left for the stack owner). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 17:50:40 +00:00
Viktor Barzin	173b1fc116	workstation: per-user OIDC kubectl — power-user-readonly RBAC + kubeconfig (Phase 2.2) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details New oidc-power-user-readonly ClusterRole (cluster-wide get/list/watch, NO secrets/exec/write); the power-user binding re-pointed to it (the existing read+write+secrets oidc-power-user role is retained but UNBOUND per ADR-0005). Applied to the rbac stack (2 add, 1 change, 0 destroy). emo added to Vault k8s_users (secret/platform) as power-user, email emil.barzin@gmail.com — the OIDC email IS the Authentik username (verified live). Verified via impersonation: emo gets cluster-wide read, NO secrets/write/exec/delete; anca unchanged. Provisioner: install_user_kubeconfig writes a per-user OIDC kubeconfig (kubelogin/PKCE — the kubernetes Authentik client is public, no secret; server+CA copied from the admin kubeconfig) if-absent. Written for emo + ancamilea (0600). End-to-end login is interactive (browser OIDC); verified config validity + RBAC, not the live browser flow. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 17:47:00 +00:00
Viktor Barzin	c611ecf84d	workstation: docs — multi-tenancy Workstation section + offboard runbook + service-catalog fix [ci skip] multi-tenancy.md: new DevVM Workstation section (roster SSoT, tiers, config inheritance, locked clone, built-vs-gated status). service-catalog.md t3code row: corrected the stale 'source of truth = /etc/ttyd-user-map' (now roster.yaml; the map/dispatch are GENERATED). offboard-user.md: written (was a referenced-but-missing dead link) — staged reversible-cut-then-gated-destructive for both cluster + workstation surfaces. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 14:27:17 +00:00
Viktor Barzin	08bf1e0a3a	workstation: per-user writable git-crypt-locked infra clone (Phase 3.1) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details install_locked_clone: non-admins get their OWN ~/code = a keyless clone of the public infra repo (the monorepo has no remote, so the locked clone is of infra). filter.git-crypt=cat + --no-checkout ⇒ code/docs plaintext, secret files (.tfvars/.tfstate/secrets/**) stay \0GITCRYPT\0 ciphertext. Writable + ungated (push != apply). Skip-if-exists ⇒ never touches emo's existing ~/code symlink (gated cutover handles that). Verified live on ancamilea: secrets ciphertext, code plaintext, commit works, emo untouched. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 14:23:57 +00:00
Viktor Barzin	2c1865eabb	workstation: roster-driven provisioner (SSoT reconcile, additive-only) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details t3-provision-users.sh now consumes roster_engine.py: derives accounts + per-tier groups + sticky ports + /etc/ttyd-user-map + dispatch.json from roster.yaml and applies them. ADDITIVE-ONLY for existing users (never strips a group, replaces a home, or re-locks an account) so the hourly timer is always safe. Best-effort tier validation vs live k8s_users: warns on a net-new absent user (emo), aborts only on a real tier conflict, skips when root has no Vault token. DRY_RUN mode for safe testing. Verified on the live host: reproduces dispatch.json content exactly, emo/anca groups + all t3-serve instances unchanged, idempotent, shellcheck-clean; deployed to /usr/local/bin (hourly timer target). Engine: validate_tiers now returns ValidationIssue(severity) — error=conflict (abort) vs warn=absent (grant pending) — + has_blocking_errors(); 28 pytest cases. setup-devvm.sh redeploys the provisioner for reproducibility. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 14:18:12 +00:00
Viktor Barzin	3feb69e379	workstation: pin verified config-inheritance mechanism in design §4 [ci skip] Spike GO (claude 2.1.168): managed claudeMd reaches a session; no managed-skills key exists so skills/rules inherit via per-user ~/.claude symlinks to the base (seeded in /etc/skel). Records the settings.json 0664->0600 leak fix. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 14:09:13 +00:00
Viktor Barzin	1757cb59e7	workstation: machine-wide config inheritance (managed claudeMd + setup-devvm.sh + skel) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Spike confirmed (claude 2.1.168): /etc/claude-code/managed-settings.json claudeMd reaches a session (sentinel echoed). Hybrid inheritance = enforced org claudeMd machine-wide (top precedence, non-overridable) + per-user ~/.claude/{skills,rules,...} symlinks to the config base (live, the proven emo pattern) seeded via /etc/skel. setup-devvm.sh is idempotent: apt toolset, node>=18 + claude-code, system-wide kubelogin (NOT the Azure apt pkg), the managed config, and /etc/skel (launcher that cd's $HOME/code, tmux UX, inheritance symlinks). Verified: emo unchanged (groups/symlinks/live sessions intact), emo can read the managed config, idempotent re-run clean. Security fix (host state): /home/wizard/.claude/settings.json was 0664, exposing MEMORY_API_KEY to all devvm users -> chmod 0600. chezmoi source needs a private_ prefix + the key templated out to persist this (dotfiles-repo follow-up). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 14:07:04 +00:00
Viktor Barzin	55d4b4cf2d	workstation: correct devvm RAM (8->24GB) + record 8G swap & capacity budget [ci skip] devvm is the t3code Workstation host. Added an 8 GiB swapfile (swappiness=10, fstab-persisted) to turn multi-user OOM-kills into graceful paging (was 0 swap, ~1.2 GiB free of 23). Capacity budget: ~4-5G RAM per active user, max ~3-4 concurrent active sessions. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 13:48:52 +00:00
Viktor Barzin	3033e2c355	workstation: roster source-of-truth + host package manifest [ci skip] roster.yaml is the single source of truth for the devvm Workstation lifecycle (os_user -> authentik_user/k8s_user/tier/namespaces); wizard listed as admin so the regenerated ttyd-map/dispatch never drops his instance. packages.txt is the declarative apt toolset (non-apt tools — node/claude-code/kubectl/vault/kubelogin — noted with their real install paths; the apt pkg named 'kubelogin' is the wrong Azure tool). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 13:38:20 +00:00
Viktor Barzin	7ab4c1e1e2	workstation: tested roster derivation + offboarding-diff engine [ci skip] Pure functional core (PRD ViktorBarzin/infra#9 modules #1 roster engine + #5 offboarding diff) that the bash provisioner will consume as JSON: roster parse/validate, fail-loud tier-vs-k8s_users check, sticky-port + ttyd-map + dispatch derivation, additive-only group reconcile, and the staged offboarding diff (reversible cut vs gated userdel, never auto). 27 pytest cases, ruff-clean; no host I/O in the tested path. Verified to reproduce the live dispatch.json byte-for-byte from the real roster. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 13:38:06 +00:00
Viktor Barzin	6504911a77	matrix: open (tokenless) registration + bot mitigations + #security alert User-chosen fully-open registration on tuwunel (no CAPTCHA support; browser challenges break native clients). Bot defense is layered instead: - Traefik rate-limit Middleware on a path-scoped /register ingress carve-out, keyed on request Host (GLOBAL /register cap) not source IP — the host is reachable via both Cloudflare-IPv4 (CF-Connecting-IP) and IPv6-direct (HE tunnel, no CF header), so a per-source key let IPv6 bots bypass. 10/min, burst 20, per replica; CrowdSec is the hard backstop on both paths. - Loki ruler rule MatrixNewUserRegistered -> lane=security -> existing #security Slack receiver (matches "registered on this server", never the rejection line). tuwunel's admin bot also posts signups to the admin room. Dropped the REGISTRATION_TOKEN env (secret/matrix + ESO kept for revert). Applied via scripts/tg (matrix tier-1 + targeted monitoring configmap), so [ci skip] to avoid CI full-applying monitoring (unrelated grafana-acl drift). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 13:27:02 +00:00
Viktor Barzin	bb7bcf803b	multi-user-workstation: design + phased implementation plan devvm multi-user Claude Code workstation: role-driven profiles (admin/power-user/namespace-owner) off one git roster (single source of truth, full onboard->offboard lifecycle); config inheritance via Claude's native machine-wide managed layer; per-user writable git-crypt-locked infra clone (ungated, apply-time is the boundary); per-tier OIDC kubectl; per-user secrets/auth (memory isolation deferred); incremental, emo-safe migration; capacity prereqs. Folds in gap-analysis findings verified live 2026-06-08. Designed, not yet implemented. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 12:58:29 +00:00
Viktor Barzin	3d6c5b8bc7	matrix/authentik: remove orphaned Matrix OAuth2 app + provider (post-tuwunel) The migration left a UI-managed (not TF) Authentik OIDC app orphaned — tuwunel uses native password auth, so nothing consumed it. Deleted application `matrix` + OAuth2 provider pk=6 via the Authentik API (user-confirmed). Drop the stale Matrix rows from the SSO reference tables and update the plan's residual list. Doc-only [ci skip]. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 12:32:49 +00:00
Viktor Barzin	23602f393e	matrix: migrate Synapse -> tuwunel (Rust homeserver, fresh start, federated) Replace the cramped Synapse deployment with tuwunel v1.7.1: embedded RocksDB drops the CNPG dependency (both init-containers, the db ESO, the Reloader annotation all gone), env-var config, fsGroup-owned encrypted PVC, federation on, tuwunel-served well-known delegation to :443. server_name unchanged (matrix.viktorbarzin.me); fresh start (no Synapse->RocksDB migration path). Registered @viktor admin then disabled registration (403). Cleanup: removed the orphaned pg-matrix Vault static role and dropped the matrix Postgres DB/role; updated service-catalog, upgrade-config, CLAUDE.md PG-rotation list, and the Matrix OIDC->orphaned auth notes. Design+plan in docs/plans/2026-06-08-matrix-synapse-to-tuwunel-*. Already applied via scripts/tg (matrix tier-1 + targeted vault tier-0), so [ci skip] to avoid CI reconciling an unrelated pre-existing vault OIDC tune-TTL drift. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 11:58:17 +00:00
Viktor Barzin	09514a234b	state(vault): update encrypted state	2026-06-08 11:51:06 +00:00
Viktor Barzin	7501ea286b	tripit: wire planner subsystem (merged trip-planner) secrets + Slack webhook ingress All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details - ExternalSecret gains SLACK_SIGNING_SECRET / TREK_USER / TREK_PASSWORD / CLAUDE_AGENT_TOKEN (SLACK_BOT_TOKEN reused from nudges). - New auth=none ingress carve-out /api/planner/slack (Slack v0 signature-gated, same pattern as the calendar + emails-confirm carve-outs). - Remove the superseded standalone stacks/trip-planner (merged into tripit per the "future travel logic goes in tripit" policy). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 09:26:21 +00:00
Viktor Barzin	838343184b	stem95su: document on-demand Drive→NFS deploy (no scheduled job) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details CI/CD for the stem95su site is intentionally ON-DEMAND, not a CronJob: the content is short-term and a scheduled job + Vault secret + ESO + GCP "publish to Production" would be rotting artifacts. Instead, mirror the source Google Drive folder "claude" → /srv/nfs/stem-site via a throwaway rclone container using the existing google_workspace OAuth creds (secret/viktor), rsync to NFS with an empty-source guard, then shred the temp config. Verified end-to-end. Recipe in claude-memory. Doc-only: corrects the service-catalog update-mechanism note. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-07 22:10:06 +00:00
Viktor Barzin	d4ec5768b2	vault-token-renew: version the devvm renewer + user units in the repo The devvm periodic Vault admin token (token-devvm-wizard, period=768h, policies default+sops-admin+vault-admin) is kept alive by a systemd user timer, but the renewer script + units lived only under ~/.local/bin and ~/.config/systemd/user — lost on a devvm rebuild. Move them into the repo as the source of truth so a rebuild can restore them. (version-only scope: behavior unchanged; no canonical-file/self-heal added.) - scripts/vault-token-renew.{sh,service,timer}: renewer + user units, refactored into pure drift-guard functions + a guarded main (behavior identical; deployed live and verified still renewing with full write access). - scripts/test-vault-token-renew.sh: unit-tests the drift guard + lookup-JSON parsing, incl. the 2026-06-05 woodpecker-clobber case (17 assertions). - docs/runbooks/vault-token-renew-devvm.md: deploy, mint/re-mint, health-check, drift recovery. - docs/architecture/secrets.md: correct the stale '~/.vault-token = OIDC token' description for devvm. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-07 22:10:06 +00:00
Viktor Barzin	f9d5cd6243	feat(tripit): wire real flight (AeroDataBox) + rail (RealtimeTrains) status Prod ran FLIGHT_PROVIDER=fake, so every flight gate/terminal/time/position was fabricated from a hash and never matched reality. Switch to real providers: - FLIGHT_PROVIDER=aerodatabox (RapidAPI free BASIC; AERODATABOX_API_KEY via the tripit-secrets ExternalSecret) - RAIL_PROVIDER=realtimetrains (RTT_API_TOKEN, already in Vault) - poll-flights cron */30 -> hourly to respect the free 600 req/month cap (provider also self-throttles to <=1 req/sec) Verified live: /api/segments/<LS1468>/status returns source=aerodatabox with real schedule/terminal/aircraft. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-07 22:10:06 +00:00
root	b1ccbd12e8	Woodpecker CI Update TLS Certificates Commit	2026-06-07 22:10:06 +00:00
Viktor Barzin	0d445d948c	stem95su: host STEM platform for 95. СУ (public NFS-backed static site) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details New public static site at stem95su.viktorbarzin.me serving the school's Bulgarian STEM platform (dashboard + lessons/games, externally authored HTML/media exported from Gemini). - Stock nginx:1.28-alpine serving /srv/nfs/stem-site read-only (nfs_volume), NOT image-baked — content updated out-of-band (Nextcloud "PVE NFS Pool" or rsync), no rebuild; auto-backed-up offsite by nfs-mirror. - ingress_factory auth="none" (open; CrowdSec + ai-bot-block at the edge), dns_type="proxied" (Cloudflare CNAME auto-created). - nginx ConfigMap sets index stem_board.html (the dashboard) for "/". - Docs: service-catalog entry + new "Static Site Hosting" pattern (NFS-backed vs image-baked) in patterns.md. Applied via scripts/tg apply; verified live end-to-end (dashboard, 20MB page, video byte-range, no Authentik redirect) through the public Cloudflare path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-07 15:21:21 +00:00
Viktor Barzin	c7ffbaa204	aiostreams: harden stream-probe + repair sources (RD-451 "few films" fix) Root cause of "barely serving films": Real-Debrid's May-2026 infringing_file/HTTP-451 filter blocks WEB-DL releases (which dominate new content), while degraded sources starved candidates. RD account + popular-title availability were healthy throughout (library 32/36 unrestrict OK; Matrix 897 / Dune2 694 / Oppenheimer 672 streams). Runtime config (AIOStreams PG, applied via API — not in this diff): - Comet timeout 5s -> 10s. Comet is the workhorse (~450+ streams/title) and was silently dropping the bulk of its results at the 5s cutoff; Interstellar 430 -> 987 streams after the bump. - Removed MediaFusion preset: broken upstream ("Invalid configuration" -> 500 Internal Server Error), contributed 0 usable streams, only a dead [X] entry in every list. This diff (Terraform): - Harden aiostreams-stream-probe: test series AND movie paths, per-source breakdown (comet/torrentio/stremthru_torz/knaben), error-stream count, success gated on Comet being alive. The old probe counted only Breaking Bad streams and stayed green while new-content playback was broken. - service-catalog: reflect source set + probe behaviour. [ci skip] — probe already applied via targeted `tg apply` + verified (series=378 movie=898 comet=206 errors=0 success=1); skipping the full servarr reconcile to avoid touching unrelated pre-existing drift (qbittorrent MetalLB annotation, tls_secret cert revert). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-07 07:21:42 +00:00
Viktor Barzin	4cdb9e1886	novelapp: switch Keel to semver (policy=major) now upstream tags are valid All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Gheorghe fixed his tag format 2026-06-06 (v.1.1.1 -> valid v1.1.1 / v1.1.3), so drop the :latest+force+match-tag digest workaround and track semver properly: policy=major (all upgrades, cumulative), match-tag removed (so Keel is free to climb to higher semver tags), image floor pinned to v1.1.3. Pull policy -> IfNotPresent (correct for a pinned Keel-managed tag; Always was only needed for the mutable :latest). Running v1.1.3. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 22:56:46 +00:00
Viktor Barzin	551412488b	apiserver: enable audit logging (low-write Metadata) + ship to Loki Some checks failed ci/woodpecker/push/default Pipeline failed Details ci/woodpecker/push/build-cli Pipeline was successful Details Resource changes/deletions are now attributable (the novelapp deletion this week was untraceable because apiserver audit was off). Low-write policy: drops reads/noise, Metadata level on mutations, omitStages RequestReceived. Wired into the kube-apiserver static-pod manifest + kubeadm-config (v1beta4 extraArgs/extraVolumes -> survives kubeadm upgrade) on k8s-master; Alloy tails /var/log/kubernetes/audit/audit.log -> Loki {job=kubernetes-audit}. Root cause that had silently blocked this AND OIDC for weeks: a stray kube-apiserver.yaml.bak inside /etc/kubernetes/manifests/ was a duplicate static-pod manifest kubelet ran instead of the real one, dropping every flag added to the real manifest. Removed it. Runbook added. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 16:51:26 +00:00
Viktor Barzin	3696ff5922	novelapp: track :latest by digest (Keel force+match-tag), adopt into TF state Keel was stuck on v1.0.3 because upstream mghee/novelapp tags newer releases as `v.1.1.1` (dot after v), which isn't valid semver, so policy=all couldn't see past the highest parseable tag. :latest correctly points at the newest release, so switch to force + match-tag digest-tracking of :latest (Kyverno does not manage match-tag, contrary to the stale code comment). Imports the live Deployment (recreated out-of-band 2026-06-06) back into TF state; running image flipped to :latest -> now on v.1.1.1. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 16:51:26 +00:00
Viktor Barzin	4d8b782df1	feat(trip-planner): app stack (Tier-1, CNPG, Slack-signed webhook ingress) Namespace trip-planner (tier=4-aux, keel enrolled), ExternalSecret pulling secret/trip-planner from vault-kv, DB-creds ExternalSecret from vault-database (static-creds/pg-trip-planner → asyncpg DSN), Deployment with migrate init container + main container (readiness+liveness /healthz, 256Mi req=limit, 100m cpu request), ClusterIP service port 8080, and ingress_factory with auth=none (Slack v0 HMAC signature verification in-app). Terraform fmt clean. NOT applied; requires Vault secret/trip-planner + CNPG trip_planner DB + Slack app config. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 16:51:26 +00:00
Viktor Barzin	7c12fbba95	monitoring/alloy: drop cosmetic calico-typha 'Endpoints deprecated' warning calico-typha (~342 lines/hr across 3 pods) still WATCHes the core v1 Endpoints API, so the apiserver returns the 'v1 Endpoints is deprecated in v1.33+' client-go warning, which typha logs. Per KEP-4974 the v1 Endpoints API will essentially never be removed (clients keep working indefinitely), and even the latest Calico still watches Endpoints (projectcalico/calico#11540) so a CNI upgrade would not fix it. Pure cosmetic noise. Targeted Alloy stage.drop (calico-system ns, exact deprecation message), mirroring the mailserver drop. Real calico warnings/errors kept; reversible. Validated with alloy fmt (exit 0). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 16:51:26 +00:00
Viktor Barzin	4b13be6d48	dawarich: upgrade 1.6.1 -> 1.7.11 (removes RailsPulse, drops orphan tables) dawarich 1.6.1 shipped the RailsPulse perf-monitoring gem, which scheduled an hourly Sidekiq SummaryJob INDEPENDENT of its disabled flag; the job hit rails_pulse_routes (no primary key) and retry-looped, logging ~125 UnknownPrimaryKey lines/hr (found via Loki triage 2026-06-06). Upstream removed RailsPulse entirely in 1.7.x (commit a5172cc) with a DropRailsPulseTables migration; 1.7.11 is latest stable. Keel only auto-applies patch bumps within 1.6.x, so the minor jump is manual. Pre-upgrade pg_dump of dawarich (79.9MB) + dawarich_queue taken to devvm. The 5 rails_pulse_* tables are empty (feature never collected data), so cleanup is zero-data-risk; location data (tracks/points/visits/places) untouched. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 16:51:26 +00:00
Viktor Barzin	8a3bbde38c	mailserver: silence mixed-TLS-directive warning + drop SMTP scanner noise from Loki Two mailserver-namespace log-noise cleanups (cluster's #1 Loki error source, from the 2026-06-06 log triage): 1. TLS warning: docker-mailserver SSL_TYPE=manual writes the authoritative smtpd_tls_chain_files at boot, so the legacy smtpd_tls_cert_file/key_file in our postfix-main.cf override were IGNORED and triggered postfix's 'Both smtpd_tls_chain_files and ... legacy ...' warning. Dropped the two legacy lines (functional no-op; chain_files already wins). Verified via live postconf. 2. Scanner noise (~9k lines/hr): narrow Alloy stage.drop for the benign public-SMTP probe patterns (unknown[unknown] SSL_accept resets, postscreen half-open drops, rate-limit-exceeded from unknown). Real delivery logs + real-IP SASL failures KEPT; CrowdSec bans these IPs independently, so security posture is unchanged. Validated with 'alloy fmt' (exit 0). Reversible. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 16:51:26 +00:00
root	de181a9afc	Woodpecker CI deploy [CI SKIP]	2026-06-06 16:51:26 +00:00
Viktor Barzin	27211acda1	rybbit: recreate missing Postgres database via idempotent init Job rybbit's 'rybbit' PG database was missing from CNPG (the role survived a past cluster rebuild but the database did not), so the app's node-cron logged 'database "rybbit" does not exist' every minute (found via Loki 2026-06-06). Created the DB manually to restore service (app auto-migrated 11 tables); this adds a self-contained init Job so the DB is recreated on any future rebuild -- connects as the rybbit role (has CREATEDB) using the existing rybbit-secrets password, idempotent CREATE DATABASE if absent. Deployment now depends_on the job. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 16:51:26 +00:00
Viktor Barzin	9529eedfe0	docs(security): bot-block-proxy is a no-op while poison-fountain is at 0 [ci skip] Reflect commit b6dd23b1: bot-block-proxy short-circuits /auth to return 200 instead of proxying to the scaled-to-0 poison-fountain. - security.md Layer 1 + tarpit description + troubleshooting (fix stale stacks/platform path -> traefik stack; drop misleading restart-poison-fountain step). - .claude/CLAUDE.md: add matrix to PG rotation list; document that startup-read secret consumers need a Reloader annotation (matrix root cause, found via Loki 2026-06-05). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 16:51:26 +00:00
Viktor Barzin	9ad7756a94	traefik: make bot-block-proxy a clean no-op while poison-fountain is at 0 bot-block-proxy is the forward-auth target for the ai-bot-block middleware (applied to every anti-AI ingress). It proxied /auth to the poison-fountain bot trap with error_page 5xx=200 fail-open. But poison-fountain is intentionally scaled to 0, so proxy_pass only ever failed and fell open to '200 allowed' -- while logging ~51k errors/hr (the #1 Loki source once pod logs began shipping 2026-06-05) and paying up to 100ms connect-timeout per authed request. Short-circuit /auth to 'return 200 "allowed"' directly (drop the upstream + proxy_pass + fallback). Identical effective behaviour (allow-all), no upstream attempt, no noise, no latency. Reversible: restore the upstream + proxy_pass and scale poison-fountain up. Also add the missing configmap.reloader.stakater.com/reload annotation so openresty picks up ConfigMap changes (it does not hot-reload on its own -- the root reason stale config ran for days). replicas stays 2: critical-path forward-auth target (anti-AI ingresses fail closed if it is down), so HA is retained though each request is now trivial. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 16:51:26 +00:00
root	d70a99dc48	Woodpecker CI deploy [CI SKIP]	2026-06-06 16:51:26 +00:00
Viktor Barzin	d661d074ef	matrix: auto-reload Synapse on DB credential rotation (Reloader) Synapse injects the Postgres password into homeserver.yaml only at startup (inject-db-password initContainer). matrix-db-creds is rotated by Vault via ESO (15m refresh), so each rotation left the running pod with a stale password and Synapse DB auth failed silently until a manual rollout restart. Found today via Loki: ~12.9k/hr 'password authentication failed for user matrix' lines; secret password verified working against the DB while the 10-day-old pod held the pre-rotation value. Add the explicit secret.reloader.stakater.com/reload annotation so Reloader rolls the deployment whenever the secret changes (explicit form, not auto/search, because the secret is referenced only in an initContainer env var). Live pod already restarted to restore service; this prevents recurrence on the next rotation. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 16:51:26 +00:00
root	e7ece3eaf9	Woodpecker CI deploy [CI SKIP]	2026-06-06 16:51:26 +00:00
root	02366103ef	Woodpecker CI deploy [CI SKIP]	2026-06-06 16:51:26 +00:00
Viktor Barzin	d808694af4	docs(storage): record harden-half shipped (orphan cleanup + ghost-reconcile) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details 2a orphan cleanup (67 Released PVs + 475 LVs removed, VG pve 997->~410) + 2b csi-ghost-reconcile CronJob done — ghost-disk doom loop closed by construction, beads code-dfjn retireable. Cap kept at 28 (lowering would reverse the 2026-05-25 eviction-cascade post-mortem fix). Phase-1: insta2spotify migrated (noted its 3.26GB image re-pull blip on node reschedule). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 21:39:36 +00:00
Viktor Barzin	1b9d4f1233	storage: migrate insta2spotify off proxmox-lvm to NFS (LUN relief, Phase 1) Some checks failed ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was canceled Details Config-only PVC (no embedded DB), preflighted. Frees one proxmox-csi slot. NB: pod reschedule re-pulled the 3.26GB backend image (~6min stall) — large-image services incur a pull-delay blip when migration moves them to a fresh node. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 21:38:01 +00:00
Viktor Barzin	355ca3ee91	proxmox-csi: auto-reconcile CronJob to detach ghost disks (code-dfjn prevention) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Closes the ghost-disk doom loop by construction (failed detach -> orphan scsiN with no VolumeAttachment -> invisible oversubscription -> query-pci wedge). Every 15min csi-ghost-reconcile compares each worker VM's real scsi disks (Proxmox API) vs k8s VolumeAttachments and safely detaches ghosts (PUT .../config delete=scsiN -> frees the LUN slot, retains the LV). - detection mirrors cluster-health check #47 - SAFETY: only vm-9999-pvc scsi with no matching VA; 60s re-confirm; per-run cap 5 - scoped CSI API token (VM.Config.Disk), not root SSH; k8s API via injected ClusterIP - verified live: read 66 VAs, 0 ghosts, no false positives - pushes csi_ghosts_detected/detached to Pushgateway Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 21:25:36 +00:00
Viktor Barzin	e311cbe103	chore(modules): remove vestigial audiblez-web copy + fix glossary note [ci skip] modules/kubernetes/ebook2audiobook/ held a tracked copy of the audiblez-web app source (24 files), sourced by no stack and built by no CI — audiblez-web is GHA-built from its own repo. Bulk-swept in 2026-04-15; removed. Also corrected CONTEXT.md: the "vestigial per-app dirs (immich/, ollama/, ...)" note was wrong — those were untracked local macOS cruft (._main.tf AppleDouble turds), never in the repo; cleaned from the working tree. modules/kubernetes/ now holds exactly the four factory modules (ingress_factory, nfs_volume, anubis_instance, setup_tls_secret). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 20:38:13 +00:00
Viktor Barzin	a42f4f7b26	trek: trial-deploy TREK group-trip planner behind Authentik (solo eval) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Stand up upstream TREK (mauriceboe/trek:3.0.22, AGPL) as a low-commitment trial to evaluate the self-hosted group-trip use case before building a custom app. Solo, single shared instance, Authentik forward-auth. - stacks/trek: namespace, deployment (pinned, TF-managed, no CI/Keel), service 80->3000, ingress_factory auth=required + proxied DNS at trek.viktorbarzin.me, TLS. Two proxmox-lvm-encrypted PVCs (SQLite data + uploads) -- encrypted per the sensitive-data rule and to avoid the SQLite-over-NFS locking hazard. - Trial secrets posture: ENCRYPTION_KEY auto-generated on the data PVC, bootstrap admin in pod logs -- no Vault/ESO. Graduation TODOs documented in main.tf + service-catalog (Vault key, app-level SQLite backup, OIDC SSO). - kyverno: add mauriceboe/* to require-trusted-registries allowlist (the policy is Enforce since 2026-05-19 -- also fixed the stale "stays in Audit" header comment that said otherwise and misled the deploy). - Runs free on OpenStreetMap (no paid maps key). Rallly availability-poll companion deferred per solo-trial scope. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 20:30:07 +00:00
Viktor Barzin	63182730f9	docs(storage): record Wave-2 NFS migration + harden-proxmox-csi decision (option 1) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Document the 2026-06-05 decision to keep proxmox-csi and harden it (keep PVC mobility, no hardware) over TopoLVM (pins to node) / Longhorn (2x writes on single shared HDD). Wave-2 moved 5 non-DB workloads off block to NFS (tandoor, speedtest, hackmd, changedetection, send), freeing 5 LUN slots. - storage.md: live PVC counts, Retain-policy/orphan-LV note, Wave-2 history, updated cap-relief levers - topolvm-evaluation.md: stamped NOT ADOPTED with rationale + pointer to the decision doc Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 20:15:21 +00:00
Viktor Barzin	a0b34750ee	storage: migrate hackmd uploads off proxmox-lvm-encrypted to NFS (LUN-cap relief) Some checks failed ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/default Pipeline was canceled Details codimd is MySQL-backed; this PVC holds only pasted image uploads (subPath hackmd, 4.5M) — no embedded DB, NFS-safe. Drops LUKS-at-rest for these low-sensitivity images (accepted). Frees one proxmox-csi SCSI-LUN slot on node6. - swap hackmd-data-encrypted -> nfs_volume module (subPath preserved) - uploads copied + verified (20 files, HTTP 200, codimd listening) - block PVC removed; LV retained per SC policy (code-dfjn cleanup) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 20:11:31 +00:00
Viktor Barzin	e35d693972	storage: migrate send off proxmox-lvm to NFS (LUN-cap relief) Some checks failed ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/default Pipeline was canceled Details Send (timvisee/send) stores encrypted upload blobs on disk with metadata in Redis — no embedded DB, NFS-safe. Frees one proxmox-csi SCSI-LUN slot on node2. - swap send-data-proxmox -> nfs_volume module - blobs copied + verified (273M, 22 entries, HTTP 200 on NFS) - block PVC removed; LV retained per SC policy (code-dfjn cleanup) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 20:04:37 +00:00
Viktor Barzin	c24b4a21d8	docs(architecture): fix stale 5-node claim -> 7 nodes (k8s-node1..6) [ci skip] Cluster grew to 7 nodes (k8s-master + node1..6; node5/6 added ~10d ago) but several docs still said "5 nodes". Corrected with live specs: - overview.md: 7-node enumeration; node1 is 16c/48GB (doc wrongly said 32GB), node2-6 are 8c/32GB general workers - compute.md: "5-node" -> "7-node" cluster description - dns.md: NodeLocal DNSCache DaemonSet "5 nodes" -> "7 nodes" - mailserver.md: HAProxy backend diagram "node1..4" -> "node1..6" Illustrative "0/5 nodes available" scheduler-error examples left as-is. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 20:03:58 +00:00
Viktor Barzin	bf3608052b	tripit: GEOCODER_PROVIDER=openmeteo for per-city itinerary weather Enables Open-Meteo geocoding of lodging addresses (results cached in the new geocode_cache table) so the itinerary can show per-city weather. Applied manually via scripts/tg apply. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 20:01:31 +00:00
Viktor Barzin	6eb683b6e0	storage: migrate speedtest off proxmox-lvm to NFS (LUN-cap relief) Some checks failed ci/woodpecker/push/default Pipeline failed Details ci/woodpecker/push/build-cli Pipeline was successful Details speedtest-tracker is MySQL-backed (config dir = Laravel config + logs, no embedded DB), NFS-safe. Frees one proxmox-csi SCSI-LUN slot. - swap speedtest-config-proxmox -> nfs_volume module - config copied + verified (HTTP 302->login,200); excluded 383MB laravel.log - block PVC removed; LV retained per SC policy (code-dfjn cleanup) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 19:59:56 +00:00
Viktor Barzin	060aefbd0b	storage: migrate changedetection off proxmox-lvm to NFS (LUN-cap relief) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details changedetection uses a file-based JSON datastore (url-watches.json + per-watch dirs + brotli snapshots) — no embedded DB, NFS-safe. Frees one proxmox-csi SCSI-LUN slot. Part of harden-proxmox-csi+NFS plan. - swap changedetection-data-proxmox -> nfs_volume module - data copied + verified (HTTP 200, 4 watches loaded); excluded 200MB test cruft - block PVC removed; block LV retained per SC policy (code-dfjn cleanup) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 19:55:03 +00:00
Viktor Barzin	52f5de905d	docs(context): freshen infra glossary (modules, tiers, new concepts) [ci skip] Refresh CONTEXT.md against current repo + cluster reality (grill-with-docs): - Module taxonomy rewrite: drop fictional k8s_app/helm_app/postgres_app factory modules (never existed); name the real four (ingress_factory, nfs_volume, anubis_instance, setup_tls_secret) + the shared / Stack-local / flat distinction; flag vestigial modules/kubernetes/<app> dirs. - Rename "Ingress auth tier" -> "Ingress auth" (discrete modes, not tiers); reserve "tier" for State tier + Namespace tier only. - Add local-path entry (cluster default SC; node-local footgun warning). - Add concepts: Keel, Diun, CNPG/pg-cluster, MetalLB LB-IP split, Calico. - Add "policy" ambiguity flag (Kyverno vs Calico NetworkPolicy vs Vault/RBAC). - Fix node count 5 -> 7 (k8s-master + k8s-node1..6). Doc-sync (same commit per repo rules): - overview.md: replace fictional factory modules with the real shared modules + the flat/stack-local pattern. - .claude/CLAUDE.md: drop dead nfs-proxmox column from the storage decision table + stale cross-reference (vault migrated off it 2026-04-25). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 19:34:49 +00:00

1 2 3 4 5 ...

4074 commits