infra

Author	SHA1	Message	Date
Viktor Barzin	ae874e028d	postiz: bump memory request 512Mi → 2Gi, limit 4Gi → 3Gi (right-size for next deploy) krr 2026-05-22 flagged postiz-app as critically under-requested when it was running (gap 2.2 GiB above the 512Mi request). Postiz is currently uninstalled in the cluster — this change is only for when the stack is re-deployed later. No apply triggered now. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 01:11:25 +00:00
Viktor Barzin	1b21d4819e	postiz: disable unused providers + pin temporal vs Keel force-policy Two changes in one commit because they are coupled — the DISABLED_PROVIDERS addition cannot land safely without the Keel exclusion on temporal: 1. Add DISABLED_PROVIDERS env on postiz Helm chart. Live DB audit showed only 'instagram-standalone' connected; all other Postiz providers were idle-polling Temporal task queues. List excludes x, linkedin, reddit, threads, youtube, tiktok, pinterest, dribbble, slack, discord, mastodon, bluesky, lemmy, warpcast, vk, beehiiv, telegram, wordpress, nostr, farcaster. Keeps facebook + instagram + the standalone variant active. 2. temporal deployment needs keel.sh/policy=never (set live via kubectl annotate). Keel was rolling temporalio/auto-setup 1.28.1 -> 0.20.0 on every helm reconcile because :0.20.0 is published in the same registry path but is a DIFFERENT (legacy Cassandra-based) image stream. Memory id 1933 trap; new variant captured in id 2315-2319. The annotation is set live (not in TF) because the existing TF block has lifecycle.ignore_changes = [keel.sh/policy] so the chart reconcile won't reset it. Long-term fix: add temporal to the Kyverno keel-mutate-existing exclude list so it survives a namespace re-label.	2026-05-21 10:04:22 +00:00
Viktor Barzin	eb99ee5635	Bucket A retrigger + Bucket D enrollment (5 module-nested stacks) After fixing the postgresql-lb MetalLB flap (deleted stuck ServiceL2Status CR l2-rgt9d), Tier 1 CI can apply again. Combined commit: * Bucket A (16 stacks): re-append CI retrigger marker so the previously-pending applies pick up: blog calico cyberchef descheduler f1-stream homepage jsoncrack k8s-dashboard k8s-version-upgrade kms local-path osm_routing real-estate-crawler travel_blog vault webhook_handler * Bucket D (5 module-nested stacks): keel.sh/enrolled label on namespace + KYVERNO_LIFECYCLE_V2 on Deployments inside the module: postiz instagram-poster k8s-portal uptime-kuma vaultwarden Bucket C (raw-deploy apps without V1 marker on their Deployment lifecycles) deferred — needs per-Deployment lifecycle block additions that the bulk script can't safely automate: beads-server immich llama-cpp novelapp plotting-book trading-bot Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-16 23:10:38 +00:00
Viktor Barzin	53657d9952	infra: document auth = "app\|none" tier on every legacy ingress Sweep through the 30+ stacks that predated the auth = "app" tier and were tagged auth = "none" without a comment explaining why they weren't behind Authentik. Each is now self-documenting at the call site, so the tg-level anti-exposure guard passes and future readers don't have to reverse-engineer the intent. Flipped 6 stacks from "none" to "app" — their backends have their own user auth and the new tier records that more accurately: - navidrome (Subsonic user/password) - ntfy (deny-all default + user.db tokens) - nextcloud (WebDAV/CalDAV/CardDAV app passwords) - vaultwarden (Bitwarden-compatible token auth) - headscale (OIDC + preauth keys for Tailscale nodes) - paperless-ngx (app-layer login + API tokens) Kept "none" with a comment on the rest — they're genuinely public, webhook receivers, native-protocol endpoints, OAuth callbacks, or Anubis-fronted: authentik (×2 + guest outpost), beads-server (dolt), claude-memory (bearer-token MCP), dawarich, ebooks/book-search-api, fire-planner /api, forgejo (git/OCI native clients), frigate (HA integration), immich/frame, insta2spotify /api, instagram-poster (meta fetcher), k8s-portal, matrix (native bearer), monitoring×2 (HA REST scrapes), n8n (webhooks), nvidia, onlyoffice (JWT), owntracks (HTTP Basic), postiz, privatebin (client-side enc), rybbit (analytics tracker), send (E2E file drop), tuya-bridge (API key), vault (own auth + CLI), webhook_handler, woodpecker (forgejo webhooks + OAuth), xray (×3 VPN transports). real-estate-crawler/main.tf:400 already had its comment from a prior edit — not touched here. No live state changes — auth = "app" produces the same middleware chain as auth = "none" (verified earlier this session). This commit is purely documentation + intent-tagging.	2026-05-11 19:25:48 +00:00
Viktor Barzin	bf752dffa5	fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes After fixing the threshold=80% misconfig and seeing two PVCs (prometheus + technitium primary) get stuck Terminating, a 3rd round showed four more PVCs (frigate, hackmd, immich-postgresql, paperless-ngx) in the same state. Same root cause: TF spec'd a smaller storage size than the autoresizer-grown live value, K8s rejected the shrink, TF force-replaced the PVC, and the pvc-protection finalizer held it in Terminating while the pod kept using the underlying volume. Bulk-inject lifecycle.ignore_changes = [spec[0].resources[0].requests] on every kubernetes_persistent_volume_claim block that has resize.topolvm.io/threshold annotations. The pattern was already documented in .claude/CLAUDE.md but ~63 stacks were missing it. Live PVCs are unaffected; this only prevents future TF applies from attempting the destroy+recreate. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-10 21:57:01 +00:00
Viktor Barzin	fecfa211fd	fix: pvc-autoresizer threshold should be 10%, not 80% topolvm/pvc-autoresizer's threshold annotation is the FREE-SPACE percentage below which expansion fires (per upstream README). Setting it to "80%" means "expand when free-space drops below 80%", i.e. as soon as the PVC crosses 20% utilization — which caused prometheus-data-proxmox to be repeatedly expanded from 200Gi to 433Gi in 70 minutes (six 10% bumps, all when the volume was only ~14% used). Once the SC opt-in fix landed (`1e4eac53`) and the inode metrics fix landed (`02a12f1a`), the autoresizer started actively misfiring across 75+ PVCs cluster-wide. Flip the value to "10%" everywhere — that's "expand when free-space drops below 10%", i.e. at 90% utilization, which is the conventional semantic and matches the alert thresholds in prometheus_chart_values.tpl (PVAutoExpanding fires at 80%, PVFillingUp at 95%). The CLAUDE.md PVC template was the source of the misconfig, so update it too. Live PVC annotations were patched in parallel via kubectl annotate; TF apply on each affected stack will be a no-op against those live values. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-10 19:56:16 +00:00
Viktor Barzin	e4f806abe3	ingress_factory: replace `protected` bool with `auth` enum + audit pass across 100 stacks Phase 3+4 of default-deny ingress plan. Replaces the `protected = bool` (default false → unprotected) variable in `modules/kubernetes/ingress_factory` with `auth = string` enum (default "required" → fail-closed). Touches every ingress_factory caller so the audit decision is recorded explicitly in code. ingress_factory (Phase 3): - `auth = "required"`: standard Authentik forward-auth (the legacy `protected = true` semantic). - `auth = "public"`: forward-auth via the new `authentik-forward-auth-public` middleware → dedicated public outpost → guest auto-bind. Logged-in users keep their real identity. - `auth = "none"`: no Authentik middleware. For Anubis-fronted content, native client APIs (Git, /v2/, WebDAV), webhook receivers, the Authentik outpost itself. - `effective_anti_ai` default flips ON only when `auth = "none"` (auth-gated ingresses don't need anti-AI noise; the auth flow already discourages bots). Audit pass (Phase 4) across 96 ingress_factory call sites: - 49 explicit `protected = true` → `auth = "required"` - 8 explicit `protected = false` → `auth = "none"` (5) or `auth = "public"` (3) - 64 previously-default (no protected line) → `auth = "required"` ADDED, then reviewed individually: * 9 Anubis-fronted (blog, www, kms, travel, f1, cyberchef, jsoncrack, homepage, wrongmove UI, privatebin) → `auth = "none"` * 22 native-client / programmatic surfaces (Forgejo Git+/v2/, webhook handler, claude-memory MCP, Nextcloud WebDAV, Matrix, Vault CLI/OIDC, xray VPN, ntfy, woodpecker webhooks, n8n triggers, ntfy push, dawarich location ingestion, immich frame kiosk, headscale CP, send anonymous drops, rybbit beacon, vaultwarden API, Authentik UI itself + outposts) → `auth = "none"` * Remaining ~33 → `auth = "required"` confirmed (admin tools, internal UIs, services without app-level auth) - Smoke-test promotions to `auth = "public"`: fire-planner public UI, k8s-portal API, insta2spotify callback. Three call sites in wrapper modules (`stacks/freedify/factory/`, `stacks/reverse-proxy/modules/reverse_proxy/`) keep their internal `protected` bool — they translate to `auth` internally, out of scope for this rename. Behavior change: previously-default ingresses now fail closed (require Authentik login) unless explicitly flipped to `auth = "none"` or `auth = "public"`. This is the audit goal — no more accidentally-unprotected surfaces. Sites that were intentionally public (Anubis content, native APIs, webhooks) are now explicitly recorded as `auth = "none"`. Drive-by: `modules/create-vm/main.tf` picked up cosmetic alignment via `terraform fmt -recursive` during the audit. Behavior-neutral. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-10 18:53:49 +00:00
Viktor Barzin	5ea89ebcd7	postiz: disable signups (DISABLE_REGISTRATION=true) Admin account already exists; we don't want random users registering on the public-facing instance. Sign-in only from now on.	2026-05-10 00:02:27 +00:00
Viktor Barzin	cfe969fe43	backup: fix daily-backup silent failures, postiz pg_dump CronJob, doc reconcile daily-backup ran out of its 1h budget and SIGTERMed for 10 days straight (Apr 30 → May 9). Each failed run left its snapshot mount stacked on /tmp/pvc-mount, which blocked the next run from completing — root cause of the WeeklyBackupStale alert going silent (the metric never reached its end-of-script push). Fixes: - TimeoutStartSec 1h → 4h (current workload of 118 PVCs needs ~1.5h, was hitting the wall during week 18 runs) - Recursive umount + LUKS cleanup on EXIT trap, plus the same at script start as belt-and-braces for any inherited stuck state from a prior crashed run - TERM/INT trap pushes status=2 metric so WeeklyBackupFailing fires instead of the alert going blind on systemd kills - pfsense metric pushed in BOTH success and failure paths (was only on success; any ssh-to-pfsense outage made PfsenseBackupStale silent until the alert threshold expired) Postiz backup CronJob: bundled bitnami PG/Redis live on local-path (K8s node OS disk) — outside Layer 1+2 of the 3-2-1 pipeline. Added postiz-postgres-backup that pg_dumps postiz + temporal + temporal_visibility daily 03:00 to /srv/nfs/postiz-backup, getting Layer 3 offsite coverage. Verified end-to-end: 3 dumps written, Pushgateway metric received. Note: bitnamilegacy/postgresql image is stripped (no curl/wget/python) — switched to docker.io/library/postgres matching the dbaas/postgresql-backup pattern with apt-installed curl. Doc reconcile (backup-dr.md): metric names had drifted (e.g. the docs claimed backup_weekly_last_success_timestamp but the script pushes daily_backup_last_run_timestamp). Updated to match what's actually emitted, and added a "default-covered" footnote to the Service Protection Matrix so the ~40 services with PVCs not enumerated in the table are no longer ambiguous. Manual PVE-host actions (out-of-band, not in TF): - unmounted 6 stacked snapshots from /tmp/pvc-mount - pruned 5 stale snapshots on vm-9999-pvc-67c90b6b... (origin LV that the loop got SIGTERMed against repeatedly, so prune kept failing) - created /srv/nfs/postiz-backup directory - triggered a one-shot daily-backup run with the new TimeoutStartSec to validate the fix end-to-end Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-09 17:41:04 +00:00
Viktor Barzin	36d5cebb5c	postiz: expose /uploads publicly so Meta IG fetcher can pull JPEGs Stories+feed posts via Postiz failed with state=ERROR and Postiz mistranslated the cause as 'Invalid Instagram image resolution max: 1920x1080px'. Real cause: Postiz hands Meta an upload URL under https://postiz.viktorbarzin.me/uploads/... and Meta gets a 302 to the Authentik login page instead of bytes. Meta returns error 36001 (image not fetchable) which Postiz maps to that misleading resolution string. Split the ingress: /uploads/* on a public ingress (matches the instagram-poster /image+/original pattern), everything else remains behind Authentik forward-auth. /uploads contents are random UUIDs, low blast radius if scraped.	2026-05-09 12:29:39 +00:00
Viktor Barzin	e22c023c8c	postiz: wire INSTAGRAM_APP_ID/SECRET via ESO for IG-standalone provider Standalone provider (instagram-standalone OAuth flow) is what the user is trying after the FB-Login path was blocked by their Business Account ad-policy flag. Uses modern scope names (instagram_business_*), so no JS patch needed unlike the FB-Login provider.	2026-05-09 11:43:01 +00:00
Viktor Barzin	c2e61cdf31	postiz: wire FACEBOOK_APP_ID/SECRET via ESO for IG-Business integration	2026-05-09 09:19:43 +00:00
Viktor Barzin	60dd6c61b5	postiz: idempotent Job to drop default Text search attributes (Temporal SQL visibility caps at 3 Text attrs; auto-setup ships with 2, Postiz adds 2 more — gitroomhq/postiz-app#1504 )	2026-05-09 09:16:07 +00:00
Viktor Barzin	d3db617b1b	postiz: bump memory limit to 4Gi (was OOMing during NestJS startup)	2026-05-09 09:12:39 +00:00
Viktor Barzin	8c4a370a34	postiz: add Temporal sidecar; lock both stacks behind Authentik Postiz backend was crashlooping on connect ECONNREFUSED ::1:7233 — Postiz needs Temporal for cron/scheduled posts and the Helm chart doesn't bundle it. Added a single-replica temporalio/auto-setup:1.28.1 Deployment in the postiz namespace, backed by the bundled postiz-postgresql (separate `temporal` + `temporal_visibility` databases pre-created via init container), ENABLE_ES=false (Postiz only uses the workflow engine, not visibility search). Skips DYNAMIC_CONFIG_FILE_PATH because that file isn't bundled in auto-setup. Auth audit: - postiz: ingress now `protected = true` (Authentik forward-auth). Postiz also has its own login on top, but registration is no longer exposed to the open internet. - instagram-poster: split into two ingresses on the same host. `/image/*` stays public (Meta + Telegram fetch the 9:16 derivatives). Everything else (/healthz, /queue, /scan, /enqueue, /reject, /post-next) sits behind Authentik. The protected ingress sets dns_type=none — the public one already created the CF DNS record.	2026-05-09 09:08:21 +00:00
Viktor Barzin	7f7698991e	postiz + n8n: real DB URL + webhook-trigger approval - postiz: set DATABASE_URL/REDIS_URL pointing at the bundled subcharts; the chart does NOT auto-wire even when postgresql.enabled=true, so the prisma db:push was failing with empty DATABASE_URL. - n8n approval workflow: swap telegramTrigger -> webhook node so it works without an n8n-stored Telegram credential. Telegram bot's webhook is set via setWebhook to https://n8n.viktorbarzin.me/webhook/instagram-approval. Parse-callback Code node tolerates both shapes ({body:{callback_query:...}} vs {callback_query:...}) so a future move back to telegramTrigger doesn't break.	2026-05-09 00:55:19 +00:00
Viktor Barzin	71e3439650	postiz + instagram-poster: deploy fixes after first apply - postiz: pin chart name to 'postiz-app' (was 'postiz', wrong path) and override bundled bitnami subchart images to bitnamilegacy/* — Bitnami removed bitnami/postgresql + bitnami/redis from DockerHub in Aug 2025 (Broadcom acquisition). - postiz: enable initial registration (DISABLE_REGISTRATION=false) so first admin user can be created in UI; tighten after. - instagram-poster: add securityContext (fsGroup/runAsUser=10001) so kubelet chowns the PVC mount for the non-root 'poster' user; was crashing on alembic with 'unable to open database file'. - instagram-poster: bump image_tag to 24935ab4 (uvicorn now binds to port 8000 to match Service contract; was 8080 -> probe 404).	2026-05-09 00:47:14 +00:00
Viktor Barzin	7a93e09d2f	add postiz + instagram-poster stacks for IG Stories pipeline New stacks: - stacks/postiz/ — Postiz scheduler (Helm chart v1.0.5, image v2.21.7) with bundled PG/Redis, /uploads PVC on proxmox-lvm, JWT_SECRET via ESO from secret/instagram-poster. - stacks/instagram-poster/ — custom Python service that polls Immich for the 'instagram' tag, reformats photos to 9:16 with blurred-bg letterbox, exposes /image/<asset_id> publicly so Postiz can fetch. Image: forgejo.viktorbarzin.me/viktor/instagram-poster. n8n: 3 new workflows (discover, approval, post) for the Telegram inline-button approval UX. Adds ExternalSecret + env vars for TELEGRAM_BOT_TOKEN, TELEGRAM_CHAT_ID, IMMICH_API_KEY, plus static URLs for the new service. Vault: seed secret/instagram-poster with telegram_bot_token, telegram_chat_id, immich_api_key, postiz_api_token, postiz_jwt_secret before applying.	2026-05-09 00:07:44 +00:00

18 commits