Sweep through the 30+ stacks that predated the auth = "app" tier
and were tagged auth = "none" without a comment explaining why
they weren't behind Authentik. Each is now self-documenting at the
call site, so the tg-level anti-exposure guard passes and future
readers don't have to reverse-engineer the intent.
Flipped 6 stacks from "none" to "app" — their backends have their
own user auth and the new tier records that more accurately:
- navidrome (Subsonic user/password)
- ntfy (deny-all default + user.db tokens)
- nextcloud (WebDAV/CalDAV/CardDAV app passwords)
- vaultwarden (Bitwarden-compatible token auth)
- headscale (OIDC + preauth keys for Tailscale nodes)
- paperless-ngx (app-layer login + API tokens)
Kept "none" with a comment on the rest — they're genuinely public,
webhook receivers, native-protocol endpoints, OAuth callbacks, or
Anubis-fronted: authentik (×2 + guest outpost), beads-server (dolt),
claude-memory (bearer-token MCP), dawarich, ebooks/book-search-api,
fire-planner /api, forgejo (git/OCI native clients), frigate (HA
integration), immich/frame, insta2spotify /api, instagram-poster
(meta fetcher), k8s-portal, matrix (native bearer), monitoring×2
(HA REST scrapes), n8n (webhooks), nvidia, onlyoffice (JWT),
owntracks (HTTP Basic), postiz, privatebin (client-side enc),
rybbit (analytics tracker), send (E2E file drop), tuya-bridge
(API key), vault (own auth + CLI), webhook_handler, woodpecker
(forgejo webhooks + OAuth), xray (×3 VPN transports).
real-estate-crawler/main.tf:400 already had its comment from a
prior edit — not touched here.
No live state changes — auth = "app" produces the same middleware
chain as auth = "none" (verified earlier this session). This commit
is purely documentation + intent-tagging.
After fixing the threshold=80% misconfig and seeing two PVCs
(prometheus + technitium primary) get stuck Terminating, a 3rd round
showed four more PVCs (frigate, hackmd, immich-postgresql,
paperless-ngx) in the same state. Same root cause: TF spec'd a
smaller storage size than the autoresizer-grown live value, K8s
rejected the shrink, TF force-replaced the PVC, and the
pvc-protection finalizer held it in Terminating while the pod kept
using the underlying volume.
Bulk-inject lifecycle.ignore_changes = [spec[0].resources[0].requests]
on every kubernetes_persistent_volume_claim block that has
resize.topolvm.io/threshold annotations. The pattern was already
documented in .claude/CLAUDE.md but ~63 stacks were missing it.
Live PVCs are unaffected; this only prevents future TF applies from
attempting the destroy+recreate.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
topolvm/pvc-autoresizer's threshold annotation is the FREE-SPACE
percentage below which expansion fires (per upstream README). Setting
it to "80%" means "expand when free-space drops below 80%", i.e. as
soon as the PVC crosses 20% utilization — which caused
prometheus-data-proxmox to be repeatedly expanded from 200Gi to 433Gi
in 70 minutes (six 10% bumps, all when the volume was only ~14% used).
Once the SC opt-in fix landed (1e4eac53) and the inode metrics fix
landed (02a12f1a), the autoresizer started actively misfiring across
75+ PVCs cluster-wide.
Flip the value to "10%" everywhere — that's "expand when free-space
drops below 10%", i.e. at 90% utilization, which is the conventional
semantic and matches the alert thresholds in
prometheus_chart_values.tpl (PVAutoExpanding fires at 80%, PVFillingUp
at 95%).
The CLAUDE.md PVC template was the source of the misconfig, so update
it too. Live PVC annotations were patched in parallel via kubectl
annotate; TF apply on each affected stack will be a no-op against
those live values.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
daily-backup ran out of its 1h budget and SIGTERMed for 10 days straight (Apr
30 → May 9). Each failed run left its snapshot mount stacked on /tmp/pvc-mount,
which blocked the next run from completing — root cause of the WeeklyBackupStale
alert going silent (the metric never reached its end-of-script push).
Fixes:
- TimeoutStartSec 1h → 4h (current workload of 118 PVCs needs ~1.5h, was hitting
the wall during week 18 runs)
- Recursive umount + LUKS cleanup on EXIT trap, plus the same at script start as
belt-and-braces for any inherited stuck state from a prior crashed run
- TERM/INT trap pushes status=2 metric so WeeklyBackupFailing fires instead of
the alert going blind on systemd kills
- pfsense metric pushed in BOTH success and failure paths (was only on success;
any ssh-to-pfsense outage made PfsenseBackupStale silent until the alert
threshold expired)
Postiz backup CronJob: bundled bitnami PG/Redis live on local-path (K8s node
OS disk) — outside Layer 1+2 of the 3-2-1 pipeline. Added postiz-postgres-backup
that pg_dumps postiz + temporal + temporal_visibility daily 03:00 to
/srv/nfs/postiz-backup, getting Layer 3 offsite coverage. Verified end-to-end:
3 dumps written, Pushgateway metric received. Note: bitnamilegacy/postgresql
image is stripped (no curl/wget/python) — switched to docker.io/library/postgres
matching the dbaas/postgresql-backup pattern with apt-installed curl.
Doc reconcile (backup-dr.md): metric names had drifted (e.g. the docs claimed
backup_weekly_last_success_timestamp but the script pushes
daily_backup_last_run_timestamp). Updated to match what's actually emitted, and
added a "default-covered" footnote to the Service Protection Matrix so the
~40 services with PVCs not enumerated in the table are no longer ambiguous.
Manual PVE-host actions (out-of-band, not in TF):
- unmounted 6 stacked snapshots from /tmp/pvc-mount
- pruned 5 stale snapshots on vm-9999-pvc-67c90b6b... (origin LV that the
loop got SIGTERMed against repeatedly, so prune kept failing)
- created /srv/nfs/postiz-backup directory
- triggered a one-shot daily-backup run with the new TimeoutStartSec to
validate the fix end-to-end
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Stories+feed posts via Postiz failed with state=ERROR and Postiz
mistranslated the cause as 'Invalid Instagram image resolution
max: 1920x1080px'. Real cause: Postiz hands Meta an upload URL
under https://postiz.viktorbarzin.me/uploads/... and Meta gets a
302 to the Authentik login page instead of bytes. Meta returns
error 36001 (image not fetchable) which Postiz maps to that
misleading resolution string.
Split the ingress: /uploads/* on a public ingress (matches the
instagram-poster /image+/original pattern), everything else
remains behind Authentik forward-auth. /uploads contents are
random UUIDs, low blast radius if scraped.
Standalone provider (instagram-standalone OAuth flow) is what the user
is trying after the FB-Login path was blocked by their Business Account
ad-policy flag. Uses modern scope names (instagram_business_*), so no
JS patch needed unlike the FB-Login provider.
Postiz backend was crashlooping on connect ECONNREFUSED ::1:7233 —
Postiz needs Temporal for cron/scheduled posts and the Helm chart
doesn't bundle it. Added a single-replica temporalio/auto-setup:1.28.1
Deployment in the postiz namespace, backed by the bundled
postiz-postgresql (separate `temporal` + `temporal_visibility`
databases pre-created via init container), ENABLE_ES=false (Postiz
only uses the workflow engine, not visibility search). Skips
DYNAMIC_CONFIG_FILE_PATH because that file isn't bundled in
auto-setup.
Auth audit:
- postiz: ingress now `protected = true` (Authentik forward-auth).
Postiz also has its own login on top, but registration is no
longer exposed to the open internet.
- instagram-poster: split into two ingresses on the same host.
`/image/*` stays public (Meta + Telegram fetch the 9:16
derivatives). Everything else (/healthz, /queue, /scan,
/enqueue, /reject, /post-next) sits behind Authentik. The
protected ingress sets dns_type=none — the public one already
created the CF DNS record.
- postiz: set DATABASE_URL/REDIS_URL pointing at the bundled subcharts;
the chart does NOT auto-wire even when postgresql.enabled=true, so
the prisma db:push was failing with empty DATABASE_URL.
- n8n approval workflow: swap telegramTrigger -> webhook node so it
works without an n8n-stored Telegram credential. Telegram bot's
webhook is set via setWebhook to https://n8n.viktorbarzin.me/webhook/instagram-approval.
Parse-callback Code node tolerates both shapes ({body:{callback_query:...}}
vs {callback_query:...}) so a future move back to telegramTrigger doesn't break.
- postiz: pin chart name to 'postiz-app' (was 'postiz', wrong path)
and override bundled bitnami subchart images to bitnamilegacy/* —
Bitnami removed bitnami/postgresql + bitnami/redis from DockerHub
in Aug 2025 (Broadcom acquisition).
- postiz: enable initial registration (DISABLE_REGISTRATION=false)
so first admin user can be created in UI; tighten after.
- instagram-poster: add securityContext (fsGroup/runAsUser=10001)
so kubelet chowns the PVC mount for the non-root 'poster' user;
was crashing on alembic with 'unable to open database file'.
- instagram-poster: bump image_tag to 24935ab4 (uvicorn now binds
to port 8000 to match Service contract; was 8080 -> probe 404).
New stacks:
- stacks/postiz/ — Postiz scheduler (Helm chart v1.0.5, image v2.21.7)
with bundled PG/Redis, /uploads PVC on proxmox-lvm, JWT_SECRET
via ESO from secret/instagram-poster.
- stacks/instagram-poster/ — custom Python service that polls Immich
for the 'instagram' tag, reformats photos to 9:16 with blurred-bg
letterbox, exposes /image/<asset_id> publicly so Postiz can fetch.
Image: forgejo.viktorbarzin.me/viktor/instagram-poster.
n8n: 3 new workflows (discover, approval, post) for the Telegram
inline-button approval UX. Adds ExternalSecret + env vars for
TELEGRAM_BOT_TOKEN, TELEGRAM_CHAT_ID, IMMICH_API_KEY, plus static
URLs for the new service.
Vault: seed secret/instagram-poster with telegram_bot_token,
telegram_chat_id, immich_api_key, postiz_api_token,
postiz_jwt_secret before applying.