infra

Author	SHA1	Message	Date
Viktor Barzin	b98890d799	fix(beads-server): fix Workbench GraphQL URL for remote hosting Dolt Workbench hardcodes http://localhost:9002/graphql in the built JS. For k8s hosting, init container patches this to relative /graphql path. Second ingress routes /graphql to port 9002 behind Authentik auth. - Init container copies static JS to writable emptyDir, patches URL - Pre-seeds store.json with Dolt connection config - Added /graphql ingress with Authentik forward-auth Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 21:53:57 +00:00
Viktor Barzin	375a3d91d5	[monitoring] Exclude websocket protocol from HighServiceLatency alert Traefik records websocket connection lifetimes (minutes to hours) as "request duration." When websockets close, the full lifetime pollutes the average latency metric — Authentik showed 6.7s avg (201s websocket avg) vs 0.065s actual HTTP avg. This caused ~90 false alerts/day across 12 services (Authentik, Vaultwarden, Terminal, HA, etc.). Changes: - Add protocol!="websocket" filter to HighServiceLatency alert expr - Raise minimum traffic threshold from 0.01 to 0.05 rps to filter statistical noise from services with <3 req/min - Remove .githooks/pre-commit file-size hook (blocked state commits) Validated against 7-day historical data: 637 breaches → ~2 with both filters applied (99.7% reduction). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 21:51:19 +00:00
Viktor Barzin	3e273399c1	fix(ci): add registry.viktorbarzin.me:5050 to imagePullSecrets Pipeline pods pull from registry.viktorbarzin.me:5050 but the registry-credentials secret only had auth for registry.viktorbarzin.me (without port). Containerd requires exact hostname:port match. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 21:50:51 +00:00
Viktor Barzin	116fdcf82d	fix(ci): Woodpecker secret sync includes all event types The vault-woodpecker-sync script was creating global secrets with only push/tag/deployment events. Manual and cron-triggered pipelines couldn't access secrets, causing "secret not found" errors and pipeline failures. Also fixes three root causes of CI failures: 1. Pull-through cache corruption: purged stale blobs, added post-GC registry restart cron to prevent recurrence 2. Missing repo-level secrets: added registry_user/registry_password for the infra repo's build-ci-image workflow 3. Stuck pipelines: cleaned up 3 pipelines stuck in "running" since March Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 21:43:48 +00:00
Viktor Barzin	d9ed166640	fix(beads-server): add Authentik auth to Dolt Workbench - Set protected=true on ingress (Authentik forward-auth) - Remove unused DATABASE_URL env var (Workbench uses browser-based connection config) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 21:43:22 +00:00
Viktor Barzin	c33f597111	feat(upgrade-agent): add automated service upgrade pipeline with n8n + DIUN Pipeline: DIUN detects new image versions every 6h → webhook to n8n → n8n filters (skip databases/custom/infra/:latest) and rate-limits (max 5/6h) → SSH to dev VM → claude -p runs upgrade agent. Agent workflow: resolve GitHub repo → fetch changelogs → classify risk (SAFE/CAUTION) → backup DB if needed → bump version in .tf → commit+push → wait for CI → verify (pod ready + HTTP + Uptime Kuma) → rollback on failure. Changes: - stacks/n8n: add N8N_PORT=5678 to fix K8s env var conflict - stacks/n8n/workflows: version-controlled n8n workflow backup - docs/architecture/automated-upgrades.md: full pipeline documentation - AGENTS.md: add upgrade agent section - service-catalog.md: update DIUN description Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 21:38:27 +00:00
Viktor Barzin	27d7c91608	feat(beads-server): add Dolt Workbench web UI Deploy dolthub/dolt-workbench alongside the Dolt server in beads-server namespace. Provides SQL console, spreadsheet editor, and commit graph visualization for the centralized beads task database. - Workbench at dolt-workbench.viktorbarzin.me (Cloudflare-proxied) - Connects to Dolt server via in-cluster service DNS - Added to cloudflare_proxied_names for external access Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 21:32:45 +00:00
Viktor Barzin	c124a23390	fix(ci): add K8s pull secrets to Woodpecker agents Pipeline pods were failing with "authorization failed: no basic auth credentials" when pulling from the private registry. The WOODPECKER_BACKEND_K8S_PULL_SECRET_NAMES env var was in values.yaml but never deployed to the agents. Also removes the stale db-init job that used `-U root` (incompatible with CNPG's `postgres` superuser). The database already exists. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 21:21:12 +00:00
Viktor Barzin	f7411327d1	fix(affine): update image tag 0.20.7 → 0.26.6 Image ghcr.io/toeverything/affine:0.20.7 was removed from ghcr.io, causing persistent ImagePullBackOff. Updated to latest stable 0.26.6. Prisma migrations run via init container on startup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 20:49:46 +00:00
Viktor Barzin	8b004c4c94	feat(storage): migrate all sensitive services to proxmox-lvm-encrypted Reconcile Terraform with cluster state after manual encrypted PVC migrations and complete the remaining unfinished migrations. All services storing sensitive data now use LUKS2-encrypted block storage via the Proxmox CSI plugin. ## Context Only Technitium DNS was using encrypted storage in Terraform. Many services had been manually migrated to encrypted PVCs in the cluster, but Terraform was never updated — creating dangerous state drift where a `tg apply` could recreate unencrypted PVCs. ## This change Phase 0 — Infrastructure: - Add `proxmox-lvm-encrypted` StorageClass to Helm values (extraParameters) - Add ExternalSecret for LUKS encryption passphrase to Terraform - Fix CSI node plugin memory: `node.plugin.resources` (not `node.resources`) with 1280Mi limit for LUKS2 Argon2id key derivation Phase 1 — TF state reconciliation (zero downtime): - Health, Matrix, N8N, Forgejo, Vaultwarden, Mailserver: state rm + import - Redis, DBAAS MySQL, DBAAS PostgreSQL: Helm/CNPG value updates Phase 2 — Data migration (encrypted PVCs existed but unused): - Headscale, Frigate, MeshCentral: rsync + switchover - Nextcloud (20Gi): rsync + chart_values update Phase 3 — New encrypted PVCs: - Roundcube HTML, HackMD, Affine, DBAAS pgadmin: create + rsync + switchover Phase 4 — Cleanup: - Deleted 5 orphaned unencrypted PVCs ## Services migrated (18 PVCs across 14 namespaces) ``` vaultwarden → vaultwarden-data-encrypted dbaas → datadir-mysql-cluster-0, pg-cluster-{1,2}, dbaas-pgadmin-encrypted mailserver → mailserver-data-encrypted, roundcubemail-{enigma,html}-encrypted nextcloud → nextcloud-data-encrypted forgejo → forgejo-data-encrypted matrix → matrix-data-encrypted n8n → n8n-data-encrypted affine → affine-data-encrypted health → health-uploads-encrypted hackmd → hackmd-data-encrypted redis → redis-data-redis-node-{0,1} headscale → headscale-data-encrypted frigate → frigate-config-encrypted meshcentral → meshcentral-{data,files}-encrypted ``` Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 20:15:30 +00:00
Viktor Barzin	1613003d00	upgrade: vaultwarden 1.35.4 -> 1.35.7 Security fixes (1.35.5): 3 CVEs — org vault purge by unconfirmed owner (GHSA-937x-3j8m-7w7p), cross-org group binding unauthorized access (GHSA-569v-845w-g82p), refresh tokens not invalidated on stamp rotation (GHSA-6j4w-g4jh-xjfx). 2FA remember tokens now max 30 days. 1.35.6: Fix 2FA remember tokens broken in 1.35.5. 1.35.7: Fix 2FA for Android. Risk: SAFE (patch bump, no breaking changes) DB backup: yes (job: pre-upgrade-vaultwarden-1776280439, SQLite, 7 MiB) Config changes applied: none Flagged for manual review: none Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>	2026-04-15 19:14:21 +00:00
Viktor Barzin	cf578516e9	feat: auto-cleanup failed/evicted pods via Kyverno ClusterCleanupPolicy Add cleanup-failed-pods policy that runs hourly (at :15) to delete all pods in Failed phase cluster-wide. Prevents stale evicted and failed CronJob pods from accumulating and creating healthcheck noise. Also adds ClusterRole + ClusterRoleBinding to grant Kyverno cleanup controller permission to delete Pods (not included by default). [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 17:37:49 +00:00
Viktor Barzin	de42acd68e	fix: backup LUKS rsync tolerance, stale mapping cleanup, tier-4-aux quota bump - daily-backup: handle rsync exit 23 (partial transfer) as OK for LUKS noload mounts — in-flight writes have corrupt metadata from skipped journal replay, but core data is intact - daily-backup: clean up stale LUKS dm mappings from previous crashed runs before attempting to open - daily-backup: capture rsync exit code safely with set -e (\|\| pattern) - kyverno: bump tier-4-aux requests.memory 2Gi→3Gi (servarr was at 83%) - actualbudget: patched custom quota 5Gi→6Gi (was at 82%) Verified: backup now completes status=0 (96 PVCs OK, 0 failed) [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 17:21:51 +00:00
Viktor Barzin	9baefa22ab	fix: technitium CronJob scheduling, LUKS backup support, speedtest scrape - technitium-password-sync: remove RWO encrypted PVC mount that caused pods to stick in ContainerCreating on wrong nodes. Plugin install now warns instead of failing when zip unavailable. - daily-backup: add LUKS decryption support for encrypted PVC snapshots using /root/.luks-backup-key. Uses noload mount option to skip ext4 journal replay. Also installed cryptsetup-bin on PVE host. - speedtest: disable prometheus.io/scrape annotation (no /prometheus endpoint exists, causing ScrapeTargetDown alert). [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 15:12:32 +00:00
Viktor Barzin	601a83d84e	fix: CI pipeline image pull auth + shallow clone resilience [ci skip] - Add WOODPECKER_BACKEND_K8S_PULL_SECRET_NAMES to agent env so step pods can pull from private registry (registry.viktorbarzin.me:5050) - Add fallback in default.yml when HEAD~1 is unavailable (shallow clone with depth=1): fetch more history, or apply all platform stacks as safe default - Root cause: pipeline #243 failed because infra-ci:latest image couldn't be pulled (no imagePullSecrets on step pods) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 14:41:08 +00:00
Viktor Barzin	e23153cf03	chore: add pre-commit size guard and harden .gitignore - Add .githooks/pre-commit that blocks files >2MB (configurable via GIT_MAX_FILE_SIZE). Activate with: git config core.hooksPath .githooks - Expand .gitignore to block common binary/archive patterns (.tar.gz, .tgz, .iso, .img, .bin, .exe, *.dmg) - Add explicit root-level terraform.tfstate ignore rules - Remove stale redis-25.3.2.tgz helm chart (unreferenced) Prevents re-accumulation of large blobs after git history cleanup that reduced .git from 2.6GB to 128MB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 14:13:18 +00:00
Viktor Barzin	bcad200a23	chore: add untracked stacks, scripts, and agent configs - New stacks: beads-server, hermes-agent - Terragrunt tiers.tf for infra, phpipam, status-page - Secrets symlinks for vault, phpipam, hermes-agent - Scripts: cluster_manager, image_pull, containerd pullthrough setup - Frigate config, audiblez-web app source, n8n workflows dir - Claude agent: service-upgrade, reference: upgrade-config.json - Removed: claudeception skill, excalidraw empty submodule, temp listings [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 09:33:06 +00:00
Viktor Barzin	bd41bb9230	fix: cluster healthcheck fixes + Authentik upgrade to 2026.2.2 - Authentik: upgrade 2025.10.3 → 2025.12.4 → 2026.2.2 with DB restore and stepped migration. Switch to existingSecret, PgBouncer session mode. - Mailserver: migrate email roundtrip probe from Mailgun to Brevo API - Redis: fix HAProxy tcp-check regex (rstring), faster health intervals - Nextcloud: fix Redis fallback to HAProxy service, update dependency - MeshCentral: fix TLSOffload + certUrl init container for first-run - Monitoring: remove authentik from latency alert exclusion - Diun: simplify to webhook notifier, remove git auto-update [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 06:41:56 +00:00
Viktor Barzin	0256ccdccc	feat: add per-database backups for PostgreSQL and MySQL Add separate CronJobs that dump each database individually: - postgresql-backup-per-db: pg_dump -Fc per DB (daily 00:15) - mysql-backup-per-db: mysqldump per DB (daily 00:45) Dumps go to /backup/per-db/<dbname>/ on the same NFS PVC. Enables single-database restore without affecting other databases. Also fixed CNPG superuser password sync and added --single-transaction --set-gtid-purged=OFF to MySQL per-db dumps. Updated restore runbooks with per-database restore procedures. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 22:39:33 +00:00
Viktor Barzin	3e9231ae0d	feat: augment outage report template with debugging context - Expand service list: add Home Assistant, Actual Budget, Audiobookshelf, Linkwarden, Matrix, Paperless, Tandoor, FreshRSS, Frigate, HackMD, Excalidraw, Wealthfolio, Send, Stirling PDF - Add structured debugging fields: error type, scope (just me vs others), when it started, URL accessed - Fix user report parser to extract all form fields into status.json - Show error type, scope, and start time in status page report cards [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 20:03:44 +00:00
Viktor Barzin	460c68e015	feat: add incident management system with user reporting - Status page (status.viktorbarzin.me): incident cards with SEV badges, expandable timelines, postmortem links, user report rendering - Issue templates on infra repo for user outage reports - CronJob reads incidents + user-reports from ViktorBarzin/infra - "Report an Outage" button on status page links to infra repo - Post-mortem agents restored (4-stage pipeline: triage → investigation → historian → report writer) with updated paths and issue linking - Post-mortem skill/template updated to link reports to GitHub Issues and manage postmortem-required/postmortem-done labels - Labels: incident, sev1-3, user-report, postmortem-required, postmortem-done on infra repo [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 20:00:31 +00:00
Viktor Barzin	c69eba9b46	fix: increase Uptime Kuma API timeout and fix status code format - Increase socket timeout from 30s to 120s (121+ monitors need time to sync) - Add wait_events=0.2 for reliable login - Fix accepted_statuscodes format: use 100-increment ranges not arbitrary [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 19:28:18 +00:00
Viktor Barzin	ff360a8807	feat: add external monitoring for all Cloudflare-proxied services Add automatic external HTTPS monitors to Uptime Kuma for ~96 services exposed via Cloudflare tunnel. A sync CronJob (every 10min) reads from a Terraform-generated ConfigMap and creates/deletes [External] monitors to match cloudflare_proxied_names. Status page groups these separately as "External Reachability" and pushes a divergence metric to Pushgateway when services are externally down but internally up. Prometheus alert ExternalAccessDivergence fires after 15min of divergence. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 19:04:45 +00:00
Viktor Barzin	ca2680c189	fix(post-mortem): add NFSHighRPCRetransmissions alert + migrate alertmanager to proxmox-lvm-encrypted [PM-2026-04-14] - Add PrometheusRule: NFSHighRPCRetransmissions fires when node_nfs_rpc_retransmissions_total rate exceeds 5/s for 5m — catches NFS server degradation before pod failures cascade - Migrate alertmanager PV from NFS (192.168.1.127:/srv/nfs/alertmanager) to proxmox-lvm-encrypted eliminating the circular dependency where alertmanager couldn't alert about NFS failures - Set force_update=true on prometheus helm_release to handle StatefulSet volumeClaimTemplate changes Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>	2026-04-14 18:05:33 +00:00
Viktor Barzin	0901dd5f61	state(monitoring): update encrypted state	2026-04-14 17:52:13 +00:00
Viktor Barzin	803cb5fd26	fix: convert Technitium zone sync from one-time Job to CronJob Secondary/tertiary DNS instances had no custom zones — only the primary had viktorbarzin.lan and viktorbarzin.me. The old setup Job ran once at deployment and never synced new zones. New CronJob runs every 30 minutes: - Gets all zones from primary - Enables zone transfer on primary - Creates missing zones as Secondary type on replicas - Resyncs existing zones via AXFR Fixes .lan resolution failures (2/3 queries returned NXDOMAIN). [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 12:18:19 +00:00
Viktor Barzin	30cdeefb1c	chore: sync terraform state after nfsvers=4 convergence Applied all 20 NFS stacks to converge PV mount_options (nfsvers=4). State files encrypted and committed. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 11:20:18 +00:00
Viktor Barzin	ea18116da9	fix: NFS outage recovery — migrate to NFSv4, add alerting NFS server restart broke NFSv3 (lockd kernel bug on PVE 6.14). All 52 NFS PVs patched to nfsvers=4, NFSv3 disabled on PVE. Changes: - nfs_volume module: add nfsvers=4 mount option - nfs-csi StorageClass: add nfsvers=4 mount option - dbaas: MySQL serverInstances 3→1, mysql-native-password=ON - monitoring: add NFSCSINodeDown and NFSMountFailures alerts [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 10:28:27 +00:00
Viktor Barzin	68c8c5b4a0	fix(technitium): migrate primary to proxmox-lvm-encrypted + post-mortem SEV1 outage: fsid=0 in PVE /etc/exports broke all NFS subdirectory mounts from k8s (NFSv4 pseudo-root path resolution). Combined with lockd failure, both NFSv4 and NFSv3 mount paths broken. Cascaded into DNS primary, Vault (2/3 pods), Alertmanager, 20+ services. Changes: - Primary PVC: NFS (nfs-truenas) → proxmox-lvm-encrypted - Secondary/tertiary PVCs: proxmox-lvm → proxmox-lvm-encrypted - Removed NFS module dependency from technitium stack - Added full post-mortem with prevention plan [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 08:18:59 +00:00
Viktor Barzin	82f674a0b4	rename weekly-backup → daily-backup across scripts, timers, services, and docs [ci skip] Reflects the schedule change from weekly to daily. All references updated: - scripts/weekly-backup.{sh,timer,service} → daily-backup.* - Pushgateway job name: weekly-backup → daily-backup - Prometheus metric names: weekly_backup_* → daily_backup_* - All docs, runbooks, AGENTS.md, CLAUDE.md, proxmox-inventory - offsite-sync dependency: After=daily-backup.service Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 18:37:04 +00:00
Viktor Barzin	38d51ab0af	deprecate TrueNAS: migrate Immich NFS to Proxmox, remove all 10.0.10.15 references [ci skip] - Migrate Immich (8 NFS PVs, 1.1TB) from TrueNAS to Proxmox host NFS - Update config.tfvars nfs_server to 192.168.1.127 (Proxmox) - Update nfs-csi StorageClass share to /srv/nfs - Update scripts (weekly-backup, cluster-healthcheck) to Proxmox IP - Delete obsolete TrueNAS scripts (nfs_exports.sh, truenas-status.sh) - Rewrite nfs-health.sh for Proxmox NFS monitoring - Update Freedify nfs_music_server default to Proxmox - Mark CloudSync monitor CronJob as deprecated - Update Prometheus alert summaries - Update all architecture docs, AGENTS.md, and reference docs - Zero PVs remain on TrueNAS — VM ready for decommission Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 14:42:07 +00:00
root	b0303ab17d	Woodpecker CI deploy commit [CI SKIP]	2026-04-12 21:39:26 +00:00
Viktor Barzin	1c300a14cf	mailserver: overhaul inbound delivery, monitoring, CrowdSec, and migrate to Brevo relay Inbound: - Direct MX to mail.viktorbarzin.me (ForwardEmail relay attempted and abandoned) - Dedicated MetalLB IP 10.0.20.202 with ETP: Local for CrowdSec real-IP detection - Removed Cloudflare Email Routing (can't store-and-forward) - Fixed dual SPF violation, hardened to -all - Added MTA-STS, TLSRPT, imported Rspamd DKIM into Terraform - Removed dead BIND zones from config.tfvars (199 lines) Outbound: - Migrated from Mailgun (100/day) to Brevo (300/day free) - Added Brevo DKIM CNAMEs and verification TXT Monitoring: - Probe frequency: 30m → 20m, alert thresholds adjusted to 60m - Enabled Dovecot exporter scraping (port 9166) - Added external SMTP monitor on public IP Documentation: - New docs/architecture/mailserver.md with full architecture - New docs/architecture/mailserver-visual.html visualization - Updated monitoring.md, CLAUDE.md, historical plan docs	2026-04-12 22:24:38 +01:00
Viktor Barzin	82b0f6c4cb	truenas deprecation: migrate all non-immich storage to proxmox NFS - Migrate 7 backup CronJobs to Proxmox host NFS (192.168.1.127) (etcd, mysql, postgresql, nextcloud, redis, vaultwarden, plotting-book) - Migrate headscale backup, ebook2audiobook, osm_routing to Proxmox NFS - Migrate servarr (lidarr, readarr, soulseek) NFS refs to Proxmox - Remove 79 orphaned TrueNAS NFS module declarations from 49 stacks - Delete stacks/platform/modules/ (27 dead module copies, 65MB) - Update nfs-truenas StorageClass to point to Proxmox (192.168.1.127) - Remove iscsi DNS record from config.tfvars - Fix woodpecker persistence config and alertmanager PV Only Immich (8 PVCs, ~1.4TB) remains on TrueNAS.	2026-04-12 14:35:39 +01:00
Viktor Barzin	5da6d75094	fix(monitoring): PodCrashLooping alert now fires only for active CrashLoopBackOff Switch from restart-count based detection (increase restarts[1h] > 5) to waiting-reason based (kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"}). Alert auto-resolves when pod recovers, making it clear whether the issue is active.	2026-04-12 12:41:07 +01:00
Viktor Barzin	65551e4602	fix(dbaas): relax MySQL anti-affinity from required to preferred Avoids pods getting stuck Pending during node outages while still preferring to spread across nodes.	2026-04-11 16:26:24 +01:00
Viktor Barzin	160e8980e5	perf(immich): restore PostgreSQL vector search optimizations - shared_buffers: 1GB → 2GB (clip_index is 452MB, needs headroom) - effective_cache_size: 1536MB → 2560MB - PG memory: 2Gi → 3Gi to support larger shared_buffers - Add pg_prewarm to shared_preload_libraries with autoprewarm - First search after restart: 999ms → 25ms	2026-04-11 10:30:44 +01:00
Viktor Barzin	aa58565ecc	upgrade immich to v2.7.4 and increase rate limit burst - Immich version: v2.7.3 → v2.7.4 - Immich rate limit: avg 200→500, burst 2000→5000 (both traefik and platform stacks)	2026-04-11 10:15:42 +01:00
Viktor Barzin	75255d22a2	fix(phpipam): fix London SSH via WG MTU reduction (1420→1200) Root cause: PMTU black hole on WireGuard tunnel. The tunnel runs over the HE IPv6 6in4 tunnel (gif0 MTU 1280). With WG overhead (~80 bytes), effective inner MTU is 1200 — but both sides were configured at 1420. SSH kex packets >1200 bytes were silently dropped. Fix: Set tun_wg0 MTU to 1200 on pfSense + peer_855 MTU to 1200 on London GL-iNet. Re-enabled London DHCP/ARP import in remote CronJob. All 3 sites now fully automated: - Sofia: Kea leases + ARP every 5min - London: DHCP + ARP via pfSense→London SSH hop, hourly - Valchedrym: DHCP + ARP via pfSense→OpenWRT SSH hop, hourly [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 08:18:42 +00:00
Viktor Barzin	d7de5de07c	fix(monitoring): add pve_* metrics to Prometheus whitelist ProxmoxMetricsMissing alert was firing because pve_* metrics were excluded from the kubernetes-service-endpoints metric_relabel_configs whitelist. The exporter was scraping successfully but metrics were being dropped before ingestion.	2026-04-10 22:58:49 +01:00
Viktor Barzin	04beb123eb	feat(phpipam): split CronJobs - Sofia 5min, remote sites hourly - Sofia import (every 5min): Kea leases + pfSense ARP via SSH - Remote import (hourly): Valchedrym DHCP/ARP via pfSense SSH hop - London SSH (dropbear) hangs during kex on low-power router — disabled for now, data imported manually. TODO: lightweight push agent - Fixed SSH key filename (id_rsa, not id_ed25519) for RSA keys - No more ping sweeping anywhere — all passive DHCP/ARP data [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 21:24:58 +00:00
Viktor Barzin	a0392a9617	fix(nextcloud): auto-sync DB password from Vault rotation into config.php Nextcloud persists dbpassword in config.php on its PVC and ignores MYSQL_PASSWORD env var after initial install. When Vault rotates the MySQL password, config.php goes stale causing HTTP 500 crash loops. Adds a before-starting hook that patches config.php with the current MYSQL_PASSWORD on every pod start. Combined with Stakater Reloader annotation, the full rotation chain is now automated: Vault rotates → ESO syncs Secret → Reloader restarts pod → hook patches config.php → Nextcloud connects with new password. Also fixes stale existingClaim (nextcloud-data-iscsi → nextcloud-data-proxmox).	2026-04-10 22:23:52 +01:00
Viktor Barzin	92e0c18e81	feat(phpipam): pull Valchedrym devices from OpenWRT DHCP/ARP via SSH - CronJob now SSHs to Valchedrym OpenWRT (192.168.0.1) to pull DHCP leases + ARP table - Parses /tmp/dhcp.leases for hostname + MAC, /proc/net/arp for additional devices - London still uses ping sweep via pfSense WG tunnel (no SSH access to GL-iNet) - 6 Valchedrym devices tracked: router, alarm, video, termoregulator, 2 clients [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 21:06:39 +00:00
Viktor Barzin	cddbb1c8b0	feat(phpipam): scan London/Valchedrym via WireGuard tunnel - pfsense-import CronJob now scans remote subnets (192.168.8.0/24, 192.168.0.0/24) via parallel ping sweep through pfSense WG tunnel - 13 London devices + 1 Valchedrym device discovered - Known hosts named: ha-london, rpi-london, openwrt-london - fping cron container fully removed [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 20:58:49 +00:00
Viktor Barzin	bba2de9eb1	refactor(phpipam): remove fping cron container All device discovery now handled by phpipam-pfsense-import CronJob which queries Kea DHCP leases + pfSense ARP table every 5min. No active scanning needed — pfSense sees all devices passively. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 20:38:59 +00:00
Viktor Barzin	d71c636269	feat(phpipam): replace fping scanning with pfSense Kea+ARP import - New CronJob `phpipam-pfsense-import` runs every 5min - Queries Kea DHCP lease API (IP + MAC + hostname for all DHCP clients) - Queries pfSense ARP table (IP + MAC for static IP devices) - Imports into phpIPAM MySQL: new hosts get inserted, existing get MAC/hostname updates - Reduced fping scan interval from 15min to 24h (weekly audit only) - Faster, quieter, gets MACs (fping didn't), gets Kea hostnames - SSH key (RSA PEM) stored in Vault, synced via ExternalSecret [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 20:37:28 +00:00
Viktor Barzin	e09996302a	feat(phpipam): bidirectional DNS sync + Kea DHCP for 192.168.1.x - CronJob now pulls hostnames FROM Technitium INTO phpIPAM for unnamed entries (reverse sync: Kea DDNS registers → Technitium PTR → phpIPAM hostname) - Kea DHCP4 now serves 192.168.1.0/24 via pfSense WAN (vtnet0) - 42 MAC→IP reservations for all known LAN devices - Kea DDNS registers 192.168.1.x hosts in Technitium (forward + reverse) - DHCP pool .150-.199 for unknown devices - Technitium update ACL extended to include 192.168.1.2 (pfSense WAN) [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 20:29:25 +00:00
Viktor Barzin	a86394f12b	feat(phpipam): add DNS sync CronJob and Kea DDNS integration - CronJob syncs phpIPAM hosts → Technitium DNS (A + PTR records) every 15min - Queries phpIPAM MySQL directly for named hosts, pushes to Technitium API - Covers 192.168.1.0/24 LAN (TP-Link DHCP, not Kea-managed) - Kea DDNS configured on pfSense for 10.0.10.0/24 + 10.0.20.0/24 subnets - Technitium zones accept dynamic updates from pfSense IPs (10.0.20.1, 10.0.10.1) - 5 reverse DNS zones created (10.0.10, 20.0.10, 1.168.192, 2.3.10, 0.168.192) [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 14:47:24 +00:00
Viktor Barzin	4d3d3316ab	feat(phpipam): deploy phpIPAM for live IP address management Lightweight IPAM with auto-discovery scanning every 15min via fping. Replaces disabled NetBox (OOM'd). Uses existing MySQL InnoDB cluster with Vault-rotated credentials. Cloudflare DNS + Authentik auth. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 14:19:25 +00:00
Viktor Barzin	2a10c1821f	fix(nextcloud): enable mysql.utf8mb4 to fix collation errors with MySQL 8.4 MySQL 8.4 remapped `utf8` to mean `utf8mb4`, but Nextcloud without this config sends `COLLATE UTF8_general_ci` (a utf8mb3-only collation) in queries, causing SQLSTATE[42000] errors that broke occ commands and sync. Also removed stale `Work 🎯.csv` whose emoji filename was stripped in the DB filecache (stored as `Work .csv`), causing permanent sync errors. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 09:16:48 +00:00

1 2 3 4 5 ...

505 commits