infra

Author	SHA1	Message	Date
Viktor Barzin	cf578516e9	feat: auto-cleanup failed/evicted pods via Kyverno ClusterCleanupPolicy Add cleanup-failed-pods policy that runs hourly (at :15) to delete all pods in Failed phase cluster-wide. Prevents stale evicted and failed CronJob pods from accumulating and creating healthcheck noise. Also adds ClusterRole + ClusterRoleBinding to grant Kyverno cleanup controller permission to delete Pods (not included by default). [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 17:37:49 +00:00
Viktor Barzin	de42acd68e	fix: backup LUKS rsync tolerance, stale mapping cleanup, tier-4-aux quota bump - daily-backup: handle rsync exit 23 (partial transfer) as OK for LUKS noload mounts — in-flight writes have corrupt metadata from skipped journal replay, but core data is intact - daily-backup: clean up stale LUKS dm mappings from previous crashed runs before attempting to open - daily-backup: capture rsync exit code safely with set -e (\|\| pattern) - kyverno: bump tier-4-aux requests.memory 2Gi→3Gi (servarr was at 83%) - actualbudget: patched custom quota 5Gi→6Gi (was at 82%) Verified: backup now completes status=0 (96 PVCs OK, 0 failed) [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 17:21:51 +00:00
Viktor Barzin	9baefa22ab	fix: technitium CronJob scheduling, LUKS backup support, speedtest scrape - technitium-password-sync: remove RWO encrypted PVC mount that caused pods to stick in ContainerCreating on wrong nodes. Plugin install now warns instead of failing when zip unavailable. - daily-backup: add LUKS decryption support for encrypted PVC snapshots using /root/.luks-backup-key. Uses noload mount option to skip ext4 journal replay. Also installed cryptsetup-bin on PVE host. - speedtest: disable prometheus.io/scrape annotation (no /prometheus endpoint exists, causing ScrapeTargetDown alert). [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 15:12:32 +00:00
Viktor Barzin	601a83d84e	fix: CI pipeline image pull auth + shallow clone resilience [ci skip] - Add WOODPECKER_BACKEND_K8S_PULL_SECRET_NAMES to agent env so step pods can pull from private registry (registry.viktorbarzin.me:5050) - Add fallback in default.yml when HEAD~1 is unavailable (shallow clone with depth=1): fetch more history, or apply all platform stacks as safe default - Root cause: pipeline #243 failed because infra-ci:latest image couldn't be pulled (no imagePullSecrets on step pods) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 14:41:08 +00:00
Viktor Barzin	e23153cf03	chore: add pre-commit size guard and harden .gitignore - Add .githooks/pre-commit that blocks files >2MB (configurable via GIT_MAX_FILE_SIZE). Activate with: git config core.hooksPath .githooks - Expand .gitignore to block common binary/archive patterns (.tar.gz, .tgz, .iso, .img, .bin, .exe, *.dmg) - Add explicit root-level terraform.tfstate ignore rules - Remove stale redis-25.3.2.tgz helm chart (unreferenced) Prevents re-accumulation of large blobs after git history cleanup that reduced .git from 2.6GB to 128MB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 14:13:18 +00:00
Viktor Barzin	bcad200a23	chore: add untracked stacks, scripts, and agent configs - New stacks: beads-server, hermes-agent - Terragrunt tiers.tf for infra, phpipam, status-page - Secrets symlinks for vault, phpipam, hermes-agent - Scripts: cluster_manager, image_pull, containerd pullthrough setup - Frigate config, audiblez-web app source, n8n workflows dir - Claude agent: service-upgrade, reference: upgrade-config.json - Removed: claudeception skill, excalidraw empty submodule, temp listings [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 09:33:06 +00:00
Viktor Barzin	bd41bb9230	fix: cluster healthcheck fixes + Authentik upgrade to 2026.2.2 - Authentik: upgrade 2025.10.3 → 2025.12.4 → 2026.2.2 with DB restore and stepped migration. Switch to existingSecret, PgBouncer session mode. - Mailserver: migrate email roundtrip probe from Mailgun to Brevo API - Redis: fix HAProxy tcp-check regex (rstring), faster health intervals - Nextcloud: fix Redis fallback to HAProxy service, update dependency - MeshCentral: fix TLSOffload + certUrl init container for first-run - Monitoring: remove authentik from latency alert exclusion - Diun: simplify to webhook notifier, remove git auto-update [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 06:41:56 +00:00
Viktor Barzin	0256ccdccc	feat: add per-database backups for PostgreSQL and MySQL Add separate CronJobs that dump each database individually: - postgresql-backup-per-db: pg_dump -Fc per DB (daily 00:15) - mysql-backup-per-db: mysqldump per DB (daily 00:45) Dumps go to /backup/per-db/<dbname>/ on the same NFS PVC. Enables single-database restore without affecting other databases. Also fixed CNPG superuser password sync and added --single-transaction --set-gtid-purged=OFF to MySQL per-db dumps. Updated restore runbooks with per-database restore procedures. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 22:39:33 +00:00
Viktor Barzin	3e9231ae0d	feat: augment outage report template with debugging context - Expand service list: add Home Assistant, Actual Budget, Audiobookshelf, Linkwarden, Matrix, Paperless, Tandoor, FreshRSS, Frigate, HackMD, Excalidraw, Wealthfolio, Send, Stirling PDF - Add structured debugging fields: error type, scope (just me vs others), when it started, URL accessed - Fix user report parser to extract all form fields into status.json - Show error type, scope, and start time in status page report cards [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 20:03:44 +00:00
Viktor Barzin	460c68e015	feat: add incident management system with user reporting - Status page (status.viktorbarzin.me): incident cards with SEV badges, expandable timelines, postmortem links, user report rendering - Issue templates on infra repo for user outage reports - CronJob reads incidents + user-reports from ViktorBarzin/infra - "Report an Outage" button on status page links to infra repo - Post-mortem agents restored (4-stage pipeline: triage → investigation → historian → report writer) with updated paths and issue linking - Post-mortem skill/template updated to link reports to GitHub Issues and manage postmortem-required/postmortem-done labels - Labels: incident, sev1-3, user-report, postmortem-required, postmortem-done on infra repo [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 20:00:31 +00:00
Viktor Barzin	c69eba9b46	fix: increase Uptime Kuma API timeout and fix status code format - Increase socket timeout from 30s to 120s (121+ monitors need time to sync) - Add wait_events=0.2 for reliable login - Fix accepted_statuscodes format: use 100-increment ranges not arbitrary [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 19:28:18 +00:00
Viktor Barzin	ff360a8807	feat: add external monitoring for all Cloudflare-proxied services Add automatic external HTTPS monitors to Uptime Kuma for ~96 services exposed via Cloudflare tunnel. A sync CronJob (every 10min) reads from a Terraform-generated ConfigMap and creates/deletes [External] monitors to match cloudflare_proxied_names. Status page groups these separately as "External Reachability" and pushes a divergence metric to Pushgateway when services are externally down but internally up. Prometheus alert ExternalAccessDivergence fires after 15min of divergence. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 19:04:45 +00:00
Viktor Barzin	ca2680c189	fix(post-mortem): add NFSHighRPCRetransmissions alert + migrate alertmanager to proxmox-lvm-encrypted [PM-2026-04-14] - Add PrometheusRule: NFSHighRPCRetransmissions fires when node_nfs_rpc_retransmissions_total rate exceeds 5/s for 5m — catches NFS server degradation before pod failures cascade - Migrate alertmanager PV from NFS (192.168.1.127:/srv/nfs/alertmanager) to proxmox-lvm-encrypted eliminating the circular dependency where alertmanager couldn't alert about NFS failures - Set force_update=true on prometheus helm_release to handle StatefulSet volumeClaimTemplate changes Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>	2026-04-14 18:05:33 +00:00
Viktor Barzin	0901dd5f61	state(monitoring): update encrypted state	2026-04-14 17:52:13 +00:00
Viktor Barzin	803cb5fd26	fix: convert Technitium zone sync from one-time Job to CronJob Secondary/tertiary DNS instances had no custom zones — only the primary had viktorbarzin.lan and viktorbarzin.me. The old setup Job ran once at deployment and never synced new zones. New CronJob runs every 30 minutes: - Gets all zones from primary - Enables zone transfer on primary - Creates missing zones as Secondary type on replicas - Resyncs existing zones via AXFR Fixes .lan resolution failures (2/3 queries returned NXDOMAIN). [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 12:18:19 +00:00
Viktor Barzin	30cdeefb1c	chore: sync terraform state after nfsvers=4 convergence Applied all 20 NFS stacks to converge PV mount_options (nfsvers=4). State files encrypted and committed. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 11:20:18 +00:00
Viktor Barzin	ea18116da9	fix: NFS outage recovery — migrate to NFSv4, add alerting NFS server restart broke NFSv3 (lockd kernel bug on PVE 6.14). All 52 NFS PVs patched to nfsvers=4, NFSv3 disabled on PVE. Changes: - nfs_volume module: add nfsvers=4 mount option - nfs-csi StorageClass: add nfsvers=4 mount option - dbaas: MySQL serverInstances 3→1, mysql-native-password=ON - monitoring: add NFSCSINodeDown and NFSMountFailures alerts [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 10:28:27 +00:00
Viktor Barzin	68c8c5b4a0	fix(technitium): migrate primary to proxmox-lvm-encrypted + post-mortem SEV1 outage: fsid=0 in PVE /etc/exports broke all NFS subdirectory mounts from k8s (NFSv4 pseudo-root path resolution). Combined with lockd failure, both NFSv4 and NFSv3 mount paths broken. Cascaded into DNS primary, Vault (2/3 pods), Alertmanager, 20+ services. Changes: - Primary PVC: NFS (nfs-truenas) → proxmox-lvm-encrypted - Secondary/tertiary PVCs: proxmox-lvm → proxmox-lvm-encrypted - Removed NFS module dependency from technitium stack - Added full post-mortem with prevention plan [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 08:18:59 +00:00
Viktor Barzin	82f674a0b4	rename weekly-backup → daily-backup across scripts, timers, services, and docs [ci skip] Reflects the schedule change from weekly to daily. All references updated: - scripts/weekly-backup.{sh,timer,service} → daily-backup.* - Pushgateway job name: weekly-backup → daily-backup - Prometheus metric names: weekly_backup_* → daily_backup_* - All docs, runbooks, AGENTS.md, CLAUDE.md, proxmox-inventory - offsite-sync dependency: After=daily-backup.service Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 18:37:04 +00:00
Viktor Barzin	38d51ab0af	deprecate TrueNAS: migrate Immich NFS to Proxmox, remove all 10.0.10.15 references [ci skip] - Migrate Immich (8 NFS PVs, 1.1TB) from TrueNAS to Proxmox host NFS - Update config.tfvars nfs_server to 192.168.1.127 (Proxmox) - Update nfs-csi StorageClass share to /srv/nfs - Update scripts (weekly-backup, cluster-healthcheck) to Proxmox IP - Delete obsolete TrueNAS scripts (nfs_exports.sh, truenas-status.sh) - Rewrite nfs-health.sh for Proxmox NFS monitoring - Update Freedify nfs_music_server default to Proxmox - Mark CloudSync monitor CronJob as deprecated - Update Prometheus alert summaries - Update all architecture docs, AGENTS.md, and reference docs - Zero PVs remain on TrueNAS — VM ready for decommission Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 14:42:07 +00:00
root	b0303ab17d	Woodpecker CI deploy commit [CI SKIP]	2026-04-12 21:39:26 +00:00
Viktor Barzin	1c300a14cf	mailserver: overhaul inbound delivery, monitoring, CrowdSec, and migrate to Brevo relay Inbound: - Direct MX to mail.viktorbarzin.me (ForwardEmail relay attempted and abandoned) - Dedicated MetalLB IP 10.0.20.202 with ETP: Local for CrowdSec real-IP detection - Removed Cloudflare Email Routing (can't store-and-forward) - Fixed dual SPF violation, hardened to -all - Added MTA-STS, TLSRPT, imported Rspamd DKIM into Terraform - Removed dead BIND zones from config.tfvars (199 lines) Outbound: - Migrated from Mailgun (100/day) to Brevo (300/day free) - Added Brevo DKIM CNAMEs and verification TXT Monitoring: - Probe frequency: 30m → 20m, alert thresholds adjusted to 60m - Enabled Dovecot exporter scraping (port 9166) - Added external SMTP monitor on public IP Documentation: - New docs/architecture/mailserver.md with full architecture - New docs/architecture/mailserver-visual.html visualization - Updated monitoring.md, CLAUDE.md, historical plan docs	2026-04-12 22:24:38 +01:00
Viktor Barzin	82b0f6c4cb	truenas deprecation: migrate all non-immich storage to proxmox NFS - Migrate 7 backup CronJobs to Proxmox host NFS (192.168.1.127) (etcd, mysql, postgresql, nextcloud, redis, vaultwarden, plotting-book) - Migrate headscale backup, ebook2audiobook, osm_routing to Proxmox NFS - Migrate servarr (lidarr, readarr, soulseek) NFS refs to Proxmox - Remove 79 orphaned TrueNAS NFS module declarations from 49 stacks - Delete stacks/platform/modules/ (27 dead module copies, 65MB) - Update nfs-truenas StorageClass to point to Proxmox (192.168.1.127) - Remove iscsi DNS record from config.tfvars - Fix woodpecker persistence config and alertmanager PV Only Immich (8 PVCs, ~1.4TB) remains on TrueNAS.	2026-04-12 14:35:39 +01:00
Viktor Barzin	5da6d75094	fix(monitoring): PodCrashLooping alert now fires only for active CrashLoopBackOff Switch from restart-count based detection (increase restarts[1h] > 5) to waiting-reason based (kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"}). Alert auto-resolves when pod recovers, making it clear whether the issue is active.	2026-04-12 12:41:07 +01:00
Viktor Barzin	65551e4602	fix(dbaas): relax MySQL anti-affinity from required to preferred Avoids pods getting stuck Pending during node outages while still preferring to spread across nodes.	2026-04-11 16:26:24 +01:00
Viktor Barzin	160e8980e5	perf(immich): restore PostgreSQL vector search optimizations - shared_buffers: 1GB → 2GB (clip_index is 452MB, needs headroom) - effective_cache_size: 1536MB → 2560MB - PG memory: 2Gi → 3Gi to support larger shared_buffers - Add pg_prewarm to shared_preload_libraries with autoprewarm - First search after restart: 999ms → 25ms	2026-04-11 10:30:44 +01:00
Viktor Barzin	aa58565ecc	upgrade immich to v2.7.4 and increase rate limit burst - Immich version: v2.7.3 → v2.7.4 - Immich rate limit: avg 200→500, burst 2000→5000 (both traefik and platform stacks)	2026-04-11 10:15:42 +01:00
Viktor Barzin	75255d22a2	fix(phpipam): fix London SSH via WG MTU reduction (1420→1200) Root cause: PMTU black hole on WireGuard tunnel. The tunnel runs over the HE IPv6 6in4 tunnel (gif0 MTU 1280). With WG overhead (~80 bytes), effective inner MTU is 1200 — but both sides were configured at 1420. SSH kex packets >1200 bytes were silently dropped. Fix: Set tun_wg0 MTU to 1200 on pfSense + peer_855 MTU to 1200 on London GL-iNet. Re-enabled London DHCP/ARP import in remote CronJob. All 3 sites now fully automated: - Sofia: Kea leases + ARP every 5min - London: DHCP + ARP via pfSense→London SSH hop, hourly - Valchedrym: DHCP + ARP via pfSense→OpenWRT SSH hop, hourly [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 08:18:42 +00:00
Viktor Barzin	d7de5de07c	fix(monitoring): add pve_* metrics to Prometheus whitelist ProxmoxMetricsMissing alert was firing because pve_* metrics were excluded from the kubernetes-service-endpoints metric_relabel_configs whitelist. The exporter was scraping successfully but metrics were being dropped before ingestion.	2026-04-10 22:58:49 +01:00
Viktor Barzin	04beb123eb	feat(phpipam): split CronJobs - Sofia 5min, remote sites hourly - Sofia import (every 5min): Kea leases + pfSense ARP via SSH - Remote import (hourly): Valchedrym DHCP/ARP via pfSense SSH hop - London SSH (dropbear) hangs during kex on low-power router — disabled for now, data imported manually. TODO: lightweight push agent - Fixed SSH key filename (id_rsa, not id_ed25519) for RSA keys - No more ping sweeping anywhere — all passive DHCP/ARP data [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 21:24:58 +00:00
Viktor Barzin	a0392a9617	fix(nextcloud): auto-sync DB password from Vault rotation into config.php Nextcloud persists dbpassword in config.php on its PVC and ignores MYSQL_PASSWORD env var after initial install. When Vault rotates the MySQL password, config.php goes stale causing HTTP 500 crash loops. Adds a before-starting hook that patches config.php with the current MYSQL_PASSWORD on every pod start. Combined with Stakater Reloader annotation, the full rotation chain is now automated: Vault rotates → ESO syncs Secret → Reloader restarts pod → hook patches config.php → Nextcloud connects with new password. Also fixes stale existingClaim (nextcloud-data-iscsi → nextcloud-data-proxmox).	2026-04-10 22:23:52 +01:00
Viktor Barzin	92e0c18e81	feat(phpipam): pull Valchedrym devices from OpenWRT DHCP/ARP via SSH - CronJob now SSHs to Valchedrym OpenWRT (192.168.0.1) to pull DHCP leases + ARP table - Parses /tmp/dhcp.leases for hostname + MAC, /proc/net/arp for additional devices - London still uses ping sweep via pfSense WG tunnel (no SSH access to GL-iNet) - 6 Valchedrym devices tracked: router, alarm, video, termoregulator, 2 clients [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 21:06:39 +00:00
Viktor Barzin	cddbb1c8b0	feat(phpipam): scan London/Valchedrym via WireGuard tunnel - pfsense-import CronJob now scans remote subnets (192.168.8.0/24, 192.168.0.0/24) via parallel ping sweep through pfSense WG tunnel - 13 London devices + 1 Valchedrym device discovered - Known hosts named: ha-london, rpi-london, openwrt-london - fping cron container fully removed [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 20:58:49 +00:00
Viktor Barzin	bba2de9eb1	refactor(phpipam): remove fping cron container All device discovery now handled by phpipam-pfsense-import CronJob which queries Kea DHCP leases + pfSense ARP table every 5min. No active scanning needed — pfSense sees all devices passively. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 20:38:59 +00:00
Viktor Barzin	d71c636269	feat(phpipam): replace fping scanning with pfSense Kea+ARP import - New CronJob `phpipam-pfsense-import` runs every 5min - Queries Kea DHCP lease API (IP + MAC + hostname for all DHCP clients) - Queries pfSense ARP table (IP + MAC for static IP devices) - Imports into phpIPAM MySQL: new hosts get inserted, existing get MAC/hostname updates - Reduced fping scan interval from 15min to 24h (weekly audit only) - Faster, quieter, gets MACs (fping didn't), gets Kea hostnames - SSH key (RSA PEM) stored in Vault, synced via ExternalSecret [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 20:37:28 +00:00
Viktor Barzin	e09996302a	feat(phpipam): bidirectional DNS sync + Kea DHCP for 192.168.1.x - CronJob now pulls hostnames FROM Technitium INTO phpIPAM for unnamed entries (reverse sync: Kea DDNS registers → Technitium PTR → phpIPAM hostname) - Kea DHCP4 now serves 192.168.1.0/24 via pfSense WAN (vtnet0) - 42 MAC→IP reservations for all known LAN devices - Kea DDNS registers 192.168.1.x hosts in Technitium (forward + reverse) - DHCP pool .150-.199 for unknown devices - Technitium update ACL extended to include 192.168.1.2 (pfSense WAN) [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 20:29:25 +00:00
Viktor Barzin	a86394f12b	feat(phpipam): add DNS sync CronJob and Kea DDNS integration - CronJob syncs phpIPAM hosts → Technitium DNS (A + PTR records) every 15min - Queries phpIPAM MySQL directly for named hosts, pushes to Technitium API - Covers 192.168.1.0/24 LAN (TP-Link DHCP, not Kea-managed) - Kea DDNS configured on pfSense for 10.0.10.0/24 + 10.0.20.0/24 subnets - Technitium zones accept dynamic updates from pfSense IPs (10.0.20.1, 10.0.10.1) - 5 reverse DNS zones created (10.0.10, 20.0.10, 1.168.192, 2.3.10, 0.168.192) [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 14:47:24 +00:00
Viktor Barzin	4d3d3316ab	feat(phpipam): deploy phpIPAM for live IP address management Lightweight IPAM with auto-discovery scanning every 15min via fping. Replaces disabled NetBox (OOM'd). Uses existing MySQL InnoDB cluster with Vault-rotated credentials. Cloudflare DNS + Authentik auth. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 14:19:25 +00:00
Viktor Barzin	2a10c1821f	fix(nextcloud): enable mysql.utf8mb4 to fix collation errors with MySQL 8.4 MySQL 8.4 remapped `utf8` to mean `utf8mb4`, but Nextcloud without this config sends `COLLATE UTF8_general_ci` (a utf8mb3-only collation) in queries, causing SQLSTATE[42000] errors that broke occ commands and sync. Also removed stale `Work 🎯.csv` whose emoji filename was stripped in the DB filecache (stored as `Work .csv`), causing permanent sync errors. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 09:16:48 +00:00
Viktor Barzin	9820f2ced0	add foolery stack: agent orchestration UI on devvm [ci skip] Service+Endpoints pattern proxying to 10.0.10.10:3210 (Foolery). Protected by Authentik forward-auth. DNS via Cloudflare tunnel.	2026-04-10 00:21:59 +01:00
Viktor Barzin	795874fc21	immich: upgrade to v2.7.3, tune PG for vector search performance - Bump immich server + ML from v2.6.3 to v2.7.3 - Increase PG shared_buffers to 2GB (memory 3Gi) to prevent clip_index eviction by background jobs - Switch DB_STORAGE_TYPE to SSD (effective_io_concurrency=200, random_page_cost=1.2) - Add pg_prewarm autoprewarm for warm restarts - Add postgresql.override.conf via init container for tuning - Add postStart hook to prewarm vector tables on startup Search latency: ~1.3s → ~130ms (external), ~60ms (internal)	2026-04-09 23:04:13 +01:00
Viktor Barzin	6101fb99f9	Reduce disk write amplification across cluster (~200-350 GB/day savings) [ci skip] - Prometheus: persist metric whitelist (keep rules) to Helm template, preventing regression from 33K to 250K samples/scrape on next apply. Reduce retention 52w→26w. - MySQL InnoDB: aggressive write reduction — flush_log_at_trx_commit=0, sync_binlog=0, doublewrite=OFF, io_capacity=100/200, redo_log=1GB, flush_neighbors=1, reduced page cleaners. - etcd: increase snapshot-count 10000→50000 to reduce WAL snapshot frequency. - VM disks: enable TRIM/discard passthrough to LVM thin pool via create-vm module. - Cloud-init: enable fstrim.timer, journald limits (500M/7d/compress). - Kubelet: containerLogMaxSize=10Mi, containerLogMaxFiles=3. - Technitium: DNS query log retention 0→30 days (was unlimited writes to MySQL). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 19:01:21 +00:00
Viktor Barzin	9de543c076	feat(technitium): add Split Horizon AddressTranslation for hairpin NAT fix 192.168.1.x LAN clients couldn't reach non-proxied *.viktorbarzin.me domains because the TP-Link router doesn't support hairpin NAT. Adds a CronJob that configures Technitium's Split Horizon AddressTranslation post-processor on all 3 instances to translate 176.12.22.76 (public IP) → 10.0.20.200 (Traefik LB) in DNS responses for 192.168.1.0/24 clients. Also adds viktorbarzin.me to the DNS Rebinding Protection privateDomains allowlist so the translated private IP isn't stripped. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 18:43:15 +00:00
Viktor Barzin	64585e329c	fix: update Technitium DNS IP from 10.0.20.200 to 10.0.20.201 Technitium DNS was moved to its own dedicated MetalLB LoadBalancer IP (10.0.20.201) but several references still pointed to the old shared IP (10.0.20.200, now used by traefik/coturn/etc). This caused DNS resolution failures for *.viktorbarzin.lan from pfSense and LAN clients. - Update CoreDNS Corefile forward in both technitium and platform modules - Update MetalLB annotation and remove stale allow-shared-ip - Update zone NS records and apex A record in config.tfvars - Update legacy BIND forwarder reference Also fixed on pfSense (not in repo): - Removed NAT rule redirecting UDP 53 to wrong IP (10.0.20.200) - Added dnsmasq listen on WAN (192.168.1.2) for LAN clients - Added domain-specific forwarding (viktorbarzin.lan -> 10.0.20.201) - Created aliases (technitium_dns, k8s_shared_lb) for all NAT rules [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 09:53:56 +00:00
Viktor Barzin	cdfa1b7e92	fix(headscale): backup CronJob uses pod_affinity for RWO PVC access The backup CronJob was stuck in ContainerCreating because it couldn't mount the proxmox-lvm RWO PVC from a different node. Fixed by: - Adding pod_affinity to co-locate with the headscale pod (same node) - Mounting both data PVC (read-only) and NFS backup PVC (write) - Adding integrity check pattern from vaultwarden backup - Setting concurrency_policy=Replace and ttl_seconds_after_finished=10	2026-04-08 08:20:08 +01:00
Viktor Barzin	0498cc4ad1	feat(terminal): add image upload button for iOS paste support iOS Safari doesn't support reading images via navigator.clipboard.read(). Added a camera button that opens the native file/photo picker, which works reliably on all platforms including iOS.	2026-04-08 08:11:18 +01:00
Viktor Barzin	4d753a6486	fix(immich): improve thumbnail loading performance on iOS app - Bump immich-server memory 1700Mi/2500Mi → 2000Mi/3500Mi to prevent OOM kills - Disable anti-AI middleware chain for Immich (removes 3 unnecessary ForwardAuth hops per request — Immich content is behind auth, not crawlable) - Double rate limit to 200 avg / 2000 burst for fast-scroll thumbnail requests - Fix ImmichFrame image tag (1.7.4 → v1.0.32.0) - Add PostgreSQL vector search prewarming and tuning (SSD storage type, init container for override conf, postStart pg_prewarm)	2026-04-08 08:08:53 +01:00
Viktor Barzin	15e45b95a9	feat(terminal): add clipboard paste support for text and images - Custom index.html with xterm.js for reliable Ctrl+V text paste - Go clipboard-upload service saves pasted images to /tmp/clipboard-images/ - Traefik IngressRoute routes /clipboard/* to upload service (same-origin) - Authentik-protected upload path with strip-prefix middleware	2026-04-06 16:57:18 +03:00
Viktor Barzin	9338af3c29	fix(dbaas): raise ResourceQuota to 40Gi and add sidecar resources MySQL operator ignores podSpec.containers sidecar resource overrides, always injecting 6Gi limit defaults. Added sidecar to CR spec for documentation but raised quota from 32Gi to 40Gi as the practical fix. Quota usage drops from 99% to 79%.	2026-04-06 15:57:47 +03:00
Viktor Barzin	fbdb57eb58	fix(monitoring): UsingInverterEnergyForTooLong only alerts when stuck Changed from simple time-based (24h on inverter) to condition-based: only fires when on inverter AND battery charge <80% for 1h. This means normal daytime inverter usage won't trigger alerts — only fires when the grid is unavailable and battery is draining.	2026-04-06 15:43:47 +03:00

1 2 3 4 5 ...

494 commits