infra

Author	SHA1	Message	Date
Viktor Barzin	ca57c8c15c	state(uptime-kuma): update encrypted state	2026-04-03 15:11:53 +03:00
Viktor Barzin	8d8534df8a	state(crowdsec): update encrypted state	2026-04-03 15:01:08 +03:00
Viktor Barzin	d86c62b6ef	state(health): update encrypted state	2026-04-03 15:00:30 +03:00
Viktor Barzin	b49e9d7e69	state(woodpecker): update encrypted state	2026-04-03 14:59:49 +03:00
Viktor Barzin	dd59512153	migrate iSCSI block volumes from democratic-csi to Proxmox CSI [ci skip] Replace TrueNAS iSCSI (democratic-csi) with Proxmox CSI plugin for all block storage PVCs. Eliminates double-CoW (ZFS + LVM-thin) and removes the iSCSI network hop for database I/O. New stack: stacks/proxmox-csi/ — deploys proxmox-csi-plugin Helm chart with StorageClass "proxmox-lvm" using existing local-lvm thin pool. Migrated PVCs (12 total): - Phase 1 standalone: plotting-book, novelapp, vaultwarden, nextcloud, prometheus - Phase 2 StatefulSets: CNPG PostgreSQL (2), MySQL InnoDB (3), Redis (2) All services verified healthy post-migration.	2026-04-02 22:13:04 +03:00
Viktor Barzin	337da2184d	add upstream fallback to containerd registry mirrors When the pull-through proxy (10.0.20.10) is down, containerd now falls back to the official upstream registries (registry-1.docker.io, ghcr.io) instead of failing. Also cleans up stale disabled registry mirror dirs and removes unnecessary containerd restart from the rollout script.	2026-04-02 11:05:30 +03:00
Viktor Barzin	2d8aa5ed89	docs: update hardware inventory for R730 RAM upgrade to 272GB Upgraded from 144GB (4x32G + 2x8G) to 272GB (8x32G + 2x8G) DDR4-2400. Added physical DIMM slot diagram, channel layout, and BIOS speed override notes. Updated compute architecture with correct CPU (single socket), VM memory values, and capacity figures.	2026-04-02 00:48:13 +03:00
Viktor Barzin	87c858f026	state(platform): update encrypted state	2026-04-01 20:08:32 +03:00
Viktor Barzin	5af6558935	state(platform): update encrypted state	2026-04-01 20:08:29 +03:00
Viktor Barzin	c7369d8a2b	state(platform): update encrypted state	2026-04-01 20:07:42 +03:00
Viktor Barzin	d1059d6017	registry: set proxy TTL to 0 to prevent stale :latest images Blob caching (content-addressed by SHA256) is unaffected — only manifest re-validation changes. Every pull now checks upstream for the current manifest digest, eliminating stale :latest tag issues.	2026-03-30 00:02:48 +03:00
Viktor Barzin	28587c674d	fix-broken-blobs: use argparse for proper flag handling --dry-run as first arg was being parsed as the BASE directory path.	2026-03-29 22:33:33 +03:00
Viktor Barzin	dd461beb33	add registry blob integrity checker to self-heal corrupted cache The cleanup-tags.sh + garbage-collect cycle can delete blob data while leaving _layers/ link files intact. The registry then returns HTTP 200 with 0 bytes for those layers, causing "unexpected EOF" on image pulls. fix-broken-blobs.sh walks all repositories, checks each layer link against actual blob data, and removes orphaned links so the registry re-fetches from upstream on next pull. Schedule: daily at 2:30am (after tag cleanup) and Sunday 3:30am (after garbage collection). First run found 2335/2556 (91%) of layer links were orphaned.	2026-03-29 22:31:39 +03:00
Viktor Barzin	facf959ecf	fix registry healthchecks: use 127.0.0.1 instead of localhost localhost resolves to IPv6 ::1 but containers bind to 0.0.0.0 (IPv4 only), causing wget to fail with "Connection refused". The nginx proxy had 18,462 consecutive health check failures because of this. Also cleared corrupted pull-through cache for mghee/novelapp — the registry had layer link files pointing to non-existent blob data, causing containerd to get 200 responses with 0 bytes (unexpected EOF).	2026-03-29 22:29:27 +03:00
Viktor Barzin	a2b1b0e817	remove caretta network mapper to free 3Gi cluster memory Caretta eBPF DaemonSet was using 600Mi x 5 nodes = 3Gi total for non-critical network topology visualization. Removing it to free memory for novelapp and aiostreams which were stuck in Pending.	2026-03-29 22:17:35 +03:00
Viktor Barzin	b27b508f10	state(terminal): update encrypted state	2026-03-29 21:45:49 +03:00
Viktor Barzin	7ad01661f0	novelapp: migrate NEXTAUTH env vars to Auth.js v5 (AUTH_*) Replace NEXTAUTH_URL/NEXTAUTH_SECRET with AUTH_URL/AUTH_SECRET and add AUTH_TRUST_HOST=true for Auth.js v5 compatibility.	2026-03-29 20:37:26 +03:00
Viktor Barzin	c71a784e1c	state(novelapp): update encrypted state	2026-03-29 20:37:16 +03:00
Viktor Barzin	8bf83147db	add SLACK_WEBHOOK_URL env var to book-search deployment	2026-03-29 13:53:24 +03:00
Viktor Barzin	10f22350c5	exclude frigate, audiblez, ollama, real-estate-crawler from Synology backup [ci skip] Expanded cloud sync excludes to reduce sync time and Synology disk usage. All excluded data is either regenerable or low-value. TrueNAS Task 1 and incremental script already updated live.	2026-03-29 13:44:32 +03:00
Viktor Barzin	78eff9ab11	fix: bump book-search memory to 512Mi for file upload/email [ci skip] Downloads and sends ebook files via HTTP — needs more than 128Mi for large PDFs. Applied live via kubectl, persisting in Terraform.	2026-03-29 13:24:19 +03:00
Viktor Barzin	914e0b08e2	add SMTP and CWA auth env vars to book-search for send-to-kindle [ci skip]	2026-03-29 12:42:45 +03:00
Viktor Barzin	cbea959966	feat(ebooks): mount calibre-library PVC in book-search for permission fixing CWA NETWORK_SHARE_MODE=true skips post-import chown, leaving files as root. book-search now mounts the library to periodically fix permissions on recently imported books.	2026-03-29 11:31:41 +03:00
Viktor Barzin	fed9df8c0e	feat(ebooks): mount stacks-config PVC in book-search for force re-download Adds stacks-config volume mount to book-search pod so it can delete Stacks history entries and force re-downloads when a book was consumed by CWA but failed to import.	2026-03-29 11:26:30 +03:00
Viktor Barzin	6d44b4292f	add /api/download-status to book-search unprotected API ingress [ci skip] Needed for async polling from iOS Shortcuts — status endpoint doesn't need Authentik auth (job IDs are unguessable UUIDs).	2026-03-29 10:11:22 +03:00
Viktor Barzin	46444e0306	openclaw: remove install-dotfiles init container to reduce NFS writes The init container was cloning the dotfiles repo via git on every pod start, causing 200+ small NFS writes that amplified through ZFS. Dotfiles already exist on NFS from a previous clone — no need to re-clone on every restart. To update dotfiles, run git pull manually. Also cleaned up stale Uptime Kuma files (1.6GB old SQLite DB + 289MB error log left over from migration to MariaDB).	2026-03-29 01:11:33 +02:00
Viktor Barzin	d89321ff3b	state(openclaw): update encrypted state	2026-03-29 01:10:42 +02:00
Viktor Barzin	878b556179	state(monitoring): update encrypted state	2026-03-29 01:04:11 +02:00
Viktor Barzin	d41211ddd5	add API key + unprotected API ingress for book-search iOS Shortcut - API_KEY env var from calibre-secrets for /api/download-url auth - SHORTCUT_ICLOUD_URL env var for /shortcut redirect - Separate ingress for /api/download-url and /shortcut (bypasses Authentik)	2026-03-29 00:43:34 +02:00
Viktor Barzin	06490b0634	reduce Prometheus cardinality round 3: drop 44k more series - cadvisor: drop unused network error/dropped counters, unused cpu metrics (load_avg, system, user), unused memory metrics (cache, failcnt, kernel, mapped_file, max_usage, rss, swap, active/inactive) - kubelet: drop all unused histogram buckets (storage_operation, csi, volume_operation, image_pull, http_requests, rest_client, pod_worker, volume_metric, cgroup_manager) + kubernetes_feature_enabled - apiserver: drop flowcontrol/rest_client histograms, longrunning_requests - traefik: drop all router-level metrics (keep service + entrypoint) - service-endpoints: drop coredns histograms, node_filesystem_* Post-relabel: 332k → 99k (-70%), ingestion: 5,480 → 1,659 samples/sec (-70%)	2026-03-29 00:27:23 +02:00
Viktor Barzin	614d3c72bd	add liveness probe to annas-archive-stacks deployment Prevents corrupted SQLite DB from looping errors forever — K8s will auto-restart the pod if /api/version stops responding.	2026-03-29 00:17:29 +02:00
Viktor Barzin	a9ca65bc31	reduce Prometheus cardinality round 2: drop 137k more series - fix traefik double-scrape: kubernetes-pods job was scraping traefik pods again (43k duplicate series). Added namespace drop rule. - drop unused cadvisor metrics: container_fs_, container_blkio_, container_pressure_, container_spec_, and misc (30k series) - drop more apiserver histogram buckets: watch_list, watch_cache, response_sizes, watch_events, admission_controller, workqueue (11k) - drop unused kube-state-metrics: replicaset_, pod_tolerations, pod_labels, endpoint_, service_, configmap_, etc (53k series) Post-relabel samples: 332k → 142k (-57%) Ingestion rate: 5,480 → 3,239 samples/sec (-41%)	2026-03-28 23:51:24 +02:00
Viktor Barzin	aceea7db94	increase global rate limit from 10/50 to 50/200 HA frontend loads 30-50 JS bundles on page load, exhausting the burst. iOS Companion app reconnections also trigger bursts. 172 rate-limited (429) requests found in Traefik logs causing intermittent connectivity failures for ha-sofia iOS app.	2026-03-28 23:40:10 +02:00
Viktor Barzin	4b3851829b	feat: organize Grafana dashboards into folders Enable sidecar folderAnnotation + foldersFromFilesStructure to group 26 dashboards into 5 managed folders: - Cluster (6): k8s health, API server, nodes, pods, kube-state-metrics - Networking (6): CoreDNS, Technitium, Headscale, ingress, network traffic - Hardware (5): node-exporter, proxmox, iDRAC, UPS, NVIDIA GPU - Operations (4): backup health, registry, audit logs, Loki - Applications (2): realestate-crawler, qBittorrent Dashboard-to-folder mapping defined in grafana.tf locals block. External stacks (headscale, technitium) annotated individually.	2026-03-28 16:23:49 +02:00
Viktor Barzin	9c49d4c39b	state(headscale): update encrypted state	2026-03-28 16:19:09 +02:00
Viktor Barzin	725fefe565	fix: add Headscale monitoring, alerts, and pin UI image - Add 4 Prometheus alerts: HeadscaleDown (critical), NoOnlineNodes, HighHTTPLatency, HighErrorRate - Add Grafana dashboard with node count, map responses, HTTP latency, nodestore operations, and memory panels - Pin headscale-ui to digest sha256:015f5ba0... (was :latest) - Set disable_check_updates: true to skip GitHub check on startup - Uptime Kuma monitor already existed (id=19, 300s interval)	2026-03-28 16:07:04 +02:00
Viktor Barzin	972edf4d30	state(headscale): update encrypted state	2026-03-28 16:05:24 +02:00
Viktor Barzin	069af6517e	state(freedify): update encrypted state	2026-03-28 16:05:11 +02:00
Viktor Barzin	f4ff654a69	perf: optimize Headscale for connectivity and latency - Remove viktorbarzin.me from split DNS (same IPs as public DNS, was adding unnecessary tunnel overhead for every DNS query) - Narrow reverse DNS split scope from 10.0.0.0/8 → 10.0.20.0/24 and 10.0.10.0/24 only; 192.168.0.0/16 → 192.168.1.0/24 only - Add extra_records for key internal services (technitium, k8s-master) for instant MagicDNS resolution without tunnel roundtrip - Replace full Tailscale DERP map (29 regions) with curated set: home + 8 European + 5 global fallback DERPs (14 total) - Add custom derp.yaml to ConfigMap, sourced from Vault Port 80 DERP dropped — Traefik's global HTTP→HTTPS redirect prevents non-TLS DERP upgrades on the web entrypoint.	2026-03-28 15:44:13 +02:00
Viktor Barzin	29fe56aa68	state(headscale): update encrypted state	2026-03-28 15:43:54 +02:00
Viktor Barzin	8a5a53a832	fix alerts and reduce Prometheus disk write rate - linkwarden: add Reloader match annotation to DB secret so pods auto-restart on Vault credential rotation (was causing 100% 5xx) - authentik: increase memory limits (server 1Gi→1.5Gi, worker 896Mi→1Gi) to prevent OOM kills - prometheus: drop 113k high-cardinality series to reduce HDD write rate from ~8.8 to ~6.0 MB/s (31% reduction): - drop all traefik/apiserver/etcd histogram bucket metrics - drop goflow2_flow_process_nf_templates_total (9.3k series) - drop container_tasks_state and container_memory_failures_total - rewrite HighServiceLatency alert to use avg latency (_sum/_count) - update cluster_health dashboard to match - raise KubeletRuntimeOperationsLatency threshold from 30s to 60s	2026-03-28 15:42:14 +02:00
Viktor Barzin	7267e53e2f	state(headscale): update encrypted state	2026-03-28 15:41:32 +02:00
Viktor Barzin	85bbc67722	state(linkwarden): update encrypted state	2026-03-28 14:55:55 +02:00
Viktor Barzin	e79b996624	state(authentik): update encrypted state	2026-03-28 14:51:24 +02:00
Viktor Barzin	7e0b0d9362	fix: headscale VPN setup hardening - Add SQLite backup CronJob (every 6h to NFS for cloud sync pickup) - Move headscale-ui secrets (COOKIE_SECRET, ROOT_API_KEY) from hardcoded values to Vault-managed secrets - Add DERP IPv6 address (2001:470:6e:43d::2) for IPv6-capable clients - Clean up stale test nodes, duplicate users, rename "localhost" nodes Also updated headscale_config in Vault to include DERP ipv6 field and headscale_ui_cookie_secret/headscale_ui_api_key secrets.	2026-03-28 14:38:12 +02:00
Viktor Barzin	b339d454dd	state(headscale): update encrypted state	2026-03-28 14:37:16 +02:00
Viktor Barzin	a42003fb8f	fix: add dedicated DERP IngressRoute bypassing middlewares CrowdSec, rate limiting, anti-AI, and error pages middlewares were interfering with the Upgrade: DERP protocol handshake. Also updated Headscale ACL in Vault to allow tailnet DNS traffic to Technitium (10.0.20.200:53).	2026-03-28 14:26:51 +02:00
Viktor Barzin	1ec11cdab4	state(headscale): update encrypted state	2026-03-28 14:22:44 +02:00
Viktor Barzin	eadc266691	state(headscale): update encrypted state	2026-03-28 14:06:03 +02:00
Viktor Barzin	04a96955c0	fix: exclude NFS PVs from PVFillingUp alert NFS PVs report the entire NFS server filesystem usage (e.g., navidrome-music shows 5.3 TiB Synology volume at 97%), not PVC-specific usage. Filter out PVs with >1TiB capacity (always NFS mounts; iSCSI PVCs are 10-50Gi).	2026-03-28 01:14:05 +02:00

1 2 3 4 5 ...

2180 commits