Commit graph

2380 commits

Author SHA1 Message Date
Viktor Barzin
73531c12e0 docs(vpn): update with dual-stack WG, GL-iNet AllowedIPs fix, and troubleshooting [ci skip]
Document fixes from 2026-04-10 London network debugging session:
- pfSense WG now dual-stack (IPv4+IPv6 via HE tunnel gif0 pf rule)
- GL-iNet AllowedIPs must be single comma-separated UCI entry (parse bug)
- AdGuardHome/carrier-monitor must not use 1.1.1.1 (conntrack + rate limit)
- Expanded troubleshooting for site-to-site tunnel disconnects
2026-04-10 22:24:19 +01:00
Viktor Barzin
a0392a9617 fix(nextcloud): auto-sync DB password from Vault rotation into config.php
Nextcloud persists dbpassword in config.php on its PVC and ignores
MYSQL_PASSWORD env var after initial install. When Vault rotates the
MySQL password, config.php goes stale causing HTTP 500 crash loops.

Adds a before-starting hook that patches config.php with the current
MYSQL_PASSWORD on every pod start. Combined with Stakater Reloader
annotation, the full rotation chain is now automated:
Vault rotates → ESO syncs Secret → Reloader restarts pod → hook
patches config.php → Nextcloud connects with new password.

Also fixes stale existingClaim (nextcloud-data-iscsi → nextcloud-data-proxmox).
2026-04-10 22:23:52 +01:00
Viktor Barzin
92e0c18e81 feat(phpipam): pull Valchedrym devices from OpenWRT DHCP/ARP via SSH
- CronJob now SSHs to Valchedrym OpenWRT (192.168.0.1) to pull DHCP leases + ARP table
- Parses /tmp/dhcp.leases for hostname + MAC, /proc/net/arp for additional devices
- London still uses ping sweep via pfSense WG tunnel (no SSH access to GL-iNet)
- 6 Valchedrym devices tracked: router, alarm, video, termoregulator, 2 clients

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 21:06:39 +00:00
Viktor Barzin
cddbb1c8b0 feat(phpipam): scan London/Valchedrym via WireGuard tunnel
- pfsense-import CronJob now scans remote subnets (192.168.8.0/24,
  192.168.0.0/24) via parallel ping sweep through pfSense WG tunnel
- 13 London devices + 1 Valchedrym device discovered
- Known hosts named: ha-london, rpi-london, openwrt-london
- fping cron container fully removed

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 20:58:49 +00:00
Viktor Barzin
eec6af6aef docs: add IPAM/DDNS architecture diagram and update docs
- networking.md: Add mermaid diagram showing full device discovery pipeline
  (Kea DHCP → DDNS → Technitium, pfSense import → phpIPAM → DNS sync)
- networking.md: Add data flow table, DHCP coverage table
- networking.md: Update pfSense (3 subnets + 42 reservations), phpIPAM
  (passive import replaces fping), Technitium (192.168.1.2 in ACL)
- CLAUDE.md: Update phpIPAM and networking descriptions

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 20:42:10 +00:00
Viktor Barzin
bba2de9eb1 refactor(phpipam): remove fping cron container
All device discovery now handled by phpipam-pfsense-import CronJob
which queries Kea DHCP leases + pfSense ARP table every 5min.
No active scanning needed — pfSense sees all devices passively.

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 20:38:59 +00:00
Viktor Barzin
0ac96def6b state(phpipam): update encrypted state 2026-04-10 20:38:53 +00:00
Viktor Barzin
d71c636269 feat(phpipam): replace fping scanning with pfSense Kea+ARP import
- New CronJob `phpipam-pfsense-import` runs every 5min
  - Queries Kea DHCP lease API (IP + MAC + hostname for all DHCP clients)
  - Queries pfSense ARP table (IP + MAC for static IP devices)
  - Imports into phpIPAM MySQL: new hosts get inserted, existing get MAC/hostname updates
- Reduced fping scan interval from 15min to 24h (weekly audit only)
- Faster, quieter, gets MACs (fping didn't), gets Kea hostnames
- SSH key (RSA PEM) stored in Vault, synced via ExternalSecret

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 20:37:28 +00:00
Viktor Barzin
93dee6030b state(phpipam): update encrypted state 2026-04-10 20:35:20 +00:00
Viktor Barzin
a34f8e48a7 state(phpipam): update encrypted state 2026-04-10 20:34:02 +00:00
Viktor Barzin
e09996302a feat(phpipam): bidirectional DNS sync + Kea DHCP for 192.168.1.x
- CronJob now pulls hostnames FROM Technitium INTO phpIPAM for unnamed entries
  (reverse sync: Kea DDNS registers → Technitium PTR → phpIPAM hostname)
- Kea DHCP4 now serves 192.168.1.0/24 via pfSense WAN (vtnet0)
- 42 MAC→IP reservations for all known LAN devices
- Kea DDNS registers 192.168.1.x hosts in Technitium (forward + reverse)
- DHCP pool .150-.199 for unknown devices
- Technitium update ACL extended to include 192.168.1.2 (pfSense WAN)

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 20:29:25 +00:00
Viktor Barzin
7f4a1127a0 state(phpipam): update encrypted state 2026-04-10 20:28:49 +00:00
Viktor Barzin
8cd8743140 docs: add phpIPAM, Kea DDNS, and DNS sync documentation
- networking.md: Add phpIPAM IPAM section, Kea DDNS config, reverse DNS zones,
  Technitium dynamic update policy
- CLAUDE.md: Add phpipam to DB rotation list, service notes, networking section
- service-catalog.md: Add phpipam, mark netbox as disabled/replaced

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 16:01:32 +00:00
Viktor Barzin
a86394f12b feat(phpipam): add DNS sync CronJob and Kea DDNS integration
- CronJob syncs phpIPAM hosts → Technitium DNS (A + PTR records) every 15min
- Queries phpIPAM MySQL directly for named hosts, pushes to Technitium API
- Covers 192.168.1.0/24 LAN (TP-Link DHCP, not Kea-managed)
- Kea DDNS configured on pfSense for 10.0.10.0/24 + 10.0.20.0/24 subnets
- Technitium zones accept dynamic updates from pfSense IPs (10.0.20.1, 10.0.10.1)
- 5 reverse DNS zones created (10.0.10, 20.0.10, 1.168.192, 2.3.10, 0.168.192)

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 14:47:24 +00:00
Viktor Barzin
c07d8720c8 state(phpipam): update encrypted state 2026-04-10 14:46:43 +00:00
Viktor Barzin
3314e914fd state(phpipam): update encrypted state 2026-04-10 14:44:49 +00:00
Viktor Barzin
4d3d3316ab feat(phpipam): deploy phpIPAM for live IP address management
Lightweight IPAM with auto-discovery scanning every 15min via fping.
Replaces disabled NetBox (OOM'd). Uses existing MySQL InnoDB cluster
with Vault-rotated credentials. Cloudflare DNS + Authentik auth.

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 14:19:25 +00:00
Viktor Barzin
d2fdc481ef state(phpipam): update encrypted state 2026-04-10 14:11:52 +00:00
Viktor Barzin
934e5f6afb state(platform): update encrypted state 2026-04-10 13:41:55 +00:00
Viktor Barzin
c54a36e7ca state(vault): update encrypted state 2026-04-10 13:33:33 +00:00
Viktor Barzin
2a10c1821f fix(nextcloud): enable mysql.utf8mb4 to fix collation errors with MySQL 8.4
MySQL 8.4 remapped `utf8` to mean `utf8mb4`, but Nextcloud without this
config sends `COLLATE UTF8_general_ci` (a utf8mb3-only collation) in
queries, causing SQLSTATE[42000] errors that broke occ commands and sync.

Also removed stale `Work 🎯.csv` whose emoji filename was stripped in the
DB filecache (stored as `Work .csv`), causing permanent sync errors.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 09:16:48 +00:00
Viktor Barzin
9820f2ced0 add foolery stack: agent orchestration UI on devvm [ci skip]
Service+Endpoints pattern proxying to 10.0.10.10:3210 (Foolery).
Protected by Authentik forward-auth. DNS via Cloudflare tunnel.
2026-04-10 00:21:59 +01:00
Viktor Barzin
795874fc21 immich: upgrade to v2.7.3, tune PG for vector search performance
- Bump immich server + ML from v2.6.3 to v2.7.3
- Increase PG shared_buffers to 2GB (memory 3Gi) to prevent
  clip_index eviction by background jobs
- Switch DB_STORAGE_TYPE to SSD (effective_io_concurrency=200,
  random_page_cost=1.2)
- Add pg_prewarm autoprewarm for warm restarts
- Add postgresql.override.conf via init container for tuning
- Add postStart hook to prewarm vector tables on startup

Search latency: ~1.3s → ~130ms (external), ~60ms (internal)
2026-04-09 23:04:13 +01:00
Viktor Barzin
1c7998163d extend backup-verify.sh with full 3-2-1 health inspection
18 checks covering all backup layers:
- Layer 1: LVM snapshot freshness/status/count, thin pool free, timer
- Layer 2: Weekly backup freshness/status/timer, sda mount/usage,
  PVC data/NFS mirror/pfsense freshness
- Layer 3: Offsite sync freshness/status/timer
- DB CronJobs: age check with per-service thresholds
- CNPG backups: existing check preserved

New --fix flag for conservative auto-remediation:
- Re-enable disabled timers
- Clear stale lockfiles
- Mount /mnt/backup if unmounted
2026-04-09 22:48:49 +01:00
Viktor Barzin
6101fb99f9 Reduce disk write amplification across cluster (~200-350 GB/day savings) [ci skip]
- Prometheus: persist metric whitelist (keep rules) to Helm template, preventing
  regression from 33K to 250K samples/scrape on next apply. Reduce retention 52w→26w.
- MySQL InnoDB: aggressive write reduction — flush_log_at_trx_commit=0, sync_binlog=0,
  doublewrite=OFF, io_capacity=100/200, redo_log=1GB, flush_neighbors=1, reduced page cleaners.
- etcd: increase snapshot-count 10000→50000 to reduce WAL snapshot frequency.
- VM disks: enable TRIM/discard passthrough to LVM thin pool via create-vm module.
- Cloud-init: enable fstrim.timer, journald limits (500M/7d/compress).
- Kubelet: containerLogMaxSize=10Mi, containerLogMaxFiles=3.
- Technitium: DNS query log retention 0→30 days (was unlimited writes to MySQL).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 19:01:21 +00:00
Viktor Barzin
98aaba98da docs: add Split Horizon hairpin NAT fix to networking architecture
[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 18:45:53 +00:00
Viktor Barzin
9de543c076 feat(technitium): add Split Horizon AddressTranslation for hairpin NAT fix
192.168.1.x LAN clients couldn't reach non-proxied *.viktorbarzin.me
domains because the TP-Link router doesn't support hairpin NAT.

Adds a CronJob that configures Technitium's Split Horizon
AddressTranslation post-processor on all 3 instances to translate
176.12.22.76 (public IP) → 10.0.20.200 (Traefik LB) in DNS responses
for 192.168.1.0/24 clients. Also adds viktorbarzin.me to the DNS
Rebinding Protection privateDomains allowlist so the translated private
IP isn't stripped.

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 18:43:15 +00:00
Viktor Barzin
cfa7a50cb5 docs: update networking architecture for DNS consolidation
- Technitium DNS now at dedicated MetalLB IP 10.0.20.201 (was shared 10.0.20.200)
- Document LAN DNS path: pfSense NAT redirect preserves client IPs for Technitium logging
- Document pfSense dnsmasq role (K8s VLAN + localhost only, not WAN)
- Document pfSense aliases (technitium_dns, k8s_shared_lb) for NAT rule maintainability
- Update MetalLB table with per-service IP assignments
- Add ClusterIP (10.96.0.53) for CoreDNS internal forwarding

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 17:49:33 +00:00
Viktor Barzin
64585e329c fix: update Technitium DNS IP from 10.0.20.200 to 10.0.20.201
Technitium DNS was moved to its own dedicated MetalLB LoadBalancer IP
(10.0.20.201) but several references still pointed to the old shared IP
(10.0.20.200, now used by traefik/coturn/etc). This caused DNS resolution
failures for *.viktorbarzin.lan from pfSense and LAN clients.

- Update CoreDNS Corefile forward in both technitium and platform modules
- Update MetalLB annotation and remove stale allow-shared-ip
- Update zone NS records and apex A record in config.tfvars
- Update legacy BIND forwarder reference

Also fixed on pfSense (not in repo):
- Removed NAT rule redirecting UDP 53 to wrong IP (10.0.20.200)
- Added dnsmasq listen on WAN (192.168.1.2) for LAN clients
- Added domain-specific forwarding (viktorbarzin.lan -> 10.0.20.201)
- Created aliases (technitium_dns, k8s_shared_lb) for all NAT rules

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 09:53:56 +00:00
Viktor Barzin
cdfa1b7e92 fix(headscale): backup CronJob uses pod_affinity for RWO PVC access
The backup CronJob was stuck in ContainerCreating because it couldn't
mount the proxmox-lvm RWO PVC from a different node. Fixed by:
- Adding pod_affinity to co-locate with the headscale pod (same node)
- Mounting both data PVC (read-only) and NFS backup PVC (write)
- Adding integrity check pattern from vaultwarden backup
- Setting concurrency_policy=Replace and ttl_seconds_after_finished=10
2026-04-08 08:20:08 +01:00
Viktor Barzin
f14c227a98 state(headscale): update encrypted state 2026-04-08 08:18:03 +01:00
Viktor Barzin
0498cc4ad1 feat(terminal): add image upload button for iOS paste support
iOS Safari doesn't support reading images via navigator.clipboard.read().
Added a camera button that opens the native file/photo picker, which works
reliably on all platforms including iOS.
2026-04-08 08:11:18 +01:00
Viktor Barzin
4d753a6486 fix(immich): improve thumbnail loading performance on iOS app
- Bump immich-server memory 1700Mi/2500Mi → 2000Mi/3500Mi to prevent OOM kills
- Disable anti-AI middleware chain for Immich (removes 3 unnecessary ForwardAuth
  hops per request — Immich content is behind auth, not crawlable)
- Double rate limit to 200 avg / 2000 burst for fast-scroll thumbnail requests
- Fix ImmichFrame image tag (1.7.4 → v1.0.32.0)
- Add PostgreSQL vector search prewarming and tuning (SSD storage type,
  init container for override conf, postStart pg_prewarm)
2026-04-08 08:08:53 +01:00
Viktor Barzin
37dbad47e9 state(immich): update encrypted state 2026-04-06 22:11:17 +03:00
Viktor Barzin
0a0cc9b7a2 state(traefik): update encrypted state 2026-04-06 21:22:52 +03:00
Viktor Barzin
15e45b95a9 feat(terminal): add clipboard paste support for text and images
- Custom index.html with xterm.js for reliable Ctrl+V text paste
- Go clipboard-upload service saves pasted images to /tmp/clipboard-images/
- Traefik IngressRoute routes /clipboard/* to upload service (same-origin)
- Authentik-protected upload path with strip-prefix middleware
2026-04-06 16:57:18 +03:00
Viktor Barzin
cbed5423ec state(terminal): update encrypted state 2026-04-06 16:50:52 +03:00
Viktor Barzin
4b2339bfae state(terminal): update encrypted state 2026-04-06 16:49:30 +03:00
Viktor Barzin
fea8519f51 update VPN architecture docs and Authentik state reference
- vpn.md: Rewrite WireGuard section to match actual config (single tun_wg0
  interface, 10.3.2.0/24 subnet, hub-and-spoke topology, correct device
  names and subnets for London/Valchedrym)
- authentik-state.md: Document brute-force-protection policy unbinding fix
  that was blocking all unauthenticated users from login flows

[ci skip]
2026-04-06 16:26:21 +03:00
Viktor Barzin
d2af5339af fix offsite sync: use --chmod for Synology permission compatibility
Synology Administrator user can't create dirs with root-owned permissions
from PVC snapshots. Switch from -az to -rltz --chmod to set writable
permissions on destination. Also updated Cloud Sync Task 1 excludes
to prevent duplication of backup dirs on Synology.
2026-04-06 16:01:42 +03:00
Viktor Barzin
9338af3c29 fix(dbaas): raise ResourceQuota to 40Gi and add sidecar resources
MySQL operator ignores podSpec.containers sidecar resource overrides,
always injecting 6Gi limit defaults. Added sidecar to CR spec for
documentation but raised quota from 32Gi to 40Gi as the practical fix.
Quota usage drops from 99% to 79%.
2026-04-06 15:57:47 +03:00
Viktor Barzin
fbdb57eb58 fix(monitoring): UsingInverterEnergyForTooLong only alerts when stuck
Changed from simple time-based (24h on inverter) to condition-based:
only fires when on inverter AND battery charge <80% for 1h. This means
normal daytime inverter usage won't trigger alerts — only fires when
the grid is unavailable and battery is draining.
2026-04-06 15:43:47 +03:00
Viktor Barzin
b5689afe6d fix(monitoring): tune alert thresholds to reduce false positives
- HighPowerUsage: raise from 200W to 300W (R730 idles at ~230W)
- HighServiceLatency: exclude headscale (WebSocket) and authentik (SSO)
  from latency checks — both have inherently high avg response times
2026-04-06 15:39:23 +03:00
Viktor Barzin
ca4acaecd0 bd init: initialize beads issue tracking 2026-04-06 15:38:46 +03:00
Viktor Barzin
91242b0b40 feat(monitoring): add comprehensive hardware exporter alerts
Added 20 new alerts across 3 rule groups:

Power (8 new):
- UPSAlarmsActive, UPSBatteryDegraded, UPSOverloaded, UPSOutputVoltageAbnormal
- ATSFault, ATSPowerFault, ATSOverload, ATSInputVoltageAbnormal

Server Health (10 new):
- iDRACSystemUnhealthy, iDRACPowerSupplyUnhealthy, iDRACMemoryUnhealthy
- iDRACStorageDriveUnhealthy, iDRACSSDWearCritical/Warning
- iDRACServerPoweredOff, ProxmoxExporterDown
- FuseMainFault, FuseGarageFault

Metric Staleness (3 new):
- FuseMainMetricsMissing, FuseGarageMetricsMissing, ProxmoxMetricsMissing

Plus 4 new inhibition rules for alert cascade protection.
2026-04-06 15:31:50 +03:00
Viktor Barzin
6abc0b9742 security(monitoring): remove public SNMP exporter ingress
snmp-exporter-external.viktorbarzin.me exposed UPS metrics to the
public internet with no authentication. Removed the external ingress
and Cloudflare DNS record. ha-sofia now accesses the SNMP exporter
via the existing .lan ingress (allow_local_access_only=true) using
direct IP 10.0.20.200 with Host header.
2026-04-06 15:23:56 +03:00
Viktor Barzin
7f141faa8c Fix: Expose SNMP exporter externally to ha-sofia via Cloudflare tunnel
- Add snmp-exporter-ingress-external module for external HTTPS access to snmp-exporter
- Register snmp-exporter-external.viktorbarzin.me in Cloudflare DNS (proxied via tunnel)
- Update ha-sofia REST integration to use external HTTPS endpoint
- Fix ingress backend service routing to use existing snmp-exporter service
- All UPS sensors on ha-sofia now report values (voltage, battery %, load, etc.)
2026-04-06 15:14:19 +03:00
Viktor Barzin
b345b086ef update backup/DR docs and runbooks for 3-2-1 architecture
- Full rewrite of backup-dr.md: 3-2-1 strategy with sda backup disk,
  PVC file-level copy from LVM snapshots, pfsense backup, two offsite
  paths. 4 Mermaid diagrams (data flow, timeline, disk layout, restore tree).
- Update storage.md: 65 proxmox-lvm PVCs, sda backup tier
- Update restore-full-cluster.md: add Phase 3.5 for PVC restore from sda
- Update restore-{mysql,postgresql,vault,vaultwarden}.md: add sda fallback paths
- New runbook: restore-pvc-from-backup.md (file-level restore from sda)
- Update CLAUDE.md Storage & Backup section for 3-2-1 architecture
2026-04-06 15:06:01 +03:00
Viktor Barzin
d5b0990ed1 state(platform): update encrypted state 2026-04-06 15:04:39 +03:00
Viktor Barzin
9e2ac5fbb5 feat: add hardware exporter checks to cluster healthcheck (check #30)
Verifies snmp-exporter, idrac-redfish-exporter, proxmox-exporter, and
tuya-bridge pods are running, plus checks Prometheus scrape targets
(snmp-idrac, snmp-ups, redfish-idrac, proxmox-host) are UP.
2026-04-06 14:58:46 +03:00