Add cleanup-failed-pods policy that runs hourly (at :15) to delete all
pods in Failed phase cluster-wide. Prevents stale evicted and failed
CronJob pods from accumulating and creating healthcheck noise.
Also adds ClusterRole + ClusterRoleBinding to grant Kyverno cleanup
controller permission to delete Pods (not included by default).
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- daily-backup: handle rsync exit 23 (partial transfer) as OK for LUKS
noload mounts — in-flight writes have corrupt metadata from skipped
journal replay, but core data is intact
- daily-backup: clean up stale LUKS dm mappings from previous crashed
runs before attempting to open
- daily-backup: capture rsync exit code safely with set -e (|| pattern)
- kyverno: bump tier-4-aux requests.memory 2Gi→3Gi (servarr was at 83%)
- actualbudget: patched custom quota 5Gi→6Gi (was at 82%)
Verified: backup now completes status=0 (96 PVCs OK, 0 failed)
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- technitium-password-sync: remove RWO encrypted PVC mount that caused
pods to stick in ContainerCreating on wrong nodes. Plugin install now
warns instead of failing when zip unavailable.
- daily-backup: add LUKS decryption support for encrypted PVC snapshots
using /root/.luks-backup-key. Uses noload mount option to skip ext4
journal replay. Also installed cryptsetup-bin on PVE host.
- speedtest: disable prometheus.io/scrape annotation (no /prometheus
endpoint exists, causing ScrapeTargetDown alert).
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add WOODPECKER_BACKEND_K8S_PULL_SECRET_NAMES to agent env so step
pods can pull from private registry (registry.viktorbarzin.me:5050)
- Add fallback in default.yml when HEAD~1 is unavailable (shallow
clone with depth=1): fetch more history, or apply all platform
stacks as safe default
- Root cause: pipeline #243 failed because infra-ci:latest image
couldn't be pulled (no imagePullSecrets on step pods)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add .githooks/pre-commit that blocks files >2MB (configurable via
GIT_MAX_FILE_SIZE). Activate with: git config core.hooksPath .githooks
- Expand .gitignore to block common binary/archive patterns
(*.tar.gz, *.tgz, *.iso, *.img, *.bin, *.exe, *.dmg)
- Add explicit root-level terraform.tfstate ignore rules
- Remove stale redis-25.3.2.tgz helm chart (unreferenced)
Prevents re-accumulation of large blobs after git history cleanup
that reduced .git from 2.6GB to 128MB.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add separate CronJobs that dump each database individually:
- postgresql-backup-per-db: pg_dump -Fc per DB (daily 00:15)
- mysql-backup-per-db: mysqldump per DB (daily 00:45)
Dumps go to /backup/per-db/<dbname>/ on the same NFS PVC.
Enables single-database restore without affecting other databases.
Also fixed CNPG superuser password sync and added --single-transaction
--set-gtid-purged=OFF to MySQL per-db dumps.
Updated restore runbooks with per-database restore procedures.
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Expand service list: add Home Assistant, Actual Budget, Audiobookshelf,
Linkwarden, Matrix, Paperless, Tandoor, FreshRSS, Frigate, HackMD,
Excalidraw, Wealthfolio, Send, Stirling PDF
- Add structured debugging fields: error type, scope (just me vs others),
when it started, URL accessed
- Fix user report parser to extract all form fields into status.json
- Show error type, scope, and start time in status page report cards
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Status page (status.viktorbarzin.me): incident cards with SEV badges,
expandable timelines, postmortem links, user report rendering
- Issue templates on infra repo for user outage reports
- CronJob reads incidents + user-reports from ViktorBarzin/infra
- "Report an Outage" button on status page links to infra repo
- Post-mortem agents restored (4-stage pipeline: triage → investigation
→ historian → report writer) with updated paths and issue linking
- Post-mortem skill/template updated to link reports to GitHub Issues
and manage postmortem-required/postmortem-done labels
- Labels: incident, sev1-3, user-report, postmortem-required,
postmortem-done on infra repo
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Increase socket timeout from 30s to 120s (121+ monitors need time to sync)
- Add wait_events=0.2 for reliable login
- Fix accepted_statuscodes format: use 100-increment ranges not arbitrary
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add automatic external HTTPS monitors to Uptime Kuma for ~96 services
exposed via Cloudflare tunnel. A sync CronJob (every 10min) reads from
a Terraform-generated ConfigMap and creates/deletes [External] monitors
to match cloudflare_proxied_names. Status page groups these separately
as "External Reachability" and pushes a divergence metric to Pushgateway
when services are externally down but internally up. Prometheus alert
ExternalAccessDivergence fires after 15min of divergence.
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add PrometheusRule: NFSHighRPCRetransmissions fires when node_nfs_rpc_retransmissions_total
rate exceeds 5/s for 5m — catches NFS server degradation before pod failures cascade
- Migrate alertmanager PV from NFS (192.168.1.127:/srv/nfs/alertmanager) to proxmox-lvm-encrypted
eliminating the circular dependency where alertmanager couldn't alert about NFS failures
- Set force_update=true on prometheus helm_release to handle StatefulSet volumeClaimTemplate changes
Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>
Secondary/tertiary DNS instances had no custom zones — only the
primary had viktorbarzin.lan and viktorbarzin.me. The old setup Job
ran once at deployment and never synced new zones.
New CronJob runs every 30 minutes:
- Gets all zones from primary
- Enables zone transfer on primary
- Creates missing zones as Secondary type on replicas
- Resyncs existing zones via AXFR
Fixes .lan resolution failures (2/3 queries returned NXDOMAIN).
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Applied all 20 NFS stacks to converge PV mount_options (nfsvers=4).
State files encrypted and committed.
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Inbound:
- Direct MX to mail.viktorbarzin.me (ForwardEmail relay attempted and abandoned)
- Dedicated MetalLB IP 10.0.20.202 with ETP: Local for CrowdSec real-IP detection
- Removed Cloudflare Email Routing (can't store-and-forward)
- Fixed dual SPF violation, hardened to -all
- Added MTA-STS, TLSRPT, imported Rspamd DKIM into Terraform
- Removed dead BIND zones from config.tfvars (199 lines)
Outbound:
- Migrated from Mailgun (100/day) to Brevo (300/day free)
- Added Brevo DKIM CNAMEs and verification TXT
Monitoring:
- Probe frequency: 30m → 20m, alert thresholds adjusted to 60m
- Enabled Dovecot exporter scraping (port 9166)
- Added external SMTP monitor on public IP
Documentation:
- New docs/architecture/mailserver.md with full architecture
- New docs/architecture/mailserver-visual.html visualization
- Updated monitoring.md, CLAUDE.md, historical plan docs
Switch from restart-count based detection (increase restarts[1h] > 5) to
waiting-reason based (kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"}).
Alert auto-resolves when pod recovers, making it clear whether the issue is active.
Root cause: PMTU black hole on WireGuard tunnel. The tunnel runs over
the HE IPv6 6in4 tunnel (gif0 MTU 1280). With WG overhead (~80 bytes),
effective inner MTU is 1200 — but both sides were configured at 1420.
SSH kex packets >1200 bytes were silently dropped.
Fix: Set tun_wg0 MTU to 1200 on pfSense + peer_855 MTU to 1200 on
London GL-iNet. Re-enabled London DHCP/ARP import in remote CronJob.
All 3 sites now fully automated:
- Sofia: Kea leases + ARP every 5min
- London: DHCP + ARP via pfSense→London SSH hop, hourly
- Valchedrym: DHCP + ARP via pfSense→OpenWRT SSH hop, hourly
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ProxmoxMetricsMissing alert was firing because pve_* metrics were
excluded from the kubernetes-service-endpoints metric_relabel_configs
whitelist. The exporter was scraping successfully but metrics were
being dropped before ingestion.
- Sofia import (every 5min): Kea leases + pfSense ARP via SSH
- Remote import (hourly): Valchedrym DHCP/ARP via pfSense SSH hop
- London SSH (dropbear) hangs during kex on low-power router — disabled
for now, data imported manually. TODO: lightweight push agent
- Fixed SSH key filename (id_rsa, not id_ed25519) for RSA keys
- No more ping sweeping anywhere — all passive DHCP/ARP data
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Nextcloud persists dbpassword in config.php on its PVC and ignores
MYSQL_PASSWORD env var after initial install. When Vault rotates the
MySQL password, config.php goes stale causing HTTP 500 crash loops.
Adds a before-starting hook that patches config.php with the current
MYSQL_PASSWORD on every pod start. Combined with Stakater Reloader
annotation, the full rotation chain is now automated:
Vault rotates → ESO syncs Secret → Reloader restarts pod → hook
patches config.php → Nextcloud connects with new password.
Also fixes stale existingClaim (nextcloud-data-iscsi → nextcloud-data-proxmox).
- CronJob now SSHs to Valchedrym OpenWRT (192.168.0.1) to pull DHCP leases + ARP table
- Parses /tmp/dhcp.leases for hostname + MAC, /proc/net/arp for additional devices
- London still uses ping sweep via pfSense WG tunnel (no SSH access to GL-iNet)
- 6 Valchedrym devices tracked: router, alarm, video, termoregulator, 2 clients
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All device discovery now handled by phpipam-pfsense-import CronJob
which queries Kea DHCP leases + pfSense ARP table every 5min.
No active scanning needed — pfSense sees all devices passively.
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- New CronJob `phpipam-pfsense-import` runs every 5min
- Queries Kea DHCP lease API (IP + MAC + hostname for all DHCP clients)
- Queries pfSense ARP table (IP + MAC for static IP devices)
- Imports into phpIPAM MySQL: new hosts get inserted, existing get MAC/hostname updates
- Reduced fping scan interval from 15min to 24h (weekly audit only)
- Faster, quieter, gets MACs (fping didn't), gets Kea hostnames
- SSH key (RSA PEM) stored in Vault, synced via ExternalSecret
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- CronJob now pulls hostnames FROM Technitium INTO phpIPAM for unnamed entries
(reverse sync: Kea DDNS registers → Technitium PTR → phpIPAM hostname)
- Kea DHCP4 now serves 192.168.1.0/24 via pfSense WAN (vtnet0)
- 42 MAC→IP reservations for all known LAN devices
- Kea DDNS registers 192.168.1.x hosts in Technitium (forward + reverse)
- DHCP pool .150-.199 for unknown devices
- Technitium update ACL extended to include 192.168.1.2 (pfSense WAN)
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- CronJob syncs phpIPAM hosts → Technitium DNS (A + PTR records) every 15min
- Queries phpIPAM MySQL directly for named hosts, pushes to Technitium API
- Covers 192.168.1.0/24 LAN (TP-Link DHCP, not Kea-managed)
- Kea DDNS configured on pfSense for 10.0.10.0/24 + 10.0.20.0/24 subnets
- Technitium zones accept dynamic updates from pfSense IPs (10.0.20.1, 10.0.10.1)
- 5 reverse DNS zones created (10.0.10, 20.0.10, 1.168.192, 2.3.10, 0.168.192)
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Lightweight IPAM with auto-discovery scanning every 15min via fping.
Replaces disabled NetBox (OOM'd). Uses existing MySQL InnoDB cluster
with Vault-rotated credentials. Cloudflare DNS + Authentik auth.
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
MySQL 8.4 remapped `utf8` to mean `utf8mb4`, but Nextcloud without this
config sends `COLLATE UTF8_general_ci` (a utf8mb3-only collation) in
queries, causing SQLSTATE[42000] errors that broke occ commands and sync.
Also removed stale `Work 🎯.csv` whose emoji filename was stripped in the
DB filecache (stored as `Work .csv`), causing permanent sync errors.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Bump immich server + ML from v2.6.3 to v2.7.3
- Increase PG shared_buffers to 2GB (memory 3Gi) to prevent
clip_index eviction by background jobs
- Switch DB_STORAGE_TYPE to SSD (effective_io_concurrency=200,
random_page_cost=1.2)
- Add pg_prewarm autoprewarm for warm restarts
- Add postgresql.override.conf via init container for tuning
- Add postStart hook to prewarm vector tables on startup
Search latency: ~1.3s → ~130ms (external), ~60ms (internal)
192.168.1.x LAN clients couldn't reach non-proxied *.viktorbarzin.me
domains because the TP-Link router doesn't support hairpin NAT.
Adds a CronJob that configures Technitium's Split Horizon
AddressTranslation post-processor on all 3 instances to translate
176.12.22.76 (public IP) → 10.0.20.200 (Traefik LB) in DNS responses
for 192.168.1.0/24 clients. Also adds viktorbarzin.me to the DNS
Rebinding Protection privateDomains allowlist so the translated private
IP isn't stripped.
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Technitium DNS was moved to its own dedicated MetalLB LoadBalancer IP
(10.0.20.201) but several references still pointed to the old shared IP
(10.0.20.200, now used by traefik/coturn/etc). This caused DNS resolution
failures for *.viktorbarzin.lan from pfSense and LAN clients.
- Update CoreDNS Corefile forward in both technitium and platform modules
- Update MetalLB annotation and remove stale allow-shared-ip
- Update zone NS records and apex A record in config.tfvars
- Update legacy BIND forwarder reference
Also fixed on pfSense (not in repo):
- Removed NAT rule redirecting UDP 53 to wrong IP (10.0.20.200)
- Added dnsmasq listen on WAN (192.168.1.2) for LAN clients
- Added domain-specific forwarding (viktorbarzin.lan -> 10.0.20.201)
- Created aliases (technitium_dns, k8s_shared_lb) for all NAT rules
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The backup CronJob was stuck in ContainerCreating because it couldn't
mount the proxmox-lvm RWO PVC from a different node. Fixed by:
- Adding pod_affinity to co-locate with the headscale pod (same node)
- Mounting both data PVC (read-only) and NFS backup PVC (write)
- Adding integrity check pattern from vaultwarden backup
- Setting concurrency_policy=Replace and ttl_seconds_after_finished=10
iOS Safari doesn't support reading images via navigator.clipboard.read().
Added a camera button that opens the native file/photo picker, which works
reliably on all platforms including iOS.
- Custom index.html with xterm.js for reliable Ctrl+V text paste
- Go clipboard-upload service saves pasted images to /tmp/clipboard-images/
- Traefik IngressRoute routes /clipboard/* to upload service (same-origin)
- Authentik-protected upload path with strip-prefix middleware
MySQL operator ignores podSpec.containers sidecar resource overrides,
always injecting 6Gi limit defaults. Added sidecar to CR spec for
documentation but raised quota from 32Gi to 40Gi as the practical fix.
Quota usage drops from 99% to 79%.
Changed from simple time-based (24h on inverter) to condition-based:
only fires when on inverter AND battery charge <80% for 1h. This means
normal daytime inverter usage won't trigger alerts — only fires when
the grid is unavailable and battery is draining.