- Replace custom ViktorBarzin/metallb module with official Helm chart
- Migrate from ConfigMap-based config to CRD (IPAddressPool + L2Advertisement)
- Update Traefik LB annotations from metallb.universe.tf to metallb.io format
- Technitium DNS keeps stable IP 10.0.20.204 via MetalLB auto-assignment
- Headscale split DNS already configured to use 10.0.20.204
- Expose STUN port 3479/UDP on container and LoadBalancer service
- Upgrade headscale from 0.23.0 to 0.28.0
- Vault config updated: auto DERP region with ipv4 field, ISP router
port forward for UDP 3479 added
Home DERP now shows ~3ms latency and is selected as nearest relay.
Kyverno generate+synchronize only manages secrets it created itself.
Existing Terraform-managed secrets in ~70 namespaces weren't updated.
Now loops through all namespaces and kubectl apply the new cert.
21+ stale TXT records accumulated from previous runs, causing certbot
DNS-01 challenge to fail. Now deletes all _acme-challenge records
from Cloudflare before certbot creates fresh ones.
bitnami/kubectl runs as non-root UID 1001, cannot read git-crypt
decrypted secrets owned by root. Switch to alpine (runs as root)
with kubectl downloaded directly.
Kyverno ClusterPolicy clones tls-secret from kyverno namespace to all
namespaces with synchronize=true. Renewal pipeline now updates the source
secret via kubectl, verifies cert validity, and sends Slack notification.
The upstream ghcr.io/mrlhansen/idrac_exporter:2.4.1 is missing
NewPowerSupplyInputVoltage in RefreshPowerOld, so the R730 iDRAC
never emits idrac_power_supply_input_voltage. Switch to the patched
viktorbarzin/idrac-redfish-exporter:2.4.1-voltage-fix image.
/proc/self/io inside $(awk ...) resolves to the awk subprocess PID,
not the parent bash shell. Use $$ (bash PID) to read the correct
process IO counters.
- Add /proc/self/io read/write tracking to vault raft-backup and etcd backup
- Push backup_duration_seconds, backup_read_bytes, backup_written_bytes,
backup_last_success_timestamp to Pushgateway from all 6 backup CronJobs
(etcd skipped — distroless image has no wget/curl)
- Add cloudsync_duration_seconds metric to cloudsync-monitor
- New "Backup Health" Grafana dashboard with 8 panels: time since last backup,
overview table, duration/IO trends, cloud sync status, alerts, CronJob schedule
Add 512Mi tmpfs emptyDir for /tmp/cache — Frigate writes 10s MP4
segments here continuously for all cameras. With motion-only retention,
segments without events are deleted immediately anyway, so losing them
on pod restart is acceptable.
Node1 disk writes: 3.55 MB/s → 2.08 MB/s (previous commit) → 96 KB/s (now)
- Add proxy_intercept_errors + error_page for 502/503/504 on blob locations
to prevent caching truncated upstream responses (root cause of repeated
ImagePullBackOff across services)
- Reduce proxy_cache_lock_timeout from 15m to 5m — fail fast, let containerd
retry instead of all concurrent pulls waiting on a failed first download
- Add proxy_cache_valid any 0 — never cache error responses
- Add /healthz endpoints on Docker Hub and GHCR servers
- Add draintimeout and proxy.ttl to registry proxy configs
- Added refId to all targets (required by Grafana)
- Added explicit panel IDs for stable references
- Fixed CrowdSec bouncer metric: cs_lapi_bouncer_requests_total doesn't
exist, use cs_lapi_route_requests_total instead
- Added drawStyle/showPoints to all timeseries panels
- Updated via MySQL + ConfigMap + Grafana restart
GoFlow2 v2 only exposes aggregate metrics (traffic_bytes_total,
process_nf_total, delay_seconds) — no per-source/dest labels.
Removed panels referencing non-existent src_addr/dst_port labels.
Replaced with flowset records by type, separated bytes and flows
into own panels to avoid scale issues.