Dolt Workbench hardcodes http://localhost:9002/graphql in the built JS.
For k8s hosting, init container patches this to relative /graphql path.
Second ingress routes /graphql to port 9002 behind Authentik auth.
- Init container copies static JS to writable emptyDir, patches URL
- Pre-seeds store.json with Dolt connection config
- Added /graphql ingress with Authentik forward-auth
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Traefik records websocket connection lifetimes (minutes to hours) as
"request duration." When websockets close, the full lifetime pollutes
the average latency metric — Authentik showed 6.7s avg (201s websocket
avg) vs 0.065s actual HTTP avg. This caused ~90 false alerts/day across
12 services (Authentik, Vaultwarden, Terminal, HA, etc.).
Changes:
- Add protocol!="websocket" filter to HighServiceLatency alert expr
- Raise minimum traffic threshold from 0.01 to 0.05 rps to filter
statistical noise from services with <3 req/min
- Remove .githooks/pre-commit file-size hook (blocked state commits)
Validated against 7-day historical data: 637 breaches → ~2 with both
filters applied (99.7% reduction).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pipeline pods pull from registry.viktorbarzin.me:5050 but the
registry-credentials secret only had auth for registry.viktorbarzin.me
(without port). Containerd requires exact hostname:port match.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The vault-woodpecker-sync script was creating global secrets with only
push/tag/deployment events. Manual and cron-triggered pipelines couldn't
access secrets, causing "secret not found" errors and pipeline failures.
Also fixes three root causes of CI failures:
1. Pull-through cache corruption: purged stale blobs, added post-GC
registry restart cron to prevent recurrence
2. Missing repo-level secrets: added registry_user/registry_password
for the infra repo's build-ci-image workflow
3. Stuck pipelines: cleaned up 3 pipelines stuck in "running" since March
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Set protected=true on ingress (Authentik forward-auth)
- Remove unused DATABASE_URL env var (Workbench uses browser-based connection config)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Deploy dolthub/dolt-workbench alongside the Dolt server in beads-server
namespace. Provides SQL console, spreadsheet editor, and commit graph
visualization for the centralized beads task database.
- Workbench at dolt-workbench.viktorbarzin.me (Cloudflare-proxied)
- Connects to Dolt server via in-cluster service DNS
- Added to cloudflare_proxied_names for external access
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pipeline pods were failing with "authorization failed: no basic auth
credentials" when pulling from the private registry. The
WOODPECKER_BACKEND_K8S_PULL_SECRET_NAMES env var was in values.yaml but
never deployed to the agents.
Also removes the stale db-init job that used `-U root` (incompatible
with CNPG's `postgres` superuser). The database already exists.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Image ghcr.io/toeverything/affine:0.20.7 was removed from ghcr.io,
causing persistent ImagePullBackOff. Updated to latest stable 0.26.6.
Prisma migrations run via init container on startup.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add cleanup-failed-pods policy that runs hourly (at :15) to delete all
pods in Failed phase cluster-wide. Prevents stale evicted and failed
CronJob pods from accumulating and creating healthcheck noise.
Also adds ClusterRole + ClusterRoleBinding to grant Kyverno cleanup
controller permission to delete Pods (not included by default).
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- daily-backup: handle rsync exit 23 (partial transfer) as OK for LUKS
noload mounts — in-flight writes have corrupt metadata from skipped
journal replay, but core data is intact
- daily-backup: clean up stale LUKS dm mappings from previous crashed
runs before attempting to open
- daily-backup: capture rsync exit code safely with set -e (|| pattern)
- kyverno: bump tier-4-aux requests.memory 2Gi→3Gi (servarr was at 83%)
- actualbudget: patched custom quota 5Gi→6Gi (was at 82%)
Verified: backup now completes status=0 (96 PVCs OK, 0 failed)
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- technitium-password-sync: remove RWO encrypted PVC mount that caused
pods to stick in ContainerCreating on wrong nodes. Plugin install now
warns instead of failing when zip unavailable.
- daily-backup: add LUKS decryption support for encrypted PVC snapshots
using /root/.luks-backup-key. Uses noload mount option to skip ext4
journal replay. Also installed cryptsetup-bin on PVE host.
- speedtest: disable prometheus.io/scrape annotation (no /prometheus
endpoint exists, causing ScrapeTargetDown alert).
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add WOODPECKER_BACKEND_K8S_PULL_SECRET_NAMES to agent env so step
pods can pull from private registry (registry.viktorbarzin.me:5050)
- Add fallback in default.yml when HEAD~1 is unavailable (shallow
clone with depth=1): fetch more history, or apply all platform
stacks as safe default
- Root cause: pipeline #243 failed because infra-ci:latest image
couldn't be pulled (no imagePullSecrets on step pods)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add .githooks/pre-commit that blocks files >2MB (configurable via
GIT_MAX_FILE_SIZE). Activate with: git config core.hooksPath .githooks
- Expand .gitignore to block common binary/archive patterns
(*.tar.gz, *.tgz, *.iso, *.img, *.bin, *.exe, *.dmg)
- Add explicit root-level terraform.tfstate ignore rules
- Remove stale redis-25.3.2.tgz helm chart (unreferenced)
Prevents re-accumulation of large blobs after git history cleanup
that reduced .git from 2.6GB to 128MB.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add separate CronJobs that dump each database individually:
- postgresql-backup-per-db: pg_dump -Fc per DB (daily 00:15)
- mysql-backup-per-db: mysqldump per DB (daily 00:45)
Dumps go to /backup/per-db/<dbname>/ on the same NFS PVC.
Enables single-database restore without affecting other databases.
Also fixed CNPG superuser password sync and added --single-transaction
--set-gtid-purged=OFF to MySQL per-db dumps.
Updated restore runbooks with per-database restore procedures.
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Expand service list: add Home Assistant, Actual Budget, Audiobookshelf,
Linkwarden, Matrix, Paperless, Tandoor, FreshRSS, Frigate, HackMD,
Excalidraw, Wealthfolio, Send, Stirling PDF
- Add structured debugging fields: error type, scope (just me vs others),
when it started, URL accessed
- Fix user report parser to extract all form fields into status.json
- Show error type, scope, and start time in status page report cards
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Status page (status.viktorbarzin.me): incident cards with SEV badges,
expandable timelines, postmortem links, user report rendering
- Issue templates on infra repo for user outage reports
- CronJob reads incidents + user-reports from ViktorBarzin/infra
- "Report an Outage" button on status page links to infra repo
- Post-mortem agents restored (4-stage pipeline: triage → investigation
→ historian → report writer) with updated paths and issue linking
- Post-mortem skill/template updated to link reports to GitHub Issues
and manage postmortem-required/postmortem-done labels
- Labels: incident, sev1-3, user-report, postmortem-required,
postmortem-done on infra repo
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Increase socket timeout from 30s to 120s (121+ monitors need time to sync)
- Add wait_events=0.2 for reliable login
- Fix accepted_statuscodes format: use 100-increment ranges not arbitrary
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add automatic external HTTPS monitors to Uptime Kuma for ~96 services
exposed via Cloudflare tunnel. A sync CronJob (every 10min) reads from
a Terraform-generated ConfigMap and creates/deletes [External] monitors
to match cloudflare_proxied_names. Status page groups these separately
as "External Reachability" and pushes a divergence metric to Pushgateway
when services are externally down but internally up. Prometheus alert
ExternalAccessDivergence fires after 15min of divergence.
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add PrometheusRule: NFSHighRPCRetransmissions fires when node_nfs_rpc_retransmissions_total
rate exceeds 5/s for 5m — catches NFS server degradation before pod failures cascade
- Migrate alertmanager PV from NFS (192.168.1.127:/srv/nfs/alertmanager) to proxmox-lvm-encrypted
eliminating the circular dependency where alertmanager couldn't alert about NFS failures
- Set force_update=true on prometheus helm_release to handle StatefulSet volumeClaimTemplate changes
Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>
Secondary/tertiary DNS instances had no custom zones — only the
primary had viktorbarzin.lan and viktorbarzin.me. The old setup Job
ran once at deployment and never synced new zones.
New CronJob runs every 30 minutes:
- Gets all zones from primary
- Enables zone transfer on primary
- Creates missing zones as Secondary type on replicas
- Resyncs existing zones via AXFR
Fixes .lan resolution failures (2/3 queries returned NXDOMAIN).
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Applied all 20 NFS stacks to converge PV mount_options (nfsvers=4).
State files encrypted and committed.
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Inbound:
- Direct MX to mail.viktorbarzin.me (ForwardEmail relay attempted and abandoned)
- Dedicated MetalLB IP 10.0.20.202 with ETP: Local for CrowdSec real-IP detection
- Removed Cloudflare Email Routing (can't store-and-forward)
- Fixed dual SPF violation, hardened to -all
- Added MTA-STS, TLSRPT, imported Rspamd DKIM into Terraform
- Removed dead BIND zones from config.tfvars (199 lines)
Outbound:
- Migrated from Mailgun (100/day) to Brevo (300/day free)
- Added Brevo DKIM CNAMEs and verification TXT
Monitoring:
- Probe frequency: 30m → 20m, alert thresholds adjusted to 60m
- Enabled Dovecot exporter scraping (port 9166)
- Added external SMTP monitor on public IP
Documentation:
- New docs/architecture/mailserver.md with full architecture
- New docs/architecture/mailserver-visual.html visualization
- Updated monitoring.md, CLAUDE.md, historical plan docs
Switch from restart-count based detection (increase restarts[1h] > 5) to
waiting-reason based (kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"}).
Alert auto-resolves when pod recovers, making it clear whether the issue is active.
Root cause: PMTU black hole on WireGuard tunnel. The tunnel runs over
the HE IPv6 6in4 tunnel (gif0 MTU 1280). With WG overhead (~80 bytes),
effective inner MTU is 1200 — but both sides were configured at 1420.
SSH kex packets >1200 bytes were silently dropped.
Fix: Set tun_wg0 MTU to 1200 on pfSense + peer_855 MTU to 1200 on
London GL-iNet. Re-enabled London DHCP/ARP import in remote CronJob.
All 3 sites now fully automated:
- Sofia: Kea leases + ARP every 5min
- London: DHCP + ARP via pfSense→London SSH hop, hourly
- Valchedrym: DHCP + ARP via pfSense→OpenWRT SSH hop, hourly
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ProxmoxMetricsMissing alert was firing because pve_* metrics were
excluded from the kubernetes-service-endpoints metric_relabel_configs
whitelist. The exporter was scraping successfully but metrics were
being dropped before ingestion.
- Sofia import (every 5min): Kea leases + pfSense ARP via SSH
- Remote import (hourly): Valchedrym DHCP/ARP via pfSense SSH hop
- London SSH (dropbear) hangs during kex on low-power router — disabled
for now, data imported manually. TODO: lightweight push agent
- Fixed SSH key filename (id_rsa, not id_ed25519) for RSA keys
- No more ping sweeping anywhere — all passive DHCP/ARP data
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Nextcloud persists dbpassword in config.php on its PVC and ignores
MYSQL_PASSWORD env var after initial install. When Vault rotates the
MySQL password, config.php goes stale causing HTTP 500 crash loops.
Adds a before-starting hook that patches config.php with the current
MYSQL_PASSWORD on every pod start. Combined with Stakater Reloader
annotation, the full rotation chain is now automated:
Vault rotates → ESO syncs Secret → Reloader restarts pod → hook
patches config.php → Nextcloud connects with new password.
Also fixes stale existingClaim (nextcloud-data-iscsi → nextcloud-data-proxmox).
- CronJob now SSHs to Valchedrym OpenWRT (192.168.0.1) to pull DHCP leases + ARP table
- Parses /tmp/dhcp.leases for hostname + MAC, /proc/net/arp for additional devices
- London still uses ping sweep via pfSense WG tunnel (no SSH access to GL-iNet)
- 6 Valchedrym devices tracked: router, alarm, video, termoregulator, 2 clients
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All device discovery now handled by phpipam-pfsense-import CronJob
which queries Kea DHCP leases + pfSense ARP table every 5min.
No active scanning needed — pfSense sees all devices passively.
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- New CronJob `phpipam-pfsense-import` runs every 5min
- Queries Kea DHCP lease API (IP + MAC + hostname for all DHCP clients)
- Queries pfSense ARP table (IP + MAC for static IP devices)
- Imports into phpIPAM MySQL: new hosts get inserted, existing get MAC/hostname updates
- Reduced fping scan interval from 15min to 24h (weekly audit only)
- Faster, quieter, gets MACs (fping didn't), gets Kea hostnames
- SSH key (RSA PEM) stored in Vault, synced via ExternalSecret
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- CronJob now pulls hostnames FROM Technitium INTO phpIPAM for unnamed entries
(reverse sync: Kea DDNS registers → Technitium PTR → phpIPAM hostname)
- Kea DHCP4 now serves 192.168.1.0/24 via pfSense WAN (vtnet0)
- 42 MAC→IP reservations for all known LAN devices
- Kea DDNS registers 192.168.1.x hosts in Technitium (forward + reverse)
- DHCP pool .150-.199 for unknown devices
- Technitium update ACL extended to include 192.168.1.2 (pfSense WAN)
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- CronJob syncs phpIPAM hosts → Technitium DNS (A + PTR records) every 15min
- Queries phpIPAM MySQL directly for named hosts, pushes to Technitium API
- Covers 192.168.1.0/24 LAN (TP-Link DHCP, not Kea-managed)
- Kea DDNS configured on pfSense for 10.0.10.0/24 + 10.0.20.0/24 subnets
- Technitium zones accept dynamic updates from pfSense IPs (10.0.20.1, 10.0.10.1)
- 5 reverse DNS zones created (10.0.10, 20.0.10, 1.168.192, 2.3.10, 0.168.192)
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Lightweight IPAM with auto-discovery scanning every 15min via fping.
Replaces disabled NetBox (OOM'd). Uses existing MySQL InnoDB cluster
with Vault-rotated credentials. Cloudflare DNS + Authentik auth.
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
MySQL 8.4 remapped `utf8` to mean `utf8mb4`, but Nextcloud without this
config sends `COLLATE UTF8_general_ci` (a utf8mb3-only collation) in
queries, causing SQLSTATE[42000] errors that broke occ commands and sync.
Also removed stale `Work 🎯.csv` whose emoji filename was stripped in the
DB filecache (stored as `Work .csv`), causing permanent sync errors.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>