Commit graph

505 commits

Author SHA1 Message Date
Viktor Barzin
b98890d799 fix(beads-server): fix Workbench GraphQL URL for remote hosting
Dolt Workbench hardcodes http://localhost:9002/graphql in the built JS.
For k8s hosting, init container patches this to relative /graphql path.
Second ingress routes /graphql to port 9002 behind Authentik auth.

- Init container copies static JS to writable emptyDir, patches URL
- Pre-seeds store.json with Dolt connection config
- Added /graphql ingress with Authentik forward-auth

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 21:53:57 +00:00
Viktor Barzin
375a3d91d5 [monitoring] Exclude websocket protocol from HighServiceLatency alert
Traefik records websocket connection lifetimes (minutes to hours) as
"request duration." When websockets close, the full lifetime pollutes
the average latency metric — Authentik showed 6.7s avg (201s websocket
avg) vs 0.065s actual HTTP avg. This caused ~90 false alerts/day across
12 services (Authentik, Vaultwarden, Terminal, HA, etc.).

Changes:
- Add protocol!="websocket" filter to HighServiceLatency alert expr
- Raise minimum traffic threshold from 0.01 to 0.05 rps to filter
  statistical noise from services with <3 req/min
- Remove .githooks/pre-commit file-size hook (blocked state commits)

Validated against 7-day historical data: 637 breaches → ~2 with both
filters applied (99.7% reduction).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 21:51:19 +00:00
Viktor Barzin
3e273399c1 fix(ci): add registry.viktorbarzin.me:5050 to imagePullSecrets
Pipeline pods pull from registry.viktorbarzin.me:5050 but the
registry-credentials secret only had auth for registry.viktorbarzin.me
(without port). Containerd requires exact hostname:port match.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 21:50:51 +00:00
Viktor Barzin
116fdcf82d fix(ci): Woodpecker secret sync includes all event types
The vault-woodpecker-sync script was creating global secrets with only
push/tag/deployment events. Manual and cron-triggered pipelines couldn't
access secrets, causing "secret not found" errors and pipeline failures.

Also fixes three root causes of CI failures:
1. Pull-through cache corruption: purged stale blobs, added post-GC
   registry restart cron to prevent recurrence
2. Missing repo-level secrets: added registry_user/registry_password
   for the infra repo's build-ci-image workflow
3. Stuck pipelines: cleaned up 3 pipelines stuck in "running" since March

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 21:43:48 +00:00
Viktor Barzin
d9ed166640 fix(beads-server): add Authentik auth to Dolt Workbench
- Set protected=true on ingress (Authentik forward-auth)
- Remove unused DATABASE_URL env var (Workbench uses browser-based connection config)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 21:43:22 +00:00
Viktor Barzin
c33f597111 feat(upgrade-agent): add automated service upgrade pipeline with n8n + DIUN
Pipeline: DIUN detects new image versions every 6h → webhook to n8n →
n8n filters (skip databases/custom/infra/:latest) and rate-limits
(max 5/6h) → SSH to dev VM → claude -p runs upgrade agent.

Agent workflow: resolve GitHub repo → fetch changelogs → classify risk
(SAFE/CAUTION) → backup DB if needed → bump version in .tf → commit+push
→ wait for CI → verify (pod ready + HTTP + Uptime Kuma) → rollback on
failure.

Changes:
- stacks/n8n: add N8N_PORT=5678 to fix K8s env var conflict
- stacks/n8n/workflows: version-controlled n8n workflow backup
- docs/architecture/automated-upgrades.md: full pipeline documentation
- AGENTS.md: add upgrade agent section
- service-catalog.md: update DIUN description

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 21:38:27 +00:00
Viktor Barzin
27d7c91608 feat(beads-server): add Dolt Workbench web UI
Deploy dolthub/dolt-workbench alongside the Dolt server in beads-server
namespace. Provides SQL console, spreadsheet editor, and commit graph
visualization for the centralized beads task database.

- Workbench at dolt-workbench.viktorbarzin.me (Cloudflare-proxied)
- Connects to Dolt server via in-cluster service DNS
- Added to cloudflare_proxied_names for external access

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 21:32:45 +00:00
Viktor Barzin
c124a23390 fix(ci): add K8s pull secrets to Woodpecker agents
Pipeline pods were failing with "authorization failed: no basic auth
credentials" when pulling from the private registry. The
WOODPECKER_BACKEND_K8S_PULL_SECRET_NAMES env var was in values.yaml but
never deployed to the agents.

Also removes the stale db-init job that used `-U root` (incompatible
with CNPG's `postgres` superuser). The database already exists.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 21:21:12 +00:00
Viktor Barzin
f7411327d1 fix(affine): update image tag 0.20.7 → 0.26.6
Image ghcr.io/toeverything/affine:0.20.7 was removed from ghcr.io,
causing persistent ImagePullBackOff. Updated to latest stable 0.26.6.
Prisma migrations run via init container on startup.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 20:49:46 +00:00
Viktor Barzin
8b004c4c94 feat(storage): migrate all sensitive services to proxmox-lvm-encrypted
Reconcile Terraform with cluster state after manual encrypted PVC migrations
and complete the remaining unfinished migrations. All services storing
sensitive data now use LUKS2-encrypted block storage via the Proxmox CSI
plugin.

## Context

Only Technitium DNS was using encrypted storage in Terraform. Many services
had been manually migrated to encrypted PVCs in the cluster, but Terraform
was never updated — creating dangerous state drift where a `tg apply` could
recreate unencrypted PVCs.

## This change

Phase 0 — Infrastructure:
- Add `proxmox-lvm-encrypted` StorageClass to Helm values (extraParameters)
- Add ExternalSecret for LUKS encryption passphrase to Terraform
- Fix CSI node plugin memory: `node.plugin.resources` (not `node.resources`)
  with 1280Mi limit for LUKS2 Argon2id key derivation

Phase 1 — TF state reconciliation (zero downtime):
- Health, Matrix, N8N, Forgejo, Vaultwarden, Mailserver: state rm + import
- Redis, DBAAS MySQL, DBAAS PostgreSQL: Helm/CNPG value updates

Phase 2 — Data migration (encrypted PVCs existed but unused):
- Headscale, Frigate, MeshCentral: rsync + switchover
- Nextcloud (20Gi): rsync + chart_values update

Phase 3 — New encrypted PVCs:
- Roundcube HTML, HackMD, Affine, DBAAS pgadmin: create + rsync + switchover

Phase 4 — Cleanup:
- Deleted 5 orphaned unencrypted PVCs

## Services migrated (18 PVCs across 14 namespaces)

```
vaultwarden     → vaultwarden-data-encrypted
dbaas           → datadir-mysql-cluster-0, pg-cluster-{1,2}, dbaas-pgadmin-encrypted
mailserver      → mailserver-data-encrypted, roundcubemail-{enigma,html}-encrypted
nextcloud       → nextcloud-data-encrypted
forgejo         → forgejo-data-encrypted
matrix          → matrix-data-encrypted
n8n             → n8n-data-encrypted
affine          → affine-data-encrypted
health          → health-uploads-encrypted
hackmd          → hackmd-data-encrypted
redis           → redis-data-redis-node-{0,1}
headscale       → headscale-data-encrypted
frigate         → frigate-config-encrypted
meshcentral     → meshcentral-{data,files}-encrypted
```

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 20:15:30 +00:00
Viktor Barzin
1613003d00 upgrade: vaultwarden 1.35.4 -> 1.35.7
Security fixes (1.35.5): 3 CVEs — org vault purge by unconfirmed owner
(GHSA-937x-3j8m-7w7p), cross-org group binding unauthorized access
(GHSA-569v-845w-g82p), refresh tokens not invalidated on stamp rotation
(GHSA-6j4w-g4jh-xjfx). 2FA remember tokens now max 30 days.
1.35.6: Fix 2FA remember tokens broken in 1.35.5.
1.35.7: Fix 2FA for Android.

Risk: SAFE (patch bump, no breaking changes)
DB backup: yes (job: pre-upgrade-vaultwarden-1776280439, SQLite, 7 MiB)
Config changes applied: none
Flagged for manual review: none

Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>
2026-04-15 19:14:21 +00:00
Viktor Barzin
cf578516e9 feat: auto-cleanup failed/evicted pods via Kyverno ClusterCleanupPolicy
Add cleanup-failed-pods policy that runs hourly (at :15) to delete all
pods in Failed phase cluster-wide. Prevents stale evicted and failed
CronJob pods from accumulating and creating healthcheck noise.

Also adds ClusterRole + ClusterRoleBinding to grant Kyverno cleanup
controller permission to delete Pods (not included by default).

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 17:37:49 +00:00
Viktor Barzin
de42acd68e fix: backup LUKS rsync tolerance, stale mapping cleanup, tier-4-aux quota bump
- daily-backup: handle rsync exit 23 (partial transfer) as OK for LUKS
  noload mounts — in-flight writes have corrupt metadata from skipped
  journal replay, but core data is intact
- daily-backup: clean up stale LUKS dm mappings from previous crashed
  runs before attempting to open
- daily-backup: capture rsync exit code safely with set -e (|| pattern)
- kyverno: bump tier-4-aux requests.memory 2Gi→3Gi (servarr was at 83%)
- actualbudget: patched custom quota 5Gi→6Gi (was at 82%)

Verified: backup now completes status=0 (96 PVCs OK, 0 failed)

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 17:21:51 +00:00
Viktor Barzin
9baefa22ab fix: technitium CronJob scheduling, LUKS backup support, speedtest scrape
- technitium-password-sync: remove RWO encrypted PVC mount that caused
  pods to stick in ContainerCreating on wrong nodes. Plugin install now
  warns instead of failing when zip unavailable.
- daily-backup: add LUKS decryption support for encrypted PVC snapshots
  using /root/.luks-backup-key. Uses noload mount option to skip ext4
  journal replay. Also installed cryptsetup-bin on PVE host.
- speedtest: disable prometheus.io/scrape annotation (no /prometheus
  endpoint exists, causing ScrapeTargetDown alert).

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 15:12:32 +00:00
Viktor Barzin
601a83d84e fix: CI pipeline image pull auth + shallow clone resilience [ci skip]
- Add WOODPECKER_BACKEND_K8S_PULL_SECRET_NAMES to agent env so step
  pods can pull from private registry (registry.viktorbarzin.me:5050)
- Add fallback in default.yml when HEAD~1 is unavailable (shallow
  clone with depth=1): fetch more history, or apply all platform
  stacks as safe default
- Root cause: pipeline #243 failed because infra-ci:latest image
  couldn't be pulled (no imagePullSecrets on step pods)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 14:41:08 +00:00
Viktor Barzin
e23153cf03 chore: add pre-commit size guard and harden .gitignore
- Add .githooks/pre-commit that blocks files >2MB (configurable via
  GIT_MAX_FILE_SIZE). Activate with: git config core.hooksPath .githooks
- Expand .gitignore to block common binary/archive patterns
  (*.tar.gz, *.tgz, *.iso, *.img, *.bin, *.exe, *.dmg)
- Add explicit root-level terraform.tfstate ignore rules
- Remove stale redis-25.3.2.tgz helm chart (unreferenced)

Prevents re-accumulation of large blobs after git history cleanup
that reduced .git from 2.6GB to 128MB.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 14:13:18 +00:00
Viktor Barzin
bcad200a23 chore: add untracked stacks, scripts, and agent configs
- New stacks: beads-server, hermes-agent
- Terragrunt tiers.tf for infra, phpipam, status-page
- Secrets symlinks for vault, phpipam, hermes-agent
- Scripts: cluster_manager, image_pull, containerd pullthrough setup
- Frigate config, audiblez-web app source, n8n workflows dir
- Claude agent: service-upgrade, reference: upgrade-config.json
- Removed: claudeception skill, excalidraw empty submodule, temp listings

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 09:33:06 +00:00
Viktor Barzin
bd41bb9230 fix: cluster healthcheck fixes + Authentik upgrade to 2026.2.2
- Authentik: upgrade 2025.10.3 → 2025.12.4 → 2026.2.2 with DB restore
  and stepped migration. Switch to existingSecret, PgBouncer session mode.
- Mailserver: migrate email roundtrip probe from Mailgun to Brevo API
- Redis: fix HAProxy tcp-check regex (rstring), faster health intervals
- Nextcloud: fix Redis fallback to HAProxy service, update dependency
- MeshCentral: fix TLSOffload + certUrl init container for first-run
- Monitoring: remove authentik from latency alert exclusion
- Diun: simplify to webhook notifier, remove git auto-update

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 06:41:56 +00:00
Viktor Barzin
0256ccdccc feat: add per-database backups for PostgreSQL and MySQL
Add separate CronJobs that dump each database individually:
- postgresql-backup-per-db: pg_dump -Fc per DB (daily 00:15)
- mysql-backup-per-db: mysqldump per DB (daily 00:45)

Dumps go to /backup/per-db/<dbname>/ on the same NFS PVC.
Enables single-database restore without affecting other databases.
Also fixed CNPG superuser password sync and added --single-transaction
--set-gtid-purged=OFF to MySQL per-db dumps.

Updated restore runbooks with per-database restore procedures.

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 22:39:33 +00:00
Viktor Barzin
3e9231ae0d feat: augment outage report template with debugging context
- Expand service list: add Home Assistant, Actual Budget, Audiobookshelf,
  Linkwarden, Matrix, Paperless, Tandoor, FreshRSS, Frigate, HackMD,
  Excalidraw, Wealthfolio, Send, Stirling PDF
- Add structured debugging fields: error type, scope (just me vs others),
  when it started, URL accessed
- Fix user report parser to extract all form fields into status.json
- Show error type, scope, and start time in status page report cards

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 20:03:44 +00:00
Viktor Barzin
460c68e015 feat: add incident management system with user reporting
- Status page (status.viktorbarzin.me): incident cards with SEV badges,
  expandable timelines, postmortem links, user report rendering
- Issue templates on infra repo for user outage reports
- CronJob reads incidents + user-reports from ViktorBarzin/infra
- "Report an Outage" button on status page links to infra repo
- Post-mortem agents restored (4-stage pipeline: triage → investigation
  → historian → report writer) with updated paths and issue linking
- Post-mortem skill/template updated to link reports to GitHub Issues
  and manage postmortem-required/postmortem-done labels
- Labels: incident, sev1-3, user-report, postmortem-required,
  postmortem-done on infra repo

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 20:00:31 +00:00
Viktor Barzin
c69eba9b46 fix: increase Uptime Kuma API timeout and fix status code format
- Increase socket timeout from 30s to 120s (121+ monitors need time to sync)
- Add wait_events=0.2 for reliable login
- Fix accepted_statuscodes format: use 100-increment ranges not arbitrary

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 19:28:18 +00:00
Viktor Barzin
ff360a8807 feat: add external monitoring for all Cloudflare-proxied services
Add automatic external HTTPS monitors to Uptime Kuma for ~96 services
exposed via Cloudflare tunnel. A sync CronJob (every 10min) reads from
a Terraform-generated ConfigMap and creates/deletes [External] monitors
to match cloudflare_proxied_names. Status page groups these separately
as "External Reachability" and pushes a divergence metric to Pushgateway
when services are externally down but internally up. Prometheus alert
ExternalAccessDivergence fires after 15min of divergence.

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 19:04:45 +00:00
Viktor Barzin
ca2680c189 fix(post-mortem): add NFSHighRPCRetransmissions alert + migrate alertmanager to proxmox-lvm-encrypted [PM-2026-04-14]
- Add PrometheusRule: NFSHighRPCRetransmissions fires when node_nfs_rpc_retransmissions_total
  rate exceeds 5/s for 5m — catches NFS server degradation before pod failures cascade
- Migrate alertmanager PV from NFS (192.168.1.127:/srv/nfs/alertmanager) to proxmox-lvm-encrypted
  eliminating the circular dependency where alertmanager couldn't alert about NFS failures
- Set force_update=true on prometheus helm_release to handle StatefulSet volumeClaimTemplate changes

Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>
2026-04-14 18:05:33 +00:00
Viktor Barzin
0901dd5f61 state(monitoring): update encrypted state 2026-04-14 17:52:13 +00:00
Viktor Barzin
803cb5fd26 fix: convert Technitium zone sync from one-time Job to CronJob
Secondary/tertiary DNS instances had no custom zones — only the
primary had viktorbarzin.lan and viktorbarzin.me. The old setup Job
ran once at deployment and never synced new zones.

New CronJob runs every 30 minutes:
- Gets all zones from primary
- Enables zone transfer on primary
- Creates missing zones as Secondary type on replicas
- Resyncs existing zones via AXFR

Fixes .lan resolution failures (2/3 queries returned NXDOMAIN).

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 12:18:19 +00:00
Viktor Barzin
30cdeefb1c chore: sync terraform state after nfsvers=4 convergence
Applied all 20 NFS stacks to converge PV mount_options (nfsvers=4).
State files encrypted and committed.

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 11:20:18 +00:00
Viktor Barzin
ea18116da9 fix: NFS outage recovery — migrate to NFSv4, add alerting
NFS server restart broke NFSv3 (lockd kernel bug on PVE 6.14).
All 52 NFS PVs patched to nfsvers=4, NFSv3 disabled on PVE.

Changes:
- nfs_volume module: add nfsvers=4 mount option
- nfs-csi StorageClass: add nfsvers=4 mount option
- dbaas: MySQL serverInstances 3→1, mysql-native-password=ON
- monitoring: add NFSCSINodeDown and NFSMountFailures alerts

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 10:28:27 +00:00
Viktor Barzin
68c8c5b4a0 fix(technitium): migrate primary to proxmox-lvm-encrypted + post-mortem
SEV1 outage: fsid=0 in PVE /etc/exports broke all NFS subdirectory
mounts from k8s (NFSv4 pseudo-root path resolution). Combined with
lockd failure, both NFSv4 and NFSv3 mount paths broken. Cascaded
into DNS primary, Vault (2/3 pods), Alertmanager, 20+ services.

Changes:
- Primary PVC: NFS (nfs-truenas) → proxmox-lvm-encrypted
- Secondary/tertiary PVCs: proxmox-lvm → proxmox-lvm-encrypted
- Removed NFS module dependency from technitium stack
- Added full post-mortem with prevention plan

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 08:18:59 +00:00
Viktor Barzin
82f674a0b4 rename weekly-backup → daily-backup across scripts, timers, services, and docs [ci skip]
Reflects the schedule change from weekly to daily. All references updated:
- scripts/weekly-backup.{sh,timer,service} → daily-backup.*
- Pushgateway job name: weekly-backup → daily-backup
- Prometheus metric names: weekly_backup_* → daily_backup_*
- All docs, runbooks, AGENTS.md, CLAUDE.md, proxmox-inventory
- offsite-sync dependency: After=daily-backup.service

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 18:37:04 +00:00
Viktor Barzin
38d51ab0af deprecate TrueNAS: migrate Immich NFS to Proxmox, remove all 10.0.10.15 references [ci skip]
- Migrate Immich (8 NFS PVs, 1.1TB) from TrueNAS to Proxmox host NFS
- Update config.tfvars nfs_server to 192.168.1.127 (Proxmox)
- Update nfs-csi StorageClass share to /srv/nfs
- Update scripts (weekly-backup, cluster-healthcheck) to Proxmox IP
- Delete obsolete TrueNAS scripts (nfs_exports.sh, truenas-status.sh)
- Rewrite nfs-health.sh for Proxmox NFS monitoring
- Update Freedify nfs_music_server default to Proxmox
- Mark CloudSync monitor CronJob as deprecated
- Update Prometheus alert summaries
- Update all architecture docs, AGENTS.md, and reference docs
- Zero PVs remain on TrueNAS — VM ready for decommission

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 14:42:07 +00:00
root
b0303ab17d Woodpecker CI deploy commit [CI SKIP] 2026-04-12 21:39:26 +00:00
Viktor Barzin
1c300a14cf mailserver: overhaul inbound delivery, monitoring, CrowdSec, and migrate to Brevo relay
Inbound:
- Direct MX to mail.viktorbarzin.me (ForwardEmail relay attempted and abandoned)
- Dedicated MetalLB IP 10.0.20.202 with ETP: Local for CrowdSec real-IP detection
- Removed Cloudflare Email Routing (can't store-and-forward)
- Fixed dual SPF violation, hardened to -all
- Added MTA-STS, TLSRPT, imported Rspamd DKIM into Terraform
- Removed dead BIND zones from config.tfvars (199 lines)

Outbound:
- Migrated from Mailgun (100/day) to Brevo (300/day free)
- Added Brevo DKIM CNAMEs and verification TXT

Monitoring:
- Probe frequency: 30m → 20m, alert thresholds adjusted to 60m
- Enabled Dovecot exporter scraping (port 9166)
- Added external SMTP monitor on public IP

Documentation:
- New docs/architecture/mailserver.md with full architecture
- New docs/architecture/mailserver-visual.html visualization
- Updated monitoring.md, CLAUDE.md, historical plan docs
2026-04-12 22:24:38 +01:00
Viktor Barzin
82b0f6c4cb truenas deprecation: migrate all non-immich storage to proxmox NFS
- Migrate 7 backup CronJobs to Proxmox host NFS (192.168.1.127)
  (etcd, mysql, postgresql, nextcloud, redis, vaultwarden, plotting-book)
- Migrate headscale backup, ebook2audiobook, osm_routing to Proxmox NFS
- Migrate servarr (lidarr, readarr, soulseek) NFS refs to Proxmox
- Remove 79 orphaned TrueNAS NFS module declarations from 49 stacks
- Delete stacks/platform/modules/ (27 dead module copies, 65MB)
- Update nfs-truenas StorageClass to point to Proxmox (192.168.1.127)
- Remove iscsi DNS record from config.tfvars
- Fix woodpecker persistence config and alertmanager PV

Only Immich (8 PVCs, ~1.4TB) remains on TrueNAS.
2026-04-12 14:35:39 +01:00
Viktor Barzin
5da6d75094 fix(monitoring): PodCrashLooping alert now fires only for active CrashLoopBackOff
Switch from restart-count based detection (increase restarts[1h] > 5) to
waiting-reason based (kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"}).
Alert auto-resolves when pod recovers, making it clear whether the issue is active.
2026-04-12 12:41:07 +01:00
Viktor Barzin
65551e4602 fix(dbaas): relax MySQL anti-affinity from required to preferred
Avoids pods getting stuck Pending during node outages while still
preferring to spread across nodes.
2026-04-11 16:26:24 +01:00
Viktor Barzin
160e8980e5 perf(immich): restore PostgreSQL vector search optimizations
- shared_buffers: 1GB → 2GB (clip_index is 452MB, needs headroom)
- effective_cache_size: 1536MB → 2560MB
- PG memory: 2Gi → 3Gi to support larger shared_buffers
- Add pg_prewarm to shared_preload_libraries with autoprewarm
- First search after restart: 999ms → 25ms
2026-04-11 10:30:44 +01:00
Viktor Barzin
aa58565ecc upgrade immich to v2.7.4 and increase rate limit burst
- Immich version: v2.7.3 → v2.7.4
- Immich rate limit: avg 200→500, burst 2000→5000 (both traefik and platform stacks)
2026-04-11 10:15:42 +01:00
Viktor Barzin
75255d22a2 fix(phpipam): fix London SSH via WG MTU reduction (1420→1200)
Root cause: PMTU black hole on WireGuard tunnel. The tunnel runs over
the HE IPv6 6in4 tunnel (gif0 MTU 1280). With WG overhead (~80 bytes),
effective inner MTU is 1200 — but both sides were configured at 1420.
SSH kex packets >1200 bytes were silently dropped.

Fix: Set tun_wg0 MTU to 1200 on pfSense + peer_855 MTU to 1200 on
London GL-iNet. Re-enabled London DHCP/ARP import in remote CronJob.

All 3 sites now fully automated:
- Sofia: Kea leases + ARP every 5min
- London: DHCP + ARP via pfSense→London SSH hop, hourly
- Valchedrym: DHCP + ARP via pfSense→OpenWRT SSH hop, hourly

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 08:18:42 +00:00
Viktor Barzin
d7de5de07c fix(monitoring): add pve_* metrics to Prometheus whitelist
ProxmoxMetricsMissing alert was firing because pve_* metrics were
excluded from the kubernetes-service-endpoints metric_relabel_configs
whitelist. The exporter was scraping successfully but metrics were
being dropped before ingestion.
2026-04-10 22:58:49 +01:00
Viktor Barzin
04beb123eb feat(phpipam): split CronJobs - Sofia 5min, remote sites hourly
- Sofia import (every 5min): Kea leases + pfSense ARP via SSH
- Remote import (hourly): Valchedrym DHCP/ARP via pfSense SSH hop
- London SSH (dropbear) hangs during kex on low-power router — disabled
  for now, data imported manually. TODO: lightweight push agent
- Fixed SSH key filename (id_rsa, not id_ed25519) for RSA keys
- No more ping sweeping anywhere — all passive DHCP/ARP data

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 21:24:58 +00:00
Viktor Barzin
a0392a9617 fix(nextcloud): auto-sync DB password from Vault rotation into config.php
Nextcloud persists dbpassword in config.php on its PVC and ignores
MYSQL_PASSWORD env var after initial install. When Vault rotates the
MySQL password, config.php goes stale causing HTTP 500 crash loops.

Adds a before-starting hook that patches config.php with the current
MYSQL_PASSWORD on every pod start. Combined with Stakater Reloader
annotation, the full rotation chain is now automated:
Vault rotates → ESO syncs Secret → Reloader restarts pod → hook
patches config.php → Nextcloud connects with new password.

Also fixes stale existingClaim (nextcloud-data-iscsi → nextcloud-data-proxmox).
2026-04-10 22:23:52 +01:00
Viktor Barzin
92e0c18e81 feat(phpipam): pull Valchedrym devices from OpenWRT DHCP/ARP via SSH
- CronJob now SSHs to Valchedrym OpenWRT (192.168.0.1) to pull DHCP leases + ARP table
- Parses /tmp/dhcp.leases for hostname + MAC, /proc/net/arp for additional devices
- London still uses ping sweep via pfSense WG tunnel (no SSH access to GL-iNet)
- 6 Valchedrym devices tracked: router, alarm, video, termoregulator, 2 clients

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 21:06:39 +00:00
Viktor Barzin
cddbb1c8b0 feat(phpipam): scan London/Valchedrym via WireGuard tunnel
- pfsense-import CronJob now scans remote subnets (192.168.8.0/24,
  192.168.0.0/24) via parallel ping sweep through pfSense WG tunnel
- 13 London devices + 1 Valchedrym device discovered
- Known hosts named: ha-london, rpi-london, openwrt-london
- fping cron container fully removed

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 20:58:49 +00:00
Viktor Barzin
bba2de9eb1 refactor(phpipam): remove fping cron container
All device discovery now handled by phpipam-pfsense-import CronJob
which queries Kea DHCP leases + pfSense ARP table every 5min.
No active scanning needed — pfSense sees all devices passively.

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 20:38:59 +00:00
Viktor Barzin
d71c636269 feat(phpipam): replace fping scanning with pfSense Kea+ARP import
- New CronJob `phpipam-pfsense-import` runs every 5min
  - Queries Kea DHCP lease API (IP + MAC + hostname for all DHCP clients)
  - Queries pfSense ARP table (IP + MAC for static IP devices)
  - Imports into phpIPAM MySQL: new hosts get inserted, existing get MAC/hostname updates
- Reduced fping scan interval from 15min to 24h (weekly audit only)
- Faster, quieter, gets MACs (fping didn't), gets Kea hostnames
- SSH key (RSA PEM) stored in Vault, synced via ExternalSecret

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 20:37:28 +00:00
Viktor Barzin
e09996302a feat(phpipam): bidirectional DNS sync + Kea DHCP for 192.168.1.x
- CronJob now pulls hostnames FROM Technitium INTO phpIPAM for unnamed entries
  (reverse sync: Kea DDNS registers → Technitium PTR → phpIPAM hostname)
- Kea DHCP4 now serves 192.168.1.0/24 via pfSense WAN (vtnet0)
- 42 MAC→IP reservations for all known LAN devices
- Kea DDNS registers 192.168.1.x hosts in Technitium (forward + reverse)
- DHCP pool .150-.199 for unknown devices
- Technitium update ACL extended to include 192.168.1.2 (pfSense WAN)

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 20:29:25 +00:00
Viktor Barzin
a86394f12b feat(phpipam): add DNS sync CronJob and Kea DDNS integration
- CronJob syncs phpIPAM hosts → Technitium DNS (A + PTR records) every 15min
- Queries phpIPAM MySQL directly for named hosts, pushes to Technitium API
- Covers 192.168.1.0/24 LAN (TP-Link DHCP, not Kea-managed)
- Kea DDNS configured on pfSense for 10.0.10.0/24 + 10.0.20.0/24 subnets
- Technitium zones accept dynamic updates from pfSense IPs (10.0.20.1, 10.0.10.1)
- 5 reverse DNS zones created (10.0.10, 20.0.10, 1.168.192, 2.3.10, 0.168.192)

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 14:47:24 +00:00
Viktor Barzin
4d3d3316ab feat(phpipam): deploy phpIPAM for live IP address management
Lightweight IPAM with auto-discovery scanning every 15min via fping.
Replaces disabled NetBox (OOM'd). Uses existing MySQL InnoDB cluster
with Vault-rotated credentials. Cloudflare DNS + Authentik auth.

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 14:19:25 +00:00
Viktor Barzin
2a10c1821f fix(nextcloud): enable mysql.utf8mb4 to fix collation errors with MySQL 8.4
MySQL 8.4 remapped `utf8` to mean `utf8mb4`, but Nextcloud without this
config sends `COLLATE UTF8_general_ci` (a utf8mb3-only collation) in
queries, causing SQLSTATE[42000] errors that broke occ commands and sync.

Also removed stale `Work 🎯.csv` whose emoji filename was stripped in the
DB filecache (stored as `Work .csv`), causing permanent sync errors.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 09:16:48 +00:00