Commit graph

31 commits

Author SHA1 Message Date
Viktor Barzin
f1f723be83 [technitium] zone-sync now reconciles primaryNameServerAddresses
When a zone is created against a stale primary IP (e.g. the old primary
pod IP 10.10.36.189 before the technitium-primary ClusterIP service
existed), AXFR refresh keeps failing forever while every other zone on
the same replica refreshes fine from 10.110.37.186. The resync-only
branch didn't touch zone options, so the bad IP was pinned indefinitely.

This surfaced as rpi-sofia.viktorbarzin.lan returning 192.168.1.16
(pre-move) on secondaries while primary had the correct .10 from
2026-04-22 morning — Uptime Kuma Sofia RPI monitor DOWN, cluster
cluster_healthcheck FAIL.

The sync loop now re-applies primaryNameServerAddresses on every run
for existing zones. Idempotent — Technitium accepts identical values
— and self-heals any drift within 30 min. Env renamed PRIMARY_IP →
PRIMARY_HOST for consistency with the reconcile semantics.

Hostname form (technitium-primary.technitium.svc.cluster.local) was
tried but Technitium's own resolver doesn't forward svc.cluster.local,
so the field must stay a literal IP. Terraform tracks the ClusterIP on
every apply and the reconcile loop propagates it to replicas.
2026-04-22 17:47:18 +00:00
Viktor Barzin
364df9f2ea [dns] readiness gate — replace auth-required zone-count probe with DNS parity check
Zone-count parity required hitting /api/zones/list which requires auth. The
null_resource has no access to the Technitium admin password (it's declared
`sensitive = true` on the module variable), so we were probing with an empty
token and getting 200 OK with an error JSON — silently returning 0 zones for
every instance.

Replaced the HTTP probe with a second DNS check: dig idrac.viktorbarzin.lan
on each pod, require the same A record from all three. This catches both
"zone not loaded on an instance" and "zone drift between primary and
replicas" without needing any HTTP client or credentials. The AXFR chain
guarantees all three should converge on the same value.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 15:24:56 +00:00
Viktor Barzin
91aa39ef96 [dns] readiness gate — reject all-zero zone counts as probe failure
The zone-count parity check was trivially passing when the ephemeral
curl pod failed to reach the Technitium web API: all three counts came
back as 0, UNIQ=1, gate claimed "PASSED". This happened during today's
DNS hardening apply when CoreDNS was in CrashLoopBackOff and the curl
pod couldn't resolve service names.

Added a MIN > 0 sanity check. Technitium always has built-in zones
(localhost, standard reverse PTRs), so a zero count means the probe
didn't reach the API, not that the instance truly has zero zones.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 15:23:07 +00:00
Viktor Barzin
af6574a006 [dns] Fix CoreDNS serve_stale syntax — 24h TTL, no refresh-mode arg
CoreDNS refused to load the new Corefile with `serve_stale 3600s 86400s`:

  plugin/cache: invalid value for serve_stale refresh mode: 86400s

serve_stale takes one DURATION and an optional refresh_mode keyword
("immediate" or "verify"), not two durations. Simplified to
`serve_stale 86400s` (serve cached entries for up to 24h when upstream
is unreachable). The new CoreDNS pods were CrashLoopBackOff; the two
old pods kept serving traffic so there was no outage, but the partial
apply left the cluster wedged with the bad ConfigMap.

Also collapses the inline viktorbarzin.lan cache block.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 15:18:43 +00:00
Viktor Barzin
5ea079181f [dns] Technitium — raise memory limit to 2Gi (was 1Gi, originally 512Mi)
Primary was at 401Mi / 512Mi (78%) before the first bump; the plan's 1Gi
leaves enough headroom for normal operation but thin margin if blocklists or
cache grow. User escalated: OOM cascades are the exact failure mode that
causes user-visible DNS outages, so give a full 2x safety margin across all
three instances. Replicas currently use 124-155Mi steady-state so they have
enormous headroom at 2Gi — accepted for symmetry and future growth (OISD
blocklists, in-memory cache).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 15:08:04 +00:00
Viktor Barzin
4b39fbb717 [dns] readiness gate — use dig-in-pod + retries, ephemeral curl pod for zone parity
Technitium pods don't ship wget/curl, only dig/nslookup. Switched the per-pod
health check from wget against /api to dig +short against 127.0.0.1. This
probes the actual DNS serving path, which is what we care about anyway.

Zone-count parity can't be done inside the Technitium pod (no HTTP client),
so it spawns a short-lived curlimages/curl pod via kubectl run --rm that
curls the three internal web services and exits.

Added retry loop on the dig check (6 × 10s) to tolerate zone-load delay after
a pod restart — viktorbarzin.lan is ~864KB and can take tens of seconds to
load into memory on a cold start.

Relaxed the A-record regex to match any IPv4 rather than 10.x — records may
legitimately live outside that range.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 14:57:29 +00:00
Viktor Barzin
9a21c0f065 [dns] DNS reliability & hardening — Technitium + CoreDNS + alerts + readiness gate
Workstreams A, B, G, H, I of the DNS reliability plan (code-q2e).
Follow-ups for C, D, E, F filed as code-2k6, code-k0d, code-o6j, code-dw8.

**Technitium (WS A)**
- Primary deployment: add Kyverno lifecycle ignore_changes on dns_config
  (secondary/tertiary already had it) — eliminates per-apply ndots drift.
- All 3 instances: raise memory request+limit from 512Mi to 1Gi (primary
  was restarting near the ceiling; CPU limits stay off per cluster policy).
- zone-sync CronJob: parse API responses, push status/failures/last-run and
  per-instance zone_count gauges to Pushgateway, fail the job on any
  create error (was silently passing).

**CoreDNS (WS B)**
- Corefile: add policy sequential + health_check 5s + max_fails 2 on root
  forward, health_check on viktorbarzin.lan forward, serve_stale
  3600s/86400s on both cache blocks — pfSense flap no longer takes the
  cluster down; upstream outage keeps cached names resolving for 24h.
- Scale deploy/coredns to 3 replicas with required pod anti-affinity on
  hostname via null_resource (hashicorp/kubernetes v3 dropped the _patch
  resources); readiness gate asserts state post-apply.
- PDB coredns with minAvailable=2.

**Observability (WS G)**
- Fix DNSQuerySpike — rewrite to compare against
  avg_over_time(dns_anomaly_total_queries[1h] offset 15m); previous
  dns_anomaly_avg_queries was computed from a per-pod /tmp file so always
  equalled the current value (alert could never fire).
- New: DNSQueryRateDropped, TechnitiumZoneSyncFailed,
  TechnitiumZoneSyncStale, TechnitiumZoneCountMismatch,
  CoreDNSForwardFailureRate.

**Post-apply readiness gate (WS H)**
- null_resource.technitium_readiness_gate runs at end of apply:
  kubectl rollout status on all 3 deployments (180s), per-pod
  /api/stats/get probe, zone-count parity across the 3 instances.
  Fails the apply on any check fail. Override: -var skip_readiness=true.

**Docs (WS I)**
- docs/architecture/dns.md: CoreDNS Corefile hardening, new alerts table,
  zone-sync metrics reference, why DNSQuerySpike was broken.
- docs/runbooks/technitium-apply.md (new): what the gate checks, failure
  modes, emergency override.

Out of scope for this commit (see beads follow-ups):
- WS C: NodeLocal DNSCache (code-2k6)
- WS D: pfSense Unbound replaces dnsmasq (code-k0d)
- WS E: Kea multi-IP DHCP + TSIG (code-o6j)
- WS F: static-client DNS fixes (code-dw8)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 14:53:41 +00:00
Viktor Barzin
327ce215b9 [infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip]
## Context

Wave 3A (commit c9d221d5) added the `# KYVERNO_LIFECYCLE_V1` marker to the
27 pre-existing `ignore_changes = [...dns_config]` sites so they could be
grepped and audited. It did NOT address pod-owning resources that were
simply missing the suppression entirely. Post-Wave-3A sampling (2026-04-18)
found that navidrome, f1-stream, frigate, servarr, monitoring, crowdsec,
and many other stacks showed perpetual `dns_config` drift every plan
because their `kubernetes_deployment` / `kubernetes_stateful_set` /
`kubernetes_cron_job_v1` resources had no `lifecycle {}` block at all.

Root cause (same as Wave 3A): Kyverno's admission webhook stamps
`dns_config { option { name = "ndots"; value = "2" } }` on every pod's
`spec.template.spec.dns_config` to prevent NxDomain search-domain flooding
(see `k8s-ndots-search-domain-nxdomain-flood` skill). Without `ignore_changes`
on every Terraform-managed pod-owner, Terraform repeatedly tries to strip
the injected field.

## This change

Extends the Wave 3A convention by sweeping EVERY `kubernetes_deployment`,
`kubernetes_stateful_set`, `kubernetes_daemon_set`, `kubernetes_cron_job_v1`,
`kubernetes_job_v1` (+ their `_v1` variants) in the repo and ensuring each
carries the right `ignore_changes` path:

- **kubernetes_deployment / stateful_set / daemon_set / job_v1**:
  `spec[0].template[0].spec[0].dns_config`
- **kubernetes_cron_job_v1**:
  `spec[0].job_template[0].spec[0].template[0].spec[0].dns_config`
  (extra `job_template[0]` nesting — the CronJob's PodTemplateSpec is
  one level deeper)

Each injection / extension is tagged `# KYVERNO_LIFECYCLE_V1: Kyverno
admission webhook mutates dns_config with ndots=2` inline so the
suppression is discoverable via `rg 'KYVERNO_LIFECYCLE_V1' stacks/`.

Two insertion paths are handled by a Python pass (`/tmp/add_dns_config_ignore.py`):

1. **No existing `lifecycle {}`**: inject a brand-new block just before the
   resource's closing `}`. 108 new blocks on 93 files.
2. **Existing `lifecycle {}` (usually for `DRIFT_WORKAROUND: CI owns image tag`
   from Wave 4, commit a62b43d1)**: extend its `ignore_changes` list with the
   dns_config path. Handles both inline (`= [x]`) and multiline
   (`= [\n  x,\n]`) forms; ensures the last pre-existing list item carries
   a trailing comma so the extended list is valid HCL. 34 extensions.

The script skips anything already mentioning `dns_config` inside an
`ignore_changes`, so re-running is a no-op.

## Scale

- 142 total lifecycle injections/extensions
- 93 `.tf` files touched
- 108 brand-new `lifecycle {}` blocks + 34 extensions of existing ones
- Every Tier 0 and Tier 1 stack with a pod-owning resource is covered
- Together with Wave 3A's 27 pre-existing markers → **169 greppable
  `KYVERNO_LIFECYCLE_V1` dns_config sites across the repo**

## What is NOT in this change

- `stacks/trading-bot/main.tf` — entirely commented-out block (`/* … */`).
  Python script touched the file, reverted manually.
- `_template/main.tf.example` skeleton — kept minimal on purpose; any
  future stack created from it should either inherit the Wave 3A one-line
  form or add its own on first `kubernetes_deployment`.
- `terraform fmt` fixes to pre-existing alignment issues in meshcentral,
  nvidia/modules/nvidia, vault — unrelated to this commit. Left for a
  separate fmt-only pass.
- Non-pod resources (`kubernetes_service`, `kubernetes_secret`,
  `kubernetes_manifest`, etc.) — they don't own pods so they don't get
  Kyverno dns_config mutation.

## Verification

Random sample post-commit:
```
$ cd stacks/navidrome && ../../scripts/tg plan  → No changes.
$ cd stacks/f1-stream && ../../scripts/tg plan  → No changes.
$ cd stacks/frigate && ../../scripts/tg plan    → No changes.

$ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='*.tf' --include='*.tf.example' \
    | awk -F: '{s+=$2} END {print s}'
169
```

## Reproduce locally
1. `git pull`
2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/ | wc -l` → 169+
3. `cd stacks/navidrome && ../../scripts/tg plan` → expect 0 drift on
   the deployment's dns_config field.

Refs: code-seq (Wave 3B dns_config class closed; kubernetes_manifest
annotation class handled separately in 8d94688d for tls_secret)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:19:48 +00:00
Viktor Barzin
8b43692af0 [infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip]
## Context

Wave 3B-continued: the Goldilocks VPA dashboard (stacks/vpa) runs a Kyverno
ClusterPolicy `goldilocks-vpa-auto-mode` that mutates every namespace with
`metadata.labels["goldilocks.fairwinds.com/vpa-update-mode"] = "off"`. This
is intentional — Terraform owns container resource limits, and Goldilocks
should only provide recommendations, never auto-update. The label is how
Goldilocks decides per-namespace whether to run its VPA in `off` mode.

Effect on Terraform: every `kubernetes_namespace` resource shows the label
as pending-removal (`-> null`) on every `scripts/tg plan`. Dawarich survey
2026-04-18 confirmed the drift. Cluster-side count: 88 namespaces carry the
label (`kubectl get ns -o json | jq ... | wc -l`). Every TF-managed namespace
is affected.

This commit brings the intentional admission drift under the same
`# KYVERNO_LIFECYCLE_V1` discoverability marker introduced in c9d221d5 for
the ndots dns_config pattern. The marker now stands generically for any
Kyverno admission-webhook drift suppression; the inline comment records
which specific policy stamps which specific field so future grep audits
show why each suppression exists.

## This change

107 `.tf` files touched — every stack's `resource "kubernetes_namespace"`
resource gets:

```hcl
lifecycle {
  # KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
  ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
```

Injection was done with a brace-depth-tracking Python pass (`/tmp/add_goldilocks_ignore.py`):
match `^resource "kubernetes_namespace" ` → track `{` / `}` until the
outermost closing brace → insert the lifecycle block before the closing
brace. The script is idempotent (skips any file that already mentions
`goldilocks.fairwinds.com/vpa-update-mode`) so re-running is safe.

Vault stack picked up 2 namespaces in the same file (k8s-users produces
one, plus a second explicit ns) — confirmed via file diff (+8 lines).

## What is NOT in this change

- `stacks/trading-bot/main.tf` — entire file is `/* … */` commented out
  (paused 2026-04-06 per user decision). Reverted after the script ran.
- `stacks/_template/main.tf.example` — per-stack skeleton, intentionally
  minimal. User keeps it that way. Not touched by the script (file
  has no real `resource "kubernetes_namespace"` — only a placeholder
  comment).
- `.terraform/` copies (e.g. `stacks/metallb/.terraform/modules/...`) —
  gitignored, won't commit; the live path was edited.
- `terraform fmt` cleanup of adjacent pre-existing alignment issues in
  authentik, freedify, hermes-agent, nvidia, vault, meshcentral. Reverted
  to keep the commit scoped to the Goldilocks sweep. Those files will
  need a separate fmt-only commit or will be cleaned up on next real
  apply to that stack.

## Verification

Dawarich (one of the hundred-plus touched stacks) showed the pattern
before and after:

```
$ cd stacks/dawarich && ../../scripts/tg plan

Before:
  Plan: 0 to add, 2 to change, 0 to destroy.
   # kubernetes_namespace.dawarich will be updated in-place
     (goldilocks.fairwinds.com/vpa-update-mode -> null)
   # module.tls_secret.kubernetes_secret.tls_secret will be updated in-place
     (Kyverno generate.* labels — fixed in 8d94688d)

After:
  No changes. Your infrastructure matches the configuration.
```

Injection count check:
```
$ rg -c 'KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode' stacks/ | awk -F: '{s+=$2} END {print s}'
108
```

## Reproduce locally
1. `git pull`
2. Pick any stack: `cd stacks/<name> && ../../scripts/tg plan`
3. Expect: no drift on the namespace's goldilocks.fairwinds.com/vpa-update-mode label.

Closes: code-dwx

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:15:27 +00:00
Viktor Barzin
afb8a16623 [infra] Scale down unused services + remove DoH ingress
Scale to 0 replicas:
- ollama: low usage, saves ~2Gi memory + 59GB NFS-SSD model data idle
- poison-fountain: RSS link archiver, not actively used
- travel-blog: Hugo blog, not actively used

Remove technitium DoH ingress (dns.viktorbarzin.me): externally unreachable
and unused. DNS is served on UDP/TCP port 53 via LoadBalancer (10.0.20.201).

Clears 3 of 5 ExternalAccessDivergence services. Remaining 2 (pdf, travel)
should clear now that the Uptime Kuma monitors will report both down.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 18:55:52 +00:00
Viktor Barzin
996bdfc9b6 [technitium] Uninstall MySQL+SQLite query log plugins instead of just disabling
## Context
Disabling MySQL/SQLite query logging via config was not durable — Technitium
re-enables disabled plugins on pod restart, causing 46 GB/day of writes to
the standalone MySQL (15M inserts to technitium.dns_logs between CronJob runs).

## This change:
The password-sync CronJob now UNINSTALLS MySQL and SQLite query log plugins
via `/api/apps/uninstall` instead of setting `enableLogging:false`. This is
permanent — the plugin files are removed from the PVC, so they can't re-enable
on restart. The CronJob checks if the plugins are present first (idempotent).

Only PostgreSQL query logging remains (90-day retention).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 08:20:55 +00:00
Viktor Barzin
e80b2f026f [infra] Migrate Terraform state from local SOPS to PostgreSQL backend
Two-tier state architecture:
- Tier 0 (infra, platform, cnpg, vault, dbaas, external-secrets): local
  state with SOPS encryption in git — unchanged, required for bootstrap.
- Tier 1 (105 app stacks): PostgreSQL backend on CNPG cluster at
  10.0.20.200:5432/terraform_state with native pg_advisory_lock.

Motivation: multi-operator friction (every workstation needed SOPS + age +
git-crypt), bootstrap complexity for new operators, and headless agents/CI
needing the full encryption toolchain just to read state.

Changes:
- terragrunt.hcl: conditional backend (local vs pg) based on tier0 list
- scripts/tg: tier detection, auto-fetch PG creds from Vault for Tier 1,
  skip SOPS and Vault KV locking for Tier 1 stacks
- scripts/state-sync: tier-aware encrypt/decrypt (skips Tier 1)
- scripts/migrate-state-to-pg: one-shot migration script (idempotent)
- stacks/vault/main.tf: pg-terraform-state static role + K8s auth role
  for claude-agent namespace
- stacks/dbaas: terraform_state DB creation + MetalLB LoadBalancer
  service on shared IP 10.0.20.200
- Deleted 107 .tfstate.enc files for migrated Tier 1 stacks
- Cleaned up per-stack tiers.tf (now generated by root terragrunt.hcl)

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 19:33:12 +00:00
Viktor Barzin
f538115c43 [dbaas] Migrate MySQL from InnoDB Cluster to standalone StatefulSet
## Context
Disk write analysis showed MySQL InnoDB Cluster writing ~95 GB/day for only
~35 MB of actual data due to Group Replication overhead (binlog, relay log,
GR apply log). The operator enforces GR even with serverInstances=1.

Bitnami Helm charts were deprecated by Broadcom in Aug 2025 — no free
container images available. Using official mysql:8.4 image instead.

## This change:
- Replace helm_release.mysql_cluster service selector with raw
  kubernetes_stateful_set_v1 using official mysql:8.4 image
- ConfigMap mysql-standalone-cnf: skip-log-bin, innodb_flush_log_at_trx_commit=2,
  innodb_doublewrite=ON (re-enabled for standalone safety)
- Service selector switched to standalone pod labels
- Technitium: disable SQLite query logging (18 GB/day write amplification),
  keep PostgreSQL-only logging (90-day retention)
- Grafana datasource and dashboards migrated from MySQL to PostgreSQL
- Dashboard SQL queries fixed for PG integer division (::float cast)
- Updated CLAUDE.md service-specific notes

## What is NOT in this change:
- InnoDB Cluster + operator removal (Phase 4, 7+ days from now)
- Stale Vault role cleanup (Phase 4)
- Old PVC deletion (Phase 4)

Expected write reduction: ~113 GB/day (MySQL 95 + Technitium 18)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 19:01:06 +00:00
Viktor Barzin
b1d152be1f [infra] Auto-create Cloudflare DNS records from ingress_factory
## Context

Deploying new services required manually adding hostnames to
cloudflare_proxied_names/cloudflare_non_proxied_names in config.tfvars —
a separate file from the service stack. This was frequently forgotten,
leaving services unreachable externally.

## This change:

- Add `dns_type` parameter to `ingress_factory` and `reverse_proxy/factory`
  modules. Setting `dns_type = "proxied"` or `"non-proxied"` auto-creates
  the Cloudflare DNS record (CNAME to tunnel or A/AAAA to public IP).
- Simplify cloudflared tunnel from 100 per-hostname rules to wildcard
  `*.viktorbarzin.me → Traefik`. Traefik still handles host-based routing.
- Add global Cloudflare provider via terragrunt.hcl (separate
  cloudflare_provider.tf with Vault-sourced API key).
- Migrate 118 hostnames from centralized config.tfvars to per-service
  dns_type. 17 hostnames remain centrally managed (Helm ingresses,
  special cases).
- Update docs, AGENTS.md, CLAUDE.md, dns.md runbook.

```
BEFORE                          AFTER
config.tfvars (manual list)     stacks/<svc>/main.tf
        |                         module "ingress" {
        v                           dns_type = "proxied"
stacks/cloudflared/               }
  for_each = list                     |
  cloudflare_record               auto-creates
  tunnel per-hostname             cloudflare_record + annotation
```

## What is NOT in this change:

- Uptime Kuma monitor migration (still reads from config.tfvars)
- 17 remaining centrally-managed hostnames (Helm, special cases)
- Removal of allow_overwrite (keep until migration confirmed stable)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 13:45:04 +00:00
Viktor Barzin
9baefa22ab fix: technitium CronJob scheduling, LUKS backup support, speedtest scrape
- technitium-password-sync: remove RWO encrypted PVC mount that caused
  pods to stick in ContainerCreating on wrong nodes. Plugin install now
  warns instead of failing when zip unavailable.
- daily-backup: add LUKS decryption support for encrypted PVC snapshots
  using /root/.luks-backup-key. Uses noload mount option to skip ext4
  journal replay. Also installed cryptsetup-bin on PVE host.
- speedtest: disable prometheus.io/scrape annotation (no /prometheus
  endpoint exists, causing ScrapeTargetDown alert).

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 15:12:32 +00:00
Viktor Barzin
803cb5fd26 fix: convert Technitium zone sync from one-time Job to CronJob
Secondary/tertiary DNS instances had no custom zones — only the
primary had viktorbarzin.lan and viktorbarzin.me. The old setup Job
ran once at deployment and never synced new zones.

New CronJob runs every 30 minutes:
- Gets all zones from primary
- Enables zone transfer on primary
- Creates missing zones as Secondary type on replicas
- Resyncs existing zones via AXFR

Fixes .lan resolution failures (2/3 queries returned NXDOMAIN).

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 12:18:19 +00:00
Viktor Barzin
68c8c5b4a0 fix(technitium): migrate primary to proxmox-lvm-encrypted + post-mortem
SEV1 outage: fsid=0 in PVE /etc/exports broke all NFS subdirectory
mounts from k8s (NFSv4 pseudo-root path resolution). Combined with
lockd failure, both NFSv4 and NFSv3 mount paths broken. Cascaded
into DNS primary, Vault (2/3 pods), Alertmanager, 20+ services.

Changes:
- Primary PVC: NFS (nfs-truenas) → proxmox-lvm-encrypted
- Secondary/tertiary PVCs: proxmox-lvm → proxmox-lvm-encrypted
- Removed NFS module dependency from technitium stack
- Added full post-mortem with prevention plan

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 08:18:59 +00:00
Viktor Barzin
82b0f6c4cb truenas deprecation: migrate all non-immich storage to proxmox NFS
- Migrate 7 backup CronJobs to Proxmox host NFS (192.168.1.127)
  (etcd, mysql, postgresql, nextcloud, redis, vaultwarden, plotting-book)
- Migrate headscale backup, ebook2audiobook, osm_routing to Proxmox NFS
- Migrate servarr (lidarr, readarr, soulseek) NFS refs to Proxmox
- Remove 79 orphaned TrueNAS NFS module declarations from 49 stacks
- Delete stacks/platform/modules/ (27 dead module copies, 65MB)
- Update nfs-truenas StorageClass to point to Proxmox (192.168.1.127)
- Remove iscsi DNS record from config.tfvars
- Fix woodpecker persistence config and alertmanager PV

Only Immich (8 PVCs, ~1.4TB) remains on TrueNAS.
2026-04-12 14:35:39 +01:00
Viktor Barzin
6101fb99f9 Reduce disk write amplification across cluster (~200-350 GB/day savings) [ci skip]
- Prometheus: persist metric whitelist (keep rules) to Helm template, preventing
  regression from 33K to 250K samples/scrape on next apply. Reduce retention 52w→26w.
- MySQL InnoDB: aggressive write reduction — flush_log_at_trx_commit=0, sync_binlog=0,
  doublewrite=OFF, io_capacity=100/200, redo_log=1GB, flush_neighbors=1, reduced page cleaners.
- etcd: increase snapshot-count 10000→50000 to reduce WAL snapshot frequency.
- VM disks: enable TRIM/discard passthrough to LVM thin pool via create-vm module.
- Cloud-init: enable fstrim.timer, journald limits (500M/7d/compress).
- Kubelet: containerLogMaxSize=10Mi, containerLogMaxFiles=3.
- Technitium: DNS query log retention 0→30 days (was unlimited writes to MySQL).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 19:01:21 +00:00
Viktor Barzin
9de543c076 feat(technitium): add Split Horizon AddressTranslation for hairpin NAT fix
192.168.1.x LAN clients couldn't reach non-proxied *.viktorbarzin.me
domains because the TP-Link router doesn't support hairpin NAT.

Adds a CronJob that configures Technitium's Split Horizon
AddressTranslation post-processor on all 3 instances to translate
176.12.22.76 (public IP) → 10.0.20.200 (Traefik LB) in DNS responses
for 192.168.1.0/24 clients. Also adds viktorbarzin.me to the DNS
Rebinding Protection privateDomains allowlist so the translated private
IP isn't stripped.

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 18:43:15 +00:00
Viktor Barzin
09b4bad958 feat: pin ~28 images to specific versions, enable DIUN monitoring, add app-stacks pipeline
Pin third-party images from :latest to current stable versions:
- Platform: cloudflared, technitium, snmp-exporter, pve-exporter,
  headscale, shadowsocks, xray
- Apps: paperless-ngx, linkwarden, wealthfolio, speedtest, synapse,
  n8n, prowlarr, qbittorrent, lidarr, rybbit, ollama, immichframe,
  cyberchef, networking-toolbox, echo, coturn, shlink, affine

Enable DIUN annotations on all pinned deployments with per-image
tag patterns. Add Woodpecker app-stacks pipeline for selective
terragrunt apply on changed app stacks.
2026-04-06 14:27:13 +03:00
Viktor Barzin
9492874c43 fix: restore technitium MySQL query logging with Vault auto-rotation [ci skip]
Query logs stopped syncing on 2026-03-16 due to password mismatch after
MySQL cluster rebuild and Technitium app config reset.

- Add Vault static role mysql-technitium (7-day rotation)
- Add ExternalSecret for technitium-db-creds in technitium namespace
- Add password-sync CronJob (6h) to push rotated password to Technitium API
- Update Grafana datasource to use ESO-managed password
- Remove stale technitium_db_password variable (replaced by ESO)
- Update databases.md and restore-mysql.md runbook
2026-04-06 13:00:49 +03:00
Viktor Barzin
aac02f0467 meshcentral: restore DB from backup; technitium: remove orphaned PVC
- meshcentral: fix homepage annotations formatting (no functional change,
  serversscheme was tested but not needed since MeshCentral serves HTTP)
- meshcentral: restored user DB from Dec 2024 backup (1428B → 45KB)
- technitium: remove unused technitium-config-proxmox PVC (WaitForFirstConsumer,
  never mounted — primary uses NFS, replicas have their own proxmox PVCs)
2026-04-06 12:17:08 +03:00
Viktor Barzin
b0178cf6d2 technitium: add tertiary DNS replica and fix CoreDNS forward order
- Add tertiary DNS deployment with zone-transfer replication for
  externalTrafficPolicy=Local coverage across more nodes
- Reorder CoreDNS default forwarders: pfSense (10.0.20.1) first,
  then public DNS fallbacks (8.8.8.8, 1.1.1.1)
2026-04-06 11:57:31 +03:00
Viktor Barzin
f80e1fa868 cluster health fixes: NFS CSI, Immich ML, dbaas, Redis, DNS, trading-bot removal
- NFS CSI: fix liveness-probe port conflict (29652 → 29653)
- Immich ML: add gpu-workload priority class to enable preemption on node1
- dbaas: right-size MySQL memory limits (sidecar 6Gi→350Mi, main 4Gi→3Gi)
- Redis: add redis-master service via HAProxy for master-only routing,
  update config.tfvars redis_host to use it
- CoreDNS: forward .viktorbarzin.lan to Technitium ClusterIP (10.96.0.53)
  instead of stale LoadBalancer IP (10.0.20.200)
- Trading bot: comment out all resources (no longer needed)
- Vault: remove trading-bot PostgreSQL database role
2026-04-06 11:54:45 +03:00
Viktor Barzin
aa7a7e74b2 fix: technitium secondary to proxmox-lvm + bootstrap TF state
- Migrate technitium-secondary-config from NFS to proxmox-lvm PVC
- Change secondary strategy from RollingUpdate to Recreate (RWO)
- Bootstrap encrypted state for insta2spotify and ebooks stacks
- Import servarr sub-module PVCs and reconcile state
2026-04-05 19:32:40 +03:00
Viktor Barzin
cb8a808700 feat(storage): migrate 38 NFS PVCs to proxmox-lvm (Wave 2)
Add proxmox-lvm PVCs with pvc-autoresizer annotations for all
remaining single-pod app data services. Deployments updated to
use new block storage PVCs. Old NFS modules retained for rollback.

Services: affine, changedetection, diun, excalidraw, f1-stream,
hackmd, isponsorblocktv, matrix, n8n, send, grampsweb, health,
onlyoffice, owntracks, paperless-ngx, privatebin, resume,
speedtest, stirling-pdf, tandoor, rybbit (clickhouse), tor-proxy
(torrserver), whisper+piper, frigate (config), ollama (ui),
servarr (prowlarr/listenarr/qbittorrent), aiostreams, freshrss
(extensions), meshcentral (data+files), openclaw (data+home+
openlobster), technitium, mailserver (data+roundcube html+enigma),
dbaas (pgadmin).

Strategy set to Recreate where needed for RWO volumes.
2026-04-04 19:25:12 +03:00
Viktor Barzin
4b3851829b feat: organize Grafana dashboards into folders
Enable sidecar folderAnnotation + foldersFromFilesStructure to group
26 dashboards into 5 managed folders:

- Cluster (6): k8s health, API server, nodes, pods, kube-state-metrics
- Networking (6): CoreDNS, Technitium, Headscale, ingress, network traffic
- Hardware (5): node-exporter, proxmox, iDRAC, UPS, NVIDIA GPU
- Operations (4): backup health, registry, audit logs, Loki
- Applications (2): realestate-crawler, qBittorrent

Dashboard-to-folder mapping defined in grafana.tf locals block.
External stacks (headscale, technitium) annotated individually.
2026-03-28 16:23:49 +02:00
Viktor Barzin
c49e4561a3 consolidate MetalLB IPs: 5 → 1 (10.0.20.200)
Migrate all 11 LoadBalancer services to share 10.0.20.200:
- Update annotations: metallb.universe.tf → metallb.io
- Pin all services to 10.0.20.200 with allow-shared-ip: shared
- Standardize externalTrafficPolicy to Cluster (required for IP sharing)
- Remove redundant port 80 (roundcube) from mailserver LB
- Update CoreDNS forward: 10.0.20.204 → 10.0.20.200
- Update cloudflared tunnel target: 10.0.20.202 → 10.0.20.200

Services consolidated: coturn, headscale, kms, qbittorrent, shadowsocks,
torrserver, wireguard, mailserver, traefik, xray, technitium
2026-03-24 18:35:43 +02:00
Viktor Barzin
33037eba46 upgrade MetalLB v0.10.2 → v0.15.3 and update annotations
- Replace custom ViktorBarzin/metallb module with official Helm chart
- Migrate from ConfigMap-based config to CRD (IPAddressPool + L2Advertisement)
- Update Traefik LB annotations from metallb.universe.tf to metallb.io format
- Technitium DNS keeps stable IP 10.0.20.204 via MetalLB auto-assignment
- Headscale split DNS already configured to use 10.0.20.204
2026-03-24 17:24:05 +02:00
Viktor Barzin
73511b1230 extract remaining 19 modules from platform, complete stack split [ci skip]
Phase 3: all 27 platform modules now run as independent stacks.
Platform reduced to empty shell (outputs only) for backward compat
with 72 app stacks that declare dependency "platform".
Fixed technitium cross-module dashboard reference by copying file.
Woodpecker pipeline applies all 27+1 stacks in parallel via loop.
All applied with zero destroys.
2026-03-17 21:42:16 +00:00