Commit graph

12 commits

Author SHA1 Message Date
Viktor Barzin
cd96fb64a8 phpipam-pfsense-import: every 5min → hourly
Reduces 5-min disk-write spikes on PVE sdc. The cronjob was the
heaviest single contributor in our hourly fan-out investigation
(11.2 MB/s burst when it fired). Kea DDNS still handles real-time
DNS auto-registration; phpIPAM inventory just lags by up to 1h,
which we don't need fresher.

Docs (dns.md, networking.md, .claude/CLAUDE.md) updated to match.
2026-04-26 22:48:43 +00:00
Viktor Barzin
5a0b24f54e [docs] TrueNAS decommission cleanup — remove references from active docs
TrueNAS VM 9000 was operationally decommissioned 2026-04-13; NFS has been
served by Proxmox host (192.168.1.127) since. This commit scrubs remaining
references from active docs. VM 9000 itself remains on PVE in stopped state
pending user decision on deletion.

In-session cleanup already landed: reverse-proxy ingress + Cloudflare record
removed; Technitium DNS records deleted; Vault truenas_{api_key,ssh_private_key}
purged; homepage_credentials.reverse_proxy.truenas_token removed;
truenas_homepage_token variable + module deleted; Loki + Dashy cleaned;
config.tfvars deprecated DNS lines removed; historical-name comment added to
the nfs-truenas StorageClass (48 bound PVs, immutable name — kept).

Historical records (docs/plans/, docs/post-mortems/, .planning/) intentionally
untouched — they describe state at a point in time.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 16:55:43 +00:00
Viktor Barzin
f6685a23a9 [dns] Kea: multi-IP DHCP option 6 (10.0.10, 10.0.20) + TSIG-signed DDNS (WS E)
Workstream E of the DNS hardening push. Two independent pfSense-side
changes to eliminate single-point DNS failures and the unauthenticated
RFC 2136 update vector.

Part 1 — Multi-IP DHCP option 6
- Before: clients on 10.0.10/24 got only 10.0.10.1; clients on 10.0.20/24
  got only 10.0.20.1. Internal resolver outage == cluster-wide DNS dark.
- After:
  - 10.0.10/24 -> [10.0.10.1, 94.140.14.14]
  - 10.0.20/24 -> [10.0.20.1, 94.140.14.14]
- 192.168.1/24 deliberately untouched (served by TP-Link AP, not pfSense
  Kea — pfSense WAN DHCP is disabled); already ships [192.168.1.2,
  94.140.14.14] so the end state is consistent across all three subnets.
- Applied via PHP: set $cfg['dhcpd']['lan']['dnsserver'] and
  $cfg['dhcpd']['opt1']['dnsserver'] as arrays. pfSense's
  services_kea4_configure() implodes the array into "data: a, b" on the
  "domain-name-servers" option-data entry (services.inc L1214).
- Verified:
  - DevVM (10.0.10.10) resolv.conf shows "nameserver 10.0.10.1" +
    "nameserver 94.140.14.14" after networkd renew.
  - k8s-node1 (10.0.20.101) same after networkctl reload + systemd-resolved
    restart.
  - Fallback drill on k8s-node1: `ip route add blackhole 10.0.20.1/32`;
    dig @10.0.20.1 google.com -> "no servers could be reached"; dig
    @94.140.14.14 google.com -> 216.58.204.110; system resolver
    (getent hosts) succeeds via the fallback IP. Blackhole route removed.

Part 2 — TSIG-signed Kea DHCP-DDNS
- Before: /usr/local/etc/kea/kea-dhcp-ddns.conf had `tsig-keys: []` and
  Technitium's viktorbarzin.lan zone had update=Deny. Unauthenticated
  update vector was latent (DDNS wiring in Kea DHCP4 is actually off
  today — "DDNS: disabled" in dhcpd.log) but would activate as soon as
  anyone turned on ddnsupdate on LAN/OPT1.
- Generated HMAC-SHA256 secret, base64-encoded 32 random bytes.
- Stored in Vault: secret/viktor/kea_ddns_tsig_secret (version 27).
- Created TSIG key "kea-ddns" on primary/secondary/tertiary Technitium
  instances via /api/settings/set (tsigKeys[]).
- Updated kea-dhcp-ddns.conf on pfSense with
  tsig-keys[]={name: "kea-ddns", algorithm: "HMAC-SHA256", secret: …}
  and key-name: kea-ddns on each forward-ddns / reverse-ddns domain.
  Pre-change backup at /usr/local/etc/kea/kea-dhcp-ddns.conf.2026-04-19-pre-tsig.
- Configured viktorbarzin.lan + 10.0.10.in-addr.arpa +
  20.0.10.in-addr.arpa + 1.168.192.in-addr.arpa on Technitium primary:
  - update = UseSpecifiedNetworkACL
  - updateNetworkACL = [10.0.20.1, 10.0.10.1, 192.168.1.2]
  - updateSecurityPolicies = [{tsigKeyName: kea-ddns,
                               domain: "*.<zone>", allowedTypes: [ANY]}]
  Technitium requires BOTH a source-IP match AND a valid TSIG signature.
- Verified TSIG end-to-end:
  - Signed A-record update from pfSense -> "successfully processed",
    dig returns 10.99.99.99 (log: "TSIG KeyName: kea-ddns; TSIG Algo:
    hmac-sha256; TSIG Error: NoError; RCODE: NoError").
  - Signed PTR update same zone pattern -> dig -x returns tsig-test
    FQDN.
  - Unsigned update from pfSense IP (in ACL) -> "update failed:
    REFUSED" (log: "refused a zone UPDATE request [...] due to Dynamic
    Updates Security Policy").
  - Test records cleaned up via signed nsupdate.

Safety
- pfSense config backup: /cf/conf/config.xml.2026-04-19-pre-kea-multi-ip
  (145898 bytes, pre-change snapshot — keep 30d).
- DDNS config backup: /usr/local/etc/kea/kea-dhcp-ddns.conf.2026-04-19-pre-tsig.
- TSIG secret lives only in Vault + in config.xml/kea-dhcp-ddns.conf on
  pfSense; not committed to git.

Docs
- architecture/dns.md: zone dynamic-updates section records the TSIG
  policy; Incident History gets a WS E entry.
- architecture/networking.md: DHCP Coverage table now shows the DNS
  option 6 values per subnet; pfSense block notes the TSIG-signed DDNS
  and config backup path.
- runbooks/pfsense-unbound.md: new "Kea DHCP-DDNS TSIG" section covers
  key rotation, emergency bypass, and enforcement-verification.

Closes: code-o6j

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 16:12:23 +00:00
Viktor Barzin
69474fae96 docs: add comprehensive DNS architecture documentation
Covers Technitium HA (3-instance AXFR replication), CoreDNS config,
Cloudflare external DNS, Split Horizon hairpin NAT fix, DHCP-DNS
auto-registration, 6 automation CronJobs, and troubleshooting guides.
Also fixes stale NFS reference in networking.md.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 18:10:27 +00:00
Viktor Barzin
c740ed1301 docs: update Technitium DNS docs after cache optimization
- Fix Technitium IP typo: 10.0.20.101 → 10.0.20.201 (service-catalog, vpn.md)
- Fix PDB minAvailable: 1 → 2 (networking.md)
- Add emrsn.org stub zone, cache TTL tuning, PG query logging, CronJobs
- Update forwarders: was "Cloudflare + Google", actually Cloudflare DoH only
- Update config storage: was generic PVC, now NFS path
2026-04-12 18:29:25 +01:00
Viktor Barzin
eec6af6aef docs: add IPAM/DDNS architecture diagram and update docs
- networking.md: Add mermaid diagram showing full device discovery pipeline
  (Kea DHCP → DDNS → Technitium, pfSense import → phpIPAM → DNS sync)
- networking.md: Add data flow table, DHCP coverage table
- networking.md: Update pfSense (3 subnets + 42 reservations), phpIPAM
  (passive import replaces fping), Technitium (192.168.1.2 in ACL)
- CLAUDE.md: Update phpIPAM and networking descriptions

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 20:42:10 +00:00
Viktor Barzin
8cd8743140 docs: add phpIPAM, Kea DDNS, and DNS sync documentation
- networking.md: Add phpIPAM IPAM section, Kea DDNS config, reverse DNS zones,
  Technitium dynamic update policy
- CLAUDE.md: Add phpipam to DB rotation list, service notes, networking section
- service-catalog.md: Add phpipam, mark netbox as disabled/replaced

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 16:01:32 +00:00
Viktor Barzin
98aaba98da docs: add Split Horizon hairpin NAT fix to networking architecture
[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 18:45:53 +00:00
Viktor Barzin
cfa7a50cb5 docs: update networking architecture for DNS consolidation
- Technitium DNS now at dedicated MetalLB IP 10.0.20.201 (was shared 10.0.20.200)
- Document LAN DNS path: pfSense NAT redirect preserves client IPs for Technitium logging
- Document pfSense dnsmasq role (K8s VLAN + localhost only, not WAN)
- Document pfSense aliases (technitium_dns, k8s_shared_lb) for NAT rule maintainability
- Update MetalLB table with per-service IP assignments
- Add ClusterIP (10.96.0.53) for CoreDNS internal forwarding

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 17:49:33 +00:00
Viktor Barzin
fc233bd27f docs: comprehensive audit and update of all architecture docs and runbooks [ci skip]
Audited 14 documentation files against live cluster state and Terraform code.

Architecture docs:
- databases.md: MySQL 8.4.4, proxmox-lvm storage (not iSCSI), anti-affinity
  excludes k8s-node1 (GPU), 2Gi/3Gi resources, 7-day rotation (not 24h),
  CNPG 2 instances, PostGIS 16, postgresql.dbaas has endpoints
- overview.md: 1x CPU, ~160GB RAM, all nodes 32GB, proxmox-lvm storage,
  correct Vault paths (secret/ not kv/)
- compute.md: 272GB physical host RAM, ~160GB allocated to VMs
- secrets.md: 7-day rotation, 7 MySQL + 5 PG roles, correct ESO config
- networking.md: MetalLB pool 10.0.20.200-220
- ci-cd.md: 9 GHA projects, travel_blog 5.7GB

Runbooks:
- restore-mysql/postgresql: backup files are .sql.gz (not .sql)
- restore-vault: weekly backup (not daily), auto-unseal sidecar note
- restore-vaultwarden: PVC is proxmox (not iscsi)
- restore-full-cluster: updated node roles, removed trading

Reference docs:
- CLAUDE.md: 7-day rotation, removed trading from PG list
- AGENTS.md: 100+ stacks, proxmox-lvm, platform empty shell
- service-catalog.md: 6 new stacks, 14 stack column updates
2026-04-06 13:21:05 +03:00
Viktor Barzin
6af47c7c89 docs: update networking architecture for single MetalLB IP
Reflect consolidation of all 11 LB services onto 10.0.20.200.
Add service port table, MetalLB v0.15 sharing key requirements,
and ETP matching troubleshooting guidance.
2026-03-24 18:44:47 +02:00
Viktor Barzin
5a42643176 add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00