infra

Author	SHA1	Message	Date
Viktor Barzin	be80ef23bb	ADR-0017 rev 3: single switch — PE replaces the SG105E, CCTV rides a VLAN-30 trunk on the LAN1 cable Viktor prefers not running two switches, so the TL-SG105PE takes over all rack duties (apartment uplink, 4G, UPS, camera PoE) and the CCTV segment moves onto a managed tagged trunk over the existing LAN1 cable: pfSense net3 re-pointed from vmbr2 to vmbr0 tag=30 (applied live; same MAC so vtnet3/dCCTV survived untouched). This is safe where the original 802.1Q rejection was not, because the managed switch is the only device on eno1 and polices VLAN-30 membership. eno2/vmbr2 kept dormant as the documented fallback. Old SG105E retires to cold spare; PE inherits 192.168.1.6. Glossary Segment term updated (all three segments are now bridge-tags feeding untagged pfSense vNICs). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 09:15:52 +00:00
Viktor Barzin	e11bd6e893	ADR-0017 rev 2: two switches — the PE is a dedicated CCTV island, no VLAN table anywhere Viktor asked to verify free ports on the garage switch (192.168.1.6) before finalizing. Logging into it showed it is NOT the TL-SG105PE from the plan but a pre-existing non-PoE TL-SG105E with 4 of 5 ports in use (apartment uplink, R730 LAN1, 4G router, UPS) - the single-shared-switch port-VLAN design written earlier today was based on conflating the two devices. Corrected: the new TL-SG105PE carries ONLY camera + eno2 uplink (mgmt 10.0.30.6 inside the segment), the old switch is untouched, and no VLAN config exists anywhere. ADR, topology SVG and networking.md updated to match. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 08:37:15 +00:00
Viktor Barzin	248e186dce	CCTV segment (dCCTV 10.0.30.0/24) on a dedicated pfSense leg for the garage camera All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor and emo are adding the first owned camera at the Sofia site (HiLook IPC-T241H-C watching the garage / server rack). Viktor asked to finalize emo's plan; the grilling session resolved emo's five open decisions and replaced the doc's 802.1Q-trunk idea with the site idiom: a dedicated physical leg (R730 eno2 -> vmbr2 -> pfSense net3 = dCCTV 10.0.30.1/24), port-based VLAN split on the shared TL-SG105PE, camera default-deny with NTP-only egress, Frigate + ha-sofia as the only consumers. The PVE bridge, pfSense interface, Kea subnet and firewall rules were applied live this session (hand-managed hosts, backed up). This commit records the decision (ADR-0017), the glossary terms (Segment / CCTV segment), the as-built architecture doc, and bumps Frigate's ADR-0016 VRAM budget 2000 -> 2300 MiB for the upcoming NVDEC stream. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 20:01:45 +00:00
Viktor Barzin	21afae85c9	dawarich: dedicated 100/1000 Traefik rate limit (default 10/50 429'd page loads) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor saw dawarich throwing 429s through Traefik and asked to loosen the burst for it. The access log confirms the burst pattern: one page load fires the whole fingerprinted-asset tail (SVG store badges, favicons, webmanifest) from a single client IP and trips the default 10 req/s / burst 50 limiter (repro: 80 parallel GETs -> 28x 429). Same remedy as ha-sofia, ActualBudget, noVNC, tripit, health and authentik: dedicated dawarich-rate-limit middleware (average 100 / burst 1000) + skip_default_rate_limit on the dawarich ingress. Also updates the networking.md middleware enumerations (adding the previously undocumented tripit/health limiters alongside dawarich). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 15:03:08 +00:00
Viktor Barzin	308a174ad6	docs(networking): record MetalLB .204 (frigate-rtsp go2rtc) allocation All checks were successful ci/woodpecker/push/default Pipeline was successful Details PR #17 moved frigate-rtsp to a dedicated MetalLB LoadBalancer IP (10.0.20.204) exposing RTSP 8554 + WebRTC 8555, but the networking doc still listed only four IPs in use / three dedicated. Add the .204 row to the allocation table, bump the counts (five in use, four dedicated, 5-IP layout), and add a LB-IP renumber-checklist entry for the out-of-band consumers (the go2rtc WebRTC candidate on the frigate config PVC and the HA-sofia rtsp_url_template). Note go2rtc cannot use a DNS name in ICE candidates, so the Service annotation is the single source of truth. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-30 07:42:27 +00:00
Viktor Barzin	b84b0021c2	authentik: dedicated rate-limit carve-out + per-router 5xx observability All checks were successful ci/woodpecker/push/default Pipeline was successful Details Unauthenticated users were getting a blank login screen (and the screen would sometimes just hang). Root-caused via a read-only fan-out + adversarial verify: the login SPA cold-loads ~70 flow-executor JS/CSS chunks from /static through the SHARED 10/50 Traefik limiter, so a fresh/empty-cache load 429s the tail and a failed ES-module import aborts SPA bootstrap -> permanent blank. authentik was the only first-party SPA still on the default limiter (8 siblings already have a carve-out). NAT-shared clients trip it especially easily (shared per-IP bucket). - traefik: new `authentik-rate-limit` Middleware (average 100 / burst 1000, mirroring the existing health/tripit carve-outs). The authentik / and /static ingresses switch to it in the authentik-stack commit. - monitoring: the `traefik` scrape job's drop-regex was a blanket `traefik_router_.`, which also dropped `traefik_router_requests_total` — so per-router 4xx/5xx (incl. 429/503) was neither queryable nor alertable. Narrowed it to keep the counter while still dropping the high-cardinality `_duration_seconds_bucket` histogram, and added `AuthentikRootRouter5xxHigh` for the episodic all-3-server-pods-NotReady 502/503/504 cascade. Docs updated (networking.md rate-limit list, .claude/CLAUDE.md). GitOps CI applies. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 09:10:34 +00:00
Viktor Barzin	ceae4d5f06	docs: rewrite CrowdSec enforcement architecture (firewall-bouncer + CF WAF; Yaegi plugin removed) The Traefik Yaegi CrowdSec bouncer plugin was dead on Traefik 3.7.5 (handler never invoked) and has been removed. Document the replacement: in-kernel nftables drop via cs-firewall-bouncer on direct hosts, and a Cloudflare IP-List + zone WAF block rule (fed by a LAPI->CF-list sync CronJob) on proxied hosts. Both add zero per-request latency and fail open. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 13:39:26 +00:00
Viktor Barzin	d5fdc7ffe9	cloudflared: disable in-place autoupdate (--no-autoupdate) Viktor asked to root-cause the frequent t3 code disconnects and rule infra in or out. The tunnel pods ran bare 'cloudflared tunnel run': every Cloudflare release made the binary self-update and exit (code 11), restarting all 3 pods and severing every WebSocket riding the tunnel — one of the confirmed infra-side drop causes (pods cycled 2026-06-09 20:55/21:00 and 2026-06-10 02:31). Updates belong to pod image rollouts, not in-place binary swaps.	2026-06-10 21:00:05 +00:00
Viktor Barzin	acb847b858	actualbudget: dedicated traefik rate-limit (50/300) for budget ingresses The Actual web app boots with ~70 near-parallel requests (55 /data/migrations/.sql + statics, all served cache-control max-age=0 so every page load re-validates them). The shared rate-limit middleware (average 10, burst 50) 429s the tail of that storm, so every cold boot shows 'Server returned an error while checking its status' and every load stalls in retry backoff — measured up to 5min stalls when two loads from one IP overlap. Viktor asked to relax the limit after the anca slow-load investigation (beads code-7zv). Same pattern as immich: dedicated actualbudget-rate-limit middleware in the traefik stack, budget- ingresses opt out of the default via skip_default_rate_limit + extra_middlewares. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 19:36:42 +00:00
Viktor Barzin	fd0f4a0365	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip] `6d224861` came from a --no-checkout worktree whose empty index made the commit drop every file except two. This restores 05b50d2b's full tree and correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the live infra was never applied from the broken commit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 08:45:33 +00:00
Viktor Barzin	6d224861c4	stem95su: scheduled Drive->site sync CronJob (every 10m) CronJob stem95su-gdrive-sync (*/10) mounts the content PVC RW and rclone-syncs the read-only Drive folder "claude" (stem claude/files) onto it (rclone/rclone:1.74.3, scope=drive.readonly, empty-source guard + --max-delete 25). ESO ExternalSecret stem95su-rclone <- Vault secret/stem95su. Requires the GCP OAuth app published to Production or the refresh token expires ~weekly. Lands the gdrive-sync stack on master (it had landed on a feature branch by accident on the shared devvm checkout). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 08:42:26 +00:00
Viktor Barzin	98f29edf34	technitium: CoreDNS rewrite forgejo.viktorbarzin.me -> Traefik ClusterIP In-cluster pods resolved forgejo.viktorbarzin.me to the public IP (176.12.22.76) and hairpinned out through the WAN gateway, intermittently timing out buildkit pushes from Woodpecker build pods (which, unlike kubelet, don't use the per-node containerd Forgejo mirror). This silently failed CI build-and-push for Forgejo-hosted repos (recruiter-responder pipelines #15-#18 at the push step). Add a CoreDNS `rewrite name exact forgejo.viktorbarzin.me traefik.traefik.svc.cluster.local` so pods resolve to the Traefik ClusterIP (reachable in-cluster, unlike the ETP=Local LB .203; the Service-name target auto-tracks the ClusterIP so it can't rot on a Traefik renumber). Traefik's *.viktorbarzin.me wildcard keeps SNI/TLS valid. Makes the per-pod woodpecker-server hostAlias belt-and-suspenders. Applied via targeted apply (coredns ConfigMap only, to avoid reconciling 7 unrelated pre-existing drifts in the stack) + verified: - pod resolves forgejo.viktorbarzin.me -> 10.111.111.95 (Traefik ClusterIP) - recruiter-responder pipeline #20 build-and-push succeeds via ClusterIP Docs: networking.md (K8s cluster DNS path) + .claude/CLAUDE.md (forgejo registry quick-ref). Advances beads code-yh33. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-04 07:34:30 +00:00
Viktor Barzin	7d7a0ad474	infra: fix stale Traefik LB-IP refs + accurate LB-IP registry Some checks failed ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/default Pipeline was canceled Details Part of the L4 LB-IP review (docs/plans/2026-06-03-lb-ip-hygiene-design.md). The 2026-05-30 Traefik .200->.203 move left consumers pointing at the dead .200; this fixes the two in-Terraform ones and replaces the stale networking doc with an accurate registry + a renumber checklist. - woodpecker: forgejo.viktorbarzin.me hostAlias hardcoded 10.0.20.200 (.200:443 refuses TLS now; the next woodpecker apply would re-pin it and break pipeline creation). Now reads the Traefik ClusterIP dynamically via a kubernetes_service data source -- cannot rot on a future renumber and avoids the ETP=Local hairpin trap. - monitoring: ViktorBarzinApexDrift alert summary said "expected 10.0.20.200" -> 10.0.20.203 (cosmetic; alert logic already correct). - docs/architecture/networking.md: rewrote the MetalLB section (it wrongly had KMS on .200, mailserver on a LB IP, "two dedicated") into an accurate 4-IP registry + LB-IP renumber checklist (in-band + out-of-band consumers). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-03 10:24:25 +00:00
Viktor Barzin	c7cf21a986	Revert mail LAN-redirect approach; pending VIP-based redesign The pfSense NAT rdr rules added in f7cf9f07 hardcoded 10.0.20.203 (Traefik LB IP) as the redirect source. That couples mail's LAN path to Traefik's IP choice — if Traefik moves again (it just moved .200 → .203 on 2026-05-30), the mail path silently breaks. Removing the script and the matching doc paragraph; keeping the networking.md .200 → .203 staleness fix (separate correction). Follow-up: give the mail HAProxy listener a dedicated pfSense Virtual IP (IP Alias on opt1), update Technitium internal zone + WAN port-forwards to target the VIP, so mail's LAN-side path is decoupled from any other service's LB IP.	2026-06-03 10:24:25 +00:00
Viktor Barzin	fd35c4f303	pfSense: LAN-side NAT redirect for mail ports landing on Traefik LB IP Technitium's split-horizon rewrites *.viktorbarzin.me to 10.0.20.203 (Traefik LB) for the 192.168.1.0/24 Barzini WiFi (TP-Link router has no hairpin NAT). The rule is name-agnostic so mail.viktorbarzin.me (and imap./smtp.) get sent to .203 too — where Traefik does not listen on 25/465/587/993. iOS Mail on Barzini WiFi silently hangs while Roundcube (port 443 via Traefik) keeps working. Adds pfSense NAT rdr rules so traffic to 10.0.20.203:{25,465,587,993} gets redirected to 10.0.20.1 (the mail HAProxy listener already serving the public path). Loaded on every incoming interface by pfSense rule generation, so any LAN/VPN client falling into the split-horizon answer lands on the right service unchanged. Includes idempotent reproducer script (mirrors the existing pfsense-haproxy-bootstrap.php pattern) and the networking.md mail carve-out paragraph plus the stale .200 → .203 reference.	2026-06-03 10:24:25 +00:00
Viktor Barzin	5bcb4525a4	traefik: uncap download duration (writeTimeout 60s->0), upload window 3600s [ci skip] Large Immich video downloads and uploads failed at a hard ~60s wall. The websecure entrypoint set respondingTimeouts.{read,write}Timeout=60s; unlike nginx proxy_*_timeout (per-read idle), Traefik respondingTimeouts are hard caps on total request/response duration, so every transfer slower than 60s was cut mid-stream. Reproduced: a 6 MB/s throttled 650MB download died at 386MB / 62s with an HTTP/2 stream reset. - writeTimeout=0 (Traefik's default, which Immich's reverse-proxy guidance assumes): unlimited download size/duration. - readTimeout=3600s: passes multi-GB uploads while keeping a slow-loris backstop (Immich has no resumable upload, so the window must exceed real upload times). Verified: the same 650MB download now completes fully (650MB / 102s, exit 0). IPv6 path needs no change - the pfSense bridge HAProxy 1h timeouts are inactivity-based, not total caps. Applied via tg (Tier 1 / PG-authoritative state); this commit syncs source + docs only, hence [ci skip]. Docs: networking.md (Entrypoint Transport Timeouts + troubleshooting), .claude/CLAUDE.md networking note. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 17:46:59 +00:00
Viktor Barzin	e9046e5a26	traefik+pfsense: real IPv6 client IPs via HAProxy PROXY-v2 bridge Replace the pfSense socat IPv6 forwarder (which masked every IPv6 client as 10.0.20.1) with a standalone HAProxy bridge using send-proxy-v2, so real IPv6 client IPs reach Traefik/CrowdSec. Traefik now trusts PROXY-v2 only from 10.0.20.1 on the web/websecure entrypoints; real IPv4 clients (ETP=Local, own source IP) are unaffected. Mail-over-IPv6 routed through the mail NodePorts (send-proxy-v2) too. Bridge is TCP/h2 only (no QUIC over IPv6). Persistence on pfSense: rc.d/ipv6proxy + ipv6_proxy.sh (config.xml shellcmd), keeping the nginx-off-[::] patch. Also fixes stale networking.md: Traefik was still documented on the shared .200; it moved to dedicated .203/ETP=Local on 2026-05-30. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-30 09:51:23 +00:00
Viktor Barzin	0025511b6a	docs: Technitium DNS IP — 10.0.20.101 → 10.0.20.201 Stragglers from the same drift as commit b288a59 (monorepo) / the 2026-05-22 viktorbarzin.me apex incident — the `.101` references were left over from the NodePort exposure era. Technitium's actual MetalLB LB IP is `.201` (in pool 10.0.20.200-220). - architecture/vpn.md — Technitium component cell + AdGuard forwarder example + nslookup troubleshooting hint - architecture/networking.md — 502 ingress troubleshooting snippet - plans/2026-02-22-talos-linux-migration-evaluation.md — nameservers example	2026-05-23 08:53:52 +00:00
Viktor Barzin	572d6cd8e0	kms: deploy slack-notifier sidecar with Prometheus metrics + document public exposure Slack notifier now also exposes /metrics on :9101 with stdlib HTTP — counts activations and dedup-skips by product, gauges last-activation timestamp. Pod template gets the standard prometheus.io/scrape annotations so the cluster-wide kubernetes-pods job picks it up via pod IP. Memory request bumped to 48Mi to cover counter dicts + HTTPServer. Plus docs: networking.md footnotes the windows-kms row noting public WAN exposure with the rate-limited (max-src-conn 50, max-src-conn-rate 10/60, overload <virusprot> flush) pfSense filter rule, and a new runbook covers log locations, rate-limit tuning, and how to revoke the WAN forward. The matching pfSense rule was tightened in place (TCP-only + rate limits) via SSH; pfSense isn't Terraform-managed.	2026-05-09 22:12:46 +00:00
Viktor Barzin	cd96fb64a8	phpipam-pfsense-import: every 5min → hourly Reduces 5-min disk-write spikes on PVE sdc. The cronjob was the heaviest single contributor in our hourly fan-out investigation (11.2 MB/s burst when it fired). Kea DDNS still handles real-time DNS auto-registration; phpIPAM inventory just lags by up to 1h, which we don't need fresher. Docs (dns.md, networking.md, .claude/CLAUDE.md) updated to match.	2026-04-26 22:48:43 +00:00
Viktor Barzin	5a0b24f54e	[docs] TrueNAS decommission cleanup — remove references from active docs TrueNAS VM 9000 was operationally decommissioned 2026-04-13; NFS has been served by Proxmox host (192.168.1.127) since. This commit scrubs remaining references from active docs. VM 9000 itself remains on PVE in stopped state pending user decision on deletion. In-session cleanup already landed: reverse-proxy ingress + Cloudflare record removed; Technitium DNS records deleted; Vault truenas_{api_key,ssh_private_key} purged; homepage_credentials.reverse_proxy.truenas_token removed; truenas_homepage_token variable + module deleted; Loki + Dashy cleaned; config.tfvars deprecated DNS lines removed; historical-name comment added to the nfs-truenas StorageClass (48 bound PVs, immutable name — kept). Historical records (docs/plans/, docs/post-mortems/, .planning/) intentionally untouched — they describe state at a point in time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 16:55:43 +00:00
Viktor Barzin	f6685a23a9	[dns] Kea: multi-IP DHCP option 6 (10.0.10, 10.0.20) + TSIG-signed DDNS (WS E) Workstream E of the DNS hardening push. Two independent pfSense-side changes to eliminate single-point DNS failures and the unauthenticated RFC 2136 update vector. Part 1 — Multi-IP DHCP option 6 - Before: clients on 10.0.10/24 got only 10.0.10.1; clients on 10.0.20/24 got only 10.0.20.1. Internal resolver outage == cluster-wide DNS dark. - After: - 10.0.10/24 -> [10.0.10.1, 94.140.14.14] - 10.0.20/24 -> [10.0.20.1, 94.140.14.14] - 192.168.1/24 deliberately untouched (served by TP-Link AP, not pfSense Kea — pfSense WAN DHCP is disabled); already ships [192.168.1.2, 94.140.14.14] so the end state is consistent across all three subnets. - Applied via PHP: set $cfg['dhcpd']['lan']['dnsserver'] and $cfg['dhcpd']['opt1']['dnsserver'] as arrays. pfSense's services_kea4_configure() implodes the array into "data: a, b" on the "domain-name-servers" option-data entry (services.inc L1214). - Verified: - DevVM (10.0.10.10) resolv.conf shows "nameserver 10.0.10.1" + "nameserver 94.140.14.14" after networkd renew. - k8s-node1 (10.0.20.101) same after networkctl reload + systemd-resolved restart. - Fallback drill on k8s-node1: `ip route add blackhole 10.0.20.1/32`; dig @10.0.20.1 google.com -> "no servers could be reached"; dig @94.140.14.14 google.com -> 216.58.204.110; system resolver (getent hosts) succeeds via the fallback IP. Blackhole route removed. Part 2 — TSIG-signed Kea DHCP-DDNS - Before: /usr/local/etc/kea/kea-dhcp-ddns.conf had `tsig-keys: []` and Technitium's viktorbarzin.lan zone had update=Deny. Unauthenticated update vector was latent (DDNS wiring in Kea DHCP4 is actually off today — "DDNS: disabled" in dhcpd.log) but would activate as soon as anyone turned on ddnsupdate on LAN/OPT1. - Generated HMAC-SHA256 secret, base64-encoded 32 random bytes. - Stored in Vault: secret/viktor/kea_ddns_tsig_secret (version 27). - Created TSIG key "kea-ddns" on primary/secondary/tertiary Technitium instances via /api/settings/set (tsigKeys[]). - Updated kea-dhcp-ddns.conf on pfSense with tsig-keys[]={name: "kea-ddns", algorithm: "HMAC-SHA256", secret: …} and key-name: kea-ddns on each forward-ddns / reverse-ddns domain. Pre-change backup at /usr/local/etc/kea/kea-dhcp-ddns.conf.2026-04-19-pre-tsig. - Configured viktorbarzin.lan + 10.0.10.in-addr.arpa + 20.0.10.in-addr.arpa + 1.168.192.in-addr.arpa on Technitium primary: - update = UseSpecifiedNetworkACL - updateNetworkACL = [10.0.20.1, 10.0.10.1, 192.168.1.2] - updateSecurityPolicies = [{tsigKeyName: kea-ddns, domain: "*.<zone>", allowedTypes: [ANY]}] Technitium requires BOTH a source-IP match AND a valid TSIG signature. - Verified TSIG end-to-end: - Signed A-record update from pfSense -> "successfully processed", dig returns 10.99.99.99 (log: "TSIG KeyName: kea-ddns; TSIG Algo: hmac-sha256; TSIG Error: NoError; RCODE: NoError"). - Signed PTR update same zone pattern -> dig -x returns tsig-test FQDN. - Unsigned update from pfSense IP (in ACL) -> "update failed: REFUSED" (log: "refused a zone UPDATE request [...] due to Dynamic Updates Security Policy"). - Test records cleaned up via signed nsupdate. Safety - pfSense config backup: /cf/conf/config.xml.2026-04-19-pre-kea-multi-ip (145898 bytes, pre-change snapshot — keep 30d). - DDNS config backup: /usr/local/etc/kea/kea-dhcp-ddns.conf.2026-04-19-pre-tsig. - TSIG secret lives only in Vault + in config.xml/kea-dhcp-ddns.conf on pfSense; not committed to git. Docs - architecture/dns.md: zone dynamic-updates section records the TSIG policy; Incident History gets a WS E entry. - architecture/networking.md: DHCP Coverage table now shows the DNS option 6 values per subnet; pfSense block notes the TSIG-signed DDNS and config backup path. - runbooks/pfsense-unbound.md: new "Kea DHCP-DDNS TSIG" section covers key rotation, emergency bypass, and enforcement-verification. Closes: code-o6j Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 16:12:23 +00:00
Viktor Barzin	69474fae96	docs: add comprehensive DNS architecture documentation Covers Technitium HA (3-instance AXFR replication), CoreDNS config, Cloudflare external DNS, Split Horizon hairpin NAT fix, DHCP-DNS auto-registration, 6 automation CronJobs, and troubleshooting guides. Also fixes stale NFS reference in networking.md. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 18:10:27 +00:00
Viktor Barzin	c740ed1301	docs: update Technitium DNS docs after cache optimization - Fix Technitium IP typo: 10.0.20.101 → 10.0.20.201 (service-catalog, vpn.md) - Fix PDB minAvailable: 1 → 2 (networking.md) - Add emrsn.org stub zone, cache TTL tuning, PG query logging, CronJobs - Update forwarders: was "Cloudflare + Google", actually Cloudflare DoH only - Update config storage: was generic PVC, now NFS path	2026-04-12 18:29:25 +01:00
Viktor Barzin	eec6af6aef	docs: add IPAM/DDNS architecture diagram and update docs - networking.md: Add mermaid diagram showing full device discovery pipeline (Kea DHCP → DDNS → Technitium, pfSense import → phpIPAM → DNS sync) - networking.md: Add data flow table, DHCP coverage table - networking.md: Update pfSense (3 subnets + 42 reservations), phpIPAM (passive import replaces fping), Technitium (192.168.1.2 in ACL) - CLAUDE.md: Update phpIPAM and networking descriptions [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 20:42:10 +00:00
Viktor Barzin	8cd8743140	docs: add phpIPAM, Kea DDNS, and DNS sync documentation - networking.md: Add phpIPAM IPAM section, Kea DDNS config, reverse DNS zones, Technitium dynamic update policy - CLAUDE.md: Add phpipam to DB rotation list, service notes, networking section - service-catalog.md: Add phpipam, mark netbox as disabled/replaced [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 16:01:32 +00:00
Viktor Barzin	98aaba98da	docs: add Split Horizon hairpin NAT fix to networking architecture [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 18:45:53 +00:00
Viktor Barzin	cfa7a50cb5	docs: update networking architecture for DNS consolidation - Technitium DNS now at dedicated MetalLB IP 10.0.20.201 (was shared 10.0.20.200) - Document LAN DNS path: pfSense NAT redirect preserves client IPs for Technitium logging - Document pfSense dnsmasq role (K8s VLAN + localhost only, not WAN) - Document pfSense aliases (technitium_dns, k8s_shared_lb) for NAT rule maintainability - Update MetalLB table with per-service IP assignments - Add ClusterIP (10.96.0.53) for CoreDNS internal forwarding [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 17:49:33 +00:00
Viktor Barzin	fc233bd27f	docs: comprehensive audit and update of all architecture docs and runbooks [ci skip] Audited 14 documentation files against live cluster state and Terraform code. Architecture docs: - databases.md: MySQL 8.4.4, proxmox-lvm storage (not iSCSI), anti-affinity excludes k8s-node1 (GPU), 2Gi/3Gi resources, 7-day rotation (not 24h), CNPG 2 instances, PostGIS 16, postgresql.dbaas has endpoints - overview.md: 1x CPU, ~160GB RAM, all nodes 32GB, proxmox-lvm storage, correct Vault paths (secret/ not kv/) - compute.md: 272GB physical host RAM, ~160GB allocated to VMs - secrets.md: 7-day rotation, 7 MySQL + 5 PG roles, correct ESO config - networking.md: MetalLB pool 10.0.20.200-220 - ci-cd.md: 9 GHA projects, travel_blog 5.7GB Runbooks: - restore-mysql/postgresql: backup files are .sql.gz (not .sql) - restore-vault: weekly backup (not daily), auto-unseal sidecar note - restore-vaultwarden: PVC is proxmox (not iscsi) - restore-full-cluster: updated node roles, removed trading Reference docs: - CLAUDE.md: 7-day rotation, removed trading from PG list - AGENTS.md: 100+ stacks, proxmox-lvm, platform empty shell - service-catalog.md: 6 new stacks, 14 stack column updates	2026-04-06 13:21:05 +03:00
Viktor Barzin	6af47c7c89	docs: update networking architecture for single MetalLB IP Reflect consolidation of all 11 LB services onto 10.0.20.200. Add service port table, MetalLB v0.15 sharing key requirements, and ETP matching troubleshooting guidance.	2026-03-24 18:44:47 +02:00
Viktor Barzin	5a42643176	add architecture documentation for all infrastructure subsystems [ci skip] 14 docs covering networking, VPN, storage, authentication, security, monitoring, secrets, CI/CD, backup/DR, compute, databases, and multi-tenancy. Each doc includes Mermaid diagrams, component tables, configuration references, decision rationale, and troubleshooting.	2026-03-24 00:55:25 +02:00

31 commits