Commit graph

31 commits

Author SHA1 Message Date
Viktor Barzin
be80ef23bb ADR-0017 rev 3: single switch — PE replaces the SG105E, CCTV rides a VLAN-30 trunk on the LAN1 cable
Viktor prefers not running two switches, so the TL-SG105PE takes over
all rack duties (apartment uplink, 4G, UPS, camera PoE) and the CCTV
segment moves onto a managed tagged trunk over the existing LAN1 cable:
pfSense net3 re-pointed from vmbr2 to vmbr0 tag=30 (applied live; same
MAC so vtnet3/dCCTV survived untouched). This is safe where the original
802.1Q rejection was not, because the managed switch is the only device
on eno1 and polices VLAN-30 membership. eno2/vmbr2 kept dormant as the
documented fallback. Old SG105E retires to cold spare; PE inherits
192.168.1.6. Glossary Segment term updated (all three segments are now
bridge-tags feeding untagged pfSense vNICs).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 09:15:52 +00:00
Viktor Barzin
e11bd6e893 ADR-0017 rev 2: two switches — the PE is a dedicated CCTV island, no VLAN table anywhere
Viktor asked to verify free ports on the garage switch (192.168.1.6)
before finalizing. Logging into it showed it is NOT the TL-SG105PE from
the plan but a pre-existing non-PoE TL-SG105E with 4 of 5 ports in use
(apartment uplink, R730 LAN1, 4G router, UPS) - the single-shared-switch
port-VLAN design written earlier today was based on conflating the two
devices. Corrected: the new TL-SG105PE carries ONLY camera + eno2
uplink (mgmt 10.0.30.6 inside the segment), the old switch is untouched,
and no VLAN config exists anywhere. ADR, topology SVG and networking.md
updated to match.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 08:37:15 +00:00
Viktor Barzin
248e186dce CCTV segment (dCCTV 10.0.30.0/24) on a dedicated pfSense leg for the garage camera
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor and emo are adding the first owned camera at the Sofia site (HiLook
IPC-T241H-C watching the garage / server rack). Viktor asked to finalize
emo's plan; the grilling session resolved emo's five open decisions and
replaced the doc's 802.1Q-trunk idea with the site idiom: a dedicated
physical leg (R730 eno2 -> vmbr2 -> pfSense net3 = dCCTV 10.0.30.1/24),
port-based VLAN split on the shared TL-SG105PE, camera default-deny with
NTP-only egress, Frigate + ha-sofia as the only consumers.

The PVE bridge, pfSense interface, Kea subnet and firewall rules were
applied live this session (hand-managed hosts, backed up). This commit
records the decision (ADR-0017), the glossary terms (Segment / CCTV
segment), the as-built architecture doc, and bumps Frigate's ADR-0016
VRAM budget 2000 -> 2300 MiB for the upcoming NVDEC stream.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 20:01:45 +00:00
Viktor Barzin
21afae85c9 dawarich: dedicated 100/1000 Traefik rate limit (default 10/50 429'd page loads)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor saw dawarich throwing 429s through Traefik and asked to loosen
the burst for it. The access log confirms the burst pattern: one page
load fires the whole fingerprinted-asset tail (SVG store badges,
favicons, webmanifest) from a single client IP and trips the default
10 req/s / burst 50 limiter (repro: 80 parallel GETs -> 28x 429).
Same remedy as ha-sofia, ActualBudget, noVNC, tripit, health and
authentik: dedicated dawarich-rate-limit middleware (average 100 /
burst 1000) + skip_default_rate_limit on the dawarich ingress. Also
updates the networking.md middleware enumerations (adding the
previously undocumented tripit/health limiters alongside dawarich).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 15:03:08 +00:00
Viktor Barzin
308a174ad6 docs(networking): record MetalLB .204 (frigate-rtsp go2rtc) allocation
All checks were successful
ci/woodpecker/push/default Pipeline was successful
PR #17 moved frigate-rtsp to a dedicated MetalLB LoadBalancer IP
(10.0.20.204) exposing RTSP 8554 + WebRTC 8555, but the networking doc
still listed only four IPs in use / three dedicated. Add the .204 row to
the allocation table, bump the counts (five in use, four dedicated, 5-IP
layout), and add a LB-IP renumber-checklist entry for the out-of-band
consumers (the go2rtc WebRTC candidate on the frigate config PVC and the
HA-sofia rtsp_url_template). Note go2rtc cannot use a DNS name in ICE
candidates, so the Service annotation is the single source of truth.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 07:42:27 +00:00
Viktor Barzin
b84b0021c2 authentik: dedicated rate-limit carve-out + per-router 5xx observability
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Unauthenticated users were getting a blank login screen (and the screen would
sometimes just hang). Root-caused via a read-only fan-out + adversarial verify:
the login SPA cold-loads ~70 flow-executor JS/CSS chunks from /static through
the SHARED 10/50 Traefik limiter, so a fresh/empty-cache load 429s the tail and
a failed ES-module import aborts SPA bootstrap -> permanent blank. authentik was
the only first-party SPA still on the default limiter (8 siblings already have a
carve-out). NAT-shared clients trip it especially easily (shared per-IP bucket).

- traefik: new `authentik-rate-limit` Middleware (average 100 / burst 1000,
  mirroring the existing health/tripit carve-outs). The authentik / and /static
  ingresses switch to it in the authentik-stack commit.
- monitoring: the `traefik` scrape job's drop-regex was a blanket
  `traefik_router_.*`, which also dropped `traefik_router_requests_total` — so
  per-router 4xx/5xx (incl. 429/503) was neither queryable nor alertable.
  Narrowed it to keep the counter while still dropping the high-cardinality
  `*_duration_seconds_bucket` histogram, and added `AuthentikRootRouter5xxHigh`
  for the episodic all-3-server-pods-NotReady 502/503/504 cascade.

Docs updated (networking.md rate-limit list, .claude/CLAUDE.md). GitOps CI applies.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 09:10:34 +00:00
Viktor Barzin
ceae4d5f06 docs: rewrite CrowdSec enforcement architecture (firewall-bouncer + CF WAF; Yaegi plugin removed)
The Traefik Yaegi CrowdSec bouncer plugin was dead on Traefik 3.7.5 (handler
never invoked) and has been removed. Document the replacement: in-kernel
nftables drop via cs-firewall-bouncer on direct hosts, and a Cloudflare IP-List
+ zone WAF block rule (fed by a LAPI->CF-list sync CronJob) on proxied hosts.
Both add zero per-request latency and fail open.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 13:39:26 +00:00
Viktor Barzin
d5fdc7ffe9 cloudflared: disable in-place autoupdate (--no-autoupdate)
Viktor asked to root-cause the frequent t3 code disconnects and rule
infra in or out. The tunnel pods ran bare 'cloudflared tunnel run':
every Cloudflare release made the binary self-update and exit (code 11),
restarting all 3 pods and severing every WebSocket riding the tunnel —
one of the confirmed infra-side drop causes (pods cycled 2026-06-09
20:55/21:00 and 2026-06-10 02:31). Updates belong to pod image rollouts,
not in-place binary swaps.
2026-06-10 21:00:05 +00:00
Viktor Barzin
acb847b858 actualbudget: dedicated traefik rate-limit (50/300) for budget ingresses
The Actual web app boots with ~70 near-parallel requests (55
/data/migrations/*.sql + statics, all served cache-control max-age=0 so
every page load re-validates them). The shared rate-limit middleware
(average 10, burst 50) 429s the tail of that storm, so every cold boot
shows 'Server returned an error while checking its status' and every
load stalls in retry backoff — measured up to 5min stalls when two
loads from one IP overlap. Viktor asked to relax the limit after the
anca slow-load investigation (beads code-7zv).

Same pattern as immich: dedicated actualbudget-rate-limit middleware in
the traefik stack, budget-* ingresses opt out of the default via
skip_default_rate_limit + extra_middlewares.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 19:36:42 +00:00
Viktor Barzin
fd0f4a0365 fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip]
6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 08:45:33 +00:00
Viktor Barzin
6d224861c4 stem95su: scheduled Drive->site sync CronJob (every 10m)
CronJob stem95su-gdrive-sync (*/10) mounts the content PVC RW and
rclone-syncs the read-only Drive folder "claude" (stem claude/files) onto
it (rclone/rclone:1.74.3, scope=drive.readonly, empty-source guard +
--max-delete 25). ESO ExternalSecret stem95su-rclone <- Vault
secret/stem95su. Requires the GCP OAuth app published to Production or the
refresh token expires ~weekly.

Lands the gdrive-sync stack on master (it had landed on a feature branch
by accident on the shared devvm checkout).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 08:42:26 +00:00
Viktor Barzin
98f29edf34 technitium: CoreDNS rewrite forgejo.viktorbarzin.me -> Traefik ClusterIP
In-cluster pods resolved forgejo.viktorbarzin.me to the public IP
(176.12.22.76) and hairpinned out through the WAN gateway, intermittently
timing out buildkit pushes from Woodpecker build pods (which, unlike
kubelet, don't use the per-node containerd Forgejo mirror). This silently
failed CI build-and-push for Forgejo-hosted repos (recruiter-responder
pipelines #15-#18 at the push step).

Add a CoreDNS `rewrite name exact forgejo.viktorbarzin.me
traefik.traefik.svc.cluster.local` so pods resolve to the Traefik ClusterIP
(reachable in-cluster, unlike the ETP=Local LB .203; the Service-name target
auto-tracks the ClusterIP so it can't rot on a Traefik renumber). Traefik's
*.viktorbarzin.me wildcard keeps SNI/TLS valid. Makes the per-pod
woodpecker-server hostAlias belt-and-suspenders.

Applied via targeted apply (coredns ConfigMap only, to avoid reconciling 7
unrelated pre-existing drifts in the stack) + verified:
- pod resolves forgejo.viktorbarzin.me -> 10.111.111.95 (Traefik ClusterIP)
- recruiter-responder pipeline #20 build-and-push succeeds via ClusterIP

Docs: networking.md (K8s cluster DNS path) + .claude/CLAUDE.md (forgejo
registry quick-ref). Advances beads code-yh33.

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-04 07:34:30 +00:00
Viktor Barzin
7d7a0ad474 infra: fix stale Traefik LB-IP refs + accurate LB-IP registry
Some checks failed
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was canceled
Part of the L4 LB-IP review (docs/plans/2026-06-03-lb-ip-hygiene-design.md).
The 2026-05-30 Traefik .200->.203 move left consumers pointing at the dead
.200; this fixes the two in-Terraform ones and replaces the stale networking
doc with an accurate registry + a renumber checklist.

- woodpecker: forgejo.viktorbarzin.me hostAlias hardcoded 10.0.20.200
  (.200:443 refuses TLS now; the next woodpecker apply would re-pin it and
  break pipeline creation). Now reads the Traefik ClusterIP dynamically via a
  kubernetes_service data source -- cannot rot on a future renumber and avoids
  the ETP=Local hairpin trap.
- monitoring: ViktorBarzinApexDrift alert summary said "expected 10.0.20.200"
  -> 10.0.20.203 (cosmetic; alert logic already correct).
- docs/architecture/networking.md: rewrote the MetalLB section (it wrongly had
  KMS on .200, mailserver on a LB IP, "two dedicated") into an accurate 4-IP
  registry + LB-IP renumber checklist (in-band + out-of-band consumers).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-03 10:24:25 +00:00
Viktor Barzin
c7cf21a986 Revert mail LAN-redirect approach; pending VIP-based redesign
The pfSense NAT rdr rules added in f7cf9f07 hardcoded 10.0.20.203
(Traefik LB IP) as the redirect source. That couples mail's LAN
path to Traefik's IP choice — if Traefik moves again (it just
moved .200 → .203 on 2026-05-30), the mail path silently breaks.

Removing the script and the matching doc paragraph; keeping the
networking.md .200 → .203 staleness fix (separate correction).

Follow-up: give the mail HAProxy listener a dedicated pfSense
Virtual IP (IP Alias on opt1), update Technitium internal zone
+ WAN port-forwards to target the VIP, so mail's LAN-side path
is decoupled from any other service's LB IP.
2026-06-03 10:24:25 +00:00
Viktor Barzin
fd35c4f303 pfSense: LAN-side NAT redirect for mail ports landing on Traefik LB IP
Technitium's split-horizon rewrites *.viktorbarzin.me to 10.0.20.203
(Traefik LB) for the 192.168.1.0/24 Barzini WiFi (TP-Link router has
no hairpin NAT). The rule is name-agnostic so mail.viktorbarzin.me
(and imap./smtp.) get sent to .203 too — where Traefik does not
listen on 25/465/587/993. iOS Mail on Barzini WiFi silently hangs
while Roundcube (port 443 via Traefik) keeps working.

Adds pfSense NAT rdr rules so traffic to 10.0.20.203:{25,465,587,993}
gets redirected to 10.0.20.1 (the mail HAProxy listener already
serving the public path). Loaded on every incoming interface by
pfSense rule generation, so any LAN/VPN client falling into the
split-horizon answer lands on the right service unchanged.

Includes idempotent reproducer script (mirrors the existing
pfsense-haproxy-bootstrap.php pattern) and the networking.md
mail carve-out paragraph plus the stale .200 → .203 reference.
2026-06-03 10:24:25 +00:00
Viktor Barzin
5bcb4525a4 traefik: uncap download duration (writeTimeout 60s->0), upload window 3600s [ci skip]
Large Immich video downloads and uploads failed at a hard ~60s wall. The
websecure entrypoint set respondingTimeouts.{read,write}Timeout=60s; unlike
nginx proxy_*_timeout (per-read idle), Traefik respondingTimeouts are hard caps
on total request/response duration, so every transfer slower than 60s was cut
mid-stream. Reproduced: a 6 MB/s throttled 650MB download died at 386MB / 62s
with an HTTP/2 stream reset.

- writeTimeout=0 (Traefik's default, which Immich's reverse-proxy guidance
  assumes): unlimited download size/duration.
- readTimeout=3600s: passes multi-GB uploads while keeping a slow-loris backstop
  (Immich has no resumable upload, so the window must exceed real upload times).

Verified: the same 650MB download now completes fully (650MB / 102s, exit 0).
IPv6 path needs no change - the pfSense bridge HAProxy 1h timeouts are
inactivity-based, not total caps. Applied via tg (Tier 1 / PG-authoritative
state); this commit syncs source + docs only, hence [ci skip].

Docs: networking.md (Entrypoint Transport Timeouts + troubleshooting),
.claude/CLAUDE.md networking note.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 17:46:59 +00:00
Viktor Barzin
e9046e5a26 traefik+pfsense: real IPv6 client IPs via HAProxy PROXY-v2 bridge
Replace the pfSense socat IPv6 forwarder (which masked every IPv6 client
as 10.0.20.1) with a standalone HAProxy bridge using send-proxy-v2, so
real IPv6 client IPs reach Traefik/CrowdSec. Traefik now trusts PROXY-v2
only from 10.0.20.1 on the web/websecure entrypoints; real IPv4 clients
(ETP=Local, own source IP) are unaffected. Mail-over-IPv6 routed through
the mail NodePorts (send-proxy-v2) too. Bridge is TCP/h2 only (no QUIC
over IPv6). Persistence on pfSense: rc.d/ipv6proxy + ipv6_proxy.sh
(config.xml shellcmd), keeping the nginx-off-[::] patch.

Also fixes stale networking.md: Traefik was still documented on the
shared .200; it moved to dedicated .203/ETP=Local on 2026-05-30.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-30 09:51:23 +00:00
0025511b6a docs: Technitium DNS IP — 10.0.20.101 → 10.0.20.201
Stragglers from the same drift as commit b288a59 (monorepo) / the
2026-05-22 viktorbarzin.me apex incident — the `.101` references were
left over from the NodePort exposure era. Technitium's actual MetalLB LB
IP is `.201` (in pool 10.0.20.200-220).

- architecture/vpn.md — Technitium component cell + AdGuard forwarder
  example + nslookup troubleshooting hint
- architecture/networking.md — 502 ingress troubleshooting snippet
- plans/2026-02-22-talos-linux-migration-evaluation.md — nameservers
  example
2026-05-23 08:53:52 +00:00
Viktor Barzin
572d6cd8e0 kms: deploy slack-notifier sidecar with Prometheus metrics + document public exposure
Slack notifier now also exposes /metrics on :9101 with stdlib HTTP — counts
activations and dedup-skips by product, gauges last-activation timestamp.
Pod template gets the standard prometheus.io/scrape annotations so the
cluster-wide kubernetes-pods job picks it up via pod IP. Memory request
bumped to 48Mi to cover counter dicts + HTTPServer.

Plus docs: networking.md footnotes the windows-kms row noting public WAN
exposure with the rate-limited (max-src-conn 50, max-src-conn-rate 10/60,
overload <virusprot> flush) pfSense filter rule, and a new runbook covers
log locations, rate-limit tuning, and how to revoke the WAN forward.

The matching pfSense rule was tightened in place (TCP-only + rate limits)
via SSH; pfSense isn't Terraform-managed.
2026-05-09 22:12:46 +00:00
Viktor Barzin
cd96fb64a8 phpipam-pfsense-import: every 5min → hourly
Reduces 5-min disk-write spikes on PVE sdc. The cronjob was the
heaviest single contributor in our hourly fan-out investigation
(11.2 MB/s burst when it fired). Kea DDNS still handles real-time
DNS auto-registration; phpIPAM inventory just lags by up to 1h,
which we don't need fresher.

Docs (dns.md, networking.md, .claude/CLAUDE.md) updated to match.
2026-04-26 22:48:43 +00:00
Viktor Barzin
5a0b24f54e [docs] TrueNAS decommission cleanup — remove references from active docs
TrueNAS VM 9000 was operationally decommissioned 2026-04-13; NFS has been
served by Proxmox host (192.168.1.127) since. This commit scrubs remaining
references from active docs. VM 9000 itself remains on PVE in stopped state
pending user decision on deletion.

In-session cleanup already landed: reverse-proxy ingress + Cloudflare record
removed; Technitium DNS records deleted; Vault truenas_{api_key,ssh_private_key}
purged; homepage_credentials.reverse_proxy.truenas_token removed;
truenas_homepage_token variable + module deleted; Loki + Dashy cleaned;
config.tfvars deprecated DNS lines removed; historical-name comment added to
the nfs-truenas StorageClass (48 bound PVs, immutable name — kept).

Historical records (docs/plans/, docs/post-mortems/, .planning/) intentionally
untouched — they describe state at a point in time.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 16:55:43 +00:00
Viktor Barzin
f6685a23a9 [dns] Kea: multi-IP DHCP option 6 (10.0.10, 10.0.20) + TSIG-signed DDNS (WS E)
Workstream E of the DNS hardening push. Two independent pfSense-side
changes to eliminate single-point DNS failures and the unauthenticated
RFC 2136 update vector.

Part 1 — Multi-IP DHCP option 6
- Before: clients on 10.0.10/24 got only 10.0.10.1; clients on 10.0.20/24
  got only 10.0.20.1. Internal resolver outage == cluster-wide DNS dark.
- After:
  - 10.0.10/24 -> [10.0.10.1, 94.140.14.14]
  - 10.0.20/24 -> [10.0.20.1, 94.140.14.14]
- 192.168.1/24 deliberately untouched (served by TP-Link AP, not pfSense
  Kea — pfSense WAN DHCP is disabled); already ships [192.168.1.2,
  94.140.14.14] so the end state is consistent across all three subnets.
- Applied via PHP: set $cfg['dhcpd']['lan']['dnsserver'] and
  $cfg['dhcpd']['opt1']['dnsserver'] as arrays. pfSense's
  services_kea4_configure() implodes the array into "data: a, b" on the
  "domain-name-servers" option-data entry (services.inc L1214).
- Verified:
  - DevVM (10.0.10.10) resolv.conf shows "nameserver 10.0.10.1" +
    "nameserver 94.140.14.14" after networkd renew.
  - k8s-node1 (10.0.20.101) same after networkctl reload + systemd-resolved
    restart.
  - Fallback drill on k8s-node1: `ip route add blackhole 10.0.20.1/32`;
    dig @10.0.20.1 google.com -> "no servers could be reached"; dig
    @94.140.14.14 google.com -> 216.58.204.110; system resolver
    (getent hosts) succeeds via the fallback IP. Blackhole route removed.

Part 2 — TSIG-signed Kea DHCP-DDNS
- Before: /usr/local/etc/kea/kea-dhcp-ddns.conf had `tsig-keys: []` and
  Technitium's viktorbarzin.lan zone had update=Deny. Unauthenticated
  update vector was latent (DDNS wiring in Kea DHCP4 is actually off
  today — "DDNS: disabled" in dhcpd.log) but would activate as soon as
  anyone turned on ddnsupdate on LAN/OPT1.
- Generated HMAC-SHA256 secret, base64-encoded 32 random bytes.
- Stored in Vault: secret/viktor/kea_ddns_tsig_secret (version 27).
- Created TSIG key "kea-ddns" on primary/secondary/tertiary Technitium
  instances via /api/settings/set (tsigKeys[]).
- Updated kea-dhcp-ddns.conf on pfSense with
  tsig-keys[]={name: "kea-ddns", algorithm: "HMAC-SHA256", secret: …}
  and key-name: kea-ddns on each forward-ddns / reverse-ddns domain.
  Pre-change backup at /usr/local/etc/kea/kea-dhcp-ddns.conf.2026-04-19-pre-tsig.
- Configured viktorbarzin.lan + 10.0.10.in-addr.arpa +
  20.0.10.in-addr.arpa + 1.168.192.in-addr.arpa on Technitium primary:
  - update = UseSpecifiedNetworkACL
  - updateNetworkACL = [10.0.20.1, 10.0.10.1, 192.168.1.2]
  - updateSecurityPolicies = [{tsigKeyName: kea-ddns,
                               domain: "*.<zone>", allowedTypes: [ANY]}]
  Technitium requires BOTH a source-IP match AND a valid TSIG signature.
- Verified TSIG end-to-end:
  - Signed A-record update from pfSense -> "successfully processed",
    dig returns 10.99.99.99 (log: "TSIG KeyName: kea-ddns; TSIG Algo:
    hmac-sha256; TSIG Error: NoError; RCODE: NoError").
  - Signed PTR update same zone pattern -> dig -x returns tsig-test
    FQDN.
  - Unsigned update from pfSense IP (in ACL) -> "update failed:
    REFUSED" (log: "refused a zone UPDATE request [...] due to Dynamic
    Updates Security Policy").
  - Test records cleaned up via signed nsupdate.

Safety
- pfSense config backup: /cf/conf/config.xml.2026-04-19-pre-kea-multi-ip
  (145898 bytes, pre-change snapshot — keep 30d).
- DDNS config backup: /usr/local/etc/kea/kea-dhcp-ddns.conf.2026-04-19-pre-tsig.
- TSIG secret lives only in Vault + in config.xml/kea-dhcp-ddns.conf on
  pfSense; not committed to git.

Docs
- architecture/dns.md: zone dynamic-updates section records the TSIG
  policy; Incident History gets a WS E entry.
- architecture/networking.md: DHCP Coverage table now shows the DNS
  option 6 values per subnet; pfSense block notes the TSIG-signed DDNS
  and config backup path.
- runbooks/pfsense-unbound.md: new "Kea DHCP-DDNS TSIG" section covers
  key rotation, emergency bypass, and enforcement-verification.

Closes: code-o6j

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 16:12:23 +00:00
Viktor Barzin
69474fae96 docs: add comprehensive DNS architecture documentation
Covers Technitium HA (3-instance AXFR replication), CoreDNS config,
Cloudflare external DNS, Split Horizon hairpin NAT fix, DHCP-DNS
auto-registration, 6 automation CronJobs, and troubleshooting guides.
Also fixes stale NFS reference in networking.md.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 18:10:27 +00:00
Viktor Barzin
c740ed1301 docs: update Technitium DNS docs after cache optimization
- Fix Technitium IP typo: 10.0.20.101 → 10.0.20.201 (service-catalog, vpn.md)
- Fix PDB minAvailable: 1 → 2 (networking.md)
- Add emrsn.org stub zone, cache TTL tuning, PG query logging, CronJobs
- Update forwarders: was "Cloudflare + Google", actually Cloudflare DoH only
- Update config storage: was generic PVC, now NFS path
2026-04-12 18:29:25 +01:00
Viktor Barzin
eec6af6aef docs: add IPAM/DDNS architecture diagram and update docs
- networking.md: Add mermaid diagram showing full device discovery pipeline
  (Kea DHCP → DDNS → Technitium, pfSense import → phpIPAM → DNS sync)
- networking.md: Add data flow table, DHCP coverage table
- networking.md: Update pfSense (3 subnets + 42 reservations), phpIPAM
  (passive import replaces fping), Technitium (192.168.1.2 in ACL)
- CLAUDE.md: Update phpIPAM and networking descriptions

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 20:42:10 +00:00
Viktor Barzin
8cd8743140 docs: add phpIPAM, Kea DDNS, and DNS sync documentation
- networking.md: Add phpIPAM IPAM section, Kea DDNS config, reverse DNS zones,
  Technitium dynamic update policy
- CLAUDE.md: Add phpipam to DB rotation list, service notes, networking section
- service-catalog.md: Add phpipam, mark netbox as disabled/replaced

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 16:01:32 +00:00
Viktor Barzin
98aaba98da docs: add Split Horizon hairpin NAT fix to networking architecture
[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 18:45:53 +00:00
Viktor Barzin
cfa7a50cb5 docs: update networking architecture for DNS consolidation
- Technitium DNS now at dedicated MetalLB IP 10.0.20.201 (was shared 10.0.20.200)
- Document LAN DNS path: pfSense NAT redirect preserves client IPs for Technitium logging
- Document pfSense dnsmasq role (K8s VLAN + localhost only, not WAN)
- Document pfSense aliases (technitium_dns, k8s_shared_lb) for NAT rule maintainability
- Update MetalLB table with per-service IP assignments
- Add ClusterIP (10.96.0.53) for CoreDNS internal forwarding

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 17:49:33 +00:00
Viktor Barzin
fc233bd27f docs: comprehensive audit and update of all architecture docs and runbooks [ci skip]
Audited 14 documentation files against live cluster state and Terraform code.

Architecture docs:
- databases.md: MySQL 8.4.4, proxmox-lvm storage (not iSCSI), anti-affinity
  excludes k8s-node1 (GPU), 2Gi/3Gi resources, 7-day rotation (not 24h),
  CNPG 2 instances, PostGIS 16, postgresql.dbaas has endpoints
- overview.md: 1x CPU, ~160GB RAM, all nodes 32GB, proxmox-lvm storage,
  correct Vault paths (secret/ not kv/)
- compute.md: 272GB physical host RAM, ~160GB allocated to VMs
- secrets.md: 7-day rotation, 7 MySQL + 5 PG roles, correct ESO config
- networking.md: MetalLB pool 10.0.20.200-220
- ci-cd.md: 9 GHA projects, travel_blog 5.7GB

Runbooks:
- restore-mysql/postgresql: backup files are .sql.gz (not .sql)
- restore-vault: weekly backup (not daily), auto-unseal sidecar note
- restore-vaultwarden: PVC is proxmox (not iscsi)
- restore-full-cluster: updated node roles, removed trading

Reference docs:
- CLAUDE.md: 7-day rotation, removed trading from PG list
- AGENTS.md: 100+ stacks, proxmox-lvm, platform empty shell
- service-catalog.md: 6 new stacks, 14 stack column updates
2026-04-06 13:21:05 +03:00
Viktor Barzin
6af47c7c89 docs: update networking architecture for single MetalLB IP
Reflect consolidation of all 11 LB services onto 10.0.20.200.
Add service port table, MetalLB v0.15 sharing key requirements,
and ETP matching troubleshooting guidance.
2026-03-24 18:44:47 +02:00
Viktor Barzin
5a42643176 add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00