infra

viktor/infra

Fork 0

Commit graph

Author	SHA1	Message	Date
Viktor Barzin	d85b54d89d	kms: per-connection state in notifier (vlmcsd is multi-threaded) Bug found via E2E test against the Windows VM (VMID 300). The single shared `state` dict in slack-notifier.py worked when vlmcsd processed one connection at a time, but real Windows KMS activations hold the connection open ~30 seconds (handshake + keep-alive). During that window vlmcsd accepts other concurrent connections — most relevantly the new kubelet TCP readiness probe every 5s — and each new OPEN line reset the shared state, wiping the in-flight activation's app/product/host before its CLOSE arrived. Result: real activations were misclassified as probes (no Slack post, no metric increment). Fix: state is now a dict keyed by `ip:port` with one sub-dict per in-flight connection. A `__current` pointer tracks the most recent OPEN so unkeyed detail lines (Application ID, Workstation name, etc.) can be attributed correctly — vlmcsd writes detail lines immediately after the OPEN and before any subsequent OPEN, so the heuristic holds. Orphan CLOSEs (notifier started mid-conn) are now silently dropped instead of emitting an empty probe event. Two new regression tests: - test_kubelet_probe_during_long_activation: 5s probe interleaved into a 31s activation block — exact production failure mode. - test_orphan_close_no_event: bare CLOSE without prior OPEN. Verified live: triggered slmgr /upk + /ipk + /skms 10.0.20.202 + /ato on WIN10Pro-DS32. vlmcsd logged the full activation block, notifier posted to Slack with ip=192.168.1.230 source=external product='Windows 10 Professional' host='WIN10Pro-DS32.viktorbarzin.lan' and kms_activations_total{product=Windows 10 Professional, status=Licensed} 1 — real WAN client IP preserved through the ETP=Local + dedicated MetalLB IP chain end to end.	2026-05-22 14:16:40 +00:00
Viktor Barzin	67b11a964a	kms: dedicate MetalLB IP 10.0.20.202 + filter probe noise Two coupled fixes for the hourly Slack noise + missing client IPs: 1. Move windows-kms off shared 10.0.20.200 to a dedicated MetalLB IP 10.0.20.202 with externalTrafficPolicy=Local, so vlmcsd sees real WAN client IPs (pfSense WAN forwards do DNAT-only; ETP=Local skips kube-proxy SNAT). Same pattern mailserver used pre-2026-04-19. Sharing 10.0.20.200 is blocked because all 10 services there are ETP=Cluster and MetalLB requires consistent ETP per shared IP. 2. Slack notifier now suppresses Slack posts for bare TCP open/close pairs (no Application/Activation block) — these are Uptime Kuma's port monitor and the new kubelet readiness/liveness probes. Probe counts go to a new metric kms_connection_probes_total{source} where source classifies the IP as internal_pod / cluster_node / external. Real activations are unaffected. Pod fluidity: added TCP readiness/liveness probes on 1688 to gate Pod Ready on the listener actually being up — required for ETP=Local so MetalLB only advertises 10.0.20.202 from a node where vlmcsd is serving. pfSense side (applied separately, not codified): - New alias k8s_kms_lb = 10.0.20.202 (KMS-only) - WAN:1688 NAT + filter rule retargeted from k8s_shared_lb to k8s_kms_lb - All other forwards on k8s_shared_lb (WireGuard, HTTPS, shadowsocks, smtps, etc.) untouched Runbook updated. Tests added for classify_source / is_probe / process_line.	2026-05-22 14:16:40 +00:00
Viktor Barzin	08edd92b22	kms: deploy slack-notifier sidecar with Prometheus metrics + document public exposure Slack notifier now also exposes /metrics on :9101 with stdlib HTTP — counts activations and dedup-skips by product, gauges last-activation timestamp. Pod template gets the standard prometheus.io/scrape annotations so the cluster-wide kubernetes-pods job picks it up via pod IP. Memory request bumped to 48Mi to cover counter dicts + HTTPServer. Plus docs: networking.md footnotes the windows-kms row noting public WAN exposure with the rate-limited (max-src-conn 50, max-src-conn-rate 10/60, overload <virusprot> flush) pfSense filter rule, and a new runbook covers log locations, rate-limit tuning, and how to revoke the WAN forward. The matching pfSense rule was tightened in place (TCP-only + rate limits) via SSH; pfSense isn't Terraform-managed.	2026-05-10 11:12:39 +00:00

Author

SHA1

Message

Date

Viktor Barzin

d85b54d89d

kms: per-connection state in notifier (vlmcsd is multi-threaded)

Bug found via E2E test against the Windows VM (VMID 300). The single
shared `state` dict in slack-notifier.py worked when vlmcsd processed
one connection at a time, but real Windows KMS activations hold the
connection open ~30 seconds (handshake + keep-alive). During that
window vlmcsd accepts other concurrent connections — most relevantly
the new kubelet TCP readiness probe every 5s — and each new OPEN line
reset the shared state, wiping the in-flight activation's
app/product/host before its CLOSE arrived. Result: real activations
were misclassified as probes (no Slack post, no metric increment).

Fix: state is now a dict keyed by `ip:port` with one sub-dict per
in-flight connection. A `__current` pointer tracks the most recent
OPEN so unkeyed detail lines (Application ID, Workstation name, etc.)
can be attributed correctly — vlmcsd writes detail lines immediately
after the OPEN and before any subsequent OPEN, so the heuristic holds.
Orphan CLOSEs (notifier started mid-conn) are now silently dropped
instead of emitting an empty probe event.

Two new regression tests:
- test_kubelet_probe_during_long_activation: 5s probe interleaved into
  a 31s activation block — exact production failure mode.
- test_orphan_close_no_event: bare CLOSE without prior OPEN.

Verified live: triggered slmgr /upk + /ipk + /skms 10.0.20.202 + /ato
on WIN10Pro-DS32. vlmcsd logged the full activation block, notifier
posted to Slack with ip=192.168.1.230 source=external
product='Windows 10 Professional' host='WIN10Pro-DS32.viktorbarzin.lan'
and kms_activations_total{product=Windows 10 Professional,
status=Licensed} 1 — real WAN client IP preserved through the
ETP=Local + dedicated MetalLB IP chain end to end.

2026-05-22 14:16:40 +00:00

Viktor Barzin

67b11a964a

kms: dedicate MetalLB IP 10.0.20.202 + filter probe noise

Two coupled fixes for the hourly Slack noise + missing client IPs:

1. Move windows-kms off shared 10.0.20.200 to a dedicated MetalLB IP
   10.0.20.202 with externalTrafficPolicy=Local, so vlmcsd sees real
   WAN client IPs (pfSense WAN forwards do DNAT-only; ETP=Local skips
   kube-proxy SNAT). Same pattern mailserver used pre-2026-04-19.
   Sharing 10.0.20.200 is blocked because all 10 services there are
   ETP=Cluster and MetalLB requires consistent ETP per shared IP.

2. Slack notifier now suppresses Slack posts for bare TCP open/close
   pairs (no Application/Activation block) — these are Uptime Kuma's
   port monitor and the new kubelet readiness/liveness probes. Probe
   counts go to a new metric kms_connection_probes_total{source} where
   source classifies the IP as internal_pod / cluster_node / external.
   Real activations are unaffected.

Pod fluidity: added TCP readiness/liveness probes on 1688 to gate Pod
Ready on the listener actually being up — required for ETP=Local so
MetalLB only advertises 10.0.20.202 from a node where vlmcsd is serving.

pfSense side (applied separately, not codified):
- New alias k8s_kms_lb = 10.0.20.202 (KMS-only)
- WAN:1688 NAT + filter rule retargeted from k8s_shared_lb to k8s_kms_lb
- All other forwards on k8s_shared_lb (WireGuard, HTTPS, shadowsocks,
  smtps, etc.) untouched

Runbook updated. Tests added for classify_source / is_probe / process_line.

2026-05-22 14:16:40 +00:00

Viktor Barzin

08edd92b22

kms: deploy slack-notifier sidecar with Prometheus metrics + document public exposure

Slack notifier now also exposes /metrics on :9101 with stdlib HTTP — counts
activations and dedup-skips by product, gauges last-activation timestamp.
Pod template gets the standard prometheus.io/scrape annotations so the
cluster-wide kubernetes-pods job picks it up via pod IP. Memory request
bumped to 48Mi to cover counter dicts + HTTPServer.

Plus docs: networking.md footnotes the windows-kms row noting public WAN
exposure with the rate-limited (max-src-conn 50, max-src-conn-rate 10/60,
overload <virusprot> flush) pfSense filter rule, and a new runbook covers
log locations, rate-limit tuning, and how to revoke the WAN forward.

The matching pfSense rule was tightened in place (TCP-only + rate limits)
via SSH; pfSense isn't Terraform-managed.

2026-05-10 11:12:39 +00:00

3 commits