infra

Author	SHA1	Message	Date
Viktor Barzin	6504911a77	matrix: open (tokenless) registration + bot mitigations + #security alert User-chosen fully-open registration on tuwunel (no CAPTCHA support; browser challenges break native clients). Bot defense is layered instead: - Traefik rate-limit Middleware on a path-scoped /register ingress carve-out, keyed on request Host (GLOBAL /register cap) not source IP — the host is reachable via both Cloudflare-IPv4 (CF-Connecting-IP) and IPv6-direct (HE tunnel, no CF header), so a per-source key let IPv6 bots bypass. 10/min, burst 20, per replica; CrowdSec is the hard backstop on both paths. - Loki ruler rule MatrixNewUserRegistered -> lane=security -> existing #security Slack receiver (matches "registered on this server", never the rejection line). tuwunel's admin bot also posts signups to the admin room. Dropped the REGISTRATION_TOKEN env (secret/matrix + ESO kept for revert). Applied via scripts/tg (matrix tier-1 + targeted monitoring configmap), so [ci skip] to avoid CI full-applying monitoring (unrelated grafana-acl drift). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 13:27:02 +00:00
Viktor Barzin	e5d9160a88	monitoring: KEEL/tier ignore_changes on 5 exporters [ci skip] goflow2, snmp-exporter, pve-exporter, idrac-redfish and the sysctl-inotify daemonset were missed by the `cdb7d9a8` KEEL_LIFECYCLE sweep. The monitoring ns is keel-enrolled (policy=patch) so Keel owns their image tags + injects keel.sh annotations; TF kept trying to revert both, plus a live-stamped tier label — which made `terragrunt plan -detailed-exitcode` return 2 every run and the drift-detection cron fail daily. Add the standard KEEL ignore_changes (image + keel.sh annotations) and ignore the tier label so these stop churning. Declarative-only: takes effect at next plan, no apply needed. [ci skip] so this does not trigger a monitoring apply. Remaining (separate) drift: the grafana ACL null_resource (triggers.always) + tls cert refresh. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 15:33:30 +00:00
Viktor Barzin	16b3969ceb	alloy: move resources to alloy.* (chart key bug); 1Gi limit fixes IO storm The Alloy Helm chart maps `alloy.resources`, NOT `controller.resources`, onto the alloy container. The block under `controller:` was silently dropped, so the container ran with `resources: {}` and inherited the Kyverno LimitRange `tier-defaults` 256Mi — well below Alloy's 400-450Mi steady state. The cgroup ran at 255.8/256MB with ~50M memory-reclaim events, page-cache thrashing drove ~185 MB/s sdc reads (12.18 TB in 24h), saturating the Proxmox host and rippling out to all VMs + NFS. Fix: - Move resources to `alloy.resources` (correct chart key). - Burstable QoS: request 512Mi, limit 1Gi. Workers are at 97-99% memory-request saturation cluster-wide; a 1Gi request blocks scheduling on node2/node3. - Bump controller.updateStrategy.maxUnavailable to 50% so a 5-pod DS rolling update fits inside the helm timeout. - Bump helm_release.alloy.timeout to 900s (default 300s was too short with occasional runc-stuck-Terminating on k8s-master). Verified: all 4 alloy pods now show 1Gi/512Mi at the container level; helm rev=8 deployed; per-pod memory 99-108Mi at steady state (well under the new limit). Memory ID 2726. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 02:08:35 +00:00
Viktor Barzin	669ba97078	security(wave1): W1.1 audit-log shipping LIVE + W1.5 trusted-registries Enforce LIVE ## W1.1 — K8s API audit log shipping (LIVE) - alloy.yaml: added control-plane toleration so Alloy DaemonSet runs on k8s-master node. Verified alloy-7zg7t scheduled on master, tailing /var/log/kubernetes/audit.log - loki.tf "Security Wave 1" rule group: added K2-K9 alert rules (skipped K1 per Q7 decision): - K2 K8sSATokenFromUnexpectedIP - K3 K8sSensitiveSecretReadByUnexpectedActor - K4 K8sExecIntoSensitiveNamespace - K5 K8sMassDelete (>5 Pod/Secret/CM in 60s by single user) - K6 K8sAuditPolicyModified (kubeadm-config CM change) - K7 K8sClusterRoleWildcardCreated (verbs=* + resources=) - K8 K8sAnonymousBindingGranted - K9 K8sViktorFromUnexpectedIP - All rules use source-IP regex matching the wave-1 allowlist (10.0.20.0/22, 192.168.1.0/24, 10.10.0.0/16 pod, 10.96.0.0/12 svc, 100.64-127 tailnet) and `lane = "security"` → #security Slack route. - Verified: kubectl-audit logs flowing in Loki query {job="kubernetes-audit"} returns events with node=k8s-master. - Verified: /loki/api/v1/rules lists all K2-K9 + V1-V7 + S1. ## W1.5 — require-trusted-registries Enforce (LIVE) - security-policies.tf: flipped Audit→Enforce with explicit allowlist built by `kubectl get pods -A -o jsonpath='{..image}'` enumeration. - Removed `/` catch-all (which made Audit→Enforce a no-op). - Pattern includes 15 explicit registries, 6 DockerHub library bare names, 56 DockerHub user repos. - Verified by admission dry-run: - evilcorp.example/malware:v1 → BLOCKED with custom message - alpine:3.20 → ALLOWED (matches `alpine`) - docker.io/library/alpine:3.20 → ALLOWED (matches `docker.io/*`) ## W1.6 — Calico flow logs (BLOCKED — Calico OSS limitation) - Tried adding FelixConfiguration with flowLogsFileEnabled=true via kubectl_manifest in stacks/calico/main.tf - Calico OSS rejected with "strict decoding error: unknown field spec.flowLogsFileEnabled" — these fields are Calico Enterprise/Tigera-only - Removed the failed resource. Documented alternative paths in main.tf comment block: GNP with action=Log (iptables NFLOG → journal), Cilium migration, eBPF tooling, or Tigera Operator adoption. ## Docs updates - security.md status table refreshed: W1.1/W1.2/W1.3/W1.4/W1.5 LIVE, W1.6/W1.7 blocked - monitoring.md: Loki marked DEPLOYED (was incorrectly NOT-DEPLOYED in prior session before today's apply) ## Cleanup - Removed stacks/kyverno/imports.tf (TF 1.5+ import blocks completed their job in the 2026-05-18 apply; should not stay in tree per TF docs) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-19 06:37:54 +00:00
Viktor Barzin	0560d81f3a	monitoring(wave1): re-enable Loki+Alloy, deploy wave1 alert rules, add #security Slack lane ## Loki + Alloy re-enabled (code-146x) - Uncommented helm_release.loki, helm_release.alloy, kubernetes_daemon_set_v1.sysctl-inotify, kubernetes_config_map.loki_alert_rules, kubernetes_config_map.grafana_loki_datasource - Reverses the documented "operational overhead vs benefit after node2 incident" decision. Re-evaluated because wave 1 security detection layer (beads code-8ywc) needs Loki + ruler + alert routing. - SingleBinary mode, 2-4Gi mem, 50Gi proxmox-lvm PVC, 30-day retention, ruler enabled pointed at prometheus-alertmanager.monitoring.svc:9093 - Alloy DaemonSet (4 pods on worker nodes) discovers pod logs via K8s API + pushes to Loki - Loki canaries running (4) - Vault audit-tail sidecar logs now flowing to Loki: queried {namespace="vault",container="audit-tail"} returns live audit JSON ## Wave 1 alert rules deployed (W1.3 partial) Added "Security Wave 1" rule group to loki_alert_rules configmap: - V1: VaultRootTokenCreated — auth/token/create with policies=[root] - V2: VaultAuditDeviceModified — sys/audit/* create/delete/update - V3: VaultSealChanged — sys/seal update - V4: VaultPolicyModified — sys/policies/acl/* create/update/delete - V5: VaultAuthFailureSpike — >10 permission denied/min - V7: VaultViktorFromUnexpectedIP — auth as me@viktorbarzin.me from non-allowlist source IP (allowlist: 10.0.20.0/22, 192.168.1.0/24, 10.10.0.0/16 pod, 10.96.0.0/12 svc, 100.64-127 tailnet) - S1: PVEsshLoginFromUnexpectedIP — sshd "Accepted" from non-allowlist IP (rule defined, fires once promtail/Alloy ships sshd journal with job=sshd-pve) Verified rules visible via /loki/api/v1/rules. K2-K9 (K8s API audit) deferred to W1.1 which needs the audit policy + apiserver log shipping codified. ## #security Slack lane (Alertmanager) - New `slack-security` receiver in prometheus_chart_values.tpl, channel #security - Higher-priority route at top of routes list: matchers `lane = security` → slack-security, continue: false (so wave 1 alerts never fall through to #alerts) - Slack message format includes summary + description + runbook link annotation - All wave 1 rules set `lane = "security"` label ## Resource summary - 6 added: helm_release.loki, helm_release.alloy, kubernetes_config_map.grafana_loki_datasource, kubernetes_config_map.loki_alert_rules, kubernetes_daemon_set_v1.sysctl-inotify, + 1 other - 5 changed: helm_release.prometheus (alertmanager config — new receiver + route), 4 deployments (image tag drift from Keel-managed images, unrelated) - 1 destroyed: null_resource grafana_admin_only_folder_acl["Finance (Personal)"] (timestamp-triggered always recreates — not destructive) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Closes: code-146x	2026-05-18 19:51:57 +00:00
Viktor Barzin	327ce215b9	[infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip] ## Context Wave 3A (commit `c9d221d5`) added the `# KYVERNO_LIFECYCLE_V1` marker to the 27 pre-existing `ignore_changes = [...dns_config]` sites so they could be grepped and audited. It did NOT address pod-owning resources that were simply missing the suppression entirely. Post-Wave-3A sampling (2026-04-18) found that navidrome, f1-stream, frigate, servarr, monitoring, crowdsec, and many other stacks showed perpetual `dns_config` drift every plan because their `kubernetes_deployment` / `kubernetes_stateful_set` / `kubernetes_cron_job_v1` resources had no `lifecycle {}` block at all. Root cause (same as Wave 3A): Kyverno's admission webhook stamps `dns_config { option { name = "ndots"; value = "2" } }` on every pod's `spec.template.spec.dns_config` to prevent NxDomain search-domain flooding (see `k8s-ndots-search-domain-nxdomain-flood` skill). Without `ignore_changes` on every Terraform-managed pod-owner, Terraform repeatedly tries to strip the injected field. ## This change Extends the Wave 3A convention by sweeping EVERY `kubernetes_deployment`, `kubernetes_stateful_set`, `kubernetes_daemon_set`, `kubernetes_cron_job_v1`, `kubernetes_job_v1` (+ their `_v1` variants) in the repo and ensuring each carries the right `ignore_changes` path: - kubernetes_deployment / stateful_set / daemon_set / job_v1: `spec[0].template[0].spec[0].dns_config` - kubernetes_cron_job_v1: `spec[0].job_template[0].spec[0].template[0].spec[0].dns_config` (extra `job_template[0]` nesting — the CronJob's PodTemplateSpec is one level deeper) Each injection / extension is tagged `# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2` inline so the suppression is discoverable via `rg 'KYVERNO_LIFECYCLE_V1' stacks/`. Two insertion paths are handled by a Python pass (`/tmp/add_dns_config_ignore.py`): 1. No existing `lifecycle {}`: inject a brand-new block just before the resource's closing `}`. 108 new blocks on 93 files. 2. Existing `lifecycle {}` (usually for `DRIFT_WORKAROUND: CI owns image tag` from Wave 4, commit a62b43d1): extend its `ignore_changes` list with the dns_config path. Handles both inline (`= [x]`) and multiline (`= [\n x,\n]`) forms; ensures the last pre-existing list item carries a trailing comma so the extended list is valid HCL. 34 extensions. The script skips anything already mentioning `dns_config` inside an `ignore_changes`, so re-running is a no-op. ## Scale - 142 total lifecycle injections/extensions - 93 `.tf` files touched - 108 brand-new `lifecycle {}` blocks + 34 extensions of existing ones - Every Tier 0 and Tier 1 stack with a pod-owning resource is covered - Together with Wave 3A's 27 pre-existing markers → 169 greppable `KYVERNO_LIFECYCLE_V1` dns_config sites across the repo ## What is NOT in this change - `stacks/trading-bot/main.tf` — entirely commented-out block (`/* … /`). Python script touched the file, reverted manually. - `_template/main.tf.example` skeleton — kept minimal on purpose; any future stack created from it should either inherit the Wave 3A one-line form or add its own on first `kubernetes_deployment`. - `terraform fmt` fixes to pre-existing alignment issues in meshcentral, nvidia/modules/nvidia, vault — unrelated to this commit. Left for a separate fmt-only pass. - Non-pod resources (`kubernetes_service`, `kubernetes_secret`, `kubernetes_manifest`, etc.) — they don't own pods so they don't get Kyverno dns_config mutation. ## Verification Random sample post-commit: ``` $ cd stacks/navidrome && ../../scripts/tg plan → No changes. $ cd stacks/f1-stream && ../../scripts/tg plan → No changes. $ cd stacks/frigate && ../../scripts/tg plan → No changes. $ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='.tf' --include='*.tf.example' \ \| awk -F: '{s+=$2} END {print s}' 169 ``` ## Reproduce locally 1. `git pull` 2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/ \| wc -l` → 169+ 3. `cd stacks/navidrome && ../../scripts/tg plan` → expect 0 drift on the deployment's dns_config field. Refs: code-seq (Wave 3B dns_config class closed; kubernetes_manifest annotation class handled separately in `8d94688d` for tls_secret) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 21:19:48 +00:00
Viktor Barzin	ae36dc253b	extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] Phase 2 of platform stack split. 5 more modules extracted into independent stacks. All applied successfully with zero destroys. Cloudflared now reads k8s_users from Vault directly to compute user_domains. Woodpecker pipeline runs all 8 extracted stacks in parallel. Memory bumped to 6Gi for 9 concurrent TF processes. Platform reduced from 27 to 19 modules.	2026-03-17 21:34:11 +00:00

7 commits