docs(security): wave 1 plan — Kyverno enforce, NetworkPolicy egress, audit logging, source-IP anomaly

Locked design for wave 1 of cluster security hardening. Plan only — implementation lives in beads code-8ywc and follow-up commits. Captures: - security.md: Kyverno policy table updated (Audit → Enforce planned for the four security policies with the 31-namespace exclude list). New section "Audit Logging & Anomaly Detection" detailing the K8s API audit policy, Vault audit device + X-Forwarded-For trust, source-IP anomaly rules (K9, V7, S1), and the rejected-canary-tokens / rejected-K1 rationales. New section "NetworkPolicy Default-Deny Egress" describing the observe-then-enforce (γ) approach for tier 3+4. - monitoring.md: new "Security Alerts (Wave 1)" section listing the 16 rules (K2-K9, V1-V7, S1) and the Loki ruler → Alertmanager → #security routing path. - runbooks/security-incident.md (new): per-alert response playbook with LogQL queries, action steps, false-positive triage, and SEV1 escalation. - .claude/CLAUDE.md: new "Security Posture" section summarising the locked decisions: identity allowlist is me@viktorbarzin.me ONLY, source-IP allowlist CIDRs, no public-IP access policy, rationale for not adopting canary tokens. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 19:10:16 +00:00 · 2026-05-18 19:10:16 +00:00 · b3cf75dc61
commit b3cf75dc61
parent b879481d71
4 changed files with 335 additions and 8 deletions
--- a/.claude/CLAUDE.md
+++ b/.claude/CLAUDE.md
@ -147,6 +147,17 @@ Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handle
 - Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence.
 - **E2E email monitoring**: CronJob `email-roundtrip-monitor` (every 20 min) sends test email via Brevo HTTP API to `smoke-test@viktorbarzin.me` (catch-all → `spam@`), verifies IMAP delivery, deletes test email, pushes metrics to Pushgateway + Uptime Kuma. Alerts: `EmailRoundtripFailing` (60m), `EmailRoundtripStale` (60m), `EmailRoundtripNeverRun` (60m). Outbound relay: Brevo EU (`smtp-relay.brevo.com:587`, 300/day free — migrated from Mailgun). Inbound external traffic enters via pfSense HAProxy on `10.0.20.1:{25,465,587,993}`, which forwards to k8s `mailserver-proxy` NodePort (30125-30128) with `send-proxy-v2`. Mailserver pod runs alt PROXY-speaking listeners (2525/4465/5587/10993) alongside stock PROXY-free ones (25/465/587/993) for intra-cluster clients. Real client IPs recovered from PROXY v2 header despite kube-proxy SNAT (replaces pre-2026-04-19 MetalLB `10.0.20.202` ETP:Local scheme; see bd code-yiu + `docs/runbooks/mailserver-pfsense-haproxy.md`). Vault: `brevo_api_key` in `secret/viktor` (probe + relay).

+## Security Posture (Wave 1 — locked 2026-05-18)
+
+Plan in `docs/architecture/security.md` + response playbook in `docs/runbooks/security-incident.md`. Beads epic: `code-8ywc`.
+
+- **Identity allowlist for security rules**: ONLY `me@viktorbarzin.me`. NOT `viktor@viktorbarzin.me`, NOT `emo@viktorbarzin.me` (those don't exist). emo's identity scheme is unknown — ask before assuming.
+- **Source-IP allowlist (K2, K9, V7, S1)**: `10.0.20.0/22`, `192.168.1.0/24` (Proxmox + Sofia LAN), K8s pod CIDR, K8s service CIDR, Headscale tailnet. **Policy: no public-IP access** — Vault, kube-apiserver, PVE sshd must transit LAN or Headscale.
+- **Response model**: (I) Slack-only daily skim. All security alerts via Loki ruler → Alertmanager → `#security` Slack receiver. Single channel with severity labels inside (critical/warning/info). No paging.
+- **Kyverno policies (wave 1)**: `deny-privileged-containers`, `deny-host-namespaces`, `restrict-sys-admin`, `require-trusted-registries` flip Audit→Enforce with the 31-namespace exclude list (memory id=1970). `failurePolicy: Ignore` preserved. Cosign `verify-images` deferred.
+- **NetworkPolicy default-deny egress (wave 1)**: observe-then-enforce (γ approach) — Calico flow logs cluster-wide + GlobalNetworkPolicy log-only on tier 3+4, build empirical allowlist after 1 week, phased per-namespace enforce starting `recruiter-responder`. Tier 0/1/2 deferred.
+- **What's NOT in scope**: canary tokens (rejected — self-trigger risk with Viktor's normal `vault kv list secret/viktor` and `kubectl get secret -A` workflows), Falco/Tetragon (too noisy for Slack-only daily check), Cloudflare/GitHub audit polling (deferred to wave 2).
+
 ## Storage & Backup Architecture

 ### Storage Class Decision Rule (for new services)
--- a/docs/architecture/monitoring.md
+++ b/docs/architecture/monitoring.md
@ -176,6 +176,35 @@ The email monitoring system uses a CronJob (`email-roundtrip-monitor`, every 10

 Uptime Kuma monitors: TCP SMTP (port 25) on `176.12.22.76` (external), IMAP (port 993) on `10.0.20.202`, and Dovecot exporter metrics on port 9166.

+#### Security Alerts (Wave 1 — planned, beads `code-8ywc`)
+
+Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same handling path as infra alerts. Single channel with severity labels inside (critical/warning/info), not three separate channels. Detection sources: K8s API audit log (`job=kube-audit`), Vault audit log (`job=vault-audit`), PVE sshd journald (`job=sshd-pve`), Calico flow logs (`job=calico-flow`, W1.6 only).
+
+| # | Source | Event | Severity |
+|---|---|---|---|
+| K2 | kube-audit | SA token used from outside cluster | critical |
+| K3 | kube-audit | Secret read in vault/sealed-secrets/external-secrets by non-allowlisted SA | critical |
+| K4 | kube-audit | Exec into vault/kube-system/dbaas/cnpg-system pod by non-allowlisted user | warning |
+| K5 | kube-audit | Mass delete (>5 Pod/Secret/CM in 60s) | critical |
+| K6 | kube-audit | Audit policy itself modified | critical |
+| K7 | kube-audit | New `*,*` ClusterRole created | warning |
+| K8 | kube-audit | Anonymous binding granted | critical |
+| K9 | kube-audit | `me@viktorbarzin.me` request from non-allowlist sourceIP | critical |
+| V1 | vault-audit | Root token created | critical |
+| V2 | vault-audit | Audit device disabled/modified | critical |
+| V3 | vault-audit | Seal status changed | critical |
+| V4 | vault-audit | Policy written/modified (allowlist Terraform actor) | warning |
+| V5 | vault-audit | Auth failure spike >10/min | warning |
+| V6 | vault-audit | Token with policies different from parent created | critical |
+| V7 | vault-audit | Viktor's entity_id from non-allowlist remote_addr (requires `x_forwarded_for_authorized_addrs`) | critical |
+| S1 | sshd-pve | sshd auth success from non-allowlist IP | critical |
+
+K1 (cluster-admin grant) intentionally skipped — see security.md.
+
+Allowlist source-IP CIDRs (used by K2, K9, V7, S1): `10.0.20.0/22`, `192.168.1.0/24`, K8s pod CIDR, K8s service CIDR, Headscale tailnet. Policy: no public-IP access; all admin paths transit LAN or Headscale.
+
+IOPS impact estimated ~1-2 GB/day additional disk writes after custom audit-policy tuning. Retention: 90d for security streams.
+
 #### Backup Alerts
 - **PostgreSQLBackupStale**: >36h since last backup
 - **MySQLBackupStale**: >36h since last backup
--- a/docs/architecture/security.md
+++ b/docs/architecture/security.md
@ -111,16 +111,20 @@ Namespaces are labeled with a tier (`tier: 0` through `tier: 4`). Kyverno auto-g

 This prevents resource exhaustion and enforces governance without manual quota management.

-#### Security Policies (ALL in Audit Mode)
+#### Security Policies

-**Why audit mode?** Gradual rollout without breaking existing workloads. Policies collect violations, then selectively enforced after cleanup.
+**Why audit mode first?** Gradual rollout without breaking existing workloads. Policies collect violations, then selectively enforced after cleanup.

-| Policy | Purpose | Enforcement |
-|--------|---------|-------------|
-| `deny-privileged-containers` | Block privileged pods | Audit |
-| `deny-host-namespaces` | Block hostNetwork/hostPID/hostIPC | Audit |
-| `restrict-sys-admin` | Block CAP_SYS_ADMIN | Audit |
-| `require-trusted-registries` | Only allow approved image registries | Audit |
+**Wave 1 plan (locked 2026-05-18, see beads `code-8ywc`):** all four below flip from Audit → Enforce with `failurePolicy: Ignore` preserved and an exclude list covering the 31 critical namespaces (keel, calico-system, authentik, vault, cnpg-system, dbaas, monitoring, traefik, technitium, mailserver, kyverno, metallb-system, external-secrets, proxmox-csi, nfs-csi, nvidia, kube-system, cloudflared, crowdsec, reverse-proxy, reloader, descheduler, vpa, redis, sealed-secrets, headscale, wireguard, xray, infra-maintenance, metrics-server, tigera-operator). Phased: one policy per day with PolicyReport observation.
+
+| Policy | Purpose | Current | Planned (wave 1) |
+|--------|---------|---------|------------------|
+| `deny-privileged-containers` | Block privileged pods | Audit | **Enforce** |
+| `deny-host-namespaces` | Block hostNetwork/hostPID/hostIPC | Audit | **Enforce** |
+| `restrict-sys-admin` | Block CAP_SYS_ADMIN | Audit | **Enforce** |
+| `require-trusted-registries` | Only allow approved image registries (forgejo.viktorbarzin.me, docker.io, ghcr.io, quay.io, registry.k8s.io, gcr.io, oci://ghcr.io/sergelogvinov) | Audit | **Enforce** |
+
+Cosign `verify-images` is **deferred** beyond wave 1 — needs image-signing infrastructure (Sigstore / cosign + KMS) before it can enforce meaningfully.

 #### Operational Policies

@ -163,6 +167,98 @@ Removed April 2026. The rewrite-body Traefik plugin used to inject hidden trap l

 **Implementation**: See `stacks/poison-fountain/` and `stacks/platform/modules/traefik/middleware.tf`

+### Audit Logging & Anomaly Detection (Wave 1 — planned 2026-05-18)
+
+Beads epic: `code-8ywc`. **Status: planned, not yet implemented.** The block below documents the locked design so future sessions don't re-grill.
+
+Response model: **(I) Slack-only, daily skim.** All security alerts land in a new `#security` Slack channel via Alertmanager. No paging. Mean detection time accepted as ~12-24h; the design weight sits on prevention (Kyverno enforce, NetworkPolicy default-deny egress) rather than runtime detection.
+
+#### Detection sources
+
+| Source | Mechanism | Ships via | Loki job label |
+|---|---|---|---|
+| K8s API audit log | Custom audit policy on kube-apiserver: drop `get`/`list`/`watch` at `None` for most resources, log writes at `Metadata`, secret reads at `Metadata`, `exec`/`portforward` at `RequestResponse`, exclude kubelet+controller-manager noise. Codified in `stacks/infra` kubeadm config templating. | Alloy DaemonSet tails `/var/log/kubernetes/audit/*.log` | `job=kube-audit` |
+| Vault audit log | `file` audit device on existing Vault PVC. Vault listener config sets `x_forwarded_for_authorized_addrs` trusting Traefik pod CIDR so `remote_addr` is the real client IP, not Traefik's. | Alloy tails audit log file | `job=vault-audit` |
+| PVE sshd auth log | journald `_SYSTEMD_UNIT=ssh.service` | promtail systemd unit on Proxmox host (192.168.1.127) | `job=sshd-pve` |
+| Calico flow log | `flowLogsFileEnabled: true` in Calico Felix config | Alloy (cluster-wide) | `job=calico-flow` (W1.6 only) |
+
+#### Alert rules (16 total)
+
+Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same handling path as existing infra alerts — silenceable in Alertmanager UI, history queryable, severity labels (critical/warning/info) inside the single `#security` channel.
+
+**K8s API audit (K2-K9, 8 rules — K1 cluster-admin-grant intentionally skipped):**
+
+| # | Event | Severity |
+|---|---|---|
+| K2 | ServiceAccount token used from outside cluster (sourceIPs not in pod CIDR or trusted LAN) | critical |
+| K3 | Secret READ in `vault`, `sealed-secrets`, `external-secrets` namespaces by a non-allowlisted ServiceAccount | critical |
+| K4 | Exec into a pod in `vault`, `kube-system`, `dbaas`, `cnpg-system` (excluding `me@viktorbarzin.me` + 1 break-glass SA) | warning |
+| K5 | >5 deletes of `Pod`, `Secret`, or `ConfigMap` in 60s by any single actor | critical |
+| K6 | `audit-log-path` flag or audit policy modified on kube-apiserver | critical |
+| K7 | New ClusterRole created with `verbs: ["*"]` and `resources: ["*"]` | warning |
+| K8 | Anonymous binding granted (any RoleBinding/CRB referencing `system:anonymous` or `system:unauthenticated`) | critical |
+| K9 | Authenticated request where `user.username == "me@viktorbarzin.me"` AND `sourceIPs[0]` NOT in allowlist CIDRs | critical |
+
+**Vault audit (V1-V7):**
+
+| # | Event | Severity |
+|---|---|---|
+| V1 | Root token created | critical |
+| V2 | Audit device disabled or modified | critical |
+| V3 | Seal status changed (`sys/seal` write) | critical |
+| V4 | Policy written or modified (allowlist Terraform-driven writes by source IP / token role) | warning |
+| V5 | Authentication failure spike >10/min on any auth method | warning |
+| V6 | Token created with policies different from parent (privilege escalation) | critical |
+| V7 | Vault audit event where `auth.entity_id == <viktor-entity-id>` AND `remote_addr` NOT in allowlist CIDRs | critical |
+
+**Host (S1):**
+
+| # | Event | Severity |
+|---|---|---|
+| S1 | PVE sshd auth success from source IP NOT in allowlist | critical |
+
+#### Allowlist — "expected source IPs" for K2, K9, V7, S1
+
+| CIDR | Source |
+|---|---|
+| `10.0.20.0/22` | VLAN 20 (K8s cluster + main LAN) |
+| `192.168.1.0/24` | Proxmox host LAN + Sofia LAN (same RFC1918 block in both physical locations; cross-site traffic transits Headscale so the CIDR matches only on-LAN clients in either location) |
+| K8s pod CIDR (verify at implementation time) | In-cluster pods talking to apiserver |
+| K8s service CIDR | Service-to-apiserver traffic |
+| Headscale tailnet | VPN-connected devices |
+
+**Policy: no public-IP access ever.** Vault, kube-apiserver, PVE sshd must transit a trusted LAN or Headscale. Anything else fires an alert.
+
+#### Why no canary tokens
+
+Original plan included canary tokens (fake K8s Secret, Vault KV path, PVE file, sinkhole hostname). Rejected because Viktor routinely greps `secret/viktor` (135 keys) and lists `kubectl get secret -A` — any read-trigger canary self-fires. Use-based canaries (zero-RBAC SA tokens with audit alerts on use) were also considered but rejected in favor of cleaner source-IP anomaly detection (K9, V7) on REAL tokens — same threat model, no fake-token operational burden.
+
+#### Why no K1 (cluster-admin grant detection)
+
+Viktor opted out. Gap covered indirectly by K7 (new `*,*` ClusterRole created), K8 (anonymous binding), and K3 (secret read on Vault namespace) — most attacker progressions toward cluster-admin trigger one of these.
+
+#### IOPS / disk-wear
+
+Custom audit policy reduces volume ~80-90% vs default Metadata-everywhere. Loki tuned for fewer larger chunks: `chunk_target_size: 1.5MB`, `chunk_idle_period: 30m`, snappy compression. Retention 90d for security streams (matches Technitium DNS query log precedent). Net estimate: ~1-2 GB/day additional disk writes after tuning.
+
+### NetworkPolicy Default-Deny Egress (Wave 1 — observe-then-enforce, tier 3+4)
+
+Beads: `code-8ywc` W1.6 + W1.7. **Status: planned.**
+
+**Approach (γ): cluster-wide observe-then-enforce.**
+
+1. **Week 0:** Enable Calico flow logs cluster-wide. Apply a GlobalNetworkPolicy with selector `tier in {tier-3, tier-4}`, `action: Log` (no Deny). Ship flow logs to Loki.
+2. **Week 1:** Build per-namespace egress allowlist from observed traffic. Common allowlist module `tier3_egress_baseline` covers DNS, NTP, internal Vault/ESO/Authentik, Brevo SMTP, Cloudflare API, OAuth providers. Per-namespace add-ons for service-specific external destinations.
+3. **Week 2-3:** Apply default-deny + allowlist per-namespace, starting `recruiter-responder` (smallest egress footprint — local llama-cpp). Watch 24-48h per namespace, iterate. Roll out 3-5 namespaces/day.
+
+**Scope exclusions:** tier 0/1/2 namespaces (defer to wave 2), 31 critical infra namespaces (same exclude list as Kyverno).
+
+**DNS handling:** Calico GlobalNetworkPolicy supports domain-based rules via the `domains:` selector which queries CoreDNS internally. Static IPs reserved for fixed-IP services (Brevo SMTP relay).
+
+**Known risks:**
+- Rare-event misses: a Sunday-only CronJob's egress won't appear in 7 days of flow logs. Mitigation: extend observation to 2 weeks for namespaces with weekly CronJobs.
+- Mass-rollout cascade: the 26h March 2026 outage (memory id=390) was a mass-change cascade. Mitigation: phased per-namespace with health-check pauses, similar to the 2026-05-17 Keel phased rollout (memory id=1972).
+
 ### TLS & HTTP/3

 **Traefik** handles TLS termination:
--- a/docs/runbooks/security-incident.md
+++ b/docs/runbooks/security-incident.md
@ -0,0 +1,191 @@
+# Security Incident Response
+
+What to do when a wave-1 security alert fires. Each alert links to a Loki query for investigation and concrete remediation steps.
+
+**Status: planned, not yet implemented.** Beads epic: `code-8ywc`. This runbook is the response playbook for when wave 1 ships.
+
+## General workflow
+
+1. **Acknowledge in Alertmanager.** Silence only after triage starts.
+2. **Pull context from Loki** (queries below). Get the actor, source IP, timestamp.
+3. **Decide: real or false-positive?** Use the "false-positive cases" notes below.
+4. **If real:** revoke credentials (Vault token revoke, K8s SA token rotate, SSH key remove, OIDC session invalidate), then post-mortem.
+5. **If false-positive:** tune the alert (extend allowlist, refine LogQL query).
+
+## Allowlist CIDRs
+
+All source-IP-based alerts (K2, K9, V7, S1) reference this list. Update in one place: Terraform variable `security_source_ip_allowlist` in `stacks/monitoring`.
+
+- `10.0.20.0/22` — VLAN 20 (cluster + main LAN)
+- `192.168.1.0/24` — Proxmox + Sofia LAN
+- K8s pod CIDR (verify at implementation time)
+- K8s service CIDR
+- Headscale tailnet
+
+**Anything outside = alert.** No public-IP exceptions.
+
+## Viktor's identity
+
+`me@viktorbarzin.me` is the ONLY allowlisted human identity. NOT `viktor@viktorbarzin.me`. NOT `emo@viktorbarzin.me`. emo's identity scheme is separate and must be added explicitly if/when needed.
+
+---
+
+## K-alerts (K8s API audit)
+
+### K2 — ServiceAccount token used from outside cluster
+
+**Meaning:** A K8s ServiceAccount token authenticated a request whose `sourceIPs[0]` is not in the pod CIDR or trusted LAN. Stolen SA token used externally.
+
+```logql
+{job="kube-audit"} | json | user_username =~ "system:serviceaccount:.*" | sourceIPs_0 !~ "10\\.0\\.20\\..*|192\\.168\\.1\\..*"
+```
+
+**Action:** Identify the SA. Rotate its token (`kubectl delete secret <sa-token-name>` if old-style, or recreate the SA if projected token). Audit the SA's permissions and tighten.
+
+**False positives:** Pod-to-apiserver traffic that egresses and re-enters via NodePort/LB (rare). Investigate the originating workload.
+
+### K3 — Secret read in sensitive namespace by unexpected actor
+
+**Meaning:** A Secret in `vault`, `sealed-secrets`, or `external-secrets` namespace was read by an SA NOT in the allowlist (ESO controller, sealed-secrets controller, Vault SA, `me@viktorbarzin.me`).
+
+```logql
+{job="kube-audit"} | json | verb =~ "get|list" | objectRef_resource = "secrets" | objectRef_namespace =~ "vault|sealed-secrets|external-secrets" | user_username !~ "(me@viktorbarzin.me|system:serviceaccount:external-secrets:.*|system:serviceaccount:sealed-secrets:.*|system:serviceaccount:vault:.*)"
+```
+
+**Action:** Identify the actor. If a service account, audit its bindings — it shouldn't have RBAC to read those secrets. Revoke the binding. Rotate any secrets that were read.
+
+### K4 — Exec into sensitive pod
+
+**Meaning:** Someone `kubectl exec`'d into a pod in `vault`, `kube-system`, `dbaas`, or `cnpg-system`.
+
+```logql
+{job="kube-audit"} | json | verb = "create" | objectRef_resource = "pods" | objectRef_subresource = "exec" | objectRef_namespace =~ "vault|kube-system|dbaas|cnpg-system" | user_username != "me@viktorbarzin.me"
+```
+
+**Action:** Determine if Viktor authorized the exec. If unrecognized actor, revoke their access and rotate any credentials they could have read inside the pod.
+
+**False positives:** Break-glass SAs used during incident response — extend the allowlist to include them by SA name.
+
+### K5 — Mass delete
+
+**Meaning:** Single actor deleted >5 Pods, Secrets, or ConfigMaps in 60 seconds. Either a script gone wrong or destructive intrusion.
+
+```logql
+sum by (user_username) (count_over_time({job="kube-audit"} | json | verb = "delete" | objectRef_resource =~ "pods|secrets|configmaps" [1m])) > 5
+```
+
+**Action:** Identify actor. If a Terraform apply or known cleanup job, false positive. If unrecognized, suspend the actor's credentials immediately and audit what was deleted.
+
+### K6 — Audit policy modified
+
+**Meaning:** Someone changed the kube-apiserver audit policy. Should only happen via Terraform.
+
+**Action:** Verify the change came from a planned Terraform apply (check recent commits to `stacks/infra`). If not, treat as critical compromise — attacker disabling visibility.
+
+### K7 — New ClusterRole with full wildcards
+
+**Meaning:** A new ClusterRole was created with `verbs: ["*"]` and `resources: ["*"]`. Privilege escalation primitive.
+
+```logql
+{job="kube-audit"} | json | verb = "create" | objectRef_resource = "clusterroles" | requestObject_rules_0_verbs_0 = "*" | requestObject_rules_0_resources_0 = "*"
+```
+
+**Action:** Verify the change is intentional (some operators install such roles — calico, kyverno). If unrecognized, delete the ClusterRole and audit the creator.
+
+### K8 — Anonymous binding
+
+**Meaning:** A RoleBinding or ClusterRoleBinding was created referencing `system:anonymous` or `system:unauthenticated`. Catastrophic — allows unauthenticated cluster access.
+
+**Action:** Delete the binding immediately. Audit who created it. Treat as full cluster compromise — rotate all secrets, force kubeconfig re-issue.
+
+### K9 — Viktor's identity from unexpected source IP
+
+**Meaning:** A request authenticated as `me@viktorbarzin.me` arrived from a source IP outside the allowlist. Stolen OIDC token / kubeconfig.
+
+```logql
+{job="kube-audit"} | json | user_username = "me@viktorbarzin.me" | sourceIPs_0 !~ "10\\.0\\.20\\..*|192\\.168\\.1\\..*|<pod-cidr>|<headscale-cidr>"
+```
+
+**Action:** Revoke Viktor's OIDC session in Authentik. Rotate Vault OIDC tokens. Audit recent activity from that IP. Verify Viktor's devices for compromise.
+
+**False positives:** Viktor's machine on a new network without VPN — should not happen per the "no public IP access" policy. If it does, the policy needs revisiting, not the alert.
+
+---
+
+## V-alerts (Vault audit)
+
+### V1 — Root token created
+
+```logql
+{job="vault-audit"} | json | request_path = "auth/token/create" | response_auth_policies = "root"
+```
+
+**Action:** Verify against Terraform / planned operation. Root tokens should ONLY be created during initial Vault setup or break-glass.
+
+### V2 — Audit device disabled/modified
+
+**Action:** Attacker silencing visibility. Re-enable immediately. Treat as critical compromise.
+
+### V3 — Seal status changed
+
+**Action:** Verify whether this is a planned operation (unseal during upgrade). If unplanned, treat as critical.
+
+### V4 — Policy modified
+
+**Action:** Confirm change came from a Terraform apply. Allowlist Terraform's source IP / token role. Otherwise: review the policy diff, revert if malicious.
+
+### V5 — Auth failure spike
+
+**Action:** Identify the auth method and source. If CI token rotation, false positive. If unknown source brute-forcing, block the source IP at pfSense.
+
+### V6 — Token with policies different from parent
+
+**Action:** Privilege escalation attempt. Revoke the new token. Audit the parent token's policies.
+
+### V7 — Viktor's Vault identity from unexpected source IP
+
+**Meaning:** A Vault operation authenticated as Viktor's entity_id arrived from an IP not in the allowlist. Requires `x_forwarded_for_authorized_addrs` to be configured (Vault sits behind Traefik so `remote_addr` is Traefik's pod IP without XFF trust).
+
+**Action:** Revoke Viktor's Vault OIDC tokens. Force OIDC re-auth. Audit Vault access from that IP.
+
+---
+
+## S-alerts (Host)
+
+### S1 — PVE sshd auth success from unexpected IP
+
+```logql
+{job="sshd-pve"} |= "Accepted" | regexp "Accepted (?P<method>\\S+) for (?P<user>\\S+) from (?P<ip>\\S+)" | ip !~ "10\\.0\\.20\\..*|192\\.168\\.1\\..*|<headscale-cidr>"
+```
+
+**Action:** Remove the user's SSH key from `/root/.ssh/authorized_keys` if it's still there. Audit recent sudo/login history (`last`, `sudo -i; journalctl _COMM=sudo`). Consider PVE as compromised — rotate root password, audit `/root/.luks-backup-key`, audit `/usr/local/bin/lvm-pvc-snapshot` and backup scripts for tampering.
+
+---
+
+## False-positive triage decision tree
+
+```
+Did the alert fire from a known operational event?
+├─ Terraform apply at the same time?       → likely V4 (policy modified)
+├─ Keel auto-roll?                          → not a security path
+├─ CI/CD pipeline running?                  → check V5 / K5
+└─ Viktor doing recovery work?              → K4, K9, S1 candidates
+                                              Extend allowlist if persistent
+```
+
+## Escalation
+
+For SEV1 (multiple alerts, cluster-admin grants, anonymous bindings, mass deletes):
+
+1. Cordon all nodes (`kubectl cordon`) to prevent further pod scheduling — but be aware this also stops legitimate recovery work
+2. Revoke all OIDC sessions in Authentik
+3. Rotate Vault root keys + reseal
+4. Restore from a pre-incident backup if data integrity is questionable
+5. Post-mortem per `incident-response.md`
+
+## Related
+
+- [Security architecture](../architecture/security.md)
+- [Monitoring architecture](../architecture/monitoring.md)
+- [Incident response (general)](../architecture/incident-response.md)
+- Beads epic: `code-8ywc`