security(wave1): Vault audit-tail sidecar (live) + doc reality-check

## Vault audit-tail sidecar (APPLIED + VERIFIED)
- Added `audit-tail` extraContainer to vault helm chart values: busybox:1.37 with
  `tail -F /vault/audit/vault-audit.log`. Reads the audit PVC (`audit` volume
  from the chart's auditStorage), emits JSON audit events to stdout. kubelet
  captures the stdout; once Loki+Alloy are deployed (blocked on code-146x),
  these logs flow automatically to Loki with `container="audit-tail"`.
- Resources: 5m CPU / 16Mi mem request, 32Mi limit. PVC mount is readOnly.
- Applied via `tg apply -target=helm_release.vault`. All 3 vault pods rolled
  cleanly (OnDelete strategy, manual one-at-a-time, auto-unseal each ~10s).
- Verified: `kubectl logs -n vault vault-2 -c audit-tail` shows live JSON
  audit lines from ESO token issuance, KV reads, etc.

## Doc reality-check
While verifying logs reached Loki, discovered Loki is NOT actually deployed.
`stacks/monitoring/modules/monitoring/loki.tf` defines `helm_release.loki` but
has a self-referencing `depends_on = [helm_release.loki]` that prevented apply.
No `loki` Helm release in the cluster, no Loki pods, no Loki Service. The
monitoring.md "Loki: deployed" claim was aspirational.

- security.md W1.2 row: PENDING → PARTIAL (sidecar live, shipping blocked on
  code-146x)
- security.md W1.3 row: gated on code-146x added
- monitoring.md Loki row: marked NOT DEPLOYED with cross-ref to code-146x

## New beads task
- code-146x P1 — Loki + log shipper missing. Lists the helm_release self-depends_on bug,
  investigation paths, and revised wave 1 sequencing (Loki/Alloy is prereq 0).

## Wave 1 status update
- W1.2: Vault audit device + XFF + audit-tail sidecar all LIVE; Loki shipping blocked on code-146x
- W1.1, W1.3, W1.6, W1.7: still not started (W1.6 also blocked on code-3ad Calico Installation CR)
- W1.4, W1.5: code committed, blocked on code-e2dp (Kyverno provider crash)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-05-18 19:37:36 +00:00 committed by Viktor Barzin
parent 0a26364e4f
commit c9289192c7
3 changed files with 70 additions and 27 deletions

View file

@ -57,7 +57,7 @@ graph TB
|-----------|---------|----------|---------|
| Prometheus | Latest (Diun monitored) | `stacks/monitoring/modules/monitoring/` | Metrics collection and storage, scrape configs for all services |
| Grafana | Latest (Diun monitored) | `stacks/monitoring/modules/monitoring/` | Visualization, 14+ dashboards (API server, CoreDNS, GPU, UPS, etc.) |
| Loki | Latest (Diun monitored) | `stacks/monitoring/modules/monitoring/` | Log aggregation and querying |
| Loki | **NOT DEPLOYED as of 2026-05-18** — TF code exists in `stacks/monitoring/modules/monitoring/loki.tf` but `helm_release.loki` has a self-referencing `depends_on` that prevented apply. No `loki` Helm release in cluster, no Loki pods or Service. All "log aggregation" claims below are aspirational. Tracked under beads `code-146x`. | `stacks/monitoring/modules/monitoring/` | Log aggregation and querying (planned) |
| Alertmanager | Latest (Diun monitored) | `stacks/monitoring/modules/monitoring/` | Alert routing with cascade inhibitions |
| Uptime Kuma | Latest (Diun monitored) | `stacks/uptime-kuma/` | Internal + external HTTP monitors, status page |
| External Monitor Sync | Python 3.12 | `stacks/uptime-kuma/` | CronJob (10min) syncs `[External]` monitors from `cloudflare_proxied_names` |