From aac807fb3a637fd8c668c67dda892a858630204d Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Wed, 10 Jun 2026 19:31:45 +0000 Subject: [PATCH] pve-host: ship journal to Loki (snoopy command audit + sshd-pve) for emo's root SSH MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Emo's Claude agent was given root SSH to the Proxmox host (`ssh pve`, dedicated shared-root key emo-pve-agent@devvm) so he can manage the host — e.g. the R730 fan daemon — through his agent. To keep an audit trail of what that agent does, and to feed the long-pending Wave-1 S1 security rule, the PVE host now ships its systemd journal to cluster Loki: - snoopy logs every execve() to journald (identifier=snoopy), enabled via /etc/ld.so.preload; config scripts/pve-snoopy.ini. - promtail v3.5.1 (amd64) ships /var/log/journal to Loki as {job="pve-journal"} (full host journal; filter identifier="snoopy" for the command audit), and relabels sshd auth to {job="sshd-pve"} — which ACTIVATES S1 (it was PENDING only for lack of this shipper). Config/unit: scripts/pve-promtail.{yaml,service}. S1 won't false-fire on legitimate access: the devvm SNATs through pfSense to 192.168.1.2, which is already in the S1 source-IP allowlist. Loki is reached via an /etc/hosts pin (10.0.20.203 loki.viktorbarzin.lan); follow-up noted to register a Technitium CNAME so it auto-tracks LB renumbers. Host pieces are hand-managed (not Terraform), like fan-control and the rpi-sofia promtail — these files are the source of truth. Docs updated: security.md (S1 LIVE) and monitoring.md ("External host: pve"). Co-Authored-By: Claude Opus 4.8 --- docs/architecture/monitoring.md | 26 ++++++++++++++-- docs/architecture/security.md | 4 +-- scripts/pve-promtail.service | 17 +++++++++++ scripts/pve-promtail.yaml | 53 +++++++++++++++++++++++++++++++++ scripts/pve-snoopy.ini | 21 +++++++++++++ 5 files changed, 116 insertions(+), 5 deletions(-) create mode 100644 scripts/pve-promtail.service create mode 100644 scripts/pve-promtail.yaml create mode 100644 scripts/pve-snoopy.ini diff --git a/docs/architecture/monitoring.md b/docs/architecture/monitoring.md index 28daac25..b7d0619d 100644 --- a/docs/architecture/monitoring.md +++ b/docs/architecture/monitoring.md @@ -122,9 +122,12 @@ view plus a Grafana-link button. Those sensors reach Loki via the Traefik LB IP `10.0.20.203` + a `Host: loki.viktorbarzin.lan` header (`verify_ssl: false`) because `loki.viktorbarzin.lan` has **no Technitium record yet** (the `technitium-ingress-dns-sync` CronJob only creates `.me` CNAMEs + pins -`ingress.viktorbarzin.lan`). **Follow-up:** register `loki.viktorbarzin.lan` in -Technitium (or fix the `*.viktorbarzin.lan` wildcard) so both this sensor and the -Sofia-Pi promtail can resolve it by name instead of pinning the LB IP. +`ingress.viktorbarzin.lan`). The **PVE host** promtail (see "External host: pve" +below) reaches Loki the same way, via an `/etc/hosts` pin +`10.0.20.203 loki.viktorbarzin.lan`. **Follow-up (now 3 consumers — this sensor, +rpi-sofia, the PVE host):** register `loki.viktorbarzin.lan` in Technitium as a +CNAME → `ingress.viktorbarzin.lan` (auto-tracks Traefik LB renumbers) so all +three resolve it by name instead of pinning the LB IP. ### External host: rpi-sofia (Sofia Raspberry Pi) @@ -146,6 +149,23 @@ Query examples (Grafana → Loki): `{job="rpi-sofia-journal"}`, `{job="rpi-sofia > The cluster side (scrape job, alerts, Loki ingress, dashboard) is Terraform-managed in `stacks/monitoring/`. The **Pi-side** pieces (node_exporter, the textfile collector + timer, promtail, the watchdog config, and the `server=/viktorbarzin.lan/192.168.1.2` dnsmasq split-horizon forward needed to resolve the Loki ingress) are configured by hand on the Pi — it is not under Terraform — and are backed up off-box at `/home/wizard/rpi-sofia-backup/`. The real reliability fix (reflash/replace the SD card) needs on-site access. +### External host: pve (Proxmox hypervisor, 192.168.1.127) + +`pve` is the Proxmox VE host — the hypervisor running **every** VM (pfSense, the 5 k8s nodes, the devvm, HA, Windows). It is not in the cluster. Since 2026-06-10 its **full systemd journal ships to cluster Loki**, closing a gap (the most critical host previously had no central logging) and giving the Wave-1 **S1** security rule its data source (`docs/architecture/security.md`). + +**Why now:** emo's Claude agent was granted **root SSH** to the host (a dedicated shared-root key `emo-pve-agent@devvm`, fingerprint `SHA256:Wd+m0EABlm4RDDykDh85PIYSqe0Al8Hr9AZ+7Ksy4HQ`, reachable as `ssh pve` from the devvm) so he can manage the host (e.g. the R730 fan daemon) via his agent. To keep an audit trail, **snoopy** (enabled via `/etc/ld.so.preload` → `libsnoopy.so`; config `scripts/pve-snoopy.ini`) logs every `execve()` to journald under identifier `snoopy`, and promtail ships it to Loki. + +**Logs** — `promtail` v3.5.1 (amd64) at `/usr/local/bin/promtail`, config `scripts/pve-promtail.yaml`, unit `scripts/pve-promtail.service`. Ships `/var/log/journal` to `https://loki.viktorbarzin.lan/loki/api/v1/push` (`insecure_skip_verify`; LB-IP reached via the `/etc/hosts` pin noted above). Relabels: `unit`, `level`, `identifier`; sshd lines (`identifier=~"sshd.*"`) are re-jobbed to `sshd-pve` so the S1 rule matches. Streams: +- `{job="pve-journal", host="pve"}` — full host journal (kernel, pvestatd, fan-control, NFS, etc.). +- `{job="pve-journal", identifier="snoopy"}` — **command audit** (every execve: `uid login tty sid cwd cmdline`). +- `{job="sshd-pve"}` — sshd auth; an `Accepted publickey ... SHA256:` line ties a session to a key (e.g. emo's fp above). Feeds S1. + +**Attribution caveat:** all SSH is shared-root, so snoopy `uid`/`login` are always `root`; attribute a command to a person by correlating its `sid`/timestamp with the matching `{job="sshd-pve"}` Accepted-publickey line (key fingerprint). emo's agent arrives SNAT'd as `192.168.1.2`, which is in the S1 allowlist, so legitimate access does not alert. + +Query examples (Grafana → Loki): `{host="pve"}`, `{job="pve-journal", identifier="snoopy"}` (command audit), `{job="sshd-pve"} |= "Accepted publickey"`. + +> Hand-managed (not Terraform), like the rpi-sofia and fan-control pieces: the promtail binary/config/unit, the snoopy enable (`/etc/ld.so.preload`), and the `/etc/hosts` Loki pin all live on the host. Source-of-truth files: `scripts/pve-promtail.{yaml,service}` + `scripts/pve-snoopy.ini`; deploy steps are in the `pve-promtail.yaml` header. + ### Dell R730 iDRAC: SNMP-primary + Redfish remnant (migrated 2026-06-05) The R730 iDRAC (`192.168.1.4` / `idrac.viktorbarzin.lan`) is monitored by **two** Prometheus jobs, both relabeled to the `r730_idrac_*` prefix (which historically hid which source served what). Design/plan: `docs/plans/2026-06-05-idrac-snmp-migration-{design,plan}.md`. diff --git a/docs/architecture/security.md b/docs/architecture/security.md index 6b3e794b..4a29638d 100644 --- a/docs/architecture/security.md +++ b/docs/architecture/security.md @@ -189,7 +189,7 @@ Beads epic: `code-8ywc`. **Status: partially live as of 2026-05-18.** | W1.2 Vault `x_forwarded_for_authorized_addrs = 10.10.0.0/16` | **LIVE** — applied via `tg apply -target=helm_release.vault` on 2026-05-18; all 3 vault pods restarted cleanly | | W1.2 Vault audit log shipping to Loki | **LIVE** — `audit-tail` sidecar in vault pods + Alloy DaemonSet ships to Loki with `container="audit-tail"`. Verified via `{namespace="vault",container="audit-tail"}` LogQL query. | | W1.1 K8s API audit policy + shipping | **LIVE** — kube-apiserver audit policy was already configured (Metadata level, `/var/log/kubernetes/audit.log`, 7d retention). Alloy DaemonSet now tolerates control-plane taint, scrapes the audit log file, ships to Loki with `job=kubernetes-audit`. K2-K9 alert rules in Loki ruler. | -| W1.3 Source-IP anomaly rules (K9, V7, S1) | **LIVE** (K9, V7); **S1 PENDING** — fires once promtail/Alloy on PVE host ships sshd journal with `job=sshd-pve`. | +| W1.3 Source-IP anomaly rules (K9, V7, S1) | **LIVE** (K9, V7, S1). **S1 activated 2026-06-10** — promtail on the PVE host now ships the journal to Loki (`scripts/pve-promtail.yaml`); sshd auth lands as `job=sshd-pve` (the S1 data source). The same shipper carries snoopy `execve()` command audit as `{job="pve-journal", identifier="snoopy"}` (forensic, not alerting). Deployed because emo's agent was given root SSH to the host (shared key) — see `docs/architecture/monitoring.md` → "External host: pve". | | W1.4 Kyverno security policies → Enforce | **LIVE** — 3 policies in Enforce mode with 35-namespace exclude list. | | W1.5 Kyverno trusted-registries → Enforce | **LIVE** — explicit allowlist (15 registries + 6 DockerHub library bare names + 56 DockerHub user repos). Verified by admission dry-run: `evilcorp.example/malware:v1` BLOCKED, `alpine:3.20` and `docker.io/library/alpine:3.20` ALLOWED. | | W1.6 Calico observe-phase (pilot: recruiter-responder) | **LIVE** (2026-05-19) — GlobalNetworkPolicy `wave1-egress-observe-recruiter-responder` with rules `[action:Log, action:Allow]`. FelixConfiguration.flowLogsFileEnabled approach abandoned (Calico Enterprise-only field, rejected by OSS v3.26). Log action emits iptables LOG with prefix `calico-packet: ` → kernel → journald → Alloy → Loki. Verified: `{job="node-journal"} \|~ "calico-packet"` returns real packet metadata (SRC/DST/PROTO). Expand to more namespaces by adding to `namespaceSelector`. | @@ -205,7 +205,7 @@ Response model: **(I) Slack-only, daily skim.** All security alerts land in a ne |---|---|---|---| | K8s API audit log | Custom audit policy on kube-apiserver: drop `get`/`list`/`watch` at `None` for most resources, log writes at `Metadata`, secret reads at `Metadata`, `exec`/`portforward` at `RequestResponse`, exclude kubelet+controller-manager noise. Codified in `stacks/infra` kubeadm config templating. | Alloy DaemonSet tails `/var/log/kubernetes/audit/*.log` | `job=kube-audit` | | Vault audit log | `file` audit device on existing Vault PVC. Vault listener config sets `x_forwarded_for_authorized_addrs` trusting Traefik pod CIDR so `remote_addr` is the real client IP, not Traefik's. | Alloy tails audit log file | `job=vault-audit` | -| PVE sshd auth log | journald `_SYSTEMD_UNIT=ssh.service` | promtail systemd unit on Proxmox host (192.168.1.127) | `job=sshd-pve` | +| PVE sshd auth log | journald (`_SYSTEMD_UNIT=ssh.service`, `SYSLOG_IDENTIFIER=sshd-session`); promtail relabels `identifier=~"sshd.*"` → `job=sshd-pve` | promtail systemd unit on Proxmox host (192.168.1.127), `scripts/pve-promtail.yaml` — **LIVE 2026-06-10** | `job=sshd-pve` | | Calico flow log | `flowLogsFileEnabled: true` in Calico Felix config | Alloy (cluster-wide) | `job=calico-flow` (W1.6 only) | #### Alert rules (16 total) diff --git a/scripts/pve-promtail.service b/scripts/pve-promtail.service new file mode 100644 index 00000000..0b288bfc --- /dev/null +++ b/scripts/pve-promtail.service @@ -0,0 +1,17 @@ +# systemd unit for promtail on the PVE host (192.168.1.127). Install to +# /etc/systemd/system/promtail.service. See scripts/pve-promtail.yaml for the full deploy. +[Unit] +Description=Promtail (ships PVE host journal -> cluster Loki) +After=network-online.target +Wants=network-online.target + +[Service] +Type=simple +ExecStart=/usr/local/bin/promtail -config.file=/etc/promtail/config.yml +Restart=on-failure +RestartSec=5 +User=root +Group=root + +[Install] +WantedBy=multi-user.target diff --git a/scripts/pve-promtail.yaml b/scripts/pve-promtail.yaml new file mode 100644 index 00000000..92e311e1 --- /dev/null +++ b/scripts/pve-promtail.yaml @@ -0,0 +1,53 @@ +# Promtail config for the PVE host (192.168.1.127) — ships the systemd journal to cluster Loki. +# +# NOT Terraform-managed (the PVE host is the hypervisor, outside k8s). Deployed by hand, +# same pattern as scripts/fan-control.* and the rpi-sofia promtail. This file is source-of-truth. +# +# Deploy: +# scp scripts/pve-promtail.yaml root@192.168.1.127:/etc/promtail/config.yml +# scp scripts/pve-promtail.service root@192.168.1.127:/etc/systemd/system/promtail.service +# ssh root@192.168.1.127 'mkdir -p /var/lib/promtail && systemctl daemon-reload && systemctl enable --now promtail' +# # Binary: grafana/loki v3.5.1 promtail-linux-amd64 -> /usr/local/bin/promtail (chmod 0755). +# # Loki reach: /etc/hosts pin "10.0.20.203 loki.viktorbarzin.lan" (Traefik LB, ETP-Local). +# # FOLLOW-UP: replace the pin with a Technitium CNAME loki.viktorbarzin.lan -> ingress.viktorbarzin.lan +# # so it auto-tracks Traefik LB renumbers (also fixes the rpi-sofia pin — see docs/architecture/monitoring.md). +# +# Streams produced: +# {job="pve-journal"} — full host journal (filter identifier="snoopy" for the command audit) +# {job="sshd-pve"} — sshd auth lines; feeds the Loki S1 security rule (docs/architecture/security.md) +# {job="pve-journal", identifier="snoopy"} — snoopy command audit (every execve on the host; see scripts/pve-snoopy.ini) +server: + http_listen_port: 9080 + grpc_listen_port: 0 + log_level: warn + +positions: + filename: /var/lib/promtail/positions.yaml + +clients: + - url: https://loki.viktorbarzin.lan/loki/api/v1/push + tls_config: + insecure_skip_verify: true + +scrape_configs: + - job_name: journal + journal: + max_age: 12h + json: false + path: /var/log/journal + labels: + host: pve + job: pve-journal + relabel_configs: + - source_labels: ['__journal__systemd_unit'] + target_label: unit + - source_labels: ['__journal_priority_keyword'] + target_label: level + - source_labels: ['__journal_syslog_identifier'] + target_label: identifier + # sshd auth lines (identifier sshd / sshd-session) -> job=sshd-pve so the Loki S1 + # security rule ({job="sshd-pve"}) matches. snoopy command lines stay job=pve-journal. + - source_labels: ['__journal_syslog_identifier'] + regex: 'sshd.*' + target_label: job + replacement: 'sshd-pve' diff --git a/scripts/pve-snoopy.ini b/scripts/pve-snoopy.ini new file mode 100644 index 00000000..931bc29d --- /dev/null +++ b/scripts/pve-snoopy.ini @@ -0,0 +1,21 @@ +; snoopy config for the PVE host (192.168.1.127) — logs every execve() to journald. +; +; Install to /etc/snoopy.ini. Enable globally by adding the lib to /etc/ld.so.preload: +; apt-get install -y snoopy +; echo /usr/lib/x86_64-linux-gnu/libsnoopy.so > /etc/ld.so.preload # enable (no snoopy-enable in the Debian pkg) +; # disable/rollback: truncate -s 0 /etc/ld.so.preload (or remove the line) +; +; output=devlog writes directly to /dev/log -> journald (identifier "snoopy"). +; DO NOT use output=syslog on a systemd host — snoopy's own docs warn it can hang the system on boot. +; +; Shipped to Loki by promtail as {job="pve-journal", identifier="snoopy"} (scripts/pve-promtail.yaml). +; Attribution note: all sessions run as root (shared root key), so uid/login are always root; +; correlate a command's sid/time with the matching {job="sshd-pve"} "Accepted publickey ... SHA256:" +; line to attribute it to a person (e.g. emo's agent key fp SHA256:Wd+m0EABlm4RDDykDh85PIYSqe0Al8Hr9AZ+7Ksy4HQ). +[snoopy] +output = devlog +message_format = "snoopy uid=%{uid} login=%{login} tty=%{tty} sid=%{sid} cwd=%{cwd} : %{cmdline}" +syslog_ident = snoopy +syslog_facility = LOG_AUTHPRIV +syslog_level = LOG_INFO +filter_chain = ""