pve-host: ship journal to Loki (snoopy command audit + sshd-pve) for emo's root SSH
Emo's Claude agent was given root SSH to the Proxmox host (`ssh pve`, dedicated
shared-root key emo-pve-agent@devvm) so he can manage the host — e.g. the R730
fan daemon — through his agent. To keep an audit trail of what that agent does,
and to feed the long-pending Wave-1 S1 security rule, the PVE host now ships its
systemd journal to cluster Loki:
- snoopy logs every execve() to journald (identifier=snoopy), enabled via
/etc/ld.so.preload; config scripts/pve-snoopy.ini.
- promtail v3.5.1 (amd64) ships /var/log/journal to Loki as {job="pve-journal"}
(full host journal; filter identifier="snoopy" for the command audit), and
relabels sshd auth to {job="sshd-pve"} — which ACTIVATES S1 (it was PENDING
only for lack of this shipper). Config/unit: scripts/pve-promtail.{yaml,service}.
S1 won't false-fire on legitimate access: the devvm SNATs through pfSense to
192.168.1.2, which is already in the S1 source-IP allowlist.
Loki is reached via an /etc/hosts pin (10.0.20.203 loki.viktorbarzin.lan);
follow-up noted to register a Technitium CNAME so it auto-tracks LB renumbers.
Host pieces are hand-managed (not Terraform), like fan-control and the rpi-sofia
promtail — these files are the source of truth. Docs updated: security.md
(S1 LIVE) and monitoring.md ("External host: pve").
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
176a65d3d2
commit
aac807fb3a
5 changed files with 116 additions and 5 deletions
|
|
@ -122,9 +122,12 @@ view plus a Grafana-link button. Those sensors reach Loki via the Traefik LB IP
|
|||
`10.0.20.203` + a `Host: loki.viktorbarzin.lan` header (`verify_ssl: false`)
|
||||
because `loki.viktorbarzin.lan` has **no Technitium record yet** (the
|
||||
`technitium-ingress-dns-sync` CronJob only creates `.me` CNAMEs + pins
|
||||
`ingress.viktorbarzin.lan`). **Follow-up:** register `loki.viktorbarzin.lan` in
|
||||
Technitium (or fix the `*.viktorbarzin.lan` wildcard) so both this sensor and the
|
||||
Sofia-Pi promtail can resolve it by name instead of pinning the LB IP.
|
||||
`ingress.viktorbarzin.lan`). The **PVE host** promtail (see "External host: pve"
|
||||
below) reaches Loki the same way, via an `/etc/hosts` pin
|
||||
`10.0.20.203 loki.viktorbarzin.lan`. **Follow-up (now 3 consumers — this sensor,
|
||||
rpi-sofia, the PVE host):** register `loki.viktorbarzin.lan` in Technitium as a
|
||||
CNAME → `ingress.viktorbarzin.lan` (auto-tracks Traefik LB renumbers) so all
|
||||
three resolve it by name instead of pinning the LB IP.
|
||||
|
||||
### External host: rpi-sofia (Sofia Raspberry Pi)
|
||||
|
||||
|
|
@ -146,6 +149,23 @@ Query examples (Grafana → Loki): `{job="rpi-sofia-journal"}`, `{job="rpi-sofia
|
|||
|
||||
> The cluster side (scrape job, alerts, Loki ingress, dashboard) is Terraform-managed in `stacks/monitoring/`. The **Pi-side** pieces (node_exporter, the textfile collector + timer, promtail, the watchdog config, and the `server=/viktorbarzin.lan/192.168.1.2` dnsmasq split-horizon forward needed to resolve the Loki ingress) are configured by hand on the Pi — it is not under Terraform — and are backed up off-box at `/home/wizard/rpi-sofia-backup/`. The real reliability fix (reflash/replace the SD card) needs on-site access.
|
||||
|
||||
### External host: pve (Proxmox hypervisor, 192.168.1.127)
|
||||
|
||||
`pve` is the Proxmox VE host — the hypervisor running **every** VM (pfSense, the 5 k8s nodes, the devvm, HA, Windows). It is not in the cluster. Since 2026-06-10 its **full systemd journal ships to cluster Loki**, closing a gap (the most critical host previously had no central logging) and giving the Wave-1 **S1** security rule its data source (`docs/architecture/security.md`).
|
||||
|
||||
**Why now:** emo's Claude agent was granted **root SSH** to the host (a dedicated shared-root key `emo-pve-agent@devvm`, fingerprint `SHA256:Wd+m0EABlm4RDDykDh85PIYSqe0Al8Hr9AZ+7Ksy4HQ`, reachable as `ssh pve` from the devvm) so he can manage the host (e.g. the R730 fan daemon) via his agent. To keep an audit trail, **snoopy** (enabled via `/etc/ld.so.preload` → `libsnoopy.so`; config `scripts/pve-snoopy.ini`) logs every `execve()` to journald under identifier `snoopy`, and promtail ships it to Loki.
|
||||
|
||||
**Logs** — `promtail` v3.5.1 (amd64) at `/usr/local/bin/promtail`, config `scripts/pve-promtail.yaml`, unit `scripts/pve-promtail.service`. Ships `/var/log/journal` to `https://loki.viktorbarzin.lan/loki/api/v1/push` (`insecure_skip_verify`; LB-IP reached via the `/etc/hosts` pin noted above). Relabels: `unit`, `level`, `identifier`; sshd lines (`identifier=~"sshd.*"`) are re-jobbed to `sshd-pve` so the S1 rule matches. Streams:
|
||||
- `{job="pve-journal", host="pve"}` — full host journal (kernel, pvestatd, fan-control, NFS, etc.).
|
||||
- `{job="pve-journal", identifier="snoopy"}` — **command audit** (every execve: `uid login tty sid cwd cmdline`).
|
||||
- `{job="sshd-pve"}` — sshd auth; an `Accepted publickey ... SHA256:<fp>` line ties a session to a key (e.g. emo's fp above). Feeds S1.
|
||||
|
||||
**Attribution caveat:** all SSH is shared-root, so snoopy `uid`/`login` are always `root`; attribute a command to a person by correlating its `sid`/timestamp with the matching `{job="sshd-pve"}` Accepted-publickey line (key fingerprint). emo's agent arrives SNAT'd as `192.168.1.2`, which is in the S1 allowlist, so legitimate access does not alert.
|
||||
|
||||
Query examples (Grafana → Loki): `{host="pve"}`, `{job="pve-journal", identifier="snoopy"}` (command audit), `{job="sshd-pve"} |= "Accepted publickey"`.
|
||||
|
||||
> Hand-managed (not Terraform), like the rpi-sofia and fan-control pieces: the promtail binary/config/unit, the snoopy enable (`/etc/ld.so.preload`), and the `/etc/hosts` Loki pin all live on the host. Source-of-truth files: `scripts/pve-promtail.{yaml,service}` + `scripts/pve-snoopy.ini`; deploy steps are in the `pve-promtail.yaml` header.
|
||||
|
||||
### Dell R730 iDRAC: SNMP-primary + Redfish remnant (migrated 2026-06-05)
|
||||
|
||||
The R730 iDRAC (`192.168.1.4` / `idrac.viktorbarzin.lan`) is monitored by **two** Prometheus jobs, both relabeled to the `r730_idrac_*` prefix (which historically hid which source served what). Design/plan: `docs/plans/2026-06-05-idrac-snmp-migration-{design,plan}.md`.
|
||||
|
|
|
|||
|
|
@ -189,7 +189,7 @@ Beads epic: `code-8ywc`. **Status: partially live as of 2026-05-18.**
|
|||
| W1.2 Vault `x_forwarded_for_authorized_addrs = 10.10.0.0/16` | **LIVE** — applied via `tg apply -target=helm_release.vault` on 2026-05-18; all 3 vault pods restarted cleanly |
|
||||
| W1.2 Vault audit log shipping to Loki | **LIVE** — `audit-tail` sidecar in vault pods + Alloy DaemonSet ships to Loki with `container="audit-tail"`. Verified via `{namespace="vault",container="audit-tail"}` LogQL query. |
|
||||
| W1.1 K8s API audit policy + shipping | **LIVE** — kube-apiserver audit policy was already configured (Metadata level, `/var/log/kubernetes/audit.log`, 7d retention). Alloy DaemonSet now tolerates control-plane taint, scrapes the audit log file, ships to Loki with `job=kubernetes-audit`. K2-K9 alert rules in Loki ruler. |
|
||||
| W1.3 Source-IP anomaly rules (K9, V7, S1) | **LIVE** (K9, V7); **S1 PENDING** — fires once promtail/Alloy on PVE host ships sshd journal with `job=sshd-pve`. |
|
||||
| W1.3 Source-IP anomaly rules (K9, V7, S1) | **LIVE** (K9, V7, S1). **S1 activated 2026-06-10** — promtail on the PVE host now ships the journal to Loki (`scripts/pve-promtail.yaml`); sshd auth lands as `job=sshd-pve` (the S1 data source). The same shipper carries snoopy `execve()` command audit as `{job="pve-journal", identifier="snoopy"}` (forensic, not alerting). Deployed because emo's agent was given root SSH to the host (shared key) — see `docs/architecture/monitoring.md` → "External host: pve". |
|
||||
| W1.4 Kyverno security policies → Enforce | **LIVE** — 3 policies in Enforce mode with 35-namespace exclude list. |
|
||||
| W1.5 Kyverno trusted-registries → Enforce | **LIVE** — explicit allowlist (15 registries + 6 DockerHub library bare names + 56 DockerHub user repos). Verified by admission dry-run: `evilcorp.example/malware:v1` BLOCKED, `alpine:3.20` and `docker.io/library/alpine:3.20` ALLOWED. |
|
||||
| W1.6 Calico observe-phase (pilot: recruiter-responder) | **LIVE** (2026-05-19) — GlobalNetworkPolicy `wave1-egress-observe-recruiter-responder` with rules `[action:Log, action:Allow]`. FelixConfiguration.flowLogsFileEnabled approach abandoned (Calico Enterprise-only field, rejected by OSS v3.26). Log action emits iptables LOG with prefix `calico-packet: ` → kernel → journald → Alloy → Loki. Verified: `{job="node-journal"} \|~ "calico-packet"` returns real packet metadata (SRC/DST/PROTO). Expand to more namespaces by adding to `namespaceSelector`. |
|
||||
|
|
@ -205,7 +205,7 @@ Response model: **(I) Slack-only, daily skim.** All security alerts land in a ne
|
|||
|---|---|---|---|
|
||||
| K8s API audit log | Custom audit policy on kube-apiserver: drop `get`/`list`/`watch` at `None` for most resources, log writes at `Metadata`, secret reads at `Metadata`, `exec`/`portforward` at `RequestResponse`, exclude kubelet+controller-manager noise. Codified in `stacks/infra` kubeadm config templating. | Alloy DaemonSet tails `/var/log/kubernetes/audit/*.log` | `job=kube-audit` |
|
||||
| Vault audit log | `file` audit device on existing Vault PVC. Vault listener config sets `x_forwarded_for_authorized_addrs` trusting Traefik pod CIDR so `remote_addr` is the real client IP, not Traefik's. | Alloy tails audit log file | `job=vault-audit` |
|
||||
| PVE sshd auth log | journald `_SYSTEMD_UNIT=ssh.service` | promtail systemd unit on Proxmox host (192.168.1.127) | `job=sshd-pve` |
|
||||
| PVE sshd auth log | journald (`_SYSTEMD_UNIT=ssh.service`, `SYSLOG_IDENTIFIER=sshd-session`); promtail relabels `identifier=~"sshd.*"` → `job=sshd-pve` | promtail systemd unit on Proxmox host (192.168.1.127), `scripts/pve-promtail.yaml` — **LIVE 2026-06-10** | `job=sshd-pve` |
|
||||
| Calico flow log | `flowLogsFileEnabled: true` in Calico Felix config | Alloy (cluster-wide) | `job=calico-flow` (W1.6 only) |
|
||||
|
||||
#### Alert rules (16 total)
|
||||
|
|
|
|||
17
scripts/pve-promtail.service
Normal file
17
scripts/pve-promtail.service
Normal file
|
|
@ -0,0 +1,17 @@
|
|||
# systemd unit for promtail on the PVE host (192.168.1.127). Install to
|
||||
# /etc/systemd/system/promtail.service. See scripts/pve-promtail.yaml for the full deploy.
|
||||
[Unit]
|
||||
Description=Promtail (ships PVE host journal -> cluster Loki)
|
||||
After=network-online.target
|
||||
Wants=network-online.target
|
||||
|
||||
[Service]
|
||||
Type=simple
|
||||
ExecStart=/usr/local/bin/promtail -config.file=/etc/promtail/config.yml
|
||||
Restart=on-failure
|
||||
RestartSec=5
|
||||
User=root
|
||||
Group=root
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
53
scripts/pve-promtail.yaml
Normal file
53
scripts/pve-promtail.yaml
Normal file
|
|
@ -0,0 +1,53 @@
|
|||
# Promtail config for the PVE host (192.168.1.127) — ships the systemd journal to cluster Loki.
|
||||
#
|
||||
# NOT Terraform-managed (the PVE host is the hypervisor, outside k8s). Deployed by hand,
|
||||
# same pattern as scripts/fan-control.* and the rpi-sofia promtail. This file is source-of-truth.
|
||||
#
|
||||
# Deploy:
|
||||
# scp scripts/pve-promtail.yaml root@192.168.1.127:/etc/promtail/config.yml
|
||||
# scp scripts/pve-promtail.service root@192.168.1.127:/etc/systemd/system/promtail.service
|
||||
# ssh root@192.168.1.127 'mkdir -p /var/lib/promtail && systemctl daemon-reload && systemctl enable --now promtail'
|
||||
# # Binary: grafana/loki v3.5.1 promtail-linux-amd64 -> /usr/local/bin/promtail (chmod 0755).
|
||||
# # Loki reach: /etc/hosts pin "10.0.20.203 loki.viktorbarzin.lan" (Traefik LB, ETP-Local).
|
||||
# # FOLLOW-UP: replace the pin with a Technitium CNAME loki.viktorbarzin.lan -> ingress.viktorbarzin.lan
|
||||
# # so it auto-tracks Traefik LB renumbers (also fixes the rpi-sofia pin — see docs/architecture/monitoring.md).
|
||||
#
|
||||
# Streams produced:
|
||||
# {job="pve-journal"} — full host journal (filter identifier="snoopy" for the command audit)
|
||||
# {job="sshd-pve"} — sshd auth lines; feeds the Loki S1 security rule (docs/architecture/security.md)
|
||||
# {job="pve-journal", identifier="snoopy"} — snoopy command audit (every execve on the host; see scripts/pve-snoopy.ini)
|
||||
server:
|
||||
http_listen_port: 9080
|
||||
grpc_listen_port: 0
|
||||
log_level: warn
|
||||
|
||||
positions:
|
||||
filename: /var/lib/promtail/positions.yaml
|
||||
|
||||
clients:
|
||||
- url: https://loki.viktorbarzin.lan/loki/api/v1/push
|
||||
tls_config:
|
||||
insecure_skip_verify: true
|
||||
|
||||
scrape_configs:
|
||||
- job_name: journal
|
||||
journal:
|
||||
max_age: 12h
|
||||
json: false
|
||||
path: /var/log/journal
|
||||
labels:
|
||||
host: pve
|
||||
job: pve-journal
|
||||
relabel_configs:
|
||||
- source_labels: ['__journal__systemd_unit']
|
||||
target_label: unit
|
||||
- source_labels: ['__journal_priority_keyword']
|
||||
target_label: level
|
||||
- source_labels: ['__journal_syslog_identifier']
|
||||
target_label: identifier
|
||||
# sshd auth lines (identifier sshd / sshd-session) -> job=sshd-pve so the Loki S1
|
||||
# security rule ({job="sshd-pve"}) matches. snoopy command lines stay job=pve-journal.
|
||||
- source_labels: ['__journal_syslog_identifier']
|
||||
regex: 'sshd.*'
|
||||
target_label: job
|
||||
replacement: 'sshd-pve'
|
||||
21
scripts/pve-snoopy.ini
Normal file
21
scripts/pve-snoopy.ini
Normal file
|
|
@ -0,0 +1,21 @@
|
|||
; snoopy config for the PVE host (192.168.1.127) — logs every execve() to journald.
|
||||
;
|
||||
; Install to /etc/snoopy.ini. Enable globally by adding the lib to /etc/ld.so.preload:
|
||||
; apt-get install -y snoopy
|
||||
; echo /usr/lib/x86_64-linux-gnu/libsnoopy.so > /etc/ld.so.preload # enable (no snoopy-enable in the Debian pkg)
|
||||
; # disable/rollback: truncate -s 0 /etc/ld.so.preload (or remove the line)
|
||||
;
|
||||
; output=devlog writes directly to /dev/log -> journald (identifier "snoopy").
|
||||
; DO NOT use output=syslog on a systemd host — snoopy's own docs warn it can hang the system on boot.
|
||||
;
|
||||
; Shipped to Loki by promtail as {job="pve-journal", identifier="snoopy"} (scripts/pve-promtail.yaml).
|
||||
; Attribution note: all sessions run as root (shared root key), so uid/login are always root;
|
||||
; correlate a command's sid/time with the matching {job="sshd-pve"} "Accepted publickey ... SHA256:<fp>"
|
||||
; line to attribute it to a person (e.g. emo's agent key fp SHA256:Wd+m0EABlm4RDDykDh85PIYSqe0Al8Hr9AZ+7Ksy4HQ).
|
||||
[snoopy]
|
||||
output = devlog
|
||||
message_format = "snoopy uid=%{uid} login=%{login} tty=%{tty} sid=%{sid} cwd=%{cwd} : %{cmdline}"
|
||||
syslog_ident = snoopy
|
||||
syslog_facility = LOG_AUTHPRIV
|
||||
syslog_level = LOG_INFO
|
||||
filter_chain = ""
|
||||
Loading…
Add table
Add a link
Reference in a new issue