diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md index 8c7aa45f..602c7fa7 100755 --- a/.claude/CLAUDE.md +++ b/.claude/CLAUDE.md @@ -245,7 +245,7 @@ Plan in `docs/architecture/security.md` + response playbook in `docs/runbooks/se - **Response model**: (I) Slack-only daily skim. All security alerts via Loki ruler → Alertmanager → the `slack-security` receiver, which posts to `#alerts` (it keeps its `[SECURITY/]` title styling so security-lane alerts stand out). Severity labels carried in the alert (critical/warning/info). No paging. The dedicated `#security` channel was abandoned 2026-06-25 — the shared `alertmanager_slack_api_url` webhook's Slack app isn't a member of it (a `#security` override 404s), so everything consolidated to `#alerts`. - **Kyverno policies (wave 1)**: `deny-privileged-containers`, `deny-host-namespaces`, `restrict-sys-admin`, `require-trusted-registries` flip Audit→Enforce with the 31-namespace exclude list (memory id=1970). `failurePolicy: Ignore` preserved. Cosign `verify-images` deferred. - **NetworkPolicy default-deny egress (wave 1)**: observe-then-enforce (γ approach) — Calico flow logs cluster-wide + GlobalNetworkPolicy log-only on tier 3+4, build empirical allowlist after 1 week, phased per-namespace enforce starting `recruiter-responder`. Tier 0/1/2 deferred. **The internal (ns-to-ns) half of each allowlist now derives faster from the east-west flow trail** (below): `SELECT DISTINCT dst_ns FROM edge WHERE src_ns='' AND action='allow'`. External egress is NOT in that table (empty-ns flows dropped) — those still come from the Calico flow-log W1.6 snapshot. Enforce-flips remain out of scope of the trail (observe-and-derive only; beads `code-8ywc`). -- **East-west flow trail (who-talks-to-whom, ADR-0014)**: Calico **Goldmane** (`goldmane.calico-system:7443`, gRPC/mTLS, ~60-min in-memory ring buffer — no etcd writes) + **Whisker** live UI (`whisker.viktorbarzin.me`, Authentik-gated) → **`goldmane-edge-aggregator`** streams Goldmane's `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + public-internet flows dropped) into **CNPG DB `goldmane_edges`** → daily **`goldmane-edges-digest`** CronJob posts first-seen edges to `#alerts` (consolidated to `#alerts`; the `#security` channel was abandoned 2026-06-25 — the shared webhook's Slack app isn't a member of it, so a `#security` override 404s; see runbook). **CERT-REUSE GOTCHA**: the aggregator's mTLS client cert reuses the operator's Tigera-CA-signed `whisker-backend-key-pair` Secret (Goldmane verifies CA-chain only) — **re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it** (symptom: no `last_seen` updates, `AggregatorDown`). Service identity = namespace, + `service-identity` label only in `monitoring`/`kube-system`/`dbaas`. Health: `AggregatorDown` + `DigestFailing` alerts + cluster-health #48. **WHISKER-WEDGE GOTCHA**: whisker-backend has no operator liveness probe, so a transient CNI/DNS blip (e.g. a node reboot/upgrade) can wedge its Goldmane gRPC stream and leave the UI **empty** indefinitely (the aggregator, a separate pod, is unaffected) — the `whisker-watchdog` CronJob (`stacks/calico`, every 10 min) auto-restarts it; manual heal `kubectl -n calico-system delete pod -l k8s-app=whisker`. Runbook: `docs/runbooks/goldmane-flow-trail.md`. (Goldmane is OSS tech-preview — reversible operator-CR toggle in `stacks/calico/main.tf`.) +- **East-west flow trail (who-talks-to-whom, ADR-0014)**: Calico **Goldmane** (`goldmane.calico-system:7443`, gRPC/mTLS, ~60-min in-memory ring buffer — no etcd writes) + **Whisker** live UI (`whisker.viktorbarzin.me`, Authentik-gated) → **`goldmane-edge-aggregator`** streams Goldmane's `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + public-internet flows dropped) into **CNPG DB `goldmane_edges`** → daily **`goldmane-edges-digest`** CronJob posts first-seen edges to `#alerts` (consolidated to `#alerts`; the `#security` channel was abandoned 2026-06-25 — the shared webhook's Slack app isn't a member of it, so a `#security` override 404s; see runbook). **CERT-REUSE GOTCHA**: the aggregator's mTLS client cert reuses the operator's Tigera-CA-signed `whisker-backend-key-pair` Secret (Goldmane verifies CA-chain only) — **re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it** (symptom: no `last_seen` updates, `AggregatorDown`). Service identity = namespace, + `service-identity` label only in `monitoring`/`kube-system`/`dbaas`. Health: `AggregatorDown` + `DigestFailing` alerts + cluster-health #48. **WHISKER-WEDGE GOTCHA** (2026-06-28): the operator's `whisker` NetworkPolicy allows DNS egress only to kube-dns *pods*, but whisker-backend resolves goldmane via the kube-dns *ClusterIP* — Calico drops UDP DNS to a ClusterIP under a podSelector-only egress rule, so when whisker-backend's gRPC stream breaks and it re-resolves, it wedges and the UI goes **empty** (the aggregator, a separate pod, is unaffected). FIX = additive egress NP `whisker-allow-dns-clusterip` (`stacks/calico`, allows whisker→10.96.0.10/32:53); the `whisker-watchdog` CronJob is a backstop. Manual heal `kubectl -n calico-system delete pod -l k8s-app=whisker`. Runbook: `docs/runbooks/goldmane-flow-trail.md`. (Goldmane is OSS tech-preview — reversible operator-CR toggle in `stacks/calico/main.tf`.) - **What's NOT in scope**: canary tokens (rejected — self-trigger risk with Viktor's normal `vault kv list secret/viktor` and `kubectl get secret -A` workflows), Falco/Tetragon (too noisy for Slack-only daily check), Cloudflare/GitHub audit polling (deferred to wave 2). ## Storage & Backup Architecture diff --git a/cli/README.md b/cli/README.md index 186c1ee5..a35d6450 100644 --- a/cli/README.md +++ b/cli/README.md @@ -202,6 +202,21 @@ runs on the devvm, `setInputFiles` streams local files to the remote browser ove CDP — no `chmod`/staging-dir workaround. See `docs/architecture/chrome-service.md` and `docs/adr/0013`. +### v0.9 verbs — edges (east-west "who-talks-to-whom" trail) + +Read-only investigation helper over the `goldmane_edges` CNPG trail (ADR-0014): +filters render to a single safe `SELECT` (namespace values validated to the k8s +name charset) run via the dbaas primary pod — the same exec path as `k8s db`. + +| Command | Tier | What it does | +| --- | --- | --- | +| `edges --ns ` | read | edges touching `` (either direction) | +| `edges --src ` / `--dst ` | read | directional: ``'s egress / ingress peers | +| `edges --peers-of ` | read | distinct peer namespaces of `` (both directions) | +| `edges --new-since <24h\|7d\|YYYY-MM-DD>` | read | edges first seen since a duration or date | +| `edges --denied` | read | only `action='deny'` edges (blocked / lateral-movement) | +| `edges --json` / `--limit N` | read | JSON array output / row cap (default 200) | + ## Build / install Built from source to `/usr/local/bin/homelab` during devvm provisioning diff --git a/cli/VERSION b/cli/VERSION index 85f7059b..f979adec 100644 --- a/cli/VERSION +++ b/cli/VERSION @@ -1 +1 @@ -v0.8.1 +v0.9.0 diff --git a/cli/cmd_edges.go b/cli/cmd_edges.go new file mode 100644 index 00000000..7ee528fd --- /dev/null +++ b/cli/cmd_edges.go @@ -0,0 +1,69 @@ +package main + +import "fmt" + +func edgesCommands() []Command { + return []Command{ + {Path: []string{"edges"}, Tier: TierRead, + Summary: "who-talks-to-whom trail: edges [--ns|--src|--dst|--peers-of N] [--new-since 24h] [--denied] [--json] [--limit N]", + Run: edgesRun}, + } +} + +// edgesRun renders the filter flags to SQL and runs it read-only against the +// goldmane_edges CNPG DB via the dbaas primary pod (same exec path as `k8s db`). +func edgesRun(args []string) error { + for _, a := range args { + if a == "-h" || a == "--help" { + fmt.Print(edgesUsage()) + return nil + } + } + o, err := parseEdgesArgs(args) + if err != nil { + return fmt.Errorf("%w\n\n%s", err, edgesUsage()) + } + sql, err := buildEdgesQuery(o) + if err != nil { + return err + } + // pg-cluster-rw is a Service (not exec-able); resolve the primary POD. + pod, err := kubectlCapture("dbaas", "get", "pod", "-l", "cnpg.io/instanceRole=primary", + "-o", "jsonpath={.items[0].metadata.name}") + if err != nil || pod == "" { + return fmt.Errorf("could not resolve CNPG primary pod in dbaas: %v", err) + } + exec := []string{"exec", pod, "-c", "postgres", "--", "psql", "-U", "postgres", "-d", "goldmane_edges"} + if o.asJSON { + exec = append(exec, "-tAc", sql) // raw tuple → the JSON array + } else { + exec = append(exec, "-P", "pager=off", "-c", sql) // aligned table for humans + } + return kubectlStream("dbaas", exec...) +} + +func edgesUsage() string { + return `homelab edges — query the who-talks-to-whom trail (goldmane_edges, ADR-0014) + +Usage: homelab edges [filters] + +Filters (AND-combined; namespace values are validated to the k8s name charset): + --ns NAME edges touching NAME (either direction) + --src NAME edges where source namespace = NAME + --dst NAME edges where destination namespace = NAME + --peers-of NAME distinct peer namespaces of NAME (both directions) + --new-since SPEC first seen since SPEC: a duration (24h, 7d, 30m, 90s) or a date (YYYY-MM-DD) + --denied only denied (action='deny') edges — blocked / lateral-movement attempts + --json output a JSON array (for agents/pipelines) + --limit N cap rows (default 200) + +Examples: + homelab edges --ns immich # everything immich talks to / is talked to by + homelab edges --peers-of authentik # authentik's peer namespaces + homelab edges --src recruiter-responder # that namespace's egress peers + homelab edges --new-since 24h # edges first seen in the last day + homelab edges --denied --json # blocked flows, machine-readable + +Read-only SELECT against CNPG DB goldmane_edges via the dbaas primary pod. +` +} diff --git a/cli/cmd_memory.go b/cli/cmd_memory.go index 94f3a482..7ae11ea0 100644 --- a/cli/cmd_memory.go +++ b/cli/cmd_memory.go @@ -54,10 +54,7 @@ func printMemories(raw []byte, jsonOut bool) error { return nil } for _, m := range r.Memories { - c := strings.ReplaceAll(m.Content, "\n", " ") - if len(c) > 240 { - c = c[:240] + "…" - } + c := truncatePreview(strings.ReplaceAll(m.Content, "\n", " "), 240) fmt.Printf("#%d [%s] (%.2f) %s\n", m.ID, m.Category, m.Importance, c) if m.Tags != "" { fmt.Printf(" tags: %s\n", m.Tags) @@ -66,6 +63,21 @@ func printMemories(raw []byte, jsonOut bool) error { return nil } +// truncatePreview shortens s to at most maxRunes RUNES, appending "…" when it +// trims. Counting runes (not bytes) is load-bearing: a byte slice like s[:240] +// can cut through the middle of a multibyte UTF-8 character (e.g. 2-byte +// Cyrillic), leaving a dangling lead byte = invalid UTF-8. That crashed strict +// decoders downstream — notably the homelab-memory-recall.py UserPromptSubmit +// hook (subprocess text=True), which surfaced as a recurring "UserPromptSubmit +// hook error" for Cyrillic-language users. +func truncatePreview(s string, maxRunes int) string { + r := []rune(s) + if len(r) <= maxRunes { + return s + } + return string(r[:maxRunes]) + "…" +} + func memoryRecall(args []string) error { req := memRecallReq{} jsonOut := false diff --git a/cli/edges.go b/cli/edges.go new file mode 100644 index 00000000..396cc5b9 --- /dev/null +++ b/cli/edges.go @@ -0,0 +1,164 @@ +package main + +import ( + "fmt" + "regexp" + "strconv" + "strings" +) + +// edgesOpts is the parsed filter set for `homelab edges` (the who-talks-to-whom +// investigation helper over the goldmane_edges trail; see ADR-0014). +type edgesOpts struct { + ns string // edges touching this namespace (either direction) + src string // edges where src_ns = this + dst string // edges where dst_ns = this + peersOf string // distinct peers of this namespace (both directions) + newSince string // first_seen >= duration (24h/7d/30m) or date (YYYY-MM-DD) + denied bool // action = 'deny' only + asJSON bool // wrap result as a JSON array + limit int // row cap (default 200) +} + +// parseEdgesArgs parses the edges flag surface. Unknown flags error out so a +// typo surfaces instead of silently dumping the whole table. +func parseEdgesArgs(args []string) (edgesOpts, error) { + o := edgesOpts{limit: 200} + i := 0 + for i < len(args) { + a := args[i] + key, inline, hasInline := a, "", false + if eq := strings.IndexByte(a, '='); eq >= 0 { + key, inline, hasInline = a[:eq], a[eq+1:], true + } + needVal := func() (string, error) { + if hasInline { + return inline, nil + } + if i+1 < len(args) { + i++ + return args[i], nil + } + return "", fmt.Errorf("flag %s needs a value", key) + } + var err error + switch key { + case "--ns": + o.ns, err = needVal() + case "--src": + o.src, err = needVal() + case "--dst": + o.dst, err = needVal() + case "--peers-of": + o.peersOf, err = needVal() + case "--new-since": + o.newSince, err = needVal() + case "--denied": + o.denied = true + case "--json": + o.asJSON = true + case "--limit": + var v string + if v, err = needVal(); err == nil { + if o.limit, err = strconv.Atoi(v); err != nil { + err = fmt.Errorf("--limit must be an integer: %q", v) + } + } + default: + return o, fmt.Errorf("unknown flag: %s", a) + } + if err != nil { + return o, err + } + i++ + } + return o, nil +} + +// nsRE is the safe namespace-token charset (k8s names + "Global"). Used as the +// injection guard — anything else is rejected rather than quoted-and-hoped. +var nsRE = regexp.MustCompile(`^[A-Za-z0-9][A-Za-z0-9_.-]*$`) + +func validateNS(s string) error { + if s == "" || len(s) > 63 || !nsRE.MatchString(s) { + return fmt.Errorf("invalid namespace name: %q", s) + } + return nil +} + +// sqlStr renders a SQL string literal (belt-and-suspenders on top of validateNS). +func sqlStr(s string) string { return "'" + strings.ReplaceAll(s, "'", "''") + "'" } + +var ( + durRE = regexp.MustCompile(`^(\d+)([smhd])$`) + dateRE = regexp.MustCompile(`^\d{4}-\d{2}-\d{2}([ T]\d{2}:\d{2}(:\d{2})?)?$`) +) + +// newSinceCond turns a duration (24h/7d/30m/90s) or a date (YYYY-MM-DD[ HH:MM]) +// into a first_seen predicate. +func newSinceCond(v string) (string, error) { + if m := durRE.FindStringSubmatch(v); m != nil { + unit := map[string]string{"s": "seconds", "m": "minutes", "h": "hours", "d": "days"}[m[2]] + return fmt.Sprintf("first_seen >= now() - interval '%s %s'", m[1], unit), nil + } + if dateRE.MatchString(v) { + return "first_seen >= " + sqlStr(v), nil + } + return "", fmt.Errorf("--new-since must be a duration (e.g. 24h, 7d, 30m) or a date (YYYY-MM-DD): %q", v) +} + +// buildEdgesQuery renders the SQL for the given filters against the `edge` table. +func buildEdgesQuery(o edgesOpts) (string, error) { + limit := o.limit + if limit <= 0 { + limit = 200 + } + + // peers-of is a distinct-peer summary, a different shape from the row list. + if o.peersOf != "" { + if err := validateNS(o.peersOf); err != nil { + return "", err + } + p := sqlStr(o.peersOf) + return fmt.Sprintf("SELECT DISTINCT peer, action FROM ("+ + "SELECT dst_ns AS peer, action FROM edge WHERE src_ns = %s "+ + "UNION SELECT src_ns AS peer, action FROM edge WHERE dst_ns = %s"+ + ") t ORDER BY peer LIMIT %d", p, p, limit), nil + } + + var conds []string + for _, f := range []struct{ val, tmpl string }{ + {o.ns, "(src_ns = %[1]s OR dst_ns = %[1]s)"}, + {o.src, "src_ns = %s"}, + {o.dst, "dst_ns = %s"}, + } { + if f.val == "" { + continue + } + if err := validateNS(f.val); err != nil { + return "", err + } + conds = append(conds, fmt.Sprintf(f.tmpl, sqlStr(f.val))) + } + if o.denied { + conds = append(conds, "action = 'deny'") + } + if o.newSince != "" { + c, err := newSinceCond(o.newSince) + if err != nil { + return "", err + } + conds = append(conds, c) + } + + q := "SELECT src_ns, dst_ns, action, flow_count, first_seen, last_seen FROM edge" + if len(conds) > 0 { + q += " WHERE " + strings.Join(conds, " AND ") + } + q += fmt.Sprintf(" ORDER BY first_seen DESC LIMIT %d", limit) + + if o.asJSON { + q = "SELECT coalesce(json_agg(row_to_json(t)), '[]') FROM (" + q + ") t" + } + return q, nil +} diff --git a/cli/edges_test.go b/cli/edges_test.go new file mode 100644 index 00000000..c8ead29d --- /dev/null +++ b/cli/edges_test.go @@ -0,0 +1,163 @@ +package main + +import ( + "strings" + "testing" +) + +func TestParseEdgesArgs(t *testing.T) { + cases := []struct { + name string + args []string + want edgesOpts + }{ + {"defaults", nil, edgesOpts{limit: 200}}, + {"ns", []string{"--ns", "immich"}, edgesOpts{ns: "immich", limit: 200}}, + {"ns equals", []string{"--ns=immich"}, edgesOpts{ns: "immich", limit: 200}}, + {"src dst", []string{"--src", "a", "--dst", "b"}, edgesOpts{src: "a", dst: "b", limit: 200}}, + {"peers-of", []string{"--peers-of", "authentik"}, edgesOpts{peersOf: "authentik", limit: 200}}, + {"denied json", []string{"--denied", "--json"}, edgesOpts{denied: true, asJSON: true, limit: 200}}, + {"new-since", []string{"--new-since", "24h"}, edgesOpts{newSince: "24h", limit: 200}}, + {"limit", []string{"--limit", "50"}, edgesOpts{limit: 50}}, + } + for _, c := range cases { + t.Run(c.name, func(t *testing.T) { + got, err := parseEdgesArgs(c.args) + if err != nil { + t.Fatalf("parseEdgesArgs(%v) error: %v", c.args, err) + } + if got != c.want { + t.Fatalf("parseEdgesArgs(%v) = %+v, want %+v", c.args, got, c.want) + } + }) + } +} + +func TestParseEdgesArgsErrors(t *testing.T) { + for _, args := range [][]string{ + {"--limit", "abc"}, + {"--bogus"}, + } { + if _, err := parseEdgesArgs(args); err == nil { + t.Errorf("parseEdgesArgs(%v) expected error, got nil", args) + } + } +} + +func TestBuildEdgesQueryDefaults(t *testing.T) { + q, err := buildEdgesQuery(edgesOpts{limit: 200}) + if err != nil { + t.Fatal(err) + } + for _, want := range []string{"FROM edge", "ORDER BY first_seen DESC", "LIMIT 200"} { + if !strings.Contains(q, want) { + t.Errorf("query %q missing %q", q, want) + } + } + if strings.Contains(q, "WHERE") { + t.Errorf("no-filter query should have no WHERE: %q", q) + } +} + +func TestBuildEdgesQueryFilters(t *testing.T) { + cases := []struct { + name string + o edgesOpts + want string + }{ + {"ns both directions", edgesOpts{ns: "immich", limit: 10}, "(src_ns = 'immich' OR dst_ns = 'immich')"}, + {"src only", edgesOpts{src: "authentik", limit: 10}, "src_ns = 'authentik'"}, + {"dst only", edgesOpts{dst: "dbaas", limit: 10}, "dst_ns = 'dbaas'"}, + {"denied", edgesOpts{denied: true, limit: 10}, "action = 'deny'"}, + } + for _, c := range cases { + t.Run(c.name, func(t *testing.T) { + q, err := buildEdgesQuery(c.o) + if err != nil { + t.Fatal(err) + } + if !strings.Contains(q, "WHERE") || !strings.Contains(q, c.want) { + t.Errorf("query %q missing WHERE/%q", q, c.want) + } + }) + } +} + +func TestBuildEdgesQueryCombinedFiltersAnded(t *testing.T) { + q, err := buildEdgesQuery(edgesOpts{src: "a", denied: true, limit: 5}) + if err != nil { + t.Fatal(err) + } + if !strings.Contains(q, "src_ns = 'a' AND action = 'deny'") { + t.Errorf("combined filters not AND'd: %q", q) + } +} + +func TestBuildEdgesQueryPeersOf(t *testing.T) { + q, err := buildEdgesQuery(edgesOpts{peersOf: "authentik", limit: 100}) + if err != nil { + t.Fatal(err) + } + for _, want := range []string{"DISTINCT", "src_ns = 'authentik'", "dst_ns = 'authentik'", "UNION"} { + if !strings.Contains(q, want) { + t.Errorf("peers-of query %q missing %q", q, want) + } + } +} + +func TestBuildEdgesQueryJSON(t *testing.T) { + q, err := buildEdgesQuery(edgesOpts{asJSON: true, limit: 200}) + if err != nil { + t.Fatal(err) + } + if !strings.Contains(q, "json_agg") || !strings.Contains(q, "row_to_json") { + t.Errorf("json query missing json_agg wrapper: %q", q) + } +} + +func TestBuildEdgesQueryRejectsInjection(t *testing.T) { + for _, bad := range []string{"a'; DROP TABLE edge;--", "a b", "a;b", "a\"b"} { + if _, err := buildEdgesQuery(edgesOpts{ns: bad, limit: 10}); err == nil { + t.Errorf("buildEdgesQuery(ns=%q) expected validation error, got nil", bad) + } + } +} + +func TestNewSinceCond(t *testing.T) { + cases := []struct { + in string + want string + }{ + {"24h", "first_seen >= now() - interval '24 hours'"}, + {"7d", "first_seen >= now() - interval '7 days'"}, + {"30m", "first_seen >= now() - interval '30 minutes'"}, + {"2026-06-28", "first_seen >= '2026-06-28'"}, + } + for _, c := range cases { + got, err := newSinceCond(c.in) + if err != nil { + t.Fatalf("newSinceCond(%q) error: %v", c.in, err) + } + if got != c.want { + t.Errorf("newSinceCond(%q) = %q, want %q", c.in, got, c.want) + } + } + for _, bad := range []string{"yesterday", "1y", "'; DROP", ""} { + if _, err := newSinceCond(bad); err == nil { + t.Errorf("newSinceCond(%q) expected error, got nil", bad) + } + } +} + +func TestValidateNS(t *testing.T) { + for _, ok := range []string{"immich", "calico-system", "kube-system", "Global", "pg-cluster-rw"} { + if err := validateNS(ok); err != nil { + t.Errorf("validateNS(%q) unexpected error: %v", ok, err) + } + } + for _, bad := range []string{"", "a b", "a'b", "a;b", "../x", "a$b"} { + if err := validateNS(bad); err == nil { + t.Errorf("validateNS(%q) expected error, got nil", bad) + } + } +} diff --git a/cli/homelab.go b/cli/homelab.go index 62c0c8aa..14b0afd4 100644 --- a/cli/homelab.go +++ b/cli/homelab.go @@ -20,6 +20,7 @@ func buildRegistry() []Command { reg = append(reg, deployCommands()...) reg = append(reg, netCommands()...) reg = append(reg, obsCommands()...) + reg = append(reg, edgesCommands()...) reg = append(reg, usageCommands()...) reg = append(reg, haCommands()...) reg = append(reg, browserCommands()...) diff --git a/cli/memory_test.go b/cli/memory_test.go index 7b14ef20..1c673c7b 100644 --- a/cli/memory_test.go +++ b/cli/memory_test.go @@ -5,8 +5,31 @@ import ( "os" "strings" "testing" + "unicode/utf8" ) +func TestTruncatePreviewKeepsValidUTF8(t *testing.T) { + // Byte-slicing a long Cyrillic string at 240 splits a 2-byte rune and emits + // invalid UTF-8 — the bug that crashed the recall hook. truncatePreview must + // cut on a rune boundary and always stay valid UTF-8. + long := strings.Repeat("я", 300) // 300 runes / 600 bytes + got := truncatePreview(long, 240) + if !utf8.ValidString(got) { + t.Fatalf("truncatePreview produced invalid UTF-8: %q", got) + } + if r := []rune(got); len(r) != 241 || string(r[:240]) != strings.Repeat("я", 240) || r[240] != '…' { + t.Fatalf("truncatePreview = %d runes, want 240 Cyrillic + ellipsis", len(r)) + } + // Short multibyte strings pass through untouched (no ellipsis). + if got := truncatePreview("кратко", 240); got != "кратко" { + t.Fatalf("short string altered: %q", got) + } + // ASCII boundary still works. + if got := truncatePreview(strings.Repeat("a", 500), 240); got != strings.Repeat("a", 240)+"…" { + t.Fatalf("ascii truncation wrong: %q", got) + } +} + func TestResolveMemoryBase(t *testing.T) { old1, old2 := os.Getenv("CLAUDE_MEMORY_API_URL"), os.Getenv("MEMORY_API_URL") defer func() { os.Setenv("CLAUDE_MEMORY_API_URL", old1); os.Setenv("MEMORY_API_URL", old2) }() diff --git a/docs/runbooks/goldmane-flow-trail.md b/docs/runbooks/goldmane-flow-trail.md index f6a93bc3..dbf6f6d4 100644 --- a/docs/runbooks/goldmane-flow-trail.md +++ b/docs/runbooks/goldmane-flow-trail.md @@ -43,9 +43,11 @@ small no matter how much traffic flows. history. - In-cluster: `Service goldmane:7443` (gRPC/mTLS), `Service whisker:8081` (HTTP), both in `calico-system`. -- **Self-heal:** the `whisker-watchdog` CronJob (`stacks/calico`, every 10 min) - restarts whisker if its backend's Goldmane stream wedges (the operator gives - whisker-backend no liveness probe) — see Troubleshooting → "Whisker UI empty". +- **DNS fix + self-heal:** whisker's egress to the kube-dns ClusterIP is allowed + by `whisker-allow-dns-clusterip` (`stacks/calico`) — without it the UI goes + empty after any gRPC-stream break (see Troubleshooting → "Whisker UI empty"). + The `whisker-watchdog` CronJob (every 10 min) is a backstop that restarts + whisker if its backend ever wedges for another reason. ### CNPG `goldmane_edges` — durable - Postgres DB `goldmane_edges` on the CNPG cluster @@ -151,8 +153,22 @@ on Goldmane's live serving cert, so no `GOLDMANE_SERVER_NAME` / ## How to query who-talks-to-whom -`psql` into the DB (creds: Vault static role `static-creds/pg-goldmane-edges`, or -exec a CNPG pod). All queries are against the single `edge` table. +**Quickest — the `homelab edges` CLI** (the investigation helper; read-only +SELECT against the DB via the dbaas primary pod, no creds/SQL to remember): + +``` +homelab edges --ns # edges touching (either direction) +homelab edges --peers-of # 's distinct peer namespaces +homelab edges --src # 's egress peers (--dst for ingress) +homelab edges --new-since 24h # edges first seen in the last day (or a date) +homelab edges --denied # blocked / lateral-movement attempts +homelab edges --json [...] # machine-readable, for agents/pipelines +homelab edges --help # full flag list +``` + +For ad-hoc SQL, `psql` into the DB (creds: Vault static role +`static-creds/pg-goldmane-edges`, or exec a CNPG pod). All queries are against +the single `edge` table. ```sql -- Everything talking to a namespace (inbound), most-active first @@ -261,23 +277,30 @@ brand-new ingress host is also invisible to LAN split-horizon until the hourly `curl -sSI --resolve whisker.viktorbarzin.me:443:10.0.20.203 https://whisker.viktorbarzin.me` (expect a 302 to Authentik — the gate working). -**Whisker UI empty (but reachable — 302s to Authentik fine).** whisker-backend's -gRPC stream to `goldmane:7443` wedged. A transient CNI/DNS blip (e.g. right after -a node reboot/upgrade — observed 2026-06-28 as k8s-node5 settled post-1.35.6 -upgrade: the pod's resolver started timing out on the kube-dns ClusterIP) drops -the stream, and the Go gRPC resolver gets STUCK — it spams `failed to stream -flows` / `code = Unavailable: dns ... i/o timeout` forever and never reconnects. -The operator ships whisker-backend with **no liveness probe**, so nothing -restarts it. The **`whisker-watchdog` CronJob** (`stacks/calico`, every 10 min) -auto-heals this — it deletes the whisker pod when it sees ≥10 such errors in 11m -*and* Goldmane is Ready (the Goldmane-Ready guard avoids restart-thrash during a -real Goldmane outage). To heal immediately: -`kubectl -n calico-system delete pod -l k8s-app=whisker` (the Deployment recreates -it; a fresh pod reconnects cleanly). The durable **aggregator is a SEPARATE pod** -and is unaffected — only the live UI goes blank. Confirm the diagnosis with -`kubectl -n calico-system logs -l k8s-app=whisker -c whisker-backend --tail=20`; -the node's own DNS is usually fine (test with a throwaway pod pinned there: -`kubectl run dns-test --image=busybox:1.36 --overrides='{"spec":{"nodeName":""}}' --rm -it -- nslookup goldmane.calico-system.svc.cluster.local`). +**Whisker UI empty (but reachable — 302s to Authentik fine).** ROOT CAUSE (the +2026-06-28 incident): the operator's own `whisker` NetworkPolicy is +policyTypes:[Ingress,**Egress**], and its egress allows DNS only to the kube-dns +*pods* (podSelector `k8s-app=kube-dns`). But whisker-backend resolves +`goldmane.calico-system.svc` via the kube-dns **ClusterIP** (10.96.0.10), and +**Calico drops UDP DNS to a ClusterIP under a podSelector-only egress rule**. +Verified: from the whisker pod's netns, ClusterIP DNS = 100% timeout while direct +kube-dns *pod-IP* DNS = OK, and a pod with no egress policy resolves fine. +whisker-backend resolves goldmane ONCE in the brief startup window before the +policy programs, holds its long-lived gRPC stream, and only re-resolves when that +stream breaks (e.g. a node-reboot blip) — at which point the blocked ClusterIP +DNS wedges its Go resolver (`failed to stream flows` / `code = Unavailable: dns +... i/o timeout` forever) and the UI goes blank. The durable **aggregator is a +SEPARATE pod in its own (unrestricted) namespace** and is unaffected. + +FIX (applied 2026-06-28): `kubernetes_network_policy_v1.whisker_allow_dns_clusterip` +(`stacks/calico`) — an additive egress NP allowing whisker → the kube-dns +ClusterIP (`10.96.0.10/32`) on 53/UDP+TCP; k8s egress policies are additive so +the operator NP is untouched. Backstop: the `whisker-watchdog` CronJob restarts +the pod if it ever wedges for another reason. Immediate manual heal: +`kubectl -n calico-system delete pod -l k8s-app=whisker`. Diagnose by comparing, +from the whisker pod's netns, `nslookup goldmane.calico-system.svc.cluster.local +10.96.0.10` (the ClusterIP — times out if the NP fix is missing) against the same +query aimed at a kube-dns *pod IP* (always works). **No new `last_seen` updates / `AggregatorDown` firing.** Check the `aggregate` pod logs (`kubectl logs -n goldmane-edge-aggregator deploy/goldmane-edge-aggregator`). diff --git a/scripts/workstation/claude-hooks/homelab-memory-recall.py b/scripts/workstation/claude-hooks/homelab-memory-recall.py index 7315f116..c9e1d1c3 100755 --- a/scripts/workstation/claude-hooks/homelab-memory-recall.py +++ b/scripts/workstation/claude-hooks/homelab-memory-recall.py @@ -45,9 +45,15 @@ def main() -> None: try: res = subprocess.run( [homelab, "memory", "recall", prompt, "--limit", "5"], - capture_output=True, text=True, timeout=4, env=os.environ, + capture_output=True, text=True, errors="replace", timeout=4, + env=os.environ, ) - except (subprocess.TimeoutExpired, OSError): + except Exception: + # Best-effort: ANY failure — timeout, OSError, or a UnicodeDecodeError on + # truncated multibyte (Cyrillic) output — must silently skip recall this + # turn, exactly like the MCP being unavailable. errors="replace" above + # also keeps a mid-rune-truncated payload from raising here at all. Never + # let this hook surface a "UserPromptSubmit hook error". return out = (res.stdout or "").strip() diff --git a/stacks/calico/main.tf b/stacks/calico/main.tf index 3c411ecb..956534fb 100644 --- a/stacks/calico/main.tf +++ b/stacks/calico/main.tf @@ -275,20 +275,67 @@ resource "kubernetes_network_policy_v1" "whisker_allow_traefik" { } } +# Additive egress NetworkPolicy: permit whisker -> the kube-dns ClusterIP for DNS. +# +# ROOT CAUSE of the 2026-06-28 "Whisker UI empty" incident: the operator's own +# `whisker` NetworkPolicy is policyTypes:[Ingress,Egress] and its egress allows +# DNS only to the kube-dns *pods* (podSelector k8s-app=kube-dns). But +# whisker-backend resolves `goldmane...svc` via the kube-dns *ClusterIP* +# (10.96.0.10), and Calico drops UDP DNS to a ClusterIP under a podSelector-only +# egress rule (verified: from whisker's netns, ClusterIP DNS = 100% timeout +# while direct kube-dns pod-IP DNS = OK; a pod with no egress policy resolves +# fine). whisker-backend resolves once in the brief startup window before the +# policy programs, establishes its long-lived gRPC stream, and only re-resolves +# when that stream breaks — at which point the blocked ClusterIP DNS wedges its +# Go resolver and the UI goes empty (the durable aggregator, in its own +# unrestricted namespace, is unaffected). k8s egress policies are additive, so +# this ORs in an allow for the ClusterIP; the operator NP is left untouched. +# (Empirically: adding this ipBlock rule flips ClusterIP DNS from 100% fail to +# 100% ok.) See docs/runbooks/goldmane-flow-trail.md. +resource "kubernetes_network_policy_v1" "whisker_allow_dns_clusterip" { + metadata { + name = "whisker-allow-dns-clusterip" + namespace = "calico-system" + } + spec { + pod_selector { + match_labels = { + "app.kubernetes.io/name" = "whisker" + } + } + policy_types = ["Egress"] + egress { + # 10.96.0.10 is the kube-dns ClusterIP (cluster invariant — service CIDR + # 10.96.0.0/12, DNS always .10; the same IP CoreDNS/Technitium configs pin). + to { + ip_block { + cidr = "10.96.0.10/32" + } + } + ports { + port = "53" + protocol = "UDP" + } + ports { + port = "53" + protocol = "TCP" + } + } + } +} + # --------------------------------------------------------------------------- # Whisker self-heal watchdog (ADR-0014; added 2026-06-28 after a live incident). # -# FAILURE MODE: whisker-backend dials goldmane:7443 over a long-lived gRPC -# stream. When that stream drops during a transient CNI/DNS blip (observed -# 2026-06-28 right after k8s-node5's v1.35.6 upgrade settled — the pod's -# resolver started timing out on the kube-dns ClusterIP), the Go client's -# resolver gets WEDGED: it spams `failed to stream flows` / -# `code = Unavailable: dns ... i/o timeout` forever and never reconnects, so -# the Whisker UI shows EMPTY while the durable aggregator (a separate pod, same -# Goldmane source) is unaffected. The operator ships whisker-backend with NO -# liveness/readiness probe, so nothing restarts it — it sat broken until a -# manual `kubectl delete pod`. Whisker is operator-managed (Whisker CR), so we -# can't inject a probe; this watchdog is the supported-pattern alternative. +# BACKSTOP. The REAL fix is kubernetes_network_policy_v1.whisker_allow_dns_clusterip +# above (it unblocks the root-cause ClusterIP DNS). This watchdog stays as +# defense-in-depth: whisker-backend has NO operator liveness probe, so if its +# long-lived goldmane gRPC stream ever wedges for any OTHER reason (the Go +# resolver spams `failed to stream flows` / `code = Unavailable` and never +# reconnects -> empty UI, while the durable aggregator in its own namespace is +# unaffected), nothing else would restart it. Whisker is operator-managed +# (Whisker CR) so we can't inject a probe; this is the supported-pattern +# alternative. With the DNS fix in place it should rarely, if ever, fire. # # It restarts the pod ONLY when the wedged signature is present AND Goldmane is # Ready (so a real Goldmane outage doesn't cause restart-thrash). A fresh pod