Merge remote-tracking branch 'origin/master' into wizard/authentik-perf-fix
All checks were successful
ci/woodpecker/push/default Pipeline was successful

This commit is contained in:
Viktor Barzin 2026-06-28 09:55:25 +00:00
commit b3c419e108
12 changed files with 564 additions and 41 deletions

View file

@ -245,7 +245,7 @@ Plan in `docs/architecture/security.md` + response playbook in `docs/runbooks/se
- **Response model**: (I) Slack-only daily skim. All security alerts via Loki ruler → Alertmanager → the `slack-security` receiver, which posts to `#alerts` (it keeps its `[SECURITY/<sev>]` title styling so security-lane alerts stand out). Severity labels carried in the alert (critical/warning/info). No paging. The dedicated `#security` channel was abandoned 2026-06-25 — the shared `alertmanager_slack_api_url` webhook's Slack app isn't a member of it (a `#security` override 404s), so everything consolidated to `#alerts`. - **Response model**: (I) Slack-only daily skim. All security alerts via Loki ruler → Alertmanager → the `slack-security` receiver, which posts to `#alerts` (it keeps its `[SECURITY/<sev>]` title styling so security-lane alerts stand out). Severity labels carried in the alert (critical/warning/info). No paging. The dedicated `#security` channel was abandoned 2026-06-25 — the shared `alertmanager_slack_api_url` webhook's Slack app isn't a member of it (a `#security` override 404s), so everything consolidated to `#alerts`.
- **Kyverno policies (wave 1)**: `deny-privileged-containers`, `deny-host-namespaces`, `restrict-sys-admin`, `require-trusted-registries` flip Audit→Enforce with the 31-namespace exclude list (memory id=1970). `failurePolicy: Ignore` preserved. Cosign `verify-images` deferred. - **Kyverno policies (wave 1)**: `deny-privileged-containers`, `deny-host-namespaces`, `restrict-sys-admin`, `require-trusted-registries` flip Audit→Enforce with the 31-namespace exclude list (memory id=1970). `failurePolicy: Ignore` preserved. Cosign `verify-images` deferred.
- **NetworkPolicy default-deny egress (wave 1)**: observe-then-enforce (γ approach) — Calico flow logs cluster-wide + GlobalNetworkPolicy log-only on tier 3+4, build empirical allowlist after 1 week, phased per-namespace enforce starting `recruiter-responder`. Tier 0/1/2 deferred. **The internal (ns-to-ns) half of each allowlist now derives faster from the east-west flow trail** (below): `SELECT DISTINCT dst_ns FROM edge WHERE src_ns='<ns>' AND action='allow'`. External egress is NOT in that table (empty-ns flows dropped) — those still come from the Calico flow-log W1.6 snapshot. Enforce-flips remain out of scope of the trail (observe-and-derive only; beads `code-8ywc`). - **NetworkPolicy default-deny egress (wave 1)**: observe-then-enforce (γ approach) — Calico flow logs cluster-wide + GlobalNetworkPolicy log-only on tier 3+4, build empirical allowlist after 1 week, phased per-namespace enforce starting `recruiter-responder`. Tier 0/1/2 deferred. **The internal (ns-to-ns) half of each allowlist now derives faster from the east-west flow trail** (below): `SELECT DISTINCT dst_ns FROM edge WHERE src_ns='<ns>' AND action='allow'`. External egress is NOT in that table (empty-ns flows dropped) — those still come from the Calico flow-log W1.6 snapshot. Enforce-flips remain out of scope of the trail (observe-and-derive only; beads `code-8ywc`).
- **East-west flow trail (who-talks-to-whom, ADR-0014)**: Calico **Goldmane** (`goldmane.calico-system:7443`, gRPC/mTLS, ~60-min in-memory ring buffer — no etcd writes) + **Whisker** live UI (`whisker.viktorbarzin.me`, Authentik-gated) → **`goldmane-edge-aggregator`** streams Goldmane's `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + public-internet flows dropped) into **CNPG DB `goldmane_edges`** → daily **`goldmane-edges-digest`** CronJob posts first-seen edges to `#alerts` (consolidated to `#alerts`; the `#security` channel was abandoned 2026-06-25 — the shared webhook's Slack app isn't a member of it, so a `#security` override 404s; see runbook). **CERT-REUSE GOTCHA**: the aggregator's mTLS client cert reuses the operator's Tigera-CA-signed `whisker-backend-key-pair` Secret (Goldmane verifies CA-chain only) — **re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it** (symptom: no `last_seen` updates, `AggregatorDown`). Service identity = namespace, + `service-identity` label only in `monitoring`/`kube-system`/`dbaas`. Health: `AggregatorDown` + `DigestFailing` alerts + cluster-health #48. **WHISKER-WEDGE GOTCHA**: whisker-backend has no operator liveness probe, so a transient CNI/DNS blip (e.g. a node reboot/upgrade) can wedge its Goldmane gRPC stream and leave the UI **empty** indefinitely (the aggregator, a separate pod, is unaffected) — the `whisker-watchdog` CronJob (`stacks/calico`, every 10 min) auto-restarts it; manual heal `kubectl -n calico-system delete pod -l k8s-app=whisker`. Runbook: `docs/runbooks/goldmane-flow-trail.md`. (Goldmane is OSS tech-preview — reversible operator-CR toggle in `stacks/calico/main.tf`.) - **East-west flow trail (who-talks-to-whom, ADR-0014)**: Calico **Goldmane** (`goldmane.calico-system:7443`, gRPC/mTLS, ~60-min in-memory ring buffer — no etcd writes) + **Whisker** live UI (`whisker.viktorbarzin.me`, Authentik-gated) → **`goldmane-edge-aggregator`** streams Goldmane's `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + public-internet flows dropped) into **CNPG DB `goldmane_edges`** → daily **`goldmane-edges-digest`** CronJob posts first-seen edges to `#alerts` (consolidated to `#alerts`; the `#security` channel was abandoned 2026-06-25 — the shared webhook's Slack app isn't a member of it, so a `#security` override 404s; see runbook). **CERT-REUSE GOTCHA**: the aggregator's mTLS client cert reuses the operator's Tigera-CA-signed `whisker-backend-key-pair` Secret (Goldmane verifies CA-chain only) — **re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it** (symptom: no `last_seen` updates, `AggregatorDown`). Service identity = namespace, + `service-identity` label only in `monitoring`/`kube-system`/`dbaas`. Health: `AggregatorDown` + `DigestFailing` alerts + cluster-health #48. **WHISKER-WEDGE GOTCHA** (2026-06-28): the operator's `whisker` NetworkPolicy allows DNS egress only to kube-dns *pods*, but whisker-backend resolves goldmane via the kube-dns *ClusterIP* — Calico drops UDP DNS to a ClusterIP under a podSelector-only egress rule, so when whisker-backend's gRPC stream breaks and it re-resolves, it wedges and the UI goes **empty** (the aggregator, a separate pod, is unaffected). FIX = additive egress NP `whisker-allow-dns-clusterip` (`stacks/calico`, allows whisker→10.96.0.10/32:53); the `whisker-watchdog` CronJob is a backstop. Manual heal `kubectl -n calico-system delete pod -l k8s-app=whisker`. Runbook: `docs/runbooks/goldmane-flow-trail.md`. (Goldmane is OSS tech-preview — reversible operator-CR toggle in `stacks/calico/main.tf`.)
- **What's NOT in scope**: canary tokens (rejected — self-trigger risk with Viktor's normal `vault kv list secret/viktor` and `kubectl get secret -A` workflows), Falco/Tetragon (too noisy for Slack-only daily check), Cloudflare/GitHub audit polling (deferred to wave 2). - **What's NOT in scope**: canary tokens (rejected — self-trigger risk with Viktor's normal `vault kv list secret/viktor` and `kubectl get secret -A` workflows), Falco/Tetragon (too noisy for Slack-only daily check), Cloudflare/GitHub audit polling (deferred to wave 2).
## Storage & Backup Architecture ## Storage & Backup Architecture

View file

@ -202,6 +202,21 @@ runs on the devvm, `setInputFiles` streams local files to the remote browser ove
CDP — no `chmod`/staging-dir workaround. See `docs/architecture/chrome-service.md` CDP — no `chmod`/staging-dir workaround. See `docs/architecture/chrome-service.md`
and `docs/adr/0013`. and `docs/adr/0013`.
### v0.9 verbs — edges (east-west "who-talks-to-whom" trail)
Read-only investigation helper over the `goldmane_edges` CNPG trail (ADR-0014):
filters render to a single safe `SELECT` (namespace values validated to the k8s
name charset) run via the dbaas primary pod — the same exec path as `k8s db`.
| Command | Tier | What it does |
| --- | --- | --- |
| `edges --ns <ns>` | read | edges touching `<ns>` (either direction) |
| `edges --src <ns>` / `--dst <ns>` | read | directional: `<ns>`'s egress / ingress peers |
| `edges --peers-of <ns>` | read | distinct peer namespaces of `<ns>` (both directions) |
| `edges --new-since <24h\|7d\|YYYY-MM-DD>` | read | edges first seen since a duration or date |
| `edges --denied` | read | only `action='deny'` edges (blocked / lateral-movement) |
| `edges --json` / `--limit N` | read | JSON array output / row cap (default 200) |
## Build / install ## Build / install
Built from source to `/usr/local/bin/homelab` during devvm provisioning Built from source to `/usr/local/bin/homelab` during devvm provisioning

View file

@ -1 +1 @@
v0.8.1 v0.9.0

69
cli/cmd_edges.go Normal file
View file

@ -0,0 +1,69 @@
package main
import "fmt"
func edgesCommands() []Command {
return []Command{
{Path: []string{"edges"}, Tier: TierRead,
Summary: "who-talks-to-whom trail: edges [--ns|--src|--dst|--peers-of N] [--new-since 24h] [--denied] [--json] [--limit N]",
Run: edgesRun},
}
}
// edgesRun renders the filter flags to SQL and runs it read-only against the
// goldmane_edges CNPG DB via the dbaas primary pod (same exec path as `k8s db`).
func edgesRun(args []string) error {
for _, a := range args {
if a == "-h" || a == "--help" {
fmt.Print(edgesUsage())
return nil
}
}
o, err := parseEdgesArgs(args)
if err != nil {
return fmt.Errorf("%w\n\n%s", err, edgesUsage())
}
sql, err := buildEdgesQuery(o)
if err != nil {
return err
}
// pg-cluster-rw is a Service (not exec-able); resolve the primary POD.
pod, err := kubectlCapture("dbaas", "get", "pod", "-l", "cnpg.io/instanceRole=primary",
"-o", "jsonpath={.items[0].metadata.name}")
if err != nil || pod == "" {
return fmt.Errorf("could not resolve CNPG primary pod in dbaas: %v", err)
}
exec := []string{"exec", pod, "-c", "postgres", "--", "psql", "-U", "postgres", "-d", "goldmane_edges"}
if o.asJSON {
exec = append(exec, "-tAc", sql) // raw tuple → the JSON array
} else {
exec = append(exec, "-P", "pager=off", "-c", sql) // aligned table for humans
}
return kubectlStream("dbaas", exec...)
}
func edgesUsage() string {
return `homelab edges query the who-talks-to-whom trail (goldmane_edges, ADR-0014)
Usage: homelab edges [filters]
Filters (AND-combined; namespace values are validated to the k8s name charset):
--ns NAME edges touching NAME (either direction)
--src NAME edges where source namespace = NAME
--dst NAME edges where destination namespace = NAME
--peers-of NAME distinct peer namespaces of NAME (both directions)
--new-since SPEC first seen since SPEC: a duration (24h, 7d, 30m, 90s) or a date (YYYY-MM-DD)
--denied only denied (action='deny') edges blocked / lateral-movement attempts
--json output a JSON array (for agents/pipelines)
--limit N cap rows (default 200)
Examples:
homelab edges --ns immich # everything immich talks to / is talked to by
homelab edges --peers-of authentik # authentik's peer namespaces
homelab edges --src recruiter-responder # that namespace's egress peers
homelab edges --new-since 24h # edges first seen in the last day
homelab edges --denied --json # blocked flows, machine-readable
Read-only SELECT against CNPG DB goldmane_edges via the dbaas primary pod.
`
}

View file

@ -54,10 +54,7 @@ func printMemories(raw []byte, jsonOut bool) error {
return nil return nil
} }
for _, m := range r.Memories { for _, m := range r.Memories {
c := strings.ReplaceAll(m.Content, "\n", " ") c := truncatePreview(strings.ReplaceAll(m.Content, "\n", " "), 240)
if len(c) > 240 {
c = c[:240] + "…"
}
fmt.Printf("#%d [%s] (%.2f) %s\n", m.ID, m.Category, m.Importance, c) fmt.Printf("#%d [%s] (%.2f) %s\n", m.ID, m.Category, m.Importance, c)
if m.Tags != "" { if m.Tags != "" {
fmt.Printf(" tags: %s\n", m.Tags) fmt.Printf(" tags: %s\n", m.Tags)
@ -66,6 +63,21 @@ func printMemories(raw []byte, jsonOut bool) error {
return nil return nil
} }
// truncatePreview shortens s to at most maxRunes RUNES, appending "…" when it
// trims. Counting runes (not bytes) is load-bearing: a byte slice like s[:240]
// can cut through the middle of a multibyte UTF-8 character (e.g. 2-byte
// Cyrillic), leaving a dangling lead byte = invalid UTF-8. That crashed strict
// decoders downstream — notably the homelab-memory-recall.py UserPromptSubmit
// hook (subprocess text=True), which surfaced as a recurring "UserPromptSubmit
// hook error" for Cyrillic-language users.
func truncatePreview(s string, maxRunes int) string {
r := []rune(s)
if len(r) <= maxRunes {
return s
}
return string(r[:maxRunes]) + "…"
}
func memoryRecall(args []string) error { func memoryRecall(args []string) error {
req := memRecallReq{} req := memRecallReq{}
jsonOut := false jsonOut := false

164
cli/edges.go Normal file
View file

@ -0,0 +1,164 @@
package main
import (
"fmt"
"regexp"
"strconv"
"strings"
)
// edgesOpts is the parsed filter set for `homelab edges` (the who-talks-to-whom
// investigation helper over the goldmane_edges trail; see ADR-0014).
type edgesOpts struct {
ns string // edges touching this namespace (either direction)
src string // edges where src_ns = this
dst string // edges where dst_ns = this
peersOf string // distinct peers of this namespace (both directions)
newSince string // first_seen >= duration (24h/7d/30m) or date (YYYY-MM-DD)
denied bool // action = 'deny' only
asJSON bool // wrap result as a JSON array
limit int // row cap (default 200)
}
// parseEdgesArgs parses the edges flag surface. Unknown flags error out so a
// typo surfaces instead of silently dumping the whole table.
func parseEdgesArgs(args []string) (edgesOpts, error) {
o := edgesOpts{limit: 200}
i := 0
for i < len(args) {
a := args[i]
key, inline, hasInline := a, "", false
if eq := strings.IndexByte(a, '='); eq >= 0 {
key, inline, hasInline = a[:eq], a[eq+1:], true
}
needVal := func() (string, error) {
if hasInline {
return inline, nil
}
if i+1 < len(args) {
i++
return args[i], nil
}
return "", fmt.Errorf("flag %s needs a value", key)
}
var err error
switch key {
case "--ns":
o.ns, err = needVal()
case "--src":
o.src, err = needVal()
case "--dst":
o.dst, err = needVal()
case "--peers-of":
o.peersOf, err = needVal()
case "--new-since":
o.newSince, err = needVal()
case "--denied":
o.denied = true
case "--json":
o.asJSON = true
case "--limit":
var v string
if v, err = needVal(); err == nil {
if o.limit, err = strconv.Atoi(v); err != nil {
err = fmt.Errorf("--limit must be an integer: %q", v)
}
}
default:
return o, fmt.Errorf("unknown flag: %s", a)
}
if err != nil {
return o, err
}
i++
}
return o, nil
}
// nsRE is the safe namespace-token charset (k8s names + "Global"). Used as the
// injection guard — anything else is rejected rather than quoted-and-hoped.
var nsRE = regexp.MustCompile(`^[A-Za-z0-9][A-Za-z0-9_.-]*$`)
func validateNS(s string) error {
if s == "" || len(s) > 63 || !nsRE.MatchString(s) {
return fmt.Errorf("invalid namespace name: %q", s)
}
return nil
}
// sqlStr renders a SQL string literal (belt-and-suspenders on top of validateNS).
func sqlStr(s string) string { return "'" + strings.ReplaceAll(s, "'", "''") + "'" }
var (
durRE = regexp.MustCompile(`^(\d+)([smhd])$`)
dateRE = regexp.MustCompile(`^\d{4}-\d{2}-\d{2}([ T]\d{2}:\d{2}(:\d{2})?)?$`)
)
// newSinceCond turns a duration (24h/7d/30m/90s) or a date (YYYY-MM-DD[ HH:MM])
// into a first_seen predicate.
func newSinceCond(v string) (string, error) {
if m := durRE.FindStringSubmatch(v); m != nil {
unit := map[string]string{"s": "seconds", "m": "minutes", "h": "hours", "d": "days"}[m[2]]
return fmt.Sprintf("first_seen >= now() - interval '%s %s'", m[1], unit), nil
}
if dateRE.MatchString(v) {
return "first_seen >= " + sqlStr(v), nil
}
return "", fmt.Errorf("--new-since must be a duration (e.g. 24h, 7d, 30m) or a date (YYYY-MM-DD): %q", v)
}
// buildEdgesQuery renders the SQL for the given filters against the `edge` table.
func buildEdgesQuery(o edgesOpts) (string, error) {
limit := o.limit
if limit <= 0 {
limit = 200
}
// peers-of is a distinct-peer summary, a different shape from the row list.
if o.peersOf != "" {
if err := validateNS(o.peersOf); err != nil {
return "", err
}
p := sqlStr(o.peersOf)
return fmt.Sprintf("SELECT DISTINCT peer, action FROM ("+
"SELECT dst_ns AS peer, action FROM edge WHERE src_ns = %s "+
"UNION SELECT src_ns AS peer, action FROM edge WHERE dst_ns = %s"+
") t ORDER BY peer LIMIT %d", p, p, limit), nil
}
var conds []string
for _, f := range []struct{ val, tmpl string }{
{o.ns, "(src_ns = %[1]s OR dst_ns = %[1]s)"},
{o.src, "src_ns = %s"},
{o.dst, "dst_ns = %s"},
} {
if f.val == "" {
continue
}
if err := validateNS(f.val); err != nil {
return "", err
}
conds = append(conds, fmt.Sprintf(f.tmpl, sqlStr(f.val)))
}
if o.denied {
conds = append(conds, "action = 'deny'")
}
if o.newSince != "" {
c, err := newSinceCond(o.newSince)
if err != nil {
return "", err
}
conds = append(conds, c)
}
q := "SELECT src_ns, dst_ns, action, flow_count, first_seen, last_seen FROM edge"
if len(conds) > 0 {
q += " WHERE " + strings.Join(conds, " AND ")
}
q += fmt.Sprintf(" ORDER BY first_seen DESC LIMIT %d", limit)
if o.asJSON {
q = "SELECT coalesce(json_agg(row_to_json(t)), '[]') FROM (" + q + ") t"
}
return q, nil
}

163
cli/edges_test.go Normal file
View file

@ -0,0 +1,163 @@
package main
import (
"strings"
"testing"
)
func TestParseEdgesArgs(t *testing.T) {
cases := []struct {
name string
args []string
want edgesOpts
}{
{"defaults", nil, edgesOpts{limit: 200}},
{"ns", []string{"--ns", "immich"}, edgesOpts{ns: "immich", limit: 200}},
{"ns equals", []string{"--ns=immich"}, edgesOpts{ns: "immich", limit: 200}},
{"src dst", []string{"--src", "a", "--dst", "b"}, edgesOpts{src: "a", dst: "b", limit: 200}},
{"peers-of", []string{"--peers-of", "authentik"}, edgesOpts{peersOf: "authentik", limit: 200}},
{"denied json", []string{"--denied", "--json"}, edgesOpts{denied: true, asJSON: true, limit: 200}},
{"new-since", []string{"--new-since", "24h"}, edgesOpts{newSince: "24h", limit: 200}},
{"limit", []string{"--limit", "50"}, edgesOpts{limit: 50}},
}
for _, c := range cases {
t.Run(c.name, func(t *testing.T) {
got, err := parseEdgesArgs(c.args)
if err != nil {
t.Fatalf("parseEdgesArgs(%v) error: %v", c.args, err)
}
if got != c.want {
t.Fatalf("parseEdgesArgs(%v) = %+v, want %+v", c.args, got, c.want)
}
})
}
}
func TestParseEdgesArgsErrors(t *testing.T) {
for _, args := range [][]string{
{"--limit", "abc"},
{"--bogus"},
} {
if _, err := parseEdgesArgs(args); err == nil {
t.Errorf("parseEdgesArgs(%v) expected error, got nil", args)
}
}
}
func TestBuildEdgesQueryDefaults(t *testing.T) {
q, err := buildEdgesQuery(edgesOpts{limit: 200})
if err != nil {
t.Fatal(err)
}
for _, want := range []string{"FROM edge", "ORDER BY first_seen DESC", "LIMIT 200"} {
if !strings.Contains(q, want) {
t.Errorf("query %q missing %q", q, want)
}
}
if strings.Contains(q, "WHERE") {
t.Errorf("no-filter query should have no WHERE: %q", q)
}
}
func TestBuildEdgesQueryFilters(t *testing.T) {
cases := []struct {
name string
o edgesOpts
want string
}{
{"ns both directions", edgesOpts{ns: "immich", limit: 10}, "(src_ns = 'immich' OR dst_ns = 'immich')"},
{"src only", edgesOpts{src: "authentik", limit: 10}, "src_ns = 'authentik'"},
{"dst only", edgesOpts{dst: "dbaas", limit: 10}, "dst_ns = 'dbaas'"},
{"denied", edgesOpts{denied: true, limit: 10}, "action = 'deny'"},
}
for _, c := range cases {
t.Run(c.name, func(t *testing.T) {
q, err := buildEdgesQuery(c.o)
if err != nil {
t.Fatal(err)
}
if !strings.Contains(q, "WHERE") || !strings.Contains(q, c.want) {
t.Errorf("query %q missing WHERE/%q", q, c.want)
}
})
}
}
func TestBuildEdgesQueryCombinedFiltersAnded(t *testing.T) {
q, err := buildEdgesQuery(edgesOpts{src: "a", denied: true, limit: 5})
if err != nil {
t.Fatal(err)
}
if !strings.Contains(q, "src_ns = 'a' AND action = 'deny'") {
t.Errorf("combined filters not AND'd: %q", q)
}
}
func TestBuildEdgesQueryPeersOf(t *testing.T) {
q, err := buildEdgesQuery(edgesOpts{peersOf: "authentik", limit: 100})
if err != nil {
t.Fatal(err)
}
for _, want := range []string{"DISTINCT", "src_ns = 'authentik'", "dst_ns = 'authentik'", "UNION"} {
if !strings.Contains(q, want) {
t.Errorf("peers-of query %q missing %q", q, want)
}
}
}
func TestBuildEdgesQueryJSON(t *testing.T) {
q, err := buildEdgesQuery(edgesOpts{asJSON: true, limit: 200})
if err != nil {
t.Fatal(err)
}
if !strings.Contains(q, "json_agg") || !strings.Contains(q, "row_to_json") {
t.Errorf("json query missing json_agg wrapper: %q", q)
}
}
func TestBuildEdgesQueryRejectsInjection(t *testing.T) {
for _, bad := range []string{"a'; DROP TABLE edge;--", "a b", "a;b", "a\"b"} {
if _, err := buildEdgesQuery(edgesOpts{ns: bad, limit: 10}); err == nil {
t.Errorf("buildEdgesQuery(ns=%q) expected validation error, got nil", bad)
}
}
}
func TestNewSinceCond(t *testing.T) {
cases := []struct {
in string
want string
}{
{"24h", "first_seen >= now() - interval '24 hours'"},
{"7d", "first_seen >= now() - interval '7 days'"},
{"30m", "first_seen >= now() - interval '30 minutes'"},
{"2026-06-28", "first_seen >= '2026-06-28'"},
}
for _, c := range cases {
got, err := newSinceCond(c.in)
if err != nil {
t.Fatalf("newSinceCond(%q) error: %v", c.in, err)
}
if got != c.want {
t.Errorf("newSinceCond(%q) = %q, want %q", c.in, got, c.want)
}
}
for _, bad := range []string{"yesterday", "1y", "'; DROP", ""} {
if _, err := newSinceCond(bad); err == nil {
t.Errorf("newSinceCond(%q) expected error, got nil", bad)
}
}
}
func TestValidateNS(t *testing.T) {
for _, ok := range []string{"immich", "calico-system", "kube-system", "Global", "pg-cluster-rw"} {
if err := validateNS(ok); err != nil {
t.Errorf("validateNS(%q) unexpected error: %v", ok, err)
}
}
for _, bad := range []string{"", "a b", "a'b", "a;b", "../x", "a$b"} {
if err := validateNS(bad); err == nil {
t.Errorf("validateNS(%q) expected error, got nil", bad)
}
}
}

View file

@ -20,6 +20,7 @@ func buildRegistry() []Command {
reg = append(reg, deployCommands()...) reg = append(reg, deployCommands()...)
reg = append(reg, netCommands()...) reg = append(reg, netCommands()...)
reg = append(reg, obsCommands()...) reg = append(reg, obsCommands()...)
reg = append(reg, edgesCommands()...)
reg = append(reg, usageCommands()...) reg = append(reg, usageCommands()...)
reg = append(reg, haCommands()...) reg = append(reg, haCommands()...)
reg = append(reg, browserCommands()...) reg = append(reg, browserCommands()...)

View file

@ -5,8 +5,31 @@ import (
"os" "os"
"strings" "strings"
"testing" "testing"
"unicode/utf8"
) )
func TestTruncatePreviewKeepsValidUTF8(t *testing.T) {
// Byte-slicing a long Cyrillic string at 240 splits a 2-byte rune and emits
// invalid UTF-8 — the bug that crashed the recall hook. truncatePreview must
// cut on a rune boundary and always stay valid UTF-8.
long := strings.Repeat("я", 300) // 300 runes / 600 bytes
got := truncatePreview(long, 240)
if !utf8.ValidString(got) {
t.Fatalf("truncatePreview produced invalid UTF-8: %q", got)
}
if r := []rune(got); len(r) != 241 || string(r[:240]) != strings.Repeat("я", 240) || r[240] != '…' {
t.Fatalf("truncatePreview = %d runes, want 240 Cyrillic + ellipsis", len(r))
}
// Short multibyte strings pass through untouched (no ellipsis).
if got := truncatePreview("кратко", 240); got != "кратко" {
t.Fatalf("short string altered: %q", got)
}
// ASCII boundary still works.
if got := truncatePreview(strings.Repeat("a", 500), 240); got != strings.Repeat("a", 240)+"…" {
t.Fatalf("ascii truncation wrong: %q", got)
}
}
func TestResolveMemoryBase(t *testing.T) { func TestResolveMemoryBase(t *testing.T) {
old1, old2 := os.Getenv("CLAUDE_MEMORY_API_URL"), os.Getenv("MEMORY_API_URL") old1, old2 := os.Getenv("CLAUDE_MEMORY_API_URL"), os.Getenv("MEMORY_API_URL")
defer func() { os.Setenv("CLAUDE_MEMORY_API_URL", old1); os.Setenv("MEMORY_API_URL", old2) }() defer func() { os.Setenv("CLAUDE_MEMORY_API_URL", old1); os.Setenv("MEMORY_API_URL", old2) }()

View file

@ -43,9 +43,11 @@ small no matter how much traffic flows.
history. history.
- In-cluster: `Service goldmane:7443` (gRPC/mTLS), `Service whisker:8081` - In-cluster: `Service goldmane:7443` (gRPC/mTLS), `Service whisker:8081`
(HTTP), both in `calico-system`. (HTTP), both in `calico-system`.
- **Self-heal:** the `whisker-watchdog` CronJob (`stacks/calico`, every 10 min) - **DNS fix + self-heal:** whisker's egress to the kube-dns ClusterIP is allowed
restarts whisker if its backend's Goldmane stream wedges (the operator gives by `whisker-allow-dns-clusterip` (`stacks/calico`) — without it the UI goes
whisker-backend no liveness probe) — see Troubleshooting → "Whisker UI empty". empty after any gRPC-stream break (see Troubleshooting → "Whisker UI empty").
The `whisker-watchdog` CronJob (every 10 min) is a backstop that restarts
whisker if its backend ever wedges for another reason.
### CNPG `goldmane_edges` — durable ### CNPG `goldmane_edges` — durable
- Postgres DB `goldmane_edges` on the CNPG cluster - Postgres DB `goldmane_edges` on the CNPG cluster
@ -151,8 +153,22 @@ on Goldmane's live serving cert, so no `GOLDMANE_SERVER_NAME` /
## How to query who-talks-to-whom ## How to query who-talks-to-whom
`psql` into the DB (creds: Vault static role `static-creds/pg-goldmane-edges`, or **Quickest — the `homelab edges` CLI** (the investigation helper; read-only
exec a CNPG pod). All queries are against the single `edge` table. SELECT against the DB via the dbaas primary pod, no creds/SQL to remember):
```
homelab edges --ns <ns> # edges touching <ns> (either direction)
homelab edges --peers-of <ns> # <ns>'s distinct peer namespaces
homelab edges --src <ns> # <ns>'s egress peers (--dst <ns> for ingress)
homelab edges --new-since 24h # edges first seen in the last day (or a date)
homelab edges --denied # blocked / lateral-movement attempts
homelab edges --json [...] # machine-readable, for agents/pipelines
homelab edges --help # full flag list
```
For ad-hoc SQL, `psql` into the DB (creds: Vault static role
`static-creds/pg-goldmane-edges`, or exec a CNPG pod). All queries are against
the single `edge` table.
```sql ```sql
-- Everything talking to a namespace (inbound), most-active first -- Everything talking to a namespace (inbound), most-active first
@ -261,23 +277,30 @@ brand-new ingress host is also invisible to LAN split-horizon until the hourly
`curl -sSI --resolve whisker.viktorbarzin.me:443:10.0.20.203 https://whisker.viktorbarzin.me` `curl -sSI --resolve whisker.viktorbarzin.me:443:10.0.20.203 https://whisker.viktorbarzin.me`
(expect a 302 to Authentik — the gate working). (expect a 302 to Authentik — the gate working).
**Whisker UI empty (but reachable — 302s to Authentik fine).** whisker-backend's **Whisker UI empty (but reachable — 302s to Authentik fine).** ROOT CAUSE (the
gRPC stream to `goldmane:7443` wedged. A transient CNI/DNS blip (e.g. right after 2026-06-28 incident): the operator's own `whisker` NetworkPolicy is
a node reboot/upgrade — observed 2026-06-28 as k8s-node5 settled post-1.35.6 policyTypes:[Ingress,**Egress**], and its egress allows DNS only to the kube-dns
upgrade: the pod's resolver started timing out on the kube-dns ClusterIP) drops *pods* (podSelector `k8s-app=kube-dns`). But whisker-backend resolves
the stream, and the Go gRPC resolver gets STUCK — it spams `failed to stream `goldmane.calico-system.svc` via the kube-dns **ClusterIP** (10.96.0.10), and
flows` / `code = Unavailable: dns ... i/o timeout` forever and never reconnects. **Calico drops UDP DNS to a ClusterIP under a podSelector-only egress rule**.
The operator ships whisker-backend with **no liveness probe**, so nothing Verified: from the whisker pod's netns, ClusterIP DNS = 100% timeout while direct
restarts it. The **`whisker-watchdog` CronJob** (`stacks/calico`, every 10 min) kube-dns *pod-IP* DNS = OK, and a pod with no egress policy resolves fine.
auto-heals this — it deletes the whisker pod when it sees ≥10 such errors in 11m whisker-backend resolves goldmane ONCE in the brief startup window before the
*and* Goldmane is Ready (the Goldmane-Ready guard avoids restart-thrash during a policy programs, holds its long-lived gRPC stream, and only re-resolves when that
real Goldmane outage). To heal immediately: stream breaks (e.g. a node-reboot blip) — at which point the blocked ClusterIP
`kubectl -n calico-system delete pod -l k8s-app=whisker` (the Deployment recreates DNS wedges its Go resolver (`failed to stream flows` / `code = Unavailable: dns
it; a fresh pod reconnects cleanly). The durable **aggregator is a SEPARATE pod** ... i/o timeout` forever) and the UI goes blank. The durable **aggregator is a
and is unaffected — only the live UI goes blank. Confirm the diagnosis with SEPARATE pod in its own (unrestricted) namespace** and is unaffected.
`kubectl -n calico-system logs -l k8s-app=whisker -c whisker-backend --tail=20`;
the node's own DNS is usually fine (test with a throwaway pod pinned there: FIX (applied 2026-06-28): `kubernetes_network_policy_v1.whisker_allow_dns_clusterip`
`kubectl run dns-test --image=busybox:1.36 --overrides='{"spec":{"nodeName":"<node>"}}' --rm -it -- nslookup goldmane.calico-system.svc.cluster.local`). (`stacks/calico`) — an additive egress NP allowing whisker → the kube-dns
ClusterIP (`10.96.0.10/32`) on 53/UDP+TCP; k8s egress policies are additive so
the operator NP is untouched. Backstop: the `whisker-watchdog` CronJob restarts
the pod if it ever wedges for another reason. Immediate manual heal:
`kubectl -n calico-system delete pod -l k8s-app=whisker`. Diagnose by comparing,
from the whisker pod's netns, `nslookup goldmane.calico-system.svc.cluster.local
10.96.0.10` (the ClusterIP — times out if the NP fix is missing) against the same
query aimed at a kube-dns *pod IP* (always works).
**No new `last_seen` updates / `AggregatorDown` firing.** Check the `aggregate` **No new `last_seen` updates / `AggregatorDown` firing.** Check the `aggregate`
pod logs (`kubectl logs -n goldmane-edge-aggregator deploy/goldmane-edge-aggregator`). pod logs (`kubectl logs -n goldmane-edge-aggregator deploy/goldmane-edge-aggregator`).

View file

@ -45,9 +45,15 @@ def main() -> None:
try: try:
res = subprocess.run( res = subprocess.run(
[homelab, "memory", "recall", prompt, "--limit", "5"], [homelab, "memory", "recall", prompt, "--limit", "5"],
capture_output=True, text=True, timeout=4, env=os.environ, capture_output=True, text=True, errors="replace", timeout=4,
env=os.environ,
) )
except (subprocess.TimeoutExpired, OSError): except Exception:
# Best-effort: ANY failure — timeout, OSError, or a UnicodeDecodeError on
# truncated multibyte (Cyrillic) output — must silently skip recall this
# turn, exactly like the MCP being unavailable. errors="replace" above
# also keeps a mid-rune-truncated payload from raising here at all. Never
# let this hook surface a "UserPromptSubmit hook error".
return return
out = (res.stdout or "").strip() out = (res.stdout or "").strip()

View file

@ -275,20 +275,67 @@ resource "kubernetes_network_policy_v1" "whisker_allow_traefik" {
} }
} }
# Additive egress NetworkPolicy: permit whisker -> the kube-dns ClusterIP for DNS.
#
# ROOT CAUSE of the 2026-06-28 "Whisker UI empty" incident: the operator's own
# `whisker` NetworkPolicy is policyTypes:[Ingress,Egress] and its egress allows
# DNS only to the kube-dns *pods* (podSelector k8s-app=kube-dns). But
# whisker-backend resolves `goldmane...svc` via the kube-dns *ClusterIP*
# (10.96.0.10), and Calico drops UDP DNS to a ClusterIP under a podSelector-only
# egress rule (verified: from whisker's netns, ClusterIP DNS = 100% timeout
# while direct kube-dns pod-IP DNS = OK; a pod with no egress policy resolves
# fine). whisker-backend resolves once in the brief startup window before the
# policy programs, establishes its long-lived gRPC stream, and only re-resolves
# when that stream breaks at which point the blocked ClusterIP DNS wedges its
# Go resolver and the UI goes empty (the durable aggregator, in its own
# unrestricted namespace, is unaffected). k8s egress policies are additive, so
# this ORs in an allow for the ClusterIP; the operator NP is left untouched.
# (Empirically: adding this ipBlock rule flips ClusterIP DNS from 100% fail to
# 100% ok.) See docs/runbooks/goldmane-flow-trail.md.
resource "kubernetes_network_policy_v1" "whisker_allow_dns_clusterip" {
metadata {
name = "whisker-allow-dns-clusterip"
namespace = "calico-system"
}
spec {
pod_selector {
match_labels = {
"app.kubernetes.io/name" = "whisker"
}
}
policy_types = ["Egress"]
egress {
# 10.96.0.10 is the kube-dns ClusterIP (cluster invariant service CIDR
# 10.96.0.0/12, DNS always .10; the same IP CoreDNS/Technitium configs pin).
to {
ip_block {
cidr = "10.96.0.10/32"
}
}
ports {
port = "53"
protocol = "UDP"
}
ports {
port = "53"
protocol = "TCP"
}
}
}
}
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
# Whisker self-heal watchdog (ADR-0014; added 2026-06-28 after a live incident). # Whisker self-heal watchdog (ADR-0014; added 2026-06-28 after a live incident).
# #
# FAILURE MODE: whisker-backend dials goldmane:7443 over a long-lived gRPC # BACKSTOP. The REAL fix is kubernetes_network_policy_v1.whisker_allow_dns_clusterip
# stream. When that stream drops during a transient CNI/DNS blip (observed # above (it unblocks the root-cause ClusterIP DNS). This watchdog stays as
# 2026-06-28 right after k8s-node5's v1.35.6 upgrade settled the pod's # defense-in-depth: whisker-backend has NO operator liveness probe, so if its
# resolver started timing out on the kube-dns ClusterIP), the Go client's # long-lived goldmane gRPC stream ever wedges for any OTHER reason (the Go
# resolver gets WEDGED: it spams `failed to stream flows` / # resolver spams `failed to stream flows` / `code = Unavailable` and never
# `code = Unavailable: dns ... i/o timeout` forever and never reconnects, so # reconnects -> empty UI, while the durable aggregator in its own namespace is
# the Whisker UI shows EMPTY while the durable aggregator (a separate pod, same # unaffected), nothing else would restart it. Whisker is operator-managed
# Goldmane source) is unaffected. The operator ships whisker-backend with NO # (Whisker CR) so we can't inject a probe; this is the supported-pattern
# liveness/readiness probe, so nothing restarts it it sat broken until a # alternative. With the DNS fix in place it should rarely, if ever, fire.
# manual `kubectl delete pod`. Whisker is operator-managed (Whisker CR), so we
# can't inject a probe; this watchdog is the supported-pattern alternative.
# #
# It restarts the pod ONLY when the wedged signature is present AND Goldmane is # It restarts the pod ONLY when the wedged signature is present AND Goldmane is
# Ready (so a real Goldmane outage doesn't cause restart-thrash). A fresh pod # Ready (so a real Goldmane outage doesn't cause restart-thrash). A fresh pod