diff --git a/AGENTS.md b/AGENTS.md index 012e85dc..5e30bd9e 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -289,7 +289,7 @@ curl -X POST -H "Authorization: token $TOK" -H 'Content-Type: application/json' ``` ## Common Operations -- **`homelab` CLI** (`/usr/local/bin/homelab`, source `cli/`): unified infra-ops verbs — run `homelab manifest` to discover the surface (each verb tagged read/write). Infra loop: `homelab tf plan|fmt|apply ` (wraps `scripts/tg`; `apply` auto-claims presence + releases on exit, warns out-of-band), `homelab claim|release :`, `homelab work start|land|clean ` (worktree lifecycle; `land` gates on verification, `--verify-cmd`/`--no-verify`). Kubernetes (v0.2): `homelab k8s status|get|logs|describe|debug|pf|rollout-status ` (read; `` defaults to the namespace, target to `deploy/`), `homelab k8s db [--mysql] -- ""`, `k8s exec`, `k8s restart`, `k8s rm-pod` (pods/jobs only) — config-mutation kubectl verbs are intentionally absent (Terraform-only). Memory (v0.3): `homelab memory recall ""` (semantic search), `memory list|categories|tags|stats|secret`, `memory store|update|delete` — a direct HTTP client to claude-memory that works even when the memory MCP is down. CI/deploy (v0.4): `homelab ci status|watch [commit]` (Woodpecker, repo resolved from cwd), `homelab deploy wait / [--sha]` (image-sha + rollout) — `work land` now auto-watches CI to green. Net/obs (v0.5): `homelab net check [path]` (external-CF vs internal-LB reachability), `dns lookup ` (Technitium vs public diff), `metrics query ""` / `metrics alerts` (Prometheus via LB), `logs query "" [--since]` (Loki via LB) — endpoint resolution baked in, no port-forward. Full docs: `cli/README.md`. +- **`homelab` CLI** (`/usr/local/bin/homelab`, source `cli/`): unified infra-ops verbs — run `homelab manifest` to discover the surface (each verb tagged read/write). Infra loop: `homelab tf plan|fmt|apply ` (wraps `scripts/tg`; `apply` auto-claims presence + releases on exit, warns out-of-band), `homelab claim|release :`, `homelab work start|land|clean ` (worktree lifecycle; `land` gates on verification, `--verify-cmd`/`--no-verify`). Kubernetes (v0.2): `homelab k8s status|get|logs|describe|debug|pf|rollout-status ` (read; `` defaults to the namespace, target to `deploy/`), `homelab k8s db [--mysql] -- ""`, `k8s exec`, `k8s restart`, `k8s rm-pod` (pods/jobs only) — config-mutation kubectl verbs are intentionally absent (Terraform-only). Memory (v0.3): `homelab memory recall ""` (semantic search), `memory list|categories|tags|stats|secret`, `memory store|update|delete` — a direct HTTP client to claude-memory that works even when the memory MCP is down. CI/deploy (v0.4): `homelab ci status|watch [commit]` (Woodpecker, repo resolved from cwd), `homelab deploy wait / [--sha]` (image-sha + rollout) — `work land` now auto-watches CI to green. Net/obs (v0.5): `homelab net check [path]` (external-CF vs internal-LB reachability), `dns lookup ` (Technitium vs public diff), `metrics query ""` / `metrics alerts` (Prometheus via LB), `logs query "" [--since]` (Loki via LB) — endpoint resolution baked in, no port-forward. Usage telemetry (v0.6): every dispatched verb fire-and-forgets a Loki line (`{user,verb}` + exit only, NO args/secrets; opt-out `HOMELAB_TELEMETRY=0`); `homelab usage top [--since][--user]` ranks verb usage across all users — evidence for what to build next, queryable without reading anyone's home. Full docs: `cli/README.md`. - **Deploy new service**: Use `stacks//` as template. Create stack, add DNS in tfvars, apply platform then service. - **Fix crashed pods**: Run healthcheck first. Safe to delete evicted/failed pods and CrashLoopBackOff pods with >10 restarts. - **OOMKilled**: Check `kubectl describe limitrange tier-defaults -n `. Increase `resources.limits.memory` in the stack's main.tf. diff --git a/cli/README.md b/cli/README.md index ab43f0f6..e21da6d2 100644 --- a/cli/README.md +++ b/cli/README.md @@ -131,6 +131,22 @@ Quote the PromQL/LogQL. These hit auth-free internal ingresses — no port-forwa no kubectl. (In-cluster-only endpoints like Alertmanager stay out of scope; the firing set is reachable via `ALERTS` instead.) +### v0.6 — usage telemetry (`usage top`) + +Makes "which verbs are actually used, by everyone" a query instead of a guess — +so adding the *next* verb is evidence-driven, not shaped by one person's habits. + +Every dispatched verb emits one fire-and-forget Loki line: `{job, user, verb}` +labels + `exit=N ver=X` — **only the verb path and exit code, never args, paths, +flags, or secrets.** It's best-effort (tight timeout, errors swallowed, never +affects the command) and opt-out via `HOMELAB_TELEMETRY=0`. Because the sink is +the shared Loki, aggregate usage is queryable **without reading anyone's home** — +the privacy-preserving answer to "what does the team use." + +| Command | Tier | What it does | +|---|---|---| +| `usage top [--since 30d] [--user U] [--json]` | read | rank verbs by invocation count across all users (or one), via `sum by (verb) (count_over_time({job="homelab-usage"}[…]))` | + ## Build / install Built from source to `/usr/local/bin/homelab` during devvm provisioning @@ -150,4 +166,4 @@ original flag-based path unchanged, so the webhook handler is unaffected. ## Design -See `infra/docs/adr/0004`–`0010` for the architecture decisions. +See `infra/docs/adr/0004`–`0011` for the architecture decisions. diff --git a/cli/VERSION b/cli/VERSION index b043aa64..60f63432 100644 --- a/cli/VERSION +++ b/cli/VERSION @@ -1 +1 @@ -v0.5.0 +v0.6.0 diff --git a/cli/cmd_usage.go b/cli/cmd_usage.go new file mode 100644 index 00000000..e9b7fa8e --- /dev/null +++ b/cli/cmd_usage.go @@ -0,0 +1,77 @@ +package main + +import ( + "encoding/json" + "fmt" + "net/url" + "sort" + "strconv" +) + +func usageCommands() []Command { + return []Command{ + {Path: []string{"usage", "top"}, Tier: TierRead, + Summary: "rank homelab verb usage across users (from Loki): usage top [--since 30d] [--user U] [--json]", Run: usageTop}, + } +} + +// usageQuery builds the LogQL metric query that counts invocations per verb. +func usageQuery(since, user string) string { + sel := `job="` + usageJob + `"` + if user != "" { + sel += `, user="` + user + `"` + } + return fmt.Sprintf(`sum by (verb) (count_over_time({%s}[%s]))`, sel, since) +} + +func usageTop(args []string) error { + since := flagValue(args, "--since") + if since == "" { + since = "30d" + } + v := url.Values{} + v.Set("query", usageQuery(since, flagValue(args, "--user"))) + body, err := lbGetBody(lokiHost, "/loki/api/v1/query", v) + if err != nil { + return err + } + if containsArg(args, "--json") { + fmt.Println(string(body)) + return nil + } + var r struct { + Data struct { + Result []struct { + Metric map[string]string `json:"metric"` + Value []interface{} `json:"value"` + } `json:"result"` + } `json:"data"` + } + if err := json.Unmarshal(body, &r); err != nil { + fmt.Println(string(body)) + return nil + } + type row struct { + verb string + n int + } + var rows []row + for _, s := range r.Data.Result { + n := 0 + if len(s.Value) == 2 { + if f, e := strconv.ParseFloat(fmt.Sprint(s.Value[1]), 64); e == nil { + n = int(f) + } + } + rows = append(rows, row{s.Metric["verb"], n}) + } + if len(rows) == 0 { + fmt.Println("(no usage recorded yet)") + return nil + } + sort.Slice(rows, func(i, j int) bool { return rows[i].n > rows[j].n }) + for _, r := range rows { + fmt.Printf("%6d %s\n", r.n, r.verb) + } + return nil +} diff --git a/cli/command.go b/cli/command.go index fd7f4812..55449788 100644 --- a/cli/command.go +++ b/cli/command.go @@ -50,7 +50,10 @@ func dispatch(reg []Command, args []string) error { if best < 0 { return fmt.Errorf("unknown command: %q", strings.Join(args, " ")) } - return reg[best].Run(args[bestLen:]) + matched := reg[best] + runErr := matched.Run(args[bestLen:]) + emitUsage(matched.name(), runErr) // best-effort usage telemetry; never affects the command + return runErr } // name is the space-joined verb path, e.g. "tf plan". diff --git a/cli/homelab.go b/cli/homelab.go index 23a7f776..350b081f 100644 --- a/cli/homelab.go +++ b/cli/homelab.go @@ -20,6 +20,7 @@ func buildRegistry() []Command { reg = append(reg, deployCommands()...) reg = append(reg, netCommands()...) reg = append(reg, obsCommands()...) + reg = append(reg, usageCommands()...) return reg } diff --git a/cli/telemetry.go b/cli/telemetry.go new file mode 100644 index 00000000..b0bb625a --- /dev/null +++ b/cli/telemetry.go @@ -0,0 +1,62 @@ +package main + +import ( + "bytes" + "encoding/json" + "net/http" + "os" + "strconv" + "strings" + "time" +) + +// usageJob is the Loki stream job label for homelab usage telemetry. +const usageJob = "homelab-usage" + +// emitUsage best-effort records one verb invocation to Loki for cross-user +// usage analytics. Labels are low-cardinality (job/user/verb); the line carries +// only exit code + CLI version. NEVER args, paths, flags, or secrets. It must +// never affect the command: all errors are swallowed and a tight timeout bounds +// the cost. Opt out with HOMELAB_TELEMETRY=0. +func emitUsage(verb string, runErr error) { + switch os.Getenv("HOMELAB_TELEMETRY") { + case "0", "off", "false", "no": + return + } + if verb == "" || strings.HasPrefix(verb, "usage") { + return // don't self-record the analytics reader + } + exit := 0 + if runErr != nil { + exit = 1 + } + body, err := json.Marshal(lokiPush{Streams: []lokiStream{{ + Stream: map[string]string{"job": usageJob, "user": currentUser(), "verb": verb}, + Values: [][2]string{{ + strconv.FormatInt(time.Now().UnixNano(), 10), + "exit=" + strconv.Itoa(exit) + " ver=" + version, + }}, + }}}) + if err != nil { + return + } + req, err := http.NewRequest("POST", "https://"+lokiHost+"/loki/api/v1/push", bytes.NewReader(body)) + if err != nil { + return + } + req.Header.Set("Content-Type", "application/json") + resp, err := clientDialingIP(internalLBIP, 800*time.Millisecond).Do(req) + if err != nil { + return + } + resp.Body.Close() +} + +type lokiPush struct { + Streams []lokiStream `json:"streams"` +} + +type lokiStream struct { + Stream map[string]string `json:"stream"` + Values [][2]string `json:"values"` +} diff --git a/cli/usage_test.go b/cli/usage_test.go new file mode 100644 index 00000000..052e080c --- /dev/null +++ b/cli/usage_test.go @@ -0,0 +1,18 @@ +package main + +import ( + "strings" + "testing" +) + +func TestUsageQuery(t *testing.T) { + got := usageQuery("30d", "") + want := `sum by (verb) (count_over_time({job="homelab-usage"}[30d]))` + if got != want { + t.Errorf("usageQuery(30d,\"\") = %q, want %q", got, want) + } + withUser := usageQuery("7d", "emo") + if !strings.Contains(withUser, `user="emo"`) || !strings.Contains(withUser, "[7d]") { + t.Errorf("usageQuery with user missing filter/range: %q", withUser) + } +} diff --git a/docs/adr/0011-homelab-usage-telemetry.md b/docs/adr/0011-homelab-usage-telemetry.md new file mode 100644 index 00000000..c383211b --- /dev/null +++ b/docs/adr/0011-homelab-usage-telemetry.md @@ -0,0 +1,34 @@ +# homelab usage telemetry: evidence-driven verb prioritization, privacy by construction + +v0.6 adds `usage top` plus a fire-and-forget emit on every dispatched verb. It +exists to answer the question that drove the whole CLI — *which verbs are worth +adding next* — with data instead of one maintainer's habits (the earlier mining +covered a single user's ~51k commands, so the surface is shaped to that user). + +## Decisions + +- **Emit on dispatch, in `dispatch()`.** The longest-prefix match already knows + the verb path; after `Run` returns we emit `{verb, exit}`. Discovery verbs + don't go through `dispatch()` (`manifest`/`version`/`help` are handled in + `dispatchTop`), so they don't self-record; `usage *` is skipped explicitly so + the analytics reader doesn't pollute its own data. +- **Payload is deliberately minimal: verb path + exit code only.** Labels + `{job=homelab-usage, user, verb}` (all low-cardinality) + line `exit=N ver=X`. + **No args, paths, flags, hostnames, or secrets** ever leave the process — the + emit sees only the matched verb name, not the arguments. This is what makes + cross-user aggregation safe. +- **Shared Loki sink → cross-user analytics WITHOUT reading homes.** Each user's + CLI writes its own invocations (attributed to its OS user) to the shared Loki + push API via the Traefik LB (verified: HTTP 204, no auth). `usage top` reads + back with a LogQL metric query. This is the privacy-preserving resolution to + "what does everyone (e.g. another user) use" — it never touches anyone's + `~/.claude`, which the org per-user policy bars (see the per-user red-line in + managed-settings; reading another user's home is off-limits even for an owner + in-session — a fresh session under changed MDM policy is the only legitimate + path, and even then this telemetry is the better answer). +- **Best-effort, never affects the command.** All errors swallowed; an 800ms + client timeout bounds the cost; opt-out via `HOMELAB_TELEMETRY=0`. Telemetry + must never slow or break the tool it measures. +- **Loki, not a new datastore.** Zero new infra, and it dogfoods the v0.5 `logs` + path (same host, same LB dial). Presence MySQL was the alternative (queryable + SQL) but would add a write dependency and creds; Loki needs neither.