homelab: v0.6.0 — usage telemetry (usage top), evidence-driven verb prioritization
Answers the question that drove the whole CLI — which verbs to add next — with
data instead of one maintainer's habits, and resolves the cross-user-usage ask
in-bounds (no reading anyone's home).
- emit on dispatch: every verb fire-and-forgets one Loki line {job,user,verb} +
"exit=N ver=X". ONLY the verb path + exit code — never args, paths, flags, or
secrets (the emit never sees arguments). Best-effort: 800ms timeout, errors
swallowed, never affects the command; opt-out HOMELAB_TELEMETRY=0. Discovery
verbs (manifest/version/help) and usage itself don't self-record.
- usage top [--since 30d] [--user U] [--json]: ranks verbs via
sum by (verb)(count_over_time({job="homelab-usage"}[…])) against the shared
Loki. Cross-user analytics WITHOUT touching ~/.claude — the privacy-preserving
answer to "what does the team use".
- Loki sink (zero new infra, dogfoods v0.5 logs path); push verified HTTP 204 no
auth. ADR docs/adr/0011.
Live-verified: ran 4 verbs, usage top ranked them correctly (metrics query=2).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
666fefd22b
commit
3e3fdb34f0
9 changed files with 215 additions and 4 deletions
|
|
@ -289,7 +289,7 @@ curl -X POST -H "Authorization: token $TOK" -H 'Content-Type: application/json'
|
||||||
```
|
```
|
||||||
|
|
||||||
## Common Operations
|
## Common Operations
|
||||||
- **`homelab` CLI** (`/usr/local/bin/homelab`, source `cli/`): unified infra-ops verbs — run `homelab manifest` to discover the surface (each verb tagged read/write). Infra loop: `homelab tf plan|fmt|apply <stack>` (wraps `scripts/tg`; `apply` auto-claims presence + releases on exit, warns out-of-band), `homelab claim|release <kind>:<name>`, `homelab work start|land|clean <topic>` (worktree lifecycle; `land` gates on verification, `--verify-cmd`/`--no-verify`). Kubernetes (v0.2): `homelab k8s status|get|logs|describe|debug|pf|rollout-status <app>` (read; `<app>` defaults to the namespace, target to `deploy/<app>`), `homelab k8s db <app> [--mysql] -- "<SQL>"`, `k8s exec`, `k8s restart`, `k8s rm-pod` (pods/jobs only) — config-mutation kubectl verbs are intentionally absent (Terraform-only). Memory (v0.3): `homelab memory recall "<context>"` (semantic search), `memory list|categories|tags|stats|secret`, `memory store|update|delete` — a direct HTTP client to claude-memory that works even when the memory MCP is down. CI/deploy (v0.4): `homelab ci status|watch [commit]` (Woodpecker, repo resolved from cwd), `homelab deploy wait <ns>/<deploy> [--sha]` (image-sha + rollout) — `work land` now auto-watches CI to green. Net/obs (v0.5): `homelab net check <host> [path]` (external-CF vs internal-LB reachability), `dns lookup <name>` (Technitium vs public diff), `metrics query "<promql>"` / `metrics alerts` (Prometheus via LB), `logs query "<logql>" [--since]` (Loki via LB) — endpoint resolution baked in, no port-forward. Full docs: `cli/README.md`.
|
- **`homelab` CLI** (`/usr/local/bin/homelab`, source `cli/`): unified infra-ops verbs — run `homelab manifest` to discover the surface (each verb tagged read/write). Infra loop: `homelab tf plan|fmt|apply <stack>` (wraps `scripts/tg`; `apply` auto-claims presence + releases on exit, warns out-of-band), `homelab claim|release <kind>:<name>`, `homelab work start|land|clean <topic>` (worktree lifecycle; `land` gates on verification, `--verify-cmd`/`--no-verify`). Kubernetes (v0.2): `homelab k8s status|get|logs|describe|debug|pf|rollout-status <app>` (read; `<app>` defaults to the namespace, target to `deploy/<app>`), `homelab k8s db <app> [--mysql] -- "<SQL>"`, `k8s exec`, `k8s restart`, `k8s rm-pod` (pods/jobs only) — config-mutation kubectl verbs are intentionally absent (Terraform-only). Memory (v0.3): `homelab memory recall "<context>"` (semantic search), `memory list|categories|tags|stats|secret`, `memory store|update|delete` — a direct HTTP client to claude-memory that works even when the memory MCP is down. CI/deploy (v0.4): `homelab ci status|watch [commit]` (Woodpecker, repo resolved from cwd), `homelab deploy wait <ns>/<deploy> [--sha]` (image-sha + rollout) — `work land` now auto-watches CI to green. Net/obs (v0.5): `homelab net check <host> [path]` (external-CF vs internal-LB reachability), `dns lookup <name>` (Technitium vs public diff), `metrics query "<promql>"` / `metrics alerts` (Prometheus via LB), `logs query "<logql>" [--since]` (Loki via LB) — endpoint resolution baked in, no port-forward. Usage telemetry (v0.6): every dispatched verb fire-and-forgets a Loki line (`{user,verb}` + exit only, NO args/secrets; opt-out `HOMELAB_TELEMETRY=0`); `homelab usage top [--since][--user]` ranks verb usage across all users — evidence for what to build next, queryable without reading anyone's home. Full docs: `cli/README.md`.
|
||||||
- **Deploy new service**: Use `stacks/<existing-service>/` as template. Create stack, add DNS in tfvars, apply platform then service.
|
- **Deploy new service**: Use `stacks/<existing-service>/` as template. Create stack, add DNS in tfvars, apply platform then service.
|
||||||
- **Fix crashed pods**: Run healthcheck first. Safe to delete evicted/failed pods and CrashLoopBackOff pods with >10 restarts.
|
- **Fix crashed pods**: Run healthcheck first. Safe to delete evicted/failed pods and CrashLoopBackOff pods with >10 restarts.
|
||||||
- **OOMKilled**: Check `kubectl describe limitrange tier-defaults -n <ns>`. Increase `resources.limits.memory` in the stack's main.tf.
|
- **OOMKilled**: Check `kubectl describe limitrange tier-defaults -n <ns>`. Increase `resources.limits.memory` in the stack's main.tf.
|
||||||
|
|
|
||||||
|
|
@ -131,6 +131,22 @@ Quote the PromQL/LogQL. These hit auth-free internal ingresses — no port-forwa
|
||||||
no kubectl. (In-cluster-only endpoints like Alertmanager stay out of scope; the
|
no kubectl. (In-cluster-only endpoints like Alertmanager stay out of scope; the
|
||||||
firing set is reachable via `ALERTS` instead.)
|
firing set is reachable via `ALERTS` instead.)
|
||||||
|
|
||||||
|
### v0.6 — usage telemetry (`usage top`)
|
||||||
|
|
||||||
|
Makes "which verbs are actually used, by everyone" a query instead of a guess —
|
||||||
|
so adding the *next* verb is evidence-driven, not shaped by one person's habits.
|
||||||
|
|
||||||
|
Every dispatched verb emits one fire-and-forget Loki line: `{job, user, verb}`
|
||||||
|
labels + `exit=N ver=X` — **only the verb path and exit code, never args, paths,
|
||||||
|
flags, or secrets.** It's best-effort (tight timeout, errors swallowed, never
|
||||||
|
affects the command) and opt-out via `HOMELAB_TELEMETRY=0`. Because the sink is
|
||||||
|
the shared Loki, aggregate usage is queryable **without reading anyone's home** —
|
||||||
|
the privacy-preserving answer to "what does the team use."
|
||||||
|
|
||||||
|
| Command | Tier | What it does |
|
||||||
|
|---|---|---|
|
||||||
|
| `usage top [--since 30d] [--user U] [--json]` | read | rank verbs by invocation count across all users (or one), via `sum by (verb) (count_over_time({job="homelab-usage"}[…]))` |
|
||||||
|
|
||||||
## Build / install
|
## Build / install
|
||||||
|
|
||||||
Built from source to `/usr/local/bin/homelab` during devvm provisioning
|
Built from source to `/usr/local/bin/homelab` during devvm provisioning
|
||||||
|
|
@ -150,4 +166,4 @@ original flag-based path unchanged, so the webhook handler is unaffected.
|
||||||
|
|
||||||
## Design
|
## Design
|
||||||
|
|
||||||
See `infra/docs/adr/0004`–`0010` for the architecture decisions.
|
See `infra/docs/adr/0004`–`0011` for the architecture decisions.
|
||||||
|
|
|
||||||
|
|
@ -1 +1 @@
|
||||||
v0.5.0
|
v0.6.0
|
||||||
|
|
|
||||||
77
cli/cmd_usage.go
Normal file
77
cli/cmd_usage.go
Normal file
|
|
@ -0,0 +1,77 @@
|
||||||
|
package main
|
||||||
|
|
||||||
|
import (
|
||||||
|
"encoding/json"
|
||||||
|
"fmt"
|
||||||
|
"net/url"
|
||||||
|
"sort"
|
||||||
|
"strconv"
|
||||||
|
)
|
||||||
|
|
||||||
|
func usageCommands() []Command {
|
||||||
|
return []Command{
|
||||||
|
{Path: []string{"usage", "top"}, Tier: TierRead,
|
||||||
|
Summary: "rank homelab verb usage across users (from Loki): usage top [--since 30d] [--user U] [--json]", Run: usageTop},
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// usageQuery builds the LogQL metric query that counts invocations per verb.
|
||||||
|
func usageQuery(since, user string) string {
|
||||||
|
sel := `job="` + usageJob + `"`
|
||||||
|
if user != "" {
|
||||||
|
sel += `, user="` + user + `"`
|
||||||
|
}
|
||||||
|
return fmt.Sprintf(`sum by (verb) (count_over_time({%s}[%s]))`, sel, since)
|
||||||
|
}
|
||||||
|
|
||||||
|
func usageTop(args []string) error {
|
||||||
|
since := flagValue(args, "--since")
|
||||||
|
if since == "" {
|
||||||
|
since = "30d"
|
||||||
|
}
|
||||||
|
v := url.Values{}
|
||||||
|
v.Set("query", usageQuery(since, flagValue(args, "--user")))
|
||||||
|
body, err := lbGetBody(lokiHost, "/loki/api/v1/query", v)
|
||||||
|
if err != nil {
|
||||||
|
return err
|
||||||
|
}
|
||||||
|
if containsArg(args, "--json") {
|
||||||
|
fmt.Println(string(body))
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
var r struct {
|
||||||
|
Data struct {
|
||||||
|
Result []struct {
|
||||||
|
Metric map[string]string `json:"metric"`
|
||||||
|
Value []interface{} `json:"value"`
|
||||||
|
} `json:"result"`
|
||||||
|
} `json:"data"`
|
||||||
|
}
|
||||||
|
if err := json.Unmarshal(body, &r); err != nil {
|
||||||
|
fmt.Println(string(body))
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
type row struct {
|
||||||
|
verb string
|
||||||
|
n int
|
||||||
|
}
|
||||||
|
var rows []row
|
||||||
|
for _, s := range r.Data.Result {
|
||||||
|
n := 0
|
||||||
|
if len(s.Value) == 2 {
|
||||||
|
if f, e := strconv.ParseFloat(fmt.Sprint(s.Value[1]), 64); e == nil {
|
||||||
|
n = int(f)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
rows = append(rows, row{s.Metric["verb"], n})
|
||||||
|
}
|
||||||
|
if len(rows) == 0 {
|
||||||
|
fmt.Println("(no usage recorded yet)")
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
sort.Slice(rows, func(i, j int) bool { return rows[i].n > rows[j].n })
|
||||||
|
for _, r := range rows {
|
||||||
|
fmt.Printf("%6d %s\n", r.n, r.verb)
|
||||||
|
}
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
|
@ -50,7 +50,10 @@ func dispatch(reg []Command, args []string) error {
|
||||||
if best < 0 {
|
if best < 0 {
|
||||||
return fmt.Errorf("unknown command: %q", strings.Join(args, " "))
|
return fmt.Errorf("unknown command: %q", strings.Join(args, " "))
|
||||||
}
|
}
|
||||||
return reg[best].Run(args[bestLen:])
|
matched := reg[best]
|
||||||
|
runErr := matched.Run(args[bestLen:])
|
||||||
|
emitUsage(matched.name(), runErr) // best-effort usage telemetry; never affects the command
|
||||||
|
return runErr
|
||||||
}
|
}
|
||||||
|
|
||||||
// name is the space-joined verb path, e.g. "tf plan".
|
// name is the space-joined verb path, e.g. "tf plan".
|
||||||
|
|
|
||||||
|
|
@ -20,6 +20,7 @@ func buildRegistry() []Command {
|
||||||
reg = append(reg, deployCommands()...)
|
reg = append(reg, deployCommands()...)
|
||||||
reg = append(reg, netCommands()...)
|
reg = append(reg, netCommands()...)
|
||||||
reg = append(reg, obsCommands()...)
|
reg = append(reg, obsCommands()...)
|
||||||
|
reg = append(reg, usageCommands()...)
|
||||||
return reg
|
return reg
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
|
||||||
62
cli/telemetry.go
Normal file
62
cli/telemetry.go
Normal file
|
|
@ -0,0 +1,62 @@
|
||||||
|
package main
|
||||||
|
|
||||||
|
import (
|
||||||
|
"bytes"
|
||||||
|
"encoding/json"
|
||||||
|
"net/http"
|
||||||
|
"os"
|
||||||
|
"strconv"
|
||||||
|
"strings"
|
||||||
|
"time"
|
||||||
|
)
|
||||||
|
|
||||||
|
// usageJob is the Loki stream job label for homelab usage telemetry.
|
||||||
|
const usageJob = "homelab-usage"
|
||||||
|
|
||||||
|
// emitUsage best-effort records one verb invocation to Loki for cross-user
|
||||||
|
// usage analytics. Labels are low-cardinality (job/user/verb); the line carries
|
||||||
|
// only exit code + CLI version. NEVER args, paths, flags, or secrets. It must
|
||||||
|
// never affect the command: all errors are swallowed and a tight timeout bounds
|
||||||
|
// the cost. Opt out with HOMELAB_TELEMETRY=0.
|
||||||
|
func emitUsage(verb string, runErr error) {
|
||||||
|
switch os.Getenv("HOMELAB_TELEMETRY") {
|
||||||
|
case "0", "off", "false", "no":
|
||||||
|
return
|
||||||
|
}
|
||||||
|
if verb == "" || strings.HasPrefix(verb, "usage") {
|
||||||
|
return // don't self-record the analytics reader
|
||||||
|
}
|
||||||
|
exit := 0
|
||||||
|
if runErr != nil {
|
||||||
|
exit = 1
|
||||||
|
}
|
||||||
|
body, err := json.Marshal(lokiPush{Streams: []lokiStream{{
|
||||||
|
Stream: map[string]string{"job": usageJob, "user": currentUser(), "verb": verb},
|
||||||
|
Values: [][2]string{{
|
||||||
|
strconv.FormatInt(time.Now().UnixNano(), 10),
|
||||||
|
"exit=" + strconv.Itoa(exit) + " ver=" + version,
|
||||||
|
}},
|
||||||
|
}}})
|
||||||
|
if err != nil {
|
||||||
|
return
|
||||||
|
}
|
||||||
|
req, err := http.NewRequest("POST", "https://"+lokiHost+"/loki/api/v1/push", bytes.NewReader(body))
|
||||||
|
if err != nil {
|
||||||
|
return
|
||||||
|
}
|
||||||
|
req.Header.Set("Content-Type", "application/json")
|
||||||
|
resp, err := clientDialingIP(internalLBIP, 800*time.Millisecond).Do(req)
|
||||||
|
if err != nil {
|
||||||
|
return
|
||||||
|
}
|
||||||
|
resp.Body.Close()
|
||||||
|
}
|
||||||
|
|
||||||
|
type lokiPush struct {
|
||||||
|
Streams []lokiStream `json:"streams"`
|
||||||
|
}
|
||||||
|
|
||||||
|
type lokiStream struct {
|
||||||
|
Stream map[string]string `json:"stream"`
|
||||||
|
Values [][2]string `json:"values"`
|
||||||
|
}
|
||||||
18
cli/usage_test.go
Normal file
18
cli/usage_test.go
Normal file
|
|
@ -0,0 +1,18 @@
|
||||||
|
package main
|
||||||
|
|
||||||
|
import (
|
||||||
|
"strings"
|
||||||
|
"testing"
|
||||||
|
)
|
||||||
|
|
||||||
|
func TestUsageQuery(t *testing.T) {
|
||||||
|
got := usageQuery("30d", "")
|
||||||
|
want := `sum by (verb) (count_over_time({job="homelab-usage"}[30d]))`
|
||||||
|
if got != want {
|
||||||
|
t.Errorf("usageQuery(30d,\"\") = %q, want %q", got, want)
|
||||||
|
}
|
||||||
|
withUser := usageQuery("7d", "emo")
|
||||||
|
if !strings.Contains(withUser, `user="emo"`) || !strings.Contains(withUser, "[7d]") {
|
||||||
|
t.Errorf("usageQuery with user missing filter/range: %q", withUser)
|
||||||
|
}
|
||||||
|
}
|
||||||
34
docs/adr/0011-homelab-usage-telemetry.md
Normal file
34
docs/adr/0011-homelab-usage-telemetry.md
Normal file
|
|
@ -0,0 +1,34 @@
|
||||||
|
# homelab usage telemetry: evidence-driven verb prioritization, privacy by construction
|
||||||
|
|
||||||
|
v0.6 adds `usage top` plus a fire-and-forget emit on every dispatched verb. It
|
||||||
|
exists to answer the question that drove the whole CLI — *which verbs are worth
|
||||||
|
adding next* — with data instead of one maintainer's habits (the earlier mining
|
||||||
|
covered a single user's ~51k commands, so the surface is shaped to that user).
|
||||||
|
|
||||||
|
## Decisions
|
||||||
|
|
||||||
|
- **Emit on dispatch, in `dispatch()`.** The longest-prefix match already knows
|
||||||
|
the verb path; after `Run` returns we emit `{verb, exit}`. Discovery verbs
|
||||||
|
don't go through `dispatch()` (`manifest`/`version`/`help` are handled in
|
||||||
|
`dispatchTop`), so they don't self-record; `usage *` is skipped explicitly so
|
||||||
|
the analytics reader doesn't pollute its own data.
|
||||||
|
- **Payload is deliberately minimal: verb path + exit code only.** Labels
|
||||||
|
`{job=homelab-usage, user, verb}` (all low-cardinality) + line `exit=N ver=X`.
|
||||||
|
**No args, paths, flags, hostnames, or secrets** ever leave the process — the
|
||||||
|
emit sees only the matched verb name, not the arguments. This is what makes
|
||||||
|
cross-user aggregation safe.
|
||||||
|
- **Shared Loki sink → cross-user analytics WITHOUT reading homes.** Each user's
|
||||||
|
CLI writes its own invocations (attributed to its OS user) to the shared Loki
|
||||||
|
push API via the Traefik LB (verified: HTTP 204, no auth). `usage top` reads
|
||||||
|
back with a LogQL metric query. This is the privacy-preserving resolution to
|
||||||
|
"what does everyone (e.g. another user) use" — it never touches anyone's
|
||||||
|
`~/.claude`, which the org per-user policy bars (see the per-user red-line in
|
||||||
|
managed-settings; reading another user's home is off-limits even for an owner
|
||||||
|
in-session — a fresh session under changed MDM policy is the only legitimate
|
||||||
|
path, and even then this telemetry is the better answer).
|
||||||
|
- **Best-effort, never affects the command.** All errors swallowed; an 800ms
|
||||||
|
client timeout bounds the cost; opt-out via `HOMELAB_TELEMETRY=0`. Telemetry
|
||||||
|
must never slow or break the tool it measures.
|
||||||
|
- **Loki, not a new datastore.** Zero new infra, and it dogfoods the v0.5 `logs`
|
||||||
|
path (same host, same LB dial). Presence MySQL was the alternative (queryable
|
||||||
|
SQL) but would add a write dependency and creds; Loki needs neither.
|
||||||
Loading…
Add table
Add a link
Reference in a new issue