homelab: v0.6.0 — usage telemetry (usage top), evidence-driven verb prioritization
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful

Answers the question that drove the whole CLI — which verbs to add next — with
data instead of one maintainer's habits, and resolves the cross-user-usage ask
in-bounds (no reading anyone's home).

- emit on dispatch: every verb fire-and-forgets one Loki line {job,user,verb} +
  "exit=N ver=X". ONLY the verb path + exit code — never args, paths, flags, or
  secrets (the emit never sees arguments). Best-effort: 800ms timeout, errors
  swallowed, never affects the command; opt-out HOMELAB_TELEMETRY=0. Discovery
  verbs (manifest/version/help) and usage itself don't self-record.
- usage top [--since 30d] [--user U] [--json]: ranks verbs via
  sum by (verb)(count_over_time({job="homelab-usage"}[…])) against the shared
  Loki. Cross-user analytics WITHOUT touching ~/.claude — the privacy-preserving
  answer to "what does the team use".
- Loki sink (zero new infra, dogfoods v0.5 logs path); push verified HTTP 204 no
  auth. ADR docs/adr/0011.

Live-verified: ran 4 verbs, usage top ranked them correctly (metrics query=2).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-19 22:29:01 +00:00
parent 666fefd22b
commit 3e3fdb34f0
9 changed files with 215 additions and 4 deletions

View file

@ -289,7 +289,7 @@ curl -X POST -H "Authorization: token $TOK" -H 'Content-Type: application/json'
``` ```
## Common Operations ## Common Operations
- **`homelab` CLI** (`/usr/local/bin/homelab`, source `cli/`): unified infra-ops verbs — run `homelab manifest` to discover the surface (each verb tagged read/write). Infra loop: `homelab tf plan|fmt|apply <stack>` (wraps `scripts/tg`; `apply` auto-claims presence + releases on exit, warns out-of-band), `homelab claim|release <kind>:<name>`, `homelab work start|land|clean <topic>` (worktree lifecycle; `land` gates on verification, `--verify-cmd`/`--no-verify`). Kubernetes (v0.2): `homelab k8s status|get|logs|describe|debug|pf|rollout-status <app>` (read; `<app>` defaults to the namespace, target to `deploy/<app>`), `homelab k8s db <app> [--mysql] -- "<SQL>"`, `k8s exec`, `k8s restart`, `k8s rm-pod` (pods/jobs only) — config-mutation kubectl verbs are intentionally absent (Terraform-only). Memory (v0.3): `homelab memory recall "<context>"` (semantic search), `memory list|categories|tags|stats|secret`, `memory store|update|delete` — a direct HTTP client to claude-memory that works even when the memory MCP is down. CI/deploy (v0.4): `homelab ci status|watch [commit]` (Woodpecker, repo resolved from cwd), `homelab deploy wait <ns>/<deploy> [--sha]` (image-sha + rollout) — `work land` now auto-watches CI to green. Net/obs (v0.5): `homelab net check <host> [path]` (external-CF vs internal-LB reachability), `dns lookup <name>` (Technitium vs public diff), `metrics query "<promql>"` / `metrics alerts` (Prometheus via LB), `logs query "<logql>" [--since]` (Loki via LB) — endpoint resolution baked in, no port-forward. Full docs: `cli/README.md`. - **`homelab` CLI** (`/usr/local/bin/homelab`, source `cli/`): unified infra-ops verbs — run `homelab manifest` to discover the surface (each verb tagged read/write). Infra loop: `homelab tf plan|fmt|apply <stack>` (wraps `scripts/tg`; `apply` auto-claims presence + releases on exit, warns out-of-band), `homelab claim|release <kind>:<name>`, `homelab work start|land|clean <topic>` (worktree lifecycle; `land` gates on verification, `--verify-cmd`/`--no-verify`). Kubernetes (v0.2): `homelab k8s status|get|logs|describe|debug|pf|rollout-status <app>` (read; `<app>` defaults to the namespace, target to `deploy/<app>`), `homelab k8s db <app> [--mysql] -- "<SQL>"`, `k8s exec`, `k8s restart`, `k8s rm-pod` (pods/jobs only) — config-mutation kubectl verbs are intentionally absent (Terraform-only). Memory (v0.3): `homelab memory recall "<context>"` (semantic search), `memory list|categories|tags|stats|secret`, `memory store|update|delete` — a direct HTTP client to claude-memory that works even when the memory MCP is down. CI/deploy (v0.4): `homelab ci status|watch [commit]` (Woodpecker, repo resolved from cwd), `homelab deploy wait <ns>/<deploy> [--sha]` (image-sha + rollout) — `work land` now auto-watches CI to green. Net/obs (v0.5): `homelab net check <host> [path]` (external-CF vs internal-LB reachability), `dns lookup <name>` (Technitium vs public diff), `metrics query "<promql>"` / `metrics alerts` (Prometheus via LB), `logs query "<logql>" [--since]` (Loki via LB) — endpoint resolution baked in, no port-forward. Usage telemetry (v0.6): every dispatched verb fire-and-forgets a Loki line (`{user,verb}` + exit only, NO args/secrets; opt-out `HOMELAB_TELEMETRY=0`); `homelab usage top [--since][--user]` ranks verb usage across all users — evidence for what to build next, queryable without reading anyone's home. Full docs: `cli/README.md`.
- **Deploy new service**: Use `stacks/<existing-service>/` as template. Create stack, add DNS in tfvars, apply platform then service. - **Deploy new service**: Use `stacks/<existing-service>/` as template. Create stack, add DNS in tfvars, apply platform then service.
- **Fix crashed pods**: Run healthcheck first. Safe to delete evicted/failed pods and CrashLoopBackOff pods with >10 restarts. - **Fix crashed pods**: Run healthcheck first. Safe to delete evicted/failed pods and CrashLoopBackOff pods with >10 restarts.
- **OOMKilled**: Check `kubectl describe limitrange tier-defaults -n <ns>`. Increase `resources.limits.memory` in the stack's main.tf. - **OOMKilled**: Check `kubectl describe limitrange tier-defaults -n <ns>`. Increase `resources.limits.memory` in the stack's main.tf.

View file

@ -131,6 +131,22 @@ Quote the PromQL/LogQL. These hit auth-free internal ingresses — no port-forwa
no kubectl. (In-cluster-only endpoints like Alertmanager stay out of scope; the no kubectl. (In-cluster-only endpoints like Alertmanager stay out of scope; the
firing set is reachable via `ALERTS` instead.) firing set is reachable via `ALERTS` instead.)
### v0.6 — usage telemetry (`usage top`)
Makes "which verbs are actually used, by everyone" a query instead of a guess —
so adding the *next* verb is evidence-driven, not shaped by one person's habits.
Every dispatched verb emits one fire-and-forget Loki line: `{job, user, verb}`
labels + `exit=N ver=X` — **only the verb path and exit code, never args, paths,
flags, or secrets.** It's best-effort (tight timeout, errors swallowed, never
affects the command) and opt-out via `HOMELAB_TELEMETRY=0`. Because the sink is
the shared Loki, aggregate usage is queryable **without reading anyone's home**
the privacy-preserving answer to "what does the team use."
| Command | Tier | What it does |
|---|---|---|
| `usage top [--since 30d] [--user U] [--json]` | read | rank verbs by invocation count across all users (or one), via `sum by (verb) (count_over_time({job="homelab-usage"}[…]))` |
## Build / install ## Build / install
Built from source to `/usr/local/bin/homelab` during devvm provisioning Built from source to `/usr/local/bin/homelab` during devvm provisioning
@ -150,4 +166,4 @@ original flag-based path unchanged, so the webhook handler is unaffected.
## Design ## Design
See `infra/docs/adr/0004``0010` for the architecture decisions. See `infra/docs/adr/0004``0011` for the architecture decisions.

View file

@ -1 +1 @@
v0.5.0 v0.6.0

77
cli/cmd_usage.go Normal file
View file

@ -0,0 +1,77 @@
package main
import (
"encoding/json"
"fmt"
"net/url"
"sort"
"strconv"
)
func usageCommands() []Command {
return []Command{
{Path: []string{"usage", "top"}, Tier: TierRead,
Summary: "rank homelab verb usage across users (from Loki): usage top [--since 30d] [--user U] [--json]", Run: usageTop},
}
}
// usageQuery builds the LogQL metric query that counts invocations per verb.
func usageQuery(since, user string) string {
sel := `job="` + usageJob + `"`
if user != "" {
sel += `, user="` + user + `"`
}
return fmt.Sprintf(`sum by (verb) (count_over_time({%s}[%s]))`, sel, since)
}
func usageTop(args []string) error {
since := flagValue(args, "--since")
if since == "" {
since = "30d"
}
v := url.Values{}
v.Set("query", usageQuery(since, flagValue(args, "--user")))
body, err := lbGetBody(lokiHost, "/loki/api/v1/query", v)
if err != nil {
return err
}
if containsArg(args, "--json") {
fmt.Println(string(body))
return nil
}
var r struct {
Data struct {
Result []struct {
Metric map[string]string `json:"metric"`
Value []interface{} `json:"value"`
} `json:"result"`
} `json:"data"`
}
if err := json.Unmarshal(body, &r); err != nil {
fmt.Println(string(body))
return nil
}
type row struct {
verb string
n int
}
var rows []row
for _, s := range r.Data.Result {
n := 0
if len(s.Value) == 2 {
if f, e := strconv.ParseFloat(fmt.Sprint(s.Value[1]), 64); e == nil {
n = int(f)
}
}
rows = append(rows, row{s.Metric["verb"], n})
}
if len(rows) == 0 {
fmt.Println("(no usage recorded yet)")
return nil
}
sort.Slice(rows, func(i, j int) bool { return rows[i].n > rows[j].n })
for _, r := range rows {
fmt.Printf("%6d %s\n", r.n, r.verb)
}
return nil
}

View file

@ -50,7 +50,10 @@ func dispatch(reg []Command, args []string) error {
if best < 0 { if best < 0 {
return fmt.Errorf("unknown command: %q", strings.Join(args, " ")) return fmt.Errorf("unknown command: %q", strings.Join(args, " "))
} }
return reg[best].Run(args[bestLen:]) matched := reg[best]
runErr := matched.Run(args[bestLen:])
emitUsage(matched.name(), runErr) // best-effort usage telemetry; never affects the command
return runErr
} }
// name is the space-joined verb path, e.g. "tf plan". // name is the space-joined verb path, e.g. "tf plan".

View file

@ -20,6 +20,7 @@ func buildRegistry() []Command {
reg = append(reg, deployCommands()...) reg = append(reg, deployCommands()...)
reg = append(reg, netCommands()...) reg = append(reg, netCommands()...)
reg = append(reg, obsCommands()...) reg = append(reg, obsCommands()...)
reg = append(reg, usageCommands()...)
return reg return reg
} }

62
cli/telemetry.go Normal file
View file

@ -0,0 +1,62 @@
package main
import (
"bytes"
"encoding/json"
"net/http"
"os"
"strconv"
"strings"
"time"
)
// usageJob is the Loki stream job label for homelab usage telemetry.
const usageJob = "homelab-usage"
// emitUsage best-effort records one verb invocation to Loki for cross-user
// usage analytics. Labels are low-cardinality (job/user/verb); the line carries
// only exit code + CLI version. NEVER args, paths, flags, or secrets. It must
// never affect the command: all errors are swallowed and a tight timeout bounds
// the cost. Opt out with HOMELAB_TELEMETRY=0.
func emitUsage(verb string, runErr error) {
switch os.Getenv("HOMELAB_TELEMETRY") {
case "0", "off", "false", "no":
return
}
if verb == "" || strings.HasPrefix(verb, "usage") {
return // don't self-record the analytics reader
}
exit := 0
if runErr != nil {
exit = 1
}
body, err := json.Marshal(lokiPush{Streams: []lokiStream{{
Stream: map[string]string{"job": usageJob, "user": currentUser(), "verb": verb},
Values: [][2]string{{
strconv.FormatInt(time.Now().UnixNano(), 10),
"exit=" + strconv.Itoa(exit) + " ver=" + version,
}},
}}})
if err != nil {
return
}
req, err := http.NewRequest("POST", "https://"+lokiHost+"/loki/api/v1/push", bytes.NewReader(body))
if err != nil {
return
}
req.Header.Set("Content-Type", "application/json")
resp, err := clientDialingIP(internalLBIP, 800*time.Millisecond).Do(req)
if err != nil {
return
}
resp.Body.Close()
}
type lokiPush struct {
Streams []lokiStream `json:"streams"`
}
type lokiStream struct {
Stream map[string]string `json:"stream"`
Values [][2]string `json:"values"`
}

18
cli/usage_test.go Normal file
View file

@ -0,0 +1,18 @@
package main
import (
"strings"
"testing"
)
func TestUsageQuery(t *testing.T) {
got := usageQuery("30d", "")
want := `sum by (verb) (count_over_time({job="homelab-usage"}[30d]))`
if got != want {
t.Errorf("usageQuery(30d,\"\") = %q, want %q", got, want)
}
withUser := usageQuery("7d", "emo")
if !strings.Contains(withUser, `user="emo"`) || !strings.Contains(withUser, "[7d]") {
t.Errorf("usageQuery with user missing filter/range: %q", withUser)
}
}

View file

@ -0,0 +1,34 @@
# homelab usage telemetry: evidence-driven verb prioritization, privacy by construction
v0.6 adds `usage top` plus a fire-and-forget emit on every dispatched verb. It
exists to answer the question that drove the whole CLI — *which verbs are worth
adding next* — with data instead of one maintainer's habits (the earlier mining
covered a single user's ~51k commands, so the surface is shaped to that user).
## Decisions
- **Emit on dispatch, in `dispatch()`.** The longest-prefix match already knows
the verb path; after `Run` returns we emit `{verb, exit}`. Discovery verbs
don't go through `dispatch()` (`manifest`/`version`/`help` are handled in
`dispatchTop`), so they don't self-record; `usage *` is skipped explicitly so
the analytics reader doesn't pollute its own data.
- **Payload is deliberately minimal: verb path + exit code only.** Labels
`{job=homelab-usage, user, verb}` (all low-cardinality) + line `exit=N ver=X`.
**No args, paths, flags, hostnames, or secrets** ever leave the process — the
emit sees only the matched verb name, not the arguments. This is what makes
cross-user aggregation safe.
- **Shared Loki sink → cross-user analytics WITHOUT reading homes.** Each user's
CLI writes its own invocations (attributed to its OS user) to the shared Loki
push API via the Traefik LB (verified: HTTP 204, no auth). `usage top` reads
back with a LogQL metric query. This is the privacy-preserving resolution to
"what does everyone (e.g. another user) use" — it never touches anyone's
`~/.claude`, which the org per-user policy bars (see the per-user red-line in
managed-settings; reading another user's home is off-limits even for an owner
in-session — a fresh session under changed MDM policy is the only legitimate
path, and even then this telemetry is the better answer).
- **Best-effort, never affects the command.** All errors swallowed; an 800ms
client timeout bounds the cost; opt-out via `HOMELAB_TELEMETRY=0`. Telemetry
must never slow or break the tool it measures.
- **Loki, not a new datastore.** Zero new infra, and it dogfoods the v0.5 `logs`
path (same host, same LB dial). Presence MySQL was the alternative (queryable
SQL) but would add a write dependency and creds; Loki needs neither.