diff --git a/AGENTS.md b/AGENTS.md index b97cadb0..012e85dc 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -289,7 +289,7 @@ curl -X POST -H "Authorization: token $TOK" -H 'Content-Type: application/json' ``` ## Common Operations -- **`homelab` CLI** (`/usr/local/bin/homelab`, source `cli/`): unified infra-ops verbs — run `homelab manifest` to discover the surface (each verb tagged read/write). Infra loop: `homelab tf plan|fmt|apply ` (wraps `scripts/tg`; `apply` auto-claims presence + releases on exit, warns out-of-band), `homelab claim|release :`, `homelab work start|land|clean ` (worktree lifecycle; `land` gates on verification, `--verify-cmd`/`--no-verify`). Kubernetes (v0.2): `homelab k8s status|get|logs|describe|debug|pf|rollout-status ` (read; `` defaults to the namespace, target to `deploy/`), `homelab k8s db [--mysql] -- ""`, `k8s exec`, `k8s restart`, `k8s rm-pod` (pods/jobs only) — config-mutation kubectl verbs are intentionally absent (Terraform-only). Memory (v0.3): `homelab memory recall ""` (semantic search), `memory list|categories|tags|stats|secret`, `memory store|update|delete` — a direct HTTP client to claude-memory that works even when the memory MCP is down. CI/deploy (v0.4): `homelab ci status|watch [commit]` (Woodpecker, repo resolved from cwd), `homelab deploy wait / [--sha]` (image-sha + rollout) — `work land` now auto-watches CI to green. Full docs: `cli/README.md`. +- **`homelab` CLI** (`/usr/local/bin/homelab`, source `cli/`): unified infra-ops verbs — run `homelab manifest` to discover the surface (each verb tagged read/write). Infra loop: `homelab tf plan|fmt|apply ` (wraps `scripts/tg`; `apply` auto-claims presence + releases on exit, warns out-of-band), `homelab claim|release :`, `homelab work start|land|clean ` (worktree lifecycle; `land` gates on verification, `--verify-cmd`/`--no-verify`). Kubernetes (v0.2): `homelab k8s status|get|logs|describe|debug|pf|rollout-status ` (read; `` defaults to the namespace, target to `deploy/`), `homelab k8s db [--mysql] -- ""`, `k8s exec`, `k8s restart`, `k8s rm-pod` (pods/jobs only) — config-mutation kubectl verbs are intentionally absent (Terraform-only). Memory (v0.3): `homelab memory recall ""` (semantic search), `memory list|categories|tags|stats|secret`, `memory store|update|delete` — a direct HTTP client to claude-memory that works even when the memory MCP is down. CI/deploy (v0.4): `homelab ci status|watch [commit]` (Woodpecker, repo resolved from cwd), `homelab deploy wait / [--sha]` (image-sha + rollout) — `work land` now auto-watches CI to green. Net/obs (v0.5): `homelab net check [path]` (external-CF vs internal-LB reachability), `dns lookup ` (Technitium vs public diff), `metrics query ""` / `metrics alerts` (Prometheus via LB), `logs query "" [--since]` (Loki via LB) — endpoint resolution baked in, no port-forward. Full docs: `cli/README.md`. - **Deploy new service**: Use `stacks//` as template. Create stack, add DNS in tfvars, apply platform then service. - **Fix crashed pods**: Run healthcheck first. Safe to delete evicted/failed pods and CrashLoopBackOff pods with >10 restarts. - **OOMKilled**: Check `kubectl describe limitrange tier-defaults -n `. Increase `resources.limits.memory` in the stack's main.tf. diff --git a/cli/README.md b/cli/README.md index def3c26b..ab43f0f6 100644 --- a/cli/README.md +++ b/cli/README.md @@ -112,6 +112,25 @@ remote, with retries that ride Woodpecker's intermittent empty responses. step) is deferred to v0.4.1 — Woodpecker's per-pipeline detail/log endpoints were the least reliable; `status`/`watch` use the list endpoint that works. +### v0.5 verbs — net / dns / metrics / logs + +Reachability + observability probes. Their value is *endpoint resolution* — the +non-obvious "which host, public or LB, what auth, what URL shape" reasoning you'd +otherwise re-derive every time — not the HTTP call itself. All reach internal +ingresses through the Traefik LB (the Go form of `curl --resolve host:443:10.0.20.203`). + +| Command | Tier | What it does | +|---|---|---| +| `net check [path]` | read | probes the host two ways — external (public DNS → Cloudflare) vs internal (Traefik LB) — with status + latency, so you can tell *where* a break is (CF? app? the LB path?) | +| `dns lookup [type]` | read | resolves via Technitium (`10.0.20.201`) and public (`1.1.1.1`), diffed — surfaces split-horizon vs propagation gaps | +| `metrics query ""` | read | Prometheus instant query (`prometheus-query.viktorbarzin.lan`); prints `value {labels}` or `--json` | +| `metrics alerts` | read | currently-firing alerts (via the synthetic `ALERTS` series — the query frontend has no `/api/v1/alerts`) | +| `logs query "" [--since 1h] [--limit N]` | read | Loki range query (`loki.viktorbarzin.lan`); prints log lines or `--json` | + +Quote the PromQL/LogQL. These hit auth-free internal ingresses — no port-forward, +no kubectl. (In-cluster-only endpoints like Alertmanager stay out of scope; the +firing set is reachable via `ALERTS` instead.) + ## Build / install Built from source to `/usr/local/bin/homelab` during devvm provisioning @@ -131,4 +150,4 @@ original flag-based path unchanged, so the webhook handler is unaffected. ## Design -See `infra/docs/adr/0004`–`0009` for the architecture decisions. +See `infra/docs/adr/0004`–`0010` for the architecture decisions. diff --git a/cli/VERSION b/cli/VERSION index fb7a04cf..b043aa64 100644 --- a/cli/VERSION +++ b/cli/VERSION @@ -1 +1 @@ -v0.4.0 +v0.5.0 diff --git a/cli/cmd_net.go b/cli/cmd_net.go new file mode 100644 index 00000000..6401755c --- /dev/null +++ b/cli/cmd_net.go @@ -0,0 +1,83 @@ +package main + +import ( + "fmt" + "strings" + "time" +) + +func netCommands() []Command { + return []Command{ + {Path: []string{"net", "check"}, Tier: TierRead, + Summary: "reachability of [/path]: external (public DNS→CF) vs internal (Traefik LB)", Run: netCheck}, + {Path: []string{"dns", "lookup"}, Tier: TierRead, + Summary: "resolve via Technitium (10.0.20.201) and public (1.1.1.1), diffed", Run: dnsLookup}, + } +} + +func fmtProbe(code int, d time.Duration, err error) string { + if err != nil { + return "ERR " + err.Error() + } + return fmt.Sprintf("HTTP %d %dms", code, d.Milliseconds()) +} + +func netCheck(args []string) error { + host, rest := firstPositional(args) + if host == "" { + return fmt.Errorf("usage: homelab net check [path]") + } + path := "/" + if len(rest) > 0 && !strings.HasPrefix(rest[0], "-") { + path = rest[0] + if !strings.HasPrefix(path, "/") { + path = "/" + path + } + } + u := "https://" + host + path + fmt.Printf("%s\n", u) + + // external leg: resolve via public DNS, dial the public IP (tests the real CF path) + pubOut, _ := dig(hostOnly(host), "1.1.1.1", "") + if pubIP := firstLine(pubOut); pubIP != "" { + c, d, e := probeURL(clientDialingIP(pubIP, 10*time.Second), u) + fmt.Printf(" external (public %-15s) %s\n", pubIP, fmtProbe(c, d, e)) + } else { + fmt.Println(" external (public) no public A record") + } + // internal leg: dial the Traefik LB directly + c, d, e := probeURL(clientDialingIP(internalLBIP, 10*time.Second), u) + fmt.Printf(" internal (LB %-15s) %s\n", internalLBIP, fmtProbe(c, d, e)) + return nil +} + +func dnsLookup(args []string) error { + name, rest := firstPositional(args) + if name == "" { + return fmt.Errorf("usage: homelab dns lookup [A|AAAA|TXT|MX|PTR]") + } + rr := "" + if len(rest) > 0 { + rr = rest[0] + } + tech, _ := dig(name, "10.0.20.201", rr) + pub, _ := dig(name, "1.1.1.1", rr) + fmt.Printf("technitium (10.0.20.201): %s\n", oneLineList(tech)) + fmt.Printf("public (1.1.1.1) : %s\n", oneLineList(pub)) + if strings.TrimSpace(tech) != strings.TrimSpace(pub) { + fmt.Println("⚠ mismatch — split-horizon (expected for internal-only apps) or a propagation gap") + } + return nil +} + +func hostOnly(h string) string { // strip any path accidentally included + return strings.SplitN(h, "/", 2)[0] +} + +func oneLineList(s string) string { + s = strings.TrimSpace(s) + if s == "" { + return "(none)" + } + return strings.ReplaceAll(s, "\n", ", ") +} diff --git a/cli/cmd_obs.go b/cli/cmd_obs.go new file mode 100644 index 00000000..33f16e6c --- /dev/null +++ b/cli/cmd_obs.go @@ -0,0 +1,197 @@ +package main + +import ( + "encoding/json" + "fmt" + "net/url" + "sort" + "strconv" + "strings" + "time" +) + +const ( + promHost = "prometheus-query.viktorbarzin.lan" + lokiHost = "loki.viktorbarzin.lan" +) + +func obsCommands() []Command { + return []Command{ + {Path: []string{"metrics", "query"}, Tier: TierRead, + Summary: `Prometheus instant query: metrics query "" [--json]`, Run: metricsQuery}, + {Path: []string{"metrics", "alerts"}, Tier: TierRead, + Summary: "list currently firing Prometheus alerts", Run: metricsAlerts}, + {Path: []string{"logs", "query"}, Tier: TierRead, + Summary: `Loki query (last --since, default 1h): logs query "" [--since 1h] [--limit N] [--json]`, Run: logsQuery}, + } +} + +// queryArg joins non-flag args into the query (PromQL/LogQL should normally be +// passed as a single quoted argument; this also tolerates unquoted multi-token). +func queryArg(args []string, valueFlags map[string]bool) string { + var parts []string + for i := 0; i < len(args); i++ { + a := args[i] + if valueFlags[a] { + i++ + continue + } + if strings.HasPrefix(a, "-") { + continue + } + parts = append(parts, a) + } + return strings.Join(parts, " ") +} + +func labelStr(m map[string]string) string { + name := m["__name__"] + var kv []string + for k, v := range m { + if k != "__name__" { + kv = append(kv, k+"="+v) + } + } + sort.Strings(kv) + return name + "{" + strings.Join(kv, ",") + "}" +} + +func metricsQuery(args []string) error { + q := queryArg(args, nil) + if q == "" { + return fmt.Errorf(`usage: homelab metrics query "" [--json]`) + } + v := url.Values{} + v.Set("query", q) + body, err := lbGetBody(promHost, "/api/v1/query", v) + if err != nil { + return err + } + if containsArg(args, "--json") { + fmt.Println(string(body)) + return nil + } + var r struct { + Data struct { + Result []struct { + Metric map[string]string `json:"metric"` + Value []interface{} `json:"value"` + } `json:"result"` + } `json:"data"` + } + if err := json.Unmarshal(body, &r); err != nil { + fmt.Println(string(body)) + return nil + } + if len(r.Data.Result) == 0 { + fmt.Println("(no series)") + return nil + } + for _, s := range r.Data.Result { + val := "" + if len(s.Value) == 2 { + val = fmt.Sprint(s.Value[1]) + } + fmt.Printf("%-14s %s\n", val, labelStr(s.Metric)) + } + return nil +} + +func metricsAlerts(args []string) error { + // prometheus-query is a query-only frontend (no /api/v1/alerts); the firing + // set is exposed as the synthetic ALERTS series, queryable the normal way. + v := url.Values{} + v.Set("query", `ALERTS{alertstate="firing"}`) + body, err := lbGetBody(promHost, "/api/v1/query", v) + if err != nil { + return err + } + if containsArg(args, "--json") { + fmt.Println(string(body)) + return nil + } + var r struct { + Data struct { + Result []struct { + Metric map[string]string `json:"metric"` + } `json:"result"` + } `json:"data"` + } + if err := json.Unmarshal(body, &r); err != nil { + fmt.Println(string(body)) + return nil + } + if len(r.Data.Result) == 0 { + fmt.Println("(no firing alerts)") + return nil + } + for _, a := range r.Data.Result { + m := a.Metric + scope := "" + for _, k := range []string{"namespace", "deployment", "instance", "job", "node"} { + if v := m[k]; v != "" { + scope = k + "=" + v + break + } + } + fmt.Printf("%-9s %-34s %s\n", m["severity"], m["alertname"], scope) + } + return nil +} + +func logsQuery(args []string) error { + q := queryArg(args, map[string]bool{"--since": true, "--limit": true}) + if q == "" { + return fmt.Errorf(`usage: homelab logs query "" [--since 1h] [--limit N] [--json]`) + } + since := flagValue(args, "--since") + if since == "" { + since = "1h" + } + dur, err := time.ParseDuration(since) + if err != nil { + return fmt.Errorf("bad --since %q: %w", since, err) + } + limit := flagValue(args, "--limit") + if limit == "" { + limit = "100" + } + end := time.Now() + v := url.Values{} + v.Set("query", q) + v.Set("limit", limit) + v.Set("start", strconv.FormatInt(end.Add(-dur).UnixNano(), 10)) + v.Set("end", strconv.FormatInt(end.UnixNano(), 10)) + body, err := lbGetBody(lokiHost, "/loki/api/v1/query_range", v) + if err != nil { + return err + } + if containsArg(args, "--json") { + fmt.Println(string(body)) + return nil + } + var r struct { + Data struct { + Result []struct { + Values [][]string `json:"values"` + } `json:"result"` + } `json:"data"` + } + if err := json.Unmarshal(body, &r); err != nil { + fmt.Println(string(body)) + return nil + } + n := 0 + for _, s := range r.Data.Result { + for _, val := range s.Values { + if len(val) == 2 { + fmt.Println(val[1]) + n++ + } + } + } + if n == 0 { + fmt.Println("(no log lines)") + } + return nil +} diff --git a/cli/homelab.go b/cli/homelab.go index 108dfa93..23a7f776 100644 --- a/cli/homelab.go +++ b/cli/homelab.go @@ -18,6 +18,8 @@ func buildRegistry() []Command { reg = append(reg, memoryCommands()...) reg = append(reg, ciCommands()...) reg = append(reg, deployCommands()...) + reg = append(reg, netCommands()...) + reg = append(reg, obsCommands()...) return reg } diff --git a/cli/probe.go b/cli/probe.go new file mode 100644 index 00000000..25d148a0 --- /dev/null +++ b/cli/probe.go @@ -0,0 +1,76 @@ +package main + +import ( + "context" + "crypto/tls" + "fmt" + "io" + "net" + "net/http" + "net/url" + "os/exec" + "strings" + "time" +) + +// internalLBIP is the dedicated Traefik LB; every internal ingress routes through it. +const internalLBIP = "10.0.20.203" + +// clientDialingIP returns an http.Client that dials ip for ANY host while keeping +// the URL host as SNI (so the cert matches) — the Go form of `curl --resolve +// host:443:ip`. TLS verification is skipped (these are reachability/observability +// probes, not security checks; internal .lan vhosts may serve a non-matching cert). +func clientDialingIP(ip string, timeout time.Duration) *http.Client { + d := &net.Dialer{Timeout: 8 * time.Second} + tr := &http.Transport{ + DialContext: func(ctx context.Context, network, addr string) (net.Conn, error) { + if i := strings.LastIndex(addr, ":"); i >= 0 { + addr = ip + addr[i:] + } + return d.DialContext(ctx, network, addr) + }, + TLSClientConfig: &tls.Config{InsecureSkipVerify: true}, + } + return &http.Client{Timeout: timeout, Transport: tr} +} + +// probeURL issues a GET and returns status code + elapsed time. +func probeURL(c *http.Client, rawurl string) (int, time.Duration, error) { + start := time.Now() + resp, err := c.Get(rawurl) + dur := time.Since(start) + if err != nil { + return 0, dur, err + } + resp.Body.Close() + return resp.StatusCode, dur, nil +} + +// lbGetBody GETs https://? through the internal LB and returns the body. +func lbGetBody(host, path string, q url.Values) ([]byte, error) { + u := "https://" + host + path + if len(q) > 0 { + u += "?" + q.Encode() + } + resp, err := clientDialingIP(internalLBIP, 20*time.Second).Get(u) + if err != nil { + return nil, err + } + defer resp.Body.Close() + body, _ := io.ReadAll(resp.Body) + if resp.StatusCode >= 300 { + return nil, fmt.Errorf("%s -> %d: %s", path, resp.StatusCode, strings.TrimSpace(string(body))) + } + return body, nil +} + +// dig runs `dig +short` against a resolver, optionally for a record type. +func dig(name, server, rrtype string) (string, error) { + args := []string{"+short", "+time=3", "+tries=1"} + if rrtype != "" { + args = append(args, rrtype) + } + args = append(args, name, "@"+server) + out, err := exec.Command("dig", args...).Output() + return strings.TrimSpace(string(out)), err +} diff --git a/cli/probe_test.go b/cli/probe_test.go new file mode 100644 index 00000000..bec4d132 --- /dev/null +++ b/cli/probe_test.go @@ -0,0 +1,49 @@ +package main + +import "testing" + +func TestQueryArg(t *testing.T) { + if got := queryArg([]string{"up"}, nil); got != "up" { + t.Errorf(`queryArg(["up"]) = %q, want "up"`, got) + } + if got := queryArg([]string{"up", "--json"}, nil); got != "up" { + t.Errorf(`--json should be dropped, got %q`, got) + } + // single quoted PromQL arrives as one token + if got := queryArg([]string{"count by (node) (up)", "--json"}, nil); got != "count by (node) (up)" { + t.Errorf(`quoted query mangled: %q`, got) + } + // value-flags and their values are skipped, query survives + vf := map[string]bool{"--since": true, "--limit": true} + if got := queryArg([]string{`{app="x"}`, "--since", "1h", "--limit", "50"}, vf); got != `{app="x"}` { + t.Errorf(`value-flag skipping failed: %q`, got) + } +} + +func TestLabelStr(t *testing.T) { + got := labelStr(map[string]string{"__name__": "up", "job": "x", "instance": "y"}) + if got != "up{instance=y,job=x}" { // __name__ extracted, rest sorted + t.Errorf("labelStr = %q", got) + } + if got := labelStr(map[string]string{"alertname": "Foo"}); got != "{alertname=Foo}" { + t.Errorf("labelStr (no __name__) = %q", got) + } +} + +func TestOneLineList(t *testing.T) { + if got := oneLineList(" "); got != "(none)" { + t.Errorf("empty = %q, want (none)", got) + } + if got := oneLineList("a\nb"); got != "a, b" { + t.Errorf("multi = %q, want 'a, b'", got) + } +} + +func TestHostOnly(t *testing.T) { + if got := hostOnly("foo.me/path"); got != "foo.me" { + t.Errorf("hostOnly = %q", got) + } + if got := hostOnly("foo.me"); got != "foo.me" { + t.Errorf("hostOnly = %q", got) + } +} diff --git a/docs/adr/0010-homelab-net-obs-verbs.md b/docs/adr/0010-homelab-net-obs-verbs.md new file mode 100644 index 00000000..29a94a46 --- /dev/null +++ b/docs/adr/0010-homelab-net-obs-verbs.md @@ -0,0 +1,37 @@ +# homelab net/dns/metrics/logs verbs: endpoint resolution as the unit of value + +v0.5 adds `net`/`dns`/`metrics`/`logs`. These were chosen against an explicit +test the user posed mid-build: *does the verb save reasoning, or only typing?* A +wrapper over a command already known fluently (plain `ssh`, `vault kv get`) saves +keystrokes but not thought. These four save thought — the reasoning they encode +is **which endpoint, reached how, with what auth/URL shape** — re-derived every +time otherwise. (That same test deprioritized `node ssh` aliasing and `secret +get`, which are thin wrappers; see the session discussion.) + +## Decisions + +- **Internal ingresses, reached via the LB.** Everything routes through the + Traefik LB by dialing `10.0.20.203` with the URL host preserved as SNI — the + Go form of the house `curl --resolve host:443:10.0.20.203` pattern + (`probe.go: clientDialingIP`). Verified live before building: Prometheus + (`prometheus-query.viktorbarzin.lan`) and Loki (`loki.viktorbarzin.lan`) both + answer JSON over the LB with **no auth gate and no port-forward** — so these + stay clean HTTP clients, not kubectl wrappers. +- **`net check` is two-legged on purpose.** It resolves the host via public DNS + (→ Cloudflare) AND dials the internal LB, reporting both — because the useful + question is *where* a break is (CF edge vs the app vs the LB path), which a + single curl can't answer. The external leg forces public resolution (the devvm + resolver is split-horizon and would otherwise hit the LB for both). +- **`metrics alerts` uses the `ALERTS` series, not `/api/v1/alerts`.** + `prometheus-query.*` is a query-only frontend (404 on `/api/v1/alerts`), and + Alertmanager has no LB ingress (the alert-digest reads it in-cluster). Firing + alerts are exposed as the synthetic `ALERTS{alertstate="firing"}` time series, + queryable through the working endpoint — so no new dependency. +- **Deliberately NOT built:** in-cluster-only endpoints (Alertmanager v2, + raw `*.svc` services) that would force port-forward/`kubectl run`. The + reasoning-savings there don't beat the added moving parts; kept out of scope. +- **No `node`/`secret` group.** Same test: their high-volume parts are + command-wrappers (low savings); only compound node ops (serial console, VM + wait, fan-out) would qualify, and those are lower-frequency. Left unbuilt + unless a concrete pain surfaces — the high-value deterministic surface + (tf/work/ci/k8s/memory + these probes) is now covered.