diff --git a/AGENTS.md b/AGENTS.md index 012e85dc..b97cadb0 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -289,7 +289,7 @@ curl -X POST -H "Authorization: token $TOK" -H 'Content-Type: application/json' ``` ## Common Operations -- **`homelab` CLI** (`/usr/local/bin/homelab`, source `cli/`): unified infra-ops verbs — run `homelab manifest` to discover the surface (each verb tagged read/write). Infra loop: `homelab tf plan|fmt|apply ` (wraps `scripts/tg`; `apply` auto-claims presence + releases on exit, warns out-of-band), `homelab claim|release :`, `homelab work start|land|clean ` (worktree lifecycle; `land` gates on verification, `--verify-cmd`/`--no-verify`). Kubernetes (v0.2): `homelab k8s status|get|logs|describe|debug|pf|rollout-status ` (read; `` defaults to the namespace, target to `deploy/`), `homelab k8s db [--mysql] -- ""`, `k8s exec`, `k8s restart`, `k8s rm-pod` (pods/jobs only) — config-mutation kubectl verbs are intentionally absent (Terraform-only). Memory (v0.3): `homelab memory recall ""` (semantic search), `memory list|categories|tags|stats|secret`, `memory store|update|delete` — a direct HTTP client to claude-memory that works even when the memory MCP is down. CI/deploy (v0.4): `homelab ci status|watch [commit]` (Woodpecker, repo resolved from cwd), `homelab deploy wait / [--sha]` (image-sha + rollout) — `work land` now auto-watches CI to green. Net/obs (v0.5): `homelab net check [path]` (external-CF vs internal-LB reachability), `dns lookup ` (Technitium vs public diff), `metrics query ""` / `metrics alerts` (Prometheus via LB), `logs query "" [--since]` (Loki via LB) — endpoint resolution baked in, no port-forward. Full docs: `cli/README.md`. +- **`homelab` CLI** (`/usr/local/bin/homelab`, source `cli/`): unified infra-ops verbs — run `homelab manifest` to discover the surface (each verb tagged read/write). Infra loop: `homelab tf plan|fmt|apply ` (wraps `scripts/tg`; `apply` auto-claims presence + releases on exit, warns out-of-band), `homelab claim|release :`, `homelab work start|land|clean ` (worktree lifecycle; `land` gates on verification, `--verify-cmd`/`--no-verify`). Kubernetes (v0.2): `homelab k8s status|get|logs|describe|debug|pf|rollout-status ` (read; `` defaults to the namespace, target to `deploy/`), `homelab k8s db [--mysql] -- ""`, `k8s exec`, `k8s restart`, `k8s rm-pod` (pods/jobs only) — config-mutation kubectl verbs are intentionally absent (Terraform-only). Memory (v0.3): `homelab memory recall ""` (semantic search), `memory list|categories|tags|stats|secret`, `memory store|update|delete` — a direct HTTP client to claude-memory that works even when the memory MCP is down. CI/deploy (v0.4): `homelab ci status|watch [commit]` (Woodpecker, repo resolved from cwd), `homelab deploy wait / [--sha]` (image-sha + rollout) — `work land` now auto-watches CI to green. Full docs: `cli/README.md`. - **Deploy new service**: Use `stacks//` as template. Create stack, add DNS in tfvars, apply platform then service. - **Fix crashed pods**: Run healthcheck first. Safe to delete evicted/failed pods and CrashLoopBackOff pods with >10 restarts. - **OOMKilled**: Check `kubectl describe limitrange tier-defaults -n `. Increase `resources.limits.memory` in the stack's main.tf. diff --git a/cli/README.md b/cli/README.md index ab43f0f6..def3c26b 100644 --- a/cli/README.md +++ b/cli/README.md @@ -112,25 +112,6 @@ remote, with retries that ride Woodpecker's intermittent empty responses. step) is deferred to v0.4.1 — Woodpecker's per-pipeline detail/log endpoints were the least reliable; `status`/`watch` use the list endpoint that works. -### v0.5 verbs — net / dns / metrics / logs - -Reachability + observability probes. Their value is *endpoint resolution* — the -non-obvious "which host, public or LB, what auth, what URL shape" reasoning you'd -otherwise re-derive every time — not the HTTP call itself. All reach internal -ingresses through the Traefik LB (the Go form of `curl --resolve host:443:10.0.20.203`). - -| Command | Tier | What it does | -|---|---|---| -| `net check [path]` | read | probes the host two ways — external (public DNS → Cloudflare) vs internal (Traefik LB) — with status + latency, so you can tell *where* a break is (CF? app? the LB path?) | -| `dns lookup [type]` | read | resolves via Technitium (`10.0.20.201`) and public (`1.1.1.1`), diffed — surfaces split-horizon vs propagation gaps | -| `metrics query ""` | read | Prometheus instant query (`prometheus-query.viktorbarzin.lan`); prints `value {labels}` or `--json` | -| `metrics alerts` | read | currently-firing alerts (via the synthetic `ALERTS` series — the query frontend has no `/api/v1/alerts`) | -| `logs query "" [--since 1h] [--limit N]` | read | Loki range query (`loki.viktorbarzin.lan`); prints log lines or `--json` | - -Quote the PromQL/LogQL. These hit auth-free internal ingresses — no port-forward, -no kubectl. (In-cluster-only endpoints like Alertmanager stay out of scope; the -firing set is reachable via `ALERTS` instead.) - ## Build / install Built from source to `/usr/local/bin/homelab` during devvm provisioning @@ -150,4 +131,4 @@ original flag-based path unchanged, so the webhook handler is unaffected. ## Design -See `infra/docs/adr/0004`–`0010` for the architecture decisions. +See `infra/docs/adr/0004`–`0009` for the architecture decisions. diff --git a/cli/VERSION b/cli/VERSION index b043aa64..fb7a04cf 100644 --- a/cli/VERSION +++ b/cli/VERSION @@ -1 +1 @@ -v0.5.0 +v0.4.0 diff --git a/cli/cmd_net.go b/cli/cmd_net.go deleted file mode 100644 index 6401755c..00000000 --- a/cli/cmd_net.go +++ /dev/null @@ -1,83 +0,0 @@ -package main - -import ( - "fmt" - "strings" - "time" -) - -func netCommands() []Command { - return []Command{ - {Path: []string{"net", "check"}, Tier: TierRead, - Summary: "reachability of [/path]: external (public DNS→CF) vs internal (Traefik LB)", Run: netCheck}, - {Path: []string{"dns", "lookup"}, Tier: TierRead, - Summary: "resolve via Technitium (10.0.20.201) and public (1.1.1.1), diffed", Run: dnsLookup}, - } -} - -func fmtProbe(code int, d time.Duration, err error) string { - if err != nil { - return "ERR " + err.Error() - } - return fmt.Sprintf("HTTP %d %dms", code, d.Milliseconds()) -} - -func netCheck(args []string) error { - host, rest := firstPositional(args) - if host == "" { - return fmt.Errorf("usage: homelab net check [path]") - } - path := "/" - if len(rest) > 0 && !strings.HasPrefix(rest[0], "-") { - path = rest[0] - if !strings.HasPrefix(path, "/") { - path = "/" + path - } - } - u := "https://" + host + path - fmt.Printf("%s\n", u) - - // external leg: resolve via public DNS, dial the public IP (tests the real CF path) - pubOut, _ := dig(hostOnly(host), "1.1.1.1", "") - if pubIP := firstLine(pubOut); pubIP != "" { - c, d, e := probeURL(clientDialingIP(pubIP, 10*time.Second), u) - fmt.Printf(" external (public %-15s) %s\n", pubIP, fmtProbe(c, d, e)) - } else { - fmt.Println(" external (public) no public A record") - } - // internal leg: dial the Traefik LB directly - c, d, e := probeURL(clientDialingIP(internalLBIP, 10*time.Second), u) - fmt.Printf(" internal (LB %-15s) %s\n", internalLBIP, fmtProbe(c, d, e)) - return nil -} - -func dnsLookup(args []string) error { - name, rest := firstPositional(args) - if name == "" { - return fmt.Errorf("usage: homelab dns lookup [A|AAAA|TXT|MX|PTR]") - } - rr := "" - if len(rest) > 0 { - rr = rest[0] - } - tech, _ := dig(name, "10.0.20.201", rr) - pub, _ := dig(name, "1.1.1.1", rr) - fmt.Printf("technitium (10.0.20.201): %s\n", oneLineList(tech)) - fmt.Printf("public (1.1.1.1) : %s\n", oneLineList(pub)) - if strings.TrimSpace(tech) != strings.TrimSpace(pub) { - fmt.Println("⚠ mismatch — split-horizon (expected for internal-only apps) or a propagation gap") - } - return nil -} - -func hostOnly(h string) string { // strip any path accidentally included - return strings.SplitN(h, "/", 2)[0] -} - -func oneLineList(s string) string { - s = strings.TrimSpace(s) - if s == "" { - return "(none)" - } - return strings.ReplaceAll(s, "\n", ", ") -} diff --git a/cli/cmd_obs.go b/cli/cmd_obs.go deleted file mode 100644 index 33f16e6c..00000000 --- a/cli/cmd_obs.go +++ /dev/null @@ -1,197 +0,0 @@ -package main - -import ( - "encoding/json" - "fmt" - "net/url" - "sort" - "strconv" - "strings" - "time" -) - -const ( - promHost = "prometheus-query.viktorbarzin.lan" - lokiHost = "loki.viktorbarzin.lan" -) - -func obsCommands() []Command { - return []Command{ - {Path: []string{"metrics", "query"}, Tier: TierRead, - Summary: `Prometheus instant query: metrics query "" [--json]`, Run: metricsQuery}, - {Path: []string{"metrics", "alerts"}, Tier: TierRead, - Summary: "list currently firing Prometheus alerts", Run: metricsAlerts}, - {Path: []string{"logs", "query"}, Tier: TierRead, - Summary: `Loki query (last --since, default 1h): logs query "" [--since 1h] [--limit N] [--json]`, Run: logsQuery}, - } -} - -// queryArg joins non-flag args into the query (PromQL/LogQL should normally be -// passed as a single quoted argument; this also tolerates unquoted multi-token). -func queryArg(args []string, valueFlags map[string]bool) string { - var parts []string - for i := 0; i < len(args); i++ { - a := args[i] - if valueFlags[a] { - i++ - continue - } - if strings.HasPrefix(a, "-") { - continue - } - parts = append(parts, a) - } - return strings.Join(parts, " ") -} - -func labelStr(m map[string]string) string { - name := m["__name__"] - var kv []string - for k, v := range m { - if k != "__name__" { - kv = append(kv, k+"="+v) - } - } - sort.Strings(kv) - return name + "{" + strings.Join(kv, ",") + "}" -} - -func metricsQuery(args []string) error { - q := queryArg(args, nil) - if q == "" { - return fmt.Errorf(`usage: homelab metrics query "" [--json]`) - } - v := url.Values{} - v.Set("query", q) - body, err := lbGetBody(promHost, "/api/v1/query", v) - if err != nil { - return err - } - if containsArg(args, "--json") { - fmt.Println(string(body)) - return nil - } - var r struct { - Data struct { - Result []struct { - Metric map[string]string `json:"metric"` - Value []interface{} `json:"value"` - } `json:"result"` - } `json:"data"` - } - if err := json.Unmarshal(body, &r); err != nil { - fmt.Println(string(body)) - return nil - } - if len(r.Data.Result) == 0 { - fmt.Println("(no series)") - return nil - } - for _, s := range r.Data.Result { - val := "" - if len(s.Value) == 2 { - val = fmt.Sprint(s.Value[1]) - } - fmt.Printf("%-14s %s\n", val, labelStr(s.Metric)) - } - return nil -} - -func metricsAlerts(args []string) error { - // prometheus-query is a query-only frontend (no /api/v1/alerts); the firing - // set is exposed as the synthetic ALERTS series, queryable the normal way. - v := url.Values{} - v.Set("query", `ALERTS{alertstate="firing"}`) - body, err := lbGetBody(promHost, "/api/v1/query", v) - if err != nil { - return err - } - if containsArg(args, "--json") { - fmt.Println(string(body)) - return nil - } - var r struct { - Data struct { - Result []struct { - Metric map[string]string `json:"metric"` - } `json:"result"` - } `json:"data"` - } - if err := json.Unmarshal(body, &r); err != nil { - fmt.Println(string(body)) - return nil - } - if len(r.Data.Result) == 0 { - fmt.Println("(no firing alerts)") - return nil - } - for _, a := range r.Data.Result { - m := a.Metric - scope := "" - for _, k := range []string{"namespace", "deployment", "instance", "job", "node"} { - if v := m[k]; v != "" { - scope = k + "=" + v - break - } - } - fmt.Printf("%-9s %-34s %s\n", m["severity"], m["alertname"], scope) - } - return nil -} - -func logsQuery(args []string) error { - q := queryArg(args, map[string]bool{"--since": true, "--limit": true}) - if q == "" { - return fmt.Errorf(`usage: homelab logs query "" [--since 1h] [--limit N] [--json]`) - } - since := flagValue(args, "--since") - if since == "" { - since = "1h" - } - dur, err := time.ParseDuration(since) - if err != nil { - return fmt.Errorf("bad --since %q: %w", since, err) - } - limit := flagValue(args, "--limit") - if limit == "" { - limit = "100" - } - end := time.Now() - v := url.Values{} - v.Set("query", q) - v.Set("limit", limit) - v.Set("start", strconv.FormatInt(end.Add(-dur).UnixNano(), 10)) - v.Set("end", strconv.FormatInt(end.UnixNano(), 10)) - body, err := lbGetBody(lokiHost, "/loki/api/v1/query_range", v) - if err != nil { - return err - } - if containsArg(args, "--json") { - fmt.Println(string(body)) - return nil - } - var r struct { - Data struct { - Result []struct { - Values [][]string `json:"values"` - } `json:"result"` - } `json:"data"` - } - if err := json.Unmarshal(body, &r); err != nil { - fmt.Println(string(body)) - return nil - } - n := 0 - for _, s := range r.Data.Result { - for _, val := range s.Values { - if len(val) == 2 { - fmt.Println(val[1]) - n++ - } - } - } - if n == 0 { - fmt.Println("(no log lines)") - } - return nil -} diff --git a/cli/homelab.go b/cli/homelab.go index 23a7f776..108dfa93 100644 --- a/cli/homelab.go +++ b/cli/homelab.go @@ -18,8 +18,6 @@ func buildRegistry() []Command { reg = append(reg, memoryCommands()...) reg = append(reg, ciCommands()...) reg = append(reg, deployCommands()...) - reg = append(reg, netCommands()...) - reg = append(reg, obsCommands()...) return reg } diff --git a/cli/probe.go b/cli/probe.go deleted file mode 100644 index 25d148a0..00000000 --- a/cli/probe.go +++ /dev/null @@ -1,76 +0,0 @@ -package main - -import ( - "context" - "crypto/tls" - "fmt" - "io" - "net" - "net/http" - "net/url" - "os/exec" - "strings" - "time" -) - -// internalLBIP is the dedicated Traefik LB; every internal ingress routes through it. -const internalLBIP = "10.0.20.203" - -// clientDialingIP returns an http.Client that dials ip for ANY host while keeping -// the URL host as SNI (so the cert matches) — the Go form of `curl --resolve -// host:443:ip`. TLS verification is skipped (these are reachability/observability -// probes, not security checks; internal .lan vhosts may serve a non-matching cert). -func clientDialingIP(ip string, timeout time.Duration) *http.Client { - d := &net.Dialer{Timeout: 8 * time.Second} - tr := &http.Transport{ - DialContext: func(ctx context.Context, network, addr string) (net.Conn, error) { - if i := strings.LastIndex(addr, ":"); i >= 0 { - addr = ip + addr[i:] - } - return d.DialContext(ctx, network, addr) - }, - TLSClientConfig: &tls.Config{InsecureSkipVerify: true}, - } - return &http.Client{Timeout: timeout, Transport: tr} -} - -// probeURL issues a GET and returns status code + elapsed time. -func probeURL(c *http.Client, rawurl string) (int, time.Duration, error) { - start := time.Now() - resp, err := c.Get(rawurl) - dur := time.Since(start) - if err != nil { - return 0, dur, err - } - resp.Body.Close() - return resp.StatusCode, dur, nil -} - -// lbGetBody GETs https://? through the internal LB and returns the body. -func lbGetBody(host, path string, q url.Values) ([]byte, error) { - u := "https://" + host + path - if len(q) > 0 { - u += "?" + q.Encode() - } - resp, err := clientDialingIP(internalLBIP, 20*time.Second).Get(u) - if err != nil { - return nil, err - } - defer resp.Body.Close() - body, _ := io.ReadAll(resp.Body) - if resp.StatusCode >= 300 { - return nil, fmt.Errorf("%s -> %d: %s", path, resp.StatusCode, strings.TrimSpace(string(body))) - } - return body, nil -} - -// dig runs `dig +short` against a resolver, optionally for a record type. -func dig(name, server, rrtype string) (string, error) { - args := []string{"+short", "+time=3", "+tries=1"} - if rrtype != "" { - args = append(args, rrtype) - } - args = append(args, name, "@"+server) - out, err := exec.Command("dig", args...).Output() - return strings.TrimSpace(string(out)), err -} diff --git a/cli/probe_test.go b/cli/probe_test.go deleted file mode 100644 index bec4d132..00000000 --- a/cli/probe_test.go +++ /dev/null @@ -1,49 +0,0 @@ -package main - -import "testing" - -func TestQueryArg(t *testing.T) { - if got := queryArg([]string{"up"}, nil); got != "up" { - t.Errorf(`queryArg(["up"]) = %q, want "up"`, got) - } - if got := queryArg([]string{"up", "--json"}, nil); got != "up" { - t.Errorf(`--json should be dropped, got %q`, got) - } - // single quoted PromQL arrives as one token - if got := queryArg([]string{"count by (node) (up)", "--json"}, nil); got != "count by (node) (up)" { - t.Errorf(`quoted query mangled: %q`, got) - } - // value-flags and their values are skipped, query survives - vf := map[string]bool{"--since": true, "--limit": true} - if got := queryArg([]string{`{app="x"}`, "--since", "1h", "--limit", "50"}, vf); got != `{app="x"}` { - t.Errorf(`value-flag skipping failed: %q`, got) - } -} - -func TestLabelStr(t *testing.T) { - got := labelStr(map[string]string{"__name__": "up", "job": "x", "instance": "y"}) - if got != "up{instance=y,job=x}" { // __name__ extracted, rest sorted - t.Errorf("labelStr = %q", got) - } - if got := labelStr(map[string]string{"alertname": "Foo"}); got != "{alertname=Foo}" { - t.Errorf("labelStr (no __name__) = %q", got) - } -} - -func TestOneLineList(t *testing.T) { - if got := oneLineList(" "); got != "(none)" { - t.Errorf("empty = %q, want (none)", got) - } - if got := oneLineList("a\nb"); got != "a, b" { - t.Errorf("multi = %q, want 'a, b'", got) - } -} - -func TestHostOnly(t *testing.T) { - if got := hostOnly("foo.me/path"); got != "foo.me" { - t.Errorf("hostOnly = %q", got) - } - if got := hostOnly("foo.me"); got != "foo.me" { - t.Errorf("hostOnly = %q", got) - } -} diff --git a/docs/adr/0010-homelab-net-obs-verbs.md b/docs/adr/0010-homelab-net-obs-verbs.md deleted file mode 100644 index 29a94a46..00000000 --- a/docs/adr/0010-homelab-net-obs-verbs.md +++ /dev/null @@ -1,37 +0,0 @@ -# homelab net/dns/metrics/logs verbs: endpoint resolution as the unit of value - -v0.5 adds `net`/`dns`/`metrics`/`logs`. These were chosen against an explicit -test the user posed mid-build: *does the verb save reasoning, or only typing?* A -wrapper over a command already known fluently (plain `ssh`, `vault kv get`) saves -keystrokes but not thought. These four save thought — the reasoning they encode -is **which endpoint, reached how, with what auth/URL shape** — re-derived every -time otherwise. (That same test deprioritized `node ssh` aliasing and `secret -get`, which are thin wrappers; see the session discussion.) - -## Decisions - -- **Internal ingresses, reached via the LB.** Everything routes through the - Traefik LB by dialing `10.0.20.203` with the URL host preserved as SNI — the - Go form of the house `curl --resolve host:443:10.0.20.203` pattern - (`probe.go: clientDialingIP`). Verified live before building: Prometheus - (`prometheus-query.viktorbarzin.lan`) and Loki (`loki.viktorbarzin.lan`) both - answer JSON over the LB with **no auth gate and no port-forward** — so these - stay clean HTTP clients, not kubectl wrappers. -- **`net check` is two-legged on purpose.** It resolves the host via public DNS - (→ Cloudflare) AND dials the internal LB, reporting both — because the useful - question is *where* a break is (CF edge vs the app vs the LB path), which a - single curl can't answer. The external leg forces public resolution (the devvm - resolver is split-horizon and would otherwise hit the LB for both). -- **`metrics alerts` uses the `ALERTS` series, not `/api/v1/alerts`.** - `prometheus-query.*` is a query-only frontend (404 on `/api/v1/alerts`), and - Alertmanager has no LB ingress (the alert-digest reads it in-cluster). Firing - alerts are exposed as the synthetic `ALERTS{alertstate="firing"}` time series, - queryable through the working endpoint — so no new dependency. -- **Deliberately NOT built:** in-cluster-only endpoints (Alertmanager v2, - raw `*.svc` services) that would force port-forward/`kubectl run`. The - reasoning-savings there don't beat the added moving parts; kept out of scope. -- **No `node`/`secret` group.** Same test: their high-volume parts are - command-wrappers (low savings); only compound node ops (serial console, VM - wait, fan-out) would qualify, and those are lower-frequency. Left unbuilt - unless a concrete pain surfaces — the high-value deterministic surface - (tf/work/ci/k8s/memory + these probes) is now covered.