homelab: v0.5.0 — net/dns/metrics/logs probes (endpoint resolution)

The remaining verbs that pass the "saves reasoning, not just typing" test the
user posed mid-session: each encodes the non-obvious which-endpoint-reached-how
resolution otherwise re-derived every time. (Same test deprioritized node-ssh
and secret-get aliasing — thin wrappers over commands already known.)

- net check <host> [path]: two-legged reachability — external (public DNS→CF)
  vs internal (Traefik LB) — so you see WHERE a break is, not just that one path
  works. (live: surfaced the LB at 6ms vs CF 77ms.)
- dns lookup <name> [type]: Technitium (10.0.20.201) vs public (1.1.1.1) diff.
- metrics query "<promql>" / metrics alerts: Prometheus via the LB
  (prometheus-query.viktorbarzin.lan); alerts uses the synthetic ALERTS series
  since the query frontend has no /api/v1/alerts and Alertmanager has no ingress.
- logs query "<logql>" [--since 1h] [--limit N]: Loki range query via the LB.

All reach auth-free internal ingresses through the LB (Go form of
curl --resolve host:443:10.0.20.203) — no port-forward, no kubectl. In-cluster-
only endpoints (Alertmanager v2) deliberately out of scope. Verified live before
building; all five smoke-tested green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-19 11:27:31 +00:00
parent 9189560ac3
commit e91e1612dd
9 changed files with 466 additions and 3 deletions

View file

@ -289,7 +289,7 @@ curl -X POST -H "Authorization: token $TOK" -H 'Content-Type: application/json'
```
## Common Operations
- **`homelab` CLI** (`/usr/local/bin/homelab`, source `cli/`): unified infra-ops verbs — run `homelab manifest` to discover the surface (each verb tagged read/write). Infra loop: `homelab tf plan|fmt|apply <stack>` (wraps `scripts/tg`; `apply` auto-claims presence + releases on exit, warns out-of-band), `homelab claim|release <kind>:<name>`, `homelab work start|land|clean <topic>` (worktree lifecycle; `land` gates on verification, `--verify-cmd`/`--no-verify`). Kubernetes (v0.2): `homelab k8s status|get|logs|describe|debug|pf|rollout-status <app>` (read; `<app>` defaults to the namespace, target to `deploy/<app>`), `homelab k8s db <app> [--mysql] -- "<SQL>"`, `k8s exec`, `k8s restart`, `k8s rm-pod` (pods/jobs only) — config-mutation kubectl verbs are intentionally absent (Terraform-only). Memory (v0.3): `homelab memory recall "<context>"` (semantic search), `memory list|categories|tags|stats|secret`, `memory store|update|delete` — a direct HTTP client to claude-memory that works even when the memory MCP is down. CI/deploy (v0.4): `homelab ci status|watch [commit]` (Woodpecker, repo resolved from cwd), `homelab deploy wait <ns>/<deploy> [--sha]` (image-sha + rollout) — `work land` now auto-watches CI to green. Full docs: `cli/README.md`.
- **`homelab` CLI** (`/usr/local/bin/homelab`, source `cli/`): unified infra-ops verbs — run `homelab manifest` to discover the surface (each verb tagged read/write). Infra loop: `homelab tf plan|fmt|apply <stack>` (wraps `scripts/tg`; `apply` auto-claims presence + releases on exit, warns out-of-band), `homelab claim|release <kind>:<name>`, `homelab work start|land|clean <topic>` (worktree lifecycle; `land` gates on verification, `--verify-cmd`/`--no-verify`). Kubernetes (v0.2): `homelab k8s status|get|logs|describe|debug|pf|rollout-status <app>` (read; `<app>` defaults to the namespace, target to `deploy/<app>`), `homelab k8s db <app> [--mysql] -- "<SQL>"`, `k8s exec`, `k8s restart`, `k8s rm-pod` (pods/jobs only) — config-mutation kubectl verbs are intentionally absent (Terraform-only). Memory (v0.3): `homelab memory recall "<context>"` (semantic search), `memory list|categories|tags|stats|secret`, `memory store|update|delete` — a direct HTTP client to claude-memory that works even when the memory MCP is down. CI/deploy (v0.4): `homelab ci status|watch [commit]` (Woodpecker, repo resolved from cwd), `homelab deploy wait <ns>/<deploy> [--sha]` (image-sha + rollout) — `work land` now auto-watches CI to green. Net/obs (v0.5): `homelab net check <host> [path]` (external-CF vs internal-LB reachability), `dns lookup <name>` (Technitium vs public diff), `metrics query "<promql>"` / `metrics alerts` (Prometheus via LB), `logs query "<logql>" [--since]` (Loki via LB) — endpoint resolution baked in, no port-forward. Full docs: `cli/README.md`.
- **Deploy new service**: Use `stacks/<existing-service>/` as template. Create stack, add DNS in tfvars, apply platform then service.
- **Fix crashed pods**: Run healthcheck first. Safe to delete evicted/failed pods and CrashLoopBackOff pods with >10 restarts.
- **OOMKilled**: Check `kubectl describe limitrange tier-defaults -n <ns>`. Increase `resources.limits.memory` in the stack's main.tf.