Compare commits

..

1 commit

Author SHA1 Message Date
Viktor Barzin
fe9364b9c9 portal-tts: DRAFT stack — Piper TTS (CPU, always-on) for portal-assistant
Draft (NOT applied) of a new infra stack deploying Piper as an in-cluster
text-to-speech service for the portal-assistant Gateway (portal-assistant
issue #3, ADR-0003). Bulgarian (bg_BG-dimitar-medium) + English
(en_US-lessac-medium), voice chosen per request.

Why this shape:
- CPU-only, always-on (replicas=1, no GPU): Piper runs in real time on CPU, so
  this keeps TTS off the OOM-prone shared T4 that the two GPU siblings
  (tts/chatterbox, portal-stt) already contend for. Bulgarian isn't on
  chatterbox anyway (its langs exclude bg).
- OpenAI-compatible image (openedai-speech-min, /v1/audio/speech) so the Gateway
  gets raw audio bytes per its tts.synthesize(text, lang) -> bytes contract and
  treats Piper + the future edge-tts fallback identically — same shape
  chatterbox already uses.
- Voices on an NFS-SSD PVC, downloaded from rhasspy/piper-voices by an init
  container on first boot; a ConfigMap maps request voice bg/en -> .onnx model.
- ClusterIP only (audio stays on the LAN; the Gateway is the only externally
  exposed component, ADR-0001).

Mirrors the just-written portal-stt sibling stack's conventions. terraform fmt
clean; terraform validate passes (only the codebase-wide kubernetes_namespace
deprecation warnings). HITL: operator reviews + applies via GitOps; do not apply
from a worktree. Open items flagged in main.tf (image choice on a frozen
upstream; resource sizing to confirm with krr).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 18:59:42 +00:00
425 changed files with 11535 additions and 43696 deletions

File diff suppressed because one or more lines are too long

View file

@ -7,7 +7,6 @@ Control and query Home Assistant entities on ha-sofia.viktorbarzin.me.
import argparse
import json
import os
import subprocess
import sys
from urllib.parse import urljoin
@ -18,29 +17,13 @@ except ImportError:
print(" pip install requests")
sys.exit(1)
# Configuration from environment variables (ha-sofia specific)
HA_URL = os.environ.get("HOME_ASSISTANT_SOFIA_URL", "").rstrip("/")
HA_TOKEN = os.environ.get("HOME_ASSISTANT_SOFIA_TOKEN")
def _token_from_homelab():
"""Resolve the token via the homelab CLI when the env var isn't set, so the
script works from any directory / unprovisioned session (see ADR-0012)."""
try:
out = subprocess.run(
["homelab", "ha", "token", "--instance", "sofia"],
capture_output=True, text=True, timeout=30)
if out.returncode == 0 and out.stdout.strip():
return out.stdout.strip()
except Exception:
pass
return None
# Configuration: prefer env vars (set by the Claude venv); otherwise fall back to
# defaults + the homelab CLI so the script is not cwd/env dependent (ADR-0012).
HA_URL = os.environ.get("HOME_ASSISTANT_SOFIA_URL", "").rstrip("/") or "https://ha-sofia.viktorbarzin.me"
HA_TOKEN = os.environ.get("HOME_ASSISTANT_SOFIA_TOKEN") or _token_from_homelab()
if not HA_TOKEN:
print("ERROR: no ha-sofia API token available.")
print("Set HOME_ASSISTANT_SOFIA_TOKEN, or ensure `homelab ha token` works (kubeconfig reachable).")
if not HA_URL or not HA_TOKEN:
print("ERROR: HOME_ASSISTANT_SOFIA_URL and HOME_ASSISTANT_SOFIA_TOKEN environment variables must be set.")
print("These should be set when activating the Claude venv (~/.venvs/claude)")
sys.exit(1)
HEADERS = {

View file

@ -166,8 +166,7 @@ Pinned via Terraform in `stacks/authentik/`:
| Knob | Value | Surface | Effect |
|------|-------|---------|--------|
| `UserLoginStage.session_duration` on `default-authentication-login` | `weeks=4` | `authentik_stage_user_login.default_login` in `authentik_provider.tf` | Authenticated users stay logged in 4 weeks across browser restarts. No sliding refresh — resets on each login. Used by password login (`default-authentication-flow`) AND passkey login (`webauthn` flow — both terminate on this stage). |
| `UserLoginStage.session_duration` on `default-source-authentication-login` | `weeks=4` | `authentik_stage_user_login.default_source_login` in `authentik_provider.tf` (imported 2026-06-20, id `4c6977d2-…`) | **Social logins** (Google/GitHub/Facebook, via `default-source-authentication-flow`). Was the provider default `seconds=0`, which fell back to `UNAUTHENTICATED_AGE=hours=2` — so social logins expired every **2h** while password/passkey lasted 4 weeks. Pinned `weeks=4` on 2026-06-20 to make all login paths consistent. (Surfaced when the 2026-06-18 passkey wipe forced fallback to Google login → "re-login multiple times daily".) |
| `UserLoginStage.session_duration` on `default-authentication-login` | `weeks=4` | `authentik_stage_user_login.default_login` in `authentik_provider.tf` | Authenticated users stay logged in 4 weeks across browser restarts. No sliding refresh — resets on each login. |
| `ProxyProvider.access_token_validity` on `Provider for Domain wide catch all` | `weeks=4` | `authentik_provider_proxy.catchall.access_token_validity` in `authentik_provider.tf` | Cookie `Max-Age` on `authentik_proxy_*` and `expires` on rows in `authentik_providers_proxy_proxysession`. Bumped 2026-05-10 from `hours=168`. **Bumping requires `kubectl rollout restart deploy/ak-outpost-authentik-embedded-outpost`** — the gorilla session store binds the value once at outpost startup; the 5-min provider refresh logs `"reusing existing session store"` and skips rebuild. |
| `AUTHENTIK_SESSIONS__UNAUTHENTICATED_AGE` (server + worker) | `hours=2` | `server.env` + `worker.env` in `modules/authentik/values.yaml` | Anonymous Django sessions (bots, healthcheckers, partial flows) are reaped within 2h instead of the 1d default. |
@ -178,13 +177,6 @@ Notes:
- The standalone embedded-outpost deployment needs `AUTHENTIK_POSTGRESQL__{HOST,PORT,USER,PASSWORD,NAME}` env vars to reach the dbaas cluster — codified via `kubernetes_json_patches.deployment` envFrom the shared `goauthentik` Secret. The `app.kubernetes.io/component=server` pod label is also injected via JSON patch (matches the `component:server` half of the Service selector that the controller adds for embedded outposts).
- `ProxyProvider.remember_me_offset` stays UI-managed via `ignore_changes`.
- The Authentik provider's resource schema does **not** expose the `Outpost.managed` field. We rely on TF's "write only fields it knows about" semantic: the server-set `goauthentik.io/outposts/embedded` value is preserved across applies because Terraform never writes `managed`. Don't change the resource provider schema expectations without verifying this assumption holds.
## WebAuthn / Passkeys (2026-06-20)
- **Passkey devices live in the DB, NOT Terraform** (`WebAuthnDevice` model). They are user-owned; no TF resource or blueprint manages them. Re-enroll via the user settings UI (Authentik → Settings → MFA Devices → register a security key / passkey).
- **2026-06-18 wipe (root cause of the "WebAuthn broke" incident):** all 6 of Viktor's passkeys were deleted (`WebAuthnDevice.objects.count()` → 0) at 19:27 by an **ad-hoc tripit passkey E2E test** run from the devvm (`python-httpx/0.28.1`, as `akadmin`). The test cleanup did `GET /core/users/?search={demo}` (a **fuzzy** search) then `DELETE /api/v3/authenticators/admin/webauthn/{pk}/` for each device of `users[0]` — but `users[0]` resolved to the **real** account, not the intended demo user. **Lesson:** any future passkey-test cleanup MUST exact-match the demo user (`username == demo`), never `users[0]` of a fuzzy `?search=`. It was a one-off ad-hoc script (no committed/scheduled copy), so nothing auto-re-deletes — re-enrollment is safe.
- **Passkey login path itself is intact:** the identification stage's `passwordless_flow``webauthn` flow (UI-managed, in `ignore_changes`); the break was purely the missing device records.
- **Provider-schema gotcha:** the pinned authentik TF provider's `authentik_stage_identification` resource exposes **no** `webauthn_stage` or `enable_remember_me` attribute (they exist on the app *model*, not in the provider schema). Do NOT add them to `ignore_changes``tg plan` errors `Unsupported attribute`. They are purely UI/app-managed. (Commit `4e882989` removed them for exactly this reason; re-adding breaks every apply.)
- ALL tuned env vars are injected via `server.env` / `worker.env` (not the `authentik.*` values block) because we set `authentik.existingSecret.secretName: goauthentik`, which makes the chart skip rendering its own `AUTHENTIK_*` Secret. The `authentik.*` value block is therefore inert in this stack — anything new under `authentik.*` must use the `*.env` arrays instead. Live base values come from the orphaned, helm-keep-policy `goauthentik` Secret created by chart 2025.10.3 before `existingSecret` was introduced. **2026-06-10:** the previously-inert tuning (`AUTHENTIK_WEB__WORKERS=3`, `AUTHENTIK_WEB__THREADS=4`, `AUTHENTIK_CACHE__TIMEOUT_FLOWS=1800`, `AUTHENTIK_CACHE__TIMEOUT_POLICIES=900`, `AUTHENTIK_POSTGRESQL__CONN_MAX_AGE=60`, `AUTHENTIK_POSTGRESQL__CONN_HEALTH_CHECKS=true`, worker `AUTHENTIK_WORKER__THREADS=4`) was moved into the env arrays and is now actually live — before that, pods silently ran defaults (2 gunicorn workers, 300s caches, no persistent DB conns).
- **Outpost (2026-06-10):** `log_level=info` (was `trace` — per-request overhead on the forward-auth hot path) and `kubernetes_replicas=2` (was 1 — single-pod hot path; safe since proxy sessions live in Postgres). Both in `authentik_outpost.embedded` config.
- **Image tag is PINNED in values (`global.image.tag`), 2026-06-10:** Keel moves the authentik image between chart releases, while helm derives the tag from the chart appVersion — an unpinned helm apply silently DOWNGRADES live pods (caused the 2026-06-10 boot storm + shared-PG failover; see `docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md`). Before touching this chart, check the live image tag and refresh the pin.

File diff suppressed because one or more lines are too long

View file

@ -11,8 +11,8 @@ description: |
There are TWO Home Assistant deployments: ha-london (default) and ha-sofia.
Always use Home Assistant for smart home control.
author: Claude Code
version: 2.1.0
date: 2026-06-24
version: 2.0.0
date: 2026-02-07
---
# Home Assistant Control
@ -44,12 +44,6 @@ There are **two** Home Assistant instances:
- Environment variables for each instance:
- **ha-london**: `HOME_ASSISTANT_URL` and `HOME_ASSISTANT_TOKEN`
- **ha-sofia**: `HOME_ASSISTANT_SOFIA_URL` and `HOME_ASSISTANT_SOFIA_TOKEN`
- If those env vars aren't set (e.g. you're not in the infra repo / Claude venv), don't hand-roll a `kubectl | base64 | jq` token pipeline — use the global **`homelab` CLI** instead (on `$PATH` in any directory):
## homelab CLI (preferred — works from any directory)
- **Token**: `homelab ha token [--instance sofia|london]` resolves the long-lived API token live from the cluster. Use it directly in curl: `curl -H "Authorization: Bearer $(homelab ha token)" https://ha-sofia.viktorbarzin.me/api/states`. (The `home-assistant-sofia.py` script also auto-falls-back to this when its env var is unset.)
- **Host shell** (ha-sofia): `homelab ha ssh -- <cmd>` runs a command on the HA host with deterministic non-interactive ssh (no host-key prompt) — e.g. `homelab ha ssh -- "sudo docker ps"`, `homelab ha ssh -- "cat /config/configuration.yaml"`. Replaces bespoke `ssh -o StrictHostKeyChecking=no …` invocations.
- **Cluster metrics/logs** (not HA-specific): prefer `homelab metrics query "<promql>"` / `homelab logs query "<logql>"` over hand-rolled `curl …/api/v1/query`, and `homelab claim`/`release` over calling `scripts/presence` directly.
## API Control
@ -395,27 +389,14 @@ Advanced SSH, File Editor, Studio Code Server, InfluxDB, Mosquitto, Node-RED, Fr
## ha-london Knowledge Map
### Overview
- **HA Version**: 2026.5.2 on **Home Assistant OS** (HAOS — managed appliance, NOT a `docker run` container). Latest is 2026.6.4 (update available, deliberately not applied).
- **HA Version**: 2025.9.1 (Docker container on Raspberry Pi)
- **Location**: London, UK
- **Platform**: Raspberry Pi 4, HA OS
- **Access from the Sofia devvm**: london is **remote**`homelab ha ssh --instance london` generally WON'T connect (ADR-0012). Drive it via the API: `homelab ha token --instance london` + `https://ha-london.viktorbarzin.me/api/...`, and the WebSocket API `wss://ha-london.viktorbarzin.me/api/websocket` for dashboards / config-entries / HACS installs.
- **SSH (only from the London LAN)**: `ssh hassio@192.168.8.103` (requires `sudo` for file access)
- **Config path**: `/config/`
- **Platform**: Raspberry Pi 4, HA OS (not Docker standalone)
- **SSH**: `ssh hassio@192.168.8.103` (requires `sudo` for file access)
- **Config path**: `/config/` (requires `sudo` for file access)
- **3 tracked people**: Viktor Barzin, Anca Milea, Gheorghe Milea
- **Zone**: London (home)
### Dashboards (redesigned 2026-06-24)
**Glossary** (HA terms — keep distinct):
- **Dashboard** = a sidebar entry (Overview, Air Quality, Map). Sidebar *order* is a per-USER frontend preference, not in any dashboard config.
- **View** = a tab inside a dashboard. View order is global (stored in the dashboard config).
- **Card** = a widget inside a view.
- **Overview** (`lovelace`, the default): responsive **sections** views, styled with Mushroom + mini-graph-card.
- **Home** tab: *Who's home* · *Comfort & Air* (CO₂/temp/humidity/PM2.5/VOC chips + CO₂ and temp/humidity trend graphs + link to Air Quality) · *Cowboy* (battery/range/last-ride) · *Energy* (5 Kasa plugs + power trend) · *Quick actions* (Netflix/Stremio/Night).
- **More** tab: *Network* (GL-MT6000 router) · *System* (HA version/update, last backup, RPi power) · *Phones*.
- **Air Quality** (`air-quality`): deep-dive (views: Home, Detailed). (`detialed``detailed` path typo fixed 2026-06-24.)
- Built via the WS `lovelace/config/save` API (london is remote — no SSH path).
### Key Systems
#### 1. Smart Plugs (TP-Link Kasa) — Energy Monitoring
@ -437,15 +418,10 @@ Named plugs with power/energy tracking:
- PM1.0/2.5/4.0/10 particulate sensors
- VOC, NOx, ammonia, CO, ethanol, hydrogen, methane, NO2 gas sensors
#### 3. Cowboy E-Bike (`elsbrock/cowboy-ha`)
Bike named **"Classic Performance"** → entities are `sensor.classic_performance_*` (26 total). The old `sensor.bike_*` names are GONE (they were the dead `jdejaegh` integration).
- `sensor.classic_performance_remaining_battery`: Battery % (was `sensor.bike_state_of_charge`)
- `sensor.classic_performance_remaining_range`: Range km
- `sensor.classic_performance_mileage`: Total km (was `sensor.bike_total_distance`)
- `sensor.classic_performance_saved_co2`: Lifetime CO2 saved (was `sensor.bike_total_co2_saved`)
- Plus `_distance_today`, `_last_trip_*`, `_battery_health`, `device_tracker.classic_performance`, etc.
- **GOTCHA**: live battery/range/mileage read `unknown` while the bike is parked/asleep — Cowboy only reports live SoC when awake (ridden/charging); trip-history + `distance_today` stay live regardless.
- Auth: account **email+password** (no AWS Cognito — that was the dead `jdejaegh`/`cowboybike` lineage). Setup via UI config flow / REST `config_entries/flow`. Creds in Vaultwarden item **"cowboy bike"** (`homelab vault get "cowboy bike"`).
#### 3. Cowboy E-Bike
- `sensor.bike_state_of_charge`: Battery %
- `sensor.bike_total_distance`: Total km
- `sensor.bike_total_co2_saved`: CO2 saved (grams)
#### 4. Uptime Monitoring (UptimeRobot)
- `sensor.blog`: blog uptime
@ -464,17 +440,12 @@ Bike named **"Classic Performance"** → entities are `sensor.classic_performanc
- Scripts: `script.start_netflix`, `script.start_stremio`
- Scene: `scene.night` (turns off Livia + Michelle plugs)
### Custom Components (HACS integrations)
- **cowboy** (`elsbrock/cowboy-ha` v1.2.0): Cowboy e-bike — revived 2026-06-24. The old `jdejaegh/home-assistant-cowboy` repo is **dead (404)**; don't chase it.
- **hildebrandglow_dcc**: UK smart meter DCC energy — **DISABLED by user** (config entry `disabled_by: user`), not broken.
### HACS frontend cards (plugins)
- **Mushroom** (`piitaya/lovelace-mushroom`), **mini-graph-card** (`kalkih/mini-graph-card`), **plotly-graph-card** (`dbuezas/lovelace-plotly-graph-card`) — used by the redesigned Overview. Install over WS `hacs/repository/download`; resources auto-register in storage mode.
### Custom Components
- **cowboy**: Cowboy e-bike integration (HACS)
- **hildebrandglow_dcc**: UK smart meter DCC energy data (HACS)
### Integrations
ESPHome, TP-Link Kasa, Tapo, UptimeRobot, **Cowboy** (elsbrock), Oral-B BLE, Ookla Speedtest (exposes only an `update` entity, no live speed sensors), HACS, OpenRouter (free LLMs), Piper (TTS), Whisper (STT), Android TV/ADB.
- **Disabled by user (NOT broken)**: `met` + `metoffice` (weather — so `weather.*` entities are ABSENT), `roomba` (Rumi vacuum), `hildebrandglow_dcc` (energy).
- **Failing**: `tplink` **Tapo P100** projector plug — `setup_retry`, 403 KLAP handshake from 192.168.8.108 (plug off / firmware). Left as-is.
ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BLE, Ookla Speedtest, HACS, OpenRouter (multiple free LLMs), Piper (local TTS), Whisper (local STT), Android TV/ADB
### AI / Voice Assistants
- 5 free LLM conversation agents: Google Gemma 3 27B, Meta Llama 3.2 3B, Mistral Devstral 2, OpenAI GPT-OSS-20B, Z.AI GLM 4.5 Air
@ -489,8 +460,15 @@ ESPHome, TP-Link Kasa, Tapo, UptimeRobot, **Cowboy** (elsbrock), Oral-B BLE, Ook
- Anca arrival/departure notifications
- Night scene: turns off Livia + Michelle
### Platform (HAOS — ignore any legacy `docker run` snippet)
ha-london runs **Home Assistant OS** (managed appliance), NOT a hand-run Docker container. There is no `docker run homeassistant/home-assistant` to manage. Install HACS components over the WebSocket API (`hacs/repository/download` with the repo's HACS id), then restart via `POST /api/services/homeassistant/restart` — a HAOS restart drops automations for ~12 min and resets `sensor.uptime` (use that as the "back up" marker).
### Docker Setup
```bash
docker run -d --name homeassistant --privileged \
-e TZ=Europe/London \
-v /home/pi/docker/homeAssistant:/config \
-v /run/dbus:/run/dbus:ro \
--network=host --restart=unless-stopped \
homeassistant/home-assistant:2025.9
```
### SSH Access
```bash

View file

@ -1,203 +0,0 @@
export const meta = {
name: 'memory-overcommit-node-removal',
description: 'Read-only: assess PVE host + k8s memory overcommit, right-size deployment REQUESTS (scheduling) and LIMITS (OOM) separately from 30d usage, then test whether one worker node can be removed while preserving N-1 by BOTH a physical-usage and a scheduling-request model. Emits a gated plan.',
phases: [
{ title: 'Gather' },
{ title: 'Model' },
{ title: 'Verify' },
],
}
// ---------- confirmed read-only access paths ----------
const SSH = "ssh -o BatchMode=yes -o ConnectTimeout=8 root@192.168.1.127";
const PROM = "https://prometheus-query.viktorbarzin.lan/api/v1/query";
const G = (mib) => (mib == null ? "?" : (mib / 1024).toFixed(1) + "Gi");
// ---------- schema helpers ----------
const num = { type: "number" }, str = { type: "string" }, bool = { type: "boolean" };
const arr = (items) => ({ type: "array", items });
const obj = (props) => ({ type: "object", additionalProperties: false, required: Object.keys(props), properties: props });
const HOST = obj({
host_total_mib: num, host_used_mib: num, host_free_mib: num, host_available_mib: num,
swap_total_mib: num, swap_used_mib: num, ksm_saved_mib: num,
vms: arr(obj({ vmid: num, name: str, configured_mib: num, balloon_mib: num, rss_mib: num, is_k8s_node: bool })),
sum_vm_configured_mib: num, sum_vm_rss_mib: num, notes: str,
});
const K8S = obj({
nodes: arr(obj({
name: str, role: str, is_gpu: bool, is_control_plane: bool, gpu_tainted: bool, schedulable: bool,
capacity_mib: num, allocatable_mib: num, requests_mib: num, ds_requests_mib: num, limits_mib: num, usage_now_mib: num, peak_30d_mib: num, pod_count: num,
})),
cluster_allocatable_mib: num, cluster_requests_mib: num, cluster_usage_now_mib: num, cluster_peak_30d_mib: num, notes: str,
});
// NOTE the v2 split: requests are sized for SCHEDULING (cover normal load, can shrink below current),
// limits are sized for OOM SAFETY (cover peak). They are DIFFERENT knobs and must not be conflated.
const USAGE = obj({
totals: obj({
sum_current_requests_mib: num, sum_recommended_requests_mib: num, net_request_reclaim_mib: num,
reschedulable_request_recommended_mib: num, ds_request_recommended_per_node_mib: num, gpu_request_recommended_mib: num,
largest_single_request_mib: num, count_request_shrink: num, count_limit_raise_oom: num,
}),
request_shrinks: arr(obj({ namespace: str, name: str, kind: str, replicas: num, current_request_mib: num, p95_30d_mib: num, recommended_request_mib: num, delta_mib: num, rationale: str })),
limit_raises_oom: arr(obj({ namespace: str, name: str, container: str, current_limit_mib: num, peak_max_30d_mib: num, recommended_limit_mib: num, risk: str })),
spiky_periodic: arr(obj({ namespace: str, name: str, note: str })),
method_notes: str,
});
const TOPO = obj({
nodes: arr(obj({ name: str, sticky_pods: arr(str), local_pv_count: num, volumeattachments: num, cnpg_primary: bool, gpu_workloads: bool, evac_difficulty: str, evac_notes: str })),
spofs: arr(obj({ namespace: str, name: str, replicas: num, has_pdb: bool, issue: str })),
antiaffinity_risks: arr(str),
csi_pinning_note: str,
priority_classes_note: str,
notes: str,
});
const VERDICT = obj({ refuted: bool, confidence: str, reasoning: str, corrections: arr(str) });
// ---------- prompts ----------
const HOST_PROMPT = `Read-only PVE host memory audit. SSH (key-based): ${SSH} '<cmd>' (host 'pve', the Proxmox r730 at 192.168.1.127). Read-only ONLY; NEVER a state-changing qm/pvesh/ha-manager command.
- 'free -m' -> host_total/used/free/available_mib + swap_total/swap_used_mib.
- KSM: cat /sys/kernel/mm/ksm/pages_sharing ; ksm_saved_mib = pages_sharing*4096/1048576.
- 'qm list'; for each running VM 'qm config <vmid>' -> memory (configured_mib), balloon (balloon_mib; if balloon==memory or balloon==0 ballooning is effectively OFF -> host RSS pins near configured = the headroom RATCHET).
- Per-VM host RSS: read /var/run/qemu-server/<vmid>.pid then 'ps -o rss= -p <pid>' (KiB->MiB).
- is_k8s_node = VMs named k8s-*.
Return per-VM rows + sum_vm_configured_mib + sum_vm_rss_mib over ALL RUNNING VMs. notes: overcommit ratio, swap pressure, ballooning state.`;
const K8S_PROMPT = `Read-only Kubernetes node-capacity audit. kubectl read access confirmed. For every node (k8s-master + k8s-node1..6):
- capacity_mib & allocatable_mib from 'kubectl get node <n> -o json' (Ki->MiB).
- is_control_plane (node-role.kubernetes.io/control-plane), is_gpu (k8s-node1; nvidia.com/gpu in capacity), gpu_tainted (a NoSchedule taint general pods would NOT tolerate), schedulable.
- requests_mib, limits_mib, ds_requests_mib (DaemonSet-owned pods only), usage_now_mib, pod_count.
Prefer Prometheus (curl -sk -G '${PROM}' --data-urlencode 'query=<q>'):
sum by (node)(kube_pod_container_resource_requests{resource="memory"}) [these metrics HAVE a node label]
usage_now: cAdvisor container_memory_working_set_bytes has NO node label - join: sum by (node)(container_memory_working_set_bytes{container!="",container!="POD"} * on(namespace,pod) group_left(node) kube_pod_info)
- peak_30d_mib per node: max_over_time of that joined per-node sum over [30d:5m] (best effort; if the join is flaky leave 0 and rely on cluster figure).
ALSO return cluster-wide:
- cluster_allocatable_mib, cluster_requests_mib, cluster_usage_now_mib.
- cluster_peak_30d_mib = max_over_time(sum(container_memory_working_set_bytes{container!="",container!="POD"})[30d:5m]) /1024/1024 (this is the PHYSICAL reliability bedrock - the highest the whole cluster ever simultaneously used in 30d).
notes: host-vs-k8s overcommit contrast (requests vs allocatable vs actual usage).`;
const USAGE_PROMPT = `Read-only memory RIGHT-SIZING from 30-day usage. CRITICAL: requests and limits are DIFFERENT knobs - size them separately. Do NOT set requests to peak (that is what a flawed earlier run did; it manufactured a false capacity shortfall).
- REQUEST (scheduling reservation, drives bin-packing & node-removal feasibility): size to cover NORMAL operation = recommended_request_mib = ceil(max(p95_30d * 1.15, 64)). This SHRINKS the many over-provisioned requests toward real usage. requests should sit BELOW limits (Burstable). Be moderately conservative for stateful/db/critical infra (mysql, postgres/CNPG, redis, vault, prometheus, mailserver): use p99 instead of p95.
- LIMIT (OOM ceiling): recommended_limit_mib = ceil(peak_max_30d * 1.25). FLAG any container whose peak_max_30d >= 95% of current limit as an OOM risk (limit_raises_oom) - these are real reliability bugs to fix REGARDLESS of node removal.
Sources: kubectl (current requests/limits/replicas for Deployments/StatefulSets/DaemonSets, all namespaces); Prometheus (curl -sk -G '${PROM}' --data-urlencode 'query=<q>'):
p95: quantile_over_time(0.95, container_memory_working_set_bytes{container!="",container!="POD"}[30d])
p99: quantile_over_time(0.99, ...[30d])
peak: max_over_time(...[30d])
Aggregate by (namespace,pod,container), map pod->workload (strip hash suffixes), take MAX across a workload's pods as per-replica value.
Splits for the N-1 model (use the REQUEST recommendation; multiply per-replica by replicas):
- reschedulable_request_recommended_mib = SUM recommended_request of Deployment+StatefulSet pods that are NON-GPU and schedulable on general workers (everything that must reschedule if a worker is removed).
- ds_request_recommended_per_node_mib = SUM recommended_request of DaemonSet containers (one set per node).
- gpu_request_recommended_mib = SUM recommended_request of workloads pinned to GPU node k8s-node1 (REAL value; do not inflate).
- largest_single_request_mib = largest single recommended per-replica request among reschedulable.
Return totals (sum_current_requests_mib, sum_recommended_requests_mib, net_request_reclaim_mib = sum of POSITIVE request deltas i.e. shrinks, the splits, count_request_shrink, count_limit_raise_oom), request_shrinks (top ~30 by delta), limit_raises_oom (every OOM-tight container), spiky_periodic (mailserver/immich-ml/backups/dumps/postiz). NEVER mutate.`;
const TOPO_PROMPT = `Read-only reliability-topology audit: which worker is safest to remove? Candidates: k8s-node2..node6 (NOT master, NOT GPU node1). For each worker (k8s-node1..6): sticky_pods (StatefulSet members; pods with local/hostPath PVCs; single-replica critical), local_pv_count, volumeattachments, cnpg_primary (CNPG 'pg-cluster' PRIMARY here? check pod role labels), gpu_workloads, evac_difficulty (easy|medium|hard)+evac_notes.
Cluster-wide: spofs (1 replica AND no PDB); antiaffinity_risks (hard podAntiAffinity / topologySpread DoNotSchedule that becomes UNSATISFIABLE at one fewer worker - check replica counts vs surviving distinct hosts); csi_pinning_note (do Proxmox-CSI PVs pin to a node, or share one host-level topology so they reattach anywhere? check volumeHandle / topology zone/region on the PVs - this decides whether removal STRANDS data); priority_classes_note. NEVER mutate.`;
// ============================================================
phase('Gather');
log('Gather (read-only): PVE host memory, k8s capacity + cluster 30d peak, request/limit right-sizing, reliability topology');
const [host, k8s, usage, topo] = await parallel([
() => agent(HOST_PROMPT, { label: 'gather:pve-host', phase: 'Gather', schema: HOST }),
() => agent(K8S_PROMPT, { label: 'gather:k8s-capacity', phase: 'Gather', schema: K8S }),
() => agent(USAGE_PROMPT, { label: 'gather:rightsize', phase: 'Gather', schema: USAGE }),
() => agent(TOPO_PROMPT, { label: 'gather:reliability', phase: 'Gather', schema: TOPO }),
]);
if (!k8s || !usage) return { error: 'Critical gather agent failed (k8s/usage).', host, k8s, usage, topo };
// ============================================================
phase('Model');
const T = usage.totals;
const workers = k8s.nodes.filter((n) => !n.is_control_plane);
const generalPool = workers.filter((n) => !n.gpu_tainted); // general pods can land here (incl. GPU node if not tainted)
const candidates = workers.filter((n) => !n.is_gpu && !n.is_control_plane); // node2..node6
const clusterPeak = k8s.cluster_peak_30d_mib || 0;
const freeGeneral = (n) => n.allocatable_mib - (T.ds_request_recommended_per_node_mib || 0) - (n.is_gpu ? (T.gpu_request_recommended_mib || 0) : 0);
function evalRemove(removeName) {
const pool = generalPool.filter((n) => n.name !== removeName);
// --- scheduling N-1 (realistic requests): fit reschedulable load even if the largest survivor then fails ---
const frees = pool.map(freeGeneral);
const schedCap = frees.reduce((a, b) => a + b, 0) - (frees.length ? Math.max(...frees) : 0);
const schedNeed = T.reschedulable_request_recommended_mib;
const schedMargin = schedCap - schedNeed;
// --- physical N-1 (actual peak usage): cluster 30d peak must fit on survivors after losing the largest too ---
const survAlloc = pool.map((n) => n.allocatable_mib);
const physCap = survAlloc.reduce((a, b) => a + b, 0) - (survAlloc.length ? Math.max(...survAlloc) : 0);
const physMargin = physCap - clusterPeak;
const t = topo && topo.nodes ? topo.nodes.find((n) => n.name === removeName) : null;
return {
removeName, pool: pool.map((n) => n.name),
sched_capacityN1_mib: Math.round(schedCap), sched_need_mib: Math.round(schedNeed), sched_margin_mib: Math.round(schedMargin), sched_pass: schedMargin >= 0,
phys_capacityN1_mib: Math.round(physCap), cluster_peak_mib: Math.round(clusterPeak), phys_margin_mib: Math.round(physMargin), phys_pass: physMargin >= 0,
pass: schedMargin >= 0 && physMargin >= 0,
host_freed_mib: hostFreedFor(removeName),
evac_difficulty: t ? t.evac_difficulty : 'unknown', cnpg_primary: t ? t.cnpg_primary : false, sticky_pods: t ? t.sticky_pods : [],
};
}
function hostFreedFor(nodeName) {
if (host && host.vms) {
const s = nodeName.replace('k8s-', '');
const vm = host.vms.find((v) => v.name === nodeName || (v.name && v.name.includes(s)));
if (vm) return vm.configured_mib;
}
const n = k8s.nodes.find((x) => x.name === nodeName);
return n ? n.capacity_mib : 0;
}
const evalCandidates = candidates.map((c) => evalRemove(c.name));
const diffRank = { easy: 0, medium: 1, hard: 2, unknown: 3 };
const passing = evalCandidates.filter((c) => c.pass && !c.cnpg_primary)
.sort((a, b) => (diffRank[a.evac_difficulty] - diffRank[b.evac_difficulty]) || (b.phys_margin_mib - a.phys_margin_mib));
const best = passing[0] || null;
const hostOvercommit = host ? { sum_vm_configured_mib: host.sum_vm_configured_mib, host_total_mib: host.host_total_mib, ratio: +(host.sum_vm_configured_mib / host.host_total_mib).toFixed(3), free_mib: host.host_free_mib, available_mib: host.host_available_mib, swap_used_mib: host.swap_used_mib, swap_total_mib: host.swap_total_mib, ksm_saved_mib: host.ksm_saved_mib } : null;
const k8sOvercommit = { cluster_requests_mib: k8s.cluster_requests_mib, cluster_allocatable_mib: k8s.cluster_allocatable_mib, cluster_usage_now_mib: k8s.cluster_usage_now_mib, cluster_peak_30d_mib: clusterPeak, request_ratio: +(k8s.cluster_requests_mib / k8s.cluster_allocatable_mib).toFixed(3), usage_ratio: +(clusterPeak / k8s.cluster_allocatable_mib).toFixed(3) };
log(`Host overcommit ${hostOvercommit ? hostOvercommit.ratio : '?'}x (${G(hostOvercommit && hostOvercommit.free_mib)} free, swap ${G(hostOvercommit && hostOvercommit.swap_used_mib)}/${G(hostOvercommit && hostOvercommit.swap_total_mib)})`);
log(`K8s: requests ${G(k8s.cluster_requests_mib)} / 30d-peak-usage ${G(clusterPeak)} / allocatable ${G(k8s.cluster_allocatable_mib)} -> requests are ${(k8s.cluster_requests_mib / clusterPeak).toFixed(2)}x real peak`);
log(`Request right-sizing: ${G(T.net_request_reclaim_mib)} of over-provisioned requests can be trimmed (${T.count_request_shrink} workloads); ${T.count_limit_raise_oom} workloads are OOM-tight on LIMITS (raise regardless).`);
for (const c of evalCandidates) log(` remove ${c.removeName}: phys-N1 ${c.phys_pass ? 'PASS' : 'FAIL'} (${G(c.phys_margin_mib)}) | sched-N1 ${c.sched_pass ? 'PASS' : 'FAIL'} (${G(c.sched_margin_mib)}) | frees ~${G(c.host_freed_mib)} host | evac ${c.evac_difficulty}${c.cnpg_primary ? ' CNPG-PRIMARY' : ''}`);
log(best ? `Best candidate: ${best.removeName} (phys margin ${G(best.phys_margin_mib)}, frees ~${G(best.host_freed_mib)})` : 'No candidate passes both N-1 tests.');
// ============================================================
phase('Verify');
const headline = best
? `${best.removeName} can be removed while preserving N-1: cluster 30d peak usage ${G(clusterPeak)} fits on survivors-minus-one (${G(best.phys_capacityN1_mib)}); after trimming over-provisioned requests, scheduling also fits (${G(best.sched_margin_mib)} margin). Frees ~${G(best.host_freed_mib)} to the PVE host.`
: `No worker can be removed while preserving N-1 by BOTH physical-usage and scheduling-request models.`;
const verifyData = JSON.stringify({ hostOvercommit, k8sOvercommit, k8s_nodes: k8s.nodes, usage_totals: T, evalCandidates, best, csi_pinning_note: topo ? topo.csi_pinning_note : null, generalPool: generalPool.map((n) => n.name) }, null, 2);
const lenses = [
{ key: 'math', ask: 'Recompute BOTH N-1 models independently. Physical: cluster 30d peak vs (sum survivor allocatable - largest survivor). Scheduling: reschedulable recommended REQUESTS (not limits, not peak) vs (sum survivor freeGeneral - largest). Verify GPU node reserve uses REAL gpu requests, allocatable not capacity, DaemonSets are per-node fixed load. Are pool selection and numbers right?' },
{ key: 'temporal', ask: 'Challenge the 30-DAY peak window and the request shrinks. Could a monthly/quarterly peak exceed cluster_peak_30d (compare a 90d peak)? Are the shrunk REQUESTS safe given each workload keeps a limit above its peak (Burstable)? Name any shrink or any still-tight limit that is reckless.' },
{ key: 'stateful', ask: 'Check the chosen candidate for STRANDED state and drain blockers: CSI PV pinning (do volumes reattach anywhere?), CNPG primary, VolumeAttachment caps, anti-affinity/topologySpread unsatisfiable at one fewer worker, PDBs that block drain (disruptionsAllowed=0). Is removal actually safe, and what drain ORDERING is required?' },
];
const verdicts = (await parallel(lenses.map((l) => () =>
agent(`Adversarial reviewer. Try to REFUTE:\n"${headline}"\n\nLens: ${l.ask}\n\nData (read-only). Verify LIVE: kubectl, Prometheus (curl -sk -G '${PROM}' --data-urlencode 'query=...'), ${SSH} '<cmd>'.\n\n${verifyData}\n\nDefault refuted=true if evidence does not clearly hold. Give concrete corrections.`,
{ label: `verify:${l.key}`, phase: 'Verify', schema: VERDICT }))
)).filter(Boolean);
return {
headline,
hostOvercommit, k8sOvercommit,
rightsizing: T,
request_shrinks: usage.request_shrinks,
limit_raises_oom: usage.limit_raises_oom,
spiky_periodic: usage.spiky_periodic,
candidates: evalCandidates,
recommendation: best,
k8s_nodes: k8s.nodes,
host_vms: host ? host.vms : null,
topo_spofs: topo ? topo.spofs : [],
topo_nodes: topo ? topo.nodes : [],
csi_pinning_note: topo ? topo.csi_pinning_note : null,
antiaffinity_risks: topo ? topo.antiaffinity_risks : [],
verdicts,
verdict_summary: `${verdicts.filter((v) => v.refuted).length}/${verdicts.length} reviewers refuted the headline`,
};

9
.gitattributes vendored
View file

@ -4,12 +4,3 @@
*.tfvars filter=git-crypt diff=git-crypt
secrets/** filter=git-crypt diff=git-crypt
stacks/**/secrets/** filter=git-crypt diff=git-crypt
# Kubeconfigs / cluster credentials — encrypt at rest so a force-added or renamed
# commit can't push plaintext to the public GitHub mirror. Belt-and-suspenders to
# the .gitignore rules above; `.config` is explicit because that is exactly the
# name an admin kubeconfig once leaked under (GitGuardian, 2026-07-02).
.config filter=git-crypt diff=git-crypt
kubeconfig filter=git-crypt diff=git-crypt
*.kubeconfig filter=git-crypt diff=git-crypt
admin.conf filter=git-crypt diff=git-crypt

View file

@ -1,39 +0,0 @@
name: Build Custom Authentik Image
# ADR-0002: infra-owned image built off-infra on GHA → ghcr.
# Thin SLOW-1a overlay over the official authentik server (narrows the login
# identification stage's select_subclasses() to the login-capable source subtypes;
# see stacks/authentik/Dockerfile). Rebuild only when the Dockerfile changes — on
# every authentik bump, edit the FROM tag + the patchN suffix here + the image tag
# in modules/authentik/values.yaml together.
on:
push:
branches: [master]
paths:
- 'stacks/authentik/Dockerfile'
workflow_dispatch: {}
permissions:
contents: read
packages: write
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- uses: docker/build-push-action@v6
with:
context: stacks/authentik
platforms: linux/amd64
provenance: false
push: true
tags: |
ghcr.io/viktorbarzin/authentik-server:2026.2.4-patch3
ghcr.io/viktorbarzin/authentik-server:latest

View file

@ -1,39 +0,0 @@
name: Build chrome-service-browser
# ADR-0002: infra-owned image built off-infra on GHA → ghcr. Playwright base +
# real Google Chrome (proprietary H.264/AAC codecs) for the chrome-service
# browser container, so the noVNC view can play H.264 video (Reels). Rebuilds
# are rare → dispatch + path trigger. NOTE: after the first push, set the ghcr
# package `chrome-service-browser` to PUBLIC (same as chrome-service-novnc) so
# the pod pulls it without credentials.
on:
push:
branches: [master]
paths:
- 'stacks/chrome-service/files/chrome/**'
workflow_dispatch: {}
permissions:
contents: read
packages: write
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- uses: docker/build-push-action@v6
with:
context: stacks/chrome-service/files/chrome
platforms: linux/amd64
provenance: false
push: true
tags: |
ghcr.io/viktorbarzin/chrome-service-browser:latest
ghcr.io/viktorbarzin/chrome-service-browser:${{ github.sha }}

View file

@ -1,42 +0,0 @@
name: Build excalidraw-library
# ADR-0002 / no-local-builds: excalidraw-library (infra-owned Go app behind
# draw.viktorbarzin.me) builds off-infra on GHA → private ghcr; Keel polls
# ghcr:latest and rolls the deployment. Replaces the manual DockerHub pushes
# (viktorbarzin/excalidraw-library:v4 stays frozen as the rollback image).
on:
push:
branches: [master]
paths:
- 'stacks/excalidraw/project/**'
workflow_dispatch: {}
permissions:
contents: read
packages: write
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-go@v5
with:
go-version: '1.21'
- run: go test ./...
working-directory: stacks/excalidraw/project
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- uses: docker/build-push-action@v6
with:
context: stacks/excalidraw/project
platforms: linux/amd64
provenance: false
push: true
tags: |
ghcr.io/viktorbarzin/excalidraw-library:latest
ghcr.io/viktorbarzin/excalidraw-library:${{ github.sha }}

View file

@ -1,39 +0,0 @@
name: Build valia-sites-sync
# ADR-0002 + ADR-0018: infra-owned image built off-infra on GHA → ghcr (public).
# Rclone + wrangler runner for the Valia-sites Content-folder mirror CronJob.
# Rebuilds are rare (tool pins only change deliberately) → dispatch + path.
# Security note: no untrusted event inputs are interpolated anywhere (only
# github.actor / github.sha / GITHUB_TOKEN — same shape as the other
# build-*.yml workflows in this repo).
on:
push:
branches: [master]
paths:
- 'stacks/valia-sites/sync-image/**'
workflow_dispatch: {}
permissions:
contents: read
packages: write
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- uses: docker/build-push-action@v6
with:
context: stacks/valia-sites/sync-image
platforms: linux/amd64
provenance: false
push: true
tags: |
ghcr.io/viktorbarzin/valia-sites-sync:latest
ghcr.io/viktorbarzin/valia-sites-sync:${{ github.sha }}

15
.gitignore vendored
View file

@ -71,15 +71,8 @@ stacks/*/cloudflare_provider.tf
stacks/*/tiers.tf
stacks/*/terragrunt_rendered.json
# Kubernetes config / cluster credentials (sensitive) — never commit in plaintext.
# `config` alone missed the dotfile form: an admin kubeconfig once leaked to the
# public mirror as `.config` (GitGuardian, 2026-07-02). Cover the common names.
# Kubernetes config (sensitive)
config
.config
kubeconfig
*.kubeconfig
admin.conf
.kube/
# Node.js (not part of infra)
node_modules/
@ -117,9 +110,3 @@ terraform.tfstate.backup
# Timestamped terraform state backups (terraform.tfstate.<ts>.backup) — plaintext Tier-0
# secrets; created by terraform state ops. The patterns above miss the timestamped form.
terraform.tfstate.*.backup
# Python test artifacts (pytest bytecode cache) — e.g. from
# stacks/k8s-version-upgrade/scripts/test_compat_gate.py
__pycache__/
*.pyc
.pytest_cache/

View file

@ -19,7 +19,6 @@ clone:
git:
image: woodpeckerci/plugin-git
settings:
partial: false
depth: 2
attempts: 5
backoff: 10s
@ -65,21 +64,6 @@ steps:
# don't need explicit token propagation.
VAULT_ADDR: http://vault-active.vault.svc.cluster.local:8200
commands:
# ── Forge guard: apply ONLY on the canonical Forgejo forge ──
# infra is registered in Woodpecker on BOTH the Forgejo canonical repo and
# the legacy GitHub mirror, and BOTH fire this push pipeline. Without this
# guard both run `terragrunt apply` on every push and race each other for
# the per-stack PG state lock — the dominant cause of the "Error acquiring
# the state lock" failures + push-supersede "killed" runs. The GitHub-mirror
# registration keeps running the CRONS (drift-detection, renew-tls, …) — only
# its duplicate push-apply no-ops here. Fail-open: an unknown forge (neither
# env var set) still applies, preserving prior behaviour.
- |
if echo "${CI_REPO_URL:-}${CI_FORGE_URL:-}" | grep -qi 'github\.com'; then
echo "[forge-guard] GitHub-mirror push — apply runs only on the Forgejo canonical repo (avoids double-apply + state-lock races). Skipping."
exit 0
fi
# ── Skip CI commits ──
- |
if echo "$CI_COMMIT_MESSAGE" | grep -q '\[CI SKIP\]\|\[ci skip\]'; then
@ -228,40 +212,23 @@ steps:
if [ -s .platform_apply ]; then
echo "=== Applying platform stacks (serial, locked) ==="
while read -r stack; do
# Tier-0 `vault` is human-applied via OIDC; the CI `ci` Vault role
# lacks Vault-admin perms (sys/mounts + sys/policies/acl), so a CI
# apply always 403s and fails the pipeline. Kept in PLATFORM_STACKS
# (so the app-stack detector still excludes it) but skipped here.
# (2026-06-27 — see docs/architecture/ci-cd.md)
if [ "$stack" = "vault" ]; then echo "[vault] SKIPPED (Tier-0, human-applied via OIDC)"; continue; fi
echo "[$stack] Starting apply..."
ATTEMPT=0
while :; do
ATTEMPT=$((ATTEMPT + 1))
set +e
OUTPUT=$(cd "stacks/$stack" && ../../scripts/tg apply --non-interactive 2>&1)
EXIT=$?
set -e
if [ $EXIT -eq 0 ]; then
echo "$OUTPUT" | tail -3; echo "[$stack] OK"; break
if [ $EXIT -ne 0 ]; then
if echo "$OUTPUT" | grep -q "is locked by"; then
echo "[$stack] SKIPPED (locked by another session)"
else
echo "$OUTPUT" | tail -50
echo "[$stack] FAILED (exit $EXIT)"
FAILED_PLATFORM_STACKS="$FAILED_PLATFORM_STACKS $stack"
fi
# Lock contention → SKIP, not fail. Match BOTH the Tier-0 Vault lock
# ("is locked by", from scripts/tg) AND the Tier-1 PG-backend lock
# ("Error acquiring the state lock" / "already locked"). The PG case
# was previously counted as a failure — the #1 source of false reds.
if echo "$OUTPUT" | grep -qE 'is locked by|Error acquiring the state lock|already locked'; then
echo "[$stack] SKIPPED (locked by another session/run)"; break
else
echo "$OUTPUT" | tail -3
echo "[$stack] OK"
fi
# Transient: provider-registry download timeout / Vault 5xx → bounded
# retry. Deliberately NOT helm atomic-timeouts or config errors
# (missing arg, invalid index) — those must fail fast, retry can't fix
# them and can worsen a stuck helm release.
if [ $ATTEMPT -lt 3 ] && echo "$OUTPUT" | grep -qE 'Failed to install provider|Client\.Timeout exceeded while awaiting headers|error reading from Vault.*Code: 5[0-9][0-9]'; then
echo "[$stack] transient error (attempt $ATTEMPT/3) — retrying in 15s..."; sleep 15; continue
fi
echo "$OUTPUT" | tail -50; echo "[$stack] FAILED (exit $EXIT)"
FAILED_PLATFORM_STACKS="$FAILED_PLATFORM_STACKS $stack"; break
done
done < .platform_apply
fi
# Deferred until after app stacks so both lists get a chance to run.
@ -274,27 +241,22 @@ steps:
echo "=== Applying app stacks (serial, locked) ==="
while read -r stack; do
echo "[$stack] Starting apply..."
ATTEMPT=0
while :; do
ATTEMPT=$((ATTEMPT + 1))
set +e
OUTPUT=$(cd "stacks/$stack" && ../../scripts/tg apply --non-interactive 2>&1)
EXIT=$?
set -e
if [ $EXIT -eq 0 ]; then
echo "$OUTPUT" | tail -3; echo "[$stack] OK"; break
if [ $EXIT -ne 0 ]; then
if echo "$OUTPUT" | grep -q "is locked by"; then
echo "[$stack] SKIPPED (locked by another session)"
else
echo "$OUTPUT" | tail -50
echo "[$stack] FAILED (exit $EXIT)"
FAILED_APP_STACKS="$FAILED_APP_STACKS $stack"
fi
# Lock contention → SKIP, not fail (Tier-0 Vault + Tier-1 PG; see platform loop).
if echo "$OUTPUT" | grep -qE 'is locked by|Error acquiring the state lock|already locked'; then
echo "[$stack] SKIPPED (locked by another session/run)"; break
else
echo "$OUTPUT" | tail -3
echo "[$stack] OK"
fi
# Transient provider-download / Vault 5xx → bounded retry (see platform loop).
if [ $ATTEMPT -lt 3 ] && echo "$OUTPUT" | grep -qE 'Failed to install provider|Client\.Timeout exceeded while awaiting headers|error reading from Vault.*Code: 5[0-9][0-9]'; then
echo "[$stack] transient error (attempt $ATTEMPT/3) — retrying in 15s..."; sleep 15; continue
fi
echo "$OUTPUT" | tail -50; echo "[$stack] FAILED (exit $EXIT)"
FAILED_APP_STACKS="$FAILED_APP_STACKS $stack"; break
done
done < .app_apply
fi
# Fail the step loudly so the pipeline `default` workflow state
@ -324,8 +286,13 @@ steps:
fi
GIT_SSH_COMMAND='ssh -i ./secrets/deploy_key -o IdentitiesOnly=yes' git push origin master
# (No Slack post on success — Viktor 2026-07-02: CI notifies on FAILED
# runs only; the notify-failure step below covers those.)
# ── Slack notification ──
- |
PLATFORM_COUNT=$(wc -l < .platform_apply 2>/dev/null | tr -d ' ')
APP_COUNT=$(wc -l < .app_apply 2>/dev/null | tr -d ' ')
curl -s -X POST -H 'Content-type: application/json' \
--data "{\"channel\":\"general\",\"text\":\"Woodpecker CI: infra pipeline ${CI_PIPELINE_STATUS} (platform:${PLATFORM_COUNT}, apps:${APP_COUNT})\"}" \
"$SLACK_WEBHOOK" || true
# Slack on failure (runs even if apply step fails)
- name: notify-failure

View file

@ -9,7 +9,6 @@ clone:
git:
image: woodpeckerci/plugin-git
settings:
partial: false
depth: 1
attempts: 3
@ -85,13 +84,6 @@ steps:
stack=$(basename "$stack_dir")
[ -f "$stack_dir/terragrunt.hcl" ] || continue
# Tier-0 `vault` is human-applied via OIDC; the CI `ci` Vault role lacks
# Vault-admin perms (sys/mounts + sys/policies/acl), so `terragrunt plan`
# on it ERRORs (detailed-exitcode 1) and fails the whole nightly drift
# run. Skip it — drift on Tier-0 vault is caught at human apply time.
# (2026-06-27)
[ "$stack" = "vault" ] && continue
echo -n "[$stack] planning... "
OUTPUT=$(cd "$stack_dir" && terragrunt plan -detailed-exitcode -input=false 2>&1)
EXIT=$?
@ -147,30 +139,13 @@ steps:
echo "Drift: ${DRIFTED:-none}"
echo "Errors: ${ERRORS:-none}"
# ── Slack only when something is WRONG (drift or errors) ──
# All-clean runs are silent (Viktor 2026-07-02: CI notifies on
# failed/actionable runs only; clean is the daily normal).
# ── Slack alert if drift found ──
if [ -n "$DRIFTED" ]; then
curl -s -X POST -H 'Content-type: application/json' \
--data "{\"channel\":\"general\",\"text\":\":warning: Drift detected in:${DRIFTED}\nClean: ${CLEAN} stacks. Errors:${ERRORS:-none}\"}" \
"$SLACK_WEBHOOK" || true
elif [ -n "$ERRORS" ]; then
else
curl -s -X POST -H 'Content-type: application/json' \
--data "{\"channel\":\"general\",\"text\":\":red_circle: Drift detection had errors: ${ERRORS} (clean: ${CLEAN})\"}" \
--data "{\"channel\":\"general\",\"text\":\":white_check_mark: Drift detection: all ${CLEAN} stacks clean${ERRORS:+. Errors: $ERRORS}\"}" \
"$SLACK_WEBHOOK" || true
fi
# Hard-failure catch: the in-script posts above never run if the step
# itself crashes early — this step is the only signal for that case.
- name: notify-failure
image: curlimages/curl
commands:
- |
curl -s -X POST -H 'Content-type: application/json' \
--data "{\"channel\":\"general\",\"text\":\":red_circle: Drift-detection pipeline FAILED (crashed before reporting)\"}" \
"$SLACK_WEBHOOK" || true
environment:
SLACK_WEBHOOK:
from_secret: slack_webhook
when:
status: [failure]

View file

@ -5,7 +5,6 @@ clone:
git:
image: woodpeckerci/plugin-git
settings:
partial: false
depth: 2
steps:

View file

@ -11,7 +11,6 @@ clone:
git:
image: woodpeckerci/plugin-git
settings:
partial: false
depth: 5
steps:
@ -28,7 +27,6 @@ steps:
from_secret: slack_webhook
commands:
- apk add --no-cache curl
- "curl -sf -X POST https://hooks.slack.com/services/$SLACK_WEBHOOK -H 'Content-Type: application/json' -d '{\"text\": \":red_circle: Post-mortem TODO pipeline FAILED\"}' || true"
- "curl -sf -X POST https://hooks.slack.com/services/$SLACK_WEBHOOK -H 'Content-Type: application/json' -d '{\"text\": \"Post-mortem TODO pipeline completed\"}' || true"
when:
# Failure-only (Viktor 2026-07-02): CI notifies on failed runs only.
- status: [failure]
- status: [success, failure]

View file

@ -5,7 +5,6 @@ clone:
git:
image: woodpeckerci/plugin-git
settings:
partial: false
attempts: 5
backoff: 10s

View file

@ -23,7 +23,6 @@ clone:
git:
image: woodpeckerci/plugin-git
settings:
partial: false
depth: 1
attempts: 3
@ -58,8 +57,7 @@ steps:
commands:
- |
curl -s -X POST -H 'Content-type: application/json' \
--data "{\"channel\":\"general\",\"text\":\":red_circle: PVE /etc/exports sync FAILED\"}" \
--data "{\"channel\":\"general\",\"text\":\"PVE /etc/exports sync: ${CI_PIPELINE_STATUS}\"}" \
"$SLACK_WEBHOOK" || true
when:
# Failure-only (Viktor 2026-07-02): CI notifies on failed runs only.
status: [failure]
status: [success, failure]

View file

@ -38,7 +38,6 @@ clone:
git:
image: woodpeckerci/plugin-git
settings:
partial: false
depth: 1
attempts: 3
@ -151,8 +150,7 @@ steps:
commands:
- |
curl -s -X POST -H 'Content-type: application/json' \
--data "{\"channel\":\"general\",\"text\":\":red_circle: Registry config sync on 10.0.20.10 FAILED\"}" \
--data "{\"channel\":\"general\",\"text\":\"Registry config sync on 10.0.20.10: ${CI_PIPELINE_STATUS}\"}" \
"$SLACK_WEBHOOK" || true
when:
# Failure-only (Viktor 2026-07-02): CI notifies on failed runs only.
status: [failure]
status: [success, failure]

View file

@ -6,7 +6,6 @@ clone:
git:
image: woodpeckerci/plugin-git
settings:
partial: false
attempts: 5
backoff: 10s
@ -71,11 +70,10 @@ steps:
commands:
- |
curl -s -X POST -H 'Content-type: application/json' \
--data "{\"channel\":\"general\",\"text\":\":red_circle: Woodpecker CI: TLS certificate renewal FAILED\"}" \
--data "{\"channel\":\"general\",\"text\":\"Woodpecker CI: TLS certificate renewal ${CI_PIPELINE_STATUS}\"}" \
"$SLACK_WEBHOOK" || true
environment:
SLACK_WEBHOOK:
from_secret: slack_webhook
when:
# Failure-only (Viktor 2026-07-02): successful renewals are routine.
status: [failure]
status: [success, failure]

View file

@ -9,7 +9,7 @@
- **Ask before `git push`** — always confirm with the user first
## Execution
- **Apply a service**: `scripts/tg apply --non-interactive` (auto-decrypts SOPS secrets; passes `-lock-timeout`, default `5m` / `TG_LOCK_TIMEOUT`, so a contended state lock waits instead of failing with `Error acquiring the state lock`)
- **Apply a service**: `scripts/tg apply --non-interactive` (auto-decrypts SOPS secrets)
- **Legacy apply**: `cd stacks/<service> && terragrunt apply --non-interactive` (uses terraform.tfvars)
- **kubectl**: `kubectl --kubeconfig $(pwd)/config`
- **Health check**: `bash scripts/cluster_healthcheck.sh --quiet`
@ -95,7 +95,7 @@ Terragrunt-based homelab managing a Kubernetes cluster (5 nodes, v1.34.2) on Pro
## Key Paths
- `stacks/<service>/main.tf` — service definition
- `stacks/platform/modules/<service>/` — core infra modules
- `modules/kubernetes/ingress_factory/` — standardized ingress with auth, rate limiting, anti-AI, and auto Cloudflare DNS (`dns_type = "proxied"`, `"non-proxied"`, or `"internal"` — a public A record carrying the internal Traefik LB IP for household-only services; pair with the `home-lans-only` ipAllowList middleware, never with `"proxied"`)
- `modules/kubernetes/ingress_factory/` — standardized ingress with auth, rate limiting, anti-AI, and auto Cloudflare DNS (`dns_type = "proxied"` or `"non-proxied"`)
- `modules/kubernetes/nfs_volume/` — NFS volume module (CSI-backed, soft mount)
- `config.tfvars` — non-secret configuration (plaintext)
- `secrets.sops.json` — all secrets (SOPS-encrypted JSON)
@ -273,11 +273,8 @@ To land a finished change from such a clone:
Slack audit feed; a no-op CI apply on a docs-only commit is harmless.
4. Leave the clone on clean `master` so auto-refresh keeps working.
5. Tell the user in plain language what happened. Stack changes are
auto-applied by CI on push — or, with apply access, applied locally yourself
(`scripts/tg apply`, from the main checkout, not a worktree); either path is
fine, but the change must always be committed here, never applied
uncommitted. Verify the live result with the user's read-only kubectl before
saying "it's live".
auto-applied by CI — verify the live result with the user's read-only
kubectl before saying "it's live".
If a push to `master` is rejected by branch protection (user not on the
whitelist — e.g. new users before Viktor grants it), fall back to a
@ -292,7 +289,6 @@ curl -X POST -H "Authorization: token $TOK" -H 'Content-Type: application/json'
```
## Common Operations
- **`homelab` CLI** (`/usr/local/bin/homelab`, source `cli/`): unified infra-ops verbs — run `homelab manifest` to discover the surface (each verb tagged read/write). Infra loop: `homelab tf plan|fmt|apply <stack>` (wraps `scripts/tg`; `apply` auto-claims presence + releases on exit, warns out-of-band), `homelab claim|release <kind>:<name>`, `homelab work start|land|clean <topic>` (worktree lifecycle; `land` gates on verification, `--verify-cmd`/`--no-verify`). Kubernetes (v0.2): `homelab k8s status|get|logs|describe|debug|pf|rollout-status <app>` (read; `<app>` defaults to the namespace, target to `deploy/<app>`), `homelab k8s db <app> [--mysql] -- "<SQL>"`, `k8s exec`, `k8s restart`, `k8s rm-pod` (pods/jobs only) — config-mutation kubectl verbs are intentionally absent (Terraform-only). Memory (v0.3): `homelab memory recall "<context>"` (semantic search), `memory list|categories|tags|stats|secret`, `memory store|update|delete` — a direct HTTP client to claude-memory that works even when the memory MCP is down. CI/deploy (v0.4): `homelab ci status|watch [commit]` (Woodpecker, repo resolved from cwd), `homelab deploy wait <ns>/<deploy> [--sha]` (image-sha + rollout) — `work land` now auto-watches CI to green. Net/obs (v0.5): `homelab net check <host> [path]` (external-CF vs internal-LB reachability), `dns lookup <name>` (Technitium vs public diff), `metrics query "<promql>"` / `metrics alerts` (Prometheus via LB), `logs query "<logql>" [--since]` (Loki via LB) — endpoint resolution baked in, no port-forward. Usage telemetry (v0.6): every dispatched verb fire-and-forgets a Loki line (`{user,verb}` + exit only, NO args/secrets; opt-out `HOMELAB_TELEMETRY=0`); `homelab usage top [--since][--user]` ranks verb usage across all users — evidence for what to build next, queryable without reading anyone's home. Home Assistant (v0.7): `homelab ha token [--instance sofia|london]` (prints the long-lived API token, resolved live from k8s Secret `openclaw/openclaw-secrets` — use as `curl -H "Authorization: Bearer $(homelab ha token)"`), `homelab ha ssh [--instance sofia|london] -- <cmd>` (run a command on the HA host; deterministic non-interactive ssh, the invoking user's `~/.ssh/id_ed25519`, sofia=`vbarzin@192.168.1.8` default) — entity state/control stays with the `ha` MCP, these cover only what an API-only MCP can't (token + host shell). Full docs: `cli/README.md`.
- **Deploy new service**: Use `stacks/<existing-service>/` as template. Create stack, add DNS in tfvars, apply platform then service.
- **Fix crashed pods**: Run healthcheck first. Safe to delete evicted/failed pods and CrashLoopBackOff pods with >10 restarts.
- **OOMKilled**: Check `kubectl describe limitrange tier-defaults -n <ns>`. Increase `resources.limits.memory` in the stack's main.tf.

View file

@ -56,28 +56,6 @@ _Avoid_: "core service" (collides with the `0-core-*` Namespace tier name).
A non-admin identity declared in `secret/platform → k8s_users` (JSON map). Owns one or more namespaces and one or more public subdomains. Also drives a **Workstation profile** (an identity has both a cluster facet and a workstation facet).
_Avoid_: bare "user", "tenant".
### GPU sharing
**GPU slice**:
One unit of `nvidia.com/gpu` on the time-sliced Tesla T4 — a **scheduling turn, NOT a memory allocation**. The device plugin advertises the card ×100; a pod requesting `nvidia.com/gpu: 1` gets GPU *access*, with zero guarantee about how much of the 16 GB VRAM it may use. "Overallocate GPU memory" is a real failure precisely because a slice carries no memory accounting.
_Avoid_: reading a GPU slice as a memory reservation or a fraction of the card; "vGPU" (we run no vGPU/MIG/MPS — see ADR-0016).
**GPU memory budget**:
The custom node-level extended resource **`viktorbarzin.me/gpumem`** (integer MiB) that makes the scheduler VRAM-aware (ADR-0016). The GPU node advertises a total (~14000 MiB = physical minus driver/context slack); each GPU tenant declares `resources.limits."viktorbarzin.me/gpumem"`; being non-overcommittable, the scheduler refuses to co-schedule past the card (overflow → `Pending`). A *schedule-time* reservation, **not** a runtime cap — it stops pile-on, not a single tenant's runaway.
_Avoid_: treating it as a hard CUDA cap (it isn't — that's what the **GPU watchdog** is for); confusing it with the `nvidia.com/gpu` slice (orthogonal axes: access vs memory accounting).
**GPU watchdog**:
The `gpu-vram-watchdog` CronJob (nvidia ns) that supplies the runtime teeth the **GPU memory budget** lacks: when *actual* free VRAM (`gpu_pod_memory_used_bytes`) drops below a floor, it recycles the biggest tenant that is **over its declared budget**. Enforces the budget as a contract, acts only under pressure (so bursting into genuine slack is fine), and is what bounds the 2026-06-02 immich-ml runaway class.
_Avoid_: expecting it to act on priority (it enforces the *budget*, since co-tenants often share one PriorityClass); expecting instant prevention (it corrects with a detection lag — soft, by design).
**GPU demand-gate**:
The scale-0↔1 admission CronJobs (`stacks/tts`) that bring a best-effort *batch* GPU tenant (chatterbox-tts) up only when free VRAM ≥ a floor and idle it back down — letting on-demand tenants fill real slack without holding a reserved **GPU memory budget** seat.
_Avoid_: using it for interactive tenants (cold-load lag — portal-stt is warm-resident instead); conflating it with the **GPU watchdog** (gate = admit on free VRAM; watchdog = recycle on over-budget pressure).
**gpu-workload priority**:
The `gpu-workload` PriorityClass (1,200,000) auto-stamped on every GPU pod by the Kyverno `inject-gpu-workload-priority` policy — the exclude list (`tts`) drops to `tier-2-gpu` (600,000) so it loses node-pressure eviction first. Governs *Kubernetes node* eviction order, **not** VRAM (VRAM is the budget + watchdog's job).
_Avoid_: assuming it protects VRAM; it is a scheduling/eviction priority on node memory/CPU pressure.
### Workstation (multi-user devvm)
**devvm**:
@ -118,14 +96,6 @@ _Avoid_: "external", "outside".
`viktorbarzin.lan`, served by Technitium DNS. Resolves only inside the homelab network.
_Avoid_: bare "lan", "private", "intranet".
**Segment**:
One isolated L2/L3 network with pfSense as its gateway — realised as a Proxmox-bridge-level tag feeding one dedicated untagged pfSense interface (dManagementsVms 10.0.10.0/24 = vmbr1 tag 10, dKubernetes 10.0.20.0/24 = vmbr1 tag 20, dCCTV 10.0.30.0/24 = vmbr0 tag 30). pfSense itself never terminates 802.1Q.
_Avoid_: "VLAN" as the primary name (the tags 10/20/30 are transport detail; the Segment is the concept).
**CCTV segment**:
The untrusted camera **Segment** (`dCCTV`) — devices in it may be pulled from (RTSP/ISAPI) but may initiate nothing except NTP to their gateway. Deliberately outside every trusted source-IP allowlist (ADR-0017).
_Avoid_: "camera VLAN", "CCTV LAN".
**Ingress auth**:
The `auth = "..."` parameter on `ingress_factory` — a discrete *mode*, not a ranked tier — one of `required` (Authentik forward-auth gates every request), `app` (the backend owns its login), `public` (anonymous Authentik binding for audit only), or `none` (Anubis-fronted content, or native-client API). Default `required` (fail-closed).
_Avoid_: "auth tier" / "auth mode" — refer to it by the canonical key, `auth` (e.g. `auth = "required"`). "tier" is reserved for State tier and Namespace tier.
@ -147,17 +117,9 @@ The bare-metal load-balancer that assigns external IPs to `type=LoadBalancer` Se
_Avoid_: calling `.200` "the cluster IP" or assuming all ingress shares one LB IP.
**Calico**:
The cluster CNI and **NetworkPolicy** engine (also GlobalNetworkPolicy + flow logs; live flow observability via **Goldmane / Whisker**). Egress lockdown follows an **observe-then-enforce** rollout — flow logs build an empirical allowlist, then default-deny egress is enforced per-namespace, tier by tier (wave 1 began at `recruiter-responder`; Tier 0/1/2 deferred).
The cluster CNI and **NetworkPolicy** engine (also GlobalNetworkPolicy + flow logs). Egress lockdown follows an **observe-then-enforce** rollout — flow logs build an empirical allowlist, then default-deny egress is enforced per-namespace, tier by tier (wave 1 began at `recruiter-responder`; Tier 0/1/2 deferred).
_Avoid_: "firewall" (it's pod-level policy, not a perimeter); conflating a Calico **NetworkPolicy** (enforced in the data path) with a **Kyverno policy** (enforced at admission) — different layers.
**Service identity**:
How a **Service** is named in flow/audit data — its **namespace** is the primary identity (Goldmane stamps it natively, and "one Service ≈ one namespace" holds for ~87 namespaces), refined by an explicit identity label (e.g. `service-identity`) only in the handful of genuinely multi-Service namespaces (`monitoring`, `kube-system`, `dbaas`). Deliberately NOT a per-Service **ServiceAccount** (deferred — 56% of pods share `default`; revisit only if principal-based enforcement or mTLS is adopted) and NOT a SPIFFE/mesh identity (rejected — attribution-grade audit on a trusted single-tenant cluster doesn't justify a mesh).
_Avoid_: equating "service identity" with a workload's **ServiceAccount** (that's the deferred enforcement principal, not the attribution key) or with cryptographic/SPIFFE identity; "Service" here is the domain **Service**, not the K8s `Service` object.
**Goldmane / Whisker**:
Calico 3.30's OSS flow-observability pair — **Goldmane** aggregates identity-stamped flows (namespace/pod/workload/labels + allow-deny + policy trace) streamed from Felix over gRPC into an in-memory ~60-min ring buffer (no etcd/API writes); **Whisker** is its live web UI. The east-west "who-talks-to-whom" data plane, succeeding raw iptables-`LOG`→journald lines (which carry no identity). The in-memory buffer alone is not an audit trail — durable history is the **`goldmane-edge-aggregator`** (the implemented trail; ADR-0014 originally framed this as a Loki emitter), which streams Goldmane's gRPC `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** into CNPG DB `goldmane_edges` + a daily `#alerts` digest (the `#security` channel was abandoned 2026-06-25). As-built: `docs/runbooks/goldmane-flow-trail.md`.
_Avoid_: assuming Goldmane persists (it's a ring buffer — lost on restart); expecting a ServiceAccount field in its schema (it carries labels, not SA); confusing it with Cilium **Hubble** (needs the Cilium datapath, unusable on Calico) or **Kiali** (needs an Istio mesh).
### Storage
**proxmox-lvm-encrypted**:
@ -237,20 +199,6 @@ _Avoid_: expecting Diun to deploy; conflating with **Keel**.
**Anubis**:
A PoW reverse-proxy issuing a 30-day JWT cookie, used in front of public content-bearing sites without app-level auth (blog, wiki, landing pages). Never in front of Git, WebDAV, CalDAV, or API endpoints (clients can't solve PoW).
### Externally-authored sites
**Valia site**:
A small public static site authored by Valia (Viktor's mother, external to the infra) and hosted for her under `<name>.viktorbarzin.me`. Its source of truth is a **Content folder** she owns; the live site is a mirror of that folder, fresh within ~10 minutes. Hosted **off-infra** (Cloudflare Pages) by decision: a homelab outage freezes content but never takes her sites down. Viktor picks the English subdomain name per site at registration (her folder names stay Bulgarian). Current instances: `stem95su`, `bridge`.
_Avoid_: "school site" (the family may grow beyond school projects); treating the deployed copy as editable — edits land only in the **Content folder**.
**Content folder**:
The Google Drive folder (or subfolder) Valia shares with `vbarzin@gmail.com` holding one **Valia site**'s files. Strictly read-only from the infra side — nothing ever writes back to her Drive. Empty or half-uploaded folder states must never wipe a live site.
_Avoid_: syncing a folder root when the servable content lives in a subfolder (stem95su serves `stem claude/files/`, not the folder root).
**Entry file**:
The HTML file a **Valia site** serves at `/`. Defaults to `index.html`; per-site override when she names it differently (stem95su: `stem_board.html`). The override is a registration-time setting, not a constraint on her authoring.
_Avoid_: asking Valia to rename her files to fit hosting conventions.
## Relationships
- A **Service** is defined by exactly one **Stack****flat** or wrapping a **Stack-local module** — which sources zero or more shared **Factory modules** and resolves to one or more K8s workloads.
@ -262,7 +210,6 @@ _Avoid_: asking Valia to rename her files to fit hosting conventions.
- A **Service**'s image reaches the cluster via **Woodpecker deploy** (push-driven, on commit) or **Keel** (poll-driven, on a new registry tag); **Diun** only notifies. Operator-managed StatefulSets are rolled by neither.
- An owned **Service**'s image is built by GitHub Actions from the **Canonical repo**'s **GitHub mirror** and hosted on ghcr.io (ADR-0002); the **Forgejo registry** keeps only a frozen last-known-good tag per **Service**.
- Tier-1 **State tier** state and ~12 app databases share one **CNPG** `pg-cluster`, reached through **PgBouncer**; their credentials rotate via the `vault-database` store.
- A **Valia site** mirrors exactly one **Content folder** and serves exactly one **Entry file** at `/`; the folder is hers, the subdomain name is Viktor's, the hosting is off-infra.
## Example dialogue

View file

@ -1,287 +1,2 @@
# homelab
`homelab` is the unified, agent-facing CLI for operating this homelab — one
composable, JSON-capable surface for the operations agents run over and over,
discovered progressively at runtime. It is grown **in place** from this
directory (the former `infra-cli`), and the legacy webhook use-cases still work
(see below).
It encodes *actions*, never *judgment*: methodology (debugging, TDD, review) and
third-party/owned MCP servers (e.g. phpIPAM) are deliberately out of scope.
## Usage
```
homelab <command> [args]
homelab manifest [--json] # list every verb + its read/write tier (discovery entrypoint)
homelab version
```
### v0.1 verbs — the infra inner-loop
| Command | Tier | What it does |
|---|---|---|
| `claim <kind>:<name> --purpose "…"` | write | claim a shared resource on the presence board (wraps `scripts/presence`) |
| `release <kind>:<name>` | write | release a presence claim |
| `tf plan <stack>` | read | `scripts/tg plan` for a stack (resolved from cwd) |
| `tf validate <stack>` | read | `scripts/tg validate` |
| `tf fmt <stack>` | read | `terraform fmt -recursive` on the stack |
| `tf force-unlock <stack> <lock-id>` | write | release a stuck state lock |
| `tf apply <stack>` | write | `scripts/tg apply` — auto-claims `stack:<name>`, always releases, warns it's out-of-band |
| `work start <topic>` | write | create `.worktrees/<topic>` on `<user>/<topic>` off `<remote>/master`; enter with native `EnterWorktree` |
| `work land [--verify-cmd "…"] [--no-verify]` | write | merge master in → verify → push `HEAD:master` (non-ff retry; PR fallback) |
| `work clean <topic>` | write | remove a task's worktree + branch (run from the main checkout) |
### v0.2 verbs — Kubernetes
Built on an **app→namespace→pod resolver**: `<app>` defaults to the namespace
(most namespaces hold one app); the target defaults to `deploy/<app>` and lets
kubectl resolve the pod. Override with `-n`/`--pod`/`-c`/`-l`/`--tty`. Uses the
ambient kubeconfig.
| Command | Tier | What it does |
|---|---|---|
| `k8s status [ns]` | read | pods (wide) + recent non-Normal events (`-A` if no ns) |
| `k8s get <ns> <resource> […]` | read | `kubectl -n <ns> get …` passthrough |
| `k8s logs <app>` | read | logs for `deploy/<app>` (`--tail` default 200; `-c`/`--previous`/`--since`/`-l`) |
| `k8s describe <app> [resource]` | read | describe the deployment (or an explicit resource) |
| `k8s debug <app>` | read | one-shot triage: pods + workloads + describe + recent logs + events |
| `k8s pf <app> <local:remote> [target]` | read | port-forward to `svc/<app>` (or an explicit target) |
| `k8s rollout-status <app>` | read | `rollout status deploy/<app>` |
| `k8s db <app> [--mysql] [--db N] -- "<SQL>"` | write | exec into the dbaas DB (PG `pg-cluster-rw`, or MySQL with env-password wrapper) |
| `k8s exec <app> [--tty] -- <cmd>` | write | exec in the app's pod |
| `k8s restart <app>` | write | `rollout restart deploy/<app>` then wait for status |
| `k8s rm-pod <name> -n <ns> [--job] [--force]` | write | delete a stuck **pod/job only** |
Config-mutation verbs (`apply`/`edit`/`patch`/`scale`/`create`) are intentionally
**not** exposed — they stay raw `kubectl`, per the Terraform-only policy.
`tf` resolves the stack dir by walking up from cwd to the infra root and
delegates to `scripts/tg` (which owns state decrypt/encrypt, the Vault lock, and
the ingress auth-comment check). git-crypt filter flags are auto-injected on git
operations in the encrypted infra repo.
**`work land` refuses to push when it cannot verify** (no `--verify-cmd` and no
auto-detected suite) unless you pass `--no-verify` — landing to master unverified
must be deliberate. After pushing it **watches CI to green** (`ci watch` on the
landed commit) and fails if the pipeline does; pass `--no-ci-watch` to skip.
Tiers are recorded per verb so a future PreToolUse classifier can auto-allow
reads / prompt writes; v0.1 allows everything and relies on existing gates
(permission mode, presence claims, plan approval).
### v0.3 verbs — memory
A thin HTTP client over the **claude-memory** service (the same backend the
memory MCP wraps), authed with `CLAUDE_MEMORY_API_KEY` against
`CLAUDE_MEMORY_API_URL` (the env the hooks already set; defaults to the
ingress). Because it hits the HTTP API directly, it **works even when the MCP
frontend is down**.
| Command | Tier | What it does |
|---|---|---|
| `memory recall "<context>" [--query --category --sort --limit]` | read | semantic search (server-side ranking) — the navigate workhorse |
| `memory list [--category --tag --limit]` | read | recent memories |
| `memory categories` / `memory tags` / `memory stats` | read | enumerate the store |
| `memory secret <id>` | read | reveal a sensitive memory's content |
| `memory store "<content>" [--category --tags --keywords --importance --sensitive]` | write | store a memory |
| `memory update <id> [--content --tags --importance]` | write | edit a memory |
| `memory delete <id>` | write | delete a memory |
All read/write paths are validated against the live API (incl. a
store→recall→delete round-trip). This gives full data-plane parity with the MCP;
the eventual deprecation (rewiring the per-prompt auto-recall + auto-learn hooks
to the CLI, then uninstalling the MCP) is a **separate, deliberate follow-up**
see `docs/adr/0008`.
### v0.4 verbs — ci / deploy
Watch what you trigger, without hand-rolling Woodpecker/kubectl polling. `ci`
talks to the Woodpecker API (token from `WOODPECKER_TOKEN` or Vault
`secret/ci/global`) via the internal Traefik LB, resolving the repo from the cwd
remote, with retries that ride Woodpecker's intermittent empty responses.
| Command | Tier | What it does |
|---|---|---|
| `ci status [commit]` | read | pipeline status for HEAD (or a commit) |
| `ci watch [commit]` | read | poll the pipeline to terminal; exit non-zero on failure |
| `deploy wait <ns>/<deploy> [--sha SHA]` | read | wait for the deployment image to match the sha, *then* rollout status (rollout status alone lies on the old ReplicaSet) |
`work land` now calls `ci watch` on the landed commit automatically (skip with
`--no-ci-watch`), closing the v0.1 "doesn't wait for CI" gap. `ci logs` (failing
step) is deferred to v0.4.1 — Woodpecker's per-pipeline detail/log endpoints were
the least reliable; `status`/`watch` use the list endpoint that works.
### v0.5 verbs — net / dns / metrics / logs
Reachability + observability probes. Their value is *endpoint resolution* — the
non-obvious "which host, public or LB, what auth, what URL shape" reasoning you'd
otherwise re-derive every time — not the HTTP call itself. All reach internal
ingresses through the Traefik LB (the Go form of `curl --resolve host:443:10.0.20.203`).
| Command | Tier | What it does |
|---|---|---|
| `net check <host> [path]` | read | probes the host two ways — external (public DNS → Cloudflare) vs internal (Traefik LB) — with status + latency, so you can tell *where* a break is (CF? app? the LB path?) |
| `dns lookup <name> [type]` | read | resolves via Technitium (`10.0.20.201`) and public (`1.1.1.1`), diffed — surfaces split-horizon vs propagation gaps |
| `metrics query "<promql>"` | read | Prometheus instant query (`prometheus-query.viktorbarzin.lan`); prints `value {labels}` or `--json` |
| `metrics alerts` | read | currently-firing alerts (via the synthetic `ALERTS` series — the query frontend has no `/api/v1/alerts`) |
| `logs query "<logql>" [--since 1h] [--limit N]` | read | Loki range query (`loki.viktorbarzin.lan`); prints log lines or `--json` |
Quote the PromQL/LogQL. These hit auth-free internal ingresses — no port-forward,
no kubectl. (In-cluster-only endpoints like Alertmanager stay out of scope; the
firing set is reachable via `ALERTS` instead.)
### v0.6 — usage telemetry (`usage top`)
Makes "which verbs are actually used, by everyone" a query instead of a guess —
so adding the *next* verb is evidence-driven, not shaped by one person's habits.
Every dispatched verb emits one fire-and-forget Loki line: `{job, user, verb}`
labels + `exit=N ver=X` — **only the verb path and exit code, never args, paths,
flags, or secrets.** It's best-effort (tight timeout, errors swallowed, never
affects the command) and opt-out via `HOMELAB_TELEMETRY=0`. Because the sink is
the shared Loki, aggregate usage is queryable **without reading anyone's home**
the privacy-preserving answer to "what does the team use."
| Command | Tier | What it does |
|---|---|---|
| `usage top [--since 30d] [--user U] [--json]` | read | rank verbs by invocation count across all users (or one), via `sum by (verb) (count_over_time({job="homelab-usage"}[…]))` |
### v0.7 verbs — Home Assistant
Cover exactly the two things the `ha` **MCP server can't**: resolving the
long-lived API token out of the cluster, and SSH to the HA host for host-level
work (config files, docker, add-ons). Entity state and control (`turn_on`,
`get_state`, services) stay with the MCP — *actions an MCP already encodes are
out of scope* (see top of this doc). The value here is the same as `net`/`dns`:
the non-obvious *which secret, which host, which key, which flags* you'd
otherwise re-derive every session — agents were hand-rolling a
`kubectl | base64 | jq` token pipeline and a bespoke `ssh -o …` invocation on
every run because the existing `home-assistant-sofia.py` needs an env var set
and a cwd-relative path, neither of which holds in an arbitrary session.
| Command | Tier | What it does |
|---|---|---|
| `ha token [--instance sofia\|london]` | read | print the long-lived HA API token, resolved live from the dedicated k8s Secret `openclaw/ha-tokens` (key per instance) via the ambient kubeconfig — no pre-set env var. Use as `curl -H "Authorization: Bearer $(homelab ha token)" …`. The secret is a least-privilege carve-out (`stacks/openclaw/ha_tokens.tf`): the `Home Server Admins` group can read *just* it, so non-admin operators get the HA token without the rest of `skill_secrets` (slack webhook, uptime-kuma password) |
| `ha ssh [--instance sofia\|london] [-i KEY] -- <cmd>` | write | run `<cmd>` on the HA host over ssh with deterministic non-interactive flags (explicit key = the invoking user's `~/.ssh/id_ed25519`, no user ssh-config, no known_hosts prompt). sofia (`vbarzin@192.168.1.8`) is reachable from the devvm LAN; london is documented but generally remote |
`--instance` defaults to **sofia** (the devvm shares the Sofia LAN). `ha token`
prints the bare token to stdout so it composes in `$(…)`; it's read-tier like
`memory secret`. `ha ssh` resolves the *invoking user's* key, so it's per-user,
not tied to whoever first wrote the workflow (the user's key must be enrolled on
the HA host).
### v0.8 verbs — browser (headful anti-bot automation)
Drive the cluster's **headful** Chrome (`chrome-service`, real Chrome under Xvfb)
from the devvm over CDP, for sites that detect and block headless automation. The
headless `@playwright/mcp` browser can *load* such a site and fill its forms, but
the gated action (submit/login) silently fails — the motivating case was the
Stirling Ackroyd Fixflo tenant portal, whose pre-submit check returned
`net::ERR_FILE_NOT_FOUND` and hung. This path connects via `connect_over_cdp`,
injects the same `stealth.js` the in-cluster callers use, and submits first try.
The command owns only the *mechanics* (port-forward, stealth, lifecycle); the
agent supplies the Playwright script — judgment stays out of the CLI.
| Command | Tier | What it does |
|---|---|---|
| `browser run <script.js> [--url U] [--shared-context] [--keep-open] [--port N] [--timeout S]` | write | port-forward `svc/chrome-service:9222`, assert it's a real (non-headless) Chrome via `/json/version`, `connect_over_cdp`, `addInitScript(stealth.js)`, then run the script with `page`/`context`/`browser`/`log` in scope (top-level await ok; return a value to print it). Always tears the forward down. |
| `browser open <url> [--shared-context] [--timeout S]` | write | open `<url>` headful and print title + visible text + a screenshot path — a quick check. |
| `browser --help` | read | when-to-use signature + the error-code cheat-sheet (`ERR_FILE_NOT_FOUND` = automation-layer intercept, not egress; `ERR_CONNECTION_REFUSED`/`_TIMED_OUT`/`_NAME_NOT_RESOLVED` = real egress; one endpoint 500 while siblings 200 = bot rejection). |
Default context is a **fresh incognito** one (closed on exit) — safe for the
shared browser and concurrent callers (e.g. tripit's fare scrape); `--shared-context`
reuses the warmed persistent profile when a pre-logged-in session is needed.
`port-forward` tunnels API-server→pod, so it bypasses the `:9222` NetworkPolicy
that gates in-cluster callers — no namespace label needed. The node CDP client is
pinned to **`playwright-core@1.48.2`** to match the chrome-service image minor
(Chromium 130; protocol changes between minors) and is installed once, lazily,
into `~/.cache/homelab/browser-client/` (no per-user setup). Because the client
runs on the devvm, `setInputFiles` streams local files to the remote browser over
CDP — no `chmod`/staging-dir workaround. See `docs/architecture/chrome-service.md`
and `docs/adr/0013`.
### v0.9 verbs — edges (east-west "who-talks-to-whom" trail)
Read-only investigation helper over the `goldmane_edges` CNPG trail (ADR-0014):
filters render to a single safe `SELECT` (namespace values validated to the k8s
name charset) run via the dbaas primary pod — the same exec path as `k8s db`.
| Command | Tier | What it does |
| --- | --- | --- |
| `edges --ns <ns>` | read | edges touching `<ns>` (either direction) |
| `edges --src <ns>` / `--dst <ns>` | read | directional: `<ns>`'s egress / ingress peers |
| `edges --peers-of <ns>` | read | distinct peer namespaces of `<ns>` (both directions) |
| `edges --new-since <24h\|7d\|YYYY-MM-DD>` | read | edges first seen since a duration or date |
| `edges --denied` | read | only `action='deny'` edges (blocked / lateral-movement) |
| `edges --json` / `--limit N` | read | JSON array output / row cap (default 200) |
### v0.10 — `vault get --all` (browse every field)
`vault get <name> --all` returns the **whole item** as a normalized JSON object,
so an agent can discover and read fields the single-field `--field` allowlist
can't reach — notably arbitrary **custom fields**.
| Command | Tier | What it does |
| --- | --- | --- |
| `vault get <name> --all` | read | all fields as JSON: `{name, username?, password?, uris?, totp?, notes?, fields?}` |
Shape notes: present standard fields only (empty ones omitted); `fields` is a
custom `name→value` map (duplicate names → last-wins; `linked` fields skipped).
The TOTP **seed is never emitted**`totp` is a presence flag (`true`), so the
only seed-derived path stays the specially-audited `vault code`. Like
`get --json`, the dump is all secret values, so it **refuses a terminal** — pipe
it (`homelab vault get <name> --all | jq`).
### v0.10.1 — reads `bw sync` first (always fresh)
Every vault read (`get`, `get --all`, `list`, `code`, `status`) now runs `bw
sync` when opening its session, so it reflects the latest server-side values.
`bw unlock` only decrypts the *local* cache, so without this a persisted
(already-logged-in) session served stale data — a password changed in the web
vault wouldn't show up until the next login. The sync is **best-effort**: a
transient failure warns on stderr and falls back to the cached vault rather than
failing the read.
### v0.11 — `vault kv` (HashiCorp Vault / OpenBao infra secrets)
`homelab vault` now fronts **two unrelated stores**, made explicit in the bare
`homelab vault` help and via `[vaultwarden]` / `[hashicorp-vault]` summary tags:
- **Vaultwarden** — your personal password manager (`vault get/list/code/…`, unchanged).
- **HashiCorp Vault / OpenBao** — homelab infra secrets, the `secret/…` KV store, under `vault kv`.
| Command | Tier | What it does |
| --- | --- | --- |
| `vault kv get <path> [--field K]` | read | read a secret: `--field K` → one value (TTY-aware clipboard/stdout); no field → all fields as JSON (refuses a bare TTY) |
| `vault kv list <path>` | read | list sub-paths under `<path>` (no values) |
| `vault kv put <path> <key>` | write | write one key; **value via stdin** (piped or no-echo prompt, never argv); creates the path or **merges** (never clobbers siblings) |
**Different credentials:** the Vaultwarden verbs use the per-user *scoped* token
(bound to `claude-users/<user>`); `vault kv` uses your **own** Vault token
(`vault login -method=oidc``~/.vault-token`, or `$VAULT_TOKEN`) — the kv
handlers set `VAULT_ADDR` but never inject the scoped token (which would 403 off
its own path). Access is whatever your policy grants. Writes are merge-only;
`put` (replace) / `delete` are out of scope — use the raw `vault` CLI.
## Build / install
Built from source to `/usr/local/bin/homelab` during devvm provisioning
(`scripts/workstation/setup-devvm.sh`, the `t3-dispatch` pattern); version is
stamped from `cli/VERSION` via ldflags. Manual build:
```
cd cli && go build -ldflags "-X main.version=$(cat VERSION)" -o /usr/local/bin/homelab .
go test ./...
```
## Legacy webhook use-cases (preserved)
This binary is also the in-cluster `infra-cli` image. Invocations starting with
`-use-case=<vpn|setup-openwrt-dns|add-email-alias|...>` fall through to the
original flag-based path unchanged, so the webhook handler is unaffected.
## Design
See `infra/docs/adr/0004``0013` for the architecture decisions.
# What is this?
This is a CLI to manipulate files in the terraform repo and commit and push them

View file

@ -1 +0,0 @@
v0.12.0

View file

@ -1,388 +0,0 @@
package main
import (
_ "embed"
"encoding/json"
"fmt"
"io"
"net"
"net/http"
"os"
"os/exec"
"os/signal"
"path/filepath"
"strconv"
"strings"
"sync"
"syscall"
"time"
)
// playwrightVersion pins the node CDP client to the chrome-service image minor
// (mcr.microsoft.com/playwright:v1.48.0-noble → Chromium 130). connect_over_cdp
// speaks the browser's CDP, so the client minor must track the server minor;
// see docs/architecture/chrome-service.md "Image pin".
const playwrightVersion = "1.48.2"
// defaultBrowserTimeout is how long (seconds) to wait for the port-forwarded CDP
// endpoint to become ready before giving up.
const defaultBrowserTimeout = 60
const (
chromeServiceNamespace = "chrome-service"
chromeServiceName = "chrome-service"
chromeServiceCDPPort = 9222
)
// stealthJS is vendored verbatim from stacks/chrome-service/files/stealth.js (the
// source of truth the in-cluster callers use). TestStealthJSEmbeddedMatchesCanonical
// guards against drift.
//
//go:embed browser_stealth.js
var stealthJS string
// runnerJS is the node wrapper that connects to the port-forwarded CDP endpoint,
// installs the stealth init script, and runs the user's Playwright script.
//
//go:embed browser_runner.js
var runnerJS string
// browserOpts is the parsed form of `homelab browser run|open` arguments.
type browserOpts struct {
mode string // "run" | "open"
script string // path to the user Playwright script (run mode)
url string // initial URL (run: optional; open: required positional)
sharedCtx bool // use the warmed persistent profile instead of a fresh context
keepOpen bool // leave the created context/pages open on exit
port int // explicit local port for the forward (0 = auto)
timeout int // CDP readiness timeout, seconds
help bool
}
// parseBrowserArgs parses the args after `browser run` / `browser open`.
func parseBrowserArgs(mode string, args []string) (browserOpts, error) {
o := browserOpts{mode: mode, timeout: defaultBrowserTimeout}
var positionals []string
atoi := func(s, flag string) (int, error) {
n, err := strconv.Atoi(s)
if err != nil {
return 0, fmt.Errorf("%s expects an integer, got %q", flag, s)
}
return n, nil
}
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "-h" || a == "--help":
o.help = true
case a == "--shared-context":
o.sharedCtx = true
case a == "--keep-open":
o.keepOpen = true
case a == "--url":
if i+1 < len(args) {
o.url = args[i+1]
i++
}
case strings.HasPrefix(a, "--url="):
o.url = strings.TrimPrefix(a, "--url=")
case a == "--port":
if i+1 < len(args) {
n, err := atoi(args[i+1], "--port")
if err != nil {
return o, err
}
o.port = n
i++
}
case strings.HasPrefix(a, "--port="):
n, err := atoi(strings.TrimPrefix(a, "--port="), "--port")
if err != nil {
return o, err
}
o.port = n
case a == "--timeout":
if i+1 < len(args) {
n, err := atoi(args[i+1], "--timeout")
if err != nil {
return o, err
}
o.timeout = n
i++
}
case strings.HasPrefix(a, "--timeout="):
n, err := atoi(strings.TrimPrefix(a, "--timeout="), "--timeout")
if err != nil {
return o, err
}
o.timeout = n
case strings.HasPrefix(a, "-"):
return o, fmt.Errorf("unknown flag %q (try: homelab browser --help)", a)
default:
positionals = append(positionals, a)
}
}
if o.help {
return o, nil
}
switch mode {
case "run":
if len(positionals) == 0 {
return o, fmt.Errorf("usage: homelab browser run <script.js> [--url URL] [--shared-context] [--keep-open] [--port N] [--timeout S]")
}
o.script = positionals[0]
case "open":
if len(positionals) == 0 {
return o, fmt.Errorf("usage: homelab browser open <url> [--shared-context] [--timeout S]")
}
o.url = positionals[0]
}
return o, nil
}
// cdpHealthy parses a CDP /json/version body and reports whether the endpoint is
// a real (non-headless) Chrome — the entire reason chrome-service exists.
func cdpHealthy(jsonBody []byte) (browser string, healthy bool, err error) {
var v struct {
Browser string `json:"Browser"`
UserAgent string `json:"User-Agent"`
}
if e := json.Unmarshal(jsonBody, &v); e != nil {
return "", false, fmt.Errorf("parse /json/version: %w", e)
}
if v.Browser == "" {
return "", false, fmt.Errorf("/json/version had no Browser field")
}
healthy = strings.HasPrefix(v.Browser, "Chrome/") &&
!strings.Contains(v.Browser, "Headless") &&
!strings.Contains(v.UserAgent, "Headless")
return v.Browser, healthy, nil
}
// buildPortForwardArgs is the kubectl invocation that exposes chrome-service's
// CDP locally. port-forward tunnels API-server→pod, so it bypasses the :9222
// NetworkPolicy that gates in-cluster callers.
func buildPortForwardArgs(localPort int) []string {
return []string{"-n", chromeServiceNamespace, "port-forward",
"svc/" + chromeServiceName, fmt.Sprintf("%d:%d", localPort, chromeServiceCDPPort)}
}
// browserClientPackageJSON is the auto-managed manifest for the pinned node CDP
// client kept under the user cache dir.
func browserClientPackageJSON() string {
return fmt.Sprintf(`{
"name": "homelab-browser-client",
"private": true,
"description": "Pinned CDP client for 'homelab browser' — auto-managed, do not edit.",
"dependencies": {
"playwright-core": "%s"
}
}
`, playwrightVersion)
}
// freePort asks the kernel for an unused ephemeral TCP port.
func freePort() (int, error) {
l, err := net.Listen("tcp", "127.0.0.1:0")
if err != nil {
return 0, err
}
defer l.Close()
return l.Addr().(*net.TCPAddr).Port, nil
}
// browserClientDir is where the pinned node client + managed runner files live.
func browserClientDir() (string, error) {
cache, err := os.UserCacheDir()
if err != nil || cache == "" {
home, herr := os.UserHomeDir()
if herr != nil {
return "", fmt.Errorf("locate cache dir: %v / %v", err, herr)
}
cache = filepath.Join(home, ".cache")
}
return filepath.Join(cache, "homelab", "browser-client"), nil
}
// installedPlaywrightVersion reads the version of the playwright-core already
// installed in dir, or "" if absent/unreadable.
func installedPlaywrightVersion(dir string) string {
b, err := os.ReadFile(filepath.Join(dir, "node_modules", "playwright-core", "package.json"))
if err != nil {
return ""
}
var v struct {
Version string `json:"version"`
}
if json.Unmarshal(b, &v) != nil {
return ""
}
return v.Version
}
// ensureBrowserClient writes the managed runner/stealth/package files into dir
// and lazily installs the pinned playwright-core (only when missing/mismatched),
// so no per-user setup is needed and the client tracks the binary version.
func ensureBrowserClient(dir string) error {
if err := os.MkdirAll(dir, 0o755); err != nil {
return err
}
files := map[string]string{
"package.json": browserClientPackageJSON(),
"browser_runner.js": runnerJS,
"stealth.js": stealthJS,
}
for name, content := range files {
if err := os.WriteFile(filepath.Join(dir, name), []byte(content), 0o644); err != nil {
return err
}
}
if installedPlaywrightVersion(dir) == playwrightVersion {
return nil
}
fmt.Fprintf(os.Stderr, "homelab browser: installing pinned playwright-core@%s (one-time, ~a few seconds)…\n", playwrightVersion)
cmd := exec.Command("npm", "install", "--no-audit", "--no-fund", "--silent")
cmd.Dir = dir
cmd.Stdout = os.Stderr
cmd.Stderr = os.Stderr
if err := cmd.Run(); err != nil {
return fmt.Errorf("npm install playwright-core@%s in %s: %w (is node/npm installed?)", playwrightVersion, dir, err)
}
if got := installedPlaywrightVersion(dir); got != playwrightVersion {
return fmt.Errorf("playwright-core install mismatch in %s: want %s, got %q", dir, playwrightVersion, got)
}
return nil
}
// waitForCDP polls the local CDP endpoint until it answers as a healthy
// (non-headless) Chrome, or the timeout elapses.
func waitForCDP(cdpURL string, timeout time.Duration) (string, error) {
deadline := time.Now().Add(timeout)
client := &http.Client{Timeout: 3 * time.Second}
var lastErr error
for time.Now().Before(deadline) {
resp, err := client.Get(cdpURL + "/json/version")
if err != nil {
lastErr = err
time.Sleep(300 * time.Millisecond)
continue
}
body, _ := io.ReadAll(resp.Body)
resp.Body.Close()
browser, healthy, herr := cdpHealthy(body)
if herr != nil {
lastErr = herr
time.Sleep(300 * time.Millisecond)
continue
}
if !healthy {
return browser, fmt.Errorf("CDP reports %q — expected a non-headless Chrome (wrong target?)", browser)
}
return browser, nil
}
if lastErr == nil {
lastErr = fmt.Errorf("timed out after %s", timeout)
}
return "", lastErr
}
// runBrowser is the orchestration: pick a port, ensure the pinned client, start
// (and ALWAYS tear down) a CDP port-forward, wait for readiness, then run node.
func runBrowser(o browserOpts) error {
port := o.port
if port == 0 {
p, err := freePort()
if err != nil {
return fmt.Errorf("pick local port: %w", err)
}
port = p
}
dir, err := browserClientDir()
if err != nil {
return err
}
if err := ensureBrowserClient(dir); err != nil {
return err
}
// Start the forward in its own process group so the whole tree dies on cleanup.
pf := exec.Command("kubectl", buildPortForwardArgs(port)...)
pf.SysProcAttr = &syscall.SysProcAttr{Setpgid: true}
var pfLog strings.Builder
pf.Stdout = &pfLog
pf.Stderr = &pfLog
if err := pf.Start(); err != nil {
return fmt.Errorf("start kubectl port-forward (kubeconfig set?): %w", err)
}
var once sync.Once
teardown := func() {
once.Do(func() {
if pf.Process != nil {
_ = syscall.Kill(-pf.Process.Pid, syscall.SIGKILL)
}
_ = pf.Wait()
})
}
defer teardown()
// Tear down on Ctrl-C / SIGTERM too, then exit non-zero.
sigCh := make(chan os.Signal, 1)
signal.Notify(sigCh, os.Interrupt, syscall.SIGTERM)
defer signal.Stop(sigCh)
go func() {
if _, ok := <-sigCh; ok {
teardown()
os.Exit(130)
}
}()
cdpURL := fmt.Sprintf("http://127.0.0.1:%d", port)
browser, err := waitForCDP(cdpURL, time.Duration(o.timeout)*time.Second)
if err != nil {
return fmt.Errorf("chrome-service CDP not ready on %s: %w\n--- port-forward log ---\n%s", cdpURL, err, pfLog.String())
}
fmt.Fprintf(os.Stderr, "homelab browser: connected to %s via %s\n", browser, cdpURL)
return runBrowserNode(dir, cdpURL, o)
}
// runBrowserNode invokes the managed node runner with inputs passed via env.
func runBrowserNode(dir, cdpURL string, o browserOpts) error {
env := append(os.Environ(),
"HOMELAB_CDP_URL="+cdpURL,
"HOMELAB_BROWSER_MODE="+o.mode,
"HOMELAB_STEALTH_PATH="+filepath.Join(dir, "stealth.js"),
"NODE_PATH="+filepath.Join(dir, "node_modules"),
)
if o.url != "" {
env = append(env, "HOMELAB_BROWSER_URL="+o.url)
}
if o.script != "" {
abs, err := filepath.Abs(o.script)
if err != nil {
return err
}
if _, err := os.Stat(abs); err != nil {
return fmt.Errorf("script %s: %w", o.script, err)
}
env = append(env, "HOMELAB_BROWSER_SCRIPT="+abs)
}
if o.sharedCtx {
env = append(env, "HOMELAB_BROWSER_SHARED=1")
}
if o.keepOpen {
env = append(env, "HOMELAB_BROWSER_KEEP_OPEN=1")
}
if o.mode == "open" {
shot := filepath.Join(os.TempDir(), fmt.Sprintf("homelab-browser-%d.png", os.Getpid()))
env = append(env, "HOMELAB_BROWSER_SCREENSHOT="+shot)
}
cmd := exec.Command("node", filepath.Join(dir, "browser_runner.js"))
cmd.Env = env
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
cmd.Stdin = os.Stdin
return cmd.Run()
}

View file

@ -1,106 +0,0 @@
// homelab browser — node CDP runner (auto-managed; regenerated each run from the
// homelab binary — DO NOT EDIT here). Connects to the port-forwarded
// chrome-service CDP endpoint, installs the stealth init script, then runs the
// user's Playwright script (run mode) or opens a URL (open mode). All inputs
// arrive via HOMELAB_* env vars set by the Go CLI.
'use strict';
const fs = require('fs');
const { chromium } = require('playwright-core');
async function main() {
const cdpURL = process.env.HOMELAB_CDP_URL;
if (!cdpURL) throw new Error('HOMELAB_CDP_URL not set');
const mode = process.env.HOMELAB_BROWSER_MODE || 'run';
const stealthPath = process.env.HOMELAB_STEALTH_PATH || '';
const initURL = process.env.HOMELAB_BROWSER_URL || '';
const scriptPath = process.env.HOMELAB_BROWSER_SCRIPT || '';
const shared = process.env.HOMELAB_BROWSER_SHARED === '1';
const keepOpen = process.env.HOMELAB_BROWSER_KEEP_OPEN === '1';
const screenshotPath = process.env.HOMELAB_BROWSER_SCREENSHOT || '';
const browser = await chromium.connectOverCDP(cdpURL);
// Fresh isolated context by default (safe for the shared browser + concurrent
// callers); --shared-context reuses the warmed persistent profile.
let context;
let createdContext = false;
if (shared) {
const existing = browser.contexts();
if (existing.length) {
context = existing[0];
} else {
context = await browser.newContext();
createdContext = true;
}
} else {
context = await browser.newContext();
createdContext = true;
}
if (stealthPath) {
const stealth = fs.readFileSync(stealthPath, 'utf8');
if (stealth.trim()) await context.addInitScript(stealth);
}
const page = await context.newPage();
const log = (...a) => console.error('[browser]', ...a);
let exitCode = 0;
try {
if (initURL) {
await page.goto(initURL, { waitUntil: 'domcontentloaded' });
}
if (mode === 'open') {
console.log('url: ' + page.url());
console.log('title: ' + (await page.title()));
const text = (await page.evaluate(() => (document.body ? document.body.innerText : ''))).trim();
console.log('--- visible text (truncated to 4000 chars) ---');
console.log(text.slice(0, 4000));
if (screenshotPath) {
await page.screenshot({ path: screenshotPath, fullPage: true });
console.log('screenshot: ' + screenshotPath);
}
} else {
if (!scriptPath) throw new Error('run mode requires HOMELAB_BROWSER_SCRIPT');
const src = fs.readFileSync(scriptPath, 'utf8');
// Run the user's source with page/context/browser/log in lexical scope.
// AsyncFunction body permits top-level await.
const AsyncFunction = Object.getPrototypeOf(async () => {}).constructor;
const fn = new AsyncFunction('page', 'context', 'browser', 'log', src);
const result = await fn(page, context, browser, log);
if (result !== undefined) {
let out;
try {
out = typeof result === 'string' ? result : JSON.stringify(result, null, 2);
} catch (_) {
out = String(result);
}
console.log(out);
}
}
} catch (e) {
console.error('homelab browser: script error:', e && e.stack ? e.stack : e);
exitCode = 1;
} finally {
if (!keepOpen) {
try {
// Close only what we created; never tear down the shared persistent context.
if (createdContext) {
await context.close();
} else {
await page.close();
}
} catch (_) { /* ignore */ }
}
// Disconnect from the CDP endpoint; this does NOT kill the remote browser.
try {
await browser.close();
} catch (_) { /* ignore */ }
}
process.exit(exitCode);
}
main().catch((e) => {
console.error('homelab browser: fatal:', e && e.stack ? e.stack : e);
process.exit(1);
});

View file

@ -1,54 +0,0 @@
// Minimal stealth init script for Playwright-driven Chromium.
// Vendored from puppeteer-extra-plugin-stealth/evasions/* (MIT) — covers:
// webdriver, chrome.runtime, navigator.plugins, navigator.languages,
// Permissions.query, WebGL getParameter (vendor + renderer spoof).
// Run via context.add_init_script() so it executes before any page script.
(() => {
// navigator.webdriver — most common detection, removed entirely.
Object.defineProperty(Navigator.prototype, 'webdriver', { get: () => undefined });
// window.chrome.runtime — many sites check that real Chrome exposes this.
if (!window.chrome) window.chrome = {};
window.chrome.runtime = window.chrome.runtime || {};
// navigator.plugins — headless reports zero; spoof a plausible PDF viewer.
Object.defineProperty(navigator, 'plugins', {
get: () => [{ name: 'Chrome PDF Plugin' }, { name: 'Chrome PDF Viewer' }, { name: 'Native Client' }],
});
// navigator.languages — headless returns empty array.
Object.defineProperty(navigator, 'languages', { get: () => ['en-US', 'en'] });
// Permissions.query — headless returns 'denied' for notifications instead of 'default'.
const origQuery = window.navigator.permissions && window.navigator.permissions.query;
if (origQuery) {
window.navigator.permissions.query = (parameters) =>
parameters && parameters.name === 'notifications'
? Promise.resolve({ state: Notification.permission })
: origQuery(parameters);
}
// WebGL getParameter — spoof vendor + renderer strings to a real GPU.
const spoofGl = (proto) => {
if (!proto) return;
const orig = proto.getParameter;
proto.getParameter = function (parameter) {
if (parameter === 37445) return 'Intel Inc.'; // UNMASKED_VENDOR_WEBGL
if (parameter === 37446) return 'Intel Iris OpenGL Engine'; // UNMASKED_RENDERER_WEBGL
return orig.apply(this, arguments);
};
};
spoofGl(window.WebGLRenderingContext && window.WebGLRenderingContext.prototype);
spoofGl(window.WebGL2RenderingContext && window.WebGL2RenderingContext.prototype);
// disable-devtool.js (theajack/disable-devtool) auto-inits via a script
// tag with `disable-devtool-auto`. Its Performance detector trips under
// Playwright (CDP adds console.log latency vs console.table) and the
// redirect URL is hard-coded — for hmembeds that's google.com.
// Hide the auto-init marker so the library's IIFE exits early.
const origQS = Document.prototype.querySelector;
Document.prototype.querySelector = function (sel) {
if (typeof sel === 'string' && sel.indexOf('disable-devtool-auto') !== -1) return null;
return origQS.apply(this, arguments);
};
})();

View file

@ -1,117 +0,0 @@
package main
import "fmt"
// browser verbs drive the cluster's HEADFUL Chrome (ns chrome-service) over CDP
// from outside the cluster, for sites that detect/block headless automation.
// The headless @playwright/mcp browser can load such sites but their gated
// actions (submit/login) silently fail; this path submits first try. Mechanics
// only — the agent supplies the Playwright script. See docs/adr/0013.
func browserCommands() []Command {
return []Command{
{Path: []string{"browser"}, Tier: TierRead,
Summary: "headful cluster-Chrome automation for anti-bot sites (run `browser --help`)", Run: browserTopHelp},
{Path: []string{"browser", "run"}, Tier: TierWrite,
Summary: "run a Playwright script against headful cluster Chrome: browser run <script.js> [--url U] [--shared-context]", Run: browserRun},
{Path: []string{"browser", "open"}, Tier: TierWrite,
Summary: "open a URL in headful cluster Chrome; print title + text + screenshot: browser open <url>", Run: browserOpen},
}
}
func browserTopHelp([]string) error {
fmt.Print(browserHelp())
return nil
}
func browserRun(args []string) error {
o, err := parseBrowserArgs("run", args)
if err != nil {
return err
}
if o.help {
fmt.Print(browserHelp())
return nil
}
return runBrowser(o)
}
func browserOpen(args []string) error {
o, err := parseBrowserArgs("open", args)
if err != nil {
return err
}
if o.help {
fmt.Print(browserHelp())
return nil
}
return runBrowser(o)
}
// browserHelp carries the discoverability payload: WHEN to reach for this, and
// the diagnostic cheat-sheet that lets the agent self-correct instead of
// retrying a deterministic form blind (the failure mode that motivated this).
func browserHelp() string {
return `homelab browser drive the cluster's HEADFUL Chrome (anti-bot) over CDP
The shared chrome-service (ns chrome-service) runs a REAL, headed Chrome under
Xvfb. This connects to it via a port-forward + Playwright connect_over_cdp,
injects the same stealth.js the in-cluster callers use, and runs your script.
USAGE
homelab browser run <script.js> [--url URL] [--shared-context] [--keep-open] [--port N] [--timeout S]
homelab browser open <url> [--shared-context] [--timeout S]
WHEN TO USE THIS escalation only; DEFAULT to the headless/MCP browser
Default to the Playwright MCP / headless browser for ALL routine browsing and
automation it's interactive (snapshot per step), fast to start, isolated.
Reach for THIS command ONLY when headless is demonstrably blocked: a site
LOADS fine but a gated action FAILS or HANGS a submit/login/checkout spins
forever, or ONE request errors while its siblings 200. That is the signature
of headless / anti-bot detection (navigator.webdriver, UA "HeadlessChrome",
disable-devtool traps). It presents as a real Chrome and usually succeeds
first try but it's the shared cluster browser (slower startup, one batch
run, no per-step feedback), so it's the escalation path, never the default.
ERROR-CODE CHEAT-SHEET (diagnose BEFORE retrying)
ERR_FILE_NOT_FOUND (-6) request intercepted/resolved locally by the
automation layer NOT a network/egress problem.
(This is what silently broke the headless submit.)
ERR_CONNECTION_REFUSED / real egress failure (DNS/route/firewall). These also
ERR_TIMED_OUT / break the initial page load if the page loaded,
ERR_NAME_NOT_RESOLVED egress is fine and the cause is elsewhere.
one endpoint 500s while server-side bot rejection of the automation, not
its siblings 200 your payload.
HABITS
- Inspect the network panel BEFORE retrying a deterministic form; a blind
retry just repeats the same silent failure.
- Don't park a half-filled multi-step form across a user pause the session
can expire; re-run the whole flow from this command in one shot.
- Uploads stream over CDP via setInputFiles from THIS host no chmod/staging
of $HOME needed; just point setInputFiles at a local path.
CONTEXT
Default: a FRESH incognito context, closed on exit safe for the shared
browser and concurrent callers (e.g. tripit). Your script does its own login.
--shared-context: reuse the warmed PERSISTENT profile (cookies from a manual
noVNC login at chrome.viktorbarzin.me) when you need a pre-logged-in session.
SCRIPT CONTRACT (run mode)
Your file's body runs with page, context, browser and log() already in scope
(top-level await allowed). Return a value to print it. Example flow.js:
await page.goto('https://portal.example.com/login');
await page.fill('#user', 'me'); await page.fill('#pass', process.env.PW);
await page.click('button[type=submit]');
await page.waitForURL('**/dashboard');
return 'logged in: ' + page.url();
Run it: homelab browser run flow.js
NOTES
- The Playwright client is pinned to playwright-core@` + playwrightVersion + ` to match the
chrome-service image (Chrome 130); installed once into ~/.cache/homelab/.
- The port-forward is always torn down, on success and on error.
`
}

View file

@ -1,172 +0,0 @@
package main
import (
"os"
"reflect"
"strings"
"testing"
)
func TestParseBrowserArgsRun(t *testing.T) {
got, err := parseBrowserArgs("run", []string{
"flow.js", "--url", "https://example.com", "--shared-context",
"--port", "19999", "--timeout", "45", "--keep-open",
})
if err != nil {
t.Fatalf("parseBrowserArgs run: unexpected err: %v", err)
}
want := browserOpts{
mode: "run", script: "flow.js", url: "https://example.com",
sharedCtx: true, keepOpen: true, port: 19999, timeout: 45,
}
if !reflect.DeepEqual(got, want) {
t.Fatalf("parseBrowserArgs run =\n %+v\nwant\n %+v", got, want)
}
}
func TestParseBrowserArgsRunDefaults(t *testing.T) {
got, err := parseBrowserArgs("run", []string{"flow.js"})
if err != nil {
t.Fatalf("unexpected err: %v", err)
}
if got.script != "flow.js" || got.sharedCtx || got.keepOpen || got.port != 0 {
t.Fatalf("defaults wrong: %+v", got)
}
if got.timeout != defaultBrowserTimeout {
t.Fatalf("timeout default = %d, want %d", got.timeout, defaultBrowserTimeout)
}
}
func TestParseBrowserArgsRunRequiresScript(t *testing.T) {
if _, err := parseBrowserArgs("run", []string{"--url", "https://x"}); err == nil {
t.Fatalf("run without a script path should error")
}
}
func TestParseBrowserArgsOpenRequiresURL(t *testing.T) {
got, err := parseBrowserArgs("open", []string{"https://example.com"})
if err != nil {
t.Fatalf("unexpected err: %v", err)
}
if got.url != "https://example.com" || got.mode != "open" {
t.Fatalf("open parse wrong: %+v", got)
}
if _, err := parseBrowserArgs("open", []string{}); err == nil {
t.Fatalf("open without a URL should error")
}
}
func TestParseBrowserArgsHelp(t *testing.T) {
for _, a := range [][]string{{"--help"}, {"-h"}, {"flow.js", "--help"}} {
got, err := parseBrowserArgs("run", a)
if err != nil {
t.Fatalf("help parse %v: %v", a, err)
}
if !got.help {
t.Fatalf("args %v should set help", a)
}
}
}
func TestParseBrowserArgsEqualsForm(t *testing.T) {
got, err := parseBrowserArgs("run", []string{"flow.js", "--url=https://x", "--port=8123", "--timeout=10"})
if err != nil {
t.Fatalf("unexpected err: %v", err)
}
if got.url != "https://x" || got.port != 8123 || got.timeout != 10 {
t.Fatalf("--flag=value form not parsed: %+v", got)
}
}
func TestCDPHealthy(t *testing.T) {
real := []byte(`{"Browser":"Chrome/130.0.6723.31","User-Agent":"Mozilla/5.0 (X11; Linux x86_64) Chrome/130.0.0.0 Safari/537.36","webSocketDebuggerUrl":"ws://127.0.0.1/devtools/browser/x"}`)
browser, ok, err := cdpHealthy(real)
if err != nil || !ok {
t.Fatalf("real Chrome should be healthy: ok=%v err=%v", ok, err)
}
if !strings.HasPrefix(browser, "Chrome/") {
t.Fatalf("browser = %q, want Chrome/ prefix", browser)
}
headless := []byte(`{"Browser":"HeadlessChrome/130.0.6723.31","User-Agent":"Mozilla/5.0 HeadlessChrome/130.0.0.0"}`)
if _, ok, _ := cdpHealthy(headless); ok {
t.Fatalf("HeadlessChrome must be reported unhealthy (the whole point of chrome-service)")
}
if _, _, err := cdpHealthy([]byte("not json")); err == nil {
t.Fatalf("malformed /json/version body should error")
}
}
func TestBuildPortForwardArgs(t *testing.T) {
got := buildPortForwardArgs(18080)
want := []string{"-n", "chrome-service", "port-forward", "svc/chrome-service", "18080:9222"}
if !reflect.DeepEqual(got, want) {
t.Fatalf("buildPortForwardArgs =\n %v\nwant\n %v", got, want)
}
}
func TestBrowserClientPackageJSONPinsVersion(t *testing.T) {
pj := browserClientPackageJSON()
if !strings.Contains(pj, `"playwright-core": "`+playwrightVersion+`"`) {
t.Fatalf("package.json must pin playwright-core to %s; got:\n%s", playwrightVersion, pj)
}
}
func TestPlaywrightVersionPinnedToServerMinor(t *testing.T) {
// chrome-service runs mcr.microsoft.com/playwright:v1.48.0-noble; the CDP
// client minor MUST match (protocol changes between minors).
if !strings.HasPrefix(playwrightVersion, "1.48.") {
t.Fatalf("playwrightVersion = %q, must be 1.48.x to match the chrome-service image", playwrightVersion)
}
}
func TestBrowserHelpHasDiagnosticCheatSheet(t *testing.T) {
h := browserHelp()
for _, want := range []string{
"homelab browser run",
"ERR_FILE_NOT_FOUND",
"ERR_CONNECTION_REFUSED",
"network panel",
"headless",
"--shared-context",
} {
if !strings.Contains(h, want) {
t.Errorf("browser --help is missing %q (the discoverability/self-correction payload)", want)
}
}
}
func TestBrowserHelpIsTiered(t *testing.T) {
// --help must frame this as the ESCALATION path (default to headless first),
// matching ~/code/CLAUDE.md and chrome-service.md — non-conflicting agent
// instructions. Guard against a regression to "co-equal choice" wording.
h := browserHelp()
for _, want := range []string{"Default to the", "escalation"} {
if !strings.Contains(h, want) {
t.Errorf("browser --help must carry the tiered/default-headless framing; missing %q", want)
}
}
}
func TestStealthJSEmbeddedMatchesCanonical(t *testing.T) {
// The embedded copy must never drift from the source of truth that the
// in-cluster callers use, else the CLI's stealth and the cluster's diverge.
canonical, err := os.ReadFile("../stacks/chrome-service/files/stealth.js")
if err != nil {
t.Fatalf("read canonical stealth.js: %v", err)
}
if stealthJS != string(canonical) {
t.Fatalf("cli/browser_stealth.js has drifted from stacks/chrome-service/files/stealth.js — re-copy it")
}
}
func TestFreePortReturnsUsablePort(t *testing.T) {
p, err := freePort()
if err != nil {
t.Fatalf("freePort: %v", err)
}
if p <= 1024 || p > 65535 {
t.Fatalf("freePort returned %d, want an ephemeral port", p)
}
}

View file

@ -1,99 +0,0 @@
package main
import (
"fmt"
"os"
"strings"
"time"
)
func ciCommands() []Command {
return []Command{
{Path: []string{"ci", "status"}, Tier: TierRead,
Summary: "pipeline status for HEAD/a commit: ci status [commit]", Run: ciStatus},
{Path: []string{"ci", "watch"}, Tier: TierRead,
Summary: "poll the pipeline for HEAD (or a commit) to terminal; non-zero on failure", Run: ciWatch},
}
}
func short(s string) string {
if len(s) > 8 {
return s[:8]
}
return s
}
func firstLine(s string) string { return strings.SplitN(s, "\n", 2)[0] }
// currentHEAD returns the full HEAD sha of the cwd repo (empty if not a repo).
func currentHEAD() string {
cwd, _ := os.Getwd()
root, err := gitRepoRoot(cwd)
if err != nil {
return ""
}
sha, _ := gitOutput(root, "rev-parse", "HEAD")
return sha
}
func ciStatus(args []string) error {
commit, _ := firstPositional(args)
c, err := newWPClient()
if err != nil {
return err
}
id, err := c.repoID()
if err != nil {
return err
}
p, err := c.findPipeline(id, commit)
if err != nil {
return err
}
fmt.Printf("#%d %s event=%s %s %s\n", p.Number, p.Status, p.Event, short(p.Commit), firstLine(p.Message))
return nil
}
func ciWatch(args []string) error {
commit, _ := firstPositional(args)
if commit == "" {
commit = currentHEAD()
}
if commit == "" {
return fmt.Errorf("no commit given and not in a git repo")
}
c, err := newWPClient()
if err != nil {
return err
}
id, err := c.repoID()
if err != nil {
return err
}
timeout := 20 * time.Minute
deadline := time.Now().Add(timeout)
last := ""
for time.Now().Before(deadline) {
p, err := c.findPipeline(id, commit)
if err != nil {
if last != "waiting" {
fmt.Fprintf(os.Stderr, "homelab: waiting for pipeline (%s)...\n", short(commit))
last = "waiting"
}
} else {
if p.Status != last {
fmt.Fprintf(os.Stderr, "homelab: #%d %s\n", p.Number, p.Status)
last = p.Status
}
if isTerminalStatus(p.Status) {
fmt.Printf("#%d %s %s\n", p.Number, p.Status, short(commit))
if isFailureStatus(p.Status) {
return fmt.Errorf("pipeline #%d %s (woodpecker repo, see UI/DB for the failing step)", p.Number, p.Status)
}
return nil
}
}
time.Sleep(15 * time.Second)
}
return fmt.Errorf("timed out after %s waiting for CI on %s", timeout, short(commit))
}

View file

@ -1,56 +0,0 @@
package main
import (
"fmt"
"strings"
)
func claimCommands() []Command {
return []Command{
{Path: []string{"claim"}, Tier: TierWrite,
Summary: "claim a shared infra resource on the presence board",
Run: runClaim},
{Path: []string{"release"}, Tier: TierWrite,
Summary: "release a presence claim",
Run: runRelease},
}
}
// runClaim parses `<kind>:<name> --purpose "..."` in either order (the presence
// script takes the label first, so we can't rely on Go's flag package which
// stops at the first positional).
func runClaim(args []string) error {
var label, purpose string
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "--purpose" || a == "-purpose":
if i+1 < len(args) {
purpose = args[i+1]
i++
}
case strings.HasPrefix(a, "--purpose="):
purpose = strings.TrimPrefix(a, "--purpose=")
case !strings.HasPrefix(a, "-") && label == "":
label = a
}
}
if label == "" {
return fmt.Errorf(`usage: homelab claim <kind>:<name> --purpose "what + why"`)
}
return presenceClaim(label, purpose)
}
func runRelease(args []string) error {
var label string
for _, a := range args {
if !strings.HasPrefix(a, "-") {
label = a
break
}
}
if label == "" {
return fmt.Errorf("usage: homelab release <kind>:<name>")
}
return presenceRelease(label)
}

View file

@ -1,51 +0,0 @@
package main
import (
"fmt"
"os"
"strings"
"time"
)
func deployCommands() []Command {
return []Command{
{Path: []string{"deploy", "wait"}, Tier: TierRead,
Summary: "wait for <ns>/<deploy> to roll out the current (or --sha) image: deploy wait <ns>/<deploy> [--sha SHA]", Run: deployWait},
}
}
// deployWait closes the "did the NEW code land" gap: rollout status alone returns
// success on the OLD ReplicaSet, so we first wait for the deployment image to
// reference the expected sha, THEN block on rollout status.
func deployWait(args []string) error {
target, _ := firstPositional(args)
if target == "" || !strings.Contains(target, "/") {
return fmt.Errorf("usage: homelab deploy wait <ns>/<deploy> [--sha SHA] [--timeout 10m]")
}
parts := strings.SplitN(target, "/", 2)
ns, deploy := parts[0], parts[1]
sha := flagValue(args, "--sha")
if sha == "" {
sha = short(currentHEAD())
}
deadline := time.Now().Add(10 * time.Minute)
if sha != "" {
fmt.Fprintf(os.Stderr, "homelab: waiting for %s/%s image to match %s...\n", ns, deploy, sha)
matched := false
for time.Now().Before(deadline) {
img, _ := kubectlCapture(ns, "get", "deploy", deploy, "-o", "jsonpath={.spec.template.spec.containers[*].image}")
if strings.Contains(img, sha) {
matched = true
break
}
time.Sleep(10 * time.Second)
}
if !matched {
return fmt.Errorf("timed out: %s/%s image never matched %q", ns, deploy, sha)
}
}
fmt.Fprintf(os.Stderr, "homelab: rollout status %s/%s...\n", ns, deploy)
return kubectlStream(ns, "rollout", "status", "deploy/"+deploy, "--timeout=180s")
}

View file

@ -1,69 +0,0 @@
package main
import "fmt"
func edgesCommands() []Command {
return []Command{
{Path: []string{"edges"}, Tier: TierRead,
Summary: "who-talks-to-whom trail: edges [--ns|--src|--dst|--peers-of N] [--new-since 24h] [--denied] [--json] [--limit N]",
Run: edgesRun},
}
}
// edgesRun renders the filter flags to SQL and runs it read-only against the
// goldmane_edges CNPG DB via the dbaas primary pod (same exec path as `k8s db`).
func edgesRun(args []string) error {
for _, a := range args {
if a == "-h" || a == "--help" {
fmt.Print(edgesUsage())
return nil
}
}
o, err := parseEdgesArgs(args)
if err != nil {
return fmt.Errorf("%w\n\n%s", err, edgesUsage())
}
sql, err := buildEdgesQuery(o)
if err != nil {
return err
}
// pg-cluster-rw is a Service (not exec-able); resolve the primary POD.
pod, err := kubectlCapture("dbaas", "get", "pod", "-l", "cnpg.io/instanceRole=primary",
"-o", "jsonpath={.items[0].metadata.name}")
if err != nil || pod == "" {
return fmt.Errorf("could not resolve CNPG primary pod in dbaas: %v", err)
}
exec := []string{"exec", pod, "-c", "postgres", "--", "psql", "-U", "postgres", "-d", "goldmane_edges"}
if o.asJSON {
exec = append(exec, "-tAc", sql) // raw tuple → the JSON array
} else {
exec = append(exec, "-P", "pager=off", "-c", sql) // aligned table for humans
}
return kubectlStream("dbaas", exec...)
}
func edgesUsage() string {
return `homelab edges query the who-talks-to-whom trail (goldmane_edges, ADR-0014)
Usage: homelab edges [filters]
Filters (AND-combined; namespace values are validated to the k8s name charset):
--ns NAME edges touching NAME (either direction)
--src NAME edges where source namespace = NAME
--dst NAME edges where destination namespace = NAME
--peers-of NAME distinct peer namespaces of NAME (both directions)
--new-since SPEC first seen since SPEC: a duration (24h, 7d, 30m, 90s) or a date (YYYY-MM-DD)
--denied only denied (action='deny') edges blocked / lateral-movement attempts
--json output a JSON array (for agents/pipelines)
--limit N cap rows (default 200)
Examples:
homelab edges --ns immich # everything immich talks to / is talked to by
homelab edges --peers-of authentik # authentik's peer namespaces
homelab edges --src recruiter-responder # that namespace's egress peers
homelab edges --new-since 24h # edges first seen in the last day
homelab edges --denied --json # blocked flows, machine-readable
Read-only SELECT against CNPG DB goldmane_edges via the dbaas primary pod.
`
}

View file

@ -1,172 +0,0 @@
package main
import (
"encoding/base64"
"fmt"
"os"
"path/filepath"
"strings"
)
// Home Assistant verbs cover the two things the `ha` MCP server can't: resolving
// the long-lived API token out of the cluster, and SSH to the HA host for
// host-level work (config files, docker, add-ons). Entity state/control stays
// with the MCP — see docs/adr/0012.
//
// The token lives in the dedicated k8s Secret openclaw/ha-tokens (one key per
// instance), split out of openclaw-secrets so non-admin operators (emo / "Home
// Server Admins") can read JUST the HA token, not the full skill_secrets blob.
// `ha token` resolves it on demand via the ambient kubeconfig, so it never
// depends on a pre-set env var (the gap that made agents re-derive the
// kubectl|base64|jq pipeline every session).
type haInstance struct {
name string // sofia | london
sshUser string // SSH login on the HA host
sshHost string // host reachable from the devvm (Sofia LAN)
secretKey string // key inside the openclaw/ha-tokens Secret holding this token
}
const (
haDefaultInstance = "sofia"
haSecretNamespace = "openclaw"
haSecretName = "ha-tokens" // dedicated, least-privilege; see stacks/openclaw/ha_tokens.tf
)
// haInstances maps instance name → connection/secret facts. sofia is the default
// because the devvm is on the Sofia LAN; london is documented but its host
// (192.168.8.x) is only reachable remotely, so `ha ssh --instance london`
// generally won't connect from here (token resolution still works).
var haInstances = map[string]haInstance{
"sofia": {name: "sofia", sshUser: "vbarzin", sshHost: "192.168.1.8", secretKey: "sofia"},
"london": {name: "london", sshUser: "hassio", sshHost: "192.168.8.103", secretKey: "london"},
}
func haCommands() []Command {
return []Command{
{Path: []string{"ha", "token"}, Tier: TierRead,
Summary: "reveal the HA long-lived API token from the cluster: ha token [--instance sofia|london]", Run: haToken},
{Path: []string{"ha", "ssh"}, Tier: TierWrite,
Summary: "run a command on the HA host over ssh: ha ssh [--instance sofia|london] [-i KEY] -- <cmd>", Run: haSSH},
}
}
// resolveHAInstance looks up an instance by name; "" yields the default (sofia).
func resolveHAInstance(name string) (haInstance, error) {
if name == "" {
name = haDefaultInstance
}
inst, ok := haInstances[name]
if !ok {
return haInstance{}, fmt.Errorf("unknown HA instance %q (want sofia or london)", name)
}
return inst, nil
}
// decodeSecretValue base64-decodes a k8s Secret `.data.<key>` value as returned
// by kubectl jsonpath (trailing whitespace tolerated).
func decodeSecretValue(b64 string) (string, error) {
raw, err := base64.StdEncoding.DecodeString(strings.TrimSpace(b64))
if err != nil {
return "", fmt.Errorf("base64-decode secret value: %w", err)
}
return string(raw), nil
}
func haToken(args []string) error {
name, _ := firstPositional(args) // accept `ha token sofia` as well as `--instance sofia`
for i := 0; i < len(args); i++ {
if args[i] == "--instance" && i+1 < len(args) {
name = args[i+1]
} else if strings.HasPrefix(args[i], "--instance=") {
name = strings.TrimPrefix(args[i], "--instance=")
}
}
inst, err := resolveHAInstance(name)
if err != nil {
return err
}
b64, err := kubectlCapture(haSecretNamespace, "get", "secret", haSecretName,
"-o", "jsonpath={.data."+inst.secretKey+"}")
if err != nil {
return fmt.Errorf("read secret %s/%s (kubeconfig set? RBAC?): %w", haSecretNamespace, haSecretName, err)
}
if b64 == "" {
return fmt.Errorf("secret %s/%s has no %q key", haSecretNamespace, haSecretName, inst.secretKey)
}
tok, err := decodeSecretValue(b64)
if err != nil {
return err
}
fmt.Println(tok)
return nil
}
// defaultHAKeyPath is the invoking user's ed25519 key, so the verb is per-user
// rather than tied to whoever first wrote the workflow.
func defaultHAKeyPath() string {
if home, err := os.UserHomeDir(); err == nil && home != "" {
return filepath.Join(home, ".ssh", "id_ed25519")
}
return filepath.Join("~", ".ssh", "id_ed25519")
}
// parseHASSH reads `[--instance X] [-i|--key PATH] [-- ] <cmd...>`. Tokens after
// `--` are taken verbatim; bare tokens before it are also the remote command.
func parseHASSH(args []string) (inst haInstance, keyPath string, remote []string, err error) {
name := haDefaultInstance
keyPath = defaultHAKeyPath()
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "--":
remote = append(remote, args[i+1:]...)
i = len(args)
case a == "--instance":
if i+1 < len(args) {
name = args[i+1]
i++
}
case strings.HasPrefix(a, "--instance="):
name = strings.TrimPrefix(a, "--instance=")
case a == "--key" || a == "-i":
if i+1 < len(args) {
keyPath = args[i+1]
i++
}
case strings.HasPrefix(a, "--key="):
keyPath = strings.TrimPrefix(a, "--key=")
default:
remote = append(remote, a)
}
}
inst, err = resolveHAInstance(name)
return inst, keyPath, remote, err
}
// buildHASSHArgs assembles deterministic, non-interactive ssh args: an explicit
// key, no user ssh config, and no known_hosts prompt/record — so it runs
// unattended in an agent session without hanging on a host-key prompt.
func buildHASSHArgs(inst haInstance, keyPath string, remote []string) []string {
args := []string{
"-F", "/dev/null",
"-o", "IdentityFile=" + keyPath,
"-o", "StrictHostKeyChecking=no",
"-o", "UserKnownHostsFile=/dev/null",
"-o", "ConnectTimeout=10",
"-o", "BatchMode=yes",
inst.sshUser + "@" + inst.sshHost,
}
return append(args, remote...)
}
func haSSH(args []string) error {
inst, keyPath, remote, err := parseHASSH(args)
if err != nil {
return err
}
if len(remote) == 0 {
return fmt.Errorf(`usage: homelab ha ssh [--instance sofia|london] [-i KEY] -- <command>`)
}
return runStreaming("ssh", buildHASSHArgs(inst, keyPath, remote)...)
}

View file

@ -1,92 +0,0 @@
package main
import (
"encoding/base64"
"reflect"
"strings"
"testing"
)
func TestResolveHAInstance(t *testing.T) {
// empty defaults to sofia (the devvm sits on the Sofia LAN)
if got, err := resolveHAInstance(""); err != nil || got.name != "sofia" {
t.Fatalf(`resolveHAInstance("") = %+v, %v; want sofia`, got, err)
}
if got, err := resolveHAInstance("sofia"); err != nil || got.secretKey != "sofia" {
t.Fatalf("sofia secretKey = %q, %v", got.secretKey, err)
}
if got, err := resolveHAInstance("london"); err != nil || got.secretKey != "london" || got.sshUser != "hassio" {
t.Fatalf("london = %+v, %v", got, err)
}
if _, err := resolveHAInstance("paris"); err == nil {
t.Fatalf("resolveHAInstance(paris) should error on unknown instance")
}
}
func TestDecodeSecretValue(t *testing.T) {
// k8s stores Secret values base64-encoded; `kubectl -o jsonpath={.data.<k>}`
// returns that base64, which decodeSecretValue turns back into the raw token.
enc := base64.StdEncoding.EncodeToString([]byte("tok-sofia"))
if got, err := decodeSecretValue(enc); err != nil || got != "tok-sofia" {
t.Fatalf("decodeSecretValue = %q, %v; want tok-sofia", got, err)
}
// trailing whitespace/newline from jsonpath output must be tolerated
if got, err := decodeSecretValue(enc + "\n"); err != nil || got != "tok-sofia" {
t.Fatalf("decodeSecretValue (trailing ws) = %q, %v; want tok-sofia", got, err)
}
if _, err := decodeSecretValue("not-base64!!"); err == nil {
t.Fatalf("decodeSecretValue should error on undecodable base64")
}
}
func TestBuildHASSHArgs(t *testing.T) {
inst, _ := resolveHAInstance("sofia")
got := buildHASSHArgs(inst, "/home/u/.ssh/id_ed25519", []string{"cat", "/config/configuration.yaml"})
want := []string{
"-F", "/dev/null",
"-o", "IdentityFile=/home/u/.ssh/id_ed25519",
"-o", "StrictHostKeyChecking=no",
"-o", "UserKnownHostsFile=/dev/null",
"-o", "ConnectTimeout=10",
"-o", "BatchMode=yes",
"vbarzin@192.168.1.8",
"cat", "/config/configuration.yaml",
}
if !reflect.DeepEqual(got, want) {
t.Fatalf("buildHASSHArgs =\n %v\nwant\n %v", got, want)
}
}
func TestParseHASSH(t *testing.T) {
// instance flag + everything after `--` is the verbatim remote command
inst, key, remote, err := parseHASSH([]string{"--instance", "sofia", "--", "docker", "ps", "-a"})
if err != nil {
t.Fatalf("parseHASSH err: %v", err)
}
if inst.name != "sofia" {
t.Errorf("instance = %q, want sofia", inst.name)
}
if !strings.HasSuffix(key, "/.ssh/id_ed25519") {
t.Errorf("default key = %q, want it to end in /.ssh/id_ed25519", key)
}
if !reflect.DeepEqual(remote, []string{"docker", "ps", "-a"}) {
t.Errorf("remote = %v, want [docker ps -a]", remote)
}
// bare args (no `--`) are also taken as the remote command; -i overrides the key
_, key2, remote2, err := parseHASSH([]string{"-i", "/tmp/k", "uptime"})
if err != nil {
t.Fatalf("parseHASSH err: %v", err)
}
if key2 != "/tmp/k" {
t.Errorf("key = %q, want /tmp/k", key2)
}
if !reflect.DeepEqual(remote2, []string{"uptime"}) {
t.Errorf("remote = %v, want [uptime]", remote2)
}
// unknown instance surfaces as an error
if _, _, _, err := parseHASSH([]string{"--instance", "paris", "--", "ls"}); err == nil {
t.Errorf("parseHASSH should error on unknown instance")
}
}

View file

@ -1,288 +0,0 @@
package main
import (
"fmt"
"os"
"strings"
)
func k8sCommands() []Command {
return []Command{
{Path: []string{"k8s", "status"}, Tier: TierRead,
Summary: "pods (wide) + recent non-Normal events for a namespace (or -A)", Run: k8sStatus},
{Path: []string{"k8s", "get"}, Tier: TierRead,
Summary: "kubectl get in a namespace: k8s get <ns> <resource> [args]", Run: k8sGet},
{Path: []string{"k8s", "logs"}, Tier: TierRead,
Summary: "logs for <app> (deploy/<app>; --tail/-c/--previous/--since/-l)", Run: k8sLogs},
{Path: []string{"k8s", "describe"}, Tier: TierRead,
Summary: "describe <app>'s deployment (or an explicit resource)", Run: k8sDescribe},
{Path: []string{"k8s", "debug"}, Tier: TierRead,
Summary: "one-shot triage for <app>: pods+deploy+describe+logs+events", Run: k8sDebug},
{Path: []string{"k8s", "pf"}, Tier: TierRead,
Summary: "port-forward: k8s pf <app> <local:remote> [svc/pod target]", Run: k8sPortForward},
{Path: []string{"k8s", "db"}, Tier: TierWrite,
Summary: `query a dbaas DB: k8s db <app> [--mysql] [--db N] -- "<SQL>"`, Run: k8sDB},
{Path: []string{"k8s", "exec"}, Tier: TierWrite,
Summary: "exec in <app>'s pod: k8s exec <app> [--tty] -- <cmd>", Run: k8sExec},
{Path: []string{"k8s", "rm-pod"}, Tier: TierWrite,
Summary: "delete a stuck pod/job ONLY: k8s rm-pod <name> -n <ns> [--job] [--force]", Run: k8sRmPod},
{Path: []string{"k8s", "rollout-status"}, Tier: TierRead,
Summary: "rollout status of deploy/<app>", Run: k8sRolloutStatus},
{Path: []string{"k8s", "restart"}, Tier: TierWrite,
Summary: "rollout restart deploy/<app> then wait for status", Run: k8sRestart},
{Path: []string{"k8s", "probe"}, Tier: TierRead,
Summary: "in-cluster reachability: ephemeral curl pod to <app>.<ns>.svc", Run: k8sProbe},
}
}
func k8sStatus(args []string) error {
t := parseK8sTarget(args)
ns := t.namespace() // "" when no app/ns given → cluster-wide
get := []string{"get", "pods", "-o", "wide"}
ev := []string{"get", "events", "--field-selector", "type!=Normal", "--sort-by=.lastTimestamp"}
if ns == "" {
get = append(get, "-A")
ev = append(ev, "-A")
}
if err := kubectlStream(ns, get...); err != nil {
return err
}
fmt.Fprintln(os.Stderr, "\n--- recent events (type!=Normal) ---")
_ = kubectlStream(ns, ev...) // best-effort
return nil
}
func k8sGet(args []string) error {
t := parseK8sTarget(args)
if t.app == "" || len(t.rest) == 0 {
return fmt.Errorf("usage: homelab k8s get <ns> <resource> [args]")
}
return kubectlStream(t.app, append([]string{"get"}, t.rest...)...)
}
func k8sLogs(args []string) error {
t := parseK8sTarget(args)
if t.app == "" {
return fmt.Errorf("usage: homelab k8s logs <app> [--tail N] [-c ctr] [--previous] [--since 1h] [-l sel]")
}
a := []string{"logs"}
if t.selector != "" {
a = append(a, "-l", t.selector)
} else {
a = append(a, t.objectRef())
}
if t.container != "" {
a = append(a, "-c", t.container)
}
if !containsPrefix(t.rest, "--tail") {
a = append(a, "--tail=200")
}
a = append(a, t.rest...)
return kubectlStream(t.namespace(), a...)
}
func k8sDescribe(args []string) error {
t := parseK8sTarget(args)
if t.app == "" {
return fmt.Errorf("usage: homelab k8s describe <app> [resource]")
}
if len(t.rest) > 0 {
return kubectlStream(t.namespace(), append([]string{"describe"}, t.rest...)...)
}
return kubectlStream(t.namespace(), "describe", t.objectRef())
}
func k8sDebug(args []string) error {
t := parseK8sTarget(args)
if t.app == "" {
return fmt.Errorf("usage: homelab k8s debug <app>")
}
ns := t.namespace()
sec := func(title string) { fmt.Fprintf(os.Stderr, "\n=== %s ===\n", title) }
sec("pods")
_ = kubectlStream(ns, "get", "pods", "-o", "wide")
sec("workloads")
_ = kubectlStream(ns, "get", "deploy,sts,ds", "-o", "wide")
sec("describe "+t.objectRef())
_ = kubectlStream(ns, "describe", t.objectRef())
sec("recent logs (--tail=50)")
_ = kubectlStream(ns, "logs", t.objectRef(), "--tail=50")
sec("events (type!=Normal)")
_ = kubectlStream(ns, "get", "events", "--field-selector", "type!=Normal", "--sort-by=.lastTimestamp")
return nil
}
func k8sPortForward(args []string) error {
t := parseK8sTarget(args)
if t.app == "" || len(t.rest) == 0 {
return fmt.Errorf("usage: homelab k8s pf <app> <local:remote> [svc/pod target]")
}
ports := t.rest[0]
target := "svc/" + t.app
if len(t.rest) > 1 {
target = t.rest[1]
}
return kubectlStream(t.namespace(), "port-forward", target, ports)
}
func k8sDB(args []string) error {
var app, dbName, sql string
mysql := false
for i := 0; i < len(args); i++ {
a := args[i]
if a == "--" {
sql = strings.Join(args[i+1:], " ")
break
}
switch {
case a == "--mysql":
mysql = true
case a == "--db":
if i+1 < len(args) {
dbName = args[i+1]
i++
}
case strings.HasPrefix(a, "--db="):
dbName = strings.TrimPrefix(a, "--db=")
case !strings.HasPrefix(a, "-") && app == "":
app = a
}
}
if app == "" {
return fmt.Errorf(`usage: homelab k8s db <app> [--mysql] [--db NAME] -- "<SQL>"`)
}
p := planDBExec(app, dbName, sql, mysql)
pod := p.pod
if pod == "" && p.selector != "" {
resolved, err := kubectlCapture(p.ns, "get", "pod", "-l", p.selector, "-o", "jsonpath={.items[0].metadata.name}")
if err != nil || resolved == "" {
return fmt.Errorf("could not resolve db pod in %s (selector %q): %v", p.ns, p.selector, err)
}
pod = resolved
}
exec := []string{"exec"}
if sql == "" {
exec = append(exec, "-it") // interactive client when no SQL given
}
exec = append(exec, pod)
if p.container != "" {
exec = append(exec, "-c", p.container)
}
exec = append(exec, "--")
exec = append(exec, p.argv...)
return kubectlStream(p.ns, exec...)
}
func k8sExec(args []string) error {
t := parseK8sTarget(args)
if t.app == "" {
return fmt.Errorf("usage: homelab k8s exec <app> [--pod p] [-c ctr] [--tty] -- <cmd>")
}
if len(t.rest) == 0 {
return fmt.Errorf("provide a command after --, e.g. homelab k8s exec %s -- env", t.app)
}
a := []string{"exec"}
if t.tty {
a = append(a, "-it")
}
a = append(a, t.objectRef())
if t.container != "" {
a = append(a, "-c", t.container)
}
a = append(a, "--")
a = append(a, t.rest...)
return kubectlStream(t.namespace(), a...)
}
func k8sRmPod(args []string) error {
var pod, ns, grace string
force, job := false, false
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "-n" || a == "--namespace":
if i+1 < len(args) {
ns = args[i+1]
i++
}
case a == "--force":
force = true
case a == "--job":
job = true
case a == "--grace":
if i+1 < len(args) {
grace = args[i+1]
i++
}
case !strings.HasPrefix(a, "-") && pod == "":
pod = a
}
}
if pod == "" || ns == "" {
return fmt.Errorf("usage: homelab k8s rm-pod <name> -n <ns> [--job] [--force] [--grace N] (pods/jobs only)")
}
kind := "pod"
if job {
kind = "job"
}
a := []string{"delete", kind, pod}
if grace != "" {
a = append(a, "--grace-period="+grace)
}
if force {
a = append(a, "--force")
}
return kubectlStream(ns, a...)
}
func k8sRolloutStatus(args []string) error {
t := parseK8sTarget(args)
if t.app == "" {
return fmt.Errorf("usage: homelab k8s rollout-status <app>")
}
return kubectlStream(t.namespace(), "rollout", "status", "deploy/"+t.app)
}
func k8sRestart(args []string) error {
t := parseK8sTarget(args)
if t.app == "" {
return fmt.Errorf("usage: homelab k8s restart <app>")
}
ns := t.namespace()
if err := kubectlStream(ns, "rollout", "restart", "deploy/"+t.app); err != nil {
return err
}
return kubectlStream(ns, "rollout", "status", "deploy/"+t.app)
}
func k8sProbe(args []string) error {
t := parseK8sTarget(args)
if t.app == "" {
return fmt.Errorf("usage: homelab k8s probe <app> [path] [--port N]")
}
ns := t.namespace()
url := "http://" + t.app + "." + ns + ".svc.cluster.local"
if port := flagValue(args, "--port"); port != "" {
url += ":" + port
}
if len(t.rest) > 0 {
p := t.rest[0]
if !strings.HasPrefix(p, "/") {
p = "/" + p
}
url += p
}
return kubectlStream(ns, "run", "homelab-probe", "--rm", "-i", "--restart=Never",
"--image=curlimages/curl:latest", "--",
"curl", "-sS", "--max-time", "10", "-w", "\n[%{http_code}] %{time_total}s\n", url)
}
// containsPrefix reports whether any arg starts with prefix.
func containsPrefix(args []string, prefix string) bool {
for _, a := range args {
if strings.HasPrefix(a, prefix) {
return true
}
}
return false
}

View file

@ -1,308 +0,0 @@
package main
import (
"encoding/json"
"fmt"
"net/url"
"strings"
)
func memoryCommands() []Command {
return []Command{
{Path: []string{"memory", "recall"}, Tier: TierRead,
Summary: `semantic search of memory: memory recall "<context>" [--query …] [--category] [--sort] [--limit]`, Run: memoryRecall},
{Path: []string{"memory", "list"}, Tier: TierRead,
Summary: "list recent memories [--category C] [--tag T] [--limit N]", Run: memoryList},
{Path: []string{"memory", "categories"}, Tier: TierRead,
Summary: "list memory categories", Run: memorySimpleGet("/api/categories")},
{Path: []string{"memory", "tags"}, Tier: TierRead,
Summary: "list memory tags", Run: memorySimpleGet("/api/tags")},
{Path: []string{"memory", "stats"}, Tier: TierRead,
Summary: "memory store stats", Run: memorySimpleGet("/api/stats")},
{Path: []string{"memory", "secret"}, Tier: TierRead,
Summary: "reveal a sensitive memory's content: memory secret <id>", Run: memorySecret},
{Path: []string{"memory", "store"}, Tier: TierWrite,
Summary: `store a memory: memory store "<content>" [--category --tags --keywords --importance --sensitive]`, Run: memoryStore},
{Path: []string{"memory", "update"}, Tier: TierWrite,
Summary: "update a memory: memory update <id> [--content --tags --importance --keywords]", Run: memoryUpdate},
{Path: []string{"memory", "delete"}, Tier: TierWrite,
Summary: "delete a memory: memory delete <id>", Run: memoryDelete},
}
}
// printMemories renders a {memories:[…]} response as one line per memory, or raw JSON.
func printMemories(raw []byte, jsonOut bool) error {
fmt.Print(renderMemories(raw, jsonOut))
return nil
}
// renderMemories formats each memory as a single line with its FULL content
// (newlines flattened to spaces). Content is deliberately never truncated: the
// old 240-rune preview cut memories mid-sentence, misled agents into believing
// no full-content read-back existed, and made blind `update --content` from
// the preview silently destroy the stored tail. Full passthrough also can't
// produce invalid UTF-8 (the old mid-rune cut crashed the recall hook).
func renderMemories(raw []byte, jsonOut bool) string {
if jsonOut {
return string(raw) + "\n"
}
var r struct {
Memories []struct {
ID int `json:"id"`
Content string `json:"content"`
Category string `json:"category"`
Tags string `json:"tags"`
Importance float64 `json:"importance"`
} `json:"memories"`
}
if err := json.Unmarshal(raw, &r); err != nil {
return string(raw) + "\n"
}
if len(r.Memories) == 0 {
return "(no memories)\n"
}
var b strings.Builder
for _, m := range r.Memories {
c := strings.ReplaceAll(m.Content, "\n", " ")
fmt.Fprintf(&b, "#%d [%s] (%.2f) %s\n", m.ID, m.Category, m.Importance, c)
if m.Tags != "" {
fmt.Fprintf(&b, " tags: %s\n", m.Tags)
}
}
return b.String()
}
func memoryRecall(args []string) error {
req := memRecallReq{}
jsonOut := false
var pos []string
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "--query":
if i+1 < len(args) {
req.ExpandedQuery = args[i+1]
i++
}
case a == "--category":
if i+1 < len(args) {
req.Category = args[i+1]
i++
}
case a == "--sort":
if i+1 < len(args) {
req.SortBy = args[i+1]
i++
}
case a == "--limit":
if i+1 < len(args) {
fmt.Sscanf(args[i+1], "%d", &req.Limit)
i++
}
case a == "--json":
jsonOut = true
case !strings.HasPrefix(a, "-"):
pos = append(pos, a)
}
}
req.Context = strings.Join(pos, " ")
if req.Context == "" {
return fmt.Errorf(`usage: homelab memory recall "<context>" [--query …] [--category C] [--sort importance|relevance|recency] [--limit N]`)
}
c, err := newMemoryClient()
if err != nil {
return err
}
raw, err := c.do("POST", "/api/memories/recall", req)
if err != nil {
return err
}
return printMemories(raw, jsonOut)
}
func memoryList(args []string) error {
q := url.Values{}
jsonOut := false
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "--category":
if i+1 < len(args) {
q.Set("category", args[i+1])
i++
}
case a == "--tag":
if i+1 < len(args) {
q.Set("tag", args[i+1])
i++
}
case a == "--limit":
if i+1 < len(args) {
q.Set("limit", args[i+1])
i++
}
case a == "--json":
jsonOut = true
}
}
c, err := newMemoryClient()
if err != nil {
return err
}
path := "/api/memories"
if len(q) > 0 {
path += "?" + q.Encode()
}
raw, err := c.do("GET", path, nil)
if err != nil {
return err
}
return printMemories(raw, jsonOut)
}
func memorySimpleGet(path string) func([]string) error {
return func(args []string) error {
c, err := newMemoryClient()
if err != nil {
return err
}
raw, err := c.do("GET", path, nil)
if err != nil {
return err
}
fmt.Println(string(raw))
return nil
}
}
func memorySecret(args []string) error {
id, _ := firstPositional(args)
if id == "" {
return fmt.Errorf("usage: homelab memory secret <id>")
}
c, err := newMemoryClient()
if err != nil {
return err
}
raw, err := c.do("POST", "/api/memories/"+id+"/secret", nil)
if err != nil {
return err
}
fmt.Println(string(raw))
return nil
}
func memoryStore(args []string) error {
req := memStoreReq{Category: "facts", Importance: 0.5}
var pos []string
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "--category":
if i+1 < len(args) {
req.Category = args[i+1]
i++
}
case a == "--tags":
if i+1 < len(args) {
req.Tags = args[i+1]
i++
}
case a == "--keywords":
if i+1 < len(args) {
req.ExpandedKeywords = args[i+1]
i++
}
case a == "--importance":
if i+1 < len(args) {
fmt.Sscanf(args[i+1], "%f", &req.Importance)
i++
}
case a == "--sensitive":
req.ForceSensitive = true
case !strings.HasPrefix(a, "-"):
pos = append(pos, a)
}
}
req.Content = strings.Join(pos, " ")
if req.Content == "" {
return fmt.Errorf(`usage: homelab memory store "<content>" [--category C] [--tags ...] [--keywords ...] [--importance 0.5] [--sensitive]`)
}
c, err := newMemoryClient()
if err != nil {
return err
}
raw, err := c.do("POST", "/api/memories", req)
if err != nil {
return err
}
fmt.Println(string(raw))
return nil
}
func memoryUpdate(args []string) error {
var id string
req := memUpdateReq{}
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "--content":
if i+1 < len(args) {
v := args[i+1]
req.Content = &v
i++
}
case a == "--tags":
if i+1 < len(args) {
v := args[i+1]
req.Tags = &v
i++
}
case a == "--keywords":
if i+1 < len(args) {
v := args[i+1]
req.ExpandedKeywords = &v
i++
}
case a == "--importance":
if i+1 < len(args) {
var f float64
fmt.Sscanf(args[i+1], "%f", &f)
req.Importance = &f
i++
}
case !strings.HasPrefix(a, "-") && id == "":
id = a
}
}
if id == "" {
return fmt.Errorf("usage: homelab memory update <id> [--content ...] [--tags ...] [--importance N] [--keywords ...]")
}
c, err := newMemoryClient()
if err != nil {
return err
}
raw, err := c.do("PUT", "/api/memories/"+id, req)
if err != nil {
return err
}
fmt.Println(string(raw))
return nil
}
func memoryDelete(args []string) error {
id, _ := firstPositional(args)
if id == "" {
return fmt.Errorf("usage: homelab memory delete <id>")
}
c, err := newMemoryClient()
if err != nil {
return err
}
raw, err := c.do("DELETE", "/api/memories/"+id, nil)
if err != nil {
return err
}
fmt.Println(string(raw))
return nil
}

View file

@ -1,83 +0,0 @@
package main
import (
"fmt"
"strings"
"time"
)
func netCommands() []Command {
return []Command{
{Path: []string{"net", "check"}, Tier: TierRead,
Summary: "reachability of <host>[/path]: external (public DNS→CF) vs internal (Traefik LB)", Run: netCheck},
{Path: []string{"dns", "lookup"}, Tier: TierRead,
Summary: "resolve <name> via Technitium (10.0.20.201) and public (1.1.1.1), diffed", Run: dnsLookup},
}
}
func fmtProbe(code int, d time.Duration, err error) string {
if err != nil {
return "ERR " + err.Error()
}
return fmt.Sprintf("HTTP %d %dms", code, d.Milliseconds())
}
func netCheck(args []string) error {
host, rest := firstPositional(args)
if host == "" {
return fmt.Errorf("usage: homelab net check <host> [path]")
}
path := "/"
if len(rest) > 0 && !strings.HasPrefix(rest[0], "-") {
path = rest[0]
if !strings.HasPrefix(path, "/") {
path = "/" + path
}
}
u := "https://" + host + path
fmt.Printf("%s\n", u)
// external leg: resolve via public DNS, dial the public IP (tests the real CF path)
pubOut, _ := dig(hostOnly(host), "1.1.1.1", "")
if pubIP := firstLine(pubOut); pubIP != "" {
c, d, e := probeURL(clientDialingIP(pubIP, 10*time.Second), u)
fmt.Printf(" external (public %-15s) %s\n", pubIP, fmtProbe(c, d, e))
} else {
fmt.Println(" external (public) no public A record")
}
// internal leg: dial the Traefik LB directly
c, d, e := probeURL(clientDialingIP(internalLBIP, 10*time.Second), u)
fmt.Printf(" internal (LB %-15s) %s\n", internalLBIP, fmtProbe(c, d, e))
return nil
}
func dnsLookup(args []string) error {
name, rest := firstPositional(args)
if name == "" {
return fmt.Errorf("usage: homelab dns lookup <name> [A|AAAA|TXT|MX|PTR]")
}
rr := ""
if len(rest) > 0 {
rr = rest[0]
}
tech, _ := dig(name, "10.0.20.201", rr)
pub, _ := dig(name, "1.1.1.1", rr)
fmt.Printf("technitium (10.0.20.201): %s\n", oneLineList(tech))
fmt.Printf("public (1.1.1.1) : %s\n", oneLineList(pub))
if strings.TrimSpace(tech) != strings.TrimSpace(pub) {
fmt.Println("⚠ mismatch — split-horizon (expected for internal-only apps) or a propagation gap")
}
return nil
}
func hostOnly(h string) string { // strip any path accidentally included
return strings.SplitN(h, "/", 2)[0]
}
func oneLineList(s string) string {
s = strings.TrimSpace(s)
if s == "" {
return "(none)"
}
return strings.ReplaceAll(s, "\n", ", ")
}

View file

@ -1,197 +0,0 @@
package main
import (
"encoding/json"
"fmt"
"net/url"
"sort"
"strconv"
"strings"
"time"
)
const (
promHost = "prometheus-query.viktorbarzin.lan"
lokiHost = "loki.viktorbarzin.lan"
)
func obsCommands() []Command {
return []Command{
{Path: []string{"metrics", "query"}, Tier: TierRead,
Summary: `Prometheus instant query: metrics query "<promql>" [--json]`, Run: metricsQuery},
{Path: []string{"metrics", "alerts"}, Tier: TierRead,
Summary: "list currently firing Prometheus alerts", Run: metricsAlerts},
{Path: []string{"logs", "query"}, Tier: TierRead,
Summary: `Loki query (last --since, default 1h): logs query "<logql>" [--since 1h] [--limit N] [--json]`, Run: logsQuery},
}
}
// queryArg joins non-flag args into the query (PromQL/LogQL should normally be
// passed as a single quoted argument; this also tolerates unquoted multi-token).
func queryArg(args []string, valueFlags map[string]bool) string {
var parts []string
for i := 0; i < len(args); i++ {
a := args[i]
if valueFlags[a] {
i++
continue
}
if strings.HasPrefix(a, "-") {
continue
}
parts = append(parts, a)
}
return strings.Join(parts, " ")
}
func labelStr(m map[string]string) string {
name := m["__name__"]
var kv []string
for k, v := range m {
if k != "__name__" {
kv = append(kv, k+"="+v)
}
}
sort.Strings(kv)
return name + "{" + strings.Join(kv, ",") + "}"
}
func metricsQuery(args []string) error {
q := queryArg(args, nil)
if q == "" {
return fmt.Errorf(`usage: homelab metrics query "<promql>" [--json]`)
}
v := url.Values{}
v.Set("query", q)
body, err := lbGetBody(promHost, "/api/v1/query", v)
if err != nil {
return err
}
if containsArg(args, "--json") {
fmt.Println(string(body))
return nil
}
var r struct {
Data struct {
Result []struct {
Metric map[string]string `json:"metric"`
Value []interface{} `json:"value"`
} `json:"result"`
} `json:"data"`
}
if err := json.Unmarshal(body, &r); err != nil {
fmt.Println(string(body))
return nil
}
if len(r.Data.Result) == 0 {
fmt.Println("(no series)")
return nil
}
for _, s := range r.Data.Result {
val := ""
if len(s.Value) == 2 {
val = fmt.Sprint(s.Value[1])
}
fmt.Printf("%-14s %s\n", val, labelStr(s.Metric))
}
return nil
}
func metricsAlerts(args []string) error {
// prometheus-query is a query-only frontend (no /api/v1/alerts); the firing
// set is exposed as the synthetic ALERTS series, queryable the normal way.
v := url.Values{}
v.Set("query", `ALERTS{alertstate="firing"}`)
body, err := lbGetBody(promHost, "/api/v1/query", v)
if err != nil {
return err
}
if containsArg(args, "--json") {
fmt.Println(string(body))
return nil
}
var r struct {
Data struct {
Result []struct {
Metric map[string]string `json:"metric"`
} `json:"result"`
} `json:"data"`
}
if err := json.Unmarshal(body, &r); err != nil {
fmt.Println(string(body))
return nil
}
if len(r.Data.Result) == 0 {
fmt.Println("(no firing alerts)")
return nil
}
for _, a := range r.Data.Result {
m := a.Metric
scope := ""
for _, k := range []string{"namespace", "deployment", "instance", "job", "node"} {
if v := m[k]; v != "" {
scope = k + "=" + v
break
}
}
fmt.Printf("%-9s %-34s %s\n", m["severity"], m["alertname"], scope)
}
return nil
}
func logsQuery(args []string) error {
q := queryArg(args, map[string]bool{"--since": true, "--limit": true})
if q == "" {
return fmt.Errorf(`usage: homelab logs query "<logql>" [--since 1h] [--limit N] [--json]`)
}
since := flagValue(args, "--since")
if since == "" {
since = "1h"
}
dur, err := time.ParseDuration(since)
if err != nil {
return fmt.Errorf("bad --since %q: %w", since, err)
}
limit := flagValue(args, "--limit")
if limit == "" {
limit = "100"
}
end := time.Now()
v := url.Values{}
v.Set("query", q)
v.Set("limit", limit)
v.Set("start", strconv.FormatInt(end.Add(-dur).UnixNano(), 10))
v.Set("end", strconv.FormatInt(end.UnixNano(), 10))
body, err := lbGetBody(lokiHost, "/loki/api/v1/query_range", v)
if err != nil {
return err
}
if containsArg(args, "--json") {
fmt.Println(string(body))
return nil
}
var r struct {
Data struct {
Result []struct {
Values [][]string `json:"values"`
} `json:"result"`
} `json:"data"`
}
if err := json.Unmarshal(body, &r); err != nil {
fmt.Println(string(body))
return nil
}
n := 0
for _, s := range r.Data.Result {
for _, val := range s.Values {
if len(val) == 2 {
fmt.Println(val[1])
n++
}
}
}
if n == 0 {
fmt.Println("(no log lines)")
}
return nil
}

View file

@ -1,122 +0,0 @@
package main
import (
"fmt"
"os"
"os/signal"
"path/filepath"
"strings"
"sync"
"syscall"
)
func tfCommands() []Command {
return []Command{
{Path: []string{"tf", "plan"}, Tier: TierRead,
Summary: "terragrunt plan a stack (via scripts/tg)", Run: tfPassthrough("plan")},
{Path: []string{"tf", "validate"}, Tier: TierRead,
Summary: "terragrunt validate a stack", Run: tfPassthrough("validate")},
{Path: []string{"tf", "fmt"}, Tier: TierRead,
Summary: "terraform fmt a stack's files", Run: tfFmt},
{Path: []string{"tf", "force-unlock"}, Tier: TierWrite,
Summary: "release a stuck terraform state lock (needs <stack> <lock-id>)", Run: tfForceUnlock},
{Path: []string{"tf", "apply"}, Tier: TierWrite,
Summary: "terragrunt apply a stack — presence-coupled, out-of-band", Run: tfApply},
}
}
// firstPositional returns the first non-flag arg and the remaining args with it removed.
func firstPositional(args []string) (string, []string) {
for i, a := range args {
if !strings.HasPrefix(a, "-") {
rest := append(append([]string{}, args[:i]...), args[i+1:]...)
return a, rest
}
}
return "", args
}
// resolveTfStack finds the infra root (from cwd) and the stack directory named
// by the first positional arg, returning the remaining args.
func resolveTfStack(args []string) (infraRoot, stackName, stackDir string, rest []string, err error) {
stackName, rest = firstPositional(args)
if stackName == "" {
err = fmt.Errorf("missing <stack> argument")
return
}
cwd, e := os.Getwd()
if e != nil {
err = e
return
}
infraRoot, err = findInfraRoot(cwd)
if err != nil {
return
}
stackDir, err = resolveStack(infraRoot, stackName)
return
}
func tgPath(infraRoot string) string { return filepath.Join(infraRoot, "scripts", "tg") }
// tfPassthrough runs `scripts/tg <verb> [extra]` in the stack directory.
func tfPassthrough(verb string) func([]string) error {
return func(args []string) error {
infraRoot, _, stackDir, rest, err := resolveTfStack(args)
if err != nil {
return err
}
return runStreamingIn(stackDir, tgPath(infraRoot), append([]string{verb}, rest...)...)
}
}
func tfFmt(args []string) error {
_, _, stackDir, _, err := resolveTfStack(args)
if err != nil {
return err
}
return runStreamingIn(stackDir, "terraform", "fmt", "-recursive", ".")
}
func tfForceUnlock(args []string) error {
infraRoot, _, stackDir, rest, err := resolveTfStack(args)
if err != nil {
return err
}
if len(rest) < 1 {
return fmt.Errorf("usage: homelab tf force-unlock <stack> <lock-id>")
}
return runStreamingIn(stackDir, tgPath(infraRoot), "force-unlock", "-force", rest[0])
}
// tfApply applies a stack out-of-band: claim the stack on the presence board,
// ALWAYS release on exit (normal, error, or signal — fixing the claim leak),
// and warn that CI applies canonically on push.
func tfApply(args []string) error {
infraRoot, stackName, stackDir, _, err := resolveTfStack(args)
if err != nil {
return err
}
label := "stack:" + stackName
fmt.Fprintf(os.Stderr,
"homelab: out-of-band apply of %q — CI applies canonically on push to master.\n", stackName)
if err := presenceClaim(label, "homelab tf apply "+stackName); err != nil {
return fmt.Errorf("presence claim failed (run `vault login -method=oidc`?): %w", err)
}
// Release exactly once, whether we exit normally, on error, or on signal —
// sync.Once makes the defer and the signal goroutine safe to both call it.
var once sync.Once
release := func() { once.Do(func() { _ = presenceRelease(label) }) }
defer release()
sig := make(chan os.Signal, 1)
signal.Notify(sig, os.Interrupt, syscall.SIGTERM)
go func() {
<-sig
release()
os.Exit(130)
}()
return runStreamingIn(stackDir, tgPath(infraRoot), "apply", "--non-interactive")
}

View file

@ -1,27 +0,0 @@
package main
import (
"reflect"
"testing"
)
func TestFirstPositional(t *testing.T) {
cases := []struct {
args []string
wantName string
wantRest []string
}{
{[]string{"vault"}, "vault", []string{}},
{[]string{"--json", "vault"}, "vault", []string{"--json"}},
{[]string{"vault", "abc-123"}, "vault", []string{"abc-123"}},
{[]string{"--foo", "monitoring", "extra"}, "monitoring", []string{"--foo", "extra"}},
{[]string{"--only-flags"}, "", []string{"--only-flags"}},
}
for _, c := range cases {
gotName, gotRest := firstPositional(c.args)
if gotName != c.wantName || !reflect.DeepEqual(gotRest, c.wantRest) {
t.Errorf("firstPositional(%v) = (%q, %v), want (%q, %v)",
c.args, gotName, gotRest, c.wantName, c.wantRest)
}
}
}

View file

@ -1,77 +0,0 @@
package main
import (
"encoding/json"
"fmt"
"net/url"
"sort"
"strconv"
)
func usageCommands() []Command {
return []Command{
{Path: []string{"usage", "top"}, Tier: TierRead,
Summary: "rank homelab verb usage across users (from Loki): usage top [--since 30d] [--user U] [--json]", Run: usageTop},
}
}
// usageQuery builds the LogQL metric query that counts invocations per verb.
func usageQuery(since, user string) string {
sel := `job="` + usageJob + `"`
if user != "" {
sel += `, user="` + user + `"`
}
return fmt.Sprintf(`sum by (verb) (count_over_time({%s}[%s]))`, sel, since)
}
func usageTop(args []string) error {
since := flagValue(args, "--since")
if since == "" {
since = "30d"
}
v := url.Values{}
v.Set("query", usageQuery(since, flagValue(args, "--user")))
body, err := lbGetBody(lokiHost, "/loki/api/v1/query", v)
if err != nil {
return err
}
if containsArg(args, "--json") {
fmt.Println(string(body))
return nil
}
var r struct {
Data struct {
Result []struct {
Metric map[string]string `json:"metric"`
Value []interface{} `json:"value"`
} `json:"result"`
} `json:"data"`
}
if err := json.Unmarshal(body, &r); err != nil {
fmt.Println(string(body))
return nil
}
type row struct {
verb string
n int
}
var rows []row
for _, s := range r.Data.Result {
n := 0
if len(s.Value) == 2 {
if f, e := strconv.ParseFloat(fmt.Sprint(s.Value[1]), 64); e == nil {
n = int(f)
}
}
rows = append(rows, row{s.Metric["verb"], n})
}
if len(rows) == 0 {
fmt.Println("(no usage recorded yet)")
return nil
}
sort.Slice(rows, func(i, j int) bool { return rows[i].n > rows[j].n })
for _, r := range rows {
fmt.Printf("%6d %s\n", r.n, r.verb)
}
return nil
}

View file

@ -1,944 +0,0 @@
package main
import (
"bufio"
"encoding/base64"
"encoding/json"
"errors"
"fmt"
"os"
"os/exec"
"strings"
"syscall"
)
// vault verbs give each unix user no-HITL access to THEIR OWN Vaultwarden vault.
// Identity is the kernel UID; per-user creds live in that user's isolated Vault
// path (secret/workstation/claude-users/<user>) read via their scoped token, and
// decryption is done by the official `bw` CLI. See
// docs/runbooks/homelab-vault-onboarding.md.
func vaultCommands() []Command {
cmds := []Command{
// Vaultwarden — your personal password manager (logins/passwords/TOTP).
{Path: []string{"vault", "setup"}, Tier: TierWrite,
Summary: "[vaultwarden] one-time: store your master password + API key in your Vault path", Run: vaultSetup},
{Path: []string{"vault", "status"}, Tier: TierRead,
Summary: "[vaultwarden] show whether your vault is configured/reachable (no secrets)", Run: vaultStatus},
{Path: []string{"vault", "list"}, Tier: TierRead,
Summary: "[vaultwarden] list your item names: vault list [--search Q]", Run: vaultList},
{Path: []string{"vault", "get"}, Tier: TierRead,
Summary: "[vaultwarden] fetch one login: vault get <name> [--field password|username|uri|notes|totp] [--json] [--all]", Run: vaultGet},
{Path: []string{"vault", "search"}, Tier: TierRead,
Summary: "[vaultwarden] search your item names: vault search <query>", Run: vaultSearch},
{Path: []string{"vault", "code"}, Tier: TierRead,
Summary: "[vaultwarden] current TOTP code for an item: vault code <name>", Run: vaultCode},
{Path: []string{"vault", "lock"}, Tier: TierWrite,
Summary: "[vaultwarden] lock/log out the local bw session", Run: vaultLock},
{Path: []string{"vault"}, Tier: TierRead,
Summary: "two stores: Vaultwarden (logins) + HashiCorp Vault/OpenBao kv (infra secrets) — run `homelab vault` for help",
Run: func([]string) error { fmt.Print(vaultHelp()); return nil }},
}
// HashiCorp Vault / OpenBao — homelab INFRA secrets (the secret/… KV store).
return append(cmds, vaultKVCommands()...)
}
// vaultHelp is shown for bare `homelab vault`. It LEADS with the distinction
// between the two unrelated "vaults" this command fronts, because the name
// collides: Vaultwarden (a password manager) vs HashiCorp Vault / OpenBao (the
// infra secrets store).
func vaultHelp() string {
return `homelab vault two different secret stores under one command:
Vaultwarden your personal PASSWORD MANAGER (logins / passwords / TOTP)
HashiCorp Vault / OpenBao homelab INFRA secrets (the secret/ KV store) 'vault kv '
Vaultwarden (reads YOUR OWN vault; no-HITL after one-time setup)
homelab vault setup one-time: store your master password + API key in your Vault path
homelab vault status configured / unlocked / reachable (no secrets)
homelab vault list [--search Q] list your item names (no secrets)
homelab vault get <name> [--field password|username|uri|notes|totp] [--json]
TTY clipboard (auto-clears); piped stdout
homelab vault get <name> --all all fields (incl. custom) as JSON; piped only.
TOTP shown as presence flag use 'vault code' for a code.
homelab vault code <name> current TOTP code
homelab vault lock lock / log out the local bw session
HashiCorp Vault / OpenBao (infra secrets; uses your own OIDC vault token)
homelab vault kv get <path> [--field K] read an infra KV secret
homelab vault kv list <path> list sub-paths
homelab vault kv put <path> <key> write one key (value via stdin)
Vaultwarden creds live only in your own Vault path; the admin never sees them.
Security model: docs/runbooks/homelab-vault-onboarding.md
(note: anything running as your user can decrypt your vault the accepted no-HITL trade).
`
}
const vwUserPathPrefix = "secret/workstation/claude-users/"
// vwCreds is one user's Vaultwarden auth material, read from their Vault path.
type vwCreds struct {
Email string
MasterPassword string
ClientID string
ClientSecret string
}
// cmdRunner shells out to an external command with an explicit environment and
// returns trimmed stdout. Secrets are passed via envv, NEVER argv. Tests inject
// a fake; realRunner is the production implementation.
type cmdRunner func(name string, argv, envv []string) (string, error)
func realRunner(name string, argv, envv []string) (string, error) {
cmd := exec.Command(name, argv...)
if envv != nil {
cmd.Env = envv
}
out, err := cmd.Output()
// Trim only the trailing newline the tool appends — NOT all whitespace, so a
// fetched secret with significant leading/trailing spaces is preserved.
return strings.TrimRight(string(out), "\r\n"), augmentErr(err, exitStderr(err))
}
// exitStderr returns the stderr captured by cmd.Output() on a failed exec (it
// stows it on *exec.ExitError), or nil. The tools we shell out to (vault, bw)
// write the actionable message there — "connection refused", "permission
// denied" — which the caller would otherwise never see behind a bare
// "exit status N".
func exitStderr(err error) []byte {
var ee *exec.ExitError
if errors.As(err, &ee) {
return ee.Stderr
}
return nil
}
// augmentErr appends captured stderr to an error so failures are diagnosable
// (not just "exit status 2"). Returns nil when err is nil, and err unchanged
// when there's no stderr; preserves the wrapped error for errors.Is/As.
func augmentErr(err error, stderr []byte) error {
if err == nil {
return nil
}
if s := strings.TrimSpace(string(stderr)); s != "" {
return fmt.Errorf("%w: %s", err, s)
}
return err
}
// realRunnerStdin runs a command feeding `stdin` to it, for secret values that
// must NOT appear in argv (visible via ps / /proc/<pid>/cmdline to same-UID
// processes). Used by setup to write the master password / client_secret.
func realRunnerStdin(name string, argv, envv []string, stdin string) (string, error) {
cmd := exec.Command(name, argv...)
if envv != nil {
cmd.Env = envv
}
cmd.Stdin = strings.NewReader(stdin)
out, err := cmd.Output()
return strings.TrimRight(string(out), "\r\n"), augmentErr(err, exitStderr(err))
}
func vwCredsPath(user string) string { return vwUserPathPrefix + user }
func bwAppDataDir(uid string) string { return "/run/user/" + uid + "/homelab-bw" }
// readVaultField returns one field from a KV-v2 path, "" if absent/error.
func readVaultField(run cmdRunner, field, path string) string {
out, err := run("vault", []string{"kv", "get", "-field=" + field, path}, nil)
if err != nil {
return ""
}
return out
}
// loadCreds reads the four vaultwarden_* keys from the user's isolated path.
// A missing master password means the user hasn't onboarded.
func loadCreds(run cmdRunner, user string) (vwCreds, error) {
p := vwCredsPath(user)
c := vwCreds{
Email: readVaultField(run, "vaultwarden_email", p),
MasterPassword: readVaultField(run, "vaultwarden_master_password", p),
ClientID: readVaultField(run, "vaultwarden_client_id", p),
ClientSecret: readVaultField(run, "vaultwarden_client_secret", p),
}
if c.MasterPassword == "" {
return vwCreds{}, fmt.Errorf("vault not configured for this user — run `homelab vault setup`")
}
return c, nil
}
// vaultCurrentUser/vaultCurrentUID are seams for tests (avoid conflict with repo.go's currentUser func).
var vaultCurrentUser = func() string { return os.Getenv("USER") }
var vaultCurrentUID = func() string { return fmt.Sprintf("%d", os.Getuid()) }
// scopedTokenPath is where claude-auth-sync keeps the user's scoped Vault token.
// MUST match CAS_VAULT_TOKEN_FILE in scripts/workstation/claude-auth-sync.sh.
func scopedTokenPath(home string) string {
return home + "/.config/claude-auth-sync/vault-token"
}
// vaultTokenSource decides which Vault token the `vault` child processes should
// use. Precedence: an explicit $VAULT_TOKEN (deliberate override), then the
// per-user scoped token claude-auth-sync maintains at scopedTokenPath(HOME)
// (policy workstation-claude-<user>, which grants exactly the create/read/update
// this tool needs on the user's own path), then a native ~/.vault-token.
//
// The scoped token MUST beat ~/.vault-token: this tool only ever touches the
// caller's own secret/workstation/claude-users/<user> path, and a power-user who
// ran `vault login -method=oidc` carries a read-only ~/.vault-token whose
// capability on that path is `deny` — letting it win shadows the scoped token
// and every op fails 403/deny (emo, 2026-06-28). ~/.vault-token is only the
// right credential when there is no scoped token (admins). Returns the token to
// export — "" when the vault CLI should read the ambient/native credential —
// plus a source tag for tests/logging.
func vaultTokenSource(envToken string, haveVaultTokenFile bool, scopedToken string) (token, source string) {
switch {
case envToken != "":
return "", "env"
case strings.TrimSpace(scopedToken) != "":
return strings.TrimSpace(scopedToken), "scoped"
case haveVaultTokenFile:
return "", "file"
default:
return "", "none"
}
}
// vaultAddrDefault is the cluster Vault the workstation talks to. The bw server
// is likewise hardcoded (openSession), so a sane default here is consistent.
const vaultAddrDefault = "https://vault.viktorbarzin.me"
// vaultAddrToSet returns the VAULT_ADDR to export when the caller's environment
// doesn't already set one, else "". homelab vault is invoked by AFK agent
// sessions — frequently non-login shells (tmux panes, agent subprocesses) that
// never sourced /etc/environment — so, like claude-auth-sync, the CLI must NOT
// depend on an ambient VAULT_ADDR; otherwise every `vault` child falls back to
// the 127.0.0.1:8200 default and fails "connection refused" (exit 2).
func vaultAddrToSet(envAddr string) string {
if strings.TrimSpace(envAddr) == "" {
return vaultAddrDefault
}
return ""
}
// ensureVaultAddr exports the default VAULT_ADDR when none is set, so the vault
// child processes reach the cluster Vault regardless of the caller's shell. An
// explicit VAULT_ADDR (admins, CI) is left untouched.
func ensureVaultAddr() {
if a := vaultAddrToSet(os.Getenv("VAULT_ADDR")); a != "" {
os.Setenv("VAULT_ADDR", a)
}
}
// fileNonEmpty reports whether path exists and has content.
func fileNonEmpty(path string) bool {
fi, err := os.Stat(path)
return err == nil && fi.Size() > 0
}
// ensureVaultToken wires vaultTokenSource to the real environment: when the user
// has no ambient Vault credential, it exports the claude-auth-sync scoped token
// so the `vault` child processes authenticate as workstation-claude-<user>. It
// is idempotent and safe for admins, whose explicit $VAULT_TOKEN / ~/.vault-token
// take precedence and are left untouched.
func ensureVaultToken() {
// Every vault verb funnels through here, so this is the one place that also
// guarantees VAULT_ADDR is set (see vaultAddrToSet for why it can't be
// assumed from the caller's shell).
ensureVaultAddr()
home := os.Getenv("HOME")
scoped, _ := os.ReadFile(scopedTokenPath(home))
tok, src := vaultTokenSource(os.Getenv("VAULT_TOKEN"), home != "" && fileNonEmpty(home+"/.vault-token"), string(scoped))
if src == "scoped" {
os.Setenv("VAULT_TOKEN", tok)
}
}
// bwBaseEnv is the minimal non-secret environment bw/node need. We deliberately
// do NOT inherit the full parent env (keeps stray secrets out of the child).
func bwBaseEnv(appdata string) []string {
path := os.Getenv("PATH")
if path == "" {
path = "/usr/local/bin:/usr/bin:/bin"
}
return []string{
"PATH=" + path,
"HOME=" + os.Getenv("HOME"),
"BITWARDENCLI_APPDATA_DIR=" + appdata,
"BW_NOINTERACTION=true",
}
}
// bwSecretEnv adds the secret-bearing vars. session may be "" (pre-unlock).
func bwSecretEnv(appdata string, c vwCreds, session string) []string {
env := bwBaseEnv(appdata)
env = append(env,
"BW_CLIENTID="+c.ClientID,
"BW_CLIENTSECRET="+c.ClientSecret,
"BW_PASSWORD="+c.MasterPassword,
)
if session != "" {
env = append(env, "BW_SESSION="+session)
}
return env
}
func bwLoginArgs() []string { return []string{"login", "--apikey"} }
func bwUnlockArgs() []string { return []string{"unlock", "--passwordenv", "BW_PASSWORD", "--raw"} }
func bwGetArgs(field, name string) []string { return []string{"get", field, name} }
func bwItemArgs(name string) []string { return []string{"get", "item", name} }
func bwStatusArgs() []string { return []string{"status"} }
func bwSyncArgs() []string { return []string{"sync"} }
// bwNeedsLogin parses `bw status` JSON and reports whether a `bw login` is
// required. Unparseable/empty output → true (safer to attempt login).
func bwNeedsLogin(statusJSON string) bool {
var s struct {
Status string `json:"status"`
}
if err := json.Unmarshal([]byte(statusJSON), &s); err != nil {
return true
}
return s.Status == "unauthenticated" || s.Status == ""
}
func bwListArgs(search string) []string {
a := []string{"list", "items"}
if search != "" {
a = append(a, "--search", search)
}
return a
}
// bwUnlock runs `bw unlock` and returns the raw session key.
func bwUnlock(run cmdRunner, env []string) (string, error) {
out, err := run("bw", bwUnlockArgs(), env)
if err != nil {
return "", fmt.Errorf("bw unlock failed (wrong master password? run `homelab vault setup`): %w", err)
}
return out, nil
}
// bwGet fetches one field of one item; session must be present in env.
func bwGet(run cmdRunner, env []string, field, name string) (string, error) {
return run("bw", bwGetArgs(field, name), env)
}
func returnMode(isTTY bool) string {
if isTTY {
return "clipboard"
}
return "stdout"
}
// stdoutIsTTY reports whether stdout is a character device (a terminal).
func stdoutIsTTY() bool {
fi, err := os.Stdout.Stat()
if err != nil {
return false
}
return fi.Mode()&os.ModeCharDevice != 0
}
// stderrIsTTY reports whether stderr is a terminal (the OSC52 escape is written
// to stderr, so the clipboard path is only viable when stderr is a terminal).
func stderrIsTTY() bool {
fi, err := os.Stderr.Stat()
if err != nil {
return false
}
return fi.Mode()&os.ModeCharDevice != 0
}
// osc52 returns the OSC 52 escape that makes the local terminal copy payload to
// the system clipboard (works over SSH; no X11). osc52clear copies empty.
func osc52(payload string) string {
return "\x1b]52;c;" + base64.StdEncoding.EncodeToString([]byte(payload)) + "\a"
}
func osc52clear() string { return "\x1b]52;c;\a" }
// terminalAllowed gates OSC 52: only terminals known to honor clipboard writes,
// else we'd dump the secret's base64 into scrollback on unsupported terminals.
func terminalAllowed(term, termProgram string) bool {
t := strings.ToLower(term)
p := strings.ToLower(termProgram)
for _, ok := range []string{"kitty", "alacritty", "foot", "wezterm", "ghostty", "tmux", "screen"} {
if strings.Contains(t, ok) || strings.Contains(p, ok) {
return true
}
}
// xterm proper supports it only when the program is a known-good emulator.
return false
}
// opRecord is one CLI operation. ItemName is accepted for the caller's
// convenience but is INTENTIONALLY never rendered into the log line — auditing
// which of your own logins you opened is itself sensitive, and per-item reads
// are invisible server-side anyway (spec §9a).
type opRecord struct {
User string
Verb string
PID int
PPID int
ParentComm string
ItemName string // never logged
}
func opLogLine(r opRecord) string {
return fmt.Sprintf("user=%s verb=%s pid=%d ppid=%d parent=%s",
r.User, r.Verb, r.PID, r.PPID, r.ParentComm)
}
// parentComm reads /proc/<ppid>/comm (best-effort; "" on failure).
func parentComm(ppid int) string {
b, err := os.ReadFile(fmt.Sprintf("/proc/%d/comm", ppid))
if err != nil {
return ""
}
return strings.TrimSpace(string(b))
}
// writeOpLog appends one privacy-aware line to the user's op-log (best-effort;
// never blocks or fails the command). Goes to syslog so it ships to Loki.
func writeOpLog(r opRecord) {
exec.Command("logger", "-t", "homelab-vault", opLogLine(r)).Run() // best-effort
}
func vaultLockPath(uid string) string { return "/run/user/" + uid + "/homelab-vault.lock" }
// hardenProcess disables core dumps so a bw/homelab crash can't spill the master
// password to a core file. Best-effort.
func hardenProcess() {
_ = syscall.Setrlimit(syscall.RLIMIT_CORE, &syscall.Rlimit{Cur: 0, Max: 0})
}
// withUserLock serializes bw mutations for this user (concurrent Claude sessions
// as the same user otherwise race bw's appdata). Returns an unlock func.
func withUserLock(uid string) (func(), error) {
f, err := os.OpenFile(vaultLockPath(uid), os.O_CREATE|os.O_RDWR, 0600)
if err != nil {
return nil, err
}
if err := syscall.Flock(int(f.Fd()), syscall.LOCK_EX); err != nil {
f.Close()
return nil, err
}
return func() { syscall.Flock(int(f.Fd()), syscall.LOCK_UN); f.Close() }, nil
}
// session is one usable bw context: the env (with BW_SESSION) ready for `bw get`.
type session struct {
env []string
}
// openSession resolves creds, ensures login, unlocks, and returns a ready env.
// Caller must hold the user lock. appdata is created on tmpfs (0700).
func openSession(run cmdRunner, user, uid string) (session, error) {
creds, err := loadCreds(run, user)
if err != nil {
return session{}, err
}
appdata := bwAppDataDir(uid)
if err := os.MkdirAll(appdata, 0700); err != nil {
return session{}, fmt.Errorf("create bw appdata %s: %w", appdata, err)
}
loginEnv := bwSecretEnv(appdata, creds, "")
// Ensure server is set and we're logged in (idempotent; ignore "already").
_, _ = run("bw", []string{"config", "server", "https://vaultwarden.viktorbarzin.me"}, loginEnv)
st, _ := run("bw", bwStatusArgs(), loginEnv)
if bwNeedsLogin(st) {
if _, err := run("bw", bwLoginArgs(), loginEnv); err != nil {
return session{}, fmt.Errorf("bw login --apikey failed (API key valid? run `homelab vault setup`): %w", err)
}
}
sess, err := bwUnlock(run, loginEnv)
if err != nil {
return session{}, err
}
sessEnv := bwSecretEnv(appdata, creds, sess)
// Pull the latest server-side state so reads reflect current values. `bw
// unlock` only decrypts the LOCAL cache, so a persisted (already-logged-in)
// session would otherwise serve stale data until the next login. Best-effort:
// a transient sync failure must not break a read — fall back to the cached
// vault and warn (status reports reachability separately).
if _, err := run("bw", bwSyncArgs(), sessEnv); err != nil {
fmt.Fprintln(os.Stderr, "homelab vault: warning: bw sync failed; using cached vault (values may be stale): "+err.Error())
}
return session{env: sessEnv}, nil
}
type getOpts struct {
name string
field string
json bool
all bool // dump every field (incl. custom) as normalized JSON
}
var validGetFields = map[string]bool{"password": true, "username": true, "uri": true, "notes": true, "totp": true}
func parseGetArgs(args []string) (getOpts, error) {
o := getOpts{field: "password"}
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "--json":
o.json = true
case a == "--all":
o.all = true
case a == "--field" && i+1 < len(args):
o.field = args[i+1]
i++
case strings.HasPrefix(a, "--field="):
o.field = strings.TrimPrefix(a, "--field=")
case !strings.HasPrefix(a, "-") && o.name == "":
o.name = a
}
}
if o.name == "" {
return o, fmt.Errorf("usage: homelab vault get <name> [--field password|username|uri|notes|totp] [--json] [--all]")
}
// --all dumps the whole item, so --field is irrelevant — skip its allowlist.
if !o.all && !validGetFields[o.field] {
return o, fmt.Errorf("invalid --field %q (want password|username|uri|notes|totp)", o.field)
}
return o, nil
}
// getValue opens a session and fetches one field. Pure of I/O side effects
// besides the runner, so it is unit-tested with a fake runner.
func getValue(run cmdRunner, user, uid string, o getOpts) (string, error) {
s, err := openSession(run, user, uid)
if err != nil {
return "", err
}
return bwGet(run, s.env, o.field, o.name)
}
// getItem opens a session and returns the whole item as raw `bw get item` JSON.
// Used by `get --all`; normalization is a separate, pure step (normalizeItem).
func getItem(run cmdRunner, user, uid, name string) (string, error) {
s, err := openSession(run, user, uid)
if err != nil {
return "", err
}
return run("bw", bwItemArgs(name), s.env)
}
// normalizedItem is the browse-all-fields projection of a Vaultwarden item: the
// standard login fields that are present, notes, and a flat map of custom field
// name→value. bw internals (id, object, reprompt, passwordHistory) are dropped,
// and the TOTP *seed* is reduced to a presence flag — the only seed-derived path
// stays the specially-audited `vault code` (see the design §10/§16).
type normalizedItem struct {
Name string `json:"name"`
Username string `json:"username,omitempty"`
Password string `json:"password,omitempty"`
URIs []string `json:"uris,omitempty"`
TOTP bool `json:"totp,omitempty"` // presence only, never the seed
Notes string `json:"notes,omitempty"`
Fields map[string]string `json:"fields,omitempty"` // custom field name→value
}
// bwFieldLinked is the Bitwarden custom-field type for a "linked" field: it
// references another field and carries a null value, so it is not real data.
const bwFieldLinked = 3
// normalizeItem parses a `bw get item` payload into the browse projection. It is
// pure (no I/O), so it is the unit-tested heart of `get --all`.
func normalizeItem(raw string) (normalizedItem, error) {
var it struct {
Name string `json:"name"`
Notes string `json:"notes"`
Login *struct {
Username string `json:"username"`
Password string `json:"password"`
Totp string `json:"totp"`
URIs []struct {
URI string `json:"uri"`
} `json:"uris"`
} `json:"login"`
Fields []struct {
Name string `json:"name"`
Value string `json:"value"`
Type int `json:"type"`
} `json:"fields"`
}
if err := json.Unmarshal([]byte(raw), &it); err != nil {
return normalizedItem{}, fmt.Errorf("parse bw item: %w", err)
}
n := normalizedItem{Name: it.Name, Notes: it.Notes}
if it.Login != nil {
n.Username = it.Login.Username
n.Password = it.Login.Password
n.TOTP = it.Login.Totp != ""
for _, u := range it.Login.URIs {
if u.URI != "" {
n.URIs = append(n.URIs, u.URI)
}
}
}
for _, f := range it.Fields {
if f.Type == bwFieldLinked {
continue // references another field, no value of its own
}
if n.Fields == nil {
n.Fields = map[string]string{}
}
n.Fields[f.Name] = f.Value // duplicate names: last-wins (rare; documented)
}
return n, nil
}
// clipboardDecision picks how to return a secret value. "stdout" prints it (a
// pipe/agent — the intended machine path); "clipboard" copies via OSC52;
// "refuse" emits nothing sensitive (would otherwise risk dumping the secret's
// base64 into scrollback, or silently fail because the OSC52 escape goes to a
// non-terminal stderr).
func clipboardDecision(stdoutTTY, stderrTTY bool, term, termProgram string) string {
if !stdoutTTY {
return "stdout"
}
if terminalAllowed(term, termProgram) && stderrTTY {
return "clipboard"
}
return "refuse"
}
// jsonToStdoutOK reports whether `--json` may print the secret to stdout — only
// when stdout is NOT a terminal (i.e. piped to a machine consumer).
func jsonToStdoutOK(stdoutTTY bool) bool { return !stdoutTTY }
// emitSecret returns a value TTY-aware (see clipboardDecision). Never prints the
// secret to a terminal's stdout/scrollback.
func emitSecret(value string) {
switch clipboardDecision(stdoutIsTTY(), stderrIsTTY(), os.Getenv("TERM"), os.Getenv("TERM_PROGRAM")) {
case "stdout":
fmt.Println(value)
case "clipboard":
fmt.Fprint(os.Stderr, osc52(value))
fmt.Fprintln(os.Stderr, "copied to clipboard; clearing in 30s")
clearClipboardAfter(30)
default: // refuse
fmt.Fprintln(os.Stderr, "refusing to print secret: this terminal can't do OSC52 clipboard safely; pipe the command (e.g. | cat) or use a supported terminal")
}
}
// clearClipboardAfter spawns a detached background clear so the secret doesn't
// linger in the clipboard. Best-effort.
func clearClipboardAfter(seconds int) {
exec.Command("sh", "-c", fmt.Sprintf("sleep %d; printf '%s'", seconds, osc52clear())).Start()
}
// listNames extracts "name (id)" from `bw list items` JSON; never values.
func listNames(jsonOut string) []string {
var items []struct {
ID string `json:"id"`
Name string `json:"name"`
}
if err := json.Unmarshal([]byte(jsonOut), &items); err != nil {
return nil
}
out := make([]string, 0, len(items))
for _, it := range items {
out = append(out, fmt.Sprintf("%s (%s)", it.Name, it.ID))
}
return out
}
func runList(run cmdRunner, user, uid, search string) ([]string, error) {
s, err := openSession(run, user, uid)
if err != nil {
return nil, err
}
out, err := run("bw", bwListArgs(search), s.env)
if err != nil {
return nil, err
}
return listNames(out), nil
}
func vaultList(args []string) error {
hardenProcess()
ensureVaultToken()
search := ""
for i := 0; i < len(args); i++ {
if args[i] == "--search" && i+1 < len(args) {
search = args[i+1]
i++
} else if strings.HasPrefix(args[i], "--search=") {
search = strings.TrimPrefix(args[i], "--search=")
}
}
uid := vaultCurrentUID()
unlock, err := withUserLock(uid)
if err != nil {
return err
}
defer unlock()
names, err := runList(realRunner, vaultCurrentUser(), uid, search)
if err != nil {
return err
}
for _, n := range names {
fmt.Println(n)
}
return nil
}
func vaultSearch(args []string) error {
if len(args) == 0 {
return fmt.Errorf("usage: homelab vault search <query>")
}
return vaultList([]string{"--search", strings.Join(args, " ")})
}
func vaultCode(args []string) error {
hardenProcess()
ensureVaultToken()
if len(args) == 0 {
return fmt.Errorf("usage: homelab vault code <name>")
}
name := args[0]
uid := vaultCurrentUID()
unlock, err := withUserLock(uid)
if err != nil {
return err
}
defer unlock()
user := vaultCurrentUser()
val, err := getValue(realRunner, user, uid, getOpts{name: name, field: "totp"})
if err != nil {
return err
}
// TOTP is the most sensitive op: log AND emit an ntfy-bound marker (spec §9a-d).
writeOpLog(opRecord{User: user, Verb: "code", PID: os.Getpid(), PPID: os.Getppid(), ParentComm: parentComm(os.Getppid()), ItemName: name})
exec.Command("logger", "-t", "homelab-vault-totp", "user="+user+" totp-fetch parent="+parentComm(os.Getppid())).Run()
emitSecret(val)
return nil
}
// statusSummary reports config/reachability without revealing secrets.
func statusSummary(run cmdRunner, user, uid string) string {
if _, err := loadCreds(run, user); err != nil {
return "vault: not configured — run `homelab vault setup`"
}
s, err := openSession(run, user, uid)
if err != nil {
return "vault: configured, but unlock/login FAILED (creds stale? run `homelab vault setup`): " + err.Error()
}
// openSession already did a best-effort sync; status re-runs it explicitly so
// a reachability failure surfaces in this report rather than only on stderr.
if _, err := run("bw", bwSyncArgs(), s.env); err != nil {
return "vault: configured + unlocked, but sync/reachability failed: " + err.Error()
}
return "vault: configured, unlocked, reachable ✓"
}
func vaultStatus(args []string) error {
hardenProcess()
ensureVaultToken()
uid := vaultCurrentUID()
unlock, err := withUserLock(uid)
if err != nil {
return err
}
defer unlock()
fmt.Println(statusSummary(realRunner, vaultCurrentUser(), uid))
return nil
}
func vaultLock(args []string) error {
uid := vaultCurrentUID()
unlock, err := withUserLock(uid) // logout mutates bw state — serialize with get/list
if err != nil {
return err
}
defer unlock()
appdata := bwAppDataDir(uid)
_, _ = realRunner("bw", []string{"lock"}, bwBaseEnv(appdata))
_, logoutErr := realRunner("bw", []string{"logout"}, bwBaseEnv(appdata))
if logoutErr == nil {
fmt.Println("locked")
}
return nil // lock/logout best-effort; never error the caller
}
// kvWriteVerb selects the KV write semantics. merge=true → `kv patch -method=rw`
// (read-modify-write: needs only read+update, NOT the `patch` capability the
// scoped workstation-claude-<user> policy lacks, and preserves co-located keys
// such as claude-auth-sync's claude_ai_oauth_json). merge=false → `kv put`
// (creates the path on first use, before any sibling keys exist).
func kvWriteVerb(merge bool) []string {
if merge {
return []string{"kv", "patch", "-method=rw"}
}
return []string{"kv", "put"}
}
// vaultWritePublicArgs writes the non-secret identifiers via argv. Neither the
// email nor the API client_id is a usable credential on its own.
func vaultWritePublicArgs(merge bool, user, email, clientID string) []string {
return append(kvWriteVerb(merge), vwCredsPath(user),
"vaultwarden_email="+email,
"vaultwarden_client_id="+clientID,
)
}
// vaultWriteSecretArgs writes ONE secret value via the `key=-` stdin form, so the
// value never appears in argv (ps / /proc/<pid>/cmdline). Fed on stdin by
// realRunnerStdin.
func vaultWriteSecretArgs(merge bool, user, key string) []string {
return append(kvWriteVerb(merge), vwCredsPath(user), key+"=-")
}
// credsPathExists reports whether the user's KV path already holds data. Used to
// pick create (`kv put`) vs merge (`kv patch -method=rw`) for the first write:
// claude-auth-sync usually creates the path first (Claude OAuth backup), but a
// user could run `homelab vault setup` before that ever happens.
func credsPathExists(run cmdRunner, user string) bool {
_, err := run("vault", []string{"kv", "get", "-format=json", vwCredsPath(user)}, nil)
return err == nil
}
// cmdRunnerStdin is realRunnerStdin's shape, injected so writeCreds is testable.
type cmdRunnerStdin func(name string, argv, envv []string, stdin string) (string, error)
// writeCreds stores all four fields in the user's Vault path using only the
// capabilities the scoped policy grants (create/read/update — NOT `patch`). The
// first (public) write creates the path when absent; the two real secrets then
// merge in via read-modify-write so the public keys — and any claude-auth-sync
// keys already present — survive. Secret values travel on stdin, never argv.
func writeCreds(run cmdRunner, runStdin cmdRunnerStdin, user string, c vwCreds) error {
merge := credsPathExists(run, user)
if _, err := run("vault", vaultWritePublicArgs(merge, user, c.Email, c.ClientID), nil); err != nil {
return err
}
// The path now exists regardless of the branch above → merge the secrets in.
if _, err := runStdin("vault", vaultWriteSecretArgs(true, user, "vaultwarden_master_password"), nil, c.MasterPassword); err != nil {
return err
}
if _, err := runStdin("vault", vaultWriteSecretArgs(true, user, "vaultwarden_client_secret"), nil, c.ClientSecret); err != nil {
return err
}
return nil
}
// promptNoEcho reads one line without terminal echo (for the master password).
func promptNoEcho(prompt string) (string, error) {
fmt.Fprint(os.Stderr, prompt)
exec.Command("stty", "-echo").Run()
defer func() { exec.Command("stty", "echo").Run(); fmt.Fprintln(os.Stderr) }()
r := bufio.NewReader(os.Stdin)
line, err := r.ReadString('\n')
// Trim only the line terminator — a master password / API secret may
// legitimately contain leading/trailing spaces.
return strings.TrimRight(line, "\r\n"), err
}
func promptLine(prompt string) (string, error) {
fmt.Fprint(os.Stderr, prompt)
line, err := bufio.NewReader(os.Stdin).ReadString('\n')
return strings.TrimSpace(line), err
}
func vaultSetup(args []string) error {
hardenProcess()
ensureVaultToken()
fmt.Fprintln(os.Stderr, "One-time setup. Stored ONLY in your own Vault path; the admin never sees it.")
fmt.Fprintln(os.Stderr, "Get your API key at https://vaultwarden.viktorbarzin.me → Settings → Security → Keys → View API key.")
email, err := promptLine("Vaultwarden email: ")
if err != nil {
return err
}
clientID, err := promptLine("API key client_id (user.xxxx): ")
if err != nil {
return err
}
clientSecret, err := promptNoEcho("API key client_secret: ")
if err != nil {
return err
}
master, err := promptNoEcho("Master password: ")
if err != nil {
return err
}
if master == "" || clientID == "" || clientSecret == "" {
return fmt.Errorf("all fields are required")
}
c := vwCreds{Email: email, MasterPassword: master, ClientID: clientID, ClientSecret: clientSecret}
if err := writeCreds(realRunner, realRunnerStdin, vaultCurrentUser(), c); err != nil {
return fmt.Errorf("writing creds to your Vault path failed (scoped token present?): %w", err)
}
fmt.Fprintln(os.Stderr, "Stored. Verifying unlock…")
uid := vaultCurrentUID()
unlock, err := withUserLock(uid)
if err != nil {
return err
}
defer unlock()
if _, err := openSession(realRunner, vaultCurrentUser(), uid); err != nil {
return fmt.Errorf("stored, but verification failed — double-check master password / API key: %w", err)
}
fmt.Fprintln(os.Stderr, "✓ Verified. Fetches are now AFK.")
return nil
}
func vaultGet(args []string) error {
hardenProcess()
ensureVaultToken()
o, err := parseGetArgs(args)
if err != nil {
return err
}
uid := vaultCurrentUID()
unlock, err := withUserLock(uid)
if err != nil {
return err
}
defer unlock()
user := vaultCurrentUser()
if o.all {
return getAllFields(user, uid, o.name)
}
val, err := getValue(realRunner, user, uid, o)
if err != nil {
return err
}
writeOpLog(opRecord{User: user, Verb: "get", PID: os.Getpid(), PPID: os.Getppid(), ParentComm: parentComm(os.Getppid()), ItemName: o.name})
if o.json {
if !jsonToStdoutOK(stdoutIsTTY()) {
return fmt.Errorf("refusing to print a secret as JSON to a terminal; pipe it (e.g. | cat) or drop --json")
}
fmt.Printf("{%q:%q}\n", o.field, val)
return nil
}
emitSecret(val)
return nil
}
// getAllFields prints every field of one item as normalized JSON. Like
// `get --json`, the payload is all secret values, so it refuses a terminal
// (pipe it). The TOTP seed is never emitted — only a presence flag — so no extra
// TOTP audit is needed; the op-log uses a distinct verb so a bulk dump is
// distinguishable from a single-field get (the item name is still never logged).
func getAllFields(user, uid, name string) error {
if !jsonToStdoutOK(stdoutIsTTY()) {
return fmt.Errorf("refusing to print all fields as JSON to a terminal; pipe it (e.g. | jq)")
}
raw, err := getItem(realRunner, user, uid, name)
if err != nil {
return err
}
item, err := normalizeItem(raw)
if err != nil {
return err
}
out, err := json.Marshal(item)
if err != nil {
return err
}
writeOpLog(opRecord{User: user, Verb: "get-all", PID: os.Getpid(), PPID: os.Getppid(), ParentComm: parentComm(os.Getppid()), ItemName: name})
fmt.Println(string(out))
return nil
}

View file

@ -1,248 +0,0 @@
package main
import (
"encoding/json"
"fmt"
"io"
"os"
"strings"
)
// The `vault kv` verbs talk to HashiCorp Vault / OpenBao — the homelab INFRA
// secrets store (the `secret/…` KV-v2 mount at vault.viktorbarzin.me) — NOT
// Vaultwarden. They are a thin, TTY-aware wrapper over the `vault` CLI that adds
// the same conveniences as the Vaultwarden verbs: a self-defaulted VAULT_ADDR
// (so non-login agent shells work) and clipboard/refuse-on-TTY secret handling.
//
// CREDENTIALS DIFFER FROM THE VAULTWARDEN VERBS. Those use the per-user *scoped*
// token (bound only to secret/workstation/claude-users/<user>). A general kv read
// of e.g. secret/viktor must use the caller's OWN Vault token (the OIDC
// ~/.vault-token or an explicit $VAULT_TOKEN) — the scoped token has `deny`
// everywhere else and would 403. So the kv handlers call ensureVaultAddr() to
// guarantee VAULT_ADDR but deliberately do NOT call ensureVaultToken() (which
// injects the scoped token). Access is then whatever the caller's policy grants.
func vaultKVCommands() []Command {
return []Command{
{Path: []string{"vault", "kv", "get"}, Tier: TierRead,
Summary: "[hashicorp-vault] read an infra KV secret: vault kv get <path> [--field K]", Run: vaultKVGet},
{Path: []string{"vault", "kv", "list"}, Tier: TierRead,
Summary: "[hashicorp-vault] list infra KV sub-paths: vault kv list <path>", Run: vaultKVList},
{Path: []string{"vault", "kv", "put"}, Tier: TierWrite,
Summary: "[hashicorp-vault] write one KV key (value via stdin): vault kv put <path> <key>", Run: vaultKVPut},
{Path: []string{"vault", "kv"}, Tier: TierRead,
Summary: "[hashicorp-vault] infra secrets (run `homelab vault kv` for help)",
Run: func([]string) error { fmt.Print(vaultKVHelp()); return nil }},
}
}
func vaultKVHelp() string {
return `homelab vault kv HashiCorp Vault / OpenBao (homelab INFRA secrets, the secret/ KV store)
homelab vault kv get <path> [--field K] read a secret
--field K one value (TTY clipboard; piped stdout)
no --field all fields as JSON (piped only)
homelab vault kv list <path> list sub-paths under <path> (no values)
homelab vault kv put <path> <key> write one key; value read from stdin
(piped, or no-echo prompt); merges never clobbers siblings
Uses YOUR Vault token (vault login -method=oidc ~/.vault-token); access is
whatever your policy grants. This is NOT Vaultwarden for your personal logins
use 'homelab vault get' (see 'homelab vault').
`
}
// --- arg builders (pure; values never travel via argv) --------------------
func vaultKVGetFieldArgs(path, field string) []string {
return []string{"kv", "get", "-field=" + field, path}
}
func vaultKVGetJSONArgs(path string) []string { return []string{"kv", "get", "-format=json", path} }
func vaultKVListArgs(path string) []string { return []string{"kv", "list", "-format=json", path} }
// vaultKVPutArgs builds the write argv. merge=true → `kv patch -method=rw`
// (read-modify-write: merges, needs only read+update — not the `patch` capability
// — and preserves sibling keys); merge=false → `kv put` (creates the path on
// first write). The value is ALWAYS read from stdin via the `<key>=-` form, so it
// never appears in argv (visible via ps / /proc/<pid>/cmdline to same-UID procs).
func vaultKVPutArgs(merge bool, path, key string) []string {
return append(kvWriteVerb(merge), path, key+"=-")
}
// --- pure parsers ----------------------------------------------------------
// extractKVData returns the inner secret object from a `vault kv get -format=json`
// envelope (`{"data":{"data":{…},"metadata":{…}}}`), dropping the metadata/request
// wrapper so only the secret's own key→value data is emitted.
func extractKVData(jsonOut string) (string, error) {
var env struct {
Data struct {
Data json.RawMessage `json:"data"`
} `json:"data"`
}
if err := json.Unmarshal([]byte(jsonOut), &env); err != nil {
return "", fmt.Errorf("parse vault kv json: %w", err)
}
if len(env.Data.Data) == 0 {
return "", fmt.Errorf("no secret data at that path")
}
return string(env.Data.Data), nil
}
// parseKVList parses the JSON array `vault kv list -format=json` prints.
func parseKVList(jsonOut string) ([]string, error) {
var keys []string
if err := json.Unmarshal([]byte(jsonOut), &keys); err != nil {
return nil, fmt.Errorf("parse vault kv list json: %w", err)
}
return keys, nil
}
// --- testable cores (injected cmdRunner) -----------------------------------
func kvGetField(run cmdRunner, path, field string) (string, error) {
return run("vault", vaultKVGetFieldArgs(path, field), nil)
}
func kvGetJSON(run cmdRunner, path string) (string, error) {
out, err := run("vault", vaultKVGetJSONArgs(path), nil)
if err != nil {
return "", err
}
return extractKVData(out)
}
func kvList(run cmdRunner, path string) ([]string, error) {
out, err := run("vault", vaultKVListArgs(path), nil)
if err != nil {
return nil, err
}
return parseKVList(out)
}
// kvPathExists reports whether the KV path already holds data, to pick create
// (`kv put`) vs merge (`kv patch -method=rw`) — so a write never clobbers
// sibling keys on an existing path.
func kvPathExists(run cmdRunner, path string) bool {
_, err := run("vault", vaultKVGetJSONArgs(path), nil)
return err == nil
}
// kvPut writes one key, creating the path when absent and merging when present.
// The value travels on stdin only (never argv).
func kvPut(run cmdRunner, runStdin cmdRunnerStdin, path, key, value string) error {
merge := kvPathExists(run, path)
_, err := runStdin("vault", vaultKVPutArgs(merge, path, key), nil, value)
return err
}
// --- handlers --------------------------------------------------------------
func vaultKVGet(args []string) error {
hardenProcess()
ensureVaultAddr() // own token, NOT the scoped one (see file header)
var path, field string
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "--field" && i+1 < len(args):
field = args[i+1]
i++
case strings.HasPrefix(a, "--field="):
field = strings.TrimPrefix(a, "--field=")
case !strings.HasPrefix(a, "-") && path == "":
path = a
}
}
if path == "" {
return fmt.Errorf("usage: homelab vault kv get <path> [--field <key>]")
}
if field != "" {
val, err := kvGetField(realRunner, path, field)
if err != nil {
return err
}
emitSecret(val) // TTY-aware: clipboard on a terminal, stdout when piped
return nil
}
// No --field → the whole secret. All values, so refuse a bare TTY (like
// `vault get --json`): pick a --field for the clipboard path, or pipe it.
if !jsonToStdoutOK(stdoutIsTTY()) {
return fmt.Errorf("refusing to print all KV fields as JSON to a terminal; use --field <key>, or pipe it (e.g. | jq)")
}
out, err := kvGetJSON(realRunner, path)
if err != nil {
return err
}
fmt.Println(out)
return nil
}
func vaultKVList(args []string) error {
ensureVaultAddr()
var path string
for _, a := range args {
if !strings.HasPrefix(a, "-") {
path = a
break
}
}
if path == "" {
return fmt.Errorf("usage: homelab vault kv list <path>")
}
keys, err := kvList(realRunner, path)
if err != nil {
return err
}
for _, k := range keys {
fmt.Println(k)
}
return nil
}
func vaultKVPut(args []string) error {
hardenProcess()
ensureVaultAddr()
var path, key string
for _, a := range args {
if strings.HasPrefix(a, "-") {
continue
}
switch {
case path == "":
path = a
case key == "":
key = a
}
}
if path == "" || key == "" {
return fmt.Errorf("usage: homelab vault kv put <path> <key> (value read from stdin)")
}
value, err := readSecretValue("Value for " + key + ": ")
if err != nil {
return err
}
if value == "" {
return fmt.Errorf("empty value; aborting (nothing written)")
}
if err := kvPut(realRunner, realRunnerStdin, path, key, value); err != nil {
return fmt.Errorf("writing %q to %s failed (does your token have write access? path correct?): %w", key, path, err)
}
fmt.Fprintln(os.Stderr, "wrote "+key+" to "+path)
return nil
}
// readSecretValue obtains a secret value WITHOUT putting it in argv: piped stdin
// is read verbatim (trailing newline trimmed, internal newlines preserved so
// multi-line values like PEM keys survive); an interactive TTY is prompted
// without echo.
func readSecretValue(prompt string) (string, error) {
fi, err := os.Stdin.Stat()
if err == nil && fi.Mode()&os.ModeCharDevice == 0 {
b, rerr := io.ReadAll(os.Stdin)
if rerr != nil {
return "", rerr
}
return strings.TrimRight(string(b), "\r\n"), nil
}
return promptNoEcho(prompt)
}

File diff suppressed because it is too large Load diff

View file

@ -1,212 +0,0 @@
package main
import (
"fmt"
"os"
"path/filepath"
"strings"
)
func workCommands() []Command {
return []Command{
{Path: []string{"work", "start"}, Tier: TierWrite,
Summary: "create a worktree + branch for a task (enter it with EnterWorktree)", Run: workStart},
{Path: []string{"work", "land"}, Tier: TierWrite,
Summary: "merge master in, verify, push HEAD:master (run from the worktree)", Run: workLand},
{Path: []string{"work", "clean"}, Tier: TierWrite,
Summary: "remove a task's worktree + branch (run from the main checkout)", Run: workClean},
}
}
// flagValue extracts `--name value` or `--name=value` from args.
func flagValue(args []string, name string) string {
for i, a := range args {
if a == name && i+1 < len(args) {
return args[i+1]
}
if strings.HasPrefix(a, name+"=") {
return strings.TrimPrefix(a, name+"=")
}
}
return ""
}
func remotesOrEmpty(repoRoot string) []string {
r, _ := gitRemotes(repoRoot)
return r
}
// workStart creates .worktrees/<topic> on branch <user>/<topic> off <remote>/master.
func workStart(args []string) error {
topic, _ := firstPositional(args)
if topic == "" {
return fmt.Errorf("usage: homelab work start <topic>")
}
cwd, _ := os.Getwd()
repoRoot, err := gitRepoRoot(cwd)
if err != nil {
return fmt.Errorf("not in a git repository: %w", err)
}
remote := preferRemote(remotesOrEmpty(repoRoot))
if remote == "" {
return fmt.Errorf("no git remote configured in %s", repoRoot)
}
flags := cryptFlagsFor(repoRoot)
branch := currentUser() + "/" + topic
wtRel := filepath.Join(".worktrees", topic)
ensureWorktreesIgnored(repoRoot)
if err := gitStream(repoRoot, flags, "fetch", remote); err != nil {
return fmt.Errorf("fetch %s failed: %w", remote, err)
}
if err := gitStream(repoRoot, flags, "worktree", "add", wtRel, "-b", branch, remote+"/master"); err != nil {
return fmt.Errorf("worktree add failed: %w", err)
}
wtPath := filepath.Join(repoRoot, wtRel)
fmt.Printf("homelab: created worktree %s (branch %s off %s/master)\n", wtPath, branch, remote)
fmt.Printf("homelab: enter it with the native tool: EnterWorktree(path=%q)\n", wtPath)
return nil
}
// workLand integrates the current branch into master: fetch, merge master in,
// verify, push HEAD:master (retrying on non-fast-forward), with a feature-branch
// fallback when the direct push is rejected (e.g. branch protection).
func workLand(args []string) error {
verifyCmd := flagValue(args, "--verify-cmd")
cwd, _ := os.Getwd()
repoRoot, err := gitRepoRoot(cwd)
if err != nil {
return fmt.Errorf("not in a git repository: %w", err)
}
branch, err := gitOutput(repoRoot, "rev-parse", "--abbrev-ref", "HEAD")
if err != nil {
return err
}
if branch == "master" || branch == "main" {
return fmt.Errorf("refusing to land: already on %s", branch)
}
remote := preferRemote(remotesOrEmpty(repoRoot))
if remote == "" {
return fmt.Errorf("no git remote configured in %s", repoRoot)
}
flags := cryptFlagsFor(repoRoot)
if err := gitStream(repoRoot, flags, "fetch", remote); err != nil {
return fmt.Errorf("fetch failed: %w", err)
}
if err := gitStream(repoRoot, flags, "merge", "--no-edit", remote+"/master"); err != nil {
return fmt.Errorf("merging %s/master failed — resolve conflicts then re-run `homelab work land`: %w", remote, err)
}
if err := runVerify(repoRoot, verifyCmd, containsArg(args, "--no-verify")); err != nil {
return fmt.Errorf("not landing: %w", err)
}
if err := pushWithRetry(repoRoot, flags, remote, 3); err != nil {
return landFallback(repoRoot, flags, remote, branch, err)
}
fmt.Printf("homelab: landed %s -> %s/master.\n", branch, remote)
if containsArg(args, "--no-ci-watch") {
fmt.Println("homelab: --no-ci-watch set; not waiting for CI.")
return nil
}
landed, _ := gitOutput(repoRoot, "rev-parse", "HEAD")
fmt.Fprintln(os.Stderr, "homelab: watching CI for the landed commit...")
if err := ciWatch([]string{landed}); err != nil {
return fmt.Errorf("landed, but CI did not go green: %w", err)
}
return nil
}
// runVerify runs the explicit --verify-cmd, else auto-detects (go test). If
// neither is available it REFUSES (returns an error) unless allowSkip is set —
// landing to master unverified must be a deliberate choice (--no-verify).
func runVerify(repoRoot, verifyCmd string, allowSkip bool) error {
if verifyCmd != "" {
fmt.Fprintf(os.Stderr, "homelab: verify: %s\n", verifyCmd)
return runStreamingIn(repoRoot, "sh", "-c", verifyCmd)
}
if isFile(filepath.Join(repoRoot, "go.mod")) {
fmt.Fprintln(os.Stderr, "homelab: verify: go test ./...")
return runStreamingIn(repoRoot, "go", "test", "./...")
}
if allowSkip {
fmt.Fprintln(os.Stderr, "homelab: WARNING: --no-verify set — landing without verification")
return nil
}
return fmt.Errorf("no verification configured for this repo — pass --verify-cmd \"...\" or --no-verify to land without verifying")
}
// pushWithRetry pushes HEAD:master, recovering from non-fast-forward rejections
// by fetching + merging master and retrying.
func pushWithRetry(repoRoot string, flags []string, remote string, attempts int) error {
var lastErr error
for i := 0; i < attempts; i++ {
if err := gitStream(repoRoot, flags, "push", remote, "HEAD:master"); err == nil {
return nil
} else {
lastErr = err
}
if i < attempts-1 {
fmt.Fprintln(os.Stderr, "homelab: push rejected — fetching + merging master, then retrying")
if err := gitStream(repoRoot, flags, "fetch", remote); err != nil {
return err
}
if err := gitStream(repoRoot, flags, "merge", "--no-edit", remote+"/master"); err != nil {
return err
}
}
}
return fmt.Errorf("push to %s/master failed after %d attempts: %w", remote, attempts, lastErr)
}
// landFallback pushes the feature branch when the direct master push is rejected
// (e.g. branch protection), so the work isn't lost and a PR can be opened.
func landFallback(repoRoot string, flags []string, remote, branch string, pushErr error) error {
fmt.Fprintf(os.Stderr, "homelab: direct push to master failed (%v)\n", pushErr)
fmt.Fprintf(os.Stderr, "homelab: falling back to pushing the feature branch %q for a PR\n", branch)
if err := gitStream(repoRoot, flags, "push", "-u", remote, branch); err != nil {
return fmt.Errorf("fallback branch push also failed: %w", err)
}
fmt.Printf("homelab: pushed %s to %s. Open a PR to land it (branch protection blocked the direct push).\n", branch, remote)
return nil
}
// workClean removes a task's worktree and branch. Run from the main checkout.
func workClean(args []string) error {
topic, _ := firstPositional(args)
if topic == "" {
return fmt.Errorf("usage: homelab work clean <topic> (run from the main checkout)")
}
cwd, _ := os.Getwd()
repoRoot, err := gitRepoRoot(cwd)
if err != nil {
return fmt.Errorf("not in a git repository: %w", err)
}
flags := cryptFlagsFor(repoRoot)
wtRel := filepath.Join(".worktrees", topic)
branch := currentUser() + "/" + topic
if err := gitStream(repoRoot, flags, "worktree", "remove", wtRel); err != nil {
return fmt.Errorf("worktree remove failed (uncommitted changes? run from the main checkout, not the worktree): %w", err)
}
if err := gitStream(repoRoot, flags, "branch", "-d", branch); err != nil {
fmt.Fprintf(os.Stderr, "homelab: note: could not delete branch %s (unmerged — use `git branch -D` if intended): %v\n", branch, err)
}
fmt.Printf("homelab: removed worktree %s and branch %s\n", wtRel, branch)
return nil
}
// ensureWorktreesIgnored appends .worktrees/ to .gitignore if not already ignored.
func ensureWorktreesIgnored(repoRoot string) {
if _, err := gitOutput(repoRoot, "check-ignore", ".worktrees"); err == nil {
return
}
gi := filepath.Join(repoRoot, ".gitignore")
f, err := os.OpenFile(gi, os.O_APPEND|os.O_CREATE|os.O_WRONLY, 0o644)
if err != nil {
return
}
defer f.Close()
if _, err := f.WriteString("\n.worktrees/\n"); err == nil {
fmt.Fprintln(os.Stderr, "homelab: added .worktrees/ to .gitignore")
}
}

View file

@ -1,32 +0,0 @@
package main
import "testing"
func TestRunVerifyRefusesWhenNothingToVerify(t *testing.T) {
dir := t.TempDir() // no go.mod, no verify cmd
if err := runVerify(dir, "", false); err == nil {
t.Fatal("runVerify must refuse (error) when nothing to verify and --no-verify absent")
}
if err := runVerify(dir, "", true); err != nil {
t.Fatalf("runVerify must skip when --no-verify set, got: %v", err)
}
}
func TestFlagValue(t *testing.T) {
cases := []struct {
args []string
name string
want string
}{
{[]string{"--verify-cmd", "go test ./..."}, "--verify-cmd", "go test ./..."},
{[]string{"--verify-cmd=make test"}, "--verify-cmd", "make test"},
{[]string{"topic", "--verify-cmd", "x"}, "--verify-cmd", "x"},
{[]string{"topic"}, "--verify-cmd", ""},
{[]string{"--verify-cmd"}, "--verify-cmd", ""}, // no value
}
for _, c := range cases {
if got := flagValue(c.args, c.name); got != c.want {
t.Errorf("flagValue(%v, %q) = %q, want %q", c.args, c.name, got, c.want)
}
}
}

View file

@ -1,104 +0,0 @@
package main
import (
"encoding/json"
"fmt"
"sort"
"strings"
)
// Tier classifies whether a command observes (read) or mutates (write) state.
// v0.1 allows everything; the tier is recorded so a classifier hook can gate
// writes later without restructuring (see docs/adr/0005).
type Tier string
const (
TierRead Tier = "read"
TierWrite Tier = "write"
)
// Command is one homelab verb. Path is the token sequence that selects it,
// e.g. ["claim"] or ["tf", "plan"]. Run receives the args after the path.
type Command struct {
Path []string
Tier Tier
Summary string
Run func(args []string) error
}
// dispatch routes args to the command whose Path is the longest matching prefix
// of args, passing the remaining args to its Run.
func dispatch(reg []Command, args []string) error {
best := -1
bestLen := 0
for i, c := range reg {
if len(c.Path) > len(args) {
continue
}
match := true
for j, p := range c.Path {
if args[j] != p {
match = false
break
}
}
if match && len(c.Path) >= bestLen {
best = i
bestLen = len(c.Path)
}
}
if best < 0 {
return fmt.Errorf("unknown command: %q", strings.Join(args, " "))
}
matched := reg[best]
runErr := matched.Run(args[bestLen:])
emitUsage(matched.name(), runErr) // best-effort usage telemetry; never affects the command
return runErr
}
// name is the space-joined verb path, e.g. "tf plan".
func (c Command) name() string { return strings.Join(c.Path, " ") }
// sortedByName returns a copy of reg ordered by verb path for stable output.
func sortedByName(reg []Command) []Command {
out := make([]Command, len(reg))
copy(out, reg)
sort.Slice(out, func(i, j int) bool { return out[i].name() < out[j].name() })
return out
}
// manifestText renders one aligned line per command: "<path> <tier> <summary>".
// This is the cheap progressive-discovery entrypoint (see docs/adr/0004).
func manifestText(reg []Command) string {
cmds := sortedByName(reg)
width := 0
for _, c := range cmds {
if n := len(c.name()); n > width {
width = n
}
}
var b strings.Builder
for _, c := range cmds {
fmt.Fprintf(&b, "%-*s %-5s %s\n", width, c.name(), c.Tier, c.Summary)
}
return b.String()
}
// manifestJSON renders the registry as a JSON array of {command, tier, summary}
// so agents can parse the full surface in one call.
func manifestJSON(reg []Command) (string, error) {
type entry struct {
Command string `json:"command"`
Tier string `json:"tier"`
Summary string `json:"summary"`
}
entries := make([]entry, 0, len(reg))
for _, c := range sortedByName(reg) {
entries = append(entries, entry{Command: c.name(), Tier: string(c.Tier), Summary: c.Summary})
}
b, err := json.MarshalIndent(entries, "", " ")
if err != nil {
return "", err
}
return string(b), nil
}

View file

@ -1,73 +0,0 @@
package main
import (
"encoding/json"
"reflect"
"strings"
"testing"
)
// Tracer bullet: the dispatcher must route `homelab <path...> <args...>` to the
// command whose Path is the longest matching prefix of the input tokens, and
// hand the command the remaining args.
func TestDispatchRoutesToLongestPrefixMatch(t *testing.T) {
var gotArgs []string
ran := ""
reg := []Command{
{Path: []string{"claim"}, Tier: TierWrite, Summary: "claim a resource",
Run: func(a []string) error { ran = "claim"; gotArgs = a; return nil }},
{Path: []string{"tf", "plan"}, Tier: TierRead, Summary: "plan a stack",
Run: func(a []string) error { ran = "tf plan"; gotArgs = a; return nil }},
}
if err := dispatch(reg, []string{"tf", "plan", "vault", "--json"}); err != nil {
t.Fatalf("dispatch returned error: %v", err)
}
if ran != "tf plan" {
t.Fatalf("routed to %q, want %q", ran, "tf plan")
}
if want := []string{"vault", "--json"}; !reflect.DeepEqual(gotArgs, want) {
t.Fatalf("command got args %v, want %v", gotArgs, want)
}
}
func TestDispatchUnknownCommandErrors(t *testing.T) {
reg := []Command{{Path: []string{"claim"}, Run: func(a []string) error { return nil }}}
if err := dispatch(reg, []string{"bogus"}); err == nil {
t.Fatal("expected error for unknown command, got nil")
}
}
// The manifest is the progressive-discovery entrypoint: one line per command
// showing the full verb path, its tier, and summary, sorted for stable output.
func TestManifestTextListsEveryCommandWithTier(t *testing.T) {
reg := []Command{
{Path: []string{"tf", "plan"}, Tier: TierRead, Summary: "plan a stack"},
{Path: []string{"claim"}, Tier: TierWrite, Summary: "claim a resource"},
}
out := manifestText(reg)
for _, want := range []string{"claim", "tf plan", "read", "write", "plan a stack", "claim a resource"} {
if !strings.Contains(out, want) {
t.Errorf("manifest text missing %q\n---\n%s", want, out)
}
}
// sorted: claim (c) must appear before tf plan (t)
if strings.Index(out, "claim") > strings.Index(out, "tf plan") {
t.Errorf("manifest not sorted by path:\n%s", out)
}
}
func TestManifestJSONIsParsableAndTagged(t *testing.T) {
reg := []Command{{Path: []string{"tf", "apply"}, Tier: TierWrite, Summary: "apply a stack"}}
out, err := manifestJSON(reg)
if err != nil {
t.Fatalf("manifestJSON error: %v", err)
}
var got []map[string]string
if err := json.Unmarshal([]byte(out), &got); err != nil {
t.Fatalf("manifest JSON not parsable: %v\n%s", err, out)
}
if len(got) != 1 || got[0]["command"] != "tf apply" || got[0]["tier"] != "write" {
t.Fatalf("unexpected manifest JSON: %v", got)
}
}

View file

@ -1,164 +0,0 @@
package main
import (
"fmt"
"regexp"
"strconv"
"strings"
)
// edgesOpts is the parsed filter set for `homelab edges` (the who-talks-to-whom
// investigation helper over the goldmane_edges trail; see ADR-0014).
type edgesOpts struct {
ns string // edges touching this namespace (either direction)
src string // edges where src_ns = this
dst string // edges where dst_ns = this
peersOf string // distinct peers of this namespace (both directions)
newSince string // first_seen >= duration (24h/7d/30m) or date (YYYY-MM-DD)
denied bool // action = 'deny' only
asJSON bool // wrap result as a JSON array
limit int // row cap (default 200)
}
// parseEdgesArgs parses the edges flag surface. Unknown flags error out so a
// typo surfaces instead of silently dumping the whole table.
func parseEdgesArgs(args []string) (edgesOpts, error) {
o := edgesOpts{limit: 200}
i := 0
for i < len(args) {
a := args[i]
key, inline, hasInline := a, "", false
if eq := strings.IndexByte(a, '='); eq >= 0 {
key, inline, hasInline = a[:eq], a[eq+1:], true
}
needVal := func() (string, error) {
if hasInline {
return inline, nil
}
if i+1 < len(args) {
i++
return args[i], nil
}
return "", fmt.Errorf("flag %s needs a value", key)
}
var err error
switch key {
case "--ns":
o.ns, err = needVal()
case "--src":
o.src, err = needVal()
case "--dst":
o.dst, err = needVal()
case "--peers-of":
o.peersOf, err = needVal()
case "--new-since":
o.newSince, err = needVal()
case "--denied":
o.denied = true
case "--json":
o.asJSON = true
case "--limit":
var v string
if v, err = needVal(); err == nil {
if o.limit, err = strconv.Atoi(v); err != nil {
err = fmt.Errorf("--limit must be an integer: %q", v)
}
}
default:
return o, fmt.Errorf("unknown flag: %s", a)
}
if err != nil {
return o, err
}
i++
}
return o, nil
}
// nsRE is the safe namespace-token charset (k8s names + "Global"). Used as the
// injection guard — anything else is rejected rather than quoted-and-hoped.
var nsRE = regexp.MustCompile(`^[A-Za-z0-9][A-Za-z0-9_.-]*$`)
func validateNS(s string) error {
if s == "" || len(s) > 63 || !nsRE.MatchString(s) {
return fmt.Errorf("invalid namespace name: %q", s)
}
return nil
}
// sqlStr renders a SQL string literal (belt-and-suspenders on top of validateNS).
func sqlStr(s string) string { return "'" + strings.ReplaceAll(s, "'", "''") + "'" }
var (
durRE = regexp.MustCompile(`^(\d+)([smhd])$`)
dateRE = regexp.MustCompile(`^\d{4}-\d{2}-\d{2}([ T]\d{2}:\d{2}(:\d{2})?)?$`)
)
// newSinceCond turns a duration (24h/7d/30m/90s) or a date (YYYY-MM-DD[ HH:MM])
// into a first_seen predicate.
func newSinceCond(v string) (string, error) {
if m := durRE.FindStringSubmatch(v); m != nil {
unit := map[string]string{"s": "seconds", "m": "minutes", "h": "hours", "d": "days"}[m[2]]
return fmt.Sprintf("first_seen >= now() - interval '%s %s'", m[1], unit), nil
}
if dateRE.MatchString(v) {
return "first_seen >= " + sqlStr(v), nil
}
return "", fmt.Errorf("--new-since must be a duration (e.g. 24h, 7d, 30m) or a date (YYYY-MM-DD): %q", v)
}
// buildEdgesQuery renders the SQL for the given filters against the `edge` table.
func buildEdgesQuery(o edgesOpts) (string, error) {
limit := o.limit
if limit <= 0 {
limit = 200
}
// peers-of is a distinct-peer summary, a different shape from the row list.
if o.peersOf != "" {
if err := validateNS(o.peersOf); err != nil {
return "", err
}
p := sqlStr(o.peersOf)
return fmt.Sprintf("SELECT DISTINCT peer, action FROM ("+
"SELECT dst_ns AS peer, action FROM edge WHERE src_ns = %s "+
"UNION SELECT src_ns AS peer, action FROM edge WHERE dst_ns = %s"+
") t ORDER BY peer LIMIT %d", p, p, limit), nil
}
var conds []string
for _, f := range []struct{ val, tmpl string }{
{o.ns, "(src_ns = %[1]s OR dst_ns = %[1]s)"},
{o.src, "src_ns = %s"},
{o.dst, "dst_ns = %s"},
} {
if f.val == "" {
continue
}
if err := validateNS(f.val); err != nil {
return "", err
}
conds = append(conds, fmt.Sprintf(f.tmpl, sqlStr(f.val)))
}
if o.denied {
conds = append(conds, "action = 'deny'")
}
if o.newSince != "" {
c, err := newSinceCond(o.newSince)
if err != nil {
return "", err
}
conds = append(conds, c)
}
q := "SELECT src_ns, dst_ns, action, flow_count, first_seen, last_seen FROM edge"
if len(conds) > 0 {
q += " WHERE " + strings.Join(conds, " AND ")
}
q += fmt.Sprintf(" ORDER BY first_seen DESC LIMIT %d", limit)
if o.asJSON {
q = "SELECT coalesce(json_agg(row_to_json(t)), '[]') FROM (" + q + ") t"
}
return q, nil
}

View file

@ -1,163 +0,0 @@
package main
import (
"strings"
"testing"
)
func TestParseEdgesArgs(t *testing.T) {
cases := []struct {
name string
args []string
want edgesOpts
}{
{"defaults", nil, edgesOpts{limit: 200}},
{"ns", []string{"--ns", "immich"}, edgesOpts{ns: "immich", limit: 200}},
{"ns equals", []string{"--ns=immich"}, edgesOpts{ns: "immich", limit: 200}},
{"src dst", []string{"--src", "a", "--dst", "b"}, edgesOpts{src: "a", dst: "b", limit: 200}},
{"peers-of", []string{"--peers-of", "authentik"}, edgesOpts{peersOf: "authentik", limit: 200}},
{"denied json", []string{"--denied", "--json"}, edgesOpts{denied: true, asJSON: true, limit: 200}},
{"new-since", []string{"--new-since", "24h"}, edgesOpts{newSince: "24h", limit: 200}},
{"limit", []string{"--limit", "50"}, edgesOpts{limit: 50}},
}
for _, c := range cases {
t.Run(c.name, func(t *testing.T) {
got, err := parseEdgesArgs(c.args)
if err != nil {
t.Fatalf("parseEdgesArgs(%v) error: %v", c.args, err)
}
if got != c.want {
t.Fatalf("parseEdgesArgs(%v) = %+v, want %+v", c.args, got, c.want)
}
})
}
}
func TestParseEdgesArgsErrors(t *testing.T) {
for _, args := range [][]string{
{"--limit", "abc"},
{"--bogus"},
} {
if _, err := parseEdgesArgs(args); err == nil {
t.Errorf("parseEdgesArgs(%v) expected error, got nil", args)
}
}
}
func TestBuildEdgesQueryDefaults(t *testing.T) {
q, err := buildEdgesQuery(edgesOpts{limit: 200})
if err != nil {
t.Fatal(err)
}
for _, want := range []string{"FROM edge", "ORDER BY first_seen DESC", "LIMIT 200"} {
if !strings.Contains(q, want) {
t.Errorf("query %q missing %q", q, want)
}
}
if strings.Contains(q, "WHERE") {
t.Errorf("no-filter query should have no WHERE: %q", q)
}
}
func TestBuildEdgesQueryFilters(t *testing.T) {
cases := []struct {
name string
o edgesOpts
want string
}{
{"ns both directions", edgesOpts{ns: "immich", limit: 10}, "(src_ns = 'immich' OR dst_ns = 'immich')"},
{"src only", edgesOpts{src: "authentik", limit: 10}, "src_ns = 'authentik'"},
{"dst only", edgesOpts{dst: "dbaas", limit: 10}, "dst_ns = 'dbaas'"},
{"denied", edgesOpts{denied: true, limit: 10}, "action = 'deny'"},
}
for _, c := range cases {
t.Run(c.name, func(t *testing.T) {
q, err := buildEdgesQuery(c.o)
if err != nil {
t.Fatal(err)
}
if !strings.Contains(q, "WHERE") || !strings.Contains(q, c.want) {
t.Errorf("query %q missing WHERE/%q", q, c.want)
}
})
}
}
func TestBuildEdgesQueryCombinedFiltersAnded(t *testing.T) {
q, err := buildEdgesQuery(edgesOpts{src: "a", denied: true, limit: 5})
if err != nil {
t.Fatal(err)
}
if !strings.Contains(q, "src_ns = 'a' AND action = 'deny'") {
t.Errorf("combined filters not AND'd: %q", q)
}
}
func TestBuildEdgesQueryPeersOf(t *testing.T) {
q, err := buildEdgesQuery(edgesOpts{peersOf: "authentik", limit: 100})
if err != nil {
t.Fatal(err)
}
for _, want := range []string{"DISTINCT", "src_ns = 'authentik'", "dst_ns = 'authentik'", "UNION"} {
if !strings.Contains(q, want) {
t.Errorf("peers-of query %q missing %q", q, want)
}
}
}
func TestBuildEdgesQueryJSON(t *testing.T) {
q, err := buildEdgesQuery(edgesOpts{asJSON: true, limit: 200})
if err != nil {
t.Fatal(err)
}
if !strings.Contains(q, "json_agg") || !strings.Contains(q, "row_to_json") {
t.Errorf("json query missing json_agg wrapper: %q", q)
}
}
func TestBuildEdgesQueryRejectsInjection(t *testing.T) {
for _, bad := range []string{"a'; DROP TABLE edge;--", "a b", "a;b", "a\"b"} {
if _, err := buildEdgesQuery(edgesOpts{ns: bad, limit: 10}); err == nil {
t.Errorf("buildEdgesQuery(ns=%q) expected validation error, got nil", bad)
}
}
}
func TestNewSinceCond(t *testing.T) {
cases := []struct {
in string
want string
}{
{"24h", "first_seen >= now() - interval '24 hours'"},
{"7d", "first_seen >= now() - interval '7 days'"},
{"30m", "first_seen >= now() - interval '30 minutes'"},
{"2026-06-28", "first_seen >= '2026-06-28'"},
}
for _, c := range cases {
got, err := newSinceCond(c.in)
if err != nil {
t.Fatalf("newSinceCond(%q) error: %v", c.in, err)
}
if got != c.want {
t.Errorf("newSinceCond(%q) = %q, want %q", c.in, got, c.want)
}
}
for _, bad := range []string{"yesterday", "1y", "'; DROP", ""} {
if _, err := newSinceCond(bad); err == nil {
t.Errorf("newSinceCond(%q) expected error, got nil", bad)
}
}
}
func TestValidateNS(t *testing.T) {
for _, ok := range []string{"immich", "calico-system", "kube-system", "Global", "pg-cluster-rw"} {
if err := validateNS(ok); err != nil {
t.Errorf("validateNS(%q) unexpected error: %v", ok, err)
}
}
for _, bad := range []string{"", "a b", "a'b", "a;b", "../x", "a$b"} {
if err := validateNS(bad); err == nil {
t.Errorf("validateNS(%q) expected error, got nil", bad)
}
}
}

View file

@ -1,99 +0,0 @@
package main
import (
"fmt"
"strings"
)
// version is stamped at build time via -ldflags "-X main.version=vX.Y.Z".
var version = "dev"
// buildRegistry returns every homelab verb. New verb-groups append here.
func buildRegistry() []Command {
var reg []Command
reg = append(reg, claimCommands()...)
reg = append(reg, tfCommands()...)
reg = append(reg, workCommands()...)
reg = append(reg, k8sCommands()...)
reg = append(reg, memoryCommands()...)
reg = append(reg, ciCommands()...)
reg = append(reg, deployCommands()...)
reg = append(reg, netCommands()...)
reg = append(reg, obsCommands()...)
reg = append(reg, edgesCommands()...)
reg = append(reg, usageCommands()...)
reg = append(reg, haCommands()...)
reg = append(reg, browserCommands()...)
reg = append(reg, vaultCommands()...)
return reg
}
// dispatchTop handles the homelab verb surface. handled=false means the args are
// not a homelab verb, so main() falls back to the legacy -use-case path.
func dispatchTop(args []string) (handled bool, err error) {
if len(args) == 0 {
fmt.Print(usage())
return true, nil
}
switch args[0] {
case "help", "-h", "--help":
fmt.Print(usage())
return true, nil
case "version", "--version":
fmt.Println("homelab " + version)
return true, nil
case "manifest":
reg := buildRegistry()
if containsArg(args[1:], "--json") {
out, err := manifestJSON(reg)
if err != nil {
return true, err
}
fmt.Println(out)
return true, nil
}
fmt.Print(manifestText(reg))
return true, nil
}
if strings.HasPrefix(args[0], "-") {
return false, nil
}
reg := buildRegistry()
if !isCommandGroup(reg, args[0]) {
return false, nil
}
return true, dispatch(reg, args)
}
func isCommandGroup(reg []Command, group string) bool {
for _, c := range reg {
if len(c.Path) > 0 && c.Path[0] == group {
return true
}
}
return false
}
func containsArg(args []string, want string) bool {
for _, a := range args {
if a == want {
return true
}
}
return false
}
func usage() string {
var b strings.Builder
fmt.Fprintf(&b, "homelab %s — unified homelab operations CLI\n\n", version)
b.WriteString("Usage:\n homelab <command> [args]\n\nCommands:\n")
for _, line := range strings.Split(strings.TrimRight(manifestText(buildRegistry()), "\n"), "\n") {
if line != "" {
b.WriteString(" " + line + "\n")
}
}
b.WriteString("\n manifest [--json] list all commands (machine-readable with --json)\n")
b.WriteString(" version print version\n")
b.WriteString("\nLegacy webhook use-cases remain available via -use-case=<name>.\n")
return b.String()
}

View file

@ -1,138 +0,0 @@
package main
import (
"fmt"
"os/exec"
"strings"
)
// kubectl helpers use the ambient kubeconfig (no per-call auth flags).
func kubectlBase(ns string, args ...string) []string {
var full []string
if ns != "" {
full = append(full, "-n", ns)
}
return append(full, args...)
}
func kubectlStream(ns string, args ...string) error {
return runStreamingIn("", "kubectl", kubectlBase(ns, args...)...)
}
// kubectlCapture runs kubectl and returns trimmed stdout (for resolving pods).
func kubectlCapture(ns string, args ...string) (string, error) {
out, err := exec.Command("kubectl", kubectlBase(ns, args...)...).Output()
return strings.TrimSpace(string(out)), err
}
// k8sTarget is the parsed `<app>` + selectors shared by the k8s verbs.
type k8sTarget struct {
app string
ns string
pod string
container string
selector string
tty bool
rest []string // passthrough flags and, after `--`, the exec command
}
// parseK8sTarget reads `<app> [-n ns] [--pod p] [-c ctr] [-l sel] [flags] [-- cmd]`.
// The first bare token is the app; unknown flags pass through in rest.
func parseK8sTarget(args []string) k8sTarget {
t := k8sTarget{}
i := 0
take := func() string {
if i+1 < len(args) {
i++
return args[i]
}
return ""
}
for i = 0; i < len(args); i++ {
a := args[i]
switch {
case a == "--":
t.rest = append(t.rest, args[i+1:]...)
return t
case a == "-n" || a == "--namespace":
t.ns = take()
case strings.HasPrefix(a, "--namespace="):
t.ns = strings.TrimPrefix(a, "--namespace=")
case a == "--pod":
t.pod = take()
case strings.HasPrefix(a, "--pod="):
t.pod = strings.TrimPrefix(a, "--pod=")
case a == "-c" || a == "--container":
t.container = take()
case strings.HasPrefix(a, "--container="):
t.container = strings.TrimPrefix(a, "--container=")
case a == "-l" || a == "--selector":
t.selector = take()
case strings.HasPrefix(a, "--selector="):
t.selector = strings.TrimPrefix(a, "--selector=")
case a == "--tty" || a == "-it" || a == "-ti":
t.tty = true
case !strings.HasPrefix(a, "-") && t.app == "":
t.app = a
default:
t.rest = append(t.rest, a)
}
}
return t
}
// namespace defaults to the app name (most namespaces hold exactly one app).
func (t k8sTarget) namespace() string {
if t.ns != "" {
return t.ns
}
return t.app
}
// objectRef is the kubectl object for logs/exec: an explicit pod, else
// deploy/<app> (kubectl resolves a pod from the Deployment).
func (t k8sTarget) objectRef() string {
if t.pod != "" {
return "pod/" + t.pod
}
return "deploy/" + t.app
}
// --- database access (the dbaas exec pattern) ---
type dbPlan struct {
ns string
pod string // explicit pod (e.g. mysql-standalone-0)
selector string // resolve the pod by this label when pod == "" (CNPG primary)
container string // "" = default container
argv []string // command + args to run inside the pod
}
// planDBExec builds the in-pod command to run sql against app's database.
// PG (default): CNPG primary POD (resolved by label — pg-cluster-rw is a
// Service, not an exec target), psql -U postgres -d <db>.
// MySQL: mysql-standalone-0, password from env (never on the command line).
// dbName defaults to app. sql empty => interactive client.
func planDBExec(app, dbName, sql string, mysql bool) dbPlan {
if dbName == "" {
dbName = app
}
if mysql {
inner := fmt.Sprintf(`mysql -u root -p"$MYSQL_ROOT_PASSWORD" %s`, shellQuote(dbName))
if sql != "" {
inner += " -e " + shellQuote(sql)
}
return dbPlan{ns: "dbaas", pod: "mysql-standalone-0", argv: []string{"bash", "-c", inner}}
}
argv := []string{"psql", "-U", "postgres", "-d", dbName}
if sql != "" {
argv = append(argv, "-tAc", sql)
}
return dbPlan{ns: "dbaas", selector: "cnpg.io/instanceRole=primary", container: "postgres", argv: argv}
}
// shellQuote single-quotes s for safe embedding in a bash -c string.
func shellQuote(s string) string {
return "'" + strings.ReplaceAll(s, "'", `'\''`) + "'"
}

View file

@ -1,65 +0,0 @@
package main
import (
"reflect"
"strings"
"testing"
)
func TestParseK8sTarget(t *testing.T) {
got := parseK8sTarget([]string{"tripit", "-n", "prod", "--pod", "x-123", "-c", "app", "-l", "k=v", "--tail=50", "--", "ls", "-la"})
want := k8sTarget{app: "tripit", ns: "prod", pod: "x-123", container: "app", selector: "k=v", rest: []string{"--tail=50", "ls", "-la"}}
if !reflect.DeepEqual(got, want) {
t.Fatalf("parseK8sTarget =\n %+v\nwant\n %+v", got, want)
}
}
func TestK8sTargetNamespaceDefaultsToApp(t *testing.T) {
if ns := parseK8sTarget([]string{"immich"}).namespace(); ns != "immich" {
t.Errorf("namespace() = %q, want immich", ns)
}
if ns := parseK8sTarget([]string{"immich", "-n", "dbaas"}).namespace(); ns != "dbaas" {
t.Errorf("namespace() = %q, want dbaas", ns)
}
}
func TestK8sTargetObjectRef(t *testing.T) {
if r := parseK8sTarget([]string{"tripit"}).objectRef(); r != "deploy/tripit" {
t.Errorf("objectRef() = %q, want deploy/tripit", r)
}
if r := parseK8sTarget([]string{"tripit", "--pod", "tripit-abc"}).objectRef(); r != "pod/tripit-abc" {
t.Errorf("objectRef() = %q, want pod/tripit-abc", r)
}
}
func TestPlanDBExecPostgresDefault(t *testing.T) {
p := planDBExec("fire-planner", "", "SELECT 1", false)
// pg-cluster-rw is a Service, so the PG plan resolves the primary POD by
// label rather than naming an (un-exec-able) Service.
if p.ns != "dbaas" || p.pod != "" || p.selector != "cnpg.io/instanceRole=primary" || p.container != "postgres" {
t.Fatalf("unexpected pg target: %+v", p)
}
// db name defaults to the app; SQL passed via -tAc
joined := strings.Join(p.argv, " ")
if !strings.Contains(joined, "-d fire-planner") || !strings.Contains(joined, "-tAc") {
t.Fatalf("pg argv missing db/sql: %v", p.argv)
}
}
func TestPlanDBExecMysqlEnvPassword(t *testing.T) {
p := planDBExec("wrongmove", "wrongmove", "SHOW TABLES", true)
if p.pod != "mysql-standalone-0" {
t.Fatalf("unexpected mysql pod: %+v", p)
}
inner := strings.Join(p.argv, " ")
// password must come from the env var, never inline
if !strings.Contains(inner, `-p"$MYSQL_ROOT_PASSWORD"`) {
t.Fatalf("mysql must use env password wrapper: %v", p.argv)
}
}
func TestShellQuoteEscapes(t *testing.T) {
if got := shellQuote("a'b"); got != `'a'\''b'` {
t.Fatalf("shellQuote = %q", got)
}
}

View file

@ -26,16 +26,8 @@ var (
)
func main() {
// homelab verb surface (work/tf/claim/...) is tried first; if the args are
// not a homelab verb, fall through to the legacy webhook -use-case path.
if handled, err := dispatchTop(os.Args[1:]); handled {
err := run()
if err != nil {
fmt.Fprintln(os.Stderr, "homelab: "+err.Error())
os.Exit(1)
}
return
}
if err := run(); err != nil {
glog.Errorf("run failed: %s", err.Error())
os.Exit(255)
}

View file

@ -1,103 +0,0 @@
package main
import (
"bytes"
"encoding/json"
"fmt"
"io"
"net/http"
"os"
"strings"
"time"
)
// defaultMemoryURL is used when no env override is present (agents normally have
// CLAUDE_MEMORY_API_URL set by the memory hooks).
const defaultMemoryURL = "https://claude-memory.viktorbarzin.me"
type memoryClient struct {
base string
key string
http *http.Client
}
func firstEnv(keys ...string) string {
for _, k := range keys {
if v := os.Getenv(k); v != "" {
return v
}
}
return ""
}
func resolveMemoryBase() string {
if b := firstEnv("CLAUDE_MEMORY_API_URL", "MEMORY_API_URL"); b != "" {
return strings.TrimRight(b, "/")
}
return defaultMemoryURL
}
// newMemoryClient talks straight to the claude-memory HTTP API (the same backend
// the MCP wraps), so it works even when the MCP frontend is down.
func newMemoryClient() (*memoryClient, error) {
key := firstEnv("CLAUDE_MEMORY_API_KEY", "MEMORY_API_KEY")
if key == "" {
return nil, fmt.Errorf("no memory API key — set CLAUDE_MEMORY_API_KEY (or MEMORY_API_KEY)")
}
return &memoryClient{base: resolveMemoryBase(), key: key, http: &http.Client{Timeout: 30 * time.Second}}, nil
}
func (c *memoryClient) do(method, path string, body interface{}) ([]byte, error) {
var r io.Reader
if body != nil {
b, err := json.Marshal(body)
if err != nil {
return nil, err
}
r = bytes.NewReader(b)
}
req, err := http.NewRequest(method, c.base+path, r)
if err != nil {
return nil, err
}
req.Header.Set("Authorization", "Bearer "+c.key)
if body != nil {
req.Header.Set("Content-Type", "application/json")
}
resp, err := c.http.Do(req)
if err != nil {
return nil, err
}
defer resp.Body.Close()
out, _ := io.ReadAll(resp.Body)
if resp.StatusCode >= 300 {
return nil, fmt.Errorf("memory API %s %s -> %d: %s", method, path, resp.StatusCode, strings.TrimSpace(string(out)))
}
return out, nil
}
// Request bodies mirror src/claude_memory/api/models.py.
type memRecallReq struct {
Context string `json:"context"`
ExpandedQuery string `json:"expanded_query,omitempty"`
Category string `json:"category,omitempty"`
SortBy string `json:"sort_by,omitempty"`
Limit int `json:"limit,omitempty"`
}
type memStoreReq struct {
Content string `json:"content"`
Category string `json:"category,omitempty"`
Tags string `json:"tags,omitempty"`
ExpandedKeywords string `json:"expanded_keywords,omitempty"`
Importance float64 `json:"importance"`
ForceSensitive bool `json:"force_sensitive,omitempty"`
}
type memUpdateReq struct {
Content *string `json:"content,omitempty"`
Tags *string `json:"tags,omitempty"`
Importance *float64 `json:"importance,omitempty"`
ExpandedKeywords *string `json:"expanded_keywords,omitempty"`
}

View file

@ -1,102 +0,0 @@
package main
import (
"encoding/json"
"os"
"strings"
"testing"
"unicode/utf8"
)
func TestRenderMemoriesFullContent(t *testing.T) {
// The pretty view must NOT truncate content: the old 240-rune preview cut
// memories mid-sentence, misled agents into thinking no full-content
// read-back existed, and made blind `update --content` from the preview
// destroy the stored tail. Full passthrough also removes the mid-rune-cut
// invalid-UTF-8 class by construction — nothing is ever sliced.
long := strings.Repeat("я", 300) + strings.Repeat("a", 300)
raw, _ := json.Marshal(map[string]interface{}{"memories": []map[string]interface{}{
{"id": 7, "content": long, "category": "facts", "tags": "t1,t2", "importance": 0.7},
}})
got := renderMemories(raw, false)
if !strings.Contains(got, long) {
t.Fatalf("content was truncated: %q", got)
}
if strings.Contains(got, "…") {
t.Fatalf("ellipsis in output — truncation still active: %q", got)
}
if !utf8.ValidString(got) {
t.Fatalf("invalid UTF-8 in output: %q", got)
}
if !strings.Contains(got, "#7 [facts] (0.70) ") || !strings.Contains(got, "tags: t1,t2") {
t.Fatalf("line format broken: %q", got)
}
}
func TestRenderMemoriesFlattensNewlinesToOneLine(t *testing.T) {
// Consumers (the recall hook, terminal skims) rely on one memory per line;
// multi-line content is flattened, never split across lines.
raw, _ := json.Marshal(map[string]interface{}{"memories": []map[string]interface{}{
{"id": 1, "content": "line one\nline two\nline three", "category": "facts", "importance": 0.5},
}})
got := renderMemories(raw, false)
if !strings.Contains(got, "line one line two line three") {
t.Fatalf("newlines not flattened: %q", got)
}
}
func TestRenderMemoriesEdgeCases(t *testing.T) {
if got := renderMemories([]byte(`{"memories":[]}`), false); got != "(no memories)\n" {
t.Fatalf("empty list: %q", got)
}
// --json and unparseable responses pass through raw.
if got := renderMemories([]byte(`{"x":1}`), true); got != "{\"x\":1}\n" {
t.Fatalf("json passthrough: %q", got)
}
if got := renderMemories([]byte(`not json`), false); got != "not json\n" {
t.Fatalf("unparseable passthrough: %q", got)
}
}
func TestResolveMemoryBase(t *testing.T) {
old1, old2 := os.Getenv("CLAUDE_MEMORY_API_URL"), os.Getenv("MEMORY_API_URL")
defer func() { os.Setenv("CLAUDE_MEMORY_API_URL", old1); os.Setenv("MEMORY_API_URL", old2) }()
os.Unsetenv("CLAUDE_MEMORY_API_URL")
os.Unsetenv("MEMORY_API_URL")
if got := resolveMemoryBase(); got != defaultMemoryURL {
t.Errorf("resolveMemoryBase() = %q, want default %q", got, defaultMemoryURL)
}
os.Setenv("CLAUDE_MEMORY_API_URL", "https://m.example/") // trailing slash trimmed
if got := resolveMemoryBase(); got != "https://m.example" {
t.Errorf("resolveMemoryBase() = %q, want https://m.example", got)
}
}
func TestMemStoreReqAlwaysSendsImportance(t *testing.T) {
b, _ := json.Marshal(memStoreReq{Content: "x", Category: "facts", Importance: 0.5})
s := string(b)
if !strings.Contains(s, `"content":"x"`) || !strings.Contains(s, `"importance":0.5`) {
t.Fatalf("memStoreReq JSON missing fields: %s", s)
}
}
func TestMemUpdateReqOmitsUnsetFields(t *testing.T) {
tags := "a,b"
b, _ := json.Marshal(memUpdateReq{Tags: &tags})
s := string(b)
if strings.Contains(s, "content") || strings.Contains(s, "importance") {
t.Fatalf("unset update fields must be omitted: %s", s)
}
if !strings.Contains(s, `"tags":"a,b"`) {
t.Fatalf("set field missing: %s", s)
}
}
func TestMemRecallReqOmitsEmptyOptionals(t *testing.T) {
b, _ := json.Marshal(memRecallReq{Context: "hi"})
s := string(b)
if strings.Contains(s, "expanded_query") || strings.Contains(s, "category") || strings.Contains(s, "limit") {
t.Fatalf("empty optionals must be omitted: %s", s)
}
}

View file

@ -1,58 +0,0 @@
package main
import (
"fmt"
"os"
"path/filepath"
"strings"
)
// validPresenceKinds is the fixed label taxonomy accepted by the presence board.
var validPresenceKinds = []string{"node", "host", "stack", "service", "db", "pvc", "infra"}
// presenceScript locates the presence CLI — homelab WRAPS it, it does not
// reimplement it. Override with HOMELAB_PRESENCE; defaults to ~/code/scripts/presence.
func presenceScript() string {
if p := os.Getenv("HOMELAB_PRESENCE"); p != "" {
return p
}
home, err := os.UserHomeDir()
if err != nil {
return "presence"
}
return filepath.Join(home, "code", "scripts", "presence")
}
// validateLabel checks a presence label is <kind>:<name> with a known kind.
func validateLabel(label string) error {
parts := strings.SplitN(label, ":", 2)
if len(parts) != 2 || parts[0] == "" || parts[1] == "" {
return fmt.Errorf("label must be <kind>:<name> (e.g. stack:vault), got %q", label)
}
for _, k := range validPresenceKinds {
if parts[0] == k {
return nil
}
}
return fmt.Errorf("invalid label kind %q; valid kinds: %s", parts[0], strings.Join(validPresenceKinds, ", "))
}
// presenceClaim claims label on the board with a purpose note.
func presenceClaim(label, purpose string) error {
if err := validateLabel(label); err != nil {
return err
}
args := []string{"claim", label}
if purpose != "" {
args = append(args, "--purpose", purpose)
}
return runStreaming(presenceScript(), args...)
}
// presenceRelease releases a prior claim on label.
func presenceRelease(label string) error {
if err := validateLabel(label); err != nil {
return err
}
return runStreaming(presenceScript(), "release", label)
}

View file

@ -1,24 +0,0 @@
package main
import "testing"
func TestValidateLabelAcceptsTaxonomy(t *testing.T) {
good := []string{
"stack:vault", "service:health", "node:k8s-node1", "db:pg-cluster",
"infra:gpu-operator", "host:proxmox-1", "pvc:dbaas/data",
}
for _, l := range good {
if err := validateLabel(l); err != nil {
t.Errorf("validateLabel(%q) = %v, want nil", l, err)
}
}
}
func TestValidateLabelRejectsBadLabels(t *testing.T) {
bad := []string{"vault", "stack:", "bogus:x", ":x", "stack", ""}
for _, l := range bad {
if err := validateLabel(l); err == nil {
t.Errorf("validateLabel(%q) = nil, want error", l)
}
}
}

View file

@ -1,76 +0,0 @@
package main
import (
"context"
"crypto/tls"
"fmt"
"io"
"net"
"net/http"
"net/url"
"os/exec"
"strings"
"time"
)
// internalLBIP is the dedicated Traefik LB; every internal ingress routes through it.
const internalLBIP = "10.0.20.203"
// clientDialingIP returns an http.Client that dials ip for ANY host while keeping
// the URL host as SNI (so the cert matches) — the Go form of `curl --resolve
// host:443:ip`. TLS verification is skipped (these are reachability/observability
// probes, not security checks; internal .lan vhosts may serve a non-matching cert).
func clientDialingIP(ip string, timeout time.Duration) *http.Client {
d := &net.Dialer{Timeout: 8 * time.Second}
tr := &http.Transport{
DialContext: func(ctx context.Context, network, addr string) (net.Conn, error) {
if i := strings.LastIndex(addr, ":"); i >= 0 {
addr = ip + addr[i:]
}
return d.DialContext(ctx, network, addr)
},
TLSClientConfig: &tls.Config{InsecureSkipVerify: true},
}
return &http.Client{Timeout: timeout, Transport: tr}
}
// probeURL issues a GET and returns status code + elapsed time.
func probeURL(c *http.Client, rawurl string) (int, time.Duration, error) {
start := time.Now()
resp, err := c.Get(rawurl)
dur := time.Since(start)
if err != nil {
return 0, dur, err
}
resp.Body.Close()
return resp.StatusCode, dur, nil
}
// lbGetBody GETs https://<host><path>?<q> through the internal LB and returns the body.
func lbGetBody(host, path string, q url.Values) ([]byte, error) {
u := "https://" + host + path
if len(q) > 0 {
u += "?" + q.Encode()
}
resp, err := clientDialingIP(internalLBIP, 20*time.Second).Get(u)
if err != nil {
return nil, err
}
defer resp.Body.Close()
body, _ := io.ReadAll(resp.Body)
if resp.StatusCode >= 300 {
return nil, fmt.Errorf("%s -> %d: %s", path, resp.StatusCode, strings.TrimSpace(string(body)))
}
return body, nil
}
// dig runs `dig +short` against a resolver, optionally for a record type.
func dig(name, server, rrtype string) (string, error) {
args := []string{"+short", "+time=3", "+tries=1"}
if rrtype != "" {
args = append(args, rrtype)
}
args = append(args, name, "@"+server)
out, err := exec.Command("dig", args...).Output()
return strings.TrimSpace(string(out)), err
}

View file

@ -1,49 +0,0 @@
package main
import "testing"
func TestQueryArg(t *testing.T) {
if got := queryArg([]string{"up"}, nil); got != "up" {
t.Errorf(`queryArg(["up"]) = %q, want "up"`, got)
}
if got := queryArg([]string{"up", "--json"}, nil); got != "up" {
t.Errorf(`--json should be dropped, got %q`, got)
}
// single quoted PromQL arrives as one token
if got := queryArg([]string{"count by (node) (up)", "--json"}, nil); got != "count by (node) (up)" {
t.Errorf(`quoted query mangled: %q`, got)
}
// value-flags and their values are skipped, query survives
vf := map[string]bool{"--since": true, "--limit": true}
if got := queryArg([]string{`{app="x"}`, "--since", "1h", "--limit", "50"}, vf); got != `{app="x"}` {
t.Errorf(`value-flag skipping failed: %q`, got)
}
}
func TestLabelStr(t *testing.T) {
got := labelStr(map[string]string{"__name__": "up", "job": "x", "instance": "y"})
if got != "up{instance=y,job=x}" { // __name__ extracted, rest sorted
t.Errorf("labelStr = %q", got)
}
if got := labelStr(map[string]string{"alertname": "Foo"}); got != "{alertname=Foo}" {
t.Errorf("labelStr (no __name__) = %q", got)
}
}
func TestOneLineList(t *testing.T) {
if got := oneLineList(" "); got != "(none)" {
t.Errorf("empty = %q, want (none)", got)
}
if got := oneLineList("a\nb"); got != "a, b" {
t.Errorf("multi = %q, want 'a, b'", got)
}
}
func TestHostOnly(t *testing.T) {
if got := hostOnly("foo.me/path"); got != "foo.me" {
t.Errorf("hostOnly = %q", got)
}
if got := hostOnly("foo.me"); got != "foo.me" {
t.Errorf("hostOnly = %q", got)
}
}

View file

@ -1,101 +0,0 @@
package main
import (
"os"
"os/exec"
"os/user"
"path/filepath"
"strings"
)
// preferRemote picks the canonical remote: forgejo if present, else origin,
// else the first listed. (For infra, origin and forgejo both point at Forgejo.)
func preferRemote(remotes []string) string {
has := map[string]bool{}
for _, r := range remotes {
has[r] = true
}
switch {
case has["forgejo"]:
return "forgejo"
case has["origin"]:
return "origin"
case len(remotes) > 0:
return remotes[0]
default:
return ""
}
}
// hasGitCryptAttr reports whether .gitattributes content enables git-crypt.
func hasGitCryptAttr(gitattributes string) bool {
return strings.Contains(gitattributes, "filter=git-crypt")
}
// gitCryptFlags are the per-command flags that disable smudge/clean so git
// operations in a git-crypt repo don't try to decrypt (NEVER persisted to config).
func gitCryptFlags() []string {
return []string{
"-c", "filter.git-crypt.smudge=cat",
"-c", "filter.git-crypt.clean=cat",
"-c", "filter.git-crypt.required=false",
}
}
// gitOutput runs `git -C dir <args>` and returns trimmed stdout.
func gitOutput(dir string, args ...string) (string, error) {
cmd := exec.Command("git", append([]string{"-C", dir}, args...)...)
out, err := cmd.Output()
return strings.TrimSpace(string(out)), err
}
func gitRepoRoot(dir string) (string, error) {
return gitOutput(dir, "rev-parse", "--show-toplevel")
}
// gitRemotes lists configured remote names for the repo at dir.
func gitRemotes(dir string) ([]string, error) {
out, err := gitOutput(dir, "remote")
if err != nil {
return nil, err
}
if out == "" {
return nil, nil
}
return strings.Split(out, "\n"), nil
}
// isGitCryptRepo reports whether the repo at repoRoot uses git-crypt.
func isGitCryptRepo(repoRoot string) bool {
b, err := os.ReadFile(filepath.Join(repoRoot, ".gitattributes"))
if err != nil {
return false
}
return hasGitCryptAttr(string(b))
}
// cryptFlagsFor returns the git-crypt filter flags when repoRoot is encrypted,
// else nil. These are injected per-command and never persisted.
func cryptFlagsFor(repoRoot string) []string {
if isGitCryptRepo(repoRoot) {
return gitCryptFlags()
}
return nil
}
// gitStream runs `git [cryptFlags] -C repoRoot <args>` with live output.
func gitStream(repoRoot string, cryptFlags []string, args ...string) error {
full := append(append([]string{}, cryptFlags...), append([]string{"-C", repoRoot}, args...)...)
return runStreamingIn("", "git", full...)
}
// currentUser returns the OS username for branch naming (<user>/<topic>).
func currentUser() string {
if u := os.Getenv("USER"); u != "" {
return u
}
if u, err := user.Current(); err == nil && u.Username != "" {
return u.Username
}
return "user"
}

View file

@ -1,37 +0,0 @@
package main
import "testing"
func TestPreferRemote(t *testing.T) {
cases := []struct {
in []string
want string
}{
{[]string{"origin", "forgejo"}, "forgejo"},
{[]string{"forgejo"}, "forgejo"},
{[]string{"origin"}, "origin"},
{[]string{"upstream"}, "upstream"},
{nil, ""},
}
for _, c := range cases {
if got := preferRemote(c.in); got != c.want {
t.Errorf("preferRemote(%v) = %q, want %q", c.in, got, c.want)
}
}
}
func TestHasGitCryptAttr(t *testing.T) {
if !hasGitCryptAttr("*.tfvars filter=git-crypt diff=git-crypt") {
t.Error("expected git-crypt detected")
}
if hasGitCryptAttr("*.md text\n*.png binary") {
t.Error("expected no git-crypt")
}
}
func TestGitCryptFlagsShape(t *testing.T) {
f := gitCryptFlags()
if len(f) != 6 || f[0] != "-c" || f[1] != "filter.git-crypt.smudge=cat" {
t.Fatalf("unexpected git-crypt flags: %v", f)
}
}

View file

@ -1,23 +0,0 @@
package main
import (
"os"
"os/exec"
)
// runStreaming executes name with args, wiring std streams to this process so
// the caller sees live output, and returns the command's error (non-nil on
// non-zero exit — preserved so homelab's own exit code reflects the child's).
func runStreaming(name string, args ...string) error {
return runStreamingIn("", name, args...)
}
// runStreamingIn is runStreaming with a working directory (empty = inherit).
func runStreamingIn(dir, name string, args ...string) error {
cmd := exec.Command(name, args...)
cmd.Dir = dir
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
cmd.Stdin = os.Stdin
return cmd.Run()
}

View file

@ -1,54 +0,0 @@
package main
import (
"fmt"
"os"
"path/filepath"
"sort"
"strings"
)
// findInfraRoot walks up from start to the infra repo root — the directory
// holding both terragrunt.hcl and a stacks/ directory.
func findInfraRoot(start string) (string, error) {
dir := start
for {
if isFile(filepath.Join(dir, "terragrunt.hcl")) && isDir(filepath.Join(dir, "stacks")) {
return dir, nil
}
parent := filepath.Dir(dir)
if parent == dir {
return "", fmt.Errorf("not inside an infra checkout (no terragrunt.hcl + stacks/ found above %s)", start)
}
dir = parent
}
}
// resolveStack maps a bare stack name to its directory under <infraRoot>/stacks.
func resolveStack(infraRoot, name string) (string, error) {
dir := filepath.Join(infraRoot, "stacks", name)
if isDir(dir) {
return dir, nil
}
avail := listStacks(infraRoot)
return "", fmt.Errorf("stack %q not found under stacks/; available: %s", name, strings.Join(avail, ", "))
}
// listStacks returns the sorted names of every directory under <infraRoot>/stacks.
func listStacks(infraRoot string) []string {
entries, err := os.ReadDir(filepath.Join(infraRoot, "stacks"))
if err != nil {
return nil
}
var out []string
for _, e := range entries {
if e.IsDir() {
out = append(out, e.Name())
}
}
sort.Strings(out)
return out
}
func isFile(p string) bool { fi, err := os.Stat(p); return err == nil && !fi.IsDir() }
func isDir(p string) bool { fi, err := os.Stat(p); return err == nil && fi.IsDir() }

View file

@ -1,52 +0,0 @@
package main
import (
"os"
"path/filepath"
"testing"
)
func newInfraTree(t *testing.T, stacks ...string) string {
t.Helper()
root := t.TempDir()
if err := os.WriteFile(filepath.Join(root, "terragrunt.hcl"), []byte("# root"), 0o644); err != nil {
t.Fatal(err)
}
for _, s := range stacks {
if err := os.MkdirAll(filepath.Join(root, "stacks", s), 0o755); err != nil {
t.Fatal(err)
}
}
return root
}
func TestFindInfraRootWalksUp(t *testing.T) {
root := newInfraTree(t, "vault")
got, err := findInfraRoot(filepath.Join(root, "stacks", "vault"))
if err != nil {
t.Fatalf("findInfraRoot error: %v", err)
}
if got != root {
t.Fatalf("findInfraRoot = %q, want %q", got, root)
}
}
func TestFindInfraRootErrorsOutsideInfra(t *testing.T) {
if _, err := findInfraRoot(t.TempDir()); err == nil {
t.Fatal("expected error outside an infra checkout")
}
}
func TestResolveStack(t *testing.T) {
root := newInfraTree(t, "vault", "monitoring")
dir, err := resolveStack(root, "vault")
if err != nil {
t.Fatalf("resolveStack error: %v", err)
}
if want := filepath.Join(root, "stacks", "vault"); dir != want {
t.Fatalf("resolveStack = %q, want %q", dir, want)
}
if _, err := resolveStack(root, "nonesuch"); err == nil {
t.Fatal("expected error for unknown stack")
}
}

View file

@ -1,62 +0,0 @@
package main
import (
"bytes"
"encoding/json"
"net/http"
"os"
"strconv"
"strings"
"time"
)
// usageJob is the Loki stream job label for homelab usage telemetry.
const usageJob = "homelab-usage"
// emitUsage best-effort records one verb invocation to Loki for cross-user
// usage analytics. Labels are low-cardinality (job/user/verb); the line carries
// only exit code + CLI version. NEVER args, paths, flags, or secrets. It must
// never affect the command: all errors are swallowed and a tight timeout bounds
// the cost. Opt out with HOMELAB_TELEMETRY=0.
func emitUsage(verb string, runErr error) {
switch os.Getenv("HOMELAB_TELEMETRY") {
case "0", "off", "false", "no":
return
}
if verb == "" || strings.HasPrefix(verb, "usage") {
return // don't self-record the analytics reader
}
exit := 0
if runErr != nil {
exit = 1
}
body, err := json.Marshal(lokiPush{Streams: []lokiStream{{
Stream: map[string]string{"job": usageJob, "user": currentUser(), "verb": verb},
Values: [][2]string{{
strconv.FormatInt(time.Now().UnixNano(), 10),
"exit=" + strconv.Itoa(exit) + " ver=" + version,
}},
}}})
if err != nil {
return
}
req, err := http.NewRequest("POST", "https://"+lokiHost+"/loki/api/v1/push", bytes.NewReader(body))
if err != nil {
return
}
req.Header.Set("Content-Type", "application/json")
resp, err := clientDialingIP(internalLBIP, 800*time.Millisecond).Do(req)
if err != nil {
return
}
resp.Body.Close()
}
type lokiPush struct {
Streams []lokiStream `json:"streams"`
}
type lokiStream struct {
Stream map[string]string `json:"stream"`
Values [][2]string `json:"values"`
}

View file

@ -103,6 +103,6 @@ func notifyForIPChange(oldIP, newIP net.IP) error {
if err != nil {
return errors.Wrapf(err, "Error reading response")
}
glog.Infof("Response: %s", string(responseBody))
glog.Infof("Response:", string(responseBody))
return nil
}

View file

@ -1,18 +0,0 @@
package main
import (
"strings"
"testing"
)
func TestUsageQuery(t *testing.T) {
got := usageQuery("30d", "")
want := `sum by (verb) (count_over_time({job="homelab-usage"}[30d]))`
if got != want {
t.Errorf("usageQuery(30d,\"\") = %q, want %q", got, want)
}
withUser := usageQuery("7d", "emo")
if !strings.Contains(withUser, `user="emo"`) || !strings.Contains(withUser, "[7d]") {
t.Errorf("usageQuery with user missing filter/range: %q", withUser)
}
}

View file

@ -1,191 +0,0 @@
package main
import (
"context"
"encoding/json"
"fmt"
"io"
"net"
"net/http"
"os"
"os/exec"
"strings"
"time"
)
// Woodpecker is reached at ci.viktorbarzin.me but routed via the internal Traefik
// LB (mirrors the proven `curl --resolve ci.viktorbarzin.me:443:10.0.20.203`):
// we dial the LB IP while keeping SNI/Host = the hostname so the cert verifies.
const (
wpHost = "ci.viktorbarzin.me"
wpLBIP = "10.0.20.203"
)
type wpClient struct {
base string
token string
http *http.Client
}
// wpToken reads WOODPECKER_TOKEN, else the canonical Vault path.
func wpToken() string {
if t := firstEnv("WOODPECKER_TOKEN", "WP_TOKEN"); t != "" {
return t
}
out, err := exec.Command("vault", "kv", "get", "-field=woodpecker_api_token", "secret/ci/global").Output()
if err != nil {
return ""
}
return strings.TrimSpace(string(out))
}
func newWPClient() (*wpClient, error) {
tok := wpToken()
if tok == "" {
return nil, fmt.Errorf("no woodpecker token — set WOODPECKER_TOKEN or `vault login` (reads secret/ci/global)")
}
ip := firstEnv("HOMELAB_WP_IP")
if ip == "" {
ip = wpLBIP
}
dialer := &net.Dialer{Timeout: 8 * time.Second}
tr := &http.Transport{
DialContext: func(ctx context.Context, network, addr string) (net.Conn, error) {
if strings.HasPrefix(addr, wpHost+":") {
addr = ip + addr[strings.LastIndex(addr, ":"):]
}
return dialer.DialContext(ctx, network, addr)
},
}
return &wpClient{base: "https://" + wpHost, token: tok, http: &http.Client{Timeout: 20 * time.Second, Transport: tr}}, nil
}
// getJSON GETs path into v, retrying the transient empty/5xx responses the
// Woodpecker API intermittently returns under load.
func (c *wpClient) getJSON(path string, v interface{}) error {
var lastErr error
for attempt := 0; attempt < 5; attempt++ {
if attempt > 0 {
time.Sleep(2 * time.Second)
}
req, _ := http.NewRequest("GET", c.base+path, nil)
req.Header.Set("Authorization", "Bearer "+c.token)
resp, err := c.http.Do(req)
if err != nil {
lastErr = err
continue
}
body, _ := io.ReadAll(resp.Body)
resp.Body.Close()
if resp.StatusCode >= 500 || len(strings.TrimSpace(string(body))) == 0 {
lastErr = fmt.Errorf("woodpecker GET %s -> %d (empty/5xx, retrying)", path, resp.StatusCode)
continue
}
if resp.StatusCode >= 300 {
return fmt.Errorf("woodpecker GET %s -> %d: %s", path, resp.StatusCode, strings.TrimSpace(string(body)))
}
return json.Unmarshal(body, v)
}
return lastErr
}
type wpPipeline struct {
Number int `json:"number"`
Status string `json:"status"`
Event string `json:"event"`
Commit string `json:"commit"`
Message string `json:"message"`
}
func (c *wpClient) recentPipelines(repoID, n int) ([]wpPipeline, error) {
var ps []wpPipeline
err := c.getJSON(fmt.Sprintf("/api/repos/%d/pipelines?per_page=%d", repoID, n), &ps)
return ps, err
}
// findPipeline returns the pipeline for commit (prefix match), or the latest when
// commit is empty.
func (c *wpClient) findPipeline(repoID int, commit string) (wpPipeline, error) {
ps, err := c.recentPipelines(repoID, 25)
if err != nil {
return wpPipeline{}, err
}
if len(ps) == 0 {
return wpPipeline{}, fmt.Errorf("no pipelines for repo %d", repoID)
}
if commit == "" {
return ps[0], nil
}
for _, p := range ps {
if strings.HasPrefix(p.Commit, commit) {
return p, nil
}
}
return wpPipeline{}, fmt.Errorf("no pipeline for commit %s in the last %d", commit[:min(8, len(commit))], len(ps))
}
func (c *wpClient) repoID() (int, error) {
owner, repo, err := repoOwnerName()
if err != nil {
return 0, err
}
var r struct {
ID int `json:"id"`
}
if err := c.getJSON("/api/repos/lookup/"+owner+"/"+repo, &r); err != nil {
return 0, err
}
if r.ID == 0 {
return 0, fmt.Errorf("repo %s/%s not registered in woodpecker", owner, repo)
}
return r.ID, nil
}
// repoOwnerName derives <owner>/<repo> from the cwd git remote.
func repoOwnerName() (string, string, error) {
cwd, _ := os.Getwd()
root, err := gitRepoRoot(cwd)
if err != nil {
return "", "", fmt.Errorf("not in a git repository: %w", err)
}
remote := preferRemote(remotesOrEmpty(root))
url, err := gitOutput(root, "remote", "get-url", remote)
if err != nil {
return "", "", err
}
return parseOwnerRepo(url)
}
// parseOwnerRepo extracts owner/repo from an https or ssh git remote URL.
func parseOwnerRepo(url string) (string, string, error) {
u := strings.TrimSuffix(strings.TrimSpace(url), ".git")
u = strings.TrimSuffix(u, "/")
if i := strings.Index(u, "://"); i >= 0 {
u = u[i+3:]
}
u = strings.ReplaceAll(u, ":", "/") // git@host:owner/repo -> git@host/owner/repo
parts := strings.Split(u, "/")
if len(parts) < 2 || parts[len(parts)-1] == "" || parts[len(parts)-2] == "" {
return "", "", fmt.Errorf("cannot parse owner/repo from remote %q", url)
}
return parts[len(parts)-2], parts[len(parts)-1], nil
}
func isTerminalStatus(s string) bool {
switch s {
case "success", "failure", "error", "killed", "declined", "blocked":
return true
}
return false
}
func isFailureStatus(s string) bool {
return s == "failure" || s == "error" || s == "killed" || s == "declined"
}
func min(a, b int) int {
if a < b {
return a
}
return b
}

View file

@ -1,40 +0,0 @@
package main
import "testing"
func TestParseOwnerRepo(t *testing.T) {
cases := []struct{ in, owner, repo string }{
{"https://forgejo.viktorbarzin.me/viktor/infra.git", "viktor", "infra"},
{"https://forgejo.viktorbarzin.me/viktor/infra", "viktor", "infra"},
{"git@github.com:ViktorBarzin/infra.git", "ViktorBarzin", "infra"},
{"https://github.com/ViktorBarzin/tripit/", "ViktorBarzin", "tripit"},
}
for _, c := range cases {
o, r, err := parseOwnerRepo(c.in)
if err != nil || o != c.owner || r != c.repo {
t.Errorf("parseOwnerRepo(%q) = (%q, %q, %v), want (%q, %q)", c.in, o, r, err, c.owner, c.repo)
}
}
if _, _, err := parseOwnerRepo("nonsense"); err == nil {
t.Error("expected error for unparseable remote")
}
}
func TestStatusClassification(t *testing.T) {
for _, s := range []string{"success", "failure", "error", "killed"} {
if !isTerminalStatus(s) {
t.Errorf("%q should be terminal", s)
}
}
for _, s := range []string{"running", "pending"} {
if isTerminalStatus(s) {
t.Errorf("%q should not be terminal", s)
}
}
if !isFailureStatus("failure") || !isFailureStatus("error") {
t.Error("failure/error should classify as failure")
}
if isFailureStatus("success") {
t.Error("success must not classify as failure")
}
}

Binary file not shown.

View file

@ -13,7 +13,7 @@ The trigger was a proposal to swap Forgejo out for GitHub entirely. The grilling
Do **not** swap to GitHub. Reaffirm and *complete* the model already in `CONTEXT.md`:
- Every first-party repo has exactly **one** push target — its **Canonical repo** on Forgejo. GitHub is a one-way push-mirror (off-site backup + the source GitHub Actions builds from). **No repo is ever dual-pushed.**
- A small, explicit set of **GitHub-first repos** are the exception (canonical lives on GitHub, outside the mirror policy): third-party clones/forks where GitHub is genuinely upstream (`jsoncrack.com`, `snmp_exporter`, `SparkyFitness`, `agent-rules-books`, `Plotting-Your-Dream-Book`) and the deliberately-public first-party `health`. `Plotting-Your-Dream-Book` (owned by Anca, dev in her org) keeps its GHA build in-place and pushes the image to **its own org's ghcr** (`ghcr.io/passionprojectsanca/book-plotter`, private) via the workflow's built-in `GITHUB_TOKEN` — no Forgejo mirror, no `viktorbarzin`-namespace push, no shared PAT in her repo (2026-06-27, migrated off DockerHub).
- A small, explicit set of **GitHub-first repos** are the exception (canonical lives on GitHub, outside the mirror policy): third-party clones/forks where GitHub is genuinely upstream (`jsoncrack.com`, `snmp_exporter`, `SparkyFitness`, `agent-rules-books`, `Plotting-Your-Dream-Book`) and the deliberately-public first-party `health`.
- `infra` is reconciled into the standard model: its GitHub-only `.github/workflows/build-*.yml` are brought onto Forgejo-canonical (inert on Forgejo, active on the mirror), then the mirror is enabled — ending the deliberate divergence while keeping Woodpecker on the Forgejo forge.
- Enforcement is **structural**: reconciled clones keep only the Forgejo remote, so there is no GitHub remote to habitually push to; the execution rule is "push to the canonical forge only, never the mirror."

View file

@ -1,30 +0,0 @@
# homelab: a unified infra-ops CLI grown in place from infra/cli
Agents re-derive the same operational command boilerplate every session — mining
51,116 bash commands across 2,225 past sessions showed dense, repeated patterns
(the infra inner-loop alone is ~29%). We are building `homelab`, one CLI encoding
the deterministic, repeated **actions** (not judgment) agents run — composable in
bash, JSON-capable, and discovered progressively via `homelab manifest`. It is
grown **in place** in `cli/` (the existing `infra-cli`), absorbing new verb-groups
alongside the preserved legacy webhook use-cases. Versioned with a `cli/VERSION`
file (the infra repo deploys continuously and does not cut semver tags).
## Considered options
- **Its own top-level repo** (the original plan) — rejected in favour of keeping
it where the Terraform/Terragrunt and `scripts/tg` it drives already live; the
Go source isn't git-crypt-encrypted and a provision-time build is unaffected by
GitOps continuous-deploy.
- **A fresh CLI ignoring infra-cli** — rejected: strands the VPN/DNS/email
webhook use-cases.
- **Raw kubectl/tg/ssh + skills + MCP only** — kept for everything outside the
recurring action surface (methodology skills; third-party/owned MCP such as
phpIPAM, which homelab does NOT duplicate).
## Consequences
- The binary is dual-purpose: the agent-facing `homelab` verb surface AND the
in-cluster `infra-cli` webhook image. `main()` front-dispatches homelab verbs
and falls through to the legacy `-use-case` path verbatim.
- Distribution: built from source to `/usr/local/bin/homelab` during devvm
provisioning (`t3-dispatch` precedent), refreshed by `t3-autoupdate`.

View file

@ -1,23 +0,0 @@
# homelab v0.1 scope: the infra inner-loop; everything allowed, tiers recorded
v0.1 ships only the highest-volume surface — the infra inner-loop: `work`
(worktree lifecycle), `tf` (terragrunt via `scripts/tg` + fmt/validate/
force-unlock), and `claim`/`release` (presence) — because it is ~29% of all mined
commands and where agents lose the most time and leak the most presence claims.
v0.1 enforces **no** homelab-level permission gating: everything is allowed,
relying on existing gates (harness permission mode, presence claims, plan
approval). But every verb records a `read|write` tier (visible in `manifest`), so
a PreToolUse classifier hook (auto-allow reads / prompt writes) can be added
later with zero restructuring.
## Considered options
- **Reads-first vertical slice** (top read verb per domain) — lower risk, broad
value, but defers the toil that motivated the project.
- **One domain deep (k8s)** — cleanest template, narrow day-one value.
We chose the highest-volume-but-write-heavy infra loop deliberately, accepting
the extra complexity (worktree lifecycle, git-crypt flag injection, presence
coupling, branch-protection PR fallback) for the biggest immediate toil
reduction. k8s/node/secret/net/ci verb-groups are deferred to later versions.

View file

@ -1,29 +0,0 @@
# homelab work/tf behaviour: native worktree entry, gated auto-land, presence-coupled apply
Four behaviours of the infra-loop verbs are surprising enough to record:
1. **`work` owns worktree create/land/clean, but session *entry* delegates to the
native harness worktree tool.** A CLI is a child process and cannot change the
agent's working directory; `EnterWorktree` can. So `homelab work start <topic>`
creates the worktree + branch off `<remote>/master` (git-crypt-aware) and
prints the path — the agent enters it with native `EnterWorktree({path})`.
2. **`work land` is auto-land, but gated on verification.** It merges master in →
runs verification → pushes `HEAD:master` (fetch+merge+retry on
non-fast-forward) → falls back to pushing the feature branch for a PR when the
direct push is rejected (branch protection). It **refuses to push when it
cannot verify** (no `--verify-cmd` and no auto-detected suite) unless
`--no-verify` is passed — added after an accidental smoke-test land pushed
unverified WIP to master (benign: the infra CI applied 0 stacks because the
diff was `cli/`-only, but an unverified land must be deliberate, not default).
3. **`tf apply` is first-class despite GitOps, and mandatorily presence-coupled.**
Local applies are out-of-band (CI applies canonically on push) but happen
constantly (~763× in the corpus). `tf apply <stack>` auto-claims `stack:<name>`,
delegates to `scripts/tg apply --non-interactive`, and **always releases on
exit** (normal, error, or signal via `sync.Once` + handler) — fixing the
documented ~200-claim leak — and prints an out-of-band reminder.
4. **Known v0.1 limitation:** `work land` does not yet block on CI to green; that
arrives with the ci/deploy watch verb-group. It prints a reminder to follow
the pipeline manually.

View file

@ -1,30 +0,0 @@
# homelab k8s verb-group: app→pod resolver, read/write split, config-mutation stays raw
v0.2 adds the Kubernetes verb-group — the biggest remaining surface by far
(mining the post-v0.1 corpus: 11,291 `kubectl` commands across 243 sessions, more
than every other domain combined).
It is built on an **app→namespace→pod resolver**: most namespaces hold exactly
one app, so `<app>` defaults to the namespace, and the target defaults to
`deploy/<app>` (kubectl resolves a pod from the Deployment). `-n`/`--pod`/`-c`/
`-l`/`--tty` override; multi-pod namespaces (`dbaas`, `monitoring`) need
specificity. The CLI uses the ambient kubeconfig — no per-call auth flags.
Verbs: read — `status`, `get`, `logs`, `describe`, `debug` (one-shot triage),
`pf`, `rollout-status`; write/operational — `db`, `exec`, `restart`, `rm-pod`.
## Decisions worth recording
- **Config-mutation verbs are deliberately NOT exposed** (`apply`/`edit`/`patch`/
`scale`/`create`). They stay raw `kubectl`, by design, per the repo's
Terraform-only policy — the corpus confirms they're low-frequency, and a
friendly verb would normalise a policy violation.
- **`rm-pod` is restricted to pods/jobs only** — deleting Deployments/STS/PVCs is
config mutation and forbidden; the verb cannot target them.
- **`db` encodes the dbaas exec pattern** (the single highest-value k8s
sub-pattern, ~886 dbaas ops): PG via `pg-cluster-rw -c postgres`,
`psql -U postgres -d <app>`; MySQL via `mysql-standalone-0` with a
`bash -c 'mysql -p"$MYSQL_ROOT_PASSWORD" …'` wrapper so the password comes from
the pod env and never appears on the command line.
- Read verbs were smoke-tested against the live cluster; write verbs are
unit-tested (resolver, db-plan, shell-quoting) but not fired at live state.

View file

@ -1,30 +0,0 @@
# homelab memory verb-group: direct HTTP client to claude-memory; MCP deprecation path
v0.3 adds the memory verb-group so agents can search and navigate memory from the
CLI. `claude-memory` is a FastAPI service (Postgres-backed, `Bearer`-auth,
ingress `auth = "none"` so programmatic clients work) — the **MCP is just one
frontend over it**. `homelab memory` is a thin HTTP client over the same API,
using the env the hooks already set (`CLAUDE_MEMORY_API_URL` +
`CLAUDE_MEMORY_API_KEY`; defaults to the ingress). Because it talks to the HTTP
API directly, it **works even when the MCP frontend is down** — the recurring
MCP-disconnect problem that motivated claude-memory HA (and that took the MCP
offline for the entire session this was built in).
Verbs: `recall` (server-side semantic ranking), `list`, `categories`, `tags`,
`stats`, `secret` (read); `store`, `update`, `delete` (write). Validated against
the live API including a store→recall→delete round-trip — full data-plane parity
with the MCP.
## Deprecation path (deliberate follow-up — NOT done in v0.3)
The MCP is more than tools: the **per-prompt auto-recall hook** and the
**auto-learn hook** run on every prompt for every agent. Deprecating it safely is
a separate, sequenced change:
1. Rewire the auto-recall hook to `homelab memory recall` and the auto-learn hook
to `homelab memory store`.
2. Update the CLAUDE.md memory policy to point at the CLI.
3. Uninstall the MCP.
Done CLI-first (verbs proven before touching the every-prompt path) so a
regression can't silently break auto-recall/auto-learn fleet-wide.

View file

@ -1,29 +0,0 @@
# homelab ci/deploy verbs: API-based watch, internal-LB dialer, work-land integration
v0.4 adds `ci`/`deploy` — the biggest *reasoning* sink in agent sessions (watching
a build/deploy to completion), proven during the session that built it (hours
spent hand-rolling Woodpecker API polling, DB-schema reverse-engineering, and
retrigger logic for a single CI incident).
## Decisions
- **API, not DB.** The verbs query the Woodpecker REST API (version-stable),
not its Postgres schema (which drifts across upgrades — column renames bit us
mid-incident). Reached via the internal Traefik LB by dialing `10.0.20.203`
while keeping SNI/Host = `ci.viktorbarzin.me` so the cert verifies (the Go
equivalent of the house `curl --resolve` pattern). Token from
`WOODPECKER_TOKEN` or Vault `secret/ci/global`; repo id resolved from the cwd
git remote via `/api/repos/lookup/<owner>/<repo>`.
- **Retries are mandatory.** The Woodpecker API intermittently returns empty/5xx
under load (it flapped through the whole build session); `getJSON` retries
empties with backoff so `ci watch` is reliable exactly when it's needed.
- **`work land` now waits for CI.** After pushing, `work land` calls `ci watch`
on the landed commit and fails if the pipeline does — closing the gap ADR-0005
deferred. `--no-ci-watch` opts out.
- **`deploy wait` encodes the "rollout status lies" rule:** it first waits for
the deployment image to reference the expected sha, *then* blocks on rollout
status (kubectl-based; reuses the k8s helpers).
- **`ci logs` deferred to v0.4.1.** Woodpecker's per-pipeline detail/log
endpoints were the least reliable this session (often empty); `status`/`watch`
rely on the list endpoint that works. A DB-backed `ci logs` is a possible
follow-up if the API path stays flaky.

View file

@ -1,37 +0,0 @@
# homelab net/dns/metrics/logs verbs: endpoint resolution as the unit of value
v0.5 adds `net`/`dns`/`metrics`/`logs`. These were chosen against an explicit
test the user posed mid-build: *does the verb save reasoning, or only typing?* A
wrapper over a command already known fluently (plain `ssh`, `vault kv get`) saves
keystrokes but not thought. These four save thought — the reasoning they encode
is **which endpoint, reached how, with what auth/URL shape** — re-derived every
time otherwise. (That same test deprioritized `node ssh` aliasing and `secret
get`, which are thin wrappers; see the session discussion.)
## Decisions
- **Internal ingresses, reached via the LB.** Everything routes through the
Traefik LB by dialing `10.0.20.203` with the URL host preserved as SNI — the
Go form of the house `curl --resolve host:443:10.0.20.203` pattern
(`probe.go: clientDialingIP`). Verified live before building: Prometheus
(`prometheus-query.viktorbarzin.lan`) and Loki (`loki.viktorbarzin.lan`) both
answer JSON over the LB with **no auth gate and no port-forward** — so these
stay clean HTTP clients, not kubectl wrappers.
- **`net check` is two-legged on purpose.** It resolves the host via public DNS
(→ Cloudflare) AND dials the internal LB, reporting both — because the useful
question is *where* a break is (CF edge vs the app vs the LB path), which a
single curl can't answer. The external leg forces public resolution (the devvm
resolver is split-horizon and would otherwise hit the LB for both).
- **`metrics alerts` uses the `ALERTS` series, not `/api/v1/alerts`.**
`prometheus-query.*` is a query-only frontend (404 on `/api/v1/alerts`), and
Alertmanager has no LB ingress (the alert-digest reads it in-cluster). Firing
alerts are exposed as the synthetic `ALERTS{alertstate="firing"}` time series,
queryable through the working endpoint — so no new dependency.
- **Deliberately NOT built:** in-cluster-only endpoints (Alertmanager v2,
raw `*.svc` services) that would force port-forward/`kubectl run`. The
reasoning-savings there don't beat the added moving parts; kept out of scope.
- **No `node`/`secret` group.** Same test: their high-volume parts are
command-wrappers (low savings); only compound node ops (serial console, VM
wait, fan-out) would qualify, and those are lower-frequency. Left unbuilt
unless a concrete pain surfaces — the high-value deterministic surface
(tf/work/ci/k8s/memory + these probes) is now covered.

View file

@ -1,42 +0,0 @@
# homelab usage telemetry: evidence-driven verb prioritization, privacy by construction
v0.6 adds `usage top` plus a fire-and-forget emit on every dispatched verb. It
exists to answer the question that drove the whole CLI — *which verbs are worth
adding next* — with data instead of one maintainer's habits (the earlier mining
covered a single user's ~51k commands, so the surface is shaped to that user).
> **Update (2026-06-26) — the cross-user privacy *norm* below is superseded by
> [ADR-0015](0015-os-is-the-authorization-boundary.md).** The prohibition this
> ADR leaned on ("reading another user's `~/.claude` is off-limits even for an
> owner in-session") no longer holds: the managed-settings policy now **defers
> to OS/sudo authorization**. The `usage top` telemetry design itself is
> unchanged and still current — only the "never read homes" framing in the
> third decision below is overtaken.
## Decisions
- **Emit on dispatch, in `dispatch()`.** The longest-prefix match already knows
the verb path; after `Run` returns we emit `{verb, exit}`. Discovery verbs
don't go through `dispatch()` (`manifest`/`version`/`help` are handled in
`dispatchTop`), so they don't self-record; `usage *` is skipped explicitly so
the analytics reader doesn't pollute its own data.
- **Payload is deliberately minimal: verb path + exit code only.** Labels
`{job=homelab-usage, user, verb}` (all low-cardinality) + line `exit=N ver=X`.
**No args, paths, flags, hostnames, or secrets** ever leave the process — the
emit sees only the matched verb name, not the arguments. This is what makes
cross-user aggregation safe.
- **Shared Loki sink → cross-user analytics WITHOUT reading homes.** Each user's
CLI writes its own invocations (attributed to its OS user) to the shared Loki
push API via the Traefik LB (verified: HTTP 204, no auth). `usage top` reads
back with a LogQL metric query. This is the privacy-preserving resolution to
"what does everyone (e.g. another user) use" — it never touches anyone's
`~/.claude`, which the org per-user policy bars (see the per-user red-line in
managed-settings; reading another user's home is off-limits even for an owner
in-session — a fresh session under changed MDM policy is the only legitimate
path, and even then this telemetry is the better answer).
- **Best-effort, never affects the command.** All errors swallowed; an 800ms
client timeout bounds the cost; opt-out via `HOMELAB_TELEMETRY=0`. Telemetry
must never slow or break the tool it measures.
- **Loki, not a new datastore.** Zero new infra, and it dogfoods the v0.5 `logs`
path (same host, same LB dial). Presence MySQL was the alternative (queryable
SQL) but would add a write dependency and creds; Loki needs neither.

View file

@ -1,54 +0,0 @@
# homelab Home Assistant verbs: token resolution + host SSH, not entity control
v0.7 adds `ha token` and `ha ssh`. They were chosen by mining a heavy HA
operator's sessions: across ~1,900 shell commands the single most-repeated line
(420×) was a hand-rolled `kubectl … | base64 -d | python -c '…token'` pipeline,
and a bespoke `ssh -o StrictHostKeyChecking=no -o …` invocation was redefined as
a shell function ~30× — both re-derived from scratch every session. The existing
`home-assistant-sofia.py` already covers the *API*, but it goes unused from an
arbitrary cwd (it needs `HOME_ASSISTANT_SOFIA_TOKEN` set and is referenced by a
cwd-relative path), so agents bypassed it. A global verb on `$PATH` closes that
gap for every user in every directory.
## Decisions
- **Only the two gaps the `ha` MCP can't fill.** The `ha` MCP server already
does entity state and control (`get_state`, `call_service`, history, logs).
Per the CLI's founding rule — *MCP-encoded actions are out of scope* (ADR-0004)
— we do **not** reimplement `on`/`off`/`list`/`state`. We add only token
*resolution* and host *SSH*, neither of which an API-only MCP can provide. The
value is endpoint/secret/host resolution, exactly like `net`/`dns` (ADR-0010).
- **`ha token` resolves live from the cluster, not from an env var.** It reads
the dedicated k8s Secret `openclaw/ha-tokens` (one key per instance: `sofia` /
`london`) via the ambient kubeconfig. This is robust to env drift — the precise
failure that made agents re-derive the pipeline. Read-tier, prints the bare
token to stdout so it composes in `$(…)`, mirroring `memory secret`.
- **The token is split into its own least-privilege secret** (`stacks/openclaw/ha_tokens.tf`).
It was originally read from `openclaw-secrets``skill_secrets` (a JSON blob
also holding `slack_webhook` + `uptime_kuma_password`), which only cluster
admins can read — so the verb hung/failed for the non-admin operator it was
built for (emo = `emil.barzin@gmail.com`, group `Home Server Admins`, whose
OIDC identity is barred from secrets in `openclaw`). `ha-tokens` carries only
the HA tokens, with a Role+RoleBinding granting `get` on *just that secret* to
the `Home Server Admins` group (k8s RBAC can't scope to a JSON sub-key, hence
the separate object). openclaw's own deployment keeps reading `openclaw-secrets`
— this is purely additive.
- **`ha ssh` is deterministic and per-user.** Flags are fixed for unattended
use: `-F /dev/null` (ignore user ssh-config), `StrictHostKeyChecking=no` +
`UserKnownHostsFile=/dev/null` (no host-key prompt/record — agents have no
TTY), `BatchMode=yes` + `ConnectTimeout=10` (fail fast, never hang). The key
is the **invoking user's** `~/.ssh/id_ed25519`, so the verb isn't tied to
whoever first wrote the workflow; that user's key must be enrolled on the HA
host. Write-tier (runs an arbitrary remote command).
- **sofia is the default; london is structural.** The devvm sits on the Sofia
LAN, so `vbarzin@192.168.1.8` is reachable and is the default instance. london
(`hassio@192.168.8.103`) is in the instance map so `ha token --instance london`
works (a pure secret read), but `ha ssh --instance london` generally won't
connect from here — london is remote. We model it correctly rather than
pretend it's reachable.
- **Scope held at two verbs.** `ha api` (an authenticated curl passthrough for
the endpoints the MCP/script don't cover — `/api/template`, `/reload`,
`check_config`, `/error_log`) was deferred: once `ha token` exists, raw curl is
already unblocked, and a generic passthrough overlaps the MCP. Re-measure via
`usage top` (ADR-0011); add targeted sugar verbs only if those endpoints are
still hand-rolled often.

View file

@ -1,75 +0,0 @@
# homelab browser verbs: headful (anti-bot) web automation via cluster Chrome
v0.8 adds `browser run`, `browser open`, and `browser --help`. They package a
capability that already existed but was undiscoverable: driving the cluster's
**headful** Chrome (`chrome-service` — real Chrome under Xvfb, CDP on
`svc/chrome-service:9222`) from the devvm, for sites that detect and block
headless automation.
## Motivating incident (2026-06-22)
Logging a washing-machine repair on the Stirling Ackroyd **Fixflo** tenant
portal: the headless `@playwright/mcp` browser loaded the site and filled the
entire multi-step form, but the **final submit silently failed** — Fixflo's
pre-submit `POST /IssuePreCreationCheck` returned `net::ERR_FILE_NOT_FOUND`, the
spinner hung, no issue was created. Root cause = headless-Chrome detection. The
fix was to drive the headful `chrome-service` over `connect_over_cdp` — it
submitted first try (Fixflo ref IS22657587). That capability was documented
(`docs/architecture/chrome-service.md`) but **not packaged or discoverable**, so
it took ~40 min, three redundant full form re-runs, and a user hint. The agent
also misread `ERR_FILE_NOT_FOUND` as "network egress" and retried blind instead
of inspecting the network panel.
## Decisions
- **Mechanics in `homelab`, not a `~/.claude` skill.** A standalone skill was
rejected: the CLI is run every session (so the verb is *discoverable*), is
versioned, multi-user, and test-covered. A private, untested skill is none of
those. The command owns only the deterministic *mechanics* (port-forward,
stealth injection, lifecycle) — the agent supplies the Playwright script, so
*judgment* stays out of the CLI (the founding rule, ADR-0004/0005).
- **The failure was judgment, not setup friction**, so the CLI is paired with a
one-line pointer in always-in-context `~/code/CLAUDE.md` and a diagnostic
payload in `browser --help`: the *when-to-use* signature (a site loads but a
gated action fails/hangs, or one request 500s/aborts while siblings 200 →
suspect headless detection) and an error-code cheat-sheet (`ERR_FILE_NOT_FOUND`
= request resolved/intercepted by the automation layer, **not** egress;
egress failures are `ERR_CONNECTION_REFUSED`/`_TIMED_OUT`/`_NAME_NOT_RESOLVED`
and would break the page load too). A command the agent doesn't think to run is
useless; the cheat-sheet is the actual fix for the misdiagnosis.
- **Reach the pod via `kubectl port-forward`, then `connect_over_cdp` to
localhost.** port-forward tunnels API-server→pod, so it **bypasses the `:9222`
NetworkPolicy** that gates in-cluster callers — the devvm needs no namespace
label. Readiness is asserted against `/json/version`: the endpoint must report
a real `Chrome/…`, never `HeadlessChrome` (the whole point). The forward is
**always** torn down (process-group kill + signal handler), on success and on
error — an acceptance requirement.
- **Default to a fresh incognito context; `--shared-context` opts into the warmed
profile.** chrome-service is a single shared browser with a persistent profile.
A fresh, always-closed context is safe for concurrent callers (tripit's fare
scrape connects per-quote) and is what production already does. The warmed
persistent profile (cookies from a manual noVNC login) is opt-in for flows that
need a pre-logged-in session.
- **Pin the node CDP client to `playwright-core@1.48.2`** to match the
chrome-service image minor (`mcr.microsoft.com/playwright:v1.48.0-noble`,
Chromium 130). `connect_over_cdp` speaks the browser's CDP, and protocol
changes between Playwright minors — the devvm's ambient Python Playwright was
1.58, a 10-minor skew. The pin makes behaviour deterministic across the fleet
regardless of local drift. `playwright-core` (not `playwright`) because no
browser binary is needed — we connect to the remote one.
- **Self-provision the client lazily, no per-user setup.** The pinned client is
installed once into `~/.cache/homelab/browser-client/` (idempotent, version-
guarded) on first use, alongside the embedded runner + stealth files. node is
already fleet-wide; this avoids coupling the feature to a provisioner change
and keeps it self-contained and self-healing. The client runs on the devvm, so
`setInputFiles` streams local files to the remote browser over CDP — no
`chmod`/staging-dir workaround on the CDP path.
- **Vendor `stealth.js`, guard against drift.** The CLI embeds a byte-for-byte
copy of `stacks/chrome-service/files/stealth.js` (the source of truth the
in-cluster callers use) via `go:embed`; a unit test fails if the copy drifts.
`go:embed` can't reach outside the package dir, hence the vendored copy rather
than a path reference.
- **Scope held at two action verbs + help.** `run` (arbitrary script — the
workhorse) and `open` (navigate + title/text/screenshot — a quick check) cover
the surface. Both are write-tier; the bare `browser`/`--help` is read. Re-measure
via `usage top` (ADR-0011) before adding more.

View file

@ -1,35 +0,0 @@
---
status: accepted
date: 2026-06-24
---
# Service identity is namespace + label; east-west observability via Calico Goldmane; no service mesh
As the Service count grows we want an audit-grade record of which Service talks to which — the "service mesh evaluation" `docs/plans/2026-04-20-infra-audit-design.md` flagged as never done ("worth a design doc even if the answer is no, too much complexity for the gain"). We evaluated the full design space against two constraints: the trust model is a single-tenant cluster needing **attribution-grade** forensics (reconstruct events in a cluster we trust), not cryptographic non-repudiation against a hostile pod; and we are acutely **etcd-constrained** (we removed VPA/Goldilocks for exactly this, and carry open beads `code-oflt`/`code-at4f` on etcd starvation). Decision: **service identity = the workload's namespace** (primary; Goldmane stamps it natively and "one Service ≈ one namespace" holds for ~87 of our namespaces), refined by an explicit `service-identity` label only in the few genuinely multi-Service namespaces (`monitoring`, `kube-system`, `dbaas`). **East-west observability = Calico 3.30 Goldmane + Whisker** (already in our Calico v3.30.7, currently `enabled = false` in `stacks/calico/main.tf`), with Goldmane's emitter shipping flows to **Loki** for a durable trail. **Enforcement reuses the existing Wave 1 observe-then-enforce egress track**, now selecting on namespace/label and fed by Goldmane's allow/deny + policy-trace flows. We explicitly **reject** a service mesh, mTLS, SPIFFE/SPIRE, and dedicated per-Service ServiceAccounts for now.
## Considered options
- **Dedicated per-Service ServiceAccount as the identity primitive** — initially chosen, then reversed. 56% of pods (257/458) run as `default`, so it is a ~116-stack rollout; and Goldmane (the chosen flow source) carries pod/namespace/workload **labels but no ServiceAccount field**, so SAs would not even reach the audit trail without a custom pod→SA mapping. The cheaper, etcd-inert path (namespace is free; a handful of static labels) delivers the same attribution. Deferred until identity-aware NetworkPolicy needs a principal finer than namespace/label, or mTLS is adopted.
- **Service mesh (Istio / Linkerd / Cilium-mesh) + mTLS + SPIFFE/SPIRE** — the only thing that makes the trail cryptographically non-repudiable against a hostile pod. The trust model does not justify it, east-west stays single-tenant plaintext, and it is precisely the "too much complexity for the gain" the audit doc predicted. Rejected.
- **Microsoft Retina (CNI-agnostic eBPF)** — more capable (DNS, drops, Hubble UI) and GA, runs on Calico without a CNI change. But identity-rich mode writes **one `RetinaEndpoint` CRD per pod to etcd** (continuous, pod-proportional churn — the exact axis we guard), and it is metrics-first, not log-first (no per-flow Loki records without custom glue). Rejected for this use case; noted as the fallback if DNS/drop-level detail is ever needed.
- **Cilium Hubble** — reads Cilium's eBPF datapath maps; unusable on Calico without migrating the CNI. A CNI migration is not justified. Rejected.
- **Kiali** — builds its graph entirely from an Istio mesh's Prometheus telemetry; no mesh, no graph. Rejected.
- **Custom Grafana Alloy enrichment exporter over raw iptables-`LOG` flow lines** — Alloy has no IP→identity dictionary-lookup primitive (`loki.process` lacks a lookup stage; `k8sattributes` can't do per-line/dual-IP association), so this is a multi-day custom build that also has to beat pod-IP churn. Goldmane delivers identity-stamped flows natively and obviates it. Rejected.
- **Kyverno generate+mutate to provision/assign identity** — rejected on etcd grounds: background scans + PolicyReports + UpdateRequests are continuous writes, the VPA-class cost we shed. Identity stays static.
## Consequences
- **No etcd cost from the flow plane.** Goldmane streams flows from Felix (the existing `calico-node` DaemonSet) over gRPC into a ~60-minute in-memory ring buffer — nothing written to etcd or the K8s API. Steady-state cost is two Deployments (`goldmane`, `whisker`) + RAM/CPU on the goldmane pod.
- **The ring buffer is not a trail.** Durable, queryable history depends on the emitter→Loki path (reuse the 90-day security-stream retention); on a Goldmane restart the in-memory window is lost.
- **Goldmane is tech-preview** in OSS Calico 3.30 — the main risk. Enabling it is a reversible toggle in `stacks/calico/main.tf`, but the toggle interacts with the operator-managed Installation CR (only namespaces are TF-adopted today; verify how Goldmane/Whisker are enabled before applying).
- **Attribution is namespace-grained for free** across ~87 single-Service namespaces. Multi-Service namespaces (`monitoring`, `kube-system`, `dbaas`) need a `service-identity` label to disambiguate; most are platform/infra and already on the Wave 1 enforcement exclude list.
- **The trail is attribution-grade, not cryptographic.** It reliably reconstructs events in a trusted cluster but cannot prove identity against a pod that spoofs its source — an accepted limit of the trust model. This ADR does not change the east-west encryption posture (still plaintext, no mTLS).
- **Enforcement gains a better data source.** Goldmane's allow/deny + policy-trace flows build the Wave 1 empirical egress allowlist faster than the current iptables-`LOG`→journald→Loki path, and policies select on namespace/label with no SA dependency.
- **New ubiquitous language** recorded in `CONTEXT.md`: **Service identity** and **Goldmane / Whisker**.
- **Revisit triggers:** adopt dedicated per-Service SAs if identity-aware NetworkPolicy needs a principal finer than namespace/label, or if mTLS is ever required; reconsider Retina if DNS/drop-level flow detail becomes necessary.
## As-built (2026-06-25)
Implemented across infra issues #57#63. **One material deviation from the decision above:** the durable trail is NOT a Goldmane→Loki emitter (no such emitter exists in OSS Calico 3.30) — it is the **`goldmane-edge-aggregator`** service, which streams Goldmane's gRPC `Flows.Stream` API over mTLS and upserts the unique namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + empty-namespace flows dropped) into **CNPG DB `goldmane_edges`**, plus a daily `goldmane-edges-digest` CronJob → `#alerts` (all Slack consolidated to `#alerts`; the `#security` channel was abandoned 2026-06-25 — the shared webhook's Slack app isn't a member of it — see runbook). The mTLS client cert **reuses the operator's Tigera-CA-signed `whisker-backend-key-pair`** rather than copying the CA private key into TF state (Goldmane verifies CA-chain only, not identity) — re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it. `service-identity` labels are live on the multi-Service namespaces (`monitoring`, `dbaas`). Whisker UI is Authentik-gated at `whisker.viktorbarzin.me`. Health: Prometheus alerts `AggregatorDown` + `DigestFailing` and cluster-health check #48.
Full as-built, query recipes (incl. the Wave-1 egress-allowlist derivation), and troubleshooting: [`docs/runbooks/goldmane-flow-trail.md`](../runbooks/goldmane-flow-trail.md). Stacks: `stacks/calico` (Goldmane/Whisker + Whisker ingress), `stacks/goldmane-edge-aggregator` (the trail). Code: `~/code/goldmane-edge-aggregator`.

View file

@ -1,57 +0,0 @@
# OS is the authorization boundary: agents defer to Unix/sudo, not a stricter in-policy rule
Supersedes the cross-user privacy *norm* that the devvm managed-settings policy
carried and that ADR-0011 leaned on ("never read another user's home /
`~/.claude`, off-limits even for an owner in-session"). ADR-0011's actual
subject — `usage top` telemetry and its emit design — is unchanged and still
current; only the privacy prohibition it referenced is superseded here.
## Context
The devvm managed-settings policy (`/etc/claude-code/managed-settings.json`,
`claudeMd`) carried two rules that were, in practice, *stricter than the OS*:
"you are not the admin, do not escalate privileges" and "never read another
user's home directory, credentials, tokens, or `~/.claude`." The OS told a
different story: `wizard` holds `(ALL) NOPASSWD: ALL` — full passwordless root.
The kernel had already granted total read access; the policy was layering an
artificial refusal on top of an authorization the OS already permits, and the
"not the admin" framing was factually wrong for a NOPASSWD-root user.
Two honest ways to resolve the inconsistency: tighten sudo to match the policy,
or loosen the policy to match the OS. The owner chose the latter on 2026-06-26,
for analytics/debugging across the shared box.
## Decision
- **Authorization follows the OS, not this policy.** Agents may access whatever
their OS user can access — directly or via `sudo` where they hold sudo rights
— and must not impose restrictions stricter than the OS. On this box that
includes other users' home directories and `~/.claude` for users who hold
broad sudo.
- **No separate prompt or carve-out** for OS-authorized access. The Unix
permission model + sudoers is the single source of truth for who may read
what. Other homes are `0750`-owned, so a cross-home read necessarily transits
`sudo` and is therefore captured in the sudo/auth audit log.
- **Cluster/infra RBAC tiering is unchanged.** kubectl / Vault / infra access
stays scoped to each user's RBAC tier; "defer to the OS" is about OS-level
file access, not a licence to exceed cluster RBAC.
- **Scope is symmetric and multi-user.** The rule lives in the *shared*
managed-settings, so every user's agents defer to that user's own sudo grant.
Any user with broad sudo gets the same cross-home read capability over other
users' files. Accepted by the owner with that understanding; emo's and
ancamilea's `~/.claude` is now agent-readable by sudo-holders.
- **Takes effect in a fresh session.** managed-settings loads at session start;
the session that made the change keeps running under the old policy.
## Consequences
- The privacy-preserving telemetry rationale in ADR-0011 (`usage top` as the
"cross-user analytics without reading homes" answer) remains useful but is no
longer the *only* sanctioned path; direct reads via `sudo` are now permitted.
- Larger blast radius: if an agent session running as a sudo-holder is
prompt-injected or otherwise compromised, it can now read every user's secrets
with no in-agent friction (sudo here is passwordless). The sudo/auth audit log
is the remaining accountability control.
- Reversible: restore the prior `claudeMd` bullets (backup kept at
`/etc/claude-code/managed-settings.json.bak-2026-06-26`) and start a fresh
session.

View file

@ -1,107 +0,0 @@
# GPU VRAM protection via a scheduler extended-resource budget + a runtime watchdog (HAMi/MPS rejected)
The single Tesla T4 (16 GB, ~15360 MiB usable) on `k8s-node1` is **time-sliced**
(`nvidia.com/gpu` advertised ×100, `migStrategy: none`) and shared by ~9 tenants
(immich-ml, immich-server, frigate, llama-swap, portal-stt, tts,
ebook2audiobook, ytdlp, android-emulator). Time-slicing grants a *scheduling
turn, not memory* — the scheduler is blind to VRAM, so the tenants can
collectively overallocate the card. On 2026-06-02 immich-ml's unbounded
onnxruntime OCR arena grew from ~2 GB to **10.7 GB**, starved llama-swap's
qwen3-8b, and silently broke recruiter-responder triage for ~5 h
(`docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md`). The
post-mortem's #1 follow-up — alert/guard on GPU VRAM — was never built.
## Context
- **MIG is impossible.** The T4 is Turing; hardware memory partitioning (MIG)
only exists on Ampere+. So per-tenant *hardware* isolation is off the table.
- **The card is busy but not steadily oversubscribed.** Measured steady residents
(2026-06-17, `gpu_pod_memory_used_bytes`): immich-ml ~2.1 GiB, frigate ~1.9 GiB,
llama-swap ~4.35 GiB peak (one model at a time — it already swaps), immich-server
~1.2 GiB, portal-stt ~1.5 GiB, android-emulator ~0.15 GiB → ~11 GiB used, ~4 GiB
free. **The failure mode is a single tenant's runtime runaway, not a
scheduling-time pile-on.**
- **Prior art already exists (soft):** a `gpu-workload` PriorityClass (1,200,000)
is auto-stamped on every GPU pod by the Kyverno `inject-gpu-workload-priority`
policy (tts excluded → `tier-2-gpu`, evicted first); tts runs behind a
free-VRAM demand-gate (`stacks/tts`, scales 0↔1 on `sum(gpu_pod_memory_used_bytes)`
vs a floor); immich-ml is soft-bounded by `MACHINE_LEARNING_MODEL_TTL=600`. What
was missing is anything that bounds a tenant's VRAM *during active use*.
### Alternatives considered and rejected
- **NVIDIA MPS** (device-plugin `sharing.mps`, hard `CUDA_MPS_PINNED_DEVICE_MEM_LIMIT`):
caps are **uniform** — slice = `total ÷ replicas`, tenants get integer multiples.
Nine heterogeneous tenants spanning 0.15→6 GB do not fit uniform slices without
large rounding waste on a card that has none to spare. Rejected.
- **HAMi vGPU** (per-container `nvidia.com/gpumem` MiB caps, libvgpu CUDA hook):
the *correct* hard-cap primitive and T4-supported, but it **replaces the
operator's device plugin** (the operator owns/reconciles it), enforces via an
`LD_PRELOAD` CUDA hook that is **unproven for our NVENC transcode path**
(open codec bug), **cannot cap the android-emulator** (QEMU bypasses the CUDA
hook — KubeVirt/Kata explicitly unsupported), carries a **restart-triggered
false-OOM bug** (#1181) directly in our blast radius (kured reboots node1
regularly), and its reservation-based scheduling would **supersede the working
demand-gate** and **strand the ~4 GB of steady headroom**. Too much risk and
behavioral change for the single proven failure mode. Rejected for now; this
ADR is the record of *why*, so a future "let's just use HAMi" re-opens with the
trade-offs already on the table.
## Decision
Make the scheduler VRAM-aware and add runtime teeth — entirely with repo-native
pieces, **no device-plugin/driver change, time-slicing untouched**:
1. **Budget (schedule-time).** Advertise a custom node-level **extended resource
`viktorbarzin.me/gpumem`** on the GPU node (= ~14000 MiB; ~15.4 GB physical
minus ~1.4 GB driver/CUDA-context/exporter slack), via a reconcile Job +
CronJob that `kubectl patch node --subresource=status` (dynamic over
`nvidia.com/gpu.present=true` nodes; re-asserts after node re-register).
Every GPU tenant declares `resources.limits."viktorbarzin.me/gpumem"` (immich-ml
3000, llama-swap 5000, frigate 2000, immich-server 1800, portal-stt 1500 — sum
≤ advertised). Extended resources are **non-overcommittable** (request==limit,
integer), so the scheduler refuses to co-schedule past the card → overflow
`Pending`. On-demand batch tenants (tts/ebook2audiobook/ytdlp) keep the
free-VRAM demand-gate and fill the real slack rather than holding a reserved seat.
2. **Watchdog (runtime).** A `gpu-vram-watchdog` CronJob (every minute, nvidia ns)
reads per-pod `gpu_pod_memory_used_bytes` (the host-PID exporter) and each GPU
pod's *declared* `gpumem`, and **only when actual free VRAM < floor (~1536 MiB)**
recycles the biggest **over-budget** offender (used > declared). Contract
enforcement, not priority (immich-ml and llama-swap share `gpu-workload`, so
priority can't distinguish them). Acting only under pressure lets a tenant burst
into genuine slack; the recycle clears its arena (exactly what the TTL=600
Recreate does for immich-ml when idle). This is what would have caught 2026-06-02.
3. **Alerting** (the never-built follow-up): GPU free-VRAM below floor, GPU pod
`Pending` on `gpumem`, and pod-over-budget → the `#alerts` digest.
This is **soft enforcement**: the scheduler reserves on paper and the watchdog
corrects at runtime with a detection lag (secondsminute), so a brief physical
overshoot is possible before a recycle. Accepted, given the failure mode is a
slow arena drift, not an instantaneous spike, and the alternative (HAMi) carries
disproportionate risk for this hardware.
## Consequences
- **The 2026-06-02 class is bounded** without touching the pinned driver, the GPU
operator, or time-slicing. immich-ml can no longer silently grow into
llama-swap's VRAM: it either schedules within its budget or, on a true runaway
under pressure, gets recycled (its heavy library job is the intended loser).
- **The card has a seating chart now.** Sum of declared budgets ≤ ~14 GB, so a new
always-on GPU tenant requires re-budgeting; an over-budget on-demand tenant sits
`Pending`. This is the intended, legible back-pressure.
- **Small/on-demand tenants (android-emulator, ytdlp, tts, ebook2audiobook) are
NOT budgeted in v1** — they fill *actual* slack rather than holding a scheduler
seat (tts via its existing free-VRAM demand-gate), and are covered by the
~1.4 GiB physical reserve plus budget headroom (the five residents' budgets sum
to 13300 ≤ 14000 advertised). Give them budgets later if they grow; until then
the watchdog protects the budgeted five and counts everyone's usage toward free.
- **New RBAC:** the reconcile SA patches `nodes/status`; the watchdog SA lists pods
cluster-wide and deletes pods in GPU tenant namespaces. Far less privileged than
existing cluster-admin tooling (woodpecker-agent).
- **Apply order matters:** advertise `gpumem` (nvidia stack) **before** the
consumer stacks declare it, or a pod requesting an unadvertised extended
resource is unschedulable. The reconcile runs as a Job (immediate) for this.
- **Fully reversible:** delete the CronJobs/Job + the `gpumem` stanzas, and
`kubectl patch node --subresource=status` to remove the capacity key. Nothing
structural; no driver/operator state to unwind.
- The `gpumem` numbers are first estimates; tune from `gpu_pod_memory_used_bytes`.

View file

@ -1,126 +0,0 @@
<svg xmlns="http://www.w3.org/2000/svg" width="1600" height="820" viewBox="0 0 1600 820" font-family="system-ui, -apple-system, 'Segoe UI', Roboto, sans-serif">
<!-- ADR-0017: PHYSICAL cabling only — no VLANs, no flows. Solid = cable in
place today · dashed = camera-day work · ~~~ = radio. Palette: neutral
grays + blue for copper runs (reference dataviz palette text tokens). -->
<defs>
<marker id="dot" viewBox="0 0 8 8" refX="4" refY="4" markerWidth="5" markerHeight="5">
<circle cx="4" cy="4" r="3" fill="#52514e"/>
</marker>
</defs>
<rect width="1600" height="820" fill="#fcfcfb"/>
<text x="40" y="42" font-size="26" font-weight="700" fill="#0b0b0b">ADR-0017 — physical cabling (single-switch, rev 3)</text>
<text x="40" y="66" font-size="15" fill="#52514e">wires only — no VLANs, no traffic · solid = in place · dashed = camera-day · ~ = radio</text>
<!-- ═════════ APARTMENT ═════════ -->
<rect x="40" y="100" width="330" height="330" rx="10" fill="#0b0b0b" fill-opacity="0.03" stroke="#b9b8b2"/>
<text x="56" y="126" font-size="13" font-weight="700" fill="#52514e" letter-spacing="1">APARTMENT</text>
<text x="70" y="158" font-size="13" fill="#52514e">☁ ISP (internet)</text>
<path d="M120,166 L120,196" fill="none" stroke="#52514e" stroke-width="2"/>
<rect x="64" y="198" width="220" height="64" rx="8" fill="#ffffff" stroke="#8a8984"/>
<text x="80" y="222" font-size="14.5" font-weight="700" fill="#0b0b0b">AX6000 router</text>
<text x="80" y="242" font-size="12" fill="#52514e">192.168.1.1 · WAN←ISP · 8×LAN</text>
<rect x="64" y="290" width="220" height="52" rx="8" fill="#ffffff" stroke="#8a8984"/>
<text x="80" y="312" font-size="14" font-weight="700" fill="#0b0b0b">Synology NAS · .13</text>
<text x="80" y="330" font-size="12" fill="#52514e">on an AX6000 LAN port</text>
<path d="M174,262 L174,290" fill="none" stroke="#2a78d6" stroke-width="2"/>
<text x="70" y="376" font-size="12.5" fill="#52514e">📶 wifi clients (phones, laptops)</text>
<path d="M110,262 C104,272 106,278 100,286 C106,294 104,300 100,308 C106,316 104,322 100,330 C106,338 104,344 100,352 C104,358 102,362 98,366" fill="none" stroke="#8a8984" stroke-width="1.6" stroke-dasharray="2,3"/>
<!-- in-wall run apartment -> garage -->
<path d="M284,230 C450,230 540,228 616,228" fill="none" stroke="#2a78d6" stroke-width="2.5"/>
<text x="330" y="218" font-size="12.5" font-weight="700" fill="#2a78d6">in-wall run → garage</text>
<!-- ═════════ GARAGE — RACK ═════════ -->
<rect x="560" y="100" width="640" height="680" rx="10" fill="#0b0b0b" fill-opacity="0.03" stroke="#b9b8b2"/>
<text x="576" y="126" font-size="13" font-weight="700" fill="#52514e" letter-spacing="1">GARAGE — RACK</text>
<!-- switch -->
<rect x="600" y="150" width="560" height="150" rx="8" fill="#ffffff" stroke="#0b0b0b" stroke-opacity="0.5" stroke-width="1.6"/>
<text x="616" y="176" font-size="14.5" font-weight="700" fill="#0b0b0b">TL-SG105PE · 5-port gigabit PoE switch</text>
<text x="616" y="194" font-size="12" fill="#52514e">mgmt 192.168.1.6 · replaces the old TL-SG105E (→ shelf, cold spare)</text>
<g font-size="11.5" text-anchor="middle">
<rect x="616" y="210" width="96" height="40" rx="5" fill="#2a78d6" fill-opacity="0.08" stroke="#8a8984"/>
<text x="664" y="227" font-weight="700" fill="#0b0b0b">P1</text>
<text x="664" y="242" fill="#52514e">← apartment</text>
<rect x="722" y="210" width="96" height="40" rx="5" fill="#2a78d6" fill-opacity="0.08" stroke="#8a8984"/>
<text x="770" y="227" font-weight="700" fill="#0b0b0b">P2</text>
<text x="770" y="242" fill="#52514e">← 4G router</text>
<rect x="828" y="210" width="96" height="40" rx="5" fill="#2a78d6" fill-opacity="0.08" stroke="#8a8984"/>
<text x="876" y="227" font-weight="700" fill="#0b0b0b">P3</text>
<text x="876" y="242" fill="#52514e">← UPS mgmt</text>
<rect x="934" y="210" width="96" height="40" rx="5" fill="#2a78d6" fill-opacity="0.08" stroke="#8a8984" stroke-dasharray="4,3"/>
<text x="982" y="227" font-weight="700" fill="#0b0b0b">P4 ⚡PoE</text>
<text x="982" y="242" fill="#52514e">← camera</text>
<rect x="1040" y="210" width="96" height="40" rx="5" fill="#2a78d6" fill-opacity="0.08" stroke="#8a8984"/>
<text x="1088" y="227" font-weight="700" fill="#0b0b0b">P5</text>
<text x="1088" y="242" fill="#52514e">← R730 eno1</text>
</g>
<text x="616" y="284" font-size="12" fill="#52514e">every cable below re-plugs old-switch → PE on camera day (≈3 min)</text>
<!-- 4G router -->
<rect x="600" y="360" width="250" height="64" rx="8" fill="#ffffff" stroke="#8a8984"/>
<text x="616" y="384" font-size="14" font-weight="700" fill="#0b0b0b">4G router · 192.168.1.7</text>
<text x="616" y="403" font-size="12" fill="#52514e">~cellular uplink (out-of-band)</text>
<path d="M770,300 L770,360" fill="none" stroke="#2a78d6" stroke-width="2"/>
<path d="M856,392 C866,386 864,380 874,376 C866,370 868,364 876,360" fill="none" stroke="#8a8984" stroke-width="1.6" stroke-dasharray="2,3"/>
<text x="884" y="380" font-size="12" fill="#52514e">📡 cellular</text>
<!-- UPS -->
<rect x="600" y="452" width="250" height="56" rx="8" fill="#ffffff" stroke="#8a8984"/>
<text x="616" y="476" font-size="14" font-weight="700" fill="#0b0b0b">UPS (Huawei)</text>
<text x="616" y="494" font-size="12" fill="#52514e">network mgmt card</text>
<path d="M876,300 C876,340 800,410 720,452" fill="none" stroke="#2a78d6" stroke-width="2"/>
<!-- R730 -->
<rect x="600" y="540" width="560" height="220" rx="8" fill="#ffffff" stroke="#0b0b0b" stroke-opacity="0.5" stroke-width="1.6"/>
<text x="616" y="566" font-size="14.5" font-weight="700" fill="#0b0b0b">Dell R730 · PVE host · 192.168.1.127</text>
<g font-size="11.5">
<rect x="616" y="582" width="128" height="38" rx="5" fill="#2a78d6" fill-opacity="0.08" stroke="#8a8984"/>
<text x="628" y="598" font-weight="700" fill="#0b0b0b">eno1 · LAN1</text>
<text x="628" y="613" fill="#52514e">← switch P5 · 1GbE</text>
<rect x="756" y="582" width="128" height="38" rx="5" fill="#ffffff" stroke="#8a8984" stroke-dasharray="4,3"/>
<text x="768" y="598" font-weight="700" fill="#52514e">eno2 · LAN2</text>
<text x="768" y="613" fill="#8a8984">dark · fallback leg</text>
<rect x="896" y="582" width="128" height="38" rx="5" fill="#ffffff" stroke="#d8d7d2"/>
<text x="908" y="598" fill="#8a8984">eno3 / eno4</text>
<text x="908" y="613" fill="#8a8984">free, uncabled</text>
<rect x="1036" y="582" width="108" height="38" rx="5" fill="#ffffff" stroke="#d8d7d2"/>
<text x="1048" y="598" fill="#8a8984">iDRAC · .4</text>
<text x="1048" y="613" fill="#8a8984">shared-LOM/eno1</text>
</g>
<text x="616" y="648" font-size="12" fill="#52514e">no other network cables — everything else on this host is VIRTUAL:</text>
<text x="616" y="668" font-size="12" fill="#52514e">pfSense · ha-sofia (HA) · devvm · k8s-master + node1-6 · registry VM …</text>
<text x="616" y="696" font-size="12" fill="#8a8984">(power: host + switch fed from the UPS — power wiring not drawn)</text>
<path d="M1088,300 C1088,420 720,500 680,582" fill="none" stroke="#2a78d6" stroke-width="2.5"/>
<text x="1100" y="330" font-size="12.5" font-weight="700" fill="#2a78d6">LAN1 cable</text>
<!-- ═════════ GARAGE ENTRANCE ═════════ -->
<rect x="1280" y="100" width="280" height="200" rx="10" fill="#0b0b0b" fill-opacity="0.03" stroke="#b9b8b2"/>
<text x="1296" y="126" font-size="13" font-weight="700" fill="#52514e" letter-spacing="1">GARAGE ENTRANCE</text>
<rect x="1304" y="150" width="232" height="110" rx="8" fill="#ffffff" stroke="#8a8984"/>
<text x="1320" y="176" font-size="14" font-weight="700" fill="#0b0b0b">vermont-garage camera</text>
<text x="1320" y="196" font-size="12" fill="#52514e">HiLook IPC-T241H-C · 10.0.30.70</text>
<text x="1320" y="214" font-size="12" fill="#52514e">powered over the data cable (PoE)</text>
<text x="1320" y="232" font-size="12" fill="#52514e">outdoor · armored conduit</text>
<path d="M982,210 C982,150 1140,140 1304,180" fill="none" stroke="#52514e" stroke-width="2.5" stroke-dasharray="7,5"/>
<text x="1080" y="136" font-size="12.5" font-weight="700" fill="#52514e">single cat6 in conduit · data + PoE power (camera day)</text>
<!-- legend -->
<g transform="translate(40,780)" font-size="12.5">
<line x1="0" y1="-4" x2="44" y2="-4" stroke="#2a78d6" stroke-width="2.5"/>
<text x="52" y="0" fill="#0b0b0b">copper, in place</text>
<line x1="190" y1="-4" x2="234" y2="-4" stroke="#52514e" stroke-width="2.5" stroke-dasharray="7,5"/>
<text x="242" y="0" fill="#0b0b0b">camera-day cable / dark port</text>
<path d="M450,-4 C456,-10 454,-14 460,-18" fill="none" stroke="#8a8984" stroke-width="1.6" stroke-dasharray="2,3"/>
<text x="470" y="0" fill="#0b0b0b">radio (wifi / cellular)</text>
<text x="650" y="0" fill="#52514e">total wired links at the rack: 5 (all on the one switch) · ADR-0017 rev 3</text>
</g>
</svg>

Before

Width:  |  Height:  |  Size: 9 KiB

View file

@ -1,99 +0,0 @@
# CCTV segment: dedicated pfSense interface, VLAN-30 trunk on the LAN1 cable
Status: accepted (2026-07-02, rev 3 — single-switch)
![Network topology — dCCTV segment, flows, and camera-day steps](./0017-cctv-segment-topology.svg)
![Physical cabling — wires only, no VLANs](./0017-cctv-physical-cabling.svg)
The first owned camera at the Sofia/Vermont site (`vermont-garage`, HiLook
IPC-T241H-C at the garage entrance) needs to be network-isolated: its cable is
physically exposed outside the apartment, so anything plugged into that cable
must land in a segment that can reach nothing. The original design doc
(NAS: `Emo shared/Claude shared/garage-camera/`) called for an "802.1Q trunk
to pfSense" — but nothing in this network terminates dot1q on pfSense; the
site idiom is one vlan-aware Proxmox bridge → one tagged VM NIC → one clean
untagged pfSense interface per segment.
**Decision (rev 3):** ONE switch — the new TL-SG105PE **replaces** the old
garage TL-SG105E (Viktor prefers not running two switches; retired unit
becomes a cold spare, its 192.168.1.6 mgmt IP passes to the PE). Five ports,
all used: apartment uplink, 4G router 192.168.1.7, UPS mgmt (all untagged
VLAN 1), the camera (untagged VLAN 30, PoE), and the **trunk to R730 `eno1`
carrying home LAN untagged + CCTV tagged 30** over the existing LAN1 cable.
pfSense `net3` (vtnet3) sits on `vmbr0` with `tag=30` — exactly the site
idiom used for dManagementsVms/dKubernetes (bridge-level tag → clean untagged
vNIC; pfSense still terminates no dot1q itself). The earlier dedicated
`eno2`/`vmbr2` leg is kept **dormant as a fallback** (rev 2 wired it; moving
net3 back to vmbr2 restores pure physical isolation in one `qm set`).
This narrows the earlier 802.1Q objection rather than contradicting it: the
rejection assumed *unmanaged* switches, where any LAN device could inject
tagged frames; with the managed PE as the only device on eno1, VLAN-30
membership is {camera port, trunk port} only, so tag-30 ingress from every
other port — and from the exposed camera cable — is dropped or contained.
Cameras are untrusted: default-deny on dCCTV with a single
NTP-to-gateway exception; Frigate (k8s) pulls RTSP in; ha-sofia (192.168.1.8)
may reach ISAPI/RTSP directly; home-LAN clients route in via an AX6000 static
route (10.0.30.0/24 via 192.168.1.2). 10.0.30.0/24 is deliberately NOT in the
10.0.20.0/22 trusted source-IP allowlist.
## Traffic on the trunk — how one cable carries two networks
The LAN1 cable is shared, but the two networks on it diverge at `vmbr0`
(the vlan-aware bridge on the PVE host), and only ONE of them ever touches
pfSense:
- **Untagged (VLAN 1, home LAN)** is plain L2 bridging: vmbr0 switches it
between the trunk, the host's own IP (192.168.1.127) and pfSense `net0`
where pfSense sits as an ordinary LAN *client* (WAN 192.168.1.2). The home
LAN's gateway is and remains the AX6000; home-LAN traffic never transits
pfSense. Consequently a pfSense (or R730 VM-level) outage does not affect
the home LAN, and the apartment ↔ 4G-router ↔ UPS paths don't even leave
the switch (P1/P2/P3 bridge internally), so out-of-band recovery via the
4G router survives the whole rack being down.
- **Tagged 30 (CCTV)** has exactly one possible landing: vmbr0 delivers
VID 30 only to pfSense `net3` (dCCTV, 10.0.30.1), which is the camera
segment's gateway, firewall and sole exit. "Camera → AX6000 → internet"
is impossible by construction, not merely by firewall rule.
- pfSense forwards *upstream* only its own segments (10.0.10/20/30), NATed
out of its WAN toward the AX6000. Load-wise the trunk gained only the
camera's ~8 Mbps — it already carried all rack-bound home-LAN traffic.
![VLAN tagging — where traffic can flow](./0017-cctv-vlan-tagging.svg)
*(editable source: [`0017-cctv-vlan-tagging.excalidraw`](./0017-cctv-vlan-tagging.excalidraw) — open it in excalidraw to tweak)*
## Considered options
- **802.1Q over the LAN path behind an UNMANAGED switch** (the original plan
read this way) — rejected: any LAN device could inject tagged frames into
vmbr0 (`bridge-vids 2-4094`) and tag-passing through a dumb switch is
undefined. Rev 3 adopts the tagged path ONLY because the managed PE now
polices VLAN-30 membership at the single entry point to eno1; no bridge
reconfiguration was needed (vmbr0 was already vlan-aware).
- **Dedicated physical leg (eno2 → vmbr2 → net3), one switch per role**
(rev 1/2 as-built) — superseded by rev 3: it forced either a second switch
(6 connections vs 5 ports once the PE also replaced the old switch) or new
hardware. Strongest isolation of all options; kept dormant as the fallback.
- **AX6000 as the camera gateway** — rejected earlier in the design (consumer
router, no inter-VLAN firewall).
## Consequences
- The switch is now single-point and load-bearing for everything in the rack
(apartment uplink, pfSense backup-WAN via 4G, UPS mgmt, CCTV) AND its VLAN
table + mgmt password are part of the isolation boundary — the Easy Smart
mgmt UI answers on every port, so the password is the gate between a
compromised camera and the switch config. All 5 ports are consumed: the
next camera forces an 8-port PoE upgrade (the wiring plan already fits it).
- `eno2`/`vmbr2` stay cabled-ready but dormant (fallback to rev 2's physical
leg); eno3/eno4 remain free.
- The old TL-SG105E is retired to cold spare; the PE inherits 192.168.1.6
(Kea reservation by MAC).
- Revision history (all 2026-07-02): rev 1 assumed one shared PE with a
port-VLAN split (conflated the two devices); rev 2 split into two switches
after inspecting 192.168.1.6 (old non-PoE SG105E, 4/5 ports used); rev 3
consolidated back to one switch — the PE replacing the SG105E — per
Viktor's preference, moving CCTV onto a managed tagged trunk.
- Frigate's ADR-0016 VRAM budget was bumped 2000 → 2300 MiB for the extra
NVDEC stream.

View file

@ -1,178 +0,0 @@
<svg xmlns="http://www.w3.org/2000/svg" width="1600" height="880" viewBox="0 0 1600 880" font-family="system-ui, -apple-system, 'Segoe UI', Roboto, sans-serif">
<!-- ADR-0017 rev 3 dCCTV topology (single switch, VLAN-30 trunk on LAN1).
Colors: reference dataviz palette (light mode). blue #2a78d6 = home LAN ·
violet #4a3aa7 = dCCTV · aqua #1baf7a = dKubernetes ·
yellow #eda100 = dManagementsVms · green #008300 allow · red #e34948 deny -->
<defs>
<marker id="arrGreen" viewBox="0 0 10 10" refX="9" refY="5" markerWidth="7" markerHeight="7" orient="auto-start-reverse">
<path d="M0,0 L10,5 L0,10 z" fill="#008300"/>
</marker>
<marker id="arrRed" viewBox="0 0 10 10" refX="9" refY="5" markerWidth="7" markerHeight="7" orient="auto-start-reverse">
<path d="M0,0 L10,5 L0,10 z" fill="#e34948"/>
</marker>
<marker id="arrGray" viewBox="0 0 10 10" refX="9" refY="5" markerWidth="6" markerHeight="6" orient="auto-start-reverse">
<path d="M0,0 L10,5 L0,10 z" fill="#52514e"/>
</marker>
</defs>
<rect width="1600" height="880" fill="#fcfcfb"/>
<text x="40" y="42" font-size="26" font-weight="700" fill="#0b0b0b">ADR-0017 — CCTV segment behind pfSense, VLAN-30 trunk on the LAN1 cable</text>
<text x="40" y="66" font-size="15" fill="#52514e">Sofia/Vermont · rev 3 (single switch) 2026-07-02 · dashed = camera-day · the ONLY 802.1Q is the trunk between the switch and eno1</text>
<!-- camera -> everything else (denied) -->
<path d="M240,168 C520,104 900,104 1148,140" fill="none" stroke="#e34948" stroke-width="3" marker-end="url(#arrRed)"/>
<g transform="translate(560,111)">
<circle r="11" fill="#fcfcfb" stroke="#e34948" stroke-width="2.5"/>
<path d="M-5,-5 L5,5 M5,-5 L-5,5" stroke="#e34948" stroke-width="2.5"/>
</g>
<text x="588" y="100" font-size="13.5" font-weight="700" fill="#e34948">DENY · camera → LAN / other segments / internet (default deny on dCCTV)</text>
<!-- GARAGE ENTRANCE -->
<rect x="40" y="128" width="240" height="180" rx="10" fill="#4a3aa7" fill-opacity="0.06" stroke="#4a3aa7" stroke-opacity="0.35"/>
<text x="56" y="154" font-size="13" font-weight="700" fill="#52514e" letter-spacing="1">GARAGE ENTRANCE</text>
<rect x="64" y="170" width="192" height="112" rx="8" fill="#ffffff" stroke="#4a3aa7" stroke-width="2"/>
<text x="80" y="196" font-size="15" font-weight="700" fill="#0b0b0b">vermont-garage</text>
<text x="80" y="216" font-size="12.5" fill="#52514e">HiLook IPC-T241H-C · pure IR</text>
<text x="80" y="234" font-size="12.5" fill="#52514e">10.0.30.70 (Kea reservation)</text>
<text x="80" y="252" font-size="12.5" fill="#52514e">DNS: garage-cam.viktorbarzin.lan</text>
<text x="80" y="270" font-size="12.5" fill="#52514e">PoE from switch · cloud/P2P off</text>
<path d="M256,284 C330,330 412,368 417,430" fill="none" stroke="#52514e" stroke-width="2" stroke-dasharray="6,5" marker-end="url(#arrGray)"/>
<text x="330" y="322" font-size="12" fill="#52514e">cat6 in conduit · PoE → P4</text>
<!-- RACK zone: single switch -->
<rect x="40" y="360" width="560" height="265" rx="10" fill="#0b0b0b" fill-opacity="0.03" stroke="#b9b8b2"/>
<text x="56" y="384" font-size="13" font-weight="700" fill="#52514e" letter-spacing="1">RACK — GARAGE · ONE SWITCH</text>
<rect x="64" y="396" width="512" height="176" rx="8" fill="#4a3aa7" fill-opacity="0.04" stroke="#4a3aa7" stroke-width="2"/>
<text x="80" y="420" font-size="15" font-weight="700" fill="#0b0b0b">TL-SG105PE <tspan font-size="12.5" font-weight="400" fill="#52514e">replaces the SG105E · mgmt 192.168.1.6 (Kea) · all 5 ports used</tspan></text>
<g font-size="11.5" text-anchor="middle">
<rect x="80" y="436" width="88" height="56" rx="6" fill="#2a78d6" fill-opacity="0.12" stroke="#2a78d6"/>
<text x="124" y="454" font-weight="700" fill="#0b0b0b">P1 · V1</text>
<text x="124" y="470" fill="#52514e">apartment</text>
<text x="124" y="484" fill="#52514e">uplink</text>
<rect x="178" y="436" width="88" height="56" rx="6" fill="#2a78d6" fill-opacity="0.12" stroke="#2a78d6"/>
<text x="222" y="454" font-weight="700" fill="#0b0b0b">P2 · V1</text>
<text x="222" y="470" fill="#52514e">4G router</text>
<text x="222" y="484" fill="#52514e">192.168.1.7</text>
<rect x="276" y="436" width="88" height="56" rx="6" fill="#2a78d6" fill-opacity="0.12" stroke="#2a78d6"/>
<text x="320" y="454" font-weight="700" fill="#0b0b0b">P3 · V1</text>
<text x="320" y="470" fill="#52514e">UPS mgmt</text>
<rect x="374" y="436" width="88" height="56" rx="6" fill="#4a3aa7" fill-opacity="0.12" stroke="#4a3aa7" stroke-width="2"/>
<text x="418" y="454" font-weight="700" fill="#0b0b0b">P4 · V30</text>
<text x="418" y="470" fill="#52514e">camera</text>
<text x="418" y="484" fill="#52514e">PoE ON</text>
<rect x="472" y="436" width="88" height="56" rx="6" fill="#2a78d6" fill-opacity="0.10" stroke="#4a3aa7" stroke-width="2" stroke-dasharray="0"/>
<text x="516" y="454" font-weight="700" fill="#0b0b0b">P5 · trunk</text>
<text x="516" y="470" fill="#52514e">V1 untagged</text>
<text x="516" y="484" fill="#4a3aa7">+ V30 tagged</text>
</g>
<text x="80" y="516" font-size="12" fill="#52514e">802.1Q: VLAN 1 untagged {P1,P2,P3,P5} · VLAN 30 {P4 untagged/PVID 30, P5 tagged}</text>
<text x="80" y="534" font-size="12" fill="#52514e">tag-30 ingress on P1/P2/P3 is dropped (not members) — the trunk is the only tagged path</text>
<text x="80" y="558" font-size="12" fill="#8a8984">old TL-SG105E → retired, cold spare · backup-WAN (4G) + UPS keep their ports</text>
<!-- trunk: two parallel lines to eno1 -->
<path d="M560,458 C630,458 640,428 692,420" fill="none" stroke="#2a78d6" stroke-width="2.5"/>
<path d="M560,466 C632,466 644,436 692,428" fill="none" stroke="#4a3aa7" stroke-width="2.5"/>
<text x="588" y="404" font-size="12" font-weight="700" fill="#0b0b0b">LAN1 cable</text>
<!-- R730 / PVE zone -->
<rect x="680" y="330" width="880" height="440" rx="10" fill="#0b0b0b" fill-opacity="0.03" stroke="#b9b8b2"/>
<text x="696" y="356" font-size="13" font-weight="700" fill="#52514e" letter-spacing="1">DELL R730 — PVE HOST 192.168.1.127 (IN THE RACK)</text>
<g font-size="12">
<rect x="700" y="400" width="150" height="46" rx="6" fill="#2a78d6" fill-opacity="0.12" stroke="#4a3aa7" stroke-width="2"/>
<text x="712" y="419" font-weight="700" fill="#0b0b0b">eno1 → vmbr0</text>
<text x="712" y="436" fill="#52514e">untag V1 + tag 30</text>
<rect x="700" y="471" width="150" height="46" rx="6" fill="#ffffff" stroke="#8a8984" stroke-dasharray="4,3"/>
<text x="712" y="490" font-weight="700" fill="#52514e">eno2 → vmbr2</text>
<text x="712" y="507" fill="#8a8984">dormant fallback leg</text>
<rect x="700" y="542" width="150" height="46" rx="6" fill="#0b0b0b" fill-opacity="0.04" stroke="#8a8984"/>
<text x="712" y="561" font-weight="700" fill="#0b0b0b">vmbr1</text>
<text x="712" y="578" fill="#52514e">internal · tags 10/20</text>
</g>
<!-- pfSense VM -->
<rect x="890" y="388" width="300" height="230" rx="8" fill="#ffffff" stroke="#8a8984"/>
<text x="906" y="414" font-size="15" font-weight="700" fill="#0b0b0b">pfSense (VM 101)</text>
<text x="906" y="432" font-size="12" fill="#52514e">gateway + firewall for every segment</text>
<g font-size="12">
<rect x="906" y="444" width="268" height="34" rx="5" fill="#2a78d6" fill-opacity="0.12" stroke="#2a78d6"/>
<text x="916" y="465" fill="#0b0b0b">net0 · WAN <tspan fill="#52514e">192.168.1.2 · vmbr0 untagged</tspan></text>
<rect x="906" y="484" width="268" height="34" rx="5" fill="#eda100" fill-opacity="0.14" stroke="#eda100"/>
<text x="916" y="505" fill="#0b0b0b">net1 · dManagementsVms <tspan fill="#52514e">10.0.10.1</tspan></text>
<rect x="906" y="524" width="268" height="34" rx="5" fill="#1baf7a" fill-opacity="0.12" stroke="#1baf7a"/>
<text x="916" y="545" fill="#0b0b0b">net2 · dKubernetes <tspan fill="#52514e">10.0.20.1</tspan></text>
<rect x="906" y="564" width="268" height="34" rx="5" fill="#4a3aa7" fill-opacity="0.12" stroke="#4a3aa7" stroke-width="2"/>
<text x="916" y="585" fill="#0b0b0b">net3 · dCCTV <tspan fill="#52514e">10.0.30.1/24 · vmbr0 tag 30</tspan></text>
</g>
<path d="M850,415 L890,458" fill="none" stroke="#2a78d6" stroke-width="1.6" opacity="0.6"/>
<path d="M850,430 L890,581" fill="none" stroke="#4a3aa7" stroke-width="2"/>
<path d="M850,565 L890,501" fill="none" stroke="#8a8984" stroke-width="1.6" opacity="0.6"/>
<path d="M850,565 L890,541" fill="none" stroke="#8a8984" stroke-width="1.6" opacity="0.6"/>
<!-- k8s VMs -->
<rect x="1240" y="388" width="290" height="230" rx="8" fill="#1baf7a" fill-opacity="0.07" stroke="#1baf7a"/>
<text x="1256" y="414" font-size="15" font-weight="700" fill="#0b0b0b">k8s VMs · 10.0.20.0/24</text>
<text x="1256" y="434" font-size="12.5" fill="#52514e">vmbr1 tag 20 · pod egress SNATs</text>
<text x="1256" y="450" font-size="12.5" fill="#52514e">to node IPs</text>
<rect x="1256" y="464" width="258" height="66" rx="6" fill="#ffffff" stroke="#1baf7a"/>
<text x="1268" y="486" font-size="13.5" font-weight="700" fill="#0b0b0b">Frigate · k8s-node1 (T4)</text>
<text x="1268" y="504" font-size="12" fill="#52514e">detect sub / record main</text>
<text x="1268" y="520" font-size="12" fill="#52514e">gpumem budget 2300 MiB</text>
<rect x="1256" y="540" width="258" height="52" rx="6" fill="#ffffff" stroke="#1baf7a"/>
<text x="1268" y="562" font-size="13.5" font-weight="700" fill="#0b0b0b">go2rtc LB 10.0.20.204</text>
<text x="1268" y="580" font-size="12" fill="#52514e">restream → HA live view (MSE/HLS)</text>
<!-- HOME LAN zone -->
<rect x="1148" y="128" width="412" height="180" rx="10" fill="#2a78d6" fill-opacity="0.06" stroke="#2a78d6" stroke-opacity="0.4"/>
<text x="1164" y="154" font-size="13" font-weight="700" fill="#52514e" letter-spacing="1">HOME LAN 192.168.1.0/24</text>
<rect x="1164" y="168" width="180" height="56" rx="6" fill="#ffffff" stroke="#2a78d6"/>
<text x="1176" y="190" font-size="13.5" font-weight="700" fill="#0b0b0b">AX6000 · .1</text>
<text x="1176" y="208" font-size="11.5" fill="#52514e">+ route 10.0.30.0/24 → .2</text>
<rect x="1164" y="236" width="180" height="52" rx="6" fill="#ffffff" stroke="#2a78d6"/>
<text x="1176" y="258" font-size="13.5" font-weight="700" fill="#0b0b0b">ha-sofia · .8</text>
<text x="1176" y="275" font-size="11.5" fill="#52514e">Frigate card + hikvision_next</text>
<rect x="1360" y="168" width="184" height="56" rx="6" fill="#ffffff" stroke="#2a78d6"/>
<text x="1372" y="190" font-size="13.5" font-weight="700" fill="#0b0b0b">apartment clients</text>
<text x="1372" y="208" font-size="11.5" fill="#52514e">laptops, phones</text>
<rect x="1360" y="236" width="184" height="52" rx="6" fill="#ffffff" stroke="#52514e" stroke-dasharray="5,4"/>
<text x="1372" y="256" font-size="11.5" font-weight="700" fill="#52514e">CAMERA DAY: static route</text>
<text x="1372" y="272" font-size="11.5" fill="#52514e">10.0.30.0/24 via 192.168.1.2</text>
<path d="M1254,308 C1150,352 950,372 790,400" fill="none" stroke="#2a78d6" stroke-width="2" opacity="0.6"/>
<text x="1010" y="374" font-size="12" fill="#2a78d6">apartment uplink · switch P1 · trunk · eno1</text>
<!-- FLOWS -->
<path d="M1256,497 C1010,690 330,730 120,650 C40,618 40,380 96,286" fill="none" stroke="#008300" stroke-width="3" marker-end="url(#arrGreen)"/>
<text x="620" y="700" font-size="13.5" font-weight="700" fill="#008300">ALLOW · Frigate → camera RTSP :554 (routed k8s → dCCTV; opt1 allow-all)</text>
<path d="M1164,262 C820,282 470,268 302,176 C286,167 278,166 270,172" fill="none" stroke="#008300" stroke-width="3" marker-end="url(#arrGreen)"/>
<text x="484" y="216" font-size="13.5" font-weight="700" fill="#008300">ALLOW · ha-sofia → camera :80 ISAPI + :554</text>
<text x="484" y="234" font-size="12" fill="#52514e">enters pfSense WAN · reply-to off · needs the AX6000 route</text>
<path d="M280,232 C660,200 860,320 936,386" fill="none" stroke="#008300" stroke-width="2" opacity="0.85" marker-end="url(#arrGreen)"/>
<text x="740" y="322" font-size="12.5" font-weight="700" fill="#008300">ALLOW · camera → 10.0.30.1:123 (NTP)</text>
<!-- LEGEND -->
<g transform="translate(40,800)" font-size="12.5">
<rect x="0" y="0" width="18" height="18" rx="4" fill="#2a78d6" fill-opacity="0.12" stroke="#2a78d6"/>
<text x="26" y="14" fill="#0b0b0b">home LAN / VLAN 1</text>
<rect x="200" y="0" width="18" height="18" rx="4" fill="#4a3aa7" fill-opacity="0.12" stroke="#4a3aa7" stroke-width="2"/>
<text x="226" y="14" fill="#0b0b0b">CCTV / VLAN 30 / dCCTV 10.0.30.0/24</text>
<rect x="500" y="0" width="18" height="18" rx="4" fill="#1baf7a" fill-opacity="0.12" stroke="#1baf7a"/>
<text x="526" y="14" fill="#0b0b0b">dKubernetes</text>
<rect x="640" y="0" width="18" height="18" rx="4" fill="#eda100" fill-opacity="0.14" stroke="#eda100"/>
<text x="666" y="14" fill="#0b0b0b">dManagementsVms</text>
<line x1="820" y1="9" x2="860" y2="9" stroke="#008300" stroke-width="3" marker-end="url(#arrGreen)"/>
<text x="870" y="14" fill="#0b0b0b">allowed flow</text>
<line x1="980" y1="9" x2="1020" y2="9" stroke="#e34948" stroke-width="3" marker-end="url(#arrRed)"/>
<text x="1030" y="14" fill="#0b0b0b">denied</text>
<line x1="1100" y1="9" x2="1140" y2="9" stroke="#52514e" stroke-width="2" stroke-dasharray="6,5"/>
<text x="1150" y="14" fill="#0b0b0b">camera-day step</text>
<text x="1320" y="14" fill="#52514e">ADR-0017 · rev 3</text>
</g>
</svg>

Before

Width:  |  Height:  |  Size: 13 KiB

File diff suppressed because it is too large Load diff

File diff suppressed because one or more lines are too long

Before

Width:  |  Height:  |  Size: 23 KiB

View file

@ -1,47 +0,0 @@
# Valia sites are served off-infra (Cloudflare Pages), synced in-cluster
Valia (Viktor's mother) authors small one-page static sites in Google Drive folders she
shares, and keeps asking for them to be hosted — two exist already (`stem95su`, `bridge`)
and more are expected. We decided all **Valia sites** are served **off-infra on Cloudflare
Pages** under `<english-name>.viktorbarzin.me`, kept fresh by **one shared in-cluster
CronJob** (`stacks/valia-sites/`) that mirrors each **Content folder** every 10 minutes
(rclone, drive.readonly) and re-deploys only on change (wrangler direct upload). The
existing in-cluster `stem95su` serving stack (nginx + NFS + ingress + per-site sync)
migrates onto this and is retired.
Why off-infra serving: these are her sites, shown to teachers/parents — they must survive
homelab outages (cf. the 2026-06-27 egress incident that took every proxied in-cluster
site down). With Pages, a homelab outage degrades to "content frozen until we're back",
never "site down". Serving costs no cluster resources and no per-site nginx/PVC/ingress/
Anubis. Why the syncer stays in-cluster anyway: secrets stay in Vault (no per-site GHA
secret sprawl), and the stem95su guard patterns (hard-fail on Drive auth errors, never
wipe a live site on an empty/partial folder, capped deletes) carry over wholesale. The
deliberate asymmetry — off-infra serving, on-infra syncing — is the point, not an
accident.
## Considered options
- **In-cluster everywhere** (generalise stem95su into a factory module): one roof, no
Cloudflare Pages dependency — but her sites share the homelab's fate and each site
spends cluster resources to serve static files a free CDN serves better.
- **Pages for new sites only**: less work now, two patterns and two runbooks forever.
- **GHA-scheduled sync** (fully off-infra pipeline): no cluster dependency at all, but
Drive + Cloudflare credentials would live as GitHub secrets per repo, outside Vault.
## Consequences
- Registration is one entry in the `sites` map (name, Content folder, optional Entry
file); CI applies Pages project, custom domain, public CNAME, and internal-DNS config
together. Names are English, picked by Viktor (most → bridge set the precedent).
- The internal split-horizon zone learns Valia sites from a ConfigMap the
`technitium-ingress-dns-sync` script consumes — declaratively, including **removal**
(the previous static-CNAME approach was add-only; a retired site left a stale record).
- Deploy-on-change is mandatory, not an optimisation: Pages caps monthly deployments on
the free tier, and a 10-minute cadence would burn ~4,300/month if unchanged runs
deployed.
- Failure visibility is **failed-Job-only** by explicit choice (no stale-sync alert, no
per-site uptime monitors, no notifications to Valia) — Viktor fields "it didn't
update" reports, consistent with the alert-noise-reduction posture. Revisit if a
silent stall actually bites.
- If the homelab is down, content updates pause; the sites keep serving last-deployed
content. Accepted degradation.

View file

@ -1,97 +0,0 @@
# Inbound mail gets a self-hosted store-and-forward backup MX on Oracle Always-Free
`viktorbarzin.me` has run a single direct MX to the home IP since the 2026-04-12
inbound overhaul, with sender-MTA retry (15 days, sender-dependent) as the only
outage protection — a documented "No Backup MX" decision made after ForwardEmail's
forced anti-spoofing rejected legitimate forwarded mail and Cloudflare Email
Routing proved pass-through-only. Viktor now wants inbound mail to survive
homelab outages **without loss** (2026-07-04): delayed delivery is fine,
mid-outage reading is not required, and the budget is **$0** — a hard
constraint that eliminated every managed option (see below).
We run a minimal **Postfix store-and-forward relay on an Oracle Cloud
Always-Free `VM.Standard.E2.1.Micro`** (`mx2.viktorbarzin.me`, **reserved**
public IP, MX preference 20; primary untouched at 1). It accepts everything
for the domain (catch-all — every RCPT is valid; reputation may only ever
4xx-defer, via postscreen pregreet + conservative DNSBL-defer on the VM —
never 5xx: a backup MX that hard-rejects manufactures the loss it exists to
prevent), queues up to **30 days** (bounce lifetime 1 day — the VM can never
deliver a DSN, its only egress is the drain), and drains to the primary over
**port 2526** — one scripted pfSense WAN NAT rule onto the existing HAProxy
frontend — because Oracle blocks egress TCP 25 tenancy-wide. Management is
tailnet-only (headscale preauth key, `tag:backup-mx`; OCI console as
mid-outage break-glass since headscale itself lives in the cluster); TLS via
certbot HTTP-01 (port 80 permanently open — LE validation is
multi-perspective and unscopeable); the VM is a cattle-rebuild from a new
`stacks/backup-mx/` Terraform stack (OCI provider + cloud-init, which must
also punch 25/80 through the OCI Ubuntu image's OS-level iptables REJECT).
On the primary, the drain stream (one /32) is enabled at the layers that
actually bite — `check_client_access` permits past
`reject_unknown_client_hostname` and spoof-protection, an anvil rate-limit
exception, and rspamd `external_relay` (score against the *original* sender
IP) with the reject action capped to tag/fold so drained spam can never force
the VM to emit backscatter. Go-live is gated on empirical checks: inbound-25
reachability (recurring probe — Oracle publishes no commitment), drain
end-to-end, and a live failover test that includes a high-spam-score and a
>10 MB message. Two independent adversarial reviews (2026-07-04) shaped this
final form. Design:
[`plans/2026-07-04-backup-mx-design.md`](../plans/2026-07-04-backup-mx-design.md).
## Considered options
- **Roller Network free Secondary MX** — v1 of this decision, killed at the
validation gates the same day: free tier caps at 200 relayed messages or
10 MB per rolling 7 days, and overage suspends the domain for 48 h
answering **SMTP 5xx** (permanent bounces) — since spammers target backup
MXes even while the primary is up, background spam alone can hold it
suspended, making it *worse than no backup MX*. Free accounts are also
being discontinued. (Their TLS checked out; their paid Basic at $30/yr is
the documented fallback if the OCI route sours.)
- **Dynu Email Backup ($9.99/yr)** — queue lifetime undocumented (FAQ hints
1224 h, barely beating sender retry); filtering black-box; not free.
- **Cloudflare Email Routing / mailflare** — no store-and-forward / terminal
inbox on Cloudflare; rejected earlier (2026-04-12; 2026-07-04 memory #7148).
- **Other free tiers** (challenged and re-verified 2026-07-04): GCP e2-micro
blocks egress 25 too and its free regions are US-only; AWS's 2025+ "free"
plan is a 6-month credit; Azure has no always-free VM and blocks 25;
Hetzner has no free tier; Fly.io ended free allowances; Vultr/Linode are
trial credits; DNSExit/KisoLabs/DuoCircle backup-MX are paid or dead. OCI
is the only standing free option.
- **Harden-only** (5xx-misconfig guards + paging) — does not address
multi-day outages or short-retry senders; deferred as a complementary
track.
## Consequences
- **A pet outside the cluster** — deliberately cattle: rebuilt entirely from
Terraform + cloud-init, patched by unattended-upgrades, scraped by the
cluster's Prometheus (exporters on the reserved public IP, allowlisted to
the homelab WAN /32 — there is **no cluster→tailnet route**, so tailnet
scraping was rejected as fictional; blackbox TCP:25 + MX-set drift alerts
besides). Never a backup target itself.
- **Oracle free-tier caprice is the top risk**: Oracle silently halved the A1
free allowance in June 2026 and terminated over-limit instances, and
publishes no commitment that inbound 25 stays open. Mitigations:
**Pay-As-You-Go conversion is a required prerequisite** (exempts idle
reclamation, stays $0), a recurring inbound-25 probe, `BackupMxDown`, and
the queue being empty outside outages (a surprise reclamation loses
coverage, never mail). Home region is fixed at signup — Frankfurt, chosen
once.
- The drain stream bypasses `reject_unknown_client_hostname`, anvil limits,
and rspamd's reject tier for one /32; DKIM verification, SPF/DMARC (against
the original IP via `external_relay`), and content scoring stay on — spam
arriving via the backup is tagged and folded to Junk, never bounced. The VM
is deliberately NOT in the primary's `mynetworks` (a compromised VM must
not relay through us).
- **Outages > 30 days lose queued mail silently** — no DSN can ever leave the
VM. Stated and accepted (6× better than the status quo).
- Outage mail sits in plaintext on Oracle disk ≤ 30 days — single-tenant but
off-premises; accepted (same class as Brevo holding outbound today).
- Cloudflare zone lands at 197/200 records; the MTA-STS follow-up (policy
host found dangling during design — inert today; must list `mx2` when
fixed) needs 12 more → schedule the next record purge proactively.
- `architecture/mailserver.md` §"No Backup MX" superseded at implementation;
new runbook `docs/runbooks/backup-mx.md` (incl. OCI console break-glass);
`vpn.md`'s stale headscale claims fixed in passing; the roundtrip probe's
failure semantics change (a "failing" probe may now mean "delayed via mx2,
drains shortly" — noted in alert description).

View file

@ -86,56 +86,10 @@ Signin latency is dominated by screen count and round trips, not server time
use the explicit-consent flow (it re-prompted every 4 weeks per app).
- **Live tuning via `server.env`/`worker.env`** (the `authentik.*` Helm values
are inert due to `existingSecret`): 3 gunicorn workers, 30m flow-plan cache,
15m policy cache, gunicorn `max_requests=10000`/jitter=1000 (recycle
hardening — decorrelates the 9 workers' recycles from PG blips). **No
`CONN_MAX_AGE`** — persistent Django connections pin a PgBouncer server conn
1:1 and saturate the session-mode pool (reverted 2026-06-10).
15m policy cache, 60s persistent DB connections.
- **Static assets cached immutable**: `/static` ingress carve-out adds
`Cache-Control: public, max-age=31536000, immutable` (assets are
version-fingerprinted; authentik itself sends no max-age).
- **Rate-limit carve-out** (2026-06-28): `/` and `/static` use a dedicated
`authentik-rate-limit` (100/1000) instead of the shared 10/50 default — the
login SPA cold-loads ~70 flow-executor chunks from `/static`; the default
burst 429'd the tail and a failed ES-module import left a blank login screen.
- **Readiness tolerance** (2026-06-28): server `readinessProbe.failureThreshold:8`
(~80s, was the chart-default ~30s). The probe (`/-/health/ready/`) queries the
DB; too-tight tolerance let a sub-60s PG/pgbouncer transient return 503 on all
3 server pods at once → Traefik had no healthy backend → 502/503/504 (episodic
blank login + 30s hangs). 80s absorbs a full CNPG failover reconnect. Sessions
+ cache are PostgreSQL-only since Redis was removed in 2026.2 (no external-cache
option), so request-serving is coupled to PG — this survives a short transient,
not a total CNPG outage.
- **Rolling-update strategy** (2026-06-28): the chart key is `deploymentStrategy`
(the repo's old `strategy:` key was silently inert → live ran the chart-default
25%/25% and dropped a server pod out of rotation on every roll). Now
`maxSurge:1/maxUnavailable:0` keeps all 3 ready throughout a roll.
- **Old-browser login (SFE)** (2026-06-28): authentik's modern flow SPA is ES2022
and renders a **blank login** on Safari/WebKit ≤16.3 (every iOS browser shares
the system WebKit, so it's not browser-choice — e.g. iPadOS ≤15). The overlay
image patches `flows/views/interface.py::compat_needs_sfe()` to also serve
authentik's built-in no-JS **Simplified Flow Executor** (SFE, ES5) to old Safari
**and any iOS browser** (Chrome/Firefox on iOS are WebKit skins) on iOS ≤16.3,
so those clients get the *real* authentik login (password + MFA + reputation —
no auth downgrade). The SFE can't render Identification-stage **sources**
(authentik limitation), so the patch also injects static social-login `<a>`
links into `flow-sfe.html` (→ `/source/oauth/login/<slug>/`, plain redirects) —
required for password-less accounts (e.g. Google-only users). A Traefik
basic-auth fallback was rejected: it would have put a single spoofable-UA
password in front of `vbarzin→wizard` (passwordless root on the devvm). See
`stacks/authentik/patch-compat-sfe.py`.
- **SFE + forced-WebAuthn MFA gotcha** (2026-06-28): the `default-authentication-flow`
MFA stage (`not_configured_action=configure`, `conf_stages=[webauthn]`) force-enrols
a WebAuthn passkey for any **password**-path user with no MFA device — but the SFE
**cannot render WebAuthn** (enrol *or* validate), so that user gets
`unsupported state: ak-stage-authenticator-webauthn`. Two escape hatches, **no MFA
downgrade**: (1) **social login** — sources run `default-source-authentication`
(UserLoginStage only, **no MFA stage**), so the SFE's "Continue with <provider>"
button always completes; (2) **enrol TOTP** — the SFE *can* validate TOTP codes, and
≥1 confirmed device flips the stage from force-enrol to validate. User MFA devices are
runtime data (not Terraform): enrol via `ak shell`
(`TOTPDevice.objects.create(user=…, confirmed=True)`) and store the secret in the
user's own Vaultwarden item. (Done for emo — the Google-only iPadOS-15 case: TOTP in
his `authentik.viktorbarzin.me` Bitwarden item; e2e-verified the BW code is accepted.)
- **Outpost**: 2 replicas, `log_level=info` (was 1 replica at `trace`).
- **auth-proxy nginx**: upstream `keepalive 32` + HTTP/1.1 — no per-request
TCP setup on the forward-auth subrequest path.
@ -154,6 +108,31 @@ All new users must use an invitation link to register. The invitation-enrollment
Group membership is auto-assigned from the invitation's `fixed_data` field. This prevents open registration while maintaining SSO convenience.
### TripIt External self-signup (open enrollment, fenced)
Unlike every other app, **TripIt allows open public self-signup** for people
outside the homelab (ADR-0020 in the tripit repo; runbook
`docs/runbooks/tripit-external-signup.md`). A dedicated public `tripit-enrollment`
flow (email + passkey, no password) creates the account and stamps it into the
parentless **`TripIt External`** group. Containment is two-layered:
- **Forward-auth apps**: a branch prepended to the `admin-services-restriction`
catch-all policy admits `TripIt External` to `tripit.viktorbarzin.me` only and
denies every other `auth="required"` host.
- **OIDC apps**: that branch does NOT cover OIDC (OIDC bypasses forward-auth).
External users are contained because every sensitive OIDC app already requires a
trusted group they do not hold — audited 2026-06-15:
Immich/Grafana/Linkwarden/Cloudflare Access → `Home Server Admins`, Forgejo →
`Task Submitters`/`Forgejo Users`, Headscale → `Headscale Users`, wrongmove →
`Wrongmove Users`. **Vault** was OPEN (any OIDC identity got a powerless
`default`-policy token) and is bound to **`Allow Login Users`** as part of this
change. The Kubernetes OIDC clients are OPEN but idle (apiserver rejects OIDC).
**Invariants**: keep `TripIt External` parentless (never under `Allow Login
Users`); keep the catch-all branch first; never co-assign `TripIt External` to a
trusted/internal user; the `tripit-enrollment` user_write "Create users group"
setting is the keystone that tags every signup.
### OIDC Applications
Authentik provides OIDC for 10 applications:

View file

@ -128,7 +128,7 @@ The agent handles all three version patterns in Terraform:
- **Slack**: All upgrade events reported (start, success, failure, rollback)
- **Git**: Detailed commit messages with changelog summaries, risk level, backup status
- **DIUN Slack**: REMOVED 2026-07-02 (per-tag @channel pings in #image-updates; human cadence is the weekly upgrade report). The n8n webhook feed to the upgrade agent is unchanged.
- **DIUN Slack**: Independent Slack channel for raw version detection (separate from upgrade agent)
## Bulk Upgrades
@ -319,7 +319,7 @@ each Job's pod and its drain target are always different nodes.
- `K8sVersionSkew` — kubelet/apiserver `gitVersion` count >1 for 30m. Catches a half-done rollout.
- `EtcdPreUpgradeSnapshotMissing``k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight failing silently.
- `K8sUpgradeStalled``k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a chain Job dying without spawning its successor.
- `K8sUpgradeChainJobFailed``(kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-(preflight|master|worker|postflight)-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0) unless on() (k8s_upgrade_blocked == 1)` for 15m (warning). Catches a phase Job that terminally failed **before `in_flight` was set** (the preflight gates exit pre-metric) — invisible to the two `in_flight`-based alerts above; this was the blind spot behind the 5-day 1.34.9 preflight wedge. Reason-scoped so a retry-success doesn't false-positive (and so it doesn't needlessly block kured). The `unless k8s_upgrade_blocked == 1` clause (2026-06-21) excludes a deliberate compat-gate refusal (owned by `K8sUpgradeBlocked`) so a block doesn't double-fire as a wedge.
- `K8sUpgradeChainJobFailed``kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0` for 15m (warning). Catches a phase Job that terminally failed **before `in_flight` was set** (the preflight gates exit pre-metric) — invisible to the two `in_flight`-based alerts above; this was the blind spot behind the 5-day 1.34.9 preflight wedge. Reason-scoped so a retry-success doesn't false-positive (and so it doesn't needlessly block kured).
- **Pushgateway metrics**:
- `k8s_upgrade_in_flight` (set in preflight, cleared in postflight)
- `k8s_upgrade_snapshot_taken` (set after etcd snapshot Job completes with ≥1 KiB)

View file

@ -112,32 +112,17 @@ External caller (dev box):
@playwright/mcp --isolated --storage-state ~/.cache/...storage-state.json
```
## Browser binary — real Google Chrome (for proprietary codecs)
The chrome-service container runs **real Google Chrome**, not the bundled
Chromium, via the infra-owned image `ghcr.io/viktorbarzin/chrome-service-browser`
(`files/chrome/Dockerfile` = `mcr.microsoft.com/playwright:v1.48.0-noble` +
`google-chrome-stable`, built by `.github/workflows/build-chrome-service-browser.yml`).
The launch resolves `CHROMIUM=/opt/google/chrome/chrome`.
**Why:** the Playwright-bundled Chromium has proprietary codecs **compiled out**,
so H.264/AAC video (Instagram Reels, X, most `.mp4`) fails in the noVNC view with
`MEDIA_ERR_SRC_NOT_SUPPORTED` (the bytes download `200 video/mp4` but there's no
decoder — NOT a GPU issue). Royalty-free codecs (VP9/VP8/AV1 → YouTube) always
worked. Swapping `libffmpeg.so` does NOT help (codecs are compiled out, not just
the lib stripped) and Chrome-for-Testing is also codec-less — only
`google-chrome-stable` carries them.
## Image pin
The Playwright base + the Python client (`playwright==1.48.0` in callers'
`requirements.txt`) and the snapshot sidecars
(`mcr.microsoft.com/playwright/python:v1.48.0-noble`) historically had to match
minor-versions. The chrome-service browser is now real Google Chrome (a newer
milestone than the 1.48 Chromium), but the `connect_over_cdp` callers (tripit
fare scrape, `homelab browser`, snapshot-harvester) attach over raw CDP, which is
version-tolerant — verified working against this Chrome. If a future Chrome
milestone breaks a caller, pin Chrome in the Dockerfile or bump the clients.
Both the server image (`mcr.microsoft.com/playwright:v1.48.0-noble` in
`stacks/chrome-service/main.tf`) and the Python client
(`playwright==1.48.0` in callers' `requirements.txt`) **must match
minor-versions**. Bump in lockstep — Playwright protocol changes between
minors and the client cannot connect to a mismatched server.
The harvester + snapshot-server sidecar use
`mcr.microsoft.com/playwright/python:v1.48.0-noble` — same playwright
minor, with Python-side bindings pre-installed.
## Storage
@ -182,66 +167,7 @@ milestone breaks a caller, pin Chrome in the Dockerfile or bump the clients.
`x11vnc` (connected to Xvfb on `localhost:6099`) bridged to
`websockify` on port 6080. Service `chrome` maps :80 → :6080 and is
exposed via `ingress_factory` at `chrome.viktorbarzin.me`,
Authentik-gated. The bare host serves `vnc.html` (image symlinks
`index.html → vnc.html`); add `?autoconnect=true&resize=scale&path=websockify`
to skip the Connect button. The view is **black when no browser window is
open** (idle) — that is normal, not a failed connection. Chrome is launched
with `--window-size=1280,720 --window-position=0,0` to fill the Xvfb screen
(no window manager runs, so without it Chrome opens at its profile-persisted
size and the rest of the framebuffer shows as a black cut-off).
### noVNC fd-sweep gotcha (stuck "Connecting")
If the noVNC client hangs on **"Connecting" forever then times out**, the cause
is almost always x11vnc's fd-table sweep: containerd grants pods
`RLIMIT_NOFILE = 2^31`, and x11vnc `fcntl`-sweeps the **entire** fd table on
every client connection, so the RFB handshake never completes (websockify
accepts the WS and logs `connecting to: localhost:5900`, but x11vnc never sends
the `RFB 003.008` banner). Diagnose: `grep "open files" /proc/$(pgrep -n
x11vnc)/limits` (huge = bad) and time the handshake from a sibling container
(`python3 -c "import socket;s=socket.socket();s.connect(('127.0.0.1',5900));print(s.recv(12))"`
healthy <0.3s, broken hangs). **Fix: cap `ulimit -n 65536` before x11vnc starts**
— done both in `files/novnc/entrypoint.sh` (root) and via the container `command`
wrapper in `main.tf` (so it applies deterministically even though the image is
`:latest`/`IfNotPresent` and won't re-pull a rebuilt entrypoint). Same bug + fix
as the android-emulator stack.
### noVNC black after a browser-container restart (x11vnc supervision)
A **distinct** failure from the fd-sweep gotcha above: the noVNC client *connects*
but the view is **black**, and the novnc container logs spew
`connecting to: localhost:5900` → `Failed to connect ... [Errno 111] Connection
refused` (x11vnc is **down**, not slow). Cause: `x11vnc` and `websockify` both run
in the **novnc** container, but x11vnc attaches to the **chrome-service** (browser)
container's Xvfb over `localhost:6099` (shared pod network). When the browser
container restarts — Chrome exits cleanly (exit 0, "Completed") or crashes — its
Xvfb vanishes and x11vnc loses its X connection and exits.
`entrypoint.sh` **supervises** x11vnc: it launches x11vnc and websockify as
background children and `wait -n`s on them, exiting non-zero if **either** dies, so
the kubelet restarts the novnc container, which re-waits for Xvfb on `:6099` and
relaunches x11vnc — the bridge **self-heals** across browser-container restarts.
(Before 2026-06-27, x11vnc was an unsupervised background child of an `exec`ed
websockify; a dead x11vnc was never relaunched, leaving `:5900` dead — a
`<defunct>` zombie — and the view black until a manual pod restart. Same
supervision pattern as the android-emulator stack's entrypoint.)
**Diagnose:** `kubectl exec -c novnc -- ps aux | grep x11vnc` (a `<defunct>`/Z
entry = the bug); or the RFB-banner probe from a sibling container (`python3 -c
"import socket;s=socket.socket();s.settimeout(2);s.connect(('127.0.0.1',5900));print(s.recv(12))"`
— healthy returns `b'RFB 003.008\n'`, broken = `ConnectionRefused`). **Immediate
recovery** (no image change): restart just the novnc container with `kubectl exec
-n chrome-service deploy/chrome-service -c novnc -- kill 1` — re-runs its entrypoint
and relaunches x11vnc **without** touching the browser session/in-flight CDP jobs.
> **Deploying a rebuilt novnc entrypoint:** Keel is **off** for this deployment
> (`keel.sh/policy=never`, because the browser container's playwright image is
> version-pinned to f1-stream) and the image is `:latest`/`IfNotPresent`, so a
> rebuilt `:latest` will **not** redeploy on its own. After the
> `build-chrome-service-novnc.yml` GHA build pushes `:latest` + `:<sha>`,
> **SHA-pin** the novnc `image` in `main.tf` to the new `:<sha>` to force the pull
> and rollout (the novnc image is TF-managed — not in the deployment's
> `lifecycle.ignore_changes`).
Authentik-gated.
- **snapshot-server sidecar** (`mcr.microsoft.com/playwright/python:v1.48.0-noble`)
serves `GET /api/snapshot` from `/profile/snapshots/storage-state.json`,
bearer-gated by `PW_TOKEN`. Service `chrome-snapshot` maps :8088 → :8088
@ -254,87 +180,6 @@ and relaunches x11vnc **without** touching the browser session/in-flight CDP job
See `stacks/chrome-service/README.md` for the recipe (label namespace,
inject `CHROME_CDP_URL`, vendor `stealth.js`).
## Driving from OUTSIDE the cluster (`homelab browser`)
Agents on the devvm reach this browser through the **`homelab browser`** CLI
(`cli/`, ADR-0013) — the packaged, discoverable form of the ad-hoc
`connect_over_cdp` recipe. It is the **escalation path, not the default**:
agents default to the Playwright MCP / headless browser for all routine
automation, and reach for `homelab browser` ONLY when headless is blocked — a
site loads but a gated action (submit/login) silently fails or hangs, the
signature of headless / anti-bot detection. (Same tiered rule lives in
`~/code/CLAUDE.md` and `homelab browser --help`.)
```text
devvm: homelab browser run flow.js
│ kubectl port-forward svc/chrome-service :9222 (random local port)
http://127.0.0.1:<port> ──► chrome-service pod :9222 (CDP)
│ assert /json/version Browser is "Chrome/…", not "HeadlessChrome"
│ node + playwright-core@1.48.2 → connectOverCDP
│ context.addInitScript(stealth.js) ← same vendored file as in-cluster
│ run the user's Playwright script with page/context/browser in scope
└─ port-forward always torn down (success or error)
```
Key facts:
- **port-forward bypasses the `:9222` NetworkPolicy.** It tunnels
API-server→pod, so the devvm needs no `chrome-service.viktorbarzin.me/client`
label — unlike in-cluster callers.
- **Client pinned to the image minor.** The node client is
`playwright-core@1.48.2` (matches `v1.48.0-noble` / Chromium 130), installed
lazily into `~/.cache/homelab/browser-client/`. Bump it in lockstep when the
server image bumps (same rule as the in-cluster Python clients — see "Image
pin" above).
- **Default context is a fresh incognito one** (closed on exit), safe for the
shared browser; `--shared-context` reuses the warmed persistent profile.
- **`stealth.js` is vendored** into the CLI (`cli/browser_stealth.js`) as a
byte-identical copy of `files/stealth.js`, guarded by a drift test — so the
CLI's stealth never diverges from the in-cluster callers'.
## Multi-user access (sharing the browser)
There is ONE chrome-service browser with ONE persistent profile, warmed with
**Viktor's** logged-in sessions. CDP has no per-context auth, so anyone who can
drive the browser — over the noVNC view OR the CDP/`homelab browser` path — can
reach the persistent profile (`browser.contexts[0]`) and therefore Viktor's
sessions. Access is gated accordingly, per user.
**Decision (2026-06-28):** emo (`emil.barzin` / `emil.barzin@gmail.com`) SHARES
Viktor's browser for form-filling + captcha solving, rather than getting an
isolated instance. The session-exposure trade-off above was explicitly accepted.
Two independent grants make up "browser access" for a user:
1. **noVNC (interactive view, `chrome.viktorbarzin.me`)** — gated by the Authentik
`admin-services-restriction` policy: the `CHROME_ALLOWED` set
(`stacks/authentik/admin-services-restriction.tf`) matches the user's Authentik
username OR email. Add the user there. No kubeconfig/RBAC needed.
2. **CLI (`homelab browser`, CDP over port-forward)** — needs `pods/portforward`
in `chrome-service` PLUS a non-interactive credential (a normal devvm user's
kubeconfig is interactive-OIDC-only and can't authenticate a headless agent
session). Provided by a per-user **ServiceAccount** with a long-lived token
(`stacks/chrome-service/rbac.tf`, e.g. `emo-browser`): `pods/portforward` in
this namespace + cluster read-only (`oidc-power-user-readonly`, so it can also
resolve the Service and doesn't regress the user's normal read). The devvm
provisioner (`scripts/t3-provision-users.sh``install_browser_kubeconfig`)
reads that token and installs it as the user's DEFAULT kubeconfig context
(`<user>-browser@homelab`), keeping their personal OIDC login as the
`oidc@homelab` named context. The SA's existence is the source of truth for who
gets the CLI — the provisioner no-ops for users without a `<user>-browser` SA.
**To grant another user:** add them to `CHROME_ALLOWED` (noVNC) and/or add a
`<user>-browser` SA + bindings mirroring `emo-browser` in `rbac.tf` (CLI), then run
the provisioner. To revoke: remove from `CHROME_ALLOWED` and delete the SA (rotate
a token by deleting its `<user>-browser-token` Secret).
Because the SA is the user's DEFAULT kubectl credential, other per-namespace
port-forward grants hang off the same identity: `stacks/excalidraw/rbac.tf`
grants `emo-browser` `pods/portforward` in `excalidraw` (2026-07-02) so emo's
agent can upload drawings via the port-forward + `X-Authentik-Username` recipe
in his `~/.claude/CLAUDE.md`. Revoking the SA revokes those too.
## Limits + risks
- **Anti-bot vs stealth arms race** — when an upstream beats us (DRM

View file

@ -94,7 +94,7 @@ can't reach Forgejo's public hairpin.
| Visibility | Packages | Pull mechanism |
|------------|----------|----------------|
| **Public** | beadboard, nextcloud-todos, claude-agent-service, claude-memory-mcp, kms-website, freedify, tuya_bridge, x402-gateway, chrome-service-novnc, android-emulator | Anonymous |
| **Private** | f1-stream, job-hunter, instagram-poster, payslip-ingest, wealthfolio-sync, fire-planner, recruiter-responder, tripit, infra-cli, infra-ci, k8s-portal, excalidraw-library | `ghcr-credentials` dockerconfigjson |
| **Private** | f1-stream, job-hunter, instagram-poster, payslip-ingest, wealthfolio-sync, fire-planner, recruiter-responder, tripit, infra-cli, infra-ci | `ghcr-credentials` dockerconfigjson |
Private-image pulls use the `ghcr-credentials` dockerconfigjson, cloned by the
kyverno stack's `sync-ghcr-credentials` ClusterPolicy to an explicit
@ -115,66 +115,8 @@ claude-agent-service, claude-memory-mcp, kms-website, Freedify,
instagram-poster, payslip-ingest, broker-sync (image name `wealthfolio-sync`),
fire-planner, recruiter-responder, x402-gateway — plus **tripit** (the original
pilot, 2026-06-09). Earlier public-repo apps already on GHA (Website,
k8s-portal, apple-health-data, audiblez-web, insta2spotify,
audiobook-search) now also land on ghcr.
**plotting-book** is a special case (a GitHub-first repo owned by Anca,
ADR-0003): the build runs in *her* GitHub repo
(`PassionProjectsAnca/Plotting-Your-Dream-Book`) and pushes to **private
`ghcr.io/passionprojectsanca/book-plotter`** — under her org's ghcr namespace,
not `viktorbarzin`, using the workflow's built-in `GITHUB_TOKEN` (no shared
PAT). The cluster pulls it via the Kyverno-synced `ghcr-credentials` secret (the
`plotting-book` namespace is on the allowlist; the shared `ghcr_pull_token` has
read access). Migrated off public DockerHub (`viktorbarzin/book-plotter`) on
2026-06-27. The Woodpecker deploy hook (repo 43, registered to Anca's repo) is
unchanged. Flow:
```text
DEVELOP ───────────────────────────────────────────────────────────────────────
Anca (Codex / t3 web agent)
│ git push → main
┌──────────────────────────────────────────────────────────────┐
│ GitHub: PassionProjectsAnca/Plotting-Your-Dream-Book (private)│ ← canonical
│ .github/workflows/build-and-deploy.yml on: push → main │
└───────────────────────────┬──────────────────────────────────┘
│ GitHub Actions runner (off-infra build · ADR-0002)
┌────────────────────┴─────────────────────────────────┐
▼ ▼
┌─────────────────────────────────────────────┐ ╔═══════════════════════════════════════╗
│ build job │ push ║ GHCR · PRIVATE package ║
│ • svu next --always → tag vX.Y.Z (→ repo) │═════▶║ ghcr.io/passionprojectsanca/ ║
│ • buildx linux/amd64, provenance:false │ tags ║ book-plotter :vX.Y.Z :latest ║
│ • login ghcr (GITHUB_TOKEN, packages:write)│ ╚═══════════════════╤═══════════════════╝
│ • delete-package-versions (keep newest 10) │ │
└───────────────────────┬─────────────────────┘ │ pull (private,
▼ deploy job [gate: repo var DEPLOY_ENABLED ≠ "false"] via secret)
POST ci.viktorbarzin.me/api/repos/43/pipelines {IMAGE_TAG, IMAGE_NAME} │
▼ │
┌─────────────────────────────────────────────────────────────┐ │
│ Woodpecker repo 43 · .woodpecker/deploy.yml (event: manual) │ │
│ kubectl set image deployment/plotting-book = <ghcr>:vX.Y.Z │ │
│ kubectl rollout status │ │
└───────────────────────────┬─────────────────────────────────┘ │
▼ │
═══════════════ Kubernetes · ns: plotting-book ════════════════════════════ │
┌─────────────────────────────────────────────────────────────┐ │
│ Deployment plotting-book (Recreate · image = ignore_changes)│ │
│ imagePullSecrets: ghcr-credentials ────────pull───────────┼─────────────────┘
│ Pod → Express :3001 + SQLite on PVC (proxmox-lvm) │
└─────────────────────────────────────────────────────────────┘
guards / supporting:
• Kyverno require-trusted-registries [Enforce] → ghcr.io/* ALLOWED (admission)
• Keel policy=patch @1h → watches GHCR via ghcr-credentials (backstop)
• ghcr-credentials ⇐ Kyverno generate-clone ⇐ Vault secret/viktor/ghcr_pull_token
═══════════════ Serving path (unchanged) ══════════════════════════════════
Browser ─▶ plotting-book.viktorbarzin.me (non-proxied DNS → Traefik .203)
─▶ Authentik forward-auth (gate) ─▶ Service :80 ─▶ Pod :3001
```
Governance: the Deployment + Kyverno allowlist are Terraform (`stacks/plotting-book`,
`stacks/kyverno`); the live image *tag* is CI-owned (`ignore_changes`).
k8s-portal, apple-health-data, audiblez-web, plotting-book, insta2spotify,
audiobook-search, council-complaints) now also land on ghcr.
### Infra-owned images (issues #29 / #30)
@ -188,8 +130,6 @@ reconciled — the workflows were added to the GitHub lineage via PR):
| android-emulator | `build-android-emulator.yml` | public `ghcr.io/viktorbarzin/android-emulator` |
| infra CLI | `build-cli.yml` | DockerHub `viktorbarzin/infra` (kept) + `ghcr.io/viktorbarzin/infra-cli` |
| infra-ci | `build-infra-ci.yml` | private `ghcr.io/viktorbarzin/infra-ci` |
| k8s-portal | `build-k8s-portal.yml` | private `ghcr.io/viktorbarzin/k8s-portal` (Keel rolls `:latest` digests) |
| excalidraw-library | `build-excalidraw.yml` | private `ghcr.io/viktorbarzin/excalidraw-library` (Keel rolls `:latest` digests; DockerHub `:v4` frozen as rollback) |
**`infra-ci`** is the image the `.woodpecker/default.yml` apply step and
`drift-detection.yml` run in (proven by pipelines 165/166). `chatterbox-tts` is
@ -223,9 +163,9 @@ Woodpecker is **deploy + cluster-touching steps only**:
| Pipeline | File | Purpose |
|----------|------|---------|
| per-app deploy | `.woodpecker/deploy.yml` (each repo) | `kubectl set image` + Slack notify (event: **manual**) |
| terragrunt apply | `.woodpecker/default.yml` | Changed-stacks apply on push to master (runs in `infra-ci`). **Skips Tier-0 `vault`** — it's human-applied via OIDC; the CI `ci` role lacks Vault-admin perms (`sys/mounts`, `sys/policies/acl`) so a CI apply 403s |
| terragrunt apply | `.woodpecker/default.yml` | Changed-stacks apply on push to master (runs in `infra-ci`) |
| certbot | `.woodpecker/renew-tls.yml` | TLS renewal cron |
| drift-detection | `.woodpecker/drift-detection.yml` | Nightly Terraform drift (runs in `infra-ci`). **Skips Tier-0 `vault`** (its `plan` 403s under the `ci` role and would fail the whole run) |
| drift-detection | `.woodpecker/drift-detection.yml` | Nightly Terraform drift (runs in `infra-ci`) |
| provision-user | `.woodpecker/provision-user.yml` | Add namespace-owner user from Vault spec |
| registry-config-sync | `.woodpecker/registry-config-sync.yml` | SCP `modules/docker-registry/*``10.0.20.10` on change |
| pve-nfs-exports-sync | `.woodpecker/pve-nfs-exports-sync.yml` | Sync `scripts/pve-nfs-exports``/etc/exports` on PVE |
@ -236,38 +176,6 @@ Woodpecker is **deploy + cluster-touching steps only**:
**No build/test pipeline exists on any repo.** Do not (re)introduce one.
### `default.yml` apply: dual-registration de-dup + reliability (2026-06-28)
infra is registered in Woodpecker on **both** the canonical Forgejo repo (id 82)
and the legacy GitHub mirror (id 1), and **both fire `default.yml` on every
push**. Left unguarded, two `terragrunt apply` runs race each other for the
per-stack PG state lock — historically the #1 source of `Error acquiring the
state lock` failures and push-supersede "killed" runs.
- **Forge guard** (first command in the `apply` step): the push-apply runs **only
on the canonical Forgejo forge**; on the GitHub mirror it logs `[forge-guard]`
and `exit 0`s. Detection: `CI_REPO_URL`/`CI_FORGE_URL` contains `github.com`
skip. Fail-open (unknown forge still applies). The mirror keeps running the
**crons** (drift-detection, renew-tls, …), which live on repo 1 — only its
duplicate push-apply no-ops. (Crons were NOT moved; deactivating repo 1 would
have killed them.)
- **Lock-skip matches both tiers**: a stack whose apply hits a lock is SKIPPED,
not failed. The grep now matches the Tier-0 Vault message (`is locked by`) **and**
the Tier-1 PG-backend message (`Error acquiring the state lock` / `already
locked`) — the PG case was previously miscounted as a hard failure.
- **Transient retry** (bounded, 3 attempts): only provider-registry download
timeouts (`Failed to install provider` / `Client.Timeout`) and Vault 5xx are
retried. Config errors (missing arg, invalid index) and helm `atomic` timeouts
are NOT retried — they fail fast.
A pre-apply off-infra validate gate was evaluated and rejected: `terraform
validate` runs without state but catches ~0 of the observed failures (they are
provider-config-from-Vault-data, server-side-apply conflicts, helm installs, and
lock contention — all invisible to static validate), and `plan` cannot run
off-infra (no Vault/PG access). `terragrunt apply` already fails at its plan
phase without mutating on config errors, so a separate in-pipeline plan-gate was
also dropped as redundant.
### Woodpecker API
Uses **numeric repo IDs** (`/api/repos/<id>/pipelines`), NOT owner/name paths
@ -295,9 +203,7 @@ The infra repo runs on Woodpecker via **two** forge registrations: the Forgejo
forge (repo id 82, registered 2026-06-08) and the legacy GitHub forge (repo id
1). Pushes to **Forgejo** `master` fire `.woodpecker/default.yml`
(changed-stacks terragrunt apply, in `infra-ci`) plus the `notify-nonadmin-push`
Slack audit step. **Slack policy (2026-07-02): every infra pipeline posts only
on FAILURE** (plus the non-admin audit post and drift/error findings) — routine
successful runs are silent. Operational facts (2026-06-10):
Slack audit step. Operational facts (2026-06-10):
- **Webhook URL is the IN-CLUSTER service**:
`http://woodpecker-server.woodpecker.svc.cluster.local/api/hook?...` (PATCHed
@ -379,8 +285,7 @@ steps:
notify:
image: plugins/slack
when:
# Failure-only (2026-07-02 policy): CI notifies about failed runs only.
status: [failure]
status: [success, failure]
```
### CI/CD secrets sync

View file

@ -277,7 +277,7 @@ Technitium's **Split Horizon AddressTranslation** app post-processes DNS respons
Config is synced to all 3 Technitium instances by CronJob `technitium-split-horizon-sync` (every 6h).
**Superset rule for the internal `viktorbarzin.me` zone**: it is authoritative for every internal client (pods included since 2026-06-10), so it must carry every record type those clients consume — not just ingress A/CNAMEs. The `technitium-ingress-dns-sync` CronJob therefore also maintains the static **mail-auth records** (apex SPF + brevo-code TXT, MX → mail.viktorbarzin.me, `_dmarc`, `mail._domainkey` DKIM), mirrored from the public Cloudflare zone. Without them, rspamd on the mailserver saw `SPF=none` for inbound `@viktorbarzin.me` mail and quarantined it (broke the Brevo email-roundtrip probe, 2026-06-10). If these records change in Cloudflare, update the sync script too. **Off-infra Valia sites** (Cloudflare Pages, ADR-0018) are the other class of public-only names with no Traefik ingress — without internal records they NXDOMAIN for every internal client while working fine externally. Since 2026-07-03 they are reconciled **declaratively**: `stacks/valia-sites` writes the ConfigMap `valia-sites-dns` (technitium ns, `<name> → <project>.pages.dev`), and the sync script ensures/updates a CNAME per entry and **deletes** stale internal CNAMEs targeting `*.pages.dev` that left the map (retire/rename cleans itself up; deletion is suffix-scoped so nothing else can be touched).
**Superset rule for the internal `viktorbarzin.me` zone**: it is authoritative for every internal client (pods included since 2026-06-10), so it must carry every record type those clients consume — not just ingress A/CNAMEs. The `technitium-ingress-dns-sync` CronJob therefore also maintains the static **mail-auth records** (apex SPF + brevo-code TXT, MX → mail.viktorbarzin.me, `_dmarc`, `mail._domainkey` DKIM), mirrored from the public Cloudflare zone. Without them, rspamd on the mailserver saw `SPF=none` for inbound `@viktorbarzin.me` mail and quarantined it (broke the Brevo email-roundtrip probe, 2026-06-10). If these records change in Cloudflare, update the sync script too.
## NodeLocal DNSCache
@ -368,7 +368,6 @@ The Cloudflare tunnel uses a **wildcard rule** (`*.viktorbarzin.me → Traefik`)
| TXT (MTA-STS) | 1 | `v=STSv1; id=20260412` | TLS enforcement |
| TXT (TLSRPT) | 1 | `v=TLSRPTv1; rua=mailto:postmaster@...` | TLS reporting |
| A (keyserver) | 1 | `130.162.165.220` (Oracle VPS) | PGP keyserver |
| CNAME (CF Pages) | 2 | `<project>.pages.dev` (Cloudflare Pages) | bridge, stem95su — Valia sites (ADR-0018), managed by `stacks/valia-sites` |
### Proxied vs Non-Proxied
@ -514,7 +513,6 @@ For external `.viktorbarzin.me` records:
1. Add `dns_type = "proxied"` (or `"non-proxied"`) to the `ingress_factory` module call in the service stack
2. Run `scripts/tg apply` on the service stack — DNS record is auto-created
3. For non-standard records (MX, TXT), add a `cloudflare_record` resource in `stacks/cloudflared/modules/cloudflared/cloudflare.tf`
4. For a Valia site (off-infra Cloudflare Pages), add the entry to `local.sites` in `stacks/valia-sites/main.tf` — public CNAME + internal record both follow (`docs/runbooks/valia-sites.md`)
## Incident History

View file

@ -161,17 +161,6 @@ https://mail.viktorbarzin.me → Traefik → Roundcubemail
DB: MySQL (mysql.dbaas.svc.cluster.local)
```
### Paperless ingest mailbox (docs@)
`docs@viktorbarzin.me` is a dedicated real mailbox (explicit self-alias in
`extra/aliases.txt` so the `@domain → spam@` catch-all doesn't shadow it) that
paperless-ngx polls over IMAP; family members forward document emails to it
and the sender maps 1:1 to a paperless account. A per-user Dovecot sieve
(`docs-at-viktorbarzin.me.dovecot.sieve` in the `mailserver.config` ConfigMap,
mounted as `/tmp/docker-mailserver/docs@viktorbarzin.me.dovecot.sieve`)
discards mail from non-allowlisted senders at delivery. Full flow, sender map,
and add-a-sender procedure: [`runbooks/paperless-mail-ingest.md`](../runbooks/paperless-mail-ingest.md).
## DNS Records
All managed in Terraform at `stacks/cloudflared/modules/cloudflared/cloudflare.tf`.
@ -311,21 +300,6 @@ Push secrets (`BREVO_API_KEY`, `EMAIL_MONITOR_IMAP_PASSWORD`) come from External
## Troubleshooting
### All mail tempfailing with `451 4.3.0 queue file write error` (postsrsd spin)
Seen 2026-07-03 right after a pod restart. Signature in `/var/log/mail/mail.log`:
`postfix/cleanup: warning: tcp:localhost:10001 lookup error` +
`sender_canonical_maps map lookup problem ... message not accepted, try again later`.
Cause: **postsrsd** (SRS daemon, `sender_canonical_maps = tcp:localhost:10001`)
came up spinning at 100% CPU without binding 10001/10002 — supervisor shows it
`RUNNING` but `ss -ltn | grep 1000` is empty and its log is empty. Postfix then
tempfails every message (inbound AND submission); senders retry so nothing is
lost, and the roundtrip probe alerts within the hour.
Fix: `supervisorctl restart postsrsd` inside the container; if the fresh
process spins again (it did once), `kubectl -n mailserver delete pod` for a
full re-init — that healed it. Root cause not pinned down (one-off bad init;
postsrsd 1.10).
### Inbound mail not arriving
1. **DNS/MX**: `dig MX viktorbarzin.me +short` → should show `mail.viktorbarzin.me`
2. **WAN reachability**: `nc -zw5 mail.viktorbarzin.me 25` from outside

View file

@ -146,7 +146,7 @@ Query examples (Grafana → Loki): `{job="rpi-sofia-journal"}`, `{job="rpi-sofia
**Dashboard** — `dashboards/rpi-sofia.json` ("RPi Sofia", Hardware folder): status, undervoltage/throttle, SoC temp, load, memory, root-fs free + read-only, network.
**Alerts** (group `RPi Sofia` in `prometheus_chart_values.tpl`): `RpiSofiaDown` (`up==0`), `RpiSofiaFilesystemReadonly` (`node_filesystem_readonly{mountpoint="/"}==1` — the SD-failure signature), `RpiSofiaUndervoltage` (`increase(rpi_under_voltage_occurred[1h])>0` — edge-triggered on the sticky bit; the live `rpi_under_voltage_now` bit is too transient to catch at 1-min sampling, so it fires on a *new* brown-out and auto-resolves ~1h later instead of latching until reboot), `RpiSofiaHighTemp`.
**Alerts** (group `RPi Sofia` in `prometheus_chart_values.tpl`): `RpiSofiaDown` (`up==0`), `RpiSofiaFilesystemReadonly` (`node_filesystem_readonly{mountpoint="/"}==1` — the SD-failure signature), `RpiSofiaUndervoltage` (`rpi_under_voltage_occurred==1`), `RpiSofiaHighTemp`.
**Recovery** — a systemd hardware watchdog (`RuntimeWatchdogSec=14s`, bcm2835 max ~15s) auto-reboots the Pi on a hard hang instead of leaving it dead for hours.
@ -286,7 +286,7 @@ Uptime Kuma monitors: TCP SMTP (port 25) on `176.12.22.76` (external), IMAP (por
#### Security Alerts (Wave 1 — planned, beads `code-8ywc`)
Routed via **Loki ruler → Alertmanager → the `slack-security` receiver, which posts to `#alerts`** (it keeps its `[SECURITY/<sev>]` title styling so security-lane alerts stand out there). Same handling path as infra alerts; severity labels carried in the alert (critical/warning/info). The dedicated `#security` channel was abandoned 2026-06-25 — the shared `alertmanager_slack_api_url` webhook's Slack app isn't a member of it (a `#security` override 404s), so everything consolidated to `#alerts`. Detection sources: K8s API audit log (`job=kube-audit`), Vault audit log (`job=vault-audit`), PVE sshd journald (`job=sshd-pve`), Calico flow logs (`job=calico-flow`, W1.6 only).
Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same handling path as infra alerts. Single channel with severity labels inside (critical/warning/info), not three separate channels. Detection sources: K8s API audit log (`job=kube-audit`), Vault audit log (`job=vault-audit`), PVE sshd journald (`job=sshd-pve`), Calico flow logs (`job=calico-flow`, W1.6 only).
| # | Source | Event | Severity |
|---|---|---|---|
@ -318,20 +318,9 @@ IOPS impact estimated ~1-2 GB/day additional disk writes after custom audit-poli
Detects the inverse of the K-series alerts: a service that **must work WITHOUT Authentik SSO** getting accidentally walled off. Services on `ingress_factory auth = "required"` put Authentik forward-auth on `/`, which 302-bounces native-client / public / webhook / WebSocket / SPA-XHR paths. We carve those out with path-scoped `auth = "none"` ingresses; a TF revert, a bad deploy, or `ingress_factory`'s fail-closed `auth` default flipping back to `"required"` can silently clobber a carve-out.
- **Mechanism**: `blackbox-exporter` (monitoring ns) probes a representative GET-able URL per carve-out with `no_follow_redirects: true`. The `http_no_authentik_redirect` module FAILS the probe (`fail_if_header_matches` on the `Location` header, regex `authentik\.viktorbarzin\.me|/outpost\.goauthentik\.io|/application/o/authorize`) iff the response redirects to Authentik. `valid_status_codes` enumerates all expected non-Authentik responses **including 301/302** (so a legitimate redirect, e.g. a short-link 302, or a 404 carve-out like meshcentral `/agent.ashx`, stays green). Scrape job: `blackbox-authentik-walloff` (1m).
- **Alert**: `probe_failed_due_to_regex{job="blackbox-authentik-walloff"} == 1` for 10m → `severity=warning`, `lane=security`posts to **`#alerts`** via the `slack-security` receiver, which keeps its `[SECURITY]` styling (Slack-only, no paging; the dedicated `#security` channel was abandoned 2026-06-25 — the shared webhook's app isn't a member of it). `probe_failed_due_to_regex` (not bare `probe_success==0`) is the signal: it isolates the Authentik-redirect from unrelated 5xx/DNS/TLS failures already covered by reachability alerts. Inhibited by `TraefikDown` and `AuthentikDown` (symptom, not regression, during those outages).
- **Alert**: `probe_failed_due_to_regex{job="blackbox-authentik-walloff"} == 1` for 10m → `severity=warning`, `lane=security`**`#security` Slack** (Slack-only, no paging). `probe_failed_due_to_regex` (not bare `probe_success==0`) is the signal: it isolates the Authentik-redirect from unrelated 5xx/DNS/TLS failures already covered by reachability alerts. Inhibited by `TraefikDown` and `AuthentikDown` (symptom, not regression, during those outages).
- **Target list + how to add one**: `local.authentik_walloff_targets` in `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf` — a map of `service → URL`. To guard a NEW carve-out, add ONE line. Verify it does NOT already 302 to Authentik first: `curl -s -o /dev/null -w '%{http_code} %{redirect_url}\n' '<url>'`. The map key becomes the `service` label on the metric + alert. (Note: openclaw `task-webhook` is intentionally NOT probed — no public DNS record.)
#### East-west flow observability (Goldmane edge-aggregator) — `AggregatorDown` / `DigestFailing` (ADR-0014)
Health for the durable "who-talks-to-whom" trail (Calico Goldmane → `goldmane-edge-aggregator` → CNPG `goldmane_edges` → daily `#alerts` digest; full trail in security.md + [runbooks/goldmane-flow-trail.md](../runbooks/goldmane-flow-trail.md)). The aggregator pod exposes **no `/metrics`**, so health is inferred from kube-state-metrics. Alert group `Network Observability (Goldmane)` in `prometheus_chart_values.tpl`; both route the default `slack-warning` receiver → **`#alerts`**.
| Alert | Expr (abridged) | For | Severity |
|---|---|---|---|
| `AggregatorDown` | `kube_deployment_status_replicas_available{namespace="goldmane-edge-aggregator",deployment="goldmane-edge-aggregator"} < 1` (+ Prometheus-restart guard) | 15m | warning |
| `DigestFailing` | `kube_job_status_failed{namespace="goldmane-edge-aggregator",job_name=~"goldmane-edges-digest.*"} > 0` within 24h | 30m | warning |
The two layers are **complementary**: `AggregatorDown` ⇒ no new edges land in the DB; `DigestFailing` ⇒ edges still land but nobody is told. (`< 1` requires the metric series to exist — a fully-deleted Deployment is instead caught by cluster-health check #48 below as "deployment missing".) A freshness probe (#61b) was deliberately skipped — `AggregatorDown` is the agreed floor. **Cluster-health check #48** (`check_goldmane_aggregator` in `scripts/cluster_healthcheck.sh`) reads the Deployment's `Available` condition independently (human / `--quiet` / `--json`; JSON key `goldmane_aggregator`).
#### Backup Alerts
- **PostgreSQLBackupStale**: >36h since last backup
- **MySQLBackupStale**: >36h since last backup

Some files were not shown because too many files have changed in this diff Show more