Compare commits
6 commits
master
...
broker-syn
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
731de63150 | ||
|
|
9ce9a9a7f7 | ||
|
|
277babc696 | ||
|
|
d91fbd4a60 | ||
|
|
e81e836d3a | ||
|
|
d3be9b50af |
995 changed files with 24654 additions and 117374 deletions
File diff suppressed because one or more lines are too long
|
|
@ -1,543 +0,0 @@
|
||||||
---
|
|
||||||
name: k8s-version-upgrade-DEPRECATED
|
|
||||||
description: "DEPRECATED 2026-05-11 — replaced by the Job-chain in stacks/k8s-version-upgrade. See header below."
|
|
||||||
tools: Read, Write, Edit, Bash, Grep, Glob
|
|
||||||
model: opus
|
|
||||||
---
|
|
||||||
|
|
||||||
# DEPRECATED — Do NOT invoke this agent
|
|
||||||
|
|
||||||
Retired **2026-05-11** after a self-preemption incident: this agent ran inside
|
|
||||||
the `claude-agent-service` Deployment (replicas=1, no nodeSelector) and was
|
|
||||||
scheduled onto k8s-node4. When the agent tried to `kubectl drain k8s-node4`
|
|
||||||
(Stage 6, first worker), it evicted itself. The bash process died mid-SSH,
|
|
||||||
leaving node4 cordoned and the cluster half-upgraded (master at v1.34.7,
|
|
||||||
workers at v1.34.2).
|
|
||||||
|
|
||||||
## Replaced by
|
|
||||||
|
|
||||||
A chain of small Kubernetes Jobs, each pinned (via `nodeSelector` +
|
|
||||||
`kubernetes.io/hostname`) to a node that is NOT its drain target. No pod can
|
|
||||||
preempt itself because each Job's pod and its target node are always
|
|
||||||
different.
|
|
||||||
|
|
||||||
| Old | New |
|
|
||||||
|-----|-----|
|
|
||||||
| Single agent run in claude-agent-service pod | Chain of 7 phase Jobs (preflight → master → worker × 4 → postflight) |
|
|
||||||
| Whole pipeline in one prompt | Phase body in `stacks/k8s-version-upgrade/scripts/upgrade-step.sh`, dispatched per-phase via `case $PHASE` |
|
|
||||||
| Detection CronJob POSTs to `claude-agent-service` | Detection CronJob renders Job 0 from `job-template.yaml` via `envsubst` + `kubectl apply` |
|
|
||||||
| Drain blocks indefinitely on PDB=0 (e.g. single-replica Anubis) | New `predrain_unstick` deletes PDB-blocked pods so drain proceeds |
|
|
||||||
| `K8sVersionSkew` + `EtcdPreUpgradeSnapshotMissing` alerts | Above + `K8sUpgradeStalled` (in_flight=1 and time()-started_timestamp > 5400s) |
|
|
||||||
|
|
||||||
## Where the logic lives now
|
|
||||||
|
|
||||||
- **`infra/stacks/k8s-version-upgrade/scripts/upgrade-step.sh`** — universal
|
|
||||||
phase body. Dispatches on `$PHASE`. Each phase spawns the next Job.
|
|
||||||
- **`infra/stacks/k8s-version-upgrade/job-template.yaml`** — Job template
|
|
||||||
rendered by `envsubst` at runtime. ConfigMap-mounted at `/template` in
|
|
||||||
every Job pod.
|
|
||||||
- **`infra/stacks/k8s-version-upgrade/main.tf`** — Terraform stack: ConfigMaps,
|
|
||||||
unified `k8s-upgrade-job` ServiceAccount + RBAC, detection CronJob.
|
|
||||||
- **`infra/docs/runbooks/k8s-version-upgrade.md`** — operator runbook (kill a
|
|
||||||
stuck Job, skip a phase, manually re-trigger from a specific phase).
|
|
||||||
|
|
||||||
## Why kept (not deleted)
|
|
||||||
|
|
||||||
Documents the prompted-agent design and is useful as historical reference when
|
|
||||||
reading post-mortem discussions or comparing approaches. The `name` field has
|
|
||||||
been suffixed with `-DEPRECATED` so the agent cannot be invoked by name from
|
|
||||||
`claude-agent-service`.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
# Original prompt — DO NOT EXECUTE (reference only)
|
|
||||||
|
|
||||||
You are the K8s Version Upgrade Agent for a 5-node home-lab Kubernetes cluster (1 master, 4 workers, stacked etcd, no HA).
|
|
||||||
|
|
||||||
## Your Job
|
|
||||||
|
|
||||||
Given a target patch or minor version of `kubeadm`/`kubelet`/`kubectl`, you orchestrate the full rolling upgrade with safety gates between every node. You do NOT decide WHEN to run — the `k8s-version-check` CronJob in the `k8s-upgrade` namespace fires you off after detection. You only run when invoked.
|
|
||||||
|
|
||||||
The sequence (Pre-flight → etcd snapshot → master containerd skew fix → apt repo URL change [minor only] → master kubeadm upgrade → workers sequentially → Post-flight) is non-negotiable. Skipping a step is how clusters die.
|
|
||||||
|
|
||||||
## Inputs
|
|
||||||
|
|
||||||
The user prompt contains a JSON object with these fields:
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"target_version": "1.34.5",
|
|
||||||
"kind": "patch",
|
|
||||||
"dry_run": false,
|
|
||||||
"stages": "all"
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
| Field | Required | Description |
|
|
||||||
|---|---|---|
|
|
||||||
| `target_version` | yes | Exact `X.Y.Z` to land on (e.g. `1.34.5`). The script `infra/scripts/update_k8s.sh` accepts this via `--release`. |
|
|
||||||
| `kind` | yes | `patch` (no apt-repo URL change) or `minor` (rewrite repo to v$NEW_MINOR/deb on every node before kubeadm). |
|
|
||||||
| `dry_run` | no, default false | If true, run all SSH + kubectl READ commands but skip every mutating command (`apt-get install`, `kubeadm upgrade apply`, `kubeadm upgrade node`, `kubectl drain/uncordon`, etcd snapshot, systemctl restart). Log what you would do and exit 0. |
|
|
||||||
| `stages` | no, default `all` | Comma-separated subset of: `preflight`, `snapshot`, `containerd`, `repo`, `master`, `workers`, `postflight`. Run only those stages and exit. Used by tests. |
|
|
||||||
|
|
||||||
Parse the prompt's first JSON block to extract these. If anything is missing, abort with a Slack notification ("malformed payload").
|
|
||||||
|
|
||||||
## Environment
|
|
||||||
|
|
||||||
- **Working dir**: `/workspace/infra` (`WORKSPACE_DIR` env var)
|
|
||||||
- **Kubeconfig**: `/workspace/infra/config` (use `kubectl --kubeconfig $WORKSPACE_DIR/config ...` in every kubectl call)
|
|
||||||
- **Prometheus**: `http://prometheus-server.monitoring.svc.cluster.local:80` (in-cluster, no auth)
|
|
||||||
- **Etcd snapshot**: triggered as a one-shot Job from the existing `default/backup-etcd` CronJob (defined in `stacks/infra-maintenance/`). The Job runs on `k8s-master` with hostNetwork (so etcdctl reaches etcd at 127.0.0.1:2379), mounts the PV-backed NFS export `192.168.1.127:/srv/nfs/etcd-backup`, and writes `etcd-snapshot-<TIMESTAMP>.db` there. Do NOT shell into master with etcdctl directly — the cert paths + NFS mount are already wired into the CronJob.
|
|
||||||
- **Library script**: `/workspace/infra/scripts/update_k8s.sh` — pipe via SSH to each node, do NOT modify on the fly. Invoke as `ssh ... 'bash -s' < update_k8s.sh --role <role> --release <X.Y.Z>`.
|
|
||||||
|
|
||||||
### Credentials — fetched at startup
|
|
||||||
|
|
||||||
The k8s-upgrade ServiceAccount has GET on the `k8s-upgrade-creds` Secret in the `k8s-upgrade` namespace (granted by a RoleBinding in `stacks/k8s-version-upgrade/main.tf`). Fetch credentials into `/tmp` files at the start of every run:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
KUBECTL="kubectl --kubeconfig $WORKSPACE_DIR/config"
|
|
||||||
|
|
||||||
# SSH private key — mode 0400 required by openssh
|
|
||||||
$KUBECTL get secret -n k8s-upgrade k8s-upgrade-creds \
|
|
||||||
-o jsonpath='{.data.ssh_key}' | base64 -d > /tmp/k8s-upgrade-ssh-key
|
|
||||||
chmod 400 /tmp/k8s-upgrade-ssh-key
|
|
||||||
|
|
||||||
# Slack webhook (URL string)
|
|
||||||
SLACK_WEBHOOK_K8S_UPGRADE=$($KUBECTL get secret -n k8s-upgrade k8s-upgrade-creds \
|
|
||||||
-o jsonpath='{.data.slack_webhook}' | base64 -d)
|
|
||||||
```
|
|
||||||
|
|
||||||
The rest of the prompt uses `/tmp/k8s-upgrade-ssh-key` for SSH and `$SLACK_WEBHOOK_K8S_UPGRADE` for Slack. SSH template:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
SSH="ssh -i /tmp/k8s-upgrade-ssh-key -o StrictHostKeyChecking=accept-new -o UserKnownHostsFile=/tmp/known_hosts"
|
|
||||||
```
|
|
||||||
|
|
||||||
Every SSH call below uses `$SSH wizard@<host> '<cmd>'`. `accept-new` accepts the host key on first encounter then pins it — if a node was reimaged, clear `/tmp/known_hosts` before retry.
|
|
||||||
|
|
||||||
## NEVER do
|
|
||||||
|
|
||||||
- Never bypass the halt-on-alert check — even if a single alert "looks unrelated"
|
|
||||||
- Never start the next worker before the previous one is Ready + all its pods rescheduled + 10-min soak observed
|
|
||||||
- Never skip the etcd snapshot — even for patch
|
|
||||||
- Never `kubectl edit/patch/delete` — read-only kubectl plus `drain`/`uncordon` only
|
|
||||||
- Never `apt-mark hold` something without unholding it first, and vice versa — the script handles this; don't do it manually
|
|
||||||
- Never run two stages in parallel — sequential only
|
|
||||||
- Never run if `dry_run=false` AND the cluster has a node Not Ready, or any Upgrade Gates alert firing
|
|
||||||
- Never push to git, never modify Terraform, never invoke claude-agent-service recursively
|
|
||||||
|
|
||||||
## Slack + Pushgateway helpers
|
|
||||||
|
|
||||||
Every transition posts to Slack:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
slack() {
|
|
||||||
local msg="$1"
|
|
||||||
local hook="${SLACK_WEBHOOK_K8S_UPGRADE:-$SLACK_WEBHOOK_URL}"
|
|
||||||
curl -sS -X POST -H 'Content-Type: application/json' \
|
|
||||||
--data "$(jq -nc --arg t "[k8s-upgrade] $msg" '{text: $t}')" \
|
|
||||||
"$hook"
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
Start every message with `[k8s-upgrade]` so it's grep-able.
|
|
||||||
|
|
||||||
Pushgateway gauges drive the `EtcdPreUpgradeSnapshotMissing` and ops-visibility metrics:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
PG='http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/k8s-version-upgrade'
|
|
||||||
|
|
||||||
push_metric() {
|
|
||||||
# push_metric <name> <value>
|
|
||||||
local name="$1" val="$2"
|
|
||||||
printf '# TYPE %s gauge\n%s %s\n' "$name" "$name" "$val" \
|
|
||||||
| curl -sS --data-binary @- "$PG"
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
Pushes you must make at specific stages (skipped in dry_run):
|
|
||||||
| When | Metric | Value |
|
|
||||||
|---|---|---|
|
|
||||||
| Stage 0 start | `k8s_upgrade_in_flight` | `1` |
|
|
||||||
| Stage 0 start | `k8s_upgrade_target_minor` | `$target_minor` |
|
|
||||||
| Stage 2 verified | `k8s_upgrade_snapshot_taken` | `1` |
|
|
||||||
| Stage 7 clean | `k8s_upgrade_in_flight` | `0` |
|
|
||||||
| Stage 7 clean | `k8s_upgrade_snapshot_taken` | `0` |
|
|
||||||
|
|
||||||
If you abort mid-flight, leave `k8s_upgrade_in_flight=1` so the alert fires and surfaces the half-done state.
|
|
||||||
|
|
||||||
## Stage 0: Parse inputs + announce
|
|
||||||
|
|
||||||
1. Extract `target_version`, `kind`, `dry_run`, `stages` from the prompt JSON.
|
|
||||||
2. Derive `target_minor` from `target_version` (split on `.`).
|
|
||||||
3. Mark the in-flight annotation on the namespace AND push Pushgateway in-flight gauge:
|
|
||||||
```bash
|
|
||||||
if [ "$dry_run" = "false" ]; then
|
|
||||||
kubectl --kubeconfig $WORKSPACE_DIR/config annotate ns k8s-upgrade \
|
|
||||||
viktorbarzin.me/k8s-upgrade-in-flight="$(date -u +%FT%TZ)" \
|
|
||||||
viktorbarzin.me/k8s-upgrade-target="$target_version" \
|
|
||||||
--overwrite
|
|
||||||
|
|
||||||
push_metric k8s_upgrade_in_flight 1
|
|
||||||
push_metric k8s_upgrade_snapshot_taken 0
|
|
||||||
fi
|
|
||||||
```
|
|
||||||
4. Slack: `Starting k8s upgrade to v$target_version (kind=$kind, dry_run=$dry_run, stages=$stages)`.
|
|
||||||
|
|
||||||
## Stage 1: Pre-flight (`stages` includes `preflight`)
|
|
||||||
|
|
||||||
Skip if `stages` excludes `preflight`.
|
|
||||||
|
|
||||||
### Check 1.1 — All nodes Ready, no pressure
|
|
||||||
|
|
||||||
```bash
|
|
||||||
kubectl --kubeconfig $WORKSPACE_DIR/config get nodes -o json \
|
|
||||||
| jq -r '.items[] | "\(.metadata.name): \(.status.conditions[] | select(.type=="Ready") | .status), Mem=\(.status.conditions[] | select(.type=="MemoryPressure") | .status), Disk=\(.status.conditions[] | select(.type=="DiskPressure") | .status)"'
|
|
||||||
```
|
|
||||||
|
|
||||||
Abort if any node is not Ready=True, or has MemoryPressure=True or DiskPressure=True.
|
|
||||||
|
|
||||||
### Check 1.2 — Halt-on-alert (same query kured uses)
|
|
||||||
|
|
||||||
```bash
|
|
||||||
ALERTS=$(curl -sf 'http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/alerts' \
|
|
||||||
| jq -r '.data.alerts[] | select(.state == "firing") | .labels.alertname' \
|
|
||||||
| grep -vE '^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor)$' \
|
|
||||||
| sort -u)
|
|
||||||
|
|
||||||
if [ -n "$ALERTS" ]; then
|
|
||||||
slack "ABORT preflight — firing alerts:\n$ALERTS"
|
|
||||||
exit 1
|
|
||||||
fi
|
|
||||||
```
|
|
||||||
|
|
||||||
### Check 1.3 — 24h-quiet baseline
|
|
||||||
|
|
||||||
Re-uses the sentinel-gate Check 4 logic from `stacks/kured/main.tf`. Any node that transitioned Ready in the last 24h means the cluster just absorbed a node reboot — we want a clean baseline before starting a fresh rollout.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
RECENT_REBOOT=0
|
|
||||||
while IFS= read -r ts; do
|
|
||||||
[ -z "$ts" ] && continue
|
|
||||||
diff=$(( $(date +%s) - $(date -d "$ts" +%s) ))
|
|
||||||
[ "$diff" -lt 86400 ] && RECENT_REBOOT=1 && break
|
|
||||||
done < <(kubectl --kubeconfig $WORKSPACE_DIR/config get nodes -o jsonpath='{range .items[*]}{range .status.conditions[?(@.type=="Ready")]}{.lastTransitionTime}{"\n"}{end}{end}')
|
|
||||||
|
|
||||||
if [ "$RECENT_REBOOT" -eq 1 ]; then
|
|
||||||
slack "ABORT preflight — node transitioned Ready <24h ago (soak window)"
|
|
||||||
exit 1
|
|
||||||
fi
|
|
||||||
```
|
|
||||||
|
|
||||||
### Check 1.4 — kubeadm upgrade plan reports our target
|
|
||||||
|
|
||||||
```bash
|
|
||||||
PLAN_TARGET=$($SSH \
|
|
||||||
wizard@k8s-master 'sudo kubeadm upgrade plan' \
|
|
||||||
| grep -oE 'You can now apply the upgrade by executing the following command:.*v[0-9]+\.[0-9]+\.[0-9]+' \
|
|
||||||
| grep -oE 'v[0-9]+\.[0-9]+\.[0-9]+' | head -1 | tr -d v)
|
|
||||||
```
|
|
||||||
|
|
||||||
If `$PLAN_TARGET` does not start with the requested `target_version`, slack-abort:
|
|
||||||
"`kubeadm upgrade plan` says target is $PLAN_TARGET but caller asked for $target_version — drift; aborting."
|
|
||||||
|
|
||||||
Slack: `Pre-flight clean. Proceeding to etcd snapshot.`
|
|
||||||
|
|
||||||
## Stage 2: Etcd snapshot (`stages` includes `snapshot`)
|
|
||||||
|
|
||||||
Always run — patch OR minor. Triggers a one-shot Job from the existing `default/backup-etcd` CronJob and waits for it to complete.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
JOB_NAME="pre-upgrade-etcd-${target_version}-$(date +%s)"
|
|
||||||
|
|
||||||
if [ "$dry_run" = "false" ]; then
|
|
||||||
$KUBECTL -n default create job --from=cronjob/backup-etcd "$JOB_NAME"
|
|
||||||
|
|
||||||
# Wait up to 10 min for snapshot Job to complete
|
|
||||||
$KUBECTL -n default wait --for=condition=complete --timeout=600s "job/$JOB_NAME" || {
|
|
||||||
slack "ABORT Stage 2 — etcd snapshot Job did not complete in 10 min"
|
|
||||||
$KUBECTL -n default describe "job/$JOB_NAME" | tail -30
|
|
||||||
exit 1
|
|
||||||
}
|
|
||||||
|
|
||||||
# Parse the Job's pod log for "Backup done: <file> (<bytes> bytes)"
|
|
||||||
LOG=$($KUBECTL -n default logs "job/$JOB_NAME" -c backup-manage --tail=20)
|
|
||||||
echo "$LOG"
|
|
||||||
SNAPSHOT_LINE=$(echo "$LOG" | grep -E '^Backup done:')
|
|
||||||
SIZE=$(echo "$SNAPSHOT_LINE" | grep -oE '\([0-9]+ bytes\)' | grep -oE '[0-9]+')
|
|
||||||
SNAPSHOT_FILE=$(echo "$SNAPSHOT_LINE" | awk '{print $3}')
|
|
||||||
|
|
||||||
if [ -z "$SIZE" ] || [ "$SIZE" -lt 1024 ]; then
|
|
||||||
slack "ABORT Stage 2 — etcd snapshot empty or missing (size='$SIZE' line='$SNAPSHOT_LINE')"
|
|
||||||
exit 1
|
|
||||||
fi
|
|
||||||
|
|
||||||
TARGET_PATH="nfs://192.168.1.127:/srv/nfs/etcd-backup/$SNAPSHOT_FILE"
|
|
||||||
$KUBECTL annotate ns k8s-upgrade \
|
|
||||||
viktorbarzin.me/k8s-upgrade-snapshot-path="$TARGET_PATH" --overwrite
|
|
||||||
|
|
||||||
push_metric k8s_upgrade_snapshot_taken 1
|
|
||||||
else
|
|
||||||
TARGET_PATH="WOULD: trigger default/backup-etcd Job, wait, verify size"
|
|
||||||
SIZE="dry-run"
|
|
||||||
fi
|
|
||||||
|
|
||||||
slack "Etcd snapshot saved at $TARGET_PATH (size=$SIZE)"
|
|
||||||
```
|
|
||||||
|
|
||||||
## Stage 3: Master containerd skew fix (`stages` includes `containerd`)
|
|
||||||
|
|
||||||
Only run if master containerd version < highest worker containerd version.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
get_ctr_version() {
|
|
||||||
$SSH \
|
|
||||||
"wizard@$1" 'containerd --version | awk "{print \$3}" | tr -d v'
|
|
||||||
}
|
|
||||||
|
|
||||||
MASTER_CTR=$(get_ctr_version k8s-master)
|
|
||||||
WORKER_MAX="0.0.0"
|
|
||||||
for n in k8s-node1 k8s-node2 k8s-node3 k8s-node4; do
|
|
||||||
v=$(get_ctr_version "$n")
|
|
||||||
# Compare semver-ish
|
|
||||||
if [ "$(printf '%s\n%s' "$v" "$WORKER_MAX" | sort -V | tail -1)" = "$v" ]; then
|
|
||||||
WORKER_MAX="$v"
|
|
||||||
fi
|
|
||||||
done
|
|
||||||
|
|
||||||
if [ "$(printf '%s\n%s' "$MASTER_CTR" "$WORKER_MAX" | sort -V | head -1)" = "$MASTER_CTR" ] \
|
|
||||||
&& [ "$MASTER_CTR" != "$WORKER_MAX" ]; then
|
|
||||||
# Master is behind — bump
|
|
||||||
slack "Master containerd $MASTER_CTR < workers $WORKER_MAX — bumping master"
|
|
||||||
|
|
||||||
if [ "$dry_run" = "false" ]; then
|
|
||||||
$SSH \
|
|
||||||
wizard@k8s-master "sudo apt-mark unhold containerd.io \
|
|
||||||
&& sudo apt-get install -y containerd.io='$WORKER_MAX-1' \
|
|
||||||
&& sudo apt-mark hold containerd.io \
|
|
||||||
&& sudo systemctl restart containerd"
|
|
||||||
|
|
||||||
# Wait until kubelet on master is Ready again
|
|
||||||
for i in $(seq 1 60); do
|
|
||||||
STATUS=$(kubectl --kubeconfig $WORKSPACE_DIR/config get node k8s-master \
|
|
||||||
-o jsonpath='{.status.conditions[?(@.type=="Ready")].status}')
|
|
||||||
[ "$STATUS" = "True" ] && break
|
|
||||||
sleep 10
|
|
||||||
done
|
|
||||||
[ "$STATUS" = "True" ] || { slack "ABORT — k8s-master not Ready after containerd bump"; exit 1; }
|
|
||||||
fi
|
|
||||||
|
|
||||||
slack "Master containerd: $MASTER_CTR → $WORKER_MAX. Master Ready."
|
|
||||||
else
|
|
||||||
echo "Master containerd $MASTER_CTR >= workers max $WORKER_MAX — skipping skew fix"
|
|
||||||
fi
|
|
||||||
```
|
|
||||||
|
|
||||||
## Stage 4: Apt repo URL rewrite for minor bumps (`stages` includes `repo`)
|
|
||||||
|
|
||||||
Only run if `kind=minor`.
|
|
||||||
|
|
||||||
For each of `k8s-master k8s-node1 k8s-node2 k8s-node3 k8s-node4`:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
target_minor="$(echo "$target_version" | awk -F. '{print $1"."$2}')"
|
|
||||||
|
|
||||||
if [ "$dry_run" = "false" ]; then
|
|
||||||
$SSH \
|
|
||||||
"wizard@$node" "echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v$target_minor/deb/ /' | sudo tee /etc/apt/sources.list.d/kubernetes.list \
|
|
||||||
&& curl -fsSL 'https://pkgs.k8s.io/core:/stable:/v$target_minor/deb/Release.key' | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg --batch --yes \
|
|
||||||
&& sudo apt-get update"
|
|
||||||
fi
|
|
||||||
```
|
|
||||||
|
|
||||||
Slack: `Repo rewritten to v$target_minor/deb on all 5 nodes.`
|
|
||||||
|
|
||||||
## Stage 5: Master upgrade (`stages` includes `master`)
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# 5.1 Drain
|
|
||||||
if [ "$dry_run" = "false" ]; then
|
|
||||||
kubectl --kubeconfig $WORKSPACE_DIR/config drain k8s-master \
|
|
||||||
--ignore-daemonsets --delete-emptydir-data --force --grace-period=300
|
|
||||||
fi
|
|
||||||
|
|
||||||
# 5.2 Run the library script via SSH pipe
|
|
||||||
if [ "$dry_run" = "false" ]; then
|
|
||||||
$SSH \
|
|
||||||
wizard@k8s-master 'bash -s' \
|
|
||||||
< $WORKSPACE_DIR/scripts/update_k8s.sh \
|
|
||||||
-- --role master --release "$target_version"
|
|
||||||
fi
|
|
||||||
|
|
||||||
# 5.3 Uncordon + wait Ready
|
|
||||||
if [ "$dry_run" = "false" ]; then
|
|
||||||
kubectl --kubeconfig $WORKSPACE_DIR/config uncordon k8s-master
|
|
||||||
fi
|
|
||||||
|
|
||||||
for i in $(seq 1 60); do
|
|
||||||
STATUS=$(kubectl --kubeconfig $WORKSPACE_DIR/config get node k8s-master \
|
|
||||||
-o jsonpath='{.status.conditions[?(@.type=="Ready")].status}')
|
|
||||||
KUBELET=$(kubectl --kubeconfig $WORKSPACE_DIR/config get node k8s-master \
|
|
||||||
-o jsonpath='{.status.nodeInfo.kubeletVersion}' | tr -d v)
|
|
||||||
[ "$STATUS" = "True" ] && [ "$KUBELET" = "$target_version" ] && break
|
|
||||||
sleep 15
|
|
||||||
done
|
|
||||||
|
|
||||||
[ "$STATUS" = "True" ] && [ "$KUBELET" = "$target_version" ] \
|
|
||||||
|| { slack "ABORT — master not Ready or wrong version after upgrade ($STATUS / $KUBELET)"; exit 1; }
|
|
||||||
|
|
||||||
# 5.4 All control-plane pods Running
|
|
||||||
NOT_READY=$(kubectl --kubeconfig $WORKSPACE_DIR/config -n kube-system get pods \
|
|
||||||
-l 'tier=control-plane' --no-headers | grep -v Running | wc -l)
|
|
||||||
[ "$NOT_READY" -gt 0 ] && { slack "ABORT — $NOT_READY control-plane pods not Running"; exit 1; }
|
|
||||||
|
|
||||||
# 5.5 Re-check halt-on-alert
|
|
||||||
# (re-run the Check 1.2 query, abort if anything new fires)
|
|
||||||
|
|
||||||
slack "Master upgrade complete. Cluster on v$target_version. Healthy."
|
|
||||||
```
|
|
||||||
|
|
||||||
## Stage 6: Workers sequentially (`stages` includes `workers`)
|
|
||||||
|
|
||||||
Order: `k8s-node4 → k8s-node3 → k8s-node2 → k8s-node1`. Node1 last because it hosts GPU + Immich and benefits from the longest soak before any other worker is touched (ref: post-mortem-2026-03-16, memory id=570).
|
|
||||||
|
|
||||||
For each worker `$node`:
|
|
||||||
|
|
||||||
1. Re-check halt-on-alert. If anything fires (e.g. `RecentNodeReboot` on the previous worker), wait + retry up to 30 min, then abort.
|
|
||||||
2. `kubectl drain $node --ignore-daemonsets --delete-emptydir-data --force --grace-period=300`
|
|
||||||
3. SSH pipe `update_k8s.sh --role worker --release $target_version`
|
|
||||||
4. `kubectl uncordon $node`
|
|
||||||
5. Wait until `$node` Ready + kubeletVersion matches + all calico-node + kube-proxy pods on that node Running.
|
|
||||||
6. **10-min soak**: poll halt-on-alert every 60s. If anything fires, abort. After 10 min clean, proceed.
|
|
||||||
7. Slack: `Worker $node complete ($i/4)`.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
WORKERS="k8s-node4 k8s-node3 k8s-node2 k8s-node1"
|
|
||||||
i=0
|
|
||||||
for node in $WORKERS; do
|
|
||||||
i=$((i+1))
|
|
||||||
|
|
||||||
# Halt-on-alert recheck with retry
|
|
||||||
for attempt in $(seq 1 30); do
|
|
||||||
ALERTS=$(curl -sf 'http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/alerts' \
|
|
||||||
| jq -r '.data.alerts[] | select(.state == "firing") | .labels.alertname' \
|
|
||||||
| grep -vE '^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor)$' \
|
|
||||||
| sort -u)
|
|
||||||
[ -z "$ALERTS" ] && break
|
|
||||||
echo "Waiting for alerts to clear (attempt $attempt/30): $ALERTS"
|
|
||||||
sleep 60
|
|
||||||
done
|
|
||||||
[ -n "$ALERTS" ] && { slack "ABORT $node — alerts firing after 30min wait: $ALERTS"; exit 1; }
|
|
||||||
|
|
||||||
if [ "$dry_run" = "false" ]; then
|
|
||||||
kubectl --kubeconfig $WORKSPACE_DIR/config drain "$node" \
|
|
||||||
--ignore-daemonsets --delete-emptydir-data --force --grace-period=300
|
|
||||||
|
|
||||||
$SSH \
|
|
||||||
"wizard@$node" 'bash -s' \
|
|
||||||
< $WORKSPACE_DIR/scripts/update_k8s.sh \
|
|
||||||
-- --role worker --release "$target_version"
|
|
||||||
|
|
||||||
kubectl --kubeconfig $WORKSPACE_DIR/config uncordon "$node"
|
|
||||||
fi
|
|
||||||
|
|
||||||
# Wait Ready + version match
|
|
||||||
for w in $(seq 1 60); do
|
|
||||||
STATUS=$(kubectl --kubeconfig $WORKSPACE_DIR/config get node "$node" \
|
|
||||||
-o jsonpath='{.status.conditions[?(@.type=="Ready")].status}')
|
|
||||||
KUBELET=$(kubectl --kubeconfig $WORKSPACE_DIR/config get node "$node" \
|
|
||||||
-o jsonpath='{.status.nodeInfo.kubeletVersion}' | tr -d v)
|
|
||||||
[ "$STATUS" = "True" ] && [ "$KUBELET" = "$target_version" ] && break
|
|
||||||
sleep 15
|
|
||||||
done
|
|
||||||
[ "$STATUS" = "True" ] && [ "$KUBELET" = "$target_version" ] \
|
|
||||||
|| { slack "ABORT — $node not Ready or wrong version ($STATUS / $KUBELET)"; exit 1; }
|
|
||||||
|
|
||||||
# 10-min soak with halt-on-alert
|
|
||||||
echo "Soaking $node for 10 min..."
|
|
||||||
for sec in $(seq 1 10); do
|
|
||||||
ALERTS=$(curl -sf 'http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/alerts' \
|
|
||||||
| jq -r '.data.alerts[] | select(.state == "firing") | .labels.alertname' \
|
|
||||||
| grep -vE '^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor|RecentNodeReboot)$' \
|
|
||||||
| sort -u)
|
|
||||||
[ -n "$ALERTS" ] && { slack "ABORT $node mid-soak — alerts: $ALERTS"; exit 1; }
|
|
||||||
sleep 60
|
|
||||||
done
|
|
||||||
|
|
||||||
slack "Worker $node upgrade complete ($i/4). Soaked clean."
|
|
||||||
done
|
|
||||||
```
|
|
||||||
|
|
||||||
Note: during the soak we add `RecentNodeReboot` to the ignore-list because we KNOW we just rebooted-as-it-were that node (kubelet restart counts).
|
|
||||||
|
|
||||||
## Stage 7: Post-flight (`stages` includes `postflight`)
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# All 5 nodes at target
|
|
||||||
VERSIONS=$(kubectl --kubeconfig $WORKSPACE_DIR/config get nodes \
|
|
||||||
-o jsonpath='{range .items[*]}{.metadata.name}:{.status.nodeInfo.kubeletVersion}{"\n"}{end}')
|
|
||||||
echo "$VERSIONS"
|
|
||||||
WRONG=$(echo "$VERSIONS" | grep -v ":v${target_version}$" | wc -l)
|
|
||||||
[ "$WRONG" -ne 0 ] && { slack "ABORT post-flight — $WRONG node(s) not on v$target_version:\n$VERSIONS"; exit 1; }
|
|
||||||
|
|
||||||
# Upgrade Gates all inactive
|
|
||||||
FIRING=$(curl -sf 'http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/alerts' \
|
|
||||||
| jq -r '.data.alerts[] | select(.state == "firing") | .labels.alertname' \
|
|
||||||
| grep -vE '^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor)$' \
|
|
||||||
| sort -u)
|
|
||||||
[ -n "$FIRING" ] && slack "Post-flight WARN — alerts still firing (cluster on target, but check):\n$FIRING"
|
|
||||||
|
|
||||||
# pod-ready ratio >= 0.9
|
|
||||||
RATIO=$(curl -sf 'http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/query' \
|
|
||||||
--data-urlencode 'query=sum(kube_pod_status_ready{condition="true"}) / sum(kube_pod_status_phase{phase="Running"})' \
|
|
||||||
| jq -r '.data.result[0].value[1] // "0"')
|
|
||||||
slack "Pod-ready ratio: $RATIO (target ≥ 0.9)"
|
|
||||||
|
|
||||||
# Clear the in-flight annotation + Pushgateway gauges
|
|
||||||
if [ "$dry_run" = "false" ]; then
|
|
||||||
kubectl --kubeconfig $WORKSPACE_DIR/config annotate ns k8s-upgrade \
|
|
||||||
viktorbarzin.me/k8s-upgrade-in-flight- \
|
|
||||||
viktorbarzin.me/k8s-upgrade-target- \
|
|
||||||
viktorbarzin.me/k8s-upgrade-snapshot-path- || true
|
|
||||||
|
|
||||||
push_metric k8s_upgrade_in_flight 0
|
|
||||||
push_metric k8s_upgrade_snapshot_taken 0
|
|
||||||
fi
|
|
||||||
|
|
||||||
slack ":white_check_mark: K8s upgrade complete: cluster on v$target_version."
|
|
||||||
```
|
|
||||||
|
|
||||||
## Rollback
|
|
||||||
|
|
||||||
This agent does NOT auto-rollback. If anything aborts mid-flight:
|
|
||||||
|
|
||||||
1. Slack the failure with the last known stage + node.
|
|
||||||
2. Leave the in-flight annotation in place (the operator clears it manually after triage).
|
|
||||||
3. Operator follows `infra/docs/runbooks/k8s-version-upgrade.md` → "Rollback paths" section.
|
|
||||||
|
|
||||||
The etcd snapshot path is annotated on the `k8s-upgrade` namespace for easy recovery.
|
|
||||||
|
|
||||||
## Notes for tests
|
|
||||||
|
|
||||||
- **Test 1 (CronJob dry-run)**: The CronJob has its own `--dry-run` env var that short-circuits before POST. This agent is not invoked.
|
|
||||||
- **Test 2 (agent dry-run)**: Invoke with `{"dry_run": true}`. Every SSH + kubectl READ runs, every mutation skipped. The agent should print "WOULD: <cmd>" for each skipped mutation.
|
|
||||||
- **Test 3 (snapshot-only)**: Invoke with `{"stages": "preflight,snapshot"}`. Pre-flight + etcd snapshot only. Slack notification confirms the file exists. No node touched after that.
|
|
||||||
- **Test 4 (full run)**: `{"target_version": "1.34.7", "kind": "patch"}` once apt has it. Full sequence.
|
|
||||||
- **Test 5 (synthetic minor)**: `{"target_version": "1.35.0", "kind": "minor", "dry_run": true}`. Confirms the repo-rewrite plan path without mutation.
|
|
||||||
|
|
||||||
## Edge cases
|
|
||||||
|
|
||||||
- **Slack down**: Don't block the upgrade — continue, log to stderr.
|
|
||||||
- **SSH host key changes**: `accept-new` accepts only on first encounter — if a node was reimaged its host key changes; clear `/tmp/known_hosts` before retry.
|
|
||||||
- **kubectl drain hangs on a PDB-violating pod**: 5-min grace-period is hard. If drain fails, `kubectl drain --disable-eviction --force` is NOT a valid escalation here — slack-abort and let the operator investigate.
|
|
||||||
- **etcd snapshot dir missing/full**: stat the dir first. If <10 GiB free, abort.
|
|
||||||
- **Network blip during apt-get**: the script `set -e`s — apt-get will fail loud, the agent's bash will see non-zero exit, we slack-abort. The node is left mid-upgrade (kubeadm half-applied). Operator follows the runbook.
|
|
||||||
|
|
||||||
## Verification claims you must make
|
|
||||||
|
|
||||||
When you `slack` a SUCCESS message, you must have actually verified:
|
|
||||||
- All 5 nodes report the target kubelet version via `kubectl get nodes -o jsonpath`
|
|
||||||
- No alerts firing outside the ignore-list
|
|
||||||
- pod-ready ratio computed from Prometheus
|
|
||||||
|
|
||||||
Do not declare success without those three confirmations.
|
|
||||||
|
|
@ -1,194 +0,0 @@
|
||||||
---
|
|
||||||
name: payslip-extractor
|
|
||||||
description: "Extract structured UK payslip fields from already-extracted text (preferred) or a base64 PDF (fallback) into strict JSON."
|
|
||||||
model: haiku
|
|
||||||
allowedTools:
|
|
||||||
- Bash
|
|
||||||
- Read
|
|
||||||
---
|
|
||||||
|
|
||||||
You are a headless payslip-field extractor. You receive a prompt containing a UK payslip (either as pre-extracted text or as a base64-encoded PDF) plus a target JSON schema, and you produce exactly one JSON object that matches the schema.
|
|
||||||
|
|
||||||
## Your single job
|
|
||||||
|
|
||||||
Given a prompt that contains EITHER:
|
|
||||||
- A line `PAYSLIP_TEXT:` followed by already-extracted text (preferred path — use it directly, skip to Step 3).
|
|
||||||
- OR a line `PDF_BASE64:` followed by a base64 blob (fallback path — decode then extract text first).
|
|
||||||
|
|
||||||
Produce EXACTLY ONE JSON object on stdout matching the schema. No prose. No markdown fences. No preamble. No trailing commentary. The final message content must be a single valid JSON object and nothing else.
|
|
||||||
|
|
||||||
## RSU handling (important — Meta UK payslips)
|
|
||||||
|
|
||||||
UK payslips for equity-compensated employees (e.g. Meta) report RSU vests as NOTIONAL pay for HMRC reporting only — the broker (Schwab) sells shares to cover US-side withholding but the UK payslip ALSO runs the vest through PAYE via a grossed-up Taxable Pay line. Meta UK template:
|
|
||||||
|
|
||||||
- EARNINGS lines: `RSU Tax Offset` (grossed-up vest value) and optionally `RSU Excs Refund` (over-withheld amount returned). SUM BOTH into `rsu_vest`. Other labels seen on non-Meta templates: `RSU Vest`, `Restricted Stock Units`, `Notional Pay`, `GSU Vest`.
|
|
||||||
- Meta's template does NOT use a matching offset deduction — `rsu_offset` should be 0. Taxable Pay is grossed up to (Total Payment + rsu_vest) so PAYE already includes the RSU share.
|
|
||||||
- For non-Meta templates that DO use an offset (`Shares Retained`, `Notional Pay Offset`), populate `rsu_offset` with the magnitude.
|
|
||||||
|
|
||||||
If you see ANY of these lines, do NOT add them to `other_deductions` and do NOT let them count as regular income_tax/NI.
|
|
||||||
|
|
||||||
If the payslip has no stock component, leave both as 0.
|
|
||||||
|
|
||||||
## Earnings decomposition (v2)
|
|
||||||
|
|
||||||
- `salary`: the basic salary/pay line (usually the first "Salary" or "Basic Pay" entry in the Earnings/Payments block).
|
|
||||||
- `bonus`: the bonus line (`Perform Bonus`, `Bonus`, `Performance Bonus`). If absent or 0, leave as 0 — that's meaningful signal (bonus-sacrifice months). Don't invent.
|
|
||||||
- `pension_sacrifice`: **ABSOLUTE VALUE** of any NEGATIVE pension line in the Payments block (e.g. `AE Pension EE -600.20` → `600.20`). This is salary-sacrifice and is ALREADY subtracted from Total Payment/gross. Do not also put it in `pension_employee`.
|
|
||||||
- `pension_employee`: use this ONLY when pension appears as a POSITIVE deduction on the Deductions side (legacy Meta variant A, or non-Meta templates). Never double-count.
|
|
||||||
- `taxable_pay`: the "Taxable Pay" line in the summary block, THIS PERIOD column. For Meta this is the post-sacrifice + RSU-grossed-up base that PAYE is computed on. If the payslip doesn't surface a summary block, null.
|
|
||||||
- `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross`: YTD column values from the same summary block. Null if not present.
|
|
||||||
|
|
||||||
## Fast path: PAYSLIP_TEXT is present
|
|
||||||
|
|
||||||
If the prompt contains `PAYSLIP_TEXT:`, the caller has already run `pdftotext -layout`. Skip Steps 1-2 entirely — the text is already in your context. Go straight to Step 3.
|
|
||||||
|
|
||||||
## Processing steps
|
|
||||||
|
|
||||||
### Step 1. Extract and decode the base64 PDF
|
|
||||||
|
|
||||||
The prompt will include a line that starts with `PDF_BASE64:` followed by the base64 blob. Decode it to `/tmp/payslip.pdf`.
|
|
||||||
|
|
||||||
Preferred method (handles whitespace and very long blobs robustly):
|
|
||||||
|
|
||||||
```bash
|
|
||||||
python3 - <<'PY'
|
|
||||||
import base64, re, pathlib, sys, os
|
|
||||||
prompt = os.environ.get("PAYSLIP_PROMPT", "")
|
|
||||||
# If the orchestrator didn't set an env var, fall back to reading the transcript via CWD stdin mechanism.
|
|
||||||
# In practice the agent receives the prompt in its conversation — you extract the PDF_BASE64 value
|
|
||||||
# from the prompt text you were given, strip whitespace, and base64-decode.
|
|
||||||
PY
|
|
||||||
```
|
|
||||||
|
|
||||||
In practice: read the `PDF_BASE64:` value out of the prompt you have been given (you can see the full prompt), then run:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
python3 -c "
|
|
||||||
import base64, sys
|
|
||||||
data = sys.stdin.read().strip()
|
|
||||||
open('/tmp/payslip.pdf','wb').write(base64.b64decode(data))
|
|
||||||
print('decoded bytes:', len(base64.b64decode(data)))
|
|
||||||
" <<'B64'
|
|
||||||
<paste-the-base64-here>
|
|
||||||
B64
|
|
||||||
```
|
|
||||||
|
|
||||||
Or pipe via shell `base64 -d`:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
printf '%s' '<base64>' | base64 -d > /tmp/payslip.pdf
|
|
||||||
```
|
|
||||||
|
|
||||||
Verify the file looks like a PDF:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
head -c 8 /tmp/payslip.pdf | xxd
|
|
||||||
# Expected: 25 50 44 46 2d (i.e. "%PDF-")
|
|
||||||
```
|
|
||||||
|
|
||||||
### Step 2. Extract text from the PDF
|
|
||||||
|
|
||||||
Try tools in this order. Use the first one that works; do not chain all of them.
|
|
||||||
|
|
||||||
1. `pdftotext` from `poppler-utils` (preferred — fastest, most reliable on layout-preserving payslips):
|
|
||||||
```bash
|
|
||||||
pdftotext -layout /tmp/payslip.pdf - 2>/dev/null
|
|
||||||
```
|
|
||||||
|
|
||||||
2. Python `pypdf` fallback:
|
|
||||||
```bash
|
|
||||||
python3 -c "
|
|
||||||
from pypdf import PdfReader
|
|
||||||
r = PdfReader('/tmp/payslip.pdf')
|
|
||||||
for p in r.pages:
|
|
||||||
print(p.extract_text() or '')
|
|
||||||
"
|
|
||||||
```
|
|
||||||
|
|
||||||
3. Python `pdfplumber` fallback:
|
|
||||||
```bash
|
|
||||||
python3 -c "
|
|
||||||
import pdfplumber
|
|
||||||
with pdfplumber.open('/tmp/payslip.pdf') as pdf:
|
|
||||||
for page in pdf.pages:
|
|
||||||
print(page.extract_text() or '')
|
|
||||||
"
|
|
||||||
```
|
|
||||||
|
|
||||||
4. If none of those are installed, check what IS available:
|
|
||||||
```bash
|
|
||||||
which pdftotext pdf2txt.py mutool
|
|
||||||
python3 -c "import pypdf, pdfplumber, pdfminer" 2>&1
|
|
||||||
```
|
|
||||||
and use whatever you find (e.g. `mutool draw -F txt`).
|
|
||||||
|
|
||||||
If every text-extraction tool fails, emit the failure JSON (see "Failure mode" below).
|
|
||||||
|
|
||||||
### Step 3. Parse the extracted text
|
|
||||||
|
|
||||||
UK payslips are laid out in a few common templates (Sage, Iris, QuickBooks, Xero, in-house ADP/Workday layouts). Common landmarks:
|
|
||||||
|
|
||||||
- "Pay Date" / "Payment Date" / "Date Paid" — the date wages hit the account. Usually at the top or in a header box.
|
|
||||||
- "Tax Period" / "Period" / "Month" — e.g. "Month 1", "Week 12".
|
|
||||||
- Two numeric columns per line: "This Period" (or "Amount", "Current") and "Year to Date" (or "YTD"). **Always take the This Period column**, never YTD.
|
|
||||||
- Payments / Earnings block: "Basic Pay", "Salary", "Bonus", "Overtime", "Commission", "Holiday Pay".
|
|
||||||
- Deductions block: "Income Tax" / "PAYE", "National Insurance" / "NI" / "NIC", "Pension" / "Pension Contribution" / "Salary Sacrifice Pension", "Student Loan" / "SL", optional: "Union Dues", "Charity", "Season Ticket Loan", "Private Medical", etc.
|
|
||||||
- "Gross Pay" / "Total Gross" — sum of payments.
|
|
||||||
- "Net Pay" / "Take Home" / "Amount Payable" — the money actually paid.
|
|
||||||
- "Tax Code" — e.g. "1257L", "BR", "D0", "NT".
|
|
||||||
- "NI Number" / "National Insurance Number" — `AA123456A` format. Never invent one.
|
|
||||||
- "Employer" / "Company" — usually in the letterhead. "Employee" / "Name".
|
|
||||||
- Currency: almost always GBP / "£" for UK payslips. If the PDF is not in GBP or not a UK payslip, still return the numbers as-is but include a best-effort `currency` field.
|
|
||||||
|
|
||||||
### Step 4. Map to the schema and emit JSON
|
|
||||||
|
|
||||||
Rules that apply regardless of the caller's exact schema:
|
|
||||||
|
|
||||||
- **Dates**: `pay_date` MUST be `YYYY-MM-DD`. If the PDF prints `12/03/2026`, interpret as `DD/MM/YYYY` (UK format) → `2026-03-12`. If ambiguous (`01/02/2026`), prefer UK ordering. If impossible to determine a year, use the pay_period year.
|
|
||||||
- **Money fields**: emit as JSON numbers, not strings. Two decimal places are acceptable (`2450.17`). Strip `£`, commas, and trailing spaces. Negative values stay negative.
|
|
||||||
- **Missing numeric fields**: emit `0` (zero), not `null`, not an empty string, not `"N/A"`.
|
|
||||||
- **`other_deductions`**: an object mapping `{ "<label>": <number>, ... }` for any deduction that isn't one of the first-class fields in the schema (tax, NI, pension, student loan). Use the exact label from the payslip (e.g. `"Season Ticket Loan"`, `"Private Medical"`). If there are no other deductions, emit `{}` — NEVER `null` and NEVER omit the key.
|
|
||||||
- **Column discipline**: ALWAYS use the "This Period" column, NEVER the YTD column. If only one column exists, that's the period column.
|
|
||||||
- **Currency default**: `"GBP"` unless the payslip explicitly shows another currency symbol or ISO code.
|
|
||||||
- **No invented data**: If a field genuinely isn't on the payslip, use the documented default (`0` for money, `""` for strings, `{}` for objects). Do NOT make up names, NI numbers, tax codes, or employers.
|
|
||||||
|
|
||||||
Follow the exact field names and types given in the prompt's schema. If the prompt's schema adds fields not listed above, produce them too using the same discipline.
|
|
||||||
|
|
||||||
## Failure mode
|
|
||||||
|
|
||||||
If the PDF cannot be read at all — unreadable base64, not a PDF, encrypted PDF with no text layer, no text-extraction tool available, or clearly not a UK payslip — emit a single JSON object:
|
|
||||||
|
|
||||||
```json
|
|
||||||
{"error": "<short human reason>"}
|
|
||||||
```
|
|
||||||
|
|
||||||
Examples of acceptable error reasons:
|
|
||||||
- `"base64 did not decode to a valid PDF"`
|
|
||||||
- `"pdf has no extractable text layer (image-only scan)"`
|
|
||||||
- `"no pdf text extraction tool available (pdftotext/pypdf/pdfplumber all missing)"`
|
|
||||||
- `"document does not appear to be a UK payslip"`
|
|
||||||
- `"pay_date not found on document"`
|
|
||||||
|
|
||||||
The caller treats the `error` key as a non-retriable parse failure. Do not include any other keys when emitting an error object.
|
|
||||||
|
|
||||||
## Hard constraints — things you MUST NOT do
|
|
||||||
|
|
||||||
1. **No network calls.** Do not curl, wget, dig, or otherwise talk to the network. Everything you need is in the prompt.
|
|
||||||
2. **No modifications to `/workspace/infra/**`.** Do not edit, write, or commit any file under the infra repo. The only file you may create is the scratch PDF at `/tmp/payslip.pdf` (and intermediate text dumps under `/tmp/`).
|
|
||||||
3. **No git operations.** No `git add`, `git commit`, `git push`, nothing.
|
|
||||||
4. **No kubectl, no terraform, no vault.** You are not an infra agent — you are a narrow extractor.
|
|
||||||
5. **No markdown in output.** No ` ```json ` fences, no preamble like "Here's the extraction:", no trailing notes. The ENTIRE final assistant message is exactly one JSON object.
|
|
||||||
6. **No verbose logging in the final message.** It is fine to run bash commands and see their output during processing, but your final assistant message is JSON and nothing else.
|
|
||||||
7. **No hallucinated fields.** If the payslip does not show a pension line, do not invent one. Use the documented default instead.
|
|
||||||
|
|
||||||
## Output discipline — summary
|
|
||||||
|
|
||||||
- Exactly one JSON object, UTF-8, no BOM.
|
|
||||||
- Keys match the schema the caller gave you.
|
|
||||||
- Numeric fields are JSON numbers, not strings.
|
|
||||||
- `pay_date` is `YYYY-MM-DD`.
|
|
||||||
- `other_deductions` is always present and is an object (possibly `{}`).
|
|
||||||
- Missing money → `0`, missing string → `""`, missing object → `{}`.
|
|
||||||
- On unrecoverable failure, one JSON object with a single `error` key.
|
|
||||||
|
|
||||||
That's the whole job. Decode, extract, parse, emit JSON. Be boring and exact.
|
|
||||||
|
|
@ -34,11 +34,7 @@ You receive these parameters in your invocation:
|
||||||
- **Infra repo**: `/home/wizard/code/infra`
|
- **Infra repo**: `/home/wizard/code/infra`
|
||||||
- **Config**: `/home/wizard/code/infra/.claude/reference/upgrade-config.json`
|
- **Config**: `/home/wizard/code/infra/.claude/reference/upgrade-config.json`
|
||||||
- **Kubeconfig**: `/home/wizard/code/infra/config`
|
- **Kubeconfig**: `/home/wizard/code/infra/config`
|
||||||
- **Secrets (env-var contract)**: You run in the `claude-agent-service` pod, which has NO Vault CLI auth — do NOT call `vault kv get`. The following env vars are pre-loaded via `envFrom: claude-agent-secrets`:
|
- **Vault**: Authenticate with `vault login -method=oidc` if needed. Secrets at `secret/viktor` and `secret/platform`.
|
||||||
- `GITHUB_TOKEN` — PAT for GitHub API (changelog fetch) and `git push`
|
|
||||||
- `WOODPECKER_API_TOKEN` — bearer for `ci.viktorbarzin.me/api/...`
|
|
||||||
- `SLACK_WEBHOOK_URL` — full Slack webhook URL for status messages
|
|
||||||
- Anything else (e.g. `kubectl`) uses the pod's ServiceAccount or in-repo git-crypt-unlocked secrets.
|
|
||||||
- **Git remote**: `origin` → `github.com/ViktorBarzin/infra.git`
|
- **Git remote**: `origin` → `github.com/ViktorBarzin/infra.git`
|
||||||
|
|
||||||
## NEVER Do
|
## NEVER Do
|
||||||
|
|
@ -122,6 +118,7 @@ cat /home/wizard/code/infra/.claude/reference/upgrade-config.json
|
||||||
3. **For Helm charts**: Check `helm_chart_repo_overrides` for the chart repository URL
|
3. **For Helm charts**: Check `helm_chart_repo_overrides` for the chart repository URL
|
||||||
4. If auto-detect fails, verify the repo exists:
|
4. If auto-detect fails, verify the repo exists:
|
||||||
```bash
|
```bash
|
||||||
|
GITHUB_TOKEN=$(vault kv get -field=github_pat secret/viktor)
|
||||||
curl -sf -H "Authorization: token $GITHUB_TOKEN" \
|
curl -sf -H "Authorization: token $GITHUB_TOKEN" \
|
||||||
"https://api.github.com/repos/${DETECTED_REPO}" > /dev/null
|
"https://api.github.com/repos/${DETECTED_REPO}" > /dev/null
|
||||||
```
|
```
|
||||||
|
|
@ -131,6 +128,7 @@ cat /home/wizard/code/infra/.claude/reference/upgrade-config.json
|
||||||
## Step 3: Fetch Changelogs via GitHub API
|
## Step 3: Fetch Changelogs via GitHub API
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
|
GITHUB_TOKEN=$(vault kv get -field=github_pat secret/viktor)
|
||||||
curl -s -H "Authorization: token $GITHUB_TOKEN" \
|
curl -s -H "Authorization: token $GITHUB_TOKEN" \
|
||||||
"https://api.github.com/repos/${GITHUB_REPO}/releases?per_page=100"
|
"https://api.github.com/repos/${GITHUB_REPO}/releases?per_page=100"
|
||||||
```
|
```
|
||||||
|
|
@ -173,9 +171,11 @@ Scan all intermediate release notes for breaking change indicators from the conf
|
||||||
## Step 5: Slack Notification — Starting
|
## Step 5: Slack Notification — Starting
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
|
SLACK_WEBHOOK=$(vault kv get -field=alertmanager_slack_api_url secret/platform)
|
||||||
|
|
||||||
curl -s -X POST -H 'Content-type: application/json' \
|
curl -s -X POST -H 'Content-type: application/json' \
|
||||||
--data "{\"text\":\"[Upgrade Agent] Starting: *${STACK}* ${OLD_VERSION} -> ${NEW_VERSION} (risk: ${RISK})\"}" \
|
--data "{\"text\":\"[Upgrade Agent] Starting: *${STACK}* ${OLD_VERSION} -> ${NEW_VERSION} (risk: ${RISK})\"}" \
|
||||||
"$SLACK_WEBHOOK_URL"
|
"$SLACK_WEBHOOK"
|
||||||
```
|
```
|
||||||
|
|
||||||
For CAUTION risk, include breaking change excerpts in the Slack message.
|
For CAUTION risk, include breaking change excerpts in the Slack message.
|
||||||
|
|
@ -266,28 +266,23 @@ UPGRADE_SHA=$(git rev-parse HEAD)
|
||||||
|
|
||||||
## Step 9: Wait for Woodpecker CI
|
## Step 9: Wait for Woodpecker CI
|
||||||
|
|
||||||
The commit triggers one pipeline that runs multiple **workflows** in parallel — e.g. `default` (terragrunt apply) and `build-cli` (builds the infra CLI image). Only the `default` workflow gates your upgrade; the other workflows may be unrelated and sometimes fail without breaking anything on the cluster (current example: `build-cli` push to `registry.viktorbarzin.me:5050` is known-broken as of 2026-04-19).
|
The commit triggers the `app-stacks.yml` pipeline (or `default.yml` for platform stacks).
|
||||||
|
|
||||||
**Do not read the overall pipeline `status`** — it reports `failure` whenever *any* workflow fails. Read the `default` workflow's `state` instead.
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Find the pipeline for our commit
|
WOODPECKER_TOKEN=$(vault kv get -field=woodpecker_token secret/viktor)
|
||||||
curl -s -H "Authorization: Bearer $WOODPECKER_API_TOKEN" \
|
|
||||||
"https://ci.viktorbarzin.me/api/repos/1/pipelines?page=1&per_page=10" \
|
|
||||||
| jq --arg sha "$UPGRADE_SHA" '.[] | select(.commit==$sha) | .number'
|
|
||||||
# → $PIPELINE_NUMBER
|
|
||||||
|
|
||||||
# Fetch detail (includes workflows[])
|
|
||||||
curl -s -H "Authorization: Bearer $WOODPECKER_API_TOKEN" \
|
|
||||||
"https://ci.viktorbarzin.me/api/repos/1/pipelines/$PIPELINE_NUMBER" \
|
|
||||||
| jq '.workflows[] | select(.name=="default") | .state'
|
|
||||||
# → "running" | "pending" | "success" | "failure" | "error" | "killed"
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Poll every 30 seconds until the `default` workflow's `state` is terminal (`success`, `failure`, `error`, `killed`). Timeout after 15 minutes.
|
Poll for the pipeline triggered by our commit:
|
||||||
|
```bash
|
||||||
|
# Get latest pipeline
|
||||||
|
curl -s -H "Authorization: Bearer $WOODPECKER_TOKEN" \
|
||||||
|
"https://ci.viktorbarzin.me/api/repos/1/pipelines?page=1&per_page=5"
|
||||||
|
```
|
||||||
|
|
||||||
**If `default` state is `success`** → proceed to Step 10 (verification), regardless of other workflows' state.
|
Find the pipeline matching our commit SHA. Poll every 30 seconds until status is `success`, `failure`, `error`, or `killed`. Timeout after 15 minutes.
|
||||||
**If `default` state is terminal-and-not-success, or the poll times out** → proceed to Step 10b (rollback).
|
|
||||||
|
**If CI fails** → proceed to Step 10 (rollback).
|
||||||
|
**If CI succeeds** → proceed to verification.
|
||||||
|
|
||||||
## Step 10: Verify
|
## Step 10: Verify
|
||||||
|
|
||||||
|
|
@ -346,7 +341,7 @@ Re-run verification checks to confirm rollback succeeded. If rollback verificati
|
||||||
```bash
|
```bash
|
||||||
curl -s -X POST -H 'Content-type: application/json' \
|
curl -s -X POST -H 'Content-type: application/json' \
|
||||||
--data '{"text":"[Upgrade Agent] CRITICAL: Rollback of *${STACK}* also failed. Manual intervention required."}' \
|
--data '{"text":"[Upgrade Agent] CRITICAL: Rollback of *${STACK}* also failed. Manual intervention required."}' \
|
||||||
"$SLACK_WEBHOOK_URL"
|
"$SLACK_WEBHOOK"
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 11: Report Results
|
## Step 11: Report Results
|
||||||
|
|
@ -355,14 +350,14 @@ curl -s -X POST -H 'Content-type: application/json' \
|
||||||
```bash
|
```bash
|
||||||
curl -s -X POST -H 'Content-type: application/json' \
|
curl -s -X POST -H 'Content-type: application/json' \
|
||||||
--data "{\"text\":\"[Upgrade Agent] SUCCESS: *${STACK}* upgraded ${OLD_VERSION} -> ${NEW_VERSION}\nVerification: pods ready, HTTP OK${UPTIME_KUMA_MSG}\nCommit: ${UPGRADE_SHA}\"}" \
|
--data "{\"text\":\"[Upgrade Agent] SUCCESS: *${STACK}* upgraded ${OLD_VERSION} -> ${NEW_VERSION}\nVerification: pods ready, HTTP OK${UPTIME_KUMA_MSG}\nCommit: ${UPGRADE_SHA}\"}" \
|
||||||
"$SLACK_WEBHOOK_URL"
|
"$SLACK_WEBHOOK"
|
||||||
```
|
```
|
||||||
|
|
||||||
### On failure + rollback
|
### On failure + rollback
|
||||||
```bash
|
```bash
|
||||||
curl -s -X POST -H 'Content-type: application/json' \
|
curl -s -X POST -H 'Content-type: application/json' \
|
||||||
--data "{\"text\":\"[Upgrade Agent] FAILED + ROLLED BACK: *${STACK}* ${OLD_VERSION} -> ${NEW_VERSION}\nReason: ${FAILURE_REASON}\nRollback commit: ${ROLLBACK_SHA}\nRollback status: ${ROLLBACK_STATUS}\"}" \
|
--data "{\"text\":\"[Upgrade Agent] FAILED + ROLLED BACK: *${STACK}* ${OLD_VERSION} -> ${NEW_VERSION}\nReason: ${FAILURE_REASON}\nRollback commit: ${ROLLBACK_SHA}\nRollback status: ${ROLLBACK_STATUS}\"}" \
|
||||||
"$SLACK_WEBHOOK_URL"
|
"$SLACK_WEBHOOK"
|
||||||
```
|
```
|
||||||
|
|
||||||
## Edge Cases
|
## Edge Cases
|
||||||
|
|
|
||||||
1728
.claude/cluster-health.sh
Executable file
1728
.claude/cluster-health.sh
Executable file
File diff suppressed because it is too large
Load diff
|
|
@ -7,7 +7,6 @@ Control and query Home Assistant entities on ha-sofia.viktorbarzin.me.
|
||||||
import argparse
|
import argparse
|
||||||
import json
|
import json
|
||||||
import os
|
import os
|
||||||
import subprocess
|
|
||||||
import sys
|
import sys
|
||||||
from urllib.parse import urljoin
|
from urllib.parse import urljoin
|
||||||
|
|
||||||
|
|
@ -18,29 +17,13 @@ except ImportError:
|
||||||
print(" pip install requests")
|
print(" pip install requests")
|
||||||
sys.exit(1)
|
sys.exit(1)
|
||||||
|
|
||||||
|
# Configuration from environment variables (ha-sofia specific)
|
||||||
|
HA_URL = os.environ.get("HOME_ASSISTANT_SOFIA_URL", "").rstrip("/")
|
||||||
|
HA_TOKEN = os.environ.get("HOME_ASSISTANT_SOFIA_TOKEN")
|
||||||
|
|
||||||
def _token_from_homelab():
|
if not HA_URL or not HA_TOKEN:
|
||||||
"""Resolve the token via the homelab CLI when the env var isn't set, so the
|
print("ERROR: HOME_ASSISTANT_SOFIA_URL and HOME_ASSISTANT_SOFIA_TOKEN environment variables must be set.")
|
||||||
script works from any directory / unprovisioned session (see ADR-0012)."""
|
print("These should be set when activating the Claude venv (~/.venvs/claude)")
|
||||||
try:
|
|
||||||
out = subprocess.run(
|
|
||||||
["homelab", "ha", "token", "--instance", "sofia"],
|
|
||||||
capture_output=True, text=True, timeout=30)
|
|
||||||
if out.returncode == 0 and out.stdout.strip():
|
|
||||||
return out.stdout.strip()
|
|
||||||
except Exception:
|
|
||||||
pass
|
|
||||||
return None
|
|
||||||
|
|
||||||
|
|
||||||
# Configuration: prefer env vars (set by the Claude venv); otherwise fall back to
|
|
||||||
# defaults + the homelab CLI so the script is not cwd/env dependent (ADR-0012).
|
|
||||||
HA_URL = os.environ.get("HOME_ASSISTANT_SOFIA_URL", "").rstrip("/") or "https://ha-sofia.viktorbarzin.me"
|
|
||||||
HA_TOKEN = os.environ.get("HOME_ASSISTANT_SOFIA_TOKEN") or _token_from_homelab()
|
|
||||||
|
|
||||||
if not HA_TOKEN:
|
|
||||||
print("ERROR: no ha-sofia API token available.")
|
|
||||||
print("Set HOME_ASSISTANT_SOFIA_TOKEN, or ensure `homelab ha token` works (kubeconfig reachable).")
|
|
||||||
sys.exit(1)
|
sys.exit(1)
|
||||||
|
|
||||||
HEADERS = {
|
HEADERS = {
|
||||||
|
|
|
||||||
|
|
@ -2,41 +2,20 @@
|
||||||
|
|
||||||
> Snapshot of applications, groups, users, and flows. Use `authentik` skill for management tasks.
|
> Snapshot of applications, groups, users, and flows. Use `authentik` skill for management tasks.
|
||||||
|
|
||||||
## Applications (11)
|
## Applications (10)
|
||||||
| Application | Provider Type | Auth Flow |
|
| Application | Provider Type | Auth Flow |
|
||||||
|-------------|--------------|-----------|
|
|-------------|--------------|-----------|
|
||||||
| Cloudflare Access | OAuth2/OIDC | implicit consent |
|
| Cloudflare Access | OAuth2/OIDC | explicit consent |
|
||||||
| Domain wide catch all | Proxy (forward auth) | implicit consent |
|
| Domain wide catch all | Proxy (forward auth) | implicit consent |
|
||||||
| Forgejo | OAuth2/OIDC | implicit consent |
|
| Forgejo | OAuth2/OIDC | explicit consent |
|
||||||
| Grafana | OAuth2/OIDC | implicit consent |
|
| Grafana | OAuth2/OIDC | implicit consent |
|
||||||
| Headscale | OAuth2/OIDC | implicit consent |
|
| Headscale | OAuth2/OIDC | explicit consent |
|
||||||
| Immich | OAuth2/OIDC | implicit consent |
|
| Immich | OAuth2/OIDC | explicit consent |
|
||||||
| Kubernetes | OAuth2/OIDC (public) | implicit consent |
|
| Kubernetes | OAuth2/OIDC (public) | implicit consent |
|
||||||
| Kubernetes Dashboard | OAuth2/OIDC (confidential) | implicit consent |
|
| linkwarden | OAuth2/OIDC | explicit consent |
|
||||||
| linkwarden | OAuth2/OIDC | implicit consent |
|
| Matrix | OAuth2/OIDC | implicit consent |
|
||||||
| Vault | OAuth2/OIDC | implicit consent |
|
|
||||||
| wrongmove | OAuth2/OIDC | implicit consent |
|
| wrongmove | OAuth2/OIDC | implicit consent |
|
||||||
|
|
||||||
> **2026-06-10 — every provider now uses implicit consent.** Cloudflare
|
|
||||||
> Access (pk 9), Forgejo (20), Immich (1), Headscale (13), linkwarden (8)
|
|
||||||
> and Vault (53) were switched from
|
|
||||||
> `default-provider-authorization-explicit-consent` via the API (these
|
|
||||||
> providers are UI-managed, not in TF). All are first-party apps; the
|
|
||||||
> expiring consent screen (re-shown every 4 weeks per app) only slowed
|
|
||||||
> first-time signin.
|
|
||||||
|
|
||||||
> **Kubernetes Dashboard** (TF-managed in `stacks/k8s-dashboard/authentik.tf`):
|
|
||||||
> confidential client `k8s-dashboard`, built for seamless dashboard SSO via
|
|
||||||
> oauth2-proxy. **Currently IDLE** — the apiserver rejects all OIDC tokens (see
|
|
||||||
> `docs/plans/2026-06-04-k8s-dashboard-sso-design.md` §12), so the dashboard runs
|
|
||||||
> on forward-auth + token-paste instead and oauth2-proxy is unwired. Kept for a
|
|
||||||
> future SSO retry once apiserver OIDC is fixed.
|
|
||||||
>
|
|
||||||
> **admin-services-restriction** policy (TF-managed in
|
|
||||||
> `stacks/authentik/admin-services-restriction.tf`, adopted 2026-06-04): gates the
|
|
||||||
> 15 admin-only hostnames to `Home Server Admins`, with a carve-out admitting the
|
|
||||||
> `kubernetes-*` RBAC groups to `k8s.viktorbarzin.me` (dashboard login page).
|
|
||||||
|
|
||||||
## Groups (9)
|
## Groups (9)
|
||||||
| Group | Parent | Superuser | Purpose |
|
| Group | Parent | Superuser | Purpose |
|
||||||
|-------|--------|-----------|---------|
|
|-------|--------|-----------|---------|
|
||||||
|
|
@ -57,7 +36,7 @@
|
||||||
| vbarzin@gmail.com | Viktor Barzin | internal | authentik Admins, Home Server Admins, Wrongmove Users, Headscale Users |
|
| vbarzin@gmail.com | Viktor Barzin | internal | authentik Admins, Home Server Admins, Wrongmove Users, Headscale Users |
|
||||||
| emil.barzin@gmail.com | Emil Barzin | internal | Home Server Admins, Headscale Users |
|
| emil.barzin@gmail.com | Emil Barzin | internal | Home Server Admins, Headscale Users |
|
||||||
| ancaelena98@gmail.com | Anca Milea | external | Wrongmove Users, Headscale Users |
|
| ancaelena98@gmail.com | Anca Milea | external | Wrongmove Users, Headscale Users |
|
||||||
| vabbit81@gmail.com | GHEORGHE Milea | external | Headscale Users, kubernetes-namespace-owners, sops-vabbit81 |
|
| vabbit81@gmail.com | GHEORGHE Milea | external | Headscale Users |
|
||||||
| valentinakolevabarzina@gmail.com | Valentina | internal | Headscale Users |
|
| valentinakolevabarzina@gmail.com | Valentina | internal | Headscale Users |
|
||||||
| anca.r.cristian10@gmail.com | -- | internal | Wrongmove Users |
|
| anca.r.cristian10@gmail.com | -- | internal | Wrongmove Users |
|
||||||
| kadir.tugan@gmail.com | Kadir | internal | Wrongmove Users |
|
| kadir.tugan@gmail.com | Kadir | internal | Wrongmove Users |
|
||||||
|
|
@ -69,27 +48,8 @@
|
||||||
- All sources use `invitation-enrollment` as enrollment flow (new users require invitation)
|
- All sources use `invitation-enrollment` as enrollment flow (new users require invitation)
|
||||||
|
|
||||||
## Authorization Flows
|
## Authorization Flows
|
||||||
- **Explicit consent** (`default-provider-authorization-explicit-consent`): Shows consent screen — no provider uses it since 2026-06-10
|
- **Explicit consent** (`default-provider-authorization-explicit-consent`): Shows consent screen
|
||||||
- **Implicit consent** (`default-provider-authorization-implicit-consent`): Auto-redirects — used by ALL providers
|
- **Implicit consent** (`default-provider-authorization-implicit-consent`): Auto-redirects
|
||||||
|
|
||||||
## Authentication Flow (single-screen login, 2026-06-10)
|
|
||||||
|
|
||||||
`default-authentication-flow` bindings: identification (order 10) →
|
|
||||||
mfa-validation (order 30) → user-login (order 100). The identification
|
|
||||||
stage (`default-authentication-identification`, pk
|
|
||||||
`32aca5ab-106e-43f4-a4cc-4513d80e57f3`) has `password_stage` set to
|
|
||||||
`default-authentication-password`, so username + password render on ONE
|
|
||||||
screen (one round trip instead of two). The previously separate
|
|
||||||
password-stage binding at order 20 (pk `0fc677db-a23f-4ee7-8648-da342e14573b`)
|
|
||||||
was DELETED via the API — authentik requires removing it when the
|
|
||||||
identification stage embeds the password field. `password_stage` is pinned in
|
|
||||||
Terraform (`authentik_stage_identification.default_identification` in
|
|
||||||
`stacks/authentik/authentik_provider.tf`); all other stage fields stay
|
|
||||||
UI-managed via `ignore_changes`. Social-login buttons remain on the same
|
|
||||||
screen and bypass the password field, so Google/GitHub/Facebook users are
|
|
||||||
unaffected. If a future authentik upgrade/blueprint re-adds the order-20
|
|
||||||
binding, users would briefly see a second password prompt — delete the
|
|
||||||
binding again.
|
|
||||||
|
|
||||||
## Invitation Enrollment Flow
|
## Invitation Enrollment Flow
|
||||||
Slug: `invitation-enrollment` | PK: `7d667321-2b02-4e16-8161-148078a8dac1`
|
Slug: `invitation-enrollment` | PK: `7d667321-2b02-4e16-8161-148078a8dac1`
|
||||||
|
|
@ -159,87 +119,3 @@ Removed bindings from:
|
||||||
- `default-source-authentication` (PK: via policybindingmodel `1a779f24`) — Google/GitHub/Facebook OAuth
|
- `default-source-authentication` (PK: via policybindingmodel `1a779f24`) — Google/GitHub/Facebook OAuth
|
||||||
|
|
||||||
Policy still exists with 0 bindings. If brute-force protection is needed, bind to the **password stage** (not the flow level).
|
Policy still exists with 0 bindings. If brute-force protection is needed, bind to the **password stage** (not the flow level).
|
||||||
|
|
||||||
## Session Duration (2026-05-01)
|
|
||||||
|
|
||||||
Pinned via Terraform in `stacks/authentik/`:
|
|
||||||
|
|
||||||
| Knob | Value | Surface | Effect |
|
|
||||||
|------|-------|---------|--------|
|
|
||||||
| `UserLoginStage.session_duration` on `default-authentication-login` | `weeks=4` | `authentik_stage_user_login.default_login` in `authentik_provider.tf` | Authenticated users stay logged in 4 weeks across browser restarts. No sliding refresh — resets on each login. Used by password login (`default-authentication-flow`) AND passkey login (`webauthn` flow — both terminate on this stage). |
|
|
||||||
| `UserLoginStage.session_duration` on `default-source-authentication-login` | `weeks=4` | `authentik_stage_user_login.default_source_login` in `authentik_provider.tf` (imported 2026-06-20, id `4c6977d2-…`) | **Social logins** (Google/GitHub/Facebook, via `default-source-authentication-flow`). Was the provider default `seconds=0`, which fell back to `UNAUTHENTICATED_AGE=hours=2` — so social logins expired every **2h** while password/passkey lasted 4 weeks. Pinned `weeks=4` on 2026-06-20 to make all login paths consistent. (Surfaced when the 2026-06-18 passkey wipe forced fallback to Google login → "re-login multiple times daily".) |
|
|
||||||
| `ProxyProvider.access_token_validity` on `Provider for Domain wide catch all` | `weeks=4` | `authentik_provider_proxy.catchall.access_token_validity` in `authentik_provider.tf` | Cookie `Max-Age` on `authentik_proxy_*` and `expires` on rows in `authentik_providers_proxy_proxysession`. Bumped 2026-05-10 from `hours=168`. **Bumping requires `kubectl rollout restart deploy/ak-outpost-authentik-embedded-outpost`** — the gorilla session store binds the value once at outpost startup; the 5-min provider refresh logs `"reusing existing session store"` and skips rebuild. |
|
|
||||||
| `AUTHENTIK_SESSIONS__UNAUTHENTICATED_AGE` (server + worker) | `hours=2` | `server.env` + `worker.env` in `modules/authentik/values.yaml` | Anonymous Django sessions (bots, healthcheckers, partial flows) are reaped within 2h instead of the 1d default. |
|
|
||||||
|
|
||||||
Notes:
|
|
||||||
- There is **no** `Brand.session_duration`; `UserLoginStage` is the only correct lever for authenticated session lifetime.
|
|
||||||
- Embedded outpost session storage: PostgreSQL table `authentik_providers_proxy_proxysession` in authentik 2025.10+ (PR #16628), but **only when `IsEmbedded()` returns true** (i.e. `Outpost.managed == "goauthentik.io/outposts/embedded"`). Our outpost record had `managed=null` until 2026-05-10, which silently kept it on the gorilla `FilesystemStore` at `/dev/shm` (TMPDIR) and re-exposed the 2026-04-18 mismatched-session-ID class on every pod restart. Fix landed 2026-05-10: see `authentik_outpost.embedded` in `authentik_provider.tf` and post-mortem `2026-04-18-authentik-outpost-shm-full.md`.
|
|
||||||
- The proxy outpost service has a known goauthentik 2026.2.2 bug (`internal/outpost/controllers/k8s/service.py:52`): for embedded outposts the controller sets the Service selector to `app.kubernetes.io/name=authentik` (the server pods), not `authentik-outpost-proxy`. We work around it via a `kubernetes_json_patches.service` patch on the outpost record (replaces `/spec/selector` with the outpost's own labels). Without this, endpoints are empty and Traefik forward-auth fails over to the Basic Auth realm `Emergency Access`.
|
|
||||||
- The standalone embedded-outpost deployment needs `AUTHENTIK_POSTGRESQL__{HOST,PORT,USER,PASSWORD,NAME}` env vars to reach the dbaas cluster — codified via `kubernetes_json_patches.deployment` envFrom the shared `goauthentik` Secret. The `app.kubernetes.io/component=server` pod label is also injected via JSON patch (matches the `component:server` half of the Service selector that the controller adds for embedded outposts).
|
|
||||||
- `ProxyProvider.remember_me_offset` stays UI-managed via `ignore_changes`.
|
|
||||||
- The Authentik provider's resource schema does **not** expose the `Outpost.managed` field. We rely on TF's "write only fields it knows about" semantic: the server-set `goauthentik.io/outposts/embedded` value is preserved across applies because Terraform never writes `managed`. Don't change the resource provider schema expectations without verifying this assumption holds.
|
|
||||||
|
|
||||||
## WebAuthn / Passkeys (2026-06-20)
|
|
||||||
|
|
||||||
- **Passkey devices live in the DB, NOT Terraform** (`WebAuthnDevice` model). They are user-owned; no TF resource or blueprint manages them. Re-enroll via the user settings UI (Authentik → Settings → MFA Devices → register a security key / passkey).
|
|
||||||
- **2026-06-18 wipe (root cause of the "WebAuthn broke" incident):** all 6 of Viktor's passkeys were deleted (`WebAuthnDevice.objects.count()` → 0) at 19:27 by an **ad-hoc tripit passkey E2E test** run from the devvm (`python-httpx/0.28.1`, as `akadmin`). The test cleanup did `GET /core/users/?search={demo}` (a **fuzzy** search) then `DELETE /api/v3/authenticators/admin/webauthn/{pk}/` for each device of `users[0]` — but `users[0]` resolved to the **real** account, not the intended demo user. **Lesson:** any future passkey-test cleanup MUST exact-match the demo user (`username == demo`), never `users[0]` of a fuzzy `?search=`. It was a one-off ad-hoc script (no committed/scheduled copy), so nothing auto-re-deletes — re-enrollment is safe.
|
|
||||||
- **Passkey login path itself is intact:** the identification stage's `passwordless_flow` → `webauthn` flow (UI-managed, in `ignore_changes`); the break was purely the missing device records.
|
|
||||||
- **Provider-schema gotcha:** the pinned authentik TF provider's `authentik_stage_identification` resource exposes **no** `webauthn_stage` or `enable_remember_me` attribute (they exist on the app *model*, not in the provider schema). Do NOT add them to `ignore_changes` — `tg plan` errors `Unsupported attribute`. They are purely UI/app-managed. (Commit `4e882989` removed them for exactly this reason; re-adding breaks every apply.)
|
|
||||||
- ALL tuned env vars are injected via `server.env` / `worker.env` (not the `authentik.*` values block) because we set `authentik.existingSecret.secretName: goauthentik`, which makes the chart skip rendering its own `AUTHENTIK_*` Secret. The `authentik.*` value block is therefore inert in this stack — anything new under `authentik.*` must use the `*.env` arrays instead. Live base values come from the orphaned, helm-keep-policy `goauthentik` Secret created by chart 2025.10.3 before `existingSecret` was introduced. **2026-06-10:** the previously-inert tuning (`AUTHENTIK_WEB__WORKERS=3`, `AUTHENTIK_WEB__THREADS=4`, `AUTHENTIK_CACHE__TIMEOUT_FLOWS=1800`, `AUTHENTIK_CACHE__TIMEOUT_POLICIES=900`, `AUTHENTIK_POSTGRESQL__CONN_MAX_AGE=60`, `AUTHENTIK_POSTGRESQL__CONN_HEALTH_CHECKS=true`, worker `AUTHENTIK_WORKER__THREADS=4`) was moved into the env arrays and is now actually live — before that, pods silently ran defaults (2 gunicorn workers, 300s caches, no persistent DB conns).
|
|
||||||
- **Outpost (2026-06-10):** `log_level=info` (was `trace` — per-request overhead on the forward-auth hot path) and `kubernetes_replicas=2` (was 1 — single-pod hot path; safe since proxy sessions live in Postgres). Both in `authentik_outpost.embedded` config.
|
|
||||||
- **Image tag is PINNED in values (`global.image.tag`), 2026-06-10:** Keel moves the authentik image between chart releases, while helm derives the tag from the chart appVersion — an unpinned helm apply silently DOWNGRADES live pods (caused the 2026-06-10 boot storm + shared-PG failover; see `docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md`). Before touching this chart, check the live image tag and refresh the pin.
|
|
||||||
- **Liveness budget (2026-06-10):** `server.livenessProbe` = 6×10s, 5s timeout (chart default 3×10s/3s kill-loops pods that queue on the DB migration advisory lock during rolling restarts).
|
|
||||||
- **PgBouncer (2026-06-10):** `idle_transaction_timeout=300` reaps ghost `idle in transaction` sessions (a killed pod mid-migration otherwise holds the migration advisory lock forever, serializing all boots); the deployment carries a config-checksum annotation so ini changes roll the pods. Do NOT set `AUTHENTIK_POSTGRESQL__CONN_MAX_AGE` — session-mode PgBouncer pins persistent conns 1:1 (pool saturation).
|
|
||||||
- **Static assets (2026-06-10):** a second `ingress_factory` (`module.ingress-static`, path `/static` on the authentik host) attaches the `authentik-static-cache-headers` middleware → `Cache-Control: public, max-age=31536000, immutable`. Authentik itself serves no max-age; assets are version-fingerprinted so immutable is safe. Mainly helps split-horizon internal users (no Cloudflare edge cache on the direct path).
|
|
||||||
|
|
||||||
## Upgrade Validation Checklist
|
|
||||||
|
|
||||||
Run after **any** of these:
|
|
||||||
- Authentik chart version bump in `stacks/authentik/modules/authentik/main.tf` (the `version = "..."` line on `helm_release.authentik`).
|
|
||||||
- `goauthentik/authentik` Terraform provider version bump.
|
|
||||||
- Outpost pod recreation (kured reboot, eviction, manual `rollout restart`, scheduler move).
|
|
||||||
|
|
||||||
The fragile surfaces are the `kubernetes_json_patches` and the `Outpost.managed` field — both rely on assumptions that can silently break across upgrades. The checklist exercises the same path the alerts watch, so it doubles as a smoke test for the alerts.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# 1. Service routes to the outpost pods (NOT the server pods).
|
|
||||||
# Empty endpoints => auth-proxy fallback fires; expected: TWO pod IPs
|
|
||||||
# (kubernetes_replicas=2 since 2026-06-10), ports 9000/9300/9443.
|
|
||||||
kubectl -n authentik get endpoints ak-outpost-authentik-embedded-outpost
|
|
||||||
|
|
||||||
# 2. Service selector still excludes the server pods. Expected: includes
|
|
||||||
# `app.kubernetes.io/name: authentik-outpost-proxy`. If it flips to
|
|
||||||
# `name: authentik`, the goauthentik upstream bug came back or our
|
|
||||||
# JSON patch was unset.
|
|
||||||
kubectl -n authentik get svc ak-outpost-authentik-embedded-outpost -o jsonpath='{.spec.selector}'
|
|
||||||
|
|
||||||
# 3. Outpost mode + session backend. Expected log lines on startup:
|
|
||||||
# {"embedded":true,"event":"Outpost mode",...}
|
|
||||||
# {"event":"using PostgreSQL session backend",...}
|
|
||||||
# If embedded=false or `using filesystem session backend`, the postgres
|
|
||||||
# fix is broken — likely `Outpost.managed` got cleared, or the upstream
|
|
||||||
# schema started exposing `managed` and TF reset it.
|
|
||||||
kubectl -n authentik logs deploy/ak-outpost-authentik-embedded-outpost | grep -E '"Outpost mode"|"session backend"' | head -3
|
|
||||||
|
|
||||||
# 4. /dev/shm is essentially empty (postgres backend = no filesystem use).
|
|
||||||
# A row count > a few dozen indicates filesystem fallback is firing.
|
|
||||||
kubectl -n authentik exec deploy/ak-outpost-authentik-embedded-outpost -- sh -c 'df -h /dev/shm; ls /dev/shm | wc -l'
|
|
||||||
|
|
||||||
# 5. Postgres session table is growing with traffic. Expected: rows with
|
|
||||||
# `expires` ~28 days out (matches access_token_validity = weeks=4).
|
|
||||||
kubectl -n authentik exec deploy/goauthentik-server -- ak shell -c "
|
|
||||||
from django.db import connection; c = connection.cursor()
|
|
||||||
c.execute('SELECT COUNT(*), MAX(expires) FROM authentik_providers_proxy_proxysession')
|
|
||||||
print(c.fetchone())"
|
|
||||||
|
|
||||||
# 6. Edge auth flow: should be 302 → authentik. NOT 401 with WWW-Authenticate.
|
|
||||||
curl -sS -o /dev/null -D - 'https://terminal.viktorbarzin.me/' -H 'User-Agent: Mozilla/5.0' \
|
|
||||||
| grep -iE '^HTTP|^location|x-auth-fallback|www-authenticate'
|
|
||||||
|
|
||||||
# 7. Terraform plan-to-zero on the whole authentik stack.
|
|
||||||
( cd stacks/authentik && /home/wizard/code/infra/scripts/tg plan ) | grep -E 'No changes|Plan:'
|
|
||||||
```
|
|
||||||
|
|
||||||
Steps 1, 3, 6 cover the failure modes the Prometheus alerts trigger on (`AuthentikForwardAuthFallbackActive`, `AuthentikOutpostForwardAuth400Spike`). Steps 4 and 5 cover the silent-regression case (filesystem fallback) where the alerts don't fire but the system loses its postgres-backed session persistence on the next pod restart.
|
|
||||||
|
|
||||||
If step 2 shows the controller restored `app.kubernetes.io/name=authentik`, watch goauthentik/authentik issue tracker for fixes around `internal/outpost/controllers/k8s/service.py:52` — the upstream patch might let us drop our `kubernetes_json_patches.service` workaround.
|
|
||||||
|
|
|
||||||
|
|
@ -23,28 +23,14 @@ module "nfs_data" {
|
||||||
3. Reload exports: `ssh root@192.168.1.127 "exportfs -ra"`
|
3. Reload exports: `ssh root@192.168.1.127 "exportfs -ra"`
|
||||||
4. Verify: `showmount -e 192.168.1.127`
|
4. Verify: `showmount -e 192.168.1.127`
|
||||||
|
|
||||||
## Static Site Hosting
|
|
||||||
Two patterns for serving a folder of static files (HTML/CSS/JS/media):
|
|
||||||
|
|
||||||
1. **Image-baked** (default for git-native content): bake files into an `nginx:*-alpine` image at build time, deploy like any owned app (CI builds + pushes, Keel/Woodpecker rolls out). Reference: `stacks/blog` (Hugo → nginx, `Website/Dockerfile`). Use when content lives in git and changes via commits.
|
|
||||||
|
|
||||||
2. **NFS-backed** (for externally-authored / large / non-git content): a stock `nginx:1.28-alpine` Deployment mounts an `nfs_volume` PVC **read-only** at `/usr/share/nginx/html`; a tiny ConfigMap supplies `/etc/nginx/conf.d/default.conf` (just `root` + `index <entry>.html`). Files are dropped on `/srv/nfs/<site>` out-of-band (Nextcloud "PVE NFS Pool" or rsync) — no rebuild, auto-backed-up by `nfs-mirror`. Reference: `stacks/stem95su` (established 2026-06-07). Use when content is authored outside git (e.g. exported tools), is large (avoids git/image bloat), or a non-dev updates it. **The export subdir on the PVE host must exist before the pod mounts** — the `nfs_volume` module does NOT create it (see "Adding NFS Exports"; a subdir under the already-exported `/srv/nfs` needs no new `/etc/exports` line).
|
|
||||||
|
|
||||||
Both front with `ingress_factory` (`auth="none"` for open public content → CrowdSec + ai-bot-block still apply; or chain `anubis_instance` for a PoW gate, as `blog` does).
|
|
||||||
|
|
||||||
## ~~iSCSI Storage~~ (REMOVED — replaced by proxmox-lvm)
|
## ~~iSCSI Storage~~ (REMOVED — replaced by proxmox-lvm)
|
||||||
> iSCSI via democratic-csi and TrueNAS has been fully removed (2026-04). All database storage now uses `StorageClass: proxmox-lvm` (Proxmox CSI, LVM-thin hotplug). TrueNAS has been decommissioned.
|
> iSCSI via democratic-csi and TrueNAS has been fully removed (2026-04). All database storage now uses `StorageClass: proxmox-lvm` (Proxmox CSI, LVM-thin hotplug). TrueNAS has been decommissioned.
|
||||||
|
|
||||||
## Anti-AI Scraping (4 Active Layers) (Updated 2026-05-10)
|
## Anti-AI Scraping (5-Layer Defense)
|
||||||
Default `anti_ai_scraping = true` in ingress_factory. Disable per-service: `anti_ai_scraping = false`.
|
Default `anti_ai_scraping = true` in ingress_factory. Disable per-service: `anti_ai_scraping = false`.
|
||||||
1. **Anubis PoW challenge** (per-site reverse proxy) — `modules/kubernetes/anubis_instance/`. Latest: `ghcr.io/techarohq/anubis:v1.25.0`. Difficulty 2 (~250 ms desktop / ~700 ms mobile), 30-day JWT cookie scoped to `viktorbarzin.me` so a single solve covers every Anubis-fronted subdomain. Active on: `viktorbarzin.me`, `kms.viktorbarzin.me`, `travel.viktorbarzin.me`. Add to a stack: `module "anubis" { source = "../../modules/kubernetes/anubis_instance"; name = "X"; namespace = ...; target_url = "http://<svc>.<ns>.svc.cluster.local" }`, then point ingress_factory at `module.anubis.service_name` + `port = module.anubis.service_port` and set `anti_ai_scraping = false`. Shared ed25519 signing key in Vault `secret/viktor` -> `anubis_ed25519_key`. **Avoid putting Anubis in front of CLI/API/Git endpoints (Forgejo, APIs, WebDAV)** — clients without JS can't solve PoW.
|
1. Bot blocking (ForwardAuth → poison-fountain) 2. X-Robots-Tag noai 3. Trap links before `</body>`
|
||||||
2. **Bot blocking forwardAuth** (ForwardAuth → bot-block-proxy → poison-fountain) — global default for non-Anubis sites. `bot-block-proxy` (OpenResty in `traefik` ns) is fail-open with 100 ms connect / 200 ms read timeouts so a downed poison-fountain costs ≤200 ms per request. Source: `stacks/traefik/modules/traefik/main.tf`.
|
4. Tarpit (~100 bytes/sec) 5. Poison content (CronJob every 6h, `--http1.1` required)
|
||||||
3. **X-Robots-Tag noai** — set by `traefik-anti-ai-headers` middleware. Anubis additionally serves a comprehensive `/robots.txt` (`SERVE_ROBOTS_TXT=true`) to well-behaved bots.
|
Key files: `stacks/poison-fountain/`, `stacks/platform/modules/traefik/middleware.tf`
|
||||||
4. **Tarpit/poison content** (standalone at poison.viktorbarzin.me, `stacks/poison-fountain/`). Currently scaled to `replicas = 0` — fail-open path means no live traffic, no penalty.
|
|
||||||
|
|
||||||
Trap links (formerly a layer) removed April 2026 — rewrite-body plugin broken on Traefik v3.6.12 (Yaegi bugs). `strip-accept-encoding` and `anti-ai-trap-links` middlewares deleted.
|
|
||||||
Rybbit analytics injection now via Cloudflare Worker (`stacks/rybbit/worker/`, HTMLRewriter, wildcard route `*.viktorbarzin.me/*`, 28 site ID mappings).
|
|
||||||
Key files: `modules/kubernetes/anubis_instance/`, `stacks/poison-fountain/`, `stacks/rybbit/worker/`, `stacks/traefik/modules/traefik/main.tf`
|
|
||||||
|
|
||||||
## Terragrunt Architecture
|
## Terragrunt Architecture
|
||||||
- Root `terragrunt.hcl`: DRY providers, backend, variable loading, `generate "tiers"` block
|
- Root `terragrunt.hcl`: DRY providers, backend, variable loading, `generate "tiers"` block
|
||||||
|
|
|
||||||
|
|
@ -92,21 +92,19 @@ Channel 3: A4 [32G] ──── A8 [32G] ──── A12[ 8G ] = 72 GB
|
||||||
| VMID | Name | Status | CPUs | RAM | Network | Disk | Notes |
|
| VMID | Name | Status | CPUs | RAM | Network | Disk | Notes |
|
||||||
|------|------|--------|------|-----|---------|------|-------|
|
|------|------|--------|------|-----|---------|------|-------|
|
||||||
| 101 | pfsense | running | 8 | 4GB | vmbr0, vmbr1:vlan10, vmbr1:vlan20 | 32G | Gateway/firewall |
|
| 101 | pfsense | running | 8 | 4GB | vmbr0, vmbr1:vlan10, vmbr1:vlan20 | 32G | Gateway/firewall |
|
||||||
| 102 | devvm | running | 16 | 24GB | vmbr1:vlan10 | 100G | Development VM + t3code Workstation host. 14G swap (8G /swapfile + 6G /swapfile2, grown 2026-06-10; swappiness=10). Capacity budget: ~4-5G RAM/active user, max ~3-4 concurrent active Claude sessions. NOT Terraform-managed. Disk controller: `virtio-scsi-single` + `scsi0 iothread=1,aio=threads` staged 2026-06-11 after the QEMU I/O stall (was `scsihw: lsi`, the only VM on the legacy path — see `docs/post-mortems/2026-06-11-devvm-qemu-io-stall.md`); applies at next cold stop→start. |
|
| 102 | devvm | running | 16 | 8GB | vmbr1:vlan10 | 100G | Development VM |
|
||||||
| 103 | home-assistant | running | 8 | 8GB | vmbr0 | 64G | HA Sofia, net0(vlan10) disabled, SSH: vbarzin@192.168.1.8 |
|
| 103 | home-assistant | running | 8 | 8GB | vmbr0 | 64G | HA Sofia, net0(vlan10) disabled, SSH: vbarzin@192.168.1.8 |
|
||||||
| 105 | pbs | stopped | 16 | 8GB | vmbr1:vlan10 | 32G | Proxmox Backup (unused) |
|
| 105 | pbs | stopped | 16 | 8GB | vmbr1:vlan10 | 32G | Proxmox Backup (unused) |
|
||||||
| 200 | k8s-master | running | 8 | 32GB | vmbr1:vlan20 | 64G | Control plane (10.0.20.100) |
|
| 200 | k8s-master | running | 8 | 16GB | vmbr1:vlan20 | 64G | Control plane (10.0.20.100) |
|
||||||
| 201 | k8s-node1 | running | 16 | 48GB | vmbr1:vlan20 | 256G | GPU node, Tesla T4 |
|
| 201 | k8s-node1 | running | 16 | 32GB | vmbr1:vlan20 | 256G | GPU node, Tesla T4 |
|
||||||
| 202 | k8s-node2 | running | 8 | 32GB | vmbr1:vlan20 | 256G | Worker |
|
| 202 | k8s-node2 | running | 8 | 24GB | vmbr1:vlan20 | 256G | Worker |
|
||||||
| 203 | k8s-node3 | running | 8 | 32GB | vmbr1:vlan20 | 256G | Worker |
|
| 203 | k8s-node3 | running | 8 | 24GB | vmbr1:vlan20 | 256G | Worker |
|
||||||
| 204 | k8s-node4 | running | 8 | 32GB | vmbr1:vlan20 | 256G | Worker |
|
| 204 | k8s-node4 | running | 8 | 24GB | vmbr1:vlan20 | 256G | Worker |
|
||||||
| 205 | k8s-node5 | running | 8 | 32GB | vmbr1:vlan20 | 256G | Worker (10.0.20.105, joined 2026-05-26) |
|
|
||||||
| 206 | k8s-node6 | running | 8 | 32GB | vmbr1:vlan20 | 256G | Worker (10.0.20.106, joined 2026-05-26) |
|
|
||||||
| 220 | docker-registry | running | 4 | 4GB | vmbr1:vlan20 | 64G | MAC DE:AD:BE:EF:22:22 (10.0.20.10) |
|
| 220 | docker-registry | running | 4 | 4GB | vmbr1:vlan20 | 64G | MAC DE:AD:BE:EF:22:22 (10.0.20.10) |
|
||||||
| 300 | Windows10 | running | 16 | 8GB | vmbr0 | 100G | Windows VM |
|
| 300 | Windows10 | running | 16 | 8GB | vmbr0 | 100G | Windows VM |
|
||||||
| ~~9000~~ | ~~truenas~~ | **stopped/decommissioned** | — | — | — | — | NFS migrated to Proxmox host (192.168.1.127) at `/srv/nfs` and `/srv/nfs-ssd` |
|
| ~~9000~~ | ~~truenas~~ | **stopped/decommissioned** | — | — | — | — | NFS migrated to Proxmox host (192.168.1.127) at `/srv/nfs` and `/srv/nfs-ssd` |
|
||||||
|
|
||||||
**Total VM RAM allocated**: ~288 GB nominal across running VMs vs 272 GB physical — OVERCOMMITTED (ballooning enabled on K8s workers, host swap in use; see memory id=535/2543). K8s rows live-verified via `kubectl get nodes` capacity 2026-06-11 (master 32G, node1 48G, node2-6 32G; the old 16/32/24GB figures predated the 2026-04-02 resize and node5/6).
|
**Total VM RAM allocated**: 180 GB of 272 GB (66%) — 92 GB free for future VMs
|
||||||
|
|
||||||
## VM Templates
|
## VM Templates
|
||||||
| VMID | Name | Purpose |
|
| VMID | Name | Purpose |
|
||||||
|
|
@ -124,9 +122,8 @@ Channel 3: A4 [32G] ──── A8 [32G] ──── A12[ 8G ] = 72 GB
|
||||||
| `offsite-sync-backup.timer` | Timer | Daily 06:00 | Two-step rsync to Synology (sda + NFS via inotify) |
|
| `offsite-sync-backup.timer` | Timer | Daily 06:00 | Two-step rsync to Synology (sda + NFS via inotify) |
|
||||||
| `nfs-change-tracker.service` | Service | Continuous | inotifywait on `/srv/nfs` + `/srv/nfs-ssd`, logs to `/mnt/backup/.nfs-changes.log` |
|
| `nfs-change-tracker.service` | Service | Continuous | inotifywait on `/srv/nfs` + `/srv/nfs-ssd`, logs to `/mnt/backup/.nfs-changes.log` |
|
||||||
|
|
||||||
## GPU Node (currently k8s-node1)
|
## GPU Node (k8s-node1)
|
||||||
- **VMID**: 201, **PCIe**: `0000:06:00.0` (NVIDIA Tesla T4) — physical passthrough, no Terraform pin
|
- **VMID**: 201, **PCIe**: `0000:06:00.0` (NVIDIA Tesla T4)
|
||||||
- **Taint**: `nvidia.com/gpu=true:PreferNoSchedule` (applied dynamically to every NFD-discovered GPU node)
|
- **Taint**: `nvidia.com/gpu=true:NoSchedule`, **Label**: `gpu=true`
|
||||||
- **Label**: `nvidia.com/gpu.present=true` (auto-applied by gpu-feature-discovery; also `feature.node.kubernetes.io/pci-10de.present=true` from NFD)
|
- GPU workloads need: `node_selector = { "gpu": "true" }` + nvidia toleration
|
||||||
- GPU workloads need: `node_selector = { "nvidia.com/gpu.present" : "true" }` + nvidia toleration
|
- Taint applied via `null_resource.gpu_node_taint` in `modules/kubernetes/nvidia/main.tf`
|
||||||
- Taint applied via `null_resource.gpu_node_config` in `stacks/nvidia/modules/nvidia/main.tf`; node discovery keyed on the NFD `pci-10de.present` label so the taint follows the card to whichever host is carrying it
|
|
||||||
|
|
|
||||||
File diff suppressed because one or more lines are too long
|
|
@ -7,6 +7,7 @@
|
||||||
"docker.io/mailserver/docker-mailserver": "docker-mailserver/docker-mailserver",
|
"docker.io/mailserver/docker-mailserver": "docker-mailserver/docker-mailserver",
|
||||||
"mailserver/docker-mailserver": "docker-mailserver/docker-mailserver",
|
"mailserver/docker-mailserver": "docker-mailserver/docker-mailserver",
|
||||||
"docker.n8n.io/n8nio/n8n": "n8n-io/n8n",
|
"docker.n8n.io/n8nio/n8n": "n8n-io/n8n",
|
||||||
|
"matrixdotorg/synapse": "element-hq/synapse",
|
||||||
"headscale/headscale": "juanfont/headscale",
|
"headscale/headscale": "juanfont/headscale",
|
||||||
"technitium/dns-server": "TechnitiumSoftware/DnsServer",
|
"technitium/dns-server": "TechnitiumSoftware/DnsServer",
|
||||||
"ghcr.io/paperless-ngx/paperless-ngx": "paperless-ngx/paperless-ngx",
|
"ghcr.io/paperless-ngx/paperless-ngx": "paperless-ngx/paperless-ngx",
|
||||||
|
|
@ -81,6 +82,7 @@
|
||||||
"dawarich": { "type": "postgresql", "db_name": "dawarich", "shared": true },
|
"dawarich": { "type": "postgresql", "db_name": "dawarich", "shared": true },
|
||||||
"health": { "type": "postgresql", "db_name": "health", "shared": true },
|
"health": { "type": "postgresql", "db_name": "health", "shared": true },
|
||||||
"linkwarden": { "type": "postgresql", "db_name": "linkwarden", "shared": true },
|
"linkwarden": { "type": "postgresql", "db_name": "linkwarden", "shared": true },
|
||||||
|
"matrix": { "type": "postgresql", "db_name": "matrix", "shared": true },
|
||||||
"n8n": { "type": "postgresql", "db_name": "n8n", "shared": true },
|
"n8n": { "type": "postgresql", "db_name": "n8n", "shared": true },
|
||||||
"netbox": { "type": "postgresql", "db_name": "netbox", "shared": true },
|
"netbox": { "type": "postgresql", "db_name": "netbox", "shared": true },
|
||||||
"rybbit": { "type": "postgresql", "db_name": "rybbit", "shared": true },
|
"rybbit": { "type": "postgresql", "db_name": "rybbit", "shared": true },
|
||||||
|
|
|
||||||
|
|
@ -177,33 +177,6 @@ Tell the user to share these onboarding instructions with the new user:
|
||||||
- K8s Portal: `https://k8s-portal.viktorbarzin.me/onboarding?role=namespace-owner`
|
- K8s Portal: `https://k8s-portal.viktorbarzin.me/onboarding?role=namespace-owner`
|
||||||
- README: `https://github.com/ViktorBarzin/infra#new-user-onboarding`
|
- README: `https://github.com/ViktorBarzin/infra#new-user-onboarding`
|
||||||
|
|
||||||
**Web dashboard access** (auto-login, no token paste): the `rbac` stack
|
|
||||||
auto-creates a `dashboard-<user>` SA + token for every namespace-owner
|
|
||||||
(`dashboard-sa.tf`), and the **k8s-dashboard** stack's token-injector maps the
|
|
||||||
user's Authentik identity → that token (`dashboard_injector.tf`, auto-derived
|
|
||||||
from `k8s_users`). The new user just logs into `https://k8s.viktorbarzin.me` and
|
|
||||||
lands in the dashboard scoped to their namespace (`admin` on their namespace +
|
|
||||||
read-only on the namespace list & nodes for nav — no cross-tenant resource reads).
|
|
||||||
|
|
||||||
> **Apply order for a new namespace-owner:** after the vault/rbac/woodpecker
|
|
||||||
> applies above, ALSO `cd stacks/k8s-dashboard && ../../scripts/tg apply` so the
|
|
||||||
> injector map picks up the new user. (Manual token fallback:
|
|
||||||
> `kubectl -n NAMESPACE get secret dashboard-USERNAME-token -o jsonpath='{.data.token}' | base64 -d`.)
|
|
||||||
> Seamless OIDC SSO is built but blocked — see
|
|
||||||
> `docs/plans/2026-06-04-k8s-dashboard-sso-design.md` §12.
|
|
||||||
|
|
||||||
> **Auto-login works only for the user's `k8s_users` HOME namespace.** The
|
|
||||||
> dashboard injects the user's `dashboard-<user>` SA token, which the `rbac`
|
|
||||||
> stack binds to `admin` on their home namespace only. If their workload lives
|
|
||||||
> in a DIFFERENT / pre-existing namespace (e.g. gheorghe's app is in `novelapp`,
|
|
||||||
> not his home `vabbit81`), that namespace's stack must ALSO grant their
|
|
||||||
> **dashboard SA** — `kind: ServiceAccount, name: dashboard-<user>, namespace:
|
|
||||||
> <home-ns>` — not just their OIDC `User` email (the dashboard uses the SA, and
|
|
||||||
> apiserver OIDC is blocked). See `stacks/novelapp/main.tf` `novelapp_owner_vabbit81`
|
|
||||||
> for the pattern (two subjects: User + SA). Best practice: set the user's
|
|
||||||
> `k8s_users` namespace to where their workload actually runs, so the home-ns
|
|
||||||
> auto-path covers them with no extra binding.
|
|
||||||
|
|
||||||
The user can decrypt their stack's state with:
|
The user can decrypt their stack's state with:
|
||||||
```bash
|
```bash
|
||||||
vault login -method=oidc # authenticates via Authentik SSO
|
vault login -method=oidc # authenticates via Authentik SSO
|
||||||
|
|
|
||||||
102
.claude/skills/archived/setup-remote-executor.md
Normal file
102
.claude/skills/archived/setup-remote-executor.md
Normal file
|
|
@ -0,0 +1,102 @@
|
||||||
|
# Setup Shared Remote Executor
|
||||||
|
|
||||||
|
Skill for setting up Claude Code's shared remote executor in new projects.
|
||||||
|
|
||||||
|
## When to Use
|
||||||
|
- When adding Claude Code support to a new project
|
||||||
|
- When the user says "set up remote executor for this project"
|
||||||
|
- When working on a new project that needs remote command execution
|
||||||
|
|
||||||
|
## Prerequisites
|
||||||
|
- Shared executor already deployed at `~/.claude/` on wizard@10.0.10.10
|
||||||
|
- Project accessible via NFS from both macOS and the remote VM
|
||||||
|
|
||||||
|
## Setup Steps
|
||||||
|
|
||||||
|
### 1. Create .claude Directory
|
||||||
|
```bash
|
||||||
|
mkdir -p .claude/sessions
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Create session-exec.sh Wrapper
|
||||||
|
Create `.claude/session-exec.sh` with the following content (adjust PROJECT_ROOT):
|
||||||
|
|
||||||
|
```bash
|
||||||
|
#!/bin/bash
|
||||||
|
# Project-Local Session Helper - Wrapper for shared executor
|
||||||
|
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
SHARED_SESSION_EXEC="/home/wizard/.claude/session-exec.sh"
|
||||||
|
PROJECT_ROOT="/home/wizard/path/to/project" # UPDATE THIS
|
||||||
|
|
||||||
|
if [ -f "$SHARED_SESSION_EXEC" ]; then
|
||||||
|
if [ "${1:-}" = "create" ] || [ -z "${1:-}" ]; then
|
||||||
|
"$SHARED_SESSION_EXEC" create "$PROJECT_ROOT"
|
||||||
|
else
|
||||||
|
"$SHARED_SESSION_EXEC" "$@"
|
||||||
|
fi
|
||||||
|
else
|
||||||
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
|
SESSIONS_DIR="$SCRIPT_DIR/sessions"
|
||||||
|
SESSION_ID="${1:-$(date +%s)-$$-$RANDOM}"
|
||||||
|
ACTION="${2:-create}"
|
||||||
|
SESSION_DIR="$SESSIONS_DIR/$SESSION_ID"
|
||||||
|
|
||||||
|
case "$ACTION" in
|
||||||
|
create|init|"")
|
||||||
|
mkdir -p "$SESSION_DIR"
|
||||||
|
echo "ready" > "$SESSION_DIR/cmd_status.txt"
|
||||||
|
echo "$PROJECT_ROOT" > "$SESSION_DIR/workdir.txt"
|
||||||
|
> "$SESSION_DIR/cmd_input.txt"
|
||||||
|
> "$SESSION_DIR/cmd_output.txt"
|
||||||
|
echo "$SESSION_ID"
|
||||||
|
;;
|
||||||
|
cleanup|remove|delete)
|
||||||
|
[ -d "$SESSION_DIR" ] && rm -rf "$SESSION_DIR"
|
||||||
|
;;
|
||||||
|
status)
|
||||||
|
[ -d "$SESSION_DIR" ] && cat "$SESSION_DIR/cmd_status.txt"
|
||||||
|
;;
|
||||||
|
list)
|
||||||
|
[ -d "$SESSIONS_DIR" ] && ls -1 "$SESSIONS_DIR" 2>/dev/null
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
fi
|
||||||
|
```
|
||||||
|
|
||||||
|
Make executable: `chmod +x .claude/session-exec.sh`
|
||||||
|
|
||||||
|
### 3. Link Sessions Directory (on remote VM)
|
||||||
|
Run on the remote VM to add project sessions to the shared executor:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Option A: Symlink project sessions (if using project-local sessions)
|
||||||
|
ln -sfn /path/to/project/.claude/sessions ~/.claude/sessions
|
||||||
|
|
||||||
|
# Option B: Use shared sessions (all projects share one directory)
|
||||||
|
# Just ensure ~/.claude/sessions exists
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4. Create CLAUDE.md
|
||||||
|
Add execution instructions to `.claude/CLAUDE.md`:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
## Remote Command Execution
|
||||||
|
Uses shared executor at `~/.claude/` on wizard@10.0.10.10.
|
||||||
|
|
||||||
|
### Usage
|
||||||
|
\```bash
|
||||||
|
SESSION_ID=$(.claude/session-exec.sh)
|
||||||
|
echo "command" > .claude/sessions/$SESSION_ID/cmd_input.txt
|
||||||
|
sleep 1 && cat .claude/sessions/$SESSION_ID/cmd_status.txt
|
||||||
|
cat .claude/sessions/$SESSION_ID/cmd_output.txt
|
||||||
|
\```
|
||||||
|
|
||||||
|
Start executor: `~/.claude/remote-executor.sh` (on remote VM)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Shared Executor Location
|
||||||
|
- Scripts: `~/.claude/remote-executor.sh`, `~/.claude/session-exec.sh`
|
||||||
|
- Sessions: `~/.claude/sessions/`
|
||||||
|
- Remote VM: wizard@10.0.10.10
|
||||||
|
|
@ -7,448 +7,339 @@ description: |
|
||||||
(3) User asks to fix stuck pods, evicted pods, or CrashLoopBackOff,
|
(3) User asks to fix stuck pods, evicted pods, or CrashLoopBackOff,
|
||||||
(4) User mentions "health check", "cluster status", "cluster health",
|
(4) User mentions "health check", "cluster status", "cluster health",
|
||||||
(5) User asks "is everything running" or "any problems".
|
(5) User asks "is everything running" or "any problems".
|
||||||
Runs 47 cluster-wide checks (nodes, workloads, monitoring, certs,
|
Runs 8 standard K8s health checks with safe auto-fix for evicted pods
|
||||||
backups, external reachability, PVE host thermals + load, HA Sofia
|
and stuck CrashLoopBackOff pods.
|
||||||
status dashboard, Immich smart-search, Proxmox CSI ghost-disk drift)
|
|
||||||
with safe auto-fix for evicted pods.
|
|
||||||
author: Claude Code
|
author: Claude Code
|
||||||
version: 2.0.0
|
version: 1.0.0
|
||||||
date: 2026-04-19
|
date: 2026-02-21
|
||||||
---
|
---
|
||||||
|
|
||||||
# Cluster Health Check
|
# Cluster Health Check
|
||||||
|
|
||||||
## MANDATORY: Run the script first
|
## Overview
|
||||||
|
|
||||||
When this skill is invoked, your **first action** must be to run the
|
- **Script**: `/workspace/infra/.claude/cluster-health.sh`
|
||||||
cluster health check script and reason over its output before doing
|
- **Schedule**: CronJob runs every 30 minutes in the `openclaw` namespace
|
||||||
anything else. Do not improvise individual `kubectl` calls — the
|
- **Slack notifications**: Posts results to the webhook URL in `$SLACK_WEBHOOK_URL`
|
||||||
script is the authoritative surface.
|
- **Auto-fix**: Automatically deletes evicted/failed pods and CrashLoopBackOff pods with >10 restarts
|
||||||
|
- **Exit code**: 0 = healthy, 1 = issues found
|
||||||
|
|
||||||
|
## Quick Check
|
||||||
|
|
||||||
|
Run the health check interactively:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
cd /home/wizard/code
|
# Report only, no Slack notification
|
||||||
bash infra/scripts/cluster_healthcheck.sh --json | tee /tmp/cluster-health.json
|
bash /workspace/infra/.claude/cluster-health.sh --no-slack
|
||||||
|
|
||||||
|
# Full run with Slack notification
|
||||||
|
bash /workspace/infra/.claude/cluster-health.sh
|
||||||
|
|
||||||
|
# Report only, no auto-fix and no Slack
|
||||||
|
bash /workspace/infra/.claude/cluster-health.sh --no-fix --no-slack
|
||||||
```
|
```
|
||||||
|
|
||||||
If the session is rooted elsewhere, fall back to the absolute path:
|
## What It Checks
|
||||||
|
|
||||||
```bash
|
| # | Check | Auto-Fix | Alerts |
|
||||||
bash /home/wizard/code/infra/scripts/cluster_healthcheck.sh --json
|
|---|-------|----------|--------|
|
||||||
```
|
| 1 | **Node Health** — NotReady nodes, MemoryPressure, DiskPressure, PIDPressure | No | Yes |
|
||||||
|
| 2 | **Pod Health** — CrashLoopBackOff, ImagePullBackOff, ErrImagePull, Error | Yes (CrashLoop >10 restarts) | Yes |
|
||||||
Then:
|
| 3 | **Evicted/Failed Pods** — Pods in `Failed` phase | Yes (deletes all) | Yes |
|
||||||
|
| 4 | **Failed Deployments** — Deployments with ready != desired replicas | No | Yes |
|
||||||
1. Parse the JSON. Report the PASS/WARN/FAIL counts + overall verdict.
|
| 5 | **Pending PVCs** — PersistentVolumeClaims not in `Bound` state | No | Yes |
|
||||||
2. Iterate every FAIL and WARN check, describe what tripped, and propose
|
| 6 | **Resource Pressure** — Node CPU or memory >80% (warn) or >90% (issue) | No | Yes |
|
||||||
the remediation path (use the recipes below).
|
| 7 | **CronJob Failures** — Failed CronJob-owned Jobs in the last 24h | No | Yes |
|
||||||
3. Only reach for ad-hoc `kubectl` commands when investigating a
|
| 8 | **DaemonSet Health** — DaemonSets with desired != ready | No | Yes |
|
||||||
specific failure beyond what the script reported.
|
|
||||||
|
|
||||||
Exit codes: `0` = healthy, `1` = warnings only, `2` = failures.
|
|
||||||
|
|
||||||
## Quick flags
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Human-readable report (default), no auto-fix
|
|
||||||
bash infra/scripts/cluster_healthcheck.sh
|
|
||||||
|
|
||||||
# Machine-readable JSON summary
|
|
||||||
bash infra/scripts/cluster_healthcheck.sh --json
|
|
||||||
|
|
||||||
# Only show WARN + FAIL (suppress PASS noise)
|
|
||||||
bash infra/scripts/cluster_healthcheck.sh --quiet
|
|
||||||
|
|
||||||
# Enable auto-fix (delete evicted pods, kick stuck CrashLoop pods)
|
|
||||||
bash infra/scripts/cluster_healthcheck.sh --fix
|
|
||||||
|
|
||||||
# Combined: quiet JSON without auto-fix
|
|
||||||
bash infra/scripts/cluster_healthcheck.sh --no-fix --quiet --json
|
|
||||||
|
|
||||||
# Custom kubeconfig
|
|
||||||
bash infra/scripts/cluster_healthcheck.sh --kubeconfig /path/to/config
|
|
||||||
```
|
|
||||||
|
|
||||||
## What It Checks (47 checks)
|
|
||||||
|
|
||||||
| # | Check | Notes |
|
|
||||||
|---|-------|-------|
|
|
||||||
| 1 | Node Status | NotReady nodes, version drift |
|
|
||||||
| 2 | Node Resources | CPU/mem >80% (warn) / >90% (fail) |
|
|
||||||
| 3 | Node Conditions | MemoryPressure / DiskPressure / PIDPressure |
|
|
||||||
| 4 | Problematic Pods | CrashLoopBackOff / Error / ImagePullBackOff |
|
|
||||||
| 5 | Evicted/Failed Pods | `status.phase=Failed` |
|
|
||||||
| 6 | DaemonSets | desired == ready |
|
|
||||||
| 7 | Deployments | ready == desired replicas |
|
|
||||||
| 8 | PVC Status | all Bound |
|
|
||||||
| 9 | HPA Health | targets not `<unknown>`, utilization <100% |
|
|
||||||
| 10 | CronJob Failures | job conditions `Failed=True` in last 24h |
|
|
||||||
| 11 | CrowdSec Agents | all pods Running |
|
|
||||||
| 12 | Ingress Routes | every ingress has an LB IP + Traefik LB |
|
|
||||||
| 13 | Prometheus Alerts | count of firing alerts |
|
|
||||||
| 14 | Uptime Kuma Monitors | internal + external monitors up |
|
|
||||||
| 15 | ResourceQuota Pressure | any quota >80% used |
|
|
||||||
| 16 | StatefulSets | ready == desired |
|
|
||||||
| 17 | Node Disk Usage | ephemeral-storage <80% |
|
|
||||||
| 18 | Helm Release Health | all `deployed` (no `pending-*`) |
|
|
||||||
| 19 | Kyverno Policy Engine | all pods Running |
|
|
||||||
| 20 | NFS Connectivity | 192.168.1.127 showmount / port 2049 |
|
|
||||||
| 21 | DNS Resolution | Technitium resolves internal + external |
|
|
||||||
| 22 | TLS Certificate Expiry | TLS `Secret` certs >30d valid |
|
|
||||||
| 23 | GPU Health | nvidia namespace + device-plugin Running |
|
|
||||||
| 24 | Cloudflare Tunnel | pods Running |
|
|
||||||
| 25 | Resource Usage | node CPU/mem headroom |
|
|
||||||
| 26 | HA Sofia — Entity Availability | Home Assistant unavailable/unknown count |
|
|
||||||
| 27 | HA Sofia — Integration Health | config entries setup_error / not_loaded |
|
|
||||||
| 28 | HA Sofia — Automation Status | disabled / stale (>30d) automations |
|
|
||||||
| 29 | HA Sofia — System Resources | HA CPU / mem / disk |
|
|
||||||
| 30 | Hardware Exporters | snmp / idrac-redfish / proxmox / tuya pods + scrapes |
|
|
||||||
| 31 | cert-manager — Certificate Readiness | Certificate CRs with `Ready!=True` |
|
|
||||||
| 32 | cert-manager — Certificate Expiry (<14d) | notAfter within 14d |
|
|
||||||
| 33 | cert-manager — Failed CertificateRequests | `Ready=False, reason=Failed` |
|
|
||||||
| 34 | Backup Freshness — Per-DB Dumps | MySQL + PG dumps within 25h |
|
|
||||||
| 35 | Backup Freshness — Offsite Sync | Pushgateway `backup_last_success_timestamp` <27h |
|
|
||||||
| 36 | Backup Freshness — LVM PVC Snapshots | newest thin snapshot <25h (SSH PVE) |
|
|
||||||
| 37 | Monitoring — Prometheus + Alertmanager | `/-/ready` + AM pods Running |
|
|
||||||
| 38 | Monitoring — Vault Sealed Status | `vault status` reports `Sealed: false` |
|
|
||||||
| 39 | Monitoring — ClusterSecretStore Ready | `vault-kv` + `vault-database` Ready |
|
|
||||||
| 40 | External — Cloudflared + Authentik Replicas | deployments fully ready |
|
|
||||||
| 41 | External — ExternalAccessDivergence Alert | alert not firing |
|
|
||||||
| 42 | External — Traefik 5xx Rate (15m) | top-10 services emitting 5xx |
|
|
||||||
| 43 | PVE Host Thermals | package + per-core temps via `/sys/class/hwmon` (SSH). Baseline 55-65 °C. PASS <65 °C, WARN 65-82 °C (a VM is burning too much CPU), FAIL ≥83 °C (TjMax) |
|
|
||||||
| 44 | PVE Host Load | `/proc/loadavg` via SSH. PASS 5m <30, WARN 30-37, FAIL ≥38 of 44 threads |
|
|
||||||
| 45 | HA Sofia — Status Dashboard | emo's curated Барзини → Статус view (`dashboard-barzini` / path `status`). Pulls the lovelace config via WS, batch-renders every `custom:mushroom-template-card` secondary template against `/api/template`, classifies each rendered line: FAIL on `Offline` / `Disconnected` / `Разкачен` / `— No data`; WARN on `⚠️` / `Abnormal` / `Trouble (` / `(ниска)` / `Пълен резервоар` / `Грешка` / `attention` / `Внимание`. Verdict rolls up across the 8 sections (Сигурност, Мрежа & IT, Енергия, Климат, Уреди, Мултимедия, Осветление, Поливна) |
|
|
||||||
| 46 | Immich Smart Search | `clip_index` residency in PG `shared_buffers` + representative ANN probe latency (in immich-postgresql). FAIL >1.5s or <50% resident; WARN >0.5s or <90% resident. Cold cache → check `clip-index-prewarm` CronJob |
|
|
||||||
| 47 | Proxmox CSI — Ghost-Disk Drift | Per node, compares real virtio-scsi CSI disks in `qm config <vmid>` (SSH PVE) vs attached proxmox-CSI VolumeAttachments k8s tracks. Catches orphaned "ghost" disks left by failed detaches (`query-pci` QMP timeouts) that the scheduler's 28-LUN guard can't see. PASS reconciled; WARN drift>0 or real 20-24; FAIL real ≥25 (near LUN cap → imminent wedge). Cleanup: detach ghosts via `qm set <vmid> --delete scsiN` (frees slot, retains LV) |
|
|
||||||
|
|
||||||
## Safe Auto-Fix Rules
|
## Safe Auto-Fix Rules
|
||||||
|
|
||||||
`--fix` only performs operations that are genuinely reversible and
|
### Safe to auto-fix (the script does these automatically)
|
||||||
observable. Nothing here rewrites Terraform state or mutates the cluster
|
|
||||||
beyond "delete pod".
|
|
||||||
|
|
||||||
### Done automatically by `--fix`
|
1. **Evicted/Failed pods** — These are already terminated and just cluttering the namespace:
|
||||||
|
```bash
|
||||||
|
kubectl delete pods -A --field-selector=status.phase=Failed
|
||||||
|
```
|
||||||
|
|
||||||
- **Evicted / Failed pods** — delete them; the controller recreates.
|
2. **CrashLoopBackOff pods with >10 restarts** — The pod is stuck in a crash loop; deleting lets the controller recreate it with a fresh backoff timer:
|
||||||
```bash
|
```bash
|
||||||
kubectl delete pods -A --field-selector=status.phase=Failed
|
kubectl delete pod -n <namespace> <pod-name> --grace-period=0
|
||||||
```
|
```
|
||||||
- **CrashLoopBackOff pods with >10 restarts** — delete once to reset
|
|
||||||
backoff timer.
|
|
||||||
|
|
||||||
### NEVER auto-fix (requires human investigation)
|
### NEVER auto-fix (requires human investigation)
|
||||||
|
|
||||||
- NotReady nodes
|
- **NotReady nodes** — Could be network, kubelet, or hardware issue; needs SSH investigation
|
||||||
- MemoryPressure / DiskPressure / PIDPressure
|
- **DiskPressure / MemoryPressure / PIDPressure** — Root cause must be identified
|
||||||
- ImagePullBackOff (usually a bad tag / registry credential)
|
- **ImagePullBackOff** — Usually a wrong image tag or registry issue; needs config fix
|
||||||
- Deployment ready-replica mismatch
|
- **Failed deployments** — Could be resource limits, bad config, missing secrets
|
||||||
- Pending PVCs
|
- **Pending PVCs** — Usually NFS export missing or storage class issue
|
||||||
- Node CPU/memory >90%
|
- **Resource pressure >90%** — Need to identify which pods are consuming resources
|
||||||
- CronJob failures
|
- **CronJob failures** — Need to check job logs to understand why it failed
|
||||||
- DaemonSet desired != ready
|
- **DaemonSet issues** — Could be node taints, resource limits, or image issues
|
||||||
- Vault sealed
|
|
||||||
- ClusterSecretStore not Ready
|
|
||||||
- cert-manager Certificate failures
|
|
||||||
- Backup freshness regressions
|
|
||||||
- Any external-reachability failure
|
|
||||||
|
|
||||||
## Deep-investigation recipes per failure mode
|
## Deep Investigation
|
||||||
|
|
||||||
### Node Issues (checks 1, 3, 17, 25)
|
When the health check reports issues, use these commands to investigate further.
|
||||||
|
|
||||||
|
### Node Issues
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
kubectl describe node <node>
|
# Describe the problematic node (events, conditions, capacity)
|
||||||
|
kubectl describe node <node-name>
|
||||||
|
|
||||||
|
# Check resource usage across all nodes
|
||||||
kubectl top nodes
|
kubectl top nodes
|
||||||
kubectl get events --field-selector involvedObject.name=<node> --sort-by='.lastTimestamp'
|
|
||||||
# SSH to the node
|
# Check recent events on a specific node
|
||||||
ssh root@10.0.20.10X
|
kubectl get events --field-selector involvedObject.name=<node-name> --sort-by='.lastTimestamp'
|
||||||
|
|
||||||
|
# SSH to the node for direct inspection
|
||||||
|
ssh root@<node-ip>
|
||||||
systemctl status kubelet
|
systemctl status kubelet
|
||||||
journalctl -u kubelet --since "30 minutes ago" | tail -100
|
journalctl -u kubelet --since "30 minutes ago" | tail -100
|
||||||
df -h ; free -h
|
df -h
|
||||||
|
free -h
|
||||||
```
|
```
|
||||||
|
|
||||||
Node IPs: `10.0.20.100` master, `.101` node1 (GPU), `.102` node2,
|
### Pod Issues
|
||||||
`.103` node3, `.104` node4.
|
|
||||||
|
|
||||||
### Pod Issues (checks 4, 5, 11, 19)
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
kubectl describe pod -n <ns> <pod>
|
# Describe the pod (events, conditions, container statuses)
|
||||||
kubectl logs -n <ns> <pod> --tail=200
|
kubectl describe pod -n <namespace> <pod-name>
|
||||||
kubectl logs -n <ns> <pod> --previous --tail=200
|
|
||||||
kubectl get events -n <ns> --sort-by='.lastTimestamp' | tail -20
|
# Check current logs
|
||||||
|
kubectl logs -n <namespace> <pod-name> --tail=100
|
||||||
|
|
||||||
|
# Check logs from the previous crashed container
|
||||||
|
kubectl logs -n <namespace> <pod-name> --previous --tail=100
|
||||||
|
|
||||||
|
# Check events in the namespace
|
||||||
|
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
|
||||||
|
|
||||||
|
# Check all pods in a namespace
|
||||||
|
kubectl get pods -n <namespace> -o wide
|
||||||
```
|
```
|
||||||
|
|
||||||
Common failure causes: OOMKilled (raise mem limit in Terraform), bad
|
### Deployment Issues
|
||||||
config / missing env var, DB connection failure (check `dbaas` pods),
|
|
||||||
NFS mount failure (`showmount -e 192.168.1.127`), stale
|
|
||||||
imagePullSecret.
|
|
||||||
|
|
||||||
### Deployment / StatefulSet / DaemonSet (checks 6, 7, 16)
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
kubectl describe deployment -n <ns> <name>
|
# Describe the deployment (strategy, conditions, events)
|
||||||
kubectl rollout status deployment -n <ns> <name>
|
kubectl describe deployment -n <namespace> <deployment-name>
|
||||||
kubectl rollout history deployment -n <ns> <name>
|
|
||||||
kubectl get rs -n <ns> -l app=<app>
|
# Check rollout status
|
||||||
|
kubectl rollout status deployment -n <namespace> <deployment-name>
|
||||||
|
|
||||||
|
# Check rollout history
|
||||||
|
kubectl rollout history deployment -n <namespace> <deployment-name>
|
||||||
|
|
||||||
|
# Check the replicaset
|
||||||
|
kubectl get rs -n <namespace> -l app=<app-label>
|
||||||
```
|
```
|
||||||
|
|
||||||
### PVC (check 8)
|
### PVC Issues
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
kubectl describe pvc -n <ns> <pvc>
|
# Describe the PVC (events, status, storage class)
|
||||||
kubectl get events -n <ns> --field-selector reason=FailedMount --sort-by='.lastTimestamp'
|
kubectl describe pvc -n <namespace> <pvc-name>
|
||||||
kubectl get pv | grep <pvc>
|
|
||||||
showmount -e 192.168.1.127
|
# Check PVs
|
||||||
|
kubectl get pv
|
||||||
|
|
||||||
|
# Check events related to PVCs
|
||||||
|
kubectl get events -n <namespace> --field-selector reason=FailedMount --sort-by='.lastTimestamp'
|
||||||
|
|
||||||
|
# Verify NFS export exists
|
||||||
|
showmount -e 10.0.10.15 | grep <service-name>
|
||||||
```
|
```
|
||||||
|
|
||||||
### cert-manager (checks 31, 32, 33)
|
### Resource Pressure
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
kubectl get certificate -A
|
# Top nodes (CPU and memory usage)
|
||||||
kubectl describe certificate -n <ns> <name>
|
|
||||||
kubectl get certificaterequest -A
|
|
||||||
kubectl describe certificaterequest -n <ns> <name>
|
|
||||||
kubectl logs -n cert-manager deploy/cert-manager | tail -50
|
|
||||||
```
|
|
||||||
|
|
||||||
Common causes: ACME HTTP-01 challenge blocked, ClusterIssuer missing
|
|
||||||
DNS provider secret, rate-limit from Let's Encrypt.
|
|
||||||
|
|
||||||
### Backups (checks 34, 35, 36)
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Per-DB dumps (inside the DB pod)
|
|
||||||
kubectl exec -n dbaas mysql-standalone-0 -- ls -lah /backup/per-db/
|
|
||||||
kubectl exec -n dbaas pg-cluster-0 -- ls -lah /backup/per-db/
|
|
||||||
|
|
||||||
# Pushgateway metrics
|
|
||||||
kubectl exec -n monitoring deploy/prometheus-server -- \
|
|
||||||
wget -qO- http://prometheus-prometheus-pushgateway:9091/metrics | \
|
|
||||||
grep backup_last_success_timestamp
|
|
||||||
|
|
||||||
# LVM snapshots on PVE host
|
|
||||||
ssh -o BatchMode=yes root@192.168.1.127 \
|
|
||||||
'lvs -o lv_name,lv_time,lv_size --noheadings | grep snap'
|
|
||||||
```
|
|
||||||
|
|
||||||
If offsite sync is stale, the common cause is the
|
|
||||||
`offsite-sync-backup.service` systemd unit on the PVE host failing.
|
|
||||||
`ssh root@192.168.1.127 'systemctl status offsite-sync-backup'`.
|
|
||||||
|
|
||||||
### Monitoring stack (checks 37, 38, 39)
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Prometheus
|
|
||||||
kubectl exec -n monitoring deploy/prometheus-server -- wget -qO- http://localhost:9090/-/ready
|
|
||||||
kubectl logs -n monitoring deploy/prometheus-server --tail=100
|
|
||||||
|
|
||||||
# Alertmanager
|
|
||||||
kubectl get pods -n monitoring | grep alertmanager
|
|
||||||
kubectl logs -n monitoring -l app=prometheus-alertmanager --tail=100
|
|
||||||
|
|
||||||
# Vault
|
|
||||||
kubectl exec -n vault vault-0 -- sh -c 'VAULT_ADDR=http://127.0.0.1:8200 vault status'
|
|
||||||
# If sealed: check raft peers with `vault operator raft list-peers` and unseal.
|
|
||||||
|
|
||||||
# ClusterSecretStore
|
|
||||||
kubectl get clustersecretstore
|
|
||||||
kubectl describe clustersecretstore vault-kv vault-database
|
|
||||||
kubectl logs -n external-secrets deploy/external-secrets --tail=100
|
|
||||||
```
|
|
||||||
|
|
||||||
### External reachability (checks 40, 41, 42)
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Cloudflared
|
|
||||||
kubectl get pods -n cloudflared
|
|
||||||
kubectl logs -n cloudflared -l app=cloudflared --tail=100
|
|
||||||
|
|
||||||
# Authentik (Helm chart names the deployment goauthentik-server)
|
|
||||||
kubectl get deployment -n authentik goauthentik-server
|
|
||||||
kubectl logs -n authentik deploy/goauthentik-server --tail=100
|
|
||||||
|
|
||||||
# ExternalAccessDivergence alert
|
|
||||||
kubectl exec -n monitoring deploy/prometheus-server -- \
|
|
||||||
wget -qO- 'http://localhost:9090/api/v1/alerts' | \
|
|
||||||
python3 -m json.tool | grep -A 5 ExternalAccessDivergence
|
|
||||||
|
|
||||||
# Traefik 5xx — find the hot service
|
|
||||||
kubectl exec -n monitoring deploy/prometheus-server -- \
|
|
||||||
wget -qO- 'http://localhost:9090/api/v1/query?query=topk(10,rate(traefik_service_requests_total{code=~%225..%22}%5B15m%5D))' \
|
|
||||||
| python3 -m json.tool
|
|
||||||
```
|
|
||||||
|
|
||||||
### OOMKilled remediation
|
|
||||||
|
|
||||||
1. `kubectl describe pod -n <ns> <pod> | grep -A 5 Limits`
|
|
||||||
2. Edit `infra/modules/kubernetes/<service>/main.tf` and raise
|
|
||||||
`resources.limits.memory`.
|
|
||||||
3. `cd /home/wizard/code/infra && scripts/tg apply` (Tier 1) or
|
|
||||||
`terraform apply -target=module.<service>` as appropriate.
|
|
||||||
|
|
||||||
### ImagePullBackOff remediation
|
|
||||||
|
|
||||||
1. `kubectl describe pod -n <ns> <pod> | grep -A 5 Events`
|
|
||||||
2. Verify tag exists on the source registry.
|
|
||||||
3. Check pull-through cache at `10.0.20.10:{5000,5010,5020,5030}`.
|
|
||||||
4. Update the image tag in Terraform + re-apply.
|
|
||||||
|
|
||||||
### Persistent CrashLoopBackOff after auto-fix
|
|
||||||
|
|
||||||
1. `kubectl logs -n <ns> <pod> --previous --tail=200`
|
|
||||||
2. `kubectl describe pod -n <ns> <pod>` and check Last State:
|
|
||||||
- `OOMKilled` → raise memory limit
|
|
||||||
- Exit code 137 → OOM or probe killed
|
|
||||||
- Exit code 143 → SIGTERM / graceful shutdown failed
|
|
||||||
3. Cross-check dbaas + NFS + secrets are healthy.
|
|
||||||
|
|
||||||
## Performance forensics — top consumers + optimization hints
|
|
||||||
|
|
||||||
When the cluster is healthy (script returns 0) but the host is hot or load
|
|
||||||
is elevated, switch from "what broke?" to "what's expensive?". Run these
|
|
||||||
in order; stop as soon as the root cause is obvious.
|
|
||||||
|
|
||||||
### Step 1 — Snapshot top consumers cluster-wide
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Top 15 pods by current CPU
|
|
||||||
kubectl top pods --all-namespaces --sort-by=cpu --no-headers | head -15
|
|
||||||
|
|
||||||
# Top 5 nodes by CPU + memory pressure
|
|
||||||
kubectl top nodes
|
kubectl top nodes
|
||||||
|
|
||||||
# Top 15 by 5-min rolling rate (smoothed — kills noise from one-off spikes)
|
# Top pods sorted by memory (cluster-wide)
|
||||||
kubectl -n monitoring exec deploy/prometheus-server -- wget -qO- \
|
kubectl top pods -A --sort-by=memory | head -20
|
||||||
"http://localhost:9090/api/v1/query?query=topk(15,sum%20by%20(namespace,pod)%20(rate(container_cpu_usage_seconds_total%7Bcontainer!%3D''%7D%5B5m%5D)))" \
|
|
||||||
| python3 -m json.tool | head -80
|
# Top pods sorted by CPU (cluster-wide)
|
||||||
|
kubectl top pods -A --sort-by=cpu | head -20
|
||||||
|
|
||||||
|
# Check resource requests/limits in a namespace
|
||||||
|
kubectl describe resourcequota -n <namespace>
|
||||||
|
kubectl describe limitrange -n <namespace>
|
||||||
```
|
```
|
||||||
|
|
||||||
### Step 2 — For each suspect pod, get the WHY
|
## Common Remediation
|
||||||
|
|
||||||
For every pod in the top-N, gather these BEFORE proposing a fix:
|
### Persistent CrashLoopBackOff
|
||||||
|
|
||||||
```bash
|
A pod keeps crashing even after the auto-fix deletes it.
|
||||||
NS=<namespace>; POD=<pod>; CONT=$(kubectl -n $NS get pod $POD -o jsonpath='{.spec.containers[0].name}')
|
|
||||||
|
|
||||||
# What it does (image + command)
|
1. **Check logs from the crashed container**:
|
||||||
kubectl -n $NS get pod $POD -o jsonpath='{.spec.containers[0].image}{"\n"}{.spec.containers[0].args}{"\n"}'
|
```bash
|
||||||
|
kubectl logs -n <namespace> <pod-name> --previous --tail=200
|
||||||
|
```
|
||||||
|
|
||||||
# Resource limits + current usage
|
2. **Check the pod description for clues**:
|
||||||
kubectl -n $NS top pod $POD --containers
|
```bash
|
||||||
kubectl -n $NS get pod $POD -o jsonpath='{.spec.containers[0].resources}'
|
kubectl describe pod -n <namespace> <pod-name>
|
||||||
|
```
|
||||||
|
Look for:
|
||||||
|
- `OOMKilled` in Last State — the container ran out of memory
|
||||||
|
- `Error` with exit code 1 — application error (bad config, missing env var, DB connection failure)
|
||||||
|
- `Error` with exit code 137 — killed by OOM killer or liveness probe
|
||||||
|
- `Error` with exit code 143 — SIGTERM (graceful shutdown failure)
|
||||||
|
|
||||||
# Recent logs filtered for reconcile loops, watch storms, slow queries
|
3. **Common causes**:
|
||||||
kubectl -n $NS logs $POD -c $CONT --tail=200 --since=5m 2>&1 \
|
- **OOMKilled**: Increase memory limits in Terraform (see below)
|
||||||
| grep -iE 'reconcil|watch|scrape|index|loop|retry|slow|timeout' | tail -20
|
- **Bad config**: Check environment variables, secrets, config maps
|
||||||
|
- **DB connection failure**: Verify the database pod is running (`kubectl get pods -n dbaas`)
|
||||||
|
- **NFS mount failure**: Verify NFS export exists (`showmount -e 10.0.10.15`)
|
||||||
|
- **Missing secret**: Check if TLS secret or other secrets exist in the namespace
|
||||||
|
|
||||||
# Restart count + recent OOM
|
### OOMKilled
|
||||||
kubectl -n $NS describe pod $POD | grep -E 'Restart Count|Last State|Reason'
|
|
||||||
|
|
||||||
# Self-exported metrics (for apps that publish on /metrics)
|
The container was killed because it exceeded its memory limit.
|
||||||
kubectl -n $NS exec $POD -c $CONT -- wget -qO- localhost:<port>/metrics 2>/dev/null | head -50
|
|
||||||
```
|
|
||||||
|
|
||||||
### Step 3 — apiserver / etcd specific deep-dive (when control-plane is hot)
|
1. **Check current limits**:
|
||||||
|
```bash
|
||||||
|
kubectl describe pod -n <namespace> <pod-name> | grep -A 5 "Limits"
|
||||||
|
```
|
||||||
|
|
||||||
```bash
|
2. **Fix in Terraform** — Edit `modules/kubernetes/<service>/main.tf` and increase the memory limit:
|
||||||
# Top request producers by verb+resource (last 30 min)
|
```hcl
|
||||||
kubectl -n monitoring exec deploy/prometheus-server -- wget -qO- \
|
resources {
|
||||||
"http://localhost:9090/api/v1/query?query=topk(15,sum%20by%20(resource,verb)%20(rate(apiserver_request_total%5B30m%5D)))" \
|
limits = {
|
||||||
| python3 -m json.tool
|
memory = "2Gi" # Increase from current value
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
# Top user agents (which clients are hammering)
|
3. **Apply the change**:
|
||||||
kubectl -n monitoring exec deploy/prometheus-server -- wget -qO- \
|
```bash
|
||||||
"http://localhost:9090/api/v1/query?query=topk(15,sum%20by%20(user_agent)%20(rate(apiserver_request_total%5B30m%5D)))" \
|
cd /workspace/infra
|
||||||
| python3 -m json.tool
|
terraform apply -target=module.kubernetes_cluster.module.<service> -auto-approve
|
||||||
|
```
|
||||||
|
|
||||||
# Long-running requests (WATCH / CONNECT — log streams, pod-watchers)
|
### ImagePullBackOff
|
||||||
kubectl -n monitoring exec deploy/prometheus-server -- wget -qO- \
|
|
||||||
"http://localhost:9090/api/v1/query?query=apiserver_longrunning_requests" \
|
|
||||||
| python3 -m json.tool
|
|
||||||
|
|
||||||
# etcd write rate + DB size
|
The container image cannot be pulled.
|
||||||
kubectl -n monitoring exec deploy/prometheus-server -- wget -qO- \
|
|
||||||
"http://localhost:9090/api/v1/query?query=rate(etcd_disk_wal_fsync_duration_seconds_count%5B5m%5D)" \
|
|
||||||
| python3 -m json.tool
|
|
||||||
```
|
|
||||||
|
|
||||||
### Step 4 — PVE host specific deep-dive (when temp / load is high)
|
1. **Check the exact error**:
|
||||||
|
```bash
|
||||||
|
kubectl describe pod -n <namespace> <pod-name> | grep -A 5 "Events"
|
||||||
|
```
|
||||||
|
|
||||||
Checks 43 + 44 capture package temp + 5-min load avg with PASS/WARN/FAIL
|
2. **Common causes**:
|
||||||
thresholds — that's the first stop. When those WARN or FAIL, the
|
- **Wrong image tag**: Verify the tag exists on the registry (Docker Hub, ghcr.io, etc.)
|
||||||
follow-up commands below trace which VM / process is the source:
|
- **Private registry without credentials**: Check if imagePullSecrets are configured
|
||||||
|
- **Pull-through cache issue**: The registry cache at `10.0.20.10` may have a stale entry
|
||||||
|
```bash
|
||||||
|
# Check pull-through cache ports:
|
||||||
|
# 5000 = docker.io, 5010 = ghcr.io, 5020 = quay.io, 5030 = registry.k8s.io
|
||||||
|
curl -s http://10.0.20.10:5000/v2/_catalog | python3 -m json.tool
|
||||||
|
```
|
||||||
|
- **Registry rate limit**: Docker Hub free tier has pull limits; pull-through cache helps avoid this
|
||||||
|
|
||||||
```bash
|
3. **Fix**: Update the image tag in the service's Terraform module and re-apply.
|
||||||
# Per-core temps (broader than the package summary in check 43)
|
|
||||||
ssh root@192.168.1.127 'for f in /sys/class/hwmon/hwmon0/temp*_input; do
|
|
||||||
base=${f%_input}; label=$(cat ${base}_label 2>/dev/null || echo "${base##*/}")
|
|
||||||
val=$(cat "$f"); echo " $label: $((val/1000))°C"
|
|
||||||
done'
|
|
||||||
|
|
||||||
# Per-VM CPU (each VM = one kvm process)
|
### Node NotReady
|
||||||
ssh root@192.168.1.127 'top -bn1 -o %CPU | grep kvm | head -10'
|
|
||||||
|
|
||||||
# pvestatd anomaly check — bursts > 50% usually mean LV count > 1000
|
A node has gone NotReady.
|
||||||
ssh root@192.168.1.127 'lvs --noheadings 2>/dev/null | wc -l'
|
|
||||||
|
|
||||||
# Stale snapshots (any '_pre-*' that survived past their rollback window)
|
1. **Check node conditions**:
|
||||||
ssh root@192.168.1.127 'lvs --noheadings -o lv_name 2>/dev/null | awk "/_pre-/" | head -20'
|
```bash
|
||||||
```
|
kubectl describe node <node-name> | grep -A 20 "Conditions"
|
||||||
|
```
|
||||||
|
|
||||||
### Step 5 — Optimization decision
|
2. **SSH to the node and check kubelet**:
|
||||||
|
```bash
|
||||||
|
ssh root@<node-ip>
|
||||||
|
systemctl status kubelet
|
||||||
|
journalctl -u kubelet --since "10 minutes ago" | tail -50
|
||||||
|
```
|
||||||
|
|
||||||
For each consumer in the top-N, fill in a row:
|
3. **Check resources**:
|
||||||
|
```bash
|
||||||
|
# On the node
|
||||||
|
df -h # Disk space
|
||||||
|
free -h # Memory
|
||||||
|
top -bn1 # CPU/processes
|
||||||
|
```
|
||||||
|
|
||||||
| Pod / Process | CPU (m) | Why busy | Tunable | Est saving | Trade-off | Effort |
|
4. **Node IPs** (for SSH):
|
||||||
|---|---|---|---|---|---|---|
|
- `10.0.20.100` — k8s-master
|
||||||
|
- `10.0.20.101` — k8s-node1 (GPU)
|
||||||
|
- `10.0.20.102` — k8s-node2
|
||||||
|
- `10.0.20.103` — k8s-node3
|
||||||
|
- `10.0.20.104` — k8s-node4
|
||||||
|
|
||||||
Then rank by ROI (saving / effort) and surface the top 3-5. **Hold back the ones where saving < 50m unless effort is also < 5 min.**
|
## Slack Webhook
|
||||||
|
|
||||||
### Common causes + tunables (catalogue)
|
The script posts results to the Slack incoming webhook URL in `$SLACK_WEBHOOK_URL`. The message format uses Slack mrkdwn:
|
||||||
|
- All clear: green checkmark with node/pod count
|
||||||
|
- Warnings only: warning icon with details
|
||||||
|
- Issues found: red alert icon with auto-fixes applied and remaining issues
|
||||||
|
|
||||||
| Symptom | Likely cause | Tunable |
|
The webhook URL is passed as an environment variable from `openclaw_skill_secrets` in `terraform.tfvars`.
|
||||||
|---|---|---|
|
|
||||||
| **`kube-apiserver` > 1 core sustained** | `CONNECT pods/log` streams from `alloy`/`promtail` using apiserver-tail; OR Kyverno PolicyReport churn (background+enforce mode); OR VPA fanout (309 VPAs cause ~7 req/s) | Switch alloy/promtail to `loki.source.file`; raise Kyverno `backgroundScanInterval`; reduce VPA count |
|
|
||||||
| **`pvestatd` 70-100% bursts** | LV metadata scan over > 1000 LVs (typically stale `_pre-*` snapshots from ad-hoc node ops) | Delete stale snapshots; `/usr/local/bin/lvm-pvc-snapshot prune` |
|
|
||||||
| **Frigate > 2 cores** | Birdseye `mode: continuous` (16% on frigate.output); LPR debug; debug logging; too many active cameras × detect.fps | `birdseye.mode: motion`; `lpr.debug_save_plates: false`; remove debug loggers |
|
|
||||||
| **`vault-0` looping ERRORs every ~10s** | DB static-role not in connection's `allowed_roles` list (drift between role and connection) | Add role to `vault_database_secret_backend_connection.*.allowed_roles` in TF |
|
|
||||||
| **Alloy DS > 100m/pod** | `loki.source.kubernetes` (apiserver-tail) instead of `loki.source.file` | Switch to file-tail (~5× drop per pod) |
|
|
||||||
| **Prometheus default 1m scrape** | Chart default; new sample every minute | Raise `server.global.scrape_interval` to 2m; pin critical jobs (snmp-ups) to 30s; bump `for: 1m` alerts to `for: 3m` |
|
|
||||||
| **`kube-controller-manager` periodic ERROR loop** | Aggregated APIService discovery fails (calico/metrics-server unreachable, OR stuck Terminating pod still in endpoints) | Force-delete stuck pod; verify APIService Available; check pod runc bug on k8s-master |
|
|
||||||
| **etcd write > 1 MB/s** | PolicyReport thrash, too-frequent secret rotation, or audit log mode = RequestResponse | Trim Kyverno reports config; raise rotation_period; downgrade audit policy to Metadata for noisy resources |
|
|
||||||
|
|
||||||
### What NOT to touch
|
## Infrastructure
|
||||||
|
|
||||||
- **calico-node, etcd write rate, kube-controller-manager core work, pg-cluster replication** — structural cost, touching them risks correctness.
|
| Component | Path / Location |
|
||||||
- **Pods doing legitimate request-serving work** (web servers, databases under load) — optimize the workload, not the runtime.
|
|-----------|----------------|
|
||||||
- **Anything where Goldilocks VPA upperBound is already close to current request** — no headroom to cut.
|
| Health check script | `/workspace/infra/.claude/cluster-health.sh` (in-pod) or `.claude/cluster-health.sh` (repo) |
|
||||||
|
| Terraform module | `modules/kubernetes/openclaw/main.tf` |
|
||||||
|
| CronJob definition | Defined in the OpenClaw Terraform module |
|
||||||
|
| Existing full healthcheck | `scripts/cluster_healthcheck.sh` (local-only, 24 checks with color output) |
|
||||||
|
| Infra repo (in pod) | `/workspace/infra` |
|
||||||
|
| kubectl (in pod) | `/tools/kubectl` |
|
||||||
|
| terraform (in pod) | `/tools/terraform` |
|
||||||
|
|
||||||
### Source-of-truth notes
|
## Auto-File Incidents for SEV1/SEV2
|
||||||
|
|
||||||
- **All infra mutations go via Terraform** (`scripts/tg plan/apply`). The recipes above are diagnostic; the FIX lives in `infra/stacks/<name>/main.tf` or chart values.
|
After running health checks, if **SEV1 or SEV2 issues** are found (node down, multiple services affected, core service outage, or single important service down), auto-file a GitHub Issue:
|
||||||
- **Pod-internal config files** (e.g., Frigate's `/config/config.yml` on a PVC) are not TF-managed — edit in-pod and document in `infra/docs/runbooks/`.
|
|
||||||
- **PVE host-level state** (LVM snapshots, pvestatd) — SSH + manual ops; record in memory if the pattern recurs.
|
|
||||||
|
|
||||||
## Notes on the canonical / hardlink setup
|
### Severity Classification
|
||||||
|
- **SEV1**: Node NotReady, multiple services down, data at risk, core service outage (DNS, auth, ingress, databases)
|
||||||
|
- **SEV2**: Single non-core service down, degraded performance, persistent CrashLoopBackOff
|
||||||
|
- **SEV3**: Warnings only, resource pressure <90%, cosmetic — do NOT auto-file
|
||||||
|
|
||||||
The authoritative copy of this SKILL.md lives at
|
### Workflow
|
||||||
`/home/wizard/code/.claude/skills/cluster-health/SKILL.md`. A hardlink
|
1. **Dedup check**: Before filing, query open incidents:
|
||||||
at `/home/wizard/code/infra/.claude/skills/cluster-health/SKILL.md`
|
```bash
|
||||||
points to the same inode so infra-rooted sessions also discover the
|
GITHUB_TOKEN=$(vault kv get -field=github_pat secret/viktor)
|
||||||
skill.
|
curl -s -H "Authorization: token $GITHUB_TOKEN" \
|
||||||
|
"https://api.github.com/repos/ViktorBarzin/infra/issues?labels=incident&state=open&per_page=50"
|
||||||
|
```
|
||||||
|
If an open issue already covers the same service/namespace, **skip filing**.
|
||||||
|
|
||||||
To verify the hardlink is intact:
|
2. **File the issue** with labels `incident`, `sev1` or `sev2`, `postmortem-required`:
|
||||||
|
- Title: `[AUTO] <Service/Namespace> — <brief symptom>`
|
||||||
|
- Body: full diagnostic dump (pod status, events, alerts, node state)
|
||||||
|
- The issue-automation GHA workflow will trigger the post-mortem pipeline automatically
|
||||||
|
|
||||||
```bash
|
3. **Auto-close recovered services**: If a service that previously had an auto-filed incident is now healthy:
|
||||||
stat -c '%i %n' \
|
```bash
|
||||||
/home/wizard/code/.claude/skills/cluster-health/SKILL.md \
|
# Comment and close
|
||||||
/home/wizard/code/infra/.claude/skills/cluster-health/SKILL.md
|
curl -s -X POST -H "Authorization: token $GITHUB_TOKEN" \
|
||||||
```
|
"https://api.github.com/repos/ViktorBarzin/infra/issues/<N>/comments" \
|
||||||
|
-d '{"body": "**Resolved** — Service recovered. Auto-closed by cluster health check."}'
|
||||||
|
curl -s -X PATCH -H "Authorization: token $GITHUB_TOKEN" \
|
||||||
|
"https://api.github.com/repos/ViktorBarzin/infra/issues/<N>" \
|
||||||
|
-d '{"state": "closed"}'
|
||||||
|
```
|
||||||
|
|
||||||
Both should print the same inode number. If they diverge (e.g. `git
|
## Post-Mortem Auto-Suggest
|
||||||
checkout` replaced the file rather than updating it), re-link:
|
|
||||||
|
|
||||||
```bash
|
After running a healthcheck, if the cluster has **recovered from an unhealthy state** (previous run showed FAIL items that are now resolved), suggest writing a post-mortem:
|
||||||
ln -f /home/wizard/code/.claude/skills/cluster-health/SKILL.md \
|
|
||||||
/home/wizard/code/infra/.claude/skills/cluster-health/SKILL.md
|
> The cluster has recovered from the previous unhealthy state. Would you like me to write a post-mortem? Run `/post-mortem` to generate one.
|
||||||
```
|
|
||||||
|
This ensures incidents are documented while context is fresh.
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
1. This script is designed to run inside the OpenClaw pod where kubectl is pre-configured via the ServiceAccount
|
||||||
|
2. The full `scripts/cluster_healthcheck.sh` script runs 24 checks and is meant for local interactive use; this skill's script runs 8 core checks optimized for automated CronJob execution
|
||||||
|
3. When investigating issues interactively, prefer running commands directly rather than re-running the script
|
||||||
|
4. All Terraform changes must go through the `.tf` files — never use `kubectl apply/edit/patch` for persistent changes
|
||||||
|
|
|
||||||
|
|
@ -11,8 +11,8 @@ description: |
|
||||||
There are TWO Home Assistant deployments: ha-london (default) and ha-sofia.
|
There are TWO Home Assistant deployments: ha-london (default) and ha-sofia.
|
||||||
Always use Home Assistant for smart home control.
|
Always use Home Assistant for smart home control.
|
||||||
author: Claude Code
|
author: Claude Code
|
||||||
version: 2.1.0
|
version: 2.0.0
|
||||||
date: 2026-06-24
|
date: 2026-02-07
|
||||||
---
|
---
|
||||||
|
|
||||||
# Home Assistant Control
|
# Home Assistant Control
|
||||||
|
|
@ -44,12 +44,6 @@ There are **two** Home Assistant instances:
|
||||||
- Environment variables for each instance:
|
- Environment variables for each instance:
|
||||||
- **ha-london**: `HOME_ASSISTANT_URL` and `HOME_ASSISTANT_TOKEN`
|
- **ha-london**: `HOME_ASSISTANT_URL` and `HOME_ASSISTANT_TOKEN`
|
||||||
- **ha-sofia**: `HOME_ASSISTANT_SOFIA_URL` and `HOME_ASSISTANT_SOFIA_TOKEN`
|
- **ha-sofia**: `HOME_ASSISTANT_SOFIA_URL` and `HOME_ASSISTANT_SOFIA_TOKEN`
|
||||||
- If those env vars aren't set (e.g. you're not in the infra repo / Claude venv), don't hand-roll a `kubectl | base64 | jq` token pipeline — use the global **`homelab` CLI** instead (on `$PATH` in any directory):
|
|
||||||
|
|
||||||
## homelab CLI (preferred — works from any directory)
|
|
||||||
- **Token**: `homelab ha token [--instance sofia|london]` resolves the long-lived API token live from the cluster. Use it directly in curl: `curl -H "Authorization: Bearer $(homelab ha token)" https://ha-sofia.viktorbarzin.me/api/states`. (The `home-assistant-sofia.py` script also auto-falls-back to this when its env var is unset.)
|
|
||||||
- **Host shell** (ha-sofia): `homelab ha ssh -- <cmd>` runs a command on the HA host with deterministic non-interactive ssh (no host-key prompt) — e.g. `homelab ha ssh -- "sudo docker ps"`, `homelab ha ssh -- "cat /config/configuration.yaml"`. Replaces bespoke `ssh -o StrictHostKeyChecking=no …` invocations.
|
|
||||||
- **Cluster metrics/logs** (not HA-specific): prefer `homelab metrics query "<promql>"` / `homelab logs query "<logql>"` over hand-rolled `curl …/api/v1/query`, and `homelab claim`/`release` over calling `scripts/presence` directly.
|
|
||||||
|
|
||||||
## API Control
|
## API Control
|
||||||
|
|
||||||
|
|
@ -395,27 +389,14 @@ Advanced SSH, File Editor, Studio Code Server, InfluxDB, Mosquitto, Node-RED, Fr
|
||||||
## ha-london Knowledge Map
|
## ha-london Knowledge Map
|
||||||
|
|
||||||
### Overview
|
### Overview
|
||||||
- **HA Version**: 2026.5.2 on **Home Assistant OS** (HAOS — managed appliance, NOT a `docker run` container). Latest is 2026.6.4 (update available, deliberately not applied).
|
- **HA Version**: 2025.9.1 (Docker container on Raspberry Pi)
|
||||||
- **Location**: London, UK
|
- **Location**: London, UK
|
||||||
- **Platform**: Raspberry Pi 4, HA OS
|
- **Platform**: Raspberry Pi 4, HA OS (not Docker standalone)
|
||||||
- **Access from the Sofia devvm**: london is **remote** — `homelab ha ssh --instance london` generally WON'T connect (ADR-0012). Drive it via the API: `homelab ha token --instance london` + `https://ha-london.viktorbarzin.me/api/...`, and the WebSocket API `wss://ha-london.viktorbarzin.me/api/websocket` for dashboards / config-entries / HACS installs.
|
- **SSH**: `ssh hassio@192.168.8.103` (requires `sudo` for file access)
|
||||||
- **SSH (only from the London LAN)**: `ssh hassio@192.168.8.103` (requires `sudo` for file access)
|
- **Config path**: `/config/` (requires `sudo` for file access)
|
||||||
- **Config path**: `/config/`
|
|
||||||
- **3 tracked people**: Viktor Barzin, Anca Milea, Gheorghe Milea
|
- **3 tracked people**: Viktor Barzin, Anca Milea, Gheorghe Milea
|
||||||
- **Zone**: London (home)
|
- **Zone**: London (home)
|
||||||
|
|
||||||
### Dashboards (redesigned 2026-06-24)
|
|
||||||
**Glossary** (HA terms — keep distinct):
|
|
||||||
- **Dashboard** = a sidebar entry (Overview, Air Quality, Map). Sidebar *order* is a per-USER frontend preference, not in any dashboard config.
|
|
||||||
- **View** = a tab inside a dashboard. View order is global (stored in the dashboard config).
|
|
||||||
- **Card** = a widget inside a view.
|
|
||||||
|
|
||||||
- **Overview** (`lovelace`, the default): responsive **sections** views, styled with Mushroom + mini-graph-card.
|
|
||||||
- **Home** tab: *Who's home* · *Comfort & Air* (CO₂/temp/humidity/PM2.5/VOC chips + CO₂ and temp/humidity trend graphs + link to Air Quality) · *Cowboy* (battery/range/last-ride) · *Energy* (5 Kasa plugs + power trend) · *Quick actions* (Netflix/Stremio/Night).
|
|
||||||
- **More** tab: *Network* (GL-MT6000 router) · *System* (HA version/update, last backup, RPi power) · *Phones*.
|
|
||||||
- **Air Quality** (`air-quality`): deep-dive (views: Home, Detailed). (`detialed`→`detailed` path typo fixed 2026-06-24.)
|
|
||||||
- Built via the WS `lovelace/config/save` API (london is remote — no SSH path).
|
|
||||||
|
|
||||||
### Key Systems
|
### Key Systems
|
||||||
|
|
||||||
#### 1. Smart Plugs (TP-Link Kasa) — Energy Monitoring
|
#### 1. Smart Plugs (TP-Link Kasa) — Energy Monitoring
|
||||||
|
|
@ -437,15 +418,10 @@ Named plugs with power/energy tracking:
|
||||||
- PM1.0/2.5/4.0/10 particulate sensors
|
- PM1.0/2.5/4.0/10 particulate sensors
|
||||||
- VOC, NOx, ammonia, CO, ethanol, hydrogen, methane, NO2 gas sensors
|
- VOC, NOx, ammonia, CO, ethanol, hydrogen, methane, NO2 gas sensors
|
||||||
|
|
||||||
#### 3. Cowboy E-Bike (`elsbrock/cowboy-ha`)
|
#### 3. Cowboy E-Bike
|
||||||
Bike named **"Classic Performance"** → entities are `sensor.classic_performance_*` (26 total). The old `sensor.bike_*` names are GONE (they were the dead `jdejaegh` integration).
|
- `sensor.bike_state_of_charge`: Battery %
|
||||||
- `sensor.classic_performance_remaining_battery`: Battery % (was `sensor.bike_state_of_charge`)
|
- `sensor.bike_total_distance`: Total km
|
||||||
- `sensor.classic_performance_remaining_range`: Range km
|
- `sensor.bike_total_co2_saved`: CO2 saved (grams)
|
||||||
- `sensor.classic_performance_mileage`: Total km (was `sensor.bike_total_distance`)
|
|
||||||
- `sensor.classic_performance_saved_co2`: Lifetime CO2 saved (was `sensor.bike_total_co2_saved`)
|
|
||||||
- Plus `_distance_today`, `_last_trip_*`, `_battery_health`, `device_tracker.classic_performance`, etc.
|
|
||||||
- **GOTCHA**: live battery/range/mileage read `unknown` while the bike is parked/asleep — Cowboy only reports live SoC when awake (ridden/charging); trip-history + `distance_today` stay live regardless.
|
|
||||||
- Auth: account **email+password** (no AWS Cognito — that was the dead `jdejaegh`/`cowboybike` lineage). Setup via UI config flow / REST `config_entries/flow`. Creds in Vaultwarden item **"cowboy bike"** (`homelab vault get "cowboy bike"`).
|
|
||||||
|
|
||||||
#### 4. Uptime Monitoring (UptimeRobot)
|
#### 4. Uptime Monitoring (UptimeRobot)
|
||||||
- `sensor.blog`: blog uptime
|
- `sensor.blog`: blog uptime
|
||||||
|
|
@ -464,17 +440,12 @@ Bike named **"Classic Performance"** → entities are `sensor.classic_performanc
|
||||||
- Scripts: `script.start_netflix`, `script.start_stremio`
|
- Scripts: `script.start_netflix`, `script.start_stremio`
|
||||||
- Scene: `scene.night` (turns off Livia + Michelle plugs)
|
- Scene: `scene.night` (turns off Livia + Michelle plugs)
|
||||||
|
|
||||||
### Custom Components (HACS integrations)
|
### Custom Components
|
||||||
- **cowboy** (`elsbrock/cowboy-ha` v1.2.0): Cowboy e-bike — revived 2026-06-24. The old `jdejaegh/home-assistant-cowboy` repo is **dead (404)**; don't chase it.
|
- **cowboy**: Cowboy e-bike integration (HACS)
|
||||||
- **hildebrandglow_dcc**: UK smart meter DCC energy — **DISABLED by user** (config entry `disabled_by: user`), not broken.
|
- **hildebrandglow_dcc**: UK smart meter DCC energy data (HACS)
|
||||||
|
|
||||||
### HACS frontend cards (plugins)
|
|
||||||
- **Mushroom** (`piitaya/lovelace-mushroom`), **mini-graph-card** (`kalkih/mini-graph-card`), **plotly-graph-card** (`dbuezas/lovelace-plotly-graph-card`) — used by the redesigned Overview. Install over WS `hacs/repository/download`; resources auto-register in storage mode.
|
|
||||||
|
|
||||||
### Integrations
|
### Integrations
|
||||||
ESPHome, TP-Link Kasa, Tapo, UptimeRobot, **Cowboy** (elsbrock), Oral-B BLE, Ookla Speedtest (exposes only an `update` entity, no live speed sensors), HACS, OpenRouter (free LLMs), Piper (TTS), Whisper (STT), Android TV/ADB.
|
ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BLE, Ookla Speedtest, HACS, OpenRouter (multiple free LLMs), Piper (local TTS), Whisper (local STT), Android TV/ADB
|
||||||
- **Disabled by user (NOT broken)**: `met` + `metoffice` (weather — so `weather.*` entities are ABSENT), `roomba` (Rumi vacuum), `hildebrandglow_dcc` (energy).
|
|
||||||
- **Failing**: `tplink` **Tapo P100** projector plug — `setup_retry`, 403 KLAP handshake from 192.168.8.108 (plug off / firmware). Left as-is.
|
|
||||||
|
|
||||||
### AI / Voice Assistants
|
### AI / Voice Assistants
|
||||||
- 5 free LLM conversation agents: Google Gemma 3 27B, Meta Llama 3.2 3B, Mistral Devstral 2, OpenAI GPT-OSS-20B, Z.AI GLM 4.5 Air
|
- 5 free LLM conversation agents: Google Gemma 3 27B, Meta Llama 3.2 3B, Mistral Devstral 2, OpenAI GPT-OSS-20B, Z.AI GLM 4.5 Air
|
||||||
|
|
@ -489,8 +460,15 @@ ESPHome, TP-Link Kasa, Tapo, UptimeRobot, **Cowboy** (elsbrock), Oral-B BLE, Ook
|
||||||
- Anca arrival/departure notifications
|
- Anca arrival/departure notifications
|
||||||
- Night scene: turns off Livia + Michelle
|
- Night scene: turns off Livia + Michelle
|
||||||
|
|
||||||
### Platform (HAOS — ignore any legacy `docker run` snippet)
|
### Docker Setup
|
||||||
ha-london runs **Home Assistant OS** (managed appliance), NOT a hand-run Docker container. There is no `docker run homeassistant/home-assistant` to manage. Install HACS components over the WebSocket API (`hacs/repository/download` with the repo's HACS id), then restart via `POST /api/services/homeassistant/restart` — a HAOS restart drops automations for ~1–2 min and resets `sensor.uptime` (use that as the "back up" marker).
|
```bash
|
||||||
|
docker run -d --name homeassistant --privileged \
|
||||||
|
-e TZ=Europe/London \
|
||||||
|
-v /home/pi/docker/homeAssistant:/config \
|
||||||
|
-v /run/dbus:/run/dbus:ro \
|
||||||
|
--network=host --restart=unless-stopped \
|
||||||
|
homeassistant/home-assistant:2025.9
|
||||||
|
```
|
||||||
|
|
||||||
### SSH Access
|
### SSH Access
|
||||||
```bash
|
```bash
|
||||||
|
|
|
||||||
|
|
@ -1,227 +0,0 @@
|
||||||
---
|
|
||||||
name: upgrade-state
|
|
||||||
description: |
|
|
||||||
Audit the three autonomous-upgrade pipelines (apps via Keel, OS via
|
|
||||||
unattended-upgrades+kured, K8s components via the version-check chain).
|
|
||||||
Use when:
|
|
||||||
(1) User asks "/upgrade-state" or "are we current",
|
|
||||||
(2) User asks "what's pending upgrade" or "what's the upgrade state",
|
|
||||||
(3) User asks if Keel / kured / k8s-version-check is healthy,
|
|
||||||
(4) User asks about kept-back / held packages or pending reboots,
|
|
||||||
(5) Periodic survey before the next `k8s-version-check` daily run.
|
|
||||||
Read-only — no `--fix`. Exits 0 healthy / 1 attention / 2 stalled.
|
|
||||||
author: Claude Code
|
|
||||||
version: 1.0.0
|
|
||||||
date: 2026-05-18
|
|
||||||
---
|
|
||||||
|
|
||||||
# Upgrade-state
|
|
||||||
|
|
||||||
## MANDATORY: Run the script first
|
|
||||||
|
|
||||||
When this skill is invoked, your **first action** must be to run
|
|
||||||
`upgrade_state.sh` and reason over its output before doing anything
|
|
||||||
else. Do NOT improvise individual `kubectl` / `ssh` calls — the script
|
|
||||||
is the authoritative surface.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
bash /home/wizard/code/infra/scripts/upgrade_state.sh
|
|
||||||
```
|
|
||||||
|
|
||||||
For programmatic use:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
bash /home/wizard/code/infra/scripts/upgrade_state.sh --json | tee /tmp/upgrade-state.json
|
|
||||||
```
|
|
||||||
|
|
||||||
Then:
|
|
||||||
|
|
||||||
1. Report the rendered table verbatim — it answers the user's
|
|
||||||
"are we current" question in three lines.
|
|
||||||
2. For every `⚠` or `✗` row, surface the relevant drill-down lines
|
|
||||||
underneath and propose a next action (links in the table below).
|
|
||||||
3. Only reach for ad-hoc commands when investigating beyond what the
|
|
||||||
script reported.
|
|
||||||
|
|
||||||
Exit codes: `0` healthy, `1` attention warranted, `2` stalled / broken.
|
|
||||||
|
|
||||||
## What it covers (3 pipelines)
|
|
||||||
|
|
||||||
| Layer | What runs | Cadence | Data sources |
|
|
||||||
|---|---|---|---|
|
|
||||||
| **Apps** | Keel polls every watched Deployment's container registry; rolls on new digest | hourly | Prom (`pending_approvals`, `registries_scanned_total`), Keel pod logs |
|
|
||||||
| **OS** | `unattended-upgrades` in-release patching; `kured` reboots when `/var/run/reboot-required` is set | daily 02:00-06:00 London | SSH fan-out to all 5 nodes |
|
|
||||||
| **K8s** | `k8s-version-check` CronJob detects new kubeadm patch/minor; spawns the Job-chain that drains+upgrades node-by-node | nightly 23:00 UTC | Pushgateway (`k8s_upgrade_*`), `kubectl get nodes` |
|
|
||||||
|
|
||||||
The K8s pipeline pushes a small set of gauges to the Prometheus
|
|
||||||
Pushgateway (`prometheus-prometheus-pushgateway.monitoring:9091`):
|
|
||||||
|
|
||||||
- `k8s_upgrade_available{kind="patch"|"minor",target=…}` — 1 if newer release detected
|
|
||||||
- `k8s_version_check_last_run_timestamp` — when detection last ran
|
|
||||||
- `k8s_upgrade_in_flight` — 0/1
|
|
||||||
- `k8s_upgrade_started_timestamp` — when the current chain started (0 when idle)
|
|
||||||
|
|
||||||
`K8sUpgradeStalled` fires when `in_flight=1` and the chain has been running
|
|
||||||
>90 minutes. `K8sUpgradeChainJobFailed` fires when a phase Job terminally
|
|
||||||
failed — including a **preflight that aborted before `in_flight` was set**
|
|
||||||
(the gates exit pre-metric). The script raises `✗` for either, and reads the
|
|
||||||
Jobs directly, so it also catches a Failed preflight that left no metric.
|
|
||||||
|
|
||||||
## Status-icon legend
|
|
||||||
|
|
||||||
| Icon | Meaning |
|
|
||||||
|---|---|
|
|
||||||
| `✓` | Healthy, fully current |
|
|
||||||
| `→` | Update available, not yet applied (K8s patch/minor) |
|
|
||||||
| `…` | In flight — chain currently running |
|
|
||||||
| `⚠` | Attention: held-with-bumps, recent errors, pending approvals |
|
|
||||||
| `✗` | Broken: pod down, alert firing, chain stalled, or a chain Job failed |
|
|
||||||
|
|
||||||
## Drill-down — when a row trips, what to do
|
|
||||||
|
|
||||||
### Apps `⚠` — pending approvals or errors
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Read recent Keel log lines
|
|
||||||
kubectl -n keel logs deploy/keel --since=24h --tail=200
|
|
||||||
|
|
||||||
# What is Keel currently tracking?
|
|
||||||
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
|
|
||||||
wget -qO- 'http://localhost:9090/api/v1/query?query=count by (image) (registries_scanned_total)'
|
|
||||||
|
|
||||||
# Is the scrape live?
|
|
||||||
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
|
|
||||||
wget -qO- 'http://localhost:9090/api/v1/query?query=up{job="kubernetes-pods",app="keel"}'
|
|
||||||
```
|
|
||||||
|
|
||||||
Common Keel errors:
|
|
||||||
- `failed to add image watch job` — image annotation mistyped (rare; Kyverno auto-injects)
|
|
||||||
- `registry authentication required` — bad imagePullSecret on the watched Deployment
|
|
||||||
- `bad tag pattern` — Keel can't parse the watched image's tag against its policy
|
|
||||||
|
|
||||||
### OS `⚠` — held packages with bumps
|
|
||||||
|
|
||||||
The script flags any package held via `apt-mark hold` that ALSO appears
|
|
||||||
in `apt list --upgradable` — excluding k8s components (the K8s pipeline
|
|
||||||
owns those) and the kernel (kured handles the reboot half).
|
|
||||||
|
|
||||||
Typical cause: a major-version bump (e.g. containerd 1.7 → 2.2,
|
|
||||||
runc 1.1 → 1.4). These are held because they need cluster-wide
|
|
||||||
coordination, not silent in-release patching.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Inspect the situation on the flagged node
|
|
||||||
ssh wizard@10.0.20.10X 'apt-mark showhold; apt list --upgradable 2>/dev/null'
|
|
||||||
|
|
||||||
# Unhold + upgrade a specific package
|
|
||||||
ssh wizard@10.0.20.10X 'sudo apt-mark unhold containerd && sudo apt-get install -y containerd'
|
|
||||||
```
|
|
||||||
|
|
||||||
Node IPs: master=`100`, node1=`101`, node2=`102`, node3=`103`, node4=`104`.
|
|
||||||
|
|
||||||
### OS `⚠` — pending reboot
|
|
||||||
|
|
||||||
A node has `/var/run/reboot-required`. Kured will reboot it inside the
|
|
||||||
next 02:00-06:00 London window (any day of the week).
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Force a manual reboot inside the window (rare)
|
|
||||||
kubectl drain k8s-nodeX --delete-emptydir-data --ignore-daemonsets
|
|
||||||
ssh wizard@10.0.20.10X sudo systemctl reboot
|
|
||||||
```
|
|
||||||
|
|
||||||
### OS `✗` — kured not Running
|
|
||||||
|
|
||||||
```bash
|
|
||||||
kubectl -n kured get pods
|
|
||||||
kubectl -n kured logs daemonset/kured --tail=100
|
|
||||||
# Verify sentinel gate (kured-sentinel-gate DaemonSet writes /var/run/gated-reboot-required)
|
|
||||||
kubectl -n kured get pods -l name=kured-sentinel-gate
|
|
||||||
```
|
|
||||||
|
|
||||||
### K8s `→` — patch/minor available
|
|
||||||
|
|
||||||
Detection ran, target identified, chain NOT started. The chain spawns
|
|
||||||
on the same daily detection cycle — typically within ~24h of the
|
|
||||||
target first being detected.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Inspect Pushgateway state
|
|
||||||
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
|
|
||||||
wget -qO- 'http://prometheus-prometheus-pushgateway:9091/metrics' | grep ^k8s_upgrade
|
|
||||||
|
|
||||||
# Trigger a manual run of the detection CronJob
|
|
||||||
kubectl -n k8s-upgrade create job --from=cronjob/k8s-version-check manual-detect-$(date +%s)
|
|
||||||
```
|
|
||||||
|
|
||||||
### K8s `…` — in flight
|
|
||||||
|
|
||||||
The Job chain is running. Watch its progress:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
kubectl -n k8s-upgrade get jobs --sort-by=.metadata.creationTimestamp
|
|
||||||
kubectl -n k8s-upgrade logs -l app=k8s-version-upgrade --tail=200 --prefix
|
|
||||||
```
|
|
||||||
|
|
||||||
### K8s `✗ stalled` — `K8sUpgradeStalled` would fire
|
|
||||||
|
|
||||||
Chain in-flight >90m. The Job is most likely stuck on drain or a
|
|
||||||
pre-flight check.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
kubectl -n k8s-upgrade get jobs
|
|
||||||
kubectl -n k8s-upgrade describe job <stuck-job>
|
|
||||||
kubectl -n k8s-upgrade logs job/<stuck-job> --tail=300
|
|
||||||
|
|
||||||
# If you need to clear the in-flight flag (after diagnosing):
|
|
||||||
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- sh -c \
|
|
||||||
"printf 'k8s_upgrade_in_flight 0\nk8s_upgrade_started_timestamp 0\n' | \
|
|
||||||
wget -qO- --post-file=- 'http://prometheus-prometheus-pushgateway:9091/metrics/job/k8s-version-upgrade' \
|
|
||||||
--header='Content-Type: text/plain'"
|
|
||||||
```
|
|
||||||
|
|
||||||
### K8s `✗ chain failed` — a phase Job terminally failed
|
|
||||||
|
|
||||||
`K8sUpgradeChainJobFailed` would fire. Most often a **preflight** that aborted
|
|
||||||
on a gate (a critical alert firing, a node not Ready, a kubeadm-plan mismatch) —
|
|
||||||
these exit before `in_flight` is set, so `K8sUpgradeStalled` never sees them, and
|
|
||||||
the deterministic name + 7d TTL blocked re-spawn (the 2026-06-12 5-day wedge).
|
|
||||||
|
|
||||||
```bash
|
|
||||||
kubectl -n k8s-upgrade get jobs
|
|
||||||
kubectl -n k8s-upgrade describe job <failed-job> # check the Failed reason
|
|
||||||
# Preflight abort reasons post to Slack ONLY (not stdout), so Loki won't have
|
|
||||||
# them. Replay the gate instead — which critical alerts were firing at the
|
|
||||||
# failure time? (ALERTS{severity="critical"} in Prometheus, query at that ts.)
|
|
||||||
```
|
|
||||||
|
|
||||||
Recovery is now mostly automatic: the detection CronJob and `spawn_next`
|
|
||||||
re-spawn a terminally-Failed Job on the next cycle (retry-on-failure), so a
|
|
||||||
transient gate clears within ~24h. To expedite, delete the Failed Job and
|
|
||||||
trigger detection:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
kubectl -n k8s-upgrade delete job <failed-job>
|
|
||||||
kubectl -n k8s-upgrade create job --from=cronjob/k8s-version-check manual-detect-$(date +%s)
|
|
||||||
```
|
|
||||||
|
|
||||||
### K8s `✗ detection stale` — last detection >9 days
|
|
||||||
|
|
||||||
```bash
|
|
||||||
kubectl -n k8s-upgrade get cronjob k8s-version-check
|
|
||||||
kubectl -n k8s-upgrade get jobs --sort-by=.metadata.creationTimestamp | tail -5
|
|
||||||
```
|
|
||||||
|
|
||||||
If the CronJob hasn't fired on time, suspect:
|
|
||||||
- `suspend=true` on the CronJob (`var.enabled=false` in the
|
|
||||||
`k8s-version-upgrade` Terraform stack)
|
|
||||||
- Image-pull failure on the version-check pod
|
|
||||||
- Pushgateway scrape gone stale
|
|
||||||
|
|
||||||
## Companion command-line flags
|
|
||||||
|
|
||||||
```bash
|
|
||||||
bash infra/scripts/upgrade_state.sh # rendered table (default)
|
|
||||||
bash infra/scripts/upgrade_state.sh --json # machine output
|
|
||||||
bash infra/scripts/upgrade_state.sh --kubeconfig X # override kubeconfig
|
|
||||||
```
|
|
||||||
|
|
@ -155,19 +155,3 @@ Common port is 80. Exceptions:
|
||||||
3. Add `time.sleep(0.3)` between bulk operations to avoid overloading
|
3. Add `time.sleep(0.3)` between bulk operations to avoid overloading
|
||||||
4. Homepage dashboard widget slug: `cluster-internal`
|
4. Homepage dashboard widget slug: `cluster-internal`
|
||||||
5. Cloudflare-proxied at `uptime.viktorbarzin.me`
|
5. Cloudflare-proxied at `uptime.viktorbarzin.me`
|
||||||
|
|
||||||
## Terraform-Managed Monitors
|
|
||||||
|
|
||||||
There is NO `louislam/uptime-kuma` Terraform provider. Two patterns exist for
|
|
||||||
declarative monitor management in this stack:
|
|
||||||
|
|
||||||
- **External HTTPS monitors** — auto-discovered from ingress annotations by the
|
|
||||||
`external-monitor-sync` CronJob (`*/10 * * * *`). Opt-out via
|
|
||||||
`uptime.viktorbarzin.me/external-monitor: "false"` on the ingress.
|
|
||||||
- **Internal monitors (DBs, non-HTTP)** — declared in the
|
|
||||||
`local.internal_monitors` list in `stacks/uptime-kuma/modules/uptime-kuma/main.tf`
|
|
||||||
and synced by the `internal-monitor-sync` CronJob. To add one, append to the
|
|
||||||
list (provide `name`, `type`, `database_connection_string`,
|
|
||||||
`database_password_vault_key`, `interval`, `retry_interval`, `max_retries`)
|
|
||||||
and `scripts/tg apply`. The sync is idempotent — looks up by name, creates
|
|
||||||
if missing, patches if drifted. Existing monitors keep their id and history.
|
|
||||||
|
|
|
||||||
36
.github/workflows/build-android-emulator.yml
vendored
36
.github/workflows/build-android-emulator.yml
vendored
|
|
@ -1,36 +0,0 @@
|
||||||
name: Build android-emulator
|
|
||||||
|
|
||||||
# ADR-0002: infra-owned image built off-infra on GHA → ghcr (public).
|
|
||||||
# Large image (Android SDK + emulator); on-demand workload (scaled 0). Rebuilds
|
|
||||||
# rare → dispatch + path trigger.
|
|
||||||
on:
|
|
||||||
push:
|
|
||||||
branches: [master]
|
|
||||||
paths:
|
|
||||||
- 'stacks/android-emulator/docker/**'
|
|
||||||
workflow_dispatch: {}
|
|
||||||
|
|
||||||
permissions:
|
|
||||||
contents: read
|
|
||||||
packages: write
|
|
||||||
|
|
||||||
jobs:
|
|
||||||
build:
|
|
||||||
runs-on: ubuntu-latest
|
|
||||||
steps:
|
|
||||||
- uses: actions/checkout@v4
|
|
||||||
- uses: docker/setup-buildx-action@v3
|
|
||||||
- uses: docker/login-action@v3
|
|
||||||
with:
|
|
||||||
registry: ghcr.io
|
|
||||||
username: ${{ github.actor }}
|
|
||||||
password: ${{ secrets.GITHUB_TOKEN }}
|
|
||||||
- uses: docker/build-push-action@v6
|
|
||||||
with:
|
|
||||||
context: stacks/android-emulator/docker
|
|
||||||
platforms: linux/amd64
|
|
||||||
provenance: false
|
|
||||||
push: true
|
|
||||||
tags: |
|
|
||||||
ghcr.io/viktorbarzin/android-emulator:latest
|
|
||||||
ghcr.io/viktorbarzin/android-emulator:${{ github.sha }}
|
|
||||||
|
|
@ -1,39 +0,0 @@
|
||||||
name: Build chrome-service-browser
|
|
||||||
|
|
||||||
# ADR-0002: infra-owned image built off-infra on GHA → ghcr. Playwright base +
|
|
||||||
# real Google Chrome (proprietary H.264/AAC codecs) for the chrome-service
|
|
||||||
# browser container, so the noVNC view can play H.264 video (Reels). Rebuilds
|
|
||||||
# are rare → dispatch + path trigger. NOTE: after the first push, set the ghcr
|
|
||||||
# package `chrome-service-browser` to PUBLIC (same as chrome-service-novnc) so
|
|
||||||
# the pod pulls it without credentials.
|
|
||||||
on:
|
|
||||||
push:
|
|
||||||
branches: [master]
|
|
||||||
paths:
|
|
||||||
- 'stacks/chrome-service/files/chrome/**'
|
|
||||||
workflow_dispatch: {}
|
|
||||||
|
|
||||||
permissions:
|
|
||||||
contents: read
|
|
||||||
packages: write
|
|
||||||
|
|
||||||
jobs:
|
|
||||||
build:
|
|
||||||
runs-on: ubuntu-latest
|
|
||||||
steps:
|
|
||||||
- uses: actions/checkout@v4
|
|
||||||
- uses: docker/setup-buildx-action@v3
|
|
||||||
- uses: docker/login-action@v3
|
|
||||||
with:
|
|
||||||
registry: ghcr.io
|
|
||||||
username: ${{ github.actor }}
|
|
||||||
password: ${{ secrets.GITHUB_TOKEN }}
|
|
||||||
- uses: docker/build-push-action@v6
|
|
||||||
with:
|
|
||||||
context: stacks/chrome-service/files/chrome
|
|
||||||
platforms: linux/amd64
|
|
||||||
provenance: false
|
|
||||||
push: true
|
|
||||||
tags: |
|
|
||||||
ghcr.io/viktorbarzin/chrome-service-browser:latest
|
|
||||||
ghcr.io/viktorbarzin/chrome-service-browser:${{ github.sha }}
|
|
||||||
36
.github/workflows/build-chrome-service-novnc.yml
vendored
36
.github/workflows/build-chrome-service-novnc.yml
vendored
|
|
@ -1,36 +0,0 @@
|
||||||
name: Build chrome-service-novnc
|
|
||||||
|
|
||||||
# ADR-0002: infra-owned image built off-infra on GHA → ghcr (public).
|
|
||||||
# Source Dockerfile identical on both git remotes, so the github checkout builds
|
|
||||||
# the current image. Rebuilds are rare (stable noVNC proxy) → dispatch + path.
|
|
||||||
on:
|
|
||||||
push:
|
|
||||||
branches: [master]
|
|
||||||
paths:
|
|
||||||
- 'stacks/chrome-service/files/novnc/**'
|
|
||||||
workflow_dispatch: {}
|
|
||||||
|
|
||||||
permissions:
|
|
||||||
contents: read
|
|
||||||
packages: write
|
|
||||||
|
|
||||||
jobs:
|
|
||||||
build:
|
|
||||||
runs-on: ubuntu-latest
|
|
||||||
steps:
|
|
||||||
- uses: actions/checkout@v4
|
|
||||||
- uses: docker/setup-buildx-action@v3
|
|
||||||
- uses: docker/login-action@v3
|
|
||||||
with:
|
|
||||||
registry: ghcr.io
|
|
||||||
username: ${{ github.actor }}
|
|
||||||
password: ${{ secrets.GITHUB_TOKEN }}
|
|
||||||
- uses: docker/build-push-action@v6
|
|
||||||
with:
|
|
||||||
context: stacks/chrome-service/files/novnc
|
|
||||||
platforms: linux/amd64
|
|
||||||
provenance: false
|
|
||||||
push: true
|
|
||||||
tags: |
|
|
||||||
ghcr.io/viktorbarzin/chrome-service-novnc:latest
|
|
||||||
ghcr.io/viktorbarzin/chrome-service-novnc:${{ github.sha }}
|
|
||||||
41
.github/workflows/build-cli.yml
vendored
41
.github/workflows/build-cli.yml
vendored
|
|
@ -1,41 +0,0 @@
|
||||||
name: Build infra CLI
|
|
||||||
|
|
||||||
# ADR-0002: infra CLI built off-infra on GHA. Replaces the Woodpecker
|
|
||||||
# build-cli.yml. Pushes to DockerHub (public distribution, kept) + ghcr.
|
|
||||||
# Not a cluster workload — a distributed tool image.
|
|
||||||
on:
|
|
||||||
push:
|
|
||||||
branches: [master]
|
|
||||||
paths:
|
|
||||||
- 'cli/**'
|
|
||||||
workflow_dispatch: {}
|
|
||||||
|
|
||||||
permissions:
|
|
||||||
contents: read
|
|
||||||
packages: write
|
|
||||||
|
|
||||||
jobs:
|
|
||||||
build:
|
|
||||||
runs-on: ubuntu-latest
|
|
||||||
steps:
|
|
||||||
- uses: actions/checkout@v4
|
|
||||||
- uses: docker/setup-buildx-action@v3
|
|
||||||
- uses: docker/login-action@v3
|
|
||||||
with:
|
|
||||||
registry: ghcr.io
|
|
||||||
username: ${{ github.actor }}
|
|
||||||
password: ${{ secrets.GITHUB_TOKEN }}
|
|
||||||
- uses: docker/login-action@v3
|
|
||||||
with:
|
|
||||||
username: ${{ secrets.DOCKERHUB_USERNAME }}
|
|
||||||
password: ${{ secrets.DOCKERHUB_TOKEN }}
|
|
||||||
- uses: docker/build-push-action@v6
|
|
||||||
with:
|
|
||||||
context: cli
|
|
||||||
platforms: linux/amd64
|
|
||||||
provenance: false
|
|
||||||
push: true
|
|
||||||
tags: |
|
|
||||||
viktorbarzin/infra:latest
|
|
||||||
ghcr.io/viktorbarzin/infra-cli:latest
|
|
||||||
ghcr.io/viktorbarzin/infra-cli:${{ github.sha }}
|
|
||||||
37
.github/workflows/build-infra-ci.yml
vendored
37
.github/workflows/build-infra-ci.yml
vendored
|
|
@ -1,37 +0,0 @@
|
||||||
name: Build infra-ci
|
|
||||||
|
|
||||||
# ADR-0002: the infra CI toolbox image (terraform/terragrunt/sops/kubectl/vault)
|
|
||||||
# built off-infra on GHA → ghcr (public). BOOTSTRAP-CRITICAL: .woodpecker/default.yml's
|
|
||||||
# apply step runs in this image. The Woodpecker build-ci-image.yml is kept until a
|
|
||||||
# ghcr-based apply is proven, then removed.
|
|
||||||
on:
|
|
||||||
push:
|
|
||||||
branches: [master]
|
|
||||||
paths:
|
|
||||||
- 'ci/Dockerfile'
|
|
||||||
workflow_dispatch: {}
|
|
||||||
|
|
||||||
permissions:
|
|
||||||
contents: read
|
|
||||||
packages: write
|
|
||||||
|
|
||||||
jobs:
|
|
||||||
build:
|
|
||||||
runs-on: ubuntu-latest
|
|
||||||
steps:
|
|
||||||
- uses: actions/checkout@v4
|
|
||||||
- uses: docker/setup-buildx-action@v3
|
|
||||||
- uses: docker/login-action@v3
|
|
||||||
with:
|
|
||||||
registry: ghcr.io
|
|
||||||
username: ${{ github.actor }}
|
|
||||||
password: ${{ secrets.GITHUB_TOKEN }}
|
|
||||||
- uses: docker/build-push-action@v6
|
|
||||||
with:
|
|
||||||
context: ci
|
|
||||||
platforms: linux/amd64
|
|
||||||
provenance: false
|
|
||||||
push: true
|
|
||||||
tags: |
|
|
||||||
ghcr.io/viktorbarzin/infra-ci:latest
|
|
||||||
ghcr.io/viktorbarzin/infra-ci:${{ github.sha }}
|
|
||||||
36
.github/workflows/build-k8s-portal.yml
vendored
36
.github/workflows/build-k8s-portal.yml
vendored
|
|
@ -1,36 +0,0 @@
|
||||||
name: Build k8s-portal
|
|
||||||
|
|
||||||
# ADR-0002 / no-local-builds: k8s-portal (infra-owned Go portal) builds off-infra
|
|
||||||
# on GHA → public ghcr; Keel polls ghcr:latest and rolls the deployment. Replaces
|
|
||||||
# the in-cluster .woodpecker/k8s-portal.yml build.
|
|
||||||
on:
|
|
||||||
push:
|
|
||||||
branches: [master]
|
|
||||||
paths:
|
|
||||||
- 'stacks/k8s-portal/modules/k8s-portal/files/**'
|
|
||||||
workflow_dispatch: {}
|
|
||||||
|
|
||||||
permissions:
|
|
||||||
contents: read
|
|
||||||
packages: write
|
|
||||||
|
|
||||||
jobs:
|
|
||||||
build:
|
|
||||||
runs-on: ubuntu-latest
|
|
||||||
steps:
|
|
||||||
- uses: actions/checkout@v4
|
|
||||||
- uses: docker/setup-buildx-action@v3
|
|
||||||
- uses: docker/login-action@v3
|
|
||||||
with:
|
|
||||||
registry: ghcr.io
|
|
||||||
username: ${{ github.actor }}
|
|
||||||
password: ${{ secrets.GITHUB_TOKEN }}
|
|
||||||
- uses: docker/build-push-action@v6
|
|
||||||
with:
|
|
||||||
context: stacks/k8s-portal/modules/k8s-portal/files
|
|
||||||
platforms: linux/amd64
|
|
||||||
provenance: false
|
|
||||||
push: true
|
|
||||||
tags: |
|
|
||||||
ghcr.io/viktorbarzin/k8s-portal:latest
|
|
||||||
ghcr.io/viktorbarzin/k8s-portal:${{ github.sha }}
|
|
||||||
18
.gitignore
vendored
18
.gitignore
vendored
|
|
@ -65,11 +65,6 @@ state/infra/
|
||||||
backend.tf
|
backend.tf
|
||||||
providers.tf
|
providers.tf
|
||||||
.terraform.lock.hcl
|
.terraform.lock.hcl
|
||||||
cloudflare_provider.tf
|
|
||||||
tiers.tf
|
|
||||||
stacks/*/cloudflare_provider.tf
|
|
||||||
stacks/*/tiers.tf
|
|
||||||
stacks/*/terragrunt_rendered.json
|
|
||||||
|
|
||||||
# Kubernetes config (sensitive)
|
# Kubernetes config (sensitive)
|
||||||
config
|
config
|
||||||
|
|
@ -103,16 +98,3 @@ stacks/terminal/clipboard-upload/clipboard-upload
|
||||||
# Plaintext terraform state — NEVER commit (use SOPS-encrypted .tfstate.enc only)
|
# Plaintext terraform state — NEVER commit (use SOPS-encrypted .tfstate.enc only)
|
||||||
terraform.tfstate
|
terraform.tfstate
|
||||||
terraform.tfstate.backup
|
terraform.tfstate.backup
|
||||||
|
|
||||||
# Per-feature git worktrees (worktree-first workflow — execution.md)
|
|
||||||
.worktrees/
|
|
||||||
|
|
||||||
# Timestamped terraform state backups (terraform.tfstate.<ts>.backup) — plaintext Tier-0
|
|
||||||
# secrets; created by terraform state ops. The patterns above miss the timestamped form.
|
|
||||||
terraform.tfstate.*.backup
|
|
||||||
|
|
||||||
# Python test artifacts (pytest bytecode cache) — e.g. from
|
|
||||||
# stacks/k8s-version-upgrade/scripts/test_compat_gate.py
|
|
||||||
__pycache__/
|
|
||||||
*.pyc
|
|
||||||
.pytest_cache/
|
|
||||||
|
|
|
||||||
|
|
@ -1,11 +0,0 @@
|
||||||
# git-crypt encrypts these at rest; the working-tree plaintext is local-only.
|
|
||||||
# gitleaks scans the staged working-tree copy and can't see that they're
|
|
||||||
# encrypted on disk in git, so allowlist by fingerprint.
|
|
||||||
stacks/recruiter-responder/secrets/privkey.pem:private-key:1
|
|
||||||
|
|
||||||
# False positives: the `curl-auth-user` rule flags `-u "admin:..."` in the
|
|
||||||
# nextcloud-todos webhook-register provisioner, but the password is a shell
|
|
||||||
# variable ($NC_ADMIN_APP_PW) resolved at apply time from Vault — no literal
|
|
||||||
# secret is committed.
|
|
||||||
stacks/nextcloud-todos/main.tf:curl-auth-user:383
|
|
||||||
stacks/nextcloud-todos/main.tf:curl-auth-user:400
|
|
||||||
|
|
@ -1,8 +0,0 @@
|
||||||
{
|
|
||||||
"mcpServers": {
|
|
||||||
"ha": {
|
|
||||||
"type": "http",
|
|
||||||
"url": "${HA_MCP_URL}"
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
@ -1,31 +0,0 @@
|
||||||
# Break-glass: save the ghcr infra-ci image to a tarball on the registry VM
|
|
||||||
# (10.0.20.10) so it can be `docker load`-ed onto a node if ghcr is ever
|
|
||||||
# unreachable during a recovery. infra-ci now builds on GHA → ghcr (ADR-0002),
|
|
||||||
# which is external + node-cached, so this is a belt-and-braces DR artifact —
|
|
||||||
# run MANUALLY after an infra-ci rebuild (or periodically). Pulls from ghcr
|
|
||||||
# (public, no login). Recovery: docs/runbooks/forgejo-registry-breakglass.md.
|
|
||||||
when:
|
|
||||||
- event: manual
|
|
||||||
|
|
||||||
steps:
|
|
||||||
- name: breakglass-tarball
|
|
||||||
image: alpine:3.20
|
|
||||||
failure: ignore
|
|
||||||
environment:
|
|
||||||
REGISTRY_SSH_KEY:
|
|
||||||
from_secret: registry_ssh_key
|
|
||||||
commands:
|
|
||||||
- apk add --no-cache openssh-client
|
|
||||||
- mkdir -p ~/.ssh && chmod 700 ~/.ssh
|
|
||||||
- printf '%s\n' "$REGISTRY_SSH_KEY" > ~/.ssh/id_ed25519
|
|
||||||
- chmod 600 ~/.ssh/id_ed25519
|
|
||||||
- ssh-keyscan -t ed25519 10.0.20.10 >> ~/.ssh/known_hosts 2>/dev/null
|
|
||||||
- |
|
|
||||||
ssh -n -o BatchMode=yes root@10.0.20.10 "
|
|
||||||
set -e
|
|
||||||
mkdir -p /opt/registry/data/private/_breakglass
|
|
||||||
IMAGE=ghcr.io/viktorbarzin/infra-ci:latest
|
|
||||||
docker pull \$IMAGE
|
|
||||||
docker save \$IMAGE | gzip > /opt/registry/data/private/_breakglass/infra-ci-latest.tar.gz
|
|
||||||
ls -lh /opt/registry/data/private/_breakglass/infra-ci-latest.tar.gz
|
|
||||||
"
|
|
||||||
41
.woodpecker/build-ci-image.yml
Normal file
41
.woodpecker/build-ci-image.yml
Normal file
|
|
@ -0,0 +1,41 @@
|
||||||
|
# Build the CI tools Docker image used by all infra pipelines.
|
||||||
|
# Triggers on changes to ci/Dockerfile only (push to master).
|
||||||
|
|
||||||
|
when:
|
||||||
|
event: push
|
||||||
|
branch: master
|
||||||
|
path:
|
||||||
|
include:
|
||||||
|
- 'ci/Dockerfile'
|
||||||
|
|
||||||
|
steps:
|
||||||
|
- name: build-and-push
|
||||||
|
image: woodpeckerci/plugin-docker-buildx
|
||||||
|
settings:
|
||||||
|
repo: registry.viktorbarzin.me:5050/infra-ci
|
||||||
|
dockerfile: ci/Dockerfile
|
||||||
|
context: ci/
|
||||||
|
tags:
|
||||||
|
- latest
|
||||||
|
- "${CI_COMMIT_SHA:0:8}"
|
||||||
|
platforms: linux/amd64
|
||||||
|
registry: registry.viktorbarzin.me:5050
|
||||||
|
logins:
|
||||||
|
- registry: registry.viktorbarzin.me:5050
|
||||||
|
username:
|
||||||
|
from_secret: registry_user
|
||||||
|
password:
|
||||||
|
from_secret: registry_password
|
||||||
|
|
||||||
|
- name: slack
|
||||||
|
image: curlimages/curl
|
||||||
|
commands:
|
||||||
|
- |
|
||||||
|
curl -s -X POST -H 'Content-type: application/json' \
|
||||||
|
--data "{\"text\":\"CI image built: registry.viktorbarzin.me:5050/infra-ci:${CI_COMMIT_SHA:0:8}\"}" \
|
||||||
|
"$SLACK_WEBHOOK" || true
|
||||||
|
environment:
|
||||||
|
SLACK_WEBHOOK:
|
||||||
|
from_secret: slack_webhook
|
||||||
|
when:
|
||||||
|
status: [success]
|
||||||
33
.woodpecker/build-cli.yml
Normal file
33
.woodpecker/build-cli.yml
Normal file
|
|
@ -0,0 +1,33 @@
|
||||||
|
when:
|
||||||
|
event: push
|
||||||
|
|
||||||
|
clone:
|
||||||
|
git:
|
||||||
|
image: woodpeckerci/plugin-git
|
||||||
|
settings:
|
||||||
|
attempts: 5
|
||||||
|
backoff: 10s
|
||||||
|
|
||||||
|
steps:
|
||||||
|
- name: build-image
|
||||||
|
image: woodpeckerci/plugin-docker-buildx
|
||||||
|
settings:
|
||||||
|
username: "viktorbarzin"
|
||||||
|
password:
|
||||||
|
from_secret: dockerhub-pat
|
||||||
|
repo:
|
||||||
|
- viktorbarzin/infra
|
||||||
|
- registry.viktorbarzin.me:5050/infra
|
||||||
|
logins:
|
||||||
|
- registry: https://index.docker.io/v1/
|
||||||
|
username: viktorbarzin
|
||||||
|
password:
|
||||||
|
from_secret: dockerhub-pat
|
||||||
|
dockerfile: cli/Dockerfile
|
||||||
|
context: cli
|
||||||
|
auto_tag: true
|
||||||
|
# cache_from/cache_to removed: registry cache corruption causes
|
||||||
|
# "short read: expected 32 bytes" BuildKit errors. Inline cache
|
||||||
|
# will be re-populated once a clean image is pushed.
|
||||||
|
# cache_from: "registry.viktorbarzin.me:5050/infra:latest"
|
||||||
|
# cache_to: "type=inline"
|
||||||
|
|
@ -19,34 +19,13 @@ clone:
|
||||||
git:
|
git:
|
||||||
image: woodpeckerci/plugin-git
|
image: woodpeckerci/plugin-git
|
||||||
settings:
|
settings:
|
||||||
partial: false
|
|
||||||
depth: 2
|
depth: 2
|
||||||
attempts: 5
|
attempts: 5
|
||||||
backoff: 10s
|
backoff: 10s
|
||||||
|
|
||||||
steps:
|
steps:
|
||||||
# Audit feed for the allow-then-audit contribution model: any master push by
|
|
||||||
# a NON-admin author is surfaced in Slack (Viktor's own pushes are not).
|
|
||||||
# Runs before apply and never blocks it. Note: [ci skip] commits never reach
|
|
||||||
# this step (Woodpecker skips the whole pipeline) — hence the rule that
|
|
||||||
# non-admins must not use [ci skip].
|
|
||||||
- name: notify-nonadmin-push
|
|
||||||
image: curlimages/curl
|
|
||||||
environment:
|
|
||||||
SLACK_WEBHOOK:
|
|
||||||
from_secret: slack_webhook
|
|
||||||
commands:
|
|
||||||
- |
|
|
||||||
case "$CI_COMMIT_AUTHOR" in
|
|
||||||
viktor|ViktorBarzin|wizard) echo "admin push — no notify"; exit 0 ;;
|
|
||||||
esac
|
|
||||||
SUBJECT=$(echo "$CI_COMMIT_MESSAGE" | head -1 | tr -d '"\\')
|
|
||||||
curl -s -X POST -H 'Content-type: application/json' \
|
|
||||||
--data "{\"text\":\"📝 infra master push by *$CI_COMMIT_AUTHOR*: $SUBJECT\n$CI_REPO_URL/commit/$CI_COMMIT_SHA\"}" \
|
|
||||||
"$SLACK_WEBHOOK" || true
|
|
||||||
|
|
||||||
- name: apply
|
- name: apply
|
||||||
image: ghcr.io/viktorbarzin/infra-ci:latest
|
image: registry.viktorbarzin.me:5050/infra-ci:latest
|
||||||
pull: true
|
pull: true
|
||||||
backend_options:
|
backend_options:
|
||||||
kubernetes:
|
kubernetes:
|
||||||
|
|
@ -58,12 +37,6 @@ steps:
|
||||||
environment:
|
environment:
|
||||||
SLACK_WEBHOOK:
|
SLACK_WEBHOOK:
|
||||||
from_secret: slack_webhook
|
from_secret: slack_webhook
|
||||||
# Each `- |` command runs in a fresh shell, so we can't rely on an
|
|
||||||
# `export VAULT_ADDR=...` in the auth command persisting — pin it at
|
|
||||||
# step level. VAULT_TOKEN is still per-command; we persist it to
|
|
||||||
# ~/.vault-token (auto-read by `vault` CLI) so downstream commands
|
|
||||||
# don't need explicit token propagation.
|
|
||||||
VAULT_ADDR: http://vault-active.vault.svc.cluster.local:8200
|
|
||||||
commands:
|
commands:
|
||||||
# ── Skip CI commits ──
|
# ── Skip CI commits ──
|
||||||
- |
|
- |
|
||||||
|
|
@ -82,49 +55,9 @@ steps:
|
||||||
# ── Vault auth ──
|
# ── Vault auth ──
|
||||||
- |
|
- |
|
||||||
SA_TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
|
SA_TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
|
||||||
VAULT_TOKEN=$(curl -s -X POST "$VAULT_ADDR/v1/auth/kubernetes/login" \
|
export VAULT_ADDR=http://vault-active.vault.svc.cluster.local:8200
|
||||||
|
export VAULT_TOKEN=$(curl -s -X POST "$VAULT_ADDR/v1/auth/kubernetes/login" \
|
||||||
-d "{\"role\":\"ci\",\"jwt\":\"$SA_TOKEN\"}" | jq -r .auth.client_token)
|
-d "{\"role\":\"ci\",\"jwt\":\"$SA_TOKEN\"}" | jq -r .auth.client_token)
|
||||||
if [ -z "$VAULT_TOKEN" ] || [ "$VAULT_TOKEN" = "null" ]; then
|
|
||||||
echo "ERROR: Vault K8s auth failed (role=ci, ns=woodpecker)" >&2
|
|
||||||
exit 1
|
|
||||||
fi
|
|
||||||
# Persist for downstream `- |` blocks (each runs in a fresh shell,
|
|
||||||
# so exporting VAULT_TOKEN wouldn't help). `vault`, `scripts/tg`,
|
|
||||||
# and `scripts/state-sync` all fall through to ~/.vault-token when
|
|
||||||
# the env var is unset.
|
|
||||||
umask 077; printf '%s' "$VAULT_TOKEN" > "$HOME/.vault-token"
|
|
||||||
|
|
||||||
# ── Generate kubeconfig from projected SA token ──
|
|
||||||
# terragrunt.hcl injects `-var kube_config_path=<repo>/config` for every
|
|
||||||
# terraform invocation, so we need a kubeconfig file at that path. The
|
|
||||||
# `default` SA in the woodpecker namespace is cluster-admin (via the
|
|
||||||
# `woodpecker-default` ClusterRoleBinding), so the projected token is
|
|
||||||
# sufficient to apply any stack. Using `tokenFile` (not an inline token)
|
|
||||||
# so the provider re-reads it if kubelet rotates the projected token
|
|
||||||
# mid-pipeline.
|
|
||||||
- |
|
|
||||||
cat > config <<'EOF'
|
|
||||||
apiVersion: v1
|
|
||||||
kind: Config
|
|
||||||
clusters:
|
|
||||||
- name: kubernetes
|
|
||||||
cluster:
|
|
||||||
server: https://10.0.20.100:6443
|
|
||||||
certificate-authority: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
|
|
||||||
contexts:
|
|
||||||
- name: ci
|
|
||||||
context:
|
|
||||||
cluster: kubernetes
|
|
||||||
user: ci
|
|
||||||
current-context: ci
|
|
||||||
users:
|
|
||||||
- name: ci
|
|
||||||
user:
|
|
||||||
tokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
|
|
||||||
EOF
|
|
||||||
chmod 600 config
|
|
||||||
# Sanity check: kubeconfig works
|
|
||||||
kubectl --kubeconfig=config get ns kube-system -o name >/dev/null
|
|
||||||
|
|
||||||
# ── Detect changed stacks ──
|
# ── Detect changed stacks ──
|
||||||
- |
|
- |
|
||||||
|
|
@ -136,25 +69,6 @@ steps:
|
||||||
git fetch --deepen=1 origin master 2>/dev/null || true
|
git fetch --deepen=1 origin master 2>/dev/null || true
|
||||||
fi
|
fi
|
||||||
|
|
||||||
# Diff base: prefer the push's true before-state (CI_PREV_COMMIT_SHA).
|
|
||||||
# HEAD~1 is WRONG for merge commits — it is the first parent (the
|
|
||||||
# feature-branch side), so the diff shows the OTHER lineage's files
|
|
||||||
# and silently skips the stacks this push actually changed
|
|
||||||
# (bit ci-pipeline-health on 2026-06-12, pipeline 128).
|
|
||||||
DIFF_BASE="HEAD~1"
|
|
||||||
if [ -n "${CI_PREV_COMMIT_SHA:-}" ] && [ "$CI_PREV_COMMIT_SHA" != "$CI_COMMIT_SHA" ]; then
|
|
||||||
git cat-file -e "$CI_PREV_COMMIT_SHA^{commit}" 2>/dev/null || git fetch --depth=50 origin master 2>/dev/null || true
|
|
||||||
# Restarted pipelines after master moved produce REVERSE diffs
|
|
||||||
# (CI_PREV ahead of the checked-out HEAD re-applied stale trees and
|
|
||||||
# reverted a sibling apply on 2026-06-12, pipeline 148). Only use
|
|
||||||
# CI_PREV when it is an ancestor of HEAD.
|
|
||||||
if git cat-file -e "$CI_PREV_COMMIT_SHA^{commit}" 2>/dev/null \
|
|
||||||
&& git merge-base --is-ancestor "$CI_PREV_COMMIT_SHA" HEAD 2>/dev/null; then
|
|
||||||
DIFF_BASE="$CI_PREV_COMMIT_SHA"
|
|
||||||
fi
|
|
||||||
fi
|
|
||||||
echo "Diff base: $DIFF_BASE"
|
|
||||||
|
|
||||||
# If still no parent, apply all platform stacks as a safe fallback
|
# If still no parent, apply all platform stacks as a safe fallback
|
||||||
if ! git rev-parse HEAD~1 >/dev/null 2>&1; then
|
if ! git rev-parse HEAD~1 >/dev/null 2>&1; then
|
||||||
echo "Cannot determine changed files — applying ALL platform stacks"
|
echo "Cannot determine changed files — applying ALL platform stacks"
|
||||||
|
|
@ -162,14 +76,14 @@ steps:
|
||||||
> .app_apply
|
> .app_apply
|
||||||
else
|
else
|
||||||
# Check if global files changed (triggers full platform apply)
|
# Check if global files changed (triggers full platform apply)
|
||||||
GLOBAL_CHANGED=$(git diff --name-only "$DIFF_BASE" HEAD | grep -E '^(modules/|config\.tfvars|terragrunt\.hcl)' || true)
|
GLOBAL_CHANGED=$(git diff --name-only HEAD~1 HEAD | grep -E '^(modules/|config\.tfvars|terragrunt\.hcl)' || true)
|
||||||
|
|
||||||
if [ -n "$GLOBAL_CHANGED" ]; then
|
if [ -n "$GLOBAL_CHANGED" ]; then
|
||||||
echo "Global files changed — applying ALL platform stacks"
|
echo "Global files changed — applying ALL platform stacks"
|
||||||
echo "$PLATFORM_STACKS" | tr ' ' '\n' > .platform_apply
|
echo "$PLATFORM_STACKS" | tr ' ' '\n' > .platform_apply
|
||||||
else
|
else
|
||||||
# Detect platform stacks that changed
|
# Detect platform stacks that changed
|
||||||
git diff --name-only "$DIFF_BASE" HEAD | grep '^stacks/' | cut -d/ -f2 | sort -u > .all_changed
|
git diff --name-only HEAD~1 HEAD | grep '^stacks/' | cut -d/ -f2 | sort -u > .all_changed
|
||||||
> .platform_apply
|
> .platform_apply
|
||||||
while read -r stack; do
|
while read -r stack; do
|
||||||
if echo "$PLATFORM_STACKS" | grep -qw "$stack"; then
|
if echo "$PLATFORM_STACKS" | grep -qw "$stack"; then
|
||||||
|
|
@ -180,7 +94,7 @@ steps:
|
||||||
|
|
||||||
# Detect app stacks that changed
|
# Detect app stacks that changed
|
||||||
> .app_apply
|
> .app_apply
|
||||||
git diff --name-only "$DIFF_BASE" HEAD | grep '^stacks/' | cut -d/ -f2 | sort -u | while read -r stack; do
|
git diff --name-only HEAD~1 HEAD | grep '^stacks/' | cut -d/ -f2 | sort -u | while read -r stack; do
|
||||||
if echo "$PLATFORM_STACKS" | grep -qw "$stack"; then
|
if echo "$PLATFORM_STACKS" | grep -qw "$stack"; then
|
||||||
continue # Skip platform stacks
|
continue # Skip platform stacks
|
||||||
fi
|
fi
|
||||||
|
|
@ -200,7 +114,7 @@ steps:
|
||||||
# ── Pre-warm provider cache ──
|
# ── Pre-warm provider cache ──
|
||||||
- |
|
- |
|
||||||
if [ -s .platform_apply ] || [ -s .app_apply ]; then
|
if [ -s .platform_apply ] || [ -s .app_apply ]; then
|
||||||
FIRST_STACK=$(cat .platform_apply .app_apply 2>/dev/null | head -1)
|
FIRST_STACK=$(head -1 .platform_apply .app_apply 2>/dev/null | head -1)
|
||||||
if [ -n "$FIRST_STACK" ]; then
|
if [ -n "$FIRST_STACK" ]; then
|
||||||
echo "Pre-warming provider cache from stacks/$FIRST_STACK..."
|
echo "Pre-warming provider cache from stacks/$FIRST_STACK..."
|
||||||
cd "stacks/$FIRST_STACK" && terragrunt init --terragrunt-non-interactive -input=false 2>&1 | tail -3 && cd ../..
|
cd "stacks/$FIRST_STACK" && terragrunt init --terragrunt-non-interactive -input=false 2>&1 | tail -3 && cd ../..
|
||||||
|
|
@ -209,7 +123,6 @@ steps:
|
||||||
|
|
||||||
# ── Apply platform stacks (serial, with Vault advisory locks) ──
|
# ── Apply platform stacks (serial, with Vault advisory locks) ──
|
||||||
- |
|
- |
|
||||||
FAILED_PLATFORM_STACKS=""
|
|
||||||
if [ -s .platform_apply ]; then
|
if [ -s .platform_apply ]; then
|
||||||
echo "=== Applying platform stacks (serial, locked) ==="
|
echo "=== Applying platform stacks (serial, locked) ==="
|
||||||
while read -r stack; do
|
while read -r stack; do
|
||||||
|
|
@ -222,9 +135,8 @@ steps:
|
||||||
if echo "$OUTPUT" | grep -q "is locked by"; then
|
if echo "$OUTPUT" | grep -q "is locked by"; then
|
||||||
echo "[$stack] SKIPPED (locked by another session)"
|
echo "[$stack] SKIPPED (locked by another session)"
|
||||||
else
|
else
|
||||||
echo "$OUTPUT" | tail -50
|
echo "$OUTPUT" | tail -5
|
||||||
echo "[$stack] FAILED (exit $EXIT)"
|
echo "[$stack] FAILED (exit $EXIT)"
|
||||||
FAILED_PLATFORM_STACKS="$FAILED_PLATFORM_STACKS $stack"
|
|
||||||
fi
|
fi
|
||||||
else
|
else
|
||||||
echo "$OUTPUT" | tail -3
|
echo "$OUTPUT" | tail -3
|
||||||
|
|
@ -232,12 +144,9 @@ steps:
|
||||||
fi
|
fi
|
||||||
done < .platform_apply
|
done < .platform_apply
|
||||||
fi
|
fi
|
||||||
# Deferred until after app stacks so both lists get a chance to run.
|
|
||||||
echo "$FAILED_PLATFORM_STACKS" > .platform_failed
|
|
||||||
|
|
||||||
# ── Apply app stacks (serial, with Vault advisory locks) ──
|
# ── Apply app stacks (serial, with Vault advisory locks) ──
|
||||||
- |
|
- |
|
||||||
FAILED_APP_STACKS=""
|
|
||||||
if [ -s .app_apply ]; then
|
if [ -s .app_apply ]; then
|
||||||
echo "=== Applying app stacks (serial, locked) ==="
|
echo "=== Applying app stacks (serial, locked) ==="
|
||||||
while read -r stack; do
|
while read -r stack; do
|
||||||
|
|
@ -250,9 +159,8 @@ steps:
|
||||||
if echo "$OUTPUT" | grep -q "is locked by"; then
|
if echo "$OUTPUT" | grep -q "is locked by"; then
|
||||||
echo "[$stack] SKIPPED (locked by another session)"
|
echo "[$stack] SKIPPED (locked by another session)"
|
||||||
else
|
else
|
||||||
echo "$OUTPUT" | tail -50
|
echo "$OUTPUT" | tail -5
|
||||||
echo "[$stack] FAILED (exit $EXIT)"
|
echo "[$stack] FAILED (exit $EXIT)"
|
||||||
FAILED_APP_STACKS="$FAILED_APP_STACKS $stack"
|
|
||||||
fi
|
fi
|
||||||
else
|
else
|
||||||
echo "$OUTPUT" | tail -3
|
echo "$OUTPUT" | tail -3
|
||||||
|
|
@ -260,15 +168,6 @@ steps:
|
||||||
fi
|
fi
|
||||||
done < .app_apply
|
done < .app_apply
|
||||||
fi
|
fi
|
||||||
# Fail the step loudly so the pipeline `default` workflow state
|
|
||||||
# reflects reality — the service-upgrade agent and CI alert cascade
|
|
||||||
# both rely on this (see bd code-e1x). Lock-skipped stacks are NOT
|
|
||||||
# counted as failures.
|
|
||||||
FAILED_PLATFORM=$(cat .platform_failed 2>/dev/null | tr -d ' ')
|
|
||||||
if [ -n "$FAILED_PLATFORM" ] || [ -n "$FAILED_APP_STACKS" ]; then
|
|
||||||
echo "=== FAILED STACKS: platform=[$FAILED_PLATFORM ] apps=[$FAILED_APP_STACKS ] ==="
|
|
||||||
exit 1
|
|
||||||
fi
|
|
||||||
|
|
||||||
# ── Commit and push state changes ──
|
# ── Commit and push state changes ──
|
||||||
- |
|
- |
|
||||||
|
|
|
||||||
|
|
@ -9,13 +9,12 @@ clone:
|
||||||
git:
|
git:
|
||||||
image: woodpeckerci/plugin-git
|
image: woodpeckerci/plugin-git
|
||||||
settings:
|
settings:
|
||||||
partial: false
|
|
||||||
depth: 1
|
depth: 1
|
||||||
attempts: 3
|
attempts: 3
|
||||||
|
|
||||||
steps:
|
steps:
|
||||||
- name: detect-drift
|
- name: detect-drift
|
||||||
image: ghcr.io/viktorbarzin/infra-ci:latest
|
image: registry.viktorbarzin.me:5050/infra-ci:latest
|
||||||
pull: true
|
pull: true
|
||||||
backend_options:
|
backend_options:
|
||||||
kubernetes:
|
kubernetes:
|
||||||
|
|
@ -42,44 +41,11 @@ steps:
|
||||||
export VAULT_TOKEN=$(curl -s -X POST "$VAULT_ADDR/v1/auth/kubernetes/login" \
|
export VAULT_TOKEN=$(curl -s -X POST "$VAULT_ADDR/v1/auth/kubernetes/login" \
|
||||||
-d "{\"role\":\"ci\",\"jwt\":\"$SA_TOKEN\"}" | jq -r .auth.client_token)
|
-d "{\"role\":\"ci\",\"jwt\":\"$SA_TOKEN\"}" | jq -r .auth.client_token)
|
||||||
|
|
||||||
# ── Generate kubeconfig from projected SA token ──
|
|
||||||
# See default.yml for rationale. terragrunt.hcl injects
|
|
||||||
# `-var kube_config_path=<repo>/config` for every terraform invocation,
|
|
||||||
# so we need a kubeconfig file at that path. The woodpecker default SA
|
|
||||||
# is cluster-admin, so the projected token is sufficient.
|
|
||||||
- |
|
|
||||||
cat > config <<'EOF'
|
|
||||||
apiVersion: v1
|
|
||||||
kind: Config
|
|
||||||
clusters:
|
|
||||||
- name: kubernetes
|
|
||||||
cluster:
|
|
||||||
server: https://10.0.20.100:6443
|
|
||||||
certificate-authority: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
|
|
||||||
contexts:
|
|
||||||
- name: ci
|
|
||||||
context:
|
|
||||||
cluster: kubernetes
|
|
||||||
user: ci
|
|
||||||
current-context: ci
|
|
||||||
users:
|
|
||||||
- name: ci
|
|
||||||
user:
|
|
||||||
tokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
|
|
||||||
EOF
|
|
||||||
chmod 600 config
|
|
||||||
kubectl --kubeconfig=config get ns kube-system -o name >/dev/null
|
|
||||||
|
|
||||||
# ── Run terraform plan on all stacks ──
|
# ── Run terraform plan on all stacks ──
|
||||||
# Emits two timestamps per drifted stack so the Pushgateway/Prometheus
|
|
||||||
# side can compute drift-age-hours via `time() - drift_stack_first_seen`.
|
|
||||||
- |
|
- |
|
||||||
DRIFTED=""
|
DRIFTED=""
|
||||||
CLEAN=0
|
CLEAN=0
|
||||||
ERRORS=""
|
ERRORS=""
|
||||||
NOW=$(date +%s)
|
|
||||||
# Metrics accumulator — written once per stack, then pushed as a batch.
|
|
||||||
METRICS=""
|
|
||||||
|
|
||||||
for stack_dir in stacks/*/; do
|
for stack_dir in stacks/*/; do
|
||||||
stack=$(basename "$stack_dir")
|
stack=$(basename "$stack_dir")
|
||||||
|
|
@ -90,50 +56,12 @@ steps:
|
||||||
EXIT=$?
|
EXIT=$?
|
||||||
|
|
||||||
case $EXIT in
|
case $EXIT in
|
||||||
0)
|
0) echo "OK (no changes)"; CLEAN=$((CLEAN + 1)) ;;
|
||||||
echo "OK (no changes)"
|
1) echo "ERROR"; ERRORS="$ERRORS $stack" ;;
|
||||||
CLEAN=$((CLEAN + 1))
|
2) echo "DRIFT DETECTED"; DRIFTED="$DRIFTED $stack" ;;
|
||||||
# drift_stack_state=0 means clean; age-hours irrelevant so we
|
|
||||||
# still push 0 so per-stack gauges don't go stale.
|
|
||||||
METRICS="${METRICS}drift_stack_state{stack=\"$stack\"} 0\n"
|
|
||||||
METRICS="${METRICS}drift_stack_age_hours{stack=\"$stack\"} 0\n"
|
|
||||||
;;
|
|
||||||
1)
|
|
||||||
echo "ERROR"
|
|
||||||
ERRORS="$ERRORS $stack"
|
|
||||||
METRICS="${METRICS}drift_stack_state{stack=\"$stack\"} 2\n"
|
|
||||||
;;
|
|
||||||
2)
|
|
||||||
echo "DRIFT DETECTED"
|
|
||||||
DRIFTED="$DRIFTED $stack"
|
|
||||||
# Fetch first-seen timestamp from Pushgateway (preserve across runs).
|
|
||||||
FIRST_SEEN=$(curl -s "http://prometheus-prometheus-pushgateway.monitoring:9091/metrics" \
|
|
||||||
| awk -v s="$stack" '$1 == "drift_stack_first_seen{stack=\""s"\"}" {print $2; exit}')
|
|
||||||
if [ -z "$FIRST_SEEN" ] || [ "$FIRST_SEEN" = "0" ]; then
|
|
||||||
FIRST_SEEN="$NOW"
|
|
||||||
fi
|
|
||||||
AGE_HOURS=$(( (NOW - FIRST_SEEN) / 3600 ))
|
|
||||||
METRICS="${METRICS}drift_stack_state{stack=\"$stack\"} 1\n"
|
|
||||||
METRICS="${METRICS}drift_stack_first_seen{stack=\"$stack\"} $FIRST_SEEN\n"
|
|
||||||
METRICS="${METRICS}drift_stack_age_hours{stack=\"$stack\"} $AGE_HOURS\n"
|
|
||||||
;;
|
|
||||||
esac
|
esac
|
||||||
done
|
done
|
||||||
|
|
||||||
# Summary counters — single gauge per run.
|
|
||||||
DRIFT_COUNT=$(echo "$DRIFTED" | wc -w)
|
|
||||||
ERROR_COUNT=$(echo "$ERRORS" | wc -w)
|
|
||||||
METRICS="${METRICS}drift_stack_count $DRIFT_COUNT\n"
|
|
||||||
METRICS="${METRICS}drift_error_count $ERROR_COUNT\n"
|
|
||||||
METRICS="${METRICS}drift_clean_count $CLEAN\n"
|
|
||||||
METRICS="${METRICS}drift_detection_last_run_timestamp $NOW\n"
|
|
||||||
|
|
||||||
# ── Push to Pushgateway ──
|
|
||||||
# One batched push keeps the run atomic: either all metrics land or none.
|
|
||||||
printf "%b" "$METRICS" | curl -s --data-binary @- \
|
|
||||||
http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/drift-detection \
|
|
||||||
|| echo "(pushgateway unavailable, metrics lost for this run)"
|
|
||||||
|
|
||||||
echo ""
|
echo ""
|
||||||
echo "=== Drift Detection Summary ==="
|
echo "=== Drift Detection Summary ==="
|
||||||
echo "Clean: $CLEAN stacks"
|
echo "Clean: $CLEAN stacks"
|
||||||
|
|
|
||||||
|
|
@ -5,75 +5,56 @@ clone:
|
||||||
git:
|
git:
|
||||||
image: woodpeckerci/plugin-git
|
image: woodpeckerci/plugin-git
|
||||||
settings:
|
settings:
|
||||||
partial: false
|
|
||||||
depth: 2
|
depth: 2
|
||||||
|
|
||||||
steps:
|
steps:
|
||||||
- name: run-issue-responder
|
- name: run-issue-responder
|
||||||
image: alpine:3.20
|
image: python:3.12-alpine
|
||||||
commands:
|
commands:
|
||||||
- apk add --no-cache curl jq
|
- apk add --no-cache openssh-client curl jq
|
||||||
# Authenticate to Vault via K8s SA JWT
|
# Authenticate to Vault via K8s SA JWT
|
||||||
- |
|
- |
|
||||||
SA_TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
|
SA_TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
|
||||||
VAULT_RESP=$(curl -sf -X POST http://vault-active.vault.svc.cluster.local:8200/v1/auth/kubernetes/login \
|
VAULT_RESP=$(curl -sf -X POST http://vault-active.vault.svc.cluster.local:8200/v1/auth/kubernetes/login \
|
||||||
-d "{\"role\":\"ci\",\"jwt\":\"$$SA_TOKEN\"}")
|
-d "{\"role\":\"ci\",\"jwt\":\"$SA_TOKEN\"}")
|
||||||
VAULT_TOKEN=$(echo "$$VAULT_RESP" | jq -r .auth.client_token)
|
VAULT_TOKEN=$(echo "$VAULT_RESP" | jq -r .auth.client_token)
|
||||||
if [ -z "$$VAULT_TOKEN" ] || [ "$$VAULT_TOKEN" = "null" ]; then
|
if [ -z "$VAULT_TOKEN" ] || [ "$VAULT_TOKEN" = "null" ]; then
|
||||||
echo "ERROR: Vault authentication failed"
|
echo "ERROR: Vault authentication failed"
|
||||||
exit 1
|
exit 1
|
||||||
fi
|
fi
|
||||||
echo "Vault authenticated"
|
echo "Vault authenticated"
|
||||||
# Fetch API token for claude-agent-service
|
# Fetch DevVM SSH key
|
||||||
- |
|
- |
|
||||||
AGENT_TOKEN=$(curl -sf -H "X-Vault-Token: $$VAULT_TOKEN" \
|
curl -sf -H "X-Vault-Token: $VAULT_TOKEN" \
|
||||||
http://vault-active.vault.svc.cluster.local:8200/v1/secret/data/claude-agent-service | \
|
http://vault-active.vault.svc.cluster.local:8200/v1/secret/data/ci/infra | \
|
||||||
jq -r '.data.data.api_bearer_token')
|
jq -r '.data.data.devvm_ssh_key' > /tmp/devvm-key
|
||||||
if [ -z "$$AGENT_TOKEN" ] || [ "$$AGENT_TOKEN" = "null" ]; then
|
chmod 600 /tmp/devvm-key
|
||||||
echo "ERROR: Failed to fetch agent API token"
|
if [ ! -s /tmp/devvm-key ]; then
|
||||||
|
echo "ERROR: Failed to fetch DevVM SSH key"
|
||||||
exit 1
|
exit 1
|
||||||
fi
|
fi
|
||||||
echo "Agent token fetched"
|
echo "SSH key fetched"
|
||||||
# Submit job to claude-agent-service
|
# SSH to DevVM and run issue-responder agent
|
||||||
- |
|
- |
|
||||||
ISSUE_NUM="${ISSUE_NUMBER:-}"
|
ISSUE_NUM="${ISSUE_NUMBER:-}"
|
||||||
ISSUE_TITLE="${ISSUE_TITLE:-}"
|
ISSUE_TITLE="${ISSUE_TITLE:-}"
|
||||||
ISSUE_LABELS="${ISSUE_LABELS:-}"
|
ISSUE_LABELS="${ISSUE_LABELS:-}"
|
||||||
ISSUE_URL="${ISSUE_URL:-}"
|
ISSUE_URL="${ISSUE_URL:-}"
|
||||||
|
|
||||||
if [ -z "$$ISSUE_NUM" ]; then
|
if [ -z "$ISSUE_NUM" ]; then
|
||||||
echo "ERROR: No issue number provided"
|
echo "ERROR: No issue number provided"
|
||||||
exit 1
|
exit 1
|
||||||
fi
|
fi
|
||||||
|
|
||||||
echo "Processing issue #$$ISSUE_NUM: $$ISSUE_TITLE"
|
echo "Processing issue #$ISSUE_NUM: $ISSUE_TITLE"
|
||||||
|
echo "Labels: $ISSUE_LABELS"
|
||||||
|
|
||||||
PAYLOAD=$(jq -n \
|
ssh -i /tmp/devvm-key -o StrictHostKeyChecking=no wizard@10.0.10.10 \
|
||||||
--arg prompt "Process GitHub Issue #$$ISSUE_NUM: $$ISSUE_TITLE. Labels: $$ISSUE_LABELS. URL: $$ISSUE_URL. Read the issue body via GitHub API, investigate, and take appropriate action." \
|
"cd ~/code && git -C infra stash && git -C infra pull --rebase && git -C infra stash pop 2>/dev/null; \
|
||||||
--arg agent ".claude/agents/issue-responder" \
|
~/.local/bin/claude -p \
|
||||||
'{prompt: $prompt, agent: $agent, max_budget_usd: 10, timeout_seconds: 1800}')
|
--agent infra/.claude/agents/issue-responder \
|
||||||
|
--dangerously-skip-permissions \
|
||||||
RESP=$(curl -sf -X POST \
|
--max-budget-usd 10 \
|
||||||
-H "Authorization: Bearer $$AGENT_TOKEN" \
|
'Process GitHub Issue #${ISSUE_NUM}: ${ISSUE_TITLE}. Labels: ${ISSUE_LABELS}. URL: ${ISSUE_URL}. Read the issue body via GitHub API, investigate, and take appropriate action.'"
|
||||||
-H "Content-Type: application/json" \
|
# Cleanup
|
||||||
-d "$$PAYLOAD" \
|
- rm -f /tmp/devvm-key
|
||||||
http://claude-agent-service.claude-agent.svc.cluster.local:8080/execute)
|
|
||||||
|
|
||||||
JOB_ID=$(echo "$$RESP" | jq -r '.job_id')
|
|
||||||
echo "Job submitted: $$JOB_ID"
|
|
||||||
# Poll for completion (30min max)
|
|
||||||
- |
|
|
||||||
for i in $(seq 1 120); do
|
|
||||||
sleep 15
|
|
||||||
RESULT=$(curl -sf \
|
|
||||||
-H "Authorization: Bearer $$AGENT_TOKEN" \
|
|
||||||
http://claude-agent-service.claude-agent.svc.cluster.local:8080/jobs/$$JOB_ID)
|
|
||||||
STATUS=$(echo "$$RESULT" | jq -r '.status')
|
|
||||||
echo "[$$i/120] Status: $$STATUS"
|
|
||||||
if [ "$$STATUS" != "running" ]; then
|
|
||||||
echo "$$RESULT" | jq .
|
|
||||||
if [ "$$STATUS" = "completed" ]; then exit 0; else exit 1; fi
|
|
||||||
fi
|
|
||||||
done
|
|
||||||
echo "ERROR: Job timed out after 30 minutes"
|
|
||||||
exit 1
|
|
||||||
|
|
|
||||||
49
.woodpecker/k8s-portal.yml
Normal file
49
.woodpecker/k8s-portal.yml
Normal file
|
|
@ -0,0 +1,49 @@
|
||||||
|
when:
|
||||||
|
event: push
|
||||||
|
branch: master
|
||||||
|
path:
|
||||||
|
include:
|
||||||
|
- "stacks/platform/modules/k8s-portal/files/**"
|
||||||
|
|
||||||
|
clone:
|
||||||
|
git:
|
||||||
|
image: woodpeckerci/plugin-git
|
||||||
|
settings:
|
||||||
|
attempts: 5
|
||||||
|
backoff: 10s
|
||||||
|
|
||||||
|
steps:
|
||||||
|
- name: build-and-push
|
||||||
|
image: woodpeckerci/plugin-docker-buildx
|
||||||
|
settings:
|
||||||
|
username: "viktorbarzin"
|
||||||
|
password:
|
||||||
|
from_secret: dockerhub-pat
|
||||||
|
repo: viktorbarzin/k8s-portal
|
||||||
|
dockerfile: stacks/platform/modules/k8s-portal/files/Dockerfile
|
||||||
|
context: stacks/platform/modules/k8s-portal/files
|
||||||
|
platforms:
|
||||||
|
- linux/amd64
|
||||||
|
tag: ["${CI_PIPELINE_NUMBER}", "latest"]
|
||||||
|
cache_from: "viktorbarzin/k8s-portal:latest"
|
||||||
|
cache_to: "type=inline"
|
||||||
|
|
||||||
|
- name: deploy
|
||||||
|
image: bitnami/kubectl:latest
|
||||||
|
commands:
|
||||||
|
- "kubectl set image deployment/k8s-portal portal=viktorbarzin/k8s-portal:${CI_PIPELINE_NUMBER} -n k8s-portal"
|
||||||
|
- "kubectl rollout status deployment/k8s-portal -n k8s-portal --timeout=120s"
|
||||||
|
- "echo 'k8s-portal deployed successfully (build ${CI_PIPELINE_NUMBER})'"
|
||||||
|
|
||||||
|
- name: slack
|
||||||
|
image: curlimages/curl
|
||||||
|
commands:
|
||||||
|
- |
|
||||||
|
curl -s -X POST -H 'Content-type: application/json' \
|
||||||
|
--data "{\"text\":\"K8s Portal: build #${CI_PIPELINE_NUMBER} ${CI_PIPELINE_STATUS}\"}" \
|
||||||
|
"$SLACK_WEBHOOK" || true
|
||||||
|
environment:
|
||||||
|
SLACK_WEBHOOK:
|
||||||
|
from_secret: slack_webhook
|
||||||
|
when:
|
||||||
|
status: [success, failure]
|
||||||
|
|
@ -11,14 +11,13 @@ clone:
|
||||||
git:
|
git:
|
||||||
image: woodpeckerci/plugin-git
|
image: woodpeckerci/plugin-git
|
||||||
settings:
|
settings:
|
||||||
partial: false
|
|
||||||
depth: 5
|
depth: 5
|
||||||
|
|
||||||
steps:
|
steps:
|
||||||
- name: parse-and-implement
|
- name: parse-and-implement
|
||||||
image: python:3.12-alpine
|
image: python:3.12-alpine
|
||||||
commands:
|
commands:
|
||||||
- apk add --no-cache jq curl git
|
- apk add --no-cache jq curl git openssh-client
|
||||||
- sh scripts/postmortem-pipeline.sh
|
- sh scripts/postmortem-pipeline.sh
|
||||||
|
|
||||||
- name: notify-slack
|
- name: notify-slack
|
||||||
|
|
|
||||||
|
|
@ -5,7 +5,6 @@ clone:
|
||||||
git:
|
git:
|
||||||
image: woodpeckerci/plugin-git
|
image: woodpeckerci/plugin-git
|
||||||
settings:
|
settings:
|
||||||
partial: false
|
|
||||||
attempts: 5
|
attempts: 5
|
||||||
backoff: 10s
|
backoff: 10s
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -1,64 +0,0 @@
|
||||||
# Sync infra/scripts/pve-nfs-exports → PVE host /etc/exports on change.
|
|
||||||
#
|
|
||||||
# Wave 6b of the state-drift consolidation plan: move the "scp + exportfs -ra"
|
|
||||||
# deploy step out of runbook-human-hands and into CI so the Proxmox NFS export
|
|
||||||
# table tracks git.
|
|
||||||
#
|
|
||||||
# Trigger: push to master that touches `scripts/pve-nfs-exports`. The file
|
|
||||||
# header documents the deploy invocation; this pipeline codifies it.
|
|
||||||
#
|
|
||||||
# Credentials:
|
|
||||||
# - pve_ssh_key: Woodpecker repo-secret (ed25519 keypair provisioned
|
|
||||||
# 2026-04-18 as `woodpecker-pve-nfs-exports-sync`). Public key lives in
|
|
||||||
# /root/.ssh/authorized_keys on the PVE host. Private key mirrored in
|
|
||||||
# Vault `secret/woodpecker/pve_ssh_key` for recovery.
|
|
||||||
|
|
||||||
when:
|
|
||||||
- event: push
|
|
||||||
branch: master
|
|
||||||
path: scripts/pve-nfs-exports
|
|
||||||
- event: manual
|
|
||||||
|
|
||||||
clone:
|
|
||||||
git:
|
|
||||||
image: woodpeckerci/plugin-git
|
|
||||||
settings:
|
|
||||||
partial: false
|
|
||||||
depth: 1
|
|
||||||
attempts: 3
|
|
||||||
|
|
||||||
steps:
|
|
||||||
- name: deploy
|
|
||||||
image: alpine:3.20
|
|
||||||
environment:
|
|
||||||
PVE_SSH_KEY:
|
|
||||||
from_secret: pve_ssh_key
|
|
||||||
SLACK_WEBHOOK:
|
|
||||||
from_secret: slack_webhook
|
|
||||||
commands:
|
|
||||||
- apk add --no-cache openssh-client curl
|
|
||||||
- mkdir -p ~/.ssh && chmod 700 ~/.ssh
|
|
||||||
- printf '%s\n' "$PVE_SSH_KEY" > ~/.ssh/id_ed25519
|
|
||||||
- chmod 600 ~/.ssh/id_ed25519
|
|
||||||
# Pin host key — CI's ~/.ssh/known_hosts is ephemeral, so accept-new on first pull.
|
|
||||||
- ssh-keyscan -t ed25519 192.168.1.127 >> ~/.ssh/known_hosts 2>/dev/null
|
|
||||||
# Diff what we'd ship, so pipeline logs show the intended change.
|
|
||||||
- echo '---diff---' && ssh -o BatchMode=yes root@192.168.1.127 "cat /etc/exports" > /tmp/remote.exports || true
|
|
||||||
- diff -u /tmp/remote.exports scripts/pve-nfs-exports || true
|
|
||||||
- echo '---applying---'
|
|
||||||
- scp -o BatchMode=yes scripts/pve-nfs-exports root@192.168.1.127:/etc/exports
|
|
||||||
- ssh -o BatchMode=yes root@192.168.1.127 "exportfs -ra && exportfs -s | head -5"
|
|
||||||
- echo '---done---'
|
|
||||||
|
|
||||||
- name: slack
|
|
||||||
image: curlimages/curl:8.11.0
|
|
||||||
environment:
|
|
||||||
SLACK_WEBHOOK:
|
|
||||||
from_secret: slack_webhook
|
|
||||||
commands:
|
|
||||||
- |
|
|
||||||
curl -s -X POST -H 'Content-type: application/json' \
|
|
||||||
--data "{\"channel\":\"general\",\"text\":\"PVE /etc/exports sync: ${CI_PIPELINE_STATUS}\"}" \
|
|
||||||
"$SLACK_WEBHOOK" || true
|
|
||||||
when:
|
|
||||||
status: [success, failure]
|
|
||||||
|
|
@ -1,157 +0,0 @@
|
||||||
# Sync modules/docker-registry/* → /opt/registry/ on docker-registry VM
|
|
||||||
# (10.0.20.10) on change, and bounce containers + nginx when needed.
|
|
||||||
#
|
|
||||||
# Replaces the manual "ssh + scp + docker compose up -d" that was required
|
|
||||||
# after the 2026-04-19 `registry:2 → registry:2.8.3` pin landed. The deploy
|
|
||||||
# flow is now: edit a file in modules/docker-registry/ → git push → this
|
|
||||||
# pipeline runs → registry VM picks up the change.
|
|
||||||
#
|
|
||||||
# Trigger: push to master that touches any managed file (see `when.path`),
|
|
||||||
# or a manual run via Woodpecker UI / API.
|
|
||||||
#
|
|
||||||
# Credentials:
|
|
||||||
# - registry_ssh_key: Woodpecker repo-secret (ed25519 keypair provisioned
|
|
||||||
# 2026-04-19 as `woodpecker-registry-config-sync`). Public key lives in
|
|
||||||
# /root/.ssh/authorized_keys on 10.0.20.10. Private key mirrored in
|
|
||||||
# Vault `secret/woodpecker/registry_ssh_key` (subkeys private_key /
|
|
||||||
# public_key / known_hosts_entry) for recovery.
|
|
||||||
#
|
|
||||||
# Why bounce nginx every time: nginx caches upstream DNS at startup, so if
|
|
||||||
# any registry-* container gets recreated (new IP on the docker bridge),
|
|
||||||
# nginx keeps forwarding to a stale address. Always restart nginx as the
|
|
||||||
# last step — see docs/runbooks/registry-vm.md § "Bouncing registry
|
|
||||||
# containers — the nginx DNS trap".
|
|
||||||
|
|
||||||
when:
|
|
||||||
- event: push
|
|
||||||
branch: master
|
|
||||||
path:
|
|
||||||
include:
|
|
||||||
- 'modules/docker-registry/docker-compose.yml'
|
|
||||||
- 'modules/docker-registry/fix-broken-blobs.sh'
|
|
||||||
- 'modules/docker-registry/cleanup-tags.sh'
|
|
||||||
- 'modules/docker-registry/nginx_registry.conf'
|
|
||||||
- 'modules/docker-registry/config-private.yml'
|
|
||||||
- event: manual
|
|
||||||
|
|
||||||
clone:
|
|
||||||
git:
|
|
||||||
image: woodpeckerci/plugin-git
|
|
||||||
settings:
|
|
||||||
partial: false
|
|
||||||
depth: 1
|
|
||||||
attempts: 3
|
|
||||||
|
|
||||||
steps:
|
|
||||||
- name: deploy
|
|
||||||
image: alpine:3.20
|
|
||||||
environment:
|
|
||||||
REGISTRY_SSH_KEY:
|
|
||||||
from_secret: registry_ssh_key
|
|
||||||
commands:
|
|
||||||
- apk add --no-cache openssh-client rsync
|
|
||||||
- mkdir -p ~/.ssh && chmod 700 ~/.ssh
|
|
||||||
- printf '%s\n' "$REGISTRY_SSH_KEY" > ~/.ssh/id_ed25519
|
|
||||||
- chmod 600 ~/.ssh/id_ed25519
|
|
||||||
# Pin host key — CI's ~/.ssh/known_hosts is ephemeral, so accept-new on first pull.
|
|
||||||
- ssh-keyscan -t ed25519 10.0.20.10 >> ~/.ssh/known_hosts 2>/dev/null
|
|
||||||
- echo '---detecting changed files---'
|
|
||||||
- |
|
|
||||||
# Mirror the remote state of each file so we can diff and decide what bounces.
|
|
||||||
CHANGED=""
|
|
||||||
for f in docker-compose.yml fix-broken-blobs.sh cleanup-tags.sh nginx_registry.conf config-private.yml; do
|
|
||||||
LOCAL="modules/docker-registry/$f"
|
|
||||||
REMOTE="/opt/registry/$f"
|
|
||||||
if [ ! -f "$LOCAL" ]; then
|
|
||||||
echo "skip $f (not in repo)"
|
|
||||||
continue
|
|
||||||
fi
|
|
||||||
# Pull the remote copy into /tmp for a diff. ssh -n avoids stdin-hogging.
|
|
||||||
REMOTE_CONTENT=$(ssh -n -o BatchMode=yes root@10.0.20.10 "cat $REMOTE 2>/dev/null || true")
|
|
||||||
LOCAL_CONTENT=$(cat "$LOCAL")
|
|
||||||
if [ "$LOCAL_CONTENT" = "$REMOTE_CONTENT" ]; then
|
|
||||||
echo "unchanged: $f"
|
|
||||||
else
|
|
||||||
echo "---diff: $f ---"
|
|
||||||
echo "$REMOTE_CONTENT" > /tmp/remote.txt
|
|
||||||
diff -u /tmp/remote.txt "$LOCAL" | head -40 || true
|
|
||||||
CHANGED="$CHANGED $f"
|
|
||||||
fi
|
|
||||||
done
|
|
||||||
echo "CHANGED_FILES=$CHANGED"
|
|
||||||
printf '%s' "$CHANGED" > /tmp/changed
|
|
||||||
- echo '---applying---'
|
|
||||||
- |
|
|
||||||
CHANGED=$(cat /tmp/changed)
|
|
||||||
if [ -z "$CHANGED" ]; then
|
|
||||||
echo "No files changed — exiting cleanly (manual run with no drift)."
|
|
||||||
exit 0
|
|
||||||
fi
|
|
||||||
# Ship every managed file unconditionally — scp is cheap, idempotency is safe.
|
|
||||||
scp -o BatchMode=yes \
|
|
||||||
modules/docker-registry/docker-compose.yml \
|
|
||||||
modules/docker-registry/fix-broken-blobs.sh \
|
|
||||||
modules/docker-registry/cleanup-tags.sh \
|
|
||||||
modules/docker-registry/nginx_registry.conf \
|
|
||||||
modules/docker-registry/config-private.yml \
|
|
||||||
root@10.0.20.10:/opt/registry/
|
|
||||||
ssh -n -o BatchMode=yes root@10.0.20.10 '
|
|
||||||
chmod +x /opt/registry/fix-broken-blobs.sh /opt/registry/cleanup-tags.sh
|
|
||||||
'
|
|
||||||
- echo '---bouncing containers + nginx---'
|
|
||||||
- |
|
|
||||||
CHANGED=$(cat /tmp/changed)
|
|
||||||
# Compose-visible files: docker-compose.yml (image tag, mounts) and
|
|
||||||
# config-private.yml (registry config → needs registry-private reload).
|
|
||||||
BOUNCE_COMPOSE=0
|
|
||||||
BOUNCE_NGINX=0
|
|
||||||
echo "$CHANGED" | grep -q "docker-compose.yml" && BOUNCE_COMPOSE=1
|
|
||||||
echo "$CHANGED" | grep -q "config-private.yml" && BOUNCE_COMPOSE=1
|
|
||||||
echo "$CHANGED" | grep -q "nginx_registry.conf" && BOUNCE_NGINX=1
|
|
||||||
|
|
||||||
if [ "$BOUNCE_COMPOSE" = "1" ]; then
|
|
||||||
echo "compose-visible change → pull + up -d"
|
|
||||||
ssh -n -o BatchMode=yes root@10.0.20.10 '
|
|
||||||
cd /opt/registry
|
|
||||||
docker compose pull 2>&1 | tail -5
|
|
||||||
docker compose up -d 2>&1 | tail -20
|
|
||||||
'
|
|
||||||
# Any compose recreate requires nginx DNS refresh too.
|
|
||||||
BOUNCE_NGINX=1
|
|
||||||
fi
|
|
||||||
|
|
||||||
if [ "$BOUNCE_NGINX" = "1" ]; then
|
|
||||||
echo "bouncing nginx to flush upstream DNS cache"
|
|
||||||
ssh -n -o BatchMode=yes root@10.0.20.10 '
|
|
||||||
docker restart registry-nginx
|
|
||||||
sleep 3
|
|
||||||
docker ps --format "{{.Names}}\t{{.Image}}\t{{.Status}}" | grep -E "registry-"
|
|
||||||
'
|
|
||||||
fi
|
|
||||||
|
|
||||||
if [ "$BOUNCE_COMPOSE" = "0" ] && [ "$BOUNCE_NGINX" = "0" ]; then
|
|
||||||
echo "only script files changed (cron-picks-up semantics) — no bounce needed"
|
|
||||||
fi
|
|
||||||
- echo '---verify---'
|
|
||||||
- |
|
|
||||||
ssh -n -o BatchMode=yes root@10.0.20.10 '
|
|
||||||
echo "=== catalog ==="
|
|
||||||
# Prove auth + routing survived.
|
|
||||||
curl -sk -o /dev/null -w "catalog (unauth → 401 expected): HTTP %{http_code}\n" \
|
|
||||||
https://127.0.0.1:5050/v2/
|
|
||||||
echo "=== integrity scan (dry-run) ==="
|
|
||||||
python3 /opt/registry/fix-broken-blobs.sh --dry-run 2>&1 | tail -5
|
|
||||||
'
|
|
||||||
|
|
||||||
- name: slack
|
|
||||||
image: curlimages/curl:8.11.0
|
|
||||||
environment:
|
|
||||||
SLACK_WEBHOOK:
|
|
||||||
from_secret: slack_webhook
|
|
||||||
commands:
|
|
||||||
- |
|
|
||||||
curl -s -X POST -H 'Content-type: application/json' \
|
|
||||||
--data "{\"channel\":\"general\",\"text\":\"Registry config sync on 10.0.20.10: ${CI_PIPELINE_STATUS}\"}" \
|
|
||||||
"$SLACK_WEBHOOK" || true
|
|
||||||
when:
|
|
||||||
status: [success, failure]
|
|
||||||
|
|
@ -6,7 +6,6 @@ clone:
|
||||||
git:
|
git:
|
||||||
image: woodpeckerci/plugin-git
|
image: woodpeckerci/plugin-git
|
||||||
settings:
|
settings:
|
||||||
partial: false
|
|
||||||
attempts: 5
|
attempts: 5
|
||||||
backoff: 10s
|
backoff: 10s
|
||||||
|
|
||||||
|
|
|
||||||
208
AGENTS.md
208
AGENTS.md
|
|
@ -9,55 +9,12 @@
|
||||||
- **Ask before `git push`** — always confirm with the user first
|
- **Ask before `git push`** — always confirm with the user first
|
||||||
|
|
||||||
## Execution
|
## Execution
|
||||||
- **Apply a service**: `scripts/tg apply --non-interactive` (auto-decrypts SOPS secrets; passes `-lock-timeout`, default `5m` / `TG_LOCK_TIMEOUT`, so a contended state lock waits instead of failing with `Error acquiring the state lock`)
|
- **Apply a service**: `scripts/tg apply --non-interactive` (auto-decrypts SOPS secrets)
|
||||||
- **Legacy apply**: `cd stacks/<service> && terragrunt apply --non-interactive` (uses terraform.tfvars)
|
- **Legacy apply**: `cd stacks/<service> && terragrunt apply --non-interactive` (uses terraform.tfvars)
|
||||||
- **kubectl**: `kubectl --kubeconfig $(pwd)/config`
|
- **kubectl**: `kubectl --kubeconfig $(pwd)/config`
|
||||||
- **Health check**: `bash scripts/cluster_healthcheck.sh --quiet`
|
- **Health check**: `bash scripts/cluster_healthcheck.sh --quiet`
|
||||||
- **Plan all**: `cd stacks && terragrunt run --all --non-interactive -- plan`
|
- **Plan all**: `cd stacks && terragrunt run --all --non-interactive -- plan`
|
||||||
|
|
||||||
## Adopting Existing Resources — Use `import {}` Blocks, Not the CLI
|
|
||||||
|
|
||||||
When bringing a live cluster/Vault/Cloudflare resource under Terraform management, use an HCL `import {}` block (Terraform 1.5+). Do **NOT** use `terraform import` on the CLI for anything landing in this repo — the CLI path leaves no audit trail and makes multi-operator adoption fragile.
|
|
||||||
|
|
||||||
**Canonical workflow:**
|
|
||||||
|
|
||||||
1. Write the `resource` block that matches the live object.
|
|
||||||
2. In the same stack, add an `import {}` stanza naming the target and the provider-specific ID:
|
|
||||||
```hcl
|
|
||||||
import {
|
|
||||||
to = helm_release.kured
|
|
||||||
id = "kured/kured" # Helm ID format: <namespace>/<release-name>
|
|
||||||
}
|
|
||||||
|
|
||||||
resource "helm_release" "kured" {
|
|
||||||
name = "kured"
|
|
||||||
namespace = "kured"
|
|
||||||
repository = "https://kubereboot.github.io/charts/"
|
|
||||||
chart = "kured"
|
|
||||||
version = "5.7.0"
|
|
||||||
# ... values matching the live release
|
|
||||||
}
|
|
||||||
```
|
|
||||||
3. `scripts/tg plan` — every change it proposes is real divergence between HCL and live state. Iterate on values until the plan is **0 changes**.
|
|
||||||
4. `scripts/tg apply` — the import runs alongside whatever zero-change apply you have. If your plan is 0 changes, this commits only the state-ownership transfer.
|
|
||||||
5. After the apply lands cleanly, **delete the `import {}` block** in a follow-up commit. The resource is now fully TF-owned and the stanza would be a no-op that clutters diffs.
|
|
||||||
|
|
||||||
**Why `import {}` and not `terraform import`:**
|
|
||||||
|
|
||||||
- Reviewable in PRs before any state mutation. The CLI path is an out-of-band action nobody sees.
|
|
||||||
- Plan-safe: the `import` plan step shows the exact object being adopted. Mistyped IDs or the wrong resource address are caught before apply, not after.
|
|
||||||
- Survives state backend changes (Tier 0 SOPS vs Tier 1 PG) transparently — both work identically from the operator's perspective because both use `scripts/tg`.
|
|
||||||
- Re-runnable: if the apply fails partway through, the `import {}` block is idempotent. The CLI path's state mutation is not.
|
|
||||||
|
|
||||||
**Finding the provider-specific ID:** each provider has its own convention.
|
|
||||||
| Resource | ID format | Example |
|
|
||||||
|---|---|---|
|
|
||||||
| `helm_release` | `<namespace>/<release-name>` | `kured/kured` |
|
|
||||||
| `kubernetes_manifest` | `{"apiVersion":"...","kind":"...","metadata":{"namespace":"...","name":"..."}}` | (pass as HCL object literal) |
|
|
||||||
| `kubernetes_<kind>_v1` | `<namespace>/<name>` for namespaced, `<name>` for cluster-scoped | `kube-system/coredns` |
|
|
||||||
| `authentik_provider_proxy` | provider UUID | `0eecac07-97c7-443c-...` |
|
|
||||||
| `cloudflare_record` | `<zone-id>/<record-id>` | `abc123/def456` |
|
|
||||||
|
|
||||||
## Secrets Management (SOPS)
|
## Secrets Management (SOPS)
|
||||||
- **`config.tfvars`** — plaintext config (hostnames, IPs, DNS records, public keys)
|
- **`config.tfvars`** — plaintext config (hostnames, IPs, DNS records, public keys)
|
||||||
- **`secrets.sops.json`** — SOPS-encrypted secrets (passwords, tokens, SSH keys, API keys)
|
- **`secrets.sops.json`** — SOPS-encrypted secrets (passwords, tokens, SSH keys, API keys)
|
||||||
|
|
@ -90,7 +47,6 @@ Terragrunt-based homelab managing a Kubernetes cluster (5 nodes, v1.34.2) on Pro
|
||||||
- **Public domain**: `viktorbarzin.me` (Cloudflare) | **Internal**: `viktorbarzin.lan` (Technitium DNS)
|
- **Public domain**: `viktorbarzin.me` (Cloudflare) | **Internal**: `viktorbarzin.lan` (Technitium DNS)
|
||||||
- **Onboarding portal**: `https://k8s-portal.viktorbarzin.me` — self-service kubectl setup + docs
|
- **Onboarding portal**: `https://k8s-portal.viktorbarzin.me` — self-service kubectl setup + docs
|
||||||
- **CI/CD**: Woodpecker CI — PRs run plan, merges to master auto-apply all stacks
|
- **CI/CD**: Woodpecker CI — PRs run plan, merges to master auto-apply all stacks
|
||||||
- **CI compute is external (ADR-0002, 2026-06-12)**: builds, tests, lint, and release jobs run on GitHub Actions hosted runners via each repo's GitHub mirror — never on cluster nodes. In-cluster pipelines exist only for steps that need cluster access (Woodpecker `kubectl set image` deploys, terragrunt applies, certbot). Never add an in-cluster build or test pipeline to any repo; the fallback-build pattern was deliberately removed. After pushing anything that fires a build chain, watch it end-to-end (GHA run → Woodpecker deploy → rollout) before calling the change done — verify live state, not the checkmark.
|
|
||||||
|
|
||||||
## Key Paths
|
## Key Paths
|
||||||
- `stacks/<service>/main.tf` — service definition
|
- `stacks/<service>/main.tf` — service definition
|
||||||
|
|
@ -100,111 +56,25 @@ Terragrunt-based homelab managing a Kubernetes cluster (5 nodes, v1.34.2) on Pro
|
||||||
- `config.tfvars` — non-secret configuration (plaintext)
|
- `config.tfvars` — non-secret configuration (plaintext)
|
||||||
- `secrets.sops.json` — all secrets (SOPS-encrypted JSON)
|
- `secrets.sops.json` — all secrets (SOPS-encrypted JSON)
|
||||||
- `terraform.tfvars` — legacy secrets file (git-crypt, kept for reference)
|
- `terraform.tfvars` — legacy secrets file (git-crypt, kept for reference)
|
||||||
- `scripts/cluster_healthcheck.sh` — 42-check cluster health script (nodes, workloads, monitoring, certs, backups, external reachability)
|
- `scripts/cluster_healthcheck.sh` — 25-check cluster health script
|
||||||
|
|
||||||
## Storage
|
## Storage
|
||||||
- **NFS** (`nfs-proxmox` StorageClass): For app data. Use the `nfs_volume` module, never inline `nfs {}` blocks.
|
- **NFS** (`nfs-proxmox` StorageClass): For app data. Use the `nfs_volume` module, never inline `nfs {}` blocks.
|
||||||
- **proxmox-lvm-encrypted** (`proxmox-lvm-encrypted` StorageClass): **Default for all sensitive data** — databases, auth, email, passwords, git repos, health data. LUKS2 encryption via Proxmox CSI. Passphrase in Vault, backup key on PVE host.
|
- **proxmox-lvm-encrypted** (`proxmox-lvm-encrypted` StorageClass): **Default for all sensitive data** — databases, auth, email, passwords, git repos, health data. LUKS2 encryption via Proxmox CSI. Passphrase in Vault, backup key on PVE host.
|
||||||
- **proxmox-lvm** (`proxmox-lvm` StorageClass): For non-sensitive stateful apps (configs, caches, tools). Proxmox CSI driver.
|
- **proxmox-lvm** (`proxmox-lvm` StorageClass): For non-sensitive stateful apps (configs, caches, tools). Proxmox CSI driver.
|
||||||
- **NFS server**: Proxmox host at 192.168.1.127 (sole NFS). HDD NFS at `/srv/nfs` (2TB ext4 LV `pve/nfs-data`), SSD NFS at `/srv/nfs-ssd` (100GB ext4 LV `ssd/nfs-ssd-data`). Exports use `async` mode (safe with UPS + databases on block storage). TrueNAS (VM 9000, 10.0.10.15) decommissioned 2026-04-13. Legacy `nfs-truenas` StorageClass name retained (48 PVs bind it; SC names are immutable on PVs) but now points to the Proxmox host, identical to `nfs-proxmox`.
|
- **NFS server**: Proxmox host at 192.168.1.127. HDD NFS at `/srv/nfs` (2TB ext4 LV `pve/nfs-data`), SSD NFS at `/srv/nfs-ssd` (100GB ext4 LV `ssd/nfs-ssd-data`). Exports use `async` mode (safe with UPS + databases on block storage). TrueNAS (10.0.10.15) decommissioned.
|
||||||
- **SQLite on NFS is unreliable** (fsync issues) — always use proxmox-lvm or local disk for databases.
|
- **SQLite on NFS is unreliable** (fsync issues) — always use proxmox-lvm or local disk for databases.
|
||||||
- **NFS mount options**: Always `soft,timeo=30,retrans=3` to prevent uninterruptible sleep (D state).
|
- **NFS mount options**: Always `soft,timeo=30,retrans=3` to prevent uninterruptible sleep (D state).
|
||||||
- **NFS export directory must exist** on the Proxmox host before Terraform can create the PV.
|
- **NFS export directory must exist** on the Proxmox host before Terraform can create the PV.
|
||||||
- **Backup (3-2-1)**: Copy 1 = live PVCs on sdc. Copy 2 = sda `/mnt/backup` (PVC file backups, auto SQLite backups, pfSense, PVE config, **VM images via `vzdump-vms`**). Copy 3 = Synology offsite (two-tier: sda→`pve-backup/`, NFS→`nfs/`+`nfs-ssd/` via inotify change tracking).
|
- **Backup (3-2-1)**: Copy 1 = live PVCs on sdc. Copy 2 = sda `/mnt/backup` (PVC file backups, auto SQLite backups, pfSense, PVE config). Copy 3 = Synology offsite (two-tier: sda→`pve-backup/`, NFS→`nfs/`+`nfs-ssd/` via inotify change tracking).
|
||||||
- **vzdump-vms** (Daily 01:00): live `vzdump --mode snapshot` of hand-managed VMs (NOT in TF) → `/mnt/backup/vzdump/`, keep 3/VMID. `VZDUMP_VMIDS` default `102` (devvm) — the only VM imaged today; before this (2026-06-09) no VM was ever imaged. NOT in the incremental offsite manifest; monthly full pass mirrors it. See `docs/architecture/backup-dr.md`.
|
|
||||||
- **daily-backup** (Daily 05:00): Auto-discovered BACKUP_DIRS (glob), auto SQLite backup (magic number + `?mode=ro`), pfSense, PVE config. No NFS mirror step (NFS syncs directly to Synology via inotify).
|
- **daily-backup** (Daily 05:00): Auto-discovered BACKUP_DIRS (glob), auto SQLite backup (magic number + `?mode=ro`), pfSense, PVE config. No NFS mirror step (NFS syncs directly to Synology via inotify).
|
||||||
- **offsite-sync-backup** (Daily 06:00): Step 1: sda→Synology `pve-backup/`. Step 2: NFS→Synology `nfs/`+`nfs-ssd/` via `rsync --files-from` (inotify change log). Monthly full `--delete`.
|
- **offsite-sync-backup** (Daily 06:00): Step 1: sda→Synology `pve-backup/`. Step 2: NFS→Synology `nfs/`+`nfs-ssd/` via `rsync --files-from` (inotify change log). Monthly full `--delete`.
|
||||||
- **nfs-change-tracker.service**: inotifywait on `/srv/nfs` + `/srv/nfs-ssd`, logs to `/mnt/backup/.nfs-changes.log`. Incremental syncs complete in seconds.
|
- **nfs-change-tracker.service**: inotifywait on `/srv/nfs` + `/srv/nfs-ssd`, logs to `/mnt/backup/.nfs-changes.log`. Incremental syncs complete in seconds.
|
||||||
- **Synology layout** (`/volume1/Backup/Viki/`): `pve-backup/` (from sda), `nfs/` (from `/srv/nfs`), `nfs-ssd/` (from `/srv/nfs-ssd`).
|
- **Synology layout** (`/volume1/Backup/Viki/`): `pve-backup/` (from sda), `nfs/` (from `/srv/nfs`), `nfs-ssd/` (from `/srv/nfs-ssd`). `truenas/` renamed to `nfs/`, `pve-backup/nfs-mirror/` removed.
|
||||||
|
|
||||||
## Shared Variables (never hardcode)
|
## Shared Variables (never hardcode)
|
||||||
`var.nfs_server` (192.168.1.127), `var.redis_host`, `var.postgresql_host`, `var.mysql_host`, `var.ollama_host`, `var.mail_host`
|
`var.nfs_server` (192.168.1.127), `var.redis_host`, `var.postgresql_host`, `var.mysql_host`, `var.ollama_host`, `var.mail_host`
|
||||||
|
|
||||||
## Redis Service Naming (read before wiring a new consumer)
|
|
||||||
|
|
||||||
The Redis stack (`stacks/redis/`) exposes three distinct entry points. Pick the one that matches the client's connection pattern — the wrong one causes READONLY errors or silent connection drops.
|
|
||||||
|
|
||||||
| Endpoint | Port(s) | Use for | Backed by |
|
|
||||||
|----------|---------|---------|-----------|
|
|
||||||
| `redis-master.redis.svc.cluster.local` | 6379 (redis), 26379 (sentinel) | **Default for new services.** Write-safe — HAProxy health-checks nodes and routes only to the current master. Matches `var.redis_host`. | `kubernetes_service.redis_master` → HAProxy → Bitnami StatefulSet |
|
|
||||||
| `redis-node-{0,1,2}.redis-headless.redis.svc.cluster.local` | 26379 | **Long-lived connections (PUBSUB, BLPOP, MONITOR, Sidekiq).** Use a sentinel-aware client with master name `mymaster`. Example: `stacks/nextcloud/chart_values.yaml:32-54`. | Bitnami-created headless service → pod DNS |
|
|
||||||
| `redis.redis.svc.cluster.local` | 6379 | **Do NOT use.** Helm chart's default service — selector patched by `null_resource.patch_redis_service` to match `redis-haproxy`, so today it behaves like `redis-master`. This patch is load-bearing but temporary; consumers hard-coded on this name are tracked in a beads follow-up (T0). | Bitnami chart (patched) |
|
|
||||||
|
|
||||||
**HAProxy's `timeout client 30s` closes idle raw Redis connections** — any client that holds a connection open for pub/sub, blocking commands, or replication streams MUST use the sentinel path. Uptime Kuma's Redis monitor hit this limit and had to be re-pointed at the sentinel endpoint (see memory id=748).
|
|
||||||
|
|
||||||
**When onboarding a new service:** start from `redis-master.redis.svc.cluster.local:6379` via `var.redis_host`. Only reach for sentinel discovery if the client library supports it natively (ioredis, redis-py Sentinel, go-redis FailoverClient, Sidekiq `sentinels` array) AND the workload uses long-lived connections.
|
|
||||||
|
|
||||||
## Kyverno Drift Suppression (`# KYVERNO_LIFECYCLE_V1`)
|
|
||||||
|
|
||||||
Kyverno's admission webhook mutates every pod with a `dns_config { option { name = "ndots"; value = "2" } }` block (fixes NxDomain search-domain floods — see `k8s-ndots-search-domain-nxdomain-flood` skill). Terraform does not manage that field, so without suppression every pod-owning resource shows perpetual `spec[0].template[0].spec[0].dns_config` drift.
|
|
||||||
|
|
||||||
**Rule**: every `kubernetes_deployment`, `kubernetes_stateful_set`, `kubernetes_daemon_set`, and `kubernetes_cron_job_v1` MUST include the following `lifecycle` block, tagged with the `# KYVERNO_LIFECYCLE_V1` marker so every site is greppable:
|
|
||||||
|
|
||||||
```hcl
|
|
||||||
# kubernetes_deployment / kubernetes_stateful_set / kubernetes_daemon_set
|
|
||||||
lifecycle {
|
|
||||||
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
|
|
||||||
}
|
|
||||||
|
|
||||||
# kubernetes_cron_job_v1 (extra job_template nesting)
|
|
||||||
lifecycle {
|
|
||||||
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
**Why not a shared module?** Terraform's `ignore_changes` meta-argument only accepts static attribute paths. It rejects module outputs, locals, variables, and any expression. A DRY module is therefore impossible — the canonical pattern IS the snippet + marker. When `kubernetes_manifest` resources get Kyverno `generate.kyverno.io/*` annotations mutated, a sibling convention `# KYVERNO_MANIFEST_V1` will be introduced (Phase B).
|
|
||||||
|
|
||||||
**Audit**: `rg "KYVERNO_LIFECYCLE_V1" stacks/ | wc -l` — should grow (never shrink). Add the marker to every new pod-owning resource. The `_template/main.tf.example` stub shows the canonical form.
|
|
||||||
|
|
||||||
### `# KYVERNO_LIFECYCLE_V2` — Keel auto-update annotations
|
|
||||||
|
|
||||||
When a namespace is labeled `keel.sh/enrolled=true`, the `inject-keel-annotations` ClusterPolicy (`stacks/kyverno/modules/kyverno/keel-annotations.tf`) injects these annotations on every Deployment / StatefulSet / DaemonSet:
|
|
||||||
|
|
||||||
```
|
|
||||||
keel.sh/policy: patch
|
|
||||||
keel.sh/trigger: poll
|
|
||||||
keel.sh/pollSchedule: "@every 1h"
|
|
||||||
```
|
|
||||||
|
|
||||||
**`keel.sh/match-tag` is NO LONGER injected — it is actively STRIPPED.** It was the pre-2026-05-26 default (`force + match-tag`), proven unreliable: under `force` it let Keel rewrite tag strings and cross-assign images between containers in multi-image pods. The `blog` deployment was a casualty — its `nginx` ⇄ `nginx-exporter` images got swapped and the site was down 2026-05-26 → 2026-06-01. The policy now sets the annotation to `null` (strips on admission); the 194 pre-existing workloads still carrying it were swept once via `kubectl annotate … keel.sh/match-tag-` on 2026-06-01. The `ignore_changes` line for it (below) is retained as a harmless no-op. See `docs/post-mortems/2026-06-01-keel-match-tag-image-swap.md`.
|
|
||||||
|
|
||||||
To suppress the resulting Terraform drift, **enrolled workloads** must carry the complete `ignore_changes` block below. This is the canonical form — it folds together every marker (see the legend after it):
|
|
||||||
|
|
||||||
```hcl
|
|
||||||
lifecycle {
|
|
||||||
ignore_changes = [
|
|
||||||
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
|
|
||||||
metadata[0].annotations["keel.sh/policy"],
|
|
||||||
metadata[0].annotations["keel.sh/trigger"],
|
|
||||||
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
|
|
||||||
metadata[0].annotations["keel.sh/match-tag"],
|
|
||||||
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates
|
|
||||||
metadata[0].annotations["kubernetes.io/change-cause"],
|
|
||||||
metadata[0].annotations["deployment.kubernetes.io/revision"],
|
|
||||||
spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1
|
|
||||||
]
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
**Marker legend** (the names are historical; grep each to audit coverage):
|
|
||||||
|
|
||||||
| Marker | Ignores | Why |
|
|
||||||
|---|---|---|
|
|
||||||
| `# KYVERNO_LIFECYCLE_V1` | `dns_config` | Kyverno injects pod DNS `ndots` config |
|
|
||||||
| `# KYVERNO_LIFECYCLE_V2` | `keel.sh/policy`, `/trigger`, `/pollSchedule` | Kyverno-injected Keel control annotations |
|
|
||||||
| `# KEEL_IGNORE_IMAGE` | `container[N].image` (one line **per container index**, incl. `init_container[N]`) | Keel rewrites the image tag on `policy=patch`; without this, `apply` reverts the bump (a **downgrade**) |
|
|
||||||
| `# KEEL_LIFECYCLE_V1` | `keel.sh/match-tag`, `keel.sh/update-time` (pod template), `kubernetes.io/change-cause`, `deployment.kubernetes.io/revision` | every Keel digest-update restamps these; without ignoring them `apply` strips them → forces a rollout → Keel re-stamps → fight loop |
|
|
||||||
|
|
||||||
**Multi-container caveat**: `container[0].image` only covers the first container. Add one `container[N].image` line for **every** container index, plus `init_container[N].image` for init containers — otherwise the un-ignored container's image still drifts/downgrades.
|
|
||||||
|
|
||||||
The `KEEL_LIFECYCLE_V1` + per-container `KEEL_IGNORE_IMAGE` lines were swept across all enrolled workloads on **2026-05-28** (previously only `llama-cpp` had them; the rest fought on every apply). New enrolled workloads must include the full block. Workloads in un-enrolled namespaces don't receive the annotations and don't need the block.
|
|
||||||
|
|
||||||
Per-workload opt-out: add the label `keel.sh/policy: never` on the Deployment metadata (not pod template); the policy's `exclude` clause respects it, no annotation gets injected, no `ignore_changes` needed.
|
|
||||||
|
|
||||||
**Audit**: `rg "KYVERNO_LIFECYCLE_V2" stacks/` — count should equal the number of enrolled workloads. `rg "KEEL_LIFECYCLE_V1" stacks/` should match it (every enrolled workload also carries the V1 lines).
|
|
||||||
|
|
||||||
**Design context**: `docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md`.
|
|
||||||
|
|
||||||
## Tier System
|
## Tier System
|
||||||
`0-core` | `1-cluster` | `2-gpu` | `3-edge` | `4-aux` — Kyverno auto-generates LimitRange + ResourceQuota per namespace based on tier label.
|
`0-core` | `1-cluster` | `2-gpu` | `3-edge` | `4-aux` — Kyverno auto-generates LimitRange + ResourceQuota per namespace based on tier label.
|
||||||
- Containers without explicit `resources {}` get default limits (256Mi for edge/aux — causes OOMKill for heavy apps)
|
- Containers without explicit `resources {}` get default limits (256Mi for edge/aux — causes OOMKill for heavy apps)
|
||||||
|
|
@ -214,10 +84,10 @@ Per-workload opt-out: add the label `keel.sh/policy: never` on the Deployment me
|
||||||
## Infrastructure
|
## Infrastructure
|
||||||
- **Proxmox**: 192.168.1.127 (Dell R730, 22c/44t, 142GB RAM)
|
- **Proxmox**: 192.168.1.127 (Dell R730, 22c/44t, 142GB RAM)
|
||||||
- **Nodes**: k8s-master (10.0.20.100), node1 (GPU, Tesla T4), node2-4
|
- **Nodes**: k8s-master (10.0.20.100), node1 (GPU, Tesla T4), node2-4
|
||||||
- **GPU**: `node_selector = { "nvidia.com/gpu.present" : "true" }` + toleration `nvidia.com/gpu`. The label is auto-applied by NFD/gpu-feature-discovery on any node with an NVIDIA PCI device — nothing is hostname-pinned, so the GPU card can move between nodes without Terraform edits.
|
- **GPU**: `node_selector = { "gpu": "true" }` + toleration `nvidia.com/gpu`
|
||||||
- **Pull-through cache**: 10.0.20.10 — docker.io (:5000), ghcr.io (:5010) only. Caches stale manifests for :latest tags — use versioned tags or pre-pull with `ctr --hosts-dir ''` to bypass.
|
- **Pull-through cache**: 10.0.20.10 — docker.io (:5000), ghcr.io (:5010) only. Caches stale manifests for :latest tags — use versioned tags or pre-pull with `ctr --hosts-dir ''` to bypass.
|
||||||
- **pfSense**: 10.0.20.1 (gateway, firewall, DNS forwarding)
|
- **pfSense**: 10.0.20.1 (gateway, firewall, DNS forwarding)
|
||||||
- **MySQL InnoDB Cluster**: 1 instance on proxmox-lvm (scaled from 3 — only Uptime Kuma + phpIPAM remain), PriorityClass `mysql-critical` + PDB, anti-affinity excludes any GPU node (`nvidia.com/gpu.present=true`) so MySQL moves off the GPU host automatically if the card is relocated
|
- **MySQL InnoDB Cluster**: 1 instance on proxmox-lvm (scaled from 3 — only Uptime Kuma + phpIPAM remain), PriorityClass `mysql-critical` + PDB, anti-affinity excludes k8s-node1 (GPU node)
|
||||||
- **SMTP**: `var.mail_host` port 587 STARTTLS (not internal svc address — cert mismatch)
|
- **SMTP**: `var.mail_host` port 587 STARTTLS (not internal svc address — cert mismatch)
|
||||||
|
|
||||||
## Contributor Onboarding
|
## Contributor Onboarding
|
||||||
|
|
@ -227,69 +97,7 @@ Per-workload opt-out: add the label `keel.sh/policy: never` on the Deployment me
|
||||||
4. Viktor reviews → CI applies → Slack notification
|
4. Viktor reviews → CI applies → Slack notification
|
||||||
5. Portal: `https://k8s-portal.viktorbarzin.me/onboarding` for full guide
|
5. Portal: `https://k8s-portal.viktorbarzin.me/onboarding` for full guide
|
||||||
|
|
||||||
### Non-admin workstation users — the AGENT does the git work
|
|
||||||
|
|
||||||
Non-admin devvm users (power-user / namespace-owner tiers) may not know git at
|
|
||||||
all. Their agent handles every version-control step silently — never ask them
|
|
||||||
to commit, push, pull, or open a PR, and never surface git jargon at them.
|
|
||||||
Their infra clone arrives preconfigured: git identity, a `forgejo` remote
|
|
||||||
authenticated via `~/.git-credentials`, and `master` tracking `forgejo/master`
|
|
||||||
(auto-freshened hourly and at session launch, fast-forward only).
|
|
||||||
|
|
||||||
Two per-user layouts exist (`code_layout` in
|
|
||||||
`scripts/workstation/roster.yaml`): `single` (the default) — `~/code` IS the
|
|
||||||
locked infra clone — and `workspace` — `~/code` is a plain directory of
|
|
||||||
per-project clones: the infra clone at `~/code/infra`, plus each roster
|
|
||||||
`repos` entry (e.g. `~/code/tripit`) cloned from Forgejo `viktor/<name>` with
|
|
||||||
the user's own PAT. The reconcile auto-migrates a single-layout `~/code` when
|
|
||||||
a user is flipped to `workspace`, and keeps every clone fresh either way.
|
|
||||||
|
|
||||||
The model is **allow-then-audit** (Viktor, 2026-06-10): whitelisted users (emo)
|
|
||||||
push straight to `master` — no PR gate — and the record of *what changed and
|
|
||||||
why* is what matters. Force-push is disabled for everyone, so master history
|
|
||||||
is append-only.
|
|
||||||
|
|
||||||
**Feature-sized work is worktree-first** (org rule, 2026-06-10): develop in an
|
|
||||||
isolated worktree (`.worktrees/<topic>`, branch `<os-user>/<topic>` off
|
|
||||||
`forgejo/master`) so concurrent agent sessions never collide in the clone, then
|
|
||||||
land by merging latest master into the branch and pushing it
|
|
||||||
(`git push forgejo HEAD:master`, or the PR fallback below if not whitelisted) —
|
|
||||||
the audit-trail rules below apply to the branch's commit messages all the same.
|
|
||||||
Locked (git-crypt) clones can use plain `git worktree add`. Trivial
|
|
||||||
single-commit fixes may be committed directly on a clean `master`. Full
|
|
||||||
lifecycle: `~/.claude/rules/execution.md` §3.
|
|
||||||
|
|
||||||
To land a finished change from such a clone:
|
|
||||||
|
|
||||||
1. Commit on `master`. **The commit message is the audit trail** — this matters
|
|
||||||
more than the change itself:
|
|
||||||
- subject: what changed, specific ("ha-sofia: lower fan curve bias to -5")
|
|
||||||
- body: WHY, in plain words — paraphrase the user's actual request and any
|
|
||||||
reasoning ("Emil asked for quieter fans in the evening; curve was
|
|
||||||
overshooting after the 2026-06-08 redesign")
|
|
||||||
2. `git push forgejo master`. If rejected non-fast-forward: `git pull --rebase
|
|
||||||
forgejo master` and push again.
|
|
||||||
3. **Never use `[ci skip]`** as a non-admin — it hides the change from the
|
|
||||||
Slack audit feed; a no-op CI apply on a docs-only commit is harmless.
|
|
||||||
4. Leave the clone on clean `master` so auto-refresh keeps working.
|
|
||||||
5. Tell the user in plain language what happened. Stack changes are
|
|
||||||
auto-applied by CI — verify the live result with the user's read-only
|
|
||||||
kubectl before saying "it's live".
|
|
||||||
|
|
||||||
If a push to `master` is rejected by branch protection (user not on the
|
|
||||||
whitelist — e.g. new users before Viktor grants it), fall back to a
|
|
||||||
`<os-user>/<short-topic>` branch + PR with the user's own PAT
|
|
||||||
(`write:repository` suffices — verified 2026-06-10):
|
|
||||||
|
|
||||||
```bash
|
|
||||||
TOK=$(sed -E 's#https://[^:]+:([^@]+)@.*#\1#' ~/.git-credentials)
|
|
||||||
curl -X POST -H "Authorization: token $TOK" -H 'Content-Type: application/json' \
|
|
||||||
https://forgejo.viktorbarzin.me/api/v1/repos/viktor/infra/pulls \
|
|
||||||
-d '{"title":"<title>","head":"<os-user>/<short-topic>","base":"master","body":"<what + why>"}'
|
|
||||||
```
|
|
||||||
|
|
||||||
## Common Operations
|
## Common Operations
|
||||||
- **`homelab` CLI** (`/usr/local/bin/homelab`, source `cli/`): unified infra-ops verbs — run `homelab manifest` to discover the surface (each verb tagged read/write). Infra loop: `homelab tf plan|fmt|apply <stack>` (wraps `scripts/tg`; `apply` auto-claims presence + releases on exit, warns out-of-band), `homelab claim|release <kind>:<name>`, `homelab work start|land|clean <topic>` (worktree lifecycle; `land` gates on verification, `--verify-cmd`/`--no-verify`). Kubernetes (v0.2): `homelab k8s status|get|logs|describe|debug|pf|rollout-status <app>` (read; `<app>` defaults to the namespace, target to `deploy/<app>`), `homelab k8s db <app> [--mysql] -- "<SQL>"`, `k8s exec`, `k8s restart`, `k8s rm-pod` (pods/jobs only) — config-mutation kubectl verbs are intentionally absent (Terraform-only). Memory (v0.3): `homelab memory recall "<context>"` (semantic search), `memory list|categories|tags|stats|secret`, `memory store|update|delete` — a direct HTTP client to claude-memory that works even when the memory MCP is down. CI/deploy (v0.4): `homelab ci status|watch [commit]` (Woodpecker, repo resolved from cwd), `homelab deploy wait <ns>/<deploy> [--sha]` (image-sha + rollout) — `work land` now auto-watches CI to green. Net/obs (v0.5): `homelab net check <host> [path]` (external-CF vs internal-LB reachability), `dns lookup <name>` (Technitium vs public diff), `metrics query "<promql>"` / `metrics alerts` (Prometheus via LB), `logs query "<logql>" [--since]` (Loki via LB) — endpoint resolution baked in, no port-forward. Usage telemetry (v0.6): every dispatched verb fire-and-forgets a Loki line (`{user,verb}` + exit only, NO args/secrets; opt-out `HOMELAB_TELEMETRY=0`); `homelab usage top [--since][--user]` ranks verb usage across all users — evidence for what to build next, queryable without reading anyone's home. Home Assistant (v0.7): `homelab ha token [--instance sofia|london]` (prints the long-lived API token, resolved live from k8s Secret `openclaw/openclaw-secrets` — use as `curl -H "Authorization: Bearer $(homelab ha token)"`), `homelab ha ssh [--instance sofia|london] -- <cmd>` (run a command on the HA host; deterministic non-interactive ssh, the invoking user's `~/.ssh/id_ed25519`, sofia=`vbarzin@192.168.1.8` default) — entity state/control stays with the `ha` MCP, these cover only what an API-only MCP can't (token + host shell). Full docs: `cli/README.md`.
|
|
||||||
- **Deploy new service**: Use `stacks/<existing-service>/` as template. Create stack, add DNS in tfvars, apply platform then service.
|
- **Deploy new service**: Use `stacks/<existing-service>/` as template. Create stack, add DNS in tfvars, apply platform then service.
|
||||||
- **Fix crashed pods**: Run healthcheck first. Safe to delete evicted/failed pods and CrashLoopBackOff pods with >10 restarts.
|
- **Fix crashed pods**: Run healthcheck first. Safe to delete evicted/failed pods and CrashLoopBackOff pods with >10 restarts.
|
||||||
- **OOMKilled**: Check `kubectl describe limitrange tier-defaults -n <ns>`. Increase `resources.limits.memory` in the stack's main.tf.
|
- **OOMKilled**: Check `kubectl describe limitrange tier-defaults -n <ns>`. Increase `resources.limits.memory` in the stack's main.tf.
|
||||||
|
|
@ -297,7 +105,7 @@ curl -X POST -H "Authorization: token $TOK" -H 'Content-Type: application/json'
|
||||||
- **NFS exports**: Create dir on Proxmox host (`ssh root@192.168.1.127 "mkdir -p /srv/nfs/<service>"`), add to `/etc/exports`, run `exportfs -ra`.
|
- **NFS exports**: Create dir on Proxmox host (`ssh root@192.168.1.127 "mkdir -p /srv/nfs/<service>"`), add to `/etc/exports`, run `exportfs -ra`.
|
||||||
|
|
||||||
## Automated Service Upgrades
|
## Automated Service Upgrades
|
||||||
- **Pipeline**: DIUN (detect) → n8n webhook (filter + rate limit) → HTTP POST → `claude-agent-service` (K8s) → `claude -p` (upgrade agent)
|
- **Pipeline**: DIUN (detect) → n8n webhook (filter + rate limit) → SSH → `claude -p` (upgrade agent)
|
||||||
- **Agent**: `.claude/agents/service-upgrade.md` — analyzes changelogs, backs up DBs, bumps versions, verifies health, rolls back on failure
|
- **Agent**: `.claude/agents/service-upgrade.md` — analyzes changelogs, backs up DBs, bumps versions, verifies health, rolls back on failure
|
||||||
- **Config**: `.claude/reference/upgrade-config.json` — GitHub repo mappings, DB-backed services, skip patterns
|
- **Config**: `.claude/reference/upgrade-config.json` — GitHub repo mappings, DB-backed services, skip patterns
|
||||||
- **Rate limit**: Max 5 upgrades per 6h DIUN scan cycle (configured in n8n workflow)
|
- **Rate limit**: Max 5 upgrades per 6h DIUN scan cycle (configured in n8n workflow)
|
||||||
|
|
|
||||||
239
CONTEXT.md
239
CONTEXT.md
|
|
@ -1,239 +0,0 @@
|
||||||
# Infra
|
|
||||||
|
|
||||||
Terragrunt-managed homelab declaring a 7-node Kubernetes cluster (1 control plane + 6 workers) on a single Proxmox host. Vault is the secrets source of truth; everything else flows from this repo via `scripts/tg apply`.
|
|
||||||
|
|
||||||
## Language
|
|
||||||
|
|
||||||
### Code organization
|
|
||||||
|
|
||||||
**Service**:
|
|
||||||
The deployed app as a domain concept — one logical thing that runs in the cluster (e.g. immich, technitium, freshrss). Defined by exactly one **Stack**.
|
|
||||||
_Avoid_: bare "app" without the Service definition; "deployment" (collides with K8s `Deployment`).
|
|
||||||
|
|
||||||
**Stack**:
|
|
||||||
The HCL directory under `stacks/<name>/` that defines a Service, applied independently with `scripts/tg apply`. A Stack is the unit of Terraform organisation; a Service is the running thing. They are 1:1 but not synonyms. A Stack is either **flat** (resources declared directly in its own `.tf` files — the majority, ~94, e.g. immich) or wraps a **Stack-local module** (~31, the larger/older ones).
|
|
||||||
_Avoid_: using "Stack" when you mean the running Service.
|
|
||||||
|
|
||||||
**Module**:
|
|
||||||
A unit of HCL consumed via `source =`. Two homes, two purposes: **shared** modules under the top-level `modules/` tree (reused across many Stacks) and **Stack-local** modules nested under `stacks/<name>/modules/` (one Stack only). Bare "Module" means the shared kind.
|
|
||||||
_Avoid_: "library", "package".
|
|
||||||
|
|
||||||
**Factory module**:
|
|
||||||
A shared **Module** that hides convention (defaults, drift handling, secret wiring) behind a small input surface. `modules/kubernetes/` holds exactly four, all factories: `ingress_factory` (103 Stacks), `setup_tls_secret` (93), `nfs_volume` (41), `anubis_instance` (8).
|
|
||||||
_Avoid_: "wrapper"; citing `k8s_app` / `helm_app` / `postgres_app` (these never existed in the repo).
|
|
||||||
|
|
||||||
**Stack-local module**:
|
|
||||||
A single Stack's implementation factored into a nested `stacks/<name>/modules/<name>/`, sourced by that one Stack only — organisation, not reuse. ~31 Stacks (authentik, kyverno, dbaas, mailserver, metallb, cloudflared, technitium, …). The alternative to a **flat** Stack.
|
|
||||||
_Avoid_: calling it a "Module" unqualified (it isn't reusable); "submodule".
|
|
||||||
|
|
||||||
**State tier**:
|
|
||||||
Terraform state-backend partition. **Tier 0** = bootstrap Stacks (`infra`, `platform`, `cnpg`, `vault`, `dbaas`, `external-secrets`) on local SOPS-encrypted state. **Tier 1** = every other Stack, on PG-backed state.
|
|
||||||
_Avoid_: "phase", "bootstrap stack" — say Tier 0 explicitly.
|
|
||||||
|
|
||||||
### Cluster
|
|
||||||
|
|
||||||
**Node**:
|
|
||||||
A K8s cluster VM — `k8s-master` (control plane) plus `k8s-node1..6` (workers). Default reading of the bare word "node" in this repo.
|
|
||||||
_Avoid_: "k8s node" (redundant), "host" (ambiguous).
|
|
||||||
|
|
||||||
**PVE node** / **PVE host**:
|
|
||||||
The single physical Dell R730 running Proxmox; sole hypervisor and sole NFS server. There is exactly one.
|
|
||||||
_Avoid_: "server", "hypervisor", "Proxmox" alone when you mean the host.
|
|
||||||
|
|
||||||
**Namespace tier**:
|
|
||||||
A namespace-prefix partition (`0-core-*`, `1-cluster-*`, `2-gpu-*`, `3-edge-*`, `4-aux-*`) driving PriorityClass, default resources, and ResourceQuota — generated by **Kyverno policy** from the namespace name. Orthogonal to **State tier**.
|
|
||||||
_Avoid_: "Service tier" (the partition is on the namespace, not the Service); collapsing Namespace tier with State tier — they are different axes.
|
|
||||||
|
|
||||||
**Kyverno policy**:
|
|
||||||
The convention engine of the cluster — a ClusterPolicy or Policy resource that mutates/generates/validates on admission. Owns Namespace tier limits/quotas, `dns_config` injection on every pod-owning workload, Forgejo pull-credential sync across namespaces, TLS-secret replication. When the repo says "this happens automatically", a Kyverno policy is usually the actor.
|
|
||||||
_Avoid_: bare "policy" (overloaded with Vault, RBAC, NetworkPolicy).
|
|
||||||
|
|
||||||
**Critical-path Service**:
|
|
||||||
One of {Traefik, Authentik, CrowdSec LAPI, PgBouncer, Cloudflared} — replicas ≥3, PDB enforced, monitored independently.
|
|
||||||
_Avoid_: "core service" (collides with the `0-core-*` Namespace tier name).
|
|
||||||
|
|
||||||
**Namespace-owner**:
|
|
||||||
A non-admin identity declared in `secret/platform → k8s_users` (JSON map). Owns one or more namespaces and one or more public subdomains. Also drives a **Workstation profile** (an identity has both a cluster facet and a workstation facet).
|
|
||||||
_Avoid_: bare "user", "tenant".
|
|
||||||
|
|
||||||
### Workstation (multi-user devvm)
|
|
||||||
|
|
||||||
**devvm**:
|
|
||||||
The dev VM (`10.0.10.10`), a non-cluster VM on the **PVE host** that hosts each person's Claude Code coding environment (the `t3-serve@<user>` and terminal-lobby sessions). Not a **Node** (it isn't in the cluster).
|
|
||||||
_Avoid_: calling it a "Node"; "host" (reserved for the PVE host).
|
|
||||||
|
|
||||||
**Workstation**:
|
|
||||||
A person's identity-scoped Claude Code environment on the **devvm** — one OS account, their session runs as that uid. The same human may also be a **Namespace-owner**; the cluster identity and the Workstation are two facets of one person.
|
|
||||||
_Avoid_: "t3 instance" (only one surface of a Workstation); bare "user".
|
|
||||||
|
|
||||||
**RBAC tier**:
|
|
||||||
The role band that governs a person everywhere — `kubernetes-admins` (Viktor; cluster-admin, secrets, apply), `kubernetes-power-users` (infra-aware, broad read, no destructive change), `kubernetes-namespace-owners` (own-namespace app dev). The single axis that keys both cluster RBAC **and** the **Workstation profile**.
|
|
||||||
_Avoid_: inventing per-service roles; conflating with **Namespace tier** / **State tier** (those are not identity).
|
|
||||||
|
|
||||||
**Workstation profile**:
|
|
||||||
The **RBAC tier**-keyed bundle a **Workstation** receives: **Config inheritance** (identical for everyone) plus the person's **Infra visibility** and cluster scope (varies by tier). Never hand-tuned per person — one identity decision (Authentik group + `k8s_users`) provisions the cluster facet and the Workstation together.
|
|
||||||
_Avoid_: per-person bespoke setup (the rejected "stitched-together" status quo).
|
|
||||||
|
|
||||||
**Config inheritance**:
|
|
||||||
The universal half of every **Workstation profile** — Viktor's *static* Claude config (skills, rules, agents, commands, `CLAUDE.md`, hooks) **live-extends** from a **Config base**, it is NOT copied: each person's `~/.claude` draws these from the shared base, so an edit Viktor makes appears in every Workstation immediately, with no seed/copy/sync step. Users may layer their own items on top (rarely do). **RBAC tier**-independent. Per-user *mutable* state (`~/.claude.json`, `.credentials.json`, `projects/`, sessions) is never shared — local only.
|
|
||||||
_Avoid_: a periodic copy/seed/sync of `~/.claude` (rejected — inheritance must be live); sharing `~/.claude.json` / `.credentials.json` (per-user, secret-bearing, corrupts under concurrent writes — see emo's multi-session profile).
|
|
||||||
|
|
||||||
**Config base**:
|
|
||||||
The shared, secret-free, version-controlled source of truth for the *static* Claude config that every **Workstation** live-extends (see **Config inheritance**). Viktor's authoring surface — when he edits a skill/rule, he edits the base; the chezmoi dotfiles repo is its versioned form (commit = audit/rollback, NOT a push to users). Holds only skills/rules/agents/commands/`CLAUDE.md`/hooks — never secrets or per-user mutable state.
|
|
||||||
_Avoid_: treating it as a per-user seed target (it is a live shared source, not a copy); putting secrets in it.
|
|
||||||
|
|
||||||
**Infra visibility**:
|
|
||||||
What a non-admin **Workstation** may SEE of the infra: the public repo **code** and the person's own **RBAC**-scoped view of the live cluster (kubectl / dashboard within their namespaces). Explicitly excludes the **git-crypt** secrets (`terraform.tfvars`, `secrets/`) and any out-of-scope mutation. The boundary that "respect their permissions" enforces — violated today because `~/code` is one git-crypt-*unlocked* tree shared via the `code-shared` group.
|
|
||||||
_Avoid_: reading "see the infra" as access to secrets or apply rights.
|
|
||||||
|
|
||||||
### Networking
|
|
||||||
|
|
||||||
**Public domain**:
|
|
||||||
`viktorbarzin.me`, served through Cloudflare. DNS records are either **proxied** (Cloudflare CDN/WAF in front) or **non-proxied** (direct A/AAAA reachable via Cloudflared Tunnel).
|
|
||||||
_Avoid_: "external", "outside".
|
|
||||||
|
|
||||||
**Internal domain**:
|
|
||||||
`viktorbarzin.lan`, served by Technitium DNS. Resolves only inside the homelab network.
|
|
||||||
_Avoid_: bare "lan", "private", "intranet".
|
|
||||||
|
|
||||||
**Ingress auth**:
|
|
||||||
The `auth = "..."` parameter on `ingress_factory` — a discrete *mode*, not a ranked tier — one of `required` (Authentik forward-auth gates every request), `app` (the backend owns its login), `public` (anonymous Authentik binding for audit only), or `none` (Anubis-fronted content, or native-client API). Default `required` (fail-closed).
|
|
||||||
_Avoid_: "auth tier" / "auth mode" — refer to it by the canonical key, `auth` (e.g. `auth = "required"`). "tier" is reserved for State tier and Namespace tier.
|
|
||||||
|
|
||||||
**Authentik outpost**:
|
|
||||||
A standalone Authentik deployment that terminates the proxy/auth flow for a specific binding model. The repo runs two distinct ones: the default outpost (used by `auth = "required"`) and the `public` outpost (anonymous binding, used by `auth = "public"`).
|
|
||||||
_Avoid_: conflating outpost with Authentik core; "Authentik instance".
|
|
||||||
|
|
||||||
**Cloudflared Tunnel**:
|
|
||||||
The channel by which non-proxied **public domain** traffic reaches the cluster, terminating at Traefik. Backs every `dns_type = "non-proxied"` record and is the fallback path for the wildcard `*.viktorbarzin.me`.
|
|
||||||
_Avoid_: "the tunnel" without "Cloudflared" (could mean Headscale).
|
|
||||||
|
|
||||||
**Ingress chain**:
|
|
||||||
The opinionated stack of Traefik middlewares that `ingress_factory` layers onto every Ingress. Slots, in order: forward-auth (per **Ingress auth**) → anti-AI scraping (default-on when no Authentik is in the path) → CrowdSec bouncer (fail-open) → retry (2× / 100ms) → rate-limit (429, not 503). Adding or removing a middleware is a Stack-level choice, but the chain order is convention.
|
|
||||||
_Avoid_: "middleware list", "Traefik chain". The Anubis PoW gate is upstream of this chain, not inside it.
|
|
||||||
|
|
||||||
**MetalLB / LB IP**:
|
|
||||||
The bare-metal load-balancer that assigns external IPs to `type=LoadBalancer` Services. Two IPs matter: the **shared LB IP** `10.0.20.200` (~10 services — PG state-backend, headscale, wireguard, coturn, xray… — all `externalTrafficPolicy: Cluster`) and **Traefik's dedicated LB IP** `10.0.20.203` (`externalTrafficPolicy: Local`). Traefik runs on its own IP because ETP:Local preserves the **real client IP** (for CrowdSec) and enables QUIC, and MetalLB forbids mixed ETP on one shared IP.
|
|
||||||
_Avoid_: calling `.200` "the cluster IP" or assuming all ingress shares one LB IP.
|
|
||||||
|
|
||||||
**Calico**:
|
|
||||||
The cluster CNI and **NetworkPolicy** engine (also GlobalNetworkPolicy + flow logs; live flow observability via **Goldmane / Whisker**). Egress lockdown follows an **observe-then-enforce** rollout — flow logs build an empirical allowlist, then default-deny egress is enforced per-namespace, tier by tier (wave 1 began at `recruiter-responder`; Tier 0/1/2 deferred).
|
|
||||||
_Avoid_: "firewall" (it's pod-level policy, not a perimeter); conflating a Calico **NetworkPolicy** (enforced in the data path) with a **Kyverno policy** (enforced at admission) — different layers.
|
|
||||||
|
|
||||||
**Service identity**:
|
|
||||||
How a **Service** is named in flow/audit data — its **namespace** is the primary identity (Goldmane stamps it natively, and "one Service ≈ one namespace" holds for ~87 namespaces), refined by an explicit identity label (e.g. `service-identity`) only in the handful of genuinely multi-Service namespaces (`monitoring`, `kube-system`, `dbaas`). Deliberately NOT a per-Service **ServiceAccount** (deferred — 56% of pods share `default`; revisit only if principal-based enforcement or mTLS is adopted) and NOT a SPIFFE/mesh identity (rejected — attribution-grade audit on a trusted single-tenant cluster doesn't justify a mesh).
|
|
||||||
_Avoid_: equating "service identity" with a workload's **ServiceAccount** (that's the deferred enforcement principal, not the attribution key) or with cryptographic/SPIFFE identity; "Service" here is the domain **Service**, not the K8s `Service` object.
|
|
||||||
|
|
||||||
**Goldmane / Whisker**:
|
|
||||||
Calico 3.30's OSS flow-observability pair — **Goldmane** aggregates identity-stamped flows (namespace/pod/workload/labels + allow-deny + policy trace) streamed from Felix over gRPC into an in-memory ~60-min ring buffer (no etcd/API writes); **Whisker** is its live web UI. The east-west "who-talks-to-whom" data plane, succeeding raw iptables-`LOG`→journald lines (which carry no identity). Durable history requires emitting Goldmane flows to **Loki**; the in-memory buffer alone is not an audit trail.
|
|
||||||
_Avoid_: assuming Goldmane persists (it's a ring buffer — lost on restart); expecting a ServiceAccount field in its schema (it carries labels, not SA); confusing it with Cilium **Hubble** (needs the Cilium datapath, unusable on Calico) or **Kiali** (needs an Istio mesh).
|
|
||||||
|
|
||||||
### Storage
|
|
||||||
|
|
||||||
**proxmox-lvm-encrypted**:
|
|
||||||
Default StorageClass for any workload holding sensitive data (databases, auth, password managers, email, financial data). LUKS2 over a Proxmox LVM-thin LV.
|
|
||||||
_Avoid_: bare "encrypted PVC" — name the StorageClass.
|
|
||||||
|
|
||||||
**proxmox-lvm**:
|
|
||||||
Block StorageClass for non-sensitive workloads (caches, monitoring data, indexes, app state without secrets).
|
|
||||||
|
|
||||||
**NFS volume**:
|
|
||||||
RWX file storage for shared media libraries, large datasets, or anything that needs to be inspected from outside K8s. Provisioned via the `nfs_volume` Module.
|
|
||||||
_Avoid_: "shared storage" (ambiguous).
|
|
||||||
|
|
||||||
**nfs-truenas StorageClass**:
|
|
||||||
A historical SC name retained only because StorageClass strings are immutable on bound PVs. The underlying server is the **PVE host**, not TrueNAS; TrueNAS is decommissioned.
|
|
||||||
_Avoid_: assuming this means TrueNAS.
|
|
||||||
|
|
||||||
**local-path**:
|
|
||||||
The cluster's Kubernetes default StorageClass (`rancher.io/local-path`) — node-local hostpath, **non-replicated**, no CSI snapshots, outside the backup pipeline. A PVC that omits `storageClassName` silently binds here, pinned to one Node's disk. Always set an explicit `storageClassName`; reach for local-path only for genuinely throwaway, node-pinned data.
|
|
||||||
_Avoid_: relying on the default. Note the two senses of "default": local-path is the *cluster default SC* (what an unspecified PVC gets); proxmox-lvm-encrypted is the *default choice* for sensitive data. Different things.
|
|
||||||
|
|
||||||
**3-2-1 backup**:
|
|
||||||
The named posture of where data lives: **Copy 1** = live on the PVE thin pool (sdc), **Copy 2** = sda backup disk (`/mnt/backup`), **Copy 3** = offsite Synology NAS. Per-PVC file-level rsync from LVM thin snapshots; databases additionally dump to NFS for per-DB restore.
|
|
||||||
_Avoid_: bare "backup" without saying which copy you mean (a service is "backed up" only once it's on Copy 2; Copy 3 is the disaster floor).
|
|
||||||
|
|
||||||
### Data
|
|
||||||
|
|
||||||
**CNPG** / **pg-cluster**:
|
|
||||||
**CNPG** is the CloudNativePG operator; **`pg-cluster`** is the Postgres cluster it manages — the shared Postgres substrate. Backs Tier-1 Terraform state (`pg-cluster-rw.dbaas.svc.cluster.local:5432/terraform_state`) and ~12 application databases, reached through **PgBouncer** (a **critical-path Service**) for connection pooling; app credentials rotate via the `vault-database` ClusterSecretStore.
|
|
||||||
_Avoid_: "the database" (many DBs share one cluster); the legacy `postgresql.dbaas` Service for NEW work (it is a live compatibility alias selecting the CNPG primary — authentik's PgBouncer still uses it — but `pg-cluster-rw` is the canonical name); conflating the CNPG operator with the `pg-cluster` it manages.
|
|
||||||
|
|
||||||
### Secrets
|
|
||||||
|
|
||||||
**Vault path**:
|
|
||||||
Convention: `secret/<service>` for Service-owned secrets, `secret/viktor` for personal/global, `secret/platform` for cluster-wide maps (`k8s_users`, `homepage_credentials`).
|
|
||||||
_Avoid_: conflating Vault path (e.g. `secret/viktor`) with Vault field (e.g. `forgejo_pull_token`).
|
|
||||||
|
|
||||||
**ExternalSecret** / **ESO**:
|
|
||||||
A K8s manifest that materialises a Vault KV value as a K8s Secret. Two ClusterSecretStores: `vault-kv` (KV engine) and `vault-database` (rotating DB creds).
|
|
||||||
|
|
||||||
**Plan-time secret**:
|
|
||||||
A secret value read in Terraform via `data "kubernetes_secret"` (i.e. via the ESO-created K8s Secret) at plan time, with no Vault provider call. Distinct from a **vault data source** read (`data "vault_kv_secret_v2"`), which still goes through the Vault provider. A few Stacks remain hybrid (plan-time for env vars, vault data source for module inputs).
|
|
||||||
|
|
||||||
**Sealed Secret**:
|
|
||||||
A user-managed secret committed to a Stack directory as `sealed-*.yaml`. Distinct from ExternalSecret — Sealed Secrets carry their own bytes, ExternalSecrets reference Vault.
|
|
||||||
|
|
||||||
### CI/CD
|
|
||||||
|
|
||||||
**GHA build + Woodpecker deploy**:
|
|
||||||
The split where every owned image is built+pushed by GitHub Actions and Woodpecker only runs `kubectl set image` on a deploy-only pipeline (ADR-0002). Woodpecker never builds images.
|
|
||||||
_Avoid_: bare "Woodpecker pipeline" — say "build" or "deploy"; "fallback build" (the in-cluster fallback path was removed by ADR-0002).
|
|
||||||
|
|
||||||
**Canonical repo**:
|
|
||||||
The Forgejo `viktor/<name>` repo — the only place commits land, workflow files included. Every first-party repo is Forgejo-canonical *except* an explicit set of **GitHub-first repos**. A clone keeps **only** the canonical remote (ADR-0003): the **GitHub mirror** is not a second push target.
|
|
||||||
_Avoid_: "upstream" (ambiguous); committing anywhere else; keeping both remotes on a clone and hand-pushing to each (the dual-push habit that caused the 2026-06 divergence — ADR-0003).
|
|
||||||
|
|
||||||
**GitHub mirror**:
|
|
||||||
The GitHub repo a **Canonical repo** push-mirrors to, one-way (Forgejo's `push_mirrors`, `sync_on_commit`), so GitHub Actions can build from it; anything committed on the mirror is silently overwritten by the next sync — and enabling the mirror **force-overwrites** the GitHub side, so a diverged GitHub-only commit must be merged back into Forgejo *before* the mirror is turned on or it is lost.
|
|
||||||
_Avoid_: treating it as a second writable remote; bare "the GitHub repo" without saying mirror.
|
|
||||||
|
|
||||||
**GitHub-first repo**:
|
|
||||||
The deliberate exception to the **Canonical repo** rule — a repo whose canonical home is GitHub, so it sits outside the mirror policy. Two kinds: third-party clones/forks where GitHub is genuinely upstream (`jsoncrack.com`, `snmp_exporter`, `SparkyFitness`, `agent-rules-books`, `Plotting-Your-Dream-Book`), and a first-party repo intentionally kept public on GitHub (`health`). Single GitHub remote, never dual-pushed.
|
|
||||||
_Avoid_: adding a Forgejo remote "for consistency"; treating one as a **Canonical repo**.
|
|
||||||
|
|
||||||
**Forgejo registry**:
|
|
||||||
Forgejo's built-in container registry — since ADR-0002 a frozen archive holding one last-known-good tag per **Service**, not a build target; owned images live on ghcr.io.
|
|
||||||
_Avoid_: "private registry" (collides with the registry VM's pull-through caches); pushing new images to it.
|
|
||||||
|
|
||||||
**Keel**:
|
|
||||||
The **poll-driven** rollout orchestrator — watches registries for new image tags and rolls the matching Deployments automatically. The actor behind "auto-upgrade" for upstream images, and a redundant net for owned apps (already rolled on push by **Woodpecker deploy**).
|
|
||||||
_Avoid_: conflating with **Woodpecker deploy** (push-driven, fires on commit) or **Diun** (watches but only notifies). Never point Keel / `set image` at operator-managed StatefulSets.
|
|
||||||
|
|
||||||
**Diun**:
|
|
||||||
**Notify-only** image-update monitoring — reports that a newer image exists, never rolls anything (contrast **Keel**, which acts). Disabled on pinned images (MySQL, PostgreSQL, Redis) so version pins aren't nagged.
|
|
||||||
_Avoid_: expecting Diun to deploy; conflating with **Keel**.
|
|
||||||
|
|
||||||
**Anubis**:
|
|
||||||
A PoW reverse-proxy issuing a 30-day JWT cookie, used in front of public content-bearing sites without app-level auth (blog, wiki, landing pages). Never in front of Git, WebDAV, CalDAV, or API endpoints (clients can't solve PoW).
|
|
||||||
|
|
||||||
## Relationships
|
|
||||||
|
|
||||||
- A **Service** is defined by exactly one **Stack** — **flat** or wrapping a **Stack-local module** — which sources zero or more shared **Factory modules** and resolves to one or more K8s workloads.
|
|
||||||
- A **Namespace-owner** owns one or more namespaces and one or more public subdomains.
|
|
||||||
- A **Service** owns its **Vault path** at `secret/<service>`, surfaces values through **ExternalSecrets**, and reads them at plan time via **plan-time secrets**.
|
|
||||||
- An **Ingress** picks exactly one **Ingress auth** mode; the choice defines how strangers reach the backend.
|
|
||||||
- A **proxmox-lvm-encrypted** PVC binds to one Node at a time (RWO) and requires a Service-level backup CronJob; an **NFS volume** is RWX and is backed up at the host level via rsync.
|
|
||||||
- **State tier** and **Namespace tier** are orthogonal — a Tier 0 Stack can deploy a Service into any Namespace tier and vice versa.
|
|
||||||
- A **Service**'s image reaches the cluster via **Woodpecker deploy** (push-driven, on commit) or **Keel** (poll-driven, on a new registry tag); **Diun** only notifies. Operator-managed StatefulSets are rolled by neither.
|
|
||||||
- An owned **Service**'s image is built by GitHub Actions from the **Canonical repo**'s **GitHub mirror** and hosted on ghcr.io (ADR-0002); the **Forgejo registry** keeps only a frozen last-known-good tag per **Service**.
|
|
||||||
- Tier-1 **State tier** state and ~12 app databases share one **CNPG** `pg-cluster`, reached through **PgBouncer**; their credentials rotate via the `vault-database` store.
|
|
||||||
|
|
||||||
## Example dialogue
|
|
||||||
|
|
||||||
> **Dev:** "I'm adding a new **Service** — FastAPI backend with its own JWT login. Do I need Authentik?"
|
|
||||||
> **Domain expert:** "If the FastAPI login is the gate, set `auth = "app"` on the ingress. That records the intent that you _chose_ not to layer Authentik — leave a one-line comment above stating what gates the Service, or `scripts/tg` will refuse the apply."
|
|
||||||
> **Dev:** "And storage?"
|
|
||||||
> **Domain expert:** "Does it hold user data? If yes, `proxmox-lvm-encrypted` — that's the default for anything sensitive. Add a backup CronJob writing to `/mnt/main/<service>-backup/`. If the data is just caches, plain `proxmox-lvm` is fine."
|
|
||||||
> **Dev:** "What about a Secret with the JWT signing key?"
|
|
||||||
> **Domain expert:** "Put the key in `secret/<service>` in Vault, then declare an **ExternalSecret** to materialise it as a K8s Secret. Read it at plan time with `data "kubernetes_secret"` — that keeps Vault out of the plan path."
|
|
||||||
|
|
||||||
## Flagged ambiguities
|
|
||||||
|
|
||||||
- **"tier"** has exactly two senses — always qualify which: *State tier* (Tier 0 / Tier 1, Terraform backend partition) and *Namespace tier* (`0-core`..`4-aux`, scheduling priority/quota). They are orthogonal axes. Do **not** coin new "tier"s: **Ingress auth** is a *mode* (not a tier), and storage speed (SSD vs HDD) is *not* a "tier" either.
|
|
||||||
- **"node"** can mean a K8s Node (default) or a PVE node. For Proxmox-level statements, say **PVE node** explicitly.
|
|
||||||
- **"service"** spans two distinct concepts: the deployed app (capitalised **Service**, this repo's domain noun) and the K8s `Service` object (in backticks or qualified "K8s Service"). Lowercase "service" in prose is fine when context disambiguates; flag it when it doesn't.
|
|
||||||
- **"secret"** spans Vault entries, K8s Secret objects, **ExternalSecrets**, and **Sealed Secrets**. Always specify which.
|
|
||||||
- **"proxied"** / **"non-proxied"** refer to Cloudflare's CDN posture for a DNS record, _not_ Anubis or forward-auth layering.
|
|
||||||
- **"policy"** spans **Kyverno policy** (admission-time mutate/generate/validate), **Calico NetworkPolicy** (data-path ingress/egress), Vault policy (KV access), and K8s RBAC. Always qualify which engine.
|
|
||||||
- **"registry"** spans three things: ghcr.io (where owned images live, ADR-0002), the **Forgejo registry** (frozen last-known-good archive), and the registry VM's pull-through caches (read-only proxies of upstream registries). Name which one.
|
|
||||||
|
|
@ -5,13 +5,10 @@ ARG TERRAFORM_VERSION=1.5.7
|
||||||
ARG TERRAGRUNT_VERSION=0.99.4
|
ARG TERRAGRUNT_VERSION=0.99.4
|
||||||
ARG SOPS_VERSION=3.9.4
|
ARG SOPS_VERSION=3.9.4
|
||||||
ARG KUBECTL_VERSION=1.34.0
|
ARG KUBECTL_VERSION=1.34.0
|
||||||
ARG VAULT_VERSION=1.18.1
|
|
||||||
|
|
||||||
# Install system packages (single layer).
|
# Install system packages (single layer)
|
||||||
# python3: required by scripts/check-ingress-auth-comments.py, invoked
|
|
||||||
# by scripts/tg before every plan/apply.
|
|
||||||
RUN apk add --no-cache \
|
RUN apk add --no-cache \
|
||||||
bash curl git git-crypt jq openssh-client openssl python3 unzip \
|
bash curl git git-crypt jq openssh-client openssl unzip \
|
||||||
&& rm -rf /var/cache/apk/*
|
&& rm -rf /var/cache/apk/*
|
||||||
|
|
||||||
# Terraform
|
# Terraform
|
||||||
|
|
@ -37,16 +34,6 @@ RUN curl -fsSL "https://dl.k8s.io/release/v${KUBECTL_VERSION}/bin/linux/amd64/ku
|
||||||
-o /usr/local/bin/kubectl \
|
-o /usr/local/bin/kubectl \
|
||||||
&& chmod +x /usr/local/bin/kubectl
|
&& chmod +x /usr/local/bin/kubectl
|
||||||
|
|
||||||
# Vault CLI — required by scripts/tg for Tier 1 stack PG credential reads
|
|
||||||
# and Tier 0 advisory locks. Pinned to server version (1.18.1). Without this
|
|
||||||
# the CI pipeline surfaces the misleading "Cannot read PG credentials" error
|
|
||||||
# because scripts/tg swallows stderr ("vault: not found").
|
|
||||||
RUN curl -fsSL "https://releases.hashicorp.com/vault/${VAULT_VERSION}/vault_${VAULT_VERSION}_linux_amd64.zip" \
|
|
||||||
-o /tmp/vault.zip \
|
|
||||||
&& unzip /tmp/vault.zip -d /usr/local/bin/ \
|
|
||||||
&& rm /tmp/vault.zip \
|
|
||||||
&& vault version
|
|
||||||
|
|
||||||
# Provider cache directory (shared across stacks)
|
# Provider cache directory (shared across stacks)
|
||||||
ENV TF_PLUGIN_CACHE_DIR=/tmp/terraform-plugin-cache
|
ENV TF_PLUGIN_CACHE_DIR=/tmp/terraform-plugin-cache
|
||||||
ENV TF_PLUGIN_CACHE_MAY_BREAK_DEPENDENCY_LOCK_FILE=1
|
ENV TF_PLUGIN_CACHE_MAY_BREAK_DEPENDENCY_LOCK_FILE=1
|
||||||
|
|
|
||||||
226
cli/README.md
226
cli/README.md
|
|
@ -1,224 +1,2 @@
|
||||||
# homelab
|
# What is this?
|
||||||
|
This is a CLI to manipulate files in the terraform repo and commit and push them
|
||||||
`homelab` is the unified, agent-facing CLI for operating this homelab — one
|
|
||||||
composable, JSON-capable surface for the operations agents run over and over,
|
|
||||||
discovered progressively at runtime. It is grown **in place** from this
|
|
||||||
directory (the former `infra-cli`), and the legacy webhook use-cases still work
|
|
||||||
(see below).
|
|
||||||
|
|
||||||
It encodes *actions*, never *judgment*: methodology (debugging, TDD, review) and
|
|
||||||
third-party/owned MCP servers (e.g. phpIPAM) are deliberately out of scope.
|
|
||||||
|
|
||||||
## Usage
|
|
||||||
|
|
||||||
```
|
|
||||||
homelab <command> [args]
|
|
||||||
homelab manifest [--json] # list every verb + its read/write tier (discovery entrypoint)
|
|
||||||
homelab version
|
|
||||||
```
|
|
||||||
|
|
||||||
### v0.1 verbs — the infra inner-loop
|
|
||||||
|
|
||||||
| Command | Tier | What it does |
|
|
||||||
|---|---|---|
|
|
||||||
| `claim <kind>:<name> --purpose "…"` | write | claim a shared resource on the presence board (wraps `scripts/presence`) |
|
|
||||||
| `release <kind>:<name>` | write | release a presence claim |
|
|
||||||
| `tf plan <stack>` | read | `scripts/tg plan` for a stack (resolved from cwd) |
|
|
||||||
| `tf validate <stack>` | read | `scripts/tg validate` |
|
|
||||||
| `tf fmt <stack>` | read | `terraform fmt -recursive` on the stack |
|
|
||||||
| `tf force-unlock <stack> <lock-id>` | write | release a stuck state lock |
|
|
||||||
| `tf apply <stack>` | write | `scripts/tg apply` — auto-claims `stack:<name>`, always releases, warns it's out-of-band |
|
|
||||||
| `work start <topic>` | write | create `.worktrees/<topic>` on `<user>/<topic>` off `<remote>/master`; enter with native `EnterWorktree` |
|
|
||||||
| `work land [--verify-cmd "…"] [--no-verify]` | write | merge master in → verify → push `HEAD:master` (non-ff retry; PR fallback) |
|
|
||||||
| `work clean <topic>` | write | remove a task's worktree + branch (run from the main checkout) |
|
|
||||||
|
|
||||||
### v0.2 verbs — Kubernetes
|
|
||||||
|
|
||||||
Built on an **app→namespace→pod resolver**: `<app>` defaults to the namespace
|
|
||||||
(most namespaces hold one app); the target defaults to `deploy/<app>` and lets
|
|
||||||
kubectl resolve the pod. Override with `-n`/`--pod`/`-c`/`-l`/`--tty`. Uses the
|
|
||||||
ambient kubeconfig.
|
|
||||||
|
|
||||||
| Command | Tier | What it does |
|
|
||||||
|---|---|---|
|
|
||||||
| `k8s status [ns]` | read | pods (wide) + recent non-Normal events (`-A` if no ns) |
|
|
||||||
| `k8s get <ns> <resource> […]` | read | `kubectl -n <ns> get …` passthrough |
|
|
||||||
| `k8s logs <app>` | read | logs for `deploy/<app>` (`--tail` default 200; `-c`/`--previous`/`--since`/`-l`) |
|
|
||||||
| `k8s describe <app> [resource]` | read | describe the deployment (or an explicit resource) |
|
|
||||||
| `k8s debug <app>` | read | one-shot triage: pods + workloads + describe + recent logs + events |
|
|
||||||
| `k8s pf <app> <local:remote> [target]` | read | port-forward to `svc/<app>` (or an explicit target) |
|
|
||||||
| `k8s rollout-status <app>` | read | `rollout status deploy/<app>` |
|
|
||||||
| `k8s db <app> [--mysql] [--db N] -- "<SQL>"` | write | exec into the dbaas DB (PG `pg-cluster-rw`, or MySQL with env-password wrapper) |
|
|
||||||
| `k8s exec <app> [--tty] -- <cmd>` | write | exec in the app's pod |
|
|
||||||
| `k8s restart <app>` | write | `rollout restart deploy/<app>` then wait for status |
|
|
||||||
| `k8s rm-pod <name> -n <ns> [--job] [--force]` | write | delete a stuck **pod/job only** |
|
|
||||||
|
|
||||||
Config-mutation verbs (`apply`/`edit`/`patch`/`scale`/`create`) are intentionally
|
|
||||||
**not** exposed — they stay raw `kubectl`, per the Terraform-only policy.
|
|
||||||
|
|
||||||
`tf` resolves the stack dir by walking up from cwd to the infra root and
|
|
||||||
delegates to `scripts/tg` (which owns state decrypt/encrypt, the Vault lock, and
|
|
||||||
the ingress auth-comment check). git-crypt filter flags are auto-injected on git
|
|
||||||
operations in the encrypted infra repo.
|
|
||||||
|
|
||||||
**`work land` refuses to push when it cannot verify** (no `--verify-cmd` and no
|
|
||||||
auto-detected suite) unless you pass `--no-verify` — landing to master unverified
|
|
||||||
must be deliberate. After pushing it **watches CI to green** (`ci watch` on the
|
|
||||||
landed commit) and fails if the pipeline does; pass `--no-ci-watch` to skip.
|
|
||||||
|
|
||||||
Tiers are recorded per verb so a future PreToolUse classifier can auto-allow
|
|
||||||
reads / prompt writes; v0.1 allows everything and relies on existing gates
|
|
||||||
(permission mode, presence claims, plan approval).
|
|
||||||
|
|
||||||
### v0.3 verbs — memory
|
|
||||||
|
|
||||||
A thin HTTP client over the **claude-memory** service (the same backend the
|
|
||||||
memory MCP wraps), authed with `CLAUDE_MEMORY_API_KEY` against
|
|
||||||
`CLAUDE_MEMORY_API_URL` (the env the hooks already set; defaults to the
|
|
||||||
ingress). Because it hits the HTTP API directly, it **works even when the MCP
|
|
||||||
frontend is down**.
|
|
||||||
|
|
||||||
| Command | Tier | What it does |
|
|
||||||
|---|---|---|
|
|
||||||
| `memory recall "<context>" [--query --category --sort --limit]` | read | semantic search (server-side ranking) — the navigate workhorse |
|
|
||||||
| `memory list [--category --tag --limit]` | read | recent memories |
|
|
||||||
| `memory categories` / `memory tags` / `memory stats` | read | enumerate the store |
|
|
||||||
| `memory secret <id>` | read | reveal a sensitive memory's content |
|
|
||||||
| `memory store "<content>" [--category --tags --keywords --importance --sensitive]` | write | store a memory |
|
|
||||||
| `memory update <id> [--content --tags --importance]` | write | edit a memory |
|
|
||||||
| `memory delete <id>` | write | delete a memory |
|
|
||||||
|
|
||||||
All read/write paths are validated against the live API (incl. a
|
|
||||||
store→recall→delete round-trip). This gives full data-plane parity with the MCP;
|
|
||||||
the eventual deprecation (rewiring the per-prompt auto-recall + auto-learn hooks
|
|
||||||
to the CLI, then uninstalling the MCP) is a **separate, deliberate follow-up** —
|
|
||||||
see `docs/adr/0008`.
|
|
||||||
|
|
||||||
### v0.4 verbs — ci / deploy
|
|
||||||
|
|
||||||
Watch what you trigger, without hand-rolling Woodpecker/kubectl polling. `ci`
|
|
||||||
talks to the Woodpecker API (token from `WOODPECKER_TOKEN` or Vault
|
|
||||||
`secret/ci/global`) via the internal Traefik LB, resolving the repo from the cwd
|
|
||||||
remote, with retries that ride Woodpecker's intermittent empty responses.
|
|
||||||
|
|
||||||
| Command | Tier | What it does |
|
|
||||||
|---|---|---|
|
|
||||||
| `ci status [commit]` | read | pipeline status for HEAD (or a commit) |
|
|
||||||
| `ci watch [commit]` | read | poll the pipeline to terminal; exit non-zero on failure |
|
|
||||||
| `deploy wait <ns>/<deploy> [--sha SHA]` | read | wait for the deployment image to match the sha, *then* rollout status (rollout status alone lies on the old ReplicaSet) |
|
|
||||||
|
|
||||||
`work land` now calls `ci watch` on the landed commit automatically (skip with
|
|
||||||
`--no-ci-watch`), closing the v0.1 "doesn't wait for CI" gap. `ci logs` (failing
|
|
||||||
step) is deferred to v0.4.1 — Woodpecker's per-pipeline detail/log endpoints were
|
|
||||||
the least reliable; `status`/`watch` use the list endpoint that works.
|
|
||||||
|
|
||||||
### v0.5 verbs — net / dns / metrics / logs
|
|
||||||
|
|
||||||
Reachability + observability probes. Their value is *endpoint resolution* — the
|
|
||||||
non-obvious "which host, public or LB, what auth, what URL shape" reasoning you'd
|
|
||||||
otherwise re-derive every time — not the HTTP call itself. All reach internal
|
|
||||||
ingresses through the Traefik LB (the Go form of `curl --resolve host:443:10.0.20.203`).
|
|
||||||
|
|
||||||
| Command | Tier | What it does |
|
|
||||||
|---|---|---|
|
|
||||||
| `net check <host> [path]` | read | probes the host two ways — external (public DNS → Cloudflare) vs internal (Traefik LB) — with status + latency, so you can tell *where* a break is (CF? app? the LB path?) |
|
|
||||||
| `dns lookup <name> [type]` | read | resolves via Technitium (`10.0.20.201`) and public (`1.1.1.1`), diffed — surfaces split-horizon vs propagation gaps |
|
|
||||||
| `metrics query "<promql>"` | read | Prometheus instant query (`prometheus-query.viktorbarzin.lan`); prints `value {labels}` or `--json` |
|
|
||||||
| `metrics alerts` | read | currently-firing alerts (via the synthetic `ALERTS` series — the query frontend has no `/api/v1/alerts`) |
|
|
||||||
| `logs query "<logql>" [--since 1h] [--limit N]` | read | Loki range query (`loki.viktorbarzin.lan`); prints log lines or `--json` |
|
|
||||||
|
|
||||||
Quote the PromQL/LogQL. These hit auth-free internal ingresses — no port-forward,
|
|
||||||
no kubectl. (In-cluster-only endpoints like Alertmanager stay out of scope; the
|
|
||||||
firing set is reachable via `ALERTS` instead.)
|
|
||||||
|
|
||||||
### v0.6 — usage telemetry (`usage top`)
|
|
||||||
|
|
||||||
Makes "which verbs are actually used, by everyone" a query instead of a guess —
|
|
||||||
so adding the *next* verb is evidence-driven, not shaped by one person's habits.
|
|
||||||
|
|
||||||
Every dispatched verb emits one fire-and-forget Loki line: `{job, user, verb}`
|
|
||||||
labels + `exit=N ver=X` — **only the verb path and exit code, never args, paths,
|
|
||||||
flags, or secrets.** It's best-effort (tight timeout, errors swallowed, never
|
|
||||||
affects the command) and opt-out via `HOMELAB_TELEMETRY=0`. Because the sink is
|
|
||||||
the shared Loki, aggregate usage is queryable **without reading anyone's home** —
|
|
||||||
the privacy-preserving answer to "what does the team use."
|
|
||||||
|
|
||||||
| Command | Tier | What it does |
|
|
||||||
|---|---|---|
|
|
||||||
| `usage top [--since 30d] [--user U] [--json]` | read | rank verbs by invocation count across all users (or one), via `sum by (verb) (count_over_time({job="homelab-usage"}[…]))` |
|
|
||||||
|
|
||||||
### v0.7 verbs — Home Assistant
|
|
||||||
|
|
||||||
Cover exactly the two things the `ha` **MCP server can't**: resolving the
|
|
||||||
long-lived API token out of the cluster, and SSH to the HA host for host-level
|
|
||||||
work (config files, docker, add-ons). Entity state and control (`turn_on`,
|
|
||||||
`get_state`, services) stay with the MCP — *actions an MCP already encodes are
|
|
||||||
out of scope* (see top of this doc). The value here is the same as `net`/`dns`:
|
|
||||||
the non-obvious *which secret, which host, which key, which flags* you'd
|
|
||||||
otherwise re-derive every session — agents were hand-rolling a
|
|
||||||
`kubectl | base64 | jq` token pipeline and a bespoke `ssh -o …` invocation on
|
|
||||||
every run because the existing `home-assistant-sofia.py` needs an env var set
|
|
||||||
and a cwd-relative path, neither of which holds in an arbitrary session.
|
|
||||||
|
|
||||||
| Command | Tier | What it does |
|
|
||||||
|---|---|---|
|
|
||||||
| `ha token [--instance sofia\|london]` | read | print the long-lived HA API token, resolved live from the dedicated k8s Secret `openclaw/ha-tokens` (key per instance) via the ambient kubeconfig — no pre-set env var. Use as `curl -H "Authorization: Bearer $(homelab ha token)" …`. The secret is a least-privilege carve-out (`stacks/openclaw/ha_tokens.tf`): the `Home Server Admins` group can read *just* it, so non-admin operators get the HA token without the rest of `skill_secrets` (slack webhook, uptime-kuma password) |
|
|
||||||
| `ha ssh [--instance sofia\|london] [-i KEY] -- <cmd>` | write | run `<cmd>` on the HA host over ssh with deterministic non-interactive flags (explicit key = the invoking user's `~/.ssh/id_ed25519`, no user ssh-config, no known_hosts prompt). sofia (`vbarzin@192.168.1.8`) is reachable from the devvm LAN; london is documented but generally remote |
|
|
||||||
|
|
||||||
`--instance` defaults to **sofia** (the devvm shares the Sofia LAN). `ha token`
|
|
||||||
prints the bare token to stdout so it composes in `$(…)`; it's read-tier like
|
|
||||||
`memory secret`. `ha ssh` resolves the *invoking user's* key, so it's per-user,
|
|
||||||
not tied to whoever first wrote the workflow (the user's key must be enrolled on
|
|
||||||
the HA host).
|
|
||||||
|
|
||||||
### v0.8 verbs — browser (headful anti-bot automation)
|
|
||||||
|
|
||||||
Drive the cluster's **headful** Chrome (`chrome-service`, real Chrome under Xvfb)
|
|
||||||
from the devvm over CDP, for sites that detect and block headless automation. The
|
|
||||||
headless `@playwright/mcp` browser can *load* such a site and fill its forms, but
|
|
||||||
the gated action (submit/login) silently fails — the motivating case was the
|
|
||||||
Stirling Ackroyd Fixflo tenant portal, whose pre-submit check returned
|
|
||||||
`net::ERR_FILE_NOT_FOUND` and hung. This path connects via `connect_over_cdp`,
|
|
||||||
injects the same `stealth.js` the in-cluster callers use, and submits first try.
|
|
||||||
|
|
||||||
The command owns only the *mechanics* (port-forward, stealth, lifecycle); the
|
|
||||||
agent supplies the Playwright script — judgment stays out of the CLI.
|
|
||||||
|
|
||||||
| Command | Tier | What it does |
|
|
||||||
|---|---|---|
|
|
||||||
| `browser run <script.js> [--url U] [--shared-context] [--keep-open] [--port N] [--timeout S]` | write | port-forward `svc/chrome-service:9222`, assert it's a real (non-headless) Chrome via `/json/version`, `connect_over_cdp`, `addInitScript(stealth.js)`, then run the script with `page`/`context`/`browser`/`log` in scope (top-level await ok; return a value to print it). Always tears the forward down. |
|
|
||||||
| `browser open <url> [--shared-context] [--timeout S]` | write | open `<url>` headful and print title + visible text + a screenshot path — a quick check. |
|
|
||||||
| `browser --help` | read | when-to-use signature + the error-code cheat-sheet (`ERR_FILE_NOT_FOUND` = automation-layer intercept, not egress; `ERR_CONNECTION_REFUSED`/`_TIMED_OUT`/`_NAME_NOT_RESOLVED` = real egress; one endpoint 500 while siblings 200 = bot rejection). |
|
|
||||||
|
|
||||||
Default context is a **fresh incognito** one (closed on exit) — safe for the
|
|
||||||
shared browser and concurrent callers (e.g. tripit's fare scrape); `--shared-context`
|
|
||||||
reuses the warmed persistent profile when a pre-logged-in session is needed.
|
|
||||||
`port-forward` tunnels API-server→pod, so it bypasses the `:9222` NetworkPolicy
|
|
||||||
that gates in-cluster callers — no namespace label needed. The node CDP client is
|
|
||||||
pinned to **`playwright-core@1.48.2`** to match the chrome-service image minor
|
|
||||||
(Chromium 130; protocol changes between minors) and is installed once, lazily,
|
|
||||||
into `~/.cache/homelab/browser-client/` (no per-user setup). Because the client
|
|
||||||
runs on the devvm, `setInputFiles` streams local files to the remote browser over
|
|
||||||
CDP — no `chmod`/staging-dir workaround. See `docs/architecture/chrome-service.md`
|
|
||||||
and `docs/adr/0013`.
|
|
||||||
|
|
||||||
## Build / install
|
|
||||||
|
|
||||||
Built from source to `/usr/local/bin/homelab` during devvm provisioning
|
|
||||||
(`scripts/workstation/setup-devvm.sh`, the `t3-dispatch` pattern); version is
|
|
||||||
stamped from `cli/VERSION` via ldflags. Manual build:
|
|
||||||
|
|
||||||
```
|
|
||||||
cd cli && go build -ldflags "-X main.version=$(cat VERSION)" -o /usr/local/bin/homelab .
|
|
||||||
go test ./...
|
|
||||||
```
|
|
||||||
|
|
||||||
## Legacy webhook use-cases (preserved)
|
|
||||||
|
|
||||||
This binary is also the in-cluster `infra-cli` image. Invocations starting with
|
|
||||||
`-use-case=<vpn|setup-openwrt-dns|add-email-alias|...>` fall through to the
|
|
||||||
original flag-based path unchanged, so the webhook handler is unaffected.
|
|
||||||
|
|
||||||
## Design
|
|
||||||
|
|
||||||
See `infra/docs/adr/0004`–`0013` for the architecture decisions.
|
|
||||||
|
|
|
||||||
|
|
@ -1 +0,0 @@
|
||||||
v0.8.1
|
|
||||||
388
cli/browser.go
388
cli/browser.go
|
|
@ -1,388 +0,0 @@
|
||||||
package main
|
|
||||||
|
|
||||||
import (
|
|
||||||
_ "embed"
|
|
||||||
"encoding/json"
|
|
||||||
"fmt"
|
|
||||||
"io"
|
|
||||||
"net"
|
|
||||||
"net/http"
|
|
||||||
"os"
|
|
||||||
"os/exec"
|
|
||||||
"os/signal"
|
|
||||||
"path/filepath"
|
|
||||||
"strconv"
|
|
||||||
"strings"
|
|
||||||
"sync"
|
|
||||||
"syscall"
|
|
||||||
"time"
|
|
||||||
)
|
|
||||||
|
|
||||||
// playwrightVersion pins the node CDP client to the chrome-service image minor
|
|
||||||
// (mcr.microsoft.com/playwright:v1.48.0-noble → Chromium 130). connect_over_cdp
|
|
||||||
// speaks the browser's CDP, so the client minor must track the server minor;
|
|
||||||
// see docs/architecture/chrome-service.md "Image pin".
|
|
||||||
const playwrightVersion = "1.48.2"
|
|
||||||
|
|
||||||
// defaultBrowserTimeout is how long (seconds) to wait for the port-forwarded CDP
|
|
||||||
// endpoint to become ready before giving up.
|
|
||||||
const defaultBrowserTimeout = 60
|
|
||||||
|
|
||||||
const (
|
|
||||||
chromeServiceNamespace = "chrome-service"
|
|
||||||
chromeServiceName = "chrome-service"
|
|
||||||
chromeServiceCDPPort = 9222
|
|
||||||
)
|
|
||||||
|
|
||||||
// stealthJS is vendored verbatim from stacks/chrome-service/files/stealth.js (the
|
|
||||||
// source of truth the in-cluster callers use). TestStealthJSEmbeddedMatchesCanonical
|
|
||||||
// guards against drift.
|
|
||||||
//
|
|
||||||
//go:embed browser_stealth.js
|
|
||||||
var stealthJS string
|
|
||||||
|
|
||||||
// runnerJS is the node wrapper that connects to the port-forwarded CDP endpoint,
|
|
||||||
// installs the stealth init script, and runs the user's Playwright script.
|
|
||||||
//
|
|
||||||
//go:embed browser_runner.js
|
|
||||||
var runnerJS string
|
|
||||||
|
|
||||||
// browserOpts is the parsed form of `homelab browser run|open` arguments.
|
|
||||||
type browserOpts struct {
|
|
||||||
mode string // "run" | "open"
|
|
||||||
script string // path to the user Playwright script (run mode)
|
|
||||||
url string // initial URL (run: optional; open: required positional)
|
|
||||||
sharedCtx bool // use the warmed persistent profile instead of a fresh context
|
|
||||||
keepOpen bool // leave the created context/pages open on exit
|
|
||||||
port int // explicit local port for the forward (0 = auto)
|
|
||||||
timeout int // CDP readiness timeout, seconds
|
|
||||||
help bool
|
|
||||||
}
|
|
||||||
|
|
||||||
// parseBrowserArgs parses the args after `browser run` / `browser open`.
|
|
||||||
func parseBrowserArgs(mode string, args []string) (browserOpts, error) {
|
|
||||||
o := browserOpts{mode: mode, timeout: defaultBrowserTimeout}
|
|
||||||
var positionals []string
|
|
||||||
atoi := func(s, flag string) (int, error) {
|
|
||||||
n, err := strconv.Atoi(s)
|
|
||||||
if err != nil {
|
|
||||||
return 0, fmt.Errorf("%s expects an integer, got %q", flag, s)
|
|
||||||
}
|
|
||||||
return n, nil
|
|
||||||
}
|
|
||||||
for i := 0; i < len(args); i++ {
|
|
||||||
a := args[i]
|
|
||||||
switch {
|
|
||||||
case a == "-h" || a == "--help":
|
|
||||||
o.help = true
|
|
||||||
case a == "--shared-context":
|
|
||||||
o.sharedCtx = true
|
|
||||||
case a == "--keep-open":
|
|
||||||
o.keepOpen = true
|
|
||||||
case a == "--url":
|
|
||||||
if i+1 < len(args) {
|
|
||||||
o.url = args[i+1]
|
|
||||||
i++
|
|
||||||
}
|
|
||||||
case strings.HasPrefix(a, "--url="):
|
|
||||||
o.url = strings.TrimPrefix(a, "--url=")
|
|
||||||
case a == "--port":
|
|
||||||
if i+1 < len(args) {
|
|
||||||
n, err := atoi(args[i+1], "--port")
|
|
||||||
if err != nil {
|
|
||||||
return o, err
|
|
||||||
}
|
|
||||||
o.port = n
|
|
||||||
i++
|
|
||||||
}
|
|
||||||
case strings.HasPrefix(a, "--port="):
|
|
||||||
n, err := atoi(strings.TrimPrefix(a, "--port="), "--port")
|
|
||||||
if err != nil {
|
|
||||||
return o, err
|
|
||||||
}
|
|
||||||
o.port = n
|
|
||||||
case a == "--timeout":
|
|
||||||
if i+1 < len(args) {
|
|
||||||
n, err := atoi(args[i+1], "--timeout")
|
|
||||||
if err != nil {
|
|
||||||
return o, err
|
|
||||||
}
|
|
||||||
o.timeout = n
|
|
||||||
i++
|
|
||||||
}
|
|
||||||
case strings.HasPrefix(a, "--timeout="):
|
|
||||||
n, err := atoi(strings.TrimPrefix(a, "--timeout="), "--timeout")
|
|
||||||
if err != nil {
|
|
||||||
return o, err
|
|
||||||
}
|
|
||||||
o.timeout = n
|
|
||||||
case strings.HasPrefix(a, "-"):
|
|
||||||
return o, fmt.Errorf("unknown flag %q (try: homelab browser --help)", a)
|
|
||||||
default:
|
|
||||||
positionals = append(positionals, a)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
if o.help {
|
|
||||||
return o, nil
|
|
||||||
}
|
|
||||||
switch mode {
|
|
||||||
case "run":
|
|
||||||
if len(positionals) == 0 {
|
|
||||||
return o, fmt.Errorf("usage: homelab browser run <script.js> [--url URL] [--shared-context] [--keep-open] [--port N] [--timeout S]")
|
|
||||||
}
|
|
||||||
o.script = positionals[0]
|
|
||||||
case "open":
|
|
||||||
if len(positionals) == 0 {
|
|
||||||
return o, fmt.Errorf("usage: homelab browser open <url> [--shared-context] [--timeout S]")
|
|
||||||
}
|
|
||||||
o.url = positionals[0]
|
|
||||||
}
|
|
||||||
return o, nil
|
|
||||||
}
|
|
||||||
|
|
||||||
// cdpHealthy parses a CDP /json/version body and reports whether the endpoint is
|
|
||||||
// a real (non-headless) Chrome — the entire reason chrome-service exists.
|
|
||||||
func cdpHealthy(jsonBody []byte) (browser string, healthy bool, err error) {
|
|
||||||
var v struct {
|
|
||||||
Browser string `json:"Browser"`
|
|
||||||
UserAgent string `json:"User-Agent"`
|
|
||||||
}
|
|
||||||
if e := json.Unmarshal(jsonBody, &v); e != nil {
|
|
||||||
return "", false, fmt.Errorf("parse /json/version: %w", e)
|
|
||||||
}
|
|
||||||
if v.Browser == "" {
|
|
||||||
return "", false, fmt.Errorf("/json/version had no Browser field")
|
|
||||||
}
|
|
||||||
healthy = strings.HasPrefix(v.Browser, "Chrome/") &&
|
|
||||||
!strings.Contains(v.Browser, "Headless") &&
|
|
||||||
!strings.Contains(v.UserAgent, "Headless")
|
|
||||||
return v.Browser, healthy, nil
|
|
||||||
}
|
|
||||||
|
|
||||||
// buildPortForwardArgs is the kubectl invocation that exposes chrome-service's
|
|
||||||
// CDP locally. port-forward tunnels API-server→pod, so it bypasses the :9222
|
|
||||||
// NetworkPolicy that gates in-cluster callers.
|
|
||||||
func buildPortForwardArgs(localPort int) []string {
|
|
||||||
return []string{"-n", chromeServiceNamespace, "port-forward",
|
|
||||||
"svc/" + chromeServiceName, fmt.Sprintf("%d:%d", localPort, chromeServiceCDPPort)}
|
|
||||||
}
|
|
||||||
|
|
||||||
// browserClientPackageJSON is the auto-managed manifest for the pinned node CDP
|
|
||||||
// client kept under the user cache dir.
|
|
||||||
func browserClientPackageJSON() string {
|
|
||||||
return fmt.Sprintf(`{
|
|
||||||
"name": "homelab-browser-client",
|
|
||||||
"private": true,
|
|
||||||
"description": "Pinned CDP client for 'homelab browser' — auto-managed, do not edit.",
|
|
||||||
"dependencies": {
|
|
||||||
"playwright-core": "%s"
|
|
||||||
}
|
|
||||||
}
|
|
||||||
`, playwrightVersion)
|
|
||||||
}
|
|
||||||
|
|
||||||
// freePort asks the kernel for an unused ephemeral TCP port.
|
|
||||||
func freePort() (int, error) {
|
|
||||||
l, err := net.Listen("tcp", "127.0.0.1:0")
|
|
||||||
if err != nil {
|
|
||||||
return 0, err
|
|
||||||
}
|
|
||||||
defer l.Close()
|
|
||||||
return l.Addr().(*net.TCPAddr).Port, nil
|
|
||||||
}
|
|
||||||
|
|
||||||
// browserClientDir is where the pinned node client + managed runner files live.
|
|
||||||
func browserClientDir() (string, error) {
|
|
||||||
cache, err := os.UserCacheDir()
|
|
||||||
if err != nil || cache == "" {
|
|
||||||
home, herr := os.UserHomeDir()
|
|
||||||
if herr != nil {
|
|
||||||
return "", fmt.Errorf("locate cache dir: %v / %v", err, herr)
|
|
||||||
}
|
|
||||||
cache = filepath.Join(home, ".cache")
|
|
||||||
}
|
|
||||||
return filepath.Join(cache, "homelab", "browser-client"), nil
|
|
||||||
}
|
|
||||||
|
|
||||||
// installedPlaywrightVersion reads the version of the playwright-core already
|
|
||||||
// installed in dir, or "" if absent/unreadable.
|
|
||||||
func installedPlaywrightVersion(dir string) string {
|
|
||||||
b, err := os.ReadFile(filepath.Join(dir, "node_modules", "playwright-core", "package.json"))
|
|
||||||
if err != nil {
|
|
||||||
return ""
|
|
||||||
}
|
|
||||||
var v struct {
|
|
||||||
Version string `json:"version"`
|
|
||||||
}
|
|
||||||
if json.Unmarshal(b, &v) != nil {
|
|
||||||
return ""
|
|
||||||
}
|
|
||||||
return v.Version
|
|
||||||
}
|
|
||||||
|
|
||||||
// ensureBrowserClient writes the managed runner/stealth/package files into dir
|
|
||||||
// and lazily installs the pinned playwright-core (only when missing/mismatched),
|
|
||||||
// so no per-user setup is needed and the client tracks the binary version.
|
|
||||||
func ensureBrowserClient(dir string) error {
|
|
||||||
if err := os.MkdirAll(dir, 0o755); err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
files := map[string]string{
|
|
||||||
"package.json": browserClientPackageJSON(),
|
|
||||||
"browser_runner.js": runnerJS,
|
|
||||||
"stealth.js": stealthJS,
|
|
||||||
}
|
|
||||||
for name, content := range files {
|
|
||||||
if err := os.WriteFile(filepath.Join(dir, name), []byte(content), 0o644); err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
}
|
|
||||||
if installedPlaywrightVersion(dir) == playwrightVersion {
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
fmt.Fprintf(os.Stderr, "homelab browser: installing pinned playwright-core@%s (one-time, ~a few seconds)…\n", playwrightVersion)
|
|
||||||
cmd := exec.Command("npm", "install", "--no-audit", "--no-fund", "--silent")
|
|
||||||
cmd.Dir = dir
|
|
||||||
cmd.Stdout = os.Stderr
|
|
||||||
cmd.Stderr = os.Stderr
|
|
||||||
if err := cmd.Run(); err != nil {
|
|
||||||
return fmt.Errorf("npm install playwright-core@%s in %s: %w (is node/npm installed?)", playwrightVersion, dir, err)
|
|
||||||
}
|
|
||||||
if got := installedPlaywrightVersion(dir); got != playwrightVersion {
|
|
||||||
return fmt.Errorf("playwright-core install mismatch in %s: want %s, got %q", dir, playwrightVersion, got)
|
|
||||||
}
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
|
|
||||||
// waitForCDP polls the local CDP endpoint until it answers as a healthy
|
|
||||||
// (non-headless) Chrome, or the timeout elapses.
|
|
||||||
func waitForCDP(cdpURL string, timeout time.Duration) (string, error) {
|
|
||||||
deadline := time.Now().Add(timeout)
|
|
||||||
client := &http.Client{Timeout: 3 * time.Second}
|
|
||||||
var lastErr error
|
|
||||||
for time.Now().Before(deadline) {
|
|
||||||
resp, err := client.Get(cdpURL + "/json/version")
|
|
||||||
if err != nil {
|
|
||||||
lastErr = err
|
|
||||||
time.Sleep(300 * time.Millisecond)
|
|
||||||
continue
|
|
||||||
}
|
|
||||||
body, _ := io.ReadAll(resp.Body)
|
|
||||||
resp.Body.Close()
|
|
||||||
browser, healthy, herr := cdpHealthy(body)
|
|
||||||
if herr != nil {
|
|
||||||
lastErr = herr
|
|
||||||
time.Sleep(300 * time.Millisecond)
|
|
||||||
continue
|
|
||||||
}
|
|
||||||
if !healthy {
|
|
||||||
return browser, fmt.Errorf("CDP reports %q — expected a non-headless Chrome (wrong target?)", browser)
|
|
||||||
}
|
|
||||||
return browser, nil
|
|
||||||
}
|
|
||||||
if lastErr == nil {
|
|
||||||
lastErr = fmt.Errorf("timed out after %s", timeout)
|
|
||||||
}
|
|
||||||
return "", lastErr
|
|
||||||
}
|
|
||||||
|
|
||||||
// runBrowser is the orchestration: pick a port, ensure the pinned client, start
|
|
||||||
// (and ALWAYS tear down) a CDP port-forward, wait for readiness, then run node.
|
|
||||||
func runBrowser(o browserOpts) error {
|
|
||||||
port := o.port
|
|
||||||
if port == 0 {
|
|
||||||
p, err := freePort()
|
|
||||||
if err != nil {
|
|
||||||
return fmt.Errorf("pick local port: %w", err)
|
|
||||||
}
|
|
||||||
port = p
|
|
||||||
}
|
|
||||||
|
|
||||||
dir, err := browserClientDir()
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
if err := ensureBrowserClient(dir); err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
|
|
||||||
// Start the forward in its own process group so the whole tree dies on cleanup.
|
|
||||||
pf := exec.Command("kubectl", buildPortForwardArgs(port)...)
|
|
||||||
pf.SysProcAttr = &syscall.SysProcAttr{Setpgid: true}
|
|
||||||
var pfLog strings.Builder
|
|
||||||
pf.Stdout = &pfLog
|
|
||||||
pf.Stderr = &pfLog
|
|
||||||
if err := pf.Start(); err != nil {
|
|
||||||
return fmt.Errorf("start kubectl port-forward (kubeconfig set?): %w", err)
|
|
||||||
}
|
|
||||||
|
|
||||||
var once sync.Once
|
|
||||||
teardown := func() {
|
|
||||||
once.Do(func() {
|
|
||||||
if pf.Process != nil {
|
|
||||||
_ = syscall.Kill(-pf.Process.Pid, syscall.SIGKILL)
|
|
||||||
}
|
|
||||||
_ = pf.Wait()
|
|
||||||
})
|
|
||||||
}
|
|
||||||
defer teardown()
|
|
||||||
|
|
||||||
// Tear down on Ctrl-C / SIGTERM too, then exit non-zero.
|
|
||||||
sigCh := make(chan os.Signal, 1)
|
|
||||||
signal.Notify(sigCh, os.Interrupt, syscall.SIGTERM)
|
|
||||||
defer signal.Stop(sigCh)
|
|
||||||
go func() {
|
|
||||||
if _, ok := <-sigCh; ok {
|
|
||||||
teardown()
|
|
||||||
os.Exit(130)
|
|
||||||
}
|
|
||||||
}()
|
|
||||||
|
|
||||||
cdpURL := fmt.Sprintf("http://127.0.0.1:%d", port)
|
|
||||||
browser, err := waitForCDP(cdpURL, time.Duration(o.timeout)*time.Second)
|
|
||||||
if err != nil {
|
|
||||||
return fmt.Errorf("chrome-service CDP not ready on %s: %w\n--- port-forward log ---\n%s", cdpURL, err, pfLog.String())
|
|
||||||
}
|
|
||||||
fmt.Fprintf(os.Stderr, "homelab browser: connected to %s via %s\n", browser, cdpURL)
|
|
||||||
|
|
||||||
return runBrowserNode(dir, cdpURL, o)
|
|
||||||
}
|
|
||||||
|
|
||||||
// runBrowserNode invokes the managed node runner with inputs passed via env.
|
|
||||||
func runBrowserNode(dir, cdpURL string, o browserOpts) error {
|
|
||||||
env := append(os.Environ(),
|
|
||||||
"HOMELAB_CDP_URL="+cdpURL,
|
|
||||||
"HOMELAB_BROWSER_MODE="+o.mode,
|
|
||||||
"HOMELAB_STEALTH_PATH="+filepath.Join(dir, "stealth.js"),
|
|
||||||
"NODE_PATH="+filepath.Join(dir, "node_modules"),
|
|
||||||
)
|
|
||||||
if o.url != "" {
|
|
||||||
env = append(env, "HOMELAB_BROWSER_URL="+o.url)
|
|
||||||
}
|
|
||||||
if o.script != "" {
|
|
||||||
abs, err := filepath.Abs(o.script)
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
if _, err := os.Stat(abs); err != nil {
|
|
||||||
return fmt.Errorf("script %s: %w", o.script, err)
|
|
||||||
}
|
|
||||||
env = append(env, "HOMELAB_BROWSER_SCRIPT="+abs)
|
|
||||||
}
|
|
||||||
if o.sharedCtx {
|
|
||||||
env = append(env, "HOMELAB_BROWSER_SHARED=1")
|
|
||||||
}
|
|
||||||
if o.keepOpen {
|
|
||||||
env = append(env, "HOMELAB_BROWSER_KEEP_OPEN=1")
|
|
||||||
}
|
|
||||||
if o.mode == "open" {
|
|
||||||
shot := filepath.Join(os.TempDir(), fmt.Sprintf("homelab-browser-%d.png", os.Getpid()))
|
|
||||||
env = append(env, "HOMELAB_BROWSER_SCREENSHOT="+shot)
|
|
||||||
}
|
|
||||||
cmd := exec.Command("node", filepath.Join(dir, "browser_runner.js"))
|
|
||||||
cmd.Env = env
|
|
||||||
cmd.Stdout = os.Stdout
|
|
||||||
cmd.Stderr = os.Stderr
|
|
||||||
cmd.Stdin = os.Stdin
|
|
||||||
return cmd.Run()
|
|
||||||
}
|
|
||||||
|
|
@ -1,106 +0,0 @@
|
||||||
// homelab browser — node CDP runner (auto-managed; regenerated each run from the
|
|
||||||
// homelab binary — DO NOT EDIT here). Connects to the port-forwarded
|
|
||||||
// chrome-service CDP endpoint, installs the stealth init script, then runs the
|
|
||||||
// user's Playwright script (run mode) or opens a URL (open mode). All inputs
|
|
||||||
// arrive via HOMELAB_* env vars set by the Go CLI.
|
|
||||||
'use strict';
|
|
||||||
const fs = require('fs');
|
|
||||||
const { chromium } = require('playwright-core');
|
|
||||||
|
|
||||||
async function main() {
|
|
||||||
const cdpURL = process.env.HOMELAB_CDP_URL;
|
|
||||||
if (!cdpURL) throw new Error('HOMELAB_CDP_URL not set');
|
|
||||||
const mode = process.env.HOMELAB_BROWSER_MODE || 'run';
|
|
||||||
const stealthPath = process.env.HOMELAB_STEALTH_PATH || '';
|
|
||||||
const initURL = process.env.HOMELAB_BROWSER_URL || '';
|
|
||||||
const scriptPath = process.env.HOMELAB_BROWSER_SCRIPT || '';
|
|
||||||
const shared = process.env.HOMELAB_BROWSER_SHARED === '1';
|
|
||||||
const keepOpen = process.env.HOMELAB_BROWSER_KEEP_OPEN === '1';
|
|
||||||
const screenshotPath = process.env.HOMELAB_BROWSER_SCREENSHOT || '';
|
|
||||||
|
|
||||||
const browser = await chromium.connectOverCDP(cdpURL);
|
|
||||||
|
|
||||||
// Fresh isolated context by default (safe for the shared browser + concurrent
|
|
||||||
// callers); --shared-context reuses the warmed persistent profile.
|
|
||||||
let context;
|
|
||||||
let createdContext = false;
|
|
||||||
if (shared) {
|
|
||||||
const existing = browser.contexts();
|
|
||||||
if (existing.length) {
|
|
||||||
context = existing[0];
|
|
||||||
} else {
|
|
||||||
context = await browser.newContext();
|
|
||||||
createdContext = true;
|
|
||||||
}
|
|
||||||
} else {
|
|
||||||
context = await browser.newContext();
|
|
||||||
createdContext = true;
|
|
||||||
}
|
|
||||||
|
|
||||||
if (stealthPath) {
|
|
||||||
const stealth = fs.readFileSync(stealthPath, 'utf8');
|
|
||||||
if (stealth.trim()) await context.addInitScript(stealth);
|
|
||||||
}
|
|
||||||
|
|
||||||
const page = await context.newPage();
|
|
||||||
const log = (...a) => console.error('[browser]', ...a);
|
|
||||||
|
|
||||||
let exitCode = 0;
|
|
||||||
try {
|
|
||||||
if (initURL) {
|
|
||||||
await page.goto(initURL, { waitUntil: 'domcontentloaded' });
|
|
||||||
}
|
|
||||||
if (mode === 'open') {
|
|
||||||
console.log('url: ' + page.url());
|
|
||||||
console.log('title: ' + (await page.title()));
|
|
||||||
const text = (await page.evaluate(() => (document.body ? document.body.innerText : ''))).trim();
|
|
||||||
console.log('--- visible text (truncated to 4000 chars) ---');
|
|
||||||
console.log(text.slice(0, 4000));
|
|
||||||
if (screenshotPath) {
|
|
||||||
await page.screenshot({ path: screenshotPath, fullPage: true });
|
|
||||||
console.log('screenshot: ' + screenshotPath);
|
|
||||||
}
|
|
||||||
} else {
|
|
||||||
if (!scriptPath) throw new Error('run mode requires HOMELAB_BROWSER_SCRIPT');
|
|
||||||
const src = fs.readFileSync(scriptPath, 'utf8');
|
|
||||||
// Run the user's source with page/context/browser/log in lexical scope.
|
|
||||||
// AsyncFunction body permits top-level await.
|
|
||||||
const AsyncFunction = Object.getPrototypeOf(async () => {}).constructor;
|
|
||||||
const fn = new AsyncFunction('page', 'context', 'browser', 'log', src);
|
|
||||||
const result = await fn(page, context, browser, log);
|
|
||||||
if (result !== undefined) {
|
|
||||||
let out;
|
|
||||||
try {
|
|
||||||
out = typeof result === 'string' ? result : JSON.stringify(result, null, 2);
|
|
||||||
} catch (_) {
|
|
||||||
out = String(result);
|
|
||||||
}
|
|
||||||
console.log(out);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
} catch (e) {
|
|
||||||
console.error('homelab browser: script error:', e && e.stack ? e.stack : e);
|
|
||||||
exitCode = 1;
|
|
||||||
} finally {
|
|
||||||
if (!keepOpen) {
|
|
||||||
try {
|
|
||||||
// Close only what we created; never tear down the shared persistent context.
|
|
||||||
if (createdContext) {
|
|
||||||
await context.close();
|
|
||||||
} else {
|
|
||||||
await page.close();
|
|
||||||
}
|
|
||||||
} catch (_) { /* ignore */ }
|
|
||||||
}
|
|
||||||
// Disconnect from the CDP endpoint; this does NOT kill the remote browser.
|
|
||||||
try {
|
|
||||||
await browser.close();
|
|
||||||
} catch (_) { /* ignore */ }
|
|
||||||
}
|
|
||||||
process.exit(exitCode);
|
|
||||||
}
|
|
||||||
|
|
||||||
main().catch((e) => {
|
|
||||||
console.error('homelab browser: fatal:', e && e.stack ? e.stack : e);
|
|
||||||
process.exit(1);
|
|
||||||
});
|
|
||||||
|
|
@ -1,54 +0,0 @@
|
||||||
// Minimal stealth init script for Playwright-driven Chromium.
|
|
||||||
// Vendored from puppeteer-extra-plugin-stealth/evasions/* (MIT) — covers:
|
|
||||||
// webdriver, chrome.runtime, navigator.plugins, navigator.languages,
|
|
||||||
// Permissions.query, WebGL getParameter (vendor + renderer spoof).
|
|
||||||
// Run via context.add_init_script() so it executes before any page script.
|
|
||||||
(() => {
|
|
||||||
// navigator.webdriver — most common detection, removed entirely.
|
|
||||||
Object.defineProperty(Navigator.prototype, 'webdriver', { get: () => undefined });
|
|
||||||
|
|
||||||
// window.chrome.runtime — many sites check that real Chrome exposes this.
|
|
||||||
if (!window.chrome) window.chrome = {};
|
|
||||||
window.chrome.runtime = window.chrome.runtime || {};
|
|
||||||
|
|
||||||
// navigator.plugins — headless reports zero; spoof a plausible PDF viewer.
|
|
||||||
Object.defineProperty(navigator, 'plugins', {
|
|
||||||
get: () => [{ name: 'Chrome PDF Plugin' }, { name: 'Chrome PDF Viewer' }, { name: 'Native Client' }],
|
|
||||||
});
|
|
||||||
|
|
||||||
// navigator.languages — headless returns empty array.
|
|
||||||
Object.defineProperty(navigator, 'languages', { get: () => ['en-US', 'en'] });
|
|
||||||
|
|
||||||
// Permissions.query — headless returns 'denied' for notifications instead of 'default'.
|
|
||||||
const origQuery = window.navigator.permissions && window.navigator.permissions.query;
|
|
||||||
if (origQuery) {
|
|
||||||
window.navigator.permissions.query = (parameters) =>
|
|
||||||
parameters && parameters.name === 'notifications'
|
|
||||||
? Promise.resolve({ state: Notification.permission })
|
|
||||||
: origQuery(parameters);
|
|
||||||
}
|
|
||||||
|
|
||||||
// WebGL getParameter — spoof vendor + renderer strings to a real GPU.
|
|
||||||
const spoofGl = (proto) => {
|
|
||||||
if (!proto) return;
|
|
||||||
const orig = proto.getParameter;
|
|
||||||
proto.getParameter = function (parameter) {
|
|
||||||
if (parameter === 37445) return 'Intel Inc.'; // UNMASKED_VENDOR_WEBGL
|
|
||||||
if (parameter === 37446) return 'Intel Iris OpenGL Engine'; // UNMASKED_RENDERER_WEBGL
|
|
||||||
return orig.apply(this, arguments);
|
|
||||||
};
|
|
||||||
};
|
|
||||||
spoofGl(window.WebGLRenderingContext && window.WebGLRenderingContext.prototype);
|
|
||||||
spoofGl(window.WebGL2RenderingContext && window.WebGL2RenderingContext.prototype);
|
|
||||||
|
|
||||||
// disable-devtool.js (theajack/disable-devtool) auto-inits via a script
|
|
||||||
// tag with `disable-devtool-auto`. Its Performance detector trips under
|
|
||||||
// Playwright (CDP adds console.log latency vs console.table) and the
|
|
||||||
// redirect URL is hard-coded — for hmembeds that's google.com.
|
|
||||||
// Hide the auto-init marker so the library's IIFE exits early.
|
|
||||||
const origQS = Document.prototype.querySelector;
|
|
||||||
Document.prototype.querySelector = function (sel) {
|
|
||||||
if (typeof sel === 'string' && sel.indexOf('disable-devtool-auto') !== -1) return null;
|
|
||||||
return origQS.apply(this, arguments);
|
|
||||||
};
|
|
||||||
})();
|
|
||||||
|
|
@ -1,117 +0,0 @@
|
||||||
package main
|
|
||||||
|
|
||||||
import "fmt"
|
|
||||||
|
|
||||||
// browser verbs drive the cluster's HEADFUL Chrome (ns chrome-service) over CDP
|
|
||||||
// from outside the cluster, for sites that detect/block headless automation.
|
|
||||||
// The headless @playwright/mcp browser can load such sites but their gated
|
|
||||||
// actions (submit/login) silently fail; this path submits first try. Mechanics
|
|
||||||
// only — the agent supplies the Playwright script. See docs/adr/0013.
|
|
||||||
|
|
||||||
func browserCommands() []Command {
|
|
||||||
return []Command{
|
|
||||||
{Path: []string{"browser"}, Tier: TierRead,
|
|
||||||
Summary: "headful cluster-Chrome automation for anti-bot sites (run `browser --help`)", Run: browserTopHelp},
|
|
||||||
{Path: []string{"browser", "run"}, Tier: TierWrite,
|
|
||||||
Summary: "run a Playwright script against headful cluster Chrome: browser run <script.js> [--url U] [--shared-context]", Run: browserRun},
|
|
||||||
{Path: []string{"browser", "open"}, Tier: TierWrite,
|
|
||||||
Summary: "open a URL in headful cluster Chrome; print title + text + screenshot: browser open <url>", Run: browserOpen},
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func browserTopHelp([]string) error {
|
|
||||||
fmt.Print(browserHelp())
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
|
|
||||||
func browserRun(args []string) error {
|
|
||||||
o, err := parseBrowserArgs("run", args)
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
if o.help {
|
|
||||||
fmt.Print(browserHelp())
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
return runBrowser(o)
|
|
||||||
}
|
|
||||||
|
|
||||||
func browserOpen(args []string) error {
|
|
||||||
o, err := parseBrowserArgs("open", args)
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
if o.help {
|
|
||||||
fmt.Print(browserHelp())
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
return runBrowser(o)
|
|
||||||
}
|
|
||||||
|
|
||||||
// browserHelp carries the discoverability payload: WHEN to reach for this, and
|
|
||||||
// the diagnostic cheat-sheet that lets the agent self-correct instead of
|
|
||||||
// retrying a deterministic form blind (the failure mode that motivated this).
|
|
||||||
func browserHelp() string {
|
|
||||||
return `homelab browser — drive the cluster's HEADFUL Chrome (anti-bot) over CDP
|
|
||||||
|
|
||||||
The shared chrome-service (ns chrome-service) runs a REAL, headed Chrome under
|
|
||||||
Xvfb. This connects to it via a port-forward + Playwright connect_over_cdp,
|
|
||||||
injects the same stealth.js the in-cluster callers use, and runs your script.
|
|
||||||
|
|
||||||
USAGE
|
|
||||||
homelab browser run <script.js> [--url URL] [--shared-context] [--keep-open] [--port N] [--timeout S]
|
|
||||||
homelab browser open <url> [--shared-context] [--timeout S]
|
|
||||||
|
|
||||||
WHEN TO USE THIS — escalation only; DEFAULT to the headless/MCP browser
|
|
||||||
Default to the Playwright MCP / headless browser for ALL routine browsing and
|
|
||||||
automation — it's interactive (snapshot per step), fast to start, isolated.
|
|
||||||
Reach for THIS command ONLY when headless is demonstrably blocked: a site
|
|
||||||
LOADS fine but a gated action FAILS or HANGS — a submit/login/checkout spins
|
|
||||||
forever, or ONE request errors while its siblings 200. That is the signature
|
|
||||||
of headless / anti-bot detection (navigator.webdriver, UA "HeadlessChrome",
|
|
||||||
disable-devtool traps). It presents as a real Chrome and usually succeeds
|
|
||||||
first try — but it's the shared cluster browser (slower startup, one batch
|
|
||||||
run, no per-step feedback), so it's the escalation path, never the default.
|
|
||||||
|
|
||||||
ERROR-CODE CHEAT-SHEET (diagnose BEFORE retrying)
|
|
||||||
ERR_FILE_NOT_FOUND (-6) request intercepted/resolved locally by the
|
|
||||||
automation layer — NOT a network/egress problem.
|
|
||||||
(This is what silently broke the headless submit.)
|
|
||||||
ERR_CONNECTION_REFUSED / real egress failure (DNS/route/firewall). These also
|
|
||||||
ERR_TIMED_OUT / break the initial page load — if the page loaded,
|
|
||||||
ERR_NAME_NOT_RESOLVED egress is fine and the cause is elsewhere.
|
|
||||||
one endpoint 500s while server-side bot rejection of the automation, not
|
|
||||||
its siblings 200 your payload.
|
|
||||||
|
|
||||||
HABITS
|
|
||||||
- Inspect the network panel BEFORE retrying a deterministic form; a blind
|
|
||||||
retry just repeats the same silent failure.
|
|
||||||
- Don't park a half-filled multi-step form across a user pause — the session
|
|
||||||
can expire; re-run the whole flow from this command in one shot.
|
|
||||||
- Uploads stream over CDP via setInputFiles from THIS host — no chmod/staging
|
|
||||||
of $HOME needed; just point setInputFiles at a local path.
|
|
||||||
|
|
||||||
CONTEXT
|
|
||||||
Default: a FRESH incognito context, closed on exit — safe for the shared
|
|
||||||
browser and concurrent callers (e.g. tripit). Your script does its own login.
|
|
||||||
--shared-context: reuse the warmed PERSISTENT profile (cookies from a manual
|
|
||||||
noVNC login at chrome.viktorbarzin.me) when you need a pre-logged-in session.
|
|
||||||
|
|
||||||
SCRIPT CONTRACT (run mode)
|
|
||||||
Your file's body runs with page, context, browser and log() already in scope
|
|
||||||
(top-level await allowed). Return a value to print it. Example flow.js:
|
|
||||||
|
|
||||||
await page.goto('https://portal.example.com/login');
|
|
||||||
await page.fill('#user', 'me'); await page.fill('#pass', process.env.PW);
|
|
||||||
await page.click('button[type=submit]');
|
|
||||||
await page.waitForURL('**/dashboard');
|
|
||||||
return 'logged in: ' + page.url();
|
|
||||||
|
|
||||||
Run it: homelab browser run flow.js
|
|
||||||
|
|
||||||
NOTES
|
|
||||||
- The Playwright client is pinned to playwright-core@` + playwrightVersion + ` to match the
|
|
||||||
chrome-service image (Chrome 130); installed once into ~/.cache/homelab/.
|
|
||||||
- The port-forward is always torn down, on success and on error.
|
|
||||||
`
|
|
||||||
}
|
|
||||||
|
|
@ -1,172 +0,0 @@
|
||||||
package main
|
|
||||||
|
|
||||||
import (
|
|
||||||
"os"
|
|
||||||
"reflect"
|
|
||||||
"strings"
|
|
||||||
"testing"
|
|
||||||
)
|
|
||||||
|
|
||||||
func TestParseBrowserArgsRun(t *testing.T) {
|
|
||||||
got, err := parseBrowserArgs("run", []string{
|
|
||||||
"flow.js", "--url", "https://example.com", "--shared-context",
|
|
||||||
"--port", "19999", "--timeout", "45", "--keep-open",
|
|
||||||
})
|
|
||||||
if err != nil {
|
|
||||||
t.Fatalf("parseBrowserArgs run: unexpected err: %v", err)
|
|
||||||
}
|
|
||||||
want := browserOpts{
|
|
||||||
mode: "run", script: "flow.js", url: "https://example.com",
|
|
||||||
sharedCtx: true, keepOpen: true, port: 19999, timeout: 45,
|
|
||||||
}
|
|
||||||
if !reflect.DeepEqual(got, want) {
|
|
||||||
t.Fatalf("parseBrowserArgs run =\n %+v\nwant\n %+v", got, want)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestParseBrowserArgsRunDefaults(t *testing.T) {
|
|
||||||
got, err := parseBrowserArgs("run", []string{"flow.js"})
|
|
||||||
if err != nil {
|
|
||||||
t.Fatalf("unexpected err: %v", err)
|
|
||||||
}
|
|
||||||
if got.script != "flow.js" || got.sharedCtx || got.keepOpen || got.port != 0 {
|
|
||||||
t.Fatalf("defaults wrong: %+v", got)
|
|
||||||
}
|
|
||||||
if got.timeout != defaultBrowserTimeout {
|
|
||||||
t.Fatalf("timeout default = %d, want %d", got.timeout, defaultBrowserTimeout)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestParseBrowserArgsRunRequiresScript(t *testing.T) {
|
|
||||||
if _, err := parseBrowserArgs("run", []string{"--url", "https://x"}); err == nil {
|
|
||||||
t.Fatalf("run without a script path should error")
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestParseBrowserArgsOpenRequiresURL(t *testing.T) {
|
|
||||||
got, err := parseBrowserArgs("open", []string{"https://example.com"})
|
|
||||||
if err != nil {
|
|
||||||
t.Fatalf("unexpected err: %v", err)
|
|
||||||
}
|
|
||||||
if got.url != "https://example.com" || got.mode != "open" {
|
|
||||||
t.Fatalf("open parse wrong: %+v", got)
|
|
||||||
}
|
|
||||||
if _, err := parseBrowserArgs("open", []string{}); err == nil {
|
|
||||||
t.Fatalf("open without a URL should error")
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestParseBrowserArgsHelp(t *testing.T) {
|
|
||||||
for _, a := range [][]string{{"--help"}, {"-h"}, {"flow.js", "--help"}} {
|
|
||||||
got, err := parseBrowserArgs("run", a)
|
|
||||||
if err != nil {
|
|
||||||
t.Fatalf("help parse %v: %v", a, err)
|
|
||||||
}
|
|
||||||
if !got.help {
|
|
||||||
t.Fatalf("args %v should set help", a)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestParseBrowserArgsEqualsForm(t *testing.T) {
|
|
||||||
got, err := parseBrowserArgs("run", []string{"flow.js", "--url=https://x", "--port=8123", "--timeout=10"})
|
|
||||||
if err != nil {
|
|
||||||
t.Fatalf("unexpected err: %v", err)
|
|
||||||
}
|
|
||||||
if got.url != "https://x" || got.port != 8123 || got.timeout != 10 {
|
|
||||||
t.Fatalf("--flag=value form not parsed: %+v", got)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestCDPHealthy(t *testing.T) {
|
|
||||||
real := []byte(`{"Browser":"Chrome/130.0.6723.31","User-Agent":"Mozilla/5.0 (X11; Linux x86_64) Chrome/130.0.0.0 Safari/537.36","webSocketDebuggerUrl":"ws://127.0.0.1/devtools/browser/x"}`)
|
|
||||||
browser, ok, err := cdpHealthy(real)
|
|
||||||
if err != nil || !ok {
|
|
||||||
t.Fatalf("real Chrome should be healthy: ok=%v err=%v", ok, err)
|
|
||||||
}
|
|
||||||
if !strings.HasPrefix(browser, "Chrome/") {
|
|
||||||
t.Fatalf("browser = %q, want Chrome/ prefix", browser)
|
|
||||||
}
|
|
||||||
|
|
||||||
headless := []byte(`{"Browser":"HeadlessChrome/130.0.6723.31","User-Agent":"Mozilla/5.0 HeadlessChrome/130.0.0.0"}`)
|
|
||||||
if _, ok, _ := cdpHealthy(headless); ok {
|
|
||||||
t.Fatalf("HeadlessChrome must be reported unhealthy (the whole point of chrome-service)")
|
|
||||||
}
|
|
||||||
|
|
||||||
if _, _, err := cdpHealthy([]byte("not json")); err == nil {
|
|
||||||
t.Fatalf("malformed /json/version body should error")
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestBuildPortForwardArgs(t *testing.T) {
|
|
||||||
got := buildPortForwardArgs(18080)
|
|
||||||
want := []string{"-n", "chrome-service", "port-forward", "svc/chrome-service", "18080:9222"}
|
|
||||||
if !reflect.DeepEqual(got, want) {
|
|
||||||
t.Fatalf("buildPortForwardArgs =\n %v\nwant\n %v", got, want)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestBrowserClientPackageJSONPinsVersion(t *testing.T) {
|
|
||||||
pj := browserClientPackageJSON()
|
|
||||||
if !strings.Contains(pj, `"playwright-core": "`+playwrightVersion+`"`) {
|
|
||||||
t.Fatalf("package.json must pin playwright-core to %s; got:\n%s", playwrightVersion, pj)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestPlaywrightVersionPinnedToServerMinor(t *testing.T) {
|
|
||||||
// chrome-service runs mcr.microsoft.com/playwright:v1.48.0-noble; the CDP
|
|
||||||
// client minor MUST match (protocol changes between minors).
|
|
||||||
if !strings.HasPrefix(playwrightVersion, "1.48.") {
|
|
||||||
t.Fatalf("playwrightVersion = %q, must be 1.48.x to match the chrome-service image", playwrightVersion)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestBrowserHelpHasDiagnosticCheatSheet(t *testing.T) {
|
|
||||||
h := browserHelp()
|
|
||||||
for _, want := range []string{
|
|
||||||
"homelab browser run",
|
|
||||||
"ERR_FILE_NOT_FOUND",
|
|
||||||
"ERR_CONNECTION_REFUSED",
|
|
||||||
"network panel",
|
|
||||||
"headless",
|
|
||||||
"--shared-context",
|
|
||||||
} {
|
|
||||||
if !strings.Contains(h, want) {
|
|
||||||
t.Errorf("browser --help is missing %q (the discoverability/self-correction payload)", want)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestBrowserHelpIsTiered(t *testing.T) {
|
|
||||||
// --help must frame this as the ESCALATION path (default to headless first),
|
|
||||||
// matching ~/code/CLAUDE.md and chrome-service.md — non-conflicting agent
|
|
||||||
// instructions. Guard against a regression to "co-equal choice" wording.
|
|
||||||
h := browserHelp()
|
|
||||||
for _, want := range []string{"Default to the", "escalation"} {
|
|
||||||
if !strings.Contains(h, want) {
|
|
||||||
t.Errorf("browser --help must carry the tiered/default-headless framing; missing %q", want)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestStealthJSEmbeddedMatchesCanonical(t *testing.T) {
|
|
||||||
// The embedded copy must never drift from the source of truth that the
|
|
||||||
// in-cluster callers use, else the CLI's stealth and the cluster's diverge.
|
|
||||||
canonical, err := os.ReadFile("../stacks/chrome-service/files/stealth.js")
|
|
||||||
if err != nil {
|
|
||||||
t.Fatalf("read canonical stealth.js: %v", err)
|
|
||||||
}
|
|
||||||
if stealthJS != string(canonical) {
|
|
||||||
t.Fatalf("cli/browser_stealth.js has drifted from stacks/chrome-service/files/stealth.js — re-copy it")
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestFreePortReturnsUsablePort(t *testing.T) {
|
|
||||||
p, err := freePort()
|
|
||||||
if err != nil {
|
|
||||||
t.Fatalf("freePort: %v", err)
|
|
||||||
}
|
|
||||||
if p <= 1024 || p > 65535 {
|
|
||||||
t.Fatalf("freePort returned %d, want an ephemeral port", p)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
@ -1,99 +0,0 @@
|
||||||
package main
|
|
||||||
|
|
||||||
import (
|
|
||||||
"fmt"
|
|
||||||
"os"
|
|
||||||
"strings"
|
|
||||||
"time"
|
|
||||||
)
|
|
||||||
|
|
||||||
func ciCommands() []Command {
|
|
||||||
return []Command{
|
|
||||||
{Path: []string{"ci", "status"}, Tier: TierRead,
|
|
||||||
Summary: "pipeline status for HEAD/a commit: ci status [commit]", Run: ciStatus},
|
|
||||||
{Path: []string{"ci", "watch"}, Tier: TierRead,
|
|
||||||
Summary: "poll the pipeline for HEAD (or a commit) to terminal; non-zero on failure", Run: ciWatch},
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func short(s string) string {
|
|
||||||
if len(s) > 8 {
|
|
||||||
return s[:8]
|
|
||||||
}
|
|
||||||
return s
|
|
||||||
}
|
|
||||||
|
|
||||||
func firstLine(s string) string { return strings.SplitN(s, "\n", 2)[0] }
|
|
||||||
|
|
||||||
// currentHEAD returns the full HEAD sha of the cwd repo (empty if not a repo).
|
|
||||||
func currentHEAD() string {
|
|
||||||
cwd, _ := os.Getwd()
|
|
||||||
root, err := gitRepoRoot(cwd)
|
|
||||||
if err != nil {
|
|
||||||
return ""
|
|
||||||
}
|
|
||||||
sha, _ := gitOutput(root, "rev-parse", "HEAD")
|
|
||||||
return sha
|
|
||||||
}
|
|
||||||
|
|
||||||
func ciStatus(args []string) error {
|
|
||||||
commit, _ := firstPositional(args)
|
|
||||||
c, err := newWPClient()
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
id, err := c.repoID()
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
p, err := c.findPipeline(id, commit)
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
fmt.Printf("#%d %s event=%s %s %s\n", p.Number, p.Status, p.Event, short(p.Commit), firstLine(p.Message))
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
|
|
||||||
func ciWatch(args []string) error {
|
|
||||||
commit, _ := firstPositional(args)
|
|
||||||
if commit == "" {
|
|
||||||
commit = currentHEAD()
|
|
||||||
}
|
|
||||||
if commit == "" {
|
|
||||||
return fmt.Errorf("no commit given and not in a git repo")
|
|
||||||
}
|
|
||||||
c, err := newWPClient()
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
id, err := c.repoID()
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
timeout := 20 * time.Minute
|
|
||||||
deadline := time.Now().Add(timeout)
|
|
||||||
last := ""
|
|
||||||
for time.Now().Before(deadline) {
|
|
||||||
p, err := c.findPipeline(id, commit)
|
|
||||||
if err != nil {
|
|
||||||
if last != "waiting" {
|
|
||||||
fmt.Fprintf(os.Stderr, "homelab: waiting for pipeline (%s)...\n", short(commit))
|
|
||||||
last = "waiting"
|
|
||||||
}
|
|
||||||
} else {
|
|
||||||
if p.Status != last {
|
|
||||||
fmt.Fprintf(os.Stderr, "homelab: #%d %s\n", p.Number, p.Status)
|
|
||||||
last = p.Status
|
|
||||||
}
|
|
||||||
if isTerminalStatus(p.Status) {
|
|
||||||
fmt.Printf("#%d %s %s\n", p.Number, p.Status, short(commit))
|
|
||||||
if isFailureStatus(p.Status) {
|
|
||||||
return fmt.Errorf("pipeline #%d %s (woodpecker repo, see UI/DB for the failing step)", p.Number, p.Status)
|
|
||||||
}
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
}
|
|
||||||
time.Sleep(15 * time.Second)
|
|
||||||
}
|
|
||||||
return fmt.Errorf("timed out after %s waiting for CI on %s", timeout, short(commit))
|
|
||||||
}
|
|
||||||
|
|
@ -1,56 +0,0 @@
|
||||||
package main
|
|
||||||
|
|
||||||
import (
|
|
||||||
"fmt"
|
|
||||||
"strings"
|
|
||||||
)
|
|
||||||
|
|
||||||
func claimCommands() []Command {
|
|
||||||
return []Command{
|
|
||||||
{Path: []string{"claim"}, Tier: TierWrite,
|
|
||||||
Summary: "claim a shared infra resource on the presence board",
|
|
||||||
Run: runClaim},
|
|
||||||
{Path: []string{"release"}, Tier: TierWrite,
|
|
||||||
Summary: "release a presence claim",
|
|
||||||
Run: runRelease},
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
// runClaim parses `<kind>:<name> --purpose "..."` in either order (the presence
|
|
||||||
// script takes the label first, so we can't rely on Go's flag package which
|
|
||||||
// stops at the first positional).
|
|
||||||
func runClaim(args []string) error {
|
|
||||||
var label, purpose string
|
|
||||||
for i := 0; i < len(args); i++ {
|
|
||||||
a := args[i]
|
|
||||||
switch {
|
|
||||||
case a == "--purpose" || a == "-purpose":
|
|
||||||
if i+1 < len(args) {
|
|
||||||
purpose = args[i+1]
|
|
||||||
i++
|
|
||||||
}
|
|
||||||
case strings.HasPrefix(a, "--purpose="):
|
|
||||||
purpose = strings.TrimPrefix(a, "--purpose=")
|
|
||||||
case !strings.HasPrefix(a, "-") && label == "":
|
|
||||||
label = a
|
|
||||||
}
|
|
||||||
}
|
|
||||||
if label == "" {
|
|
||||||
return fmt.Errorf(`usage: homelab claim <kind>:<name> --purpose "what + why"`)
|
|
||||||
}
|
|
||||||
return presenceClaim(label, purpose)
|
|
||||||
}
|
|
||||||
|
|
||||||
func runRelease(args []string) error {
|
|
||||||
var label string
|
|
||||||
for _, a := range args {
|
|
||||||
if !strings.HasPrefix(a, "-") {
|
|
||||||
label = a
|
|
||||||
break
|
|
||||||
}
|
|
||||||
}
|
|
||||||
if label == "" {
|
|
||||||
return fmt.Errorf("usage: homelab release <kind>:<name>")
|
|
||||||
}
|
|
||||||
return presenceRelease(label)
|
|
||||||
}
|
|
||||||
|
|
@ -1,51 +0,0 @@
|
||||||
package main
|
|
||||||
|
|
||||||
import (
|
|
||||||
"fmt"
|
|
||||||
"os"
|
|
||||||
"strings"
|
|
||||||
"time"
|
|
||||||
)
|
|
||||||
|
|
||||||
func deployCommands() []Command {
|
|
||||||
return []Command{
|
|
||||||
{Path: []string{"deploy", "wait"}, Tier: TierRead,
|
|
||||||
Summary: "wait for <ns>/<deploy> to roll out the current (or --sha) image: deploy wait <ns>/<deploy> [--sha SHA]", Run: deployWait},
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
// deployWait closes the "did the NEW code land" gap: rollout status alone returns
|
|
||||||
// success on the OLD ReplicaSet, so we first wait for the deployment image to
|
|
||||||
// reference the expected sha, THEN block on rollout status.
|
|
||||||
func deployWait(args []string) error {
|
|
||||||
target, _ := firstPositional(args)
|
|
||||||
if target == "" || !strings.Contains(target, "/") {
|
|
||||||
return fmt.Errorf("usage: homelab deploy wait <ns>/<deploy> [--sha SHA] [--timeout 10m]")
|
|
||||||
}
|
|
||||||
parts := strings.SplitN(target, "/", 2)
|
|
||||||
ns, deploy := parts[0], parts[1]
|
|
||||||
|
|
||||||
sha := flagValue(args, "--sha")
|
|
||||||
if sha == "" {
|
|
||||||
sha = short(currentHEAD())
|
|
||||||
}
|
|
||||||
deadline := time.Now().Add(10 * time.Minute)
|
|
||||||
|
|
||||||
if sha != "" {
|
|
||||||
fmt.Fprintf(os.Stderr, "homelab: waiting for %s/%s image to match %s...\n", ns, deploy, sha)
|
|
||||||
matched := false
|
|
||||||
for time.Now().Before(deadline) {
|
|
||||||
img, _ := kubectlCapture(ns, "get", "deploy", deploy, "-o", "jsonpath={.spec.template.spec.containers[*].image}")
|
|
||||||
if strings.Contains(img, sha) {
|
|
||||||
matched = true
|
|
||||||
break
|
|
||||||
}
|
|
||||||
time.Sleep(10 * time.Second)
|
|
||||||
}
|
|
||||||
if !matched {
|
|
||||||
return fmt.Errorf("timed out: %s/%s image never matched %q", ns, deploy, sha)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
fmt.Fprintf(os.Stderr, "homelab: rollout status %s/%s...\n", ns, deploy)
|
|
||||||
return kubectlStream(ns, "rollout", "status", "deploy/"+deploy, "--timeout=180s")
|
|
||||||
}
|
|
||||||
172
cli/cmd_ha.go
172
cli/cmd_ha.go
|
|
@ -1,172 +0,0 @@
|
||||||
package main
|
|
||||||
|
|
||||||
import (
|
|
||||||
"encoding/base64"
|
|
||||||
"fmt"
|
|
||||||
"os"
|
|
||||||
"path/filepath"
|
|
||||||
"strings"
|
|
||||||
)
|
|
||||||
|
|
||||||
// Home Assistant verbs cover the two things the `ha` MCP server can't: resolving
|
|
||||||
// the long-lived API token out of the cluster, and SSH to the HA host for
|
|
||||||
// host-level work (config files, docker, add-ons). Entity state/control stays
|
|
||||||
// with the MCP — see docs/adr/0012.
|
|
||||||
//
|
|
||||||
// The token lives in the dedicated k8s Secret openclaw/ha-tokens (one key per
|
|
||||||
// instance), split out of openclaw-secrets so non-admin operators (emo / "Home
|
|
||||||
// Server Admins") can read JUST the HA token, not the full skill_secrets blob.
|
|
||||||
// `ha token` resolves it on demand via the ambient kubeconfig, so it never
|
|
||||||
// depends on a pre-set env var (the gap that made agents re-derive the
|
|
||||||
// kubectl|base64|jq pipeline every session).
|
|
||||||
|
|
||||||
type haInstance struct {
|
|
||||||
name string // sofia | london
|
|
||||||
sshUser string // SSH login on the HA host
|
|
||||||
sshHost string // host reachable from the devvm (Sofia LAN)
|
|
||||||
secretKey string // key inside the openclaw/ha-tokens Secret holding this token
|
|
||||||
}
|
|
||||||
|
|
||||||
const (
|
|
||||||
haDefaultInstance = "sofia"
|
|
||||||
haSecretNamespace = "openclaw"
|
|
||||||
haSecretName = "ha-tokens" // dedicated, least-privilege; see stacks/openclaw/ha_tokens.tf
|
|
||||||
)
|
|
||||||
|
|
||||||
// haInstances maps instance name → connection/secret facts. sofia is the default
|
|
||||||
// because the devvm is on the Sofia LAN; london is documented but its host
|
|
||||||
// (192.168.8.x) is only reachable remotely, so `ha ssh --instance london`
|
|
||||||
// generally won't connect from here (token resolution still works).
|
|
||||||
var haInstances = map[string]haInstance{
|
|
||||||
"sofia": {name: "sofia", sshUser: "vbarzin", sshHost: "192.168.1.8", secretKey: "sofia"},
|
|
||||||
"london": {name: "london", sshUser: "hassio", sshHost: "192.168.8.103", secretKey: "london"},
|
|
||||||
}
|
|
||||||
|
|
||||||
func haCommands() []Command {
|
|
||||||
return []Command{
|
|
||||||
{Path: []string{"ha", "token"}, Tier: TierRead,
|
|
||||||
Summary: "reveal the HA long-lived API token from the cluster: ha token [--instance sofia|london]", Run: haToken},
|
|
||||||
{Path: []string{"ha", "ssh"}, Tier: TierWrite,
|
|
||||||
Summary: "run a command on the HA host over ssh: ha ssh [--instance sofia|london] [-i KEY] -- <cmd>", Run: haSSH},
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
// resolveHAInstance looks up an instance by name; "" yields the default (sofia).
|
|
||||||
func resolveHAInstance(name string) (haInstance, error) {
|
|
||||||
if name == "" {
|
|
||||||
name = haDefaultInstance
|
|
||||||
}
|
|
||||||
inst, ok := haInstances[name]
|
|
||||||
if !ok {
|
|
||||||
return haInstance{}, fmt.Errorf("unknown HA instance %q (want sofia or london)", name)
|
|
||||||
}
|
|
||||||
return inst, nil
|
|
||||||
}
|
|
||||||
|
|
||||||
// decodeSecretValue base64-decodes a k8s Secret `.data.<key>` value as returned
|
|
||||||
// by kubectl jsonpath (trailing whitespace tolerated).
|
|
||||||
func decodeSecretValue(b64 string) (string, error) {
|
|
||||||
raw, err := base64.StdEncoding.DecodeString(strings.TrimSpace(b64))
|
|
||||||
if err != nil {
|
|
||||||
return "", fmt.Errorf("base64-decode secret value: %w", err)
|
|
||||||
}
|
|
||||||
return string(raw), nil
|
|
||||||
}
|
|
||||||
|
|
||||||
func haToken(args []string) error {
|
|
||||||
name, _ := firstPositional(args) // accept `ha token sofia` as well as `--instance sofia`
|
|
||||||
for i := 0; i < len(args); i++ {
|
|
||||||
if args[i] == "--instance" && i+1 < len(args) {
|
|
||||||
name = args[i+1]
|
|
||||||
} else if strings.HasPrefix(args[i], "--instance=") {
|
|
||||||
name = strings.TrimPrefix(args[i], "--instance=")
|
|
||||||
}
|
|
||||||
}
|
|
||||||
inst, err := resolveHAInstance(name)
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
b64, err := kubectlCapture(haSecretNamespace, "get", "secret", haSecretName,
|
|
||||||
"-o", "jsonpath={.data."+inst.secretKey+"}")
|
|
||||||
if err != nil {
|
|
||||||
return fmt.Errorf("read secret %s/%s (kubeconfig set? RBAC?): %w", haSecretNamespace, haSecretName, err)
|
|
||||||
}
|
|
||||||
if b64 == "" {
|
|
||||||
return fmt.Errorf("secret %s/%s has no %q key", haSecretNamespace, haSecretName, inst.secretKey)
|
|
||||||
}
|
|
||||||
tok, err := decodeSecretValue(b64)
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
fmt.Println(tok)
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
|
|
||||||
// defaultHAKeyPath is the invoking user's ed25519 key, so the verb is per-user
|
|
||||||
// rather than tied to whoever first wrote the workflow.
|
|
||||||
func defaultHAKeyPath() string {
|
|
||||||
if home, err := os.UserHomeDir(); err == nil && home != "" {
|
|
||||||
return filepath.Join(home, ".ssh", "id_ed25519")
|
|
||||||
}
|
|
||||||
return filepath.Join("~", ".ssh", "id_ed25519")
|
|
||||||
}
|
|
||||||
|
|
||||||
// parseHASSH reads `[--instance X] [-i|--key PATH] [-- ] <cmd...>`. Tokens after
|
|
||||||
// `--` are taken verbatim; bare tokens before it are also the remote command.
|
|
||||||
func parseHASSH(args []string) (inst haInstance, keyPath string, remote []string, err error) {
|
|
||||||
name := haDefaultInstance
|
|
||||||
keyPath = defaultHAKeyPath()
|
|
||||||
for i := 0; i < len(args); i++ {
|
|
||||||
a := args[i]
|
|
||||||
switch {
|
|
||||||
case a == "--":
|
|
||||||
remote = append(remote, args[i+1:]...)
|
|
||||||
i = len(args)
|
|
||||||
case a == "--instance":
|
|
||||||
if i+1 < len(args) {
|
|
||||||
name = args[i+1]
|
|
||||||
i++
|
|
||||||
}
|
|
||||||
case strings.HasPrefix(a, "--instance="):
|
|
||||||
name = strings.TrimPrefix(a, "--instance=")
|
|
||||||
case a == "--key" || a == "-i":
|
|
||||||
if i+1 < len(args) {
|
|
||||||
keyPath = args[i+1]
|
|
||||||
i++
|
|
||||||
}
|
|
||||||
case strings.HasPrefix(a, "--key="):
|
|
||||||
keyPath = strings.TrimPrefix(a, "--key=")
|
|
||||||
default:
|
|
||||||
remote = append(remote, a)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
inst, err = resolveHAInstance(name)
|
|
||||||
return inst, keyPath, remote, err
|
|
||||||
}
|
|
||||||
|
|
||||||
// buildHASSHArgs assembles deterministic, non-interactive ssh args: an explicit
|
|
||||||
// key, no user ssh config, and no known_hosts prompt/record — so it runs
|
|
||||||
// unattended in an agent session without hanging on a host-key prompt.
|
|
||||||
func buildHASSHArgs(inst haInstance, keyPath string, remote []string) []string {
|
|
||||||
args := []string{
|
|
||||||
"-F", "/dev/null",
|
|
||||||
"-o", "IdentityFile=" + keyPath,
|
|
||||||
"-o", "StrictHostKeyChecking=no",
|
|
||||||
"-o", "UserKnownHostsFile=/dev/null",
|
|
||||||
"-o", "ConnectTimeout=10",
|
|
||||||
"-o", "BatchMode=yes",
|
|
||||||
inst.sshUser + "@" + inst.sshHost,
|
|
||||||
}
|
|
||||||
return append(args, remote...)
|
|
||||||
}
|
|
||||||
|
|
||||||
func haSSH(args []string) error {
|
|
||||||
inst, keyPath, remote, err := parseHASSH(args)
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
if len(remote) == 0 {
|
|
||||||
return fmt.Errorf(`usage: homelab ha ssh [--instance sofia|london] [-i KEY] -- <command>`)
|
|
||||||
}
|
|
||||||
return runStreaming("ssh", buildHASSHArgs(inst, keyPath, remote)...)
|
|
||||||
}
|
|
||||||
|
|
@ -1,92 +0,0 @@
|
||||||
package main
|
|
||||||
|
|
||||||
import (
|
|
||||||
"encoding/base64"
|
|
||||||
"reflect"
|
|
||||||
"strings"
|
|
||||||
"testing"
|
|
||||||
)
|
|
||||||
|
|
||||||
func TestResolveHAInstance(t *testing.T) {
|
|
||||||
// empty defaults to sofia (the devvm sits on the Sofia LAN)
|
|
||||||
if got, err := resolveHAInstance(""); err != nil || got.name != "sofia" {
|
|
||||||
t.Fatalf(`resolveHAInstance("") = %+v, %v; want sofia`, got, err)
|
|
||||||
}
|
|
||||||
if got, err := resolveHAInstance("sofia"); err != nil || got.secretKey != "sofia" {
|
|
||||||
t.Fatalf("sofia secretKey = %q, %v", got.secretKey, err)
|
|
||||||
}
|
|
||||||
if got, err := resolveHAInstance("london"); err != nil || got.secretKey != "london" || got.sshUser != "hassio" {
|
|
||||||
t.Fatalf("london = %+v, %v", got, err)
|
|
||||||
}
|
|
||||||
if _, err := resolveHAInstance("paris"); err == nil {
|
|
||||||
t.Fatalf("resolveHAInstance(paris) should error on unknown instance")
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestDecodeSecretValue(t *testing.T) {
|
|
||||||
// k8s stores Secret values base64-encoded; `kubectl -o jsonpath={.data.<k>}`
|
|
||||||
// returns that base64, which decodeSecretValue turns back into the raw token.
|
|
||||||
enc := base64.StdEncoding.EncodeToString([]byte("tok-sofia"))
|
|
||||||
if got, err := decodeSecretValue(enc); err != nil || got != "tok-sofia" {
|
|
||||||
t.Fatalf("decodeSecretValue = %q, %v; want tok-sofia", got, err)
|
|
||||||
}
|
|
||||||
// trailing whitespace/newline from jsonpath output must be tolerated
|
|
||||||
if got, err := decodeSecretValue(enc + "\n"); err != nil || got != "tok-sofia" {
|
|
||||||
t.Fatalf("decodeSecretValue (trailing ws) = %q, %v; want tok-sofia", got, err)
|
|
||||||
}
|
|
||||||
if _, err := decodeSecretValue("not-base64!!"); err == nil {
|
|
||||||
t.Fatalf("decodeSecretValue should error on undecodable base64")
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestBuildHASSHArgs(t *testing.T) {
|
|
||||||
inst, _ := resolveHAInstance("sofia")
|
|
||||||
got := buildHASSHArgs(inst, "/home/u/.ssh/id_ed25519", []string{"cat", "/config/configuration.yaml"})
|
|
||||||
want := []string{
|
|
||||||
"-F", "/dev/null",
|
|
||||||
"-o", "IdentityFile=/home/u/.ssh/id_ed25519",
|
|
||||||
"-o", "StrictHostKeyChecking=no",
|
|
||||||
"-o", "UserKnownHostsFile=/dev/null",
|
|
||||||
"-o", "ConnectTimeout=10",
|
|
||||||
"-o", "BatchMode=yes",
|
|
||||||
"vbarzin@192.168.1.8",
|
|
||||||
"cat", "/config/configuration.yaml",
|
|
||||||
}
|
|
||||||
if !reflect.DeepEqual(got, want) {
|
|
||||||
t.Fatalf("buildHASSHArgs =\n %v\nwant\n %v", got, want)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestParseHASSH(t *testing.T) {
|
|
||||||
// instance flag + everything after `--` is the verbatim remote command
|
|
||||||
inst, key, remote, err := parseHASSH([]string{"--instance", "sofia", "--", "docker", "ps", "-a"})
|
|
||||||
if err != nil {
|
|
||||||
t.Fatalf("parseHASSH err: %v", err)
|
|
||||||
}
|
|
||||||
if inst.name != "sofia" {
|
|
||||||
t.Errorf("instance = %q, want sofia", inst.name)
|
|
||||||
}
|
|
||||||
if !strings.HasSuffix(key, "/.ssh/id_ed25519") {
|
|
||||||
t.Errorf("default key = %q, want it to end in /.ssh/id_ed25519", key)
|
|
||||||
}
|
|
||||||
if !reflect.DeepEqual(remote, []string{"docker", "ps", "-a"}) {
|
|
||||||
t.Errorf("remote = %v, want [docker ps -a]", remote)
|
|
||||||
}
|
|
||||||
|
|
||||||
// bare args (no `--`) are also taken as the remote command; -i overrides the key
|
|
||||||
_, key2, remote2, err := parseHASSH([]string{"-i", "/tmp/k", "uptime"})
|
|
||||||
if err != nil {
|
|
||||||
t.Fatalf("parseHASSH err: %v", err)
|
|
||||||
}
|
|
||||||
if key2 != "/tmp/k" {
|
|
||||||
t.Errorf("key = %q, want /tmp/k", key2)
|
|
||||||
}
|
|
||||||
if !reflect.DeepEqual(remote2, []string{"uptime"}) {
|
|
||||||
t.Errorf("remote = %v, want [uptime]", remote2)
|
|
||||||
}
|
|
||||||
|
|
||||||
// unknown instance surfaces as an error
|
|
||||||
if _, _, _, err := parseHASSH([]string{"--instance", "paris", "--", "ls"}); err == nil {
|
|
||||||
t.Errorf("parseHASSH should error on unknown instance")
|
|
||||||
}
|
|
||||||
}
|
|
||||||
288
cli/cmd_k8s.go
288
cli/cmd_k8s.go
|
|
@ -1,288 +0,0 @@
|
||||||
package main
|
|
||||||
|
|
||||||
import (
|
|
||||||
"fmt"
|
|
||||||
"os"
|
|
||||||
"strings"
|
|
||||||
)
|
|
||||||
|
|
||||||
func k8sCommands() []Command {
|
|
||||||
return []Command{
|
|
||||||
{Path: []string{"k8s", "status"}, Tier: TierRead,
|
|
||||||
Summary: "pods (wide) + recent non-Normal events for a namespace (or -A)", Run: k8sStatus},
|
|
||||||
{Path: []string{"k8s", "get"}, Tier: TierRead,
|
|
||||||
Summary: "kubectl get in a namespace: k8s get <ns> <resource> [args]", Run: k8sGet},
|
|
||||||
{Path: []string{"k8s", "logs"}, Tier: TierRead,
|
|
||||||
Summary: "logs for <app> (deploy/<app>; --tail/-c/--previous/--since/-l)", Run: k8sLogs},
|
|
||||||
{Path: []string{"k8s", "describe"}, Tier: TierRead,
|
|
||||||
Summary: "describe <app>'s deployment (or an explicit resource)", Run: k8sDescribe},
|
|
||||||
{Path: []string{"k8s", "debug"}, Tier: TierRead,
|
|
||||||
Summary: "one-shot triage for <app>: pods+deploy+describe+logs+events", Run: k8sDebug},
|
|
||||||
{Path: []string{"k8s", "pf"}, Tier: TierRead,
|
|
||||||
Summary: "port-forward: k8s pf <app> <local:remote> [svc/pod target]", Run: k8sPortForward},
|
|
||||||
{Path: []string{"k8s", "db"}, Tier: TierWrite,
|
|
||||||
Summary: `query a dbaas DB: k8s db <app> [--mysql] [--db N] -- "<SQL>"`, Run: k8sDB},
|
|
||||||
{Path: []string{"k8s", "exec"}, Tier: TierWrite,
|
|
||||||
Summary: "exec in <app>'s pod: k8s exec <app> [--tty] -- <cmd>", Run: k8sExec},
|
|
||||||
{Path: []string{"k8s", "rm-pod"}, Tier: TierWrite,
|
|
||||||
Summary: "delete a stuck pod/job ONLY: k8s rm-pod <name> -n <ns> [--job] [--force]", Run: k8sRmPod},
|
|
||||||
{Path: []string{"k8s", "rollout-status"}, Tier: TierRead,
|
|
||||||
Summary: "rollout status of deploy/<app>", Run: k8sRolloutStatus},
|
|
||||||
{Path: []string{"k8s", "restart"}, Tier: TierWrite,
|
|
||||||
Summary: "rollout restart deploy/<app> then wait for status", Run: k8sRestart},
|
|
||||||
{Path: []string{"k8s", "probe"}, Tier: TierRead,
|
|
||||||
Summary: "in-cluster reachability: ephemeral curl pod to <app>.<ns>.svc", Run: k8sProbe},
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func k8sStatus(args []string) error {
|
|
||||||
t := parseK8sTarget(args)
|
|
||||||
ns := t.namespace() // "" when no app/ns given → cluster-wide
|
|
||||||
get := []string{"get", "pods", "-o", "wide"}
|
|
||||||
ev := []string{"get", "events", "--field-selector", "type!=Normal", "--sort-by=.lastTimestamp"}
|
|
||||||
if ns == "" {
|
|
||||||
get = append(get, "-A")
|
|
||||||
ev = append(ev, "-A")
|
|
||||||
}
|
|
||||||
if err := kubectlStream(ns, get...); err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
fmt.Fprintln(os.Stderr, "\n--- recent events (type!=Normal) ---")
|
|
||||||
_ = kubectlStream(ns, ev...) // best-effort
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
|
|
||||||
func k8sGet(args []string) error {
|
|
||||||
t := parseK8sTarget(args)
|
|
||||||
if t.app == "" || len(t.rest) == 0 {
|
|
||||||
return fmt.Errorf("usage: homelab k8s get <ns> <resource> [args]")
|
|
||||||
}
|
|
||||||
return kubectlStream(t.app, append([]string{"get"}, t.rest...)...)
|
|
||||||
}
|
|
||||||
|
|
||||||
func k8sLogs(args []string) error {
|
|
||||||
t := parseK8sTarget(args)
|
|
||||||
if t.app == "" {
|
|
||||||
return fmt.Errorf("usage: homelab k8s logs <app> [--tail N] [-c ctr] [--previous] [--since 1h] [-l sel]")
|
|
||||||
}
|
|
||||||
a := []string{"logs"}
|
|
||||||
if t.selector != "" {
|
|
||||||
a = append(a, "-l", t.selector)
|
|
||||||
} else {
|
|
||||||
a = append(a, t.objectRef())
|
|
||||||
}
|
|
||||||
if t.container != "" {
|
|
||||||
a = append(a, "-c", t.container)
|
|
||||||
}
|
|
||||||
if !containsPrefix(t.rest, "--tail") {
|
|
||||||
a = append(a, "--tail=200")
|
|
||||||
}
|
|
||||||
a = append(a, t.rest...)
|
|
||||||
return kubectlStream(t.namespace(), a...)
|
|
||||||
}
|
|
||||||
|
|
||||||
func k8sDescribe(args []string) error {
|
|
||||||
t := parseK8sTarget(args)
|
|
||||||
if t.app == "" {
|
|
||||||
return fmt.Errorf("usage: homelab k8s describe <app> [resource]")
|
|
||||||
}
|
|
||||||
if len(t.rest) > 0 {
|
|
||||||
return kubectlStream(t.namespace(), append([]string{"describe"}, t.rest...)...)
|
|
||||||
}
|
|
||||||
return kubectlStream(t.namespace(), "describe", t.objectRef())
|
|
||||||
}
|
|
||||||
|
|
||||||
func k8sDebug(args []string) error {
|
|
||||||
t := parseK8sTarget(args)
|
|
||||||
if t.app == "" {
|
|
||||||
return fmt.Errorf("usage: homelab k8s debug <app>")
|
|
||||||
}
|
|
||||||
ns := t.namespace()
|
|
||||||
sec := func(title string) { fmt.Fprintf(os.Stderr, "\n=== %s ===\n", title) }
|
|
||||||
sec("pods")
|
|
||||||
_ = kubectlStream(ns, "get", "pods", "-o", "wide")
|
|
||||||
sec("workloads")
|
|
||||||
_ = kubectlStream(ns, "get", "deploy,sts,ds", "-o", "wide")
|
|
||||||
sec("describe "+t.objectRef())
|
|
||||||
_ = kubectlStream(ns, "describe", t.objectRef())
|
|
||||||
sec("recent logs (--tail=50)")
|
|
||||||
_ = kubectlStream(ns, "logs", t.objectRef(), "--tail=50")
|
|
||||||
sec("events (type!=Normal)")
|
|
||||||
_ = kubectlStream(ns, "get", "events", "--field-selector", "type!=Normal", "--sort-by=.lastTimestamp")
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
|
|
||||||
func k8sPortForward(args []string) error {
|
|
||||||
t := parseK8sTarget(args)
|
|
||||||
if t.app == "" || len(t.rest) == 0 {
|
|
||||||
return fmt.Errorf("usage: homelab k8s pf <app> <local:remote> [svc/pod target]")
|
|
||||||
}
|
|
||||||
ports := t.rest[0]
|
|
||||||
target := "svc/" + t.app
|
|
||||||
if len(t.rest) > 1 {
|
|
||||||
target = t.rest[1]
|
|
||||||
}
|
|
||||||
return kubectlStream(t.namespace(), "port-forward", target, ports)
|
|
||||||
}
|
|
||||||
|
|
||||||
func k8sDB(args []string) error {
|
|
||||||
var app, dbName, sql string
|
|
||||||
mysql := false
|
|
||||||
for i := 0; i < len(args); i++ {
|
|
||||||
a := args[i]
|
|
||||||
if a == "--" {
|
|
||||||
sql = strings.Join(args[i+1:], " ")
|
|
||||||
break
|
|
||||||
}
|
|
||||||
switch {
|
|
||||||
case a == "--mysql":
|
|
||||||
mysql = true
|
|
||||||
case a == "--db":
|
|
||||||
if i+1 < len(args) {
|
|
||||||
dbName = args[i+1]
|
|
||||||
i++
|
|
||||||
}
|
|
||||||
case strings.HasPrefix(a, "--db="):
|
|
||||||
dbName = strings.TrimPrefix(a, "--db=")
|
|
||||||
case !strings.HasPrefix(a, "-") && app == "":
|
|
||||||
app = a
|
|
||||||
}
|
|
||||||
}
|
|
||||||
if app == "" {
|
|
||||||
return fmt.Errorf(`usage: homelab k8s db <app> [--mysql] [--db NAME] -- "<SQL>"`)
|
|
||||||
}
|
|
||||||
p := planDBExec(app, dbName, sql, mysql)
|
|
||||||
pod := p.pod
|
|
||||||
if pod == "" && p.selector != "" {
|
|
||||||
resolved, err := kubectlCapture(p.ns, "get", "pod", "-l", p.selector, "-o", "jsonpath={.items[0].metadata.name}")
|
|
||||||
if err != nil || resolved == "" {
|
|
||||||
return fmt.Errorf("could not resolve db pod in %s (selector %q): %v", p.ns, p.selector, err)
|
|
||||||
}
|
|
||||||
pod = resolved
|
|
||||||
}
|
|
||||||
exec := []string{"exec"}
|
|
||||||
if sql == "" {
|
|
||||||
exec = append(exec, "-it") // interactive client when no SQL given
|
|
||||||
}
|
|
||||||
exec = append(exec, pod)
|
|
||||||
if p.container != "" {
|
|
||||||
exec = append(exec, "-c", p.container)
|
|
||||||
}
|
|
||||||
exec = append(exec, "--")
|
|
||||||
exec = append(exec, p.argv...)
|
|
||||||
return kubectlStream(p.ns, exec...)
|
|
||||||
}
|
|
||||||
|
|
||||||
func k8sExec(args []string) error {
|
|
||||||
t := parseK8sTarget(args)
|
|
||||||
if t.app == "" {
|
|
||||||
return fmt.Errorf("usage: homelab k8s exec <app> [--pod p] [-c ctr] [--tty] -- <cmd>")
|
|
||||||
}
|
|
||||||
if len(t.rest) == 0 {
|
|
||||||
return fmt.Errorf("provide a command after --, e.g. homelab k8s exec %s -- env", t.app)
|
|
||||||
}
|
|
||||||
a := []string{"exec"}
|
|
||||||
if t.tty {
|
|
||||||
a = append(a, "-it")
|
|
||||||
}
|
|
||||||
a = append(a, t.objectRef())
|
|
||||||
if t.container != "" {
|
|
||||||
a = append(a, "-c", t.container)
|
|
||||||
}
|
|
||||||
a = append(a, "--")
|
|
||||||
a = append(a, t.rest...)
|
|
||||||
return kubectlStream(t.namespace(), a...)
|
|
||||||
}
|
|
||||||
|
|
||||||
func k8sRmPod(args []string) error {
|
|
||||||
var pod, ns, grace string
|
|
||||||
force, job := false, false
|
|
||||||
for i := 0; i < len(args); i++ {
|
|
||||||
a := args[i]
|
|
||||||
switch {
|
|
||||||
case a == "-n" || a == "--namespace":
|
|
||||||
if i+1 < len(args) {
|
|
||||||
ns = args[i+1]
|
|
||||||
i++
|
|
||||||
}
|
|
||||||
case a == "--force":
|
|
||||||
force = true
|
|
||||||
case a == "--job":
|
|
||||||
job = true
|
|
||||||
case a == "--grace":
|
|
||||||
if i+1 < len(args) {
|
|
||||||
grace = args[i+1]
|
|
||||||
i++
|
|
||||||
}
|
|
||||||
case !strings.HasPrefix(a, "-") && pod == "":
|
|
||||||
pod = a
|
|
||||||
}
|
|
||||||
}
|
|
||||||
if pod == "" || ns == "" {
|
|
||||||
return fmt.Errorf("usage: homelab k8s rm-pod <name> -n <ns> [--job] [--force] [--grace N] (pods/jobs only)")
|
|
||||||
}
|
|
||||||
kind := "pod"
|
|
||||||
if job {
|
|
||||||
kind = "job"
|
|
||||||
}
|
|
||||||
a := []string{"delete", kind, pod}
|
|
||||||
if grace != "" {
|
|
||||||
a = append(a, "--grace-period="+grace)
|
|
||||||
}
|
|
||||||
if force {
|
|
||||||
a = append(a, "--force")
|
|
||||||
}
|
|
||||||
return kubectlStream(ns, a...)
|
|
||||||
}
|
|
||||||
|
|
||||||
func k8sRolloutStatus(args []string) error {
|
|
||||||
t := parseK8sTarget(args)
|
|
||||||
if t.app == "" {
|
|
||||||
return fmt.Errorf("usage: homelab k8s rollout-status <app>")
|
|
||||||
}
|
|
||||||
return kubectlStream(t.namespace(), "rollout", "status", "deploy/"+t.app)
|
|
||||||
}
|
|
||||||
|
|
||||||
func k8sRestart(args []string) error {
|
|
||||||
t := parseK8sTarget(args)
|
|
||||||
if t.app == "" {
|
|
||||||
return fmt.Errorf("usage: homelab k8s restart <app>")
|
|
||||||
}
|
|
||||||
ns := t.namespace()
|
|
||||||
if err := kubectlStream(ns, "rollout", "restart", "deploy/"+t.app); err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
return kubectlStream(ns, "rollout", "status", "deploy/"+t.app)
|
|
||||||
}
|
|
||||||
|
|
||||||
func k8sProbe(args []string) error {
|
|
||||||
t := parseK8sTarget(args)
|
|
||||||
if t.app == "" {
|
|
||||||
return fmt.Errorf("usage: homelab k8s probe <app> [path] [--port N]")
|
|
||||||
}
|
|
||||||
ns := t.namespace()
|
|
||||||
url := "http://" + t.app + "." + ns + ".svc.cluster.local"
|
|
||||||
if port := flagValue(args, "--port"); port != "" {
|
|
||||||
url += ":" + port
|
|
||||||
}
|
|
||||||
if len(t.rest) > 0 {
|
|
||||||
p := t.rest[0]
|
|
||||||
if !strings.HasPrefix(p, "/") {
|
|
||||||
p = "/" + p
|
|
||||||
}
|
|
||||||
url += p
|
|
||||||
}
|
|
||||||
return kubectlStream(ns, "run", "homelab-probe", "--rm", "-i", "--restart=Never",
|
|
||||||
"--image=curlimages/curl:latest", "--",
|
|
||||||
"curl", "-sS", "--max-time", "10", "-w", "\n[%{http_code}] %{time_total}s\n", url)
|
|
||||||
}
|
|
||||||
|
|
||||||
// containsPrefix reports whether any arg starts with prefix.
|
|
||||||
func containsPrefix(args []string, prefix string) bool {
|
|
||||||
for _, a := range args {
|
|
||||||
if strings.HasPrefix(a, prefix) {
|
|
||||||
return true
|
|
||||||
}
|
|
||||||
}
|
|
||||||
return false
|
|
||||||
}
|
|
||||||
|
|
@ -1,302 +0,0 @@
|
||||||
package main
|
|
||||||
|
|
||||||
import (
|
|
||||||
"encoding/json"
|
|
||||||
"fmt"
|
|
||||||
"net/url"
|
|
||||||
"strings"
|
|
||||||
)
|
|
||||||
|
|
||||||
func memoryCommands() []Command {
|
|
||||||
return []Command{
|
|
||||||
{Path: []string{"memory", "recall"}, Tier: TierRead,
|
|
||||||
Summary: `semantic search of memory: memory recall "<context>" [--query …] [--category] [--sort] [--limit]`, Run: memoryRecall},
|
|
||||||
{Path: []string{"memory", "list"}, Tier: TierRead,
|
|
||||||
Summary: "list recent memories [--category C] [--tag T] [--limit N]", Run: memoryList},
|
|
||||||
{Path: []string{"memory", "categories"}, Tier: TierRead,
|
|
||||||
Summary: "list memory categories", Run: memorySimpleGet("/api/categories")},
|
|
||||||
{Path: []string{"memory", "tags"}, Tier: TierRead,
|
|
||||||
Summary: "list memory tags", Run: memorySimpleGet("/api/tags")},
|
|
||||||
{Path: []string{"memory", "stats"}, Tier: TierRead,
|
|
||||||
Summary: "memory store stats", Run: memorySimpleGet("/api/stats")},
|
|
||||||
{Path: []string{"memory", "secret"}, Tier: TierRead,
|
|
||||||
Summary: "reveal a sensitive memory's content: memory secret <id>", Run: memorySecret},
|
|
||||||
{Path: []string{"memory", "store"}, Tier: TierWrite,
|
|
||||||
Summary: `store a memory: memory store "<content>" [--category --tags --keywords --importance --sensitive]`, Run: memoryStore},
|
|
||||||
{Path: []string{"memory", "update"}, Tier: TierWrite,
|
|
||||||
Summary: "update a memory: memory update <id> [--content --tags --importance --keywords]", Run: memoryUpdate},
|
|
||||||
{Path: []string{"memory", "delete"}, Tier: TierWrite,
|
|
||||||
Summary: "delete a memory: memory delete <id>", Run: memoryDelete},
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
// printMemories renders a {memories:[…]} response as compact lines, or raw JSON.
|
|
||||||
func printMemories(raw []byte, jsonOut bool) error {
|
|
||||||
if jsonOut {
|
|
||||||
fmt.Println(string(raw))
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
var r struct {
|
|
||||||
Memories []struct {
|
|
||||||
ID int `json:"id"`
|
|
||||||
Content string `json:"content"`
|
|
||||||
Category string `json:"category"`
|
|
||||||
Tags string `json:"tags"`
|
|
||||||
Importance float64 `json:"importance"`
|
|
||||||
} `json:"memories"`
|
|
||||||
}
|
|
||||||
if err := json.Unmarshal(raw, &r); err != nil {
|
|
||||||
fmt.Println(string(raw))
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
if len(r.Memories) == 0 {
|
|
||||||
fmt.Println("(no memories)")
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
for _, m := range r.Memories {
|
|
||||||
c := strings.ReplaceAll(m.Content, "\n", " ")
|
|
||||||
if len(c) > 240 {
|
|
||||||
c = c[:240] + "…"
|
|
||||||
}
|
|
||||||
fmt.Printf("#%d [%s] (%.2f) %s\n", m.ID, m.Category, m.Importance, c)
|
|
||||||
if m.Tags != "" {
|
|
||||||
fmt.Printf(" tags: %s\n", m.Tags)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
|
|
||||||
func memoryRecall(args []string) error {
|
|
||||||
req := memRecallReq{}
|
|
||||||
jsonOut := false
|
|
||||||
var pos []string
|
|
||||||
for i := 0; i < len(args); i++ {
|
|
||||||
a := args[i]
|
|
||||||
switch {
|
|
||||||
case a == "--query":
|
|
||||||
if i+1 < len(args) {
|
|
||||||
req.ExpandedQuery = args[i+1]
|
|
||||||
i++
|
|
||||||
}
|
|
||||||
case a == "--category":
|
|
||||||
if i+1 < len(args) {
|
|
||||||
req.Category = args[i+1]
|
|
||||||
i++
|
|
||||||
}
|
|
||||||
case a == "--sort":
|
|
||||||
if i+1 < len(args) {
|
|
||||||
req.SortBy = args[i+1]
|
|
||||||
i++
|
|
||||||
}
|
|
||||||
case a == "--limit":
|
|
||||||
if i+1 < len(args) {
|
|
||||||
fmt.Sscanf(args[i+1], "%d", &req.Limit)
|
|
||||||
i++
|
|
||||||
}
|
|
||||||
case a == "--json":
|
|
||||||
jsonOut = true
|
|
||||||
case !strings.HasPrefix(a, "-"):
|
|
||||||
pos = append(pos, a)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
req.Context = strings.Join(pos, " ")
|
|
||||||
if req.Context == "" {
|
|
||||||
return fmt.Errorf(`usage: homelab memory recall "<context>" [--query …] [--category C] [--sort importance|relevance|recency] [--limit N]`)
|
|
||||||
}
|
|
||||||
c, err := newMemoryClient()
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
raw, err := c.do("POST", "/api/memories/recall", req)
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
return printMemories(raw, jsonOut)
|
|
||||||
}
|
|
||||||
|
|
||||||
func memoryList(args []string) error {
|
|
||||||
q := url.Values{}
|
|
||||||
jsonOut := false
|
|
||||||
for i := 0; i < len(args); i++ {
|
|
||||||
a := args[i]
|
|
||||||
switch {
|
|
||||||
case a == "--category":
|
|
||||||
if i+1 < len(args) {
|
|
||||||
q.Set("category", args[i+1])
|
|
||||||
i++
|
|
||||||
}
|
|
||||||
case a == "--tag":
|
|
||||||
if i+1 < len(args) {
|
|
||||||
q.Set("tag", args[i+1])
|
|
||||||
i++
|
|
||||||
}
|
|
||||||
case a == "--limit":
|
|
||||||
if i+1 < len(args) {
|
|
||||||
q.Set("limit", args[i+1])
|
|
||||||
i++
|
|
||||||
}
|
|
||||||
case a == "--json":
|
|
||||||
jsonOut = true
|
|
||||||
}
|
|
||||||
}
|
|
||||||
c, err := newMemoryClient()
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
path := "/api/memories"
|
|
||||||
if len(q) > 0 {
|
|
||||||
path += "?" + q.Encode()
|
|
||||||
}
|
|
||||||
raw, err := c.do("GET", path, nil)
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
return printMemories(raw, jsonOut)
|
|
||||||
}
|
|
||||||
|
|
||||||
func memorySimpleGet(path string) func([]string) error {
|
|
||||||
return func(args []string) error {
|
|
||||||
c, err := newMemoryClient()
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
raw, err := c.do("GET", path, nil)
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
fmt.Println(string(raw))
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func memorySecret(args []string) error {
|
|
||||||
id, _ := firstPositional(args)
|
|
||||||
if id == "" {
|
|
||||||
return fmt.Errorf("usage: homelab memory secret <id>")
|
|
||||||
}
|
|
||||||
c, err := newMemoryClient()
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
raw, err := c.do("POST", "/api/memories/"+id+"/secret", nil)
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
fmt.Println(string(raw))
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
|
|
||||||
func memoryStore(args []string) error {
|
|
||||||
req := memStoreReq{Category: "facts", Importance: 0.5}
|
|
||||||
var pos []string
|
|
||||||
for i := 0; i < len(args); i++ {
|
|
||||||
a := args[i]
|
|
||||||
switch {
|
|
||||||
case a == "--category":
|
|
||||||
if i+1 < len(args) {
|
|
||||||
req.Category = args[i+1]
|
|
||||||
i++
|
|
||||||
}
|
|
||||||
case a == "--tags":
|
|
||||||
if i+1 < len(args) {
|
|
||||||
req.Tags = args[i+1]
|
|
||||||
i++
|
|
||||||
}
|
|
||||||
case a == "--keywords":
|
|
||||||
if i+1 < len(args) {
|
|
||||||
req.ExpandedKeywords = args[i+1]
|
|
||||||
i++
|
|
||||||
}
|
|
||||||
case a == "--importance":
|
|
||||||
if i+1 < len(args) {
|
|
||||||
fmt.Sscanf(args[i+1], "%f", &req.Importance)
|
|
||||||
i++
|
|
||||||
}
|
|
||||||
case a == "--sensitive":
|
|
||||||
req.ForceSensitive = true
|
|
||||||
case !strings.HasPrefix(a, "-"):
|
|
||||||
pos = append(pos, a)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
req.Content = strings.Join(pos, " ")
|
|
||||||
if req.Content == "" {
|
|
||||||
return fmt.Errorf(`usage: homelab memory store "<content>" [--category C] [--tags ...] [--keywords ...] [--importance 0.5] [--sensitive]`)
|
|
||||||
}
|
|
||||||
c, err := newMemoryClient()
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
raw, err := c.do("POST", "/api/memories", req)
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
fmt.Println(string(raw))
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
|
|
||||||
func memoryUpdate(args []string) error {
|
|
||||||
var id string
|
|
||||||
req := memUpdateReq{}
|
|
||||||
for i := 0; i < len(args); i++ {
|
|
||||||
a := args[i]
|
|
||||||
switch {
|
|
||||||
case a == "--content":
|
|
||||||
if i+1 < len(args) {
|
|
||||||
v := args[i+1]
|
|
||||||
req.Content = &v
|
|
||||||
i++
|
|
||||||
}
|
|
||||||
case a == "--tags":
|
|
||||||
if i+1 < len(args) {
|
|
||||||
v := args[i+1]
|
|
||||||
req.Tags = &v
|
|
||||||
i++
|
|
||||||
}
|
|
||||||
case a == "--keywords":
|
|
||||||
if i+1 < len(args) {
|
|
||||||
v := args[i+1]
|
|
||||||
req.ExpandedKeywords = &v
|
|
||||||
i++
|
|
||||||
}
|
|
||||||
case a == "--importance":
|
|
||||||
if i+1 < len(args) {
|
|
||||||
var f float64
|
|
||||||
fmt.Sscanf(args[i+1], "%f", &f)
|
|
||||||
req.Importance = &f
|
|
||||||
i++
|
|
||||||
}
|
|
||||||
case !strings.HasPrefix(a, "-") && id == "":
|
|
||||||
id = a
|
|
||||||
}
|
|
||||||
}
|
|
||||||
if id == "" {
|
|
||||||
return fmt.Errorf("usage: homelab memory update <id> [--content ...] [--tags ...] [--importance N] [--keywords ...]")
|
|
||||||
}
|
|
||||||
c, err := newMemoryClient()
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
raw, err := c.do("PUT", "/api/memories/"+id, req)
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
fmt.Println(string(raw))
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
|
|
||||||
func memoryDelete(args []string) error {
|
|
||||||
id, _ := firstPositional(args)
|
|
||||||
if id == "" {
|
|
||||||
return fmt.Errorf("usage: homelab memory delete <id>")
|
|
||||||
}
|
|
||||||
c, err := newMemoryClient()
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
raw, err := c.do("DELETE", "/api/memories/"+id, nil)
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
fmt.Println(string(raw))
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
|
|
@ -1,83 +0,0 @@
|
||||||
package main
|
|
||||||
|
|
||||||
import (
|
|
||||||
"fmt"
|
|
||||||
"strings"
|
|
||||||
"time"
|
|
||||||
)
|
|
||||||
|
|
||||||
func netCommands() []Command {
|
|
||||||
return []Command{
|
|
||||||
{Path: []string{"net", "check"}, Tier: TierRead,
|
|
||||||
Summary: "reachability of <host>[/path]: external (public DNS→CF) vs internal (Traefik LB)", Run: netCheck},
|
|
||||||
{Path: []string{"dns", "lookup"}, Tier: TierRead,
|
|
||||||
Summary: "resolve <name> via Technitium (10.0.20.201) and public (1.1.1.1), diffed", Run: dnsLookup},
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func fmtProbe(code int, d time.Duration, err error) string {
|
|
||||||
if err != nil {
|
|
||||||
return "ERR " + err.Error()
|
|
||||||
}
|
|
||||||
return fmt.Sprintf("HTTP %d %dms", code, d.Milliseconds())
|
|
||||||
}
|
|
||||||
|
|
||||||
func netCheck(args []string) error {
|
|
||||||
host, rest := firstPositional(args)
|
|
||||||
if host == "" {
|
|
||||||
return fmt.Errorf("usage: homelab net check <host> [path]")
|
|
||||||
}
|
|
||||||
path := "/"
|
|
||||||
if len(rest) > 0 && !strings.HasPrefix(rest[0], "-") {
|
|
||||||
path = rest[0]
|
|
||||||
if !strings.HasPrefix(path, "/") {
|
|
||||||
path = "/" + path
|
|
||||||
}
|
|
||||||
}
|
|
||||||
u := "https://" + host + path
|
|
||||||
fmt.Printf("%s\n", u)
|
|
||||||
|
|
||||||
// external leg: resolve via public DNS, dial the public IP (tests the real CF path)
|
|
||||||
pubOut, _ := dig(hostOnly(host), "1.1.1.1", "")
|
|
||||||
if pubIP := firstLine(pubOut); pubIP != "" {
|
|
||||||
c, d, e := probeURL(clientDialingIP(pubIP, 10*time.Second), u)
|
|
||||||
fmt.Printf(" external (public %-15s) %s\n", pubIP, fmtProbe(c, d, e))
|
|
||||||
} else {
|
|
||||||
fmt.Println(" external (public) no public A record")
|
|
||||||
}
|
|
||||||
// internal leg: dial the Traefik LB directly
|
|
||||||
c, d, e := probeURL(clientDialingIP(internalLBIP, 10*time.Second), u)
|
|
||||||
fmt.Printf(" internal (LB %-15s) %s\n", internalLBIP, fmtProbe(c, d, e))
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
|
|
||||||
func dnsLookup(args []string) error {
|
|
||||||
name, rest := firstPositional(args)
|
|
||||||
if name == "" {
|
|
||||||
return fmt.Errorf("usage: homelab dns lookup <name> [A|AAAA|TXT|MX|PTR]")
|
|
||||||
}
|
|
||||||
rr := ""
|
|
||||||
if len(rest) > 0 {
|
|
||||||
rr = rest[0]
|
|
||||||
}
|
|
||||||
tech, _ := dig(name, "10.0.20.201", rr)
|
|
||||||
pub, _ := dig(name, "1.1.1.1", rr)
|
|
||||||
fmt.Printf("technitium (10.0.20.201): %s\n", oneLineList(tech))
|
|
||||||
fmt.Printf("public (1.1.1.1) : %s\n", oneLineList(pub))
|
|
||||||
if strings.TrimSpace(tech) != strings.TrimSpace(pub) {
|
|
||||||
fmt.Println("⚠ mismatch — split-horizon (expected for internal-only apps) or a propagation gap")
|
|
||||||
}
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
|
|
||||||
func hostOnly(h string) string { // strip any path accidentally included
|
|
||||||
return strings.SplitN(h, "/", 2)[0]
|
|
||||||
}
|
|
||||||
|
|
||||||
func oneLineList(s string) string {
|
|
||||||
s = strings.TrimSpace(s)
|
|
||||||
if s == "" {
|
|
||||||
return "(none)"
|
|
||||||
}
|
|
||||||
return strings.ReplaceAll(s, "\n", ", ")
|
|
||||||
}
|
|
||||||
197
cli/cmd_obs.go
197
cli/cmd_obs.go
|
|
@ -1,197 +0,0 @@
|
||||||
package main
|
|
||||||
|
|
||||||
import (
|
|
||||||
"encoding/json"
|
|
||||||
"fmt"
|
|
||||||
"net/url"
|
|
||||||
"sort"
|
|
||||||
"strconv"
|
|
||||||
"strings"
|
|
||||||
"time"
|
|
||||||
)
|
|
||||||
|
|
||||||
const (
|
|
||||||
promHost = "prometheus-query.viktorbarzin.lan"
|
|
||||||
lokiHost = "loki.viktorbarzin.lan"
|
|
||||||
)
|
|
||||||
|
|
||||||
func obsCommands() []Command {
|
|
||||||
return []Command{
|
|
||||||
{Path: []string{"metrics", "query"}, Tier: TierRead,
|
|
||||||
Summary: `Prometheus instant query: metrics query "<promql>" [--json]`, Run: metricsQuery},
|
|
||||||
{Path: []string{"metrics", "alerts"}, Tier: TierRead,
|
|
||||||
Summary: "list currently firing Prometheus alerts", Run: metricsAlerts},
|
|
||||||
{Path: []string{"logs", "query"}, Tier: TierRead,
|
|
||||||
Summary: `Loki query (last --since, default 1h): logs query "<logql>" [--since 1h] [--limit N] [--json]`, Run: logsQuery},
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
// queryArg joins non-flag args into the query (PromQL/LogQL should normally be
|
|
||||||
// passed as a single quoted argument; this also tolerates unquoted multi-token).
|
|
||||||
func queryArg(args []string, valueFlags map[string]bool) string {
|
|
||||||
var parts []string
|
|
||||||
for i := 0; i < len(args); i++ {
|
|
||||||
a := args[i]
|
|
||||||
if valueFlags[a] {
|
|
||||||
i++
|
|
||||||
continue
|
|
||||||
}
|
|
||||||
if strings.HasPrefix(a, "-") {
|
|
||||||
continue
|
|
||||||
}
|
|
||||||
parts = append(parts, a)
|
|
||||||
}
|
|
||||||
return strings.Join(parts, " ")
|
|
||||||
}
|
|
||||||
|
|
||||||
func labelStr(m map[string]string) string {
|
|
||||||
name := m["__name__"]
|
|
||||||
var kv []string
|
|
||||||
for k, v := range m {
|
|
||||||
if k != "__name__" {
|
|
||||||
kv = append(kv, k+"="+v)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
sort.Strings(kv)
|
|
||||||
return name + "{" + strings.Join(kv, ",") + "}"
|
|
||||||
}
|
|
||||||
|
|
||||||
func metricsQuery(args []string) error {
|
|
||||||
q := queryArg(args, nil)
|
|
||||||
if q == "" {
|
|
||||||
return fmt.Errorf(`usage: homelab metrics query "<promql>" [--json]`)
|
|
||||||
}
|
|
||||||
v := url.Values{}
|
|
||||||
v.Set("query", q)
|
|
||||||
body, err := lbGetBody(promHost, "/api/v1/query", v)
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
if containsArg(args, "--json") {
|
|
||||||
fmt.Println(string(body))
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
var r struct {
|
|
||||||
Data struct {
|
|
||||||
Result []struct {
|
|
||||||
Metric map[string]string `json:"metric"`
|
|
||||||
Value []interface{} `json:"value"`
|
|
||||||
} `json:"result"`
|
|
||||||
} `json:"data"`
|
|
||||||
}
|
|
||||||
if err := json.Unmarshal(body, &r); err != nil {
|
|
||||||
fmt.Println(string(body))
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
if len(r.Data.Result) == 0 {
|
|
||||||
fmt.Println("(no series)")
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
for _, s := range r.Data.Result {
|
|
||||||
val := ""
|
|
||||||
if len(s.Value) == 2 {
|
|
||||||
val = fmt.Sprint(s.Value[1])
|
|
||||||
}
|
|
||||||
fmt.Printf("%-14s %s\n", val, labelStr(s.Metric))
|
|
||||||
}
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
|
|
||||||
func metricsAlerts(args []string) error {
|
|
||||||
// prometheus-query is a query-only frontend (no /api/v1/alerts); the firing
|
|
||||||
// set is exposed as the synthetic ALERTS series, queryable the normal way.
|
|
||||||
v := url.Values{}
|
|
||||||
v.Set("query", `ALERTS{alertstate="firing"}`)
|
|
||||||
body, err := lbGetBody(promHost, "/api/v1/query", v)
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
if containsArg(args, "--json") {
|
|
||||||
fmt.Println(string(body))
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
var r struct {
|
|
||||||
Data struct {
|
|
||||||
Result []struct {
|
|
||||||
Metric map[string]string `json:"metric"`
|
|
||||||
} `json:"result"`
|
|
||||||
} `json:"data"`
|
|
||||||
}
|
|
||||||
if err := json.Unmarshal(body, &r); err != nil {
|
|
||||||
fmt.Println(string(body))
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
if len(r.Data.Result) == 0 {
|
|
||||||
fmt.Println("(no firing alerts)")
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
for _, a := range r.Data.Result {
|
|
||||||
m := a.Metric
|
|
||||||
scope := ""
|
|
||||||
for _, k := range []string{"namespace", "deployment", "instance", "job", "node"} {
|
|
||||||
if v := m[k]; v != "" {
|
|
||||||
scope = k + "=" + v
|
|
||||||
break
|
|
||||||
}
|
|
||||||
}
|
|
||||||
fmt.Printf("%-9s %-34s %s\n", m["severity"], m["alertname"], scope)
|
|
||||||
}
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
|
|
||||||
func logsQuery(args []string) error {
|
|
||||||
q := queryArg(args, map[string]bool{"--since": true, "--limit": true})
|
|
||||||
if q == "" {
|
|
||||||
return fmt.Errorf(`usage: homelab logs query "<logql>" [--since 1h] [--limit N] [--json]`)
|
|
||||||
}
|
|
||||||
since := flagValue(args, "--since")
|
|
||||||
if since == "" {
|
|
||||||
since = "1h"
|
|
||||||
}
|
|
||||||
dur, err := time.ParseDuration(since)
|
|
||||||
if err != nil {
|
|
||||||
return fmt.Errorf("bad --since %q: %w", since, err)
|
|
||||||
}
|
|
||||||
limit := flagValue(args, "--limit")
|
|
||||||
if limit == "" {
|
|
||||||
limit = "100"
|
|
||||||
}
|
|
||||||
end := time.Now()
|
|
||||||
v := url.Values{}
|
|
||||||
v.Set("query", q)
|
|
||||||
v.Set("limit", limit)
|
|
||||||
v.Set("start", strconv.FormatInt(end.Add(-dur).UnixNano(), 10))
|
|
||||||
v.Set("end", strconv.FormatInt(end.UnixNano(), 10))
|
|
||||||
body, err := lbGetBody(lokiHost, "/loki/api/v1/query_range", v)
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
if containsArg(args, "--json") {
|
|
||||||
fmt.Println(string(body))
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
var r struct {
|
|
||||||
Data struct {
|
|
||||||
Result []struct {
|
|
||||||
Values [][]string `json:"values"`
|
|
||||||
} `json:"result"`
|
|
||||||
} `json:"data"`
|
|
||||||
}
|
|
||||||
if err := json.Unmarshal(body, &r); err != nil {
|
|
||||||
fmt.Println(string(body))
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
n := 0
|
|
||||||
for _, s := range r.Data.Result {
|
|
||||||
for _, val := range s.Values {
|
|
||||||
if len(val) == 2 {
|
|
||||||
fmt.Println(val[1])
|
|
||||||
n++
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
if n == 0 {
|
|
||||||
fmt.Println("(no log lines)")
|
|
||||||
}
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
122
cli/cmd_tf.go
122
cli/cmd_tf.go
|
|
@ -1,122 +0,0 @@
|
||||||
package main
|
|
||||||
|
|
||||||
import (
|
|
||||||
"fmt"
|
|
||||||
"os"
|
|
||||||
"os/signal"
|
|
||||||
"path/filepath"
|
|
||||||
"strings"
|
|
||||||
"sync"
|
|
||||||
"syscall"
|
|
||||||
)
|
|
||||||
|
|
||||||
func tfCommands() []Command {
|
|
||||||
return []Command{
|
|
||||||
{Path: []string{"tf", "plan"}, Tier: TierRead,
|
|
||||||
Summary: "terragrunt plan a stack (via scripts/tg)", Run: tfPassthrough("plan")},
|
|
||||||
{Path: []string{"tf", "validate"}, Tier: TierRead,
|
|
||||||
Summary: "terragrunt validate a stack", Run: tfPassthrough("validate")},
|
|
||||||
{Path: []string{"tf", "fmt"}, Tier: TierRead,
|
|
||||||
Summary: "terraform fmt a stack's files", Run: tfFmt},
|
|
||||||
{Path: []string{"tf", "force-unlock"}, Tier: TierWrite,
|
|
||||||
Summary: "release a stuck terraform state lock (needs <stack> <lock-id>)", Run: tfForceUnlock},
|
|
||||||
{Path: []string{"tf", "apply"}, Tier: TierWrite,
|
|
||||||
Summary: "terragrunt apply a stack — presence-coupled, out-of-band", Run: tfApply},
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
// firstPositional returns the first non-flag arg and the remaining args with it removed.
|
|
||||||
func firstPositional(args []string) (string, []string) {
|
|
||||||
for i, a := range args {
|
|
||||||
if !strings.HasPrefix(a, "-") {
|
|
||||||
rest := append(append([]string{}, args[:i]...), args[i+1:]...)
|
|
||||||
return a, rest
|
|
||||||
}
|
|
||||||
}
|
|
||||||
return "", args
|
|
||||||
}
|
|
||||||
|
|
||||||
// resolveTfStack finds the infra root (from cwd) and the stack directory named
|
|
||||||
// by the first positional arg, returning the remaining args.
|
|
||||||
func resolveTfStack(args []string) (infraRoot, stackName, stackDir string, rest []string, err error) {
|
|
||||||
stackName, rest = firstPositional(args)
|
|
||||||
if stackName == "" {
|
|
||||||
err = fmt.Errorf("missing <stack> argument")
|
|
||||||
return
|
|
||||||
}
|
|
||||||
cwd, e := os.Getwd()
|
|
||||||
if e != nil {
|
|
||||||
err = e
|
|
||||||
return
|
|
||||||
}
|
|
||||||
infraRoot, err = findInfraRoot(cwd)
|
|
||||||
if err != nil {
|
|
||||||
return
|
|
||||||
}
|
|
||||||
stackDir, err = resolveStack(infraRoot, stackName)
|
|
||||||
return
|
|
||||||
}
|
|
||||||
|
|
||||||
func tgPath(infraRoot string) string { return filepath.Join(infraRoot, "scripts", "tg") }
|
|
||||||
|
|
||||||
// tfPassthrough runs `scripts/tg <verb> [extra]` in the stack directory.
|
|
||||||
func tfPassthrough(verb string) func([]string) error {
|
|
||||||
return func(args []string) error {
|
|
||||||
infraRoot, _, stackDir, rest, err := resolveTfStack(args)
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
return runStreamingIn(stackDir, tgPath(infraRoot), append([]string{verb}, rest...)...)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func tfFmt(args []string) error {
|
|
||||||
_, _, stackDir, _, err := resolveTfStack(args)
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
return runStreamingIn(stackDir, "terraform", "fmt", "-recursive", ".")
|
|
||||||
}
|
|
||||||
|
|
||||||
func tfForceUnlock(args []string) error {
|
|
||||||
infraRoot, _, stackDir, rest, err := resolveTfStack(args)
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
if len(rest) < 1 {
|
|
||||||
return fmt.Errorf("usage: homelab tf force-unlock <stack> <lock-id>")
|
|
||||||
}
|
|
||||||
return runStreamingIn(stackDir, tgPath(infraRoot), "force-unlock", "-force", rest[0])
|
|
||||||
}
|
|
||||||
|
|
||||||
// tfApply applies a stack out-of-band: claim the stack on the presence board,
|
|
||||||
// ALWAYS release on exit (normal, error, or signal — fixing the claim leak),
|
|
||||||
// and warn that CI applies canonically on push.
|
|
||||||
func tfApply(args []string) error {
|
|
||||||
infraRoot, stackName, stackDir, _, err := resolveTfStack(args)
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
label := "stack:" + stackName
|
|
||||||
fmt.Fprintf(os.Stderr,
|
|
||||||
"homelab: out-of-band apply of %q — CI applies canonically on push to master.\n", stackName)
|
|
||||||
|
|
||||||
if err := presenceClaim(label, "homelab tf apply "+stackName); err != nil {
|
|
||||||
return fmt.Errorf("presence claim failed (run `vault login -method=oidc`?): %w", err)
|
|
||||||
}
|
|
||||||
// Release exactly once, whether we exit normally, on error, or on signal —
|
|
||||||
// sync.Once makes the defer and the signal goroutine safe to both call it.
|
|
||||||
var once sync.Once
|
|
||||||
release := func() { once.Do(func() { _ = presenceRelease(label) }) }
|
|
||||||
defer release()
|
|
||||||
|
|
||||||
sig := make(chan os.Signal, 1)
|
|
||||||
signal.Notify(sig, os.Interrupt, syscall.SIGTERM)
|
|
||||||
go func() {
|
|
||||||
<-sig
|
|
||||||
release()
|
|
||||||
os.Exit(130)
|
|
||||||
}()
|
|
||||||
|
|
||||||
return runStreamingIn(stackDir, tgPath(infraRoot), "apply", "--non-interactive")
|
|
||||||
}
|
|
||||||
|
|
@ -1,27 +0,0 @@
|
||||||
package main
|
|
||||||
|
|
||||||
import (
|
|
||||||
"reflect"
|
|
||||||
"testing"
|
|
||||||
)
|
|
||||||
|
|
||||||
func TestFirstPositional(t *testing.T) {
|
|
||||||
cases := []struct {
|
|
||||||
args []string
|
|
||||||
wantName string
|
|
||||||
wantRest []string
|
|
||||||
}{
|
|
||||||
{[]string{"vault"}, "vault", []string{}},
|
|
||||||
{[]string{"--json", "vault"}, "vault", []string{"--json"}},
|
|
||||||
{[]string{"vault", "abc-123"}, "vault", []string{"abc-123"}},
|
|
||||||
{[]string{"--foo", "monitoring", "extra"}, "monitoring", []string{"--foo", "extra"}},
|
|
||||||
{[]string{"--only-flags"}, "", []string{"--only-flags"}},
|
|
||||||
}
|
|
||||||
for _, c := range cases {
|
|
||||||
gotName, gotRest := firstPositional(c.args)
|
|
||||||
if gotName != c.wantName || !reflect.DeepEqual(gotRest, c.wantRest) {
|
|
||||||
t.Errorf("firstPositional(%v) = (%q, %v), want (%q, %v)",
|
|
||||||
c.args, gotName, gotRest, c.wantName, c.wantRest)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
@ -1,77 +0,0 @@
|
||||||
package main
|
|
||||||
|
|
||||||
import (
|
|
||||||
"encoding/json"
|
|
||||||
"fmt"
|
|
||||||
"net/url"
|
|
||||||
"sort"
|
|
||||||
"strconv"
|
|
||||||
)
|
|
||||||
|
|
||||||
func usageCommands() []Command {
|
|
||||||
return []Command{
|
|
||||||
{Path: []string{"usage", "top"}, Tier: TierRead,
|
|
||||||
Summary: "rank homelab verb usage across users (from Loki): usage top [--since 30d] [--user U] [--json]", Run: usageTop},
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
// usageQuery builds the LogQL metric query that counts invocations per verb.
|
|
||||||
func usageQuery(since, user string) string {
|
|
||||||
sel := `job="` + usageJob + `"`
|
|
||||||
if user != "" {
|
|
||||||
sel += `, user="` + user + `"`
|
|
||||||
}
|
|
||||||
return fmt.Sprintf(`sum by (verb) (count_over_time({%s}[%s]))`, sel, since)
|
|
||||||
}
|
|
||||||
|
|
||||||
func usageTop(args []string) error {
|
|
||||||
since := flagValue(args, "--since")
|
|
||||||
if since == "" {
|
|
||||||
since = "30d"
|
|
||||||
}
|
|
||||||
v := url.Values{}
|
|
||||||
v.Set("query", usageQuery(since, flagValue(args, "--user")))
|
|
||||||
body, err := lbGetBody(lokiHost, "/loki/api/v1/query", v)
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
if containsArg(args, "--json") {
|
|
||||||
fmt.Println(string(body))
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
var r struct {
|
|
||||||
Data struct {
|
|
||||||
Result []struct {
|
|
||||||
Metric map[string]string `json:"metric"`
|
|
||||||
Value []interface{} `json:"value"`
|
|
||||||
} `json:"result"`
|
|
||||||
} `json:"data"`
|
|
||||||
}
|
|
||||||
if err := json.Unmarshal(body, &r); err != nil {
|
|
||||||
fmt.Println(string(body))
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
type row struct {
|
|
||||||
verb string
|
|
||||||
n int
|
|
||||||
}
|
|
||||||
var rows []row
|
|
||||||
for _, s := range r.Data.Result {
|
|
||||||
n := 0
|
|
||||||
if len(s.Value) == 2 {
|
|
||||||
if f, e := strconv.ParseFloat(fmt.Sprint(s.Value[1]), 64); e == nil {
|
|
||||||
n = int(f)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
rows = append(rows, row{s.Metric["verb"], n})
|
|
||||||
}
|
|
||||||
if len(rows) == 0 {
|
|
||||||
fmt.Println("(no usage recorded yet)")
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
sort.Slice(rows, func(i, j int) bool { return rows[i].n > rows[j].n })
|
|
||||||
for _, r := range rows {
|
|
||||||
fmt.Printf("%6d %s\n", r.n, r.verb)
|
|
||||||
}
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
663
cli/cmd_vault.go
663
cli/cmd_vault.go
|
|
@ -1,663 +0,0 @@
|
||||||
package main
|
|
||||||
|
|
||||||
import (
|
|
||||||
"bufio"
|
|
||||||
"encoding/base64"
|
|
||||||
"encoding/json"
|
|
||||||
"fmt"
|
|
||||||
"os"
|
|
||||||
"os/exec"
|
|
||||||
"strings"
|
|
||||||
"syscall"
|
|
||||||
)
|
|
||||||
|
|
||||||
// vault verbs give each unix user no-HITL access to THEIR OWN Vaultwarden vault.
|
|
||||||
// Identity is the kernel UID; per-user creds live in that user's isolated Vault
|
|
||||||
// path (secret/workstation/claude-users/<user>) read via their scoped token, and
|
|
||||||
// decryption is done by the official `bw` CLI. See
|
|
||||||
// docs/superpowers/specs/2026-06-24-homelab-vault-design.md.
|
|
||||||
func vaultCommands() []Command {
|
|
||||||
return []Command{
|
|
||||||
{Path: []string{"vault", "setup"}, Tier: TierWrite,
|
|
||||||
Summary: "one-time: store your Vaultwarden master password + API key in your Vault path", Run: vaultSetup},
|
|
||||||
{Path: []string{"vault", "status"}, Tier: TierRead,
|
|
||||||
Summary: "show whether your vault is configured/reachable (no secrets)", Run: vaultStatus},
|
|
||||||
{Path: []string{"vault", "list"}, Tier: TierRead,
|
|
||||||
Summary: "list your item names: vault list [--search Q]", Run: vaultList},
|
|
||||||
{Path: []string{"vault", "get"}, Tier: TierRead,
|
|
||||||
Summary: "fetch one item: vault get <name> [--field password|username|uri|notes|totp] [--json]", Run: vaultGet},
|
|
||||||
{Path: []string{"vault", "search"}, Tier: TierRead,
|
|
||||||
Summary: "search your item names: vault search <query>", Run: vaultSearch},
|
|
||||||
{Path: []string{"vault", "code"}, Tier: TierRead,
|
|
||||||
Summary: "current TOTP code for an item: vault code <name>", Run: vaultCode},
|
|
||||||
{Path: []string{"vault", "lock"}, Tier: TierWrite,
|
|
||||||
Summary: "lock/log out the local bw session", Run: vaultLock},
|
|
||||||
{Path: []string{"vault"}, Tier: TierRead,
|
|
||||||
Summary: "Vaultwarden access for your own vault (run `homelab vault` for help)",
|
|
||||||
Run: func([]string) error { fmt.Print(vaultHelp()); return nil }},
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
// vaultHelp is shown for bare `homelab vault`.
|
|
||||||
func vaultHelp() string {
|
|
||||||
return `homelab vault — read YOUR OWN Vaultwarden logins (no-HITL after one-time setup)
|
|
||||||
|
|
||||||
homelab vault setup one-time: store your master password + API key in your Vault path
|
|
||||||
homelab vault status configured / unlocked / reachable (no secrets)
|
|
||||||
homelab vault list [--search Q] list your item names (no secrets)
|
|
||||||
homelab vault get <name> [--field password|username|uri|notes|totp] [--json]
|
|
||||||
TTY → clipboard (auto-clears); piped → stdout
|
|
||||||
homelab vault code <name> current TOTP code
|
|
||||||
homelab vault lock lock / log out the local bw session
|
|
||||||
|
|
||||||
Creds live only in your own Vault path; the admin never sees them. Identity is
|
|
||||||
your unix UID. Security model: docs/superpowers/specs/2026-06-24-homelab-vault-design.md
|
|
||||||
(note: anything running as your user can decrypt your vault — the accepted no-HITL trade).
|
|
||||||
`
|
|
||||||
}
|
|
||||||
|
|
||||||
const vwUserPathPrefix = "secret/workstation/claude-users/"
|
|
||||||
|
|
||||||
// vwCreds is one user's Vaultwarden auth material, read from their Vault path.
|
|
||||||
type vwCreds struct {
|
|
||||||
Email string
|
|
||||||
MasterPassword string
|
|
||||||
ClientID string
|
|
||||||
ClientSecret string
|
|
||||||
}
|
|
||||||
|
|
||||||
// cmdRunner shells out to an external command with an explicit environment and
|
|
||||||
// returns trimmed stdout. Secrets are passed via envv, NEVER argv. Tests inject
|
|
||||||
// a fake; realRunner is the production implementation.
|
|
||||||
type cmdRunner func(name string, argv, envv []string) (string, error)
|
|
||||||
|
|
||||||
func realRunner(name string, argv, envv []string) (string, error) {
|
|
||||||
cmd := exec.Command(name, argv...)
|
|
||||||
if envv != nil {
|
|
||||||
cmd.Env = envv
|
|
||||||
}
|
|
||||||
out, err := cmd.Output()
|
|
||||||
// Trim only the trailing newline the tool appends — NOT all whitespace, so a
|
|
||||||
// fetched secret with significant leading/trailing spaces is preserved.
|
|
||||||
return strings.TrimRight(string(out), "\r\n"), err
|
|
||||||
}
|
|
||||||
|
|
||||||
// realRunnerStdin runs a command feeding `stdin` to it, for secret values that
|
|
||||||
// must NOT appear in argv (visible via ps / /proc/<pid>/cmdline to same-UID
|
|
||||||
// processes). Used by setup to write the master password / client_secret.
|
|
||||||
func realRunnerStdin(name string, argv, envv []string, stdin string) (string, error) {
|
|
||||||
cmd := exec.Command(name, argv...)
|
|
||||||
if envv != nil {
|
|
||||||
cmd.Env = envv
|
|
||||||
}
|
|
||||||
cmd.Stdin = strings.NewReader(stdin)
|
|
||||||
out, err := cmd.Output()
|
|
||||||
return strings.TrimRight(string(out), "\r\n"), err
|
|
||||||
}
|
|
||||||
|
|
||||||
func vwCredsPath(user string) string { return vwUserPathPrefix + user }
|
|
||||||
|
|
||||||
func bwAppDataDir(uid string) string { return "/run/user/" + uid + "/homelab-bw" }
|
|
||||||
|
|
||||||
// readVaultField returns one field from a KV-v2 path, "" if absent/error.
|
|
||||||
func readVaultField(run cmdRunner, field, path string) string {
|
|
||||||
out, err := run("vault", []string{"kv", "get", "-field=" + field, path}, nil)
|
|
||||||
if err != nil {
|
|
||||||
return ""
|
|
||||||
}
|
|
||||||
return out
|
|
||||||
}
|
|
||||||
|
|
||||||
// loadCreds reads the four vaultwarden_* keys from the user's isolated path.
|
|
||||||
// A missing master password means the user hasn't onboarded.
|
|
||||||
func loadCreds(run cmdRunner, user string) (vwCreds, error) {
|
|
||||||
p := vwCredsPath(user)
|
|
||||||
c := vwCreds{
|
|
||||||
Email: readVaultField(run, "vaultwarden_email", p),
|
|
||||||
MasterPassword: readVaultField(run, "vaultwarden_master_password", p),
|
|
||||||
ClientID: readVaultField(run, "vaultwarden_client_id", p),
|
|
||||||
ClientSecret: readVaultField(run, "vaultwarden_client_secret", p),
|
|
||||||
}
|
|
||||||
if c.MasterPassword == "" {
|
|
||||||
return vwCreds{}, fmt.Errorf("vault not configured for this user — run `homelab vault setup`")
|
|
||||||
}
|
|
||||||
return c, nil
|
|
||||||
}
|
|
||||||
|
|
||||||
// vaultCurrentUser/vaultCurrentUID are seams for tests (avoid conflict with repo.go's currentUser func).
|
|
||||||
var vaultCurrentUser = func() string { return os.Getenv("USER") }
|
|
||||||
var vaultCurrentUID = func() string { return fmt.Sprintf("%d", os.Getuid()) }
|
|
||||||
|
|
||||||
// bwBaseEnv is the minimal non-secret environment bw/node need. We deliberately
|
|
||||||
// do NOT inherit the full parent env (keeps stray secrets out of the child).
|
|
||||||
func bwBaseEnv(appdata string) []string {
|
|
||||||
path := os.Getenv("PATH")
|
|
||||||
if path == "" {
|
|
||||||
path = "/usr/local/bin:/usr/bin:/bin"
|
|
||||||
}
|
|
||||||
return []string{
|
|
||||||
"PATH=" + path,
|
|
||||||
"HOME=" + os.Getenv("HOME"),
|
|
||||||
"BITWARDENCLI_APPDATA_DIR=" + appdata,
|
|
||||||
"BW_NOINTERACTION=true",
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
// bwSecretEnv adds the secret-bearing vars. session may be "" (pre-unlock).
|
|
||||||
func bwSecretEnv(appdata string, c vwCreds, session string) []string {
|
|
||||||
env := bwBaseEnv(appdata)
|
|
||||||
env = append(env,
|
|
||||||
"BW_CLIENTID="+c.ClientID,
|
|
||||||
"BW_CLIENTSECRET="+c.ClientSecret,
|
|
||||||
"BW_PASSWORD="+c.MasterPassword,
|
|
||||||
)
|
|
||||||
if session != "" {
|
|
||||||
env = append(env, "BW_SESSION="+session)
|
|
||||||
}
|
|
||||||
return env
|
|
||||||
}
|
|
||||||
|
|
||||||
func bwLoginArgs() []string { return []string{"login", "--apikey"} }
|
|
||||||
func bwUnlockArgs() []string { return []string{"unlock", "--passwordenv", "BW_PASSWORD", "--raw"} }
|
|
||||||
func bwGetArgs(field, name string) []string { return []string{"get", field, name} }
|
|
||||||
func bwStatusArgs() []string { return []string{"status"} }
|
|
||||||
|
|
||||||
// bwNeedsLogin parses `bw status` JSON and reports whether a `bw login` is
|
|
||||||
// required. Unparseable/empty output → true (safer to attempt login).
|
|
||||||
func bwNeedsLogin(statusJSON string) bool {
|
|
||||||
var s struct {
|
|
||||||
Status string `json:"status"`
|
|
||||||
}
|
|
||||||
if err := json.Unmarshal([]byte(statusJSON), &s); err != nil {
|
|
||||||
return true
|
|
||||||
}
|
|
||||||
return s.Status == "unauthenticated" || s.Status == ""
|
|
||||||
}
|
|
||||||
|
|
||||||
func bwListArgs(search string) []string {
|
|
||||||
a := []string{"list", "items"}
|
|
||||||
if search != "" {
|
|
||||||
a = append(a, "--search", search)
|
|
||||||
}
|
|
||||||
return a
|
|
||||||
}
|
|
||||||
|
|
||||||
// bwUnlock runs `bw unlock` and returns the raw session key.
|
|
||||||
func bwUnlock(run cmdRunner, env []string) (string, error) {
|
|
||||||
out, err := run("bw", bwUnlockArgs(), env)
|
|
||||||
if err != nil {
|
|
||||||
return "", fmt.Errorf("bw unlock failed (wrong master password? run `homelab vault setup`): %w", err)
|
|
||||||
}
|
|
||||||
return out, nil
|
|
||||||
}
|
|
||||||
|
|
||||||
// bwGet fetches one field of one item; session must be present in env.
|
|
||||||
func bwGet(run cmdRunner, env []string, field, name string) (string, error) {
|
|
||||||
return run("bw", bwGetArgs(field, name), env)
|
|
||||||
}
|
|
||||||
|
|
||||||
func returnMode(isTTY bool) string {
|
|
||||||
if isTTY {
|
|
||||||
return "clipboard"
|
|
||||||
}
|
|
||||||
return "stdout"
|
|
||||||
}
|
|
||||||
|
|
||||||
// stdoutIsTTY reports whether stdout is a character device (a terminal).
|
|
||||||
func stdoutIsTTY() bool {
|
|
||||||
fi, err := os.Stdout.Stat()
|
|
||||||
if err != nil {
|
|
||||||
return false
|
|
||||||
}
|
|
||||||
return fi.Mode()&os.ModeCharDevice != 0
|
|
||||||
}
|
|
||||||
|
|
||||||
// stderrIsTTY reports whether stderr is a terminal (the OSC52 escape is written
|
|
||||||
// to stderr, so the clipboard path is only viable when stderr is a terminal).
|
|
||||||
func stderrIsTTY() bool {
|
|
||||||
fi, err := os.Stderr.Stat()
|
|
||||||
if err != nil {
|
|
||||||
return false
|
|
||||||
}
|
|
||||||
return fi.Mode()&os.ModeCharDevice != 0
|
|
||||||
}
|
|
||||||
|
|
||||||
// osc52 returns the OSC 52 escape that makes the local terminal copy payload to
|
|
||||||
// the system clipboard (works over SSH; no X11). osc52clear copies empty.
|
|
||||||
func osc52(payload string) string {
|
|
||||||
return "\x1b]52;c;" + base64.StdEncoding.EncodeToString([]byte(payload)) + "\a"
|
|
||||||
}
|
|
||||||
func osc52clear() string { return "\x1b]52;c;\a" }
|
|
||||||
|
|
||||||
// terminalAllowed gates OSC 52: only terminals known to honor clipboard writes,
|
|
||||||
// else we'd dump the secret's base64 into scrollback on unsupported terminals.
|
|
||||||
func terminalAllowed(term, termProgram string) bool {
|
|
||||||
t := strings.ToLower(term)
|
|
||||||
p := strings.ToLower(termProgram)
|
|
||||||
for _, ok := range []string{"kitty", "alacritty", "foot", "wezterm", "ghostty", "tmux", "screen"} {
|
|
||||||
if strings.Contains(t, ok) || strings.Contains(p, ok) {
|
|
||||||
return true
|
|
||||||
}
|
|
||||||
}
|
|
||||||
// xterm proper supports it only when the program is a known-good emulator.
|
|
||||||
return false
|
|
||||||
}
|
|
||||||
|
|
||||||
// opRecord is one CLI operation. ItemName is accepted for the caller's
|
|
||||||
// convenience but is INTENTIONALLY never rendered into the log line — auditing
|
|
||||||
// which of your own logins you opened is itself sensitive, and per-item reads
|
|
||||||
// are invisible server-side anyway (spec §9a).
|
|
||||||
type opRecord struct {
|
|
||||||
User string
|
|
||||||
Verb string
|
|
||||||
PID int
|
|
||||||
PPID int
|
|
||||||
ParentComm string
|
|
||||||
ItemName string // never logged
|
|
||||||
}
|
|
||||||
|
|
||||||
func opLogLine(r opRecord) string {
|
|
||||||
return fmt.Sprintf("user=%s verb=%s pid=%d ppid=%d parent=%s",
|
|
||||||
r.User, r.Verb, r.PID, r.PPID, r.ParentComm)
|
|
||||||
}
|
|
||||||
|
|
||||||
// parentComm reads /proc/<ppid>/comm (best-effort; "" on failure).
|
|
||||||
func parentComm(ppid int) string {
|
|
||||||
b, err := os.ReadFile(fmt.Sprintf("/proc/%d/comm", ppid))
|
|
||||||
if err != nil {
|
|
||||||
return ""
|
|
||||||
}
|
|
||||||
return strings.TrimSpace(string(b))
|
|
||||||
}
|
|
||||||
|
|
||||||
// writeOpLog appends one privacy-aware line to the user's op-log (best-effort;
|
|
||||||
// never blocks or fails the command). Goes to syslog so it ships to Loki.
|
|
||||||
func writeOpLog(r opRecord) {
|
|
||||||
exec.Command("logger", "-t", "homelab-vault", opLogLine(r)).Run() // best-effort
|
|
||||||
}
|
|
||||||
|
|
||||||
func vaultLockPath(uid string) string { return "/run/user/" + uid + "/homelab-vault.lock" }
|
|
||||||
|
|
||||||
// hardenProcess disables core dumps so a bw/homelab crash can't spill the master
|
|
||||||
// password to a core file. Best-effort.
|
|
||||||
func hardenProcess() {
|
|
||||||
_ = syscall.Setrlimit(syscall.RLIMIT_CORE, &syscall.Rlimit{Cur: 0, Max: 0})
|
|
||||||
}
|
|
||||||
|
|
||||||
// withUserLock serializes bw mutations for this user (concurrent Claude sessions
|
|
||||||
// as the same user otherwise race bw's appdata). Returns an unlock func.
|
|
||||||
func withUserLock(uid string) (func(), error) {
|
|
||||||
f, err := os.OpenFile(vaultLockPath(uid), os.O_CREATE|os.O_RDWR, 0600)
|
|
||||||
if err != nil {
|
|
||||||
return nil, err
|
|
||||||
}
|
|
||||||
if err := syscall.Flock(int(f.Fd()), syscall.LOCK_EX); err != nil {
|
|
||||||
f.Close()
|
|
||||||
return nil, err
|
|
||||||
}
|
|
||||||
return func() { syscall.Flock(int(f.Fd()), syscall.LOCK_UN); f.Close() }, nil
|
|
||||||
}
|
|
||||||
|
|
||||||
// session is one usable bw context: the env (with BW_SESSION) ready for `bw get`.
|
|
||||||
type session struct {
|
|
||||||
env []string
|
|
||||||
}
|
|
||||||
|
|
||||||
// openSession resolves creds, ensures login, unlocks, and returns a ready env.
|
|
||||||
// Caller must hold the user lock. appdata is created on tmpfs (0700).
|
|
||||||
func openSession(run cmdRunner, user, uid string) (session, error) {
|
|
||||||
creds, err := loadCreds(run, user)
|
|
||||||
if err != nil {
|
|
||||||
return session{}, err
|
|
||||||
}
|
|
||||||
appdata := bwAppDataDir(uid)
|
|
||||||
if err := os.MkdirAll(appdata, 0700); err != nil {
|
|
||||||
return session{}, fmt.Errorf("create bw appdata %s: %w", appdata, err)
|
|
||||||
}
|
|
||||||
loginEnv := bwSecretEnv(appdata, creds, "")
|
|
||||||
// Ensure server is set and we're logged in (idempotent; ignore "already").
|
|
||||||
_, _ = run("bw", []string{"config", "server", "https://vaultwarden.viktorbarzin.me"}, loginEnv)
|
|
||||||
st, _ := run("bw", bwStatusArgs(), loginEnv)
|
|
||||||
if bwNeedsLogin(st) {
|
|
||||||
if _, err := run("bw", bwLoginArgs(), loginEnv); err != nil {
|
|
||||||
return session{}, fmt.Errorf("bw login --apikey failed (API key valid? run `homelab vault setup`): %w", err)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
sess, err := bwUnlock(run, loginEnv)
|
|
||||||
if err != nil {
|
|
||||||
return session{}, err
|
|
||||||
}
|
|
||||||
return session{env: bwSecretEnv(appdata, creds, sess)}, nil
|
|
||||||
}
|
|
||||||
|
|
||||||
type getOpts struct {
|
|
||||||
name string
|
|
||||||
field string
|
|
||||||
json bool
|
|
||||||
}
|
|
||||||
|
|
||||||
var validGetFields = map[string]bool{"password": true, "username": true, "uri": true, "notes": true, "totp": true}
|
|
||||||
|
|
||||||
func parseGetArgs(args []string) (getOpts, error) {
|
|
||||||
o := getOpts{field: "password"}
|
|
||||||
for i := 0; i < len(args); i++ {
|
|
||||||
a := args[i]
|
|
||||||
switch {
|
|
||||||
case a == "--json":
|
|
||||||
o.json = true
|
|
||||||
case a == "--field" && i+1 < len(args):
|
|
||||||
o.field = args[i+1]
|
|
||||||
i++
|
|
||||||
case strings.HasPrefix(a, "--field="):
|
|
||||||
o.field = strings.TrimPrefix(a, "--field=")
|
|
||||||
case !strings.HasPrefix(a, "-") && o.name == "":
|
|
||||||
o.name = a
|
|
||||||
}
|
|
||||||
}
|
|
||||||
if o.name == "" {
|
|
||||||
return o, fmt.Errorf("usage: homelab vault get <name> [--field password|username|uri|notes|totp] [--json]")
|
|
||||||
}
|
|
||||||
if !validGetFields[o.field] {
|
|
||||||
return o, fmt.Errorf("invalid --field %q (want password|username|uri|notes|totp)", o.field)
|
|
||||||
}
|
|
||||||
return o, nil
|
|
||||||
}
|
|
||||||
|
|
||||||
// getValue opens a session and fetches one field. Pure of I/O side effects
|
|
||||||
// besides the runner, so it is unit-tested with a fake runner.
|
|
||||||
func getValue(run cmdRunner, user, uid string, o getOpts) (string, error) {
|
|
||||||
s, err := openSession(run, user, uid)
|
|
||||||
if err != nil {
|
|
||||||
return "", err
|
|
||||||
}
|
|
||||||
return bwGet(run, s.env, o.field, o.name)
|
|
||||||
}
|
|
||||||
|
|
||||||
// clipboardDecision picks how to return a secret value. "stdout" prints it (a
|
|
||||||
// pipe/agent — the intended machine path); "clipboard" copies via OSC52;
|
|
||||||
// "refuse" emits nothing sensitive (would otherwise risk dumping the secret's
|
|
||||||
// base64 into scrollback, or silently fail because the OSC52 escape goes to a
|
|
||||||
// non-terminal stderr).
|
|
||||||
func clipboardDecision(stdoutTTY, stderrTTY bool, term, termProgram string) string {
|
|
||||||
if !stdoutTTY {
|
|
||||||
return "stdout"
|
|
||||||
}
|
|
||||||
if terminalAllowed(term, termProgram) && stderrTTY {
|
|
||||||
return "clipboard"
|
|
||||||
}
|
|
||||||
return "refuse"
|
|
||||||
}
|
|
||||||
|
|
||||||
// jsonToStdoutOK reports whether `--json` may print the secret to stdout — only
|
|
||||||
// when stdout is NOT a terminal (i.e. piped to a machine consumer).
|
|
||||||
func jsonToStdoutOK(stdoutTTY bool) bool { return !stdoutTTY }
|
|
||||||
|
|
||||||
// emitSecret returns a value TTY-aware (see clipboardDecision). Never prints the
|
|
||||||
// secret to a terminal's stdout/scrollback.
|
|
||||||
func emitSecret(value string) {
|
|
||||||
switch clipboardDecision(stdoutIsTTY(), stderrIsTTY(), os.Getenv("TERM"), os.Getenv("TERM_PROGRAM")) {
|
|
||||||
case "stdout":
|
|
||||||
fmt.Println(value)
|
|
||||||
case "clipboard":
|
|
||||||
fmt.Fprint(os.Stderr, osc52(value))
|
|
||||||
fmt.Fprintln(os.Stderr, "copied to clipboard; clearing in 30s")
|
|
||||||
clearClipboardAfter(30)
|
|
||||||
default: // refuse
|
|
||||||
fmt.Fprintln(os.Stderr, "refusing to print secret: this terminal can't do OSC52 clipboard safely; pipe the command (e.g. | cat) or use a supported terminal")
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
// clearClipboardAfter spawns a detached background clear so the secret doesn't
|
|
||||||
// linger in the clipboard. Best-effort.
|
|
||||||
func clearClipboardAfter(seconds int) {
|
|
||||||
exec.Command("sh", "-c", fmt.Sprintf("sleep %d; printf '%s'", seconds, osc52clear())).Start()
|
|
||||||
}
|
|
||||||
|
|
||||||
// listNames extracts "name (id)" from `bw list items` JSON; never values.
|
|
||||||
func listNames(jsonOut string) []string {
|
|
||||||
var items []struct {
|
|
||||||
ID string `json:"id"`
|
|
||||||
Name string `json:"name"`
|
|
||||||
}
|
|
||||||
if err := json.Unmarshal([]byte(jsonOut), &items); err != nil {
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
out := make([]string, 0, len(items))
|
|
||||||
for _, it := range items {
|
|
||||||
out = append(out, fmt.Sprintf("%s (%s)", it.Name, it.ID))
|
|
||||||
}
|
|
||||||
return out
|
|
||||||
}
|
|
||||||
|
|
||||||
func runList(run cmdRunner, user, uid, search string) ([]string, error) {
|
|
||||||
s, err := openSession(run, user, uid)
|
|
||||||
if err != nil {
|
|
||||||
return nil, err
|
|
||||||
}
|
|
||||||
out, err := run("bw", bwListArgs(search), s.env)
|
|
||||||
if err != nil {
|
|
||||||
return nil, err
|
|
||||||
}
|
|
||||||
return listNames(out), nil
|
|
||||||
}
|
|
||||||
|
|
||||||
func vaultList(args []string) error {
|
|
||||||
hardenProcess()
|
|
||||||
search := ""
|
|
||||||
for i := 0; i < len(args); i++ {
|
|
||||||
if args[i] == "--search" && i+1 < len(args) {
|
|
||||||
search = args[i+1]
|
|
||||||
i++
|
|
||||||
} else if strings.HasPrefix(args[i], "--search=") {
|
|
||||||
search = strings.TrimPrefix(args[i], "--search=")
|
|
||||||
}
|
|
||||||
}
|
|
||||||
uid := vaultCurrentUID()
|
|
||||||
unlock, err := withUserLock(uid)
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
defer unlock()
|
|
||||||
names, err := runList(realRunner, vaultCurrentUser(), uid, search)
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
for _, n := range names {
|
|
||||||
fmt.Println(n)
|
|
||||||
}
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
|
|
||||||
func vaultSearch(args []string) error {
|
|
||||||
if len(args) == 0 {
|
|
||||||
return fmt.Errorf("usage: homelab vault search <query>")
|
|
||||||
}
|
|
||||||
return vaultList([]string{"--search", strings.Join(args, " ")})
|
|
||||||
}
|
|
||||||
|
|
||||||
func vaultCode(args []string) error {
|
|
||||||
hardenProcess()
|
|
||||||
if len(args) == 0 {
|
|
||||||
return fmt.Errorf("usage: homelab vault code <name>")
|
|
||||||
}
|
|
||||||
name := args[0]
|
|
||||||
uid := vaultCurrentUID()
|
|
||||||
unlock, err := withUserLock(uid)
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
defer unlock()
|
|
||||||
user := vaultCurrentUser()
|
|
||||||
val, err := getValue(realRunner, user, uid, getOpts{name: name, field: "totp"})
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
// TOTP is the most sensitive op: log AND emit an ntfy-bound marker (spec §9a-d).
|
|
||||||
writeOpLog(opRecord{User: user, Verb: "code", PID: os.Getpid(), PPID: os.Getppid(), ParentComm: parentComm(os.Getppid()), ItemName: name})
|
|
||||||
exec.Command("logger", "-t", "homelab-vault-totp", "user="+user+" totp-fetch parent="+parentComm(os.Getppid())).Run()
|
|
||||||
emitSecret(val)
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
|
|
||||||
// statusSummary reports config/reachability without revealing secrets.
|
|
||||||
func statusSummary(run cmdRunner, user, uid string) string {
|
|
||||||
if _, err := loadCreds(run, user); err != nil {
|
|
||||||
return "vault: not configured — run `homelab vault setup`"
|
|
||||||
}
|
|
||||||
s, err := openSession(run, user, uid)
|
|
||||||
if err != nil {
|
|
||||||
return "vault: configured, but unlock/login FAILED (creds stale? run `homelab vault setup`): " + err.Error()
|
|
||||||
}
|
|
||||||
if _, err := run("bw", []string{"sync"}, s.env); err != nil {
|
|
||||||
return "vault: configured + unlocked, but sync/reachability failed: " + err.Error()
|
|
||||||
}
|
|
||||||
return "vault: configured, unlocked, reachable ✓"
|
|
||||||
}
|
|
||||||
|
|
||||||
func vaultStatus(args []string) error {
|
|
||||||
hardenProcess()
|
|
||||||
uid := vaultCurrentUID()
|
|
||||||
unlock, err := withUserLock(uid)
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
defer unlock()
|
|
||||||
fmt.Println(statusSummary(realRunner, vaultCurrentUser(), uid))
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
|
|
||||||
func vaultLock(args []string) error {
|
|
||||||
uid := vaultCurrentUID()
|
|
||||||
unlock, err := withUserLock(uid) // logout mutates bw state — serialize with get/list
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
defer unlock()
|
|
||||||
appdata := bwAppDataDir(uid)
|
|
||||||
_, _ = realRunner("bw", []string{"lock"}, bwBaseEnv(appdata))
|
|
||||||
_, logoutErr := realRunner("bw", []string{"logout"}, bwBaseEnv(appdata))
|
|
||||||
if logoutErr == nil {
|
|
||||||
fmt.Println("locked")
|
|
||||||
}
|
|
||||||
return nil // lock/logout best-effort; never error the caller
|
|
||||||
}
|
|
||||||
|
|
||||||
// vaultPatchPublicArgs writes the non-secret identifiers via argv. Neither the
|
|
||||||
// email nor the API client_id is a usable credential on its own.
|
|
||||||
func vaultPatchPublicArgs(user, email, clientID string) []string {
|
|
||||||
return []string{"kv", "patch", vwCredsPath(user),
|
|
||||||
"vaultwarden_email=" + email,
|
|
||||||
"vaultwarden_client_id=" + clientID,
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
// vaultPatchSecretArgs writes ONE secret value via the `key=-` stdin form, so
|
|
||||||
// the value never appears in argv (ps / /proc/<pid>/cmdline). The value is fed
|
|
||||||
// on stdin by realRunnerStdin.
|
|
||||||
func vaultPatchSecretArgs(user, key string) []string {
|
|
||||||
return []string{"kv", "patch", vwCredsPath(user), key + "=-"}
|
|
||||||
}
|
|
||||||
|
|
||||||
// writeCreds stores all four fields in the user's Vault path. The two real
|
|
||||||
// secrets (master password, API client_secret) go via stdin — never argv.
|
|
||||||
func writeCreds(user string, c vwCreds) error {
|
|
||||||
if _, err := realRunner("vault", vaultPatchPublicArgs(user, c.Email, c.ClientID), nil); err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
if _, err := realRunnerStdin("vault", vaultPatchSecretArgs(user, "vaultwarden_master_password"), nil, c.MasterPassword); err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
if _, err := realRunnerStdin("vault", vaultPatchSecretArgs(user, "vaultwarden_client_secret"), nil, c.ClientSecret); err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
|
|
||||||
// promptNoEcho reads one line without terminal echo (for the master password).
|
|
||||||
func promptNoEcho(prompt string) (string, error) {
|
|
||||||
fmt.Fprint(os.Stderr, prompt)
|
|
||||||
exec.Command("stty", "-echo").Run()
|
|
||||||
defer func() { exec.Command("stty", "echo").Run(); fmt.Fprintln(os.Stderr) }()
|
|
||||||
r := bufio.NewReader(os.Stdin)
|
|
||||||
line, err := r.ReadString('\n')
|
|
||||||
// Trim only the line terminator — a master password / API secret may
|
|
||||||
// legitimately contain leading/trailing spaces.
|
|
||||||
return strings.TrimRight(line, "\r\n"), err
|
|
||||||
}
|
|
||||||
|
|
||||||
func promptLine(prompt string) (string, error) {
|
|
||||||
fmt.Fprint(os.Stderr, prompt)
|
|
||||||
line, err := bufio.NewReader(os.Stdin).ReadString('\n')
|
|
||||||
return strings.TrimSpace(line), err
|
|
||||||
}
|
|
||||||
|
|
||||||
func vaultSetup(args []string) error {
|
|
||||||
hardenProcess()
|
|
||||||
fmt.Fprintln(os.Stderr, "One-time setup. Stored ONLY in your own Vault path; the admin never sees it.")
|
|
||||||
fmt.Fprintln(os.Stderr, "Get your API key at https://vaultwarden.viktorbarzin.me → Settings → Security → Keys → View API key.")
|
|
||||||
email, err := promptLine("Vaultwarden email: ")
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
clientID, err := promptLine("API key client_id (user.xxxx): ")
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
clientSecret, err := promptNoEcho("API key client_secret: ")
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
master, err := promptNoEcho("Master password: ")
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
if master == "" || clientID == "" || clientSecret == "" {
|
|
||||||
return fmt.Errorf("all fields are required")
|
|
||||||
}
|
|
||||||
c := vwCreds{Email: email, MasterPassword: master, ClientID: clientID, ClientSecret: clientSecret}
|
|
||||||
if err := writeCreds(vaultCurrentUser(), c); err != nil {
|
|
||||||
return fmt.Errorf("writing creds to your Vault path failed (scoped token present?): %w", err)
|
|
||||||
}
|
|
||||||
fmt.Fprintln(os.Stderr, "Stored. Verifying unlock…")
|
|
||||||
uid := vaultCurrentUID()
|
|
||||||
unlock, err := withUserLock(uid)
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
defer unlock()
|
|
||||||
if _, err := openSession(realRunner, vaultCurrentUser(), uid); err != nil {
|
|
||||||
return fmt.Errorf("stored, but verification failed — double-check master password / API key: %w", err)
|
|
||||||
}
|
|
||||||
fmt.Fprintln(os.Stderr, "✓ Verified. Fetches are now AFK.")
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
|
|
||||||
func vaultGet(args []string) error {
|
|
||||||
hardenProcess()
|
|
||||||
o, err := parseGetArgs(args)
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
uid := vaultCurrentUID()
|
|
||||||
unlock, err := withUserLock(uid)
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
defer unlock()
|
|
||||||
user := vaultCurrentUser()
|
|
||||||
val, err := getValue(realRunner, user, uid, o)
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
writeOpLog(opRecord{User: user, Verb: "get", PID: os.Getpid(), PPID: os.Getppid(), ParentComm: parentComm(os.Getppid()), ItemName: o.name})
|
|
||||||
if o.json {
|
|
||||||
if !jsonToStdoutOK(stdoutIsTTY()) {
|
|
||||||
return fmt.Errorf("refusing to print a secret as JSON to a terminal; pipe it (e.g. | cat) or drop --json")
|
|
||||||
}
|
|
||||||
fmt.Printf("{%q:%q}\n", o.field, val)
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
emitSecret(val)
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
@ -1,368 +0,0 @@
|
||||||
package main
|
|
||||||
|
|
||||||
import (
|
|
||||||
"encoding/base64"
|
|
||||||
"fmt"
|
|
||||||
"os"
|
|
||||||
"reflect"
|
|
||||||
"strings"
|
|
||||||
"testing"
|
|
||||||
)
|
|
||||||
|
|
||||||
func TestVaultCommandsRegistered(t *testing.T) {
|
|
||||||
want := map[string]Tier{
|
|
||||||
"vault setup": TierWrite,
|
|
||||||
"vault status": TierRead,
|
|
||||||
"vault list": TierRead,
|
|
||||||
"vault get": TierRead,
|
|
||||||
"vault search": TierRead,
|
|
||||||
"vault code": TierRead,
|
|
||||||
"vault lock": TierWrite,
|
|
||||||
}
|
|
||||||
got := map[string]Tier{}
|
|
||||||
for _, c := range vaultCommands() {
|
|
||||||
got[c.name()] = c.Tier
|
|
||||||
}
|
|
||||||
for name, tier := range want {
|
|
||||||
if got[name] != tier {
|
|
||||||
t.Errorf("command %q: tier=%q, want %q (registered=%v)", name, got[name], tier, got[name] != "")
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestVaultGroupInRegistry(t *testing.T) {
|
|
||||||
if !isCommandGroup(buildRegistry(), "vault") {
|
|
||||||
t.Fatal("`vault` group not wired into buildRegistry()")
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestVaultCredsPath(t *testing.T) {
|
|
||||||
if got := vwCredsPath("emo"); got != "secret/workstation/claude-users/emo" {
|
|
||||||
t.Fatalf("vwCredsPath = %q", got)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestBwAppDataDir(t *testing.T) {
|
|
||||||
if got := bwAppDataDir("1001"); got != "/run/user/1001/homelab-bw" {
|
|
||||||
t.Fatalf("bwAppDataDir = %q", got)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
// fakeRunner records calls and returns canned stdout/err keyed by argv[0]+first arg.
|
|
||||||
type fakeRunner struct {
|
|
||||||
calls [][]string
|
|
||||||
out map[string]string // key: name+" "+strings.Join(argv," ") prefix-matched
|
|
||||||
err map[string]error
|
|
||||||
lastEnv []string
|
|
||||||
}
|
|
||||||
|
|
||||||
func (f *fakeRunner) run(name string, argv, envv []string) (string, error) {
|
|
||||||
f.calls = append(f.calls, append([]string{name}, argv...))
|
|
||||||
f.lastEnv = envv
|
|
||||||
key := name + " " + strings.Join(argv, " ")
|
|
||||||
for k, v := range f.out {
|
|
||||||
if strings.HasPrefix(key, k) {
|
|
||||||
return v, f.err[k]
|
|
||||||
}
|
|
||||||
}
|
|
||||||
return "", f.err[key]
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestLoadCredsReadsFourFields(t *testing.T) {
|
|
||||||
f := &fakeRunner{out: map[string]string{
|
|
||||||
"vault kv get -field=vaultwarden_email secret/workstation/claude-users/emo": "emo@x.me",
|
|
||||||
"vault kv get -field=vaultwarden_master_password secret/workstation/claude-users/emo": "hunter2",
|
|
||||||
"vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo": "user.abc",
|
|
||||||
"vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo": "sek",
|
|
||||||
}}
|
|
||||||
c, err := loadCreds(f.run, "emo")
|
|
||||||
if err != nil {
|
|
||||||
t.Fatalf("loadCreds: %v", err)
|
|
||||||
}
|
|
||||||
want := vwCreds{Email: "emo@x.me", MasterPassword: "hunter2", ClientID: "user.abc", ClientSecret: "sek"}
|
|
||||||
if !reflect.DeepEqual(c, want) {
|
|
||||||
t.Fatalf("loadCreds = %+v want %+v", c, want)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestLoadCredsUnconfigured(t *testing.T) {
|
|
||||||
f := &fakeRunner{out: map[string]string{}} // every field empty
|
|
||||||
if _, err := loadCreds(f.run, "emo"); err == nil || !strings.Contains(err.Error(), "not configured") {
|
|
||||||
t.Fatalf("want 'not configured' error, got %v", err)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestBwEnvCarriesSecretsNotArgv(t *testing.T) {
|
|
||||||
c := vwCreds{ClientID: "user.abc", ClientSecret: "sek", MasterPassword: "hunter2"}
|
|
||||||
env := bwSecretEnv("/run/user/1001/homelab-bw", c, "SESSIONKEY")
|
|
||||||
joined := strings.Join(env, "\n")
|
|
||||||
for _, want := range []string{
|
|
||||||
"BW_CLIENTID=user.abc", "BW_CLIENTSECRET=sek", "BW_PASSWORD=hunter2",
|
|
||||||
"BW_SESSION=SESSIONKEY", "BITWARDENCLI_APPDATA_DIR=/run/user/1001/homelab-bw",
|
|
||||||
} {
|
|
||||||
if !strings.Contains(joined, want) {
|
|
||||||
t.Errorf("bwSecretEnv missing %q", want)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
if strings.Contains(joined, "PATH=") == false {
|
|
||||||
t.Error("bwSecretEnv must keep a PATH so node/bw resolve")
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestBwGetArgsHasNoSessionInArgv(t *testing.T) {
|
|
||||||
argv := bwGetArgs("password", "github")
|
|
||||||
for _, a := range argv {
|
|
||||||
if strings.Contains(a, "SESSION") || a == "--session" {
|
|
||||||
t.Fatalf("session must travel via env, not argv: %v", argv)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
if !reflect.DeepEqual(argv, []string{"get", "password", "github"}) {
|
|
||||||
t.Fatalf("bwGetArgs = %v", argv)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestBwListArgs(t *testing.T) {
|
|
||||||
if got := bwListArgs(""); !reflect.DeepEqual(got, []string{"list", "items"}) {
|
|
||||||
t.Fatalf("bwListArgs('') = %v", got)
|
|
||||||
}
|
|
||||||
if got := bwListArgs("git"); !reflect.DeepEqual(got, []string{"list", "items", "--search", "git"}) {
|
|
||||||
t.Fatalf("bwListArgs('git') = %v", got)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestBwUnlockReturnsSession(t *testing.T) {
|
|
||||||
f := &fakeRunner{out: map[string]string{"bw unlock": "THE-SESSION-KEY"}}
|
|
||||||
env := bwSecretEnv("/run/user/1001/homelab-bw", vwCreds{MasterPassword: "pw"}, "")
|
|
||||||
sess, err := bwUnlock(f.run, env)
|
|
||||||
if err != nil || sess != "THE-SESSION-KEY" {
|
|
||||||
t.Fatalf("bwUnlock = %q, %v", sess, err)
|
|
||||||
}
|
|
||||||
// argv must use --passwordenv + --raw, never the password literal
|
|
||||||
last := f.calls[len(f.calls)-1]
|
|
||||||
if strings.Join(last, " ") != "bw unlock --passwordenv BW_PASSWORD --raw" {
|
|
||||||
t.Fatalf("unlock argv = %v", last)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestReturnMode(t *testing.T) {
|
|
||||||
if returnMode(true) != "clipboard" || returnMode(false) != "stdout" {
|
|
||||||
t.Fatal("returnMode wrong")
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestOSC52Encode(t *testing.T) {
|
|
||||||
got := osc52("secret")
|
|
||||||
want := "\x1b]52;c;" + base64.StdEncoding.EncodeToString([]byte("secret")) + "\a"
|
|
||||||
if got != want {
|
|
||||||
t.Fatalf("osc52 = %q want %q", got, want)
|
|
||||||
}
|
|
||||||
if osc52clear() != "\x1b]52;c;\a" {
|
|
||||||
t.Fatalf("osc52clear wrong: %q", osc52clear())
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestTerminalAllowed(t *testing.T) {
|
|
||||||
allow := []struct{ term, prog string }{
|
|
||||||
{"xterm-kitty", ""}, {"alacritty", ""}, {"foot", ""}, {"tmux-256color", ""},
|
|
||||||
{"screen-256color", ""}, {"xterm-256color", "WezTerm"}, {"xterm-256color", "ghostty"},
|
|
||||||
}
|
|
||||||
for _, c := range allow {
|
|
||||||
if !terminalAllowed(c.term, c.prog) {
|
|
||||||
t.Errorf("terminalAllowed(%q,%q) = false, want true", c.term, c.prog)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
deny := []struct{ term, prog string }{{"dumb", ""}, {"", ""}, {"vt100", ""}}
|
|
||||||
for _, c := range deny {
|
|
||||||
if terminalAllowed(c.term, c.prog) {
|
|
||||||
t.Errorf("terminalAllowed(%q,%q) = true, want false", c.term, c.prog)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestOpLogLineHasNoSecretOrItem(t *testing.T) {
|
|
||||||
line := opLogLine(opRecord{User: "emo", Verb: "get", PID: 10, PPID: 9, ParentComm: "claude", ItemName: "Chase Bank"})
|
|
||||||
for _, must := range []string{"user=emo", "verb=get", "ppid=9", "parent=claude"} {
|
|
||||||
if !strings.Contains(line, must) {
|
|
||||||
t.Errorf("op-log missing %q: %s", must, line)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
for _, mustNot := range []string{"Chase", "password", "secret"} {
|
|
||||||
if strings.Contains(line, mustNot) {
|
|
||||||
t.Fatalf("op-log LEAKS %q (privacy violation): %s", mustNot, line)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestLockPath(t *testing.T) {
|
|
||||||
if got := vaultLockPath("1001"); got != "/run/user/1001/homelab-vault.lock" {
|
|
||||||
t.Fatalf("vaultLockPath = %q", got)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestParseGetArgs(t *testing.T) {
|
|
||||||
o, err := parseGetArgs([]string{"github", "--field", "username", "--json"})
|
|
||||||
if err != nil || o.name != "github" || o.field != "username" || !o.json {
|
|
||||||
t.Fatalf("parseGetArgs = %+v err=%v", o, err)
|
|
||||||
}
|
|
||||||
d, _ := parseGetArgs([]string{"github"})
|
|
||||||
if d.field != "password" || d.json {
|
|
||||||
t.Fatalf("defaults wrong: %+v", d)
|
|
||||||
}
|
|
||||||
if _, err := parseGetArgs([]string{}); err == nil {
|
|
||||||
t.Fatal("get with no name must error")
|
|
||||||
}
|
|
||||||
if _, err := parseGetArgs([]string{"x", "--field", "evil"}); err == nil {
|
|
||||||
t.Fatal("invalid --field must error")
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestListNamesParsing(t *testing.T) {
|
|
||||||
// bw list items returns JSON; listNames extracts name + id only.
|
|
||||||
js := `[{"id":"1","name":"GitHub","login":{"username":"u"}},{"id":"2","name":"AWS"}]`
|
|
||||||
names := listNames(js)
|
|
||||||
if len(names) != 2 || names[0] != "GitHub (1)" || names[1] != "AWS (2)" {
|
|
||||||
t.Fatalf("listNames = %v", names)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestStatusSummaryUnconfigured(t *testing.T) {
|
|
||||||
f := &fakeRunner{out: map[string]string{}} // no creds
|
|
||||||
s := statusSummary(f.run, "emo", "1001")
|
|
||||||
if !strings.Contains(s, "not configured") {
|
|
||||||
t.Fatalf("status = %q", s)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestVaultPatchPublicArgs(t *testing.T) {
|
|
||||||
got := vaultPatchPublicArgs("emo", "e@x.me", "user.ci")
|
|
||||||
want := []string{"kv", "patch", "secret/workstation/claude-users/emo",
|
|
||||||
"vaultwarden_email=e@x.me", "vaultwarden_client_id=user.ci"}
|
|
||||||
if !reflect.DeepEqual(got, want) {
|
|
||||||
t.Fatalf("vaultPatchPublicArgs = %v", got)
|
|
||||||
}
|
|
||||||
for _, a := range got {
|
|
||||||
if strings.Contains(a, "master_password") || strings.Contains(a, "client_secret") {
|
|
||||||
t.Fatalf("secret key leaked into public argv: %v", got)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestVaultPatchSecretArgsNoValueInArgv(t *testing.T) {
|
|
||||||
for _, key := range []string{"vaultwarden_master_password", "vaultwarden_client_secret"} {
|
|
||||||
got := vaultPatchSecretArgs("emo", key)
|
|
||||||
want := []string{"kv", "patch", "secret/workstation/claude-users/emo", key + "=-"}
|
|
||||||
if !reflect.DeepEqual(got, want) {
|
|
||||||
t.Fatalf("vaultPatchSecretArgs(%q) = %v", key, got)
|
|
||||||
}
|
|
||||||
if got[len(got)-1] != key+"=-" {
|
|
||||||
t.Fatalf("secret value must be read from stdin (`%s=-`), got %v", key, got)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
// TestNoSecretInArgvAcrossFlow is the load-bearing security test: across the
|
|
||||||
// whole get flow (vault reads, bw config/status/login/unlock/get) NO secret
|
|
||||||
// value may appear in any command's argv — secrets travel via env/stdin only.
|
|
||||||
func TestNoSecretInArgvAcrossFlow(t *testing.T) {
|
|
||||||
uid := fmt.Sprintf("%d", os.Getuid())
|
|
||||||
f := &fakeRunner{out: map[string]string{
|
|
||||||
"vault kv get -field=vaultwarden_master_password secret/workstation/claude-users/emo": "SUPERSECRETPW",
|
|
||||||
"vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo": "user.x",
|
|
||||||
"vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo": "CLIENTSEKRET",
|
|
||||||
"bw status": `{"status":"locked"}`,
|
|
||||||
"bw unlock": "SESSIONXYZ",
|
|
||||||
"bw get password github": "p@ss",
|
|
||||||
}}
|
|
||||||
if _, err := getValue(f.run, "emo", uid, getOpts{name: "github", field: "password"}); err != nil {
|
|
||||||
t.Fatalf("getValue: %v", err)
|
|
||||||
}
|
|
||||||
for _, call := range f.calls {
|
|
||||||
for _, arg := range call {
|
|
||||||
for _, s := range []string{"SUPERSECRETPW", "CLIENTSEKRET", "SESSIONXYZ"} {
|
|
||||||
if strings.Contains(arg, s) {
|
|
||||||
t.Errorf("secret %q leaked into argv: %v", s, call)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
if !strings.Contains(strings.Join(f.lastEnv, "\n"), "BW_SESSION=SESSIONXYZ") {
|
|
||||||
t.Error("expected BW_SESSION in the bw get env (test would be vacuous otherwise)")
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestClipboardDecision(t *testing.T) {
|
|
||||||
cases := []struct {
|
|
||||||
stdoutTTY, stderrTTY bool
|
|
||||||
term, prog, want string
|
|
||||||
}{
|
|
||||||
{false, true, "xterm-kitty", "", "stdout"},
|
|
||||||
{true, true, "xterm-kitty", "", "clipboard"},
|
|
||||||
{true, true, "dumb", "", "refuse"},
|
|
||||||
{true, false, "xterm-kitty", "", "refuse"},
|
|
||||||
}
|
|
||||||
for _, c := range cases {
|
|
||||||
if got := clipboardDecision(c.stdoutTTY, c.stderrTTY, c.term, c.prog); got != c.want {
|
|
||||||
t.Errorf("clipboardDecision(%v,%v,%q) = %q, want %q", c.stdoutTTY, c.stderrTTY, c.term, got, c.want)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestJSONToStdoutOK(t *testing.T) {
|
|
||||||
if jsonToStdoutOK(true) {
|
|
||||||
t.Error("must refuse JSON secret on a terminal")
|
|
||||||
}
|
|
||||||
if !jsonToStdoutOK(false) {
|
|
||||||
t.Error("must allow JSON when piped")
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestBwNeedsLogin(t *testing.T) {
|
|
||||||
if !bwNeedsLogin(`{"status":"unauthenticated"}`) {
|
|
||||||
t.Error("unauthenticated → needs login")
|
|
||||||
}
|
|
||||||
if bwNeedsLogin(`{"status":"locked"}`) {
|
|
||||||
t.Error("locked → no login (just unlock)")
|
|
||||||
}
|
|
||||||
if bwNeedsLogin(`{"status":"unlocked"}`) {
|
|
||||||
t.Error("unlocked → no login")
|
|
||||||
}
|
|
||||||
if !bwNeedsLogin(`not json`) {
|
|
||||||
t.Error("unparseable → attempt login")
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestVaultHelpMentionsSecurity(t *testing.T) {
|
|
||||||
h := vaultHelp()
|
|
||||||
for _, want := range []string{"homelab vault get", "no-HITL", "your own", "setup"} {
|
|
||||||
if !strings.Contains(h, want) {
|
|
||||||
t.Errorf("vault help missing %q", want)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestVaultBareGroupRegistered(t *testing.T) {
|
|
||||||
for _, c := range vaultCommands() {
|
|
||||||
if len(c.Path) == 1 && c.Path[0] == "vault" {
|
|
||||||
return
|
|
||||||
}
|
|
||||||
}
|
|
||||||
t.Fatal("bare `vault` help command not registered")
|
|
||||||
}
|
|
||||||
|
|
||||||
// getValue is the testable core: given a runner + opts, returns the secret value.
|
|
||||||
func TestGetValueFlow(t *testing.T) {
|
|
||||||
f := &fakeRunner{out: map[string]string{
|
|
||||||
"vault kv get -field=vaultwarden_master_password secret/workstation/claude-users/emo": "pw",
|
|
||||||
"vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo": "user.x",
|
|
||||||
"vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo": "cs",
|
|
||||||
"bw status": `{"status":"locked"}`,
|
|
||||||
"bw unlock": "SESS",
|
|
||||||
"bw get password github": "p@ss",
|
|
||||||
}}
|
|
||||||
// Use real UID so os.MkdirAll(/run/user/<uid>/homelab-bw) succeeds.
|
|
||||||
uid := fmt.Sprintf("%d", os.Getuid())
|
|
||||||
val, err := getValue(f.run, "emo", uid, getOpts{name: "github", field: "password"})
|
|
||||||
if err != nil || val != "p@ss" {
|
|
||||||
t.Fatalf("getValue = %q, %v", val, err)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
212
cli/cmd_work.go
212
cli/cmd_work.go
|
|
@ -1,212 +0,0 @@
|
||||||
package main
|
|
||||||
|
|
||||||
import (
|
|
||||||
"fmt"
|
|
||||||
"os"
|
|
||||||
"path/filepath"
|
|
||||||
"strings"
|
|
||||||
)
|
|
||||||
|
|
||||||
func workCommands() []Command {
|
|
||||||
return []Command{
|
|
||||||
{Path: []string{"work", "start"}, Tier: TierWrite,
|
|
||||||
Summary: "create a worktree + branch for a task (enter it with EnterWorktree)", Run: workStart},
|
|
||||||
{Path: []string{"work", "land"}, Tier: TierWrite,
|
|
||||||
Summary: "merge master in, verify, push HEAD:master (run from the worktree)", Run: workLand},
|
|
||||||
{Path: []string{"work", "clean"}, Tier: TierWrite,
|
|
||||||
Summary: "remove a task's worktree + branch (run from the main checkout)", Run: workClean},
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
// flagValue extracts `--name value` or `--name=value` from args.
|
|
||||||
func flagValue(args []string, name string) string {
|
|
||||||
for i, a := range args {
|
|
||||||
if a == name && i+1 < len(args) {
|
|
||||||
return args[i+1]
|
|
||||||
}
|
|
||||||
if strings.HasPrefix(a, name+"=") {
|
|
||||||
return strings.TrimPrefix(a, name+"=")
|
|
||||||
}
|
|
||||||
}
|
|
||||||
return ""
|
|
||||||
}
|
|
||||||
|
|
||||||
func remotesOrEmpty(repoRoot string) []string {
|
|
||||||
r, _ := gitRemotes(repoRoot)
|
|
||||||
return r
|
|
||||||
}
|
|
||||||
|
|
||||||
// workStart creates .worktrees/<topic> on branch <user>/<topic> off <remote>/master.
|
|
||||||
func workStart(args []string) error {
|
|
||||||
topic, _ := firstPositional(args)
|
|
||||||
if topic == "" {
|
|
||||||
return fmt.Errorf("usage: homelab work start <topic>")
|
|
||||||
}
|
|
||||||
cwd, _ := os.Getwd()
|
|
||||||
repoRoot, err := gitRepoRoot(cwd)
|
|
||||||
if err != nil {
|
|
||||||
return fmt.Errorf("not in a git repository: %w", err)
|
|
||||||
}
|
|
||||||
remote := preferRemote(remotesOrEmpty(repoRoot))
|
|
||||||
if remote == "" {
|
|
||||||
return fmt.Errorf("no git remote configured in %s", repoRoot)
|
|
||||||
}
|
|
||||||
flags := cryptFlagsFor(repoRoot)
|
|
||||||
branch := currentUser() + "/" + topic
|
|
||||||
wtRel := filepath.Join(".worktrees", topic)
|
|
||||||
|
|
||||||
ensureWorktreesIgnored(repoRoot)
|
|
||||||
if err := gitStream(repoRoot, flags, "fetch", remote); err != nil {
|
|
||||||
return fmt.Errorf("fetch %s failed: %w", remote, err)
|
|
||||||
}
|
|
||||||
if err := gitStream(repoRoot, flags, "worktree", "add", wtRel, "-b", branch, remote+"/master"); err != nil {
|
|
||||||
return fmt.Errorf("worktree add failed: %w", err)
|
|
||||||
}
|
|
||||||
wtPath := filepath.Join(repoRoot, wtRel)
|
|
||||||
fmt.Printf("homelab: created worktree %s (branch %s off %s/master)\n", wtPath, branch, remote)
|
|
||||||
fmt.Printf("homelab: enter it with the native tool: EnterWorktree(path=%q)\n", wtPath)
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
|
|
||||||
// workLand integrates the current branch into master: fetch, merge master in,
|
|
||||||
// verify, push HEAD:master (retrying on non-fast-forward), with a feature-branch
|
|
||||||
// fallback when the direct push is rejected (e.g. branch protection).
|
|
||||||
func workLand(args []string) error {
|
|
||||||
verifyCmd := flagValue(args, "--verify-cmd")
|
|
||||||
cwd, _ := os.Getwd()
|
|
||||||
repoRoot, err := gitRepoRoot(cwd)
|
|
||||||
if err != nil {
|
|
||||||
return fmt.Errorf("not in a git repository: %w", err)
|
|
||||||
}
|
|
||||||
branch, err := gitOutput(repoRoot, "rev-parse", "--abbrev-ref", "HEAD")
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
if branch == "master" || branch == "main" {
|
|
||||||
return fmt.Errorf("refusing to land: already on %s", branch)
|
|
||||||
}
|
|
||||||
remote := preferRemote(remotesOrEmpty(repoRoot))
|
|
||||||
if remote == "" {
|
|
||||||
return fmt.Errorf("no git remote configured in %s", repoRoot)
|
|
||||||
}
|
|
||||||
flags := cryptFlagsFor(repoRoot)
|
|
||||||
|
|
||||||
if err := gitStream(repoRoot, flags, "fetch", remote); err != nil {
|
|
||||||
return fmt.Errorf("fetch failed: %w", err)
|
|
||||||
}
|
|
||||||
if err := gitStream(repoRoot, flags, "merge", "--no-edit", remote+"/master"); err != nil {
|
|
||||||
return fmt.Errorf("merging %s/master failed — resolve conflicts then re-run `homelab work land`: %w", remote, err)
|
|
||||||
}
|
|
||||||
if err := runVerify(repoRoot, verifyCmd, containsArg(args, "--no-verify")); err != nil {
|
|
||||||
return fmt.Errorf("not landing: %w", err)
|
|
||||||
}
|
|
||||||
if err := pushWithRetry(repoRoot, flags, remote, 3); err != nil {
|
|
||||||
return landFallback(repoRoot, flags, remote, branch, err)
|
|
||||||
}
|
|
||||||
fmt.Printf("homelab: landed %s -> %s/master.\n", branch, remote)
|
|
||||||
if containsArg(args, "--no-ci-watch") {
|
|
||||||
fmt.Println("homelab: --no-ci-watch set; not waiting for CI.")
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
landed, _ := gitOutput(repoRoot, "rev-parse", "HEAD")
|
|
||||||
fmt.Fprintln(os.Stderr, "homelab: watching CI for the landed commit...")
|
|
||||||
if err := ciWatch([]string{landed}); err != nil {
|
|
||||||
return fmt.Errorf("landed, but CI did not go green: %w", err)
|
|
||||||
}
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
|
|
||||||
// runVerify runs the explicit --verify-cmd, else auto-detects (go test). If
|
|
||||||
// neither is available it REFUSES (returns an error) unless allowSkip is set —
|
|
||||||
// landing to master unverified must be a deliberate choice (--no-verify).
|
|
||||||
func runVerify(repoRoot, verifyCmd string, allowSkip bool) error {
|
|
||||||
if verifyCmd != "" {
|
|
||||||
fmt.Fprintf(os.Stderr, "homelab: verify: %s\n", verifyCmd)
|
|
||||||
return runStreamingIn(repoRoot, "sh", "-c", verifyCmd)
|
|
||||||
}
|
|
||||||
if isFile(filepath.Join(repoRoot, "go.mod")) {
|
|
||||||
fmt.Fprintln(os.Stderr, "homelab: verify: go test ./...")
|
|
||||||
return runStreamingIn(repoRoot, "go", "test", "./...")
|
|
||||||
}
|
|
||||||
if allowSkip {
|
|
||||||
fmt.Fprintln(os.Stderr, "homelab: WARNING: --no-verify set — landing without verification")
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
return fmt.Errorf("no verification configured for this repo — pass --verify-cmd \"...\" or --no-verify to land without verifying")
|
|
||||||
}
|
|
||||||
|
|
||||||
// pushWithRetry pushes HEAD:master, recovering from non-fast-forward rejections
|
|
||||||
// by fetching + merging master and retrying.
|
|
||||||
func pushWithRetry(repoRoot string, flags []string, remote string, attempts int) error {
|
|
||||||
var lastErr error
|
|
||||||
for i := 0; i < attempts; i++ {
|
|
||||||
if err := gitStream(repoRoot, flags, "push", remote, "HEAD:master"); err == nil {
|
|
||||||
return nil
|
|
||||||
} else {
|
|
||||||
lastErr = err
|
|
||||||
}
|
|
||||||
if i < attempts-1 {
|
|
||||||
fmt.Fprintln(os.Stderr, "homelab: push rejected — fetching + merging master, then retrying")
|
|
||||||
if err := gitStream(repoRoot, flags, "fetch", remote); err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
if err := gitStream(repoRoot, flags, "merge", "--no-edit", remote+"/master"); err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
return fmt.Errorf("push to %s/master failed after %d attempts: %w", remote, attempts, lastErr)
|
|
||||||
}
|
|
||||||
|
|
||||||
// landFallback pushes the feature branch when the direct master push is rejected
|
|
||||||
// (e.g. branch protection), so the work isn't lost and a PR can be opened.
|
|
||||||
func landFallback(repoRoot string, flags []string, remote, branch string, pushErr error) error {
|
|
||||||
fmt.Fprintf(os.Stderr, "homelab: direct push to master failed (%v)\n", pushErr)
|
|
||||||
fmt.Fprintf(os.Stderr, "homelab: falling back to pushing the feature branch %q for a PR\n", branch)
|
|
||||||
if err := gitStream(repoRoot, flags, "push", "-u", remote, branch); err != nil {
|
|
||||||
return fmt.Errorf("fallback branch push also failed: %w", err)
|
|
||||||
}
|
|
||||||
fmt.Printf("homelab: pushed %s to %s. Open a PR to land it (branch protection blocked the direct push).\n", branch, remote)
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
|
|
||||||
// workClean removes a task's worktree and branch. Run from the main checkout.
|
|
||||||
func workClean(args []string) error {
|
|
||||||
topic, _ := firstPositional(args)
|
|
||||||
if topic == "" {
|
|
||||||
return fmt.Errorf("usage: homelab work clean <topic> (run from the main checkout)")
|
|
||||||
}
|
|
||||||
cwd, _ := os.Getwd()
|
|
||||||
repoRoot, err := gitRepoRoot(cwd)
|
|
||||||
if err != nil {
|
|
||||||
return fmt.Errorf("not in a git repository: %w", err)
|
|
||||||
}
|
|
||||||
flags := cryptFlagsFor(repoRoot)
|
|
||||||
wtRel := filepath.Join(".worktrees", topic)
|
|
||||||
branch := currentUser() + "/" + topic
|
|
||||||
|
|
||||||
if err := gitStream(repoRoot, flags, "worktree", "remove", wtRel); err != nil {
|
|
||||||
return fmt.Errorf("worktree remove failed (uncommitted changes? run from the main checkout, not the worktree): %w", err)
|
|
||||||
}
|
|
||||||
if err := gitStream(repoRoot, flags, "branch", "-d", branch); err != nil {
|
|
||||||
fmt.Fprintf(os.Stderr, "homelab: note: could not delete branch %s (unmerged — use `git branch -D` if intended): %v\n", branch, err)
|
|
||||||
}
|
|
||||||
fmt.Printf("homelab: removed worktree %s and branch %s\n", wtRel, branch)
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
|
|
||||||
// ensureWorktreesIgnored appends .worktrees/ to .gitignore if not already ignored.
|
|
||||||
func ensureWorktreesIgnored(repoRoot string) {
|
|
||||||
if _, err := gitOutput(repoRoot, "check-ignore", ".worktrees"); err == nil {
|
|
||||||
return
|
|
||||||
}
|
|
||||||
gi := filepath.Join(repoRoot, ".gitignore")
|
|
||||||
f, err := os.OpenFile(gi, os.O_APPEND|os.O_CREATE|os.O_WRONLY, 0o644)
|
|
||||||
if err != nil {
|
|
||||||
return
|
|
||||||
}
|
|
||||||
defer f.Close()
|
|
||||||
if _, err := f.WriteString("\n.worktrees/\n"); err == nil {
|
|
||||||
fmt.Fprintln(os.Stderr, "homelab: added .worktrees/ to .gitignore")
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
@ -1,32 +0,0 @@
|
||||||
package main
|
|
||||||
|
|
||||||
import "testing"
|
|
||||||
|
|
||||||
func TestRunVerifyRefusesWhenNothingToVerify(t *testing.T) {
|
|
||||||
dir := t.TempDir() // no go.mod, no verify cmd
|
|
||||||
if err := runVerify(dir, "", false); err == nil {
|
|
||||||
t.Fatal("runVerify must refuse (error) when nothing to verify and --no-verify absent")
|
|
||||||
}
|
|
||||||
if err := runVerify(dir, "", true); err != nil {
|
|
||||||
t.Fatalf("runVerify must skip when --no-verify set, got: %v", err)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestFlagValue(t *testing.T) {
|
|
||||||
cases := []struct {
|
|
||||||
args []string
|
|
||||||
name string
|
|
||||||
want string
|
|
||||||
}{
|
|
||||||
{[]string{"--verify-cmd", "go test ./..."}, "--verify-cmd", "go test ./..."},
|
|
||||||
{[]string{"--verify-cmd=make test"}, "--verify-cmd", "make test"},
|
|
||||||
{[]string{"topic", "--verify-cmd", "x"}, "--verify-cmd", "x"},
|
|
||||||
{[]string{"topic"}, "--verify-cmd", ""},
|
|
||||||
{[]string{"--verify-cmd"}, "--verify-cmd", ""}, // no value
|
|
||||||
}
|
|
||||||
for _, c := range cases {
|
|
||||||
if got := flagValue(c.args, c.name); got != c.want {
|
|
||||||
t.Errorf("flagValue(%v, %q) = %q, want %q", c.args, c.name, got, c.want)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
104
cli/command.go
104
cli/command.go
|
|
@ -1,104 +0,0 @@
|
||||||
package main
|
|
||||||
|
|
||||||
import (
|
|
||||||
"encoding/json"
|
|
||||||
"fmt"
|
|
||||||
"sort"
|
|
||||||
"strings"
|
|
||||||
)
|
|
||||||
|
|
||||||
// Tier classifies whether a command observes (read) or mutates (write) state.
|
|
||||||
// v0.1 allows everything; the tier is recorded so a classifier hook can gate
|
|
||||||
// writes later without restructuring (see docs/adr/0005).
|
|
||||||
type Tier string
|
|
||||||
|
|
||||||
const (
|
|
||||||
TierRead Tier = "read"
|
|
||||||
TierWrite Tier = "write"
|
|
||||||
)
|
|
||||||
|
|
||||||
// Command is one homelab verb. Path is the token sequence that selects it,
|
|
||||||
// e.g. ["claim"] or ["tf", "plan"]. Run receives the args after the path.
|
|
||||||
type Command struct {
|
|
||||||
Path []string
|
|
||||||
Tier Tier
|
|
||||||
Summary string
|
|
||||||
Run func(args []string) error
|
|
||||||
}
|
|
||||||
|
|
||||||
// dispatch routes args to the command whose Path is the longest matching prefix
|
|
||||||
// of args, passing the remaining args to its Run.
|
|
||||||
func dispatch(reg []Command, args []string) error {
|
|
||||||
best := -1
|
|
||||||
bestLen := 0
|
|
||||||
for i, c := range reg {
|
|
||||||
if len(c.Path) > len(args) {
|
|
||||||
continue
|
|
||||||
}
|
|
||||||
match := true
|
|
||||||
for j, p := range c.Path {
|
|
||||||
if args[j] != p {
|
|
||||||
match = false
|
|
||||||
break
|
|
||||||
}
|
|
||||||
}
|
|
||||||
if match && len(c.Path) >= bestLen {
|
|
||||||
best = i
|
|
||||||
bestLen = len(c.Path)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
if best < 0 {
|
|
||||||
return fmt.Errorf("unknown command: %q", strings.Join(args, " "))
|
|
||||||
}
|
|
||||||
matched := reg[best]
|
|
||||||
runErr := matched.Run(args[bestLen:])
|
|
||||||
emitUsage(matched.name(), runErr) // best-effort usage telemetry; never affects the command
|
|
||||||
return runErr
|
|
||||||
}
|
|
||||||
|
|
||||||
// name is the space-joined verb path, e.g. "tf plan".
|
|
||||||
func (c Command) name() string { return strings.Join(c.Path, " ") }
|
|
||||||
|
|
||||||
// sortedByName returns a copy of reg ordered by verb path for stable output.
|
|
||||||
func sortedByName(reg []Command) []Command {
|
|
||||||
out := make([]Command, len(reg))
|
|
||||||
copy(out, reg)
|
|
||||||
sort.Slice(out, func(i, j int) bool { return out[i].name() < out[j].name() })
|
|
||||||
return out
|
|
||||||
}
|
|
||||||
|
|
||||||
// manifestText renders one aligned line per command: "<path> <tier> <summary>".
|
|
||||||
// This is the cheap progressive-discovery entrypoint (see docs/adr/0004).
|
|
||||||
func manifestText(reg []Command) string {
|
|
||||||
cmds := sortedByName(reg)
|
|
||||||
width := 0
|
|
||||||
for _, c := range cmds {
|
|
||||||
if n := len(c.name()); n > width {
|
|
||||||
width = n
|
|
||||||
}
|
|
||||||
}
|
|
||||||
var b strings.Builder
|
|
||||||
for _, c := range cmds {
|
|
||||||
fmt.Fprintf(&b, "%-*s %-5s %s\n", width, c.name(), c.Tier, c.Summary)
|
|
||||||
}
|
|
||||||
return b.String()
|
|
||||||
}
|
|
||||||
|
|
||||||
// manifestJSON renders the registry as a JSON array of {command, tier, summary}
|
|
||||||
// so agents can parse the full surface in one call.
|
|
||||||
func manifestJSON(reg []Command) (string, error) {
|
|
||||||
type entry struct {
|
|
||||||
Command string `json:"command"`
|
|
||||||
Tier string `json:"tier"`
|
|
||||||
Summary string `json:"summary"`
|
|
||||||
}
|
|
||||||
entries := make([]entry, 0, len(reg))
|
|
||||||
for _, c := range sortedByName(reg) {
|
|
||||||
entries = append(entries, entry{Command: c.name(), Tier: string(c.Tier), Summary: c.Summary})
|
|
||||||
}
|
|
||||||
b, err := json.MarshalIndent(entries, "", " ")
|
|
||||||
if err != nil {
|
|
||||||
return "", err
|
|
||||||
}
|
|
||||||
return string(b), nil
|
|
||||||
}
|
|
||||||
|
|
@ -1,73 +0,0 @@
|
||||||
package main
|
|
||||||
|
|
||||||
import (
|
|
||||||
"encoding/json"
|
|
||||||
"reflect"
|
|
||||||
"strings"
|
|
||||||
"testing"
|
|
||||||
)
|
|
||||||
|
|
||||||
// Tracer bullet: the dispatcher must route `homelab <path...> <args...>` to the
|
|
||||||
// command whose Path is the longest matching prefix of the input tokens, and
|
|
||||||
// hand the command the remaining args.
|
|
||||||
func TestDispatchRoutesToLongestPrefixMatch(t *testing.T) {
|
|
||||||
var gotArgs []string
|
|
||||||
ran := ""
|
|
||||||
reg := []Command{
|
|
||||||
{Path: []string{"claim"}, Tier: TierWrite, Summary: "claim a resource",
|
|
||||||
Run: func(a []string) error { ran = "claim"; gotArgs = a; return nil }},
|
|
||||||
{Path: []string{"tf", "plan"}, Tier: TierRead, Summary: "plan a stack",
|
|
||||||
Run: func(a []string) error { ran = "tf plan"; gotArgs = a; return nil }},
|
|
||||||
}
|
|
||||||
|
|
||||||
if err := dispatch(reg, []string{"tf", "plan", "vault", "--json"}); err != nil {
|
|
||||||
t.Fatalf("dispatch returned error: %v", err)
|
|
||||||
}
|
|
||||||
if ran != "tf plan" {
|
|
||||||
t.Fatalf("routed to %q, want %q", ran, "tf plan")
|
|
||||||
}
|
|
||||||
if want := []string{"vault", "--json"}; !reflect.DeepEqual(gotArgs, want) {
|
|
||||||
t.Fatalf("command got args %v, want %v", gotArgs, want)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestDispatchUnknownCommandErrors(t *testing.T) {
|
|
||||||
reg := []Command{{Path: []string{"claim"}, Run: func(a []string) error { return nil }}}
|
|
||||||
if err := dispatch(reg, []string{"bogus"}); err == nil {
|
|
||||||
t.Fatal("expected error for unknown command, got nil")
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
// The manifest is the progressive-discovery entrypoint: one line per command
|
|
||||||
// showing the full verb path, its tier, and summary, sorted for stable output.
|
|
||||||
func TestManifestTextListsEveryCommandWithTier(t *testing.T) {
|
|
||||||
reg := []Command{
|
|
||||||
{Path: []string{"tf", "plan"}, Tier: TierRead, Summary: "plan a stack"},
|
|
||||||
{Path: []string{"claim"}, Tier: TierWrite, Summary: "claim a resource"},
|
|
||||||
}
|
|
||||||
out := manifestText(reg)
|
|
||||||
for _, want := range []string{"claim", "tf plan", "read", "write", "plan a stack", "claim a resource"} {
|
|
||||||
if !strings.Contains(out, want) {
|
|
||||||
t.Errorf("manifest text missing %q\n---\n%s", want, out)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
// sorted: claim (c) must appear before tf plan (t)
|
|
||||||
if strings.Index(out, "claim") > strings.Index(out, "tf plan") {
|
|
||||||
t.Errorf("manifest not sorted by path:\n%s", out)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestManifestJSONIsParsableAndTagged(t *testing.T) {
|
|
||||||
reg := []Command{{Path: []string{"tf", "apply"}, Tier: TierWrite, Summary: "apply a stack"}}
|
|
||||||
out, err := manifestJSON(reg)
|
|
||||||
if err != nil {
|
|
||||||
t.Fatalf("manifestJSON error: %v", err)
|
|
||||||
}
|
|
||||||
var got []map[string]string
|
|
||||||
if err := json.Unmarshal([]byte(out), &got); err != nil {
|
|
||||||
t.Fatalf("manifest JSON not parsable: %v\n%s", err, out)
|
|
||||||
}
|
|
||||||
if len(got) != 1 || got[0]["command"] != "tf apply" || got[0]["tier"] != "write" {
|
|
||||||
t.Fatalf("unexpected manifest JSON: %v", got)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
@ -1,98 +0,0 @@
|
||||||
package main
|
|
||||||
|
|
||||||
import (
|
|
||||||
"fmt"
|
|
||||||
"strings"
|
|
||||||
)
|
|
||||||
|
|
||||||
// version is stamped at build time via -ldflags "-X main.version=vX.Y.Z".
|
|
||||||
var version = "dev"
|
|
||||||
|
|
||||||
// buildRegistry returns every homelab verb. New verb-groups append here.
|
|
||||||
func buildRegistry() []Command {
|
|
||||||
var reg []Command
|
|
||||||
reg = append(reg, claimCommands()...)
|
|
||||||
reg = append(reg, tfCommands()...)
|
|
||||||
reg = append(reg, workCommands()...)
|
|
||||||
reg = append(reg, k8sCommands()...)
|
|
||||||
reg = append(reg, memoryCommands()...)
|
|
||||||
reg = append(reg, ciCommands()...)
|
|
||||||
reg = append(reg, deployCommands()...)
|
|
||||||
reg = append(reg, netCommands()...)
|
|
||||||
reg = append(reg, obsCommands()...)
|
|
||||||
reg = append(reg, usageCommands()...)
|
|
||||||
reg = append(reg, haCommands()...)
|
|
||||||
reg = append(reg, browserCommands()...)
|
|
||||||
reg = append(reg, vaultCommands()...)
|
|
||||||
return reg
|
|
||||||
}
|
|
||||||
|
|
||||||
// dispatchTop handles the homelab verb surface. handled=false means the args are
|
|
||||||
// not a homelab verb, so main() falls back to the legacy -use-case path.
|
|
||||||
func dispatchTop(args []string) (handled bool, err error) {
|
|
||||||
if len(args) == 0 {
|
|
||||||
fmt.Print(usage())
|
|
||||||
return true, nil
|
|
||||||
}
|
|
||||||
switch args[0] {
|
|
||||||
case "help", "-h", "--help":
|
|
||||||
fmt.Print(usage())
|
|
||||||
return true, nil
|
|
||||||
case "version", "--version":
|
|
||||||
fmt.Println("homelab " + version)
|
|
||||||
return true, nil
|
|
||||||
case "manifest":
|
|
||||||
reg := buildRegistry()
|
|
||||||
if containsArg(args[1:], "--json") {
|
|
||||||
out, err := manifestJSON(reg)
|
|
||||||
if err != nil {
|
|
||||||
return true, err
|
|
||||||
}
|
|
||||||
fmt.Println(out)
|
|
||||||
return true, nil
|
|
||||||
}
|
|
||||||
fmt.Print(manifestText(reg))
|
|
||||||
return true, nil
|
|
||||||
}
|
|
||||||
if strings.HasPrefix(args[0], "-") {
|
|
||||||
return false, nil
|
|
||||||
}
|
|
||||||
reg := buildRegistry()
|
|
||||||
if !isCommandGroup(reg, args[0]) {
|
|
||||||
return false, nil
|
|
||||||
}
|
|
||||||
return true, dispatch(reg, args)
|
|
||||||
}
|
|
||||||
|
|
||||||
func isCommandGroup(reg []Command, group string) bool {
|
|
||||||
for _, c := range reg {
|
|
||||||
if len(c.Path) > 0 && c.Path[0] == group {
|
|
||||||
return true
|
|
||||||
}
|
|
||||||
}
|
|
||||||
return false
|
|
||||||
}
|
|
||||||
|
|
||||||
func containsArg(args []string, want string) bool {
|
|
||||||
for _, a := range args {
|
|
||||||
if a == want {
|
|
||||||
return true
|
|
||||||
}
|
|
||||||
}
|
|
||||||
return false
|
|
||||||
}
|
|
||||||
|
|
||||||
func usage() string {
|
|
||||||
var b strings.Builder
|
|
||||||
fmt.Fprintf(&b, "homelab %s — unified homelab operations CLI\n\n", version)
|
|
||||||
b.WriteString("Usage:\n homelab <command> [args]\n\nCommands:\n")
|
|
||||||
for _, line := range strings.Split(strings.TrimRight(manifestText(buildRegistry()), "\n"), "\n") {
|
|
||||||
if line != "" {
|
|
||||||
b.WriteString(" " + line + "\n")
|
|
||||||
}
|
|
||||||
}
|
|
||||||
b.WriteString("\n manifest [--json] list all commands (machine-readable with --json)\n")
|
|
||||||
b.WriteString(" version print version\n")
|
|
||||||
b.WriteString("\nLegacy webhook use-cases remain available via -use-case=<name>.\n")
|
|
||||||
return b.String()
|
|
||||||
}
|
|
||||||
138
cli/k8s.go
138
cli/k8s.go
|
|
@ -1,138 +0,0 @@
|
||||||
package main
|
|
||||||
|
|
||||||
import (
|
|
||||||
"fmt"
|
|
||||||
"os/exec"
|
|
||||||
"strings"
|
|
||||||
)
|
|
||||||
|
|
||||||
// kubectl helpers use the ambient kubeconfig (no per-call auth flags).
|
|
||||||
|
|
||||||
func kubectlBase(ns string, args ...string) []string {
|
|
||||||
var full []string
|
|
||||||
if ns != "" {
|
|
||||||
full = append(full, "-n", ns)
|
|
||||||
}
|
|
||||||
return append(full, args...)
|
|
||||||
}
|
|
||||||
|
|
||||||
func kubectlStream(ns string, args ...string) error {
|
|
||||||
return runStreamingIn("", "kubectl", kubectlBase(ns, args...)...)
|
|
||||||
}
|
|
||||||
|
|
||||||
// kubectlCapture runs kubectl and returns trimmed stdout (for resolving pods).
|
|
||||||
func kubectlCapture(ns string, args ...string) (string, error) {
|
|
||||||
out, err := exec.Command("kubectl", kubectlBase(ns, args...)...).Output()
|
|
||||||
return strings.TrimSpace(string(out)), err
|
|
||||||
}
|
|
||||||
|
|
||||||
// k8sTarget is the parsed `<app>` + selectors shared by the k8s verbs.
|
|
||||||
type k8sTarget struct {
|
|
||||||
app string
|
|
||||||
ns string
|
|
||||||
pod string
|
|
||||||
container string
|
|
||||||
selector string
|
|
||||||
tty bool
|
|
||||||
rest []string // passthrough flags and, after `--`, the exec command
|
|
||||||
}
|
|
||||||
|
|
||||||
// parseK8sTarget reads `<app> [-n ns] [--pod p] [-c ctr] [-l sel] [flags] [-- cmd]`.
|
|
||||||
// The first bare token is the app; unknown flags pass through in rest.
|
|
||||||
func parseK8sTarget(args []string) k8sTarget {
|
|
||||||
t := k8sTarget{}
|
|
||||||
i := 0
|
|
||||||
take := func() string {
|
|
||||||
if i+1 < len(args) {
|
|
||||||
i++
|
|
||||||
return args[i]
|
|
||||||
}
|
|
||||||
return ""
|
|
||||||
}
|
|
||||||
for i = 0; i < len(args); i++ {
|
|
||||||
a := args[i]
|
|
||||||
switch {
|
|
||||||
case a == "--":
|
|
||||||
t.rest = append(t.rest, args[i+1:]...)
|
|
||||||
return t
|
|
||||||
case a == "-n" || a == "--namespace":
|
|
||||||
t.ns = take()
|
|
||||||
case strings.HasPrefix(a, "--namespace="):
|
|
||||||
t.ns = strings.TrimPrefix(a, "--namespace=")
|
|
||||||
case a == "--pod":
|
|
||||||
t.pod = take()
|
|
||||||
case strings.HasPrefix(a, "--pod="):
|
|
||||||
t.pod = strings.TrimPrefix(a, "--pod=")
|
|
||||||
case a == "-c" || a == "--container":
|
|
||||||
t.container = take()
|
|
||||||
case strings.HasPrefix(a, "--container="):
|
|
||||||
t.container = strings.TrimPrefix(a, "--container=")
|
|
||||||
case a == "-l" || a == "--selector":
|
|
||||||
t.selector = take()
|
|
||||||
case strings.HasPrefix(a, "--selector="):
|
|
||||||
t.selector = strings.TrimPrefix(a, "--selector=")
|
|
||||||
case a == "--tty" || a == "-it" || a == "-ti":
|
|
||||||
t.tty = true
|
|
||||||
case !strings.HasPrefix(a, "-") && t.app == "":
|
|
||||||
t.app = a
|
|
||||||
default:
|
|
||||||
t.rest = append(t.rest, a)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
return t
|
|
||||||
}
|
|
||||||
|
|
||||||
// namespace defaults to the app name (most namespaces hold exactly one app).
|
|
||||||
func (t k8sTarget) namespace() string {
|
|
||||||
if t.ns != "" {
|
|
||||||
return t.ns
|
|
||||||
}
|
|
||||||
return t.app
|
|
||||||
}
|
|
||||||
|
|
||||||
// objectRef is the kubectl object for logs/exec: an explicit pod, else
|
|
||||||
// deploy/<app> (kubectl resolves a pod from the Deployment).
|
|
||||||
func (t k8sTarget) objectRef() string {
|
|
||||||
if t.pod != "" {
|
|
||||||
return "pod/" + t.pod
|
|
||||||
}
|
|
||||||
return "deploy/" + t.app
|
|
||||||
}
|
|
||||||
|
|
||||||
// --- database access (the dbaas exec pattern) ---
|
|
||||||
|
|
||||||
type dbPlan struct {
|
|
||||||
ns string
|
|
||||||
pod string // explicit pod (e.g. mysql-standalone-0)
|
|
||||||
selector string // resolve the pod by this label when pod == "" (CNPG primary)
|
|
||||||
container string // "" = default container
|
|
||||||
argv []string // command + args to run inside the pod
|
|
||||||
}
|
|
||||||
|
|
||||||
// planDBExec builds the in-pod command to run sql against app's database.
|
|
||||||
// PG (default): CNPG primary POD (resolved by label — pg-cluster-rw is a
|
|
||||||
// Service, not an exec target), psql -U postgres -d <db>.
|
|
||||||
// MySQL: mysql-standalone-0, password from env (never on the command line).
|
|
||||||
// dbName defaults to app. sql empty => interactive client.
|
|
||||||
func planDBExec(app, dbName, sql string, mysql bool) dbPlan {
|
|
||||||
if dbName == "" {
|
|
||||||
dbName = app
|
|
||||||
}
|
|
||||||
if mysql {
|
|
||||||
inner := fmt.Sprintf(`mysql -u root -p"$MYSQL_ROOT_PASSWORD" %s`, shellQuote(dbName))
|
|
||||||
if sql != "" {
|
|
||||||
inner += " -e " + shellQuote(sql)
|
|
||||||
}
|
|
||||||
return dbPlan{ns: "dbaas", pod: "mysql-standalone-0", argv: []string{"bash", "-c", inner}}
|
|
||||||
}
|
|
||||||
argv := []string{"psql", "-U", "postgres", "-d", dbName}
|
|
||||||
if sql != "" {
|
|
||||||
argv = append(argv, "-tAc", sql)
|
|
||||||
}
|
|
||||||
return dbPlan{ns: "dbaas", selector: "cnpg.io/instanceRole=primary", container: "postgres", argv: argv}
|
|
||||||
}
|
|
||||||
|
|
||||||
// shellQuote single-quotes s for safe embedding in a bash -c string.
|
|
||||||
func shellQuote(s string) string {
|
|
||||||
return "'" + strings.ReplaceAll(s, "'", `'\''`) + "'"
|
|
||||||
}
|
|
||||||
|
|
@ -1,65 +0,0 @@
|
||||||
package main
|
|
||||||
|
|
||||||
import (
|
|
||||||
"reflect"
|
|
||||||
"strings"
|
|
||||||
"testing"
|
|
||||||
)
|
|
||||||
|
|
||||||
func TestParseK8sTarget(t *testing.T) {
|
|
||||||
got := parseK8sTarget([]string{"tripit", "-n", "prod", "--pod", "x-123", "-c", "app", "-l", "k=v", "--tail=50", "--", "ls", "-la"})
|
|
||||||
want := k8sTarget{app: "tripit", ns: "prod", pod: "x-123", container: "app", selector: "k=v", rest: []string{"--tail=50", "ls", "-la"}}
|
|
||||||
if !reflect.DeepEqual(got, want) {
|
|
||||||
t.Fatalf("parseK8sTarget =\n %+v\nwant\n %+v", got, want)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestK8sTargetNamespaceDefaultsToApp(t *testing.T) {
|
|
||||||
if ns := parseK8sTarget([]string{"immich"}).namespace(); ns != "immich" {
|
|
||||||
t.Errorf("namespace() = %q, want immich", ns)
|
|
||||||
}
|
|
||||||
if ns := parseK8sTarget([]string{"immich", "-n", "dbaas"}).namespace(); ns != "dbaas" {
|
|
||||||
t.Errorf("namespace() = %q, want dbaas", ns)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestK8sTargetObjectRef(t *testing.T) {
|
|
||||||
if r := parseK8sTarget([]string{"tripit"}).objectRef(); r != "deploy/tripit" {
|
|
||||||
t.Errorf("objectRef() = %q, want deploy/tripit", r)
|
|
||||||
}
|
|
||||||
if r := parseK8sTarget([]string{"tripit", "--pod", "tripit-abc"}).objectRef(); r != "pod/tripit-abc" {
|
|
||||||
t.Errorf("objectRef() = %q, want pod/tripit-abc", r)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestPlanDBExecPostgresDefault(t *testing.T) {
|
|
||||||
p := planDBExec("fire-planner", "", "SELECT 1", false)
|
|
||||||
// pg-cluster-rw is a Service, so the PG plan resolves the primary POD by
|
|
||||||
// label rather than naming an (un-exec-able) Service.
|
|
||||||
if p.ns != "dbaas" || p.pod != "" || p.selector != "cnpg.io/instanceRole=primary" || p.container != "postgres" {
|
|
||||||
t.Fatalf("unexpected pg target: %+v", p)
|
|
||||||
}
|
|
||||||
// db name defaults to the app; SQL passed via -tAc
|
|
||||||
joined := strings.Join(p.argv, " ")
|
|
||||||
if !strings.Contains(joined, "-d fire-planner") || !strings.Contains(joined, "-tAc") {
|
|
||||||
t.Fatalf("pg argv missing db/sql: %v", p.argv)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestPlanDBExecMysqlEnvPassword(t *testing.T) {
|
|
||||||
p := planDBExec("wrongmove", "wrongmove", "SHOW TABLES", true)
|
|
||||||
if p.pod != "mysql-standalone-0" {
|
|
||||||
t.Fatalf("unexpected mysql pod: %+v", p)
|
|
||||||
}
|
|
||||||
inner := strings.Join(p.argv, " ")
|
|
||||||
// password must come from the env var, never inline
|
|
||||||
if !strings.Contains(inner, `-p"$MYSQL_ROOT_PASSWORD"`) {
|
|
||||||
t.Fatalf("mysql must use env password wrapper: %v", p.argv)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestShellQuoteEscapes(t *testing.T) {
|
|
||||||
if got := shellQuote("a'b"); got != `'a'\''b'` {
|
|
||||||
t.Fatalf("shellQuote = %q", got)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
12
cli/main.go
12
cli/main.go
|
|
@ -26,16 +26,8 @@ var (
|
||||||
)
|
)
|
||||||
|
|
||||||
func main() {
|
func main() {
|
||||||
// homelab verb surface (work/tf/claim/...) is tried first; if the args are
|
err := run()
|
||||||
// not a homelab verb, fall through to the legacy webhook -use-case path.
|
if err != nil {
|
||||||
if handled, err := dispatchTop(os.Args[1:]); handled {
|
|
||||||
if err != nil {
|
|
||||||
fmt.Fprintln(os.Stderr, "homelab: "+err.Error())
|
|
||||||
os.Exit(1)
|
|
||||||
}
|
|
||||||
return
|
|
||||||
}
|
|
||||||
if err := run(); err != nil {
|
|
||||||
glog.Errorf("run failed: %s", err.Error())
|
glog.Errorf("run failed: %s", err.Error())
|
||||||
os.Exit(255)
|
os.Exit(255)
|
||||||
}
|
}
|
||||||
|
|
|
||||||
103
cli/memory.go
103
cli/memory.go
|
|
@ -1,103 +0,0 @@
|
||||||
package main
|
|
||||||
|
|
||||||
import (
|
|
||||||
"bytes"
|
|
||||||
"encoding/json"
|
|
||||||
"fmt"
|
|
||||||
"io"
|
|
||||||
"net/http"
|
|
||||||
"os"
|
|
||||||
"strings"
|
|
||||||
"time"
|
|
||||||
)
|
|
||||||
|
|
||||||
// defaultMemoryURL is used when no env override is present (agents normally have
|
|
||||||
// CLAUDE_MEMORY_API_URL set by the memory hooks).
|
|
||||||
const defaultMemoryURL = "https://claude-memory.viktorbarzin.me"
|
|
||||||
|
|
||||||
type memoryClient struct {
|
|
||||||
base string
|
|
||||||
key string
|
|
||||||
http *http.Client
|
|
||||||
}
|
|
||||||
|
|
||||||
func firstEnv(keys ...string) string {
|
|
||||||
for _, k := range keys {
|
|
||||||
if v := os.Getenv(k); v != "" {
|
|
||||||
return v
|
|
||||||
}
|
|
||||||
}
|
|
||||||
return ""
|
|
||||||
}
|
|
||||||
|
|
||||||
func resolveMemoryBase() string {
|
|
||||||
if b := firstEnv("CLAUDE_MEMORY_API_URL", "MEMORY_API_URL"); b != "" {
|
|
||||||
return strings.TrimRight(b, "/")
|
|
||||||
}
|
|
||||||
return defaultMemoryURL
|
|
||||||
}
|
|
||||||
|
|
||||||
// newMemoryClient talks straight to the claude-memory HTTP API (the same backend
|
|
||||||
// the MCP wraps), so it works even when the MCP frontend is down.
|
|
||||||
func newMemoryClient() (*memoryClient, error) {
|
|
||||||
key := firstEnv("CLAUDE_MEMORY_API_KEY", "MEMORY_API_KEY")
|
|
||||||
if key == "" {
|
|
||||||
return nil, fmt.Errorf("no memory API key — set CLAUDE_MEMORY_API_KEY (or MEMORY_API_KEY)")
|
|
||||||
}
|
|
||||||
return &memoryClient{base: resolveMemoryBase(), key: key, http: &http.Client{Timeout: 30 * time.Second}}, nil
|
|
||||||
}
|
|
||||||
|
|
||||||
func (c *memoryClient) do(method, path string, body interface{}) ([]byte, error) {
|
|
||||||
var r io.Reader
|
|
||||||
if body != nil {
|
|
||||||
b, err := json.Marshal(body)
|
|
||||||
if err != nil {
|
|
||||||
return nil, err
|
|
||||||
}
|
|
||||||
r = bytes.NewReader(b)
|
|
||||||
}
|
|
||||||
req, err := http.NewRequest(method, c.base+path, r)
|
|
||||||
if err != nil {
|
|
||||||
return nil, err
|
|
||||||
}
|
|
||||||
req.Header.Set("Authorization", "Bearer "+c.key)
|
|
||||||
if body != nil {
|
|
||||||
req.Header.Set("Content-Type", "application/json")
|
|
||||||
}
|
|
||||||
resp, err := c.http.Do(req)
|
|
||||||
if err != nil {
|
|
||||||
return nil, err
|
|
||||||
}
|
|
||||||
defer resp.Body.Close()
|
|
||||||
out, _ := io.ReadAll(resp.Body)
|
|
||||||
if resp.StatusCode >= 300 {
|
|
||||||
return nil, fmt.Errorf("memory API %s %s -> %d: %s", method, path, resp.StatusCode, strings.TrimSpace(string(out)))
|
|
||||||
}
|
|
||||||
return out, nil
|
|
||||||
}
|
|
||||||
|
|
||||||
// Request bodies mirror src/claude_memory/api/models.py.
|
|
||||||
|
|
||||||
type memRecallReq struct {
|
|
||||||
Context string `json:"context"`
|
|
||||||
ExpandedQuery string `json:"expanded_query,omitempty"`
|
|
||||||
Category string `json:"category,omitempty"`
|
|
||||||
SortBy string `json:"sort_by,omitempty"`
|
|
||||||
Limit int `json:"limit,omitempty"`
|
|
||||||
}
|
|
||||||
|
|
||||||
type memStoreReq struct {
|
|
||||||
Content string `json:"content"`
|
|
||||||
Category string `json:"category,omitempty"`
|
|
||||||
Tags string `json:"tags,omitempty"`
|
|
||||||
ExpandedKeywords string `json:"expanded_keywords,omitempty"`
|
|
||||||
Importance float64 `json:"importance"`
|
|
||||||
ForceSensitive bool `json:"force_sensitive,omitempty"`
|
|
||||||
}
|
|
||||||
|
|
||||||
type memUpdateReq struct {
|
|
||||||
Content *string `json:"content,omitempty"`
|
|
||||||
Tags *string `json:"tags,omitempty"`
|
|
||||||
Importance *float64 `json:"importance,omitempty"`
|
|
||||||
ExpandedKeywords *string `json:"expanded_keywords,omitempty"`
|
|
||||||
}
|
|
||||||
|
|
@ -1,51 +0,0 @@
|
||||||
package main
|
|
||||||
|
|
||||||
import (
|
|
||||||
"encoding/json"
|
|
||||||
"os"
|
|
||||||
"strings"
|
|
||||||
"testing"
|
|
||||||
)
|
|
||||||
|
|
||||||
func TestResolveMemoryBase(t *testing.T) {
|
|
||||||
old1, old2 := os.Getenv("CLAUDE_MEMORY_API_URL"), os.Getenv("MEMORY_API_URL")
|
|
||||||
defer func() { os.Setenv("CLAUDE_MEMORY_API_URL", old1); os.Setenv("MEMORY_API_URL", old2) }()
|
|
||||||
|
|
||||||
os.Unsetenv("CLAUDE_MEMORY_API_URL")
|
|
||||||
os.Unsetenv("MEMORY_API_URL")
|
|
||||||
if got := resolveMemoryBase(); got != defaultMemoryURL {
|
|
||||||
t.Errorf("resolveMemoryBase() = %q, want default %q", got, defaultMemoryURL)
|
|
||||||
}
|
|
||||||
os.Setenv("CLAUDE_MEMORY_API_URL", "https://m.example/") // trailing slash trimmed
|
|
||||||
if got := resolveMemoryBase(); got != "https://m.example" {
|
|
||||||
t.Errorf("resolveMemoryBase() = %q, want https://m.example", got)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestMemStoreReqAlwaysSendsImportance(t *testing.T) {
|
|
||||||
b, _ := json.Marshal(memStoreReq{Content: "x", Category: "facts", Importance: 0.5})
|
|
||||||
s := string(b)
|
|
||||||
if !strings.Contains(s, `"content":"x"`) || !strings.Contains(s, `"importance":0.5`) {
|
|
||||||
t.Fatalf("memStoreReq JSON missing fields: %s", s)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestMemUpdateReqOmitsUnsetFields(t *testing.T) {
|
|
||||||
tags := "a,b"
|
|
||||||
b, _ := json.Marshal(memUpdateReq{Tags: &tags})
|
|
||||||
s := string(b)
|
|
||||||
if strings.Contains(s, "content") || strings.Contains(s, "importance") {
|
|
||||||
t.Fatalf("unset update fields must be omitted: %s", s)
|
|
||||||
}
|
|
||||||
if !strings.Contains(s, `"tags":"a,b"`) {
|
|
||||||
t.Fatalf("set field missing: %s", s)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestMemRecallReqOmitsEmptyOptionals(t *testing.T) {
|
|
||||||
b, _ := json.Marshal(memRecallReq{Context: "hi"})
|
|
||||||
s := string(b)
|
|
||||||
if strings.Contains(s, "expanded_query") || strings.Contains(s, "category") || strings.Contains(s, "limit") {
|
|
||||||
t.Fatalf("empty optionals must be omitted: %s", s)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
@ -1,58 +0,0 @@
|
||||||
package main
|
|
||||||
|
|
||||||
import (
|
|
||||||
"fmt"
|
|
||||||
"os"
|
|
||||||
"path/filepath"
|
|
||||||
"strings"
|
|
||||||
)
|
|
||||||
|
|
||||||
// validPresenceKinds is the fixed label taxonomy accepted by the presence board.
|
|
||||||
var validPresenceKinds = []string{"node", "host", "stack", "service", "db", "pvc", "infra"}
|
|
||||||
|
|
||||||
// presenceScript locates the presence CLI — homelab WRAPS it, it does not
|
|
||||||
// reimplement it. Override with HOMELAB_PRESENCE; defaults to ~/code/scripts/presence.
|
|
||||||
func presenceScript() string {
|
|
||||||
if p := os.Getenv("HOMELAB_PRESENCE"); p != "" {
|
|
||||||
return p
|
|
||||||
}
|
|
||||||
home, err := os.UserHomeDir()
|
|
||||||
if err != nil {
|
|
||||||
return "presence"
|
|
||||||
}
|
|
||||||
return filepath.Join(home, "code", "scripts", "presence")
|
|
||||||
}
|
|
||||||
|
|
||||||
// validateLabel checks a presence label is <kind>:<name> with a known kind.
|
|
||||||
func validateLabel(label string) error {
|
|
||||||
parts := strings.SplitN(label, ":", 2)
|
|
||||||
if len(parts) != 2 || parts[0] == "" || parts[1] == "" {
|
|
||||||
return fmt.Errorf("label must be <kind>:<name> (e.g. stack:vault), got %q", label)
|
|
||||||
}
|
|
||||||
for _, k := range validPresenceKinds {
|
|
||||||
if parts[0] == k {
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
}
|
|
||||||
return fmt.Errorf("invalid label kind %q; valid kinds: %s", parts[0], strings.Join(validPresenceKinds, ", "))
|
|
||||||
}
|
|
||||||
|
|
||||||
// presenceClaim claims label on the board with a purpose note.
|
|
||||||
func presenceClaim(label, purpose string) error {
|
|
||||||
if err := validateLabel(label); err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
args := []string{"claim", label}
|
|
||||||
if purpose != "" {
|
|
||||||
args = append(args, "--purpose", purpose)
|
|
||||||
}
|
|
||||||
return runStreaming(presenceScript(), args...)
|
|
||||||
}
|
|
||||||
|
|
||||||
// presenceRelease releases a prior claim on label.
|
|
||||||
func presenceRelease(label string) error {
|
|
||||||
if err := validateLabel(label); err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
return runStreaming(presenceScript(), "release", label)
|
|
||||||
}
|
|
||||||
|
|
@ -1,24 +0,0 @@
|
||||||
package main
|
|
||||||
|
|
||||||
import "testing"
|
|
||||||
|
|
||||||
func TestValidateLabelAcceptsTaxonomy(t *testing.T) {
|
|
||||||
good := []string{
|
|
||||||
"stack:vault", "service:health", "node:k8s-node1", "db:pg-cluster",
|
|
||||||
"infra:gpu-operator", "host:proxmox-1", "pvc:dbaas/data",
|
|
||||||
}
|
|
||||||
for _, l := range good {
|
|
||||||
if err := validateLabel(l); err != nil {
|
|
||||||
t.Errorf("validateLabel(%q) = %v, want nil", l, err)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestValidateLabelRejectsBadLabels(t *testing.T) {
|
|
||||||
bad := []string{"vault", "stack:", "bogus:x", ":x", "stack", ""}
|
|
||||||
for _, l := range bad {
|
|
||||||
if err := validateLabel(l); err == nil {
|
|
||||||
t.Errorf("validateLabel(%q) = nil, want error", l)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
76
cli/probe.go
76
cli/probe.go
|
|
@ -1,76 +0,0 @@
|
||||||
package main
|
|
||||||
|
|
||||||
import (
|
|
||||||
"context"
|
|
||||||
"crypto/tls"
|
|
||||||
"fmt"
|
|
||||||
"io"
|
|
||||||
"net"
|
|
||||||
"net/http"
|
|
||||||
"net/url"
|
|
||||||
"os/exec"
|
|
||||||
"strings"
|
|
||||||
"time"
|
|
||||||
)
|
|
||||||
|
|
||||||
// internalLBIP is the dedicated Traefik LB; every internal ingress routes through it.
|
|
||||||
const internalLBIP = "10.0.20.203"
|
|
||||||
|
|
||||||
// clientDialingIP returns an http.Client that dials ip for ANY host while keeping
|
|
||||||
// the URL host as SNI (so the cert matches) — the Go form of `curl --resolve
|
|
||||||
// host:443:ip`. TLS verification is skipped (these are reachability/observability
|
|
||||||
// probes, not security checks; internal .lan vhosts may serve a non-matching cert).
|
|
||||||
func clientDialingIP(ip string, timeout time.Duration) *http.Client {
|
|
||||||
d := &net.Dialer{Timeout: 8 * time.Second}
|
|
||||||
tr := &http.Transport{
|
|
||||||
DialContext: func(ctx context.Context, network, addr string) (net.Conn, error) {
|
|
||||||
if i := strings.LastIndex(addr, ":"); i >= 0 {
|
|
||||||
addr = ip + addr[i:]
|
|
||||||
}
|
|
||||||
return d.DialContext(ctx, network, addr)
|
|
||||||
},
|
|
||||||
TLSClientConfig: &tls.Config{InsecureSkipVerify: true},
|
|
||||||
}
|
|
||||||
return &http.Client{Timeout: timeout, Transport: tr}
|
|
||||||
}
|
|
||||||
|
|
||||||
// probeURL issues a GET and returns status code + elapsed time.
|
|
||||||
func probeURL(c *http.Client, rawurl string) (int, time.Duration, error) {
|
|
||||||
start := time.Now()
|
|
||||||
resp, err := c.Get(rawurl)
|
|
||||||
dur := time.Since(start)
|
|
||||||
if err != nil {
|
|
||||||
return 0, dur, err
|
|
||||||
}
|
|
||||||
resp.Body.Close()
|
|
||||||
return resp.StatusCode, dur, nil
|
|
||||||
}
|
|
||||||
|
|
||||||
// lbGetBody GETs https://<host><path>?<q> through the internal LB and returns the body.
|
|
||||||
func lbGetBody(host, path string, q url.Values) ([]byte, error) {
|
|
||||||
u := "https://" + host + path
|
|
||||||
if len(q) > 0 {
|
|
||||||
u += "?" + q.Encode()
|
|
||||||
}
|
|
||||||
resp, err := clientDialingIP(internalLBIP, 20*time.Second).Get(u)
|
|
||||||
if err != nil {
|
|
||||||
return nil, err
|
|
||||||
}
|
|
||||||
defer resp.Body.Close()
|
|
||||||
body, _ := io.ReadAll(resp.Body)
|
|
||||||
if resp.StatusCode >= 300 {
|
|
||||||
return nil, fmt.Errorf("%s -> %d: %s", path, resp.StatusCode, strings.TrimSpace(string(body)))
|
|
||||||
}
|
|
||||||
return body, nil
|
|
||||||
}
|
|
||||||
|
|
||||||
// dig runs `dig +short` against a resolver, optionally for a record type.
|
|
||||||
func dig(name, server, rrtype string) (string, error) {
|
|
||||||
args := []string{"+short", "+time=3", "+tries=1"}
|
|
||||||
if rrtype != "" {
|
|
||||||
args = append(args, rrtype)
|
|
||||||
}
|
|
||||||
args = append(args, name, "@"+server)
|
|
||||||
out, err := exec.Command("dig", args...).Output()
|
|
||||||
return strings.TrimSpace(string(out)), err
|
|
||||||
}
|
|
||||||
|
|
@ -1,49 +0,0 @@
|
||||||
package main
|
|
||||||
|
|
||||||
import "testing"
|
|
||||||
|
|
||||||
func TestQueryArg(t *testing.T) {
|
|
||||||
if got := queryArg([]string{"up"}, nil); got != "up" {
|
|
||||||
t.Errorf(`queryArg(["up"]) = %q, want "up"`, got)
|
|
||||||
}
|
|
||||||
if got := queryArg([]string{"up", "--json"}, nil); got != "up" {
|
|
||||||
t.Errorf(`--json should be dropped, got %q`, got)
|
|
||||||
}
|
|
||||||
// single quoted PromQL arrives as one token
|
|
||||||
if got := queryArg([]string{"count by (node) (up)", "--json"}, nil); got != "count by (node) (up)" {
|
|
||||||
t.Errorf(`quoted query mangled: %q`, got)
|
|
||||||
}
|
|
||||||
// value-flags and their values are skipped, query survives
|
|
||||||
vf := map[string]bool{"--since": true, "--limit": true}
|
|
||||||
if got := queryArg([]string{`{app="x"}`, "--since", "1h", "--limit", "50"}, vf); got != `{app="x"}` {
|
|
||||||
t.Errorf(`value-flag skipping failed: %q`, got)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestLabelStr(t *testing.T) {
|
|
||||||
got := labelStr(map[string]string{"__name__": "up", "job": "x", "instance": "y"})
|
|
||||||
if got != "up{instance=y,job=x}" { // __name__ extracted, rest sorted
|
|
||||||
t.Errorf("labelStr = %q", got)
|
|
||||||
}
|
|
||||||
if got := labelStr(map[string]string{"alertname": "Foo"}); got != "{alertname=Foo}" {
|
|
||||||
t.Errorf("labelStr (no __name__) = %q", got)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestOneLineList(t *testing.T) {
|
|
||||||
if got := oneLineList(" "); got != "(none)" {
|
|
||||||
t.Errorf("empty = %q, want (none)", got)
|
|
||||||
}
|
|
||||||
if got := oneLineList("a\nb"); got != "a, b" {
|
|
||||||
t.Errorf("multi = %q, want 'a, b'", got)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestHostOnly(t *testing.T) {
|
|
||||||
if got := hostOnly("foo.me/path"); got != "foo.me" {
|
|
||||||
t.Errorf("hostOnly = %q", got)
|
|
||||||
}
|
|
||||||
if got := hostOnly("foo.me"); got != "foo.me" {
|
|
||||||
t.Errorf("hostOnly = %q", got)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
101
cli/repo.go
101
cli/repo.go
|
|
@ -1,101 +0,0 @@
|
||||||
package main
|
|
||||||
|
|
||||||
import (
|
|
||||||
"os"
|
|
||||||
"os/exec"
|
|
||||||
"os/user"
|
|
||||||
"path/filepath"
|
|
||||||
"strings"
|
|
||||||
)
|
|
||||||
|
|
||||||
// preferRemote picks the canonical remote: forgejo if present, else origin,
|
|
||||||
// else the first listed. (For infra, origin and forgejo both point at Forgejo.)
|
|
||||||
func preferRemote(remotes []string) string {
|
|
||||||
has := map[string]bool{}
|
|
||||||
for _, r := range remotes {
|
|
||||||
has[r] = true
|
|
||||||
}
|
|
||||||
switch {
|
|
||||||
case has["forgejo"]:
|
|
||||||
return "forgejo"
|
|
||||||
case has["origin"]:
|
|
||||||
return "origin"
|
|
||||||
case len(remotes) > 0:
|
|
||||||
return remotes[0]
|
|
||||||
default:
|
|
||||||
return ""
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
// hasGitCryptAttr reports whether .gitattributes content enables git-crypt.
|
|
||||||
func hasGitCryptAttr(gitattributes string) bool {
|
|
||||||
return strings.Contains(gitattributes, "filter=git-crypt")
|
|
||||||
}
|
|
||||||
|
|
||||||
// gitCryptFlags are the per-command flags that disable smudge/clean so git
|
|
||||||
// operations in a git-crypt repo don't try to decrypt (NEVER persisted to config).
|
|
||||||
func gitCryptFlags() []string {
|
|
||||||
return []string{
|
|
||||||
"-c", "filter.git-crypt.smudge=cat",
|
|
||||||
"-c", "filter.git-crypt.clean=cat",
|
|
||||||
"-c", "filter.git-crypt.required=false",
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
// gitOutput runs `git -C dir <args>` and returns trimmed stdout.
|
|
||||||
func gitOutput(dir string, args ...string) (string, error) {
|
|
||||||
cmd := exec.Command("git", append([]string{"-C", dir}, args...)...)
|
|
||||||
out, err := cmd.Output()
|
|
||||||
return strings.TrimSpace(string(out)), err
|
|
||||||
}
|
|
||||||
|
|
||||||
func gitRepoRoot(dir string) (string, error) {
|
|
||||||
return gitOutput(dir, "rev-parse", "--show-toplevel")
|
|
||||||
}
|
|
||||||
|
|
||||||
// gitRemotes lists configured remote names for the repo at dir.
|
|
||||||
func gitRemotes(dir string) ([]string, error) {
|
|
||||||
out, err := gitOutput(dir, "remote")
|
|
||||||
if err != nil {
|
|
||||||
return nil, err
|
|
||||||
}
|
|
||||||
if out == "" {
|
|
||||||
return nil, nil
|
|
||||||
}
|
|
||||||
return strings.Split(out, "\n"), nil
|
|
||||||
}
|
|
||||||
|
|
||||||
// isGitCryptRepo reports whether the repo at repoRoot uses git-crypt.
|
|
||||||
func isGitCryptRepo(repoRoot string) bool {
|
|
||||||
b, err := os.ReadFile(filepath.Join(repoRoot, ".gitattributes"))
|
|
||||||
if err != nil {
|
|
||||||
return false
|
|
||||||
}
|
|
||||||
return hasGitCryptAttr(string(b))
|
|
||||||
}
|
|
||||||
|
|
||||||
// cryptFlagsFor returns the git-crypt filter flags when repoRoot is encrypted,
|
|
||||||
// else nil. These are injected per-command and never persisted.
|
|
||||||
func cryptFlagsFor(repoRoot string) []string {
|
|
||||||
if isGitCryptRepo(repoRoot) {
|
|
||||||
return gitCryptFlags()
|
|
||||||
}
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
|
|
||||||
// gitStream runs `git [cryptFlags] -C repoRoot <args>` with live output.
|
|
||||||
func gitStream(repoRoot string, cryptFlags []string, args ...string) error {
|
|
||||||
full := append(append([]string{}, cryptFlags...), append([]string{"-C", repoRoot}, args...)...)
|
|
||||||
return runStreamingIn("", "git", full...)
|
|
||||||
}
|
|
||||||
|
|
||||||
// currentUser returns the OS username for branch naming (<user>/<topic>).
|
|
||||||
func currentUser() string {
|
|
||||||
if u := os.Getenv("USER"); u != "" {
|
|
||||||
return u
|
|
||||||
}
|
|
||||||
if u, err := user.Current(); err == nil && u.Username != "" {
|
|
||||||
return u.Username
|
|
||||||
}
|
|
||||||
return "user"
|
|
||||||
}
|
|
||||||
|
|
@ -1,37 +0,0 @@
|
||||||
package main
|
|
||||||
|
|
||||||
import "testing"
|
|
||||||
|
|
||||||
func TestPreferRemote(t *testing.T) {
|
|
||||||
cases := []struct {
|
|
||||||
in []string
|
|
||||||
want string
|
|
||||||
}{
|
|
||||||
{[]string{"origin", "forgejo"}, "forgejo"},
|
|
||||||
{[]string{"forgejo"}, "forgejo"},
|
|
||||||
{[]string{"origin"}, "origin"},
|
|
||||||
{[]string{"upstream"}, "upstream"},
|
|
||||||
{nil, ""},
|
|
||||||
}
|
|
||||||
for _, c := range cases {
|
|
||||||
if got := preferRemote(c.in); got != c.want {
|
|
||||||
t.Errorf("preferRemote(%v) = %q, want %q", c.in, got, c.want)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestHasGitCryptAttr(t *testing.T) {
|
|
||||||
if !hasGitCryptAttr("*.tfvars filter=git-crypt diff=git-crypt") {
|
|
||||||
t.Error("expected git-crypt detected")
|
|
||||||
}
|
|
||||||
if hasGitCryptAttr("*.md text\n*.png binary") {
|
|
||||||
t.Error("expected no git-crypt")
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestGitCryptFlagsShape(t *testing.T) {
|
|
||||||
f := gitCryptFlags()
|
|
||||||
if len(f) != 6 || f[0] != "-c" || f[1] != "filter.git-crypt.smudge=cat" {
|
|
||||||
t.Fatalf("unexpected git-crypt flags: %v", f)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
23
cli/run.go
23
cli/run.go
|
|
@ -1,23 +0,0 @@
|
||||||
package main
|
|
||||||
|
|
||||||
import (
|
|
||||||
"os"
|
|
||||||
"os/exec"
|
|
||||||
)
|
|
||||||
|
|
||||||
// runStreaming executes name with args, wiring std streams to this process so
|
|
||||||
// the caller sees live output, and returns the command's error (non-nil on
|
|
||||||
// non-zero exit — preserved so homelab's own exit code reflects the child's).
|
|
||||||
func runStreaming(name string, args ...string) error {
|
|
||||||
return runStreamingIn("", name, args...)
|
|
||||||
}
|
|
||||||
|
|
||||||
// runStreamingIn is runStreaming with a working directory (empty = inherit).
|
|
||||||
func runStreamingIn(dir, name string, args ...string) error {
|
|
||||||
cmd := exec.Command(name, args...)
|
|
||||||
cmd.Dir = dir
|
|
||||||
cmd.Stdout = os.Stdout
|
|
||||||
cmd.Stderr = os.Stderr
|
|
||||||
cmd.Stdin = os.Stdin
|
|
||||||
return cmd.Run()
|
|
||||||
}
|
|
||||||
54
cli/stack.go
54
cli/stack.go
|
|
@ -1,54 +0,0 @@
|
||||||
package main
|
|
||||||
|
|
||||||
import (
|
|
||||||
"fmt"
|
|
||||||
"os"
|
|
||||||
"path/filepath"
|
|
||||||
"sort"
|
|
||||||
"strings"
|
|
||||||
)
|
|
||||||
|
|
||||||
// findInfraRoot walks up from start to the infra repo root — the directory
|
|
||||||
// holding both terragrunt.hcl and a stacks/ directory.
|
|
||||||
func findInfraRoot(start string) (string, error) {
|
|
||||||
dir := start
|
|
||||||
for {
|
|
||||||
if isFile(filepath.Join(dir, "terragrunt.hcl")) && isDir(filepath.Join(dir, "stacks")) {
|
|
||||||
return dir, nil
|
|
||||||
}
|
|
||||||
parent := filepath.Dir(dir)
|
|
||||||
if parent == dir {
|
|
||||||
return "", fmt.Errorf("not inside an infra checkout (no terragrunt.hcl + stacks/ found above %s)", start)
|
|
||||||
}
|
|
||||||
dir = parent
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
// resolveStack maps a bare stack name to its directory under <infraRoot>/stacks.
|
|
||||||
func resolveStack(infraRoot, name string) (string, error) {
|
|
||||||
dir := filepath.Join(infraRoot, "stacks", name)
|
|
||||||
if isDir(dir) {
|
|
||||||
return dir, nil
|
|
||||||
}
|
|
||||||
avail := listStacks(infraRoot)
|
|
||||||
return "", fmt.Errorf("stack %q not found under stacks/; available: %s", name, strings.Join(avail, ", "))
|
|
||||||
}
|
|
||||||
|
|
||||||
// listStacks returns the sorted names of every directory under <infraRoot>/stacks.
|
|
||||||
func listStacks(infraRoot string) []string {
|
|
||||||
entries, err := os.ReadDir(filepath.Join(infraRoot, "stacks"))
|
|
||||||
if err != nil {
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
var out []string
|
|
||||||
for _, e := range entries {
|
|
||||||
if e.IsDir() {
|
|
||||||
out = append(out, e.Name())
|
|
||||||
}
|
|
||||||
}
|
|
||||||
sort.Strings(out)
|
|
||||||
return out
|
|
||||||
}
|
|
||||||
|
|
||||||
func isFile(p string) bool { fi, err := os.Stat(p); return err == nil && !fi.IsDir() }
|
|
||||||
func isDir(p string) bool { fi, err := os.Stat(p); return err == nil && fi.IsDir() }
|
|
||||||
|
|
@ -1,52 +0,0 @@
|
||||||
package main
|
|
||||||
|
|
||||||
import (
|
|
||||||
"os"
|
|
||||||
"path/filepath"
|
|
||||||
"testing"
|
|
||||||
)
|
|
||||||
|
|
||||||
func newInfraTree(t *testing.T, stacks ...string) string {
|
|
||||||
t.Helper()
|
|
||||||
root := t.TempDir()
|
|
||||||
if err := os.WriteFile(filepath.Join(root, "terragrunt.hcl"), []byte("# root"), 0o644); err != nil {
|
|
||||||
t.Fatal(err)
|
|
||||||
}
|
|
||||||
for _, s := range stacks {
|
|
||||||
if err := os.MkdirAll(filepath.Join(root, "stacks", s), 0o755); err != nil {
|
|
||||||
t.Fatal(err)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
return root
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestFindInfraRootWalksUp(t *testing.T) {
|
|
||||||
root := newInfraTree(t, "vault")
|
|
||||||
got, err := findInfraRoot(filepath.Join(root, "stacks", "vault"))
|
|
||||||
if err != nil {
|
|
||||||
t.Fatalf("findInfraRoot error: %v", err)
|
|
||||||
}
|
|
||||||
if got != root {
|
|
||||||
t.Fatalf("findInfraRoot = %q, want %q", got, root)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestFindInfraRootErrorsOutsideInfra(t *testing.T) {
|
|
||||||
if _, err := findInfraRoot(t.TempDir()); err == nil {
|
|
||||||
t.Fatal("expected error outside an infra checkout")
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestResolveStack(t *testing.T) {
|
|
||||||
root := newInfraTree(t, "vault", "monitoring")
|
|
||||||
dir, err := resolveStack(root, "vault")
|
|
||||||
if err != nil {
|
|
||||||
t.Fatalf("resolveStack error: %v", err)
|
|
||||||
}
|
|
||||||
if want := filepath.Join(root, "stacks", "vault"); dir != want {
|
|
||||||
t.Fatalf("resolveStack = %q, want %q", dir, want)
|
|
||||||
}
|
|
||||||
if _, err := resolveStack(root, "nonesuch"); err == nil {
|
|
||||||
t.Fatal("expected error for unknown stack")
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
@ -1,62 +0,0 @@
|
||||||
package main
|
|
||||||
|
|
||||||
import (
|
|
||||||
"bytes"
|
|
||||||
"encoding/json"
|
|
||||||
"net/http"
|
|
||||||
"os"
|
|
||||||
"strconv"
|
|
||||||
"strings"
|
|
||||||
"time"
|
|
||||||
)
|
|
||||||
|
|
||||||
// usageJob is the Loki stream job label for homelab usage telemetry.
|
|
||||||
const usageJob = "homelab-usage"
|
|
||||||
|
|
||||||
// emitUsage best-effort records one verb invocation to Loki for cross-user
|
|
||||||
// usage analytics. Labels are low-cardinality (job/user/verb); the line carries
|
|
||||||
// only exit code + CLI version. NEVER args, paths, flags, or secrets. It must
|
|
||||||
// never affect the command: all errors are swallowed and a tight timeout bounds
|
|
||||||
// the cost. Opt out with HOMELAB_TELEMETRY=0.
|
|
||||||
func emitUsage(verb string, runErr error) {
|
|
||||||
switch os.Getenv("HOMELAB_TELEMETRY") {
|
|
||||||
case "0", "off", "false", "no":
|
|
||||||
return
|
|
||||||
}
|
|
||||||
if verb == "" || strings.HasPrefix(verb, "usage") {
|
|
||||||
return // don't self-record the analytics reader
|
|
||||||
}
|
|
||||||
exit := 0
|
|
||||||
if runErr != nil {
|
|
||||||
exit = 1
|
|
||||||
}
|
|
||||||
body, err := json.Marshal(lokiPush{Streams: []lokiStream{{
|
|
||||||
Stream: map[string]string{"job": usageJob, "user": currentUser(), "verb": verb},
|
|
||||||
Values: [][2]string{{
|
|
||||||
strconv.FormatInt(time.Now().UnixNano(), 10),
|
|
||||||
"exit=" + strconv.Itoa(exit) + " ver=" + version,
|
|
||||||
}},
|
|
||||||
}}})
|
|
||||||
if err != nil {
|
|
||||||
return
|
|
||||||
}
|
|
||||||
req, err := http.NewRequest("POST", "https://"+lokiHost+"/loki/api/v1/push", bytes.NewReader(body))
|
|
||||||
if err != nil {
|
|
||||||
return
|
|
||||||
}
|
|
||||||
req.Header.Set("Content-Type", "application/json")
|
|
||||||
resp, err := clientDialingIP(internalLBIP, 800*time.Millisecond).Do(req)
|
|
||||||
if err != nil {
|
|
||||||
return
|
|
||||||
}
|
|
||||||
resp.Body.Close()
|
|
||||||
}
|
|
||||||
|
|
||||||
type lokiPush struct {
|
|
||||||
Streams []lokiStream `json:"streams"`
|
|
||||||
}
|
|
||||||
|
|
||||||
type lokiStream struct {
|
|
||||||
Stream map[string]string `json:"stream"`
|
|
||||||
Values [][2]string `json:"values"`
|
|
||||||
}
|
|
||||||
|
|
@ -103,6 +103,6 @@ func notifyForIPChange(oldIP, newIP net.IP) error {
|
||||||
if err != nil {
|
if err != nil {
|
||||||
return errors.Wrapf(err, "Error reading response")
|
return errors.Wrapf(err, "Error reading response")
|
||||||
}
|
}
|
||||||
glog.Infof("Response: %s", string(responseBody))
|
glog.Infof("Response:", string(responseBody))
|
||||||
return nil
|
return nil
|
||||||
}
|
}
|
||||||
|
|
|
||||||
|
|
@ -1,18 +0,0 @@
|
||||||
package main
|
|
||||||
|
|
||||||
import (
|
|
||||||
"strings"
|
|
||||||
"testing"
|
|
||||||
)
|
|
||||||
|
|
||||||
func TestUsageQuery(t *testing.T) {
|
|
||||||
got := usageQuery("30d", "")
|
|
||||||
want := `sum by (verb) (count_over_time({job="homelab-usage"}[30d]))`
|
|
||||||
if got != want {
|
|
||||||
t.Errorf("usageQuery(30d,\"\") = %q, want %q", got, want)
|
|
||||||
}
|
|
||||||
withUser := usageQuery("7d", "emo")
|
|
||||||
if !strings.Contains(withUser, `user="emo"`) || !strings.Contains(withUser, "[7d]") {
|
|
||||||
t.Errorf("usageQuery with user missing filter/range: %q", withUser)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
@ -1,191 +0,0 @@
|
||||||
package main
|
|
||||||
|
|
||||||
import (
|
|
||||||
"context"
|
|
||||||
"encoding/json"
|
|
||||||
"fmt"
|
|
||||||
"io"
|
|
||||||
"net"
|
|
||||||
"net/http"
|
|
||||||
"os"
|
|
||||||
"os/exec"
|
|
||||||
"strings"
|
|
||||||
"time"
|
|
||||||
)
|
|
||||||
|
|
||||||
// Woodpecker is reached at ci.viktorbarzin.me but routed via the internal Traefik
|
|
||||||
// LB (mirrors the proven `curl --resolve ci.viktorbarzin.me:443:10.0.20.203`):
|
|
||||||
// we dial the LB IP while keeping SNI/Host = the hostname so the cert verifies.
|
|
||||||
const (
|
|
||||||
wpHost = "ci.viktorbarzin.me"
|
|
||||||
wpLBIP = "10.0.20.203"
|
|
||||||
)
|
|
||||||
|
|
||||||
type wpClient struct {
|
|
||||||
base string
|
|
||||||
token string
|
|
||||||
http *http.Client
|
|
||||||
}
|
|
||||||
|
|
||||||
// wpToken reads WOODPECKER_TOKEN, else the canonical Vault path.
|
|
||||||
func wpToken() string {
|
|
||||||
if t := firstEnv("WOODPECKER_TOKEN", "WP_TOKEN"); t != "" {
|
|
||||||
return t
|
|
||||||
}
|
|
||||||
out, err := exec.Command("vault", "kv", "get", "-field=woodpecker_api_token", "secret/ci/global").Output()
|
|
||||||
if err != nil {
|
|
||||||
return ""
|
|
||||||
}
|
|
||||||
return strings.TrimSpace(string(out))
|
|
||||||
}
|
|
||||||
|
|
||||||
func newWPClient() (*wpClient, error) {
|
|
||||||
tok := wpToken()
|
|
||||||
if tok == "" {
|
|
||||||
return nil, fmt.Errorf("no woodpecker token — set WOODPECKER_TOKEN or `vault login` (reads secret/ci/global)")
|
|
||||||
}
|
|
||||||
ip := firstEnv("HOMELAB_WP_IP")
|
|
||||||
if ip == "" {
|
|
||||||
ip = wpLBIP
|
|
||||||
}
|
|
||||||
dialer := &net.Dialer{Timeout: 8 * time.Second}
|
|
||||||
tr := &http.Transport{
|
|
||||||
DialContext: func(ctx context.Context, network, addr string) (net.Conn, error) {
|
|
||||||
if strings.HasPrefix(addr, wpHost+":") {
|
|
||||||
addr = ip + addr[strings.LastIndex(addr, ":"):]
|
|
||||||
}
|
|
||||||
return dialer.DialContext(ctx, network, addr)
|
|
||||||
},
|
|
||||||
}
|
|
||||||
return &wpClient{base: "https://" + wpHost, token: tok, http: &http.Client{Timeout: 20 * time.Second, Transport: tr}}, nil
|
|
||||||
}
|
|
||||||
|
|
||||||
// getJSON GETs path into v, retrying the transient empty/5xx responses the
|
|
||||||
// Woodpecker API intermittently returns under load.
|
|
||||||
func (c *wpClient) getJSON(path string, v interface{}) error {
|
|
||||||
var lastErr error
|
|
||||||
for attempt := 0; attempt < 5; attempt++ {
|
|
||||||
if attempt > 0 {
|
|
||||||
time.Sleep(2 * time.Second)
|
|
||||||
}
|
|
||||||
req, _ := http.NewRequest("GET", c.base+path, nil)
|
|
||||||
req.Header.Set("Authorization", "Bearer "+c.token)
|
|
||||||
resp, err := c.http.Do(req)
|
|
||||||
if err != nil {
|
|
||||||
lastErr = err
|
|
||||||
continue
|
|
||||||
}
|
|
||||||
body, _ := io.ReadAll(resp.Body)
|
|
||||||
resp.Body.Close()
|
|
||||||
if resp.StatusCode >= 500 || len(strings.TrimSpace(string(body))) == 0 {
|
|
||||||
lastErr = fmt.Errorf("woodpecker GET %s -> %d (empty/5xx, retrying)", path, resp.StatusCode)
|
|
||||||
continue
|
|
||||||
}
|
|
||||||
if resp.StatusCode >= 300 {
|
|
||||||
return fmt.Errorf("woodpecker GET %s -> %d: %s", path, resp.StatusCode, strings.TrimSpace(string(body)))
|
|
||||||
}
|
|
||||||
return json.Unmarshal(body, v)
|
|
||||||
}
|
|
||||||
return lastErr
|
|
||||||
}
|
|
||||||
|
|
||||||
type wpPipeline struct {
|
|
||||||
Number int `json:"number"`
|
|
||||||
Status string `json:"status"`
|
|
||||||
Event string `json:"event"`
|
|
||||||
Commit string `json:"commit"`
|
|
||||||
Message string `json:"message"`
|
|
||||||
}
|
|
||||||
|
|
||||||
func (c *wpClient) recentPipelines(repoID, n int) ([]wpPipeline, error) {
|
|
||||||
var ps []wpPipeline
|
|
||||||
err := c.getJSON(fmt.Sprintf("/api/repos/%d/pipelines?per_page=%d", repoID, n), &ps)
|
|
||||||
return ps, err
|
|
||||||
}
|
|
||||||
|
|
||||||
// findPipeline returns the pipeline for commit (prefix match), or the latest when
|
|
||||||
// commit is empty.
|
|
||||||
func (c *wpClient) findPipeline(repoID int, commit string) (wpPipeline, error) {
|
|
||||||
ps, err := c.recentPipelines(repoID, 25)
|
|
||||||
if err != nil {
|
|
||||||
return wpPipeline{}, err
|
|
||||||
}
|
|
||||||
if len(ps) == 0 {
|
|
||||||
return wpPipeline{}, fmt.Errorf("no pipelines for repo %d", repoID)
|
|
||||||
}
|
|
||||||
if commit == "" {
|
|
||||||
return ps[0], nil
|
|
||||||
}
|
|
||||||
for _, p := range ps {
|
|
||||||
if strings.HasPrefix(p.Commit, commit) {
|
|
||||||
return p, nil
|
|
||||||
}
|
|
||||||
}
|
|
||||||
return wpPipeline{}, fmt.Errorf("no pipeline for commit %s in the last %d", commit[:min(8, len(commit))], len(ps))
|
|
||||||
}
|
|
||||||
|
|
||||||
func (c *wpClient) repoID() (int, error) {
|
|
||||||
owner, repo, err := repoOwnerName()
|
|
||||||
if err != nil {
|
|
||||||
return 0, err
|
|
||||||
}
|
|
||||||
var r struct {
|
|
||||||
ID int `json:"id"`
|
|
||||||
}
|
|
||||||
if err := c.getJSON("/api/repos/lookup/"+owner+"/"+repo, &r); err != nil {
|
|
||||||
return 0, err
|
|
||||||
}
|
|
||||||
if r.ID == 0 {
|
|
||||||
return 0, fmt.Errorf("repo %s/%s not registered in woodpecker", owner, repo)
|
|
||||||
}
|
|
||||||
return r.ID, nil
|
|
||||||
}
|
|
||||||
|
|
||||||
// repoOwnerName derives <owner>/<repo> from the cwd git remote.
|
|
||||||
func repoOwnerName() (string, string, error) {
|
|
||||||
cwd, _ := os.Getwd()
|
|
||||||
root, err := gitRepoRoot(cwd)
|
|
||||||
if err != nil {
|
|
||||||
return "", "", fmt.Errorf("not in a git repository: %w", err)
|
|
||||||
}
|
|
||||||
remote := preferRemote(remotesOrEmpty(root))
|
|
||||||
url, err := gitOutput(root, "remote", "get-url", remote)
|
|
||||||
if err != nil {
|
|
||||||
return "", "", err
|
|
||||||
}
|
|
||||||
return parseOwnerRepo(url)
|
|
||||||
}
|
|
||||||
|
|
||||||
// parseOwnerRepo extracts owner/repo from an https or ssh git remote URL.
|
|
||||||
func parseOwnerRepo(url string) (string, string, error) {
|
|
||||||
u := strings.TrimSuffix(strings.TrimSpace(url), ".git")
|
|
||||||
u = strings.TrimSuffix(u, "/")
|
|
||||||
if i := strings.Index(u, "://"); i >= 0 {
|
|
||||||
u = u[i+3:]
|
|
||||||
}
|
|
||||||
u = strings.ReplaceAll(u, ":", "/") // git@host:owner/repo -> git@host/owner/repo
|
|
||||||
parts := strings.Split(u, "/")
|
|
||||||
if len(parts) < 2 || parts[len(parts)-1] == "" || parts[len(parts)-2] == "" {
|
|
||||||
return "", "", fmt.Errorf("cannot parse owner/repo from remote %q", url)
|
|
||||||
}
|
|
||||||
return parts[len(parts)-2], parts[len(parts)-1], nil
|
|
||||||
}
|
|
||||||
|
|
||||||
func isTerminalStatus(s string) bool {
|
|
||||||
switch s {
|
|
||||||
case "success", "failure", "error", "killed", "declined", "blocked":
|
|
||||||
return true
|
|
||||||
}
|
|
||||||
return false
|
|
||||||
}
|
|
||||||
|
|
||||||
func isFailureStatus(s string) bool {
|
|
||||||
return s == "failure" || s == "error" || s == "killed" || s == "declined"
|
|
||||||
}
|
|
||||||
|
|
||||||
func min(a, b int) int {
|
|
||||||
if a < b {
|
|
||||||
return a
|
|
||||||
}
|
|
||||||
return b
|
|
||||||
}
|
|
||||||
|
|
@ -1,40 +0,0 @@
|
||||||
package main
|
|
||||||
|
|
||||||
import "testing"
|
|
||||||
|
|
||||||
func TestParseOwnerRepo(t *testing.T) {
|
|
||||||
cases := []struct{ in, owner, repo string }{
|
|
||||||
{"https://forgejo.viktorbarzin.me/viktor/infra.git", "viktor", "infra"},
|
|
||||||
{"https://forgejo.viktorbarzin.me/viktor/infra", "viktor", "infra"},
|
|
||||||
{"git@github.com:ViktorBarzin/infra.git", "ViktorBarzin", "infra"},
|
|
||||||
{"https://github.com/ViktorBarzin/tripit/", "ViktorBarzin", "tripit"},
|
|
||||||
}
|
|
||||||
for _, c := range cases {
|
|
||||||
o, r, err := parseOwnerRepo(c.in)
|
|
||||||
if err != nil || o != c.owner || r != c.repo {
|
|
||||||
t.Errorf("parseOwnerRepo(%q) = (%q, %q, %v), want (%q, %q)", c.in, o, r, err, c.owner, c.repo)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
if _, _, err := parseOwnerRepo("nonsense"); err == nil {
|
|
||||||
t.Error("expected error for unparseable remote")
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func TestStatusClassification(t *testing.T) {
|
|
||||||
for _, s := range []string{"success", "failure", "error", "killed"} {
|
|
||||||
if !isTerminalStatus(s) {
|
|
||||||
t.Errorf("%q should be terminal", s)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
for _, s := range []string{"running", "pending"} {
|
|
||||||
if isTerminalStatus(s) {
|
|
||||||
t.Errorf("%q should not be terminal", s)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
if !isFailureStatus("failure") || !isFailureStatus("error") {
|
|
||||||
t.Error("failure/error should classify as failure")
|
|
||||||
}
|
|
||||||
if isFailureStatus("success") {
|
|
||||||
t.Error("success must not classify as failure")
|
|
||||||
}
|
|
||||||
}
|
|
||||||
BIN
config.tfvars
BIN
config.tfvars
Binary file not shown.
|
|
@ -80,6 +80,8 @@ def sofia():
|
||||||
pfsense >> k8s_switch
|
pfsense >> k8s_switch
|
||||||
with Cluster('Management Network'):
|
with Cluster('Management Network'):
|
||||||
mgt_switch = Switch()
|
mgt_switch = Switch()
|
||||||
|
# Truenas
|
||||||
|
truenas = Storage("Truenas")
|
||||||
# pxe server
|
# pxe server
|
||||||
pxe_server = Rack("PXE Server")
|
pxe_server = Rack("PXE Server")
|
||||||
# HA
|
# HA
|
||||||
|
|
@ -89,6 +91,7 @@ def sofia():
|
||||||
devvm_vpn_client = VPN("Tailscale Client")
|
devvm_vpn_client = VPN("Tailscale Client")
|
||||||
vpn_clients["devvm"] = devvm_vpn_client
|
vpn_clients["devvm"] = devvm_vpn_client
|
||||||
|
|
||||||
|
mgt_switch >> truenas
|
||||||
mgt_switch >> pxe_server
|
mgt_switch >> pxe_server
|
||||||
mgt_switch >> home_assistant
|
mgt_switch >> home_assistant
|
||||||
mgt_switch >> devvm
|
mgt_switch >> devvm
|
||||||
|
|
|
||||||
|
|
@ -20,7 +20,7 @@ This repository contains the configuration and documentation for a homelab Kuber
|
||||||
| [Overview](architecture/overview.md) | Infrastructure overview, hardware specs, VM inventory, and service catalog |
|
| [Overview](architecture/overview.md) | Infrastructure overview, hardware specs, VM inventory, and service catalog |
|
||||||
| [Networking](architecture/networking.md) | Network topology, VLANs, routing, and firewall rules |
|
| [Networking](architecture/networking.md) | Network topology, VLANs, routing, and firewall rules |
|
||||||
| [VPN](architecture/vpn.md) | Headscale mesh VPN and Cloudflare Tunnel configuration |
|
| [VPN](architecture/vpn.md) | Headscale mesh VPN and Cloudflare Tunnel configuration |
|
||||||
| [Storage](architecture/storage.md) | Proxmox host NFS, Proxmox CSI (LVM-thin + LUKS2), and persistent volume management |
|
| [Storage](architecture/storage.md) | TrueNAS NFS, democratic-csi, and persistent volume management |
|
||||||
| [Authentication](architecture/authentication.md) | Authentik SSO, OIDC flows, and service integration |
|
| [Authentication](architecture/authentication.md) | Authentik SSO, OIDC flows, and service integration |
|
||||||
| [Security](architecture/security.md) | CrowdSec IPS, Kyverno policies, and security controls |
|
| [Security](architecture/security.md) | CrowdSec IPS, Kyverno policies, and security controls |
|
||||||
| [Monitoring](architecture/monitoring.md) | Prometheus, Grafana, Loki, and observability stack |
|
| [Monitoring](architecture/monitoring.md) | Prometheus, Grafana, Loki, and observability stack |
|
||||||
|
|
|
||||||
|
|
@ -1,42 +0,0 @@
|
||||||
---
|
|
||||||
status: accepted
|
|
||||||
---
|
|
||||||
|
|
||||||
# The Android testing environment is a privileged KVM emulator pod in-cluster
|
|
||||||
|
|
||||||
Viktor's apps are growing Android clients (first: tripit's Capacitor shell —
|
|
||||||
see tripit ADR-0013/0014), and agents need a native Android instance to test
|
|
||||||
changes against before shipping. All K8s nodes already run with CPU type
|
|
||||||
`host`, so `/dev/kvm` works inside the cluster.
|
|
||||||
|
|
||||||
Decision (2026-06-11): one shared **Android 16 (API 36) Google-emulator
|
|
||||||
instance** runs as a privileged pod in namespace `android-emulator`
|
|
||||||
(stack `stacks/android-emulator`), with `/dev/kvm` via hostPath, adb exposed
|
|
||||||
LAN-only on the shared MetalLB IP (10.0.20.200:5555), and a noVNC screen view
|
|
||||||
at android-emulator.viktorbarzin.lan. The SDK/system-image/AVD live on a PVC;
|
|
||||||
the image is a slim manually-built shell.
|
|
||||||
|
|
||||||
## Considered options
|
|
||||||
|
|
||||||
- **devvm-local docker emulator** — rejected as the durable home: shared
|
|
||||||
24GB workstation, ~13GB free disk, per-machine, not shared across agents.
|
|
||||||
- **Dedicated Proxmox VM** — rejected: burns scarce PVE host headroom 24/7
|
|
||||||
and adds a whole VM lifecycle for one emulator.
|
|
||||||
- **redroid (container-native Android)** — rejected: requires binder kernel
|
|
||||||
modules on every node (documented binderfs incompatibilities), max
|
|
||||||
Android 15; most invasive for the least version coverage.
|
|
||||||
- **budtmo/docker-android** — rejected: turnkey but capped at Android 14;
|
|
||||||
the native features driving the Android work (Live Updates, background
|
|
||||||
GPS) are Android 16 behaviors, matching the real target device.
|
|
||||||
- **/dev/kvm device plugin instead of privileged** — deferred: a new
|
|
||||||
cluster component to avoid one namespace-scoped exclude-list entry; the
|
|
||||||
exclude pattern (kured/woodpecker/frigate/changedetection) already exists.
|
|
||||||
|
|
||||||
## Consequences
|
|
||||||
|
|
||||||
- `android-emulator` joins the Kyverno `security_policy_exclude_namespaces`
|
|
||||||
list (privileged allowed; registry policy also bypassed in-namespace).
|
|
||||||
- adb is unauthenticated by design — the LB IP must remain LAN-only.
|
|
||||||
- Single shared instance: concurrent agent sessions share Android state;
|
|
||||||
long destructive work should presence-claim `service:android-emulator`.
|
|
||||||
- Rendering is swiftshader (CPU) — the contended T4 stays out of the path.
|
|
||||||
|
|
@ -1,24 +0,0 @@
|
||||||
---
|
|
||||||
status: accepted
|
|
||||||
date: 2026-06-12
|
|
||||||
---
|
|
||||||
|
|
||||||
# All owned images build off-infra on GitHub Actions and live on ghcr.io
|
|
||||||
|
|
||||||
In-cluster Woodpecker buildkit builds repeatedly hurt the homelab: registry-push load OOMKilled Forgejo (2026-06-09), buildkit→Forgejo pushes ride a flaky hairpin, build IO lands on the shared sdc HDD, and the Forgejo registry PVC sat at its 50Gi ceiling with retention stuck in DRY_RUN. We decided every owned image is built by GitHub Actions and hosted on ghcr.io, extending the tripit pilot (2026-06-09) to the whole fleet: Forgejo stays the canonical git host, a one-way push-mirror feeds a GitHub mirror, and the mirror's workflow builds, pushes, then POSTs Woodpecker's API to deploy. The Forgejo container registry is decommissioned as a build target — one manual cleanup pass keeps a last-known-good tag per Service, after which nothing pushes to it.
|
|
||||||
|
|
||||||
## Considered options
|
|
||||||
|
|
||||||
- **GHA builds pushing back into the Forgejo registry** — keeps images home and the pull path unchanged, but keeps the exact failure mode that motivated the move (Forgejo OOM under blob-push load), keeps the PVC growth, and keeps the circular dependency where the images needed to repair the cluster live inside the cluster. Rejected.
|
|
||||||
- **Per-repo in-cluster fallback builds** (the old `build-fallback.yml` pattern) — rejected in favour of a clean cut: a GitHub outage pauses image builds (running workloads are unaffected), and existing fallback files are deleted. The hedge against ghcr's "currently free" private storage ever being enforced is the visibility split (public images are permanently free) plus re-creating fallbacks if that day comes.
|
|
||||||
- **Paid builders (Docker Build Cloud, Depot)** — solve a multi-arch/persistent-cache problem this fleet doesn't have (everything is linux/amd64). Rejected.
|
|
||||||
|
|
||||||
## Consequences
|
|
||||||
|
|
||||||
- DR improves: images survive homelab loss, so a dead cluster can pull everything it needs to come back — the same doctrine that keeps the monorepo on GitHub ("Forgejo dies with the cluster").
|
|
||||||
- Private ghcr pulls bypass the registry VM's pull-through cache (it can't authenticate), so cold-node pulls of private images depend on GitHub availability; public images cache normally.
|
|
||||||
- Visibility is decided per repo: public = generic tooling that passes a gitleaks/PII history scan; private = personal, financial, or legally-gray domains. A failed scan means the repo stays private — canonical history is never rewritten for publication. For interpreted languages repo visibility ≈ image visibility (the image ships the source).
|
|
||||||
- Only private-repo builds consume GitHub free-plan minutes (~12 builders, well under the 2,000/mo free tier; usage is reviewed after rollout wave 2 before considering Pro).
|
|
||||||
- Woodpecker becomes deploy-only; its agents never build. The Kyverno-synced `registry-credentials` stays (Forgejo git + frozen last-known-good images); a cluster-wide Kyverno-synced `ghcr-credentials` joins it.
|
|
||||||
- Builders with no live consumer (terminal-lobby, webhook-handler, hmrc-sync, trading-bot, travel-agent, trip-planner) are decommissioned rather than migrated; travel_blog is decommissioned outright (service + CI). Any revival adopts this ADR's pattern.
|
|
||||||
- Workflows build single-manifest images (`provenance: false`, linux/amd64 only) so registry retention never faces the orphaned-index-children failure class that broke Forgejo's cleanup.
|
|
||||||
|
|
@ -1,30 +0,0 @@
|
||||||
# Keep Forgejo as the canonical forge; complete the one-way GitHub mirror instead of swapping to GitHub
|
|
||||||
|
|
||||||
Status: accepted (extends ADR-0002)
|
|
||||||
|
|
||||||
## Context
|
|
||||||
|
|
||||||
Repo trees kept diverging between the Forgejo **Canonical repo** (`viktor/<name>`) and its **GitHub mirror**. A 2026-06-15 audit found the cause: an *incomplete rollout* of the Forgejo→GitHub push-mirror, not anything inherent to Forgejo. 14 repos carry **both** remotes and are hand-pushed to each (`push_mirrors = 0` on Forgejo — e.g. `infra`, `finance`, `Website`), so a human forgets one side and the trees drift; the ADR-0002-onboarded repos have a working one-way mirror (`push_mirrors = 1` — e.g. `tripit`, `recruiter-responder`) and never diverge. `infra/CONTEXT.md` already says Forgejo is the only place commits land and the GitHub mirror must never be a second writable remote — practice had simply drifted from the documented model.
|
|
||||||
|
|
||||||
The trigger was a proposal to swap Forgejo out for GitHub entirely. The grilling reframed it: the pain (divergence) is a "two writable remotes" problem, and the stated preference is self-hosted-primary with the remote as backup.
|
|
||||||
|
|
||||||
## Decision
|
|
||||||
|
|
||||||
Do **not** swap to GitHub. Reaffirm and *complete* the model already in `CONTEXT.md`:
|
|
||||||
|
|
||||||
- Every first-party repo has exactly **one** push target — its **Canonical repo** on Forgejo. GitHub is a one-way push-mirror (off-site backup + the source GitHub Actions builds from). **No repo is ever dual-pushed.**
|
|
||||||
- A small, explicit set of **GitHub-first repos** are the exception (canonical lives on GitHub, outside the mirror policy): third-party clones/forks where GitHub is genuinely upstream (`jsoncrack.com`, `snmp_exporter`, `SparkyFitness`, `agent-rules-books`, `Plotting-Your-Dream-Book`) and the deliberately-public first-party `health`.
|
|
||||||
- `infra` is reconciled into the standard model: its GitHub-only `.github/workflows/build-*.yml` are brought onto Forgejo-canonical (inert on Forgejo, active on the mirror), then the mirror is enabled — ending the deliberate divergence while keeping Woodpecker on the Forgejo forge.
|
|
||||||
- Enforcement is **structural**: reconciled clones keep only the Forgejo remote, so there is no GitHub remote to habitually push to; the execution rule is "push to the canonical forge only, never the mirror."
|
|
||||||
|
|
||||||
## Considered options
|
|
||||||
|
|
||||||
- **Swap to GitHub (retire Forgejo).** Rejected: takes on a hard WAN dependency for *all* git ops — including `infra`, the repo you use to *recover* from outages — plus git-crypt secrets on GitHub as primary, a Woodpecker forge migration (WP authenticates against and watches Forgejo), and GitHub private-repo CI-minute/size limits. All to fix a problem that is actually an incomplete mirror, not Forgejo's existence. Contradicts the self-hosted-primary preference.
|
|
||||||
- **GitHub canonical, Forgejo demoted to a DR pull-mirror.** Rejected for the same WAN-dependency and forge-migration cost; unnecessary once the real cause is understood.
|
|
||||||
|
|
||||||
## Consequences
|
|
||||||
|
|
||||||
- Divergence becomes structurally impossible — one push target per repo.
|
|
||||||
- Forgejo stays load-bearing (canonical git + the Woodpecker forge), so every cost of the swap is avoided.
|
|
||||||
- The GitHub-limits worry is neutralized: private code lives on Forgejo (unlimited, self-hosted); GitHub holds mirrors for CI + backup. (GitHub Free has unlimited private repos anyway; the real limits are GHA minutes and ~1 GB repo size — `travel_blog` at 1.4 GB is why it never went to GHA.)
|
|
||||||
- One-time remediation is required and carries a data-loss footgun: the Forgejo→GitHub mirror **force-overwrites** GitHub, so for each currently-diverged repo, any GitHub-only commits must be merged into Forgejo **before** the mirror is enabled, or they are lost. Scope: the 14 dual-push repos + the `infra` reconciliation; all other repos are already single-remote and non-diverging.
|
|
||||||
|
|
@ -1,30 +0,0 @@
|
||||||
# homelab: a unified infra-ops CLI grown in place from infra/cli
|
|
||||||
|
|
||||||
Agents re-derive the same operational command boilerplate every session — mining
|
|
||||||
51,116 bash commands across 2,225 past sessions showed dense, repeated patterns
|
|
||||||
(the infra inner-loop alone is ~29%). We are building `homelab`, one CLI encoding
|
|
||||||
the deterministic, repeated **actions** (not judgment) agents run — composable in
|
|
||||||
bash, JSON-capable, and discovered progressively via `homelab manifest`. It is
|
|
||||||
grown **in place** in `cli/` (the existing `infra-cli`), absorbing new verb-groups
|
|
||||||
alongside the preserved legacy webhook use-cases. Versioned with a `cli/VERSION`
|
|
||||||
file (the infra repo deploys continuously and does not cut semver tags).
|
|
||||||
|
|
||||||
## Considered options
|
|
||||||
|
|
||||||
- **Its own top-level repo** (the original plan) — rejected in favour of keeping
|
|
||||||
it where the Terraform/Terragrunt and `scripts/tg` it drives already live; the
|
|
||||||
Go source isn't git-crypt-encrypted and a provision-time build is unaffected by
|
|
||||||
GitOps continuous-deploy.
|
|
||||||
- **A fresh CLI ignoring infra-cli** — rejected: strands the VPN/DNS/email
|
|
||||||
webhook use-cases.
|
|
||||||
- **Raw kubectl/tg/ssh + skills + MCP only** — kept for everything outside the
|
|
||||||
recurring action surface (methodology skills; third-party/owned MCP such as
|
|
||||||
phpIPAM, which homelab does NOT duplicate).
|
|
||||||
|
|
||||||
## Consequences
|
|
||||||
|
|
||||||
- The binary is dual-purpose: the agent-facing `homelab` verb surface AND the
|
|
||||||
in-cluster `infra-cli` webhook image. `main()` front-dispatches homelab verbs
|
|
||||||
and falls through to the legacy `-use-case` path verbatim.
|
|
||||||
- Distribution: built from source to `/usr/local/bin/homelab` during devvm
|
|
||||||
provisioning (`t3-dispatch` precedent), refreshed by `t3-autoupdate`.
|
|
||||||
|
|
@ -1,23 +0,0 @@
|
||||||
# homelab v0.1 scope: the infra inner-loop; everything allowed, tiers recorded
|
|
||||||
|
|
||||||
v0.1 ships only the highest-volume surface — the infra inner-loop: `work`
|
|
||||||
(worktree lifecycle), `tf` (terragrunt via `scripts/tg` + fmt/validate/
|
|
||||||
force-unlock), and `claim`/`release` (presence) — because it is ~29% of all mined
|
|
||||||
commands and where agents lose the most time and leak the most presence claims.
|
|
||||||
|
|
||||||
v0.1 enforces **no** homelab-level permission gating: everything is allowed,
|
|
||||||
relying on existing gates (harness permission mode, presence claims, plan
|
|
||||||
approval). But every verb records a `read|write` tier (visible in `manifest`), so
|
|
||||||
a PreToolUse classifier hook (auto-allow reads / prompt writes) can be added
|
|
||||||
later with zero restructuring.
|
|
||||||
|
|
||||||
## Considered options
|
|
||||||
|
|
||||||
- **Reads-first vertical slice** (top read verb per domain) — lower risk, broad
|
|
||||||
value, but defers the toil that motivated the project.
|
|
||||||
- **One domain deep (k8s)** — cleanest template, narrow day-one value.
|
|
||||||
|
|
||||||
We chose the highest-volume-but-write-heavy infra loop deliberately, accepting
|
|
||||||
the extra complexity (worktree lifecycle, git-crypt flag injection, presence
|
|
||||||
coupling, branch-protection PR fallback) for the biggest immediate toil
|
|
||||||
reduction. k8s/node/secret/net/ci verb-groups are deferred to later versions.
|
|
||||||
|
|
@ -1,29 +0,0 @@
|
||||||
# homelab work/tf behaviour: native worktree entry, gated auto-land, presence-coupled apply
|
|
||||||
|
|
||||||
Four behaviours of the infra-loop verbs are surprising enough to record:
|
|
||||||
|
|
||||||
1. **`work` owns worktree create/land/clean, but session *entry* delegates to the
|
|
||||||
native harness worktree tool.** A CLI is a child process and cannot change the
|
|
||||||
agent's working directory; `EnterWorktree` can. So `homelab work start <topic>`
|
|
||||||
creates the worktree + branch off `<remote>/master` (git-crypt-aware) and
|
|
||||||
prints the path — the agent enters it with native `EnterWorktree({path})`.
|
|
||||||
|
|
||||||
2. **`work land` is auto-land, but gated on verification.** It merges master in →
|
|
||||||
runs verification → pushes `HEAD:master` (fetch+merge+retry on
|
|
||||||
non-fast-forward) → falls back to pushing the feature branch for a PR when the
|
|
||||||
direct push is rejected (branch protection). It **refuses to push when it
|
|
||||||
cannot verify** (no `--verify-cmd` and no auto-detected suite) unless
|
|
||||||
`--no-verify` is passed — added after an accidental smoke-test land pushed
|
|
||||||
unverified WIP to master (benign: the infra CI applied 0 stacks because the
|
|
||||||
diff was `cli/`-only, but an unverified land must be deliberate, not default).
|
|
||||||
|
|
||||||
3. **`tf apply` is first-class despite GitOps, and mandatorily presence-coupled.**
|
|
||||||
Local applies are out-of-band (CI applies canonically on push) but happen
|
|
||||||
constantly (~763× in the corpus). `tf apply <stack>` auto-claims `stack:<name>`,
|
|
||||||
delegates to `scripts/tg apply --non-interactive`, and **always releases on
|
|
||||||
exit** (normal, error, or signal via `sync.Once` + handler) — fixing the
|
|
||||||
documented ~200-claim leak — and prints an out-of-band reminder.
|
|
||||||
|
|
||||||
4. **Known v0.1 limitation:** `work land` does not yet block on CI to green; that
|
|
||||||
arrives with the ci/deploy watch verb-group. It prints a reminder to follow
|
|
||||||
the pipeline manually.
|
|
||||||
|
|
@ -1,30 +0,0 @@
|
||||||
# homelab k8s verb-group: app→pod resolver, read/write split, config-mutation stays raw
|
|
||||||
|
|
||||||
v0.2 adds the Kubernetes verb-group — the biggest remaining surface by far
|
|
||||||
(mining the post-v0.1 corpus: 11,291 `kubectl` commands across 243 sessions, more
|
|
||||||
than every other domain combined).
|
|
||||||
|
|
||||||
It is built on an **app→namespace→pod resolver**: most namespaces hold exactly
|
|
||||||
one app, so `<app>` defaults to the namespace, and the target defaults to
|
|
||||||
`deploy/<app>` (kubectl resolves a pod from the Deployment). `-n`/`--pod`/`-c`/
|
|
||||||
`-l`/`--tty` override; multi-pod namespaces (`dbaas`, `monitoring`) need
|
|
||||||
specificity. The CLI uses the ambient kubeconfig — no per-call auth flags.
|
|
||||||
|
|
||||||
Verbs: read — `status`, `get`, `logs`, `describe`, `debug` (one-shot triage),
|
|
||||||
`pf`, `rollout-status`; write/operational — `db`, `exec`, `restart`, `rm-pod`.
|
|
||||||
|
|
||||||
## Decisions worth recording
|
|
||||||
|
|
||||||
- **Config-mutation verbs are deliberately NOT exposed** (`apply`/`edit`/`patch`/
|
|
||||||
`scale`/`create`). They stay raw `kubectl`, by design, per the repo's
|
|
||||||
Terraform-only policy — the corpus confirms they're low-frequency, and a
|
|
||||||
friendly verb would normalise a policy violation.
|
|
||||||
- **`rm-pod` is restricted to pods/jobs only** — deleting Deployments/STS/PVCs is
|
|
||||||
config mutation and forbidden; the verb cannot target them.
|
|
||||||
- **`db` encodes the dbaas exec pattern** (the single highest-value k8s
|
|
||||||
sub-pattern, ~886 dbaas ops): PG via `pg-cluster-rw -c postgres`,
|
|
||||||
`psql -U postgres -d <app>`; MySQL via `mysql-standalone-0` with a
|
|
||||||
`bash -c 'mysql -p"$MYSQL_ROOT_PASSWORD" …'` wrapper so the password comes from
|
|
||||||
the pod env and never appears on the command line.
|
|
||||||
- Read verbs were smoke-tested against the live cluster; write verbs are
|
|
||||||
unit-tested (resolver, db-plan, shell-quoting) but not fired at live state.
|
|
||||||
|
|
@ -1,30 +0,0 @@
|
||||||
# homelab memory verb-group: direct HTTP client to claude-memory; MCP deprecation path
|
|
||||||
|
|
||||||
v0.3 adds the memory verb-group so agents can search and navigate memory from the
|
|
||||||
CLI. `claude-memory` is a FastAPI service (Postgres-backed, `Bearer`-auth,
|
|
||||||
ingress `auth = "none"` so programmatic clients work) — the **MCP is just one
|
|
||||||
frontend over it**. `homelab memory` is a thin HTTP client over the same API,
|
|
||||||
using the env the hooks already set (`CLAUDE_MEMORY_API_URL` +
|
|
||||||
`CLAUDE_MEMORY_API_KEY`; defaults to the ingress). Because it talks to the HTTP
|
|
||||||
API directly, it **works even when the MCP frontend is down** — the recurring
|
|
||||||
MCP-disconnect problem that motivated claude-memory HA (and that took the MCP
|
|
||||||
offline for the entire session this was built in).
|
|
||||||
|
|
||||||
Verbs: `recall` (server-side semantic ranking), `list`, `categories`, `tags`,
|
|
||||||
`stats`, `secret` (read); `store`, `update`, `delete` (write). Validated against
|
|
||||||
the live API including a store→recall→delete round-trip — full data-plane parity
|
|
||||||
with the MCP.
|
|
||||||
|
|
||||||
## Deprecation path (deliberate follow-up — NOT done in v0.3)
|
|
||||||
|
|
||||||
The MCP is more than tools: the **per-prompt auto-recall hook** and the
|
|
||||||
**auto-learn hook** run on every prompt for every agent. Deprecating it safely is
|
|
||||||
a separate, sequenced change:
|
|
||||||
|
|
||||||
1. Rewire the auto-recall hook to `homelab memory recall` and the auto-learn hook
|
|
||||||
to `homelab memory store`.
|
|
||||||
2. Update the CLAUDE.md memory policy to point at the CLI.
|
|
||||||
3. Uninstall the MCP.
|
|
||||||
|
|
||||||
Done CLI-first (verbs proven before touching the every-prompt path) so a
|
|
||||||
regression can't silently break auto-recall/auto-learn fleet-wide.
|
|
||||||
|
|
@ -1,29 +0,0 @@
|
||||||
# homelab ci/deploy verbs: API-based watch, internal-LB dialer, work-land integration
|
|
||||||
|
|
||||||
v0.4 adds `ci`/`deploy` — the biggest *reasoning* sink in agent sessions (watching
|
|
||||||
a build/deploy to completion), proven during the session that built it (hours
|
|
||||||
spent hand-rolling Woodpecker API polling, DB-schema reverse-engineering, and
|
|
||||||
retrigger logic for a single CI incident).
|
|
||||||
|
|
||||||
## Decisions
|
|
||||||
|
|
||||||
- **API, not DB.** The verbs query the Woodpecker REST API (version-stable),
|
|
||||||
not its Postgres schema (which drifts across upgrades — column renames bit us
|
|
||||||
mid-incident). Reached via the internal Traefik LB by dialing `10.0.20.203`
|
|
||||||
while keeping SNI/Host = `ci.viktorbarzin.me` so the cert verifies (the Go
|
|
||||||
equivalent of the house `curl --resolve` pattern). Token from
|
|
||||||
`WOODPECKER_TOKEN` or Vault `secret/ci/global`; repo id resolved from the cwd
|
|
||||||
git remote via `/api/repos/lookup/<owner>/<repo>`.
|
|
||||||
- **Retries are mandatory.** The Woodpecker API intermittently returns empty/5xx
|
|
||||||
under load (it flapped through the whole build session); `getJSON` retries
|
|
||||||
empties with backoff so `ci watch` is reliable exactly when it's needed.
|
|
||||||
- **`work land` now waits for CI.** After pushing, `work land` calls `ci watch`
|
|
||||||
on the landed commit and fails if the pipeline does — closing the gap ADR-0005
|
|
||||||
deferred. `--no-ci-watch` opts out.
|
|
||||||
- **`deploy wait` encodes the "rollout status lies" rule:** it first waits for
|
|
||||||
the deployment image to reference the expected sha, *then* blocks on rollout
|
|
||||||
status (kubectl-based; reuses the k8s helpers).
|
|
||||||
- **`ci logs` deferred to v0.4.1.** Woodpecker's per-pipeline detail/log
|
|
||||||
endpoints were the least reliable this session (often empty); `status`/`watch`
|
|
||||||
rely on the list endpoint that works. A DB-backed `ci logs` is a possible
|
|
||||||
follow-up if the API path stays flaky.
|
|
||||||
|
|
@ -1,37 +0,0 @@
|
||||||
# homelab net/dns/metrics/logs verbs: endpoint resolution as the unit of value
|
|
||||||
|
|
||||||
v0.5 adds `net`/`dns`/`metrics`/`logs`. These were chosen against an explicit
|
|
||||||
test the user posed mid-build: *does the verb save reasoning, or only typing?* A
|
|
||||||
wrapper over a command already known fluently (plain `ssh`, `vault kv get`) saves
|
|
||||||
keystrokes but not thought. These four save thought — the reasoning they encode
|
|
||||||
is **which endpoint, reached how, with what auth/URL shape** — re-derived every
|
|
||||||
time otherwise. (That same test deprioritized `node ssh` aliasing and `secret
|
|
||||||
get`, which are thin wrappers; see the session discussion.)
|
|
||||||
|
|
||||||
## Decisions
|
|
||||||
|
|
||||||
- **Internal ingresses, reached via the LB.** Everything routes through the
|
|
||||||
Traefik LB by dialing `10.0.20.203` with the URL host preserved as SNI — the
|
|
||||||
Go form of the house `curl --resolve host:443:10.0.20.203` pattern
|
|
||||||
(`probe.go: clientDialingIP`). Verified live before building: Prometheus
|
|
||||||
(`prometheus-query.viktorbarzin.lan`) and Loki (`loki.viktorbarzin.lan`) both
|
|
||||||
answer JSON over the LB with **no auth gate and no port-forward** — so these
|
|
||||||
stay clean HTTP clients, not kubectl wrappers.
|
|
||||||
- **`net check` is two-legged on purpose.** It resolves the host via public DNS
|
|
||||||
(→ Cloudflare) AND dials the internal LB, reporting both — because the useful
|
|
||||||
question is *where* a break is (CF edge vs the app vs the LB path), which a
|
|
||||||
single curl can't answer. The external leg forces public resolution (the devvm
|
|
||||||
resolver is split-horizon and would otherwise hit the LB for both).
|
|
||||||
- **`metrics alerts` uses the `ALERTS` series, not `/api/v1/alerts`.**
|
|
||||||
`prometheus-query.*` is a query-only frontend (404 on `/api/v1/alerts`), and
|
|
||||||
Alertmanager has no LB ingress (the alert-digest reads it in-cluster). Firing
|
|
||||||
alerts are exposed as the synthetic `ALERTS{alertstate="firing"}` time series,
|
|
||||||
queryable through the working endpoint — so no new dependency.
|
|
||||||
- **Deliberately NOT built:** in-cluster-only endpoints (Alertmanager v2,
|
|
||||||
raw `*.svc` services) that would force port-forward/`kubectl run`. The
|
|
||||||
reasoning-savings there don't beat the added moving parts; kept out of scope.
|
|
||||||
- **No `node`/`secret` group.** Same test: their high-volume parts are
|
|
||||||
command-wrappers (low savings); only compound node ops (serial console, VM
|
|
||||||
wait, fan-out) would qualify, and those are lower-frequency. Left unbuilt
|
|
||||||
unless a concrete pain surfaces — the high-value deterministic surface
|
|
||||||
(tf/work/ci/k8s/memory + these probes) is now covered.
|
|
||||||
|
|
@ -1,34 +0,0 @@
|
||||||
# homelab usage telemetry: evidence-driven verb prioritization, privacy by construction
|
|
||||||
|
|
||||||
v0.6 adds `usage top` plus a fire-and-forget emit on every dispatched verb. It
|
|
||||||
exists to answer the question that drove the whole CLI — *which verbs are worth
|
|
||||||
adding next* — with data instead of one maintainer's habits (the earlier mining
|
|
||||||
covered a single user's ~51k commands, so the surface is shaped to that user).
|
|
||||||
|
|
||||||
## Decisions
|
|
||||||
|
|
||||||
- **Emit on dispatch, in `dispatch()`.** The longest-prefix match already knows
|
|
||||||
the verb path; after `Run` returns we emit `{verb, exit}`. Discovery verbs
|
|
||||||
don't go through `dispatch()` (`manifest`/`version`/`help` are handled in
|
|
||||||
`dispatchTop`), so they don't self-record; `usage *` is skipped explicitly so
|
|
||||||
the analytics reader doesn't pollute its own data.
|
|
||||||
- **Payload is deliberately minimal: verb path + exit code only.** Labels
|
|
||||||
`{job=homelab-usage, user, verb}` (all low-cardinality) + line `exit=N ver=X`.
|
|
||||||
**No args, paths, flags, hostnames, or secrets** ever leave the process — the
|
|
||||||
emit sees only the matched verb name, not the arguments. This is what makes
|
|
||||||
cross-user aggregation safe.
|
|
||||||
- **Shared Loki sink → cross-user analytics WITHOUT reading homes.** Each user's
|
|
||||||
CLI writes its own invocations (attributed to its OS user) to the shared Loki
|
|
||||||
push API via the Traefik LB (verified: HTTP 204, no auth). `usage top` reads
|
|
||||||
back with a LogQL metric query. This is the privacy-preserving resolution to
|
|
||||||
"what does everyone (e.g. another user) use" — it never touches anyone's
|
|
||||||
`~/.claude`, which the org per-user policy bars (see the per-user red-line in
|
|
||||||
managed-settings; reading another user's home is off-limits even for an owner
|
|
||||||
in-session — a fresh session under changed MDM policy is the only legitimate
|
|
||||||
path, and even then this telemetry is the better answer).
|
|
||||||
- **Best-effort, never affects the command.** All errors swallowed; an 800ms
|
|
||||||
client timeout bounds the cost; opt-out via `HOMELAB_TELEMETRY=0`. Telemetry
|
|
||||||
must never slow or break the tool it measures.
|
|
||||||
- **Loki, not a new datastore.** Zero new infra, and it dogfoods the v0.5 `logs`
|
|
||||||
path (same host, same LB dial). Presence MySQL was the alternative (queryable
|
|
||||||
SQL) but would add a write dependency and creds; Loki needs neither.
|
|
||||||
Some files were not shown because too many files have changed in this diff Show more
Loading…
Add table
Add a link
Reference in a new issue