Compare commits

..

6 commits

Author SHA1 Message Date
Viktor Barzin
731de63150 fix(beads-server): disable Authentik + CrowdSec on Workbench
Authentik forward-auth returns 400 for dolt-workbench (no Authentik
application configured for this domain). CrowdSec bouncer also
intermittently returns 400. Both disabled — Workbench is accessible
via Cloudflare tunnel only.

TODO: Create Authentik application for dolt-workbench.viktorbarzin.me

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 20:00:21 +00:00
Viktor Barzin
9ce9a9a7f7 Add broker-sync Terraform stack (pending apply)
Context
-------
Part of the broker-sync rollout — see the plan at
~/.claude/plans/let-s-work-on-linking-temporal-valiant.md and the
companion repo at ViktorBarzin/broker-sync.

This change
-----------
New stack `stacks/broker-sync/`:
- `broker-sync` namespace, aux tier.
- ExternalSecret pulling `secret/broker-sync` via vault-kv
  ClusterSecretStore.
- `broker-sync-data-encrypted` PVC (1Gi, proxmox-lvm-encrypted,
  auto-resizer) — holds the sync SQLite db, FX cache, Wealthfolio
  cookie, CSV archive, watermarks.
- Five CronJobs (all under `viktorbarzin/broker-sync:<tag>`, public
  DockerHub image; no pull secret):
    * `broker-sync-version` — daily 01:00 liveness probe (`broker-sync
      version`), used to smoke-test each new image.
    * `broker-sync-trading212` — daily 02:00 `broker-sync trading212
      --mode steady`.
    * `broker-sync-imap` — daily 02:30, SUSPENDED (Phase 2).
    * `broker-sync-csv` — daily 03:00, SUSPENDED (Phase 3).
    * `broker-sync-fx-reconcile` — 7th of month 05:05, SUSPENDED
      (Phase 1 tail).
- `broker-sync-backup` — daily 04:15, snapshots /data into
  NFS `/srv/nfs/broker-sync-backup/` with 30-day retention, matches
  the convention in infra/.claude/CLAUDE.md §3-2-1.

NOT in this commit:
- Old `wealthfolio-sync` CronJob retirement in
  stacks/wealthfolio/main.tf — happens in the same commit that first
  applies this stack, per the plan's "clean cutover" decision.
- Vault seed. `secret/broker-sync` must be populated before apply;
  required keys documented in the ExternalSecret comment block.

Test plan
---------
## Automated
- `terraform fmt` — clean (ran before commit).
- `terraform validate` needs `terragrunt init` first; deferred to
  apply time.

## Manual Verification
1. Seed Vault `secret/broker-sync/*` (see comment block on the
   ExternalSecret in main.tf).
2. `cd stacks/broker-sync && scripts/tg apply`.
3. `kubectl -n broker-sync get cronjob` — expect 6 CJs, 3 suspended.
4. `kubectl -n broker-sync create job smoke --from=cronjob/broker-sync-version`.
5. `kubectl -n broker-sync logs -l job-name=smoke` — expect
   `broker-sync 0.1.0`.
2026-04-17 19:52:36 +00:00
Viktor Barzin
277babc696 [tls] Move 3 outlier stacks from per-stack PEMs to root-wildcard symlink
## Context
foolery, terminal, and claude-memory each had their own
`stacks/<x>/secrets/` directory with a plaintext EC-256 private key
(privkey.pem, 241 B) and matching TLS certificate (fullchain.pem, 2868 B)
for *.viktorbarzin.me. The 92 other stacks under stacks/ symlink
`secrets/` → `../../secrets`, which resolves to the repo-root
/secrets/ directory covered by the `secrets/** filter=git-crypt`
.gitattributes rule — i.e., every other stack consumes the same
git-crypt-encrypted root wildcard cert.

The 3 outliers shipped their keys in plaintext because `.gitattributes`
secrets/** rule matches only repo-root /secrets/, not
stacks/*/secrets/. Both remotes are public, so the 6 plaintext PEM files
have been exposed for 1–6 weeks (commits 5a988133 2026-03-11,
a6f71fc6 2026-03-18, 9820f2ce 2026-04-10).

Verified:
- Root wildcard cert subject = CN viktorbarzin.me,
  SAN *.viktorbarzin.me + viktorbarzin.me — covers the 3 subdomains.
- Root privkey + fullchain are a valid key pair (pubkey SHA256 match).
- All 3 outlier certs have the same subject/SAN as root; different
  distinct cert material but equivalent coverage.

## This change
- Delete plaintext PEMs in all 3 outlier stacks (6 files total).
- Replace each stacks/<x>/secrets directory with a symlink to
  ../../secrets, matching the fleet pattern.
- Add `stacks/**/secrets/** filter=git-crypt diff=git-crypt` to
  .gitattributes as a regression guard — any future real file placed
  under stacks/<x>/secrets/ gets git-crypt-encrypted automatically.

setup_tls_secret module (modules/kubernetes/setup_tls_secret/main.tf) is
unchanged. It still reads `file("${path.root}/secrets/fullchain.pem")`,
which via the symlink resolves to the root wildcard.

## What is NOT in this change
- Revocation of the 3 leaked per-stack certs. Backed up the leaked PEMs
  to /tmp/leaked-certs/ for `certbot revoke --reason keycompromise`
  once the user's LE account is authenticated. Revocation must happen
  before or alongside the history-rewrite force-push to both remotes.
- Git-history scrub. The leaked PEM blobs are still reachable in every
  commit from 2026-03-11 forward. Scheduled for removal via
  `git filter-repo --path stacks/<x>/secrets/privkey.pem --invert-paths`
  (and fullchain.pem for each stack) in the broader remediation pass.
- cert-manager introduction. The fleet does not use cert-manager today;
  this commit matches the existing symlink-to-wildcard pattern rather
  than introducing a new component.

## Test plan
### Automated
  $ readlink stacks/foolery/secrets
  ../../secrets
  (likewise for terminal, claude-memory)

  $ for s in foolery terminal claude-memory; do
      openssl x509 -in stacks/$s/secrets/fullchain.pem -noout -subject
    done
  subject=CN = viktorbarzin.me  (x3 — all resolve via symlink to root wildcard)

  $ git check-attr filter -- stacks/foolery/secrets/fullchain.pem
  stacks/foolery/secrets/fullchain.pem: filter: git-crypt
  (now matched by the new rule, though for the symlink target the
   repo-root rule already applied)

### Manual Verification
1. `terragrunt plan` in stacks/foolery, stacks/terminal, stacks/claude-memory
   shows only the K8s TLS secret being re-created with the root-wildcard
   material. No ingress changes.
2. `terragrunt apply` for each stack → `kubectl -n <ns> get secret
   <name>-tls -o yaml` → tls.crt decodes to CN viktorbarzin.me with
   the root serial (different from the pre-change per-stack serials).
3. `curl -v https://foolery.viktorbarzin.me/` (and likewise terminal,
   claude-memory) → cert chain presents the new serial, handshake OK.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 19:49:09 +00:00
Viktor Barzin
d91fbd4a60 [monitoring] Delete orphan server-power-cycle/main.sh with iDRAC default creds
## Context
stacks/monitoring/modules/monitoring/server-power-cycle/main.sh is an old
shell implementation of a power-cycle watchdog that polled the Dell iDRAC
on 192.168.1.4 for PSU voltage. It hardcoded the Dell iDRAC default
credentials (root:calvin) in 5 `curl -u root:calvin` calls. Both remotes
are public, so those credentials — and the implicit statement that 'this
host has not rotated the default BMC password' — have been exposed.

The current implementation is main.py in the same directory. It reads
iDRAC credentials from the environment variables `idrac_user` and
`idrac_password` (see module's iDRAC_USER_ENV_VAR / iDRAC_PASSWORD_ENV_VAR
constants), which are populated from Vault via ExternalSecret at runtime.
main.sh is not referenced by any Terraform, ConfigMap, or deploy script —
grep confirms no `file()` / `templatefile()` / `filebase64()` call loads
it, and no hand-rolled shell wrapper invokes it.

## This change
- git rm stacks/monitoring/modules/monitoring/server-power-cycle/main.sh

main.py is retained unchanged.

## What is NOT in this change
- iDRAC password rotation on 192.168.1.4. The BMC should be moved off the
  vendor default `calvin` regardless; rotation is tracked in the broader
  remediation plan and in the iDRAC web UI.
- A separate finding in stacks/monitoring/modules/monitoring/idrac.tf
  (the redfish-exporter ConfigMap has `default: username: root, password:
  calvin` as a fallback for iDRAC hosts not explicitly listed) is NOT
  addressed here — filed as its own task so the fix (drop the default
  block vs. source from env) can be considered in isolation.
- Git-history scrub of main.sh is pending the broader filter-repo pass.

## Test plan
### Automated
  $ grep -rn 'server-power-cycle/main\.sh\|main\.sh' \
       --include='*.tf' --include='*.hcl' --include='*.yaml' \
       --include='*.yml' --include='*.sh'
  (no consumer references)

### Manual Verification
1. `git show HEAD --stat` shows only the one deletion.
2. `test ! -e stacks/monitoring/modules/monitoring/server-power-cycle/main.sh`
3. `kubectl -n monitoring get deploy idrac-redfish-exporter` still shows
   the exporter running — unrelated to this file.
4. main.py continues to run its watchdog loop without regression, because
   it was never coupled to main.sh.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 19:42:55 +00:00
Viktor Barzin
e81e836d3a [setup-tls-secret] Delete deprecated renew.sh with hardcoded Technitium token
## Context
modules/kubernetes/setup_tls_secret/renew.sh is a 2.5-year-old
expect(1) script for manual Let's Encrypt wildcard-cert renewal via
Technitium DNS TXT-record challenges. It hardcodes a 64-char Technitium
API token on line 7 (as an expect variable) and line 27 (inside a
certbot-cleanup heredoc). Both remotes are public, so the token has been
exposed for ~2.5 years.

The script is not invoked by the module's Terraform (main.tf only creates
a kubernetes.io/tls Secret from PEM files); it is a standalone
run-it-yourself tool. grep across the repo confirms nothing references
`renew.sh` — neither the 20+ stacks that consume the `setup_tls_secret`
module, nor any CI pipeline, nor any shell wrapper.

A replacement script `renew2.sh` (4 weeks old) lives alongside it. It
sources the Technitium token from `$TECHNITIUM_API_KEY` env var and also
supports Cloudflare DNS-01 challenges via `$CLOUDFLARE_TOKEN`. It is the
current renewal path.

## This change
- git rm modules/kubernetes/setup_tls_secret/renew.sh

## What is NOT in this change
- Technitium token rotation. The leaked token still works against
  `technitium-web.technitium.svc.cluster.local:5380` until revoked in the
  Technitium admin UI. Rotation is a prerequisite for the upcoming
  git-history scrub, which will remove the token from every commit via
  `git filter-repo --replace-text`.
- renew2.sh is retained as-is (already env-var-sourced; clean).
- The setup_tls_secret module's main.tf is not touched; 20+ consuming
  stacks keep working.

## Test plan
### Automated
  $ grep -rn 'renew\.sh' --include='*.tf' --include='*.hcl' \
       --include='*.yaml' --include='*.yml' --include='*.sh'
  (no output — confirms no consumer)
  $ git grep -n 'e28818f309a9ce7f72f0fcc867a365cf5d57b214751b75e2ef3ea74943ef23be'
  (no output in HEAD after this commit)

### Manual Verification
1. `git show HEAD --stat` shows exactly one deletion:
     modules/kubernetes/setup_tls_secret/renew.sh | 136 ---------
2. `test ! -e modules/kubernetes/setup_tls_secret/renew.sh` returns true.
3. `renew2.sh` still exists and is executable:
     ls -la modules/kubernetes/setup_tls_secret/renew2.sh
4. Next cert-renewal run uses renew2.sh with env-var-sourced token; no
   behavioral regression because renew.sh was never part of the automated
   flow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 19:41:08 +00:00
Viktor Barzin
d3be9b50af [frigate] Remove orphan config.yaml with leaked RTSP passwords
## Context
A Frigate configuration file was added to modules/kubernetes/frigate/ in
bcad200a (2026-04-15, ~2 days ago) as part of a bulk `chore: add untracked
stacks, scripts, and agent configs` commit. The file contains 14 inline
rtsp://admin:<password>@<host>:554/... URLs, leaking two distinct RTSP
passwords for the cameras at 192.168.1.10 (LAN-only) and
valchedrym.ddns.net (confirmed reachable from public internet on port
554). Both remotes are public, so the creds have been exposed for ~2 days.

Grep across the repo confirms nothing references this config.yaml — the
active stacks/frigate/main.tf stack reads its configuration from a
persistent volume claim named `frigate-config-encrypted`, not from this
file. The file is therefore an orphan from the bulk add, with no
production function.

## This change
- git rm modules/kubernetes/frigate/config.yaml

## What is NOT in this change
- Camera password rotation. The user does not own the cameras; rotation
  must be coordinated out-of-band with the camera operators. The DDNS
  camera (valchedrym.ddns.net:554) is internet-reachable, so the leaked
  password is high-priority to rotate from the device side.
- Git-history rewrite. The file plus its leaked strings remain in all
  commits from bcad200a forward. Scheduled to be purged via
  `git filter-repo --path modules/kubernetes/frigate/config.yaml
  --invert-paths --replace-text <list>` in the broader remediation pass.
- Future Frigate config provisioning. If the stack is re-platformed to
  source config from Git rather than the PVC, the replacement should go
  through ExternalSecret + env-var interpolation, not an inline YAML.

## Test plan
### Automated
  $ grep -rn 'frigate/config\.yaml' --include='*.tf' --include='*.hcl' \
       --include='*.yaml' --include='*.yml' --include='*.sh'
  (no output — confirms orphan status)

### Manual Verification
1. `git show HEAD --stat` shows exactly one deletion:
     modules/kubernetes/frigate/config.yaml | 229 ---------------------------------
2. `test ! -e modules/kubernetes/frigate/config.yaml` returns true.
3. `kubectl -n frigate get pvc frigate-config-encrypted` still shows the
   PVC bound (unaffected by this change).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 19:39:35 +00:00
997 changed files with 24656 additions and 118228 deletions

File diff suppressed because one or more lines are too long

View file

@ -1,543 +0,0 @@
---
name: k8s-version-upgrade-DEPRECATED
description: "DEPRECATED 2026-05-11 — replaced by the Job-chain in stacks/k8s-version-upgrade. See header below."
tools: Read, Write, Edit, Bash, Grep, Glob
model: opus
---
# DEPRECATED — Do NOT invoke this agent
Retired **2026-05-11** after a self-preemption incident: this agent ran inside
the `claude-agent-service` Deployment (replicas=1, no nodeSelector) and was
scheduled onto k8s-node4. When the agent tried to `kubectl drain k8s-node4`
(Stage 6, first worker), it evicted itself. The bash process died mid-SSH,
leaving node4 cordoned and the cluster half-upgraded (master at v1.34.7,
workers at v1.34.2).
## Replaced by
A chain of small Kubernetes Jobs, each pinned (via `nodeSelector` +
`kubernetes.io/hostname`) to a node that is NOT its drain target. No pod can
preempt itself because each Job's pod and its target node are always
different.
| Old | New |
|-----|-----|
| Single agent run in claude-agent-service pod | Chain of 7 phase Jobs (preflight → master → worker × 4 → postflight) |
| Whole pipeline in one prompt | Phase body in `stacks/k8s-version-upgrade/scripts/upgrade-step.sh`, dispatched per-phase via `case $PHASE` |
| Detection CronJob POSTs to `claude-agent-service` | Detection CronJob renders Job 0 from `job-template.yaml` via `envsubst` + `kubectl apply` |
| Drain blocks indefinitely on PDB=0 (e.g. single-replica Anubis) | New `predrain_unstick` deletes PDB-blocked pods so drain proceeds |
| `K8sVersionSkew` + `EtcdPreUpgradeSnapshotMissing` alerts | Above + `K8sUpgradeStalled` (in_flight=1 and time()-started_timestamp > 5400s) |
## Where the logic lives now
- **`infra/stacks/k8s-version-upgrade/scripts/upgrade-step.sh`** — universal
phase body. Dispatches on `$PHASE`. Each phase spawns the next Job.
- **`infra/stacks/k8s-version-upgrade/job-template.yaml`** — Job template
rendered by `envsubst` at runtime. ConfigMap-mounted at `/template` in
every Job pod.
- **`infra/stacks/k8s-version-upgrade/main.tf`** — Terraform stack: ConfigMaps,
unified `k8s-upgrade-job` ServiceAccount + RBAC, detection CronJob.
- **`infra/docs/runbooks/k8s-version-upgrade.md`** — operator runbook (kill a
stuck Job, skip a phase, manually re-trigger from a specific phase).
## Why kept (not deleted)
Documents the prompted-agent design and is useful as historical reference when
reading post-mortem discussions or comparing approaches. The `name` field has
been suffixed with `-DEPRECATED` so the agent cannot be invoked by name from
`claude-agent-service`.
---
# Original prompt — DO NOT EXECUTE (reference only)
You are the K8s Version Upgrade Agent for a 5-node home-lab Kubernetes cluster (1 master, 4 workers, stacked etcd, no HA).
## Your Job
Given a target patch or minor version of `kubeadm`/`kubelet`/`kubectl`, you orchestrate the full rolling upgrade with safety gates between every node. You do NOT decide WHEN to run — the `k8s-version-check` CronJob in the `k8s-upgrade` namespace fires you off after detection. You only run when invoked.
The sequence (Pre-flight → etcd snapshot → master containerd skew fix → apt repo URL change [minor only] → master kubeadm upgrade → workers sequentially → Post-flight) is non-negotiable. Skipping a step is how clusters die.
## Inputs
The user prompt contains a JSON object with these fields:
```json
{
"target_version": "1.34.5",
"kind": "patch",
"dry_run": false,
"stages": "all"
}
```
| Field | Required | Description |
|---|---|---|
| `target_version` | yes | Exact `X.Y.Z` to land on (e.g. `1.34.5`). The script `infra/scripts/update_k8s.sh` accepts this via `--release`. |
| `kind` | yes | `patch` (no apt-repo URL change) or `minor` (rewrite repo to v$NEW_MINOR/deb on every node before kubeadm). |
| `dry_run` | no, default false | If true, run all SSH + kubectl READ commands but skip every mutating command (`apt-get install`, `kubeadm upgrade apply`, `kubeadm upgrade node`, `kubectl drain/uncordon`, etcd snapshot, systemctl restart). Log what you would do and exit 0. |
| `stages` | no, default `all` | Comma-separated subset of: `preflight`, `snapshot`, `containerd`, `repo`, `master`, `workers`, `postflight`. Run only those stages and exit. Used by tests. |
Parse the prompt's first JSON block to extract these. If anything is missing, abort with a Slack notification ("malformed payload").
## Environment
- **Working dir**: `/workspace/infra` (`WORKSPACE_DIR` env var)
- **Kubeconfig**: `/workspace/infra/config` (use `kubectl --kubeconfig $WORKSPACE_DIR/config ...` in every kubectl call)
- **Prometheus**: `http://prometheus-server.monitoring.svc.cluster.local:80` (in-cluster, no auth)
- **Etcd snapshot**: triggered as a one-shot Job from the existing `default/backup-etcd` CronJob (defined in `stacks/infra-maintenance/`). The Job runs on `k8s-master` with hostNetwork (so etcdctl reaches etcd at 127.0.0.1:2379), mounts the PV-backed NFS export `192.168.1.127:/srv/nfs/etcd-backup`, and writes `etcd-snapshot-<TIMESTAMP>.db` there. Do NOT shell into master with etcdctl directly — the cert paths + NFS mount are already wired into the CronJob.
- **Library script**: `/workspace/infra/scripts/update_k8s.sh` — pipe via SSH to each node, do NOT modify on the fly. Invoke as `ssh ... 'bash -s' < update_k8s.sh --role <role> --release <X.Y.Z>`.
### Credentials — fetched at startup
The k8s-upgrade ServiceAccount has GET on the `k8s-upgrade-creds` Secret in the `k8s-upgrade` namespace (granted by a RoleBinding in `stacks/k8s-version-upgrade/main.tf`). Fetch credentials into `/tmp` files at the start of every run:
```bash
KUBECTL="kubectl --kubeconfig $WORKSPACE_DIR/config"
# SSH private key — mode 0400 required by openssh
$KUBECTL get secret -n k8s-upgrade k8s-upgrade-creds \
-o jsonpath='{.data.ssh_key}' | base64 -d > /tmp/k8s-upgrade-ssh-key
chmod 400 /tmp/k8s-upgrade-ssh-key
# Slack webhook (URL string)
SLACK_WEBHOOK_K8S_UPGRADE=$($KUBECTL get secret -n k8s-upgrade k8s-upgrade-creds \
-o jsonpath='{.data.slack_webhook}' | base64 -d)
```
The rest of the prompt uses `/tmp/k8s-upgrade-ssh-key` for SSH and `$SLACK_WEBHOOK_K8S_UPGRADE` for Slack. SSH template:
```bash
SSH="ssh -i /tmp/k8s-upgrade-ssh-key -o StrictHostKeyChecking=accept-new -o UserKnownHostsFile=/tmp/known_hosts"
```
Every SSH call below uses `$SSH wizard@<host> '<cmd>'`. `accept-new` accepts the host key on first encounter then pins it — if a node was reimaged, clear `/tmp/known_hosts` before retry.
## NEVER do
- Never bypass the halt-on-alert check — even if a single alert "looks unrelated"
- Never start the next worker before the previous one is Ready + all its pods rescheduled + 10-min soak observed
- Never skip the etcd snapshot — even for patch
- Never `kubectl edit/patch/delete` — read-only kubectl plus `drain`/`uncordon` only
- Never `apt-mark hold` something without unholding it first, and vice versa — the script handles this; don't do it manually
- Never run two stages in parallel — sequential only
- Never run if `dry_run=false` AND the cluster has a node Not Ready, or any Upgrade Gates alert firing
- Never push to git, never modify Terraform, never invoke claude-agent-service recursively
## Slack + Pushgateway helpers
Every transition posts to Slack:
```bash
slack() {
local msg="$1"
local hook="${SLACK_WEBHOOK_K8S_UPGRADE:-$SLACK_WEBHOOK_URL}"
curl -sS -X POST -H 'Content-Type: application/json' \
--data "$(jq -nc --arg t "[k8s-upgrade] $msg" '{text: $t}')" \
"$hook"
}
```
Start every message with `[k8s-upgrade]` so it's grep-able.
Pushgateway gauges drive the `EtcdPreUpgradeSnapshotMissing` and ops-visibility metrics:
```bash
PG='http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/k8s-version-upgrade'
push_metric() {
# push_metric <name> <value>
local name="$1" val="$2"
printf '# TYPE %s gauge\n%s %s\n' "$name" "$name" "$val" \
| curl -sS --data-binary @- "$PG"
}
```
Pushes you must make at specific stages (skipped in dry_run):
| When | Metric | Value |
|---|---|---|
| Stage 0 start | `k8s_upgrade_in_flight` | `1` |
| Stage 0 start | `k8s_upgrade_target_minor` | `$target_minor` |
| Stage 2 verified | `k8s_upgrade_snapshot_taken` | `1` |
| Stage 7 clean | `k8s_upgrade_in_flight` | `0` |
| Stage 7 clean | `k8s_upgrade_snapshot_taken` | `0` |
If you abort mid-flight, leave `k8s_upgrade_in_flight=1` so the alert fires and surfaces the half-done state.
## Stage 0: Parse inputs + announce
1. Extract `target_version`, `kind`, `dry_run`, `stages` from the prompt JSON.
2. Derive `target_minor` from `target_version` (split on `.`).
3. Mark the in-flight annotation on the namespace AND push Pushgateway in-flight gauge:
```bash
if [ "$dry_run" = "false" ]; then
kubectl --kubeconfig $WORKSPACE_DIR/config annotate ns k8s-upgrade \
viktorbarzin.me/k8s-upgrade-in-flight="$(date -u +%FT%TZ)" \
viktorbarzin.me/k8s-upgrade-target="$target_version" \
--overwrite
push_metric k8s_upgrade_in_flight 1
push_metric k8s_upgrade_snapshot_taken 0
fi
```
4. Slack: `Starting k8s upgrade to v$target_version (kind=$kind, dry_run=$dry_run, stages=$stages)`.
## Stage 1: Pre-flight (`stages` includes `preflight`)
Skip if `stages` excludes `preflight`.
### Check 1.1 — All nodes Ready, no pressure
```bash
kubectl --kubeconfig $WORKSPACE_DIR/config get nodes -o json \
| jq -r '.items[] | "\(.metadata.name): \(.status.conditions[] | select(.type=="Ready") | .status), Mem=\(.status.conditions[] | select(.type=="MemoryPressure") | .status), Disk=\(.status.conditions[] | select(.type=="DiskPressure") | .status)"'
```
Abort if any node is not Ready=True, or has MemoryPressure=True or DiskPressure=True.
### Check 1.2 — Halt-on-alert (same query kured uses)
```bash
ALERTS=$(curl -sf 'http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/alerts' \
| jq -r '.data.alerts[] | select(.state == "firing") | .labels.alertname' \
| grep -vE '^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor)$' \
| sort -u)
if [ -n "$ALERTS" ]; then
slack "ABORT preflight — firing alerts:\n$ALERTS"
exit 1
fi
```
### Check 1.3 — 24h-quiet baseline
Re-uses the sentinel-gate Check 4 logic from `stacks/kured/main.tf`. Any node that transitioned Ready in the last 24h means the cluster just absorbed a node reboot — we want a clean baseline before starting a fresh rollout.
```bash
RECENT_REBOOT=0
while IFS= read -r ts; do
[ -z "$ts" ] && continue
diff=$(( $(date +%s) - $(date -d "$ts" +%s) ))
[ "$diff" -lt 86400 ] && RECENT_REBOOT=1 && break
done < <(kubectl --kubeconfig $WORKSPACE_DIR/config get nodes -o jsonpath='{range .items[*]}{range .status.conditions[?(@.type=="Ready")]}{.lastTransitionTime}{"\n"}{end}{end}')
if [ "$RECENT_REBOOT" -eq 1 ]; then
slack "ABORT preflight — node transitioned Ready <24h ago (soak window)"
exit 1
fi
```
### Check 1.4 — kubeadm upgrade plan reports our target
```bash
PLAN_TARGET=$($SSH \
wizard@k8s-master 'sudo kubeadm upgrade plan' \
| grep -oE 'You can now apply the upgrade by executing the following command:.*v[0-9]+\.[0-9]+\.[0-9]+' \
| grep -oE 'v[0-9]+\.[0-9]+\.[0-9]+' | head -1 | tr -d v)
```
If `$PLAN_TARGET` does not start with the requested `target_version`, slack-abort:
"`kubeadm upgrade plan` says target is $PLAN_TARGET but caller asked for $target_version — drift; aborting."
Slack: `Pre-flight clean. Proceeding to etcd snapshot.`
## Stage 2: Etcd snapshot (`stages` includes `snapshot`)
Always run — patch OR minor. Triggers a one-shot Job from the existing `default/backup-etcd` CronJob and waits for it to complete.
```bash
JOB_NAME="pre-upgrade-etcd-${target_version}-$(date +%s)"
if [ "$dry_run" = "false" ]; then
$KUBECTL -n default create job --from=cronjob/backup-etcd "$JOB_NAME"
# Wait up to 10 min for snapshot Job to complete
$KUBECTL -n default wait --for=condition=complete --timeout=600s "job/$JOB_NAME" || {
slack "ABORT Stage 2 — etcd snapshot Job did not complete in 10 min"
$KUBECTL -n default describe "job/$JOB_NAME" | tail -30
exit 1
}
# Parse the Job's pod log for "Backup done: <file> (<bytes> bytes)"
LOG=$($KUBECTL -n default logs "job/$JOB_NAME" -c backup-manage --tail=20)
echo "$LOG"
SNAPSHOT_LINE=$(echo "$LOG" | grep -E '^Backup done:')
SIZE=$(echo "$SNAPSHOT_LINE" | grep -oE '\([0-9]+ bytes\)' | grep -oE '[0-9]+')
SNAPSHOT_FILE=$(echo "$SNAPSHOT_LINE" | awk '{print $3}')
if [ -z "$SIZE" ] || [ "$SIZE" -lt 1024 ]; then
slack "ABORT Stage 2 — etcd snapshot empty or missing (size='$SIZE' line='$SNAPSHOT_LINE')"
exit 1
fi
TARGET_PATH="nfs://192.168.1.127:/srv/nfs/etcd-backup/$SNAPSHOT_FILE"
$KUBECTL annotate ns k8s-upgrade \
viktorbarzin.me/k8s-upgrade-snapshot-path="$TARGET_PATH" --overwrite
push_metric k8s_upgrade_snapshot_taken 1
else
TARGET_PATH="WOULD: trigger default/backup-etcd Job, wait, verify size"
SIZE="dry-run"
fi
slack "Etcd snapshot saved at $TARGET_PATH (size=$SIZE)"
```
## Stage 3: Master containerd skew fix (`stages` includes `containerd`)
Only run if master containerd version < highest worker containerd version.
```bash
get_ctr_version() {
$SSH \
"wizard@$1" 'containerd --version | awk "{print \$3}" | tr -d v'
}
MASTER_CTR=$(get_ctr_version k8s-master)
WORKER_MAX="0.0.0"
for n in k8s-node1 k8s-node2 k8s-node3 k8s-node4; do
v=$(get_ctr_version "$n")
# Compare semver-ish
if [ "$(printf '%s\n%s' "$v" "$WORKER_MAX" | sort -V | tail -1)" = "$v" ]; then
WORKER_MAX="$v"
fi
done
if [ "$(printf '%s\n%s' "$MASTER_CTR" "$WORKER_MAX" | sort -V | head -1)" = "$MASTER_CTR" ] \
&& [ "$MASTER_CTR" != "$WORKER_MAX" ]; then
# Master is behind — bump
slack "Master containerd $MASTER_CTR < workers $WORKER_MAX bumping master"
if [ "$dry_run" = "false" ]; then
$SSH \
wizard@k8s-master "sudo apt-mark unhold containerd.io \
&& sudo apt-get install -y containerd.io='$WORKER_MAX-1' \
&& sudo apt-mark hold containerd.io \
&& sudo systemctl restart containerd"
# Wait until kubelet on master is Ready again
for i in $(seq 1 60); do
STATUS=$(kubectl --kubeconfig $WORKSPACE_DIR/config get node k8s-master \
-o jsonpath='{.status.conditions[?(@.type=="Ready")].status}')
[ "$STATUS" = "True" ] && break
sleep 10
done
[ "$STATUS" = "True" ] || { slack "ABORT — k8s-master not Ready after containerd bump"; exit 1; }
fi
slack "Master containerd: $MASTER_CTR → $WORKER_MAX. Master Ready."
else
echo "Master containerd $MASTER_CTR >= workers max $WORKER_MAX — skipping skew fix"
fi
```
## Stage 4: Apt repo URL rewrite for minor bumps (`stages` includes `repo`)
Only run if `kind=minor`.
For each of `k8s-master k8s-node1 k8s-node2 k8s-node3 k8s-node4`:
```bash
target_minor="$(echo "$target_version" | awk -F. '{print $1"."$2}')"
if [ "$dry_run" = "false" ]; then
$SSH \
"wizard@$node" "echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v$target_minor/deb/ /' | sudo tee /etc/apt/sources.list.d/kubernetes.list \
&& curl -fsSL 'https://pkgs.k8s.io/core:/stable:/v$target_minor/deb/Release.key' | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg --batch --yes \
&& sudo apt-get update"
fi
```
Slack: `Repo rewritten to v$target_minor/deb on all 5 nodes.`
## Stage 5: Master upgrade (`stages` includes `master`)
```bash
# 5.1 Drain
if [ "$dry_run" = "false" ]; then
kubectl --kubeconfig $WORKSPACE_DIR/config drain k8s-master \
--ignore-daemonsets --delete-emptydir-data --force --grace-period=300
fi
# 5.2 Run the library script via SSH pipe
if [ "$dry_run" = "false" ]; then
$SSH \
wizard@k8s-master 'bash -s' \
< $WORKSPACE_DIR/scripts/update_k8s.sh \
-- --role master --release "$target_version"
fi
# 5.3 Uncordon + wait Ready
if [ "$dry_run" = "false" ]; then
kubectl --kubeconfig $WORKSPACE_DIR/config uncordon k8s-master
fi
for i in $(seq 1 60); do
STATUS=$(kubectl --kubeconfig $WORKSPACE_DIR/config get node k8s-master \
-o jsonpath='{.status.conditions[?(@.type=="Ready")].status}')
KUBELET=$(kubectl --kubeconfig $WORKSPACE_DIR/config get node k8s-master \
-o jsonpath='{.status.nodeInfo.kubeletVersion}' | tr -d v)
[ "$STATUS" = "True" ] && [ "$KUBELET" = "$target_version" ] && break
sleep 15
done
[ "$STATUS" = "True" ] && [ "$KUBELET" = "$target_version" ] \
|| { slack "ABORT — master not Ready or wrong version after upgrade ($STATUS / $KUBELET)"; exit 1; }
# 5.4 All control-plane pods Running
NOT_READY=$(kubectl --kubeconfig $WORKSPACE_DIR/config -n kube-system get pods \
-l 'tier=control-plane' --no-headers | grep -v Running | wc -l)
[ "$NOT_READY" -gt 0 ] && { slack "ABORT — $NOT_READY control-plane pods not Running"; exit 1; }
# 5.5 Re-check halt-on-alert
# (re-run the Check 1.2 query, abort if anything new fires)
slack "Master upgrade complete. Cluster on v$target_version. Healthy."
```
## Stage 6: Workers sequentially (`stages` includes `workers`)
Order: `k8s-node4 → k8s-node3 → k8s-node2 → k8s-node1`. Node1 last because it hosts GPU + Immich and benefits from the longest soak before any other worker is touched (ref: post-mortem-2026-03-16, memory id=570).
For each worker `$node`:
1. Re-check halt-on-alert. If anything fires (e.g. `RecentNodeReboot` on the previous worker), wait + retry up to 30 min, then abort.
2. `kubectl drain $node --ignore-daemonsets --delete-emptydir-data --force --grace-period=300`
3. SSH pipe `update_k8s.sh --role worker --release $target_version`
4. `kubectl uncordon $node`
5. Wait until `$node` Ready + kubeletVersion matches + all calico-node + kube-proxy pods on that node Running.
6. **10-min soak**: poll halt-on-alert every 60s. If anything fires, abort. After 10 min clean, proceed.
7. Slack: `Worker $node complete ($i/4)`.
```bash
WORKERS="k8s-node4 k8s-node3 k8s-node2 k8s-node1"
i=0
for node in $WORKERS; do
i=$((i+1))
# Halt-on-alert recheck with retry
for attempt in $(seq 1 30); do
ALERTS=$(curl -sf 'http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/alerts' \
| jq -r '.data.alerts[] | select(.state == "firing") | .labels.alertname' \
| grep -vE '^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor)$' \
| sort -u)
[ -z "$ALERTS" ] && break
echo "Waiting for alerts to clear (attempt $attempt/30): $ALERTS"
sleep 60
done
[ -n "$ALERTS" ] && { slack "ABORT $node — alerts firing after 30min wait: $ALERTS"; exit 1; }
if [ "$dry_run" = "false" ]; then
kubectl --kubeconfig $WORKSPACE_DIR/config drain "$node" \
--ignore-daemonsets --delete-emptydir-data --force --grace-period=300
$SSH \
"wizard@$node" 'bash -s' \
< $WORKSPACE_DIR/scripts/update_k8s.sh \
-- --role worker --release "$target_version"
kubectl --kubeconfig $WORKSPACE_DIR/config uncordon "$node"
fi
# Wait Ready + version match
for w in $(seq 1 60); do
STATUS=$(kubectl --kubeconfig $WORKSPACE_DIR/config get node "$node" \
-o jsonpath='{.status.conditions[?(@.type=="Ready")].status}')
KUBELET=$(kubectl --kubeconfig $WORKSPACE_DIR/config get node "$node" \
-o jsonpath='{.status.nodeInfo.kubeletVersion}' | tr -d v)
[ "$STATUS" = "True" ] && [ "$KUBELET" = "$target_version" ] && break
sleep 15
done
[ "$STATUS" = "True" ] && [ "$KUBELET" = "$target_version" ] \
|| { slack "ABORT — $node not Ready or wrong version ($STATUS / $KUBELET)"; exit 1; }
# 10-min soak with halt-on-alert
echo "Soaking $node for 10 min..."
for sec in $(seq 1 10); do
ALERTS=$(curl -sf 'http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/alerts' \
| jq -r '.data.alerts[] | select(.state == "firing") | .labels.alertname' \
| grep -vE '^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor|RecentNodeReboot)$' \
| sort -u)
[ -n "$ALERTS" ] && { slack "ABORT $node mid-soak — alerts: $ALERTS"; exit 1; }
sleep 60
done
slack "Worker $node upgrade complete ($i/4). Soaked clean."
done
```
Note: during the soak we add `RecentNodeReboot` to the ignore-list because we KNOW we just rebooted-as-it-were that node (kubelet restart counts).
## Stage 7: Post-flight (`stages` includes `postflight`)
```bash
# All 5 nodes at target
VERSIONS=$(kubectl --kubeconfig $WORKSPACE_DIR/config get nodes \
-o jsonpath='{range .items[*]}{.metadata.name}:{.status.nodeInfo.kubeletVersion}{"\n"}{end}')
echo "$VERSIONS"
WRONG=$(echo "$VERSIONS" | grep -v ":v${target_version}$" | wc -l)
[ "$WRONG" -ne 0 ] && { slack "ABORT post-flight — $WRONG node(s) not on v$target_version:\n$VERSIONS"; exit 1; }
# Upgrade Gates all inactive
FIRING=$(curl -sf 'http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/alerts' \
| jq -r '.data.alerts[] | select(.state == "firing") | .labels.alertname' \
| grep -vE '^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor)$' \
| sort -u)
[ -n "$FIRING" ] && slack "Post-flight WARN — alerts still firing (cluster on target, but check):\n$FIRING"
# pod-ready ratio >= 0.9
RATIO=$(curl -sf 'http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/query' \
--data-urlencode 'query=sum(kube_pod_status_ready{condition="true"}) / sum(kube_pod_status_phase{phase="Running"})' \
| jq -r '.data.result[0].value[1] // "0"')
slack "Pod-ready ratio: $RATIO (target ≥ 0.9)"
# Clear the in-flight annotation + Pushgateway gauges
if [ "$dry_run" = "false" ]; then
kubectl --kubeconfig $WORKSPACE_DIR/config annotate ns k8s-upgrade \
viktorbarzin.me/k8s-upgrade-in-flight- \
viktorbarzin.me/k8s-upgrade-target- \
viktorbarzin.me/k8s-upgrade-snapshot-path- || true
push_metric k8s_upgrade_in_flight 0
push_metric k8s_upgrade_snapshot_taken 0
fi
slack ":white_check_mark: K8s upgrade complete: cluster on v$target_version."
```
## Rollback
This agent does NOT auto-rollback. If anything aborts mid-flight:
1. Slack the failure with the last known stage + node.
2. Leave the in-flight annotation in place (the operator clears it manually after triage).
3. Operator follows `infra/docs/runbooks/k8s-version-upgrade.md` → "Rollback paths" section.
The etcd snapshot path is annotated on the `k8s-upgrade` namespace for easy recovery.
## Notes for tests
- **Test 1 (CronJob dry-run)**: The CronJob has its own `--dry-run` env var that short-circuits before POST. This agent is not invoked.
- **Test 2 (agent dry-run)**: Invoke with `{"dry_run": true}`. Every SSH + kubectl READ runs, every mutation skipped. The agent should print "WOULD: <cmd>" for each skipped mutation.
- **Test 3 (snapshot-only)**: Invoke with `{"stages": "preflight,snapshot"}`. Pre-flight + etcd snapshot only. Slack notification confirms the file exists. No node touched after that.
- **Test 4 (full run)**: `{"target_version": "1.34.7", "kind": "patch"}` once apt has it. Full sequence.
- **Test 5 (synthetic minor)**: `{"target_version": "1.35.0", "kind": "minor", "dry_run": true}`. Confirms the repo-rewrite plan path without mutation.
## Edge cases
- **Slack down**: Don't block the upgrade — continue, log to stderr.
- **SSH host key changes**: `accept-new` accepts only on first encounter — if a node was reimaged its host key changes; clear `/tmp/known_hosts` before retry.
- **kubectl drain hangs on a PDB-violating pod**: 5-min grace-period is hard. If drain fails, `kubectl drain --disable-eviction --force` is NOT a valid escalation here — slack-abort and let the operator investigate.
- **etcd snapshot dir missing/full**: stat the dir first. If <10 GiB free, abort.
- **Network blip during apt-get**: the script `set -e`s — apt-get will fail loud, the agent's bash will see non-zero exit, we slack-abort. The node is left mid-upgrade (kubeadm half-applied). Operator follows the runbook.
## Verification claims you must make
When you `slack` a SUCCESS message, you must have actually verified:
- All 5 nodes report the target kubelet version via `kubectl get nodes -o jsonpath`
- No alerts firing outside the ignore-list
- pod-ready ratio computed from Prometheus
Do not declare success without those three confirmations.

View file

@ -1,194 +0,0 @@
---
name: payslip-extractor
description: "Extract structured UK payslip fields from already-extracted text (preferred) or a base64 PDF (fallback) into strict JSON."
model: haiku
allowedTools:
- Bash
- Read
---
You are a headless payslip-field extractor. You receive a prompt containing a UK payslip (either as pre-extracted text or as a base64-encoded PDF) plus a target JSON schema, and you produce exactly one JSON object that matches the schema.
## Your single job
Given a prompt that contains EITHER:
- A line `PAYSLIP_TEXT:` followed by already-extracted text (preferred path — use it directly, skip to Step 3).
- OR a line `PDF_BASE64:` followed by a base64 blob (fallback path — decode then extract text first).
Produce EXACTLY ONE JSON object on stdout matching the schema. No prose. No markdown fences. No preamble. No trailing commentary. The final message content must be a single valid JSON object and nothing else.
## RSU handling (important — Meta UK payslips)
UK payslips for equity-compensated employees (e.g. Meta) report RSU vests as NOTIONAL pay for HMRC reporting only — the broker (Schwab) sells shares to cover US-side withholding but the UK payslip ALSO runs the vest through PAYE via a grossed-up Taxable Pay line. Meta UK template:
- EARNINGS lines: `RSU Tax Offset` (grossed-up vest value) and optionally `RSU Excs Refund` (over-withheld amount returned). SUM BOTH into `rsu_vest`. Other labels seen on non-Meta templates: `RSU Vest`, `Restricted Stock Units`, `Notional Pay`, `GSU Vest`.
- Meta's template does NOT use a matching offset deduction — `rsu_offset` should be 0. Taxable Pay is grossed up to (Total Payment + rsu_vest) so PAYE already includes the RSU share.
- For non-Meta templates that DO use an offset (`Shares Retained`, `Notional Pay Offset`), populate `rsu_offset` with the magnitude.
If you see ANY of these lines, do NOT add them to `other_deductions` and do NOT let them count as regular income_tax/NI.
If the payslip has no stock component, leave both as 0.
## Earnings decomposition (v2)
- `salary`: the basic salary/pay line (usually the first "Salary" or "Basic Pay" entry in the Earnings/Payments block).
- `bonus`: the bonus line (`Perform Bonus`, `Bonus`, `Performance Bonus`). If absent or 0, leave as 0 — that's meaningful signal (bonus-sacrifice months). Don't invent.
- `pension_sacrifice`: **ABSOLUTE VALUE** of any NEGATIVE pension line in the Payments block (e.g. `AE Pension EE -600.20``600.20`). This is salary-sacrifice and is ALREADY subtracted from Total Payment/gross. Do not also put it in `pension_employee`.
- `pension_employee`: use this ONLY when pension appears as a POSITIVE deduction on the Deductions side (legacy Meta variant A, or non-Meta templates). Never double-count.
- `taxable_pay`: the "Taxable Pay" line in the summary block, THIS PERIOD column. For Meta this is the post-sacrifice + RSU-grossed-up base that PAYE is computed on. If the payslip doesn't surface a summary block, null.
- `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross`: YTD column values from the same summary block. Null if not present.
## Fast path: PAYSLIP_TEXT is present
If the prompt contains `PAYSLIP_TEXT:`, the caller has already run `pdftotext -layout`. Skip Steps 1-2 entirely — the text is already in your context. Go straight to Step 3.
## Processing steps
### Step 1. Extract and decode the base64 PDF
The prompt will include a line that starts with `PDF_BASE64:` followed by the base64 blob. Decode it to `/tmp/payslip.pdf`.
Preferred method (handles whitespace and very long blobs robustly):
```bash
python3 - <<'PY'
import base64, re, pathlib, sys, os
prompt = os.environ.get("PAYSLIP_PROMPT", "")
# If the orchestrator didn't set an env var, fall back to reading the transcript via CWD stdin mechanism.
# In practice the agent receives the prompt in its conversation — you extract the PDF_BASE64 value
# from the prompt text you were given, strip whitespace, and base64-decode.
PY
```
In practice: read the `PDF_BASE64:` value out of the prompt you have been given (you can see the full prompt), then run:
```bash
python3 -c "
import base64, sys
data = sys.stdin.read().strip()
open('/tmp/payslip.pdf','wb').write(base64.b64decode(data))
print('decoded bytes:', len(base64.b64decode(data)))
" <<'B64'
<paste-the-base64-here>
B64
```
Or pipe via shell `base64 -d`:
```bash
printf '%s' '<base64>' | base64 -d > /tmp/payslip.pdf
```
Verify the file looks like a PDF:
```bash
head -c 8 /tmp/payslip.pdf | xxd
# Expected: 25 50 44 46 2d (i.e. "%PDF-")
```
### Step 2. Extract text from the PDF
Try tools in this order. Use the first one that works; do not chain all of them.
1. `pdftotext` from `poppler-utils` (preferred — fastest, most reliable on layout-preserving payslips):
```bash
pdftotext -layout /tmp/payslip.pdf - 2>/dev/null
```
2. Python `pypdf` fallback:
```bash
python3 -c "
from pypdf import PdfReader
r = PdfReader('/tmp/payslip.pdf')
for p in r.pages:
print(p.extract_text() or '')
"
```
3. Python `pdfplumber` fallback:
```bash
python3 -c "
import pdfplumber
with pdfplumber.open('/tmp/payslip.pdf') as pdf:
for page in pdf.pages:
print(page.extract_text() or '')
"
```
4. If none of those are installed, check what IS available:
```bash
which pdftotext pdf2txt.py mutool
python3 -c "import pypdf, pdfplumber, pdfminer" 2>&1
```
and use whatever you find (e.g. `mutool draw -F txt`).
If every text-extraction tool fails, emit the failure JSON (see "Failure mode" below).
### Step 3. Parse the extracted text
UK payslips are laid out in a few common templates (Sage, Iris, QuickBooks, Xero, in-house ADP/Workday layouts). Common landmarks:
- "Pay Date" / "Payment Date" / "Date Paid" — the date wages hit the account. Usually at the top or in a header box.
- "Tax Period" / "Period" / "Month" — e.g. "Month 1", "Week 12".
- Two numeric columns per line: "This Period" (or "Amount", "Current") and "Year to Date" (or "YTD"). **Always take the This Period column**, never YTD.
- Payments / Earnings block: "Basic Pay", "Salary", "Bonus", "Overtime", "Commission", "Holiday Pay".
- Deductions block: "Income Tax" / "PAYE", "National Insurance" / "NI" / "NIC", "Pension" / "Pension Contribution" / "Salary Sacrifice Pension", "Student Loan" / "SL", optional: "Union Dues", "Charity", "Season Ticket Loan", "Private Medical", etc.
- "Gross Pay" / "Total Gross" — sum of payments.
- "Net Pay" / "Take Home" / "Amount Payable" — the money actually paid.
- "Tax Code" — e.g. "1257L", "BR", "D0", "NT".
- "NI Number" / "National Insurance Number" — `AA123456A` format. Never invent one.
- "Employer" / "Company" — usually in the letterhead. "Employee" / "Name".
- Currency: almost always GBP / "£" for UK payslips. If the PDF is not in GBP or not a UK payslip, still return the numbers as-is but include a best-effort `currency` field.
### Step 4. Map to the schema and emit JSON
Rules that apply regardless of the caller's exact schema:
- **Dates**: `pay_date` MUST be `YYYY-MM-DD`. If the PDF prints `12/03/2026`, interpret as `DD/MM/YYYY` (UK format) → `2026-03-12`. If ambiguous (`01/02/2026`), prefer UK ordering. If impossible to determine a year, use the pay_period year.
- **Money fields**: emit as JSON numbers, not strings. Two decimal places are acceptable (`2450.17`). Strip `£`, commas, and trailing spaces. Negative values stay negative.
- **Missing numeric fields**: emit `0` (zero), not `null`, not an empty string, not `"N/A"`.
- **`other_deductions`**: an object mapping `{ "<label>": <number>, ... }` for any deduction that isn't one of the first-class fields in the schema (tax, NI, pension, student loan). Use the exact label from the payslip (e.g. `"Season Ticket Loan"`, `"Private Medical"`). If there are no other deductions, emit `{}` — NEVER `null` and NEVER omit the key.
- **Column discipline**: ALWAYS use the "This Period" column, NEVER the YTD column. If only one column exists, that's the period column.
- **Currency default**: `"GBP"` unless the payslip explicitly shows another currency symbol or ISO code.
- **No invented data**: If a field genuinely isn't on the payslip, use the documented default (`0` for money, `""` for strings, `{}` for objects). Do NOT make up names, NI numbers, tax codes, or employers.
Follow the exact field names and types given in the prompt's schema. If the prompt's schema adds fields not listed above, produce them too using the same discipline.
## Failure mode
If the PDF cannot be read at all — unreadable base64, not a PDF, encrypted PDF with no text layer, no text-extraction tool available, or clearly not a UK payslip — emit a single JSON object:
```json
{"error": "<short human reason>"}
```
Examples of acceptable error reasons:
- `"base64 did not decode to a valid PDF"`
- `"pdf has no extractable text layer (image-only scan)"`
- `"no pdf text extraction tool available (pdftotext/pypdf/pdfplumber all missing)"`
- `"document does not appear to be a UK payslip"`
- `"pay_date not found on document"`
The caller treats the `error` key as a non-retriable parse failure. Do not include any other keys when emitting an error object.
## Hard constraints — things you MUST NOT do
1. **No network calls.** Do not curl, wget, dig, or otherwise talk to the network. Everything you need is in the prompt.
2. **No modifications to `/workspace/infra/**`.** Do not edit, write, or commit any file under the infra repo. The only file you may create is the scratch PDF at `/tmp/payslip.pdf` (and intermediate text dumps under `/tmp/`).
3. **No git operations.** No `git add`, `git commit`, `git push`, nothing.
4. **No kubectl, no terraform, no vault.** You are not an infra agent — you are a narrow extractor.
5. **No markdown in output.** No ` ```json ` fences, no preamble like "Here's the extraction:", no trailing notes. The ENTIRE final assistant message is exactly one JSON object.
6. **No verbose logging in the final message.** It is fine to run bash commands and see their output during processing, but your final assistant message is JSON and nothing else.
7. **No hallucinated fields.** If the payslip does not show a pension line, do not invent one. Use the documented default instead.
## Output discipline — summary
- Exactly one JSON object, UTF-8, no BOM.
- Keys match the schema the caller gave you.
- Numeric fields are JSON numbers, not strings.
- `pay_date` is `YYYY-MM-DD`.
- `other_deductions` is always present and is an object (possibly `{}`).
- Missing money → `0`, missing string → `""`, missing object → `{}`.
- On unrecoverable failure, one JSON object with a single `error` key.
That's the whole job. Decode, extract, parse, emit JSON. Be boring and exact.

View file

@ -34,11 +34,7 @@ You receive these parameters in your invocation:
- **Infra repo**: `/home/wizard/code/infra` - **Infra repo**: `/home/wizard/code/infra`
- **Config**: `/home/wizard/code/infra/.claude/reference/upgrade-config.json` - **Config**: `/home/wizard/code/infra/.claude/reference/upgrade-config.json`
- **Kubeconfig**: `/home/wizard/code/infra/config` - **Kubeconfig**: `/home/wizard/code/infra/config`
- **Secrets (env-var contract)**: You run in the `claude-agent-service` pod, which has NO Vault CLI auth — do NOT call `vault kv get`. The following env vars are pre-loaded via `envFrom: claude-agent-secrets`: - **Vault**: Authenticate with `vault login -method=oidc` if needed. Secrets at `secret/viktor` and `secret/platform`.
- `GITHUB_TOKEN` — PAT for GitHub API (changelog fetch) and `git push`
- `WOODPECKER_API_TOKEN` — bearer for `ci.viktorbarzin.me/api/...`
- `SLACK_WEBHOOK_URL` — full Slack webhook URL for status messages
- Anything else (e.g. `kubectl`) uses the pod's ServiceAccount or in-repo git-crypt-unlocked secrets.
- **Git remote**: `origin``github.com/ViktorBarzin/infra.git` - **Git remote**: `origin``github.com/ViktorBarzin/infra.git`
## NEVER Do ## NEVER Do
@ -122,6 +118,7 @@ cat /home/wizard/code/infra/.claude/reference/upgrade-config.json
3. **For Helm charts**: Check `helm_chart_repo_overrides` for the chart repository URL 3. **For Helm charts**: Check `helm_chart_repo_overrides` for the chart repository URL
4. If auto-detect fails, verify the repo exists: 4. If auto-detect fails, verify the repo exists:
```bash ```bash
GITHUB_TOKEN=$(vault kv get -field=github_pat secret/viktor)
curl -sf -H "Authorization: token $GITHUB_TOKEN" \ curl -sf -H "Authorization: token $GITHUB_TOKEN" \
"https://api.github.com/repos/${DETECTED_REPO}" > /dev/null "https://api.github.com/repos/${DETECTED_REPO}" > /dev/null
``` ```
@ -131,6 +128,7 @@ cat /home/wizard/code/infra/.claude/reference/upgrade-config.json
## Step 3: Fetch Changelogs via GitHub API ## Step 3: Fetch Changelogs via GitHub API
```bash ```bash
GITHUB_TOKEN=$(vault kv get -field=github_pat secret/viktor)
curl -s -H "Authorization: token $GITHUB_TOKEN" \ curl -s -H "Authorization: token $GITHUB_TOKEN" \
"https://api.github.com/repos/${GITHUB_REPO}/releases?per_page=100" "https://api.github.com/repos/${GITHUB_REPO}/releases?per_page=100"
``` ```
@ -173,9 +171,11 @@ Scan all intermediate release notes for breaking change indicators from the conf
## Step 5: Slack Notification — Starting ## Step 5: Slack Notification — Starting
```bash ```bash
SLACK_WEBHOOK=$(vault kv get -field=alertmanager_slack_api_url secret/platform)
curl -s -X POST -H 'Content-type: application/json' \ curl -s -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"[Upgrade Agent] Starting: *${STACK}* ${OLD_VERSION} -> ${NEW_VERSION} (risk: ${RISK})\"}" \ --data "{\"text\":\"[Upgrade Agent] Starting: *${STACK}* ${OLD_VERSION} -> ${NEW_VERSION} (risk: ${RISK})\"}" \
"$SLACK_WEBHOOK_URL" "$SLACK_WEBHOOK"
``` ```
For CAUTION risk, include breaking change excerpts in the Slack message. For CAUTION risk, include breaking change excerpts in the Slack message.
@ -266,28 +266,23 @@ UPGRADE_SHA=$(git rev-parse HEAD)
## Step 9: Wait for Woodpecker CI ## Step 9: Wait for Woodpecker CI
The commit triggers one pipeline that runs multiple **workflows** in parallel — e.g. `default` (terragrunt apply) and `build-cli` (builds the infra CLI image). Only the `default` workflow gates your upgrade; the other workflows may be unrelated and sometimes fail without breaking anything on the cluster (current example: `build-cli` push to `registry.viktorbarzin.me:5050` is known-broken as of 2026-04-19). The commit triggers the `app-stacks.yml` pipeline (or `default.yml` for platform stacks).
**Do not read the overall pipeline `status`** — it reports `failure` whenever *any* workflow fails. Read the `default` workflow's `state` instead.
```bash ```bash
# Find the pipeline for our commit WOODPECKER_TOKEN=$(vault kv get -field=woodpecker_token secret/viktor)
curl -s -H "Authorization: Bearer $WOODPECKER_API_TOKEN" \
"https://ci.viktorbarzin.me/api/repos/1/pipelines?page=1&per_page=10" \
| jq --arg sha "$UPGRADE_SHA" '.[] | select(.commit==$sha) | .number'
# → $PIPELINE_NUMBER
# Fetch detail (includes workflows[])
curl -s -H "Authorization: Bearer $WOODPECKER_API_TOKEN" \
"https://ci.viktorbarzin.me/api/repos/1/pipelines/$PIPELINE_NUMBER" \
| jq '.workflows[] | select(.name=="default") | .state'
# → "running" | "pending" | "success" | "failure" | "error" | "killed"
``` ```
Poll every 30 seconds until the `default` workflow's `state` is terminal (`success`, `failure`, `error`, `killed`). Timeout after 15 minutes. Poll for the pipeline triggered by our commit:
```bash
# Get latest pipeline
curl -s -H "Authorization: Bearer $WOODPECKER_TOKEN" \
"https://ci.viktorbarzin.me/api/repos/1/pipelines?page=1&per_page=5"
```
**If `default` state is `success`** → proceed to Step 10 (verification), regardless of other workflows' state. Find the pipeline matching our commit SHA. Poll every 30 seconds until status is `success`, `failure`, `error`, or `killed`. Timeout after 15 minutes.
**If `default` state is terminal-and-not-success, or the poll times out** → proceed to Step 10b (rollback).
**If CI fails** → proceed to Step 10 (rollback).
**If CI succeeds** → proceed to verification.
## Step 10: Verify ## Step 10: Verify
@ -346,7 +341,7 @@ Re-run verification checks to confirm rollback succeeded. If rollback verificati
```bash ```bash
curl -s -X POST -H 'Content-type: application/json' \ curl -s -X POST -H 'Content-type: application/json' \
--data '{"text":"[Upgrade Agent] CRITICAL: Rollback of *${STACK}* also failed. Manual intervention required."}' \ --data '{"text":"[Upgrade Agent] CRITICAL: Rollback of *${STACK}* also failed. Manual intervention required."}' \
"$SLACK_WEBHOOK_URL" "$SLACK_WEBHOOK"
``` ```
## Step 11: Report Results ## Step 11: Report Results
@ -355,14 +350,14 @@ curl -s -X POST -H 'Content-type: application/json' \
```bash ```bash
curl -s -X POST -H 'Content-type: application/json' \ curl -s -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"[Upgrade Agent] SUCCESS: *${STACK}* upgraded ${OLD_VERSION} -> ${NEW_VERSION}\nVerification: pods ready, HTTP OK${UPTIME_KUMA_MSG}\nCommit: ${UPGRADE_SHA}\"}" \ --data "{\"text\":\"[Upgrade Agent] SUCCESS: *${STACK}* upgraded ${OLD_VERSION} -> ${NEW_VERSION}\nVerification: pods ready, HTTP OK${UPTIME_KUMA_MSG}\nCommit: ${UPGRADE_SHA}\"}" \
"$SLACK_WEBHOOK_URL" "$SLACK_WEBHOOK"
``` ```
### On failure + rollback ### On failure + rollback
```bash ```bash
curl -s -X POST -H 'Content-type: application/json' \ curl -s -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"[Upgrade Agent] FAILED + ROLLED BACK: *${STACK}* ${OLD_VERSION} -> ${NEW_VERSION}\nReason: ${FAILURE_REASON}\nRollback commit: ${ROLLBACK_SHA}\nRollback status: ${ROLLBACK_STATUS}\"}" \ --data "{\"text\":\"[Upgrade Agent] FAILED + ROLLED BACK: *${STACK}* ${OLD_VERSION} -> ${NEW_VERSION}\nReason: ${FAILURE_REASON}\nRollback commit: ${ROLLBACK_SHA}\nRollback status: ${ROLLBACK_STATUS}\"}" \
"$SLACK_WEBHOOK_URL" "$SLACK_WEBHOOK"
``` ```
## Edge Cases ## Edge Cases

1728
.claude/cluster-health.sh Executable file

File diff suppressed because it is too large Load diff

View file

@ -7,7 +7,6 @@ Control and query Home Assistant entities on ha-sofia.viktorbarzin.me.
import argparse import argparse
import json import json
import os import os
import subprocess
import sys import sys
from urllib.parse import urljoin from urllib.parse import urljoin
@ -18,29 +17,13 @@ except ImportError:
print(" pip install requests") print(" pip install requests")
sys.exit(1) sys.exit(1)
# Configuration from environment variables (ha-sofia specific)
HA_URL = os.environ.get("HOME_ASSISTANT_SOFIA_URL", "").rstrip("/")
HA_TOKEN = os.environ.get("HOME_ASSISTANT_SOFIA_TOKEN")
def _token_from_homelab(): if not HA_URL or not HA_TOKEN:
"""Resolve the token via the homelab CLI when the env var isn't set, so the print("ERROR: HOME_ASSISTANT_SOFIA_URL and HOME_ASSISTANT_SOFIA_TOKEN environment variables must be set.")
script works from any directory / unprovisioned session (see ADR-0012).""" print("These should be set when activating the Claude venv (~/.venvs/claude)")
try:
out = subprocess.run(
["homelab", "ha", "token", "--instance", "sofia"],
capture_output=True, text=True, timeout=30)
if out.returncode == 0 and out.stdout.strip():
return out.stdout.strip()
except Exception:
pass
return None
# Configuration: prefer env vars (set by the Claude venv); otherwise fall back to
# defaults + the homelab CLI so the script is not cwd/env dependent (ADR-0012).
HA_URL = os.environ.get("HOME_ASSISTANT_SOFIA_URL", "").rstrip("/") or "https://ha-sofia.viktorbarzin.me"
HA_TOKEN = os.environ.get("HOME_ASSISTANT_SOFIA_TOKEN") or _token_from_homelab()
if not HA_TOKEN:
print("ERROR: no ha-sofia API token available.")
print("Set HOME_ASSISTANT_SOFIA_TOKEN, or ensure `homelab ha token` works (kubeconfig reachable).")
sys.exit(1) sys.exit(1)
HEADERS = { HEADERS = {

View file

@ -2,41 +2,20 @@
> Snapshot of applications, groups, users, and flows. Use `authentik` skill for management tasks. > Snapshot of applications, groups, users, and flows. Use `authentik` skill for management tasks.
## Applications (11) ## Applications (10)
| Application | Provider Type | Auth Flow | | Application | Provider Type | Auth Flow |
|-------------|--------------|-----------| |-------------|--------------|-----------|
| Cloudflare Access | OAuth2/OIDC | implicit consent | | Cloudflare Access | OAuth2/OIDC | explicit consent |
| Domain wide catch all | Proxy (forward auth) | implicit consent | | Domain wide catch all | Proxy (forward auth) | implicit consent |
| Forgejo | OAuth2/OIDC | implicit consent | | Forgejo | OAuth2/OIDC | explicit consent |
| Grafana | OAuth2/OIDC | implicit consent | | Grafana | OAuth2/OIDC | implicit consent |
| Headscale | OAuth2/OIDC | implicit consent | | Headscale | OAuth2/OIDC | explicit consent |
| Immich | OAuth2/OIDC | implicit consent | | Immich | OAuth2/OIDC | explicit consent |
| Kubernetes | OAuth2/OIDC (public) | implicit consent | | Kubernetes | OAuth2/OIDC (public) | implicit consent |
| Kubernetes Dashboard | OAuth2/OIDC (confidential) | implicit consent | | linkwarden | OAuth2/OIDC | explicit consent |
| linkwarden | OAuth2/OIDC | implicit consent | | Matrix | OAuth2/OIDC | implicit consent |
| Vault | OAuth2/OIDC | implicit consent |
| wrongmove | OAuth2/OIDC | implicit consent | | wrongmove | OAuth2/OIDC | implicit consent |
> **2026-06-10 — every provider now uses implicit consent.** Cloudflare
> Access (pk 9), Forgejo (20), Immich (1), Headscale (13), linkwarden (8)
> and Vault (53) were switched from
> `default-provider-authorization-explicit-consent` via the API (these
> providers are UI-managed, not in TF). All are first-party apps; the
> expiring consent screen (re-shown every 4 weeks per app) only slowed
> first-time signin.
> **Kubernetes Dashboard** (TF-managed in `stacks/k8s-dashboard/authentik.tf`):
> confidential client `k8s-dashboard`, built for seamless dashboard SSO via
> oauth2-proxy. **Currently IDLE** — the apiserver rejects all OIDC tokens (see
> `docs/plans/2026-06-04-k8s-dashboard-sso-design.md` §12), so the dashboard runs
> on forward-auth + token-paste instead and oauth2-proxy is unwired. Kept for a
> future SSO retry once apiserver OIDC is fixed.
>
> **admin-services-restriction** policy (TF-managed in
> `stacks/authentik/admin-services-restriction.tf`, adopted 2026-06-04): gates the
> 15 admin-only hostnames to `Home Server Admins`, with a carve-out admitting the
> `kubernetes-*` RBAC groups to `k8s.viktorbarzin.me` (dashboard login page).
## Groups (9) ## Groups (9)
| Group | Parent | Superuser | Purpose | | Group | Parent | Superuser | Purpose |
|-------|--------|-----------|---------| |-------|--------|-----------|---------|
@ -57,7 +36,7 @@
| vbarzin@gmail.com | Viktor Barzin | internal | authentik Admins, Home Server Admins, Wrongmove Users, Headscale Users | | vbarzin@gmail.com | Viktor Barzin | internal | authentik Admins, Home Server Admins, Wrongmove Users, Headscale Users |
| emil.barzin@gmail.com | Emil Barzin | internal | Home Server Admins, Headscale Users | | emil.barzin@gmail.com | Emil Barzin | internal | Home Server Admins, Headscale Users |
| ancaelena98@gmail.com | Anca Milea | external | Wrongmove Users, Headscale Users | | ancaelena98@gmail.com | Anca Milea | external | Wrongmove Users, Headscale Users |
| vabbit81@gmail.com | GHEORGHE Milea | external | Headscale Users, kubernetes-namespace-owners, sops-vabbit81 | | vabbit81@gmail.com | GHEORGHE Milea | external | Headscale Users |
| valentinakolevabarzina@gmail.com | Valentina | internal | Headscale Users | | valentinakolevabarzina@gmail.com | Valentina | internal | Headscale Users |
| anca.r.cristian10@gmail.com | -- | internal | Wrongmove Users | | anca.r.cristian10@gmail.com | -- | internal | Wrongmove Users |
| kadir.tugan@gmail.com | Kadir | internal | Wrongmove Users | | kadir.tugan@gmail.com | Kadir | internal | Wrongmove Users |
@ -69,27 +48,8 @@
- All sources use `invitation-enrollment` as enrollment flow (new users require invitation) - All sources use `invitation-enrollment` as enrollment flow (new users require invitation)
## Authorization Flows ## Authorization Flows
- **Explicit consent** (`default-provider-authorization-explicit-consent`): Shows consent screen — no provider uses it since 2026-06-10 - **Explicit consent** (`default-provider-authorization-explicit-consent`): Shows consent screen
- **Implicit consent** (`default-provider-authorization-implicit-consent`): Auto-redirects — used by ALL providers - **Implicit consent** (`default-provider-authorization-implicit-consent`): Auto-redirects
## Authentication Flow (single-screen login, 2026-06-10)
`default-authentication-flow` bindings: identification (order 10) →
mfa-validation (order 30) → user-login (order 100). The identification
stage (`default-authentication-identification`, pk
`32aca5ab-106e-43f4-a4cc-4513d80e57f3`) has `password_stage` set to
`default-authentication-password`, so username + password render on ONE
screen (one round trip instead of two). The previously separate
password-stage binding at order 20 (pk `0fc677db-a23f-4ee7-8648-da342e14573b`)
was DELETED via the API — authentik requires removing it when the
identification stage embeds the password field. `password_stage` is pinned in
Terraform (`authentik_stage_identification.default_identification` in
`stacks/authentik/authentik_provider.tf`); all other stage fields stay
UI-managed via `ignore_changes`. Social-login buttons remain on the same
screen and bypass the password field, so Google/GitHub/Facebook users are
unaffected. If a future authentik upgrade/blueprint re-adds the order-20
binding, users would briefly see a second password prompt — delete the
binding again.
## Invitation Enrollment Flow ## Invitation Enrollment Flow
Slug: `invitation-enrollment` | PK: `7d667321-2b02-4e16-8161-148078a8dac1` Slug: `invitation-enrollment` | PK: `7d667321-2b02-4e16-8161-148078a8dac1`
@ -159,87 +119,3 @@ Removed bindings from:
- `default-source-authentication` (PK: via policybindingmodel `1a779f24`) — Google/GitHub/Facebook OAuth - `default-source-authentication` (PK: via policybindingmodel `1a779f24`) — Google/GitHub/Facebook OAuth
Policy still exists with 0 bindings. If brute-force protection is needed, bind to the **password stage** (not the flow level). Policy still exists with 0 bindings. If brute-force protection is needed, bind to the **password stage** (not the flow level).
## Session Duration (2026-05-01)
Pinned via Terraform in `stacks/authentik/`:
| Knob | Value | Surface | Effect |
|------|-------|---------|--------|
| `UserLoginStage.session_duration` on `default-authentication-login` | `weeks=4` | `authentik_stage_user_login.default_login` in `authentik_provider.tf` | Authenticated users stay logged in 4 weeks across browser restarts. No sliding refresh — resets on each login. Used by password login (`default-authentication-flow`) AND passkey login (`webauthn` flow — both terminate on this stage). |
| `UserLoginStage.session_duration` on `default-source-authentication-login` | `weeks=4` | `authentik_stage_user_login.default_source_login` in `authentik_provider.tf` (imported 2026-06-20, id `4c6977d2-…`) | **Social logins** (Google/GitHub/Facebook, via `default-source-authentication-flow`). Was the provider default `seconds=0`, which fell back to `UNAUTHENTICATED_AGE=hours=2` — so social logins expired every **2h** while password/passkey lasted 4 weeks. Pinned `weeks=4` on 2026-06-20 to make all login paths consistent. (Surfaced when the 2026-06-18 passkey wipe forced fallback to Google login → "re-login multiple times daily".) |
| `ProxyProvider.access_token_validity` on `Provider for Domain wide catch all` | `weeks=4` | `authentik_provider_proxy.catchall.access_token_validity` in `authentik_provider.tf` | Cookie `Max-Age` on `authentik_proxy_*` and `expires` on rows in `authentik_providers_proxy_proxysession`. Bumped 2026-05-10 from `hours=168`. **Bumping requires `kubectl rollout restart deploy/ak-outpost-authentik-embedded-outpost`** — the gorilla session store binds the value once at outpost startup; the 5-min provider refresh logs `"reusing existing session store"` and skips rebuild. |
| `AUTHENTIK_SESSIONS__UNAUTHENTICATED_AGE` (server + worker) | `hours=2` | `server.env` + `worker.env` in `modules/authentik/values.yaml` | Anonymous Django sessions (bots, healthcheckers, partial flows) are reaped within 2h instead of the 1d default. |
Notes:
- There is **no** `Brand.session_duration`; `UserLoginStage` is the only correct lever for authenticated session lifetime.
- Embedded outpost session storage: PostgreSQL table `authentik_providers_proxy_proxysession` in authentik 2025.10+ (PR #16628), but **only when `IsEmbedded()` returns true** (i.e. `Outpost.managed == "goauthentik.io/outposts/embedded"`). Our outpost record had `managed=null` until 2026-05-10, which silently kept it on the gorilla `FilesystemStore` at `/dev/shm` (TMPDIR) and re-exposed the 2026-04-18 mismatched-session-ID class on every pod restart. Fix landed 2026-05-10: see `authentik_outpost.embedded` in `authentik_provider.tf` and post-mortem `2026-04-18-authentik-outpost-shm-full.md`.
- The proxy outpost service has a known goauthentik 2026.2.2 bug (`internal/outpost/controllers/k8s/service.py:52`): for embedded outposts the controller sets the Service selector to `app.kubernetes.io/name=authentik` (the server pods), not `authentik-outpost-proxy`. We work around it via a `kubernetes_json_patches.service` patch on the outpost record (replaces `/spec/selector` with the outpost's own labels). Without this, endpoints are empty and Traefik forward-auth fails over to the Basic Auth realm `Emergency Access`.
- The standalone embedded-outpost deployment needs `AUTHENTIK_POSTGRESQL__{HOST,PORT,USER,PASSWORD,NAME}` env vars to reach the dbaas cluster — codified via `kubernetes_json_patches.deployment` envFrom the shared `goauthentik` Secret. The `app.kubernetes.io/component=server` pod label is also injected via JSON patch (matches the `component:server` half of the Service selector that the controller adds for embedded outposts).
- `ProxyProvider.remember_me_offset` stays UI-managed via `ignore_changes`.
- The Authentik provider's resource schema does **not** expose the `Outpost.managed` field. We rely on TF's "write only fields it knows about" semantic: the server-set `goauthentik.io/outposts/embedded` value is preserved across applies because Terraform never writes `managed`. Don't change the resource provider schema expectations without verifying this assumption holds.
## WebAuthn / Passkeys (2026-06-20)
- **Passkey devices live in the DB, NOT Terraform** (`WebAuthnDevice` model). They are user-owned; no TF resource or blueprint manages them. Re-enroll via the user settings UI (Authentik → Settings → MFA Devices → register a security key / passkey).
- **2026-06-18 wipe (root cause of the "WebAuthn broke" incident):** all 6 of Viktor's passkeys were deleted (`WebAuthnDevice.objects.count()` → 0) at 19:27 by an **ad-hoc tripit passkey E2E test** run from the devvm (`python-httpx/0.28.1`, as `akadmin`). The test cleanup did `GET /core/users/?search={demo}` (a **fuzzy** search) then `DELETE /api/v3/authenticators/admin/webauthn/{pk}/` for each device of `users[0]` — but `users[0]` resolved to the **real** account, not the intended demo user. **Lesson:** any future passkey-test cleanup MUST exact-match the demo user (`username == demo`), never `users[0]` of a fuzzy `?search=`. It was a one-off ad-hoc script (no committed/scheduled copy), so nothing auto-re-deletes — re-enrollment is safe.
- **Passkey login path itself is intact:** the identification stage's `passwordless_flow``webauthn` flow (UI-managed, in `ignore_changes`); the break was purely the missing device records.
- **Provider-schema gotcha:** the pinned authentik TF provider's `authentik_stage_identification` resource exposes **no** `webauthn_stage` or `enable_remember_me` attribute (they exist on the app *model*, not in the provider schema). Do NOT add them to `ignore_changes``tg plan` errors `Unsupported attribute`. They are purely UI/app-managed. (Commit `4e882989` removed them for exactly this reason; re-adding breaks every apply.)
- ALL tuned env vars are injected via `server.env` / `worker.env` (not the `authentik.*` values block) because we set `authentik.existingSecret.secretName: goauthentik`, which makes the chart skip rendering its own `AUTHENTIK_*` Secret. The `authentik.*` value block is therefore inert in this stack — anything new under `authentik.*` must use the `*.env` arrays instead. Live base values come from the orphaned, helm-keep-policy `goauthentik` Secret created by chart 2025.10.3 before `existingSecret` was introduced. **2026-06-10:** the previously-inert tuning (`AUTHENTIK_WEB__WORKERS=3`, `AUTHENTIK_WEB__THREADS=4`, `AUTHENTIK_CACHE__TIMEOUT_FLOWS=1800`, `AUTHENTIK_CACHE__TIMEOUT_POLICIES=900`, `AUTHENTIK_POSTGRESQL__CONN_MAX_AGE=60`, `AUTHENTIK_POSTGRESQL__CONN_HEALTH_CHECKS=true`, worker `AUTHENTIK_WORKER__THREADS=4`) was moved into the env arrays and is now actually live — before that, pods silently ran defaults (2 gunicorn workers, 300s caches, no persistent DB conns).
- **Outpost (2026-06-10):** `log_level=info` (was `trace` — per-request overhead on the forward-auth hot path) and `kubernetes_replicas=2` (was 1 — single-pod hot path; safe since proxy sessions live in Postgres). Both in `authentik_outpost.embedded` config.
- **Image tag is PINNED in values (`global.image.tag`), 2026-06-10:** Keel moves the authentik image between chart releases, while helm derives the tag from the chart appVersion — an unpinned helm apply silently DOWNGRADES live pods (caused the 2026-06-10 boot storm + shared-PG failover; see `docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md`). Before touching this chart, check the live image tag and refresh the pin.
- **Liveness budget (2026-06-10):** `server.livenessProbe` = 6×10s, 5s timeout (chart default 3×10s/3s kill-loops pods that queue on the DB migration advisory lock during rolling restarts).
- **PgBouncer (2026-06-10):** `idle_transaction_timeout=300` reaps ghost `idle in transaction` sessions (a killed pod mid-migration otherwise holds the migration advisory lock forever, serializing all boots); the deployment carries a config-checksum annotation so ini changes roll the pods. Do NOT set `AUTHENTIK_POSTGRESQL__CONN_MAX_AGE` — session-mode PgBouncer pins persistent conns 1:1 (pool saturation).
- **Static assets (2026-06-10):** a second `ingress_factory` (`module.ingress-static`, path `/static` on the authentik host) attaches the `authentik-static-cache-headers` middleware → `Cache-Control: public, max-age=31536000, immutable`. Authentik itself serves no max-age; assets are version-fingerprinted so immutable is safe. Mainly helps split-horizon internal users (no Cloudflare edge cache on the direct path).
## Upgrade Validation Checklist
Run after **any** of these:
- Authentik chart version bump in `stacks/authentik/modules/authentik/main.tf` (the `version = "..."` line on `helm_release.authentik`).
- `goauthentik/authentik` Terraform provider version bump.
- Outpost pod recreation (kured reboot, eviction, manual `rollout restart`, scheduler move).
The fragile surfaces are the `kubernetes_json_patches` and the `Outpost.managed` field — both rely on assumptions that can silently break across upgrades. The checklist exercises the same path the alerts watch, so it doubles as a smoke test for the alerts.
```bash
# 1. Service routes to the outpost pods (NOT the server pods).
# Empty endpoints => auth-proxy fallback fires; expected: TWO pod IPs
# (kubernetes_replicas=2 since 2026-06-10), ports 9000/9300/9443.
kubectl -n authentik get endpoints ak-outpost-authentik-embedded-outpost
# 2. Service selector still excludes the server pods. Expected: includes
# `app.kubernetes.io/name: authentik-outpost-proxy`. If it flips to
# `name: authentik`, the goauthentik upstream bug came back or our
# JSON patch was unset.
kubectl -n authentik get svc ak-outpost-authentik-embedded-outpost -o jsonpath='{.spec.selector}'
# 3. Outpost mode + session backend. Expected log lines on startup:
# {"embedded":true,"event":"Outpost mode",...}
# {"event":"using PostgreSQL session backend",...}
# If embedded=false or `using filesystem session backend`, the postgres
# fix is broken — likely `Outpost.managed` got cleared, or the upstream
# schema started exposing `managed` and TF reset it.
kubectl -n authentik logs deploy/ak-outpost-authentik-embedded-outpost | grep -E '"Outpost mode"|"session backend"' | head -3
# 4. /dev/shm is essentially empty (postgres backend = no filesystem use).
# A row count > a few dozen indicates filesystem fallback is firing.
kubectl -n authentik exec deploy/ak-outpost-authentik-embedded-outpost -- sh -c 'df -h /dev/shm; ls /dev/shm | wc -l'
# 5. Postgres session table is growing with traffic. Expected: rows with
# `expires` ~28 days out (matches access_token_validity = weeks=4).
kubectl -n authentik exec deploy/goauthentik-server -- ak shell -c "
from django.db import connection; c = connection.cursor()
c.execute('SELECT COUNT(*), MAX(expires) FROM authentik_providers_proxy_proxysession')
print(c.fetchone())"
# 6. Edge auth flow: should be 302 → authentik. NOT 401 with WWW-Authenticate.
curl -sS -o /dev/null -D - 'https://terminal.viktorbarzin.me/' -H 'User-Agent: Mozilla/5.0' \
| grep -iE '^HTTP|^location|x-auth-fallback|www-authenticate'
# 7. Terraform plan-to-zero on the whole authentik stack.
( cd stacks/authentik && /home/wizard/code/infra/scripts/tg plan ) | grep -E 'No changes|Plan:'
```
Steps 1, 3, 6 cover the failure modes the Prometheus alerts trigger on (`AuthentikForwardAuthFallbackActive`, `AuthentikOutpostForwardAuth400Spike`). Steps 4 and 5 cover the silent-regression case (filesystem fallback) where the alerts don't fire but the system loses its postgres-backed session persistence on the next pod restart.
If step 2 shows the controller restored `app.kubernetes.io/name=authentik`, watch goauthentik/authentik issue tracker for fixes around `internal/outpost/controllers/k8s/service.py:52` — the upstream patch might let us drop our `kubernetes_json_patches.service` workaround.

View file

@ -23,28 +23,14 @@ module "nfs_data" {
3. Reload exports: `ssh root@192.168.1.127 "exportfs -ra"` 3. Reload exports: `ssh root@192.168.1.127 "exportfs -ra"`
4. Verify: `showmount -e 192.168.1.127` 4. Verify: `showmount -e 192.168.1.127`
## Static Site Hosting
Two patterns for serving a folder of static files (HTML/CSS/JS/media):
1. **Image-baked** (default for git-native content): bake files into an `nginx:*-alpine` image at build time, deploy like any owned app (CI builds + pushes, Keel/Woodpecker rolls out). Reference: `stacks/blog` (Hugo → nginx, `Website/Dockerfile`). Use when content lives in git and changes via commits.
2. **NFS-backed** (for externally-authored / large / non-git content): a stock `nginx:1.28-alpine` Deployment mounts an `nfs_volume` PVC **read-only** at `/usr/share/nginx/html`; a tiny ConfigMap supplies `/etc/nginx/conf.d/default.conf` (just `root` + `index <entry>.html`). Files are dropped on `/srv/nfs/<site>` out-of-band (Nextcloud "PVE NFS Pool" or rsync) — no rebuild, auto-backed-up by `nfs-mirror`. Reference: `stacks/stem95su` (established 2026-06-07). Use when content is authored outside git (e.g. exported tools), is large (avoids git/image bloat), or a non-dev updates it. **The export subdir on the PVE host must exist before the pod mounts** — the `nfs_volume` module does NOT create it (see "Adding NFS Exports"; a subdir under the already-exported `/srv/nfs` needs no new `/etc/exports` line).
Both front with `ingress_factory` (`auth="none"` for open public content → CrowdSec + ai-bot-block still apply; or chain `anubis_instance` for a PoW gate, as `blog` does).
## ~~iSCSI Storage~~ (REMOVED — replaced by proxmox-lvm) ## ~~iSCSI Storage~~ (REMOVED — replaced by proxmox-lvm)
> iSCSI via democratic-csi and TrueNAS has been fully removed (2026-04). All database storage now uses `StorageClass: proxmox-lvm` (Proxmox CSI, LVM-thin hotplug). TrueNAS has been decommissioned. > iSCSI via democratic-csi and TrueNAS has been fully removed (2026-04). All database storage now uses `StorageClass: proxmox-lvm` (Proxmox CSI, LVM-thin hotplug). TrueNAS has been decommissioned.
## Anti-AI Scraping (4 Active Layers) (Updated 2026-05-10) ## Anti-AI Scraping (5-Layer Defense)
Default `anti_ai_scraping = true` in ingress_factory. Disable per-service: `anti_ai_scraping = false`. Default `anti_ai_scraping = true` in ingress_factory. Disable per-service: `anti_ai_scraping = false`.
1. **Anubis PoW challenge** (per-site reverse proxy) — `modules/kubernetes/anubis_instance/`. Latest: `ghcr.io/techarohq/anubis:v1.25.0`. Difficulty 2 (~250 ms desktop / ~700 ms mobile), 30-day JWT cookie scoped to `viktorbarzin.me` so a single solve covers every Anubis-fronted subdomain. Active on: `viktorbarzin.me`, `kms.viktorbarzin.me`, `travel.viktorbarzin.me`. Add to a stack: `module "anubis" { source = "../../modules/kubernetes/anubis_instance"; name = "X"; namespace = ...; target_url = "http://<svc>.<ns>.svc.cluster.local" }`, then point ingress_factory at `module.anubis.service_name` + `port = module.anubis.service_port` and set `anti_ai_scraping = false`. Shared ed25519 signing key in Vault `secret/viktor` -> `anubis_ed25519_key`. **Avoid putting Anubis in front of CLI/API/Git endpoints (Forgejo, APIs, WebDAV)** — clients without JS can't solve PoW. 1. Bot blocking (ForwardAuth → poison-fountain) 2. X-Robots-Tag noai 3. Trap links before `</body>`
2. **Bot blocking forwardAuth** (ForwardAuth → bot-block-proxy → poison-fountain) — global default for non-Anubis sites. `bot-block-proxy` (OpenResty in `traefik` ns) is fail-open with 100 ms connect / 200 ms read timeouts so a downed poison-fountain costs ≤200 ms per request. Source: `stacks/traefik/modules/traefik/main.tf`. 4. Tarpit (~100 bytes/sec) 5. Poison content (CronJob every 6h, `--http1.1` required)
3. **X-Robots-Tag noai** — set by `traefik-anti-ai-headers` middleware. Anubis additionally serves a comprehensive `/robots.txt` (`SERVE_ROBOTS_TXT=true`) to well-behaved bots. Key files: `stacks/poison-fountain/`, `stacks/platform/modules/traefik/middleware.tf`
4. **Tarpit/poison content** (standalone at poison.viktorbarzin.me, `stacks/poison-fountain/`). Currently scaled to `replicas = 0` — fail-open path means no live traffic, no penalty.
Trap links (formerly a layer) removed April 2026 — rewrite-body plugin broken on Traefik v3.6.12 (Yaegi bugs). `strip-accept-encoding` and `anti-ai-trap-links` middlewares deleted.
Rybbit analytics injection now via Cloudflare Worker (`stacks/rybbit/worker/`, HTMLRewriter, wildcard route `*.viktorbarzin.me/*`, 28 site ID mappings).
Key files: `modules/kubernetes/anubis_instance/`, `stacks/poison-fountain/`, `stacks/rybbit/worker/`, `stacks/traefik/modules/traefik/main.tf`
## Terragrunt Architecture ## Terragrunt Architecture
- Root `terragrunt.hcl`: DRY providers, backend, variable loading, `generate "tiers"` block - Root `terragrunt.hcl`: DRY providers, backend, variable loading, `generate "tiers"` block

View file

@ -92,21 +92,19 @@ Channel 3: A4 [32G] ──── A8 [32G] ──── A12[ 8G ] = 72 GB
| VMID | Name | Status | CPUs | RAM | Network | Disk | Notes | | VMID | Name | Status | CPUs | RAM | Network | Disk | Notes |
|------|------|--------|------|-----|---------|------|-------| |------|------|--------|------|-----|---------|------|-------|
| 101 | pfsense | running | 8 | 4GB | vmbr0, vmbr1:vlan10, vmbr1:vlan20 | 32G | Gateway/firewall | | 101 | pfsense | running | 8 | 4GB | vmbr0, vmbr1:vlan10, vmbr1:vlan20 | 32G | Gateway/firewall |
| 102 | devvm | running | 16 | 24GB | vmbr1:vlan10 | 100G | Development VM + t3code Workstation host. 14G swap (8G /swapfile + 6G /swapfile2, grown 2026-06-10; swappiness=10). Capacity budget: ~4-5G RAM/active user, max ~3-4 concurrent active Claude sessions. NOT Terraform-managed. Disk controller: `virtio-scsi-single` + `scsi0 iothread=1,aio=threads` staged 2026-06-11 after the QEMU I/O stall (was `scsihw: lsi`, the only VM on the legacy path — see `docs/post-mortems/2026-06-11-devvm-qemu-io-stall.md`); applies at next cold stop→start. | | 102 | devvm | running | 16 | 8GB | vmbr1:vlan10 | 100G | Development VM |
| 103 | home-assistant | running | 8 | 8GB | vmbr0 | 64G | HA Sofia, net0(vlan10) disabled, SSH: vbarzin@192.168.1.8 | | 103 | home-assistant | running | 8 | 8GB | vmbr0 | 64G | HA Sofia, net0(vlan10) disabled, SSH: vbarzin@192.168.1.8 |
| 105 | pbs | stopped | 16 | 8GB | vmbr1:vlan10 | 32G | Proxmox Backup (unused) | | 105 | pbs | stopped | 16 | 8GB | vmbr1:vlan10 | 32G | Proxmox Backup (unused) |
| 200 | k8s-master | running | 8 | 32GB | vmbr1:vlan20 | 64G | Control plane (10.0.20.100) | | 200 | k8s-master | running | 8 | 16GB | vmbr1:vlan20 | 64G | Control plane (10.0.20.100) |
| 201 | k8s-node1 | running | 16 | 48GB | vmbr1:vlan20 | 256G | GPU node, Tesla T4 | | 201 | k8s-node1 | running | 16 | 32GB | vmbr1:vlan20 | 256G | GPU node, Tesla T4 |
| 202 | k8s-node2 | running | 8 | 32GB | vmbr1:vlan20 | 256G | Worker | | 202 | k8s-node2 | running | 8 | 24GB | vmbr1:vlan20 | 256G | Worker |
| 203 | k8s-node3 | running | 8 | 32GB | vmbr1:vlan20 | 256G | Worker | | 203 | k8s-node3 | running | 8 | 24GB | vmbr1:vlan20 | 256G | Worker |
| 204 | k8s-node4 | running | 8 | 32GB | vmbr1:vlan20 | 256G | Worker | | 204 | k8s-node4 | running | 8 | 24GB | vmbr1:vlan20 | 256G | Worker |
| 205 | k8s-node5 | running | 8 | 32GB | vmbr1:vlan20 | 256G | Worker (10.0.20.105, joined 2026-05-26) |
| 206 | k8s-node6 | running | 8 | 32GB | vmbr1:vlan20 | 256G | Worker (10.0.20.106, joined 2026-05-26) |
| 220 | docker-registry | running | 4 | 4GB | vmbr1:vlan20 | 64G | MAC DE:AD:BE:EF:22:22 (10.0.20.10) | | 220 | docker-registry | running | 4 | 4GB | vmbr1:vlan20 | 64G | MAC DE:AD:BE:EF:22:22 (10.0.20.10) |
| 300 | Windows10 | running | 16 | 8GB | vmbr0 | 100G | Windows VM | | 300 | Windows10 | running | 16 | 8GB | vmbr0 | 100G | Windows VM |
| ~~9000~~ | ~~truenas~~ | **stopped/decommissioned** | — | — | — | — | NFS migrated to Proxmox host (192.168.1.127) at `/srv/nfs` and `/srv/nfs-ssd` | | ~~9000~~ | ~~truenas~~ | **stopped/decommissioned** | — | — | — | — | NFS migrated to Proxmox host (192.168.1.127) at `/srv/nfs` and `/srv/nfs-ssd` |
**Total VM RAM allocated**: ~288 GB nominal across running VMs vs 272 GB physical — OVERCOMMITTED (ballooning enabled on K8s workers, host swap in use; see memory id=535/2543). K8s rows live-verified via `kubectl get nodes` capacity 2026-06-11 (master 32G, node1 48G, node2-6 32G; the old 16/32/24GB figures predated the 2026-04-02 resize and node5/6). **Total VM RAM allocated**: 180 GB of 272 GB (66%) — 92 GB free for future VMs
## VM Templates ## VM Templates
| VMID | Name | Purpose | | VMID | Name | Purpose |
@ -124,9 +122,8 @@ Channel 3: A4 [32G] ──── A8 [32G] ──── A12[ 8G ] = 72 GB
| `offsite-sync-backup.timer` | Timer | Daily 06:00 | Two-step rsync to Synology (sda + NFS via inotify) | | `offsite-sync-backup.timer` | Timer | Daily 06:00 | Two-step rsync to Synology (sda + NFS via inotify) |
| `nfs-change-tracker.service` | Service | Continuous | inotifywait on `/srv/nfs` + `/srv/nfs-ssd`, logs to `/mnt/backup/.nfs-changes.log` | | `nfs-change-tracker.service` | Service | Continuous | inotifywait on `/srv/nfs` + `/srv/nfs-ssd`, logs to `/mnt/backup/.nfs-changes.log` |
## GPU Node (currently k8s-node1) ## GPU Node (k8s-node1)
- **VMID**: 201, **PCIe**: `0000:06:00.0` (NVIDIA Tesla T4) — physical passthrough, no Terraform pin - **VMID**: 201, **PCIe**: `0000:06:00.0` (NVIDIA Tesla T4)
- **Taint**: `nvidia.com/gpu=true:PreferNoSchedule` (applied dynamically to every NFD-discovered GPU node) - **Taint**: `nvidia.com/gpu=true:NoSchedule`, **Label**: `gpu=true`
- **Label**: `nvidia.com/gpu.present=true` (auto-applied by gpu-feature-discovery; also `feature.node.kubernetes.io/pci-10de.present=true` from NFD) - GPU workloads need: `node_selector = { "gpu": "true" }` + nvidia toleration
- GPU workloads need: `node_selector = { "nvidia.com/gpu.present" : "true" }` + nvidia toleration - Taint applied via `null_resource.gpu_node_taint` in `modules/kubernetes/nvidia/main.tf`
- Taint applied via `null_resource.gpu_node_config` in `stacks/nvidia/modules/nvidia/main.tf`; node discovery keyed on the NFD `pci-10de.present` label so the taint follows the card to whichever host is carrying it

File diff suppressed because one or more lines are too long

View file

@ -7,6 +7,7 @@
"docker.io/mailserver/docker-mailserver": "docker-mailserver/docker-mailserver", "docker.io/mailserver/docker-mailserver": "docker-mailserver/docker-mailserver",
"mailserver/docker-mailserver": "docker-mailserver/docker-mailserver", "mailserver/docker-mailserver": "docker-mailserver/docker-mailserver",
"docker.n8n.io/n8nio/n8n": "n8n-io/n8n", "docker.n8n.io/n8nio/n8n": "n8n-io/n8n",
"matrixdotorg/synapse": "element-hq/synapse",
"headscale/headscale": "juanfont/headscale", "headscale/headscale": "juanfont/headscale",
"technitium/dns-server": "TechnitiumSoftware/DnsServer", "technitium/dns-server": "TechnitiumSoftware/DnsServer",
"ghcr.io/paperless-ngx/paperless-ngx": "paperless-ngx/paperless-ngx", "ghcr.io/paperless-ngx/paperless-ngx": "paperless-ngx/paperless-ngx",
@ -81,6 +82,7 @@
"dawarich": { "type": "postgresql", "db_name": "dawarich", "shared": true }, "dawarich": { "type": "postgresql", "db_name": "dawarich", "shared": true },
"health": { "type": "postgresql", "db_name": "health", "shared": true }, "health": { "type": "postgresql", "db_name": "health", "shared": true },
"linkwarden": { "type": "postgresql", "db_name": "linkwarden", "shared": true }, "linkwarden": { "type": "postgresql", "db_name": "linkwarden", "shared": true },
"matrix": { "type": "postgresql", "db_name": "matrix", "shared": true },
"n8n": { "type": "postgresql", "db_name": "n8n", "shared": true }, "n8n": { "type": "postgresql", "db_name": "n8n", "shared": true },
"netbox": { "type": "postgresql", "db_name": "netbox", "shared": true }, "netbox": { "type": "postgresql", "db_name": "netbox", "shared": true },
"rybbit": { "type": "postgresql", "db_name": "rybbit", "shared": true }, "rybbit": { "type": "postgresql", "db_name": "rybbit", "shared": true },

View file

@ -177,33 +177,6 @@ Tell the user to share these onboarding instructions with the new user:
- K8s Portal: `https://k8s-portal.viktorbarzin.me/onboarding?role=namespace-owner` - K8s Portal: `https://k8s-portal.viktorbarzin.me/onboarding?role=namespace-owner`
- README: `https://github.com/ViktorBarzin/infra#new-user-onboarding` - README: `https://github.com/ViktorBarzin/infra#new-user-onboarding`
**Web dashboard access** (auto-login, no token paste): the `rbac` stack
auto-creates a `dashboard-<user>` SA + token for every namespace-owner
(`dashboard-sa.tf`), and the **k8s-dashboard** stack's token-injector maps the
user's Authentik identity → that token (`dashboard_injector.tf`, auto-derived
from `k8s_users`). The new user just logs into `https://k8s.viktorbarzin.me` and
lands in the dashboard scoped to their namespace (`admin` on their namespace +
read-only on the namespace list & nodes for nav — no cross-tenant resource reads).
> **Apply order for a new namespace-owner:** after the vault/rbac/woodpecker
> applies above, ALSO `cd stacks/k8s-dashboard && ../../scripts/tg apply` so the
> injector map picks up the new user. (Manual token fallback:
> `kubectl -n NAMESPACE get secret dashboard-USERNAME-token -o jsonpath='{.data.token}' | base64 -d`.)
> Seamless OIDC SSO is built but blocked — see
> `docs/plans/2026-06-04-k8s-dashboard-sso-design.md` §12.
> **Auto-login works only for the user's `k8s_users` HOME namespace.** The
> dashboard injects the user's `dashboard-<user>` SA token, which the `rbac`
> stack binds to `admin` on their home namespace only. If their workload lives
> in a DIFFERENT / pre-existing namespace (e.g. gheorghe's app is in `novelapp`,
> not his home `vabbit81`), that namespace's stack must ALSO grant their
> **dashboard SA** — `kind: ServiceAccount, name: dashboard-<user>, namespace:
> <home-ns>` — not just their OIDC `User` email (the dashboard uses the SA, and
> apiserver OIDC is blocked). See `stacks/novelapp/main.tf` `novelapp_owner_vabbit81`
> for the pattern (two subjects: User + SA). Best practice: set the user's
> `k8s_users` namespace to where their workload actually runs, so the home-ns
> auto-path covers them with no extra binding.
The user can decrypt their stack's state with: The user can decrypt their stack's state with:
```bash ```bash
vault login -method=oidc # authenticates via Authentik SSO vault login -method=oidc # authenticates via Authentik SSO

View file

@ -0,0 +1,102 @@
# Setup Shared Remote Executor
Skill for setting up Claude Code's shared remote executor in new projects.
## When to Use
- When adding Claude Code support to a new project
- When the user says "set up remote executor for this project"
- When working on a new project that needs remote command execution
## Prerequisites
- Shared executor already deployed at `~/.claude/` on wizard@10.0.10.10
- Project accessible via NFS from both macOS and the remote VM
## Setup Steps
### 1. Create .claude Directory
```bash
mkdir -p .claude/sessions
```
### 2. Create session-exec.sh Wrapper
Create `.claude/session-exec.sh` with the following content (adjust PROJECT_ROOT):
```bash
#!/bin/bash
# Project-Local Session Helper - Wrapper for shared executor
set -euo pipefail
SHARED_SESSION_EXEC="/home/wizard/.claude/session-exec.sh"
PROJECT_ROOT="/home/wizard/path/to/project" # UPDATE THIS
if [ -f "$SHARED_SESSION_EXEC" ]; then
if [ "${1:-}" = "create" ] || [ -z "${1:-}" ]; then
"$SHARED_SESSION_EXEC" create "$PROJECT_ROOT"
else
"$SHARED_SESSION_EXEC" "$@"
fi
else
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
SESSIONS_DIR="$SCRIPT_DIR/sessions"
SESSION_ID="${1:-$(date +%s)-$$-$RANDOM}"
ACTION="${2:-create}"
SESSION_DIR="$SESSIONS_DIR/$SESSION_ID"
case "$ACTION" in
create|init|"")
mkdir -p "$SESSION_DIR"
echo "ready" > "$SESSION_DIR/cmd_status.txt"
echo "$PROJECT_ROOT" > "$SESSION_DIR/workdir.txt"
> "$SESSION_DIR/cmd_input.txt"
> "$SESSION_DIR/cmd_output.txt"
echo "$SESSION_ID"
;;
cleanup|remove|delete)
[ -d "$SESSION_DIR" ] && rm -rf "$SESSION_DIR"
;;
status)
[ -d "$SESSION_DIR" ] && cat "$SESSION_DIR/cmd_status.txt"
;;
list)
[ -d "$SESSIONS_DIR" ] && ls -1 "$SESSIONS_DIR" 2>/dev/null
;;
esac
fi
```
Make executable: `chmod +x .claude/session-exec.sh`
### 3. Link Sessions Directory (on remote VM)
Run on the remote VM to add project sessions to the shared executor:
```bash
# Option A: Symlink project sessions (if using project-local sessions)
ln -sfn /path/to/project/.claude/sessions ~/.claude/sessions
# Option B: Use shared sessions (all projects share one directory)
# Just ensure ~/.claude/sessions exists
```
### 4. Create CLAUDE.md
Add execution instructions to `.claude/CLAUDE.md`:
```markdown
## Remote Command Execution
Uses shared executor at `~/.claude/` on wizard@10.0.10.10.
### Usage
\```bash
SESSION_ID=$(.claude/session-exec.sh)
echo "command" > .claude/sessions/$SESSION_ID/cmd_input.txt
sleep 1 && cat .claude/sessions/$SESSION_ID/cmd_status.txt
cat .claude/sessions/$SESSION_ID/cmd_output.txt
\```
Start executor: `~/.claude/remote-executor.sh` (on remote VM)
```
## Shared Executor Location
- Scripts: `~/.claude/remote-executor.sh`, `~/.claude/session-exec.sh`
- Sessions: `~/.claude/sessions/`
- Remote VM: wizard@10.0.10.10

View file

@ -7,448 +7,339 @@ description: |
(3) User asks to fix stuck pods, evicted pods, or CrashLoopBackOff, (3) User asks to fix stuck pods, evicted pods, or CrashLoopBackOff,
(4) User mentions "health check", "cluster status", "cluster health", (4) User mentions "health check", "cluster status", "cluster health",
(5) User asks "is everything running" or "any problems". (5) User asks "is everything running" or "any problems".
Runs 47 cluster-wide checks (nodes, workloads, monitoring, certs, Runs 8 standard K8s health checks with safe auto-fix for evicted pods
backups, external reachability, PVE host thermals + load, HA Sofia and stuck CrashLoopBackOff pods.
status dashboard, Immich smart-search, Proxmox CSI ghost-disk drift)
with safe auto-fix for evicted pods.
author: Claude Code author: Claude Code
version: 2.0.0 version: 1.0.0
date: 2026-04-19 date: 2026-02-21
--- ---
# Cluster Health Check # Cluster Health Check
## MANDATORY: Run the script first ## Overview
When this skill is invoked, your **first action** must be to run the - **Script**: `/workspace/infra/.claude/cluster-health.sh`
cluster health check script and reason over its output before doing - **Schedule**: CronJob runs every 30 minutes in the `openclaw` namespace
anything else. Do not improvise individual `kubectl` calls — the - **Slack notifications**: Posts results to the webhook URL in `$SLACK_WEBHOOK_URL`
script is the authoritative surface. - **Auto-fix**: Automatically deletes evicted/failed pods and CrashLoopBackOff pods with >10 restarts
- **Exit code**: 0 = healthy, 1 = issues found
## Quick Check
Run the health check interactively:
```bash ```bash
cd /home/wizard/code # Report only, no Slack notification
bash infra/scripts/cluster_healthcheck.sh --json | tee /tmp/cluster-health.json bash /workspace/infra/.claude/cluster-health.sh --no-slack
# Full run with Slack notification
bash /workspace/infra/.claude/cluster-health.sh
# Report only, no auto-fix and no Slack
bash /workspace/infra/.claude/cluster-health.sh --no-fix --no-slack
``` ```
If the session is rooted elsewhere, fall back to the absolute path: ## What It Checks
```bash | # | Check | Auto-Fix | Alerts |
bash /home/wizard/code/infra/scripts/cluster_healthcheck.sh --json |---|-------|----------|--------|
``` | 1 | **Node Health** — NotReady nodes, MemoryPressure, DiskPressure, PIDPressure | No | Yes |
| 2 | **Pod Health** — CrashLoopBackOff, ImagePullBackOff, ErrImagePull, Error | Yes (CrashLoop >10 restarts) | Yes |
Then: | 3 | **Evicted/Failed Pods** — Pods in `Failed` phase | Yes (deletes all) | Yes |
| 4 | **Failed Deployments** — Deployments with ready != desired replicas | No | Yes |
1. Parse the JSON. Report the PASS/WARN/FAIL counts + overall verdict. | 5 | **Pending PVCs** — PersistentVolumeClaims not in `Bound` state | No | Yes |
2. Iterate every FAIL and WARN check, describe what tripped, and propose | 6 | **Resource Pressure** — Node CPU or memory >80% (warn) or >90% (issue) | No | Yes |
the remediation path (use the recipes below). | 7 | **CronJob Failures** — Failed CronJob-owned Jobs in the last 24h | No | Yes |
3. Only reach for ad-hoc `kubectl` commands when investigating a | 8 | **DaemonSet Health** — DaemonSets with desired != ready | No | Yes |
specific failure beyond what the script reported.
Exit codes: `0` = healthy, `1` = warnings only, `2` = failures.
## Quick flags
```bash
# Human-readable report (default), no auto-fix
bash infra/scripts/cluster_healthcheck.sh
# Machine-readable JSON summary
bash infra/scripts/cluster_healthcheck.sh --json
# Only show WARN + FAIL (suppress PASS noise)
bash infra/scripts/cluster_healthcheck.sh --quiet
# Enable auto-fix (delete evicted pods, kick stuck CrashLoop pods)
bash infra/scripts/cluster_healthcheck.sh --fix
# Combined: quiet JSON without auto-fix
bash infra/scripts/cluster_healthcheck.sh --no-fix --quiet --json
# Custom kubeconfig
bash infra/scripts/cluster_healthcheck.sh --kubeconfig /path/to/config
```
## What It Checks (47 checks)
| # | Check | Notes |
|---|-------|-------|
| 1 | Node Status | NotReady nodes, version drift |
| 2 | Node Resources | CPU/mem >80% (warn) / >90% (fail) |
| 3 | Node Conditions | MemoryPressure / DiskPressure / PIDPressure |
| 4 | Problematic Pods | CrashLoopBackOff / Error / ImagePullBackOff |
| 5 | Evicted/Failed Pods | `status.phase=Failed` |
| 6 | DaemonSets | desired == ready |
| 7 | Deployments | ready == desired replicas |
| 8 | PVC Status | all Bound |
| 9 | HPA Health | targets not `<unknown>`, utilization <100% |
| 10 | CronJob Failures | job conditions `Failed=True` in last 24h |
| 11 | CrowdSec Agents | all pods Running |
| 12 | Ingress Routes | every ingress has an LB IP + Traefik LB |
| 13 | Prometheus Alerts | count of firing alerts |
| 14 | Uptime Kuma Monitors | internal + external monitors up |
| 15 | ResourceQuota Pressure | any quota >80% used |
| 16 | StatefulSets | ready == desired |
| 17 | Node Disk Usage | ephemeral-storage <80% |
| 18 | Helm Release Health | all `deployed` (no `pending-*`) |
| 19 | Kyverno Policy Engine | all pods Running |
| 20 | NFS Connectivity | 192.168.1.127 showmount / port 2049 |
| 21 | DNS Resolution | Technitium resolves internal + external |
| 22 | TLS Certificate Expiry | TLS `Secret` certs >30d valid |
| 23 | GPU Health | nvidia namespace + device-plugin Running |
| 24 | Cloudflare Tunnel | pods Running |
| 25 | Resource Usage | node CPU/mem headroom |
| 26 | HA Sofia — Entity Availability | Home Assistant unavailable/unknown count |
| 27 | HA Sofia — Integration Health | config entries setup_error / not_loaded |
| 28 | HA Sofia — Automation Status | disabled / stale (>30d) automations |
| 29 | HA Sofia — System Resources | HA CPU / mem / disk |
| 30 | Hardware Exporters | snmp / idrac-redfish / proxmox / tuya pods + scrapes |
| 31 | cert-manager — Certificate Readiness | Certificate CRs with `Ready!=True` |
| 32 | cert-manager — Certificate Expiry (<14d) | notAfter within 14d |
| 33 | cert-manager — Failed CertificateRequests | `Ready=False, reason=Failed` |
| 34 | Backup Freshness — Per-DB Dumps | MySQL + PG dumps within 25h |
| 35 | Backup Freshness — Offsite Sync | Pushgateway `backup_last_success_timestamp` <27h |
| 36 | Backup Freshness — LVM PVC Snapshots | newest thin snapshot <25h (SSH PVE) |
| 37 | Monitoring — Prometheus + Alertmanager | `/-/ready` + AM pods Running |
| 38 | Monitoring — Vault Sealed Status | `vault status` reports `Sealed: false` |
| 39 | Monitoring — ClusterSecretStore Ready | `vault-kv` + `vault-database` Ready |
| 40 | External — Cloudflared + Authentik Replicas | deployments fully ready |
| 41 | External — ExternalAccessDivergence Alert | alert not firing |
| 42 | External — Traefik 5xx Rate (15m) | top-10 services emitting 5xx |
| 43 | PVE Host Thermals | package + per-core temps via `/sys/class/hwmon` (SSH). Baseline 55-65 °C. PASS <65 °C, WARN 65-82 °C (a VM is burning too much CPU), FAIL 83 °C (TjMax) |
| 44 | PVE Host Load | `/proc/loadavg` via SSH. PASS 5m <30, WARN 30-37, FAIL 38 of 44 threads |
| 45 | HA Sofia — Status Dashboard | emo's curated Барзини → Статус view (`dashboard-barzini` / path `status`). Pulls the lovelace config via WS, batch-renders every `custom:mushroom-template-card` secondary template against `/api/template`, classifies each rendered line: FAIL on `Offline` / `Disconnected` / `Разкачен` / `— No data`; WARN on `⚠️` / `Abnormal` / `Trouble (` / `(ниска)` / `Пълен резервоар` / `Грешка` / `attention` / `Внимание`. Verdict rolls up across the 8 sections (Сигурност, Мрежа & IT, Енергия, Климат, Уреди, Мултимедия, Осветление, Поливна) |
| 46 | Immich Smart Search | `clip_index` residency in PG `shared_buffers` + representative ANN probe latency (in immich-postgresql). FAIL >1.5s or <50% resident; WARN >0.5s or <90% resident. Cold cache check `clip-index-prewarm` CronJob |
| 47 | Proxmox CSI — Ghost-Disk Drift | Per node, compares real virtio-scsi CSI disks in `qm config <vmid>` (SSH PVE) vs attached proxmox-CSI VolumeAttachments k8s tracks. Catches orphaned "ghost" disks left by failed detaches (`query-pci` QMP timeouts) that the scheduler's 28-LUN guard can't see. PASS reconciled; WARN drift>0 or real 20-24; FAIL real ≥25 (near LUN cap → imminent wedge). Cleanup: detach ghosts via `qm set <vmid> --delete scsiN` (frees slot, retains LV) |
## Safe Auto-Fix Rules ## Safe Auto-Fix Rules
`--fix` only performs operations that are genuinely reversible and ### Safe to auto-fix (the script does these automatically)
observable. Nothing here rewrites Terraform state or mutates the cluster
beyond "delete pod".
### Done automatically by `--fix` 1. **Evicted/Failed pods** — These are already terminated and just cluttering the namespace:
- **Evicted / Failed pods** — delete them; the controller recreates.
```bash ```bash
kubectl delete pods -A --field-selector=status.phase=Failed kubectl delete pods -A --field-selector=status.phase=Failed
``` ```
- **CrashLoopBackOff pods with >10 restarts** — delete once to reset
backoff timer. 2. **CrashLoopBackOff pods with >10 restarts** — The pod is stuck in a crash loop; deleting lets the controller recreate it with a fresh backoff timer:
```bash
kubectl delete pod -n <namespace> <pod-name> --grace-period=0
```
### NEVER auto-fix (requires human investigation) ### NEVER auto-fix (requires human investigation)
- NotReady nodes - **NotReady nodes** — Could be network, kubelet, or hardware issue; needs SSH investigation
- MemoryPressure / DiskPressure / PIDPressure - **DiskPressure / MemoryPressure / PIDPressure** — Root cause must be identified
- ImagePullBackOff (usually a bad tag / registry credential) - **ImagePullBackOff** — Usually a wrong image tag or registry issue; needs config fix
- Deployment ready-replica mismatch - **Failed deployments** — Could be resource limits, bad config, missing secrets
- Pending PVCs - **Pending PVCs** — Usually NFS export missing or storage class issue
- Node CPU/memory >90% - **Resource pressure >90%** — Need to identify which pods are consuming resources
- CronJob failures - **CronJob failures** — Need to check job logs to understand why it failed
- DaemonSet desired != ready - **DaemonSet issues** — Could be node taints, resource limits, or image issues
- Vault sealed
- ClusterSecretStore not Ready
- cert-manager Certificate failures
- Backup freshness regressions
- Any external-reachability failure
## Deep-investigation recipes per failure mode ## Deep Investigation
### Node Issues (checks 1, 3, 17, 25) When the health check reports issues, use these commands to investigate further.
### Node Issues
```bash ```bash
kubectl describe node <node> # Describe the problematic node (events, conditions, capacity)
kubectl describe node <node-name>
# Check resource usage across all nodes
kubectl top nodes kubectl top nodes
kubectl get events --field-selector involvedObject.name=<node> --sort-by='.lastTimestamp'
# SSH to the node # Check recent events on a specific node
ssh root@10.0.20.10X kubectl get events --field-selector involvedObject.name=<node-name> --sort-by='.lastTimestamp'
# SSH to the node for direct inspection
ssh root@<node-ip>
systemctl status kubelet systemctl status kubelet
journalctl -u kubelet --since "30 minutes ago" | tail -100 journalctl -u kubelet --since "30 minutes ago" | tail -100
df -h ; free -h df -h
free -h
``` ```
Node IPs: `10.0.20.100` master, `.101` node1 (GPU), `.102` node2, ### Pod Issues
`.103` node3, `.104` node4.
### Pod Issues (checks 4, 5, 11, 19)
```bash ```bash
kubectl describe pod -n <ns> <pod> # Describe the pod (events, conditions, container statuses)
kubectl logs -n <ns> <pod> --tail=200 kubectl describe pod -n <namespace> <pod-name>
kubectl logs -n <ns> <pod> --previous --tail=200
kubectl get events -n <ns> --sort-by='.lastTimestamp' | tail -20 # Check current logs
kubectl logs -n <namespace> <pod-name> --tail=100
# Check logs from the previous crashed container
kubectl logs -n <namespace> <pod-name> --previous --tail=100
# Check events in the namespace
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
# Check all pods in a namespace
kubectl get pods -n <namespace> -o wide
``` ```
Common failure causes: OOMKilled (raise mem limit in Terraform), bad ### Deployment Issues
config / missing env var, DB connection failure (check `dbaas` pods),
NFS mount failure (`showmount -e 192.168.1.127`), stale
imagePullSecret.
### Deployment / StatefulSet / DaemonSet (checks 6, 7, 16)
```bash ```bash
kubectl describe deployment -n <ns> <name> # Describe the deployment (strategy, conditions, events)
kubectl rollout status deployment -n <ns> <name> kubectl describe deployment -n <namespace> <deployment-name>
kubectl rollout history deployment -n <ns> <name>
kubectl get rs -n <ns> -l app=<app> # Check rollout status
kubectl rollout status deployment -n <namespace> <deployment-name>
# Check rollout history
kubectl rollout history deployment -n <namespace> <deployment-name>
# Check the replicaset
kubectl get rs -n <namespace> -l app=<app-label>
``` ```
### PVC (check 8) ### PVC Issues
```bash ```bash
kubectl describe pvc -n <ns> <pvc> # Describe the PVC (events, status, storage class)
kubectl get events -n <ns> --field-selector reason=FailedMount --sort-by='.lastTimestamp' kubectl describe pvc -n <namespace> <pvc-name>
kubectl get pv | grep <pvc>
showmount -e 192.168.1.127 # Check PVs
kubectl get pv
# Check events related to PVCs
kubectl get events -n <namespace> --field-selector reason=FailedMount --sort-by='.lastTimestamp'
# Verify NFS export exists
showmount -e 10.0.10.15 | grep <service-name>
``` ```
### cert-manager (checks 31, 32, 33) ### Resource Pressure
```bash ```bash
kubectl get certificate -A # Top nodes (CPU and memory usage)
kubectl describe certificate -n <ns> <name>
kubectl get certificaterequest -A
kubectl describe certificaterequest -n <ns> <name>
kubectl logs -n cert-manager deploy/cert-manager | tail -50
```
Common causes: ACME HTTP-01 challenge blocked, ClusterIssuer missing
DNS provider secret, rate-limit from Let's Encrypt.
### Backups (checks 34, 35, 36)
```bash
# Per-DB dumps (inside the DB pod)
kubectl exec -n dbaas mysql-standalone-0 -- ls -lah /backup/per-db/
kubectl exec -n dbaas pg-cluster-0 -- ls -lah /backup/per-db/
# Pushgateway metrics
kubectl exec -n monitoring deploy/prometheus-server -- \
wget -qO- http://prometheus-prometheus-pushgateway:9091/metrics | \
grep backup_last_success_timestamp
# LVM snapshots on PVE host
ssh -o BatchMode=yes root@192.168.1.127 \
'lvs -o lv_name,lv_time,lv_size --noheadings | grep snap'
```
If offsite sync is stale, the common cause is the
`offsite-sync-backup.service` systemd unit on the PVE host failing.
`ssh root@192.168.1.127 'systemctl status offsite-sync-backup'`.
### Monitoring stack (checks 37, 38, 39)
```bash
# Prometheus
kubectl exec -n monitoring deploy/prometheus-server -- wget -qO- http://localhost:9090/-/ready
kubectl logs -n monitoring deploy/prometheus-server --tail=100
# Alertmanager
kubectl get pods -n monitoring | grep alertmanager
kubectl logs -n monitoring -l app=prometheus-alertmanager --tail=100
# Vault
kubectl exec -n vault vault-0 -- sh -c 'VAULT_ADDR=http://127.0.0.1:8200 vault status'
# If sealed: check raft peers with `vault operator raft list-peers` and unseal.
# ClusterSecretStore
kubectl get clustersecretstore
kubectl describe clustersecretstore vault-kv vault-database
kubectl logs -n external-secrets deploy/external-secrets --tail=100
```
### External reachability (checks 40, 41, 42)
```bash
# Cloudflared
kubectl get pods -n cloudflared
kubectl logs -n cloudflared -l app=cloudflared --tail=100
# Authentik (Helm chart names the deployment goauthentik-server)
kubectl get deployment -n authentik goauthentik-server
kubectl logs -n authentik deploy/goauthentik-server --tail=100
# ExternalAccessDivergence alert
kubectl exec -n monitoring deploy/prometheus-server -- \
wget -qO- 'http://localhost:9090/api/v1/alerts' | \
python3 -m json.tool | grep -A 5 ExternalAccessDivergence
# Traefik 5xx — find the hot service
kubectl exec -n monitoring deploy/prometheus-server -- \
wget -qO- 'http://localhost:9090/api/v1/query?query=topk(10,rate(traefik_service_requests_total{code=~%225..%22}%5B15m%5D))' \
| python3 -m json.tool
```
### OOMKilled remediation
1. `kubectl describe pod -n <ns> <pod> | grep -A 5 Limits`
2. Edit `infra/modules/kubernetes/<service>/main.tf` and raise
`resources.limits.memory`.
3. `cd /home/wizard/code/infra && scripts/tg apply` (Tier 1) or
`terraform apply -target=module.<service>` as appropriate.
### ImagePullBackOff remediation
1. `kubectl describe pod -n <ns> <pod> | grep -A 5 Events`
2. Verify tag exists on the source registry.
3. Check pull-through cache at `10.0.20.10:{5000,5010,5020,5030}`.
4. Update the image tag in Terraform + re-apply.
### Persistent CrashLoopBackOff after auto-fix
1. `kubectl logs -n <ns> <pod> --previous --tail=200`
2. `kubectl describe pod -n <ns> <pod>` and check Last State:
- `OOMKilled` → raise memory limit
- Exit code 137 → OOM or probe killed
- Exit code 143 → SIGTERM / graceful shutdown failed
3. Cross-check dbaas + NFS + secrets are healthy.
## Performance forensics — top consumers + optimization hints
When the cluster is healthy (script returns 0) but the host is hot or load
is elevated, switch from "what broke?" to "what's expensive?". Run these
in order; stop as soon as the root cause is obvious.
### Step 1 — Snapshot top consumers cluster-wide
```bash
# Top 15 pods by current CPU
kubectl top pods --all-namespaces --sort-by=cpu --no-headers | head -15
# Top 5 nodes by CPU + memory pressure
kubectl top nodes kubectl top nodes
# Top 15 by 5-min rolling rate (smoothed — kills noise from one-off spikes) # Top pods sorted by memory (cluster-wide)
kubectl -n monitoring exec deploy/prometheus-server -- wget -qO- \ kubectl top pods -A --sort-by=memory | head -20
"http://localhost:9090/api/v1/query?query=topk(15,sum%20by%20(namespace,pod)%20(rate(container_cpu_usage_seconds_total%7Bcontainer!%3D''%7D%5B5m%5D)))" \
| python3 -m json.tool | head -80 # Top pods sorted by CPU (cluster-wide)
kubectl top pods -A --sort-by=cpu | head -20
# Check resource requests/limits in a namespace
kubectl describe resourcequota -n <namespace>
kubectl describe limitrange -n <namespace>
``` ```
### Step 2 — For each suspect pod, get the WHY ## Common Remediation
For every pod in the top-N, gather these BEFORE proposing a fix: ### Persistent CrashLoopBackOff
A pod keeps crashing even after the auto-fix deletes it.
1. **Check logs from the crashed container**:
```bash ```bash
NS=<namespace>; POD=<pod>; CONT=$(kubectl -n $NS get pod $POD -o jsonpath='{.spec.containers[0].name}') kubectl logs -n <namespace> <pod-name> --previous --tail=200
# What it does (image + command)
kubectl -n $NS get pod $POD -o jsonpath='{.spec.containers[0].image}{"\n"}{.spec.containers[0].args}{"\n"}'
# Resource limits + current usage
kubectl -n $NS top pod $POD --containers
kubectl -n $NS get pod $POD -o jsonpath='{.spec.containers[0].resources}'
# Recent logs filtered for reconcile loops, watch storms, slow queries
kubectl -n $NS logs $POD -c $CONT --tail=200 --since=5m 2>&1 \
| grep -iE 'reconcil|watch|scrape|index|loop|retry|slow|timeout' | tail -20
# Restart count + recent OOM
kubectl -n $NS describe pod $POD | grep -E 'Restart Count|Last State|Reason'
# Self-exported metrics (for apps that publish on /metrics)
kubectl -n $NS exec $POD -c $CONT -- wget -qO- localhost:<port>/metrics 2>/dev/null | head -50
``` ```
### Step 3 — apiserver / etcd specific deep-dive (when control-plane is hot) 2. **Check the pod description for clues**:
```bash ```bash
# Top request producers by verb+resource (last 30 min) kubectl describe pod -n <namespace> <pod-name>
kubectl -n monitoring exec deploy/prometheus-server -- wget -qO- \
"http://localhost:9090/api/v1/query?query=topk(15,sum%20by%20(resource,verb)%20(rate(apiserver_request_total%5B30m%5D)))" \
| python3 -m json.tool
# Top user agents (which clients are hammering)
kubectl -n monitoring exec deploy/prometheus-server -- wget -qO- \
"http://localhost:9090/api/v1/query?query=topk(15,sum%20by%20(user_agent)%20(rate(apiserver_request_total%5B30m%5D)))" \
| python3 -m json.tool
# Long-running requests (WATCH / CONNECT — log streams, pod-watchers)
kubectl -n monitoring exec deploy/prometheus-server -- wget -qO- \
"http://localhost:9090/api/v1/query?query=apiserver_longrunning_requests" \
| python3 -m json.tool
# etcd write rate + DB size
kubectl -n monitoring exec deploy/prometheus-server -- wget -qO- \
"http://localhost:9090/api/v1/query?query=rate(etcd_disk_wal_fsync_duration_seconds_count%5B5m%5D)" \
| python3 -m json.tool
``` ```
Look for:
- `OOMKilled` in Last State — the container ran out of memory
- `Error` with exit code 1 — application error (bad config, missing env var, DB connection failure)
- `Error` with exit code 137 — killed by OOM killer or liveness probe
- `Error` with exit code 143 — SIGTERM (graceful shutdown failure)
### Step 4 — PVE host specific deep-dive (when temp / load is high) 3. **Common causes**:
- **OOMKilled**: Increase memory limits in Terraform (see below)
- **Bad config**: Check environment variables, secrets, config maps
- **DB connection failure**: Verify the database pod is running (`kubectl get pods -n dbaas`)
- **NFS mount failure**: Verify NFS export exists (`showmount -e 10.0.10.15`)
- **Missing secret**: Check if TLS secret or other secrets exist in the namespace
Checks 43 + 44 capture package temp + 5-min load avg with PASS/WARN/FAIL ### OOMKilled
thresholds — that's the first stop. When those WARN or FAIL, the
follow-up commands below trace which VM / process is the source:
The container was killed because it exceeded its memory limit.
1. **Check current limits**:
```bash ```bash
# Per-core temps (broader than the package summary in check 43) kubectl describe pod -n <namespace> <pod-name> | grep -A 5 "Limits"
ssh root@192.168.1.127 'for f in /sys/class/hwmon/hwmon0/temp*_input; do
base=${f%_input}; label=$(cat ${base}_label 2>/dev/null || echo "${base##*/}")
val=$(cat "$f"); echo " $label: $((val/1000))°C"
done'
# Per-VM CPU (each VM = one kvm process)
ssh root@192.168.1.127 'top -bn1 -o %CPU | grep kvm | head -10'
# pvestatd anomaly check — bursts > 50% usually mean LV count > 1000
ssh root@192.168.1.127 'lvs --noheadings 2>/dev/null | wc -l'
# Stale snapshots (any '_pre-*' that survived past their rollback window)
ssh root@192.168.1.127 'lvs --noheadings -o lv_name 2>/dev/null | awk "/_pre-/" | head -20'
``` ```
### Step 5 — Optimization decision 2. **Fix in Terraform** — Edit `modules/kubernetes/<service>/main.tf` and increase the memory limit:
```hcl
For each consumer in the top-N, fill in a row: resources {
limits = {
| Pod / Process | CPU (m) | Why busy | Tunable | Est saving | Trade-off | Effort | memory = "2Gi" # Increase from current value
|---|---|---|---|---|---|---| }
}
Then rank by ROI (saving / effort) and surface the top 3-5. **Hold back the ones where saving < 50m unless effort is also < 5 min.** ```
### Common causes + tunables (catalogue)
| Symptom | Likely cause | Tunable |
|---|---|---|
| **`kube-apiserver` > 1 core sustained** | `CONNECT pods/log` streams from `alloy`/`promtail` using apiserver-tail; OR Kyverno PolicyReport churn (background+enforce mode); OR VPA fanout (309 VPAs cause ~7 req/s) | Switch alloy/promtail to `loki.source.file`; raise Kyverno `backgroundScanInterval`; reduce VPA count |
| **`pvestatd` 70-100% bursts** | LV metadata scan over > 1000 LVs (typically stale `_pre-*` snapshots from ad-hoc node ops) | Delete stale snapshots; `/usr/local/bin/lvm-pvc-snapshot prune` |
| **Frigate > 2 cores** | Birdseye `mode: continuous` (16% on frigate.output); LPR debug; debug logging; too many active cameras × detect.fps | `birdseye.mode: motion`; `lpr.debug_save_plates: false`; remove debug loggers |
| **`vault-0` looping ERRORs every ~10s** | DB static-role not in connection's `allowed_roles` list (drift between role and connection) | Add role to `vault_database_secret_backend_connection.*.allowed_roles` in TF |
| **Alloy DS > 100m/pod** | `loki.source.kubernetes` (apiserver-tail) instead of `loki.source.file` | Switch to file-tail (~5× drop per pod) |
| **Prometheus default 1m scrape** | Chart default; new sample every minute | Raise `server.global.scrape_interval` to 2m; pin critical jobs (snmp-ups) to 30s; bump `for: 1m` alerts to `for: 3m` |
| **`kube-controller-manager` periodic ERROR loop** | Aggregated APIService discovery fails (calico/metrics-server unreachable, OR stuck Terminating pod still in endpoints) | Force-delete stuck pod; verify APIService Available; check pod runc bug on k8s-master |
| **etcd write > 1 MB/s** | PolicyReport thrash, too-frequent secret rotation, or audit log mode = RequestResponse | Trim Kyverno reports config; raise rotation_period; downgrade audit policy to Metadata for noisy resources |
### What NOT to touch
- **calico-node, etcd write rate, kube-controller-manager core work, pg-cluster replication** — structural cost, touching them risks correctness.
- **Pods doing legitimate request-serving work** (web servers, databases under load) — optimize the workload, not the runtime.
- **Anything where Goldilocks VPA upperBound is already close to current request** — no headroom to cut.
### Source-of-truth notes
- **All infra mutations go via Terraform** (`scripts/tg plan/apply`). The recipes above are diagnostic; the FIX lives in `infra/stacks/<name>/main.tf` or chart values.
- **Pod-internal config files** (e.g., Frigate's `/config/config.yml` on a PVC) are not TF-managed — edit in-pod and document in `infra/docs/runbooks/`.
- **PVE host-level state** (LVM snapshots, pvestatd) — SSH + manual ops; record in memory if the pattern recurs.
## Notes on the canonical / hardlink setup
The authoritative copy of this SKILL.md lives at
`/home/wizard/code/.claude/skills/cluster-health/SKILL.md`. A hardlink
at `/home/wizard/code/infra/.claude/skills/cluster-health/SKILL.md`
points to the same inode so infra-rooted sessions also discover the
skill.
To verify the hardlink is intact:
3. **Apply the change**:
```bash ```bash
stat -c '%i %n' \ cd /workspace/infra
/home/wizard/code/.claude/skills/cluster-health/SKILL.md \ terraform apply -target=module.kubernetes_cluster.module.<service> -auto-approve
/home/wizard/code/infra/.claude/skills/cluster-health/SKILL.md
``` ```
Both should print the same inode number. If they diverge (e.g. `git ### ImagePullBackOff
checkout` replaced the file rather than updating it), re-link:
The container image cannot be pulled.
1. **Check the exact error**:
```bash ```bash
ln -f /home/wizard/code/.claude/skills/cluster-health/SKILL.md \ kubectl describe pod -n <namespace> <pod-name> | grep -A 5 "Events"
/home/wizard/code/infra/.claude/skills/cluster-health/SKILL.md
``` ```
2. **Common causes**:
- **Wrong image tag**: Verify the tag exists on the registry (Docker Hub, ghcr.io, etc.)
- **Private registry without credentials**: Check if imagePullSecrets are configured
- **Pull-through cache issue**: The registry cache at `10.0.20.10` may have a stale entry
```bash
# Check pull-through cache ports:
# 5000 = docker.io, 5010 = ghcr.io, 5020 = quay.io, 5030 = registry.k8s.io
curl -s http://10.0.20.10:5000/v2/_catalog | python3 -m json.tool
```
- **Registry rate limit**: Docker Hub free tier has pull limits; pull-through cache helps avoid this
3. **Fix**: Update the image tag in the service's Terraform module and re-apply.
### Node NotReady
A node has gone NotReady.
1. **Check node conditions**:
```bash
kubectl describe node <node-name> | grep -A 20 "Conditions"
```
2. **SSH to the node and check kubelet**:
```bash
ssh root@<node-ip>
systemctl status kubelet
journalctl -u kubelet --since "10 minutes ago" | tail -50
```
3. **Check resources**:
```bash
# On the node
df -h # Disk space
free -h # Memory
top -bn1 # CPU/processes
```
4. **Node IPs** (for SSH):
- `10.0.20.100` — k8s-master
- `10.0.20.101` — k8s-node1 (GPU)
- `10.0.20.102` — k8s-node2
- `10.0.20.103` — k8s-node3
- `10.0.20.104` — k8s-node4
## Slack Webhook
The script posts results to the Slack incoming webhook URL in `$SLACK_WEBHOOK_URL`. The message format uses Slack mrkdwn:
- All clear: green checkmark with node/pod count
- Warnings only: warning icon with details
- Issues found: red alert icon with auto-fixes applied and remaining issues
The webhook URL is passed as an environment variable from `openclaw_skill_secrets` in `terraform.tfvars`.
## Infrastructure
| Component | Path / Location |
|-----------|----------------|
| Health check script | `/workspace/infra/.claude/cluster-health.sh` (in-pod) or `.claude/cluster-health.sh` (repo) |
| Terraform module | `modules/kubernetes/openclaw/main.tf` |
| CronJob definition | Defined in the OpenClaw Terraform module |
| Existing full healthcheck | `scripts/cluster_healthcheck.sh` (local-only, 24 checks with color output) |
| Infra repo (in pod) | `/workspace/infra` |
| kubectl (in pod) | `/tools/kubectl` |
| terraform (in pod) | `/tools/terraform` |
## Auto-File Incidents for SEV1/SEV2
After running health checks, if **SEV1 or SEV2 issues** are found (node down, multiple services affected, core service outage, or single important service down), auto-file a GitHub Issue:
### Severity Classification
- **SEV1**: Node NotReady, multiple services down, data at risk, core service outage (DNS, auth, ingress, databases)
- **SEV2**: Single non-core service down, degraded performance, persistent CrashLoopBackOff
- **SEV3**: Warnings only, resource pressure <90%, cosmetic do NOT auto-file
### Workflow
1. **Dedup check**: Before filing, query open incidents:
```bash
GITHUB_TOKEN=$(vault kv get -field=github_pat secret/viktor)
curl -s -H "Authorization: token $GITHUB_TOKEN" \
"https://api.github.com/repos/ViktorBarzin/infra/issues?labels=incident&state=open&per_page=50"
```
If an open issue already covers the same service/namespace, **skip filing**.
2. **File the issue** with labels `incident`, `sev1` or `sev2`, `postmortem-required`:
- Title: `[AUTO] <Service/Namespace> — <brief symptom>`
- Body: full diagnostic dump (pod status, events, alerts, node state)
- The issue-automation GHA workflow will trigger the post-mortem pipeline automatically
3. **Auto-close recovered services**: If a service that previously had an auto-filed incident is now healthy:
```bash
# Comment and close
curl -s -X POST -H "Authorization: token $GITHUB_TOKEN" \
"https://api.github.com/repos/ViktorBarzin/infra/issues/<N>/comments" \
-d '{"body": "**Resolved** — Service recovered. Auto-closed by cluster health check."}'
curl -s -X PATCH -H "Authorization: token $GITHUB_TOKEN" \
"https://api.github.com/repos/ViktorBarzin/infra/issues/<N>" \
-d '{"state": "closed"}'
```
## Post-Mortem Auto-Suggest
After running a healthcheck, if the cluster has **recovered from an unhealthy state** (previous run showed FAIL items that are now resolved), suggest writing a post-mortem:
> The cluster has recovered from the previous unhealthy state. Would you like me to write a post-mortem? Run `/post-mortem` to generate one.
This ensures incidents are documented while context is fresh.
## Notes
1. This script is designed to run inside the OpenClaw pod where kubectl is pre-configured via the ServiceAccount
2. The full `scripts/cluster_healthcheck.sh` script runs 24 checks and is meant for local interactive use; this skill's script runs 8 core checks optimized for automated CronJob execution
3. When investigating issues interactively, prefer running commands directly rather than re-running the script
4. All Terraform changes must go through the `.tf` files — never use `kubectl apply/edit/patch` for persistent changes

View file

@ -11,8 +11,8 @@ description: |
There are TWO Home Assistant deployments: ha-london (default) and ha-sofia. There are TWO Home Assistant deployments: ha-london (default) and ha-sofia.
Always use Home Assistant for smart home control. Always use Home Assistant for smart home control.
author: Claude Code author: Claude Code
version: 2.1.0 version: 2.0.0
date: 2026-06-24 date: 2026-02-07
--- ---
# Home Assistant Control # Home Assistant Control
@ -44,12 +44,6 @@ There are **two** Home Assistant instances:
- Environment variables for each instance: - Environment variables for each instance:
- **ha-london**: `HOME_ASSISTANT_URL` and `HOME_ASSISTANT_TOKEN` - **ha-london**: `HOME_ASSISTANT_URL` and `HOME_ASSISTANT_TOKEN`
- **ha-sofia**: `HOME_ASSISTANT_SOFIA_URL` and `HOME_ASSISTANT_SOFIA_TOKEN` - **ha-sofia**: `HOME_ASSISTANT_SOFIA_URL` and `HOME_ASSISTANT_SOFIA_TOKEN`
- If those env vars aren't set (e.g. you're not in the infra repo / Claude venv), don't hand-roll a `kubectl | base64 | jq` token pipeline — use the global **`homelab` CLI** instead (on `$PATH` in any directory):
## homelab CLI (preferred — works from any directory)
- **Token**: `homelab ha token [--instance sofia|london]` resolves the long-lived API token live from the cluster. Use it directly in curl: `curl -H "Authorization: Bearer $(homelab ha token)" https://ha-sofia.viktorbarzin.me/api/states`. (The `home-assistant-sofia.py` script also auto-falls-back to this when its env var is unset.)
- **Host shell** (ha-sofia): `homelab ha ssh -- <cmd>` runs a command on the HA host with deterministic non-interactive ssh (no host-key prompt) — e.g. `homelab ha ssh -- "sudo docker ps"`, `homelab ha ssh -- "cat /config/configuration.yaml"`. Replaces bespoke `ssh -o StrictHostKeyChecking=no …` invocations.
- **Cluster metrics/logs** (not HA-specific): prefer `homelab metrics query "<promql>"` / `homelab logs query "<logql>"` over hand-rolled `curl …/api/v1/query`, and `homelab claim`/`release` over calling `scripts/presence` directly.
## API Control ## API Control
@ -395,27 +389,14 @@ Advanced SSH, File Editor, Studio Code Server, InfluxDB, Mosquitto, Node-RED, Fr
## ha-london Knowledge Map ## ha-london Knowledge Map
### Overview ### Overview
- **HA Version**: 2026.5.2 on **Home Assistant OS** (HAOS — managed appliance, NOT a `docker run` container). Latest is 2026.6.4 (update available, deliberately not applied). - **HA Version**: 2025.9.1 (Docker container on Raspberry Pi)
- **Location**: London, UK - **Location**: London, UK
- **Platform**: Raspberry Pi 4, HA OS - **Platform**: Raspberry Pi 4, HA OS (not Docker standalone)
- **Access from the Sofia devvm**: london is **remote**`homelab ha ssh --instance london` generally WON'T connect (ADR-0012). Drive it via the API: `homelab ha token --instance london` + `https://ha-london.viktorbarzin.me/api/...`, and the WebSocket API `wss://ha-london.viktorbarzin.me/api/websocket` for dashboards / config-entries / HACS installs. - **SSH**: `ssh hassio@192.168.8.103` (requires `sudo` for file access)
- **SSH (only from the London LAN)**: `ssh hassio@192.168.8.103` (requires `sudo` for file access) - **Config path**: `/config/` (requires `sudo` for file access)
- **Config path**: `/config/`
- **3 tracked people**: Viktor Barzin, Anca Milea, Gheorghe Milea - **3 tracked people**: Viktor Barzin, Anca Milea, Gheorghe Milea
- **Zone**: London (home) - **Zone**: London (home)
### Dashboards (redesigned 2026-06-24)
**Glossary** (HA terms — keep distinct):
- **Dashboard** = a sidebar entry (Overview, Air Quality, Map). Sidebar *order* is a per-USER frontend preference, not in any dashboard config.
- **View** = a tab inside a dashboard. View order is global (stored in the dashboard config).
- **Card** = a widget inside a view.
- **Overview** (`lovelace`, the default): responsive **sections** views, styled with Mushroom + mini-graph-card.
- **Home** tab: *Who's home* · *Comfort & Air* (CO₂/temp/humidity/PM2.5/VOC chips + CO₂ and temp/humidity trend graphs + link to Air Quality) · *Cowboy* (battery/range/last-ride) · *Energy* (5 Kasa plugs + power trend) · *Quick actions* (Netflix/Stremio/Night).
- **More** tab: *Network* (GL-MT6000 router) · *System* (HA version/update, last backup, RPi power) · *Phones*.
- **Air Quality** (`air-quality`): deep-dive (views: Home, Detailed). (`detialed``detailed` path typo fixed 2026-06-24.)
- Built via the WS `lovelace/config/save` API (london is remote — no SSH path).
### Key Systems ### Key Systems
#### 1. Smart Plugs (TP-Link Kasa) — Energy Monitoring #### 1. Smart Plugs (TP-Link Kasa) — Energy Monitoring
@ -437,15 +418,10 @@ Named plugs with power/energy tracking:
- PM1.0/2.5/4.0/10 particulate sensors - PM1.0/2.5/4.0/10 particulate sensors
- VOC, NOx, ammonia, CO, ethanol, hydrogen, methane, NO2 gas sensors - VOC, NOx, ammonia, CO, ethanol, hydrogen, methane, NO2 gas sensors
#### 3. Cowboy E-Bike (`elsbrock/cowboy-ha`) #### 3. Cowboy E-Bike
Bike named **"Classic Performance"** → entities are `sensor.classic_performance_*` (26 total). The old `sensor.bike_*` names are GONE (they were the dead `jdejaegh` integration). - `sensor.bike_state_of_charge`: Battery %
- `sensor.classic_performance_remaining_battery`: Battery % (was `sensor.bike_state_of_charge`) - `sensor.bike_total_distance`: Total km
- `sensor.classic_performance_remaining_range`: Range km - `sensor.bike_total_co2_saved`: CO2 saved (grams)
- `sensor.classic_performance_mileage`: Total km (was `sensor.bike_total_distance`)
- `sensor.classic_performance_saved_co2`: Lifetime CO2 saved (was `sensor.bike_total_co2_saved`)
- Plus `_distance_today`, `_last_trip_*`, `_battery_health`, `device_tracker.classic_performance`, etc.
- **GOTCHA**: live battery/range/mileage read `unknown` while the bike is parked/asleep — Cowboy only reports live SoC when awake (ridden/charging); trip-history + `distance_today` stay live regardless.
- Auth: account **email+password** (no AWS Cognito — that was the dead `jdejaegh`/`cowboybike` lineage). Setup via UI config flow / REST `config_entries/flow`. Creds in Vaultwarden item **"cowboy bike"** (`homelab vault get "cowboy bike"`).
#### 4. Uptime Monitoring (UptimeRobot) #### 4. Uptime Monitoring (UptimeRobot)
- `sensor.blog`: blog uptime - `sensor.blog`: blog uptime
@ -464,17 +440,12 @@ Bike named **"Classic Performance"** → entities are `sensor.classic_performanc
- Scripts: `script.start_netflix`, `script.start_stremio` - Scripts: `script.start_netflix`, `script.start_stremio`
- Scene: `scene.night` (turns off Livia + Michelle plugs) - Scene: `scene.night` (turns off Livia + Michelle plugs)
### Custom Components (HACS integrations) ### Custom Components
- **cowboy** (`elsbrock/cowboy-ha` v1.2.0): Cowboy e-bike — revived 2026-06-24. The old `jdejaegh/home-assistant-cowboy` repo is **dead (404)**; don't chase it. - **cowboy**: Cowboy e-bike integration (HACS)
- **hildebrandglow_dcc**: UK smart meter DCC energy — **DISABLED by user** (config entry `disabled_by: user`), not broken. - **hildebrandglow_dcc**: UK smart meter DCC energy data (HACS)
### HACS frontend cards (plugins)
- **Mushroom** (`piitaya/lovelace-mushroom`), **mini-graph-card** (`kalkih/mini-graph-card`), **plotly-graph-card** (`dbuezas/lovelace-plotly-graph-card`) — used by the redesigned Overview. Install over WS `hacs/repository/download`; resources auto-register in storage mode.
### Integrations ### Integrations
ESPHome, TP-Link Kasa, Tapo, UptimeRobot, **Cowboy** (elsbrock), Oral-B BLE, Ookla Speedtest (exposes only an `update` entity, no live speed sensors), HACS, OpenRouter (free LLMs), Piper (TTS), Whisper (STT), Android TV/ADB. ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BLE, Ookla Speedtest, HACS, OpenRouter (multiple free LLMs), Piper (local TTS), Whisper (local STT), Android TV/ADB
- **Disabled by user (NOT broken)**: `met` + `metoffice` (weather — so `weather.*` entities are ABSENT), `roomba` (Rumi vacuum), `hildebrandglow_dcc` (energy).
- **Failing**: `tplink` **Tapo P100** projector plug — `setup_retry`, 403 KLAP handshake from 192.168.8.108 (plug off / firmware). Left as-is.
### AI / Voice Assistants ### AI / Voice Assistants
- 5 free LLM conversation agents: Google Gemma 3 27B, Meta Llama 3.2 3B, Mistral Devstral 2, OpenAI GPT-OSS-20B, Z.AI GLM 4.5 Air - 5 free LLM conversation agents: Google Gemma 3 27B, Meta Llama 3.2 3B, Mistral Devstral 2, OpenAI GPT-OSS-20B, Z.AI GLM 4.5 Air
@ -489,8 +460,15 @@ ESPHome, TP-Link Kasa, Tapo, UptimeRobot, **Cowboy** (elsbrock), Oral-B BLE, Ook
- Anca arrival/departure notifications - Anca arrival/departure notifications
- Night scene: turns off Livia + Michelle - Night scene: turns off Livia + Michelle
### Platform (HAOS — ignore any legacy `docker run` snippet) ### Docker Setup
ha-london runs **Home Assistant OS** (managed appliance), NOT a hand-run Docker container. There is no `docker run homeassistant/home-assistant` to manage. Install HACS components over the WebSocket API (`hacs/repository/download` with the repo's HACS id), then restart via `POST /api/services/homeassistant/restart` — a HAOS restart drops automations for ~12 min and resets `sensor.uptime` (use that as the "back up" marker). ```bash
docker run -d --name homeassistant --privileged \
-e TZ=Europe/London \
-v /home/pi/docker/homeAssistant:/config \
-v /run/dbus:/run/dbus:ro \
--network=host --restart=unless-stopped \
homeassistant/home-assistant:2025.9
```
### SSH Access ### SSH Access
```bash ```bash

View file

@ -1,227 +0,0 @@
---
name: upgrade-state
description: |
Audit the three autonomous-upgrade pipelines (apps via Keel, OS via
unattended-upgrades+kured, K8s components via the version-check chain).
Use when:
(1) User asks "/upgrade-state" or "are we current",
(2) User asks "what's pending upgrade" or "what's the upgrade state",
(3) User asks if Keel / kured / k8s-version-check is healthy,
(4) User asks about kept-back / held packages or pending reboots,
(5) Periodic survey before the next `k8s-version-check` daily run.
Read-only — no `--fix`. Exits 0 healthy / 1 attention / 2 stalled.
author: Claude Code
version: 1.0.0
date: 2026-05-18
---
# Upgrade-state
## MANDATORY: Run the script first
When this skill is invoked, your **first action** must be to run
`upgrade_state.sh` and reason over its output before doing anything
else. Do NOT improvise individual `kubectl` / `ssh` calls — the script
is the authoritative surface.
```bash
bash /home/wizard/code/infra/scripts/upgrade_state.sh
```
For programmatic use:
```bash
bash /home/wizard/code/infra/scripts/upgrade_state.sh --json | tee /tmp/upgrade-state.json
```
Then:
1. Report the rendered table verbatim — it answers the user's
"are we current" question in three lines.
2. For every `⚠` or `✗` row, surface the relevant drill-down lines
underneath and propose a next action (links in the table below).
3. Only reach for ad-hoc commands when investigating beyond what the
script reported.
Exit codes: `0` healthy, `1` attention warranted, `2` stalled / broken.
## What it covers (3 pipelines)
| Layer | What runs | Cadence | Data sources |
|---|---|---|---|
| **Apps** | Keel polls every watched Deployment's container registry; rolls on new digest | hourly | Prom (`pending_approvals`, `registries_scanned_total`), Keel pod logs |
| **OS** | `unattended-upgrades` in-release patching; `kured` reboots when `/var/run/reboot-required` is set | daily 02:00-06:00 London | SSH fan-out to all 5 nodes |
| **K8s** | `k8s-version-check` CronJob detects new kubeadm patch/minor; spawns the Job-chain that drains+upgrades node-by-node | nightly 23:00 UTC | Pushgateway (`k8s_upgrade_*`), `kubectl get nodes` |
The K8s pipeline pushes a small set of gauges to the Prometheus
Pushgateway (`prometheus-prometheus-pushgateway.monitoring:9091`):
- `k8s_upgrade_available{kind="patch"|"minor",target=…}` — 1 if newer release detected
- `k8s_version_check_last_run_timestamp` — when detection last ran
- `k8s_upgrade_in_flight` — 0/1
- `k8s_upgrade_started_timestamp` — when the current chain started (0 when idle)
`K8sUpgradeStalled` fires when `in_flight=1` and the chain has been running
>90 minutes. `K8sUpgradeChainJobFailed` fires when a phase Job terminally
failed — including a **preflight that aborted before `in_flight` was set**
(the gates exit pre-metric). The script raises `✗` for either, and reads the
Jobs directly, so it also catches a Failed preflight that left no metric.
## Status-icon legend
| Icon | Meaning |
|---|---|
| `✓` | Healthy, fully current |
| `→` | Update available, not yet applied (K8s patch/minor) |
| `…` | In flight — chain currently running |
| `⚠` | Attention: held-with-bumps, recent errors, pending approvals |
| `✗` | Broken: pod down, alert firing, chain stalled, or a chain Job failed |
## Drill-down — when a row trips, what to do
### Apps `⚠` — pending approvals or errors
```bash
# Read recent Keel log lines
kubectl -n keel logs deploy/keel --since=24h --tail=200
# What is Keel currently tracking?
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
wget -qO- 'http://localhost:9090/api/v1/query?query=count by (image) (registries_scanned_total)'
# Is the scrape live?
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
wget -qO- 'http://localhost:9090/api/v1/query?query=up{job="kubernetes-pods",app="keel"}'
```
Common Keel errors:
- `failed to add image watch job` — image annotation mistyped (rare; Kyverno auto-injects)
- `registry authentication required` — bad imagePullSecret on the watched Deployment
- `bad tag pattern` — Keel can't parse the watched image's tag against its policy
### OS `⚠` — held packages with bumps
The script flags any package held via `apt-mark hold` that ALSO appears
in `apt list --upgradable` — excluding k8s components (the K8s pipeline
owns those) and the kernel (kured handles the reboot half).
Typical cause: a major-version bump (e.g. containerd 1.7 → 2.2,
runc 1.1 → 1.4). These are held because they need cluster-wide
coordination, not silent in-release patching.
```bash
# Inspect the situation on the flagged node
ssh wizard@10.0.20.10X 'apt-mark showhold; apt list --upgradable 2>/dev/null'
# Unhold + upgrade a specific package
ssh wizard@10.0.20.10X 'sudo apt-mark unhold containerd && sudo apt-get install -y containerd'
```
Node IPs: master=`100`, node1=`101`, node2=`102`, node3=`103`, node4=`104`.
### OS `⚠` — pending reboot
A node has `/var/run/reboot-required`. Kured will reboot it inside the
next 02:00-06:00 London window (any day of the week).
```bash
# Force a manual reboot inside the window (rare)
kubectl drain k8s-nodeX --delete-emptydir-data --ignore-daemonsets
ssh wizard@10.0.20.10X sudo systemctl reboot
```
### OS `✗` — kured not Running
```bash
kubectl -n kured get pods
kubectl -n kured logs daemonset/kured --tail=100
# Verify sentinel gate (kured-sentinel-gate DaemonSet writes /var/run/gated-reboot-required)
kubectl -n kured get pods -l name=kured-sentinel-gate
```
### K8s `→` — patch/minor available
Detection ran, target identified, chain NOT started. The chain spawns
on the same daily detection cycle — typically within ~24h of the
target first being detected.
```bash
# Inspect Pushgateway state
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
wget -qO- 'http://prometheus-prometheus-pushgateway:9091/metrics' | grep ^k8s_upgrade
# Trigger a manual run of the detection CronJob
kubectl -n k8s-upgrade create job --from=cronjob/k8s-version-check manual-detect-$(date +%s)
```
### K8s `…` — in flight
The Job chain is running. Watch its progress:
```bash
kubectl -n k8s-upgrade get jobs --sort-by=.metadata.creationTimestamp
kubectl -n k8s-upgrade logs -l app=k8s-version-upgrade --tail=200 --prefix
```
### K8s `✗ stalled``K8sUpgradeStalled` would fire
Chain in-flight >90m. The Job is most likely stuck on drain or a
pre-flight check.
```bash
kubectl -n k8s-upgrade get jobs
kubectl -n k8s-upgrade describe job <stuck-job>
kubectl -n k8s-upgrade logs job/<stuck-job> --tail=300
# If you need to clear the in-flight flag (after diagnosing):
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- sh -c \
"printf 'k8s_upgrade_in_flight 0\nk8s_upgrade_started_timestamp 0\n' | \
wget -qO- --post-file=- 'http://prometheus-prometheus-pushgateway:9091/metrics/job/k8s-version-upgrade' \
--header='Content-Type: text/plain'"
```
### K8s `✗ chain failed` — a phase Job terminally failed
`K8sUpgradeChainJobFailed` would fire. Most often a **preflight** that aborted
on a gate (a critical alert firing, a node not Ready, a kubeadm-plan mismatch) —
these exit before `in_flight` is set, so `K8sUpgradeStalled` never sees them, and
the deterministic name + 7d TTL blocked re-spawn (the 2026-06-12 5-day wedge).
```bash
kubectl -n k8s-upgrade get jobs
kubectl -n k8s-upgrade describe job <failed-job> # check the Failed reason
# Preflight abort reasons post to Slack ONLY (not stdout), so Loki won't have
# them. Replay the gate instead — which critical alerts were firing at the
# failure time? (ALERTS{severity="critical"} in Prometheus, query at that ts.)
```
Recovery is now mostly automatic: the detection CronJob and `spawn_next`
re-spawn a terminally-Failed Job on the next cycle (retry-on-failure), so a
transient gate clears within ~24h. To expedite, delete the Failed Job and
trigger detection:
```bash
kubectl -n k8s-upgrade delete job <failed-job>
kubectl -n k8s-upgrade create job --from=cronjob/k8s-version-check manual-detect-$(date +%s)
```
### K8s `✗ detection stale` — last detection >9 days
```bash
kubectl -n k8s-upgrade get cronjob k8s-version-check
kubectl -n k8s-upgrade get jobs --sort-by=.metadata.creationTimestamp | tail -5
```
If the CronJob hasn't fired on time, suspect:
- `suspend=true` on the CronJob (`var.enabled=false` in the
`k8s-version-upgrade` Terraform stack)
- Image-pull failure on the version-check pod
- Pushgateway scrape gone stale
## Companion command-line flags
```bash
bash infra/scripts/upgrade_state.sh # rendered table (default)
bash infra/scripts/upgrade_state.sh --json # machine output
bash infra/scripts/upgrade_state.sh --kubeconfig X # override kubeconfig
```

View file

@ -155,19 +155,3 @@ Common port is 80. Exceptions:
3. Add `time.sleep(0.3)` between bulk operations to avoid overloading 3. Add `time.sleep(0.3)` between bulk operations to avoid overloading
4. Homepage dashboard widget slug: `cluster-internal` 4. Homepage dashboard widget slug: `cluster-internal`
5. Cloudflare-proxied at `uptime.viktorbarzin.me` 5. Cloudflare-proxied at `uptime.viktorbarzin.me`
## Terraform-Managed Monitors
There is NO `louislam/uptime-kuma` Terraform provider. Two patterns exist for
declarative monitor management in this stack:
- **External HTTPS monitors** — auto-discovered from ingress annotations by the
`external-monitor-sync` CronJob (`*/10 * * * *`). Opt-out via
`uptime.viktorbarzin.me/external-monitor: "false"` on the ingress.
- **Internal monitors (DBs, non-HTTP)** — declared in the
`local.internal_monitors` list in `stacks/uptime-kuma/modules/uptime-kuma/main.tf`
and synced by the `internal-monitor-sync` CronJob. To add one, append to the
list (provide `name`, `type`, `database_connection_string`,
`database_password_vault_key`, `interval`, `retry_interval`, `max_retries`)
and `scripts/tg apply`. The sync is idempotent — looks up by name, creates
if missing, patches if drifted. Existing monitors keep their id and history.

View file

@ -1,36 +0,0 @@
name: Build android-emulator
# ADR-0002: infra-owned image built off-infra on GHA → ghcr (public).
# Large image (Android SDK + emulator); on-demand workload (scaled 0). Rebuilds
# rare → dispatch + path trigger.
on:
push:
branches: [master]
paths:
- 'stacks/android-emulator/docker/**'
workflow_dispatch: {}
permissions:
contents: read
packages: write
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- uses: docker/build-push-action@v6
with:
context: stacks/android-emulator/docker
platforms: linux/amd64
provenance: false
push: true
tags: |
ghcr.io/viktorbarzin/android-emulator:latest
ghcr.io/viktorbarzin/android-emulator:${{ github.sha }}

View file

@ -1,39 +0,0 @@
name: Build chrome-service-browser
# ADR-0002: infra-owned image built off-infra on GHA → ghcr. Playwright base +
# real Google Chrome (proprietary H.264/AAC codecs) for the chrome-service
# browser container, so the noVNC view can play H.264 video (Reels). Rebuilds
# are rare → dispatch + path trigger. NOTE: after the first push, set the ghcr
# package `chrome-service-browser` to PUBLIC (same as chrome-service-novnc) so
# the pod pulls it without credentials.
on:
push:
branches: [master]
paths:
- 'stacks/chrome-service/files/chrome/**'
workflow_dispatch: {}
permissions:
contents: read
packages: write
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- uses: docker/build-push-action@v6
with:
context: stacks/chrome-service/files/chrome
platforms: linux/amd64
provenance: false
push: true
tags: |
ghcr.io/viktorbarzin/chrome-service-browser:latest
ghcr.io/viktorbarzin/chrome-service-browser:${{ github.sha }}

View file

@ -1,36 +0,0 @@
name: Build chrome-service-novnc
# ADR-0002: infra-owned image built off-infra on GHA → ghcr (public).
# Source Dockerfile identical on both git remotes, so the github checkout builds
# the current image. Rebuilds are rare (stable noVNC proxy) → dispatch + path.
on:
push:
branches: [master]
paths:
- 'stacks/chrome-service/files/novnc/**'
workflow_dispatch: {}
permissions:
contents: read
packages: write
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- uses: docker/build-push-action@v6
with:
context: stacks/chrome-service/files/novnc
platforms: linux/amd64
provenance: false
push: true
tags: |
ghcr.io/viktorbarzin/chrome-service-novnc:latest
ghcr.io/viktorbarzin/chrome-service-novnc:${{ github.sha }}

View file

@ -1,41 +0,0 @@
name: Build infra CLI
# ADR-0002: infra CLI built off-infra on GHA. Replaces the Woodpecker
# build-cli.yml. Pushes to DockerHub (public distribution, kept) + ghcr.
# Not a cluster workload — a distributed tool image.
on:
push:
branches: [master]
paths:
- 'cli/**'
workflow_dispatch: {}
permissions:
contents: read
packages: write
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- uses: docker/login-action@v3
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
- uses: docker/build-push-action@v6
with:
context: cli
platforms: linux/amd64
provenance: false
push: true
tags: |
viktorbarzin/infra:latest
ghcr.io/viktorbarzin/infra-cli:latest
ghcr.io/viktorbarzin/infra-cli:${{ github.sha }}

View file

@ -1,37 +0,0 @@
name: Build infra-ci
# ADR-0002: the infra CI toolbox image (terraform/terragrunt/sops/kubectl/vault)
# built off-infra on GHA → ghcr (public). BOOTSTRAP-CRITICAL: .woodpecker/default.yml's
# apply step runs in this image. The Woodpecker build-ci-image.yml is kept until a
# ghcr-based apply is proven, then removed.
on:
push:
branches: [master]
paths:
- 'ci/Dockerfile'
workflow_dispatch: {}
permissions:
contents: read
packages: write
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- uses: docker/build-push-action@v6
with:
context: ci
platforms: linux/amd64
provenance: false
push: true
tags: |
ghcr.io/viktorbarzin/infra-ci:latest
ghcr.io/viktorbarzin/infra-ci:${{ github.sha }}

View file

@ -1,36 +0,0 @@
name: Build k8s-portal
# ADR-0002 / no-local-builds: k8s-portal (infra-owned Go portal) builds off-infra
# on GHA → public ghcr; Keel polls ghcr:latest and rolls the deployment. Replaces
# the in-cluster .woodpecker/k8s-portal.yml build.
on:
push:
branches: [master]
paths:
- 'stacks/k8s-portal/modules/k8s-portal/files/**'
workflow_dispatch: {}
permissions:
contents: read
packages: write
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- uses: docker/build-push-action@v6
with:
context: stacks/k8s-portal/modules/k8s-portal/files
platforms: linux/amd64
provenance: false
push: true
tags: |
ghcr.io/viktorbarzin/k8s-portal:latest
ghcr.io/viktorbarzin/k8s-portal:${{ github.sha }}

18
.gitignore vendored
View file

@ -65,11 +65,6 @@ state/infra/
backend.tf backend.tf
providers.tf providers.tf
.terraform.lock.hcl .terraform.lock.hcl
cloudflare_provider.tf
tiers.tf
stacks/*/cloudflare_provider.tf
stacks/*/tiers.tf
stacks/*/terragrunt_rendered.json
# Kubernetes config (sensitive) # Kubernetes config (sensitive)
config config
@ -103,16 +98,3 @@ stacks/terminal/clipboard-upload/clipboard-upload
# Plaintext terraform state — NEVER commit (use SOPS-encrypted .tfstate.enc only) # Plaintext terraform state — NEVER commit (use SOPS-encrypted .tfstate.enc only)
terraform.tfstate terraform.tfstate
terraform.tfstate.backup terraform.tfstate.backup
# Per-feature git worktrees (worktree-first workflow — execution.md)
.worktrees/
# Timestamped terraform state backups (terraform.tfstate.<ts>.backup) — plaintext Tier-0
# secrets; created by terraform state ops. The patterns above miss the timestamped form.
terraform.tfstate.*.backup
# Python test artifacts (pytest bytecode cache) — e.g. from
# stacks/k8s-version-upgrade/scripts/test_compat_gate.py
__pycache__/
*.pyc
.pytest_cache/

View file

@ -1,11 +0,0 @@
# git-crypt encrypts these at rest; the working-tree plaintext is local-only.
# gitleaks scans the staged working-tree copy and can't see that they're
# encrypted on disk in git, so allowlist by fingerprint.
stacks/recruiter-responder/secrets/privkey.pem:private-key:1
# False positives: the `curl-auth-user` rule flags `-u "admin:..."` in the
# nextcloud-todos webhook-register provisioner, but the password is a shell
# variable ($NC_ADMIN_APP_PW) resolved at apply time from Vault — no literal
# secret is committed.
stacks/nextcloud-todos/main.tf:curl-auth-user:383
stacks/nextcloud-todos/main.tf:curl-auth-user:400

View file

@ -1,8 +0,0 @@
{
"mcpServers": {
"ha": {
"type": "http",
"url": "${HA_MCP_URL}"
}
}
}

View file

@ -1,31 +0,0 @@
# Break-glass: save the ghcr infra-ci image to a tarball on the registry VM
# (10.0.20.10) so it can be `docker load`-ed onto a node if ghcr is ever
# unreachable during a recovery. infra-ci now builds on GHA → ghcr (ADR-0002),
# which is external + node-cached, so this is a belt-and-braces DR artifact —
# run MANUALLY after an infra-ci rebuild (or periodically). Pulls from ghcr
# (public, no login). Recovery: docs/runbooks/forgejo-registry-breakglass.md.
when:
- event: manual
steps:
- name: breakglass-tarball
image: alpine:3.20
failure: ignore
environment:
REGISTRY_SSH_KEY:
from_secret: registry_ssh_key
commands:
- apk add --no-cache openssh-client
- mkdir -p ~/.ssh && chmod 700 ~/.ssh
- printf '%s\n' "$REGISTRY_SSH_KEY" > ~/.ssh/id_ed25519
- chmod 600 ~/.ssh/id_ed25519
- ssh-keyscan -t ed25519 10.0.20.10 >> ~/.ssh/known_hosts 2>/dev/null
- |
ssh -n -o BatchMode=yes root@10.0.20.10 "
set -e
mkdir -p /opt/registry/data/private/_breakglass
IMAGE=ghcr.io/viktorbarzin/infra-ci:latest
docker pull \$IMAGE
docker save \$IMAGE | gzip > /opt/registry/data/private/_breakglass/infra-ci-latest.tar.gz
ls -lh /opt/registry/data/private/_breakglass/infra-ci-latest.tar.gz
"

View file

@ -0,0 +1,41 @@
# Build the CI tools Docker image used by all infra pipelines.
# Triggers on changes to ci/Dockerfile only (push to master).
when:
event: push
branch: master
path:
include:
- 'ci/Dockerfile'
steps:
- name: build-and-push
image: woodpeckerci/plugin-docker-buildx
settings:
repo: registry.viktorbarzin.me:5050/infra-ci
dockerfile: ci/Dockerfile
context: ci/
tags:
- latest
- "${CI_COMMIT_SHA:0:8}"
platforms: linux/amd64
registry: registry.viktorbarzin.me:5050
logins:
- registry: registry.viktorbarzin.me:5050
username:
from_secret: registry_user
password:
from_secret: registry_password
- name: slack
image: curlimages/curl
commands:
- |
curl -s -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"CI image built: registry.viktorbarzin.me:5050/infra-ci:${CI_COMMIT_SHA:0:8}\"}" \
"$SLACK_WEBHOOK" || true
environment:
SLACK_WEBHOOK:
from_secret: slack_webhook
when:
status: [success]

33
.woodpecker/build-cli.yml Normal file
View file

@ -0,0 +1,33 @@
when:
event: push
clone:
git:
image: woodpeckerci/plugin-git
settings:
attempts: 5
backoff: 10s
steps:
- name: build-image
image: woodpeckerci/plugin-docker-buildx
settings:
username: "viktorbarzin"
password:
from_secret: dockerhub-pat
repo:
- viktorbarzin/infra
- registry.viktorbarzin.me:5050/infra
logins:
- registry: https://index.docker.io/v1/
username: viktorbarzin
password:
from_secret: dockerhub-pat
dockerfile: cli/Dockerfile
context: cli
auto_tag: true
# cache_from/cache_to removed: registry cache corruption causes
# "short read: expected 32 bytes" BuildKit errors. Inline cache
# will be re-populated once a clean image is pushed.
# cache_from: "registry.viktorbarzin.me:5050/infra:latest"
# cache_to: "type=inline"

View file

@ -19,34 +19,13 @@ clone:
git: git:
image: woodpeckerci/plugin-git image: woodpeckerci/plugin-git
settings: settings:
partial: false
depth: 2 depth: 2
attempts: 5 attempts: 5
backoff: 10s backoff: 10s
steps: steps:
# Audit feed for the allow-then-audit contribution model: any master push by
# a NON-admin author is surfaced in Slack (Viktor's own pushes are not).
# Runs before apply and never blocks it. Note: [ci skip] commits never reach
# this step (Woodpecker skips the whole pipeline) — hence the rule that
# non-admins must not use [ci skip].
- name: notify-nonadmin-push
image: curlimages/curl
environment:
SLACK_WEBHOOK:
from_secret: slack_webhook
commands:
- |
case "$CI_COMMIT_AUTHOR" in
viktor|ViktorBarzin|wizard) echo "admin push — no notify"; exit 0 ;;
esac
SUBJECT=$(echo "$CI_COMMIT_MESSAGE" | head -1 | tr -d '"\\')
curl -s -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"📝 infra master push by *$CI_COMMIT_AUTHOR*: $SUBJECT\n$CI_REPO_URL/commit/$CI_COMMIT_SHA\"}" \
"$SLACK_WEBHOOK" || true
- name: apply - name: apply
image: ghcr.io/viktorbarzin/infra-ci:latest image: registry.viktorbarzin.me:5050/infra-ci:latest
pull: true pull: true
backend_options: backend_options:
kubernetes: kubernetes:
@ -58,12 +37,6 @@ steps:
environment: environment:
SLACK_WEBHOOK: SLACK_WEBHOOK:
from_secret: slack_webhook from_secret: slack_webhook
# Each `- |` command runs in a fresh shell, so we can't rely on an
# `export VAULT_ADDR=...` in the auth command persisting — pin it at
# step level. VAULT_TOKEN is still per-command; we persist it to
# ~/.vault-token (auto-read by `vault` CLI) so downstream commands
# don't need explicit token propagation.
VAULT_ADDR: http://vault-active.vault.svc.cluster.local:8200
commands: commands:
# ── Skip CI commits ── # ── Skip CI commits ──
- | - |
@ -82,49 +55,9 @@ steps:
# ── Vault auth ── # ── Vault auth ──
- | - |
SA_TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token) SA_TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
VAULT_TOKEN=$(curl -s -X POST "$VAULT_ADDR/v1/auth/kubernetes/login" \ export VAULT_ADDR=http://vault-active.vault.svc.cluster.local:8200
export VAULT_TOKEN=$(curl -s -X POST "$VAULT_ADDR/v1/auth/kubernetes/login" \
-d "{\"role\":\"ci\",\"jwt\":\"$SA_TOKEN\"}" | jq -r .auth.client_token) -d "{\"role\":\"ci\",\"jwt\":\"$SA_TOKEN\"}" | jq -r .auth.client_token)
if [ -z "$VAULT_TOKEN" ] || [ "$VAULT_TOKEN" = "null" ]; then
echo "ERROR: Vault K8s auth failed (role=ci, ns=woodpecker)" >&2
exit 1
fi
# Persist for downstream `- |` blocks (each runs in a fresh shell,
# so exporting VAULT_TOKEN wouldn't help). `vault`, `scripts/tg`,
# and `scripts/state-sync` all fall through to ~/.vault-token when
# the env var is unset.
umask 077; printf '%s' "$VAULT_TOKEN" > "$HOME/.vault-token"
# ── Generate kubeconfig from projected SA token ──
# terragrunt.hcl injects `-var kube_config_path=<repo>/config` for every
# terraform invocation, so we need a kubeconfig file at that path. The
# `default` SA in the woodpecker namespace is cluster-admin (via the
# `woodpecker-default` ClusterRoleBinding), so the projected token is
# sufficient to apply any stack. Using `tokenFile` (not an inline token)
# so the provider re-reads it if kubelet rotates the projected token
# mid-pipeline.
- |
cat > config <<'EOF'
apiVersion: v1
kind: Config
clusters:
- name: kubernetes
cluster:
server: https://10.0.20.100:6443
certificate-authority: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
contexts:
- name: ci
context:
cluster: kubernetes
user: ci
current-context: ci
users:
- name: ci
user:
tokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
EOF
chmod 600 config
# Sanity check: kubeconfig works
kubectl --kubeconfig=config get ns kube-system -o name >/dev/null
# ── Detect changed stacks ── # ── Detect changed stacks ──
- | - |
@ -136,25 +69,6 @@ steps:
git fetch --deepen=1 origin master 2>/dev/null || true git fetch --deepen=1 origin master 2>/dev/null || true
fi fi
# Diff base: prefer the push's true before-state (CI_PREV_COMMIT_SHA).
# HEAD~1 is WRONG for merge commits — it is the first parent (the
# feature-branch side), so the diff shows the OTHER lineage's files
# and silently skips the stacks this push actually changed
# (bit ci-pipeline-health on 2026-06-12, pipeline 128).
DIFF_BASE="HEAD~1"
if [ -n "${CI_PREV_COMMIT_SHA:-}" ] && [ "$CI_PREV_COMMIT_SHA" != "$CI_COMMIT_SHA" ]; then
git cat-file -e "$CI_PREV_COMMIT_SHA^{commit}" 2>/dev/null || git fetch --depth=50 origin master 2>/dev/null || true
# Restarted pipelines after master moved produce REVERSE diffs
# (CI_PREV ahead of the checked-out HEAD re-applied stale trees and
# reverted a sibling apply on 2026-06-12, pipeline 148). Only use
# CI_PREV when it is an ancestor of HEAD.
if git cat-file -e "$CI_PREV_COMMIT_SHA^{commit}" 2>/dev/null \
&& git merge-base --is-ancestor "$CI_PREV_COMMIT_SHA" HEAD 2>/dev/null; then
DIFF_BASE="$CI_PREV_COMMIT_SHA"
fi
fi
echo "Diff base: $DIFF_BASE"
# If still no parent, apply all platform stacks as a safe fallback # If still no parent, apply all platform stacks as a safe fallback
if ! git rev-parse HEAD~1 >/dev/null 2>&1; then if ! git rev-parse HEAD~1 >/dev/null 2>&1; then
echo "Cannot determine changed files — applying ALL platform stacks" echo "Cannot determine changed files — applying ALL platform stacks"
@ -162,14 +76,14 @@ steps:
> .app_apply > .app_apply
else else
# Check if global files changed (triggers full platform apply) # Check if global files changed (triggers full platform apply)
GLOBAL_CHANGED=$(git diff --name-only "$DIFF_BASE" HEAD | grep -E '^(modules/|config\.tfvars|terragrunt\.hcl)' || true) GLOBAL_CHANGED=$(git diff --name-only HEAD~1 HEAD | grep -E '^(modules/|config\.tfvars|terragrunt\.hcl)' || true)
if [ -n "$GLOBAL_CHANGED" ]; then if [ -n "$GLOBAL_CHANGED" ]; then
echo "Global files changed — applying ALL platform stacks" echo "Global files changed — applying ALL platform stacks"
echo "$PLATFORM_STACKS" | tr ' ' '\n' > .platform_apply echo "$PLATFORM_STACKS" | tr ' ' '\n' > .platform_apply
else else
# Detect platform stacks that changed # Detect platform stacks that changed
git diff --name-only "$DIFF_BASE" HEAD | grep '^stacks/' | cut -d/ -f2 | sort -u > .all_changed git diff --name-only HEAD~1 HEAD | grep '^stacks/' | cut -d/ -f2 | sort -u > .all_changed
> .platform_apply > .platform_apply
while read -r stack; do while read -r stack; do
if echo "$PLATFORM_STACKS" | grep -qw "$stack"; then if echo "$PLATFORM_STACKS" | grep -qw "$stack"; then
@ -180,7 +94,7 @@ steps:
# Detect app stacks that changed # Detect app stacks that changed
> .app_apply > .app_apply
git diff --name-only "$DIFF_BASE" HEAD | grep '^stacks/' | cut -d/ -f2 | sort -u | while read -r stack; do git diff --name-only HEAD~1 HEAD | grep '^stacks/' | cut -d/ -f2 | sort -u | while read -r stack; do
if echo "$PLATFORM_STACKS" | grep -qw "$stack"; then if echo "$PLATFORM_STACKS" | grep -qw "$stack"; then
continue # Skip platform stacks continue # Skip platform stacks
fi fi
@ -200,7 +114,7 @@ steps:
# ── Pre-warm provider cache ── # ── Pre-warm provider cache ──
- | - |
if [ -s .platform_apply ] || [ -s .app_apply ]; then if [ -s .platform_apply ] || [ -s .app_apply ]; then
FIRST_STACK=$(cat .platform_apply .app_apply 2>/dev/null | head -1) FIRST_STACK=$(head -1 .platform_apply .app_apply 2>/dev/null | head -1)
if [ -n "$FIRST_STACK" ]; then if [ -n "$FIRST_STACK" ]; then
echo "Pre-warming provider cache from stacks/$FIRST_STACK..." echo "Pre-warming provider cache from stacks/$FIRST_STACK..."
cd "stacks/$FIRST_STACK" && terragrunt init --terragrunt-non-interactive -input=false 2>&1 | tail -3 && cd ../.. cd "stacks/$FIRST_STACK" && terragrunt init --terragrunt-non-interactive -input=false 2>&1 | tail -3 && cd ../..
@ -209,7 +123,6 @@ steps:
# ── Apply platform stacks (serial, with Vault advisory locks) ── # ── Apply platform stacks (serial, with Vault advisory locks) ──
- | - |
FAILED_PLATFORM_STACKS=""
if [ -s .platform_apply ]; then if [ -s .platform_apply ]; then
echo "=== Applying platform stacks (serial, locked) ===" echo "=== Applying platform stacks (serial, locked) ==="
while read -r stack; do while read -r stack; do
@ -222,9 +135,8 @@ steps:
if echo "$OUTPUT" | grep -q "is locked by"; then if echo "$OUTPUT" | grep -q "is locked by"; then
echo "[$stack] SKIPPED (locked by another session)" echo "[$stack] SKIPPED (locked by another session)"
else else
echo "$OUTPUT" | tail -50 echo "$OUTPUT" | tail -5
echo "[$stack] FAILED (exit $EXIT)" echo "[$stack] FAILED (exit $EXIT)"
FAILED_PLATFORM_STACKS="$FAILED_PLATFORM_STACKS $stack"
fi fi
else else
echo "$OUTPUT" | tail -3 echo "$OUTPUT" | tail -3
@ -232,12 +144,9 @@ steps:
fi fi
done < .platform_apply done < .platform_apply
fi fi
# Deferred until after app stacks so both lists get a chance to run.
echo "$FAILED_PLATFORM_STACKS" > .platform_failed
# ── Apply app stacks (serial, with Vault advisory locks) ── # ── Apply app stacks (serial, with Vault advisory locks) ──
- | - |
FAILED_APP_STACKS=""
if [ -s .app_apply ]; then if [ -s .app_apply ]; then
echo "=== Applying app stacks (serial, locked) ===" echo "=== Applying app stacks (serial, locked) ==="
while read -r stack; do while read -r stack; do
@ -250,9 +159,8 @@ steps:
if echo "$OUTPUT" | grep -q "is locked by"; then if echo "$OUTPUT" | grep -q "is locked by"; then
echo "[$stack] SKIPPED (locked by another session)" echo "[$stack] SKIPPED (locked by another session)"
else else
echo "$OUTPUT" | tail -50 echo "$OUTPUT" | tail -5
echo "[$stack] FAILED (exit $EXIT)" echo "[$stack] FAILED (exit $EXIT)"
FAILED_APP_STACKS="$FAILED_APP_STACKS $stack"
fi fi
else else
echo "$OUTPUT" | tail -3 echo "$OUTPUT" | tail -3
@ -260,15 +168,6 @@ steps:
fi fi
done < .app_apply done < .app_apply
fi fi
# Fail the step loudly so the pipeline `default` workflow state
# reflects reality — the service-upgrade agent and CI alert cascade
# both rely on this (see bd code-e1x). Lock-skipped stacks are NOT
# counted as failures.
FAILED_PLATFORM=$(cat .platform_failed 2>/dev/null | tr -d ' ')
if [ -n "$FAILED_PLATFORM" ] || [ -n "$FAILED_APP_STACKS" ]; then
echo "=== FAILED STACKS: platform=[$FAILED_PLATFORM ] apps=[$FAILED_APP_STACKS ] ==="
exit 1
fi
# ── Commit and push state changes ── # ── Commit and push state changes ──
- | - |

View file

@ -9,13 +9,12 @@ clone:
git: git:
image: woodpeckerci/plugin-git image: woodpeckerci/plugin-git
settings: settings:
partial: false
depth: 1 depth: 1
attempts: 3 attempts: 3
steps: steps:
- name: detect-drift - name: detect-drift
image: ghcr.io/viktorbarzin/infra-ci:latest image: registry.viktorbarzin.me:5050/infra-ci:latest
pull: true pull: true
backend_options: backend_options:
kubernetes: kubernetes:
@ -42,44 +41,11 @@ steps:
export VAULT_TOKEN=$(curl -s -X POST "$VAULT_ADDR/v1/auth/kubernetes/login" \ export VAULT_TOKEN=$(curl -s -X POST "$VAULT_ADDR/v1/auth/kubernetes/login" \
-d "{\"role\":\"ci\",\"jwt\":\"$SA_TOKEN\"}" | jq -r .auth.client_token) -d "{\"role\":\"ci\",\"jwt\":\"$SA_TOKEN\"}" | jq -r .auth.client_token)
# ── Generate kubeconfig from projected SA token ──
# See default.yml for rationale. terragrunt.hcl injects
# `-var kube_config_path=<repo>/config` for every terraform invocation,
# so we need a kubeconfig file at that path. The woodpecker default SA
# is cluster-admin, so the projected token is sufficient.
- |
cat > config <<'EOF'
apiVersion: v1
kind: Config
clusters:
- name: kubernetes
cluster:
server: https://10.0.20.100:6443
certificate-authority: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
contexts:
- name: ci
context:
cluster: kubernetes
user: ci
current-context: ci
users:
- name: ci
user:
tokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
EOF
chmod 600 config
kubectl --kubeconfig=config get ns kube-system -o name >/dev/null
# ── Run terraform plan on all stacks ── # ── Run terraform plan on all stacks ──
# Emits two timestamps per drifted stack so the Pushgateway/Prometheus
# side can compute drift-age-hours via `time() - drift_stack_first_seen`.
- | - |
DRIFTED="" DRIFTED=""
CLEAN=0 CLEAN=0
ERRORS="" ERRORS=""
NOW=$(date +%s)
# Metrics accumulator — written once per stack, then pushed as a batch.
METRICS=""
for stack_dir in stacks/*/; do for stack_dir in stacks/*/; do
stack=$(basename "$stack_dir") stack=$(basename "$stack_dir")
@ -90,50 +56,12 @@ steps:
EXIT=$? EXIT=$?
case $EXIT in case $EXIT in
0) 0) echo "OK (no changes)"; CLEAN=$((CLEAN + 1)) ;;
echo "OK (no changes)" 1) echo "ERROR"; ERRORS="$ERRORS $stack" ;;
CLEAN=$((CLEAN + 1)) 2) echo "DRIFT DETECTED"; DRIFTED="$DRIFTED $stack" ;;
# drift_stack_state=0 means clean; age-hours irrelevant so we
# still push 0 so per-stack gauges don't go stale.
METRICS="${METRICS}drift_stack_state{stack=\"$stack\"} 0\n"
METRICS="${METRICS}drift_stack_age_hours{stack=\"$stack\"} 0\n"
;;
1)
echo "ERROR"
ERRORS="$ERRORS $stack"
METRICS="${METRICS}drift_stack_state{stack=\"$stack\"} 2\n"
;;
2)
echo "DRIFT DETECTED"
DRIFTED="$DRIFTED $stack"
# Fetch first-seen timestamp from Pushgateway (preserve across runs).
FIRST_SEEN=$(curl -s "http://prometheus-prometheus-pushgateway.monitoring:9091/metrics" \
| awk -v s="$stack" '$1 == "drift_stack_first_seen{stack=\""s"\"}" {print $2; exit}')
if [ -z "$FIRST_SEEN" ] || [ "$FIRST_SEEN" = "0" ]; then
FIRST_SEEN="$NOW"
fi
AGE_HOURS=$(( (NOW - FIRST_SEEN) / 3600 ))
METRICS="${METRICS}drift_stack_state{stack=\"$stack\"} 1\n"
METRICS="${METRICS}drift_stack_first_seen{stack=\"$stack\"} $FIRST_SEEN\n"
METRICS="${METRICS}drift_stack_age_hours{stack=\"$stack\"} $AGE_HOURS\n"
;;
esac esac
done done
# Summary counters — single gauge per run.
DRIFT_COUNT=$(echo "$DRIFTED" | wc -w)
ERROR_COUNT=$(echo "$ERRORS" | wc -w)
METRICS="${METRICS}drift_stack_count $DRIFT_COUNT\n"
METRICS="${METRICS}drift_error_count $ERROR_COUNT\n"
METRICS="${METRICS}drift_clean_count $CLEAN\n"
METRICS="${METRICS}drift_detection_last_run_timestamp $NOW\n"
# ── Push to Pushgateway ──
# One batched push keeps the run atomic: either all metrics land or none.
printf "%b" "$METRICS" | curl -s --data-binary @- \
http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/drift-detection \
|| echo "(pushgateway unavailable, metrics lost for this run)"
echo "" echo ""
echo "=== Drift Detection Summary ===" echo "=== Drift Detection Summary ==="
echo "Clean: $CLEAN stacks" echo "Clean: $CLEAN stacks"

View file

@ -5,75 +5,56 @@ clone:
git: git:
image: woodpeckerci/plugin-git image: woodpeckerci/plugin-git
settings: settings:
partial: false
depth: 2 depth: 2
steps: steps:
- name: run-issue-responder - name: run-issue-responder
image: alpine:3.20 image: python:3.12-alpine
commands: commands:
- apk add --no-cache curl jq - apk add --no-cache openssh-client curl jq
# Authenticate to Vault via K8s SA JWT # Authenticate to Vault via K8s SA JWT
- | - |
SA_TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token) SA_TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
VAULT_RESP=$(curl -sf -X POST http://vault-active.vault.svc.cluster.local:8200/v1/auth/kubernetes/login \ VAULT_RESP=$(curl -sf -X POST http://vault-active.vault.svc.cluster.local:8200/v1/auth/kubernetes/login \
-d "{\"role\":\"ci\",\"jwt\":\"$$SA_TOKEN\"}") -d "{\"role\":\"ci\",\"jwt\":\"$SA_TOKEN\"}")
VAULT_TOKEN=$(echo "$$VAULT_RESP" | jq -r .auth.client_token) VAULT_TOKEN=$(echo "$VAULT_RESP" | jq -r .auth.client_token)
if [ -z "$$VAULT_TOKEN" ] || [ "$$VAULT_TOKEN" = "null" ]; then if [ -z "$VAULT_TOKEN" ] || [ "$VAULT_TOKEN" = "null" ]; then
echo "ERROR: Vault authentication failed" echo "ERROR: Vault authentication failed"
exit 1 exit 1
fi fi
echo "Vault authenticated" echo "Vault authenticated"
# Fetch API token for claude-agent-service # Fetch DevVM SSH key
- | - |
AGENT_TOKEN=$(curl -sf -H "X-Vault-Token: $$VAULT_TOKEN" \ curl -sf -H "X-Vault-Token: $VAULT_TOKEN" \
http://vault-active.vault.svc.cluster.local:8200/v1/secret/data/claude-agent-service | \ http://vault-active.vault.svc.cluster.local:8200/v1/secret/data/ci/infra | \
jq -r '.data.data.api_bearer_token') jq -r '.data.data.devvm_ssh_key' > /tmp/devvm-key
if [ -z "$$AGENT_TOKEN" ] || [ "$$AGENT_TOKEN" = "null" ]; then chmod 600 /tmp/devvm-key
echo "ERROR: Failed to fetch agent API token" if [ ! -s /tmp/devvm-key ]; then
echo "ERROR: Failed to fetch DevVM SSH key"
exit 1 exit 1
fi fi
echo "Agent token fetched" echo "SSH key fetched"
# Submit job to claude-agent-service # SSH to DevVM and run issue-responder agent
- | - |
ISSUE_NUM="${ISSUE_NUMBER:-}" ISSUE_NUM="${ISSUE_NUMBER:-}"
ISSUE_TITLE="${ISSUE_TITLE:-}" ISSUE_TITLE="${ISSUE_TITLE:-}"
ISSUE_LABELS="${ISSUE_LABELS:-}" ISSUE_LABELS="${ISSUE_LABELS:-}"
ISSUE_URL="${ISSUE_URL:-}" ISSUE_URL="${ISSUE_URL:-}"
if [ -z "$$ISSUE_NUM" ]; then if [ -z "$ISSUE_NUM" ]; then
echo "ERROR: No issue number provided" echo "ERROR: No issue number provided"
exit 1 exit 1
fi fi
echo "Processing issue #$$ISSUE_NUM: $$ISSUE_TITLE" echo "Processing issue #$ISSUE_NUM: $ISSUE_TITLE"
echo "Labels: $ISSUE_LABELS"
PAYLOAD=$(jq -n \ ssh -i /tmp/devvm-key -o StrictHostKeyChecking=no wizard@10.0.10.10 \
--arg prompt "Process GitHub Issue #$$ISSUE_NUM: $$ISSUE_TITLE. Labels: $$ISSUE_LABELS. URL: $$ISSUE_URL. Read the issue body via GitHub API, investigate, and take appropriate action." \ "cd ~/code && git -C infra stash && git -C infra pull --rebase && git -C infra stash pop 2>/dev/null; \
--arg agent ".claude/agents/issue-responder" \ ~/.local/bin/claude -p \
'{prompt: $prompt, agent: $agent, max_budget_usd: 10, timeout_seconds: 1800}') --agent infra/.claude/agents/issue-responder \
--dangerously-skip-permissions \
RESP=$(curl -sf -X POST \ --max-budget-usd 10 \
-H "Authorization: Bearer $$AGENT_TOKEN" \ 'Process GitHub Issue #${ISSUE_NUM}: ${ISSUE_TITLE}. Labels: ${ISSUE_LABELS}. URL: ${ISSUE_URL}. Read the issue body via GitHub API, investigate, and take appropriate action.'"
-H "Content-Type: application/json" \ # Cleanup
-d "$$PAYLOAD" \ - rm -f /tmp/devvm-key
http://claude-agent-service.claude-agent.svc.cluster.local:8080/execute)
JOB_ID=$(echo "$$RESP" | jq -r '.job_id')
echo "Job submitted: $$JOB_ID"
# Poll for completion (30min max)
- |
for i in $(seq 1 120); do
sleep 15
RESULT=$(curl -sf \
-H "Authorization: Bearer $$AGENT_TOKEN" \
http://claude-agent-service.claude-agent.svc.cluster.local:8080/jobs/$$JOB_ID)
STATUS=$(echo "$$RESULT" | jq -r '.status')
echo "[$$i/120] Status: $$STATUS"
if [ "$$STATUS" != "running" ]; then
echo "$$RESULT" | jq .
if [ "$$STATUS" = "completed" ]; then exit 0; else exit 1; fi
fi
done
echo "ERROR: Job timed out after 30 minutes"
exit 1

View file

@ -0,0 +1,49 @@
when:
event: push
branch: master
path:
include:
- "stacks/platform/modules/k8s-portal/files/**"
clone:
git:
image: woodpeckerci/plugin-git
settings:
attempts: 5
backoff: 10s
steps:
- name: build-and-push
image: woodpeckerci/plugin-docker-buildx
settings:
username: "viktorbarzin"
password:
from_secret: dockerhub-pat
repo: viktorbarzin/k8s-portal
dockerfile: stacks/platform/modules/k8s-portal/files/Dockerfile
context: stacks/platform/modules/k8s-portal/files
platforms:
- linux/amd64
tag: ["${CI_PIPELINE_NUMBER}", "latest"]
cache_from: "viktorbarzin/k8s-portal:latest"
cache_to: "type=inline"
- name: deploy
image: bitnami/kubectl:latest
commands:
- "kubectl set image deployment/k8s-portal portal=viktorbarzin/k8s-portal:${CI_PIPELINE_NUMBER} -n k8s-portal"
- "kubectl rollout status deployment/k8s-portal -n k8s-portal --timeout=120s"
- "echo 'k8s-portal deployed successfully (build ${CI_PIPELINE_NUMBER})'"
- name: slack
image: curlimages/curl
commands:
- |
curl -s -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"K8s Portal: build #${CI_PIPELINE_NUMBER} ${CI_PIPELINE_STATUS}\"}" \
"$SLACK_WEBHOOK" || true
environment:
SLACK_WEBHOOK:
from_secret: slack_webhook
when:
status: [success, failure]

View file

@ -11,14 +11,13 @@ clone:
git: git:
image: woodpeckerci/plugin-git image: woodpeckerci/plugin-git
settings: settings:
partial: false
depth: 5 depth: 5
steps: steps:
- name: parse-and-implement - name: parse-and-implement
image: python:3.12-alpine image: python:3.12-alpine
commands: commands:
- apk add --no-cache jq curl git - apk add --no-cache jq curl git openssh-client
- sh scripts/postmortem-pipeline.sh - sh scripts/postmortem-pipeline.sh
- name: notify-slack - name: notify-slack

View file

@ -5,7 +5,6 @@ clone:
git: git:
image: woodpeckerci/plugin-git image: woodpeckerci/plugin-git
settings: settings:
partial: false
attempts: 5 attempts: 5
backoff: 10s backoff: 10s

View file

@ -1,64 +0,0 @@
# Sync infra/scripts/pve-nfs-exports → PVE host /etc/exports on change.
#
# Wave 6b of the state-drift consolidation plan: move the "scp + exportfs -ra"
# deploy step out of runbook-human-hands and into CI so the Proxmox NFS export
# table tracks git.
#
# Trigger: push to master that touches `scripts/pve-nfs-exports`. The file
# header documents the deploy invocation; this pipeline codifies it.
#
# Credentials:
# - pve_ssh_key: Woodpecker repo-secret (ed25519 keypair provisioned
# 2026-04-18 as `woodpecker-pve-nfs-exports-sync`). Public key lives in
# /root/.ssh/authorized_keys on the PVE host. Private key mirrored in
# Vault `secret/woodpecker/pve_ssh_key` for recovery.
when:
- event: push
branch: master
path: scripts/pve-nfs-exports
- event: manual
clone:
git:
image: woodpeckerci/plugin-git
settings:
partial: false
depth: 1
attempts: 3
steps:
- name: deploy
image: alpine:3.20
environment:
PVE_SSH_KEY:
from_secret: pve_ssh_key
SLACK_WEBHOOK:
from_secret: slack_webhook
commands:
- apk add --no-cache openssh-client curl
- mkdir -p ~/.ssh && chmod 700 ~/.ssh
- printf '%s\n' "$PVE_SSH_KEY" > ~/.ssh/id_ed25519
- chmod 600 ~/.ssh/id_ed25519
# Pin host key — CI's ~/.ssh/known_hosts is ephemeral, so accept-new on first pull.
- ssh-keyscan -t ed25519 192.168.1.127 >> ~/.ssh/known_hosts 2>/dev/null
# Diff what we'd ship, so pipeline logs show the intended change.
- echo '---diff---' && ssh -o BatchMode=yes root@192.168.1.127 "cat /etc/exports" > /tmp/remote.exports || true
- diff -u /tmp/remote.exports scripts/pve-nfs-exports || true
- echo '---applying---'
- scp -o BatchMode=yes scripts/pve-nfs-exports root@192.168.1.127:/etc/exports
- ssh -o BatchMode=yes root@192.168.1.127 "exportfs -ra && exportfs -s | head -5"
- echo '---done---'
- name: slack
image: curlimages/curl:8.11.0
environment:
SLACK_WEBHOOK:
from_secret: slack_webhook
commands:
- |
curl -s -X POST -H 'Content-type: application/json' \
--data "{\"channel\":\"general\",\"text\":\"PVE /etc/exports sync: ${CI_PIPELINE_STATUS}\"}" \
"$SLACK_WEBHOOK" || true
when:
status: [success, failure]

View file

@ -1,157 +0,0 @@
# Sync modules/docker-registry/* → /opt/registry/ on docker-registry VM
# (10.0.20.10) on change, and bounce containers + nginx when needed.
#
# Replaces the manual "ssh + scp + docker compose up -d" that was required
# after the 2026-04-19 `registry:2 → registry:2.8.3` pin landed. The deploy
# flow is now: edit a file in modules/docker-registry/ → git push → this
# pipeline runs → registry VM picks up the change.
#
# Trigger: push to master that touches any managed file (see `when.path`),
# or a manual run via Woodpecker UI / API.
#
# Credentials:
# - registry_ssh_key: Woodpecker repo-secret (ed25519 keypair provisioned
# 2026-04-19 as `woodpecker-registry-config-sync`). Public key lives in
# /root/.ssh/authorized_keys on 10.0.20.10. Private key mirrored in
# Vault `secret/woodpecker/registry_ssh_key` (subkeys private_key /
# public_key / known_hosts_entry) for recovery.
#
# Why bounce nginx every time: nginx caches upstream DNS at startup, so if
# any registry-* container gets recreated (new IP on the docker bridge),
# nginx keeps forwarding to a stale address. Always restart nginx as the
# last step — see docs/runbooks/registry-vm.md § "Bouncing registry
# containers — the nginx DNS trap".
when:
- event: push
branch: master
path:
include:
- 'modules/docker-registry/docker-compose.yml'
- 'modules/docker-registry/fix-broken-blobs.sh'
- 'modules/docker-registry/cleanup-tags.sh'
- 'modules/docker-registry/nginx_registry.conf'
- 'modules/docker-registry/config-private.yml'
- event: manual
clone:
git:
image: woodpeckerci/plugin-git
settings:
partial: false
depth: 1
attempts: 3
steps:
- name: deploy
image: alpine:3.20
environment:
REGISTRY_SSH_KEY:
from_secret: registry_ssh_key
commands:
- apk add --no-cache openssh-client rsync
- mkdir -p ~/.ssh && chmod 700 ~/.ssh
- printf '%s\n' "$REGISTRY_SSH_KEY" > ~/.ssh/id_ed25519
- chmod 600 ~/.ssh/id_ed25519
# Pin host key — CI's ~/.ssh/known_hosts is ephemeral, so accept-new on first pull.
- ssh-keyscan -t ed25519 10.0.20.10 >> ~/.ssh/known_hosts 2>/dev/null
- echo '---detecting changed files---'
- |
# Mirror the remote state of each file so we can diff and decide what bounces.
CHANGED=""
for f in docker-compose.yml fix-broken-blobs.sh cleanup-tags.sh nginx_registry.conf config-private.yml; do
LOCAL="modules/docker-registry/$f"
REMOTE="/opt/registry/$f"
if [ ! -f "$LOCAL" ]; then
echo "skip $f (not in repo)"
continue
fi
# Pull the remote copy into /tmp for a diff. ssh -n avoids stdin-hogging.
REMOTE_CONTENT=$(ssh -n -o BatchMode=yes root@10.0.20.10 "cat $REMOTE 2>/dev/null || true")
LOCAL_CONTENT=$(cat "$LOCAL")
if [ "$LOCAL_CONTENT" = "$REMOTE_CONTENT" ]; then
echo "unchanged: $f"
else
echo "---diff: $f ---"
echo "$REMOTE_CONTENT" > /tmp/remote.txt
diff -u /tmp/remote.txt "$LOCAL" | head -40 || true
CHANGED="$CHANGED $f"
fi
done
echo "CHANGED_FILES=$CHANGED"
printf '%s' "$CHANGED" > /tmp/changed
- echo '---applying---'
- |
CHANGED=$(cat /tmp/changed)
if [ -z "$CHANGED" ]; then
echo "No files changed — exiting cleanly (manual run with no drift)."
exit 0
fi
# Ship every managed file unconditionally — scp is cheap, idempotency is safe.
scp -o BatchMode=yes \
modules/docker-registry/docker-compose.yml \
modules/docker-registry/fix-broken-blobs.sh \
modules/docker-registry/cleanup-tags.sh \
modules/docker-registry/nginx_registry.conf \
modules/docker-registry/config-private.yml \
root@10.0.20.10:/opt/registry/
ssh -n -o BatchMode=yes root@10.0.20.10 '
chmod +x /opt/registry/fix-broken-blobs.sh /opt/registry/cleanup-tags.sh
'
- echo '---bouncing containers + nginx---'
- |
CHANGED=$(cat /tmp/changed)
# Compose-visible files: docker-compose.yml (image tag, mounts) and
# config-private.yml (registry config → needs registry-private reload).
BOUNCE_COMPOSE=0
BOUNCE_NGINX=0
echo "$CHANGED" | grep -q "docker-compose.yml" && BOUNCE_COMPOSE=1
echo "$CHANGED" | grep -q "config-private.yml" && BOUNCE_COMPOSE=1
echo "$CHANGED" | grep -q "nginx_registry.conf" && BOUNCE_NGINX=1
if [ "$BOUNCE_COMPOSE" = "1" ]; then
echo "compose-visible change → pull + up -d"
ssh -n -o BatchMode=yes root@10.0.20.10 '
cd /opt/registry
docker compose pull 2>&1 | tail -5
docker compose up -d 2>&1 | tail -20
'
# Any compose recreate requires nginx DNS refresh too.
BOUNCE_NGINX=1
fi
if [ "$BOUNCE_NGINX" = "1" ]; then
echo "bouncing nginx to flush upstream DNS cache"
ssh -n -o BatchMode=yes root@10.0.20.10 '
docker restart registry-nginx
sleep 3
docker ps --format "{{.Names}}\t{{.Image}}\t{{.Status}}" | grep -E "registry-"
'
fi
if [ "$BOUNCE_COMPOSE" = "0" ] && [ "$BOUNCE_NGINX" = "0" ]; then
echo "only script files changed (cron-picks-up semantics) — no bounce needed"
fi
- echo '---verify---'
- |
ssh -n -o BatchMode=yes root@10.0.20.10 '
echo "=== catalog ==="
# Prove auth + routing survived.
curl -sk -o /dev/null -w "catalog (unauth → 401 expected): HTTP %{http_code}\n" \
https://127.0.0.1:5050/v2/
echo "=== integrity scan (dry-run) ==="
python3 /opt/registry/fix-broken-blobs.sh --dry-run 2>&1 | tail -5
'
- name: slack
image: curlimages/curl:8.11.0
environment:
SLACK_WEBHOOK:
from_secret: slack_webhook
commands:
- |
curl -s -X POST -H 'Content-type: application/json' \
--data "{\"channel\":\"general\",\"text\":\"Registry config sync on 10.0.20.10: ${CI_PIPELINE_STATUS}\"}" \
"$SLACK_WEBHOOK" || true
when:
status: [success, failure]

View file

@ -6,7 +6,6 @@ clone:
git: git:
image: woodpeckerci/plugin-git image: woodpeckerci/plugin-git
settings: settings:
partial: false
attempts: 5 attempts: 5
backoff: 10s backoff: 10s

208
AGENTS.md
View file

@ -9,55 +9,12 @@
- **Ask before `git push`** — always confirm with the user first - **Ask before `git push`** — always confirm with the user first
## Execution ## Execution
- **Apply a service**: `scripts/tg apply --non-interactive` (auto-decrypts SOPS secrets; passes `-lock-timeout`, default `5m` / `TG_LOCK_TIMEOUT`, so a contended state lock waits instead of failing with `Error acquiring the state lock`) - **Apply a service**: `scripts/tg apply --non-interactive` (auto-decrypts SOPS secrets)
- **Legacy apply**: `cd stacks/<service> && terragrunt apply --non-interactive` (uses terraform.tfvars) - **Legacy apply**: `cd stacks/<service> && terragrunt apply --non-interactive` (uses terraform.tfvars)
- **kubectl**: `kubectl --kubeconfig $(pwd)/config` - **kubectl**: `kubectl --kubeconfig $(pwd)/config`
- **Health check**: `bash scripts/cluster_healthcheck.sh --quiet` - **Health check**: `bash scripts/cluster_healthcheck.sh --quiet`
- **Plan all**: `cd stacks && terragrunt run --all --non-interactive -- plan` - **Plan all**: `cd stacks && terragrunt run --all --non-interactive -- plan`
## Adopting Existing Resources — Use `import {}` Blocks, Not the CLI
When bringing a live cluster/Vault/Cloudflare resource under Terraform management, use an HCL `import {}` block (Terraform 1.5+). Do **NOT** use `terraform import` on the CLI for anything landing in this repo — the CLI path leaves no audit trail and makes multi-operator adoption fragile.
**Canonical workflow:**
1. Write the `resource` block that matches the live object.
2. In the same stack, add an `import {}` stanza naming the target and the provider-specific ID:
```hcl
import {
to = helm_release.kured
id = "kured/kured" # Helm ID format: <namespace>/<release-name>
}
resource "helm_release" "kured" {
name = "kured"
namespace = "kured"
repository = "https://kubereboot.github.io/charts/"
chart = "kured"
version = "5.7.0"
# ... values matching the live release
}
```
3. `scripts/tg plan` — every change it proposes is real divergence between HCL and live state. Iterate on values until the plan is **0 changes**.
4. `scripts/tg apply` — the import runs alongside whatever zero-change apply you have. If your plan is 0 changes, this commits only the state-ownership transfer.
5. After the apply lands cleanly, **delete the `import {}` block** in a follow-up commit. The resource is now fully TF-owned and the stanza would be a no-op that clutters diffs.
**Why `import {}` and not `terraform import`:**
- Reviewable in PRs before any state mutation. The CLI path is an out-of-band action nobody sees.
- Plan-safe: the `import` plan step shows the exact object being adopted. Mistyped IDs or the wrong resource address are caught before apply, not after.
- Survives state backend changes (Tier 0 SOPS vs Tier 1 PG) transparently — both work identically from the operator's perspective because both use `scripts/tg`.
- Re-runnable: if the apply fails partway through, the `import {}` block is idempotent. The CLI path's state mutation is not.
**Finding the provider-specific ID:** each provider has its own convention.
| Resource | ID format | Example |
|---|---|---|
| `helm_release` | `<namespace>/<release-name>` | `kured/kured` |
| `kubernetes_manifest` | `{"apiVersion":"...","kind":"...","metadata":{"namespace":"...","name":"..."}}` | (pass as HCL object literal) |
| `kubernetes_<kind>_v1` | `<namespace>/<name>` for namespaced, `<name>` for cluster-scoped | `kube-system/coredns` |
| `authentik_provider_proxy` | provider UUID | `0eecac07-97c7-443c-...` |
| `cloudflare_record` | `<zone-id>/<record-id>` | `abc123/def456` |
## Secrets Management (SOPS) ## Secrets Management (SOPS)
- **`config.tfvars`** — plaintext config (hostnames, IPs, DNS records, public keys) - **`config.tfvars`** — plaintext config (hostnames, IPs, DNS records, public keys)
- **`secrets.sops.json`** — SOPS-encrypted secrets (passwords, tokens, SSH keys, API keys) - **`secrets.sops.json`** — SOPS-encrypted secrets (passwords, tokens, SSH keys, API keys)
@ -90,7 +47,6 @@ Terragrunt-based homelab managing a Kubernetes cluster (5 nodes, v1.34.2) on Pro
- **Public domain**: `viktorbarzin.me` (Cloudflare) | **Internal**: `viktorbarzin.lan` (Technitium DNS) - **Public domain**: `viktorbarzin.me` (Cloudflare) | **Internal**: `viktorbarzin.lan` (Technitium DNS)
- **Onboarding portal**: `https://k8s-portal.viktorbarzin.me` — self-service kubectl setup + docs - **Onboarding portal**: `https://k8s-portal.viktorbarzin.me` — self-service kubectl setup + docs
- **CI/CD**: Woodpecker CI — PRs run plan, merges to master auto-apply all stacks - **CI/CD**: Woodpecker CI — PRs run plan, merges to master auto-apply all stacks
- **CI compute is external (ADR-0002, 2026-06-12)**: builds, tests, lint, and release jobs run on GitHub Actions hosted runners via each repo's GitHub mirror — never on cluster nodes. In-cluster pipelines exist only for steps that need cluster access (Woodpecker `kubectl set image` deploys, terragrunt applies, certbot). Never add an in-cluster build or test pipeline to any repo; the fallback-build pattern was deliberately removed. After pushing anything that fires a build chain, watch it end-to-end (GHA run → Woodpecker deploy → rollout) before calling the change done — verify live state, not the checkmark.
## Key Paths ## Key Paths
- `stacks/<service>/main.tf` — service definition - `stacks/<service>/main.tf` — service definition
@ -100,111 +56,25 @@ Terragrunt-based homelab managing a Kubernetes cluster (5 nodes, v1.34.2) on Pro
- `config.tfvars` — non-secret configuration (plaintext) - `config.tfvars` — non-secret configuration (plaintext)
- `secrets.sops.json` — all secrets (SOPS-encrypted JSON) - `secrets.sops.json` — all secrets (SOPS-encrypted JSON)
- `terraform.tfvars` — legacy secrets file (git-crypt, kept for reference) - `terraform.tfvars` — legacy secrets file (git-crypt, kept for reference)
- `scripts/cluster_healthcheck.sh`42-check cluster health script (nodes, workloads, monitoring, certs, backups, external reachability) - `scripts/cluster_healthcheck.sh` — 25-check cluster health script
## Storage ## Storage
- **NFS** (`nfs-proxmox` StorageClass): For app data. Use the `nfs_volume` module, never inline `nfs {}` blocks. - **NFS** (`nfs-proxmox` StorageClass): For app data. Use the `nfs_volume` module, never inline `nfs {}` blocks.
- **proxmox-lvm-encrypted** (`proxmox-lvm-encrypted` StorageClass): **Default for all sensitive data** — databases, auth, email, passwords, git repos, health data. LUKS2 encryption via Proxmox CSI. Passphrase in Vault, backup key on PVE host. - **proxmox-lvm-encrypted** (`proxmox-lvm-encrypted` StorageClass): **Default for all sensitive data** — databases, auth, email, passwords, git repos, health data. LUKS2 encryption via Proxmox CSI. Passphrase in Vault, backup key on PVE host.
- **proxmox-lvm** (`proxmox-lvm` StorageClass): For non-sensitive stateful apps (configs, caches, tools). Proxmox CSI driver. - **proxmox-lvm** (`proxmox-lvm` StorageClass): For non-sensitive stateful apps (configs, caches, tools). Proxmox CSI driver.
- **NFS server**: Proxmox host at 192.168.1.127 (sole NFS). HDD NFS at `/srv/nfs` (2TB ext4 LV `pve/nfs-data`), SSD NFS at `/srv/nfs-ssd` (100GB ext4 LV `ssd/nfs-ssd-data`). Exports use `async` mode (safe with UPS + databases on block storage). TrueNAS (VM 9000, 10.0.10.15) decommissioned 2026-04-13. Legacy `nfs-truenas` StorageClass name retained (48 PVs bind it; SC names are immutable on PVs) but now points to the Proxmox host, identical to `nfs-proxmox`. - **NFS server**: Proxmox host at 192.168.1.127. HDD NFS at `/srv/nfs` (2TB ext4 LV `pve/nfs-data`), SSD NFS at `/srv/nfs-ssd` (100GB ext4 LV `ssd/nfs-ssd-data`). Exports use `async` mode (safe with UPS + databases on block storage). TrueNAS (10.0.10.15) decommissioned.
- **SQLite on NFS is unreliable** (fsync issues) — always use proxmox-lvm or local disk for databases. - **SQLite on NFS is unreliable** (fsync issues) — always use proxmox-lvm or local disk for databases.
- **NFS mount options**: Always `soft,timeo=30,retrans=3` to prevent uninterruptible sleep (D state). - **NFS mount options**: Always `soft,timeo=30,retrans=3` to prevent uninterruptible sleep (D state).
- **NFS export directory must exist** on the Proxmox host before Terraform can create the PV. - **NFS export directory must exist** on the Proxmox host before Terraform can create the PV.
- **Backup (3-2-1)**: Copy 1 = live PVCs on sdc. Copy 2 = sda `/mnt/backup` (PVC file backups, auto SQLite backups, pfSense, PVE config, **VM images via `vzdump-vms`**). Copy 3 = Synology offsite (two-tier: sda→`pve-backup/`, NFS→`nfs/`+`nfs-ssd/` via inotify change tracking). - **Backup (3-2-1)**: Copy 1 = live PVCs on sdc. Copy 2 = sda `/mnt/backup` (PVC file backups, auto SQLite backups, pfSense, PVE config). Copy 3 = Synology offsite (two-tier: sda→`pve-backup/`, NFS→`nfs/`+`nfs-ssd/` via inotify change tracking).
- **vzdump-vms** (Daily 01:00): live `vzdump --mode snapshot` of hand-managed VMs (NOT in TF) → `/mnt/backup/vzdump/`, keep 3/VMID. `VZDUMP_VMIDS` default `102` (devvm) — the only VM imaged today; before this (2026-06-09) no VM was ever imaged. NOT in the incremental offsite manifest; monthly full pass mirrors it. See `docs/architecture/backup-dr.md`.
- **daily-backup** (Daily 05:00): Auto-discovered BACKUP_DIRS (glob), auto SQLite backup (magic number + `?mode=ro`), pfSense, PVE config. No NFS mirror step (NFS syncs directly to Synology via inotify). - **daily-backup** (Daily 05:00): Auto-discovered BACKUP_DIRS (glob), auto SQLite backup (magic number + `?mode=ro`), pfSense, PVE config. No NFS mirror step (NFS syncs directly to Synology via inotify).
- **offsite-sync-backup** (Daily 06:00): Step 1: sda→Synology `pve-backup/`. Step 2: NFS→Synology `nfs/`+`nfs-ssd/` via `rsync --files-from` (inotify change log). Monthly full `--delete`. - **offsite-sync-backup** (Daily 06:00): Step 1: sda→Synology `pve-backup/`. Step 2: NFS→Synology `nfs/`+`nfs-ssd/` via `rsync --files-from` (inotify change log). Monthly full `--delete`.
- **nfs-change-tracker.service**: inotifywait on `/srv/nfs` + `/srv/nfs-ssd`, logs to `/mnt/backup/.nfs-changes.log`. Incremental syncs complete in seconds. - **nfs-change-tracker.service**: inotifywait on `/srv/nfs` + `/srv/nfs-ssd`, logs to `/mnt/backup/.nfs-changes.log`. Incremental syncs complete in seconds.
- **Synology layout** (`/volume1/Backup/Viki/`): `pve-backup/` (from sda), `nfs/` (from `/srv/nfs`), `nfs-ssd/` (from `/srv/nfs-ssd`). - **Synology layout** (`/volume1/Backup/Viki/`): `pve-backup/` (from sda), `nfs/` (from `/srv/nfs`), `nfs-ssd/` (from `/srv/nfs-ssd`). `truenas/` renamed to `nfs/`, `pve-backup/nfs-mirror/` removed.
## Shared Variables (never hardcode) ## Shared Variables (never hardcode)
`var.nfs_server` (192.168.1.127), `var.redis_host`, `var.postgresql_host`, `var.mysql_host`, `var.ollama_host`, `var.mail_host` `var.nfs_server` (192.168.1.127), `var.redis_host`, `var.postgresql_host`, `var.mysql_host`, `var.ollama_host`, `var.mail_host`
## Redis Service Naming (read before wiring a new consumer)
The Redis stack (`stacks/redis/`) exposes three distinct entry points. Pick the one that matches the client's connection pattern — the wrong one causes READONLY errors or silent connection drops.
| Endpoint | Port(s) | Use for | Backed by |
|----------|---------|---------|-----------|
| `redis-master.redis.svc.cluster.local` | 6379 (redis), 26379 (sentinel) | **Default for new services.** Write-safe — HAProxy health-checks nodes and routes only to the current master. Matches `var.redis_host`. | `kubernetes_service.redis_master` → HAProxy → Bitnami StatefulSet |
| `redis-node-{0,1,2}.redis-headless.redis.svc.cluster.local` | 26379 | **Long-lived connections (PUBSUB, BLPOP, MONITOR, Sidekiq).** Use a sentinel-aware client with master name `mymaster`. Example: `stacks/nextcloud/chart_values.yaml:32-54`. | Bitnami-created headless service → pod DNS |
| `redis.redis.svc.cluster.local` | 6379 | **Do NOT use.** Helm chart's default service — selector patched by `null_resource.patch_redis_service` to match `redis-haproxy`, so today it behaves like `redis-master`. This patch is load-bearing but temporary; consumers hard-coded on this name are tracked in a beads follow-up (T0). | Bitnami chart (patched) |
**HAProxy's `timeout client 30s` closes idle raw Redis connections** — any client that holds a connection open for pub/sub, blocking commands, or replication streams MUST use the sentinel path. Uptime Kuma's Redis monitor hit this limit and had to be re-pointed at the sentinel endpoint (see memory id=748).
**When onboarding a new service:** start from `redis-master.redis.svc.cluster.local:6379` via `var.redis_host`. Only reach for sentinel discovery if the client library supports it natively (ioredis, redis-py Sentinel, go-redis FailoverClient, Sidekiq `sentinels` array) AND the workload uses long-lived connections.
## Kyverno Drift Suppression (`# KYVERNO_LIFECYCLE_V1`)
Kyverno's admission webhook mutates every pod with a `dns_config { option { name = "ndots"; value = "2" } }` block (fixes NxDomain search-domain floods — see `k8s-ndots-search-domain-nxdomain-flood` skill). Terraform does not manage that field, so without suppression every pod-owning resource shows perpetual `spec[0].template[0].spec[0].dns_config` drift.
**Rule**: every `kubernetes_deployment`, `kubernetes_stateful_set`, `kubernetes_daemon_set`, and `kubernetes_cron_job_v1` MUST include the following `lifecycle` block, tagged with the `# KYVERNO_LIFECYCLE_V1` marker so every site is greppable:
```hcl
# kubernetes_deployment / kubernetes_stateful_set / kubernetes_daemon_set
lifecycle {
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
}
# kubernetes_cron_job_v1 (extra job_template nesting)
lifecycle {
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
}
```
**Why not a shared module?** Terraform's `ignore_changes` meta-argument only accepts static attribute paths. It rejects module outputs, locals, variables, and any expression. A DRY module is therefore impossible — the canonical pattern IS the snippet + marker. When `kubernetes_manifest` resources get Kyverno `generate.kyverno.io/*` annotations mutated, a sibling convention `# KYVERNO_MANIFEST_V1` will be introduced (Phase B).
**Audit**: `rg "KYVERNO_LIFECYCLE_V1" stacks/ | wc -l` — should grow (never shrink). Add the marker to every new pod-owning resource. The `_template/main.tf.example` stub shows the canonical form.
### `# KYVERNO_LIFECYCLE_V2` — Keel auto-update annotations
When a namespace is labeled `keel.sh/enrolled=true`, the `inject-keel-annotations` ClusterPolicy (`stacks/kyverno/modules/kyverno/keel-annotations.tf`) injects these annotations on every Deployment / StatefulSet / DaemonSet:
```
keel.sh/policy: patch
keel.sh/trigger: poll
keel.sh/pollSchedule: "@every 1h"
```
**`keel.sh/match-tag` is NO LONGER injected — it is actively STRIPPED.** It was the pre-2026-05-26 default (`force + match-tag`), proven unreliable: under `force` it let Keel rewrite tag strings and cross-assign images between containers in multi-image pods. The `blog` deployment was a casualty — its `nginx``nginx-exporter` images got swapped and the site was down 2026-05-26 → 2026-06-01. The policy now sets the annotation to `null` (strips on admission); the 194 pre-existing workloads still carrying it were swept once via `kubectl annotate … keel.sh/match-tag-` on 2026-06-01. The `ignore_changes` line for it (below) is retained as a harmless no-op. See `docs/post-mortems/2026-06-01-keel-match-tag-image-swap.md`.
To suppress the resulting Terraform drift, **enrolled workloads** must carry the complete `ignore_changes` block below. This is the canonical form — it folds together every marker (see the legend after it):
```hcl
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
metadata[0].annotations["keel.sh/match-tag"],
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates
metadata[0].annotations["kubernetes.io/change-cause"],
metadata[0].annotations["deployment.kubernetes.io/revision"],
spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1
]
}
```
**Marker legend** (the names are historical; grep each to audit coverage):
| Marker | Ignores | Why |
|---|---|---|
| `# KYVERNO_LIFECYCLE_V1` | `dns_config` | Kyverno injects pod DNS `ndots` config |
| `# KYVERNO_LIFECYCLE_V2` | `keel.sh/policy`, `/trigger`, `/pollSchedule` | Kyverno-injected Keel control annotations |
| `# KEEL_IGNORE_IMAGE` | `container[N].image` (one line **per container index**, incl. `init_container[N]`) | Keel rewrites the image tag on `policy=patch`; without this, `apply` reverts the bump (a **downgrade**) |
| `# KEEL_LIFECYCLE_V1` | `keel.sh/match-tag`, `keel.sh/update-time` (pod template), `kubernetes.io/change-cause`, `deployment.kubernetes.io/revision` | every Keel digest-update restamps these; without ignoring them `apply` strips them → forces a rollout → Keel re-stamps → fight loop |
**Multi-container caveat**: `container[0].image` only covers the first container. Add one `container[N].image` line for **every** container index, plus `init_container[N].image` for init containers — otherwise the un-ignored container's image still drifts/downgrades.
The `KEEL_LIFECYCLE_V1` + per-container `KEEL_IGNORE_IMAGE` lines were swept across all enrolled workloads on **2026-05-28** (previously only `llama-cpp` had them; the rest fought on every apply). New enrolled workloads must include the full block. Workloads in un-enrolled namespaces don't receive the annotations and don't need the block.
Per-workload opt-out: add the label `keel.sh/policy: never` on the Deployment metadata (not pod template); the policy's `exclude` clause respects it, no annotation gets injected, no `ignore_changes` needed.
**Audit**: `rg "KYVERNO_LIFECYCLE_V2" stacks/` — count should equal the number of enrolled workloads. `rg "KEEL_LIFECYCLE_V1" stacks/` should match it (every enrolled workload also carries the V1 lines).
**Design context**: `docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md`.
## Tier System ## Tier System
`0-core` | `1-cluster` | `2-gpu` | `3-edge` | `4-aux` — Kyverno auto-generates LimitRange + ResourceQuota per namespace based on tier label. `0-core` | `1-cluster` | `2-gpu` | `3-edge` | `4-aux` — Kyverno auto-generates LimitRange + ResourceQuota per namespace based on tier label.
- Containers without explicit `resources {}` get default limits (256Mi for edge/aux — causes OOMKill for heavy apps) - Containers without explicit `resources {}` get default limits (256Mi for edge/aux — causes OOMKill for heavy apps)
@ -214,10 +84,10 @@ Per-workload opt-out: add the label `keel.sh/policy: never` on the Deployment me
## Infrastructure ## Infrastructure
- **Proxmox**: 192.168.1.127 (Dell R730, 22c/44t, 142GB RAM) - **Proxmox**: 192.168.1.127 (Dell R730, 22c/44t, 142GB RAM)
- **Nodes**: k8s-master (10.0.20.100), node1 (GPU, Tesla T4), node2-4 - **Nodes**: k8s-master (10.0.20.100), node1 (GPU, Tesla T4), node2-4
- **GPU**: `node_selector = { "nvidia.com/gpu.present" : "true" }` + toleration `nvidia.com/gpu`. The label is auto-applied by NFD/gpu-feature-discovery on any node with an NVIDIA PCI device — nothing is hostname-pinned, so the GPU card can move between nodes without Terraform edits. - **GPU**: `node_selector = { "gpu": "true" }` + toleration `nvidia.com/gpu`
- **Pull-through cache**: 10.0.20.10 — docker.io (:5000), ghcr.io (:5010) only. Caches stale manifests for :latest tags — use versioned tags or pre-pull with `ctr --hosts-dir ''` to bypass. - **Pull-through cache**: 10.0.20.10 — docker.io (:5000), ghcr.io (:5010) only. Caches stale manifests for :latest tags — use versioned tags or pre-pull with `ctr --hosts-dir ''` to bypass.
- **pfSense**: 10.0.20.1 (gateway, firewall, DNS forwarding) - **pfSense**: 10.0.20.1 (gateway, firewall, DNS forwarding)
- **MySQL InnoDB Cluster**: 1 instance on proxmox-lvm (scaled from 3 — only Uptime Kuma + phpIPAM remain), PriorityClass `mysql-critical` + PDB, anti-affinity excludes any GPU node (`nvidia.com/gpu.present=true`) so MySQL moves off the GPU host automatically if the card is relocated - **MySQL InnoDB Cluster**: 1 instance on proxmox-lvm (scaled from 3 — only Uptime Kuma + phpIPAM remain), PriorityClass `mysql-critical` + PDB, anti-affinity excludes k8s-node1 (GPU node)
- **SMTP**: `var.mail_host` port 587 STARTTLS (not internal svc address — cert mismatch) - **SMTP**: `var.mail_host` port 587 STARTTLS (not internal svc address — cert mismatch)
## Contributor Onboarding ## Contributor Onboarding
@ -227,69 +97,7 @@ Per-workload opt-out: add the label `keel.sh/policy: never` on the Deployment me
4. Viktor reviews → CI applies → Slack notification 4. Viktor reviews → CI applies → Slack notification
5. Portal: `https://k8s-portal.viktorbarzin.me/onboarding` for full guide 5. Portal: `https://k8s-portal.viktorbarzin.me/onboarding` for full guide
### Non-admin workstation users — the AGENT does the git work
Non-admin devvm users (power-user / namespace-owner tiers) may not know git at
all. Their agent handles every version-control step silently — never ask them
to commit, push, pull, or open a PR, and never surface git jargon at them.
Their infra clone arrives preconfigured: git identity, a `forgejo` remote
authenticated via `~/.git-credentials`, and `master` tracking `forgejo/master`
(auto-freshened hourly and at session launch, fast-forward only).
Two per-user layouts exist (`code_layout` in
`scripts/workstation/roster.yaml`): `single` (the default) — `~/code` IS the
locked infra clone — and `workspace``~/code` is a plain directory of
per-project clones: the infra clone at `~/code/infra`, plus each roster
`repos` entry (e.g. `~/code/tripit`) cloned from Forgejo `viktor/<name>` with
the user's own PAT. The reconcile auto-migrates a single-layout `~/code` when
a user is flipped to `workspace`, and keeps every clone fresh either way.
The model is **allow-then-audit** (Viktor, 2026-06-10): whitelisted users (emo)
push straight to `master` — no PR gate — and the record of *what changed and
why* is what matters. Force-push is disabled for everyone, so master history
is append-only.
**Feature-sized work is worktree-first** (org rule, 2026-06-10): develop in an
isolated worktree (`.worktrees/<topic>`, branch `<os-user>/<topic>` off
`forgejo/master`) so concurrent agent sessions never collide in the clone, then
land by merging latest master into the branch and pushing it
(`git push forgejo HEAD:master`, or the PR fallback below if not whitelisted) —
the audit-trail rules below apply to the branch's commit messages all the same.
Locked (git-crypt) clones can use plain `git worktree add`. Trivial
single-commit fixes may be committed directly on a clean `master`. Full
lifecycle: `~/.claude/rules/execution.md` §3.
To land a finished change from such a clone:
1. Commit on `master`. **The commit message is the audit trail** — this matters
more than the change itself:
- subject: what changed, specific ("ha-sofia: lower fan curve bias to -5")
- body: WHY, in plain words — paraphrase the user's actual request and any
reasoning ("Emil asked for quieter fans in the evening; curve was
overshooting after the 2026-06-08 redesign")
2. `git push forgejo master`. If rejected non-fast-forward: `git pull --rebase
forgejo master` and push again.
3. **Never use `[ci skip]`** as a non-admin — it hides the change from the
Slack audit feed; a no-op CI apply on a docs-only commit is harmless.
4. Leave the clone on clean `master` so auto-refresh keeps working.
5. Tell the user in plain language what happened. Stack changes are
auto-applied by CI — verify the live result with the user's read-only
kubectl before saying "it's live".
If a push to `master` is rejected by branch protection (user not on the
whitelist — e.g. new users before Viktor grants it), fall back to a
`<os-user>/<short-topic>` branch + PR with the user's own PAT
(`write:repository` suffices — verified 2026-06-10):
```bash
TOK=$(sed -E 's#https://[^:]+:([^@]+)@.*#\1#' ~/.git-credentials)
curl -X POST -H "Authorization: token $TOK" -H 'Content-Type: application/json' \
https://forgejo.viktorbarzin.me/api/v1/repos/viktor/infra/pulls \
-d '{"title":"<title>","head":"<os-user>/<short-topic>","base":"master","body":"<what + why>"}'
```
## Common Operations ## Common Operations
- **`homelab` CLI** (`/usr/local/bin/homelab`, source `cli/`): unified infra-ops verbs — run `homelab manifest` to discover the surface (each verb tagged read/write). Infra loop: `homelab tf plan|fmt|apply <stack>` (wraps `scripts/tg`; `apply` auto-claims presence + releases on exit, warns out-of-band), `homelab claim|release <kind>:<name>`, `homelab work start|land|clean <topic>` (worktree lifecycle; `land` gates on verification, `--verify-cmd`/`--no-verify`). Kubernetes (v0.2): `homelab k8s status|get|logs|describe|debug|pf|rollout-status <app>` (read; `<app>` defaults to the namespace, target to `deploy/<app>`), `homelab k8s db <app> [--mysql] -- "<SQL>"`, `k8s exec`, `k8s restart`, `k8s rm-pod` (pods/jobs only) — config-mutation kubectl verbs are intentionally absent (Terraform-only). Memory (v0.3): `homelab memory recall "<context>"` (semantic search), `memory list|categories|tags|stats|secret`, `memory store|update|delete` — a direct HTTP client to claude-memory that works even when the memory MCP is down. CI/deploy (v0.4): `homelab ci status|watch [commit]` (Woodpecker, repo resolved from cwd), `homelab deploy wait <ns>/<deploy> [--sha]` (image-sha + rollout) — `work land` now auto-watches CI to green. Net/obs (v0.5): `homelab net check <host> [path]` (external-CF vs internal-LB reachability), `dns lookup <name>` (Technitium vs public diff), `metrics query "<promql>"` / `metrics alerts` (Prometheus via LB), `logs query "<logql>" [--since]` (Loki via LB) — endpoint resolution baked in, no port-forward. Usage telemetry (v0.6): every dispatched verb fire-and-forgets a Loki line (`{user,verb}` + exit only, NO args/secrets; opt-out `HOMELAB_TELEMETRY=0`); `homelab usage top [--since][--user]` ranks verb usage across all users — evidence for what to build next, queryable without reading anyone's home. Home Assistant (v0.7): `homelab ha token [--instance sofia|london]` (prints the long-lived API token, resolved live from k8s Secret `openclaw/openclaw-secrets` — use as `curl -H "Authorization: Bearer $(homelab ha token)"`), `homelab ha ssh [--instance sofia|london] -- <cmd>` (run a command on the HA host; deterministic non-interactive ssh, the invoking user's `~/.ssh/id_ed25519`, sofia=`vbarzin@192.168.1.8` default) — entity state/control stays with the `ha` MCP, these cover only what an API-only MCP can't (token + host shell). Full docs: `cli/README.md`.
- **Deploy new service**: Use `stacks/<existing-service>/` as template. Create stack, add DNS in tfvars, apply platform then service. - **Deploy new service**: Use `stacks/<existing-service>/` as template. Create stack, add DNS in tfvars, apply platform then service.
- **Fix crashed pods**: Run healthcheck first. Safe to delete evicted/failed pods and CrashLoopBackOff pods with >10 restarts. - **Fix crashed pods**: Run healthcheck first. Safe to delete evicted/failed pods and CrashLoopBackOff pods with >10 restarts.
- **OOMKilled**: Check `kubectl describe limitrange tier-defaults -n <ns>`. Increase `resources.limits.memory` in the stack's main.tf. - **OOMKilled**: Check `kubectl describe limitrange tier-defaults -n <ns>`. Increase `resources.limits.memory` in the stack's main.tf.
@ -297,7 +105,7 @@ curl -X POST -H "Authorization: token $TOK" -H 'Content-Type: application/json'
- **NFS exports**: Create dir on Proxmox host (`ssh root@192.168.1.127 "mkdir -p /srv/nfs/<service>"`), add to `/etc/exports`, run `exportfs -ra`. - **NFS exports**: Create dir on Proxmox host (`ssh root@192.168.1.127 "mkdir -p /srv/nfs/<service>"`), add to `/etc/exports`, run `exportfs -ra`.
## Automated Service Upgrades ## Automated Service Upgrades
- **Pipeline**: DIUN (detect) → n8n webhook (filter + rate limit) → HTTP POST → `claude-agent-service` (K8s)`claude -p` (upgrade agent) - **Pipeline**: DIUN (detect) → n8n webhook (filter + rate limit) → SSH`claude -p` (upgrade agent)
- **Agent**: `.claude/agents/service-upgrade.md` — analyzes changelogs, backs up DBs, bumps versions, verifies health, rolls back on failure - **Agent**: `.claude/agents/service-upgrade.md` — analyzes changelogs, backs up DBs, bumps versions, verifies health, rolls back on failure
- **Config**: `.claude/reference/upgrade-config.json` — GitHub repo mappings, DB-backed services, skip patterns - **Config**: `.claude/reference/upgrade-config.json` — GitHub repo mappings, DB-backed services, skip patterns
- **Rate limit**: Max 5 upgrades per 6h DIUN scan cycle (configured in n8n workflow) - **Rate limit**: Max 5 upgrades per 6h DIUN scan cycle (configured in n8n workflow)

View file

@ -1,239 +0,0 @@
# Infra
Terragrunt-managed homelab declaring a 7-node Kubernetes cluster (1 control plane + 6 workers) on a single Proxmox host. Vault is the secrets source of truth; everything else flows from this repo via `scripts/tg apply`.
## Language
### Code organization
**Service**:
The deployed app as a domain concept — one logical thing that runs in the cluster (e.g. immich, technitium, freshrss). Defined by exactly one **Stack**.
_Avoid_: bare "app" without the Service definition; "deployment" (collides with K8s `Deployment`).
**Stack**:
The HCL directory under `stacks/<name>/` that defines a Service, applied independently with `scripts/tg apply`. A Stack is the unit of Terraform organisation; a Service is the running thing. They are 1:1 but not synonyms. A Stack is either **flat** (resources declared directly in its own `.tf` files — the majority, ~94, e.g. immich) or wraps a **Stack-local module** (~31, the larger/older ones).
_Avoid_: using "Stack" when you mean the running Service.
**Module**:
A unit of HCL consumed via `source =`. Two homes, two purposes: **shared** modules under the top-level `modules/` tree (reused across many Stacks) and **Stack-local** modules nested under `stacks/<name>/modules/` (one Stack only). Bare "Module" means the shared kind.
_Avoid_: "library", "package".
**Factory module**:
A shared **Module** that hides convention (defaults, drift handling, secret wiring) behind a small input surface. `modules/kubernetes/` holds exactly four, all factories: `ingress_factory` (103 Stacks), `setup_tls_secret` (93), `nfs_volume` (41), `anubis_instance` (8).
_Avoid_: "wrapper"; citing `k8s_app` / `helm_app` / `postgres_app` (these never existed in the repo).
**Stack-local module**:
A single Stack's implementation factored into a nested `stacks/<name>/modules/<name>/`, sourced by that one Stack only — organisation, not reuse. ~31 Stacks (authentik, kyverno, dbaas, mailserver, metallb, cloudflared, technitium, …). The alternative to a **flat** Stack.
_Avoid_: calling it a "Module" unqualified (it isn't reusable); "submodule".
**State tier**:
Terraform state-backend partition. **Tier 0** = bootstrap Stacks (`infra`, `platform`, `cnpg`, `vault`, `dbaas`, `external-secrets`) on local SOPS-encrypted state. **Tier 1** = every other Stack, on PG-backed state.
_Avoid_: "phase", "bootstrap stack" — say Tier 0 explicitly.
### Cluster
**Node**:
A K8s cluster VM — `k8s-master` (control plane) plus `k8s-node1..6` (workers). Default reading of the bare word "node" in this repo.
_Avoid_: "k8s node" (redundant), "host" (ambiguous).
**PVE node** / **PVE host**:
The single physical Dell R730 running Proxmox; sole hypervisor and sole NFS server. There is exactly one.
_Avoid_: "server", "hypervisor", "Proxmox" alone when you mean the host.
**Namespace tier**:
A namespace-prefix partition (`0-core-*`, `1-cluster-*`, `2-gpu-*`, `3-edge-*`, `4-aux-*`) driving PriorityClass, default resources, and ResourceQuota — generated by **Kyverno policy** from the namespace name. Orthogonal to **State tier**.
_Avoid_: "Service tier" (the partition is on the namespace, not the Service); collapsing Namespace tier with State tier — they are different axes.
**Kyverno policy**:
The convention engine of the cluster — a ClusterPolicy or Policy resource that mutates/generates/validates on admission. Owns Namespace tier limits/quotas, `dns_config` injection on every pod-owning workload, Forgejo pull-credential sync across namespaces, TLS-secret replication. When the repo says "this happens automatically", a Kyverno policy is usually the actor.
_Avoid_: bare "policy" (overloaded with Vault, RBAC, NetworkPolicy).
**Critical-path Service**:
One of {Traefik, Authentik, CrowdSec LAPI, PgBouncer, Cloudflared} — replicas ≥3, PDB enforced, monitored independently.
_Avoid_: "core service" (collides with the `0-core-*` Namespace tier name).
**Namespace-owner**:
A non-admin identity declared in `secret/platform → k8s_users` (JSON map). Owns one or more namespaces and one or more public subdomains. Also drives a **Workstation profile** (an identity has both a cluster facet and a workstation facet).
_Avoid_: bare "user", "tenant".
### Workstation (multi-user devvm)
**devvm**:
The dev VM (`10.0.10.10`), a non-cluster VM on the **PVE host** that hosts each person's Claude Code coding environment (the `t3-serve@<user>` and terminal-lobby sessions). Not a **Node** (it isn't in the cluster).
_Avoid_: calling it a "Node"; "host" (reserved for the PVE host).
**Workstation**:
A person's identity-scoped Claude Code environment on the **devvm** — one OS account, their session runs as that uid. The same human may also be a **Namespace-owner**; the cluster identity and the Workstation are two facets of one person.
_Avoid_: "t3 instance" (only one surface of a Workstation); bare "user".
**RBAC tier**:
The role band that governs a person everywhere — `kubernetes-admins` (Viktor; cluster-admin, secrets, apply), `kubernetes-power-users` (infra-aware, broad read, no destructive change), `kubernetes-namespace-owners` (own-namespace app dev). The single axis that keys both cluster RBAC **and** the **Workstation profile**.
_Avoid_: inventing per-service roles; conflating with **Namespace tier** / **State tier** (those are not identity).
**Workstation profile**:
The **RBAC tier**-keyed bundle a **Workstation** receives: **Config inheritance** (identical for everyone) plus the person's **Infra visibility** and cluster scope (varies by tier). Never hand-tuned per person — one identity decision (Authentik group + `k8s_users`) provisions the cluster facet and the Workstation together.
_Avoid_: per-person bespoke setup (the rejected "stitched-together" status quo).
**Config inheritance**:
The universal half of every **Workstation profile** — Viktor's *static* Claude config (skills, rules, agents, commands, `CLAUDE.md`, hooks) **live-extends** from a **Config base**, it is NOT copied: each person's `~/.claude` draws these from the shared base, so an edit Viktor makes appears in every Workstation immediately, with no seed/copy/sync step. Users may layer their own items on top (rarely do). **RBAC tier**-independent. Per-user *mutable* state (`~/.claude.json`, `.credentials.json`, `projects/`, sessions) is never shared — local only.
_Avoid_: a periodic copy/seed/sync of `~/.claude` (rejected — inheritance must be live); sharing `~/.claude.json` / `.credentials.json` (per-user, secret-bearing, corrupts under concurrent writes — see emo's multi-session profile).
**Config base**:
The shared, secret-free, version-controlled source of truth for the *static* Claude config that every **Workstation** live-extends (see **Config inheritance**). Viktor's authoring surface — when he edits a skill/rule, he edits the base; the chezmoi dotfiles repo is its versioned form (commit = audit/rollback, NOT a push to users). Holds only skills/rules/agents/commands/`CLAUDE.md`/hooks — never secrets or per-user mutable state.
_Avoid_: treating it as a per-user seed target (it is a live shared source, not a copy); putting secrets in it.
**Infra visibility**:
What a non-admin **Workstation** may SEE of the infra: the public repo **code** and the person's own **RBAC**-scoped view of the live cluster (kubectl / dashboard within their namespaces). Explicitly excludes the **git-crypt** secrets (`terraform.tfvars`, `secrets/`) and any out-of-scope mutation. The boundary that "respect their permissions" enforces — violated today because `~/code` is one git-crypt-*unlocked* tree shared via the `code-shared` group.
_Avoid_: reading "see the infra" as access to secrets or apply rights.
### Networking
**Public domain**:
`viktorbarzin.me`, served through Cloudflare. DNS records are either **proxied** (Cloudflare CDN/WAF in front) or **non-proxied** (direct A/AAAA reachable via Cloudflared Tunnel).
_Avoid_: "external", "outside".
**Internal domain**:
`viktorbarzin.lan`, served by Technitium DNS. Resolves only inside the homelab network.
_Avoid_: bare "lan", "private", "intranet".
**Ingress auth**:
The `auth = "..."` parameter on `ingress_factory` — a discrete *mode*, not a ranked tier — one of `required` (Authentik forward-auth gates every request), `app` (the backend owns its login), `public` (anonymous Authentik binding for audit only), or `none` (Anubis-fronted content, or native-client API). Default `required` (fail-closed).
_Avoid_: "auth tier" / "auth mode" — refer to it by the canonical key, `auth` (e.g. `auth = "required"`). "tier" is reserved for State tier and Namespace tier.
**Authentik outpost**:
A standalone Authentik deployment that terminates the proxy/auth flow for a specific binding model. The repo runs two distinct ones: the default outpost (used by `auth = "required"`) and the `public` outpost (anonymous binding, used by `auth = "public"`).
_Avoid_: conflating outpost with Authentik core; "Authentik instance".
**Cloudflared Tunnel**:
The channel by which non-proxied **public domain** traffic reaches the cluster, terminating at Traefik. Backs every `dns_type = "non-proxied"` record and is the fallback path for the wildcard `*.viktorbarzin.me`.
_Avoid_: "the tunnel" without "Cloudflared" (could mean Headscale).
**Ingress chain**:
The opinionated stack of Traefik middlewares that `ingress_factory` layers onto every Ingress. Slots, in order: forward-auth (per **Ingress auth**) → anti-AI scraping (default-on when no Authentik is in the path) → CrowdSec bouncer (fail-open) → retry (2× / 100ms) → rate-limit (429, not 503). Adding or removing a middleware is a Stack-level choice, but the chain order is convention.
_Avoid_: "middleware list", "Traefik chain". The Anubis PoW gate is upstream of this chain, not inside it.
**MetalLB / LB IP**:
The bare-metal load-balancer that assigns external IPs to `type=LoadBalancer` Services. Two IPs matter: the **shared LB IP** `10.0.20.200` (~10 services — PG state-backend, headscale, wireguard, coturn, xray… — all `externalTrafficPolicy: Cluster`) and **Traefik's dedicated LB IP** `10.0.20.203` (`externalTrafficPolicy: Local`). Traefik runs on its own IP because ETP:Local preserves the **real client IP** (for CrowdSec) and enables QUIC, and MetalLB forbids mixed ETP on one shared IP.
_Avoid_: calling `.200` "the cluster IP" or assuming all ingress shares one LB IP.
**Calico**:
The cluster CNI and **NetworkPolicy** engine (also GlobalNetworkPolicy + flow logs; live flow observability via **Goldmane / Whisker**). Egress lockdown follows an **observe-then-enforce** rollout — flow logs build an empirical allowlist, then default-deny egress is enforced per-namespace, tier by tier (wave 1 began at `recruiter-responder`; Tier 0/1/2 deferred).
_Avoid_: "firewall" (it's pod-level policy, not a perimeter); conflating a Calico **NetworkPolicy** (enforced in the data path) with a **Kyverno policy** (enforced at admission) — different layers.
**Service identity**:
How a **Service** is named in flow/audit data — its **namespace** is the primary identity (Goldmane stamps it natively, and "one Service ≈ one namespace" holds for ~87 namespaces), refined by an explicit identity label (e.g. `service-identity`) only in the handful of genuinely multi-Service namespaces (`monitoring`, `kube-system`, `dbaas`). Deliberately NOT a per-Service **ServiceAccount** (deferred — 56% of pods share `default`; revisit only if principal-based enforcement or mTLS is adopted) and NOT a SPIFFE/mesh identity (rejected — attribution-grade audit on a trusted single-tenant cluster doesn't justify a mesh).
_Avoid_: equating "service identity" with a workload's **ServiceAccount** (that's the deferred enforcement principal, not the attribution key) or with cryptographic/SPIFFE identity; "Service" here is the domain **Service**, not the K8s `Service` object.
**Goldmane / Whisker**:
Calico 3.30's OSS flow-observability pair — **Goldmane** aggregates identity-stamped flows (namespace/pod/workload/labels + allow-deny + policy trace) streamed from Felix over gRPC into an in-memory ~60-min ring buffer (no etcd/API writes); **Whisker** is its live web UI. The east-west "who-talks-to-whom" data plane, succeeding raw iptables-`LOG`→journald lines (which carry no identity). The in-memory buffer alone is not an audit trail — durable history is the **`goldmane-edge-aggregator`** (the implemented trail; ADR-0014 originally framed this as a Loki emitter), which streams Goldmane's gRPC `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** into CNPG DB `goldmane_edges` + a daily `#security` digest. As-built: `docs/runbooks/goldmane-flow-trail.md`.
_Avoid_: assuming Goldmane persists (it's a ring buffer — lost on restart); expecting a ServiceAccount field in its schema (it carries labels, not SA); confusing it with Cilium **Hubble** (needs the Cilium datapath, unusable on Calico) or **Kiali** (needs an Istio mesh).
### Storage
**proxmox-lvm-encrypted**:
Default StorageClass for any workload holding sensitive data (databases, auth, password managers, email, financial data). LUKS2 over a Proxmox LVM-thin LV.
_Avoid_: bare "encrypted PVC" — name the StorageClass.
**proxmox-lvm**:
Block StorageClass for non-sensitive workloads (caches, monitoring data, indexes, app state without secrets).
**NFS volume**:
RWX file storage for shared media libraries, large datasets, or anything that needs to be inspected from outside K8s. Provisioned via the `nfs_volume` Module.
_Avoid_: "shared storage" (ambiguous).
**nfs-truenas StorageClass**:
A historical SC name retained only because StorageClass strings are immutable on bound PVs. The underlying server is the **PVE host**, not TrueNAS; TrueNAS is decommissioned.
_Avoid_: assuming this means TrueNAS.
**local-path**:
The cluster's Kubernetes default StorageClass (`rancher.io/local-path`) — node-local hostpath, **non-replicated**, no CSI snapshots, outside the backup pipeline. A PVC that omits `storageClassName` silently binds here, pinned to one Node's disk. Always set an explicit `storageClassName`; reach for local-path only for genuinely throwaway, node-pinned data.
_Avoid_: relying on the default. Note the two senses of "default": local-path is the *cluster default SC* (what an unspecified PVC gets); proxmox-lvm-encrypted is the *default choice* for sensitive data. Different things.
**3-2-1 backup**:
The named posture of where data lives: **Copy 1** = live on the PVE thin pool (sdc), **Copy 2** = sda backup disk (`/mnt/backup`), **Copy 3** = offsite Synology NAS. Per-PVC file-level rsync from LVM thin snapshots; databases additionally dump to NFS for per-DB restore.
_Avoid_: bare "backup" without saying which copy you mean (a service is "backed up" only once it's on Copy 2; Copy 3 is the disaster floor).
### Data
**CNPG** / **pg-cluster**:
**CNPG** is the CloudNativePG operator; **`pg-cluster`** is the Postgres cluster it manages — the shared Postgres substrate. Backs Tier-1 Terraform state (`pg-cluster-rw.dbaas.svc.cluster.local:5432/terraform_state`) and ~12 application databases, reached through **PgBouncer** (a **critical-path Service**) for connection pooling; app credentials rotate via the `vault-database` ClusterSecretStore.
_Avoid_: "the database" (many DBs share one cluster); the legacy `postgresql.dbaas` Service for NEW work (it is a live compatibility alias selecting the CNPG primary — authentik's PgBouncer still uses it — but `pg-cluster-rw` is the canonical name); conflating the CNPG operator with the `pg-cluster` it manages.
### Secrets
**Vault path**:
Convention: `secret/<service>` for Service-owned secrets, `secret/viktor` for personal/global, `secret/platform` for cluster-wide maps (`k8s_users`, `homepage_credentials`).
_Avoid_: conflating Vault path (e.g. `secret/viktor`) with Vault field (e.g. `forgejo_pull_token`).
**ExternalSecret** / **ESO**:
A K8s manifest that materialises a Vault KV value as a K8s Secret. Two ClusterSecretStores: `vault-kv` (KV engine) and `vault-database` (rotating DB creds).
**Plan-time secret**:
A secret value read in Terraform via `data "kubernetes_secret"` (i.e. via the ESO-created K8s Secret) at plan time, with no Vault provider call. Distinct from a **vault data source** read (`data "vault_kv_secret_v2"`), which still goes through the Vault provider. A few Stacks remain hybrid (plan-time for env vars, vault data source for module inputs).
**Sealed Secret**:
A user-managed secret committed to a Stack directory as `sealed-*.yaml`. Distinct from ExternalSecret — Sealed Secrets carry their own bytes, ExternalSecrets reference Vault.
### CI/CD
**GHA build + Woodpecker deploy**:
The split where every owned image is built+pushed by GitHub Actions and Woodpecker only runs `kubectl set image` on a deploy-only pipeline (ADR-0002). Woodpecker never builds images.
_Avoid_: bare "Woodpecker pipeline" — say "build" or "deploy"; "fallback build" (the in-cluster fallback path was removed by ADR-0002).
**Canonical repo**:
The Forgejo `viktor/<name>` repo — the only place commits land, workflow files included. Every first-party repo is Forgejo-canonical *except* an explicit set of **GitHub-first repos**. A clone keeps **only** the canonical remote (ADR-0003): the **GitHub mirror** is not a second push target.
_Avoid_: "upstream" (ambiguous); committing anywhere else; keeping both remotes on a clone and hand-pushing to each (the dual-push habit that caused the 2026-06 divergence — ADR-0003).
**GitHub mirror**:
The GitHub repo a **Canonical repo** push-mirrors to, one-way (Forgejo's `push_mirrors`, `sync_on_commit`), so GitHub Actions can build from it; anything committed on the mirror is silently overwritten by the next sync — and enabling the mirror **force-overwrites** the GitHub side, so a diverged GitHub-only commit must be merged back into Forgejo *before* the mirror is turned on or it is lost.
_Avoid_: treating it as a second writable remote; bare "the GitHub repo" without saying mirror.
**GitHub-first repo**:
The deliberate exception to the **Canonical repo** rule — a repo whose canonical home is GitHub, so it sits outside the mirror policy. Two kinds: third-party clones/forks where GitHub is genuinely upstream (`jsoncrack.com`, `snmp_exporter`, `SparkyFitness`, `agent-rules-books`, `Plotting-Your-Dream-Book`), and a first-party repo intentionally kept public on GitHub (`health`). Single GitHub remote, never dual-pushed.
_Avoid_: adding a Forgejo remote "for consistency"; treating one as a **Canonical repo**.
**Forgejo registry**:
Forgejo's built-in container registry — since ADR-0002 a frozen archive holding one last-known-good tag per **Service**, not a build target; owned images live on ghcr.io.
_Avoid_: "private registry" (collides with the registry VM's pull-through caches); pushing new images to it.
**Keel**:
The **poll-driven** rollout orchestrator — watches registries for new image tags and rolls the matching Deployments automatically. The actor behind "auto-upgrade" for upstream images, and a redundant net for owned apps (already rolled on push by **Woodpecker deploy**).
_Avoid_: conflating with **Woodpecker deploy** (push-driven, fires on commit) or **Diun** (watches but only notifies). Never point Keel / `set image` at operator-managed StatefulSets.
**Diun**:
**Notify-only** image-update monitoring — reports that a newer image exists, never rolls anything (contrast **Keel**, which acts). Disabled on pinned images (MySQL, PostgreSQL, Redis) so version pins aren't nagged.
_Avoid_: expecting Diun to deploy; conflating with **Keel**.
**Anubis**:
A PoW reverse-proxy issuing a 30-day JWT cookie, used in front of public content-bearing sites without app-level auth (blog, wiki, landing pages). Never in front of Git, WebDAV, CalDAV, or API endpoints (clients can't solve PoW).
## Relationships
- A **Service** is defined by exactly one **Stack****flat** or wrapping a **Stack-local module** — which sources zero or more shared **Factory modules** and resolves to one or more K8s workloads.
- A **Namespace-owner** owns one or more namespaces and one or more public subdomains.
- A **Service** owns its **Vault path** at `secret/<service>`, surfaces values through **ExternalSecrets**, and reads them at plan time via **plan-time secrets**.
- An **Ingress** picks exactly one **Ingress auth** mode; the choice defines how strangers reach the backend.
- A **proxmox-lvm-encrypted** PVC binds to one Node at a time (RWO) and requires a Service-level backup CronJob; an **NFS volume** is RWX and is backed up at the host level via rsync.
- **State tier** and **Namespace tier** are orthogonal — a Tier 0 Stack can deploy a Service into any Namespace tier and vice versa.
- A **Service**'s image reaches the cluster via **Woodpecker deploy** (push-driven, on commit) or **Keel** (poll-driven, on a new registry tag); **Diun** only notifies. Operator-managed StatefulSets are rolled by neither.
- An owned **Service**'s image is built by GitHub Actions from the **Canonical repo**'s **GitHub mirror** and hosted on ghcr.io (ADR-0002); the **Forgejo registry** keeps only a frozen last-known-good tag per **Service**.
- Tier-1 **State tier** state and ~12 app databases share one **CNPG** `pg-cluster`, reached through **PgBouncer**; their credentials rotate via the `vault-database` store.
## Example dialogue
> **Dev:** "I'm adding a new **Service** — FastAPI backend with its own JWT login. Do I need Authentik?"
> **Domain expert:** "If the FastAPI login is the gate, set `auth = "app"` on the ingress. That records the intent that you _chose_ not to layer Authentik — leave a one-line comment above stating what gates the Service, or `scripts/tg` will refuse the apply."
> **Dev:** "And storage?"
> **Domain expert:** "Does it hold user data? If yes, `proxmox-lvm-encrypted` — that's the default for anything sensitive. Add a backup CronJob writing to `/mnt/main/<service>-backup/`. If the data is just caches, plain `proxmox-lvm` is fine."
> **Dev:** "What about a Secret with the JWT signing key?"
> **Domain expert:** "Put the key in `secret/<service>` in Vault, then declare an **ExternalSecret** to materialise it as a K8s Secret. Read it at plan time with `data "kubernetes_secret"` — that keeps Vault out of the plan path."
## Flagged ambiguities
- **"tier"** has exactly two senses — always qualify which: *State tier* (Tier 0 / Tier 1, Terraform backend partition) and *Namespace tier* (`0-core`..`4-aux`, scheduling priority/quota). They are orthogonal axes. Do **not** coin new "tier"s: **Ingress auth** is a *mode* (not a tier), and storage speed (SSD vs HDD) is *not* a "tier" either.
- **"node"** can mean a K8s Node (default) or a PVE node. For Proxmox-level statements, say **PVE node** explicitly.
- **"service"** spans two distinct concepts: the deployed app (capitalised **Service**, this repo's domain noun) and the K8s `Service` object (in backticks or qualified "K8s Service"). Lowercase "service" in prose is fine when context disambiguates; flag it when it doesn't.
- **"secret"** spans Vault entries, K8s Secret objects, **ExternalSecrets**, and **Sealed Secrets**. Always specify which.
- **"proxied"** / **"non-proxied"** refer to Cloudflare's CDN posture for a DNS record, _not_ Anubis or forward-auth layering.
- **"policy"** spans **Kyverno policy** (admission-time mutate/generate/validate), **Calico NetworkPolicy** (data-path ingress/egress), Vault policy (KV access), and K8s RBAC. Always qualify which engine.
- **"registry"** spans three things: ghcr.io (where owned images live, ADR-0002), the **Forgejo registry** (frozen last-known-good archive), and the registry VM's pull-through caches (read-only proxies of upstream registries). Name which one.

View file

@ -5,13 +5,10 @@ ARG TERRAFORM_VERSION=1.5.7
ARG TERRAGRUNT_VERSION=0.99.4 ARG TERRAGRUNT_VERSION=0.99.4
ARG SOPS_VERSION=3.9.4 ARG SOPS_VERSION=3.9.4
ARG KUBECTL_VERSION=1.34.0 ARG KUBECTL_VERSION=1.34.0
ARG VAULT_VERSION=1.18.1
# Install system packages (single layer). # Install system packages (single layer)
# python3: required by scripts/check-ingress-auth-comments.py, invoked
# by scripts/tg before every plan/apply.
RUN apk add --no-cache \ RUN apk add --no-cache \
bash curl git git-crypt jq openssh-client openssl python3 unzip \ bash curl git git-crypt jq openssh-client openssl unzip \
&& rm -rf /var/cache/apk/* && rm -rf /var/cache/apk/*
# Terraform # Terraform
@ -37,16 +34,6 @@ RUN curl -fsSL "https://dl.k8s.io/release/v${KUBECTL_VERSION}/bin/linux/amd64/ku
-o /usr/local/bin/kubectl \ -o /usr/local/bin/kubectl \
&& chmod +x /usr/local/bin/kubectl && chmod +x /usr/local/bin/kubectl
# Vault CLI — required by scripts/tg for Tier 1 stack PG credential reads
# and Tier 0 advisory locks. Pinned to server version (1.18.1). Without this
# the CI pipeline surfaces the misleading "Cannot read PG credentials" error
# because scripts/tg swallows stderr ("vault: not found").
RUN curl -fsSL "https://releases.hashicorp.com/vault/${VAULT_VERSION}/vault_${VAULT_VERSION}_linux_amd64.zip" \
-o /tmp/vault.zip \
&& unzip /tmp/vault.zip -d /usr/local/bin/ \
&& rm /tmp/vault.zip \
&& vault version
# Provider cache directory (shared across stacks) # Provider cache directory (shared across stacks)
ENV TF_PLUGIN_CACHE_DIR=/tmp/terraform-plugin-cache ENV TF_PLUGIN_CACHE_DIR=/tmp/terraform-plugin-cache
ENV TF_PLUGIN_CACHE_MAY_BREAK_DEPENDENCY_LOCK_FILE=1 ENV TF_PLUGIN_CACHE_MAY_BREAK_DEPENDENCY_LOCK_FILE=1

View file

@ -1,224 +1,2 @@
# homelab # What is this?
This is a CLI to manipulate files in the terraform repo and commit and push them
`homelab` is the unified, agent-facing CLI for operating this homelab — one
composable, JSON-capable surface for the operations agents run over and over,
discovered progressively at runtime. It is grown **in place** from this
directory (the former `infra-cli`), and the legacy webhook use-cases still work
(see below).
It encodes *actions*, never *judgment*: methodology (debugging, TDD, review) and
third-party/owned MCP servers (e.g. phpIPAM) are deliberately out of scope.
## Usage
```
homelab <command> [args]
homelab manifest [--json] # list every verb + its read/write tier (discovery entrypoint)
homelab version
```
### v0.1 verbs — the infra inner-loop
| Command | Tier | What it does |
|---|---|---|
| `claim <kind>:<name> --purpose "…"` | write | claim a shared resource on the presence board (wraps `scripts/presence`) |
| `release <kind>:<name>` | write | release a presence claim |
| `tf plan <stack>` | read | `scripts/tg plan` for a stack (resolved from cwd) |
| `tf validate <stack>` | read | `scripts/tg validate` |
| `tf fmt <stack>` | read | `terraform fmt -recursive` on the stack |
| `tf force-unlock <stack> <lock-id>` | write | release a stuck state lock |
| `tf apply <stack>` | write | `scripts/tg apply` — auto-claims `stack:<name>`, always releases, warns it's out-of-band |
| `work start <topic>` | write | create `.worktrees/<topic>` on `<user>/<topic>` off `<remote>/master`; enter with native `EnterWorktree` |
| `work land [--verify-cmd "…"] [--no-verify]` | write | merge master in → verify → push `HEAD:master` (non-ff retry; PR fallback) |
| `work clean <topic>` | write | remove a task's worktree + branch (run from the main checkout) |
### v0.2 verbs — Kubernetes
Built on an **app→namespace→pod resolver**: `<app>` defaults to the namespace
(most namespaces hold one app); the target defaults to `deploy/<app>` and lets
kubectl resolve the pod. Override with `-n`/`--pod`/`-c`/`-l`/`--tty`. Uses the
ambient kubeconfig.
| Command | Tier | What it does |
|---|---|---|
| `k8s status [ns]` | read | pods (wide) + recent non-Normal events (`-A` if no ns) |
| `k8s get <ns> <resource> […]` | read | `kubectl -n <ns> get …` passthrough |
| `k8s logs <app>` | read | logs for `deploy/<app>` (`--tail` default 200; `-c`/`--previous`/`--since`/`-l`) |
| `k8s describe <app> [resource]` | read | describe the deployment (or an explicit resource) |
| `k8s debug <app>` | read | one-shot triage: pods + workloads + describe + recent logs + events |
| `k8s pf <app> <local:remote> [target]` | read | port-forward to `svc/<app>` (or an explicit target) |
| `k8s rollout-status <app>` | read | `rollout status deploy/<app>` |
| `k8s db <app> [--mysql] [--db N] -- "<SQL>"` | write | exec into the dbaas DB (PG `pg-cluster-rw`, or MySQL with env-password wrapper) |
| `k8s exec <app> [--tty] -- <cmd>` | write | exec in the app's pod |
| `k8s restart <app>` | write | `rollout restart deploy/<app>` then wait for status |
| `k8s rm-pod <name> -n <ns> [--job] [--force]` | write | delete a stuck **pod/job only** |
Config-mutation verbs (`apply`/`edit`/`patch`/`scale`/`create`) are intentionally
**not** exposed — they stay raw `kubectl`, per the Terraform-only policy.
`tf` resolves the stack dir by walking up from cwd to the infra root and
delegates to `scripts/tg` (which owns state decrypt/encrypt, the Vault lock, and
the ingress auth-comment check). git-crypt filter flags are auto-injected on git
operations in the encrypted infra repo.
**`work land` refuses to push when it cannot verify** (no `--verify-cmd` and no
auto-detected suite) unless you pass `--no-verify` — landing to master unverified
must be deliberate. After pushing it **watches CI to green** (`ci watch` on the
landed commit) and fails if the pipeline does; pass `--no-ci-watch` to skip.
Tiers are recorded per verb so a future PreToolUse classifier can auto-allow
reads / prompt writes; v0.1 allows everything and relies on existing gates
(permission mode, presence claims, plan approval).
### v0.3 verbs — memory
A thin HTTP client over the **claude-memory** service (the same backend the
memory MCP wraps), authed with `CLAUDE_MEMORY_API_KEY` against
`CLAUDE_MEMORY_API_URL` (the env the hooks already set; defaults to the
ingress). Because it hits the HTTP API directly, it **works even when the MCP
frontend is down**.
| Command | Tier | What it does |
|---|---|---|
| `memory recall "<context>" [--query --category --sort --limit]` | read | semantic search (server-side ranking) — the navigate workhorse |
| `memory list [--category --tag --limit]` | read | recent memories |
| `memory categories` / `memory tags` / `memory stats` | read | enumerate the store |
| `memory secret <id>` | read | reveal a sensitive memory's content |
| `memory store "<content>" [--category --tags --keywords --importance --sensitive]` | write | store a memory |
| `memory update <id> [--content --tags --importance]` | write | edit a memory |
| `memory delete <id>` | write | delete a memory |
All read/write paths are validated against the live API (incl. a
store→recall→delete round-trip). This gives full data-plane parity with the MCP;
the eventual deprecation (rewiring the per-prompt auto-recall + auto-learn hooks
to the CLI, then uninstalling the MCP) is a **separate, deliberate follow-up**
see `docs/adr/0008`.
### v0.4 verbs — ci / deploy
Watch what you trigger, without hand-rolling Woodpecker/kubectl polling. `ci`
talks to the Woodpecker API (token from `WOODPECKER_TOKEN` or Vault
`secret/ci/global`) via the internal Traefik LB, resolving the repo from the cwd
remote, with retries that ride Woodpecker's intermittent empty responses.
| Command | Tier | What it does |
|---|---|---|
| `ci status [commit]` | read | pipeline status for HEAD (or a commit) |
| `ci watch [commit]` | read | poll the pipeline to terminal; exit non-zero on failure |
| `deploy wait <ns>/<deploy> [--sha SHA]` | read | wait for the deployment image to match the sha, *then* rollout status (rollout status alone lies on the old ReplicaSet) |
`work land` now calls `ci watch` on the landed commit automatically (skip with
`--no-ci-watch`), closing the v0.1 "doesn't wait for CI" gap. `ci logs` (failing
step) is deferred to v0.4.1 — Woodpecker's per-pipeline detail/log endpoints were
the least reliable; `status`/`watch` use the list endpoint that works.
### v0.5 verbs — net / dns / metrics / logs
Reachability + observability probes. Their value is *endpoint resolution* — the
non-obvious "which host, public or LB, what auth, what URL shape" reasoning you'd
otherwise re-derive every time — not the HTTP call itself. All reach internal
ingresses through the Traefik LB (the Go form of `curl --resolve host:443:10.0.20.203`).
| Command | Tier | What it does |
|---|---|---|
| `net check <host> [path]` | read | probes the host two ways — external (public DNS → Cloudflare) vs internal (Traefik LB) — with status + latency, so you can tell *where* a break is (CF? app? the LB path?) |
| `dns lookup <name> [type]` | read | resolves via Technitium (`10.0.20.201`) and public (`1.1.1.1`), diffed — surfaces split-horizon vs propagation gaps |
| `metrics query "<promql>"` | read | Prometheus instant query (`prometheus-query.viktorbarzin.lan`); prints `value {labels}` or `--json` |
| `metrics alerts` | read | currently-firing alerts (via the synthetic `ALERTS` series — the query frontend has no `/api/v1/alerts`) |
| `logs query "<logql>" [--since 1h] [--limit N]` | read | Loki range query (`loki.viktorbarzin.lan`); prints log lines or `--json` |
Quote the PromQL/LogQL. These hit auth-free internal ingresses — no port-forward,
no kubectl. (In-cluster-only endpoints like Alertmanager stay out of scope; the
firing set is reachable via `ALERTS` instead.)
### v0.6 — usage telemetry (`usage top`)
Makes "which verbs are actually used, by everyone" a query instead of a guess —
so adding the *next* verb is evidence-driven, not shaped by one person's habits.
Every dispatched verb emits one fire-and-forget Loki line: `{job, user, verb}`
labels + `exit=N ver=X` — **only the verb path and exit code, never args, paths,
flags, or secrets.** It's best-effort (tight timeout, errors swallowed, never
affects the command) and opt-out via `HOMELAB_TELEMETRY=0`. Because the sink is
the shared Loki, aggregate usage is queryable **without reading anyone's home**
the privacy-preserving answer to "what does the team use."
| Command | Tier | What it does |
|---|---|---|
| `usage top [--since 30d] [--user U] [--json]` | read | rank verbs by invocation count across all users (or one), via `sum by (verb) (count_over_time({job="homelab-usage"}[…]))` |
### v0.7 verbs — Home Assistant
Cover exactly the two things the `ha` **MCP server can't**: resolving the
long-lived API token out of the cluster, and SSH to the HA host for host-level
work (config files, docker, add-ons). Entity state and control (`turn_on`,
`get_state`, services) stay with the MCP — *actions an MCP already encodes are
out of scope* (see top of this doc). The value here is the same as `net`/`dns`:
the non-obvious *which secret, which host, which key, which flags* you'd
otherwise re-derive every session — agents were hand-rolling a
`kubectl | base64 | jq` token pipeline and a bespoke `ssh -o …` invocation on
every run because the existing `home-assistant-sofia.py` needs an env var set
and a cwd-relative path, neither of which holds in an arbitrary session.
| Command | Tier | What it does |
|---|---|---|
| `ha token [--instance sofia\|london]` | read | print the long-lived HA API token, resolved live from the dedicated k8s Secret `openclaw/ha-tokens` (key per instance) via the ambient kubeconfig — no pre-set env var. Use as `curl -H "Authorization: Bearer $(homelab ha token)" …`. The secret is a least-privilege carve-out (`stacks/openclaw/ha_tokens.tf`): the `Home Server Admins` group can read *just* it, so non-admin operators get the HA token without the rest of `skill_secrets` (slack webhook, uptime-kuma password) |
| `ha ssh [--instance sofia\|london] [-i KEY] -- <cmd>` | write | run `<cmd>` on the HA host over ssh with deterministic non-interactive flags (explicit key = the invoking user's `~/.ssh/id_ed25519`, no user ssh-config, no known_hosts prompt). sofia (`vbarzin@192.168.1.8`) is reachable from the devvm LAN; london is documented but generally remote |
`--instance` defaults to **sofia** (the devvm shares the Sofia LAN). `ha token`
prints the bare token to stdout so it composes in `$(…)`; it's read-tier like
`memory secret`. `ha ssh` resolves the *invoking user's* key, so it's per-user,
not tied to whoever first wrote the workflow (the user's key must be enrolled on
the HA host).
### v0.8 verbs — browser (headful anti-bot automation)
Drive the cluster's **headful** Chrome (`chrome-service`, real Chrome under Xvfb)
from the devvm over CDP, for sites that detect and block headless automation. The
headless `@playwright/mcp` browser can *load* such a site and fill its forms, but
the gated action (submit/login) silently fails — the motivating case was the
Stirling Ackroyd Fixflo tenant portal, whose pre-submit check returned
`net::ERR_FILE_NOT_FOUND` and hung. This path connects via `connect_over_cdp`,
injects the same `stealth.js` the in-cluster callers use, and submits first try.
The command owns only the *mechanics* (port-forward, stealth, lifecycle); the
agent supplies the Playwright script — judgment stays out of the CLI.
| Command | Tier | What it does |
|---|---|---|
| `browser run <script.js> [--url U] [--shared-context] [--keep-open] [--port N] [--timeout S]` | write | port-forward `svc/chrome-service:9222`, assert it's a real (non-headless) Chrome via `/json/version`, `connect_over_cdp`, `addInitScript(stealth.js)`, then run the script with `page`/`context`/`browser`/`log` in scope (top-level await ok; return a value to print it). Always tears the forward down. |
| `browser open <url> [--shared-context] [--timeout S]` | write | open `<url>` headful and print title + visible text + a screenshot path — a quick check. |
| `browser --help` | read | when-to-use signature + the error-code cheat-sheet (`ERR_FILE_NOT_FOUND` = automation-layer intercept, not egress; `ERR_CONNECTION_REFUSED`/`_TIMED_OUT`/`_NAME_NOT_RESOLVED` = real egress; one endpoint 500 while siblings 200 = bot rejection). |
Default context is a **fresh incognito** one (closed on exit) — safe for the
shared browser and concurrent callers (e.g. tripit's fare scrape); `--shared-context`
reuses the warmed persistent profile when a pre-logged-in session is needed.
`port-forward` tunnels API-server→pod, so it bypasses the `:9222` NetworkPolicy
that gates in-cluster callers — no namespace label needed. The node CDP client is
pinned to **`playwright-core@1.48.2`** to match the chrome-service image minor
(Chromium 130; protocol changes between minors) and is installed once, lazily,
into `~/.cache/homelab/browser-client/` (no per-user setup). Because the client
runs on the devvm, `setInputFiles` streams local files to the remote browser over
CDP — no `chmod`/staging-dir workaround. See `docs/architecture/chrome-service.md`
and `docs/adr/0013`.
## Build / install
Built from source to `/usr/local/bin/homelab` during devvm provisioning
(`scripts/workstation/setup-devvm.sh`, the `t3-dispatch` pattern); version is
stamped from `cli/VERSION` via ldflags. Manual build:
```
cd cli && go build -ldflags "-X main.version=$(cat VERSION)" -o /usr/local/bin/homelab .
go test ./...
```
## Legacy webhook use-cases (preserved)
This binary is also the in-cluster `infra-cli` image. Invocations starting with
`-use-case=<vpn|setup-openwrt-dns|add-email-alias|...>` fall through to the
original flag-based path unchanged, so the webhook handler is unaffected.
## Design
See `infra/docs/adr/0004``0013` for the architecture decisions.

View file

@ -1 +0,0 @@
v0.8.1

View file

@ -1,388 +0,0 @@
package main
import (
_ "embed"
"encoding/json"
"fmt"
"io"
"net"
"net/http"
"os"
"os/exec"
"os/signal"
"path/filepath"
"strconv"
"strings"
"sync"
"syscall"
"time"
)
// playwrightVersion pins the node CDP client to the chrome-service image minor
// (mcr.microsoft.com/playwright:v1.48.0-noble → Chromium 130). connect_over_cdp
// speaks the browser's CDP, so the client minor must track the server minor;
// see docs/architecture/chrome-service.md "Image pin".
const playwrightVersion = "1.48.2"
// defaultBrowserTimeout is how long (seconds) to wait for the port-forwarded CDP
// endpoint to become ready before giving up.
const defaultBrowserTimeout = 60
const (
chromeServiceNamespace = "chrome-service"
chromeServiceName = "chrome-service"
chromeServiceCDPPort = 9222
)
// stealthJS is vendored verbatim from stacks/chrome-service/files/stealth.js (the
// source of truth the in-cluster callers use). TestStealthJSEmbeddedMatchesCanonical
// guards against drift.
//
//go:embed browser_stealth.js
var stealthJS string
// runnerJS is the node wrapper that connects to the port-forwarded CDP endpoint,
// installs the stealth init script, and runs the user's Playwright script.
//
//go:embed browser_runner.js
var runnerJS string
// browserOpts is the parsed form of `homelab browser run|open` arguments.
type browserOpts struct {
mode string // "run" | "open"
script string // path to the user Playwright script (run mode)
url string // initial URL (run: optional; open: required positional)
sharedCtx bool // use the warmed persistent profile instead of a fresh context
keepOpen bool // leave the created context/pages open on exit
port int // explicit local port for the forward (0 = auto)
timeout int // CDP readiness timeout, seconds
help bool
}
// parseBrowserArgs parses the args after `browser run` / `browser open`.
func parseBrowserArgs(mode string, args []string) (browserOpts, error) {
o := browserOpts{mode: mode, timeout: defaultBrowserTimeout}
var positionals []string
atoi := func(s, flag string) (int, error) {
n, err := strconv.Atoi(s)
if err != nil {
return 0, fmt.Errorf("%s expects an integer, got %q", flag, s)
}
return n, nil
}
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "-h" || a == "--help":
o.help = true
case a == "--shared-context":
o.sharedCtx = true
case a == "--keep-open":
o.keepOpen = true
case a == "--url":
if i+1 < len(args) {
o.url = args[i+1]
i++
}
case strings.HasPrefix(a, "--url="):
o.url = strings.TrimPrefix(a, "--url=")
case a == "--port":
if i+1 < len(args) {
n, err := atoi(args[i+1], "--port")
if err != nil {
return o, err
}
o.port = n
i++
}
case strings.HasPrefix(a, "--port="):
n, err := atoi(strings.TrimPrefix(a, "--port="), "--port")
if err != nil {
return o, err
}
o.port = n
case a == "--timeout":
if i+1 < len(args) {
n, err := atoi(args[i+1], "--timeout")
if err != nil {
return o, err
}
o.timeout = n
i++
}
case strings.HasPrefix(a, "--timeout="):
n, err := atoi(strings.TrimPrefix(a, "--timeout="), "--timeout")
if err != nil {
return o, err
}
o.timeout = n
case strings.HasPrefix(a, "-"):
return o, fmt.Errorf("unknown flag %q (try: homelab browser --help)", a)
default:
positionals = append(positionals, a)
}
}
if o.help {
return o, nil
}
switch mode {
case "run":
if len(positionals) == 0 {
return o, fmt.Errorf("usage: homelab browser run <script.js> [--url URL] [--shared-context] [--keep-open] [--port N] [--timeout S]")
}
o.script = positionals[0]
case "open":
if len(positionals) == 0 {
return o, fmt.Errorf("usage: homelab browser open <url> [--shared-context] [--timeout S]")
}
o.url = positionals[0]
}
return o, nil
}
// cdpHealthy parses a CDP /json/version body and reports whether the endpoint is
// a real (non-headless) Chrome — the entire reason chrome-service exists.
func cdpHealthy(jsonBody []byte) (browser string, healthy bool, err error) {
var v struct {
Browser string `json:"Browser"`
UserAgent string `json:"User-Agent"`
}
if e := json.Unmarshal(jsonBody, &v); e != nil {
return "", false, fmt.Errorf("parse /json/version: %w", e)
}
if v.Browser == "" {
return "", false, fmt.Errorf("/json/version had no Browser field")
}
healthy = strings.HasPrefix(v.Browser, "Chrome/") &&
!strings.Contains(v.Browser, "Headless") &&
!strings.Contains(v.UserAgent, "Headless")
return v.Browser, healthy, nil
}
// buildPortForwardArgs is the kubectl invocation that exposes chrome-service's
// CDP locally. port-forward tunnels API-server→pod, so it bypasses the :9222
// NetworkPolicy that gates in-cluster callers.
func buildPortForwardArgs(localPort int) []string {
return []string{"-n", chromeServiceNamespace, "port-forward",
"svc/" + chromeServiceName, fmt.Sprintf("%d:%d", localPort, chromeServiceCDPPort)}
}
// browserClientPackageJSON is the auto-managed manifest for the pinned node CDP
// client kept under the user cache dir.
func browserClientPackageJSON() string {
return fmt.Sprintf(`{
"name": "homelab-browser-client",
"private": true,
"description": "Pinned CDP client for 'homelab browser' — auto-managed, do not edit.",
"dependencies": {
"playwright-core": "%s"
}
}
`, playwrightVersion)
}
// freePort asks the kernel for an unused ephemeral TCP port.
func freePort() (int, error) {
l, err := net.Listen("tcp", "127.0.0.1:0")
if err != nil {
return 0, err
}
defer l.Close()
return l.Addr().(*net.TCPAddr).Port, nil
}
// browserClientDir is where the pinned node client + managed runner files live.
func browserClientDir() (string, error) {
cache, err := os.UserCacheDir()
if err != nil || cache == "" {
home, herr := os.UserHomeDir()
if herr != nil {
return "", fmt.Errorf("locate cache dir: %v / %v", err, herr)
}
cache = filepath.Join(home, ".cache")
}
return filepath.Join(cache, "homelab", "browser-client"), nil
}
// installedPlaywrightVersion reads the version of the playwright-core already
// installed in dir, or "" if absent/unreadable.
func installedPlaywrightVersion(dir string) string {
b, err := os.ReadFile(filepath.Join(dir, "node_modules", "playwright-core", "package.json"))
if err != nil {
return ""
}
var v struct {
Version string `json:"version"`
}
if json.Unmarshal(b, &v) != nil {
return ""
}
return v.Version
}
// ensureBrowserClient writes the managed runner/stealth/package files into dir
// and lazily installs the pinned playwright-core (only when missing/mismatched),
// so no per-user setup is needed and the client tracks the binary version.
func ensureBrowserClient(dir string) error {
if err := os.MkdirAll(dir, 0o755); err != nil {
return err
}
files := map[string]string{
"package.json": browserClientPackageJSON(),
"browser_runner.js": runnerJS,
"stealth.js": stealthJS,
}
for name, content := range files {
if err := os.WriteFile(filepath.Join(dir, name), []byte(content), 0o644); err != nil {
return err
}
}
if installedPlaywrightVersion(dir) == playwrightVersion {
return nil
}
fmt.Fprintf(os.Stderr, "homelab browser: installing pinned playwright-core@%s (one-time, ~a few seconds)…\n", playwrightVersion)
cmd := exec.Command("npm", "install", "--no-audit", "--no-fund", "--silent")
cmd.Dir = dir
cmd.Stdout = os.Stderr
cmd.Stderr = os.Stderr
if err := cmd.Run(); err != nil {
return fmt.Errorf("npm install playwright-core@%s in %s: %w (is node/npm installed?)", playwrightVersion, dir, err)
}
if got := installedPlaywrightVersion(dir); got != playwrightVersion {
return fmt.Errorf("playwright-core install mismatch in %s: want %s, got %q", dir, playwrightVersion, got)
}
return nil
}
// waitForCDP polls the local CDP endpoint until it answers as a healthy
// (non-headless) Chrome, or the timeout elapses.
func waitForCDP(cdpURL string, timeout time.Duration) (string, error) {
deadline := time.Now().Add(timeout)
client := &http.Client{Timeout: 3 * time.Second}
var lastErr error
for time.Now().Before(deadline) {
resp, err := client.Get(cdpURL + "/json/version")
if err != nil {
lastErr = err
time.Sleep(300 * time.Millisecond)
continue
}
body, _ := io.ReadAll(resp.Body)
resp.Body.Close()
browser, healthy, herr := cdpHealthy(body)
if herr != nil {
lastErr = herr
time.Sleep(300 * time.Millisecond)
continue
}
if !healthy {
return browser, fmt.Errorf("CDP reports %q — expected a non-headless Chrome (wrong target?)", browser)
}
return browser, nil
}
if lastErr == nil {
lastErr = fmt.Errorf("timed out after %s", timeout)
}
return "", lastErr
}
// runBrowser is the orchestration: pick a port, ensure the pinned client, start
// (and ALWAYS tear down) a CDP port-forward, wait for readiness, then run node.
func runBrowser(o browserOpts) error {
port := o.port
if port == 0 {
p, err := freePort()
if err != nil {
return fmt.Errorf("pick local port: %w", err)
}
port = p
}
dir, err := browserClientDir()
if err != nil {
return err
}
if err := ensureBrowserClient(dir); err != nil {
return err
}
// Start the forward in its own process group so the whole tree dies on cleanup.
pf := exec.Command("kubectl", buildPortForwardArgs(port)...)
pf.SysProcAttr = &syscall.SysProcAttr{Setpgid: true}
var pfLog strings.Builder
pf.Stdout = &pfLog
pf.Stderr = &pfLog
if err := pf.Start(); err != nil {
return fmt.Errorf("start kubectl port-forward (kubeconfig set?): %w", err)
}
var once sync.Once
teardown := func() {
once.Do(func() {
if pf.Process != nil {
_ = syscall.Kill(-pf.Process.Pid, syscall.SIGKILL)
}
_ = pf.Wait()
})
}
defer teardown()
// Tear down on Ctrl-C / SIGTERM too, then exit non-zero.
sigCh := make(chan os.Signal, 1)
signal.Notify(sigCh, os.Interrupt, syscall.SIGTERM)
defer signal.Stop(sigCh)
go func() {
if _, ok := <-sigCh; ok {
teardown()
os.Exit(130)
}
}()
cdpURL := fmt.Sprintf("http://127.0.0.1:%d", port)
browser, err := waitForCDP(cdpURL, time.Duration(o.timeout)*time.Second)
if err != nil {
return fmt.Errorf("chrome-service CDP not ready on %s: %w\n--- port-forward log ---\n%s", cdpURL, err, pfLog.String())
}
fmt.Fprintf(os.Stderr, "homelab browser: connected to %s via %s\n", browser, cdpURL)
return runBrowserNode(dir, cdpURL, o)
}
// runBrowserNode invokes the managed node runner with inputs passed via env.
func runBrowserNode(dir, cdpURL string, o browserOpts) error {
env := append(os.Environ(),
"HOMELAB_CDP_URL="+cdpURL,
"HOMELAB_BROWSER_MODE="+o.mode,
"HOMELAB_STEALTH_PATH="+filepath.Join(dir, "stealth.js"),
"NODE_PATH="+filepath.Join(dir, "node_modules"),
)
if o.url != "" {
env = append(env, "HOMELAB_BROWSER_URL="+o.url)
}
if o.script != "" {
abs, err := filepath.Abs(o.script)
if err != nil {
return err
}
if _, err := os.Stat(abs); err != nil {
return fmt.Errorf("script %s: %w", o.script, err)
}
env = append(env, "HOMELAB_BROWSER_SCRIPT="+abs)
}
if o.sharedCtx {
env = append(env, "HOMELAB_BROWSER_SHARED=1")
}
if o.keepOpen {
env = append(env, "HOMELAB_BROWSER_KEEP_OPEN=1")
}
if o.mode == "open" {
shot := filepath.Join(os.TempDir(), fmt.Sprintf("homelab-browser-%d.png", os.Getpid()))
env = append(env, "HOMELAB_BROWSER_SCREENSHOT="+shot)
}
cmd := exec.Command("node", filepath.Join(dir, "browser_runner.js"))
cmd.Env = env
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
cmd.Stdin = os.Stdin
return cmd.Run()
}

View file

@ -1,106 +0,0 @@
// homelab browser — node CDP runner (auto-managed; regenerated each run from the
// homelab binary — DO NOT EDIT here). Connects to the port-forwarded
// chrome-service CDP endpoint, installs the stealth init script, then runs the
// user's Playwright script (run mode) or opens a URL (open mode). All inputs
// arrive via HOMELAB_* env vars set by the Go CLI.
'use strict';
const fs = require('fs');
const { chromium } = require('playwright-core');
async function main() {
const cdpURL = process.env.HOMELAB_CDP_URL;
if (!cdpURL) throw new Error('HOMELAB_CDP_URL not set');
const mode = process.env.HOMELAB_BROWSER_MODE || 'run';
const stealthPath = process.env.HOMELAB_STEALTH_PATH || '';
const initURL = process.env.HOMELAB_BROWSER_URL || '';
const scriptPath = process.env.HOMELAB_BROWSER_SCRIPT || '';
const shared = process.env.HOMELAB_BROWSER_SHARED === '1';
const keepOpen = process.env.HOMELAB_BROWSER_KEEP_OPEN === '1';
const screenshotPath = process.env.HOMELAB_BROWSER_SCREENSHOT || '';
const browser = await chromium.connectOverCDP(cdpURL);
// Fresh isolated context by default (safe for the shared browser + concurrent
// callers); --shared-context reuses the warmed persistent profile.
let context;
let createdContext = false;
if (shared) {
const existing = browser.contexts();
if (existing.length) {
context = existing[0];
} else {
context = await browser.newContext();
createdContext = true;
}
} else {
context = await browser.newContext();
createdContext = true;
}
if (stealthPath) {
const stealth = fs.readFileSync(stealthPath, 'utf8');
if (stealth.trim()) await context.addInitScript(stealth);
}
const page = await context.newPage();
const log = (...a) => console.error('[browser]', ...a);
let exitCode = 0;
try {
if (initURL) {
await page.goto(initURL, { waitUntil: 'domcontentloaded' });
}
if (mode === 'open') {
console.log('url: ' + page.url());
console.log('title: ' + (await page.title()));
const text = (await page.evaluate(() => (document.body ? document.body.innerText : ''))).trim();
console.log('--- visible text (truncated to 4000 chars) ---');
console.log(text.slice(0, 4000));
if (screenshotPath) {
await page.screenshot({ path: screenshotPath, fullPage: true });
console.log('screenshot: ' + screenshotPath);
}
} else {
if (!scriptPath) throw new Error('run mode requires HOMELAB_BROWSER_SCRIPT');
const src = fs.readFileSync(scriptPath, 'utf8');
// Run the user's source with page/context/browser/log in lexical scope.
// AsyncFunction body permits top-level await.
const AsyncFunction = Object.getPrototypeOf(async () => {}).constructor;
const fn = new AsyncFunction('page', 'context', 'browser', 'log', src);
const result = await fn(page, context, browser, log);
if (result !== undefined) {
let out;
try {
out = typeof result === 'string' ? result : JSON.stringify(result, null, 2);
} catch (_) {
out = String(result);
}
console.log(out);
}
}
} catch (e) {
console.error('homelab browser: script error:', e && e.stack ? e.stack : e);
exitCode = 1;
} finally {
if (!keepOpen) {
try {
// Close only what we created; never tear down the shared persistent context.
if (createdContext) {
await context.close();
} else {
await page.close();
}
} catch (_) { /* ignore */ }
}
// Disconnect from the CDP endpoint; this does NOT kill the remote browser.
try {
await browser.close();
} catch (_) { /* ignore */ }
}
process.exit(exitCode);
}
main().catch((e) => {
console.error('homelab browser: fatal:', e && e.stack ? e.stack : e);
process.exit(1);
});

View file

@ -1,54 +0,0 @@
// Minimal stealth init script for Playwright-driven Chromium.
// Vendored from puppeteer-extra-plugin-stealth/evasions/* (MIT) — covers:
// webdriver, chrome.runtime, navigator.plugins, navigator.languages,
// Permissions.query, WebGL getParameter (vendor + renderer spoof).
// Run via context.add_init_script() so it executes before any page script.
(() => {
// navigator.webdriver — most common detection, removed entirely.
Object.defineProperty(Navigator.prototype, 'webdriver', { get: () => undefined });
// window.chrome.runtime — many sites check that real Chrome exposes this.
if (!window.chrome) window.chrome = {};
window.chrome.runtime = window.chrome.runtime || {};
// navigator.plugins — headless reports zero; spoof a plausible PDF viewer.
Object.defineProperty(navigator, 'plugins', {
get: () => [{ name: 'Chrome PDF Plugin' }, { name: 'Chrome PDF Viewer' }, { name: 'Native Client' }],
});
// navigator.languages — headless returns empty array.
Object.defineProperty(navigator, 'languages', { get: () => ['en-US', 'en'] });
// Permissions.query — headless returns 'denied' for notifications instead of 'default'.
const origQuery = window.navigator.permissions && window.navigator.permissions.query;
if (origQuery) {
window.navigator.permissions.query = (parameters) =>
parameters && parameters.name === 'notifications'
? Promise.resolve({ state: Notification.permission })
: origQuery(parameters);
}
// WebGL getParameter — spoof vendor + renderer strings to a real GPU.
const spoofGl = (proto) => {
if (!proto) return;
const orig = proto.getParameter;
proto.getParameter = function (parameter) {
if (parameter === 37445) return 'Intel Inc.'; // UNMASKED_VENDOR_WEBGL
if (parameter === 37446) return 'Intel Iris OpenGL Engine'; // UNMASKED_RENDERER_WEBGL
return orig.apply(this, arguments);
};
};
spoofGl(window.WebGLRenderingContext && window.WebGLRenderingContext.prototype);
spoofGl(window.WebGL2RenderingContext && window.WebGL2RenderingContext.prototype);
// disable-devtool.js (theajack/disable-devtool) auto-inits via a script
// tag with `disable-devtool-auto`. Its Performance detector trips under
// Playwright (CDP adds console.log latency vs console.table) and the
// redirect URL is hard-coded — for hmembeds that's google.com.
// Hide the auto-init marker so the library's IIFE exits early.
const origQS = Document.prototype.querySelector;
Document.prototype.querySelector = function (sel) {
if (typeof sel === 'string' && sel.indexOf('disable-devtool-auto') !== -1) return null;
return origQS.apply(this, arguments);
};
})();

View file

@ -1,117 +0,0 @@
package main
import "fmt"
// browser verbs drive the cluster's HEADFUL Chrome (ns chrome-service) over CDP
// from outside the cluster, for sites that detect/block headless automation.
// The headless @playwright/mcp browser can load such sites but their gated
// actions (submit/login) silently fail; this path submits first try. Mechanics
// only — the agent supplies the Playwright script. See docs/adr/0013.
func browserCommands() []Command {
return []Command{
{Path: []string{"browser"}, Tier: TierRead,
Summary: "headful cluster-Chrome automation for anti-bot sites (run `browser --help`)", Run: browserTopHelp},
{Path: []string{"browser", "run"}, Tier: TierWrite,
Summary: "run a Playwright script against headful cluster Chrome: browser run <script.js> [--url U] [--shared-context]", Run: browserRun},
{Path: []string{"browser", "open"}, Tier: TierWrite,
Summary: "open a URL in headful cluster Chrome; print title + text + screenshot: browser open <url>", Run: browserOpen},
}
}
func browserTopHelp([]string) error {
fmt.Print(browserHelp())
return nil
}
func browserRun(args []string) error {
o, err := parseBrowserArgs("run", args)
if err != nil {
return err
}
if o.help {
fmt.Print(browserHelp())
return nil
}
return runBrowser(o)
}
func browserOpen(args []string) error {
o, err := parseBrowserArgs("open", args)
if err != nil {
return err
}
if o.help {
fmt.Print(browserHelp())
return nil
}
return runBrowser(o)
}
// browserHelp carries the discoverability payload: WHEN to reach for this, and
// the diagnostic cheat-sheet that lets the agent self-correct instead of
// retrying a deterministic form blind (the failure mode that motivated this).
func browserHelp() string {
return `homelab browser drive the cluster's HEADFUL Chrome (anti-bot) over CDP
The shared chrome-service (ns chrome-service) runs a REAL, headed Chrome under
Xvfb. This connects to it via a port-forward + Playwright connect_over_cdp,
injects the same stealth.js the in-cluster callers use, and runs your script.
USAGE
homelab browser run <script.js> [--url URL] [--shared-context] [--keep-open] [--port N] [--timeout S]
homelab browser open <url> [--shared-context] [--timeout S]
WHEN TO USE THIS escalation only; DEFAULT to the headless/MCP browser
Default to the Playwright MCP / headless browser for ALL routine browsing and
automation it's interactive (snapshot per step), fast to start, isolated.
Reach for THIS command ONLY when headless is demonstrably blocked: a site
LOADS fine but a gated action FAILS or HANGS a submit/login/checkout spins
forever, or ONE request errors while its siblings 200. That is the signature
of headless / anti-bot detection (navigator.webdriver, UA "HeadlessChrome",
disable-devtool traps). It presents as a real Chrome and usually succeeds
first try but it's the shared cluster browser (slower startup, one batch
run, no per-step feedback), so it's the escalation path, never the default.
ERROR-CODE CHEAT-SHEET (diagnose BEFORE retrying)
ERR_FILE_NOT_FOUND (-6) request intercepted/resolved locally by the
automation layer NOT a network/egress problem.
(This is what silently broke the headless submit.)
ERR_CONNECTION_REFUSED / real egress failure (DNS/route/firewall). These also
ERR_TIMED_OUT / break the initial page load if the page loaded,
ERR_NAME_NOT_RESOLVED egress is fine and the cause is elsewhere.
one endpoint 500s while server-side bot rejection of the automation, not
its siblings 200 your payload.
HABITS
- Inspect the network panel BEFORE retrying a deterministic form; a blind
retry just repeats the same silent failure.
- Don't park a half-filled multi-step form across a user pause the session
can expire; re-run the whole flow from this command in one shot.
- Uploads stream over CDP via setInputFiles from THIS host no chmod/staging
of $HOME needed; just point setInputFiles at a local path.
CONTEXT
Default: a FRESH incognito context, closed on exit safe for the shared
browser and concurrent callers (e.g. tripit). Your script does its own login.
--shared-context: reuse the warmed PERSISTENT profile (cookies from a manual
noVNC login at chrome.viktorbarzin.me) when you need a pre-logged-in session.
SCRIPT CONTRACT (run mode)
Your file's body runs with page, context, browser and log() already in scope
(top-level await allowed). Return a value to print it. Example flow.js:
await page.goto('https://portal.example.com/login');
await page.fill('#user', 'me'); await page.fill('#pass', process.env.PW);
await page.click('button[type=submit]');
await page.waitForURL('**/dashboard');
return 'logged in: ' + page.url();
Run it: homelab browser run flow.js
NOTES
- The Playwright client is pinned to playwright-core@` + playwrightVersion + ` to match the
chrome-service image (Chrome 130); installed once into ~/.cache/homelab/.
- The port-forward is always torn down, on success and on error.
`
}

View file

@ -1,172 +0,0 @@
package main
import (
"os"
"reflect"
"strings"
"testing"
)
func TestParseBrowserArgsRun(t *testing.T) {
got, err := parseBrowserArgs("run", []string{
"flow.js", "--url", "https://example.com", "--shared-context",
"--port", "19999", "--timeout", "45", "--keep-open",
})
if err != nil {
t.Fatalf("parseBrowserArgs run: unexpected err: %v", err)
}
want := browserOpts{
mode: "run", script: "flow.js", url: "https://example.com",
sharedCtx: true, keepOpen: true, port: 19999, timeout: 45,
}
if !reflect.DeepEqual(got, want) {
t.Fatalf("parseBrowserArgs run =\n %+v\nwant\n %+v", got, want)
}
}
func TestParseBrowserArgsRunDefaults(t *testing.T) {
got, err := parseBrowserArgs("run", []string{"flow.js"})
if err != nil {
t.Fatalf("unexpected err: %v", err)
}
if got.script != "flow.js" || got.sharedCtx || got.keepOpen || got.port != 0 {
t.Fatalf("defaults wrong: %+v", got)
}
if got.timeout != defaultBrowserTimeout {
t.Fatalf("timeout default = %d, want %d", got.timeout, defaultBrowserTimeout)
}
}
func TestParseBrowserArgsRunRequiresScript(t *testing.T) {
if _, err := parseBrowserArgs("run", []string{"--url", "https://x"}); err == nil {
t.Fatalf("run without a script path should error")
}
}
func TestParseBrowserArgsOpenRequiresURL(t *testing.T) {
got, err := parseBrowserArgs("open", []string{"https://example.com"})
if err != nil {
t.Fatalf("unexpected err: %v", err)
}
if got.url != "https://example.com" || got.mode != "open" {
t.Fatalf("open parse wrong: %+v", got)
}
if _, err := parseBrowserArgs("open", []string{}); err == nil {
t.Fatalf("open without a URL should error")
}
}
func TestParseBrowserArgsHelp(t *testing.T) {
for _, a := range [][]string{{"--help"}, {"-h"}, {"flow.js", "--help"}} {
got, err := parseBrowserArgs("run", a)
if err != nil {
t.Fatalf("help parse %v: %v", a, err)
}
if !got.help {
t.Fatalf("args %v should set help", a)
}
}
}
func TestParseBrowserArgsEqualsForm(t *testing.T) {
got, err := parseBrowserArgs("run", []string{"flow.js", "--url=https://x", "--port=8123", "--timeout=10"})
if err != nil {
t.Fatalf("unexpected err: %v", err)
}
if got.url != "https://x" || got.port != 8123 || got.timeout != 10 {
t.Fatalf("--flag=value form not parsed: %+v", got)
}
}
func TestCDPHealthy(t *testing.T) {
real := []byte(`{"Browser":"Chrome/130.0.6723.31","User-Agent":"Mozilla/5.0 (X11; Linux x86_64) Chrome/130.0.0.0 Safari/537.36","webSocketDebuggerUrl":"ws://127.0.0.1/devtools/browser/x"}`)
browser, ok, err := cdpHealthy(real)
if err != nil || !ok {
t.Fatalf("real Chrome should be healthy: ok=%v err=%v", ok, err)
}
if !strings.HasPrefix(browser, "Chrome/") {
t.Fatalf("browser = %q, want Chrome/ prefix", browser)
}
headless := []byte(`{"Browser":"HeadlessChrome/130.0.6723.31","User-Agent":"Mozilla/5.0 HeadlessChrome/130.0.0.0"}`)
if _, ok, _ := cdpHealthy(headless); ok {
t.Fatalf("HeadlessChrome must be reported unhealthy (the whole point of chrome-service)")
}
if _, _, err := cdpHealthy([]byte("not json")); err == nil {
t.Fatalf("malformed /json/version body should error")
}
}
func TestBuildPortForwardArgs(t *testing.T) {
got := buildPortForwardArgs(18080)
want := []string{"-n", "chrome-service", "port-forward", "svc/chrome-service", "18080:9222"}
if !reflect.DeepEqual(got, want) {
t.Fatalf("buildPortForwardArgs =\n %v\nwant\n %v", got, want)
}
}
func TestBrowserClientPackageJSONPinsVersion(t *testing.T) {
pj := browserClientPackageJSON()
if !strings.Contains(pj, `"playwright-core": "`+playwrightVersion+`"`) {
t.Fatalf("package.json must pin playwright-core to %s; got:\n%s", playwrightVersion, pj)
}
}
func TestPlaywrightVersionPinnedToServerMinor(t *testing.T) {
// chrome-service runs mcr.microsoft.com/playwright:v1.48.0-noble; the CDP
// client minor MUST match (protocol changes between minors).
if !strings.HasPrefix(playwrightVersion, "1.48.") {
t.Fatalf("playwrightVersion = %q, must be 1.48.x to match the chrome-service image", playwrightVersion)
}
}
func TestBrowserHelpHasDiagnosticCheatSheet(t *testing.T) {
h := browserHelp()
for _, want := range []string{
"homelab browser run",
"ERR_FILE_NOT_FOUND",
"ERR_CONNECTION_REFUSED",
"network panel",
"headless",
"--shared-context",
} {
if !strings.Contains(h, want) {
t.Errorf("browser --help is missing %q (the discoverability/self-correction payload)", want)
}
}
}
func TestBrowserHelpIsTiered(t *testing.T) {
// --help must frame this as the ESCALATION path (default to headless first),
// matching ~/code/CLAUDE.md and chrome-service.md — non-conflicting agent
// instructions. Guard against a regression to "co-equal choice" wording.
h := browserHelp()
for _, want := range []string{"Default to the", "escalation"} {
if !strings.Contains(h, want) {
t.Errorf("browser --help must carry the tiered/default-headless framing; missing %q", want)
}
}
}
func TestStealthJSEmbeddedMatchesCanonical(t *testing.T) {
// The embedded copy must never drift from the source of truth that the
// in-cluster callers use, else the CLI's stealth and the cluster's diverge.
canonical, err := os.ReadFile("../stacks/chrome-service/files/stealth.js")
if err != nil {
t.Fatalf("read canonical stealth.js: %v", err)
}
if stealthJS != string(canonical) {
t.Fatalf("cli/browser_stealth.js has drifted from stacks/chrome-service/files/stealth.js — re-copy it")
}
}
func TestFreePortReturnsUsablePort(t *testing.T) {
p, err := freePort()
if err != nil {
t.Fatalf("freePort: %v", err)
}
if p <= 1024 || p > 65535 {
t.Fatalf("freePort returned %d, want an ephemeral port", p)
}
}

View file

@ -1,99 +0,0 @@
package main
import (
"fmt"
"os"
"strings"
"time"
)
func ciCommands() []Command {
return []Command{
{Path: []string{"ci", "status"}, Tier: TierRead,
Summary: "pipeline status for HEAD/a commit: ci status [commit]", Run: ciStatus},
{Path: []string{"ci", "watch"}, Tier: TierRead,
Summary: "poll the pipeline for HEAD (or a commit) to terminal; non-zero on failure", Run: ciWatch},
}
}
func short(s string) string {
if len(s) > 8 {
return s[:8]
}
return s
}
func firstLine(s string) string { return strings.SplitN(s, "\n", 2)[0] }
// currentHEAD returns the full HEAD sha of the cwd repo (empty if not a repo).
func currentHEAD() string {
cwd, _ := os.Getwd()
root, err := gitRepoRoot(cwd)
if err != nil {
return ""
}
sha, _ := gitOutput(root, "rev-parse", "HEAD")
return sha
}
func ciStatus(args []string) error {
commit, _ := firstPositional(args)
c, err := newWPClient()
if err != nil {
return err
}
id, err := c.repoID()
if err != nil {
return err
}
p, err := c.findPipeline(id, commit)
if err != nil {
return err
}
fmt.Printf("#%d %s event=%s %s %s\n", p.Number, p.Status, p.Event, short(p.Commit), firstLine(p.Message))
return nil
}
func ciWatch(args []string) error {
commit, _ := firstPositional(args)
if commit == "" {
commit = currentHEAD()
}
if commit == "" {
return fmt.Errorf("no commit given and not in a git repo")
}
c, err := newWPClient()
if err != nil {
return err
}
id, err := c.repoID()
if err != nil {
return err
}
timeout := 20 * time.Minute
deadline := time.Now().Add(timeout)
last := ""
for time.Now().Before(deadline) {
p, err := c.findPipeline(id, commit)
if err != nil {
if last != "waiting" {
fmt.Fprintf(os.Stderr, "homelab: waiting for pipeline (%s)...\n", short(commit))
last = "waiting"
}
} else {
if p.Status != last {
fmt.Fprintf(os.Stderr, "homelab: #%d %s\n", p.Number, p.Status)
last = p.Status
}
if isTerminalStatus(p.Status) {
fmt.Printf("#%d %s %s\n", p.Number, p.Status, short(commit))
if isFailureStatus(p.Status) {
return fmt.Errorf("pipeline #%d %s (woodpecker repo, see UI/DB for the failing step)", p.Number, p.Status)
}
return nil
}
}
time.Sleep(15 * time.Second)
}
return fmt.Errorf("timed out after %s waiting for CI on %s", timeout, short(commit))
}

View file

@ -1,56 +0,0 @@
package main
import (
"fmt"
"strings"
)
func claimCommands() []Command {
return []Command{
{Path: []string{"claim"}, Tier: TierWrite,
Summary: "claim a shared infra resource on the presence board",
Run: runClaim},
{Path: []string{"release"}, Tier: TierWrite,
Summary: "release a presence claim",
Run: runRelease},
}
}
// runClaim parses `<kind>:<name> --purpose "..."` in either order (the presence
// script takes the label first, so we can't rely on Go's flag package which
// stops at the first positional).
func runClaim(args []string) error {
var label, purpose string
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "--purpose" || a == "-purpose":
if i+1 < len(args) {
purpose = args[i+1]
i++
}
case strings.HasPrefix(a, "--purpose="):
purpose = strings.TrimPrefix(a, "--purpose=")
case !strings.HasPrefix(a, "-") && label == "":
label = a
}
}
if label == "" {
return fmt.Errorf(`usage: homelab claim <kind>:<name> --purpose "what + why"`)
}
return presenceClaim(label, purpose)
}
func runRelease(args []string) error {
var label string
for _, a := range args {
if !strings.HasPrefix(a, "-") {
label = a
break
}
}
if label == "" {
return fmt.Errorf("usage: homelab release <kind>:<name>")
}
return presenceRelease(label)
}

View file

@ -1,51 +0,0 @@
package main
import (
"fmt"
"os"
"strings"
"time"
)
func deployCommands() []Command {
return []Command{
{Path: []string{"deploy", "wait"}, Tier: TierRead,
Summary: "wait for <ns>/<deploy> to roll out the current (or --sha) image: deploy wait <ns>/<deploy> [--sha SHA]", Run: deployWait},
}
}
// deployWait closes the "did the NEW code land" gap: rollout status alone returns
// success on the OLD ReplicaSet, so we first wait for the deployment image to
// reference the expected sha, THEN block on rollout status.
func deployWait(args []string) error {
target, _ := firstPositional(args)
if target == "" || !strings.Contains(target, "/") {
return fmt.Errorf("usage: homelab deploy wait <ns>/<deploy> [--sha SHA] [--timeout 10m]")
}
parts := strings.SplitN(target, "/", 2)
ns, deploy := parts[0], parts[1]
sha := flagValue(args, "--sha")
if sha == "" {
sha = short(currentHEAD())
}
deadline := time.Now().Add(10 * time.Minute)
if sha != "" {
fmt.Fprintf(os.Stderr, "homelab: waiting for %s/%s image to match %s...\n", ns, deploy, sha)
matched := false
for time.Now().Before(deadline) {
img, _ := kubectlCapture(ns, "get", "deploy", deploy, "-o", "jsonpath={.spec.template.spec.containers[*].image}")
if strings.Contains(img, sha) {
matched = true
break
}
time.Sleep(10 * time.Second)
}
if !matched {
return fmt.Errorf("timed out: %s/%s image never matched %q", ns, deploy, sha)
}
}
fmt.Fprintf(os.Stderr, "homelab: rollout status %s/%s...\n", ns, deploy)
return kubectlStream(ns, "rollout", "status", "deploy/"+deploy, "--timeout=180s")
}

View file

@ -1,172 +0,0 @@
package main
import (
"encoding/base64"
"fmt"
"os"
"path/filepath"
"strings"
)
// Home Assistant verbs cover the two things the `ha` MCP server can't: resolving
// the long-lived API token out of the cluster, and SSH to the HA host for
// host-level work (config files, docker, add-ons). Entity state/control stays
// with the MCP — see docs/adr/0012.
//
// The token lives in the dedicated k8s Secret openclaw/ha-tokens (one key per
// instance), split out of openclaw-secrets so non-admin operators (emo / "Home
// Server Admins") can read JUST the HA token, not the full skill_secrets blob.
// `ha token` resolves it on demand via the ambient kubeconfig, so it never
// depends on a pre-set env var (the gap that made agents re-derive the
// kubectl|base64|jq pipeline every session).
type haInstance struct {
name string // sofia | london
sshUser string // SSH login on the HA host
sshHost string // host reachable from the devvm (Sofia LAN)
secretKey string // key inside the openclaw/ha-tokens Secret holding this token
}
const (
haDefaultInstance = "sofia"
haSecretNamespace = "openclaw"
haSecretName = "ha-tokens" // dedicated, least-privilege; see stacks/openclaw/ha_tokens.tf
)
// haInstances maps instance name → connection/secret facts. sofia is the default
// because the devvm is on the Sofia LAN; london is documented but its host
// (192.168.8.x) is only reachable remotely, so `ha ssh --instance london`
// generally won't connect from here (token resolution still works).
var haInstances = map[string]haInstance{
"sofia": {name: "sofia", sshUser: "vbarzin", sshHost: "192.168.1.8", secretKey: "sofia"},
"london": {name: "london", sshUser: "hassio", sshHost: "192.168.8.103", secretKey: "london"},
}
func haCommands() []Command {
return []Command{
{Path: []string{"ha", "token"}, Tier: TierRead,
Summary: "reveal the HA long-lived API token from the cluster: ha token [--instance sofia|london]", Run: haToken},
{Path: []string{"ha", "ssh"}, Tier: TierWrite,
Summary: "run a command on the HA host over ssh: ha ssh [--instance sofia|london] [-i KEY] -- <cmd>", Run: haSSH},
}
}
// resolveHAInstance looks up an instance by name; "" yields the default (sofia).
func resolveHAInstance(name string) (haInstance, error) {
if name == "" {
name = haDefaultInstance
}
inst, ok := haInstances[name]
if !ok {
return haInstance{}, fmt.Errorf("unknown HA instance %q (want sofia or london)", name)
}
return inst, nil
}
// decodeSecretValue base64-decodes a k8s Secret `.data.<key>` value as returned
// by kubectl jsonpath (trailing whitespace tolerated).
func decodeSecretValue(b64 string) (string, error) {
raw, err := base64.StdEncoding.DecodeString(strings.TrimSpace(b64))
if err != nil {
return "", fmt.Errorf("base64-decode secret value: %w", err)
}
return string(raw), nil
}
func haToken(args []string) error {
name, _ := firstPositional(args) // accept `ha token sofia` as well as `--instance sofia`
for i := 0; i < len(args); i++ {
if args[i] == "--instance" && i+1 < len(args) {
name = args[i+1]
} else if strings.HasPrefix(args[i], "--instance=") {
name = strings.TrimPrefix(args[i], "--instance=")
}
}
inst, err := resolveHAInstance(name)
if err != nil {
return err
}
b64, err := kubectlCapture(haSecretNamespace, "get", "secret", haSecretName,
"-o", "jsonpath={.data."+inst.secretKey+"}")
if err != nil {
return fmt.Errorf("read secret %s/%s (kubeconfig set? RBAC?): %w", haSecretNamespace, haSecretName, err)
}
if b64 == "" {
return fmt.Errorf("secret %s/%s has no %q key", haSecretNamespace, haSecretName, inst.secretKey)
}
tok, err := decodeSecretValue(b64)
if err != nil {
return err
}
fmt.Println(tok)
return nil
}
// defaultHAKeyPath is the invoking user's ed25519 key, so the verb is per-user
// rather than tied to whoever first wrote the workflow.
func defaultHAKeyPath() string {
if home, err := os.UserHomeDir(); err == nil && home != "" {
return filepath.Join(home, ".ssh", "id_ed25519")
}
return filepath.Join("~", ".ssh", "id_ed25519")
}
// parseHASSH reads `[--instance X] [-i|--key PATH] [-- ] <cmd...>`. Tokens after
// `--` are taken verbatim; bare tokens before it are also the remote command.
func parseHASSH(args []string) (inst haInstance, keyPath string, remote []string, err error) {
name := haDefaultInstance
keyPath = defaultHAKeyPath()
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "--":
remote = append(remote, args[i+1:]...)
i = len(args)
case a == "--instance":
if i+1 < len(args) {
name = args[i+1]
i++
}
case strings.HasPrefix(a, "--instance="):
name = strings.TrimPrefix(a, "--instance=")
case a == "--key" || a == "-i":
if i+1 < len(args) {
keyPath = args[i+1]
i++
}
case strings.HasPrefix(a, "--key="):
keyPath = strings.TrimPrefix(a, "--key=")
default:
remote = append(remote, a)
}
}
inst, err = resolveHAInstance(name)
return inst, keyPath, remote, err
}
// buildHASSHArgs assembles deterministic, non-interactive ssh args: an explicit
// key, no user ssh config, and no known_hosts prompt/record — so it runs
// unattended in an agent session without hanging on a host-key prompt.
func buildHASSHArgs(inst haInstance, keyPath string, remote []string) []string {
args := []string{
"-F", "/dev/null",
"-o", "IdentityFile=" + keyPath,
"-o", "StrictHostKeyChecking=no",
"-o", "UserKnownHostsFile=/dev/null",
"-o", "ConnectTimeout=10",
"-o", "BatchMode=yes",
inst.sshUser + "@" + inst.sshHost,
}
return append(args, remote...)
}
func haSSH(args []string) error {
inst, keyPath, remote, err := parseHASSH(args)
if err != nil {
return err
}
if len(remote) == 0 {
return fmt.Errorf(`usage: homelab ha ssh [--instance sofia|london] [-i KEY] -- <command>`)
}
return runStreaming("ssh", buildHASSHArgs(inst, keyPath, remote)...)
}

View file

@ -1,92 +0,0 @@
package main
import (
"encoding/base64"
"reflect"
"strings"
"testing"
)
func TestResolveHAInstance(t *testing.T) {
// empty defaults to sofia (the devvm sits on the Sofia LAN)
if got, err := resolveHAInstance(""); err != nil || got.name != "sofia" {
t.Fatalf(`resolveHAInstance("") = %+v, %v; want sofia`, got, err)
}
if got, err := resolveHAInstance("sofia"); err != nil || got.secretKey != "sofia" {
t.Fatalf("sofia secretKey = %q, %v", got.secretKey, err)
}
if got, err := resolveHAInstance("london"); err != nil || got.secretKey != "london" || got.sshUser != "hassio" {
t.Fatalf("london = %+v, %v", got, err)
}
if _, err := resolveHAInstance("paris"); err == nil {
t.Fatalf("resolveHAInstance(paris) should error on unknown instance")
}
}
func TestDecodeSecretValue(t *testing.T) {
// k8s stores Secret values base64-encoded; `kubectl -o jsonpath={.data.<k>}`
// returns that base64, which decodeSecretValue turns back into the raw token.
enc := base64.StdEncoding.EncodeToString([]byte("tok-sofia"))
if got, err := decodeSecretValue(enc); err != nil || got != "tok-sofia" {
t.Fatalf("decodeSecretValue = %q, %v; want tok-sofia", got, err)
}
// trailing whitespace/newline from jsonpath output must be tolerated
if got, err := decodeSecretValue(enc + "\n"); err != nil || got != "tok-sofia" {
t.Fatalf("decodeSecretValue (trailing ws) = %q, %v; want tok-sofia", got, err)
}
if _, err := decodeSecretValue("not-base64!!"); err == nil {
t.Fatalf("decodeSecretValue should error on undecodable base64")
}
}
func TestBuildHASSHArgs(t *testing.T) {
inst, _ := resolveHAInstance("sofia")
got := buildHASSHArgs(inst, "/home/u/.ssh/id_ed25519", []string{"cat", "/config/configuration.yaml"})
want := []string{
"-F", "/dev/null",
"-o", "IdentityFile=/home/u/.ssh/id_ed25519",
"-o", "StrictHostKeyChecking=no",
"-o", "UserKnownHostsFile=/dev/null",
"-o", "ConnectTimeout=10",
"-o", "BatchMode=yes",
"vbarzin@192.168.1.8",
"cat", "/config/configuration.yaml",
}
if !reflect.DeepEqual(got, want) {
t.Fatalf("buildHASSHArgs =\n %v\nwant\n %v", got, want)
}
}
func TestParseHASSH(t *testing.T) {
// instance flag + everything after `--` is the verbatim remote command
inst, key, remote, err := parseHASSH([]string{"--instance", "sofia", "--", "docker", "ps", "-a"})
if err != nil {
t.Fatalf("parseHASSH err: %v", err)
}
if inst.name != "sofia" {
t.Errorf("instance = %q, want sofia", inst.name)
}
if !strings.HasSuffix(key, "/.ssh/id_ed25519") {
t.Errorf("default key = %q, want it to end in /.ssh/id_ed25519", key)
}
if !reflect.DeepEqual(remote, []string{"docker", "ps", "-a"}) {
t.Errorf("remote = %v, want [docker ps -a]", remote)
}
// bare args (no `--`) are also taken as the remote command; -i overrides the key
_, key2, remote2, err := parseHASSH([]string{"-i", "/tmp/k", "uptime"})
if err != nil {
t.Fatalf("parseHASSH err: %v", err)
}
if key2 != "/tmp/k" {
t.Errorf("key = %q, want /tmp/k", key2)
}
if !reflect.DeepEqual(remote2, []string{"uptime"}) {
t.Errorf("remote = %v, want [uptime]", remote2)
}
// unknown instance surfaces as an error
if _, _, _, err := parseHASSH([]string{"--instance", "paris", "--", "ls"}); err == nil {
t.Errorf("parseHASSH should error on unknown instance")
}
}

View file

@ -1,288 +0,0 @@
package main
import (
"fmt"
"os"
"strings"
)
func k8sCommands() []Command {
return []Command{
{Path: []string{"k8s", "status"}, Tier: TierRead,
Summary: "pods (wide) + recent non-Normal events for a namespace (or -A)", Run: k8sStatus},
{Path: []string{"k8s", "get"}, Tier: TierRead,
Summary: "kubectl get in a namespace: k8s get <ns> <resource> [args]", Run: k8sGet},
{Path: []string{"k8s", "logs"}, Tier: TierRead,
Summary: "logs for <app> (deploy/<app>; --tail/-c/--previous/--since/-l)", Run: k8sLogs},
{Path: []string{"k8s", "describe"}, Tier: TierRead,
Summary: "describe <app>'s deployment (or an explicit resource)", Run: k8sDescribe},
{Path: []string{"k8s", "debug"}, Tier: TierRead,
Summary: "one-shot triage for <app>: pods+deploy+describe+logs+events", Run: k8sDebug},
{Path: []string{"k8s", "pf"}, Tier: TierRead,
Summary: "port-forward: k8s pf <app> <local:remote> [svc/pod target]", Run: k8sPortForward},
{Path: []string{"k8s", "db"}, Tier: TierWrite,
Summary: `query a dbaas DB: k8s db <app> [--mysql] [--db N] -- "<SQL>"`, Run: k8sDB},
{Path: []string{"k8s", "exec"}, Tier: TierWrite,
Summary: "exec in <app>'s pod: k8s exec <app> [--tty] -- <cmd>", Run: k8sExec},
{Path: []string{"k8s", "rm-pod"}, Tier: TierWrite,
Summary: "delete a stuck pod/job ONLY: k8s rm-pod <name> -n <ns> [--job] [--force]", Run: k8sRmPod},
{Path: []string{"k8s", "rollout-status"}, Tier: TierRead,
Summary: "rollout status of deploy/<app>", Run: k8sRolloutStatus},
{Path: []string{"k8s", "restart"}, Tier: TierWrite,
Summary: "rollout restart deploy/<app> then wait for status", Run: k8sRestart},
{Path: []string{"k8s", "probe"}, Tier: TierRead,
Summary: "in-cluster reachability: ephemeral curl pod to <app>.<ns>.svc", Run: k8sProbe},
}
}
func k8sStatus(args []string) error {
t := parseK8sTarget(args)
ns := t.namespace() // "" when no app/ns given → cluster-wide
get := []string{"get", "pods", "-o", "wide"}
ev := []string{"get", "events", "--field-selector", "type!=Normal", "--sort-by=.lastTimestamp"}
if ns == "" {
get = append(get, "-A")
ev = append(ev, "-A")
}
if err := kubectlStream(ns, get...); err != nil {
return err
}
fmt.Fprintln(os.Stderr, "\n--- recent events (type!=Normal) ---")
_ = kubectlStream(ns, ev...) // best-effort
return nil
}
func k8sGet(args []string) error {
t := parseK8sTarget(args)
if t.app == "" || len(t.rest) == 0 {
return fmt.Errorf("usage: homelab k8s get <ns> <resource> [args]")
}
return kubectlStream(t.app, append([]string{"get"}, t.rest...)...)
}
func k8sLogs(args []string) error {
t := parseK8sTarget(args)
if t.app == "" {
return fmt.Errorf("usage: homelab k8s logs <app> [--tail N] [-c ctr] [--previous] [--since 1h] [-l sel]")
}
a := []string{"logs"}
if t.selector != "" {
a = append(a, "-l", t.selector)
} else {
a = append(a, t.objectRef())
}
if t.container != "" {
a = append(a, "-c", t.container)
}
if !containsPrefix(t.rest, "--tail") {
a = append(a, "--tail=200")
}
a = append(a, t.rest...)
return kubectlStream(t.namespace(), a...)
}
func k8sDescribe(args []string) error {
t := parseK8sTarget(args)
if t.app == "" {
return fmt.Errorf("usage: homelab k8s describe <app> [resource]")
}
if len(t.rest) > 0 {
return kubectlStream(t.namespace(), append([]string{"describe"}, t.rest...)...)
}
return kubectlStream(t.namespace(), "describe", t.objectRef())
}
func k8sDebug(args []string) error {
t := parseK8sTarget(args)
if t.app == "" {
return fmt.Errorf("usage: homelab k8s debug <app>")
}
ns := t.namespace()
sec := func(title string) { fmt.Fprintf(os.Stderr, "\n=== %s ===\n", title) }
sec("pods")
_ = kubectlStream(ns, "get", "pods", "-o", "wide")
sec("workloads")
_ = kubectlStream(ns, "get", "deploy,sts,ds", "-o", "wide")
sec("describe "+t.objectRef())
_ = kubectlStream(ns, "describe", t.objectRef())
sec("recent logs (--tail=50)")
_ = kubectlStream(ns, "logs", t.objectRef(), "--tail=50")
sec("events (type!=Normal)")
_ = kubectlStream(ns, "get", "events", "--field-selector", "type!=Normal", "--sort-by=.lastTimestamp")
return nil
}
func k8sPortForward(args []string) error {
t := parseK8sTarget(args)
if t.app == "" || len(t.rest) == 0 {
return fmt.Errorf("usage: homelab k8s pf <app> <local:remote> [svc/pod target]")
}
ports := t.rest[0]
target := "svc/" + t.app
if len(t.rest) > 1 {
target = t.rest[1]
}
return kubectlStream(t.namespace(), "port-forward", target, ports)
}
func k8sDB(args []string) error {
var app, dbName, sql string
mysql := false
for i := 0; i < len(args); i++ {
a := args[i]
if a == "--" {
sql = strings.Join(args[i+1:], " ")
break
}
switch {
case a == "--mysql":
mysql = true
case a == "--db":
if i+1 < len(args) {
dbName = args[i+1]
i++
}
case strings.HasPrefix(a, "--db="):
dbName = strings.TrimPrefix(a, "--db=")
case !strings.HasPrefix(a, "-") && app == "":
app = a
}
}
if app == "" {
return fmt.Errorf(`usage: homelab k8s db <app> [--mysql] [--db NAME] -- "<SQL>"`)
}
p := planDBExec(app, dbName, sql, mysql)
pod := p.pod
if pod == "" && p.selector != "" {
resolved, err := kubectlCapture(p.ns, "get", "pod", "-l", p.selector, "-o", "jsonpath={.items[0].metadata.name}")
if err != nil || resolved == "" {
return fmt.Errorf("could not resolve db pod in %s (selector %q): %v", p.ns, p.selector, err)
}
pod = resolved
}
exec := []string{"exec"}
if sql == "" {
exec = append(exec, "-it") // interactive client when no SQL given
}
exec = append(exec, pod)
if p.container != "" {
exec = append(exec, "-c", p.container)
}
exec = append(exec, "--")
exec = append(exec, p.argv...)
return kubectlStream(p.ns, exec...)
}
func k8sExec(args []string) error {
t := parseK8sTarget(args)
if t.app == "" {
return fmt.Errorf("usage: homelab k8s exec <app> [--pod p] [-c ctr] [--tty] -- <cmd>")
}
if len(t.rest) == 0 {
return fmt.Errorf("provide a command after --, e.g. homelab k8s exec %s -- env", t.app)
}
a := []string{"exec"}
if t.tty {
a = append(a, "-it")
}
a = append(a, t.objectRef())
if t.container != "" {
a = append(a, "-c", t.container)
}
a = append(a, "--")
a = append(a, t.rest...)
return kubectlStream(t.namespace(), a...)
}
func k8sRmPod(args []string) error {
var pod, ns, grace string
force, job := false, false
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "-n" || a == "--namespace":
if i+1 < len(args) {
ns = args[i+1]
i++
}
case a == "--force":
force = true
case a == "--job":
job = true
case a == "--grace":
if i+1 < len(args) {
grace = args[i+1]
i++
}
case !strings.HasPrefix(a, "-") && pod == "":
pod = a
}
}
if pod == "" || ns == "" {
return fmt.Errorf("usage: homelab k8s rm-pod <name> -n <ns> [--job] [--force] [--grace N] (pods/jobs only)")
}
kind := "pod"
if job {
kind = "job"
}
a := []string{"delete", kind, pod}
if grace != "" {
a = append(a, "--grace-period="+grace)
}
if force {
a = append(a, "--force")
}
return kubectlStream(ns, a...)
}
func k8sRolloutStatus(args []string) error {
t := parseK8sTarget(args)
if t.app == "" {
return fmt.Errorf("usage: homelab k8s rollout-status <app>")
}
return kubectlStream(t.namespace(), "rollout", "status", "deploy/"+t.app)
}
func k8sRestart(args []string) error {
t := parseK8sTarget(args)
if t.app == "" {
return fmt.Errorf("usage: homelab k8s restart <app>")
}
ns := t.namespace()
if err := kubectlStream(ns, "rollout", "restart", "deploy/"+t.app); err != nil {
return err
}
return kubectlStream(ns, "rollout", "status", "deploy/"+t.app)
}
func k8sProbe(args []string) error {
t := parseK8sTarget(args)
if t.app == "" {
return fmt.Errorf("usage: homelab k8s probe <app> [path] [--port N]")
}
ns := t.namespace()
url := "http://" + t.app + "." + ns + ".svc.cluster.local"
if port := flagValue(args, "--port"); port != "" {
url += ":" + port
}
if len(t.rest) > 0 {
p := t.rest[0]
if !strings.HasPrefix(p, "/") {
p = "/" + p
}
url += p
}
return kubectlStream(ns, "run", "homelab-probe", "--rm", "-i", "--restart=Never",
"--image=curlimages/curl:latest", "--",
"curl", "-sS", "--max-time", "10", "-w", "\n[%{http_code}] %{time_total}s\n", url)
}
// containsPrefix reports whether any arg starts with prefix.
func containsPrefix(args []string, prefix string) bool {
for _, a := range args {
if strings.HasPrefix(a, prefix) {
return true
}
}
return false
}

View file

@ -1,302 +0,0 @@
package main
import (
"encoding/json"
"fmt"
"net/url"
"strings"
)
func memoryCommands() []Command {
return []Command{
{Path: []string{"memory", "recall"}, Tier: TierRead,
Summary: `semantic search of memory: memory recall "<context>" [--query …] [--category] [--sort] [--limit]`, Run: memoryRecall},
{Path: []string{"memory", "list"}, Tier: TierRead,
Summary: "list recent memories [--category C] [--tag T] [--limit N]", Run: memoryList},
{Path: []string{"memory", "categories"}, Tier: TierRead,
Summary: "list memory categories", Run: memorySimpleGet("/api/categories")},
{Path: []string{"memory", "tags"}, Tier: TierRead,
Summary: "list memory tags", Run: memorySimpleGet("/api/tags")},
{Path: []string{"memory", "stats"}, Tier: TierRead,
Summary: "memory store stats", Run: memorySimpleGet("/api/stats")},
{Path: []string{"memory", "secret"}, Tier: TierRead,
Summary: "reveal a sensitive memory's content: memory secret <id>", Run: memorySecret},
{Path: []string{"memory", "store"}, Tier: TierWrite,
Summary: `store a memory: memory store "<content>" [--category --tags --keywords --importance --sensitive]`, Run: memoryStore},
{Path: []string{"memory", "update"}, Tier: TierWrite,
Summary: "update a memory: memory update <id> [--content --tags --importance --keywords]", Run: memoryUpdate},
{Path: []string{"memory", "delete"}, Tier: TierWrite,
Summary: "delete a memory: memory delete <id>", Run: memoryDelete},
}
}
// printMemories renders a {memories:[…]} response as compact lines, or raw JSON.
func printMemories(raw []byte, jsonOut bool) error {
if jsonOut {
fmt.Println(string(raw))
return nil
}
var r struct {
Memories []struct {
ID int `json:"id"`
Content string `json:"content"`
Category string `json:"category"`
Tags string `json:"tags"`
Importance float64 `json:"importance"`
} `json:"memories"`
}
if err := json.Unmarshal(raw, &r); err != nil {
fmt.Println(string(raw))
return nil
}
if len(r.Memories) == 0 {
fmt.Println("(no memories)")
return nil
}
for _, m := range r.Memories {
c := strings.ReplaceAll(m.Content, "\n", " ")
if len(c) > 240 {
c = c[:240] + "…"
}
fmt.Printf("#%d [%s] (%.2f) %s\n", m.ID, m.Category, m.Importance, c)
if m.Tags != "" {
fmt.Printf(" tags: %s\n", m.Tags)
}
}
return nil
}
func memoryRecall(args []string) error {
req := memRecallReq{}
jsonOut := false
var pos []string
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "--query":
if i+1 < len(args) {
req.ExpandedQuery = args[i+1]
i++
}
case a == "--category":
if i+1 < len(args) {
req.Category = args[i+1]
i++
}
case a == "--sort":
if i+1 < len(args) {
req.SortBy = args[i+1]
i++
}
case a == "--limit":
if i+1 < len(args) {
fmt.Sscanf(args[i+1], "%d", &req.Limit)
i++
}
case a == "--json":
jsonOut = true
case !strings.HasPrefix(a, "-"):
pos = append(pos, a)
}
}
req.Context = strings.Join(pos, " ")
if req.Context == "" {
return fmt.Errorf(`usage: homelab memory recall "<context>" [--query …] [--category C] [--sort importance|relevance|recency] [--limit N]`)
}
c, err := newMemoryClient()
if err != nil {
return err
}
raw, err := c.do("POST", "/api/memories/recall", req)
if err != nil {
return err
}
return printMemories(raw, jsonOut)
}
func memoryList(args []string) error {
q := url.Values{}
jsonOut := false
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "--category":
if i+1 < len(args) {
q.Set("category", args[i+1])
i++
}
case a == "--tag":
if i+1 < len(args) {
q.Set("tag", args[i+1])
i++
}
case a == "--limit":
if i+1 < len(args) {
q.Set("limit", args[i+1])
i++
}
case a == "--json":
jsonOut = true
}
}
c, err := newMemoryClient()
if err != nil {
return err
}
path := "/api/memories"
if len(q) > 0 {
path += "?" + q.Encode()
}
raw, err := c.do("GET", path, nil)
if err != nil {
return err
}
return printMemories(raw, jsonOut)
}
func memorySimpleGet(path string) func([]string) error {
return func(args []string) error {
c, err := newMemoryClient()
if err != nil {
return err
}
raw, err := c.do("GET", path, nil)
if err != nil {
return err
}
fmt.Println(string(raw))
return nil
}
}
func memorySecret(args []string) error {
id, _ := firstPositional(args)
if id == "" {
return fmt.Errorf("usage: homelab memory secret <id>")
}
c, err := newMemoryClient()
if err != nil {
return err
}
raw, err := c.do("POST", "/api/memories/"+id+"/secret", nil)
if err != nil {
return err
}
fmt.Println(string(raw))
return nil
}
func memoryStore(args []string) error {
req := memStoreReq{Category: "facts", Importance: 0.5}
var pos []string
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "--category":
if i+1 < len(args) {
req.Category = args[i+1]
i++
}
case a == "--tags":
if i+1 < len(args) {
req.Tags = args[i+1]
i++
}
case a == "--keywords":
if i+1 < len(args) {
req.ExpandedKeywords = args[i+1]
i++
}
case a == "--importance":
if i+1 < len(args) {
fmt.Sscanf(args[i+1], "%f", &req.Importance)
i++
}
case a == "--sensitive":
req.ForceSensitive = true
case !strings.HasPrefix(a, "-"):
pos = append(pos, a)
}
}
req.Content = strings.Join(pos, " ")
if req.Content == "" {
return fmt.Errorf(`usage: homelab memory store "<content>" [--category C] [--tags ...] [--keywords ...] [--importance 0.5] [--sensitive]`)
}
c, err := newMemoryClient()
if err != nil {
return err
}
raw, err := c.do("POST", "/api/memories", req)
if err != nil {
return err
}
fmt.Println(string(raw))
return nil
}
func memoryUpdate(args []string) error {
var id string
req := memUpdateReq{}
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "--content":
if i+1 < len(args) {
v := args[i+1]
req.Content = &v
i++
}
case a == "--tags":
if i+1 < len(args) {
v := args[i+1]
req.Tags = &v
i++
}
case a == "--keywords":
if i+1 < len(args) {
v := args[i+1]
req.ExpandedKeywords = &v
i++
}
case a == "--importance":
if i+1 < len(args) {
var f float64
fmt.Sscanf(args[i+1], "%f", &f)
req.Importance = &f
i++
}
case !strings.HasPrefix(a, "-") && id == "":
id = a
}
}
if id == "" {
return fmt.Errorf("usage: homelab memory update <id> [--content ...] [--tags ...] [--importance N] [--keywords ...]")
}
c, err := newMemoryClient()
if err != nil {
return err
}
raw, err := c.do("PUT", "/api/memories/"+id, req)
if err != nil {
return err
}
fmt.Println(string(raw))
return nil
}
func memoryDelete(args []string) error {
id, _ := firstPositional(args)
if id == "" {
return fmt.Errorf("usage: homelab memory delete <id>")
}
c, err := newMemoryClient()
if err != nil {
return err
}
raw, err := c.do("DELETE", "/api/memories/"+id, nil)
if err != nil {
return err
}
fmt.Println(string(raw))
return nil
}

View file

@ -1,83 +0,0 @@
package main
import (
"fmt"
"strings"
"time"
)
func netCommands() []Command {
return []Command{
{Path: []string{"net", "check"}, Tier: TierRead,
Summary: "reachability of <host>[/path]: external (public DNS→CF) vs internal (Traefik LB)", Run: netCheck},
{Path: []string{"dns", "lookup"}, Tier: TierRead,
Summary: "resolve <name> via Technitium (10.0.20.201) and public (1.1.1.1), diffed", Run: dnsLookup},
}
}
func fmtProbe(code int, d time.Duration, err error) string {
if err != nil {
return "ERR " + err.Error()
}
return fmt.Sprintf("HTTP %d %dms", code, d.Milliseconds())
}
func netCheck(args []string) error {
host, rest := firstPositional(args)
if host == "" {
return fmt.Errorf("usage: homelab net check <host> [path]")
}
path := "/"
if len(rest) > 0 && !strings.HasPrefix(rest[0], "-") {
path = rest[0]
if !strings.HasPrefix(path, "/") {
path = "/" + path
}
}
u := "https://" + host + path
fmt.Printf("%s\n", u)
// external leg: resolve via public DNS, dial the public IP (tests the real CF path)
pubOut, _ := dig(hostOnly(host), "1.1.1.1", "")
if pubIP := firstLine(pubOut); pubIP != "" {
c, d, e := probeURL(clientDialingIP(pubIP, 10*time.Second), u)
fmt.Printf(" external (public %-15s) %s\n", pubIP, fmtProbe(c, d, e))
} else {
fmt.Println(" external (public) no public A record")
}
// internal leg: dial the Traefik LB directly
c, d, e := probeURL(clientDialingIP(internalLBIP, 10*time.Second), u)
fmt.Printf(" internal (LB %-15s) %s\n", internalLBIP, fmtProbe(c, d, e))
return nil
}
func dnsLookup(args []string) error {
name, rest := firstPositional(args)
if name == "" {
return fmt.Errorf("usage: homelab dns lookup <name> [A|AAAA|TXT|MX|PTR]")
}
rr := ""
if len(rest) > 0 {
rr = rest[0]
}
tech, _ := dig(name, "10.0.20.201", rr)
pub, _ := dig(name, "1.1.1.1", rr)
fmt.Printf("technitium (10.0.20.201): %s\n", oneLineList(tech))
fmt.Printf("public (1.1.1.1) : %s\n", oneLineList(pub))
if strings.TrimSpace(tech) != strings.TrimSpace(pub) {
fmt.Println("⚠ mismatch — split-horizon (expected for internal-only apps) or a propagation gap")
}
return nil
}
func hostOnly(h string) string { // strip any path accidentally included
return strings.SplitN(h, "/", 2)[0]
}
func oneLineList(s string) string {
s = strings.TrimSpace(s)
if s == "" {
return "(none)"
}
return strings.ReplaceAll(s, "\n", ", ")
}

View file

@ -1,197 +0,0 @@
package main
import (
"encoding/json"
"fmt"
"net/url"
"sort"
"strconv"
"strings"
"time"
)
const (
promHost = "prometheus-query.viktorbarzin.lan"
lokiHost = "loki.viktorbarzin.lan"
)
func obsCommands() []Command {
return []Command{
{Path: []string{"metrics", "query"}, Tier: TierRead,
Summary: `Prometheus instant query: metrics query "<promql>" [--json]`, Run: metricsQuery},
{Path: []string{"metrics", "alerts"}, Tier: TierRead,
Summary: "list currently firing Prometheus alerts", Run: metricsAlerts},
{Path: []string{"logs", "query"}, Tier: TierRead,
Summary: `Loki query (last --since, default 1h): logs query "<logql>" [--since 1h] [--limit N] [--json]`, Run: logsQuery},
}
}
// queryArg joins non-flag args into the query (PromQL/LogQL should normally be
// passed as a single quoted argument; this also tolerates unquoted multi-token).
func queryArg(args []string, valueFlags map[string]bool) string {
var parts []string
for i := 0; i < len(args); i++ {
a := args[i]
if valueFlags[a] {
i++
continue
}
if strings.HasPrefix(a, "-") {
continue
}
parts = append(parts, a)
}
return strings.Join(parts, " ")
}
func labelStr(m map[string]string) string {
name := m["__name__"]
var kv []string
for k, v := range m {
if k != "__name__" {
kv = append(kv, k+"="+v)
}
}
sort.Strings(kv)
return name + "{" + strings.Join(kv, ",") + "}"
}
func metricsQuery(args []string) error {
q := queryArg(args, nil)
if q == "" {
return fmt.Errorf(`usage: homelab metrics query "<promql>" [--json]`)
}
v := url.Values{}
v.Set("query", q)
body, err := lbGetBody(promHost, "/api/v1/query", v)
if err != nil {
return err
}
if containsArg(args, "--json") {
fmt.Println(string(body))
return nil
}
var r struct {
Data struct {
Result []struct {
Metric map[string]string `json:"metric"`
Value []interface{} `json:"value"`
} `json:"result"`
} `json:"data"`
}
if err := json.Unmarshal(body, &r); err != nil {
fmt.Println(string(body))
return nil
}
if len(r.Data.Result) == 0 {
fmt.Println("(no series)")
return nil
}
for _, s := range r.Data.Result {
val := ""
if len(s.Value) == 2 {
val = fmt.Sprint(s.Value[1])
}
fmt.Printf("%-14s %s\n", val, labelStr(s.Metric))
}
return nil
}
func metricsAlerts(args []string) error {
// prometheus-query is a query-only frontend (no /api/v1/alerts); the firing
// set is exposed as the synthetic ALERTS series, queryable the normal way.
v := url.Values{}
v.Set("query", `ALERTS{alertstate="firing"}`)
body, err := lbGetBody(promHost, "/api/v1/query", v)
if err != nil {
return err
}
if containsArg(args, "--json") {
fmt.Println(string(body))
return nil
}
var r struct {
Data struct {
Result []struct {
Metric map[string]string `json:"metric"`
} `json:"result"`
} `json:"data"`
}
if err := json.Unmarshal(body, &r); err != nil {
fmt.Println(string(body))
return nil
}
if len(r.Data.Result) == 0 {
fmt.Println("(no firing alerts)")
return nil
}
for _, a := range r.Data.Result {
m := a.Metric
scope := ""
for _, k := range []string{"namespace", "deployment", "instance", "job", "node"} {
if v := m[k]; v != "" {
scope = k + "=" + v
break
}
}
fmt.Printf("%-9s %-34s %s\n", m["severity"], m["alertname"], scope)
}
return nil
}
func logsQuery(args []string) error {
q := queryArg(args, map[string]bool{"--since": true, "--limit": true})
if q == "" {
return fmt.Errorf(`usage: homelab logs query "<logql>" [--since 1h] [--limit N] [--json]`)
}
since := flagValue(args, "--since")
if since == "" {
since = "1h"
}
dur, err := time.ParseDuration(since)
if err != nil {
return fmt.Errorf("bad --since %q: %w", since, err)
}
limit := flagValue(args, "--limit")
if limit == "" {
limit = "100"
}
end := time.Now()
v := url.Values{}
v.Set("query", q)
v.Set("limit", limit)
v.Set("start", strconv.FormatInt(end.Add(-dur).UnixNano(), 10))
v.Set("end", strconv.FormatInt(end.UnixNano(), 10))
body, err := lbGetBody(lokiHost, "/loki/api/v1/query_range", v)
if err != nil {
return err
}
if containsArg(args, "--json") {
fmt.Println(string(body))
return nil
}
var r struct {
Data struct {
Result []struct {
Values [][]string `json:"values"`
} `json:"result"`
} `json:"data"`
}
if err := json.Unmarshal(body, &r); err != nil {
fmt.Println(string(body))
return nil
}
n := 0
for _, s := range r.Data.Result {
for _, val := range s.Values {
if len(val) == 2 {
fmt.Println(val[1])
n++
}
}
}
if n == 0 {
fmt.Println("(no log lines)")
}
return nil
}

View file

@ -1,122 +0,0 @@
package main
import (
"fmt"
"os"
"os/signal"
"path/filepath"
"strings"
"sync"
"syscall"
)
func tfCommands() []Command {
return []Command{
{Path: []string{"tf", "plan"}, Tier: TierRead,
Summary: "terragrunt plan a stack (via scripts/tg)", Run: tfPassthrough("plan")},
{Path: []string{"tf", "validate"}, Tier: TierRead,
Summary: "terragrunt validate a stack", Run: tfPassthrough("validate")},
{Path: []string{"tf", "fmt"}, Tier: TierRead,
Summary: "terraform fmt a stack's files", Run: tfFmt},
{Path: []string{"tf", "force-unlock"}, Tier: TierWrite,
Summary: "release a stuck terraform state lock (needs <stack> <lock-id>)", Run: tfForceUnlock},
{Path: []string{"tf", "apply"}, Tier: TierWrite,
Summary: "terragrunt apply a stack — presence-coupled, out-of-band", Run: tfApply},
}
}
// firstPositional returns the first non-flag arg and the remaining args with it removed.
func firstPositional(args []string) (string, []string) {
for i, a := range args {
if !strings.HasPrefix(a, "-") {
rest := append(append([]string{}, args[:i]...), args[i+1:]...)
return a, rest
}
}
return "", args
}
// resolveTfStack finds the infra root (from cwd) and the stack directory named
// by the first positional arg, returning the remaining args.
func resolveTfStack(args []string) (infraRoot, stackName, stackDir string, rest []string, err error) {
stackName, rest = firstPositional(args)
if stackName == "" {
err = fmt.Errorf("missing <stack> argument")
return
}
cwd, e := os.Getwd()
if e != nil {
err = e
return
}
infraRoot, err = findInfraRoot(cwd)
if err != nil {
return
}
stackDir, err = resolveStack(infraRoot, stackName)
return
}
func tgPath(infraRoot string) string { return filepath.Join(infraRoot, "scripts", "tg") }
// tfPassthrough runs `scripts/tg <verb> [extra]` in the stack directory.
func tfPassthrough(verb string) func([]string) error {
return func(args []string) error {
infraRoot, _, stackDir, rest, err := resolveTfStack(args)
if err != nil {
return err
}
return runStreamingIn(stackDir, tgPath(infraRoot), append([]string{verb}, rest...)...)
}
}
func tfFmt(args []string) error {
_, _, stackDir, _, err := resolveTfStack(args)
if err != nil {
return err
}
return runStreamingIn(stackDir, "terraform", "fmt", "-recursive", ".")
}
func tfForceUnlock(args []string) error {
infraRoot, _, stackDir, rest, err := resolveTfStack(args)
if err != nil {
return err
}
if len(rest) < 1 {
return fmt.Errorf("usage: homelab tf force-unlock <stack> <lock-id>")
}
return runStreamingIn(stackDir, tgPath(infraRoot), "force-unlock", "-force", rest[0])
}
// tfApply applies a stack out-of-band: claim the stack on the presence board,
// ALWAYS release on exit (normal, error, or signal — fixing the claim leak),
// and warn that CI applies canonically on push.
func tfApply(args []string) error {
infraRoot, stackName, stackDir, _, err := resolveTfStack(args)
if err != nil {
return err
}
label := "stack:" + stackName
fmt.Fprintf(os.Stderr,
"homelab: out-of-band apply of %q — CI applies canonically on push to master.\n", stackName)
if err := presenceClaim(label, "homelab tf apply "+stackName); err != nil {
return fmt.Errorf("presence claim failed (run `vault login -method=oidc`?): %w", err)
}
// Release exactly once, whether we exit normally, on error, or on signal —
// sync.Once makes the defer and the signal goroutine safe to both call it.
var once sync.Once
release := func() { once.Do(func() { _ = presenceRelease(label) }) }
defer release()
sig := make(chan os.Signal, 1)
signal.Notify(sig, os.Interrupt, syscall.SIGTERM)
go func() {
<-sig
release()
os.Exit(130)
}()
return runStreamingIn(stackDir, tgPath(infraRoot), "apply", "--non-interactive")
}

View file

@ -1,27 +0,0 @@
package main
import (
"reflect"
"testing"
)
func TestFirstPositional(t *testing.T) {
cases := []struct {
args []string
wantName string
wantRest []string
}{
{[]string{"vault"}, "vault", []string{}},
{[]string{"--json", "vault"}, "vault", []string{"--json"}},
{[]string{"vault", "abc-123"}, "vault", []string{"abc-123"}},
{[]string{"--foo", "monitoring", "extra"}, "monitoring", []string{"--foo", "extra"}},
{[]string{"--only-flags"}, "", []string{"--only-flags"}},
}
for _, c := range cases {
gotName, gotRest := firstPositional(c.args)
if gotName != c.wantName || !reflect.DeepEqual(gotRest, c.wantRest) {
t.Errorf("firstPositional(%v) = (%q, %v), want (%q, %v)",
c.args, gotName, gotRest, c.wantName, c.wantRest)
}
}
}

View file

@ -1,77 +0,0 @@
package main
import (
"encoding/json"
"fmt"
"net/url"
"sort"
"strconv"
)
func usageCommands() []Command {
return []Command{
{Path: []string{"usage", "top"}, Tier: TierRead,
Summary: "rank homelab verb usage across users (from Loki): usage top [--since 30d] [--user U] [--json]", Run: usageTop},
}
}
// usageQuery builds the LogQL metric query that counts invocations per verb.
func usageQuery(since, user string) string {
sel := `job="` + usageJob + `"`
if user != "" {
sel += `, user="` + user + `"`
}
return fmt.Sprintf(`sum by (verb) (count_over_time({%s}[%s]))`, sel, since)
}
func usageTop(args []string) error {
since := flagValue(args, "--since")
if since == "" {
since = "30d"
}
v := url.Values{}
v.Set("query", usageQuery(since, flagValue(args, "--user")))
body, err := lbGetBody(lokiHost, "/loki/api/v1/query", v)
if err != nil {
return err
}
if containsArg(args, "--json") {
fmt.Println(string(body))
return nil
}
var r struct {
Data struct {
Result []struct {
Metric map[string]string `json:"metric"`
Value []interface{} `json:"value"`
} `json:"result"`
} `json:"data"`
}
if err := json.Unmarshal(body, &r); err != nil {
fmt.Println(string(body))
return nil
}
type row struct {
verb string
n int
}
var rows []row
for _, s := range r.Data.Result {
n := 0
if len(s.Value) == 2 {
if f, e := strconv.ParseFloat(fmt.Sprint(s.Value[1]), 64); e == nil {
n = int(f)
}
}
rows = append(rows, row{s.Metric["verb"], n})
}
if len(rows) == 0 {
fmt.Println("(no usage recorded yet)")
return nil
}
sort.Slice(rows, func(i, j int) bool { return rows[i].n > rows[j].n })
for _, r := range rows {
fmt.Printf("%6d %s\n", r.n, r.verb)
}
return nil
}

View file

@ -1,663 +0,0 @@
package main
import (
"bufio"
"encoding/base64"
"encoding/json"
"fmt"
"os"
"os/exec"
"strings"
"syscall"
)
// vault verbs give each unix user no-HITL access to THEIR OWN Vaultwarden vault.
// Identity is the kernel UID; per-user creds live in that user's isolated Vault
// path (secret/workstation/claude-users/<user>) read via their scoped token, and
// decryption is done by the official `bw` CLI. See
// docs/superpowers/specs/2026-06-24-homelab-vault-design.md.
func vaultCommands() []Command {
return []Command{
{Path: []string{"vault", "setup"}, Tier: TierWrite,
Summary: "one-time: store your Vaultwarden master password + API key in your Vault path", Run: vaultSetup},
{Path: []string{"vault", "status"}, Tier: TierRead,
Summary: "show whether your vault is configured/reachable (no secrets)", Run: vaultStatus},
{Path: []string{"vault", "list"}, Tier: TierRead,
Summary: "list your item names: vault list [--search Q]", Run: vaultList},
{Path: []string{"vault", "get"}, Tier: TierRead,
Summary: "fetch one item: vault get <name> [--field password|username|uri|notes|totp] [--json]", Run: vaultGet},
{Path: []string{"vault", "search"}, Tier: TierRead,
Summary: "search your item names: vault search <query>", Run: vaultSearch},
{Path: []string{"vault", "code"}, Tier: TierRead,
Summary: "current TOTP code for an item: vault code <name>", Run: vaultCode},
{Path: []string{"vault", "lock"}, Tier: TierWrite,
Summary: "lock/log out the local bw session", Run: vaultLock},
{Path: []string{"vault"}, Tier: TierRead,
Summary: "Vaultwarden access for your own vault (run `homelab vault` for help)",
Run: func([]string) error { fmt.Print(vaultHelp()); return nil }},
}
}
// vaultHelp is shown for bare `homelab vault`.
func vaultHelp() string {
return `homelab vault read YOUR OWN Vaultwarden logins (no-HITL after one-time setup)
homelab vault setup one-time: store your master password + API key in your Vault path
homelab vault status configured / unlocked / reachable (no secrets)
homelab vault list [--search Q] list your item names (no secrets)
homelab vault get <name> [--field password|username|uri|notes|totp] [--json]
TTY clipboard (auto-clears); piped stdout
homelab vault code <name> current TOTP code
homelab vault lock lock / log out the local bw session
Creds live only in your own Vault path; the admin never sees them. Identity is
your unix UID. Security model: docs/superpowers/specs/2026-06-24-homelab-vault-design.md
(note: anything running as your user can decrypt your vault the accepted no-HITL trade).
`
}
const vwUserPathPrefix = "secret/workstation/claude-users/"
// vwCreds is one user's Vaultwarden auth material, read from their Vault path.
type vwCreds struct {
Email string
MasterPassword string
ClientID string
ClientSecret string
}
// cmdRunner shells out to an external command with an explicit environment and
// returns trimmed stdout. Secrets are passed via envv, NEVER argv. Tests inject
// a fake; realRunner is the production implementation.
type cmdRunner func(name string, argv, envv []string) (string, error)
func realRunner(name string, argv, envv []string) (string, error) {
cmd := exec.Command(name, argv...)
if envv != nil {
cmd.Env = envv
}
out, err := cmd.Output()
// Trim only the trailing newline the tool appends — NOT all whitespace, so a
// fetched secret with significant leading/trailing spaces is preserved.
return strings.TrimRight(string(out), "\r\n"), err
}
// realRunnerStdin runs a command feeding `stdin` to it, for secret values that
// must NOT appear in argv (visible via ps / /proc/<pid>/cmdline to same-UID
// processes). Used by setup to write the master password / client_secret.
func realRunnerStdin(name string, argv, envv []string, stdin string) (string, error) {
cmd := exec.Command(name, argv...)
if envv != nil {
cmd.Env = envv
}
cmd.Stdin = strings.NewReader(stdin)
out, err := cmd.Output()
return strings.TrimRight(string(out), "\r\n"), err
}
func vwCredsPath(user string) string { return vwUserPathPrefix + user }
func bwAppDataDir(uid string) string { return "/run/user/" + uid + "/homelab-bw" }
// readVaultField returns one field from a KV-v2 path, "" if absent/error.
func readVaultField(run cmdRunner, field, path string) string {
out, err := run("vault", []string{"kv", "get", "-field=" + field, path}, nil)
if err != nil {
return ""
}
return out
}
// loadCreds reads the four vaultwarden_* keys from the user's isolated path.
// A missing master password means the user hasn't onboarded.
func loadCreds(run cmdRunner, user string) (vwCreds, error) {
p := vwCredsPath(user)
c := vwCreds{
Email: readVaultField(run, "vaultwarden_email", p),
MasterPassword: readVaultField(run, "vaultwarden_master_password", p),
ClientID: readVaultField(run, "vaultwarden_client_id", p),
ClientSecret: readVaultField(run, "vaultwarden_client_secret", p),
}
if c.MasterPassword == "" {
return vwCreds{}, fmt.Errorf("vault not configured for this user — run `homelab vault setup`")
}
return c, nil
}
// vaultCurrentUser/vaultCurrentUID are seams for tests (avoid conflict with repo.go's currentUser func).
var vaultCurrentUser = func() string { return os.Getenv("USER") }
var vaultCurrentUID = func() string { return fmt.Sprintf("%d", os.Getuid()) }
// bwBaseEnv is the minimal non-secret environment bw/node need. We deliberately
// do NOT inherit the full parent env (keeps stray secrets out of the child).
func bwBaseEnv(appdata string) []string {
path := os.Getenv("PATH")
if path == "" {
path = "/usr/local/bin:/usr/bin:/bin"
}
return []string{
"PATH=" + path,
"HOME=" + os.Getenv("HOME"),
"BITWARDENCLI_APPDATA_DIR=" + appdata,
"BW_NOINTERACTION=true",
}
}
// bwSecretEnv adds the secret-bearing vars. session may be "" (pre-unlock).
func bwSecretEnv(appdata string, c vwCreds, session string) []string {
env := bwBaseEnv(appdata)
env = append(env,
"BW_CLIENTID="+c.ClientID,
"BW_CLIENTSECRET="+c.ClientSecret,
"BW_PASSWORD="+c.MasterPassword,
)
if session != "" {
env = append(env, "BW_SESSION="+session)
}
return env
}
func bwLoginArgs() []string { return []string{"login", "--apikey"} }
func bwUnlockArgs() []string { return []string{"unlock", "--passwordenv", "BW_PASSWORD", "--raw"} }
func bwGetArgs(field, name string) []string { return []string{"get", field, name} }
func bwStatusArgs() []string { return []string{"status"} }
// bwNeedsLogin parses `bw status` JSON and reports whether a `bw login` is
// required. Unparseable/empty output → true (safer to attempt login).
func bwNeedsLogin(statusJSON string) bool {
var s struct {
Status string `json:"status"`
}
if err := json.Unmarshal([]byte(statusJSON), &s); err != nil {
return true
}
return s.Status == "unauthenticated" || s.Status == ""
}
func bwListArgs(search string) []string {
a := []string{"list", "items"}
if search != "" {
a = append(a, "--search", search)
}
return a
}
// bwUnlock runs `bw unlock` and returns the raw session key.
func bwUnlock(run cmdRunner, env []string) (string, error) {
out, err := run("bw", bwUnlockArgs(), env)
if err != nil {
return "", fmt.Errorf("bw unlock failed (wrong master password? run `homelab vault setup`): %w", err)
}
return out, nil
}
// bwGet fetches one field of one item; session must be present in env.
func bwGet(run cmdRunner, env []string, field, name string) (string, error) {
return run("bw", bwGetArgs(field, name), env)
}
func returnMode(isTTY bool) string {
if isTTY {
return "clipboard"
}
return "stdout"
}
// stdoutIsTTY reports whether stdout is a character device (a terminal).
func stdoutIsTTY() bool {
fi, err := os.Stdout.Stat()
if err != nil {
return false
}
return fi.Mode()&os.ModeCharDevice != 0
}
// stderrIsTTY reports whether stderr is a terminal (the OSC52 escape is written
// to stderr, so the clipboard path is only viable when stderr is a terminal).
func stderrIsTTY() bool {
fi, err := os.Stderr.Stat()
if err != nil {
return false
}
return fi.Mode()&os.ModeCharDevice != 0
}
// osc52 returns the OSC 52 escape that makes the local terminal copy payload to
// the system clipboard (works over SSH; no X11). osc52clear copies empty.
func osc52(payload string) string {
return "\x1b]52;c;" + base64.StdEncoding.EncodeToString([]byte(payload)) + "\a"
}
func osc52clear() string { return "\x1b]52;c;\a" }
// terminalAllowed gates OSC 52: only terminals known to honor clipboard writes,
// else we'd dump the secret's base64 into scrollback on unsupported terminals.
func terminalAllowed(term, termProgram string) bool {
t := strings.ToLower(term)
p := strings.ToLower(termProgram)
for _, ok := range []string{"kitty", "alacritty", "foot", "wezterm", "ghostty", "tmux", "screen"} {
if strings.Contains(t, ok) || strings.Contains(p, ok) {
return true
}
}
// xterm proper supports it only when the program is a known-good emulator.
return false
}
// opRecord is one CLI operation. ItemName is accepted for the caller's
// convenience but is INTENTIONALLY never rendered into the log line — auditing
// which of your own logins you opened is itself sensitive, and per-item reads
// are invisible server-side anyway (spec §9a).
type opRecord struct {
User string
Verb string
PID int
PPID int
ParentComm string
ItemName string // never logged
}
func opLogLine(r opRecord) string {
return fmt.Sprintf("user=%s verb=%s pid=%d ppid=%d parent=%s",
r.User, r.Verb, r.PID, r.PPID, r.ParentComm)
}
// parentComm reads /proc/<ppid>/comm (best-effort; "" on failure).
func parentComm(ppid int) string {
b, err := os.ReadFile(fmt.Sprintf("/proc/%d/comm", ppid))
if err != nil {
return ""
}
return strings.TrimSpace(string(b))
}
// writeOpLog appends one privacy-aware line to the user's op-log (best-effort;
// never blocks or fails the command). Goes to syslog so it ships to Loki.
func writeOpLog(r opRecord) {
exec.Command("logger", "-t", "homelab-vault", opLogLine(r)).Run() // best-effort
}
func vaultLockPath(uid string) string { return "/run/user/" + uid + "/homelab-vault.lock" }
// hardenProcess disables core dumps so a bw/homelab crash can't spill the master
// password to a core file. Best-effort.
func hardenProcess() {
_ = syscall.Setrlimit(syscall.RLIMIT_CORE, &syscall.Rlimit{Cur: 0, Max: 0})
}
// withUserLock serializes bw mutations for this user (concurrent Claude sessions
// as the same user otherwise race bw's appdata). Returns an unlock func.
func withUserLock(uid string) (func(), error) {
f, err := os.OpenFile(vaultLockPath(uid), os.O_CREATE|os.O_RDWR, 0600)
if err != nil {
return nil, err
}
if err := syscall.Flock(int(f.Fd()), syscall.LOCK_EX); err != nil {
f.Close()
return nil, err
}
return func() { syscall.Flock(int(f.Fd()), syscall.LOCK_UN); f.Close() }, nil
}
// session is one usable bw context: the env (with BW_SESSION) ready for `bw get`.
type session struct {
env []string
}
// openSession resolves creds, ensures login, unlocks, and returns a ready env.
// Caller must hold the user lock. appdata is created on tmpfs (0700).
func openSession(run cmdRunner, user, uid string) (session, error) {
creds, err := loadCreds(run, user)
if err != nil {
return session{}, err
}
appdata := bwAppDataDir(uid)
if err := os.MkdirAll(appdata, 0700); err != nil {
return session{}, fmt.Errorf("create bw appdata %s: %w", appdata, err)
}
loginEnv := bwSecretEnv(appdata, creds, "")
// Ensure server is set and we're logged in (idempotent; ignore "already").
_, _ = run("bw", []string{"config", "server", "https://vaultwarden.viktorbarzin.me"}, loginEnv)
st, _ := run("bw", bwStatusArgs(), loginEnv)
if bwNeedsLogin(st) {
if _, err := run("bw", bwLoginArgs(), loginEnv); err != nil {
return session{}, fmt.Errorf("bw login --apikey failed (API key valid? run `homelab vault setup`): %w", err)
}
}
sess, err := bwUnlock(run, loginEnv)
if err != nil {
return session{}, err
}
return session{env: bwSecretEnv(appdata, creds, sess)}, nil
}
type getOpts struct {
name string
field string
json bool
}
var validGetFields = map[string]bool{"password": true, "username": true, "uri": true, "notes": true, "totp": true}
func parseGetArgs(args []string) (getOpts, error) {
o := getOpts{field: "password"}
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "--json":
o.json = true
case a == "--field" && i+1 < len(args):
o.field = args[i+1]
i++
case strings.HasPrefix(a, "--field="):
o.field = strings.TrimPrefix(a, "--field=")
case !strings.HasPrefix(a, "-") && o.name == "":
o.name = a
}
}
if o.name == "" {
return o, fmt.Errorf("usage: homelab vault get <name> [--field password|username|uri|notes|totp] [--json]")
}
if !validGetFields[o.field] {
return o, fmt.Errorf("invalid --field %q (want password|username|uri|notes|totp)", o.field)
}
return o, nil
}
// getValue opens a session and fetches one field. Pure of I/O side effects
// besides the runner, so it is unit-tested with a fake runner.
func getValue(run cmdRunner, user, uid string, o getOpts) (string, error) {
s, err := openSession(run, user, uid)
if err != nil {
return "", err
}
return bwGet(run, s.env, o.field, o.name)
}
// clipboardDecision picks how to return a secret value. "stdout" prints it (a
// pipe/agent — the intended machine path); "clipboard" copies via OSC52;
// "refuse" emits nothing sensitive (would otherwise risk dumping the secret's
// base64 into scrollback, or silently fail because the OSC52 escape goes to a
// non-terminal stderr).
func clipboardDecision(stdoutTTY, stderrTTY bool, term, termProgram string) string {
if !stdoutTTY {
return "stdout"
}
if terminalAllowed(term, termProgram) && stderrTTY {
return "clipboard"
}
return "refuse"
}
// jsonToStdoutOK reports whether `--json` may print the secret to stdout — only
// when stdout is NOT a terminal (i.e. piped to a machine consumer).
func jsonToStdoutOK(stdoutTTY bool) bool { return !stdoutTTY }
// emitSecret returns a value TTY-aware (see clipboardDecision). Never prints the
// secret to a terminal's stdout/scrollback.
func emitSecret(value string) {
switch clipboardDecision(stdoutIsTTY(), stderrIsTTY(), os.Getenv("TERM"), os.Getenv("TERM_PROGRAM")) {
case "stdout":
fmt.Println(value)
case "clipboard":
fmt.Fprint(os.Stderr, osc52(value))
fmt.Fprintln(os.Stderr, "copied to clipboard; clearing in 30s")
clearClipboardAfter(30)
default: // refuse
fmt.Fprintln(os.Stderr, "refusing to print secret: this terminal can't do OSC52 clipboard safely; pipe the command (e.g. | cat) or use a supported terminal")
}
}
// clearClipboardAfter spawns a detached background clear so the secret doesn't
// linger in the clipboard. Best-effort.
func clearClipboardAfter(seconds int) {
exec.Command("sh", "-c", fmt.Sprintf("sleep %d; printf '%s'", seconds, osc52clear())).Start()
}
// listNames extracts "name (id)" from `bw list items` JSON; never values.
func listNames(jsonOut string) []string {
var items []struct {
ID string `json:"id"`
Name string `json:"name"`
}
if err := json.Unmarshal([]byte(jsonOut), &items); err != nil {
return nil
}
out := make([]string, 0, len(items))
for _, it := range items {
out = append(out, fmt.Sprintf("%s (%s)", it.Name, it.ID))
}
return out
}
func runList(run cmdRunner, user, uid, search string) ([]string, error) {
s, err := openSession(run, user, uid)
if err != nil {
return nil, err
}
out, err := run("bw", bwListArgs(search), s.env)
if err != nil {
return nil, err
}
return listNames(out), nil
}
func vaultList(args []string) error {
hardenProcess()
search := ""
for i := 0; i < len(args); i++ {
if args[i] == "--search" && i+1 < len(args) {
search = args[i+1]
i++
} else if strings.HasPrefix(args[i], "--search=") {
search = strings.TrimPrefix(args[i], "--search=")
}
}
uid := vaultCurrentUID()
unlock, err := withUserLock(uid)
if err != nil {
return err
}
defer unlock()
names, err := runList(realRunner, vaultCurrentUser(), uid, search)
if err != nil {
return err
}
for _, n := range names {
fmt.Println(n)
}
return nil
}
func vaultSearch(args []string) error {
if len(args) == 0 {
return fmt.Errorf("usage: homelab vault search <query>")
}
return vaultList([]string{"--search", strings.Join(args, " ")})
}
func vaultCode(args []string) error {
hardenProcess()
if len(args) == 0 {
return fmt.Errorf("usage: homelab vault code <name>")
}
name := args[0]
uid := vaultCurrentUID()
unlock, err := withUserLock(uid)
if err != nil {
return err
}
defer unlock()
user := vaultCurrentUser()
val, err := getValue(realRunner, user, uid, getOpts{name: name, field: "totp"})
if err != nil {
return err
}
// TOTP is the most sensitive op: log AND emit an ntfy-bound marker (spec §9a-d).
writeOpLog(opRecord{User: user, Verb: "code", PID: os.Getpid(), PPID: os.Getppid(), ParentComm: parentComm(os.Getppid()), ItemName: name})
exec.Command("logger", "-t", "homelab-vault-totp", "user="+user+" totp-fetch parent="+parentComm(os.Getppid())).Run()
emitSecret(val)
return nil
}
// statusSummary reports config/reachability without revealing secrets.
func statusSummary(run cmdRunner, user, uid string) string {
if _, err := loadCreds(run, user); err != nil {
return "vault: not configured — run `homelab vault setup`"
}
s, err := openSession(run, user, uid)
if err != nil {
return "vault: configured, but unlock/login FAILED (creds stale? run `homelab vault setup`): " + err.Error()
}
if _, err := run("bw", []string{"sync"}, s.env); err != nil {
return "vault: configured + unlocked, but sync/reachability failed: " + err.Error()
}
return "vault: configured, unlocked, reachable ✓"
}
func vaultStatus(args []string) error {
hardenProcess()
uid := vaultCurrentUID()
unlock, err := withUserLock(uid)
if err != nil {
return err
}
defer unlock()
fmt.Println(statusSummary(realRunner, vaultCurrentUser(), uid))
return nil
}
func vaultLock(args []string) error {
uid := vaultCurrentUID()
unlock, err := withUserLock(uid) // logout mutates bw state — serialize with get/list
if err != nil {
return err
}
defer unlock()
appdata := bwAppDataDir(uid)
_, _ = realRunner("bw", []string{"lock"}, bwBaseEnv(appdata))
_, logoutErr := realRunner("bw", []string{"logout"}, bwBaseEnv(appdata))
if logoutErr == nil {
fmt.Println("locked")
}
return nil // lock/logout best-effort; never error the caller
}
// vaultPatchPublicArgs writes the non-secret identifiers via argv. Neither the
// email nor the API client_id is a usable credential on its own.
func vaultPatchPublicArgs(user, email, clientID string) []string {
return []string{"kv", "patch", vwCredsPath(user),
"vaultwarden_email=" + email,
"vaultwarden_client_id=" + clientID,
}
}
// vaultPatchSecretArgs writes ONE secret value via the `key=-` stdin form, so
// the value never appears in argv (ps / /proc/<pid>/cmdline). The value is fed
// on stdin by realRunnerStdin.
func vaultPatchSecretArgs(user, key string) []string {
return []string{"kv", "patch", vwCredsPath(user), key + "=-"}
}
// writeCreds stores all four fields in the user's Vault path. The two real
// secrets (master password, API client_secret) go via stdin — never argv.
func writeCreds(user string, c vwCreds) error {
if _, err := realRunner("vault", vaultPatchPublicArgs(user, c.Email, c.ClientID), nil); err != nil {
return err
}
if _, err := realRunnerStdin("vault", vaultPatchSecretArgs(user, "vaultwarden_master_password"), nil, c.MasterPassword); err != nil {
return err
}
if _, err := realRunnerStdin("vault", vaultPatchSecretArgs(user, "vaultwarden_client_secret"), nil, c.ClientSecret); err != nil {
return err
}
return nil
}
// promptNoEcho reads one line without terminal echo (for the master password).
func promptNoEcho(prompt string) (string, error) {
fmt.Fprint(os.Stderr, prompt)
exec.Command("stty", "-echo").Run()
defer func() { exec.Command("stty", "echo").Run(); fmt.Fprintln(os.Stderr) }()
r := bufio.NewReader(os.Stdin)
line, err := r.ReadString('\n')
// Trim only the line terminator — a master password / API secret may
// legitimately contain leading/trailing spaces.
return strings.TrimRight(line, "\r\n"), err
}
func promptLine(prompt string) (string, error) {
fmt.Fprint(os.Stderr, prompt)
line, err := bufio.NewReader(os.Stdin).ReadString('\n')
return strings.TrimSpace(line), err
}
func vaultSetup(args []string) error {
hardenProcess()
fmt.Fprintln(os.Stderr, "One-time setup. Stored ONLY in your own Vault path; the admin never sees it.")
fmt.Fprintln(os.Stderr, "Get your API key at https://vaultwarden.viktorbarzin.me → Settings → Security → Keys → View API key.")
email, err := promptLine("Vaultwarden email: ")
if err != nil {
return err
}
clientID, err := promptLine("API key client_id (user.xxxx): ")
if err != nil {
return err
}
clientSecret, err := promptNoEcho("API key client_secret: ")
if err != nil {
return err
}
master, err := promptNoEcho("Master password: ")
if err != nil {
return err
}
if master == "" || clientID == "" || clientSecret == "" {
return fmt.Errorf("all fields are required")
}
c := vwCreds{Email: email, MasterPassword: master, ClientID: clientID, ClientSecret: clientSecret}
if err := writeCreds(vaultCurrentUser(), c); err != nil {
return fmt.Errorf("writing creds to your Vault path failed (scoped token present?): %w", err)
}
fmt.Fprintln(os.Stderr, "Stored. Verifying unlock…")
uid := vaultCurrentUID()
unlock, err := withUserLock(uid)
if err != nil {
return err
}
defer unlock()
if _, err := openSession(realRunner, vaultCurrentUser(), uid); err != nil {
return fmt.Errorf("stored, but verification failed — double-check master password / API key: %w", err)
}
fmt.Fprintln(os.Stderr, "✓ Verified. Fetches are now AFK.")
return nil
}
func vaultGet(args []string) error {
hardenProcess()
o, err := parseGetArgs(args)
if err != nil {
return err
}
uid := vaultCurrentUID()
unlock, err := withUserLock(uid)
if err != nil {
return err
}
defer unlock()
user := vaultCurrentUser()
val, err := getValue(realRunner, user, uid, o)
if err != nil {
return err
}
writeOpLog(opRecord{User: user, Verb: "get", PID: os.Getpid(), PPID: os.Getppid(), ParentComm: parentComm(os.Getppid()), ItemName: o.name})
if o.json {
if !jsonToStdoutOK(stdoutIsTTY()) {
return fmt.Errorf("refusing to print a secret as JSON to a terminal; pipe it (e.g. | cat) or drop --json")
}
fmt.Printf("{%q:%q}\n", o.field, val)
return nil
}
emitSecret(val)
return nil
}

View file

@ -1,368 +0,0 @@
package main
import (
"encoding/base64"
"fmt"
"os"
"reflect"
"strings"
"testing"
)
func TestVaultCommandsRegistered(t *testing.T) {
want := map[string]Tier{
"vault setup": TierWrite,
"vault status": TierRead,
"vault list": TierRead,
"vault get": TierRead,
"vault search": TierRead,
"vault code": TierRead,
"vault lock": TierWrite,
}
got := map[string]Tier{}
for _, c := range vaultCommands() {
got[c.name()] = c.Tier
}
for name, tier := range want {
if got[name] != tier {
t.Errorf("command %q: tier=%q, want %q (registered=%v)", name, got[name], tier, got[name] != "")
}
}
}
func TestVaultGroupInRegistry(t *testing.T) {
if !isCommandGroup(buildRegistry(), "vault") {
t.Fatal("`vault` group not wired into buildRegistry()")
}
}
func TestVaultCredsPath(t *testing.T) {
if got := vwCredsPath("emo"); got != "secret/workstation/claude-users/emo" {
t.Fatalf("vwCredsPath = %q", got)
}
}
func TestBwAppDataDir(t *testing.T) {
if got := bwAppDataDir("1001"); got != "/run/user/1001/homelab-bw" {
t.Fatalf("bwAppDataDir = %q", got)
}
}
// fakeRunner records calls and returns canned stdout/err keyed by argv[0]+first arg.
type fakeRunner struct {
calls [][]string
out map[string]string // key: name+" "+strings.Join(argv," ") prefix-matched
err map[string]error
lastEnv []string
}
func (f *fakeRunner) run(name string, argv, envv []string) (string, error) {
f.calls = append(f.calls, append([]string{name}, argv...))
f.lastEnv = envv
key := name + " " + strings.Join(argv, " ")
for k, v := range f.out {
if strings.HasPrefix(key, k) {
return v, f.err[k]
}
}
return "", f.err[key]
}
func TestLoadCredsReadsFourFields(t *testing.T) {
f := &fakeRunner{out: map[string]string{
"vault kv get -field=vaultwarden_email secret/workstation/claude-users/emo": "emo@x.me",
"vault kv get -field=vaultwarden_master_password secret/workstation/claude-users/emo": "hunter2",
"vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo": "user.abc",
"vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo": "sek",
}}
c, err := loadCreds(f.run, "emo")
if err != nil {
t.Fatalf("loadCreds: %v", err)
}
want := vwCreds{Email: "emo@x.me", MasterPassword: "hunter2", ClientID: "user.abc", ClientSecret: "sek"}
if !reflect.DeepEqual(c, want) {
t.Fatalf("loadCreds = %+v want %+v", c, want)
}
}
func TestLoadCredsUnconfigured(t *testing.T) {
f := &fakeRunner{out: map[string]string{}} // every field empty
if _, err := loadCreds(f.run, "emo"); err == nil || !strings.Contains(err.Error(), "not configured") {
t.Fatalf("want 'not configured' error, got %v", err)
}
}
func TestBwEnvCarriesSecretsNotArgv(t *testing.T) {
c := vwCreds{ClientID: "user.abc", ClientSecret: "sek", MasterPassword: "hunter2"}
env := bwSecretEnv("/run/user/1001/homelab-bw", c, "SESSIONKEY")
joined := strings.Join(env, "\n")
for _, want := range []string{
"BW_CLIENTID=user.abc", "BW_CLIENTSECRET=sek", "BW_PASSWORD=hunter2",
"BW_SESSION=SESSIONKEY", "BITWARDENCLI_APPDATA_DIR=/run/user/1001/homelab-bw",
} {
if !strings.Contains(joined, want) {
t.Errorf("bwSecretEnv missing %q", want)
}
}
if strings.Contains(joined, "PATH=") == false {
t.Error("bwSecretEnv must keep a PATH so node/bw resolve")
}
}
func TestBwGetArgsHasNoSessionInArgv(t *testing.T) {
argv := bwGetArgs("password", "github")
for _, a := range argv {
if strings.Contains(a, "SESSION") || a == "--session" {
t.Fatalf("session must travel via env, not argv: %v", argv)
}
}
if !reflect.DeepEqual(argv, []string{"get", "password", "github"}) {
t.Fatalf("bwGetArgs = %v", argv)
}
}
func TestBwListArgs(t *testing.T) {
if got := bwListArgs(""); !reflect.DeepEqual(got, []string{"list", "items"}) {
t.Fatalf("bwListArgs('') = %v", got)
}
if got := bwListArgs("git"); !reflect.DeepEqual(got, []string{"list", "items", "--search", "git"}) {
t.Fatalf("bwListArgs('git') = %v", got)
}
}
func TestBwUnlockReturnsSession(t *testing.T) {
f := &fakeRunner{out: map[string]string{"bw unlock": "THE-SESSION-KEY"}}
env := bwSecretEnv("/run/user/1001/homelab-bw", vwCreds{MasterPassword: "pw"}, "")
sess, err := bwUnlock(f.run, env)
if err != nil || sess != "THE-SESSION-KEY" {
t.Fatalf("bwUnlock = %q, %v", sess, err)
}
// argv must use --passwordenv + --raw, never the password literal
last := f.calls[len(f.calls)-1]
if strings.Join(last, " ") != "bw unlock --passwordenv BW_PASSWORD --raw" {
t.Fatalf("unlock argv = %v", last)
}
}
func TestReturnMode(t *testing.T) {
if returnMode(true) != "clipboard" || returnMode(false) != "stdout" {
t.Fatal("returnMode wrong")
}
}
func TestOSC52Encode(t *testing.T) {
got := osc52("secret")
want := "\x1b]52;c;" + base64.StdEncoding.EncodeToString([]byte("secret")) + "\a"
if got != want {
t.Fatalf("osc52 = %q want %q", got, want)
}
if osc52clear() != "\x1b]52;c;\a" {
t.Fatalf("osc52clear wrong: %q", osc52clear())
}
}
func TestTerminalAllowed(t *testing.T) {
allow := []struct{ term, prog string }{
{"xterm-kitty", ""}, {"alacritty", ""}, {"foot", ""}, {"tmux-256color", ""},
{"screen-256color", ""}, {"xterm-256color", "WezTerm"}, {"xterm-256color", "ghostty"},
}
for _, c := range allow {
if !terminalAllowed(c.term, c.prog) {
t.Errorf("terminalAllowed(%q,%q) = false, want true", c.term, c.prog)
}
}
deny := []struct{ term, prog string }{{"dumb", ""}, {"", ""}, {"vt100", ""}}
for _, c := range deny {
if terminalAllowed(c.term, c.prog) {
t.Errorf("terminalAllowed(%q,%q) = true, want false", c.term, c.prog)
}
}
}
func TestOpLogLineHasNoSecretOrItem(t *testing.T) {
line := opLogLine(opRecord{User: "emo", Verb: "get", PID: 10, PPID: 9, ParentComm: "claude", ItemName: "Chase Bank"})
for _, must := range []string{"user=emo", "verb=get", "ppid=9", "parent=claude"} {
if !strings.Contains(line, must) {
t.Errorf("op-log missing %q: %s", must, line)
}
}
for _, mustNot := range []string{"Chase", "password", "secret"} {
if strings.Contains(line, mustNot) {
t.Fatalf("op-log LEAKS %q (privacy violation): %s", mustNot, line)
}
}
}
func TestLockPath(t *testing.T) {
if got := vaultLockPath("1001"); got != "/run/user/1001/homelab-vault.lock" {
t.Fatalf("vaultLockPath = %q", got)
}
}
func TestParseGetArgs(t *testing.T) {
o, err := parseGetArgs([]string{"github", "--field", "username", "--json"})
if err != nil || o.name != "github" || o.field != "username" || !o.json {
t.Fatalf("parseGetArgs = %+v err=%v", o, err)
}
d, _ := parseGetArgs([]string{"github"})
if d.field != "password" || d.json {
t.Fatalf("defaults wrong: %+v", d)
}
if _, err := parseGetArgs([]string{}); err == nil {
t.Fatal("get with no name must error")
}
if _, err := parseGetArgs([]string{"x", "--field", "evil"}); err == nil {
t.Fatal("invalid --field must error")
}
}
func TestListNamesParsing(t *testing.T) {
// bw list items returns JSON; listNames extracts name + id only.
js := `[{"id":"1","name":"GitHub","login":{"username":"u"}},{"id":"2","name":"AWS"}]`
names := listNames(js)
if len(names) != 2 || names[0] != "GitHub (1)" || names[1] != "AWS (2)" {
t.Fatalf("listNames = %v", names)
}
}
func TestStatusSummaryUnconfigured(t *testing.T) {
f := &fakeRunner{out: map[string]string{}} // no creds
s := statusSummary(f.run, "emo", "1001")
if !strings.Contains(s, "not configured") {
t.Fatalf("status = %q", s)
}
}
func TestVaultPatchPublicArgs(t *testing.T) {
got := vaultPatchPublicArgs("emo", "e@x.me", "user.ci")
want := []string{"kv", "patch", "secret/workstation/claude-users/emo",
"vaultwarden_email=e@x.me", "vaultwarden_client_id=user.ci"}
if !reflect.DeepEqual(got, want) {
t.Fatalf("vaultPatchPublicArgs = %v", got)
}
for _, a := range got {
if strings.Contains(a, "master_password") || strings.Contains(a, "client_secret") {
t.Fatalf("secret key leaked into public argv: %v", got)
}
}
}
func TestVaultPatchSecretArgsNoValueInArgv(t *testing.T) {
for _, key := range []string{"vaultwarden_master_password", "vaultwarden_client_secret"} {
got := vaultPatchSecretArgs("emo", key)
want := []string{"kv", "patch", "secret/workstation/claude-users/emo", key + "=-"}
if !reflect.DeepEqual(got, want) {
t.Fatalf("vaultPatchSecretArgs(%q) = %v", key, got)
}
if got[len(got)-1] != key+"=-" {
t.Fatalf("secret value must be read from stdin (`%s=-`), got %v", key, got)
}
}
}
// TestNoSecretInArgvAcrossFlow is the load-bearing security test: across the
// whole get flow (vault reads, bw config/status/login/unlock/get) NO secret
// value may appear in any command's argv — secrets travel via env/stdin only.
func TestNoSecretInArgvAcrossFlow(t *testing.T) {
uid := fmt.Sprintf("%d", os.Getuid())
f := &fakeRunner{out: map[string]string{
"vault kv get -field=vaultwarden_master_password secret/workstation/claude-users/emo": "SUPERSECRETPW",
"vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo": "user.x",
"vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo": "CLIENTSEKRET",
"bw status": `{"status":"locked"}`,
"bw unlock": "SESSIONXYZ",
"bw get password github": "p@ss",
}}
if _, err := getValue(f.run, "emo", uid, getOpts{name: "github", field: "password"}); err != nil {
t.Fatalf("getValue: %v", err)
}
for _, call := range f.calls {
for _, arg := range call {
for _, s := range []string{"SUPERSECRETPW", "CLIENTSEKRET", "SESSIONXYZ"} {
if strings.Contains(arg, s) {
t.Errorf("secret %q leaked into argv: %v", s, call)
}
}
}
}
if !strings.Contains(strings.Join(f.lastEnv, "\n"), "BW_SESSION=SESSIONXYZ") {
t.Error("expected BW_SESSION in the bw get env (test would be vacuous otherwise)")
}
}
func TestClipboardDecision(t *testing.T) {
cases := []struct {
stdoutTTY, stderrTTY bool
term, prog, want string
}{
{false, true, "xterm-kitty", "", "stdout"},
{true, true, "xterm-kitty", "", "clipboard"},
{true, true, "dumb", "", "refuse"},
{true, false, "xterm-kitty", "", "refuse"},
}
for _, c := range cases {
if got := clipboardDecision(c.stdoutTTY, c.stderrTTY, c.term, c.prog); got != c.want {
t.Errorf("clipboardDecision(%v,%v,%q) = %q, want %q", c.stdoutTTY, c.stderrTTY, c.term, got, c.want)
}
}
}
func TestJSONToStdoutOK(t *testing.T) {
if jsonToStdoutOK(true) {
t.Error("must refuse JSON secret on a terminal")
}
if !jsonToStdoutOK(false) {
t.Error("must allow JSON when piped")
}
}
func TestBwNeedsLogin(t *testing.T) {
if !bwNeedsLogin(`{"status":"unauthenticated"}`) {
t.Error("unauthenticated → needs login")
}
if bwNeedsLogin(`{"status":"locked"}`) {
t.Error("locked → no login (just unlock)")
}
if bwNeedsLogin(`{"status":"unlocked"}`) {
t.Error("unlocked → no login")
}
if !bwNeedsLogin(`not json`) {
t.Error("unparseable → attempt login")
}
}
func TestVaultHelpMentionsSecurity(t *testing.T) {
h := vaultHelp()
for _, want := range []string{"homelab vault get", "no-HITL", "your own", "setup"} {
if !strings.Contains(h, want) {
t.Errorf("vault help missing %q", want)
}
}
}
func TestVaultBareGroupRegistered(t *testing.T) {
for _, c := range vaultCommands() {
if len(c.Path) == 1 && c.Path[0] == "vault" {
return
}
}
t.Fatal("bare `vault` help command not registered")
}
// getValue is the testable core: given a runner + opts, returns the secret value.
func TestGetValueFlow(t *testing.T) {
f := &fakeRunner{out: map[string]string{
"vault kv get -field=vaultwarden_master_password secret/workstation/claude-users/emo": "pw",
"vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo": "user.x",
"vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo": "cs",
"bw status": `{"status":"locked"}`,
"bw unlock": "SESS",
"bw get password github": "p@ss",
}}
// Use real UID so os.MkdirAll(/run/user/<uid>/homelab-bw) succeeds.
uid := fmt.Sprintf("%d", os.Getuid())
val, err := getValue(f.run, "emo", uid, getOpts{name: "github", field: "password"})
if err != nil || val != "p@ss" {
t.Fatalf("getValue = %q, %v", val, err)
}
}

View file

@ -1,212 +0,0 @@
package main
import (
"fmt"
"os"
"path/filepath"
"strings"
)
func workCommands() []Command {
return []Command{
{Path: []string{"work", "start"}, Tier: TierWrite,
Summary: "create a worktree + branch for a task (enter it with EnterWorktree)", Run: workStart},
{Path: []string{"work", "land"}, Tier: TierWrite,
Summary: "merge master in, verify, push HEAD:master (run from the worktree)", Run: workLand},
{Path: []string{"work", "clean"}, Tier: TierWrite,
Summary: "remove a task's worktree + branch (run from the main checkout)", Run: workClean},
}
}
// flagValue extracts `--name value` or `--name=value` from args.
func flagValue(args []string, name string) string {
for i, a := range args {
if a == name && i+1 < len(args) {
return args[i+1]
}
if strings.HasPrefix(a, name+"=") {
return strings.TrimPrefix(a, name+"=")
}
}
return ""
}
func remotesOrEmpty(repoRoot string) []string {
r, _ := gitRemotes(repoRoot)
return r
}
// workStart creates .worktrees/<topic> on branch <user>/<topic> off <remote>/master.
func workStart(args []string) error {
topic, _ := firstPositional(args)
if topic == "" {
return fmt.Errorf("usage: homelab work start <topic>")
}
cwd, _ := os.Getwd()
repoRoot, err := gitRepoRoot(cwd)
if err != nil {
return fmt.Errorf("not in a git repository: %w", err)
}
remote := preferRemote(remotesOrEmpty(repoRoot))
if remote == "" {
return fmt.Errorf("no git remote configured in %s", repoRoot)
}
flags := cryptFlagsFor(repoRoot)
branch := currentUser() + "/" + topic
wtRel := filepath.Join(".worktrees", topic)
ensureWorktreesIgnored(repoRoot)
if err := gitStream(repoRoot, flags, "fetch", remote); err != nil {
return fmt.Errorf("fetch %s failed: %w", remote, err)
}
if err := gitStream(repoRoot, flags, "worktree", "add", wtRel, "-b", branch, remote+"/master"); err != nil {
return fmt.Errorf("worktree add failed: %w", err)
}
wtPath := filepath.Join(repoRoot, wtRel)
fmt.Printf("homelab: created worktree %s (branch %s off %s/master)\n", wtPath, branch, remote)
fmt.Printf("homelab: enter it with the native tool: EnterWorktree(path=%q)\n", wtPath)
return nil
}
// workLand integrates the current branch into master: fetch, merge master in,
// verify, push HEAD:master (retrying on non-fast-forward), with a feature-branch
// fallback when the direct push is rejected (e.g. branch protection).
func workLand(args []string) error {
verifyCmd := flagValue(args, "--verify-cmd")
cwd, _ := os.Getwd()
repoRoot, err := gitRepoRoot(cwd)
if err != nil {
return fmt.Errorf("not in a git repository: %w", err)
}
branch, err := gitOutput(repoRoot, "rev-parse", "--abbrev-ref", "HEAD")
if err != nil {
return err
}
if branch == "master" || branch == "main" {
return fmt.Errorf("refusing to land: already on %s", branch)
}
remote := preferRemote(remotesOrEmpty(repoRoot))
if remote == "" {
return fmt.Errorf("no git remote configured in %s", repoRoot)
}
flags := cryptFlagsFor(repoRoot)
if err := gitStream(repoRoot, flags, "fetch", remote); err != nil {
return fmt.Errorf("fetch failed: %w", err)
}
if err := gitStream(repoRoot, flags, "merge", "--no-edit", remote+"/master"); err != nil {
return fmt.Errorf("merging %s/master failed — resolve conflicts then re-run `homelab work land`: %w", remote, err)
}
if err := runVerify(repoRoot, verifyCmd, containsArg(args, "--no-verify")); err != nil {
return fmt.Errorf("not landing: %w", err)
}
if err := pushWithRetry(repoRoot, flags, remote, 3); err != nil {
return landFallback(repoRoot, flags, remote, branch, err)
}
fmt.Printf("homelab: landed %s -> %s/master.\n", branch, remote)
if containsArg(args, "--no-ci-watch") {
fmt.Println("homelab: --no-ci-watch set; not waiting for CI.")
return nil
}
landed, _ := gitOutput(repoRoot, "rev-parse", "HEAD")
fmt.Fprintln(os.Stderr, "homelab: watching CI for the landed commit...")
if err := ciWatch([]string{landed}); err != nil {
return fmt.Errorf("landed, but CI did not go green: %w", err)
}
return nil
}
// runVerify runs the explicit --verify-cmd, else auto-detects (go test). If
// neither is available it REFUSES (returns an error) unless allowSkip is set —
// landing to master unverified must be a deliberate choice (--no-verify).
func runVerify(repoRoot, verifyCmd string, allowSkip bool) error {
if verifyCmd != "" {
fmt.Fprintf(os.Stderr, "homelab: verify: %s\n", verifyCmd)
return runStreamingIn(repoRoot, "sh", "-c", verifyCmd)
}
if isFile(filepath.Join(repoRoot, "go.mod")) {
fmt.Fprintln(os.Stderr, "homelab: verify: go test ./...")
return runStreamingIn(repoRoot, "go", "test", "./...")
}
if allowSkip {
fmt.Fprintln(os.Stderr, "homelab: WARNING: --no-verify set — landing without verification")
return nil
}
return fmt.Errorf("no verification configured for this repo — pass --verify-cmd \"...\" or --no-verify to land without verifying")
}
// pushWithRetry pushes HEAD:master, recovering from non-fast-forward rejections
// by fetching + merging master and retrying.
func pushWithRetry(repoRoot string, flags []string, remote string, attempts int) error {
var lastErr error
for i := 0; i < attempts; i++ {
if err := gitStream(repoRoot, flags, "push", remote, "HEAD:master"); err == nil {
return nil
} else {
lastErr = err
}
if i < attempts-1 {
fmt.Fprintln(os.Stderr, "homelab: push rejected — fetching + merging master, then retrying")
if err := gitStream(repoRoot, flags, "fetch", remote); err != nil {
return err
}
if err := gitStream(repoRoot, flags, "merge", "--no-edit", remote+"/master"); err != nil {
return err
}
}
}
return fmt.Errorf("push to %s/master failed after %d attempts: %w", remote, attempts, lastErr)
}
// landFallback pushes the feature branch when the direct master push is rejected
// (e.g. branch protection), so the work isn't lost and a PR can be opened.
func landFallback(repoRoot string, flags []string, remote, branch string, pushErr error) error {
fmt.Fprintf(os.Stderr, "homelab: direct push to master failed (%v)\n", pushErr)
fmt.Fprintf(os.Stderr, "homelab: falling back to pushing the feature branch %q for a PR\n", branch)
if err := gitStream(repoRoot, flags, "push", "-u", remote, branch); err != nil {
return fmt.Errorf("fallback branch push also failed: %w", err)
}
fmt.Printf("homelab: pushed %s to %s. Open a PR to land it (branch protection blocked the direct push).\n", branch, remote)
return nil
}
// workClean removes a task's worktree and branch. Run from the main checkout.
func workClean(args []string) error {
topic, _ := firstPositional(args)
if topic == "" {
return fmt.Errorf("usage: homelab work clean <topic> (run from the main checkout)")
}
cwd, _ := os.Getwd()
repoRoot, err := gitRepoRoot(cwd)
if err != nil {
return fmt.Errorf("not in a git repository: %w", err)
}
flags := cryptFlagsFor(repoRoot)
wtRel := filepath.Join(".worktrees", topic)
branch := currentUser() + "/" + topic
if err := gitStream(repoRoot, flags, "worktree", "remove", wtRel); err != nil {
return fmt.Errorf("worktree remove failed (uncommitted changes? run from the main checkout, not the worktree): %w", err)
}
if err := gitStream(repoRoot, flags, "branch", "-d", branch); err != nil {
fmt.Fprintf(os.Stderr, "homelab: note: could not delete branch %s (unmerged — use `git branch -D` if intended): %v\n", branch, err)
}
fmt.Printf("homelab: removed worktree %s and branch %s\n", wtRel, branch)
return nil
}
// ensureWorktreesIgnored appends .worktrees/ to .gitignore if not already ignored.
func ensureWorktreesIgnored(repoRoot string) {
if _, err := gitOutput(repoRoot, "check-ignore", ".worktrees"); err == nil {
return
}
gi := filepath.Join(repoRoot, ".gitignore")
f, err := os.OpenFile(gi, os.O_APPEND|os.O_CREATE|os.O_WRONLY, 0o644)
if err != nil {
return
}
defer f.Close()
if _, err := f.WriteString("\n.worktrees/\n"); err == nil {
fmt.Fprintln(os.Stderr, "homelab: added .worktrees/ to .gitignore")
}
}

View file

@ -1,32 +0,0 @@
package main
import "testing"
func TestRunVerifyRefusesWhenNothingToVerify(t *testing.T) {
dir := t.TempDir() // no go.mod, no verify cmd
if err := runVerify(dir, "", false); err == nil {
t.Fatal("runVerify must refuse (error) when nothing to verify and --no-verify absent")
}
if err := runVerify(dir, "", true); err != nil {
t.Fatalf("runVerify must skip when --no-verify set, got: %v", err)
}
}
func TestFlagValue(t *testing.T) {
cases := []struct {
args []string
name string
want string
}{
{[]string{"--verify-cmd", "go test ./..."}, "--verify-cmd", "go test ./..."},
{[]string{"--verify-cmd=make test"}, "--verify-cmd", "make test"},
{[]string{"topic", "--verify-cmd", "x"}, "--verify-cmd", "x"},
{[]string{"topic"}, "--verify-cmd", ""},
{[]string{"--verify-cmd"}, "--verify-cmd", ""}, // no value
}
for _, c := range cases {
if got := flagValue(c.args, c.name); got != c.want {
t.Errorf("flagValue(%v, %q) = %q, want %q", c.args, c.name, got, c.want)
}
}
}

View file

@ -1,104 +0,0 @@
package main
import (
"encoding/json"
"fmt"
"sort"
"strings"
)
// Tier classifies whether a command observes (read) or mutates (write) state.
// v0.1 allows everything; the tier is recorded so a classifier hook can gate
// writes later without restructuring (see docs/adr/0005).
type Tier string
const (
TierRead Tier = "read"
TierWrite Tier = "write"
)
// Command is one homelab verb. Path is the token sequence that selects it,
// e.g. ["claim"] or ["tf", "plan"]. Run receives the args after the path.
type Command struct {
Path []string
Tier Tier
Summary string
Run func(args []string) error
}
// dispatch routes args to the command whose Path is the longest matching prefix
// of args, passing the remaining args to its Run.
func dispatch(reg []Command, args []string) error {
best := -1
bestLen := 0
for i, c := range reg {
if len(c.Path) > len(args) {
continue
}
match := true
for j, p := range c.Path {
if args[j] != p {
match = false
break
}
}
if match && len(c.Path) >= bestLen {
best = i
bestLen = len(c.Path)
}
}
if best < 0 {
return fmt.Errorf("unknown command: %q", strings.Join(args, " "))
}
matched := reg[best]
runErr := matched.Run(args[bestLen:])
emitUsage(matched.name(), runErr) // best-effort usage telemetry; never affects the command
return runErr
}
// name is the space-joined verb path, e.g. "tf plan".
func (c Command) name() string { return strings.Join(c.Path, " ") }
// sortedByName returns a copy of reg ordered by verb path for stable output.
func sortedByName(reg []Command) []Command {
out := make([]Command, len(reg))
copy(out, reg)
sort.Slice(out, func(i, j int) bool { return out[i].name() < out[j].name() })
return out
}
// manifestText renders one aligned line per command: "<path> <tier> <summary>".
// This is the cheap progressive-discovery entrypoint (see docs/adr/0004).
func manifestText(reg []Command) string {
cmds := sortedByName(reg)
width := 0
for _, c := range cmds {
if n := len(c.name()); n > width {
width = n
}
}
var b strings.Builder
for _, c := range cmds {
fmt.Fprintf(&b, "%-*s %-5s %s\n", width, c.name(), c.Tier, c.Summary)
}
return b.String()
}
// manifestJSON renders the registry as a JSON array of {command, tier, summary}
// so agents can parse the full surface in one call.
func manifestJSON(reg []Command) (string, error) {
type entry struct {
Command string `json:"command"`
Tier string `json:"tier"`
Summary string `json:"summary"`
}
entries := make([]entry, 0, len(reg))
for _, c := range sortedByName(reg) {
entries = append(entries, entry{Command: c.name(), Tier: string(c.Tier), Summary: c.Summary})
}
b, err := json.MarshalIndent(entries, "", " ")
if err != nil {
return "", err
}
return string(b), nil
}

View file

@ -1,73 +0,0 @@
package main
import (
"encoding/json"
"reflect"
"strings"
"testing"
)
// Tracer bullet: the dispatcher must route `homelab <path...> <args...>` to the
// command whose Path is the longest matching prefix of the input tokens, and
// hand the command the remaining args.
func TestDispatchRoutesToLongestPrefixMatch(t *testing.T) {
var gotArgs []string
ran := ""
reg := []Command{
{Path: []string{"claim"}, Tier: TierWrite, Summary: "claim a resource",
Run: func(a []string) error { ran = "claim"; gotArgs = a; return nil }},
{Path: []string{"tf", "plan"}, Tier: TierRead, Summary: "plan a stack",
Run: func(a []string) error { ran = "tf plan"; gotArgs = a; return nil }},
}
if err := dispatch(reg, []string{"tf", "plan", "vault", "--json"}); err != nil {
t.Fatalf("dispatch returned error: %v", err)
}
if ran != "tf plan" {
t.Fatalf("routed to %q, want %q", ran, "tf plan")
}
if want := []string{"vault", "--json"}; !reflect.DeepEqual(gotArgs, want) {
t.Fatalf("command got args %v, want %v", gotArgs, want)
}
}
func TestDispatchUnknownCommandErrors(t *testing.T) {
reg := []Command{{Path: []string{"claim"}, Run: func(a []string) error { return nil }}}
if err := dispatch(reg, []string{"bogus"}); err == nil {
t.Fatal("expected error for unknown command, got nil")
}
}
// The manifest is the progressive-discovery entrypoint: one line per command
// showing the full verb path, its tier, and summary, sorted for stable output.
func TestManifestTextListsEveryCommandWithTier(t *testing.T) {
reg := []Command{
{Path: []string{"tf", "plan"}, Tier: TierRead, Summary: "plan a stack"},
{Path: []string{"claim"}, Tier: TierWrite, Summary: "claim a resource"},
}
out := manifestText(reg)
for _, want := range []string{"claim", "tf plan", "read", "write", "plan a stack", "claim a resource"} {
if !strings.Contains(out, want) {
t.Errorf("manifest text missing %q\n---\n%s", want, out)
}
}
// sorted: claim (c) must appear before tf plan (t)
if strings.Index(out, "claim") > strings.Index(out, "tf plan") {
t.Errorf("manifest not sorted by path:\n%s", out)
}
}
func TestManifestJSONIsParsableAndTagged(t *testing.T) {
reg := []Command{{Path: []string{"tf", "apply"}, Tier: TierWrite, Summary: "apply a stack"}}
out, err := manifestJSON(reg)
if err != nil {
t.Fatalf("manifestJSON error: %v", err)
}
var got []map[string]string
if err := json.Unmarshal([]byte(out), &got); err != nil {
t.Fatalf("manifest JSON not parsable: %v\n%s", err, out)
}
if len(got) != 1 || got[0]["command"] != "tf apply" || got[0]["tier"] != "write" {
t.Fatalf("unexpected manifest JSON: %v", got)
}
}

View file

@ -1,98 +0,0 @@
package main
import (
"fmt"
"strings"
)
// version is stamped at build time via -ldflags "-X main.version=vX.Y.Z".
var version = "dev"
// buildRegistry returns every homelab verb. New verb-groups append here.
func buildRegistry() []Command {
var reg []Command
reg = append(reg, claimCommands()...)
reg = append(reg, tfCommands()...)
reg = append(reg, workCommands()...)
reg = append(reg, k8sCommands()...)
reg = append(reg, memoryCommands()...)
reg = append(reg, ciCommands()...)
reg = append(reg, deployCommands()...)
reg = append(reg, netCommands()...)
reg = append(reg, obsCommands()...)
reg = append(reg, usageCommands()...)
reg = append(reg, haCommands()...)
reg = append(reg, browserCommands()...)
reg = append(reg, vaultCommands()...)
return reg
}
// dispatchTop handles the homelab verb surface. handled=false means the args are
// not a homelab verb, so main() falls back to the legacy -use-case path.
func dispatchTop(args []string) (handled bool, err error) {
if len(args) == 0 {
fmt.Print(usage())
return true, nil
}
switch args[0] {
case "help", "-h", "--help":
fmt.Print(usage())
return true, nil
case "version", "--version":
fmt.Println("homelab " + version)
return true, nil
case "manifest":
reg := buildRegistry()
if containsArg(args[1:], "--json") {
out, err := manifestJSON(reg)
if err != nil {
return true, err
}
fmt.Println(out)
return true, nil
}
fmt.Print(manifestText(reg))
return true, nil
}
if strings.HasPrefix(args[0], "-") {
return false, nil
}
reg := buildRegistry()
if !isCommandGroup(reg, args[0]) {
return false, nil
}
return true, dispatch(reg, args)
}
func isCommandGroup(reg []Command, group string) bool {
for _, c := range reg {
if len(c.Path) > 0 && c.Path[0] == group {
return true
}
}
return false
}
func containsArg(args []string, want string) bool {
for _, a := range args {
if a == want {
return true
}
}
return false
}
func usage() string {
var b strings.Builder
fmt.Fprintf(&b, "homelab %s — unified homelab operations CLI\n\n", version)
b.WriteString("Usage:\n homelab <command> [args]\n\nCommands:\n")
for _, line := range strings.Split(strings.TrimRight(manifestText(buildRegistry()), "\n"), "\n") {
if line != "" {
b.WriteString(" " + line + "\n")
}
}
b.WriteString("\n manifest [--json] list all commands (machine-readable with --json)\n")
b.WriteString(" version print version\n")
b.WriteString("\nLegacy webhook use-cases remain available via -use-case=<name>.\n")
return b.String()
}

View file

@ -1,138 +0,0 @@
package main
import (
"fmt"
"os/exec"
"strings"
)
// kubectl helpers use the ambient kubeconfig (no per-call auth flags).
func kubectlBase(ns string, args ...string) []string {
var full []string
if ns != "" {
full = append(full, "-n", ns)
}
return append(full, args...)
}
func kubectlStream(ns string, args ...string) error {
return runStreamingIn("", "kubectl", kubectlBase(ns, args...)...)
}
// kubectlCapture runs kubectl and returns trimmed stdout (for resolving pods).
func kubectlCapture(ns string, args ...string) (string, error) {
out, err := exec.Command("kubectl", kubectlBase(ns, args...)...).Output()
return strings.TrimSpace(string(out)), err
}
// k8sTarget is the parsed `<app>` + selectors shared by the k8s verbs.
type k8sTarget struct {
app string
ns string
pod string
container string
selector string
tty bool
rest []string // passthrough flags and, after `--`, the exec command
}
// parseK8sTarget reads `<app> [-n ns] [--pod p] [-c ctr] [-l sel] [flags] [-- cmd]`.
// The first bare token is the app; unknown flags pass through in rest.
func parseK8sTarget(args []string) k8sTarget {
t := k8sTarget{}
i := 0
take := func() string {
if i+1 < len(args) {
i++
return args[i]
}
return ""
}
for i = 0; i < len(args); i++ {
a := args[i]
switch {
case a == "--":
t.rest = append(t.rest, args[i+1:]...)
return t
case a == "-n" || a == "--namespace":
t.ns = take()
case strings.HasPrefix(a, "--namespace="):
t.ns = strings.TrimPrefix(a, "--namespace=")
case a == "--pod":
t.pod = take()
case strings.HasPrefix(a, "--pod="):
t.pod = strings.TrimPrefix(a, "--pod=")
case a == "-c" || a == "--container":
t.container = take()
case strings.HasPrefix(a, "--container="):
t.container = strings.TrimPrefix(a, "--container=")
case a == "-l" || a == "--selector":
t.selector = take()
case strings.HasPrefix(a, "--selector="):
t.selector = strings.TrimPrefix(a, "--selector=")
case a == "--tty" || a == "-it" || a == "-ti":
t.tty = true
case !strings.HasPrefix(a, "-") && t.app == "":
t.app = a
default:
t.rest = append(t.rest, a)
}
}
return t
}
// namespace defaults to the app name (most namespaces hold exactly one app).
func (t k8sTarget) namespace() string {
if t.ns != "" {
return t.ns
}
return t.app
}
// objectRef is the kubectl object for logs/exec: an explicit pod, else
// deploy/<app> (kubectl resolves a pod from the Deployment).
func (t k8sTarget) objectRef() string {
if t.pod != "" {
return "pod/" + t.pod
}
return "deploy/" + t.app
}
// --- database access (the dbaas exec pattern) ---
type dbPlan struct {
ns string
pod string // explicit pod (e.g. mysql-standalone-0)
selector string // resolve the pod by this label when pod == "" (CNPG primary)
container string // "" = default container
argv []string // command + args to run inside the pod
}
// planDBExec builds the in-pod command to run sql against app's database.
// PG (default): CNPG primary POD (resolved by label — pg-cluster-rw is a
// Service, not an exec target), psql -U postgres -d <db>.
// MySQL: mysql-standalone-0, password from env (never on the command line).
// dbName defaults to app. sql empty => interactive client.
func planDBExec(app, dbName, sql string, mysql bool) dbPlan {
if dbName == "" {
dbName = app
}
if mysql {
inner := fmt.Sprintf(`mysql -u root -p"$MYSQL_ROOT_PASSWORD" %s`, shellQuote(dbName))
if sql != "" {
inner += " -e " + shellQuote(sql)
}
return dbPlan{ns: "dbaas", pod: "mysql-standalone-0", argv: []string{"bash", "-c", inner}}
}
argv := []string{"psql", "-U", "postgres", "-d", dbName}
if sql != "" {
argv = append(argv, "-tAc", sql)
}
return dbPlan{ns: "dbaas", selector: "cnpg.io/instanceRole=primary", container: "postgres", argv: argv}
}
// shellQuote single-quotes s for safe embedding in a bash -c string.
func shellQuote(s string) string {
return "'" + strings.ReplaceAll(s, "'", `'\''`) + "'"
}

View file

@ -1,65 +0,0 @@
package main
import (
"reflect"
"strings"
"testing"
)
func TestParseK8sTarget(t *testing.T) {
got := parseK8sTarget([]string{"tripit", "-n", "prod", "--pod", "x-123", "-c", "app", "-l", "k=v", "--tail=50", "--", "ls", "-la"})
want := k8sTarget{app: "tripit", ns: "prod", pod: "x-123", container: "app", selector: "k=v", rest: []string{"--tail=50", "ls", "-la"}}
if !reflect.DeepEqual(got, want) {
t.Fatalf("parseK8sTarget =\n %+v\nwant\n %+v", got, want)
}
}
func TestK8sTargetNamespaceDefaultsToApp(t *testing.T) {
if ns := parseK8sTarget([]string{"immich"}).namespace(); ns != "immich" {
t.Errorf("namespace() = %q, want immich", ns)
}
if ns := parseK8sTarget([]string{"immich", "-n", "dbaas"}).namespace(); ns != "dbaas" {
t.Errorf("namespace() = %q, want dbaas", ns)
}
}
func TestK8sTargetObjectRef(t *testing.T) {
if r := parseK8sTarget([]string{"tripit"}).objectRef(); r != "deploy/tripit" {
t.Errorf("objectRef() = %q, want deploy/tripit", r)
}
if r := parseK8sTarget([]string{"tripit", "--pod", "tripit-abc"}).objectRef(); r != "pod/tripit-abc" {
t.Errorf("objectRef() = %q, want pod/tripit-abc", r)
}
}
func TestPlanDBExecPostgresDefault(t *testing.T) {
p := planDBExec("fire-planner", "", "SELECT 1", false)
// pg-cluster-rw is a Service, so the PG plan resolves the primary POD by
// label rather than naming an (un-exec-able) Service.
if p.ns != "dbaas" || p.pod != "" || p.selector != "cnpg.io/instanceRole=primary" || p.container != "postgres" {
t.Fatalf("unexpected pg target: %+v", p)
}
// db name defaults to the app; SQL passed via -tAc
joined := strings.Join(p.argv, " ")
if !strings.Contains(joined, "-d fire-planner") || !strings.Contains(joined, "-tAc") {
t.Fatalf("pg argv missing db/sql: %v", p.argv)
}
}
func TestPlanDBExecMysqlEnvPassword(t *testing.T) {
p := planDBExec("wrongmove", "wrongmove", "SHOW TABLES", true)
if p.pod != "mysql-standalone-0" {
t.Fatalf("unexpected mysql pod: %+v", p)
}
inner := strings.Join(p.argv, " ")
// password must come from the env var, never inline
if !strings.Contains(inner, `-p"$MYSQL_ROOT_PASSWORD"`) {
t.Fatalf("mysql must use env password wrapper: %v", p.argv)
}
}
func TestShellQuoteEscapes(t *testing.T) {
if got := shellQuote("a'b"); got != `'a'\''b'` {
t.Fatalf("shellQuote = %q", got)
}
}

View file

@ -26,16 +26,8 @@ var (
) )
func main() { func main() {
// homelab verb surface (work/tf/claim/...) is tried first; if the args are err := run()
// not a homelab verb, fall through to the legacy webhook -use-case path.
if handled, err := dispatchTop(os.Args[1:]); handled {
if err != nil { if err != nil {
fmt.Fprintln(os.Stderr, "homelab: "+err.Error())
os.Exit(1)
}
return
}
if err := run(); err != nil {
glog.Errorf("run failed: %s", err.Error()) glog.Errorf("run failed: %s", err.Error())
os.Exit(255) os.Exit(255)
} }

View file

@ -1,103 +0,0 @@
package main
import (
"bytes"
"encoding/json"
"fmt"
"io"
"net/http"
"os"
"strings"
"time"
)
// defaultMemoryURL is used when no env override is present (agents normally have
// CLAUDE_MEMORY_API_URL set by the memory hooks).
const defaultMemoryURL = "https://claude-memory.viktorbarzin.me"
type memoryClient struct {
base string
key string
http *http.Client
}
func firstEnv(keys ...string) string {
for _, k := range keys {
if v := os.Getenv(k); v != "" {
return v
}
}
return ""
}
func resolveMemoryBase() string {
if b := firstEnv("CLAUDE_MEMORY_API_URL", "MEMORY_API_URL"); b != "" {
return strings.TrimRight(b, "/")
}
return defaultMemoryURL
}
// newMemoryClient talks straight to the claude-memory HTTP API (the same backend
// the MCP wraps), so it works even when the MCP frontend is down.
func newMemoryClient() (*memoryClient, error) {
key := firstEnv("CLAUDE_MEMORY_API_KEY", "MEMORY_API_KEY")
if key == "" {
return nil, fmt.Errorf("no memory API key — set CLAUDE_MEMORY_API_KEY (or MEMORY_API_KEY)")
}
return &memoryClient{base: resolveMemoryBase(), key: key, http: &http.Client{Timeout: 30 * time.Second}}, nil
}
func (c *memoryClient) do(method, path string, body interface{}) ([]byte, error) {
var r io.Reader
if body != nil {
b, err := json.Marshal(body)
if err != nil {
return nil, err
}
r = bytes.NewReader(b)
}
req, err := http.NewRequest(method, c.base+path, r)
if err != nil {
return nil, err
}
req.Header.Set("Authorization", "Bearer "+c.key)
if body != nil {
req.Header.Set("Content-Type", "application/json")
}
resp, err := c.http.Do(req)
if err != nil {
return nil, err
}
defer resp.Body.Close()
out, _ := io.ReadAll(resp.Body)
if resp.StatusCode >= 300 {
return nil, fmt.Errorf("memory API %s %s -> %d: %s", method, path, resp.StatusCode, strings.TrimSpace(string(out)))
}
return out, nil
}
// Request bodies mirror src/claude_memory/api/models.py.
type memRecallReq struct {
Context string `json:"context"`
ExpandedQuery string `json:"expanded_query,omitempty"`
Category string `json:"category,omitempty"`
SortBy string `json:"sort_by,omitempty"`
Limit int `json:"limit,omitempty"`
}
type memStoreReq struct {
Content string `json:"content"`
Category string `json:"category,omitempty"`
Tags string `json:"tags,omitempty"`
ExpandedKeywords string `json:"expanded_keywords,omitempty"`
Importance float64 `json:"importance"`
ForceSensitive bool `json:"force_sensitive,omitempty"`
}
type memUpdateReq struct {
Content *string `json:"content,omitempty"`
Tags *string `json:"tags,omitempty"`
Importance *float64 `json:"importance,omitempty"`
ExpandedKeywords *string `json:"expanded_keywords,omitempty"`
}

View file

@ -1,51 +0,0 @@
package main
import (
"encoding/json"
"os"
"strings"
"testing"
)
func TestResolveMemoryBase(t *testing.T) {
old1, old2 := os.Getenv("CLAUDE_MEMORY_API_URL"), os.Getenv("MEMORY_API_URL")
defer func() { os.Setenv("CLAUDE_MEMORY_API_URL", old1); os.Setenv("MEMORY_API_URL", old2) }()
os.Unsetenv("CLAUDE_MEMORY_API_URL")
os.Unsetenv("MEMORY_API_URL")
if got := resolveMemoryBase(); got != defaultMemoryURL {
t.Errorf("resolveMemoryBase() = %q, want default %q", got, defaultMemoryURL)
}
os.Setenv("CLAUDE_MEMORY_API_URL", "https://m.example/") // trailing slash trimmed
if got := resolveMemoryBase(); got != "https://m.example" {
t.Errorf("resolveMemoryBase() = %q, want https://m.example", got)
}
}
func TestMemStoreReqAlwaysSendsImportance(t *testing.T) {
b, _ := json.Marshal(memStoreReq{Content: "x", Category: "facts", Importance: 0.5})
s := string(b)
if !strings.Contains(s, `"content":"x"`) || !strings.Contains(s, `"importance":0.5`) {
t.Fatalf("memStoreReq JSON missing fields: %s", s)
}
}
func TestMemUpdateReqOmitsUnsetFields(t *testing.T) {
tags := "a,b"
b, _ := json.Marshal(memUpdateReq{Tags: &tags})
s := string(b)
if strings.Contains(s, "content") || strings.Contains(s, "importance") {
t.Fatalf("unset update fields must be omitted: %s", s)
}
if !strings.Contains(s, `"tags":"a,b"`) {
t.Fatalf("set field missing: %s", s)
}
}
func TestMemRecallReqOmitsEmptyOptionals(t *testing.T) {
b, _ := json.Marshal(memRecallReq{Context: "hi"})
s := string(b)
if strings.Contains(s, "expanded_query") || strings.Contains(s, "category") || strings.Contains(s, "limit") {
t.Fatalf("empty optionals must be omitted: %s", s)
}
}

View file

@ -1,58 +0,0 @@
package main
import (
"fmt"
"os"
"path/filepath"
"strings"
)
// validPresenceKinds is the fixed label taxonomy accepted by the presence board.
var validPresenceKinds = []string{"node", "host", "stack", "service", "db", "pvc", "infra"}
// presenceScript locates the presence CLI — homelab WRAPS it, it does not
// reimplement it. Override with HOMELAB_PRESENCE; defaults to ~/code/scripts/presence.
func presenceScript() string {
if p := os.Getenv("HOMELAB_PRESENCE"); p != "" {
return p
}
home, err := os.UserHomeDir()
if err != nil {
return "presence"
}
return filepath.Join(home, "code", "scripts", "presence")
}
// validateLabel checks a presence label is <kind>:<name> with a known kind.
func validateLabel(label string) error {
parts := strings.SplitN(label, ":", 2)
if len(parts) != 2 || parts[0] == "" || parts[1] == "" {
return fmt.Errorf("label must be <kind>:<name> (e.g. stack:vault), got %q", label)
}
for _, k := range validPresenceKinds {
if parts[0] == k {
return nil
}
}
return fmt.Errorf("invalid label kind %q; valid kinds: %s", parts[0], strings.Join(validPresenceKinds, ", "))
}
// presenceClaim claims label on the board with a purpose note.
func presenceClaim(label, purpose string) error {
if err := validateLabel(label); err != nil {
return err
}
args := []string{"claim", label}
if purpose != "" {
args = append(args, "--purpose", purpose)
}
return runStreaming(presenceScript(), args...)
}
// presenceRelease releases a prior claim on label.
func presenceRelease(label string) error {
if err := validateLabel(label); err != nil {
return err
}
return runStreaming(presenceScript(), "release", label)
}

View file

@ -1,24 +0,0 @@
package main
import "testing"
func TestValidateLabelAcceptsTaxonomy(t *testing.T) {
good := []string{
"stack:vault", "service:health", "node:k8s-node1", "db:pg-cluster",
"infra:gpu-operator", "host:proxmox-1", "pvc:dbaas/data",
}
for _, l := range good {
if err := validateLabel(l); err != nil {
t.Errorf("validateLabel(%q) = %v, want nil", l, err)
}
}
}
func TestValidateLabelRejectsBadLabels(t *testing.T) {
bad := []string{"vault", "stack:", "bogus:x", ":x", "stack", ""}
for _, l := range bad {
if err := validateLabel(l); err == nil {
t.Errorf("validateLabel(%q) = nil, want error", l)
}
}
}

View file

@ -1,76 +0,0 @@
package main
import (
"context"
"crypto/tls"
"fmt"
"io"
"net"
"net/http"
"net/url"
"os/exec"
"strings"
"time"
)
// internalLBIP is the dedicated Traefik LB; every internal ingress routes through it.
const internalLBIP = "10.0.20.203"
// clientDialingIP returns an http.Client that dials ip for ANY host while keeping
// the URL host as SNI (so the cert matches) — the Go form of `curl --resolve
// host:443:ip`. TLS verification is skipped (these are reachability/observability
// probes, not security checks; internal .lan vhosts may serve a non-matching cert).
func clientDialingIP(ip string, timeout time.Duration) *http.Client {
d := &net.Dialer{Timeout: 8 * time.Second}
tr := &http.Transport{
DialContext: func(ctx context.Context, network, addr string) (net.Conn, error) {
if i := strings.LastIndex(addr, ":"); i >= 0 {
addr = ip + addr[i:]
}
return d.DialContext(ctx, network, addr)
},
TLSClientConfig: &tls.Config{InsecureSkipVerify: true},
}
return &http.Client{Timeout: timeout, Transport: tr}
}
// probeURL issues a GET and returns status code + elapsed time.
func probeURL(c *http.Client, rawurl string) (int, time.Duration, error) {
start := time.Now()
resp, err := c.Get(rawurl)
dur := time.Since(start)
if err != nil {
return 0, dur, err
}
resp.Body.Close()
return resp.StatusCode, dur, nil
}
// lbGetBody GETs https://<host><path>?<q> through the internal LB and returns the body.
func lbGetBody(host, path string, q url.Values) ([]byte, error) {
u := "https://" + host + path
if len(q) > 0 {
u += "?" + q.Encode()
}
resp, err := clientDialingIP(internalLBIP, 20*time.Second).Get(u)
if err != nil {
return nil, err
}
defer resp.Body.Close()
body, _ := io.ReadAll(resp.Body)
if resp.StatusCode >= 300 {
return nil, fmt.Errorf("%s -> %d: %s", path, resp.StatusCode, strings.TrimSpace(string(body)))
}
return body, nil
}
// dig runs `dig +short` against a resolver, optionally for a record type.
func dig(name, server, rrtype string) (string, error) {
args := []string{"+short", "+time=3", "+tries=1"}
if rrtype != "" {
args = append(args, rrtype)
}
args = append(args, name, "@"+server)
out, err := exec.Command("dig", args...).Output()
return strings.TrimSpace(string(out)), err
}

View file

@ -1,49 +0,0 @@
package main
import "testing"
func TestQueryArg(t *testing.T) {
if got := queryArg([]string{"up"}, nil); got != "up" {
t.Errorf(`queryArg(["up"]) = %q, want "up"`, got)
}
if got := queryArg([]string{"up", "--json"}, nil); got != "up" {
t.Errorf(`--json should be dropped, got %q`, got)
}
// single quoted PromQL arrives as one token
if got := queryArg([]string{"count by (node) (up)", "--json"}, nil); got != "count by (node) (up)" {
t.Errorf(`quoted query mangled: %q`, got)
}
// value-flags and their values are skipped, query survives
vf := map[string]bool{"--since": true, "--limit": true}
if got := queryArg([]string{`{app="x"}`, "--since", "1h", "--limit", "50"}, vf); got != `{app="x"}` {
t.Errorf(`value-flag skipping failed: %q`, got)
}
}
func TestLabelStr(t *testing.T) {
got := labelStr(map[string]string{"__name__": "up", "job": "x", "instance": "y"})
if got != "up{instance=y,job=x}" { // __name__ extracted, rest sorted
t.Errorf("labelStr = %q", got)
}
if got := labelStr(map[string]string{"alertname": "Foo"}); got != "{alertname=Foo}" {
t.Errorf("labelStr (no __name__) = %q", got)
}
}
func TestOneLineList(t *testing.T) {
if got := oneLineList(" "); got != "(none)" {
t.Errorf("empty = %q, want (none)", got)
}
if got := oneLineList("a\nb"); got != "a, b" {
t.Errorf("multi = %q, want 'a, b'", got)
}
}
func TestHostOnly(t *testing.T) {
if got := hostOnly("foo.me/path"); got != "foo.me" {
t.Errorf("hostOnly = %q", got)
}
if got := hostOnly("foo.me"); got != "foo.me" {
t.Errorf("hostOnly = %q", got)
}
}

View file

@ -1,101 +0,0 @@
package main
import (
"os"
"os/exec"
"os/user"
"path/filepath"
"strings"
)
// preferRemote picks the canonical remote: forgejo if present, else origin,
// else the first listed. (For infra, origin and forgejo both point at Forgejo.)
func preferRemote(remotes []string) string {
has := map[string]bool{}
for _, r := range remotes {
has[r] = true
}
switch {
case has["forgejo"]:
return "forgejo"
case has["origin"]:
return "origin"
case len(remotes) > 0:
return remotes[0]
default:
return ""
}
}
// hasGitCryptAttr reports whether .gitattributes content enables git-crypt.
func hasGitCryptAttr(gitattributes string) bool {
return strings.Contains(gitattributes, "filter=git-crypt")
}
// gitCryptFlags are the per-command flags that disable smudge/clean so git
// operations in a git-crypt repo don't try to decrypt (NEVER persisted to config).
func gitCryptFlags() []string {
return []string{
"-c", "filter.git-crypt.smudge=cat",
"-c", "filter.git-crypt.clean=cat",
"-c", "filter.git-crypt.required=false",
}
}
// gitOutput runs `git -C dir <args>` and returns trimmed stdout.
func gitOutput(dir string, args ...string) (string, error) {
cmd := exec.Command("git", append([]string{"-C", dir}, args...)...)
out, err := cmd.Output()
return strings.TrimSpace(string(out)), err
}
func gitRepoRoot(dir string) (string, error) {
return gitOutput(dir, "rev-parse", "--show-toplevel")
}
// gitRemotes lists configured remote names for the repo at dir.
func gitRemotes(dir string) ([]string, error) {
out, err := gitOutput(dir, "remote")
if err != nil {
return nil, err
}
if out == "" {
return nil, nil
}
return strings.Split(out, "\n"), nil
}
// isGitCryptRepo reports whether the repo at repoRoot uses git-crypt.
func isGitCryptRepo(repoRoot string) bool {
b, err := os.ReadFile(filepath.Join(repoRoot, ".gitattributes"))
if err != nil {
return false
}
return hasGitCryptAttr(string(b))
}
// cryptFlagsFor returns the git-crypt filter flags when repoRoot is encrypted,
// else nil. These are injected per-command and never persisted.
func cryptFlagsFor(repoRoot string) []string {
if isGitCryptRepo(repoRoot) {
return gitCryptFlags()
}
return nil
}
// gitStream runs `git [cryptFlags] -C repoRoot <args>` with live output.
func gitStream(repoRoot string, cryptFlags []string, args ...string) error {
full := append(append([]string{}, cryptFlags...), append([]string{"-C", repoRoot}, args...)...)
return runStreamingIn("", "git", full...)
}
// currentUser returns the OS username for branch naming (<user>/<topic>).
func currentUser() string {
if u := os.Getenv("USER"); u != "" {
return u
}
if u, err := user.Current(); err == nil && u.Username != "" {
return u.Username
}
return "user"
}

View file

@ -1,37 +0,0 @@
package main
import "testing"
func TestPreferRemote(t *testing.T) {
cases := []struct {
in []string
want string
}{
{[]string{"origin", "forgejo"}, "forgejo"},
{[]string{"forgejo"}, "forgejo"},
{[]string{"origin"}, "origin"},
{[]string{"upstream"}, "upstream"},
{nil, ""},
}
for _, c := range cases {
if got := preferRemote(c.in); got != c.want {
t.Errorf("preferRemote(%v) = %q, want %q", c.in, got, c.want)
}
}
}
func TestHasGitCryptAttr(t *testing.T) {
if !hasGitCryptAttr("*.tfvars filter=git-crypt diff=git-crypt") {
t.Error("expected git-crypt detected")
}
if hasGitCryptAttr("*.md text\n*.png binary") {
t.Error("expected no git-crypt")
}
}
func TestGitCryptFlagsShape(t *testing.T) {
f := gitCryptFlags()
if len(f) != 6 || f[0] != "-c" || f[1] != "filter.git-crypt.smudge=cat" {
t.Fatalf("unexpected git-crypt flags: %v", f)
}
}

View file

@ -1,23 +0,0 @@
package main
import (
"os"
"os/exec"
)
// runStreaming executes name with args, wiring std streams to this process so
// the caller sees live output, and returns the command's error (non-nil on
// non-zero exit — preserved so homelab's own exit code reflects the child's).
func runStreaming(name string, args ...string) error {
return runStreamingIn("", name, args...)
}
// runStreamingIn is runStreaming with a working directory (empty = inherit).
func runStreamingIn(dir, name string, args ...string) error {
cmd := exec.Command(name, args...)
cmd.Dir = dir
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
cmd.Stdin = os.Stdin
return cmd.Run()
}

View file

@ -1,54 +0,0 @@
package main
import (
"fmt"
"os"
"path/filepath"
"sort"
"strings"
)
// findInfraRoot walks up from start to the infra repo root — the directory
// holding both terragrunt.hcl and a stacks/ directory.
func findInfraRoot(start string) (string, error) {
dir := start
for {
if isFile(filepath.Join(dir, "terragrunt.hcl")) && isDir(filepath.Join(dir, "stacks")) {
return dir, nil
}
parent := filepath.Dir(dir)
if parent == dir {
return "", fmt.Errorf("not inside an infra checkout (no terragrunt.hcl + stacks/ found above %s)", start)
}
dir = parent
}
}
// resolveStack maps a bare stack name to its directory under <infraRoot>/stacks.
func resolveStack(infraRoot, name string) (string, error) {
dir := filepath.Join(infraRoot, "stacks", name)
if isDir(dir) {
return dir, nil
}
avail := listStacks(infraRoot)
return "", fmt.Errorf("stack %q not found under stacks/; available: %s", name, strings.Join(avail, ", "))
}
// listStacks returns the sorted names of every directory under <infraRoot>/stacks.
func listStacks(infraRoot string) []string {
entries, err := os.ReadDir(filepath.Join(infraRoot, "stacks"))
if err != nil {
return nil
}
var out []string
for _, e := range entries {
if e.IsDir() {
out = append(out, e.Name())
}
}
sort.Strings(out)
return out
}
func isFile(p string) bool { fi, err := os.Stat(p); return err == nil && !fi.IsDir() }
func isDir(p string) bool { fi, err := os.Stat(p); return err == nil && fi.IsDir() }

View file

@ -1,52 +0,0 @@
package main
import (
"os"
"path/filepath"
"testing"
)
func newInfraTree(t *testing.T, stacks ...string) string {
t.Helper()
root := t.TempDir()
if err := os.WriteFile(filepath.Join(root, "terragrunt.hcl"), []byte("# root"), 0o644); err != nil {
t.Fatal(err)
}
for _, s := range stacks {
if err := os.MkdirAll(filepath.Join(root, "stacks", s), 0o755); err != nil {
t.Fatal(err)
}
}
return root
}
func TestFindInfraRootWalksUp(t *testing.T) {
root := newInfraTree(t, "vault")
got, err := findInfraRoot(filepath.Join(root, "stacks", "vault"))
if err != nil {
t.Fatalf("findInfraRoot error: %v", err)
}
if got != root {
t.Fatalf("findInfraRoot = %q, want %q", got, root)
}
}
func TestFindInfraRootErrorsOutsideInfra(t *testing.T) {
if _, err := findInfraRoot(t.TempDir()); err == nil {
t.Fatal("expected error outside an infra checkout")
}
}
func TestResolveStack(t *testing.T) {
root := newInfraTree(t, "vault", "monitoring")
dir, err := resolveStack(root, "vault")
if err != nil {
t.Fatalf("resolveStack error: %v", err)
}
if want := filepath.Join(root, "stacks", "vault"); dir != want {
t.Fatalf("resolveStack = %q, want %q", dir, want)
}
if _, err := resolveStack(root, "nonesuch"); err == nil {
t.Fatal("expected error for unknown stack")
}
}

View file

@ -1,62 +0,0 @@
package main
import (
"bytes"
"encoding/json"
"net/http"
"os"
"strconv"
"strings"
"time"
)
// usageJob is the Loki stream job label for homelab usage telemetry.
const usageJob = "homelab-usage"
// emitUsage best-effort records one verb invocation to Loki for cross-user
// usage analytics. Labels are low-cardinality (job/user/verb); the line carries
// only exit code + CLI version. NEVER args, paths, flags, or secrets. It must
// never affect the command: all errors are swallowed and a tight timeout bounds
// the cost. Opt out with HOMELAB_TELEMETRY=0.
func emitUsage(verb string, runErr error) {
switch os.Getenv("HOMELAB_TELEMETRY") {
case "0", "off", "false", "no":
return
}
if verb == "" || strings.HasPrefix(verb, "usage") {
return // don't self-record the analytics reader
}
exit := 0
if runErr != nil {
exit = 1
}
body, err := json.Marshal(lokiPush{Streams: []lokiStream{{
Stream: map[string]string{"job": usageJob, "user": currentUser(), "verb": verb},
Values: [][2]string{{
strconv.FormatInt(time.Now().UnixNano(), 10),
"exit=" + strconv.Itoa(exit) + " ver=" + version,
}},
}}})
if err != nil {
return
}
req, err := http.NewRequest("POST", "https://"+lokiHost+"/loki/api/v1/push", bytes.NewReader(body))
if err != nil {
return
}
req.Header.Set("Content-Type", "application/json")
resp, err := clientDialingIP(internalLBIP, 800*time.Millisecond).Do(req)
if err != nil {
return
}
resp.Body.Close()
}
type lokiPush struct {
Streams []lokiStream `json:"streams"`
}
type lokiStream struct {
Stream map[string]string `json:"stream"`
Values [][2]string `json:"values"`
}

View file

@ -103,6 +103,6 @@ func notifyForIPChange(oldIP, newIP net.IP) error {
if err != nil { if err != nil {
return errors.Wrapf(err, "Error reading response") return errors.Wrapf(err, "Error reading response")
} }
glog.Infof("Response: %s", string(responseBody)) glog.Infof("Response:", string(responseBody))
return nil return nil
} }

View file

@ -1,18 +0,0 @@
package main
import (
"strings"
"testing"
)
func TestUsageQuery(t *testing.T) {
got := usageQuery("30d", "")
want := `sum by (verb) (count_over_time({job="homelab-usage"}[30d]))`
if got != want {
t.Errorf("usageQuery(30d,\"\") = %q, want %q", got, want)
}
withUser := usageQuery("7d", "emo")
if !strings.Contains(withUser, `user="emo"`) || !strings.Contains(withUser, "[7d]") {
t.Errorf("usageQuery with user missing filter/range: %q", withUser)
}
}

View file

@ -1,191 +0,0 @@
package main
import (
"context"
"encoding/json"
"fmt"
"io"
"net"
"net/http"
"os"
"os/exec"
"strings"
"time"
)
// Woodpecker is reached at ci.viktorbarzin.me but routed via the internal Traefik
// LB (mirrors the proven `curl --resolve ci.viktorbarzin.me:443:10.0.20.203`):
// we dial the LB IP while keeping SNI/Host = the hostname so the cert verifies.
const (
wpHost = "ci.viktorbarzin.me"
wpLBIP = "10.0.20.203"
)
type wpClient struct {
base string
token string
http *http.Client
}
// wpToken reads WOODPECKER_TOKEN, else the canonical Vault path.
func wpToken() string {
if t := firstEnv("WOODPECKER_TOKEN", "WP_TOKEN"); t != "" {
return t
}
out, err := exec.Command("vault", "kv", "get", "-field=woodpecker_api_token", "secret/ci/global").Output()
if err != nil {
return ""
}
return strings.TrimSpace(string(out))
}
func newWPClient() (*wpClient, error) {
tok := wpToken()
if tok == "" {
return nil, fmt.Errorf("no woodpecker token — set WOODPECKER_TOKEN or `vault login` (reads secret/ci/global)")
}
ip := firstEnv("HOMELAB_WP_IP")
if ip == "" {
ip = wpLBIP
}
dialer := &net.Dialer{Timeout: 8 * time.Second}
tr := &http.Transport{
DialContext: func(ctx context.Context, network, addr string) (net.Conn, error) {
if strings.HasPrefix(addr, wpHost+":") {
addr = ip + addr[strings.LastIndex(addr, ":"):]
}
return dialer.DialContext(ctx, network, addr)
},
}
return &wpClient{base: "https://" + wpHost, token: tok, http: &http.Client{Timeout: 20 * time.Second, Transport: tr}}, nil
}
// getJSON GETs path into v, retrying the transient empty/5xx responses the
// Woodpecker API intermittently returns under load.
func (c *wpClient) getJSON(path string, v interface{}) error {
var lastErr error
for attempt := 0; attempt < 5; attempt++ {
if attempt > 0 {
time.Sleep(2 * time.Second)
}
req, _ := http.NewRequest("GET", c.base+path, nil)
req.Header.Set("Authorization", "Bearer "+c.token)
resp, err := c.http.Do(req)
if err != nil {
lastErr = err
continue
}
body, _ := io.ReadAll(resp.Body)
resp.Body.Close()
if resp.StatusCode >= 500 || len(strings.TrimSpace(string(body))) == 0 {
lastErr = fmt.Errorf("woodpecker GET %s -> %d (empty/5xx, retrying)", path, resp.StatusCode)
continue
}
if resp.StatusCode >= 300 {
return fmt.Errorf("woodpecker GET %s -> %d: %s", path, resp.StatusCode, strings.TrimSpace(string(body)))
}
return json.Unmarshal(body, v)
}
return lastErr
}
type wpPipeline struct {
Number int `json:"number"`
Status string `json:"status"`
Event string `json:"event"`
Commit string `json:"commit"`
Message string `json:"message"`
}
func (c *wpClient) recentPipelines(repoID, n int) ([]wpPipeline, error) {
var ps []wpPipeline
err := c.getJSON(fmt.Sprintf("/api/repos/%d/pipelines?per_page=%d", repoID, n), &ps)
return ps, err
}
// findPipeline returns the pipeline for commit (prefix match), or the latest when
// commit is empty.
func (c *wpClient) findPipeline(repoID int, commit string) (wpPipeline, error) {
ps, err := c.recentPipelines(repoID, 25)
if err != nil {
return wpPipeline{}, err
}
if len(ps) == 0 {
return wpPipeline{}, fmt.Errorf("no pipelines for repo %d", repoID)
}
if commit == "" {
return ps[0], nil
}
for _, p := range ps {
if strings.HasPrefix(p.Commit, commit) {
return p, nil
}
}
return wpPipeline{}, fmt.Errorf("no pipeline for commit %s in the last %d", commit[:min(8, len(commit))], len(ps))
}
func (c *wpClient) repoID() (int, error) {
owner, repo, err := repoOwnerName()
if err != nil {
return 0, err
}
var r struct {
ID int `json:"id"`
}
if err := c.getJSON("/api/repos/lookup/"+owner+"/"+repo, &r); err != nil {
return 0, err
}
if r.ID == 0 {
return 0, fmt.Errorf("repo %s/%s not registered in woodpecker", owner, repo)
}
return r.ID, nil
}
// repoOwnerName derives <owner>/<repo> from the cwd git remote.
func repoOwnerName() (string, string, error) {
cwd, _ := os.Getwd()
root, err := gitRepoRoot(cwd)
if err != nil {
return "", "", fmt.Errorf("not in a git repository: %w", err)
}
remote := preferRemote(remotesOrEmpty(root))
url, err := gitOutput(root, "remote", "get-url", remote)
if err != nil {
return "", "", err
}
return parseOwnerRepo(url)
}
// parseOwnerRepo extracts owner/repo from an https or ssh git remote URL.
func parseOwnerRepo(url string) (string, string, error) {
u := strings.TrimSuffix(strings.TrimSpace(url), ".git")
u = strings.TrimSuffix(u, "/")
if i := strings.Index(u, "://"); i >= 0 {
u = u[i+3:]
}
u = strings.ReplaceAll(u, ":", "/") // git@host:owner/repo -> git@host/owner/repo
parts := strings.Split(u, "/")
if len(parts) < 2 || parts[len(parts)-1] == "" || parts[len(parts)-2] == "" {
return "", "", fmt.Errorf("cannot parse owner/repo from remote %q", url)
}
return parts[len(parts)-2], parts[len(parts)-1], nil
}
func isTerminalStatus(s string) bool {
switch s {
case "success", "failure", "error", "killed", "declined", "blocked":
return true
}
return false
}
func isFailureStatus(s string) bool {
return s == "failure" || s == "error" || s == "killed" || s == "declined"
}
func min(a, b int) int {
if a < b {
return a
}
return b
}

View file

@ -1,40 +0,0 @@
package main
import "testing"
func TestParseOwnerRepo(t *testing.T) {
cases := []struct{ in, owner, repo string }{
{"https://forgejo.viktorbarzin.me/viktor/infra.git", "viktor", "infra"},
{"https://forgejo.viktorbarzin.me/viktor/infra", "viktor", "infra"},
{"git@github.com:ViktorBarzin/infra.git", "ViktorBarzin", "infra"},
{"https://github.com/ViktorBarzin/tripit/", "ViktorBarzin", "tripit"},
}
for _, c := range cases {
o, r, err := parseOwnerRepo(c.in)
if err != nil || o != c.owner || r != c.repo {
t.Errorf("parseOwnerRepo(%q) = (%q, %q, %v), want (%q, %q)", c.in, o, r, err, c.owner, c.repo)
}
}
if _, _, err := parseOwnerRepo("nonsense"); err == nil {
t.Error("expected error for unparseable remote")
}
}
func TestStatusClassification(t *testing.T) {
for _, s := range []string{"success", "failure", "error", "killed"} {
if !isTerminalStatus(s) {
t.Errorf("%q should be terminal", s)
}
}
for _, s := range []string{"running", "pending"} {
if isTerminalStatus(s) {
t.Errorf("%q should not be terminal", s)
}
}
if !isFailureStatus("failure") || !isFailureStatus("error") {
t.Error("failure/error should classify as failure")
}
if isFailureStatus("success") {
t.Error("success must not classify as failure")
}
}

Binary file not shown.

View file

@ -80,6 +80,8 @@ def sofia():
pfsense >> k8s_switch pfsense >> k8s_switch
with Cluster('Management Network'): with Cluster('Management Network'):
mgt_switch = Switch() mgt_switch = Switch()
# Truenas
truenas = Storage("Truenas")
# pxe server # pxe server
pxe_server = Rack("PXE Server") pxe_server = Rack("PXE Server")
# HA # HA
@ -89,6 +91,7 @@ def sofia():
devvm_vpn_client = VPN("Tailscale Client") devvm_vpn_client = VPN("Tailscale Client")
vpn_clients["devvm"] = devvm_vpn_client vpn_clients["devvm"] = devvm_vpn_client
mgt_switch >> truenas
mgt_switch >> pxe_server mgt_switch >> pxe_server
mgt_switch >> home_assistant mgt_switch >> home_assistant
mgt_switch >> devvm mgt_switch >> devvm

View file

@ -20,7 +20,7 @@ This repository contains the configuration and documentation for a homelab Kuber
| [Overview](architecture/overview.md) | Infrastructure overview, hardware specs, VM inventory, and service catalog | | [Overview](architecture/overview.md) | Infrastructure overview, hardware specs, VM inventory, and service catalog |
| [Networking](architecture/networking.md) | Network topology, VLANs, routing, and firewall rules | | [Networking](architecture/networking.md) | Network topology, VLANs, routing, and firewall rules |
| [VPN](architecture/vpn.md) | Headscale mesh VPN and Cloudflare Tunnel configuration | | [VPN](architecture/vpn.md) | Headscale mesh VPN and Cloudflare Tunnel configuration |
| [Storage](architecture/storage.md) | Proxmox host NFS, Proxmox CSI (LVM-thin + LUKS2), and persistent volume management | | [Storage](architecture/storage.md) | TrueNAS NFS, democratic-csi, and persistent volume management |
| [Authentication](architecture/authentication.md) | Authentik SSO, OIDC flows, and service integration | | [Authentication](architecture/authentication.md) | Authentik SSO, OIDC flows, and service integration |
| [Security](architecture/security.md) | CrowdSec IPS, Kyverno policies, and security controls | | [Security](architecture/security.md) | CrowdSec IPS, Kyverno policies, and security controls |
| [Monitoring](architecture/monitoring.md) | Prometheus, Grafana, Loki, and observability stack | | [Monitoring](architecture/monitoring.md) | Prometheus, Grafana, Loki, and observability stack |

View file

@ -1,42 +0,0 @@
---
status: accepted
---
# The Android testing environment is a privileged KVM emulator pod in-cluster
Viktor's apps are growing Android clients (first: tripit's Capacitor shell —
see tripit ADR-0013/0014), and agents need a native Android instance to test
changes against before shipping. All K8s nodes already run with CPU type
`host`, so `/dev/kvm` works inside the cluster.
Decision (2026-06-11): one shared **Android 16 (API 36) Google-emulator
instance** runs as a privileged pod in namespace `android-emulator`
(stack `stacks/android-emulator`), with `/dev/kvm` via hostPath, adb exposed
LAN-only on the shared MetalLB IP (10.0.20.200:5555), and a noVNC screen view
at android-emulator.viktorbarzin.lan. The SDK/system-image/AVD live on a PVC;
the image is a slim manually-built shell.
## Considered options
- **devvm-local docker emulator** — rejected as the durable home: shared
24GB workstation, ~13GB free disk, per-machine, not shared across agents.
- **Dedicated Proxmox VM** — rejected: burns scarce PVE host headroom 24/7
and adds a whole VM lifecycle for one emulator.
- **redroid (container-native Android)** — rejected: requires binder kernel
modules on every node (documented binderfs incompatibilities), max
Android 15; most invasive for the least version coverage.
- **budtmo/docker-android** — rejected: turnkey but capped at Android 14;
the native features driving the Android work (Live Updates, background
GPS) are Android 16 behaviors, matching the real target device.
- **/dev/kvm device plugin instead of privileged** — deferred: a new
cluster component to avoid one namespace-scoped exclude-list entry; the
exclude pattern (kured/woodpecker/frigate/changedetection) already exists.
## Consequences
- `android-emulator` joins the Kyverno `security_policy_exclude_namespaces`
list (privileged allowed; registry policy also bypassed in-namespace).
- adb is unauthenticated by design — the LB IP must remain LAN-only.
- Single shared instance: concurrent agent sessions share Android state;
long destructive work should presence-claim `service:android-emulator`.
- Rendering is swiftshader (CPU) — the contended T4 stays out of the path.

View file

@ -1,24 +0,0 @@
---
status: accepted
date: 2026-06-12
---
# All owned images build off-infra on GitHub Actions and live on ghcr.io
In-cluster Woodpecker buildkit builds repeatedly hurt the homelab: registry-push load OOMKilled Forgejo (2026-06-09), buildkit→Forgejo pushes ride a flaky hairpin, build IO lands on the shared sdc HDD, and the Forgejo registry PVC sat at its 50Gi ceiling with retention stuck in DRY_RUN. We decided every owned image is built by GitHub Actions and hosted on ghcr.io, extending the tripit pilot (2026-06-09) to the whole fleet: Forgejo stays the canonical git host, a one-way push-mirror feeds a GitHub mirror, and the mirror's workflow builds, pushes, then POSTs Woodpecker's API to deploy. The Forgejo container registry is decommissioned as a build target — one manual cleanup pass keeps a last-known-good tag per Service, after which nothing pushes to it.
## Considered options
- **GHA builds pushing back into the Forgejo registry** — keeps images home and the pull path unchanged, but keeps the exact failure mode that motivated the move (Forgejo OOM under blob-push load), keeps the PVC growth, and keeps the circular dependency where the images needed to repair the cluster live inside the cluster. Rejected.
- **Per-repo in-cluster fallback builds** (the old `build-fallback.yml` pattern) — rejected in favour of a clean cut: a GitHub outage pauses image builds (running workloads are unaffected), and existing fallback files are deleted. The hedge against ghcr's "currently free" private storage ever being enforced is the visibility split (public images are permanently free) plus re-creating fallbacks if that day comes.
- **Paid builders (Docker Build Cloud, Depot)** — solve a multi-arch/persistent-cache problem this fleet doesn't have (everything is linux/amd64). Rejected.
## Consequences
- DR improves: images survive homelab loss, so a dead cluster can pull everything it needs to come back — the same doctrine that keeps the monorepo on GitHub ("Forgejo dies with the cluster").
- Private ghcr pulls bypass the registry VM's pull-through cache (it can't authenticate), so cold-node pulls of private images depend on GitHub availability; public images cache normally.
- Visibility is decided per repo: public = generic tooling that passes a gitleaks/PII history scan; private = personal, financial, or legally-gray domains. A failed scan means the repo stays private — canonical history is never rewritten for publication. For interpreted languages repo visibility ≈ image visibility (the image ships the source).
- Only private-repo builds consume GitHub free-plan minutes (~12 builders, well under the 2,000/mo free tier; usage is reviewed after rollout wave 2 before considering Pro).
- Woodpecker becomes deploy-only; its agents never build. The Kyverno-synced `registry-credentials` stays (Forgejo git + frozen last-known-good images); a cluster-wide Kyverno-synced `ghcr-credentials` joins it.
- Builders with no live consumer (terminal-lobby, webhook-handler, hmrc-sync, trading-bot, travel-agent, trip-planner) are decommissioned rather than migrated; travel_blog is decommissioned outright (service + CI). Any revival adopts this ADR's pattern.
- Workflows build single-manifest images (`provenance: false`, linux/amd64 only) so registry retention never faces the orphaned-index-children failure class that broke Forgejo's cleanup.

View file

@ -1,30 +0,0 @@
# Keep Forgejo as the canonical forge; complete the one-way GitHub mirror instead of swapping to GitHub
Status: accepted (extends ADR-0002)
## Context
Repo trees kept diverging between the Forgejo **Canonical repo** (`viktor/<name>`) and its **GitHub mirror**. A 2026-06-15 audit found the cause: an *incomplete rollout* of the Forgejo→GitHub push-mirror, not anything inherent to Forgejo. 14 repos carry **both** remotes and are hand-pushed to each (`push_mirrors = 0` on Forgejo — e.g. `infra`, `finance`, `Website`), so a human forgets one side and the trees drift; the ADR-0002-onboarded repos have a working one-way mirror (`push_mirrors = 1` — e.g. `tripit`, `recruiter-responder`) and never diverge. `infra/CONTEXT.md` already says Forgejo is the only place commits land and the GitHub mirror must never be a second writable remote — practice had simply drifted from the documented model.
The trigger was a proposal to swap Forgejo out for GitHub entirely. The grilling reframed it: the pain (divergence) is a "two writable remotes" problem, and the stated preference is self-hosted-primary with the remote as backup.
## Decision
Do **not** swap to GitHub. Reaffirm and *complete* the model already in `CONTEXT.md`:
- Every first-party repo has exactly **one** push target — its **Canonical repo** on Forgejo. GitHub is a one-way push-mirror (off-site backup + the source GitHub Actions builds from). **No repo is ever dual-pushed.**
- A small, explicit set of **GitHub-first repos** are the exception (canonical lives on GitHub, outside the mirror policy): third-party clones/forks where GitHub is genuinely upstream (`jsoncrack.com`, `snmp_exporter`, `SparkyFitness`, `agent-rules-books`, `Plotting-Your-Dream-Book`) and the deliberately-public first-party `health`.
- `infra` is reconciled into the standard model: its GitHub-only `.github/workflows/build-*.yml` are brought onto Forgejo-canonical (inert on Forgejo, active on the mirror), then the mirror is enabled — ending the deliberate divergence while keeping Woodpecker on the Forgejo forge.
- Enforcement is **structural**: reconciled clones keep only the Forgejo remote, so there is no GitHub remote to habitually push to; the execution rule is "push to the canonical forge only, never the mirror."
## Considered options
- **Swap to GitHub (retire Forgejo).** Rejected: takes on a hard WAN dependency for *all* git ops — including `infra`, the repo you use to *recover* from outages — plus git-crypt secrets on GitHub as primary, a Woodpecker forge migration (WP authenticates against and watches Forgejo), and GitHub private-repo CI-minute/size limits. All to fix a problem that is actually an incomplete mirror, not Forgejo's existence. Contradicts the self-hosted-primary preference.
- **GitHub canonical, Forgejo demoted to a DR pull-mirror.** Rejected for the same WAN-dependency and forge-migration cost; unnecessary once the real cause is understood.
## Consequences
- Divergence becomes structurally impossible — one push target per repo.
- Forgejo stays load-bearing (canonical git + the Woodpecker forge), so every cost of the swap is avoided.
- The GitHub-limits worry is neutralized: private code lives on Forgejo (unlimited, self-hosted); GitHub holds mirrors for CI + backup. (GitHub Free has unlimited private repos anyway; the real limits are GHA minutes and ~1 GB repo size — `travel_blog` at 1.4 GB is why it never went to GHA.)
- One-time remediation is required and carries a data-loss footgun: the Forgejo→GitHub mirror **force-overwrites** GitHub, so for each currently-diverged repo, any GitHub-only commits must be merged into Forgejo **before** the mirror is enabled, or they are lost. Scope: the 14 dual-push repos + the `infra` reconciliation; all other repos are already single-remote and non-diverging.

View file

@ -1,30 +0,0 @@
# homelab: a unified infra-ops CLI grown in place from infra/cli
Agents re-derive the same operational command boilerplate every session — mining
51,116 bash commands across 2,225 past sessions showed dense, repeated patterns
(the infra inner-loop alone is ~29%). We are building `homelab`, one CLI encoding
the deterministic, repeated **actions** (not judgment) agents run — composable in
bash, JSON-capable, and discovered progressively via `homelab manifest`. It is
grown **in place** in `cli/` (the existing `infra-cli`), absorbing new verb-groups
alongside the preserved legacy webhook use-cases. Versioned with a `cli/VERSION`
file (the infra repo deploys continuously and does not cut semver tags).
## Considered options
- **Its own top-level repo** (the original plan) — rejected in favour of keeping
it where the Terraform/Terragrunt and `scripts/tg` it drives already live; the
Go source isn't git-crypt-encrypted and a provision-time build is unaffected by
GitOps continuous-deploy.
- **A fresh CLI ignoring infra-cli** — rejected: strands the VPN/DNS/email
webhook use-cases.
- **Raw kubectl/tg/ssh + skills + MCP only** — kept for everything outside the
recurring action surface (methodology skills; third-party/owned MCP such as
phpIPAM, which homelab does NOT duplicate).
## Consequences
- The binary is dual-purpose: the agent-facing `homelab` verb surface AND the
in-cluster `infra-cli` webhook image. `main()` front-dispatches homelab verbs
and falls through to the legacy `-use-case` path verbatim.
- Distribution: built from source to `/usr/local/bin/homelab` during devvm
provisioning (`t3-dispatch` precedent), refreshed by `t3-autoupdate`.

View file

@ -1,23 +0,0 @@
# homelab v0.1 scope: the infra inner-loop; everything allowed, tiers recorded
v0.1 ships only the highest-volume surface — the infra inner-loop: `work`
(worktree lifecycle), `tf` (terragrunt via `scripts/tg` + fmt/validate/
force-unlock), and `claim`/`release` (presence) — because it is ~29% of all mined
commands and where agents lose the most time and leak the most presence claims.
v0.1 enforces **no** homelab-level permission gating: everything is allowed,
relying on existing gates (harness permission mode, presence claims, plan
approval). But every verb records a `read|write` tier (visible in `manifest`), so
a PreToolUse classifier hook (auto-allow reads / prompt writes) can be added
later with zero restructuring.
## Considered options
- **Reads-first vertical slice** (top read verb per domain) — lower risk, broad
value, but defers the toil that motivated the project.
- **One domain deep (k8s)** — cleanest template, narrow day-one value.
We chose the highest-volume-but-write-heavy infra loop deliberately, accepting
the extra complexity (worktree lifecycle, git-crypt flag injection, presence
coupling, branch-protection PR fallback) for the biggest immediate toil
reduction. k8s/node/secret/net/ci verb-groups are deferred to later versions.

View file

@ -1,29 +0,0 @@
# homelab work/tf behaviour: native worktree entry, gated auto-land, presence-coupled apply
Four behaviours of the infra-loop verbs are surprising enough to record:
1. **`work` owns worktree create/land/clean, but session *entry* delegates to the
native harness worktree tool.** A CLI is a child process and cannot change the
agent's working directory; `EnterWorktree` can. So `homelab work start <topic>`
creates the worktree + branch off `<remote>/master` (git-crypt-aware) and
prints the path — the agent enters it with native `EnterWorktree({path})`.
2. **`work land` is auto-land, but gated on verification.** It merges master in →
runs verification → pushes `HEAD:master` (fetch+merge+retry on
non-fast-forward) → falls back to pushing the feature branch for a PR when the
direct push is rejected (branch protection). It **refuses to push when it
cannot verify** (no `--verify-cmd` and no auto-detected suite) unless
`--no-verify` is passed — added after an accidental smoke-test land pushed
unverified WIP to master (benign: the infra CI applied 0 stacks because the
diff was `cli/`-only, but an unverified land must be deliberate, not default).
3. **`tf apply` is first-class despite GitOps, and mandatorily presence-coupled.**
Local applies are out-of-band (CI applies canonically on push) but happen
constantly (~763× in the corpus). `tf apply <stack>` auto-claims `stack:<name>`,
delegates to `scripts/tg apply --non-interactive`, and **always releases on
exit** (normal, error, or signal via `sync.Once` + handler) — fixing the
documented ~200-claim leak — and prints an out-of-band reminder.
4. **Known v0.1 limitation:** `work land` does not yet block on CI to green; that
arrives with the ci/deploy watch verb-group. It prints a reminder to follow
the pipeline manually.

View file

@ -1,30 +0,0 @@
# homelab k8s verb-group: app→pod resolver, read/write split, config-mutation stays raw
v0.2 adds the Kubernetes verb-group — the biggest remaining surface by far
(mining the post-v0.1 corpus: 11,291 `kubectl` commands across 243 sessions, more
than every other domain combined).
It is built on an **app→namespace→pod resolver**: most namespaces hold exactly
one app, so `<app>` defaults to the namespace, and the target defaults to
`deploy/<app>` (kubectl resolves a pod from the Deployment). `-n`/`--pod`/`-c`/
`-l`/`--tty` override; multi-pod namespaces (`dbaas`, `monitoring`) need
specificity. The CLI uses the ambient kubeconfig — no per-call auth flags.
Verbs: read — `status`, `get`, `logs`, `describe`, `debug` (one-shot triage),
`pf`, `rollout-status`; write/operational — `db`, `exec`, `restart`, `rm-pod`.
## Decisions worth recording
- **Config-mutation verbs are deliberately NOT exposed** (`apply`/`edit`/`patch`/
`scale`/`create`). They stay raw `kubectl`, by design, per the repo's
Terraform-only policy — the corpus confirms they're low-frequency, and a
friendly verb would normalise a policy violation.
- **`rm-pod` is restricted to pods/jobs only** — deleting Deployments/STS/PVCs is
config mutation and forbidden; the verb cannot target them.
- **`db` encodes the dbaas exec pattern** (the single highest-value k8s
sub-pattern, ~886 dbaas ops): PG via `pg-cluster-rw -c postgres`,
`psql -U postgres -d <app>`; MySQL via `mysql-standalone-0` with a
`bash -c 'mysql -p"$MYSQL_ROOT_PASSWORD" …'` wrapper so the password comes from
the pod env and never appears on the command line.
- Read verbs were smoke-tested against the live cluster; write verbs are
unit-tested (resolver, db-plan, shell-quoting) but not fired at live state.

View file

@ -1,30 +0,0 @@
# homelab memory verb-group: direct HTTP client to claude-memory; MCP deprecation path
v0.3 adds the memory verb-group so agents can search and navigate memory from the
CLI. `claude-memory` is a FastAPI service (Postgres-backed, `Bearer`-auth,
ingress `auth = "none"` so programmatic clients work) — the **MCP is just one
frontend over it**. `homelab memory` is a thin HTTP client over the same API,
using the env the hooks already set (`CLAUDE_MEMORY_API_URL` +
`CLAUDE_MEMORY_API_KEY`; defaults to the ingress). Because it talks to the HTTP
API directly, it **works even when the MCP frontend is down** — the recurring
MCP-disconnect problem that motivated claude-memory HA (and that took the MCP
offline for the entire session this was built in).
Verbs: `recall` (server-side semantic ranking), `list`, `categories`, `tags`,
`stats`, `secret` (read); `store`, `update`, `delete` (write). Validated against
the live API including a store→recall→delete round-trip — full data-plane parity
with the MCP.
## Deprecation path (deliberate follow-up — NOT done in v0.3)
The MCP is more than tools: the **per-prompt auto-recall hook** and the
**auto-learn hook** run on every prompt for every agent. Deprecating it safely is
a separate, sequenced change:
1. Rewire the auto-recall hook to `homelab memory recall` and the auto-learn hook
to `homelab memory store`.
2. Update the CLAUDE.md memory policy to point at the CLI.
3. Uninstall the MCP.
Done CLI-first (verbs proven before touching the every-prompt path) so a
regression can't silently break auto-recall/auto-learn fleet-wide.

View file

@ -1,29 +0,0 @@
# homelab ci/deploy verbs: API-based watch, internal-LB dialer, work-land integration
v0.4 adds `ci`/`deploy` — the biggest *reasoning* sink in agent sessions (watching
a build/deploy to completion), proven during the session that built it (hours
spent hand-rolling Woodpecker API polling, DB-schema reverse-engineering, and
retrigger logic for a single CI incident).
## Decisions
- **API, not DB.** The verbs query the Woodpecker REST API (version-stable),
not its Postgres schema (which drifts across upgrades — column renames bit us
mid-incident). Reached via the internal Traefik LB by dialing `10.0.20.203`
while keeping SNI/Host = `ci.viktorbarzin.me` so the cert verifies (the Go
equivalent of the house `curl --resolve` pattern). Token from
`WOODPECKER_TOKEN` or Vault `secret/ci/global`; repo id resolved from the cwd
git remote via `/api/repos/lookup/<owner>/<repo>`.
- **Retries are mandatory.** The Woodpecker API intermittently returns empty/5xx
under load (it flapped through the whole build session); `getJSON` retries
empties with backoff so `ci watch` is reliable exactly when it's needed.
- **`work land` now waits for CI.** After pushing, `work land` calls `ci watch`
on the landed commit and fails if the pipeline does — closing the gap ADR-0005
deferred. `--no-ci-watch` opts out.
- **`deploy wait` encodes the "rollout status lies" rule:** it first waits for
the deployment image to reference the expected sha, *then* blocks on rollout
status (kubectl-based; reuses the k8s helpers).
- **`ci logs` deferred to v0.4.1.** Woodpecker's per-pipeline detail/log
endpoints were the least reliable this session (often empty); `status`/`watch`
rely on the list endpoint that works. A DB-backed `ci logs` is a possible
follow-up if the API path stays flaky.

View file

@ -1,37 +0,0 @@
# homelab net/dns/metrics/logs verbs: endpoint resolution as the unit of value
v0.5 adds `net`/`dns`/`metrics`/`logs`. These were chosen against an explicit
test the user posed mid-build: *does the verb save reasoning, or only typing?* A
wrapper over a command already known fluently (plain `ssh`, `vault kv get`) saves
keystrokes but not thought. These four save thought — the reasoning they encode
is **which endpoint, reached how, with what auth/URL shape** — re-derived every
time otherwise. (That same test deprioritized `node ssh` aliasing and `secret
get`, which are thin wrappers; see the session discussion.)
## Decisions
- **Internal ingresses, reached via the LB.** Everything routes through the
Traefik LB by dialing `10.0.20.203` with the URL host preserved as SNI — the
Go form of the house `curl --resolve host:443:10.0.20.203` pattern
(`probe.go: clientDialingIP`). Verified live before building: Prometheus
(`prometheus-query.viktorbarzin.lan`) and Loki (`loki.viktorbarzin.lan`) both
answer JSON over the LB with **no auth gate and no port-forward** — so these
stay clean HTTP clients, not kubectl wrappers.
- **`net check` is two-legged on purpose.** It resolves the host via public DNS
(→ Cloudflare) AND dials the internal LB, reporting both — because the useful
question is *where* a break is (CF edge vs the app vs the LB path), which a
single curl can't answer. The external leg forces public resolution (the devvm
resolver is split-horizon and would otherwise hit the LB for both).
- **`metrics alerts` uses the `ALERTS` series, not `/api/v1/alerts`.**
`prometheus-query.*` is a query-only frontend (404 on `/api/v1/alerts`), and
Alertmanager has no LB ingress (the alert-digest reads it in-cluster). Firing
alerts are exposed as the synthetic `ALERTS{alertstate="firing"}` time series,
queryable through the working endpoint — so no new dependency.
- **Deliberately NOT built:** in-cluster-only endpoints (Alertmanager v2,
raw `*.svc` services) that would force port-forward/`kubectl run`. The
reasoning-savings there don't beat the added moving parts; kept out of scope.
- **No `node`/`secret` group.** Same test: their high-volume parts are
command-wrappers (low savings); only compound node ops (serial console, VM
wait, fan-out) would qualify, and those are lower-frequency. Left unbuilt
unless a concrete pain surfaces — the high-value deterministic surface
(tf/work/ci/k8s/memory + these probes) is now covered.

View file

@ -1,34 +0,0 @@
# homelab usage telemetry: evidence-driven verb prioritization, privacy by construction
v0.6 adds `usage top` plus a fire-and-forget emit on every dispatched verb. It
exists to answer the question that drove the whole CLI — *which verbs are worth
adding next* — with data instead of one maintainer's habits (the earlier mining
covered a single user's ~51k commands, so the surface is shaped to that user).
## Decisions
- **Emit on dispatch, in `dispatch()`.** The longest-prefix match already knows
the verb path; after `Run` returns we emit `{verb, exit}`. Discovery verbs
don't go through `dispatch()` (`manifest`/`version`/`help` are handled in
`dispatchTop`), so they don't self-record; `usage *` is skipped explicitly so
the analytics reader doesn't pollute its own data.
- **Payload is deliberately minimal: verb path + exit code only.** Labels
`{job=homelab-usage, user, verb}` (all low-cardinality) + line `exit=N ver=X`.
**No args, paths, flags, hostnames, or secrets** ever leave the process — the
emit sees only the matched verb name, not the arguments. This is what makes
cross-user aggregation safe.
- **Shared Loki sink → cross-user analytics WITHOUT reading homes.** Each user's
CLI writes its own invocations (attributed to its OS user) to the shared Loki
push API via the Traefik LB (verified: HTTP 204, no auth). `usage top` reads
back with a LogQL metric query. This is the privacy-preserving resolution to
"what does everyone (e.g. another user) use" — it never touches anyone's
`~/.claude`, which the org per-user policy bars (see the per-user red-line in
managed-settings; reading another user's home is off-limits even for an owner
in-session — a fresh session under changed MDM policy is the only legitimate
path, and even then this telemetry is the better answer).
- **Best-effort, never affects the command.** All errors swallowed; an 800ms
client timeout bounds the cost; opt-out via `HOMELAB_TELEMETRY=0`. Telemetry
must never slow or break the tool it measures.
- **Loki, not a new datastore.** Zero new infra, and it dogfoods the v0.5 `logs`
path (same host, same LB dial). Presence MySQL was the alternative (queryable
SQL) but would add a write dependency and creds; Loki needs neither.

Some files were not shown because too many files have changed in this diff Show more