diff --git a/docs/architecture/secrets.md b/docs/architecture/secrets.md index 8a46eef4..4aa15d6c 100644 --- a/docs/architecture/secrets.md +++ b/docs/architecture/secrets.md @@ -77,7 +77,7 @@ graph LR - Application configuration secrets - Encryption keys -Authentication: `vault login -method=oidc` (Authentik SSO) → `~/.vault-token` → read by Vault Terraform provider. +Authentication: `vault login -method=oidc` (Authentik SSO) → `~/.vault-token` → read by Vault Terraform provider. On `devvm`, `~/.vault-token` instead holds a long-lived **periodic** admin token auto-renewed daily by a systemd user timer (no weekly re-login) — see the [vault-token-renew-devvm runbook](../runbooks/vault-token-renew-devvm.md). ### External Secrets Operator (ESO) @@ -260,7 +260,14 @@ spec: ### Terraform Provider Auth -`~/.vault-token` created by `vault login -method=oidc`: +The provider reads `VAULT_ADDR` from env and the token from `~/.vault-token`. +That file is populated by `vault login -method=oidc` (humans, ad-hoc) — except +on `devvm`, where it holds a long-lived **periodic** admin token (`display_name +token-devvm-wizard`, `period=768h`, `explicit_max_ttl=0`, policies +`default`+`sops-admin`+`vault-admin`) that a systemd user timer renews daily, so +no weekly re-login is needed. A drift guard refuses to renew if a stray +`vault login` clobbers the file with a foreign token. Deploy + recovery: +[vault-token-renew-devvm runbook](../runbooks/vault-token-renew-devvm.md). ```hcl provider "vault" { diff --git a/docs/runbooks/vault-token-renew-devvm.md b/docs/runbooks/vault-token-renew-devvm.md new file mode 100644 index 00000000..2dc4d35b --- /dev/null +++ b/docs/runbooks/vault-token-renew-devvm.md @@ -0,0 +1,114 @@ +# Runbook: devvm Vault token auto-renewal + +**Host:** `devvm` (10.0.10.10), user `wizard` +**Source of truth:** `infra/scripts/vault-token-renew.{sh,service,timer}` +**Live paths:** `~/.local/bin/vault-token-renew`, `~/.config/systemd/user/vault-token-renew.{service,timer}` + +## What this is + +`wizard@devvm` authenticates to Vault with a **periodic, orphan** token stored +in `~/.vault-token`, instead of a 7-day OIDC login that needed weekly +re-auth. A systemd **user** timer renews it daily so it never expires. + +| Property | Value | +|---|---| +| `display_name` | `token-devvm-wizard` | +| `period` | `768h` (32 days) | +| `explicit_max_ttl` | `0` (no hard cap) | +| `policies` | `default`, `sops-admin`, `vault-admin` | +| `orphan` | `true` (not revoked when any parent expires) | + +Periodic tokens have no max-TTL; they only need renewing once per `period`. +Daily renewal leaves a 32× margin. **If devvm is decommissioned and the timer +stops, the token self-expires within ~32 days** — deliberately, unlike a root +token which would live forever (this is the security trade-off Viktor chose: +periodic + renewer over a never-expiring root token). + +## Deploy on a fresh devvm + +The renewer is a host-side script + user systemd units, deployed manually (same +model as the other `infra/scripts/` host scripts). From a checkout of the repo +**as user `wizard` on devvm**: + +```bash +cd ~/code/infra/scripts +install -m 0755 vault-token-renew.sh ~/.local/bin/vault-token-renew # strip .sh +install -m 0644 vault-token-renew.service vault-token-renew.timer ~/.config/systemd/user/ + +# user manager must survive logout, so the daily timer fires headless +loginctl enable-linger "$USER" + +systemctl --user daemon-reload +systemctl --user enable --now vault-token-renew.timer +``` + +Then mint the token (one-time, interactive — see below). The script and units +carry no secret; only the token itself is sensitive and stays out of git. + +## Mint / re-mint the token + +Requires an interactive OIDC login (browser), so it can't run unattended: + +```bash +export VAULT_ADDR=https://vault.viktorbarzin.me +vault login -method=oidc +vault token create -orphan -period=768h \ + -policy=vault-admin -policy=sops-admin -display-name=devvm-wizard \ + -field=token > ~/.vault-token +chmod 600 ~/.vault-token +``` + +Vault prefixes the display name, so it becomes `token-devvm-wizard` (which is +what the drift guard checks for). `-orphan` is essential: a child of the 7-day +OIDC token would be revoked when that parent expired. + +## Health check + +```bash +export VAULT_ADDR=https://vault.viktorbarzin.me +vault token lookup | grep -E 'display_name|period|explicit_max_ttl|policies' +# expect: display_name token-devvm-wizard, period 768h, explicit_max_ttl 0s, +# policies [default sops-admin vault-admin] + +# authoritative write-capability check (do NOT trust the policies field alone — +# an OIDC token shows policies=[default] but carries vault-admin via identity): +vault token capabilities secret/data/viktor # expect create/update/.../sudo + +# renewer health +systemctl --user list-timers | grep vault-token-renew # next/last run +tail -5 ~/.local/state/vault-token-renew.log # recent results +``` + +A healthy log line looks like: +` OK renewed (dn=token-devvm-wizard ttl=2764800s)` (ttl 2764800s = 768h). + +## Drift guard & recovery + +`~/.vault-token` is the Vault CLI's default token sink, so **any** `vault login` +overwrites it. Two confirmed clobber vectors: + +1. `vault login -method=oidc` → replaces it with a 7-day OIDC token (the renewer + can't push past the OIDC role's 7-day `token_max_ttl`). +2. A stray `vault login -method=kubernetes` (e.g. a headless agent flow) → + writes a read-only `kubernetes-woodpecker-default` token (can read Vault but + **cannot** write `secret/*`). This happened 2026-06-05 and went unnoticed for + two days — reads worked, writes silently 403'd. + +To stop the renewer from silently keeping a foreign token alive, it runs a +**drift guard** first: it refuses to renew unless the token is +`token-devvm-wizard` **and** carries `vault-admin`. On drift it logs loudly and +exits non-zero (the systemd unit goes `failed`) rather than renewing someone +else's token. Symptom in the log: + +` DRIFT: ~/.vault-token is dn=... policies=... Refusing to renew a foreign token. Re-mint: ...` + +**Recovery: re-mint** (the DRIFT log line contains the exact command) — run the +[mint/re-mint](#mint--re-mint-the-token) block. The drift guard detects but does +**not** auto-recover (a deliberate scope choice — version-only, no self-heal); +recovery is the manual re-mint above. + +## Tests + +`infra/scripts/test-vault-token-renew.sh` unit-tests the drift-guard decision +and the lookup-JSON parsers (including the exact 2026-06-05 woodpecker-clobber +case). Run: `bash infra/scripts/test-vault-token-renew.sh`. diff --git a/scripts/test-vault-token-renew.sh b/scripts/test-vault-token-renew.sh new file mode 100644 index 00000000..d64d02b4 --- /dev/null +++ b/scripts/test-vault-token-renew.sh @@ -0,0 +1,57 @@ +#!/usr/bin/env bash +# Unit tests for the pure drift-guard functions in vault-token-renew.sh. +# Sources the script (vtr_main is guarded) and exercises the decision logic that +# decides whether ~/.vault-token is OUR periodic admin token (renew) or a foreign +# token that clobbered the file (refuse, fail loud). This is exactly the logic +# whose ABSENCE let the 2026-06-05 woodpecker-token clobber be silently renewed +# for two days. Run: bash infra/scripts/test-vault-token-renew.sh +set -uo pipefail +DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +# shellcheck source=/dev/null +source "$DIR/vault-token-renew.sh" + +pass=0 fail=0 +ok() { # — expects the command to succeed (renew-OK) + if "${@:2}"; then pass=$((pass + 1)); else + fail=$((fail + 1)); printf 'FAIL: %s — expected OK, got refuse\n' "$1" + fi +} +no() { # — expects the command to fail (drift, refuse) + if "${@:2}"; then + fail=$((fail + 1)); printf 'FAIL: %s — expected DRIFT, got OK\n' "$1" + else pass=$((pass + 1)); fi +} +eq() { # + if [[ "$2" == "$3" ]]; then pass=$((pass + 1)); else + fail=$((fail + 1)); printf 'FAIL: %s — expected [%s] got [%s]\n' "$1" "$2" "$3" + fi +} + +# --- vtr_drift_ok: ONLY our periodic admin token (right name AND vault-admin) renews --- +ok "our token renews" vtr_drift_ok token-devvm-wizard "default,sops-admin,vault-admin" +ok "vault-admin anywhere in list" vtr_drift_ok token-devvm-wizard "default,vault-admin" +ok "policy order irrelevant" vtr_drift_ok token-devvm-wizard "vault-admin,default" +no "woodpecker clobber refused" vtr_drift_ok kubernetes-woodpecker-default "ci,default,terraform-state" +no "oidc token (admin but wrong dn)" vtr_drift_ok oidc-vbarzin "default,sops-admin,vault-admin" +no "right name, no vault-admin" vtr_drift_ok token-devvm-wizard "default,sops-admin" +no "empty display_name" vtr_drift_ok "" "vault-admin" +no "empty policies" vtr_drift_ok token-devvm-wizard "" +no "no substring false-positive" vtr_drift_ok token-devvm-wizard "default,vault-admin-ro" + +# --- vtr_display_name / vtr_policies_csv: parse real `vault token lookup -format=json` --- +LOOKUP_OURS='{"data":{"display_name":"token-devvm-wizard","policies":["default","sops-admin","vault-admin"],"identity_policies":null}}' +LOOKUP_OIDC='{"data":{"display_name":"oidc-vbarzin","policies":["default"],"identity_policies":["sops-admin","vault-admin"]}}' +LOOKUP_WP='{"data":{"display_name":"kubernetes-woodpecker-default","policies":["ci","default","terraform-state"],"identity_policies":[]}}' +eq "dn ours" "token-devvm-wizard" "$(vtr_display_name "$LOOKUP_OURS")" +eq "dn oidc" "oidc-vbarzin" "$(vtr_display_name "$LOOKUP_OIDC")" +eq "pols ours" "default,sops-admin,vault-admin" "$(vtr_policies_csv "$LOOKUP_OURS")" +eq "pols oidc merges token+identity" "default,sops-admin,vault-admin" "$(vtr_policies_csv "$LOOKUP_OIDC")" +eq "pols woodpecker" "ci,default,terraform-state" "$(vtr_policies_csv "$LOOKUP_WP")" + +# --- parse + decide end-to-end (the real lookup-JSON -> renew/refuse path) --- +ok "ours: parse+decide renews" vtr_drift_ok "$(vtr_display_name "$LOOKUP_OURS")" "$(vtr_policies_csv "$LOOKUP_OURS")" +no "woodpecker: parse+decide refused" vtr_drift_ok "$(vtr_display_name "$LOOKUP_WP")" "$(vtr_policies_csv "$LOOKUP_WP")" +no "oidc: parse+decide refused" vtr_drift_ok "$(vtr_display_name "$LOOKUP_OIDC")" "$(vtr_policies_csv "$LOOKUP_OIDC")" + +printf '\n%d passed, %d failed\n' "$pass" "$fail" +(( fail == 0 )) diff --git a/scripts/vault-token-renew.service b/scripts/vault-token-renew.service new file mode 100644 index 00000000..4580fd21 --- /dev/null +++ b/scripts/vault-token-renew.service @@ -0,0 +1,9 @@ +[Unit] +Description=Renew the periodic Vault/OpenBao token in ~/.vault-token +Documentation=https://github.com/ViktorBarzin/infra/blob/master/scripts/vault-token-renew.sh +Wants=network-online.target +After=network-online.target + +[Service] +Type=oneshot +ExecStart=%h/.local/bin/vault-token-renew diff --git a/scripts/vault-token-renew.sh b/scripts/vault-token-renew.sh new file mode 100644 index 00000000..2d73c862 --- /dev/null +++ b/scripts/vault-token-renew.sh @@ -0,0 +1,90 @@ +#!/usr/bin/env bash +# Renew the long-lived PERIODIC Vault/OpenBao token stored in ~/.vault-token. +# +# Background: wizard@devvm used to hold a 7-day OIDC login token (re-auth weekly +# via `vault login -method=oidc`). On 2026-06-05 that was replaced with a +# periodic, orphan token so it never expires. Periodic tokens have no max-TTL; +# they only need renewing within each `period` (768h / 32d here). This unit +# renews daily, so the token stays alive indefinitely with huge margin. If the +# box is ever decommissioned and this stops running, the token self-expires +# within ~32 days (unlike a root token, which would live forever). +# +# Token was minted with (vault-admin = path "*" sudo; sops-admin = transit for SOPS): +# vault token create -orphan -period=768h \ +# -policy=vault-admin -policy=sops-admin -display-name=devvm-wizard +# To recreate if ever lost: `vault login -method=oidc`, run the above with +# `-field=token > ~/.vault-token`, then `chmod 600 ~/.vault-token`. +# +# Source of truth: infra/scripts/vault-token-renew.sh (deployed to +# ~/.local/bin/vault-token-renew). Driven by the systemd USER units +# vault-token-renew.{service,timer}. Deploy + recovery runbook: +# infra/docs/runbooks/vault-token-renew-devvm.md + +EXPECTED_DN="token-devvm-wizard" +REQUIRED_POLICY="vault-admin" + +# vtr_display_name -> display_name (empty if absent). +vtr_display_name() { + printf '%s' "$1" | jq -r '.data.display_name // ""' +} + +# vtr_policies_csv -> comma-joined token policies + identity policies. +# Both are merged because a token minted via OIDC carries vault-admin only in +# identity_policies, while .data.policies shows just [default] (misleading on its +# own — see memory id=4211). Our periodic token carries them as token policies. +vtr_policies_csv() { + printf '%s' "$1" | jq -r '((.data.policies // []) + (.data.identity_policies // [])) | join(",")' +} + +# vtr_drift_ok -> 0 if this is OUR periodic admin +# token (right display name AND vault-admin present), 1 otherwise. The comma +# fencing makes the policy match exact (so "vault-admin-ro" never matches). +vtr_drift_ok() { + local dn="$1" pols="$2" + [ "$dn" = "$EXPECTED_DN" ] || return 1 + printf ',%s,' "$pols" | grep -q ",$REQUIRED_POLICY," || return 1 +} + +vtr_main() { + set -euo pipefail + export PATH="/usr/local/bin:/usr/bin:/bin:${PATH:-}" + export VAULT_ADDR="${VAULT_ADDR:-https://vault.viktorbarzin.me}" + + local log info dn pols out ttl + log="${XDG_STATE_HOME:-$HOME/.local/state}/vault-token-renew.log" + mkdir -p "$(dirname "$log")" + + if ! info=$(vault token lookup -format=json 2>&1); then + printf '%s FAIL: token lookup: %s\n' "$(date -Is)" "$info" >>"$log" + exit 1 + fi + dn=$(vtr_display_name "$info") + pols=$(vtr_policies_csv "$info") + + # Drift guard (added 2026-06-07): the renewer must NOT keep a FOREIGN token alive. + # On 2026-06-05 a stray `vault login -method=kubernetes` overwrote ~/.vault-token + # with a read-only woodpecker token, and this script then silently renewed THAT + # for two days — masking the loss of write access. So before renewing, confirm + # the token is our periodic admin token; if it has drifted, fail loudly (systemd + # marks the unit failed) instead of keeping someone else's token alive. + if ! vtr_drift_ok "$dn" "$pols"; then + printf '%s DRIFT: ~/.vault-token is dn=%q policies=%q (expected dn=%q with %q). Refusing to renew a foreign token. Re-mint: vault login -method=oidc && vault token create -orphan -period=768h -policy=vault-admin -policy=sops-admin -display-name=devvm-wizard -field=token > ~/.vault-token && chmod 600 ~/.vault-token\n' \ + "$(date -Is)" "$dn" "$pols" "$EXPECTED_DN" "$REQUIRED_POLICY" >>"$log" + exit 1 + fi + + # `vault token renew` with no argument renews the calling token (renew-self). + # On success, log only the new TTL (never the raw JSON — it contains the token). + if out=$(vault token renew -format=json 2>&1); then + ttl=$(printf '%s' "$out" | jq -r '.auth.lease_duration' 2>/dev/null || echo '?') + printf '%s OK renewed (dn=%s ttl=%ss)\n' "$(date -Is)" "$dn" "$ttl" >>"$log" + else + printf '%s FAIL: %s\n' "$(date -Is)" "$out" >>"$log" + exit 1 + fi +} + +# Run main only when executed directly, so the test can source the pure functions. +if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then + vtr_main "$@" +fi diff --git a/scripts/vault-token-renew.timer b/scripts/vault-token-renew.timer new file mode 100644 index 00000000..83edaef2 --- /dev/null +++ b/scripts/vault-token-renew.timer @@ -0,0 +1,10 @@ +[Unit] +Description=Daily renewal of the periodic Vault token in ~/.vault-token + +[Timer] +OnCalendar=daily +Persistent=true +RandomizedDelaySec=300 + +[Install] +WantedBy=timers.target