vault-token-renew runbook: document the self-heal behavior
All checks were successful
ci/woodpecker/push/default Pipeline was successful

Drift guard section rewritten: admin-capable clobbers now self-heal at the
nightly run (HEALED log line); weak clobbers keep the loud DRIFT failure;
manual re-mint is only the weak-clobber recovery now.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-07-03 20:20:44 +00:00
parent 4a7b6db806
commit d9717a53bf
2 changed files with 39 additions and 23 deletions

View file

@ -82,33 +82,48 @@ tail -5 ~/.local/state/vault-token-renew.log # recent results
A healthy log line looks like: A healthy log line looks like:
`<ts> OK renewed (dn=token-devvm-wizard ttl=2764800s)` (ttl 2764800s = 768h). `<ts> OK renewed (dn=token-devvm-wizard ttl=2764800s)` (ttl 2764800s = 768h).
## Drift guard & recovery After an OIDC login you'll instead see, at the next nightly run:
`<ts> HEALED: re-minted periodic token from foreign dn=oidc-… (revoked N stale periodic token(s))`
— that's the self-heal working as designed.
## Drift guard & self-heal
`~/.vault-token` is the Vault CLI's default token sink, so **any** `vault login` `~/.vault-token` is the Vault CLI's default token sink, so **any** `vault login`
overwrites it. Two confirmed clobber vectors: overwrites it. Two confirmed clobber vectors:
1. `vault login -method=oidc` → replaces it with a 7-day OIDC token (the renewer 1. `vault login -method=oidc` → replaces it with a 7-day OIDC token (the renewer
can't push past the OIDC role's 7-day `token_max_ttl`). can't push past the OIDC role's 7-day `token_max_ttl`). The infra docs
prescribe this login before applies, so it recurs — it went unnoticed for
weeks twice (2026-06-18→26, 2026-06-29→07-03) and read as "Vault expires
weekly".
2. A stray `vault login -method=kubernetes` (e.g. a headless agent flow) → 2. A stray `vault login -method=kubernetes` (e.g. a headless agent flow) →
writes a read-only `kubernetes-woodpecker-default` token (can read Vault but writes a read-only `kubernetes-woodpecker-default` token (can read Vault but
**cannot** write `secret/*`). This happened 2026-06-05 and went unnoticed for **cannot** write `secret/*`). Happened 2026-06-05, unnoticed for two days.
two days — reads worked, writes silently 403'd.
To stop the renewer from silently keeping a foreign token alive, it runs a Since 2026-07-03 the renewer **self-heals**
**drift guard** first: it refuses to renew unless the token is (`docs/plans/2026-07-03-vault-token-self-heal-design.md`). On a foreign token
`token-devvm-wizard` **and** carries `vault-admin`. On drift it logs loudly and it attempts the re-mint **with the clobbering token's own authority** and lets
exits non-zero (the systemd unit goes `failed`) rather than renewing someone Vault's authz decide:
else's token. Symptom in the log:
`<ts> DRIFT: ~/.vault-token is dn=... policies=... Refusing to renew a foreign token. Re-mint: ...` - **Admin-capable clobber (OIDC login)** → re-mints the periodic token,
sanity-checks it against the drift guard, atomically replaces
`~/.vault-token`, revokes stale `token-devvm-wizard` leftovers
(anti-sprawl), logs
`HEALED: re-minted periodic token from foreign dn=… (revoked N stale periodic token(s))`
and exits 0. The clobbering token is NOT revoked — it may still back a live
login session; it ages out on its own.
- **Weak clobber (read-only k8s token)** → the mint is denied; logs
`DRIFT: … heal denied, foreign token lacks create authority …; investigate what wrote it`
and exits non-zero (unit `failed`). Deliberately loud: this signals a
misbehaving agent flow — exactly the 2026-06-05 case.
**Recovery: re-mint** (the DRIFT log line contains the exact command) — run the **Manual recovery** is only needed for the weak-clobber case (the DRIFT log
[mint/re-mint](#mint--re-mint-the-token) block. The drift guard detects but does line still contains the exact command) — run the
**not** auto-recover (a deliberate scope choice — version-only, no self-heal); [mint/re-mint](#mint--re-mint-the-token) block.
recovery is the manual re-mint above.
## Tests ## Tests
`infra/scripts/test-vault-token-renew.sh` unit-tests the drift-guard decision `infra/scripts/test-vault-token-renew.sh` unit-tests the drift-guard decision,
and the lookup-JSON parsers (including the exact 2026-06-05 woodpecker-clobber the lookup-JSON parsers (including the exact 2026-06-05 woodpecker-clobber
case). Run: `bash infra/scripts/test-vault-token-renew.sh`. case), and the self-heal's revoke filter (which stale periodic tokens a heal
may sweep). Run: `bash infra/scripts/test-vault-token-renew.sh`.

View file

@ -1,10 +1,11 @@
#!/usr/bin/env bash #!/usr/bin/env bash
# Unit tests for the pure drift-guard functions in vault-token-renew.sh. # Unit tests for the pure functions in vault-token-renew.sh.
# Sources the script (vtr_main is guarded) and exercises the decision logic that # Sources the script (vtr_main is guarded) and exercises (a) the drift-guard
# decides whether ~/.vault-token is OUR periodic admin token (renew) or a foreign # decision — is ~/.vault-token OUR periodic admin token (renew) or a foreign
# token that clobbered the file (refuse, fail loud). This is exactly the logic # clobber (heal / fail loud)? — whose ABSENCE let the 2026-06-05 woodpecker
# whose ABSENCE let the 2026-06-05 woodpecker-token clobber be silently renewed # clobber be silently renewed for two days, and (b) the self-heal's revoke
# for two days. Run: bash infra/scripts/test-vault-token-renew.sh # filter — which stale token-devvm-wizard tokens a heal may sweep.
# Run: bash infra/scripts/test-vault-token-renew.sh
set -uo pipefail set -uo pipefail
DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# shellcheck source=/dev/null # shellcheck source=/dev/null