vault-token-renew: version the devvm renewer + user units in the repo

The devvm periodic Vault admin token (token-devvm-wizard, period=768h, policies default+sops-admin+vault-admin) is kept alive by a systemd user timer, but the renewer script + units lived only under ~/.local/bin and ~/.config/systemd/user — lost on a devvm rebuild. Move them into the repo as the source of truth so a rebuild can restore them. (version-only scope: behavior unchanged; no canonical-file/self-heal added.)

- scripts/vault-token-renew.{sh,service,timer}: renewer + user units, refactored into pure drift-guard functions + a guarded main (behavior identical; deployed live and verified still renewing with full write access).

- scripts/test-vault-token-renew.sh: unit-tests the drift guard + lookup-JSON parsing, incl. the 2026-06-05 woodpecker-clobber case (17 assertions).

- docs/runbooks/vault-token-renew-devvm.md: deploy, mint/re-mint, health-check, drift recovery.

- docs/architecture/secrets.md: correct the stale '~/.vault-token = OIDC token' description for devvm.

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-07 16:26:01 +00:00
parent f9d5cd6243
commit d4ec5768b2
6 changed files with 289 additions and 2 deletions

View file

@ -77,7 +77,7 @@ graph LR
- Application configuration secrets
- Encryption keys
Authentication: `vault login -method=oidc` (Authentik SSO) → `~/.vault-token` → read by Vault Terraform provider.
Authentication: `vault login -method=oidc` (Authentik SSO) → `~/.vault-token` → read by Vault Terraform provider. On `devvm`, `~/.vault-token` instead holds a long-lived **periodic** admin token auto-renewed daily by a systemd user timer (no weekly re-login) — see the [vault-token-renew-devvm runbook](../runbooks/vault-token-renew-devvm.md).
### External Secrets Operator (ESO)
@ -260,7 +260,14 @@ spec:
### Terraform Provider Auth
`~/.vault-token` created by `vault login -method=oidc`:
The provider reads `VAULT_ADDR` from env and the token from `~/.vault-token`.
That file is populated by `vault login -method=oidc` (humans, ad-hoc) — except
on `devvm`, where it holds a long-lived **periodic** admin token (`display_name
token-devvm-wizard`, `period=768h`, `explicit_max_ttl=0`, policies
`default`+`sops-admin`+`vault-admin`) that a systemd user timer renews daily, so
no weekly re-login is needed. A drift guard refuses to renew if a stray
`vault login` clobbers the file with a foreign token. Deploy + recovery:
[vault-token-renew-devvm runbook](../runbooks/vault-token-renew-devvm.md).
```hcl
provider "vault" {

View file

@ -0,0 +1,114 @@
# Runbook: devvm Vault token auto-renewal
**Host:** `devvm` (10.0.10.10), user `wizard`
**Source of truth:** `infra/scripts/vault-token-renew.{sh,service,timer}`
**Live paths:** `~/.local/bin/vault-token-renew`, `~/.config/systemd/user/vault-token-renew.{service,timer}`
## What this is
`wizard@devvm` authenticates to Vault with a **periodic, orphan** token stored
in `~/.vault-token`, instead of a 7-day OIDC login that needed weekly
re-auth. A systemd **user** timer renews it daily so it never expires.
| Property | Value |
|---|---|
| `display_name` | `token-devvm-wizard` |
| `period` | `768h` (32 days) |
| `explicit_max_ttl` | `0` (no hard cap) |
| `policies` | `default`, `sops-admin`, `vault-admin` |
| `orphan` | `true` (not revoked when any parent expires) |
Periodic tokens have no max-TTL; they only need renewing once per `period`.
Daily renewal leaves a 32× margin. **If devvm is decommissioned and the timer
stops, the token self-expires within ~32 days** — deliberately, unlike a root
token which would live forever (this is the security trade-off Viktor chose:
periodic + renewer over a never-expiring root token).
## Deploy on a fresh devvm
The renewer is a host-side script + user systemd units, deployed manually (same
model as the other `infra/scripts/` host scripts). From a checkout of the repo
**as user `wizard` on devvm**:
```bash
cd ~/code/infra/scripts
install -m 0755 vault-token-renew.sh ~/.local/bin/vault-token-renew # strip .sh
install -m 0644 vault-token-renew.service vault-token-renew.timer ~/.config/systemd/user/
# user manager must survive logout, so the daily timer fires headless
loginctl enable-linger "$USER"
systemctl --user daemon-reload
systemctl --user enable --now vault-token-renew.timer
```
Then mint the token (one-time, interactive — see below). The script and units
carry no secret; only the token itself is sensitive and stays out of git.
## Mint / re-mint the token
Requires an interactive OIDC login (browser), so it can't run unattended:
```bash
export VAULT_ADDR=https://vault.viktorbarzin.me
vault login -method=oidc
vault token create -orphan -period=768h \
-policy=vault-admin -policy=sops-admin -display-name=devvm-wizard \
-field=token > ~/.vault-token
chmod 600 ~/.vault-token
```
Vault prefixes the display name, so it becomes `token-devvm-wizard` (which is
what the drift guard checks for). `-orphan` is essential: a child of the 7-day
OIDC token would be revoked when that parent expired.
## Health check
```bash
export VAULT_ADDR=https://vault.viktorbarzin.me
vault token lookup | grep -E 'display_name|period|explicit_max_ttl|policies'
# expect: display_name token-devvm-wizard, period 768h, explicit_max_ttl 0s,
# policies [default sops-admin vault-admin]
# authoritative write-capability check (do NOT trust the policies field alone —
# an OIDC token shows policies=[default] but carries vault-admin via identity):
vault token capabilities secret/data/viktor # expect create/update/.../sudo
# renewer health
systemctl --user list-timers | grep vault-token-renew # next/last run
tail -5 ~/.local/state/vault-token-renew.log # recent results
```
A healthy log line looks like:
`<ts> OK renewed (dn=token-devvm-wizard ttl=2764800s)` (ttl 2764800s = 768h).
## Drift guard & recovery
`~/.vault-token` is the Vault CLI's default token sink, so **any** `vault login`
overwrites it. Two confirmed clobber vectors:
1. `vault login -method=oidc` → replaces it with a 7-day OIDC token (the renewer
can't push past the OIDC role's 7-day `token_max_ttl`).
2. A stray `vault login -method=kubernetes` (e.g. a headless agent flow) →
writes a read-only `kubernetes-woodpecker-default` token (can read Vault but
**cannot** write `secret/*`). This happened 2026-06-05 and went unnoticed for
two days — reads worked, writes silently 403'd.
To stop the renewer from silently keeping a foreign token alive, it runs a
**drift guard** first: it refuses to renew unless the token is
`token-devvm-wizard` **and** carries `vault-admin`. On drift it logs loudly and
exits non-zero (the systemd unit goes `failed`) rather than renewing someone
else's token. Symptom in the log:
`<ts> DRIFT: ~/.vault-token is dn=... policies=... Refusing to renew a foreign token. Re-mint: ...`
**Recovery: re-mint** (the DRIFT log line contains the exact command) — run the
[mint/re-mint](#mint--re-mint-the-token) block. The drift guard detects but does
**not** auto-recover (a deliberate scope choice — version-only, no self-heal);
recovery is the manual re-mint above.
## Tests
`infra/scripts/test-vault-token-renew.sh` unit-tests the drift-guard decision
and the lookup-JSON parsers (including the exact 2026-06-05 woodpecker-clobber
case). Run: `bash infra/scripts/test-vault-token-renew.sh`.