Commit graph

40 commits

Author SHA1 Message Date
Viktor Barzin
94dfbb9a9c state(vault): update encrypted state 2026-05-22 14:16:41 +00:00
Viktor Barzin
5b255cf6f2 state(vault): update encrypted state 2026-05-07 23:29:35 +00:00
Viktor Barzin
df2fa0a31d state(vault): update encrypted state 2026-04-25 17:09:35 +00:00
Viktor Barzin
7dd580972a state(vault): update encrypted state 2026-04-25 16:57:42 +00:00
Viktor Barzin
08b13858dd state(vault): update encrypted state 2026-04-25 16:16:35 +00:00
Viktor Barzin
3f85cee1ef state(vault): update encrypted state 2026-04-25 16:08:38 +00:00
Viktor Barzin
2eca011cc3 [ci,vault] Fix Tier-1 apply silently failing in Woodpecker
## Context
For weeks, every push to infra has resulted in `build-cli` workflow
failure AND `default` workflow succeed — but the `default` workflow's
"success" was a lie. Inside the apply-loop we were swallowing per-stack
failures with `set +e ... echo FAILED` and the step exited 0 regardless.

Discovered during bd code-3o3 e2e test (qbittorrent 5.0.4 → 5.1.4):
agent commit landed, CI reported `default=success`, but cluster was
unchanged. Log inside the step showed:
    [servarr] Starting apply...
    ERROR: Cannot read PG credentials from Vault.
    Run: vault login -method=oidc
    [servarr] FAILED (exit 1)

Two root causes, two fixes here.

### 1. Vault `ci` role lacks Tier-1 PG backend creds

The Tier-1 PG state backend (2026-04-16 migration, memory 407) uses
the `pg-terraform-state` static DB role. `scripts/tg` reads it via
`vault read database/static-creds/pg-terraform-state`. That path is
permitted by the separate `terraform-state` Vault policy, which is
bound only to a role in namespace `claude-agent`. The CI runner is in
namespace `woodpecker` using role `ci`, whose policy grants only KV
+ K8s-creds + transit. Net: every Tier-1 stack apply from CI has
been dying at the PG-creds fetch since the migration.

**Fix**: attach `vault_policy.terraform_state` to
`vault_kubernetes_auth_backend_role.ci`'s `token_policies`. No new
policy needed — reuses the minimal one from 2026-04-16.

### 2. Apply-loop swallows stack failures

`.woodpecker/default.yml`'s platform + app apply loops use
`set +e; OUTPUT=$(... tg apply ...); EXIT=$?; set -e; [ $EXIT -ne 0 ]
&& echo FAILED` and then continue the while-loop. The step never
re-raises, so it exits 0 regardless of how many stacks failed.

**Fix**: accumulate failed stack names (excluding lock-skipped ones)
into `FAILED_PLATFORM_STACKS` / `FAILED_APP_STACKS`, serialise the
platform list to `.platform_failed` so it survives the step boundary,
and at the end of the app-stack step exit 1 if either list is
non-empty. Lock-skipped stacks remain non-fatal.

Together, (1) unblocks real apply and (2) ensures the Woodpecker
pipeline + the service-upgrade agent can both trust `default`
workflow state again.

## What is NOT in this change
- Re-running the qbittorrent upgrade to converge the cluster — the
  TF file is already at 5.1.4 in git; once CI picks up this commit
  it'll apply on its own, or Viktor can run `tg apply` locally now
  that the ci role has access too.
- Retiring the `set +e ... continue` pattern entirely — keeping the
  per-stack continuation so a single bad stack doesn't hide the
  others' plans from the log. Just making the final status honest.

## Test Plan
### Automated
`terraform plan` / apply clean (Tier-0 via scripts/tg):
```
Plan: 0 to add, 2 to change, 0 to destroy.
  # vault_kubernetes_auth_backend_role.ci will be updated in-place
  ~ token_policies = [
      + "terraform-state",
        # (1 unchanged element hidden)
    ]
  # vault_jwt_auth_backend.oidc will be updated in-place
  ~ tune = [...]    # cosmetic provider-schema drift, pre-existing

Apply complete! Resources: 0 added, 2 changed, 0 destroyed.
```
State re-encrypted via `scripts/state-sync encrypt vault`; enc file
committed.

### Manual Verification
```
# Before (on previous commit — expect failure):
$ kubectl -n woodpecker exec woodpecker-server-0 -- sh -c '
    SA=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token);
    TOK=$(curl -s -X POST http://vault-active.vault.svc:8200/v1/auth/kubernetes/login \
          -d "{\"role\":\"ci\",\"jwt\":\"$SA\"}" | jq -r .auth.client_token);
    curl -s -H "X-Vault-Token: $TOK" \
      http://vault-active.vault.svc:8200/v1/database/static-creds/pg-terraform-state'
→ {"errors":["1 error occurred:\n\t* permission denied\n\n"]}

# After (this commit):
→ {"data":{"username":"terraform_state","password":"..."},...}
```

Pipeline-level: the next infra push will exercise
`.woodpecker/default.yml`; expected first push is this very commit.
Watch `ci.viktorbarzin.me` — the `default` workflow should either
succeed for real (and land actual changes) or exit 1 with
"=== FAILED STACKS ===" so the cause is visible.

Refs: bd code-e1x

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 14:25:52 +00:00
Viktor Barzin
9e5d7cd825 state(vault): update encrypted state 2026-04-18 22:12:55 +00:00
Viktor Barzin
1860cd1dfb state(vault): update encrypted state 2026-04-17 14:14:05 +00:00
Viktor Barzin
14fa2b9762 state(vault): update encrypted state 2026-04-16 18:43:06 +00:00
Viktor Barzin
a34df78158 state(vault): update encrypted state 2026-04-16 10:24:29 +00:00
Viktor Barzin
aac81e0a1f state(vault): update encrypted state 2026-04-14 11:06:27 +00:00
Viktor Barzin
0eb96e4e22 state(vault): update encrypted state 2026-04-13 23:04:57 +01:00
Viktor Barzin
b7aec4c617 state: update encrypted terraform state 2026-04-12 14:17:12 +01:00
Viktor Barzin
8363efc56b state: update encrypted terraform state 2026-04-12 12:59:01 +01:00
Viktor Barzin
c54a36e7ca state(vault): update encrypted state 2026-04-10 13:33:33 +00:00
Viktor Barzin
cd2d00703c state(vault): update encrypted state 2026-04-06 12:40:54 +03:00
Viktor Barzin
9f91a3db88 state: update encrypted terraform state 2026-04-06 11:26:45 +03:00
Viktor Barzin
f48e400087 state(vault): update encrypted state 2026-04-04 16:10:25 +03:00
Viktor Barzin
e65647edb4 state(vault): add vabbit81 user resources 2026-03-26 17:32:34 +02:00
Viktor Barzin
b6ac68d7f2 state(vault): update encrypted state 2026-03-26 12:21:23 +02:00
Viktor Barzin
45cb49416e state(vault): update encrypted state 2026-03-25 02:48:15 +02:00
Viktor Barzin
41f53a0f3e state(vault): update encrypted state 2026-03-25 02:24:45 +02:00
Viktor Barzin
ab95e0ab2f state(vault): update encrypted state 2026-03-22 15:18:03 +02:00
Viktor Barzin
527bfb1c9e state(vault): update encrypted state 2026-03-22 01:13:02 +02:00
Viktor Barzin
03f55d969f state(vault): update encrypted state 2026-03-18 21:30:59 +00:00
Viktor Barzin
5b29cfc73a state(vault): update encrypted state 2026-03-17 23:46:56 +00:00
Viktor Barzin
4d40c51a97 state(vault): update encrypted state 2026-03-17 23:14:24 +00:00
Viktor Barzin
7a8452e4c7 state(vault): update encrypted state 2026-03-17 23:14:16 +00:00
Viktor Barzin
0215d81622 state(vault): update encrypted state 2026-03-17 23:13:57 +00:00
Viktor Barzin
750cfcce7c state(vault): update encrypted state 2026-03-17 23:13:55 +00:00
Viktor Barzin
e54ad33315 state(vault): update encrypted state 2026-03-17 23:13:19 +00:00
Viktor Barzin
02d0291797 state(vault): update encrypted state 2026-03-17 23:12:58 +00:00
Viktor Barzin
468df3c5c4 state(vault): update encrypted state 2026-03-17 23:12:35 +00:00
Viktor Barzin
cf570c3d3b state(vault): update encrypted state 2026-03-17 23:12:03 +00:00
Viktor Barzin
4277b41c28 state(vault): update encrypted state 2026-03-17 23:11:55 +00:00
Viktor Barzin
77143dfd6b state: per-stack Transit keys for namespace-owner access control
- Each stack gets its own Vault Transit key (transit/keys/sops-state-<stack>)
- state-sync passes per-stack Transit URI + age keys on encrypt
- Vault policies scope namespace-owners to their stacks only:
  - sops-admin: wildcard access to all transit keys
  - sops-user-<name>: access only to owned stack keys
- Anca (plotting-book) can only decrypt plotting-book state
- Admin can decrypt everything (via admin Transit policy or age fallback)
- External group sops-plotting-book maps Authentik group to Vault policy
- Updated CLAUDE.md with state sync documentation
2026-03-17 23:08:18 +00:00
Viktor Barzin
4e7ca1ad61 state: add Vault Transit as primary SOPS backend, age as fallback
- .sops.yaml: add hc_vault_transit_uri for transit/keys/sops-state
- state-sync: try Vault Transit first, fall back to age key on disk
- Re-encrypted all 101 state files with both Vault Transit + age
- Normal workflow: vault login → decrypt via Transit (no key files)
- Bootstrap/DR: age key at ~/.config/sops/age/keys.txt
2026-03-17 22:56:33 +00:00
Viktor Barzin
9f80eb7ba0 state: add devvm as SOPS recipient
Add devvm age public key to .sops.yaml and re-encrypt all 101 state
files with both laptop and devvm keys.
2026-03-17 22:41:19 +00:00
Viktor Barzin
b6faa24349 state: add SOPS-encrypted terraform state to git
- SOPS + age encrypts all 101 .tfstate files (JSON-aware: keys visible, values encrypted)
- scripts/state-sync: encrypt/decrypt/commit wrapper
- scripts/tg: auto-decrypt before ops, auto-encrypt+commit after apply/destroy
- terragrunt.hcl: -backup=- prevents backup file accumulation
- .gitignore: track .tfstate.enc, ignore plaintext .tfstate
- Cleaned 964MB of stale backups (state/backups/, .backup files)
2026-03-17 22:37:56 +00:00