Commit graph

722 commits

Author SHA1 Message Date
Viktor Barzin
2eca011cc3 [ci,vault] Fix Tier-1 apply silently failing in Woodpecker
## Context
For weeks, every push to infra has resulted in `build-cli` workflow
failure AND `default` workflow succeed — but the `default` workflow's
"success" was a lie. Inside the apply-loop we were swallowing per-stack
failures with `set +e ... echo FAILED` and the step exited 0 regardless.

Discovered during bd code-3o3 e2e test (qbittorrent 5.0.4 → 5.1.4):
agent commit landed, CI reported `default=success`, but cluster was
unchanged. Log inside the step showed:
    [servarr] Starting apply...
    ERROR: Cannot read PG credentials from Vault.
    Run: vault login -method=oidc
    [servarr] FAILED (exit 1)

Two root causes, two fixes here.

### 1. Vault `ci` role lacks Tier-1 PG backend creds

The Tier-1 PG state backend (2026-04-16 migration, memory 407) uses
the `pg-terraform-state` static DB role. `scripts/tg` reads it via
`vault read database/static-creds/pg-terraform-state`. That path is
permitted by the separate `terraform-state` Vault policy, which is
bound only to a role in namespace `claude-agent`. The CI runner is in
namespace `woodpecker` using role `ci`, whose policy grants only KV
+ K8s-creds + transit. Net: every Tier-1 stack apply from CI has
been dying at the PG-creds fetch since the migration.

**Fix**: attach `vault_policy.terraform_state` to
`vault_kubernetes_auth_backend_role.ci`'s `token_policies`. No new
policy needed — reuses the minimal one from 2026-04-16.

### 2. Apply-loop swallows stack failures

`.woodpecker/default.yml`'s platform + app apply loops use
`set +e; OUTPUT=$(... tg apply ...); EXIT=$?; set -e; [ $EXIT -ne 0 ]
&& echo FAILED` and then continue the while-loop. The step never
re-raises, so it exits 0 regardless of how many stacks failed.

**Fix**: accumulate failed stack names (excluding lock-skipped ones)
into `FAILED_PLATFORM_STACKS` / `FAILED_APP_STACKS`, serialise the
platform list to `.platform_failed` so it survives the step boundary,
and at the end of the app-stack step exit 1 if either list is
non-empty. Lock-skipped stacks remain non-fatal.

Together, (1) unblocks real apply and (2) ensures the Woodpecker
pipeline + the service-upgrade agent can both trust `default`
workflow state again.

## What is NOT in this change
- Re-running the qbittorrent upgrade to converge the cluster — the
  TF file is already at 5.1.4 in git; once CI picks up this commit
  it'll apply on its own, or Viktor can run `tg apply` locally now
  that the ci role has access too.
- Retiring the `set +e ... continue` pattern entirely — keeping the
  per-stack continuation so a single bad stack doesn't hide the
  others' plans from the log. Just making the final status honest.

## Test Plan
### Automated
`terraform plan` / apply clean (Tier-0 via scripts/tg):
```
Plan: 0 to add, 2 to change, 0 to destroy.
  # vault_kubernetes_auth_backend_role.ci will be updated in-place
  ~ token_policies = [
      + "terraform-state",
        # (1 unchanged element hidden)
    ]
  # vault_jwt_auth_backend.oidc will be updated in-place
  ~ tune = [...]    # cosmetic provider-schema drift, pre-existing

Apply complete! Resources: 0 added, 2 changed, 0 destroyed.
```
State re-encrypted via `scripts/state-sync encrypt vault`; enc file
committed.

### Manual Verification
```
# Before (on previous commit — expect failure):
$ kubectl -n woodpecker exec woodpecker-server-0 -- sh -c '
    SA=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token);
    TOK=$(curl -s -X POST http://vault-active.vault.svc:8200/v1/auth/kubernetes/login \
          -d "{\"role\":\"ci\",\"jwt\":\"$SA\"}" | jq -r .auth.client_token);
    curl -s -H "X-Vault-Token: $TOK" \
      http://vault-active.vault.svc:8200/v1/database/static-creds/pg-terraform-state'
→ {"errors":["1 error occurred:\n\t* permission denied\n\n"]}

# After (this commit):
→ {"data":{"username":"terraform_state","password":"..."},...}
```

Pipeline-level: the next infra push will exercise
`.woodpecker/default.yml`; expected first push is this very commit.
Watch `ci.viktorbarzin.me` — the `default` workflow should either
succeed for real (and land actual changes) or exit 1 with
"=== FAILED STACKS ===" so the cause is visible.

Refs: bd code-e1x

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 14:25:52 +00:00
Viktor Barzin
9e5d7cd825 state(vault): update encrypted state 2026-04-18 22:12:55 +00:00
Viktor Barzin
402fd1fbac state(dbaas): update encrypted state 2026-04-18 22:12:09 +00:00
Viktor Barzin
6cf3575ed9 state(dbaas): update encrypted state 2026-04-18 19:17:31 +00:00
Viktor Barzin
30fa411bf7 state(dbaas): update encrypted state 2026-04-18 19:17:20 +00:00
Viktor Barzin
61e94c21fe state(dbaas): update encrypted state 2026-04-18 19:16:41 +00:00
Viktor Barzin
81e7c3d6ee state(dbaas): update encrypted state 2026-04-18 18:59:51 +00:00
Viktor Barzin
9780c04ca0 state(dbaas): update encrypted state 2026-04-17 22:33:13 +00:00
Viktor Barzin
1860cd1dfb state(vault): update encrypted state 2026-04-17 14:14:05 +00:00
Viktor Barzin
f0ddfb8cae state(dbaas): update encrypted state 2026-04-17 14:08:49 +00:00
Viktor Barzin
0c4fe98d75 state(dbaas): update encrypted state 2026-04-17 10:08:04 +00:00
Viktor Barzin
8b206a63ad state(dbaas): update encrypted state 2026-04-16 22:55:52 +00:00
Viktor Barzin
e80b2f026f [infra] Migrate Terraform state from local SOPS to PostgreSQL backend
Two-tier state architecture:
- Tier 0 (infra, platform, cnpg, vault, dbaas, external-secrets): local
  state with SOPS encryption in git — unchanged, required for bootstrap.
- Tier 1 (105 app stacks): PostgreSQL backend on CNPG cluster at
  10.0.20.200:5432/terraform_state with native pg_advisory_lock.

Motivation: multi-operator friction (every workstation needed SOPS + age +
git-crypt), bootstrap complexity for new operators, and headless agents/CI
needing the full encryption toolchain just to read state.

Changes:
- terragrunt.hcl: conditional backend (local vs pg) based on tier0 list
- scripts/tg: tier detection, auto-fetch PG creds from Vault for Tier 1,
  skip SOPS and Vault KV locking for Tier 1 stacks
- scripts/state-sync: tier-aware encrypt/decrypt (skips Tier 1)
- scripts/migrate-state-to-pg: one-shot migration script (idempotent)
- stacks/vault/main.tf: pg-terraform-state static role + K8s auth role
  for claude-agent namespace
- stacks/dbaas: terraform_state DB creation + MetalLB LoadBalancer
  service on shared IP 10.0.20.200
- Deleted 107 .tfstate.enc files for migrated Tier 1 stacks
- Cleaned up per-stack tiers.tf (now generated by root terragrunt.hcl)

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 19:33:12 +00:00
Viktor Barzin
ef30f27ac9 state(dbaas): update encrypted state 2026-04-16 18:56:59 +00:00
Viktor Barzin
b6fc1e63a6 state(dbaas): import postgresql-lb service 2026-04-16 18:55:40 +00:00
Viktor Barzin
14fa2b9762 state(vault): update encrypted state 2026-04-16 18:43:06 +00:00
Viktor Barzin
1a42f750f8 state(dbaas): update encrypted state 2026-04-16 18:41:34 +00:00
Viktor Barzin
0a43b5c2ac state(dbaas): update encrypted state 2026-04-16 18:31:33 +00:00
Viktor Barzin
cd513a2226 state(dbaas): update encrypted state 2026-04-16 18:24:31 +00:00
Viktor Barzin
0368601eff state(dbaas): update encrypted state 2026-04-16 18:24:20 +00:00
Viktor Barzin
8bd2ace00d state(technitium): update encrypted state 2026-04-16 17:21:06 +00:00
Viktor Barzin
7680d4e009 state(dawarich): update encrypted state 2026-04-16 17:19:29 +00:00
Viktor Barzin
611c67b92c state(paperless-ngx): update encrypted state 2026-04-16 17:10:57 +00:00
Viktor Barzin
e9275534b6 state(dawarich): update encrypted state 2026-04-16 17:08:15 +00:00
Viktor Barzin
1f589a403c state(dawarich): update encrypted state 2026-04-16 17:04:44 +00:00
Viktor Barzin
178fc4b398 state(matrix): update encrypted state 2026-04-16 17:01:28 +00:00
Viktor Barzin
449f1af9d6 state(immich): update encrypted state 2026-04-16 17:00:59 +00:00
Viktor Barzin
0ec48d942f state(paperless-ngx): update encrypted state 2026-04-16 17:00:58 +00:00
Viktor Barzin
88c47efa1d state(url): update encrypted state 2026-04-16 16:54:31 +00:00
Viktor Barzin
bf257414b3 state(dawarich): update encrypted state 2026-04-16 16:53:19 +00:00
Viktor Barzin
727b3c4570 state(coturn): update encrypted state 2026-04-16 16:48:48 +00:00
Viktor Barzin
1171b390c5 state(owntracks): update encrypted state 2026-04-16 16:48:40 +00:00
Viktor Barzin
a0ea11a4b4 state(coturn): update encrypted state 2026-04-16 16:48:12 +00:00
Viktor Barzin
5d610baed8 state(ollama): update encrypted state 2026-04-16 16:44:34 +00:00
Viktor Barzin
541bee7176 state(ebooks): update encrypted state 2026-04-16 16:05:27 +00:00
root
af090c818b Woodpecker CI deploy [CI SKIP] 2026-04-16 13:46:08 +00:00
Viktor Barzin
95d2a6abf8 state(wealthfolio): update encrypted state 2026-04-16 11:30:59 +00:00
Viktor Barzin
e8874dd37a state(cloudflared): update encrypted state 2026-04-16 10:59:30 +00:00
Viktor Barzin
997fd4f85b state(linkwarden): update encrypted state 2026-04-16 10:35:35 +00:00
Viktor Barzin
2ae31148cb state(ytdlp): update encrypted state 2026-04-16 10:33:55 +00:00
Viktor Barzin
43b0316978 state(xray): update encrypted state 2026-04-16 10:33:39 +00:00
Viktor Barzin
f0e7de8e57 state(woodpecker): update encrypted state 2026-04-16 10:33:27 +00:00
Viktor Barzin
deff4ae9f5 state(webhook_handler): update encrypted state 2026-04-16 10:33:11 +00:00
Viktor Barzin
1557ce0084 state(servarr): update encrypted state 2026-04-16 10:30:30 +00:00
Viktor Barzin
6d0772df60 state(vpa): update encrypted state 2026-04-16 10:25:07 +00:00
Viktor Barzin
1616b3c483 state(vaultwarden): update encrypted state 2026-04-16 10:24:42 +00:00
Viktor Barzin
a34df78158 state(vault): update encrypted state 2026-04-16 10:24:29 +00:00
Viktor Barzin
fc813bd5bd state(tuya-bridge): update encrypted state 2026-04-16 10:19:56 +00:00
Viktor Barzin
192bb2348f state(traefik): update encrypted state 2026-04-16 10:19:35 +00:00
Viktor Barzin
90189a4307 state(trading-bot): update encrypted state 2026-04-16 10:19:13 +00:00