[ci skip] add SOPS multi-user secrets migration design (v3, reviewed 3x)

Replaces git-crypt all-or-nothing encryption with SOPS per-value encryption.
Operators push PRs → Viktor reviews → CI applies. No encryption keys needed
for operators. 7-phase migration plan, reviewed by 2 agents across 3 iterations
(0 remaining CRITICALs).
This commit is contained in:
Viktor Barzin 2026-03-07 13:55:05 +00:00
parent 5b28319bc3
commit 91d11e5cda

View file

@ -0,0 +1,366 @@
# SOPS Multi-User Secrets Migration — Design Document (v3)
## Goal
Enable non-technical operators to manage cluster services via PR → review → merge → CI apply, without access to secrets. Viktor retains full local apply capability.
## Current State
- **terraform.tfvars**: 211 variables (mix of secrets + non-secret config), git-crypt encrypted as a whole
- **secrets/**: TLS certs, deploy keys, NFS config — git-crypt encrypted (binary files)
- **.gitattributes**: encrypts `*.tfvars`, `*.tfstate`, `secrets/**`
- **Woodpecker CI**: unlocks git-crypt via K8s ConfigMap, applies `stacks/platform/` on push
- **Terragrunt**: loads `terraform.tfvars` via `required_var_files` for all stacks
## Design
### 1. Split terraform.tfvars into Two Files
**`config.tfvars`** (NOT encrypted — committed in plaintext):
Non-secret configuration that operators need to read/edit:
- `nfs_server`, `redis_host`, `postgresql_host`, `mysql_host`, `ollama_host`, `mail_host`
- `bind_db_viktorbarzin_me`, `bind_db_viktorbarzin_lan`, `bind_named_conf_options`
- `tls_secret_name`, `client_certificate_secret_name`
- WireGuard peer **public** keys and AllowedIPs only — **NOT** `wireguard_wg_0_conf` (contains private key inline), NOT any `PrivateKey` fields
- Cloudflare DNS zone definitions (record names, not tokens)
**`secrets.sops.json`** (SOPS-encrypted, per-value, JSON format):
All actual secrets, including complex types. JSON format chosen because:
- `sops -d` outputs the same format as input — JSON in, JSON out
- Terraform natively supports `*.auto.tfvars.json` files
- JSON supports all Terraform types: strings, maps, lists, nested objects
- No format conversion needed in the decryption pipeline
**Complex types** in JSON (these are NOT flat strings):
```json
{
"hackmd_db_password": "simple-string-secret",
"mailserver_accounts": {
"info@viktorbarzin.me": "password1",
"admin@viktorbarzin.me": "password2"
},
"homepage_credentials": {
"technitium": {"token": "abc123"},
"crowdsec": {"username": "user", "password": "pass"}
},
"k8s_users": {
"viktor": {"role": "admin", "email": "v@example.com", "namespaces": []}
},
"xray_reality_clients": [
{"id": "uuid-here", "flow": "xtls-rprx-vision"}
],
"webhook_handler_ssh_key": "-----BEGIN OPENSSH PRIVATE KEY-----\nb3Blbn...\n-----END OPENSSH PRIVATE KEY-----\n",
"wireguard_wg_0_conf": "[Interface]\nPrivateKey = ...\nAddress = ...\n\n[Peer]\n..."
}
```
### 2. SOPS Configuration
```yaml
# .sops.yaml
creation_rules:
- path_regex: ^secrets\.sops\.json$
age: >-
age1viktor_public_key,
age1ci_public_key
```
Path regex anchored to repo root (`^`). All secrets encrypted to Viktor + CI.
### 3. Terragrunt Changes
```hcl
# terragrunt.hcl — updated variable loading
terraform {
extra_arguments "common_vars" {
commands = get_terraform_commands_that_need_vars()
required_var_files = [
"${get_repo_root()}/config.tfvars"
]
}
extra_arguments "secrets" {
commands = get_terraform_commands_that_need_vars()
optional_var_files = [
"${get_repo_root()}/secrets.auto.tfvars.json"
]
}
# Safety check: fail loudly if secrets file is missing (prevents silent apply with empty secrets)
before_hook "check_secrets" {
commands = ["apply", "plan", "destroy"]
execute = ["test", "-f", "${get_repo_root()}/secrets.auto.tfvars.json"]
}
}
```
**Global decrypt-once wrapper** (run instead of raw terragrunt):
```bash
#!/usr/bin/env bash
# scripts/tg — wrapper: decrypt then terragrunt
set -euo pipefail
REPO_ROOT="$(cd "$(dirname "$0")/.." && pwd)"
SOPS_FILE="$REPO_ROOT/secrets.sops.json"
OUT_FILE="$REPO_ROOT/secrets.auto.tfvars.json"
if [ ! -f "$OUT_FILE" ] && [ -f "$SOPS_FILE" ]; then
TEMP=$(mktemp "$OUT_FILE.XXXXXX")
trap "rm -f '$TEMP'" EXIT
sops -d "$SOPS_FILE" > "$TEMP"
mv "$TEMP" "$OUT_FILE"
echo "Decrypted secrets → secrets.auto.tfvars.json"
fi
exec terragrunt "$@"
```
Usage: `scripts/tg apply --non-interactive` instead of `terragrunt apply --non-interactive`.
**Why not before_hook/after_hook for decryption?** When using `run --all`, each of 70+ stacks would run hooks in parallel, all writing to the same file — race condition. The wrapper decrypts once.
**Why before_hook for the existence check?** It's read-only (just `test -f`) — safe in parallel. Fails loudly if someone forgets to decrypt, instead of silently applying with empty secrets.
### 4. File Protection
**.gitignore** (add these entries):
```
/secrets.auto.tfvars.json
/secrets.auto.tfvars.json.*
```
**.gitattributes** changes (done atomically in Phase 4):
```
# KEEP for binary files
secrets/** filter=git-crypt diff=git-crypt
*.tfstate filter=git-crypt diff=git-crypt
# REMOVED: *.tfvars filter=git-crypt diff=git-crypt
```
### 5. Woodpecker CI Pipeline Changes
**default.yml**:
```yaml
steps:
- name: prepare
image: alpine
commands:
- "apk update && apk add jq curl git git-crypt"
# git-crypt for secrets/ directory (TLS certs, deploy key)
# Note: K8s Secret .data values are base64-encoded by the API
- |
curl -k https://10.0.20.100:6443/api/v1/namespaces/woodpecker/secrets/git-crypt-key \
-H "Authorization:Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" \
| jq -r '.data.key' | base64 -d > /tmp/key
- "git-crypt unlock /tmp/key && rm /tmp/key"
# Install SOPS to workspace (shared across steps via workspace volume)
- "wget -qO ./sops https://github.com/getsops/sops/releases/download/v3.9.4/sops-v3.9.4.linux.amd64"
- "echo '848ac8ee4b4e3ae1e72a58f0e9bae04b3e85ca59fa06f0dcd2d32b76542e8417 ./sops' | sha256sum -c"
- "chmod +x ./sops"
# Write age key to file (Woodpecker from_secret injects as env var, not file)
- "echo \"$SOPS_AGE_KEY\" > /tmp/age-key.txt"
- "SOPS_AGE_KEY_FILE=/tmp/age-key.txt ./sops -d secrets.sops.json > secrets.auto.tfvars.json"
- "shred -u /tmp/age-key.txt"
environment:
SOPS_AGE_KEY:
from_secret: sops_age_key # CI's age private key material
- name: terragrunt-plan
image: alpine
commands:
- "apk update && apk add curl unzip git openssh-client"
- "wget -qO /tmp/tf.zip https://releases.hashicorp.com/terraform/1.5.7/terraform_1.5.7_linux_amd64.zip"
- "unzip -o /tmp/tf.zip -d /usr/local/bin/ && chmod 755 /usr/local/bin/terraform"
- "wget -qO /usr/local/bin/terragrunt https://github.com/gruntwork-io/terragrunt/releases/download/v0.99.4/terragrunt_linux_amd64"
- "chmod 755 /usr/local/bin/terragrunt"
- "cd stacks/platform && terragrunt plan --non-interactive -out=tfplan 2>&1 | grep -v 'sensitive'"
when:
event: pull_request
- name: terragrunt-apply
image: alpine
commands:
- "apk update && apk add curl unzip git openssh-client"
- "wget -qO /tmp/tf.zip https://releases.hashicorp.com/terraform/1.5.7/terraform_1.5.7_linux_amd64.zip"
- "unzip -o /tmp/tf.zip -d /usr/local/bin/ && chmod 755 /usr/local/bin/terraform"
- "wget -qO /usr/local/bin/terragrunt https://github.com/gruntwork-io/terragrunt/releases/download/v0.99.4/terragrunt_linux_amd64"
- "chmod 755 /usr/local/bin/terragrunt"
- "cd stacks/platform && terragrunt apply --non-interactive -auto-approve"
when:
event: push
branch: master
- name: cleanup-and-push
image: alpine
commands:
- "rm -f secrets.auto.tfvars.json secrets.auto.tfvars.json.*"
- "apk update && apk add openssh-client git git-crypt"
- "mkdir -p ~/.ssh && ssh-keyscan -H github.com >> ~/.ssh/known_hosts"
- "chmod 400 secrets/deploy_key"
- "git add stacks/ state/ .woodpecker/ || true"
- "git remote set-url origin git@github.com:ViktorBarzin/infra.git"
- "git commit -m 'Woodpecker CI deploy commit [CI SKIP]' || echo 'No changes'"
- "GIT_SSH_COMMAND='ssh -i ./secrets/deploy_key -o IdentitiesOnly=yes' git push origin master"
when:
- event: push
branch: master
- status: [success, failure] # Always clean up, even on failure
- name: slack
image: curlimages/curl
commands:
- |
curl -s -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"Woodpecker CI: infra pipeline ${CI_PIPELINE_STATUS}\"}" \
"$SLACK_WEBHOOK" || true
environment:
SLACK_WEBHOOK:
from_secret: slack_webhook
when:
- status: [success, failure]
```
**renew-tls.yml** — ALSO update this pipeline:
- Change `git add .` to `git add secrets/ state/` in the `commit-certs` step
- Same defense-in-depth as default.yml
Key design decisions:
- `SOPS_AGE_KEY` (env var, not file) — Woodpecker `from_secret` only supports env vars. The prepare step writes it to a temp file, uses `SOPS_AGE_KEY_FILE`, then `shred`s the file
- SOPS binary in workspace (shared volume) — not per-container `/usr/local/bin/`
- `cleanup-and-push` runs on `status: [success, failure]` — always cleans up decrypted file
- `git add stacks/ state/ .woodpecker/` — never `git add .`
- Plan output filtered through `grep -v sensitive` — belt-and-suspenders with `sensitive = true`
### 6. Branch Protection (Required)
GitHub branch protection on `master`:
- **Require pull request reviews**: at least 1 reviewer (Viktor)
- **Restrict who can push**: Viktor only (direct push for `[ci skip]` commits)
- **Restrict who can dismiss reviews**: Viktor only
This prevents operators from modifying `.woodpecker/`, `terragrunt.hcl`, or `.sops.yaml` without review.
**Residual risk**: An operator can add `provisioner "local-exec" { command = "echo ${var.secret}" }` in a PR. Viktor must catch this in review. Mitigated by: (1) PR review is required, (2) `sensitive = true` hides values in plan output, (3) `local-exec` provisioners are unusual in this codebase and should be flagged during review.
### 7. K8s RBAC for Operators
Scoped operator role — no cluster-wide secrets access:
```hcl
resource "kubernetes_cluster_role" "operator" {
metadata { name = "cluster-operator" }
rule {
api_groups = [""]
resources = ["pods", "pods/log", "services", "endpoints", "configmaps", "events"]
verbs = ["get", "list", "watch"]
}
rule {
api_groups = ["apps"]
resources = ["deployments", "statefulsets", "daemonsets", "replicasets"]
verbs = ["get", "list", "watch"]
}
}
# Per-namespace full access (edit role includes secrets within namespace — accepted residual risk)
resource "kubernetes_role_binding" "operator_namespace" {
for_each = toset(var.operator_namespaces)
metadata {
name = "operator-access"
namespace = each.value
}
role_ref {
api_group = "rbac.authorization.k8s.io"
kind = "ClusterRole"
name = "edit"
}
subject {
kind = "Group"
name = "operators"
}
}
```
**Excluded namespaces** (never in `operator_namespaces`): `woodpecker`, `kube-system`, `dbaas`, `monitoring`, `authentik`.
### 8. Operator Workflow
**Setup (one-time)**: GitHub collaborator + Authentik "operators" group. No encryption keys, no local tools beyond git.
**Day-to-day**: Create branch → edit → push → open PR → Viktor reviews → merge → CI applies → Slack notification.
**kubectl**: `kubectl oidc-login` → Authentik → scoped to assigned namespaces.
**New secrets**: Comment on PR, Viktor adds to `secrets.sops.json`.
### 9. Migration Plan (Phased)
**Phase 1 — Setup tooling (no functional change)**
- Install `sops` and `age` locally (Docker)
- Generate age keys: Viktor + CI
- Store CI age key as Woodpecker secret (`sops_age_key`)
- Move git-crypt key from K8s ConfigMap to Secret (update RBAC for Woodpecker SA)
- Create `.sops.yaml` config file
- Add `/secrets.auto.tfvars.json` to `.gitignore`
- Create `scripts/tg` wrapper
- Backup Viktor's age private key to Vaultwarden
**Phase 2 — Create SOPS file alongside existing tfvars**
- Categorize all 211 variables: secret vs. non-secret (WireGuard private keys → secrets)
- Extract non-secret config into `config.tfvars` (plaintext)
- Extract secrets into `secrets.sops.json` (JSON, including complex types: maps, lists, nested objects)
- Encrypt with SOPS
- Verify round-trip: `sops -d secrets.sops.json | jq .` produces valid JSON
- Verify SSH keys: `sops -d secrets.sops.json | jq -r '.truenas_ssh_private_key' | ssh-keygen -l -f -`
- Verify complex types: `sops -d secrets.sops.json | jq '.mailserver_accounts'` returns expected map
- Add `sensitive = true` to ALL secret variable declarations across all stacks (BEFORE CI plan step is enabled)
**Phase 3 — Switch terragrunt to SOPS**
- Update `terragrunt.hcl`: `config.tfvars` (required) + `secrets.auto.tfvars.json` (optional) + existence check hook
- Test: `scripts/tg apply --non-interactive` works per-stack
- Test: `scripts/tg run --all -- plan` works (no race condition)
- Test failure mode: delete `secrets.auto.tfvars.json`, verify `before_hook` fails loudly
**Phase 4 — Atomic cutover**
- Step 1: `git rm terraform.tfvars` (removes file while git-crypt filter still active — clean deletion)
- Step 2: Remove `*.tfvars filter=git-crypt` from `.gitattributes`
- Step 3: `git commit` both changes
**Phase 5 — Update CI pipelines**
- Update `.woodpecker/default.yml` with new pipeline
- Update `.woodpecker/renew-tls.yml`: change `git add .` to `git add secrets/ state/`
- Add `sops_age_key` Woodpecker secret
- Enable GitHub branch protection on master
- Test: CI pipeline applies successfully
**Phase 6 — Security hardening**
- Create scoped operator RBAC role
- Remove `secrets` from `power-user` ClusterRole
- Update CLAUDE.md and AGENTS.md documentation
**Phase 7 — Onboard operator**
- Add as GitHub collaborator
- Create Authentik account in "operators" group
- Walk through first PR workflow
### 10. Rollback Plan
- **Phase 1-2**: No functional change — delete SOPS artifacts
- **Phase 3**: Revert `terragrunt.hcl` to load `terraform.tfvars`
- **Phase 4+**: `git show HEAD~1:terraform.tfvars > terraform.tfvars`, re-add `.gitattributes` rule. Backfill any secrets added during SOPS period.
- Git-crypt stays functional for `secrets/` and `*.tfstate`
### 11. What Stays with git-crypt
- `secrets/` directory: TLS certs, deploy keys (binary)
- `*.tfstate` files: Terraform state
- git-crypt key: K8s **Secret** in `woodpecker` namespace (migrated from ConfigMap)
### 12. Security Considerations
- **Decrypted file**: temporary, `.gitignore`d, never staged by CI, cleaned up on success AND failure
- **CI staging**: `git add stacks/ state/ .woodpecker/` — never `git add .` (all pipelines)
- **Age key in CI**: `SOPS_AGE_KEY` env var → written to temp file → `SOPS_AGE_KEY_FILE``shred` after use
- **Age key backup**: Viktor's in Vaultwarden. CI's as Woodpecker secret
- **Branch protection**: Operators cannot modify CI pipeline, terragrunt.hcl, or .sops.yaml without review
- **RBAC**: Operator role excludes cluster-wide secrets. Namespace `edit` role allows secrets within assigned namespaces (accepted residual risk). Excluded: woodpecker, kube-system, dbaas, monitoring, authentik
- **Terraform variables**: `sensitive = true` on all secret vars — applied in Phase 2 BEFORE plan step is enabled
- **Plan output**: filtered through `grep -v sensitive` as belt-and-suspenders
- **`local-exec` exfiltration**: residual risk mitigated by PR review requirement — Viktor must review all PRs
- **State files**: contain secret values, git-crypt encrypted. Future: remote backend
- **Rotation**: new CI age key → re-encrypt → update Woodpecker secret → rotate affected secrets
- **Git history**: old `terraform.tfvars` remains git-crypt encrypted in history — recoverable only with git-crypt key (K8s Secret, not accessible to operators)