[ci skip] add SOPS multi-user secrets migration design (v3, reviewed 3x)
Replaces git-crypt all-or-nothing encryption with SOPS per-value encryption. Operators push PRs → Viktor reviews → CI applies. No encryption keys needed for operators. 7-phase migration plan, reviewed by 2 agents across 3 iterations (0 remaining CRITICALs).
This commit is contained in:
parent
5b28319bc3
commit
91d11e5cda
1 changed files with 366 additions and 0 deletions
366
docs/plans/2026-03-07-sops-migration-design.md
Normal file
366
docs/plans/2026-03-07-sops-migration-design.md
Normal file
|
|
@ -0,0 +1,366 @@
|
|||
# SOPS Multi-User Secrets Migration — Design Document (v3)
|
||||
|
||||
## Goal
|
||||
Enable non-technical operators to manage cluster services via PR → review → merge → CI apply, without access to secrets. Viktor retains full local apply capability.
|
||||
|
||||
## Current State
|
||||
- **terraform.tfvars**: 211 variables (mix of secrets + non-secret config), git-crypt encrypted as a whole
|
||||
- **secrets/**: TLS certs, deploy keys, NFS config — git-crypt encrypted (binary files)
|
||||
- **.gitattributes**: encrypts `*.tfvars`, `*.tfstate`, `secrets/**`
|
||||
- **Woodpecker CI**: unlocks git-crypt via K8s ConfigMap, applies `stacks/platform/` on push
|
||||
- **Terragrunt**: loads `terraform.tfvars` via `required_var_files` for all stacks
|
||||
|
||||
## Design
|
||||
|
||||
### 1. Split terraform.tfvars into Two Files
|
||||
|
||||
**`config.tfvars`** (NOT encrypted — committed in plaintext):
|
||||
Non-secret configuration that operators need to read/edit:
|
||||
- `nfs_server`, `redis_host`, `postgresql_host`, `mysql_host`, `ollama_host`, `mail_host`
|
||||
- `bind_db_viktorbarzin_me`, `bind_db_viktorbarzin_lan`, `bind_named_conf_options`
|
||||
- `tls_secret_name`, `client_certificate_secret_name`
|
||||
- WireGuard peer **public** keys and AllowedIPs only — **NOT** `wireguard_wg_0_conf` (contains private key inline), NOT any `PrivateKey` fields
|
||||
- Cloudflare DNS zone definitions (record names, not tokens)
|
||||
|
||||
**`secrets.sops.json`** (SOPS-encrypted, per-value, JSON format):
|
||||
All actual secrets, including complex types. JSON format chosen because:
|
||||
- `sops -d` outputs the same format as input — JSON in, JSON out
|
||||
- Terraform natively supports `*.auto.tfvars.json` files
|
||||
- JSON supports all Terraform types: strings, maps, lists, nested objects
|
||||
- No format conversion needed in the decryption pipeline
|
||||
|
||||
**Complex types** in JSON (these are NOT flat strings):
|
||||
```json
|
||||
{
|
||||
"hackmd_db_password": "simple-string-secret",
|
||||
"mailserver_accounts": {
|
||||
"info@viktorbarzin.me": "password1",
|
||||
"admin@viktorbarzin.me": "password2"
|
||||
},
|
||||
"homepage_credentials": {
|
||||
"technitium": {"token": "abc123"},
|
||||
"crowdsec": {"username": "user", "password": "pass"}
|
||||
},
|
||||
"k8s_users": {
|
||||
"viktor": {"role": "admin", "email": "v@example.com", "namespaces": []}
|
||||
},
|
||||
"xray_reality_clients": [
|
||||
{"id": "uuid-here", "flow": "xtls-rprx-vision"}
|
||||
],
|
||||
"webhook_handler_ssh_key": "-----BEGIN OPENSSH PRIVATE KEY-----\nb3Blbn...\n-----END OPENSSH PRIVATE KEY-----\n",
|
||||
"wireguard_wg_0_conf": "[Interface]\nPrivateKey = ...\nAddress = ...\n\n[Peer]\n..."
|
||||
}
|
||||
```
|
||||
|
||||
### 2. SOPS Configuration
|
||||
|
||||
```yaml
|
||||
# .sops.yaml
|
||||
creation_rules:
|
||||
- path_regex: ^secrets\.sops\.json$
|
||||
age: >-
|
||||
age1viktor_public_key,
|
||||
age1ci_public_key
|
||||
```
|
||||
|
||||
Path regex anchored to repo root (`^`). All secrets encrypted to Viktor + CI.
|
||||
|
||||
### 3. Terragrunt Changes
|
||||
|
||||
```hcl
|
||||
# terragrunt.hcl — updated variable loading
|
||||
terraform {
|
||||
extra_arguments "common_vars" {
|
||||
commands = get_terraform_commands_that_need_vars()
|
||||
required_var_files = [
|
||||
"${get_repo_root()}/config.tfvars"
|
||||
]
|
||||
}
|
||||
|
||||
extra_arguments "secrets" {
|
||||
commands = get_terraform_commands_that_need_vars()
|
||||
optional_var_files = [
|
||||
"${get_repo_root()}/secrets.auto.tfvars.json"
|
||||
]
|
||||
}
|
||||
|
||||
# Safety check: fail loudly if secrets file is missing (prevents silent apply with empty secrets)
|
||||
before_hook "check_secrets" {
|
||||
commands = ["apply", "plan", "destroy"]
|
||||
execute = ["test", "-f", "${get_repo_root()}/secrets.auto.tfvars.json"]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Global decrypt-once wrapper** (run instead of raw terragrunt):
|
||||
```bash
|
||||
#!/usr/bin/env bash
|
||||
# scripts/tg — wrapper: decrypt then terragrunt
|
||||
set -euo pipefail
|
||||
REPO_ROOT="$(cd "$(dirname "$0")/.." && pwd)"
|
||||
SOPS_FILE="$REPO_ROOT/secrets.sops.json"
|
||||
OUT_FILE="$REPO_ROOT/secrets.auto.tfvars.json"
|
||||
|
||||
if [ ! -f "$OUT_FILE" ] && [ -f "$SOPS_FILE" ]; then
|
||||
TEMP=$(mktemp "$OUT_FILE.XXXXXX")
|
||||
trap "rm -f '$TEMP'" EXIT
|
||||
sops -d "$SOPS_FILE" > "$TEMP"
|
||||
mv "$TEMP" "$OUT_FILE"
|
||||
echo "Decrypted secrets → secrets.auto.tfvars.json"
|
||||
fi
|
||||
|
||||
exec terragrunt "$@"
|
||||
```
|
||||
|
||||
Usage: `scripts/tg apply --non-interactive` instead of `terragrunt apply --non-interactive`.
|
||||
|
||||
**Why not before_hook/after_hook for decryption?** When using `run --all`, each of 70+ stacks would run hooks in parallel, all writing to the same file — race condition. The wrapper decrypts once.
|
||||
|
||||
**Why before_hook for the existence check?** It's read-only (just `test -f`) — safe in parallel. Fails loudly if someone forgets to decrypt, instead of silently applying with empty secrets.
|
||||
|
||||
### 4. File Protection
|
||||
|
||||
**.gitignore** (add these entries):
|
||||
```
|
||||
/secrets.auto.tfvars.json
|
||||
/secrets.auto.tfvars.json.*
|
||||
```
|
||||
|
||||
**.gitattributes** changes (done atomically in Phase 4):
|
||||
```
|
||||
# KEEP for binary files
|
||||
secrets/** filter=git-crypt diff=git-crypt
|
||||
*.tfstate filter=git-crypt diff=git-crypt
|
||||
|
||||
# REMOVED: *.tfvars filter=git-crypt diff=git-crypt
|
||||
```
|
||||
|
||||
### 5. Woodpecker CI Pipeline Changes
|
||||
|
||||
**default.yml**:
|
||||
```yaml
|
||||
steps:
|
||||
- name: prepare
|
||||
image: alpine
|
||||
commands:
|
||||
- "apk update && apk add jq curl git git-crypt"
|
||||
# git-crypt for secrets/ directory (TLS certs, deploy key)
|
||||
# Note: K8s Secret .data values are base64-encoded by the API
|
||||
- |
|
||||
curl -k https://10.0.20.100:6443/api/v1/namespaces/woodpecker/secrets/git-crypt-key \
|
||||
-H "Authorization:Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" \
|
||||
| jq -r '.data.key' | base64 -d > /tmp/key
|
||||
- "git-crypt unlock /tmp/key && rm /tmp/key"
|
||||
# Install SOPS to workspace (shared across steps via workspace volume)
|
||||
- "wget -qO ./sops https://github.com/getsops/sops/releases/download/v3.9.4/sops-v3.9.4.linux.amd64"
|
||||
- "echo '848ac8ee4b4e3ae1e72a58f0e9bae04b3e85ca59fa06f0dcd2d32b76542e8417 ./sops' | sha256sum -c"
|
||||
- "chmod +x ./sops"
|
||||
# Write age key to file (Woodpecker from_secret injects as env var, not file)
|
||||
- "echo \"$SOPS_AGE_KEY\" > /tmp/age-key.txt"
|
||||
- "SOPS_AGE_KEY_FILE=/tmp/age-key.txt ./sops -d secrets.sops.json > secrets.auto.tfvars.json"
|
||||
- "shred -u /tmp/age-key.txt"
|
||||
environment:
|
||||
SOPS_AGE_KEY:
|
||||
from_secret: sops_age_key # CI's age private key material
|
||||
|
||||
- name: terragrunt-plan
|
||||
image: alpine
|
||||
commands:
|
||||
- "apk update && apk add curl unzip git openssh-client"
|
||||
- "wget -qO /tmp/tf.zip https://releases.hashicorp.com/terraform/1.5.7/terraform_1.5.7_linux_amd64.zip"
|
||||
- "unzip -o /tmp/tf.zip -d /usr/local/bin/ && chmod 755 /usr/local/bin/terraform"
|
||||
- "wget -qO /usr/local/bin/terragrunt https://github.com/gruntwork-io/terragrunt/releases/download/v0.99.4/terragrunt_linux_amd64"
|
||||
- "chmod 755 /usr/local/bin/terragrunt"
|
||||
- "cd stacks/platform && terragrunt plan --non-interactive -out=tfplan 2>&1 | grep -v 'sensitive'"
|
||||
when:
|
||||
event: pull_request
|
||||
|
||||
- name: terragrunt-apply
|
||||
image: alpine
|
||||
commands:
|
||||
- "apk update && apk add curl unzip git openssh-client"
|
||||
- "wget -qO /tmp/tf.zip https://releases.hashicorp.com/terraform/1.5.7/terraform_1.5.7_linux_amd64.zip"
|
||||
- "unzip -o /tmp/tf.zip -d /usr/local/bin/ && chmod 755 /usr/local/bin/terraform"
|
||||
- "wget -qO /usr/local/bin/terragrunt https://github.com/gruntwork-io/terragrunt/releases/download/v0.99.4/terragrunt_linux_amd64"
|
||||
- "chmod 755 /usr/local/bin/terragrunt"
|
||||
- "cd stacks/platform && terragrunt apply --non-interactive -auto-approve"
|
||||
when:
|
||||
event: push
|
||||
branch: master
|
||||
|
||||
- name: cleanup-and-push
|
||||
image: alpine
|
||||
commands:
|
||||
- "rm -f secrets.auto.tfvars.json secrets.auto.tfvars.json.*"
|
||||
- "apk update && apk add openssh-client git git-crypt"
|
||||
- "mkdir -p ~/.ssh && ssh-keyscan -H github.com >> ~/.ssh/known_hosts"
|
||||
- "chmod 400 secrets/deploy_key"
|
||||
- "git add stacks/ state/ .woodpecker/ || true"
|
||||
- "git remote set-url origin git@github.com:ViktorBarzin/infra.git"
|
||||
- "git commit -m 'Woodpecker CI deploy commit [CI SKIP]' || echo 'No changes'"
|
||||
- "GIT_SSH_COMMAND='ssh -i ./secrets/deploy_key -o IdentitiesOnly=yes' git push origin master"
|
||||
when:
|
||||
- event: push
|
||||
branch: master
|
||||
- status: [success, failure] # Always clean up, even on failure
|
||||
|
||||
- name: slack
|
||||
image: curlimages/curl
|
||||
commands:
|
||||
- |
|
||||
curl -s -X POST -H 'Content-type: application/json' \
|
||||
--data "{\"text\":\"Woodpecker CI: infra pipeline ${CI_PIPELINE_STATUS}\"}" \
|
||||
"$SLACK_WEBHOOK" || true
|
||||
environment:
|
||||
SLACK_WEBHOOK:
|
||||
from_secret: slack_webhook
|
||||
when:
|
||||
- status: [success, failure]
|
||||
```
|
||||
|
||||
**renew-tls.yml** — ALSO update this pipeline:
|
||||
- Change `git add .` to `git add secrets/ state/` in the `commit-certs` step
|
||||
- Same defense-in-depth as default.yml
|
||||
|
||||
Key design decisions:
|
||||
- `SOPS_AGE_KEY` (env var, not file) — Woodpecker `from_secret` only supports env vars. The prepare step writes it to a temp file, uses `SOPS_AGE_KEY_FILE`, then `shred`s the file
|
||||
- SOPS binary in workspace (shared volume) — not per-container `/usr/local/bin/`
|
||||
- `cleanup-and-push` runs on `status: [success, failure]` — always cleans up decrypted file
|
||||
- `git add stacks/ state/ .woodpecker/` — never `git add .`
|
||||
- Plan output filtered through `grep -v sensitive` — belt-and-suspenders with `sensitive = true`
|
||||
|
||||
### 6. Branch Protection (Required)
|
||||
|
||||
GitHub branch protection on `master`:
|
||||
- **Require pull request reviews**: at least 1 reviewer (Viktor)
|
||||
- **Restrict who can push**: Viktor only (direct push for `[ci skip]` commits)
|
||||
- **Restrict who can dismiss reviews**: Viktor only
|
||||
|
||||
This prevents operators from modifying `.woodpecker/`, `terragrunt.hcl`, or `.sops.yaml` without review.
|
||||
|
||||
**Residual risk**: An operator can add `provisioner "local-exec" { command = "echo ${var.secret}" }` in a PR. Viktor must catch this in review. Mitigated by: (1) PR review is required, (2) `sensitive = true` hides values in plan output, (3) `local-exec` provisioners are unusual in this codebase and should be flagged during review.
|
||||
|
||||
### 7. K8s RBAC for Operators
|
||||
|
||||
Scoped operator role — no cluster-wide secrets access:
|
||||
|
||||
```hcl
|
||||
resource "kubernetes_cluster_role" "operator" {
|
||||
metadata { name = "cluster-operator" }
|
||||
rule {
|
||||
api_groups = [""]
|
||||
resources = ["pods", "pods/log", "services", "endpoints", "configmaps", "events"]
|
||||
verbs = ["get", "list", "watch"]
|
||||
}
|
||||
rule {
|
||||
api_groups = ["apps"]
|
||||
resources = ["deployments", "statefulsets", "daemonsets", "replicasets"]
|
||||
verbs = ["get", "list", "watch"]
|
||||
}
|
||||
}
|
||||
|
||||
# Per-namespace full access (edit role includes secrets within namespace — accepted residual risk)
|
||||
resource "kubernetes_role_binding" "operator_namespace" {
|
||||
for_each = toset(var.operator_namespaces)
|
||||
metadata {
|
||||
name = "operator-access"
|
||||
namespace = each.value
|
||||
}
|
||||
role_ref {
|
||||
api_group = "rbac.authorization.k8s.io"
|
||||
kind = "ClusterRole"
|
||||
name = "edit"
|
||||
}
|
||||
subject {
|
||||
kind = "Group"
|
||||
name = "operators"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Excluded namespaces** (never in `operator_namespaces`): `woodpecker`, `kube-system`, `dbaas`, `monitoring`, `authentik`.
|
||||
|
||||
### 8. Operator Workflow
|
||||
|
||||
**Setup (one-time)**: GitHub collaborator + Authentik "operators" group. No encryption keys, no local tools beyond git.
|
||||
|
||||
**Day-to-day**: Create branch → edit → push → open PR → Viktor reviews → merge → CI applies → Slack notification.
|
||||
|
||||
**kubectl**: `kubectl oidc-login` → Authentik → scoped to assigned namespaces.
|
||||
|
||||
**New secrets**: Comment on PR, Viktor adds to `secrets.sops.json`.
|
||||
|
||||
### 9. Migration Plan (Phased)
|
||||
|
||||
**Phase 1 — Setup tooling (no functional change)**
|
||||
- Install `sops` and `age` locally (Docker)
|
||||
- Generate age keys: Viktor + CI
|
||||
- Store CI age key as Woodpecker secret (`sops_age_key`)
|
||||
- Move git-crypt key from K8s ConfigMap to Secret (update RBAC for Woodpecker SA)
|
||||
- Create `.sops.yaml` config file
|
||||
- Add `/secrets.auto.tfvars.json` to `.gitignore`
|
||||
- Create `scripts/tg` wrapper
|
||||
- Backup Viktor's age private key to Vaultwarden
|
||||
|
||||
**Phase 2 — Create SOPS file alongside existing tfvars**
|
||||
- Categorize all 211 variables: secret vs. non-secret (WireGuard private keys → secrets)
|
||||
- Extract non-secret config into `config.tfvars` (plaintext)
|
||||
- Extract secrets into `secrets.sops.json` (JSON, including complex types: maps, lists, nested objects)
|
||||
- Encrypt with SOPS
|
||||
- Verify round-trip: `sops -d secrets.sops.json | jq .` produces valid JSON
|
||||
- Verify SSH keys: `sops -d secrets.sops.json | jq -r '.truenas_ssh_private_key' | ssh-keygen -l -f -`
|
||||
- Verify complex types: `sops -d secrets.sops.json | jq '.mailserver_accounts'` returns expected map
|
||||
- Add `sensitive = true` to ALL secret variable declarations across all stacks (BEFORE CI plan step is enabled)
|
||||
|
||||
**Phase 3 — Switch terragrunt to SOPS**
|
||||
- Update `terragrunt.hcl`: `config.tfvars` (required) + `secrets.auto.tfvars.json` (optional) + existence check hook
|
||||
- Test: `scripts/tg apply --non-interactive` works per-stack
|
||||
- Test: `scripts/tg run --all -- plan` works (no race condition)
|
||||
- Test failure mode: delete `secrets.auto.tfvars.json`, verify `before_hook` fails loudly
|
||||
|
||||
**Phase 4 — Atomic cutover**
|
||||
- Step 1: `git rm terraform.tfvars` (removes file while git-crypt filter still active — clean deletion)
|
||||
- Step 2: Remove `*.tfvars filter=git-crypt` from `.gitattributes`
|
||||
- Step 3: `git commit` both changes
|
||||
|
||||
**Phase 5 — Update CI pipelines**
|
||||
- Update `.woodpecker/default.yml` with new pipeline
|
||||
- Update `.woodpecker/renew-tls.yml`: change `git add .` to `git add secrets/ state/`
|
||||
- Add `sops_age_key` Woodpecker secret
|
||||
- Enable GitHub branch protection on master
|
||||
- Test: CI pipeline applies successfully
|
||||
|
||||
**Phase 6 — Security hardening**
|
||||
- Create scoped operator RBAC role
|
||||
- Remove `secrets` from `power-user` ClusterRole
|
||||
- Update CLAUDE.md and AGENTS.md documentation
|
||||
|
||||
**Phase 7 — Onboard operator**
|
||||
- Add as GitHub collaborator
|
||||
- Create Authentik account in "operators" group
|
||||
- Walk through first PR workflow
|
||||
|
||||
### 10. Rollback Plan
|
||||
- **Phase 1-2**: No functional change — delete SOPS artifacts
|
||||
- **Phase 3**: Revert `terragrunt.hcl` to load `terraform.tfvars`
|
||||
- **Phase 4+**: `git show HEAD~1:terraform.tfvars > terraform.tfvars`, re-add `.gitattributes` rule. Backfill any secrets added during SOPS period.
|
||||
- Git-crypt stays functional for `secrets/` and `*.tfstate`
|
||||
|
||||
### 11. What Stays with git-crypt
|
||||
- `secrets/` directory: TLS certs, deploy keys (binary)
|
||||
- `*.tfstate` files: Terraform state
|
||||
- git-crypt key: K8s **Secret** in `woodpecker` namespace (migrated from ConfigMap)
|
||||
|
||||
### 12. Security Considerations
|
||||
- **Decrypted file**: temporary, `.gitignore`d, never staged by CI, cleaned up on success AND failure
|
||||
- **CI staging**: `git add stacks/ state/ .woodpecker/` — never `git add .` (all pipelines)
|
||||
- **Age key in CI**: `SOPS_AGE_KEY` env var → written to temp file → `SOPS_AGE_KEY_FILE` → `shred` after use
|
||||
- **Age key backup**: Viktor's in Vaultwarden. CI's as Woodpecker secret
|
||||
- **Branch protection**: Operators cannot modify CI pipeline, terragrunt.hcl, or .sops.yaml without review
|
||||
- **RBAC**: Operator role excludes cluster-wide secrets. Namespace `edit` role allows secrets within assigned namespaces (accepted residual risk). Excluded: woodpecker, kube-system, dbaas, monitoring, authentik
|
||||
- **Terraform variables**: `sensitive = true` on all secret vars — applied in Phase 2 BEFORE plan step is enabled
|
||||
- **Plan output**: filtered through `grep -v sensitive` as belt-and-suspenders
|
||||
- **`local-exec` exfiltration**: residual risk mitigated by PR review requirement — Viktor must review all PRs
|
||||
- **State files**: contain secret values, git-crypt encrypted. Future: remote backend
|
||||
- **Rotation**: new CI age key → re-encrypt → update Woodpecker secret → rotate affected secrets
|
||||
- **Git history**: old `terraform.tfvars` remains git-crypt encrypted in history — recoverable only with git-crypt key (K8s Secret, not accessible to operators)
|
||||
Loading…
Add table
Add a link
Reference in a new issue