Replaces git-crypt all-or-nothing encryption with SOPS per-value encryption. Operators push PRs → Viktor reviews → CI applies. No encryption keys needed for operators. 7-phase migration plan, reviewed by 2 agents across 3 iterations (0 remaining CRITICALs).
16 KiB
SOPS Multi-User Secrets Migration — Design Document (v3)
Goal
Enable non-technical operators to manage cluster services via PR → review → merge → CI apply, without access to secrets. Viktor retains full local apply capability.
Current State
- terraform.tfvars: 211 variables (mix of secrets + non-secret config), git-crypt encrypted as a whole
- secrets/: TLS certs, deploy keys, NFS config — git-crypt encrypted (binary files)
- .gitattributes: encrypts
*.tfvars,*.tfstate,secrets/** - Woodpecker CI: unlocks git-crypt via K8s ConfigMap, applies
stacks/platform/on push - Terragrunt: loads
terraform.tfvarsviarequired_var_filesfor all stacks
Design
1. Split terraform.tfvars into Two Files
config.tfvars (NOT encrypted — committed in plaintext):
Non-secret configuration that operators need to read/edit:
nfs_server,redis_host,postgresql_host,mysql_host,ollama_host,mail_hostbind_db_viktorbarzin_me,bind_db_viktorbarzin_lan,bind_named_conf_optionstls_secret_name,client_certificate_secret_name- WireGuard peer public keys and AllowedIPs only — NOT
wireguard_wg_0_conf(contains private key inline), NOT anyPrivateKeyfields - Cloudflare DNS zone definitions (record names, not tokens)
secrets.sops.json (SOPS-encrypted, per-value, JSON format):
All actual secrets, including complex types. JSON format chosen because:
sops -doutputs the same format as input — JSON in, JSON out- Terraform natively supports
*.auto.tfvars.jsonfiles - JSON supports all Terraform types: strings, maps, lists, nested objects
- No format conversion needed in the decryption pipeline
Complex types in JSON (these are NOT flat strings):
{
"hackmd_db_password": "simple-string-secret",
"mailserver_accounts": {
"info@viktorbarzin.me": "password1",
"admin@viktorbarzin.me": "password2"
},
"homepage_credentials": {
"technitium": {"token": "abc123"},
"crowdsec": {"username": "user", "password": "pass"}
},
"k8s_users": {
"viktor": {"role": "admin", "email": "v@example.com", "namespaces": []}
},
"xray_reality_clients": [
{"id": "uuid-here", "flow": "xtls-rprx-vision"}
],
"webhook_handler_ssh_key": "-----BEGIN OPENSSH PRIVATE KEY-----\nb3Blbn...\n-----END OPENSSH PRIVATE KEY-----\n",
"wireguard_wg_0_conf": "[Interface]\nPrivateKey = ...\nAddress = ...\n\n[Peer]\n..."
}
2. SOPS Configuration
# .sops.yaml
creation_rules:
- path_regex: ^secrets\.sops\.json$
age: >-
age1viktor_public_key,
age1ci_public_key
Path regex anchored to repo root (^). All secrets encrypted to Viktor + CI.
3. Terragrunt Changes
# terragrunt.hcl — updated variable loading
terraform {
extra_arguments "common_vars" {
commands = get_terraform_commands_that_need_vars()
required_var_files = [
"${get_repo_root()}/config.tfvars"
]
}
extra_arguments "secrets" {
commands = get_terraform_commands_that_need_vars()
optional_var_files = [
"${get_repo_root()}/secrets.auto.tfvars.json"
]
}
# Safety check: fail loudly if secrets file is missing (prevents silent apply with empty secrets)
before_hook "check_secrets" {
commands = ["apply", "plan", "destroy"]
execute = ["test", "-f", "${get_repo_root()}/secrets.auto.tfvars.json"]
}
}
Global decrypt-once wrapper (run instead of raw terragrunt):
#!/usr/bin/env bash
# scripts/tg — wrapper: decrypt then terragrunt
set -euo pipefail
REPO_ROOT="$(cd "$(dirname "$0")/.." && pwd)"
SOPS_FILE="$REPO_ROOT/secrets.sops.json"
OUT_FILE="$REPO_ROOT/secrets.auto.tfvars.json"
if [ ! -f "$OUT_FILE" ] && [ -f "$SOPS_FILE" ]; then
TEMP=$(mktemp "$OUT_FILE.XXXXXX")
trap "rm -f '$TEMP'" EXIT
sops -d "$SOPS_FILE" > "$TEMP"
mv "$TEMP" "$OUT_FILE"
echo "Decrypted secrets → secrets.auto.tfvars.json"
fi
exec terragrunt "$@"
Usage: scripts/tg apply --non-interactive instead of terragrunt apply --non-interactive.
Why not before_hook/after_hook for decryption? When using run --all, each of 70+ stacks would run hooks in parallel, all writing to the same file — race condition. The wrapper decrypts once.
Why before_hook for the existence check? It's read-only (just test -f) — safe in parallel. Fails loudly if someone forgets to decrypt, instead of silently applying with empty secrets.
4. File Protection
.gitignore (add these entries):
/secrets.auto.tfvars.json
/secrets.auto.tfvars.json.*
.gitattributes changes (done atomically in Phase 4):
# KEEP for binary files
secrets/** filter=git-crypt diff=git-crypt
*.tfstate filter=git-crypt diff=git-crypt
# REMOVED: *.tfvars filter=git-crypt diff=git-crypt
5. Woodpecker CI Pipeline Changes
default.yml:
steps:
- name: prepare
image: alpine
commands:
- "apk update && apk add jq curl git git-crypt"
# git-crypt for secrets/ directory (TLS certs, deploy key)
# Note: K8s Secret .data values are base64-encoded by the API
- |
curl -k https://10.0.20.100:6443/api/v1/namespaces/woodpecker/secrets/git-crypt-key \
-H "Authorization:Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" \
| jq -r '.data.key' | base64 -d > /tmp/key
- "git-crypt unlock /tmp/key && rm /tmp/key"
# Install SOPS to workspace (shared across steps via workspace volume)
- "wget -qO ./sops https://github.com/getsops/sops/releases/download/v3.9.4/sops-v3.9.4.linux.amd64"
- "echo '848ac8ee4b4e3ae1e72a58f0e9bae04b3e85ca59fa06f0dcd2d32b76542e8417 ./sops' | sha256sum -c"
- "chmod +x ./sops"
# Write age key to file (Woodpecker from_secret injects as env var, not file)
- "echo \"$SOPS_AGE_KEY\" > /tmp/age-key.txt"
- "SOPS_AGE_KEY_FILE=/tmp/age-key.txt ./sops -d secrets.sops.json > secrets.auto.tfvars.json"
- "shred -u /tmp/age-key.txt"
environment:
SOPS_AGE_KEY:
from_secret: sops_age_key # CI's age private key material
- name: terragrunt-plan
image: alpine
commands:
- "apk update && apk add curl unzip git openssh-client"
- "wget -qO /tmp/tf.zip https://releases.hashicorp.com/terraform/1.5.7/terraform_1.5.7_linux_amd64.zip"
- "unzip -o /tmp/tf.zip -d /usr/local/bin/ && chmod 755 /usr/local/bin/terraform"
- "wget -qO /usr/local/bin/terragrunt https://github.com/gruntwork-io/terragrunt/releases/download/v0.99.4/terragrunt_linux_amd64"
- "chmod 755 /usr/local/bin/terragrunt"
- "cd stacks/platform && terragrunt plan --non-interactive -out=tfplan 2>&1 | grep -v 'sensitive'"
when:
event: pull_request
- name: terragrunt-apply
image: alpine
commands:
- "apk update && apk add curl unzip git openssh-client"
- "wget -qO /tmp/tf.zip https://releases.hashicorp.com/terraform/1.5.7/terraform_1.5.7_linux_amd64.zip"
- "unzip -o /tmp/tf.zip -d /usr/local/bin/ && chmod 755 /usr/local/bin/terraform"
- "wget -qO /usr/local/bin/terragrunt https://github.com/gruntwork-io/terragrunt/releases/download/v0.99.4/terragrunt_linux_amd64"
- "chmod 755 /usr/local/bin/terragrunt"
- "cd stacks/platform && terragrunt apply --non-interactive -auto-approve"
when:
event: push
branch: master
- name: cleanup-and-push
image: alpine
commands:
- "rm -f secrets.auto.tfvars.json secrets.auto.tfvars.json.*"
- "apk update && apk add openssh-client git git-crypt"
- "mkdir -p ~/.ssh && ssh-keyscan -H github.com >> ~/.ssh/known_hosts"
- "chmod 400 secrets/deploy_key"
- "git add stacks/ state/ .woodpecker/ || true"
- "git remote set-url origin git@github.com:ViktorBarzin/infra.git"
- "git commit -m 'Woodpecker CI deploy commit [CI SKIP]' || echo 'No changes'"
- "GIT_SSH_COMMAND='ssh -i ./secrets/deploy_key -o IdentitiesOnly=yes' git push origin master"
when:
- event: push
branch: master
- status: [success, failure] # Always clean up, even on failure
- name: slack
image: curlimages/curl
commands:
- |
curl -s -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"Woodpecker CI: infra pipeline ${CI_PIPELINE_STATUS}\"}" \
"$SLACK_WEBHOOK" || true
environment:
SLACK_WEBHOOK:
from_secret: slack_webhook
when:
- status: [success, failure]
renew-tls.yml — ALSO update this pipeline:
- Change
git add .togit add secrets/ state/in thecommit-certsstep - Same defense-in-depth as default.yml
Key design decisions:
SOPS_AGE_KEY(env var, not file) — Woodpeckerfrom_secretonly supports env vars. The prepare step writes it to a temp file, usesSOPS_AGE_KEY_FILE, thenshreds the file- SOPS binary in workspace (shared volume) — not per-container
/usr/local/bin/ cleanup-and-pushruns onstatus: [success, failure]— always cleans up decrypted filegit add stacks/ state/ .woodpecker/— nevergit add .- Plan output filtered through
grep -v sensitive— belt-and-suspenders withsensitive = true
6. Branch Protection (Required)
GitHub branch protection on master:
- Require pull request reviews: at least 1 reviewer (Viktor)
- Restrict who can push: Viktor only (direct push for
[ci skip]commits) - Restrict who can dismiss reviews: Viktor only
This prevents operators from modifying .woodpecker/, terragrunt.hcl, or .sops.yaml without review.
Residual risk: An operator can add provisioner "local-exec" { command = "echo ${var.secret}" } in a PR. Viktor must catch this in review. Mitigated by: (1) PR review is required, (2) sensitive = true hides values in plan output, (3) local-exec provisioners are unusual in this codebase and should be flagged during review.
7. K8s RBAC for Operators
Scoped operator role — no cluster-wide secrets access:
resource "kubernetes_cluster_role" "operator" {
metadata { name = "cluster-operator" }
rule {
api_groups = [""]
resources = ["pods", "pods/log", "services", "endpoints", "configmaps", "events"]
verbs = ["get", "list", "watch"]
}
rule {
api_groups = ["apps"]
resources = ["deployments", "statefulsets", "daemonsets", "replicasets"]
verbs = ["get", "list", "watch"]
}
}
# Per-namespace full access (edit role includes secrets within namespace — accepted residual risk)
resource "kubernetes_role_binding" "operator_namespace" {
for_each = toset(var.operator_namespaces)
metadata {
name = "operator-access"
namespace = each.value
}
role_ref {
api_group = "rbac.authorization.k8s.io"
kind = "ClusterRole"
name = "edit"
}
subject {
kind = "Group"
name = "operators"
}
}
Excluded namespaces (never in operator_namespaces): woodpecker, kube-system, dbaas, monitoring, authentik.
8. Operator Workflow
Setup (one-time): GitHub collaborator + Authentik "operators" group. No encryption keys, no local tools beyond git.
Day-to-day: Create branch → edit → push → open PR → Viktor reviews → merge → CI applies → Slack notification.
kubectl: kubectl oidc-login → Authentik → scoped to assigned namespaces.
New secrets: Comment on PR, Viktor adds to secrets.sops.json.
9. Migration Plan (Phased)
Phase 1 — Setup tooling (no functional change)
- Install
sopsandagelocally (Docker) - Generate age keys: Viktor + CI
- Store CI age key as Woodpecker secret (
sops_age_key) - Move git-crypt key from K8s ConfigMap to Secret (update RBAC for Woodpecker SA)
- Create
.sops.yamlconfig file - Add
/secrets.auto.tfvars.jsonto.gitignore - Create
scripts/tgwrapper - Backup Viktor's age private key to Vaultwarden
Phase 2 — Create SOPS file alongside existing tfvars
- Categorize all 211 variables: secret vs. non-secret (WireGuard private keys → secrets)
- Extract non-secret config into
config.tfvars(plaintext) - Extract secrets into
secrets.sops.json(JSON, including complex types: maps, lists, nested objects) - Encrypt with SOPS
- Verify round-trip:
sops -d secrets.sops.json | jq .produces valid JSON - Verify SSH keys:
sops -d secrets.sops.json | jq -r '.truenas_ssh_private_key' | ssh-keygen -l -f - - Verify complex types:
sops -d secrets.sops.json | jq '.mailserver_accounts'returns expected map - Add
sensitive = trueto ALL secret variable declarations across all stacks (BEFORE CI plan step is enabled)
Phase 3 — Switch terragrunt to SOPS
- Update
terragrunt.hcl:config.tfvars(required) +secrets.auto.tfvars.json(optional) + existence check hook - Test:
scripts/tg apply --non-interactiveworks per-stack - Test:
scripts/tg run --all -- planworks (no race condition) - Test failure mode: delete
secrets.auto.tfvars.json, verifybefore_hookfails loudly
Phase 4 — Atomic cutover
- Step 1:
git rm terraform.tfvars(removes file while git-crypt filter still active — clean deletion) - Step 2: Remove
*.tfvars filter=git-cryptfrom.gitattributes - Step 3:
git commitboth changes
Phase 5 — Update CI pipelines
- Update
.woodpecker/default.ymlwith new pipeline - Update
.woodpecker/renew-tls.yml: changegit add .togit add secrets/ state/ - Add
sops_age_keyWoodpecker secret - Enable GitHub branch protection on master
- Test: CI pipeline applies successfully
Phase 6 — Security hardening
- Create scoped operator RBAC role
- Remove
secretsfrompower-userClusterRole - Update CLAUDE.md and AGENTS.md documentation
Phase 7 — Onboard operator
- Add as GitHub collaborator
- Create Authentik account in "operators" group
- Walk through first PR workflow
10. Rollback Plan
- Phase 1-2: No functional change — delete SOPS artifacts
- Phase 3: Revert
terragrunt.hclto loadterraform.tfvars - Phase 4+:
git show HEAD~1:terraform.tfvars > terraform.tfvars, re-add.gitattributesrule. Backfill any secrets added during SOPS period. - Git-crypt stays functional for
secrets/and*.tfstate
11. What Stays with git-crypt
secrets/directory: TLS certs, deploy keys (binary)*.tfstatefiles: Terraform state- git-crypt key: K8s Secret in
woodpeckernamespace (migrated from ConfigMap)
12. Security Considerations
- Decrypted file: temporary,
.gitignored, never staged by CI, cleaned up on success AND failure - CI staging:
git add stacks/ state/ .woodpecker/— nevergit add .(all pipelines) - Age key in CI:
SOPS_AGE_KEYenv var → written to temp file →SOPS_AGE_KEY_FILE→shredafter use - Age key backup: Viktor's in Vaultwarden. CI's as Woodpecker secret
- Branch protection: Operators cannot modify CI pipeline, terragrunt.hcl, or .sops.yaml without review
- RBAC: Operator role excludes cluster-wide secrets. Namespace
editrole allows secrets within assigned namespaces (accepted residual risk). Excluded: woodpecker, kube-system, dbaas, monitoring, authentik - Terraform variables:
sensitive = trueon all secret vars — applied in Phase 2 BEFORE plan step is enabled - Plan output: filtered through
grep -v sensitiveas belt-and-suspenders local-execexfiltration: residual risk mitigated by PR review requirement — Viktor must review all PRs- State files: contain secret values, git-crypt encrypted. Future: remote backend
- Rotation: new CI age key → re-encrypt → update Woodpecker secret → rotate affected secrets
- Git history: old
terraform.tfvarsremains git-crypt encrypted in history — recoverable only with git-crypt key (K8s Secret, not accessible to operators)