infra/docs/plans/2026-03-07-sops-migration-design.md
Viktor Barzin 91d11e5cda [ci skip] add SOPS multi-user secrets migration design (v3, reviewed 3x)
Replaces git-crypt all-or-nothing encryption with SOPS per-value encryption.
Operators push PRs → Viktor reviews → CI applies. No encryption keys needed
for operators. 7-phase migration plan, reviewed by 2 agents across 3 iterations
(0 remaining CRITICALs).
2026-03-07 13:55:05 +00:00

16 KiB

SOPS Multi-User Secrets Migration — Design Document (v3)

Goal

Enable non-technical operators to manage cluster services via PR → review → merge → CI apply, without access to secrets. Viktor retains full local apply capability.

Current State

  • terraform.tfvars: 211 variables (mix of secrets + non-secret config), git-crypt encrypted as a whole
  • secrets/: TLS certs, deploy keys, NFS config — git-crypt encrypted (binary files)
  • .gitattributes: encrypts *.tfvars, *.tfstate, secrets/**
  • Woodpecker CI: unlocks git-crypt via K8s ConfigMap, applies stacks/platform/ on push
  • Terragrunt: loads terraform.tfvars via required_var_files for all stacks

Design

1. Split terraform.tfvars into Two Files

config.tfvars (NOT encrypted — committed in plaintext): Non-secret configuration that operators need to read/edit:

  • nfs_server, redis_host, postgresql_host, mysql_host, ollama_host, mail_host
  • bind_db_viktorbarzin_me, bind_db_viktorbarzin_lan, bind_named_conf_options
  • tls_secret_name, client_certificate_secret_name
  • WireGuard peer public keys and AllowedIPs only — NOT wireguard_wg_0_conf (contains private key inline), NOT any PrivateKey fields
  • Cloudflare DNS zone definitions (record names, not tokens)

secrets.sops.json (SOPS-encrypted, per-value, JSON format): All actual secrets, including complex types. JSON format chosen because:

  • sops -d outputs the same format as input — JSON in, JSON out
  • Terraform natively supports *.auto.tfvars.json files
  • JSON supports all Terraform types: strings, maps, lists, nested objects
  • No format conversion needed in the decryption pipeline

Complex types in JSON (these are NOT flat strings):

{
  "hackmd_db_password": "simple-string-secret",
  "mailserver_accounts": {
    "info@viktorbarzin.me": "password1",
    "admin@viktorbarzin.me": "password2"
  },
  "homepage_credentials": {
    "technitium": {"token": "abc123"},
    "crowdsec": {"username": "user", "password": "pass"}
  },
  "k8s_users": {
    "viktor": {"role": "admin", "email": "v@example.com", "namespaces": []}
  },
  "xray_reality_clients": [
    {"id": "uuid-here", "flow": "xtls-rprx-vision"}
  ],
  "webhook_handler_ssh_key": "-----BEGIN OPENSSH PRIVATE KEY-----\nb3Blbn...\n-----END OPENSSH PRIVATE KEY-----\n",
  "wireguard_wg_0_conf": "[Interface]\nPrivateKey = ...\nAddress = ...\n\n[Peer]\n..."
}

2. SOPS Configuration

# .sops.yaml
creation_rules:
  - path_regex: ^secrets\.sops\.json$
    age: >-
      age1viktor_public_key,
      age1ci_public_key

Path regex anchored to repo root (^). All secrets encrypted to Viktor + CI.

3. Terragrunt Changes

# terragrunt.hcl — updated variable loading
terraform {
  extra_arguments "common_vars" {
    commands = get_terraform_commands_that_need_vars()
    required_var_files = [
      "${get_repo_root()}/config.tfvars"
    ]
  }

  extra_arguments "secrets" {
    commands = get_terraform_commands_that_need_vars()
    optional_var_files = [
      "${get_repo_root()}/secrets.auto.tfvars.json"
    ]
  }

  # Safety check: fail loudly if secrets file is missing (prevents silent apply with empty secrets)
  before_hook "check_secrets" {
    commands = ["apply", "plan", "destroy"]
    execute  = ["test", "-f", "${get_repo_root()}/secrets.auto.tfvars.json"]
  }
}

Global decrypt-once wrapper (run instead of raw terragrunt):

#!/usr/bin/env bash
# scripts/tg — wrapper: decrypt then terragrunt
set -euo pipefail
REPO_ROOT="$(cd "$(dirname "$0")/.." && pwd)"
SOPS_FILE="$REPO_ROOT/secrets.sops.json"
OUT_FILE="$REPO_ROOT/secrets.auto.tfvars.json"

if [ ! -f "$OUT_FILE" ] && [ -f "$SOPS_FILE" ]; then
  TEMP=$(mktemp "$OUT_FILE.XXXXXX")
  trap "rm -f '$TEMP'" EXIT
  sops -d "$SOPS_FILE" > "$TEMP"
  mv "$TEMP" "$OUT_FILE"
  echo "Decrypted secrets → secrets.auto.tfvars.json"
fi

exec terragrunt "$@"

Usage: scripts/tg apply --non-interactive instead of terragrunt apply --non-interactive.

Why not before_hook/after_hook for decryption? When using run --all, each of 70+ stacks would run hooks in parallel, all writing to the same file — race condition. The wrapper decrypts once.

Why before_hook for the existence check? It's read-only (just test -f) — safe in parallel. Fails loudly if someone forgets to decrypt, instead of silently applying with empty secrets.

4. File Protection

.gitignore (add these entries):

/secrets.auto.tfvars.json
/secrets.auto.tfvars.json.*

.gitattributes changes (done atomically in Phase 4):

# KEEP for binary files
secrets/** filter=git-crypt diff=git-crypt
*.tfstate filter=git-crypt diff=git-crypt

# REMOVED: *.tfvars filter=git-crypt diff=git-crypt

5. Woodpecker CI Pipeline Changes

default.yml:

steps:
  - name: prepare
    image: alpine
    commands:
      - "apk update && apk add jq curl git git-crypt"
      # git-crypt for secrets/ directory (TLS certs, deploy key)
      # Note: K8s Secret .data values are base64-encoded by the API
      - |
        curl -k https://10.0.20.100:6443/api/v1/namespaces/woodpecker/secrets/git-crypt-key \
          -H "Authorization:Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" \
          | jq -r '.data.key' | base64 -d > /tmp/key
      - "git-crypt unlock /tmp/key && rm /tmp/key"
      # Install SOPS to workspace (shared across steps via workspace volume)
      - "wget -qO ./sops https://github.com/getsops/sops/releases/download/v3.9.4/sops-v3.9.4.linux.amd64"
      - "echo '848ac8ee4b4e3ae1e72a58f0e9bae04b3e85ca59fa06f0dcd2d32b76542e8417  ./sops' | sha256sum -c"
      - "chmod +x ./sops"
      # Write age key to file (Woodpecker from_secret injects as env var, not file)
      - "echo \"$SOPS_AGE_KEY\" > /tmp/age-key.txt"
      - "SOPS_AGE_KEY_FILE=/tmp/age-key.txt ./sops -d secrets.sops.json > secrets.auto.tfvars.json"
      - "shred -u /tmp/age-key.txt"
    environment:
      SOPS_AGE_KEY:
        from_secret: sops_age_key  # CI's age private key material

  - name: terragrunt-plan
    image: alpine
    commands:
      - "apk update && apk add curl unzip git openssh-client"
      - "wget -qO /tmp/tf.zip https://releases.hashicorp.com/terraform/1.5.7/terraform_1.5.7_linux_amd64.zip"
      - "unzip -o /tmp/tf.zip -d /usr/local/bin/ && chmod 755 /usr/local/bin/terraform"
      - "wget -qO /usr/local/bin/terragrunt https://github.com/gruntwork-io/terragrunt/releases/download/v0.99.4/terragrunt_linux_amd64"
      - "chmod 755 /usr/local/bin/terragrunt"
      - "cd stacks/platform && terragrunt plan --non-interactive -out=tfplan 2>&1 | grep -v 'sensitive'"
    when:
      event: pull_request

  - name: terragrunt-apply
    image: alpine
    commands:
      - "apk update && apk add curl unzip git openssh-client"
      - "wget -qO /tmp/tf.zip https://releases.hashicorp.com/terraform/1.5.7/terraform_1.5.7_linux_amd64.zip"
      - "unzip -o /tmp/tf.zip -d /usr/local/bin/ && chmod 755 /usr/local/bin/terraform"
      - "wget -qO /usr/local/bin/terragrunt https://github.com/gruntwork-io/terragrunt/releases/download/v0.99.4/terragrunt_linux_amd64"
      - "chmod 755 /usr/local/bin/terragrunt"
      - "cd stacks/platform && terragrunt apply --non-interactive -auto-approve"
    when:
      event: push
      branch: master

  - name: cleanup-and-push
    image: alpine
    commands:
      - "rm -f secrets.auto.tfvars.json secrets.auto.tfvars.json.*"
      - "apk update && apk add openssh-client git git-crypt"
      - "mkdir -p ~/.ssh && ssh-keyscan -H github.com >> ~/.ssh/known_hosts"
      - "chmod 400 secrets/deploy_key"
      - "git add stacks/ state/ .woodpecker/ || true"
      - "git remote set-url origin git@github.com:ViktorBarzin/infra.git"
      - "git commit -m 'Woodpecker CI deploy commit [CI SKIP]' || echo 'No changes'"
      - "GIT_SSH_COMMAND='ssh -i ./secrets/deploy_key -o IdentitiesOnly=yes' git push origin master"
    when:
      - event: push
        branch: master
      - status: [success, failure]  # Always clean up, even on failure

  - name: slack
    image: curlimages/curl
    commands:
      - |
        curl -s -X POST -H 'Content-type: application/json' \
          --data "{\"text\":\"Woodpecker CI: infra pipeline ${CI_PIPELINE_STATUS}\"}" \
          "$SLACK_WEBHOOK" || true
    environment:
      SLACK_WEBHOOK:
        from_secret: slack_webhook
    when:
      - status: [success, failure]

renew-tls.yml — ALSO update this pipeline:

  • Change git add . to git add secrets/ state/ in the commit-certs step
  • Same defense-in-depth as default.yml

Key design decisions:

  • SOPS_AGE_KEY (env var, not file) — Woodpecker from_secret only supports env vars. The prepare step writes it to a temp file, uses SOPS_AGE_KEY_FILE, then shreds the file
  • SOPS binary in workspace (shared volume) — not per-container /usr/local/bin/
  • cleanup-and-push runs on status: [success, failure] — always cleans up decrypted file
  • git add stacks/ state/ .woodpecker/ — never git add .
  • Plan output filtered through grep -v sensitive — belt-and-suspenders with sensitive = true

6. Branch Protection (Required)

GitHub branch protection on master:

  • Require pull request reviews: at least 1 reviewer (Viktor)
  • Restrict who can push: Viktor only (direct push for [ci skip] commits)
  • Restrict who can dismiss reviews: Viktor only

This prevents operators from modifying .woodpecker/, terragrunt.hcl, or .sops.yaml without review.

Residual risk: An operator can add provisioner "local-exec" { command = "echo ${var.secret}" } in a PR. Viktor must catch this in review. Mitigated by: (1) PR review is required, (2) sensitive = true hides values in plan output, (3) local-exec provisioners are unusual in this codebase and should be flagged during review.

7. K8s RBAC for Operators

Scoped operator role — no cluster-wide secrets access:

resource "kubernetes_cluster_role" "operator" {
  metadata { name = "cluster-operator" }
  rule {
    api_groups = [""]
    resources  = ["pods", "pods/log", "services", "endpoints", "configmaps", "events"]
    verbs      = ["get", "list", "watch"]
  }
  rule {
    api_groups = ["apps"]
    resources  = ["deployments", "statefulsets", "daemonsets", "replicasets"]
    verbs      = ["get", "list", "watch"]
  }
}

# Per-namespace full access (edit role includes secrets within namespace — accepted residual risk)
resource "kubernetes_role_binding" "operator_namespace" {
  for_each = toset(var.operator_namespaces)
  metadata {
    name      = "operator-access"
    namespace = each.value
  }
  role_ref {
    api_group = "rbac.authorization.k8s.io"
    kind      = "ClusterRole"
    name      = "edit"
  }
  subject {
    kind = "Group"
    name = "operators"
  }
}

Excluded namespaces (never in operator_namespaces): woodpecker, kube-system, dbaas, monitoring, authentik.

8. Operator Workflow

Setup (one-time): GitHub collaborator + Authentik "operators" group. No encryption keys, no local tools beyond git.

Day-to-day: Create branch → edit → push → open PR → Viktor reviews → merge → CI applies → Slack notification.

kubectl: kubectl oidc-login → Authentik → scoped to assigned namespaces.

New secrets: Comment on PR, Viktor adds to secrets.sops.json.

9. Migration Plan (Phased)

Phase 1 — Setup tooling (no functional change)

  • Install sops and age locally (Docker)
  • Generate age keys: Viktor + CI
  • Store CI age key as Woodpecker secret (sops_age_key)
  • Move git-crypt key from K8s ConfigMap to Secret (update RBAC for Woodpecker SA)
  • Create .sops.yaml config file
  • Add /secrets.auto.tfvars.json to .gitignore
  • Create scripts/tg wrapper
  • Backup Viktor's age private key to Vaultwarden

Phase 2 — Create SOPS file alongside existing tfvars

  • Categorize all 211 variables: secret vs. non-secret (WireGuard private keys → secrets)
  • Extract non-secret config into config.tfvars (plaintext)
  • Extract secrets into secrets.sops.json (JSON, including complex types: maps, lists, nested objects)
  • Encrypt with SOPS
  • Verify round-trip: sops -d secrets.sops.json | jq . produces valid JSON
  • Verify SSH keys: sops -d secrets.sops.json | jq -r '.truenas_ssh_private_key' | ssh-keygen -l -f -
  • Verify complex types: sops -d secrets.sops.json | jq '.mailserver_accounts' returns expected map
  • Add sensitive = true to ALL secret variable declarations across all stacks (BEFORE CI plan step is enabled)

Phase 3 — Switch terragrunt to SOPS

  • Update terragrunt.hcl: config.tfvars (required) + secrets.auto.tfvars.json (optional) + existence check hook
  • Test: scripts/tg apply --non-interactive works per-stack
  • Test: scripts/tg run --all -- plan works (no race condition)
  • Test failure mode: delete secrets.auto.tfvars.json, verify before_hook fails loudly

Phase 4 — Atomic cutover

  • Step 1: git rm terraform.tfvars (removes file while git-crypt filter still active — clean deletion)
  • Step 2: Remove *.tfvars filter=git-crypt from .gitattributes
  • Step 3: git commit both changes

Phase 5 — Update CI pipelines

  • Update .woodpecker/default.yml with new pipeline
  • Update .woodpecker/renew-tls.yml: change git add . to git add secrets/ state/
  • Add sops_age_key Woodpecker secret
  • Enable GitHub branch protection on master
  • Test: CI pipeline applies successfully

Phase 6 — Security hardening

  • Create scoped operator RBAC role
  • Remove secrets from power-user ClusterRole
  • Update CLAUDE.md and AGENTS.md documentation

Phase 7 — Onboard operator

  • Add as GitHub collaborator
  • Create Authentik account in "operators" group
  • Walk through first PR workflow

10. Rollback Plan

  • Phase 1-2: No functional change — delete SOPS artifacts
  • Phase 3: Revert terragrunt.hcl to load terraform.tfvars
  • Phase 4+: git show HEAD~1:terraform.tfvars > terraform.tfvars, re-add .gitattributes rule. Backfill any secrets added during SOPS period.
  • Git-crypt stays functional for secrets/ and *.tfstate

11. What Stays with git-crypt

  • secrets/ directory: TLS certs, deploy keys (binary)
  • *.tfstate files: Terraform state
  • git-crypt key: K8s Secret in woodpecker namespace (migrated from ConfigMap)

12. Security Considerations

  • Decrypted file: temporary, .gitignored, never staged by CI, cleaned up on success AND failure
  • CI staging: git add stacks/ state/ .woodpecker/ — never git add . (all pipelines)
  • Age key in CI: SOPS_AGE_KEY env var → written to temp file → SOPS_AGE_KEY_FILEshred after use
  • Age key backup: Viktor's in Vaultwarden. CI's as Woodpecker secret
  • Branch protection: Operators cannot modify CI pipeline, terragrunt.hcl, or .sops.yaml without review
  • RBAC: Operator role excludes cluster-wide secrets. Namespace edit role allows secrets within assigned namespaces (accepted residual risk). Excluded: woodpecker, kube-system, dbaas, monitoring, authentik
  • Terraform variables: sensitive = true on all secret vars — applied in Phase 2 BEFORE plan step is enabled
  • Plan output: filtered through grep -v sensitive as belt-and-suspenders
  • local-exec exfiltration: residual risk mitigated by PR review requirement — Viktor must review all PRs
  • State files: contain secret values, git-crypt encrypted. Future: remote backend
  • Rotation: new CI age key → re-encrypt → update Woodpecker secret → rotate affected secrets
  • Git history: old terraform.tfvars remains git-crypt encrypted in history — recoverable only with git-crypt key (K8s Secret, not accessible to operators)