diff --git a/docs/plans/2026-03-07-sops-migration-design.md b/docs/plans/2026-03-07-sops-migration-design.md new file mode 100644 index 00000000..b7a762ca --- /dev/null +++ b/docs/plans/2026-03-07-sops-migration-design.md @@ -0,0 +1,366 @@ +# SOPS Multi-User Secrets Migration — Design Document (v3) + +## Goal +Enable non-technical operators to manage cluster services via PR → review → merge → CI apply, without access to secrets. Viktor retains full local apply capability. + +## Current State +- **terraform.tfvars**: 211 variables (mix of secrets + non-secret config), git-crypt encrypted as a whole +- **secrets/**: TLS certs, deploy keys, NFS config — git-crypt encrypted (binary files) +- **.gitattributes**: encrypts `*.tfvars`, `*.tfstate`, `secrets/**` +- **Woodpecker CI**: unlocks git-crypt via K8s ConfigMap, applies `stacks/platform/` on push +- **Terragrunt**: loads `terraform.tfvars` via `required_var_files` for all stacks + +## Design + +### 1. Split terraform.tfvars into Two Files + +**`config.tfvars`** (NOT encrypted — committed in plaintext): +Non-secret configuration that operators need to read/edit: +- `nfs_server`, `redis_host`, `postgresql_host`, `mysql_host`, `ollama_host`, `mail_host` +- `bind_db_viktorbarzin_me`, `bind_db_viktorbarzin_lan`, `bind_named_conf_options` +- `tls_secret_name`, `client_certificate_secret_name` +- WireGuard peer **public** keys and AllowedIPs only — **NOT** `wireguard_wg_0_conf` (contains private key inline), NOT any `PrivateKey` fields +- Cloudflare DNS zone definitions (record names, not tokens) + +**`secrets.sops.json`** (SOPS-encrypted, per-value, JSON format): +All actual secrets, including complex types. JSON format chosen because: +- `sops -d` outputs the same format as input — JSON in, JSON out +- Terraform natively supports `*.auto.tfvars.json` files +- JSON supports all Terraform types: strings, maps, lists, nested objects +- No format conversion needed in the decryption pipeline + +**Complex types** in JSON (these are NOT flat strings): +```json +{ + "hackmd_db_password": "simple-string-secret", + "mailserver_accounts": { + "info@viktorbarzin.me": "password1", + "admin@viktorbarzin.me": "password2" + }, + "homepage_credentials": { + "technitium": {"token": "abc123"}, + "crowdsec": {"username": "user", "password": "pass"} + }, + "k8s_users": { + "viktor": {"role": "admin", "email": "v@example.com", "namespaces": []} + }, + "xray_reality_clients": [ + {"id": "uuid-here", "flow": "xtls-rprx-vision"} + ], + "webhook_handler_ssh_key": "-----BEGIN OPENSSH PRIVATE KEY-----\nb3Blbn...\n-----END OPENSSH PRIVATE KEY-----\n", + "wireguard_wg_0_conf": "[Interface]\nPrivateKey = ...\nAddress = ...\n\n[Peer]\n..." +} +``` + +### 2. SOPS Configuration + +```yaml +# .sops.yaml +creation_rules: + - path_regex: ^secrets\.sops\.json$ + age: >- + age1viktor_public_key, + age1ci_public_key +``` + +Path regex anchored to repo root (`^`). All secrets encrypted to Viktor + CI. + +### 3. Terragrunt Changes + +```hcl +# terragrunt.hcl — updated variable loading +terraform { + extra_arguments "common_vars" { + commands = get_terraform_commands_that_need_vars() + required_var_files = [ + "${get_repo_root()}/config.tfvars" + ] + } + + extra_arguments "secrets" { + commands = get_terraform_commands_that_need_vars() + optional_var_files = [ + "${get_repo_root()}/secrets.auto.tfvars.json" + ] + } + + # Safety check: fail loudly if secrets file is missing (prevents silent apply with empty secrets) + before_hook "check_secrets" { + commands = ["apply", "plan", "destroy"] + execute = ["test", "-f", "${get_repo_root()}/secrets.auto.tfvars.json"] + } +} +``` + +**Global decrypt-once wrapper** (run instead of raw terragrunt): +```bash +#!/usr/bin/env bash +# scripts/tg — wrapper: decrypt then terragrunt +set -euo pipefail +REPO_ROOT="$(cd "$(dirname "$0")/.." && pwd)" +SOPS_FILE="$REPO_ROOT/secrets.sops.json" +OUT_FILE="$REPO_ROOT/secrets.auto.tfvars.json" + +if [ ! -f "$OUT_FILE" ] && [ -f "$SOPS_FILE" ]; then + TEMP=$(mktemp "$OUT_FILE.XXXXXX") + trap "rm -f '$TEMP'" EXIT + sops -d "$SOPS_FILE" > "$TEMP" + mv "$TEMP" "$OUT_FILE" + echo "Decrypted secrets → secrets.auto.tfvars.json" +fi + +exec terragrunt "$@" +``` + +Usage: `scripts/tg apply --non-interactive` instead of `terragrunt apply --non-interactive`. + +**Why not before_hook/after_hook for decryption?** When using `run --all`, each of 70+ stacks would run hooks in parallel, all writing to the same file — race condition. The wrapper decrypts once. + +**Why before_hook for the existence check?** It's read-only (just `test -f`) — safe in parallel. Fails loudly if someone forgets to decrypt, instead of silently applying with empty secrets. + +### 4. File Protection + +**.gitignore** (add these entries): +``` +/secrets.auto.tfvars.json +/secrets.auto.tfvars.json.* +``` + +**.gitattributes** changes (done atomically in Phase 4): +``` +# KEEP for binary files +secrets/** filter=git-crypt diff=git-crypt +*.tfstate filter=git-crypt diff=git-crypt + +# REMOVED: *.tfvars filter=git-crypt diff=git-crypt +``` + +### 5. Woodpecker CI Pipeline Changes + +**default.yml**: +```yaml +steps: + - name: prepare + image: alpine + commands: + - "apk update && apk add jq curl git git-crypt" + # git-crypt for secrets/ directory (TLS certs, deploy key) + # Note: K8s Secret .data values are base64-encoded by the API + - | + curl -k https://10.0.20.100:6443/api/v1/namespaces/woodpecker/secrets/git-crypt-key \ + -H "Authorization:Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" \ + | jq -r '.data.key' | base64 -d > /tmp/key + - "git-crypt unlock /tmp/key && rm /tmp/key" + # Install SOPS to workspace (shared across steps via workspace volume) + - "wget -qO ./sops https://github.com/getsops/sops/releases/download/v3.9.4/sops-v3.9.4.linux.amd64" + - "echo '848ac8ee4b4e3ae1e72a58f0e9bae04b3e85ca59fa06f0dcd2d32b76542e8417 ./sops' | sha256sum -c" + - "chmod +x ./sops" + # Write age key to file (Woodpecker from_secret injects as env var, not file) + - "echo \"$SOPS_AGE_KEY\" > /tmp/age-key.txt" + - "SOPS_AGE_KEY_FILE=/tmp/age-key.txt ./sops -d secrets.sops.json > secrets.auto.tfvars.json" + - "shred -u /tmp/age-key.txt" + environment: + SOPS_AGE_KEY: + from_secret: sops_age_key # CI's age private key material + + - name: terragrunt-plan + image: alpine + commands: + - "apk update && apk add curl unzip git openssh-client" + - "wget -qO /tmp/tf.zip https://releases.hashicorp.com/terraform/1.5.7/terraform_1.5.7_linux_amd64.zip" + - "unzip -o /tmp/tf.zip -d /usr/local/bin/ && chmod 755 /usr/local/bin/terraform" + - "wget -qO /usr/local/bin/terragrunt https://github.com/gruntwork-io/terragrunt/releases/download/v0.99.4/terragrunt_linux_amd64" + - "chmod 755 /usr/local/bin/terragrunt" + - "cd stacks/platform && terragrunt plan --non-interactive -out=tfplan 2>&1 | grep -v 'sensitive'" + when: + event: pull_request + + - name: terragrunt-apply + image: alpine + commands: + - "apk update && apk add curl unzip git openssh-client" + - "wget -qO /tmp/tf.zip https://releases.hashicorp.com/terraform/1.5.7/terraform_1.5.7_linux_amd64.zip" + - "unzip -o /tmp/tf.zip -d /usr/local/bin/ && chmod 755 /usr/local/bin/terraform" + - "wget -qO /usr/local/bin/terragrunt https://github.com/gruntwork-io/terragrunt/releases/download/v0.99.4/terragrunt_linux_amd64" + - "chmod 755 /usr/local/bin/terragrunt" + - "cd stacks/platform && terragrunt apply --non-interactive -auto-approve" + when: + event: push + branch: master + + - name: cleanup-and-push + image: alpine + commands: + - "rm -f secrets.auto.tfvars.json secrets.auto.tfvars.json.*" + - "apk update && apk add openssh-client git git-crypt" + - "mkdir -p ~/.ssh && ssh-keyscan -H github.com >> ~/.ssh/known_hosts" + - "chmod 400 secrets/deploy_key" + - "git add stacks/ state/ .woodpecker/ || true" + - "git remote set-url origin git@github.com:ViktorBarzin/infra.git" + - "git commit -m 'Woodpecker CI deploy commit [CI SKIP]' || echo 'No changes'" + - "GIT_SSH_COMMAND='ssh -i ./secrets/deploy_key -o IdentitiesOnly=yes' git push origin master" + when: + - event: push + branch: master + - status: [success, failure] # Always clean up, even on failure + + - name: slack + image: curlimages/curl + commands: + - | + curl -s -X POST -H 'Content-type: application/json' \ + --data "{\"text\":\"Woodpecker CI: infra pipeline ${CI_PIPELINE_STATUS}\"}" \ + "$SLACK_WEBHOOK" || true + environment: + SLACK_WEBHOOK: + from_secret: slack_webhook + when: + - status: [success, failure] +``` + +**renew-tls.yml** — ALSO update this pipeline: +- Change `git add .` to `git add secrets/ state/` in the `commit-certs` step +- Same defense-in-depth as default.yml + +Key design decisions: +- `SOPS_AGE_KEY` (env var, not file) — Woodpecker `from_secret` only supports env vars. The prepare step writes it to a temp file, uses `SOPS_AGE_KEY_FILE`, then `shred`s the file +- SOPS binary in workspace (shared volume) — not per-container `/usr/local/bin/` +- `cleanup-and-push` runs on `status: [success, failure]` — always cleans up decrypted file +- `git add stacks/ state/ .woodpecker/` — never `git add .` +- Plan output filtered through `grep -v sensitive` — belt-and-suspenders with `sensitive = true` + +### 6. Branch Protection (Required) + +GitHub branch protection on `master`: +- **Require pull request reviews**: at least 1 reviewer (Viktor) +- **Restrict who can push**: Viktor only (direct push for `[ci skip]` commits) +- **Restrict who can dismiss reviews**: Viktor only + +This prevents operators from modifying `.woodpecker/`, `terragrunt.hcl`, or `.sops.yaml` without review. + +**Residual risk**: An operator can add `provisioner "local-exec" { command = "echo ${var.secret}" }` in a PR. Viktor must catch this in review. Mitigated by: (1) PR review is required, (2) `sensitive = true` hides values in plan output, (3) `local-exec` provisioners are unusual in this codebase and should be flagged during review. + +### 7. K8s RBAC for Operators + +Scoped operator role — no cluster-wide secrets access: + +```hcl +resource "kubernetes_cluster_role" "operator" { + metadata { name = "cluster-operator" } + rule { + api_groups = [""] + resources = ["pods", "pods/log", "services", "endpoints", "configmaps", "events"] + verbs = ["get", "list", "watch"] + } + rule { + api_groups = ["apps"] + resources = ["deployments", "statefulsets", "daemonsets", "replicasets"] + verbs = ["get", "list", "watch"] + } +} + +# Per-namespace full access (edit role includes secrets within namespace — accepted residual risk) +resource "kubernetes_role_binding" "operator_namespace" { + for_each = toset(var.operator_namespaces) + metadata { + name = "operator-access" + namespace = each.value + } + role_ref { + api_group = "rbac.authorization.k8s.io" + kind = "ClusterRole" + name = "edit" + } + subject { + kind = "Group" + name = "operators" + } +} +``` + +**Excluded namespaces** (never in `operator_namespaces`): `woodpecker`, `kube-system`, `dbaas`, `monitoring`, `authentik`. + +### 8. Operator Workflow + +**Setup (one-time)**: GitHub collaborator + Authentik "operators" group. No encryption keys, no local tools beyond git. + +**Day-to-day**: Create branch → edit → push → open PR → Viktor reviews → merge → CI applies → Slack notification. + +**kubectl**: `kubectl oidc-login` → Authentik → scoped to assigned namespaces. + +**New secrets**: Comment on PR, Viktor adds to `secrets.sops.json`. + +### 9. Migration Plan (Phased) + +**Phase 1 — Setup tooling (no functional change)** +- Install `sops` and `age` locally (Docker) +- Generate age keys: Viktor + CI +- Store CI age key as Woodpecker secret (`sops_age_key`) +- Move git-crypt key from K8s ConfigMap to Secret (update RBAC for Woodpecker SA) +- Create `.sops.yaml` config file +- Add `/secrets.auto.tfvars.json` to `.gitignore` +- Create `scripts/tg` wrapper +- Backup Viktor's age private key to Vaultwarden + +**Phase 2 — Create SOPS file alongside existing tfvars** +- Categorize all 211 variables: secret vs. non-secret (WireGuard private keys → secrets) +- Extract non-secret config into `config.tfvars` (plaintext) +- Extract secrets into `secrets.sops.json` (JSON, including complex types: maps, lists, nested objects) +- Encrypt with SOPS +- Verify round-trip: `sops -d secrets.sops.json | jq .` produces valid JSON +- Verify SSH keys: `sops -d secrets.sops.json | jq -r '.truenas_ssh_private_key' | ssh-keygen -l -f -` +- Verify complex types: `sops -d secrets.sops.json | jq '.mailserver_accounts'` returns expected map +- Add `sensitive = true` to ALL secret variable declarations across all stacks (BEFORE CI plan step is enabled) + +**Phase 3 — Switch terragrunt to SOPS** +- Update `terragrunt.hcl`: `config.tfvars` (required) + `secrets.auto.tfvars.json` (optional) + existence check hook +- Test: `scripts/tg apply --non-interactive` works per-stack +- Test: `scripts/tg run --all -- plan` works (no race condition) +- Test failure mode: delete `secrets.auto.tfvars.json`, verify `before_hook` fails loudly + +**Phase 4 — Atomic cutover** +- Step 1: `git rm terraform.tfvars` (removes file while git-crypt filter still active — clean deletion) +- Step 2: Remove `*.tfvars filter=git-crypt` from `.gitattributes` +- Step 3: `git commit` both changes + +**Phase 5 — Update CI pipelines** +- Update `.woodpecker/default.yml` with new pipeline +- Update `.woodpecker/renew-tls.yml`: change `git add .` to `git add secrets/ state/` +- Add `sops_age_key` Woodpecker secret +- Enable GitHub branch protection on master +- Test: CI pipeline applies successfully + +**Phase 6 — Security hardening** +- Create scoped operator RBAC role +- Remove `secrets` from `power-user` ClusterRole +- Update CLAUDE.md and AGENTS.md documentation + +**Phase 7 — Onboard operator** +- Add as GitHub collaborator +- Create Authentik account in "operators" group +- Walk through first PR workflow + +### 10. Rollback Plan +- **Phase 1-2**: No functional change — delete SOPS artifacts +- **Phase 3**: Revert `terragrunt.hcl` to load `terraform.tfvars` +- **Phase 4+**: `git show HEAD~1:terraform.tfvars > terraform.tfvars`, re-add `.gitattributes` rule. Backfill any secrets added during SOPS period. +- Git-crypt stays functional for `secrets/` and `*.tfstate` + +### 11. What Stays with git-crypt +- `secrets/` directory: TLS certs, deploy keys (binary) +- `*.tfstate` files: Terraform state +- git-crypt key: K8s **Secret** in `woodpecker` namespace (migrated from ConfigMap) + +### 12. Security Considerations +- **Decrypted file**: temporary, `.gitignore`d, never staged by CI, cleaned up on success AND failure +- **CI staging**: `git add stacks/ state/ .woodpecker/` — never `git add .` (all pipelines) +- **Age key in CI**: `SOPS_AGE_KEY` env var → written to temp file → `SOPS_AGE_KEY_FILE` → `shred` after use +- **Age key backup**: Viktor's in Vaultwarden. CI's as Woodpecker secret +- **Branch protection**: Operators cannot modify CI pipeline, terragrunt.hcl, or .sops.yaml without review +- **RBAC**: Operator role excludes cluster-wide secrets. Namespace `edit` role allows secrets within assigned namespaces (accepted residual risk). Excluded: woodpecker, kube-system, dbaas, monitoring, authentik +- **Terraform variables**: `sensitive = true` on all secret vars — applied in Phase 2 BEFORE plan step is enabled +- **Plan output**: filtered through `grep -v sensitive` as belt-and-suspenders +- **`local-exec` exfiltration**: residual risk mitigated by PR review requirement — Viktor must review all PRs +- **State files**: contain secret values, git-crypt encrypted. Future: remote backend +- **Rotation**: new CI age key → re-encrypt → update Woodpecker secret → rotate affected secrets +- **Git history**: old `terraform.tfvars` remains git-crypt encrypted in history — recoverable only with git-crypt key (K8s Secret, not accessible to operators)