From 86648f684f07e49b94b11023c395403c317d2c02 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sun, 22 Feb 2026 00:46:57 +0000 Subject: [PATCH] [ci skip] Add Terragrunt migration design document --- .../2026-02-22-terragrunt-migration-design.md | 387 ++++++++++++++++++ 1 file changed, 387 insertions(+) create mode 100644 docs/plans/2026-02-22-terragrunt-migration-design.md diff --git a/docs/plans/2026-02-22-terragrunt-migration-design.md b/docs/plans/2026-02-22-terragrunt-migration-design.md new file mode 100644 index 00000000..417db72e --- /dev/null +++ b/docs/plans/2026-02-22-terragrunt-migration-design.md @@ -0,0 +1,387 @@ +# Terragrunt Migration Design + +**Date**: 2026-02-22 +**Status**: Approved + +## Problem + +The infrastructure repo has a monolithic Terraform setup: +- 15MB state file, 857 resources, 85+ service modules in a single root +- `terraform plan/apply` evaluates all modules even when targeting one service +- `null_resource.core_services` bottleneck blocks 73 services behind 12 core modules +- 150+ variables passed through root -> kubernetes_cluster -> individual services +- 3 providers (kubernetes, helm, proxmox) initialize on every run + +## Goals + +- **Speed**: Faster plan/apply by splitting state into independent stacks +- **Blast radius isolation**: Bad apply can't break unrelated services +- **DRY config**: Shared provider/backend configuration via Terragrunt +- **Proper DAG**: Full references between stacks (not hardcoded DNS strings) +- **Bootstrappable**: `terragrunt run-all apply` works from scratch +- **CI/CD**: Changed-stack detection in Drone CI + +## Architecture: Flat Stacks + +### Directory Structure + +``` +infra/ +├── terragrunt.hcl # Root config (providers, backend, common vars) +├── stacks/ +│ ├── infra/ # Proxmox VMs, templates, docker-registry +│ │ ├── terragrunt.hcl +│ │ └── main.tf +│ ├── platform/ # Core: traefik, metallb, redis, dbaas, authentik, etc. +│ │ ├── terragrunt.hcl +│ │ └── main.tf +│ ├── blog/ # One dir per user service +│ │ ├── terragrunt.hcl +│ │ └── main.tf +│ ├── immich/ +│ │ ├── terragrunt.hcl +│ │ └── main.tf +│ └── ... (~65 service dirs) +├── modules/ # UNCHANGED — existing modules stay where they are +│ ├── kubernetes/ +│ │ ├── ingress_factory/ +│ │ ├── setup_tls_secret/ +│ │ ├── blog/ +│ │ ├── immich/ +│ │ └── ... +│ ├── create-vm/ +│ └── create-template-vm/ +├── state/ # Per-stack state files +│ ├── infra/terraform.tfstate +│ ├── platform/terraform.tfstate +│ ├── blog/terraform.tfstate +│ └── ... +├── terraform.tfvars # UNCHANGED — encrypted secrets +├── secrets/ # UNCHANGED — TLS certs +├── main.tf # LEGACY — gradually emptied during migration +└── terraform.tfstate # LEGACY — gradually emptied during migration +``` + +Each stack has a thin `main.tf` wrapper that calls the existing module via +`source = "../../modules/kubernetes/"`. We do NOT use Terragrunt's +`terraform { source }` directive because our modules use relative paths +(`../ingress_factory`, `../setup_tls_secret`) that would break when Terragrunt +copies them to `.terragrunt-cache/`. + +### Stack Composition + +**Infra stack** (~10 resources): +- Proxmox VM templates (k8s, non-k8s, docker-registry) +- Docker registry VM +- Uses proxmox provider (not kubernetes/helm) + +**Platform stack** (~200 resources, ~20 services): +- traefik, metallb, redis, dbaas, technitium, authentik, crowdsec, cloudflared +- monitoring (prometheus, alertmanager, grafana, loki, alloy) +- kyverno, metrics-server, nvidia, mailserver, authelia +- wireguard, headscale, xray, uptime-kuma, vaultwarden, reverse-proxy +- Exports outputs consumed by service stacks + +**Per-service stacks** (~65, each 5-25 resources): +- One stack per user-facing service +- Each depends on platform via Terragrunt `dependency` block +- Some depend on other services (f1-stream -> coturn, etc.) + +### Dependency Graph + +``` + ┌─────────┐ + │ infra │ + └────┬────┘ + │ + ┌────▼────┐ + │platform │ exports: redis_host, postgresql_host, + │ │ mysql_host, smtp_host, tls_secret_name, ... + └────┬────┘ + │ + ┌────────┬───────────┼───────────┬────────┐ + │ │ │ │ │ + ┌────▼──┐ ┌───▼───┐ ┌────▼───┐ ┌─────▼──┐ ┌──▼───┐ + │ blog │ │immich │ │ affine │ │ollama │ │coturn│ ... + └───────┘ └───────┘ └────────┘ └───┬────┘ └──┬───┘ + │ │ + ┌────▼───┐ ┌───▼──────┐ + │openclaw│ │f1-stream │ + │gramps │ └──────────┘ + │ytdlp │ + └────────┘ +``` + +### Platform Stack Outputs + +| Output | Value | Consumers | +|--------|-------|-----------| +| `redis_host` | `redis.redis.svc.cluster.local` | 10 services | +| `postgresql_host` | `postgresql.dbaas.svc.cluster.local` | 10 services | +| `postgresql_port` | `5432` | 10 services | +| `mysql_host` | `mysql.dbaas.svc.cluster.local` | 8 services | +| `mysql_port` | `3306` | 8 services | +| `smtp_host` | `mail.viktorbarzin.me` | 6 services | +| `smtp_port` | `587` | 6 services | +| `tls_secret_name` | from variable | all services | +| `authentik_outpost_url` | `http://ak-outpost-...` | traefik | +| `crowdsec_lapi_host` | `crowdsec-service...` | traefik | +| `alertmanager_url` | `http://prometheus-alertmanager...` | loki | +| `loki_push_url` | `http://loki...` | alloy | + +Service-to-service dependencies: + +| Service | Depends on | Outputs consumed | +|---------|-----------|-----------------| +| f1-stream | coturn | `coturn_host`, `coturn_port` | +| real-estate-crawler | osm-routing | `osrm_foot_host`, `osrm_bicycle_host` | +| openclaw, grampsweb, ytdlp | ollama | `ollama_host` | + +### Module Modifications + +Service modules that hardcode DNS names need modification to accept hosts as variables. +~20 modules affected. Example for affine: + +**Before:** +```hcl +# modules/kubernetes/affine/main.tf +DATABASE_URL = "postgresql://...@postgresql.dbaas.svc.cluster.local:5432/affine" +REDIS_SERVER_HOST = "redis.redis.svc.cluster.local" +``` + +**After:** +```hcl +variable "redis_host" { type = string } +variable "postgresql_host" { type = string } +variable "postgresql_port" { type = number } + +DATABASE_URL = "postgresql://...@${var.postgresql_host}:${var.postgresql_port}/affine" +REDIS_SERVER_HOST = var.redis_host +``` + +## Root Terragrunt Configuration + +```hcl +# infra/terragrunt.hcl + +remote_state { + backend = "local" + generate = { + path = "backend.tf" + if_exists = "overwrite_terragrunt" + } + config = { + path = "${get_repo_root()}/state/${path_relative_to_include()}/terraform.tfstate" + } +} + +terraform { + extra_arguments "common_vars" { + commands = get_terraform_commands_that_need_vars() + required_var_files = [ + "${get_repo_root()}/terraform.tfvars" + ] + } +} + +generate "k8s_providers" { + path = "providers.tf" + if_exists = "overwrite_terragrunt" + contents = < coturn) + +```hcl +# stacks/f1-stream/terragrunt.hcl +include "root" { + path = find_in_parent_folders() +} + +dependency "platform" { + config_path = "../platform" +} + +dependency "coturn" { + config_path = "../coturn" +} + +inputs = { + tls_secret_name = dependency.platform.outputs.tls_secret_name + coturn_host = dependency.coturn.outputs.coturn_host + coturn_port = dependency.coturn.outputs.coturn_port +} +``` + +## Migration Strategy + +### Phase 0: Setup +- Install Terragrunt +- Create root `terragrunt.hcl`, `stacks/`, `state/` directories +- No state changes, no risk + +### Phase 1: Infra Stack (VMs) +- Create `stacks/infra/` with Proxmox provider + VM module calls +- `terraform state mv` 4 root-level module resources to `state/infra/` +- Remove from root `main.tf` +- Verify: `cd stacks/infra && terragrunt plan` shows no changes + +### Phase 2: Platform Stack (Core Services) +- Create `stacks/platform/main.tf` with ~20 core services + outputs +- `terraform state mv` ~200 resources from `module.kubernetes_cluster.module.` +- Remove `null_resource.core_services` (Terragrunt handles ordering) +- Verify: `cd stacks/platform && terragrunt plan` shows no changes + +### Phase 3: Simple Services (No DB Dependencies) +- blog, echo, privatebin, excalidraw, city-guesser, dashy, etc. +- Create stack, move state, verify — one at a time + +### Phase 4: Database-Backed Services +- Modify modules to accept hosts as variables +- affine, immich, linkwarden, nextcloud, grampsweb, etc. +- Create stack, move state, verify + +### Phase 5: Service-to-Service Dependencies +- ollama -> openclaw, grampsweb, ytdlp +- coturn -> f1-stream +- osm-routing -> real-estate-crawler + +### Phase 6: Cleanup +- Delete DEFCON system from `modules/kubernetes/main.tf` +- Delete legacy `terraform.tfstate` +- Delete root `main.tf` kubernetes_cluster module call +- Update CI/CD to Terragrunt + +### Rollback +At any phase, `terraform state mv` resources back to monolith state and +restore module calls. + +## CI/CD: Changed-Stack Detection + +Drone CI pipeline detects changed files per commit and maps to affected stacks: + +| Changed file | Affected stack | +|-------------|---------------| +| `stacks/blog/*` | blog | +| `modules/kubernetes/blog/*` | blog | +| `terraform.tfvars` | all stacks | +| `terragrunt.hcl` | all stacks | +| `modules/kubernetes/ingress_factory/*` | all stacks | + +### Manual Workflow + +```bash +# Apply single service +cd stacks/blog && terragrunt apply + +# Apply everything (respects DAG ordering) +cd stacks && terragrunt run-all apply + +# Plan everything +cd stacks && terragrunt run-all plan +``` + +## Decisions Made + +| Decision | Choice | Rationale | +|----------|--------|-----------| +| Tool | Terragrunt | DRY config, dependency management, run-all orchestration | +| Stack granularity | 1 platform + 1 per service | Max isolation for apps, grouped core | +| Migration | Incremental | Lower risk, verify each step | +| Shared modules | Relative paths | Simple, no registry overhead | +| State backend | Local files | No external dependencies | +| Cross-stack refs | Full references via outputs | Proper DAG, bootstrappable from scratch | +| CI/CD | Changed-stack detection | Only apply what changed |