# Terragrunt Migration Design **Date**: 2026-02-22 **Status**: Approved ## Problem The infrastructure repo has a monolithic Terraform setup: - 15MB state file, 857 resources, 85+ service modules in a single root - `terraform plan/apply` evaluates all modules even when targeting one service - `null_resource.core_services` bottleneck blocks 73 services behind 12 core modules - 150+ variables passed through root -> kubernetes_cluster -> individual services - 3 providers (kubernetes, helm, proxmox) initialize on every run ## Goals - **Speed**: Faster plan/apply by splitting state into independent stacks - **Blast radius isolation**: Bad apply can't break unrelated services - **DRY config**: Shared provider/backend configuration via Terragrunt - **Proper DAG**: Full references between stacks (not hardcoded DNS strings) - **Bootstrappable**: `terragrunt run-all apply` works from scratch - **CI/CD**: Changed-stack detection in Drone CI ## Architecture: Flat Stacks ### Directory Structure ``` infra/ ├── terragrunt.hcl # Root config (providers, backend, common vars) ├── stacks/ │ ├── infra/ # Proxmox VMs, templates, docker-registry │ │ ├── terragrunt.hcl │ │ └── main.tf │ ├── platform/ # Core: traefik, metallb, redis, dbaas, authentik, etc. │ │ ├── terragrunt.hcl │ │ └── main.tf │ ├── blog/ # One dir per user service │ │ ├── terragrunt.hcl │ │ └── main.tf │ ├── immich/ │ │ ├── terragrunt.hcl │ │ └── main.tf │ └── ... (~65 service dirs) ├── modules/ # UNCHANGED — existing modules stay where they are │ ├── kubernetes/ │ │ ├── ingress_factory/ │ │ ├── setup_tls_secret/ │ │ ├── blog/ │ │ ├── immich/ │ │ └── ... │ ├── create-vm/ │ └── create-template-vm/ ├── state/ # Per-stack state files │ ├── infra/terraform.tfstate │ ├── platform/terraform.tfstate │ ├── blog/terraform.tfstate │ └── ... ├── terraform.tfvars # UNCHANGED — encrypted secrets ├── secrets/ # UNCHANGED — TLS certs ├── main.tf # LEGACY — gradually emptied during migration └── terraform.tfstate # LEGACY — gradually emptied during migration ``` Each stack has a thin `main.tf` wrapper that calls the existing module via `source = "../../modules/kubernetes/"`. We do NOT use Terragrunt's `terraform { source }` directive because our modules use relative paths (`../ingress_factory`, `../setup_tls_secret`) that would break when Terragrunt copies them to `.terragrunt-cache/`. ### Stack Composition **Infra stack** (~10 resources): - Proxmox VM templates (k8s, non-k8s, docker-registry) - Docker registry VM - Uses proxmox provider (not kubernetes/helm) **Platform stack** (~200 resources, ~20 services): - traefik, metallb, redis, dbaas, technitium, authentik, crowdsec, cloudflared - monitoring (prometheus, alertmanager, grafana, loki, alloy) - kyverno, metrics-server, nvidia, mailserver, authelia - wireguard, headscale, xray, uptime-kuma, vaultwarden, reverse-proxy - Exports outputs consumed by service stacks **Per-service stacks** (~65, each 5-25 resources): - One stack per user-facing service - Each depends on platform via Terragrunt `dependency` block - Some depend on other services (f1-stream -> coturn, etc.) ### Dependency Graph ``` ┌─────────┐ │ infra │ └────┬────┘ │ ┌────▼────┐ │platform │ exports: redis_host, postgresql_host, │ │ mysql_host, smtp_host, tls_secret_name, ... └────┬────┘ │ ┌────────┬───────────┼───────────┬────────┐ │ │ │ │ │ ┌────▼──┐ ┌───▼───┐ ┌────▼───┐ ┌─────▼──┐ ┌──▼───┐ │ blog │ │immich │ │ affine │ │ollama │ │coturn│ ... └───────┘ └───────┘ └────────┘ └───┬────┘ └──┬───┘ │ │ ┌────▼───┐ ┌───▼──────┐ │openclaw│ │f1-stream │ │gramps │ └──────────┘ │ytdlp │ └────────┘ ``` ### Platform Stack Outputs | Output | Value | Consumers | |--------|-------|-----------| | `redis_host` | `redis.redis.svc.cluster.local` | 10 services | | `postgresql_host` | `postgresql.dbaas.svc.cluster.local` | 10 services | | `postgresql_port` | `5432` | 10 services | | `mysql_host` | `mysql.dbaas.svc.cluster.local` | 8 services | | `mysql_port` | `3306` | 8 services | | `smtp_host` | `mail.viktorbarzin.me` | 6 services | | `smtp_port` | `587` | 6 services | | `tls_secret_name` | from variable | all services | | `authentik_outpost_url` | `http://ak-outpost-...` | traefik | | `crowdsec_lapi_host` | `crowdsec-service...` | traefik | | `alertmanager_url` | `http://prometheus-alertmanager...` | loki | | `loki_push_url` | `http://loki...` | alloy | Service-to-service dependencies: | Service | Depends on | Outputs consumed | |---------|-----------|-----------------| | f1-stream | coturn | `coturn_host`, `coturn_port` | | real-estate-crawler | osm-routing | `osrm_foot_host`, `osrm_bicycle_host` | | openclaw, grampsweb, ytdlp | ollama | `ollama_host` | ### Module Modifications Service modules that hardcode DNS names need modification to accept hosts as variables. ~20 modules affected. Example for affine: **Before:** ```hcl # modules/kubernetes/affine/main.tf DATABASE_URL = "postgresql://...@postgresql.dbaas.svc.cluster.local:5432/affine" REDIS_SERVER_HOST = "redis.redis.svc.cluster.local" ``` **After:** ```hcl variable "redis_host" { type = string } variable "postgresql_host" { type = string } variable "postgresql_port" { type = number } DATABASE_URL = "postgresql://...@${var.postgresql_host}:${var.postgresql_port}/affine" REDIS_SERVER_HOST = var.redis_host ``` ## Root Terragrunt Configuration ```hcl # infra/terragrunt.hcl remote_state { backend = "local" generate = { path = "backend.tf" if_exists = "overwrite_terragrunt" } config = { path = "${get_repo_root()}/state/${path_relative_to_include()}/terraform.tfstate" } } terraform { extra_arguments "common_vars" { commands = get_terraform_commands_that_need_vars() required_var_files = [ "${get_repo_root()}/terraform.tfvars" ] } } generate "k8s_providers" { path = "providers.tf" if_exists = "overwrite_terragrunt" contents = < coturn) ```hcl # stacks/f1-stream/terragrunt.hcl include "root" { path = find_in_parent_folders() } dependency "platform" { config_path = "../platform" } dependency "coturn" { config_path = "../coturn" } inputs = { tls_secret_name = dependency.platform.outputs.tls_secret_name coturn_host = dependency.coturn.outputs.coturn_host coturn_port = dependency.coturn.outputs.coturn_port } ``` ## Migration Strategy ### Phase 0: Setup - Install Terragrunt - Create root `terragrunt.hcl`, `stacks/`, `state/` directories - No state changes, no risk ### Phase 1: Infra Stack (VMs) - Create `stacks/infra/` with Proxmox provider + VM module calls - `terraform state mv` 4 root-level module resources to `state/infra/` - Remove from root `main.tf` - Verify: `cd stacks/infra && terragrunt plan` shows no changes ### Phase 2: Platform Stack (Core Services) - Create `stacks/platform/main.tf` with ~20 core services + outputs - `terraform state mv` ~200 resources from `module.kubernetes_cluster.module.` - Remove `null_resource.core_services` (Terragrunt handles ordering) - Verify: `cd stacks/platform && terragrunt plan` shows no changes ### Phase 3: Simple Services (No DB Dependencies) - blog, echo, privatebin, excalidraw, city-guesser, dashy, etc. - Create stack, move state, verify — one at a time ### Phase 4: Database-Backed Services - Modify modules to accept hosts as variables - affine, immich, linkwarden, nextcloud, grampsweb, etc. - Create stack, move state, verify ### Phase 5: Service-to-Service Dependencies - ollama -> openclaw, grampsweb, ytdlp - coturn -> f1-stream - osm-routing -> real-estate-crawler ### Phase 6: Cleanup - Delete DEFCON system from `modules/kubernetes/main.tf` - Delete legacy `terraform.tfstate` - Delete root `main.tf` kubernetes_cluster module call - Update CI/CD to Terragrunt ### Rollback At any phase, `terraform state mv` resources back to monolith state and restore module calls. ## CI/CD: Changed-Stack Detection Drone CI pipeline detects changed files per commit and maps to affected stacks: | Changed file | Affected stack | |-------------|---------------| | `stacks/blog/*` | blog | | `modules/kubernetes/blog/*` | blog | | `terraform.tfvars` | all stacks | | `terragrunt.hcl` | all stacks | | `modules/kubernetes/ingress_factory/*` | all stacks | ### Manual Workflow ```bash # Apply single service cd stacks/blog && terragrunt apply # Apply everything (respects DAG ordering) cd stacks && terragrunt run-all apply # Plan everything cd stacks && terragrunt run-all plan ``` ## Decisions Made | Decision | Choice | Rationale | |----------|--------|-----------| | Tool | Terragrunt | DRY config, dependency management, run-all orchestration | | Stack granularity | 1 platform + 1 per service | Max isolation for apps, grouped core | | Migration | Incremental | Lower risk, verify each step | | Shared modules | Relative paths | Simple, no registry overhead | | State backend | Local files | No external dependencies | | Cross-stack refs | Full references via outputs | Proper DAG, bootstrappable from scratch | | CI/CD | Changed-stack detection | Only apply what changed |