12 KiB
Terragrunt Migration Design
Date: 2026-02-22 Status: Approved
Problem
The infrastructure repo has a monolithic Terraform setup:
- 15MB state file, 857 resources, 85+ service modules in a single root
terraform plan/applyevaluates all modules even when targeting one servicenull_resource.core_servicesbottleneck blocks 73 services behind 12 core modules- 150+ variables passed through root -> kubernetes_cluster -> individual services
- 3 providers (kubernetes, helm, proxmox) initialize on every run
Goals
- Speed: Faster plan/apply by splitting state into independent stacks
- Blast radius isolation: Bad apply can't break unrelated services
- DRY config: Shared provider/backend configuration via Terragrunt
- Proper DAG: Full references between stacks (not hardcoded DNS strings)
- Bootstrappable:
terragrunt run-all applyworks from scratch - CI/CD: Changed-stack detection in Drone CI
Architecture: Flat Stacks
Directory Structure
infra/
├── terragrunt.hcl # Root config (providers, backend, common vars)
├── stacks/
│ ├── infra/ # Proxmox VMs, templates, docker-registry
│ │ ├── terragrunt.hcl
│ │ └── main.tf
│ ├── platform/ # Core: traefik, metallb, redis, dbaas, authentik, etc.
│ │ ├── terragrunt.hcl
│ │ └── main.tf
│ ├── blog/ # One dir per user service
│ │ ├── terragrunt.hcl
│ │ └── main.tf
│ ├── immich/
│ │ ├── terragrunt.hcl
│ │ └── main.tf
│ └── ... (~65 service dirs)
├── modules/ # UNCHANGED — existing modules stay where they are
│ ├── kubernetes/
│ │ ├── ingress_factory/
│ │ ├── setup_tls_secret/
│ │ ├── blog/
│ │ ├── immich/
│ │ └── ...
│ ├── create-vm/
│ └── create-template-vm/
├── state/ # Per-stack state files
│ ├── infra/terraform.tfstate
│ ├── platform/terraform.tfstate
│ ├── blog/terraform.tfstate
│ └── ...
├── terraform.tfvars # UNCHANGED — encrypted secrets
├── secrets/ # UNCHANGED — TLS certs
├── main.tf # LEGACY — gradually emptied during migration
└── terraform.tfstate # LEGACY — gradually emptied during migration
Each stack has a thin main.tf wrapper that calls the existing module via
source = "../../modules/kubernetes/<service>". We do NOT use Terragrunt's
terraform { source } directive because our modules use relative paths
(../ingress_factory, ../setup_tls_secret) that would break when Terragrunt
copies them to .terragrunt-cache/.
Stack Composition
Infra stack (~10 resources):
- Proxmox VM templates (k8s, non-k8s, docker-registry)
- Docker registry VM
- Uses proxmox provider (not kubernetes/helm)
Platform stack (~200 resources, ~20 services):
- traefik, metallb, redis, dbaas, technitium, authentik, crowdsec, cloudflared
- monitoring (prometheus, alertmanager, grafana, loki, alloy)
- kyverno, metrics-server, nvidia, mailserver, authelia
- wireguard, headscale, xray, uptime-kuma, vaultwarden, reverse-proxy
- Exports outputs consumed by service stacks
Per-service stacks (~65, each 5-25 resources):
- One stack per user-facing service
- Each depends on platform via Terragrunt
dependencyblock - Some depend on other services (f1-stream -> coturn, etc.)
Dependency Graph
┌─────────┐
│ infra │
└────┬────┘
│
┌────▼────┐
│platform │ exports: redis_host, postgresql_host,
│ │ mysql_host, smtp_host, tls_secret_name, ...
└────┬────┘
│
┌────────┬───────────┼───────────┬────────┐
│ │ │ │ │
┌────▼──┐ ┌───▼───┐ ┌────▼───┐ ┌─────▼──┐ ┌──▼───┐
│ blog │ │immich │ │ affine │ │ollama │ │coturn│ ...
└───────┘ └───────┘ └────────┘ └───┬────┘ └──┬───┘
│ │
┌────▼───┐ ┌───▼──────┐
│openclaw│ │f1-stream │
│gramps │ └──────────┘
│ytdlp │
└────────┘
Platform Stack Outputs
| Output | Value | Consumers |
|---|---|---|
redis_host |
redis.redis.svc.cluster.local |
10 services |
postgresql_host |
postgresql.dbaas.svc.cluster.local |
10 services |
postgresql_port |
5432 |
10 services |
mysql_host |
mysql.dbaas.svc.cluster.local |
8 services |
mysql_port |
3306 |
8 services |
smtp_host |
mail.viktorbarzin.me |
6 services |
smtp_port |
587 |
6 services |
tls_secret_name |
from variable | all services |
authentik_outpost_url |
http://ak-outpost-... |
traefik |
crowdsec_lapi_host |
crowdsec-service... |
traefik |
alertmanager_url |
http://prometheus-alertmanager... |
loki |
loki_push_url |
http://loki... |
alloy |
Service-to-service dependencies:
| Service | Depends on | Outputs consumed |
|---|---|---|
| f1-stream | coturn | coturn_host, coturn_port |
| real-estate-crawler | osm-routing | osrm_foot_host, osrm_bicycle_host |
| openclaw, grampsweb, ytdlp | ollama | ollama_host |
Module Modifications
Service modules that hardcode DNS names need modification to accept hosts as variables. ~20 modules affected. Example for affine:
Before:
# modules/kubernetes/affine/main.tf
DATABASE_URL = "postgresql://...@postgresql.dbaas.svc.cluster.local:5432/affine"
REDIS_SERVER_HOST = "redis.redis.svc.cluster.local"
After:
variable "redis_host" { type = string }
variable "postgresql_host" { type = string }
variable "postgresql_port" { type = number }
DATABASE_URL = "postgresql://...@${var.postgresql_host}:${var.postgresql_port}/affine"
REDIS_SERVER_HOST = var.redis_host
Root Terragrunt Configuration
# infra/terragrunt.hcl
remote_state {
backend = "local"
generate = {
path = "backend.tf"
if_exists = "overwrite_terragrunt"
}
config = {
path = "${get_repo_root()}/state/${path_relative_to_include()}/terraform.tfstate"
}
}
terraform {
extra_arguments "common_vars" {
commands = get_terraform_commands_that_need_vars()
required_var_files = [
"${get_repo_root()}/terraform.tfvars"
]
}
}
generate "k8s_providers" {
path = "providers.tf"
if_exists = "overwrite_terragrunt"
contents = <<EOF
variable "kube_config_path" {
type = string
default = "~/.kube/config"
}
provider "kubernetes" {
config_path = var.kube_config_path
}
provider "helm" {
kubernetes {
config_path = var.kube_config_path
}
}
EOF
}
Stack Wrapper Examples
Simple service (blog)
# stacks/blog/terragrunt.hcl
include "root" {
path = find_in_parent_folders()
}
dependency "platform" {
config_path = "../platform"
}
inputs = {
tls_secret_name = dependency.platform.outputs.tls_secret_name
}
# stacks/blog/main.tf
variable "tls_secret_name" {}
variable "kube_config_path" { default = "~/.kube/config" }
module "blog" {
source = "../../modules/kubernetes/blog"
tls_secret_name = var.tls_secret_name
tier = "4-aux"
}
Database-backed service (affine)
# stacks/affine/terragrunt.hcl
include "root" {
path = find_in_parent_folders()
}
dependency "platform" {
config_path = "../platform"
}
inputs = {
tls_secret_name = dependency.platform.outputs.tls_secret_name
redis_host = dependency.platform.outputs.redis_host
postgresql_host = dependency.platform.outputs.postgresql_host
postgresql_port = dependency.platform.outputs.postgresql_port
smtp_host = dependency.platform.outputs.smtp_host
smtp_port = dependency.platform.outputs.smtp_port
}
# stacks/affine/main.tf
variable "tls_secret_name" {}
variable "kube_config_path" { default = "~/.kube/config" }
variable "affine_postgresql_password" {}
variable "redis_host" { type = string }
variable "postgresql_host" { type = string }
variable "postgresql_port" { type = number }
variable "smtp_host" { type = string }
variable "smtp_port" { type = number }
module "affine" {
source = "../../modules/kubernetes/affine"
tls_secret_name = var.tls_secret_name
postgresql_password = var.affine_postgresql_password
redis_host = var.redis_host
postgresql_host = var.postgresql_host
postgresql_port = var.postgresql_port
smtp_host = var.smtp_host
smtp_port = var.smtp_port
tier = "4-aux"
}
Service-to-service dependency (f1-stream -> coturn)
# stacks/f1-stream/terragrunt.hcl
include "root" {
path = find_in_parent_folders()
}
dependency "platform" {
config_path = "../platform"
}
dependency "coturn" {
config_path = "../coturn"
}
inputs = {
tls_secret_name = dependency.platform.outputs.tls_secret_name
coturn_host = dependency.coturn.outputs.coturn_host
coturn_port = dependency.coturn.outputs.coturn_port
}
Migration Strategy
Phase 0: Setup
- Install Terragrunt
- Create root
terragrunt.hcl,stacks/,state/directories - No state changes, no risk
Phase 1: Infra Stack (VMs)
- Create
stacks/infra/with Proxmox provider + VM module calls terraform state mv4 root-level module resources tostate/infra/- Remove from root
main.tf - Verify:
cd stacks/infra && terragrunt planshows no changes
Phase 2: Platform Stack (Core Services)
- Create
stacks/platform/main.tfwith ~20 core services + outputs terraform state mv~200 resources frommodule.kubernetes_cluster.module.<core>- Remove
null_resource.core_services(Terragrunt handles ordering) - Verify:
cd stacks/platform && terragrunt planshows no changes
Phase 3: Simple Services (No DB Dependencies)
- blog, echo, privatebin, excalidraw, city-guesser, dashy, etc.
- Create stack, move state, verify — one at a time
Phase 4: Database-Backed Services
- Modify modules to accept hosts as variables
- affine, immich, linkwarden, nextcloud, grampsweb, etc.
- Create stack, move state, verify
Phase 5: Service-to-Service Dependencies
- ollama -> openclaw, grampsweb, ytdlp
- coturn -> f1-stream
- osm-routing -> real-estate-crawler
Phase 6: Cleanup
- Delete DEFCON system from
modules/kubernetes/main.tf - Delete legacy
terraform.tfstate - Delete root
main.tfkubernetes_cluster module call - Update CI/CD to Terragrunt
Rollback
At any phase, terraform state mv resources back to monolith state and
restore module calls.
CI/CD: Changed-Stack Detection
Drone CI pipeline detects changed files per commit and maps to affected stacks:
| Changed file | Affected stack |
|---|---|
stacks/blog/* |
blog |
modules/kubernetes/blog/* |
blog |
terraform.tfvars |
all stacks |
terragrunt.hcl |
all stacks |
modules/kubernetes/ingress_factory/* |
all stacks |
Manual Workflow
# Apply single service
cd stacks/blog && terragrunt apply
# Apply everything (respects DAG ordering)
cd stacks && terragrunt run-all apply
# Plan everything
cd stacks && terragrunt run-all plan
Decisions Made
| Decision | Choice | Rationale |
|---|---|---|
| Tool | Terragrunt | DRY config, dependency management, run-all orchestration |
| Stack granularity | 1 platform + 1 per service | Max isolation for apps, grouped core |
| Migration | Incremental | Lower risk, verify each step |
| Shared modules | Relative paths | Simple, no registry overhead |
| State backend | Local files | No external dependencies |
| Cross-stack refs | Full references via outputs | Proper DAG, bootstrappable from scratch |
| CI/CD | Changed-stack detection | Only apply what changed |