infra/docs/plans/2026-02-22-terragrunt-migration-design.md
2026-02-22 00:46:57 +00:00

12 KiB

Terragrunt Migration Design

Date: 2026-02-22 Status: Approved

Problem

The infrastructure repo has a monolithic Terraform setup:

  • 15MB state file, 857 resources, 85+ service modules in a single root
  • terraform plan/apply evaluates all modules even when targeting one service
  • null_resource.core_services bottleneck blocks 73 services behind 12 core modules
  • 150+ variables passed through root -> kubernetes_cluster -> individual services
  • 3 providers (kubernetes, helm, proxmox) initialize on every run

Goals

  • Speed: Faster plan/apply by splitting state into independent stacks
  • Blast radius isolation: Bad apply can't break unrelated services
  • DRY config: Shared provider/backend configuration via Terragrunt
  • Proper DAG: Full references between stacks (not hardcoded DNS strings)
  • Bootstrappable: terragrunt run-all apply works from scratch
  • CI/CD: Changed-stack detection in Drone CI

Architecture: Flat Stacks

Directory Structure

infra/
├── terragrunt.hcl              # Root config (providers, backend, common vars)
├── stacks/
│   ├── infra/                  # Proxmox VMs, templates, docker-registry
│   │   ├── terragrunt.hcl
│   │   └── main.tf
│   ├── platform/               # Core: traefik, metallb, redis, dbaas, authentik, etc.
│   │   ├── terragrunt.hcl
│   │   └── main.tf
│   ├── blog/                   # One dir per user service
│   │   ├── terragrunt.hcl
│   │   └── main.tf
│   ├── immich/
│   │   ├── terragrunt.hcl
│   │   └── main.tf
│   └── ... (~65 service dirs)
├── modules/                    # UNCHANGED — existing modules stay where they are
│   ├── kubernetes/
│   │   ├── ingress_factory/
│   │   ├── setup_tls_secret/
│   │   ├── blog/
│   │   ├── immich/
│   │   └── ...
│   ├── create-vm/
│   └── create-template-vm/
├── state/                      # Per-stack state files
│   ├── infra/terraform.tfstate
│   ├── platform/terraform.tfstate
│   ├── blog/terraform.tfstate
│   └── ...
├── terraform.tfvars            # UNCHANGED — encrypted secrets
├── secrets/                    # UNCHANGED — TLS certs
├── main.tf                     # LEGACY — gradually emptied during migration
└── terraform.tfstate           # LEGACY — gradually emptied during migration

Each stack has a thin main.tf wrapper that calls the existing module via source = "../../modules/kubernetes/<service>". We do NOT use Terragrunt's terraform { source } directive because our modules use relative paths (../ingress_factory, ../setup_tls_secret) that would break when Terragrunt copies them to .terragrunt-cache/.

Stack Composition

Infra stack (~10 resources):

  • Proxmox VM templates (k8s, non-k8s, docker-registry)
  • Docker registry VM
  • Uses proxmox provider (not kubernetes/helm)

Platform stack (~200 resources, ~20 services):

  • traefik, metallb, redis, dbaas, technitium, authentik, crowdsec, cloudflared
  • monitoring (prometheus, alertmanager, grafana, loki, alloy)
  • kyverno, metrics-server, nvidia, mailserver, authelia
  • wireguard, headscale, xray, uptime-kuma, vaultwarden, reverse-proxy
  • Exports outputs consumed by service stacks

Per-service stacks (~65, each 5-25 resources):

  • One stack per user-facing service
  • Each depends on platform via Terragrunt dependency block
  • Some depend on other services (f1-stream -> coturn, etc.)

Dependency Graph

                         ┌─────────┐
                         │  infra  │
                         └────┬────┘
                              │
                         ┌────▼────┐
                         │platform │  exports: redis_host, postgresql_host,
                         │         │  mysql_host, smtp_host, tls_secret_name, ...
                         └────┬────┘
                              │
         ┌────────┬───────────┼───────────┬────────┐
         │        │           │           │        │
    ┌────▼──┐ ┌───▼───┐ ┌────▼───┐ ┌─────▼──┐ ┌──▼───┐
    │ blog  │ │immich │ │ affine │ │ollama  │ │coturn│  ...
    └───────┘ └───────┘ └────────┘ └───┬────┘ └──┬───┘
                                       │         │
                                  ┌────▼───┐ ┌───▼──────┐
                                  │openclaw│ │f1-stream │
                                  │gramps  │ └──────────┘
                                  │ytdlp   │
                                  └────────┘

Platform Stack Outputs

Output Value Consumers
redis_host redis.redis.svc.cluster.local 10 services
postgresql_host postgresql.dbaas.svc.cluster.local 10 services
postgresql_port 5432 10 services
mysql_host mysql.dbaas.svc.cluster.local 8 services
mysql_port 3306 8 services
smtp_host mail.viktorbarzin.me 6 services
smtp_port 587 6 services
tls_secret_name from variable all services
authentik_outpost_url http://ak-outpost-... traefik
crowdsec_lapi_host crowdsec-service... traefik
alertmanager_url http://prometheus-alertmanager... loki
loki_push_url http://loki... alloy

Service-to-service dependencies:

Service Depends on Outputs consumed
f1-stream coturn coturn_host, coturn_port
real-estate-crawler osm-routing osrm_foot_host, osrm_bicycle_host
openclaw, grampsweb, ytdlp ollama ollama_host

Module Modifications

Service modules that hardcode DNS names need modification to accept hosts as variables. ~20 modules affected. Example for affine:

Before:

# modules/kubernetes/affine/main.tf
DATABASE_URL      = "postgresql://...@postgresql.dbaas.svc.cluster.local:5432/affine"
REDIS_SERVER_HOST = "redis.redis.svc.cluster.local"

After:

variable "redis_host" { type = string }
variable "postgresql_host" { type = string }
variable "postgresql_port" { type = number }

DATABASE_URL      = "postgresql://...@${var.postgresql_host}:${var.postgresql_port}/affine"
REDIS_SERVER_HOST = var.redis_host

Root Terragrunt Configuration

# infra/terragrunt.hcl

remote_state {
  backend = "local"
  generate = {
    path      = "backend.tf"
    if_exists = "overwrite_terragrunt"
  }
  config = {
    path = "${get_repo_root()}/state/${path_relative_to_include()}/terraform.tfstate"
  }
}

terraform {
  extra_arguments "common_vars" {
    commands = get_terraform_commands_that_need_vars()
    required_var_files = [
      "${get_repo_root()}/terraform.tfvars"
    ]
  }
}

generate "k8s_providers" {
  path      = "providers.tf"
  if_exists = "overwrite_terragrunt"
  contents  = <<EOF
variable "kube_config_path" {
  type    = string
  default = "~/.kube/config"
}

provider "kubernetes" {
  config_path = var.kube_config_path
}

provider "helm" {
  kubernetes {
    config_path = var.kube_config_path
  }
}
EOF
}

Stack Wrapper Examples

Simple service (blog)

# stacks/blog/terragrunt.hcl
include "root" {
  path = find_in_parent_folders()
}

dependency "platform" {
  config_path = "../platform"
}

inputs = {
  tls_secret_name = dependency.platform.outputs.tls_secret_name
}
# stacks/blog/main.tf
variable "tls_secret_name" {}
variable "kube_config_path" { default = "~/.kube/config" }

module "blog" {
  source          = "../../modules/kubernetes/blog"
  tls_secret_name = var.tls_secret_name
  tier            = "4-aux"
}

Database-backed service (affine)

# stacks/affine/terragrunt.hcl
include "root" {
  path = find_in_parent_folders()
}

dependency "platform" {
  config_path = "../platform"
}

inputs = {
  tls_secret_name = dependency.platform.outputs.tls_secret_name
  redis_host      = dependency.platform.outputs.redis_host
  postgresql_host = dependency.platform.outputs.postgresql_host
  postgresql_port = dependency.platform.outputs.postgresql_port
  smtp_host       = dependency.platform.outputs.smtp_host
  smtp_port       = dependency.platform.outputs.smtp_port
}
# stacks/affine/main.tf
variable "tls_secret_name" {}
variable "kube_config_path" { default = "~/.kube/config" }
variable "affine_postgresql_password" {}
variable "redis_host" { type = string }
variable "postgresql_host" { type = string }
variable "postgresql_port" { type = number }
variable "smtp_host" { type = string }
variable "smtp_port" { type = number }

module "affine" {
  source              = "../../modules/kubernetes/affine"
  tls_secret_name     = var.tls_secret_name
  postgresql_password = var.affine_postgresql_password
  redis_host          = var.redis_host
  postgresql_host     = var.postgresql_host
  postgresql_port     = var.postgresql_port
  smtp_host           = var.smtp_host
  smtp_port           = var.smtp_port
  tier                = "4-aux"
}

Service-to-service dependency (f1-stream -> coturn)

# stacks/f1-stream/terragrunt.hcl
include "root" {
  path = find_in_parent_folders()
}

dependency "platform" {
  config_path = "../platform"
}

dependency "coturn" {
  config_path = "../coturn"
}

inputs = {
  tls_secret_name = dependency.platform.outputs.tls_secret_name
  coturn_host     = dependency.coturn.outputs.coturn_host
  coturn_port     = dependency.coturn.outputs.coturn_port
}

Migration Strategy

Phase 0: Setup

  • Install Terragrunt
  • Create root terragrunt.hcl, stacks/, state/ directories
  • No state changes, no risk

Phase 1: Infra Stack (VMs)

  • Create stacks/infra/ with Proxmox provider + VM module calls
  • terraform state mv 4 root-level module resources to state/infra/
  • Remove from root main.tf
  • Verify: cd stacks/infra && terragrunt plan shows no changes

Phase 2: Platform Stack (Core Services)

  • Create stacks/platform/main.tf with ~20 core services + outputs
  • terraform state mv ~200 resources from module.kubernetes_cluster.module.<core>
  • Remove null_resource.core_services (Terragrunt handles ordering)
  • Verify: cd stacks/platform && terragrunt plan shows no changes

Phase 3: Simple Services (No DB Dependencies)

  • blog, echo, privatebin, excalidraw, city-guesser, dashy, etc.
  • Create stack, move state, verify — one at a time

Phase 4: Database-Backed Services

  • Modify modules to accept hosts as variables
  • affine, immich, linkwarden, nextcloud, grampsweb, etc.
  • Create stack, move state, verify

Phase 5: Service-to-Service Dependencies

  • ollama -> openclaw, grampsweb, ytdlp
  • coturn -> f1-stream
  • osm-routing -> real-estate-crawler

Phase 6: Cleanup

  • Delete DEFCON system from modules/kubernetes/main.tf
  • Delete legacy terraform.tfstate
  • Delete root main.tf kubernetes_cluster module call
  • Update CI/CD to Terragrunt

Rollback

At any phase, terraform state mv resources back to monolith state and restore module calls.

CI/CD: Changed-Stack Detection

Drone CI pipeline detects changed files per commit and maps to affected stacks:

Changed file Affected stack
stacks/blog/* blog
modules/kubernetes/blog/* blog
terraform.tfvars all stacks
terragrunt.hcl all stacks
modules/kubernetes/ingress_factory/* all stacks

Manual Workflow

# Apply single service
cd stacks/blog && terragrunt apply

# Apply everything (respects DAG ordering)
cd stacks && terragrunt run-all apply

# Plan everything
cd stacks && terragrunt run-all plan

Decisions Made

Decision Choice Rationale
Tool Terragrunt DRY config, dependency management, run-all orchestration
Stack granularity 1 platform + 1 per service Max isolation for apps, grouped core
Migration Incremental Lower risk, verify each step
Shared modules Relative paths Simple, no registry overhead
State backend Local files No external dependencies
Cross-stack refs Full references via outputs Proper DAG, bootstrappable from scratch
CI/CD Changed-stack detection Only apply what changed