infra/docs/plans/2026-02-22-terragrunt-migration-plan.md

37 KiB

Terragrunt Migration Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

Goal: Migrate the monolithic Terraform setup (857 resources, 15MB state) to Terragrunt with per-service state isolation, proper DAG dependencies, and changed-stack CI/CD detection.

Architecture: Flat stacks under stacks/ with thin main.tf wrappers calling existing modules. Root terragrunt.hcl provides DRY provider/backend config. Platform stack groups ~20 core services and exports outputs (redis_host, postgresql_host, etc.) consumed by ~65 per-service stacks via Terragrunt dependency blocks.

Tech Stack: Terragrunt, Terraform 1.14.x, local state backend, Drone CI

Design Doc: docs/plans/2026-02-22-terragrunt-migration-design.md


Task 1: Install Terragrunt and Create Directory Skeleton

Files:

  • Create: stacks/ directory
  • Create: state/ directory
  • Create: .gitignore updates

Step 1: Install Terragrunt

Run:

brew install terragrunt

Expected: Terragrunt available at terragrunt --version

Step 2: Create directory skeleton

Run:

mkdir -p stacks/{infra,platform}
mkdir -p state

Step 3: Update .gitignore

Add to .gitignore:

# Terragrunt
.terragrunt-cache/
state/

The state/ directory contains per-stack terraform state files. These are local-only and should not be committed (they contain resource IDs and potentially sensitive data, same as the current terraform.tfstate).

Step 4: Commit

git add stacks/ .gitignore
git commit -m "[ci skip] Add Terragrunt directory skeleton"

Task 2: Create Root Terragrunt Configuration

Files:

  • Create: terragrunt.hcl

Step 1: Write root terragrunt.hcl

# Root Terragrunt configuration
# Provides DRY provider, backend, and variable loading for all stacks.

# Each stack gets its own local state file under state/<stack-name>/
remote_state {
  backend = "local"
  generate = {
    path      = "backend.tf"
    if_exists = "overwrite_terragrunt"
  }
  config = {
    path = "${get_repo_root()}/state/${path_relative_to_include()}/terraform.tfstate"
  }
}

# Load terraform.tfvars for all stacks.
# Variables not declared by a stack are silently ignored (Terraform 1.x behavior).
terraform {
  extra_arguments "common_vars" {
    commands = get_terraform_commands_that_need_vars()
    required_var_files = [
      "${get_repo_root()}/terraform.tfvars"
    ]
  }

  extra_arguments "kube_config" {
    commands = get_terraform_commands_that_need_vars()
    arguments = [
      "-var", "kube_config_path=${get_repo_root()}/config"
    ]
  }
}

# Generate kubernetes + helm providers for K8s stacks.
# The infra stack overrides this to add the proxmox provider.
generate "k8s_providers" {
  path      = "providers.tf"
  if_exists = "overwrite_terragrunt"
  contents  = <<EOF
variable "kube_config_path" {
  type    = string
  default = "~/.kube/config"
}

provider "kubernetes" {
  config_path = var.kube_config_path
}

provider "helm" {
  kubernetes {
    config_path = var.kube_config_path
  }
}
EOF
}

Step 2: Verify Terragrunt parses it

Run:

cd stacks/infra && terragrunt validate-inputs 2>&1 | head -5

Expected: No parse errors (may show warnings about missing main.tf, that's fine)

Step 3: Commit

git add terragrunt.hcl
git commit -m "[ci skip] Add root Terragrunt configuration"

Task 3: Create Infra Stack (Proxmox VMs)

Files:

  • Create: stacks/infra/terragrunt.hcl
  • Create: stacks/infra/main.tf

Step 1: Write infra terragrunt.hcl

This stack needs the proxmox provider instead of (or in addition to) the default k8s providers.

# stacks/infra/terragrunt.hcl
include "root" {
  path = find_in_parent_folders()
}

# Override provider generation to include proxmox
generate "providers" {
  path      = "providers.tf"
  if_exists = "overwrite_terragrunt"
  contents  = <<EOF
variable "kube_config_path" {
  type    = string
  default = "~/.kube/config"
}

variable "proxmox_pm_api_url" { type = string }
variable "proxmox_pm_api_token_id" { type = string }
variable "proxmox_pm_api_token_secret" { type = string }

provider "proxmox" {
  pm_api_url          = var.proxmox_pm_api_url
  pm_api_token_id     = var.proxmox_pm_api_token_id
  pm_api_token_secret = var.proxmox_pm_api_token_secret
  pm_tls_insecure     = true
}
EOF
}

Step 2: Write infra main.tf

Copy the VM template and docker-registry module calls from the current root main.tf (lines 196-400). These reference ./modules/create-template-vm and ./modules/create-vm — adjust paths to ../../modules/....

# stacks/infra/main.tf
# Proxmox VM templates and docker registry VM

variable "proxmox_host" { type = string }
variable "ssh_private_key" {
  type    = string
  default = ""
}
variable "ssh_public_key" {
  type    = string
  default = ""
}
variable "vm_wizard_password" { type = string }
variable "k8s_join_command" { type = string }
variable "dockerhub_registry_password" { type = string }

locals {
  k8s_vm_template             = "ubuntu-2404-cloudinit-k8s-template"
  k8s_cloud_init_snippet_name = "k8s_cloud_init.yaml"
  k8s_cloud_init_image_path   = "/var/lib/vz/template/iso/noble-server-cloudimg-amd64-k8s.img"

  non_k8s_vm_template             = "ubuntu-2404-cloudinit-non-k8s-template"
  non_k8s_cloud_init_snippet_name = "non_k8s_cloud_init.yaml"
  non_k8s_cloud_init_image_path   = "/var/lib/vz/template/iso/noble-server-cloudimg-amd64-non-k8s.img"

  cloud_init_image_url = "https://cloud-images.ubuntu.com/noble/current/noble-server-cloudimg-amd64.img"
}

module "k8s-node-template" {
  source       = "../../modules/create-template-vm"
  proxmox_host = var.proxmox_host
  proxmox_user = "root"

  ssh_private_key = var.ssh_private_key
  ssh_public_key  = var.ssh_public_key

  cloud_image_url = local.cloud_init_image_url
  image_path      = local.k8s_cloud_init_image_path
  template_id     = 2000
  template_name   = local.k8s_vm_template
  user_passwd     = var.vm_wizard_password

  is_k8s_template = true
  snippet_name    = local.k8s_cloud_init_snippet_name
  containerd_config_update_command = <<-EOF
  sed -i 's|config_path = ""|config_path = "/etc/containerd/certs.d"|' /etc/containerd/config.toml
  mkdir -p /etc/containerd/certs.d/docker.io
  printf 'server = "https://registry-1.docker.io"\n\n[host."http://10.0.20.10:5000"]\n  capabilities = ["pull", "resolve"]\n' > /etc/containerd/certs.d/docker.io/hosts.toml
  mkdir -p /etc/containerd/certs.d/ghcr.io
  printf 'server = "https://ghcr.io"\n\n[host."http://10.0.20.10:5010"]\n  capabilities = ["pull", "resolve"]\n' > /etc/containerd/certs.d/ghcr.io/hosts.toml
  mkdir -p /etc/containerd/certs.d/quay.io
  printf 'server = "https://quay.io"\n\n[host."http://10.0.20.10:5020"]\n  capabilities = ["pull", "resolve"]\n' > /etc/containerd/certs.d/quay.io/hosts.toml
  mkdir -p /etc/containerd/certs.d/registry.k8s.io
  printf 'server = "https://registry.k8s.io"\n\n[host."http://10.0.20.10:5030"]\n  capabilities = ["pull", "resolve"]\n' > /etc/containerd/certs.d/registry.k8s.io/hosts.toml
  mkdir -p /etc/containerd/certs.d/reg.kyverno.io
  printf 'server = "https://reg.kyverno.io"\n\n[host."http://10.0.20.10:5040"]\n  capabilities = ["pull", "resolve"]\n' > /etc/containerd/certs.d/reg.kyverno.io/hosts.toml
  sed -i 's/.*max_concurrent_downloads = 3/max_concurrent_downloads = 20/g' /etc/containerd/config.toml
  sudo sed -i '/serializeImagePulls:/d' /var/lib/kubelet/config.yaml && \
  sudo sed -i '/maxParallelImagePulls:/d' /var/lib/kubelet/config.yaml && \
  echo -e 'serializeImagePulls: false\nmaxParallelImagePulls: 50' | sudo tee -a /var/lib/kubelet/config.yaml
  EOF
  k8s_join_command = var.k8s_join_command
}

module "non-k8s-node-template" {
  source       = "../../modules/create-template-vm"
  proxmox_host = var.proxmox_host
  proxmox_user = "root"

  ssh_private_key = var.ssh_private_key
  ssh_public_key  = var.ssh_public_key

  cloud_image_url = local.cloud_init_image_url
  image_path      = local.non_k8s_cloud_init_image_path
  template_id     = 1000
  template_name   = local.non_k8s_vm_template
  user_passwd     = var.vm_wizard_password

  is_k8s_template = false
  snippet_name    = local.non_k8s_cloud_init_snippet_name
}

module "docker-registry-template" {
  source       = "../../modules/create-template-vm"
  proxmox_host = var.proxmox_host
  proxmox_user = "root"

  ssh_private_key = var.ssh_private_key
  ssh_public_key  = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIDHLhYDfyx237eJgOGVoJRECpUS95+7rEBS9vacsIxtx devvm"

  cloud_image_url = local.cloud_init_image_url
  image_path      = local.non_k8s_cloud_init_image_path
  template_id     = 1001
  template_name   = "docker-registry-template"

  user_passwd     = var.vm_wizard_password
  is_k8s_template = false
  snippet_name    = "docker-registry.yaml"

  provision_cmds = [
    "mkdir -p /etc/docker-registry",
    format("echo %s | base64 -d > /etc/docker-registry/config.yml",
      base64encode(
        templatefile("${path.root}/../../modules/docker-registry/config.yaml", {
          password = var.dockerhub_registry_password
        })
      )
    ),
    # ... (copy remaining provision_cmds from main.tf lines 305-371)
  ]
}

module "docker-registry-vm" {
  source         = "../../modules/create-vm"
  vmid           = 220
  vm_cpus        = 4
  vm_mem_mb      = 4196
  vm_disk_size   = "64G"
  template_name  = "docker-registry-template"
  vm_name        = "docker-registry"
  cisnippet_name = "docker-registry.yaml"
  vm_mac_address = "DE:AD:BE:EF:22:22"
  bridge         = "vmbr1"
  vlan_tag       = "20"
  ipconfig0      = "ip=10.0.20.10/24,gw=10.0.20.1"
}

Note: The provision_cmds for docker-registry-template is long (~60 lines). Copy it exactly from the current main.tf lines 296-371. The only change is templatefile paths: prefix with ${path.root}/../../ since the working directory is now stacks/infra/.

Step 3: Verify with init (do NOT apply yet)

Run:

cd stacks/infra && terragrunt init

Expected: Successful init, providers downloaded

Step 4: Commit

git add stacks/infra/
git commit -m "[ci skip] Add infra stack (Proxmox VMs)"

Task 4: Migrate Infra Stack State

CRITICAL: This task modifies live state. Take a backup first.

Step 1: Backup current state

Run:

cp terraform.tfstate terraform.tfstate.backup-pre-terragrunt

Step 2: List current infra resources in state

Run:

terraform state list | grep -E '^module\.(k8s-node-template|non-k8s-node-template|docker-registry-template|docker-registry-vm)\.'

Expected: List of ~10 resources belonging to these 4 modules

Step 3: Move resources to new state file

For each resource listed in step 2, run:

mkdir -p state/infra
terraform state mv \
  -state=terraform.tfstate \
  -state-out=state/infra/terraform.tfstate \
  'module.k8s-node-template' 'module.k8s-node-template'
terraform state mv \
  -state=terraform.tfstate \
  -state-out=state/infra/terraform.tfstate \
  'module.non-k8s-node-template' 'module.non-k8s-node-template'
terraform state mv \
  -state=terraform.tfstate \
  -state-out=state/infra/terraform.tfstate \
  'module.docker-registry-template' 'module.docker-registry-template'
terraform state mv \
  -state=terraform.tfstate \
  -state-out=state/infra/terraform.tfstate \
  'module.docker-registry-vm' 'module.docker-registry-vm'

Step 4: Verify no changes in new state

Run:

cd stacks/infra && terragrunt plan

Expected: No changes. Infrastructure is up-to-date.

If there are changes, something went wrong — restore from backup and investigate.

Step 5: Remove infra modules from root main.tf

Remove (or comment out) the module.k8s-node-template, module.non-k8s-node-template, module.docker-registry-template, and module.docker-registry-vm blocks from main.tf (lines 208-400).

Also remove the corresponding locals block (lines 196-206) since they're now in stacks/infra/main.tf.

Step 6: Verify legacy state is clean

Run:

terraform plan -var="kube_config_path=$(pwd)/config"

Expected: No changes (the moved resources are gone from this state but also from main.tf)

Step 7: Commit

git add main.tf stacks/infra/
git commit -m "[ci skip] Migrate infra stack (VMs) to Terragrunt"

Task 5: Create Platform Stack

Files:

  • Create: stacks/platform/terragrunt.hcl
  • Create: stacks/platform/main.tf

This is the largest task — it groups ~20 core services into one stack.

Step 1: Write platform terragrunt.hcl

# stacks/platform/terragrunt.hcl
include "root" {
  path = find_in_parent_folders()
}

dependency "infra" {
  config_path = "../infra"
  skip_outputs = true
}

Step 2: Write platform main.tf

This file contains all core/cluster service module calls. Copy each from modules/kubernetes/main.tf, adjusting source paths from "./<service>" to "../../modules/kubernetes/<service>". Remove for_each conditionals (core services are always present). Remove depends_on = [null_resource.core_services].

Platform services (from modules/kubernetes/main.tf):

# stacks/platform/main.tf

# Variables — declare all variables needed by platform services
variable "kube_config_path" { default = "~/.kube/config" }
variable "tls_secret_name" {}
variable "prod" { default = false }

# dbaas vars
variable "dbaas_root_password" {}
variable "dbaas_postgresql_root_password" {}
variable "dbaas_pgadmin_password" {}

# traefik vars
variable "ingress_crowdsec_api_key" {}

# technitium vars
variable "technitium_db_password" {}
variable "homepage_credentials" { type = map(any) }

# headscale vars
variable "headscale_config" {}
variable "headscale_acl" {}

# authentik vars
variable "authentik_secret_key" {}
variable "authentik_postgres_password" {}
variable "k8s_users" { type = map(any); default = {} }
variable "ssh_private_key" { type = string; default = ""; sensitive = true }

# crowdsec vars
variable "crowdsec_enroll_key" { type = string }
variable "crowdsec_db_password" { type = string }
variable "crowdsec_dash_api_key" { type = string }
variable "crowdsec_dash_machine_id" { type = string }
variable "crowdsec_dash_machine_password" { type = string }
variable "alertmanager_slack_api_url" {}

# cloudflared vars
variable "cloudflare_api_key" {}
variable "cloudflare_email" {}
variable "cloudflare_account_id" {}
variable "cloudflare_zone_id" {}
variable "cloudflare_tunnel_id" {}
variable "public_ip" {}
variable "cloudflare_proxied_names" {}
variable "cloudflare_non_proxied_names" {}
variable "cloudflare_tunnel_token" {}

# monitoring vars
variable "alertmanager_account_password" {}
variable "idrac_username" { default = "" }
variable "idrac_password" { default = "" }
variable "tiny_tuya_service_secret" { type = string }
variable "haos_api_token" { type = string }
variable "pve_password" { type = string }
variable "grafana_db_password" { type = string }
variable "grafana_admin_password" { type = string }

# vaultwarden vars
variable "vaultwarden_smtp_password" {}

# reverse-proxy vars (homepage tokens are in homepage_credentials)

# wireguard vars
variable "wireguard_wg_0_conf" {}
variable "wireguard_wg_0_key" {}
variable "wireguard_firewall_sh" {}

# xray vars
variable "xray_reality_clients" { type = list(map(string)) }
variable "xray_reality_private_key" { type = string }
variable "xray_reality_short_ids" { type = list(string) }

# nvidia vars (none beyond tls_secret_name + tier)

# mailserver vars
variable "mailserver_accounts" {}
variable "mailserver_aliases" {}
variable "mailserver_opendkim_key" {}
variable "mailserver_sasl_passwd" {}
variable "mailserver_roundcubemail_db_password" { type = string }

# infra-maintenance vars
variable "webhook_handler_git_user" {}
variable "webhook_handler_git_token" {}
variable "technitium_username" {}
variable "technitium_password" {}

# uptime-kuma (no extra vars)
# metrics-server (no extra vars)
# kyverno (no extra vars)

locals {
  tiers = {
    core    = "0-core"
    cluster = "1-cluster"
    gpu     = "2-gpu"
    edge    = "3-edge"
    aux     = "4-aux"
  }
}

# --- Core Services (no dependencies, deployed first) ---

module "metallb" {
  source = "../../modules/kubernetes/metallb"
  tier   = local.tiers.core
}

module "dbaas" {
  source                   = "../../modules/kubernetes/dbaas"
  prod                     = var.prod
  tls_secret_name          = var.tls_secret_name
  dbaas_root_password      = var.dbaas_root_password
  postgresql_root_password = var.dbaas_postgresql_root_password
  pgadmin_password         = var.dbaas_pgadmin_password
  tier                     = local.tiers.cluster
}

module "redis" {
  source          = "../../modules/kubernetes/redis"
  tls_secret_name = var.tls_secret_name
  tier            = local.tiers.cluster
}

module "traefik" {
  source           = "../../modules/kubernetes/traefik"
  tier             = local.tiers.core
  crowdsec_api_key = var.ingress_crowdsec_api_key
  tls_secret_name  = var.tls_secret_name
}

module "technitium" {
  source                 = "../../modules/kubernetes/technitium"
  tls_secret_name        = var.tls_secret_name
  homepage_token         = var.homepage_credentials["technitium"]["token"]
  technitium_db_password = var.technitium_db_password
  tier                   = local.tiers.core
}

module "headscale" {
  source           = "../../modules/kubernetes/headscale"
  tls_secret_name  = var.tls_secret_name
  headscale_config = var.headscale_config
  headscale_acl    = var.headscale_acl
  tier             = local.tiers.core
}

module "authentik" {
  source            = "../../modules/kubernetes/authentik"
  tier              = local.tiers.cluster
  tls_secret_name   = var.tls_secret_name
  secret_key        = var.authentik_secret_key
  postgres_password = var.authentik_postgres_password
}

module "rbac" {
  source          = "../../modules/kubernetes/rbac"
  tier            = local.tiers.cluster
  tls_secret_name = var.tls_secret_name
  k8s_users       = var.k8s_users
  ssh_private_key = var.ssh_private_key
}

module "k8s-portal" {
  source          = "../../modules/kubernetes/k8s-portal"
  tier            = local.tiers.edge
  tls_secret_name = var.tls_secret_name
}

module "crowdsec" {
  source                         = "../../modules/kubernetes/crowdsec"
  tier                           = local.tiers.cluster
  tls_secret_name                = var.tls_secret_name
  homepage_username              = var.homepage_credentials["crowdsec"]["username"]
  homepage_password              = var.homepage_credentials["crowdsec"]["password"]
  enroll_key                     = var.crowdsec_enroll_key
  db_password                    = var.crowdsec_db_password
  crowdsec_dash_api_key          = var.crowdsec_dash_api_key
  crowdsec_dash_machine_id       = var.crowdsec_dash_machine_id
  crowdsec_dash_machine_password = var.crowdsec_dash_machine_password
  slack_webhook_url              = var.alertmanager_slack_api_url
}

module "cloudflared" {
  source                       = "../../modules/kubernetes/cloudflared"
  tier                         = local.tiers.core
  tls_secret_name              = var.tls_secret_name
  cloudflare_api_key           = var.cloudflare_api_key
  cloudflare_email             = var.cloudflare_email
  cloudflare_account_id        = var.cloudflare_account_id
  cloudflare_zone_id           = var.cloudflare_zone_id
  cloudflare_tunnel_id         = var.cloudflare_tunnel_id
  public_ip                    = var.public_ip
  cloudflare_proxied_names     = var.cloudflare_proxied_names
  cloudflare_non_proxied_names = var.cloudflare_non_proxied_names
  cloudflare_tunnel_token      = var.cloudflare_tunnel_token
}

module "monitoring" {
  source                        = "../../modules/kubernetes/monitoring"
  tls_secret_name               = var.tls_secret_name
  alertmanager_account_password = var.alertmanager_account_password
  idrac_username                = var.idrac_username
  idrac_password                = var.idrac_password
  alertmanager_slack_api_url    = var.alertmanager_slack_api_url
  tiny_tuya_service_secret      = var.tiny_tuya_service_secret
  haos_api_token                = var.haos_api_token
  pve_password                  = var.pve_password
  grafana_db_password           = var.grafana_db_password
  grafana_admin_password        = var.grafana_admin_password
  tier                          = local.tiers.cluster
}

module "vaultwarden" {
  source          = "../../modules/kubernetes/vaultwarden"
  tls_secret_name = var.tls_secret_name
  smtp_password   = var.vaultwarden_smtp_password
  tier            = local.tiers.edge
}

module "reverse-proxy" {
  source                 = "../../modules/kubernetes/reverse_proxy"
  tls_secret_name        = var.tls_secret_name
  truenas_homepage_token = var.homepage_credentials["reverse_proxy"]["truenas_token"]
  pfsense_homepage_token = var.homepage_credentials["reverse_proxy"]["pfsense_token"]
}

module "metrics-server" {
  source          = "../../modules/kubernetes/metrics-server"
  tier            = local.tiers.cluster
  tls_secret_name = var.tls_secret_name
}

module "nvidia" {
  source          = "../../modules/kubernetes/nvidia"
  tls_secret_name = var.tls_secret_name
  tier            = local.tiers.gpu
}

module "kyverno" {
  source = "../../modules/kubernetes/kyverno"
}

module "uptime-kuma" {
  source          = "../../modules/kubernetes/uptime-kuma"
  tls_secret_name = var.tls_secret_name
  tier            = local.tiers.cluster
}

module "wireguard" {
  source          = "../../modules/kubernetes/wireguard"
  tls_secret_name = var.tls_secret_name
  wg_0_conf       = var.wireguard_wg_0_conf
  wg_0_key        = var.wireguard_wg_0_key
  firewall_sh     = var.wireguard_firewall_sh
  tier            = local.tiers.core
}

module "xray" {
  source                   = "../../modules/kubernetes/xray"
  tls_secret_name          = var.tls_secret_name
  tier                     = local.tiers.core
  xray_reality_clients     = var.xray_reality_clients
  xray_reality_private_key = var.xray_reality_private_key
  xray_reality_short_ids   = var.xray_reality_short_ids
}

module "mailserver" {
  source                  = "../../modules/kubernetes/mailserver"
  tls_secret_name         = var.tls_secret_name
  mailserver_accounts     = var.mailserver_accounts
  postfix_account_aliases = var.mailserver_aliases
  opendkim_key            = var.mailserver_opendkim_key
  sasl_passwd             = var.mailserver_sasl_passwd
  roundcube_db_password   = var.mailserver_roundcubemail_db_password
  tier                    = local.tiers.edge
}

module "infra-maintenance" {
  source              = "../../modules/kubernetes/infra-maintenance"
  git_user            = var.webhook_handler_git_user
  git_token           = var.webhook_handler_git_token
  technitium_username = var.technitium_username
  technitium_password = var.technitium_password
}

# --- OUTPUTS (consumed by service stacks via Terragrunt dependency) ---

output "tls_secret_name"   { value = var.tls_secret_name }
output "redis_host"        { value = "redis.redis.svc.cluster.local" }
output "postgresql_host"   { value = "postgresql.dbaas.svc.cluster.local" }
output "postgresql_port"   { value = 5432 }
output "mysql_host"        { value = "mysql.dbaas.svc.cluster.local" }
output "mysql_port"        { value = 3306 }
output "smtp_host"         { value = "mail.viktorbarzin.me" }
output "smtp_port"         { value = 587 }

Step 3: Verify init succeeds

Run:

cd stacks/platform && terragrunt init

Step 4: Commit

git add stacks/platform/
git commit -m "[ci skip] Add platform stack (core services)"

Task 6: Migrate Platform Stack State

CRITICAL: Largest state migration. Backup first.

Step 1: Backup

cp terraform.tfstate terraform.tfstate.backup-pre-platform

Step 2: Move core service resources

The resources are currently at module.kubernetes_cluster.module.<service>["<key>"] (the for_each key). Services without for_each are at module.kubernetes_cluster.module.<service>.

Run state mv for each platform service. Example pattern:

# Services WITH for_each (note the ["key"] suffix):
terraform state mv \
  -state=terraform.tfstate \
  -state-out=state/platform/terraform.tfstate \
  'module.kubernetes_cluster.module.redis["redis"]' \
  'module.redis'

terraform state mv \
  -state=terraform.tfstate \
  -state-out=state/platform/terraform.tfstate \
  'module.kubernetes_cluster.module.traefik["traefik"]' \
  'module.traefik'

# Services WITHOUT for_each:
terraform state mv \
  -state=terraform.tfstate \
  -state-out=state/platform/terraform.tfstate \
  'module.kubernetes_cluster.module.metallb' \
  'module.metallb'

terraform state mv \
  -state=terraform.tfstate \
  -state-out=state/platform/terraform.tfstate \
  'module.kubernetes_cluster.module.dbaas' \
  'module.dbaas'

terraform state mv \
  -state=terraform.tfstate \
  -state-out=state/platform/terraform.tfstate \
  'module.kubernetes_cluster.module.cloudflared' \
  'module.cloudflared'

terraform state mv \
  -state=terraform.tfstate \
  -state-out=state/platform/terraform.tfstate \
  'module.kubernetes_cluster.module.infra-maintenance' \
  'module.infra-maintenance'

terraform state mv \
  -state=terraform.tfstate \
  -state-out=state/platform/terraform.tfstate \
  'module.kubernetes_cluster.module.reverse-proxy["reverse-proxy"]' \
  'module.reverse-proxy'

Repeat for all platform services. Check whether each has for_each by looking at the state list:

terraform state list | grep 'module.kubernetes_cluster.module' | sort

Services with for_each have ["key"] suffix; those without don't.

Step 3: Also move null_resource.core_services

# This resource can be dropped — don't move it, just remove it
terraform state rm 'module.kubernetes_cluster.null_resource.core_services'

Step 4: Verify platform state

Run:

cd stacks/platform && terragrunt plan

Expected: No changes. (or only expected diffs from removed for_each wrappers)

Step 5: Remove platform services from modules/kubernetes/main.tf

Remove the module blocks for all services that moved to the platform stack. Also remove null_resource.core_services and the defcon_modules/active_modules locals that reference these modules.

Step 6: Verify legacy state

Run:

terraform plan -var="kube_config_path=$(pwd)/config"

Expected: No changes for remaining services

Step 7: Commit

git add main.tf modules/kubernetes/main.tf stacks/platform/
git commit -m "[ci skip] Migrate platform stack (core services) to Terragrunt"

Task 7: Create Simple Service Stack Template + Migrate First Service (blog)

Files:

  • Create: stacks/blog/terragrunt.hcl
  • Create: stacks/blog/main.tf

Step 1: Write blog terragrunt.hcl

# stacks/blog/terragrunt.hcl
include "root" {
  path = find_in_parent_folders()
}

dependency "platform" {
  config_path = "../platform"
}

inputs = {
  tls_secret_name = dependency.platform.outputs.tls_secret_name
}

Step 2: Write blog main.tf

# stacks/blog/main.tf
variable "tls_secret_name" {}
variable "kube_config_path" { default = "~/.kube/config" }

module "blog" {
  source          = "../../modules/kubernetes/blog"
  tls_secret_name = var.tls_secret_name
  tier            = "4-aux"
}

Step 3: Move blog state

terraform state mv \
  -state=terraform.tfstate \
  -state-out=state/blog/terraform.tfstate \
  'module.kubernetes_cluster.module.blog["blog"]' \
  'module.blog'

Step 4: Verify

cd stacks/blog && terragrunt plan

Expected: No changes.

Step 5: Remove blog from modules/kubernetes/main.tf

Delete the module "blog" { ... } block (lines 197-205).

Step 6: Commit

git add stacks/blog/ modules/kubernetes/main.tf
git commit -m "[ci skip] Migrate blog to Terragrunt stack"

Task 8: Batch-Migrate Remaining Simple Services

Simple services only need tls_secret_name (and possibly a few non-DB variables). These follow the exact same pattern as blog.

Simple services to migrate (one stack each):

  • echo, privatebin, excalidraw, city-guesser, dashy, travel_blog, jsoncrack, cyberchef, stirling-pdf, networking-toolbox, meshcentral, ntfy, plotting-book, reloader, descheduler, homepage, tor-proxy, forgejo, freshrss, navidrome, audiobookshelf, ebook2audiobook, whisper, frigate, matrix, changedetection, isponsorblocktv

Services with a few extra variables (still no DB host refs):

  • shadowsocks (password), kms, hackmd (db_password), drone (github creds, rpc_secret), diun (nfty_token, slack_url), calibre (homepage creds), owntracks (credentials), webhook_handler (many tokens), coturn (turn_secret, public_ip), wealthfolio (password_hash), actualbudget (credentials), servarr (aiostreams), onlyoffice (db_password, jwt_token), xray (reality vars), tuya-bridge (api keys), openclaw (ssh_key, api keys), f1-stream (turn_secret), paperless-ngx (db_password), freedify (credentials), netbox

For each service, create:

  1. stacks/<service>/terragrunt.hcl — include root, dependency on platform, inputs from platform outputs
  2. stacks/<service>/main.tf — variable declarations + module call with source = "../../modules/kubernetes/<service>"
  3. terraform state mv from legacy state
  4. Remove module block from modules/kubernetes/main.tf
  5. Verify with terragrunt plan

Automation script (run for each simple service):

#!/bin/bash
# Usage: ./migrate-service.sh <service-name> <source-dir> <for-each-key> <tier>
# Example: ./migrate-service.sh echo echo echo 3-edge

SERVICE=$1
SOURCE_DIR=${2:-$1}
FOR_EACH_KEY=${3:-$1}
TIER=${4:-4-aux}

mkdir -p stacks/$SERVICE

cat > stacks/$SERVICE/terragrunt.hcl <<'TGEOF'
include "root" {
  path = find_in_parent_folders()
}

dependency "platform" {
  config_path = "../platform"
}

inputs = {
  tls_secret_name = dependency.platform.outputs.tls_secret_name
}
TGEOF

cat > stacks/$SERVICE/main.tf <<EOF
variable "tls_secret_name" {}
variable "kube_config_path" { default = "~/.kube/config" }

module "$SERVICE" {
  source          = "../../modules/kubernetes/$SOURCE_DIR"
  tls_secret_name = var.tls_secret_name
  tier            = "$TIER"
}
EOF

# Move state
mkdir -p state/$SERVICE
terraform state mv \
  -state=terraform.tfstate \
  -state-out=state/$SERVICE/terraform.tfstate \
  "module.kubernetes_cluster.module.${SERVICE}[\"${FOR_EACH_KEY}\"]" \
  "module.${SERVICE}"

# Verify
cd stacks/$SERVICE && terragrunt plan && cd ../..

Services with extra variables need manual wrapper main.tf files (add the variable declarations and pass them to the module).

Commit after each batch of ~5-10 services:

git add stacks/*/
git commit -m "[ci skip] Migrate <service-list> to Terragrunt stacks"

Task 9: Modify Service Modules to Accept Host Variables

~20 modules need modification to replace hardcoded DNS names with variables.

For each module, the change is mechanical:

  1. Add variable "redis_host" { type = string } (and/or postgresql_host, etc.)
  2. Replace the hardcoded string with var.redis_host

Modules to modify and their needed variables:

Module Add variables Replace in
affine redis_host, postgresql_host, postgresql_port, smtp_host, smtp_port main.tf:25,29,50-64
immich redis_host, postgresql_host main.tf:80,96
nextcloud redis_host, mysql_host chart_values.yaml:31,37
grampsweb redis_host, smtp_host, smtp_port main.tf:37,41,45,57
dawarich redis_host, postgresql_host main.tf:75,79,147
send redis_host main.tf:75
linkwarden postgresql_host, postgresql_port main.tf:67
n8n postgresql_host main.tf:56
health postgresql_host, postgresql_port main.tf:54
tandoor postgresql_host, smtp_host, smtp_port main.tf:66,98
rybbit postgresql_host main.tf:162
netbox postgresql_host main.tf:73
speedtest mysql_host main.tf:85
real-estate-crawler redis_host, mysql_host main.tf:140,153,157,301,305,309,401,405,409
ytdlp redis_host, ollama_host main.tf:241,255
resume smtp_host, smtp_port main.tf:186
monitoring mysql_host, smtp_host grafana_chart_values.yaml:51, prometheus_chart_values.tpl:35,37

Example modification for affine:

In modules/kubernetes/affine/main.tf, add variables:

variable "redis_host" { type = string }
variable "postgresql_host" { type = string }
variable "postgresql_port" { type = number }
variable "smtp_host" { type = string }
variable "smtp_port" { type = number }

Replace:

# Before:
DATABASE_URL = "postgresql://postgres:${var.postgresql_password}@postgresql.dbaas.svc.cluster.local:5432/affine"
# After:
DATABASE_URL = "postgresql://postgres:${var.postgresql_password}@${var.postgresql_host}:${var.postgresql_port}/affine"

# Before:
REDIS_SERVER_HOST = "redis.redis.svc.cluster.local"
# After:
REDIS_SERVER_HOST = var.redis_host

Step: Commit each module modification

git add modules/kubernetes/<service>/
git commit -m "[ci skip] Accept host variables in <service> module"

Task 10: Migrate Database-Backed Services to Terragrunt Stacks

After modules are modified (Task 9), create stacks that wire platform outputs to module inputs.

Example: stacks/affine/terragrunt.hcl

include "root" {
  path = find_in_parent_folders()
}

dependency "platform" {
  config_path = "../platform"
}

inputs = {
  tls_secret_name = dependency.platform.outputs.tls_secret_name
  redis_host      = dependency.platform.outputs.redis_host
  postgresql_host = dependency.platform.outputs.postgresql_host
  postgresql_port = dependency.platform.outputs.postgresql_port
  smtp_host       = dependency.platform.outputs.smtp_host
  smtp_port       = dependency.platform.outputs.smtp_port
}

stacks/affine/main.tf:

variable "tls_secret_name" {}
variable "kube_config_path" { default = "~/.kube/config" }
variable "affine_postgresql_password" {}
variable "redis_host" { type = string }
variable "postgresql_host" { type = string }
variable "postgresql_port" { type = number }
variable "smtp_host" { type = string }
variable "smtp_port" { type = number }

module "affine" {
  source              = "../../modules/kubernetes/affine"
  tls_secret_name     = var.tls_secret_name
  postgresql_password = var.affine_postgresql_password
  redis_host          = var.redis_host
  postgresql_host     = var.postgresql_host
  postgresql_port     = var.postgresql_port
  smtp_host           = var.smtp_host
  smtp_port           = var.smtp_port
  tier                = "4-aux"
}

State migration follows the same pattern as Task 7.

Repeat for all DB-backed services from the table in Task 9.


Task 11: Migrate Service-to-Service Dependencies

Services that depend on other non-platform services need multi-dependency stacks.

Step 1: Create ollama stack with outputs

# stacks/ollama/main.tf
variable "tls_secret_name" {}
variable "kube_config_path" { default = "~/.kube/config" }
variable "ollama_api_credentials" {}

module "ollama" {
  source                 = "../../modules/kubernetes/ollama"
  tls_secret_name        = var.tls_secret_name
  tier                   = "2-gpu"
  ollama_api_credentials = var.ollama_api_credentials
}

output "ollama_host" {
  value = "ollama.ollama.svc.cluster.local"
}

Step 2: Create openclaw stack with ollama dependency

# stacks/openclaw/terragrunt.hcl
include "root" {
  path = find_in_parent_folders()
}

dependency "platform" {
  config_path = "../platform"
}

dependency "ollama" {
  config_path = "../ollama"
}

inputs = {
  tls_secret_name = dependency.platform.outputs.tls_secret_name
  ollama_host     = dependency.ollama.outputs.ollama_host
}

Step 3: Similarly for coturn → f1-stream and osm-routing → real-estate-crawler


Task 12: Final Cleanup

Step 1: Remove legacy modules/kubernetes/main.tf

After all services are migrated, this file should be empty (or contain only commented-out blocks). Delete it.

Step 2: Remove kubernetes_cluster module call from root main.tf

The root main.tf should now only contain provider blocks (which can also be removed since Terragrunt generates them) and the variable declarations for terraform.tfvars loading.

Step 3: Archive legacy state

mv terraform.tfstate terraform.tfstate.legacy
mv terraform.tfstate.backup-* state/backups/

Step 4: Verify full DAG

cd stacks && terragrunt run-all plan

Expected: All stacks show No changes.

Step 5: Update CLAUDE.md

Update the knowledge file to reflect the new Terragrunt architecture, commands, and workflow.

Step 6: Final commit

git add -A
git commit -m "[ci skip] Complete Terragrunt migration — remove legacy monolith"

Task 13: Update CI/CD (Drone Pipeline)

Files:

  • Modify: .drone.yml

Create a Drone pipeline that:

  1. Detects changed files
  2. Maps to affected stacks
  3. Runs terragrunt plan (on PR) or terragrunt apply (on master merge)

See design doc section "CI/CD: Changed-Stack Detection" for the pipeline logic.


Execution Order Summary

Task Phase Risk Reversible
1. Install + skeleton 0 None Yes (delete dirs)
2. Root terragrunt.hcl 0 None Yes (delete file)
3. Infra stack files 1 None Yes (delete stack)
4. Infra state migration 1 Medium Yes (state mv back)
5. Platform stack files 2 None Yes (delete stack)
6. Platform state migration 2 High Yes (state mv back)
7. First simple service (blog) 3 Low Yes (state mv back)
8. Batch simple services 3 Low Yes (state mv back)
9. Module host variable mods 4 Low Yes (revert changes)
10. DB service stacks 4 Low Yes (state mv back)
11. Service-to-service deps 5 Low Yes (state mv back)
12. Final cleanup 6 Medium Harder to reverse
13. CI/CD update 6 Low Yes (revert .drone.yml)