[ci skip] Remove legacy files and orphaned modules

Delete 20 orphaned module directories and 3 stray files from modules/kubernetes/ that are no longer referenced by any stack. Remove 7 root-level legacy files including the empty tfstate, 27MB terraform zip, commented-out main.tf, and migration notes. Clean up commented-out dockerhub_secret and oauth-proxy references in blog, travel_blog, and city-guesser stacks. Remove stale frigate config.yaml entry from .gitignore. Remove ephemeral docs/plans/ directory.
2026-02-22 15:23:27 +00:00 · 2026-02-22 15:23:27 +00:00 · 116c4d9c30
commit 116c4d9c30
parent c7c7047f1c
56 changed files with 2 additions and 9402 deletions
--- a/docs/plans/2026-02-13-centralized-log-collection-design.md
+++ b/docs/plans/2026-02-13-centralized-log-collection-design.md
@ -1,140 +0,0 @@
-# Centralized Log Collection Design
-
-## Date: 2026-02-13
-
-## Goal
-
-Centrally collect logs from all Kubernetes pods for monitoring and alerting. Minimize disk I/O by holding logs in memory for extended periods, flushing to NFS once daily. Alert on log patterns via existing Alertmanager pipeline.
-
-## Requirements
-
- **Primary use case**: Monitoring and alerting (log-based alert rules evaluated in real-time)
- **Retention**: 7 days on disk after flush
- **Memory budget**: 4-8GB total (~6.6GB used)
- **Disk strategy**: 24h in-memory chunks, WAL on tmpfs, single daily flush to NFS
- **Crash policy**: Accept up to 24h log loss on pod/node crash (alerts still fire in real-time before flush)
- **Alert delivery**: Loki Ruler -> existing Alertmanager -> Slack/email
-
-## Architecture
-
-```
-┌──────────────────┐     ┌──────────────────────┐     ┌──────────────┐
-│ Alloy DaemonSet  │     │ Loki SingleBinary     │     │ Grafana       │
-│ 5 pods, 128Mi ea │────>│ 1 pod, 6Gi RAM        │<────│ (existing)    │
-│ tails /var/log/  │     │                        │     │ + Loki        │
-│ pods on each node│     │ Ingester: 24h chunks   │     │   datasource  │
-└──────────────────┘     │ WAL: tmpfs (in-memory) │     └──────────────┘
-                         │ Storage: NFS 15Gi      │
-┌──────────────────┐     │ Ruler ──> Alertmanager │
-│ Sysctl DaemonSet │     └──────────────────────┘
-│ 5 pods (pause)   │
-│ sets inotify     │
-│ limits on nodes  │
-└──────────────────┘
-```
-
-## Components
-
-### 1. Sysctl DaemonSet
-
-Solves the `too many open files` / fsnotify watcher exhaustion problem that previously blocked Alloy.
-
- Privileged init container runs `sysctl -w` on each node
- Settings: `fs.inotify.max_user_watches=1048576`, `fs.inotify.max_user_instances=512`, `fs.inotify.max_queued_events=1048576`
- Main container: `pause` image (near-zero resources)
- Survives node reboots (DaemonSet recreates pod)
- Namespace: `monitoring`
-
-### 2. Loki (Helm Release)
-
-Single-binary deployment. Existing Helm chart config in `loki.yaml`, updated with:
-
-**Ingester tuning (disk-friendly):**
- `chunk_idle_period: 12h` — don't flush idle streams quickly
- `max_chunk_age: 24h` — hold chunks in memory for full day
- `chunk_retain_period: 1m` — brief retain after flush
- `chunk_target_size: 1572864` (1.5MB) — larger chunks = fewer writes
- WAL: tmpfs emptyDir (`medium: Memory`, 2Gi limit)
-
-**Retention:**
- `retention_period: 168h` (7 days)
- Compactor enabled for retention enforcement
-
-**Ruler:**
- Evaluates LogQL alert rules in real-time (before chunk flush)
- Fires to `http://prometheus-alertmanager.monitoring.svc.cluster.local:9093`
-
-**Storage:**
- NFS PV/PVC at `/mnt/main/loki/loki` (15Gi, existing)
- TSDB index with 24h period
-
-**Resources:**
- Memory: 6Gi limit
- CPU: 1 limit
-
-### 3. Alloy (Helm Release)
-
-DaemonSet log collector. Existing config in `alloy.yaml` is complete:
- Discovers pods via `discovery.kubernetes`
- Labels: namespace, pod, container, app, job, container_runtime, cluster
- Tails `/var/log/pods/` on each node
- Forwards to `http://loki.monitoring.svc.cluster.local:3100/loki/api/v1/push`
-
-**Resources per pod:**
- Memory: 128Mi limit
- CPU: 200m limit
-
-### 4. Grafana Datasource
-
-ConfigMap with label `grafana_datasource: "1"` for sidecar auto-discovery:
- Name: Loki
- Type: loki
- URL: `http://loki.monitoring.svc.cluster.local:3100`
- Existing `loki.json` dashboard already in dashboards directory
-
-### 5. Starter Alert Rules
-
-Configured in Loki Ruler (evaluated in real-time, before disk flush):
-
-| Alert | LogQL Expression | Severity |
-|-------|-----------------|----------|
-| HighErrorRate | `sum(rate({namespace=~".+"} \|= "error" [5m])) by (namespace) > 10` | warning |
-| PodCrashLoopBackOff | `count_over_time({namespace=~".+"} \|= "CrashLoopBackOff" [5m]) > 0` | critical |
-| OOMKilled | `count_over_time({namespace=~".+"} \|= "OOMKilled" [5m]) > 0` | critical |
-
-## Memory Budget
-
-| Component | Per-pod | Pods | Total |
-|-----------|---------|------|-------|
-| Alloy | 128Mi | 5 | 640Mi |
-| Loki | 6Gi | 1 | 6Gi |
-| Sysctl DS | ~0 (pause) | 5 | ~0 |
-| **Total** | | | **~6.6 GB** |
-
-## Files to Change
-
-| File | Action |
-|------|--------|
-| `modules/kubernetes/monitoring/loki.tf` | Uncomment Loki + Alloy helm releases, add sysctl DaemonSet, add Grafana Loki datasource ConfigMap |
-| `modules/kubernetes/monitoring/loki.yaml` | Update with ingester tuning, ruler config, retention, resource limits |
-| `modules/kubernetes/monitoring/alloy.yaml` | Add resource limits in Helm values wrapper |
-| `secrets/nfs_directories.txt` | Ensure `/mnt/main/loki` entries exist |
-
-## Implementation Steps
-
-1. Add sysctl DaemonSet to `loki.tf`
-2. Update `loki.yaml` with disk-friendly tuning, ruler, retention, resources
-3. Update `alloy.yaml` with resource limits
-4. Uncomment Loki Helm release in `loki.tf`, wire up NFS PV/PVC
-5. Uncomment Alloy Helm release in `loki.tf`
-6. Add Grafana Loki datasource ConfigMap to `loki.tf`
-7. Add alert rules to Loki config
-8. Ensure NFS exports exist in `secrets/nfs_directories.txt`
-9. `terraform apply -target=module.kubernetes_cluster.module.monitoring`
-10. Verify: Grafana Explore -> Loki datasource -> query `{namespace="monitoring"}`
-
-## Risks
-
- **24h data loss on crash**: Accepted trade-off. Alerts fire in real-time before flush, so alert coverage is not affected — only historical log browsing is at risk.
- **Memory pressure**: 6Gi for Loki on a 16GB node is significant. Monitor with existing Prometheus memory alerts.
- **Log volume spikes**: A chatty pod could cause Loki to OOM. Alloy can be configured with rate limiting if needed (future enhancement).
--- a/docs/plans/2026-02-13-centralized-log-collection-plan.md
+++ b/docs/plans/2026-02-13-centralized-log-collection-plan.md
@ -1,532 +0,0 @@
-# Centralized Log Collection Implementation Plan
-
-> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
-
-**Goal:** Deploy Loki + Alloy for centralized Kubernetes log collection with 24h in-memory chunks, 7-day disk retention, and log-based alerting via existing Alertmanager.
-
-**Architecture:** Alloy DaemonSet tails pod logs on all 5 nodes, forwards to single-binary Loki which holds chunks in 6Gi RAM for 24h before flushing to NFS. Loki Ruler evaluates LogQL alert rules in real-time and fires to Alertmanager. Grafana gets a Loki datasource via sidecar auto-provisioning.
-
-**Tech Stack:** Terraform, Helm (Loki chart, Alloy chart), Kubernetes DaemonSet, NFS, Grafana
-
-**Design doc:** `docs/plans/2026-02-13-centralized-log-collection-design.md`
-
---
-
-### Task 1: Add sysctl DaemonSet for inotify limits
-
-Alloy uses fsnotify to tail log files. Default kernel limits cause "too many open files" errors. This DaemonSet sets the limits on every node persistently.
-
-**Files:**
- Modify: `modules/kubernetes/monitoring/loki.tf` (replace the comment block at lines 67-71)
-
-**Step 1: Write the sysctl DaemonSet resource**
-
-Replace lines 67-71 (the comment block about sysctl) with this Terraform resource in `loki.tf`:
-
-```hcl
-resource "kubernetes_daemon_set_v1" "sysctl-inotify" {
-  metadata {
-    name      = "sysctl-inotify"
-    namespace = kubernetes_namespace.monitoring.metadata[0].name
-    labels = {
-      app = "sysctl-inotify"
-    }
-  }
-  spec {
-    selector {
-      match_labels = {
-        app = "sysctl-inotify"
-      }
-    }
-    template {
-      metadata {
-        labels = {
-          app = "sysctl-inotify"
-        }
-      }
-      spec {
-        init_container {
-          name  = "sysctl"
-          image = "busybox:1.37"
-          command = [
-            "sh", "-c",
-            "sysctl -w fs.inotify.max_user_watches=1048576 && sysctl -w fs.inotify.max_user_instances=512 && sysctl -w fs.inotify.max_queued_events=1048576"
-          ]
-          security_context {
-            privileged = true
-          }
-        }
-        container {
-          name  = "pause"
-          image = "registry.k8s.io/pause:3.10"
-          resources {
-            requests = {
-              cpu    = "1m"
-              memory = "4Mi"
-            }
-            limits = {
-              cpu    = "1m"
-              memory = "4Mi"
-            }
-          }
-        }
-        host_pid = true
-        toleration {
-          operator = "Exists"
-        }
-      }
-    }
-  }
-}
-```
-
-**Step 2: Run terraform fmt**
-
-Run: `terraform fmt -recursive modules/kubernetes/monitoring/`
-
-**Step 3: Run terraform plan to verify**
-
-Run: `terraform plan -target=module.kubernetes_cluster.module.monitoring -var="kube_config_path=$(pwd)/config" 2>&1 | tail -30`
-Expected: Plan shows 1 resource to add (kubernetes_daemon_set_v1.sysctl-inotify)
-
-**Step 4: Commit**
-
-```bash
-git add modules/kubernetes/monitoring/loki.tf
-git commit -m "[ci skip] Add sysctl DaemonSet for inotify limits"
-```
-
---
-
-### Task 2: Update Loki Helm values with disk-friendly tuning
-
-Configure ingester for 24h in-memory chunks, WAL on tmpfs, 7-day retention, ruler for alerting, and resource limits.
-
-**Files:**
- Modify: `modules/kubernetes/monitoring/loki.yaml` (full rewrite)
-
-**Step 1: Write updated loki.yaml**
-
-Replace entire contents of `loki.yaml` with:
-
-```yaml
-loki:
-  commonConfig:
-    replication_factor: 1
-  schemaConfig:
-    configs:
-      - from: "2025-04-01"
-        store: tsdb
-        object_store: filesystem
-        schema: v13
-        index:
-          prefix: loki_index_
-          period: 24h
-  ingester:
-    chunk_idle_period: 12h
-    max_chunk_age: 24h
-    chunk_retain_period: 1m
-    chunk_target_size: 1572864
-    wal:
-      dir: /loki-wal
-  pattern_ingester:
-    enabled: true
-  limits_config:
-    allow_structured_metadata: true
-    volume_enabled: true
-    retention_period: 168h
-  compactor:
-    retention_enabled: true
-    working_directory: /loki/compactor
-    compaction_interval: 1h
-    delete_request_store: filesystem
-  ruler:
-    enable_api: true
-    storage:
-      type: local
-      local:
-        directory: /loki/rules
-    alertmanager_url: http://alertmanager.monitoring.svc.cluster.local:9093
-    ring:
-      kvstore:
-        store: inmemory
-    rule_path: /loki/scratch
-  storage:
-    type: "filesystem"
-  auth_enabled: false
-
-minio:
-  enabled: false
-
-deploymentMode: SingleBinary
-
-singleBinary:
-  replicas: 1
-  persistence:
-    enabled: true
-    size: 15Gi
-    storageClass: ""
-  extraVolumes:
-    - name: wal
-      emptyDir:
-        medium: Memory
-        sizeLimit: 2Gi
-    - name: rules
-      configMap:
-        name: loki-alert-rules
-  extraVolumeMounts:
-    - name: wal
-      mountPath: /loki-wal
-    - name: rules
-      mountPath: /loki/rules/fake
-  resources:
-    requests:
-      cpu: 250m
-      memory: 4Gi
-    limits:
-      cpu: "1"
-      memory: 6Gi
-
-# Zero out replica counts of other deployment modes
-backend:
-  replicas: 0
-read:
-  replicas: 0
-write:
-  replicas: 0
-ingester:
-  replicas: 0
-querier:
-  replicas: 0
-queryFrontend:
-  replicas: 0
-queryScheduler:
-  replicas: 0
-distributor:
-  replicas: 0
-compactor:
-  replicas: 0
-indexGateway:
-  replicas: 0
-bloomCompactor:
-  replicas: 0
-bloomGateway:
-  replicas: 0
-```
-
-**Step 2: Commit**
-
-```bash
-git add modules/kubernetes/monitoring/loki.yaml
-git commit -m "[ci skip] Update Loki config with disk-friendly tuning and ruler"
-```
-
---
-
-### Task 3: Update Alloy Helm values with resource limits
-
-The Alloy config content is already complete. Wrap it in proper Helm values with resource limits.
-
-**Files:**
- Modify: `modules/kubernetes/monitoring/alloy.yaml` (add resource limits)
-
-**Step 1: Add resource limits to alloy.yaml**
-
-Append after the existing `alloy.configMap.content` block (after the last line):
-
-```yaml
-
-  # Resource limits for DaemonSet pods
-  resources:
-    requests:
-      cpu: 50m
-      memory: 64Mi
-    limits:
-      cpu: 200m
-      memory: 128Mi
-```
-
-The final file should have the `alloy.configMap.content` block unchanged, with `alloy.resources` added as a sibling under `alloy:`.
-
-**Step 2: Commit**
-
-```bash
-git add modules/kubernetes/monitoring/alloy.yaml
-git commit -m "[ci skip] Add resource limits to Alloy config"
-```
-
---
-
-### Task 4: Uncomment Loki Helm release and PV in loki.tf
-
-Enable the Loki Helm release and its NFS persistent volume. Remove minio PV (not needed with filesystem storage).
-
-**Files:**
- Modify: `modules/kubernetes/monitoring/loki.tf` (uncomment Loki resources, remove minio PV)
-
-**Step 1: Uncomment the Loki Helm release (lines 1-12)**
-
-Uncomment and update the helm_release to:
-
-```hcl
-resource "helm_release" "loki" {
-  namespace        = kubernetes_namespace.monitoring.metadata[0].name
-  create_namespace = true
-  name             = "loki"
-
-  repository = "https://grafana.github.io/helm-charts"
-  chart      = "loki"
-
-  values  = [templatefile("${path.module}/loki.yaml", {})]
-  timeout = 300
-
-  depends_on = [kubernetes_config_map.loki_alert_rules]
-}
-```
-
-**Step 2: Uncomment the Loki NFS PV (lines 14-32)**
-
-Uncomment the `kubernetes_persistent_volume.loki` resource as-is.
-
-**Step 3: Remove the minio PV block (lines 34-52)**
-
-Delete the entire `kubernetes_persistent_volume.loki-minio` commented block — minio is disabled.
-
-**Step 4: Run terraform fmt**
-
-Run: `terraform fmt -recursive modules/kubernetes/monitoring/`
-
-**Step 5: Commit**
-
-```bash
-git add modules/kubernetes/monitoring/loki.tf
-git commit -m "[ci skip] Enable Loki Helm release and NFS PV"
-```
-
---
-
-### Task 5: Uncomment Alloy Helm release in loki.tf
-
-Enable the Alloy Helm release.
-
-**Files:**
- Modify: `modules/kubernetes/monitoring/loki.tf` (uncomment Alloy helm release)
-
-**Step 1: Uncomment and update the Alloy Helm release**
-
-Replace the commented Alloy block with:
-
-```hcl
-resource "helm_release" "alloy" {
-  namespace        = kubernetes_namespace.monitoring.metadata[0].name
-  create_namespace = true
-  name             = "alloy"
-
-  repository = "https://grafana.github.io/helm-charts"
-  chart      = "alloy"
-
-  values = [file("${path.module}/alloy.yaml")]
-  atomic = true
-
-  depends_on = [helm_release.loki]
-}
-```
-
-**Step 2: Run terraform fmt**
-
-Run: `terraform fmt -recursive modules/kubernetes/monitoring/`
-
-**Step 3: Commit**
-
-```bash
-git add modules/kubernetes/monitoring/loki.tf
-git commit -m "[ci skip] Enable Alloy Helm release"
-```
-
---
-
-### Task 6: Add Grafana Loki datasource ConfigMap
-
-Grafana's sidecar auto-discovers ConfigMaps with label `grafana_datasource: "1"`. Create one for Loki.
-
-**Files:**
- Modify: `modules/kubernetes/monitoring/loki.tf` (add ConfigMap resource)
-
-**Step 1: Add the datasource ConfigMap**
-
-Add to `loki.tf`:
-
-```hcl
-resource "kubernetes_config_map" "grafana_loki_datasource" {
-  metadata {
-    name      = "grafana-loki-datasource"
-    namespace = kubernetes_namespace.monitoring.metadata[0].name
-    labels = {
-      grafana_datasource = "1"
-    }
-  }
-  data = {
-    "loki-datasource.yaml" = yamlencode({
-      apiVersion  = 1
-      datasources = [{
-        name      = "Loki"
-        type      = "loki"
-        access    = "proxy"
-        url       = "http://loki.monitoring.svc.cluster.local:3100"
-        isDefault = false
-      }]
-    })
-  }
-}
-```
-
-**Step 2: Run terraform fmt**
-
-Run: `terraform fmt -recursive modules/kubernetes/monitoring/`
-
-**Step 3: Commit**
-
-```bash
-git add modules/kubernetes/monitoring/loki.tf
-git commit -m "[ci skip] Add Grafana Loki datasource ConfigMap"
-```
-
---
-
-### Task 7: Add Loki alert rules ConfigMap
-
-Create the ConfigMap that Loki's ruler reads for alert rules. Mounted into the Loki pod at `/loki/rules/fake/`.
-
-**Files:**
- Modify: `modules/kubernetes/monitoring/loki.tf` (add alert rules ConfigMap)
-
-**Step 1: Add the alert rules ConfigMap**
-
-Add to `loki.tf`:
-
-```hcl
-resource "kubernetes_config_map" "loki_alert_rules" {
-  metadata {
-    name      = "loki-alert-rules"
-    namespace = kubernetes_namespace.monitoring.metadata[0].name
-  }
-  data = {
-    "rules.yaml" = yamlencode({
-      groups = [{
-        name = "log-alerts"
-        rules = [
-          {
-            alert = "HighErrorRate"
-            expr  = "sum(rate({namespace=~\".+\"} |= \"error\" [5m])) by (namespace) > 10"
-            for   = "5m"
-            labels = {
-              severity = "warning"
-            }
-            annotations = {
-              summary = "High error rate in {{ $labels.namespace }}"
-            }
-          },
-          {
-            alert = "PodCrashLoopBackOff"
-            expr  = "count_over_time({namespace=~\".+\"} |= \"CrashLoopBackOff\" [5m]) > 0"
-            for   = "1m"
-            labels = {
-              severity = "critical"
-            }
-            annotations = {
-              summary = "CrashLoopBackOff detected in {{ $labels.namespace }}"
-            }
-          },
-          {
-            alert = "OOMKilled"
-            expr  = "count_over_time({namespace=~\".+\"} |= \"OOMKilled\" [5m]) > 0"
-            for   = "1m"
-            labels = {
-              severity = "critical"
-            }
-            annotations = {
-              summary = "OOMKilled detected in {{ $labels.namespace }}"
-            }
-          }
-        ]
-      }]
-    })
-  }
-}
-```
-
-**Step 2: Run terraform fmt**
-
-Run: `terraform fmt -recursive modules/kubernetes/monitoring/`
-
-**Step 3: Commit**
-
-```bash
-git add modules/kubernetes/monitoring/loki.tf
-git commit -m "[ci skip] Add Loki alert rules ConfigMap"
-```
-
---
-
-### Task 8: Deploy and verify
-
-Apply all changes via Terraform and verify the stack is working.
-
-**Files:** None (deployment only)
-
-**Step 1: Run terraform apply for monitoring module**
-
-Run: `terraform apply -target=module.kubernetes_cluster.module.monitoring -var="kube_config_path=$(pwd)/config" -auto-approve`
-Expected: Multiple resources created (sysctl DaemonSet, Loki Helm release, Alloy Helm release, PV, ConfigMaps)
-
-**Step 2: Verify sysctl DaemonSet is running on all nodes**
-
-Run: `kubectl --kubeconfig $(pwd)/config get ds -n monitoring sysctl-inotify`
-Expected: DESIRED=5, CURRENT=5, READY=5
-
-**Step 3: Verify Loki pod is running**
-
-Run: `kubectl --kubeconfig $(pwd)/config get pods -n monitoring -l app.kubernetes.io/name=loki`
-Expected: 1/1 Running
-
-**Step 4: Verify Alloy DaemonSet is running**
-
-Run: `kubectl --kubeconfig $(pwd)/config get ds -n monitoring -l app.kubernetes.io/name=alloy`
-Expected: DESIRED=5, CURRENT=5, READY=5
-
-**Step 5: Verify Loki is receiving logs**
-
-Run: `kubectl --kubeconfig $(pwd)/config exec -n monitoring deploy/loki -- wget -qO- 'http://localhost:3100/loki/api/v1/labels'`
-Expected: JSON response with labels like `namespace`, `pod`, `container`
-
-**Step 6: Verify Grafana has Loki datasource**
-
-Open `https://grafana.viktorbarzin.me/explore`, select "Loki" datasource, run query: `{namespace="monitoring"}`
-Expected: Log lines from monitoring namespace pods
-
-**Step 7: Commit final state**
-
-```bash
-git add -A
-git commit -m "[ci skip] Deploy centralized log collection (Loki + Alloy)"
-```
-
---
-
-### Troubleshooting
-
-**If Alloy pods crash with inotify errors:**
- Check sysctl DaemonSet init logs: `kubectl --kubeconfig $(pwd)/config logs -n monitoring ds/sysctl-inotify -c sysctl`
- Verify sysctl values on node: `kubectl --kubeconfig $(pwd)/config debug node/k8s-node2 -it --image=busybox -- sysctl fs.inotify.max_user_watches`
-
-**If Loki OOMs:**
- Check memory usage: `kubectl --kubeconfig $(pwd)/config top pod -n monitoring -l app.kubernetes.io/name=loki`
- Reduce `max_chunk_age` from 24h to 12h in `loki.yaml` to flush more frequently
-
-**If Grafana doesn't show Loki datasource:**
- Verify ConfigMap has correct label: `kubectl --kubeconfig $(pwd)/config get cm -n monitoring grafana-loki-datasource -o yaml`
- Restart Grafana sidecar: `kubectl --kubeconfig $(pwd)/config rollout restart deploy -n monitoring grafana`
-
-**If Loki PV won't bind:**
- Check NFS export exists: `ssh root@10.0.10.15 'showmount -e localhost | grep loki'`
- Run NFS export script: `cd secrets && bash nfs_exports.sh`
--- a/docs/plans/2026-02-17-multi-user-k8s-access-design.md
+++ b/docs/plans/2026-02-17-multi-user-k8s-access-design.md
@ -1,154 +0,0 @@
-# Multi-User Kubernetes Access Design
-
-**Date**: 2026-02-17
-**Status**: Approved
-
-## Problem
-
-The cluster uses a single `kubernetes-admin` client certificate for all access. There is no way to:
- Give different users different levels of access
- Track who performed which actions
- Enforce resource limits per user
- Onboard new users without sharing admin credentials
-
-## Decision
-
-Native OIDC authentication on the kube-apiserver using Authentik as the identity provider, with Terraform-managed RBAC and a self-service Svelte portal for user onboarding.
-
-### Alternatives Considered
-
-1. **Pinniped (Concierge + Supervisor)**: Avoids API server changes but adds two components to maintain. Requires Pinniped CLI on user machines. Overkill for a single-cluster setup.
-2. **kube-oidc-proxy**: Avoids API server changes but adds a proxy in the request path (single point of failure, extra latency). Sporadic maintenance from JetStack.
-
-## Architecture
-
-```
-User → Self-Service Portal → Authentik Login → Download Kubeconfig
-                                                       │
-User → kubectl (with kubelogin) → kube-apiserver → OIDC validation → Authentik
-                                        │
-                                   RBAC evaluation
-                                        │
-                                   Audit logging → Alloy → Loki → Grafana
-```
-
-### User Roles
-
-| Role | Scope | Access |
-|------|-------|--------|
-| `admin` | Cluster-wide | Full `cluster-admin` access |
-| `power-user` | Cluster-wide | Deploy/manage workloads, view all resources, no RBAC/node modification |
-| `namespace-owner` | Specific namespaces | Full `admin` within assigned namespaces only |
-
-## Components
-
-### 1. Authentik OIDC Provider
-
-New OAuth2/OIDC application in Authentik configured via Terraform (`modules/kubernetes/authentik/`).
-
- **Application name**: `kubernetes`
- **Provider type**: OAuth2/OpenID Connect
- **Client type**: Public (no client secret, kubelogin uses PKCE)
- **Redirect URIs**: `http://localhost:8000/callback` (kubelogin default)
- **Scopes**: `openid`, `email`, `profile`, `groups`
- **Property mappings**: Include `groups` claim for RBAC group matching
-
-### 2. kube-apiserver OIDC Flags
-
-One-time change on k8s-master (`10.0.20.100`), automated via Terraform `null_resource` with `remote-exec`.
-
-Added to `/etc/kubernetes/manifests/kube-apiserver.yaml`:
-
-```yaml
- --oidc-issuer-url=https://authentik.viktorbarzin.me/application/o/kubernetes/
- --oidc-client-id=kubernetes
- --oidc-username-claim=email
- --oidc-groups-claim=groups
-```
-
-Kubelet auto-restarts the API server pod when the manifest changes. These flags persist through `kubeadm upgrade apply`.
-
-### 3. RBAC (Terraform-managed)
-
-New module: `modules/kubernetes/rbac/main.tf`
-
-**User definition** in `terraform.tfvars`:
-
-```hcl
-k8s_users = {
-  "viktor" = {
-    role  = "admin"
-    email = "viktor@viktorbarzin.me"
-  }
-  "alice" = {
-    role  = "power-user"
-    email = "alice@example.com"
-  }
-  "bob" = {
-    role       = "namespace-owner"
-    namespaces = ["bob-apps", "bob-dev"]
-    email      = "bob@example.com"
-  }
-}
-```
-
-**Resources created per role:**
-
-| Role | Terraform Resources |
-|------|-------------------|
-| `admin` | `ClusterRoleBinding` → `cluster-admin` for user email |
-| `power-user` | Custom `ClusterRole` (workload management, no RBAC/node access) + `ClusterRoleBinding` |
-| `namespace-owner` | `Namespace`(s) + `RoleBinding` → built-in `admin` ClusterRole + `ResourceQuota` per namespace |
-
-### 4. Self-Service Portal
-
-Svelte (SvelteKit) app at `https://k8s-portal.viktorbarzin.me`.
-
-**Flow:**
-1. User visits portal → Authentik login via Traefik forward auth
-2. Portal displays user's role and assigned namespaces
-3. User downloads pre-configured kubeconfig (generated server-side)
-4. Portal shows setup instructions (install kubectl + kubelogin)
-
-**Kubeconfig template** includes:
- Cluster: `https://10.0.20.100:6443` with CA cert
- Auth: `exec` credential plugin pointing to kubelogin
- OIDC issuer URL and client ID pre-configured
-
-**Deployment**: Standard Kubernetes deployment + service + ingress, Terraform-managed like other services. No database needed — user role info read from Kubernetes RBAC bindings or a Terraform-generated ConfigMap.
-
-### 5. Audit Logging
-
-Kubernetes audit policy deployed to master via the same `null_resource`.
-
-**Policy** (`/etc/kubernetes/audit-policy.yaml`):
- `RequestResponse` level for OIDC-authenticated users (captures what they changed)
- `Metadata` level for system/service accounts (keeps volume down)
- Secrets logged at `Metadata` level only (no request/response bodies)
-
-**Log pipeline**: Audit log file → Alloy (DaemonSet on master) → Loki → Grafana dashboard
-
-**Grafana dashboard** shows: who accessed what resource, when, from where, and the outcome (allow/deny).
-
-### 6. Resource Quotas
-
-Each namespace-owner namespace gets a `ResourceQuota`:
-
-```hcl
-requests.cpu    = "2"
-requests.memory = "4Gi"
-limits.cpu      = "4"
-limits.memory   = "8Gi"
-pods            = "20"
-```
-
-Defaults can be overridden per-user via an optional `quota` field in the `k8s_users` variable.
-
-## Implementation Order
-
-1. Authentik OIDC application setup
-2. kube-apiserver OIDC flag configuration
-3. RBAC Terraform module
-4. Audit logging
-5. Self-service portal
-6. Grafana dashboard for audit logs
--- a/docs/plans/2026-02-17-multi-user-k8s-access-plan.md
+++ b/docs/plans/2026-02-17-multi-user-k8s-access-plan.md
--- a/docs/plans/2026-02-21-openclaw-cluster-agent-design.md
+++ b/docs/plans/2026-02-21-openclaw-cluster-agent-design.md
@ -1,111 +0,0 @@
-# OpenClaw Cluster Management Agent — Design
-
-**Date**: 2026-02-21
-**Status**: Approved
-
-## Goal
-
-Build a proactive cluster management agent that runs scheduled health checks every 30 minutes, auto-fixes safe issues, and alerts via Slack. The agent is "taught" via an OpenClaw skill and a reusable health check script.
-
-## Architecture
-
-```
-CronJob (every 30min)
-  └─ kubectl exec into OpenClaw pod
-       └─ /workspace/infra/.claude/cluster-health.sh
-            ├─ kubectl get nodes (check health)
-            ├─ kubectl get pods -A (find problems)
-            ├─ kubectl delete pod (evicted/stuck)
-            └─ curl Slack webhook (report)
-```
-
-Interactive path: User asks OpenClaw via UI -> `cluster-health` skill triggers -> runs same script -> LLM analyzes output and can do deeper investigation.
-
-## Components
-
-### 1. `cluster-health` skill (`.claude/skills/cluster-health/SKILL.md`)
-
-Teaches OpenClaw:
- What health checks to run
- What's safe to auto-fix vs alert-only
- How to format Slack alerts
- How to do deeper investigation when asked interactively
-
-Trigger conditions: "check cluster", "cluster health", "what's wrong", "health check", etc.
-
-### 2. `cluster-health.sh` helper script (`.claude/cluster-health.sh`)
-
-Reusable script that performs all checks:
-
-**Checks:**
- Node health (NotReady, MemoryPressure, DiskPressure, PIDPressure)
- Pod health (CrashLoopBackOff, ImagePullBackOff, Error, OOMKilled, Pending)
- Evicted pods
- Failed deployments (unavailable replicas)
- Pending PVCs
- Resource pressure (high CPU/memory allocation)
- Failed CronJobs
- DaemonSet health (missing pods)
-
-**Safe auto-fix actions:**
- Delete evicted pods
- Delete completed/succeeded pods older than 24h
- Restart (delete) pods in CrashLoopBackOff for more than 1 hour
-
-**Alert-only (never auto-fix):**
- Node NotReady
- Persistent OOMKilled
- ImagePullBackOff
- Pending PVCs
- Failed deployments with 0 available replicas
-
-**Output:**
- Structured text summary
- Posts to Slack via webhook
- Exit code 0 = healthy, 1 = issues found
-
-### 3. Kubernetes CronJob (in `modules/kubernetes/openclaw/main.tf`)
-
- Schedule: `*/30 * * * *`
- Container: `bitnami/kubectl` (minimal image with kubectl)
- Command: `kubectl exec deploy/openclaw -n openclaw -- /bin/bash /workspace/infra/.claude/cluster-health.sh`
- ServiceAccount with RBAC to exec into pods in `openclaw` namespace
- `concurrencyPolicy: Forbid`
- `failedJobsHistoryLimit: 3`
- `successfulJobsHistoryLimit: 3`
-
-### 4. Slack Integration
-
- Webhook URL from `openclaw_skill_secrets["slack"]` (already configured)
- Passed as `SLACK_WEBHOOK_URL` env var to the OpenClaw pod
-
-## Slack Message Format
-
-```
-:white_check_mark: Cluster Health Check — All Clear
-Nodes: 5/5 Ready | Pods: 142 Running | 0 Issues
-```
-
-```
-:warning: Cluster Health Check — 3 Issues Found
-
-Auto-fixed:
- Deleted 4 evicted pods in monitoring namespace
- Restarted stuck pod calibre-web-xyz (CrashLoopBackOff >1h)
-
-Needs attention:
- Node k8s-node3: MemoryPressure condition detected
- PVC data-tandoor pending for 45 minutes
-```
-
-## Decisions
-
-| Decision | Choice | Rationale |
-|----------|--------|-----------|
-| Mode | Proactive (scheduled) | Want automated monitoring |
-| Alert channel | Slack | Existing webhook in openclaw_skill_secrets |
-| Auto-fix | Safe fixes only | Delete evicted, restart stuck; alert for the rest |
-| Frequency | 30 minutes | Balance between detection speed and overhead |
-| Checks scope | Standard K8s health | Pod/node/deployment/PVC/CronJob/DaemonSet |
-| Trigger mechanism | CronJob execs into OpenClaw pod | Reuses OpenClaw's tools; LLM available interactively |
-| Fallback | None | Uptime Kuma monitors OpenClaw availability |
--- a/docs/plans/2026-02-21-openclaw-cluster-agent-plan.md
+++ b/docs/plans/2026-02-21-openclaw-cluster-agent-plan.md
@ -1,800 +0,0 @@
-# OpenClaw Cluster Management Agent — Implementation Plan
-
-> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
-
-**Goal:** Build a proactive cluster health agent — a skill that teaches OpenClaw to check the cluster, a helper script that runs the checks and posts to Slack, and a CronJob that triggers it every 30 minutes via `kubectl exec`.
-
-**Architecture:** CronJob (bitnami/kubectl) -> `kubectl exec` into OpenClaw pod -> runs `cluster-health.sh` which performs 8 health checks, auto-fixes safe issues, and posts a summary to Slack. The same script is available as an OpenClaw skill for interactive use.
-
-**Tech Stack:** Bash (health check script), Terraform/HCL (CronJob + RBAC), Slack webhook API, kubectl
-
---
-
-### Task 1: Add Slack webhook to openclaw_skill_secrets
-
-**Files:**
- Modify: `terraform.tfvars:1291-1295` (add slack_webhook key)
- Modify: `modules/kubernetes/openclaw/main.tf:350-376` (add SLACK_WEBHOOK_URL env var)
-
-**Step 1: Add slack_webhook to openclaw_skill_secrets in tfvars**
-
-Add a new key `slack_webhook` to the existing `openclaw_skill_secrets` map. The user must provide the webhook URL. For now, use the existing `alertmanager_slack_api_url` value or a dedicated one.
-
-In `terraform.tfvars`, change:
-```hcl
-openclaw_skill_secrets = {
-  home_assistant_token       = "..."
-  home_assistant_sofia_token = "..."
-  uptime_kuma_password       = "..."
-}
-```
-to:
-```hcl
-openclaw_skill_secrets = {
-  home_assistant_token       = "..."
-  home_assistant_sofia_token = "..."
-  uptime_kuma_password       = "..."
-  slack_webhook              = "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
-}
-```
-
-**NOTE:** Ask the user which Slack webhook URL to use. Candidates:
- `alertmanager_slack_api_url` (line 4 in tfvars)
- `tiny_tuya_slack_url` (line 1213, comment says "K8s bot slack")
- A new webhook the user creates
-
-**Step 2: Add SLACK_WEBHOOK_URL env var to OpenClaw container**
-
-In `modules/kubernetes/openclaw/main.tf`, add after the `UPTIME_KUMA_PASSWORD` env block (around line 370):
-```hcl
-          # Skill secrets - Slack
-          env {
-            name  = "SLACK_WEBHOOK_URL"
-            value = var.skill_secrets["slack_webhook"]
-          }
-```
-
-**Step 3: Commit**
-
-```bash
-git add modules/kubernetes/openclaw/main.tf
-git commit -m "[ci skip] Add Slack webhook env var to OpenClaw deployment"
-```
-
-Do NOT commit `terraform.tfvars` separately — it will be committed with the full set of changes at the end.
-
---
-
-### Task 2: Create the cluster-health.sh helper script
-
-**Files:**
- Create: `.claude/cluster-health.sh`
-
-**Step 1: Write the health check script**
-
-Create `.claude/cluster-health.sh` with the following structure. The script:
- Uses `$KUBECONFIG` (already set in OpenClaw pod) or falls back to in-cluster config
- Runs 8 checks: nodes, pods, evicted, deployments, PVCs, resources, CronJobs, DaemonSets
- Auto-fixes: deletes evicted pods, restarts CrashLoopBackOff pods stuck >1 hour
- Posts structured Slack message via `$SLACK_WEBHOOK_URL`
- Exit code 0 = healthy, 1 = issues found, 2 = critical
-
-```bash
-#!/usr/bin/env bash
-# Cluster health check script for OpenClaw.
-# Runs health checks, auto-fixes safe issues, posts to Slack.
-# Designed to run inside the OpenClaw pod (has kubectl via $KUBECONFIG).
-#
-# Usage: ./cluster-health.sh [--no-slack] [--no-fix]
-#   --no-slack  Skip Slack notification (useful for interactive/debug runs)
-#   --no-fix    Skip auto-fix actions (report only)
-
-set -euo pipefail
-
-SEND_SLACK=true
-AUTO_FIX=true
-ISSUES=()
-FIXES=()
-WARNINGS=()
-
-# --- Argument parsing ---
-for arg in "$@"; do
-  case "$arg" in
-    --no-slack) SEND_SLACK=false ;;
-    --no-fix)   AUTO_FIX=false ;;
-  esac
-done
-
-KUBECTL="kubectl"
-
-# --- 1. Node Health ---
-check_nodes() {
-  local nodes not_ready
-  nodes=$($KUBECTL get nodes --no-headers 2>&1) || { ISSUES+=("Cannot reach cluster API"); return; }
-  not_ready=$(echo "$nodes" | awk '$2 != "Ready" {print $1}' || true)
-
-  if [[ -n "$not_ready" ]]; then
-    while IFS= read -r node; do
-      ISSUES+=("Node NotReady: $node")
-    done <<< "$not_ready"
-  fi
-
-  # Check conditions
-  local conditions
-  conditions=$($KUBECTL get nodes -o json | python3 -c '
-import json, sys
-data = json.load(sys.stdin)
-for node in data["items"]:
-    name = node["metadata"]["name"]
-    for c in node["status"]["conditions"]:
-        if c["type"] in ("MemoryPressure","DiskPressure","PIDPressure") and c["status"] == "True":
-            print(name + ": " + c["type"])
-' 2>/dev/null) || true
-
-  if [[ -n "$conditions" ]]; then
-    while IFS= read -r line; do
-      ISSUES+=("$line")
-    done <<< "$conditions"
-  fi
-}
-
-# --- 2. Pod Health ---
-check_pods() {
-  local bad
-  bad=$( {
-    $KUBECTL get pods -A --no-headers 2>/dev/null \
-      | grep -E 'CrashLoopBackOff|ImagePullBackOff|ErrImagePull|Error' || true
-  } | awk '!seen[$1,$2]++' | sed '/^$/d') || true
-
-  if [[ -z "$bad" ]]; then return; fi
-
-  while IFS= read -r line; do
-    local ns pod status
-    ns=$(echo "$line" | awk '{print $1}')
-    pod=$(echo "$line" | awk '{print $2}')
-    status=$(echo "$line" | awk '{print $4}')
-
-    if [[ "$status" == "CrashLoopBackOff" ]]; then
-      # Check if stuck for >1 hour
-      local restart_count
-      restart_count=$(echo "$line" | awk '{print $5}')
-      if [[ "$AUTO_FIX" == true && "$restart_count" -gt 10 ]]; then
-        $KUBECTL delete pod -n "$ns" "$pod" --grace-period=30 2>/dev/null && \
-          FIXES+=("Restarted $ns/$pod (CrashLoopBackOff, $restart_count restarts)") || \
-          WARNINGS+=("Failed to restart $ns/$pod")
-      else
-        ISSUES+=("CrashLoopBackOff: $ns/$pod ($restart_count restarts)")
-      fi
-    elif [[ "$status" == "ImagePullBackOff" || "$status" == "ErrImagePull" ]]; then
-      ISSUES+=("ImagePullBackOff: $ns/$pod")
-    else
-      ISSUES+=("Error: $ns/$pod ($status)")
-    fi
-  done <<< "$bad"
-}
-
-# --- 3. Evicted/Failed Pods ---
-check_evicted() {
-  local evicted count
-  evicted=$($KUBECTL get pods -A --no-headers --field-selector=status.phase=Failed 2>/dev/null || true)
-
-  if [[ -z "$evicted" ]]; then return; fi
-  count=$(echo "$evicted" | wc -l | tr -d ' ')
-
-  if [[ "$AUTO_FIX" == true && "$count" -gt 0 ]]; then
-    $KUBECTL delete pods -A --field-selector=status.phase=Failed 2>/dev/null && \
-      FIXES+=("Deleted $count evicted/failed pod(s)") || \
-      WARNINGS+=("Failed to delete evicted pods")
-  else
-    ISSUES+=("$count evicted/failed pod(s)")
-  fi
-}
-
-# --- 4. Failed Deployments ---
-check_deployments() {
-  local deps
-  deps=$($KUBECTL get deployments -A --no-headers 2>/dev/null) || return
-
-  while IFS= read -r line; do
-    local ns name ready current desired
-    ns=$(echo "$line" | awk '{print $1}')
-    name=$(echo "$line" | awk '{print $2}')
-    ready=$(echo "$line" | awk '{print $3}')
-    current=$(echo "$ready" | cut -d/ -f1)
-    desired=$(echo "$ready" | cut -d/ -f2)
-
-    if [[ "$current" != "$desired" ]]; then
-      ISSUES+=("Deployment $ns/$name: $current/$desired ready")
-    fi
-  done <<< "$deps"
-}
-
-# --- 5. Pending PVCs ---
-check_pvcs() {
-  local pvcs
-  pvcs=$($KUBECTL get pvc -A --no-headers 2>/dev/null) || return
-
-  if [[ -z "$pvcs" || "$pvcs" == *"No resources found"* ]]; then return; fi
-
-  while IFS= read -r line; do
-    local ns name status
-    ns=$(echo "$line" | awk '{print $1}')
-    name=$(echo "$line" | awk '{print $2}')
-    status=$(echo "$line" | awk '{print $3}')
-
-    if [[ "$status" != "Bound" ]]; then
-      ISSUES+=("PVC $ns/$name: $status")
-    fi
-  done <<< "$pvcs"
-}
-
-# --- 6. Resource Pressure ---
-check_resources() {
-  local top
-  top=$($KUBECTL top nodes --no-headers 2>/dev/null) || return
-
-  while IFS= read -r line; do
-    local node cpu_pct mem_pct
-    node=$(echo "$line" | awk '{print $1}')
-    cpu_pct=$(echo "$line" | awk '{print $3}' | tr -d '%')
-    mem_pct=$(echo "$line" | awk '{print $5}' | tr -d '%')
-
-    [[ "$cpu_pct" == *"unknown"* || "$mem_pct" == *"unknown"* ]] && continue
-
-    if [[ "$cpu_pct" -gt 90 || "$mem_pct" -gt 90 ]]; then
-      ISSUES+=("High resource usage on $node: CPU ${cpu_pct}%, Mem ${mem_pct}%")
-    elif [[ "$cpu_pct" -gt 80 || "$mem_pct" -gt 80 ]]; then
-      WARNINGS+=("Elevated resource usage on $node: CPU ${cpu_pct}%, Mem ${mem_pct}%")
-    fi
-  done <<< "$top"
-}
-
-# --- 7. CronJob Failures ---
-check_cronjobs() {
-  local failures
-  failures=$($KUBECTL get jobs -A -o json 2>/dev/null | python3 -c '
-import json, sys
-from datetime import datetime, timezone, timedelta
-
-data = json.load(sys.stdin)
-cutoff = datetime.now(timezone.utc) - timedelta(hours=24)
-
-for job in data.get("items", []):
-    meta = job.get("metadata", {})
-    ns = meta.get("namespace", "")
-    name = meta.get("name", "")
-    owners = meta.get("ownerReferences", [])
-    if not any(o.get("kind") == "CronJob" for o in owners):
-        continue
-    for c in job.get("status", {}).get("conditions", []):
-        if c.get("type") == "Failed" and c.get("status") == "True":
-            ts = c.get("lastTransitionTime", "")
-            if ts:
-                try:
-                    t = datetime.fromisoformat(ts.replace("Z", "+00:00"))
-                    if t > cutoff:
-                        print(f"{ns}/{name}")
-                except:
-                    print(f"{ns}/{name}")
-' 2>/dev/null) || true
-
-  if [[ -n "$failures" ]]; then
-    local count
-    count=$(echo "$failures" | wc -l | tr -d ' ')
-    ISSUES+=("$count CronJob failure(s) in last 24h")
-  fi
-}
-
-# --- 8. DaemonSet Health ---
-check_daemonsets() {
-  local ds
-  ds=$($KUBECTL get daemonsets -A --no-headers 2>/dev/null) || return
-
-  while IFS= read -r line; do
-    local ns name desired ready
-    ns=$(echo "$line" | awk '{print $1}')
-    name=$(echo "$line" | awk '{print $2}')
-    desired=$(echo "$line" | awk '{print $3}')
-    ready=$(echo "$line" | awk '{print $5}')
-
-    if [[ "$desired" != "$ready" ]]; then
-      ISSUES+=("DaemonSet $ns/$name: desired=$desired ready=$ready")
-    fi
-  done <<< "$ds"
-}
-
-# --- Cluster summary stats ---
-get_summary_stats() {
-  local node_count ready_count pod_count
-  node_count=$($KUBECTL get nodes --no-headers 2>/dev/null | wc -l | tr -d ' ')
-  ready_count=$($KUBECTL get nodes --no-headers 2>/dev/null | awk '$2 == "Ready"' | wc -l | tr -d ' ')
-  pod_count=$($KUBECTL get pods -A --no-headers --field-selector=status.phase=Running 2>/dev/null | wc -l | tr -d ' ')
-  echo "${ready_count}/${node_count} nodes | ${pod_count} pods running"
-}
-
-# --- Send Slack message ---
-send_slack() {
-  local webhook_url="$SLACK_WEBHOOK_URL"
-  if [[ -z "${webhook_url:-}" ]]; then
-    echo "WARNING: SLACK_WEBHOOK_URL not set, skipping Slack notification"
-    return
-  fi
-
-  local summary issue_count fix_count warning_count
-  summary=$(get_summary_stats)
-  issue_count=${#ISSUES[@]}
-  fix_count=${#FIXES[@]}
-  warning_count=${#WARNINGS[@]}
-
-  local text=""
-  local total_problems=$((issue_count + warning_count))
-
-  if [[ "$total_problems" -eq 0 && "$fix_count" -eq 0 ]]; then
-    text=":white_check_mark: *Cluster Health Check — All Clear*\n${summary} | 0 issues"
-  else
-    if [[ "$issue_count" -gt 0 ]]; then
-      text=":rotating_light: *Cluster Health Check — ${issue_count} Issue(s) Found*\n${summary}"
-    elif [[ "$warning_count" -gt 0 ]]; then
-      text=":warning: *Cluster Health Check — ${warning_count} Warning(s)*\n${summary}"
-    else
-      text=":white_check_mark: *Cluster Health Check — All Clear (auto-fixed ${fix_count})*\n${summary}"
-    fi
-
-    if [[ "$fix_count" -gt 0 ]]; then
-      text+="\n\n*Auto-fixed:*"
-      for fix in "${FIXES[@]}"; do
-        text+="\n• ${fix}"
-      done
-    fi
-
-    if [[ "$issue_count" -gt 0 ]]; then
-      text+="\n\n*Needs attention:*"
-      for issue in "${ISSUES[@]}"; do
-        text+="\n• ${issue}"
-      done
-    fi
-
-    if [[ "$warning_count" -gt 0 ]]; then
-      text+="\n\n*Warnings:*"
-      for warning in "${WARNINGS[@]}"; do
-        text+="\n• ${warning}"
-      done
-    fi
-  fi
-
-  curl -s -X POST "$webhook_url" \
-    -H 'Content-Type: application/json' \
-    -d "{\"text\": \"${text}\"}" > /dev/null 2>&1
-}
-
-# --- Main ---
-main() {
-  echo "=== Cluster Health Check — $(date '+%Y-%m-%d %H:%M:%S') ==="
-
-  check_nodes
-  check_pods
-  check_evicted
-  check_deployments
-  check_pvcs
-  check_resources
-  check_cronjobs
-  check_daemonsets
-
-  local issue_count=${#ISSUES[@]}
-  local fix_count=${#FIXES[@]}
-  local warning_count=${#WARNINGS[@]}
-
-  echo ""
-  echo "Results: ${issue_count} issue(s), ${fix_count} fix(es), ${warning_count} warning(s)"
-
-  if [[ "$fix_count" -gt 0 ]]; then
-    echo ""
-    echo "Auto-fixed:"
-    for fix in "${FIXES[@]}"; do echo "  - $fix"; done
-  fi
-
-  if [[ "$issue_count" -gt 0 ]]; then
-    echo ""
-    echo "Issues:"
-    for issue in "${ISSUES[@]}"; do echo "  - $issue"; done
-  fi
-
-  if [[ "$warning_count" -gt 0 ]]; then
-    echo ""
-    echo "Warnings:"
-    for warning in "${WARNINGS[@]}"; do echo "  - $warning"; done
-  fi
-
-  if [[ "$SEND_SLACK" == true ]]; then
-    send_slack
-    echo ""
-    echo "Slack notification sent."
-  fi
-
-  # Exit code
-  if [[ "$issue_count" -gt 0 ]]; then
-    exit 1
-  fi
-  exit 0
-}
-
-main "$@"
-```
-
-**Step 2: Make it executable**
-
-```bash
-chmod +x .claude/cluster-health.sh
-```
-
-**Step 3: Test locally (dry run)**
-
-```bash
-KUBECONFIG=$(pwd)/config SLACK_WEBHOOK_URL="" bash .claude/cluster-health.sh --no-slack
-```
-
-Expected: Script runs, prints check results, no Slack post.
-
-**Step 4: Commit**
-
-```bash
-git add .claude/cluster-health.sh
-git commit -m "[ci skip] Add cluster health check script for OpenClaw agent"
-```
-
---
-
-### Task 3: Create the cluster-health skill
-
-**Files:**
- Create: `.claude/skills/cluster-health/SKILL.md`
-
-**Step 1: Write the skill document**
-
-```markdown
---
-name: cluster-health
-description: |
-  Check Kubernetes cluster health and fix common issues. Use when:
-  (1) User asks to check the cluster, check health, or "what's wrong",
-  (2) User asks about pod status, node health, or deployment issues,
-  (3) User asks to fix stuck pods, evicted pods, or CrashLoopBackOff,
-  (4) User mentions "health check", "cluster status", "cluster health",
-  (5) User asks "is everything running" or "any problems".
-  Runs 8 standard K8s health checks with safe auto-fix for evicted pods
-  and stuck CrashLoopBackOff pods.
-author: Claude Code
-version: 1.0.0
-date: 2026-02-21
---
-
-# Cluster Health Check
-
-## Overview
- **Script**: `/workspace/infra/.claude/cluster-health.sh`
- **Schedule**: CronJob runs every 30 minutes, execs into this pod
- **Slack**: Posts results to `$SLACK_WEBHOOK_URL`
- **Auto-fix**: Deletes evicted pods, restarts CrashLoopBackOff pods (>10 restarts)
-
-## Quick Check
-
-Run the health check script:
-```bash
-bash /workspace/infra/.claude/cluster-health.sh --no-slack
-```
-
-Or with Slack notification:
-```bash
-bash /workspace/infra/.claude/cluster-health.sh
-```
-
-Report-only (no auto-fix):
-```bash
-bash /workspace/infra/.claude/cluster-health.sh --no-fix
-```
-
-## What It Checks
-
-| # | Check | Auto-Fix | Alert |
-|---|-------|----------|-------|
-| 1 | Node health (NotReady, conditions) | No | Yes |
-| 2 | Pod health (CrashLoopBackOff, ImagePullBackOff, Error) | Restart if >10 restarts | Yes |
-| 3 | Evicted/failed pods | Delete all | Yes |
-| 4 | Deployment availability (current != desired) | No | Yes |
-| 5 | PVC status (not Bound) | No | Yes |
-| 6 | Resource pressure (CPU/Mem >80%) | No | Yes |
-| 7 | CronJob failures (last 24h) | No | Yes |
-| 8 | DaemonSet health (desired != ready) | No | Yes |
-
-## Safe Auto-Fix Rules
-
-These are the ONLY things the script auto-fixes:
-1. **Evicted/failed pods**: `kubectl delete pods -A --field-selector=status.phase=Failed`
-2. **CrashLoopBackOff pods with >10 restarts**: `kubectl delete pod -n <ns> <pod> --grace-period=30`
-
-Everything else is alert-only. NEVER auto-fix:
- Node NotReady (could be maintenance)
- ImagePullBackOff (needs image tag or registry fix)
- Pending PVCs (needs storage investigation)
- Failed deployments (needs config investigation)
-
-## Deep Investigation
-
-When the script reports issues and the user asks for more detail, use these commands:
-
-### Node issues
-```bash
-kubectl describe node <node-name>
-kubectl top node <node-name>
-kubectl get events --field-selector involvedObject.name=<node-name>
-```
-
-### Pod issues
-```bash
-kubectl describe pod -n <namespace> <pod-name>
-kubectl logs -n <namespace> <pod-name> --tail=100
-kubectl logs -n <namespace> <pod-name> --previous --tail=100
-kubectl get events -n <namespace> --field-selector involvedObject.name=<pod-name>
-```
-
-### Deployment issues
-```bash
-kubectl describe deployment -n <namespace> <deployment-name>
-kubectl rollout status deployment -n <namespace> <deployment-name>
-kubectl rollout history deployment -n <namespace> <deployment-name>
-```
-
-### PVC issues
-```bash
-kubectl describe pvc -n <namespace> <pvc-name>
-kubectl get pv
-kubectl get events -n <namespace> --field-selector involvedObject.name=<pvc-name>
-```
-
-### Resource pressure
-```bash
-kubectl top nodes
-kubectl top pods -A --sort-by=memory | head -20
-kubectl top pods -A --sort-by=cpu | head -20
-```
-
-## Common Remediation
-
-### CrashLoopBackOff (persistent)
-1. Check logs: `kubectl logs -n <ns> <pod> --previous --tail=100`
-2. Check events: `kubectl describe pod -n <ns> <pod>`
-3. Common causes: OOMKilled (increase memory limit), bad config, missing env var
-4. If image issue: check if newer image exists, update in Terraform
-
-### OOMKilled
-1. Check current limits: `kubectl describe pod -n <ns> <pod> | grep -A2 Limits`
-2. Fix: Update resource limits in Terraform module for the service
-3. Apply: `terraform apply -target=module.kubernetes_cluster.module.<service> -var="kube_config_path=$(pwd)/config"`
-
-### ImagePullBackOff
-1. Check image: `kubectl describe pod -n <ns> <pod> | grep Image`
-2. Check registry: Is the image tag valid? Is the registry reachable?
-3. Check pull-through cache: Docker registry at 10.0.20.10
-
-### Node NotReady
-1. Check kubelet: SSH to node, `systemctl status kubelet`
-2. Check resources: `kubectl top node <node>`
-3. Check conditions: `kubectl describe node <node> | grep -A10 Conditions`
-
-## Slack Webhook
-
-Messages are posted to the webhook at `$SLACK_WEBHOOK_URL`. Format:
- All clear: green check + summary stats
- Issues found: red siren + list of issues + auto-fix actions taken
- Warnings only: yellow warning + elevated metrics
-
-## Infrastructure
-
- **Terraform module**: `modules/kubernetes/openclaw/main.tf`
- **CronJob**: Runs in `openclaw` namespace every 30 min
- **Existing healthcheck**: `scripts/cluster_healthcheck.sh` (local-only, not for OpenClaw)
- **Repo path inside pod**: `/workspace/infra/`
-```
-
-**Step 2: Commit**
-
-```bash
-git add .claude/skills/cluster-health/SKILL.md
-git commit -m "[ci skip] Add cluster-health skill for OpenClaw agent"
-```
-
---
-
-### Task 4: Add CronJob and RBAC to Terraform
-
-**Files:**
- Modify: `modules/kubernetes/openclaw/main.tf` (append CronJob + ServiceAccount + Role + RoleBinding)
-
-**Step 1: Add CronJob resources**
-
-Append the following to `modules/kubernetes/openclaw/main.tf` after the `module "ingress"` block:
-
-```hcl
-# --- CronJob: Scheduled cluster health check ---
-
-resource "kubernetes_service_account" "healthcheck" {
-  metadata {
-    name      = "cluster-healthcheck"
-    namespace = kubernetes_namespace.openclaw.metadata[0].name
-  }
-}
-
-resource "kubernetes_role" "healthcheck_exec" {
-  metadata {
-    name      = "healthcheck-pod-exec"
-    namespace = kubernetes_namespace.openclaw.metadata[0].name
-  }
-  rule {
-    api_groups = [""]
-    resources  = ["pods"]
-    verbs      = ["get", "list"]
-  }
-  rule {
-    api_groups = [""]
-    resources  = ["pods/exec"]
-    verbs      = ["create"]
-  }
-}
-
-resource "kubernetes_role_binding" "healthcheck_exec" {
-  metadata {
-    name      = "healthcheck-pod-exec"
-    namespace = kubernetes_namespace.openclaw.metadata[0].name
-  }
-  subject {
-    kind      = "ServiceAccount"
-    name      = kubernetes_service_account.healthcheck.metadata[0].name
-    namespace = kubernetes_namespace.openclaw.metadata[0].name
-  }
-  role_ref {
-    api_group = "rbac.authorization.k8s.io"
-    kind      = "Role"
-    name      = kubernetes_role.healthcheck_exec.metadata[0].name
-  }
-}
-
-resource "kubernetes_cron_job_v1" "cluster_healthcheck" {
-  metadata {
-    name      = "cluster-healthcheck"
-    namespace = kubernetes_namespace.openclaw.metadata[0].name
-    labels = {
-      app  = "cluster-healthcheck"
-      tier = var.tier
-    }
-  }
-  spec {
-    schedule                      = "*/30 * * * *"
-    concurrency_policy            = "Forbid"
-    failed_jobs_history_limit     = 3
-    successful_jobs_history_limit = 3
-
-    job_template {
-      metadata {
-        labels = {
-          app = "cluster-healthcheck"
-        }
-      }
-      spec {
-        active_deadline_seconds = 300
-        template {
-          metadata {
-            labels = {
-              app = "cluster-healthcheck"
-            }
-          }
-          spec {
-            service_account_name = kubernetes_service_account.healthcheck.metadata[0].name
-            restart_policy       = "Never"
-
-            container {
-              name    = "healthcheck"
-              image   = "bitnami/kubectl:1.34"
-              command = ["bash", "-c", <<-EOF
-                # Find the openclaw pod
-                POD=$(kubectl get pods -n openclaw -l app=openclaw -o jsonpath='{.items[0].metadata.name}' 2>/dev/null)
-                if [ -z "$POD" ]; then
-                  echo "ERROR: OpenClaw pod not found"
-                  exit 1
-                fi
-                echo "Executing health check in pod $POD..."
-                kubectl exec -n openclaw "$POD" -c openclaw -- bash /workspace/infra/.claude/cluster-health.sh
-              EOF
-              ]
-
-              resources {
-                requests = {
-                  cpu    = "50m"
-                  memory = "64Mi"
-                }
-                limits = {
-                  memory = "128Mi"
-                }
-              }
-            }
-          }
-        }
-      }
-    }
-  }
-}
-```
-
-**Step 2: Verify Terraform formatting**
-
-```bash
-terraform fmt modules/kubernetes/openclaw/main.tf
-```
-
-**Step 3: Verify Terraform plan**
-
-```bash
-terraform plan -target=module.kubernetes_cluster.module.openclaw -var="kube_config_path=$(pwd)/config"
-```
-
-Expected: Plan shows 4 new resources (ServiceAccount, Role, RoleBinding, CronJobV1). No destructive changes to existing resources.
-
-**Step 4: Commit**
-
-```bash
-git add modules/kubernetes/openclaw/main.tf
-git commit -m "[ci skip] Add cluster health check CronJob to OpenClaw module"
-```
-
---
-
-### Task 5: Deploy and verify
-
-**Step 1: Apply Terraform**
-
-```bash
-terraform apply -target=module.kubernetes_cluster.module.openclaw -var="kube_config_path=$(pwd)/config" -auto-approve
-```
-
-**Step 2: Verify CronJob exists**
-
-```bash
-kubectl --kubeconfig $(pwd)/config get cronjob -n openclaw
-```
-
-Expected: `cluster-healthcheck` with schedule `*/30 * * * *`
-
-**Step 3: Verify RBAC**
-
-```bash
-kubectl --kubeconfig $(pwd)/config get serviceaccount,role,rolebinding -n openclaw
-```
-
-Expected: `cluster-healthcheck` SA, `healthcheck-pod-exec` role and rolebinding
-
-**Step 4: Trigger a manual run**
-
-```bash
-kubectl --kubeconfig $(pwd)/config create job --from=cronjob/cluster-healthcheck healthcheck-manual-test -n openclaw
-```
-
-**Step 5: Check job output**
-
-```bash
-kubectl --kubeconfig $(pwd)/config wait --for=condition=complete job/healthcheck-manual-test -n openclaw --timeout=120s
-kubectl --kubeconfig $(pwd)/config logs job/healthcheck-manual-test -n openclaw
-```
-
-Expected: Health check output with results. If `SLACK_WEBHOOK_URL` is set, check Slack for the message.
-
-**Step 6: Clean up test job**
-
-```bash
-kubectl --kubeconfig $(pwd)/config delete job healthcheck-manual-test -n openclaw
-```
-
-**Step 7: Final commit**
-
-```bash
-git add -A modules/kubernetes/openclaw/ .claude/skills/cluster-health/ .claude/cluster-health.sh
-git commit -m "[ci skip] OpenClaw cluster health agent: script + skill + CronJob"
-```
--- a/docs/plans/2026-02-22-terragrunt-migration-design.md
+++ b/docs/plans/2026-02-22-terragrunt-migration-design.md
@ -1,387 +0,0 @@
-# Terragrunt Migration Design
-
-**Date**: 2026-02-22
-**Status**: Approved
-
-## Problem
-
-The infrastructure repo has a monolithic Terraform setup:
- 15MB state file, 857 resources, 85+ service modules in a single root
- `terraform plan/apply` evaluates all modules even when targeting one service
- `null_resource.core_services` bottleneck blocks 73 services behind 12 core modules
- 150+ variables passed through root -> kubernetes_cluster -> individual services
- 3 providers (kubernetes, helm, proxmox) initialize on every run
-
-## Goals
-
- **Speed**: Faster plan/apply by splitting state into independent stacks
- **Blast radius isolation**: Bad apply can't break unrelated services
- **DRY config**: Shared provider/backend configuration via Terragrunt
- **Proper DAG**: Full references between stacks (not hardcoded DNS strings)
- **Bootstrappable**: `terragrunt run-all apply` works from scratch
- **CI/CD**: Changed-stack detection in Drone CI
-
-## Architecture: Flat Stacks
-
-### Directory Structure
-
-```
-infra/
-├── terragrunt.hcl              # Root config (providers, backend, common vars)
-├── stacks/
-│   ├── infra/                  # Proxmox VMs, templates, docker-registry
-│   │   ├── terragrunt.hcl
-│   │   └── main.tf
-│   ├── platform/               # Core: traefik, metallb, redis, dbaas, authentik, etc.
-│   │   ├── terragrunt.hcl
-│   │   └── main.tf
-│   ├── blog/                   # One dir per user service
-│   │   ├── terragrunt.hcl
-│   │   └── main.tf
-│   ├── immich/
-│   │   ├── terragrunt.hcl
-│   │   └── main.tf
-│   └── ... (~65 service dirs)
-├── modules/                    # UNCHANGED — existing modules stay where they are
-│   ├── kubernetes/
-│   │   ├── ingress_factory/
-│   │   ├── setup_tls_secret/
-│   │   ├── blog/
-│   │   ├── immich/
-│   │   └── ...
-│   ├── create-vm/
-│   └── create-template-vm/
-├── state/                      # Per-stack state files
-│   ├── infra/terraform.tfstate
-│   ├── platform/terraform.tfstate
-│   ├── blog/terraform.tfstate
-│   └── ...
-├── terraform.tfvars            # UNCHANGED — encrypted secrets
-├── secrets/                    # UNCHANGED — TLS certs
-├── main.tf                     # LEGACY — gradually emptied during migration
-└── terraform.tfstate           # LEGACY — gradually emptied during migration
-```
-
-Each stack has a thin `main.tf` wrapper that calls the existing module via
-`source = "../../modules/kubernetes/<service>"`. We do NOT use Terragrunt's
-`terraform { source }` directive because our modules use relative paths
-(`../ingress_factory`, `../setup_tls_secret`) that would break when Terragrunt
-copies them to `.terragrunt-cache/`.
-
-### Stack Composition
-
-**Infra stack** (~10 resources):
- Proxmox VM templates (k8s, non-k8s, docker-registry)
- Docker registry VM
- Uses proxmox provider (not kubernetes/helm)
-
-**Platform stack** (~200 resources, ~20 services):
- traefik, metallb, redis, dbaas, technitium, authentik, crowdsec, cloudflared
- monitoring (prometheus, alertmanager, grafana, loki, alloy)
- kyverno, metrics-server, nvidia, mailserver, authelia
- wireguard, headscale, xray, uptime-kuma, vaultwarden, reverse-proxy
- Exports outputs consumed by service stacks
-
-**Per-service stacks** (~65, each 5-25 resources):
- One stack per user-facing service
- Each depends on platform via Terragrunt `dependency` block
- Some depend on other services (f1-stream -> coturn, etc.)
-
-### Dependency Graph
-
-```
-                         ┌─────────┐
-                         │  infra  │
-                         └────┬────┘
-                              │
-                         ┌────▼────┐
-                         │platform │  exports: redis_host, postgresql_host,
-                         │         │  mysql_host, smtp_host, tls_secret_name, ...
-                         └────┬────┘
-                              │
-         ┌────────┬───────────┼───────────┬────────┐
-         │        │           │           │        │
-    ┌────▼──┐ ┌───▼───┐ ┌────▼───┐ ┌─────▼──┐ ┌──▼───┐
-    │ blog  │ │immich │ │ affine │ │ollama  │ │coturn│  ...
-    └───────┘ └───────┘ └────────┘ └───┬────┘ └──┬───┘
-                                       │         │
-                                  ┌────▼───┐ ┌───▼──────┐
-                                  │openclaw│ │f1-stream │
-                                  │gramps  │ └──────────┘
-                                  │ytdlp   │
-                                  └────────┘
-```
-
-### Platform Stack Outputs
-
-| Output | Value | Consumers |
-|--------|-------|-----------|
-| `redis_host` | `redis.redis.svc.cluster.local` | 10 services |
-| `postgresql_host` | `postgresql.dbaas.svc.cluster.local` | 10 services |
-| `postgresql_port` | `5432` | 10 services |
-| `mysql_host` | `mysql.dbaas.svc.cluster.local` | 8 services |
-| `mysql_port` | `3306` | 8 services |
-| `smtp_host` | `mail.viktorbarzin.me` | 6 services |
-| `smtp_port` | `587` | 6 services |
-| `tls_secret_name` | from variable | all services |
-| `authentik_outpost_url` | `http://ak-outpost-...` | traefik |
-| `crowdsec_lapi_host` | `crowdsec-service...` | traefik |
-| `alertmanager_url` | `http://prometheus-alertmanager...` | loki |
-| `loki_push_url` | `http://loki...` | alloy |
-
-Service-to-service dependencies:
-
-| Service | Depends on | Outputs consumed |
-|---------|-----------|-----------------|
-| f1-stream | coturn | `coturn_host`, `coturn_port` |
-| real-estate-crawler | osm-routing | `osrm_foot_host`, `osrm_bicycle_host` |
-| openclaw, grampsweb, ytdlp | ollama | `ollama_host` |
-
-### Module Modifications
-
-Service modules that hardcode DNS names need modification to accept hosts as variables.
-~20 modules affected. Example for affine:
-
-**Before:**
-```hcl
-# modules/kubernetes/affine/main.tf
-DATABASE_URL      = "postgresql://...@postgresql.dbaas.svc.cluster.local:5432/affine"
-REDIS_SERVER_HOST = "redis.redis.svc.cluster.local"
-```
-
-**After:**
-```hcl
-variable "redis_host" { type = string }
-variable "postgresql_host" { type = string }
-variable "postgresql_port" { type = number }
-
-DATABASE_URL      = "postgresql://...@${var.postgresql_host}:${var.postgresql_port}/affine"
-REDIS_SERVER_HOST = var.redis_host
-```
-
-## Root Terragrunt Configuration
-
-```hcl
-# infra/terragrunt.hcl
-
-remote_state {
-  backend = "local"
-  generate = {
-    path      = "backend.tf"
-    if_exists = "overwrite_terragrunt"
-  }
-  config = {
-    path = "${get_repo_root()}/state/${path_relative_to_include()}/terraform.tfstate"
-  }
-}
-
-terraform {
-  extra_arguments "common_vars" {
-    commands = get_terraform_commands_that_need_vars()
-    required_var_files = [
-      "${get_repo_root()}/terraform.tfvars"
-    ]
-  }
-}
-
-generate "k8s_providers" {
-  path      = "providers.tf"
-  if_exists = "overwrite_terragrunt"
-  contents  = <<EOF
-variable "kube_config_path" {
-  type    = string
-  default = "~/.kube/config"
-}
-
-provider "kubernetes" {
-  config_path = var.kube_config_path
-}
-
-provider "helm" {
-  kubernetes {
-    config_path = var.kube_config_path
-  }
-}
-EOF
-}
-```
-
-## Stack Wrapper Examples
-
-### Simple service (blog)
-
-```hcl
-# stacks/blog/terragrunt.hcl
-include "root" {
-  path = find_in_parent_folders()
-}
-
-dependency "platform" {
-  config_path = "../platform"
-}
-
-inputs = {
-  tls_secret_name = dependency.platform.outputs.tls_secret_name
-}
-```
-
-```hcl
-# stacks/blog/main.tf
-variable "tls_secret_name" {}
-variable "kube_config_path" { default = "~/.kube/config" }
-
-module "blog" {
-  source          = "../../modules/kubernetes/blog"
-  tls_secret_name = var.tls_secret_name
-  tier            = "4-aux"
-}
-```
-
-### Database-backed service (affine)
-
-```hcl
-# stacks/affine/terragrunt.hcl
-include "root" {
-  path = find_in_parent_folders()
-}
-
-dependency "platform" {
-  config_path = "../platform"
-}
-
-inputs = {
-  tls_secret_name = dependency.platform.outputs.tls_secret_name
-  redis_host      = dependency.platform.outputs.redis_host
-  postgresql_host = dependency.platform.outputs.postgresql_host
-  postgresql_port = dependency.platform.outputs.postgresql_port
-  smtp_host       = dependency.platform.outputs.smtp_host
-  smtp_port       = dependency.platform.outputs.smtp_port
-}
-```
-
-```hcl
-# stacks/affine/main.tf
-variable "tls_secret_name" {}
-variable "kube_config_path" { default = "~/.kube/config" }
-variable "affine_postgresql_password" {}
-variable "redis_host" { type = string }
-variable "postgresql_host" { type = string }
-variable "postgresql_port" { type = number }
-variable "smtp_host" { type = string }
-variable "smtp_port" { type = number }
-
-module "affine" {
-  source              = "../../modules/kubernetes/affine"
-  tls_secret_name     = var.tls_secret_name
-  postgresql_password = var.affine_postgresql_password
-  redis_host          = var.redis_host
-  postgresql_host     = var.postgresql_host
-  postgresql_port     = var.postgresql_port
-  smtp_host           = var.smtp_host
-  smtp_port           = var.smtp_port
-  tier                = "4-aux"
-}
-```
-
-### Service-to-service dependency (f1-stream -> coturn)
-
-```hcl
-# stacks/f1-stream/terragrunt.hcl
-include "root" {
-  path = find_in_parent_folders()
-}
-
-dependency "platform" {
-  config_path = "../platform"
-}
-
-dependency "coturn" {
-  config_path = "../coturn"
-}
-
-inputs = {
-  tls_secret_name = dependency.platform.outputs.tls_secret_name
-  coturn_host     = dependency.coturn.outputs.coturn_host
-  coturn_port     = dependency.coturn.outputs.coturn_port
-}
-```
-
-## Migration Strategy
-
-### Phase 0: Setup
- Install Terragrunt
- Create root `terragrunt.hcl`, `stacks/`, `state/` directories
- No state changes, no risk
-
-### Phase 1: Infra Stack (VMs)
- Create `stacks/infra/` with Proxmox provider + VM module calls
- `terraform state mv` 4 root-level module resources to `state/infra/`
- Remove from root `main.tf`
- Verify: `cd stacks/infra && terragrunt plan` shows no changes
-
-### Phase 2: Platform Stack (Core Services)
- Create `stacks/platform/main.tf` with ~20 core services + outputs
- `terraform state mv` ~200 resources from `module.kubernetes_cluster.module.<core>`
- Remove `null_resource.core_services` (Terragrunt handles ordering)
- Verify: `cd stacks/platform && terragrunt plan` shows no changes
-
-### Phase 3: Simple Services (No DB Dependencies)
- blog, echo, privatebin, excalidraw, city-guesser, dashy, etc.
- Create stack, move state, verify — one at a time
-
-### Phase 4: Database-Backed Services
- Modify modules to accept hosts as variables
- affine, immich, linkwarden, nextcloud, grampsweb, etc.
- Create stack, move state, verify
-
-### Phase 5: Service-to-Service Dependencies
- ollama -> openclaw, grampsweb, ytdlp
- coturn -> f1-stream
- osm-routing -> real-estate-crawler
-
-### Phase 6: Cleanup
- Delete DEFCON system from `modules/kubernetes/main.tf`
- Delete legacy `terraform.tfstate`
- Delete root `main.tf` kubernetes_cluster module call
- Update CI/CD to Terragrunt
-
-### Rollback
-At any phase, `terraform state mv` resources back to monolith state and
-restore module calls.
-
-## CI/CD: Changed-Stack Detection
-
-Drone CI pipeline detects changed files per commit and maps to affected stacks:
-
-| Changed file | Affected stack |
-|-------------|---------------|
-| `stacks/blog/*` | blog |
-| `modules/kubernetes/blog/*` | blog |
-| `terraform.tfvars` | all stacks |
-| `terragrunt.hcl` | all stacks |
-| `modules/kubernetes/ingress_factory/*` | all stacks |
-
-### Manual Workflow
-
-```bash
-# Apply single service
-cd stacks/blog && terragrunt apply
-
-# Apply everything (respects DAG ordering)
-cd stacks && terragrunt run-all apply
-
-# Plan everything
-cd stacks && terragrunt run-all plan
-```
-
-## Decisions Made
-
-| Decision | Choice | Rationale |
-|----------|--------|-----------|
-| Tool | Terragrunt | DRY config, dependency management, run-all orchestration |
-| Stack granularity | 1 platform + 1 per service | Max isolation for apps, grouped core |
-| Migration | Incremental | Lower risk, verify each step |
-| Shared modules | Relative paths | Simple, no registry overhead |
-| State backend | Local files | No external dependencies |
-| Cross-stack refs | Full references via outputs | Proper DAG, bootstrappable from scratch |
-| CI/CD | Changed-stack detection | Only apply what changed |
--- a/docs/plans/2026-02-22-terragrunt-migration-plan.md
+++ b/docs/plans/2026-02-22-terragrunt-migration-plan.md