6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CronJob stem95su-gdrive-sync (*/10) mounts the content PVC RW and
rclone-syncs the read-only Drive folder "claude" (stem claude/files) onto
it (rclone/rclone:1.74.3, scope=drive.readonly, empty-source guard +
--max-delete 25). ESO ExternalSecret stem95su-rclone <- Vault
secret/stem95su. Requires the GCP OAuth app published to Production or the
refresh token expires ~weekly.
Lands the gdrive-sync stack on master (it had landed on a feature branch
by accident on the shared devvm checkout).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
terragrunt generates backend.tf per run (remote_state generate,
if_exists=overwrite_terragrunt) from get_env("PG_CONN_STR"); these 72 committed
copies are stale artifacts already covered by .gitignore:65. They held a
plaintext (Vault-rotated, ~expired) PG password + the .200 state-backend literal
and were re-committed by CI on every run. git rm --cached stops that; they
regenerate locally from PG_CONN_STR. The live .200:5432 literal now lives only
in scripts/tg (its single bootstrap source).
Part of the L4 LB-IP review (docs/plans/2026-06-03-lb-ip-hygiene-design.md).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Goal: re-clone the worker template, boot, and have it appear as `kubectl
get nodes …Ready` with no manual steps. Adds `scripts/provision-k8s-worker
NAME VMID IP` and rebuilds the cloud-init pipeline that was failing five
distinct ways on a clean boot.
Bugs fixed (all hit during the k8s-node5 + k8s-node6 builds today):
1. `indent(6, containerd_config_update_command)` indented the bodies of
`cat >> /etc/containerd/config.toml <<'CONTAINERD_GC'` heredocs, so
[plugins.*] TOML sections landed in /etc/containerd/config.toml at
col 6 — containerd refused to parse them. Source is now a normal
.sh file (`modules/create-template-vm/k8s-node-containerd-setup.sh`)
base64-embedded into `write_files`; YAML whitespace never touches
the heredoc bodies.
2. The same script tried to `cat >> /etc/containerd/config.toml`
`[plugins."io.containerd.gc.v1.scheduler"]` etc., which containerd
v2.2.4's `config default` ALREADY emits. Result: `toml: table …
already exists`. Patched with sed-in-place overrides instead.
3. Kubelet tuning (sed against /var/lib/kubelet/config.yaml) ran from
the containerd setup script — BEFORE `kubeadm join` writes that
file. Sed aborted with "No such file or directory", `set -e` killed
the script, post-script cloud-init steps kept going (cloud-init
doesn't stop on runcmd failure). Split into a dedicated
`k8s-node-post-join-tune.sh` invoked AFTER kubeadm join.
4. cloud_init.yaml fallocate'd a 4G swapfile and `swapon`'d it BEFORE
kubeadm join. kubelet defaults to failSwapOn=true → exited 1
immediately. Replaced the swap setup with `swapoff -a` (node4
already runs this way and the cluster is fine).
5. Without `hostname:` in the shared user-data snippet, Proxmox's
auto-generated meta-data does NOT include local-hostname when
`cicustom user=…` is set — so cloud-init falls back to the cloud
image's default `ubuntu` and `kubeadm join` registers the wrong
node name. `provision-k8s-worker` now writes a per-node
`<NAME>-meta.yaml` snippet and passes both via
`cicustom user=…,meta=…`.
Other improvements rolled in while fixing the above:
- `ssh_public_key` read from Vault (`secret/viktor.ssh_public_key`,
added today) instead of `var.ssh_public_key`. The last
`terragrunt apply` was run with that var empty, leaving the snippet's
`ssh_authorized_keys` with a single blank entry; the wizard user
was effectively locked out of every fresh node.
- `cloud_init.yaml` adds `/etc/systemd/resolved.conf.d/global-dns.conf`
with `DNS=8.8.8.8 1.1.1.1, FallbackDNS=10.0.20.201`. Without it,
systemd-resolved only consulted Technitium (link-level), which
returns NXDOMAIN for `forgejo.viktorbarzin.me` — kubelet pulls from
the Forgejo registry then failed DNS until I patched it manually
on node5.
- k8s apt repo bumped v1.32 → v1.34 (matches cluster).
- The containerd setup script now creates hosts.toml for forgejo,
quay, registry.k8s.io in addition to docker.io + ghcr.io. node3/4
had these added by hand post-bootstrap; now they're baked in.
- `config_path` sed matches both `""` (containerd v1) and `''`
(containerd v2.x). Without the v2 match, the certs.d mirror dir was
silently ignored.
- `proxmox-csi` node map adds k8s-node5 + k8s-node6 entries so CSI
topology labels (region/zone, max-volume-attachments=28) apply on
next `tg apply`.
- `stacks/infra/main.tf` shed the 160-line inline containerd setup
heredoc — that whole thing now lives in the module as a .sh file.
Known unsolved gaps (deferred):
- iscsid restart hangs ~90s on first boot before SIGKILL releases it
(systemd-resolved restart kicks iscsid via dependency). Adds wall-
clock time but doesn't block the join.
- `provision-k8s-worker` doesn't run `tg apply` on `proxmox-csi`
afterward, so the CSI topology labels need a manual apply after
the node joins. Solving cleanly needs the CSI map to derive from
`kubectl get nodes` instead of a static local — separate work.
- `var.containerd_config_update_command` is now ignored when
is_k8s_template=true (replaced by the bundled .sh file). Variable
kept with a deprecation note to avoid breaking other call sites.
E2E proof: k8s-node6 (VMID 206) boots hands-off from
`provision-k8s-worker k8s-node6 206 10.0.20.106` and appears as
`kubectl get nodes …Ready` ~7 min later (most of which is the apt
package_upgrade — separate optimization).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The telmate/proxmox v3.0.2-rc07 provider mangles dynamically-attached
disks (id=539, 2026-05-26 incident) and doesn't refresh mbps_*_concurrent
fields back from live state — every plan after a qm-set cap is applied
proposes to "fix" mbps 0 → N and the apply errors with the spurious
"the QEMU guest needs to be rebooted" message. lifecycle.ignore_changes
does NOT block either failure mode.
Decision: stop trying to manage Linux VMs in this stack. The cloud-init
bootstrap stays in TF (via k8s-node-template, non-k8s-node-template,
docker-registry-template above), so a fresh node still clones the right
template and runs the same bootstrap. VM lifecycle stays in the Proxmox
UI. I/O caps are managed via qm-set on the PVE host (idempotent script
at /tmp/apply-mbps-caps.sh, tracked in beads code-9v2j).
Removed from TF state + HCL:
- module "k8s-master" (vmid 200)
- module "k8s-node2" (vmid 202) — pre-existing drift, never in state
- module "docker-registry-vm" (vmid 220) — was in state, hit refresh bug
Already hand-managed (never in HCL):
- 102 devvm, 103 home-assistant, 201 k8s-node1 (Tesla T4 passthrough),
203 k8s-node3, 204 k8s-node4, 101 pfSense (BSD), 300 Windows10.
Live I/O caps (qm set, all verified):
102=60/60 103=40/40 200=100/60 201=150/120 202=150/120
203=150/120 204=150/120 220=40/40
Future TF adoption tracked in beads code-75ds (blocks on bpg/proxmox
provider migration — telmate can't represent these VMs at all).
Closes: code-75ds
Two bugs found while rebuilding k8s-node4 (2026-05-26):
1. **runcmd YAML breakage**: `- $${containerd_config_update_command}`
interpolated a multi-line heredoc as bare list-item content. The
trailing lines lost their list-item prefix, breaking cloud-config
parsing. Cloud-init silently fell back to the minimal default
(hostname + package_upgrade only) — kubeadm join, containerd config,
kubelet tuning, iSCSI hardening, swap, ALL skipped. No error visible
in `cloud-init status`.
Fix: wrap the interpolation in `- |` literal block with `indent(4, ...)`.
2. **containerd v2 single-quote mismatch**: `containerd config default`
in v2 writes `config_path = ''` (single quotes), v1 writes `""` (double).
The sed pattern matched only double quotes → silent no-op on fresh
containerd 2.x nodes → registry-mirror hosts.toml ignored → all image
pulls hit upstream registries → DNS-to-MetalLB chicken-and-egg loop.
Fix: match any value with `config_path = .*`.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
WHAT LANDED:
- terragrunt.hcl (root): added telmate/proxmox to k8s_providers
required_providers. Other stacks just don't instantiate a provider
block — harmless. Replaces the same-name override trick the infra
stack used to do, which stopped working under Terragrunt v0.77
("Detected generate blocks with the same name").
- stacks/infra/terragrunt.hcl: new generate "proxmox_provider" block
writes proxmox_provider.tf with the provider config; credentials
read from Vault secret/viktor at plan/apply time (no env vars).
- modules/create-vm: new mbps_rd / mbps_wr number variables (default 0
= uncapped), wired into scsi0/scsi1 disk{} blocks as
mbps_r_concurrent / mbps_wr_concurrent. lifecycle.ignore_changes
extended to scsi6..scsi29 (K8s nodes have many CSI-managed slots),
plus scsihw and qemu_os (vary per-VM; non-trivial live changes).
- stacks/infra/main.tf: docker-registry-vm gains mbps_rd=40,
mbps_wr=40 in HCL — already applied live via qm set on 2026-05-26.
WHAT FAILED AND WAS ROLLED BACK:
- Attempted import of 7 VMs (102 devvm, 103 home-assistant, 200
k8s-master, 201 k8s-node1, 202 k8s-node2, 203 k8s-node3, 204
k8s-node4) via import {} blocks. The telmate/proxmox v3.0.2-rc07
provider mangled proxmox-csi PVC slots on apply for vmid 202 and
203: every scsi slot got rewritten from `vm-9999-pvc-<uuid>` to
the boot disk `vm-<vmid>-disk-0`. Restored both .conf files from
the 2026-05-24 nightly PVE config backup at /mnt/backup/pve-config/
etc-pve/nodes/pve/qemu-server/{202,203}.conf — no reboots, no data
loss, K8s CSI reconciled PVC attachments within minutes. Removed
the 7 imports from state via `terraform state rm` and re-encrypted.
Tracked in beads code-xzbl: blocked on bpg/proxmox provider
migration (telmate has the same dynamic-disk defect that bit us on
iSCSI back in 2026-04-02; see memory id=539).
LIVE CAPS STILL IN PLACE (qm set, 2026-05-26 ~03:13 UTC):
102 devvm 60/60 103 home-assistant 40/40 200 k8s-master 100/60
201 k8s-node1 150/120 202 k8s-node2 150/120 203 k8s-node3 150/120
204 k8s-node4 150/120 220 docker-registry 40/40
(pfSense 101 BSD + Windows10 300 intentionally out of scope.)
PRE-EXISTING DRIFT EXPOSED (NOT NEW):
- HCL declares k8s-master (200) and k8s-node2 (202) but neither was
ever imported into TF state — confirmed against the SOPS-encrypted
state in git (lineage e1cc5bb5, serial 42, last touched 2026-04-06).
This commit leaves both declarations in place but does NOT import
them; that's part of the code-xzbl follow-up.
Closes: code-s9xr
End of forgejo-registry-consolidation. After Phase 0/1 already landed
(Forgejo ready, dual-push CI, integrity probe, retention CronJob,
images migrated via forgejo-migrate-orphan-images.sh), this commit
flips everything off registry.viktorbarzin.me onto Forgejo and
removes the legacy infrastructure.
Phase 3 — image= flips:
* infra/stacks/{payslip-ingest,job-hunter,claude-agent-service,
fire-planner,freedify/factory,chrome-service,beads-server}/main.tf
— image= now points to forgejo.viktorbarzin.me/viktor/<name>.
* infra/stacks/claude-memory/main.tf — also moved off DockerHub
(viktorbarzin/claude-memory-mcp:17 → forgejo.viktorbarzin.me/viktor/...).
* infra/.woodpecker/{default,drift-detection}.yml — infra-ci pulled
from Forgejo. build-ci-image.yml dual-pushes still until next
build cycle confirms Forgejo as canonical.
* /home/wizard/code/CLAUDE.md — claude-memory-mcp install URL updated.
Phase 4 — decommission registry-private:
* registry-credentials Secret: dropped registry.viktorbarzin.me /
registry.viktorbarzin.me:5050 / 10.0.20.10:5050 auths entries.
Forgejo entry is the only one left.
* infra/stacks/infra/main.tf cloud-init: dropped containerd
hosts.toml entries for registry.viktorbarzin.me +
10.0.20.10:5050. (Existing nodes already had the file removed
manually by `setup-forgejo-containerd-mirror.sh` rollout — the
cloud-init template only fires on new VM provision.)
* infra/modules/docker-registry/docker-compose.yml: registry-private
service block removed; nginx 5050 port mapping dropped. Pull-
through caches for upstream registries (5000/5010/5020/5030/5040)
stay on the VM permanently.
* infra/modules/docker-registry/nginx_registry.conf: upstream
`private` block + port 5050 server block removed.
* infra/stacks/monitoring/modules/monitoring/main.tf: registry_
integrity_probe + registry_probe_credentials resources stripped.
forgejo_integrity_probe is the only manifest probe now.
Phase 5 — final docs sweep:
* infra/docs/runbooks/registry-vm.md — VM scope reduced to pull-
through caches; forgejo-registry-breakglass.md cross-ref added.
* infra/docs/architecture/ci-cd.md — registry component table +
diagram now reflect Forgejo. Pre-migration root-cause sentence
preserved as historical context with a pointer to the design doc.
* infra/docs/architecture/monitoring.md — Registry Integrity Probe
row updated to point at the Forgejo probe.
* infra/.claude/CLAUDE.md — Private registry section rewritten end-
to-end (auth, retention, integrity, where the bake came from).
* prometheus_chart_values.tpl — RegistryManifestIntegrityFailure
alert annotation simplified now that only one registry is in
scope.
Operational follow-up (cannot be done from a TF apply):
1. ssh root@10.0.20.10 — edit /opt/registry/docker-compose.yml to
match the new template AND `docker compose up -d --remove-orphans`
to actually stop the registry-private container. Memory id=1078
confirms cloud-init won't redeploy on TF apply alone.
2. After 1 week of no incidents, `rm -rf /opt/registry/data/private/`
on the VM (~2.6GB freed).
3. Open the dual-push step in build-ci-image.yml and drop
registry.viktorbarzin.me:5050 from the `repo:` list — at that
point the post-push integrity check at line 33-107 also needs
to be repointed at Forgejo or removed (the per-build verify is
redundant with the every-15min Forgejo probe).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Stage 1 of moving private images off the registry:2 container at
registry.viktorbarzin.me:5050 (which has hit distribution#3324 corruption
3x in 3 weeks) onto Forgejo's built-in OCI registry. No cutover risk —
pods still pull from the existing registry until Phase 3.
What changes:
* Forgejo deployment: memory 384Mi→1Gi, PVC 5Gi→15Gi (cap 50Gi).
Explicit FORGEJO__packages__ENABLED + CHUNKED_UPLOAD_PATH (defensive,
v11 default-on).
* ingress_factory: max_body_size variable was declared but never wired
in after the nginx→Traefik migration. Now creates a per-ingress
Buffering middleware when set; default null = no limit (preserves
existing behavior). Forgejo ingress sets max_body_size=5g to allow
multi-GB layer pushes.
* Cluster-wide registry-credentials Secret: 4th auths entry for
forgejo.viktorbarzin.me, populated from Vault secret/viktor/
forgejo_pull_token (cluster-puller PAT, read:package). Existing
Kyverno ClusterPolicy syncs cluster-wide — no policy edits.
* Containerd hosts.toml redirect: forgejo.viktorbarzin.me → in-cluster
Traefik LB 10.0.20.200 (avoids hairpin NAT for in-cluster pulls).
Cloud-init for new VMs + scripts/setup-forgejo-containerd-mirror.sh
for existing nodes.
* Forgejo retention CronJob (0 4 * * *): keeps newest 10 versions per
package + always :latest. First 7 days dry-run (DRY_RUN=true);
flip the local in cleanup.tf after log review.
* Forgejo integrity probe CronJob (*/15): same algorithm as the
existing registry-integrity-probe. Existing Prometheus alerts
(RegistryManifestIntegrityFailure et al) made instance-aware so
they cover both registries during the bake.
* Docs: design+plan in docs/plans/, setup runbook in docs/runbooks/.
Operational note — the apply order is non-trivial because the new
Vault keys (forgejo_pull_token, forgejo_cleanup_token,
secret/ci/global/forgejo_*) must exist BEFORE terragrunt apply in the
kyverno + monitoring + forgejo stacks. The setup runbook documents
the bootstrap sequence.
Phase 1 (per-project dual-push pipelines) follows in subsequent
commits. Bake clock starts when the last project goes dual-push.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two-tier state architecture:
- Tier 0 (infra, platform, cnpg, vault, dbaas, external-secrets): local
state with SOPS encryption in git — unchanged, required for bootstrap.
- Tier 1 (105 app stacks): PostgreSQL backend on CNPG cluster at
10.0.20.200:5432/terraform_state with native pg_advisory_lock.
Motivation: multi-operator friction (every workstation needed SOPS + age +
git-crypt), bootstrap complexity for new operators, and headless agents/CI
needing the full encryption toolchain just to read state.
Changes:
- terragrunt.hcl: conditional backend (local vs pg) based on tier0 list
- scripts/tg: tier detection, auto-fetch PG creds from Vault for Tier 1,
skip SOPS and Vault KV locking for Tier 1 stacks
- scripts/state-sync: tier-aware encrypt/decrypt (skips Tier 1)
- scripts/migrate-state-to-pg: one-shot migration script (idempotent)
- stacks/vault/main.tf: pg-terraform-state static role + K8s auth role
for claude-agent namespace
- stacks/dbaas: terraform_state DB creation + MetalLB LoadBalancer
service on shared IP 10.0.20.200
- Deleted 107 .tfstate.enc files for migrated Tier 1 stacks
- Cleaned up per-stack tiers.tf (now generated by root terragrunt.hcl)
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When the pull-through proxy (10.0.20.10) is down, containerd now falls
back to the official upstream registries (registry-1.docker.io, ghcr.io)
instead of failing. Also cleans up stale disabled registry mirror dirs
and removes unnecessary containerd restart from the rollout script.
The cleanup-tags.sh + garbage-collect cycle can delete blob data while
leaving _layers/ link files intact. The registry then returns HTTP 200
with 0 bytes for those layers, causing "unexpected EOF" on image pulls.
fix-broken-blobs.sh walks all repositories, checks each layer link
against actual blob data, and removes orphaned links so the registry
re-fetches from upstream on next pull.
Schedule: daily at 2:30am (after tag cleanup) and Sunday 3:30am
(after garbage collection). First run found 2335/2556 (91%) of
layer links were orphaned.
- Switch priority-pass images from 10.0.20.10:5050 to registry.viktorbarzin.me
- Add containerd hosts.toml for registry.viktorbarzin.me on all nodes + template
(redirects to 10.0.20.10:5050 LAN direct, avoids Traefik round-trip)
- Enable Authentik protection on priority-pass ingress
- Grant kyverno-admission-controller and kyverno-background-controller
permissions to manage Secrets (required for generate clone rules)
- Add containerd hosts.toml for 10.0.20.10:5050 with skip_verify=true
(wildcard cert doesn't cover IP SANs) — applied to all nodes + template
- Add auth.htpasswd section to config-private.yml
- Mount htpasswd file in registry-private container, fix healthcheck for 401
- Rename registry UI from registry.viktorbarzin.me → docker.viktorbarzin.me
- Add Docker CLI ingress at registry.viktorbarzin.me (HTTPS backend, no rate-limit, unlimited body)
- Add docker to cloudflare_proxied_names (registry stays non-proxied)
- Add Kyverno ClusterPolicy to sync registry-credentials secret to all namespaces
- Update infra provisioning to install apache2-utils and generate htpasswd from Vault
- Terragrunt-regenerated providers.tf across stacks (vault_root_token
variable removed from root generate block)
- Upstream monitoring/openclaw/CLAUDE.md changes from rebase
SQLite backup via Online Backup API + copy of RSA keys,
attachments, sends, and config. 30-day retention with rotation.
Pod affinity ensures co-scheduling with vaultwarden for RWO PVC access.
- Add vault provider to root terragrunt.hcl (generated providers.tf)
- Delete stacks/vault/vault_provider.tf (now in generated providers.tf)
- Add 124 variable declarations + 43 vault_kv_secret_v2 resources to
vault/main.tf to populate Vault KV at secret/<stack-name>
- Migrate 43 consuming stacks to read secrets from Vault KV via
data "vault_kv_secret_v2" instead of SOPS var-file
- Add dependency "vault" to all migrated stacks' terragrunt.hcl
- Complex types (maps/lists) stored as JSON strings, decoded with
jsondecode() in locals blocks
Bootstrap secrets (vault_root_token, vault_authentik_client_id,
vault_authentik_client_secret) remain in SOPS permanently.
Apply order: vault stack first (populates KV), then all others.
CPU limits cause CFS throttling even when nodes have idle capacity.
Move to a request-only CPU model: keep CPU requests for scheduling
fairness but remove all CPU limits. Memory limits stay (incompressible).
Changes across 108 files:
- Kyverno LimitRange policy: remove cpu from default/max in all 6 tiers
- Kyverno ResourceQuota policy: remove limits.cpu from all 5 tiers
- Custom ResourceQuotas: remove limits.cpu from 8 namespace quotas
- Custom LimitRanges: remove cpu from default/max (nextcloud, onlyoffice)
- RBAC module: remove cpu_limits variable and quota reference
- Freedify factory: remove cpu_limit variable and limits reference
- 86 deployment files: remove cpu from all limits blocks
- 6 Helm values files: remove cpu under limits sections
LimitRange defaults had a 4-8x limit/request ratio causing the scheduler
to overpack nodes. When pods burst, nodes OOM-thrashed and became
unresponsive (k8s-node3 and k8s-node4 both went down today).
Changes:
- Increase default memory requests across all tiers (ratio now 2x):
- core/cluster: 64Mi → 256Mi request (512Mi limit)
- gpu: 256Mi → 1Gi request (2Gi limit)
- edge/aux/fallback: 64Mi → 128Mi request (256Mi limit)
- Add kubelet memory reservation and eviction thresholds:
- systemReserved: 512Mi, kubeReserved: 512Mi
- evictionHard: 500Mi (was 100Mi), evictionSoft: 1Gi (was unset)
- Applied to all nodes and future node template
Phase 5 — CI pipelines:
- default.yml: add SOPS decrypt in prepare step, change git add . to
specific paths (stacks/ state/ .woodpecker/), cleanup on success+failure
- renew-tls.yml: change git add . to git add secrets/ state/
Phase 6 — sensitive=true:
- Add sensitive = true to 256 variable declarations across 149 stack files
- Prevents secret values from appearing in terraform plan output
- Does NOT modify shared modules (ingress_factory, nfs_volume) to avoid
breaking module interface contracts
Note: CI pipeline SOPS decryption requires sops_age_key Woodpecker secret
to be created before the pipeline will work with SOPS. Until then, the old
terraform.tfvars path continues to function.
Pull-through cache at 10.0.20.10 was serving corrupted/truncated images
for low-traffic registries, breaking VPA certgen (ImagePullBackOff) and
previously causing Kyverno image pull failures.
Kept: docker.io (port 5000) and ghcr.io (port 5010) — high traffic,
Docker Hub rate limits make caching essential.
Removed from cloud-init template and all 5 live nodes:
- registry.k8s.io (port 5030) — 14 system images, very low churn
- quay.io (port 5020) — 11 images
- reg.kyverno.io (port 5040) — 5 images
The registry containers on the 10.0.20.10 VM still run but nodes no
longer route to them. They can be stopped/removed from the VM later.
Replace individual `docker run` commands with Docker Compose stack managed
by systemd. Nginx now fronts all 5 registry ports (5000/5010/5020/5030/5040)
with proxy_cache_lock to serialize concurrent blob pulls and prevent
corrupt partial responses. Adds QEMU guest agent for remote management.
Move all 88 service modules (66 individual + 22 platform) from
modules/kubernetes/<service>/ into their corresponding stack directories:
- Service stacks: stacks/<service>/module/
- Platform stack: stacks/platform/modules/<service>/
This collocates module source code with its Terragrunt definition.
Only shared utility modules remain in modules/kubernetes/:
ingress_factory, setup_tls_secret, dockerhub_secret, oauth-proxy.
All cross-references to shared modules updated to use correct
relative paths. Verified with terragrunt run --all -- plan:
0 adds, 0 destroys across all 68 stacks.