infra/stacks/monitoring/main.tf

# =============================================================================
# Monitoring Stack — Prometheus / Grafana / Loki
# =============================================================================

variable "tls_secret_name" { type = string }
variable "nfs_server" { type = string }
variable "mysql_host" { type = string }
variable "monitoring_idrac_username" { type = string }

data "vault_kv_secret_v2" "secrets" {
  mount = "secret"
  name  = "platform"
}

data "vault_kv_secret_v2" "viktor" {
  mount = "secret"
  name  = "viktor"
}

module "monitoring" {
  source                        = "./modules/monitoring"
  tls_secret_name               = var.tls_secret_name
  nfs_server                    = var.nfs_server
  mysql_host                    = var.mysql_host
  alertmanager_account_password = data.vault_kv_secret_v2.secrets.data["alertmanager_account_password"]
  idrac_username                = var.monitoring_idrac_username
  idrac_password                = data.vault_kv_secret_v2.secrets.data["monitoring_idrac_password"]
  alertmanager_slack_api_url    = data.vault_kv_secret_v2.secrets.data["alertmanager_slack_api_url"]
  tiny_tuya_service_secret      = data.vault_kv_secret_v2.secrets.data["tiny_tuya_service_secret"]
  haos_api_token                = data.vault_kv_secret_v2.secrets.data["haos_api_token"]
  pve_password                  = data.vault_kv_secret_v2.secrets.data["pve_password"]
  grafana_admin_password        = data.vault_kv_secret_v2.secrets.data["grafana_admin_password"]
  kube_config_path              = var.kube_config_path
  registry_user                 = data.vault_kv_secret_v2.viktor.data["registry_user"]
  registry_password             = data.vault_kv_secret_v2.viktor.data["registry_password"]
  # try() so apply succeeds before the Vault key is populated during Phase 0
  # bootstrap (see docs/runbooks/forgejo-registry-setup.md). Empty token =
  # probe will report an auth failure and fire RegistryCatalogInaccessible —
  # that's the intended visible-broken state until the PAT is created.
  forgejo_pull_token = try(data.vault_kv_secret_v2.viktor.data["forgejo_pull_token"], "")
  tier               = local.tiers.cluster
}
extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] Phase 2 of platform stack split. 5 more modules extracted into independent stacks. All applied successfully with zero destroys. Cloudflared now reads k8s_users from Vault directly to compute user_domains. Woodpecker pipeline runs all 8 extracted stacks in parallel. Memory bumped to 6Gi for 9 concurrent TF processes. Platform reduced from 27 to 19 modules. 2026-03-17 21:34:11 +00:00			`# =============================================================================`
			`# Monitoring Stack — Prometheus / Grafana / Loki`
			`# =============================================================================`

			`variable "tls_secret_name" { type = string }`
			`variable "nfs_server" { type = string }`
			`variable "mysql_host" { type = string }`
			`variable "monitoring_idrac_username" { type = string }`

			`data "vault_kv_secret_v2" "secrets" {`
			`mount = "secret"`
			`name = "platform"`
			`}`

add TrueNAS Cloud Sync monitor CronJob and bump Prometheus Helm timeout - New cloudsync-monitor CronJob: queries TrueNAS API every 6h, pushes metrics to Pushgateway - Increase Prometheus Helm timeout to 900s for slow iSCSI reattach 2026-03-23 02:24:39 +02:00			`data "vault_kv_secret_v2" "viktor" {`
			`mount = "secret"`
			`name = "viktor"`
			`}`

extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] Phase 2 of platform stack split. 5 more modules extracted into independent stacks. All applied successfully with zero destroys. Cloudflared now reads k8s_users from Vault directly to compute user_domains. Woodpecker pipeline runs all 8 extracted stacks in parallel. Memory bumped to 6Gi for 9 concurrent TF processes. Platform reduced from 27 to 19 modules. 2026-03-17 21:34:11 +00:00			`module "monitoring" {`
			`source = "./modules/monitoring"`
			`tls_secret_name = var.tls_secret_name`
			`nfs_server = var.nfs_server`
			`mysql_host = var.mysql_host`
			`alertmanager_account_password = data.vault_kv_secret_v2.secrets.data["alertmanager_account_password"]`
			`idrac_username = var.monitoring_idrac_username`
			`idrac_password = data.vault_kv_secret_v2.secrets.data["monitoring_idrac_password"]`
			`alertmanager_slack_api_url = data.vault_kv_secret_v2.secrets.data["alertmanager_slack_api_url"]`
			`tiny_tuya_service_secret = data.vault_kv_secret_v2.secrets.data["tiny_tuya_service_secret"]`
			`haos_api_token = data.vault_kv_secret_v2.secrets.data["haos_api_token"]`
			`pve_password = data.vault_kv_secret_v2.secrets.data["pve_password"]`
			`grafana_admin_password = data.vault_kv_secret_v2.secrets.data["grafana_admin_password"]`
monitoring: lock Finance (Personal) folder to admin + fix cash classification Folder ACL: - Move uk-payslip + wealth dashboards to a new "Finance (Personal)" folder; job-hunter + fire-planner stay in "Finance" (open). - New null_resource calls Grafana's folder permissions API after the dashboard sidecar materialises the folder, setting an admin-only ACL ({Admin: 4}). Default Viewer/Editor inheritance is overridden, so anonymous-Viewer (auth.anonymous=true) is denied. Server-admin always retains access. - Verified: anonymous → 403 on uk-payslip + wealth, 200 on control dashboards (node-exporter); admin → 200 on all. Wealth cash fix: - Wealthfolio dumps WORKPLACE_PENSION wrappers entirely into cash_balance because it doesn't track underlying fund holdings. Reclassify pension cash as invested in the "Cash vs invested" panel so the cash series reflects actual uninvested broker cash (~£16k T212 ISA + Schwab) instead of phantom £154k. Pre-fix: cash=£153,789 / invested=£870,282 / total=£1,024,071 Post-fix: cash=£16,064 / invested=£1,008,008 / total=£1,024,071 2026-04-25 23:11:26 +00:00			`kube_config_path = var.kube_config_path`
[registry] Stop recurring orphan OCI-index incidents — detection + prevention + recovery Second identical registry incident on 2026-04-19 (first 2026-04-13): the infra-ci:latest image index resolved to child manifests whose blobs had been garbage-collected out from under the index. Pipelines P366→P376 all exited 126 "image can't be pulled". Hot fix (a05d63e / 6371e75 / c113be4) restored green CI but left the underlying bug unaddressed. Root cause: cleanup-tags.sh rmtrees tag dirs on the registry VM daily at 02:00, registry:2's GC (Sunday 03:25) walks OCI index children imperfectly (distribution/distribution#3324 class). Nothing verified pushes end-to-end; nothing probed the registry for fetchability; nothing caught orphan indexes. Phase 1 — Detection: - .woodpecker/build-ci-image.yml: after build-and-push, a verify-integrity step walks the just-pushed manifest (index + children + config + every layer blob) via HEAD and fails the pipeline on any non-200. Catches broken pushes at the source. - stacks/monitoring: new registry-integrity-probe CronJob (every 15m) and three alerts — RegistryManifestIntegrityFailure, RegistryIntegrityProbeStale, RegistryCatalogInaccessible — closing the "registry serves 404 for a tag that exists" gap that masked the incident for 2+ hours. - docs/post-mortems/2026-04-19-registry-orphan-index.md: root cause, timeline, monitoring gaps, permanent fix. Phase 2 — Prevention: - modules/docker-registry/docker-compose.yml: pin registry:2 → registry:2.8.3 across all six registry services. Removes the floating-tag footgun. - modules/docker-registry/fix-broken-blobs.sh: new scan walks every _manifests/revisions/sha256/<digest> that is an image index and logs a loud WARNING when a referenced child blob is missing. Does NOT auto- delete — deleting a published image is a conscious decision. Layer-link scan preserved. Phase 3 — Recovery: - build-ci-image.yml: accept `manual` event so Woodpecker API/UI rebuilds don't need a cosmetic Dockerfile edit (matches convention from pve-nfs-exports-sync.yml). - docs/runbooks/registry-rebuild-image.md: exact command sequence for diagnosing + rebuilding after an orphan-index incident, plus a fallback for building directly on the registry VM if Woodpecker itself is down. - docs/runbooks/registry-vm.md + .claude/reference/service-catalog.md: cross-references to the new runbook. Out of scope (verified healthy or intentionally deferred): - Pull-through DockerHub/GHCR mirrors (74.5% hit rate, no 404s). - Registry HA/replication (single-VM SPOF is a known architectural choice; Synology offsite covers RPO < 1 day). - Diun exclude for registry:2 — not applicable; Diun only watches k8s (DIUN_PROVIDERS_KUBERNETES=true), not the VM's docker-compose. Verified locally: - fix-broken-blobs.sh --dry-run on a synthetic registry directory correctly flags both orphan layer links and orphan OCI-index children. - terraform fmt + validate on stacks/monitoring: success (only unrelated deprecation warnings). - python3 yaml.safe_load on .woodpecker/build-ci-image.yml and modules/docker-registry/docker-compose.yml: both parse clean. Closes: code-4b8 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> 2026-04-19 17:08:28 +00:00			`registry_user = data.vault_kv_secret_v2.viktor.data["registry_user"]`
			`registry_password = data.vault_kv_secret_v2.viktor.data["registry_password"]`
[forgejo] Tolerate missing Vault keys during Phase 0 bootstrap Wrap the three new Vault key reads in try(...) so the first apply succeeds even when forgejo_pull_token / forgejo_cleanup_token / secret/ci/global haven't been populated yet. Without this, CI auto-apply blocks on the very push that introduces the references — chicken-and-egg with the runbook order (which is: apply Forgejo bumps, then create users + PATs, then apply the rest). Empty tokens are intentionally visible-broken (auth fails, probe reports auth failure, cleanup CronJob errors) — that's the signal to run the bootstrap runbook. Subsequent apply picks up the real values. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> 2026-05-07 15:53:08 +00:00			`# try() so apply succeeds before the Vault key is populated during Phase 0`
			`# bootstrap (see docs/runbooks/forgejo-registry-setup.md). Empty token =`
			`# probe will report an auth failure and fire RegistryCatalogInaccessible —`
			`# that's the intended visible-broken state until the PAT is created.`
			`forgejo_pull_token = try(data.vault_kv_secret_v2.viktor.data["forgejo_pull_token"], "")`
			`tier = local.tiers.cluster`
extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] Phase 2 of platform stack split. 5 more modules extracted into independent stacks. All applied successfully with zero destroys. Cloudflared now reads k8s_users from Vault directly to compute user_domains. Woodpecker pipeline runs all 8 extracted stacks in parallel. Memory bumped to 6Gi for 9 concurrent TF processes. Platform reduced from 27 to 19 modules. 2026-03-17 21:34:11 +00:00			`}`