infra/stacks/platform/modules/monitoring/grafana.tf



# resource "kubernetes_persistent_volume" "prometheus_grafana_pv" {
#   metadata {
#     name = "grafana-pv"
#   }
#   spec {
#     capacity = {
#       "storage" = "2Gi"
#     }
#     access_modes = ["ReadWriteOnce"]
#     persistent_volume_source {
#       nfs {
#         path   = "/mnt/main/grafana"
#         server = var.nfs_server
#       }
#       # iscsi {
#       #   target_portal = "iscsi.viktorbarzin.lan:3260"
#       #   iqn           = "iqn.2020-12.lan.viktorbarzin:storage:monitoring:grafana"
#       #   lun           = 0
#       #   fs_type       = "ext4"
#       # }
#     }
#   }
# }

resource "kubernetes_persistent_volume" "alertmanager_pv" {
  metadata {
    name = "alertmanager-pv"
  }
  spec {
    capacity = {
      "storage" = "2Gi"
    }
    access_modes = ["ReadWriteOnce"]
    persistent_volume_source {
      csi {
        driver        = "nfs.csi.k8s.io"
        volume_handle = "alertmanager-pv"
        volume_attributes = {
          server = var.nfs_server
          share  = "/mnt/main/alertmanager"
        }
      }
    }
    mount_options = [
      "soft",
      "timeo=30",
      "retrans=3",
      "actimeo=5",
    ]
    storage_class_name = "nfs-truenas"
  }
}
# resource "kubernetes_persistent_volume_claim" "grafana_pvc" {
#   metadata {
#     name      = "grafana-pvc"
#    namespace = kubernetes_namespace.monitoring.metadata[0].name
#   }
#   spec {
#     access_modes = ["ReadWriteOnce"]
#     resources {
#       requests = {
#         "storage" = "2Gi"
#       }
#     }
#   }
# }

# DB credentials from Vault database engine (rotated automatically)
# Provides GF_DATABASE_PASSWORD that auto-updates when password rotates
resource "kubernetes_manifest" "grafana_db_creds" {
  manifest = {
    apiVersion = "external-secrets.io/v1beta1"
    kind       = "ExternalSecret"
    metadata = {
      name      = "grafana-db-creds"
      namespace = kubernetes_namespace.monitoring.metadata[0].name
    }
    spec = {
      refreshInterval = "15m"
      secretStoreRef = {
        name = "vault-database"
        kind = "ClusterSecretStore"
      }
      target = {
        name = "grafana-db-creds"
        template = {
          data = {
            GF_DATABASE_PASSWORD = "{{ .password }}"
          }
        }
      }
      data = [{
        secretKey = "password"
        remoteRef = {
          key      = "static-creds/mysql-grafana"
          property = "password"
        }
      }]
    }
  }
}

resource "kubernetes_config_map" "grafana_dashboards" {
  for_each = fileset("${path.module}/dashboards", "*.json")

  metadata {
    name      = "grafana-dashboard-${replace(trimsuffix(each.value, ".json"), "_", "-")}"
    namespace = kubernetes_namespace.monitoring.metadata[0].name
    labels = {
      grafana_dashboard = "1"
    }
  }
  data = {
    (each.value) = file("${path.module}/dashboards/${each.value}")
  }
}

resource "helm_release" "grafana" {
  namespace        = kubernetes_namespace.monitoring.metadata[0].name
  create_namespace = true
  name             = "grafana"
  atomic           = true
  timeout          = 600

  repository = "https://grafana.github.io/helm-charts"
  chart      = "grafana"

  values     = [templatefile("${path.module}/grafana_chart_values.yaml", { grafana_admin_password = var.grafana_admin_password, mysql_host = var.mysql_host })]
  depends_on = [kubernetes_manifest.grafana_db_creds]
}
move grafana into separate file and tunr off persistence as we use external db now [ci skip] 2025-12-28 20:05:27 +00:00
[ci skip] Infrastructure hardening: security, monitoring, reliability, maintainability Phase 1 - Critical Security: - Netbox: move hardcoded DB/superuser passwords to variables - MeshCentral: disable public registration, add Authentik auth - Traefik: disable insecure API dashboard (api.insecure=false) - Traefik: configure forwarded headers with Cloudflare trusted IPs Phase 2 - Security Hardening: - Add security headers middleware (HSTS, X-Frame-Options, nosniff, etc.) - Add Kyverno pod security policies in audit mode (privileged, host namespaces, SYS_ADMIN, trusted registries) - Tighten rate limiting (avg=10, burst=50) - Add Authentik protection to grampsweb Phase 3 - Monitoring & Alerting: - Add critical service alerts (PostgreSQL, MySQL, Redis, Headscale, Authentik, Loki) - Increase Loki retention from 7 to 30 days (720h) - Add predictive PV filling alert (predict_linear) - Re-enable Hackmd and Privatebin down alerts Phase 4 - Reliability: - Add resource requests/limits to Redis, DBaaS, Technitium, Headscale, Vaultwarden, Uptime Kuma - Increase Alloy DaemonSet memory to 512Mi/1Gi Phase 6 - Maintainability: - Extract duplicated tiers locals to terragrunt.hcl generate block (removed from 67 stacks) - Replace hardcoded NFS IP 10.0.10.15 with var.nfs_server (114 instances across 63 files) - Replace hardcoded Redis/PostgreSQL/MySQL/Ollama/mail host references with variables across ~35 stacks - Migrate xray raw ingress resources to ingress_factory modules 2026-02-23 22:05:28 +00:00
move grafana into separate file and tunr off persistence as we use external db now [ci skip] 2025-12-28 20:05:27 +00:00			`# resource "kubernetes_persistent_volume" "prometheus_grafana_pv" {`
			`# metadata {`
			`# name = "grafana-pv"`
			`# }`
			`# spec {`
			`# capacity = {`
			`# "storage" = "2Gi"`
			`# }`
			`# access_modes = ["ReadWriteOnce"]`
			`# persistent_volume_source {`
			`# nfs {`
			`# path = "/mnt/main/grafana"`
[ci skip] Infrastructure hardening: security, monitoring, reliability, maintainability Phase 1 - Critical Security: - Netbox: move hardcoded DB/superuser passwords to variables - MeshCentral: disable public registration, add Authentik auth - Traefik: disable insecure API dashboard (api.insecure=false) - Traefik: configure forwarded headers with Cloudflare trusted IPs Phase 2 - Security Hardening: - Add security headers middleware (HSTS, X-Frame-Options, nosniff, etc.) - Add Kyverno pod security policies in audit mode (privileged, host namespaces, SYS_ADMIN, trusted registries) - Tighten rate limiting (avg=10, burst=50) - Add Authentik protection to grampsweb Phase 3 - Monitoring & Alerting: - Add critical service alerts (PostgreSQL, MySQL, Redis, Headscale, Authentik, Loki) - Increase Loki retention from 7 to 30 days (720h) - Add predictive PV filling alert (predict_linear) - Re-enable Hackmd and Privatebin down alerts Phase 4 - Reliability: - Add resource requests/limits to Redis, DBaaS, Technitium, Headscale, Vaultwarden, Uptime Kuma - Increase Alloy DaemonSet memory to 512Mi/1Gi Phase 6 - Maintainability: - Extract duplicated tiers locals to terragrunt.hcl generate block (removed from 67 stacks) - Replace hardcoded NFS IP 10.0.10.15 with var.nfs_server (114 instances across 63 files) - Replace hardcoded Redis/PostgreSQL/MySQL/Ollama/mail host references with variables across ~35 stacks - Migrate xray raw ingress resources to ingress_factory modules 2026-02-23 22:05:28 +00:00			`# server = var.nfs_server`
move grafana into separate file and tunr off persistence as we use external db now [ci skip] 2025-12-28 20:05:27 +00:00			`# }`
			`# # iscsi {`
			`# # target_portal = "iscsi.viktorbarzin.lan:3260"`
			`# # iqn = "iqn.2020-12.lan.viktorbarzin:storage:monitoring:grafana"`
			`# # lun = 0`
			`# # fs_type = "ext4"`
			`# # }`
			`# }`
			`# }`
			`# }`

			`resource "kubernetes_persistent_volume" "alertmanager_pv" {`
			`metadata {`
			`name = "alertmanager-pv"`
			`}`
			`spec {`
			`capacity = {`
			`"storage" = "2Gi"`
			`}`
			`access_modes = ["ReadWriteOnce"]`
			`persistent_volume_source {`
[ci skip] complete NFS CSI migration: complex stacks + platform modules Migrate remaining multi-volume stacks and all platform modules from inline NFS volumes to CSI-backed PV/PVC with nfs-truenas StorageClass (soft,timeo=30,retrans=3 mount options). Complex stacks: openclaw (4 vols), immich (8 vols), frigate (2 vols), nextcloud (2 vols + old PV replaced), rybbit (1 vol) Remaining stacks: affine, ebook2audiobook, f1-stream, osm_routing, real-estate-crawler Platform modules: monitoring (prometheus, loki, alertmanager PVs converted from native NFS to CSI), redis, dbaas, technitium, headscale, vaultwarden, uptime-kuma, mailserver, infra-maintenance 2026-03-02 01:24:07 +00:00			`csi {`
			`driver = "nfs.csi.k8s.io"`
			`volume_handle = "alertmanager-pv"`
			`volume_attributes = {`
			`server = var.nfs_server`
			`share = "/mnt/main/alertmanager"`
			`}`
move grafana into separate file and tunr off persistence as we use external db now [ci skip] 2025-12-28 20:05:27 +00:00			`}`
			`}`
[ci skip] fix: add mount_options to all NFS PVs (soft,timeo=30,retrans=3) Critical fix: StorageClass mountOptions only apply during dynamic provisioning. Our static PVs (created by Terraform) were missing mount_options, so all NFS mounts defaulted to hard,timeo=600 — the exact stale mount behavior we were trying to eliminate. Adds mount_options directly to the nfs_volume module PV spec and to the monitoring PVs (prometheus, loki, alertmanager). Requires re-applying all stacks to propagate to existing PVs. 2026-03-02 20:23:36 +00:00			`mount_options = [`
			`"soft",`
			`"timeo=30",`
			`"retrans=3",`
			`"actimeo=5",`
			`]`
[ci skip] complete NFS CSI migration: complex stacks + platform modules Migrate remaining multi-volume stacks and all platform modules from inline NFS volumes to CSI-backed PV/PVC with nfs-truenas StorageClass (soft,timeo=30,retrans=3 mount options). Complex stacks: openclaw (4 vols), immich (8 vols), frigate (2 vols), nextcloud (2 vols + old PV replaced), rybbit (1 vol) Remaining stacks: affine, ebook2audiobook, f1-stream, osm_routing, real-estate-crawler Platform modules: monitoring (prometheus, loki, alertmanager PVs converted from native NFS to CSI), redis, dbaas, technitium, headscale, vaultwarden, uptime-kuma, mailserver, infra-maintenance 2026-03-02 01:24:07 +00:00			`storage_class_name = "nfs-truenas"`
move grafana into separate file and tunr off persistence as we use external db now [ci skip] 2025-12-28 20:05:27 +00:00			`}`
			`}`
			`# resource "kubernetes_persistent_volume_claim" "grafana_pvc" {`
			`# metadata {`
			`# name = "grafana-pvc"`
replace hardcoded namespace with module reference [ci skip] 2025-12-29 10:23:42 +00:00			`# namespace = kubernetes_namespace.monitoring.metadata[0].name`
move grafana into separate file and tunr off persistence as we use external db now [ci skip] 2025-12-28 20:05:27 +00:00			`# }`
			`# spec {`
			`# access_modes = ["ReadWriteOnce"]`
			`# resources {`
			`# requests = {`
			`# "storage" = "2Gi"`
			`# }`
			`# }`
			`# }`
			`# }`

fix DB password rotation desync in 5 stacks Vault DB engine rotates passwords weekly but 5 stacks baked passwords at Terraform plan time, causing stale credentials until next apply. - real-estate-crawler: add vault-database ESO, use secret_key_ref in 3 deployments - nextcloud: switch Helm chart to existingSecret for DB password - grafana: add vault-database ESO, use envFromSecrets in Helm values - woodpecker: use extraSecretNamesForEnvFrom, remove plan-time data source chain - affine: add vault-database ESO, use secret_key_ref in deployment + init container 2026-03-17 07:39:29 +00:00			`# DB credentials from Vault database engine (rotated automatically)`
			`# Provides GF_DATABASE_PASSWORD that auto-updates when password rotates`
			`resource "kubernetes_manifest" "grafana_db_creds" {`
			`manifest = {`
			`apiVersion = "external-secrets.io/v1beta1"`
			`kind = "ExternalSecret"`
			`metadata = {`
			`name = "grafana-db-creds"`
			`namespace = kubernetes_namespace.monitoring.metadata[0].name`
			`}`
			`spec = {`
			`refreshInterval = "15m"`
			`secretStoreRef = {`
			`name = "vault-database"`
			`kind = "ClusterSecretStore"`
			`}`
			`target = {`
			`name = "grafana-db-creds"`
			`template = {`
			`data = {`
			`GF_DATABASE_PASSWORD = "{{ .password }}"`
			`}`
			`}`
			`}`
			`data = [{`
			`secretKey = "password"`
			`remoteRef = {`
			`key = "static-creds/mysql-grafana"`
			`property = "password"`
			`}`
			`}]`
			`}`
			`}`
			`}`

Monitoring overhaul: reduce noise, add coverage gaps, auto-load dashboards Noise reduction (8 alerts tuned): - PoisonFountainDown: 2m→5m, critical→warning (fail-open service) - NodeExporterDown: 2m→5m (flaps during node restarts) - PowerOutage: add for:1m (debounce transient voltage dips) - New Tailscale client: add for:5m (debounce headscale reauths) - NoNodeLoadData: use absent() instead of OR vector(0)==0 - NodeHighCPUUsage: 30%→60% (normal for 70+ services) - HighMemoryUsage GPU: 12GB/5m→14GB/15m (T4=16GB, model loading) - PrometheusStorageFull: 50GiB→150GiB (TSDB cap is 180GB) Alert regrouping: - Move MailServerDown, HackmdDown, PrivatebinDown → new "Application Health" - Move New Tailscale client → "Infrastructure Health" New alerts (14): - Networking: Cloudflared (2), MetalLB (2), Technitium DNS - Storage: NFS CSI, iSCSI CSI controllers - Critical Services: PgBouncer, CNPG operator, MySQL operator - Infra Health: CrowdSec, Kyverno, Sealed Secrets, Woodpecker Inhibit rules: - Consolidate 3 NodeDown rules into 1 comprehensive rule - Extend NFS rule to suppress NFS-dependent services - Add PowerOutage → downstream suppression Dashboard loading: - Add for_each ConfigMap in grafana.tf to auto-load all 18 dashboards - Remove duplicate caretta dashboard ConfigMap from caretta.tf 2026-03-14 10:22:22 +00:00			`resource "kubernetes_config_map" "grafana_dashboards" {`
			`for_each = fileset("${path.module}/dashboards", "*.json")`

			`metadata {`
			`name = "grafana-dashboard-${replace(trimsuffix(each.value, ".json"), "_", "-")}"`
			`namespace = kubernetes_namespace.monitoring.metadata[0].name`
			`labels = {`
			`grafana_dashboard = "1"`
			`}`
			`}`
			`data = {`
			`(each.value) = file("${path.module}/dashboards/${each.value}")`
			`}`
			`}`

move grafana into separate file and tunr off persistence as we use external db now [ci skip] 2025-12-28 20:05:27 +00:00			`resource "helm_release" "grafana" {`
replace hardcoded namespace with module reference [ci skip] 2025-12-29 10:23:42 +00:00			`namespace = kubernetes_namespace.monitoring.metadata[0].name`
move grafana into separate file and tunr off persistence as we use external db now [ci skip] 2025-12-28 20:05:27 +00:00			`create_namespace = true`
			`name = "grafana"`
			`atomic = true`
[ci skip] Remove Authentik forward auth from Grafana, add admin password management Fixes HA mobile app 403 when embedding Grafana dashboards - the webview blocks third-party cookies needed by Authentik forward auth. Grafana already has anonymous Viewer access enabled, so forward auth is not needed. Also adds grafana_admin_password variable and explicit resource limits to prevent ResourceQuota issues during rolling updates. 2026-02-18 21:40:32 +00:00			`timeout = 600`
move grafana into separate file and tunr off persistence as we use external db now [ci skip] 2025-12-28 20:05:27 +00:00
			`repository = "https://grafana.github.io/helm-charts"`
			`chart = "grafana"`

fix DB password rotation desync in 5 stacks Vault DB engine rotates passwords weekly but 5 stacks baked passwords at Terraform plan time, causing stale credentials until next apply. - real-estate-crawler: add vault-database ESO, use secret_key_ref in 3 deployments - nextcloud: switch Helm chart to existingSecret for DB password - grafana: add vault-database ESO, use envFromSecrets in Helm values - woodpecker: use extraSecretNamesForEnvFrom, remove plan-time data source chain - affine: add vault-database ESO, use secret_key_ref in deployment + init container 2026-03-17 07:39:29 +00:00			`values = [templatefile("${path.module}/grafana_chart_values.yaml", { grafana_admin_password = var.grafana_admin_password, mysql_host = var.mysql_host })]`
			`depends_on = [kubernetes_manifest.grafana_db_creds]`
move grafana into separate file and tunr off persistence as we use external db now [ci skip] 2025-12-28 20:05:27 +00:00			`}`