2025-12-28 20:05:27 +00:00
|
|
|
|
[ci skip] Infrastructure hardening: security, monitoring, reliability, maintainability
Phase 1 - Critical Security:
- Netbox: move hardcoded DB/superuser passwords to variables
- MeshCentral: disable public registration, add Authentik auth
- Traefik: disable insecure API dashboard (api.insecure=false)
- Traefik: configure forwarded headers with Cloudflare trusted IPs
Phase 2 - Security Hardening:
- Add security headers middleware (HSTS, X-Frame-Options, nosniff, etc.)
- Add Kyverno pod security policies in audit mode (privileged, host
namespaces, SYS_ADMIN, trusted registries)
- Tighten rate limiting (avg=10, burst=50)
- Add Authentik protection to grampsweb
Phase 3 - Monitoring & Alerting:
- Add critical service alerts (PostgreSQL, MySQL, Redis, Headscale,
Authentik, Loki)
- Increase Loki retention from 7 to 30 days (720h)
- Add predictive PV filling alert (predict_linear)
- Re-enable Hackmd and Privatebin down alerts
Phase 4 - Reliability:
- Add resource requests/limits to Redis, DBaaS, Technitium, Headscale,
Vaultwarden, Uptime Kuma
- Increase Alloy DaemonSet memory to 512Mi/1Gi
Phase 6 - Maintainability:
- Extract duplicated tiers locals to terragrunt.hcl generate block
(removed from 67 stacks)
- Replace hardcoded NFS IP 10.0.10.15 with var.nfs_server (114
instances across 63 files)
- Replace hardcoded Redis/PostgreSQL/MySQL/Ollama/mail host references
with variables across ~35 stacks
- Migrate xray raw ingress resources to ingress_factory modules
2026-02-23 22:05:28 +00:00
|
|
|
|
2025-12-28 20:05:27 +00:00
|
|
|
# resource "kubernetes_persistent_volume" "prometheus_grafana_pv" {
|
|
|
|
|
# metadata {
|
|
|
|
|
# name = "grafana-pv"
|
|
|
|
|
# }
|
|
|
|
|
# spec {
|
|
|
|
|
# capacity = {
|
|
|
|
|
# "storage" = "2Gi"
|
|
|
|
|
# }
|
|
|
|
|
# access_modes = ["ReadWriteOnce"]
|
|
|
|
|
# persistent_volume_source {
|
|
|
|
|
# nfs {
|
|
|
|
|
# path = "/mnt/main/grafana"
|
[ci skip] Infrastructure hardening: security, monitoring, reliability, maintainability
Phase 1 - Critical Security:
- Netbox: move hardcoded DB/superuser passwords to variables
- MeshCentral: disable public registration, add Authentik auth
- Traefik: disable insecure API dashboard (api.insecure=false)
- Traefik: configure forwarded headers with Cloudflare trusted IPs
Phase 2 - Security Hardening:
- Add security headers middleware (HSTS, X-Frame-Options, nosniff, etc.)
- Add Kyverno pod security policies in audit mode (privileged, host
namespaces, SYS_ADMIN, trusted registries)
- Tighten rate limiting (avg=10, burst=50)
- Add Authentik protection to grampsweb
Phase 3 - Monitoring & Alerting:
- Add critical service alerts (PostgreSQL, MySQL, Redis, Headscale,
Authentik, Loki)
- Increase Loki retention from 7 to 30 days (720h)
- Add predictive PV filling alert (predict_linear)
- Re-enable Hackmd and Privatebin down alerts
Phase 4 - Reliability:
- Add resource requests/limits to Redis, DBaaS, Technitium, Headscale,
Vaultwarden, Uptime Kuma
- Increase Alloy DaemonSet memory to 512Mi/1Gi
Phase 6 - Maintainability:
- Extract duplicated tiers locals to terragrunt.hcl generate block
(removed from 67 stacks)
- Replace hardcoded NFS IP 10.0.10.15 with var.nfs_server (114
instances across 63 files)
- Replace hardcoded Redis/PostgreSQL/MySQL/Ollama/mail host references
with variables across ~35 stacks
- Migrate xray raw ingress resources to ingress_factory modules
2026-02-23 22:05:28 +00:00
|
|
|
# server = var.nfs_server
|
2025-12-28 20:05:27 +00:00
|
|
|
# }
|
|
|
|
|
# # iscsi {
|
|
|
|
|
# # target_portal = "iscsi.viktorbarzin.lan:3260"
|
|
|
|
|
# # iqn = "iqn.2020-12.lan.viktorbarzin:storage:monitoring:grafana"
|
|
|
|
|
# # lun = 0
|
|
|
|
|
# # fs_type = "ext4"
|
|
|
|
|
# # }
|
|
|
|
|
# }
|
|
|
|
|
# }
|
|
|
|
|
# }
|
|
|
|
|
|
|
|
|
|
resource "kubernetes_persistent_volume" "alertmanager_pv" {
|
|
|
|
|
metadata {
|
|
|
|
|
name = "alertmanager-pv"
|
|
|
|
|
}
|
|
|
|
|
spec {
|
|
|
|
|
capacity = {
|
|
|
|
|
"storage" = "2Gi"
|
|
|
|
|
}
|
|
|
|
|
access_modes = ["ReadWriteOnce"]
|
|
|
|
|
persistent_volume_source {
|
[ci skip] complete NFS CSI migration: complex stacks + platform modules
Migrate remaining multi-volume stacks and all platform modules from
inline NFS volumes to CSI-backed PV/PVC with nfs-truenas StorageClass
(soft,timeo=30,retrans=3 mount options).
Complex stacks: openclaw (4 vols), immich (8 vols), frigate (2 vols),
nextcloud (2 vols + old PV replaced), rybbit (1 vol)
Remaining stacks: affine, ebook2audiobook, f1-stream, osm_routing,
real-estate-crawler
Platform modules: monitoring (prometheus, loki, alertmanager PVs
converted from native NFS to CSI), redis, dbaas, technitium,
headscale, vaultwarden, uptime-kuma, mailserver, infra-maintenance
2026-03-02 01:24:07 +00:00
|
|
|
csi {
|
|
|
|
|
driver = "nfs.csi.k8s.io"
|
|
|
|
|
volume_handle = "alertmanager-pv"
|
|
|
|
|
volume_attributes = {
|
|
|
|
|
server = var.nfs_server
|
|
|
|
|
share = "/mnt/main/alertmanager"
|
|
|
|
|
}
|
2025-12-28 20:05:27 +00:00
|
|
|
}
|
|
|
|
|
}
|
2026-03-02 20:23:36 +00:00
|
|
|
mount_options = [
|
|
|
|
|
"soft",
|
|
|
|
|
"timeo=30",
|
|
|
|
|
"retrans=3",
|
|
|
|
|
"actimeo=5",
|
|
|
|
|
]
|
[ci skip] complete NFS CSI migration: complex stacks + platform modules
Migrate remaining multi-volume stacks and all platform modules from
inline NFS volumes to CSI-backed PV/PVC with nfs-truenas StorageClass
(soft,timeo=30,retrans=3 mount options).
Complex stacks: openclaw (4 vols), immich (8 vols), frigate (2 vols),
nextcloud (2 vols + old PV replaced), rybbit (1 vol)
Remaining stacks: affine, ebook2audiobook, f1-stream, osm_routing,
real-estate-crawler
Platform modules: monitoring (prometheus, loki, alertmanager PVs
converted from native NFS to CSI), redis, dbaas, technitium,
headscale, vaultwarden, uptime-kuma, mailserver, infra-maintenance
2026-03-02 01:24:07 +00:00
|
|
|
storage_class_name = "nfs-truenas"
|
2025-12-28 20:05:27 +00:00
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
# resource "kubernetes_persistent_volume_claim" "grafana_pvc" {
|
|
|
|
|
# metadata {
|
|
|
|
|
# name = "grafana-pvc"
|
2025-12-29 10:23:42 +00:00
|
|
|
# namespace = kubernetes_namespace.monitoring.metadata[0].name
|
2025-12-28 20:05:27 +00:00
|
|
|
# }
|
|
|
|
|
# spec {
|
|
|
|
|
# access_modes = ["ReadWriteOnce"]
|
|
|
|
|
# resources {
|
|
|
|
|
# requests = {
|
|
|
|
|
# "storage" = "2Gi"
|
|
|
|
|
# }
|
|
|
|
|
# }
|
|
|
|
|
# }
|
|
|
|
|
# }
|
|
|
|
|
|
2026-03-17 07:39:29 +00:00
|
|
|
# DB credentials from Vault database engine (rotated automatically)
|
|
|
|
|
# Provides GF_DATABASE_PASSWORD that auto-updates when password rotates
|
|
|
|
|
resource "kubernetes_manifest" "grafana_db_creds" {
|
|
|
|
|
manifest = {
|
|
|
|
|
apiVersion = "external-secrets.io/v1beta1"
|
|
|
|
|
kind = "ExternalSecret"
|
|
|
|
|
metadata = {
|
|
|
|
|
name = "grafana-db-creds"
|
|
|
|
|
namespace = kubernetes_namespace.monitoring.metadata[0].name
|
|
|
|
|
}
|
|
|
|
|
spec = {
|
|
|
|
|
refreshInterval = "15m"
|
|
|
|
|
secretStoreRef = {
|
|
|
|
|
name = "vault-database"
|
|
|
|
|
kind = "ClusterSecretStore"
|
|
|
|
|
}
|
|
|
|
|
target = {
|
|
|
|
|
name = "grafana-db-creds"
|
|
|
|
|
template = {
|
|
|
|
|
data = {
|
|
|
|
|
GF_DATABASE_PASSWORD = "{{ .password }}"
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
data = [{
|
|
|
|
|
secretKey = "password"
|
|
|
|
|
remoteRef = {
|
|
|
|
|
key = "static-creds/mysql-grafana"
|
|
|
|
|
property = "password"
|
|
|
|
|
}
|
|
|
|
|
}]
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
Monitoring overhaul: reduce noise, add coverage gaps, auto-load dashboards
Noise reduction (8 alerts tuned):
- PoisonFountainDown: 2m→5m, critical→warning (fail-open service)
- NodeExporterDown: 2m→5m (flaps during node restarts)
- PowerOutage: add for:1m (debounce transient voltage dips)
- New Tailscale client: add for:5m (debounce headscale reauths)
- NoNodeLoadData: use absent() instead of OR vector(0)==0
- NodeHighCPUUsage: 30%→60% (normal for 70+ services)
- HighMemoryUsage GPU: 12GB/5m→14GB/15m (T4=16GB, model loading)
- PrometheusStorageFull: 50GiB→150GiB (TSDB cap is 180GB)
Alert regrouping:
- Move MailServerDown, HackmdDown, PrivatebinDown → new "Application Health"
- Move New Tailscale client → "Infrastructure Health"
New alerts (14):
- Networking: Cloudflared (2), MetalLB (2), Technitium DNS
- Storage: NFS CSI, iSCSI CSI controllers
- Critical Services: PgBouncer, CNPG operator, MySQL operator
- Infra Health: CrowdSec, Kyverno, Sealed Secrets, Woodpecker
Inhibit rules:
- Consolidate 3 NodeDown rules into 1 comprehensive rule
- Extend NFS rule to suppress NFS-dependent services
- Add PowerOutage → downstream suppression
Dashboard loading:
- Add for_each ConfigMap in grafana.tf to auto-load all 18 dashboards
- Remove duplicate caretta dashboard ConfigMap from caretta.tf
2026-03-14 10:22:22 +00:00
|
|
|
resource "kubernetes_config_map" "grafana_dashboards" {
|
|
|
|
|
for_each = fileset("${path.module}/dashboards", "*.json")
|
|
|
|
|
|
|
|
|
|
metadata {
|
|
|
|
|
name = "grafana-dashboard-${replace(trimsuffix(each.value, ".json"), "_", "-")}"
|
|
|
|
|
namespace = kubernetes_namespace.monitoring.metadata[0].name
|
|
|
|
|
labels = {
|
|
|
|
|
grafana_dashboard = "1"
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
data = {
|
|
|
|
|
(each.value) = file("${path.module}/dashboards/${each.value}")
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
2025-12-28 20:05:27 +00:00
|
|
|
resource "helm_release" "grafana" {
|
2025-12-29 10:23:42 +00:00
|
|
|
namespace = kubernetes_namespace.monitoring.metadata[0].name
|
2025-12-28 20:05:27 +00:00
|
|
|
create_namespace = true
|
|
|
|
|
name = "grafana"
|
|
|
|
|
atomic = true
|
2026-02-18 21:40:32 +00:00
|
|
|
timeout = 600
|
2025-12-28 20:05:27 +00:00
|
|
|
|
|
|
|
|
repository = "https://grafana.github.io/helm-charts"
|
|
|
|
|
chart = "grafana"
|
|
|
|
|
|
2026-03-17 07:39:29 +00:00
|
|
|
values = [templatefile("${path.module}/grafana_chart_values.yaml", { grafana_admin_password = var.grafana_admin_password, mysql_host = var.mysql_host })]
|
|
|
|
|
depends_on = [kubernetes_manifest.grafana_db_creds]
|
2025-12-28 20:05:27 +00:00
|
|
|
}
|