2026-03-17 21:34:11 +00:00
|
|
|
variable "tls_secret_name" {}
|
|
|
|
|
variable "tier" { type = string }
|
|
|
|
|
|
|
|
|
|
module "tls_secret" {
|
|
|
|
|
source = "../../../../modules/kubernetes/setup_tls_secret"
|
|
|
|
|
namespace = kubernetes_namespace.nvidia.metadata[0].name
|
|
|
|
|
tls_secret_name = var.tls_secret_name
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
resource "kubernetes_namespace" "nvidia" {
|
|
|
|
|
metadata {
|
|
|
|
|
name = "nvidia"
|
|
|
|
|
labels = {
|
|
|
|
|
"istio-injection" : "disabled"
|
2026-03-17 22:35:54 +00:00
|
|
|
tier = var.tier
|
[infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip]
## Context
Wave 3A (commit c9d221d5) added the `# KYVERNO_LIFECYCLE_V1` marker to the
27 pre-existing `ignore_changes = [...dns_config]` sites so they could be
grepped and audited. It did NOT address pod-owning resources that were
simply missing the suppression entirely. Post-Wave-3A sampling (2026-04-18)
found that navidrome, f1-stream, frigate, servarr, monitoring, crowdsec,
and many other stacks showed perpetual `dns_config` drift every plan
because their `kubernetes_deployment` / `kubernetes_stateful_set` /
`kubernetes_cron_job_v1` resources had no `lifecycle {}` block at all.
Root cause (same as Wave 3A): Kyverno's admission webhook stamps
`dns_config { option { name = "ndots"; value = "2" } }` on every pod's
`spec.template.spec.dns_config` to prevent NxDomain search-domain flooding
(see `k8s-ndots-search-domain-nxdomain-flood` skill). Without `ignore_changes`
on every Terraform-managed pod-owner, Terraform repeatedly tries to strip
the injected field.
## This change
Extends the Wave 3A convention by sweeping EVERY `kubernetes_deployment`,
`kubernetes_stateful_set`, `kubernetes_daemon_set`, `kubernetes_cron_job_v1`,
`kubernetes_job_v1` (+ their `_v1` variants) in the repo and ensuring each
carries the right `ignore_changes` path:
- **kubernetes_deployment / stateful_set / daemon_set / job_v1**:
`spec[0].template[0].spec[0].dns_config`
- **kubernetes_cron_job_v1**:
`spec[0].job_template[0].spec[0].template[0].spec[0].dns_config`
(extra `job_template[0]` nesting — the CronJob's PodTemplateSpec is
one level deeper)
Each injection / extension is tagged `# KYVERNO_LIFECYCLE_V1: Kyverno
admission webhook mutates dns_config with ndots=2` inline so the
suppression is discoverable via `rg 'KYVERNO_LIFECYCLE_V1' stacks/`.
Two insertion paths are handled by a Python pass (`/tmp/add_dns_config_ignore.py`):
1. **No existing `lifecycle {}`**: inject a brand-new block just before the
resource's closing `}`. 108 new blocks on 93 files.
2. **Existing `lifecycle {}` (usually for `DRIFT_WORKAROUND: CI owns image tag`
from Wave 4, commit a62b43d1)**: extend its `ignore_changes` list with the
dns_config path. Handles both inline (`= [x]`) and multiline
(`= [\n x,\n]`) forms; ensures the last pre-existing list item carries
a trailing comma so the extended list is valid HCL. 34 extensions.
The script skips anything already mentioning `dns_config` inside an
`ignore_changes`, so re-running is a no-op.
## Scale
- 142 total lifecycle injections/extensions
- 93 `.tf` files touched
- 108 brand-new `lifecycle {}` blocks + 34 extensions of existing ones
- Every Tier 0 and Tier 1 stack with a pod-owning resource is covered
- Together with Wave 3A's 27 pre-existing markers → **169 greppable
`KYVERNO_LIFECYCLE_V1` dns_config sites across the repo**
## What is NOT in this change
- `stacks/trading-bot/main.tf` — entirely commented-out block (`/* … */`).
Python script touched the file, reverted manually.
- `_template/main.tf.example` skeleton — kept minimal on purpose; any
future stack created from it should either inherit the Wave 3A one-line
form or add its own on first `kubernetes_deployment`.
- `terraform fmt` fixes to pre-existing alignment issues in meshcentral,
nvidia/modules/nvidia, vault — unrelated to this commit. Left for a
separate fmt-only pass.
- Non-pod resources (`kubernetes_service`, `kubernetes_secret`,
`kubernetes_manifest`, etc.) — they don't own pods so they don't get
Kyverno dns_config mutation.
## Verification
Random sample post-commit:
```
$ cd stacks/navidrome && ../../scripts/tg plan → No changes.
$ cd stacks/f1-stream && ../../scripts/tg plan → No changes.
$ cd stacks/frigate && ../../scripts/tg plan → No changes.
$ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='*.tf' --include='*.tf.example' \
| awk -F: '{s+=$2} END {print s}'
169
```
## Reproduce locally
1. `git pull`
2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/ | wc -l` → 169+
3. `cd stacks/navidrome && ../../scripts/tg plan` → expect 0 drift on
the deployment's dns_config field.
Refs: code-seq (Wave 3B dns_config class closed; kubernetes_manifest
annotation class handled separately in 8d94688d for tls_secret)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:19:48 +00:00
|
|
|
"resource-governance/custom-quota" = "true"
|
2026-03-17 22:35:54 +00:00
|
|
|
"resource-governance/custom-limitrange" = "true"
|
|
|
|
|
}
|
|
|
|
|
}
|
[infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip]
## Context
Wave 3B-continued: the Goldilocks VPA dashboard (stacks/vpa) runs a Kyverno
ClusterPolicy `goldilocks-vpa-auto-mode` that mutates every namespace with
`metadata.labels["goldilocks.fairwinds.com/vpa-update-mode"] = "off"`. This
is intentional — Terraform owns container resource limits, and Goldilocks
should only provide recommendations, never auto-update. The label is how
Goldilocks decides per-namespace whether to run its VPA in `off` mode.
Effect on Terraform: every `kubernetes_namespace` resource shows the label
as pending-removal (`-> null`) on every `scripts/tg plan`. Dawarich survey
2026-04-18 confirmed the drift. Cluster-side count: 88 namespaces carry the
label (`kubectl get ns -o json | jq ... | wc -l`). Every TF-managed namespace
is affected.
This commit brings the intentional admission drift under the same
`# KYVERNO_LIFECYCLE_V1` discoverability marker introduced in c9d221d5 for
the ndots dns_config pattern. The marker now stands generically for any
Kyverno admission-webhook drift suppression; the inline comment records
which specific policy stamps which specific field so future grep audits
show why each suppression exists.
## This change
107 `.tf` files touched — every stack's `resource "kubernetes_namespace"`
resource gets:
```hcl
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
```
Injection was done with a brace-depth-tracking Python pass (`/tmp/add_goldilocks_ignore.py`):
match `^resource "kubernetes_namespace" ` → track `{` / `}` until the
outermost closing brace → insert the lifecycle block before the closing
brace. The script is idempotent (skips any file that already mentions
`goldilocks.fairwinds.com/vpa-update-mode`) so re-running is safe.
Vault stack picked up 2 namespaces in the same file (k8s-users produces
one, plus a second explicit ns) — confirmed via file diff (+8 lines).
## What is NOT in this change
- `stacks/trading-bot/main.tf` — entire file is `/* … */` commented out
(paused 2026-04-06 per user decision). Reverted after the script ran.
- `stacks/_template/main.tf.example` — per-stack skeleton, intentionally
minimal. User keeps it that way. Not touched by the script (file
has no real `resource "kubernetes_namespace"` — only a placeholder
comment).
- `.terraform/` copies (e.g. `stacks/metallb/.terraform/modules/...`) —
gitignored, won't commit; the live path was edited.
- `terraform fmt` cleanup of adjacent pre-existing alignment issues in
authentik, freedify, hermes-agent, nvidia, vault, meshcentral. Reverted
to keep the commit scoped to the Goldilocks sweep. Those files will
need a separate fmt-only commit or will be cleaned up on next real
apply to that stack.
## Verification
Dawarich (one of the hundred-plus touched stacks) showed the pattern
before and after:
```
$ cd stacks/dawarich && ../../scripts/tg plan
Before:
Plan: 0 to add, 2 to change, 0 to destroy.
# kubernetes_namespace.dawarich will be updated in-place
(goldilocks.fairwinds.com/vpa-update-mode -> null)
# module.tls_secret.kubernetes_secret.tls_secret will be updated in-place
(Kyverno generate.* labels — fixed in 8d94688d)
After:
No changes. Your infrastructure matches the configuration.
```
Injection count check:
```
$ rg -c 'KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode' stacks/ | awk -F: '{s+=$2} END {print s}'
108
```
## Reproduce locally
1. `git pull`
2. Pick any stack: `cd stacks/<name> && ../../scripts/tg plan`
3. Expect: no drift on the namespace's goldilocks.fairwinds.com/vpa-update-mode label.
Closes: code-dwx
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:15:27 +00:00
|
|
|
lifecycle {
|
|
|
|
|
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
|
|
|
|
|
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
|
|
|
|
|
}
|
2026-03-17 22:35:54 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
# Custom LimitRange — overrides Kyverno tier-2-gpu default (1Gi per container)
|
|
|
|
|
# which was inflating NVIDIA operator init container requests by ~2.5Gi total.
|
|
|
|
|
# Init containers do quick validation checks and need minimal memory.
|
|
|
|
|
resource "kubernetes_limit_range" "nvidia_defaults" {
|
|
|
|
|
metadata {
|
|
|
|
|
name = "tier-defaults"
|
|
|
|
|
namespace = kubernetes_namespace.nvidia.metadata[0].name
|
|
|
|
|
}
|
|
|
|
|
spec {
|
|
|
|
|
limit {
|
|
|
|
|
type = "Container"
|
|
|
|
|
default = {
|
|
|
|
|
memory = "128Mi"
|
|
|
|
|
}
|
|
|
|
|
default_request = {
|
|
|
|
|
cpu = "50m"
|
|
|
|
|
memory = "128Mi"
|
|
|
|
|
}
|
|
|
|
|
max = {
|
|
|
|
|
memory = "16Gi"
|
|
|
|
|
}
|
2026-03-17 21:34:11 +00:00
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
resource "kubernetes_resource_quota" "nvidia_quota" {
|
|
|
|
|
metadata {
|
|
|
|
|
name = "tier-quota"
|
|
|
|
|
namespace = kubernetes_namespace.nvidia.metadata[0].name
|
|
|
|
|
}
|
|
|
|
|
spec {
|
|
|
|
|
hard = {
|
|
|
|
|
"limits.memory" = "48Gi"
|
|
|
|
|
"requests.cpu" = "8"
|
|
|
|
|
"requests.memory" = "12Gi"
|
|
|
|
|
pods = "40"
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
gpu: schedule off NFD label, not k8s-node1 hostname
Remove every hardcoded reference to k8s-node1 that pinned GPU
scheduling to a specific host:
- GPU workload nodeSelectors: gpu=true -> nvidia.com/gpu.present=true
(frigate, immich, whisper, piper, ytdlp, ebook2audiobook, audiblez,
audiblez-web, nvidia-exporter, gpu-pod-exporter). The NFD label is
auto-applied by gpu-feature-discovery on any node carrying an
NVIDIA PCI device, so the selector follows the card.
- null_resource.gpu_node_config: rewrite to enumerate NFD-labeled
nodes (feature.node.kubernetes.io/pci-10de.present=true) and taint
each with nvidia.com/gpu=true:PreferNoSchedule. Drop the manual
'kubectl label gpu=true' since NFD handles labeling.
- MySQL anti-affinity: kubernetes.io/hostname NotIn [k8s-node1] ->
nvidia.com/gpu.present NotIn [true]. Same intent (keep MySQL off
the GPU node) but portable when the card relocates.
Net effect: moving the GPU card between nodes no longer requires any
Terraform edit. Verified no-op for current scheduling — both old and
new labels resolve to node1 today.
Docs updated to match: AGENTS.md, compute.md, overview.md,
proxmox-inventory.md, k8s-portal agent-guidance string.
2026-04-22 13:43:07 +00:00
|
|
|
# Apply GPU taint dynamically based on NFD-discovered GPU nodes. The
|
|
|
|
|
# NFD label `feature.node.kubernetes.io/pci-10de.present=true` is
|
|
|
|
|
# auto-applied on any node with an NVIDIA PCI device (vendor 0x10de),
|
|
|
|
|
# so the taint follows the card if it moves between nodes. Workload
|
|
|
|
|
# nodeSelectors key off `nvidia.com/gpu.present=true` (applied by
|
|
|
|
|
# gpu-feature-discovery once the operator is up).
|
2026-03-17 21:34:11 +00:00
|
|
|
resource "null_resource" "gpu_node_config" {
|
|
|
|
|
provisioner "local-exec" {
|
|
|
|
|
command = <<-EOT
|
gpu: schedule off NFD label, not k8s-node1 hostname
Remove every hardcoded reference to k8s-node1 that pinned GPU
scheduling to a specific host:
- GPU workload nodeSelectors: gpu=true -> nvidia.com/gpu.present=true
(frigate, immich, whisper, piper, ytdlp, ebook2audiobook, audiblez,
audiblez-web, nvidia-exporter, gpu-pod-exporter). The NFD label is
auto-applied by gpu-feature-discovery on any node carrying an
NVIDIA PCI device, so the selector follows the card.
- null_resource.gpu_node_config: rewrite to enumerate NFD-labeled
nodes (feature.node.kubernetes.io/pci-10de.present=true) and taint
each with nvidia.com/gpu=true:PreferNoSchedule. Drop the manual
'kubectl label gpu=true' since NFD handles labeling.
- MySQL anti-affinity: kubernetes.io/hostname NotIn [k8s-node1] ->
nvidia.com/gpu.present NotIn [true]. Same intent (keep MySQL off
the GPU node) but portable when the card relocates.
Net effect: moving the GPU card between nodes no longer requires any
Terraform edit. Verified no-op for current scheduling — both old and
new labels resolve to node1 today.
Docs updated to match: AGENTS.md, compute.md, overview.md,
proxmox-inventory.md, k8s-portal agent-guidance string.
2026-04-22 13:43:07 +00:00
|
|
|
set -euo pipefail
|
|
|
|
|
for node in $(kubectl get nodes -l feature.node.kubernetes.io/pci-10de.present=true -o jsonpath='{.items[*].metadata.name}'); do
|
|
|
|
|
kubectl taint nodes "$node" nvidia.com/gpu=true:PreferNoSchedule --overwrite
|
|
|
|
|
done
|
2026-03-17 21:34:11 +00:00
|
|
|
EOT
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
triggers = {
|
gpu: schedule off NFD label, not k8s-node1 hostname
Remove every hardcoded reference to k8s-node1 that pinned GPU
scheduling to a specific host:
- GPU workload nodeSelectors: gpu=true -> nvidia.com/gpu.present=true
(frigate, immich, whisper, piper, ytdlp, ebook2audiobook, audiblez,
audiblez-web, nvidia-exporter, gpu-pod-exporter). The NFD label is
auto-applied by gpu-feature-discovery on any node carrying an
NVIDIA PCI device, so the selector follows the card.
- null_resource.gpu_node_config: rewrite to enumerate NFD-labeled
nodes (feature.node.kubernetes.io/pci-10de.present=true) and taint
each with nvidia.com/gpu=true:PreferNoSchedule. Drop the manual
'kubectl label gpu=true' since NFD handles labeling.
- MySQL anti-affinity: kubernetes.io/hostname NotIn [k8s-node1] ->
nvidia.com/gpu.present NotIn [true]. Same intent (keep MySQL off
the GPU node) but portable when the card relocates.
Net effect: moving the GPU card between nodes no longer requires any
Terraform edit. Verified no-op for current scheduling — both old and
new labels resolve to node1 today.
Docs updated to match: AGENTS.md, compute.md, overview.md,
proxmox-inventory.md, k8s-portal agent-guidance string.
2026-04-22 13:43:07 +00:00
|
|
|
namespace = kubernetes_namespace.nvidia.metadata[0].name
|
|
|
|
|
command_hash = "dynamic-taint-v1"
|
2026-03-17 21:34:11 +00:00
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
# [not needed anymore; part of the chart values] Apply to operator with:
|
|
|
|
|
# kubectl patch clusterpolicies.nvidia.com/cluster-policy -n gpu-operator --type merge -p '{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config", "default": "any"}}}}'
|
|
|
|
|
|
|
|
|
|
resource "kubernetes_config_map" "time_slicing_config" {
|
|
|
|
|
metadata {
|
|
|
|
|
name = "time-slicing-config"
|
|
|
|
|
namespace = kubernetes_namespace.nvidia.metadata[0].name
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
data = {
|
|
|
|
|
any = <<-EOF
|
|
|
|
|
flags:
|
|
|
|
|
migStrategy: none
|
|
|
|
|
sharing:
|
|
|
|
|
timeSlicing:
|
|
|
|
|
renameByDefault: false
|
|
|
|
|
failRequestsGreaterThanOne: false
|
|
|
|
|
resources:
|
|
|
|
|
- name: nvidia.com/gpu
|
|
|
|
|
replicas: 100
|
|
|
|
|
EOF
|
|
|
|
|
}
|
|
|
|
|
depends_on = [kubernetes_namespace.nvidia]
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
resource "helm_release" "nvidia-gpu-operator" {
|
|
|
|
|
namespace = kubernetes_namespace.nvidia.metadata[0].name
|
|
|
|
|
name = "nvidia-gpu-operator"
|
|
|
|
|
|
|
|
|
|
repository = "https://helm.ngc.nvidia.com/nvidia"
|
|
|
|
|
chart = "gpu-operator"
|
|
|
|
|
atomic = true
|
|
|
|
|
# version = "0.9.3"
|
|
|
|
|
timeout = 6000
|
|
|
|
|
|
|
|
|
|
values = [templatefile("${path.module}/values.yaml", {})]
|
|
|
|
|
depends_on = [kubernetes_config_map.time_slicing_config]
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
resource "kubernetes_deployment" "nvidia-exporter" {
|
|
|
|
|
metadata {
|
|
|
|
|
name = "nvidia-exporter"
|
|
|
|
|
namespace = kubernetes_namespace.nvidia.metadata[0].name
|
|
|
|
|
labels = {
|
|
|
|
|
app = "nvidia-exporter"
|
|
|
|
|
tier = var.tier
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
spec {
|
|
|
|
|
replicas = 1
|
|
|
|
|
selector {
|
|
|
|
|
match_labels = {
|
|
|
|
|
app = "nvidia-exporter"
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
template {
|
|
|
|
|
metadata {
|
|
|
|
|
labels = {
|
|
|
|
|
app = "nvidia-exporter"
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
spec {
|
|
|
|
|
node_selector = {
|
gpu: schedule off NFD label, not k8s-node1 hostname
Remove every hardcoded reference to k8s-node1 that pinned GPU
scheduling to a specific host:
- GPU workload nodeSelectors: gpu=true -> nvidia.com/gpu.present=true
(frigate, immich, whisper, piper, ytdlp, ebook2audiobook, audiblez,
audiblez-web, nvidia-exporter, gpu-pod-exporter). The NFD label is
auto-applied by gpu-feature-discovery on any node carrying an
NVIDIA PCI device, so the selector follows the card.
- null_resource.gpu_node_config: rewrite to enumerate NFD-labeled
nodes (feature.node.kubernetes.io/pci-10de.present=true) and taint
each with nvidia.com/gpu=true:PreferNoSchedule. Drop the manual
'kubectl label gpu=true' since NFD handles labeling.
- MySQL anti-affinity: kubernetes.io/hostname NotIn [k8s-node1] ->
nvidia.com/gpu.present NotIn [true]. Same intent (keep MySQL off
the GPU node) but portable when the card relocates.
Net effect: moving the GPU card between nodes no longer requires any
Terraform edit. Verified no-op for current scheduling — both old and
new labels resolve to node1 today.
Docs updated to match: AGENTS.md, compute.md, overview.md,
proxmox-inventory.md, k8s-portal agent-guidance string.
2026-04-22 13:43:07 +00:00
|
|
|
"nvidia.com/gpu.present" : "true"
|
2026-03-17 21:34:11 +00:00
|
|
|
}
|
|
|
|
|
toleration {
|
|
|
|
|
key = "nvidia.com/gpu"
|
|
|
|
|
operator = "Equal"
|
|
|
|
|
value = "true"
|
|
|
|
|
effect = "NoSchedule"
|
|
|
|
|
}
|
|
|
|
|
container {
|
|
|
|
|
image = "nvidia/dcgm-exporter:latest"
|
|
|
|
|
name = "nvidia-exporter"
|
|
|
|
|
port {
|
|
|
|
|
container_port = 9400
|
|
|
|
|
}
|
|
|
|
|
security_context {
|
|
|
|
|
privileged = true
|
|
|
|
|
capabilities {
|
|
|
|
|
add = ["SYS_ADMIN"]
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
resources {
|
|
|
|
|
requests = {
|
|
|
|
|
memory = "192Mi"
|
|
|
|
|
}
|
|
|
|
|
limits = {
|
|
|
|
|
memory = "192Mi"
|
|
|
|
|
"nvidia.com/gpu" = "1"
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
dns_config {
|
|
|
|
|
option {
|
|
|
|
|
name = "ndots"
|
|
|
|
|
value = "2"
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
depends_on = [helm_release.nvidia-gpu-operator]
|
[infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip]
## Context
Wave 3A (commit c9d221d5) added the `# KYVERNO_LIFECYCLE_V1` marker to the
27 pre-existing `ignore_changes = [...dns_config]` sites so they could be
grepped and audited. It did NOT address pod-owning resources that were
simply missing the suppression entirely. Post-Wave-3A sampling (2026-04-18)
found that navidrome, f1-stream, frigate, servarr, monitoring, crowdsec,
and many other stacks showed perpetual `dns_config` drift every plan
because their `kubernetes_deployment` / `kubernetes_stateful_set` /
`kubernetes_cron_job_v1` resources had no `lifecycle {}` block at all.
Root cause (same as Wave 3A): Kyverno's admission webhook stamps
`dns_config { option { name = "ndots"; value = "2" } }` on every pod's
`spec.template.spec.dns_config` to prevent NxDomain search-domain flooding
(see `k8s-ndots-search-domain-nxdomain-flood` skill). Without `ignore_changes`
on every Terraform-managed pod-owner, Terraform repeatedly tries to strip
the injected field.
## This change
Extends the Wave 3A convention by sweeping EVERY `kubernetes_deployment`,
`kubernetes_stateful_set`, `kubernetes_daemon_set`, `kubernetes_cron_job_v1`,
`kubernetes_job_v1` (+ their `_v1` variants) in the repo and ensuring each
carries the right `ignore_changes` path:
- **kubernetes_deployment / stateful_set / daemon_set / job_v1**:
`spec[0].template[0].spec[0].dns_config`
- **kubernetes_cron_job_v1**:
`spec[0].job_template[0].spec[0].template[0].spec[0].dns_config`
(extra `job_template[0]` nesting — the CronJob's PodTemplateSpec is
one level deeper)
Each injection / extension is tagged `# KYVERNO_LIFECYCLE_V1: Kyverno
admission webhook mutates dns_config with ndots=2` inline so the
suppression is discoverable via `rg 'KYVERNO_LIFECYCLE_V1' stacks/`.
Two insertion paths are handled by a Python pass (`/tmp/add_dns_config_ignore.py`):
1. **No existing `lifecycle {}`**: inject a brand-new block just before the
resource's closing `}`. 108 new blocks on 93 files.
2. **Existing `lifecycle {}` (usually for `DRIFT_WORKAROUND: CI owns image tag`
from Wave 4, commit a62b43d1)**: extend its `ignore_changes` list with the
dns_config path. Handles both inline (`= [x]`) and multiline
(`= [\n x,\n]`) forms; ensures the last pre-existing list item carries
a trailing comma so the extended list is valid HCL. 34 extensions.
The script skips anything already mentioning `dns_config` inside an
`ignore_changes`, so re-running is a no-op.
## Scale
- 142 total lifecycle injections/extensions
- 93 `.tf` files touched
- 108 brand-new `lifecycle {}` blocks + 34 extensions of existing ones
- Every Tier 0 and Tier 1 stack with a pod-owning resource is covered
- Together with Wave 3A's 27 pre-existing markers → **169 greppable
`KYVERNO_LIFECYCLE_V1` dns_config sites across the repo**
## What is NOT in this change
- `stacks/trading-bot/main.tf` — entirely commented-out block (`/* … */`).
Python script touched the file, reverted manually.
- `_template/main.tf.example` skeleton — kept minimal on purpose; any
future stack created from it should either inherit the Wave 3A one-line
form or add its own on first `kubernetes_deployment`.
- `terraform fmt` fixes to pre-existing alignment issues in meshcentral,
nvidia/modules/nvidia, vault — unrelated to this commit. Left for a
separate fmt-only pass.
- Non-pod resources (`kubernetes_service`, `kubernetes_secret`,
`kubernetes_manifest`, etc.) — they don't own pods so they don't get
Kyverno dns_config mutation.
## Verification
Random sample post-commit:
```
$ cd stacks/navidrome && ../../scripts/tg plan → No changes.
$ cd stacks/f1-stream && ../../scripts/tg plan → No changes.
$ cd stacks/frigate && ../../scripts/tg plan → No changes.
$ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='*.tf' --include='*.tf.example' \
| awk -F: '{s+=$2} END {print s}'
169
```
## Reproduce locally
1. `git pull`
2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/ | wc -l` → 169+
3. `cd stacks/navidrome && ../../scripts/tg plan` → expect 0 drift on
the deployment's dns_config field.
Refs: code-seq (Wave 3B dns_config class closed; kubernetes_manifest
annotation class handled separately in 8d94688d for tls_secret)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:19:48 +00:00
|
|
|
lifecycle {
|
|
|
|
|
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
|
|
|
|
|
ignore_changes = [spec[0].template[0].spec[0].dns_config]
|
|
|
|
|
}
|
2026-03-17 21:34:11 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
resource "kubernetes_service" "nvidia-exporter" {
|
|
|
|
|
metadata {
|
|
|
|
|
name = "nvidia-exporter"
|
|
|
|
|
namespace = kubernetes_namespace.nvidia.metadata[0].name
|
|
|
|
|
labels = {
|
|
|
|
|
"app" = "nvidia-exporter"
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
spec {
|
|
|
|
|
selector = {
|
|
|
|
|
app = "nvidia-exporter"
|
|
|
|
|
}
|
|
|
|
|
port {
|
|
|
|
|
name = "http"
|
|
|
|
|
port = 80
|
|
|
|
|
target_port = 9400
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
module "ingress" {
|
2026-05-10 22:26:22 +00:00
|
|
|
source = "../../../../modules/kubernetes/ingress_factory"
|
|
|
|
|
# Auth disabled — HA Sofia REST sensors poll /metrics; the OIDC flow
|
|
|
|
|
# would 302 every request. Same pattern as idrac-redfish-exporter +
|
|
|
|
|
# snmp-exporter (commit 5c594291).
|
infra: document auth = "app|none" tier on every legacy ingress
Sweep through the 30+ stacks that predated the auth = "app" tier
and were tagged auth = "none" without a comment explaining why
they weren't behind Authentik. Each is now self-documenting at the
call site, so the tg-level anti-exposure guard passes and future
readers don't have to reverse-engineer the intent.
Flipped 6 stacks from "none" to "app" — their backends have their
own user auth and the new tier records that more accurately:
- navidrome (Subsonic user/password)
- ntfy (deny-all default + user.db tokens)
- nextcloud (WebDAV/CalDAV/CardDAV app passwords)
- vaultwarden (Bitwarden-compatible token auth)
- headscale (OIDC + preauth keys for Tailscale nodes)
- paperless-ngx (app-layer login + API tokens)
Kept "none" with a comment on the rest — they're genuinely public,
webhook receivers, native-protocol endpoints, OAuth callbacks, or
Anubis-fronted: authentik (×2 + guest outpost), beads-server (dolt),
claude-memory (bearer-token MCP), dawarich, ebooks/book-search-api,
fire-planner /api, forgejo (git/OCI native clients), frigate (HA
integration), immich/frame, insta2spotify /api, instagram-poster
(meta fetcher), k8s-portal, matrix (native bearer), monitoring×2
(HA REST scrapes), n8n (webhooks), nvidia, onlyoffice (JWT),
owntracks (HTTP Basic), postiz, privatebin (client-side enc),
rybbit (analytics tracker), send (E2E file drop), tuya-bridge
(API key), vault (own auth + CLI), webhook_handler, woodpecker
(forgejo webhooks + OAuth), xray (×3 VPN transports).
real-estate-crawler/main.tf:400 already had its comment from a
prior edit — not touched here.
No live state changes — auth = "app" produces the same middleware
chain as auth = "none" (verified earlier this session). This commit
is purely documentation + intent-tagging.
2026-05-11 19:25:48 +00:00
|
|
|
# auth = "none": HA Sofia REST sensors poll /metrics programmatically; OIDC flow would 302 every request breaking automation.
|
2026-05-10 22:26:22 +00:00
|
|
|
auth = "none"
|
2026-03-17 21:34:11 +00:00
|
|
|
namespace = kubernetes_namespace.nvidia.metadata[0].name
|
|
|
|
|
name = "nvidia-exporter"
|
|
|
|
|
root_domain = "viktorbarzin.lan"
|
|
|
|
|
tls_secret_name = var.tls_secret_name
|
|
|
|
|
allow_local_access_only = true
|
|
|
|
|
ssl_redirect = false
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
# resource "kubernetes_ingress_v1" "nvidia-exporter" {
|
|
|
|
|
# metadata {
|
|
|
|
|
# name = "nvidia-exporter"
|
|
|
|
|
# namespace = kubernetes_namespace.nvidia.metadata[0].name
|
|
|
|
|
# annotations = {
|
|
|
|
|
# "kubernetes.io/ingress.class" = "nginx"
|
|
|
|
|
# "nginx.ingress.kubernetes.io/whitelist-source-range" : "192.168.1.0/24, 10.0.0.0/8"
|
|
|
|
|
# "nginx.ingress.kubernetes.io/ssl-redirect" : "false" # used only in LAN
|
|
|
|
|
|
|
|
|
|
# }
|
|
|
|
|
# }
|
|
|
|
|
# spec {
|
|
|
|
|
# tls {
|
|
|
|
|
# hosts = ["nvidia-exporter.viktorbarzin.lan"]
|
|
|
|
|
# secret_name = var.tls_secret_name
|
|
|
|
|
# }
|
|
|
|
|
# rule {
|
|
|
|
|
# host = "nvidia-exporter.viktorbarzin.lan"
|
|
|
|
|
# http {
|
|
|
|
|
# path {
|
|
|
|
|
# backend {
|
|
|
|
|
# service {
|
|
|
|
|
# name = "nvidia-exporter"
|
|
|
|
|
# port {
|
|
|
|
|
# number = 80
|
|
|
|
|
# }
|
|
|
|
|
# }
|
|
|
|
|
# }
|
|
|
|
|
# }
|
|
|
|
|
# }
|
|
|
|
|
# }
|
|
|
|
|
# }
|
|
|
|
|
# }
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
# resource "kubernetes_deployment" "gpu-container" {
|
|
|
|
|
# metadata {
|
|
|
|
|
# name = "gpu-container"
|
|
|
|
|
# namespace = kubernetes_namespace.nvidia.metadata[0].name
|
|
|
|
|
# labels = {
|
|
|
|
|
# app = "gpu-container"
|
|
|
|
|
# }
|
|
|
|
|
# }
|
|
|
|
|
# spec {
|
|
|
|
|
# replicas = 1
|
|
|
|
|
# selector {
|
|
|
|
|
# match_labels = {
|
|
|
|
|
# app = "gpu-container"
|
|
|
|
|
# }
|
|
|
|
|
# }
|
|
|
|
|
# template {
|
|
|
|
|
# metadata {
|
|
|
|
|
# labels = {
|
|
|
|
|
# app = "gpu-container"
|
|
|
|
|
# }
|
|
|
|
|
# }
|
|
|
|
|
# spec {
|
|
|
|
|
# node_selector = {
|
|
|
|
|
# "gpu" : "true"
|
|
|
|
|
# }
|
|
|
|
|
# container {
|
|
|
|
|
# image = "ubuntu"
|
|
|
|
|
# name = "gpu-container"
|
|
|
|
|
# command = ["/usr/bin/sleep", "3600"]
|
|
|
|
|
# # security_context {
|
|
|
|
|
# # privileged = true
|
|
|
|
|
# # capabilities {
|
|
|
|
|
# # add = ["SYS_ADMIN"]
|
|
|
|
|
# # }
|
|
|
|
|
# # }
|
|
|
|
|
# resources {
|
|
|
|
|
# limits = {
|
|
|
|
|
# "nvidia.com/gpu" = "1"
|
|
|
|
|
# }
|
|
|
|
|
# }
|
|
|
|
|
# }
|
|
|
|
|
# }
|
|
|
|
|
# }
|
|
|
|
|
# }
|
|
|
|
|
# depends_on = [helm_release.nvidia-gpu-operator]
|
|
|
|
|
# }
|
|
|
|
|
|
|
|
|
|
# GPU Pod Memory Exporter - exposes per-pod GPU memory usage as Prometheus metrics
|
|
|
|
|
resource "kubernetes_config_map" "gpu_pod_exporter_script" {
|
|
|
|
|
metadata {
|
|
|
|
|
name = "gpu-pod-exporter-script"
|
|
|
|
|
namespace = kubernetes_namespace.nvidia.metadata[0].name
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
data = {
|
|
|
|
|
"exporter.py" = <<-EOF
|
|
|
|
|
#!/usr/bin/env python3
|
|
|
|
|
"""GPU Pod Memory Exporter - Collects per-pod GPU memory usage."""
|
|
|
|
|
|
|
|
|
|
import subprocess
|
|
|
|
|
import time
|
|
|
|
|
import re
|
|
|
|
|
import os
|
|
|
|
|
import json
|
|
|
|
|
import urllib.request
|
|
|
|
|
import ssl
|
|
|
|
|
from http.server import HTTPServer, BaseHTTPRequestHandler
|
|
|
|
|
|
|
|
|
|
METRICS_PORT = 9401
|
|
|
|
|
SCRAPE_INTERVAL = 15
|
|
|
|
|
|
|
|
|
|
# Kubernetes API configuration
|
|
|
|
|
K8S_API = "https://kubernetes.default.svc"
|
|
|
|
|
TOKEN_PATH = "/var/run/secrets/kubernetes.io/serviceaccount/token"
|
|
|
|
|
CA_PATH = "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
|
|
|
|
|
|
|
|
|
|
# Cache for container ID to pod info mapping
|
|
|
|
|
container_cache = {}
|
|
|
|
|
cache_refresh_time = 0
|
|
|
|
|
CACHE_TTL = 60 # Refresh cache every 60 seconds
|
|
|
|
|
|
|
|
|
|
def get_k8s_token():
|
|
|
|
|
"""Read Kubernetes service account token."""
|
|
|
|
|
try:
|
|
|
|
|
with open(TOKEN_PATH, 'r') as f:
|
|
|
|
|
return f.read().strip()
|
|
|
|
|
except:
|
|
|
|
|
return None
|
|
|
|
|
|
|
|
|
|
def refresh_container_cache():
|
|
|
|
|
"""Refresh the container ID to pod mapping from Kubernetes API."""
|
|
|
|
|
global container_cache, cache_refresh_time
|
|
|
|
|
|
|
|
|
|
token = get_k8s_token()
|
|
|
|
|
if not token:
|
|
|
|
|
return
|
|
|
|
|
|
|
|
|
|
try:
|
|
|
|
|
# Create SSL context with K8s CA
|
|
|
|
|
ctx = ssl.create_default_context()
|
|
|
|
|
if os.path.exists(CA_PATH):
|
|
|
|
|
ctx.load_verify_locations(CA_PATH)
|
|
|
|
|
|
|
|
|
|
# Get all pods on this node
|
|
|
|
|
node_name = os.environ.get('NODE_NAME', '')
|
|
|
|
|
url = f"{K8S_API}/api/v1/pods?fieldSelector=spec.nodeName={node_name}"
|
|
|
|
|
|
|
|
|
|
req = urllib.request.Request(url, headers={
|
|
|
|
|
'Authorization': f'Bearer {token}',
|
|
|
|
|
'Accept': 'application/json'
|
|
|
|
|
})
|
|
|
|
|
|
|
|
|
|
with urllib.request.urlopen(req, context=ctx, timeout=10) as resp:
|
|
|
|
|
data = json.loads(resp.read().decode())
|
|
|
|
|
|
|
|
|
|
new_cache = {}
|
|
|
|
|
for pod in data.get('items', []):
|
|
|
|
|
pod_name = pod['metadata']['name']
|
|
|
|
|
namespace = pod['metadata']['namespace']
|
|
|
|
|
|
|
|
|
|
# Get container statuses
|
|
|
|
|
for status in pod.get('status', {}).get('containerStatuses', []):
|
|
|
|
|
container_id = status.get('containerID', '')
|
|
|
|
|
# Extract the ID part (e.g., "containerd://abc123..." -> "abc123")
|
|
|
|
|
if '://' in container_id:
|
|
|
|
|
container_id = container_id.split('://')[-1]
|
|
|
|
|
if container_id:
|
|
|
|
|
short_id = container_id[:12]
|
|
|
|
|
new_cache[short_id] = {
|
|
|
|
|
'pod': pod_name,
|
|
|
|
|
'namespace': namespace,
|
|
|
|
|
'container': status.get('name', 'unknown')
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
container_cache = new_cache
|
|
|
|
|
cache_refresh_time = time.time()
|
|
|
|
|
print(f"Refreshed container cache: {len(new_cache)} containers")
|
|
|
|
|
|
|
|
|
|
except Exception as e:
|
|
|
|
|
print(f"Error refreshing container cache: {e}")
|
|
|
|
|
|
|
|
|
|
def get_pod_info(container_id):
|
|
|
|
|
"""Look up pod info for a container ID."""
|
|
|
|
|
global cache_refresh_time
|
|
|
|
|
|
|
|
|
|
# Refresh cache if stale
|
|
|
|
|
if time.time() - cache_refresh_time > CACHE_TTL:
|
|
|
|
|
refresh_container_cache()
|
|
|
|
|
|
|
|
|
|
return container_cache.get(container_id, {
|
|
|
|
|
'pod': 'unknown',
|
|
|
|
|
'namespace': 'unknown',
|
|
|
|
|
'container': 'unknown'
|
|
|
|
|
})
|
|
|
|
|
|
|
|
|
|
def get_gpu_processes():
|
|
|
|
|
"""Run nvidia-smi to get GPU process info."""
|
|
|
|
|
try:
|
|
|
|
|
result = subprocess.run(
|
|
|
|
|
["nvidia-smi", "--query-compute-apps=pid,used_memory,process_name", "--format=csv,noheader,nounits"],
|
|
|
|
|
capture_output=True, text=True, timeout=10
|
|
|
|
|
)
|
|
|
|
|
if result.returncode != 0:
|
|
|
|
|
print(f"nvidia-smi error: {result.stderr}")
|
|
|
|
|
return []
|
|
|
|
|
|
|
|
|
|
processes = []
|
|
|
|
|
for line in result.stdout.strip().split('\n'):
|
|
|
|
|
if not line.strip():
|
|
|
|
|
continue
|
|
|
|
|
parts = [p.strip() for p in line.split(',')]
|
|
|
|
|
if len(parts) >= 3:
|
|
|
|
|
pid, memory_mib, process_name = parts[0], parts[1], parts[2]
|
|
|
|
|
processes.append({
|
|
|
|
|
'pid': pid,
|
|
|
|
|
'memory_bytes': int(memory_mib) * 1024 * 1024,
|
|
|
|
|
'process_name': process_name
|
|
|
|
|
})
|
|
|
|
|
return processes
|
|
|
|
|
except Exception as e:
|
|
|
|
|
print(f"Error running nvidia-smi: {e}")
|
|
|
|
|
return []
|
|
|
|
|
|
|
|
|
|
def get_container_id(pid):
|
|
|
|
|
"""Map PID to container ID via cgroup."""
|
|
|
|
|
cgroup_path = f"/host_proc/{pid}/cgroup"
|
|
|
|
|
try:
|
|
|
|
|
with open(cgroup_path, 'r') as f:
|
|
|
|
|
for line in f:
|
|
|
|
|
# Match container ID patterns (docker, containerd, cri-o)
|
|
|
|
|
match = re.search(r'[:/]([a-f0-9]{64})', line)
|
|
|
|
|
if match:
|
|
|
|
|
return match.group(1)[:12]
|
|
|
|
|
match = re.search(r'cri-containerd-([a-f0-9]{64})', line)
|
|
|
|
|
if match:
|
|
|
|
|
return match.group(1)[:12]
|
|
|
|
|
except (FileNotFoundError, PermissionError):
|
|
|
|
|
pass
|
|
|
|
|
return "host"
|
|
|
|
|
|
|
|
|
|
# Global metrics storage
|
|
|
|
|
current_metrics = []
|
|
|
|
|
|
|
|
|
|
def collect_metrics():
|
|
|
|
|
"""Collect GPU memory metrics."""
|
|
|
|
|
global current_metrics
|
|
|
|
|
metrics = []
|
|
|
|
|
processes = get_gpu_processes()
|
|
|
|
|
|
|
|
|
|
for proc in processes:
|
|
|
|
|
container_id = get_container_id(proc['pid'])
|
|
|
|
|
pod_info = get_pod_info(container_id)
|
|
|
|
|
metrics.append({
|
|
|
|
|
'container_id': container_id,
|
|
|
|
|
'pid': proc['pid'],
|
|
|
|
|
'process_name': proc['process_name'],
|
|
|
|
|
'memory_bytes': proc['memory_bytes'],
|
|
|
|
|
'pod': pod_info['pod'],
|
|
|
|
|
'namespace': pod_info['namespace'],
|
|
|
|
|
'container': pod_info['container']
|
|
|
|
|
})
|
|
|
|
|
|
|
|
|
|
current_metrics = metrics
|
|
|
|
|
|
|
|
|
|
def format_metrics():
|
|
|
|
|
"""Format metrics in Prometheus exposition format."""
|
|
|
|
|
lines = [
|
|
|
|
|
"# HELP gpu_pod_memory_used_bytes GPU memory used by pod",
|
|
|
|
|
"# TYPE gpu_pod_memory_used_bytes gauge"
|
|
|
|
|
]
|
|
|
|
|
|
|
|
|
|
for m in current_metrics:
|
|
|
|
|
labels = ','.join([
|
|
|
|
|
f'namespace="{m["namespace"]}"',
|
|
|
|
|
f'pod="{m["pod"]}"',
|
|
|
|
|
f'container="{m["container"]}"',
|
|
|
|
|
f'process_name="{m["process_name"]}"',
|
|
|
|
|
f'pid="{m["pid"]}"'
|
|
|
|
|
])
|
|
|
|
|
lines.append(f'gpu_pod_memory_used_bytes{{{labels}}} {m["memory_bytes"]}')
|
|
|
|
|
|
|
|
|
|
return '\n'.join(lines) + '\n'
|
|
|
|
|
|
|
|
|
|
class MetricsHandler(BaseHTTPRequestHandler):
|
|
|
|
|
def do_GET(self):
|
|
|
|
|
if self.path == '/metrics':
|
|
|
|
|
content = format_metrics()
|
|
|
|
|
self.send_response(200)
|
|
|
|
|
self.send_header('Content-Type', 'text/plain; charset=utf-8')
|
|
|
|
|
self.end_headers()
|
|
|
|
|
self.wfile.write(content.encode())
|
|
|
|
|
elif self.path == '/health':
|
|
|
|
|
self.send_response(200)
|
|
|
|
|
self.end_headers()
|
|
|
|
|
self.wfile.write(b'ok')
|
|
|
|
|
else:
|
|
|
|
|
self.send_response(404)
|
|
|
|
|
self.end_headers()
|
|
|
|
|
|
|
|
|
|
def log_message(self, format, *args):
|
|
|
|
|
pass # Suppress request logging
|
|
|
|
|
|
|
|
|
|
def background_collector():
|
|
|
|
|
"""Background thread to collect metrics periodically."""
|
|
|
|
|
import threading
|
|
|
|
|
def run():
|
|
|
|
|
while True:
|
|
|
|
|
collect_metrics()
|
|
|
|
|
time.sleep(SCRAPE_INTERVAL)
|
|
|
|
|
thread = threading.Thread(target=run, daemon=True)
|
|
|
|
|
thread.start()
|
|
|
|
|
|
|
|
|
|
if __name__ == '__main__':
|
|
|
|
|
print(f"Starting GPU Pod Memory Exporter on port {METRICS_PORT}")
|
|
|
|
|
refresh_container_cache() # Initial cache load
|
|
|
|
|
collect_metrics() # Initial collection
|
|
|
|
|
background_collector()
|
|
|
|
|
|
|
|
|
|
server = HTTPServer(('', METRICS_PORT), MetricsHandler)
|
|
|
|
|
server.serve_forever()
|
|
|
|
|
EOF
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
resource "kubernetes_service_account" "gpu_pod_exporter" {
|
|
|
|
|
metadata {
|
|
|
|
|
name = "gpu-pod-exporter"
|
|
|
|
|
namespace = kubernetes_namespace.nvidia.metadata[0].name
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
resource "kubernetes_cluster_role" "gpu_pod_exporter" {
|
|
|
|
|
metadata {
|
|
|
|
|
name = "gpu-pod-exporter"
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
rule {
|
|
|
|
|
api_groups = [""]
|
|
|
|
|
resources = ["pods"]
|
|
|
|
|
verbs = ["list"]
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
resource "kubernetes_cluster_role_binding" "gpu_pod_exporter" {
|
|
|
|
|
metadata {
|
|
|
|
|
name = "gpu-pod-exporter"
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
role_ref {
|
|
|
|
|
api_group = "rbac.authorization.k8s.io"
|
|
|
|
|
kind = "ClusterRole"
|
|
|
|
|
name = kubernetes_cluster_role.gpu_pod_exporter.metadata[0].name
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
subject {
|
|
|
|
|
kind = "ServiceAccount"
|
|
|
|
|
name = kubernetes_service_account.gpu_pod_exporter.metadata[0].name
|
|
|
|
|
namespace = kubernetes_namespace.nvidia.metadata[0].name
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
resource "kubernetes_daemonset" "gpu_pod_exporter" {
|
|
|
|
|
metadata {
|
|
|
|
|
name = "gpu-pod-exporter"
|
|
|
|
|
namespace = kubernetes_namespace.nvidia.metadata[0].name
|
|
|
|
|
labels = {
|
|
|
|
|
app = "gpu-pod-exporter"
|
|
|
|
|
tier = var.tier
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
spec {
|
|
|
|
|
selector {
|
|
|
|
|
match_labels = {
|
|
|
|
|
app = "gpu-pod-exporter"
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
template {
|
|
|
|
|
metadata {
|
|
|
|
|
labels = {
|
|
|
|
|
app = "gpu-pod-exporter"
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
spec {
|
|
|
|
|
host_pid = true
|
|
|
|
|
service_account_name = kubernetes_service_account.gpu_pod_exporter.metadata[0].name
|
|
|
|
|
|
|
|
|
|
node_selector = {
|
gpu: schedule off NFD label, not k8s-node1 hostname
Remove every hardcoded reference to k8s-node1 that pinned GPU
scheduling to a specific host:
- GPU workload nodeSelectors: gpu=true -> nvidia.com/gpu.present=true
(frigate, immich, whisper, piper, ytdlp, ebook2audiobook, audiblez,
audiblez-web, nvidia-exporter, gpu-pod-exporter). The NFD label is
auto-applied by gpu-feature-discovery on any node carrying an
NVIDIA PCI device, so the selector follows the card.
- null_resource.gpu_node_config: rewrite to enumerate NFD-labeled
nodes (feature.node.kubernetes.io/pci-10de.present=true) and taint
each with nvidia.com/gpu=true:PreferNoSchedule. Drop the manual
'kubectl label gpu=true' since NFD handles labeling.
- MySQL anti-affinity: kubernetes.io/hostname NotIn [k8s-node1] ->
nvidia.com/gpu.present NotIn [true]. Same intent (keep MySQL off
the GPU node) but portable when the card relocates.
Net effect: moving the GPU card between nodes no longer requires any
Terraform edit. Verified no-op for current scheduling — both old and
new labels resolve to node1 today.
Docs updated to match: AGENTS.md, compute.md, overview.md,
proxmox-inventory.md, k8s-portal agent-guidance string.
2026-04-22 13:43:07 +00:00
|
|
|
"nvidia.com/gpu.present" : "true"
|
2026-03-17 21:34:11 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
toleration {
|
|
|
|
|
key = "nvidia.com/gpu"
|
|
|
|
|
operator = "Equal"
|
|
|
|
|
value = "true"
|
|
|
|
|
effect = "NoSchedule"
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
container {
|
|
|
|
|
name = "exporter"
|
|
|
|
|
image = "python:3.11-slim"
|
|
|
|
|
|
|
|
|
|
command = ["/bin/bash", "-c"]
|
|
|
|
|
args = [
|
|
|
|
|
"python3 /scripts/exporter.py"
|
|
|
|
|
]
|
|
|
|
|
|
|
|
|
|
env {
|
|
|
|
|
name = "NODE_NAME"
|
|
|
|
|
value_from {
|
|
|
|
|
field_ref {
|
|
|
|
|
field_path = "spec.nodeName"
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
port {
|
|
|
|
|
container_port = 9401
|
|
|
|
|
name = "metrics"
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
volume_mount {
|
|
|
|
|
name = "scripts"
|
|
|
|
|
mount_path = "/scripts"
|
|
|
|
|
read_only = true
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
volume_mount {
|
|
|
|
|
name = "host-proc"
|
|
|
|
|
mount_path = "/host_proc"
|
|
|
|
|
read_only = true
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
resources {
|
|
|
|
|
requests = {
|
|
|
|
|
cpu = "10m"
|
|
|
|
|
memory = "128Mi"
|
|
|
|
|
}
|
|
|
|
|
limits = {
|
|
|
|
|
memory = "128Mi"
|
|
|
|
|
"nvidia.com/gpu" = "1"
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
liveness_probe {
|
|
|
|
|
http_get {
|
|
|
|
|
path = "/health"
|
|
|
|
|
port = 9401
|
|
|
|
|
}
|
|
|
|
|
initial_delay_seconds = 30
|
|
|
|
|
period_seconds = 30
|
|
|
|
|
timeout_seconds = 5
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
volume {
|
|
|
|
|
name = "scripts"
|
|
|
|
|
config_map {
|
|
|
|
|
name = kubernetes_config_map.gpu_pod_exporter_script.metadata[0].name
|
|
|
|
|
default_mode = "0755"
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
volume {
|
|
|
|
|
name = "host-proc"
|
|
|
|
|
host_path {
|
|
|
|
|
path = "/proc"
|
|
|
|
|
type = "Directory"
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
dns_config {
|
|
|
|
|
option {
|
|
|
|
|
name = "ndots"
|
|
|
|
|
value = "2"
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
depends_on = [helm_release.nvidia-gpu-operator]
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
resource "kubernetes_service" "gpu_pod_exporter" {
|
|
|
|
|
metadata {
|
|
|
|
|
name = "gpu-pod-exporter"
|
|
|
|
|
namespace = kubernetes_namespace.nvidia.metadata[0].name
|
|
|
|
|
labels = {
|
|
|
|
|
app = "gpu-pod-exporter"
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
spec {
|
|
|
|
|
selector = {
|
|
|
|
|
app = "gpu-pod-exporter"
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
port {
|
|
|
|
|
name = "metrics"
|
|
|
|
|
port = 80
|
|
|
|
|
target_port = 9401
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|