[ci skip] add clickhouse-k8s-nfs-system-log-bloat skill, update GPU skill with auto-recovery

New skill: ClickHouse on K8s/NFS burns CPU from unbounded system log tables and background merges. Covers config.d mount crash (exit code 36), CronJob truncation workaround, and diagnostic commands. Updated: k8s-gpu-no-nvidia-devices v1.1.0 — added automatic GPU recovery via liveness probe pattern (nvidia-smi + app health check).
2026-03-01 21:04:19 +00:00 · 2026-03-01 21:04:19 +00:00 · f30ef660e1
commit f30ef660e1
parent 14a5b4d7d5
2 changed files with 230 additions and 2 deletions
--- a/.claude/skills/clickhouse-k8s-nfs-system-log-bloat/SKILL.md
+++ b/.claude/skills/clickhouse-k8s-nfs-system-log-bloat/SKILL.md
@ -0,0 +1,189 @@
+---
+name: clickhouse-k8s-nfs-system-log-bloat
+description: |
+  Fix for ClickHouse consuming excessive CPU (500m-1000m+) on Kubernetes when running on
+  NFS storage, caused by unbounded system log table growth triggering continuous background
+  merges. Use when: (1) ClickHouse burns ~1 CPU core with no active user queries,
+  (2) system.merges shows constant merge activity on system.metric_log or system.trace_log,
+  (3) system log tables (metric_log, trace_log, text_log, asynchronous_metric_log) have
+  grown to gigabytes while actual user data is tiny, (4) ClickHouse crashes with exit code
+  76 (loadOutdatedDataParts SIGSEGV), (5) attempting to mount custom config.d XML via
+  Kubernetes ConfigMap causes exit code 36 (BAD_ARGUMENTS) crashes. Also covers why
+  ClickHouse's MergeTree engine performs poorly on NFS and the CronJob workaround for
+  system log truncation.
+author: Claude Code
+version: 1.0.0
+date: 2026-03-01
+---
+
+# ClickHouse on Kubernetes/NFS: System Log Bloat & CPU Overhead
+
+## Problem
+
+ClickHouse deployed on Kubernetes with NFS storage consumes ~1 CPU core continuously,
+even when actual user queries are negligible. The CPU is consumed by background merge
+operations on system log tables that grow unboundedly with no default TTL.
+
+## Context / Trigger Conditions
+
+- ClickHouse pod using 500m-1000m+ CPU with no active user queries
+- `SELECT * FROM system.processes` shows only diagnostic queries
+- `SELECT * FROM system.merges` shows constant merge activity on `system.metric_log`
+- System log tables have grown to gigabytes:
+  - `system.trace_log`: 5+ GiB, 200M+ rows
+  - `system.text_log`: 3+ GiB, 90M+ rows
+  - `system.metric_log`: 1+ GiB with 80-100+ active parts (healthy is <20)
+  - `system.asynchronous_metric_log`: 500+ MiB, 1B+ rows
+- Actual user data (e.g., `clickhouse.events`) is only kilobytes
+- ClickHouse crashes periodically with exit code 76 (`loadOutdatedDataParts` SIGSEGV)
+- Data directory is on NFS (e.g., `/mnt/main/clickhouse`)
+
+## Root Cause
+
+Two compounding issues:
+
+1. **No TTL on system log tables**: ClickHouse system tables (`metric_log`, `trace_log`,
+   `text_log`, `asynchronous_metric_log`, `query_log`, `part_log`) have no default
+   retention policy and grow indefinitely.
+
+2. **NFS amplifies merge overhead**: ClickHouse's MergeTree engine relies on background
+   merge operations that involve heavy sequential I/O. NFS latency makes merges 10-100x
+   slower than local disk, creating a feedback loop:
+   - Slow merges → parts accumulate faster than they can be merged
+   - More parts → more merge operations spawned
+   - More merges → more CPU for decompression/recompression while waiting on NFS I/O
+
+## Solution
+
+### Immediate Fix: Truncate System Tables
+
+```bash
+CH_POD=$(kubectl get pod -n <namespace> -l app=clickhouse -o jsonpath='{.items[0].metadata.name}')
+kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.metric_log"
+kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.trace_log"
+kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.text_log"
+kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.asynchronous_metric_log"
+kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.query_log"
+kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.part_log"
+```
+
+This can take 30-60+ seconds per table on NFS due to part cleanup I/O.
+
+### Permanent Fix: CronJob for Periodic Truncation
+
+Add a Kubernetes CronJob that truncates system tables via the ClickHouse HTTP API:
+
+```hcl
+resource "kubernetes_cron_job_v1" "clickhouse_truncate_logs" {
+  metadata {
+    name      = "clickhouse-truncate-logs"
+    namespace = "<namespace>"
+  }
+  spec {
+    schedule                      = "0 */6 * * *"
+    successful_jobs_history_limit = 1
+    failed_jobs_history_limit     = 1
+    job_template {
+      metadata {}
+      spec {
+        template {
+          metadata {}
+          spec {
+            restart_policy = "OnFailure"
+            container {
+              name  = "truncate"
+              image = "curlimages/curl:8.12.1"
+              command = ["sh", "-c", join(" && ", [
+                "curl -s 'http://clickhouse.<ns>.svc.cluster.local:8123/?user=default&password=<pw>' -d 'TRUNCATE TABLE IF EXISTS system.metric_log'",
+                "curl -s 'http://clickhouse.<ns>.svc.cluster.local:8123/?user=default&password=<pw>' -d 'TRUNCATE TABLE IF EXISTS system.trace_log'",
+                "curl -s 'http://clickhouse.<ns>.svc.cluster.local:8123/?user=default&password=<pw>' -d 'TRUNCATE TABLE IF EXISTS system.text_log'",
+                "curl -s 'http://clickhouse.<ns>.svc.cluster.local:8123/?user=default&password=<pw>' -d 'TRUNCATE TABLE IF EXISTS system.asynchronous_metric_log'",
+                "curl -s 'http://clickhouse.<ns>.svc.cluster.local:8123/?user=default&password=<pw>' -d 'TRUNCATE TABLE IF EXISTS system.query_log'",
+                "curl -s 'http://clickhouse.<ns>.svc.cluster.local:8123/?user=default&password=<pw>' -d 'TRUNCATE TABLE IF EXISTS system.part_log'",
+                "echo 'System logs truncated'"
+              ])]
+            }
+          }
+        }
+      }
+    }
+  }
+}
+```
+
+### What Does NOT Work: Config.d XML Mount
+
+**DO NOT** attempt to mount custom XML config files into `/etc/clickhouse-server/config.d/`
+via Kubernetes ConfigMap. Both approaches crash ClickHouse with exit code 36 (BAD_ARGUMENTS):
+
+- **Full directory mount** (`mount_path = "/etc/clickhouse-server/config.d"`): Replaces
+  the entire directory, deleting the built-in `docker_related_config.xml` that the
+  entrypoint expects. Even if you include it in your ConfigMap, ClickHouse still crashes.
+
+- **sub_path mount** (`sub_path = "custom.xml"`): Also crashes with exit code 36, even
+  with minimal valid XML containing only `<background_pool_size>4</background_pool_size>`.
+
+- Both `remove="1"` (to disable tables) and `<ttl>` (to set retention) config overrides
+  crash with exit code 36.
+
+This appears to be an issue with the `clickhouse/clickhouse-server:25.4.2` Docker image
+and how it preprocesses config at startup. The CronJob approach bypasses this entirely.
+
+## Verification
+
+After truncation, verify:
+
+```bash
+# CPU should drop from ~900m to ~100m within minutes
+kubectl top pod -n <namespace> -l app=clickhouse
+
+# No active merges
+kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query \
+  "SELECT count() FROM system.merges"
+
+# System tables should be small
+kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query \
+  "SELECT database, table, formatReadableSize(sum(bytes_on_disk)) as size, sum(rows) as rows \
+   FROM system.parts WHERE active GROUP BY database, table ORDER BY sum(bytes_on_disk) DESC \
+   FORMAT Pretty"
+```
+
+## Diagnostic Commands
+
+```bash
+# Check what's consuming CPU (merges vs queries)
+kubectl exec -n <ns> $CH_POD -- clickhouse-client --query \
+  "SELECT * FROM system.merges FORMAT Pretty"
+
+kubectl exec -n <ns> $CH_POD -- clickhouse-client --query \
+  "SELECT query_id, elapsed, query FROM system.processes WHERE is_initial_query FORMAT Pretty"
+
+# Check background pool config
+kubectl exec -n <ns> $CH_POD -- clickhouse-client --query \
+  "SELECT name, value FROM system.server_settings \
+   WHERE name IN ('background_pool_size', 'background_merges_mutations_concurrency_ratio') \
+   FORMAT Pretty"
+
+# Default is background_pool_size=16, concurrency_ratio=2 → up to 32 concurrent merges
+```
+
+## Notes
+
+- **Exit code 76**: ClickHouse crashes in `loadOutdatedDataParts()` when there are hundreds
+  of outdated parts on NFS. The truncation CronJob prevents this by keeping tables small.
+
+- **Exit code 36**: `BAD_ARGUMENTS` in ClickHouse. Triggered by config.d XML mounts in
+  Kubernetes. Root cause unclear but reproducible across mount methods.
+
+- **Default thread pools**: ClickHouse defaults to `background_pool_size=16` and
+  `background_schedule_pool_size=512`, spawning 700+ threads even for a single-table
+  workload. This overhead is unavoidable without config file changes.
+
+- **NFS is fundamentally unsuitable** for ClickHouse's MergeTree engine. If data
+  persistence is not critical (e.g., analytics data is small), consider `emptyDir` or
+  local PV storage instead.
+
+## See Also
+
+- `k8s-nfs-mount-troubleshooting` — NFS mount failures and permission issues
+- `k8s-limitrange-oom-silent-kill` — LimitRange defaults causing OOM in ClickHouse containers
--- a/.claude/skills/k8s-gpu-no-nvidia-devices/SKILL.md
+++ b/.claude/skills/k8s-gpu-no-nvidia-devices/SKILL.md
@ -7,8 +7,8 @@ description: |
  but works on host, (4) PyTorch/TensorFlow falls back to CPU despite GPU allocation.
  Covers NVIDIA device plugin, time-slicing, and container runtime issues.
 author: Claude Code
-version: 1.0.0
-date: 2026-01-27
+version: 1.1.0
+date: 2026-03-01
 ---

 # Kubernetes GPU Pod - No NVIDIA Devices Found
@ -140,6 +140,45 @@ $ kubectl exec -n ebook2audiobook deployment/ebook2audiobook -- nvidia-smi
 - Check GPU Operator status: `kubectl get pods -n nvidia`
 - View time-slicing config: `kubectl get configmap -n nvidia time-slicing-config -o yaml`

+## Automatic GPU Recovery via Liveness Probe
+
+To prevent GPU loss from requiring manual intervention, add a liveness probe that checks
+both GPU availability and application health. Example for Frigate (but applicable to any
+GPU workload):
+
+```hcl
+# Restart pod if GPU becomes unavailable or app hangs
+liveness_probe {
+  exec {
+    command = ["sh", "-c", "nvidia-smi > /dev/null 2>&1 && curl -sf http://localhost:<port>/health > /dev/null"]
+  }
+  initial_delay_seconds = 120
+  period_seconds        = 60
+  timeout_seconds       = 10
+  failure_threshold     = 3
+}
+# Allow time for GPU model loading at startup
+startup_probe {
+  http_get {
+    path = "/health"
+    port = <port>
+  }
+  period_seconds    = 10
+  failure_threshold = 30  # up to 5 minutes
+}
+```
+
+The liveness probe checks:
+- `nvidia-smi` — fails if GPU devices are no longer accessible (CUDA context corruption, device plugin issues)
+- `curl` health endpoint — fails if the application process is hung
+
+If either fails 3 times in a row (3 minutes), Kubernetes automatically restarts the pod,
+which re-acquires the GPU device through the NVIDIA device plugin.
+
+**Important**: Always pair with a `startup_probe` when using GPU workloads — model loading
+(TensorRT, ONNX, PyTorch) can take several minutes and would trip a liveness probe
+configured with a short `initial_delay_seconds`.
+
 ## References

 - [NVIDIA Container Toolkit Troubleshooting](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/troubleshooting.html)