[ci skip] add clickhouse-k8s-nfs-system-log-bloat skill, update GPU skill with auto-recovery

New skill: ClickHouse on K8s/NFS burns CPU from unbounded system log tables and background merges. Covers config.d mount crash (exit code 36), CronJob truncation workaround, and diagnostic commands. Updated: k8s-gpu-no-nvidia-devices v1.1.0 — added automatic GPU recovery via liveness probe pattern (nvidia-smi + app health check).
2026-03-01 21:04:19 +00:00 · 2026-03-01 21:04:19 +00:00 · 53be356f41
commit 53be356f41
parent ab7c655776
2 changed files with 230 additions and 2 deletions
--- a/.claude/skills/clickhouse-k8s-nfs-system-log-bloat/SKILL.md
+++ b/.claude/skills/clickhouse-k8s-nfs-system-log-bloat/SKILL.md
@ -0,0 +1,189 @@
 ---
 name: clickhouse-k8s-nfs-system-log-bloat
 description: |
  Fix for ClickHouse consuming excessive CPU (500m-1000m+) on Kubernetes when running on
  NFS storage, caused by unbounded system log table growth triggering continuous background
  merges. Use when: (1) ClickHouse burns ~1 CPU core with no active user queries,
  (2) system.merges shows constant merge activity on system.metric_log or system.trace_log,
  (3) system log tables (metric_log, trace_log, text_log, asynchronous_metric_log) have
  grown to gigabytes while actual user data is tiny, (4) ClickHouse crashes with exit code
  76 (loadOutdatedDataParts SIGSEGV), (5) attempting to mount custom config.d XML via
  Kubernetes ConfigMap causes exit code 36 (BAD_ARGUMENTS) crashes. Also covers why
  ClickHouse's MergeTree engine performs poorly on NFS and the CronJob workaround for
  system log truncation.
 author: Claude Code
 version: 1.0.0
 date: 2026-03-01
 ---
 # ClickHouse on Kubernetes/NFS: System Log Bloat & CPU Overhead
 ## Problem
 ClickHouse deployed on Kubernetes with NFS storage consumes ~1 CPU core continuously,
 even when actual user queries are negligible. The CPU is consumed by background merge
 operations on system log tables that grow unboundedly with no default TTL.
 ## Context / Trigger Conditions
 - ClickHouse pod using 500m-1000m+ CPU with no active user queries
 - `SELECT * FROM system.processes` shows only diagnostic queries
 - `SELECT * FROM system.merges` shows constant merge activity on `system.metric_log`
 - System log tables have grown to gigabytes:
  - `system.trace_log`: 5+ GiB, 200M+ rows
  - `system.text_log`: 3+ GiB, 90M+ rows
  - `system.metric_log`: 1+ GiB with 80-100+ active parts (healthy is <20)
  - `system.asynchronous_metric_log`: 500+ MiB, 1B+ rows
 - Actual user data (e.g., `clickhouse.events`) is only kilobytes
 - ClickHouse crashes periodically with exit code 76 (`loadOutdatedDataParts` SIGSEGV)
 - Data directory is on NFS (e.g., `/mnt/main/clickhouse`)
 ## Root Cause
 Two compounding issues:
 1. **No TTL on system log tables**: ClickHouse system tables (`metric_log`, `trace_log`,
   `text_log`, `asynchronous_metric_log`, `query_log`, `part_log`) have no default
   retention policy and grow indefinitely.
 2. **NFS amplifies merge overhead**: ClickHouse's MergeTree engine relies on background
   merge operations that involve heavy sequential I/O. NFS latency makes merges 10-100x
   slower than local disk, creating a feedback loop:
   - Slow merges → parts accumulate faster than they can be merged
   - More parts → more merge operations spawned
   - More merges → more CPU for decompression/recompression while waiting on NFS I/O
 ## Solution
 ### Immediate Fix: Truncate System Tables
 ```bash
 CH_POD=$(kubectl get pod -n <namespace> -l app=clickhouse -o jsonpath='{.items[0].metadata.name}')
 kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.metric_log"
 kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.trace_log"
 kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.text_log"
 kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.asynchronous_metric_log"
 kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.query_log"
 kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.part_log"
 ```
 This can take 30-60+ seconds per table on NFS due to part cleanup I/O.
 ### Permanent Fix: CronJob for Periodic Truncation
 Add a Kubernetes CronJob that truncates system tables via the ClickHouse HTTP API:
 ```hcl
 resource "kubernetes_cron_job_v1" "clickhouse_truncate_logs" {
  metadata {
    name      = "clickhouse-truncate-logs"
    namespace = "<namespace>"
  }
  spec {
    schedule                      = "0 */6 * * *"
    successful_jobs_history_limit = 1
    failed_jobs_history_limit     = 1
    job_template {
      metadata {}
      spec {
        template {
          metadata {}
          spec {
            restart_policy = "OnFailure"
            container {
              name  = "truncate"
              image = "curlimages/curl:8.12.1"
              command = ["sh", "-c", join(" && ", [
                "curl -s 'http://clickhouse.<ns>.svc.cluster.local:8123/?user=default&password=<pw>' -d 'TRUNCATE TABLE IF EXISTS system.metric_log'",
                "curl -s 'http://clickhouse.<ns>.svc.cluster.local:8123/?user=default&password=<pw>' -d 'TRUNCATE TABLE IF EXISTS system.trace_log'",
                "curl -s 'http://clickhouse.<ns>.svc.cluster.local:8123/?user=default&password=<pw>' -d 'TRUNCATE TABLE IF EXISTS system.text_log'",
                "curl -s 'http://clickhouse.<ns>.svc.cluster.local:8123/?user=default&password=<pw>' -d 'TRUNCATE TABLE IF EXISTS system.asynchronous_metric_log'",
                "curl -s 'http://clickhouse.<ns>.svc.cluster.local:8123/?user=default&password=<pw>' -d 'TRUNCATE TABLE IF EXISTS system.query_log'",
                "curl -s 'http://clickhouse.<ns>.svc.cluster.local:8123/?user=default&password=<pw>' -d 'TRUNCATE TABLE IF EXISTS system.part_log'",
                "echo 'System logs truncated'"
              ])]
            }
          }
        }
      }
    }
  }
 }
 ```
 ### What Does NOT Work: Config.d XML Mount
 **DO NOT** attempt to mount custom XML config files into `/etc/clickhouse-server/config.d/`
 via Kubernetes ConfigMap. Both approaches crash ClickHouse with exit code 36 (BAD_ARGUMENTS):
 - **Full directory mount** (`mount_path = "/etc/clickhouse-server/config.d"`): Replaces
  the entire directory, deleting the built-in `docker_related_config.xml` that the
  entrypoint expects. Even if you include it in your ConfigMap, ClickHouse still crashes.
 - **sub_path mount** (`sub_path = "custom.xml"`): Also crashes with exit code 36, even
  with minimal valid XML containing only `<background_pool_size>4</background_pool_size>`.
 - Both `remove="1"` (to disable tables) and `<ttl>` (to set retention) config overrides
  crash with exit code 36.
 This appears to be an issue with the `clickhouse/clickhouse-server:25.4.2` Docker image
 and how it preprocesses config at startup. The CronJob approach bypasses this entirely.
 ## Verification
 After truncation, verify:
 ```bash
 # CPU should drop from ~900m to ~100m within minutes
 kubectl top pod -n <namespace> -l app=clickhouse
 # No active merges
 kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query \
  "SELECT count() FROM system.merges"
 # System tables should be small
 kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query \
  "SELECT database, table, formatReadableSize(sum(bytes_on_disk)) as size, sum(rows) as rows \
   FROM system.parts WHERE active GROUP BY database, table ORDER BY sum(bytes_on_disk) DESC \
   FORMAT Pretty"
 ```
 ## Diagnostic Commands
 ```bash
 # Check what's consuming CPU (merges vs queries)
 kubectl exec -n <ns> $CH_POD -- clickhouse-client --query \
  "SELECT * FROM system.merges FORMAT Pretty"
 kubectl exec -n <ns> $CH_POD -- clickhouse-client --query \
  "SELECT query_id, elapsed, query FROM system.processes WHERE is_initial_query FORMAT Pretty"
 # Check background pool config
 kubectl exec -n <ns> $CH_POD -- clickhouse-client --query \
  "SELECT name, value FROM system.server_settings \
   WHERE name IN ('background_pool_size', 'background_merges_mutations_concurrency_ratio') \
   FORMAT Pretty"
 # Default is background_pool_size=16, concurrency_ratio=2 → up to 32 concurrent merges
 ```
 ## Notes
 - **Exit code 76**: ClickHouse crashes in `loadOutdatedDataParts()` when there are hundreds
  of outdated parts on NFS. The truncation CronJob prevents this by keeping tables small.
 - **Exit code 36**: `BAD_ARGUMENTS` in ClickHouse. Triggered by config.d XML mounts in
  Kubernetes. Root cause unclear but reproducible across mount methods.
 - **Default thread pools**: ClickHouse defaults to `background_pool_size=16` and
  `background_schedule_pool_size=512`, spawning 700+ threads even for a single-table
  workload. This overhead is unavoidable without config file changes.
 - **NFS is fundamentally unsuitable** for ClickHouse's MergeTree engine. If data
  persistence is not critical (e.g., analytics data is small), consider `emptyDir` or
  local PV storage instead.
 ## See Also
 - `k8s-nfs-mount-troubleshooting` — NFS mount failures and permission issues
 - `k8s-limitrange-oom-silent-kill` — LimitRange defaults causing OOM in ClickHouse containers
--- a/.claude/skills/k8s-gpu-no-nvidia-devices/SKILL.md
+++ b/.claude/skills/k8s-gpu-no-nvidia-devices/SKILL.md
@ -7,8 +7,8 @@ description: |
  but works on host, (4) PyTorch/TensorFlow falls back to CPU despite GPU allocation.
  Covers NVIDIA device plugin, time-slicing, and container runtime issues.
 author: Claude Code
-version: 1.0.0
+version: 1.1.0
-date: 2026-01-27
+date: 2026-03-01
 ---
 # Kubernetes GPU Pod - No NVIDIA Devices Found
@ -140,6 +140,45 @@ $ kubectl exec -n ebook2audiobook deployment/ebook2audiobook -- nvidia-smi
 - Check GPU Operator status: `kubectl get pods -n nvidia`
 - View time-slicing config: `kubectl get configmap -n nvidia time-slicing-config -o yaml`
 ## Automatic GPU Recovery via Liveness Probe
 To prevent GPU loss from requiring manual intervention, add a liveness probe that checks
 both GPU availability and application health. Example for Frigate (but applicable to any
 GPU workload):
 ```hcl
 # Restart pod if GPU becomes unavailable or app hangs
 liveness_probe {
  exec {
    command = ["sh", "-c", "nvidia-smi > /dev/null 2>&1 && curl -sf http://localhost:<port>/health > /dev/null"]
  }
  initial_delay_seconds = 120
  period_seconds        = 60
  timeout_seconds       = 10
  failure_threshold     = 3
 }
 # Allow time for GPU model loading at startup
 startup_probe {
  http_get {
    path = "/health"
    port = <port>
  }
  period_seconds    = 10
  failure_threshold = 30  # up to 5 minutes
 }
 ```
 The liveness probe checks:
 - `nvidia-smi` — fails if GPU devices are no longer accessible (CUDA context corruption, device plugin issues)
 - `curl` health endpoint — fails if the application process is hung
 If either fails 3 times in a row (3 minutes), Kubernetes automatically restarts the pod,
 which re-acquires the GPU device through the NVIDIA device plugin.
 **Important**: Always pair with a `startup_probe` when using GPU workloads — model loading
 (TensorRT, ONNX, PyTorch) can take several minutes and would trip a liveness probe
 configured with a short `initial_delay_seconds`.
 ## References
 - [NVIDIA Container Toolkit Troubleshooting](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/troubleshooting.html)