diff --git a/.claude/skills/clickhouse-k8s-nfs-system-log-bloat/SKILL.md b/.claude/skills/clickhouse-k8s-nfs-system-log-bloat/SKILL.md new file mode 100644 index 00000000..808259b8 --- /dev/null +++ b/.claude/skills/clickhouse-k8s-nfs-system-log-bloat/SKILL.md @@ -0,0 +1,189 @@ +--- +name: clickhouse-k8s-nfs-system-log-bloat +description: | + Fix for ClickHouse consuming excessive CPU (500m-1000m+) on Kubernetes when running on + NFS storage, caused by unbounded system log table growth triggering continuous background + merges. Use when: (1) ClickHouse burns ~1 CPU core with no active user queries, + (2) system.merges shows constant merge activity on system.metric_log or system.trace_log, + (3) system log tables (metric_log, trace_log, text_log, asynchronous_metric_log) have + grown to gigabytes while actual user data is tiny, (4) ClickHouse crashes with exit code + 76 (loadOutdatedDataParts SIGSEGV), (5) attempting to mount custom config.d XML via + Kubernetes ConfigMap causes exit code 36 (BAD_ARGUMENTS) crashes. Also covers why + ClickHouse's MergeTree engine performs poorly on NFS and the CronJob workaround for + system log truncation. +author: Claude Code +version: 1.0.0 +date: 2026-03-01 +--- + +# ClickHouse on Kubernetes/NFS: System Log Bloat & CPU Overhead + +## Problem + +ClickHouse deployed on Kubernetes with NFS storage consumes ~1 CPU core continuously, +even when actual user queries are negligible. The CPU is consumed by background merge +operations on system log tables that grow unboundedly with no default TTL. + +## Context / Trigger Conditions + +- ClickHouse pod using 500m-1000m+ CPU with no active user queries +- `SELECT * FROM system.processes` shows only diagnostic queries +- `SELECT * FROM system.merges` shows constant merge activity on `system.metric_log` +- System log tables have grown to gigabytes: + - `system.trace_log`: 5+ GiB, 200M+ rows + - `system.text_log`: 3+ GiB, 90M+ rows + - `system.metric_log`: 1+ GiB with 80-100+ active parts (healthy is <20) + - `system.asynchronous_metric_log`: 500+ MiB, 1B+ rows +- Actual user data (e.g., `clickhouse.events`) is only kilobytes +- ClickHouse crashes periodically with exit code 76 (`loadOutdatedDataParts` SIGSEGV) +- Data directory is on NFS (e.g., `/mnt/main/clickhouse`) + +## Root Cause + +Two compounding issues: + +1. **No TTL on system log tables**: ClickHouse system tables (`metric_log`, `trace_log`, + `text_log`, `asynchronous_metric_log`, `query_log`, `part_log`) have no default + retention policy and grow indefinitely. + +2. **NFS amplifies merge overhead**: ClickHouse's MergeTree engine relies on background + merge operations that involve heavy sequential I/O. NFS latency makes merges 10-100x + slower than local disk, creating a feedback loop: + - Slow merges → parts accumulate faster than they can be merged + - More parts → more merge operations spawned + - More merges → more CPU for decompression/recompression while waiting on NFS I/O + +## Solution + +### Immediate Fix: Truncate System Tables + +```bash +CH_POD=$(kubectl get pod -n -l app=clickhouse -o jsonpath='{.items[0].metadata.name}') +kubectl exec -n $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.metric_log" +kubectl exec -n $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.trace_log" +kubectl exec -n $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.text_log" +kubectl exec -n $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.asynchronous_metric_log" +kubectl exec -n $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.query_log" +kubectl exec -n $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.part_log" +``` + +This can take 30-60+ seconds per table on NFS due to part cleanup I/O. + +### Permanent Fix: CronJob for Periodic Truncation + +Add a Kubernetes CronJob that truncates system tables via the ClickHouse HTTP API: + +```hcl +resource "kubernetes_cron_job_v1" "clickhouse_truncate_logs" { + metadata { + name = "clickhouse-truncate-logs" + namespace = "" + } + spec { + schedule = "0 */6 * * *" + successful_jobs_history_limit = 1 + failed_jobs_history_limit = 1 + job_template { + metadata {} + spec { + template { + metadata {} + spec { + restart_policy = "OnFailure" + container { + name = "truncate" + image = "curlimages/curl:8.12.1" + command = ["sh", "-c", join(" && ", [ + "curl -s 'http://clickhouse..svc.cluster.local:8123/?user=default&password=' -d 'TRUNCATE TABLE IF EXISTS system.metric_log'", + "curl -s 'http://clickhouse..svc.cluster.local:8123/?user=default&password=' -d 'TRUNCATE TABLE IF EXISTS system.trace_log'", + "curl -s 'http://clickhouse..svc.cluster.local:8123/?user=default&password=' -d 'TRUNCATE TABLE IF EXISTS system.text_log'", + "curl -s 'http://clickhouse..svc.cluster.local:8123/?user=default&password=' -d 'TRUNCATE TABLE IF EXISTS system.asynchronous_metric_log'", + "curl -s 'http://clickhouse..svc.cluster.local:8123/?user=default&password=' -d 'TRUNCATE TABLE IF EXISTS system.query_log'", + "curl -s 'http://clickhouse..svc.cluster.local:8123/?user=default&password=' -d 'TRUNCATE TABLE IF EXISTS system.part_log'", + "echo 'System logs truncated'" + ])] + } + } + } + } + } + } +} +``` + +### What Does NOT Work: Config.d XML Mount + +**DO NOT** attempt to mount custom XML config files into `/etc/clickhouse-server/config.d/` +via Kubernetes ConfigMap. Both approaches crash ClickHouse with exit code 36 (BAD_ARGUMENTS): + +- **Full directory mount** (`mount_path = "/etc/clickhouse-server/config.d"`): Replaces + the entire directory, deleting the built-in `docker_related_config.xml` that the + entrypoint expects. Even if you include it in your ConfigMap, ClickHouse still crashes. + +- **sub_path mount** (`sub_path = "custom.xml"`): Also crashes with exit code 36, even + with minimal valid XML containing only `4`. + +- Both `remove="1"` (to disable tables) and `` (to set retention) config overrides + crash with exit code 36. + +This appears to be an issue with the `clickhouse/clickhouse-server:25.4.2` Docker image +and how it preprocesses config at startup. The CronJob approach bypasses this entirely. + +## Verification + +After truncation, verify: + +```bash +# CPU should drop from ~900m to ~100m within minutes +kubectl top pod -n -l app=clickhouse + +# No active merges +kubectl exec -n $CH_POD -- clickhouse-client --query \ + "SELECT count() FROM system.merges" + +# System tables should be small +kubectl exec -n $CH_POD -- clickhouse-client --query \ + "SELECT database, table, formatReadableSize(sum(bytes_on_disk)) as size, sum(rows) as rows \ + FROM system.parts WHERE active GROUP BY database, table ORDER BY sum(bytes_on_disk) DESC \ + FORMAT Pretty" +``` + +## Diagnostic Commands + +```bash +# Check what's consuming CPU (merges vs queries) +kubectl exec -n $CH_POD -- clickhouse-client --query \ + "SELECT * FROM system.merges FORMAT Pretty" + +kubectl exec -n $CH_POD -- clickhouse-client --query \ + "SELECT query_id, elapsed, query FROM system.processes WHERE is_initial_query FORMAT Pretty" + +# Check background pool config +kubectl exec -n $CH_POD -- clickhouse-client --query \ + "SELECT name, value FROM system.server_settings \ + WHERE name IN ('background_pool_size', 'background_merges_mutations_concurrency_ratio') \ + FORMAT Pretty" + +# Default is background_pool_size=16, concurrency_ratio=2 → up to 32 concurrent merges +``` + +## Notes + +- **Exit code 76**: ClickHouse crashes in `loadOutdatedDataParts()` when there are hundreds + of outdated parts on NFS. The truncation CronJob prevents this by keeping tables small. + +- **Exit code 36**: `BAD_ARGUMENTS` in ClickHouse. Triggered by config.d XML mounts in + Kubernetes. Root cause unclear but reproducible across mount methods. + +- **Default thread pools**: ClickHouse defaults to `background_pool_size=16` and + `background_schedule_pool_size=512`, spawning 700+ threads even for a single-table + workload. This overhead is unavoidable without config file changes. + +- **NFS is fundamentally unsuitable** for ClickHouse's MergeTree engine. If data + persistence is not critical (e.g., analytics data is small), consider `emptyDir` or + local PV storage instead. + +## See Also + +- `k8s-nfs-mount-troubleshooting` — NFS mount failures and permission issues +- `k8s-limitrange-oom-silent-kill` — LimitRange defaults causing OOM in ClickHouse containers diff --git a/.claude/skills/k8s-gpu-no-nvidia-devices/SKILL.md b/.claude/skills/k8s-gpu-no-nvidia-devices/SKILL.md index 1d31b060..b2afc882 100644 --- a/.claude/skills/k8s-gpu-no-nvidia-devices/SKILL.md +++ b/.claude/skills/k8s-gpu-no-nvidia-devices/SKILL.md @@ -7,8 +7,8 @@ description: | but works on host, (4) PyTorch/TensorFlow falls back to CPU despite GPU allocation. Covers NVIDIA device plugin, time-slicing, and container runtime issues. author: Claude Code -version: 1.0.0 -date: 2026-01-27 +version: 1.1.0 +date: 2026-03-01 --- # Kubernetes GPU Pod - No NVIDIA Devices Found @@ -140,6 +140,45 @@ $ kubectl exec -n ebook2audiobook deployment/ebook2audiobook -- nvidia-smi - Check GPU Operator status: `kubectl get pods -n nvidia` - View time-slicing config: `kubectl get configmap -n nvidia time-slicing-config -o yaml` +## Automatic GPU Recovery via Liveness Probe + +To prevent GPU loss from requiring manual intervention, add a liveness probe that checks +both GPU availability and application health. Example for Frigate (but applicable to any +GPU workload): + +```hcl +# Restart pod if GPU becomes unavailable or app hangs +liveness_probe { + exec { + command = ["sh", "-c", "nvidia-smi > /dev/null 2>&1 && curl -sf http://localhost:/health > /dev/null"] + } + initial_delay_seconds = 120 + period_seconds = 60 + timeout_seconds = 10 + failure_threshold = 3 +} +# Allow time for GPU model loading at startup +startup_probe { + http_get { + path = "/health" + port = + } + period_seconds = 10 + failure_threshold = 30 # up to 5 minutes +} +``` + +The liveness probe checks: +- `nvidia-smi` — fails if GPU devices are no longer accessible (CUDA context corruption, device plugin issues) +- `curl` health endpoint — fails if the application process is hung + +If either fails 3 times in a row (3 minutes), Kubernetes automatically restarts the pod, +which re-acquires the GPU device through the NVIDIA device plugin. + +**Important**: Always pair with a `startup_probe` when using GPU workloads — model loading +(TensorRT, ONNX, PyTorch) can take several minutes and would trip a liveness probe +configured with a short `initial_delay_seconds`. + ## References - [NVIDIA Container Toolkit Troubleshooting](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/troubleshooting.html)