[ci skip] add clickhouse-k8s-nfs-system-log-bloat skill, update GPU skill with auto-recovery
New skill: ClickHouse on K8s/NFS burns CPU from unbounded system log tables and background merges. Covers config.d mount crash (exit code 36), CronJob truncation workaround, and diagnostic commands. Updated: k8s-gpu-no-nvidia-devices v1.1.0 — added automatic GPU recovery via liveness probe pattern (nvidia-smi + app health check).
This commit is contained in:
parent
14a5b4d7d5
commit
f30ef660e1
2 changed files with 230 additions and 2 deletions
189
.claude/skills/clickhouse-k8s-nfs-system-log-bloat/SKILL.md
Normal file
189
.claude/skills/clickhouse-k8s-nfs-system-log-bloat/SKILL.md
Normal file
|
|
@ -0,0 +1,189 @@
|
|||
---
|
||||
name: clickhouse-k8s-nfs-system-log-bloat
|
||||
description: |
|
||||
Fix for ClickHouse consuming excessive CPU (500m-1000m+) on Kubernetes when running on
|
||||
NFS storage, caused by unbounded system log table growth triggering continuous background
|
||||
merges. Use when: (1) ClickHouse burns ~1 CPU core with no active user queries,
|
||||
(2) system.merges shows constant merge activity on system.metric_log or system.trace_log,
|
||||
(3) system log tables (metric_log, trace_log, text_log, asynchronous_metric_log) have
|
||||
grown to gigabytes while actual user data is tiny, (4) ClickHouse crashes with exit code
|
||||
76 (loadOutdatedDataParts SIGSEGV), (5) attempting to mount custom config.d XML via
|
||||
Kubernetes ConfigMap causes exit code 36 (BAD_ARGUMENTS) crashes. Also covers why
|
||||
ClickHouse's MergeTree engine performs poorly on NFS and the CronJob workaround for
|
||||
system log truncation.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-03-01
|
||||
---
|
||||
|
||||
# ClickHouse on Kubernetes/NFS: System Log Bloat & CPU Overhead
|
||||
|
||||
## Problem
|
||||
|
||||
ClickHouse deployed on Kubernetes with NFS storage consumes ~1 CPU core continuously,
|
||||
even when actual user queries are negligible. The CPU is consumed by background merge
|
||||
operations on system log tables that grow unboundedly with no default TTL.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
|
||||
- ClickHouse pod using 500m-1000m+ CPU with no active user queries
|
||||
- `SELECT * FROM system.processes` shows only diagnostic queries
|
||||
- `SELECT * FROM system.merges` shows constant merge activity on `system.metric_log`
|
||||
- System log tables have grown to gigabytes:
|
||||
- `system.trace_log`: 5+ GiB, 200M+ rows
|
||||
- `system.text_log`: 3+ GiB, 90M+ rows
|
||||
- `system.metric_log`: 1+ GiB with 80-100+ active parts (healthy is <20)
|
||||
- `system.asynchronous_metric_log`: 500+ MiB, 1B+ rows
|
||||
- Actual user data (e.g., `clickhouse.events`) is only kilobytes
|
||||
- ClickHouse crashes periodically with exit code 76 (`loadOutdatedDataParts` SIGSEGV)
|
||||
- Data directory is on NFS (e.g., `/mnt/main/clickhouse`)
|
||||
|
||||
## Root Cause
|
||||
|
||||
Two compounding issues:
|
||||
|
||||
1. **No TTL on system log tables**: ClickHouse system tables (`metric_log`, `trace_log`,
|
||||
`text_log`, `asynchronous_metric_log`, `query_log`, `part_log`) have no default
|
||||
retention policy and grow indefinitely.
|
||||
|
||||
2. **NFS amplifies merge overhead**: ClickHouse's MergeTree engine relies on background
|
||||
merge operations that involve heavy sequential I/O. NFS latency makes merges 10-100x
|
||||
slower than local disk, creating a feedback loop:
|
||||
- Slow merges → parts accumulate faster than they can be merged
|
||||
- More parts → more merge operations spawned
|
||||
- More merges → more CPU for decompression/recompression while waiting on NFS I/O
|
||||
|
||||
## Solution
|
||||
|
||||
### Immediate Fix: Truncate System Tables
|
||||
|
||||
```bash
|
||||
CH_POD=$(kubectl get pod -n <namespace> -l app=clickhouse -o jsonpath='{.items[0].metadata.name}')
|
||||
kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.metric_log"
|
||||
kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.trace_log"
|
||||
kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.text_log"
|
||||
kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.asynchronous_metric_log"
|
||||
kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.query_log"
|
||||
kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.part_log"
|
||||
```
|
||||
|
||||
This can take 30-60+ seconds per table on NFS due to part cleanup I/O.
|
||||
|
||||
### Permanent Fix: CronJob for Periodic Truncation
|
||||
|
||||
Add a Kubernetes CronJob that truncates system tables via the ClickHouse HTTP API:
|
||||
|
||||
```hcl
|
||||
resource "kubernetes_cron_job_v1" "clickhouse_truncate_logs" {
|
||||
metadata {
|
||||
name = "clickhouse-truncate-logs"
|
||||
namespace = "<namespace>"
|
||||
}
|
||||
spec {
|
||||
schedule = "0 */6 * * *"
|
||||
successful_jobs_history_limit = 1
|
||||
failed_jobs_history_limit = 1
|
||||
job_template {
|
||||
metadata {}
|
||||
spec {
|
||||
template {
|
||||
metadata {}
|
||||
spec {
|
||||
restart_policy = "OnFailure"
|
||||
container {
|
||||
name = "truncate"
|
||||
image = "curlimages/curl:8.12.1"
|
||||
command = ["sh", "-c", join(" && ", [
|
||||
"curl -s 'http://clickhouse.<ns>.svc.cluster.local:8123/?user=default&password=<pw>' -d 'TRUNCATE TABLE IF EXISTS system.metric_log'",
|
||||
"curl -s 'http://clickhouse.<ns>.svc.cluster.local:8123/?user=default&password=<pw>' -d 'TRUNCATE TABLE IF EXISTS system.trace_log'",
|
||||
"curl -s 'http://clickhouse.<ns>.svc.cluster.local:8123/?user=default&password=<pw>' -d 'TRUNCATE TABLE IF EXISTS system.text_log'",
|
||||
"curl -s 'http://clickhouse.<ns>.svc.cluster.local:8123/?user=default&password=<pw>' -d 'TRUNCATE TABLE IF EXISTS system.asynchronous_metric_log'",
|
||||
"curl -s 'http://clickhouse.<ns>.svc.cluster.local:8123/?user=default&password=<pw>' -d 'TRUNCATE TABLE IF EXISTS system.query_log'",
|
||||
"curl -s 'http://clickhouse.<ns>.svc.cluster.local:8123/?user=default&password=<pw>' -d 'TRUNCATE TABLE IF EXISTS system.part_log'",
|
||||
"echo 'System logs truncated'"
|
||||
])]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### What Does NOT Work: Config.d XML Mount
|
||||
|
||||
**DO NOT** attempt to mount custom XML config files into `/etc/clickhouse-server/config.d/`
|
||||
via Kubernetes ConfigMap. Both approaches crash ClickHouse with exit code 36 (BAD_ARGUMENTS):
|
||||
|
||||
- **Full directory mount** (`mount_path = "/etc/clickhouse-server/config.d"`): Replaces
|
||||
the entire directory, deleting the built-in `docker_related_config.xml` that the
|
||||
entrypoint expects. Even if you include it in your ConfigMap, ClickHouse still crashes.
|
||||
|
||||
- **sub_path mount** (`sub_path = "custom.xml"`): Also crashes with exit code 36, even
|
||||
with minimal valid XML containing only `<background_pool_size>4</background_pool_size>`.
|
||||
|
||||
- Both `remove="1"` (to disable tables) and `<ttl>` (to set retention) config overrides
|
||||
crash with exit code 36.
|
||||
|
||||
This appears to be an issue with the `clickhouse/clickhouse-server:25.4.2` Docker image
|
||||
and how it preprocesses config at startup. The CronJob approach bypasses this entirely.
|
||||
|
||||
## Verification
|
||||
|
||||
After truncation, verify:
|
||||
|
||||
```bash
|
||||
# CPU should drop from ~900m to ~100m within minutes
|
||||
kubectl top pod -n <namespace> -l app=clickhouse
|
||||
|
||||
# No active merges
|
||||
kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query \
|
||||
"SELECT count() FROM system.merges"
|
||||
|
||||
# System tables should be small
|
||||
kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query \
|
||||
"SELECT database, table, formatReadableSize(sum(bytes_on_disk)) as size, sum(rows) as rows \
|
||||
FROM system.parts WHERE active GROUP BY database, table ORDER BY sum(bytes_on_disk) DESC \
|
||||
FORMAT Pretty"
|
||||
```
|
||||
|
||||
## Diagnostic Commands
|
||||
|
||||
```bash
|
||||
# Check what's consuming CPU (merges vs queries)
|
||||
kubectl exec -n <ns> $CH_POD -- clickhouse-client --query \
|
||||
"SELECT * FROM system.merges FORMAT Pretty"
|
||||
|
||||
kubectl exec -n <ns> $CH_POD -- clickhouse-client --query \
|
||||
"SELECT query_id, elapsed, query FROM system.processes WHERE is_initial_query FORMAT Pretty"
|
||||
|
||||
# Check background pool config
|
||||
kubectl exec -n <ns> $CH_POD -- clickhouse-client --query \
|
||||
"SELECT name, value FROM system.server_settings \
|
||||
WHERE name IN ('background_pool_size', 'background_merges_mutations_concurrency_ratio') \
|
||||
FORMAT Pretty"
|
||||
|
||||
# Default is background_pool_size=16, concurrency_ratio=2 → up to 32 concurrent merges
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- **Exit code 76**: ClickHouse crashes in `loadOutdatedDataParts()` when there are hundreds
|
||||
of outdated parts on NFS. The truncation CronJob prevents this by keeping tables small.
|
||||
|
||||
- **Exit code 36**: `BAD_ARGUMENTS` in ClickHouse. Triggered by config.d XML mounts in
|
||||
Kubernetes. Root cause unclear but reproducible across mount methods.
|
||||
|
||||
- **Default thread pools**: ClickHouse defaults to `background_pool_size=16` and
|
||||
`background_schedule_pool_size=512`, spawning 700+ threads even for a single-table
|
||||
workload. This overhead is unavoidable without config file changes.
|
||||
|
||||
- **NFS is fundamentally unsuitable** for ClickHouse's MergeTree engine. If data
|
||||
persistence is not critical (e.g., analytics data is small), consider `emptyDir` or
|
||||
local PV storage instead.
|
||||
|
||||
## See Also
|
||||
|
||||
- `k8s-nfs-mount-troubleshooting` — NFS mount failures and permission issues
|
||||
- `k8s-limitrange-oom-silent-kill` — LimitRange defaults causing OOM in ClickHouse containers
|
||||
|
|
@ -7,8 +7,8 @@ description: |
|
|||
but works on host, (4) PyTorch/TensorFlow falls back to CPU despite GPU allocation.
|
||||
Covers NVIDIA device plugin, time-slicing, and container runtime issues.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-01-27
|
||||
version: 1.1.0
|
||||
date: 2026-03-01
|
||||
---
|
||||
|
||||
# Kubernetes GPU Pod - No NVIDIA Devices Found
|
||||
|
|
@ -140,6 +140,45 @@ $ kubectl exec -n ebook2audiobook deployment/ebook2audiobook -- nvidia-smi
|
|||
- Check GPU Operator status: `kubectl get pods -n nvidia`
|
||||
- View time-slicing config: `kubectl get configmap -n nvidia time-slicing-config -o yaml`
|
||||
|
||||
## Automatic GPU Recovery via Liveness Probe
|
||||
|
||||
To prevent GPU loss from requiring manual intervention, add a liveness probe that checks
|
||||
both GPU availability and application health. Example for Frigate (but applicable to any
|
||||
GPU workload):
|
||||
|
||||
```hcl
|
||||
# Restart pod if GPU becomes unavailable or app hangs
|
||||
liveness_probe {
|
||||
exec {
|
||||
command = ["sh", "-c", "nvidia-smi > /dev/null 2>&1 && curl -sf http://localhost:<port>/health > /dev/null"]
|
||||
}
|
||||
initial_delay_seconds = 120
|
||||
period_seconds = 60
|
||||
timeout_seconds = 10
|
||||
failure_threshold = 3
|
||||
}
|
||||
# Allow time for GPU model loading at startup
|
||||
startup_probe {
|
||||
http_get {
|
||||
path = "/health"
|
||||
port = <port>
|
||||
}
|
||||
period_seconds = 10
|
||||
failure_threshold = 30 # up to 5 minutes
|
||||
}
|
||||
```
|
||||
|
||||
The liveness probe checks:
|
||||
- `nvidia-smi` — fails if GPU devices are no longer accessible (CUDA context corruption, device plugin issues)
|
||||
- `curl` health endpoint — fails if the application process is hung
|
||||
|
||||
If either fails 3 times in a row (3 minutes), Kubernetes automatically restarts the pod,
|
||||
which re-acquires the GPU device through the NVIDIA device plugin.
|
||||
|
||||
**Important**: Always pair with a `startup_probe` when using GPU workloads — model loading
|
||||
(TensorRT, ONNX, PyTorch) can take several minutes and would trip a liveness probe
|
||||
configured with a short `initial_delay_seconds`.
|
||||
|
||||
## References
|
||||
|
||||
- [NVIDIA Container Toolkit Troubleshooting](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/troubleshooting.html)
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue