From ca43b97fa08667082dc2f87d55aa4f669c017ef8 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Fri, 13 Feb 2026 23:47:45 +0000 Subject: [PATCH] [ci skip] Add skills: loki-helm-deployment-pitfalls, grafana-stale-datasource-cleanup --- .../grafana-stale-datasource-cleanup/SKILL.md | 105 +++++++++++++ .../loki-helm-deployment-pitfalls/SKILL.md | 143 ++++++++++++++++++ 2 files changed, 248 insertions(+) create mode 100644 .claude/skills/grafana-stale-datasource-cleanup/SKILL.md create mode 100644 .claude/skills/loki-helm-deployment-pitfalls/SKILL.md diff --git a/.claude/skills/grafana-stale-datasource-cleanup/SKILL.md b/.claude/skills/grafana-stale-datasource-cleanup/SKILL.md new file mode 100644 index 00000000..040d5de6 --- /dev/null +++ b/.claude/skills/grafana-stale-datasource-cleanup/SKILL.md @@ -0,0 +1,105 @@ +--- +name: grafana-stale-datasource-cleanup +description: | + Fix Grafana datasource errors when a Helm chart creates a datasource that conflicts + with provisioned ones, or when stale datasources persist in the MySQL database. + Use when: (1) Grafana shows "dial tcp: lookup no such host" for a datasource, + (2) Grafana API returns "datasources:delete permissions needed" when trying to remove + a datasource, (3) provisioned datasource exists but Grafana uses a stale one from + the database, (4) Helm chart auto-creates a datasource pointing to a disabled gateway + service (e.g., loki-gateway). Requires direct MySQL access to fix when Grafana RBAC + blocks API operations. +author: Claude Code +version: 1.0.0 +date: 2026-02-13 +--- + +# Grafana Stale Datasource Cleanup + +## Problem +Grafana uses a stale or incorrect datasource from its MySQL database instead of +the correctly provisioned one. Common when Helm charts auto-create datasources +that point to services you've disabled (e.g., Loki gateway). + +## Context / Trigger Conditions +- Grafana shows error: `dial tcp: lookup loki-gateway on 10.96.0.10:53: no such host` +- A provisioned datasource (via ConfigMap sidecar) is correct but Grafana uses a + different one stored in MySQL +- Grafana API returns `"permissions needed: datasources:delete"` or + `"permissions needed: datasources:write"` even with admin credentials +- Dashboard references a datasource UID that points to a wrong URL + +## Solution + +### Step 1: Identify the stale datasource + +List all datasources via API (this usually works even with RBAC): +```bash +kubectl exec -n monitoring deploy/grafana -c grafana -- \ + sh -c 'curl -s "http://localhost:3000/api/datasources" \ + -u "admin:$GF_SECURITY_ADMIN_PASSWORD"' | python3 -c \ + "import sys,json; [print(d['uid'], d['name'], d['url']) for d in json.load(sys.stdin)]" +``` + +### Step 2: Try API deletion first + +```bash +kubectl exec -n monitoring deploy/grafana -c grafana -- \ + sh -c 'curl -s -X DELETE "http://localhost:3000/api/datasources/uid/" \ + -u "admin:$GF_SECURITY_ADMIN_PASSWORD"' +``` + +If this returns a permissions error, proceed to Step 3. + +### Step 3: Delete directly from MySQL + +When Grafana RBAC blocks API operations, go through MySQL: + +```bash +# Find the Grafana MySQL password +kubectl exec -n monitoring deploy/grafana -c grafana -- \ + sh -c 'echo $GF_DATABASE_PASSWORD' + +# Find the stale datasource +kubectl exec -n dbaas deploy/mysql -- mysql -u grafana -p"" grafana \ + -e "SELECT id, uid, name, url FROM data_source;" + +# Delete it +kubectl exec -n dbaas deploy/mysql -- mysql -u grafana -p"" grafana \ + -e "DELETE FROM data_source WHERE uid='';" +``` + +### Step 4: Fix dashboards referencing the old UID + +Dashboards store datasource UIDs in their JSON. Update via MySQL: +```bash +kubectl exec -n dbaas deploy/mysql -- mysql -u grafana -p"" grafana \ + -e "UPDATE dashboard SET data = REPLACE(data, '', '') WHERE title LIKE '%Dashboard Name%';" +``` + +### Step 5: Refresh Grafana + +Hard-refresh browser (Cmd+Shift+R). If datasource still doesn't appear: +```bash +kubectl rollout restart deploy -n monitoring grafana +``` + +## Verification +```bash +# Verify only correct datasources remain +kubectl exec -n monitoring deploy/grafana -c grafana -- \ + sh -c 'curl -s "http://localhost:3000/api/datasources" \ + -u "admin:$GF_SECURITY_ADMIN_PASSWORD"' | python3 -m json.tool +``` + +## Notes +- Grafana's sidecar auto-discovers ConfigMaps with label `grafana_datasource: "1"` + and provisions datasources from them. These are file-provisioned and show as + "provisioned" in the UI. +- Helm charts (e.g., Loki) may auto-create their own datasource in the Grafana + database pointing to services like `loki-gateway`. If you disable the gateway, + this datasource becomes stale. +- Grafana dashboards in this repo are stored in MySQL (not file-provisioned), + so dashboard JSON files in the repo are reference copies only. +- The `GF_SECURITY_ADMIN_PASSWORD` env var is set by the Grafana Helm chart. +- See also: `loki-helm-deployment-pitfalls` for related Loki deployment issues. diff --git a/.claude/skills/loki-helm-deployment-pitfalls/SKILL.md b/.claude/skills/loki-helm-deployment-pitfalls/SKILL.md new file mode 100644 index 00000000..a067fd5e --- /dev/null +++ b/.claude/skills/loki-helm-deployment-pitfalls/SKILL.md @@ -0,0 +1,143 @@ +--- +name: loki-helm-deployment-pitfalls +description: | + Fix common Loki Helm chart deployment failures on Kubernetes with Terraform. + Use when: (1) Loki pod fails with "mkdir: read-only file system" for compactor + or ruler paths, (2) Helm chart fails with "Helm test requires the Loki Canary + to be enabled", (3) Helm install fails with "cannot re-use a name that is still + in use" after a failed atomic deploy, (4) PV stuck in Released state after failed + Helm install, (5) "entry too far behind" errors flooding Loki logs after initial + Alloy deployment. Covers single-binary mode with filesystem storage on NFS. +author: Claude Code +version: 1.0.0 +date: 2026-02-13 +--- + +# Loki Helm Chart Deployment Pitfalls + +## Problem +Deploying the Grafana Loki Helm chart in single-binary mode with Terraform hits +multiple non-obvious failures that aren't documented together. + +## Context / Trigger Conditions +- Deploying Loki via `helm_release` in Terraform +- Using `deploymentMode: SingleBinary` with filesystem storage on NFS +- First-time deployment or redeployment after failures + +## Pitfall 1: Read-Only Root Filesystem + +**Error:** `mkdir /loki/compactor: read-only file system` + +**Cause:** The Loki Helm chart runs containers with a read-only root filesystem +for security. The compactor `working_directory` and ruler `rule_path` default to +paths under `/loki/` which is on the read-only root FS. + +**Fix:** Use paths under `/var/loki/` — the Helm chart mounts the persistence +volume there: +```yaml +compactor: + working_directory: /var/loki/compactor # NOT /loki/compactor +ruler: + rule_path: /var/loki/scratch # NOT /loki/scratch +``` + +## Pitfall 2: Canary Required + +**Error:** `Helm test requires the Loki Canary to be enabled` + +**Cause:** The Loki Helm chart's validation template requires `lokiCanary.enabled` +to be true. You cannot disable it. + +**Fix:** Leave `lokiCanary` enabled (default). You can disable `gateway`, +`chunksCache`, and `resultsCache` to reduce resource usage: +```yaml +gateway: + enabled: false +chunksCache: + enabled: false +resultsCache: + enabled: false +# Do NOT add: lokiCanary: enabled: false +``` + +## Pitfall 3: Stale Helm Release After Failed Atomic Deploy + +**Error:** `cannot re-use a name that is still in use` + +**Cause:** When `atomic = true` and the deploy fails, Helm rolls back but +sometimes leaves a stale release secret in Kubernetes. Terraform then can't +create a new release with the same name. + +**Fix:** Delete the stale Helm secret: +```bash +kubectl delete secret -n monitoring sh.helm.release.v1.loki.v1 +``` +Also consider removing `atomic = true` for initial deployments and adding it +back after the first successful install. Use a longer `timeout` (600s+) for +first deploy since image pulls take time. + +## Pitfall 4: PV Stuck in Released State + +**Symptom:** PV shows `Released` status, PVC can't bind, Loki pod stuck in Pending. + +**Cause:** After a failed Helm deploy, the PVC is deleted but the PV retains a +`claimRef` to the old PVC. New PVCs can't bind to a `Released` PV. + +**Fix:** Clear the stale claimRef: +```bash +kubectl patch pv loki --type json -p '[{"op": "remove", "path": "/spec/claimRef"}]' +``` +The PV will transition from `Released` to `Available` and can be bound again. + +## Pitfall 5: "Entry Too Far Behind" Log Spam + +**Error:** `entry too far behind, entry timestamp is: ... oldest acceptable timestamp is: ...` + +**Cause:** Alloy reads all historical log files from the Kubernetes API on first +startup. Old entries are rejected by Loki's ingester because they're behind the +newest entry for that stream. + +**Fix:** This is harmless and self-resolving — Alloy catches up to present time +and errors stop. To clear immediately: +```bash +kubectl rollout restart ds -n monitoring alloy +``` +After restart, Alloy tails from approximately "now" for each container. + +## Pitfall 6: Alertmanager Service Name + +**Symptom:** Loki ruler alerts never fire despite correct LogQL rules. + +**Cause:** The Prometheus Helm chart names the Alertmanager service +`prometheus-alertmanager`, not `alertmanager`. Using the wrong name causes +silent alert delivery failures. + +**Fix:** +```yaml +ruler: + alertmanager_url: http://prometheus-alertmanager.monitoring.svc.cluster.local:9093 +``` +Verify the actual service name: `kubectl get svc -n monitoring | grep alertmanager` + +## Verification +```bash +# Loki pod running +kubectl get pods -n monitoring -l app.kubernetes.io/name=loki + +# Loki receiving logs +kubectl port-forward -n monitoring svc/loki 3100:3100 & +curl -s 'http://localhost:3100/loki/api/v1/labels' +# Should return JSON with namespace, pod, container labels + +# PV bound +kubectl get pv loki +# STATUS should be "Bound" +``` + +## Notes +- Always check PV status before retrying a failed deploy +- The Loki Helm chart creates many components by default (gateway, canary, + memcached caches) — disable what you don't need for single-binary mode +- WAL directory can be on tmpfs (emptyDir with `medium: Memory`) for + disk-friendly setups, but data is lost on pod crash +- See also: `helm-release-force-rerender` for Helm values not updating resources