[ci skip] Add skills: loki-helm-deployment-pitfalls, grafana-stale-datasource-cleanup
This commit is contained in:
parent
a5b240629c
commit
ca43b97fa0
2 changed files with 248 additions and 0 deletions
105
.claude/skills/grafana-stale-datasource-cleanup/SKILL.md
Normal file
105
.claude/skills/grafana-stale-datasource-cleanup/SKILL.md
Normal file
|
|
@ -0,0 +1,105 @@
|
|||
---
|
||||
name: grafana-stale-datasource-cleanup
|
||||
description: |
|
||||
Fix Grafana datasource errors when a Helm chart creates a datasource that conflicts
|
||||
with provisioned ones, or when stale datasources persist in the MySQL database.
|
||||
Use when: (1) Grafana shows "dial tcp: lookup <service> no such host" for a datasource,
|
||||
(2) Grafana API returns "datasources:delete permissions needed" when trying to remove
|
||||
a datasource, (3) provisioned datasource exists but Grafana uses a stale one from
|
||||
the database, (4) Helm chart auto-creates a datasource pointing to a disabled gateway
|
||||
service (e.g., loki-gateway). Requires direct MySQL access to fix when Grafana RBAC
|
||||
blocks API operations.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-13
|
||||
---
|
||||
|
||||
# Grafana Stale Datasource Cleanup
|
||||
|
||||
## Problem
|
||||
Grafana uses a stale or incorrect datasource from its MySQL database instead of
|
||||
the correctly provisioned one. Common when Helm charts auto-create datasources
|
||||
that point to services you've disabled (e.g., Loki gateway).
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Grafana shows error: `dial tcp: lookup loki-gateway on 10.96.0.10:53: no such host`
|
||||
- A provisioned datasource (via ConfigMap sidecar) is correct but Grafana uses a
|
||||
different one stored in MySQL
|
||||
- Grafana API returns `"permissions needed: datasources:delete"` or
|
||||
`"permissions needed: datasources:write"` even with admin credentials
|
||||
- Dashboard references a datasource UID that points to a wrong URL
|
||||
|
||||
## Solution
|
||||
|
||||
### Step 1: Identify the stale datasource
|
||||
|
||||
List all datasources via API (this usually works even with RBAC):
|
||||
```bash
|
||||
kubectl exec -n monitoring deploy/grafana -c grafana -- \
|
||||
sh -c 'curl -s "http://localhost:3000/api/datasources" \
|
||||
-u "admin:$GF_SECURITY_ADMIN_PASSWORD"' | python3 -c \
|
||||
"import sys,json; [print(d['uid'], d['name'], d['url']) for d in json.load(sys.stdin)]"
|
||||
```
|
||||
|
||||
### Step 2: Try API deletion first
|
||||
|
||||
```bash
|
||||
kubectl exec -n monitoring deploy/grafana -c grafana -- \
|
||||
sh -c 'curl -s -X DELETE "http://localhost:3000/api/datasources/uid/<STALE_UID>" \
|
||||
-u "admin:$GF_SECURITY_ADMIN_PASSWORD"'
|
||||
```
|
||||
|
||||
If this returns a permissions error, proceed to Step 3.
|
||||
|
||||
### Step 3: Delete directly from MySQL
|
||||
|
||||
When Grafana RBAC blocks API operations, go through MySQL:
|
||||
|
||||
```bash
|
||||
# Find the Grafana MySQL password
|
||||
kubectl exec -n monitoring deploy/grafana -c grafana -- \
|
||||
sh -c 'echo $GF_DATABASE_PASSWORD'
|
||||
|
||||
# Find the stale datasource
|
||||
kubectl exec -n dbaas deploy/mysql -- mysql -u grafana -p"<PASSWORD>" grafana \
|
||||
-e "SELECT id, uid, name, url FROM data_source;"
|
||||
|
||||
# Delete it
|
||||
kubectl exec -n dbaas deploy/mysql -- mysql -u grafana -p"<PASSWORD>" grafana \
|
||||
-e "DELETE FROM data_source WHERE uid='<STALE_UID>';"
|
||||
```
|
||||
|
||||
### Step 4: Fix dashboards referencing the old UID
|
||||
|
||||
Dashboards store datasource UIDs in their JSON. Update via MySQL:
|
||||
```bash
|
||||
kubectl exec -n dbaas deploy/mysql -- mysql -u grafana -p"<PASSWORD>" grafana \
|
||||
-e "UPDATE dashboard SET data = REPLACE(data, '<OLD_UID>', '<NEW_UID>') WHERE title LIKE '%Dashboard Name%';"
|
||||
```
|
||||
|
||||
### Step 5: Refresh Grafana
|
||||
|
||||
Hard-refresh browser (Cmd+Shift+R). If datasource still doesn't appear:
|
||||
```bash
|
||||
kubectl rollout restart deploy -n monitoring grafana
|
||||
```
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
# Verify only correct datasources remain
|
||||
kubectl exec -n monitoring deploy/grafana -c grafana -- \
|
||||
sh -c 'curl -s "http://localhost:3000/api/datasources" \
|
||||
-u "admin:$GF_SECURITY_ADMIN_PASSWORD"' | python3 -m json.tool
|
||||
```
|
||||
|
||||
## Notes
|
||||
- Grafana's sidecar auto-discovers ConfigMaps with label `grafana_datasource: "1"`
|
||||
and provisions datasources from them. These are file-provisioned and show as
|
||||
"provisioned" in the UI.
|
||||
- Helm charts (e.g., Loki) may auto-create their own datasource in the Grafana
|
||||
database pointing to services like `loki-gateway`. If you disable the gateway,
|
||||
this datasource becomes stale.
|
||||
- Grafana dashboards in this repo are stored in MySQL (not file-provisioned),
|
||||
so dashboard JSON files in the repo are reference copies only.
|
||||
- The `GF_SECURITY_ADMIN_PASSWORD` env var is set by the Grafana Helm chart.
|
||||
- See also: `loki-helm-deployment-pitfalls` for related Loki deployment issues.
|
||||
143
.claude/skills/loki-helm-deployment-pitfalls/SKILL.md
Normal file
143
.claude/skills/loki-helm-deployment-pitfalls/SKILL.md
Normal file
|
|
@ -0,0 +1,143 @@
|
|||
---
|
||||
name: loki-helm-deployment-pitfalls
|
||||
description: |
|
||||
Fix common Loki Helm chart deployment failures on Kubernetes with Terraform.
|
||||
Use when: (1) Loki pod fails with "mkdir: read-only file system" for compactor
|
||||
or ruler paths, (2) Helm chart fails with "Helm test requires the Loki Canary
|
||||
to be enabled", (3) Helm install fails with "cannot re-use a name that is still
|
||||
in use" after a failed atomic deploy, (4) PV stuck in Released state after failed
|
||||
Helm install, (5) "entry too far behind" errors flooding Loki logs after initial
|
||||
Alloy deployment. Covers single-binary mode with filesystem storage on NFS.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-13
|
||||
---
|
||||
|
||||
# Loki Helm Chart Deployment Pitfalls
|
||||
|
||||
## Problem
|
||||
Deploying the Grafana Loki Helm chart in single-binary mode with Terraform hits
|
||||
multiple non-obvious failures that aren't documented together.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Deploying Loki via `helm_release` in Terraform
|
||||
- Using `deploymentMode: SingleBinary` with filesystem storage on NFS
|
||||
- First-time deployment or redeployment after failures
|
||||
|
||||
## Pitfall 1: Read-Only Root Filesystem
|
||||
|
||||
**Error:** `mkdir /loki/compactor: read-only file system`
|
||||
|
||||
**Cause:** The Loki Helm chart runs containers with a read-only root filesystem
|
||||
for security. The compactor `working_directory` and ruler `rule_path` default to
|
||||
paths under `/loki/` which is on the read-only root FS.
|
||||
|
||||
**Fix:** Use paths under `/var/loki/` — the Helm chart mounts the persistence
|
||||
volume there:
|
||||
```yaml
|
||||
compactor:
|
||||
working_directory: /var/loki/compactor # NOT /loki/compactor
|
||||
ruler:
|
||||
rule_path: /var/loki/scratch # NOT /loki/scratch
|
||||
```
|
||||
|
||||
## Pitfall 2: Canary Required
|
||||
|
||||
**Error:** `Helm test requires the Loki Canary to be enabled`
|
||||
|
||||
**Cause:** The Loki Helm chart's validation template requires `lokiCanary.enabled`
|
||||
to be true. You cannot disable it.
|
||||
|
||||
**Fix:** Leave `lokiCanary` enabled (default). You can disable `gateway`,
|
||||
`chunksCache`, and `resultsCache` to reduce resource usage:
|
||||
```yaml
|
||||
gateway:
|
||||
enabled: false
|
||||
chunksCache:
|
||||
enabled: false
|
||||
resultsCache:
|
||||
enabled: false
|
||||
# Do NOT add: lokiCanary: enabled: false
|
||||
```
|
||||
|
||||
## Pitfall 3: Stale Helm Release After Failed Atomic Deploy
|
||||
|
||||
**Error:** `cannot re-use a name that is still in use`
|
||||
|
||||
**Cause:** When `atomic = true` and the deploy fails, Helm rolls back but
|
||||
sometimes leaves a stale release secret in Kubernetes. Terraform then can't
|
||||
create a new release with the same name.
|
||||
|
||||
**Fix:** Delete the stale Helm secret:
|
||||
```bash
|
||||
kubectl delete secret -n monitoring sh.helm.release.v1.loki.v1
|
||||
```
|
||||
Also consider removing `atomic = true` for initial deployments and adding it
|
||||
back after the first successful install. Use a longer `timeout` (600s+) for
|
||||
first deploy since image pulls take time.
|
||||
|
||||
## Pitfall 4: PV Stuck in Released State
|
||||
|
||||
**Symptom:** PV shows `Released` status, PVC can't bind, Loki pod stuck in Pending.
|
||||
|
||||
**Cause:** After a failed Helm deploy, the PVC is deleted but the PV retains a
|
||||
`claimRef` to the old PVC. New PVCs can't bind to a `Released` PV.
|
||||
|
||||
**Fix:** Clear the stale claimRef:
|
||||
```bash
|
||||
kubectl patch pv loki --type json -p '[{"op": "remove", "path": "/spec/claimRef"}]'
|
||||
```
|
||||
The PV will transition from `Released` to `Available` and can be bound again.
|
||||
|
||||
## Pitfall 5: "Entry Too Far Behind" Log Spam
|
||||
|
||||
**Error:** `entry too far behind, entry timestamp is: ... oldest acceptable timestamp is: ...`
|
||||
|
||||
**Cause:** Alloy reads all historical log files from the Kubernetes API on first
|
||||
startup. Old entries are rejected by Loki's ingester because they're behind the
|
||||
newest entry for that stream.
|
||||
|
||||
**Fix:** This is harmless and self-resolving — Alloy catches up to present time
|
||||
and errors stop. To clear immediately:
|
||||
```bash
|
||||
kubectl rollout restart ds -n monitoring alloy
|
||||
```
|
||||
After restart, Alloy tails from approximately "now" for each container.
|
||||
|
||||
## Pitfall 6: Alertmanager Service Name
|
||||
|
||||
**Symptom:** Loki ruler alerts never fire despite correct LogQL rules.
|
||||
|
||||
**Cause:** The Prometheus Helm chart names the Alertmanager service
|
||||
`prometheus-alertmanager`, not `alertmanager`. Using the wrong name causes
|
||||
silent alert delivery failures.
|
||||
|
||||
**Fix:**
|
||||
```yaml
|
||||
ruler:
|
||||
alertmanager_url: http://prometheus-alertmanager.monitoring.svc.cluster.local:9093
|
||||
```
|
||||
Verify the actual service name: `kubectl get svc -n monitoring | grep alertmanager`
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
# Loki pod running
|
||||
kubectl get pods -n monitoring -l app.kubernetes.io/name=loki
|
||||
|
||||
# Loki receiving logs
|
||||
kubectl port-forward -n monitoring svc/loki 3100:3100 &
|
||||
curl -s 'http://localhost:3100/loki/api/v1/labels'
|
||||
# Should return JSON with namespace, pod, container labels
|
||||
|
||||
# PV bound
|
||||
kubectl get pv loki
|
||||
# STATUS should be "Bound"
|
||||
```
|
||||
|
||||
## Notes
|
||||
- Always check PV status before retrying a failed deploy
|
||||
- The Loki Helm chart creates many components by default (gateway, canary,
|
||||
memcached caches) — disable what you don't need for single-binary mode
|
||||
- WAL directory can be on tmpfs (emptyDir with `medium: Memory`) for
|
||||
disk-friendly setups, but data is lost on pod crash
|
||||
- See also: `helm-release-force-rerender` for Helm values not updating resources
|
||||
Loading…
Add table
Add a link
Reference in a new issue