From ca43b97fa08667082dc2f87d55aa4f669c017ef8 Mon Sep 17 00:00:00 2001
From: Viktor Barzin <viktorbarzin@meta.com>
Date: Fri, 13 Feb 2026 23:47:45 +0000
Subject: [PATCH] [ci skip] Add skills: loki-helm-deployment-pitfalls,
 grafana-stale-datasource-cleanup

---
 .../grafana-stale-datasource-cleanup/SKILL.md | 105 +++++++++++++
 .../loki-helm-deployment-pitfalls/SKILL.md    | 143 ++++++++++++++++++
 2 files changed, 248 insertions(+)
 create mode 100644 .claude/skills/grafana-stale-datasource-cleanup/SKILL.md
 create mode 100644 .claude/skills/loki-helm-deployment-pitfalls/SKILL.md
diff --git a/.claude/skills/grafana-stale-datasource-cleanup/SKILL.md b/.claude/skills/grafana-stale-datasource-cleanup/SKILL.md
new file mode 100644
index 00000000..040d5de6
--- /dev/null
+++ b/.claude/skills/grafana-stale-datasource-cleanup/SKILL.md
@@ -0,0 +1,105 @@
+---
+name: grafana-stale-datasource-cleanup
+description: |
+  Fix Grafana datasource errors when a Helm chart creates a datasource that conflicts
+  with provisioned ones, or when stale datasources persist in the MySQL database.
+  Use when: (1) Grafana shows "dial tcp: lookup <service> no such host" for a datasource,
+  (2) Grafana API returns "datasources:delete permissions needed" when trying to remove
+  a datasource, (3) provisioned datasource exists but Grafana uses a stale one from
+  the database, (4) Helm chart auto-creates a datasource pointing to a disabled gateway
+  service (e.g., loki-gateway). Requires direct MySQL access to fix when Grafana RBAC
+  blocks API operations.
+author: Claude Code
+version: 1.0.0
+date: 2026-02-13
+---
+
+# Grafana Stale Datasource Cleanup
+
+## Problem
+Grafana uses a stale or incorrect datasource from its MySQL database instead of
+the correctly provisioned one. Common when Helm charts auto-create datasources
+that point to services you've disabled (e.g., Loki gateway).
+
+## Context / Trigger Conditions
+- Grafana shows error: `dial tcp: lookup loki-gateway on 10.96.0.10:53: no such host`
+- A provisioned datasource (via ConfigMap sidecar) is correct but Grafana uses a
+  different one stored in MySQL
+- Grafana API returns `"permissions needed: datasources:delete"` or
+  `"permissions needed: datasources:write"` even with admin credentials
+- Dashboard references a datasource UID that points to a wrong URL
+
+## Solution
+
+### Step 1: Identify the stale datasource
+
+List all datasources via API (this usually works even with RBAC):
+```bash
+kubectl exec -n monitoring deploy/grafana -c grafana -- \
+  sh -c 'curl -s "http://localhost:3000/api/datasources" \
+  -u "admin:$GF_SECURITY_ADMIN_PASSWORD"' | python3 -c \
+  "import sys,json; [print(d['uid'], d['name'], d['url']) for d in json.load(sys.stdin)]"
+```
+
+### Step 2: Try API deletion first
+
+```bash
+kubectl exec -n monitoring deploy/grafana -c grafana -- \
+  sh -c 'curl -s -X DELETE "http://localhost:3000/api/datasources/uid/<STALE_UID>" \
+  -u "admin:$GF_SECURITY_ADMIN_PASSWORD"'
+```
+
+If this returns a permissions error, proceed to Step 3.
+
+### Step 3: Delete directly from MySQL
+
+When Grafana RBAC blocks API operations, go through MySQL:
+
+```bash
+# Find the Grafana MySQL password
+kubectl exec -n monitoring deploy/grafana -c grafana -- \
+  sh -c 'echo $GF_DATABASE_PASSWORD'
+
+# Find the stale datasource
+kubectl exec -n dbaas deploy/mysql -- mysql -u grafana -p"<PASSWORD>" grafana \
+  -e "SELECT id, uid, name, url FROM data_source;"
+
+# Delete it
+kubectl exec -n dbaas deploy/mysql -- mysql -u grafana -p"<PASSWORD>" grafana \
+  -e "DELETE FROM data_source WHERE uid='<STALE_UID>';"
+```
+
+### Step 4: Fix dashboards referencing the old UID
+
+Dashboards store datasource UIDs in their JSON. Update via MySQL:
+```bash
+kubectl exec -n dbaas deploy/mysql -- mysql -u grafana -p"<PASSWORD>" grafana \
+  -e "UPDATE dashboard SET data = REPLACE(data, '<OLD_UID>', '<NEW_UID>') WHERE title LIKE '%Dashboard Name%';"
+```
+
+### Step 5: Refresh Grafana
+
+Hard-refresh browser (Cmd+Shift+R). If datasource still doesn't appear:
+```bash
+kubectl rollout restart deploy -n monitoring grafana
+```
+
+## Verification
+```bash
+# Verify only correct datasources remain
+kubectl exec -n monitoring deploy/grafana -c grafana -- \
+  sh -c 'curl -s "http://localhost:3000/api/datasources" \
+  -u "admin:$GF_SECURITY_ADMIN_PASSWORD"' | python3 -m json.tool
+```
+
+## Notes
+- Grafana's sidecar auto-discovers ConfigMaps with label `grafana_datasource: "1"`
+  and provisions datasources from them. These are file-provisioned and show as
+  "provisioned" in the UI.
+- Helm charts (e.g., Loki) may auto-create their own datasource in the Grafana
+  database pointing to services like `loki-gateway`. If you disable the gateway,
+  this datasource becomes stale.
+- Grafana dashboards in this repo are stored in MySQL (not file-provisioned),
+  so dashboard JSON files in the repo are reference copies only.
+- The `GF_SECURITY_ADMIN_PASSWORD` env var is set by the Grafana Helm chart.
+- See also: `loki-helm-deployment-pitfalls` for related Loki deployment issues.
diff --git a/.claude/skills/loki-helm-deployment-pitfalls/SKILL.md b/.claude/skills/loki-helm-deployment-pitfalls/SKILL.md
new file mode 100644
index 00000000..a067fd5e
--- /dev/null
+++ b/.claude/skills/loki-helm-deployment-pitfalls/SKILL.md
@@ -0,0 +1,143 @@
+---
+name: loki-helm-deployment-pitfalls
+description: |
+  Fix common Loki Helm chart deployment failures on Kubernetes with Terraform.
+  Use when: (1) Loki pod fails with "mkdir: read-only file system" for compactor
+  or ruler paths, (2) Helm chart fails with "Helm test requires the Loki Canary
+  to be enabled", (3) Helm install fails with "cannot re-use a name that is still
+  in use" after a failed atomic deploy, (4) PV stuck in Released state after failed
+  Helm install, (5) "entry too far behind" errors flooding Loki logs after initial
+  Alloy deployment. Covers single-binary mode with filesystem storage on NFS.
+author: Claude Code
+version: 1.0.0
+date: 2026-02-13
+---
+
+# Loki Helm Chart Deployment Pitfalls
+
+## Problem
+Deploying the Grafana Loki Helm chart in single-binary mode with Terraform hits
+multiple non-obvious failures that aren't documented together.
+
+## Context / Trigger Conditions
+- Deploying Loki via `helm_release` in Terraform
+- Using `deploymentMode: SingleBinary` with filesystem storage on NFS
+- First-time deployment or redeployment after failures
+
+## Pitfall 1: Read-Only Root Filesystem
+
+**Error:** `mkdir /loki/compactor: read-only file system`
+
+**Cause:** The Loki Helm chart runs containers with a read-only root filesystem
+for security. The compactor `working_directory` and ruler `rule_path` default to
+paths under `/loki/` which is on the read-only root FS.
+
+**Fix:** Use paths under `/var/loki/` — the Helm chart mounts the persistence
+volume there:
+```yaml
+compactor:
+  working_directory: /var/loki/compactor    # NOT /loki/compactor
+ruler:
+  rule_path: /var/loki/scratch              # NOT /loki/scratch
+```
+
+## Pitfall 2: Canary Required
+
+**Error:** `Helm test requires the Loki Canary to be enabled`
+
+**Cause:** The Loki Helm chart's validation template requires `lokiCanary.enabled`
+to be true. You cannot disable it.
+
+**Fix:** Leave `lokiCanary` enabled (default). You can disable `gateway`,
+`chunksCache`, and `resultsCache` to reduce resource usage:
+```yaml
+gateway:
+  enabled: false
+chunksCache:
+  enabled: false
+resultsCache:
+  enabled: false
+# Do NOT add: lokiCanary: enabled: false
+```
+
+## Pitfall 3: Stale Helm Release After Failed Atomic Deploy
+
+**Error:** `cannot re-use a name that is still in use`
+
+**Cause:** When `atomic = true` and the deploy fails, Helm rolls back but
+sometimes leaves a stale release secret in Kubernetes. Terraform then can't
+create a new release with the same name.
+
+**Fix:** Delete the stale Helm secret:
+```bash
+kubectl delete secret -n monitoring sh.helm.release.v1.loki.v1
+```
+Also consider removing `atomic = true` for initial deployments and adding it
+back after the first successful install. Use a longer `timeout` (600s+) for
+first deploy since image pulls take time.
+
+## Pitfall 4: PV Stuck in Released State
+
+**Symptom:** PV shows `Released` status, PVC can't bind, Loki pod stuck in Pending.
+
+**Cause:** After a failed Helm deploy, the PVC is deleted but the PV retains a
+`claimRef` to the old PVC. New PVCs can't bind to a `Released` PV.
+
+**Fix:** Clear the stale claimRef:
+```bash
+kubectl patch pv loki --type json -p '[{"op": "remove", "path": "/spec/claimRef"}]'
+```
+The PV will transition from `Released` to `Available` and can be bound again.
+
+## Pitfall 5: "Entry Too Far Behind" Log Spam
+
+**Error:** `entry too far behind, entry timestamp is: ... oldest acceptable timestamp is: ...`
+
+**Cause:** Alloy reads all historical log files from the Kubernetes API on first
+startup. Old entries are rejected by Loki's ingester because they're behind the
+newest entry for that stream.
+
+**Fix:** This is harmless and self-resolving — Alloy catches up to present time
+and errors stop. To clear immediately:
+```bash
+kubectl rollout restart ds -n monitoring alloy
+```
+After restart, Alloy tails from approximately "now" for each container.
+
+## Pitfall 6: Alertmanager Service Name
+
+**Symptom:** Loki ruler alerts never fire despite correct LogQL rules.
+
+**Cause:** The Prometheus Helm chart names the Alertmanager service
+`prometheus-alertmanager`, not `alertmanager`. Using the wrong name causes
+silent alert delivery failures.
+
+**Fix:**
+```yaml
+ruler:
+  alertmanager_url: http://prometheus-alertmanager.monitoring.svc.cluster.local:9093
+```
+Verify the actual service name: `kubectl get svc -n monitoring | grep alertmanager`
+
+## Verification
+```bash
+# Loki pod running
+kubectl get pods -n monitoring -l app.kubernetes.io/name=loki
+
+# Loki receiving logs
+kubectl port-forward -n monitoring svc/loki 3100:3100 &
+curl -s 'http://localhost:3100/loki/api/v1/labels'
+# Should return JSON with namespace, pod, container labels
+
+# PV bound
+kubectl get pv loki
+# STATUS should be "Bound"
+```
+
+## Notes
+- Always check PV status before retrying a failed deploy
+- The Loki Helm chart creates many components by default (gateway, canary,
+  memcached caches) — disable what you don't need for single-binary mode
+- WAL directory can be on tmpfs (emptyDir with `medium: Memory`) for
+  disk-friendly setups, but data is lost on pod crash
+- See also: `helm-release-force-rerender` for Helm values not updating resources