infra/docs/plans/2026-06-21-tripit-observability-tempo-otel.md
Viktor Barzin 1595bddfc2
Some checks failed
ci/woodpecker/push/default Pipeline failed
feat(monitoring): Tempo + OTel Collector for tripit tracing, hardened (ADR-0032 Phase 2)
Re-land Phase 2 after the first attempt's two failure modes, both fixed:
- tempo.resources set under the correct single-binary chart key (was OOMKilled on
  the namespace LimitRange default when mis-placed at top level).
- atomic=true + cleanup_on_fail=true on BOTH helm releases — a failed install
  auto-rolls-back instead of leaving a stuck/orphaned release (memory #6479).

Tempo (single-binary, proxmox-lvm 20Gi, 30d) + OTel Collector (contrib; otlp ->
redaction -> batch -> tempo) + Tempo datasource + additive trace_id->Tempo
derivedField on Loki + tripit LOG_FORMAT=json/OTEL_EXPORTER_OTLP_ENDPOINT.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 08:17:59 +00:00

2.4 KiB

Tracing capability: Grafana Tempo + OpenTelemetry Collector

Status: implemented (Phase 2) · 2026-06-22 · driver: TripIt observability Companion to: tripit repo docs/adr/0032-observability-otel-traces-and-content-logging.md Extends: monitoring architecture

Why

The monitoring stack has metrics (Prometheus), logs (Loki, 30d) and alerting, but had no distributed tracing. TripIt added end-to-end OpenTelemetry instrumentation to reproduce failed user flows and measure performance; its spans need a home, and logs↔traces need to correlate. This is a new shared cluster capability — TripIt is just the first consumer.

What landed (stacks/monitoring/modules/monitoring/)

  1. Grafana Tempo (tempo.tf / tempo.yaml) — single-binary, filesystem storage on a proxmox-lvm PVC (20Gi), 30-day retention, OTLP receivers. tempo.resources set explicitly (req 256Mi / limit 2Gi) — the single-binary chart ignores a top-level resources: and the pod otherwise OOMs on the namespace LimitRange default.
  2. OpenTelemetry Collector (otel-collector.tf / otel-collector.yaml) — contrib image (the redaction processor is contrib-only), one otlp -> redaction -> batch -> otlp/tempo traces pipeline. The redaction processor is the deny-list backstop (drops bearer/JWT/PEM-shaped values).
  3. Grafana correlation — a tempo datasource (tempo.tf), and a derivedFields addition on the Loki datasource (loki.tf) pulling trace_id out of tripit's JSON logs and deep-linking to Tempo. Additive (no uid change) so existing dashboards are unaffected.
  4. App flip (stacks/tripit/main.tf) — tripit gets LOG_FORMAT=json + OTEL_EXPORTER_OTLP_ENDPOINT pointed at the Collector.

Both helm releases use atomic=true + cleanup_on_fail=true: a failed install auto-rolls-back rather than leaving a stuck failed release (the first-attempt failure mode — see infra memory #6479).

Notes

  • Cardinality: trace_id / session.id are span attributes / log fields, never Prometheus or Loki labels.
  • Privacy: ADR-0032 records the accepted trade-off that TripIt logs user content to shared monitoring; the Collector redaction processor enforces the deny-list on the trace path.
  • Apply: Terraform-only, presence-claimed (stack:monitoring). Update docs/architecture/monitoring.md (components table + diagram) once stable.