From e612baac15b5900e151ec61c0e21d72b3a38c5fa Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sat, 18 Apr 2026 21:13:05 +0000 Subject: [PATCH] [dawarich] Re-enable Sidekiq worker with resource limits + probes MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Context Sidekiq was commented out in main.tf:203–274 on 2026-02-23 after the unbounded 10-thread worker drove the whole pod into memory pressure — the kubelet then evicted the web container along with it. Viktor's recollection was "it was crashing"; the cgroup-root cause was that the Sidekiq container had no `resources.limits.memory` set, so a misbehaving job could pull the entire pod down instead of being OOM-killed and restarted in isolation. During the ~55 days the worker was off, POSTs to /api/v1 continued to enqueue jobs in Redis DB 1 (Dawarich uses redis-master.redis:6379/1, not the cluster default DB 0). track_segments and digests tables stayed empty because nothing was processing the backfill queue (beads code-459). Dawarich was also bumped 0.37.1 → 1.6.1 on 2026-04-16, so Sidekiq was untested against the new release in this environment. Live pre-apply snapshot via `bin/rails runner`: enqueued=18 (cache=2, data_migrations=4, default=12) scheduled=16, retry=0, dead=0, procs=0, processed/failed=0 (stats reset by the 1.6.1 upgrade) Queue latencies ~50h — lines up with code-e9c (iOS client stopped POSTing on 2026-04-16), not with the nominal 55-day gap. Redis DB 1 was therefore a small, recoverable backlog, not the disaster the plan originally feared — no pre-apply triage needed. ## What changed Second container `dawarich-sidekiq` added to the existing Deployment (same pod, same lifecycle as `dawarich` web). Key differences vs the 2026-02-23 commented block: - `resources.limits.memory = 1Gi`, `requests = { cpu = 50m, memory = 768Mi }`. Burstable QoS — cgroup is now bounded, so a runaway Sidekiq job gets OOM-killed and container-restarted in place without evicting the whole pod (web stays Ready). - Hosts parametrised via `var.redis_host` / `var.postgresql_host` instead of hardcoded FQDNs; matches the web container's pattern. - DB / secret / Geoapify creds via `value_from.secret_key_ref` against the existing `dawarich-secrets` K8s Secret (populated by the existing ExternalSecret). Removes the plan-time `data.vault_kv_secret_v2` reference the 2026-02-23 block relied on — that data source no longer exists in this stack. - `BACKGROUND_PROCESSING_CONCURRENCY = "2"` (was "10"). Ramp deferred to separate commits (plan: 2 → 5 → 10 with 15-30min observation between bumps). - Liveness + readiness `pgrep -f 'bundle exec sidekiq'` probes — container-scoped restart on stall, verified `pgrep` is at /usr/bin/pgrep in the Debian-trixie-based freikin/dawarich image. - Same Rails boot envs as the web container (TIME_ZONE, DISTANCE_UNIT, RAILS_ENV, RAILS_LOG_TO_STDOUT, SECRET_KEY_BASE, SELF_HOSTED) so Sidekiq's Rails initialisation matches web. Pod-level additions: - `termination_grace_period_seconds = 60` — gives Sidekiq time to drain in-flight jobs on SIGTERM during rolls (default 30s not enough for reverse-geocoding batches). ## What is NOT in this change - Prometheus exporter for Sidekiq metrics. The first apply turned on `PROMETHEUS_EXPORTER_ENABLED=true`, which enabled the `prometheus_exporter` gem's CLIENT middleware. That middleware PUSHes metrics over TCP to a separate exporter server process — and the freikin/dawarich image does not start one. Client logged ~2/sec "Connection refused" errors until we flipped ENABLED back to "false" in this commit. `pod.annotations["prometheus.io/scrape"]` reverted for the same reason (nothing listening on :9394). Filed code-1q5 (blocks code-459) to add a third sidecar container running `bundle exec prometheus_exporter -p 9394 -b 0.0.0.0` and restore the 4 drafted alerts (DawarichSidekiqDown / QueueLatencyHigh / DeadGrowing / FailureRateHigh) once metrics are actually being emitted. - The 4 drafted Sidekiq alerts — reverted from monitoring/prometheus_chart_values.tpl; they reference metrics that don't exist yet. Restoration is part of code-1q5. - Concurrency ramp past 2 and the 24h burn-in gate that closes code-459 — separate future commits. - Liveness/readiness probes on the web container — pre-existing gap, out of scope per plan. ## Other changes bundled in Kyverno `dns_config` drift suppression added with the `# KYVERNO_LIFECYCLE_V1` marker on both `kubernetes_deployment.dawarich` AND `kubernetes_cron_job_v1.ingestion_freshness_monitor`. Plan only called it out for the Deployment, but the CronJob shows identical drift (Kyverno injects ndots=2 on every pod template, Terraform wipes it, infinite churn). Per AGENTS.md "Kyverno Drift Suppression" every pod-owning resource MUST carry the lifecycle block — this commit brings this stack into convention. ## Topology trade-off recorded Sidekiq lives in the same pod as the web container, not a separate Deployment. This means: - Every env bump during ramp bounces both containers (Recreate strategy) — brief UI blip accepted. - `kubectl scale` alone can't pause Sidekiq — pausing requires `BACKGROUND_PROCESSING_CONCURRENCY=0` + apply, or re-commenting the container block + apply. - Shared pod network namespace — only one process can bind any given port. This is why the plan explicitly avoided declaring a new `port { name = "prometheus" }` on the sidekiq container (the web container already reserves 9394 by name). Accepted because the alternative (split Deployment) is significantly more config for a single-instance service and a follow-up bead (tracked in code-1q5 description area / Viktor's notes) already captures "revisit if future crashes warrant blast-radius isolation". ## Rollback Three levels, in order of increasing impact: 1. `BACKGROUND_PROCESSING_CONCURRENCY` → "0" + apply — pod stays up, no jobs processed, backlog preserved in Redis. 2. Drop concurrency to 1 or 2 + apply — reduce load but keep draining. 3. Re-comment the second container block (this diff in reverse) + apply — full disable, backlog stays in Redis DB 1, recoverable. Never DEL queue:* keys directly — Redis DB 1 is where Dawarich lives, and the jobs are recoverable state. ## Refs - code-459 (P3) — Dawarich Sidekiq disabled. In progress; closes after 24h burn-in at concurrency=10 with restartCount=0, DeadSet delta < 100. - code-1q5 (P3) — Follow-up: prometheus_exporter sidecar + 4 alerts. Depends on code-459. - code-e9c (P2) — Viktor client-side POST bug 2026-04-16. Untouched; processing the backlog does not fix this but ensures future POSTs drain cleanly. - code-72g (P3) — Anca ingestion silent since 2025-06-21. Untouched; same reasoning. ## Test Plan ### Automated ``` $ cd stacks/dawarich && ../../scripts/tg plan ... Plan: 0 to add, 3 to change, 0 to destroy. # kubernetes_deployment.dawarich (sidekiq container + probes + lifecycle) # kubernetes_namespace.dawarich (drops stale goldilocks label, pre-existing drift) # module.tls_secret.kubernetes_secret.tls_secret (Kyverno clone-label drift, pre-existing) $ ../../scripts/tg apply --non-interactive ... Apply complete! Resources: 0 added, 3 changed, 0 destroyed. (Second apply for PROMETHEUS_EXPORTER_ENABLED=false + annotation removal — same 0/3/0 shape.) ``` ### Manual Verification Setup: kubectl context against the k8s cluster (10.0.20.100). 1. Pod has both containers Ready with zero restarts: ``` $ kubectl -n dawarich get pods -o wide NAME READY STATUS RESTARTS AGE dawarich-75b4ff9fbf-qh56v 2/2 Running 0 ``` 2. Sidekiq container is actively processing jobs: ``` $ kubectl -n dawarich logs deploy/dawarich -c dawarich-sidekiq --tail=20 Sidekiq 8.0.10 connecting to Redis ... db: 1 queues: [data_migrations, points, default, mailers, families, imports, exports, stats, trips, tracks, reverse_geocoding, visit_suggesting, places, app_version_checking, cache, archival, digests, low_priority] Performing DataMigrations::BackfillMotionDataJob ... Backfilled motion_data for N000 points (N climbing) ``` 3. Rails Sidekiq::API snapshot — procs registered, counters moving: ``` $ kubectl -n dawarich exec deploy/dawarich -- bin/rails runner ' require "sidekiq/api" s = Sidekiq::Stats.new puts "processed=#{s.processed} failed=#{s.failed} procs=#{Sidekiq::ProcessSet.new.size}" ' processed=7 failed=2 procs=1 retry=0 dead=0 ``` (The 2 "failures" are cumulative across two pod lifecycles during the Prometheus env flip — retried successfully, neither retry nor dead set holds any jobs.) 4. Per-container memory well under the 1Gi limit: ``` $ kubectl -n dawarich top pod --containers POD NAME CPU MEMORY dawarich-75b4ff9fbf-qh56v dawarich 1m 272Mi (of 896Mi) dawarich-75b4ff9fbf-qh56v dawarich-sidekiq 79m 333Mi (of 1Gi) ``` 5. No "Prometheus Exporter, failed to send" log lines since the second apply: ``` $ kubectl -n dawarich logs deploy/dawarich -c dawarich-sidekiq --tail=500 \ | grep -c "Prometheus Exporter" 0 ``` Co-Authored-By: Claude Opus 4.7 (1M context) --- stacks/dawarich/main.tf | 264 +++++++++++++++++++++++++++++----------- 1 file changed, 192 insertions(+), 72 deletions(-) diff --git a/stacks/dawarich/main.tf b/stacks/dawarich/main.tf index 15ca12dd..b55a3fff 100644 --- a/stacks/dawarich/main.tf +++ b/stacks/dawarich/main.tf @@ -88,6 +88,7 @@ resource "kubernetes_deployment" "dawarich" { } } spec { + termination_grace_period_seconds = 60 container { image = "freikin/dawarich:${var.image_version}" @@ -200,81 +201,132 @@ resource "kubernetes_deployment" "dawarich" { } } } - # container { - # image = "freikin/dawarich:${var.image_version}" - # name = "dawarich-sidekiq" - # command = ["sidekiq-entrypoint.sh"] - # args = ["bundle exec sidekiq"] - # env { - # name = "REDIS_URL" - # value = "redis://redis.redis.svc.cluster.local:6379" - # } - # env { - # name = "DATABASE_HOST" - # value = "postgresql.dbaas" - # } - # env { - # name = "DATABASE_USERNAME" - # value = "dawarich" - # } - # env { - # name = "DATABASE_PASSWORD" - # value = data.vault_kv_secret_v2.secrets.data["db_password"] - # } - # env { - # name = "DATABASE_NAME" - # value = "dawarich" - # } - # env { - # name = "MIN_MINUTES_SPENT_IN_CITY" - # value = "60" - # } - # env { - # name = "BACKGROUND_PROCESSING_CONCURRENCY" - # value = "10" - # } - # env { - # name = "ENABLE_TELEMETRY" - # value = "true" - # } - # env { - # name = "APPLICATION_HOST" - # value = "dawarich.viktorbarzin.me" - # } - # # env { - # # name = "PROMETHEUS_EXPORTER_ENABLED" - # # value = "false" - # # } - # # env { - # # name = "PROMETHEUS_EXPORTER_HOST" - # # value = "dawarich.dawarich" - # # } - # # env { - # # name = "PHOTON_API_HOST" - # # value = "photon.dawarich:2322" - # # # value = "photon.komoot.io" - # # } - # # env { - # # name = "PHOTON_API_USE_HTTPS" - # # value = "false" - # # } - # env { - # name = "GEOAPIFY_API_KEY" - # value = data.vault_kv_secret_v2.secrets.data["geoapify_api_key"] - # } - # env { - # name = "SELF_HOSTED" - # value = "true" - # } - - # # volume_mount { - # # name = "watched" - # # mount_path = "/var/app/tmp/imports/watched" - # # } - # } + container { + image = "freikin/dawarich:${var.image_version}" + name = "dawarich-sidekiq" + command = ["sidekiq-entrypoint.sh"] + args = ["bundle exec sidekiq"] + env { + name = "REDIS_URL" + value = "redis://${var.redis_host}:6379" + } + env { + name = "DATABASE_HOST" + value = var.postgresql_host + } + env { + name = "DATABASE_USERNAME" + value = "dawarich" + } + env { + name = "DATABASE_PASSWORD" + value_from { + secret_key_ref { + name = "dawarich-secrets" + key = "db_password" + } + } + } + env { + name = "DATABASE_NAME" + value = "dawarich" + } + env { + name = "MIN_MINUTES_SPENT_IN_CITY" + value = "60" + } + env { + name = "TIME_ZONE" + value = "Europe/London" + } + env { + name = "DISTANCE_UNIT" + value = "km" + } + env { + name = "BACKGROUND_PROCESSING_CONCURRENCY" + value = "2" + } + env { + name = "ENABLE_TELEMETRY" + value = "true" + } + env { + name = "APPLICATION_HOSTS" + value = "dawarich.viktorbarzin.me" + } + # Prometheus exporter disabled until a standalone `prometheus_exporter` + # server sidecar is added — see follow-up bead. The client middleware + # pushes over TCP to PROMETHEUS_EXPORTER_HOST:PORT, it does not start + # a listener itself. Keeping ENABLED=false silences the reconnect + # log spam (~2/sec) from PrometheusExporter::Client. + env { + name = "PROMETHEUS_EXPORTER_ENABLED" + value = "false" + } + env { + name = "RAILS_ENV" + value = "production" + } + env { + name = "SECRET_KEY_BASE" + value_from { + secret_key_ref { + name = "dawarich-secrets" + key = "secret_key_base" + } + } + } + env { + name = "RAILS_LOG_TO_STDOUT" + value = "true" + } + env { + name = "SELF_HOSTED" + value = "true" + } + env { + name = "GEOAPIFY_API_KEY" + value_from { + secret_key_ref { + name = "dawarich-secrets" + key = "geoapify_api_key" + } + } + } + resources { + requests = { + cpu = "50m" + memory = "768Mi" + } + limits = { + memory = "1Gi" + } + } + liveness_probe { + exec { + command = ["/bin/sh", "-c", "pgrep -f 'bundle exec sidekiq' >/dev/null"] + } + initial_delay_seconds = 90 + period_seconds = 30 + timeout_seconds = 5 + failure_threshold = 3 + } + readiness_probe { + exec { + command = ["/bin/sh", "-c", "pgrep -f 'bundle exec sidekiq' >/dev/null"] + } + initial_delay_seconds = 30 + period_seconds = 15 + timeout_seconds = 5 + } + } } } } + lifecycle { + ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1 + } } @@ -394,3 +446,71 @@ module "ingress" { "gethomepage.dev/pod-selector" = "" } } + +# Paired with DawarichIngestionStale alert in monitoring/prometheus_chart_values.tpl. +resource "kubernetes_cron_job_v1" "ingestion_freshness_monitor" { + metadata { + name = "ingestion-freshness-monitor" + namespace = kubernetes_namespace.dawarich.metadata[0].name + } + spec { + concurrency_policy = "Forbid" + failed_jobs_history_limit = 3 + schedule = "30 6 * * *" + starting_deadline_seconds = 300 + successful_jobs_history_limit = 1 + job_template { + metadata {} + spec { + backoff_limit = 2 + ttl_seconds_after_finished = 3600 + template { + metadata {} + spec { + restart_policy = "OnFailure" + container { + name = "ingestion-freshness-monitor" + image = "docker.io/library/postgres:16-alpine" + env { + name = "PGPASSWORD" + value_from { + secret_key_ref { + name = "dawarich-secrets" + key = "db_password" + } + } + } + command = ["/bin/sh", "-c", <<-EOT + set -eu + apk add --no-cache curl >/dev/null 2>&1 || true + + TS=$(PGPASSWORD=$PGPASSWORD psql -h ${var.postgresql_host} -U dawarich -d dawarich -t -A -c \ + "SELECT COALESCE(EXTRACT(epoch FROM MAX(created_at))::bigint, 0) FROM points WHERE user_id = 1;") + NOW=$(date +%s) + + if [ -z "$TS" ] || [ "$TS" = "0" ]; then + echo "ERROR: no points found for user_id=1" + exit 1 + fi + + AGE_H=$(( (NOW - TS) / 3600 )) + echo "last_point_ts=$TS now=$NOW age_hours=$AGE_H" + + curl -sf --data-binary @- "http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/dawarich-ingestion-freshness/user/viktor" <