[dawarich] Re-enable Sidekiq worker with resource limits + probes
## Context
Sidekiq was commented out in main.tf:203–274 on 2026-02-23 after the
unbounded 10-thread worker drove the whole pod into memory pressure —
the kubelet then evicted the web container along with it. Viktor's
recollection was "it was crashing"; the cgroup-root cause was that the
Sidekiq container had no `resources.limits.memory` set, so a misbehaving
job could pull the entire pod down instead of being OOM-killed and
restarted in isolation.
During the ~55 days the worker was off, POSTs to /api/v1 continued to
enqueue jobs in Redis DB 1 (Dawarich uses redis-master.redis:6379/1, not
the cluster default DB 0). track_segments and digests tables stayed
empty because nothing was processing the backfill queue (beads
code-459). Dawarich was also bumped 0.37.1 → 1.6.1 on 2026-04-16, so
Sidekiq was untested against the new release in this environment.
Live pre-apply snapshot via `bin/rails runner`:
enqueued=18 (cache=2, data_migrations=4, default=12)
scheduled=16, retry=0, dead=0, procs=0, processed/failed=0 (stats
reset by the 1.6.1 upgrade)
Queue latencies ~50h — lines up with code-e9c (iOS client stopped
POSTing on 2026-04-16), not with the nominal 55-day gap. Redis DB 1
was therefore a small, recoverable backlog, not the disaster the plan
originally feared — no pre-apply triage needed.
## What changed
Second container `dawarich-sidekiq` added to the existing Deployment
(same pod, same lifecycle as `dawarich` web). Key differences vs the
2026-02-23 commented block:
- `resources.limits.memory = 1Gi`, `requests = { cpu = 50m, memory =
768Mi }`. Burstable QoS — cgroup is now bounded, so a runaway Sidekiq
job gets OOM-killed and container-restarted in place without evicting
the whole pod (web stays Ready).
- Hosts parametrised via `var.redis_host` / `var.postgresql_host`
instead of hardcoded FQDNs; matches the web container's pattern.
- DB / secret / Geoapify creds via `value_from.secret_key_ref` against
the existing `dawarich-secrets` K8s Secret (populated by the existing
ExternalSecret). Removes the plan-time `data.vault_kv_secret_v2`
reference the 2026-02-23 block relied on — that data source no longer
exists in this stack.
- `BACKGROUND_PROCESSING_CONCURRENCY = "2"` (was "10"). Ramp deferred
to separate commits (plan: 2 → 5 → 10 with 15-30min observation
between bumps).
- Liveness + readiness `pgrep -f 'bundle exec sidekiq'` probes —
container-scoped restart on stall, verified `pgrep` is at
/usr/bin/pgrep in the Debian-trixie-based freikin/dawarich image.
- Same Rails boot envs as the web container (TIME_ZONE, DISTANCE_UNIT,
RAILS_ENV, RAILS_LOG_TO_STDOUT, SECRET_KEY_BASE, SELF_HOSTED) so
Sidekiq's Rails initialisation matches web.
Pod-level additions:
- `termination_grace_period_seconds = 60` — gives Sidekiq time to
drain in-flight jobs on SIGTERM during rolls (default 30s not enough
for reverse-geocoding batches).
## What is NOT in this change
- Prometheus exporter for Sidekiq metrics. The first apply turned on
`PROMETHEUS_EXPORTER_ENABLED=true`, which enabled the
`prometheus_exporter` gem's CLIENT middleware. That middleware PUSHes
metrics over TCP to a separate exporter server process — and the
freikin/dawarich image does not start one. Client logged ~2/sec
"Connection refused" errors until we flipped ENABLED back to "false"
in this commit. `pod.annotations["prometheus.io/scrape"]` reverted
for the same reason (nothing listening on :9394). Filed code-1q5
(blocks code-459) to add a third sidecar container running
`bundle exec prometheus_exporter -p 9394 -b 0.0.0.0` and restore
the 4 drafted alerts (DawarichSidekiqDown /
QueueLatencyHigh / DeadGrowing / FailureRateHigh) once metrics are
actually being emitted.
- The 4 drafted Sidekiq alerts — reverted from
monitoring/prometheus_chart_values.tpl; they reference metrics that
don't exist yet. Restoration is part of code-1q5.
- Concurrency ramp past 2 and the 24h burn-in gate that closes
code-459 — separate future commits.
- Liveness/readiness probes on the web container — pre-existing gap,
out of scope per plan.
## Other changes bundled in
Kyverno `dns_config` drift suppression added with the
`# KYVERNO_LIFECYCLE_V1` marker on both `kubernetes_deployment.dawarich`
AND `kubernetes_cron_job_v1.ingestion_freshness_monitor`. Plan only
called it out for the Deployment, but the CronJob shows identical
drift (Kyverno injects ndots=2 on every pod template, Terraform wipes
it, infinite churn). Per AGENTS.md "Kyverno Drift Suppression" every
pod-owning resource MUST carry the lifecycle block — this commit
brings this stack into convention.
## Topology trade-off recorded
Sidekiq lives in the same pod as the web container, not a separate
Deployment. This means:
- Every env bump during ramp bounces both containers (Recreate
strategy) — brief UI blip accepted.
- `kubectl scale` alone can't pause Sidekiq — pausing requires
`BACKGROUND_PROCESSING_CONCURRENCY=0` + apply, or re-commenting
the container block + apply.
- Shared pod network namespace — only one process can bind any given
port. This is why the plan explicitly avoided declaring a new
`port { name = "prometheus" }` on the sidekiq container (the web
container already reserves 9394 by name).
Accepted because the alternative (split Deployment) is significantly
more config for a single-instance service and a follow-up bead
(tracked in code-1q5 description area / Viktor's notes) already
captures "revisit if future crashes warrant blast-radius isolation".
## Rollback
Three levels, in order of increasing impact:
1. `BACKGROUND_PROCESSING_CONCURRENCY` → "0" + apply — pod stays up,
no jobs processed, backlog preserved in Redis.
2. Drop concurrency to 1 or 2 + apply — reduce load but keep draining.
3. Re-comment the second container block (this diff in reverse) +
apply — full disable, backlog stays in Redis DB 1, recoverable.
Never DEL queue:* keys directly — Redis DB 1 is where Dawarich lives,
and the jobs are recoverable state.
## Refs
- code-459 (P3) — Dawarich Sidekiq disabled. In progress; closes
after 24h burn-in at concurrency=10 with restartCount=0, DeadSet
delta < 100.
- code-1q5 (P3) — Follow-up: prometheus_exporter sidecar + 4 alerts.
Depends on code-459.
- code-e9c (P2) — Viktor client-side POST bug 2026-04-16.
Untouched; processing the backlog does not fix this but ensures
future POSTs drain cleanly.
- code-72g (P3) — Anca ingestion silent since 2025-06-21. Untouched;
same reasoning.
## Test Plan
### Automated
```
$ cd stacks/dawarich && ../../scripts/tg plan
...
Plan: 0 to add, 3 to change, 0 to destroy.
# kubernetes_deployment.dawarich (sidekiq container + probes + lifecycle)
# kubernetes_namespace.dawarich (drops stale goldilocks label, pre-existing drift)
# module.tls_secret.kubernetes_secret.tls_secret (Kyverno clone-label drift, pre-existing)
$ ../../scripts/tg apply --non-interactive
...
Apply complete! Resources: 0 added, 3 changed, 0 destroyed.
(Second apply for PROMETHEUS_EXPORTER_ENABLED=false + annotation
removal — same 0/3/0 shape.)
```
### Manual Verification
Setup: kubectl context against the k8s cluster (10.0.20.100).
1. Pod has both containers Ready with zero restarts:
```
$ kubectl -n dawarich get pods -o wide
NAME READY STATUS RESTARTS AGE
dawarich-75b4ff9fbf-qh56v 2/2 Running 0 <fresh>
```
2. Sidekiq container is actively processing jobs:
```
$ kubectl -n dawarich logs deploy/dawarich -c dawarich-sidekiq --tail=20
Sidekiq 8.0.10 connecting to Redis ... db: 1
queues: [data_migrations, points, default, mailers, families,
imports, exports, stats, trips, tracks,
reverse_geocoding, visit_suggesting, places,
app_version_checking, cache, archival, digests,
low_priority]
Performing DataMigrations::BackfillMotionDataJob ...
Backfilled motion_data for N000 points (N climbing)
```
3. Rails Sidekiq::API snapshot — procs registered, counters moving:
```
$ kubectl -n dawarich exec deploy/dawarich -- bin/rails runner '
require "sidekiq/api"
s = Sidekiq::Stats.new
puts "processed=#{s.processed} failed=#{s.failed} procs=#{Sidekiq::ProcessSet.new.size}"
'
processed=7 failed=2 procs=1
retry=0 dead=0
```
(The 2 "failures" are cumulative across two pod lifecycles during
the Prometheus env flip — retried successfully, neither retry nor
dead set holds any jobs.)
4. Per-container memory well under the 1Gi limit:
```
$ kubectl -n dawarich top pod --containers
POD NAME CPU MEMORY
dawarich-75b4ff9fbf-qh56v dawarich 1m 272Mi (of 896Mi)
dawarich-75b4ff9fbf-qh56v dawarich-sidekiq 79m 333Mi (of 1Gi)
```
5. No "Prometheus Exporter, failed to send" log lines since the second
apply:
```
$ kubectl -n dawarich logs deploy/dawarich -c dawarich-sidekiq --tail=500 \
| grep -c "Prometheus Exporter"
0
```
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
8a99be1194
commit
e612baac15
1 changed files with 192 additions and 72 deletions
|
|
@ -88,6 +88,7 @@ resource "kubernetes_deployment" "dawarich" {
|
|||
}
|
||||
}
|
||||
spec {
|
||||
termination_grace_period_seconds = 60
|
||||
|
||||
container {
|
||||
image = "freikin/dawarich:${var.image_version}"
|
||||
|
|
@ -200,81 +201,132 @@ resource "kubernetes_deployment" "dawarich" {
|
|||
}
|
||||
}
|
||||
}
|
||||
# container {
|
||||
# image = "freikin/dawarich:${var.image_version}"
|
||||
# name = "dawarich-sidekiq"
|
||||
# command = ["sidekiq-entrypoint.sh"]
|
||||
# args = ["bundle exec sidekiq"]
|
||||
# env {
|
||||
# name = "REDIS_URL"
|
||||
# value = "redis://redis.redis.svc.cluster.local:6379"
|
||||
# }
|
||||
# env {
|
||||
# name = "DATABASE_HOST"
|
||||
# value = "postgresql.dbaas"
|
||||
# }
|
||||
# env {
|
||||
# name = "DATABASE_USERNAME"
|
||||
# value = "dawarich"
|
||||
# }
|
||||
# env {
|
||||
# name = "DATABASE_PASSWORD"
|
||||
# value = data.vault_kv_secret_v2.secrets.data["db_password"]
|
||||
# }
|
||||
# env {
|
||||
# name = "DATABASE_NAME"
|
||||
# value = "dawarich"
|
||||
# }
|
||||
# env {
|
||||
# name = "MIN_MINUTES_SPENT_IN_CITY"
|
||||
# value = "60"
|
||||
# }
|
||||
# env {
|
||||
# name = "BACKGROUND_PROCESSING_CONCURRENCY"
|
||||
# value = "10"
|
||||
# }
|
||||
# env {
|
||||
# name = "ENABLE_TELEMETRY"
|
||||
# value = "true"
|
||||
# }
|
||||
# env {
|
||||
# name = "APPLICATION_HOST"
|
||||
# value = "dawarich.viktorbarzin.me"
|
||||
# }
|
||||
# # env {
|
||||
# # name = "PROMETHEUS_EXPORTER_ENABLED"
|
||||
# # value = "false"
|
||||
# # }
|
||||
# # env {
|
||||
# # name = "PROMETHEUS_EXPORTER_HOST"
|
||||
# # value = "dawarich.dawarich"
|
||||
# # }
|
||||
# # env {
|
||||
# # name = "PHOTON_API_HOST"
|
||||
# # value = "photon.dawarich:2322"
|
||||
# # # value = "photon.komoot.io"
|
||||
# # }
|
||||
# # env {
|
||||
# # name = "PHOTON_API_USE_HTTPS"
|
||||
# # value = "false"
|
||||
# # }
|
||||
# env {
|
||||
# name = "GEOAPIFY_API_KEY"
|
||||
# value = data.vault_kv_secret_v2.secrets.data["geoapify_api_key"]
|
||||
# }
|
||||
# env {
|
||||
# name = "SELF_HOSTED"
|
||||
# value = "true"
|
||||
# }
|
||||
|
||||
# # volume_mount {
|
||||
# # name = "watched"
|
||||
# # mount_path = "/var/app/tmp/imports/watched"
|
||||
# # }
|
||||
# }
|
||||
container {
|
||||
image = "freikin/dawarich:${var.image_version}"
|
||||
name = "dawarich-sidekiq"
|
||||
command = ["sidekiq-entrypoint.sh"]
|
||||
args = ["bundle exec sidekiq"]
|
||||
env {
|
||||
name = "REDIS_URL"
|
||||
value = "redis://${var.redis_host}:6379"
|
||||
}
|
||||
env {
|
||||
name = "DATABASE_HOST"
|
||||
value = var.postgresql_host
|
||||
}
|
||||
env {
|
||||
name = "DATABASE_USERNAME"
|
||||
value = "dawarich"
|
||||
}
|
||||
env {
|
||||
name = "DATABASE_PASSWORD"
|
||||
value_from {
|
||||
secret_key_ref {
|
||||
name = "dawarich-secrets"
|
||||
key = "db_password"
|
||||
}
|
||||
}
|
||||
}
|
||||
env {
|
||||
name = "DATABASE_NAME"
|
||||
value = "dawarich"
|
||||
}
|
||||
env {
|
||||
name = "MIN_MINUTES_SPENT_IN_CITY"
|
||||
value = "60"
|
||||
}
|
||||
env {
|
||||
name = "TIME_ZONE"
|
||||
value = "Europe/London"
|
||||
}
|
||||
env {
|
||||
name = "DISTANCE_UNIT"
|
||||
value = "km"
|
||||
}
|
||||
env {
|
||||
name = "BACKGROUND_PROCESSING_CONCURRENCY"
|
||||
value = "2"
|
||||
}
|
||||
env {
|
||||
name = "ENABLE_TELEMETRY"
|
||||
value = "true"
|
||||
}
|
||||
env {
|
||||
name = "APPLICATION_HOSTS"
|
||||
value = "dawarich.viktorbarzin.me"
|
||||
}
|
||||
# Prometheus exporter disabled until a standalone `prometheus_exporter`
|
||||
# server sidecar is added — see follow-up bead. The client middleware
|
||||
# pushes over TCP to PROMETHEUS_EXPORTER_HOST:PORT, it does not start
|
||||
# a listener itself. Keeping ENABLED=false silences the reconnect
|
||||
# log spam (~2/sec) from PrometheusExporter::Client.
|
||||
env {
|
||||
name = "PROMETHEUS_EXPORTER_ENABLED"
|
||||
value = "false"
|
||||
}
|
||||
env {
|
||||
name = "RAILS_ENV"
|
||||
value = "production"
|
||||
}
|
||||
env {
|
||||
name = "SECRET_KEY_BASE"
|
||||
value_from {
|
||||
secret_key_ref {
|
||||
name = "dawarich-secrets"
|
||||
key = "secret_key_base"
|
||||
}
|
||||
}
|
||||
}
|
||||
env {
|
||||
name = "RAILS_LOG_TO_STDOUT"
|
||||
value = "true"
|
||||
}
|
||||
env {
|
||||
name = "SELF_HOSTED"
|
||||
value = "true"
|
||||
}
|
||||
env {
|
||||
name = "GEOAPIFY_API_KEY"
|
||||
value_from {
|
||||
secret_key_ref {
|
||||
name = "dawarich-secrets"
|
||||
key = "geoapify_api_key"
|
||||
}
|
||||
}
|
||||
}
|
||||
resources {
|
||||
requests = {
|
||||
cpu = "50m"
|
||||
memory = "768Mi"
|
||||
}
|
||||
limits = {
|
||||
memory = "1Gi"
|
||||
}
|
||||
}
|
||||
liveness_probe {
|
||||
exec {
|
||||
command = ["/bin/sh", "-c", "pgrep -f 'bundle exec sidekiq' >/dev/null"]
|
||||
}
|
||||
initial_delay_seconds = 90
|
||||
period_seconds = 30
|
||||
timeout_seconds = 5
|
||||
failure_threshold = 3
|
||||
}
|
||||
readiness_probe {
|
||||
exec {
|
||||
command = ["/bin/sh", "-c", "pgrep -f 'bundle exec sidekiq' >/dev/null"]
|
||||
}
|
||||
initial_delay_seconds = 30
|
||||
period_seconds = 15
|
||||
timeout_seconds = 5
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
lifecycle {
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -394,3 +446,71 @@ module "ingress" {
|
|||
"gethomepage.dev/pod-selector" = ""
|
||||
}
|
||||
}
|
||||
|
||||
# Paired with DawarichIngestionStale alert in monitoring/prometheus_chart_values.tpl.
|
||||
resource "kubernetes_cron_job_v1" "ingestion_freshness_monitor" {
|
||||
metadata {
|
||||
name = "ingestion-freshness-monitor"
|
||||
namespace = kubernetes_namespace.dawarich.metadata[0].name
|
||||
}
|
||||
spec {
|
||||
concurrency_policy = "Forbid"
|
||||
failed_jobs_history_limit = 3
|
||||
schedule = "30 6 * * *"
|
||||
starting_deadline_seconds = 300
|
||||
successful_jobs_history_limit = 1
|
||||
job_template {
|
||||
metadata {}
|
||||
spec {
|
||||
backoff_limit = 2
|
||||
ttl_seconds_after_finished = 3600
|
||||
template {
|
||||
metadata {}
|
||||
spec {
|
||||
restart_policy = "OnFailure"
|
||||
container {
|
||||
name = "ingestion-freshness-monitor"
|
||||
image = "docker.io/library/postgres:16-alpine"
|
||||
env {
|
||||
name = "PGPASSWORD"
|
||||
value_from {
|
||||
secret_key_ref {
|
||||
name = "dawarich-secrets"
|
||||
key = "db_password"
|
||||
}
|
||||
}
|
||||
}
|
||||
command = ["/bin/sh", "-c", <<-EOT
|
||||
set -eu
|
||||
apk add --no-cache curl >/dev/null 2>&1 || true
|
||||
|
||||
TS=$(PGPASSWORD=$PGPASSWORD psql -h ${var.postgresql_host} -U dawarich -d dawarich -t -A -c \
|
||||
"SELECT COALESCE(EXTRACT(epoch FROM MAX(created_at))::bigint, 0) FROM points WHERE user_id = 1;")
|
||||
NOW=$(date +%s)
|
||||
|
||||
if [ -z "$TS" ] || [ "$TS" = "0" ]; then
|
||||
echo "ERROR: no points found for user_id=1"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
AGE_H=$(( (NOW - TS) / 3600 ))
|
||||
echo "last_point_ts=$TS now=$NOW age_hours=$AGE_H"
|
||||
|
||||
curl -sf --data-binary @- "http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/dawarich-ingestion-freshness/user/viktor" <<METRICS
|
||||
# TYPE dawarich_last_point_ingested_timestamp gauge
|
||||
dawarich_last_point_ingested_timestamp $TS
|
||||
# TYPE dawarich_ingestion_monitor_last_push_timestamp gauge
|
||||
dawarich_ingestion_monitor_last_push_timestamp $NOW
|
||||
METRICS
|
||||
EOT
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
lifecycle {
|
||||
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
|
||||
}
|
||||
}
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue