infra/stacks
Viktor Barzin e612baac15 [dawarich] Re-enable Sidekiq worker with resource limits + probes
## Context

Sidekiq was commented out in main.tf:203–274 on 2026-02-23 after the
unbounded 10-thread worker drove the whole pod into memory pressure —
the kubelet then evicted the web container along with it. Viktor's
recollection was "it was crashing"; the cgroup-root cause was that the
Sidekiq container had no `resources.limits.memory` set, so a misbehaving
job could pull the entire pod down instead of being OOM-killed and
restarted in isolation.

During the ~55 days the worker was off, POSTs to /api/v1 continued to
enqueue jobs in Redis DB 1 (Dawarich uses redis-master.redis:6379/1, not
the cluster default DB 0). track_segments and digests tables stayed
empty because nothing was processing the backfill queue (beads
code-459). Dawarich was also bumped 0.37.1 → 1.6.1 on 2026-04-16, so
Sidekiq was untested against the new release in this environment.

Live pre-apply snapshot via `bin/rails runner`:
  enqueued=18  (cache=2, data_migrations=4, default=12)
  scheduled=16, retry=0, dead=0, procs=0, processed/failed=0 (stats
  reset by the 1.6.1 upgrade)
Queue latencies ~50h — lines up with code-e9c (iOS client stopped
POSTing on 2026-04-16), not with the nominal 55-day gap. Redis DB 1
was therefore a small, recoverable backlog, not the disaster the plan
originally feared — no pre-apply triage needed.

## What changed

Second container `dawarich-sidekiq` added to the existing Deployment
(same pod, same lifecycle as `dawarich` web). Key differences vs the
2026-02-23 commented block:

- `resources.limits.memory = 1Gi`, `requests = { cpu = 50m, memory =
  768Mi }`. Burstable QoS — cgroup is now bounded, so a runaway Sidekiq
  job gets OOM-killed and container-restarted in place without evicting
  the whole pod (web stays Ready).
- Hosts parametrised via `var.redis_host` / `var.postgresql_host`
  instead of hardcoded FQDNs; matches the web container's pattern.
- DB / secret / Geoapify creds via `value_from.secret_key_ref` against
  the existing `dawarich-secrets` K8s Secret (populated by the existing
  ExternalSecret). Removes the plan-time `data.vault_kv_secret_v2`
  reference the 2026-02-23 block relied on — that data source no longer
  exists in this stack.
- `BACKGROUND_PROCESSING_CONCURRENCY = "2"` (was "10"). Ramp deferred
  to separate commits (plan: 2 → 5 → 10 with 15-30min observation
  between bumps).
- Liveness + readiness `pgrep -f 'bundle exec sidekiq'` probes —
  container-scoped restart on stall, verified `pgrep` is at
  /usr/bin/pgrep in the Debian-trixie-based freikin/dawarich image.
- Same Rails boot envs as the web container (TIME_ZONE, DISTANCE_UNIT,
  RAILS_ENV, RAILS_LOG_TO_STDOUT, SECRET_KEY_BASE, SELF_HOSTED) so
  Sidekiq's Rails initialisation matches web.

Pod-level additions:
- `termination_grace_period_seconds = 60` — gives Sidekiq time to
  drain in-flight jobs on SIGTERM during rolls (default 30s not enough
  for reverse-geocoding batches).

## What is NOT in this change

- Prometheus exporter for Sidekiq metrics. The first apply turned on
  `PROMETHEUS_EXPORTER_ENABLED=true`, which enabled the
  `prometheus_exporter` gem's CLIENT middleware. That middleware PUSHes
  metrics over TCP to a separate exporter server process — and the
  freikin/dawarich image does not start one. Client logged ~2/sec
  "Connection refused" errors until we flipped ENABLED back to "false"
  in this commit. `pod.annotations["prometheus.io/scrape"]` reverted
  for the same reason (nothing listening on :9394). Filed code-1q5
  (blocks code-459) to add a third sidecar container running
  `bundle exec prometheus_exporter -p 9394 -b 0.0.0.0` and restore
  the 4 drafted alerts (DawarichSidekiqDown /
  QueueLatencyHigh / DeadGrowing / FailureRateHigh) once metrics are
  actually being emitted.
- The 4 drafted Sidekiq alerts — reverted from
  monitoring/prometheus_chart_values.tpl; they reference metrics that
  don't exist yet. Restoration is part of code-1q5.
- Concurrency ramp past 2 and the 24h burn-in gate that closes
  code-459 — separate future commits.
- Liveness/readiness probes on the web container — pre-existing gap,
  out of scope per plan.

## Other changes bundled in

Kyverno `dns_config` drift suppression added with the
`# KYVERNO_LIFECYCLE_V1` marker on both `kubernetes_deployment.dawarich`
AND `kubernetes_cron_job_v1.ingestion_freshness_monitor`. Plan only
called it out for the Deployment, but the CronJob shows identical
drift (Kyverno injects ndots=2 on every pod template, Terraform wipes
it, infinite churn). Per AGENTS.md "Kyverno Drift Suppression" every
pod-owning resource MUST carry the lifecycle block — this commit
brings this stack into convention.

## Topology trade-off recorded

Sidekiq lives in the same pod as the web container, not a separate
Deployment. This means:
- Every env bump during ramp bounces both containers (Recreate
  strategy) — brief UI blip accepted.
- `kubectl scale` alone can't pause Sidekiq — pausing requires
  `BACKGROUND_PROCESSING_CONCURRENCY=0` + apply, or re-commenting
  the container block + apply.
- Shared pod network namespace — only one process can bind any given
  port. This is why the plan explicitly avoided declaring a new
  `port { name = "prometheus" }` on the sidekiq container (the web
  container already reserves 9394 by name).

Accepted because the alternative (split Deployment) is significantly
more config for a single-instance service and a follow-up bead
(tracked in code-1q5 description area / Viktor's notes) already
captures "revisit if future crashes warrant blast-radius isolation".

## Rollback

Three levels, in order of increasing impact:
1. `BACKGROUND_PROCESSING_CONCURRENCY` → "0" + apply — pod stays up,
   no jobs processed, backlog preserved in Redis.
2. Drop concurrency to 1 or 2 + apply — reduce load but keep draining.
3. Re-comment the second container block (this diff in reverse) +
   apply — full disable, backlog stays in Redis DB 1, recoverable.

Never DEL queue:* keys directly — Redis DB 1 is where Dawarich lives,
and the jobs are recoverable state.

## Refs

- code-459 (P3) — Dawarich Sidekiq disabled. In progress; closes
  after 24h burn-in at concurrency=10 with restartCount=0, DeadSet
  delta < 100.
- code-1q5 (P3) — Follow-up: prometheus_exporter sidecar + 4 alerts.
  Depends on code-459.
- code-e9c (P2) — Viktor client-side POST bug 2026-04-16.
  Untouched; processing the backlog does not fix this but ensures
  future POSTs drain cleanly.
- code-72g (P3) — Anca ingestion silent since 2025-06-21. Untouched;
  same reasoning.

## Test Plan

### Automated

```
$ cd stacks/dawarich && ../../scripts/tg plan
...
Plan: 0 to add, 3 to change, 0 to destroy.
#   kubernetes_deployment.dawarich         (sidekiq container + probes + lifecycle)
#   kubernetes_namespace.dawarich          (drops stale goldilocks label, pre-existing drift)
#   module.tls_secret.kubernetes_secret.tls_secret  (Kyverno clone-label drift, pre-existing)

$ ../../scripts/tg apply --non-interactive
...
Apply complete! Resources: 0 added, 3 changed, 0 destroyed.

(Second apply for PROMETHEUS_EXPORTER_ENABLED=false + annotation
removal — same 0/3/0 shape.)
```

### Manual Verification

Setup: kubectl context against the k8s cluster (10.0.20.100).

1. Pod has both containers Ready with zero restarts:
   ```
   $ kubectl -n dawarich get pods -o wide
   NAME                        READY  STATUS   RESTARTS  AGE
   dawarich-75b4ff9fbf-qh56v   2/2    Running  0         <fresh>
   ```

2. Sidekiq container is actively processing jobs:
   ```
   $ kubectl -n dawarich logs deploy/dawarich -c dawarich-sidekiq --tail=20
   Sidekiq 8.0.10 connecting to Redis ... db: 1
   queues: [data_migrations, points, default, mailers, families,
            imports, exports, stats, trips, tracks,
            reverse_geocoding, visit_suggesting, places,
            app_version_checking, cache, archival, digests,
            low_priority]
   Performing DataMigrations::BackfillMotionDataJob ...
   Backfilled motion_data for N000 points (N climbing)
   ```

3. Rails Sidekiq::API snapshot — procs registered, counters moving:
   ```
   $ kubectl -n dawarich exec deploy/dawarich -- bin/rails runner '
       require "sidekiq/api"
       s = Sidekiq::Stats.new
       puts "processed=#{s.processed} failed=#{s.failed} procs=#{Sidekiq::ProcessSet.new.size}"
     '
   processed=7 failed=2 procs=1
   retry=0 dead=0
   ```
   (The 2 "failures" are cumulative across two pod lifecycles during
   the Prometheus env flip — retried successfully, neither retry nor
   dead set holds any jobs.)

4. Per-container memory well under the 1Gi limit:
   ```
   $ kubectl -n dawarich top pod --containers
   POD                         NAME              CPU    MEMORY
   dawarich-75b4ff9fbf-qh56v   dawarich          1m     272Mi  (of 896Mi)
   dawarich-75b4ff9fbf-qh56v   dawarich-sidekiq  79m    333Mi  (of 1Gi)
   ```

5. No "Prometheus Exporter, failed to send" log lines since the second
   apply:
   ```
   $ kubectl -n dawarich logs deploy/dawarich -c dawarich-sidekiq --tail=500 \
       | grep -c "Prometheus Exporter"
   0
   ```

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:13:05 +00:00
..
_template [infra] Establish KYVERNO_LIFECYCLE_V1 drift-suppression convention [ci skip] 2026-04-18 14:15:51 +00:00
actualbudget [actualbudget] Upgrade 26.3.0 → 26.4.0 for native Sankey report 2026-04-18 13:19:27 +00:00
affine [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
authentik [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
beads-server [infra] Bump claude-agent-service + beadboard image tags 2026-04-18 19:24:37 +00:00
blog [traefik] Remove broken rewrite-body plugin and all rybbit/anti-AI injection 2026-04-17 12:41:17 +00:00
broker-sync broker-sync: add Fidelity PlanViewer CronJob (suspended) 2026-04-18 18:51:20 +00:00
changedetection [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
city-guesser [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
claude-agent-service [infra] Bump claude-agent-service + beadboard image tags 2026-04-18 19:24:37 +00:00
claude-memory [infra] Document intended ignore_changes drift-workarounds [ci skip] 2026-04-18 14:08:10 +00:00
cloudflared [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
cnpg [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
coturn [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
crowdsec [traefik] Remove broken rewrite-body plugin and all rybbit/anti-AI injection 2026-04-17 12:41:17 +00:00
cyberchef [traefik] Remove broken rewrite-body plugin and all rybbit/anti-AI injection 2026-04-17 12:41:17 +00:00
dashy [cleanup] Remove ollama from dashy + docs + nfs_directories 2026-04-18 11:17:59 +00:00
dawarich [dawarich] Re-enable Sidekiq worker with resource limits + probes 2026-04-18 21:13:05 +00:00
dbaas [infra] Remove mysql InnoDB Cluster + Operator HCL (Phase 4 cleanup) [ci skip] 2026-04-18 19:19:48 +00:00
descheduler [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
diun [infra] Establish KYVERNO_LIFECYCLE_V1 drift-suppression convention [ci skip] 2026-04-18 14:15:51 +00:00
ebook2audiobook [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
ebooks [infra] Establish KYVERNO_LIFECYCLE_V1 drift-suppression convention [ci skip] 2026-04-18 14:15:51 +00:00
echo [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
excalidraw [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
external-secrets [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
f1-stream [traefik] Remove broken rewrite-body plugin and all rybbit/anti-AI injection 2026-04-17 12:41:17 +00:00
foolery Add broker-sync Terraform stack (#7) 2026-04-17 21:17:45 +01:00
forgejo [forgejo] Probe /api/healthz for external monitor 2026-04-17 22:06:23 +00:00
freedify [infra] Establish KYVERNO_LIFECYCLE_V1 drift-suppression convention [ci skip] 2026-04-18 14:15:51 +00:00
freshrss [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
frigate [traefik] Remove broken rewrite-body plugin and all rybbit/anti-AI injection 2026-04-17 12:41:17 +00:00
grampsweb [grampsweb] Align PVC resource to encrypted storage; imported state 2026-04-18 11:37:45 +00:00
hackmd [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
headscale [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
health [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
hermes-agent [infra] Establish KYVERNO_LIFECYCLE_V1 drift-suppression convention [ci skip] 2026-04-18 14:15:51 +00:00
homepage [cleanup] Remove ollama from dashy + docs + nfs_directories 2026-04-18 11:17:59 +00:00
immich [infra] Establish KYVERNO_LIFECYCLE_V1 drift-suppression convention [ci skip] 2026-04-18 14:15:51 +00:00
infra [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
infra-maintenance [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
insta2spotify [infra] Establish KYVERNO_LIFECYCLE_V1 drift-suppression convention [ci skip] 2026-04-18 14:15:51 +00:00
isponsorblocktv [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
jsoncrack [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
k8s-dashboard [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
k8s-portal [infra] Establish KYVERNO_LIFECYCLE_V1 drift-suppression convention [ci skip] 2026-04-18 14:15:51 +00:00
kms [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
kyverno kyverno: strip resources.limits.cpu cluster-wide via ClusterPolicy 2026-04-18 11:34:39 +00:00
linkwarden [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
mailserver [infra] Document intended ignore_changes drift-workarounds [ci skip] 2026-04-18 14:08:10 +00:00
matrix [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
meshcentral [meshcentral] Remove accidentally-committed Terragrunt-generated files 2026-04-18 12:35:44 +00:00
metallb [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
metrics-server [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
monitoring [payslip-ingest] Deploy stack + Grafana dashboard + Vault DB role 2026-04-18 19:07:05 +00:00
n8n [n8n] Fix broken DIUN auto-upgrade pipeline — missing auth token to claude-agent-service 2026-04-18 10:41:09 +00:00
navidrome [traefik] Remove broken rewrite-body plugin and all rybbit/anti-AI injection 2026-04-17 12:41:17 +00:00
netbox [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
networking-toolbox [traefik] Remove broken rewrite-body plugin and all rybbit/anti-AI injection 2026-04-17 12:41:17 +00:00
nextcloud [traefik] Remove broken rewrite-body plugin and all rybbit/anti-AI injection 2026-04-17 12:41:17 +00:00
nfs-csi [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
novelapp [infra] Document intended ignore_changes drift-workarounds [ci skip] 2026-04-18 14:08:10 +00:00
ntfy [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
nvidia [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
onlyoffice [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
openclaw [infra] Establish KYVERNO_LIFECYCLE_V1 drift-suppression convention [ci skip] 2026-04-18 14:15:51 +00:00
osm_routing [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
owntracks [infra] Document intended ignore_changes drift-workarounds [ci skip] 2026-04-18 14:08:10 +00:00
paperless-ngx [traefik] Remove broken rewrite-body plugin and all rybbit/anti-AI injection 2026-04-17 12:41:17 +00:00
payslip-ingest [payslip-ingest] Deploy stack + Grafana dashboard + Vault DB role 2026-04-18 19:07:05 +00:00
phpipam [infra] Establish KYVERNO_LIFECYCLE_V1 drift-suppression convention [ci skip] 2026-04-18 14:15:51 +00:00
platform [infra] Add Cloudflare provider to all stack lock files and generated providers 2026-04-16 16:31:36 +00:00
plotting-book [infra] Document intended ignore_changes drift-workarounds [ci skip] 2026-04-18 14:08:10 +00:00
poison-fountain [infra] Scale down unused services + remove DoH ingress 2026-04-17 18:55:52 +00:00
priority-pass [infra] Establish KYVERNO_LIFECYCLE_V1 drift-suppression convention [ci skip] 2026-04-18 14:15:51 +00:00
privatebin [traefik] Remove broken rewrite-body plugin and all rybbit/anti-AI injection 2026-04-17 12:41:17 +00:00
proxmox-csi feat(storage): migrate all sensitive services to proxmox-lvm-encrypted 2026-04-15 20:15:30 +00:00
pvc-autoresizer fix: disable cert-manager webhook for pvc-autoresizer, use self-signed cert [ci skip] 2026-04-03 23:44:49 +03:00
rbac [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
real-estate-crawler [traefik] Remove broken rewrite-body plugin and all rybbit/anti-AI injection 2026-04-17 12:41:17 +00:00
redis [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
reloader [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
resume [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
reverse-proxy [traefik] Remove broken rewrite-body plugin and all rybbit/anti-AI injection 2026-04-17 12:41:17 +00:00
rybbit [rybbit] Narrow CF Worker routes to SITE_IDS hosts — fix free-tier quota breach 2026-04-18 13:23:15 +00:00
sealed-secrets [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
send [traefik] Remove broken rewrite-body plugin and all rybbit/anti-AI injection 2026-04-17 12:41:17 +00:00
servarr [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
shadowsocks [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
speedtest [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
status-page [infra] Establish KYVERNO_LIFECYCLE_V1 drift-suppression convention [ci skip] 2026-04-18 14:15:51 +00:00
stirling-pdf [traefik] Remove broken rewrite-body plugin and all rybbit/anti-AI injection 2026-04-17 12:41:17 +00:00
tandoor [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
technitium [infra] Scale down unused services + remove DoH ingress 2026-04-17 18:55:52 +00:00
terminal Add broker-sync Terraform stack (#7) 2026-04-17 21:17:45 +01:00
tor-proxy [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
trading-bot [infra] Document intended ignore_changes drift-workarounds [ci skip] 2026-04-18 14:08:10 +00:00
traefik [infra] Establish KYVERNO_LIFECYCLE_V1 drift-suppression convention [ci skip] 2026-04-18 14:15:51 +00:00
travel_blog [infra] Scale down unused services + remove DoH ingress 2026-04-17 18:55:52 +00:00
tuya-bridge [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
uptime-kuma [infra] Establish KYVERNO_LIFECYCLE_V1 drift-suppression convention [ci skip] 2026-04-18 14:15:51 +00:00
url [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
vault [payslip-ingest] Deploy stack + Grafana dashboard + Vault DB role 2026-04-18 19:07:05 +00:00
vaultwarden [traefik] Remove broken rewrite-body plugin and all rybbit/anti-AI injection 2026-04-17 12:41:17 +00:00
vpa [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
wealthfolio wealthfolio: bump memory 64Mi → 1Gi (limit) / 256Mi (request) 2026-04-18 19:13:05 +00:00
webhook_handler [infra] Establish KYVERNO_LIFECYCLE_V1 drift-suppression convention [ci skip] 2026-04-18 14:15:51 +00:00
whisper [whisper] Remove ollama_tcp IngressRouteTCP (ollama decom) 2026-04-18 11:11:21 +00:00
wireguard [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
woodpecker [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
xray [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
ytdlp [ytdlp] Remove ollama_host variable and fallback env vars 2026-04-18 11:13:42 +00:00