## Context
Sidekiq was commented out in main.tf:203–274 on 2026-02-23 after the
unbounded 10-thread worker drove the whole pod into memory pressure —
the kubelet then evicted the web container along with it. Viktor's
recollection was "it was crashing"; the cgroup-root cause was that the
Sidekiq container had no `resources.limits.memory` set, so a misbehaving
job could pull the entire pod down instead of being OOM-killed and
restarted in isolation.
During the ~55 days the worker was off, POSTs to /api/v1 continued to
enqueue jobs in Redis DB 1 (Dawarich uses redis-master.redis:6379/1, not
the cluster default DB 0). track_segments and digests tables stayed
empty because nothing was processing the backfill queue (beads
code-459). Dawarich was also bumped 0.37.1 → 1.6.1 on 2026-04-16, so
Sidekiq was untested against the new release in this environment.
Live pre-apply snapshot via `bin/rails runner`:
enqueued=18 (cache=2, data_migrations=4, default=12)
scheduled=16, retry=0, dead=0, procs=0, processed/failed=0 (stats
reset by the 1.6.1 upgrade)
Queue latencies ~50h — lines up with code-e9c (iOS client stopped
POSTing on 2026-04-16), not with the nominal 55-day gap. Redis DB 1
was therefore a small, recoverable backlog, not the disaster the plan
originally feared — no pre-apply triage needed.
## What changed
Second container `dawarich-sidekiq` added to the existing Deployment
(same pod, same lifecycle as `dawarich` web). Key differences vs the
2026-02-23 commented block:
- `resources.limits.memory = 1Gi`, `requests = { cpu = 50m, memory =
768Mi }`. Burstable QoS — cgroup is now bounded, so a runaway Sidekiq
job gets OOM-killed and container-restarted in place without evicting
the whole pod (web stays Ready).
- Hosts parametrised via `var.redis_host` / `var.postgresql_host`
instead of hardcoded FQDNs; matches the web container's pattern.
- DB / secret / Geoapify creds via `value_from.secret_key_ref` against
the existing `dawarich-secrets` K8s Secret (populated by the existing
ExternalSecret). Removes the plan-time `data.vault_kv_secret_v2`
reference the 2026-02-23 block relied on — that data source no longer
exists in this stack.
- `BACKGROUND_PROCESSING_CONCURRENCY = "2"` (was "10"). Ramp deferred
to separate commits (plan: 2 → 5 → 10 with 15-30min observation
between bumps).
- Liveness + readiness `pgrep -f 'bundle exec sidekiq'` probes —
container-scoped restart on stall, verified `pgrep` is at
/usr/bin/pgrep in the Debian-trixie-based freikin/dawarich image.
- Same Rails boot envs as the web container (TIME_ZONE, DISTANCE_UNIT,
RAILS_ENV, RAILS_LOG_TO_STDOUT, SECRET_KEY_BASE, SELF_HOSTED) so
Sidekiq's Rails initialisation matches web.
Pod-level additions:
- `termination_grace_period_seconds = 60` — gives Sidekiq time to
drain in-flight jobs on SIGTERM during rolls (default 30s not enough
for reverse-geocoding batches).
## What is NOT in this change
- Prometheus exporter for Sidekiq metrics. The first apply turned on
`PROMETHEUS_EXPORTER_ENABLED=true`, which enabled the
`prometheus_exporter` gem's CLIENT middleware. That middleware PUSHes
metrics over TCP to a separate exporter server process — and the
freikin/dawarich image does not start one. Client logged ~2/sec
"Connection refused" errors until we flipped ENABLED back to "false"
in this commit. `pod.annotations["prometheus.io/scrape"]` reverted
for the same reason (nothing listening on :9394). Filed code-1q5
(blocks code-459) to add a third sidecar container running
`bundle exec prometheus_exporter -p 9394 -b 0.0.0.0` and restore
the 4 drafted alerts (DawarichSidekiqDown /
QueueLatencyHigh / DeadGrowing / FailureRateHigh) once metrics are
actually being emitted.
- The 4 drafted Sidekiq alerts — reverted from
monitoring/prometheus_chart_values.tpl; they reference metrics that
don't exist yet. Restoration is part of code-1q5.
- Concurrency ramp past 2 and the 24h burn-in gate that closes
code-459 — separate future commits.
- Liveness/readiness probes on the web container — pre-existing gap,
out of scope per plan.
## Other changes bundled in
Kyverno `dns_config` drift suppression added with the
`# KYVERNO_LIFECYCLE_V1` marker on both `kubernetes_deployment.dawarich`
AND `kubernetes_cron_job_v1.ingestion_freshness_monitor`. Plan only
called it out for the Deployment, but the CronJob shows identical
drift (Kyverno injects ndots=2 on every pod template, Terraform wipes
it, infinite churn). Per AGENTS.md "Kyverno Drift Suppression" every
pod-owning resource MUST carry the lifecycle block — this commit
brings this stack into convention.
## Topology trade-off recorded
Sidekiq lives in the same pod as the web container, not a separate
Deployment. This means:
- Every env bump during ramp bounces both containers (Recreate
strategy) — brief UI blip accepted.
- `kubectl scale` alone can't pause Sidekiq — pausing requires
`BACKGROUND_PROCESSING_CONCURRENCY=0` + apply, or re-commenting
the container block + apply.
- Shared pod network namespace — only one process can bind any given
port. This is why the plan explicitly avoided declaring a new
`port { name = "prometheus" }` on the sidekiq container (the web
container already reserves 9394 by name).
Accepted because the alternative (split Deployment) is significantly
more config for a single-instance service and a follow-up bead
(tracked in code-1q5 description area / Viktor's notes) already
captures "revisit if future crashes warrant blast-radius isolation".
## Rollback
Three levels, in order of increasing impact:
1. `BACKGROUND_PROCESSING_CONCURRENCY` → "0" + apply — pod stays up,
no jobs processed, backlog preserved in Redis.
2. Drop concurrency to 1 or 2 + apply — reduce load but keep draining.
3. Re-comment the second container block (this diff in reverse) +
apply — full disable, backlog stays in Redis DB 1, recoverable.
Never DEL queue:* keys directly — Redis DB 1 is where Dawarich lives,
and the jobs are recoverable state.
## Refs
- code-459 (P3) — Dawarich Sidekiq disabled. In progress; closes
after 24h burn-in at concurrency=10 with restartCount=0, DeadSet
delta < 100.
- code-1q5 (P3) — Follow-up: prometheus_exporter sidecar + 4 alerts.
Depends on code-459.
- code-e9c (P2) — Viktor client-side POST bug 2026-04-16.
Untouched; processing the backlog does not fix this but ensures
future POSTs drain cleanly.
- code-72g (P3) — Anca ingestion silent since 2025-06-21. Untouched;
same reasoning.
## Test Plan
### Automated
```
$ cd stacks/dawarich && ../../scripts/tg plan
...
Plan: 0 to add, 3 to change, 0 to destroy.
# kubernetes_deployment.dawarich (sidekiq container + probes + lifecycle)
# kubernetes_namespace.dawarich (drops stale goldilocks label, pre-existing drift)
# module.tls_secret.kubernetes_secret.tls_secret (Kyverno clone-label drift, pre-existing)
$ ../../scripts/tg apply --non-interactive
...
Apply complete! Resources: 0 added, 3 changed, 0 destroyed.
(Second apply for PROMETHEUS_EXPORTER_ENABLED=false + annotation
removal — same 0/3/0 shape.)
```
### Manual Verification
Setup: kubectl context against the k8s cluster (10.0.20.100).
1. Pod has both containers Ready with zero restarts:
```
$ kubectl -n dawarich get pods -o wide
NAME READY STATUS RESTARTS AGE
dawarich-75b4ff9fbf-qh56v 2/2 Running 0 <fresh>
```
2. Sidekiq container is actively processing jobs:
```
$ kubectl -n dawarich logs deploy/dawarich -c dawarich-sidekiq --tail=20
Sidekiq 8.0.10 connecting to Redis ... db: 1
queues: [data_migrations, points, default, mailers, families,
imports, exports, stats, trips, tracks,
reverse_geocoding, visit_suggesting, places,
app_version_checking, cache, archival, digests,
low_priority]
Performing DataMigrations::BackfillMotionDataJob ...
Backfilled motion_data for N000 points (N climbing)
```
3. Rails Sidekiq::API snapshot — procs registered, counters moving:
```
$ kubectl -n dawarich exec deploy/dawarich -- bin/rails runner '
require "sidekiq/api"
s = Sidekiq::Stats.new
puts "processed=#{s.processed} failed=#{s.failed} procs=#{Sidekiq::ProcessSet.new.size}"
'
processed=7 failed=2 procs=1
retry=0 dead=0
```
(The 2 "failures" are cumulative across two pod lifecycles during
the Prometheus env flip — retried successfully, neither retry nor
dead set holds any jobs.)
4. Per-container memory well under the 1Gi limit:
```
$ kubectl -n dawarich top pod --containers
POD NAME CPU MEMORY
dawarich-75b4ff9fbf-qh56v dawarich 1m 272Mi (of 896Mi)
dawarich-75b4ff9fbf-qh56v dawarich-sidekiq 79m 333Mi (of 1Gi)
```
5. No "Prometheus Exporter, failed to send" log lines since the second
apply:
```
$ kubectl -n dawarich logs deploy/dawarich -c dawarich-sidekiq --tail=500 \
| grep -c "Prometheus Exporter"
0
```
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|---|---|---|
| .beads | ||
| .claude | ||
| .git-crypt | ||
| .github | ||
| .planning | ||
| .woodpecker | ||
| ci | ||
| cli | ||
| diagram | ||
| docs | ||
| modules | ||
| playbooks | ||
| scripts | ||
| secrets | ||
| stacks | ||
| state/stacks | ||
| .gitattributes | ||
| .gitignore | ||
| .sops.yaml | ||
| AGENTS.md | ||
| config.tfvars | ||
| CONTRIBUTING.md | ||
| LICENSE.txt | ||
| MEMORY.md | ||
| README.md | ||
| setup-monitoring.sh | ||
| terragrunt.hcl | ||
| tiers.tf | ||
This repo contains my infra-as-code sources.
My infrastructure is built using Terraform, Kubernetes and CI/CD is done using Woodpecker CI.
Read more by visiting my website: https://viktorbarzin.me
Documentation
Full architecture documentation is available in docs/ — covering networking, storage, security, monitoring, secrets, CI/CD, databases, and more.
Adding a New User (Admin)
Adding a new namespace-owner to the cluster requires three steps — no code changes needed.
1. Authentik Group Assignment
In the Authentik admin UI, add the user to:
kubernetes-namespace-ownersgroup (grants OIDC group claim for K8s RBAC)Headscale Usersgroup (if they need VPN access)
2. Vault KV Entry
Add a JSON entry to secret/platform → k8s_users key in Vault:
"username": {
"role": "namespace-owner",
"email": "user@example.com",
"namespaces": ["username"],
"domains": ["myapp"],
"quota": {
"cpu_requests": "2",
"memory_requests": "4Gi",
"memory_limits": "8Gi",
"pods": "20"
}
}
usernamekey must match the user's Forgejo username (for Woodpecker admin access)namespaces— K8s namespaces to create and grant admin access todomains— subdomains underviktorbarzin.mefor Cloudflare DNS recordsquota— resource limits per namespace (defaults shown above)
3. Apply Stacks
vault login -method=oidc
cd stacks/vault && terragrunt apply --non-interactive
# Creates: namespace, Vault policy, identity entity, K8s deployer role
cd ../platform && terragrunt apply --non-interactive
# Creates: RBAC bindings, ResourceQuota, TLS secret, DNS records
cd ../woodpecker && terragrunt apply --non-interactive
# Adds user to Woodpecker admin list
What Gets Auto-Generated
| Resource | Stack |
|---|---|
| Kubernetes namespace | vault |
Vault policy (namespace-owner-{user}) |
vault |
| Vault identity entity + OIDC alias | vault |
| K8s deployer Role + Vault K8s role | vault |
| RBAC RoleBinding (namespace admin) | platform |
| RBAC ClusterRoleBinding (cluster read-only) | platform |
| ResourceQuota | platform |
| TLS secret in namespace | platform |
| Cloudflare DNS records | platform |
| Woodpecker admin access | woodpecker |
New User Onboarding
If you've been added as a namespace-owner, follow these steps to get started.
1. Join the VPN
# Install Tailscale: https://tailscale.com/download
tailscale login --login-server https://headscale.viktorbarzin.me
# Send the registration URL to Viktor, wait for approval
ping 10.0.20.100 # verify connectivity
2. Install Tools
Run the setup script to install kubectl, kubelogin, Vault CLI, Terraform, and Terragrunt:
# macOS
bash <(curl -fsSL https://k8s-portal.viktorbarzin.me/setup/script?os=mac)
# Linux
bash <(curl -fsSL https://k8s-portal.viktorbarzin.me/setup/script?os=linux)
3. Authenticate
# Log into Vault (opens browser for SSO)
vault login -method=oidc
# Test kubectl (opens browser for OIDC login)
kubectl get pods -n YOUR_NAMESPACE
4. Deploy Your First App
# Clone the infra repo
git clone https://github.com/ViktorBarzin/infra.git && cd infra
# Copy the stack template
cp -r stacks/_template stacks/myapp
mv stacks/myapp/main.tf.example stacks/myapp/main.tf
# Edit main.tf — replace all <placeholders>
# Store secrets in Vault
vault kv put secret/YOUR_USERNAME/myapp DB_PASSWORD=secret123
# Submit a PR
git checkout -b feat/myapp
git add stacks/myapp/
git commit -m "add myapp stack"
git push -u origin feat/myapp
After review and merge, an admin runs cd stacks/myapp && terragrunt apply.
5. Set Up CI/CD (Optional)
Create .woodpecker.yml in your app's Forgejo repo:
steps:
- name: build
image: woodpeckerci/plugin-docker-buildx
settings:
repo: YOUR_DOCKERHUB_USER/myapp
tag: ["${CI_PIPELINE_NUMBER}", "latest"]
username:
from_secret: dockerhub-username
password:
from_secret: dockerhub-token
platforms: linux/amd64
- name: deploy
image: hashicorp/vault:1.18.1
commands:
- export VAULT_ADDR=http://vault-active.vault.svc.cluster.local:8200
- export VAULT_TOKEN=$(vault write -field=token auth/kubernetes/login
role=ci jwt=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token))
- KUBE_TOKEN=$(vault write -field=service_account_token
kubernetes/creds/YOUR_NAMESPACE-deployer
kubernetes_namespace=YOUR_NAMESPACE)
- kubectl --server=https://kubernetes.default.svc
--token=$KUBE_TOKEN
--certificate-authority=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt
-n YOUR_NAMESPACE set image deployment/myapp
myapp=YOUR_DOCKERHUB_USER/myapp:${CI_PIPELINE_NUMBER}
Useful Commands
# Check your pods
kubectl get pods -n YOUR_NAMESPACE
# View quota usage
kubectl describe resourcequota -n YOUR_NAMESPACE
# Store/read secrets
vault kv put secret/YOUR_USERNAME/myapp KEY=value
vault kv get secret/YOUR_USERNAME/myapp
# Get a short-lived K8s deploy token
vault write kubernetes/creds/YOUR_NAMESPACE-deployer \
kubernetes_namespace=YOUR_NAMESPACE
Important Rules
- All changes go through Terraform — never
kubectl apply/edit/patchdirectly - Never put secrets in code — use Vault:
vault kv put secret/YOUR_USERNAME/... - Always use a PR — never push directly to master
- Docker images: build for
linux/amd64, use versioned tags (not:latest)
git-crypt setup
To decrypt the secrets, you need to setup git-crypt.
- Install git-crypt.
- Setup gpg keys on the machine
git-crypt unlock
This will unlock the secrets and will lock them on commit