Commit graph

17 commits

Author SHA1 Message Date
Viktor Barzin
789cb61310 [servarr] Rewrite MAM ratio farming — break Mouse death spiral, adopt in TF
## Context

A MAM (MyAnonamouse) freeleech farming workflow was deployed on 2026-04-14
via kubectl apply (outside Terraform). Five days later the account was
still stuck in Mouse class: 715 MiB downloaded, 0 uploaded, ratio 0.
Tracker responses on 7 of 9 active torrents returned
`status=4 | msg="User currently mouse rank, you need to get your ratio up!"`
— MAM was actively refusing to serve peer lists because the account was
in Mouse class, and refusing to serve peer lists made the ratio impossible
to recover. Meanwhile the grabber kept digging: 501 torrents sat in
qBittorrent, 0 completed, 0 bytes uploaded.

Root causes (ranked):
1. Death spiral — Mouse class blocks announces, nothing uploads.
2. BP-spender 30 000 BP threshold blocked the only exit even though the
   account already had 24 500 BP.
3. Grabber selection (`score = 1.0 / (seeders+1)`) preferred low-demand
   torrents filtered to <100 MiB — ratio-hostile by design.
4. Grabber/cleanup deadlock: cleanup only fired on seed_time > 3d, so
   torrents that never started never qualified. Combined with the 500-
   torrent cap this stalled the grabber indefinitely.
5. qBittorrent queueing amplified (4) — 495/501 stuck in queuedDL.
6. Ratio-monitor labelled queued torrents `unknown` (empty tracker
   field), hiding the problem on the MAM Grafana panel.
7. qBittorrent memory limit (256 Mi LimitRange default) too low.
8. All of the above was Terraform drift with no reviewability.

## This change

Introduces `stacks/servarr/mam-farming/` — a new TF module that adopts
the three kubectl-applied resources and replaces their scripts with
demand-first, H&R-aware logic. Also bumps qBittorrent resources, fixes
ratio-monitor labelling, and adds five Prometheus alerts plus a Grafana
panel row.

### Architecture

    MAM API ───┬─── jsonLoad.php (profile: ratio, class, BP)
               ├─── loadSearchJSONbasic.php (freeleech search)
               ├─── bonusBuy.php (50 GiB min tier for API)
               └─── download.php (torrent file)
                               │
    Pushgateway <──┬────────────┤
                   │  mam_ratio            ┌────────────────────┐
                   │  mam_class_code       │ freeleech-grabber  │ */30
                   │  mam_bp_balance   ◄───│  (ratio-guarded)   │
                   │  mam_farming_*        └──────────┬─────────┘
                   │  mam_janitor_*                   │ adds to
                   │                                  ▼
                   │  Grafana panels      qBittorrent (mam-farming)
                   │  + 5 alerts                      ▲
                   │                                  │ deletes by rule
                   │                       ┌──────────┴─────────┐
                   │                   ◄───│ farming-janitor    │ */15
                   │                       │  (H&R-aware)       │
                   │                       └──────────┬─────────┘
                   │                                  │ buys credit
                   │                       ┌──────────┴─────────┐
                   └───────────────────────│ bp-spender         │ 0 */6
                                           │  (tier-aware)      │
                                           └────────────────────┘

### Key decisions

- **Ratio guard on grabber** — refuse to grab if ratio < 1.2 OR class ==
  Mouse. Prevents the death spiral from deepening. Emits
  `mam_grabber_skipped_reason{reason=...}` and exits clean.
- **Demand-first selection** — new score formula
  `leechers*3 - seeders*0.5 + 200 if freeleech_wedge else 0`; size band
  50 MiB – 1 GiB; leecher floor 1; seeder ceiling 50. Picks titles that
  will actually upload.
- **Janitor decoupled from grabber** — runs every 15 min regardless of
  the ratio-guard state. Without this, stuck torrents accumulate
  fastest exactly when the grabber is skipping (Mouse class). H&R-aware:
  never deletes `progress==1.0 AND seeding_time < 72h`. Six delete
  reasons observable via `mam_janitor_deleted_per_run{reason=...}`.
- **BP-spender tier-aware** — MAM imposes a hard 50 GiB minimum on API
  buyers ("Automated spenders are limited to buying at least 50 GB...
  due to log spam"). Valid API tiers: 50/100/200/500 GiB at 500 BP/GiB.
  The spender picks the smallest tier that satisfies the ratio deficit
  AND fits the budget, preserving a 500 BP reserve. If even the 50 GiB
  tier is too expensive, it skips and retries on the next 6-hour cron.
- **Authoritative metrics use MAM profile fields** —
  `downloaded_bytes` / `uploaded_bytes` (integers) rather than the
  pretty-printed `downloaded` / `uploaded` strings like "715.55 MiB"
  that MAM also returns.
- **Ratio-monitor category-first labelling** — `tracker` is empty for
  queued torrents that never announced. Now maps `category==mam-farming`
  to label `mam` first, only falls back to tracker-URL parsing when
  category is absent. Stops hundreds of MAM torrents collecting under
  `unknown`.
- **qBittorrent resources bumped** to `requests=512Mi / limits=1Gi` so
  hundreds of active torrents don't OOM.

### Emergency recovery performed this session

1. Adopted 5 in-cluster resources via root-module `import {}` blocks
   (Terraform 1.5+ rejects imports inside child modules).
2. Ran the janitor in DRY_RUN=1 to verify rules against live state —
   466 `never_started` candidates, 0 false positives in any other
   reason bucket. Flipped to enforce mode.
3. Janitor deleted 466 stuck torrents (matches plan's ~495 target; 35
   preserved as active/in-progress).
4. Truncated `/data/grabbed_ids.txt` so newly-popular titles become
   eligible again.

The ratio is still 0 because the API cannot buy below 50 GiB and the
account sits at 24 551 BP (needs 25 000). Manual 1 GiB purchase via the
MAM web UI — 500 BP — would immediately lift the account to ratio ≈ 1.4
and unblock announces. Future automation cannot do this for us due to
MAMs anti-spam rule.

### What is NOT in this change

- qBittorrent prefs reconciliation (max_active_downloads=20,
  max_active_uploads=150, max_active_torrents=150). The plan wanted
  this; deferred to a follow-up because the janitor + ratio recovery
  handles the 500-torrent backlog first. A small reconciler CronJob
  posting to /api/v2/app/setPreferences is the intended follow-up.
- VIP purchase (~100 k BP) — deferred until BP accumulates.
- Cross-seed / autobrr — separate initiative.

## Alerts added

- P1 MAMMouseClass — `mam_class_code == 0` for 1h
- P1 MAMCookieExpired — `mam_farming_cookie_expired > 0`
- P2 MAMRatioBelowOne — `mam_ratio < 1.0` for 24h (replaces old
  QBittorrentMAMRatioLow, now driven by authoritative profile metric)
- P2 MAMFarmingStuck — no grabs in 4h while ratio is healthy
- P2 MAMJanitorStuckBacklog — `skipped_active > 400` for 6h

## Test plan

### Automated

    $ cd infra/stacks/servarr && ../../scripts/tg plan 2>&1 | grep Plan
    Plan: 5 to import, 2 to add, 6 to change, 0 to destroy.

    $ ../../scripts/tg apply --non-interactive
    Apply complete! Resources: 5 imported, 2 added, 6 changed, 0 destroyed.

    # Re-plan after import block removal (idempotent)
    $ ../../scripts/tg plan 2>&1 | grep Plan
    Plan: 0 to add, 1 to change, 0 to destroy.
    # The 1 change is a pre-existing MetalLB annotation drift on the
    # qbittorrent-torrenting Service — unrelated to this change.

    $ cd ../monitoring && ../../scripts/tg apply --non-interactive
    Apply complete! Resources: 0 added, 2 changed, 0 destroyed.

    # Python + JSON syntax
    $ python3 -c 'import ast; [ast.parse(open(p).read()) for p in [
        "infra/stacks/servarr/mam-farming/files/freeleech-grabber.py",
        "infra/stacks/servarr/mam-farming/files/bp-spender.py",
        "infra/stacks/servarr/mam-farming/files/mam-farming-janitor.py"]]'
    $ python3 -c 'import json; json.load(open(
        "infra/stacks/monitoring/modules/monitoring/dashboards/qbittorrent.json"))'

### Manual Verification

1. Grabber ratio-guard path:

       $ kubectl -n servarr create job --from=cronjob/mam-freeleech-grabber g1
       $ kubectl -n servarr logs job/g1
       Skip grab: ratio=0.0 class=Mouse (floor=1.2) reason=mouse_class

2. BP-spender tier path:

       $ kubectl -n servarr create job --from=cronjob/mam-bp-spender s1
       $ kubectl -n servarr logs job/s1
       Profile: ratio=0.0 class=Mouse DL=0.70 GiB UL=0.00 GiB BP=24551
         | deficit=1.40 GiB needed=3 affordable=48 buy=0
       Done: BP=24551, spent=0 GiB (needed=3, affordable=48)

   Correctly skips because affordable (48) < smallest API tier (50).

3. Janitor in enforce mode:

       $ kubectl -n servarr create job --from=cronjob/mam-farming-janitor j1
       $ kubectl -n servarr logs job/j1 | tail -3
       Done: deleted=466 preserved_hnr=0 skipped_active=35 dry_run=False
         per reason: {'never_started': 466, ...}

   Second run immediately after: `deleted=0 skipped_active=35` —
   steady state with only active/seeding torrents left.

4. Alerts loaded:

       $ kubectl -n monitoring get cm prometheus-server \
           -o jsonpath='{.data.alerting_rules\.yml}' \
           | grep -E "alert: MAM|alert: QBittorrent"
         - alert: MAMMouseClass
         - alert: MAMCookieExpired
         - alert: MAMRatioBelowOne
         - alert: MAMFarmingStuck
         - alert: MAMJanitorStuckBacklog
         - alert: QBittorrentDisconnected
         - alert: QBittorrentMAMUnsatisfied

5. Dashboard: browse to Grafana "qBittorrent - Seeding & Ratio" → new
   "MAM Profile (from jsonLoad.php)" row at the bottom shows class, BP
   balance, profile ratio, transfer, BP-vs-reserve timeseries, janitor
   deletion stacked chart, janitor state stat, grabber state stat.

## Reproduce locally

1. `cd infra/stacks/servarr && ../../scripts/tg plan` — expect
   0 add / 1 change (unrelated MetalLB annotation drift).
2. `kubectl -n servarr get cronjobs` — expect three:
   mam-freeleech-grabber, mam-bp-spender, mam-farming-janitor.
3. Trigger each via `kubectl create job --from=cronjob/<name> <job>`
   and read logs; outputs match the manual-verification snippets above.

Closes: code-qfs
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:45:38 +00:00
Viktor Barzin
8b43692af0 [infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip]
## Context

Wave 3B-continued: the Goldilocks VPA dashboard (stacks/vpa) runs a Kyverno
ClusterPolicy `goldilocks-vpa-auto-mode` that mutates every namespace with
`metadata.labels["goldilocks.fairwinds.com/vpa-update-mode"] = "off"`. This
is intentional — Terraform owns container resource limits, and Goldilocks
should only provide recommendations, never auto-update. The label is how
Goldilocks decides per-namespace whether to run its VPA in `off` mode.

Effect on Terraform: every `kubernetes_namespace` resource shows the label
as pending-removal (`-> null`) on every `scripts/tg plan`. Dawarich survey
2026-04-18 confirmed the drift. Cluster-side count: 88 namespaces carry the
label (`kubectl get ns -o json | jq ... | wc -l`). Every TF-managed namespace
is affected.

This commit brings the intentional admission drift under the same
`# KYVERNO_LIFECYCLE_V1` discoverability marker introduced in c9d221d5 for
the ndots dns_config pattern. The marker now stands generically for any
Kyverno admission-webhook drift suppression; the inline comment records
which specific policy stamps which specific field so future grep audits
show why each suppression exists.

## This change

107 `.tf` files touched — every stack's `resource "kubernetes_namespace"`
resource gets:

```hcl
lifecycle {
  # KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
  ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
```

Injection was done with a brace-depth-tracking Python pass (`/tmp/add_goldilocks_ignore.py`):
match `^resource "kubernetes_namespace" ` → track `{` / `}` until the
outermost closing brace → insert the lifecycle block before the closing
brace. The script is idempotent (skips any file that already mentions
`goldilocks.fairwinds.com/vpa-update-mode`) so re-running is safe.

Vault stack picked up 2 namespaces in the same file (k8s-users produces
one, plus a second explicit ns) — confirmed via file diff (+8 lines).

## What is NOT in this change

- `stacks/trading-bot/main.tf` — entire file is `/* … */` commented out
  (paused 2026-04-06 per user decision). Reverted after the script ran.
- `stacks/_template/main.tf.example` — per-stack skeleton, intentionally
  minimal. User keeps it that way. Not touched by the script (file
  has no real `resource "kubernetes_namespace"` — only a placeholder
  comment).
- `.terraform/` copies (e.g. `stacks/metallb/.terraform/modules/...`) —
  gitignored, won't commit; the live path was edited.
- `terraform fmt` cleanup of adjacent pre-existing alignment issues in
  authentik, freedify, hermes-agent, nvidia, vault, meshcentral. Reverted
  to keep the commit scoped to the Goldilocks sweep. Those files will
  need a separate fmt-only commit or will be cleaned up on next real
  apply to that stack.

## Verification

Dawarich (one of the hundred-plus touched stacks) showed the pattern
before and after:

```
$ cd stacks/dawarich && ../../scripts/tg plan

Before:
  Plan: 0 to add, 2 to change, 0 to destroy.
   # kubernetes_namespace.dawarich will be updated in-place
     (goldilocks.fairwinds.com/vpa-update-mode -> null)
   # module.tls_secret.kubernetes_secret.tls_secret will be updated in-place
     (Kyverno generate.* labels — fixed in 8d94688d)

After:
  No changes. Your infrastructure matches the configuration.
```

Injection count check:
```
$ rg -c 'KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode' stacks/ | awk -F: '{s+=$2} END {print s}'
108
```

## Reproduce locally
1. `git pull`
2. Pick any stack: `cd stacks/<name> && ../../scripts/tg plan`
3. Expect: no drift on the namespace's goldilocks.fairwinds.com/vpa-update-mode label.

Closes: code-dwx

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:15:27 +00:00
Viktor Barzin
95e49134ae cleanup: remove old audiobook-search, superseded by book-search
- Delete servarr/audiobook-search TF module (moved to ebooks/book-search)
- Remove audiobook-search from cloudflare_proxied_names
- Remove commented-out module reference in servarr/main.tf
- Clean up "renamed from" comment in ebooks/main.tf
- K8s resources (deploy/svc/ingress) deleted from servarr namespace
- Cloudflare DNS record already absent
- Import book-search and insta2spotify DNS records into cloudflared state
2026-03-25 23:16:01 +02:00
Viktor Barzin
6e1d8c0c8b add ebooks stack: consolidate book services into single namespace [ci skip]
- New ebooks namespace with CWA, Stacks, Audiobookshelf, book-search
- book-search (renamed from audiobook-search) with CWA ingest volume
- Comment out audiobook_search module from servarr
- All NFS volumes and secrets consolidated
2026-03-25 15:04:27 +02:00
Viktor Barzin
5b5a7d8cb4 add MAM email/password env vars to audiobook-search deployment
Reads mam_email and mam_password from Vault secret/servarr via ESO.
2026-03-25 12:03:12 +02:00
Viktor Barzin
4ca7af8818 add audiobook-search service to servarr stack
- New audiobook-search deployment + service + ingress (Authentik-protected)
- qBittorrent: add NFS mount for /audiobooks (shared with Audiobookshelf)
- Cloudflare DNS: add audiobook-search.viktorbarzin.me
- Env vars: QBITTORRENT_URL/PASS, AUDIOBOOKSHELF_URL/TOKEN from ESO
2026-03-24 01:21:49 +02:00
Viktor Barzin
39b3c51709 migrate 16 plan-time stacks: vault data source → ESO + kubernetes_secret
Replaced data "vault_kv_secret_v2" with:
1. ExternalSecret (ESO syncs Vault KV → K8s Secret)
2. data "kubernetes_secret" (reads ESO-created secret at plan time)

This removes the Vault provider dependency at plan time for these
stacks — they now only need K8s API access, not a Vault token.

Stacks: actualbudget, affine, audiobookshelf, calibre, changedetection,
coturn, freedify, freshrss, grampsweb, navidrome, novelapp, ollama,
owntracks, real-estate-crawler, servarr, ytdlp
2026-03-15 22:06:39 +00:00
Viktor Barzin
a8d944eb9b migrate all secrets from SOPS to Vault KV
- Add vault provider to root terragrunt.hcl (generated providers.tf)
- Delete stacks/vault/vault_provider.tf (now in generated providers.tf)
- Add 124 variable declarations + 43 vault_kv_secret_v2 resources to
  vault/main.tf to populate Vault KV at secret/<stack-name>
- Migrate 43 consuming stacks to read secrets from Vault KV via
  data "vault_kv_secret_v2" instead of SOPS var-file
- Add dependency "vault" to all migrated stacks' terragrunt.hcl
- Complex types (maps/lists) stored as JSON strings, decoded with
  jsondecode() in locals blocks

Bootstrap secrets (vault_root_token, vault_authentik_client_id,
vault_authentik_client_secret) remain in SOPS permanently.

Apply order: vault stack first (populates KV), then all others.
2026-03-14 17:15:48 +00:00
Viktor Barzin
b00f810d3d Remove all CPU limits cluster-wide to eliminate CFS throttling
CPU limits cause CFS throttling even when nodes have idle capacity.
Move to a request-only CPU model: keep CPU requests for scheduling
fairness but remove all CPU limits. Memory limits stay (incompressible).

Changes across 108 files:
- Kyverno LimitRange policy: remove cpu from default/max in all 6 tiers
- Kyverno ResourceQuota policy: remove limits.cpu from all 5 tiers
- Custom ResourceQuotas: remove limits.cpu from 8 namespace quotas
- Custom LimitRanges: remove cpu from default/max (nextcloud, onlyoffice)
- RBAC module: remove cpu_limits variable and quota reference
- Freedify factory: remove cpu_limit variable and limits reference
- 86 deployment files: remove cpu from all limits blocks
- 6 Helm values files: remove cpu under limits sections
2026-03-14 08:51:45 +00:00
Viktor Barzin
57eed07370 [ci skip] add widgets for qbittorrent, navidrome, nextcloud, freshrss, linkwarden, uptime-kuma
Add API credentials to SOPS and wire homepage_credentials through
stacks. Re-add Uptime Kuma widget with new "infra" status page slug.
2026-03-07 20:39:55 +00:00
Viktor Barzin
10acdcd5a2 [ci skip] add widgets for audiobookshelf, changedetection, prowlarr, headscale
Wire homepage_credentials through servarr parent stack for prowlarr.
Fix paperless-ngx widget to use internal service URL.
2026-03-07 20:39:55 +00:00
Viktor Barzin
1f2c1ca361 [ci skip] phase 5+6: update CI pipelines for SOPS, add sensitive=true to secret vars
Phase 5 — CI pipelines:
- default.yml: add SOPS decrypt in prepare step, change git add . to
  specific paths (stacks/ state/ .woodpecker/), cleanup on success+failure
- renew-tls.yml: change git add . to git add secrets/ state/

Phase 6 — sensitive=true:
- Add sensitive = true to 256 variable declarations across 149 stack files
- Prevents secret values from appearing in terraform plan output
- Does NOT modify shared modules (ingress_factory, nfs_volume) to avoid
  breaking module interface contracts

Note: CI pipeline SOPS decryption requires sops_age_key Woodpecker secret
to be created before the pipeline will work with SOPS. Until then, the old
terraform.tfvars path continues to function.
2026-03-07 14:30:36 +00:00
Viktor Barzin
eb32190461 [ci skip] fix OOM crashes: add resource limits for osrm-bicycle, aiostreams, listenarr, authentik
- osrm-bicycle: 1Gi limit (loads 403MB routing graph)
- aiostreams: 768Mi limit (loads 44K anime entries)
- listenarr: 1Gi limit (.NET + Playwright/Chromium)
- authentik server: 1Gi limit, worker: 1Gi limit (Django + gunicorn)
- servarr: pass nfs_server variable to all submodules
2026-02-28 17:03:33 +00:00
Viktor Barzin
89a6e08245 [ci skip] Infrastructure hardening: security, monitoring, reliability, maintainability
Phase 1 - Critical Security:
- Netbox: move hardcoded DB/superuser passwords to variables
- MeshCentral: disable public registration, add Authentik auth
- Traefik: disable insecure API dashboard (api.insecure=false)
- Traefik: configure forwarded headers with Cloudflare trusted IPs

Phase 2 - Security Hardening:
- Add security headers middleware (HSTS, X-Frame-Options, nosniff, etc.)
- Add Kyverno pod security policies in audit mode (privileged, host
  namespaces, SYS_ADMIN, trusted registries)
- Tighten rate limiting (avg=10, burst=50)
- Add Authentik protection to grampsweb

Phase 3 - Monitoring & Alerting:
- Add critical service alerts (PostgreSQL, MySQL, Redis, Headscale,
  Authentik, Loki)
- Increase Loki retention from 7 to 30 days (720h)
- Add predictive PV filling alert (predict_linear)
- Re-enable Hackmd and Privatebin down alerts

Phase 4 - Reliability:
- Add resource requests/limits to Redis, DBaaS, Technitium, Headscale,
  Vaultwarden, Uptime Kuma
- Increase Alloy DaemonSet memory to 512Mi/1Gi

Phase 6 - Maintainability:
- Extract duplicated tiers locals to terragrunt.hcl generate block
  (removed from 67 stacks)
- Replace hardcoded NFS IP 10.0.10.15 with var.nfs_server (114
  instances across 63 files)
- Replace hardcoded Redis/PostgreSQL/MySQL/Ollama/mail host references
  with variables across ~35 stacks
- Migrate xray raw ingress resources to ingress_factory modules
2026-02-23 22:05:28 +00:00
Viktor Barzin
c7c7047f1c [ci skip] Flatten module wrappers into stack roots
Remove the module "xxx" { source = "./module" } indirection layer
from all 66 service stacks. Resources are now defined directly in
each stack's main.tf instead of through a wrapper module.

- Merge module/main.tf contents into stack main.tf
- Apply variable replacements (var.tier -> local.tiers.X, renamed vars)
- Fix shared module paths (one fewer ../ at each level)
- Move extra files/dirs (factory/, chart_values, subdirs) to stack root
- Update state files to strip module.<name>. prefix
- Update CLAUDE.md to reflect flat structure

Verified: terragrunt plan shows 0 add, 0 destroy across all stacks.
2026-02-22 15:13:55 +00:00
Viktor Barzin
e6420c7b36 [ci skip] Move Terraform modules into stack directories
Move all 88 service modules (66 individual + 22 platform) from
modules/kubernetes/<service>/ into their corresponding stack directories:

- Service stacks: stacks/<service>/module/
- Platform stack: stacks/platform/modules/<service>/

This collocates module source code with its Terragrunt definition.
Only shared utility modules remain in modules/kubernetes/:
ingress_factory, setup_tls_secret, dockerhub_secret, oauth-proxy.

All cross-references to shared modules updated to use correct
relative paths. Verified with terragrunt run --all -- plan:
0 adds, 0 destroys across all 68 stacks.
2026-02-22 14:38:14 +00:00
Viktor Barzin
a9ba8899be [ci skip] Phase 3: Create 66 service stacks and migrate state
Generated individual stack directories for all 66 services under stacks/.
Each stack has terragrunt.hcl (depends on platform) and main.tf (thin
wrapper calling existing module). Migrated all 64 active service states
from root terraform.tfstate to individual state files. Root state is now
empty. Verified with terragrunt plan on multiple stacks (no changes).
2026-02-22 13:56:34 +00:00