infra

Author	SHA1	Message	Date
Viktor Barzin	bf4c7618d8	wealth: SQLite→PG ETL sidecar + new Grafana dashboard Mirrors Wealthfolio's daily_account_valuation / accounts / activities from SQLite into a new PG database (wealthfolio_sync) every hour, so Grafana can chart net worth, contributions, and growth over time. Components: - dbaas: null_resource creates wealthfolio_sync DB + role on the CNPG cluster (dynamic primary lookup so it survives failover). - vault: pg-wealthfolio-sync static role rotates the password every 7d. - wealthfolio: ExternalSecret pulls the rotated password into the WF namespace; new pg-sync sidecar (alpine + sqlite + postgresql-client + busybox crond) does sqlite3 .backup → TSV dump → truncate-and-reload psql, hourly at :07. Plus a grafana-wealth-datasource ConfigMap in the monitoring namespace (uid: wealth-pg). - monitoring: new Wealth dashboard (wealth.json, 10 panels) — current net worth / contribution / growth / ROI% stats, then time-series for net worth, contribution-vs-market, growth area, per-account stacked area, cash-vs-invested, and a 100-row activity log. Initial sync: 6 accounts, 10,798 daily valuations, 518 activities. Verified PG totals match SQLite latest snapshot exactly.	2026-04-25 17:07:33 +00:00
Viktor Barzin	ac8d2f548b	paperless-ngx: migrate to proxmox-lvm-encrypted Document scans (receipts, contracts, IDs) are unambiguously sensitive PII. Storage decision rule defaults sensitive data to `proxmox-lvm-encrypted`, but paperless-ngx had been left on plain `proxmox-lvm` by an abandoned migration attempt that left a dormant, non-Terraform-managed encrypted PVC sitting unbound for 11 days. Cleaned up the orphan, added the encrypted PVC properly via Terraform, rsynced data with deployment scaled to 0, swapped claim_name. Plain `proxmox-lvm` PVC retained for a 7-day soak before removal. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-25 16:48:53 +00:00
Viktor Barzin	4f5f1ff8c2	monitoring(uk-payslip): add yearly receipt stacked barchart panel New panel 16 (barchart, h=11, y=179): one stacked bar per tax year showing total comp split into net pay (bank deposit), cash income tax, RSU tax (band-aware marginal: PAYE+NI), cash NI, student loan, pension salary- sacrifice, and RSU offset (Variant A only). X-axis = tax_year (categorical), y-axis = currencyGBP. Bar height ≈ gross_pay + pension_sacrifice (small over-attribution in Variant A years where the band-aware model exceeds recorded payslip PAYE).	2026-04-25 16:26:57 +00:00
Viktor Barzin	288efa89b3	vault: migrate vault-0 storage to proxmox-lvm-encrypted Phase 2 of the NFS-hostile migration: data + audit storageClass on the vault helm release switches from nfs-proxmox to proxmox-lvm-encrypted, then per-pod rolling swap (24h soak between). vault-0 swap done. vault-1 + vault-2 still on NFS — the rolling part is what makes this safe (raft quorum maintained by 2 healthy pods while one is replaced). Also restores chart-default pod securityContext fields. The previous `statefulSet.securityContext.pod = {fsGroupChangePolicy = "..."}` block REPLACED (not merged) the chart's defaults — fsGroup, runAsGroup, runAsUser, runAsNonRoot were all silently dropped. NFS exports were permissive enough to mask the missing fsGroup; ext4 LV volume root is root:root and the vault user (UID 100) couldn't open vault.db, CrashLoopBackOff. Fix: provide all five fields explicitly, survives future chart bumps. vault-1 and vault-2 retained their correct securityContext from when their pod specs were written to etcd, before the partial customization landed — the bug only surfaces when a pod is recreated. Pre-flight raft snapshot saved at /tmp/vault-pre-migration-*.snap (recovery anchor). Refs: code-gy7h Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-25 16:19:49 +00:00
Viktor Barzin	b3c29eda12	monitoring(uk-payslip): model UK income-tax bands + PA-taper for RSU marginal Replaces the flat 47% (45 PAYE + 2 NI) RSU marginal across panels 3, 7, 8, 11, and 12 with an exact piecewise band-aware computation. Each row computes ani_prior/ani_pre/ani_post over the tax-year YTD (chronological model — the RSU is taxed at the band its YTD ANI position occupies at the vest date, mirroring PAYE withholding behaviour). Bands (2024/25+, applied to all years): IT: 0% / 20% / 40% / 60% (PA-taper) / 45% at 12,570 / 50,270 / 100k / 125,140 NI: 0% / 8% / 2% at 12,570 / 50,270 PA-taper modelled as 60% effective IT marginal in £100k–£125,140 (40% on the £1 + 40% on the £0.50 of lost PA = 60%). Spot-checked per tax-year totals via psql; numbers diverge from the flat 47% baseline most for years where vests cross PA-taper or basic-rate bands (2020/21 ~35%, 2024/25 ~41%, 2025/26 ~43%).	2026-04-25 16:14:49 +00:00
Viktor Barzin	43e4f3f68e	immich: migrate PostgreSQL off NFS to proxmox-lvm-encrypted Live PG data moves to a 10Gi LUKS-encrypted RWO PVC. WAL fsync per commit on NFS contributed to the 2026-04-22 NFS writeback storm (2h43m recovery, 3 of 4 nodes hard-reset). Backups remain on NFS (append-only, NFS-tolerant). The init container that writes postgresql.override.conf is now gated on PG_VERSION presence — on a fresh PVC the file would otherwise make initdb refuse the non-empty PGDATA. First boot skips the override and initdb's cleanly; second boot (after a forced restart) writes the override so vchord/vectors/pg_prewarm load before the dump restore. Idempotent on initialised PVCs. Migration executed: pg_dumpall (1.9GB) → restore on encrypted PVC → REINDEX clip_index/face_index → 111,843 assets verified, external HTTP 200, all 10 extensions present (vector minor 0.8.0→0.8.1 only). LV created on PVE host, picked up by lvm-pvc-snapshot. See docs/plans/2026-04-25-nfs-hostile-migration-{design,plan}.md. Phase 2 (Vault Raft) follows under code-gy7h. Closes: code-ahr7 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-25 15:47:30 +00:00
Viktor Barzin	0d5f53f337	monitoring(uk-payslip): replace misleading take-home rates in Panel 3 Drop the two misleading series in "Effective rate & take-home % (YTD cumulative)" — both used SUM(gross_pay) as denominator while only counting cash deductions/net in the numerator, which understated take-home by 25-30 pp because RSU shares are absent from the cash deposit but present in gross. Replaced with three semantically clean angles: - ytd_paye_rate_pct: SUM(income_tax) / SUM(taxable_pay) — HMRC audit rate (~41-42% in additional-rate band), kept as before. - ytd_cash_take_home_pct: SUM(net_pay) / SUM(gross_pay - rsu_vest) — what fraction of cash earnings hits the bank (~62-65%). - ytd_total_keep_pct: (SUM(net_pay) + 0.53 × SUM(rsu_vest)) / SUM(gross_pay) — true "what I actually keep" including post-tax RSU shares (47% marginal applied to vest value), ~55-60%. Added field overrides for clear color-coding (red/green/blue). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-25 15:45:47 +00:00
Viktor Barzin	8f0d13282c	monitoring(uk-payslip): drop cash PAYE/NI from "Tax & pension — monthly" Same reasoning as panel 2: cash-side income_tax and NI are inherently bumpy in vest months due to UK cumulative PAYE catching up on YTD, and the flat-47% strip can't fix it. Panel now shows only the explicit RSU vest tax (orange, 47% × rsu_vest), student loan, and pensions. The smooth view of total cash deductions stays available on panel 12 (YTD cumulative). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-25 15:43:32 +00:00
Viktor Barzin	2230cb6cf4	monitoring(uk-payslip): drop tax/NI from "Monthly cash flow (RSU stripped)" panel Vest months still bumped 4-5x in this panel after the flat-47% strip because UK cumulative PAYE genuinely catches up YTD tax in vest months, on top of the marginal RSU portion — no arithmetic split can make that line flat without distorting the data. The cash-flow question this panel answers (what hits the bank, RSU aside) is already covered cleanly by cash_gross + net_pay; the tax detail lives on Panel 11 where the RSU split is now linear. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-25 15:30:46 +00:00
Viktor Barzin	cb3ffa6d8d	monitoring(uk-payslip): smooth quarterly RSU tax bumps via flat 47% marginal Replace the implicit pro-rata RSU/cash split with an explicit flat 47% marginal (45% PAYE + 2% NI) for the RSU vest tax stack. The orange slice now scales linearly with rsu_vest instead of wobbling around the month's effective PAYE rate; cash PAYE/NI slices have those amounts subtracted out so the stack still totals to actual deductions. Affects panel 7 (monthly), panel 12 (YTD cumulative), panel 7 (YTD uses), and the Sankey panel. Verified on 35 months of live data: sum invariant holds exactly (cash + rsu_marginal + cash_ni == income_tax + national_insurance), no negatives in cash slices. Out of scope (left raw): effective-rate %, data-integrity, payslip table, P60/HMRC reconciliation — those are audit views that use unmodified income_tax / cash_income_tax columns. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-25 15:13:29 +00:00
Viktor Barzin	d231615ebb	[monitoring] Fix fuse voltage alerts — divide raw deciVolt reading by 10 The tuya-bridge exporter reports `fuse_main_voltage` and `fuse_garage_voltage` as raw uint16 from the Tuya protocol, which encodes voltage in deciVolts (e.g. 2352 = 235.2V). The 200/260V thresholds were comparing against the raw integer, so both FuseMainVoltageAbnormal and FuseGarageVoltageAbnormal fired continuously during normal mains conditions. Dividing in the expression also makes `{{ $value }}V` render the correct human-readable value in the alert summary. Root fix would be in tuya-bridge `_decode_value()` where `name.startswith("voltage")` returns `int.from_bytes(...)` without the /10 scaling that `decode_voltage_threshold` applies. Leaving that alone to avoid breaking the automatic_transfer_switch scrape which uses a different code path (`parse_voltage_string`).	2026-04-24 11:12:56 +00:00
Viktor Barzin	a5e4db9af8	[monitoring] Tuya Cloud root-cause alert + cascade suppression New alert TuyaCloudDown fires when any _tuya_cloud_up gauge == 0 (i.e., the Tuya Cloud API rejects scrape calls — the symptom during last night's iot.tuya.com trial expiry, code=28841002). 5m for-duration beats the 15m window of the seven downstream MetricsMissing alerts, so the new Alertmanager inhibit rule suppresses the per-device noise and only TuyaCloudDown pages. Also flips helm_release.prometheus.force_update from true to false: force_update was tripping on the pushgateway PVC added in rev 188 (commit e51c104) — Helm's --force path tried to reset spec.volumeName on a bound PVC. Disabled here; re-enable temporarily when a StatefulSet volumeClaimTemplate change actually needs --force. Bundled with pre-existing working-tree additions for Fuse/Thermostat threshold alerts and expanded PowerOutage inhibit regex (landed in the same Helm revision 190). Verified: rule loaded, value=7 (all 7 tuya-bridge devices report cloud_up=0 right now), TuyaCloudDown moved pending→firing after 5m, 3 *MetricsMissing alerts currently suppressed in Alertmanager with inhibitedBy=1 (thermostat alerts still pending their 15m window, will be suppressed on transition).	2026-04-23 09:59:48 +00:00
Viktor Barzin	5ebd3a81c3	tuya-bridge: liveness probe hits /health so k8s restarts silently-hung bridge The bridge was down 10h 40m on 2026-04-22 without being restarted — the liveness probe hit `/` (trivial Flask handler) which passed while the actual Tuya-cloud call path was stuck. /health now reports Tuya cloud reachability via a background probe in the app; point both probes at it. Liveness: 60s grace + 6x30s = 3min of 503s before restart; readiness: 2x15s = 30s before removal from service. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 07:47:41 +00:00
Viktor Barzin	8e55c4357a	[poison-fountain] opt ingress out of Uptime Kuma external monitor Deployment is scaled to replicas=0 to silence ExternalAccessDivergence, but the ingress at poison.viktorbarzin.me was still auto-annotated `external-monitor=true` by ingress_factory (dns_type=non-proxied path), so external-monitor-sync kept creating `[External] poison` which probed a backend with no endpoints and flagged DOWN. Setting `external_monitor = false` emits the explicit opt-out annotation; next sync run deleted the orphaned monitor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 21:24:22 +00:00
Viktor Barzin	344fce3692	[monitoring][poison-fountain] pushgateway persistence + cronjob uid-0 Two independent root-cause fixes surfaced by the 2026-04-22 cluster health check: 1. Pushgateway lost all in-memory metrics when node3 kubelet hiccuped at 11:42 UTC, hiding backup_last_success_timestamp{job="offsite- backup-sync"} until the next 06:01 UTC push — a ~18h false-negative window. Enable persistence on a 2Gi proxmox-lvm-encrypted PVC with --persistence.interval=1m. Chart note: values key is `prometheus-pushgateway:` (subchart alias), not `pushgateway:`. 2. poison-fountain-fetcher CronJob runs curlimages/curl as UID 100 but the NFS mount /srv/nfs/poison-fountain is root:root 755 and the main Deployment runs as root, so mkdir /data/cache fails every 6h. Set run_as_user=0 on the CronJob container (no_root_squash is set on the export). Closes the backup_offsite_sync FAIL on the next 06:01 UTC offsite sync; closes the recurring poison-fountain evicted-pod noise on the next 00:00 UTC cron tick. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 18:32:29 +00:00
Viktor Barzin	f1f723be83	[technitium] zone-sync now reconciles primaryNameServerAddresses When a zone is created against a stale primary IP (e.g. the old primary pod IP 10.10.36.189 before the technitium-primary ClusterIP service existed), AXFR refresh keeps failing forever while every other zone on the same replica refreshes fine from 10.110.37.186. The resync-only branch didn't touch zone options, so the bad IP was pinned indefinitely. This surfaced as rpi-sofia.viktorbarzin.lan returning 192.168.1.16 (pre-move) on secondaries while primary had the correct .10 from 2026-04-22 morning — Uptime Kuma Sofia RPI monitor DOWN, cluster cluster_healthcheck FAIL. The sync loop now re-applies primaryNameServerAddresses on every run for existing zones. Idempotent — Technitium accepts identical values — and self-heals any drift within 30 min. Env renamed PRIMARY_IP → PRIMARY_HOST for consistency with the reconcile semantics. Hostname form (technitium-primary.technitium.svc.cluster.local) was tried but Technitium's own resolver doesn't forward svc.cluster.local, so the field must stay a literal IP. Terraform tracks the ClusterIP on every apply and the reconcile loop propagates it to replicas.	2026-04-22 17:47:18 +00:00
Viktor Barzin	7dfe89a6e0	[redis] stabilise against node-crash flap cascade — RC1-RC5 fixes Five compounding factors produced the 2026-04-22 flap cascade: soft anti-affinity let 2/3 pods co-locate on k8s-node3 (which bounced NotReady→Ready at 11:42Z and took quorum), aggressive sentinel/probe timing amplified LUKS-encrypted LVM I/O stalls into spurious +switch-master loops, HAProxy's 1s polling raced sentinel failovers and routed writes to demoted masters, publish_not_ready_addresses=true fed not-yet-ready pods into HAProxy DNS, and realestate-crawler-celery CrashLoopBackOff closed the feedback loop. Changes: - Anti-affinity: preferred → required (one redis pod per node, hard) - Sentinel down-after-ms 5000→15000, failover-timeout 30000→60000 - Redis + sentinel liveness: timeout 3→10, failure_threshold 3→5 - HAProxy: check inter 1s→2s / fall 2→3, timeout check 3s→5s - Headless svc: publish_not_ready_addresses true→false Post-rollout verification clean: 0 flaps, 0 +switch-master events, 0 celery ReadOnlyError in the 60s window after settle. Docs updated.	2026-04-22 15:59:00 +00:00
Viktor Barzin	fdced7577b	[monitoring] HomeAssistantCriticalSensorUnavailable alert	2026-04-22 14:52:23 +00:00
Viktor Barzin	dc05c440bc	[hermes-agent] disable deployment — PVC permission mismatch Main container crashes with "mkdir: cannot create directory '/opt/data': Permission denied". Init container writes fine but main container runs with different fsGroup/runAsUser. Scaling to 0 until the PVC permission model is reworked. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-22 14:31:50 +00:00
Viktor Barzin	a4eafafe49	[monitoring] Add GPUNodeUnschedulable alert — fires when GPU node is cordoned After k8s-node1 was silently cordoned and broke Frigate camera streams, existing alerts (NvidiaExporterDown, PodUnschedulable) didn't catch the root cause proactively. This alert fires within 5m of the GPU node being cordoned, before any pod restart attempts to schedule and fails. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-22 14:05:12 +00:00
Viktor Barzin	e2146e6916	gpu: schedule off NFD label, not k8s-node1 hostname Remove every hardcoded reference to k8s-node1 that pinned GPU scheduling to a specific host: - GPU workload nodeSelectors: gpu=true -> nvidia.com/gpu.present=true (frigate, immich, whisper, piper, ytdlp, ebook2audiobook, audiblez, audiblez-web, nvidia-exporter, gpu-pod-exporter). The NFD label is auto-applied by gpu-feature-discovery on any node carrying an NVIDIA PCI device, so the selector follows the card. - null_resource.gpu_node_config: rewrite to enumerate NFD-labeled nodes (feature.node.kubernetes.io/pci-10de.present=true) and taint each with nvidia.com/gpu=true:PreferNoSchedule. Drop the manual 'kubectl label gpu=true' since NFD handles labeling. - MySQL anti-affinity: kubernetes.io/hostname NotIn [k8s-node1] -> nvidia.com/gpu.present NotIn [true]. Same intent (keep MySQL off the GPU node) but portable when the card relocates. Net effect: moving the GPU card between nodes no longer requires any Terraform edit. Verified no-op for current scheduling — both old and new labels resolve to node1 today. Docs updated to match: AGENTS.md, compute.md, overview.md, proxmox-inventory.md, k8s-portal agent-guidance string.	2026-04-22 13:43:07 +00:00
Viktor Barzin	134d6b9a82	vault runbook + raft/HA stuck-leader alerts Post-2026-04-22 Step 5 deliverables: - docs/runbooks/vault-raft-leader-deadlock.md — safe pod-restart sequence that avoids zombie containerd-shim + kernel NFS corruption, qm reset no-op gotcha, boot-order gotcha. - prometheus_chart_values.tpl — VaultRaftLeaderStuck + VaultHAStatusUnavailable. Silent until vault telemetry scraping lands (tracked as beads code-vkpn). Epic for moving vault off NFS tracked as beads code-gy7h.	2026-04-22 12:44:46 +00:00
Viktor Barzin	2f1f9107f8	vault: add fsGroupChangePolicy=OnRootMismatch + 2026-04-22 post-mortem The 2026-04-22 Vault outage caught kubelet in a 2-minute chown loop that never exited because the default fsGroupChangePolicy (Always) walks every file on the NFS-backed data PVC. With retrans=3,timeo=30 NFS options and a 1GB audit log, the recursive chown outlasted the deadline and restarted forever — blocking raft quorum recovery. OnRootMismatch makes chown a no-op when the volume root is already correct, which it always is after initial setup. The breakglass fix was applied live via kubectl patch at 10:54 UTC; this commit persists it in Terraform so the next apply doesn't revert. The post-mortem also documents the upstream raft stuck-leader pattern, NFS kernel client corruption after force-kill, and the path to migrate Vault off NFS to proxmox-lvm-encrypted.	2026-04-22 11:12:19 +00:00
Viktor Barzin	d39770b30d	monitoring: tighten LVMSnapshotStale to 30h for daily-cadence detection Threshold was 48h + 30m for: a job that runs daily. We don't need to wait 2.5 days to detect a broken timer — bring it down to 30h + 30m (just over a day of cadence + minor drift/retry grace). Also add a description pointing to the restore runbook so the alert text surfaces the fix path directly. Threshold change: 172800s → 108000s. Docs in backup-dr.md synced. Re-triggers default.yml apply now that ci/Dockerfile is rebuilt with vault CLI — this is the first commit touching a stack that will actually succeed since the `e80b2f02` regression. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-22 08:54:37 +00:00
Viktor Barzin	9b4970da61	monitoring: alert hygiene — disambiguate, rename, tune, fix inhibits - HighPowerUsage: add subsystem:gpu (line 724) + subsystem:r730 (line 775) labels so the two same-named alerts are distinguishable in routing. - HeadscaleDown (deployment-replicas flavor, line 1414) → rename to HeadscaleReplicasMismatch. Line 2039 keeps HeadscaleDown as the real up-metric critical check. NodeDown inhibit rule updated to suppress the renamed alert too. - EmailRoundtripStale (line 1816): for 10m → 20m. Survives one missed 20-min probe cycle before firing, cuts flapping (12 short-burst fires over last 24h). ATSOverload tuning skipped: 24h fire-count is 0, it's continuously firing not flapping — already-known sustained 83% ATS load, tuning would not change behavior. 8 backup *NeverSucceeded rules audited: all 7 using kube_cronjob_status_last_successful_time target real K8s CronJobs with active metrics (not Pushgateway-sourced). PrometheusBackupNeverRun already uses absent() correctly. No fixes needed.	2026-04-21 22:29:15 +00:00
Viktor Barzin	ac695dea38	[registry] bulk-clean 34 orphan manifests + beads-server image bump Registry integrity probe surfaced 38 broken manifest references (34 unique repo:tag pairs, same OCI-index orphan pattern as the 04-19 infra-ci incident). Deleted all via registry HTTP API + ran GC; reclaimed ~3GB blob storage. beads-server CronJobs were stuck ImagePullBackOff on claude-agent-service:0c24c9b6 for >6h — bumped variable default to 2fd7670d (canonical tag in claude-agent-service stack, already healthy in registry) so new ticks can fire. Rebuilt in-use broken tags: freedify:{latest,c803de02} and beadboard:{17a38e43,latest} on registry VM; priority-pass via Woodpecker pipeline #8. wealthfolio-sync:latest deferred (monthly CronJob, next run 2026-05-01). Probe now reports 0/39 failures. RegistryManifestIntegrityFailure alert cleared. Closes: code-8hk Closes: code-jh3c Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 23:16:34 +00:00
Viktor Barzin	9041f52b05	monitoring: TechnitiumZoneCountMismatch — compare replicas only, exclude primary Primary has only the Primary-type zones it owns (10). Replicas have those + built-in zones (localhost, in-addr.arpa reverse, etc.), so their count (14) can never match primary. Alert expr compared max-min across all instances, making it chronically firing. Fix: instance!="primary" filter. The real signal this alert wants is "did one replica drift from the others" — replica-to-replica comparison captures that; primary was never comparable.	2026-04-19 22:15:55 +00:00
Viktor Barzin	e092f159b3	monitoring: drop MAM Mouse-class + qBittorrent-unsatisfied alerts Both alerts fired as expected noise while the MAM account is in new-member Mouse class — tracker refuses announces and the 72h seed-gate can't be met until ratio recovers. Keeping the rest of the MAM rules (cookie expiry, ratio, farming/janitor stalls, qbt disconnect) which still signal real pipeline failures. Firing count drops from 7 → 3 in healthcheck.	2026-04-19 21:24:46 +00:00
Viktor Barzin	68a10905e0	[monitoring] uk-payslip Panel 13: stacked bars + sum-in-legend "Monthly cash flow — tax impact (RSU excluded)" was already stacking group A in normal mode but rendered as 70%-opacity filled lines — the overlap made the total-per-month figure visually inaccessible. Switch drawStyle to bars (100% fill, 0-width lineWidth, no per-point markers) so each month reads as a single stacked bar whose top edge is the total cash-side deduction. Add "sum" to legend.calcs so the tax-year totals per series show in the legend table alongside last and max. Panel 11 (Tax & pension — monthly, RSU-inclusive) retains the line/ area style so the two panels remain visually distinct.	2026-04-19 20:31:53 +00:00
Viktor Barzin	2224a6b2cc	[job-hunter] Bump image to 92afc38d — Frankfurter FX + comp_table COALESCE	2026-04-19 19:09:54 +00:00
Viktor Barzin	e813170960	[job-hunter] Bump image to 99ab188f — levels.fyi per-level + comp_points 99ab188f adds the structured-comp pipeline: levels.fyi __NEXT_DATA__ scraper, Robert Walters + Hays PDF parser, comp_points/levels tables (alembic 0003), CLI comp/comp-table/comp-band/backfill-levels, and Grafana panels 6-9. Alembic 0003 runs via the existing init container. After apply, exec: kubectl -n job-hunter exec deploy/job-hunter -c job-hunter -- \ python -m job_hunter backfill-levels kubectl -n job-hunter exec deploy/job-hunter -c job-hunter -- \ python -m job_hunter refresh --source levels_fyi kubectl -n job-hunter exec deploy/job-hunter -c job-hunter -- \ python -m job_hunter refresh --source uk_surveys	2026-04-19 18:56:20 +00:00
Viktor Barzin	3f6dfb10aa	[monitoring] job-hunter: panels 6-9 for comp_points tables + trends Append the structured-comp dashboard surface to the job-hunter dashboard: Panel 6 — Per-company salary by level (p50 base, GBP table). Panel 7 — Total-comp heatmap per (company, level), p50 GBP. Panel 8 — Comp-point volume by source (daily time-series). Panel 9 — Base-salary trend (p50) over time for the top 5 companies. Adds templating: $location (multi, default london), $level (single, default senior), $company (multi, default all) — populated from comp_points + levels metadata so the selection reflects what was actually ingested. Closes: code-5ph	2026-04-19 18:50:48 +00:00
Viktor Barzin	a8280e77b6	[broker-sync] unsuspend IMAP + Panel 15 RSU vest reconciliation (Phase D) Activates the Schwab/InvestEngine IMAP ingest CronJob that's been scaffolded-but-suspended since Phase 2 of broker-sync, now that the Schwab parser can detect vest-confirmation emails. Runs nightly 02:30 UK. Current behaviour once deployed: - Trade confirmations (Schwab sell-to-cover, InvestEngine orders) → Activity rows posted to Wealthfolio. Unchanged. - Release Confirmations (Schwab RSU vests) → parser returns gross-vest BUY + sell-to-cover SELL Activities (to Wealthfolio) and a VestEvent object (NOT YET persisted — Postgres sink + DB grant pending; see follow-up under code-860). Vest detection uses a subject/body heuristic that will need tightening against a real email fixture. Panel 15 of the UK payslip dashboard added: per-vest-month join of payslip.rsu_vest vs rsu_vest_events (gross_value_gbp, tax_withheld_gbp) with delta columns. Tax-delta-percent coloured green/orange/red at 0/2%/5% thresholds. Table is empty until broker-sync starts persisting VestEvents — harmless until then. Before applying: - Verify IMAP creds in Vault (secret/broker-sync: imap_host, imap_user, imap_password, imap_directory) are still valid. - Empty vest-event table is expected; delta columns show NULL until the postgres sink lands. Part of: code-860	2026-04-19 18:29:01 +00:00
Viktor Barzin	1c0e1bcdde	[payslip-ingest] ActualBudget payroll sync CronJob + Panel 14 (Phase C) Wires the daily ActualBudget deposit sync from the payslip-ingest app into K8s as a CronJob, and adds dashboard Panel 14 to overlay bank deposits against payslip net_pay. CronJob: actualbudget-payroll-sync in payslip-ingest namespace, runs 02:00 UTC. Calls `python -m payslip_ingest sync-meta-deposits`, which hits budget-http-api-viktor in the actualbudget namespace and upserts matching Meta payroll deposits into payslip_ingest.external_meta_deposits. ExternalSecret extended with three new Vault keys: - ACTUALBUDGET_API_KEY (same as actualbudget-http-api-viktor's env API_KEY) - ACTUALBUDGET_ENCRYPTION_PASSWORD (Viktor's budget password) - ACTUALBUDGET_BUDGET_SYNC_ID (Viktor's sync_id) These must be seeded at secret/payslip-ingest in Vault before the CronJob will run — it'll CrashLoop on missing env vars otherwise. First run can be triggered on demand via `kubectl -n payslip-ingest create job --from=cronjob/actualbudget-payroll-sync initial-sync`. Panel 14 plots monthly SUM(external_meta_deposits.amount) vs SUM(payslip.net_pay), plus a delta bar series — \|delta\| > £50 flags likely parser drift on net_pay. Part of: code-860	2026-04-19 18:21:20 +00:00
Viktor Barzin	ef53053ae6	[job-hunter] Bump image to 48f8615d — London filter + AI CLI New image adds Alembic 0002 (primary_location column), London-default query/bands/report commands, and FX-priming on refresh so USD/EUR salaries convert correctly. Applied live; 5826 rows backfilled. Refs: code-snp Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 18:13:26 +00:00
Viktor Barzin	fca3dd4976	[monitoring] uk-payslip: Panel 2 uses COALESCE cash_income_tax; Panel 4 flags NULL Phase A of RSU tax spike fix. Two changes: 1. Panel 2 "Monthly cash flow (RSU stripped)" plotted raw income_tax despite the title. Switch to COALESCE(cash_income_tax, income_tax) so the chart is honest once the Phase B back-fill populates cash_income_tax on variant-A slips. For slips where cash_income_tax is already populated (variant B, 2024+) the spike is removed immediately. 2. Panel 4 "Data integrity" now surfaces rows where cash_income_tax is NULL on vest months (rsu_vest > 0). New status value NULL_CASH_TAX (orange) highlights the back-fill remaining population — expected to drop to 0 after Phase B lands. Part of: code-860	2026-04-19 18:04:05 +00:00
Viktor Barzin	fec0bbb7dd	[job-hunter] Pin to first built image tag 9c42eac9 Locally-built image pushed to registry.viktorbarzin.me/job-hunter:9c42eac9 after Woodpecker v3.13 Forgejo webhook parsing bug left CI unable to build the initial image (server/forge/forgejo/helper.go:57 nil pointer panic on parse — see repaired webhooks still not triggering pipelines). Unblocks code-97n (TF apply) without waiting for CI recovery. Refs: code-snp, code-0c6 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:48:16 +00:00
Viktor Barzin	f4d3fdb2e3	[monitoring] uk-payslip: drop RSU-vest annotations Vertical orange markers at every vest month added more visual noise than signal. Panel 13 (cash-only) already conveys the "no spike on vest months" story without needing markers across panels 1/2/3/7/11/12.	2026-04-19 17:32:49 +00:00
Viktor Barzin	a641dc744f	[monitoring] uk-payslip: RSU vest annotations + cash-only tax panel Panel 11 stacks RSU-attributed income tax on top of cash PAYE, which is mathematically correct but emotionally misleading since RSU tax is withheld at source via sell-to-cover and never hits the bank. Adopts the two-view convention: Panel 11 keeps the full PAYE picture; new Panel 13 shows cash-only deductions. Dashboard-level "RSU vests" annotation paints orange markers on every vest month across all timeseries panels, with tooltips like "RSU vest: £31232 gross / £15257 tax withheld". Shifts Panels 4/5/6/8/9/10 down by 9 rows to make room for Panel 13 at y=29.	2026-04-19 17:24:35 +00:00
Viktor Barzin	c9d6343a9b	[job-hunter] Switch ExternalSecret to explicit UPPERCASE data mappings Replaces dataFrom.extract with per-key `data` entries so the Secret keys in K8s (and therefore env vars in the pod) are always UPPERCASE: WEBHOOK_BEARER_TOKEN, CDIO_API_KEY, SMTP_USERNAME, SMTP_PASSWORD, DIGEST_TO_ADDRESS, DIGEST_FROM_ADDRESS. Vault KV keys at secret/job-hunter stay lowercase (webhook_bearer_token etc.). Refs: code-snp Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:23:28 +00:00
Viktor Barzin	e7ce545da2	[job-hunter] Add infra stack + Grafana dashboard + n8n digest workflow New service stack at stacks/job-hunter/ mirroring the payslip-ingest pattern: per-service CNPG database + role (via dbaas null_resource), Vault static role pg-job-hunter (7d rotation), ExternalSecrets for app secrets and DB creds, Deployment with alembic-migrate init container, ClusterIP Service, Grafana datasource ConfigMap. Grafana dashboard job-hunter.json in Finance folder: new roles per day, source breakdown, top companies, GBP salary distribution, recent roles table (sorted by parse confidence then salary). n8n weekly-digest workflow calls POST /digest/generate with bearer auth every Monday 07:00 London; digest_runs table provides idempotency. Refs: code-snp Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:09:29 +00:00
Viktor Barzin	7cb44d7264	[registry] Stop recurring orphan OCI-index incidents — detection + prevention + recovery Second identical registry incident on 2026-04-19 (first 2026-04-13): the infra-ci:latest image index resolved to child manifests whose blobs had been garbage-collected out from under the index. Pipelines P366→P376 all exited 126 "image can't be pulled". Hot fix (`a05d63e` / `6371e75` / `c113be4`) restored green CI but left the underlying bug unaddressed. Root cause: cleanup-tags.sh rmtrees tag dirs on the registry VM daily at 02:00, registry:2's GC (Sunday 03:25) walks OCI index children imperfectly (distribution/distribution#3324 class). Nothing verified pushes end-to-end; nothing probed the registry for fetchability; nothing caught orphan indexes. Phase 1 — Detection: - .woodpecker/build-ci-image.yml: after build-and-push, a verify-integrity step walks the just-pushed manifest (index + children + config + every layer blob) via HEAD and fails the pipeline on any non-200. Catches broken pushes at the source. - stacks/monitoring: new registry-integrity-probe CronJob (every 15m) and three alerts — RegistryManifestIntegrityFailure, RegistryIntegrityProbeStale, RegistryCatalogInaccessible — closing the "registry serves 404 for a tag that exists" gap that masked the incident for 2+ hours. - docs/post-mortems/2026-04-19-registry-orphan-index.md: root cause, timeline, monitoring gaps, permanent fix. Phase 2 — Prevention: - modules/docker-registry/docker-compose.yml: pin registry:2 → registry:2.8.3 across all six registry services. Removes the floating-tag footgun. - modules/docker-registry/fix-broken-blobs.sh: new scan walks every _manifests/revisions/sha256/<digest> that is an image index and logs a loud WARNING when a referenced child blob is missing. Does NOT auto- delete — deleting a published image is a conscious decision. Layer-link scan preserved. Phase 3 — Recovery: - build-ci-image.yml: accept `manual` event so Woodpecker API/UI rebuilds don't need a cosmetic Dockerfile edit (matches convention from pve-nfs-exports-sync.yml). - docs/runbooks/registry-rebuild-image.md: exact command sequence for diagnosing + rebuilding after an orphan-index incident, plus a fallback for building directly on the registry VM if Woodpecker itself is down. - docs/runbooks/registry-vm.md + .claude/reference/service-catalog.md: cross-references to the new runbook. Out of scope (verified healthy or intentionally deferred): - Pull-through DockerHub/GHCR mirrors (74.5% hit rate, no 404s). - Registry HA/replication (single-VM SPOF is a known architectural choice; Synology offsite covers RPO < 1 day). - Diun exclude for registry:2 — not applicable; Diun only watches k8s (DIUN_PROVIDERS_KUBERNETES=true), not the VM's docker-compose. Verified locally: - fix-broken-blobs.sh --dry-run on a synthetic registry directory correctly flags both orphan layer links and orphan OCI-index children. - terraform fmt + validate on stacks/monitoring: success (only unrelated deprecation warnings). - python3 yaml.safe_load on .woodpecker/build-ci-image.yml and modules/docker-registry/docker-compose.yml: both parse clean. Closes: code-4b8 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:08:28 +00:00
Viktor Barzin	df2c53db8d	[infra] TrueNAS decommission — remove active references from Terraform + configs TrueNAS VM 9000 at 10.0.10.15 was operationally decommissioned 2026-04-13. The subagent-driven doc sweep in `5a0b24f5` covered the prose. This commit removes the remaining in-code references: - reverse-proxy: drop truenas Traefik ingress + Cloudflare record (truenas.viktorbarzin.me was 502-ing since the VM stopped), drop truenas_homepage_token variable. - config.tfvars: drop deprecated `truenas IN A 10.0.10.15`, `iscsi CNAME truenas`, and the commented-out `iscsi`/`zabbix` A records. - dashy/conf.yml: remove Truenas dashboard entry (&ref_28). - monitoring/loki.yaml: change storageClass from the decommissioned `iscsi-truenas` to `proxmox-lvm` so a future re-enable has a valid SC (Loki is currently disabled). - actualbudget/main.tf + freedify/main.tf: update new-deployment docstrings to cite Proxmox host NFS instead of TrueNAS. - nfs-csi: add an explanatory comment to the `nfs-truenas` StorageClass noting the name is historical — 48 bound PVs reference it, SC names are immutable on PVs, rename not worth the churn. Also cleaned out-of-band: - Technitium DNS: deleted `truenas.viktorbarzin.lan` A and `iscsi.viktorbarzin.lan` CNAME records. - Vault: `secret/viktor` → removed `truenas_api_key` and `truenas_ssh_private_key`; `secret/platform.homepage_credentials.reverse_proxy.truenas_token` removed. - Terraform-applied: `scripts/tg apply -target=module.reverse-proxy.module.truenas` destroyed the 3 K8s/Cloudflare resources cleanly. Deferred: - VM 9000 is still stopped on PVE. Deletion (destructive) awaits explicit user go-ahead. - `nfs-truenas` StorageClass name retained (see nfs-csi comment above). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 16:57:05 +00:00
Viktor Barzin	5a0b24f54e	[docs] TrueNAS decommission cleanup — remove references from active docs TrueNAS VM 9000 was operationally decommissioned 2026-04-13; NFS has been served by Proxmox host (192.168.1.127) since. This commit scrubs remaining references from active docs. VM 9000 itself remains on PVE in stopped state pending user decision on deletion. In-session cleanup already landed: reverse-proxy ingress + Cloudflare record removed; Technitium DNS records deleted; Vault truenas_{api_key,ssh_private_key} purged; homepage_credentials.reverse_proxy.truenas_token removed; truenas_homepage_token variable + module deleted; Loki + Dashy cleaned; config.tfvars deprecated DNS lines removed; historical-name comment added to the nfs-truenas StorageClass (48 bound PVs, immutable name — kept). Historical records (docs/plans/, docs/post-mortems/, .planning/) intentionally untouched — they describe state at a point in time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 16:55:43 +00:00
Viktor Barzin	5f832e37d0	[monitoring] UK Payslip — add tax & pension breakdown panels New Panel 11 (monthly) + Panel 12 (YTD cumulative), side-by-side at y=19. Six series each: cash income tax, RSU-attributed income tax, NI, student loan, employee pension, employer pension. Employer pension included to show full retirement contribution picture (paid on top of salary, not deducted from take-home). Downstream panels shifted down by 10.	2026-04-19 16:53:32 +00:00
Viktor Barzin	ab402b3421	[monitoring] UK Payslip Panel 7 — trim to 5 semantic layers Drop ytd_student_loan (~£200-300/mo noise) and ytd_rsu_offset (always £0 on post-2024 Meta variant-B payslips) from the YTD uses stack. Now mirrors Panel 1's 4-way source breakdown clarity: take-home, cash PAYE, RSU PAYE, NI, pension. Student loan + RSU offset still surface on Panel 8 Sankey. Title: "YTD uses — where gross went" (mirrors Panel 1 label pattern).	2026-04-19 16:37:12 +00:00
Viktor Barzin	e55c549c9a	[redis] Phase 7 step 2: remove Bitnami helm_release + orphan PVCs Bringing the 2026-04-19 rework to its end-state. Cutover soaked for ~1h with 0 alerts firing and 127 ops/sec on the v2 master — skipped the nominal 24h rollback window per user direction. - Removed `helm_release.redis` (Bitnami chart v25.3.2) from TF. Helm destroy cleaned up the StatefulSet redis-node (already scaled to 0), ConfigMaps, ServiceAccount, RBAC, and the deprecated `redis` + `redis-headless` ClusterIP services that the chart owned. - Removed `null_resource.patch_redis_service` — the kubectl-patch hack that worked around the Bitnami chart's broken service selector. No Helm chart, no patch needed. - Removed the dead `depends_on = [helm_release.redis]` from the HAProxy deployment. - `kubectl delete pvc -n redis redis-data-redis-node-{0,1}` for the two orphan PVCs the StatefulSet template left behind (K8s doesn't cascade-delete). - Simplified the top-of-file comment and the redis-v2 architecture comment — they talked about the parallel-cluster migration state that no longer exists. Folded in the sentinel hostname gotcha, the redis 8.x image requirement, and the BGSAVE+AOF-rewrite memory reasoning so the rationale survives in the code rather than only in beads. - `RedisDown` alert no longer matches `redis-node\|redis-v2` — just `redis-v2` since that's the only StatefulSet now. Kept the `or on() vector(0)` so the alert fires when kube_state_metrics has no sample (e.g. after accidental delete). - `docs/architecture/databases.md` trimmed: no more "pending TF removal" or "cold rollback for 24h" language. Verification after apply: - kubectl get all -n redis: redis-v2-{0,1,2} (3/3 Running) + redis-haproxy-* (3 pods, PDB minAvailable=2). Services: redis-master + redis-v2-headless only. - PVCs: data-redis-v2-{0,1,2} only (redis-data-redis-node-* deleted). - Sentinel: all 3 agree mymaster = redis-v2-0 hostname. - HAProxy: PING PONG, DBSIZE 92, 127 ops/sec on master. - Prometheus: 0 firing redis alerts. Closes: code-v2b Closes: code-2mw Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 16:32:14 +00:00
Viktor Barzin	b6cd83f85a	[redis] Phase 3-7: cutover to redis-v2, Nextcloud HAProxy-only Phase 3 — replication chain (old → v2): - Discovered the v2 cluster was running redis:7.4-alpine, but the Bitnami old master ships redis 8.6.2 which writes RDB format 13 — the 7.4 replicas rejected the stream with "Can't handle RDB format version 13". Bumped v2 image to redis:8-alpine (also 8.6.2) to restore PSYNC compatibility. - Discovered that sentinel on BOTH v2 and old Bitnami clusters auto-discovered the cross-cluster replication chain when v2-0 REPLICAOF'd the old master, triggering a failover that reparented old-master to a v2 replica and took HAProxy's backend offline. Mitigation: `SENTINEL REMOVE mymaster` on all 5 sentinels (both clusters) during the REPLICAOF surgery, then re-MONITOR after cutover. This must be done on the OLD sentinels too, not just v2 — they're the ones that kept fighting our REPLICAOF. - Set up the chain: v2-0 REPLICAOF old-master; v2-{1,2} REPLICAOF v2-0. All 76 keys (db0:76, db1:22, db4:16) synced including `immich_bull:` BullMQ queues and `_kombu.` Celery queues — the user-stated must-survive data class. Phase 4 — HAProxy cutover: - Updated `kubernetes_config_map.haproxy` to point at `redis-v2-{0,1,2}.redis-v2-headless` for both redis_master and redis_sentinel backends (removed redis-node-{0,1}). - Promoted v2-0 (`REPLICAOF NO ONE`) at the same time as the ConfigMap apply so HAProxy's 1s health-check interval found a role:master within a few seconds. Cutover disruption on HAProxy rollout was brief; old clients naturally moved to new HAProxy pods within the rolling update window. - Re-enabled sentinel monitoring on v2 with `SENTINEL MONITOR mymaster <hostname> 6379 2` after verifying `resolve-hostnames yes` + `announce-hostnames yes` were active — this ensures sentinel stores the hostname (not resolved IP) in its rewritten config, so pod-IP churn on restart doesn't break failover. Phase 5 — chaos: - Round 1: killed master v2-0 mid-probe. First run exposed the sentinel IP-storage issue (stored 10.10.107.222, went stale on restart) — ~12s probe disruption. Fixed hostname persistence and re-MONITORed. - Round 2: killed new master v2-2 with hostnames correctly stored. Sentinel elected v2-0, HAProxy re-routed, 1/40 probe failures over 60s — target <3s of actual user-visible disruption. Phase 6 — Nextcloud simplification: - `zzz-redis.config.php` no longer queries sentinel in-process — just points at `redis-master.redis.svc.cluster.local`. Removed 20 lines of PHP. HAProxy handles master tracking transparently now that it's scaled to 3 + PDB minAvailable=2. Phase 7 step 1: - `kubectl scale statefulset/redis-node --replicas=0` (transient — TF removal in a 24h follow-up). Old PVCs `redis-data-redis-node-{0,1}` preserved as cold rollback. Docs: - Rewrote `databases.md` Redis section to reflect post-cutover reality and the sentinel hostname gotcha (so future sessions don't relearn it). - `.claude/reference/service-catalog.md` entry updated. The parallel-bootstrap race documented in the previous commit is still worth watching — the init container now defaults to pod-0 as master when no peer reports role:master-with-slaves, so fresh boots land in a deterministic topology. Closes: code-7n4 Closes: code-9y6 Closes: code-cnf Closes: code-tc4 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 16:13:43 +00:00
Viktor Barzin	b7ea122355	payslip-ingest: pin image_tag=4f70681d — includes migrations 0004+0005 Aligns the stack with the repo HEAD carrying migration 0004 (cash_income_tax + ytd_rsu_* columns), migration 0005 (p60_reference table), the bonus-dedup logic, and the Woodpecker path-filter fix. Applied + verified: - pod rolled out with the new image, Alembic ran 0003→0004→0005 - cash_income_tax backfilled on 71/71 existing rows - dashboard Panel 7 YTD split query returns real numbers - no existing (tax_year, bonus) duplicates found — guard ships for future Closes: code-7z0	2026-04-19 15:54:24 +00:00
Viktor Barzin	bc866d53fa	[servarr/mam-farming] Tune grabber for MAM's real catalogue ## Context After the Mouse-class unblock on 2026-04-19, end-to-end testing of the grabber revealed three issues with the plan's original filter values: 1. `SEEDER_CEILING=50` rejects ~99% of MAM's catalogue. MAM is a well-seeded private tracker — 100-700 seeders per torrent is normal. A ceiling of 50 makes the filter too tight: across 140 FL torrents sampled in one loop, only 0-1 matched. The intent ("avoid oversupplied swarms") is still valid; the threshold was wrong for MAM's shape. 2. `RATIO_FLOOR=1.2` was sized for Mouse-class defence and is now over-tight. Its job is preventing the death spiral where Mouse-class accounts can't announce, so any grab deepens the ratio hole. Once class > Mouse, MAM serves peer lists normally and demand-first filtering (`leechers>=1`) keeps new grabs upload-positive on average. With ratio sitting at 0.7 post-recovery (we over-downloaded while unblocking), 1.2 was preventing the very grabs that would earn us back to healthy ratio. 3. `parse_size` crashed on `"1,002.9 MiB"`. MAM's pretty-printed sizes use thousands separators; `float("1,002.9")` raises `ValueError`. Every grabber run that hit a ≥1000-MiB candidate on the page crashed with a traceback instead of skipping the size. ## This change - `SEEDER_CEILING`: 50 → 200 — live catalogue evidence showed 50 was rejecting viable demand-first candidates like `Zen and the Art of Motorcycle Maintenance` (S=156, L=1, score=125). - `RATIO_FLOOR`: 1.2 → 0.5 — still a tripwire for catastrophic dips, but no longer a steady-state block. Class == Mouse remains an absolute skip (separate branch). - `parse_size`: `s.replace(",", "").split()` before int-parse. ## Verified post-change Manual grabber loop (5 runs at random offsets) after applying: run=1 parse_size crash on "1,002.9" (this crash motivated fix #3) run=2 GRABBED 3 torrents: Dean and Me: A Love Story (240.7 MiB, S:18, L:1) score=194 Digital Nature Photography (83.7 MiB, S:42, L:1) score=182 Zen and the Art of Motorcycle (830.3 MiB, S:156, L:1) score=125 run=3-5 grabbed=0 at offsets that landed on pages with no matches (expected — MAM returns 20/page, many offsets yield nothing) MAM profile: class=User, ratio=0.7 (recovering from the Mouse unblock), BP=24,053. 28 mam-farming torrents in forcedUP state, actively uploading ~8 MiB to MAM this session across 2 of the Maxximized comic issues. ## What is NOT in this change - No alert threshold changes — `MAMRatioBelowOne` (24h) and `MAMMouseClass` (1h) already handle the "going back to Mouse" case; lowering the floor on the grabber doesn't change alerting. - No janitor changes — the janitor rules are H&R-based and independent of ratio/class state. ## Test plan ### Automated $ cd infra/stacks/servarr && ../../scripts/tg apply --non-interactive Apply complete! Resources: 0 added, 2 changed, 0 destroyed. $ python3 -c 'import ast; ast.parse(open( "infra/stacks/servarr/mam-farming/files/freeleech-grabber.py").read())' ### Manual Verification 1. Trigger the grabber and confirm it doesn't skip-for-ratio at ratio 0.7: $ kubectl -n servarr create job --from=cronjob/mam-freeleech-grabber g1 $ kubectl -n servarr logs job/g1 \| head -5 Profile: ratio=0.7 class=User \| Farming: 33, 2.0 GiB, tracked IDs: 4 Search offset=<random>, found=1323, page_results=20 Added (score=...) ... 2. Repeat 3-5× at different random offsets. Over the course of a 30-min cron cadence, expect 2-5 grabs across the day given MAM's catalogue churn and our filter intersection. ## Reproduce locally cd infra/stacks/servarr ../../scripts/tg plan # expect: 0 to add, 2 to change (configmap + cronjob) ../../scripts/tg apply --non-interactive kubectl -n servarr create job --from=cronjob/mam-freeleech-grabber g1 kubectl -n servarr logs job/g1 Follow-up: `bd close code-qfs` already completed in the parent commit; this is a post-shipping tune, no beads action needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 15:46:46 +00:00

1 2 3 4 5 ...

713 commits