Compare commits

...

2 commits

Author SHA1 Message Date
Viktor Barzin
e783cae2cb chrome-service + mam-farming: doc clarifications (+ re-trigger CI apply missed earlier)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Two small doc additions that also re-include these stacks in Woodpecker's
changed-stack detection. The earlier 2-commit push left chrome-service out of the
HEAD~1..HEAD diff so its ignore_changes fix never applied; the monitoring apply was
separately blocked by a stuck prometheus pending-upgrade (now cleared).

- chrome-service: note the live pod's container order had drifted from this file's
  order, so a TF apply reorders them (containers[0] differs live-vs-TF until the
  apply lands) -- documents the confusion this caused during diagnosis.
- mam-farming: cross-ref the grabber script that emits mam_grabber_last_run_timestamp.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 09:34:23 +00:00
Viktor Barzin
b0e8e3599f nfs-mirror: exclude SQLite WAL/SHM sidecars + treat rsync exit 24 as success
NfsMirrorFailing fired ~13% of nights (3/23 runs, all rsync exit 24). Root cause:
calibre-web-automated keeps a WAL-mode SQLite queue.db on /srv/nfs, whose -wal/-shm
sidecars are created/checkpointed/deleted constantly and vanish between rsync's
file-list scan and the transfer ("file has vanished" -> exit 24). The mirror
actually completes every run; only transient files disappear.

Two fixes: (1) exclude *-wal/*-shm/*-journal -- these must never be in a raw mirror
anyway (a WAL without an atomic .db snapshot is useless to restore; daily-backup
makes the consistent SQLite copies). (2) Treat rsync exit 24 as success-with-warning
so the run still appends to the offsite manifest (a code-24 night previously skipped
that, delaying those changes to the monthly full sync) and the alert stops
false-firing.

Deployed to the PVE host via scp to /usr/local/bin/nfs-mirror (host script, not TF).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 09:34:22 +00:00
3 changed files with 21 additions and 1 deletions

View file

@ -93,6 +93,16 @@ EXCLUDES=(
--exclude='*@synoeastream'
--exclude='/.DS_Store'
--exclude='/Thumbs.db'
# ---- transient SQLite sidecars (WAL mode) ----
# Created/checkpointed/deleted constantly, so they vanish mid-rsync and trip
# exit code 24 (root cause of NfsMirrorFailing on calibre-web-automated's
# queue.db, 2026-05/06). They must NEVER be in a raw mirror anyway: a -wal/-shm
# without an atomic .db snapshot is useless to restore from. Consistent SQLite
# copies are made separately by daily-backup (SQLite backup API).
--exclude='*-wal'
--exclude='*-shm'
--exclude='*-journal'
)
log() { echo "[$(date -u '+%Y-%m-%dT%H:%M:%SZ')] $*" | tee -a "$LOG"; }
@ -155,7 +165,12 @@ rsync \
DST_BYTES=$(df -B1 --output=used /mnt/backup | tail -1)
if [ "$RSYNC_RC" -eq 0 ]; then
# rsync exit 24 = "some source files vanished before transfer" — benign for a
# backup mirror: everything else copied; the vanished files are transient (e.g.
# SQLite WAL/SHM, now mostly caught by the excludes above). Treat as success so
# the offsite manifest still updates and NfsMirrorFailing doesn't false-fire.
if [ "$RSYNC_RC" -eq 0 ] || [ "$RSYNC_RC" -eq 24 ]; then
[ "$RSYNC_RC" -eq 24 ] && warn "rsync exited 24 (source files vanished mid-transfer) — treating as success"
# Capture files that rsync created/modified and feed them to the offsite-sync
# manifest so daily Step 1 incremental picks them up tomorrow morning.
# Use -cnewer (ctime), not -newer (mtime): rsync -t preserves SOURCE mtime

View file

@ -445,6 +445,10 @@ resource "kubernetes_deployment" "chrome_service" {
# clobber to the novnc image stick (chromium-not-found crashloop 2026-06-16)
# because TF could not revert the ignored field. Removed so TF re-asserts the
# pinned image. Keel is inert (keel.sh/policy=never) and no deploy step touches these.
# NOTE: the LIVE pod's container order had drifted to [novnc, chrome-service,
# snapshot] vs this file's [chrome-service, novnc, snapshot]; a TF apply reorders
# them to match here (harmless), so `containers[0]` differs between live and TF
# until the next apply lands don't be alarmed reading it back mid-reconcile.
spec[0].template[0].spec[0].init_container[0].image,
metadata[0].annotations["kubernetes.io/change-cause"],
metadata[0].annotations["deployment.kubernetes.io/revision"],

View file

@ -2840,6 +2840,7 @@ serverFiles:
annotations:
summary: "MAM ratio is {{ $value | printf \"%.2f\" }} for 24h (target: >= 1.0)"
- alert: MAMFarmingStuck
# Metric source: stacks/servarr/mam-farming/files/freeleech-grabber.py
# Heartbeat-based: fires only when the grabber CronJob has not COMPLETED
# a run in >4h (the original failure mode: Forbid-blocked / wedged in
# ContainerCreating). The grabber heartbeats mam_grabber_last_run_timestamp