payslip-ingest

7 commits 1 branch 0 tags 265 KiB

Author	SHA1	Message	Date
Viktor Barzin	86cac65572	processor: skip non-payslip docs by title pattern The Paperless 'payslip' tag has been applied over the years to P60 annual summaries, performance/year-end letters, Compensation_EMEA/PSC letters, comp-review letters, and RSU grant agreements. These are legitimate financial docs but not monthly payslips, and including them pollutes the dashboards (a P60 amount is ~12x a single month). Filter by title regex before hitting Claude so we skip cheaply and don't burn extraction credit on them. Status returned is 'skipped_non_payslip' to distinguish from the 'already-ingested' skip. Covers: P60, performance(letter\|year-end), compensation_emea, psc, comp-letter, rsu grant*. New parameterized tests cover both the exclude list and representative real payslip titles. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 23:32:17 +00:00
Viktor Barzin	c696bf32f0	backfill: continue on per-document errors instead of aborting A single doc that isn't a real payslip (e.g., an RSU letter wrongly tagged as payslip in Paperless) makes Claude return pay_date=null, which pydantic rejects with ValidationError. Previously this killed the whole backfill at the first bad doc, leaving 60 of 88 docs unprocessed. Catch + log + continue so the backfill processes every doc. Failed docs can be re-tagged or fixed individually later.	2026-04-18 23:25:36 +00:00
Viktor Barzin	3da24fdf7a	extractor: preextract PDF text with pdftotext before calling Claude Without this, each extraction took 5-10 minutes because the base64'd PDF expanded to ~300KB of prompt tokens. poppler-utils ships pdftotext which turns a 200KB PDF into ~3KB of plain text in milliseconds. Claude (Haiku) then processes the text in seconds. - Dockerfile installs poppler-utils in the runtime stage (one-liner). - _build_prompt() tries pdftotext -layout first; falls back to base64 if pdftotext is missing (local dev) or the PDF is unreadable (scanned image). - Agent file documents the PAYSLIP_TEXT fast path — still handles PDF_BASE64 for fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 22:48:04 +00:00
Viktor Barzin	693ec4a5d4	extractor: wait up to 15min for claude-agent-service to free lock Real UK payslip extractions routinely take 5-10min end-to-end (Haiku processing 100-300KB base64'd PDFs). With 10 retries × 5s = 50s we'd abort while another extraction was still in-flight. Bump to 90 retries × 10s = 900s wait — enough to cover the server-side timeout_seconds=600 plus some slack. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 22:36:05 +00:00
Viktor Barzin	7a32885d26	extractor: bump claude poll timeout to 600s Real UK payslip PDFs are 100-200KB base64'd, which means ~300-500KB of prompt tokens. Claude (even Haiku) takes 1-5 minutes to process and emit structured JSON. The original 120s ceiling timed out before extraction could finish. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 22:32:34 +00:00
Viktor Barzin	11a8256e6a	alembic: create schema before initializing version table The payslip_ingest schema must exist before Alembic creates its alembic_version tracking table inside it. Add CREATE SCHEMA IF NOT EXISTS at the top of do_run_migrations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 22:23:30 +00:00
Viktor Barzin	57484619c1	Initial commit: event-driven UK payslip ingest service Extracted from /home/wizard/code monorepo into its own repo so Woodpecker CI can watch it. Identical content to /home/wizard/code commit e426028. See README.md for overview, env vars, and Paperless workflow config. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 22:10:23 +00:00