payslip-ingest

9 commits 1 branch 0 tags 260 KiB

Author	SHA1	Message	Date
Viktor Barzin	1f2e73e024	alembic 0002: fix down_revision to '0001' (matches 0001_initial's id) 0001_initial.py declares revision='0001', not '0001_initial'. My 0002 migration had down_revision='0001_initial', so alembic couldn't splice it into the chain and silently ran 'upgrade head' as a no-op on pod startup. The rsu_vest/rsu_offset columns never got created and every INSERT from the new code failed with 'column does not exist'.	2026-04-18 23:41:29 +00:00
Viktor Barzin	9105b6b79d	extractor: track rsu_vest + rsu_offset separately from cash pay UK payslips for equity-comp employees report RSU vests as notional pay for HMRC only. A paired same-magnitude deduction (Shares Retained / Stock Tax Withholding / RSU Offset) nets it back out of cash. The UK payslip's income_tax line shows tax on the grossed-up total, but the actual RSU tax is handled by Schwab (US broker) via share sale. No cash flows through UK payroll for RSU. Previously the extractor folded RSU notional into gross_pay and income_tax, which inflated the dashboard numbers — a payslip with £25k RSU vest looked like 2x salary with 80% tax rate. Changes: - schema: add rsu_vest + rsu_offset fields (default 0). - db + alembic 0002: add two new NUMERIC(12,2) columns with server default 0 (backward-compatible; existing rows get 0). - validate_totals: include rsu_offset in deductions sum so the gross + rsu_vest inflation is properly netted out. - extraction prompt: explicit rules for identifying RSU lines by the common Meta/Sage/Workday labels, and to NOT put them in other_deductions. Dashboards in a follow-up commit: cash_gross = gross_pay - rsu_vest, effective tax rate based on cash metrics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 23:37:25 +00:00
Viktor Barzin	86cac65572	processor: skip non-payslip docs by title pattern The Paperless 'payslip' tag has been applied over the years to P60 annual summaries, performance/year-end letters, Compensation_EMEA/PSC letters, comp-review letters, and RSU grant agreements. These are legitimate financial docs but not monthly payslips, and including them pollutes the dashboards (a P60 amount is ~12x a single month). Filter by title regex before hitting Claude so we skip cheaply and don't burn extraction credit on them. Status returned is 'skipped_non_payslip' to distinguish from the 'already-ingested' skip. Covers: P60, performance(letter\|year-end), compensation_emea, psc, comp-letter, rsu grant*. New parameterized tests cover both the exclude list and representative real payslip titles. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 23:32:17 +00:00
Viktor Barzin	c696bf32f0	backfill: continue on per-document errors instead of aborting A single doc that isn't a real payslip (e.g., an RSU letter wrongly tagged as payslip in Paperless) makes Claude return pay_date=null, which pydantic rejects with ValidationError. Previously this killed the whole backfill at the first bad doc, leaving 60 of 88 docs unprocessed. Catch + log + continue so the backfill processes every doc. Failed docs can be re-tagged or fixed individually later.	2026-04-18 23:25:36 +00:00
Viktor Barzin	3da24fdf7a	extractor: preextract PDF text with pdftotext before calling Claude Without this, each extraction took 5-10 minutes because the base64'd PDF expanded to ~300KB of prompt tokens. poppler-utils ships pdftotext which turns a 200KB PDF into ~3KB of plain text in milliseconds. Claude (Haiku) then processes the text in seconds. - Dockerfile installs poppler-utils in the runtime stage (one-liner). - _build_prompt() tries pdftotext -layout first; falls back to base64 if pdftotext is missing (local dev) or the PDF is unreadable (scanned image). - Agent file documents the PAYSLIP_TEXT fast path — still handles PDF_BASE64 for fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 22:48:04 +00:00
Viktor Barzin	693ec4a5d4	extractor: wait up to 15min for claude-agent-service to free lock Real UK payslip extractions routinely take 5-10min end-to-end (Haiku processing 100-300KB base64'd PDFs). With 10 retries × 5s = 50s we'd abort while another extraction was still in-flight. Bump to 90 retries × 10s = 900s wait — enough to cover the server-side timeout_seconds=600 plus some slack. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 22:36:05 +00:00
Viktor Barzin	7a32885d26	extractor: bump claude poll timeout to 600s Real UK payslip PDFs are 100-200KB base64'd, which means ~300-500KB of prompt tokens. Claude (even Haiku) takes 1-5 minutes to process and emit structured JSON. The original 120s ceiling timed out before extraction could finish. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 22:32:34 +00:00
Viktor Barzin	11a8256e6a	alembic: create schema before initializing version table The payslip_ingest schema must exist before Alembic creates its alembic_version tracking table inside it. Add CREATE SCHEMA IF NOT EXISTS at the top of do_run_migrations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 22:23:30 +00:00
Viktor Barzin	57484619c1	Initial commit: event-driven UK payslip ingest service Extracted from /home/wizard/code monorepo into its own repo so Woodpecker CI can watch it. Identical content to /home/wizard/code commit e426028. See README.md for overview, env vars, and Paperless workflow config. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 22:10:23 +00:00