0001_initial.py declares revision='0001', not '0001_initial'. My 0002
migration had down_revision='0001_initial', so alembic couldn't splice
it into the chain and silently ran 'upgrade head' as a no-op on pod
startup. The rsu_vest/rsu_offset columns never got created and every
INSERT from the new code failed with 'column does not exist'.
UK payslips for equity-comp employees report RSU vests as notional pay
for HMRC only. A paired same-magnitude deduction (Shares Retained /
Stock Tax Withholding / RSU Offset) nets it back out of cash. The UK
payslip's income_tax line shows tax on the grossed-up total, but the
actual RSU tax is handled by Schwab (US broker) via share sale. No
cash flows through UK payroll for RSU.
Previously the extractor folded RSU notional into gross_pay and
income_tax, which inflated the dashboard numbers — a payslip with
£25k RSU vest looked like 2x salary with 80% tax rate.
Changes:
- schema: add rsu_vest + rsu_offset fields (default 0).
- db + alembic 0002: add two new NUMERIC(12,2) columns with server
default 0 (backward-compatible; existing rows get 0).
- validate_totals: include rsu_offset in deductions sum so the
gross + rsu_vest inflation is properly netted out.
- extraction prompt: explicit rules for identifying RSU lines by the
common Meta/Sage/Workday labels, and to NOT put them in
other_deductions.
Dashboards in a follow-up commit: cash_gross = gross_pay - rsu_vest,
effective tax rate based on cash metrics.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Paperless 'payslip' tag has been applied over the years to P60 annual
summaries, performance/year-end letters, Compensation_EMEA/PSC letters,
comp-review letters, and RSU grant agreements. These are legitimate
financial docs but not monthly payslips, and including them pollutes
the dashboards (a P60 amount is ~12x a single month).
Filter by title regex before hitting Claude so we skip cheaply and
don't burn extraction credit on them. Status returned is
'skipped_non_payslip' to distinguish from the 'already-ingested' skip.
Covers: P60*, *performance*(letter|year-end)*, compensation_emea,
*psc*, comp-letter, rsu grant*. New parameterized tests cover both
the exclude list and representative real payslip titles.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A single doc that isn't a real payslip (e.g., an RSU letter wrongly tagged
as payslip in Paperless) makes Claude return pay_date=null, which pydantic
rejects with ValidationError. Previously this killed the whole backfill at
the first bad doc, leaving 60 of 88 docs unprocessed.
Catch + log + continue so the backfill processes every doc. Failed docs
can be re-tagged or fixed individually later.
Without this, each extraction took 5-10 minutes because the base64'd PDF
expanded to ~300KB of prompt tokens. poppler-utils ships pdftotext which
turns a 200KB PDF into ~3KB of plain text in milliseconds. Claude (Haiku)
then processes the text in seconds.
- Dockerfile installs poppler-utils in the runtime stage (one-liner).
- _build_prompt() tries pdftotext -layout first; falls back to base64 if
pdftotext is missing (local dev) or the PDF is unreadable (scanned image).
- Agent file documents the PAYSLIP_TEXT fast path — still handles
PDF_BASE64 for fallback.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Real UK payslip extractions routinely take 5-10min end-to-end (Haiku
processing 100-300KB base64'd PDFs). With 10 retries × 5s = 50s we'd
abort while another extraction was still in-flight. Bump to 90 retries
× 10s = 900s wait — enough to cover the server-side timeout_seconds=600
plus some slack.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Real UK payslip PDFs are 100-200KB base64'd, which means ~300-500KB of prompt
tokens. Claude (even Haiku) takes 1-5 minutes to process and emit structured JSON.
The original 120s ceiling timed out before extraction could finish.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The payslip_ingest schema must exist before Alembic creates its
alembic_version tracking table inside it. Add CREATE SCHEMA IF NOT EXISTS
at the top of do_run_migrations.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extracted from /home/wizard/code monorepo into its own repo so Woodpecker CI
can watch it. Identical content to /home/wizard/code commit e426028.
See README.md for overview, env vars, and Paperless workflow config.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>