processor + parser: fix 3 backfill failure modes

## Context

After the first v2 backfill (commit f62c533), 72 of 73 real payslips
landed correctly, but three residual failure modes remained:

1. **doc_id=215** — a 1442-byte empty-text PDF that Claude
   hallucinated a `pay_date=1900-01-01 / gross=0 / net=0` row for.
   Data poison waiting to happen.
2. **doc_id=39** — a P60 End of Year Certificate. Got tagged
   `payslip` in Paperless, has no Paperless title, so the title-based
   filter couldn't catch it; the regex parser then happily pulled
   bogus numbers out of the P60 layout.
3. **doc_id=49** — a real June 2021 variant-A payslip with an
   `EE Discount BIK` line in BOTH Payments and Deductions at 12.00.
   The parser was configured to drop `EE Discount BIK` from
   `other_deductions` (treating it as a known mapped field), which
   caused validate_totals to fail by exactly 12.00.

## This change

### processor.py — defence in depth

- **`NON_PAYSLIP_CONTENT_RE`** — new regex run against the first
  500 chars of pdftotext output. Catches `P60 End of Year
  Certificate` and `Take-home income per month` (Viktor's comp
  estimation spreadsheet). First-500-char scoping keeps it from
  false-positiving a legit payslip that mentions "P60" in a
  footer.
- **Post-extraction sanity checks** — reject a ProcessResult if
  `pay_date.year < 2010` (Viktor joined Meta in 2019) or if
  `gross_pay == net_pay == 0`. These raise rather than insert,
  so the backfill's existing `except Exception` block logs and
  continues without poisoning the DB. Supersedes the 1900-01-01
  case that would otherwise slip through.

### meta_uk.py — variant A BIK fix

Removed `EE Discount BIK` from `VARIANT_A_DEDUCTIONS_KNOWN`. That
set filters items OUT of `other_deductions` (because they have
dedicated schema fields). `EE Discount BIK` has no dedicated
field — it should stay in `other_deductions` like Private Dental
and Private Medical so the validation math balances.

### Fixtures + tests

- New fixture `meta_uk_2021_06_variant_a_bik.txt` — real
  pdftotext from doc_id=49 — encodes the BIK-in-both-columns
  case so a regression would fail this fixture's validation test.
- `test_parses_variant_a_with_ee_discount_bik` — explicitly
  asserts `EE Discount BIK` lands in `other_deductions`.
- `test_rejects_implausible_pay_date`, `test_rejects_zero_gross_zero_net`
  — cover the two sanity-check branches.
- `test_skips_p60_by_content_when_title_is_null` — covers the
  content-based non-payslip filter.

## Test Plan

### Automated

```
$ poetry run pytest
============================== 57 passed in 2.42s ==============================
$ poetry run ruff check .
All checks passed!
$ poetry run mypy .
Success: no issues found in 24 source files
```

### Manual verification (after deploy + re-run backfill)

Expected DB shape:
- Total rows ≈ 71 (88 paperless tags − 15 non-payslip titles −
  2 null-title non-payslips caught by content filter)
- `validated = true` on ≥99% of rows
- No `pay_date < 2010` rows
- No rows with employer IS NULL

## Reproduce locally

1. `cd payslip-ingest && poetry run pytest`
2. Expected: 57 passed, including the 3 new processor tests and
   the 5 parametrised fixture-total-validation tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-04-19 12:00:00 +00:00
parent f62c5332e3
commit d91f34ddb4
5 changed files with 141 additions and 1 deletions

View file

@ -328,7 +328,6 @@ VARIANT_A_DEDUCTIONS_KNOWN = {
"Student Loans",
"Student Loan",
"RSU Net Gain",
"EE Discount BIK",
}
VARIANT_A_RSU_LABELS = {

View file

@ -33,6 +33,17 @@ NON_PAYSLIP_TITLE_RE = re.compile(
re.IGNORECASE,
)
# Some Paperless docs have no title at all — the title filter can't catch
# them. These are detected by content signature in the pdftotext output.
# Only apply to the first ~500 chars so we don't accidentally false-positive
# a real payslip that happens to mention "P60" in a footnote somewhere.
NON_PAYSLIP_CONTENT_RE = re.compile(
r"P60 End of Year Certificate"
r"|Employer's summary.+tax year ending"
r"|Take-home income per month",
re.IGNORECASE,
)
PDFTOTEXT_PATH = shutil.which("pdftotext")
@ -71,8 +82,25 @@ async def process_document(
return ProcessResult(doc_id=doc_id, status="skipped_non_payslip")
pdf_bytes = await paperless.download_document(doc_id)
# Content-level non-payslip check (catches P60s with no Paperless title,
# personal income spreadsheets, etc.) before we burn extractor budget.
text_peek = _pdftotext(pdf_bytes) or ""
if NON_PAYSLIP_CONTENT_RE.search(text_peek[:500]):
log.info("skipping doc_id=%s — content matches non-payslip signature", doc_id)
return ProcessResult(doc_id=doc_id, status="skipped_non_payslip")
extracted, which = await _extract(pdf_bytes, metadata, extractor)
# Sanity check: Viktor joined Meta UK in 2019. Any pay_date earlier
# than 2010 or a zero gross almost certainly means the extractor
# hallucinated on a non-payslip PDF that slipped past the title filter.
# Reject rather than poison the DB with a 1900-01-01 ghost row.
if extracted.pay_date.year < 2010:
raise ValueError(
f"doc_id={doc_id} extractor={which} produced implausible pay_date={extracted.pay_date}")
if extracted.gross_pay == 0 and extracted.net_pay == 0:
raise ValueError(f"doc_id={doc_id} extractor={which} produced zero gross and net")
validated = validate_totals(extracted)
if not validated:
log.warning(

View file

@ -0,0 +1,43 @@
254680A Mr Viktor Barzin Facebook UK Ltd
NI No : SZ762223D NI Letter : A Tax Code : 0T Pay By : BACS Date : 30 Jun 2021 Period : 3
Description Rate Units This Period This Year
Salary 5,096.65 15,289.95
AE Pension (152.90) (458.70)
Laundry Expense 40.00 120.00
RSU Gain Taxable 8,518.63
RSU Gain Nicable 8,518.63
Transportation Allowance 73.10
RSU Net Cash UK 31.38
Private Dental Insurance 15.61 46.83
Private Medical Insurance 84.50 253.50
EE Discount BIK 12.00 12.00
Total 5,095.86
Tax 1,410.07 7,657.00
National Insurance 423.17 1,440.88
RSU Net Gain 13,033.50
Private Dental Insurance 15.61 46.83
Private Medical Insurance 84.50 253.50
EE Discount BIK 12.00 12.00
Student Loans 244.00 1,504.00
Total 2,189.35
Tax District : Pay As You Earn
Tax Reference : 846/BA09294 Net Pay 2,906.51
Taxable Pay : This Period £5095.86 : To Date £23855.31
Employers NIC This Period : 587.71
Employers NIC To Date : 2,945.48
Employers Pension This Period : 458.70
Employers Pension To Date :1,376.10

View file

@ -86,6 +86,31 @@ def test_parses_variant_c_2022_11() -> None:
assert result.ytd_gross == Decimal("131034.64")
def test_parses_variant_a_with_ee_discount_bik() -> None:
"""June 2021 — variant A. Real pdftotext from doc_id=49.
Has an `EE Discount BIK` line present in BOTH Payments AND Deductions
blocks with value 12.00. Needs to land in `other_deductions` so the
validation formula accounts for it (earlier parser version filtered
it out, causing an off-by-12.00 validation failure).
"""
result = parse_meta_uk(_load("meta_uk_2021_06_variant_a_bik.txt"))
assert result.pay_date == date(2021, 6, 30)
assert result.salary == Decimal("5096.65")
assert result.pension_sacrifice == Decimal("152.90")
assert result.rsu_vest == Decimal("0") # RSU lines are YTD-only this period
assert result.rsu_offset == Decimal("0")
assert result.gross_pay == Decimal("5095.86")
assert result.income_tax == Decimal("1410.07")
assert result.national_insurance == Decimal("423.17")
assert result.student_loan == Decimal("244.00")
assert result.net_pay == Decimal("2906.51")
assert result.other_deductions.get("Private Dental Insurance") == Decimal("15.61")
assert result.other_deductions.get("Private Medical Insurance") == Decimal("84.50")
assert result.other_deductions.get("EE Discount BIK") == Decimal("12.00")
def test_parses_variant_a_2021_08() -> None:
"""Aug 2021 — variant A. Real pdftotext from doc_id=43.
@ -135,6 +160,7 @@ def test_raises_on_empty_text() -> None:
"meta_uk_2025_03.txt",
"meta_uk_2022_11_variant_c.txt",
"meta_uk_2021_08_variant_a.txt",
"meta_uk_2021_06_variant_a_bik.txt",
])
def test_all_fixtures_validate_totals(fixture_name: str) -> None:
"""Every fixture must satisfy gross - deductions ≈ net within 2p."""

View file

@ -202,3 +202,47 @@ async def test_regex_miss_falls_back_to_claude(paperless: AsyncMock, extractor:
assert result.status == "inserted"
assert result.extractor == "claude"
extractor.extract.assert_awaited_once()
async def test_rejects_implausible_pay_date(paperless: AsyncMock, extractor: AsyncMock) -> None:
"""Reject 1900-01-01-style hallucinations before they poison the DB."""
bad = _sample_extraction()
bad_dict = bad.model_dump()
bad_dict["pay_date"] = date(1900, 1, 1)
extractor.extract.return_value = ExtractedPayslip.model_validate(bad_dict)
factory = _SessionFactory([_FakeSession(existing_ids=[])])
with pytest.raises(ValueError, match="implausible pay_date"):
await process_document(42, factory, paperless, extractor)
async def test_skips_p60_by_content_when_title_is_null(paperless: AsyncMock, extractor: AsyncMock,
monkeypatch: pytest.MonkeyPatch) -> None:
"""P60s get the `payslip` tag sometimes, and some have no title in Paperless.
The title filter can't catch them, so we also check the pdftotext output
for the `P60 End of Year Certificate` signature before hitting the
extractor.
"""
paperless.get_document.return_value = {"id": 42, "title": None}
monkeypatch.setattr(processor, "_pdftotext",
lambda _: "P60 End of Year Certificate\nTax year to 5 April 2021\n")
factory = _SessionFactory([_FakeSession(existing_ids=[])])
result = await process_document(42, factory, paperless, extractor)
assert result.status == "skipped_non_payslip"
extractor.extract.assert_not_called()
async def test_rejects_zero_gross_zero_net(paperless: AsyncMock, extractor: AsyncMock) -> None:
"""Reject the other common hallucination: all zeros on a non-payslip."""
bad = _sample_extraction()
bad_dict = bad.model_dump()
bad_dict["gross_pay"] = Decimal("0")
bad_dict["net_pay"] = Decimal("0")
extractor.extract.return_value = ExtractedPayslip.model_validate(bad_dict)
factory = _SessionFactory([_FakeSession(existing_ids=[])])
with pytest.raises(ValueError, match="zero gross and net"):
await process_document(42, factory, paperless, extractor)