meta_uk parser: add variant A (2019-2022) + variant C (2022-2023)

## Context

The initial v2 parser (commit 9741816) only handled the modern template
(variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from
2021-07 through 2023-11 failed entirely — Claude fallback hit errors on
them and the rows never landed. Investigation via `kubectl exec` +
pdftotext on a sample of the failing docs revealed two previously-unseen
layouts that the parser needs to handle directly:

- **Variant A** (2019 → mid-2022): single-column Description/This Period/
  This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31
  Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines:
  `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the
  earnings side with a matching `RSU Net Gain` on the deductions side.
  BIK items (Private Dental/Medical) appear on both sides — net zero in
  the gross, but the deduction-side copy must land in other_deductions
  for the validation formula to hold.

- **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions|
  Year To Date (note capital "To", vs variant B's lowercase "to"). Date
  format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the
  abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU
  Net Gain` offset. `Company Name : Facebook UK Limited` preamble.

Variant B (2024+) is unchanged.

## This change

### Parser refactor

- `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches
  all three eras.
- `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's
  accounting-style parenthesized negatives normalize to `-1234.56` in
  `_to_decimal`.
- `_parse_date` tries three formats in order: slash (B), dot (C), word (A).
- `_is_variant_b_or_c` collapses B and C into one detector (both have the
  side-by-side header with `Year [Tt]o Date`); their parsers share code
  because the column mechanics are identical — only the RSU-label set and
  date format differ.
- `_parse_variant_a` is a full rewrite: single-column rows split by the
  two `Total ...` anchors (payments → deductions), pay_date from the
  header's `Date : ...`, gross from first Total, net from the trailing
  `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X`
  line at the bottom.
- RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every
  matching payment line. rsu_offset maps to `RSU Net Gain` on the
  deduction side when present (absent in variant B, present in A and C).

### Fixtures switched to real pdftotext output

Removed the two synthetic fixtures that no longer reflected real Meta
output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`)
and replaced with real pdftotext captures:

- `meta_uk_2021_08_variant_a.txt` (doc_id=43)
- `meta_uk_2022_11_variant_c.txt` (doc_id=53)

The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because
they encode specific bonus/no-bonus scenarios and the numbers are
derived from the real Feb-2026 sample in the plan.

## Tests

- 10 parser tests: one per variant (A/B/C) + totals validation across
  all 4 fixtures + the existing non-Meta/empty-input guards. All pass.
- 52 total tests across the repo, all green.

## Test Plan

### Automated

```
$ poetry run pytest
============================== 52 passed in 1.66s ==============================
$ poetry run ruff check .
All checks passed!
$ poetry run mypy .
Success: no issues found in 24 source files
```

### Manual verification (after deploy)

1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via
   regex (≥95% hit rate), 42 → 70+ validated rows.
2. Sample a row for each variant via psql: employer, rsu_vest, and
   taxable_pay should all be populated.

## Reproduce locally

1. `poetry run pytest tests/test_meta_uk_parser.py -v`
2. Expected: 10 passed, each fixture validates totals to within 2p.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-04-19 11:52:59 +00:00
parent 974181674d
commit f62c5332e3
6 changed files with 439 additions and 263 deletions

View file

@ -1,21 +0,0 @@
Facebook UK Limited Payslip
Employee: Viktor Barzin NI Number: AA123456A
Employee No: 254680 Tax Code: 1185L
Pay Date: 31/07/2019 Pay Period: 4
Period Start: 01/07/2019 Period End: 31/07/2019
Description This Period This Year
---------------------------------------------------------------------
Salary 7,083.33 28,333.32
Taxable Pay 6,583.33 26,333.32
Tax 1,480.00 5,920.00
National Insurance 564.73 2,258.92
AE Pension EE 500.00 2,000.00
Student Loans 120.00 480.00
---------------------------------------------------------------------
Gross Pay: 7,083.33
Net Pay: 4,418.60

View file

@ -0,0 +1,44 @@
254680A Mr Viktor Barzin Facebook UK Ltd
NI No : SZ762223D NI Letter : A Tax Code : 0T Pay By : BACS Date : 31 Aug 2021 Period : 5
Description Rate Units This Period This Year
Salary 5,096.65 25,483.25
AE Pension (152.90) (764.50)
Laundry Expense 40.00 200.00
Relocation Bonus 8,184.20
RSU Gain Taxable 10,239.30 18,757.93
RSU Gain Nicable 10,239.30 18,757.93
Transportation Allowance 73.10
RSU Net Cash UK 175.91 207.29
Private Dental Insurance 15.61 78.05
Private Medical Insurance 84.50 422.50
EE Discount BIK 12.00
Total 25,738.37
Tax 5,500.87 17,836.73
National Insurance 627.72 2,655.22
RSU Net Gain 15,666.13 28,699.63
Private Dental Insurance 15.61 78.05
Private Medical Insurance 84.50 422.50
EE Discount BIK 12.00
Student Loans 1,165.00 3,649.00
Total 23,059.83
Tax District : Pay As You Earn
Tax Reference : 846/BA09294 Net Pay 2,678.54
Taxable Pay : This Period £15323.16 : To Date £52446.53
Employers NIC This Period : 1,999.08
Employers NIC To Date : 6,660.03
Employers Pension This Period : 458.70
Employers Pension To Date :2,293.50

View file

@ -0,0 +1,60 @@
Page : 1
Employee Number: 254680
Facebook UK Limited
PRIVATE & CONFIDENTIAL
Viktor Barzin
Flat 37
Spenlow Apartments
London
N1 7GH
Company Name : Facebook UK Limited Tax Ref : 846/BA09294 Nat Ins No : SZ762223D
Pay Date : 30.11.2022 Tax Code : 0T Nat Ins Cat: A
Pay Method : BACS Transfer Tax Basis : 0 Cost Centre: 4220
Pay Period : 08/2022 Tax Period : 08
Payments Units Rate Amount Deductions Amount Year To Date Amount
Salary 8,983.33 Tax paid 5,800.07 Salary 8,983.33
AE Pension EE -539.00 Employee NIC 612.65 RSU Gain Taxabl 7,531.31
RSU Gain Taxabl 7,531.31 Student Loan 1,233.00 RSU Gain Nicabl 7,531.31
RSU Gain Nicabl 7,531.31 RSU Net Gain 11,522.91 RSU Net Cash 129.38
RSU Net Cash 129.38 Student Loan 7,271.00
RSU Net Gain -11,522.91
ER Pension This Period
AE Pension ER 808.50
Total Payment: 23,636.33 Total Deduction : 19,168.63 Net Pay: 4,467.70
This Period Amount Year To Date Amount Gross Benefits Payments
----------------------------------
Total Gross: 23,636.33 Total Gross: 131,034.64 Dent Ins TaxB 17.83
Non-Tax Ded: 0.00 Non-Tax ded: 0.00 Medi Ins TaxB 76.67
Taxable Pay: 16,070.14 Taxable Pay: 99,784.08
Tax Paid: 5,800.07 Tax Paid: 34,886.93
EEs NI : 612.65 EEs NI: 5,361.55
ERs NI : 2,100.04 ERs NI: 13,800.83
EEs Pension: -539.00 EEs Pension: -3,596.28
EEs AVC: 0.00 EEs AVC 0.00
ERs Pension: 5,504.38

View file

@ -1,24 +0,0 @@
Facebook UK Limited Payslip
Employee: Viktor Barzin NI Number: AA123456A Pay Date: 27/03/2024
Employee No: 254680 Tax Code: 1257L Pay Period: 12
Department: Engineering Period Start: 01/03/2024
Period End: 31/03/2024
Payments Units Rate Amount Deductions Amount Year to Date Amount
Salary 9,500.00 Tax paid 800.00 Salary 114,000.00
Perform Bonus 0.00 Employee NIC 280.00 Transportation 820.50
AE Pension EE -6,200.00 Student Loans 90.00
--------- ---------
Total Payment: 3,300.00 Total Deduction : 1,170.00 Net Pay: 2,130.00
This Period Amount Year To Date Amount
Total Gross: 3,300.00 Total Gross: 210,000.00
Taxable Pay: 3,300.00 Taxable Pay: 185,000.00
Tax Paid: 800.00 Tax Paid: 42,000.00
EEs NI: 280.00 EEs NI: 9,100.00
EEs Pension: -6,200.00 EEs Pension: -52,000.00