payslip-ingest/tests/fixtures/meta_uk_2022_11_variant_c.txt
Viktor Barzin f62c5332e3 meta_uk parser: add variant A (2019-2022) + variant C (2022-2023)
## Context

The initial v2 parser (commit 9741816) only handled the modern template
(variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from
2021-07 through 2023-11 failed entirely — Claude fallback hit errors on
them and the rows never landed. Investigation via `kubectl exec` +
pdftotext on a sample of the failing docs revealed two previously-unseen
layouts that the parser needs to handle directly:

- **Variant A** (2019 → mid-2022): single-column Description/This Period/
  This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31
  Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines:
  `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the
  earnings side with a matching `RSU Net Gain` on the deductions side.
  BIK items (Private Dental/Medical) appear on both sides — net zero in
  the gross, but the deduction-side copy must land in other_deductions
  for the validation formula to hold.

- **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions|
  Year To Date (note capital "To", vs variant B's lowercase "to"). Date
  format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the
  abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU
  Net Gain` offset. `Company Name : Facebook UK Limited` preamble.

Variant B (2024+) is unchanged.

## This change

### Parser refactor

- `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches
  all three eras.
- `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's
  accounting-style parenthesized negatives normalize to `-1234.56` in
  `_to_decimal`.
- `_parse_date` tries three formats in order: slash (B), dot (C), word (A).
- `_is_variant_b_or_c` collapses B and C into one detector (both have the
  side-by-side header with `Year [Tt]o Date`); their parsers share code
  because the column mechanics are identical — only the RSU-label set and
  date format differ.
- `_parse_variant_a` is a full rewrite: single-column rows split by the
  two `Total ...` anchors (payments → deductions), pay_date from the
  header's `Date : ...`, gross from first Total, net from the trailing
  `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X`
  line at the bottom.
- RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every
  matching payment line. rsu_offset maps to `RSU Net Gain` on the
  deduction side when present (absent in variant B, present in A and C).

### Fixtures switched to real pdftotext output

Removed the two synthetic fixtures that no longer reflected real Meta
output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`)
and replaced with real pdftotext captures:

- `meta_uk_2021_08_variant_a.txt` (doc_id=43)
- `meta_uk_2022_11_variant_c.txt` (doc_id=53)

The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because
they encode specific bonus/no-bonus scenarios and the numbers are
derived from the real Feb-2026 sample in the plan.

## Tests

- 10 parser tests: one per variant (A/B/C) + totals validation across
  all 4 fixtures + the existing non-Meta/empty-input guards. All pass.
- 52 total tests across the repo, all green.

## Test Plan

### Automated

```
$ poetry run pytest
============================== 52 passed in 1.66s ==============================
$ poetry run ruff check .
All checks passed!
$ poetry run mypy .
Success: no issues found in 24 source files
```

### Manual verification (after deploy)

1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via
   regex (≥95% hit rate), 42 → 70+ validated rows.
2. Sample a row for each variant via psql: employer, rsu_vest, and
   taxable_pay should all be populated.

## Reproduce locally

1. `poetry run pytest tests/test_meta_uk_parser.py -v`
2. Expected: 10 passed, each fixture validates totals to within 2p.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00

60 lines
3.4 KiB
Text
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

Page : 1
Employee Number: 254680
Facebook UK Limited
PRIVATE & CONFIDENTIAL
Viktor Barzin
Flat 37
Spenlow Apartments
London
N1 7GH
Company Name : Facebook UK Limited Tax Ref : 846/BA09294 Nat Ins No : SZ762223D
Pay Date : 30.11.2022 Tax Code : 0T Nat Ins Cat: A
Pay Method : BACS Transfer Tax Basis : 0 Cost Centre: 4220
Pay Period : 08/2022 Tax Period : 08
Payments Units Rate Amount Deductions Amount Year To Date Amount
Salary 8,983.33 Tax paid 5,800.07 Salary 8,983.33
AE Pension EE -539.00 Employee NIC 612.65 RSU Gain Taxabl 7,531.31
RSU Gain Taxabl 7,531.31 Student Loan 1,233.00 RSU Gain Nicabl 7,531.31
RSU Gain Nicabl 7,531.31 RSU Net Gain 11,522.91 RSU Net Cash 129.38
RSU Net Cash 129.38 Student Loan 7,271.00
RSU Net Gain -11,522.91
ER Pension This Period
AE Pension ER 808.50
Total Payment: 23,636.33 Total Deduction : 19,168.63 Net Pay: 4,467.70
This Period Amount Year To Date Amount Gross Benefits Payments
----------------------------------
Total Gross: 23,636.33 Total Gross: 131,034.64 Dent Ins TaxB 17.83
Non-Tax Ded: 0.00 Non-Tax ded: 0.00 Medi Ins TaxB 76.67
Taxable Pay: 16,070.14 Taxable Pay: 99,784.08
Tax Paid: 5,800.07 Tax Paid: 34,886.93
EEs NI : 612.65 EEs NI: 5,361.55
ERs NI : 2,100.04 ERs NI: 13,800.83
EEs Pension: -539.00 EEs Pension: -3,596.28
EEs AVC: 0.00 EEs AVC 0.00
ERs Pension: 5,504.38