payslip-ingest/tests/fixtures/meta_uk_2021_08_variant_a.txt
Viktor Barzin f62c5332e3 meta_uk parser: add variant A (2019-2022) + variant C (2022-2023)
## Context

The initial v2 parser (commit 9741816) only handled the modern template
(variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from
2021-07 through 2023-11 failed entirely — Claude fallback hit errors on
them and the rows never landed. Investigation via `kubectl exec` +
pdftotext on a sample of the failing docs revealed two previously-unseen
layouts that the parser needs to handle directly:

- **Variant A** (2019 → mid-2022): single-column Description/This Period/
  This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31
  Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines:
  `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the
  earnings side with a matching `RSU Net Gain` on the deductions side.
  BIK items (Private Dental/Medical) appear on both sides — net zero in
  the gross, but the deduction-side copy must land in other_deductions
  for the validation formula to hold.

- **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions|
  Year To Date (note capital "To", vs variant B's lowercase "to"). Date
  format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the
  abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU
  Net Gain` offset. `Company Name : Facebook UK Limited` preamble.

Variant B (2024+) is unchanged.

## This change

### Parser refactor

- `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches
  all three eras.
- `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's
  accounting-style parenthesized negatives normalize to `-1234.56` in
  `_to_decimal`.
- `_parse_date` tries three formats in order: slash (B), dot (C), word (A).
- `_is_variant_b_or_c` collapses B and C into one detector (both have the
  side-by-side header with `Year [Tt]o Date`); their parsers share code
  because the column mechanics are identical — only the RSU-label set and
  date format differ.
- `_parse_variant_a` is a full rewrite: single-column rows split by the
  two `Total ...` anchors (payments → deductions), pay_date from the
  header's `Date : ...`, gross from first Total, net from the trailing
  `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X`
  line at the bottom.
- RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every
  matching payment line. rsu_offset maps to `RSU Net Gain` on the
  deduction side when present (absent in variant B, present in A and C).

### Fixtures switched to real pdftotext output

Removed the two synthetic fixtures that no longer reflected real Meta
output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`)
and replaced with real pdftotext captures:

- `meta_uk_2021_08_variant_a.txt` (doc_id=43)
- `meta_uk_2022_11_variant_c.txt` (doc_id=53)

The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because
they encode specific bonus/no-bonus scenarios and the numbers are
derived from the real Feb-2026 sample in the plan.

## Tests

- 10 parser tests: one per variant (A/B/C) + totals validation across
  all 4 fixtures + the existing non-Meta/empty-input guards. All pass.
- 52 total tests across the repo, all green.

## Test Plan

### Automated

```
$ poetry run pytest
============================== 52 passed in 1.66s ==============================
$ poetry run ruff check .
All checks passed!
$ poetry run mypy .
Success: no issues found in 24 source files
```

### Manual verification (after deploy)

1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via
   regex (≥95% hit rate), 42 → 70+ validated rows.
2. Sample a row for each variant via psql: employer, rsu_vest, and
   taxable_pay should all be populated.

## Reproduce locally

1. `poetry run pytest tests/test_meta_uk_parser.py -v`
2. Expected: 10 passed, each fixture validates totals to within 2p.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00

44 lines
3.3 KiB
Text
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

254680A Mr Viktor Barzin Facebook UK Ltd
NI No : SZ762223D NI Letter : A Tax Code : 0T Pay By : BACS Date : 31 Aug 2021 Period : 5
Description Rate Units This Period This Year
Salary 5,096.65 25,483.25
AE Pension (152.90) (764.50)
Laundry Expense 40.00 200.00
Relocation Bonus 8,184.20
RSU Gain Taxable 10,239.30 18,757.93
RSU Gain Nicable 10,239.30 18,757.93
Transportation Allowance 73.10
RSU Net Cash UK 175.91 207.29
Private Dental Insurance 15.61 78.05
Private Medical Insurance 84.50 422.50
EE Discount BIK 12.00
Total 25,738.37
Tax 5,500.87 17,836.73
National Insurance 627.72 2,655.22
RSU Net Gain 15,666.13 28,699.63
Private Dental Insurance 15.61 78.05
Private Medical Insurance 84.50 422.50
EE Discount BIK 12.00
Student Loans 1,165.00 3,649.00
Total 23,059.83
Tax District : Pay As You Earn
Tax Reference : 846/BA09294 Net Pay 2,678.54
Taxable Pay : This Period £15323.16 : To Date £52446.53
Employers NIC This Period : 1,999.08
Employers NIC To Date : 6,660.03
Employers Pension This Period : 458.70
Employers Pension To Date :2,293.50