meta_uk parser: add variant A (2019-2022) + variant C (2022-2023)

## Context

The initial v2 parser (commit 9741816) only handled the modern template
(variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from
2021-07 through 2023-11 failed entirely — Claude fallback hit errors on
them and the rows never landed. Investigation via `kubectl exec` +
pdftotext on a sample of the failing docs revealed two previously-unseen
layouts that the parser needs to handle directly:

- **Variant A** (2019 → mid-2022): single-column Description/This Period/
  This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31
  Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines:
  `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the
  earnings side with a matching `RSU Net Gain` on the deductions side.
  BIK items (Private Dental/Medical) appear on both sides — net zero in
  the gross, but the deduction-side copy must land in other_deductions
  for the validation formula to hold.

- **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions|
  Year To Date (note capital "To", vs variant B's lowercase "to"). Date
  format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the
  abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU
  Net Gain` offset. `Company Name : Facebook UK Limited` preamble.

Variant B (2024+) is unchanged.

## This change

### Parser refactor

- `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches
  all three eras.
- `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's
  accounting-style parenthesized negatives normalize to `-1234.56` in
  `_to_decimal`.
- `_parse_date` tries three formats in order: slash (B), dot (C), word (A).
- `_is_variant_b_or_c` collapses B and C into one detector (both have the
  side-by-side header with `Year [Tt]o Date`); their parsers share code
  because the column mechanics are identical — only the RSU-label set and
  date format differ.
- `_parse_variant_a` is a full rewrite: single-column rows split by the
  two `Total ...` anchors (payments → deductions), pay_date from the
  header's `Date : ...`, gross from first Total, net from the trailing
  `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X`
  line at the bottom.
- RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every
  matching payment line. rsu_offset maps to `RSU Net Gain` on the
  deduction side when present (absent in variant B, present in A and C).

### Fixtures switched to real pdftotext output

Removed the two synthetic fixtures that no longer reflected real Meta
output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`)
and replaced with real pdftotext captures:

- `meta_uk_2021_08_variant_a.txt` (doc_id=43)
- `meta_uk_2022_11_variant_c.txt` (doc_id=53)

The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because
they encode specific bonus/no-bonus scenarios and the numbers are
derived from the real Feb-2026 sample in the plan.

## Tests

- 10 parser tests: one per variant (A/B/C) + totals validation across
  all 4 fixtures + the existing non-Meta/empty-input guards. All pass.
- 52 total tests across the repo, all green.

## Test Plan

### Automated

```
$ poetry run pytest
============================== 52 passed in 1.66s ==============================
$ poetry run ruff check .
All checks passed!
$ poetry run mypy .
Success: no issues found in 24 source files
```

### Manual verification (after deploy)

1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via
   regex (≥95% hit rate), 42 → 70+ validated rows.
2. Sample a row for each variant via psql: employer, rsu_vest, and
   taxable_pay should all be populated.

## Reproduce locally

1. `poetry run pytest tests/test_meta_uk_parser.py -v`
2. Expected: 10 passed, each fixture validates totals to within 2p.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-04-19 11:52:59 +00:00
parent 974181674d
commit f62c5332e3
6 changed files with 439 additions and 263 deletions

View file

@ -1,22 +1,28 @@
"""Regex-based Meta UK payslip parser. """Regex-based Meta UK payslip parser.
Meta UK payslips use a stable template that splits into two layout variants Meta UK payslips come in three layout variants across 2019-2026:
with a hard boundary at the 2022-01-31 template change:
- Variant A (pre-2022): single-column "Description / This Period / This Year" - **Variant A** (2019-mid-2022, seen as `Facebook UK Ltd`):
layout. No RSU lines (Viktor's pre-vest tenure). AE Pension EE lists as a Single-column `Description | This Period | This Year` layout. Parenthesized
positive deduction against a pre-sacrifice gross. negatives `(152.90)` = -152.90. Date format `Date : 31 Aug 2021`. RSU
labels: `RSU Gain Taxable`, `RSU Gain Nicable`, `RSU Net Cash UK`, plus a
matching `RSU Net Gain` deduction. BIK items (Private Dental/Medical,
EE Discount) appear as both earnings and deductions.
- Variant B (post-2022): side-by-side "Payments | Deductions | Year to Date" - **Variant C** (late-2022 - 2023, `Facebook UK Limited`):
three-column layout. AE Pension EE sits in the Payments column as a Side-by-side `Payments | Deductions | Year To Date` (capital "To"). Date
negative line i.e. salary sacrifice reduces Total Payment before it hits format `Pay Date : 30.11.2022` (dots). `Company Name : Facebook UK Limited`
PAYE. RSU vest arrives as two lines in Payments: "RSU Tax Offset" (the preamble. RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and
notional RSU value) and "RSU Excs Refund" (any over-withheld amount still have the `RSU Net Gain` offset.
returned). Their sum is what we attribute as `rsu_vest`.
Parser returns `ExtractedPayslip`. On any structural miss (header not found, - **Variant B** (2024+, `Facebook UK Limited`):
Pay Date missing, totals row malformed) it raises `ParserError` the caller Side-by-side `Payments | Deductions | Year to Date` (lowercase "to"). Date
falls back to ClaudeExtractor so we never silently drop a payslip. format `Pay Date: 27/02/2026` (slashes). RSU labels are `RSU Tax Offset`
+ `RSU Excs Refund`; there is NO matching offset deduction the vest
grosses up Taxable Pay and PAYE is on the grossed-up figure.
Parser returns `ExtractedPayslip`. On any structural miss it raises
`ParserError` so the caller falls back to ClaudeExtractor.
""" """
import re import re
from datetime import date, datetime from datetime import date, datetime
@ -29,30 +35,42 @@ class ParserError(ValueError):
"""Raised when the Meta UK template cannot be matched.""" """Raised when the Meta UK template cannot be matched."""
AMOUNT_RE = re.compile(r"-?\d{1,3}(?:,\d{3})*\.\d{2}") # Two amount notations:
PAY_DATE_RE = re.compile(r"Pay Date:\s*(\d{2}/\d{2}/\d{4})") # "-1,234.56" (slashes-era) and "(1,234.56)" (variant A parenthesized)
PERIOD_START_RE = re.compile(r"Period Start:\s*(\d{2}/\d{2}/\d{4})") AMOUNT_RE = re.compile(r"-?\d{1,3}(?:,\d{3})*\.\d{2}|\(\d{1,3}(?:,\d{3})*\.\d{2}\)")
PERIOD_END_RE = re.compile(r"Period End:\s*(\d{2}/\d{2}/\d{4})")
EMPLOYER = "Facebook UK Limited" # Pay Date / Date — three accepted formats:
# "Pay Date: 27/02/2026"
# "Pay Date : 30.11.2022"
# "Date : 31 Aug 2021"
PAY_DATE_SLASH_RE = re.compile(r"Pay Date\s*:\s*(\d{2}/\d{2}/\d{4})")
PAY_DATE_DOT_RE = re.compile(r"Pay Date\s*:\s*(\d{2}\.\d{2}\.\d{4})")
PAY_DATE_WORD_RE = re.compile(r"\bDate\s*:\s*(\d{1,2}\s+[A-Za-z]{3}\s+\d{4})")
PERIOD_START_RE = re.compile(r"Period Start\s*:\s*(\d{2}/\d{2}/\d{4})")
PERIOD_END_RE = re.compile(r"Period End\s*:\s*(\d{2}/\d{2}/\d{4})")
EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")
def parse_meta_uk(text: str) -> ExtractedPayslip: def parse_meta_uk(text: str) -> ExtractedPayslip:
if not text.strip(): if not text.strip():
raise ParserError("empty text") raise ParserError("empty text")
if "Facebook UK Limited" not in text and "Meta Platforms" not in text: employer_match = EMPLOYER_RE.search(text)
if not employer_match:
raise ParserError("does not look like a Meta UK payslip") raise ParserError("does not look like a Meta UK payslip")
employer = employer_match.group(0)
lines = text.splitlines() lines = text.splitlines()
if _is_variant_b(lines): if _is_variant_b_or_c(lines):
return _parse_variant_b(text, lines) return _parse_variant_bc(text, lines, employer)
if _is_variant_a(lines): if _is_variant_a(lines):
return _parse_variant_a(text, lines) return _parse_variant_a(text, lines, employer)
raise ParserError("neither variant A nor variant B header found") raise ParserError("neither side-by-side nor single-column header found")
def _is_variant_b(lines: list[str]) -> bool: def _is_variant_b_or_c(lines: list[str]) -> bool:
return any("Payments" in line and "Deductions" in line and "Year to Date" in line return any("Payments" in line and "Deductions" in line and re.search(r"Year [Tt]o Date", line)
for line in lines) for line in lines)
@ -62,26 +80,34 @@ def _is_variant_a(lines: list[str]) -> bool:
def _to_decimal(s: str) -> Decimal: def _to_decimal(s: str) -> Decimal:
s = s.strip()
if s.startswith("(") and s.endswith(")"):
s = "-" + s[1:-1]
return Decimal(s.replace(",", "")) return Decimal(s.replace(",", ""))
def _parse_uk_date(s: str) -> date: def _parse_date(text: str) -> date:
return datetime.strptime(s, "%d/%m/%Y").date() """Try each supported format — whichever matches first wins."""
m = PAY_DATE_SLASH_RE.search(text)
if m:
return datetime.strptime(m.group(1), "%d/%m/%Y").date()
m = PAY_DATE_DOT_RE.search(text)
if m:
return datetime.strptime(m.group(1), "%d.%m.%Y").date()
m = PAY_DATE_WORD_RE.search(text)
if m:
raw = re.sub(r"\s+", " ", m.group(1)).strip()
return datetime.strptime(raw, "%d %b %Y").date()
raise ParserError("pay date not found")
def _find_field(text: str, pattern: re.Pattern[str]) -> str | None: def _find_match(text: str, pattern: re.Pattern[str]) -> str | None:
m = pattern.search(text) m = pattern.search(text)
return m.group(1) if m else None return m.group(1) if m else None
def _last_amount(segment: str) -> tuple[str, Decimal | None]: def _last_amount(segment: str) -> tuple[str, Decimal | None]:
"""Return (label, rightmost numeric amount) parsed out of one cell. """Return (label, rightmost numeric amount)."""
pdftotext -layout keeps Meta's column alignment stable, so each cell in
a row is "label ... amount" (optionally "label units rate amount" but
Meta leaves units/rate blank). We take the rightmost token as the
amount and whatever precedes it, stripped, as the label.
"""
matches = list(AMOUNT_RE.finditer(segment)) matches = list(AMOUNT_RE.finditer(segment))
if not matches: if not matches:
return segment.strip(), None return segment.strip(), None
@ -90,44 +116,77 @@ def _last_amount(segment: str) -> tuple[str, Decimal | None]:
return label, _to_decimal(last.group()) return label, _to_decimal(last.group())
def _parse_dates(text: str) -> tuple[date, date | None, date | None]: # --------------------------------------------------------------------------
pay_date_str = _find_field(text, PAY_DATE_RE) # Variant B / C — side-by-side Payments | Deductions | Year to/To Date
if pay_date_str is None: # --------------------------------------------------------------------------
raise ParserError("Pay Date not found")
period_start = _find_field(text, PERIOD_START_RE) PAYMENTS_KNOWN = {
period_end = _find_field(text, PERIOD_END_RE) "Salary",
return ( "Perform Bonus",
_parse_uk_date(pay_date_str), "Bonus",
_parse_uk_date(period_start) if period_start else None, "AE Pension EE",
_parse_uk_date(period_end) if period_end else None, "AE Pension",
) "RSU Tax Offset",
"RSU Excs Refund",
"RSU Gain Taxabl",
"RSU Gain Nicabl",
"RSU Gain Taxable",
"RSU Gain Nicable",
"RSU Net Cash",
"RSU Net Cash UK",
}
DEDUCTIONS_KNOWN = {
"Tax paid",
"Tax",
"Employee NIC",
"National Insurance",
"Student Loans",
"Student Loan",
"RSU Net Gain",
}
RSU_VEST_LABELS = {
"RSU Tax Offset",
"RSU Excs Refund",
"RSU Gain Taxabl",
"RSU Gain Nicabl",
"RSU Gain Taxable",
"RSU Gain Nicable",
"RSU Net Cash",
"RSU Net Cash UK",
}
def _parse_variant_b(text: str, lines: list[str]) -> ExtractedPayslip: def _parse_variant_bc(text: str, lines: list[str], employer: str) -> ExtractedPayslip:
header_idx, d_col, y_col = _find_variant_b_header(lines) header_idx, d_col, y_col = _find_bc_header(lines)
payments, payments_order, deductions = _collect_b_rows(lines, header_idx, d_col, y_col) payments, payments_order, deductions = _collect_bc_rows(lines, header_idx, d_col, y_col)
gross_pay, net_pay = _parse_b_totals_row(lines, header_idx, d_col, y_col) gross_pay, net_pay = _parse_bc_totals_row(lines, header_idx, d_col, y_col)
summary = _parse_summary_block(lines) summary = _parse_bc_summary_block(lines)
ae_pension = payments.get("AE Pension EE", Decimal("0")) ae_pension = payments.get("AE Pension EE", payments.get("AE Pension", Decimal("0")))
pension_sacrifice = abs(ae_pension) if ae_pension < 0 else Decimal("0") pension_sacrifice = abs(ae_pension) if ae_pension < 0 else Decimal("0")
rsu_vest = (payments.get("RSU Tax Offset", Decimal("0")) + rsu_vest = sum((payments.get(label, Decimal("0")) for label in RSU_VEST_LABELS),
payments.get("RSU Excs Refund", Decimal("0"))) start=Decimal("0"))
rsu_offset = deductions.get("RSU Net Gain", Decimal("0"))
income_tax = deductions.get("Tax paid", deductions.get("Tax", Decimal("0"))) income_tax = deductions.get("Tax paid", deductions.get("Tax", Decimal("0")))
nic = deductions.get("Employee NIC", deductions.get("National Insurance", Decimal("0"))) nic = deductions.get("Employee NIC", deductions.get("National Insurance", Decimal("0")))
student_loan = deductions.get("Student Loans", deductions.get("Student Loan", Decimal("0"))) student_loan = deductions.get("Student Loans", deductions.get("Student Loan", Decimal("0")))
other_deductions = _build_other_deductions_b(deductions, payments_order) other_deductions = {k: v for k, v in deductions.items() if k not in DEDUCTIONS_KNOWN}
del payments_order # retained for future debugging; not used in validation
pay_date, period_start, period_end = _parse_dates(text) pay_date = _parse_date(text)
period_start_s = _find_match(text, PERIOD_START_RE)
period_end_s = _find_match(text, PERIOD_END_RE)
period_start = datetime.strptime(period_start_s, "%d/%m/%Y").date() if period_start_s else None
period_end = datetime.strptime(period_end_s, "%d/%m/%Y").date() if period_end_s else None
return ExtractedPayslip( return ExtractedPayslip(
pay_date=pay_date, pay_date=pay_date,
pay_period_start=period_start, pay_period_start=period_start,
pay_period_end=period_end, pay_period_end=period_end,
employer=EMPLOYER, employer=employer,
currency="GBP", currency="GBP",
gross_pay=gross_pay, gross_pay=gross_pay,
income_tax=income_tax, income_tax=income_tax,
@ -136,7 +195,7 @@ def _parse_variant_b(text: str, lines: list[str]) -> ExtractedPayslip:
pension_employer=Decimal("0"), pension_employer=Decimal("0"),
student_loan=student_loan, student_loan=student_loan,
rsu_vest=rsu_vest, rsu_vest=rsu_vest,
rsu_offset=Decimal("0"), rsu_offset=rsu_offset,
salary=payments.get("Salary", Decimal("0")), salary=payments.get("Salary", Decimal("0")),
bonus=payments.get("Perform Bonus", payments.get("Bonus", Decimal("0"))), bonus=payments.get("Perform Bonus", payments.get("Bonus", Decimal("0"))),
pension_sacrifice=pension_sacrifice, pension_sacrifice=pension_sacrifice,
@ -149,14 +208,17 @@ def _parse_variant_b(text: str, lines: list[str]) -> ExtractedPayslip:
) )
def _find_variant_b_header(lines: list[str]) -> tuple[int, int, int]: def _find_bc_header(lines: list[str]) -> tuple[int, int, int]:
for i, line in enumerate(lines): for i, line in enumerate(lines):
if "Payments" in line and "Deductions" in line and "Year to Date" in line: if ("Payments" in line and "Deductions" in line and re.search(r"Year [Tt]o Date", line)):
return i, line.index("Deductions"), line.index("Year to Date") # Columns anchored on left edge of "Deductions" / "Year [Tt]o Date"
raise ParserError("variant B header not found") ytd_match = re.search(r"Year [Tt]o Date", line)
assert ytd_match is not None
return i, line.index("Deductions"), ytd_match.start()
raise ParserError("variant B/C header not found")
def _collect_b_rows( def _collect_bc_rows(
lines: list[str], lines: list[str],
header_idx: int, header_idx: int,
d_col: int, d_col: int,
@ -167,9 +229,9 @@ def _collect_b_rows(
deductions: dict[str, Decimal] = {} deductions: dict[str, Decimal] = {}
for i in range(header_idx + 1, len(lines)): for i in range(header_idx + 1, len(lines)):
line = lines[i].rstrip() line = lines[i].rstrip()
if not line.strip() or "Total Payment" in line:
if "Total Payment" in line: if "Total Payment" in line:
return payments, order, deductions return payments, order, deductions
if not line.strip():
continue continue
p_seg = line[:d_col] if len(line) > d_col else line p_seg = line[:d_col] if len(line) > d_col else line
d_seg = line[d_col:y_col] if len(line) > d_col else "" d_seg = line[d_col:y_col] if len(line) > d_col else ""
@ -179,24 +241,31 @@ def _collect_b_rows(
order.append((p_label, p_amount)) order.append((p_label, p_amount))
d_label, d_amount = _last_amount(d_seg) d_label, d_amount = _last_amount(d_seg)
if d_label and d_amount is not None: if d_label and d_amount is not None:
# RSU Net Gain can show as negative on the YTD side duplication;
# normalize to absolute value on the deductions side.
if d_label == "RSU Net Gain":
d_amount = abs(d_amount)
deductions[d_label] = d_amount deductions[d_label] = d_amount
return payments, order, deductions return payments, order, deductions
def _parse_b_totals_row( def _parse_bc_totals_row(
lines: list[str], lines: list[str],
header_idx: int, header_idx: int,
d_col: int, d_col: int,
y_col: int, y_col: int,
) -> tuple[Decimal, Decimal]: ) -> tuple[Decimal, Decimal]:
del y_col # "Net Pay:" aligns with the Amount column, not the left edge of YTD
for i in range(header_idx + 1, len(lines)): for i in range(header_idx + 1, len(lines)):
line = lines[i] line = lines[i]
if "Total Payment" not in line: if "Total Payment" not in line:
continue continue
p_seg = line[:d_col] if len(line) > d_col else line p_seg = line[:d_col] if len(line) > d_col else line
y_seg = line[y_col:] if len(line) > y_col else ""
_, gross_pay = _last_amount(p_seg) _, gross_pay = _last_amount(p_seg)
_, net_pay = _last_amount(y_seg) if "Net Pay" in y_seg else (None, None) net_pay_idx = line.find("Net Pay")
if net_pay_idx < 0:
raise ParserError("Net Pay missing from totals row")
_, net_pay = _last_amount(line[net_pay_idx:])
if gross_pay is None: if gross_pay is None:
raise ParserError("Total Payment amount missing") raise ParserError("Total Payment amount missing")
if net_pay is None: if net_pay is None:
@ -205,13 +274,8 @@ def _parse_b_totals_row(
raise ParserError("totals row not found") raise ParserError("totals row not found")
def _parse_summary_block(lines: list[str]) -> dict[str, Decimal]: def _parse_bc_summary_block(lines: list[str]) -> dict[str, Decimal]:
"""Pull Taxable Pay (this period + YTD), Tax Paid (YTD), Total Gross (YTD). """Pull Taxable Pay (this period + YTD), Tax Paid (YTD), Total Gross (YTD)."""
The summary sits after the totals row. Each row has 4 columns but only
the numeric ones matter; we use "2+ numbers on a line starting with
LABEL:" as the anchor, period-value first, YTD second.
"""
result: dict[str, Decimal] = {} result: dict[str, Decimal] = {}
for line in lines: for line in lines:
stripped = line.lstrip() stripped = line.lstrip()
@ -232,80 +296,98 @@ def _parse_summary_block(lines: list[str]) -> dict[str, Decimal]:
return result return result
PAYMENTS_KNOWN = { # --------------------------------------------------------------------------
# Variant A — single-column Description | This Period | This Year
# --------------------------------------------------------------------------
VARIANT_A_PAYMENTS_KNOWN = {
"Salary", "Salary",
"Bonus",
"Perform Bonus", "Perform Bonus",
"Bonus", "Relocation Bonus",
"AE Pension EE", "AE Pension EE",
"RSU Tax Offset", "AE Pension",
"RSU Excs Refund", "Laundry Expense",
"Transportation Allowance",
"EE Edu Assist",
"RSU Gain Taxable",
"RSU Gain Nicable",
"RSU Gain Taxabl",
"RSU Gain Nicabl",
"RSU Net Cash",
"RSU Net Cash UK",
# BIK earnings mirrored on the deduction side — we exclude them from
# bonus/other_earnings so they don't double-count.
"Private Dental Insurance",
"Private Medical Insurance",
"EE Discount BIK",
} }
DEDUCTIONS_KNOWN = { VARIANT_A_DEDUCTIONS_KNOWN = {
"Tax paid",
"Tax", "Tax",
"Employee NIC",
"National Insurance", "National Insurance",
"Student Loans", "Student Loans",
"Student Loan", "Student Loan",
"RSU Net Gain",
"EE Discount BIK",
} }
VARIANT_A_RSU_LABELS = {
"RSU Gain Taxable",
"RSU Gain Nicable",
"RSU Gain Taxabl",
"RSU Gain Nicabl",
"RSU Net Cash",
"RSU Net Cash UK",
}
def _build_other_deductions_b( # "Taxable Pay : This Period £15323.16 : To Date £52446.53"
deductions: dict[str, Decimal], TAXABLE_PAY_A_RE = re.compile(r"Taxable Pay\s*:\s*This Period\s*£([\d,]+\.\d{2})")
payments_order: list[tuple[str, Decimal]], NET_PAY_A_RE = re.compile(r"Net Pay\s+(-?[\d,]+\.\d{2})")
) -> dict[str, Decimal]:
# Negative payments (Cycle To Work, Share Save, AE Pension EE) are
# already subtracted from Total Payment — adding them here would
# double-count in the validation formula. They remain visible in
# raw_extraction for historical reference.
del payments_order
return {k: v for k, v in deductions.items() if k not in DEDUCTIONS_KNOWN}
def _parse_variant_a(text: str, lines: list[str]) -> ExtractedPayslip: def _parse_variant_a(text: str, lines: list[str], employer: str) -> ExtractedPayslip:
header_idx = _find_variant_a_header(lines) header_idx = _find_variant_a_header(lines)
items = _collect_a_rows(lines, header_idx) payments, deductions = _collect_a_blocks(lines, header_idx)
gross_pay, net_pay = _parse_a_gross_net(lines) gross_pay = _parse_a_gross(lines, header_idx, payments)
net_pay = _parse_a_net(text)
salary = items.get("Salary", Decimal("0")) ae_pension = payments.get("AE Pension EE", payments.get("AE Pension", Decimal("0")))
bonus = items.get("Bonus", Decimal("0")) pension_sacrifice = abs(ae_pension) if ae_pension < 0 else Decimal("0")
taxable_pay = items.get("Taxable Pay")
income_tax = items.get("Tax", Decimal("0"))
nic = items.get("National Insurance", Decimal("0"))
student_loan = items.get("Student Loans", items.get("Student Loan", Decimal("0")))
pension_employee = items.get("AE Pension EE", Decimal("0"))
known = { rsu_vest = sum((payments.get(label, Decimal("0")) for label in VARIANT_A_RSU_LABELS),
"Salary", start=Decimal("0"))
"Bonus", rsu_offset = deductions.get("RSU Net Gain", Decimal("0"))
"Taxable Pay",
"Tax",
"National Insurance",
"Student Loans",
"Student Loan",
"AE Pension EE",
}
other_deductions = {k: v for k, v in items.items() if k not in known}
pay_date, period_start, period_end = _parse_dates(text) income_tax = deductions.get("Tax", Decimal("0"))
nic = deductions.get("National Insurance", Decimal("0"))
student_loan = deductions.get("Student Loans", deductions.get("Student Loan", Decimal("0")))
other_deductions = {k: v for k, v in deductions.items() if k not in VARIANT_A_DEDUCTIONS_KNOWN}
bonus = payments.get("Perform Bonus", payments.get("Bonus", Decimal("0")))
taxable_pay_s = _find_match(text, TAXABLE_PAY_A_RE)
taxable_pay = _to_decimal(taxable_pay_s) if taxable_pay_s else None
pay_date = _parse_date(text)
return ExtractedPayslip( return ExtractedPayslip(
pay_date=pay_date, pay_date=pay_date,
pay_period_start=period_start, pay_period_start=None,
pay_period_end=period_end, pay_period_end=None,
employer=EMPLOYER, employer=employer,
currency="GBP", currency="GBP",
gross_pay=gross_pay, gross_pay=gross_pay,
income_tax=income_tax, income_tax=income_tax,
national_insurance=nic, national_insurance=nic,
pension_employee=pension_employee, pension_employee=Decimal("0"),
pension_employer=Decimal("0"), pension_employer=Decimal("0"),
student_loan=student_loan, student_loan=student_loan,
rsu_vest=Decimal("0"), rsu_vest=rsu_vest,
rsu_offset=Decimal("0"), rsu_offset=rsu_offset,
salary=salary, salary=payments.get("Salary", Decimal("0")),
bonus=bonus, bonus=bonus,
pension_sacrifice=Decimal("0"), pension_sacrifice=pension_sacrifice,
taxable_pay=taxable_pay, taxable_pay=taxable_pay,
ytd_tax_paid=None, ytd_tax_paid=None,
ytd_taxable_pay=None, ytd_taxable_pay=None,
@ -322,37 +404,70 @@ def _find_variant_a_header(lines: list[str]) -> int:
raise ParserError("variant A header not found") raise ParserError("variant A header not found")
def _collect_a_rows(lines: list[str], header_idx: int) -> dict[str, Decimal]: def _collect_a_blocks(
items: dict[str, Decimal] = {} lines: list[str],
header_idx: int,
) -> tuple[dict[str, Decimal], dict[str, Decimal]]:
"""Split variant A rows into Payments vs Deductions by the two `Total` anchors.
Layout: header payments rows `Total <gross>` deductions rows
`Total <deductions>` `Net Pay <net>`. We collect rows into whichever
block we're currently in.
"""
payments: dict[str, Decimal] = {}
deductions: dict[str, Decimal] = {}
block = payments
total_count = 0
for i in range(header_idx + 1, len(lines)): for i in range(header_idx + 1, len(lines)):
line = lines[i].rstrip() raw = lines[i].rstrip()
if not line.strip() or line.lstrip().startswith("-"): if not raw.strip():
continue continue
if "Gross Pay" in line or "Net Pay" in line: stripped = raw.strip()
if stripped.startswith("Total ") or stripped.startswith("Total\t"):
total_count += 1
if total_count == 1:
block = deductions
continue
if total_count == 2:
break break
amounts = list(AMOUNT_RE.finditer(line)) if "Net Pay" in raw:
if not amounts: break
matches = list(AMOUNT_RE.finditer(raw))
if not matches:
continue continue
label = line[:amounts[0].start()].strip() label = raw[:matches[0].start()].strip()
if label: if not label:
items[label] = _to_decimal(amounts[0].group()) continue
return items # "This Period" value is the first amount; "This Year" is the second.
# If only one amount is present, it's a YTD-only row (e.g. Relocation
# Bonus which doesn't apply this period) — skip it for the period totals.
if len(matches) < 2:
continue
amount = _to_decimal(matches[0].group())
block[label] = amount
return payments, deductions
def _parse_a_gross_net(lines: list[str]) -> tuple[Decimal, Decimal]: def _parse_a_gross(
gross_pay: Decimal | None = None lines: list[str],
net_pay: Decimal | None = None header_idx: int,
for line in lines: payments: dict[str, Decimal],
if "Gross Pay" in line and gross_pay is None: ) -> Decimal:
nums = AMOUNT_RE.findall(line) """Pull the first `Total <amount>` after the header — that's gross pay."""
for i in range(header_idx + 1, len(lines)):
stripped = lines[i].strip()
if stripped.startswith("Total "):
nums = AMOUNT_RE.findall(stripped)
if nums: if nums:
gross_pay = _to_decimal(nums[0]) return _to_decimal(nums[0])
if "Net Pay" in line and net_pay is None: # Fallback: sum payments values if the Total line is missing.
nums = AMOUNT_RE.findall(line) if payments:
if nums: return sum(payments.values(), start=Decimal("0"))
net_pay = _to_decimal(nums[0]) raise ParserError("Total (gross pay) row not found in variant A")
if gross_pay is None:
raise ParserError("Gross Pay not found")
if net_pay is None: def _parse_a_net(text: str) -> Decimal:
raise ParserError("Net Pay not found") m = NET_PAY_A_RE.search(text)
return gross_pay, net_pay if not m:
raise ParserError("Net Pay line not found in variant A")
return _to_decimal(m.group(1))

View file

@ -1,21 +0,0 @@
Facebook UK Limited Payslip
Employee: Viktor Barzin NI Number: AA123456A
Employee No: 254680 Tax Code: 1185L
Pay Date: 31/07/2019 Pay Period: 4
Period Start: 01/07/2019 Period End: 31/07/2019
Description This Period This Year
---------------------------------------------------------------------
Salary 7,083.33 28,333.32
Taxable Pay 6,583.33 26,333.32
Tax 1,480.00 5,920.00
National Insurance 564.73 2,258.92
AE Pension EE 500.00 2,000.00
Student Loans 120.00 480.00
---------------------------------------------------------------------
Gross Pay: 7,083.33
Net Pay: 4,418.60

View file

@ -0,0 +1,44 @@
254680A Mr Viktor Barzin Facebook UK Ltd
NI No : SZ762223D NI Letter : A Tax Code : 0T Pay By : BACS Date : 31 Aug 2021 Period : 5
Description Rate Units This Period This Year
Salary 5,096.65 25,483.25
AE Pension (152.90) (764.50)
Laundry Expense 40.00 200.00
Relocation Bonus 8,184.20
RSU Gain Taxable 10,239.30 18,757.93
RSU Gain Nicable 10,239.30 18,757.93
Transportation Allowance 73.10
RSU Net Cash UK 175.91 207.29
Private Dental Insurance 15.61 78.05
Private Medical Insurance 84.50 422.50
EE Discount BIK 12.00
Total 25,738.37
Tax 5,500.87 17,836.73
National Insurance 627.72 2,655.22
RSU Net Gain 15,666.13 28,699.63
Private Dental Insurance 15.61 78.05
Private Medical Insurance 84.50 422.50
EE Discount BIK 12.00
Student Loans 1,165.00 3,649.00
Total 23,059.83
Tax District : Pay As You Earn
Tax Reference : 846/BA09294 Net Pay 2,678.54
Taxable Pay : This Period £15323.16 : To Date £52446.53
Employers NIC This Period : 1,999.08
Employers NIC To Date : 6,660.03
Employers Pension This Period : 458.70
Employers Pension To Date :2,293.50

View file

@ -0,0 +1,60 @@
Page : 1
Employee Number: 254680
Facebook UK Limited
PRIVATE & CONFIDENTIAL
Viktor Barzin
Flat 37
Spenlow Apartments
London
N1 7GH
Company Name : Facebook UK Limited Tax Ref : 846/BA09294 Nat Ins No : SZ762223D
Pay Date : 30.11.2022 Tax Code : 0T Nat Ins Cat: A
Pay Method : BACS Transfer Tax Basis : 0 Cost Centre: 4220
Pay Period : 08/2022 Tax Period : 08
Payments Units Rate Amount Deductions Amount Year To Date Amount
Salary 8,983.33 Tax paid 5,800.07 Salary 8,983.33
AE Pension EE -539.00 Employee NIC 612.65 RSU Gain Taxabl 7,531.31
RSU Gain Taxabl 7,531.31 Student Loan 1,233.00 RSU Gain Nicabl 7,531.31
RSU Gain Nicabl 7,531.31 RSU Net Gain 11,522.91 RSU Net Cash 129.38
RSU Net Cash 129.38 Student Loan 7,271.00
RSU Net Gain -11,522.91
ER Pension This Period
AE Pension ER 808.50
Total Payment: 23,636.33 Total Deduction : 19,168.63 Net Pay: 4,467.70
This Period Amount Year To Date Amount Gross Benefits Payments
----------------------------------
Total Gross: 23,636.33 Total Gross: 131,034.64 Dent Ins TaxB 17.83
Non-Tax Ded: 0.00 Non-Tax ded: 0.00 Medi Ins TaxB 76.67
Taxable Pay: 16,070.14 Taxable Pay: 99,784.08
Tax Paid: 5,800.07 Tax Paid: 34,886.93
EEs NI : 612.65 EEs NI: 5,361.55
ERs NI : 2,100.04 ERs NI: 13,800.83
EEs Pension: -539.00 EEs Pension: -3,596.28
EEs AVC: 0.00 EEs AVC 0.00
ERs Pension: 5,504.38

View file

@ -1,24 +0,0 @@
Facebook UK Limited Payslip
Employee: Viktor Barzin NI Number: AA123456A Pay Date: 27/03/2024
Employee No: 254680 Tax Code: 1257L Pay Period: 12
Department: Engineering Period Start: 01/03/2024
Period End: 31/03/2024
Payments Units Rate Amount Deductions Amount Year to Date Amount
Salary 9,500.00 Tax paid 800.00 Salary 114,000.00
Perform Bonus 0.00 Employee NIC 280.00 Transportation 820.50
AE Pension EE -6,200.00 Student Loans 90.00
--------- ---------
Total Payment: 3,300.00 Total Deduction : 1,170.00 Net Pay: 2,130.00
This Period Amount Year To Date Amount
Total Gross: 3,300.00 Total Gross: 210,000.00
Taxable Pay: 3,300.00 Taxable Pay: 185,000.00
Tax Paid: 800.00 Tax Paid: 42,000.00
EEs NI: 280.00 EEs NI: 9,100.00
EEs Pension: -6,200.00 EEs Pension: -52,000.00

View file

@ -13,28 +13,24 @@ def _load(name: str) -> str:
return (FIXTURES / name).read_text(encoding="utf-8") return (FIXTURES / name).read_text(encoding="utf-8")
def test_parses_variant_b_standard_month() -> None: def test_parses_variant_b_modern() -> None:
"""Feb 2026 — variant B, RSU vesting, no bonus, salary-sacrifice pension.""" """Feb 2026 — variant B (post-2024), RSU vest, salary-sacrifice pension."""
result = parse_meta_uk(_load("meta_uk_2026_02.txt")) result = parse_meta_uk(_load("meta_uk_2026_02.txt"))
assert result.pay_date == date(2026, 2, 27) assert result.pay_date == date(2026, 2, 27)
assert result.pay_period_start == date(2026, 2, 1) assert result.pay_period_start == date(2026, 2, 1)
assert result.pay_period_end == date(2026, 2, 27) assert result.pay_period_end == date(2026, 2, 27)
assert result.employer == "Facebook UK Limited" assert result.employer == "Facebook UK Limited"
assert result.currency == "GBP"
assert result.salary == Decimal("10003.33") assert result.salary == Decimal("10003.33")
assert result.bonus == Decimal("0") assert result.bonus == Decimal("0")
assert result.pension_sacrifice == Decimal("600.20") assert result.pension_sacrifice == Decimal("600.20")
# rsu_vest = RSU Tax Offset + RSU Excs Refund assert result.rsu_vest == Decimal("30479.76") # RSU Tax Offset + RSU Excs Refund
assert result.rsu_vest == Decimal("30479.76") assert result.rsu_offset == Decimal("0") # modern Meta template omits offset
assert result.rsu_offset == Decimal("0")
assert result.gross_pay == Decimal("39882.89") assert result.gross_pay == Decimal("39882.89")
assert result.income_tax == Decimal("31311.90") assert result.income_tax == Decimal("31311.90")
assert result.national_insurance == Decimal("1602.89") assert result.national_insurance == Decimal("1602.89")
assert result.pension_employee == Decimal("0")
assert result.student_loan == Decimal("0")
assert result.net_pay == Decimal("6968.10") assert result.net_pay == Decimal("6968.10")
assert result.taxable_pay == Decimal("72096.92") assert result.taxable_pay == Decimal("72096.92")
@ -43,8 +39,8 @@ def test_parses_variant_b_standard_month() -> None:
assert result.ytd_gross == Decimal("232630.34") assert result.ytd_gross == Decimal("232630.34")
def test_parses_variant_b_with_bonus_and_rsu() -> None: def test_parses_variant_b_with_bonus() -> None:
"""March 2025 — variant B, bonus month, RSU vesting, multiple other deductions.""" """March 2025 — variant B, bonus + RSU + multiple other deductions."""
result = parse_meta_uk(_load("meta_uk_2025_03.txt")) result = parse_meta_uk(_load("meta_uk_2025_03.txt"))
assert result.pay_date == date(2025, 3, 27) assert result.pay_date == date(2025, 3, 27)
@ -54,62 +50,74 @@ def test_parses_variant_b_with_bonus_and_rsu() -> None:
assert result.rsu_vest == Decimal("20000.00") assert result.rsu_vest == Decimal("20000.00")
assert result.gross_pay == Decimal("53720.00") assert result.gross_pay == Decimal("53720.00")
assert result.income_tax == Decimal("45210.44")
assert result.national_insurance == Decimal("2750.12")
assert result.student_loan == Decimal("850.00")
assert result.net_pay == Decimal("4753.69") assert result.net_pay == Decimal("4753.69")
# Private Medical comes from the Deductions column. Cycle To Work is a
# negative Payments line — already subtracted from Total Payment, so it
# does NOT belong in other_deductions (that would double-count).
assert "Private Medical" in result.other_deductions assert "Private Medical" in result.other_deductions
assert result.other_deductions["Private Medical"] == Decimal("155.75")
assert "Cycle To Work" not in result.other_deductions
def test_parses_variant_b_bonus_sacrificed() -> None: def test_parses_variant_c_2022_11() -> None:
"""March 2024 — variant B, full bonus sacrificed into pension, bonus line = 0.""" """Nov 2022 — mid-era template. Real pdftotext from doc_id=53.
result = parse_meta_uk(_load("meta_uk_2024_03_bonus_sacrificed.txt"))
assert result.pay_date == date(2024, 3, 27) Side-by-side Payments | Deductions | Year To Date (capital "To"), dot-
assert result.salary == Decimal("9500.00") separated date, RSU labels use `RSU Gain Taxabl` / `Nicabl` (abbreviated)
# Bonus line present but zero — parser should surface this so the dashboard and a matching `RSU Net Gain` offset on the deductions side.
# can highlight the "bonus sacrificed" dip.
assert result.bonus == Decimal("0")
# Big pension sacrifice dwarfs the salary — this is the signal we care about.
assert result.pension_sacrifice == Decimal("6200.00")
assert result.rsu_vest == Decimal("0")
assert result.gross_pay == Decimal("3300.00")
assert result.net_pay == Decimal("2130.00")
def test_parses_variant_a_pre_2022() -> None:
"""July 2019 — variant A, pre-RSU, single-column layout.
Variant A lists AE Pension EE as a positive deduction (pre-sacrifice gross),
so it maps to `pension_employee` for the standard validation formula to hold.
Variant B lists it as a negative payment (post-sacrifice gross) and maps to
`pension_sacrifice` instead. Both represent money going into the pension.
""" """
result = parse_meta_uk(_load("meta_uk_2019_07.txt")) result = parse_meta_uk(_load("meta_uk_2022_11_variant_c.txt"))
assert result.pay_date == date(2019, 7, 31) assert result.pay_date == date(2022, 11, 30)
assert result.employer == "Facebook UK Limited" assert result.employer == "Facebook UK Limited"
assert result.salary == Decimal("7083.33")
assert result.salary == Decimal("8983.33")
assert result.bonus == Decimal("0") assert result.bonus == Decimal("0")
assert result.rsu_vest == Decimal("0") assert result.pension_sacrifice == Decimal("539.00")
assert result.pension_sacrifice == Decimal("0") # rsu_vest = RSU Gain Taxabl + RSU Gain Nicabl + RSU Net Cash
assert result.pension_employee == Decimal("500.00") assert result.rsu_vest == Decimal("15192.00")
# rsu_offset = RSU Net Gain (the matching deduction)
assert result.rsu_offset == Decimal("11522.91")
assert result.gross_pay == Decimal("7083.33") assert result.gross_pay == Decimal("23636.33")
assert result.income_tax == Decimal("1480.00") assert result.income_tax == Decimal("5800.07")
assert result.national_insurance == Decimal("564.73") assert result.national_insurance == Decimal("612.65")
assert result.student_loan == Decimal("120.00") assert result.student_loan == Decimal("1233.00")
assert result.net_pay == Decimal("4418.60") assert result.net_pay == Decimal("4467.70")
# Variant A carries a "Taxable Pay" line inline assert result.taxable_pay == Decimal("16070.14")
assert result.taxable_pay == Decimal("6583.33") assert result.ytd_tax_paid == Decimal("34886.93")
assert result.ytd_taxable_pay == Decimal("99784.08")
assert result.ytd_gross == Decimal("131034.64")
def test_parses_variant_a_2021_08() -> None:
"""Aug 2021 — variant A. Real pdftotext from doc_id=43.
Single-column Description | This Period | This Year layout. Parenthesized
negatives `(152.90)`, Facebook UK Ltd (not Limited), date `Date : 31 Aug
2021`. BIK items (Dental/Medical) appear as both earnings and deductions.
"""
result = parse_meta_uk(_load("meta_uk_2021_08_variant_a.txt"))
assert result.pay_date == date(2021, 8, 31)
assert result.employer == "Facebook UK Ltd"
assert result.salary == Decimal("5096.65")
assert result.bonus == Decimal("0")
assert result.pension_sacrifice == Decimal("152.90")
# rsu_vest = RSU Gain Taxable + RSU Gain Nicable + RSU Net Cash UK
assert result.rsu_vest == Decimal("20654.51")
assert result.rsu_offset == Decimal("15666.13")
assert result.gross_pay == Decimal("25738.37")
assert result.income_tax == Decimal("5500.87")
assert result.national_insurance == Decimal("627.72")
assert result.student_loan == Decimal("1165.00")
assert result.net_pay == Decimal("2678.54")
# BIK offsets on the deductions side
assert result.other_deductions.get("Private Dental Insurance") == Decimal("15.61")
assert result.other_deductions.get("Private Medical Insurance") == Decimal("84.50")
# Variant A surfaces Taxable Pay via a trailing line `Taxable Pay : This
# Period £XXXX.XX : To Date £YYYY.YY`.
assert result.taxable_pay == Decimal("15323.16")
def test_raises_on_non_meta_payslip() -> None: def test_raises_on_non_meta_payslip() -> None:
@ -122,17 +130,11 @@ def test_raises_on_empty_text() -> None:
parse_meta_uk("") parse_meta_uk("")
def test_raises_when_pay_date_missing() -> None:
broken = "Facebook UK Limited\nPayslip\nSalary 1000.00\nNet Pay: 800.00\n"
with pytest.raises(ParserError):
parse_meta_uk(broken)
@pytest.mark.parametrize("fixture_name", [ @pytest.mark.parametrize("fixture_name", [
"meta_uk_2026_02.txt", "meta_uk_2026_02.txt",
"meta_uk_2025_03.txt", "meta_uk_2025_03.txt",
"meta_uk_2024_03_bonus_sacrificed.txt", "meta_uk_2022_11_variant_c.txt",
"meta_uk_2019_07.txt", "meta_uk_2021_08_variant_a.txt",
]) ])
def test_all_fixtures_validate_totals(fixture_name: str) -> None: def test_all_fixtures_validate_totals(fixture_name: str) -> None:
"""Every fixture must satisfy gross - deductions ≈ net within 2p.""" """Every fixture must satisfy gross - deductions ≈ net within 2p."""
@ -140,7 +142,7 @@ def test_all_fixtures_validate_totals(fixture_name: str) -> None:
result = parse_meta_uk(_load(fixture_name)) result = parse_meta_uk(_load(fixture_name))
assert validate_totals(result), ( assert validate_totals(result), (
f"{fixture_name}: gross={result.gross_pay} " f"{fixture_name}: gross={result.gross_pay} tax={result.income_tax} "
f"tax={result.income_tax} nic={result.national_insurance} " f"nic={result.national_insurance} student={result.student_loan} "
f"student={result.student_loan} other={result.other_deductions} " f"pension_employee={result.pension_employee} rsu_offset={result.rsu_offset} "
f"net={result.net_pay}") f"other={result.other_deductions} net={result.net_pay}")