payslip-ingest/payslip_ingest/parsers/meta_uk.py
Viktor Barzin f62c5332e3 meta_uk parser: add variant A (2019-2022) + variant C (2022-2023)
## Context

The initial v2 parser (commit 9741816) only handled the modern template
(variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from
2021-07 through 2023-11 failed entirely — Claude fallback hit errors on
them and the rows never landed. Investigation via `kubectl exec` +
pdftotext on a sample of the failing docs revealed two previously-unseen
layouts that the parser needs to handle directly:

- **Variant A** (2019 → mid-2022): single-column Description/This Period/
  This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31
  Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines:
  `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the
  earnings side with a matching `RSU Net Gain` on the deductions side.
  BIK items (Private Dental/Medical) appear on both sides — net zero in
  the gross, but the deduction-side copy must land in other_deductions
  for the validation formula to hold.

- **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions|
  Year To Date (note capital "To", vs variant B's lowercase "to"). Date
  format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the
  abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU
  Net Gain` offset. `Company Name : Facebook UK Limited` preamble.

Variant B (2024+) is unchanged.

## This change

### Parser refactor

- `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches
  all three eras.
- `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's
  accounting-style parenthesized negatives normalize to `-1234.56` in
  `_to_decimal`.
- `_parse_date` tries three formats in order: slash (B), dot (C), word (A).
- `_is_variant_b_or_c` collapses B and C into one detector (both have the
  side-by-side header with `Year [Tt]o Date`); their parsers share code
  because the column mechanics are identical — only the RSU-label set and
  date format differ.
- `_parse_variant_a` is a full rewrite: single-column rows split by the
  two `Total ...` anchors (payments → deductions), pay_date from the
  header's `Date : ...`, gross from first Total, net from the trailing
  `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X`
  line at the bottom.
- RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every
  matching payment line. rsu_offset maps to `RSU Net Gain` on the
  deduction side when present (absent in variant B, present in A and C).

### Fixtures switched to real pdftotext output

Removed the two synthetic fixtures that no longer reflected real Meta
output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`)
and replaced with real pdftotext captures:

- `meta_uk_2021_08_variant_a.txt` (doc_id=43)
- `meta_uk_2022_11_variant_c.txt` (doc_id=53)

The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because
they encode specific bonus/no-bonus scenarios and the numbers are
derived from the real Feb-2026 sample in the plan.

## Tests

- 10 parser tests: one per variant (A/B/C) + totals validation across
  all 4 fixtures + the existing non-Meta/empty-input guards. All pass.
- 52 total tests across the repo, all green.

## Test Plan

### Automated

```
$ poetry run pytest
============================== 52 passed in 1.66s ==============================
$ poetry run ruff check .
All checks passed!
$ poetry run mypy .
Success: no issues found in 24 source files
```

### Manual verification (after deploy)

1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via
   regex (≥95% hit rate), 42 → 70+ validated rows.
2. Sample a row for each variant via psql: employer, rsu_vest, and
   taxable_pay should all be populated.

## Reproduce locally

1. `poetry run pytest tests/test_meta_uk_parser.py -v`
2. Expected: 10 passed, each fixture validates totals to within 2p.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00

473 lines
17 KiB
Python

"""Regex-based Meta UK payslip parser.
Meta UK payslips come in three layout variants across 2019-2026:
- **Variant A** (2019-mid-2022, seen as `Facebook UK Ltd`):
Single-column `Description | This Period | This Year` layout. Parenthesized
negatives `(152.90)` = -152.90. Date format `Date : 31 Aug 2021`. RSU
labels: `RSU Gain Taxable`, `RSU Gain Nicable`, `RSU Net Cash UK`, plus a
matching `RSU Net Gain` deduction. BIK items (Private Dental/Medical,
EE Discount) appear as both earnings and deductions.
- **Variant C** (late-2022 - 2023, `Facebook UK Limited`):
Side-by-side `Payments | Deductions | Year To Date` (capital "To"). Date
format `Pay Date : 30.11.2022` (dots). `Company Name : Facebook UK Limited`
preamble. RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and
still have the `RSU Net Gain` offset.
- **Variant B** (2024+, `Facebook UK Limited`):
Side-by-side `Payments | Deductions | Year to Date` (lowercase "to"). Date
format `Pay Date: 27/02/2026` (slashes). RSU labels are `RSU Tax Offset`
+ `RSU Excs Refund`; there is NO matching offset deduction — the vest
grosses up Taxable Pay and PAYE is on the grossed-up figure.
Parser returns `ExtractedPayslip`. On any structural miss it raises
`ParserError` so the caller falls back to ClaudeExtractor.
"""
import re
from datetime import date, datetime
from decimal import Decimal
from payslip_ingest.schema import ExtractedPayslip
class ParserError(ValueError):
"""Raised when the Meta UK template cannot be matched."""
# Two amount notations:
# "-1,234.56" (slashes-era) and "(1,234.56)" (variant A parenthesized)
AMOUNT_RE = re.compile(r"-?\d{1,3}(?:,\d{3})*\.\d{2}|\(\d{1,3}(?:,\d{3})*\.\d{2}\)")
# Pay Date / Date — three accepted formats:
# "Pay Date: 27/02/2026"
# "Pay Date : 30.11.2022"
# "Date : 31 Aug 2021"
PAY_DATE_SLASH_RE = re.compile(r"Pay Date\s*:\s*(\d{2}/\d{2}/\d{4})")
PAY_DATE_DOT_RE = re.compile(r"Pay Date\s*:\s*(\d{2}\.\d{2}\.\d{4})")
PAY_DATE_WORD_RE = re.compile(r"\bDate\s*:\s*(\d{1,2}\s+[A-Za-z]{3}\s+\d{4})")
PERIOD_START_RE = re.compile(r"Period Start\s*:\s*(\d{2}/\d{2}/\d{4})")
PERIOD_END_RE = re.compile(r"Period End\s*:\s*(\d{2}/\d{2}/\d{4})")
EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")
def parse_meta_uk(text: str) -> ExtractedPayslip:
if not text.strip():
raise ParserError("empty text")
employer_match = EMPLOYER_RE.search(text)
if not employer_match:
raise ParserError("does not look like a Meta UK payslip")
employer = employer_match.group(0)
lines = text.splitlines()
if _is_variant_b_or_c(lines):
return _parse_variant_bc(text, lines, employer)
if _is_variant_a(lines):
return _parse_variant_a(text, lines, employer)
raise ParserError("neither side-by-side nor single-column header found")
def _is_variant_b_or_c(lines: list[str]) -> bool:
return any("Payments" in line and "Deductions" in line and re.search(r"Year [Tt]o Date", line)
for line in lines)
def _is_variant_a(lines: list[str]) -> bool:
return any("Description" in line and "This Period" in line and "This Year" in line
for line in lines)
def _to_decimal(s: str) -> Decimal:
s = s.strip()
if s.startswith("(") and s.endswith(")"):
s = "-" + s[1:-1]
return Decimal(s.replace(",", ""))
def _parse_date(text: str) -> date:
"""Try each supported format — whichever matches first wins."""
m = PAY_DATE_SLASH_RE.search(text)
if m:
return datetime.strptime(m.group(1), "%d/%m/%Y").date()
m = PAY_DATE_DOT_RE.search(text)
if m:
return datetime.strptime(m.group(1), "%d.%m.%Y").date()
m = PAY_DATE_WORD_RE.search(text)
if m:
raw = re.sub(r"\s+", " ", m.group(1)).strip()
return datetime.strptime(raw, "%d %b %Y").date()
raise ParserError("pay date not found")
def _find_match(text: str, pattern: re.Pattern[str]) -> str | None:
m = pattern.search(text)
return m.group(1) if m else None
def _last_amount(segment: str) -> tuple[str, Decimal | None]:
"""Return (label, rightmost numeric amount)."""
matches = list(AMOUNT_RE.finditer(segment))
if not matches:
return segment.strip(), None
last = matches[-1]
label = segment[:last.start()].strip()
return label, _to_decimal(last.group())
# --------------------------------------------------------------------------
# Variant B / C — side-by-side Payments | Deductions | Year to/To Date
# --------------------------------------------------------------------------
PAYMENTS_KNOWN = {
"Salary",
"Perform Bonus",
"Bonus",
"AE Pension EE",
"AE Pension",
"RSU Tax Offset",
"RSU Excs Refund",
"RSU Gain Taxabl",
"RSU Gain Nicabl",
"RSU Gain Taxable",
"RSU Gain Nicable",
"RSU Net Cash",
"RSU Net Cash UK",
}
DEDUCTIONS_KNOWN = {
"Tax paid",
"Tax",
"Employee NIC",
"National Insurance",
"Student Loans",
"Student Loan",
"RSU Net Gain",
}
RSU_VEST_LABELS = {
"RSU Tax Offset",
"RSU Excs Refund",
"RSU Gain Taxabl",
"RSU Gain Nicabl",
"RSU Gain Taxable",
"RSU Gain Nicable",
"RSU Net Cash",
"RSU Net Cash UK",
}
def _parse_variant_bc(text: str, lines: list[str], employer: str) -> ExtractedPayslip:
header_idx, d_col, y_col = _find_bc_header(lines)
payments, payments_order, deductions = _collect_bc_rows(lines, header_idx, d_col, y_col)
gross_pay, net_pay = _parse_bc_totals_row(lines, header_idx, d_col, y_col)
summary = _parse_bc_summary_block(lines)
ae_pension = payments.get("AE Pension EE", payments.get("AE Pension", Decimal("0")))
pension_sacrifice = abs(ae_pension) if ae_pension < 0 else Decimal("0")
rsu_vest = sum((payments.get(label, Decimal("0")) for label in RSU_VEST_LABELS),
start=Decimal("0"))
rsu_offset = deductions.get("RSU Net Gain", Decimal("0"))
income_tax = deductions.get("Tax paid", deductions.get("Tax", Decimal("0")))
nic = deductions.get("Employee NIC", deductions.get("National Insurance", Decimal("0")))
student_loan = deductions.get("Student Loans", deductions.get("Student Loan", Decimal("0")))
other_deductions = {k: v for k, v in deductions.items() if k not in DEDUCTIONS_KNOWN}
del payments_order # retained for future debugging; not used in validation
pay_date = _parse_date(text)
period_start_s = _find_match(text, PERIOD_START_RE)
period_end_s = _find_match(text, PERIOD_END_RE)
period_start = datetime.strptime(period_start_s, "%d/%m/%Y").date() if period_start_s else None
period_end = datetime.strptime(period_end_s, "%d/%m/%Y").date() if period_end_s else None
return ExtractedPayslip(
pay_date=pay_date,
pay_period_start=period_start,
pay_period_end=period_end,
employer=employer,
currency="GBP",
gross_pay=gross_pay,
income_tax=income_tax,
national_insurance=nic,
pension_employee=Decimal("0"),
pension_employer=Decimal("0"),
student_loan=student_loan,
rsu_vest=rsu_vest,
rsu_offset=rsu_offset,
salary=payments.get("Salary", Decimal("0")),
bonus=payments.get("Perform Bonus", payments.get("Bonus", Decimal("0"))),
pension_sacrifice=pension_sacrifice,
taxable_pay=summary.get("taxable_pay"),
ytd_tax_paid=summary.get("ytd_tax_paid"),
ytd_taxable_pay=summary.get("ytd_taxable_pay"),
ytd_gross=summary.get("ytd_gross"),
other_deductions=other_deductions,
net_pay=net_pay,
)
def _find_bc_header(lines: list[str]) -> tuple[int, int, int]:
for i, line in enumerate(lines):
if ("Payments" in line and "Deductions" in line and re.search(r"Year [Tt]o Date", line)):
# Columns anchored on left edge of "Deductions" / "Year [Tt]o Date"
ytd_match = re.search(r"Year [Tt]o Date", line)
assert ytd_match is not None
return i, line.index("Deductions"), ytd_match.start()
raise ParserError("variant B/C header not found")
def _collect_bc_rows(
lines: list[str],
header_idx: int,
d_col: int,
y_col: int,
) -> tuple[dict[str, Decimal], list[tuple[str, Decimal]], dict[str, Decimal]]:
payments: dict[str, Decimal] = {}
order: list[tuple[str, Decimal]] = []
deductions: dict[str, Decimal] = {}
for i in range(header_idx + 1, len(lines)):
line = lines[i].rstrip()
if "Total Payment" in line:
return payments, order, deductions
if not line.strip():
continue
p_seg = line[:d_col] if len(line) > d_col else line
d_seg = line[d_col:y_col] if len(line) > d_col else ""
p_label, p_amount = _last_amount(p_seg)
if p_label and p_amount is not None:
payments[p_label] = p_amount
order.append((p_label, p_amount))
d_label, d_amount = _last_amount(d_seg)
if d_label and d_amount is not None:
# RSU Net Gain can show as negative on the YTD side duplication;
# normalize to absolute value on the deductions side.
if d_label == "RSU Net Gain":
d_amount = abs(d_amount)
deductions[d_label] = d_amount
return payments, order, deductions
def _parse_bc_totals_row(
lines: list[str],
header_idx: int,
d_col: int,
y_col: int,
) -> tuple[Decimal, Decimal]:
del y_col # "Net Pay:" aligns with the Amount column, not the left edge of YTD
for i in range(header_idx + 1, len(lines)):
line = lines[i]
if "Total Payment" not in line:
continue
p_seg = line[:d_col] if len(line) > d_col else line
_, gross_pay = _last_amount(p_seg)
net_pay_idx = line.find("Net Pay")
if net_pay_idx < 0:
raise ParserError("Net Pay missing from totals row")
_, net_pay = _last_amount(line[net_pay_idx:])
if gross_pay is None:
raise ParserError("Total Payment amount missing")
if net_pay is None:
raise ParserError("Net Pay amount missing from totals row")
return gross_pay, net_pay
raise ParserError("totals row not found")
def _parse_bc_summary_block(lines: list[str]) -> dict[str, Decimal]:
"""Pull Taxable Pay (this period + YTD), Tax Paid (YTD), Total Gross (YTD)."""
result: dict[str, Decimal] = {}
for line in lines:
stripped = line.lstrip()
if stripped.startswith("Taxable Pay:"):
nums = AMOUNT_RE.findall(line)
if len(nums) >= 1:
result["taxable_pay"] = _to_decimal(nums[0])
if len(nums) >= 2:
result["ytd_taxable_pay"] = _to_decimal(nums[1])
elif stripped.startswith("Total Gross:"):
nums = AMOUNT_RE.findall(line)
if len(nums) >= 2:
result["ytd_gross"] = _to_decimal(nums[1])
elif stripped.startswith("Tax Paid:"):
nums = AMOUNT_RE.findall(line)
if len(nums) >= 2:
result["ytd_tax_paid"] = _to_decimal(nums[1])
return result
# --------------------------------------------------------------------------
# Variant A — single-column Description | This Period | This Year
# --------------------------------------------------------------------------
VARIANT_A_PAYMENTS_KNOWN = {
"Salary",
"Bonus",
"Perform Bonus",
"Relocation Bonus",
"AE Pension EE",
"AE Pension",
"Laundry Expense",
"Transportation Allowance",
"EE Edu Assist",
"RSU Gain Taxable",
"RSU Gain Nicable",
"RSU Gain Taxabl",
"RSU Gain Nicabl",
"RSU Net Cash",
"RSU Net Cash UK",
# BIK earnings mirrored on the deduction side — we exclude them from
# bonus/other_earnings so they don't double-count.
"Private Dental Insurance",
"Private Medical Insurance",
"EE Discount BIK",
}
VARIANT_A_DEDUCTIONS_KNOWN = {
"Tax",
"National Insurance",
"Student Loans",
"Student Loan",
"RSU Net Gain",
"EE Discount BIK",
}
VARIANT_A_RSU_LABELS = {
"RSU Gain Taxable",
"RSU Gain Nicable",
"RSU Gain Taxabl",
"RSU Gain Nicabl",
"RSU Net Cash",
"RSU Net Cash UK",
}
# "Taxable Pay : This Period £15323.16 : To Date £52446.53"
TAXABLE_PAY_A_RE = re.compile(r"Taxable Pay\s*:\s*This Period\s*£([\d,]+\.\d{2})")
NET_PAY_A_RE = re.compile(r"Net Pay\s+(-?[\d,]+\.\d{2})")
def _parse_variant_a(text: str, lines: list[str], employer: str) -> ExtractedPayslip:
header_idx = _find_variant_a_header(lines)
payments, deductions = _collect_a_blocks(lines, header_idx)
gross_pay = _parse_a_gross(lines, header_idx, payments)
net_pay = _parse_a_net(text)
ae_pension = payments.get("AE Pension EE", payments.get("AE Pension", Decimal("0")))
pension_sacrifice = abs(ae_pension) if ae_pension < 0 else Decimal("0")
rsu_vest = sum((payments.get(label, Decimal("0")) for label in VARIANT_A_RSU_LABELS),
start=Decimal("0"))
rsu_offset = deductions.get("RSU Net Gain", Decimal("0"))
income_tax = deductions.get("Tax", Decimal("0"))
nic = deductions.get("National Insurance", Decimal("0"))
student_loan = deductions.get("Student Loans", deductions.get("Student Loan", Decimal("0")))
other_deductions = {k: v for k, v in deductions.items() if k not in VARIANT_A_DEDUCTIONS_KNOWN}
bonus = payments.get("Perform Bonus", payments.get("Bonus", Decimal("0")))
taxable_pay_s = _find_match(text, TAXABLE_PAY_A_RE)
taxable_pay = _to_decimal(taxable_pay_s) if taxable_pay_s else None
pay_date = _parse_date(text)
return ExtractedPayslip(
pay_date=pay_date,
pay_period_start=None,
pay_period_end=None,
employer=employer,
currency="GBP",
gross_pay=gross_pay,
income_tax=income_tax,
national_insurance=nic,
pension_employee=Decimal("0"),
pension_employer=Decimal("0"),
student_loan=student_loan,
rsu_vest=rsu_vest,
rsu_offset=rsu_offset,
salary=payments.get("Salary", Decimal("0")),
bonus=bonus,
pension_sacrifice=pension_sacrifice,
taxable_pay=taxable_pay,
ytd_tax_paid=None,
ytd_taxable_pay=None,
ytd_gross=None,
other_deductions=other_deductions,
net_pay=net_pay,
)
def _find_variant_a_header(lines: list[str]) -> int:
for i, line in enumerate(lines):
if "Description" in line and "This Period" in line and "This Year" in line:
return i
raise ParserError("variant A header not found")
def _collect_a_blocks(
lines: list[str],
header_idx: int,
) -> tuple[dict[str, Decimal], dict[str, Decimal]]:
"""Split variant A rows into Payments vs Deductions by the two `Total` anchors.
Layout: header → payments rows → `Total <gross>` → deductions rows →
`Total <deductions>` → `Net Pay <net>`. We collect rows into whichever
block we're currently in.
"""
payments: dict[str, Decimal] = {}
deductions: dict[str, Decimal] = {}
block = payments
total_count = 0
for i in range(header_idx + 1, len(lines)):
raw = lines[i].rstrip()
if not raw.strip():
continue
stripped = raw.strip()
if stripped.startswith("Total ") or stripped.startswith("Total\t"):
total_count += 1
if total_count == 1:
block = deductions
continue
if total_count == 2:
break
if "Net Pay" in raw:
break
matches = list(AMOUNT_RE.finditer(raw))
if not matches:
continue
label = raw[:matches[0].start()].strip()
if not label:
continue
# "This Period" value is the first amount; "This Year" is the second.
# If only one amount is present, it's a YTD-only row (e.g. Relocation
# Bonus which doesn't apply this period) — skip it for the period totals.
if len(matches) < 2:
continue
amount = _to_decimal(matches[0].group())
block[label] = amount
return payments, deductions
def _parse_a_gross(
lines: list[str],
header_idx: int,
payments: dict[str, Decimal],
) -> Decimal:
"""Pull the first `Total <amount>` after the header — that's gross pay."""
for i in range(header_idx + 1, len(lines)):
stripped = lines[i].strip()
if stripped.startswith("Total "):
nums = AMOUNT_RE.findall(stripped)
if nums:
return _to_decimal(nums[0])
# Fallback: sum payments values if the Total line is missing.
if payments:
return sum(payments.values(), start=Decimal("0"))
raise ParserError("Total (gross pay) row not found in variant A")
def _parse_a_net(text: str) -> Decimal:
m = NET_PAY_A_RE.search(text)
if not m:
raise ParserError("Net Pay line not found in variant A")
return _to_decimal(m.group(1))