payslip-ingest/payslip_ingest/parsers/meta_uk.py

474 lines
17 KiB
Python
Raw Normal View History

v2: regex parser for Meta UK template + accurate RSU tax attribution ## Context v1 shipped a Claude Haiku-based extractor that validated only 10/71 backfilled rows. Haiku fumbles the arithmetic on pension salary-sacrifice, conflates RSU vest with regular earnings, and occasionally misreads YTD vs this-period columns — so 86% of rows land with validated=false and the downstream dashboards under-report take-home. Meta UK uses a stable two-variant template (pre/post 2022-01-31 boundary), so a regex parser is both faster (ms vs. 30-90s + $0.01-0.05/call) and more accurate. v2 introduces that parser as the primary path, keeps Claude as the fallback for non-Meta payslips, and surfaces new fields the dashboard needs to attribute PAYE between cash salary and RSU vests correctly. ## This change ### Parser (new) `payslip_ingest/parsers/meta_uk.py` detects the layout variant by header presence: - **Variant A** (pre-2022): vertical Description/This Period/This Year. `AE Pension EE` is a positive deduction against a pre-sacrifice gross — maps to `pension_employee` for the existing validation formula to hold. - **Variant B** (post-2022): side-by-side Payments | Deductions | Year to Date. `AE Pension EE` is NEGATIVE in Payments (salary sacrifice) — maps to `pension_sacrifice` and is already netted into Total Payment. `rsu_vest = RSU Tax Offset + RSU Excs Refund` (Meta's template inflates Taxable Pay without using a matching offset deduction). Column boundaries come from the header row's anchor positions; each data row slices into 3 cells and the last numeric token per cell is the amount. Anchor misses raise ParserError so the caller falls back to Claude rather than silently returning bad data. ### New fields Schema + DB + Claude prompt gain: - `salary`, `bonus`, `pension_sacrifice` — earnings decomposition for the dashboard's bonus-sacrifice visibility and earnings-breakdown chart - `taxable_pay`, `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross` — powers the YTD-effective-rate method of attributing cash tax vs RSU tax, which is the only method that's accurate month-to-month All new columns default to 0 / null so v1 rows continue to round-trip. ### Orchestration processor.py tries `parse_meta_uk(pdftotext(pdf))` first. On success the result goes straight to the DB — zero Claude tokens spent, extraction in milliseconds. On ParserError it falls through to ClaudeExtractor as before. ProcessResult gains an `extractor` field ("meta_uk_regex" | "claude") so backfill logs show the hit rate. ## Tests - `test_meta_uk_parser.py` — 11 tests covering variant A, variant B (standard + bonus month + bonus-sacrificed month), malformed inputs, and end-to-end totals validation for all 4 golden fixtures. - `test_processor.py` — 2 new tests proving the regex-first short-circuit and the Claude fallback on non-Meta inputs. Fixtures under `tests/fixtures/` are hand-crafted `pdftotext -layout` emulations — real Meta numbers from the plan's sample payslips for variant B, synthesized realistic variant A and bonus-sacrificed samples. 0001_initial.py reformat is yapf cleanup touched during the session's format pass; not a behavior change. ## Test Plan ### Automated ``` $ poetry run pytest ============================= test session starts ============================== collected 53 items tests/test_extractor.py ..... [ 9%] tests/test_meta_uk_parser.py ........... [ 30%] tests/test_paperless.py ...... [ 41%] tests/test_processor.py .............. [ 67%] tests/test_schema.py .... [ 75%] tests/test_tax_year.py ........ [ 90%] tests/test_webhook.py ..... [100%] ============================== 53 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files $ poetry run yapf --style pyproject.toml --diff --recursive payslip_ingest tests (no output — all files are yapf-clean) ``` ### Manual Verification Smoke-test the parser against a real Meta payslip PDF on the deploy host: ``` # After 0003 migration applied to prod DB $ poetry run python -c " from payslip_ingest.parsers import parse_meta_uk import subprocess text = subprocess.check_output(['pdftotext', '-layout', '/path/to/real.pdf', '-']).decode() p = parse_meta_uk(text) print(p.model_dump_json(indent=2)) " ``` Expected: JSON with salary/bonus/rsu_vest/pension_sacrifice populated and `validate_totals(p)` returning True. ## Reproduce locally 1. `cd payslip-ingest && poetry install` 2. `poetry run pytest tests/test_meta_uk_parser.py -v` 3. Expected: 11 tests pass, each fixture validates totals within 2p. Closes: code-un1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:53:52 +00:00
"""Regex-based Meta UK payslip parser.
meta_uk parser: add variant A (2019-2022) + variant C (2022-2023) ## Context The initial v2 parser (commit 9741816) only handled the modern template (variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from 2021-07 through 2023-11 failed entirely — Claude fallback hit errors on them and the rows never landed. Investigation via `kubectl exec` + pdftotext on a sample of the failing docs revealed two previously-unseen layouts that the parser needs to handle directly: - **Variant A** (2019 → mid-2022): single-column Description/This Period/ This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31 Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines: `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the earnings side with a matching `RSU Net Gain` on the deductions side. BIK items (Private Dental/Medical) appear on both sides — net zero in the gross, but the deduction-side copy must land in other_deductions for the validation formula to hold. - **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions| Year To Date (note capital "To", vs variant B's lowercase "to"). Date format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU Net Gain` offset. `Company Name : Facebook UK Limited` preamble. Variant B (2024+) is unchanged. ## This change ### Parser refactor - `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches all three eras. - `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's accounting-style parenthesized negatives normalize to `-1234.56` in `_to_decimal`. - `_parse_date` tries three formats in order: slash (B), dot (C), word (A). - `_is_variant_b_or_c` collapses B and C into one detector (both have the side-by-side header with `Year [Tt]o Date`); their parsers share code because the column mechanics are identical — only the RSU-label set and date format differ. - `_parse_variant_a` is a full rewrite: single-column rows split by the two `Total ...` anchors (payments → deductions), pay_date from the header's `Date : ...`, gross from first Total, net from the trailing `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X` line at the bottom. - RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every matching payment line. rsu_offset maps to `RSU Net Gain` on the deduction side when present (absent in variant B, present in A and C). ### Fixtures switched to real pdftotext output Removed the two synthetic fixtures that no longer reflected real Meta output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`) and replaced with real pdftotext captures: - `meta_uk_2021_08_variant_a.txt` (doc_id=43) - `meta_uk_2022_11_variant_c.txt` (doc_id=53) The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because they encode specific bonus/no-bonus scenarios and the numbers are derived from the real Feb-2026 sample in the plan. ## Tests - 10 parser tests: one per variant (A/B/C) + totals validation across all 4 fixtures + the existing non-Meta/empty-input guards. All pass. - 52 total tests across the repo, all green. ## Test Plan ### Automated ``` $ poetry run pytest ============================== 52 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files ``` ### Manual verification (after deploy) 1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via regex (≥95% hit rate), 42 → 70+ validated rows. 2. Sample a row for each variant via psql: employer, rsu_vest, and taxable_pay should all be populated. ## Reproduce locally 1. `poetry run pytest tests/test_meta_uk_parser.py -v` 2. Expected: 10 passed, each fixture validates totals to within 2p. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00
Meta UK payslips come in three layout variants across 2019-2026:
- **Variant A** (2019-mid-2022, seen as `Facebook UK Ltd`):
Single-column `Description | This Period | This Year` layout. Parenthesized
negatives `(152.90)` = -152.90. Date format `Date : 31 Aug 2021`. RSU
labels: `RSU Gain Taxable`, `RSU Gain Nicable`, `RSU Net Cash UK`, plus a
matching `RSU Net Gain` deduction. BIK items (Private Dental/Medical,
EE Discount) appear as both earnings and deductions.
- **Variant C** (late-2022 - 2023, `Facebook UK Limited`):
Side-by-side `Payments | Deductions | Year To Date` (capital "To"). Date
format `Pay Date : 30.11.2022` (dots). `Company Name : Facebook UK Limited`
preamble. RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and
still have the `RSU Net Gain` offset.
- **Variant B** (2024+, `Facebook UK Limited`):
Side-by-side `Payments | Deductions | Year to Date` (lowercase "to"). Date
format `Pay Date: 27/02/2026` (slashes). RSU labels are `RSU Tax Offset`
+ `RSU Excs Refund`; there is NO matching offset deduction the vest
grosses up Taxable Pay and PAYE is on the grossed-up figure.
Parser returns `ExtractedPayslip`. On any structural miss it raises
`ParserError` so the caller falls back to ClaudeExtractor.
v2: regex parser for Meta UK template + accurate RSU tax attribution ## Context v1 shipped a Claude Haiku-based extractor that validated only 10/71 backfilled rows. Haiku fumbles the arithmetic on pension salary-sacrifice, conflates RSU vest with regular earnings, and occasionally misreads YTD vs this-period columns — so 86% of rows land with validated=false and the downstream dashboards under-report take-home. Meta UK uses a stable two-variant template (pre/post 2022-01-31 boundary), so a regex parser is both faster (ms vs. 30-90s + $0.01-0.05/call) and more accurate. v2 introduces that parser as the primary path, keeps Claude as the fallback for non-Meta payslips, and surfaces new fields the dashboard needs to attribute PAYE between cash salary and RSU vests correctly. ## This change ### Parser (new) `payslip_ingest/parsers/meta_uk.py` detects the layout variant by header presence: - **Variant A** (pre-2022): vertical Description/This Period/This Year. `AE Pension EE` is a positive deduction against a pre-sacrifice gross — maps to `pension_employee` for the existing validation formula to hold. - **Variant B** (post-2022): side-by-side Payments | Deductions | Year to Date. `AE Pension EE` is NEGATIVE in Payments (salary sacrifice) — maps to `pension_sacrifice` and is already netted into Total Payment. `rsu_vest = RSU Tax Offset + RSU Excs Refund` (Meta's template inflates Taxable Pay without using a matching offset deduction). Column boundaries come from the header row's anchor positions; each data row slices into 3 cells and the last numeric token per cell is the amount. Anchor misses raise ParserError so the caller falls back to Claude rather than silently returning bad data. ### New fields Schema + DB + Claude prompt gain: - `salary`, `bonus`, `pension_sacrifice` — earnings decomposition for the dashboard's bonus-sacrifice visibility and earnings-breakdown chart - `taxable_pay`, `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross` — powers the YTD-effective-rate method of attributing cash tax vs RSU tax, which is the only method that's accurate month-to-month All new columns default to 0 / null so v1 rows continue to round-trip. ### Orchestration processor.py tries `parse_meta_uk(pdftotext(pdf))` first. On success the result goes straight to the DB — zero Claude tokens spent, extraction in milliseconds. On ParserError it falls through to ClaudeExtractor as before. ProcessResult gains an `extractor` field ("meta_uk_regex" | "claude") so backfill logs show the hit rate. ## Tests - `test_meta_uk_parser.py` — 11 tests covering variant A, variant B (standard + bonus month + bonus-sacrificed month), malformed inputs, and end-to-end totals validation for all 4 golden fixtures. - `test_processor.py` — 2 new tests proving the regex-first short-circuit and the Claude fallback on non-Meta inputs. Fixtures under `tests/fixtures/` are hand-crafted `pdftotext -layout` emulations — real Meta numbers from the plan's sample payslips for variant B, synthesized realistic variant A and bonus-sacrificed samples. 0001_initial.py reformat is yapf cleanup touched during the session's format pass; not a behavior change. ## Test Plan ### Automated ``` $ poetry run pytest ============================= test session starts ============================== collected 53 items tests/test_extractor.py ..... [ 9%] tests/test_meta_uk_parser.py ........... [ 30%] tests/test_paperless.py ...... [ 41%] tests/test_processor.py .............. [ 67%] tests/test_schema.py .... [ 75%] tests/test_tax_year.py ........ [ 90%] tests/test_webhook.py ..... [100%] ============================== 53 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files $ poetry run yapf --style pyproject.toml --diff --recursive payslip_ingest tests (no output — all files are yapf-clean) ``` ### Manual Verification Smoke-test the parser against a real Meta payslip PDF on the deploy host: ``` # After 0003 migration applied to prod DB $ poetry run python -c " from payslip_ingest.parsers import parse_meta_uk import subprocess text = subprocess.check_output(['pdftotext', '-layout', '/path/to/real.pdf', '-']).decode() p = parse_meta_uk(text) print(p.model_dump_json(indent=2)) " ``` Expected: JSON with salary/bonus/rsu_vest/pension_sacrifice populated and `validate_totals(p)` returning True. ## Reproduce locally 1. `cd payslip-ingest && poetry install` 2. `poetry run pytest tests/test_meta_uk_parser.py -v` 3. Expected: 11 tests pass, each fixture validates totals within 2p. Closes: code-un1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:53:52 +00:00
"""
import re
from datetime import date, datetime
from decimal import Decimal
from payslip_ingest.schema import ExtractedPayslip
class ParserError(ValueError):
"""Raised when the Meta UK template cannot be matched."""
meta_uk parser: add variant A (2019-2022) + variant C (2022-2023) ## Context The initial v2 parser (commit 9741816) only handled the modern template (variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from 2021-07 through 2023-11 failed entirely — Claude fallback hit errors on them and the rows never landed. Investigation via `kubectl exec` + pdftotext on a sample of the failing docs revealed two previously-unseen layouts that the parser needs to handle directly: - **Variant A** (2019 → mid-2022): single-column Description/This Period/ This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31 Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines: `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the earnings side with a matching `RSU Net Gain` on the deductions side. BIK items (Private Dental/Medical) appear on both sides — net zero in the gross, but the deduction-side copy must land in other_deductions for the validation formula to hold. - **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions| Year To Date (note capital "To", vs variant B's lowercase "to"). Date format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU Net Gain` offset. `Company Name : Facebook UK Limited` preamble. Variant B (2024+) is unchanged. ## This change ### Parser refactor - `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches all three eras. - `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's accounting-style parenthesized negatives normalize to `-1234.56` in `_to_decimal`. - `_parse_date` tries three formats in order: slash (B), dot (C), word (A). - `_is_variant_b_or_c` collapses B and C into one detector (both have the side-by-side header with `Year [Tt]o Date`); their parsers share code because the column mechanics are identical — only the RSU-label set and date format differ. - `_parse_variant_a` is a full rewrite: single-column rows split by the two `Total ...` anchors (payments → deductions), pay_date from the header's `Date : ...`, gross from first Total, net from the trailing `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X` line at the bottom. - RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every matching payment line. rsu_offset maps to `RSU Net Gain` on the deduction side when present (absent in variant B, present in A and C). ### Fixtures switched to real pdftotext output Removed the two synthetic fixtures that no longer reflected real Meta output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`) and replaced with real pdftotext captures: - `meta_uk_2021_08_variant_a.txt` (doc_id=43) - `meta_uk_2022_11_variant_c.txt` (doc_id=53) The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because they encode specific bonus/no-bonus scenarios and the numbers are derived from the real Feb-2026 sample in the plan. ## Tests - 10 parser tests: one per variant (A/B/C) + totals validation across all 4 fixtures + the existing non-Meta/empty-input guards. All pass. - 52 total tests across the repo, all green. ## Test Plan ### Automated ``` $ poetry run pytest ============================== 52 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files ``` ### Manual verification (after deploy) 1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via regex (≥95% hit rate), 42 → 70+ validated rows. 2. Sample a row for each variant via psql: employer, rsu_vest, and taxable_pay should all be populated. ## Reproduce locally 1. `poetry run pytest tests/test_meta_uk_parser.py -v` 2. Expected: 10 passed, each fixture validates totals to within 2p. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00
# Two amount notations:
# "-1,234.56" (slashes-era) and "(1,234.56)" (variant A parenthesized)
AMOUNT_RE = re.compile(r"-?\d{1,3}(?:,\d{3})*\.\d{2}|\(\d{1,3}(?:,\d{3})*\.\d{2}\)")
v2: regex parser for Meta UK template + accurate RSU tax attribution ## Context v1 shipped a Claude Haiku-based extractor that validated only 10/71 backfilled rows. Haiku fumbles the arithmetic on pension salary-sacrifice, conflates RSU vest with regular earnings, and occasionally misreads YTD vs this-period columns — so 86% of rows land with validated=false and the downstream dashboards under-report take-home. Meta UK uses a stable two-variant template (pre/post 2022-01-31 boundary), so a regex parser is both faster (ms vs. 30-90s + $0.01-0.05/call) and more accurate. v2 introduces that parser as the primary path, keeps Claude as the fallback for non-Meta payslips, and surfaces new fields the dashboard needs to attribute PAYE between cash salary and RSU vests correctly. ## This change ### Parser (new) `payslip_ingest/parsers/meta_uk.py` detects the layout variant by header presence: - **Variant A** (pre-2022): vertical Description/This Period/This Year. `AE Pension EE` is a positive deduction against a pre-sacrifice gross — maps to `pension_employee` for the existing validation formula to hold. - **Variant B** (post-2022): side-by-side Payments | Deductions | Year to Date. `AE Pension EE` is NEGATIVE in Payments (salary sacrifice) — maps to `pension_sacrifice` and is already netted into Total Payment. `rsu_vest = RSU Tax Offset + RSU Excs Refund` (Meta's template inflates Taxable Pay without using a matching offset deduction). Column boundaries come from the header row's anchor positions; each data row slices into 3 cells and the last numeric token per cell is the amount. Anchor misses raise ParserError so the caller falls back to Claude rather than silently returning bad data. ### New fields Schema + DB + Claude prompt gain: - `salary`, `bonus`, `pension_sacrifice` — earnings decomposition for the dashboard's bonus-sacrifice visibility and earnings-breakdown chart - `taxable_pay`, `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross` — powers the YTD-effective-rate method of attributing cash tax vs RSU tax, which is the only method that's accurate month-to-month All new columns default to 0 / null so v1 rows continue to round-trip. ### Orchestration processor.py tries `parse_meta_uk(pdftotext(pdf))` first. On success the result goes straight to the DB — zero Claude tokens spent, extraction in milliseconds. On ParserError it falls through to ClaudeExtractor as before. ProcessResult gains an `extractor` field ("meta_uk_regex" | "claude") so backfill logs show the hit rate. ## Tests - `test_meta_uk_parser.py` — 11 tests covering variant A, variant B (standard + bonus month + bonus-sacrificed month), malformed inputs, and end-to-end totals validation for all 4 golden fixtures. - `test_processor.py` — 2 new tests proving the regex-first short-circuit and the Claude fallback on non-Meta inputs. Fixtures under `tests/fixtures/` are hand-crafted `pdftotext -layout` emulations — real Meta numbers from the plan's sample payslips for variant B, synthesized realistic variant A and bonus-sacrificed samples. 0001_initial.py reformat is yapf cleanup touched during the session's format pass; not a behavior change. ## Test Plan ### Automated ``` $ poetry run pytest ============================= test session starts ============================== collected 53 items tests/test_extractor.py ..... [ 9%] tests/test_meta_uk_parser.py ........... [ 30%] tests/test_paperless.py ...... [ 41%] tests/test_processor.py .............. [ 67%] tests/test_schema.py .... [ 75%] tests/test_tax_year.py ........ [ 90%] tests/test_webhook.py ..... [100%] ============================== 53 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files $ poetry run yapf --style pyproject.toml --diff --recursive payslip_ingest tests (no output — all files are yapf-clean) ``` ### Manual Verification Smoke-test the parser against a real Meta payslip PDF on the deploy host: ``` # After 0003 migration applied to prod DB $ poetry run python -c " from payslip_ingest.parsers import parse_meta_uk import subprocess text = subprocess.check_output(['pdftotext', '-layout', '/path/to/real.pdf', '-']).decode() p = parse_meta_uk(text) print(p.model_dump_json(indent=2)) " ``` Expected: JSON with salary/bonus/rsu_vest/pension_sacrifice populated and `validate_totals(p)` returning True. ## Reproduce locally 1. `cd payslip-ingest && poetry install` 2. `poetry run pytest tests/test_meta_uk_parser.py -v` 3. Expected: 11 tests pass, each fixture validates totals within 2p. Closes: code-un1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:53:52 +00:00
meta_uk parser: add variant A (2019-2022) + variant C (2022-2023) ## Context The initial v2 parser (commit 9741816) only handled the modern template (variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from 2021-07 through 2023-11 failed entirely — Claude fallback hit errors on them and the rows never landed. Investigation via `kubectl exec` + pdftotext on a sample of the failing docs revealed two previously-unseen layouts that the parser needs to handle directly: - **Variant A** (2019 → mid-2022): single-column Description/This Period/ This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31 Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines: `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the earnings side with a matching `RSU Net Gain` on the deductions side. BIK items (Private Dental/Medical) appear on both sides — net zero in the gross, but the deduction-side copy must land in other_deductions for the validation formula to hold. - **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions| Year To Date (note capital "To", vs variant B's lowercase "to"). Date format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU Net Gain` offset. `Company Name : Facebook UK Limited` preamble. Variant B (2024+) is unchanged. ## This change ### Parser refactor - `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches all three eras. - `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's accounting-style parenthesized negatives normalize to `-1234.56` in `_to_decimal`. - `_parse_date` tries three formats in order: slash (B), dot (C), word (A). - `_is_variant_b_or_c` collapses B and C into one detector (both have the side-by-side header with `Year [Tt]o Date`); their parsers share code because the column mechanics are identical — only the RSU-label set and date format differ. - `_parse_variant_a` is a full rewrite: single-column rows split by the two `Total ...` anchors (payments → deductions), pay_date from the header's `Date : ...`, gross from first Total, net from the trailing `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X` line at the bottom. - RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every matching payment line. rsu_offset maps to `RSU Net Gain` on the deduction side when present (absent in variant B, present in A and C). ### Fixtures switched to real pdftotext output Removed the two synthetic fixtures that no longer reflected real Meta output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`) and replaced with real pdftotext captures: - `meta_uk_2021_08_variant_a.txt` (doc_id=43) - `meta_uk_2022_11_variant_c.txt` (doc_id=53) The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because they encode specific bonus/no-bonus scenarios and the numbers are derived from the real Feb-2026 sample in the plan. ## Tests - 10 parser tests: one per variant (A/B/C) + totals validation across all 4 fixtures + the existing non-Meta/empty-input guards. All pass. - 52 total tests across the repo, all green. ## Test Plan ### Automated ``` $ poetry run pytest ============================== 52 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files ``` ### Manual verification (after deploy) 1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via regex (≥95% hit rate), 42 → 70+ validated rows. 2. Sample a row for each variant via psql: employer, rsu_vest, and taxable_pay should all be populated. ## Reproduce locally 1. `poetry run pytest tests/test_meta_uk_parser.py -v` 2. Expected: 10 passed, each fixture validates totals to within 2p. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00
# Pay Date / Date — three accepted formats:
# "Pay Date: 27/02/2026"
# "Pay Date : 30.11.2022"
# "Date : 31 Aug 2021"
PAY_DATE_SLASH_RE = re.compile(r"Pay Date\s*:\s*(\d{2}/\d{2}/\d{4})")
PAY_DATE_DOT_RE = re.compile(r"Pay Date\s*:\s*(\d{2}\.\d{2}\.\d{4})")
PAY_DATE_WORD_RE = re.compile(r"\bDate\s*:\s*(\d{1,2}\s+[A-Za-z]{3}\s+\d{4})")
PERIOD_START_RE = re.compile(r"Period Start\s*:\s*(\d{2}/\d{2}/\d{4})")
PERIOD_END_RE = re.compile(r"Period End\s*:\s*(\d{2}/\d{2}/\d{4})")
EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")
v2: regex parser for Meta UK template + accurate RSU tax attribution ## Context v1 shipped a Claude Haiku-based extractor that validated only 10/71 backfilled rows. Haiku fumbles the arithmetic on pension salary-sacrifice, conflates RSU vest with regular earnings, and occasionally misreads YTD vs this-period columns — so 86% of rows land with validated=false and the downstream dashboards under-report take-home. Meta UK uses a stable two-variant template (pre/post 2022-01-31 boundary), so a regex parser is both faster (ms vs. 30-90s + $0.01-0.05/call) and more accurate. v2 introduces that parser as the primary path, keeps Claude as the fallback for non-Meta payslips, and surfaces new fields the dashboard needs to attribute PAYE between cash salary and RSU vests correctly. ## This change ### Parser (new) `payslip_ingest/parsers/meta_uk.py` detects the layout variant by header presence: - **Variant A** (pre-2022): vertical Description/This Period/This Year. `AE Pension EE` is a positive deduction against a pre-sacrifice gross — maps to `pension_employee` for the existing validation formula to hold. - **Variant B** (post-2022): side-by-side Payments | Deductions | Year to Date. `AE Pension EE` is NEGATIVE in Payments (salary sacrifice) — maps to `pension_sacrifice` and is already netted into Total Payment. `rsu_vest = RSU Tax Offset + RSU Excs Refund` (Meta's template inflates Taxable Pay without using a matching offset deduction). Column boundaries come from the header row's anchor positions; each data row slices into 3 cells and the last numeric token per cell is the amount. Anchor misses raise ParserError so the caller falls back to Claude rather than silently returning bad data. ### New fields Schema + DB + Claude prompt gain: - `salary`, `bonus`, `pension_sacrifice` — earnings decomposition for the dashboard's bonus-sacrifice visibility and earnings-breakdown chart - `taxable_pay`, `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross` — powers the YTD-effective-rate method of attributing cash tax vs RSU tax, which is the only method that's accurate month-to-month All new columns default to 0 / null so v1 rows continue to round-trip. ### Orchestration processor.py tries `parse_meta_uk(pdftotext(pdf))` first. On success the result goes straight to the DB — zero Claude tokens spent, extraction in milliseconds. On ParserError it falls through to ClaudeExtractor as before. ProcessResult gains an `extractor` field ("meta_uk_regex" | "claude") so backfill logs show the hit rate. ## Tests - `test_meta_uk_parser.py` — 11 tests covering variant A, variant B (standard + bonus month + bonus-sacrificed month), malformed inputs, and end-to-end totals validation for all 4 golden fixtures. - `test_processor.py` — 2 new tests proving the regex-first short-circuit and the Claude fallback on non-Meta inputs. Fixtures under `tests/fixtures/` are hand-crafted `pdftotext -layout` emulations — real Meta numbers from the plan's sample payslips for variant B, synthesized realistic variant A and bonus-sacrificed samples. 0001_initial.py reformat is yapf cleanup touched during the session's format pass; not a behavior change. ## Test Plan ### Automated ``` $ poetry run pytest ============================= test session starts ============================== collected 53 items tests/test_extractor.py ..... [ 9%] tests/test_meta_uk_parser.py ........... [ 30%] tests/test_paperless.py ...... [ 41%] tests/test_processor.py .............. [ 67%] tests/test_schema.py .... [ 75%] tests/test_tax_year.py ........ [ 90%] tests/test_webhook.py ..... [100%] ============================== 53 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files $ poetry run yapf --style pyproject.toml --diff --recursive payslip_ingest tests (no output — all files are yapf-clean) ``` ### Manual Verification Smoke-test the parser against a real Meta payslip PDF on the deploy host: ``` # After 0003 migration applied to prod DB $ poetry run python -c " from payslip_ingest.parsers import parse_meta_uk import subprocess text = subprocess.check_output(['pdftotext', '-layout', '/path/to/real.pdf', '-']).decode() p = parse_meta_uk(text) print(p.model_dump_json(indent=2)) " ``` Expected: JSON with salary/bonus/rsu_vest/pension_sacrifice populated and `validate_totals(p)` returning True. ## Reproduce locally 1. `cd payslip-ingest && poetry install` 2. `poetry run pytest tests/test_meta_uk_parser.py -v` 3. Expected: 11 tests pass, each fixture validates totals within 2p. Closes: code-un1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:53:52 +00:00
def parse_meta_uk(text: str) -> ExtractedPayslip:
if not text.strip():
raise ParserError("empty text")
meta_uk parser: add variant A (2019-2022) + variant C (2022-2023) ## Context The initial v2 parser (commit 9741816) only handled the modern template (variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from 2021-07 through 2023-11 failed entirely — Claude fallback hit errors on them and the rows never landed. Investigation via `kubectl exec` + pdftotext on a sample of the failing docs revealed two previously-unseen layouts that the parser needs to handle directly: - **Variant A** (2019 → mid-2022): single-column Description/This Period/ This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31 Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines: `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the earnings side with a matching `RSU Net Gain` on the deductions side. BIK items (Private Dental/Medical) appear on both sides — net zero in the gross, but the deduction-side copy must land in other_deductions for the validation formula to hold. - **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions| Year To Date (note capital "To", vs variant B's lowercase "to"). Date format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU Net Gain` offset. `Company Name : Facebook UK Limited` preamble. Variant B (2024+) is unchanged. ## This change ### Parser refactor - `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches all three eras. - `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's accounting-style parenthesized negatives normalize to `-1234.56` in `_to_decimal`. - `_parse_date` tries three formats in order: slash (B), dot (C), word (A). - `_is_variant_b_or_c` collapses B and C into one detector (both have the side-by-side header with `Year [Tt]o Date`); their parsers share code because the column mechanics are identical — only the RSU-label set and date format differ. - `_parse_variant_a` is a full rewrite: single-column rows split by the two `Total ...` anchors (payments → deductions), pay_date from the header's `Date : ...`, gross from first Total, net from the trailing `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X` line at the bottom. - RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every matching payment line. rsu_offset maps to `RSU Net Gain` on the deduction side when present (absent in variant B, present in A and C). ### Fixtures switched to real pdftotext output Removed the two synthetic fixtures that no longer reflected real Meta output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`) and replaced with real pdftotext captures: - `meta_uk_2021_08_variant_a.txt` (doc_id=43) - `meta_uk_2022_11_variant_c.txt` (doc_id=53) The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because they encode specific bonus/no-bonus scenarios and the numbers are derived from the real Feb-2026 sample in the plan. ## Tests - 10 parser tests: one per variant (A/B/C) + totals validation across all 4 fixtures + the existing non-Meta/empty-input guards. All pass. - 52 total tests across the repo, all green. ## Test Plan ### Automated ``` $ poetry run pytest ============================== 52 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files ``` ### Manual verification (after deploy) 1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via regex (≥95% hit rate), 42 → 70+ validated rows. 2. Sample a row for each variant via psql: employer, rsu_vest, and taxable_pay should all be populated. ## Reproduce locally 1. `poetry run pytest tests/test_meta_uk_parser.py -v` 2. Expected: 10 passed, each fixture validates totals to within 2p. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00
employer_match = EMPLOYER_RE.search(text)
if not employer_match:
v2: regex parser for Meta UK template + accurate RSU tax attribution ## Context v1 shipped a Claude Haiku-based extractor that validated only 10/71 backfilled rows. Haiku fumbles the arithmetic on pension salary-sacrifice, conflates RSU vest with regular earnings, and occasionally misreads YTD vs this-period columns — so 86% of rows land with validated=false and the downstream dashboards under-report take-home. Meta UK uses a stable two-variant template (pre/post 2022-01-31 boundary), so a regex parser is both faster (ms vs. 30-90s + $0.01-0.05/call) and more accurate. v2 introduces that parser as the primary path, keeps Claude as the fallback for non-Meta payslips, and surfaces new fields the dashboard needs to attribute PAYE between cash salary and RSU vests correctly. ## This change ### Parser (new) `payslip_ingest/parsers/meta_uk.py` detects the layout variant by header presence: - **Variant A** (pre-2022): vertical Description/This Period/This Year. `AE Pension EE` is a positive deduction against a pre-sacrifice gross — maps to `pension_employee` for the existing validation formula to hold. - **Variant B** (post-2022): side-by-side Payments | Deductions | Year to Date. `AE Pension EE` is NEGATIVE in Payments (salary sacrifice) — maps to `pension_sacrifice` and is already netted into Total Payment. `rsu_vest = RSU Tax Offset + RSU Excs Refund` (Meta's template inflates Taxable Pay without using a matching offset deduction). Column boundaries come from the header row's anchor positions; each data row slices into 3 cells and the last numeric token per cell is the amount. Anchor misses raise ParserError so the caller falls back to Claude rather than silently returning bad data. ### New fields Schema + DB + Claude prompt gain: - `salary`, `bonus`, `pension_sacrifice` — earnings decomposition for the dashboard's bonus-sacrifice visibility and earnings-breakdown chart - `taxable_pay`, `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross` — powers the YTD-effective-rate method of attributing cash tax vs RSU tax, which is the only method that's accurate month-to-month All new columns default to 0 / null so v1 rows continue to round-trip. ### Orchestration processor.py tries `parse_meta_uk(pdftotext(pdf))` first. On success the result goes straight to the DB — zero Claude tokens spent, extraction in milliseconds. On ParserError it falls through to ClaudeExtractor as before. ProcessResult gains an `extractor` field ("meta_uk_regex" | "claude") so backfill logs show the hit rate. ## Tests - `test_meta_uk_parser.py` — 11 tests covering variant A, variant B (standard + bonus month + bonus-sacrificed month), malformed inputs, and end-to-end totals validation for all 4 golden fixtures. - `test_processor.py` — 2 new tests proving the regex-first short-circuit and the Claude fallback on non-Meta inputs. Fixtures under `tests/fixtures/` are hand-crafted `pdftotext -layout` emulations — real Meta numbers from the plan's sample payslips for variant B, synthesized realistic variant A and bonus-sacrificed samples. 0001_initial.py reformat is yapf cleanup touched during the session's format pass; not a behavior change. ## Test Plan ### Automated ``` $ poetry run pytest ============================= test session starts ============================== collected 53 items tests/test_extractor.py ..... [ 9%] tests/test_meta_uk_parser.py ........... [ 30%] tests/test_paperless.py ...... [ 41%] tests/test_processor.py .............. [ 67%] tests/test_schema.py .... [ 75%] tests/test_tax_year.py ........ [ 90%] tests/test_webhook.py ..... [100%] ============================== 53 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files $ poetry run yapf --style pyproject.toml --diff --recursive payslip_ingest tests (no output — all files are yapf-clean) ``` ### Manual Verification Smoke-test the parser against a real Meta payslip PDF on the deploy host: ``` # After 0003 migration applied to prod DB $ poetry run python -c " from payslip_ingest.parsers import parse_meta_uk import subprocess text = subprocess.check_output(['pdftotext', '-layout', '/path/to/real.pdf', '-']).decode() p = parse_meta_uk(text) print(p.model_dump_json(indent=2)) " ``` Expected: JSON with salary/bonus/rsu_vest/pension_sacrifice populated and `validate_totals(p)` returning True. ## Reproduce locally 1. `cd payslip-ingest && poetry install` 2. `poetry run pytest tests/test_meta_uk_parser.py -v` 3. Expected: 11 tests pass, each fixture validates totals within 2p. Closes: code-un1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:53:52 +00:00
raise ParserError("does not look like a Meta UK payslip")
meta_uk parser: add variant A (2019-2022) + variant C (2022-2023) ## Context The initial v2 parser (commit 9741816) only handled the modern template (variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from 2021-07 through 2023-11 failed entirely — Claude fallback hit errors on them and the rows never landed. Investigation via `kubectl exec` + pdftotext on a sample of the failing docs revealed two previously-unseen layouts that the parser needs to handle directly: - **Variant A** (2019 → mid-2022): single-column Description/This Period/ This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31 Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines: `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the earnings side with a matching `RSU Net Gain` on the deductions side. BIK items (Private Dental/Medical) appear on both sides — net zero in the gross, but the deduction-side copy must land in other_deductions for the validation formula to hold. - **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions| Year To Date (note capital "To", vs variant B's lowercase "to"). Date format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU Net Gain` offset. `Company Name : Facebook UK Limited` preamble. Variant B (2024+) is unchanged. ## This change ### Parser refactor - `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches all three eras. - `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's accounting-style parenthesized negatives normalize to `-1234.56` in `_to_decimal`. - `_parse_date` tries three formats in order: slash (B), dot (C), word (A). - `_is_variant_b_or_c` collapses B and C into one detector (both have the side-by-side header with `Year [Tt]o Date`); their parsers share code because the column mechanics are identical — only the RSU-label set and date format differ. - `_parse_variant_a` is a full rewrite: single-column rows split by the two `Total ...` anchors (payments → deductions), pay_date from the header's `Date : ...`, gross from first Total, net from the trailing `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X` line at the bottom. - RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every matching payment line. rsu_offset maps to `RSU Net Gain` on the deduction side when present (absent in variant B, present in A and C). ### Fixtures switched to real pdftotext output Removed the two synthetic fixtures that no longer reflected real Meta output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`) and replaced with real pdftotext captures: - `meta_uk_2021_08_variant_a.txt` (doc_id=43) - `meta_uk_2022_11_variant_c.txt` (doc_id=53) The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because they encode specific bonus/no-bonus scenarios and the numbers are derived from the real Feb-2026 sample in the plan. ## Tests - 10 parser tests: one per variant (A/B/C) + totals validation across all 4 fixtures + the existing non-Meta/empty-input guards. All pass. - 52 total tests across the repo, all green. ## Test Plan ### Automated ``` $ poetry run pytest ============================== 52 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files ``` ### Manual verification (after deploy) 1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via regex (≥95% hit rate), 42 → 70+ validated rows. 2. Sample a row for each variant via psql: employer, rsu_vest, and taxable_pay should all be populated. ## Reproduce locally 1. `poetry run pytest tests/test_meta_uk_parser.py -v` 2. Expected: 10 passed, each fixture validates totals to within 2p. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00
employer = employer_match.group(0)
v2: regex parser for Meta UK template + accurate RSU tax attribution ## Context v1 shipped a Claude Haiku-based extractor that validated only 10/71 backfilled rows. Haiku fumbles the arithmetic on pension salary-sacrifice, conflates RSU vest with regular earnings, and occasionally misreads YTD vs this-period columns — so 86% of rows land with validated=false and the downstream dashboards under-report take-home. Meta UK uses a stable two-variant template (pre/post 2022-01-31 boundary), so a regex parser is both faster (ms vs. 30-90s + $0.01-0.05/call) and more accurate. v2 introduces that parser as the primary path, keeps Claude as the fallback for non-Meta payslips, and surfaces new fields the dashboard needs to attribute PAYE between cash salary and RSU vests correctly. ## This change ### Parser (new) `payslip_ingest/parsers/meta_uk.py` detects the layout variant by header presence: - **Variant A** (pre-2022): vertical Description/This Period/This Year. `AE Pension EE` is a positive deduction against a pre-sacrifice gross — maps to `pension_employee` for the existing validation formula to hold. - **Variant B** (post-2022): side-by-side Payments | Deductions | Year to Date. `AE Pension EE` is NEGATIVE in Payments (salary sacrifice) — maps to `pension_sacrifice` and is already netted into Total Payment. `rsu_vest = RSU Tax Offset + RSU Excs Refund` (Meta's template inflates Taxable Pay without using a matching offset deduction). Column boundaries come from the header row's anchor positions; each data row slices into 3 cells and the last numeric token per cell is the amount. Anchor misses raise ParserError so the caller falls back to Claude rather than silently returning bad data. ### New fields Schema + DB + Claude prompt gain: - `salary`, `bonus`, `pension_sacrifice` — earnings decomposition for the dashboard's bonus-sacrifice visibility and earnings-breakdown chart - `taxable_pay`, `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross` — powers the YTD-effective-rate method of attributing cash tax vs RSU tax, which is the only method that's accurate month-to-month All new columns default to 0 / null so v1 rows continue to round-trip. ### Orchestration processor.py tries `parse_meta_uk(pdftotext(pdf))` first. On success the result goes straight to the DB — zero Claude tokens spent, extraction in milliseconds. On ParserError it falls through to ClaudeExtractor as before. ProcessResult gains an `extractor` field ("meta_uk_regex" | "claude") so backfill logs show the hit rate. ## Tests - `test_meta_uk_parser.py` — 11 tests covering variant A, variant B (standard + bonus month + bonus-sacrificed month), malformed inputs, and end-to-end totals validation for all 4 golden fixtures. - `test_processor.py` — 2 new tests proving the regex-first short-circuit and the Claude fallback on non-Meta inputs. Fixtures under `tests/fixtures/` are hand-crafted `pdftotext -layout` emulations — real Meta numbers from the plan's sample payslips for variant B, synthesized realistic variant A and bonus-sacrificed samples. 0001_initial.py reformat is yapf cleanup touched during the session's format pass; not a behavior change. ## Test Plan ### Automated ``` $ poetry run pytest ============================= test session starts ============================== collected 53 items tests/test_extractor.py ..... [ 9%] tests/test_meta_uk_parser.py ........... [ 30%] tests/test_paperless.py ...... [ 41%] tests/test_processor.py .............. [ 67%] tests/test_schema.py .... [ 75%] tests/test_tax_year.py ........ [ 90%] tests/test_webhook.py ..... [100%] ============================== 53 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files $ poetry run yapf --style pyproject.toml --diff --recursive payslip_ingest tests (no output — all files are yapf-clean) ``` ### Manual Verification Smoke-test the parser against a real Meta payslip PDF on the deploy host: ``` # After 0003 migration applied to prod DB $ poetry run python -c " from payslip_ingest.parsers import parse_meta_uk import subprocess text = subprocess.check_output(['pdftotext', '-layout', '/path/to/real.pdf', '-']).decode() p = parse_meta_uk(text) print(p.model_dump_json(indent=2)) " ``` Expected: JSON with salary/bonus/rsu_vest/pension_sacrifice populated and `validate_totals(p)` returning True. ## Reproduce locally 1. `cd payslip-ingest && poetry install` 2. `poetry run pytest tests/test_meta_uk_parser.py -v` 3. Expected: 11 tests pass, each fixture validates totals within 2p. Closes: code-un1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:53:52 +00:00
lines = text.splitlines()
meta_uk parser: add variant A (2019-2022) + variant C (2022-2023) ## Context The initial v2 parser (commit 9741816) only handled the modern template (variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from 2021-07 through 2023-11 failed entirely — Claude fallback hit errors on them and the rows never landed. Investigation via `kubectl exec` + pdftotext on a sample of the failing docs revealed two previously-unseen layouts that the parser needs to handle directly: - **Variant A** (2019 → mid-2022): single-column Description/This Period/ This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31 Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines: `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the earnings side with a matching `RSU Net Gain` on the deductions side. BIK items (Private Dental/Medical) appear on both sides — net zero in the gross, but the deduction-side copy must land in other_deductions for the validation formula to hold. - **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions| Year To Date (note capital "To", vs variant B's lowercase "to"). Date format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU Net Gain` offset. `Company Name : Facebook UK Limited` preamble. Variant B (2024+) is unchanged. ## This change ### Parser refactor - `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches all three eras. - `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's accounting-style parenthesized negatives normalize to `-1234.56` in `_to_decimal`. - `_parse_date` tries three formats in order: slash (B), dot (C), word (A). - `_is_variant_b_or_c` collapses B and C into one detector (both have the side-by-side header with `Year [Tt]o Date`); their parsers share code because the column mechanics are identical — only the RSU-label set and date format differ. - `_parse_variant_a` is a full rewrite: single-column rows split by the two `Total ...` anchors (payments → deductions), pay_date from the header's `Date : ...`, gross from first Total, net from the trailing `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X` line at the bottom. - RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every matching payment line. rsu_offset maps to `RSU Net Gain` on the deduction side when present (absent in variant B, present in A and C). ### Fixtures switched to real pdftotext output Removed the two synthetic fixtures that no longer reflected real Meta output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`) and replaced with real pdftotext captures: - `meta_uk_2021_08_variant_a.txt` (doc_id=43) - `meta_uk_2022_11_variant_c.txt` (doc_id=53) The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because they encode specific bonus/no-bonus scenarios and the numbers are derived from the real Feb-2026 sample in the plan. ## Tests - 10 parser tests: one per variant (A/B/C) + totals validation across all 4 fixtures + the existing non-Meta/empty-input guards. All pass. - 52 total tests across the repo, all green. ## Test Plan ### Automated ``` $ poetry run pytest ============================== 52 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files ``` ### Manual verification (after deploy) 1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via regex (≥95% hit rate), 42 → 70+ validated rows. 2. Sample a row for each variant via psql: employer, rsu_vest, and taxable_pay should all be populated. ## Reproduce locally 1. `poetry run pytest tests/test_meta_uk_parser.py -v` 2. Expected: 10 passed, each fixture validates totals to within 2p. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00
if _is_variant_b_or_c(lines):
return _parse_variant_bc(text, lines, employer)
v2: regex parser for Meta UK template + accurate RSU tax attribution ## Context v1 shipped a Claude Haiku-based extractor that validated only 10/71 backfilled rows. Haiku fumbles the arithmetic on pension salary-sacrifice, conflates RSU vest with regular earnings, and occasionally misreads YTD vs this-period columns — so 86% of rows land with validated=false and the downstream dashboards under-report take-home. Meta UK uses a stable two-variant template (pre/post 2022-01-31 boundary), so a regex parser is both faster (ms vs. 30-90s + $0.01-0.05/call) and more accurate. v2 introduces that parser as the primary path, keeps Claude as the fallback for non-Meta payslips, and surfaces new fields the dashboard needs to attribute PAYE between cash salary and RSU vests correctly. ## This change ### Parser (new) `payslip_ingest/parsers/meta_uk.py` detects the layout variant by header presence: - **Variant A** (pre-2022): vertical Description/This Period/This Year. `AE Pension EE` is a positive deduction against a pre-sacrifice gross — maps to `pension_employee` for the existing validation formula to hold. - **Variant B** (post-2022): side-by-side Payments | Deductions | Year to Date. `AE Pension EE` is NEGATIVE in Payments (salary sacrifice) — maps to `pension_sacrifice` and is already netted into Total Payment. `rsu_vest = RSU Tax Offset + RSU Excs Refund` (Meta's template inflates Taxable Pay without using a matching offset deduction). Column boundaries come from the header row's anchor positions; each data row slices into 3 cells and the last numeric token per cell is the amount. Anchor misses raise ParserError so the caller falls back to Claude rather than silently returning bad data. ### New fields Schema + DB + Claude prompt gain: - `salary`, `bonus`, `pension_sacrifice` — earnings decomposition for the dashboard's bonus-sacrifice visibility and earnings-breakdown chart - `taxable_pay`, `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross` — powers the YTD-effective-rate method of attributing cash tax vs RSU tax, which is the only method that's accurate month-to-month All new columns default to 0 / null so v1 rows continue to round-trip. ### Orchestration processor.py tries `parse_meta_uk(pdftotext(pdf))` first. On success the result goes straight to the DB — zero Claude tokens spent, extraction in milliseconds. On ParserError it falls through to ClaudeExtractor as before. ProcessResult gains an `extractor` field ("meta_uk_regex" | "claude") so backfill logs show the hit rate. ## Tests - `test_meta_uk_parser.py` — 11 tests covering variant A, variant B (standard + bonus month + bonus-sacrificed month), malformed inputs, and end-to-end totals validation for all 4 golden fixtures. - `test_processor.py` — 2 new tests proving the regex-first short-circuit and the Claude fallback on non-Meta inputs. Fixtures under `tests/fixtures/` are hand-crafted `pdftotext -layout` emulations — real Meta numbers from the plan's sample payslips for variant B, synthesized realistic variant A and bonus-sacrificed samples. 0001_initial.py reformat is yapf cleanup touched during the session's format pass; not a behavior change. ## Test Plan ### Automated ``` $ poetry run pytest ============================= test session starts ============================== collected 53 items tests/test_extractor.py ..... [ 9%] tests/test_meta_uk_parser.py ........... [ 30%] tests/test_paperless.py ...... [ 41%] tests/test_processor.py .............. [ 67%] tests/test_schema.py .... [ 75%] tests/test_tax_year.py ........ [ 90%] tests/test_webhook.py ..... [100%] ============================== 53 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files $ poetry run yapf --style pyproject.toml --diff --recursive payslip_ingest tests (no output — all files are yapf-clean) ``` ### Manual Verification Smoke-test the parser against a real Meta payslip PDF on the deploy host: ``` # After 0003 migration applied to prod DB $ poetry run python -c " from payslip_ingest.parsers import parse_meta_uk import subprocess text = subprocess.check_output(['pdftotext', '-layout', '/path/to/real.pdf', '-']).decode() p = parse_meta_uk(text) print(p.model_dump_json(indent=2)) " ``` Expected: JSON with salary/bonus/rsu_vest/pension_sacrifice populated and `validate_totals(p)` returning True. ## Reproduce locally 1. `cd payslip-ingest && poetry install` 2. `poetry run pytest tests/test_meta_uk_parser.py -v` 3. Expected: 11 tests pass, each fixture validates totals within 2p. Closes: code-un1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:53:52 +00:00
if _is_variant_a(lines):
meta_uk parser: add variant A (2019-2022) + variant C (2022-2023) ## Context The initial v2 parser (commit 9741816) only handled the modern template (variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from 2021-07 through 2023-11 failed entirely — Claude fallback hit errors on them and the rows never landed. Investigation via `kubectl exec` + pdftotext on a sample of the failing docs revealed two previously-unseen layouts that the parser needs to handle directly: - **Variant A** (2019 → mid-2022): single-column Description/This Period/ This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31 Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines: `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the earnings side with a matching `RSU Net Gain` on the deductions side. BIK items (Private Dental/Medical) appear on both sides — net zero in the gross, but the deduction-side copy must land in other_deductions for the validation formula to hold. - **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions| Year To Date (note capital "To", vs variant B's lowercase "to"). Date format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU Net Gain` offset. `Company Name : Facebook UK Limited` preamble. Variant B (2024+) is unchanged. ## This change ### Parser refactor - `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches all three eras. - `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's accounting-style parenthesized negatives normalize to `-1234.56` in `_to_decimal`. - `_parse_date` tries three formats in order: slash (B), dot (C), word (A). - `_is_variant_b_or_c` collapses B and C into one detector (both have the side-by-side header with `Year [Tt]o Date`); their parsers share code because the column mechanics are identical — only the RSU-label set and date format differ. - `_parse_variant_a` is a full rewrite: single-column rows split by the two `Total ...` anchors (payments → deductions), pay_date from the header's `Date : ...`, gross from first Total, net from the trailing `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X` line at the bottom. - RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every matching payment line. rsu_offset maps to `RSU Net Gain` on the deduction side when present (absent in variant B, present in A and C). ### Fixtures switched to real pdftotext output Removed the two synthetic fixtures that no longer reflected real Meta output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`) and replaced with real pdftotext captures: - `meta_uk_2021_08_variant_a.txt` (doc_id=43) - `meta_uk_2022_11_variant_c.txt` (doc_id=53) The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because they encode specific bonus/no-bonus scenarios and the numbers are derived from the real Feb-2026 sample in the plan. ## Tests - 10 parser tests: one per variant (A/B/C) + totals validation across all 4 fixtures + the existing non-Meta/empty-input guards. All pass. - 52 total tests across the repo, all green. ## Test Plan ### Automated ``` $ poetry run pytest ============================== 52 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files ``` ### Manual verification (after deploy) 1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via regex (≥95% hit rate), 42 → 70+ validated rows. 2. Sample a row for each variant via psql: employer, rsu_vest, and taxable_pay should all be populated. ## Reproduce locally 1. `poetry run pytest tests/test_meta_uk_parser.py -v` 2. Expected: 10 passed, each fixture validates totals to within 2p. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00
return _parse_variant_a(text, lines, employer)
raise ParserError("neither side-by-side nor single-column header found")
v2: regex parser for Meta UK template + accurate RSU tax attribution ## Context v1 shipped a Claude Haiku-based extractor that validated only 10/71 backfilled rows. Haiku fumbles the arithmetic on pension salary-sacrifice, conflates RSU vest with regular earnings, and occasionally misreads YTD vs this-period columns — so 86% of rows land with validated=false and the downstream dashboards under-report take-home. Meta UK uses a stable two-variant template (pre/post 2022-01-31 boundary), so a regex parser is both faster (ms vs. 30-90s + $0.01-0.05/call) and more accurate. v2 introduces that parser as the primary path, keeps Claude as the fallback for non-Meta payslips, and surfaces new fields the dashboard needs to attribute PAYE between cash salary and RSU vests correctly. ## This change ### Parser (new) `payslip_ingest/parsers/meta_uk.py` detects the layout variant by header presence: - **Variant A** (pre-2022): vertical Description/This Period/This Year. `AE Pension EE` is a positive deduction against a pre-sacrifice gross — maps to `pension_employee` for the existing validation formula to hold. - **Variant B** (post-2022): side-by-side Payments | Deductions | Year to Date. `AE Pension EE` is NEGATIVE in Payments (salary sacrifice) — maps to `pension_sacrifice` and is already netted into Total Payment. `rsu_vest = RSU Tax Offset + RSU Excs Refund` (Meta's template inflates Taxable Pay without using a matching offset deduction). Column boundaries come from the header row's anchor positions; each data row slices into 3 cells and the last numeric token per cell is the amount. Anchor misses raise ParserError so the caller falls back to Claude rather than silently returning bad data. ### New fields Schema + DB + Claude prompt gain: - `salary`, `bonus`, `pension_sacrifice` — earnings decomposition for the dashboard's bonus-sacrifice visibility and earnings-breakdown chart - `taxable_pay`, `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross` — powers the YTD-effective-rate method of attributing cash tax vs RSU tax, which is the only method that's accurate month-to-month All new columns default to 0 / null so v1 rows continue to round-trip. ### Orchestration processor.py tries `parse_meta_uk(pdftotext(pdf))` first. On success the result goes straight to the DB — zero Claude tokens spent, extraction in milliseconds. On ParserError it falls through to ClaudeExtractor as before. ProcessResult gains an `extractor` field ("meta_uk_regex" | "claude") so backfill logs show the hit rate. ## Tests - `test_meta_uk_parser.py` — 11 tests covering variant A, variant B (standard + bonus month + bonus-sacrificed month), malformed inputs, and end-to-end totals validation for all 4 golden fixtures. - `test_processor.py` — 2 new tests proving the regex-first short-circuit and the Claude fallback on non-Meta inputs. Fixtures under `tests/fixtures/` are hand-crafted `pdftotext -layout` emulations — real Meta numbers from the plan's sample payslips for variant B, synthesized realistic variant A and bonus-sacrificed samples. 0001_initial.py reformat is yapf cleanup touched during the session's format pass; not a behavior change. ## Test Plan ### Automated ``` $ poetry run pytest ============================= test session starts ============================== collected 53 items tests/test_extractor.py ..... [ 9%] tests/test_meta_uk_parser.py ........... [ 30%] tests/test_paperless.py ...... [ 41%] tests/test_processor.py .............. [ 67%] tests/test_schema.py .... [ 75%] tests/test_tax_year.py ........ [ 90%] tests/test_webhook.py ..... [100%] ============================== 53 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files $ poetry run yapf --style pyproject.toml --diff --recursive payslip_ingest tests (no output — all files are yapf-clean) ``` ### Manual Verification Smoke-test the parser against a real Meta payslip PDF on the deploy host: ``` # After 0003 migration applied to prod DB $ poetry run python -c " from payslip_ingest.parsers import parse_meta_uk import subprocess text = subprocess.check_output(['pdftotext', '-layout', '/path/to/real.pdf', '-']).decode() p = parse_meta_uk(text) print(p.model_dump_json(indent=2)) " ``` Expected: JSON with salary/bonus/rsu_vest/pension_sacrifice populated and `validate_totals(p)` returning True. ## Reproduce locally 1. `cd payslip-ingest && poetry install` 2. `poetry run pytest tests/test_meta_uk_parser.py -v` 3. Expected: 11 tests pass, each fixture validates totals within 2p. Closes: code-un1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:53:52 +00:00
meta_uk parser: add variant A (2019-2022) + variant C (2022-2023) ## Context The initial v2 parser (commit 9741816) only handled the modern template (variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from 2021-07 through 2023-11 failed entirely — Claude fallback hit errors on them and the rows never landed. Investigation via `kubectl exec` + pdftotext on a sample of the failing docs revealed two previously-unseen layouts that the parser needs to handle directly: - **Variant A** (2019 → mid-2022): single-column Description/This Period/ This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31 Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines: `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the earnings side with a matching `RSU Net Gain` on the deductions side. BIK items (Private Dental/Medical) appear on both sides — net zero in the gross, but the deduction-side copy must land in other_deductions for the validation formula to hold. - **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions| Year To Date (note capital "To", vs variant B's lowercase "to"). Date format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU Net Gain` offset. `Company Name : Facebook UK Limited` preamble. Variant B (2024+) is unchanged. ## This change ### Parser refactor - `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches all three eras. - `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's accounting-style parenthesized negatives normalize to `-1234.56` in `_to_decimal`. - `_parse_date` tries three formats in order: slash (B), dot (C), word (A). - `_is_variant_b_or_c` collapses B and C into one detector (both have the side-by-side header with `Year [Tt]o Date`); their parsers share code because the column mechanics are identical — only the RSU-label set and date format differ. - `_parse_variant_a` is a full rewrite: single-column rows split by the two `Total ...` anchors (payments → deductions), pay_date from the header's `Date : ...`, gross from first Total, net from the trailing `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X` line at the bottom. - RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every matching payment line. rsu_offset maps to `RSU Net Gain` on the deduction side when present (absent in variant B, present in A and C). ### Fixtures switched to real pdftotext output Removed the two synthetic fixtures that no longer reflected real Meta output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`) and replaced with real pdftotext captures: - `meta_uk_2021_08_variant_a.txt` (doc_id=43) - `meta_uk_2022_11_variant_c.txt` (doc_id=53) The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because they encode specific bonus/no-bonus scenarios and the numbers are derived from the real Feb-2026 sample in the plan. ## Tests - 10 parser tests: one per variant (A/B/C) + totals validation across all 4 fixtures + the existing non-Meta/empty-input guards. All pass. - 52 total tests across the repo, all green. ## Test Plan ### Automated ``` $ poetry run pytest ============================== 52 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files ``` ### Manual verification (after deploy) 1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via regex (≥95% hit rate), 42 → 70+ validated rows. 2. Sample a row for each variant via psql: employer, rsu_vest, and taxable_pay should all be populated. ## Reproduce locally 1. `poetry run pytest tests/test_meta_uk_parser.py -v` 2. Expected: 10 passed, each fixture validates totals to within 2p. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00
def _is_variant_b_or_c(lines: list[str]) -> bool:
return any("Payments" in line and "Deductions" in line and re.search(r"Year [Tt]o Date", line)
v2: regex parser for Meta UK template + accurate RSU tax attribution ## Context v1 shipped a Claude Haiku-based extractor that validated only 10/71 backfilled rows. Haiku fumbles the arithmetic on pension salary-sacrifice, conflates RSU vest with regular earnings, and occasionally misreads YTD vs this-period columns — so 86% of rows land with validated=false and the downstream dashboards under-report take-home. Meta UK uses a stable two-variant template (pre/post 2022-01-31 boundary), so a regex parser is both faster (ms vs. 30-90s + $0.01-0.05/call) and more accurate. v2 introduces that parser as the primary path, keeps Claude as the fallback for non-Meta payslips, and surfaces new fields the dashboard needs to attribute PAYE between cash salary and RSU vests correctly. ## This change ### Parser (new) `payslip_ingest/parsers/meta_uk.py` detects the layout variant by header presence: - **Variant A** (pre-2022): vertical Description/This Period/This Year. `AE Pension EE` is a positive deduction against a pre-sacrifice gross — maps to `pension_employee` for the existing validation formula to hold. - **Variant B** (post-2022): side-by-side Payments | Deductions | Year to Date. `AE Pension EE` is NEGATIVE in Payments (salary sacrifice) — maps to `pension_sacrifice` and is already netted into Total Payment. `rsu_vest = RSU Tax Offset + RSU Excs Refund` (Meta's template inflates Taxable Pay without using a matching offset deduction). Column boundaries come from the header row's anchor positions; each data row slices into 3 cells and the last numeric token per cell is the amount. Anchor misses raise ParserError so the caller falls back to Claude rather than silently returning bad data. ### New fields Schema + DB + Claude prompt gain: - `salary`, `bonus`, `pension_sacrifice` — earnings decomposition for the dashboard's bonus-sacrifice visibility and earnings-breakdown chart - `taxable_pay`, `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross` — powers the YTD-effective-rate method of attributing cash tax vs RSU tax, which is the only method that's accurate month-to-month All new columns default to 0 / null so v1 rows continue to round-trip. ### Orchestration processor.py tries `parse_meta_uk(pdftotext(pdf))` first. On success the result goes straight to the DB — zero Claude tokens spent, extraction in milliseconds. On ParserError it falls through to ClaudeExtractor as before. ProcessResult gains an `extractor` field ("meta_uk_regex" | "claude") so backfill logs show the hit rate. ## Tests - `test_meta_uk_parser.py` — 11 tests covering variant A, variant B (standard + bonus month + bonus-sacrificed month), malformed inputs, and end-to-end totals validation for all 4 golden fixtures. - `test_processor.py` — 2 new tests proving the regex-first short-circuit and the Claude fallback on non-Meta inputs. Fixtures under `tests/fixtures/` are hand-crafted `pdftotext -layout` emulations — real Meta numbers from the plan's sample payslips for variant B, synthesized realistic variant A and bonus-sacrificed samples. 0001_initial.py reformat is yapf cleanup touched during the session's format pass; not a behavior change. ## Test Plan ### Automated ``` $ poetry run pytest ============================= test session starts ============================== collected 53 items tests/test_extractor.py ..... [ 9%] tests/test_meta_uk_parser.py ........... [ 30%] tests/test_paperless.py ...... [ 41%] tests/test_processor.py .............. [ 67%] tests/test_schema.py .... [ 75%] tests/test_tax_year.py ........ [ 90%] tests/test_webhook.py ..... [100%] ============================== 53 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files $ poetry run yapf --style pyproject.toml --diff --recursive payslip_ingest tests (no output — all files are yapf-clean) ``` ### Manual Verification Smoke-test the parser against a real Meta payslip PDF on the deploy host: ``` # After 0003 migration applied to prod DB $ poetry run python -c " from payslip_ingest.parsers import parse_meta_uk import subprocess text = subprocess.check_output(['pdftotext', '-layout', '/path/to/real.pdf', '-']).decode() p = parse_meta_uk(text) print(p.model_dump_json(indent=2)) " ``` Expected: JSON with salary/bonus/rsu_vest/pension_sacrifice populated and `validate_totals(p)` returning True. ## Reproduce locally 1. `cd payslip-ingest && poetry install` 2. `poetry run pytest tests/test_meta_uk_parser.py -v` 3. Expected: 11 tests pass, each fixture validates totals within 2p. Closes: code-un1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:53:52 +00:00
for line in lines)
def _is_variant_a(lines: list[str]) -> bool:
return any("Description" in line and "This Period" in line and "This Year" in line
for line in lines)
def _to_decimal(s: str) -> Decimal:
meta_uk parser: add variant A (2019-2022) + variant C (2022-2023) ## Context The initial v2 parser (commit 9741816) only handled the modern template (variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from 2021-07 through 2023-11 failed entirely — Claude fallback hit errors on them and the rows never landed. Investigation via `kubectl exec` + pdftotext on a sample of the failing docs revealed two previously-unseen layouts that the parser needs to handle directly: - **Variant A** (2019 → mid-2022): single-column Description/This Period/ This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31 Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines: `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the earnings side with a matching `RSU Net Gain` on the deductions side. BIK items (Private Dental/Medical) appear on both sides — net zero in the gross, but the deduction-side copy must land in other_deductions for the validation formula to hold. - **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions| Year To Date (note capital "To", vs variant B's lowercase "to"). Date format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU Net Gain` offset. `Company Name : Facebook UK Limited` preamble. Variant B (2024+) is unchanged. ## This change ### Parser refactor - `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches all three eras. - `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's accounting-style parenthesized negatives normalize to `-1234.56` in `_to_decimal`. - `_parse_date` tries three formats in order: slash (B), dot (C), word (A). - `_is_variant_b_or_c` collapses B and C into one detector (both have the side-by-side header with `Year [Tt]o Date`); their parsers share code because the column mechanics are identical — only the RSU-label set and date format differ. - `_parse_variant_a` is a full rewrite: single-column rows split by the two `Total ...` anchors (payments → deductions), pay_date from the header's `Date : ...`, gross from first Total, net from the trailing `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X` line at the bottom. - RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every matching payment line. rsu_offset maps to `RSU Net Gain` on the deduction side when present (absent in variant B, present in A and C). ### Fixtures switched to real pdftotext output Removed the two synthetic fixtures that no longer reflected real Meta output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`) and replaced with real pdftotext captures: - `meta_uk_2021_08_variant_a.txt` (doc_id=43) - `meta_uk_2022_11_variant_c.txt` (doc_id=53) The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because they encode specific bonus/no-bonus scenarios and the numbers are derived from the real Feb-2026 sample in the plan. ## Tests - 10 parser tests: one per variant (A/B/C) + totals validation across all 4 fixtures + the existing non-Meta/empty-input guards. All pass. - 52 total tests across the repo, all green. ## Test Plan ### Automated ``` $ poetry run pytest ============================== 52 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files ``` ### Manual verification (after deploy) 1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via regex (≥95% hit rate), 42 → 70+ validated rows. 2. Sample a row for each variant via psql: employer, rsu_vest, and taxable_pay should all be populated. ## Reproduce locally 1. `poetry run pytest tests/test_meta_uk_parser.py -v` 2. Expected: 10 passed, each fixture validates totals to within 2p. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00
s = s.strip()
if s.startswith("(") and s.endswith(")"):
s = "-" + s[1:-1]
v2: regex parser for Meta UK template + accurate RSU tax attribution ## Context v1 shipped a Claude Haiku-based extractor that validated only 10/71 backfilled rows. Haiku fumbles the arithmetic on pension salary-sacrifice, conflates RSU vest with regular earnings, and occasionally misreads YTD vs this-period columns — so 86% of rows land with validated=false and the downstream dashboards under-report take-home. Meta UK uses a stable two-variant template (pre/post 2022-01-31 boundary), so a regex parser is both faster (ms vs. 30-90s + $0.01-0.05/call) and more accurate. v2 introduces that parser as the primary path, keeps Claude as the fallback for non-Meta payslips, and surfaces new fields the dashboard needs to attribute PAYE between cash salary and RSU vests correctly. ## This change ### Parser (new) `payslip_ingest/parsers/meta_uk.py` detects the layout variant by header presence: - **Variant A** (pre-2022): vertical Description/This Period/This Year. `AE Pension EE` is a positive deduction against a pre-sacrifice gross — maps to `pension_employee` for the existing validation formula to hold. - **Variant B** (post-2022): side-by-side Payments | Deductions | Year to Date. `AE Pension EE` is NEGATIVE in Payments (salary sacrifice) — maps to `pension_sacrifice` and is already netted into Total Payment. `rsu_vest = RSU Tax Offset + RSU Excs Refund` (Meta's template inflates Taxable Pay without using a matching offset deduction). Column boundaries come from the header row's anchor positions; each data row slices into 3 cells and the last numeric token per cell is the amount. Anchor misses raise ParserError so the caller falls back to Claude rather than silently returning bad data. ### New fields Schema + DB + Claude prompt gain: - `salary`, `bonus`, `pension_sacrifice` — earnings decomposition for the dashboard's bonus-sacrifice visibility and earnings-breakdown chart - `taxable_pay`, `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross` — powers the YTD-effective-rate method of attributing cash tax vs RSU tax, which is the only method that's accurate month-to-month All new columns default to 0 / null so v1 rows continue to round-trip. ### Orchestration processor.py tries `parse_meta_uk(pdftotext(pdf))` first. On success the result goes straight to the DB — zero Claude tokens spent, extraction in milliseconds. On ParserError it falls through to ClaudeExtractor as before. ProcessResult gains an `extractor` field ("meta_uk_regex" | "claude") so backfill logs show the hit rate. ## Tests - `test_meta_uk_parser.py` — 11 tests covering variant A, variant B (standard + bonus month + bonus-sacrificed month), malformed inputs, and end-to-end totals validation for all 4 golden fixtures. - `test_processor.py` — 2 new tests proving the regex-first short-circuit and the Claude fallback on non-Meta inputs. Fixtures under `tests/fixtures/` are hand-crafted `pdftotext -layout` emulations — real Meta numbers from the plan's sample payslips for variant B, synthesized realistic variant A and bonus-sacrificed samples. 0001_initial.py reformat is yapf cleanup touched during the session's format pass; not a behavior change. ## Test Plan ### Automated ``` $ poetry run pytest ============================= test session starts ============================== collected 53 items tests/test_extractor.py ..... [ 9%] tests/test_meta_uk_parser.py ........... [ 30%] tests/test_paperless.py ...... [ 41%] tests/test_processor.py .............. [ 67%] tests/test_schema.py .... [ 75%] tests/test_tax_year.py ........ [ 90%] tests/test_webhook.py ..... [100%] ============================== 53 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files $ poetry run yapf --style pyproject.toml --diff --recursive payslip_ingest tests (no output — all files are yapf-clean) ``` ### Manual Verification Smoke-test the parser against a real Meta payslip PDF on the deploy host: ``` # After 0003 migration applied to prod DB $ poetry run python -c " from payslip_ingest.parsers import parse_meta_uk import subprocess text = subprocess.check_output(['pdftotext', '-layout', '/path/to/real.pdf', '-']).decode() p = parse_meta_uk(text) print(p.model_dump_json(indent=2)) " ``` Expected: JSON with salary/bonus/rsu_vest/pension_sacrifice populated and `validate_totals(p)` returning True. ## Reproduce locally 1. `cd payslip-ingest && poetry install` 2. `poetry run pytest tests/test_meta_uk_parser.py -v` 3. Expected: 11 tests pass, each fixture validates totals within 2p. Closes: code-un1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:53:52 +00:00
return Decimal(s.replace(",", ""))
meta_uk parser: add variant A (2019-2022) + variant C (2022-2023) ## Context The initial v2 parser (commit 9741816) only handled the modern template (variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from 2021-07 through 2023-11 failed entirely — Claude fallback hit errors on them and the rows never landed. Investigation via `kubectl exec` + pdftotext on a sample of the failing docs revealed two previously-unseen layouts that the parser needs to handle directly: - **Variant A** (2019 → mid-2022): single-column Description/This Period/ This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31 Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines: `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the earnings side with a matching `RSU Net Gain` on the deductions side. BIK items (Private Dental/Medical) appear on both sides — net zero in the gross, but the deduction-side copy must land in other_deductions for the validation formula to hold. - **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions| Year To Date (note capital "To", vs variant B's lowercase "to"). Date format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU Net Gain` offset. `Company Name : Facebook UK Limited` preamble. Variant B (2024+) is unchanged. ## This change ### Parser refactor - `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches all three eras. - `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's accounting-style parenthesized negatives normalize to `-1234.56` in `_to_decimal`. - `_parse_date` tries three formats in order: slash (B), dot (C), word (A). - `_is_variant_b_or_c` collapses B and C into one detector (both have the side-by-side header with `Year [Tt]o Date`); their parsers share code because the column mechanics are identical — only the RSU-label set and date format differ. - `_parse_variant_a` is a full rewrite: single-column rows split by the two `Total ...` anchors (payments → deductions), pay_date from the header's `Date : ...`, gross from first Total, net from the trailing `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X` line at the bottom. - RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every matching payment line. rsu_offset maps to `RSU Net Gain` on the deduction side when present (absent in variant B, present in A and C). ### Fixtures switched to real pdftotext output Removed the two synthetic fixtures that no longer reflected real Meta output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`) and replaced with real pdftotext captures: - `meta_uk_2021_08_variant_a.txt` (doc_id=43) - `meta_uk_2022_11_variant_c.txt` (doc_id=53) The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because they encode specific bonus/no-bonus scenarios and the numbers are derived from the real Feb-2026 sample in the plan. ## Tests - 10 parser tests: one per variant (A/B/C) + totals validation across all 4 fixtures + the existing non-Meta/empty-input guards. All pass. - 52 total tests across the repo, all green. ## Test Plan ### Automated ``` $ poetry run pytest ============================== 52 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files ``` ### Manual verification (after deploy) 1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via regex (≥95% hit rate), 42 → 70+ validated rows. 2. Sample a row for each variant via psql: employer, rsu_vest, and taxable_pay should all be populated. ## Reproduce locally 1. `poetry run pytest tests/test_meta_uk_parser.py -v` 2. Expected: 10 passed, each fixture validates totals to within 2p. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00
def _parse_date(text: str) -> date:
"""Try each supported format — whichever matches first wins."""
m = PAY_DATE_SLASH_RE.search(text)
if m:
return datetime.strptime(m.group(1), "%d/%m/%Y").date()
m = PAY_DATE_DOT_RE.search(text)
if m:
return datetime.strptime(m.group(1), "%d.%m.%Y").date()
m = PAY_DATE_WORD_RE.search(text)
if m:
raw = re.sub(r"\s+", " ", m.group(1)).strip()
return datetime.strptime(raw, "%d %b %Y").date()
raise ParserError("pay date not found")
v2: regex parser for Meta UK template + accurate RSU tax attribution ## Context v1 shipped a Claude Haiku-based extractor that validated only 10/71 backfilled rows. Haiku fumbles the arithmetic on pension salary-sacrifice, conflates RSU vest with regular earnings, and occasionally misreads YTD vs this-period columns — so 86% of rows land with validated=false and the downstream dashboards under-report take-home. Meta UK uses a stable two-variant template (pre/post 2022-01-31 boundary), so a regex parser is both faster (ms vs. 30-90s + $0.01-0.05/call) and more accurate. v2 introduces that parser as the primary path, keeps Claude as the fallback for non-Meta payslips, and surfaces new fields the dashboard needs to attribute PAYE between cash salary and RSU vests correctly. ## This change ### Parser (new) `payslip_ingest/parsers/meta_uk.py` detects the layout variant by header presence: - **Variant A** (pre-2022): vertical Description/This Period/This Year. `AE Pension EE` is a positive deduction against a pre-sacrifice gross — maps to `pension_employee` for the existing validation formula to hold. - **Variant B** (post-2022): side-by-side Payments | Deductions | Year to Date. `AE Pension EE` is NEGATIVE in Payments (salary sacrifice) — maps to `pension_sacrifice` and is already netted into Total Payment. `rsu_vest = RSU Tax Offset + RSU Excs Refund` (Meta's template inflates Taxable Pay without using a matching offset deduction). Column boundaries come from the header row's anchor positions; each data row slices into 3 cells and the last numeric token per cell is the amount. Anchor misses raise ParserError so the caller falls back to Claude rather than silently returning bad data. ### New fields Schema + DB + Claude prompt gain: - `salary`, `bonus`, `pension_sacrifice` — earnings decomposition for the dashboard's bonus-sacrifice visibility and earnings-breakdown chart - `taxable_pay`, `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross` — powers the YTD-effective-rate method of attributing cash tax vs RSU tax, which is the only method that's accurate month-to-month All new columns default to 0 / null so v1 rows continue to round-trip. ### Orchestration processor.py tries `parse_meta_uk(pdftotext(pdf))` first. On success the result goes straight to the DB — zero Claude tokens spent, extraction in milliseconds. On ParserError it falls through to ClaudeExtractor as before. ProcessResult gains an `extractor` field ("meta_uk_regex" | "claude") so backfill logs show the hit rate. ## Tests - `test_meta_uk_parser.py` — 11 tests covering variant A, variant B (standard + bonus month + bonus-sacrificed month), malformed inputs, and end-to-end totals validation for all 4 golden fixtures. - `test_processor.py` — 2 new tests proving the regex-first short-circuit and the Claude fallback on non-Meta inputs. Fixtures under `tests/fixtures/` are hand-crafted `pdftotext -layout` emulations — real Meta numbers from the plan's sample payslips for variant B, synthesized realistic variant A and bonus-sacrificed samples. 0001_initial.py reformat is yapf cleanup touched during the session's format pass; not a behavior change. ## Test Plan ### Automated ``` $ poetry run pytest ============================= test session starts ============================== collected 53 items tests/test_extractor.py ..... [ 9%] tests/test_meta_uk_parser.py ........... [ 30%] tests/test_paperless.py ...... [ 41%] tests/test_processor.py .............. [ 67%] tests/test_schema.py .... [ 75%] tests/test_tax_year.py ........ [ 90%] tests/test_webhook.py ..... [100%] ============================== 53 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files $ poetry run yapf --style pyproject.toml --diff --recursive payslip_ingest tests (no output — all files are yapf-clean) ``` ### Manual Verification Smoke-test the parser against a real Meta payslip PDF on the deploy host: ``` # After 0003 migration applied to prod DB $ poetry run python -c " from payslip_ingest.parsers import parse_meta_uk import subprocess text = subprocess.check_output(['pdftotext', '-layout', '/path/to/real.pdf', '-']).decode() p = parse_meta_uk(text) print(p.model_dump_json(indent=2)) " ``` Expected: JSON with salary/bonus/rsu_vest/pension_sacrifice populated and `validate_totals(p)` returning True. ## Reproduce locally 1. `cd payslip-ingest && poetry install` 2. `poetry run pytest tests/test_meta_uk_parser.py -v` 3. Expected: 11 tests pass, each fixture validates totals within 2p. Closes: code-un1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:53:52 +00:00
meta_uk parser: add variant A (2019-2022) + variant C (2022-2023) ## Context The initial v2 parser (commit 9741816) only handled the modern template (variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from 2021-07 through 2023-11 failed entirely — Claude fallback hit errors on them and the rows never landed. Investigation via `kubectl exec` + pdftotext on a sample of the failing docs revealed two previously-unseen layouts that the parser needs to handle directly: - **Variant A** (2019 → mid-2022): single-column Description/This Period/ This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31 Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines: `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the earnings side with a matching `RSU Net Gain` on the deductions side. BIK items (Private Dental/Medical) appear on both sides — net zero in the gross, but the deduction-side copy must land in other_deductions for the validation formula to hold. - **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions| Year To Date (note capital "To", vs variant B's lowercase "to"). Date format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU Net Gain` offset. `Company Name : Facebook UK Limited` preamble. Variant B (2024+) is unchanged. ## This change ### Parser refactor - `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches all three eras. - `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's accounting-style parenthesized negatives normalize to `-1234.56` in `_to_decimal`. - `_parse_date` tries three formats in order: slash (B), dot (C), word (A). - `_is_variant_b_or_c` collapses B and C into one detector (both have the side-by-side header with `Year [Tt]o Date`); their parsers share code because the column mechanics are identical — only the RSU-label set and date format differ. - `_parse_variant_a` is a full rewrite: single-column rows split by the two `Total ...` anchors (payments → deductions), pay_date from the header's `Date : ...`, gross from first Total, net from the trailing `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X` line at the bottom. - RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every matching payment line. rsu_offset maps to `RSU Net Gain` on the deduction side when present (absent in variant B, present in A and C). ### Fixtures switched to real pdftotext output Removed the two synthetic fixtures that no longer reflected real Meta output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`) and replaced with real pdftotext captures: - `meta_uk_2021_08_variant_a.txt` (doc_id=43) - `meta_uk_2022_11_variant_c.txt` (doc_id=53) The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because they encode specific bonus/no-bonus scenarios and the numbers are derived from the real Feb-2026 sample in the plan. ## Tests - 10 parser tests: one per variant (A/B/C) + totals validation across all 4 fixtures + the existing non-Meta/empty-input guards. All pass. - 52 total tests across the repo, all green. ## Test Plan ### Automated ``` $ poetry run pytest ============================== 52 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files ``` ### Manual verification (after deploy) 1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via regex (≥95% hit rate), 42 → 70+ validated rows. 2. Sample a row for each variant via psql: employer, rsu_vest, and taxable_pay should all be populated. ## Reproduce locally 1. `poetry run pytest tests/test_meta_uk_parser.py -v` 2. Expected: 10 passed, each fixture validates totals to within 2p. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00
def _find_match(text: str, pattern: re.Pattern[str]) -> str | None:
v2: regex parser for Meta UK template + accurate RSU tax attribution ## Context v1 shipped a Claude Haiku-based extractor that validated only 10/71 backfilled rows. Haiku fumbles the arithmetic on pension salary-sacrifice, conflates RSU vest with regular earnings, and occasionally misreads YTD vs this-period columns — so 86% of rows land with validated=false and the downstream dashboards under-report take-home. Meta UK uses a stable two-variant template (pre/post 2022-01-31 boundary), so a regex parser is both faster (ms vs. 30-90s + $0.01-0.05/call) and more accurate. v2 introduces that parser as the primary path, keeps Claude as the fallback for non-Meta payslips, and surfaces new fields the dashboard needs to attribute PAYE between cash salary and RSU vests correctly. ## This change ### Parser (new) `payslip_ingest/parsers/meta_uk.py` detects the layout variant by header presence: - **Variant A** (pre-2022): vertical Description/This Period/This Year. `AE Pension EE` is a positive deduction against a pre-sacrifice gross — maps to `pension_employee` for the existing validation formula to hold. - **Variant B** (post-2022): side-by-side Payments | Deductions | Year to Date. `AE Pension EE` is NEGATIVE in Payments (salary sacrifice) — maps to `pension_sacrifice` and is already netted into Total Payment. `rsu_vest = RSU Tax Offset + RSU Excs Refund` (Meta's template inflates Taxable Pay without using a matching offset deduction). Column boundaries come from the header row's anchor positions; each data row slices into 3 cells and the last numeric token per cell is the amount. Anchor misses raise ParserError so the caller falls back to Claude rather than silently returning bad data. ### New fields Schema + DB + Claude prompt gain: - `salary`, `bonus`, `pension_sacrifice` — earnings decomposition for the dashboard's bonus-sacrifice visibility and earnings-breakdown chart - `taxable_pay`, `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross` — powers the YTD-effective-rate method of attributing cash tax vs RSU tax, which is the only method that's accurate month-to-month All new columns default to 0 / null so v1 rows continue to round-trip. ### Orchestration processor.py tries `parse_meta_uk(pdftotext(pdf))` first. On success the result goes straight to the DB — zero Claude tokens spent, extraction in milliseconds. On ParserError it falls through to ClaudeExtractor as before. ProcessResult gains an `extractor` field ("meta_uk_regex" | "claude") so backfill logs show the hit rate. ## Tests - `test_meta_uk_parser.py` — 11 tests covering variant A, variant B (standard + bonus month + bonus-sacrificed month), malformed inputs, and end-to-end totals validation for all 4 golden fixtures. - `test_processor.py` — 2 new tests proving the regex-first short-circuit and the Claude fallback on non-Meta inputs. Fixtures under `tests/fixtures/` are hand-crafted `pdftotext -layout` emulations — real Meta numbers from the plan's sample payslips for variant B, synthesized realistic variant A and bonus-sacrificed samples. 0001_initial.py reformat is yapf cleanup touched during the session's format pass; not a behavior change. ## Test Plan ### Automated ``` $ poetry run pytest ============================= test session starts ============================== collected 53 items tests/test_extractor.py ..... [ 9%] tests/test_meta_uk_parser.py ........... [ 30%] tests/test_paperless.py ...... [ 41%] tests/test_processor.py .............. [ 67%] tests/test_schema.py .... [ 75%] tests/test_tax_year.py ........ [ 90%] tests/test_webhook.py ..... [100%] ============================== 53 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files $ poetry run yapf --style pyproject.toml --diff --recursive payslip_ingest tests (no output — all files are yapf-clean) ``` ### Manual Verification Smoke-test the parser against a real Meta payslip PDF on the deploy host: ``` # After 0003 migration applied to prod DB $ poetry run python -c " from payslip_ingest.parsers import parse_meta_uk import subprocess text = subprocess.check_output(['pdftotext', '-layout', '/path/to/real.pdf', '-']).decode() p = parse_meta_uk(text) print(p.model_dump_json(indent=2)) " ``` Expected: JSON with salary/bonus/rsu_vest/pension_sacrifice populated and `validate_totals(p)` returning True. ## Reproduce locally 1. `cd payslip-ingest && poetry install` 2. `poetry run pytest tests/test_meta_uk_parser.py -v` 3. Expected: 11 tests pass, each fixture validates totals within 2p. Closes: code-un1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:53:52 +00:00
m = pattern.search(text)
return m.group(1) if m else None
def _last_amount(segment: str) -> tuple[str, Decimal | None]:
meta_uk parser: add variant A (2019-2022) + variant C (2022-2023) ## Context The initial v2 parser (commit 9741816) only handled the modern template (variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from 2021-07 through 2023-11 failed entirely — Claude fallback hit errors on them and the rows never landed. Investigation via `kubectl exec` + pdftotext on a sample of the failing docs revealed two previously-unseen layouts that the parser needs to handle directly: - **Variant A** (2019 → mid-2022): single-column Description/This Period/ This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31 Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines: `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the earnings side with a matching `RSU Net Gain` on the deductions side. BIK items (Private Dental/Medical) appear on both sides — net zero in the gross, but the deduction-side copy must land in other_deductions for the validation formula to hold. - **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions| Year To Date (note capital "To", vs variant B's lowercase "to"). Date format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU Net Gain` offset. `Company Name : Facebook UK Limited` preamble. Variant B (2024+) is unchanged. ## This change ### Parser refactor - `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches all three eras. - `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's accounting-style parenthesized negatives normalize to `-1234.56` in `_to_decimal`. - `_parse_date` tries three formats in order: slash (B), dot (C), word (A). - `_is_variant_b_or_c` collapses B and C into one detector (both have the side-by-side header with `Year [Tt]o Date`); their parsers share code because the column mechanics are identical — only the RSU-label set and date format differ. - `_parse_variant_a` is a full rewrite: single-column rows split by the two `Total ...` anchors (payments → deductions), pay_date from the header's `Date : ...`, gross from first Total, net from the trailing `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X` line at the bottom. - RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every matching payment line. rsu_offset maps to `RSU Net Gain` on the deduction side when present (absent in variant B, present in A and C). ### Fixtures switched to real pdftotext output Removed the two synthetic fixtures that no longer reflected real Meta output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`) and replaced with real pdftotext captures: - `meta_uk_2021_08_variant_a.txt` (doc_id=43) - `meta_uk_2022_11_variant_c.txt` (doc_id=53) The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because they encode specific bonus/no-bonus scenarios and the numbers are derived from the real Feb-2026 sample in the plan. ## Tests - 10 parser tests: one per variant (A/B/C) + totals validation across all 4 fixtures + the existing non-Meta/empty-input guards. All pass. - 52 total tests across the repo, all green. ## Test Plan ### Automated ``` $ poetry run pytest ============================== 52 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files ``` ### Manual verification (after deploy) 1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via regex (≥95% hit rate), 42 → 70+ validated rows. 2. Sample a row for each variant via psql: employer, rsu_vest, and taxable_pay should all be populated. ## Reproduce locally 1. `poetry run pytest tests/test_meta_uk_parser.py -v` 2. Expected: 10 passed, each fixture validates totals to within 2p. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00
"""Return (label, rightmost numeric amount)."""
v2: regex parser for Meta UK template + accurate RSU tax attribution ## Context v1 shipped a Claude Haiku-based extractor that validated only 10/71 backfilled rows. Haiku fumbles the arithmetic on pension salary-sacrifice, conflates RSU vest with regular earnings, and occasionally misreads YTD vs this-period columns — so 86% of rows land with validated=false and the downstream dashboards under-report take-home. Meta UK uses a stable two-variant template (pre/post 2022-01-31 boundary), so a regex parser is both faster (ms vs. 30-90s + $0.01-0.05/call) and more accurate. v2 introduces that parser as the primary path, keeps Claude as the fallback for non-Meta payslips, and surfaces new fields the dashboard needs to attribute PAYE between cash salary and RSU vests correctly. ## This change ### Parser (new) `payslip_ingest/parsers/meta_uk.py` detects the layout variant by header presence: - **Variant A** (pre-2022): vertical Description/This Period/This Year. `AE Pension EE` is a positive deduction against a pre-sacrifice gross — maps to `pension_employee` for the existing validation formula to hold. - **Variant B** (post-2022): side-by-side Payments | Deductions | Year to Date. `AE Pension EE` is NEGATIVE in Payments (salary sacrifice) — maps to `pension_sacrifice` and is already netted into Total Payment. `rsu_vest = RSU Tax Offset + RSU Excs Refund` (Meta's template inflates Taxable Pay without using a matching offset deduction). Column boundaries come from the header row's anchor positions; each data row slices into 3 cells and the last numeric token per cell is the amount. Anchor misses raise ParserError so the caller falls back to Claude rather than silently returning bad data. ### New fields Schema + DB + Claude prompt gain: - `salary`, `bonus`, `pension_sacrifice` — earnings decomposition for the dashboard's bonus-sacrifice visibility and earnings-breakdown chart - `taxable_pay`, `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross` — powers the YTD-effective-rate method of attributing cash tax vs RSU tax, which is the only method that's accurate month-to-month All new columns default to 0 / null so v1 rows continue to round-trip. ### Orchestration processor.py tries `parse_meta_uk(pdftotext(pdf))` first. On success the result goes straight to the DB — zero Claude tokens spent, extraction in milliseconds. On ParserError it falls through to ClaudeExtractor as before. ProcessResult gains an `extractor` field ("meta_uk_regex" | "claude") so backfill logs show the hit rate. ## Tests - `test_meta_uk_parser.py` — 11 tests covering variant A, variant B (standard + bonus month + bonus-sacrificed month), malformed inputs, and end-to-end totals validation for all 4 golden fixtures. - `test_processor.py` — 2 new tests proving the regex-first short-circuit and the Claude fallback on non-Meta inputs. Fixtures under `tests/fixtures/` are hand-crafted `pdftotext -layout` emulations — real Meta numbers from the plan's sample payslips for variant B, synthesized realistic variant A and bonus-sacrificed samples. 0001_initial.py reformat is yapf cleanup touched during the session's format pass; not a behavior change. ## Test Plan ### Automated ``` $ poetry run pytest ============================= test session starts ============================== collected 53 items tests/test_extractor.py ..... [ 9%] tests/test_meta_uk_parser.py ........... [ 30%] tests/test_paperless.py ...... [ 41%] tests/test_processor.py .............. [ 67%] tests/test_schema.py .... [ 75%] tests/test_tax_year.py ........ [ 90%] tests/test_webhook.py ..... [100%] ============================== 53 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files $ poetry run yapf --style pyproject.toml --diff --recursive payslip_ingest tests (no output — all files are yapf-clean) ``` ### Manual Verification Smoke-test the parser against a real Meta payslip PDF on the deploy host: ``` # After 0003 migration applied to prod DB $ poetry run python -c " from payslip_ingest.parsers import parse_meta_uk import subprocess text = subprocess.check_output(['pdftotext', '-layout', '/path/to/real.pdf', '-']).decode() p = parse_meta_uk(text) print(p.model_dump_json(indent=2)) " ``` Expected: JSON with salary/bonus/rsu_vest/pension_sacrifice populated and `validate_totals(p)` returning True. ## Reproduce locally 1. `cd payslip-ingest && poetry install` 2. `poetry run pytest tests/test_meta_uk_parser.py -v` 3. Expected: 11 tests pass, each fixture validates totals within 2p. Closes: code-un1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:53:52 +00:00
matches = list(AMOUNT_RE.finditer(segment))
if not matches:
return segment.strip(), None
last = matches[-1]
label = segment[:last.start()].strip()
return label, _to_decimal(last.group())
meta_uk parser: add variant A (2019-2022) + variant C (2022-2023) ## Context The initial v2 parser (commit 9741816) only handled the modern template (variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from 2021-07 through 2023-11 failed entirely — Claude fallback hit errors on them and the rows never landed. Investigation via `kubectl exec` + pdftotext on a sample of the failing docs revealed two previously-unseen layouts that the parser needs to handle directly: - **Variant A** (2019 → mid-2022): single-column Description/This Period/ This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31 Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines: `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the earnings side with a matching `RSU Net Gain` on the deductions side. BIK items (Private Dental/Medical) appear on both sides — net zero in the gross, but the deduction-side copy must land in other_deductions for the validation formula to hold. - **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions| Year To Date (note capital "To", vs variant B's lowercase "to"). Date format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU Net Gain` offset. `Company Name : Facebook UK Limited` preamble. Variant B (2024+) is unchanged. ## This change ### Parser refactor - `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches all three eras. - `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's accounting-style parenthesized negatives normalize to `-1234.56` in `_to_decimal`. - `_parse_date` tries three formats in order: slash (B), dot (C), word (A). - `_is_variant_b_or_c` collapses B and C into one detector (both have the side-by-side header with `Year [Tt]o Date`); their parsers share code because the column mechanics are identical — only the RSU-label set and date format differ. - `_parse_variant_a` is a full rewrite: single-column rows split by the two `Total ...` anchors (payments → deductions), pay_date from the header's `Date : ...`, gross from first Total, net from the trailing `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X` line at the bottom. - RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every matching payment line. rsu_offset maps to `RSU Net Gain` on the deduction side when present (absent in variant B, present in A and C). ### Fixtures switched to real pdftotext output Removed the two synthetic fixtures that no longer reflected real Meta output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`) and replaced with real pdftotext captures: - `meta_uk_2021_08_variant_a.txt` (doc_id=43) - `meta_uk_2022_11_variant_c.txt` (doc_id=53) The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because they encode specific bonus/no-bonus scenarios and the numbers are derived from the real Feb-2026 sample in the plan. ## Tests - 10 parser tests: one per variant (A/B/C) + totals validation across all 4 fixtures + the existing non-Meta/empty-input guards. All pass. - 52 total tests across the repo, all green. ## Test Plan ### Automated ``` $ poetry run pytest ============================== 52 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files ``` ### Manual verification (after deploy) 1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via regex (≥95% hit rate), 42 → 70+ validated rows. 2. Sample a row for each variant via psql: employer, rsu_vest, and taxable_pay should all be populated. ## Reproduce locally 1. `poetry run pytest tests/test_meta_uk_parser.py -v` 2. Expected: 10 passed, each fixture validates totals to within 2p. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00
# --------------------------------------------------------------------------
# Variant B / C — side-by-side Payments | Deductions | Year to/To Date
# --------------------------------------------------------------------------
PAYMENTS_KNOWN = {
"Salary",
"Perform Bonus",
"Bonus",
"AE Pension EE",
"AE Pension",
"RSU Tax Offset",
"RSU Excs Refund",
"RSU Gain Taxabl",
"RSU Gain Nicabl",
"RSU Gain Taxable",
"RSU Gain Nicable",
"RSU Net Cash",
"RSU Net Cash UK",
}
DEDUCTIONS_KNOWN = {
"Tax paid",
"Tax",
"Employee NIC",
"National Insurance",
"Student Loans",
"Student Loan",
"RSU Net Gain",
}
RSU_VEST_LABELS = {
"RSU Tax Offset",
"RSU Excs Refund",
"RSU Gain Taxabl",
"RSU Gain Nicabl",
"RSU Gain Taxable",
"RSU Gain Nicable",
"RSU Net Cash",
"RSU Net Cash UK",
}
v2: regex parser for Meta UK template + accurate RSU tax attribution ## Context v1 shipped a Claude Haiku-based extractor that validated only 10/71 backfilled rows. Haiku fumbles the arithmetic on pension salary-sacrifice, conflates RSU vest with regular earnings, and occasionally misreads YTD vs this-period columns — so 86% of rows land with validated=false and the downstream dashboards under-report take-home. Meta UK uses a stable two-variant template (pre/post 2022-01-31 boundary), so a regex parser is both faster (ms vs. 30-90s + $0.01-0.05/call) and more accurate. v2 introduces that parser as the primary path, keeps Claude as the fallback for non-Meta payslips, and surfaces new fields the dashboard needs to attribute PAYE between cash salary and RSU vests correctly. ## This change ### Parser (new) `payslip_ingest/parsers/meta_uk.py` detects the layout variant by header presence: - **Variant A** (pre-2022): vertical Description/This Period/This Year. `AE Pension EE` is a positive deduction against a pre-sacrifice gross — maps to `pension_employee` for the existing validation formula to hold. - **Variant B** (post-2022): side-by-side Payments | Deductions | Year to Date. `AE Pension EE` is NEGATIVE in Payments (salary sacrifice) — maps to `pension_sacrifice` and is already netted into Total Payment. `rsu_vest = RSU Tax Offset + RSU Excs Refund` (Meta's template inflates Taxable Pay without using a matching offset deduction). Column boundaries come from the header row's anchor positions; each data row slices into 3 cells and the last numeric token per cell is the amount. Anchor misses raise ParserError so the caller falls back to Claude rather than silently returning bad data. ### New fields Schema + DB + Claude prompt gain: - `salary`, `bonus`, `pension_sacrifice` — earnings decomposition for the dashboard's bonus-sacrifice visibility and earnings-breakdown chart - `taxable_pay`, `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross` — powers the YTD-effective-rate method of attributing cash tax vs RSU tax, which is the only method that's accurate month-to-month All new columns default to 0 / null so v1 rows continue to round-trip. ### Orchestration processor.py tries `parse_meta_uk(pdftotext(pdf))` first. On success the result goes straight to the DB — zero Claude tokens spent, extraction in milliseconds. On ParserError it falls through to ClaudeExtractor as before. ProcessResult gains an `extractor` field ("meta_uk_regex" | "claude") so backfill logs show the hit rate. ## Tests - `test_meta_uk_parser.py` — 11 tests covering variant A, variant B (standard + bonus month + bonus-sacrificed month), malformed inputs, and end-to-end totals validation for all 4 golden fixtures. - `test_processor.py` — 2 new tests proving the regex-first short-circuit and the Claude fallback on non-Meta inputs. Fixtures under `tests/fixtures/` are hand-crafted `pdftotext -layout` emulations — real Meta numbers from the plan's sample payslips for variant B, synthesized realistic variant A and bonus-sacrificed samples. 0001_initial.py reformat is yapf cleanup touched during the session's format pass; not a behavior change. ## Test Plan ### Automated ``` $ poetry run pytest ============================= test session starts ============================== collected 53 items tests/test_extractor.py ..... [ 9%] tests/test_meta_uk_parser.py ........... [ 30%] tests/test_paperless.py ...... [ 41%] tests/test_processor.py .............. [ 67%] tests/test_schema.py .... [ 75%] tests/test_tax_year.py ........ [ 90%] tests/test_webhook.py ..... [100%] ============================== 53 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files $ poetry run yapf --style pyproject.toml --diff --recursive payslip_ingest tests (no output — all files are yapf-clean) ``` ### Manual Verification Smoke-test the parser against a real Meta payslip PDF on the deploy host: ``` # After 0003 migration applied to prod DB $ poetry run python -c " from payslip_ingest.parsers import parse_meta_uk import subprocess text = subprocess.check_output(['pdftotext', '-layout', '/path/to/real.pdf', '-']).decode() p = parse_meta_uk(text) print(p.model_dump_json(indent=2)) " ``` Expected: JSON with salary/bonus/rsu_vest/pension_sacrifice populated and `validate_totals(p)` returning True. ## Reproduce locally 1. `cd payslip-ingest && poetry install` 2. `poetry run pytest tests/test_meta_uk_parser.py -v` 3. Expected: 11 tests pass, each fixture validates totals within 2p. Closes: code-un1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:53:52 +00:00
meta_uk parser: add variant A (2019-2022) + variant C (2022-2023) ## Context The initial v2 parser (commit 9741816) only handled the modern template (variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from 2021-07 through 2023-11 failed entirely — Claude fallback hit errors on them and the rows never landed. Investigation via `kubectl exec` + pdftotext on a sample of the failing docs revealed two previously-unseen layouts that the parser needs to handle directly: - **Variant A** (2019 → mid-2022): single-column Description/This Period/ This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31 Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines: `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the earnings side with a matching `RSU Net Gain` on the deductions side. BIK items (Private Dental/Medical) appear on both sides — net zero in the gross, but the deduction-side copy must land in other_deductions for the validation formula to hold. - **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions| Year To Date (note capital "To", vs variant B's lowercase "to"). Date format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU Net Gain` offset. `Company Name : Facebook UK Limited` preamble. Variant B (2024+) is unchanged. ## This change ### Parser refactor - `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches all three eras. - `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's accounting-style parenthesized negatives normalize to `-1234.56` in `_to_decimal`. - `_parse_date` tries three formats in order: slash (B), dot (C), word (A). - `_is_variant_b_or_c` collapses B and C into one detector (both have the side-by-side header with `Year [Tt]o Date`); their parsers share code because the column mechanics are identical — only the RSU-label set and date format differ. - `_parse_variant_a` is a full rewrite: single-column rows split by the two `Total ...` anchors (payments → deductions), pay_date from the header's `Date : ...`, gross from first Total, net from the trailing `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X` line at the bottom. - RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every matching payment line. rsu_offset maps to `RSU Net Gain` on the deduction side when present (absent in variant B, present in A and C). ### Fixtures switched to real pdftotext output Removed the two synthetic fixtures that no longer reflected real Meta output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`) and replaced with real pdftotext captures: - `meta_uk_2021_08_variant_a.txt` (doc_id=43) - `meta_uk_2022_11_variant_c.txt` (doc_id=53) The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because they encode specific bonus/no-bonus scenarios and the numbers are derived from the real Feb-2026 sample in the plan. ## Tests - 10 parser tests: one per variant (A/B/C) + totals validation across all 4 fixtures + the existing non-Meta/empty-input guards. All pass. - 52 total tests across the repo, all green. ## Test Plan ### Automated ``` $ poetry run pytest ============================== 52 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files ``` ### Manual verification (after deploy) 1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via regex (≥95% hit rate), 42 → 70+ validated rows. 2. Sample a row for each variant via psql: employer, rsu_vest, and taxable_pay should all be populated. ## Reproduce locally 1. `poetry run pytest tests/test_meta_uk_parser.py -v` 2. Expected: 10 passed, each fixture validates totals to within 2p. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00
def _parse_variant_bc(text: str, lines: list[str], employer: str) -> ExtractedPayslip:
header_idx, d_col, y_col = _find_bc_header(lines)
payments, payments_order, deductions = _collect_bc_rows(lines, header_idx, d_col, y_col)
gross_pay, net_pay = _parse_bc_totals_row(lines, header_idx, d_col, y_col)
summary = _parse_bc_summary_block(lines)
v2: regex parser for Meta UK template + accurate RSU tax attribution ## Context v1 shipped a Claude Haiku-based extractor that validated only 10/71 backfilled rows. Haiku fumbles the arithmetic on pension salary-sacrifice, conflates RSU vest with regular earnings, and occasionally misreads YTD vs this-period columns — so 86% of rows land with validated=false and the downstream dashboards under-report take-home. Meta UK uses a stable two-variant template (pre/post 2022-01-31 boundary), so a regex parser is both faster (ms vs. 30-90s + $0.01-0.05/call) and more accurate. v2 introduces that parser as the primary path, keeps Claude as the fallback for non-Meta payslips, and surfaces new fields the dashboard needs to attribute PAYE between cash salary and RSU vests correctly. ## This change ### Parser (new) `payslip_ingest/parsers/meta_uk.py` detects the layout variant by header presence: - **Variant A** (pre-2022): vertical Description/This Period/This Year. `AE Pension EE` is a positive deduction against a pre-sacrifice gross — maps to `pension_employee` for the existing validation formula to hold. - **Variant B** (post-2022): side-by-side Payments | Deductions | Year to Date. `AE Pension EE` is NEGATIVE in Payments (salary sacrifice) — maps to `pension_sacrifice` and is already netted into Total Payment. `rsu_vest = RSU Tax Offset + RSU Excs Refund` (Meta's template inflates Taxable Pay without using a matching offset deduction). Column boundaries come from the header row's anchor positions; each data row slices into 3 cells and the last numeric token per cell is the amount. Anchor misses raise ParserError so the caller falls back to Claude rather than silently returning bad data. ### New fields Schema + DB + Claude prompt gain: - `salary`, `bonus`, `pension_sacrifice` — earnings decomposition for the dashboard's bonus-sacrifice visibility and earnings-breakdown chart - `taxable_pay`, `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross` — powers the YTD-effective-rate method of attributing cash tax vs RSU tax, which is the only method that's accurate month-to-month All new columns default to 0 / null so v1 rows continue to round-trip. ### Orchestration processor.py tries `parse_meta_uk(pdftotext(pdf))` first. On success the result goes straight to the DB — zero Claude tokens spent, extraction in milliseconds. On ParserError it falls through to ClaudeExtractor as before. ProcessResult gains an `extractor` field ("meta_uk_regex" | "claude") so backfill logs show the hit rate. ## Tests - `test_meta_uk_parser.py` — 11 tests covering variant A, variant B (standard + bonus month + bonus-sacrificed month), malformed inputs, and end-to-end totals validation for all 4 golden fixtures. - `test_processor.py` — 2 new tests proving the regex-first short-circuit and the Claude fallback on non-Meta inputs. Fixtures under `tests/fixtures/` are hand-crafted `pdftotext -layout` emulations — real Meta numbers from the plan's sample payslips for variant B, synthesized realistic variant A and bonus-sacrificed samples. 0001_initial.py reformat is yapf cleanup touched during the session's format pass; not a behavior change. ## Test Plan ### Automated ``` $ poetry run pytest ============================= test session starts ============================== collected 53 items tests/test_extractor.py ..... [ 9%] tests/test_meta_uk_parser.py ........... [ 30%] tests/test_paperless.py ...... [ 41%] tests/test_processor.py .............. [ 67%] tests/test_schema.py .... [ 75%] tests/test_tax_year.py ........ [ 90%] tests/test_webhook.py ..... [100%] ============================== 53 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files $ poetry run yapf --style pyproject.toml --diff --recursive payslip_ingest tests (no output — all files are yapf-clean) ``` ### Manual Verification Smoke-test the parser against a real Meta payslip PDF on the deploy host: ``` # After 0003 migration applied to prod DB $ poetry run python -c " from payslip_ingest.parsers import parse_meta_uk import subprocess text = subprocess.check_output(['pdftotext', '-layout', '/path/to/real.pdf', '-']).decode() p = parse_meta_uk(text) print(p.model_dump_json(indent=2)) " ``` Expected: JSON with salary/bonus/rsu_vest/pension_sacrifice populated and `validate_totals(p)` returning True. ## Reproduce locally 1. `cd payslip-ingest && poetry install` 2. `poetry run pytest tests/test_meta_uk_parser.py -v` 3. Expected: 11 tests pass, each fixture validates totals within 2p. Closes: code-un1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:53:52 +00:00
meta_uk parser: add variant A (2019-2022) + variant C (2022-2023) ## Context The initial v2 parser (commit 9741816) only handled the modern template (variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from 2021-07 through 2023-11 failed entirely — Claude fallback hit errors on them and the rows never landed. Investigation via `kubectl exec` + pdftotext on a sample of the failing docs revealed two previously-unseen layouts that the parser needs to handle directly: - **Variant A** (2019 → mid-2022): single-column Description/This Period/ This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31 Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines: `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the earnings side with a matching `RSU Net Gain` on the deductions side. BIK items (Private Dental/Medical) appear on both sides — net zero in the gross, but the deduction-side copy must land in other_deductions for the validation formula to hold. - **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions| Year To Date (note capital "To", vs variant B's lowercase "to"). Date format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU Net Gain` offset. `Company Name : Facebook UK Limited` preamble. Variant B (2024+) is unchanged. ## This change ### Parser refactor - `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches all three eras. - `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's accounting-style parenthesized negatives normalize to `-1234.56` in `_to_decimal`. - `_parse_date` tries three formats in order: slash (B), dot (C), word (A). - `_is_variant_b_or_c` collapses B and C into one detector (both have the side-by-side header with `Year [Tt]o Date`); their parsers share code because the column mechanics are identical — only the RSU-label set and date format differ. - `_parse_variant_a` is a full rewrite: single-column rows split by the two `Total ...` anchors (payments → deductions), pay_date from the header's `Date : ...`, gross from first Total, net from the trailing `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X` line at the bottom. - RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every matching payment line. rsu_offset maps to `RSU Net Gain` on the deduction side when present (absent in variant B, present in A and C). ### Fixtures switched to real pdftotext output Removed the two synthetic fixtures that no longer reflected real Meta output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`) and replaced with real pdftotext captures: - `meta_uk_2021_08_variant_a.txt` (doc_id=43) - `meta_uk_2022_11_variant_c.txt` (doc_id=53) The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because they encode specific bonus/no-bonus scenarios and the numbers are derived from the real Feb-2026 sample in the plan. ## Tests - 10 parser tests: one per variant (A/B/C) + totals validation across all 4 fixtures + the existing non-Meta/empty-input guards. All pass. - 52 total tests across the repo, all green. ## Test Plan ### Automated ``` $ poetry run pytest ============================== 52 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files ``` ### Manual verification (after deploy) 1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via regex (≥95% hit rate), 42 → 70+ validated rows. 2. Sample a row for each variant via psql: employer, rsu_vest, and taxable_pay should all be populated. ## Reproduce locally 1. `poetry run pytest tests/test_meta_uk_parser.py -v` 2. Expected: 10 passed, each fixture validates totals to within 2p. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00
ae_pension = payments.get("AE Pension EE", payments.get("AE Pension", Decimal("0")))
v2: regex parser for Meta UK template + accurate RSU tax attribution ## Context v1 shipped a Claude Haiku-based extractor that validated only 10/71 backfilled rows. Haiku fumbles the arithmetic on pension salary-sacrifice, conflates RSU vest with regular earnings, and occasionally misreads YTD vs this-period columns — so 86% of rows land with validated=false and the downstream dashboards under-report take-home. Meta UK uses a stable two-variant template (pre/post 2022-01-31 boundary), so a regex parser is both faster (ms vs. 30-90s + $0.01-0.05/call) and more accurate. v2 introduces that parser as the primary path, keeps Claude as the fallback for non-Meta payslips, and surfaces new fields the dashboard needs to attribute PAYE between cash salary and RSU vests correctly. ## This change ### Parser (new) `payslip_ingest/parsers/meta_uk.py` detects the layout variant by header presence: - **Variant A** (pre-2022): vertical Description/This Period/This Year. `AE Pension EE` is a positive deduction against a pre-sacrifice gross — maps to `pension_employee` for the existing validation formula to hold. - **Variant B** (post-2022): side-by-side Payments | Deductions | Year to Date. `AE Pension EE` is NEGATIVE in Payments (salary sacrifice) — maps to `pension_sacrifice` and is already netted into Total Payment. `rsu_vest = RSU Tax Offset + RSU Excs Refund` (Meta's template inflates Taxable Pay without using a matching offset deduction). Column boundaries come from the header row's anchor positions; each data row slices into 3 cells and the last numeric token per cell is the amount. Anchor misses raise ParserError so the caller falls back to Claude rather than silently returning bad data. ### New fields Schema + DB + Claude prompt gain: - `salary`, `bonus`, `pension_sacrifice` — earnings decomposition for the dashboard's bonus-sacrifice visibility and earnings-breakdown chart - `taxable_pay`, `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross` — powers the YTD-effective-rate method of attributing cash tax vs RSU tax, which is the only method that's accurate month-to-month All new columns default to 0 / null so v1 rows continue to round-trip. ### Orchestration processor.py tries `parse_meta_uk(pdftotext(pdf))` first. On success the result goes straight to the DB — zero Claude tokens spent, extraction in milliseconds. On ParserError it falls through to ClaudeExtractor as before. ProcessResult gains an `extractor` field ("meta_uk_regex" | "claude") so backfill logs show the hit rate. ## Tests - `test_meta_uk_parser.py` — 11 tests covering variant A, variant B (standard + bonus month + bonus-sacrificed month), malformed inputs, and end-to-end totals validation for all 4 golden fixtures. - `test_processor.py` — 2 new tests proving the regex-first short-circuit and the Claude fallback on non-Meta inputs. Fixtures under `tests/fixtures/` are hand-crafted `pdftotext -layout` emulations — real Meta numbers from the plan's sample payslips for variant B, synthesized realistic variant A and bonus-sacrificed samples. 0001_initial.py reformat is yapf cleanup touched during the session's format pass; not a behavior change. ## Test Plan ### Automated ``` $ poetry run pytest ============================= test session starts ============================== collected 53 items tests/test_extractor.py ..... [ 9%] tests/test_meta_uk_parser.py ........... [ 30%] tests/test_paperless.py ...... [ 41%] tests/test_processor.py .............. [ 67%] tests/test_schema.py .... [ 75%] tests/test_tax_year.py ........ [ 90%] tests/test_webhook.py ..... [100%] ============================== 53 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files $ poetry run yapf --style pyproject.toml --diff --recursive payslip_ingest tests (no output — all files are yapf-clean) ``` ### Manual Verification Smoke-test the parser against a real Meta payslip PDF on the deploy host: ``` # After 0003 migration applied to prod DB $ poetry run python -c " from payslip_ingest.parsers import parse_meta_uk import subprocess text = subprocess.check_output(['pdftotext', '-layout', '/path/to/real.pdf', '-']).decode() p = parse_meta_uk(text) print(p.model_dump_json(indent=2)) " ``` Expected: JSON with salary/bonus/rsu_vest/pension_sacrifice populated and `validate_totals(p)` returning True. ## Reproduce locally 1. `cd payslip-ingest && poetry install` 2. `poetry run pytest tests/test_meta_uk_parser.py -v` 3. Expected: 11 tests pass, each fixture validates totals within 2p. Closes: code-un1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:53:52 +00:00
pension_sacrifice = abs(ae_pension) if ae_pension < 0 else Decimal("0")
meta_uk parser: add variant A (2019-2022) + variant C (2022-2023) ## Context The initial v2 parser (commit 9741816) only handled the modern template (variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from 2021-07 through 2023-11 failed entirely — Claude fallback hit errors on them and the rows never landed. Investigation via `kubectl exec` + pdftotext on a sample of the failing docs revealed two previously-unseen layouts that the parser needs to handle directly: - **Variant A** (2019 → mid-2022): single-column Description/This Period/ This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31 Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines: `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the earnings side with a matching `RSU Net Gain` on the deductions side. BIK items (Private Dental/Medical) appear on both sides — net zero in the gross, but the deduction-side copy must land in other_deductions for the validation formula to hold. - **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions| Year To Date (note capital "To", vs variant B's lowercase "to"). Date format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU Net Gain` offset. `Company Name : Facebook UK Limited` preamble. Variant B (2024+) is unchanged. ## This change ### Parser refactor - `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches all three eras. - `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's accounting-style parenthesized negatives normalize to `-1234.56` in `_to_decimal`. - `_parse_date` tries three formats in order: slash (B), dot (C), word (A). - `_is_variant_b_or_c` collapses B and C into one detector (both have the side-by-side header with `Year [Tt]o Date`); their parsers share code because the column mechanics are identical — only the RSU-label set and date format differ. - `_parse_variant_a` is a full rewrite: single-column rows split by the two `Total ...` anchors (payments → deductions), pay_date from the header's `Date : ...`, gross from first Total, net from the trailing `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X` line at the bottom. - RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every matching payment line. rsu_offset maps to `RSU Net Gain` on the deduction side when present (absent in variant B, present in A and C). ### Fixtures switched to real pdftotext output Removed the two synthetic fixtures that no longer reflected real Meta output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`) and replaced with real pdftotext captures: - `meta_uk_2021_08_variant_a.txt` (doc_id=43) - `meta_uk_2022_11_variant_c.txt` (doc_id=53) The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because they encode specific bonus/no-bonus scenarios and the numbers are derived from the real Feb-2026 sample in the plan. ## Tests - 10 parser tests: one per variant (A/B/C) + totals validation across all 4 fixtures + the existing non-Meta/empty-input guards. All pass. - 52 total tests across the repo, all green. ## Test Plan ### Automated ``` $ poetry run pytest ============================== 52 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files ``` ### Manual verification (after deploy) 1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via regex (≥95% hit rate), 42 → 70+ validated rows. 2. Sample a row for each variant via psql: employer, rsu_vest, and taxable_pay should all be populated. ## Reproduce locally 1. `poetry run pytest tests/test_meta_uk_parser.py -v` 2. Expected: 10 passed, each fixture validates totals to within 2p. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00
rsu_vest = sum((payments.get(label, Decimal("0")) for label in RSU_VEST_LABELS),
start=Decimal("0"))
rsu_offset = deductions.get("RSU Net Gain", Decimal("0"))
v2: regex parser for Meta UK template + accurate RSU tax attribution ## Context v1 shipped a Claude Haiku-based extractor that validated only 10/71 backfilled rows. Haiku fumbles the arithmetic on pension salary-sacrifice, conflates RSU vest with regular earnings, and occasionally misreads YTD vs this-period columns — so 86% of rows land with validated=false and the downstream dashboards under-report take-home. Meta UK uses a stable two-variant template (pre/post 2022-01-31 boundary), so a regex parser is both faster (ms vs. 30-90s + $0.01-0.05/call) and more accurate. v2 introduces that parser as the primary path, keeps Claude as the fallback for non-Meta payslips, and surfaces new fields the dashboard needs to attribute PAYE between cash salary and RSU vests correctly. ## This change ### Parser (new) `payslip_ingest/parsers/meta_uk.py` detects the layout variant by header presence: - **Variant A** (pre-2022): vertical Description/This Period/This Year. `AE Pension EE` is a positive deduction against a pre-sacrifice gross — maps to `pension_employee` for the existing validation formula to hold. - **Variant B** (post-2022): side-by-side Payments | Deductions | Year to Date. `AE Pension EE` is NEGATIVE in Payments (salary sacrifice) — maps to `pension_sacrifice` and is already netted into Total Payment. `rsu_vest = RSU Tax Offset + RSU Excs Refund` (Meta's template inflates Taxable Pay without using a matching offset deduction). Column boundaries come from the header row's anchor positions; each data row slices into 3 cells and the last numeric token per cell is the amount. Anchor misses raise ParserError so the caller falls back to Claude rather than silently returning bad data. ### New fields Schema + DB + Claude prompt gain: - `salary`, `bonus`, `pension_sacrifice` — earnings decomposition for the dashboard's bonus-sacrifice visibility and earnings-breakdown chart - `taxable_pay`, `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross` — powers the YTD-effective-rate method of attributing cash tax vs RSU tax, which is the only method that's accurate month-to-month All new columns default to 0 / null so v1 rows continue to round-trip. ### Orchestration processor.py tries `parse_meta_uk(pdftotext(pdf))` first. On success the result goes straight to the DB — zero Claude tokens spent, extraction in milliseconds. On ParserError it falls through to ClaudeExtractor as before. ProcessResult gains an `extractor` field ("meta_uk_regex" | "claude") so backfill logs show the hit rate. ## Tests - `test_meta_uk_parser.py` — 11 tests covering variant A, variant B (standard + bonus month + bonus-sacrificed month), malformed inputs, and end-to-end totals validation for all 4 golden fixtures. - `test_processor.py` — 2 new tests proving the regex-first short-circuit and the Claude fallback on non-Meta inputs. Fixtures under `tests/fixtures/` are hand-crafted `pdftotext -layout` emulations — real Meta numbers from the plan's sample payslips for variant B, synthesized realistic variant A and bonus-sacrificed samples. 0001_initial.py reformat is yapf cleanup touched during the session's format pass; not a behavior change. ## Test Plan ### Automated ``` $ poetry run pytest ============================= test session starts ============================== collected 53 items tests/test_extractor.py ..... [ 9%] tests/test_meta_uk_parser.py ........... [ 30%] tests/test_paperless.py ...... [ 41%] tests/test_processor.py .............. [ 67%] tests/test_schema.py .... [ 75%] tests/test_tax_year.py ........ [ 90%] tests/test_webhook.py ..... [100%] ============================== 53 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files $ poetry run yapf --style pyproject.toml --diff --recursive payslip_ingest tests (no output — all files are yapf-clean) ``` ### Manual Verification Smoke-test the parser against a real Meta payslip PDF on the deploy host: ``` # After 0003 migration applied to prod DB $ poetry run python -c " from payslip_ingest.parsers import parse_meta_uk import subprocess text = subprocess.check_output(['pdftotext', '-layout', '/path/to/real.pdf', '-']).decode() p = parse_meta_uk(text) print(p.model_dump_json(indent=2)) " ``` Expected: JSON with salary/bonus/rsu_vest/pension_sacrifice populated and `validate_totals(p)` returning True. ## Reproduce locally 1. `cd payslip-ingest && poetry install` 2. `poetry run pytest tests/test_meta_uk_parser.py -v` 3. Expected: 11 tests pass, each fixture validates totals within 2p. Closes: code-un1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:53:52 +00:00
income_tax = deductions.get("Tax paid", deductions.get("Tax", Decimal("0")))
nic = deductions.get("Employee NIC", deductions.get("National Insurance", Decimal("0")))
student_loan = deductions.get("Student Loans", deductions.get("Student Loan", Decimal("0")))
meta_uk parser: add variant A (2019-2022) + variant C (2022-2023) ## Context The initial v2 parser (commit 9741816) only handled the modern template (variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from 2021-07 through 2023-11 failed entirely — Claude fallback hit errors on them and the rows never landed. Investigation via `kubectl exec` + pdftotext on a sample of the failing docs revealed two previously-unseen layouts that the parser needs to handle directly: - **Variant A** (2019 → mid-2022): single-column Description/This Period/ This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31 Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines: `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the earnings side with a matching `RSU Net Gain` on the deductions side. BIK items (Private Dental/Medical) appear on both sides — net zero in the gross, but the deduction-side copy must land in other_deductions for the validation formula to hold. - **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions| Year To Date (note capital "To", vs variant B's lowercase "to"). Date format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU Net Gain` offset. `Company Name : Facebook UK Limited` preamble. Variant B (2024+) is unchanged. ## This change ### Parser refactor - `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches all three eras. - `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's accounting-style parenthesized negatives normalize to `-1234.56` in `_to_decimal`. - `_parse_date` tries three formats in order: slash (B), dot (C), word (A). - `_is_variant_b_or_c` collapses B and C into one detector (both have the side-by-side header with `Year [Tt]o Date`); their parsers share code because the column mechanics are identical — only the RSU-label set and date format differ. - `_parse_variant_a` is a full rewrite: single-column rows split by the two `Total ...` anchors (payments → deductions), pay_date from the header's `Date : ...`, gross from first Total, net from the trailing `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X` line at the bottom. - RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every matching payment line. rsu_offset maps to `RSU Net Gain` on the deduction side when present (absent in variant B, present in A and C). ### Fixtures switched to real pdftotext output Removed the two synthetic fixtures that no longer reflected real Meta output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`) and replaced with real pdftotext captures: - `meta_uk_2021_08_variant_a.txt` (doc_id=43) - `meta_uk_2022_11_variant_c.txt` (doc_id=53) The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because they encode specific bonus/no-bonus scenarios and the numbers are derived from the real Feb-2026 sample in the plan. ## Tests - 10 parser tests: one per variant (A/B/C) + totals validation across all 4 fixtures + the existing non-Meta/empty-input guards. All pass. - 52 total tests across the repo, all green. ## Test Plan ### Automated ``` $ poetry run pytest ============================== 52 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files ``` ### Manual verification (after deploy) 1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via regex (≥95% hit rate), 42 → 70+ validated rows. 2. Sample a row for each variant via psql: employer, rsu_vest, and taxable_pay should all be populated. ## Reproduce locally 1. `poetry run pytest tests/test_meta_uk_parser.py -v` 2. Expected: 10 passed, each fixture validates totals to within 2p. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00
other_deductions = {k: v for k, v in deductions.items() if k not in DEDUCTIONS_KNOWN}
del payments_order # retained for future debugging; not used in validation
v2: regex parser for Meta UK template + accurate RSU tax attribution ## Context v1 shipped a Claude Haiku-based extractor that validated only 10/71 backfilled rows. Haiku fumbles the arithmetic on pension salary-sacrifice, conflates RSU vest with regular earnings, and occasionally misreads YTD vs this-period columns — so 86% of rows land with validated=false and the downstream dashboards under-report take-home. Meta UK uses a stable two-variant template (pre/post 2022-01-31 boundary), so a regex parser is both faster (ms vs. 30-90s + $0.01-0.05/call) and more accurate. v2 introduces that parser as the primary path, keeps Claude as the fallback for non-Meta payslips, and surfaces new fields the dashboard needs to attribute PAYE between cash salary and RSU vests correctly. ## This change ### Parser (new) `payslip_ingest/parsers/meta_uk.py` detects the layout variant by header presence: - **Variant A** (pre-2022): vertical Description/This Period/This Year. `AE Pension EE` is a positive deduction against a pre-sacrifice gross — maps to `pension_employee` for the existing validation formula to hold. - **Variant B** (post-2022): side-by-side Payments | Deductions | Year to Date. `AE Pension EE` is NEGATIVE in Payments (salary sacrifice) — maps to `pension_sacrifice` and is already netted into Total Payment. `rsu_vest = RSU Tax Offset + RSU Excs Refund` (Meta's template inflates Taxable Pay without using a matching offset deduction). Column boundaries come from the header row's anchor positions; each data row slices into 3 cells and the last numeric token per cell is the amount. Anchor misses raise ParserError so the caller falls back to Claude rather than silently returning bad data. ### New fields Schema + DB + Claude prompt gain: - `salary`, `bonus`, `pension_sacrifice` — earnings decomposition for the dashboard's bonus-sacrifice visibility and earnings-breakdown chart - `taxable_pay`, `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross` — powers the YTD-effective-rate method of attributing cash tax vs RSU tax, which is the only method that's accurate month-to-month All new columns default to 0 / null so v1 rows continue to round-trip. ### Orchestration processor.py tries `parse_meta_uk(pdftotext(pdf))` first. On success the result goes straight to the DB — zero Claude tokens spent, extraction in milliseconds. On ParserError it falls through to ClaudeExtractor as before. ProcessResult gains an `extractor` field ("meta_uk_regex" | "claude") so backfill logs show the hit rate. ## Tests - `test_meta_uk_parser.py` — 11 tests covering variant A, variant B (standard + bonus month + bonus-sacrificed month), malformed inputs, and end-to-end totals validation for all 4 golden fixtures. - `test_processor.py` — 2 new tests proving the regex-first short-circuit and the Claude fallback on non-Meta inputs. Fixtures under `tests/fixtures/` are hand-crafted `pdftotext -layout` emulations — real Meta numbers from the plan's sample payslips for variant B, synthesized realistic variant A and bonus-sacrificed samples. 0001_initial.py reformat is yapf cleanup touched during the session's format pass; not a behavior change. ## Test Plan ### Automated ``` $ poetry run pytest ============================= test session starts ============================== collected 53 items tests/test_extractor.py ..... [ 9%] tests/test_meta_uk_parser.py ........... [ 30%] tests/test_paperless.py ...... [ 41%] tests/test_processor.py .............. [ 67%] tests/test_schema.py .... [ 75%] tests/test_tax_year.py ........ [ 90%] tests/test_webhook.py ..... [100%] ============================== 53 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files $ poetry run yapf --style pyproject.toml --diff --recursive payslip_ingest tests (no output — all files are yapf-clean) ``` ### Manual Verification Smoke-test the parser against a real Meta payslip PDF on the deploy host: ``` # After 0003 migration applied to prod DB $ poetry run python -c " from payslip_ingest.parsers import parse_meta_uk import subprocess text = subprocess.check_output(['pdftotext', '-layout', '/path/to/real.pdf', '-']).decode() p = parse_meta_uk(text) print(p.model_dump_json(indent=2)) " ``` Expected: JSON with salary/bonus/rsu_vest/pension_sacrifice populated and `validate_totals(p)` returning True. ## Reproduce locally 1. `cd payslip-ingest && poetry install` 2. `poetry run pytest tests/test_meta_uk_parser.py -v` 3. Expected: 11 tests pass, each fixture validates totals within 2p. Closes: code-un1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:53:52 +00:00
meta_uk parser: add variant A (2019-2022) + variant C (2022-2023) ## Context The initial v2 parser (commit 9741816) only handled the modern template (variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from 2021-07 through 2023-11 failed entirely — Claude fallback hit errors on them and the rows never landed. Investigation via `kubectl exec` + pdftotext on a sample of the failing docs revealed two previously-unseen layouts that the parser needs to handle directly: - **Variant A** (2019 → mid-2022): single-column Description/This Period/ This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31 Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines: `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the earnings side with a matching `RSU Net Gain` on the deductions side. BIK items (Private Dental/Medical) appear on both sides — net zero in the gross, but the deduction-side copy must land in other_deductions for the validation formula to hold. - **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions| Year To Date (note capital "To", vs variant B's lowercase "to"). Date format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU Net Gain` offset. `Company Name : Facebook UK Limited` preamble. Variant B (2024+) is unchanged. ## This change ### Parser refactor - `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches all three eras. - `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's accounting-style parenthesized negatives normalize to `-1234.56` in `_to_decimal`. - `_parse_date` tries three formats in order: slash (B), dot (C), word (A). - `_is_variant_b_or_c` collapses B and C into one detector (both have the side-by-side header with `Year [Tt]o Date`); their parsers share code because the column mechanics are identical — only the RSU-label set and date format differ. - `_parse_variant_a` is a full rewrite: single-column rows split by the two `Total ...` anchors (payments → deductions), pay_date from the header's `Date : ...`, gross from first Total, net from the trailing `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X` line at the bottom. - RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every matching payment line. rsu_offset maps to `RSU Net Gain` on the deduction side when present (absent in variant B, present in A and C). ### Fixtures switched to real pdftotext output Removed the two synthetic fixtures that no longer reflected real Meta output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`) and replaced with real pdftotext captures: - `meta_uk_2021_08_variant_a.txt` (doc_id=43) - `meta_uk_2022_11_variant_c.txt` (doc_id=53) The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because they encode specific bonus/no-bonus scenarios and the numbers are derived from the real Feb-2026 sample in the plan. ## Tests - 10 parser tests: one per variant (A/B/C) + totals validation across all 4 fixtures + the existing non-Meta/empty-input guards. All pass. - 52 total tests across the repo, all green. ## Test Plan ### Automated ``` $ poetry run pytest ============================== 52 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files ``` ### Manual verification (after deploy) 1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via regex (≥95% hit rate), 42 → 70+ validated rows. 2. Sample a row for each variant via psql: employer, rsu_vest, and taxable_pay should all be populated. ## Reproduce locally 1. `poetry run pytest tests/test_meta_uk_parser.py -v` 2. Expected: 10 passed, each fixture validates totals to within 2p. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00
pay_date = _parse_date(text)
period_start_s = _find_match(text, PERIOD_START_RE)
period_end_s = _find_match(text, PERIOD_END_RE)
period_start = datetime.strptime(period_start_s, "%d/%m/%Y").date() if period_start_s else None
period_end = datetime.strptime(period_end_s, "%d/%m/%Y").date() if period_end_s else None
v2: regex parser for Meta UK template + accurate RSU tax attribution ## Context v1 shipped a Claude Haiku-based extractor that validated only 10/71 backfilled rows. Haiku fumbles the arithmetic on pension salary-sacrifice, conflates RSU vest with regular earnings, and occasionally misreads YTD vs this-period columns — so 86% of rows land with validated=false and the downstream dashboards under-report take-home. Meta UK uses a stable two-variant template (pre/post 2022-01-31 boundary), so a regex parser is both faster (ms vs. 30-90s + $0.01-0.05/call) and more accurate. v2 introduces that parser as the primary path, keeps Claude as the fallback for non-Meta payslips, and surfaces new fields the dashboard needs to attribute PAYE between cash salary and RSU vests correctly. ## This change ### Parser (new) `payslip_ingest/parsers/meta_uk.py` detects the layout variant by header presence: - **Variant A** (pre-2022): vertical Description/This Period/This Year. `AE Pension EE` is a positive deduction against a pre-sacrifice gross — maps to `pension_employee` for the existing validation formula to hold. - **Variant B** (post-2022): side-by-side Payments | Deductions | Year to Date. `AE Pension EE` is NEGATIVE in Payments (salary sacrifice) — maps to `pension_sacrifice` and is already netted into Total Payment. `rsu_vest = RSU Tax Offset + RSU Excs Refund` (Meta's template inflates Taxable Pay without using a matching offset deduction). Column boundaries come from the header row's anchor positions; each data row slices into 3 cells and the last numeric token per cell is the amount. Anchor misses raise ParserError so the caller falls back to Claude rather than silently returning bad data. ### New fields Schema + DB + Claude prompt gain: - `salary`, `bonus`, `pension_sacrifice` — earnings decomposition for the dashboard's bonus-sacrifice visibility and earnings-breakdown chart - `taxable_pay`, `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross` — powers the YTD-effective-rate method of attributing cash tax vs RSU tax, which is the only method that's accurate month-to-month All new columns default to 0 / null so v1 rows continue to round-trip. ### Orchestration processor.py tries `parse_meta_uk(pdftotext(pdf))` first. On success the result goes straight to the DB — zero Claude tokens spent, extraction in milliseconds. On ParserError it falls through to ClaudeExtractor as before. ProcessResult gains an `extractor` field ("meta_uk_regex" | "claude") so backfill logs show the hit rate. ## Tests - `test_meta_uk_parser.py` — 11 tests covering variant A, variant B (standard + bonus month + bonus-sacrificed month), malformed inputs, and end-to-end totals validation for all 4 golden fixtures. - `test_processor.py` — 2 new tests proving the regex-first short-circuit and the Claude fallback on non-Meta inputs. Fixtures under `tests/fixtures/` are hand-crafted `pdftotext -layout` emulations — real Meta numbers from the plan's sample payslips for variant B, synthesized realistic variant A and bonus-sacrificed samples. 0001_initial.py reformat is yapf cleanup touched during the session's format pass; not a behavior change. ## Test Plan ### Automated ``` $ poetry run pytest ============================= test session starts ============================== collected 53 items tests/test_extractor.py ..... [ 9%] tests/test_meta_uk_parser.py ........... [ 30%] tests/test_paperless.py ...... [ 41%] tests/test_processor.py .............. [ 67%] tests/test_schema.py .... [ 75%] tests/test_tax_year.py ........ [ 90%] tests/test_webhook.py ..... [100%] ============================== 53 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files $ poetry run yapf --style pyproject.toml --diff --recursive payslip_ingest tests (no output — all files are yapf-clean) ``` ### Manual Verification Smoke-test the parser against a real Meta payslip PDF on the deploy host: ``` # After 0003 migration applied to prod DB $ poetry run python -c " from payslip_ingest.parsers import parse_meta_uk import subprocess text = subprocess.check_output(['pdftotext', '-layout', '/path/to/real.pdf', '-']).decode() p = parse_meta_uk(text) print(p.model_dump_json(indent=2)) " ``` Expected: JSON with salary/bonus/rsu_vest/pension_sacrifice populated and `validate_totals(p)` returning True. ## Reproduce locally 1. `cd payslip-ingest && poetry install` 2. `poetry run pytest tests/test_meta_uk_parser.py -v` 3. Expected: 11 tests pass, each fixture validates totals within 2p. Closes: code-un1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:53:52 +00:00
return ExtractedPayslip(
pay_date=pay_date,
pay_period_start=period_start,
pay_period_end=period_end,
meta_uk parser: add variant A (2019-2022) + variant C (2022-2023) ## Context The initial v2 parser (commit 9741816) only handled the modern template (variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from 2021-07 through 2023-11 failed entirely — Claude fallback hit errors on them and the rows never landed. Investigation via `kubectl exec` + pdftotext on a sample of the failing docs revealed two previously-unseen layouts that the parser needs to handle directly: - **Variant A** (2019 → mid-2022): single-column Description/This Period/ This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31 Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines: `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the earnings side with a matching `RSU Net Gain` on the deductions side. BIK items (Private Dental/Medical) appear on both sides — net zero in the gross, but the deduction-side copy must land in other_deductions for the validation formula to hold. - **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions| Year To Date (note capital "To", vs variant B's lowercase "to"). Date format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU Net Gain` offset. `Company Name : Facebook UK Limited` preamble. Variant B (2024+) is unchanged. ## This change ### Parser refactor - `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches all three eras. - `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's accounting-style parenthesized negatives normalize to `-1234.56` in `_to_decimal`. - `_parse_date` tries three formats in order: slash (B), dot (C), word (A). - `_is_variant_b_or_c` collapses B and C into one detector (both have the side-by-side header with `Year [Tt]o Date`); their parsers share code because the column mechanics are identical — only the RSU-label set and date format differ. - `_parse_variant_a` is a full rewrite: single-column rows split by the two `Total ...` anchors (payments → deductions), pay_date from the header's `Date : ...`, gross from first Total, net from the trailing `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X` line at the bottom. - RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every matching payment line. rsu_offset maps to `RSU Net Gain` on the deduction side when present (absent in variant B, present in A and C). ### Fixtures switched to real pdftotext output Removed the two synthetic fixtures that no longer reflected real Meta output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`) and replaced with real pdftotext captures: - `meta_uk_2021_08_variant_a.txt` (doc_id=43) - `meta_uk_2022_11_variant_c.txt` (doc_id=53) The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because they encode specific bonus/no-bonus scenarios and the numbers are derived from the real Feb-2026 sample in the plan. ## Tests - 10 parser tests: one per variant (A/B/C) + totals validation across all 4 fixtures + the existing non-Meta/empty-input guards. All pass. - 52 total tests across the repo, all green. ## Test Plan ### Automated ``` $ poetry run pytest ============================== 52 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files ``` ### Manual verification (after deploy) 1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via regex (≥95% hit rate), 42 → 70+ validated rows. 2. Sample a row for each variant via psql: employer, rsu_vest, and taxable_pay should all be populated. ## Reproduce locally 1. `poetry run pytest tests/test_meta_uk_parser.py -v` 2. Expected: 10 passed, each fixture validates totals to within 2p. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00
employer=employer,
v2: regex parser for Meta UK template + accurate RSU tax attribution ## Context v1 shipped a Claude Haiku-based extractor that validated only 10/71 backfilled rows. Haiku fumbles the arithmetic on pension salary-sacrifice, conflates RSU vest with regular earnings, and occasionally misreads YTD vs this-period columns — so 86% of rows land with validated=false and the downstream dashboards under-report take-home. Meta UK uses a stable two-variant template (pre/post 2022-01-31 boundary), so a regex parser is both faster (ms vs. 30-90s + $0.01-0.05/call) and more accurate. v2 introduces that parser as the primary path, keeps Claude as the fallback for non-Meta payslips, and surfaces new fields the dashboard needs to attribute PAYE between cash salary and RSU vests correctly. ## This change ### Parser (new) `payslip_ingest/parsers/meta_uk.py` detects the layout variant by header presence: - **Variant A** (pre-2022): vertical Description/This Period/This Year. `AE Pension EE` is a positive deduction against a pre-sacrifice gross — maps to `pension_employee` for the existing validation formula to hold. - **Variant B** (post-2022): side-by-side Payments | Deductions | Year to Date. `AE Pension EE` is NEGATIVE in Payments (salary sacrifice) — maps to `pension_sacrifice` and is already netted into Total Payment. `rsu_vest = RSU Tax Offset + RSU Excs Refund` (Meta's template inflates Taxable Pay without using a matching offset deduction). Column boundaries come from the header row's anchor positions; each data row slices into 3 cells and the last numeric token per cell is the amount. Anchor misses raise ParserError so the caller falls back to Claude rather than silently returning bad data. ### New fields Schema + DB + Claude prompt gain: - `salary`, `bonus`, `pension_sacrifice` — earnings decomposition for the dashboard's bonus-sacrifice visibility and earnings-breakdown chart - `taxable_pay`, `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross` — powers the YTD-effective-rate method of attributing cash tax vs RSU tax, which is the only method that's accurate month-to-month All new columns default to 0 / null so v1 rows continue to round-trip. ### Orchestration processor.py tries `parse_meta_uk(pdftotext(pdf))` first. On success the result goes straight to the DB — zero Claude tokens spent, extraction in milliseconds. On ParserError it falls through to ClaudeExtractor as before. ProcessResult gains an `extractor` field ("meta_uk_regex" | "claude") so backfill logs show the hit rate. ## Tests - `test_meta_uk_parser.py` — 11 tests covering variant A, variant B (standard + bonus month + bonus-sacrificed month), malformed inputs, and end-to-end totals validation for all 4 golden fixtures. - `test_processor.py` — 2 new tests proving the regex-first short-circuit and the Claude fallback on non-Meta inputs. Fixtures under `tests/fixtures/` are hand-crafted `pdftotext -layout` emulations — real Meta numbers from the plan's sample payslips for variant B, synthesized realistic variant A and bonus-sacrificed samples. 0001_initial.py reformat is yapf cleanup touched during the session's format pass; not a behavior change. ## Test Plan ### Automated ``` $ poetry run pytest ============================= test session starts ============================== collected 53 items tests/test_extractor.py ..... [ 9%] tests/test_meta_uk_parser.py ........... [ 30%] tests/test_paperless.py ...... [ 41%] tests/test_processor.py .............. [ 67%] tests/test_schema.py .... [ 75%] tests/test_tax_year.py ........ [ 90%] tests/test_webhook.py ..... [100%] ============================== 53 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files $ poetry run yapf --style pyproject.toml --diff --recursive payslip_ingest tests (no output — all files are yapf-clean) ``` ### Manual Verification Smoke-test the parser against a real Meta payslip PDF on the deploy host: ``` # After 0003 migration applied to prod DB $ poetry run python -c " from payslip_ingest.parsers import parse_meta_uk import subprocess text = subprocess.check_output(['pdftotext', '-layout', '/path/to/real.pdf', '-']).decode() p = parse_meta_uk(text) print(p.model_dump_json(indent=2)) " ``` Expected: JSON with salary/bonus/rsu_vest/pension_sacrifice populated and `validate_totals(p)` returning True. ## Reproduce locally 1. `cd payslip-ingest && poetry install` 2. `poetry run pytest tests/test_meta_uk_parser.py -v` 3. Expected: 11 tests pass, each fixture validates totals within 2p. Closes: code-un1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:53:52 +00:00
currency="GBP",
gross_pay=gross_pay,
income_tax=income_tax,
national_insurance=nic,
pension_employee=Decimal("0"),
pension_employer=Decimal("0"),
student_loan=student_loan,
rsu_vest=rsu_vest,
meta_uk parser: add variant A (2019-2022) + variant C (2022-2023) ## Context The initial v2 parser (commit 9741816) only handled the modern template (variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from 2021-07 through 2023-11 failed entirely — Claude fallback hit errors on them and the rows never landed. Investigation via `kubectl exec` + pdftotext on a sample of the failing docs revealed two previously-unseen layouts that the parser needs to handle directly: - **Variant A** (2019 → mid-2022): single-column Description/This Period/ This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31 Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines: `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the earnings side with a matching `RSU Net Gain` on the deductions side. BIK items (Private Dental/Medical) appear on both sides — net zero in the gross, but the deduction-side copy must land in other_deductions for the validation formula to hold. - **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions| Year To Date (note capital "To", vs variant B's lowercase "to"). Date format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU Net Gain` offset. `Company Name : Facebook UK Limited` preamble. Variant B (2024+) is unchanged. ## This change ### Parser refactor - `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches all three eras. - `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's accounting-style parenthesized negatives normalize to `-1234.56` in `_to_decimal`. - `_parse_date` tries three formats in order: slash (B), dot (C), word (A). - `_is_variant_b_or_c` collapses B and C into one detector (both have the side-by-side header with `Year [Tt]o Date`); their parsers share code because the column mechanics are identical — only the RSU-label set and date format differ. - `_parse_variant_a` is a full rewrite: single-column rows split by the two `Total ...` anchors (payments → deductions), pay_date from the header's `Date : ...`, gross from first Total, net from the trailing `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X` line at the bottom. - RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every matching payment line. rsu_offset maps to `RSU Net Gain` on the deduction side when present (absent in variant B, present in A and C). ### Fixtures switched to real pdftotext output Removed the two synthetic fixtures that no longer reflected real Meta output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`) and replaced with real pdftotext captures: - `meta_uk_2021_08_variant_a.txt` (doc_id=43) - `meta_uk_2022_11_variant_c.txt` (doc_id=53) The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because they encode specific bonus/no-bonus scenarios and the numbers are derived from the real Feb-2026 sample in the plan. ## Tests - 10 parser tests: one per variant (A/B/C) + totals validation across all 4 fixtures + the existing non-Meta/empty-input guards. All pass. - 52 total tests across the repo, all green. ## Test Plan ### Automated ``` $ poetry run pytest ============================== 52 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files ``` ### Manual verification (after deploy) 1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via regex (≥95% hit rate), 42 → 70+ validated rows. 2. Sample a row for each variant via psql: employer, rsu_vest, and taxable_pay should all be populated. ## Reproduce locally 1. `poetry run pytest tests/test_meta_uk_parser.py -v` 2. Expected: 10 passed, each fixture validates totals to within 2p. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00
rsu_offset=rsu_offset,
v2: regex parser for Meta UK template + accurate RSU tax attribution ## Context v1 shipped a Claude Haiku-based extractor that validated only 10/71 backfilled rows. Haiku fumbles the arithmetic on pension salary-sacrifice, conflates RSU vest with regular earnings, and occasionally misreads YTD vs this-period columns — so 86% of rows land with validated=false and the downstream dashboards under-report take-home. Meta UK uses a stable two-variant template (pre/post 2022-01-31 boundary), so a regex parser is both faster (ms vs. 30-90s + $0.01-0.05/call) and more accurate. v2 introduces that parser as the primary path, keeps Claude as the fallback for non-Meta payslips, and surfaces new fields the dashboard needs to attribute PAYE between cash salary and RSU vests correctly. ## This change ### Parser (new) `payslip_ingest/parsers/meta_uk.py` detects the layout variant by header presence: - **Variant A** (pre-2022): vertical Description/This Period/This Year. `AE Pension EE` is a positive deduction against a pre-sacrifice gross — maps to `pension_employee` for the existing validation formula to hold. - **Variant B** (post-2022): side-by-side Payments | Deductions | Year to Date. `AE Pension EE` is NEGATIVE in Payments (salary sacrifice) — maps to `pension_sacrifice` and is already netted into Total Payment. `rsu_vest = RSU Tax Offset + RSU Excs Refund` (Meta's template inflates Taxable Pay without using a matching offset deduction). Column boundaries come from the header row's anchor positions; each data row slices into 3 cells and the last numeric token per cell is the amount. Anchor misses raise ParserError so the caller falls back to Claude rather than silently returning bad data. ### New fields Schema + DB + Claude prompt gain: - `salary`, `bonus`, `pension_sacrifice` — earnings decomposition for the dashboard's bonus-sacrifice visibility and earnings-breakdown chart - `taxable_pay`, `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross` — powers the YTD-effective-rate method of attributing cash tax vs RSU tax, which is the only method that's accurate month-to-month All new columns default to 0 / null so v1 rows continue to round-trip. ### Orchestration processor.py tries `parse_meta_uk(pdftotext(pdf))` first. On success the result goes straight to the DB — zero Claude tokens spent, extraction in milliseconds. On ParserError it falls through to ClaudeExtractor as before. ProcessResult gains an `extractor` field ("meta_uk_regex" | "claude") so backfill logs show the hit rate. ## Tests - `test_meta_uk_parser.py` — 11 tests covering variant A, variant B (standard + bonus month + bonus-sacrificed month), malformed inputs, and end-to-end totals validation for all 4 golden fixtures. - `test_processor.py` — 2 new tests proving the regex-first short-circuit and the Claude fallback on non-Meta inputs. Fixtures under `tests/fixtures/` are hand-crafted `pdftotext -layout` emulations — real Meta numbers from the plan's sample payslips for variant B, synthesized realistic variant A and bonus-sacrificed samples. 0001_initial.py reformat is yapf cleanup touched during the session's format pass; not a behavior change. ## Test Plan ### Automated ``` $ poetry run pytest ============================= test session starts ============================== collected 53 items tests/test_extractor.py ..... [ 9%] tests/test_meta_uk_parser.py ........... [ 30%] tests/test_paperless.py ...... [ 41%] tests/test_processor.py .............. [ 67%] tests/test_schema.py .... [ 75%] tests/test_tax_year.py ........ [ 90%] tests/test_webhook.py ..... [100%] ============================== 53 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files $ poetry run yapf --style pyproject.toml --diff --recursive payslip_ingest tests (no output — all files are yapf-clean) ``` ### Manual Verification Smoke-test the parser against a real Meta payslip PDF on the deploy host: ``` # After 0003 migration applied to prod DB $ poetry run python -c " from payslip_ingest.parsers import parse_meta_uk import subprocess text = subprocess.check_output(['pdftotext', '-layout', '/path/to/real.pdf', '-']).decode() p = parse_meta_uk(text) print(p.model_dump_json(indent=2)) " ``` Expected: JSON with salary/bonus/rsu_vest/pension_sacrifice populated and `validate_totals(p)` returning True. ## Reproduce locally 1. `cd payslip-ingest && poetry install` 2. `poetry run pytest tests/test_meta_uk_parser.py -v` 3. Expected: 11 tests pass, each fixture validates totals within 2p. Closes: code-un1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:53:52 +00:00
salary=payments.get("Salary", Decimal("0")),
bonus=payments.get("Perform Bonus", payments.get("Bonus", Decimal("0"))),
pension_sacrifice=pension_sacrifice,
taxable_pay=summary.get("taxable_pay"),
ytd_tax_paid=summary.get("ytd_tax_paid"),
ytd_taxable_pay=summary.get("ytd_taxable_pay"),
ytd_gross=summary.get("ytd_gross"),
other_deductions=other_deductions,
net_pay=net_pay,
)
meta_uk parser: add variant A (2019-2022) + variant C (2022-2023) ## Context The initial v2 parser (commit 9741816) only handled the modern template (variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from 2021-07 through 2023-11 failed entirely — Claude fallback hit errors on them and the rows never landed. Investigation via `kubectl exec` + pdftotext on a sample of the failing docs revealed two previously-unseen layouts that the parser needs to handle directly: - **Variant A** (2019 → mid-2022): single-column Description/This Period/ This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31 Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines: `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the earnings side with a matching `RSU Net Gain` on the deductions side. BIK items (Private Dental/Medical) appear on both sides — net zero in the gross, but the deduction-side copy must land in other_deductions for the validation formula to hold. - **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions| Year To Date (note capital "To", vs variant B's lowercase "to"). Date format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU Net Gain` offset. `Company Name : Facebook UK Limited` preamble. Variant B (2024+) is unchanged. ## This change ### Parser refactor - `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches all three eras. - `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's accounting-style parenthesized negatives normalize to `-1234.56` in `_to_decimal`. - `_parse_date` tries three formats in order: slash (B), dot (C), word (A). - `_is_variant_b_or_c` collapses B and C into one detector (both have the side-by-side header with `Year [Tt]o Date`); their parsers share code because the column mechanics are identical — only the RSU-label set and date format differ. - `_parse_variant_a` is a full rewrite: single-column rows split by the two `Total ...` anchors (payments → deductions), pay_date from the header's `Date : ...`, gross from first Total, net from the trailing `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X` line at the bottom. - RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every matching payment line. rsu_offset maps to `RSU Net Gain` on the deduction side when present (absent in variant B, present in A and C). ### Fixtures switched to real pdftotext output Removed the two synthetic fixtures that no longer reflected real Meta output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`) and replaced with real pdftotext captures: - `meta_uk_2021_08_variant_a.txt` (doc_id=43) - `meta_uk_2022_11_variant_c.txt` (doc_id=53) The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because they encode specific bonus/no-bonus scenarios and the numbers are derived from the real Feb-2026 sample in the plan. ## Tests - 10 parser tests: one per variant (A/B/C) + totals validation across all 4 fixtures + the existing non-Meta/empty-input guards. All pass. - 52 total tests across the repo, all green. ## Test Plan ### Automated ``` $ poetry run pytest ============================== 52 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files ``` ### Manual verification (after deploy) 1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via regex (≥95% hit rate), 42 → 70+ validated rows. 2. Sample a row for each variant via psql: employer, rsu_vest, and taxable_pay should all be populated. ## Reproduce locally 1. `poetry run pytest tests/test_meta_uk_parser.py -v` 2. Expected: 10 passed, each fixture validates totals to within 2p. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00
def _find_bc_header(lines: list[str]) -> tuple[int, int, int]:
v2: regex parser for Meta UK template + accurate RSU tax attribution ## Context v1 shipped a Claude Haiku-based extractor that validated only 10/71 backfilled rows. Haiku fumbles the arithmetic on pension salary-sacrifice, conflates RSU vest with regular earnings, and occasionally misreads YTD vs this-period columns — so 86% of rows land with validated=false and the downstream dashboards under-report take-home. Meta UK uses a stable two-variant template (pre/post 2022-01-31 boundary), so a regex parser is both faster (ms vs. 30-90s + $0.01-0.05/call) and more accurate. v2 introduces that parser as the primary path, keeps Claude as the fallback for non-Meta payslips, and surfaces new fields the dashboard needs to attribute PAYE between cash salary and RSU vests correctly. ## This change ### Parser (new) `payslip_ingest/parsers/meta_uk.py` detects the layout variant by header presence: - **Variant A** (pre-2022): vertical Description/This Period/This Year. `AE Pension EE` is a positive deduction against a pre-sacrifice gross — maps to `pension_employee` for the existing validation formula to hold. - **Variant B** (post-2022): side-by-side Payments | Deductions | Year to Date. `AE Pension EE` is NEGATIVE in Payments (salary sacrifice) — maps to `pension_sacrifice` and is already netted into Total Payment. `rsu_vest = RSU Tax Offset + RSU Excs Refund` (Meta's template inflates Taxable Pay without using a matching offset deduction). Column boundaries come from the header row's anchor positions; each data row slices into 3 cells and the last numeric token per cell is the amount. Anchor misses raise ParserError so the caller falls back to Claude rather than silently returning bad data. ### New fields Schema + DB + Claude prompt gain: - `salary`, `bonus`, `pension_sacrifice` — earnings decomposition for the dashboard's bonus-sacrifice visibility and earnings-breakdown chart - `taxable_pay`, `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross` — powers the YTD-effective-rate method of attributing cash tax vs RSU tax, which is the only method that's accurate month-to-month All new columns default to 0 / null so v1 rows continue to round-trip. ### Orchestration processor.py tries `parse_meta_uk(pdftotext(pdf))` first. On success the result goes straight to the DB — zero Claude tokens spent, extraction in milliseconds. On ParserError it falls through to ClaudeExtractor as before. ProcessResult gains an `extractor` field ("meta_uk_regex" | "claude") so backfill logs show the hit rate. ## Tests - `test_meta_uk_parser.py` — 11 tests covering variant A, variant B (standard + bonus month + bonus-sacrificed month), malformed inputs, and end-to-end totals validation for all 4 golden fixtures. - `test_processor.py` — 2 new tests proving the regex-first short-circuit and the Claude fallback on non-Meta inputs. Fixtures under `tests/fixtures/` are hand-crafted `pdftotext -layout` emulations — real Meta numbers from the plan's sample payslips for variant B, synthesized realistic variant A and bonus-sacrificed samples. 0001_initial.py reformat is yapf cleanup touched during the session's format pass; not a behavior change. ## Test Plan ### Automated ``` $ poetry run pytest ============================= test session starts ============================== collected 53 items tests/test_extractor.py ..... [ 9%] tests/test_meta_uk_parser.py ........... [ 30%] tests/test_paperless.py ...... [ 41%] tests/test_processor.py .............. [ 67%] tests/test_schema.py .... [ 75%] tests/test_tax_year.py ........ [ 90%] tests/test_webhook.py ..... [100%] ============================== 53 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files $ poetry run yapf --style pyproject.toml --diff --recursive payslip_ingest tests (no output — all files are yapf-clean) ``` ### Manual Verification Smoke-test the parser against a real Meta payslip PDF on the deploy host: ``` # After 0003 migration applied to prod DB $ poetry run python -c " from payslip_ingest.parsers import parse_meta_uk import subprocess text = subprocess.check_output(['pdftotext', '-layout', '/path/to/real.pdf', '-']).decode() p = parse_meta_uk(text) print(p.model_dump_json(indent=2)) " ``` Expected: JSON with salary/bonus/rsu_vest/pension_sacrifice populated and `validate_totals(p)` returning True. ## Reproduce locally 1. `cd payslip-ingest && poetry install` 2. `poetry run pytest tests/test_meta_uk_parser.py -v` 3. Expected: 11 tests pass, each fixture validates totals within 2p. Closes: code-un1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:53:52 +00:00
for i, line in enumerate(lines):
meta_uk parser: add variant A (2019-2022) + variant C (2022-2023) ## Context The initial v2 parser (commit 9741816) only handled the modern template (variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from 2021-07 through 2023-11 failed entirely — Claude fallback hit errors on them and the rows never landed. Investigation via `kubectl exec` + pdftotext on a sample of the failing docs revealed two previously-unseen layouts that the parser needs to handle directly: - **Variant A** (2019 → mid-2022): single-column Description/This Period/ This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31 Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines: `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the earnings side with a matching `RSU Net Gain` on the deductions side. BIK items (Private Dental/Medical) appear on both sides — net zero in the gross, but the deduction-side copy must land in other_deductions for the validation formula to hold. - **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions| Year To Date (note capital "To", vs variant B's lowercase "to"). Date format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU Net Gain` offset. `Company Name : Facebook UK Limited` preamble. Variant B (2024+) is unchanged. ## This change ### Parser refactor - `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches all three eras. - `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's accounting-style parenthesized negatives normalize to `-1234.56` in `_to_decimal`. - `_parse_date` tries three formats in order: slash (B), dot (C), word (A). - `_is_variant_b_or_c` collapses B and C into one detector (both have the side-by-side header with `Year [Tt]o Date`); their parsers share code because the column mechanics are identical — only the RSU-label set and date format differ. - `_parse_variant_a` is a full rewrite: single-column rows split by the two `Total ...` anchors (payments → deductions), pay_date from the header's `Date : ...`, gross from first Total, net from the trailing `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X` line at the bottom. - RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every matching payment line. rsu_offset maps to `RSU Net Gain` on the deduction side when present (absent in variant B, present in A and C). ### Fixtures switched to real pdftotext output Removed the two synthetic fixtures that no longer reflected real Meta output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`) and replaced with real pdftotext captures: - `meta_uk_2021_08_variant_a.txt` (doc_id=43) - `meta_uk_2022_11_variant_c.txt` (doc_id=53) The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because they encode specific bonus/no-bonus scenarios and the numbers are derived from the real Feb-2026 sample in the plan. ## Tests - 10 parser tests: one per variant (A/B/C) + totals validation across all 4 fixtures + the existing non-Meta/empty-input guards. All pass. - 52 total tests across the repo, all green. ## Test Plan ### Automated ``` $ poetry run pytest ============================== 52 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files ``` ### Manual verification (after deploy) 1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via regex (≥95% hit rate), 42 → 70+ validated rows. 2. Sample a row for each variant via psql: employer, rsu_vest, and taxable_pay should all be populated. ## Reproduce locally 1. `poetry run pytest tests/test_meta_uk_parser.py -v` 2. Expected: 10 passed, each fixture validates totals to within 2p. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00
if ("Payments" in line and "Deductions" in line and re.search(r"Year [Tt]o Date", line)):
# Columns anchored on left edge of "Deductions" / "Year [Tt]o Date"
ytd_match = re.search(r"Year [Tt]o Date", line)
assert ytd_match is not None
return i, line.index("Deductions"), ytd_match.start()
raise ParserError("variant B/C header not found")
v2: regex parser for Meta UK template + accurate RSU tax attribution ## Context v1 shipped a Claude Haiku-based extractor that validated only 10/71 backfilled rows. Haiku fumbles the arithmetic on pension salary-sacrifice, conflates RSU vest with regular earnings, and occasionally misreads YTD vs this-period columns — so 86% of rows land with validated=false and the downstream dashboards under-report take-home. Meta UK uses a stable two-variant template (pre/post 2022-01-31 boundary), so a regex parser is both faster (ms vs. 30-90s + $0.01-0.05/call) and more accurate. v2 introduces that parser as the primary path, keeps Claude as the fallback for non-Meta payslips, and surfaces new fields the dashboard needs to attribute PAYE between cash salary and RSU vests correctly. ## This change ### Parser (new) `payslip_ingest/parsers/meta_uk.py` detects the layout variant by header presence: - **Variant A** (pre-2022): vertical Description/This Period/This Year. `AE Pension EE` is a positive deduction against a pre-sacrifice gross — maps to `pension_employee` for the existing validation formula to hold. - **Variant B** (post-2022): side-by-side Payments | Deductions | Year to Date. `AE Pension EE` is NEGATIVE in Payments (salary sacrifice) — maps to `pension_sacrifice` and is already netted into Total Payment. `rsu_vest = RSU Tax Offset + RSU Excs Refund` (Meta's template inflates Taxable Pay without using a matching offset deduction). Column boundaries come from the header row's anchor positions; each data row slices into 3 cells and the last numeric token per cell is the amount. Anchor misses raise ParserError so the caller falls back to Claude rather than silently returning bad data. ### New fields Schema + DB + Claude prompt gain: - `salary`, `bonus`, `pension_sacrifice` — earnings decomposition for the dashboard's bonus-sacrifice visibility and earnings-breakdown chart - `taxable_pay`, `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross` — powers the YTD-effective-rate method of attributing cash tax vs RSU tax, which is the only method that's accurate month-to-month All new columns default to 0 / null so v1 rows continue to round-trip. ### Orchestration processor.py tries `parse_meta_uk(pdftotext(pdf))` first. On success the result goes straight to the DB — zero Claude tokens spent, extraction in milliseconds. On ParserError it falls through to ClaudeExtractor as before. ProcessResult gains an `extractor` field ("meta_uk_regex" | "claude") so backfill logs show the hit rate. ## Tests - `test_meta_uk_parser.py` — 11 tests covering variant A, variant B (standard + bonus month + bonus-sacrificed month), malformed inputs, and end-to-end totals validation for all 4 golden fixtures. - `test_processor.py` — 2 new tests proving the regex-first short-circuit and the Claude fallback on non-Meta inputs. Fixtures under `tests/fixtures/` are hand-crafted `pdftotext -layout` emulations — real Meta numbers from the plan's sample payslips for variant B, synthesized realistic variant A and bonus-sacrificed samples. 0001_initial.py reformat is yapf cleanup touched during the session's format pass; not a behavior change. ## Test Plan ### Automated ``` $ poetry run pytest ============================= test session starts ============================== collected 53 items tests/test_extractor.py ..... [ 9%] tests/test_meta_uk_parser.py ........... [ 30%] tests/test_paperless.py ...... [ 41%] tests/test_processor.py .............. [ 67%] tests/test_schema.py .... [ 75%] tests/test_tax_year.py ........ [ 90%] tests/test_webhook.py ..... [100%] ============================== 53 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files $ poetry run yapf --style pyproject.toml --diff --recursive payslip_ingest tests (no output — all files are yapf-clean) ``` ### Manual Verification Smoke-test the parser against a real Meta payslip PDF on the deploy host: ``` # After 0003 migration applied to prod DB $ poetry run python -c " from payslip_ingest.parsers import parse_meta_uk import subprocess text = subprocess.check_output(['pdftotext', '-layout', '/path/to/real.pdf', '-']).decode() p = parse_meta_uk(text) print(p.model_dump_json(indent=2)) " ``` Expected: JSON with salary/bonus/rsu_vest/pension_sacrifice populated and `validate_totals(p)` returning True. ## Reproduce locally 1. `cd payslip-ingest && poetry install` 2. `poetry run pytest tests/test_meta_uk_parser.py -v` 3. Expected: 11 tests pass, each fixture validates totals within 2p. Closes: code-un1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:53:52 +00:00
meta_uk parser: add variant A (2019-2022) + variant C (2022-2023) ## Context The initial v2 parser (commit 9741816) only handled the modern template (variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from 2021-07 through 2023-11 failed entirely — Claude fallback hit errors on them and the rows never landed. Investigation via `kubectl exec` + pdftotext on a sample of the failing docs revealed two previously-unseen layouts that the parser needs to handle directly: - **Variant A** (2019 → mid-2022): single-column Description/This Period/ This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31 Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines: `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the earnings side with a matching `RSU Net Gain` on the deductions side. BIK items (Private Dental/Medical) appear on both sides — net zero in the gross, but the deduction-side copy must land in other_deductions for the validation formula to hold. - **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions| Year To Date (note capital "To", vs variant B's lowercase "to"). Date format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU Net Gain` offset. `Company Name : Facebook UK Limited` preamble. Variant B (2024+) is unchanged. ## This change ### Parser refactor - `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches all three eras. - `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's accounting-style parenthesized negatives normalize to `-1234.56` in `_to_decimal`. - `_parse_date` tries three formats in order: slash (B), dot (C), word (A). - `_is_variant_b_or_c` collapses B and C into one detector (both have the side-by-side header with `Year [Tt]o Date`); their parsers share code because the column mechanics are identical — only the RSU-label set and date format differ. - `_parse_variant_a` is a full rewrite: single-column rows split by the two `Total ...` anchors (payments → deductions), pay_date from the header's `Date : ...`, gross from first Total, net from the trailing `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X` line at the bottom. - RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every matching payment line. rsu_offset maps to `RSU Net Gain` on the deduction side when present (absent in variant B, present in A and C). ### Fixtures switched to real pdftotext output Removed the two synthetic fixtures that no longer reflected real Meta output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`) and replaced with real pdftotext captures: - `meta_uk_2021_08_variant_a.txt` (doc_id=43) - `meta_uk_2022_11_variant_c.txt` (doc_id=53) The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because they encode specific bonus/no-bonus scenarios and the numbers are derived from the real Feb-2026 sample in the plan. ## Tests - 10 parser tests: one per variant (A/B/C) + totals validation across all 4 fixtures + the existing non-Meta/empty-input guards. All pass. - 52 total tests across the repo, all green. ## Test Plan ### Automated ``` $ poetry run pytest ============================== 52 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files ``` ### Manual verification (after deploy) 1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via regex (≥95% hit rate), 42 → 70+ validated rows. 2. Sample a row for each variant via psql: employer, rsu_vest, and taxable_pay should all be populated. ## Reproduce locally 1. `poetry run pytest tests/test_meta_uk_parser.py -v` 2. Expected: 10 passed, each fixture validates totals to within 2p. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00
def _collect_bc_rows(
v2: regex parser for Meta UK template + accurate RSU tax attribution ## Context v1 shipped a Claude Haiku-based extractor that validated only 10/71 backfilled rows. Haiku fumbles the arithmetic on pension salary-sacrifice, conflates RSU vest with regular earnings, and occasionally misreads YTD vs this-period columns — so 86% of rows land with validated=false and the downstream dashboards under-report take-home. Meta UK uses a stable two-variant template (pre/post 2022-01-31 boundary), so a regex parser is both faster (ms vs. 30-90s + $0.01-0.05/call) and more accurate. v2 introduces that parser as the primary path, keeps Claude as the fallback for non-Meta payslips, and surfaces new fields the dashboard needs to attribute PAYE between cash salary and RSU vests correctly. ## This change ### Parser (new) `payslip_ingest/parsers/meta_uk.py` detects the layout variant by header presence: - **Variant A** (pre-2022): vertical Description/This Period/This Year. `AE Pension EE` is a positive deduction against a pre-sacrifice gross — maps to `pension_employee` for the existing validation formula to hold. - **Variant B** (post-2022): side-by-side Payments | Deductions | Year to Date. `AE Pension EE` is NEGATIVE in Payments (salary sacrifice) — maps to `pension_sacrifice` and is already netted into Total Payment. `rsu_vest = RSU Tax Offset + RSU Excs Refund` (Meta's template inflates Taxable Pay without using a matching offset deduction). Column boundaries come from the header row's anchor positions; each data row slices into 3 cells and the last numeric token per cell is the amount. Anchor misses raise ParserError so the caller falls back to Claude rather than silently returning bad data. ### New fields Schema + DB + Claude prompt gain: - `salary`, `bonus`, `pension_sacrifice` — earnings decomposition for the dashboard's bonus-sacrifice visibility and earnings-breakdown chart - `taxable_pay`, `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross` — powers the YTD-effective-rate method of attributing cash tax vs RSU tax, which is the only method that's accurate month-to-month All new columns default to 0 / null so v1 rows continue to round-trip. ### Orchestration processor.py tries `parse_meta_uk(pdftotext(pdf))` first. On success the result goes straight to the DB — zero Claude tokens spent, extraction in milliseconds. On ParserError it falls through to ClaudeExtractor as before. ProcessResult gains an `extractor` field ("meta_uk_regex" | "claude") so backfill logs show the hit rate. ## Tests - `test_meta_uk_parser.py` — 11 tests covering variant A, variant B (standard + bonus month + bonus-sacrificed month), malformed inputs, and end-to-end totals validation for all 4 golden fixtures. - `test_processor.py` — 2 new tests proving the regex-first short-circuit and the Claude fallback on non-Meta inputs. Fixtures under `tests/fixtures/` are hand-crafted `pdftotext -layout` emulations — real Meta numbers from the plan's sample payslips for variant B, synthesized realistic variant A and bonus-sacrificed samples. 0001_initial.py reformat is yapf cleanup touched during the session's format pass; not a behavior change. ## Test Plan ### Automated ``` $ poetry run pytest ============================= test session starts ============================== collected 53 items tests/test_extractor.py ..... [ 9%] tests/test_meta_uk_parser.py ........... [ 30%] tests/test_paperless.py ...... [ 41%] tests/test_processor.py .............. [ 67%] tests/test_schema.py .... [ 75%] tests/test_tax_year.py ........ [ 90%] tests/test_webhook.py ..... [100%] ============================== 53 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files $ poetry run yapf --style pyproject.toml --diff --recursive payslip_ingest tests (no output — all files are yapf-clean) ``` ### Manual Verification Smoke-test the parser against a real Meta payslip PDF on the deploy host: ``` # After 0003 migration applied to prod DB $ poetry run python -c " from payslip_ingest.parsers import parse_meta_uk import subprocess text = subprocess.check_output(['pdftotext', '-layout', '/path/to/real.pdf', '-']).decode() p = parse_meta_uk(text) print(p.model_dump_json(indent=2)) " ``` Expected: JSON with salary/bonus/rsu_vest/pension_sacrifice populated and `validate_totals(p)` returning True. ## Reproduce locally 1. `cd payslip-ingest && poetry install` 2. `poetry run pytest tests/test_meta_uk_parser.py -v` 3. Expected: 11 tests pass, each fixture validates totals within 2p. Closes: code-un1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:53:52 +00:00
lines: list[str],
header_idx: int,
d_col: int,
y_col: int,
) -> tuple[dict[str, Decimal], list[tuple[str, Decimal]], dict[str, Decimal]]:
payments: dict[str, Decimal] = {}
order: list[tuple[str, Decimal]] = []
deductions: dict[str, Decimal] = {}
for i in range(header_idx + 1, len(lines)):
line = lines[i].rstrip()
meta_uk parser: add variant A (2019-2022) + variant C (2022-2023) ## Context The initial v2 parser (commit 9741816) only handled the modern template (variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from 2021-07 through 2023-11 failed entirely — Claude fallback hit errors on them and the rows never landed. Investigation via `kubectl exec` + pdftotext on a sample of the failing docs revealed two previously-unseen layouts that the parser needs to handle directly: - **Variant A** (2019 → mid-2022): single-column Description/This Period/ This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31 Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines: `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the earnings side with a matching `RSU Net Gain` on the deductions side. BIK items (Private Dental/Medical) appear on both sides — net zero in the gross, but the deduction-side copy must land in other_deductions for the validation formula to hold. - **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions| Year To Date (note capital "To", vs variant B's lowercase "to"). Date format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU Net Gain` offset. `Company Name : Facebook UK Limited` preamble. Variant B (2024+) is unchanged. ## This change ### Parser refactor - `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches all three eras. - `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's accounting-style parenthesized negatives normalize to `-1234.56` in `_to_decimal`. - `_parse_date` tries three formats in order: slash (B), dot (C), word (A). - `_is_variant_b_or_c` collapses B and C into one detector (both have the side-by-side header with `Year [Tt]o Date`); their parsers share code because the column mechanics are identical — only the RSU-label set and date format differ. - `_parse_variant_a` is a full rewrite: single-column rows split by the two `Total ...` anchors (payments → deductions), pay_date from the header's `Date : ...`, gross from first Total, net from the trailing `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X` line at the bottom. - RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every matching payment line. rsu_offset maps to `RSU Net Gain` on the deduction side when present (absent in variant B, present in A and C). ### Fixtures switched to real pdftotext output Removed the two synthetic fixtures that no longer reflected real Meta output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`) and replaced with real pdftotext captures: - `meta_uk_2021_08_variant_a.txt` (doc_id=43) - `meta_uk_2022_11_variant_c.txt` (doc_id=53) The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because they encode specific bonus/no-bonus scenarios and the numbers are derived from the real Feb-2026 sample in the plan. ## Tests - 10 parser tests: one per variant (A/B/C) + totals validation across all 4 fixtures + the existing non-Meta/empty-input guards. All pass. - 52 total tests across the repo, all green. ## Test Plan ### Automated ``` $ poetry run pytest ============================== 52 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files ``` ### Manual verification (after deploy) 1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via regex (≥95% hit rate), 42 → 70+ validated rows. 2. Sample a row for each variant via psql: employer, rsu_vest, and taxable_pay should all be populated. ## Reproduce locally 1. `poetry run pytest tests/test_meta_uk_parser.py -v` 2. Expected: 10 passed, each fixture validates totals to within 2p. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00
if "Total Payment" in line:
return payments, order, deductions
if not line.strip():
v2: regex parser for Meta UK template + accurate RSU tax attribution ## Context v1 shipped a Claude Haiku-based extractor that validated only 10/71 backfilled rows. Haiku fumbles the arithmetic on pension salary-sacrifice, conflates RSU vest with regular earnings, and occasionally misreads YTD vs this-period columns — so 86% of rows land with validated=false and the downstream dashboards under-report take-home. Meta UK uses a stable two-variant template (pre/post 2022-01-31 boundary), so a regex parser is both faster (ms vs. 30-90s + $0.01-0.05/call) and more accurate. v2 introduces that parser as the primary path, keeps Claude as the fallback for non-Meta payslips, and surfaces new fields the dashboard needs to attribute PAYE between cash salary and RSU vests correctly. ## This change ### Parser (new) `payslip_ingest/parsers/meta_uk.py` detects the layout variant by header presence: - **Variant A** (pre-2022): vertical Description/This Period/This Year. `AE Pension EE` is a positive deduction against a pre-sacrifice gross — maps to `pension_employee` for the existing validation formula to hold. - **Variant B** (post-2022): side-by-side Payments | Deductions | Year to Date. `AE Pension EE` is NEGATIVE in Payments (salary sacrifice) — maps to `pension_sacrifice` and is already netted into Total Payment. `rsu_vest = RSU Tax Offset + RSU Excs Refund` (Meta's template inflates Taxable Pay without using a matching offset deduction). Column boundaries come from the header row's anchor positions; each data row slices into 3 cells and the last numeric token per cell is the amount. Anchor misses raise ParserError so the caller falls back to Claude rather than silently returning bad data. ### New fields Schema + DB + Claude prompt gain: - `salary`, `bonus`, `pension_sacrifice` — earnings decomposition for the dashboard's bonus-sacrifice visibility and earnings-breakdown chart - `taxable_pay`, `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross` — powers the YTD-effective-rate method of attributing cash tax vs RSU tax, which is the only method that's accurate month-to-month All new columns default to 0 / null so v1 rows continue to round-trip. ### Orchestration processor.py tries `parse_meta_uk(pdftotext(pdf))` first. On success the result goes straight to the DB — zero Claude tokens spent, extraction in milliseconds. On ParserError it falls through to ClaudeExtractor as before. ProcessResult gains an `extractor` field ("meta_uk_regex" | "claude") so backfill logs show the hit rate. ## Tests - `test_meta_uk_parser.py` — 11 tests covering variant A, variant B (standard + bonus month + bonus-sacrificed month), malformed inputs, and end-to-end totals validation for all 4 golden fixtures. - `test_processor.py` — 2 new tests proving the regex-first short-circuit and the Claude fallback on non-Meta inputs. Fixtures under `tests/fixtures/` are hand-crafted `pdftotext -layout` emulations — real Meta numbers from the plan's sample payslips for variant B, synthesized realistic variant A and bonus-sacrificed samples. 0001_initial.py reformat is yapf cleanup touched during the session's format pass; not a behavior change. ## Test Plan ### Automated ``` $ poetry run pytest ============================= test session starts ============================== collected 53 items tests/test_extractor.py ..... [ 9%] tests/test_meta_uk_parser.py ........... [ 30%] tests/test_paperless.py ...... [ 41%] tests/test_processor.py .............. [ 67%] tests/test_schema.py .... [ 75%] tests/test_tax_year.py ........ [ 90%] tests/test_webhook.py ..... [100%] ============================== 53 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files $ poetry run yapf --style pyproject.toml --diff --recursive payslip_ingest tests (no output — all files are yapf-clean) ``` ### Manual Verification Smoke-test the parser against a real Meta payslip PDF on the deploy host: ``` # After 0003 migration applied to prod DB $ poetry run python -c " from payslip_ingest.parsers import parse_meta_uk import subprocess text = subprocess.check_output(['pdftotext', '-layout', '/path/to/real.pdf', '-']).decode() p = parse_meta_uk(text) print(p.model_dump_json(indent=2)) " ``` Expected: JSON with salary/bonus/rsu_vest/pension_sacrifice populated and `validate_totals(p)` returning True. ## Reproduce locally 1. `cd payslip-ingest && poetry install` 2. `poetry run pytest tests/test_meta_uk_parser.py -v` 3. Expected: 11 tests pass, each fixture validates totals within 2p. Closes: code-un1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:53:52 +00:00
continue
p_seg = line[:d_col] if len(line) > d_col else line
d_seg = line[d_col:y_col] if len(line) > d_col else ""
p_label, p_amount = _last_amount(p_seg)
if p_label and p_amount is not None:
payments[p_label] = p_amount
order.append((p_label, p_amount))
d_label, d_amount = _last_amount(d_seg)
if d_label and d_amount is not None:
meta_uk parser: add variant A (2019-2022) + variant C (2022-2023) ## Context The initial v2 parser (commit 9741816) only handled the modern template (variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from 2021-07 through 2023-11 failed entirely — Claude fallback hit errors on them and the rows never landed. Investigation via `kubectl exec` + pdftotext on a sample of the failing docs revealed two previously-unseen layouts that the parser needs to handle directly: - **Variant A** (2019 → mid-2022): single-column Description/This Period/ This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31 Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines: `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the earnings side with a matching `RSU Net Gain` on the deductions side. BIK items (Private Dental/Medical) appear on both sides — net zero in the gross, but the deduction-side copy must land in other_deductions for the validation formula to hold. - **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions| Year To Date (note capital "To", vs variant B's lowercase "to"). Date format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU Net Gain` offset. `Company Name : Facebook UK Limited` preamble. Variant B (2024+) is unchanged. ## This change ### Parser refactor - `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches all three eras. - `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's accounting-style parenthesized negatives normalize to `-1234.56` in `_to_decimal`. - `_parse_date` tries three formats in order: slash (B), dot (C), word (A). - `_is_variant_b_or_c` collapses B and C into one detector (both have the side-by-side header with `Year [Tt]o Date`); their parsers share code because the column mechanics are identical — only the RSU-label set and date format differ. - `_parse_variant_a` is a full rewrite: single-column rows split by the two `Total ...` anchors (payments → deductions), pay_date from the header's `Date : ...`, gross from first Total, net from the trailing `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X` line at the bottom. - RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every matching payment line. rsu_offset maps to `RSU Net Gain` on the deduction side when present (absent in variant B, present in A and C). ### Fixtures switched to real pdftotext output Removed the two synthetic fixtures that no longer reflected real Meta output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`) and replaced with real pdftotext captures: - `meta_uk_2021_08_variant_a.txt` (doc_id=43) - `meta_uk_2022_11_variant_c.txt` (doc_id=53) The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because they encode specific bonus/no-bonus scenarios and the numbers are derived from the real Feb-2026 sample in the plan. ## Tests - 10 parser tests: one per variant (A/B/C) + totals validation across all 4 fixtures + the existing non-Meta/empty-input guards. All pass. - 52 total tests across the repo, all green. ## Test Plan ### Automated ``` $ poetry run pytest ============================== 52 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files ``` ### Manual verification (after deploy) 1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via regex (≥95% hit rate), 42 → 70+ validated rows. 2. Sample a row for each variant via psql: employer, rsu_vest, and taxable_pay should all be populated. ## Reproduce locally 1. `poetry run pytest tests/test_meta_uk_parser.py -v` 2. Expected: 10 passed, each fixture validates totals to within 2p. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00
# RSU Net Gain can show as negative on the YTD side duplication;
# normalize to absolute value on the deductions side.
if d_label == "RSU Net Gain":
d_amount = abs(d_amount)
v2: regex parser for Meta UK template + accurate RSU tax attribution ## Context v1 shipped a Claude Haiku-based extractor that validated only 10/71 backfilled rows. Haiku fumbles the arithmetic on pension salary-sacrifice, conflates RSU vest with regular earnings, and occasionally misreads YTD vs this-period columns — so 86% of rows land with validated=false and the downstream dashboards under-report take-home. Meta UK uses a stable two-variant template (pre/post 2022-01-31 boundary), so a regex parser is both faster (ms vs. 30-90s + $0.01-0.05/call) and more accurate. v2 introduces that parser as the primary path, keeps Claude as the fallback for non-Meta payslips, and surfaces new fields the dashboard needs to attribute PAYE between cash salary and RSU vests correctly. ## This change ### Parser (new) `payslip_ingest/parsers/meta_uk.py` detects the layout variant by header presence: - **Variant A** (pre-2022): vertical Description/This Period/This Year. `AE Pension EE` is a positive deduction against a pre-sacrifice gross — maps to `pension_employee` for the existing validation formula to hold. - **Variant B** (post-2022): side-by-side Payments | Deductions | Year to Date. `AE Pension EE` is NEGATIVE in Payments (salary sacrifice) — maps to `pension_sacrifice` and is already netted into Total Payment. `rsu_vest = RSU Tax Offset + RSU Excs Refund` (Meta's template inflates Taxable Pay without using a matching offset deduction). Column boundaries come from the header row's anchor positions; each data row slices into 3 cells and the last numeric token per cell is the amount. Anchor misses raise ParserError so the caller falls back to Claude rather than silently returning bad data. ### New fields Schema + DB + Claude prompt gain: - `salary`, `bonus`, `pension_sacrifice` — earnings decomposition for the dashboard's bonus-sacrifice visibility and earnings-breakdown chart - `taxable_pay`, `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross` — powers the YTD-effective-rate method of attributing cash tax vs RSU tax, which is the only method that's accurate month-to-month All new columns default to 0 / null so v1 rows continue to round-trip. ### Orchestration processor.py tries `parse_meta_uk(pdftotext(pdf))` first. On success the result goes straight to the DB — zero Claude tokens spent, extraction in milliseconds. On ParserError it falls through to ClaudeExtractor as before. ProcessResult gains an `extractor` field ("meta_uk_regex" | "claude") so backfill logs show the hit rate. ## Tests - `test_meta_uk_parser.py` — 11 tests covering variant A, variant B (standard + bonus month + bonus-sacrificed month), malformed inputs, and end-to-end totals validation for all 4 golden fixtures. - `test_processor.py` — 2 new tests proving the regex-first short-circuit and the Claude fallback on non-Meta inputs. Fixtures under `tests/fixtures/` are hand-crafted `pdftotext -layout` emulations — real Meta numbers from the plan's sample payslips for variant B, synthesized realistic variant A and bonus-sacrificed samples. 0001_initial.py reformat is yapf cleanup touched during the session's format pass; not a behavior change. ## Test Plan ### Automated ``` $ poetry run pytest ============================= test session starts ============================== collected 53 items tests/test_extractor.py ..... [ 9%] tests/test_meta_uk_parser.py ........... [ 30%] tests/test_paperless.py ...... [ 41%] tests/test_processor.py .............. [ 67%] tests/test_schema.py .... [ 75%] tests/test_tax_year.py ........ [ 90%] tests/test_webhook.py ..... [100%] ============================== 53 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files $ poetry run yapf --style pyproject.toml --diff --recursive payslip_ingest tests (no output — all files are yapf-clean) ``` ### Manual Verification Smoke-test the parser against a real Meta payslip PDF on the deploy host: ``` # After 0003 migration applied to prod DB $ poetry run python -c " from payslip_ingest.parsers import parse_meta_uk import subprocess text = subprocess.check_output(['pdftotext', '-layout', '/path/to/real.pdf', '-']).decode() p = parse_meta_uk(text) print(p.model_dump_json(indent=2)) " ``` Expected: JSON with salary/bonus/rsu_vest/pension_sacrifice populated and `validate_totals(p)` returning True. ## Reproduce locally 1. `cd payslip-ingest && poetry install` 2. `poetry run pytest tests/test_meta_uk_parser.py -v` 3. Expected: 11 tests pass, each fixture validates totals within 2p. Closes: code-un1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:53:52 +00:00
deductions[d_label] = d_amount
return payments, order, deductions
meta_uk parser: add variant A (2019-2022) + variant C (2022-2023) ## Context The initial v2 parser (commit 9741816) only handled the modern template (variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from 2021-07 through 2023-11 failed entirely — Claude fallback hit errors on them and the rows never landed. Investigation via `kubectl exec` + pdftotext on a sample of the failing docs revealed two previously-unseen layouts that the parser needs to handle directly: - **Variant A** (2019 → mid-2022): single-column Description/This Period/ This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31 Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines: `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the earnings side with a matching `RSU Net Gain` on the deductions side. BIK items (Private Dental/Medical) appear on both sides — net zero in the gross, but the deduction-side copy must land in other_deductions for the validation formula to hold. - **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions| Year To Date (note capital "To", vs variant B's lowercase "to"). Date format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU Net Gain` offset. `Company Name : Facebook UK Limited` preamble. Variant B (2024+) is unchanged. ## This change ### Parser refactor - `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches all three eras. - `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's accounting-style parenthesized negatives normalize to `-1234.56` in `_to_decimal`. - `_parse_date` tries three formats in order: slash (B), dot (C), word (A). - `_is_variant_b_or_c` collapses B and C into one detector (both have the side-by-side header with `Year [Tt]o Date`); their parsers share code because the column mechanics are identical — only the RSU-label set and date format differ. - `_parse_variant_a` is a full rewrite: single-column rows split by the two `Total ...` anchors (payments → deductions), pay_date from the header's `Date : ...`, gross from first Total, net from the trailing `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X` line at the bottom. - RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every matching payment line. rsu_offset maps to `RSU Net Gain` on the deduction side when present (absent in variant B, present in A and C). ### Fixtures switched to real pdftotext output Removed the two synthetic fixtures that no longer reflected real Meta output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`) and replaced with real pdftotext captures: - `meta_uk_2021_08_variant_a.txt` (doc_id=43) - `meta_uk_2022_11_variant_c.txt` (doc_id=53) The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because they encode specific bonus/no-bonus scenarios and the numbers are derived from the real Feb-2026 sample in the plan. ## Tests - 10 parser tests: one per variant (A/B/C) + totals validation across all 4 fixtures + the existing non-Meta/empty-input guards. All pass. - 52 total tests across the repo, all green. ## Test Plan ### Automated ``` $ poetry run pytest ============================== 52 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files ``` ### Manual verification (after deploy) 1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via regex (≥95% hit rate), 42 → 70+ validated rows. 2. Sample a row for each variant via psql: employer, rsu_vest, and taxable_pay should all be populated. ## Reproduce locally 1. `poetry run pytest tests/test_meta_uk_parser.py -v` 2. Expected: 10 passed, each fixture validates totals to within 2p. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00
def _parse_bc_totals_row(
v2: regex parser for Meta UK template + accurate RSU tax attribution ## Context v1 shipped a Claude Haiku-based extractor that validated only 10/71 backfilled rows. Haiku fumbles the arithmetic on pension salary-sacrifice, conflates RSU vest with regular earnings, and occasionally misreads YTD vs this-period columns — so 86% of rows land with validated=false and the downstream dashboards under-report take-home. Meta UK uses a stable two-variant template (pre/post 2022-01-31 boundary), so a regex parser is both faster (ms vs. 30-90s + $0.01-0.05/call) and more accurate. v2 introduces that parser as the primary path, keeps Claude as the fallback for non-Meta payslips, and surfaces new fields the dashboard needs to attribute PAYE between cash salary and RSU vests correctly. ## This change ### Parser (new) `payslip_ingest/parsers/meta_uk.py` detects the layout variant by header presence: - **Variant A** (pre-2022): vertical Description/This Period/This Year. `AE Pension EE` is a positive deduction against a pre-sacrifice gross — maps to `pension_employee` for the existing validation formula to hold. - **Variant B** (post-2022): side-by-side Payments | Deductions | Year to Date. `AE Pension EE` is NEGATIVE in Payments (salary sacrifice) — maps to `pension_sacrifice` and is already netted into Total Payment. `rsu_vest = RSU Tax Offset + RSU Excs Refund` (Meta's template inflates Taxable Pay without using a matching offset deduction). Column boundaries come from the header row's anchor positions; each data row slices into 3 cells and the last numeric token per cell is the amount. Anchor misses raise ParserError so the caller falls back to Claude rather than silently returning bad data. ### New fields Schema + DB + Claude prompt gain: - `salary`, `bonus`, `pension_sacrifice` — earnings decomposition for the dashboard's bonus-sacrifice visibility and earnings-breakdown chart - `taxable_pay`, `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross` — powers the YTD-effective-rate method of attributing cash tax vs RSU tax, which is the only method that's accurate month-to-month All new columns default to 0 / null so v1 rows continue to round-trip. ### Orchestration processor.py tries `parse_meta_uk(pdftotext(pdf))` first. On success the result goes straight to the DB — zero Claude tokens spent, extraction in milliseconds. On ParserError it falls through to ClaudeExtractor as before. ProcessResult gains an `extractor` field ("meta_uk_regex" | "claude") so backfill logs show the hit rate. ## Tests - `test_meta_uk_parser.py` — 11 tests covering variant A, variant B (standard + bonus month + bonus-sacrificed month), malformed inputs, and end-to-end totals validation for all 4 golden fixtures. - `test_processor.py` — 2 new tests proving the regex-first short-circuit and the Claude fallback on non-Meta inputs. Fixtures under `tests/fixtures/` are hand-crafted `pdftotext -layout` emulations — real Meta numbers from the plan's sample payslips for variant B, synthesized realistic variant A and bonus-sacrificed samples. 0001_initial.py reformat is yapf cleanup touched during the session's format pass; not a behavior change. ## Test Plan ### Automated ``` $ poetry run pytest ============================= test session starts ============================== collected 53 items tests/test_extractor.py ..... [ 9%] tests/test_meta_uk_parser.py ........... [ 30%] tests/test_paperless.py ...... [ 41%] tests/test_processor.py .............. [ 67%] tests/test_schema.py .... [ 75%] tests/test_tax_year.py ........ [ 90%] tests/test_webhook.py ..... [100%] ============================== 53 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files $ poetry run yapf --style pyproject.toml --diff --recursive payslip_ingest tests (no output — all files are yapf-clean) ``` ### Manual Verification Smoke-test the parser against a real Meta payslip PDF on the deploy host: ``` # After 0003 migration applied to prod DB $ poetry run python -c " from payslip_ingest.parsers import parse_meta_uk import subprocess text = subprocess.check_output(['pdftotext', '-layout', '/path/to/real.pdf', '-']).decode() p = parse_meta_uk(text) print(p.model_dump_json(indent=2)) " ``` Expected: JSON with salary/bonus/rsu_vest/pension_sacrifice populated and `validate_totals(p)` returning True. ## Reproduce locally 1. `cd payslip-ingest && poetry install` 2. `poetry run pytest tests/test_meta_uk_parser.py -v` 3. Expected: 11 tests pass, each fixture validates totals within 2p. Closes: code-un1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:53:52 +00:00
lines: list[str],
header_idx: int,
d_col: int,
y_col: int,
) -> tuple[Decimal, Decimal]:
meta_uk parser: add variant A (2019-2022) + variant C (2022-2023) ## Context The initial v2 parser (commit 9741816) only handled the modern template (variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from 2021-07 through 2023-11 failed entirely — Claude fallback hit errors on them and the rows never landed. Investigation via `kubectl exec` + pdftotext on a sample of the failing docs revealed two previously-unseen layouts that the parser needs to handle directly: - **Variant A** (2019 → mid-2022): single-column Description/This Period/ This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31 Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines: `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the earnings side with a matching `RSU Net Gain` on the deductions side. BIK items (Private Dental/Medical) appear on both sides — net zero in the gross, but the deduction-side copy must land in other_deductions for the validation formula to hold. - **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions| Year To Date (note capital "To", vs variant B's lowercase "to"). Date format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU Net Gain` offset. `Company Name : Facebook UK Limited` preamble. Variant B (2024+) is unchanged. ## This change ### Parser refactor - `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches all three eras. - `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's accounting-style parenthesized negatives normalize to `-1234.56` in `_to_decimal`. - `_parse_date` tries three formats in order: slash (B), dot (C), word (A). - `_is_variant_b_or_c` collapses B and C into one detector (both have the side-by-side header with `Year [Tt]o Date`); their parsers share code because the column mechanics are identical — only the RSU-label set and date format differ. - `_parse_variant_a` is a full rewrite: single-column rows split by the two `Total ...` anchors (payments → deductions), pay_date from the header's `Date : ...`, gross from first Total, net from the trailing `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X` line at the bottom. - RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every matching payment line. rsu_offset maps to `RSU Net Gain` on the deduction side when present (absent in variant B, present in A and C). ### Fixtures switched to real pdftotext output Removed the two synthetic fixtures that no longer reflected real Meta output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`) and replaced with real pdftotext captures: - `meta_uk_2021_08_variant_a.txt` (doc_id=43) - `meta_uk_2022_11_variant_c.txt` (doc_id=53) The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because they encode specific bonus/no-bonus scenarios and the numbers are derived from the real Feb-2026 sample in the plan. ## Tests - 10 parser tests: one per variant (A/B/C) + totals validation across all 4 fixtures + the existing non-Meta/empty-input guards. All pass. - 52 total tests across the repo, all green. ## Test Plan ### Automated ``` $ poetry run pytest ============================== 52 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files ``` ### Manual verification (after deploy) 1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via regex (≥95% hit rate), 42 → 70+ validated rows. 2. Sample a row for each variant via psql: employer, rsu_vest, and taxable_pay should all be populated. ## Reproduce locally 1. `poetry run pytest tests/test_meta_uk_parser.py -v` 2. Expected: 10 passed, each fixture validates totals to within 2p. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00
del y_col # "Net Pay:" aligns with the Amount column, not the left edge of YTD
v2: regex parser for Meta UK template + accurate RSU tax attribution ## Context v1 shipped a Claude Haiku-based extractor that validated only 10/71 backfilled rows. Haiku fumbles the arithmetic on pension salary-sacrifice, conflates RSU vest with regular earnings, and occasionally misreads YTD vs this-period columns — so 86% of rows land with validated=false and the downstream dashboards under-report take-home. Meta UK uses a stable two-variant template (pre/post 2022-01-31 boundary), so a regex parser is both faster (ms vs. 30-90s + $0.01-0.05/call) and more accurate. v2 introduces that parser as the primary path, keeps Claude as the fallback for non-Meta payslips, and surfaces new fields the dashboard needs to attribute PAYE between cash salary and RSU vests correctly. ## This change ### Parser (new) `payslip_ingest/parsers/meta_uk.py` detects the layout variant by header presence: - **Variant A** (pre-2022): vertical Description/This Period/This Year. `AE Pension EE` is a positive deduction against a pre-sacrifice gross — maps to `pension_employee` for the existing validation formula to hold. - **Variant B** (post-2022): side-by-side Payments | Deductions | Year to Date. `AE Pension EE` is NEGATIVE in Payments (salary sacrifice) — maps to `pension_sacrifice` and is already netted into Total Payment. `rsu_vest = RSU Tax Offset + RSU Excs Refund` (Meta's template inflates Taxable Pay without using a matching offset deduction). Column boundaries come from the header row's anchor positions; each data row slices into 3 cells and the last numeric token per cell is the amount. Anchor misses raise ParserError so the caller falls back to Claude rather than silently returning bad data. ### New fields Schema + DB + Claude prompt gain: - `salary`, `bonus`, `pension_sacrifice` — earnings decomposition for the dashboard's bonus-sacrifice visibility and earnings-breakdown chart - `taxable_pay`, `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross` — powers the YTD-effective-rate method of attributing cash tax vs RSU tax, which is the only method that's accurate month-to-month All new columns default to 0 / null so v1 rows continue to round-trip. ### Orchestration processor.py tries `parse_meta_uk(pdftotext(pdf))` first. On success the result goes straight to the DB — zero Claude tokens spent, extraction in milliseconds. On ParserError it falls through to ClaudeExtractor as before. ProcessResult gains an `extractor` field ("meta_uk_regex" | "claude") so backfill logs show the hit rate. ## Tests - `test_meta_uk_parser.py` — 11 tests covering variant A, variant B (standard + bonus month + bonus-sacrificed month), malformed inputs, and end-to-end totals validation for all 4 golden fixtures. - `test_processor.py` — 2 new tests proving the regex-first short-circuit and the Claude fallback on non-Meta inputs. Fixtures under `tests/fixtures/` are hand-crafted `pdftotext -layout` emulations — real Meta numbers from the plan's sample payslips for variant B, synthesized realistic variant A and bonus-sacrificed samples. 0001_initial.py reformat is yapf cleanup touched during the session's format pass; not a behavior change. ## Test Plan ### Automated ``` $ poetry run pytest ============================= test session starts ============================== collected 53 items tests/test_extractor.py ..... [ 9%] tests/test_meta_uk_parser.py ........... [ 30%] tests/test_paperless.py ...... [ 41%] tests/test_processor.py .............. [ 67%] tests/test_schema.py .... [ 75%] tests/test_tax_year.py ........ [ 90%] tests/test_webhook.py ..... [100%] ============================== 53 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files $ poetry run yapf --style pyproject.toml --diff --recursive payslip_ingest tests (no output — all files are yapf-clean) ``` ### Manual Verification Smoke-test the parser against a real Meta payslip PDF on the deploy host: ``` # After 0003 migration applied to prod DB $ poetry run python -c " from payslip_ingest.parsers import parse_meta_uk import subprocess text = subprocess.check_output(['pdftotext', '-layout', '/path/to/real.pdf', '-']).decode() p = parse_meta_uk(text) print(p.model_dump_json(indent=2)) " ``` Expected: JSON with salary/bonus/rsu_vest/pension_sacrifice populated and `validate_totals(p)` returning True. ## Reproduce locally 1. `cd payslip-ingest && poetry install` 2. `poetry run pytest tests/test_meta_uk_parser.py -v` 3. Expected: 11 tests pass, each fixture validates totals within 2p. Closes: code-un1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:53:52 +00:00
for i in range(header_idx + 1, len(lines)):
line = lines[i]
if "Total Payment" not in line:
continue
p_seg = line[:d_col] if len(line) > d_col else line
_, gross_pay = _last_amount(p_seg)
meta_uk parser: add variant A (2019-2022) + variant C (2022-2023) ## Context The initial v2 parser (commit 9741816) only handled the modern template (variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from 2021-07 through 2023-11 failed entirely — Claude fallback hit errors on them and the rows never landed. Investigation via `kubectl exec` + pdftotext on a sample of the failing docs revealed two previously-unseen layouts that the parser needs to handle directly: - **Variant A** (2019 → mid-2022): single-column Description/This Period/ This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31 Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines: `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the earnings side with a matching `RSU Net Gain` on the deductions side. BIK items (Private Dental/Medical) appear on both sides — net zero in the gross, but the deduction-side copy must land in other_deductions for the validation formula to hold. - **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions| Year To Date (note capital "To", vs variant B's lowercase "to"). Date format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU Net Gain` offset. `Company Name : Facebook UK Limited` preamble. Variant B (2024+) is unchanged. ## This change ### Parser refactor - `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches all three eras. - `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's accounting-style parenthesized negatives normalize to `-1234.56` in `_to_decimal`. - `_parse_date` tries three formats in order: slash (B), dot (C), word (A). - `_is_variant_b_or_c` collapses B and C into one detector (both have the side-by-side header with `Year [Tt]o Date`); their parsers share code because the column mechanics are identical — only the RSU-label set and date format differ. - `_parse_variant_a` is a full rewrite: single-column rows split by the two `Total ...` anchors (payments → deductions), pay_date from the header's `Date : ...`, gross from first Total, net from the trailing `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X` line at the bottom. - RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every matching payment line. rsu_offset maps to `RSU Net Gain` on the deduction side when present (absent in variant B, present in A and C). ### Fixtures switched to real pdftotext output Removed the two synthetic fixtures that no longer reflected real Meta output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`) and replaced with real pdftotext captures: - `meta_uk_2021_08_variant_a.txt` (doc_id=43) - `meta_uk_2022_11_variant_c.txt` (doc_id=53) The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because they encode specific bonus/no-bonus scenarios and the numbers are derived from the real Feb-2026 sample in the plan. ## Tests - 10 parser tests: one per variant (A/B/C) + totals validation across all 4 fixtures + the existing non-Meta/empty-input guards. All pass. - 52 total tests across the repo, all green. ## Test Plan ### Automated ``` $ poetry run pytest ============================== 52 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files ``` ### Manual verification (after deploy) 1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via regex (≥95% hit rate), 42 → 70+ validated rows. 2. Sample a row for each variant via psql: employer, rsu_vest, and taxable_pay should all be populated. ## Reproduce locally 1. `poetry run pytest tests/test_meta_uk_parser.py -v` 2. Expected: 10 passed, each fixture validates totals to within 2p. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00
net_pay_idx = line.find("Net Pay")
if net_pay_idx < 0:
raise ParserError("Net Pay missing from totals row")
_, net_pay = _last_amount(line[net_pay_idx:])
v2: regex parser for Meta UK template + accurate RSU tax attribution ## Context v1 shipped a Claude Haiku-based extractor that validated only 10/71 backfilled rows. Haiku fumbles the arithmetic on pension salary-sacrifice, conflates RSU vest with regular earnings, and occasionally misreads YTD vs this-period columns — so 86% of rows land with validated=false and the downstream dashboards under-report take-home. Meta UK uses a stable two-variant template (pre/post 2022-01-31 boundary), so a regex parser is both faster (ms vs. 30-90s + $0.01-0.05/call) and more accurate. v2 introduces that parser as the primary path, keeps Claude as the fallback for non-Meta payslips, and surfaces new fields the dashboard needs to attribute PAYE between cash salary and RSU vests correctly. ## This change ### Parser (new) `payslip_ingest/parsers/meta_uk.py` detects the layout variant by header presence: - **Variant A** (pre-2022): vertical Description/This Period/This Year. `AE Pension EE` is a positive deduction against a pre-sacrifice gross — maps to `pension_employee` for the existing validation formula to hold. - **Variant B** (post-2022): side-by-side Payments | Deductions | Year to Date. `AE Pension EE` is NEGATIVE in Payments (salary sacrifice) — maps to `pension_sacrifice` and is already netted into Total Payment. `rsu_vest = RSU Tax Offset + RSU Excs Refund` (Meta's template inflates Taxable Pay without using a matching offset deduction). Column boundaries come from the header row's anchor positions; each data row slices into 3 cells and the last numeric token per cell is the amount. Anchor misses raise ParserError so the caller falls back to Claude rather than silently returning bad data. ### New fields Schema + DB + Claude prompt gain: - `salary`, `bonus`, `pension_sacrifice` — earnings decomposition for the dashboard's bonus-sacrifice visibility and earnings-breakdown chart - `taxable_pay`, `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross` — powers the YTD-effective-rate method of attributing cash tax vs RSU tax, which is the only method that's accurate month-to-month All new columns default to 0 / null so v1 rows continue to round-trip. ### Orchestration processor.py tries `parse_meta_uk(pdftotext(pdf))` first. On success the result goes straight to the DB — zero Claude tokens spent, extraction in milliseconds. On ParserError it falls through to ClaudeExtractor as before. ProcessResult gains an `extractor` field ("meta_uk_regex" | "claude") so backfill logs show the hit rate. ## Tests - `test_meta_uk_parser.py` — 11 tests covering variant A, variant B (standard + bonus month + bonus-sacrificed month), malformed inputs, and end-to-end totals validation for all 4 golden fixtures. - `test_processor.py` — 2 new tests proving the regex-first short-circuit and the Claude fallback on non-Meta inputs. Fixtures under `tests/fixtures/` are hand-crafted `pdftotext -layout` emulations — real Meta numbers from the plan's sample payslips for variant B, synthesized realistic variant A and bonus-sacrificed samples. 0001_initial.py reformat is yapf cleanup touched during the session's format pass; not a behavior change. ## Test Plan ### Automated ``` $ poetry run pytest ============================= test session starts ============================== collected 53 items tests/test_extractor.py ..... [ 9%] tests/test_meta_uk_parser.py ........... [ 30%] tests/test_paperless.py ...... [ 41%] tests/test_processor.py .............. [ 67%] tests/test_schema.py .... [ 75%] tests/test_tax_year.py ........ [ 90%] tests/test_webhook.py ..... [100%] ============================== 53 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files $ poetry run yapf --style pyproject.toml --diff --recursive payslip_ingest tests (no output — all files are yapf-clean) ``` ### Manual Verification Smoke-test the parser against a real Meta payslip PDF on the deploy host: ``` # After 0003 migration applied to prod DB $ poetry run python -c " from payslip_ingest.parsers import parse_meta_uk import subprocess text = subprocess.check_output(['pdftotext', '-layout', '/path/to/real.pdf', '-']).decode() p = parse_meta_uk(text) print(p.model_dump_json(indent=2)) " ``` Expected: JSON with salary/bonus/rsu_vest/pension_sacrifice populated and `validate_totals(p)` returning True. ## Reproduce locally 1. `cd payslip-ingest && poetry install` 2. `poetry run pytest tests/test_meta_uk_parser.py -v` 3. Expected: 11 tests pass, each fixture validates totals within 2p. Closes: code-un1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:53:52 +00:00
if gross_pay is None:
raise ParserError("Total Payment amount missing")
if net_pay is None:
raise ParserError("Net Pay amount missing from totals row")
return gross_pay, net_pay
raise ParserError("totals row not found")
meta_uk parser: add variant A (2019-2022) + variant C (2022-2023) ## Context The initial v2 parser (commit 9741816) only handled the modern template (variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from 2021-07 through 2023-11 failed entirely — Claude fallback hit errors on them and the rows never landed. Investigation via `kubectl exec` + pdftotext on a sample of the failing docs revealed two previously-unseen layouts that the parser needs to handle directly: - **Variant A** (2019 → mid-2022): single-column Description/This Period/ This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31 Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines: `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the earnings side with a matching `RSU Net Gain` on the deductions side. BIK items (Private Dental/Medical) appear on both sides — net zero in the gross, but the deduction-side copy must land in other_deductions for the validation formula to hold. - **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions| Year To Date (note capital "To", vs variant B's lowercase "to"). Date format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU Net Gain` offset. `Company Name : Facebook UK Limited` preamble. Variant B (2024+) is unchanged. ## This change ### Parser refactor - `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches all three eras. - `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's accounting-style parenthesized negatives normalize to `-1234.56` in `_to_decimal`. - `_parse_date` tries three formats in order: slash (B), dot (C), word (A). - `_is_variant_b_or_c` collapses B and C into one detector (both have the side-by-side header with `Year [Tt]o Date`); their parsers share code because the column mechanics are identical — only the RSU-label set and date format differ. - `_parse_variant_a` is a full rewrite: single-column rows split by the two `Total ...` anchors (payments → deductions), pay_date from the header's `Date : ...`, gross from first Total, net from the trailing `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X` line at the bottom. - RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every matching payment line. rsu_offset maps to `RSU Net Gain` on the deduction side when present (absent in variant B, present in A and C). ### Fixtures switched to real pdftotext output Removed the two synthetic fixtures that no longer reflected real Meta output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`) and replaced with real pdftotext captures: - `meta_uk_2021_08_variant_a.txt` (doc_id=43) - `meta_uk_2022_11_variant_c.txt` (doc_id=53) The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because they encode specific bonus/no-bonus scenarios and the numbers are derived from the real Feb-2026 sample in the plan. ## Tests - 10 parser tests: one per variant (A/B/C) + totals validation across all 4 fixtures + the existing non-Meta/empty-input guards. All pass. - 52 total tests across the repo, all green. ## Test Plan ### Automated ``` $ poetry run pytest ============================== 52 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files ``` ### Manual verification (after deploy) 1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via regex (≥95% hit rate), 42 → 70+ validated rows. 2. Sample a row for each variant via psql: employer, rsu_vest, and taxable_pay should all be populated. ## Reproduce locally 1. `poetry run pytest tests/test_meta_uk_parser.py -v` 2. Expected: 10 passed, each fixture validates totals to within 2p. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00
def _parse_bc_summary_block(lines: list[str]) -> dict[str, Decimal]:
"""Pull Taxable Pay (this period + YTD), Tax Paid (YTD), Total Gross (YTD)."""
v2: regex parser for Meta UK template + accurate RSU tax attribution ## Context v1 shipped a Claude Haiku-based extractor that validated only 10/71 backfilled rows. Haiku fumbles the arithmetic on pension salary-sacrifice, conflates RSU vest with regular earnings, and occasionally misreads YTD vs this-period columns — so 86% of rows land with validated=false and the downstream dashboards under-report take-home. Meta UK uses a stable two-variant template (pre/post 2022-01-31 boundary), so a regex parser is both faster (ms vs. 30-90s + $0.01-0.05/call) and more accurate. v2 introduces that parser as the primary path, keeps Claude as the fallback for non-Meta payslips, and surfaces new fields the dashboard needs to attribute PAYE between cash salary and RSU vests correctly. ## This change ### Parser (new) `payslip_ingest/parsers/meta_uk.py` detects the layout variant by header presence: - **Variant A** (pre-2022): vertical Description/This Period/This Year. `AE Pension EE` is a positive deduction against a pre-sacrifice gross — maps to `pension_employee` for the existing validation formula to hold. - **Variant B** (post-2022): side-by-side Payments | Deductions | Year to Date. `AE Pension EE` is NEGATIVE in Payments (salary sacrifice) — maps to `pension_sacrifice` and is already netted into Total Payment. `rsu_vest = RSU Tax Offset + RSU Excs Refund` (Meta's template inflates Taxable Pay without using a matching offset deduction). Column boundaries come from the header row's anchor positions; each data row slices into 3 cells and the last numeric token per cell is the amount. Anchor misses raise ParserError so the caller falls back to Claude rather than silently returning bad data. ### New fields Schema + DB + Claude prompt gain: - `salary`, `bonus`, `pension_sacrifice` — earnings decomposition for the dashboard's bonus-sacrifice visibility and earnings-breakdown chart - `taxable_pay`, `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross` — powers the YTD-effective-rate method of attributing cash tax vs RSU tax, which is the only method that's accurate month-to-month All new columns default to 0 / null so v1 rows continue to round-trip. ### Orchestration processor.py tries `parse_meta_uk(pdftotext(pdf))` first. On success the result goes straight to the DB — zero Claude tokens spent, extraction in milliseconds. On ParserError it falls through to ClaudeExtractor as before. ProcessResult gains an `extractor` field ("meta_uk_regex" | "claude") so backfill logs show the hit rate. ## Tests - `test_meta_uk_parser.py` — 11 tests covering variant A, variant B (standard + bonus month + bonus-sacrificed month), malformed inputs, and end-to-end totals validation for all 4 golden fixtures. - `test_processor.py` — 2 new tests proving the regex-first short-circuit and the Claude fallback on non-Meta inputs. Fixtures under `tests/fixtures/` are hand-crafted `pdftotext -layout` emulations — real Meta numbers from the plan's sample payslips for variant B, synthesized realistic variant A and bonus-sacrificed samples. 0001_initial.py reformat is yapf cleanup touched during the session's format pass; not a behavior change. ## Test Plan ### Automated ``` $ poetry run pytest ============================= test session starts ============================== collected 53 items tests/test_extractor.py ..... [ 9%] tests/test_meta_uk_parser.py ........... [ 30%] tests/test_paperless.py ...... [ 41%] tests/test_processor.py .............. [ 67%] tests/test_schema.py .... [ 75%] tests/test_tax_year.py ........ [ 90%] tests/test_webhook.py ..... [100%] ============================== 53 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files $ poetry run yapf --style pyproject.toml --diff --recursive payslip_ingest tests (no output — all files are yapf-clean) ``` ### Manual Verification Smoke-test the parser against a real Meta payslip PDF on the deploy host: ``` # After 0003 migration applied to prod DB $ poetry run python -c " from payslip_ingest.parsers import parse_meta_uk import subprocess text = subprocess.check_output(['pdftotext', '-layout', '/path/to/real.pdf', '-']).decode() p = parse_meta_uk(text) print(p.model_dump_json(indent=2)) " ``` Expected: JSON with salary/bonus/rsu_vest/pension_sacrifice populated and `validate_totals(p)` returning True. ## Reproduce locally 1. `cd payslip-ingest && poetry install` 2. `poetry run pytest tests/test_meta_uk_parser.py -v` 3. Expected: 11 tests pass, each fixture validates totals within 2p. Closes: code-un1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:53:52 +00:00
result: dict[str, Decimal] = {}
for line in lines:
stripped = line.lstrip()
if stripped.startswith("Taxable Pay:"):
nums = AMOUNT_RE.findall(line)
if len(nums) >= 1:
result["taxable_pay"] = _to_decimal(nums[0])
if len(nums) >= 2:
result["ytd_taxable_pay"] = _to_decimal(nums[1])
elif stripped.startswith("Total Gross:"):
nums = AMOUNT_RE.findall(line)
if len(nums) >= 2:
result["ytd_gross"] = _to_decimal(nums[1])
elif stripped.startswith("Tax Paid:"):
nums = AMOUNT_RE.findall(line)
if len(nums) >= 2:
result["ytd_tax_paid"] = _to_decimal(nums[1])
return result
meta_uk parser: add variant A (2019-2022) + variant C (2022-2023) ## Context The initial v2 parser (commit 9741816) only handled the modern template (variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from 2021-07 through 2023-11 failed entirely — Claude fallback hit errors on them and the rows never landed. Investigation via `kubectl exec` + pdftotext on a sample of the failing docs revealed two previously-unseen layouts that the parser needs to handle directly: - **Variant A** (2019 → mid-2022): single-column Description/This Period/ This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31 Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines: `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the earnings side with a matching `RSU Net Gain` on the deductions side. BIK items (Private Dental/Medical) appear on both sides — net zero in the gross, but the deduction-side copy must land in other_deductions for the validation formula to hold. - **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions| Year To Date (note capital "To", vs variant B's lowercase "to"). Date format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU Net Gain` offset. `Company Name : Facebook UK Limited` preamble. Variant B (2024+) is unchanged. ## This change ### Parser refactor - `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches all three eras. - `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's accounting-style parenthesized negatives normalize to `-1234.56` in `_to_decimal`. - `_parse_date` tries three formats in order: slash (B), dot (C), word (A). - `_is_variant_b_or_c` collapses B and C into one detector (both have the side-by-side header with `Year [Tt]o Date`); their parsers share code because the column mechanics are identical — only the RSU-label set and date format differ. - `_parse_variant_a` is a full rewrite: single-column rows split by the two `Total ...` anchors (payments → deductions), pay_date from the header's `Date : ...`, gross from first Total, net from the trailing `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X` line at the bottom. - RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every matching payment line. rsu_offset maps to `RSU Net Gain` on the deduction side when present (absent in variant B, present in A and C). ### Fixtures switched to real pdftotext output Removed the two synthetic fixtures that no longer reflected real Meta output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`) and replaced with real pdftotext captures: - `meta_uk_2021_08_variant_a.txt` (doc_id=43) - `meta_uk_2022_11_variant_c.txt` (doc_id=53) The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because they encode specific bonus/no-bonus scenarios and the numbers are derived from the real Feb-2026 sample in the plan. ## Tests - 10 parser tests: one per variant (A/B/C) + totals validation across all 4 fixtures + the existing non-Meta/empty-input guards. All pass. - 52 total tests across the repo, all green. ## Test Plan ### Automated ``` $ poetry run pytest ============================== 52 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files ``` ### Manual verification (after deploy) 1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via regex (≥95% hit rate), 42 → 70+ validated rows. 2. Sample a row for each variant via psql: employer, rsu_vest, and taxable_pay should all be populated. ## Reproduce locally 1. `poetry run pytest tests/test_meta_uk_parser.py -v` 2. Expected: 10 passed, each fixture validates totals to within 2p. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00
# --------------------------------------------------------------------------
# Variant A — single-column Description | This Period | This Year
# --------------------------------------------------------------------------
VARIANT_A_PAYMENTS_KNOWN = {
v2: regex parser for Meta UK template + accurate RSU tax attribution ## Context v1 shipped a Claude Haiku-based extractor that validated only 10/71 backfilled rows. Haiku fumbles the arithmetic on pension salary-sacrifice, conflates RSU vest with regular earnings, and occasionally misreads YTD vs this-period columns — so 86% of rows land with validated=false and the downstream dashboards under-report take-home. Meta UK uses a stable two-variant template (pre/post 2022-01-31 boundary), so a regex parser is both faster (ms vs. 30-90s + $0.01-0.05/call) and more accurate. v2 introduces that parser as the primary path, keeps Claude as the fallback for non-Meta payslips, and surfaces new fields the dashboard needs to attribute PAYE between cash salary and RSU vests correctly. ## This change ### Parser (new) `payslip_ingest/parsers/meta_uk.py` detects the layout variant by header presence: - **Variant A** (pre-2022): vertical Description/This Period/This Year. `AE Pension EE` is a positive deduction against a pre-sacrifice gross — maps to `pension_employee` for the existing validation formula to hold. - **Variant B** (post-2022): side-by-side Payments | Deductions | Year to Date. `AE Pension EE` is NEGATIVE in Payments (salary sacrifice) — maps to `pension_sacrifice` and is already netted into Total Payment. `rsu_vest = RSU Tax Offset + RSU Excs Refund` (Meta's template inflates Taxable Pay without using a matching offset deduction). Column boundaries come from the header row's anchor positions; each data row slices into 3 cells and the last numeric token per cell is the amount. Anchor misses raise ParserError so the caller falls back to Claude rather than silently returning bad data. ### New fields Schema + DB + Claude prompt gain: - `salary`, `bonus`, `pension_sacrifice` — earnings decomposition for the dashboard's bonus-sacrifice visibility and earnings-breakdown chart - `taxable_pay`, `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross` — powers the YTD-effective-rate method of attributing cash tax vs RSU tax, which is the only method that's accurate month-to-month All new columns default to 0 / null so v1 rows continue to round-trip. ### Orchestration processor.py tries `parse_meta_uk(pdftotext(pdf))` first. On success the result goes straight to the DB — zero Claude tokens spent, extraction in milliseconds. On ParserError it falls through to ClaudeExtractor as before. ProcessResult gains an `extractor` field ("meta_uk_regex" | "claude") so backfill logs show the hit rate. ## Tests - `test_meta_uk_parser.py` — 11 tests covering variant A, variant B (standard + bonus month + bonus-sacrificed month), malformed inputs, and end-to-end totals validation for all 4 golden fixtures. - `test_processor.py` — 2 new tests proving the regex-first short-circuit and the Claude fallback on non-Meta inputs. Fixtures under `tests/fixtures/` are hand-crafted `pdftotext -layout` emulations — real Meta numbers from the plan's sample payslips for variant B, synthesized realistic variant A and bonus-sacrificed samples. 0001_initial.py reformat is yapf cleanup touched during the session's format pass; not a behavior change. ## Test Plan ### Automated ``` $ poetry run pytest ============================= test session starts ============================== collected 53 items tests/test_extractor.py ..... [ 9%] tests/test_meta_uk_parser.py ........... [ 30%] tests/test_paperless.py ...... [ 41%] tests/test_processor.py .............. [ 67%] tests/test_schema.py .... [ 75%] tests/test_tax_year.py ........ [ 90%] tests/test_webhook.py ..... [100%] ============================== 53 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files $ poetry run yapf --style pyproject.toml --diff --recursive payslip_ingest tests (no output — all files are yapf-clean) ``` ### Manual Verification Smoke-test the parser against a real Meta payslip PDF on the deploy host: ``` # After 0003 migration applied to prod DB $ poetry run python -c " from payslip_ingest.parsers import parse_meta_uk import subprocess text = subprocess.check_output(['pdftotext', '-layout', '/path/to/real.pdf', '-']).decode() p = parse_meta_uk(text) print(p.model_dump_json(indent=2)) " ``` Expected: JSON with salary/bonus/rsu_vest/pension_sacrifice populated and `validate_totals(p)` returning True. ## Reproduce locally 1. `cd payslip-ingest && poetry install` 2. `poetry run pytest tests/test_meta_uk_parser.py -v` 3. Expected: 11 tests pass, each fixture validates totals within 2p. Closes: code-un1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:53:52 +00:00
"Salary",
"Bonus",
meta_uk parser: add variant A (2019-2022) + variant C (2022-2023) ## Context The initial v2 parser (commit 9741816) only handled the modern template (variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from 2021-07 through 2023-11 failed entirely — Claude fallback hit errors on them and the rows never landed. Investigation via `kubectl exec` + pdftotext on a sample of the failing docs revealed two previously-unseen layouts that the parser needs to handle directly: - **Variant A** (2019 → mid-2022): single-column Description/This Period/ This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31 Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines: `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the earnings side with a matching `RSU Net Gain` on the deductions side. BIK items (Private Dental/Medical) appear on both sides — net zero in the gross, but the deduction-side copy must land in other_deductions for the validation formula to hold. - **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions| Year To Date (note capital "To", vs variant B's lowercase "to"). Date format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU Net Gain` offset. `Company Name : Facebook UK Limited` preamble. Variant B (2024+) is unchanged. ## This change ### Parser refactor - `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches all three eras. - `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's accounting-style parenthesized negatives normalize to `-1234.56` in `_to_decimal`. - `_parse_date` tries three formats in order: slash (B), dot (C), word (A). - `_is_variant_b_or_c` collapses B and C into one detector (both have the side-by-side header with `Year [Tt]o Date`); their parsers share code because the column mechanics are identical — only the RSU-label set and date format differ. - `_parse_variant_a` is a full rewrite: single-column rows split by the two `Total ...` anchors (payments → deductions), pay_date from the header's `Date : ...`, gross from first Total, net from the trailing `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X` line at the bottom. - RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every matching payment line. rsu_offset maps to `RSU Net Gain` on the deduction side when present (absent in variant B, present in A and C). ### Fixtures switched to real pdftotext output Removed the two synthetic fixtures that no longer reflected real Meta output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`) and replaced with real pdftotext captures: - `meta_uk_2021_08_variant_a.txt` (doc_id=43) - `meta_uk_2022_11_variant_c.txt` (doc_id=53) The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because they encode specific bonus/no-bonus scenarios and the numbers are derived from the real Feb-2026 sample in the plan. ## Tests - 10 parser tests: one per variant (A/B/C) + totals validation across all 4 fixtures + the existing non-Meta/empty-input guards. All pass. - 52 total tests across the repo, all green. ## Test Plan ### Automated ``` $ poetry run pytest ============================== 52 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files ``` ### Manual verification (after deploy) 1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via regex (≥95% hit rate), 42 → 70+ validated rows. 2. Sample a row for each variant via psql: employer, rsu_vest, and taxable_pay should all be populated. ## Reproduce locally 1. `poetry run pytest tests/test_meta_uk_parser.py -v` 2. Expected: 10 passed, each fixture validates totals to within 2p. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00
"Perform Bonus",
"Relocation Bonus",
v2: regex parser for Meta UK template + accurate RSU tax attribution ## Context v1 shipped a Claude Haiku-based extractor that validated only 10/71 backfilled rows. Haiku fumbles the arithmetic on pension salary-sacrifice, conflates RSU vest with regular earnings, and occasionally misreads YTD vs this-period columns — so 86% of rows land with validated=false and the downstream dashboards under-report take-home. Meta UK uses a stable two-variant template (pre/post 2022-01-31 boundary), so a regex parser is both faster (ms vs. 30-90s + $0.01-0.05/call) and more accurate. v2 introduces that parser as the primary path, keeps Claude as the fallback for non-Meta payslips, and surfaces new fields the dashboard needs to attribute PAYE between cash salary and RSU vests correctly. ## This change ### Parser (new) `payslip_ingest/parsers/meta_uk.py` detects the layout variant by header presence: - **Variant A** (pre-2022): vertical Description/This Period/This Year. `AE Pension EE` is a positive deduction against a pre-sacrifice gross — maps to `pension_employee` for the existing validation formula to hold. - **Variant B** (post-2022): side-by-side Payments | Deductions | Year to Date. `AE Pension EE` is NEGATIVE in Payments (salary sacrifice) — maps to `pension_sacrifice` and is already netted into Total Payment. `rsu_vest = RSU Tax Offset + RSU Excs Refund` (Meta's template inflates Taxable Pay without using a matching offset deduction). Column boundaries come from the header row's anchor positions; each data row slices into 3 cells and the last numeric token per cell is the amount. Anchor misses raise ParserError so the caller falls back to Claude rather than silently returning bad data. ### New fields Schema + DB + Claude prompt gain: - `salary`, `bonus`, `pension_sacrifice` — earnings decomposition for the dashboard's bonus-sacrifice visibility and earnings-breakdown chart - `taxable_pay`, `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross` — powers the YTD-effective-rate method of attributing cash tax vs RSU tax, which is the only method that's accurate month-to-month All new columns default to 0 / null so v1 rows continue to round-trip. ### Orchestration processor.py tries `parse_meta_uk(pdftotext(pdf))` first. On success the result goes straight to the DB — zero Claude tokens spent, extraction in milliseconds. On ParserError it falls through to ClaudeExtractor as before. ProcessResult gains an `extractor` field ("meta_uk_regex" | "claude") so backfill logs show the hit rate. ## Tests - `test_meta_uk_parser.py` — 11 tests covering variant A, variant B (standard + bonus month + bonus-sacrificed month), malformed inputs, and end-to-end totals validation for all 4 golden fixtures. - `test_processor.py` — 2 new tests proving the regex-first short-circuit and the Claude fallback on non-Meta inputs. Fixtures under `tests/fixtures/` are hand-crafted `pdftotext -layout` emulations — real Meta numbers from the plan's sample payslips for variant B, synthesized realistic variant A and bonus-sacrificed samples. 0001_initial.py reformat is yapf cleanup touched during the session's format pass; not a behavior change. ## Test Plan ### Automated ``` $ poetry run pytest ============================= test session starts ============================== collected 53 items tests/test_extractor.py ..... [ 9%] tests/test_meta_uk_parser.py ........... [ 30%] tests/test_paperless.py ...... [ 41%] tests/test_processor.py .............. [ 67%] tests/test_schema.py .... [ 75%] tests/test_tax_year.py ........ [ 90%] tests/test_webhook.py ..... [100%] ============================== 53 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files $ poetry run yapf --style pyproject.toml --diff --recursive payslip_ingest tests (no output — all files are yapf-clean) ``` ### Manual Verification Smoke-test the parser against a real Meta payslip PDF on the deploy host: ``` # After 0003 migration applied to prod DB $ poetry run python -c " from payslip_ingest.parsers import parse_meta_uk import subprocess text = subprocess.check_output(['pdftotext', '-layout', '/path/to/real.pdf', '-']).decode() p = parse_meta_uk(text) print(p.model_dump_json(indent=2)) " ``` Expected: JSON with salary/bonus/rsu_vest/pension_sacrifice populated and `validate_totals(p)` returning True. ## Reproduce locally 1. `cd payslip-ingest && poetry install` 2. `poetry run pytest tests/test_meta_uk_parser.py -v` 3. Expected: 11 tests pass, each fixture validates totals within 2p. Closes: code-un1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:53:52 +00:00
"AE Pension EE",
meta_uk parser: add variant A (2019-2022) + variant C (2022-2023) ## Context The initial v2 parser (commit 9741816) only handled the modern template (variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from 2021-07 through 2023-11 failed entirely — Claude fallback hit errors on them and the rows never landed. Investigation via `kubectl exec` + pdftotext on a sample of the failing docs revealed two previously-unseen layouts that the parser needs to handle directly: - **Variant A** (2019 → mid-2022): single-column Description/This Period/ This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31 Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines: `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the earnings side with a matching `RSU Net Gain` on the deductions side. BIK items (Private Dental/Medical) appear on both sides — net zero in the gross, but the deduction-side copy must land in other_deductions for the validation formula to hold. - **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions| Year To Date (note capital "To", vs variant B's lowercase "to"). Date format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU Net Gain` offset. `Company Name : Facebook UK Limited` preamble. Variant B (2024+) is unchanged. ## This change ### Parser refactor - `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches all three eras. - `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's accounting-style parenthesized negatives normalize to `-1234.56` in `_to_decimal`. - `_parse_date` tries three formats in order: slash (B), dot (C), word (A). - `_is_variant_b_or_c` collapses B and C into one detector (both have the side-by-side header with `Year [Tt]o Date`); their parsers share code because the column mechanics are identical — only the RSU-label set and date format differ. - `_parse_variant_a` is a full rewrite: single-column rows split by the two `Total ...` anchors (payments → deductions), pay_date from the header's `Date : ...`, gross from first Total, net from the trailing `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X` line at the bottom. - RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every matching payment line. rsu_offset maps to `RSU Net Gain` on the deduction side when present (absent in variant B, present in A and C). ### Fixtures switched to real pdftotext output Removed the two synthetic fixtures that no longer reflected real Meta output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`) and replaced with real pdftotext captures: - `meta_uk_2021_08_variant_a.txt` (doc_id=43) - `meta_uk_2022_11_variant_c.txt` (doc_id=53) The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because they encode specific bonus/no-bonus scenarios and the numbers are derived from the real Feb-2026 sample in the plan. ## Tests - 10 parser tests: one per variant (A/B/C) + totals validation across all 4 fixtures + the existing non-Meta/empty-input guards. All pass. - 52 total tests across the repo, all green. ## Test Plan ### Automated ``` $ poetry run pytest ============================== 52 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files ``` ### Manual verification (after deploy) 1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via regex (≥95% hit rate), 42 → 70+ validated rows. 2. Sample a row for each variant via psql: employer, rsu_vest, and taxable_pay should all be populated. ## Reproduce locally 1. `poetry run pytest tests/test_meta_uk_parser.py -v` 2. Expected: 10 passed, each fixture validates totals to within 2p. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00
"AE Pension",
"Laundry Expense",
"Transportation Allowance",
"EE Edu Assist",
"RSU Gain Taxable",
"RSU Gain Nicable",
"RSU Gain Taxabl",
"RSU Gain Nicabl",
"RSU Net Cash",
"RSU Net Cash UK",
# BIK earnings mirrored on the deduction side — we exclude them from
# bonus/other_earnings so they don't double-count.
"Private Dental Insurance",
"Private Medical Insurance",
"EE Discount BIK",
v2: regex parser for Meta UK template + accurate RSU tax attribution ## Context v1 shipped a Claude Haiku-based extractor that validated only 10/71 backfilled rows. Haiku fumbles the arithmetic on pension salary-sacrifice, conflates RSU vest with regular earnings, and occasionally misreads YTD vs this-period columns — so 86% of rows land with validated=false and the downstream dashboards under-report take-home. Meta UK uses a stable two-variant template (pre/post 2022-01-31 boundary), so a regex parser is both faster (ms vs. 30-90s + $0.01-0.05/call) and more accurate. v2 introduces that parser as the primary path, keeps Claude as the fallback for non-Meta payslips, and surfaces new fields the dashboard needs to attribute PAYE between cash salary and RSU vests correctly. ## This change ### Parser (new) `payslip_ingest/parsers/meta_uk.py` detects the layout variant by header presence: - **Variant A** (pre-2022): vertical Description/This Period/This Year. `AE Pension EE` is a positive deduction against a pre-sacrifice gross — maps to `pension_employee` for the existing validation formula to hold. - **Variant B** (post-2022): side-by-side Payments | Deductions | Year to Date. `AE Pension EE` is NEGATIVE in Payments (salary sacrifice) — maps to `pension_sacrifice` and is already netted into Total Payment. `rsu_vest = RSU Tax Offset + RSU Excs Refund` (Meta's template inflates Taxable Pay without using a matching offset deduction). Column boundaries come from the header row's anchor positions; each data row slices into 3 cells and the last numeric token per cell is the amount. Anchor misses raise ParserError so the caller falls back to Claude rather than silently returning bad data. ### New fields Schema + DB + Claude prompt gain: - `salary`, `bonus`, `pension_sacrifice` — earnings decomposition for the dashboard's bonus-sacrifice visibility and earnings-breakdown chart - `taxable_pay`, `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross` — powers the YTD-effective-rate method of attributing cash tax vs RSU tax, which is the only method that's accurate month-to-month All new columns default to 0 / null so v1 rows continue to round-trip. ### Orchestration processor.py tries `parse_meta_uk(pdftotext(pdf))` first. On success the result goes straight to the DB — zero Claude tokens spent, extraction in milliseconds. On ParserError it falls through to ClaudeExtractor as before. ProcessResult gains an `extractor` field ("meta_uk_regex" | "claude") so backfill logs show the hit rate. ## Tests - `test_meta_uk_parser.py` — 11 tests covering variant A, variant B (standard + bonus month + bonus-sacrificed month), malformed inputs, and end-to-end totals validation for all 4 golden fixtures. - `test_processor.py` — 2 new tests proving the regex-first short-circuit and the Claude fallback on non-Meta inputs. Fixtures under `tests/fixtures/` are hand-crafted `pdftotext -layout` emulations — real Meta numbers from the plan's sample payslips for variant B, synthesized realistic variant A and bonus-sacrificed samples. 0001_initial.py reformat is yapf cleanup touched during the session's format pass; not a behavior change. ## Test Plan ### Automated ``` $ poetry run pytest ============================= test session starts ============================== collected 53 items tests/test_extractor.py ..... [ 9%] tests/test_meta_uk_parser.py ........... [ 30%] tests/test_paperless.py ...... [ 41%] tests/test_processor.py .............. [ 67%] tests/test_schema.py .... [ 75%] tests/test_tax_year.py ........ [ 90%] tests/test_webhook.py ..... [100%] ============================== 53 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files $ poetry run yapf --style pyproject.toml --diff --recursive payslip_ingest tests (no output — all files are yapf-clean) ``` ### Manual Verification Smoke-test the parser against a real Meta payslip PDF on the deploy host: ``` # After 0003 migration applied to prod DB $ poetry run python -c " from payslip_ingest.parsers import parse_meta_uk import subprocess text = subprocess.check_output(['pdftotext', '-layout', '/path/to/real.pdf', '-']).decode() p = parse_meta_uk(text) print(p.model_dump_json(indent=2)) " ``` Expected: JSON with salary/bonus/rsu_vest/pension_sacrifice populated and `validate_totals(p)` returning True. ## Reproduce locally 1. `cd payslip-ingest && poetry install` 2. `poetry run pytest tests/test_meta_uk_parser.py -v` 3. Expected: 11 tests pass, each fixture validates totals within 2p. Closes: code-un1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:53:52 +00:00
}
meta_uk parser: add variant A (2019-2022) + variant C (2022-2023) ## Context The initial v2 parser (commit 9741816) only handled the modern template (variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from 2021-07 through 2023-11 failed entirely — Claude fallback hit errors on them and the rows never landed. Investigation via `kubectl exec` + pdftotext on a sample of the failing docs revealed two previously-unseen layouts that the parser needs to handle directly: - **Variant A** (2019 → mid-2022): single-column Description/This Period/ This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31 Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines: `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the earnings side with a matching `RSU Net Gain` on the deductions side. BIK items (Private Dental/Medical) appear on both sides — net zero in the gross, but the deduction-side copy must land in other_deductions for the validation formula to hold. - **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions| Year To Date (note capital "To", vs variant B's lowercase "to"). Date format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU Net Gain` offset. `Company Name : Facebook UK Limited` preamble. Variant B (2024+) is unchanged. ## This change ### Parser refactor - `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches all three eras. - `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's accounting-style parenthesized negatives normalize to `-1234.56` in `_to_decimal`. - `_parse_date` tries three formats in order: slash (B), dot (C), word (A). - `_is_variant_b_or_c` collapses B and C into one detector (both have the side-by-side header with `Year [Tt]o Date`); their parsers share code because the column mechanics are identical — only the RSU-label set and date format differ. - `_parse_variant_a` is a full rewrite: single-column rows split by the two `Total ...` anchors (payments → deductions), pay_date from the header's `Date : ...`, gross from first Total, net from the trailing `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X` line at the bottom. - RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every matching payment line. rsu_offset maps to `RSU Net Gain` on the deduction side when present (absent in variant B, present in A and C). ### Fixtures switched to real pdftotext output Removed the two synthetic fixtures that no longer reflected real Meta output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`) and replaced with real pdftotext captures: - `meta_uk_2021_08_variant_a.txt` (doc_id=43) - `meta_uk_2022_11_variant_c.txt` (doc_id=53) The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because they encode specific bonus/no-bonus scenarios and the numbers are derived from the real Feb-2026 sample in the plan. ## Tests - 10 parser tests: one per variant (A/B/C) + totals validation across all 4 fixtures + the existing non-Meta/empty-input guards. All pass. - 52 total tests across the repo, all green. ## Test Plan ### Automated ``` $ poetry run pytest ============================== 52 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files ``` ### Manual verification (after deploy) 1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via regex (≥95% hit rate), 42 → 70+ validated rows. 2. Sample a row for each variant via psql: employer, rsu_vest, and taxable_pay should all be populated. ## Reproduce locally 1. `poetry run pytest tests/test_meta_uk_parser.py -v` 2. Expected: 10 passed, each fixture validates totals to within 2p. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00
VARIANT_A_DEDUCTIONS_KNOWN = {
v2: regex parser for Meta UK template + accurate RSU tax attribution ## Context v1 shipped a Claude Haiku-based extractor that validated only 10/71 backfilled rows. Haiku fumbles the arithmetic on pension salary-sacrifice, conflates RSU vest with regular earnings, and occasionally misreads YTD vs this-period columns — so 86% of rows land with validated=false and the downstream dashboards under-report take-home. Meta UK uses a stable two-variant template (pre/post 2022-01-31 boundary), so a regex parser is both faster (ms vs. 30-90s + $0.01-0.05/call) and more accurate. v2 introduces that parser as the primary path, keeps Claude as the fallback for non-Meta payslips, and surfaces new fields the dashboard needs to attribute PAYE between cash salary and RSU vests correctly. ## This change ### Parser (new) `payslip_ingest/parsers/meta_uk.py` detects the layout variant by header presence: - **Variant A** (pre-2022): vertical Description/This Period/This Year. `AE Pension EE` is a positive deduction against a pre-sacrifice gross — maps to `pension_employee` for the existing validation formula to hold. - **Variant B** (post-2022): side-by-side Payments | Deductions | Year to Date. `AE Pension EE` is NEGATIVE in Payments (salary sacrifice) — maps to `pension_sacrifice` and is already netted into Total Payment. `rsu_vest = RSU Tax Offset + RSU Excs Refund` (Meta's template inflates Taxable Pay without using a matching offset deduction). Column boundaries come from the header row's anchor positions; each data row slices into 3 cells and the last numeric token per cell is the amount. Anchor misses raise ParserError so the caller falls back to Claude rather than silently returning bad data. ### New fields Schema + DB + Claude prompt gain: - `salary`, `bonus`, `pension_sacrifice` — earnings decomposition for the dashboard's bonus-sacrifice visibility and earnings-breakdown chart - `taxable_pay`, `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross` — powers the YTD-effective-rate method of attributing cash tax vs RSU tax, which is the only method that's accurate month-to-month All new columns default to 0 / null so v1 rows continue to round-trip. ### Orchestration processor.py tries `parse_meta_uk(pdftotext(pdf))` first. On success the result goes straight to the DB — zero Claude tokens spent, extraction in milliseconds. On ParserError it falls through to ClaudeExtractor as before. ProcessResult gains an `extractor` field ("meta_uk_regex" | "claude") so backfill logs show the hit rate. ## Tests - `test_meta_uk_parser.py` — 11 tests covering variant A, variant B (standard + bonus month + bonus-sacrificed month), malformed inputs, and end-to-end totals validation for all 4 golden fixtures. - `test_processor.py` — 2 new tests proving the regex-first short-circuit and the Claude fallback on non-Meta inputs. Fixtures under `tests/fixtures/` are hand-crafted `pdftotext -layout` emulations — real Meta numbers from the plan's sample payslips for variant B, synthesized realistic variant A and bonus-sacrificed samples. 0001_initial.py reformat is yapf cleanup touched during the session's format pass; not a behavior change. ## Test Plan ### Automated ``` $ poetry run pytest ============================= test session starts ============================== collected 53 items tests/test_extractor.py ..... [ 9%] tests/test_meta_uk_parser.py ........... [ 30%] tests/test_paperless.py ...... [ 41%] tests/test_processor.py .............. [ 67%] tests/test_schema.py .... [ 75%] tests/test_tax_year.py ........ [ 90%] tests/test_webhook.py ..... [100%] ============================== 53 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files $ poetry run yapf --style pyproject.toml --diff --recursive payslip_ingest tests (no output — all files are yapf-clean) ``` ### Manual Verification Smoke-test the parser against a real Meta payslip PDF on the deploy host: ``` # After 0003 migration applied to prod DB $ poetry run python -c " from payslip_ingest.parsers import parse_meta_uk import subprocess text = subprocess.check_output(['pdftotext', '-layout', '/path/to/real.pdf', '-']).decode() p = parse_meta_uk(text) print(p.model_dump_json(indent=2)) " ``` Expected: JSON with salary/bonus/rsu_vest/pension_sacrifice populated and `validate_totals(p)` returning True. ## Reproduce locally 1. `cd payslip-ingest && poetry install` 2. `poetry run pytest tests/test_meta_uk_parser.py -v` 3. Expected: 11 tests pass, each fixture validates totals within 2p. Closes: code-un1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:53:52 +00:00
"Tax",
"National Insurance",
"Student Loans",
"Student Loan",
meta_uk parser: add variant A (2019-2022) + variant C (2022-2023) ## Context The initial v2 parser (commit 9741816) only handled the modern template (variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from 2021-07 through 2023-11 failed entirely — Claude fallback hit errors on them and the rows never landed. Investigation via `kubectl exec` + pdftotext on a sample of the failing docs revealed two previously-unseen layouts that the parser needs to handle directly: - **Variant A** (2019 → mid-2022): single-column Description/This Period/ This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31 Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines: `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the earnings side with a matching `RSU Net Gain` on the deductions side. BIK items (Private Dental/Medical) appear on both sides — net zero in the gross, but the deduction-side copy must land in other_deductions for the validation formula to hold. - **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions| Year To Date (note capital "To", vs variant B's lowercase "to"). Date format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU Net Gain` offset. `Company Name : Facebook UK Limited` preamble. Variant B (2024+) is unchanged. ## This change ### Parser refactor - `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches all three eras. - `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's accounting-style parenthesized negatives normalize to `-1234.56` in `_to_decimal`. - `_parse_date` tries three formats in order: slash (B), dot (C), word (A). - `_is_variant_b_or_c` collapses B and C into one detector (both have the side-by-side header with `Year [Tt]o Date`); their parsers share code because the column mechanics are identical — only the RSU-label set and date format differ. - `_parse_variant_a` is a full rewrite: single-column rows split by the two `Total ...` anchors (payments → deductions), pay_date from the header's `Date : ...`, gross from first Total, net from the trailing `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X` line at the bottom. - RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every matching payment line. rsu_offset maps to `RSU Net Gain` on the deduction side when present (absent in variant B, present in A and C). ### Fixtures switched to real pdftotext output Removed the two synthetic fixtures that no longer reflected real Meta output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`) and replaced with real pdftotext captures: - `meta_uk_2021_08_variant_a.txt` (doc_id=43) - `meta_uk_2022_11_variant_c.txt` (doc_id=53) The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because they encode specific bonus/no-bonus scenarios and the numbers are derived from the real Feb-2026 sample in the plan. ## Tests - 10 parser tests: one per variant (A/B/C) + totals validation across all 4 fixtures + the existing non-Meta/empty-input guards. All pass. - 52 total tests across the repo, all green. ## Test Plan ### Automated ``` $ poetry run pytest ============================== 52 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files ``` ### Manual verification (after deploy) 1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via regex (≥95% hit rate), 42 → 70+ validated rows. 2. Sample a row for each variant via psql: employer, rsu_vest, and taxable_pay should all be populated. ## Reproduce locally 1. `poetry run pytest tests/test_meta_uk_parser.py -v` 2. Expected: 10 passed, each fixture validates totals to within 2p. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00
"RSU Net Gain",
"EE Discount BIK",
v2: regex parser for Meta UK template + accurate RSU tax attribution ## Context v1 shipped a Claude Haiku-based extractor that validated only 10/71 backfilled rows. Haiku fumbles the arithmetic on pension salary-sacrifice, conflates RSU vest with regular earnings, and occasionally misreads YTD vs this-period columns — so 86% of rows land with validated=false and the downstream dashboards under-report take-home. Meta UK uses a stable two-variant template (pre/post 2022-01-31 boundary), so a regex parser is both faster (ms vs. 30-90s + $0.01-0.05/call) and more accurate. v2 introduces that parser as the primary path, keeps Claude as the fallback for non-Meta payslips, and surfaces new fields the dashboard needs to attribute PAYE between cash salary and RSU vests correctly. ## This change ### Parser (new) `payslip_ingest/parsers/meta_uk.py` detects the layout variant by header presence: - **Variant A** (pre-2022): vertical Description/This Period/This Year. `AE Pension EE` is a positive deduction against a pre-sacrifice gross — maps to `pension_employee` for the existing validation formula to hold. - **Variant B** (post-2022): side-by-side Payments | Deductions | Year to Date. `AE Pension EE` is NEGATIVE in Payments (salary sacrifice) — maps to `pension_sacrifice` and is already netted into Total Payment. `rsu_vest = RSU Tax Offset + RSU Excs Refund` (Meta's template inflates Taxable Pay without using a matching offset deduction). Column boundaries come from the header row's anchor positions; each data row slices into 3 cells and the last numeric token per cell is the amount. Anchor misses raise ParserError so the caller falls back to Claude rather than silently returning bad data. ### New fields Schema + DB + Claude prompt gain: - `salary`, `bonus`, `pension_sacrifice` — earnings decomposition for the dashboard's bonus-sacrifice visibility and earnings-breakdown chart - `taxable_pay`, `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross` — powers the YTD-effective-rate method of attributing cash tax vs RSU tax, which is the only method that's accurate month-to-month All new columns default to 0 / null so v1 rows continue to round-trip. ### Orchestration processor.py tries `parse_meta_uk(pdftotext(pdf))` first. On success the result goes straight to the DB — zero Claude tokens spent, extraction in milliseconds. On ParserError it falls through to ClaudeExtractor as before. ProcessResult gains an `extractor` field ("meta_uk_regex" | "claude") so backfill logs show the hit rate. ## Tests - `test_meta_uk_parser.py` — 11 tests covering variant A, variant B (standard + bonus month + bonus-sacrificed month), malformed inputs, and end-to-end totals validation for all 4 golden fixtures. - `test_processor.py` — 2 new tests proving the regex-first short-circuit and the Claude fallback on non-Meta inputs. Fixtures under `tests/fixtures/` are hand-crafted `pdftotext -layout` emulations — real Meta numbers from the plan's sample payslips for variant B, synthesized realistic variant A and bonus-sacrificed samples. 0001_initial.py reformat is yapf cleanup touched during the session's format pass; not a behavior change. ## Test Plan ### Automated ``` $ poetry run pytest ============================= test session starts ============================== collected 53 items tests/test_extractor.py ..... [ 9%] tests/test_meta_uk_parser.py ........... [ 30%] tests/test_paperless.py ...... [ 41%] tests/test_processor.py .............. [ 67%] tests/test_schema.py .... [ 75%] tests/test_tax_year.py ........ [ 90%] tests/test_webhook.py ..... [100%] ============================== 53 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files $ poetry run yapf --style pyproject.toml --diff --recursive payslip_ingest tests (no output — all files are yapf-clean) ``` ### Manual Verification Smoke-test the parser against a real Meta payslip PDF on the deploy host: ``` # After 0003 migration applied to prod DB $ poetry run python -c " from payslip_ingest.parsers import parse_meta_uk import subprocess text = subprocess.check_output(['pdftotext', '-layout', '/path/to/real.pdf', '-']).decode() p = parse_meta_uk(text) print(p.model_dump_json(indent=2)) " ``` Expected: JSON with salary/bonus/rsu_vest/pension_sacrifice populated and `validate_totals(p)` returning True. ## Reproduce locally 1. `cd payslip-ingest && poetry install` 2. `poetry run pytest tests/test_meta_uk_parser.py -v` 3. Expected: 11 tests pass, each fixture validates totals within 2p. Closes: code-un1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:53:52 +00:00
}
meta_uk parser: add variant A (2019-2022) + variant C (2022-2023) ## Context The initial v2 parser (commit 9741816) only handled the modern template (variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from 2021-07 through 2023-11 failed entirely — Claude fallback hit errors on them and the rows never landed. Investigation via `kubectl exec` + pdftotext on a sample of the failing docs revealed two previously-unseen layouts that the parser needs to handle directly: - **Variant A** (2019 → mid-2022): single-column Description/This Period/ This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31 Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines: `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the earnings side with a matching `RSU Net Gain` on the deductions side. BIK items (Private Dental/Medical) appear on both sides — net zero in the gross, but the deduction-side copy must land in other_deductions for the validation formula to hold. - **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions| Year To Date (note capital "To", vs variant B's lowercase "to"). Date format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU Net Gain` offset. `Company Name : Facebook UK Limited` preamble. Variant B (2024+) is unchanged. ## This change ### Parser refactor - `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches all three eras. - `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's accounting-style parenthesized negatives normalize to `-1234.56` in `_to_decimal`. - `_parse_date` tries three formats in order: slash (B), dot (C), word (A). - `_is_variant_b_or_c` collapses B and C into one detector (both have the side-by-side header with `Year [Tt]o Date`); their parsers share code because the column mechanics are identical — only the RSU-label set and date format differ. - `_parse_variant_a` is a full rewrite: single-column rows split by the two `Total ...` anchors (payments → deductions), pay_date from the header's `Date : ...`, gross from first Total, net from the trailing `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X` line at the bottom. - RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every matching payment line. rsu_offset maps to `RSU Net Gain` on the deduction side when present (absent in variant B, present in A and C). ### Fixtures switched to real pdftotext output Removed the two synthetic fixtures that no longer reflected real Meta output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`) and replaced with real pdftotext captures: - `meta_uk_2021_08_variant_a.txt` (doc_id=43) - `meta_uk_2022_11_variant_c.txt` (doc_id=53) The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because they encode specific bonus/no-bonus scenarios and the numbers are derived from the real Feb-2026 sample in the plan. ## Tests - 10 parser tests: one per variant (A/B/C) + totals validation across all 4 fixtures + the existing non-Meta/empty-input guards. All pass. - 52 total tests across the repo, all green. ## Test Plan ### Automated ``` $ poetry run pytest ============================== 52 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files ``` ### Manual verification (after deploy) 1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via regex (≥95% hit rate), 42 → 70+ validated rows. 2. Sample a row for each variant via psql: employer, rsu_vest, and taxable_pay should all be populated. ## Reproduce locally 1. `poetry run pytest tests/test_meta_uk_parser.py -v` 2. Expected: 10 passed, each fixture validates totals to within 2p. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00
VARIANT_A_RSU_LABELS = {
"RSU Gain Taxable",
"RSU Gain Nicable",
"RSU Gain Taxabl",
"RSU Gain Nicabl",
"RSU Net Cash",
"RSU Net Cash UK",
}
v2: regex parser for Meta UK template + accurate RSU tax attribution ## Context v1 shipped a Claude Haiku-based extractor that validated only 10/71 backfilled rows. Haiku fumbles the arithmetic on pension salary-sacrifice, conflates RSU vest with regular earnings, and occasionally misreads YTD vs this-period columns — so 86% of rows land with validated=false and the downstream dashboards under-report take-home. Meta UK uses a stable two-variant template (pre/post 2022-01-31 boundary), so a regex parser is both faster (ms vs. 30-90s + $0.01-0.05/call) and more accurate. v2 introduces that parser as the primary path, keeps Claude as the fallback for non-Meta payslips, and surfaces new fields the dashboard needs to attribute PAYE between cash salary and RSU vests correctly. ## This change ### Parser (new) `payslip_ingest/parsers/meta_uk.py` detects the layout variant by header presence: - **Variant A** (pre-2022): vertical Description/This Period/This Year. `AE Pension EE` is a positive deduction against a pre-sacrifice gross — maps to `pension_employee` for the existing validation formula to hold. - **Variant B** (post-2022): side-by-side Payments | Deductions | Year to Date. `AE Pension EE` is NEGATIVE in Payments (salary sacrifice) — maps to `pension_sacrifice` and is already netted into Total Payment. `rsu_vest = RSU Tax Offset + RSU Excs Refund` (Meta's template inflates Taxable Pay without using a matching offset deduction). Column boundaries come from the header row's anchor positions; each data row slices into 3 cells and the last numeric token per cell is the amount. Anchor misses raise ParserError so the caller falls back to Claude rather than silently returning bad data. ### New fields Schema + DB + Claude prompt gain: - `salary`, `bonus`, `pension_sacrifice` — earnings decomposition for the dashboard's bonus-sacrifice visibility and earnings-breakdown chart - `taxable_pay`, `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross` — powers the YTD-effective-rate method of attributing cash tax vs RSU tax, which is the only method that's accurate month-to-month All new columns default to 0 / null so v1 rows continue to round-trip. ### Orchestration processor.py tries `parse_meta_uk(pdftotext(pdf))` first. On success the result goes straight to the DB — zero Claude tokens spent, extraction in milliseconds. On ParserError it falls through to ClaudeExtractor as before. ProcessResult gains an `extractor` field ("meta_uk_regex" | "claude") so backfill logs show the hit rate. ## Tests - `test_meta_uk_parser.py` — 11 tests covering variant A, variant B (standard + bonus month + bonus-sacrificed month), malformed inputs, and end-to-end totals validation for all 4 golden fixtures. - `test_processor.py` — 2 new tests proving the regex-first short-circuit and the Claude fallback on non-Meta inputs. Fixtures under `tests/fixtures/` are hand-crafted `pdftotext -layout` emulations — real Meta numbers from the plan's sample payslips for variant B, synthesized realistic variant A and bonus-sacrificed samples. 0001_initial.py reformat is yapf cleanup touched during the session's format pass; not a behavior change. ## Test Plan ### Automated ``` $ poetry run pytest ============================= test session starts ============================== collected 53 items tests/test_extractor.py ..... [ 9%] tests/test_meta_uk_parser.py ........... [ 30%] tests/test_paperless.py ...... [ 41%] tests/test_processor.py .............. [ 67%] tests/test_schema.py .... [ 75%] tests/test_tax_year.py ........ [ 90%] tests/test_webhook.py ..... [100%] ============================== 53 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files $ poetry run yapf --style pyproject.toml --diff --recursive payslip_ingest tests (no output — all files are yapf-clean) ``` ### Manual Verification Smoke-test the parser against a real Meta payslip PDF on the deploy host: ``` # After 0003 migration applied to prod DB $ poetry run python -c " from payslip_ingest.parsers import parse_meta_uk import subprocess text = subprocess.check_output(['pdftotext', '-layout', '/path/to/real.pdf', '-']).decode() p = parse_meta_uk(text) print(p.model_dump_json(indent=2)) " ``` Expected: JSON with salary/bonus/rsu_vest/pension_sacrifice populated and `validate_totals(p)` returning True. ## Reproduce locally 1. `cd payslip-ingest && poetry install` 2. `poetry run pytest tests/test_meta_uk_parser.py -v` 3. Expected: 11 tests pass, each fixture validates totals within 2p. Closes: code-un1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:53:52 +00:00
meta_uk parser: add variant A (2019-2022) + variant C (2022-2023) ## Context The initial v2 parser (commit 9741816) only handled the modern template (variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from 2021-07 through 2023-11 failed entirely — Claude fallback hit errors on them and the rows never landed. Investigation via `kubectl exec` + pdftotext on a sample of the failing docs revealed two previously-unseen layouts that the parser needs to handle directly: - **Variant A** (2019 → mid-2022): single-column Description/This Period/ This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31 Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines: `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the earnings side with a matching `RSU Net Gain` on the deductions side. BIK items (Private Dental/Medical) appear on both sides — net zero in the gross, but the deduction-side copy must land in other_deductions for the validation formula to hold. - **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions| Year To Date (note capital "To", vs variant B's lowercase "to"). Date format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU Net Gain` offset. `Company Name : Facebook UK Limited` preamble. Variant B (2024+) is unchanged. ## This change ### Parser refactor - `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches all three eras. - `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's accounting-style parenthesized negatives normalize to `-1234.56` in `_to_decimal`. - `_parse_date` tries three formats in order: slash (B), dot (C), word (A). - `_is_variant_b_or_c` collapses B and C into one detector (both have the side-by-side header with `Year [Tt]o Date`); their parsers share code because the column mechanics are identical — only the RSU-label set and date format differ. - `_parse_variant_a` is a full rewrite: single-column rows split by the two `Total ...` anchors (payments → deductions), pay_date from the header's `Date : ...`, gross from first Total, net from the trailing `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X` line at the bottom. - RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every matching payment line. rsu_offset maps to `RSU Net Gain` on the deduction side when present (absent in variant B, present in A and C). ### Fixtures switched to real pdftotext output Removed the two synthetic fixtures that no longer reflected real Meta output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`) and replaced with real pdftotext captures: - `meta_uk_2021_08_variant_a.txt` (doc_id=43) - `meta_uk_2022_11_variant_c.txt` (doc_id=53) The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because they encode specific bonus/no-bonus scenarios and the numbers are derived from the real Feb-2026 sample in the plan. ## Tests - 10 parser tests: one per variant (A/B/C) + totals validation across all 4 fixtures + the existing non-Meta/empty-input guards. All pass. - 52 total tests across the repo, all green. ## Test Plan ### Automated ``` $ poetry run pytest ============================== 52 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files ``` ### Manual verification (after deploy) 1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via regex (≥95% hit rate), 42 → 70+ validated rows. 2. Sample a row for each variant via psql: employer, rsu_vest, and taxable_pay should all be populated. ## Reproduce locally 1. `poetry run pytest tests/test_meta_uk_parser.py -v` 2. Expected: 10 passed, each fixture validates totals to within 2p. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00
# "Taxable Pay : This Period £15323.16 : To Date £52446.53"
TAXABLE_PAY_A_RE = re.compile(r"Taxable Pay\s*:\s*This Period\s*£([\d,]+\.\d{2})")
NET_PAY_A_RE = re.compile(r"Net Pay\s+(-?[\d,]+\.\d{2})")
v2: regex parser for Meta UK template + accurate RSU tax attribution ## Context v1 shipped a Claude Haiku-based extractor that validated only 10/71 backfilled rows. Haiku fumbles the arithmetic on pension salary-sacrifice, conflates RSU vest with regular earnings, and occasionally misreads YTD vs this-period columns — so 86% of rows land with validated=false and the downstream dashboards under-report take-home. Meta UK uses a stable two-variant template (pre/post 2022-01-31 boundary), so a regex parser is both faster (ms vs. 30-90s + $0.01-0.05/call) and more accurate. v2 introduces that parser as the primary path, keeps Claude as the fallback for non-Meta payslips, and surfaces new fields the dashboard needs to attribute PAYE between cash salary and RSU vests correctly. ## This change ### Parser (new) `payslip_ingest/parsers/meta_uk.py` detects the layout variant by header presence: - **Variant A** (pre-2022): vertical Description/This Period/This Year. `AE Pension EE` is a positive deduction against a pre-sacrifice gross — maps to `pension_employee` for the existing validation formula to hold. - **Variant B** (post-2022): side-by-side Payments | Deductions | Year to Date. `AE Pension EE` is NEGATIVE in Payments (salary sacrifice) — maps to `pension_sacrifice` and is already netted into Total Payment. `rsu_vest = RSU Tax Offset + RSU Excs Refund` (Meta's template inflates Taxable Pay without using a matching offset deduction). Column boundaries come from the header row's anchor positions; each data row slices into 3 cells and the last numeric token per cell is the amount. Anchor misses raise ParserError so the caller falls back to Claude rather than silently returning bad data. ### New fields Schema + DB + Claude prompt gain: - `salary`, `bonus`, `pension_sacrifice` — earnings decomposition for the dashboard's bonus-sacrifice visibility and earnings-breakdown chart - `taxable_pay`, `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross` — powers the YTD-effective-rate method of attributing cash tax vs RSU tax, which is the only method that's accurate month-to-month All new columns default to 0 / null so v1 rows continue to round-trip. ### Orchestration processor.py tries `parse_meta_uk(pdftotext(pdf))` first. On success the result goes straight to the DB — zero Claude tokens spent, extraction in milliseconds. On ParserError it falls through to ClaudeExtractor as before. ProcessResult gains an `extractor` field ("meta_uk_regex" | "claude") so backfill logs show the hit rate. ## Tests - `test_meta_uk_parser.py` — 11 tests covering variant A, variant B (standard + bonus month + bonus-sacrificed month), malformed inputs, and end-to-end totals validation for all 4 golden fixtures. - `test_processor.py` — 2 new tests proving the regex-first short-circuit and the Claude fallback on non-Meta inputs. Fixtures under `tests/fixtures/` are hand-crafted `pdftotext -layout` emulations — real Meta numbers from the plan's sample payslips for variant B, synthesized realistic variant A and bonus-sacrificed samples. 0001_initial.py reformat is yapf cleanup touched during the session's format pass; not a behavior change. ## Test Plan ### Automated ``` $ poetry run pytest ============================= test session starts ============================== collected 53 items tests/test_extractor.py ..... [ 9%] tests/test_meta_uk_parser.py ........... [ 30%] tests/test_paperless.py ...... [ 41%] tests/test_processor.py .............. [ 67%] tests/test_schema.py .... [ 75%] tests/test_tax_year.py ........ [ 90%] tests/test_webhook.py ..... [100%] ============================== 53 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files $ poetry run yapf --style pyproject.toml --diff --recursive payslip_ingest tests (no output — all files are yapf-clean) ``` ### Manual Verification Smoke-test the parser against a real Meta payslip PDF on the deploy host: ``` # After 0003 migration applied to prod DB $ poetry run python -c " from payslip_ingest.parsers import parse_meta_uk import subprocess text = subprocess.check_output(['pdftotext', '-layout', '/path/to/real.pdf', '-']).decode() p = parse_meta_uk(text) print(p.model_dump_json(indent=2)) " ``` Expected: JSON with salary/bonus/rsu_vest/pension_sacrifice populated and `validate_totals(p)` returning True. ## Reproduce locally 1. `cd payslip-ingest && poetry install` 2. `poetry run pytest tests/test_meta_uk_parser.py -v` 3. Expected: 11 tests pass, each fixture validates totals within 2p. Closes: code-un1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:53:52 +00:00
meta_uk parser: add variant A (2019-2022) + variant C (2022-2023) ## Context The initial v2 parser (commit 9741816) only handled the modern template (variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from 2021-07 through 2023-11 failed entirely — Claude fallback hit errors on them and the rows never landed. Investigation via `kubectl exec` + pdftotext on a sample of the failing docs revealed two previously-unseen layouts that the parser needs to handle directly: - **Variant A** (2019 → mid-2022): single-column Description/This Period/ This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31 Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines: `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the earnings side with a matching `RSU Net Gain` on the deductions side. BIK items (Private Dental/Medical) appear on both sides — net zero in the gross, but the deduction-side copy must land in other_deductions for the validation formula to hold. - **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions| Year To Date (note capital "To", vs variant B's lowercase "to"). Date format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU Net Gain` offset. `Company Name : Facebook UK Limited` preamble. Variant B (2024+) is unchanged. ## This change ### Parser refactor - `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches all three eras. - `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's accounting-style parenthesized negatives normalize to `-1234.56` in `_to_decimal`. - `_parse_date` tries three formats in order: slash (B), dot (C), word (A). - `_is_variant_b_or_c` collapses B and C into one detector (both have the side-by-side header with `Year [Tt]o Date`); their parsers share code because the column mechanics are identical — only the RSU-label set and date format differ. - `_parse_variant_a` is a full rewrite: single-column rows split by the two `Total ...` anchors (payments → deductions), pay_date from the header's `Date : ...`, gross from first Total, net from the trailing `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X` line at the bottom. - RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every matching payment line. rsu_offset maps to `RSU Net Gain` on the deduction side when present (absent in variant B, present in A and C). ### Fixtures switched to real pdftotext output Removed the two synthetic fixtures that no longer reflected real Meta output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`) and replaced with real pdftotext captures: - `meta_uk_2021_08_variant_a.txt` (doc_id=43) - `meta_uk_2022_11_variant_c.txt` (doc_id=53) The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because they encode specific bonus/no-bonus scenarios and the numbers are derived from the real Feb-2026 sample in the plan. ## Tests - 10 parser tests: one per variant (A/B/C) + totals validation across all 4 fixtures + the existing non-Meta/empty-input guards. All pass. - 52 total tests across the repo, all green. ## Test Plan ### Automated ``` $ poetry run pytest ============================== 52 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files ``` ### Manual verification (after deploy) 1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via regex (≥95% hit rate), 42 → 70+ validated rows. 2. Sample a row for each variant via psql: employer, rsu_vest, and taxable_pay should all be populated. ## Reproduce locally 1. `poetry run pytest tests/test_meta_uk_parser.py -v` 2. Expected: 10 passed, each fixture validates totals to within 2p. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00
def _parse_variant_a(text: str, lines: list[str], employer: str) -> ExtractedPayslip:
v2: regex parser for Meta UK template + accurate RSU tax attribution ## Context v1 shipped a Claude Haiku-based extractor that validated only 10/71 backfilled rows. Haiku fumbles the arithmetic on pension salary-sacrifice, conflates RSU vest with regular earnings, and occasionally misreads YTD vs this-period columns — so 86% of rows land with validated=false and the downstream dashboards under-report take-home. Meta UK uses a stable two-variant template (pre/post 2022-01-31 boundary), so a regex parser is both faster (ms vs. 30-90s + $0.01-0.05/call) and more accurate. v2 introduces that parser as the primary path, keeps Claude as the fallback for non-Meta payslips, and surfaces new fields the dashboard needs to attribute PAYE between cash salary and RSU vests correctly. ## This change ### Parser (new) `payslip_ingest/parsers/meta_uk.py` detects the layout variant by header presence: - **Variant A** (pre-2022): vertical Description/This Period/This Year. `AE Pension EE` is a positive deduction against a pre-sacrifice gross — maps to `pension_employee` for the existing validation formula to hold. - **Variant B** (post-2022): side-by-side Payments | Deductions | Year to Date. `AE Pension EE` is NEGATIVE in Payments (salary sacrifice) — maps to `pension_sacrifice` and is already netted into Total Payment. `rsu_vest = RSU Tax Offset + RSU Excs Refund` (Meta's template inflates Taxable Pay without using a matching offset deduction). Column boundaries come from the header row's anchor positions; each data row slices into 3 cells and the last numeric token per cell is the amount. Anchor misses raise ParserError so the caller falls back to Claude rather than silently returning bad data. ### New fields Schema + DB + Claude prompt gain: - `salary`, `bonus`, `pension_sacrifice` — earnings decomposition for the dashboard's bonus-sacrifice visibility and earnings-breakdown chart - `taxable_pay`, `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross` — powers the YTD-effective-rate method of attributing cash tax vs RSU tax, which is the only method that's accurate month-to-month All new columns default to 0 / null so v1 rows continue to round-trip. ### Orchestration processor.py tries `parse_meta_uk(pdftotext(pdf))` first. On success the result goes straight to the DB — zero Claude tokens spent, extraction in milliseconds. On ParserError it falls through to ClaudeExtractor as before. ProcessResult gains an `extractor` field ("meta_uk_regex" | "claude") so backfill logs show the hit rate. ## Tests - `test_meta_uk_parser.py` — 11 tests covering variant A, variant B (standard + bonus month + bonus-sacrificed month), malformed inputs, and end-to-end totals validation for all 4 golden fixtures. - `test_processor.py` — 2 new tests proving the regex-first short-circuit and the Claude fallback on non-Meta inputs. Fixtures under `tests/fixtures/` are hand-crafted `pdftotext -layout` emulations — real Meta numbers from the plan's sample payslips for variant B, synthesized realistic variant A and bonus-sacrificed samples. 0001_initial.py reformat is yapf cleanup touched during the session's format pass; not a behavior change. ## Test Plan ### Automated ``` $ poetry run pytest ============================= test session starts ============================== collected 53 items tests/test_extractor.py ..... [ 9%] tests/test_meta_uk_parser.py ........... [ 30%] tests/test_paperless.py ...... [ 41%] tests/test_processor.py .............. [ 67%] tests/test_schema.py .... [ 75%] tests/test_tax_year.py ........ [ 90%] tests/test_webhook.py ..... [100%] ============================== 53 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files $ poetry run yapf --style pyproject.toml --diff --recursive payslip_ingest tests (no output — all files are yapf-clean) ``` ### Manual Verification Smoke-test the parser against a real Meta payslip PDF on the deploy host: ``` # After 0003 migration applied to prod DB $ poetry run python -c " from payslip_ingest.parsers import parse_meta_uk import subprocess text = subprocess.check_output(['pdftotext', '-layout', '/path/to/real.pdf', '-']).decode() p = parse_meta_uk(text) print(p.model_dump_json(indent=2)) " ``` Expected: JSON with salary/bonus/rsu_vest/pension_sacrifice populated and `validate_totals(p)` returning True. ## Reproduce locally 1. `cd payslip-ingest && poetry install` 2. `poetry run pytest tests/test_meta_uk_parser.py -v` 3. Expected: 11 tests pass, each fixture validates totals within 2p. Closes: code-un1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:53:52 +00:00
header_idx = _find_variant_a_header(lines)
meta_uk parser: add variant A (2019-2022) + variant C (2022-2023) ## Context The initial v2 parser (commit 9741816) only handled the modern template (variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from 2021-07 through 2023-11 failed entirely — Claude fallback hit errors on them and the rows never landed. Investigation via `kubectl exec` + pdftotext on a sample of the failing docs revealed two previously-unseen layouts that the parser needs to handle directly: - **Variant A** (2019 → mid-2022): single-column Description/This Period/ This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31 Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines: `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the earnings side with a matching `RSU Net Gain` on the deductions side. BIK items (Private Dental/Medical) appear on both sides — net zero in the gross, but the deduction-side copy must land in other_deductions for the validation formula to hold. - **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions| Year To Date (note capital "To", vs variant B's lowercase "to"). Date format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU Net Gain` offset. `Company Name : Facebook UK Limited` preamble. Variant B (2024+) is unchanged. ## This change ### Parser refactor - `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches all three eras. - `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's accounting-style parenthesized negatives normalize to `-1234.56` in `_to_decimal`. - `_parse_date` tries three formats in order: slash (B), dot (C), word (A). - `_is_variant_b_or_c` collapses B and C into one detector (both have the side-by-side header with `Year [Tt]o Date`); their parsers share code because the column mechanics are identical — only the RSU-label set and date format differ. - `_parse_variant_a` is a full rewrite: single-column rows split by the two `Total ...` anchors (payments → deductions), pay_date from the header's `Date : ...`, gross from first Total, net from the trailing `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X` line at the bottom. - RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every matching payment line. rsu_offset maps to `RSU Net Gain` on the deduction side when present (absent in variant B, present in A and C). ### Fixtures switched to real pdftotext output Removed the two synthetic fixtures that no longer reflected real Meta output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`) and replaced with real pdftotext captures: - `meta_uk_2021_08_variant_a.txt` (doc_id=43) - `meta_uk_2022_11_variant_c.txt` (doc_id=53) The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because they encode specific bonus/no-bonus scenarios and the numbers are derived from the real Feb-2026 sample in the plan. ## Tests - 10 parser tests: one per variant (A/B/C) + totals validation across all 4 fixtures + the existing non-Meta/empty-input guards. All pass. - 52 total tests across the repo, all green. ## Test Plan ### Automated ``` $ poetry run pytest ============================== 52 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files ``` ### Manual verification (after deploy) 1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via regex (≥95% hit rate), 42 → 70+ validated rows. 2. Sample a row for each variant via psql: employer, rsu_vest, and taxable_pay should all be populated. ## Reproduce locally 1. `poetry run pytest tests/test_meta_uk_parser.py -v` 2. Expected: 10 passed, each fixture validates totals to within 2p. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00
payments, deductions = _collect_a_blocks(lines, header_idx)
gross_pay = _parse_a_gross(lines, header_idx, payments)
net_pay = _parse_a_net(text)
ae_pension = payments.get("AE Pension EE", payments.get("AE Pension", Decimal("0")))
pension_sacrifice = abs(ae_pension) if ae_pension < 0 else Decimal("0")
rsu_vest = sum((payments.get(label, Decimal("0")) for label in VARIANT_A_RSU_LABELS),
start=Decimal("0"))
rsu_offset = deductions.get("RSU Net Gain", Decimal("0"))
income_tax = deductions.get("Tax", Decimal("0"))
nic = deductions.get("National Insurance", Decimal("0"))
student_loan = deductions.get("Student Loans", deductions.get("Student Loan", Decimal("0")))
other_deductions = {k: v for k, v in deductions.items() if k not in VARIANT_A_DEDUCTIONS_KNOWN}
bonus = payments.get("Perform Bonus", payments.get("Bonus", Decimal("0")))
taxable_pay_s = _find_match(text, TAXABLE_PAY_A_RE)
taxable_pay = _to_decimal(taxable_pay_s) if taxable_pay_s else None
pay_date = _parse_date(text)
v2: regex parser for Meta UK template + accurate RSU tax attribution ## Context v1 shipped a Claude Haiku-based extractor that validated only 10/71 backfilled rows. Haiku fumbles the arithmetic on pension salary-sacrifice, conflates RSU vest with regular earnings, and occasionally misreads YTD vs this-period columns — so 86% of rows land with validated=false and the downstream dashboards under-report take-home. Meta UK uses a stable two-variant template (pre/post 2022-01-31 boundary), so a regex parser is both faster (ms vs. 30-90s + $0.01-0.05/call) and more accurate. v2 introduces that parser as the primary path, keeps Claude as the fallback for non-Meta payslips, and surfaces new fields the dashboard needs to attribute PAYE between cash salary and RSU vests correctly. ## This change ### Parser (new) `payslip_ingest/parsers/meta_uk.py` detects the layout variant by header presence: - **Variant A** (pre-2022): vertical Description/This Period/This Year. `AE Pension EE` is a positive deduction against a pre-sacrifice gross — maps to `pension_employee` for the existing validation formula to hold. - **Variant B** (post-2022): side-by-side Payments | Deductions | Year to Date. `AE Pension EE` is NEGATIVE in Payments (salary sacrifice) — maps to `pension_sacrifice` and is already netted into Total Payment. `rsu_vest = RSU Tax Offset + RSU Excs Refund` (Meta's template inflates Taxable Pay without using a matching offset deduction). Column boundaries come from the header row's anchor positions; each data row slices into 3 cells and the last numeric token per cell is the amount. Anchor misses raise ParserError so the caller falls back to Claude rather than silently returning bad data. ### New fields Schema + DB + Claude prompt gain: - `salary`, `bonus`, `pension_sacrifice` — earnings decomposition for the dashboard's bonus-sacrifice visibility and earnings-breakdown chart - `taxable_pay`, `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross` — powers the YTD-effective-rate method of attributing cash tax vs RSU tax, which is the only method that's accurate month-to-month All new columns default to 0 / null so v1 rows continue to round-trip. ### Orchestration processor.py tries `parse_meta_uk(pdftotext(pdf))` first. On success the result goes straight to the DB — zero Claude tokens spent, extraction in milliseconds. On ParserError it falls through to ClaudeExtractor as before. ProcessResult gains an `extractor` field ("meta_uk_regex" | "claude") so backfill logs show the hit rate. ## Tests - `test_meta_uk_parser.py` — 11 tests covering variant A, variant B (standard + bonus month + bonus-sacrificed month), malformed inputs, and end-to-end totals validation for all 4 golden fixtures. - `test_processor.py` — 2 new tests proving the regex-first short-circuit and the Claude fallback on non-Meta inputs. Fixtures under `tests/fixtures/` are hand-crafted `pdftotext -layout` emulations — real Meta numbers from the plan's sample payslips for variant B, synthesized realistic variant A and bonus-sacrificed samples. 0001_initial.py reformat is yapf cleanup touched during the session's format pass; not a behavior change. ## Test Plan ### Automated ``` $ poetry run pytest ============================= test session starts ============================== collected 53 items tests/test_extractor.py ..... [ 9%] tests/test_meta_uk_parser.py ........... [ 30%] tests/test_paperless.py ...... [ 41%] tests/test_processor.py .............. [ 67%] tests/test_schema.py .... [ 75%] tests/test_tax_year.py ........ [ 90%] tests/test_webhook.py ..... [100%] ============================== 53 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files $ poetry run yapf --style pyproject.toml --diff --recursive payslip_ingest tests (no output — all files are yapf-clean) ``` ### Manual Verification Smoke-test the parser against a real Meta payslip PDF on the deploy host: ``` # After 0003 migration applied to prod DB $ poetry run python -c " from payslip_ingest.parsers import parse_meta_uk import subprocess text = subprocess.check_output(['pdftotext', '-layout', '/path/to/real.pdf', '-']).decode() p = parse_meta_uk(text) print(p.model_dump_json(indent=2)) " ``` Expected: JSON with salary/bonus/rsu_vest/pension_sacrifice populated and `validate_totals(p)` returning True. ## Reproduce locally 1. `cd payslip-ingest && poetry install` 2. `poetry run pytest tests/test_meta_uk_parser.py -v` 3. Expected: 11 tests pass, each fixture validates totals within 2p. Closes: code-un1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:53:52 +00:00
return ExtractedPayslip(
pay_date=pay_date,
meta_uk parser: add variant A (2019-2022) + variant C (2022-2023) ## Context The initial v2 parser (commit 9741816) only handled the modern template (variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from 2021-07 through 2023-11 failed entirely — Claude fallback hit errors on them and the rows never landed. Investigation via `kubectl exec` + pdftotext on a sample of the failing docs revealed two previously-unseen layouts that the parser needs to handle directly: - **Variant A** (2019 → mid-2022): single-column Description/This Period/ This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31 Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines: `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the earnings side with a matching `RSU Net Gain` on the deductions side. BIK items (Private Dental/Medical) appear on both sides — net zero in the gross, but the deduction-side copy must land in other_deductions for the validation formula to hold. - **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions| Year To Date (note capital "To", vs variant B's lowercase "to"). Date format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU Net Gain` offset. `Company Name : Facebook UK Limited` preamble. Variant B (2024+) is unchanged. ## This change ### Parser refactor - `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches all three eras. - `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's accounting-style parenthesized negatives normalize to `-1234.56` in `_to_decimal`. - `_parse_date` tries three formats in order: slash (B), dot (C), word (A). - `_is_variant_b_or_c` collapses B and C into one detector (both have the side-by-side header with `Year [Tt]o Date`); their parsers share code because the column mechanics are identical — only the RSU-label set and date format differ. - `_parse_variant_a` is a full rewrite: single-column rows split by the two `Total ...` anchors (payments → deductions), pay_date from the header's `Date : ...`, gross from first Total, net from the trailing `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X` line at the bottom. - RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every matching payment line. rsu_offset maps to `RSU Net Gain` on the deduction side when present (absent in variant B, present in A and C). ### Fixtures switched to real pdftotext output Removed the two synthetic fixtures that no longer reflected real Meta output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`) and replaced with real pdftotext captures: - `meta_uk_2021_08_variant_a.txt` (doc_id=43) - `meta_uk_2022_11_variant_c.txt` (doc_id=53) The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because they encode specific bonus/no-bonus scenarios and the numbers are derived from the real Feb-2026 sample in the plan. ## Tests - 10 parser tests: one per variant (A/B/C) + totals validation across all 4 fixtures + the existing non-Meta/empty-input guards. All pass. - 52 total tests across the repo, all green. ## Test Plan ### Automated ``` $ poetry run pytest ============================== 52 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files ``` ### Manual verification (after deploy) 1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via regex (≥95% hit rate), 42 → 70+ validated rows. 2. Sample a row for each variant via psql: employer, rsu_vest, and taxable_pay should all be populated. ## Reproduce locally 1. `poetry run pytest tests/test_meta_uk_parser.py -v` 2. Expected: 10 passed, each fixture validates totals to within 2p. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00
pay_period_start=None,
pay_period_end=None,
employer=employer,
v2: regex parser for Meta UK template + accurate RSU tax attribution ## Context v1 shipped a Claude Haiku-based extractor that validated only 10/71 backfilled rows. Haiku fumbles the arithmetic on pension salary-sacrifice, conflates RSU vest with regular earnings, and occasionally misreads YTD vs this-period columns — so 86% of rows land with validated=false and the downstream dashboards under-report take-home. Meta UK uses a stable two-variant template (pre/post 2022-01-31 boundary), so a regex parser is both faster (ms vs. 30-90s + $0.01-0.05/call) and more accurate. v2 introduces that parser as the primary path, keeps Claude as the fallback for non-Meta payslips, and surfaces new fields the dashboard needs to attribute PAYE between cash salary and RSU vests correctly. ## This change ### Parser (new) `payslip_ingest/parsers/meta_uk.py` detects the layout variant by header presence: - **Variant A** (pre-2022): vertical Description/This Period/This Year. `AE Pension EE` is a positive deduction against a pre-sacrifice gross — maps to `pension_employee` for the existing validation formula to hold. - **Variant B** (post-2022): side-by-side Payments | Deductions | Year to Date. `AE Pension EE` is NEGATIVE in Payments (salary sacrifice) — maps to `pension_sacrifice` and is already netted into Total Payment. `rsu_vest = RSU Tax Offset + RSU Excs Refund` (Meta's template inflates Taxable Pay without using a matching offset deduction). Column boundaries come from the header row's anchor positions; each data row slices into 3 cells and the last numeric token per cell is the amount. Anchor misses raise ParserError so the caller falls back to Claude rather than silently returning bad data. ### New fields Schema + DB + Claude prompt gain: - `salary`, `bonus`, `pension_sacrifice` — earnings decomposition for the dashboard's bonus-sacrifice visibility and earnings-breakdown chart - `taxable_pay`, `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross` — powers the YTD-effective-rate method of attributing cash tax vs RSU tax, which is the only method that's accurate month-to-month All new columns default to 0 / null so v1 rows continue to round-trip. ### Orchestration processor.py tries `parse_meta_uk(pdftotext(pdf))` first. On success the result goes straight to the DB — zero Claude tokens spent, extraction in milliseconds. On ParserError it falls through to ClaudeExtractor as before. ProcessResult gains an `extractor` field ("meta_uk_regex" | "claude") so backfill logs show the hit rate. ## Tests - `test_meta_uk_parser.py` — 11 tests covering variant A, variant B (standard + bonus month + bonus-sacrificed month), malformed inputs, and end-to-end totals validation for all 4 golden fixtures. - `test_processor.py` — 2 new tests proving the regex-first short-circuit and the Claude fallback on non-Meta inputs. Fixtures under `tests/fixtures/` are hand-crafted `pdftotext -layout` emulations — real Meta numbers from the plan's sample payslips for variant B, synthesized realistic variant A and bonus-sacrificed samples. 0001_initial.py reformat is yapf cleanup touched during the session's format pass; not a behavior change. ## Test Plan ### Automated ``` $ poetry run pytest ============================= test session starts ============================== collected 53 items tests/test_extractor.py ..... [ 9%] tests/test_meta_uk_parser.py ........... [ 30%] tests/test_paperless.py ...... [ 41%] tests/test_processor.py .............. [ 67%] tests/test_schema.py .... [ 75%] tests/test_tax_year.py ........ [ 90%] tests/test_webhook.py ..... [100%] ============================== 53 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files $ poetry run yapf --style pyproject.toml --diff --recursive payslip_ingest tests (no output — all files are yapf-clean) ``` ### Manual Verification Smoke-test the parser against a real Meta payslip PDF on the deploy host: ``` # After 0003 migration applied to prod DB $ poetry run python -c " from payslip_ingest.parsers import parse_meta_uk import subprocess text = subprocess.check_output(['pdftotext', '-layout', '/path/to/real.pdf', '-']).decode() p = parse_meta_uk(text) print(p.model_dump_json(indent=2)) " ``` Expected: JSON with salary/bonus/rsu_vest/pension_sacrifice populated and `validate_totals(p)` returning True. ## Reproduce locally 1. `cd payslip-ingest && poetry install` 2. `poetry run pytest tests/test_meta_uk_parser.py -v` 3. Expected: 11 tests pass, each fixture validates totals within 2p. Closes: code-un1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:53:52 +00:00
currency="GBP",
gross_pay=gross_pay,
income_tax=income_tax,
national_insurance=nic,
meta_uk parser: add variant A (2019-2022) + variant C (2022-2023) ## Context The initial v2 parser (commit 9741816) only handled the modern template (variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from 2021-07 through 2023-11 failed entirely — Claude fallback hit errors on them and the rows never landed. Investigation via `kubectl exec` + pdftotext on a sample of the failing docs revealed two previously-unseen layouts that the parser needs to handle directly: - **Variant A** (2019 → mid-2022): single-column Description/This Period/ This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31 Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines: `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the earnings side with a matching `RSU Net Gain` on the deductions side. BIK items (Private Dental/Medical) appear on both sides — net zero in the gross, but the deduction-side copy must land in other_deductions for the validation formula to hold. - **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions| Year To Date (note capital "To", vs variant B's lowercase "to"). Date format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU Net Gain` offset. `Company Name : Facebook UK Limited` preamble. Variant B (2024+) is unchanged. ## This change ### Parser refactor - `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches all three eras. - `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's accounting-style parenthesized negatives normalize to `-1234.56` in `_to_decimal`. - `_parse_date` tries three formats in order: slash (B), dot (C), word (A). - `_is_variant_b_or_c` collapses B and C into one detector (both have the side-by-side header with `Year [Tt]o Date`); their parsers share code because the column mechanics are identical — only the RSU-label set and date format differ. - `_parse_variant_a` is a full rewrite: single-column rows split by the two `Total ...` anchors (payments → deductions), pay_date from the header's `Date : ...`, gross from first Total, net from the trailing `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X` line at the bottom. - RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every matching payment line. rsu_offset maps to `RSU Net Gain` on the deduction side when present (absent in variant B, present in A and C). ### Fixtures switched to real pdftotext output Removed the two synthetic fixtures that no longer reflected real Meta output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`) and replaced with real pdftotext captures: - `meta_uk_2021_08_variant_a.txt` (doc_id=43) - `meta_uk_2022_11_variant_c.txt` (doc_id=53) The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because they encode specific bonus/no-bonus scenarios and the numbers are derived from the real Feb-2026 sample in the plan. ## Tests - 10 parser tests: one per variant (A/B/C) + totals validation across all 4 fixtures + the existing non-Meta/empty-input guards. All pass. - 52 total tests across the repo, all green. ## Test Plan ### Automated ``` $ poetry run pytest ============================== 52 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files ``` ### Manual verification (after deploy) 1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via regex (≥95% hit rate), 42 → 70+ validated rows. 2. Sample a row for each variant via psql: employer, rsu_vest, and taxable_pay should all be populated. ## Reproduce locally 1. `poetry run pytest tests/test_meta_uk_parser.py -v` 2. Expected: 10 passed, each fixture validates totals to within 2p. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00
pension_employee=Decimal("0"),
v2: regex parser for Meta UK template + accurate RSU tax attribution ## Context v1 shipped a Claude Haiku-based extractor that validated only 10/71 backfilled rows. Haiku fumbles the arithmetic on pension salary-sacrifice, conflates RSU vest with regular earnings, and occasionally misreads YTD vs this-period columns — so 86% of rows land with validated=false and the downstream dashboards under-report take-home. Meta UK uses a stable two-variant template (pre/post 2022-01-31 boundary), so a regex parser is both faster (ms vs. 30-90s + $0.01-0.05/call) and more accurate. v2 introduces that parser as the primary path, keeps Claude as the fallback for non-Meta payslips, and surfaces new fields the dashboard needs to attribute PAYE between cash salary and RSU vests correctly. ## This change ### Parser (new) `payslip_ingest/parsers/meta_uk.py` detects the layout variant by header presence: - **Variant A** (pre-2022): vertical Description/This Period/This Year. `AE Pension EE` is a positive deduction against a pre-sacrifice gross — maps to `pension_employee` for the existing validation formula to hold. - **Variant B** (post-2022): side-by-side Payments | Deductions | Year to Date. `AE Pension EE` is NEGATIVE in Payments (salary sacrifice) — maps to `pension_sacrifice` and is already netted into Total Payment. `rsu_vest = RSU Tax Offset + RSU Excs Refund` (Meta's template inflates Taxable Pay without using a matching offset deduction). Column boundaries come from the header row's anchor positions; each data row slices into 3 cells and the last numeric token per cell is the amount. Anchor misses raise ParserError so the caller falls back to Claude rather than silently returning bad data. ### New fields Schema + DB + Claude prompt gain: - `salary`, `bonus`, `pension_sacrifice` — earnings decomposition for the dashboard's bonus-sacrifice visibility and earnings-breakdown chart - `taxable_pay`, `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross` — powers the YTD-effective-rate method of attributing cash tax vs RSU tax, which is the only method that's accurate month-to-month All new columns default to 0 / null so v1 rows continue to round-trip. ### Orchestration processor.py tries `parse_meta_uk(pdftotext(pdf))` first. On success the result goes straight to the DB — zero Claude tokens spent, extraction in milliseconds. On ParserError it falls through to ClaudeExtractor as before. ProcessResult gains an `extractor` field ("meta_uk_regex" | "claude") so backfill logs show the hit rate. ## Tests - `test_meta_uk_parser.py` — 11 tests covering variant A, variant B (standard + bonus month + bonus-sacrificed month), malformed inputs, and end-to-end totals validation for all 4 golden fixtures. - `test_processor.py` — 2 new tests proving the regex-first short-circuit and the Claude fallback on non-Meta inputs. Fixtures under `tests/fixtures/` are hand-crafted `pdftotext -layout` emulations — real Meta numbers from the plan's sample payslips for variant B, synthesized realistic variant A and bonus-sacrificed samples. 0001_initial.py reformat is yapf cleanup touched during the session's format pass; not a behavior change. ## Test Plan ### Automated ``` $ poetry run pytest ============================= test session starts ============================== collected 53 items tests/test_extractor.py ..... [ 9%] tests/test_meta_uk_parser.py ........... [ 30%] tests/test_paperless.py ...... [ 41%] tests/test_processor.py .............. [ 67%] tests/test_schema.py .... [ 75%] tests/test_tax_year.py ........ [ 90%] tests/test_webhook.py ..... [100%] ============================== 53 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files $ poetry run yapf --style pyproject.toml --diff --recursive payslip_ingest tests (no output — all files are yapf-clean) ``` ### Manual Verification Smoke-test the parser against a real Meta payslip PDF on the deploy host: ``` # After 0003 migration applied to prod DB $ poetry run python -c " from payslip_ingest.parsers import parse_meta_uk import subprocess text = subprocess.check_output(['pdftotext', '-layout', '/path/to/real.pdf', '-']).decode() p = parse_meta_uk(text) print(p.model_dump_json(indent=2)) " ``` Expected: JSON with salary/bonus/rsu_vest/pension_sacrifice populated and `validate_totals(p)` returning True. ## Reproduce locally 1. `cd payslip-ingest && poetry install` 2. `poetry run pytest tests/test_meta_uk_parser.py -v` 3. Expected: 11 tests pass, each fixture validates totals within 2p. Closes: code-un1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:53:52 +00:00
pension_employer=Decimal("0"),
student_loan=student_loan,
meta_uk parser: add variant A (2019-2022) + variant C (2022-2023) ## Context The initial v2 parser (commit 9741816) only handled the modern template (variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from 2021-07 through 2023-11 failed entirely — Claude fallback hit errors on them and the rows never landed. Investigation via `kubectl exec` + pdftotext on a sample of the failing docs revealed two previously-unseen layouts that the parser needs to handle directly: - **Variant A** (2019 → mid-2022): single-column Description/This Period/ This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31 Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines: `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the earnings side with a matching `RSU Net Gain` on the deductions side. BIK items (Private Dental/Medical) appear on both sides — net zero in the gross, but the deduction-side copy must land in other_deductions for the validation formula to hold. - **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions| Year To Date (note capital "To", vs variant B's lowercase "to"). Date format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU Net Gain` offset. `Company Name : Facebook UK Limited` preamble. Variant B (2024+) is unchanged. ## This change ### Parser refactor - `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches all three eras. - `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's accounting-style parenthesized negatives normalize to `-1234.56` in `_to_decimal`. - `_parse_date` tries three formats in order: slash (B), dot (C), word (A). - `_is_variant_b_or_c` collapses B and C into one detector (both have the side-by-side header with `Year [Tt]o Date`); their parsers share code because the column mechanics are identical — only the RSU-label set and date format differ. - `_parse_variant_a` is a full rewrite: single-column rows split by the two `Total ...` anchors (payments → deductions), pay_date from the header's `Date : ...`, gross from first Total, net from the trailing `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X` line at the bottom. - RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every matching payment line. rsu_offset maps to `RSU Net Gain` on the deduction side when present (absent in variant B, present in A and C). ### Fixtures switched to real pdftotext output Removed the two synthetic fixtures that no longer reflected real Meta output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`) and replaced with real pdftotext captures: - `meta_uk_2021_08_variant_a.txt` (doc_id=43) - `meta_uk_2022_11_variant_c.txt` (doc_id=53) The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because they encode specific bonus/no-bonus scenarios and the numbers are derived from the real Feb-2026 sample in the plan. ## Tests - 10 parser tests: one per variant (A/B/C) + totals validation across all 4 fixtures + the existing non-Meta/empty-input guards. All pass. - 52 total tests across the repo, all green. ## Test Plan ### Automated ``` $ poetry run pytest ============================== 52 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files ``` ### Manual verification (after deploy) 1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via regex (≥95% hit rate), 42 → 70+ validated rows. 2. Sample a row for each variant via psql: employer, rsu_vest, and taxable_pay should all be populated. ## Reproduce locally 1. `poetry run pytest tests/test_meta_uk_parser.py -v` 2. Expected: 10 passed, each fixture validates totals to within 2p. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00
rsu_vest=rsu_vest,
rsu_offset=rsu_offset,
salary=payments.get("Salary", Decimal("0")),
v2: regex parser for Meta UK template + accurate RSU tax attribution ## Context v1 shipped a Claude Haiku-based extractor that validated only 10/71 backfilled rows. Haiku fumbles the arithmetic on pension salary-sacrifice, conflates RSU vest with regular earnings, and occasionally misreads YTD vs this-period columns — so 86% of rows land with validated=false and the downstream dashboards under-report take-home. Meta UK uses a stable two-variant template (pre/post 2022-01-31 boundary), so a regex parser is both faster (ms vs. 30-90s + $0.01-0.05/call) and more accurate. v2 introduces that parser as the primary path, keeps Claude as the fallback for non-Meta payslips, and surfaces new fields the dashboard needs to attribute PAYE between cash salary and RSU vests correctly. ## This change ### Parser (new) `payslip_ingest/parsers/meta_uk.py` detects the layout variant by header presence: - **Variant A** (pre-2022): vertical Description/This Period/This Year. `AE Pension EE` is a positive deduction against a pre-sacrifice gross — maps to `pension_employee` for the existing validation formula to hold. - **Variant B** (post-2022): side-by-side Payments | Deductions | Year to Date. `AE Pension EE` is NEGATIVE in Payments (salary sacrifice) — maps to `pension_sacrifice` and is already netted into Total Payment. `rsu_vest = RSU Tax Offset + RSU Excs Refund` (Meta's template inflates Taxable Pay without using a matching offset deduction). Column boundaries come from the header row's anchor positions; each data row slices into 3 cells and the last numeric token per cell is the amount. Anchor misses raise ParserError so the caller falls back to Claude rather than silently returning bad data. ### New fields Schema + DB + Claude prompt gain: - `salary`, `bonus`, `pension_sacrifice` — earnings decomposition for the dashboard's bonus-sacrifice visibility and earnings-breakdown chart - `taxable_pay`, `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross` — powers the YTD-effective-rate method of attributing cash tax vs RSU tax, which is the only method that's accurate month-to-month All new columns default to 0 / null so v1 rows continue to round-trip. ### Orchestration processor.py tries `parse_meta_uk(pdftotext(pdf))` first. On success the result goes straight to the DB — zero Claude tokens spent, extraction in milliseconds. On ParserError it falls through to ClaudeExtractor as before. ProcessResult gains an `extractor` field ("meta_uk_regex" | "claude") so backfill logs show the hit rate. ## Tests - `test_meta_uk_parser.py` — 11 tests covering variant A, variant B (standard + bonus month + bonus-sacrificed month), malformed inputs, and end-to-end totals validation for all 4 golden fixtures. - `test_processor.py` — 2 new tests proving the regex-first short-circuit and the Claude fallback on non-Meta inputs. Fixtures under `tests/fixtures/` are hand-crafted `pdftotext -layout` emulations — real Meta numbers from the plan's sample payslips for variant B, synthesized realistic variant A and bonus-sacrificed samples. 0001_initial.py reformat is yapf cleanup touched during the session's format pass; not a behavior change. ## Test Plan ### Automated ``` $ poetry run pytest ============================= test session starts ============================== collected 53 items tests/test_extractor.py ..... [ 9%] tests/test_meta_uk_parser.py ........... [ 30%] tests/test_paperless.py ...... [ 41%] tests/test_processor.py .............. [ 67%] tests/test_schema.py .... [ 75%] tests/test_tax_year.py ........ [ 90%] tests/test_webhook.py ..... [100%] ============================== 53 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files $ poetry run yapf --style pyproject.toml --diff --recursive payslip_ingest tests (no output — all files are yapf-clean) ``` ### Manual Verification Smoke-test the parser against a real Meta payslip PDF on the deploy host: ``` # After 0003 migration applied to prod DB $ poetry run python -c " from payslip_ingest.parsers import parse_meta_uk import subprocess text = subprocess.check_output(['pdftotext', '-layout', '/path/to/real.pdf', '-']).decode() p = parse_meta_uk(text) print(p.model_dump_json(indent=2)) " ``` Expected: JSON with salary/bonus/rsu_vest/pension_sacrifice populated and `validate_totals(p)` returning True. ## Reproduce locally 1. `cd payslip-ingest && poetry install` 2. `poetry run pytest tests/test_meta_uk_parser.py -v` 3. Expected: 11 tests pass, each fixture validates totals within 2p. Closes: code-un1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:53:52 +00:00
bonus=bonus,
meta_uk parser: add variant A (2019-2022) + variant C (2022-2023) ## Context The initial v2 parser (commit 9741816) only handled the modern template (variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from 2021-07 through 2023-11 failed entirely — Claude fallback hit errors on them and the rows never landed. Investigation via `kubectl exec` + pdftotext on a sample of the failing docs revealed two previously-unseen layouts that the parser needs to handle directly: - **Variant A** (2019 → mid-2022): single-column Description/This Period/ This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31 Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines: `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the earnings side with a matching `RSU Net Gain` on the deductions side. BIK items (Private Dental/Medical) appear on both sides — net zero in the gross, but the deduction-side copy must land in other_deductions for the validation formula to hold. - **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions| Year To Date (note capital "To", vs variant B's lowercase "to"). Date format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU Net Gain` offset. `Company Name : Facebook UK Limited` preamble. Variant B (2024+) is unchanged. ## This change ### Parser refactor - `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches all three eras. - `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's accounting-style parenthesized negatives normalize to `-1234.56` in `_to_decimal`. - `_parse_date` tries three formats in order: slash (B), dot (C), word (A). - `_is_variant_b_or_c` collapses B and C into one detector (both have the side-by-side header with `Year [Tt]o Date`); their parsers share code because the column mechanics are identical — only the RSU-label set and date format differ. - `_parse_variant_a` is a full rewrite: single-column rows split by the two `Total ...` anchors (payments → deductions), pay_date from the header's `Date : ...`, gross from first Total, net from the trailing `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X` line at the bottom. - RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every matching payment line. rsu_offset maps to `RSU Net Gain` on the deduction side when present (absent in variant B, present in A and C). ### Fixtures switched to real pdftotext output Removed the two synthetic fixtures that no longer reflected real Meta output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`) and replaced with real pdftotext captures: - `meta_uk_2021_08_variant_a.txt` (doc_id=43) - `meta_uk_2022_11_variant_c.txt` (doc_id=53) The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because they encode specific bonus/no-bonus scenarios and the numbers are derived from the real Feb-2026 sample in the plan. ## Tests - 10 parser tests: one per variant (A/B/C) + totals validation across all 4 fixtures + the existing non-Meta/empty-input guards. All pass. - 52 total tests across the repo, all green. ## Test Plan ### Automated ``` $ poetry run pytest ============================== 52 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files ``` ### Manual verification (after deploy) 1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via regex (≥95% hit rate), 42 → 70+ validated rows. 2. Sample a row for each variant via psql: employer, rsu_vest, and taxable_pay should all be populated. ## Reproduce locally 1. `poetry run pytest tests/test_meta_uk_parser.py -v` 2. Expected: 10 passed, each fixture validates totals to within 2p. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00
pension_sacrifice=pension_sacrifice,
v2: regex parser for Meta UK template + accurate RSU tax attribution ## Context v1 shipped a Claude Haiku-based extractor that validated only 10/71 backfilled rows. Haiku fumbles the arithmetic on pension salary-sacrifice, conflates RSU vest with regular earnings, and occasionally misreads YTD vs this-period columns — so 86% of rows land with validated=false and the downstream dashboards under-report take-home. Meta UK uses a stable two-variant template (pre/post 2022-01-31 boundary), so a regex parser is both faster (ms vs. 30-90s + $0.01-0.05/call) and more accurate. v2 introduces that parser as the primary path, keeps Claude as the fallback for non-Meta payslips, and surfaces new fields the dashboard needs to attribute PAYE between cash salary and RSU vests correctly. ## This change ### Parser (new) `payslip_ingest/parsers/meta_uk.py` detects the layout variant by header presence: - **Variant A** (pre-2022): vertical Description/This Period/This Year. `AE Pension EE` is a positive deduction against a pre-sacrifice gross — maps to `pension_employee` for the existing validation formula to hold. - **Variant B** (post-2022): side-by-side Payments | Deductions | Year to Date. `AE Pension EE` is NEGATIVE in Payments (salary sacrifice) — maps to `pension_sacrifice` and is already netted into Total Payment. `rsu_vest = RSU Tax Offset + RSU Excs Refund` (Meta's template inflates Taxable Pay without using a matching offset deduction). Column boundaries come from the header row's anchor positions; each data row slices into 3 cells and the last numeric token per cell is the amount. Anchor misses raise ParserError so the caller falls back to Claude rather than silently returning bad data. ### New fields Schema + DB + Claude prompt gain: - `salary`, `bonus`, `pension_sacrifice` — earnings decomposition for the dashboard's bonus-sacrifice visibility and earnings-breakdown chart - `taxable_pay`, `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross` — powers the YTD-effective-rate method of attributing cash tax vs RSU tax, which is the only method that's accurate month-to-month All new columns default to 0 / null so v1 rows continue to round-trip. ### Orchestration processor.py tries `parse_meta_uk(pdftotext(pdf))` first. On success the result goes straight to the DB — zero Claude tokens spent, extraction in milliseconds. On ParserError it falls through to ClaudeExtractor as before. ProcessResult gains an `extractor` field ("meta_uk_regex" | "claude") so backfill logs show the hit rate. ## Tests - `test_meta_uk_parser.py` — 11 tests covering variant A, variant B (standard + bonus month + bonus-sacrificed month), malformed inputs, and end-to-end totals validation for all 4 golden fixtures. - `test_processor.py` — 2 new tests proving the regex-first short-circuit and the Claude fallback on non-Meta inputs. Fixtures under `tests/fixtures/` are hand-crafted `pdftotext -layout` emulations — real Meta numbers from the plan's sample payslips for variant B, synthesized realistic variant A and bonus-sacrificed samples. 0001_initial.py reformat is yapf cleanup touched during the session's format pass; not a behavior change. ## Test Plan ### Automated ``` $ poetry run pytest ============================= test session starts ============================== collected 53 items tests/test_extractor.py ..... [ 9%] tests/test_meta_uk_parser.py ........... [ 30%] tests/test_paperless.py ...... [ 41%] tests/test_processor.py .............. [ 67%] tests/test_schema.py .... [ 75%] tests/test_tax_year.py ........ [ 90%] tests/test_webhook.py ..... [100%] ============================== 53 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files $ poetry run yapf --style pyproject.toml --diff --recursive payslip_ingest tests (no output — all files are yapf-clean) ``` ### Manual Verification Smoke-test the parser against a real Meta payslip PDF on the deploy host: ``` # After 0003 migration applied to prod DB $ poetry run python -c " from payslip_ingest.parsers import parse_meta_uk import subprocess text = subprocess.check_output(['pdftotext', '-layout', '/path/to/real.pdf', '-']).decode() p = parse_meta_uk(text) print(p.model_dump_json(indent=2)) " ``` Expected: JSON with salary/bonus/rsu_vest/pension_sacrifice populated and `validate_totals(p)` returning True. ## Reproduce locally 1. `cd payslip-ingest && poetry install` 2. `poetry run pytest tests/test_meta_uk_parser.py -v` 3. Expected: 11 tests pass, each fixture validates totals within 2p. Closes: code-un1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:53:52 +00:00
taxable_pay=taxable_pay,
ytd_tax_paid=None,
ytd_taxable_pay=None,
ytd_gross=None,
other_deductions=other_deductions,
net_pay=net_pay,
)
def _find_variant_a_header(lines: list[str]) -> int:
for i, line in enumerate(lines):
if "Description" in line and "This Period" in line and "This Year" in line:
return i
raise ParserError("variant A header not found")
meta_uk parser: add variant A (2019-2022) + variant C (2022-2023) ## Context The initial v2 parser (commit 9741816) only handled the modern template (variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from 2021-07 through 2023-11 failed entirely — Claude fallback hit errors on them and the rows never landed. Investigation via `kubectl exec` + pdftotext on a sample of the failing docs revealed two previously-unseen layouts that the parser needs to handle directly: - **Variant A** (2019 → mid-2022): single-column Description/This Period/ This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31 Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines: `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the earnings side with a matching `RSU Net Gain` on the deductions side. BIK items (Private Dental/Medical) appear on both sides — net zero in the gross, but the deduction-side copy must land in other_deductions for the validation formula to hold. - **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions| Year To Date (note capital "To", vs variant B's lowercase "to"). Date format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU Net Gain` offset. `Company Name : Facebook UK Limited` preamble. Variant B (2024+) is unchanged. ## This change ### Parser refactor - `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches all three eras. - `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's accounting-style parenthesized negatives normalize to `-1234.56` in `_to_decimal`. - `_parse_date` tries three formats in order: slash (B), dot (C), word (A). - `_is_variant_b_or_c` collapses B and C into one detector (both have the side-by-side header with `Year [Tt]o Date`); their parsers share code because the column mechanics are identical — only the RSU-label set and date format differ. - `_parse_variant_a` is a full rewrite: single-column rows split by the two `Total ...` anchors (payments → deductions), pay_date from the header's `Date : ...`, gross from first Total, net from the trailing `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X` line at the bottom. - RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every matching payment line. rsu_offset maps to `RSU Net Gain` on the deduction side when present (absent in variant B, present in A and C). ### Fixtures switched to real pdftotext output Removed the two synthetic fixtures that no longer reflected real Meta output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`) and replaced with real pdftotext captures: - `meta_uk_2021_08_variant_a.txt` (doc_id=43) - `meta_uk_2022_11_variant_c.txt` (doc_id=53) The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because they encode specific bonus/no-bonus scenarios and the numbers are derived from the real Feb-2026 sample in the plan. ## Tests - 10 parser tests: one per variant (A/B/C) + totals validation across all 4 fixtures + the existing non-Meta/empty-input guards. All pass. - 52 total tests across the repo, all green. ## Test Plan ### Automated ``` $ poetry run pytest ============================== 52 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files ``` ### Manual verification (after deploy) 1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via regex (≥95% hit rate), 42 → 70+ validated rows. 2. Sample a row for each variant via psql: employer, rsu_vest, and taxable_pay should all be populated. ## Reproduce locally 1. `poetry run pytest tests/test_meta_uk_parser.py -v` 2. Expected: 10 passed, each fixture validates totals to within 2p. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00
def _collect_a_blocks(
lines: list[str],
header_idx: int,
) -> tuple[dict[str, Decimal], dict[str, Decimal]]:
"""Split variant A rows into Payments vs Deductions by the two `Total` anchors.
Layout: header payments rows `Total <gross>` deductions rows
`Total <deductions>` `Net Pay <net>`. We collect rows into whichever
block we're currently in.
"""
payments: dict[str, Decimal] = {}
deductions: dict[str, Decimal] = {}
block = payments
total_count = 0
v2: regex parser for Meta UK template + accurate RSU tax attribution ## Context v1 shipped a Claude Haiku-based extractor that validated only 10/71 backfilled rows. Haiku fumbles the arithmetic on pension salary-sacrifice, conflates RSU vest with regular earnings, and occasionally misreads YTD vs this-period columns — so 86% of rows land with validated=false and the downstream dashboards under-report take-home. Meta UK uses a stable two-variant template (pre/post 2022-01-31 boundary), so a regex parser is both faster (ms vs. 30-90s + $0.01-0.05/call) and more accurate. v2 introduces that parser as the primary path, keeps Claude as the fallback for non-Meta payslips, and surfaces new fields the dashboard needs to attribute PAYE between cash salary and RSU vests correctly. ## This change ### Parser (new) `payslip_ingest/parsers/meta_uk.py` detects the layout variant by header presence: - **Variant A** (pre-2022): vertical Description/This Period/This Year. `AE Pension EE` is a positive deduction against a pre-sacrifice gross — maps to `pension_employee` for the existing validation formula to hold. - **Variant B** (post-2022): side-by-side Payments | Deductions | Year to Date. `AE Pension EE` is NEGATIVE in Payments (salary sacrifice) — maps to `pension_sacrifice` and is already netted into Total Payment. `rsu_vest = RSU Tax Offset + RSU Excs Refund` (Meta's template inflates Taxable Pay without using a matching offset deduction). Column boundaries come from the header row's anchor positions; each data row slices into 3 cells and the last numeric token per cell is the amount. Anchor misses raise ParserError so the caller falls back to Claude rather than silently returning bad data. ### New fields Schema + DB + Claude prompt gain: - `salary`, `bonus`, `pension_sacrifice` — earnings decomposition for the dashboard's bonus-sacrifice visibility and earnings-breakdown chart - `taxable_pay`, `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross` — powers the YTD-effective-rate method of attributing cash tax vs RSU tax, which is the only method that's accurate month-to-month All new columns default to 0 / null so v1 rows continue to round-trip. ### Orchestration processor.py tries `parse_meta_uk(pdftotext(pdf))` first. On success the result goes straight to the DB — zero Claude tokens spent, extraction in milliseconds. On ParserError it falls through to ClaudeExtractor as before. ProcessResult gains an `extractor` field ("meta_uk_regex" | "claude") so backfill logs show the hit rate. ## Tests - `test_meta_uk_parser.py` — 11 tests covering variant A, variant B (standard + bonus month + bonus-sacrificed month), malformed inputs, and end-to-end totals validation for all 4 golden fixtures. - `test_processor.py` — 2 new tests proving the regex-first short-circuit and the Claude fallback on non-Meta inputs. Fixtures under `tests/fixtures/` are hand-crafted `pdftotext -layout` emulations — real Meta numbers from the plan's sample payslips for variant B, synthesized realistic variant A and bonus-sacrificed samples. 0001_initial.py reformat is yapf cleanup touched during the session's format pass; not a behavior change. ## Test Plan ### Automated ``` $ poetry run pytest ============================= test session starts ============================== collected 53 items tests/test_extractor.py ..... [ 9%] tests/test_meta_uk_parser.py ........... [ 30%] tests/test_paperless.py ...... [ 41%] tests/test_processor.py .............. [ 67%] tests/test_schema.py .... [ 75%] tests/test_tax_year.py ........ [ 90%] tests/test_webhook.py ..... [100%] ============================== 53 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files $ poetry run yapf --style pyproject.toml --diff --recursive payslip_ingest tests (no output — all files are yapf-clean) ``` ### Manual Verification Smoke-test the parser against a real Meta payslip PDF on the deploy host: ``` # After 0003 migration applied to prod DB $ poetry run python -c " from payslip_ingest.parsers import parse_meta_uk import subprocess text = subprocess.check_output(['pdftotext', '-layout', '/path/to/real.pdf', '-']).decode() p = parse_meta_uk(text) print(p.model_dump_json(indent=2)) " ``` Expected: JSON with salary/bonus/rsu_vest/pension_sacrifice populated and `validate_totals(p)` returning True. ## Reproduce locally 1. `cd payslip-ingest && poetry install` 2. `poetry run pytest tests/test_meta_uk_parser.py -v` 3. Expected: 11 tests pass, each fixture validates totals within 2p. Closes: code-un1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:53:52 +00:00
for i in range(header_idx + 1, len(lines)):
meta_uk parser: add variant A (2019-2022) + variant C (2022-2023) ## Context The initial v2 parser (commit 9741816) only handled the modern template (variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from 2021-07 through 2023-11 failed entirely — Claude fallback hit errors on them and the rows never landed. Investigation via `kubectl exec` + pdftotext on a sample of the failing docs revealed two previously-unseen layouts that the parser needs to handle directly: - **Variant A** (2019 → mid-2022): single-column Description/This Period/ This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31 Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines: `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the earnings side with a matching `RSU Net Gain` on the deductions side. BIK items (Private Dental/Medical) appear on both sides — net zero in the gross, but the deduction-side copy must land in other_deductions for the validation formula to hold. - **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions| Year To Date (note capital "To", vs variant B's lowercase "to"). Date format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU Net Gain` offset. `Company Name : Facebook UK Limited` preamble. Variant B (2024+) is unchanged. ## This change ### Parser refactor - `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches all three eras. - `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's accounting-style parenthesized negatives normalize to `-1234.56` in `_to_decimal`. - `_parse_date` tries three formats in order: slash (B), dot (C), word (A). - `_is_variant_b_or_c` collapses B and C into one detector (both have the side-by-side header with `Year [Tt]o Date`); their parsers share code because the column mechanics are identical — only the RSU-label set and date format differ. - `_parse_variant_a` is a full rewrite: single-column rows split by the two `Total ...` anchors (payments → deductions), pay_date from the header's `Date : ...`, gross from first Total, net from the trailing `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X` line at the bottom. - RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every matching payment line. rsu_offset maps to `RSU Net Gain` on the deduction side when present (absent in variant B, present in A and C). ### Fixtures switched to real pdftotext output Removed the two synthetic fixtures that no longer reflected real Meta output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`) and replaced with real pdftotext captures: - `meta_uk_2021_08_variant_a.txt` (doc_id=43) - `meta_uk_2022_11_variant_c.txt` (doc_id=53) The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because they encode specific bonus/no-bonus scenarios and the numbers are derived from the real Feb-2026 sample in the plan. ## Tests - 10 parser tests: one per variant (A/B/C) + totals validation across all 4 fixtures + the existing non-Meta/empty-input guards. All pass. - 52 total tests across the repo, all green. ## Test Plan ### Automated ``` $ poetry run pytest ============================== 52 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files ``` ### Manual verification (after deploy) 1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via regex (≥95% hit rate), 42 → 70+ validated rows. 2. Sample a row for each variant via psql: employer, rsu_vest, and taxable_pay should all be populated. ## Reproduce locally 1. `poetry run pytest tests/test_meta_uk_parser.py -v` 2. Expected: 10 passed, each fixture validates totals to within 2p. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00
raw = lines[i].rstrip()
if not raw.strip():
v2: regex parser for Meta UK template + accurate RSU tax attribution ## Context v1 shipped a Claude Haiku-based extractor that validated only 10/71 backfilled rows. Haiku fumbles the arithmetic on pension salary-sacrifice, conflates RSU vest with regular earnings, and occasionally misreads YTD vs this-period columns — so 86% of rows land with validated=false and the downstream dashboards under-report take-home. Meta UK uses a stable two-variant template (pre/post 2022-01-31 boundary), so a regex parser is both faster (ms vs. 30-90s + $0.01-0.05/call) and more accurate. v2 introduces that parser as the primary path, keeps Claude as the fallback for non-Meta payslips, and surfaces new fields the dashboard needs to attribute PAYE between cash salary and RSU vests correctly. ## This change ### Parser (new) `payslip_ingest/parsers/meta_uk.py` detects the layout variant by header presence: - **Variant A** (pre-2022): vertical Description/This Period/This Year. `AE Pension EE` is a positive deduction against a pre-sacrifice gross — maps to `pension_employee` for the existing validation formula to hold. - **Variant B** (post-2022): side-by-side Payments | Deductions | Year to Date. `AE Pension EE` is NEGATIVE in Payments (salary sacrifice) — maps to `pension_sacrifice` and is already netted into Total Payment. `rsu_vest = RSU Tax Offset + RSU Excs Refund` (Meta's template inflates Taxable Pay without using a matching offset deduction). Column boundaries come from the header row's anchor positions; each data row slices into 3 cells and the last numeric token per cell is the amount. Anchor misses raise ParserError so the caller falls back to Claude rather than silently returning bad data. ### New fields Schema + DB + Claude prompt gain: - `salary`, `bonus`, `pension_sacrifice` — earnings decomposition for the dashboard's bonus-sacrifice visibility and earnings-breakdown chart - `taxable_pay`, `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross` — powers the YTD-effective-rate method of attributing cash tax vs RSU tax, which is the only method that's accurate month-to-month All new columns default to 0 / null so v1 rows continue to round-trip. ### Orchestration processor.py tries `parse_meta_uk(pdftotext(pdf))` first. On success the result goes straight to the DB — zero Claude tokens spent, extraction in milliseconds. On ParserError it falls through to ClaudeExtractor as before. ProcessResult gains an `extractor` field ("meta_uk_regex" | "claude") so backfill logs show the hit rate. ## Tests - `test_meta_uk_parser.py` — 11 tests covering variant A, variant B (standard + bonus month + bonus-sacrificed month), malformed inputs, and end-to-end totals validation for all 4 golden fixtures. - `test_processor.py` — 2 new tests proving the regex-first short-circuit and the Claude fallback on non-Meta inputs. Fixtures under `tests/fixtures/` are hand-crafted `pdftotext -layout` emulations — real Meta numbers from the plan's sample payslips for variant B, synthesized realistic variant A and bonus-sacrificed samples. 0001_initial.py reformat is yapf cleanup touched during the session's format pass; not a behavior change. ## Test Plan ### Automated ``` $ poetry run pytest ============================= test session starts ============================== collected 53 items tests/test_extractor.py ..... [ 9%] tests/test_meta_uk_parser.py ........... [ 30%] tests/test_paperless.py ...... [ 41%] tests/test_processor.py .............. [ 67%] tests/test_schema.py .... [ 75%] tests/test_tax_year.py ........ [ 90%] tests/test_webhook.py ..... [100%] ============================== 53 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files $ poetry run yapf --style pyproject.toml --diff --recursive payslip_ingest tests (no output — all files are yapf-clean) ``` ### Manual Verification Smoke-test the parser against a real Meta payslip PDF on the deploy host: ``` # After 0003 migration applied to prod DB $ poetry run python -c " from payslip_ingest.parsers import parse_meta_uk import subprocess text = subprocess.check_output(['pdftotext', '-layout', '/path/to/real.pdf', '-']).decode() p = parse_meta_uk(text) print(p.model_dump_json(indent=2)) " ``` Expected: JSON with salary/bonus/rsu_vest/pension_sacrifice populated and `validate_totals(p)` returning True. ## Reproduce locally 1. `cd payslip-ingest && poetry install` 2. `poetry run pytest tests/test_meta_uk_parser.py -v` 3. Expected: 11 tests pass, each fixture validates totals within 2p. Closes: code-un1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:53:52 +00:00
continue
meta_uk parser: add variant A (2019-2022) + variant C (2022-2023) ## Context The initial v2 parser (commit 9741816) only handled the modern template (variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from 2021-07 through 2023-11 failed entirely — Claude fallback hit errors on them and the rows never landed. Investigation via `kubectl exec` + pdftotext on a sample of the failing docs revealed two previously-unseen layouts that the parser needs to handle directly: - **Variant A** (2019 → mid-2022): single-column Description/This Period/ This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31 Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines: `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the earnings side with a matching `RSU Net Gain` on the deductions side. BIK items (Private Dental/Medical) appear on both sides — net zero in the gross, but the deduction-side copy must land in other_deductions for the validation formula to hold. - **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions| Year To Date (note capital "To", vs variant B's lowercase "to"). Date format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU Net Gain` offset. `Company Name : Facebook UK Limited` preamble. Variant B (2024+) is unchanged. ## This change ### Parser refactor - `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches all three eras. - `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's accounting-style parenthesized negatives normalize to `-1234.56` in `_to_decimal`. - `_parse_date` tries three formats in order: slash (B), dot (C), word (A). - `_is_variant_b_or_c` collapses B and C into one detector (both have the side-by-side header with `Year [Tt]o Date`); their parsers share code because the column mechanics are identical — only the RSU-label set and date format differ. - `_parse_variant_a` is a full rewrite: single-column rows split by the two `Total ...` anchors (payments → deductions), pay_date from the header's `Date : ...`, gross from first Total, net from the trailing `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X` line at the bottom. - RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every matching payment line. rsu_offset maps to `RSU Net Gain` on the deduction side when present (absent in variant B, present in A and C). ### Fixtures switched to real pdftotext output Removed the two synthetic fixtures that no longer reflected real Meta output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`) and replaced with real pdftotext captures: - `meta_uk_2021_08_variant_a.txt` (doc_id=43) - `meta_uk_2022_11_variant_c.txt` (doc_id=53) The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because they encode specific bonus/no-bonus scenarios and the numbers are derived from the real Feb-2026 sample in the plan. ## Tests - 10 parser tests: one per variant (A/B/C) + totals validation across all 4 fixtures + the existing non-Meta/empty-input guards. All pass. - 52 total tests across the repo, all green. ## Test Plan ### Automated ``` $ poetry run pytest ============================== 52 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files ``` ### Manual verification (after deploy) 1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via regex (≥95% hit rate), 42 → 70+ validated rows. 2. Sample a row for each variant via psql: employer, rsu_vest, and taxable_pay should all be populated. ## Reproduce locally 1. `poetry run pytest tests/test_meta_uk_parser.py -v` 2. Expected: 10 passed, each fixture validates totals to within 2p. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00
stripped = raw.strip()
if stripped.startswith("Total ") or stripped.startswith("Total\t"):
total_count += 1
if total_count == 1:
block = deductions
continue
if total_count == 2:
break
if "Net Pay" in raw:
v2: regex parser for Meta UK template + accurate RSU tax attribution ## Context v1 shipped a Claude Haiku-based extractor that validated only 10/71 backfilled rows. Haiku fumbles the arithmetic on pension salary-sacrifice, conflates RSU vest with regular earnings, and occasionally misreads YTD vs this-period columns — so 86% of rows land with validated=false and the downstream dashboards under-report take-home. Meta UK uses a stable two-variant template (pre/post 2022-01-31 boundary), so a regex parser is both faster (ms vs. 30-90s + $0.01-0.05/call) and more accurate. v2 introduces that parser as the primary path, keeps Claude as the fallback for non-Meta payslips, and surfaces new fields the dashboard needs to attribute PAYE between cash salary and RSU vests correctly. ## This change ### Parser (new) `payslip_ingest/parsers/meta_uk.py` detects the layout variant by header presence: - **Variant A** (pre-2022): vertical Description/This Period/This Year. `AE Pension EE` is a positive deduction against a pre-sacrifice gross — maps to `pension_employee` for the existing validation formula to hold. - **Variant B** (post-2022): side-by-side Payments | Deductions | Year to Date. `AE Pension EE` is NEGATIVE in Payments (salary sacrifice) — maps to `pension_sacrifice` and is already netted into Total Payment. `rsu_vest = RSU Tax Offset + RSU Excs Refund` (Meta's template inflates Taxable Pay without using a matching offset deduction). Column boundaries come from the header row's anchor positions; each data row slices into 3 cells and the last numeric token per cell is the amount. Anchor misses raise ParserError so the caller falls back to Claude rather than silently returning bad data. ### New fields Schema + DB + Claude prompt gain: - `salary`, `bonus`, `pension_sacrifice` — earnings decomposition for the dashboard's bonus-sacrifice visibility and earnings-breakdown chart - `taxable_pay`, `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross` — powers the YTD-effective-rate method of attributing cash tax vs RSU tax, which is the only method that's accurate month-to-month All new columns default to 0 / null so v1 rows continue to round-trip. ### Orchestration processor.py tries `parse_meta_uk(pdftotext(pdf))` first. On success the result goes straight to the DB — zero Claude tokens spent, extraction in milliseconds. On ParserError it falls through to ClaudeExtractor as before. ProcessResult gains an `extractor` field ("meta_uk_regex" | "claude") so backfill logs show the hit rate. ## Tests - `test_meta_uk_parser.py` — 11 tests covering variant A, variant B (standard + bonus month + bonus-sacrificed month), malformed inputs, and end-to-end totals validation for all 4 golden fixtures. - `test_processor.py` — 2 new tests proving the regex-first short-circuit and the Claude fallback on non-Meta inputs. Fixtures under `tests/fixtures/` are hand-crafted `pdftotext -layout` emulations — real Meta numbers from the plan's sample payslips for variant B, synthesized realistic variant A and bonus-sacrificed samples. 0001_initial.py reformat is yapf cleanup touched during the session's format pass; not a behavior change. ## Test Plan ### Automated ``` $ poetry run pytest ============================= test session starts ============================== collected 53 items tests/test_extractor.py ..... [ 9%] tests/test_meta_uk_parser.py ........... [ 30%] tests/test_paperless.py ...... [ 41%] tests/test_processor.py .............. [ 67%] tests/test_schema.py .... [ 75%] tests/test_tax_year.py ........ [ 90%] tests/test_webhook.py ..... [100%] ============================== 53 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files $ poetry run yapf --style pyproject.toml --diff --recursive payslip_ingest tests (no output — all files are yapf-clean) ``` ### Manual Verification Smoke-test the parser against a real Meta payslip PDF on the deploy host: ``` # After 0003 migration applied to prod DB $ poetry run python -c " from payslip_ingest.parsers import parse_meta_uk import subprocess text = subprocess.check_output(['pdftotext', '-layout', '/path/to/real.pdf', '-']).decode() p = parse_meta_uk(text) print(p.model_dump_json(indent=2)) " ``` Expected: JSON with salary/bonus/rsu_vest/pension_sacrifice populated and `validate_totals(p)` returning True. ## Reproduce locally 1. `cd payslip-ingest && poetry install` 2. `poetry run pytest tests/test_meta_uk_parser.py -v` 3. Expected: 11 tests pass, each fixture validates totals within 2p. Closes: code-un1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:53:52 +00:00
break
meta_uk parser: add variant A (2019-2022) + variant C (2022-2023) ## Context The initial v2 parser (commit 9741816) only handled the modern template (variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from 2021-07 through 2023-11 failed entirely — Claude fallback hit errors on them and the rows never landed. Investigation via `kubectl exec` + pdftotext on a sample of the failing docs revealed two previously-unseen layouts that the parser needs to handle directly: - **Variant A** (2019 → mid-2022): single-column Description/This Period/ This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31 Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines: `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the earnings side with a matching `RSU Net Gain` on the deductions side. BIK items (Private Dental/Medical) appear on both sides — net zero in the gross, but the deduction-side copy must land in other_deductions for the validation formula to hold. - **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions| Year To Date (note capital "To", vs variant B's lowercase "to"). Date format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU Net Gain` offset. `Company Name : Facebook UK Limited` preamble. Variant B (2024+) is unchanged. ## This change ### Parser refactor - `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches all three eras. - `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's accounting-style parenthesized negatives normalize to `-1234.56` in `_to_decimal`. - `_parse_date` tries three formats in order: slash (B), dot (C), word (A). - `_is_variant_b_or_c` collapses B and C into one detector (both have the side-by-side header with `Year [Tt]o Date`); their parsers share code because the column mechanics are identical — only the RSU-label set and date format differ. - `_parse_variant_a` is a full rewrite: single-column rows split by the two `Total ...` anchors (payments → deductions), pay_date from the header's `Date : ...`, gross from first Total, net from the trailing `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X` line at the bottom. - RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every matching payment line. rsu_offset maps to `RSU Net Gain` on the deduction side when present (absent in variant B, present in A and C). ### Fixtures switched to real pdftotext output Removed the two synthetic fixtures that no longer reflected real Meta output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`) and replaced with real pdftotext captures: - `meta_uk_2021_08_variant_a.txt` (doc_id=43) - `meta_uk_2022_11_variant_c.txt` (doc_id=53) The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because they encode specific bonus/no-bonus scenarios and the numbers are derived from the real Feb-2026 sample in the plan. ## Tests - 10 parser tests: one per variant (A/B/C) + totals validation across all 4 fixtures + the existing non-Meta/empty-input guards. All pass. - 52 total tests across the repo, all green. ## Test Plan ### Automated ``` $ poetry run pytest ============================== 52 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files ``` ### Manual verification (after deploy) 1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via regex (≥95% hit rate), 42 → 70+ validated rows. 2. Sample a row for each variant via psql: employer, rsu_vest, and taxable_pay should all be populated. ## Reproduce locally 1. `poetry run pytest tests/test_meta_uk_parser.py -v` 2. Expected: 10 passed, each fixture validates totals to within 2p. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00
matches = list(AMOUNT_RE.finditer(raw))
if not matches:
continue
label = raw[:matches[0].start()].strip()
if not label:
v2: regex parser for Meta UK template + accurate RSU tax attribution ## Context v1 shipped a Claude Haiku-based extractor that validated only 10/71 backfilled rows. Haiku fumbles the arithmetic on pension salary-sacrifice, conflates RSU vest with regular earnings, and occasionally misreads YTD vs this-period columns — so 86% of rows land with validated=false and the downstream dashboards under-report take-home. Meta UK uses a stable two-variant template (pre/post 2022-01-31 boundary), so a regex parser is both faster (ms vs. 30-90s + $0.01-0.05/call) and more accurate. v2 introduces that parser as the primary path, keeps Claude as the fallback for non-Meta payslips, and surfaces new fields the dashboard needs to attribute PAYE between cash salary and RSU vests correctly. ## This change ### Parser (new) `payslip_ingest/parsers/meta_uk.py` detects the layout variant by header presence: - **Variant A** (pre-2022): vertical Description/This Period/This Year. `AE Pension EE` is a positive deduction against a pre-sacrifice gross — maps to `pension_employee` for the existing validation formula to hold. - **Variant B** (post-2022): side-by-side Payments | Deductions | Year to Date. `AE Pension EE` is NEGATIVE in Payments (salary sacrifice) — maps to `pension_sacrifice` and is already netted into Total Payment. `rsu_vest = RSU Tax Offset + RSU Excs Refund` (Meta's template inflates Taxable Pay without using a matching offset deduction). Column boundaries come from the header row's anchor positions; each data row slices into 3 cells and the last numeric token per cell is the amount. Anchor misses raise ParserError so the caller falls back to Claude rather than silently returning bad data. ### New fields Schema + DB + Claude prompt gain: - `salary`, `bonus`, `pension_sacrifice` — earnings decomposition for the dashboard's bonus-sacrifice visibility and earnings-breakdown chart - `taxable_pay`, `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross` — powers the YTD-effective-rate method of attributing cash tax vs RSU tax, which is the only method that's accurate month-to-month All new columns default to 0 / null so v1 rows continue to round-trip. ### Orchestration processor.py tries `parse_meta_uk(pdftotext(pdf))` first. On success the result goes straight to the DB — zero Claude tokens spent, extraction in milliseconds. On ParserError it falls through to ClaudeExtractor as before. ProcessResult gains an `extractor` field ("meta_uk_regex" | "claude") so backfill logs show the hit rate. ## Tests - `test_meta_uk_parser.py` — 11 tests covering variant A, variant B (standard + bonus month + bonus-sacrificed month), malformed inputs, and end-to-end totals validation for all 4 golden fixtures. - `test_processor.py` — 2 new tests proving the regex-first short-circuit and the Claude fallback on non-Meta inputs. Fixtures under `tests/fixtures/` are hand-crafted `pdftotext -layout` emulations — real Meta numbers from the plan's sample payslips for variant B, synthesized realistic variant A and bonus-sacrificed samples. 0001_initial.py reformat is yapf cleanup touched during the session's format pass; not a behavior change. ## Test Plan ### Automated ``` $ poetry run pytest ============================= test session starts ============================== collected 53 items tests/test_extractor.py ..... [ 9%] tests/test_meta_uk_parser.py ........... [ 30%] tests/test_paperless.py ...... [ 41%] tests/test_processor.py .............. [ 67%] tests/test_schema.py .... [ 75%] tests/test_tax_year.py ........ [ 90%] tests/test_webhook.py ..... [100%] ============================== 53 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files $ poetry run yapf --style pyproject.toml --diff --recursive payslip_ingest tests (no output — all files are yapf-clean) ``` ### Manual Verification Smoke-test the parser against a real Meta payslip PDF on the deploy host: ``` # After 0003 migration applied to prod DB $ poetry run python -c " from payslip_ingest.parsers import parse_meta_uk import subprocess text = subprocess.check_output(['pdftotext', '-layout', '/path/to/real.pdf', '-']).decode() p = parse_meta_uk(text) print(p.model_dump_json(indent=2)) " ``` Expected: JSON with salary/bonus/rsu_vest/pension_sacrifice populated and `validate_totals(p)` returning True. ## Reproduce locally 1. `cd payslip-ingest && poetry install` 2. `poetry run pytest tests/test_meta_uk_parser.py -v` 3. Expected: 11 tests pass, each fixture validates totals within 2p. Closes: code-un1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:53:52 +00:00
continue
meta_uk parser: add variant A (2019-2022) + variant C (2022-2023) ## Context The initial v2 parser (commit 9741816) only handled the modern template (variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from 2021-07 through 2023-11 failed entirely — Claude fallback hit errors on them and the rows never landed. Investigation via `kubectl exec` + pdftotext on a sample of the failing docs revealed two previously-unseen layouts that the parser needs to handle directly: - **Variant A** (2019 → mid-2022): single-column Description/This Period/ This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31 Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines: `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the earnings side with a matching `RSU Net Gain` on the deductions side. BIK items (Private Dental/Medical) appear on both sides — net zero in the gross, but the deduction-side copy must land in other_deductions for the validation formula to hold. - **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions| Year To Date (note capital "To", vs variant B's lowercase "to"). Date format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU Net Gain` offset. `Company Name : Facebook UK Limited` preamble. Variant B (2024+) is unchanged. ## This change ### Parser refactor - `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches all three eras. - `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's accounting-style parenthesized negatives normalize to `-1234.56` in `_to_decimal`. - `_parse_date` tries three formats in order: slash (B), dot (C), word (A). - `_is_variant_b_or_c` collapses B and C into one detector (both have the side-by-side header with `Year [Tt]o Date`); their parsers share code because the column mechanics are identical — only the RSU-label set and date format differ. - `_parse_variant_a` is a full rewrite: single-column rows split by the two `Total ...` anchors (payments → deductions), pay_date from the header's `Date : ...`, gross from first Total, net from the trailing `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X` line at the bottom. - RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every matching payment line. rsu_offset maps to `RSU Net Gain` on the deduction side when present (absent in variant B, present in A and C). ### Fixtures switched to real pdftotext output Removed the two synthetic fixtures that no longer reflected real Meta output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`) and replaced with real pdftotext captures: - `meta_uk_2021_08_variant_a.txt` (doc_id=43) - `meta_uk_2022_11_variant_c.txt` (doc_id=53) The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because they encode specific bonus/no-bonus scenarios and the numbers are derived from the real Feb-2026 sample in the plan. ## Tests - 10 parser tests: one per variant (A/B/C) + totals validation across all 4 fixtures + the existing non-Meta/empty-input guards. All pass. - 52 total tests across the repo, all green. ## Test Plan ### Automated ``` $ poetry run pytest ============================== 52 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files ``` ### Manual verification (after deploy) 1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via regex (≥95% hit rate), 42 → 70+ validated rows. 2. Sample a row for each variant via psql: employer, rsu_vest, and taxable_pay should all be populated. ## Reproduce locally 1. `poetry run pytest tests/test_meta_uk_parser.py -v` 2. Expected: 10 passed, each fixture validates totals to within 2p. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00
# "This Period" value is the first amount; "This Year" is the second.
# If only one amount is present, it's a YTD-only row (e.g. Relocation
# Bonus which doesn't apply this period) — skip it for the period totals.
if len(matches) < 2:
continue
amount = _to_decimal(matches[0].group())
block[label] = amount
return payments, deductions
v2: regex parser for Meta UK template + accurate RSU tax attribution ## Context v1 shipped a Claude Haiku-based extractor that validated only 10/71 backfilled rows. Haiku fumbles the arithmetic on pension salary-sacrifice, conflates RSU vest with regular earnings, and occasionally misreads YTD vs this-period columns — so 86% of rows land with validated=false and the downstream dashboards under-report take-home. Meta UK uses a stable two-variant template (pre/post 2022-01-31 boundary), so a regex parser is both faster (ms vs. 30-90s + $0.01-0.05/call) and more accurate. v2 introduces that parser as the primary path, keeps Claude as the fallback for non-Meta payslips, and surfaces new fields the dashboard needs to attribute PAYE between cash salary and RSU vests correctly. ## This change ### Parser (new) `payslip_ingest/parsers/meta_uk.py` detects the layout variant by header presence: - **Variant A** (pre-2022): vertical Description/This Period/This Year. `AE Pension EE` is a positive deduction against a pre-sacrifice gross — maps to `pension_employee` for the existing validation formula to hold. - **Variant B** (post-2022): side-by-side Payments | Deductions | Year to Date. `AE Pension EE` is NEGATIVE in Payments (salary sacrifice) — maps to `pension_sacrifice` and is already netted into Total Payment. `rsu_vest = RSU Tax Offset + RSU Excs Refund` (Meta's template inflates Taxable Pay without using a matching offset deduction). Column boundaries come from the header row's anchor positions; each data row slices into 3 cells and the last numeric token per cell is the amount. Anchor misses raise ParserError so the caller falls back to Claude rather than silently returning bad data. ### New fields Schema + DB + Claude prompt gain: - `salary`, `bonus`, `pension_sacrifice` — earnings decomposition for the dashboard's bonus-sacrifice visibility and earnings-breakdown chart - `taxable_pay`, `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross` — powers the YTD-effective-rate method of attributing cash tax vs RSU tax, which is the only method that's accurate month-to-month All new columns default to 0 / null so v1 rows continue to round-trip. ### Orchestration processor.py tries `parse_meta_uk(pdftotext(pdf))` first. On success the result goes straight to the DB — zero Claude tokens spent, extraction in milliseconds. On ParserError it falls through to ClaudeExtractor as before. ProcessResult gains an `extractor` field ("meta_uk_regex" | "claude") so backfill logs show the hit rate. ## Tests - `test_meta_uk_parser.py` — 11 tests covering variant A, variant B (standard + bonus month + bonus-sacrificed month), malformed inputs, and end-to-end totals validation for all 4 golden fixtures. - `test_processor.py` — 2 new tests proving the regex-first short-circuit and the Claude fallback on non-Meta inputs. Fixtures under `tests/fixtures/` are hand-crafted `pdftotext -layout` emulations — real Meta numbers from the plan's sample payslips for variant B, synthesized realistic variant A and bonus-sacrificed samples. 0001_initial.py reformat is yapf cleanup touched during the session's format pass; not a behavior change. ## Test Plan ### Automated ``` $ poetry run pytest ============================= test session starts ============================== collected 53 items tests/test_extractor.py ..... [ 9%] tests/test_meta_uk_parser.py ........... [ 30%] tests/test_paperless.py ...... [ 41%] tests/test_processor.py .............. [ 67%] tests/test_schema.py .... [ 75%] tests/test_tax_year.py ........ [ 90%] tests/test_webhook.py ..... [100%] ============================== 53 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files $ poetry run yapf --style pyproject.toml --diff --recursive payslip_ingest tests (no output — all files are yapf-clean) ``` ### Manual Verification Smoke-test the parser against a real Meta payslip PDF on the deploy host: ``` # After 0003 migration applied to prod DB $ poetry run python -c " from payslip_ingest.parsers import parse_meta_uk import subprocess text = subprocess.check_output(['pdftotext', '-layout', '/path/to/real.pdf', '-']).decode() p = parse_meta_uk(text) print(p.model_dump_json(indent=2)) " ``` Expected: JSON with salary/bonus/rsu_vest/pension_sacrifice populated and `validate_totals(p)` returning True. ## Reproduce locally 1. `cd payslip-ingest && poetry install` 2. `poetry run pytest tests/test_meta_uk_parser.py -v` 3. Expected: 11 tests pass, each fixture validates totals within 2p. Closes: code-un1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:53:52 +00:00
meta_uk parser: add variant A (2019-2022) + variant C (2022-2023) ## Context The initial v2 parser (commit 9741816) only handled the modern template (variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from 2021-07 through 2023-11 failed entirely — Claude fallback hit errors on them and the rows never landed. Investigation via `kubectl exec` + pdftotext on a sample of the failing docs revealed two previously-unseen layouts that the parser needs to handle directly: - **Variant A** (2019 → mid-2022): single-column Description/This Period/ This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31 Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines: `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the earnings side with a matching `RSU Net Gain` on the deductions side. BIK items (Private Dental/Medical) appear on both sides — net zero in the gross, but the deduction-side copy must land in other_deductions for the validation formula to hold. - **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions| Year To Date (note capital "To", vs variant B's lowercase "to"). Date format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU Net Gain` offset. `Company Name : Facebook UK Limited` preamble. Variant B (2024+) is unchanged. ## This change ### Parser refactor - `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches all three eras. - `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's accounting-style parenthesized negatives normalize to `-1234.56` in `_to_decimal`. - `_parse_date` tries three formats in order: slash (B), dot (C), word (A). - `_is_variant_b_or_c` collapses B and C into one detector (both have the side-by-side header with `Year [Tt]o Date`); their parsers share code because the column mechanics are identical — only the RSU-label set and date format differ. - `_parse_variant_a` is a full rewrite: single-column rows split by the two `Total ...` anchors (payments → deductions), pay_date from the header's `Date : ...`, gross from first Total, net from the trailing `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X` line at the bottom. - RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every matching payment line. rsu_offset maps to `RSU Net Gain` on the deduction side when present (absent in variant B, present in A and C). ### Fixtures switched to real pdftotext output Removed the two synthetic fixtures that no longer reflected real Meta output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`) and replaced with real pdftotext captures: - `meta_uk_2021_08_variant_a.txt` (doc_id=43) - `meta_uk_2022_11_variant_c.txt` (doc_id=53) The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because they encode specific bonus/no-bonus scenarios and the numbers are derived from the real Feb-2026 sample in the plan. ## Tests - 10 parser tests: one per variant (A/B/C) + totals validation across all 4 fixtures + the existing non-Meta/empty-input guards. All pass. - 52 total tests across the repo, all green. ## Test Plan ### Automated ``` $ poetry run pytest ============================== 52 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files ``` ### Manual verification (after deploy) 1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via regex (≥95% hit rate), 42 → 70+ validated rows. 2. Sample a row for each variant via psql: employer, rsu_vest, and taxable_pay should all be populated. ## Reproduce locally 1. `poetry run pytest tests/test_meta_uk_parser.py -v` 2. Expected: 10 passed, each fixture validates totals to within 2p. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00
def _parse_a_gross(
lines: list[str],
header_idx: int,
payments: dict[str, Decimal],
) -> Decimal:
"""Pull the first `Total <amount>` after the header — that's gross pay."""
for i in range(header_idx + 1, len(lines)):
stripped = lines[i].strip()
if stripped.startswith("Total "):
nums = AMOUNT_RE.findall(stripped)
v2: regex parser for Meta UK template + accurate RSU tax attribution ## Context v1 shipped a Claude Haiku-based extractor that validated only 10/71 backfilled rows. Haiku fumbles the arithmetic on pension salary-sacrifice, conflates RSU vest with regular earnings, and occasionally misreads YTD vs this-period columns — so 86% of rows land with validated=false and the downstream dashboards under-report take-home. Meta UK uses a stable two-variant template (pre/post 2022-01-31 boundary), so a regex parser is both faster (ms vs. 30-90s + $0.01-0.05/call) and more accurate. v2 introduces that parser as the primary path, keeps Claude as the fallback for non-Meta payslips, and surfaces new fields the dashboard needs to attribute PAYE between cash salary and RSU vests correctly. ## This change ### Parser (new) `payslip_ingest/parsers/meta_uk.py` detects the layout variant by header presence: - **Variant A** (pre-2022): vertical Description/This Period/This Year. `AE Pension EE` is a positive deduction against a pre-sacrifice gross — maps to `pension_employee` for the existing validation formula to hold. - **Variant B** (post-2022): side-by-side Payments | Deductions | Year to Date. `AE Pension EE` is NEGATIVE in Payments (salary sacrifice) — maps to `pension_sacrifice` and is already netted into Total Payment. `rsu_vest = RSU Tax Offset + RSU Excs Refund` (Meta's template inflates Taxable Pay without using a matching offset deduction). Column boundaries come from the header row's anchor positions; each data row slices into 3 cells and the last numeric token per cell is the amount. Anchor misses raise ParserError so the caller falls back to Claude rather than silently returning bad data. ### New fields Schema + DB + Claude prompt gain: - `salary`, `bonus`, `pension_sacrifice` — earnings decomposition for the dashboard's bonus-sacrifice visibility and earnings-breakdown chart - `taxable_pay`, `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross` — powers the YTD-effective-rate method of attributing cash tax vs RSU tax, which is the only method that's accurate month-to-month All new columns default to 0 / null so v1 rows continue to round-trip. ### Orchestration processor.py tries `parse_meta_uk(pdftotext(pdf))` first. On success the result goes straight to the DB — zero Claude tokens spent, extraction in milliseconds. On ParserError it falls through to ClaudeExtractor as before. ProcessResult gains an `extractor` field ("meta_uk_regex" | "claude") so backfill logs show the hit rate. ## Tests - `test_meta_uk_parser.py` — 11 tests covering variant A, variant B (standard + bonus month + bonus-sacrificed month), malformed inputs, and end-to-end totals validation for all 4 golden fixtures. - `test_processor.py` — 2 new tests proving the regex-first short-circuit and the Claude fallback on non-Meta inputs. Fixtures under `tests/fixtures/` are hand-crafted `pdftotext -layout` emulations — real Meta numbers from the plan's sample payslips for variant B, synthesized realistic variant A and bonus-sacrificed samples. 0001_initial.py reformat is yapf cleanup touched during the session's format pass; not a behavior change. ## Test Plan ### Automated ``` $ poetry run pytest ============================= test session starts ============================== collected 53 items tests/test_extractor.py ..... [ 9%] tests/test_meta_uk_parser.py ........... [ 30%] tests/test_paperless.py ...... [ 41%] tests/test_processor.py .............. [ 67%] tests/test_schema.py .... [ 75%] tests/test_tax_year.py ........ [ 90%] tests/test_webhook.py ..... [100%] ============================== 53 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files $ poetry run yapf --style pyproject.toml --diff --recursive payslip_ingest tests (no output — all files are yapf-clean) ``` ### Manual Verification Smoke-test the parser against a real Meta payslip PDF on the deploy host: ``` # After 0003 migration applied to prod DB $ poetry run python -c " from payslip_ingest.parsers import parse_meta_uk import subprocess text = subprocess.check_output(['pdftotext', '-layout', '/path/to/real.pdf', '-']).decode() p = parse_meta_uk(text) print(p.model_dump_json(indent=2)) " ``` Expected: JSON with salary/bonus/rsu_vest/pension_sacrifice populated and `validate_totals(p)` returning True. ## Reproduce locally 1. `cd payslip-ingest && poetry install` 2. `poetry run pytest tests/test_meta_uk_parser.py -v` 3. Expected: 11 tests pass, each fixture validates totals within 2p. Closes: code-un1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:53:52 +00:00
if nums:
meta_uk parser: add variant A (2019-2022) + variant C (2022-2023) ## Context The initial v2 parser (commit 9741816) only handled the modern template (variant B, 2024+). Of Viktor's 73 real payslips in Paperless, 30 from 2021-07 through 2023-11 failed entirely — Claude fallback hit errors on them and the rows never landed. Investigation via `kubectl exec` + pdftotext on a sample of the failing docs revealed two previously-unseen layouts that the parser needs to handle directly: - **Variant A** (2019 → mid-2022): single-column Description/This Period/ This Year. Parenthesized negatives `(152.90)`. Date format `Date : 31 Aug 2021`. Employer is `Facebook UK Ltd` (not `Limited`). RSU lines: `RSU Gain Taxable` + `RSU Gain Nicable` + `RSU Net Cash UK` on the earnings side with a matching `RSU Net Gain` on the deductions side. BIK items (Private Dental/Medical) appear on both sides — net zero in the gross, but the deduction-side copy must land in other_deductions for the validation formula to hold. - **Variant C** (late-2022 → 2023): side-by-side Payments|Deductions| Year To Date (note capital "To", vs variant B's lowercase "to"). Date format `Pay Date : 30.11.2022` (dots, not slashes). RSU labels use the abbreviated `RSU Gain Taxabl` / `Nicabl` and still include the `RSU Net Gain` offset. `Company Name : Facebook UK Limited` preamble. Variant B (2024+) is unchanged. ## This change ### Parser refactor - `EMPLOYER_RE = re.compile(r"Facebook UK (?:Limited|Ltd)\b")` — matches all three eras. - `AMOUNT_RE` now accepts both `-1,234.56` and `(1,234.56)` — variant A's accounting-style parenthesized negatives normalize to `-1234.56` in `_to_decimal`. - `_parse_date` tries three formats in order: slash (B), dot (C), word (A). - `_is_variant_b_or_c` collapses B and C into one detector (both have the side-by-side header with `Year [Tt]o Date`); their parsers share code because the column mechanics are identical — only the RSU-label set and date format differ. - `_parse_variant_a` is a full rewrite: single-column rows split by the two `Total ...` anchors (payments → deductions), pay_date from the header's `Date : ...`, gross from first Total, net from the trailing `Net Pay` line, taxable_pay from the `Taxable Pay : This Period £X` line at the bottom. - RSU_VEST_LABELS is a shared set covering 8 aliases; rsu_vest sums every matching payment line. rsu_offset maps to `RSU Net Gain` on the deduction side when present (absent in variant B, present in A and C). ### Fixtures switched to real pdftotext output Removed the two synthetic fixtures that no longer reflected real Meta output (`meta_uk_2019_07.txt`, `meta_uk_2024_03_bonus_sacrificed.txt`) and replaced with real pdftotext captures: - `meta_uk_2021_08_variant_a.txt` (doc_id=43) - `meta_uk_2022_11_variant_c.txt` (doc_id=53) The remaining synthetic fixtures (`2025_03`, `2026_02`) stay because they encode specific bonus/no-bonus scenarios and the numbers are derived from the real Feb-2026 sample in the plan. ## Tests - 10 parser tests: one per variant (A/B/C) + totals validation across all 4 fixtures + the existing non-Meta/empty-input guards. All pass. - 52 total tests across the repo, all green. ## Test Plan ### Automated ``` $ poetry run pytest ============================== 52 passed in 1.66s ============================== $ poetry run ruff check . All checks passed! $ poetry run mypy . Success: no issues found in 24 source files ``` ### Manual verification (after deploy) 1. TRUNCATE + re-run backfill — expect 73 real payslips to extract via regex (≥95% hit rate), 42 → 70+ validated rows. 2. Sample a row for each variant via psql: employer, rsu_vest, and taxable_pay should all be populated. ## Reproduce locally 1. `poetry run pytest tests/test_meta_uk_parser.py -v` 2. Expected: 10 passed, each fixture validates totals to within 2p. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:52:59 +00:00
return _to_decimal(nums[0])
# Fallback: sum payments values if the Total line is missing.
if payments:
return sum(payments.values(), start=Decimal("0"))
raise ParserError("Total (gross pay) row not found in variant A")
def _parse_a_net(text: str) -> Decimal:
m = NET_PAY_A_RE.search(text)
if not m:
raise ParserError("Net Pay line not found in variant A")
return _to_decimal(m.group(1))