backfill: cash_income_tax back-fill for variant-A NULL rows
Phase B of RSU tax spike fix. Vest-month spikes on the dashboard trace to variant-A slips (2019–mid-2022) where `cash_income_tax` is NULL — the dashboard's COALESCE fallback returns full PAYE, masquerading as cash tax. Three changes: 1. Widen variant-A Taxable Pay regex. Original pattern only matched `Taxable Pay : This Period £...`; add case-insensitive variants that tolerate missing/different colons, elided "This", and uppercase labels. Covers older 2019-2020 templates that failed the previous match. 2. New `backfill_cash_income_tax` module — walks every NULL-cash-tax row with rsu_vest > 0, re-downloads the PDF from Paperless, runs the widened regex parser, falls back to Claude for taxable_pay extraction if regex still misses, and derives cash_income_tax pro-rata. Records provenance in new `cash_income_tax_source` column (regex/claude/ fallback_null). Idempotent — only touches NULL rows. 3. Migration 0006 adds the `cash_income_tax_source` audit column. CLI: `python -m payslip_ingest backfill-cash-tax [--limit N]`. Meant to run as a one-shot K8s Job after `alembic upgrade head`. Part of: code-860
This commit is contained in:
parent
4f70681dcb
commit
3b9c69bfd3
7 changed files with 512 additions and 4 deletions
36
alembic/versions/0006_cash_income_tax_source.py
Normal file
36
alembic/versions/0006_cash_income_tax_source.py
Normal file
|
|
@ -0,0 +1,36 @@
|
|||
"""Add cash_income_tax_source audit column.
|
||||
|
||||
Tracks which path produced `cash_income_tax` for a given row. Back-fill
|
||||
script populates this on rows it touches so the dashboard can surface how
|
||||
many rows were rescued by regex vs Claude vs left NULL.
|
||||
|
||||
Values:
|
||||
- `regex` — regex parser extracted taxable_pay and derived cash_income_tax
|
||||
- `claude` — fell back to Claude for taxable_pay, then derived locally
|
||||
- `fallback_null` — neither regex nor Claude could recover it; cash_income_tax
|
||||
left NULL (dashboard's COALESCE will use income_tax)
|
||||
|
||||
Nullable so pre-back-fill rows stay distinguishable from post-back-fill rows.
|
||||
"""
|
||||
import sqlalchemy as sa
|
||||
|
||||
from alembic import op
|
||||
|
||||
revision = "0006"
|
||||
down_revision = "0005"
|
||||
branch_labels = None
|
||||
depends_on = None
|
||||
|
||||
SCHEMA = "payslip_ingest"
|
||||
|
||||
|
||||
def upgrade() -> None:
|
||||
op.add_column(
|
||||
"payslip",
|
||||
sa.Column("cash_income_tax_source", sa.String(length=16), nullable=True),
|
||||
schema=SCHEMA,
|
||||
)
|
||||
|
||||
|
||||
def downgrade() -> None:
|
||||
op.drop_column("payslip", "cash_income_tax_source", schema=SCHEMA)
|
||||
Loading…
Add table
Add a link
Reference in a new issue