Event-driven UK payslip ingest: Paperless-ngx webhook -> claude-agent-service extraction -> Postgres -> Grafana
Find a file
Viktor Barzin 26e43b1055 parser + P60 ingest: split income_tax cash/RSU, add P60 ground-truth
Meta variant-B payslips gross up Taxable Pay for RSU and compute PAYE on
the grossed-up figure, so `income_tax` on the slip is the total PAYE
(cash + RSU-attributed). Dashboards that stacked the raw figure made
vest-month tax look ~2x higher than "cash tax paid". Introduce
`cash_income_tax = income_tax * (gross_pay - pension_sacrifice) /
taxable_pay` as a derived column alongside the raw figure. Dashboards
can now stack cash vs RSU-attributed tax as separate segments.

Also capture YTD column values of `RSU Tax Offset` and `RSU Excs Refund`
from the Payments grid — needed for reconciliation against HMRC annual
figures.

P60 ingest: new parser under `parsers/p60.py` anchoring on statutory
HMRC line labels (`Tax year to 5 April YYYY`, `Employer PAYE reference`,
`In this employment` pay/tax row, NI letter bands). Processor routes
documents carrying the `p60` Paperless tag to `_handle_p60` which
writes to the new `payslip_ingest.p60_reference` table (one row per
tax_year+employer). App lifespan resolves the tag id at startup; missing
tag disables dispatch without breaking payslip ingest. Paperless tag
creation + webhook config are manual follow-ups.

Migrations:
- 0004 — cash_income_tax + ytd_rsu_tax_offset + ytd_rsu_excs_refund on
  payslip, all nullable.
- 0005 — p60_reference table with (tax_year, employer) unique +
  paperless_doc_id unique for idempotent re-uploads.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 15:23:05 +00:00
alembic parser + P60 ingest: split income_tax cash/RSU, add P60 ground-truth 2026-04-19 15:23:05 +00:00
payslip_ingest parser + P60 ingest: split income_tax cash/RSU, add P60 ground-truth 2026-04-19 15:23:05 +00:00
tests parser + P60 ingest: split income_tax cash/RSU, add P60 ground-truth 2026-04-19 15:23:05 +00:00
.gitignore Initial commit: event-driven UK payslip ingest service 2026-04-18 22:10:23 +00:00
.woodpecker.yml Initial commit: event-driven UK payslip ingest service 2026-04-18 22:10:23 +00:00
alembic.ini Initial commit: event-driven UK payslip ingest service 2026-04-18 22:10:23 +00:00
Dockerfile extractor: preextract PDF text with pdftotext before calling Claude 2026-04-18 22:48:04 +00:00
poetry.lock Initial commit: event-driven UK payslip ingest service 2026-04-18 22:10:23 +00:00
pyproject.toml Initial commit: event-driven UK payslip ingest service 2026-04-18 22:10:23 +00:00
README.md Initial commit: event-driven UK payslip ingest service 2026-04-18 22:10:23 +00:00

payslip-ingest

Event-driven UK payslip ingest: Paperless-ngx fires a webhook when a document is tagged payslip; this service fetches the PDF, calls claude-agent-service to extract structured fields, and upserts into Postgres keyed by paperless_doc_id (idempotent). A CLI backfill mode enumerates every existing payslip in Paperless for initial population.

Local dev

poetry install
poetry run pytest -q
poetry run mypy .
poetry run ruff check .

# Smoke-test extraction against a real PDF (no DB writes):
export CLAUDE_AGENT_URL=http://claude-agent-service.claude-agent.svc.cluster.local:8080
export CLAUDE_AGENT_BEARER_TOKEN=...
poetry run python -m payslip_ingest extract-one /tmp/sample.pdf

Env vars

Variable Purpose
PAPERLESS_URL Paperless-ngx base URL (e.g. https://paperless.viktorbarzin.me)
PAPERLESS_API_TOKEN Paperless API token (User → My Profile → API Auth Token)
CLAUDE_AGENT_URL claude-agent-service URL (http://claude-agent-service.claude-agent.svc.cluster.local:8080)
CLAUDE_AGENT_BEARER_TOKEN Vault secret/claude-agent-serviceapi_bearer_token
DB_CONNECTION_STRING SQLAlchemy async URL: postgresql+asyncpg://user:pass@host/db
WEBHOOK_BEARER_TOKEN Shared secret Paperless sends in Authorization: Bearer ...

Paperless workflow configuration

In Paperless-ngx, create a workflow:

  • Name: payslip-ingest
  • Trigger: Document Added, matching tag payslip
  • Action type: Webhook
  • URL: http://payslip-ingest.payslip-ingest.svc.cluster.local:8080/webhook
  • Method: POST
  • Headers:
    • Authorization: Bearer <WEBHOOK_BEARER_TOKEN>
    • Content-Type: application/json
  • Body (template):
    {"document_id": {{ document_id }}}
    

Deployment

Ship to the payslip-ingest namespace. The service serializes incoming webhooks onto an in-process queue so it never collides with claude-agent-service's single-job lock.

Run the initial backfill once the deployment is live:

kubectl -n payslip-ingest create job \
  --from=deployment/payslip-ingest \
  payslip-backfill-$(date +%s) \
  -- python -m payslip_ingest backfill --all

Architecture notes

  • extract-one never touches the DB — safe for ad-hoc re-extraction on disk.
  • The backfill command is idempotent (skips rows whose paperless_doc_id already exists) so it can be re-run freely.
  • Totals validation is a best-effort sanity check; mismatches are stored with validated=false and raw_extraction retained for manual review, rather than rejected.
  • The agent service is single-threaded. The webhook handler enqueues and returns 202; a single background worker drains the queue one at a time and absorbs 409-busy responses from the agent with retry-with-backoff.
  • New agent prompt lives at .claude/agents/payslip-extractor in the infra repo — this is a separate deliverable (see TODOs).