Event-driven UK payslip ingest: Paperless-ngx webhook -> claude-agent-service extraction -> Postgres -> Grafana
Find a file
Viktor Barzin 9105b6b79d extractor: track rsu_vest + rsu_offset separately from cash pay
UK payslips for equity-comp employees report RSU vests as notional pay
for HMRC only. A paired same-magnitude deduction (Shares Retained /
Stock Tax Withholding / RSU Offset) nets it back out of cash. The UK
payslip's income_tax line shows tax on the grossed-up total, but the
actual RSU tax is handled by Schwab (US broker) via share sale. No
cash flows through UK payroll for RSU.

Previously the extractor folded RSU notional into gross_pay and
income_tax, which inflated the dashboard numbers — a payslip with
£25k RSU vest looked like 2x salary with 80% tax rate.

Changes:
- schema: add rsu_vest + rsu_offset fields (default 0).
- db + alembic 0002: add two new NUMERIC(12,2) columns with server
  default 0 (backward-compatible; existing rows get 0).
- validate_totals: include rsu_offset in deductions sum so the
  gross + rsu_vest inflation is properly netted out.
- extraction prompt: explicit rules for identifying RSU lines by the
  common Meta/Sage/Workday labels, and to NOT put them in
  other_deductions.

Dashboards in a follow-up commit: cash_gross = gross_pay - rsu_vest,
effective tax rate based on cash metrics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 23:37:25 +00:00
alembic extractor: track rsu_vest + rsu_offset separately from cash pay 2026-04-18 23:37:25 +00:00
payslip_ingest extractor: track rsu_vest + rsu_offset separately from cash pay 2026-04-18 23:37:25 +00:00
tests extractor: track rsu_vest + rsu_offset separately from cash pay 2026-04-18 23:37:25 +00:00
.gitignore Initial commit: event-driven UK payslip ingest service 2026-04-18 22:10:23 +00:00
.woodpecker.yml Initial commit: event-driven UK payslip ingest service 2026-04-18 22:10:23 +00:00
alembic.ini Initial commit: event-driven UK payslip ingest service 2026-04-18 22:10:23 +00:00
Dockerfile extractor: preextract PDF text with pdftotext before calling Claude 2026-04-18 22:48:04 +00:00
poetry.lock Initial commit: event-driven UK payslip ingest service 2026-04-18 22:10:23 +00:00
pyproject.toml Initial commit: event-driven UK payslip ingest service 2026-04-18 22:10:23 +00:00
README.md Initial commit: event-driven UK payslip ingest service 2026-04-18 22:10:23 +00:00

payslip-ingest

Event-driven UK payslip ingest: Paperless-ngx fires a webhook when a document is tagged payslip; this service fetches the PDF, calls claude-agent-service to extract structured fields, and upserts into Postgres keyed by paperless_doc_id (idempotent). A CLI backfill mode enumerates every existing payslip in Paperless for initial population.

Local dev

poetry install
poetry run pytest -q
poetry run mypy .
poetry run ruff check .

# Smoke-test extraction against a real PDF (no DB writes):
export CLAUDE_AGENT_URL=http://claude-agent-service.claude-agent.svc.cluster.local:8080
export CLAUDE_AGENT_BEARER_TOKEN=...
poetry run python -m payslip_ingest extract-one /tmp/sample.pdf

Env vars

Variable Purpose
PAPERLESS_URL Paperless-ngx base URL (e.g. https://paperless.viktorbarzin.me)
PAPERLESS_API_TOKEN Paperless API token (User → My Profile → API Auth Token)
CLAUDE_AGENT_URL claude-agent-service URL (http://claude-agent-service.claude-agent.svc.cluster.local:8080)
CLAUDE_AGENT_BEARER_TOKEN Vault secret/claude-agent-serviceapi_bearer_token
DB_CONNECTION_STRING SQLAlchemy async URL: postgresql+asyncpg://user:pass@host/db
WEBHOOK_BEARER_TOKEN Shared secret Paperless sends in Authorization: Bearer ...

Paperless workflow configuration

In Paperless-ngx, create a workflow:

  • Name: payslip-ingest
  • Trigger: Document Added, matching tag payslip
  • Action type: Webhook
  • URL: http://payslip-ingest.payslip-ingest.svc.cluster.local:8080/webhook
  • Method: POST
  • Headers:
    • Authorization: Bearer <WEBHOOK_BEARER_TOKEN>
    • Content-Type: application/json
  • Body (template):
    {"document_id": {{ document_id }}}
    

Deployment

Ship to the payslip-ingest namespace. The service serializes incoming webhooks onto an in-process queue so it never collides with claude-agent-service's single-job lock.

Run the initial backfill once the deployment is live:

kubectl -n payslip-ingest create job \
  --from=deployment/payslip-ingest \
  payslip-backfill-$(date +%s) \
  -- python -m payslip_ingest backfill --all

Architecture notes

  • extract-one never touches the DB — safe for ad-hoc re-extraction on disk.
  • The backfill command is idempotent (skips rows whose paperless_doc_id already exists) so it can be re-run freely.
  • Totals validation is a best-effort sanity check; mismatches are stored with validated=false and raw_extraction retained for manual review, rather than rejected.
  • The agent service is single-threaded. The webhook handler enqueues and returns 202; a single background worker drains the queue one at a time and absorbs 409-busy responses from the agent with retry-with-backoff.
  • New agent prompt lives at .claude/agents/payslip-extractor in the infra repo — this is a separate deliverable (see TODOs).