Event-driven UK payslip ingest: Paperless-ngx webhook -> claude-agent-service extraction -> Postgres -> Grafana
Find a file
Viktor Barzin 86cac65572 processor: skip non-payslip docs by title pattern
The Paperless 'payslip' tag has been applied over the years to P60 annual
summaries, performance/year-end letters, Compensation_EMEA/PSC letters,
comp-review letters, and RSU grant agreements. These are legitimate
financial docs but not monthly payslips, and including them pollutes
the dashboards (a P60 amount is ~12x a single month).

Filter by title regex before hitting Claude so we skip cheaply and
don't burn extraction credit on them. Status returned is
'skipped_non_payslip' to distinguish from the 'already-ingested' skip.

Covers: P60*, *performance*(letter|year-end)*, compensation_emea,
*psc*, comp-letter, rsu grant*. New parameterized tests cover both
the exclude list and representative real payslip titles.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 23:32:17 +00:00
alembic alembic: create schema before initializing version table 2026-04-18 22:23:30 +00:00
payslip_ingest processor: skip non-payslip docs by title pattern 2026-04-18 23:32:17 +00:00
tests processor: skip non-payslip docs by title pattern 2026-04-18 23:32:17 +00:00
.gitignore Initial commit: event-driven UK payslip ingest service 2026-04-18 22:10:23 +00:00
.woodpecker.yml Initial commit: event-driven UK payslip ingest service 2026-04-18 22:10:23 +00:00
alembic.ini Initial commit: event-driven UK payslip ingest service 2026-04-18 22:10:23 +00:00
Dockerfile extractor: preextract PDF text with pdftotext before calling Claude 2026-04-18 22:48:04 +00:00
poetry.lock Initial commit: event-driven UK payslip ingest service 2026-04-18 22:10:23 +00:00
pyproject.toml Initial commit: event-driven UK payslip ingest service 2026-04-18 22:10:23 +00:00
README.md Initial commit: event-driven UK payslip ingest service 2026-04-18 22:10:23 +00:00

payslip-ingest

Event-driven UK payslip ingest: Paperless-ngx fires a webhook when a document is tagged payslip; this service fetches the PDF, calls claude-agent-service to extract structured fields, and upserts into Postgres keyed by paperless_doc_id (idempotent). A CLI backfill mode enumerates every existing payslip in Paperless for initial population.

Local dev

poetry install
poetry run pytest -q
poetry run mypy .
poetry run ruff check .

# Smoke-test extraction against a real PDF (no DB writes):
export CLAUDE_AGENT_URL=http://claude-agent-service.claude-agent.svc.cluster.local:8080
export CLAUDE_AGENT_BEARER_TOKEN=...
poetry run python -m payslip_ingest extract-one /tmp/sample.pdf

Env vars

Variable Purpose
PAPERLESS_URL Paperless-ngx base URL (e.g. https://paperless.viktorbarzin.me)
PAPERLESS_API_TOKEN Paperless API token (User → My Profile → API Auth Token)
CLAUDE_AGENT_URL claude-agent-service URL (http://claude-agent-service.claude-agent.svc.cluster.local:8080)
CLAUDE_AGENT_BEARER_TOKEN Vault secret/claude-agent-serviceapi_bearer_token
DB_CONNECTION_STRING SQLAlchemy async URL: postgresql+asyncpg://user:pass@host/db
WEBHOOK_BEARER_TOKEN Shared secret Paperless sends in Authorization: Bearer ...

Paperless workflow configuration

In Paperless-ngx, create a workflow:

  • Name: payslip-ingest
  • Trigger: Document Added, matching tag payslip
  • Action type: Webhook
  • URL: http://payslip-ingest.payslip-ingest.svc.cluster.local:8080/webhook
  • Method: POST
  • Headers:
    • Authorization: Bearer <WEBHOOK_BEARER_TOKEN>
    • Content-Type: application/json
  • Body (template):
    {"document_id": {{ document_id }}}
    

Deployment

Ship to the payslip-ingest namespace. The service serializes incoming webhooks onto an in-process queue so it never collides with claude-agent-service's single-job lock.

Run the initial backfill once the deployment is live:

kubectl -n payslip-ingest create job \
  --from=deployment/payslip-ingest \
  payslip-backfill-$(date +%s) \
  -- python -m payslip_ingest backfill --all

Architecture notes

  • extract-one never touches the DB — safe for ad-hoc re-extraction on disk.
  • The backfill command is idempotent (skips rows whose paperless_doc_id already exists) so it can be re-run freely.
  • Totals validation is a best-effort sanity check; mismatches are stored with validated=false and raw_extraction retained for manual review, rather than rejected.
  • The agent service is single-threaded. The webhook handler enqueues and returns 202; a single background worker drains the queue one at a time and absorbs 409-busy responses from the agent with retry-with-backoff.
  • New agent prompt lives at .claude/agents/payslip-extractor in the infra repo — this is a separate deliverable (see TODOs).