Event-driven UK payslip ingest: Paperless-ngx webhook -> claude-agent-service extraction -> Postgres -> Grafana
Find a file
Viktor Barzin 80eea276df
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
Phase 4: drop registry.viktorbarzin.me push, Forgejo only
2026-05-07 22:32:00 +00:00
alembic rsu_vest_events: schema + ORM for Schwab vest ground truth (Phase D) 2026-04-19 18:27:41 +00:00
payslip_ingest rsu_vest_events: schema + ORM for Schwab vest ground truth (Phase D) 2026-04-19 18:27:41 +00:00
tests sync: ActualBudget Meta deposit overlay (Phase C) 2026-04-19 18:20:50 +00:00
.gitignore Initial commit: event-driven UK payslip ingest service 2026-04-18 22:10:23 +00:00
.woodpecker.yml Phase 4: drop registry.viktorbarzin.me push, Forgejo only 2026-05-07 22:32:00 +00:00
alembic.ini Initial commit: event-driven UK payslip ingest service 2026-04-18 22:10:23 +00:00
Dockerfile extractor: preextract PDF text with pdftotext before calling Claude 2026-04-18 22:48:04 +00:00
poetry.lock Initial commit: event-driven UK payslip ingest service 2026-04-18 22:10:23 +00:00
pyproject.toml Initial commit: event-driven UK payslip ingest service 2026-04-18 22:10:23 +00:00
README.md Initial commit: event-driven UK payslip ingest service 2026-04-18 22:10:23 +00:00

payslip-ingest

Event-driven UK payslip ingest: Paperless-ngx fires a webhook when a document is tagged payslip; this service fetches the PDF, calls claude-agent-service to extract structured fields, and upserts into Postgres keyed by paperless_doc_id (idempotent). A CLI backfill mode enumerates every existing payslip in Paperless for initial population.

Local dev

poetry install
poetry run pytest -q
poetry run mypy .
poetry run ruff check .

# Smoke-test extraction against a real PDF (no DB writes):
export CLAUDE_AGENT_URL=http://claude-agent-service.claude-agent.svc.cluster.local:8080
export CLAUDE_AGENT_BEARER_TOKEN=...
poetry run python -m payslip_ingest extract-one /tmp/sample.pdf

Env vars

Variable Purpose
PAPERLESS_URL Paperless-ngx base URL (e.g. https://paperless.viktorbarzin.me)
PAPERLESS_API_TOKEN Paperless API token (User → My Profile → API Auth Token)
CLAUDE_AGENT_URL claude-agent-service URL (http://claude-agent-service.claude-agent.svc.cluster.local:8080)
CLAUDE_AGENT_BEARER_TOKEN Vault secret/claude-agent-serviceapi_bearer_token
DB_CONNECTION_STRING SQLAlchemy async URL: postgresql+asyncpg://user:pass@host/db
WEBHOOK_BEARER_TOKEN Shared secret Paperless sends in Authorization: Bearer ...

Paperless workflow configuration

In Paperless-ngx, create a workflow:

  • Name: payslip-ingest
  • Trigger: Document Added, matching tag payslip
  • Action type: Webhook
  • URL: http://payslip-ingest.payslip-ingest.svc.cluster.local:8080/webhook
  • Method: POST
  • Headers:
    • Authorization: Bearer <WEBHOOK_BEARER_TOKEN>
    • Content-Type: application/json
  • Body (template):
    {"document_id": {{ document_id }}}
    

Deployment

Ship to the payslip-ingest namespace. The service serializes incoming webhooks onto an in-process queue so it never collides with claude-agent-service's single-job lock.

Run the initial backfill once the deployment is live:

kubectl -n payslip-ingest create job \
  --from=deployment/payslip-ingest \
  payslip-backfill-$(date +%s) \
  -- python -m payslip_ingest backfill --all

Architecture notes

  • extract-one never touches the DB — safe for ad-hoc re-extraction on disk.
  • The backfill command is idempotent (skips rows whose paperless_doc_id already exists) so it can be re-run freely.
  • Totals validation is a best-effort sanity check; mismatches are stored with validated=false and raw_extraction retained for manual review, rather than rejected.
  • The agent service is single-threaded. The webhook handler enqueues and returns 202; a single background worker drains the queue one at a time and absorbs 409-busy responses from the agent with retry-with-backoff.
  • New agent prompt lives at .claude/agents/payslip-extractor in the infra repo — this is a separate deliverable (see TODOs).