# payslip-ingest Event-driven UK payslip ingest: Paperless-ngx fires a webhook when a document is tagged `payslip`; this service fetches the PDF, calls `claude-agent-service` to extract structured fields, and upserts into Postgres keyed by `paperless_doc_id` (idempotent). A CLI `backfill` mode enumerates every existing payslip in Paperless for initial population. ## Local dev ```bash poetry install poetry run pytest -q poetry run mypy . poetry run ruff check . # Smoke-test extraction against a real PDF (no DB writes): export CLAUDE_AGENT_URL=http://claude-agent-service.claude-agent.svc.cluster.local:8080 export CLAUDE_AGENT_BEARER_TOKEN=... poetry run python -m payslip_ingest extract-one /tmp/sample.pdf ``` ## Env vars | Variable | Purpose | |---|---| | `PAPERLESS_URL` | Paperless-ngx base URL (e.g. `https://paperless.viktorbarzin.me`) | | `PAPERLESS_API_TOKEN` | Paperless API token (User → My Profile → API Auth Token) | | `CLAUDE_AGENT_URL` | claude-agent-service URL (`http://claude-agent-service.claude-agent.svc.cluster.local:8080`) | | `CLAUDE_AGENT_BEARER_TOKEN` | Vault `secret/claude-agent-service` → `api_bearer_token` | | `DB_CONNECTION_STRING` | SQLAlchemy async URL: `postgresql+asyncpg://user:pass@host/db` | | `WEBHOOK_BEARER_TOKEN` | Shared secret Paperless sends in `Authorization: Bearer ...` | ## Paperless workflow configuration In Paperless-ngx, create a workflow: - **Name**: `payslip-ingest` - **Trigger**: Document Added, matching tag `payslip` - **Action type**: Webhook - **URL**: `http://payslip-ingest.payslip-ingest.svc.cluster.local:8080/webhook` - **Method**: `POST` - **Headers**: - `Authorization: Bearer ` - `Content-Type: application/json` - **Body** (template): ```json {"document_id": {{ document_id }}} ``` ## Deployment Ship to the `payslip-ingest` namespace. The service serializes incoming webhooks onto an in-process queue so it never collides with `claude-agent-service`'s single-job lock. Run the initial backfill once the deployment is live: ```bash kubectl -n payslip-ingest create job \ --from=deployment/payslip-ingest \ payslip-backfill-$(date +%s) \ -- python -m payslip_ingest backfill --all ``` ## Architecture notes - `extract-one` never touches the DB — safe for ad-hoc re-extraction on disk. - The `backfill` command is idempotent (skips rows whose `paperless_doc_id` already exists) so it can be re-run freely. - Totals validation is a best-effort sanity check; mismatches are stored with `validated=false` and `raw_extraction` retained for manual review, rather than rejected. - The agent service is **single-threaded**. The webhook handler enqueues and returns 202; a single background worker drains the queue one at a time and absorbs 409-busy responses from the agent with retry-with-backoff. - New agent prompt lives at `.claude/agents/payslip-extractor` in the `infra` repo — this is a separate deliverable (see TODOs).