payslip-ingest/README.md
Viktor Barzin 57484619c1 Initial commit: event-driven UK payslip ingest service
Extracted from /home/wizard/code monorepo into its own repo so Woodpecker CI
can watch it. Identical content to /home/wizard/code commit e426028.

See README.md for overview, env vars, and Paperless workflow config.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 22:10:23 +00:00

78 lines
2.9 KiB
Markdown

# payslip-ingest
Event-driven UK payslip ingest: Paperless-ngx fires a webhook when a document
is tagged `payslip`; this service fetches the PDF, calls `claude-agent-service`
to extract structured fields, and upserts into Postgres keyed by
`paperless_doc_id` (idempotent). A CLI `backfill` mode enumerates every
existing payslip in Paperless for initial population.
## Local dev
```bash
poetry install
poetry run pytest -q
poetry run mypy .
poetry run ruff check .
# Smoke-test extraction against a real PDF (no DB writes):
export CLAUDE_AGENT_URL=http://claude-agent-service.claude-agent.svc.cluster.local:8080
export CLAUDE_AGENT_BEARER_TOKEN=...
poetry run python -m payslip_ingest extract-one /tmp/sample.pdf
```
## Env vars
| Variable | Purpose |
|---|---|
| `PAPERLESS_URL` | Paperless-ngx base URL (e.g. `https://paperless.viktorbarzin.me`) |
| `PAPERLESS_API_TOKEN` | Paperless API token (User → My Profile → API Auth Token) |
| `CLAUDE_AGENT_URL` | claude-agent-service URL (`http://claude-agent-service.claude-agent.svc.cluster.local:8080`) |
| `CLAUDE_AGENT_BEARER_TOKEN` | Vault `secret/claude-agent-service``api_bearer_token` |
| `DB_CONNECTION_STRING` | SQLAlchemy async URL: `postgresql+asyncpg://user:pass@host/db` |
| `WEBHOOK_BEARER_TOKEN` | Shared secret Paperless sends in `Authorization: Bearer ...` |
## Paperless workflow configuration
In Paperless-ngx, create a workflow:
- **Name**: `payslip-ingest`
- **Trigger**: Document Added, matching tag `payslip`
- **Action type**: Webhook
- **URL**: `http://payslip-ingest.payslip-ingest.svc.cluster.local:8080/webhook`
- **Method**: `POST`
- **Headers**:
- `Authorization: Bearer <WEBHOOK_BEARER_TOKEN>`
- `Content-Type: application/json`
- **Body** (template):
```json
{"document_id": {{ document_id }}}
```
## Deployment
Ship to the `payslip-ingest` namespace. The service serializes incoming
webhooks onto an in-process queue so it never collides with
`claude-agent-service`'s single-job lock.
Run the initial backfill once the deployment is live:
```bash
kubectl -n payslip-ingest create job \
--from=deployment/payslip-ingest \
payslip-backfill-$(date +%s) \
-- python -m payslip_ingest backfill --all
```
## Architecture notes
- `extract-one` never touches the DB — safe for ad-hoc re-extraction on disk.
- The `backfill` command is idempotent (skips rows whose `paperless_doc_id`
already exists) so it can be re-run freely.
- Totals validation is a best-effort sanity check; mismatches are stored with
`validated=false` and `raw_extraction` retained for manual review, rather
than rejected.
- The agent service is **single-threaded**. The webhook handler enqueues and
returns 202; a single background worker drains the queue one at a time and
absorbs 409-busy responses from the agent with retry-with-backoff.
- New agent prompt lives at `.claude/agents/payslip-extractor` in the `infra`
repo — this is a separate deliverable (see TODOs).