Extracted from /home/wizard/code monorepo into its own repo so Woodpecker CI can watch it. Identical content to /home/wizard/code commit e426028. See README.md for overview, env vars, and Paperless workflow config. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
78 lines
2.9 KiB
Markdown
78 lines
2.9 KiB
Markdown
# payslip-ingest
|
|
|
|
Event-driven UK payslip ingest: Paperless-ngx fires a webhook when a document
|
|
is tagged `payslip`; this service fetches the PDF, calls `claude-agent-service`
|
|
to extract structured fields, and upserts into Postgres keyed by
|
|
`paperless_doc_id` (idempotent). A CLI `backfill` mode enumerates every
|
|
existing payslip in Paperless for initial population.
|
|
|
|
## Local dev
|
|
|
|
```bash
|
|
poetry install
|
|
poetry run pytest -q
|
|
poetry run mypy .
|
|
poetry run ruff check .
|
|
|
|
# Smoke-test extraction against a real PDF (no DB writes):
|
|
export CLAUDE_AGENT_URL=http://claude-agent-service.claude-agent.svc.cluster.local:8080
|
|
export CLAUDE_AGENT_BEARER_TOKEN=...
|
|
poetry run python -m payslip_ingest extract-one /tmp/sample.pdf
|
|
```
|
|
|
|
## Env vars
|
|
|
|
| Variable | Purpose |
|
|
|---|---|
|
|
| `PAPERLESS_URL` | Paperless-ngx base URL (e.g. `https://paperless.viktorbarzin.me`) |
|
|
| `PAPERLESS_API_TOKEN` | Paperless API token (User → My Profile → API Auth Token) |
|
|
| `CLAUDE_AGENT_URL` | claude-agent-service URL (`http://claude-agent-service.claude-agent.svc.cluster.local:8080`) |
|
|
| `CLAUDE_AGENT_BEARER_TOKEN` | Vault `secret/claude-agent-service` → `api_bearer_token` |
|
|
| `DB_CONNECTION_STRING` | SQLAlchemy async URL: `postgresql+asyncpg://user:pass@host/db` |
|
|
| `WEBHOOK_BEARER_TOKEN` | Shared secret Paperless sends in `Authorization: Bearer ...` |
|
|
|
|
## Paperless workflow configuration
|
|
|
|
In Paperless-ngx, create a workflow:
|
|
|
|
- **Name**: `payslip-ingest`
|
|
- **Trigger**: Document Added, matching tag `payslip`
|
|
- **Action type**: Webhook
|
|
- **URL**: `http://payslip-ingest.payslip-ingest.svc.cluster.local:8080/webhook`
|
|
- **Method**: `POST`
|
|
- **Headers**:
|
|
- `Authorization: Bearer <WEBHOOK_BEARER_TOKEN>`
|
|
- `Content-Type: application/json`
|
|
- **Body** (template):
|
|
```json
|
|
{"document_id": {{ document_id }}}
|
|
```
|
|
|
|
## Deployment
|
|
|
|
Ship to the `payslip-ingest` namespace. The service serializes incoming
|
|
webhooks onto an in-process queue so it never collides with
|
|
`claude-agent-service`'s single-job lock.
|
|
|
|
Run the initial backfill once the deployment is live:
|
|
|
|
```bash
|
|
kubectl -n payslip-ingest create job \
|
|
--from=deployment/payslip-ingest \
|
|
payslip-backfill-$(date +%s) \
|
|
-- python -m payslip_ingest backfill --all
|
|
```
|
|
|
|
## Architecture notes
|
|
|
|
- `extract-one` never touches the DB — safe for ad-hoc re-extraction on disk.
|
|
- The `backfill` command is idempotent (skips rows whose `paperless_doc_id`
|
|
already exists) so it can be re-run freely.
|
|
- Totals validation is a best-effort sanity check; mismatches are stored with
|
|
`validated=false` and `raw_extraction` retained for manual review, rather
|
|
than rejected.
|
|
- The agent service is **single-threaded**. The webhook handler enqueues and
|
|
returns 202; a single background worker drains the queue one at a time and
|
|
absorbs 409-busy responses from the agent with retry-with-backoff.
|
|
- New agent prompt lives at `.claude/agents/payslip-extractor` in the `infra`
|
|
repo — this is a separate deliverable (see TODOs).
|