payslip-ingest

5 commits 1 branch 0 tags 258 KiB

Author	SHA1	Message	Date
Viktor Barzin	3da24fdf7a	extractor: preextract PDF text with pdftotext before calling Claude Without this, each extraction took 5-10 minutes because the base64'd PDF expanded to ~300KB of prompt tokens. poppler-utils ships pdftotext which turns a 200KB PDF into ~3KB of plain text in milliseconds. Claude (Haiku) then processes the text in seconds. - Dockerfile installs poppler-utils in the runtime stage (one-liner). - _build_prompt() tries pdftotext -layout first; falls back to base64 if pdftotext is missing (local dev) or the PDF is unreadable (scanned image). - Agent file documents the PAYSLIP_TEXT fast path — still handles PDF_BASE64 for fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 22:48:04 +00:00
Viktor Barzin	693ec4a5d4	extractor: wait up to 15min for claude-agent-service to free lock Real UK payslip extractions routinely take 5-10min end-to-end (Haiku processing 100-300KB base64'd PDFs). With 10 retries × 5s = 50s we'd abort while another extraction was still in-flight. Bump to 90 retries × 10s = 900s wait — enough to cover the server-side timeout_seconds=600 plus some slack. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 22:36:05 +00:00
Viktor Barzin	7a32885d26	extractor: bump claude poll timeout to 600s Real UK payslip PDFs are 100-200KB base64'd, which means ~300-500KB of prompt tokens. Claude (even Haiku) takes 1-5 minutes to process and emit structured JSON. The original 120s ceiling timed out before extraction could finish. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 22:32:34 +00:00
Viktor Barzin	11a8256e6a	alembic: create schema before initializing version table The payslip_ingest schema must exist before Alembic creates its alembic_version tracking table inside it. Add CREATE SCHEMA IF NOT EXISTS at the top of do_run_migrations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 22:23:30 +00:00
Viktor Barzin	57484619c1	Initial commit: event-driven UK payslip ingest service Extracted from /home/wizard/code monorepo into its own repo so Woodpecker CI can watch it. Identical content to /home/wizard/code commit e426028. See README.md for overview, env vars, and Paperless workflow config. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 22:10:23 +00:00