payslip-ingest

Viktor Barzin 3da24fdf7a extractor: preextract PDF text with pdftotext before calling Claude Without this, each extraction took 5-10 minutes because the base64'd PDF expanded to ~300KB of prompt tokens. poppler-utils ships pdftotext which turns a 200KB PDF into ~3KB of plain text in milliseconds. Claude (Haiku) then processes the text in seconds. - Dockerfile installs poppler-utils in the runtime stage (one-liner). - _build_prompt() tries pdftotext -layout first; falls back to base64 if pdftotext is missing (local dev) or the PDF is unreadable (scanned image). - Agent file documents the PAYSLIP_TEXT fast path — still handles PDF_BASE64 for fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>		2026-04-18 22:48:04 +00:00
..
__init__.py	Initial commit: event-driven UK payslip ingest service	2026-04-18 22:10:23 +00:00
__main__.py	Initial commit: event-driven UK payslip ingest service	2026-04-18 22:10:23 +00:00
app.py	Initial commit: event-driven UK payslip ingest service	2026-04-18 22:10:23 +00:00
db.py	Initial commit: event-driven UK payslip ingest service	2026-04-18 22:10:23 +00:00
extractor.py	extractor: preextract PDF text with pdftotext before calling Claude	2026-04-18 22:48:04 +00:00
paperless.py	Initial commit: event-driven UK payslip ingest service	2026-04-18 22:10:23 +00:00
processor.py	Initial commit: event-driven UK payslip ingest service	2026-04-18 22:10:23 +00:00
schema.py	Initial commit: event-driven UK payslip ingest service	2026-04-18 22:10:23 +00:00
tax_year.py	Initial commit: event-driven UK payslip ingest service	2026-04-18 22:10:23 +00:00