payslip-ingest/payslip_ingest
Viktor Barzin 3da24fdf7a extractor: preextract PDF text with pdftotext before calling Claude
Without this, each extraction took 5-10 minutes because the base64'd PDF
expanded to ~300KB of prompt tokens. poppler-utils ships pdftotext which
turns a 200KB PDF into ~3KB of plain text in milliseconds. Claude (Haiku)
then processes the text in seconds.

- Dockerfile installs poppler-utils in the runtime stage (one-liner).
- _build_prompt() tries pdftotext -layout first; falls back to base64 if
  pdftotext is missing (local dev) or the PDF is unreadable (scanned image).
- Agent file documents the PAYSLIP_TEXT fast path — still handles
  PDF_BASE64 for fallback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 22:48:04 +00:00
..
__init__.py Initial commit: event-driven UK payslip ingest service 2026-04-18 22:10:23 +00:00
__main__.py Initial commit: event-driven UK payslip ingest service 2026-04-18 22:10:23 +00:00
app.py Initial commit: event-driven UK payslip ingest service 2026-04-18 22:10:23 +00:00
db.py Initial commit: event-driven UK payslip ingest service 2026-04-18 22:10:23 +00:00
extractor.py extractor: preextract PDF text with pdftotext before calling Claude 2026-04-18 22:48:04 +00:00
paperless.py Initial commit: event-driven UK payslip ingest service 2026-04-18 22:10:23 +00:00
processor.py Initial commit: event-driven UK payslip ingest service 2026-04-18 22:10:23 +00:00
schema.py Initial commit: event-driven UK payslip ingest service 2026-04-18 22:10:23 +00:00
tax_year.py Initial commit: event-driven UK payslip ingest service 2026-04-18 22:10:23 +00:00