extractor: preextract PDF text with pdftotext before calling Claude

Without this, each extraction took 5-10 minutes because the base64'd PDF expanded to ~300KB of prompt tokens. poppler-utils ships pdftotext which turns a 200KB PDF into ~3KB of plain text in milliseconds. Claude (Haiku) then processes the text in seconds. - Dockerfile installs poppler-utils in the runtime stage (one-liner). - _build_prompt() tries pdftotext -layout first; falls back to base64 if pdftotext is missing (local dev) or the PDF is unreadable (scanned image). - Agent file documents the PAYSLIP_TEXT fast path — still handles PDF_BASE64 for fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 22:48:04 +00:00 · 2026-04-18 22:48:04 +00:00 · 3da24fdf7a
commit 3da24fdf7a
parent 693ec4a5d4
2 changed files with 35 additions and 2 deletions
--- a/4
+++ b/4
@ -20,6 +20,10 @@ FROM python:3.12-slim

 WORKDIR /app

+RUN apt-get update \
+    && apt-get install -y --no-install-recommends poppler-utils \
+    && rm -rf /var/lib/apt/lists/*
+
 RUN useradd --system --uid 10001 --home /app --shell /usr/sbin/nologin payslip

 COPY --from=builder --chown=payslip:payslip /app /app