extractor: preextract PDF text with pdftotext before calling Claude

Without this, each extraction took 5-10 minutes because the base64'd PDF
expanded to ~300KB of prompt tokens. poppler-utils ships pdftotext which
turns a 200KB PDF into ~3KB of plain text in milliseconds. Claude (Haiku)
then processes the text in seconds.

- Dockerfile installs poppler-utils in the runtime stage (one-liner).
- _build_prompt() tries pdftotext -layout first; falls back to base64 if
  pdftotext is missing (local dev) or the PDF is unreadable (scanned image).
- Agent file documents the PAYSLIP_TEXT fast path — still handles
  PDF_BASE64 for fallback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-04-18 22:48:04 +00:00
parent 693ec4a5d4
commit 3da24fdf7a
2 changed files with 35 additions and 2 deletions

View file

@ -20,6 +20,10 @@ FROM python:3.12-slim
WORKDIR /app
RUN apt-get update \
&& apt-get install -y --no-install-recommends poppler-utils \
&& rm -rf /var/lib/apt/lists/*
RUN useradd --system --uid 10001 --home /app --shell /usr/sbin/nologin payslip
COPY --from=builder --chown=payslip:payslip /app /app