extractor: preextract PDF text with pdftotext before calling Claude
Without this, each extraction took 5-10 minutes because the base64'd PDF expanded to ~300KB of prompt tokens. poppler-utils ships pdftotext which turns a 200KB PDF into ~3KB of plain text in milliseconds. Claude (Haiku) then processes the text in seconds. - Dockerfile installs poppler-utils in the runtime stage (one-liner). - _build_prompt() tries pdftotext -layout first; falls back to base64 if pdftotext is missing (local dev) or the PDF is unreadable (scanned image). - Agent file documents the PAYSLIP_TEXT fast path — still handles PDF_BASE64 for fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
693ec4a5d4
commit
3da24fdf7a
2 changed files with 35 additions and 2 deletions
|
|
@ -20,6 +20,10 @@ FROM python:3.12-slim
|
|||
|
||||
WORKDIR /app
|
||||
|
||||
RUN apt-get update \
|
||||
&& apt-get install -y --no-install-recommends poppler-utils \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
|
||||
RUN useradd --system --uid 10001 --home /app --shell /usr/sbin/nologin payslip
|
||||
|
||||
COPY --from=builder --chown=payslip:payslip /app /app
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue