Without this, each extraction took 5-10 minutes because the base64'd PDF expanded to ~300KB of prompt tokens. poppler-utils ships pdftotext which turns a 200KB PDF into ~3KB of plain text in milliseconds. Claude (Haiku) then processes the text in seconds. - Dockerfile installs poppler-utils in the runtime stage (one-liner). - _build_prompt() tries pdftotext -layout first; falls back to base64 if pdftotext is missing (local dev) or the PDF is unreadable (scanned image). - Agent file documents the PAYSLIP_TEXT fast path — still handles PDF_BASE64 for fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
37 lines
851 B
Docker
37 lines
851 B
Docker
FROM python:3.12-slim AS builder
|
|
|
|
ENV POETRY_VERSION=1.8.4 \
|
|
POETRY_VIRTUALENVS_IN_PROJECT=true \
|
|
PIP_NO_CACHE_DIR=1
|
|
|
|
RUN pip install --no-cache-dir "poetry==${POETRY_VERSION}"
|
|
|
|
WORKDIR /app
|
|
COPY pyproject.toml poetry.lock* README.md ./
|
|
RUN poetry install --only main --no-root
|
|
|
|
COPY payslip_ingest ./payslip_ingest
|
|
COPY alembic ./alembic
|
|
COPY alembic.ini ./alembic.ini
|
|
RUN poetry install --only main
|
|
|
|
|
|
FROM python:3.12-slim
|
|
|
|
WORKDIR /app
|
|
|
|
RUN apt-get update \
|
|
&& apt-get install -y --no-install-recommends poppler-utils \
|
|
&& rm -rf /var/lib/apt/lists/*
|
|
|
|
RUN useradd --system --uid 10001 --home /app --shell /usr/sbin/nologin payslip
|
|
|
|
COPY --from=builder --chown=payslip:payslip /app /app
|
|
|
|
ENV PATH="/app/.venv/bin:${PATH}" \
|
|
PYTHONUNBUFFERED=1
|
|
|
|
EXPOSE 8080
|
|
USER payslip
|
|
ENTRYPOINT ["python", "-m", "payslip_ingest"]
|
|
CMD ["serve"]
|