payslip-ingest/Dockerfile
Viktor Barzin 3da24fdf7a extractor: preextract PDF text with pdftotext before calling Claude
Without this, each extraction took 5-10 minutes because the base64'd PDF
expanded to ~300KB of prompt tokens. poppler-utils ships pdftotext which
turns a 200KB PDF into ~3KB of plain text in milliseconds. Claude (Haiku)
then processes the text in seconds.

- Dockerfile installs poppler-utils in the runtime stage (one-liner).
- _build_prompt() tries pdftotext -layout first; falls back to base64 if
  pdftotext is missing (local dev) or the PDF is unreadable (scanned image).
- Agent file documents the PAYSLIP_TEXT fast path — still handles
  PDF_BASE64 for fallback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 22:48:04 +00:00

37 lines
851 B
Docker

FROM python:3.12-slim AS builder
ENV POETRY_VERSION=1.8.4 \
POETRY_VIRTUALENVS_IN_PROJECT=true \
PIP_NO_CACHE_DIR=1
RUN pip install --no-cache-dir "poetry==${POETRY_VERSION}"
WORKDIR /app
COPY pyproject.toml poetry.lock* README.md ./
RUN poetry install --only main --no-root
COPY payslip_ingest ./payslip_ingest
COPY alembic ./alembic
COPY alembic.ini ./alembic.ini
RUN poetry install --only main
FROM python:3.12-slim
WORKDIR /app
RUN apt-get update \
&& apt-get install -y --no-install-recommends poppler-utils \
&& rm -rf /var/lib/apt/lists/*
RUN useradd --system --uid 10001 --home /app --shell /usr/sbin/nologin payslip
COPY --from=builder --chown=payslip:payslip /app /app
ENV PATH="/app/.venv/bin:${PATH}" \
PYTHONUNBUFFERED=1
EXPOSE 8080
USER payslip
ENTRYPOINT ["python", "-m", "payslip_ingest"]
CMD ["serve"]