paperless-ngx: scale Gotenberg x3 + Tika x2, 4 workers, skip-archive — speed the Emo import
All checks were successful
ci/woodpecker/push/default Pipeline was successful

Bottleneck found: single Gotenberg 503s under concurrent workers (office docs
failing + slow). Cluster is otherwise idle (sdc 0.5% util, etcd ~1/min), so:
- Gotenberg 1->3 + Tika 1->2 (Service load-balances; fixes the 503s, parallel
  office conversion).
- paperless TASK_WORKERS 2->4, THREADS_PER_WORKER 2->1, mem limit 4->8Gi (avoid
  OOM with 4 concurrent OCR). Requests kept low to stay within tier-quota
  (requests.memory 3840/4096Mi).
- PAPERLESS_OCR_SKIP_ARCHIVE_FILE=with_text: skip redundant archive for born-
  digital/office docs (big IO saver for the work-doc set).
Guard + etcd watch stay in place; revert to defaults after the import.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-27 18:45:25 +00:00
parent d6bd9486e3
commit 2cb37d51d4

View file

@ -225,11 +225,18 @@ resource "kubernetes_deployment" "paperless-ngx" {
# it degrades. Revert both to defaults once the import is done. # it degrades. Revert both to defaults once the import is done.
env { env {
name = "PAPERLESS_TASK_WORKERS" name = "PAPERLESS_TASK_WORKERS"
value = "2" value = "4"
} }
env { env {
name = "PAPERLESS_THREADS_PER_WORKER" name = "PAPERLESS_THREADS_PER_WORKER"
value = "2" value = "1"
}
# Skip the redundant OCR'd archive PDF for inputs that already carry a
# text layer (born-digital PDFs + office->PDF via Gotenberg). Big
# speed/IO saver for emo's work-doc set; scanned docs still OCR+archive.
env {
name = "PAPERLESS_OCR_SKIP_ARCHIVE_FILE"
value = "with_text"
} }
volume_mount { volume_mount {
name = "data" name = "data"
@ -242,7 +249,7 @@ resource "kubernetes_deployment" "paperless-ngx" {
memory = "2Gi" memory = "2Gi"
} }
limits = { limits = {
memory = "4Gi" memory = "8Gi"
} }
} }
@ -299,7 +306,9 @@ resource "kubernetes_service" "paperless-ngx" {
# --- Tika + Gotenberg: Office/email -> text/PDF conversion for paperless --- # --- Tika + Gotenberg: Office/email -> text/PDF conversion for paperless ---
# Apache Tika extracts text+metadata; Gotenberg renders Office formats to PDF. # Apache Tika extracts text+metadata; Gotenberg renders Office formats to PDF.
# Paperless routes Office/email docs through these (PAPERLESS_TIKA_* above). # Paperless routes Office/email docs through these (PAPERLESS_TIKA_* above).
# Stateless (no PVC), pinned images, single replica bulk import is serial. # Stateless (no PVC), pinned images. 3 replicas during the bulk import: a
# single LibreOffice instance 503s under concurrent paperless workers; the
# Service load-balances office conversions across the replicas.
resource "kubernetes_deployment" "gotenberg" { resource "kubernetes_deployment" "gotenberg" {
metadata { metadata {
name = "gotenberg" name = "gotenberg"
@ -310,7 +319,7 @@ resource "kubernetes_deployment" "gotenberg" {
} }
} }
spec { spec {
replicas = 1 replicas = 3
selector { selector {
match_labels = { match_labels = {
app = "gotenberg" app = "gotenberg"
@ -395,7 +404,7 @@ resource "kubernetes_deployment" "tika" {
} }
} }
spec { spec {
replicas = 1 replicas = 2
selector { selector {
match_labels = { match_labels = {
app = "tika" app = "tika"