diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md index b4c9486d..30ec710b 100755 --- a/.claude/CLAUDE.md +++ b/.claude/CLAUDE.md @@ -124,6 +124,7 @@ Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handle - **CrowdSec bouncer**: graceful degradation mode (fail-open on error). - **Rate limiting**: Return 429 (not 503). Per-service tuning: Immich/Nextcloud need higher limits. - **Retry middleware**: 2 attempts, 100ms — in default ingress chain. +- **Entrypoint transport timeouts** (`websecure` `respondingTimeouts`): `writeTimeout=0` (unlimited download duration), `readTimeout=3600s` (uploads ≤1h), `idleTimeout=600s`. These are **HARD total-duration caps**, not nginx-style per-read idle timeouts — a finite `writeTimeout` truncates *any* large download at that wall-clock mark (a prior `writeTimeout=60s` silently cut Immich videos at 60s). **Do NOT re-tighten `writeTimeout`**; keep `readTimeout` finite (slow-loris backstop) but ≥ longest expected upload. Full rationale: `docs/architecture/networking.md` → "Entrypoint Transport Timeouts". - **HTTP/3 (QUIC)**: Enabled on Traefik. Works for **direct (non-proxied) apps** via the dedicated LB IP below (ETP=Local). Proxied apps get QUIC at the Cloudflare edge. - **Traefik LB IP = `10.0.20.203`, `externalTrafficPolicy: Local`** (dedicated, NOT the shared `.200`). Moved off the shared `.200` on 2026-05-30 so direct/non-proxied apps preserve the **real client IP for CrowdSec** (ETP=Cluster SNAT'd them to the node IP) and so QUIC works. **The shared `10.0.20.200` keeps the other 10 LB services** (PG state-backend `postgresql-lb`, headscale, wireguard, coturn, xray, etc. — all ETP=Cluster; MetalLB forbids mixed ETP on a shared IP, hence Traefik's own IP). **cloudflared targets the in-cluster Traefik Service** (`https://traefik.traefik.svc.cluster.local:443`, remote/dashboard tunnel config — edit via CF Global API Key in `secret/platform`), so proxied apps are decoupled from the LB IP. pfSense WAN 443 (tcp+udp) NAT → alias `traefik_lb` (`.203`). Internal split-horizon apex `viktorbarzin.me A` → `.203`. Full runbook + post-mortem: `docs/plans/2026-05-30-traefik-dedicated-ip-etp-local-*`. - **IPv6 ingress** = HE 6in4 tunnel (`2001:470:6e:43d::2`) → **standalone HAProxy on pfSense** (`/usr/local/etc/ipv6-haproxy.cfg`, NOT the HAProxy package) using `send-proxy-v2` → Traefik `.203` (web 443/80) + mail NodePorts `30125-30128` (25/465/587/993) — so **real IPv6 client IPs reach CrowdSec**. Traefik trusts PROXY-v2 **only from `10.0.20.1`** (`entryPoints.web/websecure.proxyProtocol.trustedIPs`); real IPv4 clients (own source IP) unaffected. **No QUIC over IPv6** (bridge is TCP/h2). Replaced socat 2026-05-30 (socat masked every v6 client as `10.0.20.1`). Boot/persistence: config.xml `` → `ipv6_proxy.sh` (patches nginx off `[::]:443/:80` to free the tunnel IPv6, then `service ipv6proxy onestart`); `rc.d/ipv6proxy` manages HAProxy. Backends use **no health `check`** (a plain TCP check false-DOWNs the PROXY-expecting listeners). As-built: `docs/architecture/networking.md` → "IPv6 Ingress". diff --git a/docs/architecture/networking.md b/docs/architecture/networking.md index 43b45e4f..7b8935e5 100644 --- a/docs/architecture/networking.md +++ b/docs/architecture/networking.md @@ -252,6 +252,18 @@ Additional middleware: - **Anti-AI**: On by default via `ingress_factory`. Blocks common AI crawler user-agents. - **HTTP/3 (QUIC)**: Enabled globally on Traefik. +### Entrypoint Transport Timeouts + +The `websecure` entrypoint sets `respondingTimeouts` in `stacks/traefik/modules/traefik/main.tf`: + +| Timeout | Value | Bounds | +|---|---|---| +| `readTimeout` | `3600s` | Total time to read one request incl. body → **max upload duration** | +| `writeTimeout` | `0s` (disabled) | Total time to write the response → **max download duration (0 = unlimited)** | +| `idleTimeout` | `600s` | Keep-alive idle between requests (does *not* apply to active transfers) | + +**Gotcha — these are HARD caps on total duration, not idle timeouts** (unlike nginx `proxy_*_timeout`, which reset on every read). A finite `writeTimeout` truncates *any* download that runs longer than it, regardless of progress. A prior `writeTimeout=60s` silently cut large Immich video downloads at the 60s mark (HTTP/2 stream reset). `writeTimeout=0` (Traefik's default) is required for unlimited-size downloads — Immich's own Traefik reverse-proxy guidance assumes it and never sets `writeTimeout`. `readTimeout` is kept finite (not 0) because an unbounded request read is the slow-loris vector; 3600s passes multi-GB uploads while keeping a backstop (Immich has no resumable upload, so the window must exceed real upload times). Single-asset downloads (`GET /api/assets/{id}/original`) serve `206 Partial Content`, so they are also resumable on a dropped connection; on-the-fly ZIP "download all" is not (no stable byte offsets). + ### MetalLB & Load Balancing MetalLB v0.15.3 allocates IPs from the range 10.0.20.200-10.0.20.220 in **Layer 2 mode**. Most LoadBalancer services share **10.0.20.200** using the `metallb.io/allow-shared-ip: shared` annotation. Two services have **dedicated IPs** with `externalTrafficPolicy: Local` to preserve real client source IPs: **Traefik (10.0.20.203)** — so CrowdSec sees real public IPs on the direct-ingress path and QUIC/HTTP3 works (a shared IP forbids the mixed ETP that QUIC's UDP listener needs) — and **Technitium DNS (10.0.20.201)** for query logging. @@ -298,6 +310,8 @@ The web path works because Traefik trusts PROXY-v2 **only from `10.0.20.1`** (`e **No QUIC over IPv6** — the bridge is TCP/h2 only; IPv4 carries QUIC/HTTP3. +The bridge's HAProxy uses `timeout client 1h` / `timeout server 1h`, which are **inactivity** timeouts (reset on every byte), *not* total-transfer caps — so steady large downloads/uploads over IPv6 are not limited by the bridge. The download-duration cap was solely Traefik's `writeTimeout` (see Entrypoint Transport Timeouts above), now `0`. + pfSense files (out-of-band, **not Terraform**): - `/usr/local/etc/ipv6-haproxy.cfg` — the 6-frontend bridge config above. - `/usr/local/etc/rc.d/ipv6proxy` — service wrapper (`service ipv6proxy {start,stop,status}`); `start` does a graceful `-sf` reload. @@ -499,6 +513,14 @@ Containerd on all K8s nodes uses `hosts.toml` to redirect pulls to the local cac **Fix**: Increase rate limit in `ingress_factory` module. Default is 100 req/min per IP. Immich and Nextcloud use 500 req/min. +### Large Downloads or Uploads Truncate / Fail Partway + +**Symptoms**: Large file transfers (e.g. Immich videos, Nextcloud sync) fail at a consistent wall-clock point regardless of file — a download stops at exactly N seconds × throughput bytes; an upload fails ~1 min in. Browser shows "network error"; `curl` exits 18/92 (truncated / HTTP/2 stream reset). + +**Diagnosis**: Check the `websecure` entrypoint `respondingTimeouts` (see Entrypoint Transport Timeouts). These are **hard total-duration caps**, not idle timeouts — a finite `writeTimeout` cuts downloads, a finite `readTimeout` cuts uploads, both regardless of progress. Reproduce deterministically: `curl --limit-rate 6M` a file large enough to exceed the cap; it dies at the cap. + +**Fix**: `writeTimeout=0` (unlimited downloads), `readTimeout` ≥ longest expected upload (currently `3600s`). Not Cloudflare (Immich is non-proxied) and not the pfSense IPv6 bridge (its 1h timeouts are inactivity-based). + ## Related - **Runbooks**: diff --git a/stacks/traefik/modules/traefik/main.tf b/stacks/traefik/modules/traefik/main.tf index cd5a97ab..1ed2ac41 100644 --- a/stacks/traefik/modules/traefik/main.tf +++ b/stacks/traefik/modules/traefik/main.tf @@ -234,9 +234,17 @@ resource "helm_release" "traefik" { "--serversTransport.forwardingTimeouts.idleConnTimeout=90s", # Increase backend connection pool (default maxIdleConnsPerHost=2 is too low) "--serversTransport.maxIdleConnsPerHost=100", - # Explicit entrypoint timeouts to bound tail latency from slow clients - "--entryPoints.websecure.transport.respondingTimeouts.readTimeout=60s", - "--entryPoints.websecure.transport.respondingTimeouts.writeTimeout=60s", + # Entrypoint transport timeouts. NOTE: Traefik respondingTimeouts are HARD caps on + # total request/response duration (unlike nginx proxy_*_timeout, which reset per read). + # A finite writeTimeout therefore caps total *download* time regardless of progress — + # a prior writeTimeout=60s silently truncated large downloads at 60s (HTTP/2 reset). + # writeTimeout=0 -> unlimited download size/duration (Traefik's own default; Immich's + # reverse-proxy guidance assumes it — it never sets writeTimeout). + # readTimeout=3600s -> one upload may take up to 1h. NOT 0: an unbounded request read + # is the slow-loris vector (hence Traefik's 60s default). Immich has + # no resumable upload, so the window must exceed real upload times. + "--entryPoints.websecure.transport.respondingTimeouts.readTimeout=3600s", + "--entryPoints.websecure.transport.respondingTimeouts.writeTimeout=0s", "--entryPoints.websecure.transport.respondingTimeouts.idleTimeout=600s", # Use forwarded headers from trusted proxies "--entryPoints.websecure.forwardedHeaders.insecure=false",