wrongmove/docs/plans/2026-02-22-time-to-first-property-design.md
Viktor Barzin 6d653dba63
Add time-to-first-property performance optimization design
Measured baseline: 877ms TTFP cold, 334ms warm, with CPU throttling
confirmed (500m limit, 10-12 throttle events per streaming request).
Plan covers: CPU limit bump, frontend waterfall elimination, adaptive
first batch, deferred decision filtering, POI decoupling, and
Server-Timing headers.
2026-02-22 13:14:01 +00:00

6.4 KiB

Time-to-First-Property Performance Optimization

Problem

Initial page load takes 5+ seconds before the first property appears on screen, even when the Redis GeoJSON cache is warm. The target is sub-500ms time-to-first-property.

Measured Baseline (2026-02-22)

10,000 RENT listings, streaming endpoint via port-forward (no ingress overhead):

Metric Cold (DB) Warm (Cache)
Time to first property 877ms 334ms
Total stream time 13,411ms 2,060ms
Throughput 745 feat/s 4,854 feat/s

Real-world user experience adds: Traefik ingress + TLS (~50-100ms), frontend auth waterfall (~100-300ms), POI fetch blocking stream start (~200-500ms), and browser NDJSON parsing + rendering. Total: 5+ seconds.

Root Causes

1. CPU Throttling (High Impact)

The API pods have no explicit resource declarations. The namespace LimitRange (created 2026-02-21) silently injects defaults:

  • CPU limit: 500m (0.5 cores)
  • Memory limit: 1Gi

Measured throttling during a single streaming request:

Path Throttle events CPU time stolen
Cold (DB) +12 +240ms
Warm (cache) +10 +462ms

Cgroup proof: cpu.max = 50000/100000, nr_throttled increases by 10-12 per request.

Current pod usage at idle is only 3m CPU / 134-159Mi memory, so the limit only bites during streaming bursts when the pod needs to serialize thousands of GeoJSON features.

2. Frontend Waterfall (Medium Impact)

The request chain is sequential:

Page load -> Auth check -> user state -> POI fetch (blocks) -> loadListings()

loadListings() reads userPOIs.length to set includePoiDistances=true, so it cannot fire until the POI fetch resolves. When POI distances are requested, the Redis cache is bypassed entirely, forcing a DB path every time.

3. Pre-Stream Server Work (Medium Impact)

Before the first byte is sent, stream_listing_geojson() synchronously:

  1. Fetches disliked/liked IDs from DB (decision_service.get_disliked_ids)
  2. Checks Redis cache count
  3. If POIs enabled: builds full POI distances lookup table

Steps 1 and 3 are blocking DB queries that delay the metadata message.

4. Batch Size (Low Impact)

First batch requires 50 features to accumulate before yielding. At 500m CPU, building 50 GeoJSON features from cache (Redis LRANGE + JSON parse + re-serialize) takes measurable time.

Design

Change 1: Set Explicit API Pod Resources

Add explicit resource requests/limits to the realestate-crawler-api deployment, overriding the LimitRange defaults.

Proposed values:

  • Requests: 50m CPU, 128Mi memory
  • Limits: 2000m CPU, 1Gi memory

This gives the pod 4x the current CPU headroom during streaming bursts while keeping requests modest for scheduling. The celery-beat pod should also get explicit resources since it only needs minimal CPU.

Implementation: K8s deployment manifest stored in k8s/api-deployment.yaml and applied via kubectl.

Change 2: Decouple POI Distances from Listing Stream

Split POI distance fetching into a separate request so the listing stream can fire immediately and hit the cache.

  • loadListings() always sends includePoiDistances=false
  • New endpoint GET /api/listing_poi_distances?listing_type=RENT returns {<listing_id>: [distances...]} for the user's configured POIs
  • Frontend fetches POI distances in parallel, merges into existing features when both resolve
  • Redis cache is no longer bypassed when POIs are configured

Trade-off: POI-based distance filters won't apply until the POI distances request completes (~200-300ms after first property renders). Acceptable since the user sees data immediately.

Change 3: Adaptive First Batch Size

Send a small primer batch (5 features) immediately, then switch to normal batch size (50) for throughput.

Backend change in _stream_from_cache() and _stream_from_db():

  • First yield: after 5 features
  • Subsequent yields: after batch_size features

Expected improvement: first property rendered ~10x faster on the server side.

Change 4: Defer Decision ID Fetch

Move the decision ID lookup (disliked/liked sets) out of the blocking pre-stream path:

  • Send metadata message immediately
  • Fetch decision IDs concurrently with the first (small) batch
  • Apply decision filtering starting from the second batch
  • The first 5 features may include a disliked listing, but client-side filtering in processedListingData already handles this

Change 5: Eliminate Frontend POI Waterfall

Current: useEffect(user) -> fetchPOIs() -> loadListings()

Proposed: Fire both in parallel on user auth:

useEffect(user) -> fetchPOIs()          (async, no blocking)
                -> loadListings()        (fires immediately)

POI distances arrive separately (Change 2), so the stream doesn't need to wait.

Change 6: Server-Timing Headers

Add Server-Timing headers to the streaming response for ongoing observability:

Server-Timing: auth;dur=12, decisions;dur=85, cache_check;dur=3, first_batch;dur=42

Visible in browser DevTools Network tab without any frontend code changes.

Expected Impact

Change TTFP improvement Effort
CPU limit bump (500m -> 2000m) ~200-400ms (eliminates throttling) Low
Eliminate POI waterfall ~200-500ms (parallel fetch) Medium
Adaptive first batch (50 -> 5) ~150-300ms (fewer features to build) Low
Defer decision IDs ~100-200ms (no pre-stream DB query) Low
Decouple POI distances endpoint Enables cache hits with POIs Medium
Server-Timing headers Observability, no direct improvement Low

Conservative estimate: warm cache TTFP drops from 334ms to ~100-150ms server-side, and real-world user experience drops from 5s+ to under 1s.

What Does NOT Change

  • NDJSON protocol (same 3 message types: metadata, batch, complete)
  • Redis cache structure, TTL, or key format
  • Database schema or repository layer
  • Client-side decision filtering in processedListingData
  • Total data delivered (all features still streamed)

Files Modified

  • k8s/api-deployment.yaml (new) - explicit API pod resources
  • api/app.py - adaptive batch, deferred decisions, Server-Timing, POI distances endpoint
  • services/listing_cache.py - no changes
  • frontend/src/App.tsx - parallel fetch, POI merge
  • frontend/src/services/streamingService.ts - remove POI coupling
  • frontend/src/services/listingService.ts - new POI distances fetch function