Add time-to-first-property performance optimization design
Measured baseline: 877ms TTFP cold, 334ms warm, with CPU throttling confirmed (500m limit, 10-12 throttle events per streaming request). Plan covers: CPU limit bump, frontend waterfall elimination, adaptive first batch, deferred decision filtering, POI decoupling, and Server-Timing headers.
This commit is contained in:
parent
9dc011754b
commit
6d653dba63
1 changed files with 157 additions and 0 deletions
157
docs/plans/2026-02-22-time-to-first-property-design.md
Normal file
157
docs/plans/2026-02-22-time-to-first-property-design.md
Normal file
|
|
@ -0,0 +1,157 @@
|
|||
# Time-to-First-Property Performance Optimization
|
||||
|
||||
## Problem
|
||||
|
||||
Initial page load takes 5+ seconds before the first property appears on screen, even when the Redis GeoJSON cache is warm. The target is sub-500ms time-to-first-property.
|
||||
|
||||
## Measured Baseline (2026-02-22)
|
||||
|
||||
10,000 RENT listings, streaming endpoint via port-forward (no ingress overhead):
|
||||
|
||||
| Metric | Cold (DB) | Warm (Cache) |
|
||||
|--------|-----------|--------------|
|
||||
| Time to first property | 877ms | 334ms |
|
||||
| Total stream time | 13,411ms | 2,060ms |
|
||||
| Throughput | 745 feat/s | 4,854 feat/s |
|
||||
|
||||
Real-world user experience adds: Traefik ingress + TLS (~50-100ms), frontend auth waterfall (~100-300ms), POI fetch blocking stream start (~200-500ms), and browser NDJSON parsing + rendering. Total: **5+ seconds**.
|
||||
|
||||
## Root Causes
|
||||
|
||||
### 1. CPU Throttling (High Impact)
|
||||
|
||||
The API pods have no explicit resource declarations. The namespace `LimitRange` (created 2026-02-21) silently injects defaults:
|
||||
|
||||
- **CPU limit: 500m (0.5 cores)**
|
||||
- Memory limit: 1Gi
|
||||
|
||||
Measured throttling during a single streaming request:
|
||||
|
||||
| Path | Throttle events | CPU time stolen |
|
||||
|------|----------------|-----------------|
|
||||
| Cold (DB) | +12 | +240ms |
|
||||
| Warm (cache) | +10 | +462ms |
|
||||
|
||||
Cgroup proof: `cpu.max = 50000/100000`, `nr_throttled` increases by 10-12 per request.
|
||||
|
||||
Current pod usage at idle is only 3m CPU / 134-159Mi memory, so the limit only bites during streaming bursts when the pod needs to serialize thousands of GeoJSON features.
|
||||
|
||||
### 2. Frontend Waterfall (Medium Impact)
|
||||
|
||||
The request chain is sequential:
|
||||
|
||||
```
|
||||
Page load -> Auth check -> user state -> POI fetch (blocks) -> loadListings()
|
||||
```
|
||||
|
||||
`loadListings()` reads `userPOIs.length` to set `includePoiDistances=true`, so it cannot fire until the POI fetch resolves. When POI distances are requested, the Redis cache is bypassed entirely, forcing a DB path every time.
|
||||
|
||||
### 3. Pre-Stream Server Work (Medium Impact)
|
||||
|
||||
Before the first byte is sent, `stream_listing_geojson()` synchronously:
|
||||
|
||||
1. Fetches disliked/liked IDs from DB (`decision_service.get_disliked_ids`)
|
||||
2. Checks Redis cache count
|
||||
3. If POIs enabled: builds full POI distances lookup table
|
||||
|
||||
Steps 1 and 3 are blocking DB queries that delay the metadata message.
|
||||
|
||||
### 4. Batch Size (Low Impact)
|
||||
|
||||
First batch requires 50 features to accumulate before yielding. At 500m CPU, building 50 GeoJSON features from cache (Redis LRANGE + JSON parse + re-serialize) takes measurable time.
|
||||
|
||||
## Design
|
||||
|
||||
### Change 1: Set Explicit API Pod Resources
|
||||
|
||||
Add explicit resource requests/limits to the `realestate-crawler-api` deployment, overriding the LimitRange defaults.
|
||||
|
||||
Proposed values:
|
||||
- **Requests**: 50m CPU, 128Mi memory
|
||||
- **Limits**: 2000m CPU, 1Gi memory
|
||||
|
||||
This gives the pod 4x the current CPU headroom during streaming bursts while keeping requests modest for scheduling. The celery-beat pod should also get explicit resources since it only needs minimal CPU.
|
||||
|
||||
Implementation: K8s deployment manifest stored in `k8s/api-deployment.yaml` and applied via kubectl.
|
||||
|
||||
### Change 2: Decouple POI Distances from Listing Stream
|
||||
|
||||
Split POI distance fetching into a separate request so the listing stream can fire immediately and hit the cache.
|
||||
|
||||
- `loadListings()` always sends `includePoiDistances=false`
|
||||
- New endpoint `GET /api/listing_poi_distances?listing_type=RENT` returns `{<listing_id>: [distances...]}` for the user's configured POIs
|
||||
- Frontend fetches POI distances in parallel, merges into existing features when both resolve
|
||||
- Redis cache is no longer bypassed when POIs are configured
|
||||
|
||||
Trade-off: POI-based distance filters won't apply until the POI distances request completes (~200-300ms after first property renders). Acceptable since the user sees data immediately.
|
||||
|
||||
### Change 3: Adaptive First Batch Size
|
||||
|
||||
Send a small primer batch (5 features) immediately, then switch to normal batch size (50) for throughput.
|
||||
|
||||
Backend change in `_stream_from_cache()` and `_stream_from_db()`:
|
||||
- First yield: after 5 features
|
||||
- Subsequent yields: after `batch_size` features
|
||||
|
||||
Expected improvement: first property rendered ~10x faster on the server side.
|
||||
|
||||
### Change 4: Defer Decision ID Fetch
|
||||
|
||||
Move the decision ID lookup (disliked/liked sets) out of the blocking pre-stream path:
|
||||
|
||||
- Send `metadata` message immediately
|
||||
- Fetch decision IDs concurrently with the first (small) batch
|
||||
- Apply decision filtering starting from the second batch
|
||||
- The first 5 features may include a disliked listing, but client-side filtering in `processedListingData` already handles this
|
||||
|
||||
### Change 5: Eliminate Frontend POI Waterfall
|
||||
|
||||
Current: `useEffect(user) -> fetchPOIs() -> loadListings()`
|
||||
|
||||
Proposed: Fire both in parallel on user auth:
|
||||
```
|
||||
useEffect(user) -> fetchPOIs() (async, no blocking)
|
||||
-> loadListings() (fires immediately)
|
||||
```
|
||||
|
||||
POI distances arrive separately (Change 2), so the stream doesn't need to wait.
|
||||
|
||||
### Change 6: Server-Timing Headers
|
||||
|
||||
Add `Server-Timing` headers to the streaming response for ongoing observability:
|
||||
|
||||
```
|
||||
Server-Timing: auth;dur=12, decisions;dur=85, cache_check;dur=3, first_batch;dur=42
|
||||
```
|
||||
|
||||
Visible in browser DevTools Network tab without any frontend code changes.
|
||||
|
||||
## Expected Impact
|
||||
|
||||
| Change | TTFP improvement | Effort |
|
||||
|--------|-----------------|--------|
|
||||
| CPU limit bump (500m -> 2000m) | ~200-400ms (eliminates throttling) | Low |
|
||||
| Eliminate POI waterfall | ~200-500ms (parallel fetch) | Medium |
|
||||
| Adaptive first batch (50 -> 5) | ~150-300ms (fewer features to build) | Low |
|
||||
| Defer decision IDs | ~100-200ms (no pre-stream DB query) | Low |
|
||||
| Decouple POI distances endpoint | Enables cache hits with POIs | Medium |
|
||||
| Server-Timing headers | Observability, no direct improvement | Low |
|
||||
|
||||
Conservative estimate: **warm cache TTFP drops from 334ms to ~100-150ms server-side**, and real-world user experience drops from **5s+ to under 1s**.
|
||||
|
||||
## What Does NOT Change
|
||||
|
||||
- NDJSON protocol (same 3 message types: metadata, batch, complete)
|
||||
- Redis cache structure, TTL, or key format
|
||||
- Database schema or repository layer
|
||||
- Client-side decision filtering in `processedListingData`
|
||||
- Total data delivered (all features still streamed)
|
||||
|
||||
## Files Modified
|
||||
|
||||
- `k8s/api-deployment.yaml` (new) - explicit API pod resources
|
||||
- `api/app.py` - adaptive batch, deferred decisions, Server-Timing, POI distances endpoint
|
||||
- `services/listing_cache.py` - no changes
|
||||
- `frontend/src/App.tsx` - parallel fetch, POI merge
|
||||
- `frontend/src/services/streamingService.ts` - remove POI coupling
|
||||
- `frontend/src/services/listingService.ts` - new POI distances fetch function
|
||||
Loading…
Add table
Add a link
Reference in a new issue