trading/.planning/codebase/CONCERNS.md
2026-02-23 20:04:05 +00:00

20 KiB

Codebase Concerns

Analysis Date: 2026-02-23

Tech Debt

Incomplete API Portfolio Metrics:

  • Issue: Portfolio metrics endpoint returns hardcoded placeholder values instead of computed analytics
  • Files: services/api_gateway/routes/portfolio.py (lines 90-92, 170, 172)
  • Impact: Dashboard displays incorrect P&L, drawdown, and hold duration metrics; client decisions based on false data
  • Fix approach: Implement cumulative P&L tracking from first portfolio snapshot, compute max drawdown from historical snapshots, calculate average hold duration from trade outcomes table, implement Redis trading pause flag integration

Missing Portfolio Sync Credentials Handling:

  • Issue: Portfolio sync task silently disables when Alpaca credentials are missing, with no alerting mechanism
  • Files: services/api_gateway/tasks/portfolio_sync.py (lines 125-129)
  • Impact: Dashboard shows stale or zero portfolio data without user awareness; critical data source failure goes unnoticed
  • Fix approach: Add explicit warning log at startup, implement health check endpoint that reports sync status, consider sending alerts to dashboard via WebSocket

Broad Exception Handling:

  • Issue: Multiple services catch bare Exception without specific error classification
  • Files: services/sentiment_analyzer/main.py (lines 150, 207, 226), services/trade_executor/main.py (lines 136, 203, 222), services/signal_generator/main.py (lines 87, 178, 194, 257), services/learning_engine/main.py (lines 304), services/market_data/main.py (lines 108, 158, 253), services/api_gateway/tasks/portfolio_sync.py (lines 152)
  • Impact: Errors silently logged without retry logic, potential data loss, operational issues invisible to monitoring
  • Fix approach: Implement specific exception handling (IntegrityError, APIError, ValidationError, etc.), add telemetry counters for error categories, implement exponential backoff for transient failures

Implicit Database Persistence Fallback:

  • Issue: Multiple services (sentiment-analyzer, trade-executor) silently continue if DB initialization fails
  • Files: services/sentiment_analyzer/main.py (lines 203-208), services/trade_executor/main.py (lines 199-204), services/signal_generator/main.py (initial DB setup)
  • Impact: Services run without persistence layer — no audit trail, no dashboard data, potential duplicate processing if service restarts
  • Fix approach: Fail fast on DB connection errors, or implement persistent message queue (file-based) as fallback to ensure no trades are lost

Learning Engine Position State In Redis Only:

  • Issue: Opening trades stored in Redis positions:history hash with no persistence — loss of intermediate state if service crashes
  • Files: services/learning_engine/main.py (lines 65-82, 121-138)
  • Impact: Position close detection fails after service restart; partial trades with no close detected = unrealistic P&L
  • Fix approach: Move position history to database with persistent transaction log, implement consistent read-modify-write with transactions

Known Bugs

Portfolio Sync N+1 Query Pattern:

  • Symptoms: One SELECT per position during portfolio sync; if 100 positions, 100 database queries per cycle
  • Files: services/api_gateway/tasks/portfolio_sync.py (lines 74-92)
  • Trigger: Any portfolio with multiple open positions
  • Workaround: None; will degrade performance as position count grows
  • Fix: Use SELECT * WHERE ticker IN (...) to fetch all positions in one query, then update in-memory dict

Daily P&L Calculation Bug:

  • Symptoms: Portfolio endpoint returns same value for daily_pnl_pct and total_pnl_pct
  • Files: services/api_gateway/routes/portfolio.py (lines 90-91)
  • Trigger: Every portfolio request
  • Root cause: Code reuses daily_pnl_pct for total_pnl_pct instead of computing cumulative metric
  • Workaround: Manually inspect database portfolio_snapshots for correct metrics
  • Fix: Compute cumulative P&L from first snapshot timestamp to now

Position Unrealized P&L Division by Zero:

  • Symptoms: Potential division by zero when p.qty == 0 (closed position)
  • Files: services/api_gateway/routes/portfolio.py (lines 116-117, 120-121)
  • Trigger: When position is exactly 0 shares (edge case in reconciliation)
  • Workaround: Guards in place (if p.qty) but logic is error-prone
  • Fix: Refactor to use if p.qty and p.avg_entry guard early, simplify arithmetic

Signal Sentiment Context Missing Price Field:

  • Symptoms: Risk manager reads sentiment_context["current_price"] but signal may not include it
  • Files: services/trade_executor/risk_manager.py (lines 118-120)
  • Trigger: If signal is generated without market snapshot or price is omitted
  • Workaround: Falls back to 0.0, position sizing becomes incorrect
  • Fix: Ensure signal generator always includes current_price in sentiment_context, add assertion in executor

Security Considerations

JWT Secret Key Not Enforced in Development:

  • Risk: Default development deployments may use weak or missing JWT secret keys
  • Files: services/api_gateway/config.py (lines 13-15), services/api_gateway/auth/jwt.py (lines 41, 96)
  • Current mitigation: Config requires jwt_secret_key but does not validate length/entropy
  • Recommendations:
    • Add validation in ApiGatewayConfig that secret is >= 32 bytes
    • Fail startup if secret is not set (don't use default)
    • Add warning log if secret is weak (< 256 bits)
    • Document minimum secret requirements in README

Test Keys Insufficient for Production:

  • Risk: Test suites use hardcoded short keys that would be cryptographically weak in production
  • Files: Multiple test files use test keys like "test-secret-key"
  • Current mitigation: Only in test environment
  • Recommendations:
    • Use pytest fixture to generate strong keys per test
    • Document that tests should not be deployed as-is
    • Consider environment variable override in CI

CORS Configuration Hardcoded for Localhost:

  • Risk: Default CORS allows only http://localhost:5173, but production deployment requires environment override
  • Files: services/api_gateway/config.py (line 26)
  • Current mitigation: CORS is at least restricted (not *), but easy to miss in production deploy
  • Recommendations:
    • Make CORS_ORIGINS a required config variable (no default)
    • Add validation that origins use HTTPS in production
    • Log warning if origin is localhost in production environment

Passkey Registration Not Rate-Limited:

  • Risk: WebAuthn registration endpoint (if present) has no rate limiting, allowing brute-force attacks
  • Files: Not fully reviewed; check services/api_gateway/auth/routes.py
  • Current mitigation: Unknown
  • Recommendations:
    • Add rate limiting (redis-backed) on registration endpoints
    • Implement CAPTCHA or email verification for new accounts
    • Log failed registration attempts for monitoring

No HTTPS Enforcement in API Gateway:

  • Risk: JWT tokens transmitted over plaintext HTTP in development config
  • Files: services/api_gateway/config.py (line 31 uses http://localhost:5173)
  • Current mitigation: Only affects localhost development
  • Recommendations:
    • Add enforce_https config flag; reject insecure origins in production
    • Set secure flag on JWT cookies
    • Use HSTS headers in FastAPI

Performance Bottlenecks

Sentiment Scores Stored in Unbounded Deque:

  • Problem: Signal generator stores up to 50 sentiment scores per ticker in memory (never evicted from running process)
  • Files: services/signal_generator/main.py (line 35, lines 107-111)
  • Cause: deques are bounded but not persisted; high-frequency tickers accumulate memory
  • Current: 50 * ~5000 tickers = ~250K floats in memory during high volume
  • Improvement path:
    • Store sentiment history in database (timeseries table) instead of memory
    • Query last N scores on-demand per signal generation
    • Add TTL/expiration for old scores

Market Data Manager Holds Full OHLCV History:

  • Problem: MarketDataManager buffers all bars received for each ticker in memory
  • Files: services/signal_generator/main.py (referenced from main)
  • Cause: No bounded retention; running service accumulates bars indefinitely
  • Impact: Memory leak over weeks of operation
  • Improvement path:
    • Implement sliding window (keep last 100 bars only)
    • Persist bars to database for strategy backtesting
    • Use TimescaleDB hypertable compression for efficient historical queries

Portfolio Sync Deletes and Re-inserts All Positions:

  • Problem: Every sync cycle deletes all local positions and re-inserts from broker
  • Files: services/api_gateway/tasks/portfolio_sync.py (lines 94-101)
  • Cause: Simplistic approach; doesn't diff changes
  • Impact: Every 60s delete cascade on Position table (triggers cascade on related trades?)
  • Improvement path:
    • Implement diff-based upsert: only update changed tickers
    • Delete positions not in broker snapshot
    • Reduces churn, avoids potential foreign key cascade issues

No Connection Pooling Tuning:

  • Problem: AsyncPG default pool size may be insufficient for 7 services + API load
  • Files: shared/db.py (line 13-16)
  • Cause: create_async_engine uses default pool_size
  • Impact: Connection starvation under load, query timeouts
  • Improvement path:
    • Add pool_size and max_overflow config to BaseConfig
    • Monitor pool utilization (add telemetry)
    • Right-size pools per environment

Alpaca API Calls Not Cached:

  • Problem: Every trade executor and portfolio sync makes synchronous Alpaca API calls for account/positions
  • Files: services/trade_executor/main.py (line 74), services/api_gateway/tasks/portfolio_sync.py (line 64)
  • Cause: No request deduplication or caching layer
  • Impact: Slow order submission, API rate limit risks
  • Improvement path:
    • Cache account snapshot in Redis (TTL 5s)
    • Batch position queries where possible
    • Implement exponential backoff for API failures

Fragile Areas

Sentiment Analyzer Ollama Fallback Untested:

  • Files: services/sentiment_analyzer/analyzers/ollama_analyzer.py, services/sentiment_analyzer/main.py (lines 71-79)
  • Why fragile: Fallback path (Ollama) rarely exercised if FinBERT always has high confidence
  • Safe modification: Add integration test that mocks FinBERT confidence < threshold, verifies Ollama called
  • Test coverage: No explicit test for fallback path
  • Impact: If FinBERT fails completely, Ollama fallback could silently degrade or crash

Signal Generator Ensemble Evaluation:

  • Files: services/signal_generator/main.py (line 149), services/signal_generator/ensemble.py
  • Why fragile: Three strategies weighted by Redis weights; if weights become NaN or negative, ensemble breaks silently
  • Safe modification: Add assertions that weights are valid (sum~1, all in [0,1]), implement weight validation on load
  • Test coverage: No explicit test for weight anomalies (e.g., all weights corrupted)
  • Impact: Invalid weights could produce signals with undefined strength

Risk Manager Market Hours Check Time Zone:

  • Files: services/trade_executor/risk_manager.py (lines 19-25, 59-61)
  • Why fragile: Hard-coded ET timezone; DST transitions not handled explicitly
  • Safe modification: Use ZoneInfo with explicit DST handling (Python 3.9+), add tests for DST boundaries
  • Test coverage: No explicit DST test
  • Impact: Orders placed 1 hour off during DST transition

Learning Engine Position Matching Logic:

  • Files: services/learning_engine/main.py (lines 85-98, 121-138)
  • Why fragile: String comparison of side enum (opening_side == OrderSide.BUY.value)
  • Safe modification: Use enum comparison directly, add type hints, add unit tests for all side combinations
  • Test coverage: Tests exist but may not cover edge cases (e.g., SELL -> BUY -> SELL)
  • Impact: Position close detection fails, P&L not recorded, weights never adjusted

WebAuthn Credential Validation:

  • Files: services/api_gateway/auth/routes.py (check WebAuthn implementation)
  • Why fragile: WebAuthn implementation may not validate all required fields (origin, user verification, etc.)
  • Safe modification: Review against OWASP WebAuthn checklist, add explicit assertion tests
  • Test coverage: Limited test coverage for security properties
  • Impact: Weak authentication allows account takeover

Scaling Limits

Redis Single Node No Persistence:

  • Current capacity: In-memory Redis instance; loses all data on restart
  • Limit: All stream data (news, signals, trades) lost if Redis crashes
  • Blocks: Reliable message delivery, audit trails, service recovery
  • Scaling path:
    • Enable AOF (append-only file) persistence
    • Consider Redis Cluster for HA
    • Alternatively, add database-backed message queue (e.g., TimescaleDB FIFO table)

PostgreSQL Shared Memory Not Tuned:

  • Current capacity: Default shared_buffers, work_mem, etc.
  • Limit: Query performance degrades with 7 concurrent services
  • Scaling path:
    • Tune PostgreSQL config: shared_buffers, work_mem, effective_cache_size per machine
    • Add read replicas for reporting queries
    • Partition portfolio snapshots by date (TimescaleDB hypertables)

Dashboard WebSocket No Backpressure:

  • Current capacity: Single WebSocket connection broadcasts all events
  • Limit: Slow clients blocked, queue unbounded memory growth
  • Scaling path:
    • Implement client subscription topics (only send relevant tickers)
    • Add message batching (send updates every 100ms)
    • Implement client-side connection pooling

No Circuit Breaker for Alpaca API:

  • Current capacity: Alpaca API calls block entire trade executor on timeout
  • Limit: One slow Alpaca response stalls all signal processing
  • Scaling path:
    • Implement circuit breaker pattern (fail-open after N failures)
    • Add request timeout with retry logic
    • Queue unprocessable signals for retry after recovery

Dependencies at Risk

Ollama Local LLM Not Containerized with Stack:

  • Risk: Ollama runs separately; must be manually started for sentiment analyzer to work
  • Files: docker-compose.yml (check if Ollama service present)
  • Impact: Sentiment analyzer fails silently if Ollama not running; fallback depends on FinBERT
  • Migration plan: Add Ollama to docker-compose.yml, or make Ollama truly optional with better error handling

FinBERT Model Download Not Cached:

  • Risk: FinBERT downloads model from HuggingFace on first run (~600MB)
  • Files: services/sentiment_analyzer/analyzers/finbert.py
  • Impact: Long startup time, may timeout in resource-constrained environments
  • Migration plan: Pre-download model into Docker image, add model cache invalidation strategy

Alpaca SDK Direct HTTP Calls Not Mocked in Integration Tests:

  • Risk: Integration tests mock Alpaca SDK, so real behavior never tested
  • Files: tests/services/test_trade_executor.py (check mocks), tests/integration/
  • Impact: Order submission bugs discovered only in production
  • Migration plan: Implement live integration tests against Alpaca paper trading, verify in CI/CD

SQLAlchemy 2.0 Async Mode Requires Specific Dialect:

  • Risk: AsyncPG optional dependency; falls back to sync driver if missing
  • Files: shared/db.py (line 13 uses postgresql+asyncpg://)
  • Impact: If asyncpg not installed, services run in sync mode, defeating async benefits
  • Migration plan: Add explicit check in create_db() to fail if asyncpg missing

Missing Critical Features

No Duplicate Trade Detection:

  • Problem: If signal generator processes same article twice (due to Redis consumer group lag), executor may place duplicate trades
  • Blocks: Multi-service reliability, audit accuracy
  • Scope: Implement idempotency key in TradeExecution, check signal_id before placing order

No Circuit Break / Trading Pause Mechanism:

  • Problem: If models degrade, bot continues trading losses
  • Blocks: Risk mitigation, manual emergency stop
  • Current: TRADING_PAUSED_KEY in Redis (services/api_gateway/routes/controls.py), but not enforced in executor
  • Scope: Check executor consumes trading pause flag before submitting orders

No Backtester Trade Slippage Model:

  • Problem: Backtester assumes fills at exact signal price; unrealistic
  • Blocks: Accurate backtest results, risk estimation
  • Scope: Add configurable slippage model (fixed basis points + volume-based)

No Strategy Parameter Sensitivity Analysis:

  • Problem: Can't understand which parameters drive P&L
  • Blocks: Scientific strategy optimization, risk understanding
  • Scope: Implement parameter sweep in backtester, record sensitivity metrics

No Live Trade Reconciliation:

  • Problem: If Alpaca and local DB diverge, no auto-correction
  • Blocks: Accurate position tracking, tax reporting
  • Scope: Implement daily reconciliation job that compares broker vs. DB, alerts on mismatch

Test Coverage Gaps

Dashboard Components Not Unit Tested:

  • What's not tested: React components (portfolio display, trade table, charts, alerts)
  • Files: dashboard/src/components/, dashboard/src/pages/
  • Risk: Component bugs, broken UI layouts only caught in E2E (if at all)
  • Priority: High — dashboard is user-facing; regressions impact usability

API Routes Happy Path Only:

  • What's not tested: Edge cases (missing fields, invalid enum values, concurrent requests)
  • Files: tests/services/test_api_routes.py (13 tests covering basic CRUD)
  • Risk: API crashes on edge inputs, poor error messages
  • Priority: Medium — most users use dashboard, not direct API

Market Data Manager Not Unit Tested:

  • What's not tested: Deque overflow behavior, snapshot generation with missing bars
  • Files: services/signal_generator/market_data.py (no test file)
  • Risk: Ensemble generates signals with incomplete data
  • Priority: High — directly impacts signal quality

Learning Engine Weight Adjustment Edge Cases:

  • What's not tested: Weight convergence, NaN/inf propagation, simultaneous rewards for same strategy
  • Files: services/learning_engine/weight_adjuster.py (test coverage unknown)
  • Risk: Weights diverge to extremes, strategies disabled unintentionally
  • Priority: High — learning is core feature

WebAuthn Registration and Verification:

  • What's not tested: Invalid registration data, replay attacks, malformed challenge responses
  • Files: services/api_gateway/auth/routes.py (WebAuthn handlers)
  • Risk: Authentication bypass, account takeover
  • Priority: Critical — security sensitive

Redis Streams Consumer Group Edge Cases:

  • What's not tested: Consumer group recovery after crash, message redelivery, offset management
  • Files: shared/redis_streams.py (5 tests total)
  • Risk: Messages lost or duplicated after service restart
  • Priority: High — data reliability depends on it

Concurrent Signal and Trade Execution:

  • What's not tested: Race conditions when multiple signals processed simultaneously
  • Files: Integration tests mock this with single thread
  • Risk: Phantom positions, incorrect risk calculations
  • Priority: High — real production has concurrent load

Recommendations for Production Readiness

Before going live, address at minimum:

  1. Critical (blocks production):

    • Implement portfolio metrics computation (cumulative P&L, max drawdown)
    • Fix portfolio sync query performance (N+1 issue)
    • Validate JWT secret key entropy at startup
    • Add rate limiting to auth endpoints
    • Implement circuit breaker for Alpaca API calls
    • Add duplicate trade detection (idempotency)
  2. High (should address soon):

    • Implement trade reconciliation job
    • Move position state from Redis to database
    • Add specific exception handling (not bare Exception)
    • Implement backpressure in WebSocket broadcasting
    • Test Ollama fallback path end-to-end
    • Add comprehensive dashboard unit tests
  3. Medium (can defer):

    • Optimize portfolio sync with diff-based approach
    • Persist market data and sentiment to TimescaleDB
    • Tune PostgreSQL connection pool
    • Add DST test coverage for market hours
    • Implement parameter sensitivity analysis in backtester

Concerns audit: 2026-02-23