20 KiB
Codebase Concerns
Analysis Date: 2026-02-23
Tech Debt
Incomplete API Portfolio Metrics:
- Issue: Portfolio metrics endpoint returns hardcoded placeholder values instead of computed analytics
- Files:
services/api_gateway/routes/portfolio.py(lines 90-92, 170, 172) - Impact: Dashboard displays incorrect P&L, drawdown, and hold duration metrics; client decisions based on false data
- Fix approach: Implement cumulative P&L tracking from first portfolio snapshot, compute max drawdown from historical snapshots, calculate average hold duration from trade outcomes table, implement Redis trading pause flag integration
Missing Portfolio Sync Credentials Handling:
- Issue: Portfolio sync task silently disables when Alpaca credentials are missing, with no alerting mechanism
- Files:
services/api_gateway/tasks/portfolio_sync.py(lines 125-129) - Impact: Dashboard shows stale or zero portfolio data without user awareness; critical data source failure goes unnoticed
- Fix approach: Add explicit warning log at startup, implement health check endpoint that reports sync status, consider sending alerts to dashboard via WebSocket
Broad Exception Handling:
- Issue: Multiple services catch bare
Exceptionwithout specific error classification - Files:
services/sentiment_analyzer/main.py(lines 150, 207, 226),services/trade_executor/main.py(lines 136, 203, 222),services/signal_generator/main.py(lines 87, 178, 194, 257),services/learning_engine/main.py(lines 304),services/market_data/main.py(lines 108, 158, 253),services/api_gateway/tasks/portfolio_sync.py(lines 152) - Impact: Errors silently logged without retry logic, potential data loss, operational issues invisible to monitoring
- Fix approach: Implement specific exception handling (IntegrityError, APIError, ValidationError, etc.), add telemetry counters for error categories, implement exponential backoff for transient failures
Implicit Database Persistence Fallback:
- Issue: Multiple services (sentiment-analyzer, trade-executor) silently continue if DB initialization fails
- Files:
services/sentiment_analyzer/main.py(lines 203-208),services/trade_executor/main.py(lines 199-204),services/signal_generator/main.py(initial DB setup) - Impact: Services run without persistence layer — no audit trail, no dashboard data, potential duplicate processing if service restarts
- Fix approach: Fail fast on DB connection errors, or implement persistent message queue (file-based) as fallback to ensure no trades are lost
Learning Engine Position State In Redis Only:
- Issue: Opening trades stored in Redis
positions:historyhash with no persistence — loss of intermediate state if service crashes - Files:
services/learning_engine/main.py(lines 65-82, 121-138) - Impact: Position close detection fails after service restart; partial trades with no close detected = unrealistic P&L
- Fix approach: Move position history to database with persistent transaction log, implement consistent read-modify-write with transactions
Known Bugs
Portfolio Sync N+1 Query Pattern:
- Symptoms: One SELECT per position during portfolio sync; if 100 positions, 100 database queries per cycle
- Files:
services/api_gateway/tasks/portfolio_sync.py(lines 74-92) - Trigger: Any portfolio with multiple open positions
- Workaround: None; will degrade performance as position count grows
- Fix: Use
SELECT * WHERE ticker IN (...)to fetch all positions in one query, then update in-memory dict
Daily P&L Calculation Bug:
- Symptoms: Portfolio endpoint returns same value for
daily_pnl_pctandtotal_pnl_pct - Files:
services/api_gateway/routes/portfolio.py(lines 90-91) - Trigger: Every portfolio request
- Root cause: Code reuses
daily_pnl_pctfortotal_pnl_pctinstead of computing cumulative metric - Workaround: Manually inspect database portfolio_snapshots for correct metrics
- Fix: Compute cumulative P&L from first snapshot timestamp to now
Position Unrealized P&L Division by Zero:
- Symptoms: Potential division by zero when
p.qty == 0(closed position) - Files:
services/api_gateway/routes/portfolio.py(lines 116-117, 120-121) - Trigger: When position is exactly 0 shares (edge case in reconciliation)
- Workaround: Guards in place (
if p.qty) but logic is error-prone - Fix: Refactor to use
if p.qty and p.avg_entryguard early, simplify arithmetic
Signal Sentiment Context Missing Price Field:
- Symptoms: Risk manager reads
sentiment_context["current_price"]but signal may not include it - Files:
services/trade_executor/risk_manager.py(lines 118-120) - Trigger: If signal is generated without market snapshot or price is omitted
- Workaround: Falls back to 0.0, position sizing becomes incorrect
- Fix: Ensure signal generator always includes current_price in sentiment_context, add assertion in executor
Security Considerations
JWT Secret Key Not Enforced in Development:
- Risk: Default development deployments may use weak or missing JWT secret keys
- Files:
services/api_gateway/config.py(lines 13-15),services/api_gateway/auth/jwt.py(lines 41, 96) - Current mitigation: Config requires
jwt_secret_keybut does not validate length/entropy - Recommendations:
- Add validation in ApiGatewayConfig that secret is >= 32 bytes
- Fail startup if secret is not set (don't use default)
- Add warning log if secret is weak (< 256 bits)
- Document minimum secret requirements in README
Test Keys Insufficient for Production:
- Risk: Test suites use hardcoded short keys that would be cryptographically weak in production
- Files: Multiple test files use test keys like
"test-secret-key" - Current mitigation: Only in test environment
- Recommendations:
- Use pytest fixture to generate strong keys per test
- Document that tests should not be deployed as-is
- Consider environment variable override in CI
CORS Configuration Hardcoded for Localhost:
- Risk: Default CORS allows only
http://localhost:5173, but production deployment requires environment override - Files:
services/api_gateway/config.py(line 26) - Current mitigation: CORS is at least restricted (not
*), but easy to miss in production deploy - Recommendations:
- Make CORS_ORIGINS a required config variable (no default)
- Add validation that origins use HTTPS in production
- Log warning if origin is localhost in production environment
Passkey Registration Not Rate-Limited:
- Risk: WebAuthn registration endpoint (if present) has no rate limiting, allowing brute-force attacks
- Files: Not fully reviewed; check
services/api_gateway/auth/routes.py - Current mitigation: Unknown
- Recommendations:
- Add rate limiting (redis-backed) on registration endpoints
- Implement CAPTCHA or email verification for new accounts
- Log failed registration attempts for monitoring
No HTTPS Enforcement in API Gateway:
- Risk: JWT tokens transmitted over plaintext HTTP in development config
- Files:
services/api_gateway/config.py(line 31 useshttp://localhost:5173) - Current mitigation: Only affects localhost development
- Recommendations:
- Add
enforce_httpsconfig flag; reject insecure origins in production - Set
secureflag on JWT cookies - Use HSTS headers in FastAPI
- Add
Performance Bottlenecks
Sentiment Scores Stored in Unbounded Deque:
- Problem: Signal generator stores up to 50 sentiment scores per ticker in memory (never evicted from running process)
- Files:
services/signal_generator/main.py(line 35, lines 107-111) - Cause: deques are bounded but not persisted; high-frequency tickers accumulate memory
- Current: 50 * ~5000 tickers = ~250K floats in memory during high volume
- Improvement path:
- Store sentiment history in database (timeseries table) instead of memory
- Query last N scores on-demand per signal generation
- Add TTL/expiration for old scores
Market Data Manager Holds Full OHLCV History:
- Problem:
MarketDataManagerbuffers all bars received for each ticker in memory - Files:
services/signal_generator/main.py(referenced from main) - Cause: No bounded retention; running service accumulates bars indefinitely
- Impact: Memory leak over weeks of operation
- Improvement path:
- Implement sliding window (keep last 100 bars only)
- Persist bars to database for strategy backtesting
- Use TimescaleDB hypertable compression for efficient historical queries
Portfolio Sync Deletes and Re-inserts All Positions:
- Problem: Every sync cycle deletes all local positions and re-inserts from broker
- Files:
services/api_gateway/tasks/portfolio_sync.py(lines 94-101) - Cause: Simplistic approach; doesn't diff changes
- Impact: Every 60s delete cascade on Position table (triggers cascade on related trades?)
- Improvement path:
- Implement diff-based upsert: only update changed tickers
- Delete positions not in broker snapshot
- Reduces churn, avoids potential foreign key cascade issues
No Connection Pooling Tuning:
- Problem: AsyncPG default pool size may be insufficient for 7 services + API load
- Files:
shared/db.py(line 13-16) - Cause:
create_async_engineuses default pool_size - Impact: Connection starvation under load, query timeouts
- Improvement path:
- Add
pool_sizeandmax_overflowconfig to BaseConfig - Monitor pool utilization (add telemetry)
- Right-size pools per environment
- Add
Alpaca API Calls Not Cached:
- Problem: Every trade executor and portfolio sync makes synchronous Alpaca API calls for account/positions
- Files:
services/trade_executor/main.py(line 74),services/api_gateway/tasks/portfolio_sync.py(line 64) - Cause: No request deduplication or caching layer
- Impact: Slow order submission, API rate limit risks
- Improvement path:
- Cache account snapshot in Redis (TTL 5s)
- Batch position queries where possible
- Implement exponential backoff for API failures
Fragile Areas
Sentiment Analyzer Ollama Fallback Untested:
- Files:
services/sentiment_analyzer/analyzers/ollama_analyzer.py,services/sentiment_analyzer/main.py(lines 71-79) - Why fragile: Fallback path (Ollama) rarely exercised if FinBERT always has high confidence
- Safe modification: Add integration test that mocks FinBERT confidence < threshold, verifies Ollama called
- Test coverage: No explicit test for fallback path
- Impact: If FinBERT fails completely, Ollama fallback could silently degrade or crash
Signal Generator Ensemble Evaluation:
- Files:
services/signal_generator/main.py(line 149),services/signal_generator/ensemble.py - Why fragile: Three strategies weighted by Redis weights; if weights become NaN or negative, ensemble breaks silently
- Safe modification: Add assertions that weights are valid (sum~1, all in [0,1]), implement weight validation on load
- Test coverage: No explicit test for weight anomalies (e.g., all weights corrupted)
- Impact: Invalid weights could produce signals with undefined strength
Risk Manager Market Hours Check Time Zone:
- Files:
services/trade_executor/risk_manager.py(lines 19-25, 59-61) - Why fragile: Hard-coded ET timezone; DST transitions not handled explicitly
- Safe modification: Use ZoneInfo with explicit DST handling (Python 3.9+), add tests for DST boundaries
- Test coverage: No explicit DST test
- Impact: Orders placed 1 hour off during DST transition
Learning Engine Position Matching Logic:
- Files:
services/learning_engine/main.py(lines 85-98, 121-138) - Why fragile: String comparison of side enum (
opening_side == OrderSide.BUY.value) - Safe modification: Use enum comparison directly, add type hints, add unit tests for all side combinations
- Test coverage: Tests exist but may not cover edge cases (e.g., SELL -> BUY -> SELL)
- Impact: Position close detection fails, P&L not recorded, weights never adjusted
WebAuthn Credential Validation:
- Files:
services/api_gateway/auth/routes.py(check WebAuthn implementation) - Why fragile: WebAuthn implementation may not validate all required fields (origin, user verification, etc.)
- Safe modification: Review against OWASP WebAuthn checklist, add explicit assertion tests
- Test coverage: Limited test coverage for security properties
- Impact: Weak authentication allows account takeover
Scaling Limits
Redis Single Node No Persistence:
- Current capacity: In-memory Redis instance; loses all data on restart
- Limit: All stream data (news, signals, trades) lost if Redis crashes
- Blocks: Reliable message delivery, audit trails, service recovery
- Scaling path:
- Enable AOF (append-only file) persistence
- Consider Redis Cluster for HA
- Alternatively, add database-backed message queue (e.g., TimescaleDB FIFO table)
PostgreSQL Shared Memory Not Tuned:
- Current capacity: Default shared_buffers, work_mem, etc.
- Limit: Query performance degrades with 7 concurrent services
- Scaling path:
- Tune PostgreSQL config: shared_buffers, work_mem, effective_cache_size per machine
- Add read replicas for reporting queries
- Partition portfolio snapshots by date (TimescaleDB hypertables)
Dashboard WebSocket No Backpressure:
- Current capacity: Single WebSocket connection broadcasts all events
- Limit: Slow clients blocked, queue unbounded memory growth
- Scaling path:
- Implement client subscription topics (only send relevant tickers)
- Add message batching (send updates every 100ms)
- Implement client-side connection pooling
No Circuit Breaker for Alpaca API:
- Current capacity: Alpaca API calls block entire trade executor on timeout
- Limit: One slow Alpaca response stalls all signal processing
- Scaling path:
- Implement circuit breaker pattern (fail-open after N failures)
- Add request timeout with retry logic
- Queue unprocessable signals for retry after recovery
Dependencies at Risk
Ollama Local LLM Not Containerized with Stack:
- Risk: Ollama runs separately; must be manually started for sentiment analyzer to work
- Files:
docker-compose.yml(check if Ollama service present) - Impact: Sentiment analyzer fails silently if Ollama not running; fallback depends on FinBERT
- Migration plan: Add Ollama to docker-compose.yml, or make Ollama truly optional with better error handling
FinBERT Model Download Not Cached:
- Risk: FinBERT downloads model from HuggingFace on first run (~600MB)
- Files:
services/sentiment_analyzer/analyzers/finbert.py - Impact: Long startup time, may timeout in resource-constrained environments
- Migration plan: Pre-download model into Docker image, add model cache invalidation strategy
Alpaca SDK Direct HTTP Calls Not Mocked in Integration Tests:
- Risk: Integration tests mock Alpaca SDK, so real behavior never tested
- Files:
tests/services/test_trade_executor.py(check mocks),tests/integration/ - Impact: Order submission bugs discovered only in production
- Migration plan: Implement live integration tests against Alpaca paper trading, verify in CI/CD
SQLAlchemy 2.0 Async Mode Requires Specific Dialect:
- Risk: AsyncPG optional dependency; falls back to sync driver if missing
- Files:
shared/db.py(line 13 usespostgresql+asyncpg://) - Impact: If asyncpg not installed, services run in sync mode, defeating async benefits
- Migration plan: Add explicit check in
create_db()to fail if asyncpg missing
Missing Critical Features
No Duplicate Trade Detection:
- Problem: If signal generator processes same article twice (due to Redis consumer group lag), executor may place duplicate trades
- Blocks: Multi-service reliability, audit accuracy
- Scope: Implement idempotency key in TradeExecution, check signal_id before placing order
No Circuit Break / Trading Pause Mechanism:
- Problem: If models degrade, bot continues trading losses
- Blocks: Risk mitigation, manual emergency stop
- Current:
TRADING_PAUSED_KEYin Redis (services/api_gateway/routes/controls.py), but not enforced in executor - Scope: Check executor consumes trading pause flag before submitting orders
No Backtester Trade Slippage Model:
- Problem: Backtester assumes fills at exact signal price; unrealistic
- Blocks: Accurate backtest results, risk estimation
- Scope: Add configurable slippage model (fixed basis points + volume-based)
No Strategy Parameter Sensitivity Analysis:
- Problem: Can't understand which parameters drive P&L
- Blocks: Scientific strategy optimization, risk understanding
- Scope: Implement parameter sweep in backtester, record sensitivity metrics
No Live Trade Reconciliation:
- Problem: If Alpaca and local DB diverge, no auto-correction
- Blocks: Accurate position tracking, tax reporting
- Scope: Implement daily reconciliation job that compares broker vs. DB, alerts on mismatch
Test Coverage Gaps
Dashboard Components Not Unit Tested:
- What's not tested: React components (portfolio display, trade table, charts, alerts)
- Files:
dashboard/src/components/,dashboard/src/pages/ - Risk: Component bugs, broken UI layouts only caught in E2E (if at all)
- Priority: High — dashboard is user-facing; regressions impact usability
API Routes Happy Path Only:
- What's not tested: Edge cases (missing fields, invalid enum values, concurrent requests)
- Files:
tests/services/test_api_routes.py(13 tests covering basic CRUD) - Risk: API crashes on edge inputs, poor error messages
- Priority: Medium — most users use dashboard, not direct API
Market Data Manager Not Unit Tested:
- What's not tested: Deque overflow behavior, snapshot generation with missing bars
- Files:
services/signal_generator/market_data.py(no test file) - Risk: Ensemble generates signals with incomplete data
- Priority: High — directly impacts signal quality
Learning Engine Weight Adjustment Edge Cases:
- What's not tested: Weight convergence, NaN/inf propagation, simultaneous rewards for same strategy
- Files:
services/learning_engine/weight_adjuster.py(test coverage unknown) - Risk: Weights diverge to extremes, strategies disabled unintentionally
- Priority: High — learning is core feature
WebAuthn Registration and Verification:
- What's not tested: Invalid registration data, replay attacks, malformed challenge responses
- Files:
services/api_gateway/auth/routes.py(WebAuthn handlers) - Risk: Authentication bypass, account takeover
- Priority: Critical — security sensitive
Redis Streams Consumer Group Edge Cases:
- What's not tested: Consumer group recovery after crash, message redelivery, offset management
- Files:
shared/redis_streams.py(5 tests total) - Risk: Messages lost or duplicated after service restart
- Priority: High — data reliability depends on it
Concurrent Signal and Trade Execution:
- What's not tested: Race conditions when multiple signals processed simultaneously
- Files: Integration tests mock this with single thread
- Risk: Phantom positions, incorrect risk calculations
- Priority: High — real production has concurrent load
Recommendations for Production Readiness
Before going live, address at minimum:
-
Critical (blocks production):
- Implement portfolio metrics computation (cumulative P&L, max drawdown)
- Fix portfolio sync query performance (N+1 issue)
- Validate JWT secret key entropy at startup
- Add rate limiting to auth endpoints
- Implement circuit breaker for Alpaca API calls
- Add duplicate trade detection (idempotency)
-
High (should address soon):
- Implement trade reconciliation job
- Move position state from Redis to database
- Add specific exception handling (not bare Exception)
- Implement backpressure in WebSocket broadcasting
- Test Ollama fallback path end-to-end
- Add comprehensive dashboard unit tests
-
Medium (can defer):
- Optimize portfolio sync with diff-based approach
- Persist market data and sentiment to TimescaleDB
- Tune PostgreSQL connection pool
- Add DST test coverage for market hours
- Implement parameter sensitivity analysis in backtester
Concerns audit: 2026-02-23