infra/.planning/codebase/CONCERNS.md
2026-02-23 20:54:27 +00:00

244 lines
18 KiB
Markdown

# Codebase Concerns
**Analysis Date:** 2026-02-23
## Tech Debt
**MySQL Backup Rotation Not Implemented:**
- Issue: Backup rotation logic exists (comment at `stacks/platform/modules/dbaas/main.tf:196`) but is incomplete. Backup size noted as 11MB, rotation deferred.
- Files: `stacks/platform/modules/dbaas/main.tf` (lines 196-206)
- Impact: Backup directory could grow unbounded; no automated retention policy enforced. Manual cleanup required.
- Fix approach: Implement full rotation schedule using `find -mtime +N` or migrate to external backup solution (Velero, pgbackrest). Set up CronJob with proper retention (e.g., 14-day backups).
**PostgreSQL Major Version Upgrade Blocked:**
- Issue: Comment at `stacks/platform/modules/dbaas/main.tf:718` indicates PostgreSQL 17.2 requires `pg_upgrade` to data directory but is not implemented.
- Files: `stacks/platform/modules/dbaas/main.tf` (line 718)
- Impact: Cannot upgrade PostgreSQL from 16-master to 17.2. When upgrade is needed, requires manual pg_upgrade procedure; high downtime risk.
- Fix approach: Implement pg_upgrade CronJob or StatefulSet init container that performs in-place upgrade. Test migration path with backup first.
**TP-Link Gateway Reverse Proxy Not Functional:**
- Issue: Reverse proxy module for TP-Link gateway marked as "Not working yet" at `stacks/platform/modules/reverse_proxy/main.tf:91`.
- Files: `stacks/platform/modules/reverse_proxy/main.tf` (lines 91-95)
- Impact: Gateway access over HTTPS or HTTP routing non-functional. Unknown scope of impact on dependent services.
- Fix approach: Either complete reverse proxy implementation (Traefik/Nginx config) or document why it's disabled. Clarify if gateway is still accessible via HTTP or LAN-only.
**WireGuard Firewall Rules Incomplete:**
- Issue: Client firewall restrictions not written at `terraform.tfvars:430`. Only placeholder exists.
- Files: `terraform.tfvars` (lines 430-434)
- Impact: No network isolation between VPN clients and cluster-internal services (10.32.0.0/12). All connected clients can access cluster APIs if firewall rules not enforced at kernel level.
- Fix approach: Define explicit iptables rules for each client role (e.g., "allow DNS only", "deny cluster access"). Test with `iptables -L -v`. Consider VPN network segmentation if multiple trust levels exist.
## Known Bugs & Issues
**Immich Database Compatibility Mismatch:**
- Symptoms: Custom PostgreSQL image version mismatch between Immich PostgreSQL pod and dbaas PostgreSQL. Immich uses `ghcr.io/immich-app/postgres:15-vectorchord0.3.0-pgvectors0.2.0`, while dbaas PostgreSQL is 16-master with PostGIS/PgVector mix.
- Files: `stacks/immich/main.tf` (lines 76-77, 276), `stacks/platform/modules/dbaas/main.tf` (line 717)
- Trigger: If Immich database is migrated to shared dbaas PostgreSQL, extension version incompatibility will cause failures.
- Workaround: Keep Immich on isolated PostgreSQL 15 with Immich-specific extensions. If consolidation needed, test extension compatibility first.
**Realestate-Crawler Latest Image Tag Ignores Updates:**
- Symptoms: `realestate-crawler-ui` uses `image = "viktorbarzin/immoweb:latest"` with `lifecycle { ignore_changes = [spec[0].template[0].spec[0].container[0].image] }`.
- Files: `stacks/real-estate-crawler/main.tf` (lines 64, 79-82)
- Trigger: New versions of `immoweb:latest` will never be deployed. Terraform ignores image updates; manual image pull/push required.
- Workaround: Use Diun annotations to track image updates. Consider using version-pinned tags instead of `:latest`. Remove `ignore_changes` if auto-updates desired.
## Security Considerations
**OpenClaw Has Cluster-Admin Permissions:**
- Risk: OpenClaw ServiceAccount granted unrestricted `cluster-admin` ClusterRoleBinding at `stacks/openclaw/main.tf:41-54`.
- Files: `stacks/openclaw/main.tf` (lines 34-55)
- Current mitigation: `dangerouslyDisableDeviceAuth = true` in config (line 89) disables device auth but relies on network access control.
- Recommendations:
- Scope OpenClaw RBAC to specific namespaces/resources needed for skill execution (e.g., `get/list/watch pods`, `exec into pods`, `apply resources in specific namespaces`).
- Re-enable device auth or implement mTLS between OpenClaw and operators.
- Audit OpenClaw logs for unauthorized API calls (enable API server audit logs).
**Git-Crypt Key Mounted as ConfigMap:**
- Risk: git-crypt key at `stacks/openclaw/main.tf:68-76` stored as plain-text ConfigMap data. Any pod on cluster can read it (unless RBAC enforces secrets-only access).
- Files: `stacks/openclaw/main.tf` (lines 68-76)
- Current mitigation: None; ConfigMap is world-readable by default.
- Recommendations:
- Move git-crypt key to Kubernetes Secret instead of ConfigMap.
- Add RBAC policy restricting secret read to openclaw namespace only.
- Consider external secret management (Authentik-backed secret injection, Sealed Secrets).
**SSH Private Key Stored as Secret:**
- Risk: SSH private key for OpenClaw stored at `stacks/openclaw/main.tf:57-66` as unencrypted Secret. Readable by any pod with secret access.
- Files: `stacks/openclaw/main.tf` (lines 57-66)
- Current mitigation: Secret only readable by openclaw namespace (if RBAC enforced); encryption at rest not confirmed.
- Recommendations:
- Rotate SSH key regularly; consider using ed25519 keys (shorter, stronger).
- Audit Secret access via Kubernetes audit logs.
- Use external secret store (HashiCorp Vault, Bitwarden) instead of native Secrets.
**WireGuard VPN Clients Unrestricted:**
- Risk: VPN clients can reach all cluster-internal services (10.32.0.0/12) unless firewall rules defined. No per-client segmentation.
- Files: `terraform.tfvars` (lines 430-434)
- Current mitigation: Attempted iptables rules commented out; not enforced.
- Recommendations:
- Define explicit client restrictions in WireGuard firewall script (uncomment/complete lines 433-434).
- Implement deny-by-default firewall (drop all, then allow specific routes).
- Consider separate WireGuard interfaces for different trust levels (admin vs. guest).
**Multiple `:latest` Image Tags in Production:**
- Risk: 17 services use `:latest` tags (e.g., `nextcloud`, `kms`, `calibre`, `speedtest`, `rybbit`, `wealthfolio`, `cyberchef`, `coturn`, `immich-frame`, `health`, others).
- Files: Multiple stacks (see full list in grep output above).
- Current mitigation: Diun annotations track updates but don't auto-pull; images are immutable but unversioned.
- Recommendations:
- Pin all production images to specific semantic versions (e.g., `ghcr.io/foo/bar:v1.2.3`, not `:latest`).
- Use Diun to track new releases and trigger automated testing in staging.
- Update CI/CD pipeline to require version tags for production deployments.
## Performance Bottlenecks
**Insufficient Health Probes on Critical Services:**
- Problem: Only 14 services have liveness/readiness probes out of 70+ services. Missing probes on databases (MySQL, PostgreSQL, Redis), ingress, auth.
- Files: All stacks (identified via grep: 14 instances of liveness/readiness out of 70+ services).
- Cause: Default Kubernetes behavior is to not restart unhealthy pods without probes; cascading failures silent.
- Improvement path: Add `livenessProbe`, `readinessProbe`, and `startupProbe` to all stateful services (databases, message queues, auth providers). Use TCP/HTTP probes appropriate to each service.
**Pod Disruption Budgets Missing:**
- Problem: Only 2 services have PodDisruptionBudget resources (identified via grep). Node evictions (updates, failures) can cause service degradation.
- Files: All stacks (need comprehensive PodDisruptionBudget coverage).
- Cause: PDBs are optional; many assume single-replica stateless services won't need them.
- Improvement path: Add PDB with `minAvailable: 1` to all services with `replicas > 1`. For single-replica services, ensure they're marked as non-critical (lower PriorityClass) or accept downtime during node maintenance.
**Resource Requests Sparse, Limits Missing:**
- Problem: Many services lack explicit resource requests/limits. Kyverno auto-generates defaults but CPU limits often too low for bursty workloads (Immich ML, Ollama, Ebook2Audiobook).
- Files: Multiple stacks (e.g., `stacks/immich/main.tf`, `stacks/ebook2audiobook/main.tf`, `stacks/ollama/main.tf`).
- Cause: Request/limit tuning requires load testing; defaults used instead.
- Improvement path: Run load tests on GPU workloads (Immich ML, Ollama) to determine sustained CPU/memory. Set requests to P50 usage, limits to P99. Monitor via Prometheus and adjust quarterly.
**Large Terraform Modules (900+ lines):**
- Problem: `stacks/platform/modules/dbaas/main.tf` is 916 lines; `stacks/immich/main.tf` is 660 lines; others > 450 lines.
- Files: `stacks/platform/modules/dbaas/main.tf` (916 lines), `stacks/platform/modules/nvidia/main.tf` (658 lines), `stacks/platform/modules/kyverno/resource-governance.tf` (809 lines).
- Cause: Monolithic resource definitions; hard to navigate and test.
- Improvement path: Split large modules into sub-modules (e.g., `dbaas/``mysql/`, `postgresql/`, `pgadmin/`, `backups/`). Use Terraform workspaces for per-database configuration.
## Fragile Areas
**Immich Machine Learning GPU Dependency:**
- Files: `stacks/immich/main.tf` (lines 380-450).
- Why fragile: GPU workload (`immich-machine-learning-cuda`) requires Tesla T4 on k8s-node1. If GPU becomes unavailable (hardware failure, driver issues), ML inference fails silently (no fallback). Single GPU point of failure.
- Safe modification: Add `nodeAffinity` to prefer GPU but allow non-GPU fallback (degraded mode). Implement health checks on GPU availability (`nvidia-smi` probe). Test GPU failure scenario before production use.
- Test coverage: No tests for GPU unavailability; assumes GPU always available.
**Nextcloud Backup/Restore Procedures Manual:**
- Files: `stacks/nextcloud/main.tf` (backup.sh and restore.sh ConfigMaps).
- Why fragile: Backup/restore scripts are ConfigMap-based; no automation. Restoration requires manual `kubectl exec` and script execution. No tested recovery procedure.
- Safe modification: Implement automated backup via Velero or CSI snapshots. Test restore procedure monthly via staged environment.
- Test coverage: No automated backup validation; scripts untested.
**NFS Dependency for Data Persistence:**
- Files: 126 references to NFS volumes across all stacks.
- Why fragile: All stateful data depends on NFS server at `10.0.10.15`. If NFS becomes unavailable, all services lose data immediately (no local caches). No fallback storage.
- Safe modification: Implement NFS client-side read caching (Linux NFS mount options `ac,acregmin=3600`). Monitor NFS availability via Prometheus alerts (Mount point offline). Test NFS failover procedure (if replica NFS exists).
- Test coverage: No chaos engineering tests for NFS unavailability.
**Istio Injection Disabled Cluster-Wide:**
- Files: `stacks/real-estate-crawler/main.tf` (line 19): `"istio-injection" : "disabled"` on namespace labels.
- Why fragile: No service mesh observability. Debugging pod-to-pod communication requires manual tracing (tcpdump). No mutual TLS between services.
- Safe modification: Enable Istio on non-critical services first (e.g., realestate-crawler). Monitor resource overhead. Gradually roll out to production.
- Test coverage: No mTLS validation; assumes all pods on same network are trusted.
**PostgreSQL Custom Image Not Tracked:**
- Files: `stacks/platform/modules/dbaas/main.tf` (line 717): `image = "viktorbarzin/postgres:16-master"`.
- Why fragile: Custom build at Docker Hub with PostGIS + PgVector extensions. No version tag; `:master` tag is mutable. Upstream extension versions unknown.
- Safe modification: Pin to semantic version (e.g., `:16.4-postgis3.4-pgvector0.8`). Build images locally with Dockerfile tracked in git. Test extension versions against application requirements.
- Test coverage: No tests for extension availability or version compatibility.
## Scaling Limits
**Single-Replica Critical Services:**
- Current capacity: Immich server (1 replica), PostgreSQL databases (1 replica), Redis (1 instance), Traefik (varies).
- Limit: Node failure causes immediate service outage. Kubernetes default takes 5+ minutes to reschedule pod.
- Scaling path: Increase critical service replicas to 3 (quorum). Add pod anti-affinity to spread across nodes. Implement PodDisruptionBudget with `minAvailable: 2`.
**GPU Capacity Bottleneck:**
- Current capacity: 1 Tesla T4 GPU on k8s-node1.
- Limit: Immich ML + Ebook2Audiobook + Ollama all compete for single GPU. Queue time 10+ minutes for CPU-bound inference tasks.
- Scaling path: Add second GPU (e.g., T4 or RTX 3090) to k8s-node1. Implement GPU scheduling via NVIDIA GPU Operator. Monitor GPU utilization (target 70-80%).
**NFS Storage Capacity:**
- Current capacity: `/mnt/main/` mounted on TrueNAS (size unknown; typically 4-8TB in home setups).
- Limit: Immich (image library), Calibre (ebooks), Dawarich (location history) grow unbounded. When storage full, writes fail; services degrade.
- Scaling path: Monitor NFS capacity monthly (`df -h`). Set up Prometheus alert at 80% capacity. Plan for annual storage growth based on user behavior (e.g., 100GB Immich/month).
**MySQL/PostgreSQL Connection Pool:**
- Current capacity: PgBouncer at `dbaas/pgbouncer` provides connection pooling. Default pool size likely 100-200 connections.
- Limit: Many simultaneous connections (Nextcloud, Affine, Gramps Web, Authentik) can exceed pool. New connections queue or fail.
- Scaling path: Monitor PgBouncer pool utilization (Prometheus metric `pgbouncer_pools_used_connections`). Increase pool size if > 80% utilization. Consider read replicas for read-heavy workloads.
**API Rate Limiting & Bandwidth:**
- Current capacity: Services exposed via Traefik ingress. No global rate limiting documented.
- Limit: External tools (Immich mobile app, ebook2audiobook processing) can spike bandwidth. DoS-like behavior possible.
- Scaling path: Implement Traefik rate limiting middleware (Prometheus-aware). Add Cloudflare rate limiting on public domains. Monitor egress bandwidth.
## Dependencies at Risk
**Redis Stack `:latest` Tag:**
- Risk: `stacks/platform/modules/redis/main.tf` uses `image = "redis/redis-stack:latest"`. Redis Stack is actively developed; breaking changes possible.
- Impact: Unexpected version upgrade could introduce incompatibilities with clients expecting specific command set or module versions.
- Migration plan: Pin to specific Redis Stack version (e.g., `:7.2-rc1`). Test version upgrades in staging first. Monitor Redis logs for deprecated command warnings.
**Immich `:latest` or Floating Tag:**
- Risk: `stacks/immich/main.tf` pins to `v2.5.6` but Immich frequently releases patch versions. Database migrations can cause downtime.
- Impact: If Immich version upgrades without testing, database migrations could fail or hang (no rollback mechanism).
- Migration plan: Pin to specific patch versions (e.g., `v2.5.6`, not `v2.5`). Test Immich upgrades in staging first. Maintain backup before upgrading.
**Unsupported MySQL 9.2.0:**
- Risk: `stacks/platform/modules/dbaas/main.tf` specifies `image = "mysql:9.2.0"`. MySQL 9.2 is a development version (RC status as of Feb 2026).
- Impact: RC versions not recommended for production. Stability issues, CVEs possible. No long-term support.
- Migration plan: Migrate to MySQL 8.4 LTS or 9.0 GA (stable). Test data migration first. Plan for gradual rollout.
**Python Timeouts in Monitoring Scripts:**
- Risk: `stacks/platform/modules/nvidia/main.tf` uses hardcoded `timeout=10` for HTTP requests and subprocess calls. Slow network conditions will fail.
- Impact: GPU monitoring will fail if network is slow or unavailable. Silent failures possible.
- Migration plan: Implement exponential backoff and retry logic (e.g., `tenacity` library). Increase timeout to 30s for unreliable networks. Log timeouts for debugging.
## Missing Critical Features
**No Disaster Recovery Plan:**
- Problem: Backup procedures exist (Nextcloud, MySQL) but no tested recovery procedure. No runbook for cluster disaster.
- Blocks: If cluster data lost, recovery would be manual and time-consuming. No RTO/RPO defined.
- Impact: Data loss risk > 24 hours to recover.
**No Secrets Rotation Policy:**
- Problem: SSH keys, API tokens, database passwords stored in git-crypt and tfvars. No automated rotation schedule.
- Blocks: If key leaked, manual intervention required to rotate across all services.
- Impact: Leaked credentials persist until discovery.
**No Cross-Cluster Failover:**
- Problem: Single Kubernetes cluster on Proxmox. No HA cluster or backup cluster.
- Blocks: Cluster-wide failure (network partition, hypervisor crash) causes total outage.
- Impact: RTO > 1 hour (manual intervention to restart hypervisor or re-provision).
## Test Coverage Gaps
**No Infrastructure Testing:**
- What's not tested: Terraform applies, Helm charts, manifests only validated via `terraform plan`. No `terratest`, no functional tests of deployed services.
- Files: All stacks (no test files found).
- Risk: Typos, variable misconfigurations, missing dependencies not caught until production apply.
- Priority: High — add `terratest` to validate Terraform. Test critical paths (database connection, ingress routing).
**No Chaos Engineering Tests:**
- What's not tested: Pod evictions, node failures, NFS unavailability, network partitions.
- Files: All stacks (no chaos tests found).
- Risk: Cascading failures and data loss scenarios not validated. Assumptions about resilience untested.
- Priority: High — run monthly chaos tests (Gremlin, Chaos Toolkit). Document recovery procedures.
**No Backup Restoration Tests:**
- What's not tested: Nextcloud backups, MySQL backups. Restore procedures exist but never executed.
- Files: `stacks/nextcloud/main.tf`, `stacks/platform/modules/dbaas/main.tf`.
- Risk: Backups corrupt or unusable when needed. RPO > 24 hours if discovery slow.
- Priority: High — monthly restore-to-staging test. Automate backup validation.
**No Security Scanning for Vulnerabilities:**
- What's not tested: Container images for CVEs, Terraform for security anti-patterns (hardcoded secrets, overpermissive RBAC).
- Files: All stacks, all container images.
- Risk: Known vulnerabilities deployed to production. No supply chain security.
- Priority: Medium — integrate Trivy/Snyk into CI/CD. Scan images weekly; alert on high CVEs.
---
*Concerns audit: 2026-02-23*