infra/.planning/codebase/CONCERNS.md
2026-02-23 20:54:27 +00:00

18 KiB

Codebase Concerns

Analysis Date: 2026-02-23

Tech Debt

MySQL Backup Rotation Not Implemented:

  • Issue: Backup rotation logic exists (comment at stacks/platform/modules/dbaas/main.tf:196) but is incomplete. Backup size noted as 11MB, rotation deferred.
  • Files: stacks/platform/modules/dbaas/main.tf (lines 196-206)
  • Impact: Backup directory could grow unbounded; no automated retention policy enforced. Manual cleanup required.
  • Fix approach: Implement full rotation schedule using find -mtime +N or migrate to external backup solution (Velero, pgbackrest). Set up CronJob with proper retention (e.g., 14-day backups).

PostgreSQL Major Version Upgrade Blocked:

  • Issue: Comment at stacks/platform/modules/dbaas/main.tf:718 indicates PostgreSQL 17.2 requires pg_upgrade to data directory but is not implemented.
  • Files: stacks/platform/modules/dbaas/main.tf (line 718)
  • Impact: Cannot upgrade PostgreSQL from 16-master to 17.2. When upgrade is needed, requires manual pg_upgrade procedure; high downtime risk.
  • Fix approach: Implement pg_upgrade CronJob or StatefulSet init container that performs in-place upgrade. Test migration path with backup first.

TP-Link Gateway Reverse Proxy Not Functional:

  • Issue: Reverse proxy module for TP-Link gateway marked as "Not working yet" at stacks/platform/modules/reverse_proxy/main.tf:91.
  • Files: stacks/platform/modules/reverse_proxy/main.tf (lines 91-95)
  • Impact: Gateway access over HTTPS or HTTP routing non-functional. Unknown scope of impact on dependent services.
  • Fix approach: Either complete reverse proxy implementation (Traefik/Nginx config) or document why it's disabled. Clarify if gateway is still accessible via HTTP or LAN-only.

WireGuard Firewall Rules Incomplete:

  • Issue: Client firewall restrictions not written at terraform.tfvars:430. Only placeholder exists.
  • Files: terraform.tfvars (lines 430-434)
  • Impact: No network isolation between VPN clients and cluster-internal services (10.32.0.0/12). All connected clients can access cluster APIs if firewall rules not enforced at kernel level.
  • Fix approach: Define explicit iptables rules for each client role (e.g., "allow DNS only", "deny cluster access"). Test with iptables -L -v. Consider VPN network segmentation if multiple trust levels exist.

Known Bugs & Issues

Immich Database Compatibility Mismatch:

  • Symptoms: Custom PostgreSQL image version mismatch between Immich PostgreSQL pod and dbaas PostgreSQL. Immich uses ghcr.io/immich-app/postgres:15-vectorchord0.3.0-pgvectors0.2.0, while dbaas PostgreSQL is 16-master with PostGIS/PgVector mix.
  • Files: stacks/immich/main.tf (lines 76-77, 276), stacks/platform/modules/dbaas/main.tf (line 717)
  • Trigger: If Immich database is migrated to shared dbaas PostgreSQL, extension version incompatibility will cause failures.
  • Workaround: Keep Immich on isolated PostgreSQL 15 with Immich-specific extensions. If consolidation needed, test extension compatibility first.

Realestate-Crawler Latest Image Tag Ignores Updates:

  • Symptoms: realestate-crawler-ui uses image = "viktorbarzin/immoweb:latest" with lifecycle { ignore_changes = [spec[0].template[0].spec[0].container[0].image] }.
  • Files: stacks/real-estate-crawler/main.tf (lines 64, 79-82)
  • Trigger: New versions of immoweb:latest will never be deployed. Terraform ignores image updates; manual image pull/push required.
  • Workaround: Use Diun annotations to track image updates. Consider using version-pinned tags instead of :latest. Remove ignore_changes if auto-updates desired.

Security Considerations

OpenClaw Has Cluster-Admin Permissions:

  • Risk: OpenClaw ServiceAccount granted unrestricted cluster-admin ClusterRoleBinding at stacks/openclaw/main.tf:41-54.
  • Files: stacks/openclaw/main.tf (lines 34-55)
  • Current mitigation: dangerouslyDisableDeviceAuth = true in config (line 89) disables device auth but relies on network access control.
  • Recommendations:
    • Scope OpenClaw RBAC to specific namespaces/resources needed for skill execution (e.g., get/list/watch pods, exec into pods, apply resources in specific namespaces).
    • Re-enable device auth or implement mTLS between OpenClaw and operators.
    • Audit OpenClaw logs for unauthorized API calls (enable API server audit logs).

Git-Crypt Key Mounted as ConfigMap:

  • Risk: git-crypt key at stacks/openclaw/main.tf:68-76 stored as plain-text ConfigMap data. Any pod on cluster can read it (unless RBAC enforces secrets-only access).
  • Files: stacks/openclaw/main.tf (lines 68-76)
  • Current mitigation: None; ConfigMap is world-readable by default.
  • Recommendations:
    • Move git-crypt key to Kubernetes Secret instead of ConfigMap.
    • Add RBAC policy restricting secret read to openclaw namespace only.
    • Consider external secret management (Authentik-backed secret injection, Sealed Secrets).

SSH Private Key Stored as Secret:

  • Risk: SSH private key for OpenClaw stored at stacks/openclaw/main.tf:57-66 as unencrypted Secret. Readable by any pod with secret access.
  • Files: stacks/openclaw/main.tf (lines 57-66)
  • Current mitigation: Secret only readable by openclaw namespace (if RBAC enforced); encryption at rest not confirmed.
  • Recommendations:
    • Rotate SSH key regularly; consider using ed25519 keys (shorter, stronger).
    • Audit Secret access via Kubernetes audit logs.
    • Use external secret store (HashiCorp Vault, Bitwarden) instead of native Secrets.

WireGuard VPN Clients Unrestricted:

  • Risk: VPN clients can reach all cluster-internal services (10.32.0.0/12) unless firewall rules defined. No per-client segmentation.
  • Files: terraform.tfvars (lines 430-434)
  • Current mitigation: Attempted iptables rules commented out; not enforced.
  • Recommendations:
    • Define explicit client restrictions in WireGuard firewall script (uncomment/complete lines 433-434).
    • Implement deny-by-default firewall (drop all, then allow specific routes).
    • Consider separate WireGuard interfaces for different trust levels (admin vs. guest).

Multiple :latest Image Tags in Production:

  • Risk: 17 services use :latest tags (e.g., nextcloud, kms, calibre, speedtest, rybbit, wealthfolio, cyberchef, coturn, immich-frame, health, others).
  • Files: Multiple stacks (see full list in grep output above).
  • Current mitigation: Diun annotations track updates but don't auto-pull; images are immutable but unversioned.
  • Recommendations:
    • Pin all production images to specific semantic versions (e.g., ghcr.io/foo/bar:v1.2.3, not :latest).
    • Use Diun to track new releases and trigger automated testing in staging.
    • Update CI/CD pipeline to require version tags for production deployments.

Performance Bottlenecks

Insufficient Health Probes on Critical Services:

  • Problem: Only 14 services have liveness/readiness probes out of 70+ services. Missing probes on databases (MySQL, PostgreSQL, Redis), ingress, auth.
  • Files: All stacks (identified via grep: 14 instances of liveness/readiness out of 70+ services).
  • Cause: Default Kubernetes behavior is to not restart unhealthy pods without probes; cascading failures silent.
  • Improvement path: Add livenessProbe, readinessProbe, and startupProbe to all stateful services (databases, message queues, auth providers). Use TCP/HTTP probes appropriate to each service.

Pod Disruption Budgets Missing:

  • Problem: Only 2 services have PodDisruptionBudget resources (identified via grep). Node evictions (updates, failures) can cause service degradation.
  • Files: All stacks (need comprehensive PodDisruptionBudget coverage).
  • Cause: PDBs are optional; many assume single-replica stateless services won't need them.
  • Improvement path: Add PDB with minAvailable: 1 to all services with replicas > 1. For single-replica services, ensure they're marked as non-critical (lower PriorityClass) or accept downtime during node maintenance.

Resource Requests Sparse, Limits Missing:

  • Problem: Many services lack explicit resource requests/limits. Kyverno auto-generates defaults but CPU limits often too low for bursty workloads (Immich ML, Ollama, Ebook2Audiobook).
  • Files: Multiple stacks (e.g., stacks/immich/main.tf, stacks/ebook2audiobook/main.tf, stacks/ollama/main.tf).
  • Cause: Request/limit tuning requires load testing; defaults used instead.
  • Improvement path: Run load tests on GPU workloads (Immich ML, Ollama) to determine sustained CPU/memory. Set requests to P50 usage, limits to P99. Monitor via Prometheus and adjust quarterly.

Large Terraform Modules (900+ lines):

  • Problem: stacks/platform/modules/dbaas/main.tf is 916 lines; stacks/immich/main.tf is 660 lines; others > 450 lines.
  • Files: stacks/platform/modules/dbaas/main.tf (916 lines), stacks/platform/modules/nvidia/main.tf (658 lines), stacks/platform/modules/kyverno/resource-governance.tf (809 lines).
  • Cause: Monolithic resource definitions; hard to navigate and test.
  • Improvement path: Split large modules into sub-modules (e.g., dbaas/mysql/, postgresql/, pgadmin/, backups/). Use Terraform workspaces for per-database configuration.

Fragile Areas

Immich Machine Learning GPU Dependency:

  • Files: stacks/immich/main.tf (lines 380-450).
  • Why fragile: GPU workload (immich-machine-learning-cuda) requires Tesla T4 on k8s-node1. If GPU becomes unavailable (hardware failure, driver issues), ML inference fails silently (no fallback). Single GPU point of failure.
  • Safe modification: Add nodeAffinity to prefer GPU but allow non-GPU fallback (degraded mode). Implement health checks on GPU availability (nvidia-smi probe). Test GPU failure scenario before production use.
  • Test coverage: No tests for GPU unavailability; assumes GPU always available.

Nextcloud Backup/Restore Procedures Manual:

  • Files: stacks/nextcloud/main.tf (backup.sh and restore.sh ConfigMaps).
  • Why fragile: Backup/restore scripts are ConfigMap-based; no automation. Restoration requires manual kubectl exec and script execution. No tested recovery procedure.
  • Safe modification: Implement automated backup via Velero or CSI snapshots. Test restore procedure monthly via staged environment.
  • Test coverage: No automated backup validation; scripts untested.

NFS Dependency for Data Persistence:

  • Files: 126 references to NFS volumes across all stacks.
  • Why fragile: All stateful data depends on NFS server at 10.0.10.15. If NFS becomes unavailable, all services lose data immediately (no local caches). No fallback storage.
  • Safe modification: Implement NFS client-side read caching (Linux NFS mount options ac,acregmin=3600). Monitor NFS availability via Prometheus alerts (Mount point offline). Test NFS failover procedure (if replica NFS exists).
  • Test coverage: No chaos engineering tests for NFS unavailability.

Istio Injection Disabled Cluster-Wide:

  • Files: stacks/real-estate-crawler/main.tf (line 19): "istio-injection" : "disabled" on namespace labels.
  • Why fragile: No service mesh observability. Debugging pod-to-pod communication requires manual tracing (tcpdump). No mutual TLS between services.
  • Safe modification: Enable Istio on non-critical services first (e.g., realestate-crawler). Monitor resource overhead. Gradually roll out to production.
  • Test coverage: No mTLS validation; assumes all pods on same network are trusted.

PostgreSQL Custom Image Not Tracked:

  • Files: stacks/platform/modules/dbaas/main.tf (line 717): image = "viktorbarzin/postgres:16-master".
  • Why fragile: Custom build at Docker Hub with PostGIS + PgVector extensions. No version tag; :master tag is mutable. Upstream extension versions unknown.
  • Safe modification: Pin to semantic version (e.g., :16.4-postgis3.4-pgvector0.8). Build images locally with Dockerfile tracked in git. Test extension versions against application requirements.
  • Test coverage: No tests for extension availability or version compatibility.

Scaling Limits

Single-Replica Critical Services:

  • Current capacity: Immich server (1 replica), PostgreSQL databases (1 replica), Redis (1 instance), Traefik (varies).
  • Limit: Node failure causes immediate service outage. Kubernetes default takes 5+ minutes to reschedule pod.
  • Scaling path: Increase critical service replicas to 3 (quorum). Add pod anti-affinity to spread across nodes. Implement PodDisruptionBudget with minAvailable: 2.

GPU Capacity Bottleneck:

  • Current capacity: 1 Tesla T4 GPU on k8s-node1.
  • Limit: Immich ML + Ebook2Audiobook + Ollama all compete for single GPU. Queue time 10+ minutes for CPU-bound inference tasks.
  • Scaling path: Add second GPU (e.g., T4 or RTX 3090) to k8s-node1. Implement GPU scheduling via NVIDIA GPU Operator. Monitor GPU utilization (target 70-80%).

NFS Storage Capacity:

  • Current capacity: /mnt/main/ mounted on TrueNAS (size unknown; typically 4-8TB in home setups).
  • Limit: Immich (image library), Calibre (ebooks), Dawarich (location history) grow unbounded. When storage full, writes fail; services degrade.
  • Scaling path: Monitor NFS capacity monthly (df -h). Set up Prometheus alert at 80% capacity. Plan for annual storage growth based on user behavior (e.g., 100GB Immich/month).

MySQL/PostgreSQL Connection Pool:

  • Current capacity: PgBouncer at dbaas/pgbouncer provides connection pooling. Default pool size likely 100-200 connections.
  • Limit: Many simultaneous connections (Nextcloud, Affine, Gramps Web, Authentik) can exceed pool. New connections queue or fail.
  • Scaling path: Monitor PgBouncer pool utilization (Prometheus metric pgbouncer_pools_used_connections). Increase pool size if > 80% utilization. Consider read replicas for read-heavy workloads.

API Rate Limiting & Bandwidth:

  • Current capacity: Services exposed via Traefik ingress. No global rate limiting documented.
  • Limit: External tools (Immich mobile app, ebook2audiobook processing) can spike bandwidth. DoS-like behavior possible.
  • Scaling path: Implement Traefik rate limiting middleware (Prometheus-aware). Add Cloudflare rate limiting on public domains. Monitor egress bandwidth.

Dependencies at Risk

Redis Stack :latest Tag:

  • Risk: stacks/platform/modules/redis/main.tf uses image = "redis/redis-stack:latest". Redis Stack is actively developed; breaking changes possible.
  • Impact: Unexpected version upgrade could introduce incompatibilities with clients expecting specific command set or module versions.
  • Migration plan: Pin to specific Redis Stack version (e.g., :7.2-rc1). Test version upgrades in staging first. Monitor Redis logs for deprecated command warnings.

Immich :latest or Floating Tag:

  • Risk: stacks/immich/main.tf pins to v2.5.6 but Immich frequently releases patch versions. Database migrations can cause downtime.
  • Impact: If Immich version upgrades without testing, database migrations could fail or hang (no rollback mechanism).
  • Migration plan: Pin to specific patch versions (e.g., v2.5.6, not v2.5). Test Immich upgrades in staging first. Maintain backup before upgrading.

Unsupported MySQL 9.2.0:

  • Risk: stacks/platform/modules/dbaas/main.tf specifies image = "mysql:9.2.0". MySQL 9.2 is a development version (RC status as of Feb 2026).
  • Impact: RC versions not recommended for production. Stability issues, CVEs possible. No long-term support.
  • Migration plan: Migrate to MySQL 8.4 LTS or 9.0 GA (stable). Test data migration first. Plan for gradual rollout.

Python Timeouts in Monitoring Scripts:

  • Risk: stacks/platform/modules/nvidia/main.tf uses hardcoded timeout=10 for HTTP requests and subprocess calls. Slow network conditions will fail.
  • Impact: GPU monitoring will fail if network is slow or unavailable. Silent failures possible.
  • Migration plan: Implement exponential backoff and retry logic (e.g., tenacity library). Increase timeout to 30s for unreliable networks. Log timeouts for debugging.

Missing Critical Features

No Disaster Recovery Plan:

  • Problem: Backup procedures exist (Nextcloud, MySQL) but no tested recovery procedure. No runbook for cluster disaster.
  • Blocks: If cluster data lost, recovery would be manual and time-consuming. No RTO/RPO defined.
  • Impact: Data loss risk > 24 hours to recover.

No Secrets Rotation Policy:

  • Problem: SSH keys, API tokens, database passwords stored in git-crypt and tfvars. No automated rotation schedule.
  • Blocks: If key leaked, manual intervention required to rotate across all services.
  • Impact: Leaked credentials persist until discovery.

No Cross-Cluster Failover:

  • Problem: Single Kubernetes cluster on Proxmox. No HA cluster or backup cluster.
  • Blocks: Cluster-wide failure (network partition, hypervisor crash) causes total outage.
  • Impact: RTO > 1 hour (manual intervention to restart hypervisor or re-provision).

Test Coverage Gaps

No Infrastructure Testing:

  • What's not tested: Terraform applies, Helm charts, manifests only validated via terraform plan. No terratest, no functional tests of deployed services.
  • Files: All stacks (no test files found).
  • Risk: Typos, variable misconfigurations, missing dependencies not caught until production apply.
  • Priority: High — add terratest to validate Terraform. Test critical paths (database connection, ingress routing).

No Chaos Engineering Tests:

  • What's not tested: Pod evictions, node failures, NFS unavailability, network partitions.
  • Files: All stacks (no chaos tests found).
  • Risk: Cascading failures and data loss scenarios not validated. Assumptions about resilience untested.
  • Priority: High — run monthly chaos tests (Gremlin, Chaos Toolkit). Document recovery procedures.

No Backup Restoration Tests:

  • What's not tested: Nextcloud backups, MySQL backups. Restore procedures exist but never executed.
  • Files: stacks/nextcloud/main.tf, stacks/platform/modules/dbaas/main.tf.
  • Risk: Backups corrupt or unusable when needed. RPO > 24 hours if discovery slow.
  • Priority: High — monthly restore-to-staging test. Automate backup validation.

No Security Scanning for Vulnerabilities:

  • What's not tested: Container images for CVEs, Terraform for security anti-patterns (hardcoded secrets, overpermissive RBAC).
  • Files: All stacks, all container images.
  • Risk: Known vulnerabilities deployed to production. No supply chain security.
  • Priority: Medium — integrate Trivy/Snyk into CI/CD. Scan images weekly; alert on high CVEs.

Concerns audit: 2026-02-23