18 KiB
Codebase Concerns
Analysis Date: 2026-02-23
Tech Debt
MySQL Backup Rotation Not Implemented:
- Issue: Backup rotation logic exists (comment at
stacks/platform/modules/dbaas/main.tf:196) but is incomplete. Backup size noted as 11MB, rotation deferred. - Files:
stacks/platform/modules/dbaas/main.tf(lines 196-206) - Impact: Backup directory could grow unbounded; no automated retention policy enforced. Manual cleanup required.
- Fix approach: Implement full rotation schedule using
find -mtime +Nor migrate to external backup solution (Velero, pgbackrest). Set up CronJob with proper retention (e.g., 14-day backups).
PostgreSQL Major Version Upgrade Blocked:
- Issue: Comment at
stacks/platform/modules/dbaas/main.tf:718indicates PostgreSQL 17.2 requirespg_upgradeto data directory but is not implemented. - Files:
stacks/platform/modules/dbaas/main.tf(line 718) - Impact: Cannot upgrade PostgreSQL from 16-master to 17.2. When upgrade is needed, requires manual pg_upgrade procedure; high downtime risk.
- Fix approach: Implement pg_upgrade CronJob or StatefulSet init container that performs in-place upgrade. Test migration path with backup first.
TP-Link Gateway Reverse Proxy Not Functional:
- Issue: Reverse proxy module for TP-Link gateway marked as "Not working yet" at
stacks/platform/modules/reverse_proxy/main.tf:91. - Files:
stacks/platform/modules/reverse_proxy/main.tf(lines 91-95) - Impact: Gateway access over HTTPS or HTTP routing non-functional. Unknown scope of impact on dependent services.
- Fix approach: Either complete reverse proxy implementation (Traefik/Nginx config) or document why it's disabled. Clarify if gateway is still accessible via HTTP or LAN-only.
WireGuard Firewall Rules Incomplete:
- Issue: Client firewall restrictions not written at
terraform.tfvars:430. Only placeholder exists. - Files:
terraform.tfvars(lines 430-434) - Impact: No network isolation between VPN clients and cluster-internal services (10.32.0.0/12). All connected clients can access cluster APIs if firewall rules not enforced at kernel level.
- Fix approach: Define explicit iptables rules for each client role (e.g., "allow DNS only", "deny cluster access"). Test with
iptables -L -v. Consider VPN network segmentation if multiple trust levels exist.
Known Bugs & Issues
Immich Database Compatibility Mismatch:
- Symptoms: Custom PostgreSQL image version mismatch between Immich PostgreSQL pod and dbaas PostgreSQL. Immich uses
ghcr.io/immich-app/postgres:15-vectorchord0.3.0-pgvectors0.2.0, while dbaas PostgreSQL is 16-master with PostGIS/PgVector mix. - Files:
stacks/immich/main.tf(lines 76-77, 276),stacks/platform/modules/dbaas/main.tf(line 717) - Trigger: If Immich database is migrated to shared dbaas PostgreSQL, extension version incompatibility will cause failures.
- Workaround: Keep Immich on isolated PostgreSQL 15 with Immich-specific extensions. If consolidation needed, test extension compatibility first.
Realestate-Crawler Latest Image Tag Ignores Updates:
- Symptoms:
realestate-crawler-uiusesimage = "viktorbarzin/immoweb:latest"withlifecycle { ignore_changes = [spec[0].template[0].spec[0].container[0].image] }. - Files:
stacks/real-estate-crawler/main.tf(lines 64, 79-82) - Trigger: New versions of
immoweb:latestwill never be deployed. Terraform ignores image updates; manual image pull/push required. - Workaround: Use Diun annotations to track image updates. Consider using version-pinned tags instead of
:latest. Removeignore_changesif auto-updates desired.
Security Considerations
OpenClaw Has Cluster-Admin Permissions:
- Risk: OpenClaw ServiceAccount granted unrestricted
cluster-adminClusterRoleBinding atstacks/openclaw/main.tf:41-54. - Files:
stacks/openclaw/main.tf(lines 34-55) - Current mitigation:
dangerouslyDisableDeviceAuth = truein config (line 89) disables device auth but relies on network access control. - Recommendations:
- Scope OpenClaw RBAC to specific namespaces/resources needed for skill execution (e.g.,
get/list/watch pods,exec into pods,apply resources in specific namespaces). - Re-enable device auth or implement mTLS between OpenClaw and operators.
- Audit OpenClaw logs for unauthorized API calls (enable API server audit logs).
- Scope OpenClaw RBAC to specific namespaces/resources needed for skill execution (e.g.,
Git-Crypt Key Mounted as ConfigMap:
- Risk: git-crypt key at
stacks/openclaw/main.tf:68-76stored as plain-text ConfigMap data. Any pod on cluster can read it (unless RBAC enforces secrets-only access). - Files:
stacks/openclaw/main.tf(lines 68-76) - Current mitigation: None; ConfigMap is world-readable by default.
- Recommendations:
- Move git-crypt key to Kubernetes Secret instead of ConfigMap.
- Add RBAC policy restricting secret read to openclaw namespace only.
- Consider external secret management (Authentik-backed secret injection, Sealed Secrets).
SSH Private Key Stored as Secret:
- Risk: SSH private key for OpenClaw stored at
stacks/openclaw/main.tf:57-66as unencrypted Secret. Readable by any pod with secret access. - Files:
stacks/openclaw/main.tf(lines 57-66) - Current mitigation: Secret only readable by openclaw namespace (if RBAC enforced); encryption at rest not confirmed.
- Recommendations:
- Rotate SSH key regularly; consider using ed25519 keys (shorter, stronger).
- Audit Secret access via Kubernetes audit logs.
- Use external secret store (HashiCorp Vault, Bitwarden) instead of native Secrets.
WireGuard VPN Clients Unrestricted:
- Risk: VPN clients can reach all cluster-internal services (10.32.0.0/12) unless firewall rules defined. No per-client segmentation.
- Files:
terraform.tfvars(lines 430-434) - Current mitigation: Attempted iptables rules commented out; not enforced.
- Recommendations:
- Define explicit client restrictions in WireGuard firewall script (uncomment/complete lines 433-434).
- Implement deny-by-default firewall (drop all, then allow specific routes).
- Consider separate WireGuard interfaces for different trust levels (admin vs. guest).
Multiple :latest Image Tags in Production:
- Risk: 17 services use
:latesttags (e.g.,nextcloud,kms,calibre,speedtest,rybbit,wealthfolio,cyberchef,coturn,immich-frame,health, others). - Files: Multiple stacks (see full list in grep output above).
- Current mitigation: Diun annotations track updates but don't auto-pull; images are immutable but unversioned.
- Recommendations:
- Pin all production images to specific semantic versions (e.g.,
ghcr.io/foo/bar:v1.2.3, not:latest). - Use Diun to track new releases and trigger automated testing in staging.
- Update CI/CD pipeline to require version tags for production deployments.
- Pin all production images to specific semantic versions (e.g.,
Performance Bottlenecks
Insufficient Health Probes on Critical Services:
- Problem: Only 14 services have liveness/readiness probes out of 70+ services. Missing probes on databases (MySQL, PostgreSQL, Redis), ingress, auth.
- Files: All stacks (identified via grep: 14 instances of liveness/readiness out of 70+ services).
- Cause: Default Kubernetes behavior is to not restart unhealthy pods without probes; cascading failures silent.
- Improvement path: Add
livenessProbe,readinessProbe, andstartupProbeto all stateful services (databases, message queues, auth providers). Use TCP/HTTP probes appropriate to each service.
Pod Disruption Budgets Missing:
- Problem: Only 2 services have PodDisruptionBudget resources (identified via grep). Node evictions (updates, failures) can cause service degradation.
- Files: All stacks (need comprehensive PodDisruptionBudget coverage).
- Cause: PDBs are optional; many assume single-replica stateless services won't need them.
- Improvement path: Add PDB with
minAvailable: 1to all services withreplicas > 1. For single-replica services, ensure they're marked as non-critical (lower PriorityClass) or accept downtime during node maintenance.
Resource Requests Sparse, Limits Missing:
- Problem: Many services lack explicit resource requests/limits. Kyverno auto-generates defaults but CPU limits often too low for bursty workloads (Immich ML, Ollama, Ebook2Audiobook).
- Files: Multiple stacks (e.g.,
stacks/immich/main.tf,stacks/ebook2audiobook/main.tf,stacks/ollama/main.tf). - Cause: Request/limit tuning requires load testing; defaults used instead.
- Improvement path: Run load tests on GPU workloads (Immich ML, Ollama) to determine sustained CPU/memory. Set requests to P50 usage, limits to P99. Monitor via Prometheus and adjust quarterly.
Large Terraform Modules (900+ lines):
- Problem:
stacks/platform/modules/dbaas/main.tfis 916 lines;stacks/immich/main.tfis 660 lines; others > 450 lines. - Files:
stacks/platform/modules/dbaas/main.tf(916 lines),stacks/platform/modules/nvidia/main.tf(658 lines),stacks/platform/modules/kyverno/resource-governance.tf(809 lines). - Cause: Monolithic resource definitions; hard to navigate and test.
- Improvement path: Split large modules into sub-modules (e.g.,
dbaas/→mysql/,postgresql/,pgadmin/,backups/). Use Terraform workspaces for per-database configuration.
Fragile Areas
Immich Machine Learning GPU Dependency:
- Files:
stacks/immich/main.tf(lines 380-450). - Why fragile: GPU workload (
immich-machine-learning-cuda) requires Tesla T4 on k8s-node1. If GPU becomes unavailable (hardware failure, driver issues), ML inference fails silently (no fallback). Single GPU point of failure. - Safe modification: Add
nodeAffinityto prefer GPU but allow non-GPU fallback (degraded mode). Implement health checks on GPU availability (nvidia-smiprobe). Test GPU failure scenario before production use. - Test coverage: No tests for GPU unavailability; assumes GPU always available.
Nextcloud Backup/Restore Procedures Manual:
- Files:
stacks/nextcloud/main.tf(backup.sh and restore.sh ConfigMaps). - Why fragile: Backup/restore scripts are ConfigMap-based; no automation. Restoration requires manual
kubectl execand script execution. No tested recovery procedure. - Safe modification: Implement automated backup via Velero or CSI snapshots. Test restore procedure monthly via staged environment.
- Test coverage: No automated backup validation; scripts untested.
NFS Dependency for Data Persistence:
- Files: 126 references to NFS volumes across all stacks.
- Why fragile: All stateful data depends on NFS server at
10.0.10.15. If NFS becomes unavailable, all services lose data immediately (no local caches). No fallback storage. - Safe modification: Implement NFS client-side read caching (Linux NFS mount options
ac,acregmin=3600). Monitor NFS availability via Prometheus alerts (Mount point offline). Test NFS failover procedure (if replica NFS exists). - Test coverage: No chaos engineering tests for NFS unavailability.
Istio Injection Disabled Cluster-Wide:
- Files:
stacks/real-estate-crawler/main.tf(line 19):"istio-injection" : "disabled"on namespace labels. - Why fragile: No service mesh observability. Debugging pod-to-pod communication requires manual tracing (tcpdump). No mutual TLS between services.
- Safe modification: Enable Istio on non-critical services first (e.g., realestate-crawler). Monitor resource overhead. Gradually roll out to production.
- Test coverage: No mTLS validation; assumes all pods on same network are trusted.
PostgreSQL Custom Image Not Tracked:
- Files:
stacks/platform/modules/dbaas/main.tf(line 717):image = "viktorbarzin/postgres:16-master". - Why fragile: Custom build at Docker Hub with PostGIS + PgVector extensions. No version tag;
:mastertag is mutable. Upstream extension versions unknown. - Safe modification: Pin to semantic version (e.g.,
:16.4-postgis3.4-pgvector0.8). Build images locally with Dockerfile tracked in git. Test extension versions against application requirements. - Test coverage: No tests for extension availability or version compatibility.
Scaling Limits
Single-Replica Critical Services:
- Current capacity: Immich server (1 replica), PostgreSQL databases (1 replica), Redis (1 instance), Traefik (varies).
- Limit: Node failure causes immediate service outage. Kubernetes default takes 5+ minutes to reschedule pod.
- Scaling path: Increase critical service replicas to 3 (quorum). Add pod anti-affinity to spread across nodes. Implement PodDisruptionBudget with
minAvailable: 2.
GPU Capacity Bottleneck:
- Current capacity: 1 Tesla T4 GPU on k8s-node1.
- Limit: Immich ML + Ebook2Audiobook + Ollama all compete for single GPU. Queue time 10+ minutes for CPU-bound inference tasks.
- Scaling path: Add second GPU (e.g., T4 or RTX 3090) to k8s-node1. Implement GPU scheduling via NVIDIA GPU Operator. Monitor GPU utilization (target 70-80%).
NFS Storage Capacity:
- Current capacity:
/mnt/main/mounted on TrueNAS (size unknown; typically 4-8TB in home setups). - Limit: Immich (image library), Calibre (ebooks), Dawarich (location history) grow unbounded. When storage full, writes fail; services degrade.
- Scaling path: Monitor NFS capacity monthly (
df -h). Set up Prometheus alert at 80% capacity. Plan for annual storage growth based on user behavior (e.g., 100GB Immich/month).
MySQL/PostgreSQL Connection Pool:
- Current capacity: PgBouncer at
dbaas/pgbouncerprovides connection pooling. Default pool size likely 100-200 connections. - Limit: Many simultaneous connections (Nextcloud, Affine, Gramps Web, Authentik) can exceed pool. New connections queue or fail.
- Scaling path: Monitor PgBouncer pool utilization (Prometheus metric
pgbouncer_pools_used_connections). Increase pool size if > 80% utilization. Consider read replicas for read-heavy workloads.
API Rate Limiting & Bandwidth:
- Current capacity: Services exposed via Traefik ingress. No global rate limiting documented.
- Limit: External tools (Immich mobile app, ebook2audiobook processing) can spike bandwidth. DoS-like behavior possible.
- Scaling path: Implement Traefik rate limiting middleware (Prometheus-aware). Add Cloudflare rate limiting on public domains. Monitor egress bandwidth.
Dependencies at Risk
Redis Stack :latest Tag:
- Risk:
stacks/platform/modules/redis/main.tfusesimage = "redis/redis-stack:latest". Redis Stack is actively developed; breaking changes possible. - Impact: Unexpected version upgrade could introduce incompatibilities with clients expecting specific command set or module versions.
- Migration plan: Pin to specific Redis Stack version (e.g.,
:7.2-rc1). Test version upgrades in staging first. Monitor Redis logs for deprecated command warnings.
Immich :latest or Floating Tag:
- Risk:
stacks/immich/main.tfpins tov2.5.6but Immich frequently releases patch versions. Database migrations can cause downtime. - Impact: If Immich version upgrades without testing, database migrations could fail or hang (no rollback mechanism).
- Migration plan: Pin to specific patch versions (e.g.,
v2.5.6, notv2.5). Test Immich upgrades in staging first. Maintain backup before upgrading.
Unsupported MySQL 9.2.0:
- Risk:
stacks/platform/modules/dbaas/main.tfspecifiesimage = "mysql:9.2.0". MySQL 9.2 is a development version (RC status as of Feb 2026). - Impact: RC versions not recommended for production. Stability issues, CVEs possible. No long-term support.
- Migration plan: Migrate to MySQL 8.4 LTS or 9.0 GA (stable). Test data migration first. Plan for gradual rollout.
Python Timeouts in Monitoring Scripts:
- Risk:
stacks/platform/modules/nvidia/main.tfuses hardcodedtimeout=10for HTTP requests and subprocess calls. Slow network conditions will fail. - Impact: GPU monitoring will fail if network is slow or unavailable. Silent failures possible.
- Migration plan: Implement exponential backoff and retry logic (e.g.,
tenacitylibrary). Increase timeout to 30s for unreliable networks. Log timeouts for debugging.
Missing Critical Features
No Disaster Recovery Plan:
- Problem: Backup procedures exist (Nextcloud, MySQL) but no tested recovery procedure. No runbook for cluster disaster.
- Blocks: If cluster data lost, recovery would be manual and time-consuming. No RTO/RPO defined.
- Impact: Data loss risk > 24 hours to recover.
No Secrets Rotation Policy:
- Problem: SSH keys, API tokens, database passwords stored in git-crypt and tfvars. No automated rotation schedule.
- Blocks: If key leaked, manual intervention required to rotate across all services.
- Impact: Leaked credentials persist until discovery.
No Cross-Cluster Failover:
- Problem: Single Kubernetes cluster on Proxmox. No HA cluster or backup cluster.
- Blocks: Cluster-wide failure (network partition, hypervisor crash) causes total outage.
- Impact: RTO > 1 hour (manual intervention to restart hypervisor or re-provision).
Test Coverage Gaps
No Infrastructure Testing:
- What's not tested: Terraform applies, Helm charts, manifests only validated via
terraform plan. Noterratest, no functional tests of deployed services. - Files: All stacks (no test files found).
- Risk: Typos, variable misconfigurations, missing dependencies not caught until production apply.
- Priority: High — add
terratestto validate Terraform. Test critical paths (database connection, ingress routing).
No Chaos Engineering Tests:
- What's not tested: Pod evictions, node failures, NFS unavailability, network partitions.
- Files: All stacks (no chaos tests found).
- Risk: Cascading failures and data loss scenarios not validated. Assumptions about resilience untested.
- Priority: High — run monthly chaos tests (Gremlin, Chaos Toolkit). Document recovery procedures.
No Backup Restoration Tests:
- What's not tested: Nextcloud backups, MySQL backups. Restore procedures exist but never executed.
- Files:
stacks/nextcloud/main.tf,stacks/platform/modules/dbaas/main.tf. - Risk: Backups corrupt or unusable when needed. RPO > 24 hours if discovery slow.
- Priority: High — monthly restore-to-staging test. Automate backup validation.
No Security Scanning for Vulnerabilities:
- What's not tested: Container images for CVEs, Terraform for security anti-patterns (hardcoded secrets, overpermissive RBAC).
- Files: All stacks, all container images.
- Risk: Known vulnerabilities deployed to production. No supply chain security.
- Priority: Medium — integrate Trivy/Snyk into CI/CD. Scan images weekly; alert on high CVEs.
Concerns audit: 2026-02-23