diff --git a/.planning/codebase/ARCHITECTURE.md b/.planning/codebase/ARCHITECTURE.md new file mode 100644 index 00000000..74b2d0dd --- /dev/null +++ b/.planning/codebase/ARCHITECTURE.md @@ -0,0 +1,165 @@ +# Architecture + +**Analysis Date:** 2026-02-23 + +## Pattern Overview + +**Overall:** Terragrunt-based IaC with per-service state isolation, using Kubernetes as the primary platform and Proxmox for VM infrastructure. + +**Key Characteristics:** +- Monorepo containing ~70 service stacks with independent state files +- Declarative, GitOps-driven infrastructure using Terraform + Terragrunt +- DRY provider/backend configuration via root `terragrunt.hcl` +- Clear layering: platform (core/cluster services) → application stacks → shared modules +- Service decoupling with explicit dependencies via `dependency` blocks +- Resource governance through Kubernetes tier system (0-core through 4-aux) + +## Layers + +**Platform Layer (`stacks/platform/main.tf`):** +- Purpose: Core infrastructure services that enable all application stacks (22 modules) +- Location: `stacks/platform/` +- Contains: MetalLB, DBaaS, Redis, Traefik, Technitium DNS, Headscale VPN, Authentik SSO, RBAC, CrowdSec, Prometheus/Grafana/Loki monitoring, nginx reverse proxy, mailserver, GPU node configuration, Kyverno policy engine +- Depends on: Kubernetes cluster (declared via `stacks/infra` dependency), External secrets in `terraform.tfvars` +- Used by: All application stacks declare `dependency "platform"` to ensure platform is applied first + +**Infrastructure Layer (`stacks/infra/main.tf`):** +- Purpose: VM template provisioning and Proxmox resource management +- Location: `stacks/infra/` +- Contains: K8s node templates via cloud-init, docker-registry VM, Proxmox VM lifecycle +- Depends on: Proxmox API credentials +- Used by: Platform stack depends on it to ensure infrastructure is ready + +**Application Stacks (~70 services):** +- Purpose: User-facing and supplementary services (Nextcloud, Immich, Matrix, Ollama, etc.) +- Location: `stacks//main.tf` (102 total stacks) +- Contains: Kubernetes namespaces, Helm releases, raw Kubernetes resources (Deployments, StatefulSets, Services, PersistentVolumes) +- Depends on: Platform stack, shared TLS secret via `modules/kubernetes/setup_tls_secret`, optional NFS volumes +- Used by: Self-contained; declared dependencies control execution order + +**Shared Modules:** +- **Kubernetes utilities** (`modules/kubernetes/`): + - `ingress_factory/`: Reusable Traefik ingress + service template with anti-AI scraping, CrowdSec integration, rate limiting, authentication support + - `setup_tls_secret/`: TLS certificate secret setup in namespaces +- **Terraform modules** (`modules/`): + - `create-template-vm/`: Ubuntu cloud-init template VM provisioning (K8s and non-K8s variants) + - `create-vm/`: VM instance creation from templates + - `docker-registry/`: Docker registry pull-through cache configuration + +## Data Flow + +**Infrastructure Provisioning Flow:** + +1. **Initialize**: Root `terragrunt.hcl` loads `terraform.tfvars` globally, generates provider/backend configs +2. **Infra Stack Apply**: `stacks/infra/` creates/updates Proxmox VMs and Kubernetes node templates +3. **Platform Apply**: `stacks/platform/` applies all ~22 core services (depends on infra stack) +4. **Service Apply**: Individual `stacks//` apply their resources (depend on platform stack) + +Example dependency chain for Nextcloud: +``` +stacks/infra/main.tf (VMs) + ↓ (dependency) +stacks/platform/main.tf (Traefik, Redis, DBaaS, etc.) + ↓ (dependency) +stacks/nextcloud/main.tf (Nextcloud Helm chart + storage) +``` + +**State Management:** +- Each stack has isolated state at `state/stacks//terraform.tfstate` +- Root `terragrunt.hcl` defines local backend: `path = "${get_repo_root()}/state/${path_relative_to_include()}/terraform.tfstate"` +- Variables flow from `terraform.tfvars` → each stack's `terraform` block → Terraform execution +- Unused variables are silently ignored (Terraform 1.x behavior) + +**Configuration Flow:** +1. User edits `terraform.tfvars` (encrypted via git-crypt) +2. Each stack includes root terragrunt config: `include "root" { path = find_in_parent_folders() }` +3. Root config injects `terraform.tfvars` as `required_var_files` +4. Stack-specific `main.tf` declares which variables it uses + +## Key Abstractions + +**Tier System:** +- Purpose: Resource governance via Kubernetes PriorityClasses, LimitRanges, ResourceQuotas +- Tiers: `0-core` (critical: ingress, DNS, auth) → `4-aux` (optional workloads) +- Applied via: Kyverno policy engine in `stacks/platform/modules/kyverno/` +- Usage: Every namespace/pod gets labeled with tier; Kyverno generates corresponding LimitRange + ResourceQuota + +**Service Factory Pattern:** +- Purpose: Multi-tenant/multi-instance services (Actual Budget, Freedify) +- Pattern: Parent stack (`stacks//main.tf`) creates namespace + TLS secret, then calls `factory/` module multiple times +- Examples: `stacks/actualbudget/main.tf` calls `factory/` for viktor, anca, emo instances +- Each instance: Separate pod, service, NFS share, Cloudflare DNS entry + +**Ingress Factory (`modules/kubernetes/ingress_factory/`):** +- Purpose: DRY, opinionated Traefik ingress pattern with security defaults +- Variables: `name`, `namespace`, `port`, `host`, `protected`, `anti_ai_scraping` (default true) +- Provides: Service, Ingress, CrowdSec exemptions, rate limiting, Authentik ForwardAuth integration, anti-AI middleware +- Anti-AI layers: Bot blocking → X-Robots-Tag → Trap links → Tarpit → Poison content cache + +**NFS Volume Pattern:** +- Purpose: Persistent storage for stateful services +- Pattern: Inline NFS volumes in pod specs (preferred over PV/PVC) +- Server: `10.0.10.15` (TrueNAS) +- Paths: `/mnt/main/` or `/mnt/main//` +- Used by: ~60 services; registered in `secrets/nfs_directories.txt` (git-crypt encrypted) + +## Entry Points + +**Terragrunt Root (`terragrunt.hcl`):** +- Location: `/Users/viktorbarzin/code/infra/terragrunt.hcl` +- Triggers: `cd stacks/ && terragrunt plan/apply --non-interactive` +- Responsibilities: Load providers, backend, `terraform.tfvars`, set kube config path + +**Platform Stack (`stacks/platform/main.tf`):** +- Location: `stacks/platform/main.tf` (1000+ lines) +- Triggers: Applied before any service stack to ensure platform services exist +- Responsibilities: 22 module instantiations, tier definition, variable collection from tfvars + +**Service Stacks (`stacks//main.tf`):** +- Location: `stacks//main.tf` (27–456 lines, avg ~130) +- Triggers: `terragrunt apply --non-interactive` in service directory +- Responsibilities: Create namespace, setup TLS, instantiate Helm charts or raw K8s resources, configure storage + +**Proxmox/Infra Stack (`stacks/infra/main.tf`):** +- Location: `stacks/infra/main.tf` (200+ lines) +- Triggers: Applied first to ensure VM infrastructure is available +- Responsibilities: VM template creation, VM instance lifecycle, containerd mirror config + +**Factory Module (`stacks//factory/main.tf`):** +- Location: `stacks/actualbudget/factory/main.tf`, `stacks/freedify/factory/main.tf` +- Triggers: Called multiple times from parent `main.tf` with different `name` parameter +- Responsibilities: Single-instance deployment (pod, service, NFS share, ingress) + +## Error Handling + +**Strategy:** Declarative state reconciliation (Terraform/Kubernetes watch loop). No imperative error recovery. + +**Patterns:** +- **Helm deployments**: `atomic = true` for rollback on failure +- **Terraform apply**: `--non-interactive` to prevent hanging on prompts +- **Cloud-init VM provisioning**: Embedded error logging in scripts; check `/var/log/cloud-init-output.log` on VM +- **Dependencies**: Explicit `dependency` blocks prevent applying child before parent +- **Validation**: `terraform plan` executed by CI before apply +- **Secrets**: git-crypt locking ensures encrypted state checked into repo; no accidental plaintext commits + +## Cross-Cutting Concerns + +**Logging:** Loki + Alloy (DaemonSet collects container logs) configured in `stacks/platform/modules/monitoring/` + +**Validation:** +- Terraform validation: `terraform validate` in CI/CD pipeline +- HCL formatting: `terraform fmt -recursive` +- Kyverno policies: Enforce resource requests, tier labels, pod security standards + +**Authentication:** +- **Kubernetes API**: OIDC via Authentik (issuer: `https://authentik.viktorbarzin.me/application/o/kubernetes/`) +- **Traefik Ingress**: Authentik ForwardAuth when `protected = true` in ingress_factory +- **TLS**: Shared secret injected into all namespaces via `setup_tls_secret` module + +**Rate Limiting:** Traefik middleware `default-rate-limit` (applied by ingress_factory unless `skip_default_rate_limit = true`) + +**Anti-AI Scraping:** 5-layer defense (bot blocking → headers → trap links → tarpit → poison content) applied via `anti_ai_scraping = true` (default) in ingress_factory; disable per-service with `anti_ai_scraping = false` + +--- + +*Architecture analysis: 2026-02-23* diff --git a/.planning/codebase/CONCERNS.md b/.planning/codebase/CONCERNS.md new file mode 100644 index 00000000..31665f3a --- /dev/null +++ b/.planning/codebase/CONCERNS.md @@ -0,0 +1,244 @@ +# Codebase Concerns + +**Analysis Date:** 2026-02-23 + +## Tech Debt + +**MySQL Backup Rotation Not Implemented:** +- Issue: Backup rotation logic exists (comment at `stacks/platform/modules/dbaas/main.tf:196`) but is incomplete. Backup size noted as 11MB, rotation deferred. +- Files: `stacks/platform/modules/dbaas/main.tf` (lines 196-206) +- Impact: Backup directory could grow unbounded; no automated retention policy enforced. Manual cleanup required. +- Fix approach: Implement full rotation schedule using `find -mtime +N` or migrate to external backup solution (Velero, pgbackrest). Set up CronJob with proper retention (e.g., 14-day backups). + +**PostgreSQL Major Version Upgrade Blocked:** +- Issue: Comment at `stacks/platform/modules/dbaas/main.tf:718` indicates PostgreSQL 17.2 requires `pg_upgrade` to data directory but is not implemented. +- Files: `stacks/platform/modules/dbaas/main.tf` (line 718) +- Impact: Cannot upgrade PostgreSQL from 16-master to 17.2. When upgrade is needed, requires manual pg_upgrade procedure; high downtime risk. +- Fix approach: Implement pg_upgrade CronJob or StatefulSet init container that performs in-place upgrade. Test migration path with backup first. + +**TP-Link Gateway Reverse Proxy Not Functional:** +- Issue: Reverse proxy module for TP-Link gateway marked as "Not working yet" at `stacks/platform/modules/reverse_proxy/main.tf:91`. +- Files: `stacks/platform/modules/reverse_proxy/main.tf` (lines 91-95) +- Impact: Gateway access over HTTPS or HTTP routing non-functional. Unknown scope of impact on dependent services. +- Fix approach: Either complete reverse proxy implementation (Traefik/Nginx config) or document why it's disabled. Clarify if gateway is still accessible via HTTP or LAN-only. + +**WireGuard Firewall Rules Incomplete:** +- Issue: Client firewall restrictions not written at `terraform.tfvars:430`. Only placeholder exists. +- Files: `terraform.tfvars` (lines 430-434) +- Impact: No network isolation between VPN clients and cluster-internal services (10.32.0.0/12). All connected clients can access cluster APIs if firewall rules not enforced at kernel level. +- Fix approach: Define explicit iptables rules for each client role (e.g., "allow DNS only", "deny cluster access"). Test with `iptables -L -v`. Consider VPN network segmentation if multiple trust levels exist. + +## Known Bugs & Issues + +**Immich Database Compatibility Mismatch:** +- Symptoms: Custom PostgreSQL image version mismatch between Immich PostgreSQL pod and dbaas PostgreSQL. Immich uses `ghcr.io/immich-app/postgres:15-vectorchord0.3.0-pgvectors0.2.0`, while dbaas PostgreSQL is 16-master with PostGIS/PgVector mix. +- Files: `stacks/immich/main.tf` (lines 76-77, 276), `stacks/platform/modules/dbaas/main.tf` (line 717) +- Trigger: If Immich database is migrated to shared dbaas PostgreSQL, extension version incompatibility will cause failures. +- Workaround: Keep Immich on isolated PostgreSQL 15 with Immich-specific extensions. If consolidation needed, test extension compatibility first. + +**Realestate-Crawler Latest Image Tag Ignores Updates:** +- Symptoms: `realestate-crawler-ui` uses `image = "viktorbarzin/immoweb:latest"` with `lifecycle { ignore_changes = [spec[0].template[0].spec[0].container[0].image] }`. +- Files: `stacks/real-estate-crawler/main.tf` (lines 64, 79-82) +- Trigger: New versions of `immoweb:latest` will never be deployed. Terraform ignores image updates; manual image pull/push required. +- Workaround: Use Diun annotations to track image updates. Consider using version-pinned tags instead of `:latest`. Remove `ignore_changes` if auto-updates desired. + +## Security Considerations + +**OpenClaw Has Cluster-Admin Permissions:** +- Risk: OpenClaw ServiceAccount granted unrestricted `cluster-admin` ClusterRoleBinding at `stacks/openclaw/main.tf:41-54`. +- Files: `stacks/openclaw/main.tf` (lines 34-55) +- Current mitigation: `dangerouslyDisableDeviceAuth = true` in config (line 89) disables device auth but relies on network access control. +- Recommendations: + - Scope OpenClaw RBAC to specific namespaces/resources needed for skill execution (e.g., `get/list/watch pods`, `exec into pods`, `apply resources in specific namespaces`). + - Re-enable device auth or implement mTLS between OpenClaw and operators. + - Audit OpenClaw logs for unauthorized API calls (enable API server audit logs). + +**Git-Crypt Key Mounted as ConfigMap:** +- Risk: git-crypt key at `stacks/openclaw/main.tf:68-76` stored as plain-text ConfigMap data. Any pod on cluster can read it (unless RBAC enforces secrets-only access). +- Files: `stacks/openclaw/main.tf` (lines 68-76) +- Current mitigation: None; ConfigMap is world-readable by default. +- Recommendations: + - Move git-crypt key to Kubernetes Secret instead of ConfigMap. + - Add RBAC policy restricting secret read to openclaw namespace only. + - Consider external secret management (Authentik-backed secret injection, Sealed Secrets). + +**SSH Private Key Stored as Secret:** +- Risk: SSH private key for OpenClaw stored at `stacks/openclaw/main.tf:57-66` as unencrypted Secret. Readable by any pod with secret access. +- Files: `stacks/openclaw/main.tf` (lines 57-66) +- Current mitigation: Secret only readable by openclaw namespace (if RBAC enforced); encryption at rest not confirmed. +- Recommendations: + - Rotate SSH key regularly; consider using ed25519 keys (shorter, stronger). + - Audit Secret access via Kubernetes audit logs. + - Use external secret store (HashiCorp Vault, Bitwarden) instead of native Secrets. + +**WireGuard VPN Clients Unrestricted:** +- Risk: VPN clients can reach all cluster-internal services (10.32.0.0/12) unless firewall rules defined. No per-client segmentation. +- Files: `terraform.tfvars` (lines 430-434) +- Current mitigation: Attempted iptables rules commented out; not enforced. +- Recommendations: + - Define explicit client restrictions in WireGuard firewall script (uncomment/complete lines 433-434). + - Implement deny-by-default firewall (drop all, then allow specific routes). + - Consider separate WireGuard interfaces for different trust levels (admin vs. guest). + +**Multiple `:latest` Image Tags in Production:** +- Risk: 17 services use `:latest` tags (e.g., `nextcloud`, `kms`, `calibre`, `speedtest`, `rybbit`, `wealthfolio`, `cyberchef`, `coturn`, `immich-frame`, `health`, others). +- Files: Multiple stacks (see full list in grep output above). +- Current mitigation: Diun annotations track updates but don't auto-pull; images are immutable but unversioned. +- Recommendations: + - Pin all production images to specific semantic versions (e.g., `ghcr.io/foo/bar:v1.2.3`, not `:latest`). + - Use Diun to track new releases and trigger automated testing in staging. + - Update CI/CD pipeline to require version tags for production deployments. + +## Performance Bottlenecks + +**Insufficient Health Probes on Critical Services:** +- Problem: Only 14 services have liveness/readiness probes out of 70+ services. Missing probes on databases (MySQL, PostgreSQL, Redis), ingress, auth. +- Files: All stacks (identified via grep: 14 instances of liveness/readiness out of 70+ services). +- Cause: Default Kubernetes behavior is to not restart unhealthy pods without probes; cascading failures silent. +- Improvement path: Add `livenessProbe`, `readinessProbe`, and `startupProbe` to all stateful services (databases, message queues, auth providers). Use TCP/HTTP probes appropriate to each service. + +**Pod Disruption Budgets Missing:** +- Problem: Only 2 services have PodDisruptionBudget resources (identified via grep). Node evictions (updates, failures) can cause service degradation. +- Files: All stacks (need comprehensive PodDisruptionBudget coverage). +- Cause: PDBs are optional; many assume single-replica stateless services won't need them. +- Improvement path: Add PDB with `minAvailable: 1` to all services with `replicas > 1`. For single-replica services, ensure they're marked as non-critical (lower PriorityClass) or accept downtime during node maintenance. + +**Resource Requests Sparse, Limits Missing:** +- Problem: Many services lack explicit resource requests/limits. Kyverno auto-generates defaults but CPU limits often too low for bursty workloads (Immich ML, Ollama, Ebook2Audiobook). +- Files: Multiple stacks (e.g., `stacks/immich/main.tf`, `stacks/ebook2audiobook/main.tf`, `stacks/ollama/main.tf`). +- Cause: Request/limit tuning requires load testing; defaults used instead. +- Improvement path: Run load tests on GPU workloads (Immich ML, Ollama) to determine sustained CPU/memory. Set requests to P50 usage, limits to P99. Monitor via Prometheus and adjust quarterly. + +**Large Terraform Modules (900+ lines):** +- Problem: `stacks/platform/modules/dbaas/main.tf` is 916 lines; `stacks/immich/main.tf` is 660 lines; others > 450 lines. +- Files: `stacks/platform/modules/dbaas/main.tf` (916 lines), `stacks/platform/modules/nvidia/main.tf` (658 lines), `stacks/platform/modules/kyverno/resource-governance.tf` (809 lines). +- Cause: Monolithic resource definitions; hard to navigate and test. +- Improvement path: Split large modules into sub-modules (e.g., `dbaas/` → `mysql/`, `postgresql/`, `pgadmin/`, `backups/`). Use Terraform workspaces for per-database configuration. + +## Fragile Areas + +**Immich Machine Learning GPU Dependency:** +- Files: `stacks/immich/main.tf` (lines 380-450). +- Why fragile: GPU workload (`immich-machine-learning-cuda`) requires Tesla T4 on k8s-node1. If GPU becomes unavailable (hardware failure, driver issues), ML inference fails silently (no fallback). Single GPU point of failure. +- Safe modification: Add `nodeAffinity` to prefer GPU but allow non-GPU fallback (degraded mode). Implement health checks on GPU availability (`nvidia-smi` probe). Test GPU failure scenario before production use. +- Test coverage: No tests for GPU unavailability; assumes GPU always available. + +**Nextcloud Backup/Restore Procedures Manual:** +- Files: `stacks/nextcloud/main.tf` (backup.sh and restore.sh ConfigMaps). +- Why fragile: Backup/restore scripts are ConfigMap-based; no automation. Restoration requires manual `kubectl exec` and script execution. No tested recovery procedure. +- Safe modification: Implement automated backup via Velero or CSI snapshots. Test restore procedure monthly via staged environment. +- Test coverage: No automated backup validation; scripts untested. + +**NFS Dependency for Data Persistence:** +- Files: 126 references to NFS volumes across all stacks. +- Why fragile: All stateful data depends on NFS server at `10.0.10.15`. If NFS becomes unavailable, all services lose data immediately (no local caches). No fallback storage. +- Safe modification: Implement NFS client-side read caching (Linux NFS mount options `ac,acregmin=3600`). Monitor NFS availability via Prometheus alerts (Mount point offline). Test NFS failover procedure (if replica NFS exists). +- Test coverage: No chaos engineering tests for NFS unavailability. + +**Istio Injection Disabled Cluster-Wide:** +- Files: `stacks/real-estate-crawler/main.tf` (line 19): `"istio-injection" : "disabled"` on namespace labels. +- Why fragile: No service mesh observability. Debugging pod-to-pod communication requires manual tracing (tcpdump). No mutual TLS between services. +- Safe modification: Enable Istio on non-critical services first (e.g., realestate-crawler). Monitor resource overhead. Gradually roll out to production. +- Test coverage: No mTLS validation; assumes all pods on same network are trusted. + +**PostgreSQL Custom Image Not Tracked:** +- Files: `stacks/platform/modules/dbaas/main.tf` (line 717): `image = "viktorbarzin/postgres:16-master"`. +- Why fragile: Custom build at Docker Hub with PostGIS + PgVector extensions. No version tag; `:master` tag is mutable. Upstream extension versions unknown. +- Safe modification: Pin to semantic version (e.g., `:16.4-postgis3.4-pgvector0.8`). Build images locally with Dockerfile tracked in git. Test extension versions against application requirements. +- Test coverage: No tests for extension availability or version compatibility. + +## Scaling Limits + +**Single-Replica Critical Services:** +- Current capacity: Immich server (1 replica), PostgreSQL databases (1 replica), Redis (1 instance), Traefik (varies). +- Limit: Node failure causes immediate service outage. Kubernetes default takes 5+ minutes to reschedule pod. +- Scaling path: Increase critical service replicas to 3 (quorum). Add pod anti-affinity to spread across nodes. Implement PodDisruptionBudget with `minAvailable: 2`. + +**GPU Capacity Bottleneck:** +- Current capacity: 1 Tesla T4 GPU on k8s-node1. +- Limit: Immich ML + Ebook2Audiobook + Ollama all compete for single GPU. Queue time 10+ minutes for CPU-bound inference tasks. +- Scaling path: Add second GPU (e.g., T4 or RTX 3090) to k8s-node1. Implement GPU scheduling via NVIDIA GPU Operator. Monitor GPU utilization (target 70-80%). + +**NFS Storage Capacity:** +- Current capacity: `/mnt/main/` mounted on TrueNAS (size unknown; typically 4-8TB in home setups). +- Limit: Immich (image library), Calibre (ebooks), Dawarich (location history) grow unbounded. When storage full, writes fail; services degrade. +- Scaling path: Monitor NFS capacity monthly (`df -h`). Set up Prometheus alert at 80% capacity. Plan for annual storage growth based on user behavior (e.g., 100GB Immich/month). + +**MySQL/PostgreSQL Connection Pool:** +- Current capacity: PgBouncer at `dbaas/pgbouncer` provides connection pooling. Default pool size likely 100-200 connections. +- Limit: Many simultaneous connections (Nextcloud, Affine, Gramps Web, Authentik) can exceed pool. New connections queue or fail. +- Scaling path: Monitor PgBouncer pool utilization (Prometheus metric `pgbouncer_pools_used_connections`). Increase pool size if > 80% utilization. Consider read replicas for read-heavy workloads. + +**API Rate Limiting & Bandwidth:** +- Current capacity: Services exposed via Traefik ingress. No global rate limiting documented. +- Limit: External tools (Immich mobile app, ebook2audiobook processing) can spike bandwidth. DoS-like behavior possible. +- Scaling path: Implement Traefik rate limiting middleware (Prometheus-aware). Add Cloudflare rate limiting on public domains. Monitor egress bandwidth. + +## Dependencies at Risk + +**Redis Stack `:latest` Tag:** +- Risk: `stacks/platform/modules/redis/main.tf` uses `image = "redis/redis-stack:latest"`. Redis Stack is actively developed; breaking changes possible. +- Impact: Unexpected version upgrade could introduce incompatibilities with clients expecting specific command set or module versions. +- Migration plan: Pin to specific Redis Stack version (e.g., `:7.2-rc1`). Test version upgrades in staging first. Monitor Redis logs for deprecated command warnings. + +**Immich `:latest` or Floating Tag:** +- Risk: `stacks/immich/main.tf` pins to `v2.5.6` but Immich frequently releases patch versions. Database migrations can cause downtime. +- Impact: If Immich version upgrades without testing, database migrations could fail or hang (no rollback mechanism). +- Migration plan: Pin to specific patch versions (e.g., `v2.5.6`, not `v2.5`). Test Immich upgrades in staging first. Maintain backup before upgrading. + +**Unsupported MySQL 9.2.0:** +- Risk: `stacks/platform/modules/dbaas/main.tf` specifies `image = "mysql:9.2.0"`. MySQL 9.2 is a development version (RC status as of Feb 2026). +- Impact: RC versions not recommended for production. Stability issues, CVEs possible. No long-term support. +- Migration plan: Migrate to MySQL 8.4 LTS or 9.0 GA (stable). Test data migration first. Plan for gradual rollout. + +**Python Timeouts in Monitoring Scripts:** +- Risk: `stacks/platform/modules/nvidia/main.tf` uses hardcoded `timeout=10` for HTTP requests and subprocess calls. Slow network conditions will fail. +- Impact: GPU monitoring will fail if network is slow or unavailable. Silent failures possible. +- Migration plan: Implement exponential backoff and retry logic (e.g., `tenacity` library). Increase timeout to 30s for unreliable networks. Log timeouts for debugging. + +## Missing Critical Features + +**No Disaster Recovery Plan:** +- Problem: Backup procedures exist (Nextcloud, MySQL) but no tested recovery procedure. No runbook for cluster disaster. +- Blocks: If cluster data lost, recovery would be manual and time-consuming. No RTO/RPO defined. +- Impact: Data loss risk > 24 hours to recover. + +**No Secrets Rotation Policy:** +- Problem: SSH keys, API tokens, database passwords stored in git-crypt and tfvars. No automated rotation schedule. +- Blocks: If key leaked, manual intervention required to rotate across all services. +- Impact: Leaked credentials persist until discovery. + +**No Cross-Cluster Failover:** +- Problem: Single Kubernetes cluster on Proxmox. No HA cluster or backup cluster. +- Blocks: Cluster-wide failure (network partition, hypervisor crash) causes total outage. +- Impact: RTO > 1 hour (manual intervention to restart hypervisor or re-provision). + +## Test Coverage Gaps + +**No Infrastructure Testing:** +- What's not tested: Terraform applies, Helm charts, manifests only validated via `terraform plan`. No `terratest`, no functional tests of deployed services. +- Files: All stacks (no test files found). +- Risk: Typos, variable misconfigurations, missing dependencies not caught until production apply. +- Priority: High — add `terratest` to validate Terraform. Test critical paths (database connection, ingress routing). + +**No Chaos Engineering Tests:** +- What's not tested: Pod evictions, node failures, NFS unavailability, network partitions. +- Files: All stacks (no chaos tests found). +- Risk: Cascading failures and data loss scenarios not validated. Assumptions about resilience untested. +- Priority: High — run monthly chaos tests (Gremlin, Chaos Toolkit). Document recovery procedures. + +**No Backup Restoration Tests:** +- What's not tested: Nextcloud backups, MySQL backups. Restore procedures exist but never executed. +- Files: `stacks/nextcloud/main.tf`, `stacks/platform/modules/dbaas/main.tf`. +- Risk: Backups corrupt or unusable when needed. RPO > 24 hours if discovery slow. +- Priority: High — monthly restore-to-staging test. Automate backup validation. + +**No Security Scanning for Vulnerabilities:** +- What's not tested: Container images for CVEs, Terraform for security anti-patterns (hardcoded secrets, overpermissive RBAC). +- Files: All stacks, all container images. +- Risk: Known vulnerabilities deployed to production. No supply chain security. +- Priority: Medium — integrate Trivy/Snyk into CI/CD. Scan images weekly; alert on high CVEs. + +--- + +*Concerns audit: 2026-02-23* diff --git a/.planning/codebase/CONVENTIONS.md b/.planning/codebase/CONVENTIONS.md new file mode 100644 index 00000000..f1598d19 --- /dev/null +++ b/.planning/codebase/CONVENTIONS.md @@ -0,0 +1,192 @@ +# Coding Conventions + +**Analysis Date:** 2026-02-23 + +## Naming Patterns + +**Terraform Files:** +- `main.tf` - Primary resource definitions and module calls +- `terragrunt.hcl` - Stack-specific Terragrunt configuration +- `variables.tf` - Variable declarations for a stack +- `providers.tf` - Generated by Terragrunt root `terragrunt.hcl` +- `backend.tf` - Generated by Terragrunt for state backend configuration + +**Terraform Variables:** +- snake_case for variable names: `var.tls_secret_name`, `var.dbaas_root_password` +- snake_case for resource names: `resource "kubernetes_namespace" "nextcloud"` +- snake_case for local values: `local.tiers` +- UPPERCASE for environment-like globals in shell: `KUBECONFIG_PATH`, `PASS_COUNT` + +**Resource/Module Names:** +- kebab-case for Kubernetes resources: `nextcloud`, `whiteboard`, `kms-web-page` +- Leading underscore for prefixed resource names (internal/private pattern): resource names with underscores are module-internal +- Descriptive names matching functionality: `kubernetes_namespace`, `kubernetes_deployment`, `helm_release` + +**Shell Functions:** +- snake_case for function names: `parse_args()`, `count_lines()`, `check_nodes()` +- CamelCase for utility color variables: `RED`, `GREEN`, `YELLOW`, `BLUE`, `BOLD`, `NC` + +**Go Package/Test Names:** +- Package-level test functions: `TestContainsVideoMarkers()`, `TestIsDirectVideoContentType()` +- Table-driven test pattern with struct fields: `name`, `body`, `ct`, `want` + +## Code Style + +**Terraform Formatting:** +- Use `terraform fmt -recursive` for consistent formatting +- No explicit linter/formatter config file (tflint/terraform-lint not present) +- Indentation: 2 spaces (standard Terraform convention) +- Multi-line strings use heredoc syntax: `<&1) || { fail "message"; return 0; }` +- `set -euo pipefail` prevents silent failures and undefined var issues +- Error status captured: `$?` implicit via `||` pattern +- Graceful degradation with fallback values or skip-able steps + +**Go:** +- Standard testing error reporting: `t.Errorf()` with formatted messages +- Table-driven test pattern allows multiple related test cases +- Error messages include actual vs expected: `got = %v, want = %v` + +## Logging + +**Framework:** Not formally configured; uses `echo` and `echo -e` for output + +**Bash Logging Patterns:** +- Color-coded output with status prefixes: `${BLUE}[INFO]${NC}`, `${GREEN}[PASS]${NC}`, `${YELLOW}[WARN]${NC}`, `${RED}[FAIL]${NC}` +- Helper functions: `info()`, `pass()`, `warn()`, `fail()` - each increments counters and respects `--quiet` flag +- Section headers: `section()` for verbose output, `section_always()` for always-shown sections +- Conditional logging: functions check `$JSON`, `$QUIET` flags and skip output as needed +- JSON output option available via `json_add()` for machine-readable logging +- Detail strings accumulated in variables for final reporting + +**Terraform Logging:** +- Relies on Terraform's built-in CLI output +- Human-readable variable values in descriptions (Terraform renders these on errors) + +## Comments + +**When to Comment:** + +Terraform: +- Section dividers: Major logical groups separated by `# =============================================================================` +- Feature group headers: `# --- Feature Name ---` before variable/module blocks +- Commented-out code: Temporarily disabled resources/modules include explanation (e.g., "Do not use until issue #X is solved") +- Clarifying arbitrary choices: `# anything secret is fine` explains non-obvious variable usage + +Bash: +- Function-level comments: Each check function has purpose on first line +- Complex logic: Comments before conditional blocks explain intent +- Inline comments for edge cases: `# Skip nodes where metrics are not yet available` +- Header comments: Scripts include usage documentation at top + +**JSDoc/TSDoc:** +- Not used in this codebase (Terraform, Bash, Go only) + +## Function Design + +**Size:** +- Terraform modules typically 20-50 lines for simple services, variable declaration blocks 30-100+ lines +- Bash functions average 20-40 lines, check functions 10-30 lines +- Go test functions 10-60 lines (table + loop) + +**Parameters:** +- Terraform: via `variable` declarations and module input variables +- Bash: positional parameters passed via `$1`, `$2`, etc. with validation in `parse_args()` +- Go: test functions accept `*testing.T` parameter + +**Return Values:** +- Terraform: no explicit returns; resource state is the "return" +- Bash: `return 0` for success, implicit via `echo` output for values, status codes for error handling +- Go: functions tested for boolean returns or calculated values + +**Variables:** +- Terraform: module variables, locals, and resource attributes (computed values) +- Bash: Global state tracked via counters (`PASS_COUNT`, `WARN_COUNT`, `FAIL_COUNT`), local variables in functions with `local` keyword +- Go: table-driven tests use struct fields (no getter/setter pattern) + +## Module Design + +**Exports:** +- Terraform: outputs typically omitted unless another stack depends on them (implicit via dependency blocks) +- Modules called with `source = "./modules/"` or `source = "../../modules/kubernetes/"` +- Module version pinning used for Terraform registry modules: `version = "0.1.5"` + +**Barrel Files:** +- Not applicable (no aggregating re-exports in this codebase) +- Directories: `stacks//` is a unit, `stacks/platform/modules//` groups related modules + +**Module Organization:** +- Single responsibility per module directory +- Each module typically contains: `main.tf` (resources) and optional `variables.tf` for input variables +- Shared Kubernetes utility modules in `modules/kubernetes/`: `ingress_factory/`, `setup_tls_secret/` +- Platform services grouped in `stacks/platform/modules//` + +## Special Patterns + +**Locals for Configuration:** +- Tier definitions centralized as map in locals (each service defines same tiers locally) + ```hcl + locals { + tiers = { + core = "0-core" + cluster = "1-cluster" + gpu = "2-gpu" + edge = "3-edge" + aux = "4-aux" + } + } + ``` +- Tier applied to `kubernetes_namespace` labels and `priority_class_name` for resource governance + +**Inline Config Blocks:** +- YAML/config data stored in `<..svc.cluster.local` +- Examples: `mysql.dbaas.svc.cluster.local`, `redis.redis.svc.cluster.local`, `postgresql.dbaas.svc.cluster.local` + +**External Service Access:** +- Cloudflare Tunnel: Provides public HTTPS for services (no direct internet access needed) +- Traefik Ingress: Routes external traffic to internal K8s services +- Technitium (internal DNS) for `.lan` domain resolution + +--- + +*Integration audit: 2026-02-23* diff --git a/.planning/codebase/STACK.md b/.planning/codebase/STACK.md new file mode 100644 index 00000000..417e97a0 --- /dev/null +++ b/.planning/codebase/STACK.md @@ -0,0 +1,129 @@ +# Technology Stack + +**Analysis Date:** 2026-02-23 + +## Languages + +**Primary:** +- HCL (HashiCorp Configuration Language) - Terraform/Terragrunt infrastructure definitions +- Bash - Scripting and cluster management (`scripts/` directory) +- YAML - Kubernetes resource definitions and configuration +- Python - Monitoring and utility scripts in `stacks/platform/modules/` +- TypeScript/JavaScript - k8s-portal frontend and webhook-handler (`stacks/platform/modules/k8s-portal/`, `stacks/webhook_handler/`) + +**Secondary:** +- Go - Various utilities +- Dockerfile - Container image definitions across stacks + +## Runtime + +**Environment:** +- Kubernetes v1.34.2 (5 nodes: k8s-master + k8s-node1-4) +- Linux (Ubuntu cloud images on Proxmox VMs) +- Bash shell for automation + +**Package Manager:** +- npm (Node.js) - for k8s-portal web UI development + - Lockfile: `package-lock.json` present +- pip (Python) - for utility scripts +- Terraform/Terragrunt - manages all infrastructure dependencies + +## Frameworks + +**Core:** +- Terraform 1.x - Infrastructure-as-Code orchestration +- Terragrunt - State isolation wrapper around Terraform (`terragrunt.hcl` in each stack) +- Kubernetes - Container orchestration (kubectl, Helm, kustomize patterns) + +**Testing:** +- Playwright ^1.58.2 - E2E testing framework (root `package.json`) + +**Build/Dev:** +- Helm 3.1.1 - Kubernetes package manager (provider version via Terraform) +- Svelte - Frontend framework for k8s-portal (`stacks/platform/modules/k8s-portal/files/` Node.js project) + +## Key Dependencies + +**Critical:** +- hashicorp/terraform (Kubernetes 3.0.1) - Kubernetes API provider +- hashicorp/helm (3.1.1) - Helm release management +- telmate/proxmox (3.0.2-rc07) - Proxmox VM management (`stacks/infra/`) +- cloudflare/cloudflare (4.52.5) - DNS and tunnel management (`stacks/platform/modules/cloudflared/`) +- hashicorp/null (3.2.4) - Utility provider for local operations +- hashicorp/random (3.8.1) - Random value generation + +**Infrastructure:** +- MySQL 9.2.0 - Relational database (`stacks/platform/modules/dbaas/`) +- PostgreSQL 16.4-bullseye - Primary database with PostGIS/PGVector (`stacks/platform/modules/dbaas/`) +- Redis/redis-stack:latest - In-memory cache and broker (`stacks/platform/modules/redis/`) +- Headscale 0.23.0 - WireGuard control plane (`stacks/platform/modules/headscale/`) + +**Observability:** +- Prometheus - Metrics collection and alerting +- Grafana - Metrics visualization and dashboards +- Loki 3.6.5 - Log aggregation (from user instructions) +- Alloy v1.13.0 - Log collector (from user instructions) + +**API Gateway & Ingress:** +- Traefik 3.x - Ingress controller and reverse proxy (`stacks/platform/modules/traefik/`) +- MetalLB - Load balancer for Kubernetes service IPs (`stacks/platform/modules/metallb/`) + +**Security:** +- Authentik - Identity Provider/OIDC (`stacks/platform/modules/authentik/`) +- Vaultwarden 1.35.2 - Password manager (`stacks/platform/modules/vaultwarden/`) +- CrowdSec - Intrusion detection and IP reputation (`stacks/platform/modules/crowdsec/`) +- Kyverno - Policy enforcement and governance (`stacks/platform/modules/kyverno/`) + +**Container Images Registry:** +- docker.io - Docker Hub public images +- ghcr.io - GitHub Container Registry (Headscale UI, Immich, etc.) +- quay.io - Quay.io registry (inferred from mirror config) +- registry.k8s.io - Kubernetes images +- Local pull-through cache at `10.0.20.10` (ports 5000/5010/5020/5030/5040) + +## Configuration + +**Environment:** +- `terraform.tfvars` (git-crypt encrypted) - All secrets, API keys, DNS records, passwords +- Environment variables injected into Kubernetes pods via ConfigMap/Secret +- Kubeconfig: `config` file in repo root (referenced as `$PWD/config` in terragrunt) + +**Build:** +- `terragrunt.hcl` (root) - DRY Terraform provider and backend configuration +- `stacks//terragrunt.hcl` - Per-stack overrides +- `stacks//main.tf` - Kubernetes/Proxmox resource definitions +- `.terraform.lock.hcl` - Provider version lock (Terraform 1.x) +- `.terraform/` - Downloaded providers cached locally + +**Secrets:** +- `secrets/` directory (git-crypt encrypted) +- TLS certificates and keys in `secrets/` (symlinked from stacks) +- OpenDKIM keys for mailserver +- NFS export configuration in `secrets/nfs_directories.txt` + +## Platform Requirements + +**Development:** +- Terraform 1.x CLI +- Terragrunt CLI (uses `terragrunt apply --non-interactive`) +- kubectl configured with kubeconfig at `$PWD/config` +- git-crypt for secret decryption +- curl, bash, standard Unix utilities + +**Production:** +- Kubernetes 1.34.2+ cluster (5 nodes, 192 GB+ total memory) +- Proxmox 8.x hypervisor (`stacks/infra/` provisions VMs) +- NFS storage: TrueNAS at `10.0.10.15` with exports at `/mnt/main/` +- Docker registry pull-through cache at `10.0.20.10` +- Cloudflare DNS (public domain `viktorbarzin.me`) +- Technitium DNS (internal domain `viktorbarzin.lan`) + +**Networking:** +- Kubernetes pod CIDR: managed by cluster +- Service IPs: 10.0.20.200-10.0.20.220 (MetalLB layer 2) +- Internal DNS: Technitium at cluster IP +- External DNS: Cloudflare tunnel + traditional DNS records + +--- + +*Stack analysis: 2026-02-23* diff --git a/.planning/codebase/STRUCTURE.md b/.planning/codebase/STRUCTURE.md new file mode 100644 index 00000000..e2f5d7be --- /dev/null +++ b/.planning/codebase/STRUCTURE.md @@ -0,0 +1,255 @@ +# Codebase Structure + +**Analysis Date:** 2026-02-23 + +## Directory Layout + +``` +/Users/viktorbarzin/code/infra/ +├── .claude/ # Project-level Claude knowledge (skills, reference docs) +├── .git/ # Git repository metadata +├── .git-crypt/ # git-crypt encryption keys +├── .planning/codebase/ # GSD codebase analysis documents +├── .terraform/ # Terraform cache (gitignored) +├── .woodpecker/ # CI/CD pipeline definitions +├── cli/ # Custom CLI tools (bash/python scripts) +├── diagram/ # Infrastructure diagram sources +├── docs/ # Documentation (deployment guides, design docs) +├── modules/ # Shared Terraform modules (Proxmox, K8s utilities) +├── playbooks/ # Ansible playbooks (infrastructure setup) +├── scripts/ # Maintenance scripts (healthcheck, DNS updates, etc.) +├── secrets/ # git-crypt encrypted files (NFS dirs, TLS certs, SSH keys) +├── stacks/ # Terragrunt stacks (platform + ~70 service stacks) +├── state/ # Terraform state files (local backend, gitignored) +├── terragrunt.hcl # Root Terragrunt config (DRY provider/backend setup) +├── terraform.tfvars # All variables + secrets (git-crypt encrypted, ~48KB) +├── config # Kubernetes config (kubeconfig file) +├── README.md # Project overview +└── package.json # Node.js deps (minimal; mostly for cli tools) +``` + +## Directory Purposes + +**`.claude/`:** +- Purpose: Project-level Claude knowledge and execution skills +- Contains: `skills/` (setup-project, authentik workflows), `reference/` (inventory tables, API patterns) +- Key files: `CLAUDE.md` (this file's counterpart with full infrastructure context) + +**`.planning/codebase/`:** +- Purpose: GSD codebase analysis output directory +- Contains: `ARCHITECTURE.md`, `STRUCTURE.md` (this file), and focus-specific docs +- Auto-generated: Yes (by /gsd:map-codebase) + +**`modules/`:** +- Purpose: Reusable Terraform modules for VM creation and Kubernetes utilities +- Contains: + - `create-template-vm/`: Cloud-init Ubuntu template VM provisioning (K8s + non-K8s) + - `create-vm/`: VM instance creation from templates with cloud-init injection + - `docker-registry/`: Docker registry pull-through cache setup + - `kubernetes/`: K8s-specific utilities (ingress_factory, setup_tls_secret) + +**`stacks/`:** +- Purpose: Terragrunt stacks with isolated state and per-service configuration +- Contains: 1 platform stack + ~70 application stacks +- Structure: Each stack is a directory with `terragrunt.hcl` + `main.tf` + optional `factory/` (for multi-instance services) + +**`stacks/platform/`:** +- Purpose: Core infrastructure services (22 modules) +- Contains: Modules for MetalLB, DBaaS, Redis, Traefik, DNS, VPN, auth, monitoring, security +- Key subdirs: `modules/` (platform-specific modules like traefik, authentik, monitoring) + +**`stacks/infra/`:** +- Purpose: Proxmox VM template and instance provisioning +- Contains: K8s node templates, docker-registry VM, Proxmox provider configuration + +**`stacks//`:** +- Purpose: Single application stack with isolated state +- Pattern: `terragrunt.hcl` (includes root, declares dependencies) + `main.tf` (resources) + optional `factory/` + optional `chart_values.yaml` +- Examples: `nextcloud/`, `immich/`, `matrix/`, `actualbudget/` (multi-tenant), etc. + +**`secrets/`:** +- Purpose: git-crypt encrypted sensitive files +- Contains: TLS certificates/keys, NFS export list, SSH keys, Dkim keys, Postfix config +- Key files: + - `nfs_directories.txt`: List of NFS shares (sorted); regenerate exports with `nfs_exports.sh` + - `tls/`: TLS certificate chain and keys + - `mailserver/`: OpenDKIM keys, Postfix SASL creds + +**`scripts/`:** +- Purpose: Operational and maintenance automation +- Key scripts: + - `cluster_healthcheck.sh`: 24-point cluster health status + - `renew2.sh`: TLS certificate renewal via certbot + Cloudflare + - `setup_certs.sh`: Initial certificate setup + - `pve_*`: Proxmox management scripts + - `ha_*`: Home Assistant integration scripts + +**`docs/`:** +- Purpose: Design and deployment documentation +- Contains: High-level architecture diagrams, deployment guides, troubleshooting + +**`cli/`:** +- Purpose: Custom CLI utilities +- Contains: Python/bash scripts for common operations (DNS management, NFS, etc.) + +## Key File Locations + +**Entry Points:** +- `terragrunt.hcl`: Root Terragrunt config; invoked by `terragrunt apply` in any stack directory +- `stacks/platform/main.tf`: Platform stack; applies 22 core modules +- `stacks/infra/main.tf`: Infrastructure stack; creates VM templates and docker-registry VM + +**Configuration:** +- `terraform.tfvars`: Central variables file (~48KB, git-crypt encrypted). Used by all stacks. Contains: Cloudflare credentials, DNS records, service secrets, TLS secret name +- `stacks//terragrunt.hcl`: Stack-specific Terragrunt config (includes root, declares `dependency` blocks) +- `stacks/platform/modules//main.tf`: Platform module implementation (22 modules) + +**Core Logic:** +- `stacks/platform/main.tf`: 1000+ lines; instantiates all 22 platform modules +- `stacks//main.tf`: 30–450 lines; creates namespaces, Helm releases, Kubernetes resources +- `stacks//factory/main.tf`: Multi-instance service pattern; called multiple times with different parameters +- `modules/kubernetes/ingress_factory/main.tf`: Traefik ingress + service template with security defaults + +**Testing & Validation:** +- `.woodpecker/`: CI/CD pipeline (pushes platform apply on merge) +- `scripts/cluster_healthcheck.sh`: Manual cluster health validation + +**Kubernetes & Cluster Config:** +- `config`: Kubeconfig file for cluster access +- Namespace pattern: One namespace per service stack +- TLS secret: `tls-secret` injected into all namespaces via `setup_tls_secret` module + +## Naming Conventions + +**Files:** +- `main.tf`: Primary Terraform resource file per stack +- `terragrunt.hcl`: Terragrunt-specific configuration (includes root, dependencies) +- `terraform.tfvars`: Global variables (git-crypt encrypted) +- `chart_values.yaml`: Helm chart values template (uses templatefile for variable substitution) +- `*_values.tpl`: Helm values template (evaluated with templatefile) +- `.terraform.lock.hcl`: Provider lock file (one per stack) + +**Directories:** +- `stacks//`: Kebab-case service names (e.g., `real-estate-crawler`, `k8s-dashboard`) +- `stacks/platform/modules//`: Kebab-case module names +- `state/stacks//`: Mirrored state directory structure +- `secrets/`: Single top-level directory for all encrypted files +- `modules/kubernetes/`, `modules/create-template-vm/`: Category-based grouping + +**Terraform Resources:** +- **Kubernetes**: `kubernetes_*` (namespace, deployment, service, configmap, etc.) +- **Helm**: `helm_release` (Helm chart deployments) +- **Local files**: `local_file` (for generated scripts and configs) +- **Module calls**: `module ""` (e.g., `module "traefik"`, `module "redis"`) + +**Variables:** +- Snake_case: `tls_secret_name`, `crowdsec_api_key`, `nextcloud_db_password` +- Service-prefixed: `_` (e.g., `authentik_secret_key`, `mailserver_accounts`) + +## Where to Add New Code + +**New Service Stack:** +1. Create `stacks//` directory +2. Add `terragrunt.hcl`: + ```hcl + include "root" { + path = find_in_parent_folders() + } + dependency "platform" { + config_path = "../platform" + skip_outputs = true + } + ``` +3. Create `main.tf` with: + - Variable declarations for required inputs from `terraform.tfvars` + - `locals { tiers = { ... } }` (copy from existing stack) + - `kubernetes_namespace` resource with tier label + - `module "tls_secret"` call to `../../modules/kubernetes/setup_tls_secret` + - Service-specific resources (Helm releases, Deployments, etc.) +4. Add Cloudflare DNS records in `terraform.tfvars` if needed +5. Create optional `secrets/` symlink: `ln -s ../../secrets secrets` +6. Apply: `cd stacks/ && terragrunt apply --non-interactive` + +**Multi-Tenant Service (using Factory Pattern):** +1. Create parent stack: `stacks//main.tf` with namespace + TLS setup +2. Create `stacks//factory/main.tf` with single-instance logic +3. In parent, call factory multiple times: + ```hcl + module "instance1" { + source = "./factory" + name = "instance1" + # ... other params + } + ``` +4. Example: `stacks/actualbudget/` has factory instantiated for viktor, anca, emo + +**New Platform Module:** +1. Create `stacks/platform/modules//` directory +2. Add `main.tf` with resources (Helm chart, namespace, ConfigMaps, etc.) +3. Add `variables.tf` or declare variables in `main.tf` +4. In `stacks/platform/main.tf`, add module call: + ```hcl + module "" { + source = "./modules/" + tier = local.tiers. + # ... pass required variables + } + ``` +5. Add variable declarations in `stacks/platform/main.tf` + +**New Shared Module:** +1. Create `modules/kubernetes//` or `modules/terraform//` +2. Add `main.tf` with reusable resources +3. Declare clear variable inputs and output any useful values +4. Call from service stacks: `module "" { source = "../../modules/kubernetes/" ... }` + +**Utilities & Scripts:** +- Shared helpers: `scripts/` directory +- Custom CLI tools: `cli/` directory +- CI/CD pipelines: `.woodpecker/` + +## Special Directories + +**`state/`:** +- Purpose: Terraform state files (local backend) +- Generated: Yes (automatically by Terragrunt) +- Committed: No (gitignored; backed up separately) +- Structure: `state/stacks//terraform.tfstate` + +**`secrets/`:** +- Purpose: git-crypt encrypted secrets and sensitive config +- Generated: No (managed manually or via scripts) +- Committed: Yes (encrypted via git-crypt) +- Contents: TLS certs, SSH keys, NFS export list, mailserver config, Dkim keys + +**`.terraform/`:** +- Purpose: Terraform provider cache +- Generated: Yes (by Terraform during init) +- Committed: No (gitignored) + +**`node_modules/`:** +- Purpose: Node.js dependencies for CLI tools +- Generated: Yes (by npm install) +- Committed: No (gitignored; use lockfile) + +## File Patterns & Imports + +**Terragrunt Patterns:** +- Include root: `include "root" { path = find_in_parent_folders() }` +- Declare dependencies: `dependency "platform" { config_path = "../platform"; skip_outputs = true }` +- Variable access: `var.` in `main.tf` (variables sourced from `terraform.tfvars`) + +**Kubernetes Resource Patterns:** +- Namespace per service: `kubernetes_namespace.` with tier label +- Helm releases: `helm_release.` with `templatefile` for values +- Inline NFS volumes: `volume { name = "data"; nfs { server = "10.0.10.15"; path = "/mnt/main/" } }` +- TLS injection: Every stack calls `module "tls_secret"` to populate namespace secret + +**Module Call Pattern:** +- Standard: `module "" { source = "./modules/" ... }` +- Platform modules: `source = "./modules/"` +- Shared modules: `source = "../../modules/kubernetes/"` + +--- + +*Structure analysis: 2026-02-23* diff --git a/.planning/codebase/TESTING.md b/.planning/codebase/TESTING.md new file mode 100644 index 00000000..6513ef9f --- /dev/null +++ b/.planning/codebase/TESTING.md @@ -0,0 +1,279 @@ +# Testing Patterns + +**Analysis Date:** 2026-02-23 + +## Test Framework + +**Language-Specific Runners:** + +**Go:** +- Runner: `go test` (standard library `testing` package) +- Config: No config file (uses built-in conventions) +- Run Commands: + ```bash + go test ./... # Run all tests + go test -v ./... # Verbose output + go test -run TestContains ./... # Run specific test + go test -cover ./... # Show coverage + ``` + +**Bash:** +- Runner: Custom shell scripts in `scripts/` +- No formal test framework; uses `set -euo pipefail` for error handling +- Manual health checks via `bash scripts/cluster_healthcheck.sh` + +**Terraform:** +- Framework: No automated testing detected (no terraform test files, no tftest.hcl) +- Validation: Manual `terraform validate`, `terraform plan`, visual inspection +- Integration: Terragrunt applies validate before execution + +## Test File Organization + +**Location:** +- Go tests: Co-located with source code: `/files/internal/scraper/validate_test.go` +- Shell/Infrastructure: No test files (manual validation/health checks only) + +**Naming:** +- Go: `*_test.go` suffix +- Script tests: `.sh` for check/validation scripts + +**Structure:** +``` +stacks/f1-stream/files/internal/scraper/ +├── main.go +├── validate.go +└── validate_test.go # Test file co-located +``` + +## Test Structure + +**Go Table-Driven Tests:** + +```golang +func TestContainsVideoMarkers(t *testing.T) { + tests := []struct { + name string + body string + want bool + }{ + { + name: "video tag", + body: `
`, + want: true, + }, + // ... more test cases + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + got := containsVideoMarkers(tt.body) + if got != tt.want { + t.Errorf("containsVideoMarkers(%q) = %v, want %v", truncate(tt.body, 60), got, tt.want) + } + }) + } +} +``` + +**Patterns:** +- Slice of anonymous structs with `name`, input fields, and `want` for expected result +- Loop with `t.Run(tt.name, ...)` for individual test case execution and reporting +- Descriptive test case names: `"video tag"`, `"HLS manifest reference"`, `"empty string"` +- Separate positive cases (upper) and negative cases (lower) with comments + +**Bash Health Check Structure:** +```bash +check_nodes() { + section 1 "Node Status" + local nodes not_ready versions unique_versions detail="" + + nodes=$($KUBECTL get nodes --no-headers 2>&1) || { fail "Cannot reach cluster"; json_add "node_status" "FAIL" "Cannot reach cluster"; return 0; } + # ... processing + if [[ -n "$not_ready" ]]; then + fail "NotReady nodes: $not_ready" + json_add "node_status" "FAIL" "$detail" + elif [[ "$unique_versions" -gt 1 ]]; then + warn "Version mismatch..." + json_add "node_status" "WARN" "$detail" + else + pass "All nodes Ready..." + json_add "node_status" "PASS" "$detail" + fi +} +``` + +**Patterns:** +- Each check function follows same structure: setup → validation → status reporting +- Status reported via `pass()`, `warn()`, `fail()` helper functions +- JSON output optional via `json_add()` for programmatic consumption +- Error handling inline with `||` fallback and graceful degradation + +## Mocking + +**Framework:** +- Go: No mocking framework detected (table-driven tests use real function calls) +- Bash: External commands mocked implicitly (KUBECONFIG override, kubectl invocation through `$KUBECTL` variable) + +**Patterns (Go):** +- No mock objects or stubs +- Real function behavior tested directly +- Test data provided as input in struct fields + +**Patterns (Bash):** +```bash +# Kubeconfig override allows testing against different clusters +KUBECTL="kubectl --kubeconfig $KUBECONFIG_PATH" +nodes=$($KUBECTL get nodes --no-headers 2>&1) || { fail "Cannot reach cluster"; return 0; } +``` + +**What NOT to Mock:** +- Core functionality being tested (test actual behavior) +- Standard library functions (test integration) + +**What to Mock (Bash):** +- External kubectl calls via variable indirection: allows `KUBECONFIG` override +- Conditional output by flag: `--json`, `--quiet` flags change output, not behavior + +## Fixtures and Factories + +**Test Data (Go):** +- Inline strings in struct fields: HTML content, MIME types +- Examples from `validate_test.go`: + ```golang + { + name: "HLS manifest reference", + body: `var url = "https://cdn.example.com/live.m3u8";`, + want: true, + }, + ``` + +**Location:** +- Embedded directly in test file as struct field values +- No separate fixture files or factories + +**Bash Fixtures:** +- Real cluster fixtures: tests run against actual Kubernetes cluster +- No data files; tests fetch live state via kubectl + +## Coverage + +**Requirements:** None enforced (no coverage thresholds, targets, or CI/CD gates detected) + +**View Coverage (Go):** +```bash +go test -cover ./... # Show coverage percentages +go test -coverprofile=coverage.out ./... +go tool cover -html=coverage.out # Open HTML report +``` + +**Note:** Coverage tools not integrated into CI/CD pipeline; manual check only. + +## Test Types + +**Unit Tests (Go):** +- Scope: Single function validation +- Approach: Table-driven with parameterized inputs +- Example: `TestContainsVideoMarkers()` tests HTML content detection +- Example: `TestIsDirectVideoContentType()` tests MIME type classification +- In file: `stacks/f1-stream/files/internal/scraper/validate_test.go` + +**Integration Tests:** +- Bash health checks (`scripts/cluster_healthcheck.sh`) serve as integration tests +- Tests 24 separate checks against live Kubernetes cluster: + - Node status and readiness + - Node resource utilization + - Container metrics + - Pod crash loops + - Persistent volume health + - DNS resolution + - Networking + - RBAC + - Logs aggregation +- Can run with `--fix` flag for auto-remediation +- Can output JSON for CI integration + +**E2E Tests:** +- Not formally implemented +- Manual validation via Terragrunt apply → cluster state verification + +**Infrastructure Testing:** +- Terraform: `terraform validate` and `terraform plan` provide syntax/logic validation +- Application health: Manual checks via scripts and cluster_healthcheck.sh +- No automated test suite for infrastructure code + +## Common Patterns + +**Async Testing (Go):** +- Not applicable (synchronous function testing only) + +**Error Testing (Go):** +```golang +{ + name: "empty string", + body: "", + want: false, +}, +``` +- Negative test cases included in same table +- Error/edge cases named descriptively: `"empty string"`, `"reddit link page"` +- Expected failure behavior verified: `want: false` for invalid inputs + +**Error Reporting (Go):** +```golang +t.Errorf("containsVideoMarkers(%q) = %v, want %v", truncate(tt.body, 60), got, tt.want) +``` +- Formatted message includes: function name, input (truncated), actual, expected +- Test name automatically prefixed by `t.Run(tt.name, ...)` + +**Status Reporting (Bash):** +- Color-coded status: `${GREEN}[PASS]${NC}`, `${YELLOW}[WARN]${NC}`, `${RED}[FAIL]${NC}` +- Counter incremented per status +- Optional quiet mode (`--quiet`) suppresses PASS output +- Optional JSON output (`--json`) for CI integration +- Summary printed at end: `$PASS_COUNT/$WARN_COUNT/$FAIL_COUNT` + +## Running Tests + +**Go Tests:** +```bash +# From service directory containing *_test.go +go test -v ./... +``` + +**Bash Health Checks:** +```bash +# Comprehensive checks +bash scripts/cluster_healthcheck.sh + +# Quiet mode (WARN/FAIL only) +bash scripts/cluster_healthcheck.sh --quiet + +# Auto-fix mode +bash scripts/cluster_healthcheck.sh --fix + +# JSON output +bash scripts/cluster_healthcheck.sh --json + +# Custom kubeconfig +bash scripts/cluster_healthcheck.sh --kubeconfig /path/to/config +``` + +**Terraform Validation:** +```bash +# Format check +terraform fmt -recursive + +# Syntax validation +terraform validate + +# Plan without apply +terraform plan + +# From stack directory +cd stacks/ && terragrunt plan +cd stacks/ && terragrunt apply --non-interactive +``` + +--- + +*Testing analysis: 2026-02-23* diff --git a/stacks/platform/modules/mailserver/roundcubemail.tf b/stacks/platform/modules/mailserver/roundcubemail.tf index 80bb6e18..ce77f0d2 100644 --- a/stacks/platform/modules/mailserver/roundcubemail.tf +++ b/stacks/platform/modules/mailserver/roundcubemail.tf @@ -57,7 +57,7 @@ resource "kubernetes_deployment" "roundcubemail" { spec { container { name = "roundcube" - image = "roundcube/roundcubemail:1.6-apache" + image = "roundcube/roundcubemail:1.6.13-apache" # Uncomment me to mount additional settings # volume_mount { # name = "imap-config"