docs: map existing codebase
This commit is contained in:
parent
275eb5aec8
commit
c8e9c41afc
8 changed files with 1475 additions and 1 deletions
165
.planning/codebase/ARCHITECTURE.md
Normal file
165
.planning/codebase/ARCHITECTURE.md
Normal file
|
|
@ -0,0 +1,165 @@
|
|||
# Architecture
|
||||
|
||||
**Analysis Date:** 2026-02-23
|
||||
|
||||
## Pattern Overview
|
||||
|
||||
**Overall:** Terragrunt-based IaC with per-service state isolation, using Kubernetes as the primary platform and Proxmox for VM infrastructure.
|
||||
|
||||
**Key Characteristics:**
|
||||
- Monorepo containing ~70 service stacks with independent state files
|
||||
- Declarative, GitOps-driven infrastructure using Terraform + Terragrunt
|
||||
- DRY provider/backend configuration via root `terragrunt.hcl`
|
||||
- Clear layering: platform (core/cluster services) → application stacks → shared modules
|
||||
- Service decoupling with explicit dependencies via `dependency` blocks
|
||||
- Resource governance through Kubernetes tier system (0-core through 4-aux)
|
||||
|
||||
## Layers
|
||||
|
||||
**Platform Layer (`stacks/platform/main.tf`):**
|
||||
- Purpose: Core infrastructure services that enable all application stacks (22 modules)
|
||||
- Location: `stacks/platform/`
|
||||
- Contains: MetalLB, DBaaS, Redis, Traefik, Technitium DNS, Headscale VPN, Authentik SSO, RBAC, CrowdSec, Prometheus/Grafana/Loki monitoring, nginx reverse proxy, mailserver, GPU node configuration, Kyverno policy engine
|
||||
- Depends on: Kubernetes cluster (declared via `stacks/infra` dependency), External secrets in `terraform.tfvars`
|
||||
- Used by: All application stacks declare `dependency "platform"` to ensure platform is applied first
|
||||
|
||||
**Infrastructure Layer (`stacks/infra/main.tf`):**
|
||||
- Purpose: VM template provisioning and Proxmox resource management
|
||||
- Location: `stacks/infra/`
|
||||
- Contains: K8s node templates via cloud-init, docker-registry VM, Proxmox VM lifecycle
|
||||
- Depends on: Proxmox API credentials
|
||||
- Used by: Platform stack depends on it to ensure infrastructure is ready
|
||||
|
||||
**Application Stacks (~70 services):**
|
||||
- Purpose: User-facing and supplementary services (Nextcloud, Immich, Matrix, Ollama, etc.)
|
||||
- Location: `stacks/<service>/main.tf` (102 total stacks)
|
||||
- Contains: Kubernetes namespaces, Helm releases, raw Kubernetes resources (Deployments, StatefulSets, Services, PersistentVolumes)
|
||||
- Depends on: Platform stack, shared TLS secret via `modules/kubernetes/setup_tls_secret`, optional NFS volumes
|
||||
- Used by: Self-contained; declared dependencies control execution order
|
||||
|
||||
**Shared Modules:**
|
||||
- **Kubernetes utilities** (`modules/kubernetes/`):
|
||||
- `ingress_factory/`: Reusable Traefik ingress + service template with anti-AI scraping, CrowdSec integration, rate limiting, authentication support
|
||||
- `setup_tls_secret/`: TLS certificate secret setup in namespaces
|
||||
- **Terraform modules** (`modules/`):
|
||||
- `create-template-vm/`: Ubuntu cloud-init template VM provisioning (K8s and non-K8s variants)
|
||||
- `create-vm/`: VM instance creation from templates
|
||||
- `docker-registry/`: Docker registry pull-through cache configuration
|
||||
|
||||
## Data Flow
|
||||
|
||||
**Infrastructure Provisioning Flow:**
|
||||
|
||||
1. **Initialize**: Root `terragrunt.hcl` loads `terraform.tfvars` globally, generates provider/backend configs
|
||||
2. **Infra Stack Apply**: `stacks/infra/` creates/updates Proxmox VMs and Kubernetes node templates
|
||||
3. **Platform Apply**: `stacks/platform/` applies all ~22 core services (depends on infra stack)
|
||||
4. **Service Apply**: Individual `stacks/<service>/` apply their resources (depend on platform stack)
|
||||
|
||||
Example dependency chain for Nextcloud:
|
||||
```
|
||||
stacks/infra/main.tf (VMs)
|
||||
↓ (dependency)
|
||||
stacks/platform/main.tf (Traefik, Redis, DBaaS, etc.)
|
||||
↓ (dependency)
|
||||
stacks/nextcloud/main.tf (Nextcloud Helm chart + storage)
|
||||
```
|
||||
|
||||
**State Management:**
|
||||
- Each stack has isolated state at `state/stacks/<service>/terraform.tfstate`
|
||||
- Root `terragrunt.hcl` defines local backend: `path = "${get_repo_root()}/state/${path_relative_to_include()}/terraform.tfstate"`
|
||||
- Variables flow from `terraform.tfvars` → each stack's `terraform` block → Terraform execution
|
||||
- Unused variables are silently ignored (Terraform 1.x behavior)
|
||||
|
||||
**Configuration Flow:**
|
||||
1. User edits `terraform.tfvars` (encrypted via git-crypt)
|
||||
2. Each stack includes root terragrunt config: `include "root" { path = find_in_parent_folders() }`
|
||||
3. Root config injects `terraform.tfvars` as `required_var_files`
|
||||
4. Stack-specific `main.tf` declares which variables it uses
|
||||
|
||||
## Key Abstractions
|
||||
|
||||
**Tier System:**
|
||||
- Purpose: Resource governance via Kubernetes PriorityClasses, LimitRanges, ResourceQuotas
|
||||
- Tiers: `0-core` (critical: ingress, DNS, auth) → `4-aux` (optional workloads)
|
||||
- Applied via: Kyverno policy engine in `stacks/platform/modules/kyverno/`
|
||||
- Usage: Every namespace/pod gets labeled with tier; Kyverno generates corresponding LimitRange + ResourceQuota
|
||||
|
||||
**Service Factory Pattern:**
|
||||
- Purpose: Multi-tenant/multi-instance services (Actual Budget, Freedify)
|
||||
- Pattern: Parent stack (`stacks/<service>/main.tf`) creates namespace + TLS secret, then calls `factory/` module multiple times
|
||||
- Examples: `stacks/actualbudget/main.tf` calls `factory/` for viktor, anca, emo instances
|
||||
- Each instance: Separate pod, service, NFS share, Cloudflare DNS entry
|
||||
|
||||
**Ingress Factory (`modules/kubernetes/ingress_factory/`):**
|
||||
- Purpose: DRY, opinionated Traefik ingress pattern with security defaults
|
||||
- Variables: `name`, `namespace`, `port`, `host`, `protected`, `anti_ai_scraping` (default true)
|
||||
- Provides: Service, Ingress, CrowdSec exemptions, rate limiting, Authentik ForwardAuth integration, anti-AI middleware
|
||||
- Anti-AI layers: Bot blocking → X-Robots-Tag → Trap links → Tarpit → Poison content cache
|
||||
|
||||
**NFS Volume Pattern:**
|
||||
- Purpose: Persistent storage for stateful services
|
||||
- Pattern: Inline NFS volumes in pod specs (preferred over PV/PVC)
|
||||
- Server: `10.0.10.15` (TrueNAS)
|
||||
- Paths: `/mnt/main/<service>` or `/mnt/main/<service>/<instance>`
|
||||
- Used by: ~60 services; registered in `secrets/nfs_directories.txt` (git-crypt encrypted)
|
||||
|
||||
## Entry Points
|
||||
|
||||
**Terragrunt Root (`terragrunt.hcl`):**
|
||||
- Location: `/Users/viktorbarzin/code/infra/terragrunt.hcl`
|
||||
- Triggers: `cd stacks/<service> && terragrunt plan/apply --non-interactive`
|
||||
- Responsibilities: Load providers, backend, `terraform.tfvars`, set kube config path
|
||||
|
||||
**Platform Stack (`stacks/platform/main.tf`):**
|
||||
- Location: `stacks/platform/main.tf` (1000+ lines)
|
||||
- Triggers: Applied before any service stack to ensure platform services exist
|
||||
- Responsibilities: 22 module instantiations, tier definition, variable collection from tfvars
|
||||
|
||||
**Service Stacks (`stacks/<service>/main.tf`):**
|
||||
- Location: `stacks/<service>/main.tf` (27–456 lines, avg ~130)
|
||||
- Triggers: `terragrunt apply --non-interactive` in service directory
|
||||
- Responsibilities: Create namespace, setup TLS, instantiate Helm charts or raw K8s resources, configure storage
|
||||
|
||||
**Proxmox/Infra Stack (`stacks/infra/main.tf`):**
|
||||
- Location: `stacks/infra/main.tf` (200+ lines)
|
||||
- Triggers: Applied first to ensure VM infrastructure is available
|
||||
- Responsibilities: VM template creation, VM instance lifecycle, containerd mirror config
|
||||
|
||||
**Factory Module (`stacks/<service>/factory/main.tf`):**
|
||||
- Location: `stacks/actualbudget/factory/main.tf`, `stacks/freedify/factory/main.tf`
|
||||
- Triggers: Called multiple times from parent `main.tf` with different `name` parameter
|
||||
- Responsibilities: Single-instance deployment (pod, service, NFS share, ingress)
|
||||
|
||||
## Error Handling
|
||||
|
||||
**Strategy:** Declarative state reconciliation (Terraform/Kubernetes watch loop). No imperative error recovery.
|
||||
|
||||
**Patterns:**
|
||||
- **Helm deployments**: `atomic = true` for rollback on failure
|
||||
- **Terraform apply**: `--non-interactive` to prevent hanging on prompts
|
||||
- **Cloud-init VM provisioning**: Embedded error logging in scripts; check `/var/log/cloud-init-output.log` on VM
|
||||
- **Dependencies**: Explicit `dependency` blocks prevent applying child before parent
|
||||
- **Validation**: `terraform plan` executed by CI before apply
|
||||
- **Secrets**: git-crypt locking ensures encrypted state checked into repo; no accidental plaintext commits
|
||||
|
||||
## Cross-Cutting Concerns
|
||||
|
||||
**Logging:** Loki + Alloy (DaemonSet collects container logs) configured in `stacks/platform/modules/monitoring/`
|
||||
|
||||
**Validation:**
|
||||
- Terraform validation: `terraform validate` in CI/CD pipeline
|
||||
- HCL formatting: `terraform fmt -recursive`
|
||||
- Kyverno policies: Enforce resource requests, tier labels, pod security standards
|
||||
|
||||
**Authentication:**
|
||||
- **Kubernetes API**: OIDC via Authentik (issuer: `https://authentik.viktorbarzin.me/application/o/kubernetes/`)
|
||||
- **Traefik Ingress**: Authentik ForwardAuth when `protected = true` in ingress_factory
|
||||
- **TLS**: Shared secret injected into all namespaces via `setup_tls_secret` module
|
||||
|
||||
**Rate Limiting:** Traefik middleware `default-rate-limit` (applied by ingress_factory unless `skip_default_rate_limit = true`)
|
||||
|
||||
**Anti-AI Scraping:** 5-layer defense (bot blocking → headers → trap links → tarpit → poison content) applied via `anti_ai_scraping = true` (default) in ingress_factory; disable per-service with `anti_ai_scraping = false`
|
||||
|
||||
---
|
||||
|
||||
*Architecture analysis: 2026-02-23*
|
||||
244
.planning/codebase/CONCERNS.md
Normal file
244
.planning/codebase/CONCERNS.md
Normal file
|
|
@ -0,0 +1,244 @@
|
|||
# Codebase Concerns
|
||||
|
||||
**Analysis Date:** 2026-02-23
|
||||
|
||||
## Tech Debt
|
||||
|
||||
**MySQL Backup Rotation Not Implemented:**
|
||||
- Issue: Backup rotation logic exists (comment at `stacks/platform/modules/dbaas/main.tf:196`) but is incomplete. Backup size noted as 11MB, rotation deferred.
|
||||
- Files: `stacks/platform/modules/dbaas/main.tf` (lines 196-206)
|
||||
- Impact: Backup directory could grow unbounded; no automated retention policy enforced. Manual cleanup required.
|
||||
- Fix approach: Implement full rotation schedule using `find -mtime +N` or migrate to external backup solution (Velero, pgbackrest). Set up CronJob with proper retention (e.g., 14-day backups).
|
||||
|
||||
**PostgreSQL Major Version Upgrade Blocked:**
|
||||
- Issue: Comment at `stacks/platform/modules/dbaas/main.tf:718` indicates PostgreSQL 17.2 requires `pg_upgrade` to data directory but is not implemented.
|
||||
- Files: `stacks/platform/modules/dbaas/main.tf` (line 718)
|
||||
- Impact: Cannot upgrade PostgreSQL from 16-master to 17.2. When upgrade is needed, requires manual pg_upgrade procedure; high downtime risk.
|
||||
- Fix approach: Implement pg_upgrade CronJob or StatefulSet init container that performs in-place upgrade. Test migration path with backup first.
|
||||
|
||||
**TP-Link Gateway Reverse Proxy Not Functional:**
|
||||
- Issue: Reverse proxy module for TP-Link gateway marked as "Not working yet" at `stacks/platform/modules/reverse_proxy/main.tf:91`.
|
||||
- Files: `stacks/platform/modules/reverse_proxy/main.tf` (lines 91-95)
|
||||
- Impact: Gateway access over HTTPS or HTTP routing non-functional. Unknown scope of impact on dependent services.
|
||||
- Fix approach: Either complete reverse proxy implementation (Traefik/Nginx config) or document why it's disabled. Clarify if gateway is still accessible via HTTP or LAN-only.
|
||||
|
||||
**WireGuard Firewall Rules Incomplete:**
|
||||
- Issue: Client firewall restrictions not written at `terraform.tfvars:430`. Only placeholder exists.
|
||||
- Files: `terraform.tfvars` (lines 430-434)
|
||||
- Impact: No network isolation between VPN clients and cluster-internal services (10.32.0.0/12). All connected clients can access cluster APIs if firewall rules not enforced at kernel level.
|
||||
- Fix approach: Define explicit iptables rules for each client role (e.g., "allow DNS only", "deny cluster access"). Test with `iptables -L -v`. Consider VPN network segmentation if multiple trust levels exist.
|
||||
|
||||
## Known Bugs & Issues
|
||||
|
||||
**Immich Database Compatibility Mismatch:**
|
||||
- Symptoms: Custom PostgreSQL image version mismatch between Immich PostgreSQL pod and dbaas PostgreSQL. Immich uses `ghcr.io/immich-app/postgres:15-vectorchord0.3.0-pgvectors0.2.0`, while dbaas PostgreSQL is 16-master with PostGIS/PgVector mix.
|
||||
- Files: `stacks/immich/main.tf` (lines 76-77, 276), `stacks/platform/modules/dbaas/main.tf` (line 717)
|
||||
- Trigger: If Immich database is migrated to shared dbaas PostgreSQL, extension version incompatibility will cause failures.
|
||||
- Workaround: Keep Immich on isolated PostgreSQL 15 with Immich-specific extensions. If consolidation needed, test extension compatibility first.
|
||||
|
||||
**Realestate-Crawler Latest Image Tag Ignores Updates:**
|
||||
- Symptoms: `realestate-crawler-ui` uses `image = "viktorbarzin/immoweb:latest"` with `lifecycle { ignore_changes = [spec[0].template[0].spec[0].container[0].image] }`.
|
||||
- Files: `stacks/real-estate-crawler/main.tf` (lines 64, 79-82)
|
||||
- Trigger: New versions of `immoweb:latest` will never be deployed. Terraform ignores image updates; manual image pull/push required.
|
||||
- Workaround: Use Diun annotations to track image updates. Consider using version-pinned tags instead of `:latest`. Remove `ignore_changes` if auto-updates desired.
|
||||
|
||||
## Security Considerations
|
||||
|
||||
**OpenClaw Has Cluster-Admin Permissions:**
|
||||
- Risk: OpenClaw ServiceAccount granted unrestricted `cluster-admin` ClusterRoleBinding at `stacks/openclaw/main.tf:41-54`.
|
||||
- Files: `stacks/openclaw/main.tf` (lines 34-55)
|
||||
- Current mitigation: `dangerouslyDisableDeviceAuth = true` in config (line 89) disables device auth but relies on network access control.
|
||||
- Recommendations:
|
||||
- Scope OpenClaw RBAC to specific namespaces/resources needed for skill execution (e.g., `get/list/watch pods`, `exec into pods`, `apply resources in specific namespaces`).
|
||||
- Re-enable device auth or implement mTLS between OpenClaw and operators.
|
||||
- Audit OpenClaw logs for unauthorized API calls (enable API server audit logs).
|
||||
|
||||
**Git-Crypt Key Mounted as ConfigMap:**
|
||||
- Risk: git-crypt key at `stacks/openclaw/main.tf:68-76` stored as plain-text ConfigMap data. Any pod on cluster can read it (unless RBAC enforces secrets-only access).
|
||||
- Files: `stacks/openclaw/main.tf` (lines 68-76)
|
||||
- Current mitigation: None; ConfigMap is world-readable by default.
|
||||
- Recommendations:
|
||||
- Move git-crypt key to Kubernetes Secret instead of ConfigMap.
|
||||
- Add RBAC policy restricting secret read to openclaw namespace only.
|
||||
- Consider external secret management (Authentik-backed secret injection, Sealed Secrets).
|
||||
|
||||
**SSH Private Key Stored as Secret:**
|
||||
- Risk: SSH private key for OpenClaw stored at `stacks/openclaw/main.tf:57-66` as unencrypted Secret. Readable by any pod with secret access.
|
||||
- Files: `stacks/openclaw/main.tf` (lines 57-66)
|
||||
- Current mitigation: Secret only readable by openclaw namespace (if RBAC enforced); encryption at rest not confirmed.
|
||||
- Recommendations:
|
||||
- Rotate SSH key regularly; consider using ed25519 keys (shorter, stronger).
|
||||
- Audit Secret access via Kubernetes audit logs.
|
||||
- Use external secret store (HashiCorp Vault, Bitwarden) instead of native Secrets.
|
||||
|
||||
**WireGuard VPN Clients Unrestricted:**
|
||||
- Risk: VPN clients can reach all cluster-internal services (10.32.0.0/12) unless firewall rules defined. No per-client segmentation.
|
||||
- Files: `terraform.tfvars` (lines 430-434)
|
||||
- Current mitigation: Attempted iptables rules commented out; not enforced.
|
||||
- Recommendations:
|
||||
- Define explicit client restrictions in WireGuard firewall script (uncomment/complete lines 433-434).
|
||||
- Implement deny-by-default firewall (drop all, then allow specific routes).
|
||||
- Consider separate WireGuard interfaces for different trust levels (admin vs. guest).
|
||||
|
||||
**Multiple `:latest` Image Tags in Production:**
|
||||
- Risk: 17 services use `:latest` tags (e.g., `nextcloud`, `kms`, `calibre`, `speedtest`, `rybbit`, `wealthfolio`, `cyberchef`, `coturn`, `immich-frame`, `health`, others).
|
||||
- Files: Multiple stacks (see full list in grep output above).
|
||||
- Current mitigation: Diun annotations track updates but don't auto-pull; images are immutable but unversioned.
|
||||
- Recommendations:
|
||||
- Pin all production images to specific semantic versions (e.g., `ghcr.io/foo/bar:v1.2.3`, not `:latest`).
|
||||
- Use Diun to track new releases and trigger automated testing in staging.
|
||||
- Update CI/CD pipeline to require version tags for production deployments.
|
||||
|
||||
## Performance Bottlenecks
|
||||
|
||||
**Insufficient Health Probes on Critical Services:**
|
||||
- Problem: Only 14 services have liveness/readiness probes out of 70+ services. Missing probes on databases (MySQL, PostgreSQL, Redis), ingress, auth.
|
||||
- Files: All stacks (identified via grep: 14 instances of liveness/readiness out of 70+ services).
|
||||
- Cause: Default Kubernetes behavior is to not restart unhealthy pods without probes; cascading failures silent.
|
||||
- Improvement path: Add `livenessProbe`, `readinessProbe`, and `startupProbe` to all stateful services (databases, message queues, auth providers). Use TCP/HTTP probes appropriate to each service.
|
||||
|
||||
**Pod Disruption Budgets Missing:**
|
||||
- Problem: Only 2 services have PodDisruptionBudget resources (identified via grep). Node evictions (updates, failures) can cause service degradation.
|
||||
- Files: All stacks (need comprehensive PodDisruptionBudget coverage).
|
||||
- Cause: PDBs are optional; many assume single-replica stateless services won't need them.
|
||||
- Improvement path: Add PDB with `minAvailable: 1` to all services with `replicas > 1`. For single-replica services, ensure they're marked as non-critical (lower PriorityClass) or accept downtime during node maintenance.
|
||||
|
||||
**Resource Requests Sparse, Limits Missing:**
|
||||
- Problem: Many services lack explicit resource requests/limits. Kyverno auto-generates defaults but CPU limits often too low for bursty workloads (Immich ML, Ollama, Ebook2Audiobook).
|
||||
- Files: Multiple stacks (e.g., `stacks/immich/main.tf`, `stacks/ebook2audiobook/main.tf`, `stacks/ollama/main.tf`).
|
||||
- Cause: Request/limit tuning requires load testing; defaults used instead.
|
||||
- Improvement path: Run load tests on GPU workloads (Immich ML, Ollama) to determine sustained CPU/memory. Set requests to P50 usage, limits to P99. Monitor via Prometheus and adjust quarterly.
|
||||
|
||||
**Large Terraform Modules (900+ lines):**
|
||||
- Problem: `stacks/platform/modules/dbaas/main.tf` is 916 lines; `stacks/immich/main.tf` is 660 lines; others > 450 lines.
|
||||
- Files: `stacks/platform/modules/dbaas/main.tf` (916 lines), `stacks/platform/modules/nvidia/main.tf` (658 lines), `stacks/platform/modules/kyverno/resource-governance.tf` (809 lines).
|
||||
- Cause: Monolithic resource definitions; hard to navigate and test.
|
||||
- Improvement path: Split large modules into sub-modules (e.g., `dbaas/` → `mysql/`, `postgresql/`, `pgadmin/`, `backups/`). Use Terraform workspaces for per-database configuration.
|
||||
|
||||
## Fragile Areas
|
||||
|
||||
**Immich Machine Learning GPU Dependency:**
|
||||
- Files: `stacks/immich/main.tf` (lines 380-450).
|
||||
- Why fragile: GPU workload (`immich-machine-learning-cuda`) requires Tesla T4 on k8s-node1. If GPU becomes unavailable (hardware failure, driver issues), ML inference fails silently (no fallback). Single GPU point of failure.
|
||||
- Safe modification: Add `nodeAffinity` to prefer GPU but allow non-GPU fallback (degraded mode). Implement health checks on GPU availability (`nvidia-smi` probe). Test GPU failure scenario before production use.
|
||||
- Test coverage: No tests for GPU unavailability; assumes GPU always available.
|
||||
|
||||
**Nextcloud Backup/Restore Procedures Manual:**
|
||||
- Files: `stacks/nextcloud/main.tf` (backup.sh and restore.sh ConfigMaps).
|
||||
- Why fragile: Backup/restore scripts are ConfigMap-based; no automation. Restoration requires manual `kubectl exec` and script execution. No tested recovery procedure.
|
||||
- Safe modification: Implement automated backup via Velero or CSI snapshots. Test restore procedure monthly via staged environment.
|
||||
- Test coverage: No automated backup validation; scripts untested.
|
||||
|
||||
**NFS Dependency for Data Persistence:**
|
||||
- Files: 126 references to NFS volumes across all stacks.
|
||||
- Why fragile: All stateful data depends on NFS server at `10.0.10.15`. If NFS becomes unavailable, all services lose data immediately (no local caches). No fallback storage.
|
||||
- Safe modification: Implement NFS client-side read caching (Linux NFS mount options `ac,acregmin=3600`). Monitor NFS availability via Prometheus alerts (Mount point offline). Test NFS failover procedure (if replica NFS exists).
|
||||
- Test coverage: No chaos engineering tests for NFS unavailability.
|
||||
|
||||
**Istio Injection Disabled Cluster-Wide:**
|
||||
- Files: `stacks/real-estate-crawler/main.tf` (line 19): `"istio-injection" : "disabled"` on namespace labels.
|
||||
- Why fragile: No service mesh observability. Debugging pod-to-pod communication requires manual tracing (tcpdump). No mutual TLS between services.
|
||||
- Safe modification: Enable Istio on non-critical services first (e.g., realestate-crawler). Monitor resource overhead. Gradually roll out to production.
|
||||
- Test coverage: No mTLS validation; assumes all pods on same network are trusted.
|
||||
|
||||
**PostgreSQL Custom Image Not Tracked:**
|
||||
- Files: `stacks/platform/modules/dbaas/main.tf` (line 717): `image = "viktorbarzin/postgres:16-master"`.
|
||||
- Why fragile: Custom build at Docker Hub with PostGIS + PgVector extensions. No version tag; `:master` tag is mutable. Upstream extension versions unknown.
|
||||
- Safe modification: Pin to semantic version (e.g., `:16.4-postgis3.4-pgvector0.8`). Build images locally with Dockerfile tracked in git. Test extension versions against application requirements.
|
||||
- Test coverage: No tests for extension availability or version compatibility.
|
||||
|
||||
## Scaling Limits
|
||||
|
||||
**Single-Replica Critical Services:**
|
||||
- Current capacity: Immich server (1 replica), PostgreSQL databases (1 replica), Redis (1 instance), Traefik (varies).
|
||||
- Limit: Node failure causes immediate service outage. Kubernetes default takes 5+ minutes to reschedule pod.
|
||||
- Scaling path: Increase critical service replicas to 3 (quorum). Add pod anti-affinity to spread across nodes. Implement PodDisruptionBudget with `minAvailable: 2`.
|
||||
|
||||
**GPU Capacity Bottleneck:**
|
||||
- Current capacity: 1 Tesla T4 GPU on k8s-node1.
|
||||
- Limit: Immich ML + Ebook2Audiobook + Ollama all compete for single GPU. Queue time 10+ minutes for CPU-bound inference tasks.
|
||||
- Scaling path: Add second GPU (e.g., T4 or RTX 3090) to k8s-node1. Implement GPU scheduling via NVIDIA GPU Operator. Monitor GPU utilization (target 70-80%).
|
||||
|
||||
**NFS Storage Capacity:**
|
||||
- Current capacity: `/mnt/main/` mounted on TrueNAS (size unknown; typically 4-8TB in home setups).
|
||||
- Limit: Immich (image library), Calibre (ebooks), Dawarich (location history) grow unbounded. When storage full, writes fail; services degrade.
|
||||
- Scaling path: Monitor NFS capacity monthly (`df -h`). Set up Prometheus alert at 80% capacity. Plan for annual storage growth based on user behavior (e.g., 100GB Immich/month).
|
||||
|
||||
**MySQL/PostgreSQL Connection Pool:**
|
||||
- Current capacity: PgBouncer at `dbaas/pgbouncer` provides connection pooling. Default pool size likely 100-200 connections.
|
||||
- Limit: Many simultaneous connections (Nextcloud, Affine, Gramps Web, Authentik) can exceed pool. New connections queue or fail.
|
||||
- Scaling path: Monitor PgBouncer pool utilization (Prometheus metric `pgbouncer_pools_used_connections`). Increase pool size if > 80% utilization. Consider read replicas for read-heavy workloads.
|
||||
|
||||
**API Rate Limiting & Bandwidth:**
|
||||
- Current capacity: Services exposed via Traefik ingress. No global rate limiting documented.
|
||||
- Limit: External tools (Immich mobile app, ebook2audiobook processing) can spike bandwidth. DoS-like behavior possible.
|
||||
- Scaling path: Implement Traefik rate limiting middleware (Prometheus-aware). Add Cloudflare rate limiting on public domains. Monitor egress bandwidth.
|
||||
|
||||
## Dependencies at Risk
|
||||
|
||||
**Redis Stack `:latest` Tag:**
|
||||
- Risk: `stacks/platform/modules/redis/main.tf` uses `image = "redis/redis-stack:latest"`. Redis Stack is actively developed; breaking changes possible.
|
||||
- Impact: Unexpected version upgrade could introduce incompatibilities with clients expecting specific command set or module versions.
|
||||
- Migration plan: Pin to specific Redis Stack version (e.g., `:7.2-rc1`). Test version upgrades in staging first. Monitor Redis logs for deprecated command warnings.
|
||||
|
||||
**Immich `:latest` or Floating Tag:**
|
||||
- Risk: `stacks/immich/main.tf` pins to `v2.5.6` but Immich frequently releases patch versions. Database migrations can cause downtime.
|
||||
- Impact: If Immich version upgrades without testing, database migrations could fail or hang (no rollback mechanism).
|
||||
- Migration plan: Pin to specific patch versions (e.g., `v2.5.6`, not `v2.5`). Test Immich upgrades in staging first. Maintain backup before upgrading.
|
||||
|
||||
**Unsupported MySQL 9.2.0:**
|
||||
- Risk: `stacks/platform/modules/dbaas/main.tf` specifies `image = "mysql:9.2.0"`. MySQL 9.2 is a development version (RC status as of Feb 2026).
|
||||
- Impact: RC versions not recommended for production. Stability issues, CVEs possible. No long-term support.
|
||||
- Migration plan: Migrate to MySQL 8.4 LTS or 9.0 GA (stable). Test data migration first. Plan for gradual rollout.
|
||||
|
||||
**Python Timeouts in Monitoring Scripts:**
|
||||
- Risk: `stacks/platform/modules/nvidia/main.tf` uses hardcoded `timeout=10` for HTTP requests and subprocess calls. Slow network conditions will fail.
|
||||
- Impact: GPU monitoring will fail if network is slow or unavailable. Silent failures possible.
|
||||
- Migration plan: Implement exponential backoff and retry logic (e.g., `tenacity` library). Increase timeout to 30s for unreliable networks. Log timeouts for debugging.
|
||||
|
||||
## Missing Critical Features
|
||||
|
||||
**No Disaster Recovery Plan:**
|
||||
- Problem: Backup procedures exist (Nextcloud, MySQL) but no tested recovery procedure. No runbook for cluster disaster.
|
||||
- Blocks: If cluster data lost, recovery would be manual and time-consuming. No RTO/RPO defined.
|
||||
- Impact: Data loss risk > 24 hours to recover.
|
||||
|
||||
**No Secrets Rotation Policy:**
|
||||
- Problem: SSH keys, API tokens, database passwords stored in git-crypt and tfvars. No automated rotation schedule.
|
||||
- Blocks: If key leaked, manual intervention required to rotate across all services.
|
||||
- Impact: Leaked credentials persist until discovery.
|
||||
|
||||
**No Cross-Cluster Failover:**
|
||||
- Problem: Single Kubernetes cluster on Proxmox. No HA cluster or backup cluster.
|
||||
- Blocks: Cluster-wide failure (network partition, hypervisor crash) causes total outage.
|
||||
- Impact: RTO > 1 hour (manual intervention to restart hypervisor or re-provision).
|
||||
|
||||
## Test Coverage Gaps
|
||||
|
||||
**No Infrastructure Testing:**
|
||||
- What's not tested: Terraform applies, Helm charts, manifests only validated via `terraform plan`. No `terratest`, no functional tests of deployed services.
|
||||
- Files: All stacks (no test files found).
|
||||
- Risk: Typos, variable misconfigurations, missing dependencies not caught until production apply.
|
||||
- Priority: High — add `terratest` to validate Terraform. Test critical paths (database connection, ingress routing).
|
||||
|
||||
**No Chaos Engineering Tests:**
|
||||
- What's not tested: Pod evictions, node failures, NFS unavailability, network partitions.
|
||||
- Files: All stacks (no chaos tests found).
|
||||
- Risk: Cascading failures and data loss scenarios not validated. Assumptions about resilience untested.
|
||||
- Priority: High — run monthly chaos tests (Gremlin, Chaos Toolkit). Document recovery procedures.
|
||||
|
||||
**No Backup Restoration Tests:**
|
||||
- What's not tested: Nextcloud backups, MySQL backups. Restore procedures exist but never executed.
|
||||
- Files: `stacks/nextcloud/main.tf`, `stacks/platform/modules/dbaas/main.tf`.
|
||||
- Risk: Backups corrupt or unusable when needed. RPO > 24 hours if discovery slow.
|
||||
- Priority: High — monthly restore-to-staging test. Automate backup validation.
|
||||
|
||||
**No Security Scanning for Vulnerabilities:**
|
||||
- What's not tested: Container images for CVEs, Terraform for security anti-patterns (hardcoded secrets, overpermissive RBAC).
|
||||
- Files: All stacks, all container images.
|
||||
- Risk: Known vulnerabilities deployed to production. No supply chain security.
|
||||
- Priority: Medium — integrate Trivy/Snyk into CI/CD. Scan images weekly; alert on high CVEs.
|
||||
|
||||
---
|
||||
|
||||
*Concerns audit: 2026-02-23*
|
||||
192
.planning/codebase/CONVENTIONS.md
Normal file
192
.planning/codebase/CONVENTIONS.md
Normal file
|
|
@ -0,0 +1,192 @@
|
|||
# Coding Conventions
|
||||
|
||||
**Analysis Date:** 2026-02-23
|
||||
|
||||
## Naming Patterns
|
||||
|
||||
**Terraform Files:**
|
||||
- `main.tf` - Primary resource definitions and module calls
|
||||
- `terragrunt.hcl` - Stack-specific Terragrunt configuration
|
||||
- `variables.tf` - Variable declarations for a stack
|
||||
- `providers.tf` - Generated by Terragrunt root `terragrunt.hcl`
|
||||
- `backend.tf` - Generated by Terragrunt for state backend configuration
|
||||
|
||||
**Terraform Variables:**
|
||||
- snake_case for variable names: `var.tls_secret_name`, `var.dbaas_root_password`
|
||||
- snake_case for resource names: `resource "kubernetes_namespace" "nextcloud"`
|
||||
- snake_case for local values: `local.tiers`
|
||||
- UPPERCASE for environment-like globals in shell: `KUBECONFIG_PATH`, `PASS_COUNT`
|
||||
|
||||
**Resource/Module Names:**
|
||||
- kebab-case for Kubernetes resources: `nextcloud`, `whiteboard`, `kms-web-page`
|
||||
- Leading underscore for prefixed resource names (internal/private pattern): resource names with underscores are module-internal
|
||||
- Descriptive names matching functionality: `kubernetes_namespace`, `kubernetes_deployment`, `helm_release`
|
||||
|
||||
**Shell Functions:**
|
||||
- snake_case for function names: `parse_args()`, `count_lines()`, `check_nodes()`
|
||||
- CamelCase for utility color variables: `RED`, `GREEN`, `YELLOW`, `BLUE`, `BOLD`, `NC`
|
||||
|
||||
**Go Package/Test Names:**
|
||||
- Package-level test functions: `TestContainsVideoMarkers()`, `TestIsDirectVideoContentType()`
|
||||
- Table-driven test pattern with struct fields: `name`, `body`, `ct`, `want`
|
||||
|
||||
## Code Style
|
||||
|
||||
**Terraform Formatting:**
|
||||
- Use `terraform fmt -recursive` for consistent formatting
|
||||
- No explicit linter/formatter config file (tflint/terraform-lint not present)
|
||||
- Indentation: 2 spaces (standard Terraform convention)
|
||||
- Multi-line strings use heredoc syntax: `<<EOT ... EOT` for YAML/config blocks
|
||||
|
||||
**Bash Script Style:**
|
||||
- Shebang: `#!/usr/bin/env bash`
|
||||
- Safety flags: `set -euo pipefail` (exit on error, undefined vars, pipe failures)
|
||||
- Comments use `# ---` separator for section dividers
|
||||
- Comments use `# ---` for grouping related variables/functions
|
||||
- One-liner functions defined as: `function_name() { [[ condition ]] && action; }`
|
||||
- Multiline functions use explicit function body with local keyword for variables
|
||||
|
||||
**Terraform Style:**
|
||||
- Comments for major sections use `# =============================================================================`
|
||||
- Comments for subsections use `# --------- -------`
|
||||
- Inline comments explain why, not what: `# anything secret is fine` (explaining arbitrary choice)
|
||||
- Module calls include comments above describing purpose: `# --- Core ---`, `# --- dbaas ---`
|
||||
|
||||
## Import Organization
|
||||
|
||||
**Terraform:**
|
||||
- Locals (tier definitions) defined at top of main.tf
|
||||
- Variables declared in order: core/required first, then by feature area (dbaas, traefik, etc.)
|
||||
- Modules called after variables, grouped by functional area with comment headers
|
||||
- Resources defined after modules
|
||||
|
||||
**Go:**
|
||||
- Standard imports from `testing` package
|
||||
- No grouping (single simple import)
|
||||
|
||||
**Bash:**
|
||||
- Source definitions at top (colors, globals, helper functions)
|
||||
- Argument parsing in dedicated `parse_args()` function
|
||||
- Main logic organized by check sections with `section()` calls
|
||||
|
||||
## Error Handling
|
||||
|
||||
**Terraform:**
|
||||
- No explicit error handling (declarative; errors cause apply failure)
|
||||
- Dependency management via `depends_on` for explicit ordering
|
||||
- `dependency` blocks in terragrunt for cross-stack dependencies
|
||||
- `skip_outputs = true` used when only needing ordering, not outputs
|
||||
|
||||
**Bash:**
|
||||
- Inline error checks: `command 2>&1) || { fail "message"; return 0; }`
|
||||
- `set -euo pipefail` prevents silent failures and undefined var issues
|
||||
- Error status captured: `$?` implicit via `||` pattern
|
||||
- Graceful degradation with fallback values or skip-able steps
|
||||
|
||||
**Go:**
|
||||
- Standard testing error reporting: `t.Errorf()` with formatted messages
|
||||
- Table-driven test pattern allows multiple related test cases
|
||||
- Error messages include actual vs expected: `got = %v, want = %v`
|
||||
|
||||
## Logging
|
||||
|
||||
**Framework:** Not formally configured; uses `echo` and `echo -e` for output
|
||||
|
||||
**Bash Logging Patterns:**
|
||||
- Color-coded output with status prefixes: `${BLUE}[INFO]${NC}`, `${GREEN}[PASS]${NC}`, `${YELLOW}[WARN]${NC}`, `${RED}[FAIL]${NC}`
|
||||
- Helper functions: `info()`, `pass()`, `warn()`, `fail()` - each increments counters and respects `--quiet` flag
|
||||
- Section headers: `section()` for verbose output, `section_always()` for always-shown sections
|
||||
- Conditional logging: functions check `$JSON`, `$QUIET` flags and skip output as needed
|
||||
- JSON output option available via `json_add()` for machine-readable logging
|
||||
- Detail strings accumulated in variables for final reporting
|
||||
|
||||
**Terraform Logging:**
|
||||
- Relies on Terraform's built-in CLI output
|
||||
- Human-readable variable values in descriptions (Terraform renders these on errors)
|
||||
|
||||
## Comments
|
||||
|
||||
**When to Comment:**
|
||||
|
||||
Terraform:
|
||||
- Section dividers: Major logical groups separated by `# =============================================================================`
|
||||
- Feature group headers: `# --- Feature Name ---` before variable/module blocks
|
||||
- Commented-out code: Temporarily disabled resources/modules include explanation (e.g., "Do not use until issue #X is solved")
|
||||
- Clarifying arbitrary choices: `# anything secret is fine` explains non-obvious variable usage
|
||||
|
||||
Bash:
|
||||
- Function-level comments: Each check function has purpose on first line
|
||||
- Complex logic: Comments before conditional blocks explain intent
|
||||
- Inline comments for edge cases: `# Skip nodes where metrics are not yet available`
|
||||
- Header comments: Scripts include usage documentation at top
|
||||
|
||||
**JSDoc/TSDoc:**
|
||||
- Not used in this codebase (Terraform, Bash, Go only)
|
||||
|
||||
## Function Design
|
||||
|
||||
**Size:**
|
||||
- Terraform modules typically 20-50 lines for simple services, variable declaration blocks 30-100+ lines
|
||||
- Bash functions average 20-40 lines, check functions 10-30 lines
|
||||
- Go test functions 10-60 lines (table + loop)
|
||||
|
||||
**Parameters:**
|
||||
- Terraform: via `variable` declarations and module input variables
|
||||
- Bash: positional parameters passed via `$1`, `$2`, etc. with validation in `parse_args()`
|
||||
- Go: test functions accept `*testing.T` parameter
|
||||
|
||||
**Return Values:**
|
||||
- Terraform: no explicit returns; resource state is the "return"
|
||||
- Bash: `return 0` for success, implicit via `echo` output for values, status codes for error handling
|
||||
- Go: functions tested for boolean returns or calculated values
|
||||
|
||||
**Variables:**
|
||||
- Terraform: module variables, locals, and resource attributes (computed values)
|
||||
- Bash: Global state tracked via counters (`PASS_COUNT`, `WARN_COUNT`, `FAIL_COUNT`), local variables in functions with `local` keyword
|
||||
- Go: table-driven tests use struct fields (no getter/setter pattern)
|
||||
|
||||
## Module Design
|
||||
|
||||
**Exports:**
|
||||
- Terraform: outputs typically omitted unless another stack depends on them (implicit via dependency blocks)
|
||||
- Modules called with `source = "./modules/<name>"` or `source = "../../modules/kubernetes/<name>"`
|
||||
- Module version pinning used for Terraform registry modules: `version = "0.1.5"`
|
||||
|
||||
**Barrel Files:**
|
||||
- Not applicable (no aggregating re-exports in this codebase)
|
||||
- Directories: `stacks/<service>/` is a unit, `stacks/platform/modules/<service>/` groups related modules
|
||||
|
||||
**Module Organization:**
|
||||
- Single responsibility per module directory
|
||||
- Each module typically contains: `main.tf` (resources) and optional `variables.tf` for input variables
|
||||
- Shared Kubernetes utility modules in `modules/kubernetes/`: `ingress_factory/`, `setup_tls_secret/`
|
||||
- Platform services grouped in `stacks/platform/modules/<service>/`
|
||||
|
||||
## Special Patterns
|
||||
|
||||
**Locals for Configuration:**
|
||||
- Tier definitions centralized as map in locals (each service defines same tiers locally)
|
||||
```hcl
|
||||
locals {
|
||||
tiers = {
|
||||
core = "0-core"
|
||||
cluster = "1-cluster"
|
||||
gpu = "2-gpu"
|
||||
edge = "3-edge"
|
||||
aux = "4-aux"
|
||||
}
|
||||
}
|
||||
```
|
||||
- Tier applied to `kubernetes_namespace` labels and `priority_class_name` for resource governance
|
||||
|
||||
**Inline Config Blocks:**
|
||||
- YAML/config data stored in `<<EOT ... EOT` heredoc blocks within `data` maps
|
||||
- Example: MetalLB address pool config in ConfigMap data
|
||||
|
||||
**File Inclusion:**
|
||||
- `templatefile()` used for dynamic YAML values: `templatefile("${path.module}/chart_values.yaml", { var1 = value })`
|
||||
- `file()` used for static file content in ConfigMap data
|
||||
|
||||
---
|
||||
|
||||
*Convention analysis: 2026-02-23*
|
||||
210
.planning/codebase/INTEGRATIONS.md
Normal file
210
.planning/codebase/INTEGRATIONS.md
Normal file
|
|
@ -0,0 +1,210 @@
|
|||
# External Integrations
|
||||
|
||||
**Analysis Date:** 2026-02-23
|
||||
|
||||
## APIs & External Services
|
||||
|
||||
**Cloudflare:**
|
||||
- DNS management (public domain `viktorbarzin.me`)
|
||||
- Tunnel for public HTTPS access
|
||||
- Account ID: `cloudflare_account_id` in tfvars
|
||||
- SDK/Client: `cloudflare/cloudflare` Terraform provider v4.52.5
|
||||
- Auth: API token stored in `cloudflare_api_key`, email in `cloudflare_email`, zone ID in `cloudflare_zone_id`, tunnel ID in `cloudflare_tunnel_id`
|
||||
- Implementation: `stacks/platform/modules/cloudflared/` deploys Cloudflare tunnel daemon
|
||||
|
||||
**GitHub:**
|
||||
- Git repository hosting and CI/CD webhook source
|
||||
- Webhook endpoint: `https://webhook.viktorbarzin.me/` (handled by `stacks/webhook_handler/`)
|
||||
- Auth: Git token in `webhook_handler_git_token` (terraform.tfvars)
|
||||
- User: `webhook_handler_git_user` (terraform.tfvars)
|
||||
- SSH key: `webhook_handler_ssh_key` for Git operations (secret in K8s)
|
||||
|
||||
**Facebook Messenger:**
|
||||
- Chatbot integration via webhook
|
||||
- Webhook endpoint: `https://webhook.viktorbarzin.me/` (receives webhook_handler_fb_*)
|
||||
- Auth tokens: `webhook_handler_fb_verify_token`, `webhook_handler_fb_page_token`, `webhook_handler_fb_app_secret` (all in tfvars)
|
||||
|
||||
**Slack:**
|
||||
- Alert routing and notifications
|
||||
- Webhook URL: `alertmanager_slack_api_url` (terraform.tfvars)
|
||||
- Integration: Alertmanager alerts from `stacks/platform/modules/monitoring/` sent to Slack
|
||||
- CrowdSec integration: Security events to Slack via `stacks/platform/modules/crowdsec/`
|
||||
|
||||
**Hetrix Tools:**
|
||||
- Uptime monitoring service
|
||||
- Status page redirects: `https://hetrixtools.com/r/38981b548b5d38b052aca8d01285a3f3/` and `https://hetrixtools.com/r/2ba9d7a5e017794db0fd91f0115a8b3b/`
|
||||
- Implementation: Traefik middleware redirect in `stacks/platform/modules/monitoring/main.tf`
|
||||
|
||||
**Tiny Tuya:**
|
||||
- Smart device control via tuya-bridge
|
||||
- Auth: `tiny_tuya_service_secret` (terraform.tfvars)
|
||||
|
||||
**Mailgun:**
|
||||
- SMTP relay for outgoing mail (primary relay host)
|
||||
- Relay: `[smtp.eu.mailgun.org]:587` (Postfix DEFAULT_RELAY_HOST)
|
||||
- Auth: SASL credentials in `sasl_passwd` (mailserver config)
|
||||
- Alternative: SendGrid (commented out, previously used)
|
||||
|
||||
**Home Assistant:**
|
||||
- Home automation integration
|
||||
- API token: `haos_api_token` (terraform.tfvars)
|
||||
- Access: `https://ha-london.viktorbarzin.me`, `https://ha-sofia.viktorbarzin.me`
|
||||
|
||||
**Proxmox:**
|
||||
- Virtualization platform for VM provisioning
|
||||
- Host: `192.168.1.127:8006` (`proxmox_pm_api_url`)
|
||||
- Auth: API token ID `terraform-prov@pve!terrform-prov`, secret in tfvars
|
||||
- Provider: `telmate/proxmox` v3.0.2-rc07
|
||||
- Access: IDRAC credentials for physical server monitoring (`idrac_host`, `idrac_username`, `idrac_password`)
|
||||
|
||||
## Data Storage
|
||||
|
||||
**Databases:**
|
||||
- MySQL 9.2.0
|
||||
- Connection: `mysql.dbaas.svc.cluster.local:3306` (K8s internal)
|
||||
- Client: Direct port access (no ORM in core infrastructure)
|
||||
- Root password: `dbaas_root_password` (tfvars)
|
||||
- Storage: NFS PV at `/mnt/main/mysql`
|
||||
|
||||
- PostgreSQL 16.4-bullseye (with PostGIS + PGVector)
|
||||
- Connection: `postgresql.dbaas:5432` (K8s internal)
|
||||
- Connection via PgBouncer: `pgbouncer.authentik:6432` (Authentik only)
|
||||
- Root password: `dbaas_postgresql_root_password` (tfvars)
|
||||
- Root password for pgbouncer: `pgbouncer_root_password` (tfvars)
|
||||
- Admin UI: PgAdmin at `pma.viktorbarzin.me`
|
||||
- PgAdmin password: `dbaas_pgadmin_password` (tfvars)
|
||||
- Storage: NFS PV at `/mnt/main/postgresql`
|
||||
|
||||
**File Storage:**
|
||||
- NFS (Primary)
|
||||
- Host: `10.0.10.15` (TrueNAS)
|
||||
- Mount path: `/mnt/main/`
|
||||
- Subdirectories: per-service (e.g., `/mnt/main/immich/`, `/mnt/main/affine/`, `/mnt/main/mailserver/`, etc.)
|
||||
- Configuration: `secrets/nfs_directories.txt` (git-crypt encrypted)
|
||||
- Export script: `secrets/nfs_exports.sh` (updates TrueNAS exports)
|
||||
|
||||
**Caching:**
|
||||
- Redis/redis-stack:latest
|
||||
- Connection: `redis.redis.svc.cluster.local` (K8s internal, no explicit port in code)
|
||||
- Databases: DB 2 (Gramps Web broker), DB 3 (Gramps Web rate limiting)
|
||||
- Storage: Persistent volume for data durability
|
||||
- Implementation: `stacks/platform/modules/redis/main.tf`
|
||||
|
||||
## Authentication & Identity
|
||||
|
||||
**Auth Provider:**
|
||||
- Authentik (self-hosted OIDC/OAuth2 identity provider)
|
||||
- URL: `https://authentik.viktorbarzin.me`
|
||||
- API: `/api/v3/` endpoint
|
||||
- Token: `authentik_api_token` (terraform.tfvars)
|
||||
- Database: PostgreSQL via `postgresql.dbaas:5432` (also PgBouncer at `pgbouncer.authentik:6432`)
|
||||
- Secret key: `authentik_secret_key` (terraform.tfvars)
|
||||
- Postgres password: `authentik_postgres_password` (terraform.tfvars)
|
||||
- K8s OIDC: Issuer `https://authentik.viktorbarzin.me/application/o/kubernetes/`, client `kubernetes` (public)
|
||||
- Implementation: `stacks/platform/modules/authentik/main.tf` + Helm chart
|
||||
- Traefik integration: Forward auth via protected = true in ingress_factory
|
||||
|
||||
**RBAC:**
|
||||
- Kubernetes API auth via Authentik OIDC
|
||||
- SSH keys: `ssh_private_key` (terraform.tfvars)
|
||||
- Implementation: `stacks/platform/modules/rbac/` + `stacks/platform/modules/k8s-portal/`
|
||||
|
||||
## Monitoring & Observability
|
||||
|
||||
**Error Tracking:**
|
||||
- None detected - alerts routed to Slack instead
|
||||
|
||||
**Metrics:**
|
||||
- Prometheus - Time series database
|
||||
- Scrape endpoints: cluster nodes, services, Proxmox IDRAC, Tuya devices, Home Assistant
|
||||
- Implementation: `stacks/platform/modules/monitoring/`
|
||||
- Health check: CronJob monitors prometheus-server pod and alerts to `https://webhook.viktorbarzin.me/fb/message-viktor` if down
|
||||
|
||||
**Logs:**
|
||||
- Loki 3.6.5 (single binary) + Alloy v1.13.0 (DaemonSet collector)
|
||||
- Retention: 7 days
|
||||
- Storage: NFS PV at `/mnt/main/loki/loki` (15Gi), WAL on tmpfs (2Gi)
|
||||
- Alerting: HighErrorRate, PodCrashLoopBackOff, OOMKilled (ConfigMap `loki-alert-rules`)
|
||||
|
||||
**Visualization:**
|
||||
- Grafana
|
||||
- Database: PostgreSQL via dbaas
|
||||
- Admin password: `grafana_admin_password` (tfvars)
|
||||
- DB password: `grafana_db_password` (tfvars)
|
||||
|
||||
**Status Pages:**
|
||||
- Hetrix Tools (external uptime monitoring)
|
||||
- Uptime Kuma (self-hosted, `stacks/platform/modules/uptime-kuma/`)
|
||||
|
||||
## CI/CD & Deployment
|
||||
|
||||
**Hosting:**
|
||||
- Proxmox 8.x (hypervisor)
|
||||
- Kubernetes 1.34.2 (application platform)
|
||||
- Cloudflare Tunnel (public ingress)
|
||||
|
||||
**CI Pipeline:**
|
||||
- Woodpecker CI (self-hosted, `stacks/woodpecker/`)
|
||||
- Hosted at: `https://ci.viktorbarzin.me`
|
||||
- Config: `.woodpecker/` in repo root
|
||||
- Triggers: Git push, scheduled jobs
|
||||
- Applies platform stack automatically on merge to master
|
||||
|
||||
**GitOps:**
|
||||
- Webhook-handler service: receives GitHub webhooks, triggers deployments
|
||||
- Endpoint: `https://webhook.viktorbarzin.me/`
|
||||
- Auth: Secret token `webhook_handler_secret` (tfvars)
|
||||
- Can update K8s deployments via RBAC
|
||||
- Implementation: `stacks/webhook_handler/main.tf`, image `viktorbarzin/webhook-handler:latest`
|
||||
|
||||
## Environment Configuration
|
||||
|
||||
**Required env vars (terraform.tfvars - git-crypt encrypted):**
|
||||
- `cloudflare_api_key`, `cloudflare_email`, `cloudflare_zone_id`, `cloudflare_tunnel_id`, `cloudflare_tunnel_token`
|
||||
- `dbaas_root_password`, `dbaas_postgresql_root_password`, `dbaas_pgadmin_password`
|
||||
- `authentik_secret_key`, `authentik_postgres_password`, `authentik_api_token`
|
||||
- `proxmox_pm_api_url`, `proxmox_pm_api_token_id`, `proxmox_pm_api_token_secret`
|
||||
- `alertmanager_slack_api_url`, `alertmanager_account_password`
|
||||
- `webhook_handler_secret`, `webhook_handler_fb_verify_token`, `webhook_handler_fb_page_token`, `webhook_handler_fb_app_secret`, `webhook_handler_git_token`, `webhook_handler_git_user`, `webhook_handler_ssh_key`
|
||||
- `vaultwarden_smtp_password`, `mailserver_accounts`, `postfix_account_aliases`, `sasl_passwd`
|
||||
- `crowdsec_enroll_key`, `crowdsec_db_password`, `crowdsec_dash_api_key`, `crowdsec_dash_machine_id`, `crowdsec_dash_machine_password`
|
||||
- `headscale_config`, `headscale_acl`
|
||||
- `monitoring_idrac_username`, `monitoring_idrac_password`, `tiny_tuya_service_secret`, `haos_api_token`, `pve_password`, `grafana_admin_password`, `grafana_db_password`
|
||||
- `k8s_users` (map of SSH keys for K8s RBAC)
|
||||
|
||||
**Secrets location:**
|
||||
- Primary: `terraform.tfvars` (git-crypt encrypted at rest, decrypted during `terragrunt apply`)
|
||||
- K8s Secrets: Created by Terraform from tfvars into namespaces (see `stacks/platform/modules/*/main.tf`)
|
||||
- TLS certificates: `secrets/` directory (symlinked into stacks as `secrets/` → `../../secrets`)
|
||||
|
||||
## Webhooks & Callbacks
|
||||
|
||||
**Incoming (Webhook endpoints):**
|
||||
- GitHub webhooks: `https://webhook.viktorbarzin.me/` (deployment triggers)
|
||||
- Facebook Messenger webhooks: `https://webhook.viktorbarzin.me/` (chatbot messages)
|
||||
- Health alerts: CronJob sends to `https://webhook.viktorbarzin.me/fb/message-viktor` if Prometheus is down
|
||||
|
||||
**Outgoing:**
|
||||
- Alertmanager → Slack webhook: `alertmanager_slack_api_url`
|
||||
- CrowdSec → Slack webhook: same as alertmanager
|
||||
- Hetrix Tools status pages: redirect middleware instead of direct integration
|
||||
|
||||
## Integration Patterns
|
||||
|
||||
**Terraform Secrets Injection:**
|
||||
- Template pattern: `templatefile("${path.module}/values.yaml", { var1 = var.value1, ... })`
|
||||
- Direct env injection: K8s ConfigMap/Secret created from tfvars variables
|
||||
- Example: `stacks/platform/modules/crowdsec/main.tf` renders Helm values with interpolated secrets
|
||||
|
||||
**Internal Service Discovery:**
|
||||
- DNS: Services accessible via `<name>.<namespace>.svc.cluster.local`
|
||||
- Examples: `mysql.dbaas.svc.cluster.local`, `redis.redis.svc.cluster.local`, `postgresql.dbaas.svc.cluster.local`
|
||||
|
||||
**External Service Access:**
|
||||
- Cloudflare Tunnel: Provides public HTTPS for services (no direct internet access needed)
|
||||
- Traefik Ingress: Routes external traffic to internal K8s services
|
||||
- Technitium (internal DNS) for `.lan` domain resolution
|
||||
|
||||
---
|
||||
|
||||
*Integration audit: 2026-02-23*
|
||||
129
.planning/codebase/STACK.md
Normal file
129
.planning/codebase/STACK.md
Normal file
|
|
@ -0,0 +1,129 @@
|
|||
# Technology Stack
|
||||
|
||||
**Analysis Date:** 2026-02-23
|
||||
|
||||
## Languages
|
||||
|
||||
**Primary:**
|
||||
- HCL (HashiCorp Configuration Language) - Terraform/Terragrunt infrastructure definitions
|
||||
- Bash - Scripting and cluster management (`scripts/` directory)
|
||||
- YAML - Kubernetes resource definitions and configuration
|
||||
- Python - Monitoring and utility scripts in `stacks/platform/modules/`
|
||||
- TypeScript/JavaScript - k8s-portal frontend and webhook-handler (`stacks/platform/modules/k8s-portal/`, `stacks/webhook_handler/`)
|
||||
|
||||
**Secondary:**
|
||||
- Go - Various utilities
|
||||
- Dockerfile - Container image definitions across stacks
|
||||
|
||||
## Runtime
|
||||
|
||||
**Environment:**
|
||||
- Kubernetes v1.34.2 (5 nodes: k8s-master + k8s-node1-4)
|
||||
- Linux (Ubuntu cloud images on Proxmox VMs)
|
||||
- Bash shell for automation
|
||||
|
||||
**Package Manager:**
|
||||
- npm (Node.js) - for k8s-portal web UI development
|
||||
- Lockfile: `package-lock.json` present
|
||||
- pip (Python) - for utility scripts
|
||||
- Terraform/Terragrunt - manages all infrastructure dependencies
|
||||
|
||||
## Frameworks
|
||||
|
||||
**Core:**
|
||||
- Terraform 1.x - Infrastructure-as-Code orchestration
|
||||
- Terragrunt - State isolation wrapper around Terraform (`terragrunt.hcl` in each stack)
|
||||
- Kubernetes - Container orchestration (kubectl, Helm, kustomize patterns)
|
||||
|
||||
**Testing:**
|
||||
- Playwright ^1.58.2 - E2E testing framework (root `package.json`)
|
||||
|
||||
**Build/Dev:**
|
||||
- Helm 3.1.1 - Kubernetes package manager (provider version via Terraform)
|
||||
- Svelte - Frontend framework for k8s-portal (`stacks/platform/modules/k8s-portal/files/` Node.js project)
|
||||
|
||||
## Key Dependencies
|
||||
|
||||
**Critical:**
|
||||
- hashicorp/terraform (Kubernetes 3.0.1) - Kubernetes API provider
|
||||
- hashicorp/helm (3.1.1) - Helm release management
|
||||
- telmate/proxmox (3.0.2-rc07) - Proxmox VM management (`stacks/infra/`)
|
||||
- cloudflare/cloudflare (4.52.5) - DNS and tunnel management (`stacks/platform/modules/cloudflared/`)
|
||||
- hashicorp/null (3.2.4) - Utility provider for local operations
|
||||
- hashicorp/random (3.8.1) - Random value generation
|
||||
|
||||
**Infrastructure:**
|
||||
- MySQL 9.2.0 - Relational database (`stacks/platform/modules/dbaas/`)
|
||||
- PostgreSQL 16.4-bullseye - Primary database with PostGIS/PGVector (`stacks/platform/modules/dbaas/`)
|
||||
- Redis/redis-stack:latest - In-memory cache and broker (`stacks/platform/modules/redis/`)
|
||||
- Headscale 0.23.0 - WireGuard control plane (`stacks/platform/modules/headscale/`)
|
||||
|
||||
**Observability:**
|
||||
- Prometheus - Metrics collection and alerting
|
||||
- Grafana - Metrics visualization and dashboards
|
||||
- Loki 3.6.5 - Log aggregation (from user instructions)
|
||||
- Alloy v1.13.0 - Log collector (from user instructions)
|
||||
|
||||
**API Gateway & Ingress:**
|
||||
- Traefik 3.x - Ingress controller and reverse proxy (`stacks/platform/modules/traefik/`)
|
||||
- MetalLB - Load balancer for Kubernetes service IPs (`stacks/platform/modules/metallb/`)
|
||||
|
||||
**Security:**
|
||||
- Authentik - Identity Provider/OIDC (`stacks/platform/modules/authentik/`)
|
||||
- Vaultwarden 1.35.2 - Password manager (`stacks/platform/modules/vaultwarden/`)
|
||||
- CrowdSec - Intrusion detection and IP reputation (`stacks/platform/modules/crowdsec/`)
|
||||
- Kyverno - Policy enforcement and governance (`stacks/platform/modules/kyverno/`)
|
||||
|
||||
**Container Images Registry:**
|
||||
- docker.io - Docker Hub public images
|
||||
- ghcr.io - GitHub Container Registry (Headscale UI, Immich, etc.)
|
||||
- quay.io - Quay.io registry (inferred from mirror config)
|
||||
- registry.k8s.io - Kubernetes images
|
||||
- Local pull-through cache at `10.0.20.10` (ports 5000/5010/5020/5030/5040)
|
||||
|
||||
## Configuration
|
||||
|
||||
**Environment:**
|
||||
- `terraform.tfvars` (git-crypt encrypted) - All secrets, API keys, DNS records, passwords
|
||||
- Environment variables injected into Kubernetes pods via ConfigMap/Secret
|
||||
- Kubeconfig: `config` file in repo root (referenced as `$PWD/config` in terragrunt)
|
||||
|
||||
**Build:**
|
||||
- `terragrunt.hcl` (root) - DRY Terraform provider and backend configuration
|
||||
- `stacks/<service>/terragrunt.hcl` - Per-stack overrides
|
||||
- `stacks/<service>/main.tf` - Kubernetes/Proxmox resource definitions
|
||||
- `.terraform.lock.hcl` - Provider version lock (Terraform 1.x)
|
||||
- `.terraform/` - Downloaded providers cached locally
|
||||
|
||||
**Secrets:**
|
||||
- `secrets/` directory (git-crypt encrypted)
|
||||
- TLS certificates and keys in `secrets/` (symlinked from stacks)
|
||||
- OpenDKIM keys for mailserver
|
||||
- NFS export configuration in `secrets/nfs_directories.txt`
|
||||
|
||||
## Platform Requirements
|
||||
|
||||
**Development:**
|
||||
- Terraform 1.x CLI
|
||||
- Terragrunt CLI (uses `terragrunt apply --non-interactive`)
|
||||
- kubectl configured with kubeconfig at `$PWD/config`
|
||||
- git-crypt for secret decryption
|
||||
- curl, bash, standard Unix utilities
|
||||
|
||||
**Production:**
|
||||
- Kubernetes 1.34.2+ cluster (5 nodes, 192 GB+ total memory)
|
||||
- Proxmox 8.x hypervisor (`stacks/infra/` provisions VMs)
|
||||
- NFS storage: TrueNAS at `10.0.10.15` with exports at `/mnt/main/`
|
||||
- Docker registry pull-through cache at `10.0.20.10`
|
||||
- Cloudflare DNS (public domain `viktorbarzin.me`)
|
||||
- Technitium DNS (internal domain `viktorbarzin.lan`)
|
||||
|
||||
**Networking:**
|
||||
- Kubernetes pod CIDR: managed by cluster
|
||||
- Service IPs: 10.0.20.200-10.0.20.220 (MetalLB layer 2)
|
||||
- Internal DNS: Technitium at cluster IP
|
||||
- External DNS: Cloudflare tunnel + traditional DNS records
|
||||
|
||||
---
|
||||
|
||||
*Stack analysis: 2026-02-23*
|
||||
255
.planning/codebase/STRUCTURE.md
Normal file
255
.planning/codebase/STRUCTURE.md
Normal file
|
|
@ -0,0 +1,255 @@
|
|||
# Codebase Structure
|
||||
|
||||
**Analysis Date:** 2026-02-23
|
||||
|
||||
## Directory Layout
|
||||
|
||||
```
|
||||
/Users/viktorbarzin/code/infra/
|
||||
├── .claude/ # Project-level Claude knowledge (skills, reference docs)
|
||||
├── .git/ # Git repository metadata
|
||||
├── .git-crypt/ # git-crypt encryption keys
|
||||
├── .planning/codebase/ # GSD codebase analysis documents
|
||||
├── .terraform/ # Terraform cache (gitignored)
|
||||
├── .woodpecker/ # CI/CD pipeline definitions
|
||||
├── cli/ # Custom CLI tools (bash/python scripts)
|
||||
├── diagram/ # Infrastructure diagram sources
|
||||
├── docs/ # Documentation (deployment guides, design docs)
|
||||
├── modules/ # Shared Terraform modules (Proxmox, K8s utilities)
|
||||
├── playbooks/ # Ansible playbooks (infrastructure setup)
|
||||
├── scripts/ # Maintenance scripts (healthcheck, DNS updates, etc.)
|
||||
├── secrets/ # git-crypt encrypted files (NFS dirs, TLS certs, SSH keys)
|
||||
├── stacks/ # Terragrunt stacks (platform + ~70 service stacks)
|
||||
├── state/ # Terraform state files (local backend, gitignored)
|
||||
├── terragrunt.hcl # Root Terragrunt config (DRY provider/backend setup)
|
||||
├── terraform.tfvars # All variables + secrets (git-crypt encrypted, ~48KB)
|
||||
├── config # Kubernetes config (kubeconfig file)
|
||||
├── README.md # Project overview
|
||||
└── package.json # Node.js deps (minimal; mostly for cli tools)
|
||||
```
|
||||
|
||||
## Directory Purposes
|
||||
|
||||
**`.claude/`:**
|
||||
- Purpose: Project-level Claude knowledge and execution skills
|
||||
- Contains: `skills/` (setup-project, authentik workflows), `reference/` (inventory tables, API patterns)
|
||||
- Key files: `CLAUDE.md` (this file's counterpart with full infrastructure context)
|
||||
|
||||
**`.planning/codebase/`:**
|
||||
- Purpose: GSD codebase analysis output directory
|
||||
- Contains: `ARCHITECTURE.md`, `STRUCTURE.md` (this file), and focus-specific docs
|
||||
- Auto-generated: Yes (by /gsd:map-codebase)
|
||||
|
||||
**`modules/`:**
|
||||
- Purpose: Reusable Terraform modules for VM creation and Kubernetes utilities
|
||||
- Contains:
|
||||
- `create-template-vm/`: Cloud-init Ubuntu template VM provisioning (K8s + non-K8s)
|
||||
- `create-vm/`: VM instance creation from templates with cloud-init injection
|
||||
- `docker-registry/`: Docker registry pull-through cache setup
|
||||
- `kubernetes/`: K8s-specific utilities (ingress_factory, setup_tls_secret)
|
||||
|
||||
**`stacks/`:**
|
||||
- Purpose: Terragrunt stacks with isolated state and per-service configuration
|
||||
- Contains: 1 platform stack + ~70 application stacks
|
||||
- Structure: Each stack is a directory with `terragrunt.hcl` + `main.tf` + optional `factory/` (for multi-instance services)
|
||||
|
||||
**`stacks/platform/`:**
|
||||
- Purpose: Core infrastructure services (22 modules)
|
||||
- Contains: Modules for MetalLB, DBaaS, Redis, Traefik, DNS, VPN, auth, monitoring, security
|
||||
- Key subdirs: `modules/` (platform-specific modules like traefik, authentik, monitoring)
|
||||
|
||||
**`stacks/infra/`:**
|
||||
- Purpose: Proxmox VM template and instance provisioning
|
||||
- Contains: K8s node templates, docker-registry VM, Proxmox provider configuration
|
||||
|
||||
**`stacks/<service>/`:**
|
||||
- Purpose: Single application stack with isolated state
|
||||
- Pattern: `terragrunt.hcl` (includes root, declares dependencies) + `main.tf` (resources) + optional `factory/` + optional `chart_values.yaml`
|
||||
- Examples: `nextcloud/`, `immich/`, `matrix/`, `actualbudget/` (multi-tenant), etc.
|
||||
|
||||
**`secrets/`:**
|
||||
- Purpose: git-crypt encrypted sensitive files
|
||||
- Contains: TLS certificates/keys, NFS export list, SSH keys, Dkim keys, Postfix config
|
||||
- Key files:
|
||||
- `nfs_directories.txt`: List of NFS shares (sorted); regenerate exports with `nfs_exports.sh`
|
||||
- `tls/`: TLS certificate chain and keys
|
||||
- `mailserver/`: OpenDKIM keys, Postfix SASL creds
|
||||
|
||||
**`scripts/`:**
|
||||
- Purpose: Operational and maintenance automation
|
||||
- Key scripts:
|
||||
- `cluster_healthcheck.sh`: 24-point cluster health status
|
||||
- `renew2.sh`: TLS certificate renewal via certbot + Cloudflare
|
||||
- `setup_certs.sh`: Initial certificate setup
|
||||
- `pve_*`: Proxmox management scripts
|
||||
- `ha_*`: Home Assistant integration scripts
|
||||
|
||||
**`docs/`:**
|
||||
- Purpose: Design and deployment documentation
|
||||
- Contains: High-level architecture diagrams, deployment guides, troubleshooting
|
||||
|
||||
**`cli/`:**
|
||||
- Purpose: Custom CLI utilities
|
||||
- Contains: Python/bash scripts for common operations (DNS management, NFS, etc.)
|
||||
|
||||
## Key File Locations
|
||||
|
||||
**Entry Points:**
|
||||
- `terragrunt.hcl`: Root Terragrunt config; invoked by `terragrunt apply` in any stack directory
|
||||
- `stacks/platform/main.tf`: Platform stack; applies 22 core modules
|
||||
- `stacks/infra/main.tf`: Infrastructure stack; creates VM templates and docker-registry VM
|
||||
|
||||
**Configuration:**
|
||||
- `terraform.tfvars`: Central variables file (~48KB, git-crypt encrypted). Used by all stacks. Contains: Cloudflare credentials, DNS records, service secrets, TLS secret name
|
||||
- `stacks/<service>/terragrunt.hcl`: Stack-specific Terragrunt config (includes root, declares `dependency` blocks)
|
||||
- `stacks/platform/modules/<service>/main.tf`: Platform module implementation (22 modules)
|
||||
|
||||
**Core Logic:**
|
||||
- `stacks/platform/main.tf`: 1000+ lines; instantiates all 22 platform modules
|
||||
- `stacks/<service>/main.tf`: 30–450 lines; creates namespaces, Helm releases, Kubernetes resources
|
||||
- `stacks/<service>/factory/main.tf`: Multi-instance service pattern; called multiple times with different parameters
|
||||
- `modules/kubernetes/ingress_factory/main.tf`: Traefik ingress + service template with security defaults
|
||||
|
||||
**Testing & Validation:**
|
||||
- `.woodpecker/`: CI/CD pipeline (pushes platform apply on merge)
|
||||
- `scripts/cluster_healthcheck.sh`: Manual cluster health validation
|
||||
|
||||
**Kubernetes & Cluster Config:**
|
||||
- `config`: Kubeconfig file for cluster access
|
||||
- Namespace pattern: One namespace per service stack
|
||||
- TLS secret: `tls-secret` injected into all namespaces via `setup_tls_secret` module
|
||||
|
||||
## Naming Conventions
|
||||
|
||||
**Files:**
|
||||
- `main.tf`: Primary Terraform resource file per stack
|
||||
- `terragrunt.hcl`: Terragrunt-specific configuration (includes root, dependencies)
|
||||
- `terraform.tfvars`: Global variables (git-crypt encrypted)
|
||||
- `chart_values.yaml`: Helm chart values template (uses templatefile for variable substitution)
|
||||
- `*_values.tpl`: Helm values template (evaluated with templatefile)
|
||||
- `.terraform.lock.hcl`: Provider lock file (one per stack)
|
||||
|
||||
**Directories:**
|
||||
- `stacks/<service>/`: Kebab-case service names (e.g., `real-estate-crawler`, `k8s-dashboard`)
|
||||
- `stacks/platform/modules/<service>/`: Kebab-case module names
|
||||
- `state/stacks/<service>/`: Mirrored state directory structure
|
||||
- `secrets/`: Single top-level directory for all encrypted files
|
||||
- `modules/kubernetes/`, `modules/create-template-vm/`: Category-based grouping
|
||||
|
||||
**Terraform Resources:**
|
||||
- **Kubernetes**: `kubernetes_*` (namespace, deployment, service, configmap, etc.)
|
||||
- **Helm**: `helm_release` (Helm chart deployments)
|
||||
- **Local files**: `local_file` (for generated scripts and configs)
|
||||
- **Module calls**: `module "<short-name>"` (e.g., `module "traefik"`, `module "redis"`)
|
||||
|
||||
**Variables:**
|
||||
- Snake_case: `tls_secret_name`, `crowdsec_api_key`, `nextcloud_db_password`
|
||||
- Service-prefixed: `<service>_<attribute>` (e.g., `authentik_secret_key`, `mailserver_accounts`)
|
||||
|
||||
## Where to Add New Code
|
||||
|
||||
**New Service Stack:**
|
||||
1. Create `stacks/<service>/` directory
|
||||
2. Add `terragrunt.hcl`:
|
||||
```hcl
|
||||
include "root" {
|
||||
path = find_in_parent_folders()
|
||||
}
|
||||
dependency "platform" {
|
||||
config_path = "../platform"
|
||||
skip_outputs = true
|
||||
}
|
||||
```
|
||||
3. Create `main.tf` with:
|
||||
- Variable declarations for required inputs from `terraform.tfvars`
|
||||
- `locals { tiers = { ... } }` (copy from existing stack)
|
||||
- `kubernetes_namespace` resource with tier label
|
||||
- `module "tls_secret"` call to `../../modules/kubernetes/setup_tls_secret`
|
||||
- Service-specific resources (Helm releases, Deployments, etc.)
|
||||
4. Add Cloudflare DNS records in `terraform.tfvars` if needed
|
||||
5. Create optional `secrets/` symlink: `ln -s ../../secrets secrets`
|
||||
6. Apply: `cd stacks/<service> && terragrunt apply --non-interactive`
|
||||
|
||||
**Multi-Tenant Service (using Factory Pattern):**
|
||||
1. Create parent stack: `stacks/<service>/main.tf` with namespace + TLS setup
|
||||
2. Create `stacks/<service>/factory/main.tf` with single-instance logic
|
||||
3. In parent, call factory multiple times:
|
||||
```hcl
|
||||
module "instance1" {
|
||||
source = "./factory"
|
||||
name = "instance1"
|
||||
# ... other params
|
||||
}
|
||||
```
|
||||
4. Example: `stacks/actualbudget/` has factory instantiated for viktor, anca, emo
|
||||
|
||||
**New Platform Module:**
|
||||
1. Create `stacks/platform/modules/<service>/` directory
|
||||
2. Add `main.tf` with resources (Helm chart, namespace, ConfigMaps, etc.)
|
||||
3. Add `variables.tf` or declare variables in `main.tf`
|
||||
4. In `stacks/platform/main.tf`, add module call:
|
||||
```hcl
|
||||
module "<service>" {
|
||||
source = "./modules/<service>"
|
||||
tier = local.tiers.<tier>
|
||||
# ... pass required variables
|
||||
}
|
||||
```
|
||||
5. Add variable declarations in `stacks/platform/main.tf`
|
||||
|
||||
**New Shared Module:**
|
||||
1. Create `modules/kubernetes/<module_name>/` or `modules/terraform/<module_name>/`
|
||||
2. Add `main.tf` with reusable resources
|
||||
3. Declare clear variable inputs and output any useful values
|
||||
4. Call from service stacks: `module "<name>" { source = "../../modules/kubernetes/<module_name>" ... }`
|
||||
|
||||
**Utilities & Scripts:**
|
||||
- Shared helpers: `scripts/` directory
|
||||
- Custom CLI tools: `cli/` directory
|
||||
- CI/CD pipelines: `.woodpecker/`
|
||||
|
||||
## Special Directories
|
||||
|
||||
**`state/`:**
|
||||
- Purpose: Terraform state files (local backend)
|
||||
- Generated: Yes (automatically by Terragrunt)
|
||||
- Committed: No (gitignored; backed up separately)
|
||||
- Structure: `state/stacks/<service>/terraform.tfstate`
|
||||
|
||||
**`secrets/`:**
|
||||
- Purpose: git-crypt encrypted secrets and sensitive config
|
||||
- Generated: No (managed manually or via scripts)
|
||||
- Committed: Yes (encrypted via git-crypt)
|
||||
- Contents: TLS certs, SSH keys, NFS export list, mailserver config, Dkim keys
|
||||
|
||||
**`.terraform/`:**
|
||||
- Purpose: Terraform provider cache
|
||||
- Generated: Yes (by Terraform during init)
|
||||
- Committed: No (gitignored)
|
||||
|
||||
**`node_modules/`:**
|
||||
- Purpose: Node.js dependencies for CLI tools
|
||||
- Generated: Yes (by npm install)
|
||||
- Committed: No (gitignored; use lockfile)
|
||||
|
||||
## File Patterns & Imports
|
||||
|
||||
**Terragrunt Patterns:**
|
||||
- Include root: `include "root" { path = find_in_parent_folders() }`
|
||||
- Declare dependencies: `dependency "platform" { config_path = "../platform"; skip_outputs = true }`
|
||||
- Variable access: `var.<name>` in `main.tf` (variables sourced from `terraform.tfvars`)
|
||||
|
||||
**Kubernetes Resource Patterns:**
|
||||
- Namespace per service: `kubernetes_namespace.<service>` with tier label
|
||||
- Helm releases: `helm_release.<chart_name>` with `templatefile` for values
|
||||
- Inline NFS volumes: `volume { name = "data"; nfs { server = "10.0.10.15"; path = "/mnt/main/<service>" } }`
|
||||
- TLS injection: Every stack calls `module "tls_secret"` to populate namespace secret
|
||||
|
||||
**Module Call Pattern:**
|
||||
- Standard: `module "<name>" { source = "./modules/<module>" ... }`
|
||||
- Platform modules: `source = "./modules/<service>"`
|
||||
- Shared modules: `source = "../../modules/kubernetes/<module>"`
|
||||
|
||||
---
|
||||
|
||||
*Structure analysis: 2026-02-23*
|
||||
279
.planning/codebase/TESTING.md
Normal file
279
.planning/codebase/TESTING.md
Normal file
|
|
@ -0,0 +1,279 @@
|
|||
# Testing Patterns
|
||||
|
||||
**Analysis Date:** 2026-02-23
|
||||
|
||||
## Test Framework
|
||||
|
||||
**Language-Specific Runners:**
|
||||
|
||||
**Go:**
|
||||
- Runner: `go test` (standard library `testing` package)
|
||||
- Config: No config file (uses built-in conventions)
|
||||
- Run Commands:
|
||||
```bash
|
||||
go test ./... # Run all tests
|
||||
go test -v ./... # Verbose output
|
||||
go test -run TestContains ./... # Run specific test
|
||||
go test -cover ./... # Show coverage
|
||||
```
|
||||
|
||||
**Bash:**
|
||||
- Runner: Custom shell scripts in `scripts/`
|
||||
- No formal test framework; uses `set -euo pipefail` for error handling
|
||||
- Manual health checks via `bash scripts/cluster_healthcheck.sh`
|
||||
|
||||
**Terraform:**
|
||||
- Framework: No automated testing detected (no terraform test files, no tftest.hcl)
|
||||
- Validation: Manual `terraform validate`, `terraform plan`, visual inspection
|
||||
- Integration: Terragrunt applies validate before execution
|
||||
|
||||
## Test File Organization
|
||||
|
||||
**Location:**
|
||||
- Go tests: Co-located with source code: `<service>/files/internal/scraper/validate_test.go`
|
||||
- Shell/Infrastructure: No test files (manual validation/health checks only)
|
||||
|
||||
**Naming:**
|
||||
- Go: `*_test.go` suffix
|
||||
- Script tests: `.sh` for check/validation scripts
|
||||
|
||||
**Structure:**
|
||||
```
|
||||
stacks/f1-stream/files/internal/scraper/
|
||||
├── main.go
|
||||
├── validate.go
|
||||
└── validate_test.go # Test file co-located
|
||||
```
|
||||
|
||||
## Test Structure
|
||||
|
||||
**Go Table-Driven Tests:**
|
||||
|
||||
```golang
|
||||
func TestContainsVideoMarkers(t *testing.T) {
|
||||
tests := []struct {
|
||||
name string
|
||||
body string
|
||||
want bool
|
||||
}{
|
||||
{
|
||||
name: "video tag",
|
||||
body: `<div><video src="stream.mp4"></video></div>`,
|
||||
want: true,
|
||||
},
|
||||
// ... more test cases
|
||||
}
|
||||
|
||||
for _, tt := range tests {
|
||||
t.Run(tt.name, func(t *testing.T) {
|
||||
got := containsVideoMarkers(tt.body)
|
||||
if got != tt.want {
|
||||
t.Errorf("containsVideoMarkers(%q) = %v, want %v", truncate(tt.body, 60), got, tt.want)
|
||||
}
|
||||
})
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Patterns:**
|
||||
- Slice of anonymous structs with `name`, input fields, and `want` for expected result
|
||||
- Loop with `t.Run(tt.name, ...)` for individual test case execution and reporting
|
||||
- Descriptive test case names: `"video tag"`, `"HLS manifest reference"`, `"empty string"`
|
||||
- Separate positive cases (upper) and negative cases (lower) with comments
|
||||
|
||||
**Bash Health Check Structure:**
|
||||
```bash
|
||||
check_nodes() {
|
||||
section 1 "Node Status"
|
||||
local nodes not_ready versions unique_versions detail=""
|
||||
|
||||
nodes=$($KUBECTL get nodes --no-headers 2>&1) || { fail "Cannot reach cluster"; json_add "node_status" "FAIL" "Cannot reach cluster"; return 0; }
|
||||
# ... processing
|
||||
if [[ -n "$not_ready" ]]; then
|
||||
fail "NotReady nodes: $not_ready"
|
||||
json_add "node_status" "FAIL" "$detail"
|
||||
elif [[ "$unique_versions" -gt 1 ]]; then
|
||||
warn "Version mismatch..."
|
||||
json_add "node_status" "WARN" "$detail"
|
||||
else
|
||||
pass "All nodes Ready..."
|
||||
json_add "node_status" "PASS" "$detail"
|
||||
fi
|
||||
}
|
||||
```
|
||||
|
||||
**Patterns:**
|
||||
- Each check function follows same structure: setup → validation → status reporting
|
||||
- Status reported via `pass()`, `warn()`, `fail()` helper functions
|
||||
- JSON output optional via `json_add()` for programmatic consumption
|
||||
- Error handling inline with `||` fallback and graceful degradation
|
||||
|
||||
## Mocking
|
||||
|
||||
**Framework:**
|
||||
- Go: No mocking framework detected (table-driven tests use real function calls)
|
||||
- Bash: External commands mocked implicitly (KUBECONFIG override, kubectl invocation through `$KUBECTL` variable)
|
||||
|
||||
**Patterns (Go):**
|
||||
- No mock objects or stubs
|
||||
- Real function behavior tested directly
|
||||
- Test data provided as input in struct fields
|
||||
|
||||
**Patterns (Bash):**
|
||||
```bash
|
||||
# Kubeconfig override allows testing against different clusters
|
||||
KUBECTL="kubectl --kubeconfig $KUBECONFIG_PATH"
|
||||
nodes=$($KUBECTL get nodes --no-headers 2>&1) || { fail "Cannot reach cluster"; return 0; }
|
||||
```
|
||||
|
||||
**What NOT to Mock:**
|
||||
- Core functionality being tested (test actual behavior)
|
||||
- Standard library functions (test integration)
|
||||
|
||||
**What to Mock (Bash):**
|
||||
- External kubectl calls via variable indirection: allows `KUBECONFIG` override
|
||||
- Conditional output by flag: `--json`, `--quiet` flags change output, not behavior
|
||||
|
||||
## Fixtures and Factories
|
||||
|
||||
**Test Data (Go):**
|
||||
- Inline strings in struct fields: HTML content, MIME types
|
||||
- Examples from `validate_test.go`:
|
||||
```golang
|
||||
{
|
||||
name: "HLS manifest reference",
|
||||
body: `var url = "https://cdn.example.com/live.m3u8";`,
|
||||
want: true,
|
||||
},
|
||||
```
|
||||
|
||||
**Location:**
|
||||
- Embedded directly in test file as struct field values
|
||||
- No separate fixture files or factories
|
||||
|
||||
**Bash Fixtures:**
|
||||
- Real cluster fixtures: tests run against actual Kubernetes cluster
|
||||
- No data files; tests fetch live state via kubectl
|
||||
|
||||
## Coverage
|
||||
|
||||
**Requirements:** None enforced (no coverage thresholds, targets, or CI/CD gates detected)
|
||||
|
||||
**View Coverage (Go):**
|
||||
```bash
|
||||
go test -cover ./... # Show coverage percentages
|
||||
go test -coverprofile=coverage.out ./...
|
||||
go tool cover -html=coverage.out # Open HTML report
|
||||
```
|
||||
|
||||
**Note:** Coverage tools not integrated into CI/CD pipeline; manual check only.
|
||||
|
||||
## Test Types
|
||||
|
||||
**Unit Tests (Go):**
|
||||
- Scope: Single function validation
|
||||
- Approach: Table-driven with parameterized inputs
|
||||
- Example: `TestContainsVideoMarkers()` tests HTML content detection
|
||||
- Example: `TestIsDirectVideoContentType()` tests MIME type classification
|
||||
- In file: `stacks/f1-stream/files/internal/scraper/validate_test.go`
|
||||
|
||||
**Integration Tests:**
|
||||
- Bash health checks (`scripts/cluster_healthcheck.sh`) serve as integration tests
|
||||
- Tests 24 separate checks against live Kubernetes cluster:
|
||||
- Node status and readiness
|
||||
- Node resource utilization
|
||||
- Container metrics
|
||||
- Pod crash loops
|
||||
- Persistent volume health
|
||||
- DNS resolution
|
||||
- Networking
|
||||
- RBAC
|
||||
- Logs aggregation
|
||||
- Can run with `--fix` flag for auto-remediation
|
||||
- Can output JSON for CI integration
|
||||
|
||||
**E2E Tests:**
|
||||
- Not formally implemented
|
||||
- Manual validation via Terragrunt apply → cluster state verification
|
||||
|
||||
**Infrastructure Testing:**
|
||||
- Terraform: `terraform validate` and `terraform plan` provide syntax/logic validation
|
||||
- Application health: Manual checks via scripts and cluster_healthcheck.sh
|
||||
- No automated test suite for infrastructure code
|
||||
|
||||
## Common Patterns
|
||||
|
||||
**Async Testing (Go):**
|
||||
- Not applicable (synchronous function testing only)
|
||||
|
||||
**Error Testing (Go):**
|
||||
```golang
|
||||
{
|
||||
name: "empty string",
|
||||
body: "",
|
||||
want: false,
|
||||
},
|
||||
```
|
||||
- Negative test cases included in same table
|
||||
- Error/edge cases named descriptively: `"empty string"`, `"reddit link page"`
|
||||
- Expected failure behavior verified: `want: false` for invalid inputs
|
||||
|
||||
**Error Reporting (Go):**
|
||||
```golang
|
||||
t.Errorf("containsVideoMarkers(%q) = %v, want %v", truncate(tt.body, 60), got, tt.want)
|
||||
```
|
||||
- Formatted message includes: function name, input (truncated), actual, expected
|
||||
- Test name automatically prefixed by `t.Run(tt.name, ...)`
|
||||
|
||||
**Status Reporting (Bash):**
|
||||
- Color-coded status: `${GREEN}[PASS]${NC}`, `${YELLOW}[WARN]${NC}`, `${RED}[FAIL]${NC}`
|
||||
- Counter incremented per status
|
||||
- Optional quiet mode (`--quiet`) suppresses PASS output
|
||||
- Optional JSON output (`--json`) for CI integration
|
||||
- Summary printed at end: `$PASS_COUNT/$WARN_COUNT/$FAIL_COUNT`
|
||||
|
||||
## Running Tests
|
||||
|
||||
**Go Tests:**
|
||||
```bash
|
||||
# From service directory containing *_test.go
|
||||
go test -v ./...
|
||||
```
|
||||
|
||||
**Bash Health Checks:**
|
||||
```bash
|
||||
# Comprehensive checks
|
||||
bash scripts/cluster_healthcheck.sh
|
||||
|
||||
# Quiet mode (WARN/FAIL only)
|
||||
bash scripts/cluster_healthcheck.sh --quiet
|
||||
|
||||
# Auto-fix mode
|
||||
bash scripts/cluster_healthcheck.sh --fix
|
||||
|
||||
# JSON output
|
||||
bash scripts/cluster_healthcheck.sh --json
|
||||
|
||||
# Custom kubeconfig
|
||||
bash scripts/cluster_healthcheck.sh --kubeconfig /path/to/config
|
||||
```
|
||||
|
||||
**Terraform Validation:**
|
||||
```bash
|
||||
# Format check
|
||||
terraform fmt -recursive
|
||||
|
||||
# Syntax validation
|
||||
terraform validate
|
||||
|
||||
# Plan without apply
|
||||
terraform plan
|
||||
|
||||
# From stack directory
|
||||
cd stacks/<service> && terragrunt plan
|
||||
cd stacks/<service> && terragrunt apply --non-interactive
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
*Testing analysis: 2026-02-23*
|
||||
|
|
@ -57,7 +57,7 @@ resource "kubernetes_deployment" "roundcubemail" {
|
|||
spec {
|
||||
container {
|
||||
name = "roundcube"
|
||||
image = "roundcube/roundcubemail:1.6-apache"
|
||||
image = "roundcube/roundcubemail:1.6.13-apache"
|
||||
# Uncomment me to mount additional settings
|
||||
# volume_mount {
|
||||
# name = "imap-config"
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue