docs: map existing codebase

This commit is contained in:
Viktor Barzin 2026-02-23 20:51:20 +00:00
parent 275eb5aec8
commit c8e9c41afc
8 changed files with 1475 additions and 1 deletions

View file

@ -0,0 +1,165 @@
# Architecture
**Analysis Date:** 2026-02-23
## Pattern Overview
**Overall:** Terragrunt-based IaC with per-service state isolation, using Kubernetes as the primary platform and Proxmox for VM infrastructure.
**Key Characteristics:**
- Monorepo containing ~70 service stacks with independent state files
- Declarative, GitOps-driven infrastructure using Terraform + Terragrunt
- DRY provider/backend configuration via root `terragrunt.hcl`
- Clear layering: platform (core/cluster services) → application stacks → shared modules
- Service decoupling with explicit dependencies via `dependency` blocks
- Resource governance through Kubernetes tier system (0-core through 4-aux)
## Layers
**Platform Layer (`stacks/platform/main.tf`):**
- Purpose: Core infrastructure services that enable all application stacks (22 modules)
- Location: `stacks/platform/`
- Contains: MetalLB, DBaaS, Redis, Traefik, Technitium DNS, Headscale VPN, Authentik SSO, RBAC, CrowdSec, Prometheus/Grafana/Loki monitoring, nginx reverse proxy, mailserver, GPU node configuration, Kyverno policy engine
- Depends on: Kubernetes cluster (declared via `stacks/infra` dependency), External secrets in `terraform.tfvars`
- Used by: All application stacks declare `dependency "platform"` to ensure platform is applied first
**Infrastructure Layer (`stacks/infra/main.tf`):**
- Purpose: VM template provisioning and Proxmox resource management
- Location: `stacks/infra/`
- Contains: K8s node templates via cloud-init, docker-registry VM, Proxmox VM lifecycle
- Depends on: Proxmox API credentials
- Used by: Platform stack depends on it to ensure infrastructure is ready
**Application Stacks (~70 services):**
- Purpose: User-facing and supplementary services (Nextcloud, Immich, Matrix, Ollama, etc.)
- Location: `stacks/<service>/main.tf` (102 total stacks)
- Contains: Kubernetes namespaces, Helm releases, raw Kubernetes resources (Deployments, StatefulSets, Services, PersistentVolumes)
- Depends on: Platform stack, shared TLS secret via `modules/kubernetes/setup_tls_secret`, optional NFS volumes
- Used by: Self-contained; declared dependencies control execution order
**Shared Modules:**
- **Kubernetes utilities** (`modules/kubernetes/`):
- `ingress_factory/`: Reusable Traefik ingress + service template with anti-AI scraping, CrowdSec integration, rate limiting, authentication support
- `setup_tls_secret/`: TLS certificate secret setup in namespaces
- **Terraform modules** (`modules/`):
- `create-template-vm/`: Ubuntu cloud-init template VM provisioning (K8s and non-K8s variants)
- `create-vm/`: VM instance creation from templates
- `docker-registry/`: Docker registry pull-through cache configuration
## Data Flow
**Infrastructure Provisioning Flow:**
1. **Initialize**: Root `terragrunt.hcl` loads `terraform.tfvars` globally, generates provider/backend configs
2. **Infra Stack Apply**: `stacks/infra/` creates/updates Proxmox VMs and Kubernetes node templates
3. **Platform Apply**: `stacks/platform/` applies all ~22 core services (depends on infra stack)
4. **Service Apply**: Individual `stacks/<service>/` apply their resources (depend on platform stack)
Example dependency chain for Nextcloud:
```
stacks/infra/main.tf (VMs)
↓ (dependency)
stacks/platform/main.tf (Traefik, Redis, DBaaS, etc.)
↓ (dependency)
stacks/nextcloud/main.tf (Nextcloud Helm chart + storage)
```
**State Management:**
- Each stack has isolated state at `state/stacks/<service>/terraform.tfstate`
- Root `terragrunt.hcl` defines local backend: `path = "${get_repo_root()}/state/${path_relative_to_include()}/terraform.tfstate"`
- Variables flow from `terraform.tfvars` → each stack's `terraform` block → Terraform execution
- Unused variables are silently ignored (Terraform 1.x behavior)
**Configuration Flow:**
1. User edits `terraform.tfvars` (encrypted via git-crypt)
2. Each stack includes root terragrunt config: `include "root" { path = find_in_parent_folders() }`
3. Root config injects `terraform.tfvars` as `required_var_files`
4. Stack-specific `main.tf` declares which variables it uses
## Key Abstractions
**Tier System:**
- Purpose: Resource governance via Kubernetes PriorityClasses, LimitRanges, ResourceQuotas
- Tiers: `0-core` (critical: ingress, DNS, auth) → `4-aux` (optional workloads)
- Applied via: Kyverno policy engine in `stacks/platform/modules/kyverno/`
- Usage: Every namespace/pod gets labeled with tier; Kyverno generates corresponding LimitRange + ResourceQuota
**Service Factory Pattern:**
- Purpose: Multi-tenant/multi-instance services (Actual Budget, Freedify)
- Pattern: Parent stack (`stacks/<service>/main.tf`) creates namespace + TLS secret, then calls `factory/` module multiple times
- Examples: `stacks/actualbudget/main.tf` calls `factory/` for viktor, anca, emo instances
- Each instance: Separate pod, service, NFS share, Cloudflare DNS entry
**Ingress Factory (`modules/kubernetes/ingress_factory/`):**
- Purpose: DRY, opinionated Traefik ingress pattern with security defaults
- Variables: `name`, `namespace`, `port`, `host`, `protected`, `anti_ai_scraping` (default true)
- Provides: Service, Ingress, CrowdSec exemptions, rate limiting, Authentik ForwardAuth integration, anti-AI middleware
- Anti-AI layers: Bot blocking → X-Robots-Tag → Trap links → Tarpit → Poison content cache
**NFS Volume Pattern:**
- Purpose: Persistent storage for stateful services
- Pattern: Inline NFS volumes in pod specs (preferred over PV/PVC)
- Server: `10.0.10.15` (TrueNAS)
- Paths: `/mnt/main/<service>` or `/mnt/main/<service>/<instance>`
- Used by: ~60 services; registered in `secrets/nfs_directories.txt` (git-crypt encrypted)
## Entry Points
**Terragrunt Root (`terragrunt.hcl`):**
- Location: `/Users/viktorbarzin/code/infra/terragrunt.hcl`
- Triggers: `cd stacks/<service> && terragrunt plan/apply --non-interactive`
- Responsibilities: Load providers, backend, `terraform.tfvars`, set kube config path
**Platform Stack (`stacks/platform/main.tf`):**
- Location: `stacks/platform/main.tf` (1000+ lines)
- Triggers: Applied before any service stack to ensure platform services exist
- Responsibilities: 22 module instantiations, tier definition, variable collection from tfvars
**Service Stacks (`stacks/<service>/main.tf`):**
- Location: `stacks/<service>/main.tf` (27456 lines, avg ~130)
- Triggers: `terragrunt apply --non-interactive` in service directory
- Responsibilities: Create namespace, setup TLS, instantiate Helm charts or raw K8s resources, configure storage
**Proxmox/Infra Stack (`stacks/infra/main.tf`):**
- Location: `stacks/infra/main.tf` (200+ lines)
- Triggers: Applied first to ensure VM infrastructure is available
- Responsibilities: VM template creation, VM instance lifecycle, containerd mirror config
**Factory Module (`stacks/<service>/factory/main.tf`):**
- Location: `stacks/actualbudget/factory/main.tf`, `stacks/freedify/factory/main.tf`
- Triggers: Called multiple times from parent `main.tf` with different `name` parameter
- Responsibilities: Single-instance deployment (pod, service, NFS share, ingress)
## Error Handling
**Strategy:** Declarative state reconciliation (Terraform/Kubernetes watch loop). No imperative error recovery.
**Patterns:**
- **Helm deployments**: `atomic = true` for rollback on failure
- **Terraform apply**: `--non-interactive` to prevent hanging on prompts
- **Cloud-init VM provisioning**: Embedded error logging in scripts; check `/var/log/cloud-init-output.log` on VM
- **Dependencies**: Explicit `dependency` blocks prevent applying child before parent
- **Validation**: `terraform plan` executed by CI before apply
- **Secrets**: git-crypt locking ensures encrypted state checked into repo; no accidental plaintext commits
## Cross-Cutting Concerns
**Logging:** Loki + Alloy (DaemonSet collects container logs) configured in `stacks/platform/modules/monitoring/`
**Validation:**
- Terraform validation: `terraform validate` in CI/CD pipeline
- HCL formatting: `terraform fmt -recursive`
- Kyverno policies: Enforce resource requests, tier labels, pod security standards
**Authentication:**
- **Kubernetes API**: OIDC via Authentik (issuer: `https://authentik.viktorbarzin.me/application/o/kubernetes/`)
- **Traefik Ingress**: Authentik ForwardAuth when `protected = true` in ingress_factory
- **TLS**: Shared secret injected into all namespaces via `setup_tls_secret` module
**Rate Limiting:** Traefik middleware `default-rate-limit` (applied by ingress_factory unless `skip_default_rate_limit = true`)
**Anti-AI Scraping:** 5-layer defense (bot blocking → headers → trap links → tarpit → poison content) applied via `anti_ai_scraping = true` (default) in ingress_factory; disable per-service with `anti_ai_scraping = false`
---
*Architecture analysis: 2026-02-23*

View file

@ -0,0 +1,244 @@
# Codebase Concerns
**Analysis Date:** 2026-02-23
## Tech Debt
**MySQL Backup Rotation Not Implemented:**
- Issue: Backup rotation logic exists (comment at `stacks/platform/modules/dbaas/main.tf:196`) but is incomplete. Backup size noted as 11MB, rotation deferred.
- Files: `stacks/platform/modules/dbaas/main.tf` (lines 196-206)
- Impact: Backup directory could grow unbounded; no automated retention policy enforced. Manual cleanup required.
- Fix approach: Implement full rotation schedule using `find -mtime +N` or migrate to external backup solution (Velero, pgbackrest). Set up CronJob with proper retention (e.g., 14-day backups).
**PostgreSQL Major Version Upgrade Blocked:**
- Issue: Comment at `stacks/platform/modules/dbaas/main.tf:718` indicates PostgreSQL 17.2 requires `pg_upgrade` to data directory but is not implemented.
- Files: `stacks/platform/modules/dbaas/main.tf` (line 718)
- Impact: Cannot upgrade PostgreSQL from 16-master to 17.2. When upgrade is needed, requires manual pg_upgrade procedure; high downtime risk.
- Fix approach: Implement pg_upgrade CronJob or StatefulSet init container that performs in-place upgrade. Test migration path with backup first.
**TP-Link Gateway Reverse Proxy Not Functional:**
- Issue: Reverse proxy module for TP-Link gateway marked as "Not working yet" at `stacks/platform/modules/reverse_proxy/main.tf:91`.
- Files: `stacks/platform/modules/reverse_proxy/main.tf` (lines 91-95)
- Impact: Gateway access over HTTPS or HTTP routing non-functional. Unknown scope of impact on dependent services.
- Fix approach: Either complete reverse proxy implementation (Traefik/Nginx config) or document why it's disabled. Clarify if gateway is still accessible via HTTP or LAN-only.
**WireGuard Firewall Rules Incomplete:**
- Issue: Client firewall restrictions not written at `terraform.tfvars:430`. Only placeholder exists.
- Files: `terraform.tfvars` (lines 430-434)
- Impact: No network isolation between VPN clients and cluster-internal services (10.32.0.0/12). All connected clients can access cluster APIs if firewall rules not enforced at kernel level.
- Fix approach: Define explicit iptables rules for each client role (e.g., "allow DNS only", "deny cluster access"). Test with `iptables -L -v`. Consider VPN network segmentation if multiple trust levels exist.
## Known Bugs & Issues
**Immich Database Compatibility Mismatch:**
- Symptoms: Custom PostgreSQL image version mismatch between Immich PostgreSQL pod and dbaas PostgreSQL. Immich uses `ghcr.io/immich-app/postgres:15-vectorchord0.3.0-pgvectors0.2.0`, while dbaas PostgreSQL is 16-master with PostGIS/PgVector mix.
- Files: `stacks/immich/main.tf` (lines 76-77, 276), `stacks/platform/modules/dbaas/main.tf` (line 717)
- Trigger: If Immich database is migrated to shared dbaas PostgreSQL, extension version incompatibility will cause failures.
- Workaround: Keep Immich on isolated PostgreSQL 15 with Immich-specific extensions. If consolidation needed, test extension compatibility first.
**Realestate-Crawler Latest Image Tag Ignores Updates:**
- Symptoms: `realestate-crawler-ui` uses `image = "viktorbarzin/immoweb:latest"` with `lifecycle { ignore_changes = [spec[0].template[0].spec[0].container[0].image] }`.
- Files: `stacks/real-estate-crawler/main.tf` (lines 64, 79-82)
- Trigger: New versions of `immoweb:latest` will never be deployed. Terraform ignores image updates; manual image pull/push required.
- Workaround: Use Diun annotations to track image updates. Consider using version-pinned tags instead of `:latest`. Remove `ignore_changes` if auto-updates desired.
## Security Considerations
**OpenClaw Has Cluster-Admin Permissions:**
- Risk: OpenClaw ServiceAccount granted unrestricted `cluster-admin` ClusterRoleBinding at `stacks/openclaw/main.tf:41-54`.
- Files: `stacks/openclaw/main.tf` (lines 34-55)
- Current mitigation: `dangerouslyDisableDeviceAuth = true` in config (line 89) disables device auth but relies on network access control.
- Recommendations:
- Scope OpenClaw RBAC to specific namespaces/resources needed for skill execution (e.g., `get/list/watch pods`, `exec into pods`, `apply resources in specific namespaces`).
- Re-enable device auth or implement mTLS between OpenClaw and operators.
- Audit OpenClaw logs for unauthorized API calls (enable API server audit logs).
**Git-Crypt Key Mounted as ConfigMap:**
- Risk: git-crypt key at `stacks/openclaw/main.tf:68-76` stored as plain-text ConfigMap data. Any pod on cluster can read it (unless RBAC enforces secrets-only access).
- Files: `stacks/openclaw/main.tf` (lines 68-76)
- Current mitigation: None; ConfigMap is world-readable by default.
- Recommendations:
- Move git-crypt key to Kubernetes Secret instead of ConfigMap.
- Add RBAC policy restricting secret read to openclaw namespace only.
- Consider external secret management (Authentik-backed secret injection, Sealed Secrets).
**SSH Private Key Stored as Secret:**
- Risk: SSH private key for OpenClaw stored at `stacks/openclaw/main.tf:57-66` as unencrypted Secret. Readable by any pod with secret access.
- Files: `stacks/openclaw/main.tf` (lines 57-66)
- Current mitigation: Secret only readable by openclaw namespace (if RBAC enforced); encryption at rest not confirmed.
- Recommendations:
- Rotate SSH key regularly; consider using ed25519 keys (shorter, stronger).
- Audit Secret access via Kubernetes audit logs.
- Use external secret store (HashiCorp Vault, Bitwarden) instead of native Secrets.
**WireGuard VPN Clients Unrestricted:**
- Risk: VPN clients can reach all cluster-internal services (10.32.0.0/12) unless firewall rules defined. No per-client segmentation.
- Files: `terraform.tfvars` (lines 430-434)
- Current mitigation: Attempted iptables rules commented out; not enforced.
- Recommendations:
- Define explicit client restrictions in WireGuard firewall script (uncomment/complete lines 433-434).
- Implement deny-by-default firewall (drop all, then allow specific routes).
- Consider separate WireGuard interfaces for different trust levels (admin vs. guest).
**Multiple `:latest` Image Tags in Production:**
- Risk: 17 services use `:latest` tags (e.g., `nextcloud`, `kms`, `calibre`, `speedtest`, `rybbit`, `wealthfolio`, `cyberchef`, `coturn`, `immich-frame`, `health`, others).
- Files: Multiple stacks (see full list in grep output above).
- Current mitigation: Diun annotations track updates but don't auto-pull; images are immutable but unversioned.
- Recommendations:
- Pin all production images to specific semantic versions (e.g., `ghcr.io/foo/bar:v1.2.3`, not `:latest`).
- Use Diun to track new releases and trigger automated testing in staging.
- Update CI/CD pipeline to require version tags for production deployments.
## Performance Bottlenecks
**Insufficient Health Probes on Critical Services:**
- Problem: Only 14 services have liveness/readiness probes out of 70+ services. Missing probes on databases (MySQL, PostgreSQL, Redis), ingress, auth.
- Files: All stacks (identified via grep: 14 instances of liveness/readiness out of 70+ services).
- Cause: Default Kubernetes behavior is to not restart unhealthy pods without probes; cascading failures silent.
- Improvement path: Add `livenessProbe`, `readinessProbe`, and `startupProbe` to all stateful services (databases, message queues, auth providers). Use TCP/HTTP probes appropriate to each service.
**Pod Disruption Budgets Missing:**
- Problem: Only 2 services have PodDisruptionBudget resources (identified via grep). Node evictions (updates, failures) can cause service degradation.
- Files: All stacks (need comprehensive PodDisruptionBudget coverage).
- Cause: PDBs are optional; many assume single-replica stateless services won't need them.
- Improvement path: Add PDB with `minAvailable: 1` to all services with `replicas > 1`. For single-replica services, ensure they're marked as non-critical (lower PriorityClass) or accept downtime during node maintenance.
**Resource Requests Sparse, Limits Missing:**
- Problem: Many services lack explicit resource requests/limits. Kyverno auto-generates defaults but CPU limits often too low for bursty workloads (Immich ML, Ollama, Ebook2Audiobook).
- Files: Multiple stacks (e.g., `stacks/immich/main.tf`, `stacks/ebook2audiobook/main.tf`, `stacks/ollama/main.tf`).
- Cause: Request/limit tuning requires load testing; defaults used instead.
- Improvement path: Run load tests on GPU workloads (Immich ML, Ollama) to determine sustained CPU/memory. Set requests to P50 usage, limits to P99. Monitor via Prometheus and adjust quarterly.
**Large Terraform Modules (900+ lines):**
- Problem: `stacks/platform/modules/dbaas/main.tf` is 916 lines; `stacks/immich/main.tf` is 660 lines; others > 450 lines.
- Files: `stacks/platform/modules/dbaas/main.tf` (916 lines), `stacks/platform/modules/nvidia/main.tf` (658 lines), `stacks/platform/modules/kyverno/resource-governance.tf` (809 lines).
- Cause: Monolithic resource definitions; hard to navigate and test.
- Improvement path: Split large modules into sub-modules (e.g., `dbaas/``mysql/`, `postgresql/`, `pgadmin/`, `backups/`). Use Terraform workspaces for per-database configuration.
## Fragile Areas
**Immich Machine Learning GPU Dependency:**
- Files: `stacks/immich/main.tf` (lines 380-450).
- Why fragile: GPU workload (`immich-machine-learning-cuda`) requires Tesla T4 on k8s-node1. If GPU becomes unavailable (hardware failure, driver issues), ML inference fails silently (no fallback). Single GPU point of failure.
- Safe modification: Add `nodeAffinity` to prefer GPU but allow non-GPU fallback (degraded mode). Implement health checks on GPU availability (`nvidia-smi` probe). Test GPU failure scenario before production use.
- Test coverage: No tests for GPU unavailability; assumes GPU always available.
**Nextcloud Backup/Restore Procedures Manual:**
- Files: `stacks/nextcloud/main.tf` (backup.sh and restore.sh ConfigMaps).
- Why fragile: Backup/restore scripts are ConfigMap-based; no automation. Restoration requires manual `kubectl exec` and script execution. No tested recovery procedure.
- Safe modification: Implement automated backup via Velero or CSI snapshots. Test restore procedure monthly via staged environment.
- Test coverage: No automated backup validation; scripts untested.
**NFS Dependency for Data Persistence:**
- Files: 126 references to NFS volumes across all stacks.
- Why fragile: All stateful data depends on NFS server at `10.0.10.15`. If NFS becomes unavailable, all services lose data immediately (no local caches). No fallback storage.
- Safe modification: Implement NFS client-side read caching (Linux NFS mount options `ac,acregmin=3600`). Monitor NFS availability via Prometheus alerts (Mount point offline). Test NFS failover procedure (if replica NFS exists).
- Test coverage: No chaos engineering tests for NFS unavailability.
**Istio Injection Disabled Cluster-Wide:**
- Files: `stacks/real-estate-crawler/main.tf` (line 19): `"istio-injection" : "disabled"` on namespace labels.
- Why fragile: No service mesh observability. Debugging pod-to-pod communication requires manual tracing (tcpdump). No mutual TLS between services.
- Safe modification: Enable Istio on non-critical services first (e.g., realestate-crawler). Monitor resource overhead. Gradually roll out to production.
- Test coverage: No mTLS validation; assumes all pods on same network are trusted.
**PostgreSQL Custom Image Not Tracked:**
- Files: `stacks/platform/modules/dbaas/main.tf` (line 717): `image = "viktorbarzin/postgres:16-master"`.
- Why fragile: Custom build at Docker Hub with PostGIS + PgVector extensions. No version tag; `:master` tag is mutable. Upstream extension versions unknown.
- Safe modification: Pin to semantic version (e.g., `:16.4-postgis3.4-pgvector0.8`). Build images locally with Dockerfile tracked in git. Test extension versions against application requirements.
- Test coverage: No tests for extension availability or version compatibility.
## Scaling Limits
**Single-Replica Critical Services:**
- Current capacity: Immich server (1 replica), PostgreSQL databases (1 replica), Redis (1 instance), Traefik (varies).
- Limit: Node failure causes immediate service outage. Kubernetes default takes 5+ minutes to reschedule pod.
- Scaling path: Increase critical service replicas to 3 (quorum). Add pod anti-affinity to spread across nodes. Implement PodDisruptionBudget with `minAvailable: 2`.
**GPU Capacity Bottleneck:**
- Current capacity: 1 Tesla T4 GPU on k8s-node1.
- Limit: Immich ML + Ebook2Audiobook + Ollama all compete for single GPU. Queue time 10+ minutes for CPU-bound inference tasks.
- Scaling path: Add second GPU (e.g., T4 or RTX 3090) to k8s-node1. Implement GPU scheduling via NVIDIA GPU Operator. Monitor GPU utilization (target 70-80%).
**NFS Storage Capacity:**
- Current capacity: `/mnt/main/` mounted on TrueNAS (size unknown; typically 4-8TB in home setups).
- Limit: Immich (image library), Calibre (ebooks), Dawarich (location history) grow unbounded. When storage full, writes fail; services degrade.
- Scaling path: Monitor NFS capacity monthly (`df -h`). Set up Prometheus alert at 80% capacity. Plan for annual storage growth based on user behavior (e.g., 100GB Immich/month).
**MySQL/PostgreSQL Connection Pool:**
- Current capacity: PgBouncer at `dbaas/pgbouncer` provides connection pooling. Default pool size likely 100-200 connections.
- Limit: Many simultaneous connections (Nextcloud, Affine, Gramps Web, Authentik) can exceed pool. New connections queue or fail.
- Scaling path: Monitor PgBouncer pool utilization (Prometheus metric `pgbouncer_pools_used_connections`). Increase pool size if > 80% utilization. Consider read replicas for read-heavy workloads.
**API Rate Limiting & Bandwidth:**
- Current capacity: Services exposed via Traefik ingress. No global rate limiting documented.
- Limit: External tools (Immich mobile app, ebook2audiobook processing) can spike bandwidth. DoS-like behavior possible.
- Scaling path: Implement Traefik rate limiting middleware (Prometheus-aware). Add Cloudflare rate limiting on public domains. Monitor egress bandwidth.
## Dependencies at Risk
**Redis Stack `:latest` Tag:**
- Risk: `stacks/platform/modules/redis/main.tf` uses `image = "redis/redis-stack:latest"`. Redis Stack is actively developed; breaking changes possible.
- Impact: Unexpected version upgrade could introduce incompatibilities with clients expecting specific command set or module versions.
- Migration plan: Pin to specific Redis Stack version (e.g., `:7.2-rc1`). Test version upgrades in staging first. Monitor Redis logs for deprecated command warnings.
**Immich `:latest` or Floating Tag:**
- Risk: `stacks/immich/main.tf` pins to `v2.5.6` but Immich frequently releases patch versions. Database migrations can cause downtime.
- Impact: If Immich version upgrades without testing, database migrations could fail or hang (no rollback mechanism).
- Migration plan: Pin to specific patch versions (e.g., `v2.5.6`, not `v2.5`). Test Immich upgrades in staging first. Maintain backup before upgrading.
**Unsupported MySQL 9.2.0:**
- Risk: `stacks/platform/modules/dbaas/main.tf` specifies `image = "mysql:9.2.0"`. MySQL 9.2 is a development version (RC status as of Feb 2026).
- Impact: RC versions not recommended for production. Stability issues, CVEs possible. No long-term support.
- Migration plan: Migrate to MySQL 8.4 LTS or 9.0 GA (stable). Test data migration first. Plan for gradual rollout.
**Python Timeouts in Monitoring Scripts:**
- Risk: `stacks/platform/modules/nvidia/main.tf` uses hardcoded `timeout=10` for HTTP requests and subprocess calls. Slow network conditions will fail.
- Impact: GPU monitoring will fail if network is slow or unavailable. Silent failures possible.
- Migration plan: Implement exponential backoff and retry logic (e.g., `tenacity` library). Increase timeout to 30s for unreliable networks. Log timeouts for debugging.
## Missing Critical Features
**No Disaster Recovery Plan:**
- Problem: Backup procedures exist (Nextcloud, MySQL) but no tested recovery procedure. No runbook for cluster disaster.
- Blocks: If cluster data lost, recovery would be manual and time-consuming. No RTO/RPO defined.
- Impact: Data loss risk > 24 hours to recover.
**No Secrets Rotation Policy:**
- Problem: SSH keys, API tokens, database passwords stored in git-crypt and tfvars. No automated rotation schedule.
- Blocks: If key leaked, manual intervention required to rotate across all services.
- Impact: Leaked credentials persist until discovery.
**No Cross-Cluster Failover:**
- Problem: Single Kubernetes cluster on Proxmox. No HA cluster or backup cluster.
- Blocks: Cluster-wide failure (network partition, hypervisor crash) causes total outage.
- Impact: RTO > 1 hour (manual intervention to restart hypervisor or re-provision).
## Test Coverage Gaps
**No Infrastructure Testing:**
- What's not tested: Terraform applies, Helm charts, manifests only validated via `terraform plan`. No `terratest`, no functional tests of deployed services.
- Files: All stacks (no test files found).
- Risk: Typos, variable misconfigurations, missing dependencies not caught until production apply.
- Priority: High — add `terratest` to validate Terraform. Test critical paths (database connection, ingress routing).
**No Chaos Engineering Tests:**
- What's not tested: Pod evictions, node failures, NFS unavailability, network partitions.
- Files: All stacks (no chaos tests found).
- Risk: Cascading failures and data loss scenarios not validated. Assumptions about resilience untested.
- Priority: High — run monthly chaos tests (Gremlin, Chaos Toolkit). Document recovery procedures.
**No Backup Restoration Tests:**
- What's not tested: Nextcloud backups, MySQL backups. Restore procedures exist but never executed.
- Files: `stacks/nextcloud/main.tf`, `stacks/platform/modules/dbaas/main.tf`.
- Risk: Backups corrupt or unusable when needed. RPO > 24 hours if discovery slow.
- Priority: High — monthly restore-to-staging test. Automate backup validation.
**No Security Scanning for Vulnerabilities:**
- What's not tested: Container images for CVEs, Terraform for security anti-patterns (hardcoded secrets, overpermissive RBAC).
- Files: All stacks, all container images.
- Risk: Known vulnerabilities deployed to production. No supply chain security.
- Priority: Medium — integrate Trivy/Snyk into CI/CD. Scan images weekly; alert on high CVEs.
---
*Concerns audit: 2026-02-23*

View file

@ -0,0 +1,192 @@
# Coding Conventions
**Analysis Date:** 2026-02-23
## Naming Patterns
**Terraform Files:**
- `main.tf` - Primary resource definitions and module calls
- `terragrunt.hcl` - Stack-specific Terragrunt configuration
- `variables.tf` - Variable declarations for a stack
- `providers.tf` - Generated by Terragrunt root `terragrunt.hcl`
- `backend.tf` - Generated by Terragrunt for state backend configuration
**Terraform Variables:**
- snake_case for variable names: `var.tls_secret_name`, `var.dbaas_root_password`
- snake_case for resource names: `resource "kubernetes_namespace" "nextcloud"`
- snake_case for local values: `local.tiers`
- UPPERCASE for environment-like globals in shell: `KUBECONFIG_PATH`, `PASS_COUNT`
**Resource/Module Names:**
- kebab-case for Kubernetes resources: `nextcloud`, `whiteboard`, `kms-web-page`
- Leading underscore for prefixed resource names (internal/private pattern): resource names with underscores are module-internal
- Descriptive names matching functionality: `kubernetes_namespace`, `kubernetes_deployment`, `helm_release`
**Shell Functions:**
- snake_case for function names: `parse_args()`, `count_lines()`, `check_nodes()`
- CamelCase for utility color variables: `RED`, `GREEN`, `YELLOW`, `BLUE`, `BOLD`, `NC`
**Go Package/Test Names:**
- Package-level test functions: `TestContainsVideoMarkers()`, `TestIsDirectVideoContentType()`
- Table-driven test pattern with struct fields: `name`, `body`, `ct`, `want`
## Code Style
**Terraform Formatting:**
- Use `terraform fmt -recursive` for consistent formatting
- No explicit linter/formatter config file (tflint/terraform-lint not present)
- Indentation: 2 spaces (standard Terraform convention)
- Multi-line strings use heredoc syntax: `<<EOT ... EOT` for YAML/config blocks
**Bash Script Style:**
- Shebang: `#!/usr/bin/env bash`
- Safety flags: `set -euo pipefail` (exit on error, undefined vars, pipe failures)
- Comments use `# ---` separator for section dividers
- Comments use `# ---` for grouping related variables/functions
- One-liner functions defined as: `function_name() { [[ condition ]] && action; }`
- Multiline functions use explicit function body with local keyword for variables
**Terraform Style:**
- Comments for major sections use `# =============================================================================`
- Comments for subsections use `# --------- -------`
- Inline comments explain why, not what: `# anything secret is fine` (explaining arbitrary choice)
- Module calls include comments above describing purpose: `# --- Core ---`, `# --- dbaas ---`
## Import Organization
**Terraform:**
- Locals (tier definitions) defined at top of main.tf
- Variables declared in order: core/required first, then by feature area (dbaas, traefik, etc.)
- Modules called after variables, grouped by functional area with comment headers
- Resources defined after modules
**Go:**
- Standard imports from `testing` package
- No grouping (single simple import)
**Bash:**
- Source definitions at top (colors, globals, helper functions)
- Argument parsing in dedicated `parse_args()` function
- Main logic organized by check sections with `section()` calls
## Error Handling
**Terraform:**
- No explicit error handling (declarative; errors cause apply failure)
- Dependency management via `depends_on` for explicit ordering
- `dependency` blocks in terragrunt for cross-stack dependencies
- `skip_outputs = true` used when only needing ordering, not outputs
**Bash:**
- Inline error checks: `command 2>&1) || { fail "message"; return 0; }`
- `set -euo pipefail` prevents silent failures and undefined var issues
- Error status captured: `$?` implicit via `||` pattern
- Graceful degradation with fallback values or skip-able steps
**Go:**
- Standard testing error reporting: `t.Errorf()` with formatted messages
- Table-driven test pattern allows multiple related test cases
- Error messages include actual vs expected: `got = %v, want = %v`
## Logging
**Framework:** Not formally configured; uses `echo` and `echo -e` for output
**Bash Logging Patterns:**
- Color-coded output with status prefixes: `${BLUE}[INFO]${NC}`, `${GREEN}[PASS]${NC}`, `${YELLOW}[WARN]${NC}`, `${RED}[FAIL]${NC}`
- Helper functions: `info()`, `pass()`, `warn()`, `fail()` - each increments counters and respects `--quiet` flag
- Section headers: `section()` for verbose output, `section_always()` for always-shown sections
- Conditional logging: functions check `$JSON`, `$QUIET` flags and skip output as needed
- JSON output option available via `json_add()` for machine-readable logging
- Detail strings accumulated in variables for final reporting
**Terraform Logging:**
- Relies on Terraform's built-in CLI output
- Human-readable variable values in descriptions (Terraform renders these on errors)
## Comments
**When to Comment:**
Terraform:
- Section dividers: Major logical groups separated by `# =============================================================================`
- Feature group headers: `# --- Feature Name ---` before variable/module blocks
- Commented-out code: Temporarily disabled resources/modules include explanation (e.g., "Do not use until issue #X is solved")
- Clarifying arbitrary choices: `# anything secret is fine` explains non-obvious variable usage
Bash:
- Function-level comments: Each check function has purpose on first line
- Complex logic: Comments before conditional blocks explain intent
- Inline comments for edge cases: `# Skip nodes where metrics are not yet available`
- Header comments: Scripts include usage documentation at top
**JSDoc/TSDoc:**
- Not used in this codebase (Terraform, Bash, Go only)
## Function Design
**Size:**
- Terraform modules typically 20-50 lines for simple services, variable declaration blocks 30-100+ lines
- Bash functions average 20-40 lines, check functions 10-30 lines
- Go test functions 10-60 lines (table + loop)
**Parameters:**
- Terraform: via `variable` declarations and module input variables
- Bash: positional parameters passed via `$1`, `$2`, etc. with validation in `parse_args()`
- Go: test functions accept `*testing.T` parameter
**Return Values:**
- Terraform: no explicit returns; resource state is the "return"
- Bash: `return 0` for success, implicit via `echo` output for values, status codes for error handling
- Go: functions tested for boolean returns or calculated values
**Variables:**
- Terraform: module variables, locals, and resource attributes (computed values)
- Bash: Global state tracked via counters (`PASS_COUNT`, `WARN_COUNT`, `FAIL_COUNT`), local variables in functions with `local` keyword
- Go: table-driven tests use struct fields (no getter/setter pattern)
## Module Design
**Exports:**
- Terraform: outputs typically omitted unless another stack depends on them (implicit via dependency blocks)
- Modules called with `source = "./modules/<name>"` or `source = "../../modules/kubernetes/<name>"`
- Module version pinning used for Terraform registry modules: `version = "0.1.5"`
**Barrel Files:**
- Not applicable (no aggregating re-exports in this codebase)
- Directories: `stacks/<service>/` is a unit, `stacks/platform/modules/<service>/` groups related modules
**Module Organization:**
- Single responsibility per module directory
- Each module typically contains: `main.tf` (resources) and optional `variables.tf` for input variables
- Shared Kubernetes utility modules in `modules/kubernetes/`: `ingress_factory/`, `setup_tls_secret/`
- Platform services grouped in `stacks/platform/modules/<service>/`
## Special Patterns
**Locals for Configuration:**
- Tier definitions centralized as map in locals (each service defines same tiers locally)
```hcl
locals {
tiers = {
core = "0-core"
cluster = "1-cluster"
gpu = "2-gpu"
edge = "3-edge"
aux = "4-aux"
}
}
```
- Tier applied to `kubernetes_namespace` labels and `priority_class_name` for resource governance
**Inline Config Blocks:**
- YAML/config data stored in `<<EOT ... EOT` heredoc blocks within `data` maps
- Example: MetalLB address pool config in ConfigMap data
**File Inclusion:**
- `templatefile()` used for dynamic YAML values: `templatefile("${path.module}/chart_values.yaml", { var1 = value })`
- `file()` used for static file content in ConfigMap data
---
*Convention analysis: 2026-02-23*

View file

@ -0,0 +1,210 @@
# External Integrations
**Analysis Date:** 2026-02-23
## APIs & External Services
**Cloudflare:**
- DNS management (public domain `viktorbarzin.me`)
- Tunnel for public HTTPS access
- Account ID: `cloudflare_account_id` in tfvars
- SDK/Client: `cloudflare/cloudflare` Terraform provider v4.52.5
- Auth: API token stored in `cloudflare_api_key`, email in `cloudflare_email`, zone ID in `cloudflare_zone_id`, tunnel ID in `cloudflare_tunnel_id`
- Implementation: `stacks/platform/modules/cloudflared/` deploys Cloudflare tunnel daemon
**GitHub:**
- Git repository hosting and CI/CD webhook source
- Webhook endpoint: `https://webhook.viktorbarzin.me/` (handled by `stacks/webhook_handler/`)
- Auth: Git token in `webhook_handler_git_token` (terraform.tfvars)
- User: `webhook_handler_git_user` (terraform.tfvars)
- SSH key: `webhook_handler_ssh_key` for Git operations (secret in K8s)
**Facebook Messenger:**
- Chatbot integration via webhook
- Webhook endpoint: `https://webhook.viktorbarzin.me/` (receives webhook_handler_fb_*)
- Auth tokens: `webhook_handler_fb_verify_token`, `webhook_handler_fb_page_token`, `webhook_handler_fb_app_secret` (all in tfvars)
**Slack:**
- Alert routing and notifications
- Webhook URL: `alertmanager_slack_api_url` (terraform.tfvars)
- Integration: Alertmanager alerts from `stacks/platform/modules/monitoring/` sent to Slack
- CrowdSec integration: Security events to Slack via `stacks/platform/modules/crowdsec/`
**Hetrix Tools:**
- Uptime monitoring service
- Status page redirects: `https://hetrixtools.com/r/38981b548b5d38b052aca8d01285a3f3/` and `https://hetrixtools.com/r/2ba9d7a5e017794db0fd91f0115a8b3b/`
- Implementation: Traefik middleware redirect in `stacks/platform/modules/monitoring/main.tf`
**Tiny Tuya:**
- Smart device control via tuya-bridge
- Auth: `tiny_tuya_service_secret` (terraform.tfvars)
**Mailgun:**
- SMTP relay for outgoing mail (primary relay host)
- Relay: `[smtp.eu.mailgun.org]:587` (Postfix DEFAULT_RELAY_HOST)
- Auth: SASL credentials in `sasl_passwd` (mailserver config)
- Alternative: SendGrid (commented out, previously used)
**Home Assistant:**
- Home automation integration
- API token: `haos_api_token` (terraform.tfvars)
- Access: `https://ha-london.viktorbarzin.me`, `https://ha-sofia.viktorbarzin.me`
**Proxmox:**
- Virtualization platform for VM provisioning
- Host: `192.168.1.127:8006` (`proxmox_pm_api_url`)
- Auth: API token ID `terraform-prov@pve!terrform-prov`, secret in tfvars
- Provider: `telmate/proxmox` v3.0.2-rc07
- Access: IDRAC credentials for physical server monitoring (`idrac_host`, `idrac_username`, `idrac_password`)
## Data Storage
**Databases:**
- MySQL 9.2.0
- Connection: `mysql.dbaas.svc.cluster.local:3306` (K8s internal)
- Client: Direct port access (no ORM in core infrastructure)
- Root password: `dbaas_root_password` (tfvars)
- Storage: NFS PV at `/mnt/main/mysql`
- PostgreSQL 16.4-bullseye (with PostGIS + PGVector)
- Connection: `postgresql.dbaas:5432` (K8s internal)
- Connection via PgBouncer: `pgbouncer.authentik:6432` (Authentik only)
- Root password: `dbaas_postgresql_root_password` (tfvars)
- Root password for pgbouncer: `pgbouncer_root_password` (tfvars)
- Admin UI: PgAdmin at `pma.viktorbarzin.me`
- PgAdmin password: `dbaas_pgadmin_password` (tfvars)
- Storage: NFS PV at `/mnt/main/postgresql`
**File Storage:**
- NFS (Primary)
- Host: `10.0.10.15` (TrueNAS)
- Mount path: `/mnt/main/`
- Subdirectories: per-service (e.g., `/mnt/main/immich/`, `/mnt/main/affine/`, `/mnt/main/mailserver/`, etc.)
- Configuration: `secrets/nfs_directories.txt` (git-crypt encrypted)
- Export script: `secrets/nfs_exports.sh` (updates TrueNAS exports)
**Caching:**
- Redis/redis-stack:latest
- Connection: `redis.redis.svc.cluster.local` (K8s internal, no explicit port in code)
- Databases: DB 2 (Gramps Web broker), DB 3 (Gramps Web rate limiting)
- Storage: Persistent volume for data durability
- Implementation: `stacks/platform/modules/redis/main.tf`
## Authentication & Identity
**Auth Provider:**
- Authentik (self-hosted OIDC/OAuth2 identity provider)
- URL: `https://authentik.viktorbarzin.me`
- API: `/api/v3/` endpoint
- Token: `authentik_api_token` (terraform.tfvars)
- Database: PostgreSQL via `postgresql.dbaas:5432` (also PgBouncer at `pgbouncer.authentik:6432`)
- Secret key: `authentik_secret_key` (terraform.tfvars)
- Postgres password: `authentik_postgres_password` (terraform.tfvars)
- K8s OIDC: Issuer `https://authentik.viktorbarzin.me/application/o/kubernetes/`, client `kubernetes` (public)
- Implementation: `stacks/platform/modules/authentik/main.tf` + Helm chart
- Traefik integration: Forward auth via protected = true in ingress_factory
**RBAC:**
- Kubernetes API auth via Authentik OIDC
- SSH keys: `ssh_private_key` (terraform.tfvars)
- Implementation: `stacks/platform/modules/rbac/` + `stacks/platform/modules/k8s-portal/`
## Monitoring & Observability
**Error Tracking:**
- None detected - alerts routed to Slack instead
**Metrics:**
- Prometheus - Time series database
- Scrape endpoints: cluster nodes, services, Proxmox IDRAC, Tuya devices, Home Assistant
- Implementation: `stacks/platform/modules/monitoring/`
- Health check: CronJob monitors prometheus-server pod and alerts to `https://webhook.viktorbarzin.me/fb/message-viktor` if down
**Logs:**
- Loki 3.6.5 (single binary) + Alloy v1.13.0 (DaemonSet collector)
- Retention: 7 days
- Storage: NFS PV at `/mnt/main/loki/loki` (15Gi), WAL on tmpfs (2Gi)
- Alerting: HighErrorRate, PodCrashLoopBackOff, OOMKilled (ConfigMap `loki-alert-rules`)
**Visualization:**
- Grafana
- Database: PostgreSQL via dbaas
- Admin password: `grafana_admin_password` (tfvars)
- DB password: `grafana_db_password` (tfvars)
**Status Pages:**
- Hetrix Tools (external uptime monitoring)
- Uptime Kuma (self-hosted, `stacks/platform/modules/uptime-kuma/`)
## CI/CD & Deployment
**Hosting:**
- Proxmox 8.x (hypervisor)
- Kubernetes 1.34.2 (application platform)
- Cloudflare Tunnel (public ingress)
**CI Pipeline:**
- Woodpecker CI (self-hosted, `stacks/woodpecker/`)
- Hosted at: `https://ci.viktorbarzin.me`
- Config: `.woodpecker/` in repo root
- Triggers: Git push, scheduled jobs
- Applies platform stack automatically on merge to master
**GitOps:**
- Webhook-handler service: receives GitHub webhooks, triggers deployments
- Endpoint: `https://webhook.viktorbarzin.me/`
- Auth: Secret token `webhook_handler_secret` (tfvars)
- Can update K8s deployments via RBAC
- Implementation: `stacks/webhook_handler/main.tf`, image `viktorbarzin/webhook-handler:latest`
## Environment Configuration
**Required env vars (terraform.tfvars - git-crypt encrypted):**
- `cloudflare_api_key`, `cloudflare_email`, `cloudflare_zone_id`, `cloudflare_tunnel_id`, `cloudflare_tunnel_token`
- `dbaas_root_password`, `dbaas_postgresql_root_password`, `dbaas_pgadmin_password`
- `authentik_secret_key`, `authentik_postgres_password`, `authentik_api_token`
- `proxmox_pm_api_url`, `proxmox_pm_api_token_id`, `proxmox_pm_api_token_secret`
- `alertmanager_slack_api_url`, `alertmanager_account_password`
- `webhook_handler_secret`, `webhook_handler_fb_verify_token`, `webhook_handler_fb_page_token`, `webhook_handler_fb_app_secret`, `webhook_handler_git_token`, `webhook_handler_git_user`, `webhook_handler_ssh_key`
- `vaultwarden_smtp_password`, `mailserver_accounts`, `postfix_account_aliases`, `sasl_passwd`
- `crowdsec_enroll_key`, `crowdsec_db_password`, `crowdsec_dash_api_key`, `crowdsec_dash_machine_id`, `crowdsec_dash_machine_password`
- `headscale_config`, `headscale_acl`
- `monitoring_idrac_username`, `monitoring_idrac_password`, `tiny_tuya_service_secret`, `haos_api_token`, `pve_password`, `grafana_admin_password`, `grafana_db_password`
- `k8s_users` (map of SSH keys for K8s RBAC)
**Secrets location:**
- Primary: `terraform.tfvars` (git-crypt encrypted at rest, decrypted during `terragrunt apply`)
- K8s Secrets: Created by Terraform from tfvars into namespaces (see `stacks/platform/modules/*/main.tf`)
- TLS certificates: `secrets/` directory (symlinked into stacks as `secrets/``../../secrets`)
## Webhooks & Callbacks
**Incoming (Webhook endpoints):**
- GitHub webhooks: `https://webhook.viktorbarzin.me/` (deployment triggers)
- Facebook Messenger webhooks: `https://webhook.viktorbarzin.me/` (chatbot messages)
- Health alerts: CronJob sends to `https://webhook.viktorbarzin.me/fb/message-viktor` if Prometheus is down
**Outgoing:**
- Alertmanager → Slack webhook: `alertmanager_slack_api_url`
- CrowdSec → Slack webhook: same as alertmanager
- Hetrix Tools status pages: redirect middleware instead of direct integration
## Integration Patterns
**Terraform Secrets Injection:**
- Template pattern: `templatefile("${path.module}/values.yaml", { var1 = var.value1, ... })`
- Direct env injection: K8s ConfigMap/Secret created from tfvars variables
- Example: `stacks/platform/modules/crowdsec/main.tf` renders Helm values with interpolated secrets
**Internal Service Discovery:**
- DNS: Services accessible via `<name>.<namespace>.svc.cluster.local`
- Examples: `mysql.dbaas.svc.cluster.local`, `redis.redis.svc.cluster.local`, `postgresql.dbaas.svc.cluster.local`
**External Service Access:**
- Cloudflare Tunnel: Provides public HTTPS for services (no direct internet access needed)
- Traefik Ingress: Routes external traffic to internal K8s services
- Technitium (internal DNS) for `.lan` domain resolution
---
*Integration audit: 2026-02-23*

129
.planning/codebase/STACK.md Normal file
View file

@ -0,0 +1,129 @@
# Technology Stack
**Analysis Date:** 2026-02-23
## Languages
**Primary:**
- HCL (HashiCorp Configuration Language) - Terraform/Terragrunt infrastructure definitions
- Bash - Scripting and cluster management (`scripts/` directory)
- YAML - Kubernetes resource definitions and configuration
- Python - Monitoring and utility scripts in `stacks/platform/modules/`
- TypeScript/JavaScript - k8s-portal frontend and webhook-handler (`stacks/platform/modules/k8s-portal/`, `stacks/webhook_handler/`)
**Secondary:**
- Go - Various utilities
- Dockerfile - Container image definitions across stacks
## Runtime
**Environment:**
- Kubernetes v1.34.2 (5 nodes: k8s-master + k8s-node1-4)
- Linux (Ubuntu cloud images on Proxmox VMs)
- Bash shell for automation
**Package Manager:**
- npm (Node.js) - for k8s-portal web UI development
- Lockfile: `package-lock.json` present
- pip (Python) - for utility scripts
- Terraform/Terragrunt - manages all infrastructure dependencies
## Frameworks
**Core:**
- Terraform 1.x - Infrastructure-as-Code orchestration
- Terragrunt - State isolation wrapper around Terraform (`terragrunt.hcl` in each stack)
- Kubernetes - Container orchestration (kubectl, Helm, kustomize patterns)
**Testing:**
- Playwright ^1.58.2 - E2E testing framework (root `package.json`)
**Build/Dev:**
- Helm 3.1.1 - Kubernetes package manager (provider version via Terraform)
- Svelte - Frontend framework for k8s-portal (`stacks/platform/modules/k8s-portal/files/` Node.js project)
## Key Dependencies
**Critical:**
- hashicorp/terraform (Kubernetes 3.0.1) - Kubernetes API provider
- hashicorp/helm (3.1.1) - Helm release management
- telmate/proxmox (3.0.2-rc07) - Proxmox VM management (`stacks/infra/`)
- cloudflare/cloudflare (4.52.5) - DNS and tunnel management (`stacks/platform/modules/cloudflared/`)
- hashicorp/null (3.2.4) - Utility provider for local operations
- hashicorp/random (3.8.1) - Random value generation
**Infrastructure:**
- MySQL 9.2.0 - Relational database (`stacks/platform/modules/dbaas/`)
- PostgreSQL 16.4-bullseye - Primary database with PostGIS/PGVector (`stacks/platform/modules/dbaas/`)
- Redis/redis-stack:latest - In-memory cache and broker (`stacks/platform/modules/redis/`)
- Headscale 0.23.0 - WireGuard control plane (`stacks/platform/modules/headscale/`)
**Observability:**
- Prometheus - Metrics collection and alerting
- Grafana - Metrics visualization and dashboards
- Loki 3.6.5 - Log aggregation (from user instructions)
- Alloy v1.13.0 - Log collector (from user instructions)
**API Gateway & Ingress:**
- Traefik 3.x - Ingress controller and reverse proxy (`stacks/platform/modules/traefik/`)
- MetalLB - Load balancer for Kubernetes service IPs (`stacks/platform/modules/metallb/`)
**Security:**
- Authentik - Identity Provider/OIDC (`stacks/platform/modules/authentik/`)
- Vaultwarden 1.35.2 - Password manager (`stacks/platform/modules/vaultwarden/`)
- CrowdSec - Intrusion detection and IP reputation (`stacks/platform/modules/crowdsec/`)
- Kyverno - Policy enforcement and governance (`stacks/platform/modules/kyverno/`)
**Container Images Registry:**
- docker.io - Docker Hub public images
- ghcr.io - GitHub Container Registry (Headscale UI, Immich, etc.)
- quay.io - Quay.io registry (inferred from mirror config)
- registry.k8s.io - Kubernetes images
- Local pull-through cache at `10.0.20.10` (ports 5000/5010/5020/5030/5040)
## Configuration
**Environment:**
- `terraform.tfvars` (git-crypt encrypted) - All secrets, API keys, DNS records, passwords
- Environment variables injected into Kubernetes pods via ConfigMap/Secret
- Kubeconfig: `config` file in repo root (referenced as `$PWD/config` in terragrunt)
**Build:**
- `terragrunt.hcl` (root) - DRY Terraform provider and backend configuration
- `stacks/<service>/terragrunt.hcl` - Per-stack overrides
- `stacks/<service>/main.tf` - Kubernetes/Proxmox resource definitions
- `.terraform.lock.hcl` - Provider version lock (Terraform 1.x)
- `.terraform/` - Downloaded providers cached locally
**Secrets:**
- `secrets/` directory (git-crypt encrypted)
- TLS certificates and keys in `secrets/` (symlinked from stacks)
- OpenDKIM keys for mailserver
- NFS export configuration in `secrets/nfs_directories.txt`
## Platform Requirements
**Development:**
- Terraform 1.x CLI
- Terragrunt CLI (uses `terragrunt apply --non-interactive`)
- kubectl configured with kubeconfig at `$PWD/config`
- git-crypt for secret decryption
- curl, bash, standard Unix utilities
**Production:**
- Kubernetes 1.34.2+ cluster (5 nodes, 192 GB+ total memory)
- Proxmox 8.x hypervisor (`stacks/infra/` provisions VMs)
- NFS storage: TrueNAS at `10.0.10.15` with exports at `/mnt/main/`
- Docker registry pull-through cache at `10.0.20.10`
- Cloudflare DNS (public domain `viktorbarzin.me`)
- Technitium DNS (internal domain `viktorbarzin.lan`)
**Networking:**
- Kubernetes pod CIDR: managed by cluster
- Service IPs: 10.0.20.200-10.0.20.220 (MetalLB layer 2)
- Internal DNS: Technitium at cluster IP
- External DNS: Cloudflare tunnel + traditional DNS records
---
*Stack analysis: 2026-02-23*

View file

@ -0,0 +1,255 @@
# Codebase Structure
**Analysis Date:** 2026-02-23
## Directory Layout
```
/Users/viktorbarzin/code/infra/
├── .claude/ # Project-level Claude knowledge (skills, reference docs)
├── .git/ # Git repository metadata
├── .git-crypt/ # git-crypt encryption keys
├── .planning/codebase/ # GSD codebase analysis documents
├── .terraform/ # Terraform cache (gitignored)
├── .woodpecker/ # CI/CD pipeline definitions
├── cli/ # Custom CLI tools (bash/python scripts)
├── diagram/ # Infrastructure diagram sources
├── docs/ # Documentation (deployment guides, design docs)
├── modules/ # Shared Terraform modules (Proxmox, K8s utilities)
├── playbooks/ # Ansible playbooks (infrastructure setup)
├── scripts/ # Maintenance scripts (healthcheck, DNS updates, etc.)
├── secrets/ # git-crypt encrypted files (NFS dirs, TLS certs, SSH keys)
├── stacks/ # Terragrunt stacks (platform + ~70 service stacks)
├── state/ # Terraform state files (local backend, gitignored)
├── terragrunt.hcl # Root Terragrunt config (DRY provider/backend setup)
├── terraform.tfvars # All variables + secrets (git-crypt encrypted, ~48KB)
├── config # Kubernetes config (kubeconfig file)
├── README.md # Project overview
└── package.json # Node.js deps (minimal; mostly for cli tools)
```
## Directory Purposes
**`.claude/`:**
- Purpose: Project-level Claude knowledge and execution skills
- Contains: `skills/` (setup-project, authentik workflows), `reference/` (inventory tables, API patterns)
- Key files: `CLAUDE.md` (this file's counterpart with full infrastructure context)
**`.planning/codebase/`:**
- Purpose: GSD codebase analysis output directory
- Contains: `ARCHITECTURE.md`, `STRUCTURE.md` (this file), and focus-specific docs
- Auto-generated: Yes (by /gsd:map-codebase)
**`modules/`:**
- Purpose: Reusable Terraform modules for VM creation and Kubernetes utilities
- Contains:
- `create-template-vm/`: Cloud-init Ubuntu template VM provisioning (K8s + non-K8s)
- `create-vm/`: VM instance creation from templates with cloud-init injection
- `docker-registry/`: Docker registry pull-through cache setup
- `kubernetes/`: K8s-specific utilities (ingress_factory, setup_tls_secret)
**`stacks/`:**
- Purpose: Terragrunt stacks with isolated state and per-service configuration
- Contains: 1 platform stack + ~70 application stacks
- Structure: Each stack is a directory with `terragrunt.hcl` + `main.tf` + optional `factory/` (for multi-instance services)
**`stacks/platform/`:**
- Purpose: Core infrastructure services (22 modules)
- Contains: Modules for MetalLB, DBaaS, Redis, Traefik, DNS, VPN, auth, monitoring, security
- Key subdirs: `modules/` (platform-specific modules like traefik, authentik, monitoring)
**`stacks/infra/`:**
- Purpose: Proxmox VM template and instance provisioning
- Contains: K8s node templates, docker-registry VM, Proxmox provider configuration
**`stacks/<service>/`:**
- Purpose: Single application stack with isolated state
- Pattern: `terragrunt.hcl` (includes root, declares dependencies) + `main.tf` (resources) + optional `factory/` + optional `chart_values.yaml`
- Examples: `nextcloud/`, `immich/`, `matrix/`, `actualbudget/` (multi-tenant), etc.
**`secrets/`:**
- Purpose: git-crypt encrypted sensitive files
- Contains: TLS certificates/keys, NFS export list, SSH keys, Dkim keys, Postfix config
- Key files:
- `nfs_directories.txt`: List of NFS shares (sorted); regenerate exports with `nfs_exports.sh`
- `tls/`: TLS certificate chain and keys
- `mailserver/`: OpenDKIM keys, Postfix SASL creds
**`scripts/`:**
- Purpose: Operational and maintenance automation
- Key scripts:
- `cluster_healthcheck.sh`: 24-point cluster health status
- `renew2.sh`: TLS certificate renewal via certbot + Cloudflare
- `setup_certs.sh`: Initial certificate setup
- `pve_*`: Proxmox management scripts
- `ha_*`: Home Assistant integration scripts
**`docs/`:**
- Purpose: Design and deployment documentation
- Contains: High-level architecture diagrams, deployment guides, troubleshooting
**`cli/`:**
- Purpose: Custom CLI utilities
- Contains: Python/bash scripts for common operations (DNS management, NFS, etc.)
## Key File Locations
**Entry Points:**
- `terragrunt.hcl`: Root Terragrunt config; invoked by `terragrunt apply` in any stack directory
- `stacks/platform/main.tf`: Platform stack; applies 22 core modules
- `stacks/infra/main.tf`: Infrastructure stack; creates VM templates and docker-registry VM
**Configuration:**
- `terraform.tfvars`: Central variables file (~48KB, git-crypt encrypted). Used by all stacks. Contains: Cloudflare credentials, DNS records, service secrets, TLS secret name
- `stacks/<service>/terragrunt.hcl`: Stack-specific Terragrunt config (includes root, declares `dependency` blocks)
- `stacks/platform/modules/<service>/main.tf`: Platform module implementation (22 modules)
**Core Logic:**
- `stacks/platform/main.tf`: 1000+ lines; instantiates all 22 platform modules
- `stacks/<service>/main.tf`: 30450 lines; creates namespaces, Helm releases, Kubernetes resources
- `stacks/<service>/factory/main.tf`: Multi-instance service pattern; called multiple times with different parameters
- `modules/kubernetes/ingress_factory/main.tf`: Traefik ingress + service template with security defaults
**Testing & Validation:**
- `.woodpecker/`: CI/CD pipeline (pushes platform apply on merge)
- `scripts/cluster_healthcheck.sh`: Manual cluster health validation
**Kubernetes & Cluster Config:**
- `config`: Kubeconfig file for cluster access
- Namespace pattern: One namespace per service stack
- TLS secret: `tls-secret` injected into all namespaces via `setup_tls_secret` module
## Naming Conventions
**Files:**
- `main.tf`: Primary Terraform resource file per stack
- `terragrunt.hcl`: Terragrunt-specific configuration (includes root, dependencies)
- `terraform.tfvars`: Global variables (git-crypt encrypted)
- `chart_values.yaml`: Helm chart values template (uses templatefile for variable substitution)
- `*_values.tpl`: Helm values template (evaluated with templatefile)
- `.terraform.lock.hcl`: Provider lock file (one per stack)
**Directories:**
- `stacks/<service>/`: Kebab-case service names (e.g., `real-estate-crawler`, `k8s-dashboard`)
- `stacks/platform/modules/<service>/`: Kebab-case module names
- `state/stacks/<service>/`: Mirrored state directory structure
- `secrets/`: Single top-level directory for all encrypted files
- `modules/kubernetes/`, `modules/create-template-vm/`: Category-based grouping
**Terraform Resources:**
- **Kubernetes**: `kubernetes_*` (namespace, deployment, service, configmap, etc.)
- **Helm**: `helm_release` (Helm chart deployments)
- **Local files**: `local_file` (for generated scripts and configs)
- **Module calls**: `module "<short-name>"` (e.g., `module "traefik"`, `module "redis"`)
**Variables:**
- Snake_case: `tls_secret_name`, `crowdsec_api_key`, `nextcloud_db_password`
- Service-prefixed: `<service>_<attribute>` (e.g., `authentik_secret_key`, `mailserver_accounts`)
## Where to Add New Code
**New Service Stack:**
1. Create `stacks/<service>/` directory
2. Add `terragrunt.hcl`:
```hcl
include "root" {
path = find_in_parent_folders()
}
dependency "platform" {
config_path = "../platform"
skip_outputs = true
}
```
3. Create `main.tf` with:
- Variable declarations for required inputs from `terraform.tfvars`
- `locals { tiers = { ... } }` (copy from existing stack)
- `kubernetes_namespace` resource with tier label
- `module "tls_secret"` call to `../../modules/kubernetes/setup_tls_secret`
- Service-specific resources (Helm releases, Deployments, etc.)
4. Add Cloudflare DNS records in `terraform.tfvars` if needed
5. Create optional `secrets/` symlink: `ln -s ../../secrets secrets`
6. Apply: `cd stacks/<service> && terragrunt apply --non-interactive`
**Multi-Tenant Service (using Factory Pattern):**
1. Create parent stack: `stacks/<service>/main.tf` with namespace + TLS setup
2. Create `stacks/<service>/factory/main.tf` with single-instance logic
3. In parent, call factory multiple times:
```hcl
module "instance1" {
source = "./factory"
name = "instance1"
# ... other params
}
```
4. Example: `stacks/actualbudget/` has factory instantiated for viktor, anca, emo
**New Platform Module:**
1. Create `stacks/platform/modules/<service>/` directory
2. Add `main.tf` with resources (Helm chart, namespace, ConfigMaps, etc.)
3. Add `variables.tf` or declare variables in `main.tf`
4. In `stacks/platform/main.tf`, add module call:
```hcl
module "<service>" {
source = "./modules/<service>"
tier = local.tiers.<tier>
# ... pass required variables
}
```
5. Add variable declarations in `stacks/platform/main.tf`
**New Shared Module:**
1. Create `modules/kubernetes/<module_name>/` or `modules/terraform/<module_name>/`
2. Add `main.tf` with reusable resources
3. Declare clear variable inputs and output any useful values
4. Call from service stacks: `module "<name>" { source = "../../modules/kubernetes/<module_name>" ... }`
**Utilities & Scripts:**
- Shared helpers: `scripts/` directory
- Custom CLI tools: `cli/` directory
- CI/CD pipelines: `.woodpecker/`
## Special Directories
**`state/`:**
- Purpose: Terraform state files (local backend)
- Generated: Yes (automatically by Terragrunt)
- Committed: No (gitignored; backed up separately)
- Structure: `state/stacks/<service>/terraform.tfstate`
**`secrets/`:**
- Purpose: git-crypt encrypted secrets and sensitive config
- Generated: No (managed manually or via scripts)
- Committed: Yes (encrypted via git-crypt)
- Contents: TLS certs, SSH keys, NFS export list, mailserver config, Dkim keys
**`.terraform/`:**
- Purpose: Terraform provider cache
- Generated: Yes (by Terraform during init)
- Committed: No (gitignored)
**`node_modules/`:**
- Purpose: Node.js dependencies for CLI tools
- Generated: Yes (by npm install)
- Committed: No (gitignored; use lockfile)
## File Patterns & Imports
**Terragrunt Patterns:**
- Include root: `include "root" { path = find_in_parent_folders() }`
- Declare dependencies: `dependency "platform" { config_path = "../platform"; skip_outputs = true }`
- Variable access: `var.<name>` in `main.tf` (variables sourced from `terraform.tfvars`)
**Kubernetes Resource Patterns:**
- Namespace per service: `kubernetes_namespace.<service>` with tier label
- Helm releases: `helm_release.<chart_name>` with `templatefile` for values
- Inline NFS volumes: `volume { name = "data"; nfs { server = "10.0.10.15"; path = "/mnt/main/<service>" } }`
- TLS injection: Every stack calls `module "tls_secret"` to populate namespace secret
**Module Call Pattern:**
- Standard: `module "<name>" { source = "./modules/<module>" ... }`
- Platform modules: `source = "./modules/<service>"`
- Shared modules: `source = "../../modules/kubernetes/<module>"`
---
*Structure analysis: 2026-02-23*

View file

@ -0,0 +1,279 @@
# Testing Patterns
**Analysis Date:** 2026-02-23
## Test Framework
**Language-Specific Runners:**
**Go:**
- Runner: `go test` (standard library `testing` package)
- Config: No config file (uses built-in conventions)
- Run Commands:
```bash
go test ./... # Run all tests
go test -v ./... # Verbose output
go test -run TestContains ./... # Run specific test
go test -cover ./... # Show coverage
```
**Bash:**
- Runner: Custom shell scripts in `scripts/`
- No formal test framework; uses `set -euo pipefail` for error handling
- Manual health checks via `bash scripts/cluster_healthcheck.sh`
**Terraform:**
- Framework: No automated testing detected (no terraform test files, no tftest.hcl)
- Validation: Manual `terraform validate`, `terraform plan`, visual inspection
- Integration: Terragrunt applies validate before execution
## Test File Organization
**Location:**
- Go tests: Co-located with source code: `<service>/files/internal/scraper/validate_test.go`
- Shell/Infrastructure: No test files (manual validation/health checks only)
**Naming:**
- Go: `*_test.go` suffix
- Script tests: `.sh` for check/validation scripts
**Structure:**
```
stacks/f1-stream/files/internal/scraper/
├── main.go
├── validate.go
└── validate_test.go # Test file co-located
```
## Test Structure
**Go Table-Driven Tests:**
```golang
func TestContainsVideoMarkers(t *testing.T) {
tests := []struct {
name string
body string
want bool
}{
{
name: "video tag",
body: `<div><video src="stream.mp4"></video></div>`,
want: true,
},
// ... more test cases
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
got := containsVideoMarkers(tt.body)
if got != tt.want {
t.Errorf("containsVideoMarkers(%q) = %v, want %v", truncate(tt.body, 60), got, tt.want)
}
})
}
}
```
**Patterns:**
- Slice of anonymous structs with `name`, input fields, and `want` for expected result
- Loop with `t.Run(tt.name, ...)` for individual test case execution and reporting
- Descriptive test case names: `"video tag"`, `"HLS manifest reference"`, `"empty string"`
- Separate positive cases (upper) and negative cases (lower) with comments
**Bash Health Check Structure:**
```bash
check_nodes() {
section 1 "Node Status"
local nodes not_ready versions unique_versions detail=""
nodes=$($KUBECTL get nodes --no-headers 2>&1) || { fail "Cannot reach cluster"; json_add "node_status" "FAIL" "Cannot reach cluster"; return 0; }
# ... processing
if [[ -n "$not_ready" ]]; then
fail "NotReady nodes: $not_ready"
json_add "node_status" "FAIL" "$detail"
elif [[ "$unique_versions" -gt 1 ]]; then
warn "Version mismatch..."
json_add "node_status" "WARN" "$detail"
else
pass "All nodes Ready..."
json_add "node_status" "PASS" "$detail"
fi
}
```
**Patterns:**
- Each check function follows same structure: setup → validation → status reporting
- Status reported via `pass()`, `warn()`, `fail()` helper functions
- JSON output optional via `json_add()` for programmatic consumption
- Error handling inline with `||` fallback and graceful degradation
## Mocking
**Framework:**
- Go: No mocking framework detected (table-driven tests use real function calls)
- Bash: External commands mocked implicitly (KUBECONFIG override, kubectl invocation through `$KUBECTL` variable)
**Patterns (Go):**
- No mock objects or stubs
- Real function behavior tested directly
- Test data provided as input in struct fields
**Patterns (Bash):**
```bash
# Kubeconfig override allows testing against different clusters
KUBECTL="kubectl --kubeconfig $KUBECONFIG_PATH"
nodes=$($KUBECTL get nodes --no-headers 2>&1) || { fail "Cannot reach cluster"; return 0; }
```
**What NOT to Mock:**
- Core functionality being tested (test actual behavior)
- Standard library functions (test integration)
**What to Mock (Bash):**
- External kubectl calls via variable indirection: allows `KUBECONFIG` override
- Conditional output by flag: `--json`, `--quiet` flags change output, not behavior
## Fixtures and Factories
**Test Data (Go):**
- Inline strings in struct fields: HTML content, MIME types
- Examples from `validate_test.go`:
```golang
{
name: "HLS manifest reference",
body: `var url = "https://cdn.example.com/live.m3u8";`,
want: true,
},
```
**Location:**
- Embedded directly in test file as struct field values
- No separate fixture files or factories
**Bash Fixtures:**
- Real cluster fixtures: tests run against actual Kubernetes cluster
- No data files; tests fetch live state via kubectl
## Coverage
**Requirements:** None enforced (no coverage thresholds, targets, or CI/CD gates detected)
**View Coverage (Go):**
```bash
go test -cover ./... # Show coverage percentages
go test -coverprofile=coverage.out ./...
go tool cover -html=coverage.out # Open HTML report
```
**Note:** Coverage tools not integrated into CI/CD pipeline; manual check only.
## Test Types
**Unit Tests (Go):**
- Scope: Single function validation
- Approach: Table-driven with parameterized inputs
- Example: `TestContainsVideoMarkers()` tests HTML content detection
- Example: `TestIsDirectVideoContentType()` tests MIME type classification
- In file: `stacks/f1-stream/files/internal/scraper/validate_test.go`
**Integration Tests:**
- Bash health checks (`scripts/cluster_healthcheck.sh`) serve as integration tests
- Tests 24 separate checks against live Kubernetes cluster:
- Node status and readiness
- Node resource utilization
- Container metrics
- Pod crash loops
- Persistent volume health
- DNS resolution
- Networking
- RBAC
- Logs aggregation
- Can run with `--fix` flag for auto-remediation
- Can output JSON for CI integration
**E2E Tests:**
- Not formally implemented
- Manual validation via Terragrunt apply → cluster state verification
**Infrastructure Testing:**
- Terraform: `terraform validate` and `terraform plan` provide syntax/logic validation
- Application health: Manual checks via scripts and cluster_healthcheck.sh
- No automated test suite for infrastructure code
## Common Patterns
**Async Testing (Go):**
- Not applicable (synchronous function testing only)
**Error Testing (Go):**
```golang
{
name: "empty string",
body: "",
want: false,
},
```
- Negative test cases included in same table
- Error/edge cases named descriptively: `"empty string"`, `"reddit link page"`
- Expected failure behavior verified: `want: false` for invalid inputs
**Error Reporting (Go):**
```golang
t.Errorf("containsVideoMarkers(%q) = %v, want %v", truncate(tt.body, 60), got, tt.want)
```
- Formatted message includes: function name, input (truncated), actual, expected
- Test name automatically prefixed by `t.Run(tt.name, ...)`
**Status Reporting (Bash):**
- Color-coded status: `${GREEN}[PASS]${NC}`, `${YELLOW}[WARN]${NC}`, `${RED}[FAIL]${NC}`
- Counter incremented per status
- Optional quiet mode (`--quiet`) suppresses PASS output
- Optional JSON output (`--json`) for CI integration
- Summary printed at end: `$PASS_COUNT/$WARN_COUNT/$FAIL_COUNT`
## Running Tests
**Go Tests:**
```bash
# From service directory containing *_test.go
go test -v ./...
```
**Bash Health Checks:**
```bash
# Comprehensive checks
bash scripts/cluster_healthcheck.sh
# Quiet mode (WARN/FAIL only)
bash scripts/cluster_healthcheck.sh --quiet
# Auto-fix mode
bash scripts/cluster_healthcheck.sh --fix
# JSON output
bash scripts/cluster_healthcheck.sh --json
# Custom kubeconfig
bash scripts/cluster_healthcheck.sh --kubeconfig /path/to/config
```
**Terraform Validation:**
```bash
# Format check
terraform fmt -recursive
# Syntax validation
terraform validate
# Plan without apply
terraform plan
# From stack directory
cd stacks/<service> && terragrunt plan
cd stacks/<service> && terragrunt apply --non-interactive
```
---
*Testing analysis: 2026-02-23*

View file

@ -57,7 +57,7 @@ resource "kubernetes_deployment" "roundcubemail" {
spec {
container {
name = "roundcube"
image = "roundcube/roundcubemail:1.6-apache"
image = "roundcube/roundcubemail:1.6.13-apache"
# Uncomment me to mount additional settings
# volume_mount {
# name = "imap-config"