Wire homepage_credentials tokens through platform stack to enable
live widgets for Authentik, Shlink (URL shortener), and Home Assistant
London. Update SOPS with new credential entries.
Add Kubernetes ingress annotations for Homepage auto-discovery across
~88 services organized into 11 groups. Enable serviceAccount for RBAC,
configure group layouts, and add Grafana/Frigate/Speedtest widgets.
1. sops-age-secrets-migration: Complete guide for migrating from git-crypt
to SOPS+age. Covers JSON format requirement, race condition avoidance,
CI integration, complex types, and migration sequence.
2. iterative-plan-review-with-subagents: Design pattern for reviewing plans
with parallel security + implementation subagents. 2-3 iterations to
zero CRITICALs. Used successfully for the SOPS migration design.
Fixed architecture and services pages to wrap table rows in <thead>/<tbody>
as required by Svelte 5's strict HTML validation.
E2E test passed: clean Alpine container → setup script → kubectl installed →
CA cert verified against API server → TLS SUCCESS
Phase 5 — CI pipelines:
- default.yml: add SOPS decrypt in prepare step, change git add . to
specific paths (stacks/ state/ .woodpecker/), cleanup on success+failure
- renew-tls.yml: change git add . to git add secrets/ state/
Phase 6 — sensitive=true:
- Add sensitive = true to 256 variable declarations across 149 stack files
- Prevents secret values from appearing in terraform plan output
- Does NOT modify shared modules (ingress_factory, nfs_volume) to avoid
breaking module interface contracts
Note: CI pipeline SOPS decryption requires sops_age_key Woodpecker secret
to be created before the pipeline will work with SOPS. Until then, the old
terraform.tfvars path continues to function.
terragrunt.hcl now loads:
- config.tfvars (required, plaintext)
- terraform.tfvars (optional, git-crypt — backward compat)
- secrets.auto.tfvars.json (optional, SOPS-decrypted)
before_hook checks that at least one secrets source exists.
Use `scripts/tg` wrapper for SOPS-based workflow.
Old terraform.tfvars kept for reference and backward compatibility.
Part of SOPS multi-user secrets migration.
- .sops.yaml: defines age recipients (Viktor + CI)
- scripts/tg: wrapper that decrypts secrets before running terragrunt
- .gitignore: excludes decrypted secrets.auto.tfvars.json
No functional change — terraform.tfvars still works as before.
Check real CPU/memory usage via kubectl top nodes instead of
limits-vs-allocatable ratios. Thresholds: >80% WARN, >90% FAIL.
Limits overcommit is expected with 70+ services on 3 worker nodes.
Critical fix: StorageClass mountOptions only apply during dynamic
provisioning. Our static PVs (created by Terraform) were missing
mount_options, so all NFS mounts defaulted to hard,timeo=600 —
the exact stale mount behavior we were trying to eliminate.
Adds mount_options directly to the nfs_volume module PV spec and
to the monitoring PVs (prometheus, loki, alertmanager).
Requires re-applying all stacks to propagate to existing PVs.
StorageClass mountOptions only apply during dynamic provisioning.
Static PVs (created by Terraform) need mount_options set explicitly.
Without this, all CSI NFS mounts default to hard,timeo=600 — the
exact problem we were trying to fix.
Final batch: servarr (aiostreams, listenarr, readarr, soulseek,
prowlarr, qbittorrent, lidarr) and actualbudget factory.
All use ../../../modules/kubernetes/nfs_volume (3 levels deep).
Pull-through cache at 10.0.20.10 was serving corrupted/truncated images
for low-traffic registries, breaking VPA certgen (ImagePullBackOff) and
previously causing Kyverno image pull failures.
Kept: docker.io (port 5000) and ghcr.io (port 5010) — high traffic,
Docker Hub rate limits make caching essential.
Removed from cloud-init template and all 5 live nodes:
- registry.k8s.io (port 5030) — 14 system images, very low churn
- quay.io (port 5020) — 11 images
- reg.kyverno.io (port 5040) — 5 images
The registry containers on the 10.0.20.10 VM still run but nodes no
longer route to them. They can be stopped/removed from the VM later.
New skill: ClickHouse on K8s/NFS burns CPU from unbounded system log tables
and background merges. Covers config.d mount crash (exit code 36), CronJob
truncation workaround, and diagnostic commands.
Updated: k8s-gpu-no-nvidia-devices v1.1.0 — added automatic GPU recovery
via liveness probe pattern (nvidia-smi + app health check).
When the GPU becomes unavailable (overloaded, CUDA context corruption),
Frigate silently falls back to CPU detection burning 4 cores with no
automatic recovery. Add liveness probe checking nvidia-smi + API health
every 60s (3 failures = restart), and startup probe allowing up to 5min
for TensorRT model loading.
ClickHouse system log tables (metric_log, trace_log, text_log, etc.) were
growing unboundedly on NFS (~10GiB, 1.3B rows) with no TTL, causing
continuous background merge operations that burned ~920m CPU. Mounting
custom config.d XML files crashes ClickHouse (exit code 36) so instead
add a CronJob that truncates the tables via the HTTP API every 6 hours.
Also removed the broken ConfigMap/volume mount that was causing crashes.
- Kyverno policy: VPA mode set to 'off' for all namespaces (was 'initial'
for non-core). Terraform is now sole authority for container resources.
Goldilocks provides recommendations only.
- Ollama: add explicit CPU/memory resources (500m/4Gi req, 4/12Gi limit)
alongside GPU allocation. Fixes OOMKill from VPA scaling down resources.
- MySQL InnoDB Cluster: bump memory limit from 2Gi to 3Gi.
- Remove redundant per-namespace VPA opt-out labels from onlyoffice,
openclaw, trading-bot (now handled globally by Kyverno policy).