1. sops-age-secrets-migration: Complete guide for migrating from git-crypt
to SOPS+age. Covers JSON format requirement, race condition avoidance,
CI integration, complex types, and migration sequence.
2. iterative-plan-review-with-subagents: Design pattern for reviewing plans
with parallel security + implementation subagents. 2-3 iterations to
zero CRITICALs. Used successfully for the SOPS migration design.
Fixed architecture and services pages to wrap table rows in <thead>/<tbody>
as required by Svelte 5's strict HTML validation.
E2E test passed: clean Alpine container → setup script → kubectl installed →
CA cert verified against API server → TLS SUCCESS
Phase 5 — CI pipelines:
- default.yml: add SOPS decrypt in prepare step, change git add . to
specific paths (stacks/ state/ .woodpecker/), cleanup on success+failure
- renew-tls.yml: change git add . to git add secrets/ state/
Phase 6 — sensitive=true:
- Add sensitive = true to 256 variable declarations across 149 stack files
- Prevents secret values from appearing in terraform plan output
- Does NOT modify shared modules (ingress_factory, nfs_volume) to avoid
breaking module interface contracts
Note: CI pipeline SOPS decryption requires sops_age_key Woodpecker secret
to be created before the pipeline will work with SOPS. Until then, the old
terraform.tfvars path continues to function.
terragrunt.hcl now loads:
- config.tfvars (required, plaintext)
- terraform.tfvars (optional, git-crypt — backward compat)
- secrets.auto.tfvars.json (optional, SOPS-decrypted)
before_hook checks that at least one secrets source exists.
Use `scripts/tg` wrapper for SOPS-based workflow.
Old terraform.tfvars kept for reference and backward compatibility.
Part of SOPS multi-user secrets migration.
- .sops.yaml: defines age recipients (Viktor + CI)
- scripts/tg: wrapper that decrypts secrets before running terragrunt
- .gitignore: excludes decrypted secrets.auto.tfvars.json
No functional change — terraform.tfvars still works as before.
Check real CPU/memory usage via kubectl top nodes instead of
limits-vs-allocatable ratios. Thresholds: >80% WARN, >90% FAIL.
Limits overcommit is expected with 70+ services on 3 worker nodes.
Critical fix: StorageClass mountOptions only apply during dynamic
provisioning. Our static PVs (created by Terraform) were missing
mount_options, so all NFS mounts defaulted to hard,timeo=600 —
the exact stale mount behavior we were trying to eliminate.
Adds mount_options directly to the nfs_volume module PV spec and
to the monitoring PVs (prometheus, loki, alertmanager).
Requires re-applying all stacks to propagate to existing PVs.
StorageClass mountOptions only apply during dynamic provisioning.
Static PVs (created by Terraform) need mount_options set explicitly.
Without this, all CSI NFS mounts default to hard,timeo=600 — the
exact problem we were trying to fix.
Final batch: servarr (aiostreams, listenarr, readarr, soulseek,
prowlarr, qbittorrent, lidarr) and actualbudget factory.
All use ../../../modules/kubernetes/nfs_volume (3 levels deep).
Pull-through cache at 10.0.20.10 was serving corrupted/truncated images
for low-traffic registries, breaking VPA certgen (ImagePullBackOff) and
previously causing Kyverno image pull failures.
Kept: docker.io (port 5000) and ghcr.io (port 5010) — high traffic,
Docker Hub rate limits make caching essential.
Removed from cloud-init template and all 5 live nodes:
- registry.k8s.io (port 5030) — 14 system images, very low churn
- quay.io (port 5020) — 11 images
- reg.kyverno.io (port 5040) — 5 images
The registry containers on the 10.0.20.10 VM still run but nodes no
longer route to them. They can be stopped/removed from the VM later.
New skill: ClickHouse on K8s/NFS burns CPU from unbounded system log tables
and background merges. Covers config.d mount crash (exit code 36), CronJob
truncation workaround, and diagnostic commands.
Updated: k8s-gpu-no-nvidia-devices v1.1.0 — added automatic GPU recovery
via liveness probe pattern (nvidia-smi + app health check).
When the GPU becomes unavailable (overloaded, CUDA context corruption),
Frigate silently falls back to CPU detection burning 4 cores with no
automatic recovery. Add liveness probe checking nvidia-smi + API health
every 60s (3 failures = restart), and startup probe allowing up to 5min
for TensorRT model loading.
ClickHouse system log tables (metric_log, trace_log, text_log, etc.) were
growing unboundedly on NFS (~10GiB, 1.3B rows) with no TTL, causing
continuous background merge operations that burned ~920m CPU. Mounting
custom config.d XML files crashes ClickHouse (exit code 36) so instead
add a CronJob that truncates the tables via the HTTP API every 6 hours.
Also removed the broken ConfigMap/volume mount that was causing crashes.
- Kyverno policy: VPA mode set to 'off' for all namespaces (was 'initial'
for non-core). Terraform is now sole authority for container resources.
Goldilocks provides recommendations only.
- Ollama: add explicit CPU/memory resources (500m/4Gi req, 4/12Gi limit)
alongside GPU allocation. Fixes OOMKill from VPA scaling down resources.
- MySQL InnoDB Cluster: bump memory limit from 2Gi to 3Gi.
- Remove redundant per-namespace VPA opt-out labels from onlyoffice,
openclaw, trading-bot (now handled globally by Kyverno policy).
New skill documenting the NFSv4 idmapd UID mapping crisis where all file
UIDs show as 65534 (nobody) inside K8s containers. Root cause: containers
auto-negotiate NFSv4.2, and idmapd domain mismatch maps all UIDs to nobody.
Fix: v4_v3owner=true on TrueNAS for numeric UID passthrough.
Persist font cache (159MB) and theme images (10MB) to NFS volume.
Set GENERATE_FONTS=false to skip regeneration on startup since cache
is warm. Startup time: ~3 min -> 5 seconds.
Startup was throttled by allthemesgen and font generation hitting 2 CPU
ceiling. Bumped to 8 CPU burst limit with custom LimitRange (max 8 CPU)
and custom ResourceQuota. Disabled VPA and goldilocks opt-out labels.