SQLite on NFS caused persistent 500 errors on WebDAV PROPFIND due to
missing fsync guarantees and database locking under concurrent access.
iSCSI (ext4) provides proper fsync and block-level I/O.
- Replace nfs_volume module with iscsi-truenas PVC (20Gi)
- Update Helm chart to use nextcloud-data-iscsi claim
- Excluded 12.5GB nextcloud.log and corrupted DB from migration
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Removed from git tracking and added to .gitignore.
File stays on disk locally for reference.
config.tfvars + secrets.auto.tfvars.json are the active var sources.
[ci skip]
- Pin actualbudget/actual-server from edge to 26.3.0 (all 3 instances) to
prevent recurring migration breakage from rolling nightly builds
- Add podAntiAffinity to MySQL InnoDB Cluster to spread replicas across nodes,
relieving memory pressure on k8s-node4
- Scale grampsweb to 0 replicas (unused, consuming 1.7Gi memory)
- Add GPU toleration Kyverno policy to Terraform using patchesJson6902 instead
of patchStrategicMerge to fix toleration array being overwritten (caused
caretta DaemonSet pod to be unable to schedule on k8s-master)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- MaxRequestWorkers 25→50: too few workers caused ALL workers to block
on SQLite locks, making liveness probes fail even faster (131 restarts
vs 50 before). 50 is a compromise — enough workers for probes.
- Excluded nextcloud from HighServiceErrorRate alert (chronic SQLite issue)
- MySQL migration attempted but hit: GR error 3100 (fixed with GIPK),
emoji in calendar/filecache (stripped), SQLite corruption (pre-existing
from crash-looping). Migration rolled back, Nextcloud restored to SQLite.
The Nextcloud Helm chart expects extraVolumes/extraVolumeMounts nested
under the nextcloud: key. Also mount to mods-available/ (the actual file)
not mods-enabled/ (which is a symlink).
Verified: MaxRequestWorkers 150→25, workers dropped from 49 to 6.
- JobFailed: only alert on jobs started within the last hour, so stale
failed CronJob runs don't keep firing after subsequent runs succeed
- Mail server alert: renamed to MailServerDown, now targets the specific
mailserver deployment instead of all deployments in the namespace
(was falsely triggering on roundcubemail going down)
- Updated inhibition rule to use new MailServerDown alert name
Adds Sealed Secrets (Bitnami) to the platform stack so cluster users can
encrypt secrets with a public key and commit SealedSecret YAMLs to git.
The in-cluster controller decrypts them into regular K8s Secrets.
- New module: sealed-secrets (namespace + Helm chart v2.18.3, cluster tier)
- k8s-portal setup script: adds kubeseal CLI install for Linux and Mac
Previously manualStartOnBoot=true and exitStateAction=ABORT_SERVER meant
any ungraceful shutdown required manual rebootClusterFromCompleteOutage().
New settings:
- group_replication_start_on_boot=ON: auto-start GR after crash
- autorejoin_tries=2016: retry rejoining for ~28 minutes
- exit_state_action=OFFLINE_MODE: stay alive on expulsion (don't abort)
- member_expel_timeout=30s: tolerate brief unresponsiveness
- unreachable_majority_timeout=60s: leave group cleanly if majority lost
LimitRange defaults had a 4-8x limit/request ratio causing the scheduler
to overpack nodes. When pods burst, nodes OOM-thrashed and became
unresponsive (k8s-node3 and k8s-node4 both went down today).
Changes:
- Increase default memory requests across all tiers (ratio now 2x):
- core/cluster: 64Mi → 256Mi request (512Mi limit)
- gpu: 256Mi → 1Gi request (2Gi limit)
- edge/aux/fallback: 64Mi → 128Mi request (256Mi limit)
- Add kubelet memory reservation and eviction thresholds:
- systemReserved: 512Mi, kubeReserved: 512Mi
- evictionHard: 500Mi (was 100Mi), evictionSoft: 1Gi (was unset)
- Applied to all nodes and future node template
Use correct dashboard-icons names where available (changedetection,
gramps-web), Material Design Icons for custom apps (city-guesser,
plotting-book, resume, tuya-bridge, trading-bot, poison-fountain),
and Simple Icons for F1 Stream.
Stale-while-revalidate cache in front of Homepage reduces first-paint
latency by serving cached /api/ responses instantly while refreshing
upstream in background. Non-API paths pass through uncached.
Deploy GOOGLE_CLIENT_ID, GOOGLE_CLIENT_SECRET, and GOOGLE_CALLBACK_URL
to the plotting-book container. Update CSP to allow accounts.google.com
for connect-src and form-action directives.
Prevents stale Redis connections from silently breaking file uploads.
The old Node.js Redis client doesn't auto-reconnect after Redis restarts,
causing all files to appear expired.
- qBittorrent: use service port 80 (not container port 8080)
- Immich: add version=2 for new API endpoints (/api/server/*)
- Nextcloud: use external URL (internal rejects untrusted Host header)
- HA London: remove widget (token expired, needs manual regeneration)
- Headscale: remove widget (requires nodeId param, not overview)
Services expose port 80 via ClusterIP but widgets were using container
target ports (5000, 3001, 4533, 3000). Calibre was using external URL
through Authentik. All now use correct internal service URLs.
Wire homepage_credentials tokens through platform stack to enable
live widgets for Authentik, Shlink (URL shortener), and Home Assistant
London. Update SOPS with new credential entries.
Add Kubernetes ingress annotations for Homepage auto-discovery across
~88 services organized into 11 groups. Enable serviceAccount for RBAC,
configure group layouts, and add Grafana/Frigate/Speedtest widgets.
1. sops-age-secrets-migration: Complete guide for migrating from git-crypt
to SOPS+age. Covers JSON format requirement, race condition avoidance,
CI integration, complex types, and migration sequence.
2. iterative-plan-review-with-subagents: Design pattern for reviewing plans
with parallel security + implementation subagents. 2-3 iterations to
zero CRITICALs. Used successfully for the SOPS migration design.
Fixed architecture and services pages to wrap table rows in <thead>/<tbody>
as required by Svelte 5's strict HTML validation.
E2E test passed: clean Alpine container → setup script → kubectl installed →
CA cert verified against API server → TLS SUCCESS
Phase 5 — CI pipelines:
- default.yml: add SOPS decrypt in prepare step, change git add . to
specific paths (stacks/ state/ .woodpecker/), cleanup on success+failure
- renew-tls.yml: change git add . to git add secrets/ state/
Phase 6 — sensitive=true:
- Add sensitive = true to 256 variable declarations across 149 stack files
- Prevents secret values from appearing in terraform plan output
- Does NOT modify shared modules (ingress_factory, nfs_volume) to avoid
breaking module interface contracts
Note: CI pipeline SOPS decryption requires sops_age_key Woodpecker secret
to be created before the pipeline will work with SOPS. Until then, the old
terraform.tfvars path continues to function.
terragrunt.hcl now loads:
- config.tfvars (required, plaintext)
- terraform.tfvars (optional, git-crypt — backward compat)
- secrets.auto.tfvars.json (optional, SOPS-decrypted)
before_hook checks that at least one secrets source exists.
Use `scripts/tg` wrapper for SOPS-based workflow.
Old terraform.tfvars kept for reference and backward compatibility.
Part of SOPS multi-user secrets migration.
- .sops.yaml: defines age recipients (Viktor + CI)
- scripts/tg: wrapper that decrypts secrets before running terragrunt
- .gitignore: excludes decrypted secrets.auto.tfvars.json
No functional change — terraform.tfvars still works as before.