infra/stacks/nextcloud/chart_values.yaml

155 lines
3.7 KiB
YAML
Raw Normal View History

2024-09-20 02:26:43 +00:00
nextcloud:
host: nextcloud.viktorbarzin.me
2024-09-20 02:26:43 +00:00
trustedDomains:
- nextcloud.viktorbarzin.me
# mail:
# enabled: true
# # the user we send email as
# fromAddress: nextcloud@viktorbarzin.me
# # the domain we send email from
# domain: viktorbarzin.me
# smtp:
# host: mail.viktorbarzin.me
# secure: starttls
# port: 587
# authtype: LOGIN
# name: nextcloud@viktorbarzin.me
# password:
2024-09-20 02:26:43 +00:00
extraEnv:
- name: TRUSTED_PROXIES
value: "10.0.0.0/8"
- name: PHP_MEMORY_LIMIT
value: "512M"
- name: PHP_UPLOAD_LIMIT
value: "16G"
# - name: mail_smtpdebug
# value: "true"
# - name: loglevel
# value: "0"
configs:
zzz-redis.config.php: |
<?php
[redis] Phase 3-7: cutover to redis-v2, Nextcloud HAProxy-only Phase 3 — replication chain (old → v2): - Discovered the v2 cluster was running redis:7.4-alpine, but the Bitnami old master ships redis 8.6.2 which writes RDB format 13 — the 7.4 replicas rejected the stream with "Can't handle RDB format version 13". Bumped v2 image to redis:8-alpine (also 8.6.2) to restore PSYNC compatibility. - Discovered that sentinel on BOTH v2 and old Bitnami clusters auto-discovered the cross-cluster replication chain when v2-0 REPLICAOF'd the old master, triggering a failover that reparented old-master to a v2 replica and took HAProxy's backend offline. Mitigation: `SENTINEL REMOVE mymaster` on all 5 sentinels (both clusters) during the REPLICAOF surgery, then re-MONITOR after cutover. This must be done on the OLD sentinels too, not just v2 — they're the ones that kept fighting our REPLICAOF. - Set up the chain: v2-0 REPLICAOF old-master; v2-{1,2} REPLICAOF v2-0. All 76 keys (db0:76, db1:22, db4:16) synced including `immich_bull:*` BullMQ queues and `_kombu.*` Celery queues — the user-stated must-survive data class. Phase 4 — HAProxy cutover: - Updated `kubernetes_config_map.haproxy` to point at `redis-v2-{0,1,2}.redis-v2-headless` for both redis_master and redis_sentinel backends (removed redis-node-{0,1}). - Promoted v2-0 (`REPLICAOF NO ONE`) at the same time as the ConfigMap apply so HAProxy's 1s health-check interval found a role:master within a few seconds. Cutover disruption on HAProxy rollout was brief; old clients naturally moved to new HAProxy pods within the rolling update window. - Re-enabled sentinel monitoring on v2 with `SENTINEL MONITOR mymaster <hostname> 6379 2` after verifying `resolve-hostnames yes` + `announce-hostnames yes` were active — this ensures sentinel stores the hostname (not resolved IP) in its rewritten config, so pod-IP churn on restart doesn't break failover. Phase 5 — chaos: - Round 1: killed master v2-0 mid-probe. First run exposed the sentinel IP-storage issue (stored 10.10.107.222, went stale on restart) — ~12s probe disruption. Fixed hostname persistence and re-MONITORed. - Round 2: killed new master v2-2 with hostnames correctly stored. Sentinel elected v2-0, HAProxy re-routed, 1/40 probe failures over 60s — target <3s of actual user-visible disruption. Phase 6 — Nextcloud simplification: - `zzz-redis.config.php` no longer queries sentinel in-process — just points at `redis-master.redis.svc.cluster.local`. Removed 20 lines of PHP. HAProxy handles master tracking transparently now that it's scaled to 3 + PDB minAvailable=2. Phase 7 step 1: - `kubectl scale statefulset/redis-node --replicas=0` (transient — TF removal in a 24h follow-up). Old PVCs `redis-data-redis-node-{0,1}` preserved as cold rollback. Docs: - Rewrote `databases.md` Redis section to reflect post-cutover reality and the sentinel hostname gotcha (so future sessions don't relearn it). - `.claude/reference/service-catalog.md` entry updated. The parallel-bootstrap race documented in the previous commit is still worth watching — the init container now defaults to pod-0 as master when no peer reports role:master-with-slaves, so fresh boots land in a deterministic topology. Closes: code-7n4 Closes: code-9y6 Closes: code-cnf Closes: code-tc4 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 16:13:43 +00:00
// Redis via HAProxy master-only service. HAProxy (3 replicas, PDB
// minAvailable=2) health-checks all v2 pods via `INFO replication` and
// routes to the current role:master. Sentinel failover takes <30s, and
// HAProxy detects the new master via its 1s tcp-check interval and
// starts routing within ~3s of detection. Removed the old in-process
// sentinel-query loop on 2026-04-19 after the Redis rework — see
// beads code-v2b and infra/docs/architecture/databases.md.
$CONFIG = array(
'memcache.distributed' => '\\OC\\Memcache\\Redis',
'memcache.locking' => '\\OC\\Memcache\\Redis',
'redis' => array(
[redis] Phase 3-7: cutover to redis-v2, Nextcloud HAProxy-only Phase 3 — replication chain (old → v2): - Discovered the v2 cluster was running redis:7.4-alpine, but the Bitnami old master ships redis 8.6.2 which writes RDB format 13 — the 7.4 replicas rejected the stream with "Can't handle RDB format version 13". Bumped v2 image to redis:8-alpine (also 8.6.2) to restore PSYNC compatibility. - Discovered that sentinel on BOTH v2 and old Bitnami clusters auto-discovered the cross-cluster replication chain when v2-0 REPLICAOF'd the old master, triggering a failover that reparented old-master to a v2 replica and took HAProxy's backend offline. Mitigation: `SENTINEL REMOVE mymaster` on all 5 sentinels (both clusters) during the REPLICAOF surgery, then re-MONITOR after cutover. This must be done on the OLD sentinels too, not just v2 — they're the ones that kept fighting our REPLICAOF. - Set up the chain: v2-0 REPLICAOF old-master; v2-{1,2} REPLICAOF v2-0. All 76 keys (db0:76, db1:22, db4:16) synced including `immich_bull:*` BullMQ queues and `_kombu.*` Celery queues — the user-stated must-survive data class. Phase 4 — HAProxy cutover: - Updated `kubernetes_config_map.haproxy` to point at `redis-v2-{0,1,2}.redis-v2-headless` for both redis_master and redis_sentinel backends (removed redis-node-{0,1}). - Promoted v2-0 (`REPLICAOF NO ONE`) at the same time as the ConfigMap apply so HAProxy's 1s health-check interval found a role:master within a few seconds. Cutover disruption on HAProxy rollout was brief; old clients naturally moved to new HAProxy pods within the rolling update window. - Re-enabled sentinel monitoring on v2 with `SENTINEL MONITOR mymaster <hostname> 6379 2` after verifying `resolve-hostnames yes` + `announce-hostnames yes` were active — this ensures sentinel stores the hostname (not resolved IP) in its rewritten config, so pod-IP churn on restart doesn't break failover. Phase 5 — chaos: - Round 1: killed master v2-0 mid-probe. First run exposed the sentinel IP-storage issue (stored 10.10.107.222, went stale on restart) — ~12s probe disruption. Fixed hostname persistence and re-MONITORed. - Round 2: killed new master v2-2 with hostnames correctly stored. Sentinel elected v2-0, HAProxy re-routed, 1/40 probe failures over 60s — target <3s of actual user-visible disruption. Phase 6 — Nextcloud simplification: - `zzz-redis.config.php` no longer queries sentinel in-process — just points at `redis-master.redis.svc.cluster.local`. Removed 20 lines of PHP. HAProxy handles master tracking transparently now that it's scaled to 3 + PDB minAvailable=2. Phase 7 step 1: - `kubectl scale statefulset/redis-node --replicas=0` (transient — TF removal in a 24h follow-up). Old PVCs `redis-data-redis-node-{0,1}` preserved as cold rollback. Docs: - Rewrote `databases.md` Redis section to reflect post-cutover reality and the sentinel hostname gotcha (so future sessions don't relearn it). - `.claude/reference/service-catalog.md` entry updated. The parallel-bootstrap race documented in the previous commit is still worth watching — the init container now defaults to pod-0 as master when no peer reports role:master-with-slaves, so fresh boots land in a deterministic topology. Closes: code-7n4 Closes: code-9y6 Closes: code-cnf Closes: code-tc4 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 16:13:43 +00:00
'host' => 'redis-master.redis.svc.cluster.local',
'port' => 6379,
'password' => '',
'timeout' => 1.5,
'read_timeout' => 1.5,
),
);
performance.config.php: |
<?php
$CONFIG = array(
'loglevel' => 2,
'mail_smtpdebug' => false,
);
zzz-mysql.config.php: |
<?php
$CONFIG = array(
'mysql.utf8mb4' => true,
);
phpConfigs:
zzz-custom.ini: |
max_execution_time = 300
max_input_time = 300
default_socket_timeout = 300
opcache.enable_file_override = 1
apc.shm_size = 128M
extraVolumes:
- name: apache-tuning
configMap:
name: nextcloud-apache-tuning
- name: db-password-sync
configMap:
name: nextcloud-db-password-sync
defaultMode: 0755
extraVolumeMounts:
- name: apache-tuning
mountPath: /etc/apache2/mods-available/mpm_prefork.conf
subPath: mpm_prefork.conf
- name: db-password-sync
mountPath: /docker-entrypoint-hooks.d/before-starting
internalDatabase:
enabled: false
externalRedis:
enabled: false
externalDatabase:
enabled: true
type: mysql
2026-02-23 22:05:28 +00:00
host: ${mysql_host}
user: nextcloud
database: nextcloud
existingSecret:
secretName: nextcloud-db-creds
usernameKey: db-username
passwordKey: DB_PASSWORD
persistence:
enabled: true
feat(storage): migrate all sensitive services to proxmox-lvm-encrypted Reconcile Terraform with cluster state after manual encrypted PVC migrations and complete the remaining unfinished migrations. All services storing sensitive data now use LUKS2-encrypted block storage via the Proxmox CSI plugin. ## Context Only Technitium DNS was using encrypted storage in Terraform. Many services had been manually migrated to encrypted PVCs in the cluster, but Terraform was never updated — creating dangerous state drift where a `tg apply` could recreate unencrypted PVCs. ## This change Phase 0 — Infrastructure: - Add `proxmox-lvm-encrypted` StorageClass to Helm values (extraParameters) - Add ExternalSecret for LUKS encryption passphrase to Terraform - Fix CSI node plugin memory: `node.plugin.resources` (not `node.resources`) with 1280Mi limit for LUKS2 Argon2id key derivation Phase 1 — TF state reconciliation (zero downtime): - Health, Matrix, N8N, Forgejo, Vaultwarden, Mailserver: state rm + import - Redis, DBAAS MySQL, DBAAS PostgreSQL: Helm/CNPG value updates Phase 2 — Data migration (encrypted PVCs existed but unused): - Headscale, Frigate, MeshCentral: rsync + switchover - Nextcloud (20Gi): rsync + chart_values update Phase 3 — New encrypted PVCs: - Roundcube HTML, HackMD, Affine, DBAAS pgadmin: create + rsync + switchover Phase 4 — Cleanup: - Deleted 5 orphaned unencrypted PVCs ## Services migrated (18 PVCs across 14 namespaces) ``` vaultwarden → vaultwarden-data-encrypted dbaas → datadir-mysql-cluster-0, pg-cluster-{1,2}, dbaas-pgadmin-encrypted mailserver → mailserver-data-encrypted, roundcubemail-{enigma,html}-encrypted nextcloud → nextcloud-data-encrypted forgejo → forgejo-data-encrypted matrix → matrix-data-encrypted n8n → n8n-data-encrypted affine → affine-data-encrypted health → health-uploads-encrypted hackmd → hackmd-data-encrypted redis → redis-data-redis-node-{0,1} headscale → headscale-data-encrypted frigate → frigate-config-encrypted meshcentral → meshcentral-{data,files}-encrypted ``` Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 20:15:30 +00:00
existingClaim: nextcloud-data-encrypted
accessMode: ReadWriteOnce
size: 20Gi
startupProbe:
enabled: true
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 30
failureThreshold: 60
successThreshold: 1
livenessProbe:
enabled: true
initialDelaySeconds: 30
periodSeconds: 60
timeoutSeconds: 30
failureThreshold: 10
successThreshold: 1
readinessProbe:
enabled: true
initialDelaySeconds: 30
periodSeconds: 60
timeoutSeconds: 30
failureThreshold: 5
successThreshold: 1
podAnnotations:
diun.enable: "true"
diun.include_tags: "^[0-9]+(?:.[0-9]+)?(?:.[0-9]+)?.*"
dependency.kyverno.io/wait-for: "mysql.dbaas:3306,redis-master.redis:6379"
secret.reloader.stakater.com/reload: "nextcloud-db-creds"
collabora:
enabled: false # Using onlyoffice instead
resources:
limits:
memory: 8Gi
requests:
cpu: 50m
memory: 256Mi
cronjob:
enabled: true
resources:
limits:
memory: 384Mi
requests:
cpu: 25m
memory: 384Mi