2024-09-20 02:26:43 +00:00
|
|
|
nextcloud:
|
2024-09-28 20:10:44 +00:00
|
|
|
host: nextcloud.viktorbarzin.me
|
2024-09-20 02:26:43 +00:00
|
|
|
trustedDomains:
|
2024-09-28 20:10:44 +00:00
|
|
|
- nextcloud.viktorbarzin.me
|
|
|
|
|
# mail:
|
|
|
|
|
# enabled: true
|
|
|
|
|
# # the user we send email as
|
|
|
|
|
# fromAddress: nextcloud@viktorbarzin.me
|
|
|
|
|
# # the domain we send email from
|
|
|
|
|
# domain: viktorbarzin.me
|
|
|
|
|
# smtp:
|
|
|
|
|
# host: mail.viktorbarzin.me
|
|
|
|
|
# secure: starttls
|
|
|
|
|
# port: 587
|
|
|
|
|
# authtype: LOGIN
|
|
|
|
|
# name: nextcloud@viktorbarzin.me
|
|
|
|
|
# password:
|
2024-09-20 02:26:43 +00:00
|
|
|
extraEnv:
|
|
|
|
|
- name: TRUSTED_PROXIES
|
2024-09-28 20:10:44 +00:00
|
|
|
value: "10.0.0.0/8"
|
2026-03-08 21:33:27 +00:00
|
|
|
- name: PHP_MEMORY_LIMIT
|
|
|
|
|
value: "512M"
|
|
|
|
|
- name: PHP_UPLOAD_LIMIT
|
|
|
|
|
value: "16G"
|
2024-09-28 20:10:44 +00:00
|
|
|
# - name: mail_smtpdebug
|
|
|
|
|
# value: "true"
|
|
|
|
|
# - name: loglevel
|
|
|
|
|
# value: "0"
|
2026-03-14 08:20:51 +00:00
|
|
|
configs:
|
2026-04-06 11:57:44 +03:00
|
|
|
zzz-redis.config.php: |
|
|
|
|
|
<?php
|
[redis] Phase 3-7: cutover to redis-v2, Nextcloud HAProxy-only
Phase 3 — replication chain (old → v2):
- Discovered the v2 cluster was running redis:7.4-alpine, but the
Bitnami old master ships redis 8.6.2 which writes RDB format 13 —
the 7.4 replicas rejected the stream with "Can't handle RDB format
version 13". Bumped v2 image to redis:8-alpine (also 8.6.2) to
restore PSYNC compatibility.
- Discovered that sentinel on BOTH v2 and old Bitnami clusters
auto-discovered the cross-cluster replication chain when v2-0
REPLICAOF'd the old master, triggering a failover that reparented
old-master to a v2 replica and took HAProxy's backend offline.
Mitigation: `SENTINEL REMOVE mymaster` on all 5 sentinels (both
clusters) during the REPLICAOF surgery, then re-MONITOR after
cutover. This must be done on the OLD sentinels too, not just v2 —
they're the ones that kept fighting our REPLICAOF.
- Set up the chain: v2-0 REPLICAOF old-master; v2-{1,2} REPLICAOF v2-0.
All 76 keys (db0:76, db1:22, db4:16) synced including `immich_bull:*`
BullMQ queues and `_kombu.*` Celery queues — the user-stated
must-survive data class.
Phase 4 — HAProxy cutover:
- Updated `kubernetes_config_map.haproxy` to point at
`redis-v2-{0,1,2}.redis-v2-headless` for both redis_master and
redis_sentinel backends (removed redis-node-{0,1}).
- Promoted v2-0 (`REPLICAOF NO ONE`) at the same time as the
ConfigMap apply so HAProxy's 1s health-check interval found a
role:master within a few seconds. Cutover disruption on HAProxy
rollout was brief; old clients naturally moved to new HAProxy pods
within the rolling update window.
- Re-enabled sentinel monitoring on v2 with `SENTINEL MONITOR
mymaster <hostname> 6379 2` after verifying `resolve-hostnames yes`
+ `announce-hostnames yes` were active — this ensures sentinel
stores the hostname (not resolved IP) in its rewritten config, so
pod-IP churn on restart doesn't break failover.
Phase 5 — chaos:
- Round 1: killed master v2-0 mid-probe. First run exposed the
sentinel IP-storage issue (stored 10.10.107.222, went stale on
restart) — ~12s probe disruption. Fixed hostname persistence and
re-MONITORed.
- Round 2: killed new master v2-2 with hostnames correctly stored.
Sentinel elected v2-0, HAProxy re-routed, 1/40 probe failures over
60s — target <3s of actual user-visible disruption.
Phase 6 — Nextcloud simplification:
- `zzz-redis.config.php` no longer queries sentinel in-process —
just points at `redis-master.redis.svc.cluster.local`. Removed 20
lines of PHP. HAProxy handles master tracking transparently now
that it's scaled to 3 + PDB minAvailable=2.
Phase 7 step 1:
- `kubectl scale statefulset/redis-node --replicas=0` (transient —
TF removal in a 24h follow-up). Old PVCs `redis-data-redis-node-{0,1}`
preserved as cold rollback.
Docs:
- Rewrote `databases.md` Redis section to reflect post-cutover reality
and the sentinel hostname gotcha (so future sessions don't relearn it).
- `.claude/reference/service-catalog.md` entry updated.
The parallel-bootstrap race documented in the previous commit is still
worth watching — the init container now defaults to pod-0 as master
when no peer reports role:master-with-slaves, so fresh boots land in
a deterministic topology.
Closes: code-7n4
Closes: code-9y6
Closes: code-cnf
Closes: code-tc4
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 16:13:43 +00:00
|
|
|
// Redis via HAProxy master-only service. HAProxy (3 replicas, PDB
|
|
|
|
|
// minAvailable=2) health-checks all v2 pods via `INFO replication` and
|
|
|
|
|
// routes to the current role:master. Sentinel failover takes <30s, and
|
|
|
|
|
// HAProxy detects the new master via its 1s tcp-check interval and
|
|
|
|
|
// starts routing within ~3s of detection. Removed the old in-process
|
|
|
|
|
// sentinel-query loop on 2026-04-19 after the Redis rework — see
|
|
|
|
|
// beads code-v2b and infra/docs/architecture/databases.md.
|
2026-04-06 11:57:44 +03:00
|
|
|
$CONFIG = array(
|
|
|
|
|
'memcache.distributed' => '\\OC\\Memcache\\Redis',
|
|
|
|
|
'memcache.locking' => '\\OC\\Memcache\\Redis',
|
|
|
|
|
'redis' => array(
|
[redis] Phase 3-7: cutover to redis-v2, Nextcloud HAProxy-only
Phase 3 — replication chain (old → v2):
- Discovered the v2 cluster was running redis:7.4-alpine, but the
Bitnami old master ships redis 8.6.2 which writes RDB format 13 —
the 7.4 replicas rejected the stream with "Can't handle RDB format
version 13". Bumped v2 image to redis:8-alpine (also 8.6.2) to
restore PSYNC compatibility.
- Discovered that sentinel on BOTH v2 and old Bitnami clusters
auto-discovered the cross-cluster replication chain when v2-0
REPLICAOF'd the old master, triggering a failover that reparented
old-master to a v2 replica and took HAProxy's backend offline.
Mitigation: `SENTINEL REMOVE mymaster` on all 5 sentinels (both
clusters) during the REPLICAOF surgery, then re-MONITOR after
cutover. This must be done on the OLD sentinels too, not just v2 —
they're the ones that kept fighting our REPLICAOF.
- Set up the chain: v2-0 REPLICAOF old-master; v2-{1,2} REPLICAOF v2-0.
All 76 keys (db0:76, db1:22, db4:16) synced including `immich_bull:*`
BullMQ queues and `_kombu.*` Celery queues — the user-stated
must-survive data class.
Phase 4 — HAProxy cutover:
- Updated `kubernetes_config_map.haproxy` to point at
`redis-v2-{0,1,2}.redis-v2-headless` for both redis_master and
redis_sentinel backends (removed redis-node-{0,1}).
- Promoted v2-0 (`REPLICAOF NO ONE`) at the same time as the
ConfigMap apply so HAProxy's 1s health-check interval found a
role:master within a few seconds. Cutover disruption on HAProxy
rollout was brief; old clients naturally moved to new HAProxy pods
within the rolling update window.
- Re-enabled sentinel monitoring on v2 with `SENTINEL MONITOR
mymaster <hostname> 6379 2` after verifying `resolve-hostnames yes`
+ `announce-hostnames yes` were active — this ensures sentinel
stores the hostname (not resolved IP) in its rewritten config, so
pod-IP churn on restart doesn't break failover.
Phase 5 — chaos:
- Round 1: killed master v2-0 mid-probe. First run exposed the
sentinel IP-storage issue (stored 10.10.107.222, went stale on
restart) — ~12s probe disruption. Fixed hostname persistence and
re-MONITORed.
- Round 2: killed new master v2-2 with hostnames correctly stored.
Sentinel elected v2-0, HAProxy re-routed, 1/40 probe failures over
60s — target <3s of actual user-visible disruption.
Phase 6 — Nextcloud simplification:
- `zzz-redis.config.php` no longer queries sentinel in-process —
just points at `redis-master.redis.svc.cluster.local`. Removed 20
lines of PHP. HAProxy handles master tracking transparently now
that it's scaled to 3 + PDB minAvailable=2.
Phase 7 step 1:
- `kubectl scale statefulset/redis-node --replicas=0` (transient —
TF removal in a 24h follow-up). Old PVCs `redis-data-redis-node-{0,1}`
preserved as cold rollback.
Docs:
- Rewrote `databases.md` Redis section to reflect post-cutover reality
and the sentinel hostname gotcha (so future sessions don't relearn it).
- `.claude/reference/service-catalog.md` entry updated.
The parallel-bootstrap race documented in the previous commit is still
worth watching — the init container now defaults to pod-0 as master
when no peer reports role:master-with-slaves, so fresh boots land in
a deterministic topology.
Closes: code-7n4
Closes: code-9y6
Closes: code-cnf
Closes: code-tc4
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 16:13:43 +00:00
|
|
|
'host' => 'redis-master.redis.svc.cluster.local',
|
|
|
|
|
'port' => 6379,
|
2026-04-06 11:57:44 +03:00
|
|
|
'password' => '',
|
|
|
|
|
'timeout' => 1.5,
|
|
|
|
|
'read_timeout' => 1.5,
|
|
|
|
|
),
|
|
|
|
|
);
|
2026-03-14 08:20:51 +00:00
|
|
|
performance.config.php: |
|
|
|
|
|
<?php
|
|
|
|
|
$CONFIG = array(
|
|
|
|
|
'loglevel' => 2,
|
|
|
|
|
'mail_smtpdebug' => false,
|
|
|
|
|
);
|
2026-04-10 09:16:29 +00:00
|
|
|
zzz-mysql.config.php: |
|
|
|
|
|
<?php
|
|
|
|
|
$CONFIG = array(
|
|
|
|
|
'mysql.utf8mb4' => true,
|
|
|
|
|
);
|
2026-03-08 21:33:27 +00:00
|
|
|
phpConfigs:
|
|
|
|
|
zzz-custom.ini: |
|
|
|
|
|
max_execution_time = 300
|
|
|
|
|
max_input_time = 300
|
|
|
|
|
default_socket_timeout = 300
|
2026-03-14 08:20:51 +00:00
|
|
|
opcache.enable_file_override = 1
|
|
|
|
|
apc.shm_size = 128M
|
2026-03-08 21:37:39 +00:00
|
|
|
extraVolumes:
|
|
|
|
|
- name: apache-tuning
|
|
|
|
|
configMap:
|
|
|
|
|
name: nextcloud-apache-tuning
|
2026-04-10 22:23:41 +01:00
|
|
|
- name: db-password-sync
|
|
|
|
|
configMap:
|
|
|
|
|
name: nextcloud-db-password-sync
|
|
|
|
|
defaultMode: 0755
|
2026-03-08 21:37:39 +00:00
|
|
|
extraVolumeMounts:
|
|
|
|
|
- name: apache-tuning
|
|
|
|
|
mountPath: /etc/apache2/mods-available/mpm_prefork.conf
|
|
|
|
|
subPath: mpm_prefork.conf
|
2026-04-10 22:23:41 +01:00
|
|
|
- name: db-password-sync
|
|
|
|
|
mountPath: /docker-entrypoint-hooks.d/before-starting
|
2024-09-28 20:10:44 +00:00
|
|
|
|
2026-03-12 23:27:12 +00:00
|
|
|
internalDatabase:
|
|
|
|
|
enabled: false
|
2026-01-17 20:50:29 +00:00
|
|
|
|
|
|
|
|
externalRedis:
|
2026-04-06 11:57:44 +03:00
|
|
|
enabled: false
|
2026-01-17 20:50:29 +00:00
|
|
|
|
|
|
|
|
externalDatabase:
|
2026-03-12 23:27:12 +00:00
|
|
|
enabled: true
|
2024-09-28 20:10:44 +00:00
|
|
|
type: mysql
|
[ci skip] Infrastructure hardening: security, monitoring, reliability, maintainability
Phase 1 - Critical Security:
- Netbox: move hardcoded DB/superuser passwords to variables
- MeshCentral: disable public registration, add Authentik auth
- Traefik: disable insecure API dashboard (api.insecure=false)
- Traefik: configure forwarded headers with Cloudflare trusted IPs
Phase 2 - Security Hardening:
- Add security headers middleware (HSTS, X-Frame-Options, nosniff, etc.)
- Add Kyverno pod security policies in audit mode (privileged, host
namespaces, SYS_ADMIN, trusted registries)
- Tighten rate limiting (avg=10, burst=50)
- Add Authentik protection to grampsweb
Phase 3 - Monitoring & Alerting:
- Add critical service alerts (PostgreSQL, MySQL, Redis, Headscale,
Authentik, Loki)
- Increase Loki retention from 7 to 30 days (720h)
- Add predictive PV filling alert (predict_linear)
- Re-enable Hackmd and Privatebin down alerts
Phase 4 - Reliability:
- Add resource requests/limits to Redis, DBaaS, Technitium, Headscale,
Vaultwarden, Uptime Kuma
- Increase Alloy DaemonSet memory to 512Mi/1Gi
Phase 6 - Maintainability:
- Extract duplicated tiers locals to terragrunt.hcl generate block
(removed from 67 stacks)
- Replace hardcoded NFS IP 10.0.10.15 with var.nfs_server (114
instances across 63 files)
- Replace hardcoded Redis/PostgreSQL/MySQL/Ollama/mail host references
with variables across ~35 stacks
- Migrate xray raw ingress resources to ingress_factory modules
2026-02-23 22:05:28 +00:00
|
|
|
host: ${mysql_host}
|
2024-09-28 20:10:44 +00:00
|
|
|
user: nextcloud
|
2026-03-12 23:27:12 +00:00
|
|
|
database: nextcloud
|
2026-03-17 07:39:29 +00:00
|
|
|
existingSecret:
|
|
|
|
|
secretName: nextcloud-db-creds
|
2026-03-22 02:50:32 +02:00
|
|
|
usernameKey: db-username
|
2026-03-17 07:39:29 +00:00
|
|
|
passwordKey: DB_PASSWORD
|
2024-09-28 20:10:44 +00:00
|
|
|
|
|
|
|
|
persistence:
|
|
|
|
|
enabled: true
|
feat(storage): migrate all sensitive services to proxmox-lvm-encrypted
Reconcile Terraform with cluster state after manual encrypted PVC migrations
and complete the remaining unfinished migrations. All services storing
sensitive data now use LUKS2-encrypted block storage via the Proxmox CSI
plugin.
## Context
Only Technitium DNS was using encrypted storage in Terraform. Many services
had been manually migrated to encrypted PVCs in the cluster, but Terraform
was never updated — creating dangerous state drift where a `tg apply` could
recreate unencrypted PVCs.
## This change
Phase 0 — Infrastructure:
- Add `proxmox-lvm-encrypted` StorageClass to Helm values (extraParameters)
- Add ExternalSecret for LUKS encryption passphrase to Terraform
- Fix CSI node plugin memory: `node.plugin.resources` (not `node.resources`)
with 1280Mi limit for LUKS2 Argon2id key derivation
Phase 1 — TF state reconciliation (zero downtime):
- Health, Matrix, N8N, Forgejo, Vaultwarden, Mailserver: state rm + import
- Redis, DBAAS MySQL, DBAAS PostgreSQL: Helm/CNPG value updates
Phase 2 — Data migration (encrypted PVCs existed but unused):
- Headscale, Frigate, MeshCentral: rsync + switchover
- Nextcloud (20Gi): rsync + chart_values update
Phase 3 — New encrypted PVCs:
- Roundcube HTML, HackMD, Affine, DBAAS pgadmin: create + rsync + switchover
Phase 4 — Cleanup:
- Deleted 5 orphaned unencrypted PVCs
## Services migrated (18 PVCs across 14 namespaces)
```
vaultwarden → vaultwarden-data-encrypted
dbaas → datadir-mysql-cluster-0, pg-cluster-{1,2}, dbaas-pgadmin-encrypted
mailserver → mailserver-data-encrypted, roundcubemail-{enigma,html}-encrypted
nextcloud → nextcloud-data-encrypted
forgejo → forgejo-data-encrypted
matrix → matrix-data-encrypted
n8n → n8n-data-encrypted
affine → affine-data-encrypted
health → health-uploads-encrypted
hackmd → hackmd-data-encrypted
redis → redis-data-redis-node-{0,1}
headscale → headscale-data-encrypted
frigate → frigate-config-encrypted
meshcentral → meshcentral-{data,files}-encrypted
```
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 20:15:30 +00:00
|
|
|
existingClaim: nextcloud-data-encrypted
|
2024-09-28 20:10:44 +00:00
|
|
|
|
|
|
|
|
accessMode: ReadWriteOnce
|
2026-03-11 23:23:37 +00:00
|
|
|
size: 20Gi
|
2024-09-28 20:10:44 +00:00
|
|
|
|
|
|
|
|
startupProbe:
|
|
|
|
|
enabled: true
|
2026-02-28 23:32:28 +00:00
|
|
|
initialDelaySeconds: 30
|
2024-09-28 20:10:44 +00:00
|
|
|
periodSeconds: 10
|
2026-03-12 10:01:20 +00:00
|
|
|
timeoutSeconds: 30
|
2026-02-28 23:32:28 +00:00
|
|
|
failureThreshold: 60
|
2024-09-28 20:10:44 +00:00
|
|
|
successThreshold: 1
|
2024-12-30 14:01:38 +00:00
|
|
|
|
2026-03-08 21:33:27 +00:00
|
|
|
livenessProbe:
|
|
|
|
|
enabled: true
|
2026-03-12 10:01:20 +00:00
|
|
|
initialDelaySeconds: 30
|
|
|
|
|
periodSeconds: 60
|
|
|
|
|
timeoutSeconds: 30
|
|
|
|
|
failureThreshold: 10
|
2026-03-08 21:33:27 +00:00
|
|
|
successThreshold: 1
|
|
|
|
|
|
|
|
|
|
readinessProbe:
|
|
|
|
|
enabled: true
|
2026-03-12 10:01:20 +00:00
|
|
|
initialDelaySeconds: 30
|
|
|
|
|
periodSeconds: 60
|
|
|
|
|
timeoutSeconds: 30
|
|
|
|
|
failureThreshold: 5
|
2026-03-08 21:33:27 +00:00
|
|
|
successThreshold: 1
|
|
|
|
|
|
2024-12-30 14:01:38 +00:00
|
|
|
podAnnotations:
|
|
|
|
|
diun.enable: "true"
|
|
|
|
|
diun.include_tags: "^[0-9]+(?:.[0-9]+)?(?:.[0-9]+)?.*"
|
2026-04-15 06:41:56 +00:00
|
|
|
dependency.kyverno.io/wait-for: "mysql.dbaas:3306,redis-master.redis:6379"
|
2026-04-10 22:23:41 +01:00
|
|
|
secret.reloader.stakater.com/reload: "nextcloud-db-creds"
|
2025-08-17 19:27:34 +00:00
|
|
|
|
|
|
|
|
collabora:
|
2026-02-15 17:20:47 +00:00
|
|
|
enabled: false # Using onlyoffice instead
|
2025-08-17 19:27:34 +00:00
|
|
|
|
2026-02-28 16:26:19 +00:00
|
|
|
resources:
|
|
|
|
|
limits:
|
2026-03-14 21:46:49 +00:00
|
|
|
memory: 8Gi
|
2026-02-28 16:26:19 +00:00
|
|
|
requests:
|
2026-03-13 19:16:06 +00:00
|
|
|
cpu: 50m
|
|
|
|
|
memory: 256Mi
|
2026-02-28 16:26:19 +00:00
|
|
|
|
2025-08-17 19:27:34 +00:00
|
|
|
cronjob:
|
|
|
|
|
enabled: true
|
2026-03-14 21:46:49 +00:00
|
|
|
resources:
|
|
|
|
|
limits:
|
|
|
|
|
memory: 384Mi
|
|
|
|
|
requests:
|
|
|
|
|
cpu: 25m
|
|
|
|
|
memory: 384Mi
|