infra

Author	SHA1	Message	Date
OpenClaw	28cc7aea1f	fix(monitoring): Expand Loki PVC from 15GB to 50GB to resolve storage exhaustion ISSUE RESOLVED: - Root cause: Loki's 15GB iSCSI PVC was completely full - Symptom: 'no space left on device' errors during TSDB operations - Impact: Loki service completely down, logging unavailable - Side effects: Contributed to node2 containerd corruption incident SOLUTION APPLIED: - Expanded PVC storage: 15Gi → 50Gi via direct kubectl patch - Triggered pod restart to complete filesystem resize - Verified successful expansion and service recovery CURRENT STATUS: ✅ PVC: 50Gi capacity (iscsi-truenas storage class) ✅ Loki StatefulSet: 1/1 ready ✅ Loki Pod: 2/2 containers running ✅ Service: Successfully processing log streams ✅ No storage errors in recent logs TERRAFORM ALIGNED: - Updated loki.yaml persistence.size to match actual PVC - Infrastructure code now reflects deployed state [ci skip] - Emergency fix applied locally first due to service outage	2026-03-17 16:51:02 +00:00
Viktor Barzin	b813b63c3e	fix OOM kills: tune MySQL memory, reduce Nextcloud workers, increase Uptime Kuma limit MySQL (3 OOM kills): - Cap group_replication_message_cache_size to 128MB (default 1GB caused OOM) - Reduce innodb_log_buffer_size from 64MB to 16MB - Lower max_connections from 151 to 80 (peak usage ~40) - Increase memory limit from 3Gi to 4Gi for headroom Nextcloud (30+ apache2 OOM kills per incident): - Reduce MaxRequestWorkers from 50 to 10 to prevent fork bomb when SQLite locks cause request pileup - Lower StartServers/MinSpare/MaxSpare proportionally Uptime Kuma (Node.js memory leak): - Increase memory limit from 256Mi to 512Mi - Increase CPU limit from 200m to 500m Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-12 07:26:08 +00:00
Viktor Barzin	457d29dd3d	fix nvidia quota: use custom quota (32 CPU) instead of Kyverno-generated (16 CPU) The GPU operator needs ~19 CPU limits across 16 pods (NFD, device plugin, driver, validators, exporters). The Kyverno auto-generated quota of 16 CPU was insufficient, blocking NFD worker and GC pods from scheduling. - Add custom-quota label to nvidia namespace to exempt from Kyverno generation - Add explicit ResourceQuota with limits.cpu=32, limits.memory=48Gi - Fix: nvidia namespace tier label was missing after CI re-apply, causing Kyverno to use fallback LimitRange instead of tier-2-gpu specific one Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-12 07:04:34 +00:00
Viktor Barzin	ccbbd4bc19	fix cluster health: pin actualbudget, spread MySQL, scale grampsweb, fix GPU toleration - Pin actualbudget/actual-server from edge to 26.3.0 (all 3 instances) to prevent recurring migration breakage from rolling nightly builds - Add podAntiAffinity to MySQL InnoDB Cluster to spread replicas across nodes, relieving memory pressure on k8s-node4 - Scale grampsweb to 0 replicas (unused, consuming 1.7Gi memory) - Add GPU toleration Kyverno policy to Terraform using patchesJson6902 instead of patchStrategicMerge to fix toleration array being overwritten (caused caretta DaemonSet pod to be unable to schedule on k8s-master) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-11 11:43:34 +00:00
Viktor Barzin	6b494b70dd	revert MaxRequestWorkers to 50, exclude nextcloud from 5xx alert - MaxRequestWorkers 25→50: too few workers caused ALL workers to block on SQLite locks, making liveness probes fail even faster (131 restarts vs 50 before). 50 is a compromise — enough workers for probes. - Excluded nextcloud from HighServiceErrorRate alert (chronic SQLite issue) - MySQL migration attempted but hit: GR error 3100 (fixed with GIPK), emoji in calendar/filecache (stripped), SQLite corruption (pre-existing from crash-looping). Migration rolled back, Nextcloud restored to SQLite.	2026-03-09 22:05:20 +00:00
Viktor Barzin	eb8657dcd6	exclude nextcloud from HighServiceErrorRate alert Nextcloud has chronic 5xx errors due to SQLite lock contention causing Apache worker exhaustion. Excluding from alert until MySQL migration.	2026-03-09 20:26:30 +00:00
Viktor Barzin	914c7061c0	fix noisy JobFailed and duplicate mail server alerts - JobFailed: only alert on jobs started within the last hour, so stale failed CronJob runs don't keep firing after subsequent runs succeed - Mail server alert: renamed to MailServerDown, now targets the specific mailserver deployment instead of all deployments in the namespace (was falsely triggering on roundcubemail going down) - Updated inhibition rule to use new MailServerDown alert name	2026-03-08 21:22:43 +00:00
Viktor Barzin	e6c0c39ae7	reduce alert noise: add cascade inhibitions, increase for durations, drop Loki alerts - NodeDown now suppresses workload and service alerts (PodCrashLooping, DeploymentReplicasMismatch, StatefulSetReplicasMismatch, etc.) - NFSServerUnresponsive suppresses pod-level alerts - Increased for durations on transient alerts (e.g. 15m→30m for replica mismatches) - NodeDown for: 1m→3m to avoid flapping - Removed all 3 Loki log-based alerts (duplicated Prometheus alerts) - Downgraded HeadscaleDown critical→warning, mail server page→warning	2026-03-08 21:13:16 +00:00
Viktor Barzin	41886dd731	deploy Sealed Secrets controller for encrypted secret management Adds Sealed Secrets (Bitnami) to the platform stack so cluster users can encrypt secrets with a public key and commit SealedSecret YAMLs to git. The in-cluster controller decrypts them into regular K8s Secrets. - New module: sealed-secrets (namespace + Helm chart v2.18.3, cluster tier) - k8s-portal setup script: adds kubeseal CLI install for Linux and Mac	2026-03-08 19:49:48 +00:00
Viktor Barzin	407b33abd6	resource quota review: fix OOM risks, close quota gaps, add HA protections Phase 1 - OOM fixes: - dashy: increase memory limit 512Mi→1Gi (was at 99% utilization) - caretta DaemonSet: set explicit resources 300Mi/512Mi (was at 85-98%) - mysql-operator: add Helm resource values 256Mi/512Mi, create namespace with tier label (was at 92% of LimitRange default) - prowlarr, flaresolverr, annas-archive-stacks: add explicit resources (outgrowing 256Mi LimitRange defaults) - real-estate-crawler celery: add resources 512Mi/3Gi (608Mi actual, no explicit resources) Phase 2 - Close quota gaps: - nvidia, real-estate-crawler, trading-bot: remove custom-quota=true labels so Kyverno generates tier-appropriate quotas - descheduler: add tier=1-cluster label for proper classification Phase 3 - Reduce excessive quotas: - monitoring: limits.memory 240Gi→64Gi, limits.cpu 120→64 - woodpecker: limits.memory 128Gi→32Gi, limits.cpu 64→16 - GPU tier default: limits.memory 96Gi→32Gi, limits.cpu 48→16 Phase 4 - Kubelet protection: - Add cpu: 200m to systemReserved and kubeReserved in kubelet template Phase 5 - HA improvements: - cloudflared: add topology spread (ScheduleAnyway) + PDB (maxUnavailable:1) - grafana: add topology spread + PDB via Helm values - crowdsec LAPI: add topology spread + PDB via Helm values - authentik server: add topology spread via Helm values - authentik worker: add topology spread + PDB via Helm values	2026-03-08 18:17:46 +00:00
Viktor Barzin	8e58c5676d	enable MySQL InnoDB Cluster auto-recovery after crashes Previously manualStartOnBoot=true and exitStateAction=ABORT_SERVER meant any ungraceful shutdown required manual rebootClusterFromCompleteOutage(). New settings: - group_replication_start_on_boot=ON: auto-start GR after crash - autorejoin_tries=2016: retry rejoining for ~28 minutes - exit_state_action=OFFLINE_MODE: stay alive on expulsion (don't abort) - member_expel_timeout=30s: tolerate brief unresponsiveness - unreachable_majority_timeout=60s: leave group cleanly if majority lost	2026-03-08 17:13:03 +00:00
Viktor Barzin	4d3c6fcd79	fix node OOM: reduce memory overcommit ratio and add kubelet eviction thresholds LimitRange defaults had a 4-8x limit/request ratio causing the scheduler to overpack nodes. When pods burst, nodes OOM-thrashed and became unresponsive (k8s-node3 and k8s-node4 both went down today). Changes: - Increase default memory requests across all tiers (ratio now 2x): - core/cluster: 64Mi → 256Mi request (512Mi limit) - gpu: 256Mi → 1Gi request (2Gi limit) - edge/aux/fallback: 64Mi → 128Mi request (256Mi limit) - Add kubelet memory reservation and eviction thresholds: - systemReserved: 512Mi, kubeReserved: 512Mi - evictionHard: 500Mi (was 100Mi), evictionSoft: 1Gi (was unset) - Applied to all nodes and future node template	2026-03-08 10:33:38 +00:00
Viktor Barzin	b58ab599d6	[ci skip] fix Homepage icons for Tandoor, Listenarr, Networking Toolbox, Goldilocks tandoor.png → tandoor-recipes.png (dashboard-icons), podcast.png → mdi-podcast, networking.png → mdi-lan, goldilocks.png → mdi-scale-balance	2026-03-07 21:29:51 +00:00
Viktor Barzin	04bed0188e	[ci skip] fix pfSense widget: wan interface is vtnet0 not vmx0	2026-03-07 20:39:56 +00:00
Viktor Barzin	c09e1a0b65	[ci skip] fix pfSense widget: remove wanStatus (API v2 missing gateway endpoint) Replace wanStatus with temp field. Remove wan interface param since the pfSense REST API v2 package doesn't expose /status/gateway.	2026-03-07 20:39:56 +00:00
Viktor Barzin	8583cd8578	[ci skip] fix widget issues: ports, Immich v2 API, Nextcloud trusted domains - qBittorrent: use service port 80 (not container port 8080) - Immich: add version=2 for new API endpoints (/api/server/*) - Nextcloud: use external URL (internal rejects untrusted Host header) - HA London: remove widget (token expired, needs manual regeneration) - Headscale: remove widget (requires nodeId param, not overview)	2026-03-07 20:39:56 +00:00
Viktor Barzin	98b6839dd6	[ci skip] fix widget URLs: use correct k8s service ports Services expose port 80 via ClusterIP but widgets were using container target ports (5000, 3001, 4533, 3000). Calibre was using external URL through Authentik. All now use correct internal service URLs.	2026-03-07 20:39:56 +00:00
Viktor Barzin	2ebbf364d1	[ci skip] add widgets for qbittorrent, navidrome, nextcloud, freshrss, linkwarden, uptime-kuma Add API credentials to SOPS and wire homepage_credentials through stacks. Re-add Uptime Kuma widget with new "infra" status page slug.	2026-03-07 20:39:55 +00:00
Viktor Barzin	d573e07674	[ci skip] add widgets for audiobookshelf, changedetection, prowlarr, headscale Wire homepage_credentials through servarr parent stack for prowlarr. Fix paperless-ngx widget to use internal service URL.	2026-03-07 20:39:55 +00:00
Viktor Barzin	ce41f6841f	[ci skip] fix broken Homepage widgets + add service API tokens to SOPS - Grafana: fix service URL (grafana not monitoring-grafana) - Uptime Kuma: remove widget (no status page configured) - Speedtest/Frigate/Immich: use internal k8s service URLs (external goes through Authentik forward auth, blocking API calls) - pfSense: clean up annotations - SOPS: add headscale, prowlarr, changedetection, audiobookshelf tokens	2026-03-07 20:39:55 +00:00
Viktor Barzin	c3df3393a7	[ci skip] add Homepage widget credentials for Authentik, Shlink, Home Assistant Wire homepage_credentials tokens through platform stack to enable live widgets for Authentik, Shlink (URL shortener), and Home Assistant London. Update SOPS with new credential entries.	2026-03-07 20:39:54 +00:00
Viktor Barzin	af74aa297d	[ci skip] add Homepage gethomepage.dev annotations to all services Add Kubernetes ingress annotations for Homepage auto-discovery across ~88 services organized into 11 groups. Enable serviceAccount for RBAC, configure group layouts, and add Grafana/Frigate/Speedtest widgets.	2026-03-07 20:39:54 +00:00
Viktor Barzin	ed6a2780fb	[ci skip] fix Svelte 5 table structure (thead/tbody required) + use versioned image tag Fixed architecture and services pages to wrap table rows in <thead>/<tbody> as required by Svelte 5's strict HTML validation. E2E test passed: clean Alpine container → setup script → kubectl installed → CA cert verified against API server → TLS SUCCESS	2026-03-07 15:34:32 +00:00
Viktor Barzin	15f7114c4e	[ci skip] k8s portal: fix setup script + add onboarding hub (5 new pages) Bug fixes: - CA cert now populated in ConfigMap (was empty → TLS failures) - Remove useless heredoc quote escaping in setup script - Fix homepage: VPN callout, correct verification command (get namespaces) - Fix false-positive sensitive=true on ingress_path, tls_secret_name, truenas_host, ollama_host, client_certificate_secret_name New pages (direct Svelte, no mdsvex dependency): - /onboarding: step-by-step guide (VPN, kubectl, git, first PR) - /architecture: cluster topology, storage, networking, tiers - /services: catalog of 70+ services with URLs - /contributing: PR workflow, what you can/can't change, NEVER list - /troubleshooting: common issues and fixes Navigation bar added to layout. All pages use consistent docs styling. Requires Docker image rebuild: cd stacks/platform/modules/k8s-portal/files && docker build -t viktorbarzin/k8s-portal:latest . && docker push	2026-03-07 15:06:26 +00:00
Viktor Barzin	db68067925	[ci skip] phase 5+6: update CI pipelines for SOPS, add sensitive=true to secret vars Phase 5 — CI pipelines: - default.yml: add SOPS decrypt in prepare step, change git add . to specific paths (stacks/ state/ .woodpecker/), cleanup on success+failure - renew-tls.yml: change git add . to git add secrets/ state/ Phase 6 — sensitive=true: - Add sensitive = true to 256 variable declarations across 149 stack files - Prevents secret values from appearing in terraform plan output - Does NOT modify shared modules (ingress_factory, nfs_volume) to avoid breaking module interface contracts Note: CI pipeline SOPS decryption requires sops_age_key Woodpecker secret to be created before the pipeline will work with SOPS. Until then, the old terraform.tfvars path continues to function.	2026-03-07 14:30:36 +00:00
Viktor Barzin	1824d2be67	[ci skip] expand Prometheus PVC to 200Gi, increase retention to 180GB for 1-year history Storage analysis: ~10.5 GB/month ingestion rate, 1 year = ~125 GB + overhead. PVC: 30Gi → 200Gi, retention.size: 45GB → 180GB. Historical TSDB data restored from NFS (39.8 GB total including all blocks).	2026-03-06 23:16:32 +00:00
Viktor Barzin	a7f3d432ee	[ci skip] expand Prometheus iSCSI PVC to 30Gi for historical data restore	2026-03-06 22:51:38 +00:00
Viktor Barzin	63fb6201c8	[ci skip] migrate Redis, Prometheus, Loki storage to iSCSI - Redis: local-path → iscsi-truenas (master + replica persistence) - Prometheus: NFS PV+PVC → dynamic iSCSI PVC (prometheus-data) - Loki: NFS PV → dynamic iSCSI via storageClass in Helm values - Deleted 2 orphaned Released iSCSI PVs (31Gi freed)	2026-03-06 20:50:55 +00:00
Viktor Barzin	94dcf22db4	[ci skip] exclude linkwarden from HighService4xxRate alert	2026-03-06 20:15:58 +00:00
Viktor Barzin	1d80c49201	[ci skip] iSCSI migration, healthcheck fixes, health probes, etcd backup - Migrate MySQL/PostgreSQL storage from local-path to iscsi-truenas - Add democratic-csi iSCSI driver module for TrueNAS - Add open-iscsi to cloud-init VM template - Fix Shlink health probe path (/api/v3 -> /rest/v3 for Shlink 5.0) - Fix etcd backup: use etcd 3.5.21-0 (3.6.x is distroless, no /bin/sh) - Fix cluster healthcheck CronJob: always exit 0 to prevent circular JobFailed alerts (reporting via Slack, not exit codes) - Fix Uptime Kuma nested list handling in cluster-health.sh - Add health probes to: audiobookshelf, immich ML, ntfy, headscale, uptime-kuma, vaultwarden, rybbit (clickhouse + server + client), shlink, shlink-web - Add iSCSI storage documentation to CLAUDE.md	2026-03-06 19:54:21 +00:00
Viktor Barzin	a8e07ad930	[ci skip] fix post-NFS-migration issues: MySQL GR, Loki, grampsweb, alerts - Loki: reduce memory limit from 6Gi to 4Gi (within LimitRange max) - Grampsweb: increase memory to 2Gi (was OOMKilled at 512Mi) - Fix PostgreSQLDown alert: check pod readiness instead of deployment - Fix MySQLDown alert: check StatefulSet replicas instead of deployment - Fix RedisDown alert: check StatefulSet replicas instead of deployment - Fix NFSServerUnresponsive: aggregate all NFS versions cluster-wide - Fix Uptime Kuma healthcheck: handle nested list heartbeat format - Update etcd backup image to registry.k8s.io/etcd:3.6.5-0	2026-03-03 21:10:26 +00:00
Viktor Barzin	31f3fc0773	[ci skip] fix OOMKill: prometheus (4Gi), kyverno-reports (512Mi), grampsweb (512Mi) - Prometheus server: explicit 1Gi req / 4Gi limit (was inheriting 512Mi LimitRange default) - Kyverno reports controller: 128Mi req / 512Mi limit (was 128Mi Helm default) - Grampsweb: 256Mi req / 512Mi limit for both containers (was 256Mi LimitRange default)	2026-03-02 21:39:14 +00:00
Viktor Barzin	51d77369de	[ci skip] fix: add mount_options to all NFS PVs (soft,timeo=30,retrans=3) Critical fix: StorageClass mountOptions only apply during dynamic provisioning. Our static PVs (created by Terraform) were missing mount_options, so all NFS mounts defaulted to hard,timeo=600 — the exact stale mount behavior we were trying to eliminate. Adds mount_options directly to the nfs_volume module PV spec and to the monitoring PVs (prometheus, loki, alertmanager). Requires re-applying all stacks to propagate to existing PVs.	2026-03-02 20:23:36 +00:00
Viktor Barzin	0e324df545	[ci skip] complete NFS CSI migration: complex stacks + platform modules Migrate remaining multi-volume stacks and all platform modules from inline NFS volumes to CSI-backed PV/PVC with nfs-truenas StorageClass (soft,timeo=30,retrans=3 mount options). Complex stacks: openclaw (4 vols), immich (8 vols), frigate (2 vols), nextcloud (2 vols + old PV replaced), rybbit (1 vol) Remaining stacks: affine, ebook2audiobook, f1-stream, osm_routing, real-estate-crawler Platform modules: monitoring (prometheus, loki, alertmanager PVs converted from native NFS to CSI), redis, dbaas, technitium, headscale, vaultwarden, uptime-kuma, mailserver, infra-maintenance	2026-03-02 01:24:07 +00:00
Viktor Barzin	481e4fa46e	[ci skip] add NFS CSI driver + nfs_volume shared module - Deploy csi-driver-nfs Helm chart as platform module (nfs-csi) - Create nfs-truenas StorageClass with soft,timeo=30,retrans=3 mount options - Add shared nfs_volume module for PV/PVC boilerplate (modules/kubernetes/nfs_volume/)	2026-03-01 23:38:58 +00:00
Viktor Barzin	ca648ff9bb	[ci skip] right-size all pod resources based on VPA + live metrics audit Full cluster resource audit: cross-referenced Goldilocks VPA recommendations, live kubectl top metrics, and Terraform definitions for 100+ containers. Critical fixes: - dashy: CPU throttled at 98% (490m/500m) → 2 CPU limit - stirling-pdf: CPU throttled at 99.7% (299m/300m) → 2 CPU limit - traefik auth-proxy/bot-block-proxy: mem limit 32Mi → 128Mi Added explicit resources to ~40 containers that had none: - audiobookshelf, changedetection, cyberchef, dawarich, diun, echo, excalidraw, freshrss, hackmd, isponsorblocktv, linkwarden, n8n, navidrome, ntfy, owntracks, privatebin, send, shadowsocks, tandoor, tor-proxy, wealthfolio, networking-toolbox, rybbit, mailserver, cloudflared, pgadmin, phpmyadmin, crowdsec-web, xray, wireguard, k8s-portal, tuya-bridge, ollama-ui, whisper, piper, immich-server, immich-postgresql, osrm-foot GPU containers: added CPU/mem alongside GPU limits: - ollama: removed CPU/mem limits (models vary in size), keep GPU only - frigate: req 500m/2Gi, lim 4/8Gi + GPU - immich-ml: req 100m/1Gi, lim 2/4Gi + GPU Right-sized ~25 over-provisioned containers: - kms-web-page: 500m/512Mi → 50m/64Mi (was using 0m/10Mi) - onlyoffice: CPU 8 → 2 (VPA upper 45m) - realestate-crawler-api: CPU 2000m → 250m - blog/travel-blog/webhook-handler: 500m → 100m - coturn/health/plotting-book: reduced to match actual usage Conservative methodology: limits = max(VPA upper * 2, live usage * 2)	2026-03-01 19:18:50 +00:00
Viktor Barzin	32762a0916	[ci skip] switch VPA to off mode globally, fix Ollama/MySQL resources - Kyverno policy: VPA mode set to 'off' for all namespaces (was 'initial' for non-core). Terraform is now sole authority for container resources. Goldilocks provides recommendations only. - Ollama: add explicit CPU/memory resources (500m/4Gi req, 4/12Gi limit) alongside GPU allocation. Fixes OOMKill from VPA scaling down resources. - MySQL InnoDB Cluster: bump memory limit from 2Gi to 3Gi. - Remove redundant per-namespace VPA opt-out labels from onlyoffice, openclaw, trading-bot (now handled globally by Kyverno policy).	2026-03-01 19:03:49 +00:00
Viktor Barzin	79af6fff47	[ci skip] fix MySQL cluster RBAC, Kyverno policy bugs, Nextcloud memory - dbaas: add mysql-sidecar-extra ClusterRole for namespaces/CRD list/watch needed by kopf framework in sidecar containers - kyverno: restrict inject-priority-class-from-tier to CREATE operations only (was blocking pod patches with immutable spec error) - kyverno: add resource-governance/custom-limitrange label opt-out to LimitRange generation policy (mirrors existing custom-quota) - nextcloud: bump memory limit 4Gi -> 6Gi, add custom LimitRange with 8Gi max, opt out of Kyverno-managed LimitRange	2026-03-01 17:16:03 +00:00
Viktor Barzin	a8da2e3790	[ci skip] redis: pin service to master pod to fix read-only errors The Bitnami Redis Sentinel chart's service selects all nodes (master + replicas). Clients using plain redis:// URLs (paperless-ngx, etc.) randomly hit read-only replicas, causing write failures. Pin the service to redis-node-0 (master).	2026-03-01 17:13:25 +00:00
Viktor Barzin	d02d6ad356	[ci skip] dbaas: add custom resource quota (12Gi req mem) to support 3-node MySQL cluster	2026-03-01 15:47:11 +00:00
Viktor Barzin	82d63a10ef	[ci skip] add PoisonFountainDown and ForwardAuthFallbackActive alerts with inhibition	2026-03-01 15:05:57 +00:00
Viktor Barzin	00d3bb2fd1	[ci skip] add retry middleware (2 attempts, 100ms) to default ingress chain	2026-03-01 14:35:53 +00:00
Viktor Barzin	e36e192dd0	[ci skip] add Authentik PDB (minAvailable=2)	2026-03-01 14:24:47 +00:00
Viktor Barzin	a1ba1e0d37	[ci skip] add Traefik topology spread, PDB (minAvailable=2), and 30s response timeout	2026-03-01 14:18:54 +00:00
Viktor Barzin	ecbee75275	[ci skip] add auth resilience proxy: basicAuth fallback when Authentik is down	2026-03-01 14:13:05 +00:00
Viktor Barzin	6ffc714683	[ci skip] add bot-block resilience proxy: fail-open when Poison Fountain is down	2026-03-01 14:05:41 +00:00
Viktor Barzin	51fb96e906	[ci skip] fix MySQL service: point at mysqld pods, pin to healthy primary The InnoDB Cluster Router (mysqlrouter) doesn't deploy when the cluster lacks quorum. Changed service selector from mysqlrouter to mysqld with publishNotReadyAddresses=true to bypass the operator's readiness gate. Pinned to mysql-cluster-1 (healthy primary) until full cluster recovers.	2026-03-01 12:16:28 +00:00
Viktor Barzin	a071f08dc8	[ci skip] MySQL: deploy InnoDB Cluster via Oracle MySQL Operator - MySQL Operator v2.2.7 in mysql-operator namespace (on control-plane) - InnoDB Cluster: 3 MySQL 9.2.0 servers + 1 Router, local-path storage - Group Replication with automatic failover via MySQL Router - Compatibility service: mysql.dbaas:3306 → Router port 6446 - Images from container-registry.oracle.com (not Docker Hub) - Init containers are slow (~20 min) due to mysqlsh plugin loading - Data restore from mysqldump pending after cluster is ONLINE	2026-03-01 03:00:21 +00:00
Viktor Barzin	a76c72042e	[ci skip] add graceful degradation to CrowdSec bouncer middleware P0: Set updateMaxFailure=-1 (fail-open) Previously defaulted to 0 which blocked ALL traffic on first LAPI failure. Now serves from cached decisions when LAPI is unreachable. P1: Enable Redis cache for CrowdSec decisions Decisions are now shared across all 3 Traefik replicas and survive pod restarts. redisCacheUnreachableBlock=false prevents Redis from becoming another SPOF. P1: Add clientTrustedIPs for internal cluster traffic Node CIDR (10.0.20.0/24) and pod CIDR (10.10.0.0/16) bypass CrowdSec entirely, preventing internal cascade failures.	2026-03-01 02:36:53 +00:00
Viktor Barzin	d6731d857d	[ci skip] revert MySQL to NFS — Bitnami images unavailable on Docker Hub Bitnami MySQL images can't be pulled (not found on Docker Hub, likely moved to a different registry). Reverted MySQL to single instance on NFS as the known-working state. MySQL replication to be revisited once image availability is resolved. PostgreSQL and Redis remain on local disk with replication.	2026-02-28 22:53:33 +00:00

1 2

97 commits