Commit graph

104 commits

Author SHA1 Message Date
Viktor Barzin
f538115c43 [dbaas] Migrate MySQL from InnoDB Cluster to standalone StatefulSet
## Context
Disk write analysis showed MySQL InnoDB Cluster writing ~95 GB/day for only
~35 MB of actual data due to Group Replication overhead (binlog, relay log,
GR apply log). The operator enforces GR even with serverInstances=1.

Bitnami Helm charts were deprecated by Broadcom in Aug 2025 — no free
container images available. Using official mysql:8.4 image instead.

## This change:
- Replace helm_release.mysql_cluster service selector with raw
  kubernetes_stateful_set_v1 using official mysql:8.4 image
- ConfigMap mysql-standalone-cnf: skip-log-bin, innodb_flush_log_at_trx_commit=2,
  innodb_doublewrite=ON (re-enabled for standalone safety)
- Service selector switched to standalone pod labels
- Technitium: disable SQLite query logging (18 GB/day write amplification),
  keep PostgreSQL-only logging (90-day retention)
- Grafana datasource and dashboards migrated from MySQL to PostgreSQL
- Dashboard SQL queries fixed for PG integer division (::float cast)
- Updated CLAUDE.md service-specific notes

## What is NOT in this change:
- InnoDB Cluster + operator removal (Phase 4, 7+ days from now)
- Stale Vault role cleanup (Phase 4)
- Old PVC deletion (Phase 4)

Expected write reduction: ~113 GB/day (MySQL 95 + Technitium 18)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 19:01:06 +00:00
Viktor Barzin
b1d152be1f [infra] Auto-create Cloudflare DNS records from ingress_factory
## Context

Deploying new services required manually adding hostnames to
cloudflare_proxied_names/cloudflare_non_proxied_names in config.tfvars —
a separate file from the service stack. This was frequently forgotten,
leaving services unreachable externally.

## This change:

- Add `dns_type` parameter to `ingress_factory` and `reverse_proxy/factory`
  modules. Setting `dns_type = "proxied"` or `"non-proxied"` auto-creates
  the Cloudflare DNS record (CNAME to tunnel or A/AAAA to public IP).
- Simplify cloudflared tunnel from 100 per-hostname rules to wildcard
  `*.viktorbarzin.me → Traefik`. Traefik still handles host-based routing.
- Add global Cloudflare provider via terragrunt.hcl (separate
  cloudflare_provider.tf with Vault-sourced API key).
- Migrate 118 hostnames from centralized config.tfvars to per-service
  dns_type. 17 hostnames remain centrally managed (Helm ingresses,
  special cases).
- Update docs, AGENTS.md, CLAUDE.md, dns.md runbook.

```
BEFORE                          AFTER
config.tfvars (manual list)     stacks/<svc>/main.tf
        |                         module "ingress" {
        v                           dns_type = "proxied"
stacks/cloudflared/               }
  for_each = list                     |
  cloudflare_record               auto-creates
  tunnel per-hostname             cloudflare_record + annotation
```

## What is NOT in this change:

- Uptime Kuma monitor migration (still reads from config.tfvars)
- 17 remaining centrally-managed hostnames (Helm, special cases)
- Removal of allow_overwrite (keep until migration confirmed stable)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 13:45:04 +00:00
Viktor Barzin
dcc96f465e docs(storage): add encrypted LVM documentation
Update storage docs to reflect the 2026-04-15 migration of all sensitive
services to proxmox-lvm-encrypted. Add encrypted PVC template, LUKS2 flow
documentation, updated architecture diagram, and storage class decision
rules.

Files updated:
- .claude/CLAUDE.md: storage decision table, encrypted PVC template
- docs/architecture/storage.md: encrypted flow, components, diagram, Vault paths
- AGENTS.md: storage section with encrypted SC as default for sensitive data

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 21:00:37 +00:00
Viktor Barzin
d31bbc9a18 docs: update monitoring and backup docs for external monitors and per-db backups
- CLAUDE.md: document external monitoring (ExternalAccessDivergence alert,
  external-monitor-sync CronJob) and per-database backup/restore paths
- backup-dr.md: add per-db backup CronJobs to inventory table and daily
  timeline, update restore runbook references
- monitoring.md: add External Monitor Sync component and external monitoring
  architecture section

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 06:37:07 +00:00
Viktor Barzin
4498f61402 fix(post-mortem): add /etc/exports to git, NFS health check in daily-backup, document CSI requirements [PM-2026-04-14]
- scripts/pve-nfs-exports: git-managed copy of PVE host /etc/exports with
  detailed comments explaining fsid=0 danger and NFSv3 disable rationale.
  Deploy: scp scripts/pve-nfs-exports root@192.168.1.127:/etc/exports && ssh root@192.168.1.127 exportfs -ra
- scripts/daily-backup.sh: add check_nfs_exports() that runs before backup starts.
  Detects: missing /etc/exports, dangerous fsid=0 on /srv/nfs, nfs-server not running,
  no active exports. Warns but doesn't abort (block-storage PVC backups can still run).
- .claude/CLAUDE.md: document NFS CSI mount option requirements — nfsvers=4 mandatory,
  fsid=0 forbidden, /etc/exports is git-managed, critical services must use proxmox-lvm-encrypted.

Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>
2026-04-14 18:08:24 +00:00
Viktor Barzin
1ef40daeec docs: update for MySQL 3→1, CrowdSec/Technitium PG migration, PG tuning, NFS async, node OS tuning [ci skip] 2026-04-13 23:05:46 +01:00
Viktor Barzin
82f674a0b4 rename weekly-backup → daily-backup across scripts, timers, services, and docs [ci skip]
Reflects the schedule change from weekly to daily. All references updated:
- scripts/weekly-backup.{sh,timer,service} → daily-backup.*
- Pushgateway job name: weekly-backup → daily-backup
- Prometheus metric names: weekly_backup_* → daily_backup_*
- All docs, runbooks, AGENTS.md, CLAUDE.md, proxmox-inventory
- offsite-sync dependency: After=daily-backup.service

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 18:37:04 +00:00
Viktor Barzin
b45cee5c4a docs: update backup architecture for inotify change tracking + consolidated Synology layout [ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 18:16:36 +00:00
Viktor Barzin
1c300a14cf mailserver: overhaul inbound delivery, monitoring, CrowdSec, and migrate to Brevo relay
Inbound:
- Direct MX to mail.viktorbarzin.me (ForwardEmail relay attempted and abandoned)
- Dedicated MetalLB IP 10.0.20.202 with ETP: Local for CrowdSec real-IP detection
- Removed Cloudflare Email Routing (can't store-and-forward)
- Fixed dual SPF violation, hardened to -all
- Added MTA-STS, TLSRPT, imported Rspamd DKIM into Terraform
- Removed dead BIND zones from config.tfvars (199 lines)

Outbound:
- Migrated from Mailgun (100/day) to Brevo (300/day free)
- Added Brevo DKIM CNAMEs and verification TXT

Monitoring:
- Probe frequency: 30m → 20m, alert thresholds adjusted to 60m
- Enabled Dovecot exporter scraping (port 9166)
- Added external SMTP monitor on public IP

Documentation:
- New docs/architecture/mailserver.md with full architecture
- New docs/architecture/mailserver-visual.html visualization
- Updated monitoring.md, CLAUDE.md, historical plan docs
2026-04-12 22:24:38 +01:00
Viktor Barzin
6ba4878f3a docs: update storage architecture for NFS migration to Proxmox host [ci skip] 2026-04-11 17:00:10 +01:00
Viktor Barzin
eec6af6aef docs: add IPAM/DDNS architecture diagram and update docs
- networking.md: Add mermaid diagram showing full device discovery pipeline
  (Kea DHCP → DDNS → Technitium, pfSense import → phpIPAM → DNS sync)
- networking.md: Add data flow table, DHCP coverage table
- networking.md: Update pfSense (3 subnets + 42 reservations), phpIPAM
  (passive import replaces fping), Technitium (192.168.1.2 in ACL)
- CLAUDE.md: Update phpIPAM and networking descriptions

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 20:42:10 +00:00
Viktor Barzin
8cd8743140 docs: add phpIPAM, Kea DDNS, and DNS sync documentation
- networking.md: Add phpIPAM IPAM section, Kea DDNS config, reverse DNS zones,
  Technitium dynamic update policy
- CLAUDE.md: Add phpipam to DB rotation list, service notes, networking section
- service-catalog.md: Add phpipam, mark netbox as disabled/replaced

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 16:01:32 +00:00
Viktor Barzin
b345b086ef update backup/DR docs and runbooks for 3-2-1 architecture
- Full rewrite of backup-dr.md: 3-2-1 strategy with sda backup disk,
  PVC file-level copy from LVM snapshots, pfsense backup, two offsite
  paths. 4 Mermaid diagrams (data flow, timeline, disk layout, restore tree).
- Update storage.md: 65 proxmox-lvm PVCs, sda backup tier
- Update restore-full-cluster.md: add Phase 3.5 for PVC restore from sda
- Update restore-{mysql,postgresql,vault,vaultwarden}.md: add sda fallback paths
- New runbook: restore-pvc-from-backup.md (file-level restore from sda)
- Update CLAUDE.md Storage & Backup section for 3-2-1 architecture
2026-04-06 15:06:01 +03:00
Viktor Barzin
64c378d158 add critical instruction to update docs with every infra change [ci skip] 2026-04-06 13:21:49 +03:00
Viktor Barzin
fc233bd27f docs: comprehensive audit and update of all architecture docs and runbooks [ci skip]
Audited 14 documentation files against live cluster state and Terraform code.

Architecture docs:
- databases.md: MySQL 8.4.4, proxmox-lvm storage (not iSCSI), anti-affinity
  excludes k8s-node1 (GPU), 2Gi/3Gi resources, 7-day rotation (not 24h),
  CNPG 2 instances, PostGIS 16, postgresql.dbaas has endpoints
- overview.md: 1x CPU, ~160GB RAM, all nodes 32GB, proxmox-lvm storage,
  correct Vault paths (secret/ not kv/)
- compute.md: 272GB physical host RAM, ~160GB allocated to VMs
- secrets.md: 7-day rotation, 7 MySQL + 5 PG roles, correct ESO config
- networking.md: MetalLB pool 10.0.20.200-220
- ci-cd.md: 9 GHA projects, travel_blog 5.7GB

Runbooks:
- restore-mysql/postgresql: backup files are .sql.gz (not .sql)
- restore-vault: weekly backup (not daily), auto-unseal sidecar note
- restore-vaultwarden: PVC is proxmox (not iscsi)
- restore-full-cluster: updated node roles, removed trading

Reference docs:
- CLAUDE.md: 7-day rotation, removed trading from PG list
- AGENTS.md: 100+ stacks, proxmox-lvm, platform empty shell
- service-catalog.md: 6 new stacks, 14 stack column updates
2026-04-06 13:21:05 +03:00
Viktor Barzin
9492874c43 fix: restore technitium MySQL query logging with Vault auto-rotation [ci skip]
Query logs stopped syncing on 2026-03-16 due to password mismatch after
MySQL cluster rebuild and Technitium app config reset.

- Add Vault static role mysql-technitium (7-day rotation)
- Add ExternalSecret for technitium-db-creds in technitium namespace
- Add password-sync CronJob (6h) to push rotated password to Technitium API
- Update Grafana datasource to use ESO-managed password
- Remove stale technitium_db_password variable (replaced by ESO)
- Update databases.md and restore-mysql.md runbook
2026-04-06 13:00:49 +03:00
Viktor Barzin
ad7c0d7fc8 docs: add critical "Terraform Only" rule to CLAUDE.md
All infrastructure changes must go through Terraform/Terragrunt.
kubectl is read-only except for temporary migration steps.
If a resource isn't in Terraform, evaluate adding it before
making manual changes.
2026-04-05 19:46:07 +03:00
Viktor Barzin
2d5c55f7b1 docs: add storage class decision rule to CLAUDE.md
Default to proxmox-lvm for all new services. NFS only for RWX,
backup destinations, or shared media libraries. Updated iSCSI
backup section to reflect proxmox-lvm migration.
2026-04-04 16:35:12 +03:00
Viktor Barzin
10f22350c5 exclude frigate, audiblez, ollama, real-estate-crawler from Synology backup [ci skip]
Expanded cloud sync excludes to reduce sync time and Synology disk usage.
All excluded data is either regenerable or low-value.
TrueNAS Task 1 and incremental script already updated live.
2026-03-29 13:44:32 +03:00
Viktor Barzin
78dec8f0ad add e2e email roundtrip monitoring
CronJob (every 30 min) sends test email via Mailgun API to
smoke-test@viktorbarzin.me, verifies IMAP delivery in spam@ catch-all,
deletes test email, pushes metrics to Pushgateway + Uptime Kuma.

Prometheus alerts: EmailRoundtripFailing, EmailRoundtripStale,
EmailRoundtripNeverRun. Uptime Kuma: SMTP/IMAP port checks + E2E push.
2026-03-25 22:50:22 +02:00
Viktor Barzin
1639910043 ingress latency: add histogram buckets, fix restarts, right-size memory
- Traefik: add fine-grained Prometheus histogram buckets (0.01-30s) for meaningful P50/P99
- Calibre: relax liveness probe (timeout 5→10s, threshold 3→6) to stop NFS-caused restarts
- Novelapp: increase memory 128Mi/256Mi → 640Mi/640Mi (confirmed OOMKilled, VPA upper 505Mi)
- Forgejo: increase memory 256Mi → 384Mi (at 80% of limit, VPA upper 311Mi)
- ActualBudget: add explicit resources to prevent silent LimitRange defaults
- Docs: update Nextcloud note from 4Gi → 8Gi limit (Apache spike history)
2026-03-23 10:52:43 +02:00
Viktor Barzin
813f523170 docs: add private registry usage to infra CLAUDE.md [ci skip] 2026-03-23 01:08:57 +02:00
Viktor Barzin
c111799831 remove duplicated agents, update CLAUDE.md references [ci skip]
All agents now live globally in ~/.claude/agents/ (shared via dotfiles).
Deleted 11 duplicates, moved sev-*/deploy-app to global scope.
2026-03-22 23:44:27 +02:00
Viktor Barzin
1c13af142d sync regenerated providers.tf + upstream changes
- Terragrunt-regenerated providers.tf across stacks (vault_root_token
  variable removed from root generate block)
- Upstream monitoring/openclaw/CLAUDE.md changes from rebase
2026-03-22 02:56:04 +02:00
Viktor Barzin
fd130971aa feat(provision): automated user provisioning via Authentik webhook
- Expand CI Vault policy: write secret/data/platform + Transit SOPS keys
- Add Woodpecker provision-user.yml pipeline (manual event, API-triggered)
- Add env vars to webhook-handler deployment for Woodpecker/Authentik integration
- Update add-user skill with automated flow documentation
- Update Woodpecker repo ID list in CLAUDE.md
2026-03-17 23:56:30 +00:00
Viktor Barzin
6239e07dd5 docs: add plotting-book to GHA-migrated list and repo IDs [ci skip] 2026-03-17 23:07:32 +00:00
Viktor Barzin
88abbef7c3 update claude knowledge: GHA builds architecture, postgresql_host fix [ci skip] 2026-03-16 07:10:45 +00:00
Viktor Barzin
b87ba5e778 update claude knowledge: secret/viktor is go-to for all personal secrets [ci skip] 2026-03-15 23:21:52 +00:00
Viktor Barzin
c8069f53c8 update claude knowledge: final ESO migration state [ci skip] 2026-03-15 22:32:46 +00:00
Viktor Barzin
23dfaa1ac8 update claude knowledge: vault-native secrets migration decisions [ci skip] 2026-03-15 21:00:07 +00:00
Viktor Barzin
3aba29e7a3 remove SOPS pipeline, deploy ESO + Vault DB/K8s engines
Vault is now the sole source of truth for secrets. SOPS pipeline
removed entirely — auth via `vault login -method=oidc`.

Part A: SOPS removal
- vault/main.tf: delete 990 lines (93 vars + 43 KV write resources),
  add self-read data source for OIDC creds from secret/vault
- terragrunt.hcl: remove SOPS var loading, vault_root_token, check_secrets hook
- scripts/tg: remove SOPS decryption, keep -auto-approve logic
- .woodpecker/default.yml: replace SOPS with Vault K8s auth via curl
- Delete secrets.sops.json, .sops.yaml

Part B: External Secrets Operator
- New stack stacks/external-secrets/ with Helm chart + 2 ClusterSecretStores
  (vault-kv for KV v2, vault-database for DB engine)

Part C: Database secrets engine (in vault/main.tf)
- MySQL + PostgreSQL connections with static role rotation (24h)
- 6 MySQL roles (speedtest, wrongmove, codimd, nextcloud, shlink, grafana)
- 6 PostgreSQL roles (trading, health, linkwarden, affine, woodpecker, claude_memory)

Part D: Kubernetes secrets engine (in vault/main.tf)
- RBAC for Vault SA to manage K8s tokens
- Roles: dashboard-admin, ci-deployer, openclaw, local-admin
- New scripts/vault-kubeconfig helper for dynamic kubeconfig

K8s auth method with scoped policies for CI, ESO, OpenClaw, Woodpecker sync.
2026-03-15 16:37:38 +00:00
Viktor Barzin
944d6d3b22 update claude knowledge: resource management learnings from right-sizing session [ci skip] 2026-03-15 15:38:37 +00:00
Viktor Barzin
307b7f6819 update claude knowledge: infra operational learnings from commit history [ci skip]
Add resource management patterns, networking resilience, service-specific
notes, monitoring patterns, and NFS storage rules extracted from ~963 commits.
2026-03-15 10:46:45 +00:00
Viktor Barzin
0a69af618d update claude knowledge: vault KV secrets migration [ci skip] 2026-03-15 03:22:07 +00:00
Viktor Barzin
4a27345057 enable memory-core plugin for OpenClaw [ci skip]
- Add memory-core to plugins.allow and plugins.slots.memory
- Add /app/extensions to plugin load paths
- Update CLAUDE.md memory instructions to reference native tools
2026-03-15 03:22:07 +00:00
Viktor Barzin
5f71a53b08 add memory-tool instructions to project CLAUDE.md [ci skip]
OpenClaw agents read the project-level CLAUDE.md from the workspace.
Adding explicit memory-tool CLI instructions here ensures the agent
uses exec to call memory-tool instead of looking for non-existent
MCP tools (memory_store, memory_recall).
2026-03-15 02:16:03 +00:00
Viktor Barzin
456e2777f5 update claude knowledge: LinuxServer.io container optimization learnings [ci skip] 2026-03-15 02:04:04 +00:00
Viktor Barzin
916aa6c6cb update claude knowledge: OpenClaw deployment and tg wrapper learnings [ci skip] 2026-03-14 23:42:17 +00:00
Viktor Barzin
4635d3b826 remember: CrowdSec Helm upgrade timeout [ci skip] 2026-03-14 12:04:07 +00:00
Viktor Barzin
2fa8ba2038 [ci skip] add sealed secrets convention: fileset + kubernetes_manifest pattern
- Document sealed secrets workflow in AGENTS.md and CLAUDE.md
- Add kubernetes_manifest + fileset(sealed-*.yaml) block to plotting-book as reference
- Users: kubeseal encrypt → commit sealed-*.yaml → CI applies via Terraform
- E2E tested: seal/commit/plan/apply/decrypt cycle verified
2026-03-08 20:03:50 +00:00
Viktor Barzin
98f4920af1 [ci skip] remember: update kubelet thresholds when changing node memory 2026-03-08 10:34:17 +00:00
Viktor Barzin
9f2ac0fd1a [ci skip] update AGENTS.md + CLAUDE.md with SOPS workflow, add k8s-portal CI pipeline
AGENTS.md: added SOPS secrets management section, scripts/tg usage,
contributor onboarding steps, pull-through cache bypass notes.

CLAUDE.md: added SOPS workflow note, linux/amd64 build reminder,
versioned tag guidance for pull-through cache.

CI: new .woodpecker/k8s-portal.yml pipeline — auto-builds and deploys
the k8s portal when files under stacks/platform/modules/k8s-portal/files/
change on master push. Uses buildx for linux/amd64.
2026-03-07 15:37:19 +00:00
Viktor Barzin
8d3db35b5e [ci skip] add AGENTS.md for model-agnostic knowledge, slim CLAUDE.md to Claude-specific layer
AGENTS.md (63 lines): shared infra knowledge for any AI tool (Codex, Claude,
Cursor). Covers: critical rules, architecture, storage, tiers, common ops.

CLAUDE.md (23 lines): Claude-specific addons — skills, agents, user preferences.
References AGENTS.md for shared knowledge.

Removed generic agents (devops-engineer, fullstack-developer).
2026-03-06 23:50:26 +00:00
Viktor Barzin
c170351e77 [ci skip] refactor claude files: compact CLAUDE.md, clean memory, remove generic agents
CLAUDE.md: 260→72 lines. Moved detailed patterns (NFS, iSCSI, Kyverno
tables, anti-AI, node rebuild) to .claude/reference/patterns.md.
Kept: critical rules, quick patterns, key commands, tier overview, prefs.

Memory: CLAUDE.md is now single source of truth. Auto-memory reduced to
scratch pad (67→25 lines, 5→1 files). MetaClaw DB cleaned from 40→16
entries (removed all infra-specific duplicates, kept cross-project prefs).

Agents: removed generic devops-engineer (885L) and fullstack-developer
(234L). Kept custom cluster-health-checker (48L).
2026-03-06 23:27:46 +00:00
Viktor Barzin
bcbe8b23b4 [ci skip] archive 28 unused skills, add runbook index to CLAUDE.md, add cluster-health agent
- Move 28 never-invoked troubleshooting runbook skills to .claude/skills/archived/
- Keep 7 active workflow skills: cluster-health, uptime-kuma, pfsense,
  home-assistant, setup-project, extend-vm-storage, k8s-ndots
- Add one-line runbook index to CLAUDE.md for quick reference
- Create cluster-health-checker custom agent (haiku model, read-only + bash)
  for autonomous health checks without consuming main context
2026-03-06 23:17:40 +00:00
Viktor Barzin
e6234d4683 [ci skip] update claude knowledge: iSCSI migration for Redis, Prometheus, Loki 2026-03-06 21:05:21 +00:00
Viktor Barzin
0638e2cc2e [ci skip] iSCSI migration, healthcheck fixes, health probes, etcd backup
- Migrate MySQL/PostgreSQL storage from local-path to iscsi-truenas
- Add democratic-csi iSCSI driver module for TrueNAS
- Add open-iscsi to cloud-init VM template
- Fix Shlink health probe path (/api/v3 -> /rest/v3 for Shlink 5.0)
- Fix etcd backup: use etcd 3.5.21-0 (3.6.x is distroless, no /bin/sh)
- Fix cluster healthcheck CronJob: always exit 0 to prevent circular
  JobFailed alerts (reporting via Slack, not exit codes)
- Fix Uptime Kuma nested list handling in cluster-health.sh
- Add health probes to: audiobookshelf, immich ML, ntfy, headscale,
  uptime-kuma, vaultwarden, rybbit (clickhouse + server + client),
  shlink, shlink-web
- Add iSCSI storage documentation to CLAUDE.md
2026-03-06 19:54:21 +00:00
Viktor Barzin
61a532054e [ci skip] update CLAUDE.md: NFS volume pattern now uses CSI-backed nfs_volume module 2026-03-02 02:04:47 +00:00
Viktor Barzin
de598996f1 [ci skip] remove low-traffic pull-through caches (registry.k8s.io, quay.io, reg.kyverno.io)
Pull-through cache at 10.0.20.10 was serving corrupted/truncated images
for low-traffic registries, breaking VPA certgen (ImagePullBackOff) and
previously causing Kyverno image pull failures.

Kept: docker.io (port 5000) and ghcr.io (port 5010) — high traffic,
Docker Hub rate limits make caching essential.

Removed from cloud-init template and all 5 live nodes:
- registry.k8s.io (port 5030) — 14 system images, very low churn
- quay.io (port 5020) — 11 images
- reg.kyverno.io (port 5040) — 5 images

The registry containers on the 10.0.20.10 VM still run but nodes no
longer route to them. They can be stopped/removed from the VM later.
2026-03-01 21:46:41 +00:00
Viktor Barzin
ccf0b2232f [ci skip] switch VPA to off mode globally, fix Ollama/MySQL resources
- Kyverno policy: VPA mode set to 'off' for all namespaces (was 'initial'
  for non-core). Terraform is now sole authority for container resources.
  Goldilocks provides recommendations only.
- Ollama: add explicit CPU/memory resources (500m/4Gi req, 4/12Gi limit)
  alongside GPU allocation. Fixes OOMKill from VPA scaling down resources.
- MySQL InnoDB Cluster: bump memory limit from 2Gi to 3Gi.
- Remove redundant per-namespace VPA opt-out labels from onlyoffice,
  openclaw, trading-bot (now handled globally by Kyverno policy).
2026-03-01 19:03:49 +00:00