Viktor Barzin
407b33abd6
resource quota review: fix OOM risks, close quota gaps, add HA protections
...
Phase 1 - OOM fixes:
- dashy: increase memory limit 512Mi→1Gi (was at 99% utilization)
- caretta DaemonSet: set explicit resources 300Mi/512Mi (was at 85-98%)
- mysql-operator: add Helm resource values 256Mi/512Mi, create namespace
with tier label (was at 92% of LimitRange default)
- prowlarr, flaresolverr, annas-archive-stacks: add explicit resources
(outgrowing 256Mi LimitRange defaults)
- real-estate-crawler celery: add resources 512Mi/3Gi (608Mi actual, no
explicit resources)
Phase 2 - Close quota gaps:
- nvidia, real-estate-crawler, trading-bot: remove custom-quota=true
labels so Kyverno generates tier-appropriate quotas
- descheduler: add tier=1-cluster label for proper classification
Phase 3 - Reduce excessive quotas:
- monitoring: limits.memory 240Gi→64Gi, limits.cpu 120→64
- woodpecker: limits.memory 128Gi→32Gi, limits.cpu 64→16
- GPU tier default: limits.memory 96Gi→32Gi, limits.cpu 48→16
Phase 4 - Kubelet protection:
- Add cpu: 200m to systemReserved and kubeReserved in kubelet template
Phase 5 - HA improvements:
- cloudflared: add topology spread (ScheduleAnyway) + PDB (maxUnavailable:1)
- grafana: add topology spread + PDB via Helm values
- crowdsec LAPI: add topology spread + PDB via Helm values
- authentik server: add topology spread via Helm values
- authentik worker: add topology spread + PDB via Helm values
2026-03-08 18:17:46 +00:00
Viktor Barzin
f9efa902ef
[ci skip] update ha-sofia VM: VMID 103, disk 64G, SSH access info
2026-03-07 20:39:55 +00:00
Viktor Barzin
f02f2a5a4d
[ci skip] refactor claude files: compact CLAUDE.md, clean memory, remove generic agents
...
CLAUDE.md: 260→72 lines. Moved detailed patterns (NFS, iSCSI, Kyverno
tables, anti-AI, node rebuild) to .claude/reference/patterns.md.
Kept: critical rules, quick patterns, key commands, tier overview, prefs.
Memory: CLAUDE.md is now single source of truth. Auto-memory reduced to
scratch pad (67→25 lines, 5→1 files). MetaClaw DB cleaned from 40→16
entries (removed all infra-specific duplicates, kept cross-project prefs).
Agents: removed generic devops-engineer (885L) and fullstack-developer
(234L). Kept custom cluster-health-checker (48L).
2026-03-06 23:27:46 +00:00
Viktor Barzin
6e1586342c
[ci skip] expand k8s worker nodes to 256G, update inventory and extend script
...
- k8s-node2: 128G → 256G (160GB free)
- k8s-node3: 128G → 256G (135GB free)
- k8s-node4: 128G → 256G (127GB free)
- k8s-node1: already 256G (51GB free)
- extend_vm_storage.sh: increase drain timeout to 300s, add --force flag
- Remove Vaultwarden from SQLite migration plan (too risky)
2026-02-28 16:00:16 +00:00
Viktor Barzin
0eababf212
[ci skip] Sunset Drone CI: remove all artifacts, DNS, configs, and references
...
Drone CI has been fully replaced by Woodpecker CI at ci.viktorbarzin.me.
Destroys K8s resources (12), removes DNS records, NFS exports, Uptime Kuma
monitor, dashboard entry, and all code/doc references across 18 files.
2026-02-23 19:38:55 +00:00
Viktor Barzin
ff66adbe9e
[ci skip] Refactor knowledge: CLAUDE.md 881→190 lines, extract reference data
...
CLAUDE.md changes:
- Extract service catalog + Cloudflare domains → .claude/reference/service-catalog.md
- Extract Proxmox VMs, hardware, network → .claude/reference/proxmox-inventory.md
- Extract GitHub/Drone API patterns → .claude/reference/github-drone-api.md
- Extract Authentik state snapshot → .claude/reference/authentik-state.md
- Remove Init Container pattern (duplicates setup-project skill)
- Remove Poison Fountain service notes (duplicates Anti-AI section)
- Consolidate Authentik section (link to skills + reference)
- Remove resource limit tables (kept tier definitions inline)
Skill merges (37→32):
- helm-release-force-rerender + helm-stuck-release-recovery → helm-release-troubleshooting
- containerd-multi-registry-pull-through-cache + k8s-docker-registry-cache-bypass → k8s-container-image-caching
- (traefik merges in previous commits)
2026-02-22 22:11:31 +00:00