Commit graph

9 commits

Author SHA1 Message Date
Viktor Barzin
ff83ec3325 add infrastructure agent team: 8 specialized agents + 14 diagnostic scripts
Agents: devops-engineer, dba, security-engineer, sre, network-engineer,
platform-engineer, observability-engineer, home-automation-engineer.
Scripts: deploy-status, db-health, backup-verify, tls-check, crowdsec-status,
authentik-audit, oom-investigator, resource-report, dns-check, network-health,
nfs-health, truenas-status, platform-status, monitoring-health.
Also: known-issues.md suppression list, cluster-health-checker port-forward fix.
2026-03-15 02:01:07 +00:00
Viktor Barzin
d05ff57b11 authentik: auto-assign invitation group via expression policy [ci skip]
Added invitation-group-assignment expression policy bound to the
enrollment-login stage. Reads group name from invitation fixed_data
and auto-adds the user to the target group on enrollment.
No more manual assign step needed after signup.
2026-03-13 22:21:10 +00:00
Viktor Barzin
160fda882f authentik: cleanup unused resources + add invitation enrollment flow [ci skip]
Cleanup:
- Deleted 5 unused flows (enrollment-inviation, headscale-auth/authz, default-enrollment, oauth-enrollment)
- Deleted 8 orphaned stages bound only to deleted flows
- Deleted authentik Read-only group and role (0 users)
- Deleted 2 unbound policies (map github username, Map Google Attributes)

Invitation enrollment:
- Created invitation-enrollment flow with 5 stages (invitation validation,
  identification with social login, prompt, user write, auto-login)
- Set all OAuth sources (Google/GitHub/Facebook) enrollment_flow to invitation-enrollment
- New users can only sign up via single-use invitation links
- Added authentik-invite.sh script for invitation management
- Updated reference docs and authentik skill
2026-03-13 22:21:10 +00:00
Viktor Barzin
d352d6e7f8 resource quota review: fix OOM risks, close quota gaps, add HA protections
Phase 1 - OOM fixes:
- dashy: increase memory limit 512Mi→1Gi (was at 99% utilization)
- caretta DaemonSet: set explicit resources 300Mi/512Mi (was at 85-98%)
- mysql-operator: add Helm resource values 256Mi/512Mi, create namespace
  with tier label (was at 92% of LimitRange default)
- prowlarr, flaresolverr, annas-archive-stacks: add explicit resources
  (outgrowing 256Mi LimitRange defaults)
- real-estate-crawler celery: add resources 512Mi/3Gi (608Mi actual, no
  explicit resources)

Phase 2 - Close quota gaps:
- nvidia, real-estate-crawler, trading-bot: remove custom-quota=true
  labels so Kyverno generates tier-appropriate quotas
- descheduler: add tier=1-cluster label for proper classification

Phase 3 - Reduce excessive quotas:
- monitoring: limits.memory 240Gi→64Gi, limits.cpu 120→64
- woodpecker: limits.memory 128Gi→32Gi, limits.cpu 64→16
- GPU tier default: limits.memory 96Gi→32Gi, limits.cpu 48→16

Phase 4 - Kubelet protection:
- Add cpu: 200m to systemReserved and kubeReserved in kubelet template

Phase 5 - HA improvements:
- cloudflared: add topology spread (ScheduleAnyway) + PDB (maxUnavailable:1)
- grafana: add topology spread + PDB via Helm values
- crowdsec LAPI: add topology spread + PDB via Helm values
- authentik server: add topology spread via Helm values
- authentik worker: add topology spread + PDB via Helm values
2026-03-08 18:17:46 +00:00
Viktor Barzin
7027c49fef [ci skip] update ha-sofia VM: VMID 103, disk 64G, SSH access info 2026-03-07 20:39:55 +00:00
Viktor Barzin
c170351e77 [ci skip] refactor claude files: compact CLAUDE.md, clean memory, remove generic agents
CLAUDE.md: 260→72 lines. Moved detailed patterns (NFS, iSCSI, Kyverno
tables, anti-AI, node rebuild) to .claude/reference/patterns.md.
Kept: critical rules, quick patterns, key commands, tier overview, prefs.

Memory: CLAUDE.md is now single source of truth. Auto-memory reduced to
scratch pad (67→25 lines, 5→1 files). MetaClaw DB cleaned from 40→16
entries (removed all infra-specific duplicates, kept cross-project prefs).

Agents: removed generic devops-engineer (885L) and fullstack-developer
(234L). Kept custom cluster-health-checker (48L).
2026-03-06 23:27:46 +00:00
Viktor Barzin
14b1c43713 [ci skip] expand k8s worker nodes to 256G, update inventory and extend script
- k8s-node2: 128G → 256G (160GB free)
- k8s-node3: 128G → 256G (135GB free)
- k8s-node4: 128G → 256G (127GB free)
- k8s-node1: already 256G (51GB free)
- extend_vm_storage.sh: increase drain timeout to 300s, add --force flag
- Remove Vaultwarden from SQLite migration plan (too risky)
2026-02-28 16:00:16 +00:00
Viktor Barzin
c8de2c4803 [ci skip] Sunset Drone CI: remove all artifacts, DNS, configs, and references
Drone CI has been fully replaced by Woodpecker CI at ci.viktorbarzin.me.
Destroys K8s resources (12), removes DNS records, NFS exports, Uptime Kuma
monitor, dashboard entry, and all code/doc references across 18 files.
2026-02-23 19:38:55 +00:00
Viktor Barzin
abe89c926e [ci skip] Refactor knowledge: CLAUDE.md 881→190 lines, extract reference data
CLAUDE.md changes:
- Extract service catalog + Cloudflare domains → .claude/reference/service-catalog.md
- Extract Proxmox VMs, hardware, network → .claude/reference/proxmox-inventory.md
- Extract GitHub/Drone API patterns → .claude/reference/github-drone-api.md
- Extract Authentik state snapshot → .claude/reference/authentik-state.md
- Remove Init Container pattern (duplicates setup-project skill)
- Remove Poison Fountain service notes (duplicates Anti-AI section)
- Consolidate Authentik section (link to skills + reference)
- Remove resource limit tables (kept tier definitions inline)

Skill merges (37→32):
- helm-release-force-rerender + helm-stuck-release-recovery → helm-release-troubleshooting
- containerd-multi-registry-pull-through-cache + k8s-docker-registry-cache-bypass → k8s-container-image-caching
- (traefik merges in previous commits)
2026-02-22 22:11:31 +00:00