Commit graph

5 commits

Author SHA1 Message Date
Viktor Barzin
28ac1382d1 Remove all CPU limits cluster-wide to eliminate CFS throttling
CPU limits cause CFS throttling even when nodes have idle capacity.
Move to a request-only CPU model: keep CPU requests for scheduling
fairness but remove all CPU limits. Memory limits stay (incompressible).

Changes across 108 files:
- Kyverno LimitRange policy: remove cpu from default/max in all 6 tiers
- Kyverno ResourceQuota policy: remove limits.cpu from all 5 tiers
- Custom ResourceQuotas: remove limits.cpu from 8 namespace quotas
- Custom LimitRanges: remove cpu from default/max (nextcloud, onlyoffice)
- RBAC module: remove cpu_limits variable and quota reference
- Freedify factory: remove cpu_limit variable and limits reference
- 86 deployment files: remove cpu from all limits blocks
- 6 Helm values files: remove cpu under limits sections
2026-03-18 08:03:58 +00:00
Viktor Barzin
b45688646d
Woodpecker CI: use built-in clone, fix CoreDNS DNS resolution [CI SKIP]
- Switch from custom clone override to woodpeckerci/plugin-git built-in clone
  (handles auth automatically via netrc from GitHub OAuth token)
- Add 8.8.8.8 and 1.1.1.1 as CoreDNS upstream resolvers alongside pfSense
  (fixes intermittent DNS timeouts causing clone failures)
- Fix missing comma after heredoc in audit-policy.tf (syntax error)
2026-02-23 00:08:42 +00:00
Viktor Barzin
d870a63130
[ci skip] Reduce healthcheck frequency to 8h, fix apiserver audit duplication bug
Change cluster-healthcheck CronJob from every 30min to every 8h.
Replace fragile sed-based audit config in apiserver manifest with
idempotent Python script that deduplicates by name/mountPath,
preventing the duplicate volume entries that crashed the API server.
2026-02-22 23:18:30 +00:00
Viktor Barzin
cf67e02135
[ci skip] Reduce node config drift: GPU label, OIDC idempotency, node-exporter, rebuild docs
- Add gpu=true label to Terraform (nvidia null_resource alongside taint)
- Improve API server OIDC config to detect value changes, not just flag presence
- Add policy_hash trigger to audit-policy so rule changes auto-reapply
- Enable prometheus-node-exporter sub-chart, delete unused Ansible playbook
- Document full node rebuild procedure in CLAUDE.md
- Save Talos Linux migration evaluation for future reference
2026-02-22 22:59:38 +00:00
Viktor Barzin
e225e81ebf
[ci skip] Move Terraform modules into stack directories
Move all 88 service modules (66 individual + 22 platform) from
modules/kubernetes/<service>/ into their corresponding stack directories:

- Service stacks: stacks/<service>/module/
- Platform stack: stacks/platform/modules/<service>/

This collocates module source code with its Terragrunt definition.
Only shared utility modules remain in modules/kubernetes/:
ingress_factory, setup_tls_secret, dockerhub_secret, oauth-proxy.

All cross-references to shared modules updated to use correct
relative paths. Verified with terragrunt run --all -- plan:
0 adds, 0 destroys across all 68 stacks.
2026-02-22 14:38:14 +00:00