Commit graph

1592 commits

Author SHA1 Message Date
Viktor Barzin
f8b07b3bb9
[ci skip] Add anca as namespace-owner for plotting-book
- Add ancaelena98@gmail.com as namespace-owner for plotting-book namespace
- Fix RBAC module: don't create namespaces (they're managed by service modules)
- RoleBinding to built-in admin ClusterRole + cluster-wide read-only access
- ResourceQuota: 2 CPU / 4Gi mem requests, 4 CPU / 8Gi limits, 20 pods
2026-02-17 22:18:37 +00:00
Viktor Barzin
84d9a3f926
[ci skip] Update CLAUDE.md with OIDC gotchas and k8s multi-user notes 2026-02-17 22:16:46 +00:00
Viktor Barzin
14ab6f115f
[ci skip] Fix multi-user k8s access: redirect URIs, email scope, image ref
- Change portal image to viktorbarzin/k8s-portal:latest (Docker Hub)
- Add k8s-portal to cloudflare_non_proxied_names
- Add k8s_users with viktor admin entry to terraform.tfvars
2026-02-17 22:15:20 +00:00
Viktor Barzin
03651080bb
[ci skip] Update Authentik API token reference to terraform.tfvars 2026-02-17 22:03:55 +00:00
Viktor Barzin
79ce0db11c
[ci skip] Pass skill secrets to moltbot container and fix Python env
- Add skill_secrets variable to moltbot module with HA tokens and
  Uptime Kuma password as container env vars
- Install Python packages (requests, caldav, icalendar, uptime-kuma-api)
  in init container with PYTHONPATH for main container access
- Update all skills to use python3 directly instead of ~/.venvs/claude
  venv path that doesn't exist in the container
- Remove hardcoded Uptime Kuma password from skill, use env var
2026-02-17 21:53:32 +00:00
Viktor Barzin
d0b39f1987
[ci skip] Implement multi-user Kubernetes access with OIDC
- Add RBAC module (modules/kubernetes/rbac/) with admin, power-user,
  and namespace-owner roles, API server OIDC flags, and audit logging
- Add self-service portal (modules/kubernetes/k8s-portal/) SvelteKit app
  with kubeconfig download and setup instructions
- Configure Alloy to collect audit logs from kube-apiserver
- Add Grafana dashboard for Kubernetes audit log visualization
- Configure Authentik OIDC provider with groups scope mapping
- Wire up k8s_users and ssh_private_key variables through module chain
2026-02-17 21:42:39 +00:00
Viktor Barzin
72ff14e2df
[ci skip] Add Authentik API management knowledge 2026-02-17 21:10:40 +00:00
Viktor Barzin
6a8efa69c4
[ci skip] Import Claude skills into OpenClaw moltbot
- Convert setup-project and extend-vm-storage from standalone .md
  to directory-based SKILL.md format with YAML frontmatter
- Add symlink in moltbot init container to expose Claude skills
  at ~/.openclaw/skills/ for auto-discovery by OpenClaw
- Update CLAUDE.md skill path references
2026-02-17 21:09:12 +00:00
Viktor Barzin
734c173f78
[ci skip] Add multi-user Kubernetes access implementation plan 2026-02-17 20:49:14 +00:00
Viktor Barzin
d9913cde2a
[ci skip] Add multi-user Kubernetes access design document 2026-02-17 20:44:23 +00:00
Viktor Barzin
587b649650
[ci skip] Increase drone namespace memory limits with custom ResourceQuota 2026-02-17 20:40:40 +00:00
Viktor Barzin
6f3395fbf5
[ci skip] Add Smart Home (ha-sofia) section to Cluster Health Overview dashboard 2026-02-17 19:48:02 +00:00
Viktor Barzin
c0363be5e4
[ci skip] Add Grafana dashboard for Technitium DNS query logs
Add MySQL datasource and 15-panel dashboard for DNS analytics:
queries over time, response codes, top domains/clients, response
times, blocked/NxDomain domains. Enable Grafana dashboard sidecar
for auto-provisioning dashboards from ConfigMaps.
2026-02-16 23:06:41 +00:00
Viktor Barzin
1802b4c86d
[ci skip] Add pfsense-dnsmasq-interface-binding skill, update ndots skill to v1.1.0 2026-02-16 22:30:57 +00:00
Viktor Barzin
a268b9107f
[ci skip] Replace specific CoreDNS catch-all blocks with generic template regex
Single template regex in the viktorbarzin.lan block catches ALL search
domain expansion junk (*.com.viktorbarzin.lan, *.cluster.local.viktorbarzin.lan,
etc.) instead of needing separate server blocks per pattern. Legitimate
single-label queries (idrac.viktorbarzin.lan) fall through to Technitium.
2026-02-16 21:49:03 +00:00
Viktor Barzin
19136c21f1
[ci skip] Fix .viktorbarzin.lan.viktorbarzin.lan duplicate DNS queries
Add CoreDNS catch-all block for viktorbarzin.lan.viktorbarzin.lan to
return NXDOMAIN immediately, preventing search domain expansion junk
queries from reaching Technitium. Add trailing dots to Prometheus
scrape targets (idrac, ups, ha-sofia) to bypass ndots expansion.
2026-02-16 21:38:38 +00:00
Viktor Barzin
c08bab7da7
[ci skip] Update preference: always use cluster_healthcheck.sh for health checks 2026-02-16 21:19:49 +00:00
Viktor Barzin
205eb2704b
[ci skip] Fix Technitium DNS client IP logging: bypass Traefik L4 proxy
DNS queries were going through Traefik's IngressRouteUDP, replacing
real client IPs with Traefik pod IPs (10.10.169.150) in Technitium logs.
Changed Technitium DNS service from NodePort to LoadBalancer with
externalTrafficPolicy: Local, removed dns-udp entrypoint and
IngressRouteUDP from Traefik, and updated CoreDNS to forward .lan
queries to Technitium's LoadBalancer IP directly.
2026-02-16 21:16:16 +00:00
Viktor Barzin
3d4cdf3203
[ci skip] Fix Alloy OOMKill and iDRAC priority class conflict
- Alloy: bump memory limits from 64Mi/128Mi to 256Mi/768Mi — pods were
  OOMKilled at 128Mi, steady-state usage is ~400-450Mi per node
- iDRAC Redfish Exporter: add explicit priority_class_name to resolve
  conflict between Kyverno priority injection and default priority: 0
2026-02-16 20:09:53 +00:00
Viktor Barzin
32a7c04deb
[ci skip] Remember to use cluster_healthcheck.sh for cluster status checks 2026-02-16 19:45:31 +00:00
Viktor Barzin
5c7efbdb48
[ci skip] Fix docker-registry VM: add SSH key, remove hourly restart cron
- Set explicit devvm SSH public key for cloud-init (was empty, breaking SSH access)
- Remove hourly cron that restarted all registry containers, which wiped the
  in-memory blobdescriptor cache and caused low pull-through cache hit rates
2026-02-15 22:16:41 +00:00
Viktor Barzin
7854f89add
[ci skip] Add skill: k8s-ndots-search-domain-nxdomain-flood
Documents how Kubernetes ndots:5 search domain expansion floods external
DNS with NxDomain queries, and the CoreDNS template block fix.
2026-02-15 21:52:27 +00:00
Viktor Barzin
2d015c1cb4
Update Cluster Health dashboard: dedup metrics, GPU memory, remove broken panels [ci skip] 2026-02-15 21:51:41 +00:00
Viktor Barzin
a8f42d7fc0
[ci skip] Manage CoreDNS Corefile in Terraform and block junk NxDomain queries
Add kubernetes_config_map for CoreDNS to the technitium module, with a
template block for cluster.local.viktorbarzin.lan that returns NXDOMAIN
immediately. This prevents ndots:5 search domain expansion from flooding
Technitium with ~66k/day junk queries (e.g.
redis.redis.svc.cluster.local.viktorbarzin.lan).

Also enabled saveCache on Technitium so the DNS cache persists across
pod restarts.
2026-02-15 21:51:12 +00:00
Viktor Barzin
a2b44c8ff7
Update Cluster Health dashboard: reorder rows, add links, key services, sorting [ci skip] 2026-02-15 21:24:08 +00:00
Viktor Barzin
8d38e1474b
[ci skip] Document Terraform state splitting plan for future implementation 2026-02-15 21:10:40 +00:00
Viktor Barzin
f447e45ee1
Add Cluster Health Overview Grafana dashboard [ci skip] 2026-02-15 19:38:28 +00:00
Viktor Barzin
1564ec7e79
Add tier-based resource governance via Kyverno [ci skip]
Four layers of noisy-neighbor protection using existing tier system:
- PriorityClasses (tier-0-core through tier-4-aux)
- LimitRange defaults auto-generated per namespace tier
- ResourceQuotas auto-generated per namespace tier
- PriorityClassName injection on pods via Kyverno mutate

Custom quota overrides for monitoring and crowdsec namespaces
which exceed the default tier quotas.
2026-02-15 18:48:33 +00:00
Viktor Barzin
7ef23470cd
Add Uptime Kuma monitor check to cluster health script [ci skip]
Adds check #14 that queries Uptime Kuma API for application-level
monitor status, complementing the kubectl-level checks with HTTP/ping
health data. Reports down monitors by name with PASS/WARN/FAIL thresholds.
2026-02-15 17:49:40 +00:00
Viktor Barzin
df840a0078
[ci skip] remember: spawn subagent to monitor pods instead of sleeping 2026-02-15 17:48:42 +00:00
Viktor Barzin
8867769a75
Add cluster health check script with 13 diagnostic sections [ci skip] 2026-02-15 17:34:22 +00:00
Viktor Barzin
349fffc124
Cluster health remediation: cleanup CronJob, disable Collabora, fix GPU probe, add NFS exports [ci skip]
- Add daily CronJob to auto-clean Failed/Evicted pods cluster-wide (infra-maintenance)
- Disable Collabora in Nextcloud (broken HPA caused scaling storm; using OnlyOffice instead)
- Increase gpu-pod-exporter liveness probe timeout from 1s to 5s
- Add osm-routing NFS exports (osrm-data, otp-data)
2026-02-15 17:20:47 +00:00
Viktor Barzin
2d475c8496
[ci skip] Add skills: helm-stuck-release-recovery, k8s-hpa-scaling-storm, crowdsec-agent-registration-failure 2026-02-15 17:18:17 +00:00
Viktor Barzin
f68ed73434
[ci skip] Strengthen Terraform-only change policy in project instructions 2026-02-15 15:10:11 +00:00
Viktor Barzin
c05614e4b8
update the scrape schedule for wrongmove [ci skip] 2026-02-15 14:40:05 +00:00
Viktor Barzin
0da86577fb
[ci skip] Add skills: containerd-multi-registry-pull-through-cache, traefik-plugin-download-failure-404 2026-02-15 14:36:50 +00:00
Viktor Barzin
dca2b0cabd
[ci skip] Add uptime-kuma management skill with tiered monitoring 2026-02-15 14:35:53 +00:00
Viktor Barzin
36d32b49e7
[ci skip] Fix pull-through cache for all registries
Replace deprecated wildcard containerd mirror with per-registry
config_path approach. Add proxy containers for ghcr.io, quay.io,
registry.k8s.io, and reg.kyverno.io on the docker-registry VM.
Set static IP for docker-registry VM to avoid DHCP issues.
2026-02-15 14:35:52 +00:00
Viktor Barzin
163d6a728d Drone CI Update TLS Certificates Commit 2026-02-15 00:05:36 +00:00
Viktor Barzin
22fdb8fbf0
[ci skip] Add pfSense firewall management skill 2026-02-14 12:42:10 +00:00
Viktor Barzin
2b6bcae77f
[ci skip] Add skills: loki-helm-deployment-pitfalls, grafana-stale-datasource-cleanup 2026-02-13 23:47:45 +00:00
Viktor Barzin
ecebf401bf
[ci skip] Update knowledge base with Loki + Alloy service notes 2026-02-13 23:46:01 +00:00
Viktor Barzin
cd52fc400c
[ci skip] Sync tfstate after Loki + Alloy deployment 2026-02-13 23:44:17 +00:00
Viktor Barzin
7644c419a4
[ci skip] Update Loki dashboard to use correct datasource UID 2026-02-13 23:41:40 +00:00
Viktor Barzin
cd2d13d949
[ci skip] Fix compactor/ruler paths to use writable /var/loki mount 2026-02-13 23:22:13 +00:00
Viktor Barzin
d906513f09
[ci skip] Re-enable lokiCanary (required by Helm chart validation) 2026-02-13 23:18:13 +00:00
Viktor Barzin
a38c3d3dc7
[ci skip] Disable gateway/canary/cache, increase timeout for Loki deploy 2026-02-13 23:17:32 +00:00
Viktor Barzin
f013c0a139
[ci skip] Fix code review findings: correct Alertmanager URL, add atomic to Loki, remove dead minio NFS export, update design doc 2026-02-13 23:08:44 +00:00
Viktor Barzin
c7236f09f1
[ci skip] Add centralized log collection: Loki + Alloy + sysctl DaemonSet 2026-02-13 23:03:40 +00:00
Viktor Barzin
c330648b7b
[ci skip] Deploy MoltBot (OpenClaw) AI agent gateway
Add new Kubernetes service for OpenClaw gateway connected to in-cluster
Ollama, with kubectl/terraform/git access for infrastructure management.
Protected behind Authentik SSO.
2026-02-13 22:57:36 +00:00