infra

Author	SHA1	Message	Date
Viktor Barzin	4366a8b413	[ci skip] Add one-command setup scripts to k8s-portal - Add /setup/script?os=mac and /setup/script?os=linux endpoints - Scripts install kubectl, kubelogin, write kubeconfig, update shell rc - Unprotected ingress for /setup/script (curl-able without auth) - Fix kubeconfig to include --oidc-extra-scope for email/profile/groups	2026-02-17 22:22:41 +00:00
Viktor Barzin	9dad07618d	[ci skip] Add anca as namespace-owner for plotting-book - Add ancaelena98@gmail.com as namespace-owner for plotting-book namespace - Fix RBAC module: don't create namespaces (they're managed by service modules) - RoleBinding to built-in admin ClusterRole + cluster-wide read-only access - ResourceQuota: 2 CPU / 4Gi mem requests, 4 CPU / 8Gi limits, 20 pods	2026-02-17 22:18:37 +00:00
Viktor Barzin	aa433d0750	[ci skip] Update CLAUDE.md with OIDC gotchas and k8s multi-user notes	2026-02-17 22:16:46 +00:00
Viktor Barzin	c3840574a8	[ci skip] Update Authentik API token reference to terraform.tfvars	2026-02-17 22:03:55 +00:00
Viktor Barzin	7e3286e572	[ci skip] Pass skill secrets to moltbot container and fix Python env - Add skill_secrets variable to moltbot module with HA tokens and Uptime Kuma password as container env vars - Install Python packages (requests, caldav, icalendar, uptime-kuma-api) in init container with PYTHONPATH for main container access - Update all skills to use python3 directly instead of ~/.venvs/claude venv path that doesn't exist in the container - Remove hardcoded Uptime Kuma password from skill, use env var	2026-02-17 21:53:32 +00:00
Viktor Barzin	9bcdb9e59f	[ci skip] Implement multi-user Kubernetes access with OIDC - Add RBAC module (modules/kubernetes/rbac/) with admin, power-user, and namespace-owner roles, API server OIDC flags, and audit logging - Add self-service portal (modules/kubernetes/k8s-portal/) SvelteKit app with kubeconfig download and setup instructions - Configure Alloy to collect audit logs from kube-apiserver - Add Grafana dashboard for Kubernetes audit log visualization - Configure Authentik OIDC provider with groups scope mapping - Wire up k8s_users and ssh_private_key variables through module chain	2026-02-17 21:42:39 +00:00
Viktor Barzin	9853b5edf7	[ci skip] Add Authentik API management knowledge	2026-02-17 21:10:40 +00:00
Viktor Barzin	5a2803736d	[ci skip] Import Claude skills into OpenClaw moltbot - Convert setup-project and extend-vm-storage from standalone .md to directory-based SKILL.md format with YAML frontmatter - Add symlink in moltbot init container to expose Claude skills at ~/.openclaw/skills/ for auto-discovery by OpenClaw - Update CLAUDE.md skill path references	2026-02-17 21:09:12 +00:00
Viktor Barzin	85581923f6	[ci skip] Add multi-user Kubernetes access implementation plan	2026-02-17 20:49:14 +00:00
Viktor Barzin	cf146f5980	[ci skip] Add multi-user Kubernetes access design document	2026-02-17 20:44:23 +00:00
Viktor Barzin	5fd1ee0a9d	[ci skip] Increase drone namespace memory limits with custom ResourceQuota	2026-02-17 20:40:40 +00:00
Viktor Barzin	0545576335	[ci skip] Add Smart Home (ha-sofia) section to Cluster Health Overview dashboard	2026-02-17 19:48:02 +00:00
Viktor Barzin	039f8559c9	[ci skip] Add Grafana dashboard for Technitium DNS query logs Add MySQL datasource and 15-panel dashboard for DNS analytics: queries over time, response codes, top domains/clients, response times, blocked/NxDomain domains. Enable Grafana dashboard sidecar for auto-provisioning dashboards from ConfigMaps.	2026-02-16 23:06:41 +00:00
Viktor Barzin	80ea818476	[ci skip] Add pfsense-dnsmasq-interface-binding skill, update ndots skill to v1.1.0	2026-02-16 22:30:57 +00:00
Viktor Barzin	530986e3c6	[ci skip] Replace specific CoreDNS catch-all blocks with generic template regex Single template regex in the viktorbarzin.lan block catches ALL search domain expansion junk (.com.viktorbarzin.lan, .cluster.local.viktorbarzin.lan, etc.) instead of needing separate server blocks per pattern. Legitimate single-label queries (idrac.viktorbarzin.lan) fall through to Technitium.	2026-02-16 21:49:03 +00:00
Viktor Barzin	f06b3ac0e4	[ci skip] Fix .viktorbarzin.lan.viktorbarzin.lan duplicate DNS queries Add CoreDNS catch-all block for viktorbarzin.lan.viktorbarzin.lan to return NXDOMAIN immediately, preventing search domain expansion junk queries from reaching Technitium. Add trailing dots to Prometheus scrape targets (idrac, ups, ha-sofia) to bypass ndots expansion.	2026-02-16 21:38:38 +00:00
Viktor Barzin	800b5db3b3	[ci skip] Update preference: always use cluster_healthcheck.sh for health checks	2026-02-16 21:19:49 +00:00
Viktor Barzin	8107e5273c	[ci skip] Fix Technitium DNS client IP logging: bypass Traefik L4 proxy DNS queries were going through Traefik's IngressRouteUDP, replacing real client IPs with Traefik pod IPs (10.10.169.150) in Technitium logs. Changed Technitium DNS service from NodePort to LoadBalancer with externalTrafficPolicy: Local, removed dns-udp entrypoint and IngressRouteUDP from Traefik, and updated CoreDNS to forward .lan queries to Technitium's LoadBalancer IP directly.	2026-02-16 21:16:16 +00:00
Viktor Barzin	0eb5eb738a	[ci skip] Fix Alloy OOMKill and iDRAC priority class conflict - Alloy: bump memory limits from 64Mi/128Mi to 256Mi/768Mi — pods were OOMKilled at 128Mi, steady-state usage is ~400-450Mi per node - iDRAC Redfish Exporter: add explicit priority_class_name to resolve conflict between Kyverno priority injection and default priority: 0	2026-02-16 20:09:53 +00:00
Viktor Barzin	d8b3922b62	[ci skip] Remember to use cluster_healthcheck.sh for cluster status checks	2026-02-16 19:45:31 +00:00
Viktor Barzin	0eac3d6de6	[ci skip] Fix docker-registry VM: add SSH key, remove hourly restart cron - Set explicit devvm SSH public key for cloud-init (was empty, breaking SSH access) - Remove hourly cron that restarted all registry containers, which wiped the in-memory blobdescriptor cache and caused low pull-through cache hit rates	2026-02-15 22:16:41 +00:00
Viktor Barzin	6f33c3008f	[ci skip] Add skill: k8s-ndots-search-domain-nxdomain-flood Documents how Kubernetes ndots:5 search domain expansion floods external DNS with NxDomain queries, and the CoreDNS template block fix.	2026-02-15 21:52:27 +00:00
Viktor Barzin	5cb4bda289	Update Cluster Health dashboard: dedup metrics, GPU memory, remove broken panels [ci skip]	2026-02-15 21:51:41 +00:00
Viktor Barzin	c0a18c9c57	[ci skip] Manage CoreDNS Corefile in Terraform and block junk NxDomain queries Add kubernetes_config_map for CoreDNS to the technitium module, with a template block for cluster.local.viktorbarzin.lan that returns NXDOMAIN immediately. This prevents ndots:5 search domain expansion from flooding Technitium with ~66k/day junk queries (e.g. redis.redis.svc.cluster.local.viktorbarzin.lan). Also enabled saveCache on Technitium so the DNS cache persists across pod restarts.	2026-02-15 21:51:12 +00:00
Viktor Barzin	2db6e96115	Update Cluster Health dashboard: reorder rows, add links, key services, sorting [ci skip]	2026-02-15 21:24:08 +00:00
Viktor Barzin	e76a80eb72	[ci skip] Document Terraform state splitting plan for future implementation	2026-02-15 21:10:40 +00:00
Viktor Barzin	608d5ab636	Add Cluster Health Overview Grafana dashboard [ci skip]	2026-02-15 19:38:28 +00:00
Viktor Barzin	4d9b8242e8	Add tier-based resource governance via Kyverno [ci skip] Four layers of noisy-neighbor protection using existing tier system: - PriorityClasses (tier-0-core through tier-4-aux) - LimitRange defaults auto-generated per namespace tier - ResourceQuotas auto-generated per namespace tier - PriorityClassName injection on pods via Kyverno mutate Custom quota overrides for monitoring and crowdsec namespaces which exceed the default tier quotas.	2026-02-15 18:48:33 +00:00
Viktor Barzin	2bae6ccce3	Add Uptime Kuma monitor check to cluster health script [ci skip] Adds check #14 that queries Uptime Kuma API for application-level monitor status, complementing the kubectl-level checks with HTTP/ping health data. Reports down monitors by name with PASS/WARN/FAIL thresholds.	2026-02-15 17:49:40 +00:00
Viktor Barzin	719e3c6244	[ci skip] remember: spawn subagent to monitor pods instead of sleeping	2026-02-15 17:48:42 +00:00
Viktor Barzin	9c4ff21d58	Add cluster health check script with 13 diagnostic sections [ci skip]	2026-02-15 17:34:22 +00:00
Viktor Barzin	a73f3fcb6b	Cluster health remediation: cleanup CronJob, disable Collabora, fix GPU probe, add NFS exports [ci skip] - Add daily CronJob to auto-clean Failed/Evicted pods cluster-wide (infra-maintenance) - Disable Collabora in Nextcloud (broken HPA caused scaling storm; using OnlyOffice instead) - Increase gpu-pod-exporter liveness probe timeout from 1s to 5s - Add osm-routing NFS exports (osrm-data, otp-data)	2026-02-15 17:20:47 +00:00
Viktor Barzin	3da35166ab	[ci skip] Add skills: helm-stuck-release-recovery, k8s-hpa-scaling-storm, crowdsec-agent-registration-failure	2026-02-15 17:18:17 +00:00
Viktor Barzin	95013c9056	[ci skip] Strengthen Terraform-only change policy in project instructions	2026-02-15 15:10:11 +00:00
Viktor Barzin	606a79078e	[ci skip] Add skills: containerd-multi-registry-pull-through-cache, traefik-plugin-download-failure-404	2026-02-15 14:36:50 +00:00
Viktor Barzin	a7f2d6b9e6	[ci skip] Add uptime-kuma management skill with tiered monitoring	2026-02-15 14:35:53 +00:00
Viktor Barzin	a67a6f350e	[ci skip] Fix pull-through cache for all registries Replace deprecated wildcard containerd mirror with per-registry config_path approach. Add proxy containers for ghcr.io, quay.io, registry.k8s.io, and reg.kyverno.io on the docker-registry VM. Set static IP for docker-registry VM to avoid DHCP issues.	2026-02-15 14:35:52 +00:00
Viktor Barzin	5a37c26e9b	Drone CI Update TLS Certificates Commit	2026-02-15 00:05:36 +00:00
Viktor Barzin	c473663b98	[ci skip] Add pfSense firewall management skill	2026-02-14 12:42:10 +00:00
Viktor Barzin	ca43b97fa0	[ci skip] Add skills: loki-helm-deployment-pitfalls, grafana-stale-datasource-cleanup	2026-02-13 23:47:45 +00:00
Viktor Barzin	a5b240629c	[ci skip] Update knowledge base with Loki + Alloy service notes	2026-02-13 23:46:01 +00:00
Viktor Barzin	c4a7c5df8e	[ci skip] Update Loki dashboard to use correct datasource UID	2026-02-13 23:41:40 +00:00
Viktor Barzin	fabece6370	[ci skip] Fix compactor/ruler paths to use writable /var/loki mount	2026-02-13 23:22:13 +00:00
Viktor Barzin	0d3acec82c	[ci skip] Re-enable lokiCanary (required by Helm chart validation)	2026-02-13 23:18:13 +00:00
Viktor Barzin	fea7c6cbb1	[ci skip] Disable gateway/canary/cache, increase timeout for Loki deploy	2026-02-13 23:17:32 +00:00
Viktor Barzin	69aae2ec9d	[ci skip] Fix code review findings: correct Alertmanager URL, add atomic to Loki, remove dead minio NFS export, update design doc	2026-02-13 23:08:44 +00:00
Viktor Barzin	71ff803978	[ci skip] Add centralized log collection: Loki + Alloy + sysctl DaemonSet	2026-02-13 23:03:40 +00:00
Viktor Barzin	a44dfac721	[ci skip] Deploy MoltBot (OpenClaw) AI agent gateway Add new Kubernetes service for OpenClaw gateway connected to in-cluster Ollama, with kubectl/terraform/git access for infrastructure management. Protected behind Authentik SSO.	2026-02-13 22:57:36 +00:00
Viktor Barzin	08ea489fe0	[ci skip] Add extend-vm-storage script and skills - Script to automate K8s node VM disk expansion (drain, shutdown, resize, boot, expand FS, uncordon) - Skill docs for the workflow and troubleshooting pitfalls (growpart, macOS grep -P, drain timeouts) - Successfully tested on k8s-node2, k8s-node3, k8s-node4 (64G → 128G)	2026-02-13 22:08:46 +00:00
Viktor Barzin	04dd438b01	[ci skip] Add centralized log collection implementation plan	2026-02-13 21:54:55 +00:00

1 2 3 4 5 ...

1254 commits