infra

Author	SHA1	Message	Date
Viktor Barzin	26ba9ea371	[ci skip] Fix Prometheus storage alert and Grafana quota exhaustion - Enable size-based TSDB retention (45GB) to clean up old blocks (including 2021-era blocks with failed compaction) - Increase monitoring namespace quota from 64/128Gi to 80/160Gi CPU/memory limits to allow Grafana rolling updates	2026-02-21 21:04:08 +00:00
Viktor Barzin	dcce738641	[ci skip] Bump inotify max_user_instances from 512 to 8192 Fixes "failed to create fsnotify watcher: too many open files" in Drone CI builds where vitest exhausts the default inotify instance limit.	2026-02-21 20:21:04 +00:00
Viktor Barzin	038d4434c4	[ci skip] Fix health check false positives for completed CronJob pods	2026-02-21 19:56:39 +00:00
Viktor Barzin	de9c0869ba	[ci skip] Fix CrowdSec pods failing due to priority class mismatch Kyverno injects priorityClassName tier-1-cluster on pods in the crowdsec namespace, but pods had no explicit priorityClassName set, defaulting priority to 0. Admission controller rejected the mismatch (0 vs 800000). Set priorityClassName on LAPI, agent (Helm values) and crowdsec-web (Terraform deployment).	2026-02-21 19:18:15 +00:00
Viktor Barzin	fd6f9166a9	[ci skip] Add GitHub & Drone CI API access documentation	2026-02-21 19:14:41 +00:00
Viktor Barzin	a9e5320427	[ci skip] Disable grampsweb service and remove family DNS record	2026-02-21 18:55:54 +00:00
Viktor Barzin	9b2ec7716e	[ci skip] Add skills: pfsense-nat-rule-creation, coturn-k8s-without-hostnetwork	2026-02-21 18:29:32 +00:00
Viktor Barzin	de1a43a3c7	[ci skip] Add coturn TURN/STUN server for WebRTC relay - Deploy coturn on k8s with MetalLB shared IP (10.0.20.200) - Normal pod networking (no hostNetwork), runs on any node - 100 relay ports (49152-49252), port 3478 for STUN/TURN signaling - Shared secret auth for time-limited TURN credentials - For F1 streaming WebRTC NAT traversal	2026-02-21 18:08:01 +00:00
Viktor Barzin	5fe288a4e4	[ci skip] Real estate crawler: 2 replicas for UI/API, rolling update for celery - UI and API: 1 → 2 replicas for zero-downtime during restarts/crashes - Celery worker: Recreate → RollingUpdate strategy - Celery beat: unchanged (Recreate, singleton scheduler) - Move f1 from Cloudflare proxied to non-proxied DNS	2026-02-21 17:32:45 +00:00
Viktor Barzin	2298459496	[ci skip] Use versioned image tag for f1-stream to bypass stale cache Pull-through cache on registry VM served stale arm64-only manifest for :latest tag. Switch to v1.0.0 tag so cache fetches the fresh amd64 image.	2026-02-21 16:07:58 +00:00
Viktor Barzin	2fe7fa547c	[ci skip] Configure f1-stream: WebAuthn, NFS storage, headless browser - Set WEBAUTHN_RPID/ORIGIN for f1.viktorbarzin.me domain - Add NFS volume at /mnt/main/f1-stream for persistent session/stream data - Enable headless browser extraction (HEADLESS_EXTRACT_ENABLED=true) - Reduce replicas to 1 (file-based sessions don't work across replicas)	2026-02-21 15:57:25 +00:00
Viktor Barzin	a5e0b19a3a	[ci skip] Fix f1-stream port mismatch: container listens on 8080, not 80	2026-02-21 15:42:47 +00:00
Viktor Barzin	8756bcfb9a	[ci skip] Increase Drone CI namespace resource quota Double CPU and memory limits to give CI pipelines more headroom.	2026-02-21 14:49:16 +00:00
Viktor Barzin	f3361e3a47	[ci skip] Add Music Assistant librespot stale credentials skill New skill: music-assistant-librespot-wrong-account - Documents fix for Spotify playback failing with "librespot does not support free accounts" when cached credentials point to wrong Spotify account - Includes step-by-step solution: find container, inspect cache, clear and restart Updated: home-assistant skill with Music Assistant addon details for ha-sofia	2026-02-21 11:23:24 +00:00
Viktor Barzin	144e9b3e39	[ci skip] Add Kyverno policy to inject ndots:2 on all pods Reduces NxDomain query flood caused by Kubernetes default ndots:5 search domain expansion. 78% of DNS queries were wasted NxDomain lookups.	2026-02-20 00:21:03 +00:00
Viktor Barzin	9d7d63b970	[ci skip] Add ground rules: no secrets, CI/CD required, monitoring required	2026-02-19 23:48:44 +00:00
Viktor Barzin	5df615c31d	[ci skip] Add Modal GLM-5 model to OpenClaw, fix streaming and download reliability - Add modal provider (GLM-5-FP8) as primary model with non-streaming mode (GLM-5 uses non-standard reasoning_content field incompatible with streaming) - Add curl --retry flags to init container downloads for reliability - Fallback chain: GLM-5 → Gemini 2.5 Flash → Llama 3.3 70B	2026-02-19 23:17:08 +00:00
Viktor Barzin	71d6590939	[ci skip] Update knowledge base: add OpenClaw service, rename moltbot references	2026-02-18 22:39:58 +00:00
Viktor Barzin	843b9658d5	[ci skip] Rename moltbot to openclaw across Terraform, K8s resources, and DNS Update terraform version in init container from 1.12.1 to 1.14.5.	2026-02-18 21:53:46 +00:00
Viktor Barzin	9889728c49	[ci skip] Remove Authentik forward auth from Grafana, add admin password management Fixes HA mobile app 403 when embedding Grafana dashboards - the webview blocks third-party cookies needed by Authentik forward auth. Grafana already has anonymous Viewer access enabled, so forward auth is not needed. Also adds grafana_admin_password variable and explicit resource limits to prevent ResourceQuota issues during rolling updates.	2026-02-18 21:40:32 +00:00
Viktor Barzin	41d3358cc1	[ci skip] Add skills: authentik-oidc-kubernetes, kubelet-static-pod-manifest-update Two skills extracted from multi-user k8s access implementation: - authentik-oidc-kubernetes: 6 gotchas for Authentik OIDC + kube-apiserver - kubelet-static-pod-manifest-update: full restart cycle for static pod changes	2026-02-17 22:56:03 +00:00
Viktor Barzin	7e73965bdd	[ci skip] Add Authentik management skill for API-based identity provider control	2026-02-17 22:55:41 +00:00
Viktor Barzin	6580c00979	[ci skip] Fix setup script: handle sudo-less environments, add extra scopes	2026-02-17 22:27:03 +00:00
Viktor Barzin	4366a8b413	[ci skip] Add one-command setup scripts to k8s-portal - Add /setup/script?os=mac and /setup/script?os=linux endpoints - Scripts install kubectl, kubelogin, write kubeconfig, update shell rc - Unprotected ingress for /setup/script (curl-able without auth) - Fix kubeconfig to include --oidc-extra-scope for email/profile/groups	2026-02-17 22:22:41 +00:00
Viktor Barzin	9dad07618d	[ci skip] Add anca as namespace-owner for plotting-book - Add ancaelena98@gmail.com as namespace-owner for plotting-book namespace - Fix RBAC module: don't create namespaces (they're managed by service modules) - RoleBinding to built-in admin ClusterRole + cluster-wide read-only access - ResourceQuota: 2 CPU / 4Gi mem requests, 4 CPU / 8Gi limits, 20 pods	2026-02-17 22:18:37 +00:00
Viktor Barzin	aa433d0750	[ci skip] Update CLAUDE.md with OIDC gotchas and k8s multi-user notes	2026-02-17 22:16:46 +00:00
Viktor Barzin	c3840574a8	[ci skip] Update Authentik API token reference to terraform.tfvars	2026-02-17 22:03:55 +00:00
Viktor Barzin	7e3286e572	[ci skip] Pass skill secrets to moltbot container and fix Python env - Add skill_secrets variable to moltbot module with HA tokens and Uptime Kuma password as container env vars - Install Python packages (requests, caldav, icalendar, uptime-kuma-api) in init container with PYTHONPATH for main container access - Update all skills to use python3 directly instead of ~/.venvs/claude venv path that doesn't exist in the container - Remove hardcoded Uptime Kuma password from skill, use env var	2026-02-17 21:53:32 +00:00
Viktor Barzin	9bcdb9e59f	[ci skip] Implement multi-user Kubernetes access with OIDC - Add RBAC module (modules/kubernetes/rbac/) with admin, power-user, and namespace-owner roles, API server OIDC flags, and audit logging - Add self-service portal (modules/kubernetes/k8s-portal/) SvelteKit app with kubeconfig download and setup instructions - Configure Alloy to collect audit logs from kube-apiserver - Add Grafana dashboard for Kubernetes audit log visualization - Configure Authentik OIDC provider with groups scope mapping - Wire up k8s_users and ssh_private_key variables through module chain	2026-02-17 21:42:39 +00:00
Viktor Barzin	9853b5edf7	[ci skip] Add Authentik API management knowledge	2026-02-17 21:10:40 +00:00
Viktor Barzin	5a2803736d	[ci skip] Import Claude skills into OpenClaw moltbot - Convert setup-project and extend-vm-storage from standalone .md to directory-based SKILL.md format with YAML frontmatter - Add symlink in moltbot init container to expose Claude skills at ~/.openclaw/skills/ for auto-discovery by OpenClaw - Update CLAUDE.md skill path references	2026-02-17 21:09:12 +00:00
Viktor Barzin	85581923f6	[ci skip] Add multi-user Kubernetes access implementation plan	2026-02-17 20:49:14 +00:00
Viktor Barzin	cf146f5980	[ci skip] Add multi-user Kubernetes access design document	2026-02-17 20:44:23 +00:00
Viktor Barzin	5fd1ee0a9d	[ci skip] Increase drone namespace memory limits with custom ResourceQuota	2026-02-17 20:40:40 +00:00
Viktor Barzin	0545576335	[ci skip] Add Smart Home (ha-sofia) section to Cluster Health Overview dashboard	2026-02-17 19:48:02 +00:00
Viktor Barzin	039f8559c9	[ci skip] Add Grafana dashboard for Technitium DNS query logs Add MySQL datasource and 15-panel dashboard for DNS analytics: queries over time, response codes, top domains/clients, response times, blocked/NxDomain domains. Enable Grafana dashboard sidecar for auto-provisioning dashboards from ConfigMaps.	2026-02-16 23:06:41 +00:00
Viktor Barzin	80ea818476	[ci skip] Add pfsense-dnsmasq-interface-binding skill, update ndots skill to v1.1.0	2026-02-16 22:30:57 +00:00
Viktor Barzin	530986e3c6	[ci skip] Replace specific CoreDNS catch-all blocks with generic template regex Single template regex in the viktorbarzin.lan block catches ALL search domain expansion junk (.com.viktorbarzin.lan, .cluster.local.viktorbarzin.lan, etc.) instead of needing separate server blocks per pattern. Legitimate single-label queries (idrac.viktorbarzin.lan) fall through to Technitium.	2026-02-16 21:49:03 +00:00
Viktor Barzin	f06b3ac0e4	[ci skip] Fix .viktorbarzin.lan.viktorbarzin.lan duplicate DNS queries Add CoreDNS catch-all block for viktorbarzin.lan.viktorbarzin.lan to return NXDOMAIN immediately, preventing search domain expansion junk queries from reaching Technitium. Add trailing dots to Prometheus scrape targets (idrac, ups, ha-sofia) to bypass ndots expansion.	2026-02-16 21:38:38 +00:00
Viktor Barzin	800b5db3b3	[ci skip] Update preference: always use cluster_healthcheck.sh for health checks	2026-02-16 21:19:49 +00:00
Viktor Barzin	8107e5273c	[ci skip] Fix Technitium DNS client IP logging: bypass Traefik L4 proxy DNS queries were going through Traefik's IngressRouteUDP, replacing real client IPs with Traefik pod IPs (10.10.169.150) in Technitium logs. Changed Technitium DNS service from NodePort to LoadBalancer with externalTrafficPolicy: Local, removed dns-udp entrypoint and IngressRouteUDP from Traefik, and updated CoreDNS to forward .lan queries to Technitium's LoadBalancer IP directly.	2026-02-16 21:16:16 +00:00
Viktor Barzin	0eb5eb738a	[ci skip] Fix Alloy OOMKill and iDRAC priority class conflict - Alloy: bump memory limits from 64Mi/128Mi to 256Mi/768Mi — pods were OOMKilled at 128Mi, steady-state usage is ~400-450Mi per node - iDRAC Redfish Exporter: add explicit priority_class_name to resolve conflict between Kyverno priority injection and default priority: 0	2026-02-16 20:09:53 +00:00
Viktor Barzin	d8b3922b62	[ci skip] Remember to use cluster_healthcheck.sh for cluster status checks	2026-02-16 19:45:31 +00:00
Viktor Barzin	0eac3d6de6	[ci skip] Fix docker-registry VM: add SSH key, remove hourly restart cron - Set explicit devvm SSH public key for cloud-init (was empty, breaking SSH access) - Remove hourly cron that restarted all registry containers, which wiped the in-memory blobdescriptor cache and caused low pull-through cache hit rates	2026-02-15 22:16:41 +00:00
Viktor Barzin	6f33c3008f	[ci skip] Add skill: k8s-ndots-search-domain-nxdomain-flood Documents how Kubernetes ndots:5 search domain expansion floods external DNS with NxDomain queries, and the CoreDNS template block fix.	2026-02-15 21:52:27 +00:00
Viktor Barzin	5cb4bda289	Update Cluster Health dashboard: dedup metrics, GPU memory, remove broken panels [ci skip]	2026-02-15 21:51:41 +00:00
Viktor Barzin	c0a18c9c57	[ci skip] Manage CoreDNS Corefile in Terraform and block junk NxDomain queries Add kubernetes_config_map for CoreDNS to the technitium module, with a template block for cluster.local.viktorbarzin.lan that returns NXDOMAIN immediately. This prevents ndots:5 search domain expansion from flooding Technitium with ~66k/day junk queries (e.g. redis.redis.svc.cluster.local.viktorbarzin.lan). Also enabled saveCache on Technitium so the DNS cache persists across pod restarts.	2026-02-15 21:51:12 +00:00
Viktor Barzin	2db6e96115	Update Cluster Health dashboard: reorder rows, add links, key services, sorting [ci skip]	2026-02-15 21:24:08 +00:00
Viktor Barzin	e76a80eb72	[ci skip] Document Terraform state splitting plan for future implementation	2026-02-15 21:10:40 +00:00
Viktor Barzin	608d5ab636	Add Cluster Health Overview Grafana dashboard [ci skip]	2026-02-15 19:38:28 +00:00

1 2 3 4 5 ...

1277 commits