infra

Author	SHA1	Message	Date
Viktor Barzin	0c2c48802f	[ci skip] Sandbox proxy iframe to prevent frame-busting Add sandbox attribute to prevent proxied pages from navigating top.location or replacing the parent page body. Allows scripts, same-origin, forms, popups, and presentation but blocks top-navigation.	2026-02-21 21:25:51 +00:00
Viktor Barzin	7a444b43fa	[ci skip] Add reverse proxy mode to f1-stream Replace CPU-intensive headless Chrome + WebRTC pipeline with a lightweight Go reverse proxy that strips anti-framing headers (X-Frame-Options, CSP) and embeds streaming sites in iframes. - New internal/proxy package with URL rewriting for HTML/CSS - JS shim injection to intercept fetch/XHR/WebSocket/createElement - Referer reconstruction for correct cross-origin auth (HLS streams) - Inline iframe viewer preserving site navigation (not fullscreen overlay)	2026-02-21 21:23:21 +00:00
Viktor Barzin	2446fec1f6	[ci skip] Fix whiteboard priority class mismatch and OnlyOffice OOMKill - Add priority_class_name to nextcloud whiteboard deployment to match Kyverno-injected tier-3-edge priority class - Add explicit resource limits (4Gi memory) for OnlyOffice document server to prevent OOMKill during font generation	2026-02-21 21:22:03 +00:00
Viktor Barzin	26ba9ea371	[ci skip] Fix Prometheus storage alert and Grafana quota exhaustion - Enable size-based TSDB retention (45GB) to clean up old blocks (including 2021-era blocks with failed compaction) - Increase monitoring namespace quota from 64/128Gi to 80/160Gi CPU/memory limits to allow Grafana rolling updates	2026-02-21 21:04:08 +00:00
Viktor Barzin	dcce738641	[ci skip] Bump inotify max_user_instances from 512 to 8192 Fixes "failed to create fsnotify watcher: too many open files" in Drone CI builds where vitest exhausts the default inotify instance limit.	2026-02-21 20:21:04 +00:00
Viktor Barzin	de9c0869ba	[ci skip] Fix CrowdSec pods failing due to priority class mismatch Kyverno injects priorityClassName tier-1-cluster on pods in the crowdsec namespace, but pods had no explicit priorityClassName set, defaulting priority to 0. Admission controller rejected the mismatch (0 vs 800000). Set priorityClassName on LAPI, agent (Helm values) and crowdsec-web (Terraform deployment).	2026-02-21 19:18:15 +00:00
Viktor Barzin	a9e5320427	[ci skip] Disable grampsweb service and remove family DNS record	2026-02-21 18:55:54 +00:00
Viktor Barzin	de1a43a3c7	[ci skip] Add coturn TURN/STUN server for WebRTC relay - Deploy coturn on k8s with MetalLB shared IP (10.0.20.200) - Normal pod networking (no hostNetwork), runs on any node - 100 relay ports (49152-49252), port 3478 for STUN/TURN signaling - Shared secret auth for time-limited TURN credentials - For F1 streaming WebRTC NAT traversal	2026-02-21 18:08:01 +00:00
Viktor Barzin	5fe288a4e4	[ci skip] Real estate crawler: 2 replicas for UI/API, rolling update for celery - UI and API: 1 → 2 replicas for zero-downtime during restarts/crashes - Celery worker: Recreate → RollingUpdate strategy - Celery beat: unchanged (Recreate, singleton scheduler) - Move f1 from Cloudflare proxied to non-proxied DNS	2026-02-21 17:32:45 +00:00
Viktor Barzin	2298459496	[ci skip] Use versioned image tag for f1-stream to bypass stale cache Pull-through cache on registry VM served stale arm64-only manifest for :latest tag. Switch to v1.0.0 tag so cache fetches the fresh amd64 image.	2026-02-21 16:07:58 +00:00
Viktor Barzin	2fe7fa547c	[ci skip] Configure f1-stream: WebAuthn, NFS storage, headless browser - Set WEBAUTHN_RPID/ORIGIN for f1.viktorbarzin.me domain - Add NFS volume at /mnt/main/f1-stream for persistent session/stream data - Enable headless browser extraction (HEADLESS_EXTRACT_ENABLED=true) - Reduce replicas to 1 (file-based sessions don't work across replicas)	2026-02-21 15:57:25 +00:00
Viktor Barzin	a5e0b19a3a	[ci skip] Fix f1-stream port mismatch: container listens on 8080, not 80	2026-02-21 15:42:47 +00:00
Viktor Barzin	8756bcfb9a	[ci skip] Increase Drone CI namespace resource quota Double CPU and memory limits to give CI pipelines more headroom.	2026-02-21 14:49:16 +00:00
Viktor Barzin	144e9b3e39	[ci skip] Add Kyverno policy to inject ndots:2 on all pods Reduces NxDomain query flood caused by Kubernetes default ndots:5 search domain expansion. 78% of DNS queries were wasted NxDomain lookups.	2026-02-20 00:21:03 +00:00
Viktor Barzin	5df615c31d	[ci skip] Add Modal GLM-5 model to OpenClaw, fix streaming and download reliability - Add modal provider (GLM-5-FP8) as primary model with non-streaming mode (GLM-5 uses non-standard reasoning_content field incompatible with streaming) - Add curl --retry flags to init container downloads for reliability - Fallback chain: GLM-5 → Gemini 2.5 Flash → Llama 3.3 70B	2026-02-19 23:17:08 +00:00
Viktor Barzin	843b9658d5	[ci skip] Rename moltbot to openclaw across Terraform, K8s resources, and DNS Update terraform version in init container from 1.12.1 to 1.14.5.	2026-02-18 21:53:46 +00:00
Viktor Barzin	9889728c49	[ci skip] Remove Authentik forward auth from Grafana, add admin password management Fixes HA mobile app 403 when embedding Grafana dashboards - the webview blocks third-party cookies needed by Authentik forward auth. Grafana already has anonymous Viewer access enabled, so forward auth is not needed. Also adds grafana_admin_password variable and explicit resource limits to prevent ResourceQuota issues during rolling updates.	2026-02-18 21:40:32 +00:00
Viktor Barzin	6580c00979	[ci skip] Fix setup script: handle sudo-less environments, add extra scopes	2026-02-17 22:27:03 +00:00
Viktor Barzin	4366a8b413	[ci skip] Add one-command setup scripts to k8s-portal - Add /setup/script?os=mac and /setup/script?os=linux endpoints - Scripts install kubectl, kubelogin, write kubeconfig, update shell rc - Unprotected ingress for /setup/script (curl-able without auth) - Fix kubeconfig to include --oidc-extra-scope for email/profile/groups	2026-02-17 22:22:41 +00:00
Viktor Barzin	9dad07618d	[ci skip] Add anca as namespace-owner for plotting-book - Add ancaelena98@gmail.com as namespace-owner for plotting-book namespace - Fix RBAC module: don't create namespaces (they're managed by service modules) - RoleBinding to built-in admin ClusterRole + cluster-wide read-only access - ResourceQuota: 2 CPU / 4Gi mem requests, 4 CPU / 8Gi limits, 20 pods	2026-02-17 22:18:37 +00:00
Viktor Barzin	7e3286e572	[ci skip] Pass skill secrets to moltbot container and fix Python env - Add skill_secrets variable to moltbot module with HA tokens and Uptime Kuma password as container env vars - Install Python packages (requests, caldav, icalendar, uptime-kuma-api) in init container with PYTHONPATH for main container access - Update all skills to use python3 directly instead of ~/.venvs/claude venv path that doesn't exist in the container - Remove hardcoded Uptime Kuma password from skill, use env var	2026-02-17 21:53:32 +00:00
Viktor Barzin	9bcdb9e59f	[ci skip] Implement multi-user Kubernetes access with OIDC - Add RBAC module (modules/kubernetes/rbac/) with admin, power-user, and namespace-owner roles, API server OIDC flags, and audit logging - Add self-service portal (modules/kubernetes/k8s-portal/) SvelteKit app with kubeconfig download and setup instructions - Configure Alloy to collect audit logs from kube-apiserver - Add Grafana dashboard for Kubernetes audit log visualization - Configure Authentik OIDC provider with groups scope mapping - Wire up k8s_users and ssh_private_key variables through module chain	2026-02-17 21:42:39 +00:00
Viktor Barzin	5a2803736d	[ci skip] Import Claude skills into OpenClaw moltbot - Convert setup-project and extend-vm-storage from standalone .md to directory-based SKILL.md format with YAML frontmatter - Add symlink in moltbot init container to expose Claude skills at ~/.openclaw/skills/ for auto-discovery by OpenClaw - Update CLAUDE.md skill path references	2026-02-17 21:09:12 +00:00
Viktor Barzin	5fd1ee0a9d	[ci skip] Increase drone namespace memory limits with custom ResourceQuota	2026-02-17 20:40:40 +00:00
Viktor Barzin	0545576335	[ci skip] Add Smart Home (ha-sofia) section to Cluster Health Overview dashboard	2026-02-17 19:48:02 +00:00
Viktor Barzin	039f8559c9	[ci skip] Add Grafana dashboard for Technitium DNS query logs Add MySQL datasource and 15-panel dashboard for DNS analytics: queries over time, response codes, top domains/clients, response times, blocked/NxDomain domains. Enable Grafana dashboard sidecar for auto-provisioning dashboards from ConfigMaps.	2026-02-16 23:06:41 +00:00
Viktor Barzin	530986e3c6	[ci skip] Replace specific CoreDNS catch-all blocks with generic template regex Single template regex in the viktorbarzin.lan block catches ALL search domain expansion junk (.com.viktorbarzin.lan, .cluster.local.viktorbarzin.lan, etc.) instead of needing separate server blocks per pattern. Legitimate single-label queries (idrac.viktorbarzin.lan) fall through to Technitium.	2026-02-16 21:49:03 +00:00
Viktor Barzin	f06b3ac0e4	[ci skip] Fix .viktorbarzin.lan.viktorbarzin.lan duplicate DNS queries Add CoreDNS catch-all block for viktorbarzin.lan.viktorbarzin.lan to return NXDOMAIN immediately, preventing search domain expansion junk queries from reaching Technitium. Add trailing dots to Prometheus scrape targets (idrac, ups, ha-sofia) to bypass ndots expansion.	2026-02-16 21:38:38 +00:00
Viktor Barzin	8107e5273c	[ci skip] Fix Technitium DNS client IP logging: bypass Traefik L4 proxy DNS queries were going through Traefik's IngressRouteUDP, replacing real client IPs with Traefik pod IPs (10.10.169.150) in Technitium logs. Changed Technitium DNS service from NodePort to LoadBalancer with externalTrafficPolicy: Local, removed dns-udp entrypoint and IngressRouteUDP from Traefik, and updated CoreDNS to forward .lan queries to Technitium's LoadBalancer IP directly.	2026-02-16 21:16:16 +00:00
Viktor Barzin	0eb5eb738a	[ci skip] Fix Alloy OOMKill and iDRAC priority class conflict - Alloy: bump memory limits from 64Mi/128Mi to 256Mi/768Mi — pods were OOMKilled at 128Mi, steady-state usage is ~400-450Mi per node - iDRAC Redfish Exporter: add explicit priority_class_name to resolve conflict between Kyverno priority injection and default priority: 0	2026-02-16 20:09:53 +00:00
Viktor Barzin	5cb4bda289	Update Cluster Health dashboard: dedup metrics, GPU memory, remove broken panels [ci skip]	2026-02-15 21:51:41 +00:00
Viktor Barzin	c0a18c9c57	[ci skip] Manage CoreDNS Corefile in Terraform and block junk NxDomain queries Add kubernetes_config_map for CoreDNS to the technitium module, with a template block for cluster.local.viktorbarzin.lan that returns NXDOMAIN immediately. This prevents ndots:5 search domain expansion from flooding Technitium with ~66k/day junk queries (e.g. redis.redis.svc.cluster.local.viktorbarzin.lan). Also enabled saveCache on Technitium so the DNS cache persists across pod restarts.	2026-02-15 21:51:12 +00:00
Viktor Barzin	2db6e96115	Update Cluster Health dashboard: reorder rows, add links, key services, sorting [ci skip]	2026-02-15 21:24:08 +00:00
Viktor Barzin	608d5ab636	Add Cluster Health Overview Grafana dashboard [ci skip]	2026-02-15 19:38:28 +00:00
Viktor Barzin	4d9b8242e8	Add tier-based resource governance via Kyverno [ci skip] Four layers of noisy-neighbor protection using existing tier system: - PriorityClasses (tier-0-core through tier-4-aux) - LimitRange defaults auto-generated per namespace tier - ResourceQuotas auto-generated per namespace tier - PriorityClassName injection on pods via Kyverno mutate Custom quota overrides for monitoring and crowdsec namespaces which exceed the default tier quotas.	2026-02-15 18:48:33 +00:00
Viktor Barzin	a73f3fcb6b	Cluster health remediation: cleanup CronJob, disable Collabora, fix GPU probe, add NFS exports [ci skip] - Add daily CronJob to auto-clean Failed/Evicted pods cluster-wide (infra-maintenance) - Disable Collabora in Nextcloud (broken HPA caused scaling storm; using OnlyOffice instead) - Increase gpu-pod-exporter liveness probe timeout from 1s to 5s - Add osm-routing NFS exports (osrm-data, otp-data)	2026-02-15 17:20:47 +00:00
Viktor Barzin	c4a7c5df8e	[ci skip] Update Loki dashboard to use correct datasource UID	2026-02-13 23:41:40 +00:00
Viktor Barzin	fabece6370	[ci skip] Fix compactor/ruler paths to use writable /var/loki mount	2026-02-13 23:22:13 +00:00
Viktor Barzin	0d3acec82c	[ci skip] Re-enable lokiCanary (required by Helm chart validation)	2026-02-13 23:18:13 +00:00
Viktor Barzin	fea7c6cbb1	[ci skip] Disable gateway/canary/cache, increase timeout for Loki deploy	2026-02-13 23:17:32 +00:00
Viktor Barzin	69aae2ec9d	[ci skip] Fix code review findings: correct Alertmanager URL, add atomic to Loki, remove dead minio NFS export, update design doc	2026-02-13 23:08:44 +00:00
Viktor Barzin	71ff803978	[ci skip] Add centralized log collection: Loki + Alloy + sysctl DaemonSet	2026-02-13 23:03:40 +00:00
Viktor Barzin	a44dfac721	[ci skip] Deploy MoltBot (OpenClaw) AI agent gateway Add new Kubernetes service for OpenClaw gateway connected to in-cluster Ollama, with kubectl/terraform/git access for infrastructure management. Protected behind Authentik SSO.	2026-02-13 22:57:36 +00:00
Viktor Barzin	0137913954	[ci skip] add vibetunnel proxy	2026-02-13 18:20:50 +00:00
Viktor Barzin	cd5261161b	[ci skip] Add HomeAssistantDown alert for ha-sofia Fires after 5m if the haos Prometheus scrape target is unreachable. Covers the HTTP API endpoint which shares the same process as the WebSocket API used by the mobile app.	2026-02-11 23:24:46 +00:00
Viktor Barzin	46ffc37dcf	[ci skip] Fix all active Prometheus alerts - meshcentral: rename port from "https" to "http" — MeshCentral serves plain HTTP when REVERSE_PROXY=true, but Traefik inferred HTTPS from the port name, causing 100% 5xx errors - osm-routing/otp: scale to 0 — TfL GTFS data expired, OTP crash-loops trying to build graph with no valid transit trips - wireguard: add prometheus.io/port=9586 annotation — without it, Prometheus tried scraping all container ports (51820 UDP, 80) - travel-blog: remove stale prometheus.io annotations and dead port 9113 — nginx-exporter sidecar was commented out but annotations remained - dawarich: remove prometheus.io annotations — exporter env vars are commented out so nothing listens on port 9394 - monitoring: raise CPU temp threshold 60°C→75°C (E5-2699 v4 Tcase is 79°C), lower registry cache threshold 50%→25%, add minimum traffic floor (>0.1 req/s) to 4xx/5xx rate alerts to prevent false positives on low-traffic services	2026-02-11 22:40:56 +00:00
Viktor Barzin	b4f68d99d8	[ci skip] Fix CrowdSec to monitor Traefik and add Slack notifications - Switch acquisition from ingress-nginx to traefik namespace/pods - Change collection from crowdsecurity/nginx to crowdsecurity/traefik - Add Slack notification plugin for ban/captcha decisions - Wire alertmanager_slack_api_url through to CrowdSec module	2026-02-11 22:25:03 +00:00
Viktor Barzin	c8a41ac567	[ci skip] Add 12 Prometheus alert rules for monitoring gaps Add 3 new alert groups and 1 rule to existing group: - Storage: NodeFilesystemFull (<10% free), PVFillingUp (>85% used) - K8s Health: PodCrashLooping, ContainerOOMKilled, NodeNotReady, NodeConditionBad, JobFailed - Infrastructure Health: CoreDNSErrors, ScrapeTargetDown, PrometheusStorageFull, PrometheusNotificationsFailing - R730 Host: FanFailure (iDRAC Redfish fan health)	2026-02-11 22:14:30 +00:00
Viktor Barzin	dbf397841a	Standardize Prometheus alert formatting and fix Slack notifications - Add color coding (red/green) to Slack alerts, show alertname in title - Use summary annotation in Slack text (description was always empty) - Format all alert summaries consistently: value with units and threshold - Fix ratio expressions (CPU/memory) to display as percentages - Fix "failiure" typo, capitalize Tailscale	2026-02-11 21:53:22 +00:00
Viktor Barzin	f03b8a055b	[ci skip] Fix rewrite-body plugin corrupting compressed responses The packruler/rewrite-body plugin (used for rybbit analytics injection) fails to decompress gzip responses with "flate: corrupt input before offset 5", corrupting the response body. This broke HA Companion app's external_auth flow and WebSocket connections on ha-sofia. Fix: add a strip-accept-encoding middleware that removes Accept-Encoding from requests when rybbit is active, forcing backends to send uncompressed responses that the plugin can safely process. Also add extra_middlewares variable to reverse_proxy factory for extensibility.	2026-02-11 21:40:11 +00:00

1 2 3 4 5 ...

1021 commits