Commit graph

1076 commits

Author SHA1 Message Date
Viktor Barzin
d6cdbeaabe
[ci skip] Increase tier-based resource quotas to prevent quota exhaustion
Tier 2-gpu: 32→48 CPU limits, 64→96Gi mem limits, 30→40 pods
Tier 3-edge: 2→4 req CPU, 8→16 CPU limits, 16→32Gi mem limits, 20→30 pods
Tier 4-aux: 1→2 req CPU, 4→8 CPU limits, 8→16Gi mem limits, 15→20 pods

Fixes realestate-crawler (100% quota), nvidia (89.7%), resume/website (75%),
and actualbudget (75%) quota exhaustion causing pod creation failures.
2026-02-21 23:26:00 +00:00
Viktor Barzin
bb44b4e18e
[ci skip] Fix JS shim rw() routing non-proxy paths through proxy prefix
When upstream JS constructs URLs via location.origin + '/path', the rw()
function stripped the origin but returned bare '/path' which hit our
server's HTML index. Now correctly prefixes with /proxy/{b64origin} so
XHR/fetch requests for scripts reach the upstream via proxy.
Bump image to v1.2.7
2026-02-21 23:16:09 +00:00
Viktor Barzin
fab25e68ad
[ci skip] Fix video playback and comprehensive anti-debug neutralization
Video:
- Add allow="autoplay; encrypted-media; fullscreen" to iframe for media playback

Anti-debug:
- Strip ad/popup scripts (acscdn, popunder) and context menu blockers from HTML
- Strip debugger statements from inline HTML scripts and proxied JS responses
- Intercept setTimeout (not just setInterval) for debugger-based detection
- Override eval() and Function() constructor to strip debugger statements
- Bump image to v1.2.6
2026-02-21 23:12:11 +00:00
Viktor Barzin
a0394f4bef
[ci skip] Fix Kyverno priority injection to remove default priority/preemptionPolicy
The priority injection policy was setting priorityClassName on pods but
Kubernetes had already defaulted priority=0 and preemptionPolicy=PreemptLowerPriority
on those pods, causing admission controller to reject the mismatch.

Switch from patchStrategicMerge to patchesJson6902 to explicitly remove
the priority and preemptionPolicy fields before setting priorityClassName.
2026-02-21 23:11:35 +00:00
Viktor Barzin
c279d453a6
[ci skip] Fix narrow iframe content and strip anti-debug scripts in proxy
- Remove flex centering from browser-viewer-content; use absolute positioning
  for iframe to fill the entire container
- Strip disable-devtool and devtools-detect script tags from proxied HTML
- Add JS shim hooks to neutralize setInterval-based debugger traps and block
  loading of anti-debug scripts via setAttribute
- Bump image to v1.2.5
2026-02-21 21:32:39 +00:00
Viktor Barzin
fd7f22d8cc
[ci skip] Sandbox proxy iframe to prevent frame-busting
Add sandbox attribute to prevent proxied pages from navigating
top.location or replacing the parent page body. Allows scripts,
same-origin, forms, popups, and presentation but blocks
top-navigation.
2026-02-21 21:25:51 +00:00
Viktor Barzin
450dfc28e4
[ci skip] Add reverse proxy mode to f1-stream
Replace CPU-intensive headless Chrome + WebRTC pipeline with a
lightweight Go reverse proxy that strips anti-framing headers
(X-Frame-Options, CSP) and embeds streaming sites in iframes.

- New internal/proxy package with URL rewriting for HTML/CSS
- JS shim injection to intercept fetch/XHR/WebSocket/createElement
- Referer reconstruction for correct cross-origin auth (HLS streams)
- Inline iframe viewer preserving site navigation (not fullscreen overlay)
2026-02-21 21:23:21 +00:00
Viktor Barzin
f7710b6067
[ci skip] Fix whiteboard priority class mismatch and OnlyOffice OOMKill
- Add priority_class_name to nextcloud whiteboard deployment to match
  Kyverno-injected tier-3-edge priority class
- Add explicit resource limits (4Gi memory) for OnlyOffice document
  server to prevent OOMKill during font generation
2026-02-21 21:22:03 +00:00
Viktor Barzin
c1a18a7426
[ci skip] Fix Prometheus storage alert and Grafana quota exhaustion
- Enable size-based TSDB retention (45GB) to clean up old blocks
  (including 2021-era blocks with failed compaction)
- Increase monitoring namespace quota from 64/128Gi to 80/160Gi
  CPU/memory limits to allow Grafana rolling updates
2026-02-21 21:04:08 +00:00
Viktor Barzin
a12a81bdd5
[ci skip] Bump inotify max_user_instances from 512 to 8192
Fixes "failed to create fsnotify watcher: too many open files" in Drone
CI builds where vitest exhausts the default inotify instance limit.
2026-02-21 20:21:04 +00:00
Viktor Barzin
dea1cec3d0
[ci skip] Fix CrowdSec pods failing due to priority class mismatch
Kyverno injects priorityClassName tier-1-cluster on pods in the crowdsec
namespace, but pods had no explicit priorityClassName set, defaulting
priority to 0. Admission controller rejected the mismatch (0 vs 800000).

Set priorityClassName on LAPI, agent (Helm values) and crowdsec-web
(Terraform deployment).
2026-02-21 19:18:15 +00:00
Viktor Barzin
767a8250f6
[ci skip] Disable grampsweb service and remove family DNS record 2026-02-21 18:55:54 +00:00
Viktor Barzin
fdf374b751
[ci skip] Add coturn TURN/STUN server for WebRTC relay
- Deploy coturn on k8s with MetalLB shared IP (10.0.20.200)
- Normal pod networking (no hostNetwork), runs on any node
- 100 relay ports (49152-49252), port 3478 for STUN/TURN signaling
- Shared secret auth for time-limited TURN credentials
- For F1 streaming WebRTC NAT traversal
2026-02-21 18:08:01 +00:00
Viktor Barzin
8ec983e3fd
[ci skip] Real estate crawler: 2 replicas for UI/API, rolling update for celery
- UI and API: 1 → 2 replicas for zero-downtime during restarts/crashes
- Celery worker: Recreate → RollingUpdate strategy
- Celery beat: unchanged (Recreate, singleton scheduler)
- Move f1 from Cloudflare proxied to non-proxied DNS
2026-02-21 17:32:45 +00:00
Viktor Barzin
8e867cfb55
[ci skip] Use versioned image tag for f1-stream to bypass stale cache
Pull-through cache on registry VM served stale arm64-only manifest for
:latest tag. Switch to v1.0.0 tag so cache fetches the fresh amd64 image.
2026-02-21 16:07:58 +00:00
Viktor Barzin
f23d3c220c
[ci skip] Configure f1-stream: WebAuthn, NFS storage, headless browser
- Set WEBAUTHN_RPID/ORIGIN for f1.viktorbarzin.me domain
- Add NFS volume at /mnt/main/f1-stream for persistent session/stream data
- Enable headless browser extraction (HEADLESS_EXTRACT_ENABLED=true)
- Reduce replicas to 1 (file-based sessions don't work across replicas)
2026-02-21 15:57:25 +00:00
Viktor Barzin
691a41ee8b
[ci skip] Fix f1-stream port mismatch: container listens on 8080, not 80 2026-02-21 15:42:47 +00:00
Viktor Barzin
25df219c86
[ci skip] Increase Drone CI namespace resource quota
Double CPU and memory limits to give CI pipelines more headroom.
2026-02-21 14:49:16 +00:00
Viktor Barzin
500c0c2191
[ci skip] Add Kyverno policy to inject ndots:2 on all pods
Reduces NxDomain query flood caused by Kubernetes default ndots:5 search
domain expansion. 78% of DNS queries were wasted NxDomain lookups.
2026-02-20 00:21:03 +00:00
Viktor Barzin
dbab20995b
[ci skip] Add Modal GLM-5 model to OpenClaw, fix streaming and download reliability
- Add modal provider (GLM-5-FP8) as primary model with non-streaming mode
  (GLM-5 uses non-standard reasoning_content field incompatible with streaming)
- Add curl --retry flags to init container downloads for reliability
- Fallback chain: GLM-5 → Gemini 2.5 Flash → Llama 3.3 70B
2026-02-19 23:17:08 +00:00
Viktor Barzin
14296a3966
[ci skip] Rename moltbot to openclaw across Terraform, K8s resources, and DNS
Update terraform version in init container from 1.12.1 to 1.14.5.
2026-02-18 21:53:46 +00:00
Viktor Barzin
1206b3860b
[ci skip] Remove Authentik forward auth from Grafana, add admin password management
Fixes HA mobile app 403 when embedding Grafana dashboards - the webview
blocks third-party cookies needed by Authentik forward auth. Grafana
already has anonymous Viewer access enabled, so forward auth is not
needed. Also adds grafana_admin_password variable and explicit resource
limits to prevent ResourceQuota issues during rolling updates.
2026-02-18 21:40:32 +00:00
Viktor Barzin
bcb1e7f79f
[ci skip] Fix setup script: handle sudo-less environments, add extra scopes 2026-02-17 22:27:03 +00:00
Viktor Barzin
87719a197d
[ci skip] Add one-command setup scripts to k8s-portal
- Add /setup/script?os=mac and /setup/script?os=linux endpoints
- Scripts install kubectl, kubelogin, write kubeconfig, update shell rc
- Unprotected ingress for /setup/script (curl-able without auth)
- Fix kubeconfig to include --oidc-extra-scope for email/profile/groups
2026-02-17 22:22:41 +00:00
Viktor Barzin
f8b07b3bb9
[ci skip] Add anca as namespace-owner for plotting-book
- Add ancaelena98@gmail.com as namespace-owner for plotting-book namespace
- Fix RBAC module: don't create namespaces (they're managed by service modules)
- RoleBinding to built-in admin ClusterRole + cluster-wide read-only access
- ResourceQuota: 2 CPU / 4Gi mem requests, 4 CPU / 8Gi limits, 20 pods
2026-02-17 22:18:37 +00:00
Viktor Barzin
79ce0db11c
[ci skip] Pass skill secrets to moltbot container and fix Python env
- Add skill_secrets variable to moltbot module with HA tokens and
  Uptime Kuma password as container env vars
- Install Python packages (requests, caldav, icalendar, uptime-kuma-api)
  in init container with PYTHONPATH for main container access
- Update all skills to use python3 directly instead of ~/.venvs/claude
  venv path that doesn't exist in the container
- Remove hardcoded Uptime Kuma password from skill, use env var
2026-02-17 21:53:32 +00:00
Viktor Barzin
d0b39f1987
[ci skip] Implement multi-user Kubernetes access with OIDC
- Add RBAC module (modules/kubernetes/rbac/) with admin, power-user,
  and namespace-owner roles, API server OIDC flags, and audit logging
- Add self-service portal (modules/kubernetes/k8s-portal/) SvelteKit app
  with kubeconfig download and setup instructions
- Configure Alloy to collect audit logs from kube-apiserver
- Add Grafana dashboard for Kubernetes audit log visualization
- Configure Authentik OIDC provider with groups scope mapping
- Wire up k8s_users and ssh_private_key variables through module chain
2026-02-17 21:42:39 +00:00
Viktor Barzin
6a8efa69c4
[ci skip] Import Claude skills into OpenClaw moltbot
- Convert setup-project and extend-vm-storage from standalone .md
  to directory-based SKILL.md format with YAML frontmatter
- Add symlink in moltbot init container to expose Claude skills
  at ~/.openclaw/skills/ for auto-discovery by OpenClaw
- Update CLAUDE.md skill path references
2026-02-17 21:09:12 +00:00
Viktor Barzin
587b649650
[ci skip] Increase drone namespace memory limits with custom ResourceQuota 2026-02-17 20:40:40 +00:00
Viktor Barzin
6f3395fbf5
[ci skip] Add Smart Home (ha-sofia) section to Cluster Health Overview dashboard 2026-02-17 19:48:02 +00:00
Viktor Barzin
c0363be5e4
[ci skip] Add Grafana dashboard for Technitium DNS query logs
Add MySQL datasource and 15-panel dashboard for DNS analytics:
queries over time, response codes, top domains/clients, response
times, blocked/NxDomain domains. Enable Grafana dashboard sidecar
for auto-provisioning dashboards from ConfigMaps.
2026-02-16 23:06:41 +00:00
Viktor Barzin
a268b9107f
[ci skip] Replace specific CoreDNS catch-all blocks with generic template regex
Single template regex in the viktorbarzin.lan block catches ALL search
domain expansion junk (*.com.viktorbarzin.lan, *.cluster.local.viktorbarzin.lan,
etc.) instead of needing separate server blocks per pattern. Legitimate
single-label queries (idrac.viktorbarzin.lan) fall through to Technitium.
2026-02-16 21:49:03 +00:00
Viktor Barzin
19136c21f1
[ci skip] Fix .viktorbarzin.lan.viktorbarzin.lan duplicate DNS queries
Add CoreDNS catch-all block for viktorbarzin.lan.viktorbarzin.lan to
return NXDOMAIN immediately, preventing search domain expansion junk
queries from reaching Technitium. Add trailing dots to Prometheus
scrape targets (idrac, ups, ha-sofia) to bypass ndots expansion.
2026-02-16 21:38:38 +00:00
Viktor Barzin
205eb2704b
[ci skip] Fix Technitium DNS client IP logging: bypass Traefik L4 proxy
DNS queries were going through Traefik's IngressRouteUDP, replacing
real client IPs with Traefik pod IPs (10.10.169.150) in Technitium logs.
Changed Technitium DNS service from NodePort to LoadBalancer with
externalTrafficPolicy: Local, removed dns-udp entrypoint and
IngressRouteUDP from Traefik, and updated CoreDNS to forward .lan
queries to Technitium's LoadBalancer IP directly.
2026-02-16 21:16:16 +00:00
Viktor Barzin
3d4cdf3203
[ci skip] Fix Alloy OOMKill and iDRAC priority class conflict
- Alloy: bump memory limits from 64Mi/128Mi to 256Mi/768Mi — pods were
  OOMKilled at 128Mi, steady-state usage is ~400-450Mi per node
- iDRAC Redfish Exporter: add explicit priority_class_name to resolve
  conflict between Kyverno priority injection and default priority: 0
2026-02-16 20:09:53 +00:00
Viktor Barzin
2d015c1cb4
Update Cluster Health dashboard: dedup metrics, GPU memory, remove broken panels [ci skip] 2026-02-15 21:51:41 +00:00
Viktor Barzin
a8f42d7fc0
[ci skip] Manage CoreDNS Corefile in Terraform and block junk NxDomain queries
Add kubernetes_config_map for CoreDNS to the technitium module, with a
template block for cluster.local.viktorbarzin.lan that returns NXDOMAIN
immediately. This prevents ndots:5 search domain expansion from flooding
Technitium with ~66k/day junk queries (e.g.
redis.redis.svc.cluster.local.viktorbarzin.lan).

Also enabled saveCache on Technitium so the DNS cache persists across
pod restarts.
2026-02-15 21:51:12 +00:00
Viktor Barzin
a2b44c8ff7
Update Cluster Health dashboard: reorder rows, add links, key services, sorting [ci skip] 2026-02-15 21:24:08 +00:00
Viktor Barzin
f447e45ee1
Add Cluster Health Overview Grafana dashboard [ci skip] 2026-02-15 19:38:28 +00:00
Viktor Barzin
1564ec7e79
Add tier-based resource governance via Kyverno [ci skip]
Four layers of noisy-neighbor protection using existing tier system:
- PriorityClasses (tier-0-core through tier-4-aux)
- LimitRange defaults auto-generated per namespace tier
- ResourceQuotas auto-generated per namespace tier
- PriorityClassName injection on pods via Kyverno mutate

Custom quota overrides for monitoring and crowdsec namespaces
which exceed the default tier quotas.
2026-02-15 18:48:33 +00:00
Viktor Barzin
349fffc124
Cluster health remediation: cleanup CronJob, disable Collabora, fix GPU probe, add NFS exports [ci skip]
- Add daily CronJob to auto-clean Failed/Evicted pods cluster-wide (infra-maintenance)
- Disable Collabora in Nextcloud (broken HPA caused scaling storm; using OnlyOffice instead)
- Increase gpu-pod-exporter liveness probe timeout from 1s to 5s
- Add osm-routing NFS exports (osrm-data, otp-data)
2026-02-15 17:20:47 +00:00
Viktor Barzin
36d32b49e7
[ci skip] Fix pull-through cache for all registries
Replace deprecated wildcard containerd mirror with per-registry
config_path approach. Add proxy containers for ghcr.io, quay.io,
registry.k8s.io, and reg.kyverno.io on the docker-registry VM.
Set static IP for docker-registry VM to avoid DHCP issues.
2026-02-15 14:35:52 +00:00
Viktor Barzin
7644c419a4
[ci skip] Update Loki dashboard to use correct datasource UID 2026-02-13 23:41:40 +00:00
Viktor Barzin
cd2d13d949
[ci skip] Fix compactor/ruler paths to use writable /var/loki mount 2026-02-13 23:22:13 +00:00
Viktor Barzin
d906513f09
[ci skip] Re-enable lokiCanary (required by Helm chart validation) 2026-02-13 23:18:13 +00:00
Viktor Barzin
a38c3d3dc7
[ci skip] Disable gateway/canary/cache, increase timeout for Loki deploy 2026-02-13 23:17:32 +00:00
Viktor Barzin
f013c0a139
[ci skip] Fix code review findings: correct Alertmanager URL, add atomic to Loki, remove dead minio NFS export, update design doc 2026-02-13 23:08:44 +00:00
Viktor Barzin
c7236f09f1
[ci skip] Add centralized log collection: Loki + Alloy + sysctl DaemonSet 2026-02-13 23:03:40 +00:00
Viktor Barzin
c330648b7b
[ci skip] Deploy MoltBot (OpenClaw) AI agent gateway
Add new Kubernetes service for OpenClaw gateway connected to in-cluster
Ollama, with kubectl/terraform/git access for infrastructure management.
Protected behind Authentik SSO.
2026-02-13 22:57:36 +00:00
Viktor Barzin
e0ff08978d
[ci skip] add vibetunnel proxy 2026-02-13 18:20:50 +00:00