Commit graph

1297 commits

Author SHA1 Message Date
Viktor Barzin
c1ee757c6b [ci skip] Add Terragrunt migration implementation plan 2026-02-22 00:51:00 +00:00
Viktor Barzin
209355d1af [ci skip] Add Terragrunt migration design document 2026-02-22 00:46:57 +00:00
Viktor Barzin
cd0c030a55 [ci skip] Fix CronJob kubectl image tag to :latest 2026-02-22 00:38:33 +00:00
Viktor Barzin
f79e84c693 [ci skip] Add cluster health check CronJob to OpenClaw module 2026-02-22 00:08:51 +00:00
Viktor Barzin
8b5b389f31 [ci skip] Add cluster-health skill for OpenClaw agent 2026-02-22 00:04:15 +00:00
Viktor Barzin
9233276f62 [ci skip] Add cluster health check script for OpenClaw agent 2026-02-22 00:00:47 +00:00
Viktor Barzin
b925f9caf7 [ci skip] Add Slack webhook env var to OpenClaw deployment 2026-02-21 23:57:34 +00:00
Viktor Barzin
98b711ff8d [ci skip] Extend cluster healthcheck from 14 to 24 checks
Add 10 new checks covering gaps discovered during incident response:
ResourceQuota pressure, StatefulSets, node disk usage, Helm release
health, Kyverno policy engine, NFS connectivity, DNS resolution,
TLS certificate expiry, GPU health, and Cloudflare tunnel status.
2026-02-21 23:57:04 +00:00
Viktor Barzin
f41e2ca969 [ci skip] Add OpenClaw cluster health agent implementation plan 2026-02-21 23:48:36 +00:00
Viktor Barzin
51cb045f12 [ci skip] Add OpenClaw cluster management agent design doc 2026-02-21 23:45:30 +00:00
Viktor Barzin
846eb3bd24 [ci skip] Add custom resource quota for authentik namespace
Authentik runs ~10 pods (3 server + 3 worker + 3 pgbouncer + outpost)
which exceeds the default tier-1-cluster quota limits. Add custom-quota
label to opt out of Kyverno-generated quotas and define a Terraform-managed
ResourceQuota with limits appropriate for authentik's workload.
2026-02-21 23:44:05 +00:00
Viktor Barzin
d345841ef2 [ci skip] Add tier labels to all namespace resources for Kyverno resource governance
Added `tier = var.tier` to kubernetes_namespace labels in ~73 service
modules. This enables Kyverno to generate LimitRange defaults,
ResourceQuotas, and PriorityClass injection for all namespaces.

Previously only 11 namespaces had tier labels; now all 80 active
namespaces are labeled. All pods restarted in rolling waves to pick
up the new policies.
2026-02-21 23:38:05 +00:00
Viktor Barzin
517f5d6a6c [ci skip] Increase tier-based resource quotas to prevent quota exhaustion
Tier 2-gpu: 32→48 CPU limits, 64→96Gi mem limits, 30→40 pods
Tier 3-edge: 2→4 req CPU, 8→16 CPU limits, 16→32Gi mem limits, 20→30 pods
Tier 4-aux: 1→2 req CPU, 4→8 CPU limits, 8→16Gi mem limits, 15→20 pods

Fixes realestate-crawler (100% quota), nvidia (89.7%), resume/website (75%),
and actualbudget (75%) quota exhaustion causing pod creation failures.
2026-02-21 23:26:00 +00:00
Viktor Barzin
ce31571a9f [ci skip] Fix JS shim rw() routing non-proxy paths through proxy prefix
When upstream JS constructs URLs via location.origin + '/path', the rw()
function stripped the origin but returned bare '/path' which hit our
server's HTML index. Now correctly prefixes with /proxy/{b64origin} so
XHR/fetch requests for scripts reach the upstream via proxy.
Bump image to v1.2.7
2026-02-21 23:16:09 +00:00
Viktor Barzin
8562ed1b8f [ci skip] Fix video playback and comprehensive anti-debug neutralization
Video:
- Add allow="autoplay; encrypted-media; fullscreen" to iframe for media playback

Anti-debug:
- Strip ad/popup scripts (acscdn, popunder) and context menu blockers from HTML
- Strip debugger statements from inline HTML scripts and proxied JS responses
- Intercept setTimeout (not just setInterval) for debugger-based detection
- Override eval() and Function() constructor to strip debugger statements
- Bump image to v1.2.6
2026-02-21 23:12:11 +00:00
Viktor Barzin
642e774b62 [ci skip] Fix Kyverno priority injection to remove default priority/preemptionPolicy
The priority injection policy was setting priorityClassName on pods but
Kubernetes had already defaulted priority=0 and preemptionPolicy=PreemptLowerPriority
on those pods, causing admission controller to reject the mismatch.

Switch from patchStrategicMerge to patchesJson6902 to explicitly remove
the priority and preemptionPolicy fields before setting priorityClassName.
2026-02-21 23:11:35 +00:00
Viktor Barzin
fc0e1c3c6e [ci skip] Fix narrow iframe content and strip anti-debug scripts in proxy
- Remove flex centering from browser-viewer-content; use absolute positioning
  for iframe to fill the entire container
- Strip disable-devtool and devtools-detect script tags from proxied HTML
- Add JS shim hooks to neutralize setInterval-based debugger traps and block
  loading of anti-debug scripts via setAttribute
- Bump image to v1.2.5
2026-02-21 21:32:39 +00:00
Viktor Barzin
0c2c48802f [ci skip] Sandbox proxy iframe to prevent frame-busting
Add sandbox attribute to prevent proxied pages from navigating
top.location or replacing the parent page body. Allows scripts,
same-origin, forms, popups, and presentation but blocks
top-navigation.
2026-02-21 21:25:51 +00:00
Viktor Barzin
7a444b43fa [ci skip] Add reverse proxy mode to f1-stream
Replace CPU-intensive headless Chrome + WebRTC pipeline with a
lightweight Go reverse proxy that strips anti-framing headers
(X-Frame-Options, CSP) and embeds streaming sites in iframes.

- New internal/proxy package with URL rewriting for HTML/CSS
- JS shim injection to intercept fetch/XHR/WebSocket/createElement
- Referer reconstruction for correct cross-origin auth (HLS streams)
- Inline iframe viewer preserving site navigation (not fullscreen overlay)
2026-02-21 21:23:21 +00:00
Viktor Barzin
2446fec1f6 [ci skip] Fix whiteboard priority class mismatch and OnlyOffice OOMKill
- Add priority_class_name to nextcloud whiteboard deployment to match
  Kyverno-injected tier-3-edge priority class
- Add explicit resource limits (4Gi memory) for OnlyOffice document
  server to prevent OOMKill during font generation
2026-02-21 21:22:03 +00:00
Viktor Barzin
26ba9ea371 [ci skip] Fix Prometheus storage alert and Grafana quota exhaustion
- Enable size-based TSDB retention (45GB) to clean up old blocks
  (including 2021-era blocks with failed compaction)
- Increase monitoring namespace quota from 64/128Gi to 80/160Gi
  CPU/memory limits to allow Grafana rolling updates
2026-02-21 21:04:08 +00:00
Viktor Barzin
dcce738641 [ci skip] Bump inotify max_user_instances from 512 to 8192
Fixes "failed to create fsnotify watcher: too many open files" in Drone
CI builds where vitest exhausts the default inotify instance limit.
2026-02-21 20:21:04 +00:00
Viktor Barzin
038d4434c4 [ci skip] Fix health check false positives for completed CronJob pods 2026-02-21 19:56:39 +00:00
Viktor Barzin
de9c0869ba [ci skip] Fix CrowdSec pods failing due to priority class mismatch
Kyverno injects priorityClassName tier-1-cluster on pods in the crowdsec
namespace, but pods had no explicit priorityClassName set, defaulting
priority to 0. Admission controller rejected the mismatch (0 vs 800000).

Set priorityClassName on LAPI, agent (Helm values) and crowdsec-web
(Terraform deployment).
2026-02-21 19:18:15 +00:00
Viktor Barzin
fd6f9166a9 [ci skip] Add GitHub & Drone CI API access documentation 2026-02-21 19:14:41 +00:00
Viktor Barzin
a9e5320427 [ci skip] Disable grampsweb service and remove family DNS record 2026-02-21 18:55:54 +00:00
Viktor Barzin
9b2ec7716e [ci skip] Add skills: pfsense-nat-rule-creation, coturn-k8s-without-hostnetwork 2026-02-21 18:29:32 +00:00
Viktor Barzin
de1a43a3c7 [ci skip] Add coturn TURN/STUN server for WebRTC relay
- Deploy coturn on k8s with MetalLB shared IP (10.0.20.200)
- Normal pod networking (no hostNetwork), runs on any node
- 100 relay ports (49152-49252), port 3478 for STUN/TURN signaling
- Shared secret auth for time-limited TURN credentials
- For F1 streaming WebRTC NAT traversal
2026-02-21 18:08:01 +00:00
Viktor Barzin
5fe288a4e4 [ci skip] Real estate crawler: 2 replicas for UI/API, rolling update for celery
- UI and API: 1 → 2 replicas for zero-downtime during restarts/crashes
- Celery worker: Recreate → RollingUpdate strategy
- Celery beat: unchanged (Recreate, singleton scheduler)
- Move f1 from Cloudflare proxied to non-proxied DNS
2026-02-21 17:32:45 +00:00
Viktor Barzin
2298459496 [ci skip] Use versioned image tag for f1-stream to bypass stale cache
Pull-through cache on registry VM served stale arm64-only manifest for
:latest tag. Switch to v1.0.0 tag so cache fetches the fresh amd64 image.
2026-02-21 16:07:58 +00:00
Viktor Barzin
2fe7fa547c [ci skip] Configure f1-stream: WebAuthn, NFS storage, headless browser
- Set WEBAUTHN_RPID/ORIGIN for f1.viktorbarzin.me domain
- Add NFS volume at /mnt/main/f1-stream for persistent session/stream data
- Enable headless browser extraction (HEADLESS_EXTRACT_ENABLED=true)
- Reduce replicas to 1 (file-based sessions don't work across replicas)
2026-02-21 15:57:25 +00:00
Viktor Barzin
a5e0b19a3a [ci skip] Fix f1-stream port mismatch: container listens on 8080, not 80 2026-02-21 15:42:47 +00:00
Viktor Barzin
8756bcfb9a [ci skip] Increase Drone CI namespace resource quota
Double CPU and memory limits to give CI pipelines more headroom.
2026-02-21 14:49:16 +00:00
Viktor Barzin
f3361e3a47 [ci skip] Add Music Assistant librespot stale credentials skill
New skill: music-assistant-librespot-wrong-account
- Documents fix for Spotify playback failing with "librespot does not support
  free accounts" when cached credentials point to wrong Spotify account
- Includes step-by-step solution: find container, inspect cache, clear and restart

Updated: home-assistant skill with Music Assistant addon details for ha-sofia
2026-02-21 11:23:24 +00:00
Viktor Barzin
144e9b3e39 [ci skip] Add Kyverno policy to inject ndots:2 on all pods
Reduces NxDomain query flood caused by Kubernetes default ndots:5 search
domain expansion. 78% of DNS queries were wasted NxDomain lookups.
2026-02-20 00:21:03 +00:00
Viktor Barzin
9d7d63b970 [ci skip] Add ground rules: no secrets, CI/CD required, monitoring required 2026-02-19 23:48:44 +00:00
Viktor Barzin
5df615c31d [ci skip] Add Modal GLM-5 model to OpenClaw, fix streaming and download reliability
- Add modal provider (GLM-5-FP8) as primary model with non-streaming mode
  (GLM-5 uses non-standard reasoning_content field incompatible with streaming)
- Add curl --retry flags to init container downloads for reliability
- Fallback chain: GLM-5 → Gemini 2.5 Flash → Llama 3.3 70B
2026-02-19 23:17:08 +00:00
Viktor Barzin
71d6590939 [ci skip] Update knowledge base: add OpenClaw service, rename moltbot references 2026-02-18 22:39:58 +00:00
Viktor Barzin
843b9658d5 [ci skip] Rename moltbot to openclaw across Terraform, K8s resources, and DNS
Update terraform version in init container from 1.12.1 to 1.14.5.
2026-02-18 21:53:46 +00:00
Viktor Barzin
9889728c49 [ci skip] Remove Authentik forward auth from Grafana, add admin password management
Fixes HA mobile app 403 when embedding Grafana dashboards - the webview
blocks third-party cookies needed by Authentik forward auth. Grafana
already has anonymous Viewer access enabled, so forward auth is not
needed. Also adds grafana_admin_password variable and explicit resource
limits to prevent ResourceQuota issues during rolling updates.
2026-02-18 21:40:32 +00:00
Viktor Barzin
41d3358cc1 [ci skip] Add skills: authentik-oidc-kubernetes, kubelet-static-pod-manifest-update
Two skills extracted from multi-user k8s access implementation:
- authentik-oidc-kubernetes: 6 gotchas for Authentik OIDC + kube-apiserver
- kubelet-static-pod-manifest-update: full restart cycle for static pod changes
2026-02-17 22:56:03 +00:00
Viktor Barzin
7e73965bdd [ci skip] Add Authentik management skill for API-based identity provider control 2026-02-17 22:55:41 +00:00
Viktor Barzin
6580c00979 [ci skip] Fix setup script: handle sudo-less environments, add extra scopes 2026-02-17 22:27:03 +00:00
Viktor Barzin
4366a8b413 [ci skip] Add one-command setup scripts to k8s-portal
- Add /setup/script?os=mac and /setup/script?os=linux endpoints
- Scripts install kubectl, kubelogin, write kubeconfig, update shell rc
- Unprotected ingress for /setup/script (curl-able without auth)
- Fix kubeconfig to include --oidc-extra-scope for email/profile/groups
2026-02-17 22:22:41 +00:00
Viktor Barzin
9dad07618d [ci skip] Add anca as namespace-owner for plotting-book
- Add ancaelena98@gmail.com as namespace-owner for plotting-book namespace
- Fix RBAC module: don't create namespaces (they're managed by service modules)
- RoleBinding to built-in admin ClusterRole + cluster-wide read-only access
- ResourceQuota: 2 CPU / 4Gi mem requests, 4 CPU / 8Gi limits, 20 pods
2026-02-17 22:18:37 +00:00
Viktor Barzin
aa433d0750 [ci skip] Update CLAUDE.md with OIDC gotchas and k8s multi-user notes 2026-02-17 22:16:46 +00:00
Viktor Barzin
c3840574a8 [ci skip] Update Authentik API token reference to terraform.tfvars 2026-02-17 22:03:55 +00:00
Viktor Barzin
7e3286e572 [ci skip] Pass skill secrets to moltbot container and fix Python env
- Add skill_secrets variable to moltbot module with HA tokens and
  Uptime Kuma password as container env vars
- Install Python packages (requests, caldav, icalendar, uptime-kuma-api)
  in init container with PYTHONPATH for main container access
- Update all skills to use python3 directly instead of ~/.venvs/claude
  venv path that doesn't exist in the container
- Remove hardcoded Uptime Kuma password from skill, use env var
2026-02-17 21:53:32 +00:00
Viktor Barzin
9bcdb9e59f [ci skip] Implement multi-user Kubernetes access with OIDC
- Add RBAC module (modules/kubernetes/rbac/) with admin, power-user,
  and namespace-owner roles, API server OIDC flags, and audit logging
- Add self-service portal (modules/kubernetes/k8s-portal/) SvelteKit app
  with kubeconfig download and setup instructions
- Configure Alloy to collect audit logs from kube-apiserver
- Add Grafana dashboard for Kubernetes audit log visualization
- Configure Authentik OIDC provider with groups scope mapping
- Wire up k8s_users and ssh_private_key variables through module chain
2026-02-17 21:42:39 +00:00
Viktor Barzin
9853b5edf7 [ci skip] Add Authentik API management knowledge 2026-02-17 21:10:40 +00:00