Commit graph

1321 commits

Author SHA1 Message Date
Viktor Barzin
116c4d9c30 [ci skip] Remove legacy files and orphaned modules
Delete 20 orphaned module directories and 3 stray files from
modules/kubernetes/ that are no longer referenced by any stack.
Remove 7 root-level legacy files including the empty tfstate,
27MB terraform zip, commented-out main.tf, and migration notes.
Clean up commented-out dockerhub_secret and oauth-proxy references
in blog, travel_blog, and city-guesser stacks. Remove stale
frigate config.yaml entry from .gitignore. Remove ephemeral
docs/plans/ directory.
2026-02-22 15:23:27 +00:00
Viktor Barzin
c7c7047f1c [ci skip] Flatten module wrappers into stack roots
Remove the module "xxx" { source = "./module" } indirection layer
from all 66 service stacks. Resources are now defined directly in
each stack's main.tf instead of through a wrapper module.

- Merge module/main.tf contents into stack main.tf
- Apply variable replacements (var.tier -> local.tiers.X, renamed vars)
- Fix shared module paths (one fewer ../ at each level)
- Move extra files/dirs (factory/, chart_values, subdirs) to stack root
- Update state files to strip module.<name>. prefix
- Update CLAUDE.md to reflect flat structure

Verified: terragrunt plan shows 0 add, 0 destroy across all stacks.
2026-02-22 15:13:55 +00:00
Viktor Barzin
b0499a7f31 [ci skip] Update CLAUDE.md for module colocation
Reflect new directory structure where service modules live inside
their stack directories (stacks/<service>/module/) instead of
modules/kubernetes/<service>/. Update file paths, adding service
instructions, and stack structure documentation.
2026-02-22 14:39:22 +00:00
Viktor Barzin
e6420c7b36 [ci skip] Move Terraform modules into stack directories
Move all 88 service modules (66 individual + 22 platform) from
modules/kubernetes/<service>/ into their corresponding stack directories:

- Service stacks: stacks/<service>/module/
- Platform stack: stacks/platform/modules/<service>/

This collocates module source code with its Terragrunt definition.
Only shared utility modules remain in modules/kubernetes/:
ingress_factory, setup_tls_secret, dockerhub_secret, oauth-proxy.

All cross-references to shared modules updated to use correct
relative paths. Verified with terragrunt run --all -- plan:
0 adds, 0 destroys across all 68 stacks.
2026-02-22 14:38:14 +00:00
Viktor Barzin
7ef1a0a8bb [ci skip] Update CLAUDE.md for Terragrunt migration 2026-02-22 14:12:37 +00:00
Viktor Barzin
e2522ad9f1 [ci skip] Fix variable type mismatches in owntracks, ollama, tandoor stacks
- owntracks_credentials: string -> map(string)
- ollama_api_credentials: string -> map(string)
- tandoor_email_password: add default="" (not in tfvars)
2026-02-22 14:07:33 +00:00
Viktor Barzin
945a5f35b0 [ci skip] Fix path.root references for git-crypt key in openclaw and drone
Modules used filebase64("${path.root}/.git/git-crypt/keys/default")
which breaks with Terragrunt since path.root is now stacks/<service>/
instead of repo root. Changed to accept git_crypt_key_base64 variable
and resolve the path in the stack wrapper.
2026-02-22 14:01:02 +00:00
Viktor Barzin
71bfdc8e89 [ci skip] Phase 3: Remove migrated service modules from monolith
All 66 service modules removed from modules/kubernetes/main.tf (now
just a migration notice). The kubernetes_cluster module block removed
from root main.tf. All services now managed via stacks/<service>/.
2026-02-22 13:58:07 +00:00
Viktor Barzin
a9ba8899be [ci skip] Phase 3: Create 66 service stacks and migrate state
Generated individual stack directories for all 66 services under stacks/.
Each stack has terragrunt.hcl (depends on platform) and main.tf (thin
wrapper calling existing module). Migrated all 64 active service states
from root terraform.tfstate to individual state files. Root state is now
empty. Verified with terragrunt plan on multiple stacks (no changes).
2026-02-22 13:56:34 +00:00
Viktor Barzin
39ce2000cf [ci skip] Remove 22 platform services from modules/kubernetes/main.tf
Migrated to stacks/platform/: metallb, dbaas, redis, traefik, technitium,
headscale, authentik, rbac, k8s-portal, crowdsec, monitoring, vaultwarden,
reverse-proxy, metrics-server, nvidia, kyverno, uptime-kuma, wireguard,
xray, mailserver, cloudflared, infra-maintenance.

Also removed null_resource.core_services and all depends_on references to it
from the remaining ~66 service modules.
2026-02-22 13:40:45 +00:00
Viktor Barzin
7c4d32922a [ci skip] Migrate 22 platform service states to stacks/platform
State migration for all platform services from root state to
state/stacks/platform/terraform.tfstate. Key changes:
- module.kubernetes_cluster.module.X["key"] -> module.X
- Removed null_resource.core_services from root state
- Imported traefik helm_release (was missing from state)
- Fixed helm provider syntax (kubernetes = {} not kubernetes {})
- Added secrets symlink for TLS cert file() resolution
- Platform terragrunt plan: 0 add, 24 change (cosmetic drift), 0 destroy
2026-02-22 13:35:10 +00:00
Viktor Barzin
e2fcf4df45 [ci skip] Add platform stack (core services) for Terragrunt migration
stacks/platform/ contains 22 core/cluster services: metallb, dbaas, redis,
traefik, technitium, headscale, authentik, rbac, k8s-portal, crowdsec,
monitoring, vaultwarden, reverse-proxy, metrics-server, nvidia, kyverno,
uptime-kuma, wireguard, xray, mailserver, cloudflared, infra-maintenance.

Outputs: tls_secret_name, redis_host, postgresql_host/port, mysql_host/port,
smtp_host/port — consumed by downstream service stacks via dependency blocks.
2026-02-22 13:21:09 +00:00
Viktor Barzin
8715477d39 [ci skip] Migrate infra modules (VM templates, docker-registry) to stacks/infra/
Remove k8s-node-template, non-k8s-node-template, docker-registry-template,
and docker-registry-vm modules from root main.tf. These are now managed by
the Terragrunt infra stack at stacks/infra/.

State migration verified: both root and infra stack show "No changes".
2026-02-22 13:11:16 +00:00
Viktor Barzin
f096a889d6 [ci skip] Add infra stack (Proxmox VMs) 2026-02-22 13:04:49 +00:00
Viktor Barzin
f962349465 [ci skip] Add Terragrunt directory skeleton and root config 2026-02-22 13:01:37 +00:00
Viktor Barzin
db659b1f7a [ci skip] Fix dashy OOMKilled and healthcheck DNS false-failure
- Add explicit resource limits to dashy (2Gi memory) to prevent OOMKilled
  during webpack build on startup
- Rewrite DNS healthcheck to test from inside the Technitium pod via
  kubectl exec, since MetalLB virtual IPs aren't reachable from outside
  the L2 network
- Deleted orphaned kured/tls-secret (expired Oct 2025, module disabled,
  not mounted by kured DaemonSet)
2026-02-22 12:46:12 +00:00
Viktor Barzin
f05bf109c5 [ci skip] Increase Drone CI resource quota to handle concurrent builds
Each build pod has 8-10 containers inheriting 1 CPU / 2Gi limits from
LimitRange defaults. With 4+ concurrent builds the old quota (48 CPU /
96Gi / 30 pods) was exhausted, blocking new builds. Increase to 64 CPU /
128Gi / 60 pods to safely support 5-6 concurrent builds.
2026-02-22 12:28:42 +00:00
Viktor Barzin
00dc78e0d2 [ci skip] Fix Uptime Kuma false-down reports: use bulk heartbeat API instead of per-monitor calls 2026-02-22 01:37:28 +00:00
Viktor Barzin
0ff2aaec60 [ci skip] Add native HLS playback for VIPLeague/DaddyLive streams (v1.3.1)
- Add HLS proxy (hlsproxy) for rewriting m3u8 playlists and proxying
  segments with correct Referer/Origin headers (uses ?domain= param)
- Add playerconfig service for detecting stream types (VIPLeague,
  DaddyLive, HLS) and extracting auth params from ksohls pages
- Add VIPLeague URL resolution: extract slug from URL path, match
  against DaddyLive 24/7 channel index with token-based scoring
- Replace Clappr with direct HLS.js player for better compatibility
- Add CryptoJS CDN for DaddyLive auth module support
- Disable CrowdSec on f1-stream ingress to prevent false positives
- Bump image to v1.3.1
2026-02-22 01:30:06 +00:00
Viktor Barzin
a5f9c1595f [ci skip] Fix Python f-string quoting in Slack formatter 2026-02-22 01:23:56 +00:00
Viktor Barzin
b8029351fd [ci skip] Improve Slack message formatting with human-readable names and structured blocks 2026-02-22 01:16:47 +00:00
Viktor Barzin
a90710950b [ci skip] Upgrade cluster-health.sh to full 24-check version with Slack 2026-02-22 01:09:02 +00:00
Viktor Barzin
284fee15ca [ci skip] Fix Slack message formatting to use Block Kit mrkdwn 2026-02-22 01:00:48 +00:00
Viktor Barzin
e59928187b [ci skip] Set CronJob backoffLimit=0 to prevent duplicate Slack alerts 2026-02-22 00:59:34 +00:00
Viktor Barzin
c1ee757c6b [ci skip] Add Terragrunt migration implementation plan 2026-02-22 00:51:00 +00:00
Viktor Barzin
209355d1af [ci skip] Add Terragrunt migration design document 2026-02-22 00:46:57 +00:00
Viktor Barzin
cd0c030a55 [ci skip] Fix CronJob kubectl image tag to :latest 2026-02-22 00:38:33 +00:00
Viktor Barzin
f79e84c693 [ci skip] Add cluster health check CronJob to OpenClaw module 2026-02-22 00:08:51 +00:00
Viktor Barzin
8b5b389f31 [ci skip] Add cluster-health skill for OpenClaw agent 2026-02-22 00:04:15 +00:00
Viktor Barzin
9233276f62 [ci skip] Add cluster health check script for OpenClaw agent 2026-02-22 00:00:47 +00:00
Viktor Barzin
b925f9caf7 [ci skip] Add Slack webhook env var to OpenClaw deployment 2026-02-21 23:57:34 +00:00
Viktor Barzin
98b711ff8d [ci skip] Extend cluster healthcheck from 14 to 24 checks
Add 10 new checks covering gaps discovered during incident response:
ResourceQuota pressure, StatefulSets, node disk usage, Helm release
health, Kyverno policy engine, NFS connectivity, DNS resolution,
TLS certificate expiry, GPU health, and Cloudflare tunnel status.
2026-02-21 23:57:04 +00:00
Viktor Barzin
f41e2ca969 [ci skip] Add OpenClaw cluster health agent implementation plan 2026-02-21 23:48:36 +00:00
Viktor Barzin
51cb045f12 [ci skip] Add OpenClaw cluster management agent design doc 2026-02-21 23:45:30 +00:00
Viktor Barzin
846eb3bd24 [ci skip] Add custom resource quota for authentik namespace
Authentik runs ~10 pods (3 server + 3 worker + 3 pgbouncer + outpost)
which exceeds the default tier-1-cluster quota limits. Add custom-quota
label to opt out of Kyverno-generated quotas and define a Terraform-managed
ResourceQuota with limits appropriate for authentik's workload.
2026-02-21 23:44:05 +00:00
Viktor Barzin
d345841ef2 [ci skip] Add tier labels to all namespace resources for Kyverno resource governance
Added `tier = var.tier` to kubernetes_namespace labels in ~73 service
modules. This enables Kyverno to generate LimitRange defaults,
ResourceQuotas, and PriorityClass injection for all namespaces.

Previously only 11 namespaces had tier labels; now all 80 active
namespaces are labeled. All pods restarted in rolling waves to pick
up the new policies.
2026-02-21 23:38:05 +00:00
Viktor Barzin
517f5d6a6c [ci skip] Increase tier-based resource quotas to prevent quota exhaustion
Tier 2-gpu: 32→48 CPU limits, 64→96Gi mem limits, 30→40 pods
Tier 3-edge: 2→4 req CPU, 8→16 CPU limits, 16→32Gi mem limits, 20→30 pods
Tier 4-aux: 1→2 req CPU, 4→8 CPU limits, 8→16Gi mem limits, 15→20 pods

Fixes realestate-crawler (100% quota), nvidia (89.7%), resume/website (75%),
and actualbudget (75%) quota exhaustion causing pod creation failures.
2026-02-21 23:26:00 +00:00
Viktor Barzin
ce31571a9f [ci skip] Fix JS shim rw() routing non-proxy paths through proxy prefix
When upstream JS constructs URLs via location.origin + '/path', the rw()
function stripped the origin but returned bare '/path' which hit our
server's HTML index. Now correctly prefixes with /proxy/{b64origin} so
XHR/fetch requests for scripts reach the upstream via proxy.
Bump image to v1.2.7
2026-02-21 23:16:09 +00:00
Viktor Barzin
8562ed1b8f [ci skip] Fix video playback and comprehensive anti-debug neutralization
Video:
- Add allow="autoplay; encrypted-media; fullscreen" to iframe for media playback

Anti-debug:
- Strip ad/popup scripts (acscdn, popunder) and context menu blockers from HTML
- Strip debugger statements from inline HTML scripts and proxied JS responses
- Intercept setTimeout (not just setInterval) for debugger-based detection
- Override eval() and Function() constructor to strip debugger statements
- Bump image to v1.2.6
2026-02-21 23:12:11 +00:00
Viktor Barzin
642e774b62 [ci skip] Fix Kyverno priority injection to remove default priority/preemptionPolicy
The priority injection policy was setting priorityClassName on pods but
Kubernetes had already defaulted priority=0 and preemptionPolicy=PreemptLowerPriority
on those pods, causing admission controller to reject the mismatch.

Switch from patchStrategicMerge to patchesJson6902 to explicitly remove
the priority and preemptionPolicy fields before setting priorityClassName.
2026-02-21 23:11:35 +00:00
Viktor Barzin
fc0e1c3c6e [ci skip] Fix narrow iframe content and strip anti-debug scripts in proxy
- Remove flex centering from browser-viewer-content; use absolute positioning
  for iframe to fill the entire container
- Strip disable-devtool and devtools-detect script tags from proxied HTML
- Add JS shim hooks to neutralize setInterval-based debugger traps and block
  loading of anti-debug scripts via setAttribute
- Bump image to v1.2.5
2026-02-21 21:32:39 +00:00
Viktor Barzin
0c2c48802f [ci skip] Sandbox proxy iframe to prevent frame-busting
Add sandbox attribute to prevent proxied pages from navigating
top.location or replacing the parent page body. Allows scripts,
same-origin, forms, popups, and presentation but blocks
top-navigation.
2026-02-21 21:25:51 +00:00
Viktor Barzin
7a444b43fa [ci skip] Add reverse proxy mode to f1-stream
Replace CPU-intensive headless Chrome + WebRTC pipeline with a
lightweight Go reverse proxy that strips anti-framing headers
(X-Frame-Options, CSP) and embeds streaming sites in iframes.

- New internal/proxy package with URL rewriting for HTML/CSS
- JS shim injection to intercept fetch/XHR/WebSocket/createElement
- Referer reconstruction for correct cross-origin auth (HLS streams)
- Inline iframe viewer preserving site navigation (not fullscreen overlay)
2026-02-21 21:23:21 +00:00
Viktor Barzin
2446fec1f6 [ci skip] Fix whiteboard priority class mismatch and OnlyOffice OOMKill
- Add priority_class_name to nextcloud whiteboard deployment to match
  Kyverno-injected tier-3-edge priority class
- Add explicit resource limits (4Gi memory) for OnlyOffice document
  server to prevent OOMKill during font generation
2026-02-21 21:22:03 +00:00
Viktor Barzin
26ba9ea371 [ci skip] Fix Prometheus storage alert and Grafana quota exhaustion
- Enable size-based TSDB retention (45GB) to clean up old blocks
  (including 2021-era blocks with failed compaction)
- Increase monitoring namespace quota from 64/128Gi to 80/160Gi
  CPU/memory limits to allow Grafana rolling updates
2026-02-21 21:04:08 +00:00
Viktor Barzin
dcce738641 [ci skip] Bump inotify max_user_instances from 512 to 8192
Fixes "failed to create fsnotify watcher: too many open files" in Drone
CI builds where vitest exhausts the default inotify instance limit.
2026-02-21 20:21:04 +00:00
Viktor Barzin
038d4434c4 [ci skip] Fix health check false positives for completed CronJob pods 2026-02-21 19:56:39 +00:00
Viktor Barzin
de9c0869ba [ci skip] Fix CrowdSec pods failing due to priority class mismatch
Kyverno injects priorityClassName tier-1-cluster on pods in the crowdsec
namespace, but pods had no explicit priorityClassName set, defaulting
priority to 0. Admission controller rejected the mismatch (0 vs 800000).

Set priorityClassName on LAPI, agent (Helm values) and crowdsec-web
(Terraform deployment).
2026-02-21 19:18:15 +00:00
Viktor Barzin
fd6f9166a9 [ci skip] Add GitHub & Drone CI API access documentation 2026-02-21 19:14:41 +00:00
Viktor Barzin
a9e5320427 [ci skip] Disable grampsweb service and remove family DNS record 2026-02-21 18:55:54 +00:00