Commit graph

1534 commits

Author SHA1 Message Date
Viktor Barzin
4a9fe474c6 [ci skip] Add anti-AI scraping system design doc 2026-02-22 19:37:29 +00:00
Viktor Barzin
5cfe6595cd Apply only platform stack in CI (matches old pipeline scope) 2026-02-22 18:59:02 +00:00
Viktor Barzin
9415823ab8 Use --queue-ignore-errors for CI (infra stack needs Proxmox SSH) 2026-02-22 18:29:27 +00:00
Viktor Barzin
2547a155ed Skip infra stack in CI, remove DRONE_IMAGE_CLONE setting 2026-02-22 18:21:10 +00:00
Viktor Barzin
45b4528f0f Add clone retry logic for intermittent DNS failures 2026-02-22 18:10:31 +00:00
Viktor Barzin
9dd284204e Retry CI - test DNS resolution 2026-02-22 18:07:28 +00:00
Viktor Barzin
45f4459dbc Use manual clone with alpine instead of drone/git (pull-through cache issue) 2026-02-22 18:05:53 +00:00
Viktor Barzin
8df72b545c Test CI with drone/git:linux-amd64 clone image 2026-02-22 18:02:28 +00:00
Viktor Barzin
d2446fb4a3 Test CI pipeline with fixed clone image 2026-02-22 17:54:52 +00:00
Viktor Barzin
97f73eca0d Trigger CI build to test updated Drone pipeline 2026-02-22 17:50:47 +00:00
Viktor Barzin
9ee3140b34 Update Drone CI pipeline for Terragrunt stack architecture
Default pipeline now uses terragrunt run --all to apply all stacks
instead of the broken terraform apply -target=module.kubernetes_cluster.
TLS renewal pipeline stripped of unnecessary Terraform download/init
since renew2.sh is pure shell (certbot + Cloudflare DNS).
2026-02-22 17:47:06 +00:00
Viktor Barzin
35488f4ef6 [ci skip] Fix Drone clone image: use alpine/git via DRONE_IMAGE_CLONE
The drone/git:latest image was failing to pull through the registry
cache (corrupted blobs, unexpected EOF). Set DRONE_IMAGE_CLONE on the
Kubernetes runner to use alpine/git:latest globally for all pipelines.
2026-02-22 17:35:04 +00:00
Viktor Barzin
5501b5cfbf [ci skip] Increase authentik ResourceQuota limits
Authentik is a critical auth service that was at 83% CPU/memory
quota utilization. Double all limits to prevent throttling.
2026-02-22 17:28:41 +00:00
Viktor Barzin
116c4d9c30 [ci skip] Remove legacy files and orphaned modules
Delete 20 orphaned module directories and 3 stray files from
modules/kubernetes/ that are no longer referenced by any stack.
Remove 7 root-level legacy files including the empty tfstate,
27MB terraform zip, commented-out main.tf, and migration notes.
Clean up commented-out dockerhub_secret and oauth-proxy references
in blog, travel_blog, and city-guesser stacks. Remove stale
frigate config.yaml entry from .gitignore. Remove ephemeral
docs/plans/ directory.
2026-02-22 15:23:27 +00:00
Viktor Barzin
c7c7047f1c [ci skip] Flatten module wrappers into stack roots
Remove the module "xxx" { source = "./module" } indirection layer
from all 66 service stacks. Resources are now defined directly in
each stack's main.tf instead of through a wrapper module.

- Merge module/main.tf contents into stack main.tf
- Apply variable replacements (var.tier -> local.tiers.X, renamed vars)
- Fix shared module paths (one fewer ../ at each level)
- Move extra files/dirs (factory/, chart_values, subdirs) to stack root
- Update state files to strip module.<name>. prefix
- Update CLAUDE.md to reflect flat structure

Verified: terragrunt plan shows 0 add, 0 destroy across all stacks.
2026-02-22 15:13:55 +00:00
Viktor Barzin
b0499a7f31 [ci skip] Update CLAUDE.md for module colocation
Reflect new directory structure where service modules live inside
their stack directories (stacks/<service>/module/) instead of
modules/kubernetes/<service>/. Update file paths, adding service
instructions, and stack structure documentation.
2026-02-22 14:39:22 +00:00
Viktor Barzin
e6420c7b36 [ci skip] Move Terraform modules into stack directories
Move all 88 service modules (66 individual + 22 platform) from
modules/kubernetes/<service>/ into their corresponding stack directories:

- Service stacks: stacks/<service>/module/
- Platform stack: stacks/platform/modules/<service>/

This collocates module source code with its Terragrunt definition.
Only shared utility modules remain in modules/kubernetes/:
ingress_factory, setup_tls_secret, dockerhub_secret, oauth-proxy.

All cross-references to shared modules updated to use correct
relative paths. Verified with terragrunt run --all -- plan:
0 adds, 0 destroys across all 68 stacks.
2026-02-22 14:38:14 +00:00
Viktor Barzin
7ef1a0a8bb [ci skip] Update CLAUDE.md for Terragrunt migration 2026-02-22 14:12:37 +00:00
Viktor Barzin
e2522ad9f1 [ci skip] Fix variable type mismatches in owntracks, ollama, tandoor stacks
- owntracks_credentials: string -> map(string)
- ollama_api_credentials: string -> map(string)
- tandoor_email_password: add default="" (not in tfvars)
2026-02-22 14:07:33 +00:00
Viktor Barzin
945a5f35b0 [ci skip] Fix path.root references for git-crypt key in openclaw and drone
Modules used filebase64("${path.root}/.git/git-crypt/keys/default")
which breaks with Terragrunt since path.root is now stacks/<service>/
instead of repo root. Changed to accept git_crypt_key_base64 variable
and resolve the path in the stack wrapper.
2026-02-22 14:01:02 +00:00
Viktor Barzin
71bfdc8e89 [ci skip] Phase 3: Remove migrated service modules from monolith
All 66 service modules removed from modules/kubernetes/main.tf (now
just a migration notice). The kubernetes_cluster module block removed
from root main.tf. All services now managed via stacks/<service>/.
2026-02-22 13:58:07 +00:00
Viktor Barzin
a9ba8899be [ci skip] Phase 3: Create 66 service stacks and migrate state
Generated individual stack directories for all 66 services under stacks/.
Each stack has terragrunt.hcl (depends on platform) and main.tf (thin
wrapper calling existing module). Migrated all 64 active service states
from root terraform.tfstate to individual state files. Root state is now
empty. Verified with terragrunt plan on multiple stacks (no changes).
2026-02-22 13:56:34 +00:00
Viktor Barzin
39ce2000cf [ci skip] Remove 22 platform services from modules/kubernetes/main.tf
Migrated to stacks/platform/: metallb, dbaas, redis, traefik, technitium,
headscale, authentik, rbac, k8s-portal, crowdsec, monitoring, vaultwarden,
reverse-proxy, metrics-server, nvidia, kyverno, uptime-kuma, wireguard,
xray, mailserver, cloudflared, infra-maintenance.

Also removed null_resource.core_services and all depends_on references to it
from the remaining ~66 service modules.
2026-02-22 13:40:45 +00:00
Viktor Barzin
7c4d32922a [ci skip] Migrate 22 platform service states to stacks/platform
State migration for all platform services from root state to
state/stacks/platform/terraform.tfstate. Key changes:
- module.kubernetes_cluster.module.X["key"] -> module.X
- Removed null_resource.core_services from root state
- Imported traefik helm_release (was missing from state)
- Fixed helm provider syntax (kubernetes = {} not kubernetes {})
- Added secrets symlink for TLS cert file() resolution
- Platform terragrunt plan: 0 add, 24 change (cosmetic drift), 0 destroy
2026-02-22 13:35:10 +00:00
Viktor Barzin
e2fcf4df45 [ci skip] Add platform stack (core services) for Terragrunt migration
stacks/platform/ contains 22 core/cluster services: metallb, dbaas, redis,
traefik, technitium, headscale, authentik, rbac, k8s-portal, crowdsec,
monitoring, vaultwarden, reverse-proxy, metrics-server, nvidia, kyverno,
uptime-kuma, wireguard, xray, mailserver, cloudflared, infra-maintenance.

Outputs: tls_secret_name, redis_host, postgresql_host/port, mysql_host/port,
smtp_host/port — consumed by downstream service stacks via dependency blocks.
2026-02-22 13:21:09 +00:00
Viktor Barzin
8715477d39 [ci skip] Migrate infra modules (VM templates, docker-registry) to stacks/infra/
Remove k8s-node-template, non-k8s-node-template, docker-registry-template,
and docker-registry-vm modules from root main.tf. These are now managed by
the Terragrunt infra stack at stacks/infra/.

State migration verified: both root and infra stack show "No changes".
2026-02-22 13:11:16 +00:00
Viktor Barzin
f096a889d6 [ci skip] Add infra stack (Proxmox VMs) 2026-02-22 13:04:49 +00:00
Viktor Barzin
f962349465 [ci skip] Add Terragrunt directory skeleton and root config 2026-02-22 13:01:37 +00:00
Viktor Barzin
db659b1f7a [ci skip] Fix dashy OOMKilled and healthcheck DNS false-failure
- Add explicit resource limits to dashy (2Gi memory) to prevent OOMKilled
  during webpack build on startup
- Rewrite DNS healthcheck to test from inside the Technitium pod via
  kubectl exec, since MetalLB virtual IPs aren't reachable from outside
  the L2 network
- Deleted orphaned kured/tls-secret (expired Oct 2025, module disabled,
  not mounted by kured DaemonSet)
2026-02-22 12:46:12 +00:00
Viktor Barzin
f05bf109c5 [ci skip] Increase Drone CI resource quota to handle concurrent builds
Each build pod has 8-10 containers inheriting 1 CPU / 2Gi limits from
LimitRange defaults. With 4+ concurrent builds the old quota (48 CPU /
96Gi / 30 pods) was exhausted, blocking new builds. Increase to 64 CPU /
128Gi / 60 pods to safely support 5-6 concurrent builds.
2026-02-22 12:28:42 +00:00
Viktor Barzin
00dc78e0d2 [ci skip] Fix Uptime Kuma false-down reports: use bulk heartbeat API instead of per-monitor calls 2026-02-22 01:37:28 +00:00
Viktor Barzin
0ff2aaec60 [ci skip] Add native HLS playback for VIPLeague/DaddyLive streams (v1.3.1)
- Add HLS proxy (hlsproxy) for rewriting m3u8 playlists and proxying
  segments with correct Referer/Origin headers (uses ?domain= param)
- Add playerconfig service for detecting stream types (VIPLeague,
  DaddyLive, HLS) and extracting auth params from ksohls pages
- Add VIPLeague URL resolution: extract slug from URL path, match
  against DaddyLive 24/7 channel index with token-based scoring
- Replace Clappr with direct HLS.js player for better compatibility
- Add CryptoJS CDN for DaddyLive auth module support
- Disable CrowdSec on f1-stream ingress to prevent false positives
- Bump image to v1.3.1
2026-02-22 01:30:06 +00:00
Viktor Barzin
a5f9c1595f [ci skip] Fix Python f-string quoting in Slack formatter 2026-02-22 01:23:56 +00:00
Viktor Barzin
b8029351fd [ci skip] Improve Slack message formatting with human-readable names and structured blocks 2026-02-22 01:16:47 +00:00
Viktor Barzin
a90710950b [ci skip] Upgrade cluster-health.sh to full 24-check version with Slack 2026-02-22 01:09:02 +00:00
Viktor Barzin
284fee15ca [ci skip] Fix Slack message formatting to use Block Kit mrkdwn 2026-02-22 01:00:48 +00:00
Viktor Barzin
e59928187b [ci skip] Set CronJob backoffLimit=0 to prevent duplicate Slack alerts 2026-02-22 00:59:34 +00:00
Viktor Barzin
c1ee757c6b [ci skip] Add Terragrunt migration implementation plan 2026-02-22 00:51:00 +00:00
Viktor Barzin
209355d1af [ci skip] Add Terragrunt migration design document 2026-02-22 00:46:57 +00:00
Viktor Barzin
cd0c030a55 [ci skip] Fix CronJob kubectl image tag to :latest 2026-02-22 00:38:33 +00:00
Viktor Barzin
f79e84c693 [ci skip] Add cluster health check CronJob to OpenClaw module 2026-02-22 00:08:51 +00:00
Viktor Barzin
8b5b389f31 [ci skip] Add cluster-health skill for OpenClaw agent 2026-02-22 00:04:15 +00:00
Viktor Barzin
9233276f62 [ci skip] Add cluster health check script for OpenClaw agent 2026-02-22 00:00:47 +00:00
Viktor Barzin
b925f9caf7 [ci skip] Add Slack webhook env var to OpenClaw deployment 2026-02-21 23:57:34 +00:00
Viktor Barzin
98b711ff8d [ci skip] Extend cluster healthcheck from 14 to 24 checks
Add 10 new checks covering gaps discovered during incident response:
ResourceQuota pressure, StatefulSets, node disk usage, Helm release
health, Kyverno policy engine, NFS connectivity, DNS resolution,
TLS certificate expiry, GPU health, and Cloudflare tunnel status.
2026-02-21 23:57:04 +00:00
Viktor Barzin
f41e2ca969 [ci skip] Add OpenClaw cluster health agent implementation plan 2026-02-21 23:48:36 +00:00
Viktor Barzin
51cb045f12 [ci skip] Add OpenClaw cluster management agent design doc 2026-02-21 23:45:30 +00:00
Viktor Barzin
846eb3bd24 [ci skip] Add custom resource quota for authentik namespace
Authentik runs ~10 pods (3 server + 3 worker + 3 pgbouncer + outpost)
which exceeds the default tier-1-cluster quota limits. Add custom-quota
label to opt out of Kyverno-generated quotas and define a Terraform-managed
ResourceQuota with limits appropriate for authentik's workload.
2026-02-21 23:44:05 +00:00
Viktor Barzin
d345841ef2 [ci skip] Add tier labels to all namespace resources for Kyverno resource governance
Added `tier = var.tier` to kubernetes_namespace labels in ~73 service
modules. This enables Kyverno to generate LimitRange defaults,
ResourceQuotas, and PriorityClass injection for all namespaces.

Previously only 11 namespaces had tier labels; now all 80 active
namespaces are labeled. All pods restarted in rolling waves to pick
up the new policies.
2026-02-21 23:38:05 +00:00
Viktor Barzin
517f5d6a6c [ci skip] Increase tier-based resource quotas to prevent quota exhaustion
Tier 2-gpu: 32→48 CPU limits, 64→96Gi mem limits, 30→40 pods
Tier 3-edge: 2→4 req CPU, 8→16 CPU limits, 16→32Gi mem limits, 20→30 pods
Tier 4-aux: 1→2 req CPU, 4→8 CPU limits, 8→16Gi mem limits, 15→20 pods

Fixes realestate-crawler (100% quota), nvidia (89.7%), resume/website (75%),
and actualbudget (75%) quota exhaustion causing pod creation failures.
2026-02-21 23:26:00 +00:00