Commit graph

1087 commits

Author SHA1 Message Date
Viktor Barzin
b034c868db [traefik] Remove broken rewrite-body plugin and all rybbit/anti-AI injection
The rewrite-body Traefik plugin (both packruler/rewrite-body v1.2.0 and
the-ccsn/traefik-plugin-rewritebody v0.1.3) silently fails on Traefik
v3.6.12 due to Yaegi interpreter issues with ResponseWriter wrapping.
Both plugins load without errors but never inject content.

Removed:
- rewrite-body plugin download (init container) and registration
- strip-accept-encoding middleware (only existed for rewrite-body bug)
- anti-ai-trap-links middleware (used rewrite-body for injection)
- rybbit_site_id variable from ingress_factory and reverse_proxy factory
- rybbit_site_id from 25 service stacks (39 instances)
- Per-service rybbit-analytics middleware CRD resources

Kept:
- compress middleware (entrypoint-level, working correctly)
- ai-bot-block middleware (ForwardAuth to bot-block-proxy)
- anti-ai-headers middleware (X-Robots-Tag: noai, noimageai)
- All CrowdSec, Authentik, rate-limit middleware unchanged

Next: Cloudflare Workers with HTMLRewriter for edge-side injection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 12:41:17 +00:00
Viktor Barzin
66d2d9916b [infra] Per-ingress external-monitor annotation + actualbudget plan-time fix [ci skip]
## Context
Two operational gaps surfaced during a healthcheck sweep today:

1. **External monitoring coverage**: Only ~13 hostnames (via `cloudflare_proxied_names`
   in `config.tfvars`) had `[External]` monitors in Uptime Kuma. Any service deployed via
   `ingress_factory` with `dns_type = "proxied"` auto-created its DNS record but was NOT
   registered for external probing — so outages like Immich going down externally were
   invisible until a user complained. 99 of ~125 public ingresses had no external
   monitor.

2. **actualbudget stack unplannable**: `count = var.budget_encryption_password != null
   ? 1 : 0` in `factory/main.tf:152` failed with "Invalid count argument" because the
   value flows from a `data.kubernetes_secret` whose contents are `(known after apply)`
   at plan time. Blocked CI applies and drift reconciliation.

## This change

### Per-ingress external-monitor annotation (ingress_factory + reverse_proxy/factory)
- New variables `external_monitor` (bool, nullable) + `external_monitor_name` (string,
  nullable). Default is "follow dns_type" — enabled for any public DNS record
  (`dns_type != "none"`, covers both proxied and non-proxied so Immich and other
  direct-A records are also monitored).
- Emits two annotations on the Ingress:
  - `uptime.viktorbarzin.me/external-monitor = "true"`
  - `uptime.viktorbarzin.me/external-monitor-name = "<label>"` (optional override)

### external-monitor-sync CronJob (uptime-kuma stack)
- Discovers targets from live Ingress objects via the K8s API first (filter by
  annotation), falls back to the legacy `external-monitor-targets` ConfigMap on any
  API error (zero rollout risk).
- New `ServiceAccount` + cluster-wide `ClusterRole`/`ClusterRoleBinding` giving
  `list`/`get` on `networking.k8s.io/ingresses`.
- `API_SERVER` now uses the `KUBERNETES_SERVICE_HOST` env var (always injected by K8s)
  instead of `kubernetes.default.svc` — the search-domain expansion failed in the
  CronJob pod's DNS config. Verified working: CronJob now logs
  `Loaded N external monitor targets (source=k8s-api)`.

### actualbudget count-on-unknown refactor
- Replaced `count = var.budget_encryption_password != null ? 1 : 0` with two explicit
  plan-time booleans: `enable_http_api` and `enable_bank_sync`. Values are known at
  plan; no `-target` workaround needed.
- Callers (`stacks/actualbudget/main.tf`) pass `true` explicitly. Runtime behaviour is
  unchanged — the secret is still consumed via env var.
- Also aligned the factory with live state (the 3 budget-* PVCs had been migrated
  `proxmox-lvm` → `proxmox-lvm-encrypted` outside Terraform): PVC resource renamed
  `data_proxmox` → `data_encrypted`, storage class updated, orphaned `nfs_data` module
  removed. State was rm'd + re-imported with matching UIDs, so no data was moved.

## Rollout status (already partially applied in this session)
- `stacks/uptime-kuma` applied — SA + RBAC + CronJob changes live; FQDN fix verified
- `stacks/actualbudget` applied — budget-{viktor,anca,emo} all 200 OK externally
- `stacks/mailserver` + 21 other ingress_factory consumers applied — annotations live
- CronJob `external-monitor-sync` latest run: `source=k8s-api`, 26 monitors active
  (was 13 on the central list)

## Deferred (separate work)
- 4 stacks show pre-existing DESTRUCTIVE drift in plan (metallb namespace, claude-memory,
  rbac, redis) — NOT triggered by this commit but will be by CI's global-file cascade.
  `[ci skip]` here so those don't auto-apply; they will be fixed manually before the
  next CI push.
- Cleanup of `cloudflare_proxied_names` list once Helm-managed ingresses (authentik,
  grafana, vault, forgejo) are annotated — separate PR.

## Test plan

### Automated
\`\`\`
\$ kubectl -n uptime-kuma logs \$(kubectl -n uptime-kuma get pods -l job-name -o name | tail -1)
Loaded 26 external monitor targets (source=k8s-api)
Sync complete: 7 created, 0 deleted, 17 unchanged

\$ curl -sk -o /dev/null -w "%{http_code}\n" -H "Accept: text/html" \\
    https://dawarich.viktorbarzin.me/ https://nextcloud.viktorbarzin.me/ \\
    https://budget-viktor.viktorbarzin.me/
200 302 200

\$ kubectl -n actualbudget get deploy,pvc -l app=budget-viktor
deployment.apps/budget-viktor     1/1 1 1 Ready
persistentvolumeclaim/budget-viktor-data-encrypted  Bound  10Gi  RWO  proxmox-lvm-encrypted
\`\`\`

### Manual Verification
1. Confirm the annotation is present on an ingress_factory ingress:
   \`\`\`
   kubectl -n dawarich get ingress dawarich -o \\
     jsonpath='{.metadata.annotations.uptime\.viktorbarzin\.me/external-monitor}'
   # Expected: "true"
   \`\`\`
2. Confirm the new `[External] <name>` monitor appears in Uptime Kuma within 10 min
   (CronJob interval). For Immich specifically, it will appear after the immich stack
   is re-applied.
3. Verify actualbudget plan is clean:
   \`\`\`
   cd stacks/actualbudget && scripts/tg plan --non-interactive
   # Expected: no "Invalid count argument" errors
   \`\`\`

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 10:34:32 +00:00
Viktor Barzin
f8facf44dd [infra] Fix rewrite-body plugin + cleanup TrueNAS + version bumps
## Context

The rewrite-body Traefik plugin (packruler/rewrite-body v1.2.0) silently
broke on Traefik v3.6.12 — every service using rybbit analytics or anti-AI
injection returned HTTP 200 with "Error 404: Not Found" body. Root cause:
middleware specs referenced plugin name `rewrite-body` but Traefik registered
it as `traefik-plugin-rewritebody`.

Migrated to maintained fork `the-ccsn/traefik-plugin-rewritebody` v0.1.3
which uses the correct plugin name. Also added `lastModified = true` and
`methods = ["GET"]` to anti-AI middleware to avoid rewriting non-HTML
responses.

## This change

- Replace packruler/rewrite-body v1.2.0 with the-ccsn/traefik-plugin-rewritebody v0.1.3
- Fix plugin name in all 3 middleware locations (ingress_factory, reverse-proxy factory, traefik anti-AI)
- Remove deprecated TrueNAS cloud sync monitor (VM decommissioned 2026-04-13)
- Remove CloudSyncStale/CloudSyncFailing/CloudSyncNeverRun alerts
- Fix PrometheusBackupNeverRun alert (for: 48h → 32d to match monthly sidecar schedule)
- Bump versions: rybbit v1.0.21→v1.1.0, wealthfolio v1.1.0→v3.2,
  networking-toolbox 1.1.1→1.6.0, cyberchef v10.24.0→v9.55.0
- MySQL standalone storage_limit 30Gi → 50Gi
- beads-server: fix Dolt workbench type casing, remove Authentik on GraphQL endpoint

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 05:51:52 +00:00
Viktor Barzin
b1d152be1f [infra] Auto-create Cloudflare DNS records from ingress_factory
## Context

Deploying new services required manually adding hostnames to
cloudflare_proxied_names/cloudflare_non_proxied_names in config.tfvars —
a separate file from the service stack. This was frequently forgotten,
leaving services unreachable externally.

## This change:

- Add `dns_type` parameter to `ingress_factory` and `reverse_proxy/factory`
  modules. Setting `dns_type = "proxied"` or `"non-proxied"` auto-creates
  the Cloudflare DNS record (CNAME to tunnel or A/AAAA to public IP).
- Simplify cloudflared tunnel from 100 per-hostname rules to wildcard
  `*.viktorbarzin.me → Traefik`. Traefik still handles host-based routing.
- Add global Cloudflare provider via terragrunt.hcl (separate
  cloudflare_provider.tf with Vault-sourced API key).
- Migrate 118 hostnames from centralized config.tfvars to per-service
  dns_type. 17 hostnames remain centrally managed (Helm ingresses,
  special cases).
- Update docs, AGENTS.md, CLAUDE.md, dns.md runbook.

```
BEFORE                          AFTER
config.tfvars (manual list)     stacks/<svc>/main.tf
        |                         module "ingress" {
        v                           dns_type = "proxied"
stacks/cloudflared/               }
  for_each = list                     |
  cloudflare_record               auto-creates
  tunnel per-hostname             cloudflare_record + annotation
```

## What is NOT in this change:

- Uptime Kuma monitor migration (still reads from config.tfvars)
- 17 remaining centrally-managed hostnames (Helm, special cases)
- Removal of allow_overwrite (keep until migration confirmed stable)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 13:45:04 +00:00
Viktor Barzin
bcad200a23 chore: add untracked stacks, scripts, and agent configs
- New stacks: beads-server, hermes-agent
- Terragrunt tiers.tf for infra, phpipam, status-page
- Secrets symlinks for vault, phpipam, hermes-agent
- Scripts: cluster_manager, image_pull, containerd pullthrough setup
- Frigate config, audiblez-web app source, n8n workflows dir
- Claude agent: service-upgrade, reference: upgrade-config.json
- Removed: claudeception skill, excalidraw empty submodule, temp listings

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 09:33:06 +00:00
Viktor Barzin
ea18116da9 fix: NFS outage recovery — migrate to NFSv4, add alerting
NFS server restart broke NFSv3 (lockd kernel bug on PVE 6.14).
All 52 NFS PVs patched to nfsvers=4, NFSv3 disabled on PVE.

Changes:
- nfs_volume module: add nfsvers=4 mount option
- nfs-csi StorageClass: add nfsvers=4 mount option
- dbaas: MySQL serverInstances 3→1, mysql-native-password=ON
- monitoring: add NFSCSINodeDown and NFSMountFailures alerts

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 10:28:27 +00:00
Viktor Barzin
6101fb99f9 Reduce disk write amplification across cluster (~200-350 GB/day savings) [ci skip]
- Prometheus: persist metric whitelist (keep rules) to Helm template, preventing
  regression from 33K to 250K samples/scrape on next apply. Reduce retention 52w→26w.
- MySQL InnoDB: aggressive write reduction — flush_log_at_trx_commit=0, sync_binlog=0,
  doublewrite=OFF, io_capacity=100/200, redo_log=1GB, flush_neighbors=1, reduced page cleaners.
- etcd: increase snapshot-count 10000→50000 to reduce WAL snapshot frequency.
- VM disks: enable TRIM/discard passthrough to LVM thin pool via create-vm module.
- Cloud-init: enable fstrim.timer, journald limits (500M/7d/compress).
- Kubelet: containerLogMaxSize=10Mi, containerLogMaxFiles=3.
- Technitium: DNS query log retention 0→30 days (was unlimited writes to MySQL).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 19:01:21 +00:00
Viktor Barzin
c2f9ca0d13 modules: improve create-vm with additional config options and cloud-init updates 2026-04-06 11:57:55 +03:00
Viktor Barzin
d1059d6017 registry: set proxy TTL to 0 to prevent stale :latest images
Blob caching (content-addressed by SHA256) is unaffected — only manifest
re-validation changes. Every pull now checks upstream for the current
manifest digest, eliminating stale :latest tag issues.
2026-03-30 00:02:48 +03:00
Viktor Barzin
28587c674d fix-broken-blobs: use argparse for proper flag handling
--dry-run as first arg was being parsed as the BASE directory path.
2026-03-29 22:33:33 +03:00
Viktor Barzin
dd461beb33 add registry blob integrity checker to self-heal corrupted cache
The cleanup-tags.sh + garbage-collect cycle can delete blob data while
leaving _layers/ link files intact. The registry then returns HTTP 200
with 0 bytes for those layers, causing "unexpected EOF" on image pulls.

fix-broken-blobs.sh walks all repositories, checks each layer link
against actual blob data, and removes orphaned links so the registry
re-fetches from upstream on next pull.

Schedule: daily at 2:30am (after tag cleanup) and Sunday 3:30am
(after garbage collection). First run found 2335/2556 (91%) of
layer links were orphaned.
2026-03-29 22:31:39 +03:00
Viktor Barzin
facf959ecf fix registry healthchecks: use 127.0.0.1 instead of localhost
localhost resolves to IPv6 ::1 but containers bind to 0.0.0.0 (IPv4
only), causing wget to fail with "Connection refused". The nginx
proxy had 18,462 consecutive health check failures because of this.

Also cleared corrupted pull-through cache for mghee/novelapp — the
registry had layer link files pointing to non-existent blob data,
causing containerd to get 200 responses with 0 bytes (unexpected EOF).
2026-03-29 22:29:27 +03:00
Viktor Barzin
878b556179 state(monitoring): update encrypted state 2026-03-29 01:04:11 +02:00
Viktor Barzin
8c6f238697 add default Homepage annotations to ingress_factory for auto-discovery
- ingress_factory now injects gethomepage.dev/* annotations on all ingresses
  (name, group, href, icon) with namespace-to-group mapping
- Stacks with explicit annotations override defaults via merge order
- New homepage_enabled var allows opt-out for internal-only ingresses
- Homepage search widget switched to in-page quicklaunch (Ctrl+K / tap)
- Added hideErrors and quicklaunch settings for clean service directory
- Result: 116/134 ingresses now discoverable (up from ~30)
2026-03-25 11:00:38 +02:00
Viktor Barzin
2dcb4b7fa4 fix(renew-tls): clean stale _acme-challenge TXT records before certbot
21+ stale TXT records accumulated from previous runs, causing certbot
DNS-01 challenge to fail. Now deletes all _acme-challenge records
from Cloudflare before certbot creates fresh ones.
2026-03-23 22:32:27 +02:00
Viktor Barzin
3f0ecda737 harden pull-through cache: intercept errors, reduce lock timeout, add healthz
- Add proxy_intercept_errors + error_page for 502/503/504 on blob locations
  to prevent caching truncated upstream responses (root cause of repeated
  ImagePullBackOff across services)
- Reduce proxy_cache_lock_timeout from 15m to 5m — fail fast, let containerd
  retry instead of all concurrent pulls waiting on a failed first download
- Add proxy_cache_valid any 0 — never cache error responses
- Add /healthz endpoints on Docker Hub and GHCR servers
- Add draintimeout and proxy.ttl to registry proxy configs
2026-03-23 11:33:06 +02:00
Viktor Barzin
a44f35bcf8 harden vaultwarden iSCSI storage and increase backup frequency
- Increase backup from daily to every 6 hours (0 */6 * * *)
- Add pre/post-flight SQLite integrity checks to backup job
- Harden iSCSI on all nodes: increase recovery timeout (300s),
  enable CRC32C data/header digests for bit-flip detection
- Fix restore runbook PVC name (vaultwarden-data-iscsi)

Motivated by SQLite corruption from iSCSI I/O errors.
2026-03-23 00:36:11 +02:00
Viktor Barzin
36171bcda4 add htpasswd auth to private docker registry + expose at registry.viktorbarzin.me
- Add auth.htpasswd section to config-private.yml
- Mount htpasswd file in registry-private container, fix healthcheck for 401
- Rename registry UI from registry.viktorbarzin.me → docker.viktorbarzin.me
- Add Docker CLI ingress at registry.viktorbarzin.me (HTTPS backend, no rate-limit, unlimited body)
- Add docker to cloudflare_proxied_names (registry stays non-proxied)
- Add Kyverno ClusterPolicy to sync registry-credentials secret to all namespaces
- Update infra provisioning to install apache2-utils and generate htpasswd from Vault
2026-03-22 22:10:10 +02:00
Viktor Barzin
250a058627 feat(traefik): add custom error pages with tarampampam/error-pages
Deploy error-pages service to show themed error pages instead of raw
Traefik 502/503/504 responses. Adds catch-all IngressRoute (priority 1)
for 404 on unknown hosts. Only 5xx intercepted to avoid breaking JSON APIs.
2026-03-19 23:14:27 +00:00
Viktor Barzin
67d1ce453c add /sentinel dir to cloud-init for kured reboot gating
The kured sentinel gate DaemonSet requires /sentinel to exist on
all nodes. Without it, kured pods get stuck in ContainerCreating
with hostPath mount failure. Previously created manually; now
provisioned automatically for new nodes.
2026-03-19 19:57:27 +00:00
Viktor Barzin
f8a36f0621 fix pull-through cache: remove maxsize, harden nginx caching [ci skip]
Root cause: storage.filesystem.maxsize (5GiB) caused Docker Registry to
delete blob data while keeping metadata. Registry then served 200 OK with
correct Content-Length but 0 bytes body. nginx cached these broken responses.

Fixes:
- Remove maxsize from dockerhub/ghcr proxy configs (rely on weekly GC)
- nginx: don't cache 206 responses, require 2 requests before caching
- Wiped corrupted cache on registry VM and fixed corrupted pause container
  blobs on node3/node4
2026-03-16 07:41:11 +00:00
Viktor Barzin
c034adab5f mitigate cluster instability during terraform applies
- Recreate strategy for heavy single-replica deployments (onlyoffice, stirling-pdf)
- Reduce maxSurge on multi-replica deployments (traefik, authentik, grafana, kyverno)
  to prevent memory request surge overwhelming scheduler
- Weekly etcd defrag CronJob (Sunday 3 AM) to prevent fragmentation buildup
- Disable Kyverno policy reports (ephemeral report cleanup)
- Cloud-init: journald persistence + 4Gi swap for worker nodes
- Kubelet: LimitedSwap behavior for memory pressure relief
2026-03-15 17:23:39 +00:00
Viktor Barzin
7e72a10848 exclude manifest requests from nginx registry cache
Split /v2/ location into two: regex match for blobs (cached 24h, immutable
content-addressed by SHA256) and prefix match for everything else including
manifests (proxy_cache off, mutable tags). Also remove disabled registries
(quay, k8s, kyverno) whose containers/configs don't exist on the VM.
2026-03-14 23:42:17 +00:00
Viktor Barzin
0638e2cc2e [ci skip] iSCSI migration, healthcheck fixes, health probes, etcd backup
- Migrate MySQL/PostgreSQL storage from local-path to iscsi-truenas
- Add democratic-csi iSCSI driver module for TrueNAS
- Add open-iscsi to cloud-init VM template
- Fix Shlink health probe path (/api/v3 -> /rest/v3 for Shlink 5.0)
- Fix etcd backup: use etcd 3.5.21-0 (3.6.x is distroless, no /bin/sh)
- Fix cluster healthcheck CronJob: always exit 0 to prevent circular
  JobFailed alerts (reporting via Slack, not exit codes)
- Fix Uptime Kuma nested list handling in cluster-health.sh
- Add health probes to: audiobookshelf, immich ML, ntfy, headscale,
  uptime-kuma, vaultwarden, rybbit (clickhouse + server + client),
  shlink, shlink-web
- Add iSCSI storage documentation to CLAUDE.md
2026-03-06 19:54:21 +00:00
Viktor Barzin
1b78e44ab6 [ci skip] fix: add mount_options to nfs_volume PV spec
StorageClass mountOptions only apply during dynamic provisioning.
Static PVs (created by Terraform) need mount_options set explicitly.
Without this, all CSI NFS mounts default to hard,timeo=600 — the
exact problem we were trying to fix.
2026-03-02 20:22:47 +00:00
Viktor Barzin
c702fd2565 [ci skip] add NFS CSI driver + nfs_volume shared module
- Deploy csi-driver-nfs Helm chart as platform module (nfs-csi)
- Create nfs-truenas StorageClass with soft,timeo=30,retrans=3 mount options
- Add shared nfs_volume module for PV/PVC boilerplate (modules/kubernetes/nfs_volume/)
2026-03-01 23:38:58 +00:00
Viktor Barzin
7ff3c61bd7 [ci skip] add retry middleware (2 attempts, 100ms) to default ingress chain 2026-03-01 14:35:53 +00:00
Viktor Barzin
946b5b1745 [ci skip] add qemu-guest-agent to VM templates and enable agent by default 2026-03-01 01:58:46 +00:00
Viktor Barzin
09a810f8fb [ci skip] fix: use $http_host in nginx to preserve port in registry redirects 2026-02-28 20:16:03 +00:00
Viktor Barzin
96c0353c13 [ci skip] add TLS to private registry, switch to registry.viktorbarzin.me 2026-02-28 19:40:38 +00:00
Viktor Barzin
925dbe39c1 [ci skip] add registry-private service to Docker Compose stack 2026-02-28 17:57:04 +00:00
Viktor Barzin
64c55a6710 [ci skip] add nginx upstream and server block for private registry on port 5050 2026-02-28 17:57:03 +00:00
Viktor Barzin
2102ffdb8b [ci skip] add private R/W registry config for CI build caching 2026-02-28 17:56:50 +00:00
Viktor Barzin
865b68ce77 [ci skip] Rebuild docker-registry with nginx serialization on all ports
Replace individual `docker run` commands with Docker Compose stack managed
by systemd. Nginx now fronts all 5 registry ports (5000/5010/5020/5030/5040)
with proxy_cache_lock to serialize concurrent blob pulls and prevent
corrupt partial responses. Adds QEMU guest agent for remote management.
2026-02-22 21:45:53 +00:00
Viktor Barzin
006f95337e [ci skip] Add anti_ai_scraping option to ingress_factory (default: true) 2026-02-22 19:50:07 +00:00
Viktor Barzin
116c4d9c30 [ci skip] Remove legacy files and orphaned modules
Delete 20 orphaned module directories and 3 stray files from
modules/kubernetes/ that are no longer referenced by any stack.
Remove 7 root-level legacy files including the empty tfstate,
27MB terraform zip, commented-out main.tf, and migration notes.
Clean up commented-out dockerhub_secret and oauth-proxy references
in blog, travel_blog, and city-guesser stacks. Remove stale
frigate config.yaml entry from .gitignore. Remove ephemeral
docs/plans/ directory.
2026-02-22 15:23:27 +00:00
Viktor Barzin
e6420c7b36 [ci skip] Move Terraform modules into stack directories
Move all 88 service modules (66 individual + 22 platform) from
modules/kubernetes/<service>/ into their corresponding stack directories:

- Service stacks: stacks/<service>/module/
- Platform stack: stacks/platform/modules/<service>/

This collocates module source code with its Terragrunt definition.
Only shared utility modules remain in modules/kubernetes/:
ingress_factory, setup_tls_secret, dockerhub_secret, oauth-proxy.

All cross-references to shared modules updated to use correct
relative paths. Verified with terragrunt run --all -- plan:
0 adds, 0 destroys across all 68 stacks.
2026-02-22 14:38:14 +00:00
Viktor Barzin
945a5f35b0 [ci skip] Fix path.root references for git-crypt key in openclaw and drone
Modules used filebase64("${path.root}/.git/git-crypt/keys/default")
which breaks with Terragrunt since path.root is now stacks/<service>/
instead of repo root. Changed to accept git_crypt_key_base64 variable
and resolve the path in the stack wrapper.
2026-02-22 14:01:02 +00:00
Viktor Barzin
71bfdc8e89 [ci skip] Phase 3: Remove migrated service modules from monolith
All 66 service modules removed from modules/kubernetes/main.tf (now
just a migration notice). The kubernetes_cluster module block removed
from root main.tf. All services now managed via stacks/<service>/.
2026-02-22 13:58:07 +00:00
Viktor Barzin
39ce2000cf [ci skip] Remove 22 platform services from modules/kubernetes/main.tf
Migrated to stacks/platform/: metallb, dbaas, redis, traefik, technitium,
headscale, authentik, rbac, k8s-portal, crowdsec, monitoring, vaultwarden,
reverse-proxy, metrics-server, nvidia, kyverno, uptime-kuma, wireguard,
xray, mailserver, cloudflared, infra-maintenance.

Also removed null_resource.core_services and all depends_on references to it
from the remaining ~66 service modules.
2026-02-22 13:40:45 +00:00
Viktor Barzin
db659b1f7a [ci skip] Fix dashy OOMKilled and healthcheck DNS false-failure
- Add explicit resource limits to dashy (2Gi memory) to prevent OOMKilled
  during webpack build on startup
- Rewrite DNS healthcheck to test from inside the Technitium pod via
  kubectl exec, since MetalLB virtual IPs aren't reachable from outside
  the L2 network
- Deleted orphaned kured/tls-secret (expired Oct 2025, module disabled,
  not mounted by kured DaemonSet)
2026-02-22 12:46:12 +00:00
Viktor Barzin
f05bf109c5 [ci skip] Increase Drone CI resource quota to handle concurrent builds
Each build pod has 8-10 containers inheriting 1 CPU / 2Gi limits from
LimitRange defaults. With 4+ concurrent builds the old quota (48 CPU /
96Gi / 30 pods) was exhausted, blocking new builds. Increase to 64 CPU /
128Gi / 60 pods to safely support 5-6 concurrent builds.
2026-02-22 12:28:42 +00:00
Viktor Barzin
0ff2aaec60 [ci skip] Add native HLS playback for VIPLeague/DaddyLive streams (v1.3.1)
- Add HLS proxy (hlsproxy) for rewriting m3u8 playlists and proxying
  segments with correct Referer/Origin headers (uses ?domain= param)
- Add playerconfig service for detecting stream types (VIPLeague,
  DaddyLive, HLS) and extracting auth params from ksohls pages
- Add VIPLeague URL resolution: extract slug from URL path, match
  against DaddyLive 24/7 channel index with token-based scoring
- Replace Clappr with direct HLS.js player for better compatibility
- Add CryptoJS CDN for DaddyLive auth module support
- Disable CrowdSec on f1-stream ingress to prevent false positives
- Bump image to v1.3.1
2026-02-22 01:30:06 +00:00
Viktor Barzin
e59928187b [ci skip] Set CronJob backoffLimit=0 to prevent duplicate Slack alerts 2026-02-22 00:59:34 +00:00
Viktor Barzin
cd0c030a55 [ci skip] Fix CronJob kubectl image tag to :latest 2026-02-22 00:38:33 +00:00
Viktor Barzin
f79e84c693 [ci skip] Add cluster health check CronJob to OpenClaw module 2026-02-22 00:08:51 +00:00
Viktor Barzin
b925f9caf7 [ci skip] Add Slack webhook env var to OpenClaw deployment 2026-02-21 23:57:34 +00:00
Viktor Barzin
846eb3bd24 [ci skip] Add custom resource quota for authentik namespace
Authentik runs ~10 pods (3 server + 3 worker + 3 pgbouncer + outpost)
which exceeds the default tier-1-cluster quota limits. Add custom-quota
label to opt out of Kyverno-generated quotas and define a Terraform-managed
ResourceQuota with limits appropriate for authentik's workload.
2026-02-21 23:44:05 +00:00
Viktor Barzin
d345841ef2 [ci skip] Add tier labels to all namespace resources for Kyverno resource governance
Added `tier = var.tier` to kubernetes_namespace labels in ~73 service
modules. This enables Kyverno to generate LimitRange defaults,
ResourceQuotas, and PriorityClass injection for all namespaces.

Previously only 11 namespaces had tier labels; now all 80 active
namespaces are labeled. All pods restarted in rolling waves to pick
up the new policies.
2026-02-21 23:38:05 +00:00
Viktor Barzin
517f5d6a6c [ci skip] Increase tier-based resource quotas to prevent quota exhaustion
Tier 2-gpu: 32→48 CPU limits, 64→96Gi mem limits, 30→40 pods
Tier 3-edge: 2→4 req CPU, 8→16 CPU limits, 16→32Gi mem limits, 20→30 pods
Tier 4-aux: 1→2 req CPU, 4→8 CPU limits, 8→16Gi mem limits, 15→20 pods

Fixes realestate-crawler (100% quota), nvidia (89.7%), resume/website (75%),
and actualbudget (75%) quota exhaustion causing pod creation failures.
2026-02-21 23:26:00 +00:00