infra

Author	SHA1	Message	Date
Viktor Barzin	42961a5f58	[registry] fix-broken-blobs.sh — check revision-link, not blob data The original index-child scan checked if the child's blob data file existed under /blobs/sha256/<child>/data. That's wrong in a subtle way: registry:2 serves a per-repo manifest via the link file at <repo>/_manifests/revisions/sha256/<child-digest>/link, NOT by blob presence. When cleanup-tags.sh rmtrees a tag, the per-repo revision links for its index's children also disappear — but the blob data survives (GC owns that, and runs weekly). Result: blob present, link absent, API 404 on HEAD — the exact 2026-04-19 failure mode. Live proof: the registry-integrity-probe CronJob just found 38 real orphan children (including 98f718c8 from the original incident) while the previous fix-broken-blobs.sh scan reported 0. After the fix, both tools agree. The probe had been authoritative all along; the scan was a false-negative because it was asking the wrong question. Post-mortem updated to reflect the true mechanism (link-file absence, not blob deletion). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:43:35 +00:00
Viktor Barzin	9f9d7d10ff	[registry] Scope OCI-index scan to private registry only Live run on the registry VM surfaced 632 "orphaned" index children across 156 indexes in the pull-through caches (ghcr, immich, affine, linkwarden, openclaw). These aren't bugs — pull-through caches only fetch what's been requested, so missing arm64 / arm / attestation children are normal partial state. Scanning them generates noise that would mask the real signal from the private registry (where we push full manifests ourselves and a missing child IS always a bug — the 2026-04-13 + 2026-04-19 failure mode). Change: index-child scan is now gated on registry_name == "private". Layer- link scan still runs across all registries (missing blob under a live link is always a bug, regardless of pull-through semantics). Verified: live run now reports 0 orphans in private registry — consistent with the hot-fix rebuild of infra-ci:latest earlier today. Layer scan still inspects 425 links across all registries and finds 0 orphans. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:23:04 +00:00
Viktor Barzin	7cb44d7264	[registry] Stop recurring orphan OCI-index incidents — detection + prevention + recovery Second identical registry incident on 2026-04-19 (first 2026-04-13): the infra-ci:latest image index resolved to child manifests whose blobs had been garbage-collected out from under the index. Pipelines P366→P376 all exited 126 "image can't be pulled". Hot fix (`a05d63e` / `6371e75` / `c113be4`) restored green CI but left the underlying bug unaddressed. Root cause: cleanup-tags.sh rmtrees tag dirs on the registry VM daily at 02:00, registry:2's GC (Sunday 03:25) walks OCI index children imperfectly (distribution/distribution#3324 class). Nothing verified pushes end-to-end; nothing probed the registry for fetchability; nothing caught orphan indexes. Phase 1 — Detection: - .woodpecker/build-ci-image.yml: after build-and-push, a verify-integrity step walks the just-pushed manifest (index + children + config + every layer blob) via HEAD and fails the pipeline on any non-200. Catches broken pushes at the source. - stacks/monitoring: new registry-integrity-probe CronJob (every 15m) and three alerts — RegistryManifestIntegrityFailure, RegistryIntegrityProbeStale, RegistryCatalogInaccessible — closing the "registry serves 404 for a tag that exists" gap that masked the incident for 2+ hours. - docs/post-mortems/2026-04-19-registry-orphan-index.md: root cause, timeline, monitoring gaps, permanent fix. Phase 2 — Prevention: - modules/docker-registry/docker-compose.yml: pin registry:2 → registry:2.8.3 across all six registry services. Removes the floating-tag footgun. - modules/docker-registry/fix-broken-blobs.sh: new scan walks every _manifests/revisions/sha256/<digest> that is an image index and logs a loud WARNING when a referenced child blob is missing. Does NOT auto- delete — deleting a published image is a conscious decision. Layer-link scan preserved. Phase 3 — Recovery: - build-ci-image.yml: accept `manual` event so Woodpecker API/UI rebuilds don't need a cosmetic Dockerfile edit (matches convention from pve-nfs-exports-sync.yml). - docs/runbooks/registry-rebuild-image.md: exact command sequence for diagnosing + rebuilding after an orphan-index incident, plus a fallback for building directly on the registry VM if Woodpecker itself is down. - docs/runbooks/registry-vm.md + .claude/reference/service-catalog.md: cross-references to the new runbook. Out of scope (verified healthy or intentionally deferred): - Pull-through DockerHub/GHCR mirrors (74.5% hit rate, no 404s). - Registry HA/replication (single-VM SPOF is a known architectural choice; Synology offsite covers RPO < 1 day). - Diun exclude for registry:2 — not applicable; Diun only watches k8s (DIUN_PROVIDERS_KUBERNETES=true), not the VM's docker-compose. Verified locally: - fix-broken-blobs.sh --dry-run on a synthetic registry directory correctly flags both orphan layer links and orphan OCI-index children. - terraform fmt + validate on stacks/monitoring: success (only unrelated deprecation warnings). - python3 yaml.safe_load on .woodpecker/build-ci-image.yml and modules/docker-registry/docker-compose.yml: both parse clean. Closes: code-4b8 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:08:28 +00:00
Viktor Barzin	752f94ab8f	[monitoring] Opt-out external monitor for family/mladost3/task-webhook/torrserver; drop r730 The `external-monitor-sync` script is opt-IN by default for any *.viktorbarzin.me ingress, so a missing annotation means "monitored." Both ingress factories previously OMITTED the annotation when `external_monitor = false`, which silently left monitors in place. Fix: when the caller sets `external_monitor = false` explicitly, emit `uptime.viktorbarzin.me/external-monitor = "false"` so the sync script deletes the monitor. Keep the previous behavior (no annotation) for callers that leave external_monitor null — otherwise 19 publicly-reachable services with `dns_type="none"` would lose monitoring. Set external_monitor=false on family (grampsweb) and mladost3 (reverse-proxy) to match the other two already-flagged services. Delete the r730 ingress module entirely — the Dell server has been decommissioned.	2026-04-19 15:18:27 +00:00
Viktor Barzin	8d94688dde	[infra] Suppress Kyverno label drift on module.tls_secret Secrets [ci skip] ## Context Wave 3B of the state-drift consolidation audit (plan section "Shared Kyverno drift-suppression") identified a second Kyverno admission-induced drift class, complementary to the `# KYVERNO_LIFECYCLE_V1` ndots dns_config suppression landed in `c9d221d5`. The ClusterPolicy `sync-tls-secret` runs on every `kubernetes_secret` created via `modules/kubernetes/setup_tls_secret` and stamps the following labels on the generated Secret: app.kubernetes.io/managed-by = kyverno generate.kyverno.io/policy-name = sync-tls-secret generate.kyverno.io/policy-namespace = "" generate.kyverno.io/rule-name = sync-tls-secret generate.kyverno.io/source-kind = Secret generate.kyverno.io/source-namespace = kyverno generate.kyverno.io/source-uid = <uid> generate.kyverno.io/source-version = v1 generate.kyverno.io/source-group = "" generate.kyverno.io/clone-source = "" Terraform does not manage any labels on this Secret, so every `terragrunt plan` showed all 10 labels as `-> null`. This was observed on the dawarich stack (one of the 93 callers of setup_tls_secret) and reproduces identically on any stack that consumes this module. Root cause ticket: beads `code-seq`. ## This change Adds a single `lifecycle { ignore_changes = [metadata[0].labels] }` block to `modules/kubernetes/setup_tls_secret/main.tf`. One module edit, 93 callers' `module.tls_secret.kubernetes_secret.tls_secret` drift cleared. The marker comment `# KYVERNO_LIFECYCLE_V1` stays consistent with the Wave 3A convention (`c9d221d5`) — the rule now stands for "any Kyverno-induced drift", not only ndots dns_config. AGENTS.md's "Kyverno Drift Suppression" section will grow to catalog the fields ignored; this commit keeps the scope tight to the code change. ## What is NOT in this change - Namespace-level Goldilocks label drift (`goldilocks.fairwinds.com/vpa-update-mode = off`) — a different admission controller, different resource, different fix. Filed as beads `code-dwx` for a follow-up sweep across all 105 Tier 1 stacks. - AGENTS.md documentation expansion — will land alongside the Goldilocks sweep so both patterns are catalogued together. - Retroactive marker on other Kyverno-generated Secrets — the sync-tls-secret policy is the only generate policy that produces Secrets in this repo (verified: `kubectl get cpol -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}'` + cross-reference). ## Verification Dawarich stack: ``` Before: Plan: 0 to add, 2 to change, 0 to destroy. (kubernetes_namespace.dawarich — Goldilocks drift, untouched) (module.tls_secret.kubernetes_secret.tls_secret — Kyverno label drift) After: Plan: 0 to add, 1 to change, 0 to destroy. (kubernetes_namespace.dawarich — Goldilocks drift, untouched) ``` Closes: code-seq (partial — tls_secret branch) Refs: code-dwx (Goldilocks follow-up) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 19:23:02 +00:00
Viktor Barzin	e51bdb2af8	Add broker-sync Terraform stack (#7 ) * [f1-stream] Remove committed cluster-admin kubeconfig ## Context A kubeconfig granting cluster-admin access was accidentally committed into the f1-stream stack's application bundle in `c7c7047f` (2026-02-22). It contained the cluster CA certificate plus the kubernetes-admin client certificate and its RSA private key. Both remotes (github.com, forgejo) are public, so the credential has been reachable for ~2 months. Grep across the repo confirms no .tf / .hcl / .sh / .yaml file references this path; the file is a stray local artifact, likely swept in during a bulk `git add`. ## This change - git rm stacks/f1-stream/files/.config ## What is NOT in this change - Cluster-admin cert rotation on the control plane. The leaked client cert must be invalidated separately via `kubeadm certs renew admin.conf` or CA regeneration. Tracked in the broader secrets-remediation plan. - Git-history rewrite. The file is still reachable in every commit since `c7c7047f`. A `git filter-repo --path ... --invert-paths` pass against a fresh mirror is planned and will be force-pushed to both remotes. ## Test plan ### Automated No tests needed for a file removal. Sanity: $ grep -rn 'f1-stream/files/\.config' --include='.tf' --include='.hcl' \ --include='.yaml' --include='.yml' --include='.sh' (no output) ### Manual Verification 1. `git show HEAD --stat` shows exactly one path deleted: stacks/f1-stream/files/.config \| 19 ------------------- 2. `test ! -e stacks/f1-stream/files/.config` returns true. 3. A copy of the leaked file is at /tmp/leaked.conf for post-rotation verification (confirming `kubectl --kubeconfig /tmp/leaked.conf get ns` fails with 401/403 once the admin cert is renewed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> [frigate] Remove orphan config.yaml with leaked RTSP passwords ## Context A Frigate configuration file was added to modules/kubernetes/frigate/ in `bcad200a` (2026-04-15, ~2 days ago) as part of a bulk `chore: add untracked stacks, scripts, and agent configs` commit. The file contains 14 inline rtsp://admin:<password>@<host>:554/... URLs, leaking two distinct RTSP passwords for the cameras at 192.168.1.10 (LAN-only) and valchedrym.ddns.net (confirmed reachable from public internet on port 554). Both remotes are public, so the creds have been exposed for ~2 days. Grep across the repo confirms nothing references this config.yaml — the active stacks/frigate/main.tf stack reads its configuration from a persistent volume claim named `frigate-config-encrypted`, not from this file. The file is therefore an orphan from the bulk add, with no production function. ## This change - git rm modules/kubernetes/frigate/config.yaml ## What is NOT in this change - Camera password rotation. The user does not own the cameras; rotation must be coordinated out-of-band with the camera operators. The DDNS camera (valchedrym.ddns.net:554) is internet-reachable, so the leaked password is high-priority to rotate from the device side. - Git-history rewrite. The file plus its leaked strings remain in all commits from `bcad200a` forward. Scheduled to be purged via `git filter-repo --path modules/kubernetes/frigate/config.yaml --invert-paths --replace-text <list>` in the broader remediation pass. - Future Frigate config provisioning. If the stack is re-platformed to source config from Git rather than the PVC, the replacement should go through ExternalSecret + env-var interpolation, not an inline YAML. ## Test plan ### Automated $ grep -rn 'frigate/config\.yaml' --include='.tf' --include='.hcl' \ --include='.yaml' --include='.yml' --include='.sh' (no output — confirms orphan status) ### Manual Verification 1. `git show HEAD --stat` shows exactly one deletion: modules/kubernetes/frigate/config.yaml \| 229 --------------------------------- 2. `test ! -e modules/kubernetes/frigate/config.yaml` returns true. 3. `kubectl -n frigate get pvc frigate-config-encrypted` still shows the PVC bound (unaffected by this change). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> [setup-tls-secret] Delete deprecated renew.sh with hardcoded Technitium token ## Context modules/kubernetes/setup_tls_secret/renew.sh is a 2.5-year-old expect(1) script for manual Let's Encrypt wildcard-cert renewal via Technitium DNS TXT-record challenges. It hardcodes a 64-char Technitium API token on line 7 (as an expect variable) and line 27 (inside a certbot-cleanup heredoc). Both remotes are public, so the token has been exposed for ~2.5 years. The script is not invoked by the module's Terraform (main.tf only creates a kubernetes.io/tls Secret from PEM files); it is a standalone run-it-yourself tool. grep across the repo confirms nothing references `renew.sh` — neither the 20+ stacks that consume the `setup_tls_secret` module, nor any CI pipeline, nor any shell wrapper. A replacement script `renew2.sh` (4 weeks old) lives alongside it. It sources the Technitium token from `$TECHNITIUM_API_KEY` env var and also supports Cloudflare DNS-01 challenges via `$CLOUDFLARE_TOKEN`. It is the current renewal path. ## This change - git rm modules/kubernetes/setup_tls_secret/renew.sh ## What is NOT in this change - Technitium token rotation. The leaked token still works against `technitium-web.technitium.svc.cluster.local:5380` until revoked in the Technitium admin UI. Rotation is a prerequisite for the upcoming git-history scrub, which will remove the token from every commit via `git filter-repo --replace-text`. - renew2.sh is retained as-is (already env-var-sourced; clean). - The setup_tls_secret module's main.tf is not touched; 20+ consuming stacks keep working. ## Test plan ### Automated $ grep -rn 'renew\.sh' --include='.tf' --include='.hcl' \ --include='.yaml' --include='.yml' --include='.sh' (no output — confirms no consumer) $ git grep -n 'e28818f309a9ce7f72f0fcc867a365cf5d57b214751b75e2ef3ea74943ef23be' (no output in HEAD after this commit) ### Manual Verification 1. `git show HEAD --stat` shows exactly one deletion: modules/kubernetes/setup_tls_secret/renew.sh \| 136 --------- 2. `test ! -e modules/kubernetes/setup_tls_secret/renew.sh` returns true. 3. `renew2.sh` still exists and is executable: ls -la modules/kubernetes/setup_tls_secret/renew2.sh 4. Next cert-renewal run uses renew2.sh with env-var-sourced token; no behavioral regression because renew.sh was never part of the automated flow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> [monitoring] Delete orphan server-power-cycle/main.sh with iDRAC default creds ## Context stacks/monitoring/modules/monitoring/server-power-cycle/main.sh is an old shell implementation of a power-cycle watchdog that polled the Dell iDRAC on 192.168.1.4 for PSU voltage. It hardcoded the Dell iDRAC default credentials (root:calvin) in 5 `curl -u root:calvin` calls. Both remotes are public, so those credentials — and the implicit statement that 'this host has not rotated the default BMC password' — have been exposed. The current implementation is main.py in the same directory. It reads iDRAC credentials from the environment variables `idrac_user` and `idrac_password` (see module's iDRAC_USER_ENV_VAR / iDRAC_PASSWORD_ENV_VAR constants), which are populated from Vault via ExternalSecret at runtime. main.sh is not referenced by any Terraform, ConfigMap, or deploy script — grep confirms no `file()` / `templatefile()` / `filebase64()` call loads it, and no hand-rolled shell wrapper invokes it. ## This change - git rm stacks/monitoring/modules/monitoring/server-power-cycle/main.sh main.py is retained unchanged. ## What is NOT in this change - iDRAC password rotation on 192.168.1.4. The BMC should be moved off the vendor default `calvin` regardless; rotation is tracked in the broader remediation plan and in the iDRAC web UI. - A separate finding in stacks/monitoring/modules/monitoring/idrac.tf (the redfish-exporter ConfigMap has `default: username: root, password: calvin` as a fallback for iDRAC hosts not explicitly listed) is NOT addressed here — filed as its own task so the fix (drop the default block vs. source from env) can be considered in isolation. - Git-history scrub of main.sh is pending the broader filter-repo pass. ## Test plan ### Automated $ grep -rn 'server-power-cycle/main\.sh\\|main\.sh' \ --include='.tf' --include='.hcl' --include='.yaml' \ --include='.yml' --include='.sh' (no consumer references) ### Manual Verification 1. `git show HEAD --stat` shows only the one deletion. 2. `test ! -e stacks/monitoring/modules/monitoring/server-power-cycle/main.sh` 3. `kubectl -n monitoring get deploy idrac-redfish-exporter` still shows the exporter running — unrelated to this file. 4. main.py continues to run its watchdog loop without regression, because it was never coupled to main.sh. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> [tls] Move 3 outlier stacks from per-stack PEMs to root-wildcard symlink ## Context foolery, terminal, and claude-memory each had their own `stacks/<x>/secrets/` directory with a plaintext EC-256 private key (privkey.pem, 241 B) and matching TLS certificate (fullchain.pem, 2868 B) for .viktorbarzin.me. The 92 other stacks under stacks/ symlink `secrets/` → `../../secrets`, which resolves to the repo-root /secrets/ directory covered by the `secrets/* filter=git-crypt` .gitattributes rule — i.e., every other stack consumes the same git-crypt-encrypted root wildcard cert. The 3 outliers shipped their keys in plaintext because `.gitattributes` secrets/** rule matches only repo-root /secrets/, not stacks//secrets/. Both remotes are public, so the 6 plaintext PEM files have been exposed for 1–6 weeks (commits `5a988133` 2026-03-11, `a6f71fc6` 2026-03-18, `9820f2ce` 2026-04-10). Verified: - Root wildcard cert subject = CN viktorbarzin.me, SAN .viktorbarzin.me + viktorbarzin.me — covers the 3 subdomains. - Root privkey + fullchain are a valid key pair (pubkey SHA256 match). - All 3 outlier certs have the same subject/SAN as root; different distinct cert material but equivalent coverage. ## This change - Delete plaintext PEMs in all 3 outlier stacks (6 files total). - Replace each stacks/<x>/secrets directory with a symlink to ../../secrets, matching the fleet pattern. - Add `stacks//secrets/ filter=git-crypt diff=git-crypt` to .gitattributes as a regression guard — any future real file placed under stacks/<x>/secrets/ gets git-crypt-encrypted automatically. setup_tls_secret module (modules/kubernetes/setup_tls_secret/main.tf) is unchanged. It still reads `file("${path.root}/secrets/fullchain.pem")`, which via the symlink resolves to the root wildcard. ## What is NOT in this change - Revocation of the 3 leaked per-stack certs. Backed up the leaked PEMs to /tmp/leaked-certs/ for `certbot revoke --reason keycompromise` once the user's LE account is authenticated. Revocation must happen before or alongside the history-rewrite force-push to both remotes. - Git-history scrub. The leaked PEM blobs are still reachable in every commit from 2026-03-11 forward. Scheduled for removal via `git filter-repo --path stacks/<x>/secrets/privkey.pem --invert-paths` (and fullchain.pem for each stack) in the broader remediation pass. - cert-manager introduction. The fleet does not use cert-manager today; this commit matches the existing symlink-to-wildcard pattern rather than introducing a new component. ## Test plan ### Automated $ readlink stacks/foolery/secrets ../../secrets (likewise for terminal, claude-memory) $ for s in foolery terminal claude-memory; do openssl x509 -in stacks/$s/secrets/fullchain.pem -noout -subject done subject=CN = viktorbarzin.me (x3 — all resolve via symlink to root wildcard) $ git check-attr filter -- stacks/foolery/secrets/fullchain.pem stacks/foolery/secrets/fullchain.pem: filter: git-crypt (now matched by the new rule, though for the symlink target the repo-root rule already applied) ### Manual Verification 1. `terragrunt plan` in stacks/foolery, stacks/terminal, stacks/claude-memory shows only the K8s TLS secret being re-created with the root-wildcard material. No ingress changes. 2. `terragrunt apply` for each stack → `kubectl -n <ns> get secret <name>-tls -o yaml` → tls.crt decodes to CN viktorbarzin.me with the root serial (different from the pre-change per-stack serials). 3. `curl -v https://foolery.viktorbarzin.me/` (and likewise terminal, claude-memory) → cert chain presents the new serial, handshake OK. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Add broker-sync Terraform stack (pending apply) Context ------- Part of the broker-sync rollout — see the plan at ~/.claude/plans/let-s-work-on-linking-temporal-valiant.md and the companion repo at ViktorBarzin/broker-sync. This change ----------- New stack `stacks/broker-sync/`: - `broker-sync` namespace, aux tier. - ExternalSecret pulling `secret/broker-sync` via vault-kv ClusterSecretStore. - `broker-sync-data-encrypted` PVC (1Gi, proxmox-lvm-encrypted, auto-resizer) — holds the sync SQLite db, FX cache, Wealthfolio cookie, CSV archive, watermarks. - Five CronJobs (all under `viktorbarzin/broker-sync:<tag>`, public DockerHub image; no pull secret): * `broker-sync-version` — daily 01:00 liveness probe (`broker-sync version`), used to smoke-test each new image. * `broker-sync-trading212` — daily 02:00 `broker-sync trading212 --mode steady`. * `broker-sync-imap` — daily 02:30, SUSPENDED (Phase 2). * `broker-sync-csv` — daily 03:00, SUSPENDED (Phase 3). * `broker-sync-fx-reconcile` — 7th of month 05:05, SUSPENDED (Phase 1 tail). - `broker-sync-backup` — daily 04:15, snapshots /data into NFS `/srv/nfs/broker-sync-backup/` with 30-day retention, matches the convention in infra/.claude/CLAUDE.md §3-2-1. NOT in this commit: - Old `wealthfolio-sync` CronJob retirement in stacks/wealthfolio/main.tf — happens in the same commit that first applies this stack, per the plan's "clean cutover" decision. - Vault seed. `secret/broker-sync` must be populated before apply; required keys documented in the ExternalSecret comment block. Test plan --------- ## Automated - `terraform fmt` — clean (ran before commit). - `terraform validate` needs `terragrunt init` first; deferred to apply time. ## Manual Verification 1. Seed Vault `secret/broker-sync/` (see comment block on the ExternalSecret in main.tf). 2. `cd stacks/broker-sync && scripts/tg apply`. 3. `kubectl -n broker-sync get cronjob` — expect 6 CJs, 3 suspended. 4. `kubectl -n broker-sync create job smoke --from=cronjob/broker-sync-version`. 5. `kubectl -n broker-sync logs -l job-name=smoke` — expect `broker-sync 0.1.0`. fix(beads-server): disable Authentik + CrowdSec on Workbench Authentik forward-auth returns 400 for dolt-workbench (no Authentik application configured for this domain). CrowdSec bouncer also intermittently returns 400. Both disabled — Workbench is accessible via Cloudflare tunnel only. TODO: Create Authentik application for dolt-workbench.viktorbarzin.me Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 21:17:45 +01:00
Viktor Barzin	b034c868db	[traefik] Remove broken rewrite-body plugin and all rybbit/anti-AI injection The rewrite-body Traefik plugin (both packruler/rewrite-body v1.2.0 and the-ccsn/traefik-plugin-rewritebody v0.1.3) silently fails on Traefik v3.6.12 due to Yaegi interpreter issues with ResponseWriter wrapping. Both plugins load without errors but never inject content. Removed: - rewrite-body plugin download (init container) and registration - strip-accept-encoding middleware (only existed for rewrite-body bug) - anti-ai-trap-links middleware (used rewrite-body for injection) - rybbit_site_id variable from ingress_factory and reverse_proxy factory - rybbit_site_id from 25 service stacks (39 instances) - Per-service rybbit-analytics middleware CRD resources Kept: - compress middleware (entrypoint-level, working correctly) - ai-bot-block middleware (ForwardAuth to bot-block-proxy) - anti-ai-headers middleware (X-Robots-Tag: noai, noimageai) - All CrowdSec, Authentik, rate-limit middleware unchanged Next: Cloudflare Workers with HTMLRewriter for edge-side injection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 12:41:17 +00:00
Viktor Barzin	66d2d9916b	[infra] Per-ingress external-monitor annotation + actualbudget plan-time fix [ci skip] ## Context Two operational gaps surfaced during a healthcheck sweep today: 1. External monitoring coverage: Only ~13 hostnames (via `cloudflare_proxied_names` in `config.tfvars`) had `[External]` monitors in Uptime Kuma. Any service deployed via `ingress_factory` with `dns_type = "proxied"` auto-created its DNS record but was NOT registered for external probing — so outages like Immich going down externally were invisible until a user complained. 99 of ~125 public ingresses had no external monitor. 2. actualbudget stack unplannable: `count = var.budget_encryption_password != null ? 1 : 0` in `factory/main.tf:152` failed with "Invalid count argument" because the value flows from a `data.kubernetes_secret` whose contents are `(known after apply)` at plan time. Blocked CI applies and drift reconciliation. ## This change ### Per-ingress external-monitor annotation (ingress_factory + reverse_proxy/factory) - New variables `external_monitor` (bool, nullable) + `external_monitor_name` (string, nullable). Default is "follow dns_type" — enabled for any public DNS record (`dns_type != "none"`, covers both proxied and non-proxied so Immich and other direct-A records are also monitored). - Emits two annotations on the Ingress: - `uptime.viktorbarzin.me/external-monitor = "true"` - `uptime.viktorbarzin.me/external-monitor-name = "<label>"` (optional override) ### external-monitor-sync CronJob (uptime-kuma stack) - Discovers targets from live Ingress objects via the K8s API first (filter by annotation), falls back to the legacy `external-monitor-targets` ConfigMap on any API error (zero rollout risk). - New `ServiceAccount` + cluster-wide `ClusterRole`/`ClusterRoleBinding` giving `list`/`get` on `networking.k8s.io/ingresses`. - `API_SERVER` now uses the `KUBERNETES_SERVICE_HOST` env var (always injected by K8s) instead of `kubernetes.default.svc` — the search-domain expansion failed in the CronJob pod's DNS config. Verified working: CronJob now logs `Loaded N external monitor targets (source=k8s-api)`. ### actualbudget count-on-unknown refactor - Replaced `count = var.budget_encryption_password != null ? 1 : 0` with two explicit plan-time booleans: `enable_http_api` and `enable_bank_sync`. Values are known at plan; no `-target` workaround needed. - Callers (`stacks/actualbudget/main.tf`) pass `true` explicitly. Runtime behaviour is unchanged — the secret is still consumed via env var. - Also aligned the factory with live state (the 3 budget-* PVCs had been migrated `proxmox-lvm` → `proxmox-lvm-encrypted` outside Terraform): PVC resource renamed `data_proxmox` → `data_encrypted`, storage class updated, orphaned `nfs_data` module removed. State was rm'd + re-imported with matching UIDs, so no data was moved. ## Rollout status (already partially applied in this session) - `stacks/uptime-kuma` applied — SA + RBAC + CronJob changes live; FQDN fix verified - `stacks/actualbudget` applied — budget-{viktor,anca,emo} all 200 OK externally - `stacks/mailserver` + 21 other ingress_factory consumers applied — annotations live - CronJob `external-monitor-sync` latest run: `source=k8s-api`, 26 monitors active (was 13 on the central list) ## Deferred (separate work) - 4 stacks show pre-existing DESTRUCTIVE drift in plan (metallb namespace, claude-memory, rbac, redis) — NOT triggered by this commit but will be by CI's global-file cascade. `[ci skip]` here so those don't auto-apply; they will be fixed manually before the next CI push. - Cleanup of `cloudflare_proxied_names` list once Helm-managed ingresses (authentik, grafana, vault, forgejo) are annotated — separate PR. ## Test plan ### Automated \`\`\` \$ kubectl -n uptime-kuma logs \$(kubectl -n uptime-kuma get pods -l job-name -o name \| tail -1) Loaded 26 external monitor targets (source=k8s-api) Sync complete: 7 created, 0 deleted, 17 unchanged \$ curl -sk -o /dev/null -w "%{http_code}\n" -H "Accept: text/html" \\ https://dawarich.viktorbarzin.me/ https://nextcloud.viktorbarzin.me/ \\ https://budget-viktor.viktorbarzin.me/ 200 302 200 \$ kubectl -n actualbudget get deploy,pvc -l app=budget-viktor deployment.apps/budget-viktor 1/1 1 1 Ready persistentvolumeclaim/budget-viktor-data-encrypted Bound 10Gi RWO proxmox-lvm-encrypted \`\`\` ### Manual Verification 1. Confirm the annotation is present on an ingress_factory ingress: \`\`\` kubectl -n dawarich get ingress dawarich -o \\ jsonpath='{.metadata.annotations.uptime\.viktorbarzin\.me/external-monitor}' # Expected: "true" \`\`\` 2. Confirm the new `[External] <name>` monitor appears in Uptime Kuma within 10 min (CronJob interval). For Immich specifically, it will appear after the immich stack is re-applied. 3. Verify actualbudget plan is clean: \`\`\` cd stacks/actualbudget && scripts/tg plan --non-interactive # Expected: no "Invalid count argument" errors \`\`\` Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 10:34:32 +00:00
Viktor Barzin	f8facf44dd	[infra] Fix rewrite-body plugin + cleanup TrueNAS + version bumps ## Context The rewrite-body Traefik plugin (packruler/rewrite-body v1.2.0) silently broke on Traefik v3.6.12 — every service using rybbit analytics or anti-AI injection returned HTTP 200 with "Error 404: Not Found" body. Root cause: middleware specs referenced plugin name `rewrite-body` but Traefik registered it as `traefik-plugin-rewritebody`. Migrated to maintained fork `the-ccsn/traefik-plugin-rewritebody` v0.1.3 which uses the correct plugin name. Also added `lastModified = true` and `methods = ["GET"]` to anti-AI middleware to avoid rewriting non-HTML responses. ## This change - Replace packruler/rewrite-body v1.2.0 with the-ccsn/traefik-plugin-rewritebody v0.1.3 - Fix plugin name in all 3 middleware locations (ingress_factory, reverse-proxy factory, traefik anti-AI) - Remove deprecated TrueNAS cloud sync monitor (VM decommissioned 2026-04-13) - Remove CloudSyncStale/CloudSyncFailing/CloudSyncNeverRun alerts - Fix PrometheusBackupNeverRun alert (for: 48h → 32d to match monthly sidecar schedule) - Bump versions: rybbit v1.0.21→v1.1.0, wealthfolio v1.1.0→v3.2, networking-toolbox 1.1.1→1.6.0, cyberchef v10.24.0→v9.55.0 - MySQL standalone storage_limit 30Gi → 50Gi - beads-server: fix Dolt workbench type casing, remove Authentik on GraphQL endpoint Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 05:51:52 +00:00
Viktor Barzin	b1d152be1f	[infra] Auto-create Cloudflare DNS records from ingress_factory ## Context Deploying new services required manually adding hostnames to cloudflare_proxied_names/cloudflare_non_proxied_names in config.tfvars — a separate file from the service stack. This was frequently forgotten, leaving services unreachable externally. ## This change: - Add `dns_type` parameter to `ingress_factory` and `reverse_proxy/factory` modules. Setting `dns_type = "proxied"` or `"non-proxied"` auto-creates the Cloudflare DNS record (CNAME to tunnel or A/AAAA to public IP). - Simplify cloudflared tunnel from 100 per-hostname rules to wildcard `*.viktorbarzin.me → Traefik`. Traefik still handles host-based routing. - Add global Cloudflare provider via terragrunt.hcl (separate cloudflare_provider.tf with Vault-sourced API key). - Migrate 118 hostnames from centralized config.tfvars to per-service dns_type. 17 hostnames remain centrally managed (Helm ingresses, special cases). - Update docs, AGENTS.md, CLAUDE.md, dns.md runbook. ``` BEFORE AFTER config.tfvars (manual list) stacks/<svc>/main.tf \| module "ingress" { v dns_type = "proxied" stacks/cloudflared/ } for_each = list \| cloudflare_record auto-creates tunnel per-hostname cloudflare_record + annotation ``` ## What is NOT in this change: - Uptime Kuma monitor migration (still reads from config.tfvars) - 17 remaining centrally-managed hostnames (Helm, special cases) - Removal of allow_overwrite (keep until migration confirmed stable) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 13:45:04 +00:00
Viktor Barzin	bcad200a23	chore: add untracked stacks, scripts, and agent configs - New stacks: beads-server, hermes-agent - Terragrunt tiers.tf for infra, phpipam, status-page - Secrets symlinks for vault, phpipam, hermes-agent - Scripts: cluster_manager, image_pull, containerd pullthrough setup - Frigate config, audiblez-web app source, n8n workflows dir - Claude agent: service-upgrade, reference: upgrade-config.json - Removed: claudeception skill, excalidraw empty submodule, temp listings [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 09:33:06 +00:00
Viktor Barzin	ea18116da9	fix: NFS outage recovery — migrate to NFSv4, add alerting NFS server restart broke NFSv3 (lockd kernel bug on PVE 6.14). All 52 NFS PVs patched to nfsvers=4, NFSv3 disabled on PVE. Changes: - nfs_volume module: add nfsvers=4 mount option - nfs-csi StorageClass: add nfsvers=4 mount option - dbaas: MySQL serverInstances 3→1, mysql-native-password=ON - monitoring: add NFSCSINodeDown and NFSMountFailures alerts [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 10:28:27 +00:00
Viktor Barzin	6101fb99f9	Reduce disk write amplification across cluster (~200-350 GB/day savings) [ci skip] - Prometheus: persist metric whitelist (keep rules) to Helm template, preventing regression from 33K to 250K samples/scrape on next apply. Reduce retention 52w→26w. - MySQL InnoDB: aggressive write reduction — flush_log_at_trx_commit=0, sync_binlog=0, doublewrite=OFF, io_capacity=100/200, redo_log=1GB, flush_neighbors=1, reduced page cleaners. - etcd: increase snapshot-count 10000→50000 to reduce WAL snapshot frequency. - VM disks: enable TRIM/discard passthrough to LVM thin pool via create-vm module. - Cloud-init: enable fstrim.timer, journald limits (500M/7d/compress). - Kubelet: containerLogMaxSize=10Mi, containerLogMaxFiles=3. - Technitium: DNS query log retention 0→30 days (was unlimited writes to MySQL). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 19:01:21 +00:00
Viktor Barzin	c2f9ca0d13	modules: improve create-vm with additional config options and cloud-init updates	2026-04-06 11:57:55 +03:00
Viktor Barzin	d1059d6017	registry: set proxy TTL to 0 to prevent stale :latest images Blob caching (content-addressed by SHA256) is unaffected — only manifest re-validation changes. Every pull now checks upstream for the current manifest digest, eliminating stale :latest tag issues.	2026-03-30 00:02:48 +03:00
Viktor Barzin	28587c674d	fix-broken-blobs: use argparse for proper flag handling --dry-run as first arg was being parsed as the BASE directory path.	2026-03-29 22:33:33 +03:00
Viktor Barzin	dd461beb33	add registry blob integrity checker to self-heal corrupted cache The cleanup-tags.sh + garbage-collect cycle can delete blob data while leaving _layers/ link files intact. The registry then returns HTTP 200 with 0 bytes for those layers, causing "unexpected EOF" on image pulls. fix-broken-blobs.sh walks all repositories, checks each layer link against actual blob data, and removes orphaned links so the registry re-fetches from upstream on next pull. Schedule: daily at 2:30am (after tag cleanup) and Sunday 3:30am (after garbage collection). First run found 2335/2556 (91%) of layer links were orphaned.	2026-03-29 22:31:39 +03:00
Viktor Barzin	facf959ecf	fix registry healthchecks: use 127.0.0.1 instead of localhost localhost resolves to IPv6 ::1 but containers bind to 0.0.0.0 (IPv4 only), causing wget to fail with "Connection refused". The nginx proxy had 18,462 consecutive health check failures because of this. Also cleared corrupted pull-through cache for mghee/novelapp — the registry had layer link files pointing to non-existent blob data, causing containerd to get 200 responses with 0 bytes (unexpected EOF).	2026-03-29 22:29:27 +03:00
Viktor Barzin	878b556179	state(monitoring): update encrypted state	2026-03-29 01:04:11 +02:00
Viktor Barzin	8c6f238697	add default Homepage annotations to ingress_factory for auto-discovery - ingress_factory now injects gethomepage.dev/* annotations on all ingresses (name, group, href, icon) with namespace-to-group mapping - Stacks with explicit annotations override defaults via merge order - New homepage_enabled var allows opt-out for internal-only ingresses - Homepage search widget switched to in-page quicklaunch (Ctrl+K / tap) - Added hideErrors and quicklaunch settings for clean service directory - Result: 116/134 ingresses now discoverable (up from ~30)	2026-03-25 11:00:38 +02:00
Viktor Barzin	2dcb4b7fa4	fix(renew-tls): clean stale _acme-challenge TXT records before certbot 21+ stale TXT records accumulated from previous runs, causing certbot DNS-01 challenge to fail. Now deletes all _acme-challenge records from Cloudflare before certbot creates fresh ones.	2026-03-23 22:32:27 +02:00
Viktor Barzin	3f0ecda737	harden pull-through cache: intercept errors, reduce lock timeout, add healthz - Add proxy_intercept_errors + error_page for 502/503/504 on blob locations to prevent caching truncated upstream responses (root cause of repeated ImagePullBackOff across services) - Reduce proxy_cache_lock_timeout from 15m to 5m — fail fast, let containerd retry instead of all concurrent pulls waiting on a failed first download - Add proxy_cache_valid any 0 — never cache error responses - Add /healthz endpoints on Docker Hub and GHCR servers - Add draintimeout and proxy.ttl to registry proxy configs	2026-03-23 11:33:06 +02:00
Viktor Barzin	a44f35bcf8	harden vaultwarden iSCSI storage and increase backup frequency - Increase backup from daily to every 6 hours (0 /6 * *) - Add pre/post-flight SQLite integrity checks to backup job - Harden iSCSI on all nodes: increase recovery timeout (300s), enable CRC32C data/header digests for bit-flip detection - Fix restore runbook PVC name (vaultwarden-data-iscsi) Motivated by SQLite corruption from iSCSI I/O errors.	2026-03-23 00:36:11 +02:00
Viktor Barzin	36171bcda4	add htpasswd auth to private docker registry + expose at registry.viktorbarzin.me - Add auth.htpasswd section to config-private.yml - Mount htpasswd file in registry-private container, fix healthcheck for 401 - Rename registry UI from registry.viktorbarzin.me → docker.viktorbarzin.me - Add Docker CLI ingress at registry.viktorbarzin.me (HTTPS backend, no rate-limit, unlimited body) - Add docker to cloudflare_proxied_names (registry stays non-proxied) - Add Kyverno ClusterPolicy to sync registry-credentials secret to all namespaces - Update infra provisioning to install apache2-utils and generate htpasswd from Vault	2026-03-22 22:10:10 +02:00
Viktor Barzin	250a058627	feat(traefik): add custom error pages with tarampampam/error-pages Deploy error-pages service to show themed error pages instead of raw Traefik 502/503/504 responses. Adds catch-all IngressRoute (priority 1) for 404 on unknown hosts. Only 5xx intercepted to avoid breaking JSON APIs.	2026-03-19 23:14:27 +00:00
Viktor Barzin	67d1ce453c	add /sentinel dir to cloud-init for kured reboot gating The kured sentinel gate DaemonSet requires /sentinel to exist on all nodes. Without it, kured pods get stuck in ContainerCreating with hostPath mount failure. Previously created manually; now provisioned automatically for new nodes.	2026-03-19 19:57:27 +00:00
Viktor Barzin	f8a36f0621	fix pull-through cache: remove maxsize, harden nginx caching [ci skip] Root cause: storage.filesystem.maxsize (5GiB) caused Docker Registry to delete blob data while keeping metadata. Registry then served 200 OK with correct Content-Length but 0 bytes body. nginx cached these broken responses. Fixes: - Remove maxsize from dockerhub/ghcr proxy configs (rely on weekly GC) - nginx: don't cache 206 responses, require 2 requests before caching - Wiped corrupted cache on registry VM and fixed corrupted pause container blobs on node3/node4	2026-03-16 07:41:11 +00:00
Viktor Barzin	c034adab5f	mitigate cluster instability during terraform applies - Recreate strategy for heavy single-replica deployments (onlyoffice, stirling-pdf) - Reduce maxSurge on multi-replica deployments (traefik, authentik, grafana, kyverno) to prevent memory request surge overwhelming scheduler - Weekly etcd defrag CronJob (Sunday 3 AM) to prevent fragmentation buildup - Disable Kyverno policy reports (ephemeral report cleanup) - Cloud-init: journald persistence + 4Gi swap for worker nodes - Kubelet: LimitedSwap behavior for memory pressure relief	2026-03-15 17:23:39 +00:00
Viktor Barzin	7e72a10848	exclude manifest requests from nginx registry cache Split /v2/ location into two: regex match for blobs (cached 24h, immutable content-addressed by SHA256) and prefix match for everything else including manifests (proxy_cache off, mutable tags). Also remove disabled registries (quay, k8s, kyverno) whose containers/configs don't exist on the VM.	2026-03-14 23:42:17 +00:00
Viktor Barzin	0638e2cc2e	[ci skip] iSCSI migration, healthcheck fixes, health probes, etcd backup - Migrate MySQL/PostgreSQL storage from local-path to iscsi-truenas - Add democratic-csi iSCSI driver module for TrueNAS - Add open-iscsi to cloud-init VM template - Fix Shlink health probe path (/api/v3 -> /rest/v3 for Shlink 5.0) - Fix etcd backup: use etcd 3.5.21-0 (3.6.x is distroless, no /bin/sh) - Fix cluster healthcheck CronJob: always exit 0 to prevent circular JobFailed alerts (reporting via Slack, not exit codes) - Fix Uptime Kuma nested list handling in cluster-health.sh - Add health probes to: audiobookshelf, immich ML, ntfy, headscale, uptime-kuma, vaultwarden, rybbit (clickhouse + server + client), shlink, shlink-web - Add iSCSI storage documentation to CLAUDE.md	2026-03-06 19:54:21 +00:00
Viktor Barzin	1b78e44ab6	[ci skip] fix: add mount_options to nfs_volume PV spec StorageClass mountOptions only apply during dynamic provisioning. Static PVs (created by Terraform) need mount_options set explicitly. Without this, all CSI NFS mounts default to hard,timeo=600 — the exact problem we were trying to fix.	2026-03-02 20:22:47 +00:00
Viktor Barzin	c702fd2565	[ci skip] add NFS CSI driver + nfs_volume shared module - Deploy csi-driver-nfs Helm chart as platform module (nfs-csi) - Create nfs-truenas StorageClass with soft,timeo=30,retrans=3 mount options - Add shared nfs_volume module for PV/PVC boilerplate (modules/kubernetes/nfs_volume/)	2026-03-01 23:38:58 +00:00
Viktor Barzin	7ff3c61bd7	[ci skip] add retry middleware (2 attempts, 100ms) to default ingress chain	2026-03-01 14:35:53 +00:00
Viktor Barzin	946b5b1745	[ci skip] add qemu-guest-agent to VM templates and enable agent by default	2026-03-01 01:58:46 +00:00
Viktor Barzin	09a810f8fb	[ci skip] fix: use $http_host in nginx to preserve port in registry redirects	2026-02-28 20:16:03 +00:00
Viktor Barzin	96c0353c13	[ci skip] add TLS to private registry, switch to registry.viktorbarzin.me	2026-02-28 19:40:38 +00:00
Viktor Barzin	925dbe39c1	[ci skip] add registry-private service to Docker Compose stack	2026-02-28 17:57:04 +00:00
Viktor Barzin	64c55a6710	[ci skip] add nginx upstream and server block for private registry on port 5050	2026-02-28 17:57:03 +00:00
Viktor Barzin	2102ffdb8b	[ci skip] add private R/W registry config for CI build caching	2026-02-28 17:56:50 +00:00
Viktor Barzin	865b68ce77	[ci skip] Rebuild docker-registry with nginx serialization on all ports Replace individual `docker run` commands with Docker Compose stack managed by systemd. Nginx now fronts all 5 registry ports (5000/5010/5020/5030/5040) with proxy_cache_lock to serialize concurrent blob pulls and prevent corrupt partial responses. Adds QEMU guest agent for remote management.	2026-02-22 21:45:53 +00:00
Viktor Barzin	006f95337e	[ci skip] Add anti_ai_scraping option to ingress_factory (default: true)	2026-02-22 19:50:07 +00:00
Viktor Barzin	116c4d9c30	[ci skip] Remove legacy files and orphaned modules Delete 20 orphaned module directories and 3 stray files from modules/kubernetes/ that are no longer referenced by any stack. Remove 7 root-level legacy files including the empty tfstate, 27MB terraform zip, commented-out main.tf, and migration notes. Clean up commented-out dockerhub_secret and oauth-proxy references in blog, travel_blog, and city-guesser stacks. Remove stale frigate config.yaml entry from .gitignore. Remove ephemeral docs/plans/ directory.	2026-02-22 15:23:27 +00:00
Viktor Barzin	e6420c7b36	[ci skip] Move Terraform modules into stack directories Move all 88 service modules (66 individual + 22 platform) from modules/kubernetes/<service>/ into their corresponding stack directories: - Service stacks: stacks/<service>/module/ - Platform stack: stacks/platform/modules/<service>/ This collocates module source code with its Terragrunt definition. Only shared utility modules remain in modules/kubernetes/: ingress_factory, setup_tls_secret, dockerhub_secret, oauth-proxy. All cross-references to shared modules updated to use correct relative paths. Verified with terragrunt run --all -- plan: 0 adds, 0 destroys across all 68 stacks.	2026-02-22 14:38:14 +00:00
Viktor Barzin	945a5f35b0	[ci skip] Fix path.root references for git-crypt key in openclaw and drone Modules used filebase64("${path.root}/.git/git-crypt/keys/default") which breaks with Terragrunt since path.root is now stacks/<service>/ instead of repo root. Changed to accept git_crypt_key_base64 variable and resolve the path in the stack wrapper.	2026-02-22 14:01:02 +00:00
Viktor Barzin	71bfdc8e89	[ci skip] Phase 3: Remove migrated service modules from monolith All 66 service modules removed from modules/kubernetes/main.tf (now just a migration notice). The kubernetes_cluster module block removed from root main.tf. All services now managed via stacks/<service>/.	2026-02-22 13:58:07 +00:00
Viktor Barzin	39ce2000cf	[ci skip] Remove 22 platform services from modules/kubernetes/main.tf Migrated to stacks/platform/: metallb, dbaas, redis, traefik, technitium, headscale, authentik, rbac, k8s-portal, crowdsec, monitoring, vaultwarden, reverse-proxy, metrics-server, nvidia, kyverno, uptime-kuma, wireguard, xray, mailserver, cloudflared, infra-maintenance. Also removed null_resource.core_services and all depends_on references to it from the remaining ~66 service modules.	2026-02-22 13:40:45 +00:00
Viktor Barzin	db659b1f7a	[ci skip] Fix dashy OOMKilled and healthcheck DNS false-failure - Add explicit resource limits to dashy (2Gi memory) to prevent OOMKilled during webpack build on startup - Rewrite DNS healthcheck to test from inside the Technitium pod via kubectl exec, since MetalLB virtual IPs aren't reachable from outside the L2 network - Deleted orphaned kured/tls-secret (expired Oct 2025, module disabled, not mounted by kured DaemonSet)	2026-02-22 12:46:12 +00:00
Viktor Barzin	f05bf109c5	[ci skip] Increase Drone CI resource quota to handle concurrent builds Each build pod has 8-10 containers inheriting 1 CPU / 2Gi limits from LimitRange defaults. With 4+ concurrent builds the old quota (48 CPU / 96Gi / 30 pods) was exhausted, blocking new builds. Increase to 64 CPU / 128Gi / 60 pods to safely support 5-6 concurrent builds.	2026-02-22 12:28:42 +00:00
Viktor Barzin	0ff2aaec60	[ci skip] Add native HLS playback for VIPLeague/DaddyLive streams (v1.3.1) - Add HLS proxy (hlsproxy) for rewriting m3u8 playlists and proxying segments with correct Referer/Origin headers (uses ?domain= param) - Add playerconfig service for detecting stream types (VIPLeague, DaddyLive, HLS) and extracting auth params from ksohls pages - Add VIPLeague URL resolution: extract slug from URL path, match against DaddyLive 24/7 channel index with token-based scoring - Replace Clappr with direct HLS.js player for better compatibility - Add CryptoJS CDN for DaddyLive auth module support - Disable CrowdSec on f1-stream ingress to prevent false positives - Bump image to v1.3.1	2026-02-22 01:30:06 +00:00
Viktor Barzin	e59928187b	[ci skip] Set CronJob backoffLimit=0 to prevent duplicate Slack alerts	2026-02-22 00:59:34 +00:00

1 2 3 4 5 ...

1093 commits