The rewrite-body Traefik plugin (both packruler/rewrite-body v1.2.0 and
the-ccsn/traefik-plugin-rewritebody v0.1.3) silently fails on Traefik
v3.6.12 due to Yaegi interpreter issues with ResponseWriter wrapping.
Both plugins load without errors but never inject content.
Removed:
- rewrite-body plugin download (init container) and registration
- strip-accept-encoding middleware (only existed for rewrite-body bug)
- anti-ai-trap-links middleware (used rewrite-body for injection)
- rybbit_site_id variable from ingress_factory and reverse_proxy factory
- rybbit_site_id from 25 service stacks (39 instances)
- Per-service rybbit-analytics middleware CRD resources
Kept:
- compress middleware (entrypoint-level, working correctly)
- ai-bot-block middleware (ForwardAuth to bot-block-proxy)
- anti-ai-headers middleware (X-Robots-Tag: noai, noimageai)
- All CrowdSec, Authentik, rate-limit middleware unchanged
Next: Cloudflare Workers with HTMLRewriter for edge-side injection.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Context
Two operational gaps surfaced during a healthcheck sweep today:
1. **External monitoring coverage**: Only ~13 hostnames (via `cloudflare_proxied_names`
in `config.tfvars`) had `[External]` monitors in Uptime Kuma. Any service deployed via
`ingress_factory` with `dns_type = "proxied"` auto-created its DNS record but was NOT
registered for external probing — so outages like Immich going down externally were
invisible until a user complained. 99 of ~125 public ingresses had no external
monitor.
2. **actualbudget stack unplannable**: `count = var.budget_encryption_password != null
? 1 : 0` in `factory/main.tf:152` failed with "Invalid count argument" because the
value flows from a `data.kubernetes_secret` whose contents are `(known after apply)`
at plan time. Blocked CI applies and drift reconciliation.
## This change
### Per-ingress external-monitor annotation (ingress_factory + reverse_proxy/factory)
- New variables `external_monitor` (bool, nullable) + `external_monitor_name` (string,
nullable). Default is "follow dns_type" — enabled for any public DNS record
(`dns_type != "none"`, covers both proxied and non-proxied so Immich and other
direct-A records are also monitored).
- Emits two annotations on the Ingress:
- `uptime.viktorbarzin.me/external-monitor = "true"`
- `uptime.viktorbarzin.me/external-monitor-name = "<label>"` (optional override)
### external-monitor-sync CronJob (uptime-kuma stack)
- Discovers targets from live Ingress objects via the K8s API first (filter by
annotation), falls back to the legacy `external-monitor-targets` ConfigMap on any
API error (zero rollout risk).
- New `ServiceAccount` + cluster-wide `ClusterRole`/`ClusterRoleBinding` giving
`list`/`get` on `networking.k8s.io/ingresses`.
- `API_SERVER` now uses the `KUBERNETES_SERVICE_HOST` env var (always injected by K8s)
instead of `kubernetes.default.svc` — the search-domain expansion failed in the
CronJob pod's DNS config. Verified working: CronJob now logs
`Loaded N external monitor targets (source=k8s-api)`.
### actualbudget count-on-unknown refactor
- Replaced `count = var.budget_encryption_password != null ? 1 : 0` with two explicit
plan-time booleans: `enable_http_api` and `enable_bank_sync`. Values are known at
plan; no `-target` workaround needed.
- Callers (`stacks/actualbudget/main.tf`) pass `true` explicitly. Runtime behaviour is
unchanged — the secret is still consumed via env var.
- Also aligned the factory with live state (the 3 budget-* PVCs had been migrated
`proxmox-lvm` → `proxmox-lvm-encrypted` outside Terraform): PVC resource renamed
`data_proxmox` → `data_encrypted`, storage class updated, orphaned `nfs_data` module
removed. State was rm'd + re-imported with matching UIDs, so no data was moved.
## Rollout status (already partially applied in this session)
- `stacks/uptime-kuma` applied — SA + RBAC + CronJob changes live; FQDN fix verified
- `stacks/actualbudget` applied — budget-{viktor,anca,emo} all 200 OK externally
- `stacks/mailserver` + 21 other ingress_factory consumers applied — annotations live
- CronJob `external-monitor-sync` latest run: `source=k8s-api`, 26 monitors active
(was 13 on the central list)
## Deferred (separate work)
- 4 stacks show pre-existing DESTRUCTIVE drift in plan (metallb namespace, claude-memory,
rbac, redis) — NOT triggered by this commit but will be by CI's global-file cascade.
`[ci skip]` here so those don't auto-apply; they will be fixed manually before the
next CI push.
- Cleanup of `cloudflare_proxied_names` list once Helm-managed ingresses (authentik,
grafana, vault, forgejo) are annotated — separate PR.
## Test plan
### Automated
\`\`\`
\$ kubectl -n uptime-kuma logs \$(kubectl -n uptime-kuma get pods -l job-name -o name | tail -1)
Loaded 26 external monitor targets (source=k8s-api)
Sync complete: 7 created, 0 deleted, 17 unchanged
\$ curl -sk -o /dev/null -w "%{http_code}\n" -H "Accept: text/html" \\
https://dawarich.viktorbarzin.me/https://nextcloud.viktorbarzin.me/ \\
https://budget-viktor.viktorbarzin.me/
200 302 200
\$ kubectl -n actualbudget get deploy,pvc -l app=budget-viktor
deployment.apps/budget-viktor 1/1 1 1 Ready
persistentvolumeclaim/budget-viktor-data-encrypted Bound 10Gi RWO proxmox-lvm-encrypted
\`\`\`
### Manual Verification
1. Confirm the annotation is present on an ingress_factory ingress:
\`\`\`
kubectl -n dawarich get ingress dawarich -o \\
jsonpath='{.metadata.annotations.uptime\.viktorbarzin\.me/external-monitor}'
# Expected: "true"
\`\`\`
2. Confirm the new `[External] <name>` monitor appears in Uptime Kuma within 10 min
(CronJob interval). For Immich specifically, it will appear after the immich stack
is re-applied.
3. Verify actualbudget plan is clean:
\`\`\`
cd stacks/actualbudget && scripts/tg plan --non-interactive
# Expected: no "Invalid count argument" errors
\`\`\`
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Deploying new services required manually adding hostnames to
cloudflare_proxied_names/cloudflare_non_proxied_names in config.tfvars —
a separate file from the service stack. This was frequently forgotten,
leaving services unreachable externally.
## This change:
- Add `dns_type` parameter to `ingress_factory` and `reverse_proxy/factory`
modules. Setting `dns_type = "proxied"` or `"non-proxied"` auto-creates
the Cloudflare DNS record (CNAME to tunnel or A/AAAA to public IP).
- Simplify cloudflared tunnel from 100 per-hostname rules to wildcard
`*.viktorbarzin.me → Traefik`. Traefik still handles host-based routing.
- Add global Cloudflare provider via terragrunt.hcl (separate
cloudflare_provider.tf with Vault-sourced API key).
- Migrate 118 hostnames from centralized config.tfvars to per-service
dns_type. 17 hostnames remain centrally managed (Helm ingresses,
special cases).
- Update docs, AGENTS.md, CLAUDE.md, dns.md runbook.
```
BEFORE AFTER
config.tfvars (manual list) stacks/<svc>/main.tf
| module "ingress" {
v dns_type = "proxied"
stacks/cloudflared/ }
for_each = list |
cloudflare_record auto-creates
tunnel per-hostname cloudflare_record + annotation
```
## What is NOT in this change:
- Uptime Kuma monitor migration (still reads from config.tfvars)
- 17 remaining centrally-managed hostnames (Helm, special cases)
- Removal of allow_overwrite (keep until migration confirmed stable)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- ingress_factory now injects gethomepage.dev/* annotations on all ingresses
(name, group, href, icon) with namespace-to-group mapping
- Stacks with explicit annotations override defaults via merge order
- New homepage_enabled var allows opt-out for internal-only ingresses
- Homepage search widget switched to in-page quicklaunch (Ctrl+K / tap)
- Added hideErrors and quicklaunch settings for clean service directory
- Result: 116/134 ingresses now discoverable (up from ~30)
21+ stale TXT records accumulated from previous runs, causing certbot
DNS-01 challenge to fail. Now deletes all _acme-challenge records
from Cloudflare before certbot creates fresh ones.
Deploy error-pages service to show themed error pages instead of raw
Traefik 502/503/504 responses. Adds catch-all IngressRoute (priority 1)
for 404 on unknown hosts. Only 5xx intercepted to avoid breaking JSON APIs.
StorageClass mountOptions only apply during dynamic provisioning.
Static PVs (created by Terraform) need mount_options set explicitly.
Without this, all CSI NFS mounts default to hard,timeo=600 — the
exact problem we were trying to fix.
Delete 20 orphaned module directories and 3 stray files from
modules/kubernetes/ that are no longer referenced by any stack.
Remove 7 root-level legacy files including the empty tfstate,
27MB terraform zip, commented-out main.tf, and migration notes.
Clean up commented-out dockerhub_secret and oauth-proxy references
in blog, travel_blog, and city-guesser stacks. Remove stale
frigate config.yaml entry from .gitignore. Remove ephemeral
docs/plans/ directory.
Move all 88 service modules (66 individual + 22 platform) from
modules/kubernetes/<service>/ into their corresponding stack directories:
- Service stacks: stacks/<service>/module/
- Platform stack: stacks/platform/modules/<service>/
This collocates module source code with its Terragrunt definition.
Only shared utility modules remain in modules/kubernetes/:
ingress_factory, setup_tls_secret, dockerhub_secret, oauth-proxy.
All cross-references to shared modules updated to use correct
relative paths. Verified with terragrunt run --all -- plan:
0 adds, 0 destroys across all 68 stacks.
Modules used filebase64("${path.root}/.git/git-crypt/keys/default")
which breaks with Terragrunt since path.root is now stacks/<service>/
instead of repo root. Changed to accept git_crypt_key_base64 variable
and resolve the path in the stack wrapper.
All 66 service modules removed from modules/kubernetes/main.tf (now
just a migration notice). The kubernetes_cluster module block removed
from root main.tf. All services now managed via stacks/<service>/.
Migrated to stacks/platform/: metallb, dbaas, redis, traefik, technitium,
headscale, authentik, rbac, k8s-portal, crowdsec, monitoring, vaultwarden,
reverse-proxy, metrics-server, nvidia, kyverno, uptime-kuma, wireguard,
xray, mailserver, cloudflared, infra-maintenance.
Also removed null_resource.core_services and all depends_on references to it
from the remaining ~66 service modules.
- Add explicit resource limits to dashy (2Gi memory) to prevent OOMKilled
during webpack build on startup
- Rewrite DNS healthcheck to test from inside the Technitium pod via
kubectl exec, since MetalLB virtual IPs aren't reachable from outside
the L2 network
- Deleted orphaned kured/tls-secret (expired Oct 2025, module disabled,
not mounted by kured DaemonSet)
Each build pod has 8-10 containers inheriting 1 CPU / 2Gi limits from
LimitRange defaults. With 4+ concurrent builds the old quota (48 CPU /
96Gi / 30 pods) was exhausted, blocking new builds. Increase to 64 CPU /
128Gi / 60 pods to safely support 5-6 concurrent builds.
- Add HLS proxy (hlsproxy) for rewriting m3u8 playlists and proxying
segments with correct Referer/Origin headers (uses ?domain= param)
- Add playerconfig service for detecting stream types (VIPLeague,
DaddyLive, HLS) and extracting auth params from ksohls pages
- Add VIPLeague URL resolution: extract slug from URL path, match
against DaddyLive 24/7 channel index with token-based scoring
- Replace Clappr with direct HLS.js player for better compatibility
- Add CryptoJS CDN for DaddyLive auth module support
- Disable CrowdSec on f1-stream ingress to prevent false positives
- Bump image to v1.3.1
Authentik runs ~10 pods (3 server + 3 worker + 3 pgbouncer + outpost)
which exceeds the default tier-1-cluster quota limits. Add custom-quota
label to opt out of Kyverno-generated quotas and define a Terraform-managed
ResourceQuota with limits appropriate for authentik's workload.
Added `tier = var.tier` to kubernetes_namespace labels in ~73 service
modules. This enables Kyverno to generate LimitRange defaults,
ResourceQuotas, and PriorityClass injection for all namespaces.
Previously only 11 namespaces had tier labels; now all 80 active
namespaces are labeled. All pods restarted in rolling waves to pick
up the new policies.
When upstream JS constructs URLs via location.origin + '/path', the rw()
function stripped the origin but returned bare '/path' which hit our
server's HTML index. Now correctly prefixes with /proxy/{b64origin} so
XHR/fetch requests for scripts reach the upstream via proxy.
Bump image to v1.2.7
Video:
- Add allow="autoplay; encrypted-media; fullscreen" to iframe for media playback
Anti-debug:
- Strip ad/popup scripts (acscdn, popunder) and context menu blockers from HTML
- Strip debugger statements from inline HTML scripts and proxied JS responses
- Intercept setTimeout (not just setInterval) for debugger-based detection
- Override eval() and Function() constructor to strip debugger statements
- Bump image to v1.2.6
The priority injection policy was setting priorityClassName on pods but
Kubernetes had already defaulted priority=0 and preemptionPolicy=PreemptLowerPriority
on those pods, causing admission controller to reject the mismatch.
Switch from patchStrategicMerge to patchesJson6902 to explicitly remove
the priority and preemptionPolicy fields before setting priorityClassName.
- Remove flex centering from browser-viewer-content; use absolute positioning
for iframe to fill the entire container
- Strip disable-devtool and devtools-detect script tags from proxied HTML
- Add JS shim hooks to neutralize setInterval-based debugger traps and block
loading of anti-debug scripts via setAttribute
- Bump image to v1.2.5
Add sandbox attribute to prevent proxied pages from navigating
top.location or replacing the parent page body. Allows scripts,
same-origin, forms, popups, and presentation but blocks
top-navigation.
Replace CPU-intensive headless Chrome + WebRTC pipeline with a
lightweight Go reverse proxy that strips anti-framing headers
(X-Frame-Options, CSP) and embeds streaming sites in iframes.
- New internal/proxy package with URL rewriting for HTML/CSS
- JS shim injection to intercept fetch/XHR/WebSocket/createElement
- Referer reconstruction for correct cross-origin auth (HLS streams)
- Inline iframe viewer preserving site navigation (not fullscreen overlay)
- Add priority_class_name to nextcloud whiteboard deployment to match
Kyverno-injected tier-3-edge priority class
- Add explicit resource limits (4Gi memory) for OnlyOffice document
server to prevent OOMKill during font generation
- Enable size-based TSDB retention (45GB) to clean up old blocks
(including 2021-era blocks with failed compaction)
- Increase monitoring namespace quota from 64/128Gi to 80/160Gi
CPU/memory limits to allow Grafana rolling updates
Kyverno injects priorityClassName tier-1-cluster on pods in the crowdsec
namespace, but pods had no explicit priorityClassName set, defaulting
priority to 0. Admission controller rejected the mismatch (0 vs 800000).
Set priorityClassName on LAPI, agent (Helm values) and crowdsec-web
(Terraform deployment).
- Deploy coturn on k8s with MetalLB shared IP (10.0.20.200)
- Normal pod networking (no hostNetwork), runs on any node
- 100 relay ports (49152-49252), port 3478 for STUN/TURN signaling
- Shared secret auth for time-limited TURN credentials
- For F1 streaming WebRTC NAT traversal
- UI and API: 1 → 2 replicas for zero-downtime during restarts/crashes
- Celery worker: Recreate → RollingUpdate strategy
- Celery beat: unchanged (Recreate, singleton scheduler)
- Move f1 from Cloudflare proxied to non-proxied DNS
- Set WEBAUTHN_RPID/ORIGIN for f1.viktorbarzin.me domain
- Add NFS volume at /mnt/main/f1-stream for persistent session/stream data
- Enable headless browser extraction (HEADLESS_EXTRACT_ENABLED=true)
- Reduce replicas to 1 (file-based sessions don't work across replicas)
Fixes HA mobile app 403 when embedding Grafana dashboards - the webview
blocks third-party cookies needed by Authentik forward auth. Grafana
already has anonymous Viewer access enabled, so forward auth is not
needed. Also adds grafana_admin_password variable and explicit resource
limits to prevent ResourceQuota issues during rolling updates.