2026-03-07 14:30:36 +00:00
|
|
|
variable "tls_secret_name" {
|
2026-03-14 08:51:45 +00:00
|
|
|
type = string
|
2026-03-07 14:30:36 +00:00
|
|
|
sensitive = true
|
|
|
|
|
}
|
[ci skip] Infrastructure hardening: security, monitoring, reliability, maintainability
Phase 1 - Critical Security:
- Netbox: move hardcoded DB/superuser passwords to variables
- MeshCentral: disable public registration, add Authentik auth
- Traefik: disable insecure API dashboard (api.insecure=false)
- Traefik: configure forwarded headers with Cloudflare trusted IPs
Phase 2 - Security Hardening:
- Add security headers middleware (HSTS, X-Frame-Options, nosniff, etc.)
- Add Kyverno pod security policies in audit mode (privileged, host
namespaces, SYS_ADMIN, trusted registries)
- Tighten rate limiting (avg=10, burst=50)
- Add Authentik protection to grampsweb
Phase 3 - Monitoring & Alerting:
- Add critical service alerts (PostgreSQL, MySQL, Redis, Headscale,
Authentik, Loki)
- Increase Loki retention from 7 to 30 days (720h)
- Add predictive PV filling alert (predict_linear)
- Re-enable Hackmd and Privatebin down alerts
Phase 4 - Reliability:
- Add resource requests/limits to Redis, DBaaS, Technitium, Headscale,
Vaultwarden, Uptime Kuma
- Increase Alloy DaemonSet memory to 512Mi/1Gi
Phase 6 - Maintainability:
- Extract duplicated tiers locals to terragrunt.hcl generate block
(removed from 67 stacks)
- Replace hardcoded NFS IP 10.0.10.15 with var.nfs_server (114
instances across 63 files)
- Replace hardcoded Redis/PostgreSQL/MySQL/Ollama/mail host references
with variables across ~35 stacks
- Migrate xray raw ingress resources to ingress_factory modules
2026-02-23 22:05:28 +00:00
|
|
|
variable "nfs_server" { type = string }
|
|
|
|
|
variable "redis_host" { type = string }
|
|
|
|
|
variable "mysql_host" { type = string }
|
2026-02-22 13:56:34 +00:00
|
|
|
|
migrate 16 plan-time stacks: vault data source → ESO + kubernetes_secret
Replaced data "vault_kv_secret_v2" with:
1. ExternalSecret (ESO syncs Vault KV → K8s Secret)
2. data "kubernetes_secret" (reads ESO-created secret at plan time)
This removes the Vault provider dependency at plan time for these
stacks — they now only need K8s API access, not a Vault token.
Stacks: actualbudget, affine, audiobookshelf, calibre, changedetection,
coturn, freedify, freshrss, grampsweb, navidrome, novelapp, ollama,
owntracks, real-estate-crawler, servarr, ytdlp
2026-03-15 22:06:39 +00:00
|
|
|
resource "kubernetes_manifest" "external_secret" {
|
|
|
|
|
manifest = {
|
|
|
|
|
apiVersion = "external-secrets.io/v1beta1"
|
|
|
|
|
kind = "ExternalSecret"
|
|
|
|
|
metadata = {
|
|
|
|
|
name = "real-estate-crawler-secrets"
|
|
|
|
|
namespace = "realestate-crawler"
|
|
|
|
|
}
|
|
|
|
|
spec = {
|
|
|
|
|
refreshInterval = "15m"
|
|
|
|
|
secretStoreRef = {
|
|
|
|
|
name = "vault-kv"
|
|
|
|
|
kind = "ClusterSecretStore"
|
|
|
|
|
}
|
|
|
|
|
target = {
|
|
|
|
|
name = "real-estate-crawler-secrets"
|
|
|
|
|
}
|
|
|
|
|
dataFrom = [{
|
|
|
|
|
extract = {
|
|
|
|
|
key = "real-estate-crawler"
|
|
|
|
|
}
|
|
|
|
|
}]
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
depends_on = [kubernetes_namespace.realestate-crawler]
|
|
|
|
|
}
|
|
|
|
|
|
2026-03-17 07:39:29 +00:00
|
|
|
# DB credentials from Vault database engine (rotated automatically)
|
|
|
|
|
# Provides DB_CONNECTION_STRING that auto-updates when password rotates
|
|
|
|
|
resource "kubernetes_manifest" "db_external_secret" {
|
|
|
|
|
manifest = {
|
|
|
|
|
apiVersion = "external-secrets.io/v1beta1"
|
|
|
|
|
kind = "ExternalSecret"
|
|
|
|
|
metadata = {
|
|
|
|
|
name = "realestate-crawler-db-creds"
|
|
|
|
|
namespace = "realestate-crawler"
|
|
|
|
|
}
|
|
|
|
|
spec = {
|
|
|
|
|
refreshInterval = "15m"
|
|
|
|
|
secretStoreRef = {
|
|
|
|
|
name = "vault-database"
|
|
|
|
|
kind = "ClusterSecretStore"
|
|
|
|
|
}
|
|
|
|
|
target = {
|
|
|
|
|
name = "realestate-crawler-db-creds"
|
|
|
|
|
template = {
|
|
|
|
|
data = {
|
|
|
|
|
DB_CONNECTION_STRING = "mysql://wrongmove:{{ .password }}@${var.mysql_host}:3306/wrongmove"
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
data = [{
|
|
|
|
|
secretKey = "password"
|
|
|
|
|
remoteRef = {
|
|
|
|
|
key = "static-creds/mysql-wrongmove"
|
|
|
|
|
property = "password"
|
|
|
|
|
}
|
|
|
|
|
}]
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
depends_on = [kubernetes_namespace.realestate-crawler]
|
|
|
|
|
}
|
|
|
|
|
|
migrate 16 plan-time stacks: vault data source → ESO + kubernetes_secret
Replaced data "vault_kv_secret_v2" with:
1. ExternalSecret (ESO syncs Vault KV → K8s Secret)
2. data "kubernetes_secret" (reads ESO-created secret at plan time)
This removes the Vault provider dependency at plan time for these
stacks — they now only need K8s API access, not a Vault token.
Stacks: actualbudget, affine, audiobookshelf, calibre, changedetection,
coturn, freedify, freshrss, grampsweb, navidrome, novelapp, ollama,
owntracks, real-estate-crawler, servarr, ytdlp
2026-03-15 22:06:39 +00:00
|
|
|
data "kubernetes_secret" "eso_secrets" {
|
|
|
|
|
metadata {
|
|
|
|
|
name = "real-estate-crawler-secrets"
|
|
|
|
|
namespace = kubernetes_namespace.realestate-crawler.metadata[0].name
|
|
|
|
|
}
|
|
|
|
|
depends_on = [kubernetes_manifest.external_secret]
|
2026-03-14 17:15:48 +00:00
|
|
|
}
|
|
|
|
|
|
2026-05-18 19:11:43 +00:00
|
|
|
# DockerHub pull-secret — image is private on DockerHub
|
|
|
|
|
# (viktorbarzin/realestatecrawler and viktorbarzin/immoweb) and Keel polls
|
|
|
|
|
# the registry HEAD hourly for digest changes. Without auth, Keel hits 401
|
|
|
|
|
# and the rollout never fires. Pods themselves were pulling fine only
|
|
|
|
|
# because the image landed in the node's containerd cache months ago — a
|
|
|
|
|
# fresh node would also fail. ESO renders the dockerconfigjson server-side
|
|
|
|
|
# (Sprig `b64enc`) so the PAT never sits in K8s in cleartext.
|
|
|
|
|
resource "kubernetes_manifest" "dockerhub_pull_secret" {
|
|
|
|
|
manifest = {
|
|
|
|
|
apiVersion = "external-secrets.io/v1beta1"
|
|
|
|
|
kind = "ExternalSecret"
|
|
|
|
|
metadata = {
|
|
|
|
|
name = "dockerhub-pull-secret"
|
|
|
|
|
namespace = kubernetes_namespace.realestate-crawler.metadata[0].name
|
|
|
|
|
}
|
|
|
|
|
spec = {
|
|
|
|
|
refreshInterval = "1h"
|
|
|
|
|
secretStoreRef = {
|
|
|
|
|
name = "vault-kv"
|
|
|
|
|
kind = "ClusterSecretStore"
|
|
|
|
|
}
|
|
|
|
|
target = {
|
|
|
|
|
name = "dockerhub-pull-secret"
|
|
|
|
|
template = {
|
|
|
|
|
type = "kubernetes.io/dockerconfigjson"
|
|
|
|
|
data = {
|
|
|
|
|
".dockerconfigjson" = "{\"auths\":{\"https://index.docker.io/v1/\":{\"username\":\"viktorbarzin\",\"password\":\"{{ .pat }}\",\"auth\":\"{{ printf \"viktorbarzin:%s\" .pat | b64enc }}\"}}}"
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
data = [{
|
|
|
|
|
secretKey = "pat"
|
|
|
|
|
remoteRef = {
|
|
|
|
|
key = "viktor"
|
|
|
|
|
property = "dockerhub_registry_password"
|
|
|
|
|
}
|
|
|
|
|
}]
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
depends_on = [kubernetes_namespace.realestate-crawler]
|
|
|
|
|
}
|
|
|
|
|
|
2026-03-14 17:15:48 +00:00
|
|
|
locals {
|
migrate 16 plan-time stacks: vault data source → ESO + kubernetes_secret
Replaced data "vault_kv_secret_v2" with:
1. ExternalSecret (ESO syncs Vault KV → K8s Secret)
2. data "kubernetes_secret" (reads ESO-created secret at plan time)
This removes the Vault provider dependency at plan time for these
stacks — they now only need K8s API access, not a Vault token.
Stacks: actualbudget, affine, audiobookshelf, calibre, changedetection,
coturn, freedify, freshrss, grampsweb, navidrome, novelapp, ollama,
owntracks, real-estate-crawler, servarr, ytdlp
2026-03-15 22:06:39 +00:00
|
|
|
notification_settings = jsondecode(data.kubernetes_secret.eso_secrets.data["notification_settings"])
|
real-estate-crawler: populate SCRAPE_SCHEDULES (daily RENT + weekly BUY, London 1-2 bed)
Wires celery-beat to fire two periodic scrapes via the existing in-app
SchedulesConfig mechanism. Replaces the empty-string fallback with two
inline schedules expressed as Terraform-managed JSON:
- london-rent-daily: every day at 03:00 UTC, RENT, London, 1-2 bed,
£1900-4000
- london-buy-weekly: every Sunday at 04:00 UTC, BUY, London, 1-2 bed,
£400k-1.2M
Schedules live in `local.scrape_schedules` (jsonencode'd) rather than
Vault — they're configuration, not secrets, and benefit from being
version-controlled. The previous Vault-backed lookup
(`local.notification_settings["scrape_schedules"]`) was unused.
Verified live: new celery-beat pod logs
`Registering periodic task: london-rent-daily at 3:0` and
`london-buy-weekly at 4:0` immediately after roll-out.
Also tightens the comment above the wrongmove-api `auth = "none"` line
so it passes the new `scripts/check-ingress-auth-comments.py` guard
(pre-existing tech debt that blocked the apply).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-11 19:26:57 +00:00
|
|
|
|
|
|
|
|
# Periodic scrape schedules consumed by celery-beat via SCRAPE_SCHEDULES env var.
|
|
|
|
|
# Schema: config/schedule_config.py:ScheduleConfig. Cron fields are UTC.
|
|
|
|
|
# Daily RENT London 1-2 bed £1900-4000 at 03:00 UTC (~04:00 BST).
|
|
|
|
|
# Weekly BUY London 1-2 bed £400k-1.2M at Sun 04:00 UTC.
|
|
|
|
|
scrape_schedules = jsonencode([
|
|
|
|
|
{
|
|
|
|
|
name = "london-rent-daily"
|
|
|
|
|
listing_type = "RENT"
|
|
|
|
|
minute = "0"
|
|
|
|
|
hour = "3"
|
|
|
|
|
day_of_week = "*"
|
|
|
|
|
min_bedrooms = 1
|
|
|
|
|
max_bedrooms = 2
|
|
|
|
|
min_price = 1900
|
|
|
|
|
max_price = 4000
|
|
|
|
|
district_names = ["London"]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
name = "london-buy-weekly"
|
|
|
|
|
listing_type = "BUY"
|
|
|
|
|
minute = "0"
|
|
|
|
|
hour = "4"
|
|
|
|
|
day_of_week = "0"
|
|
|
|
|
min_bedrooms = 1
|
|
|
|
|
max_bedrooms = 2
|
|
|
|
|
min_price = 400000
|
|
|
|
|
max_price = 1200000
|
|
|
|
|
district_names = ["London"]
|
|
|
|
|
},
|
|
|
|
|
])
|
2026-03-14 17:15:48 +00:00
|
|
|
}
|
|
|
|
|
|
2026-02-22 13:56:34 +00:00
|
|
|
|
2026-02-22 15:13:55 +00:00
|
|
|
resource "kubernetes_namespace" "realestate-crawler" {
|
|
|
|
|
metadata {
|
|
|
|
|
name = "realestate-crawler"
|
|
|
|
|
labels = {
|
|
|
|
|
"istio-injection" : "disabled"
|
2026-05-18 19:11:43 +00:00
|
|
|
tier = local.tiers.aux
|
2026-05-16 12:41:05 +00:00
|
|
|
"keel.sh/enrolled" = "true"
|
2026-02-22 15:13:55 +00:00
|
|
|
}
|
|
|
|
|
}
|
[infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip]
## Context
Wave 3B-continued: the Goldilocks VPA dashboard (stacks/vpa) runs a Kyverno
ClusterPolicy `goldilocks-vpa-auto-mode` that mutates every namespace with
`metadata.labels["goldilocks.fairwinds.com/vpa-update-mode"] = "off"`. This
is intentional — Terraform owns container resource limits, and Goldilocks
should only provide recommendations, never auto-update. The label is how
Goldilocks decides per-namespace whether to run its VPA in `off` mode.
Effect on Terraform: every `kubernetes_namespace` resource shows the label
as pending-removal (`-> null`) on every `scripts/tg plan`. Dawarich survey
2026-04-18 confirmed the drift. Cluster-side count: 88 namespaces carry the
label (`kubectl get ns -o json | jq ... | wc -l`). Every TF-managed namespace
is affected.
This commit brings the intentional admission drift under the same
`# KYVERNO_LIFECYCLE_V1` discoverability marker introduced in c9d221d5 for
the ndots dns_config pattern. The marker now stands generically for any
Kyverno admission-webhook drift suppression; the inline comment records
which specific policy stamps which specific field so future grep audits
show why each suppression exists.
## This change
107 `.tf` files touched — every stack's `resource "kubernetes_namespace"`
resource gets:
```hcl
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
```
Injection was done with a brace-depth-tracking Python pass (`/tmp/add_goldilocks_ignore.py`):
match `^resource "kubernetes_namespace" ` → track `{` / `}` until the
outermost closing brace → insert the lifecycle block before the closing
brace. The script is idempotent (skips any file that already mentions
`goldilocks.fairwinds.com/vpa-update-mode`) so re-running is safe.
Vault stack picked up 2 namespaces in the same file (k8s-users produces
one, plus a second explicit ns) — confirmed via file diff (+8 lines).
## What is NOT in this change
- `stacks/trading-bot/main.tf` — entire file is `/* … */` commented out
(paused 2026-04-06 per user decision). Reverted after the script ran.
- `stacks/_template/main.tf.example` — per-stack skeleton, intentionally
minimal. User keeps it that way. Not touched by the script (file
has no real `resource "kubernetes_namespace"` — only a placeholder
comment).
- `.terraform/` copies (e.g. `stacks/metallb/.terraform/modules/...`) —
gitignored, won't commit; the live path was edited.
- `terraform fmt` cleanup of adjacent pre-existing alignment issues in
authentik, freedify, hermes-agent, nvidia, vault, meshcentral. Reverted
to keep the commit scoped to the Goldilocks sweep. Those files will
need a separate fmt-only commit or will be cleaned up on next real
apply to that stack.
## Verification
Dawarich (one of the hundred-plus touched stacks) showed the pattern
before and after:
```
$ cd stacks/dawarich && ../../scripts/tg plan
Before:
Plan: 0 to add, 2 to change, 0 to destroy.
# kubernetes_namespace.dawarich will be updated in-place
(goldilocks.fairwinds.com/vpa-update-mode -> null)
# module.tls_secret.kubernetes_secret.tls_secret will be updated in-place
(Kyverno generate.* labels — fixed in 8d94688d)
After:
No changes. Your infrastructure matches the configuration.
```
Injection count check:
```
$ rg -c 'KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode' stacks/ | awk -F: '{s+=$2} END {print s}'
108
```
## Reproduce locally
1. `git pull`
2. Pick any stack: `cd stacks/<name> && ../../scripts/tg plan`
3. Expect: no drift on the namespace's goldilocks.fairwinds.com/vpa-update-mode label.
Closes: code-dwx
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:15:27 +00:00
|
|
|
lifecycle {
|
|
|
|
|
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
|
|
|
|
|
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
|
|
|
|
|
}
|
2026-02-22 15:13:55 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
module "tls_secret" {
|
|
|
|
|
source = "../../modules/kubernetes/setup_tls_secret"
|
|
|
|
|
namespace = kubernetes_namespace.realestate-crawler.metadata[0].name
|
|
|
|
|
tls_secret_name = var.tls_secret_name
|
|
|
|
|
}
|
|
|
|
|
|
truenas deprecation: migrate all non-immich storage to proxmox NFS
- Migrate 7 backup CronJobs to Proxmox host NFS (192.168.1.127)
(etcd, mysql, postgresql, nextcloud, redis, vaultwarden, plotting-book)
- Migrate headscale backup, ebook2audiobook, osm_routing to Proxmox NFS
- Migrate servarr (lidarr, readarr, soulseek) NFS refs to Proxmox
- Remove 79 orphaned TrueNAS NFS module declarations from 49 stacks
- Delete stacks/platform/modules/ (27 dead module copies, 65MB)
- Update nfs-truenas StorageClass to point to Proxmox (192.168.1.127)
- Remove iscsi DNS record from config.tfvars
- Fix woodpecker persistence config and alertmanager PV
Only Immich (8 PVCs, ~1.4TB) remains on TrueNAS.
2026-04-12 14:35:39 +01:00
|
|
|
module "nfs_data_host" {
|
[ci skip] complete NFS CSI migration: complex stacks + platform modules
Migrate remaining multi-volume stacks and all platform modules from
inline NFS volumes to CSI-backed PV/PVC with nfs-truenas StorageClass
(soft,timeo=30,retrans=3 mount options).
Complex stacks: openclaw (4 vols), immich (8 vols), frigate (2 vols),
nextcloud (2 vols + old PV replaced), rybbit (1 vol)
Remaining stacks: affine, ebook2audiobook, f1-stream, osm_routing,
real-estate-crawler
Platform modules: monitoring (prometheus, loki, alertmanager PVs
converted from native NFS to CSI), redis, dbaas, technitium,
headscale, vaultwarden, uptime-kuma, mailserver, infra-maintenance
2026-03-02 01:24:07 +00:00
|
|
|
source = "../../modules/kubernetes/nfs_volume"
|
truenas deprecation: migrate all non-immich storage to proxmox NFS
- Migrate 7 backup CronJobs to Proxmox host NFS (192.168.1.127)
(etcd, mysql, postgresql, nextcloud, redis, vaultwarden, plotting-book)
- Migrate headscale backup, ebook2audiobook, osm_routing to Proxmox NFS
- Migrate servarr (lidarr, readarr, soulseek) NFS refs to Proxmox
- Remove 79 orphaned TrueNAS NFS module declarations from 49 stacks
- Delete stacks/platform/modules/ (27 dead module copies, 65MB)
- Update nfs-truenas StorageClass to point to Proxmox (192.168.1.127)
- Remove iscsi DNS record from config.tfvars
- Fix woodpecker persistence config and alertmanager PV
Only Immich (8 PVCs, ~1.4TB) remains on TrueNAS.
2026-04-12 14:35:39 +01:00
|
|
|
name = "real-estate-crawler-data-host"
|
[ci skip] complete NFS CSI migration: complex stacks + platform modules
Migrate remaining multi-volume stacks and all platform modules from
inline NFS volumes to CSI-backed PV/PVC with nfs-truenas StorageClass
(soft,timeo=30,retrans=3 mount options).
Complex stacks: openclaw (4 vols), immich (8 vols), frigate (2 vols),
nextcloud (2 vols + old PV replaced), rybbit (1 vol)
Remaining stacks: affine, ebook2audiobook, f1-stream, osm_routing,
real-estate-crawler
Platform modules: monitoring (prometheus, loki, alertmanager PVs
converted from native NFS to CSI), redis, dbaas, technitium,
headscale, vaultwarden, uptime-kuma, mailserver, infra-maintenance
2026-03-02 01:24:07 +00:00
|
|
|
namespace = kubernetes_namespace.realestate-crawler.metadata[0].name
|
truenas deprecation: migrate all non-immich storage to proxmox NFS
- Migrate 7 backup CronJobs to Proxmox host NFS (192.168.1.127)
(etcd, mysql, postgresql, nextcloud, redis, vaultwarden, plotting-book)
- Migrate headscale backup, ebook2audiobook, osm_routing to Proxmox NFS
- Migrate servarr (lidarr, readarr, soulseek) NFS refs to Proxmox
- Remove 79 orphaned TrueNAS NFS module declarations from 49 stacks
- Delete stacks/platform/modules/ (27 dead module copies, 65MB)
- Update nfs-truenas StorageClass to point to Proxmox (192.168.1.127)
- Remove iscsi DNS record from config.tfvars
- Fix woodpecker persistence config and alertmanager PV
Only Immich (8 PVCs, ~1.4TB) remains on TrueNAS.
2026-04-12 14:35:39 +01:00
|
|
|
nfs_server = "192.168.1.127"
|
|
|
|
|
nfs_path = "/srv/nfs/real-estate-crawler"
|
[ci skip] complete NFS CSI migration: complex stacks + platform modules
Migrate remaining multi-volume stacks and all platform modules from
inline NFS volumes to CSI-backed PV/PVC with nfs-truenas StorageClass
(soft,timeo=30,retrans=3 mount options).
Complex stacks: openclaw (4 vols), immich (8 vols), frigate (2 vols),
nextcloud (2 vols + old PV replaced), rybbit (1 vol)
Remaining stacks: affine, ebook2audiobook, f1-stream, osm_routing,
real-estate-crawler
Platform modules: monitoring (prometheus, loki, alertmanager PVs
converted from native NFS to CSI), redis, dbaas, technitium,
headscale, vaultwarden, uptime-kuma, mailserver, infra-maintenance
2026-03-02 01:24:07 +00:00
|
|
|
}
|
|
|
|
|
|
2026-02-22 15:13:55 +00:00
|
|
|
resource "kubernetes_deployment" "realestate-crawler-ui" {
|
|
|
|
|
metadata {
|
|
|
|
|
name = "realestate-crawler-ui"
|
|
|
|
|
namespace = kubernetes_namespace.realestate-crawler.metadata[0].name
|
|
|
|
|
labels = {
|
|
|
|
|
app = "realestate-crawler-ui"
|
|
|
|
|
tier = local.tiers.aux
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
spec {
|
|
|
|
|
replicas = 2
|
|
|
|
|
strategy {
|
|
|
|
|
type = "RollingUpdate"
|
|
|
|
|
rolling_update {
|
|
|
|
|
max_unavailable = 0
|
|
|
|
|
max_surge = 1
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
selector {
|
|
|
|
|
match_labels = {
|
|
|
|
|
app = "realestate-crawler-ui"
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
template {
|
|
|
|
|
metadata {
|
|
|
|
|
labels = {
|
|
|
|
|
app = "realestate-crawler-ui"
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
spec {
|
2026-05-18 19:11:43 +00:00
|
|
|
image_pull_secrets {
|
|
|
|
|
name = "dockerhub-pull-secret"
|
|
|
|
|
}
|
2026-02-22 15:13:55 +00:00
|
|
|
container {
|
|
|
|
|
name = "realestate-crawler-ui"
|
|
|
|
|
image = "viktorbarzin/immoweb:latest"
|
|
|
|
|
port {
|
|
|
|
|
name = "http"
|
|
|
|
|
container_port = 8080
|
|
|
|
|
protocol = "TCP"
|
|
|
|
|
}
|
|
|
|
|
env {
|
|
|
|
|
name = "ENV"
|
|
|
|
|
value = "prod"
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
lifecycle {
|
|
|
|
|
ignore_changes = [
|
[infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip]
## Context
Wave 3A (commit c9d221d5) added the `# KYVERNO_LIFECYCLE_V1` marker to the
27 pre-existing `ignore_changes = [...dns_config]` sites so they could be
grepped and audited. It did NOT address pod-owning resources that were
simply missing the suppression entirely. Post-Wave-3A sampling (2026-04-18)
found that navidrome, f1-stream, frigate, servarr, monitoring, crowdsec,
and many other stacks showed perpetual `dns_config` drift every plan
because their `kubernetes_deployment` / `kubernetes_stateful_set` /
`kubernetes_cron_job_v1` resources had no `lifecycle {}` block at all.
Root cause (same as Wave 3A): Kyverno's admission webhook stamps
`dns_config { option { name = "ndots"; value = "2" } }` on every pod's
`spec.template.spec.dns_config` to prevent NxDomain search-domain flooding
(see `k8s-ndots-search-domain-nxdomain-flood` skill). Without `ignore_changes`
on every Terraform-managed pod-owner, Terraform repeatedly tries to strip
the injected field.
## This change
Extends the Wave 3A convention by sweeping EVERY `kubernetes_deployment`,
`kubernetes_stateful_set`, `kubernetes_daemon_set`, `kubernetes_cron_job_v1`,
`kubernetes_job_v1` (+ their `_v1` variants) in the repo and ensuring each
carries the right `ignore_changes` path:
- **kubernetes_deployment / stateful_set / daemon_set / job_v1**:
`spec[0].template[0].spec[0].dns_config`
- **kubernetes_cron_job_v1**:
`spec[0].job_template[0].spec[0].template[0].spec[0].dns_config`
(extra `job_template[0]` nesting — the CronJob's PodTemplateSpec is
one level deeper)
Each injection / extension is tagged `# KYVERNO_LIFECYCLE_V1: Kyverno
admission webhook mutates dns_config with ndots=2` inline so the
suppression is discoverable via `rg 'KYVERNO_LIFECYCLE_V1' stacks/`.
Two insertion paths are handled by a Python pass (`/tmp/add_dns_config_ignore.py`):
1. **No existing `lifecycle {}`**: inject a brand-new block just before the
resource's closing `}`. 108 new blocks on 93 files.
2. **Existing `lifecycle {}` (usually for `DRIFT_WORKAROUND: CI owns image tag`
from Wave 4, commit a62b43d1)**: extend its `ignore_changes` list with the
dns_config path. Handles both inline (`= [x]`) and multiline
(`= [\n x,\n]`) forms; ensures the last pre-existing list item carries
a trailing comma so the extended list is valid HCL. 34 extensions.
The script skips anything already mentioning `dns_config` inside an
`ignore_changes`, so re-running is a no-op.
## Scale
- 142 total lifecycle injections/extensions
- 93 `.tf` files touched
- 108 brand-new `lifecycle {}` blocks + 34 extensions of existing ones
- Every Tier 0 and Tier 1 stack with a pod-owning resource is covered
- Together with Wave 3A's 27 pre-existing markers → **169 greppable
`KYVERNO_LIFECYCLE_V1` dns_config sites across the repo**
## What is NOT in this change
- `stacks/trading-bot/main.tf` — entirely commented-out block (`/* … */`).
Python script touched the file, reverted manually.
- `_template/main.tf.example` skeleton — kept minimal on purpose; any
future stack created from it should either inherit the Wave 3A one-line
form or add its own on first `kubernetes_deployment`.
- `terraform fmt` fixes to pre-existing alignment issues in meshcentral,
nvidia/modules/nvidia, vault — unrelated to this commit. Left for a
separate fmt-only pass.
- Non-pod resources (`kubernetes_service`, `kubernetes_secret`,
`kubernetes_manifest`, etc.) — they don't own pods so they don't get
Kyverno dns_config mutation.
## Verification
Random sample post-commit:
```
$ cd stacks/navidrome && ../../scripts/tg plan → No changes.
$ cd stacks/f1-stream && ../../scripts/tg plan → No changes.
$ cd stacks/frigate && ../../scripts/tg plan → No changes.
$ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='*.tf' --include='*.tf.example' \
| awk -F: '{s+=$2} END {print s}'
169
```
## Reproduce locally
1. `git pull`
2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/ | wc -l` → 169+
3. `cd stacks/navidrome && ../../scripts/tg plan` → expect 0 drift on
the deployment's dns_config field.
Refs: code-seq (Wave 3B dns_config class closed; kubernetes_manifest
annotation class handled separately in 8d94688d for tls_secret)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:19:48 +00:00
|
|
|
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
|
2026-02-22 15:13:55 +00:00
|
|
|
]
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
resource "kubernetes_service" "realestate-crawler-ui" {
|
|
|
|
|
metadata {
|
|
|
|
|
name = "realestate-crawler-ui"
|
|
|
|
|
namespace = kubernetes_namespace.realestate-crawler.metadata[0].name
|
|
|
|
|
labels = {
|
|
|
|
|
"app" = "realestate-crawler-ui"
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
spec {
|
|
|
|
|
selector = {
|
|
|
|
|
app = "realestate-crawler-ui"
|
|
|
|
|
}
|
|
|
|
|
port {
|
|
|
|
|
port = 80
|
|
|
|
|
target_port = 8080
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
resource "kubernetes_deployment" "realestate-crawler-api" {
|
|
|
|
|
metadata {
|
|
|
|
|
name = "realestate-crawler-api"
|
|
|
|
|
namespace = kubernetes_namespace.realestate-crawler.metadata[0].name
|
|
|
|
|
labels = {
|
|
|
|
|
app = "realestate-crawler-api"
|
|
|
|
|
tier = local.tiers.aux
|
|
|
|
|
}
|
migrate 16 plan-time stacks: vault data source → ESO + kubernetes_secret
Replaced data "vault_kv_secret_v2" with:
1. ExternalSecret (ESO syncs Vault KV → K8s Secret)
2. data "kubernetes_secret" (reads ESO-created secret at plan time)
This removes the Vault provider dependency at plan time for these
stacks — they now only need K8s API access, not a Vault token.
Stacks: actualbudget, affine, audiobookshelf, calibre, changedetection,
coturn, freedify, freshrss, grampsweb, navidrome, novelapp, ollama,
owntracks, real-estate-crawler, servarr, ytdlp
2026-03-15 22:06:39 +00:00
|
|
|
annotations = {
|
|
|
|
|
"reloader.stakater.com/auto" = "true"
|
|
|
|
|
}
|
2026-02-22 15:13:55 +00:00
|
|
|
}
|
|
|
|
|
spec {
|
2026-03-22 03:10:12 +02:00
|
|
|
replicas = 1
|
2026-02-22 15:13:55 +00:00
|
|
|
strategy {
|
|
|
|
|
type = "RollingUpdate"
|
|
|
|
|
rolling_update {
|
|
|
|
|
max_unavailable = 0
|
|
|
|
|
max_surge = 1
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
selector {
|
|
|
|
|
match_labels = {
|
|
|
|
|
app = "realestate-crawler-api"
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
template {
|
|
|
|
|
metadata {
|
|
|
|
|
labels = {
|
|
|
|
|
app = "realestate-crawler-api"
|
|
|
|
|
"kubernetes.io/cluster-service" = "true"
|
|
|
|
|
}
|
2026-03-15 19:17:44 +00:00
|
|
|
annotations = {
|
2026-04-19 12:44:46 +00:00
|
|
|
"dependency.kyverno.io/wait-for" = "mysql.dbaas:3306,redis-master.redis:6379"
|
2026-03-15 19:17:44 +00:00
|
|
|
}
|
2026-02-22 15:13:55 +00:00
|
|
|
}
|
|
|
|
|
spec {
|
2026-05-18 19:11:43 +00:00
|
|
|
image_pull_secrets {
|
|
|
|
|
name = "dockerhub-pull-secret"
|
|
|
|
|
}
|
2026-02-22 15:13:55 +00:00
|
|
|
container {
|
|
|
|
|
name = "realestate-crawler-api"
|
|
|
|
|
image = "viktorbarzin/realestatecrawler:latest"
|
|
|
|
|
image_pull_policy = "Always"
|
|
|
|
|
env {
|
|
|
|
|
name = "ENV"
|
|
|
|
|
value = "prod"
|
|
|
|
|
}
|
|
|
|
|
env {
|
2026-03-17 07:39:29 +00:00
|
|
|
name = "DB_CONNECTION_STRING"
|
|
|
|
|
value_from {
|
|
|
|
|
secret_key_ref {
|
|
|
|
|
name = "realestate-crawler-db-creds"
|
|
|
|
|
key = "DB_CONNECTION_STRING"
|
|
|
|
|
}
|
|
|
|
|
}
|
2026-02-22 15:13:55 +00:00
|
|
|
}
|
|
|
|
|
env {
|
|
|
|
|
name = "CELERY_BROKER_URL"
|
[ci skip] Infrastructure hardening: security, monitoring, reliability, maintainability
Phase 1 - Critical Security:
- Netbox: move hardcoded DB/superuser passwords to variables
- MeshCentral: disable public registration, add Authentik auth
- Traefik: disable insecure API dashboard (api.insecure=false)
- Traefik: configure forwarded headers with Cloudflare trusted IPs
Phase 2 - Security Hardening:
- Add security headers middleware (HSTS, X-Frame-Options, nosniff, etc.)
- Add Kyverno pod security policies in audit mode (privileged, host
namespaces, SYS_ADMIN, trusted registries)
- Tighten rate limiting (avg=10, burst=50)
- Add Authentik protection to grampsweb
Phase 3 - Monitoring & Alerting:
- Add critical service alerts (PostgreSQL, MySQL, Redis, Headscale,
Authentik, Loki)
- Increase Loki retention from 7 to 30 days (720h)
- Add predictive PV filling alert (predict_linear)
- Re-enable Hackmd and Privatebin down alerts
Phase 4 - Reliability:
- Add resource requests/limits to Redis, DBaaS, Technitium, Headscale,
Vaultwarden, Uptime Kuma
- Increase Alloy DaemonSet memory to 512Mi/1Gi
Phase 6 - Maintainability:
- Extract duplicated tiers locals to terragrunt.hcl generate block
(removed from 67 stacks)
- Replace hardcoded NFS IP 10.0.10.15 with var.nfs_server (114
instances across 63 files)
- Replace hardcoded Redis/PostgreSQL/MySQL/Ollama/mail host references
with variables across ~35 stacks
- Migrate xray raw ingress resources to ingress_factory modules
2026-02-23 22:05:28 +00:00
|
|
|
value = "redis://${var.redis_host}:6379/0"
|
2026-02-22 15:13:55 +00:00
|
|
|
}
|
|
|
|
|
env {
|
|
|
|
|
name = "CELERY_RESULT_BACKEND"
|
[ci skip] Infrastructure hardening: security, monitoring, reliability, maintainability
Phase 1 - Critical Security:
- Netbox: move hardcoded DB/superuser passwords to variables
- MeshCentral: disable public registration, add Authentik auth
- Traefik: disable insecure API dashboard (api.insecure=false)
- Traefik: configure forwarded headers with Cloudflare trusted IPs
Phase 2 - Security Hardening:
- Add security headers middleware (HSTS, X-Frame-Options, nosniff, etc.)
- Add Kyverno pod security policies in audit mode (privileged, host
namespaces, SYS_ADMIN, trusted registries)
- Tighten rate limiting (avg=10, burst=50)
- Add Authentik protection to grampsweb
Phase 3 - Monitoring & Alerting:
- Add critical service alerts (PostgreSQL, MySQL, Redis, Headscale,
Authentik, Loki)
- Increase Loki retention from 7 to 30 days (720h)
- Add predictive PV filling alert (predict_linear)
- Re-enable Hackmd and Privatebin down alerts
Phase 4 - Reliability:
- Add resource requests/limits to Redis, DBaaS, Technitium, Headscale,
Vaultwarden, Uptime Kuma
- Increase Alloy DaemonSet memory to 512Mi/1Gi
Phase 6 - Maintainability:
- Extract duplicated tiers locals to terragrunt.hcl generate block
(removed from 67 stacks)
- Replace hardcoded NFS IP 10.0.10.15 with var.nfs_server (114
instances across 63 files)
- Replace hardcoded Redis/PostgreSQL/MySQL/Ollama/mail host references
with variables across ~35 stacks
- Migrate xray raw ingress resources to ingress_factory modules
2026-02-23 22:05:28 +00:00
|
|
|
value = "redis://${var.redis_host}:6379/1"
|
2026-02-22 15:13:55 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
env {
|
|
|
|
|
name = "UVICORN_LOG_LEVEL"
|
|
|
|
|
value = "debug"
|
|
|
|
|
}
|
|
|
|
|
env {
|
|
|
|
|
name = "OSRM_FOOT_URL"
|
|
|
|
|
value = "http://osrm-foot.osm-routing.svc.cluster.local:5000"
|
|
|
|
|
}
|
|
|
|
|
env {
|
|
|
|
|
name = "OSRM_BICYCLE_URL"
|
|
|
|
|
value = "http://osrm-bicycle.osm-routing.svc.cluster.local:5000"
|
|
|
|
|
}
|
|
|
|
|
env {
|
|
|
|
|
name = "OTP_URL"
|
|
|
|
|
value = "http://otp.osm-routing.svc.cluster.local:8080"
|
|
|
|
|
}
|
|
|
|
|
env {
|
|
|
|
|
name = "SLACK_WEBHOOK_URL"
|
2026-03-15 21:01:24 +00:00
|
|
|
value = local.notification_settings["slack"]["webhook_url"]
|
2026-02-22 15:13:55 +00:00
|
|
|
}
|
|
|
|
|
env {
|
|
|
|
|
name = "WEBAUTHN_RP_ID"
|
|
|
|
|
value = "wrongmove.viktorbarzin.me"
|
|
|
|
|
}
|
|
|
|
|
env {
|
|
|
|
|
name = "WEBAUTHN_ORIGIN"
|
|
|
|
|
value = "https://wrongmove.viktorbarzin.me"
|
|
|
|
|
}
|
|
|
|
|
port {
|
|
|
|
|
name = "http"
|
|
|
|
|
container_port = 5001
|
|
|
|
|
protocol = "TCP"
|
|
|
|
|
}
|
[ci skip] Infrastructure hardening: security, monitoring, reliability, maintainability
Phase 1 - Critical Security:
- Netbox: move hardcoded DB/superuser passwords to variables
- MeshCentral: disable public registration, add Authentik auth
- Traefik: disable insecure API dashboard (api.insecure=false)
- Traefik: configure forwarded headers with Cloudflare trusted IPs
Phase 2 - Security Hardening:
- Add security headers middleware (HSTS, X-Frame-Options, nosniff, etc.)
- Add Kyverno pod security policies in audit mode (privileged, host
namespaces, SYS_ADMIN, trusted registries)
- Tighten rate limiting (avg=10, burst=50)
- Add Authentik protection to grampsweb
Phase 3 - Monitoring & Alerting:
- Add critical service alerts (PostgreSQL, MySQL, Redis, Headscale,
Authentik, Loki)
- Increase Loki retention from 7 to 30 days (720h)
- Add predictive PV filling alert (predict_linear)
- Re-enable Hackmd and Privatebin down alerts
Phase 4 - Reliability:
- Add resource requests/limits to Redis, DBaaS, Technitium, Headscale,
Vaultwarden, Uptime Kuma
- Increase Alloy DaemonSet memory to 512Mi/1Gi
Phase 6 - Maintainability:
- Extract duplicated tiers locals to terragrunt.hcl generate block
(removed from 67 stacks)
- Replace hardcoded NFS IP 10.0.10.15 with var.nfs_server (114
instances across 63 files)
- Replace hardcoded Redis/PostgreSQL/MySQL/Ollama/mail host references
with variables across ~35 stacks
- Migrate xray raw ingress resources to ingress_factory modules
2026-02-23 22:05:28 +00:00
|
|
|
resources {
|
|
|
|
|
requests = {
|
[ci skip] right-size all pod resources based on VPA + live metrics audit
Full cluster resource audit: cross-referenced Goldilocks VPA recommendations,
live kubectl top metrics, and Terraform definitions for 100+ containers.
Critical fixes:
- dashy: CPU throttled at 98% (490m/500m) → 2 CPU limit
- stirling-pdf: CPU throttled at 99.7% (299m/300m) → 2 CPU limit
- traefik auth-proxy/bot-block-proxy: mem limit 32Mi → 128Mi
Added explicit resources to ~40 containers that had none:
- audiobookshelf, changedetection, cyberchef, dawarich, diun, echo,
excalidraw, freshrss, hackmd, isponsorblocktv, linkwarden, n8n,
navidrome, ntfy, owntracks, privatebin, send, shadowsocks, tandoor,
tor-proxy, wealthfolio, networking-toolbox, rybbit, mailserver,
cloudflared, pgadmin, phpmyadmin, crowdsec-web, xray, wireguard,
k8s-portal, tuya-bridge, ollama-ui, whisper, piper, immich-server,
immich-postgresql, osrm-foot
GPU containers: added CPU/mem alongside GPU limits:
- ollama: removed CPU/mem limits (models vary in size), keep GPU only
- frigate: req 500m/2Gi, lim 4/8Gi + GPU
- immich-ml: req 100m/1Gi, lim 2/4Gi + GPU
Right-sized ~25 over-provisioned containers:
- kms-web-page: 500m/512Mi → 50m/64Mi (was using 0m/10Mi)
- onlyoffice: CPU 8 → 2 (VPA upper 45m)
- realestate-crawler-api: CPU 2000m → 250m
- blog/travel-blog/webhook-handler: 500m → 100m
- coturn/health/plotting-book: reduced to match actual usage
Conservative methodology: limits = max(VPA upper * 2, live usage * 2)
2026-03-01 19:18:50 +00:00
|
|
|
cpu = "15m"
|
2026-03-15 02:33:46 +00:00
|
|
|
memory = "256Mi"
|
[ci skip] Infrastructure hardening: security, monitoring, reliability, maintainability
Phase 1 - Critical Security:
- Netbox: move hardcoded DB/superuser passwords to variables
- MeshCentral: disable public registration, add Authentik auth
- Traefik: disable insecure API dashboard (api.insecure=false)
- Traefik: configure forwarded headers with Cloudflare trusted IPs
Phase 2 - Security Hardening:
- Add security headers middleware (HSTS, X-Frame-Options, nosniff, etc.)
- Add Kyverno pod security policies in audit mode (privileged, host
namespaces, SYS_ADMIN, trusted registries)
- Tighten rate limiting (avg=10, burst=50)
- Add Authentik protection to grampsweb
Phase 3 - Monitoring & Alerting:
- Add critical service alerts (PostgreSQL, MySQL, Redis, Headscale,
Authentik, Loki)
- Increase Loki retention from 7 to 30 days (720h)
- Add predictive PV filling alert (predict_linear)
- Re-enable Hackmd and Privatebin down alerts
Phase 4 - Reliability:
- Add resource requests/limits to Redis, DBaaS, Technitium, Headscale,
Vaultwarden, Uptime Kuma
- Increase Alloy DaemonSet memory to 512Mi/1Gi
Phase 6 - Maintainability:
- Extract duplicated tiers locals to terragrunt.hcl generate block
(removed from 67 stacks)
- Replace hardcoded NFS IP 10.0.10.15 with var.nfs_server (114
instances across 63 files)
- Replace hardcoded Redis/PostgreSQL/MySQL/Ollama/mail host references
with variables across ~35 stacks
- Migrate xray raw ingress resources to ingress_factory modules
2026-02-23 22:05:28 +00:00
|
|
|
}
|
|
|
|
|
limits = {
|
2026-03-15 02:33:46 +00:00
|
|
|
memory = "256Mi"
|
[ci skip] Infrastructure hardening: security, monitoring, reliability, maintainability
Phase 1 - Critical Security:
- Netbox: move hardcoded DB/superuser passwords to variables
- MeshCentral: disable public registration, add Authentik auth
- Traefik: disable insecure API dashboard (api.insecure=false)
- Traefik: configure forwarded headers with Cloudflare trusted IPs
Phase 2 - Security Hardening:
- Add security headers middleware (HSTS, X-Frame-Options, nosniff, etc.)
- Add Kyverno pod security policies in audit mode (privileged, host
namespaces, SYS_ADMIN, trusted registries)
- Tighten rate limiting (avg=10, burst=50)
- Add Authentik protection to grampsweb
Phase 3 - Monitoring & Alerting:
- Add critical service alerts (PostgreSQL, MySQL, Redis, Headscale,
Authentik, Loki)
- Increase Loki retention from 7 to 30 days (720h)
- Add predictive PV filling alert (predict_linear)
- Re-enable Hackmd and Privatebin down alerts
Phase 4 - Reliability:
- Add resource requests/limits to Redis, DBaaS, Technitium, Headscale,
Vaultwarden, Uptime Kuma
- Increase Alloy DaemonSet memory to 512Mi/1Gi
Phase 6 - Maintainability:
- Extract duplicated tiers locals to terragrunt.hcl generate block
(removed from 67 stacks)
- Replace hardcoded NFS IP 10.0.10.15 with var.nfs_server (114
instances across 63 files)
- Replace hardcoded Redis/PostgreSQL/MySQL/Ollama/mail host references
with variables across ~35 stacks
- Migrate xray raw ingress resources to ingress_factory modules
2026-02-23 22:05:28 +00:00
|
|
|
}
|
|
|
|
|
}
|
2026-02-22 15:13:55 +00:00
|
|
|
volume_mount {
|
|
|
|
|
name = "data"
|
|
|
|
|
mount_path = "/app/data"
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
volume {
|
|
|
|
|
name = "data"
|
[ci skip] complete NFS CSI migration: complex stacks + platform modules
Migrate remaining multi-volume stacks and all platform modules from
inline NFS volumes to CSI-backed PV/PVC with nfs-truenas StorageClass
(soft,timeo=30,retrans=3 mount options).
Complex stacks: openclaw (4 vols), immich (8 vols), frigate (2 vols),
nextcloud (2 vols + old PV replaced), rybbit (1 vol)
Remaining stacks: affine, ebook2audiobook, f1-stream, osm_routing,
real-estate-crawler
Platform modules: monitoring (prometheus, loki, alertmanager PVs
converted from native NFS to CSI), redis, dbaas, technitium,
headscale, vaultwarden, uptime-kuma, mailserver, infra-maintenance
2026-03-02 01:24:07 +00:00
|
|
|
persistent_volume_claim {
|
truenas deprecation: migrate all non-immich storage to proxmox NFS
- Migrate 7 backup CronJobs to Proxmox host NFS (192.168.1.127)
(etcd, mysql, postgresql, nextcloud, redis, vaultwarden, plotting-book)
- Migrate headscale backup, ebook2audiobook, osm_routing to Proxmox NFS
- Migrate servarr (lidarr, readarr, soulseek) NFS refs to Proxmox
- Remove 79 orphaned TrueNAS NFS module declarations from 49 stacks
- Delete stacks/platform/modules/ (27 dead module copies, 65MB)
- Update nfs-truenas StorageClass to point to Proxmox (192.168.1.127)
- Remove iscsi DNS record from config.tfvars
- Fix woodpecker persistence config and alertmanager PV
Only Immich (8 PVCs, ~1.4TB) remains on TrueNAS.
2026-04-12 14:35:39 +01:00
|
|
|
claim_name = module.nfs_data_host.claim_name
|
2026-02-22 15:13:55 +00:00
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
lifecycle {
|
|
|
|
|
ignore_changes = [
|
[infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip]
## Context
Wave 3A (commit c9d221d5) added the `# KYVERNO_LIFECYCLE_V1` marker to the
27 pre-existing `ignore_changes = [...dns_config]` sites so they could be
grepped and audited. It did NOT address pod-owning resources that were
simply missing the suppression entirely. Post-Wave-3A sampling (2026-04-18)
found that navidrome, f1-stream, frigate, servarr, monitoring, crowdsec,
and many other stacks showed perpetual `dns_config` drift every plan
because their `kubernetes_deployment` / `kubernetes_stateful_set` /
`kubernetes_cron_job_v1` resources had no `lifecycle {}` block at all.
Root cause (same as Wave 3A): Kyverno's admission webhook stamps
`dns_config { option { name = "ndots"; value = "2" } }` on every pod's
`spec.template.spec.dns_config` to prevent NxDomain search-domain flooding
(see `k8s-ndots-search-domain-nxdomain-flood` skill). Without `ignore_changes`
on every Terraform-managed pod-owner, Terraform repeatedly tries to strip
the injected field.
## This change
Extends the Wave 3A convention by sweeping EVERY `kubernetes_deployment`,
`kubernetes_stateful_set`, `kubernetes_daemon_set`, `kubernetes_cron_job_v1`,
`kubernetes_job_v1` (+ their `_v1` variants) in the repo and ensuring each
carries the right `ignore_changes` path:
- **kubernetes_deployment / stateful_set / daemon_set / job_v1**:
`spec[0].template[0].spec[0].dns_config`
- **kubernetes_cron_job_v1**:
`spec[0].job_template[0].spec[0].template[0].spec[0].dns_config`
(extra `job_template[0]` nesting — the CronJob's PodTemplateSpec is
one level deeper)
Each injection / extension is tagged `# KYVERNO_LIFECYCLE_V1: Kyverno
admission webhook mutates dns_config with ndots=2` inline so the
suppression is discoverable via `rg 'KYVERNO_LIFECYCLE_V1' stacks/`.
Two insertion paths are handled by a Python pass (`/tmp/add_dns_config_ignore.py`):
1. **No existing `lifecycle {}`**: inject a brand-new block just before the
resource's closing `}`. 108 new blocks on 93 files.
2. **Existing `lifecycle {}` (usually for `DRIFT_WORKAROUND: CI owns image tag`
from Wave 4, commit a62b43d1)**: extend its `ignore_changes` list with the
dns_config path. Handles both inline (`= [x]`) and multiline
(`= [\n x,\n]`) forms; ensures the last pre-existing list item carries
a trailing comma so the extended list is valid HCL. 34 extensions.
The script skips anything already mentioning `dns_config` inside an
`ignore_changes`, so re-running is a no-op.
## Scale
- 142 total lifecycle injections/extensions
- 93 `.tf` files touched
- 108 brand-new `lifecycle {}` blocks + 34 extensions of existing ones
- Every Tier 0 and Tier 1 stack with a pod-owning resource is covered
- Together with Wave 3A's 27 pre-existing markers → **169 greppable
`KYVERNO_LIFECYCLE_V1` dns_config sites across the repo**
## What is NOT in this change
- `stacks/trading-bot/main.tf` — entirely commented-out block (`/* … */`).
Python script touched the file, reverted manually.
- `_template/main.tf.example` skeleton — kept minimal on purpose; any
future stack created from it should either inherit the Wave 3A one-line
form or add its own on first `kubernetes_deployment`.
- `terraform fmt` fixes to pre-existing alignment issues in meshcentral,
nvidia/modules/nvidia, vault — unrelated to this commit. Left for a
separate fmt-only pass.
- Non-pod resources (`kubernetes_service`, `kubernetes_secret`,
`kubernetes_manifest`, etc.) — they don't own pods so they don't get
Kyverno dns_config mutation.
## Verification
Random sample post-commit:
```
$ cd stacks/navidrome && ../../scripts/tg plan → No changes.
$ cd stacks/f1-stream && ../../scripts/tg plan → No changes.
$ cd stacks/frigate && ../../scripts/tg plan → No changes.
$ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='*.tf' --include='*.tf.example' \
| awk -F: '{s+=$2} END {print s}'
169
```
## Reproduce locally
1. `git pull`
2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/ | wc -l` → 169+
3. `cd stacks/navidrome && ../../scripts/tg plan` → expect 0 drift on
the deployment's dns_config field.
Refs: code-seq (Wave 3B dns_config class closed; kubernetes_manifest
annotation class handled separately in 8d94688d for tls_secret)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:19:48 +00:00
|
|
|
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
|
2026-02-22 15:13:55 +00:00
|
|
|
]
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
resource "kubernetes_service" "realestate-crawler-api" {
|
|
|
|
|
metadata {
|
|
|
|
|
name = "realestate-crawler-api"
|
|
|
|
|
namespace = kubernetes_namespace.realestate-crawler.metadata[0].name
|
|
|
|
|
labels = {
|
|
|
|
|
"app" = "realestate-crawler-api"
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
spec {
|
|
|
|
|
selector = {
|
|
|
|
|
app = "realestate-crawler-api"
|
|
|
|
|
}
|
|
|
|
|
port {
|
|
|
|
|
port = 80
|
|
|
|
|
target_port = 5001
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
anubis: fix 500 on multi-replica + roll out to 6 more public sites
Browser visits to viktorbarzin.me started returning HTTP 500 with
`store: key not found: "challenge:..."` in pod logs. Root cause:
each Anubis pod stores in-flight challenges in process memory; with
2 replicas behind a ClusterIP, the PoW-solved request can be
routed to a different pod than the one that issued the challenge.
Anubis upstream documents the same caveat ("when running multiple
instances on the same base domain, the key must be the same across
all instances" — true for the ed25519 signing key, but the
challenge store is still pod-local without a shared backend).
Drop module default replicas: 2 → 1. Worst-case: ~1s cold-start on
pod restart. Real fix (Redis-backed challenge store) noted as a
follow-up in CLAUDE.md.
Roll Anubis out to: f1-stream, cyberchef (cc), jsoncrack (json),
privatebin (pb), homepage (home), real-estate-crawler (wrongmove
UI only — `/api` ingress stays direct via path-based ingress carve-
out so XHRs from the SPA bypass the challenge).
End-state: 9 public hosts now Anubis-fronted (blog, www, kms,
travel, f1, cc, json, pb, home, wrongmove). All return the
challenge HTML to bare curl/browser; verified-IP search engines and
/robots.txt + /.well-known still skip via the strict-policy
allowlist.
2026-05-10 00:50:30 +00:00
|
|
|
# Anubis fronts the UI ingress only; the /api ingress (`module "ingress-api"`)
|
|
|
|
|
# stays direct so XHRs from the UI bypass the challenge.
|
|
|
|
|
module "anubis" {
|
anubis: HA with shared valkey/redis store + replicas=2
Anubis pre-2026-05-16 ran at replicas=1 because in-flight PoW challenge
state lived in process memory — a challenge issued by pod A wouldn't be
verifiable by pod B (HTTP 500 "store: key not found"). The PDB at
`minAvailable=1` made this worse: with replicas=1 the eviction API can
NEVER satisfy the constraint, so every drain on a node hosting an Anubis
pod looped forever. This is what stalled the manual K8s upgrade on
2026-05-11 (had to delete pods directly to bypass eviction) and was
about to block kured on Monday 2026-05-18 once the kured sentinel fix
landed.
Anubis upstream has first-class support for a Valkey/Redis-protocol
shared store (documented as the "Kubernetes worker pool" pattern).
Wire it up:
- modules/kubernetes/anubis_instance: add `shared_store_url` variable.
When set, appends a `store: { backend: valkey, parameters: { url } }`
block to the rendered policy YAML and defaults replicas to 2 (capped
at 2). PDB switched from `minAvailable=1` to `maxUnavailable=1` so
drains can take down one pod at a time. topologySpreadConstraint
tightened to `DoNotSchedule` so the two replicas land on different
nodes — a single node loss never takes a whole Anubis instance down.
- All 8 call sites (cyberchef, jsoncrack, kms, homepage, blog,
travel_blog, real-estate-crawler, f1-stream) opted in. Each picks a
unique Redis DB index (5–12) on `redis-master.redis:6379`. Cluster
Redis already runs HA via Sentinel + haproxy, no new infra needed.
Verified: every Anubis Deployment now 2/2 Ready with pods on different
nodes; PDBs allow 1 disruption; Redis DBs 5,7,8,10 already populated
by live traffic post-apply; Palo Alto Networks scanner hit blog right
after apply and the challenge log shows the new state path.
Drain on any worker now succeeds without a `predrain_unstick` workaround
— eviction API is satisfied because at most one pod is unavailable at a
time, and the other replica keeps serving. Monday's kured reboot wave
should roll through cleanly.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 11:54:54 +00:00
|
|
|
source = "../../modules/kubernetes/anubis_instance"
|
|
|
|
|
name = "wrongmove"
|
|
|
|
|
namespace = kubernetes_namespace.realestate-crawler.metadata[0].name
|
|
|
|
|
target_url = "http://realestate-crawler-ui.${kubernetes_namespace.realestate-crawler.metadata[0].name}.svc.cluster.local"
|
|
|
|
|
shared_store_url = "redis://redis-master.redis.svc.cluster.local:6379/12"
|
anubis: fix 500 on multi-replica + roll out to 6 more public sites
Browser visits to viktorbarzin.me started returning HTTP 500 with
`store: key not found: "challenge:..."` in pod logs. Root cause:
each Anubis pod stores in-flight challenges in process memory; with
2 replicas behind a ClusterIP, the PoW-solved request can be
routed to a different pod than the one that issued the challenge.
Anubis upstream documents the same caveat ("when running multiple
instances on the same base domain, the key must be the same across
all instances" — true for the ed25519 signing key, but the
challenge store is still pod-local without a shared backend).
Drop module default replicas: 2 → 1. Worst-case: ~1s cold-start on
pod restart. Real fix (Redis-backed challenge store) noted as a
follow-up in CLAUDE.md.
Roll Anubis out to: f1-stream, cyberchef (cc), jsoncrack (json),
privatebin (pb), homepage (home), real-estate-crawler (wrongmove
UI only — `/api` ingress stays direct via path-based ingress carve-
out so XHRs from the SPA bypass the challenge).
End-state: 9 public hosts now Anubis-fronted (blog, www, kms,
travel, f1, cc, json, pb, home, wrongmove). All return the
challenge HTML to bare curl/browser; verified-IP search engines and
/robots.txt + /.well-known still skip via the strict-policy
allowlist.
2026-05-10 00:50:30 +00:00
|
|
|
}
|
|
|
|
|
|
2026-02-22 15:13:55 +00:00
|
|
|
module "ingress" {
|
2026-05-10 10:54:38 +00:00
|
|
|
source = "../../modules/kubernetes/ingress_factory"
|
ingress_factory: replace `protected` bool with `auth` enum + audit pass across 100 stacks
Phase 3+4 of default-deny ingress plan. Replaces the `protected = bool` (default
false → unprotected) variable in `modules/kubernetes/ingress_factory` with
`auth = string` enum (default "required" → fail-closed). Touches every
ingress_factory caller so the audit decision is recorded explicitly in code.
ingress_factory (Phase 3):
- `auth = "required"`: standard Authentik forward-auth (the legacy
`protected = true` semantic).
- `auth = "public"`: forward-auth via the new `authentik-forward-auth-public`
middleware → dedicated public outpost → guest auto-bind. Logged-in users
keep their real identity.
- `auth = "none"`: no Authentik middleware. For Anubis-fronted content, native
client APIs (Git, /v2/, WebDAV), webhook receivers, the Authentik outpost
itself.
- `effective_anti_ai` default flips ON only when `auth = "none"` (auth-gated
ingresses don't need anti-AI noise; the auth flow already discourages bots).
Audit pass (Phase 4) across 96 ingress_factory call sites:
- 49 explicit `protected = true` → `auth = "required"`
- 8 explicit `protected = false` → `auth = "none"` (5) or `auth = "public"` (3)
- 64 previously-default (no protected line) → `auth = "required"` ADDED, then
reviewed individually:
* 9 Anubis-fronted (blog, www, kms, travel, f1, cyberchef, jsoncrack,
homepage, wrongmove UI, privatebin) → `auth = "none"`
* 22 native-client / programmatic surfaces (Forgejo Git+/v2/, webhook
handler, claude-memory MCP, Nextcloud WebDAV, Matrix, Vault CLI/OIDC,
xray VPN, ntfy, woodpecker webhooks, n8n triggers, ntfy push, dawarich
location ingestion, immich frame kiosk, headscale CP, send anonymous
drops, rybbit beacon, vaultwarden API, Authentik UI itself + outposts) →
`auth = "none"`
* Remaining ~33 → `auth = "required"` confirmed (admin tools, internal
UIs, services without app-level auth)
- Smoke-test promotions to `auth = "public"`: fire-planner public UI,
k8s-portal API, insta2spotify callback.
Three call sites in wrapper modules (`stacks/freedify/factory/`,
`stacks/reverse-proxy/modules/reverse_proxy/`) keep their internal `protected`
bool — they translate to `auth` internally, out of scope for this rename.
Behavior change: previously-default ingresses now fail closed (require
Authentik login) unless explicitly flipped to `auth = "none"` or
`auth = "public"`. This is the audit goal — no more accidentally-unprotected
surfaces. Sites that were intentionally public (Anubis content, native APIs,
webhooks) are now explicitly recorded as `auth = "none"`.
Drive-by: `modules/create-vm/main.tf` picked up cosmetic alignment via
`terraform fmt -recursive` during the audit. Behavior-neutral.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 18:53:49 +00:00
|
|
|
auth = "none" # Anubis-fronted; PoW challenge gates bots, no Authentik
|
2026-05-10 10:54:38 +00:00
|
|
|
dns_type = "proxied"
|
|
|
|
|
namespace = kubernetes_namespace.realestate-crawler.metadata[0].name
|
|
|
|
|
name = "wrongmove"
|
|
|
|
|
service_name = module.anubis.service_name
|
|
|
|
|
port = module.anubis.service_port
|
|
|
|
|
extra_middlewares = ["traefik-x402@kubernetescrd"]
|
ingress_factory: replace `protected` bool with `auth` enum + audit pass across 100 stacks
Phase 3+4 of default-deny ingress plan. Replaces the `protected = bool` (default
false → unprotected) variable in `modules/kubernetes/ingress_factory` with
`auth = string` enum (default "required" → fail-closed). Touches every
ingress_factory caller so the audit decision is recorded explicitly in code.
ingress_factory (Phase 3):
- `auth = "required"`: standard Authentik forward-auth (the legacy
`protected = true` semantic).
- `auth = "public"`: forward-auth via the new `authentik-forward-auth-public`
middleware → dedicated public outpost → guest auto-bind. Logged-in users
keep their real identity.
- `auth = "none"`: no Authentik middleware. For Anubis-fronted content, native
client APIs (Git, /v2/, WebDAV), webhook receivers, the Authentik outpost
itself.
- `effective_anti_ai` default flips ON only when `auth = "none"` (auth-gated
ingresses don't need anti-AI noise; the auth flow already discourages bots).
Audit pass (Phase 4) across 96 ingress_factory call sites:
- 49 explicit `protected = true` → `auth = "required"`
- 8 explicit `protected = false` → `auth = "none"` (5) or `auth = "public"` (3)
- 64 previously-default (no protected line) → `auth = "required"` ADDED, then
reviewed individually:
* 9 Anubis-fronted (blog, www, kms, travel, f1, cyberchef, jsoncrack,
homepage, wrongmove UI, privatebin) → `auth = "none"`
* 22 native-client / programmatic surfaces (Forgejo Git+/v2/, webhook
handler, claude-memory MCP, Nextcloud WebDAV, Matrix, Vault CLI/OIDC,
xray VPN, ntfy, woodpecker webhooks, n8n triggers, ntfy push, dawarich
location ingestion, immich frame kiosk, headscale CP, send anonymous
drops, rybbit beacon, vaultwarden API, Authentik UI itself + outposts) →
`auth = "none"`
* Remaining ~33 → `auth = "required"` confirmed (admin tools, internal
UIs, services without app-level auth)
- Smoke-test promotions to `auth = "public"`: fire-planner public UI,
k8s-portal API, insta2spotify callback.
Three call sites in wrapper modules (`stacks/freedify/factory/`,
`stacks/reverse-proxy/modules/reverse_proxy/`) keep their internal `protected`
bool — they translate to `auth` internally, out of scope for this rename.
Behavior change: previously-default ingresses now fail closed (require
Authentik login) unless explicitly flipped to `auth = "none"` or
`auth = "public"`. This is the audit goal — no more accidentally-unprotected
surfaces. Sites that were intentionally public (Anubis content, native APIs,
webhooks) are now explicitly recorded as `auth = "none"`.
Drive-by: `modules/create-vm/main.tf` picked up cosmetic alignment via
`terraform fmt -recursive` during the audit. Behavior-neutral.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 18:53:49 +00:00
|
|
|
anti_ai_scraping = false
|
|
|
|
|
tls_secret_name = var.tls_secret_name
|
2026-03-07 16:41:36 +00:00
|
|
|
extra_annotations = {
|
|
|
|
|
"gethomepage.dev/enabled" = "true"
|
|
|
|
|
"gethomepage.dev/name" = "Wrongmove"
|
|
|
|
|
"gethomepage.dev/description" = "Property search"
|
|
|
|
|
"gethomepage.dev/icon" = "home-assistant.png"
|
|
|
|
|
"gethomepage.dev/group" = "Other"
|
|
|
|
|
"gethomepage.dev/pod-selector" = ""
|
|
|
|
|
}
|
2026-02-22 15:13:55 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
module "ingress-api" {
|
ingress_factory: replace `protected` bool with `auth` enum + audit pass across 100 stacks
Phase 3+4 of default-deny ingress plan. Replaces the `protected = bool` (default
false → unprotected) variable in `modules/kubernetes/ingress_factory` with
`auth = string` enum (default "required" → fail-closed). Touches every
ingress_factory caller so the audit decision is recorded explicitly in code.
ingress_factory (Phase 3):
- `auth = "required"`: standard Authentik forward-auth (the legacy
`protected = true` semantic).
- `auth = "public"`: forward-auth via the new `authentik-forward-auth-public`
middleware → dedicated public outpost → guest auto-bind. Logged-in users
keep their real identity.
- `auth = "none"`: no Authentik middleware. For Anubis-fronted content, native
client APIs (Git, /v2/, WebDAV), webhook receivers, the Authentik outpost
itself.
- `effective_anti_ai` default flips ON only when `auth = "none"` (auth-gated
ingresses don't need anti-AI noise; the auth flow already discourages bots).
Audit pass (Phase 4) across 96 ingress_factory call sites:
- 49 explicit `protected = true` → `auth = "required"`
- 8 explicit `protected = false` → `auth = "none"` (5) or `auth = "public"` (3)
- 64 previously-default (no protected line) → `auth = "required"` ADDED, then
reviewed individually:
* 9 Anubis-fronted (blog, www, kms, travel, f1, cyberchef, jsoncrack,
homepage, wrongmove UI, privatebin) → `auth = "none"`
* 22 native-client / programmatic surfaces (Forgejo Git+/v2/, webhook
handler, claude-memory MCP, Nextcloud WebDAV, Matrix, Vault CLI/OIDC,
xray VPN, ntfy, woodpecker webhooks, n8n triggers, ntfy push, dawarich
location ingestion, immich frame kiosk, headscale CP, send anonymous
drops, rybbit beacon, vaultwarden API, Authentik UI itself + outposts) →
`auth = "none"`
* Remaining ~33 → `auth = "required"` confirmed (admin tools, internal
UIs, services without app-level auth)
- Smoke-test promotions to `auth = "public"`: fire-planner public UI,
k8s-portal API, insta2spotify callback.
Three call sites in wrapper modules (`stacks/freedify/factory/`,
`stacks/reverse-proxy/modules/reverse_proxy/`) keep their internal `protected`
bool — they translate to `auth` internally, out of scope for this rename.
Behavior change: previously-default ingresses now fail closed (require
Authentik login) unless explicitly flipped to `auth = "none"` or
`auth = "public"`. This is the audit goal — no more accidentally-unprotected
surfaces. Sites that were intentionally public (Anubis content, native APIs,
webhooks) are now explicitly recorded as `auth = "none"`.
Drive-by: `modules/create-vm/main.tf` picked up cosmetic alignment via
`terraform fmt -recursive` during the audit. Behavior-neutral.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 18:53:49 +00:00
|
|
|
source = "../../modules/kubernetes/ingress_factory"
|
real-estate-crawler: populate SCRAPE_SCHEDULES (daily RENT + weekly BUY, London 1-2 bed)
Wires celery-beat to fire two periodic scrapes via the existing in-app
SchedulesConfig mechanism. Replaces the empty-string fallback with two
inline schedules expressed as Terraform-managed JSON:
- london-rent-daily: every day at 03:00 UTC, RENT, London, 1-2 bed,
£1900-4000
- london-buy-weekly: every Sunday at 04:00 UTC, BUY, London, 1-2 bed,
£400k-1.2M
Schedules live in `local.scrape_schedules` (jsonencode'd) rather than
Vault — they're configuration, not secrets, and benefit from being
version-controlled. The previous Vault-backed lookup
(`local.notification_settings["scrape_schedules"]`) was unused.
Verified live: new celery-beat pod logs
`Registering periodic task: london-rent-daily at 3:0` and
`london-buy-weekly at 4:0` immediately after roll-out.
Also tightens the comment above the wrongmove-api `auth = "none"` line
so it passes the new `scripts/check-ingress-auth-comments.py` guard
(pre-existing tech debt that blocked the apply).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-11 19:26:57 +00:00
|
|
|
# Wrongmove's public UI is Anubis-fronted (auth = "none" on the / path); this
|
ingress_factory: replace `protected` bool with `auth` enum + audit pass across 100 stacks
Phase 3+4 of default-deny ingress plan. Replaces the `protected = bool` (default
false → unprotected) variable in `modules/kubernetes/ingress_factory` with
`auth = string` enum (default "required" → fail-closed). Touches every
ingress_factory caller so the audit decision is recorded explicitly in code.
ingress_factory (Phase 3):
- `auth = "required"`: standard Authentik forward-auth (the legacy
`protected = true` semantic).
- `auth = "public"`: forward-auth via the new `authentik-forward-auth-public`
middleware → dedicated public outpost → guest auto-bind. Logged-in users
keep their real identity.
- `auth = "none"`: no Authentik middleware. For Anubis-fronted content, native
client APIs (Git, /v2/, WebDAV), webhook receivers, the Authentik outpost
itself.
- `effective_anti_ai` default flips ON only when `auth = "none"` (auth-gated
ingresses don't need anti-AI noise; the auth flow already discourages bots).
Audit pass (Phase 4) across 96 ingress_factory call sites:
- 49 explicit `protected = true` → `auth = "required"`
- 8 explicit `protected = false` → `auth = "none"` (5) or `auth = "public"` (3)
- 64 previously-default (no protected line) → `auth = "required"` ADDED, then
reviewed individually:
* 9 Anubis-fronted (blog, www, kms, travel, f1, cyberchef, jsoncrack,
homepage, wrongmove UI, privatebin) → `auth = "none"`
* 22 native-client / programmatic surfaces (Forgejo Git+/v2/, webhook
handler, claude-memory MCP, Nextcloud WebDAV, Matrix, Vault CLI/OIDC,
xray VPN, ntfy, woodpecker webhooks, n8n triggers, ntfy push, dawarich
location ingestion, immich frame kiosk, headscale CP, send anonymous
drops, rybbit beacon, vaultwarden API, Authentik UI itself + outposts) →
`auth = "none"`
* Remaining ~33 → `auth = "required"` confirmed (admin tools, internal
UIs, services without app-level auth)
- Smoke-test promotions to `auth = "public"`: fire-planner public UI,
k8s-portal API, insta2spotify callback.
Three call sites in wrapper modules (`stacks/freedify/factory/`,
`stacks/reverse-proxy/modules/reverse_proxy/`) keep their internal `protected`
bool — they translate to `auth` internally, out of scope for this rename.
Behavior change: previously-default ingresses now fail closed (require
Authentik login) unless explicitly flipped to `auth = "none"` or
`auth = "public"`. This is the audit goal — no more accidentally-unprotected
surfaces. Sites that were intentionally public (Anubis content, native APIs,
webhooks) are now explicitly recorded as `auth = "none"`.
Drive-by: `modules/create-vm/main.tf` picked up cosmetic alignment via
`terraform fmt -recursive` during the audit. Behavior-neutral.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 18:53:49 +00:00
|
|
|
# /api ingress serves XHRs from that public UI. Forward-auth here would
|
|
|
|
|
# break the UI.
|
real-estate-crawler: populate SCRAPE_SCHEDULES (daily RENT + weekly BUY, London 1-2 bed)
Wires celery-beat to fire two periodic scrapes via the existing in-app
SchedulesConfig mechanism. Replaces the empty-string fallback with two
inline schedules expressed as Terraform-managed JSON:
- london-rent-daily: every day at 03:00 UTC, RENT, London, 1-2 bed,
£1900-4000
- london-buy-weekly: every Sunday at 04:00 UTC, BUY, London, 1-2 bed,
£400k-1.2M
Schedules live in `local.scrape_schedules` (jsonencode'd) rather than
Vault — they're configuration, not secrets, and benefit from being
version-controlled. The previous Vault-backed lookup
(`local.notification_settings["scrape_schedules"]`) was unused.
Verified live: new celery-beat pod logs
`Registering periodic task: london-rent-daily at 3:0` and
`london-buy-weekly at 4:0` immediately after roll-out.
Also tightens the comment above the wrongmove-api `auth = "none"` line
so it passes the new `scripts/check-ingress-auth-comments.py` guard
(pre-existing tech debt that blocked the apply).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-11 19:26:57 +00:00
|
|
|
# auth = "none": XHR endpoint for the Anubis-fronted public UI; forward-auth would break CORS.
|
ingress_factory: replace `protected` bool with `auth` enum + audit pass across 100 stacks
Phase 3+4 of default-deny ingress plan. Replaces the `protected = bool` (default
false → unprotected) variable in `modules/kubernetes/ingress_factory` with
`auth = string` enum (default "required" → fail-closed). Touches every
ingress_factory caller so the audit decision is recorded explicitly in code.
ingress_factory (Phase 3):
- `auth = "required"`: standard Authentik forward-auth (the legacy
`protected = true` semantic).
- `auth = "public"`: forward-auth via the new `authentik-forward-auth-public`
middleware → dedicated public outpost → guest auto-bind. Logged-in users
keep their real identity.
- `auth = "none"`: no Authentik middleware. For Anubis-fronted content, native
client APIs (Git, /v2/, WebDAV), webhook receivers, the Authentik outpost
itself.
- `effective_anti_ai` default flips ON only when `auth = "none"` (auth-gated
ingresses don't need anti-AI noise; the auth flow already discourages bots).
Audit pass (Phase 4) across 96 ingress_factory call sites:
- 49 explicit `protected = true` → `auth = "required"`
- 8 explicit `protected = false` → `auth = "none"` (5) or `auth = "public"` (3)
- 64 previously-default (no protected line) → `auth = "required"` ADDED, then
reviewed individually:
* 9 Anubis-fronted (blog, www, kms, travel, f1, cyberchef, jsoncrack,
homepage, wrongmove UI, privatebin) → `auth = "none"`
* 22 native-client / programmatic surfaces (Forgejo Git+/v2/, webhook
handler, claude-memory MCP, Nextcloud WebDAV, Matrix, Vault CLI/OIDC,
xray VPN, ntfy, woodpecker webhooks, n8n triggers, ntfy push, dawarich
location ingestion, immich frame kiosk, headscale CP, send anonymous
drops, rybbit beacon, vaultwarden API, Authentik UI itself + outposts) →
`auth = "none"`
* Remaining ~33 → `auth = "required"` confirmed (admin tools, internal
UIs, services without app-level auth)
- Smoke-test promotions to `auth = "public"`: fire-planner public UI,
k8s-portal API, insta2spotify callback.
Three call sites in wrapper modules (`stacks/freedify/factory/`,
`stacks/reverse-proxy/modules/reverse_proxy/`) keep their internal `protected`
bool — they translate to `auth` internally, out of scope for this rename.
Behavior change: previously-default ingresses now fail closed (require
Authentik login) unless explicitly flipped to `auth = "none"` or
`auth = "public"`. This is the audit goal — no more accidentally-unprotected
surfaces. Sites that were intentionally public (Anubis content, native APIs,
webhooks) are now explicitly recorded as `auth = "none"`.
Drive-by: `modules/create-vm/main.tf` picked up cosmetic alignment via
`terraform fmt -recursive` during the audit. Behavior-neutral.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 18:53:49 +00:00
|
|
|
auth = "none"
|
2026-04-16 13:45:04 +00:00
|
|
|
dns_type = "proxied"
|
2026-02-22 15:13:55 +00:00
|
|
|
namespace = kubernetes_namespace.realestate-crawler.metadata[0].name
|
|
|
|
|
name = "wrongmove-api"
|
|
|
|
|
host = "wrongmove"
|
|
|
|
|
service_name = "realestate-crawler-api"
|
|
|
|
|
ingress_path = ["/api"]
|
|
|
|
|
tls_secret_name = var.tls_secret_name
|
2026-03-07 16:41:36 +00:00
|
|
|
extra_annotations = {
|
|
|
|
|
"gethomepage.dev/enabled" = "false"
|
|
|
|
|
}
|
2026-02-22 15:13:55 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
# Celery worker for background task processing
|
|
|
|
|
resource "kubernetes_deployment" "realestate-crawler-celery" {
|
|
|
|
|
metadata {
|
|
|
|
|
name = "realestate-crawler-celery"
|
|
|
|
|
namespace = kubernetes_namespace.realestate-crawler.metadata[0].name
|
|
|
|
|
labels = {
|
|
|
|
|
app = "realestate-crawler-celery"
|
|
|
|
|
tier = local.tiers.aux
|
|
|
|
|
}
|
migrate 16 plan-time stacks: vault data source → ESO + kubernetes_secret
Replaced data "vault_kv_secret_v2" with:
1. ExternalSecret (ESO syncs Vault KV → K8s Secret)
2. data "kubernetes_secret" (reads ESO-created secret at plan time)
This removes the Vault provider dependency at plan time for these
stacks — they now only need K8s API access, not a Vault token.
Stacks: actualbudget, affine, audiobookshelf, calibre, changedetection,
coturn, freedify, freshrss, grampsweb, navidrome, novelapp, ollama,
owntracks, real-estate-crawler, servarr, ytdlp
2026-03-15 22:06:39 +00:00
|
|
|
annotations = {
|
|
|
|
|
"reloader.stakater.com/auto" = "true"
|
|
|
|
|
}
|
2026-02-22 15:13:55 +00:00
|
|
|
}
|
|
|
|
|
spec {
|
|
|
|
|
replicas = 1
|
|
|
|
|
strategy {
|
|
|
|
|
type = "RollingUpdate"
|
|
|
|
|
rolling_update {
|
|
|
|
|
max_unavailable = 0
|
|
|
|
|
max_surge = 1
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
selector {
|
|
|
|
|
match_labels = {
|
|
|
|
|
app = "realestate-crawler-celery"
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
template {
|
|
|
|
|
metadata {
|
|
|
|
|
labels = {
|
|
|
|
|
app = "realestate-crawler-celery"
|
|
|
|
|
}
|
2026-03-15 19:17:44 +00:00
|
|
|
annotations = {
|
2026-04-19 12:44:46 +00:00
|
|
|
"dependency.kyverno.io/wait-for" = "mysql.dbaas:3306,redis-master.redis:6379"
|
2026-03-15 19:17:44 +00:00
|
|
|
}
|
2026-02-22 15:13:55 +00:00
|
|
|
}
|
|
|
|
|
spec {
|
2026-05-18 19:11:43 +00:00
|
|
|
image_pull_secrets {
|
|
|
|
|
name = "dockerhub-pull-secret"
|
|
|
|
|
}
|
2026-02-22 15:13:55 +00:00
|
|
|
container {
|
|
|
|
|
name = "celery-worker"
|
|
|
|
|
image = "viktorbarzin/realestatecrawler:latest"
|
|
|
|
|
image_pull_policy = "Always"
|
[ci skip] Infrastructure hardening: security, monitoring, reliability, maintainability
Phase 1 - Critical Security:
- Netbox: move hardcoded DB/superuser passwords to variables
- MeshCentral: disable public registration, add Authentik auth
- Traefik: disable insecure API dashboard (api.insecure=false)
- Traefik: configure forwarded headers with Cloudflare trusted IPs
Phase 2 - Security Hardening:
- Add security headers middleware (HSTS, X-Frame-Options, nosniff, etc.)
- Add Kyverno pod security policies in audit mode (privileged, host
namespaces, SYS_ADMIN, trusted registries)
- Tighten rate limiting (avg=10, burst=50)
- Add Authentik protection to grampsweb
Phase 3 - Monitoring & Alerting:
- Add critical service alerts (PostgreSQL, MySQL, Redis, Headscale,
Authentik, Loki)
- Increase Loki retention from 7 to 30 days (720h)
- Add predictive PV filling alert (predict_linear)
- Re-enable Hackmd and Privatebin down alerts
Phase 4 - Reliability:
- Add resource requests/limits to Redis, DBaaS, Technitium, Headscale,
Vaultwarden, Uptime Kuma
- Increase Alloy DaemonSet memory to 512Mi/1Gi
Phase 6 - Maintainability:
- Extract duplicated tiers locals to terragrunt.hcl generate block
(removed from 67 stacks)
- Replace hardcoded NFS IP 10.0.10.15 with var.nfs_server (114
instances across 63 files)
- Replace hardcoded Redis/PostgreSQL/MySQL/Ollama/mail host references
with variables across ~35 stacks
- Migrate xray raw ingress resources to ingress_factory modules
2026-02-23 22:05:28 +00:00
|
|
|
command = ["python", "-m", "celery", "-A", "celery_app", "worker", "--loglevel=info", "--pool=threads"]
|
anubis: HA with shared valkey/redis store + replicas=2
Anubis pre-2026-05-16 ran at replicas=1 because in-flight PoW challenge
state lived in process memory — a challenge issued by pod A wouldn't be
verifiable by pod B (HTTP 500 "store: key not found"). The PDB at
`minAvailable=1` made this worse: with replicas=1 the eviction API can
NEVER satisfy the constraint, so every drain on a node hosting an Anubis
pod looped forever. This is what stalled the manual K8s upgrade on
2026-05-11 (had to delete pods directly to bypass eviction) and was
about to block kured on Monday 2026-05-18 once the kured sentinel fix
landed.
Anubis upstream has first-class support for a Valkey/Redis-protocol
shared store (documented as the "Kubernetes worker pool" pattern).
Wire it up:
- modules/kubernetes/anubis_instance: add `shared_store_url` variable.
When set, appends a `store: { backend: valkey, parameters: { url } }`
block to the rendered policy YAML and defaults replicas to 2 (capped
at 2). PDB switched from `minAvailable=1` to `maxUnavailable=1` so
drains can take down one pod at a time. topologySpreadConstraint
tightened to `DoNotSchedule` so the two replicas land on different
nodes — a single node loss never takes a whole Anubis instance down.
- All 8 call sites (cyberchef, jsoncrack, kms, homepage, blog,
travel_blog, real-estate-crawler, f1-stream) opted in. Each picks a
unique Redis DB index (5–12) on `redis-master.redis:6379`. Cluster
Redis already runs HA via Sentinel + haproxy, no new infra needed.
Verified: every Anubis Deployment now 2/2 Ready with pods on different
nodes; PDBs allow 1 disruption; Redis DBs 5,7,8,10 already populated
by live traffic post-apply; Palo Alto Networks scanner hit blog right
after apply and the challenge log shows the new state path.
Drain on any worker now succeeds without a `predrain_unstick` workaround
— eviction API is satisfied because at most one pod is unavailable at a
time, and the other replica keeps serving. Monday's kured reboot wave
should roll through cleanly.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 11:54:54 +00:00
|
|
|
# 512Mi OOMed during full London RENT 1-2 bed scrape (~76k existing IDs
|
|
|
|
|
# + 10k fetched into memory at concurrency=8 threads). Bumped to 1Gi.
|
resource quota review: fix OOM risks, close quota gaps, add HA protections
Phase 1 - OOM fixes:
- dashy: increase memory limit 512Mi→1Gi (was at 99% utilization)
- caretta DaemonSet: set explicit resources 300Mi/512Mi (was at 85-98%)
- mysql-operator: add Helm resource values 256Mi/512Mi, create namespace
with tier label (was at 92% of LimitRange default)
- prowlarr, flaresolverr, annas-archive-stacks: add explicit resources
(outgrowing 256Mi LimitRange defaults)
- real-estate-crawler celery: add resources 512Mi/3Gi (608Mi actual, no
explicit resources)
Phase 2 - Close quota gaps:
- nvidia, real-estate-crawler, trading-bot: remove custom-quota=true
labels so Kyverno generates tier-appropriate quotas
- descheduler: add tier=1-cluster label for proper classification
Phase 3 - Reduce excessive quotas:
- monitoring: limits.memory 240Gi→64Gi, limits.cpu 120→64
- woodpecker: limits.memory 128Gi→32Gi, limits.cpu 64→16
- GPU tier default: limits.memory 96Gi→32Gi, limits.cpu 48→16
Phase 4 - Kubelet protection:
- Add cpu: 200m to systemReserved and kubeReserved in kubelet template
Phase 5 - HA improvements:
- cloudflared: add topology spread (ScheduleAnyway) + PDB (maxUnavailable:1)
- grafana: add topology spread + PDB via Helm values
- crowdsec LAPI: add topology spread + PDB via Helm values
- authentik server: add topology spread via Helm values
- authentik worker: add topology spread + PDB via Helm values
2026-03-08 18:17:46 +00:00
|
|
|
resources {
|
|
|
|
|
requests = {
|
Right-size CPU requests cluster-wide and remove missed CPU limits
Increase requests for under-requested pods (dashy 50m→250m, frigate 500m→1500m,
clickhouse 100m→500m, otp 100m→300m, linkwarden 25m→50m, authentik worker 50m→100m).
Reduce requests for over-requested pods (crowdsec agent/lapi 500m→25m each,
prometheus 200m→100m, dbaas mysql 1800m→100m, pg-cluster 250m→50m,
shlink-web 250m→10m, gpu-pod-exporter 50m→10m, stirling-pdf 100m→25m,
technitium 100m→25m, celery 50m→15m). Reduce crowdsec quota from 8→1 CPU.
Remove missed CPU limits in prometheus (cpu: "2") and dbaas (cpu: "3600m") tpl files.
2026-03-14 09:22:24 +00:00
|
|
|
cpu = "15m"
|
anubis: HA with shared valkey/redis store + replicas=2
Anubis pre-2026-05-16 ran at replicas=1 because in-flight PoW challenge
state lived in process memory — a challenge issued by pod A wouldn't be
verifiable by pod B (HTTP 500 "store: key not found"). The PDB at
`minAvailable=1` made this worse: with replicas=1 the eviction API can
NEVER satisfy the constraint, so every drain on a node hosting an Anubis
pod looped forever. This is what stalled the manual K8s upgrade on
2026-05-11 (had to delete pods directly to bypass eviction) and was
about to block kured on Monday 2026-05-18 once the kured sentinel fix
landed.
Anubis upstream has first-class support for a Valkey/Redis-protocol
shared store (documented as the "Kubernetes worker pool" pattern).
Wire it up:
- modules/kubernetes/anubis_instance: add `shared_store_url` variable.
When set, appends a `store: { backend: valkey, parameters: { url } }`
block to the rendered policy YAML and defaults replicas to 2 (capped
at 2). PDB switched from `minAvailable=1` to `maxUnavailable=1` so
drains can take down one pod at a time. topologySpreadConstraint
tightened to `DoNotSchedule` so the two replicas land on different
nodes — a single node loss never takes a whole Anubis instance down.
- All 8 call sites (cyberchef, jsoncrack, kms, homepage, blog,
travel_blog, real-estate-crawler, f1-stream) opted in. Each picks a
unique Redis DB index (5–12) on `redis-master.redis:6379`. Cluster
Redis already runs HA via Sentinel + haproxy, no new infra needed.
Verified: every Anubis Deployment now 2/2 Ready with pods on different
nodes; PDBs allow 1 disruption; Redis DBs 5,7,8,10 already populated
by live traffic post-apply; Palo Alto Networks scanner hit blog right
after apply and the challenge log shows the new state path.
Drain on any worker now succeeds without a `predrain_unstick` workaround
— eviction API is satisfied because at most one pod is unavailable at a
time, and the other replica keeps serving. Monday's kured reboot wave
should roll through cleanly.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 11:54:54 +00:00
|
|
|
memory = "1Gi"
|
resource quota review: fix OOM risks, close quota gaps, add HA protections
Phase 1 - OOM fixes:
- dashy: increase memory limit 512Mi→1Gi (was at 99% utilization)
- caretta DaemonSet: set explicit resources 300Mi/512Mi (was at 85-98%)
- mysql-operator: add Helm resource values 256Mi/512Mi, create namespace
with tier label (was at 92% of LimitRange default)
- prowlarr, flaresolverr, annas-archive-stacks: add explicit resources
(outgrowing 256Mi LimitRange defaults)
- real-estate-crawler celery: add resources 512Mi/3Gi (608Mi actual, no
explicit resources)
Phase 2 - Close quota gaps:
- nvidia, real-estate-crawler, trading-bot: remove custom-quota=true
labels so Kyverno generates tier-appropriate quotas
- descheduler: add tier=1-cluster label for proper classification
Phase 3 - Reduce excessive quotas:
- monitoring: limits.memory 240Gi→64Gi, limits.cpu 120→64
- woodpecker: limits.memory 128Gi→32Gi, limits.cpu 64→16
- GPU tier default: limits.memory 96Gi→32Gi, limits.cpu 48→16
Phase 4 - Kubelet protection:
- Add cpu: 200m to systemReserved and kubeReserved in kubelet template
Phase 5 - HA improvements:
- cloudflared: add topology spread (ScheduleAnyway) + PDB (maxUnavailable:1)
- grafana: add topology spread + PDB via Helm values
- crowdsec LAPI: add topology spread + PDB via Helm values
- authentik server: add topology spread via Helm values
- authentik worker: add topology spread + PDB via Helm values
2026-03-08 18:17:46 +00:00
|
|
|
}
|
|
|
|
|
limits = {
|
anubis: HA with shared valkey/redis store + replicas=2
Anubis pre-2026-05-16 ran at replicas=1 because in-flight PoW challenge
state lived in process memory — a challenge issued by pod A wouldn't be
verifiable by pod B (HTTP 500 "store: key not found"). The PDB at
`minAvailable=1` made this worse: with replicas=1 the eviction API can
NEVER satisfy the constraint, so every drain on a node hosting an Anubis
pod looped forever. This is what stalled the manual K8s upgrade on
2026-05-11 (had to delete pods directly to bypass eviction) and was
about to block kured on Monday 2026-05-18 once the kured sentinel fix
landed.
Anubis upstream has first-class support for a Valkey/Redis-protocol
shared store (documented as the "Kubernetes worker pool" pattern).
Wire it up:
- modules/kubernetes/anubis_instance: add `shared_store_url` variable.
When set, appends a `store: { backend: valkey, parameters: { url } }`
block to the rendered policy YAML and defaults replicas to 2 (capped
at 2). PDB switched from `minAvailable=1` to `maxUnavailable=1` so
drains can take down one pod at a time. topologySpreadConstraint
tightened to `DoNotSchedule` so the two replicas land on different
nodes — a single node loss never takes a whole Anubis instance down.
- All 8 call sites (cyberchef, jsoncrack, kms, homepage, blog,
travel_blog, real-estate-crawler, f1-stream) opted in. Each picks a
unique Redis DB index (5–12) on `redis-master.redis:6379`. Cluster
Redis already runs HA via Sentinel + haproxy, no new infra needed.
Verified: every Anubis Deployment now 2/2 Ready with pods on different
nodes; PDBs allow 1 disruption; Redis DBs 5,7,8,10 already populated
by live traffic post-apply; Palo Alto Networks scanner hit blog right
after apply and the challenge log shows the new state path.
Drain on any worker now succeeds without a `predrain_unstick` workaround
— eviction API is satisfied because at most one pod is unavailable at a
time, and the other replica keeps serving. Monday's kured reboot wave
should roll through cleanly.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 11:54:54 +00:00
|
|
|
memory = "1Gi"
|
resource quota review: fix OOM risks, close quota gaps, add HA protections
Phase 1 - OOM fixes:
- dashy: increase memory limit 512Mi→1Gi (was at 99% utilization)
- caretta DaemonSet: set explicit resources 300Mi/512Mi (was at 85-98%)
- mysql-operator: add Helm resource values 256Mi/512Mi, create namespace
with tier label (was at 92% of LimitRange default)
- prowlarr, flaresolverr, annas-archive-stacks: add explicit resources
(outgrowing 256Mi LimitRange defaults)
- real-estate-crawler celery: add resources 512Mi/3Gi (608Mi actual, no
explicit resources)
Phase 2 - Close quota gaps:
- nvidia, real-estate-crawler, trading-bot: remove custom-quota=true
labels so Kyverno generates tier-appropriate quotas
- descheduler: add tier=1-cluster label for proper classification
Phase 3 - Reduce excessive quotas:
- monitoring: limits.memory 240Gi→64Gi, limits.cpu 120→64
- woodpecker: limits.memory 128Gi→32Gi, limits.cpu 64→16
- GPU tier default: limits.memory 96Gi→32Gi, limits.cpu 48→16
Phase 4 - Kubelet protection:
- Add cpu: 200m to systemReserved and kubeReserved in kubelet template
Phase 5 - HA improvements:
- cloudflared: add topology spread (ScheduleAnyway) + PDB (maxUnavailable:1)
- grafana: add topology spread + PDB via Helm values
- crowdsec LAPI: add topology spread + PDB via Helm values
- authentik server: add topology spread via Helm values
- authentik worker: add topology spread + PDB via Helm values
2026-03-08 18:17:46 +00:00
|
|
|
}
|
|
|
|
|
}
|
2026-02-22 15:13:55 +00:00
|
|
|
port {
|
|
|
|
|
name = "metrics"
|
|
|
|
|
container_port = 9090
|
|
|
|
|
protocol = "TCP"
|
|
|
|
|
}
|
|
|
|
|
env {
|
|
|
|
|
name = "ENV"
|
|
|
|
|
value = "prod"
|
|
|
|
|
}
|
|
|
|
|
env {
|
2026-03-17 07:39:29 +00:00
|
|
|
name = "DB_CONNECTION_STRING"
|
|
|
|
|
value_from {
|
|
|
|
|
secret_key_ref {
|
|
|
|
|
name = "realestate-crawler-db-creds"
|
|
|
|
|
key = "DB_CONNECTION_STRING"
|
|
|
|
|
}
|
|
|
|
|
}
|
2026-02-22 15:13:55 +00:00
|
|
|
}
|
|
|
|
|
env {
|
|
|
|
|
name = "CELERY_BROKER_URL"
|
[ci skip] Infrastructure hardening: security, monitoring, reliability, maintainability
Phase 1 - Critical Security:
- Netbox: move hardcoded DB/superuser passwords to variables
- MeshCentral: disable public registration, add Authentik auth
- Traefik: disable insecure API dashboard (api.insecure=false)
- Traefik: configure forwarded headers with Cloudflare trusted IPs
Phase 2 - Security Hardening:
- Add security headers middleware (HSTS, X-Frame-Options, nosniff, etc.)
- Add Kyverno pod security policies in audit mode (privileged, host
namespaces, SYS_ADMIN, trusted registries)
- Tighten rate limiting (avg=10, burst=50)
- Add Authentik protection to grampsweb
Phase 3 - Monitoring & Alerting:
- Add critical service alerts (PostgreSQL, MySQL, Redis, Headscale,
Authentik, Loki)
- Increase Loki retention from 7 to 30 days (720h)
- Add predictive PV filling alert (predict_linear)
- Re-enable Hackmd and Privatebin down alerts
Phase 4 - Reliability:
- Add resource requests/limits to Redis, DBaaS, Technitium, Headscale,
Vaultwarden, Uptime Kuma
- Increase Alloy DaemonSet memory to 512Mi/1Gi
Phase 6 - Maintainability:
- Extract duplicated tiers locals to terragrunt.hcl generate block
(removed from 67 stacks)
- Replace hardcoded NFS IP 10.0.10.15 with var.nfs_server (114
instances across 63 files)
- Replace hardcoded Redis/PostgreSQL/MySQL/Ollama/mail host references
with variables across ~35 stacks
- Migrate xray raw ingress resources to ingress_factory modules
2026-02-23 22:05:28 +00:00
|
|
|
value = "redis://${var.redis_host}:6379/0"
|
2026-02-22 15:13:55 +00:00
|
|
|
}
|
|
|
|
|
env {
|
|
|
|
|
name = "CELERY_RESULT_BACKEND"
|
[ci skip] Infrastructure hardening: security, monitoring, reliability, maintainability
Phase 1 - Critical Security:
- Netbox: move hardcoded DB/superuser passwords to variables
- MeshCentral: disable public registration, add Authentik auth
- Traefik: disable insecure API dashboard (api.insecure=false)
- Traefik: configure forwarded headers with Cloudflare trusted IPs
Phase 2 - Security Hardening:
- Add security headers middleware (HSTS, X-Frame-Options, nosniff, etc.)
- Add Kyverno pod security policies in audit mode (privileged, host
namespaces, SYS_ADMIN, trusted registries)
- Tighten rate limiting (avg=10, burst=50)
- Add Authentik protection to grampsweb
Phase 3 - Monitoring & Alerting:
- Add critical service alerts (PostgreSQL, MySQL, Redis, Headscale,
Authentik, Loki)
- Increase Loki retention from 7 to 30 days (720h)
- Add predictive PV filling alert (predict_linear)
- Re-enable Hackmd and Privatebin down alerts
Phase 4 - Reliability:
- Add resource requests/limits to Redis, DBaaS, Technitium, Headscale,
Vaultwarden, Uptime Kuma
- Increase Alloy DaemonSet memory to 512Mi/1Gi
Phase 6 - Maintainability:
- Extract duplicated tiers locals to terragrunt.hcl generate block
(removed from 67 stacks)
- Replace hardcoded NFS IP 10.0.10.15 with var.nfs_server (114
instances across 63 files)
- Replace hardcoded Redis/PostgreSQL/MySQL/Ollama/mail host references
with variables across ~35 stacks
- Migrate xray raw ingress resources to ingress_factory modules
2026-02-23 22:05:28 +00:00
|
|
|
value = "redis://${var.redis_host}:6379/1"
|
2026-02-22 15:13:55 +00:00
|
|
|
}
|
|
|
|
|
env {
|
|
|
|
|
name = "SLACK_WEBHOOK_URL"
|
2026-03-15 21:01:24 +00:00
|
|
|
value = try(local.notification_settings["slack"]["webhook_url"], "")
|
2026-02-22 15:13:55 +00:00
|
|
|
}
|
|
|
|
|
env {
|
|
|
|
|
name = "OSRM_FOOT_URL"
|
|
|
|
|
value = "http://osrm-foot.osm-routing.svc.cluster.local:5000"
|
|
|
|
|
}
|
|
|
|
|
env {
|
|
|
|
|
name = "OSRM_BICYCLE_URL"
|
|
|
|
|
value = "http://osrm-bicycle.osm-routing.svc.cluster.local:5000"
|
|
|
|
|
}
|
|
|
|
|
env {
|
|
|
|
|
name = "OTP_URL"
|
|
|
|
|
value = "http://otp.osm-routing.svc.cluster.local:8080"
|
|
|
|
|
}
|
|
|
|
|
volume_mount {
|
|
|
|
|
name = "data"
|
|
|
|
|
mount_path = "/app/data"
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
volume {
|
|
|
|
|
name = "data"
|
[ci skip] complete NFS CSI migration: complex stacks + platform modules
Migrate remaining multi-volume stacks and all platform modules from
inline NFS volumes to CSI-backed PV/PVC with nfs-truenas StorageClass
(soft,timeo=30,retrans=3 mount options).
Complex stacks: openclaw (4 vols), immich (8 vols), frigate (2 vols),
nextcloud (2 vols + old PV replaced), rybbit (1 vol)
Remaining stacks: affine, ebook2audiobook, f1-stream, osm_routing,
real-estate-crawler
Platform modules: monitoring (prometheus, loki, alertmanager PVs
converted from native NFS to CSI), redis, dbaas, technitium,
headscale, vaultwarden, uptime-kuma, mailserver, infra-maintenance
2026-03-02 01:24:07 +00:00
|
|
|
persistent_volume_claim {
|
truenas deprecation: migrate all non-immich storage to proxmox NFS
- Migrate 7 backup CronJobs to Proxmox host NFS (192.168.1.127)
(etcd, mysql, postgresql, nextcloud, redis, vaultwarden, plotting-book)
- Migrate headscale backup, ebook2audiobook, osm_routing to Proxmox NFS
- Migrate servarr (lidarr, readarr, soulseek) NFS refs to Proxmox
- Remove 79 orphaned TrueNAS NFS module declarations from 49 stacks
- Delete stacks/platform/modules/ (27 dead module copies, 65MB)
- Update nfs-truenas StorageClass to point to Proxmox (192.168.1.127)
- Remove iscsi DNS record from config.tfvars
- Fix woodpecker persistence config and alertmanager PV
Only Immich (8 PVCs, ~1.4TB) remains on TrueNAS.
2026-04-12 14:35:39 +01:00
|
|
|
claim_name = module.nfs_data_host.claim_name
|
2026-02-22 15:13:55 +00:00
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
[infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip]
## Context
Wave 3A (commit c9d221d5) added the `# KYVERNO_LIFECYCLE_V1` marker to the
27 pre-existing `ignore_changes = [...dns_config]` sites so they could be
grepped and audited. It did NOT address pod-owning resources that were
simply missing the suppression entirely. Post-Wave-3A sampling (2026-04-18)
found that navidrome, f1-stream, frigate, servarr, monitoring, crowdsec,
and many other stacks showed perpetual `dns_config` drift every plan
because their `kubernetes_deployment` / `kubernetes_stateful_set` /
`kubernetes_cron_job_v1` resources had no `lifecycle {}` block at all.
Root cause (same as Wave 3A): Kyverno's admission webhook stamps
`dns_config { option { name = "ndots"; value = "2" } }` on every pod's
`spec.template.spec.dns_config` to prevent NxDomain search-domain flooding
(see `k8s-ndots-search-domain-nxdomain-flood` skill). Without `ignore_changes`
on every Terraform-managed pod-owner, Terraform repeatedly tries to strip
the injected field.
## This change
Extends the Wave 3A convention by sweeping EVERY `kubernetes_deployment`,
`kubernetes_stateful_set`, `kubernetes_daemon_set`, `kubernetes_cron_job_v1`,
`kubernetes_job_v1` (+ their `_v1` variants) in the repo and ensuring each
carries the right `ignore_changes` path:
- **kubernetes_deployment / stateful_set / daemon_set / job_v1**:
`spec[0].template[0].spec[0].dns_config`
- **kubernetes_cron_job_v1**:
`spec[0].job_template[0].spec[0].template[0].spec[0].dns_config`
(extra `job_template[0]` nesting — the CronJob's PodTemplateSpec is
one level deeper)
Each injection / extension is tagged `# KYVERNO_LIFECYCLE_V1: Kyverno
admission webhook mutates dns_config with ndots=2` inline so the
suppression is discoverable via `rg 'KYVERNO_LIFECYCLE_V1' stacks/`.
Two insertion paths are handled by a Python pass (`/tmp/add_dns_config_ignore.py`):
1. **No existing `lifecycle {}`**: inject a brand-new block just before the
resource's closing `}`. 108 new blocks on 93 files.
2. **Existing `lifecycle {}` (usually for `DRIFT_WORKAROUND: CI owns image tag`
from Wave 4, commit a62b43d1)**: extend its `ignore_changes` list with the
dns_config path. Handles both inline (`= [x]`) and multiline
(`= [\n x,\n]`) forms; ensures the last pre-existing list item carries
a trailing comma so the extended list is valid HCL. 34 extensions.
The script skips anything already mentioning `dns_config` inside an
`ignore_changes`, so re-running is a no-op.
## Scale
- 142 total lifecycle injections/extensions
- 93 `.tf` files touched
- 108 brand-new `lifecycle {}` blocks + 34 extensions of existing ones
- Every Tier 0 and Tier 1 stack with a pod-owning resource is covered
- Together with Wave 3A's 27 pre-existing markers → **169 greppable
`KYVERNO_LIFECYCLE_V1` dns_config sites across the repo**
## What is NOT in this change
- `stacks/trading-bot/main.tf` — entirely commented-out block (`/* … */`).
Python script touched the file, reverted manually.
- `_template/main.tf.example` skeleton — kept minimal on purpose; any
future stack created from it should either inherit the Wave 3A one-line
form or add its own on first `kubernetes_deployment`.
- `terraform fmt` fixes to pre-existing alignment issues in meshcentral,
nvidia/modules/nvidia, vault — unrelated to this commit. Left for a
separate fmt-only pass.
- Non-pod resources (`kubernetes_service`, `kubernetes_secret`,
`kubernetes_manifest`, etc.) — they don't own pods so they don't get
Kyverno dns_config mutation.
## Verification
Random sample post-commit:
```
$ cd stacks/navidrome && ../../scripts/tg plan → No changes.
$ cd stacks/f1-stream && ../../scripts/tg plan → No changes.
$ cd stacks/frigate && ../../scripts/tg plan → No changes.
$ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='*.tf' --include='*.tf.example' \
| awk -F: '{s+=$2} END {print s}'
169
```
## Reproduce locally
1. `git pull`
2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/ | wc -l` → 169+
3. `cd stacks/navidrome && ../../scripts/tg plan` → expect 0 drift on
the deployment's dns_config field.
Refs: code-seq (Wave 3B dns_config class closed; kubernetes_manifest
annotation class handled separately in 8d94688d for tls_secret)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:19:48 +00:00
|
|
|
lifecycle {
|
2026-05-16 12:41:05 +00:00
|
|
|
ignore_changes = [
|
|
|
|
|
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
|
|
|
|
|
metadata[0].annotations["keel.sh/policy"],
|
|
|
|
|
metadata[0].annotations["keel.sh/trigger"],
|
|
|
|
|
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
|
2026-05-28 23:09:30 +00:00
|
|
|
metadata[0].annotations["keel.sh/match-tag"],
|
|
|
|
|
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates
|
|
|
|
|
metadata[0].annotations["kubernetes.io/change-cause"],
|
|
|
|
|
metadata[0].annotations["deployment.kubernetes.io/revision"],
|
|
|
|
|
spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1
|
2026-05-16 12:41:05 +00:00
|
|
|
]
|
[infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip]
## Context
Wave 3A (commit c9d221d5) added the `# KYVERNO_LIFECYCLE_V1` marker to the
27 pre-existing `ignore_changes = [...dns_config]` sites so they could be
grepped and audited. It did NOT address pod-owning resources that were
simply missing the suppression entirely. Post-Wave-3A sampling (2026-04-18)
found that navidrome, f1-stream, frigate, servarr, monitoring, crowdsec,
and many other stacks showed perpetual `dns_config` drift every plan
because their `kubernetes_deployment` / `kubernetes_stateful_set` /
`kubernetes_cron_job_v1` resources had no `lifecycle {}` block at all.
Root cause (same as Wave 3A): Kyverno's admission webhook stamps
`dns_config { option { name = "ndots"; value = "2" } }` on every pod's
`spec.template.spec.dns_config` to prevent NxDomain search-domain flooding
(see `k8s-ndots-search-domain-nxdomain-flood` skill). Without `ignore_changes`
on every Terraform-managed pod-owner, Terraform repeatedly tries to strip
the injected field.
## This change
Extends the Wave 3A convention by sweeping EVERY `kubernetes_deployment`,
`kubernetes_stateful_set`, `kubernetes_daemon_set`, `kubernetes_cron_job_v1`,
`kubernetes_job_v1` (+ their `_v1` variants) in the repo and ensuring each
carries the right `ignore_changes` path:
- **kubernetes_deployment / stateful_set / daemon_set / job_v1**:
`spec[0].template[0].spec[0].dns_config`
- **kubernetes_cron_job_v1**:
`spec[0].job_template[0].spec[0].template[0].spec[0].dns_config`
(extra `job_template[0]` nesting — the CronJob's PodTemplateSpec is
one level deeper)
Each injection / extension is tagged `# KYVERNO_LIFECYCLE_V1: Kyverno
admission webhook mutates dns_config with ndots=2` inline so the
suppression is discoverable via `rg 'KYVERNO_LIFECYCLE_V1' stacks/`.
Two insertion paths are handled by a Python pass (`/tmp/add_dns_config_ignore.py`):
1. **No existing `lifecycle {}`**: inject a brand-new block just before the
resource's closing `}`. 108 new blocks on 93 files.
2. **Existing `lifecycle {}` (usually for `DRIFT_WORKAROUND: CI owns image tag`
from Wave 4, commit a62b43d1)**: extend its `ignore_changes` list with the
dns_config path. Handles both inline (`= [x]`) and multiline
(`= [\n x,\n]`) forms; ensures the last pre-existing list item carries
a trailing comma so the extended list is valid HCL. 34 extensions.
The script skips anything already mentioning `dns_config` inside an
`ignore_changes`, so re-running is a no-op.
## Scale
- 142 total lifecycle injections/extensions
- 93 `.tf` files touched
- 108 brand-new `lifecycle {}` blocks + 34 extensions of existing ones
- Every Tier 0 and Tier 1 stack with a pod-owning resource is covered
- Together with Wave 3A's 27 pre-existing markers → **169 greppable
`KYVERNO_LIFECYCLE_V1` dns_config sites across the repo**
## What is NOT in this change
- `stacks/trading-bot/main.tf` — entirely commented-out block (`/* … */`).
Python script touched the file, reverted manually.
- `_template/main.tf.example` skeleton — kept minimal on purpose; any
future stack created from it should either inherit the Wave 3A one-line
form or add its own on first `kubernetes_deployment`.
- `terraform fmt` fixes to pre-existing alignment issues in meshcentral,
nvidia/modules/nvidia, vault — unrelated to this commit. Left for a
separate fmt-only pass.
- Non-pod resources (`kubernetes_service`, `kubernetes_secret`,
`kubernetes_manifest`, etc.) — they don't own pods so they don't get
Kyverno dns_config mutation.
## Verification
Random sample post-commit:
```
$ cd stacks/navidrome && ../../scripts/tg plan → No changes.
$ cd stacks/f1-stream && ../../scripts/tg plan → No changes.
$ cd stacks/frigate && ../../scripts/tg plan → No changes.
$ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='*.tf' --include='*.tf.example' \
| awk -F: '{s+=$2} END {print s}'
169
```
## Reproduce locally
1. `git pull`
2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/ | wc -l` → 169+
3. `cd stacks/navidrome && ../../scripts/tg plan` → expect 0 drift on
the deployment's dns_config field.
Refs: code-seq (Wave 3B dns_config class closed; kubernetes_manifest
annotation class handled separately in 8d94688d for tls_secret)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:19:48 +00:00
|
|
|
}
|
2026-02-22 15:13:55 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
resource "kubernetes_service" "realestate-crawler-celery-metrics" {
|
|
|
|
|
metadata {
|
|
|
|
|
name = "realestate-crawler-celery-metrics"
|
|
|
|
|
namespace = kubernetes_namespace.realestate-crawler.metadata[0].name
|
|
|
|
|
labels = {
|
|
|
|
|
"app" = "realestate-crawler-celery"
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
spec {
|
|
|
|
|
selector = {
|
|
|
|
|
app = "realestate-crawler-celery"
|
|
|
|
|
}
|
|
|
|
|
port {
|
|
|
|
|
port = 9090
|
|
|
|
|
target_port = 9090
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
# Celery beat for scheduled task management
|
|
|
|
|
resource "kubernetes_deployment" "realestate-crawler-celery-beat" {
|
|
|
|
|
metadata {
|
|
|
|
|
name = "realestate-crawler-celery-beat"
|
|
|
|
|
namespace = kubernetes_namespace.realestate-crawler.metadata[0].name
|
|
|
|
|
labels = {
|
|
|
|
|
app = "realestate-crawler-celery-beat"
|
|
|
|
|
tier = local.tiers.aux
|
|
|
|
|
}
|
migrate 16 plan-time stacks: vault data source → ESO + kubernetes_secret
Replaced data "vault_kv_secret_v2" with:
1. ExternalSecret (ESO syncs Vault KV → K8s Secret)
2. data "kubernetes_secret" (reads ESO-created secret at plan time)
This removes the Vault provider dependency at plan time for these
stacks — they now only need K8s API access, not a Vault token.
Stacks: actualbudget, affine, audiobookshelf, calibre, changedetection,
coturn, freedify, freshrss, grampsweb, navidrome, novelapp, ollama,
owntracks, real-estate-crawler, servarr, ytdlp
2026-03-15 22:06:39 +00:00
|
|
|
annotations = {
|
|
|
|
|
"reloader.stakater.com/auto" = "true"
|
|
|
|
|
}
|
2026-02-22 15:13:55 +00:00
|
|
|
}
|
|
|
|
|
spec {
|
|
|
|
|
replicas = 1
|
|
|
|
|
strategy {
|
|
|
|
|
type = "Recreate" # Only one beat instance should run at a time
|
|
|
|
|
}
|
|
|
|
|
selector {
|
|
|
|
|
match_labels = {
|
|
|
|
|
app = "realestate-crawler-celery-beat"
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
template {
|
|
|
|
|
metadata {
|
|
|
|
|
labels = {
|
|
|
|
|
app = "realestate-crawler-celery-beat"
|
|
|
|
|
}
|
2026-03-15 19:17:44 +00:00
|
|
|
annotations = {
|
2026-04-19 12:44:46 +00:00
|
|
|
"dependency.kyverno.io/wait-for" = "mysql.dbaas:3306,redis-master.redis:6379"
|
2026-03-15 19:17:44 +00:00
|
|
|
}
|
2026-02-22 15:13:55 +00:00
|
|
|
}
|
|
|
|
|
spec {
|
2026-05-18 19:11:43 +00:00
|
|
|
image_pull_secrets {
|
|
|
|
|
name = "dockerhub-pull-secret"
|
|
|
|
|
}
|
2026-02-22 15:13:55 +00:00
|
|
|
container {
|
|
|
|
|
name = "celery-beat"
|
|
|
|
|
image = "viktorbarzin/realestatecrawler:latest"
|
|
|
|
|
command = ["python", "-m", "celery", "-A", "celery_app", "beat", "--loglevel=info"]
|
[ci skip] Infrastructure hardening: security, monitoring, reliability, maintainability
Phase 1 - Critical Security:
- Netbox: move hardcoded DB/superuser passwords to variables
- MeshCentral: disable public registration, add Authentik auth
- Traefik: disable insecure API dashboard (api.insecure=false)
- Traefik: configure forwarded headers with Cloudflare trusted IPs
Phase 2 - Security Hardening:
- Add security headers middleware (HSTS, X-Frame-Options, nosniff, etc.)
- Add Kyverno pod security policies in audit mode (privileged, host
namespaces, SYS_ADMIN, trusted registries)
- Tighten rate limiting (avg=10, burst=50)
- Add Authentik protection to grampsweb
Phase 3 - Monitoring & Alerting:
- Add critical service alerts (PostgreSQL, MySQL, Redis, Headscale,
Authentik, Loki)
- Increase Loki retention from 7 to 30 days (720h)
- Add predictive PV filling alert (predict_linear)
- Re-enable Hackmd and Privatebin down alerts
Phase 4 - Reliability:
- Add resource requests/limits to Redis, DBaaS, Technitium, Headscale,
Vaultwarden, Uptime Kuma
- Increase Alloy DaemonSet memory to 512Mi/1Gi
Phase 6 - Maintainability:
- Extract duplicated tiers locals to terragrunt.hcl generate block
(removed from 67 stacks)
- Replace hardcoded NFS IP 10.0.10.15 with var.nfs_server (114
instances across 63 files)
- Replace hardcoded Redis/PostgreSQL/MySQL/Ollama/mail host references
with variables across ~35 stacks
- Migrate xray raw ingress resources to ingress_factory modules
2026-02-23 22:05:28 +00:00
|
|
|
resources {
|
|
|
|
|
requests = {
|
|
|
|
|
cpu = "10m"
|
right-size memory: set requests=limits based on actual usage
- Set memory requests = limits across 56 stacks to prevent overcommit
- Right-sized limits based on actual pod usage (2x actual, rounded up)
- Scaled down trading-bot (replicas=0) to free memory
- Fixed OOMKilled services: forgejo, dawarich, health, meshcentral,
paperless-ngx, vault auto-unseal, rybbit, whisper, openclaw, clickhouse
- Added startup+liveness probes to calibre-web
- Bumped inotify limits on nodes 2,3 (max_user_instances 128->8192)
Post node2 OOM incident (2026-03-14). Previous kubelet config had no
kubeReserved/systemReserved set, allowing pods to starve the kernel.
2026-03-14 21:01:24 +00:00
|
|
|
memory = "256Mi"
|
[ci skip] Infrastructure hardening: security, monitoring, reliability, maintainability
Phase 1 - Critical Security:
- Netbox: move hardcoded DB/superuser passwords to variables
- MeshCentral: disable public registration, add Authentik auth
- Traefik: disable insecure API dashboard (api.insecure=false)
- Traefik: configure forwarded headers with Cloudflare trusted IPs
Phase 2 - Security Hardening:
- Add security headers middleware (HSTS, X-Frame-Options, nosniff, etc.)
- Add Kyverno pod security policies in audit mode (privileged, host
namespaces, SYS_ADMIN, trusted registries)
- Tighten rate limiting (avg=10, burst=50)
- Add Authentik protection to grampsweb
Phase 3 - Monitoring & Alerting:
- Add critical service alerts (PostgreSQL, MySQL, Redis, Headscale,
Authentik, Loki)
- Increase Loki retention from 7 to 30 days (720h)
- Add predictive PV filling alert (predict_linear)
- Re-enable Hackmd and Privatebin down alerts
Phase 4 - Reliability:
- Add resource requests/limits to Redis, DBaaS, Technitium, Headscale,
Vaultwarden, Uptime Kuma
- Increase Alloy DaemonSet memory to 512Mi/1Gi
Phase 6 - Maintainability:
- Extract duplicated tiers locals to terragrunt.hcl generate block
(removed from 67 stacks)
- Replace hardcoded NFS IP 10.0.10.15 with var.nfs_server (114
instances across 63 files)
- Replace hardcoded Redis/PostgreSQL/MySQL/Ollama/mail host references
with variables across ~35 stacks
- Migrate xray raw ingress resources to ingress_factory modules
2026-02-23 22:05:28 +00:00
|
|
|
}
|
|
|
|
|
limits = {
|
|
|
|
|
memory = "256Mi"
|
|
|
|
|
}
|
|
|
|
|
}
|
2026-02-22 15:13:55 +00:00
|
|
|
env {
|
|
|
|
|
name = "ENV"
|
|
|
|
|
value = "prod"
|
|
|
|
|
}
|
|
|
|
|
env {
|
2026-03-17 07:39:29 +00:00
|
|
|
name = "DB_CONNECTION_STRING"
|
|
|
|
|
value_from {
|
|
|
|
|
secret_key_ref {
|
|
|
|
|
name = "realestate-crawler-db-creds"
|
|
|
|
|
key = "DB_CONNECTION_STRING"
|
|
|
|
|
}
|
|
|
|
|
}
|
2026-02-22 15:13:55 +00:00
|
|
|
}
|
|
|
|
|
env {
|
|
|
|
|
name = "CELERY_BROKER_URL"
|
[ci skip] Infrastructure hardening: security, monitoring, reliability, maintainability
Phase 1 - Critical Security:
- Netbox: move hardcoded DB/superuser passwords to variables
- MeshCentral: disable public registration, add Authentik auth
- Traefik: disable insecure API dashboard (api.insecure=false)
- Traefik: configure forwarded headers with Cloudflare trusted IPs
Phase 2 - Security Hardening:
- Add security headers middleware (HSTS, X-Frame-Options, nosniff, etc.)
- Add Kyverno pod security policies in audit mode (privileged, host
namespaces, SYS_ADMIN, trusted registries)
- Tighten rate limiting (avg=10, burst=50)
- Add Authentik protection to grampsweb
Phase 3 - Monitoring & Alerting:
- Add critical service alerts (PostgreSQL, MySQL, Redis, Headscale,
Authentik, Loki)
- Increase Loki retention from 7 to 30 days (720h)
- Add predictive PV filling alert (predict_linear)
- Re-enable Hackmd and Privatebin down alerts
Phase 4 - Reliability:
- Add resource requests/limits to Redis, DBaaS, Technitium, Headscale,
Vaultwarden, Uptime Kuma
- Increase Alloy DaemonSet memory to 512Mi/1Gi
Phase 6 - Maintainability:
- Extract duplicated tiers locals to terragrunt.hcl generate block
(removed from 67 stacks)
- Replace hardcoded NFS IP 10.0.10.15 with var.nfs_server (114
instances across 63 files)
- Replace hardcoded Redis/PostgreSQL/MySQL/Ollama/mail host references
with variables across ~35 stacks
- Migrate xray raw ingress resources to ingress_factory modules
2026-02-23 22:05:28 +00:00
|
|
|
value = "redis://${var.redis_host}:6379/0"
|
2026-02-22 15:13:55 +00:00
|
|
|
}
|
|
|
|
|
env {
|
|
|
|
|
name = "CELERY_RESULT_BACKEND"
|
[ci skip] Infrastructure hardening: security, monitoring, reliability, maintainability
Phase 1 - Critical Security:
- Netbox: move hardcoded DB/superuser passwords to variables
- MeshCentral: disable public registration, add Authentik auth
- Traefik: disable insecure API dashboard (api.insecure=false)
- Traefik: configure forwarded headers with Cloudflare trusted IPs
Phase 2 - Security Hardening:
- Add security headers middleware (HSTS, X-Frame-Options, nosniff, etc.)
- Add Kyverno pod security policies in audit mode (privileged, host
namespaces, SYS_ADMIN, trusted registries)
- Tighten rate limiting (avg=10, burst=50)
- Add Authentik protection to grampsweb
Phase 3 - Monitoring & Alerting:
- Add critical service alerts (PostgreSQL, MySQL, Redis, Headscale,
Authentik, Loki)
- Increase Loki retention from 7 to 30 days (720h)
- Add predictive PV filling alert (predict_linear)
- Re-enable Hackmd and Privatebin down alerts
Phase 4 - Reliability:
- Add resource requests/limits to Redis, DBaaS, Technitium, Headscale,
Vaultwarden, Uptime Kuma
- Increase Alloy DaemonSet memory to 512Mi/1Gi
Phase 6 - Maintainability:
- Extract duplicated tiers locals to terragrunt.hcl generate block
(removed from 67 stacks)
- Replace hardcoded NFS IP 10.0.10.15 with var.nfs_server (114
instances across 63 files)
- Replace hardcoded Redis/PostgreSQL/MySQL/Ollama/mail host references
with variables across ~35 stacks
- Migrate xray raw ingress resources to ingress_factory modules
2026-02-23 22:05:28 +00:00
|
|
|
value = "redis://${var.redis_host}:6379/1"
|
2026-02-22 15:13:55 +00:00
|
|
|
}
|
|
|
|
|
env {
|
|
|
|
|
name = "SCRAPE_SCHEDULES"
|
real-estate-crawler: populate SCRAPE_SCHEDULES (daily RENT + weekly BUY, London 1-2 bed)
Wires celery-beat to fire two periodic scrapes via the existing in-app
SchedulesConfig mechanism. Replaces the empty-string fallback with two
inline schedules expressed as Terraform-managed JSON:
- london-rent-daily: every day at 03:00 UTC, RENT, London, 1-2 bed,
£1900-4000
- london-buy-weekly: every Sunday at 04:00 UTC, BUY, London, 1-2 bed,
£400k-1.2M
Schedules live in `local.scrape_schedules` (jsonencode'd) rather than
Vault — they're configuration, not secrets, and benefit from being
version-controlled. The previous Vault-backed lookup
(`local.notification_settings["scrape_schedules"]`) was unused.
Verified live: new celery-beat pod logs
`Registering periodic task: london-rent-daily at 3:0` and
`london-buy-weekly at 4:0` immediately after roll-out.
Also tightens the comment above the wrongmove-api `auth = "none"` line
so it passes the new `scripts/check-ingress-auth-comments.py` guard
(pre-existing tech debt that blocked the apply).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-11 19:26:57 +00:00
|
|
|
value = local.scrape_schedules
|
2026-02-22 15:13:55 +00:00
|
|
|
}
|
|
|
|
|
volume_mount {
|
|
|
|
|
name = "data"
|
|
|
|
|
mount_path = "/app/data"
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
volume {
|
|
|
|
|
name = "data"
|
[ci skip] complete NFS CSI migration: complex stacks + platform modules
Migrate remaining multi-volume stacks and all platform modules from
inline NFS volumes to CSI-backed PV/PVC with nfs-truenas StorageClass
(soft,timeo=30,retrans=3 mount options).
Complex stacks: openclaw (4 vols), immich (8 vols), frigate (2 vols),
nextcloud (2 vols + old PV replaced), rybbit (1 vol)
Remaining stacks: affine, ebook2audiobook, f1-stream, osm_routing,
real-estate-crawler
Platform modules: monitoring (prometheus, loki, alertmanager PVs
converted from native NFS to CSI), redis, dbaas, technitium,
headscale, vaultwarden, uptime-kuma, mailserver, infra-maintenance
2026-03-02 01:24:07 +00:00
|
|
|
persistent_volume_claim {
|
truenas deprecation: migrate all non-immich storage to proxmox NFS
- Migrate 7 backup CronJobs to Proxmox host NFS (192.168.1.127)
(etcd, mysql, postgresql, nextcloud, redis, vaultwarden, plotting-book)
- Migrate headscale backup, ebook2audiobook, osm_routing to Proxmox NFS
- Migrate servarr (lidarr, readarr, soulseek) NFS refs to Proxmox
- Remove 79 orphaned TrueNAS NFS module declarations from 49 stacks
- Delete stacks/platform/modules/ (27 dead module copies, 65MB)
- Update nfs-truenas StorageClass to point to Proxmox (192.168.1.127)
- Remove iscsi DNS record from config.tfvars
- Fix woodpecker persistence config and alertmanager PV
Only Immich (8 PVCs, ~1.4TB) remains on TrueNAS.
2026-04-12 14:35:39 +01:00
|
|
|
claim_name = module.nfs_data_host.claim_name
|
2026-02-22 15:13:55 +00:00
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
[infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip]
## Context
Wave 3A (commit c9d221d5) added the `# KYVERNO_LIFECYCLE_V1` marker to the
27 pre-existing `ignore_changes = [...dns_config]` sites so they could be
grepped and audited. It did NOT address pod-owning resources that were
simply missing the suppression entirely. Post-Wave-3A sampling (2026-04-18)
found that navidrome, f1-stream, frigate, servarr, monitoring, crowdsec,
and many other stacks showed perpetual `dns_config` drift every plan
because their `kubernetes_deployment` / `kubernetes_stateful_set` /
`kubernetes_cron_job_v1` resources had no `lifecycle {}` block at all.
Root cause (same as Wave 3A): Kyverno's admission webhook stamps
`dns_config { option { name = "ndots"; value = "2" } }` on every pod's
`spec.template.spec.dns_config` to prevent NxDomain search-domain flooding
(see `k8s-ndots-search-domain-nxdomain-flood` skill). Without `ignore_changes`
on every Terraform-managed pod-owner, Terraform repeatedly tries to strip
the injected field.
## This change
Extends the Wave 3A convention by sweeping EVERY `kubernetes_deployment`,
`kubernetes_stateful_set`, `kubernetes_daemon_set`, `kubernetes_cron_job_v1`,
`kubernetes_job_v1` (+ their `_v1` variants) in the repo and ensuring each
carries the right `ignore_changes` path:
- **kubernetes_deployment / stateful_set / daemon_set / job_v1**:
`spec[0].template[0].spec[0].dns_config`
- **kubernetes_cron_job_v1**:
`spec[0].job_template[0].spec[0].template[0].spec[0].dns_config`
(extra `job_template[0]` nesting — the CronJob's PodTemplateSpec is
one level deeper)
Each injection / extension is tagged `# KYVERNO_LIFECYCLE_V1: Kyverno
admission webhook mutates dns_config with ndots=2` inline so the
suppression is discoverable via `rg 'KYVERNO_LIFECYCLE_V1' stacks/`.
Two insertion paths are handled by a Python pass (`/tmp/add_dns_config_ignore.py`):
1. **No existing `lifecycle {}`**: inject a brand-new block just before the
resource's closing `}`. 108 new blocks on 93 files.
2. **Existing `lifecycle {}` (usually for `DRIFT_WORKAROUND: CI owns image tag`
from Wave 4, commit a62b43d1)**: extend its `ignore_changes` list with the
dns_config path. Handles both inline (`= [x]`) and multiline
(`= [\n x,\n]`) forms; ensures the last pre-existing list item carries
a trailing comma so the extended list is valid HCL. 34 extensions.
The script skips anything already mentioning `dns_config` inside an
`ignore_changes`, so re-running is a no-op.
## Scale
- 142 total lifecycle injections/extensions
- 93 `.tf` files touched
- 108 brand-new `lifecycle {}` blocks + 34 extensions of existing ones
- Every Tier 0 and Tier 1 stack with a pod-owning resource is covered
- Together with Wave 3A's 27 pre-existing markers → **169 greppable
`KYVERNO_LIFECYCLE_V1` dns_config sites across the repo**
## What is NOT in this change
- `stacks/trading-bot/main.tf` — entirely commented-out block (`/* … */`).
Python script touched the file, reverted manually.
- `_template/main.tf.example` skeleton — kept minimal on purpose; any
future stack created from it should either inherit the Wave 3A one-line
form or add its own on first `kubernetes_deployment`.
- `terraform fmt` fixes to pre-existing alignment issues in meshcentral,
nvidia/modules/nvidia, vault — unrelated to this commit. Left for a
separate fmt-only pass.
- Non-pod resources (`kubernetes_service`, `kubernetes_secret`,
`kubernetes_manifest`, etc.) — they don't own pods so they don't get
Kyverno dns_config mutation.
## Verification
Random sample post-commit:
```
$ cd stacks/navidrome && ../../scripts/tg plan → No changes.
$ cd stacks/f1-stream && ../../scripts/tg plan → No changes.
$ cd stacks/frigate && ../../scripts/tg plan → No changes.
$ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='*.tf' --include='*.tf.example' \
| awk -F: '{s+=$2} END {print s}'
169
```
## Reproduce locally
1. `git pull`
2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/ | wc -l` → 169+
3. `cd stacks/navidrome && ../../scripts/tg plan` → expect 0 drift on
the deployment's dns_config field.
Refs: code-seq (Wave 3B dns_config class closed; kubernetes_manifest
annotation class handled separately in 8d94688d for tls_secret)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:19:48 +00:00
|
|
|
lifecycle {
|
2026-05-16 12:41:05 +00:00
|
|
|
ignore_changes = [
|
|
|
|
|
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
|
|
|
|
|
metadata[0].annotations["keel.sh/policy"],
|
|
|
|
|
metadata[0].annotations["keel.sh/trigger"],
|
|
|
|
|
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
|
2026-05-28 23:09:30 +00:00
|
|
|
metadata[0].annotations["keel.sh/match-tag"],
|
|
|
|
|
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates
|
|
|
|
|
metadata[0].annotations["kubernetes.io/change-cause"],
|
|
|
|
|
metadata[0].annotations["deployment.kubernetes.io/revision"],
|
|
|
|
|
spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1
|
2026-05-16 12:41:05 +00:00
|
|
|
]
|
[infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip]
## Context
Wave 3A (commit c9d221d5) added the `# KYVERNO_LIFECYCLE_V1` marker to the
27 pre-existing `ignore_changes = [...dns_config]` sites so they could be
grepped and audited. It did NOT address pod-owning resources that were
simply missing the suppression entirely. Post-Wave-3A sampling (2026-04-18)
found that navidrome, f1-stream, frigate, servarr, monitoring, crowdsec,
and many other stacks showed perpetual `dns_config` drift every plan
because their `kubernetes_deployment` / `kubernetes_stateful_set` /
`kubernetes_cron_job_v1` resources had no `lifecycle {}` block at all.
Root cause (same as Wave 3A): Kyverno's admission webhook stamps
`dns_config { option { name = "ndots"; value = "2" } }` on every pod's
`spec.template.spec.dns_config` to prevent NxDomain search-domain flooding
(see `k8s-ndots-search-domain-nxdomain-flood` skill). Without `ignore_changes`
on every Terraform-managed pod-owner, Terraform repeatedly tries to strip
the injected field.
## This change
Extends the Wave 3A convention by sweeping EVERY `kubernetes_deployment`,
`kubernetes_stateful_set`, `kubernetes_daemon_set`, `kubernetes_cron_job_v1`,
`kubernetes_job_v1` (+ their `_v1` variants) in the repo and ensuring each
carries the right `ignore_changes` path:
- **kubernetes_deployment / stateful_set / daemon_set / job_v1**:
`spec[0].template[0].spec[0].dns_config`
- **kubernetes_cron_job_v1**:
`spec[0].job_template[0].spec[0].template[0].spec[0].dns_config`
(extra `job_template[0]` nesting — the CronJob's PodTemplateSpec is
one level deeper)
Each injection / extension is tagged `# KYVERNO_LIFECYCLE_V1: Kyverno
admission webhook mutates dns_config with ndots=2` inline so the
suppression is discoverable via `rg 'KYVERNO_LIFECYCLE_V1' stacks/`.
Two insertion paths are handled by a Python pass (`/tmp/add_dns_config_ignore.py`):
1. **No existing `lifecycle {}`**: inject a brand-new block just before the
resource's closing `}`. 108 new blocks on 93 files.
2. **Existing `lifecycle {}` (usually for `DRIFT_WORKAROUND: CI owns image tag`
from Wave 4, commit a62b43d1)**: extend its `ignore_changes` list with the
dns_config path. Handles both inline (`= [x]`) and multiline
(`= [\n x,\n]`) forms; ensures the last pre-existing list item carries
a trailing comma so the extended list is valid HCL. 34 extensions.
The script skips anything already mentioning `dns_config` inside an
`ignore_changes`, so re-running is a no-op.
## Scale
- 142 total lifecycle injections/extensions
- 93 `.tf` files touched
- 108 brand-new `lifecycle {}` blocks + 34 extensions of existing ones
- Every Tier 0 and Tier 1 stack with a pod-owning resource is covered
- Together with Wave 3A's 27 pre-existing markers → **169 greppable
`KYVERNO_LIFECYCLE_V1` dns_config sites across the repo**
## What is NOT in this change
- `stacks/trading-bot/main.tf` — entirely commented-out block (`/* … */`).
Python script touched the file, reverted manually.
- `_template/main.tf.example` skeleton — kept minimal on purpose; any
future stack created from it should either inherit the Wave 3A one-line
form or add its own on first `kubernetes_deployment`.
- `terraform fmt` fixes to pre-existing alignment issues in meshcentral,
nvidia/modules/nvidia, vault — unrelated to this commit. Left for a
separate fmt-only pass.
- Non-pod resources (`kubernetes_service`, `kubernetes_secret`,
`kubernetes_manifest`, etc.) — they don't own pods so they don't get
Kyverno dns_config mutation.
## Verification
Random sample post-commit:
```
$ cd stacks/navidrome && ../../scripts/tg plan → No changes.
$ cd stacks/f1-stream && ../../scripts/tg plan → No changes.
$ cd stacks/frigate && ../../scripts/tg plan → No changes.
$ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='*.tf' --include='*.tf.example' \
| awk -F: '{s+=$2} END {print s}'
169
```
## Reproduce locally
1. `git pull`
2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/ | wc -l` → 169+
3. `cd stacks/navidrome && ../../scripts/tg plan` → expect 0 drift on
the deployment's dns_config field.
Refs: code-seq (Wave 3B dns_config class closed; kubernetes_manifest
annotation class handled separately in 8d94688d for tls_secret)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:19:48 +00:00
|
|
|
}
|
2026-02-22 13:56:34 +00:00
|
|
|
}
|
2026-05-16 13:42:57 +00:00
|
|
|
|
|
|
|
|
# CI retrigger 2026-05-16T13:42:57+00:00 — bulk enrollment apply (pipeline #689 killed)
|
2026-05-16 13:46:35 +00:00
|
|
|
# CI retrigger v2 2026-05-16T13:46:35+00:00
|
2026-05-16 14:06:39 +00:00
|
|
|
|
|
|
|
|
# CI retrigger v3 2026-05-16T14:06:39Z
|
2026-05-16 14:13:59 +00:00
|
|
|
|
|
|
|
|
# CI retrigger v4 2026-05-16T14:13:59Z
|
2026-05-16 23:10:38 +00:00
|
|
|
|
|
|
|
|
# CI retrigger v5 2026-05-16T23:10:38Z
|
2026-05-16 23:18:59 +00:00
|
|
|
|
|
|
|
|
# CI retrigger v6 2026-05-16T23:18:58Z
|