infra

Author	SHA1	Message	Date
Viktor Barzin	fd0f4a0365	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip] `6d224861` came from a --no-checkout worktree whose empty index made the commit drop every file except two. This restores 05b50d2b's full tree and correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the live infra was never applied from the broken commit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 08:45:33 +00:00
Viktor Barzin	6d224861c4	stem95su: scheduled Drive->site sync CronJob (every 10m) CronJob stem95su-gdrive-sync (*/10) mounts the content PVC RW and rclone-syncs the read-only Drive folder "claude" (stem claude/files) onto it (rclone/rclone:1.74.3, scope=drive.readonly, empty-source guard + --max-delete 25). ESO ExternalSecret stem95su-rclone <- Vault secret/stem95su. Requires the GCP OAuth app published to Production or the refresh token expires ~weekly. Lands the gdrive-sync stack on master (it had landed on a feature branch by accident on the shared devvm checkout). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 08:42:26 +00:00
Viktor Barzin	7a649ce7eb	crowdsec: pin image to v1.7.8 + remove ENROLL_KEY, CAPI restored Some checks failed ci/woodpecker/push/build-cli Pipeline failed Details ci/woodpecker/push/default Pipeline was successful Details Root cause of today's CAPI 403 crashloop: chart 0.21.0 pins appVersion to v1.7.3, but Keel had auto-bumped the running pods to v1.7.8 on 2026-05-16 and they ran fine with CAPI for 8 days. Today's TF apply (`b59acbc1` agent memory bump) re-rendered the deployment from chart defaults, reverting the image to v1.7.3 — and v1.7.3 has a CAPI watcher-auth bug against the current api.crowdsec.net behaviour, so every fresh replica started 403'ing on startup. Fix: set `image.tag: "v1.7.8"` in values.yaml so the image survives future TF applies independently of the chart's appVersion. Verified CAPI auth succeeds on all 3 fresh pods with v1.7.8. Also dropped the ENROLL_KEY env block — the existing key `cmey5e636…` is single-shot and was already consumed by the first replica; subsequent pods hit 403 on `cscli console enroll`. CAPI works WITHOUT console enrollment (separate flows). Re-enable console reporting by generating a fresh enroll key at app.crowdsec.net (procedure documented in the values.yaml comment block). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 11:11:29 +00:00
Viktor Barzin	41786b0fca	crowdsec: DISABLE_ONLINE_API=true — break the recurring 403 crashloop Some checks failed ci/woodpecker/push/build-cli Pipeline failed Details ci/woodpecker/push/default Pipeline was successful Details CAPI auth at api.crowdsec.net is rejecting watcher logins from inside the cluster within ~1h of registration, even after rotating creds via `cscli capi register`. The same login successfully authenticates from devvm but fails from cluster pods → IP-throttle or account-state issue at the central API. Until that's resolved with CrowdSec support (or the throttle window resets), running with CAPI on is just chronic crashloops on every fresh replica. `DISABLE_ONLINE_API=true` makes the chart entrypoint `conf_set 'del(.api.server.online_client)'`, removing the online_client block entirely. Pods skip CAPI auth, no 403, no crashloop. Trade-off: no community blocklists. Local scenarios + bouncers continue unchanged. Side-effect of disabling CAPI in this chart (v0.21.0) — `role.yaml` is gated on `IsOnlineAPIDisabled=false` while `cscli-lapi-register-job` is gated on `StoreLAPICscliCredentialsInSecret=true` (orthogonal). So the hook runs without the Role it needs, and atomic apply rolls back. Mitigation: pre-created the `crowdsec-lapi-cscli-credentials` Secret manually (the hook short-circuits when the secret already exists) and re-applied the missing Role for future re-enablement. Re-enable path documented in the comment block. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 10:31:03 +00:00
Viktor Barzin	b59acbc1db	crowdsec/agent: bump memory request 64Mi → 128Mi krr 2026-05-22 flagged crowdsec-agent DaemonSet (4 pods) as under- requested by ~588 MiB across the cluster. Live usage around the 80-128 MiB mark for active log parsing — 64 MiB request risked eviction ahead of more-needed pods. Limit stays at 512 MiB. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 01:11:16 +00:00
Viktor Barzin	82b0f6c4cb	truenas deprecation: migrate all non-immich storage to proxmox NFS - Migrate 7 backup CronJobs to Proxmox host NFS (192.168.1.127) (etcd, mysql, postgresql, nextcloud, redis, vaultwarden, plotting-book) - Migrate headscale backup, ebook2audiobook, osm_routing to Proxmox NFS - Migrate servarr (lidarr, readarr, soulseek) NFS refs to Proxmox - Remove 79 orphaned TrueNAS NFS module declarations from 49 stacks - Delete stacks/platform/modules/ (27 dead module copies, 65MB) - Update nfs-truenas StorageClass to point to Proxmox (192.168.1.127) - Remove iscsi DNS record from config.tfvars - Fix woodpecker persistence config and alertmanager PV Only Immich (8 PVCs, ~1.4TB) remains on TrueNAS.	2026-04-12 14:35:39 +01:00
Viktor Barzin	d401568317	fix CrowdSec collection names and increase Helm timeout - Fix: crowdsecurity/pf → crowdsecurity/pfsense + firewallservices/pf - Move syslog acquisition to custom ConfigMap (Helm schema validation) - Increase Helm timeout to 1200s for DaemonSet rollout	2026-03-23 03:41:13 +02:00
Viktor Barzin	55246c8b5d	add network traffic monitoring and adversary detection - CrowdSec: add syslog listener for pfSense firewall logs (NodePort 30514), add postfix/dovecot log acquisition, install pf/postfix/dovecot/sshd collections - Monitoring: add DNS anomaly CronJob (queries Technitium every 15m, DGA detection, pushes metrics to Pushgateway) - Grafana: add "Network Traffic & Adversary Detection" dashboard (GoFlow2 flows, CrowdSec decisions, DNS anomaly metrics) pfSense changes applied live: syslog forwarding to 10.0.20.202:30514, Snort suppress rules for http_inspect false positives, IPS connectivity policy enabled	2026-03-23 03:06:56 +02:00
Viktor Barzin	3c804aedf8	extract dbaas, authentik, crowdsec from platform into independent stacks [ci skip] Phase 1 of platform stack split for parallel CI applies. All 3 modules were fully independent (no cross-module refs). State migrated via terraform state mv. All 3 stacks applied with zero changes (dbaas had pre-existing ResourceQuota drift). Woodpecker pipeline updated to run extracted stacks in parallel.	2026-03-17 18:11:53 +00:00

9 commits