Commit graph

325 commits

Author SHA1 Message Date
Viktor Barzin
55246c8b5d add network traffic monitoring and adversary detection
- CrowdSec: add syslog listener for pfSense firewall logs (NodePort 30514),
  add postfix/dovecot log acquisition, install pf/postfix/dovecot/sshd collections
- Monitoring: add DNS anomaly CronJob (queries Technitium every 15m, DGA detection,
  pushes metrics to Pushgateway)
- Grafana: add "Network Traffic & Adversary Detection" dashboard
  (GoFlow2 flows, CrowdSec decisions, DNS anomaly metrics)

pfSense changes applied live: syslog forwarding to 10.0.20.202:30514,
Snort suppress rules for http_inspect false positives, IPS connectivity policy enabled
2026-03-23 03:06:56 +02:00
Viktor Barzin
877cd15b45 fix: increase tier-2-gpu quota to 12Gi, add NvidiaExporterDown alert
- Increase tier-2-gpu requests.memory from 8Gi to 12Gi to give immich
  ML pods scheduling headroom (was at 96% utilization)
- Add critical NvidiaExporterDown Prometheus alert that fires when GPU
  metrics are absent for >10 minutes (faster than generic ScrapeTargetDown)
2026-03-23 03:04:33 +02:00
Viktor Barzin
e9311915cb add agent route to k8s-portal 2026-03-23 02:25:08 +02:00
Viktor Barzin
6bfade3013 update infra stack terraform lock file (helm/kubernetes/vault providers) 2026-03-23 02:24:47 +02:00
Viktor Barzin
2dcdc65db5 add weekly SQLite backup for plotting-book to NFS 2026-03-23 02:24:43 +02:00
Viktor Barzin
e4cf0dee83 add TrueNAS Cloud Sync monitor CronJob and bump Prometheus Helm timeout
- New cloudsync-monitor CronJob: queries TrueNAS API every 6h, pushes metrics to Pushgateway
- Increase Prometheus Helm timeout to 900s for slow iSCSI reattach
2026-03-23 02:24:39 +02:00
Viktor Barzin
e463281205 optimize backup schedules: compress dumps, stagger to weekly, extend retention
- dbaas: gzip MySQL/PostgreSQL dumps, stagger to 0:30, clean old uncompressed
- infra-maintenance: etcd backup daily→weekly Sunday 1am
- redis: backup hourly→weekly Sunday 3am, retention 7→28 days
- vault: raft backup daily→weekly Sunday 2am
2026-03-23 02:24:34 +02:00
Viktor Barzin
644562454c add IPv6 connectivity via Hurricane Electric 6in4 tunnel
- Add public_ipv6 variable and AAAA records for all 34 non-proxied services
- Fix stale DNS records (85.130.108.6 → 176.12.22.76, old IPv6 → HE tunnel)
- Update SPF record with current IPv4/IPv6 addresses
- Add AAAA update support to Technitium DNS updater CLI
- Pin mailserver MetalLB IP to 10.0.20.201 for stable pfSense NAT
- pfSense: HE_IPv6 interface, strict firewall (80,443,25,465,587,993 + ICMPv6),
  socat IPv6→IPv4 proxy, removed dangerous "Allow all DEBUG" rules
2026-03-23 02:22:00 +02:00
Viktor Barzin
1f4e8cb278 use registry.viktorbarzin.me hostname for private images + protect ingress
- Switch priority-pass images from 10.0.20.10:5050 to registry.viktorbarzin.me
- Add containerd hosts.toml for registry.viktorbarzin.me on all nodes + template
  (redirects to 10.0.20.10:5050 LAN direct, avoids Traefik round-trip)
- Enable Authentik protection on priority-pass ingress
2026-03-23 01:02:27 +02:00
Viktor Barzin
e9919d8fc9 fix priority-pass: bump backend memory to 512Mi (OOM with OpenCV) 2026-03-23 00:58:39 +02:00
Viktor Barzin
0674d6e538 deploy priority-pass app to cluster via private registry
- SvelteKit frontend + FastAPI backend in single pod with sidecar pattern
- Images pushed to 10.0.20.10:5050 private registry (v4/v1)
- SvelteKit server route proxies /api/transform to backend on 127.0.0.1:8000
- Exposed at priority-pass.viktorbarzin.me (Cloudflare-proxied, no auth)
- Uses imagePullSecrets for authenticated registry pulls
2026-03-23 00:55:41 +02:00
Viktor Barzin
311ff5dd9e add hourly SQLite integrity check for vaultwarden with Prometheus alerting
- New CronJob runs PRAGMA integrity_check every hour
- Pushes vaultwarden_sqlite_integrity_ok metric to Prometheus pushgateway
- VaultwardenSQLiteCorrupt alert fires immediately on corruption (critical)
- VaultwardenIntegrityCheckStale alert if check hasn't run in 2h (warning)
- Prevents running for days on a corrupted DB unnoticed
2026-03-23 00:50:15 +02:00
Viktor Barzin
3b89a7d7e4 add VaultwardenDown alert and tighten backup staleness threshold
- Add dedicated VaultwardenDown Prometheus alert (critical, 5m)
- Reduce backup staleness threshold from 8d to 24h to match 6h schedule
- Fixes monitoring gap where VW downtime went undetected
2026-03-23 00:47:00 +02:00
Viktor Barzin
a44f35bcf8 harden vaultwarden iSCSI storage and increase backup frequency
- Increase backup from daily to every 6 hours (0 */6 * * *)
- Add pre/post-flight SQLite integrity checks to backup job
- Harden iSCSI on all nodes: increase recovery timeout (300s),
  enable CRC32C data/header digests for bit-flip detection
- Fix restore runbook PVC name (vaultwarden-data-iscsi)

Motivated by SQLite corruption from iSCSI I/O errors.
2026-03-23 00:36:11 +02:00
Viktor Barzin
ab7e18c07c fix registry auth: add Kyverno RBAC for Secrets + containerd TLS skip-verify
- Grant kyverno-admission-controller and kyverno-background-controller
  permissions to manage Secrets (required for generate clone rules)
- Add containerd hosts.toml for 10.0.20.10:5050 with skip_verify=true
  (wildcard cert doesn't cover IP SANs) — applied to all nodes + template
2026-03-22 23:47:29 +02:00
Viktor Barzin
36171bcda4 add htpasswd auth to private docker registry + expose at registry.viktorbarzin.me
- Add auth.htpasswd section to config-private.yml
- Mount htpasswd file in registry-private container, fix healthcheck for 401
- Rename registry UI from registry.viktorbarzin.me → docker.viktorbarzin.me
- Add Docker CLI ingress at registry.viktorbarzin.me (HTTPS backend, no rate-limit, unlimited body)
- Add docker to cloudflare_proxied_names (registry stays non-proxied)
- Add Kyverno ClusterPolicy to sync registry-credentials secret to all namespaces
- Update infra provisioning to install apache2-utils and generate htpasswd from Vault
2026-03-22 22:10:10 +02:00
Viktor Barzin
e4f478b490 switch claude-memory server to multi-user API_KEYS auth
Enables isolated memory namespaces per user (wizard/emo) by switching
from single API_KEY to API_KEYS JSON map env var.
2026-03-22 20:08:07 +02:00
Viktor Barzin
c103a1ee05 fix OOMKilled containers: bump immich/actualbudget memory, disable changedetection, cap clickhouse
- immich-server: 512Mi/1Gi → 1700Mi/1700Mi (VPA upperBound 1.39Gi, 34 OOM restarts)
- actualbudget http-api: 384Mi → 768Mi (VPA upperBound 615Mi, 3 OOM restarts)
- changedetection: replicas 1 → 0 (chronic OOM at 64Mi, not worth memory cost)
- rybbit clickhouse: add ConfigMap capping max_server_memory_usage to 800Mi (within 1Gi limit)
2026-03-22 15:22:29 +02:00
Viktor Barzin
ad689076d8 scale down non-critical services to free cluster memory
- authentik server: 3→2, worker: 3→2, PDB minAvailable: 2→1
- tuya-bridge: 3→1
- realestate-crawler-api: 2→1
- claude-memory: 2→1
- grafana: 2→1 (config only, apply pending)
- alertmanager: 2→1 (config only, apply pending)

Estimated savings: ~1.2 Gi total
2026-03-22 03:10:12 +02:00
Viktor Barzin
bd98b84ded scale grafana and alertmanager to 1 replica to free cluster memory
Grafana: 2 → 1 (saves ~312 Mi)
Alertmanager: 2 → 1 (saves ~150 Mi)
Matrix already scaled to 0 (saves ~212 Mi)
2026-03-22 03:02:17 +02:00
Viktor Barzin
1c13af142d sync regenerated providers.tf + upstream changes
- Terragrunt-regenerated providers.tf across stacks (vault_root_token
  variable removed from root generate block)
- Upstream monitoring/openclaw/CLAUDE.md changes from rebase
2026-03-22 02:56:04 +02:00
Viktor Barzin
2e016d7df2 fix nextcloud db-username + k8s-dashboard chart repo
- nextcloud: add db-username to ESO secret template and usernameKey
  to chart values (required by newer chart version)
- k8s-dashboard: update chart repo URL to kubernetes-retired.github.io
  (old kubernetes.github.io/dashboard returns 404)
2026-03-22 02:50:48 +02:00
Viktor Barzin
3d22599f7f fix infra stack: use overwrite strategy for provider generation
The child generate block needs if_exists="overwrite" to properly
override the root terragrunt's k8s_providers block.
2026-03-22 01:28:25 +02:00
Viktor Barzin
728fbcd3bd fix infra stack: add vault provider to terragrunt generate block
The infra stack's provider override only included proxmox but main.tf
uses data "vault_kv_secret_v2" which requires the vault provider.
2026-03-22 01:17:00 +02:00
Viktor Barzin
1d1549b8af state(descheduler): update encrypted state 2026-03-21 15:13:02 +00:00
Viktor Barzin
21cfa8c072 bump memory limits for OOM-prone services
FreshRSS: 64Mi → 256Mi (171 restarts, VPA upper ~204Mi)
Actual Budget HTTP API: 128Mi → 384Mi (17 restarts, VPA upper ~297Mi)
n8n: 768Mi → 1Gi (18 restarts, VPA upper ~765Mi)
Dawarich: 768Mi → 896Mi (2 restarts, VPA upper ~628Mi)
Traefik: 384Mi → 768Mi (2 restarts, VPA upper ~584Mi)
2026-03-21 11:12:12 +00:00
Viktor Barzin
b3c9c45a17 multi-user access: fix template memory default, add storage quota, add CONTRIBUTING.md [ci skip]
- Template: bump default memory from 128Mi to 256Mi (matches deploy-app skill guidance)
- ResourceQuota: add requests.storage (20Gi) and persistentvolumeclaims (5) defaults
- CONTRIBUTING.md: agent-friendly contributor guide for namespace-owners
2026-03-19 23:49:15 +00:00
Viktor Barzin
6b8ce04d44 fix(openclaw): change agent workspace from /workspace/infra to /workspace
Keeps infra repo as a subdirectory, allows OpenClaw to write to /workspace directly.
2026-03-19 23:32:28 +00:00
Viktor Barzin
e823b795f7 fix(dbaas,vault): fix backup CronJob failures and mysql-operator memory
- Add docker.io/library/ prefix to mysql and postgres backup images
  to satisfy Kyverno require-trusted-registries policy (both CronJobs
  were blocked for 46h, triggering MySQLBackupStale alert)
- Document mysql-operator chart ignoring resources values key — the
  LimitRange default (256Mi) was silently applied, putting the operator
  at 97% memory. Patched live to 512Mi via kubectl.
- Increase vault-raft-backup backoff_limit to 6 for transient failures
  (also fixed NFS export: vault-backup was a separate ZFS dataset not
  in the TrueNAS NFS share — destroyed dataset, created directory)
2026-03-19 23:26:05 +00:00
Viktor Barzin
250a058627 feat(traefik): add custom error pages with tarampampam/error-pages
Deploy error-pages service to show themed error pages instead of raw
Traefik 502/503/504 responses. Adds catch-all IngressRoute (priority 1)
for 404 on unknown hosts. Only 5xx intercepted to avoid breaking JSON APIs.
2026-03-19 23:14:27 +00:00
Viktor Barzin
d95144bd05 fix(immich): bump postgres memory 512Mi → 1Gi for v2.6.1 geodata migration
v2.6.1 bulk-inserts into geodata_places on first boot, OOM-killing
postgres at 512Mi. Raise to 1Gi to accommodate the migration.
2026-03-19 22:50:36 +00:00
Viktor Barzin
da630b8869 upgrade immich v2.5.6 → v2.6.1 2026-03-19 22:45:04 +00:00
Viktor Barzin
af2222fce8 backup & DR: add alerting, fix rotation, secure MySQL password, add runbooks
Phase 1: Add 12 PrometheusRules for backup health alerting
- PostgreSQL, MySQL, Vault, Vaultwarden, Redis staleness + never-succeeded alerts
- CSIDriverCrashLoop alert for nfs-csi/iscsi-csi namespaces
- Generic BackupCronJobFailed alert

Phase 2: Fix backup rotation
- etcd: timestamped snapshots instead of overwriting single file
- Redis: timestamped RDB files with 7-day retention purge
- PostgreSQL: retention increased from 7 to 14 days

Phase 3: Fix MySQL password exposure
- Move root password from command line arg to MYSQL_PWD env var via secretKeyRef

Phase 5: Add restore runbooks
- PostgreSQL, MySQL, Vault, etcd, Vaultwarden, full cluster rebuild
2026-03-19 20:34:33 +00:00
Viktor Barzin
e54bc016ba reduce alert noise: raise memory thresholds, exclude claude-memory 4xx, right-size mysql-operator
- ContainerNearOOM: 85% → 90% (silences forgejo, changedetection, immich-pg, mysql-cluster)
- ClusterMemoryRequestsHigh: 85% → 92% (intentional overcommit)
- NodeMemoryPressureTrending: 85% → 92%
- HighService4xxRate: exclude claude-memory (401s from unauth requests are expected)
- mysql-operator memory limit: 512Mi → 580Mi (VPA upperBound 481Mi × 1.2)
2026-03-19 20:25:36 +00:00
Viktor Barzin
21bb3036af state(dbaas): update encrypted state 2026-03-19 20:23:59 +00:00
Viktor Barzin
01eb9dd121 fix(monitoring): patch idrac-redfish-exporter to restore PSU voltage metric
Upstream v2.4.1 accidentally dropped idrac_power_supply_input_voltage from
the legacy RefreshPowerOld code path during a Huawei OEM support refactor.
Built a patched image that restores the single missing line:
  mc.NewPowerSupplyInputVoltage(ch, psu.LineInputVoltage, id)

Image: viktorbarzin/idrac-redfish-exporter:2.4.1-voltage-fix

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-19 13:37:14 +00:00
Viktor Barzin
b05421dbb5 add comment explaining prometheus 4Gi minimum memory requirement [ci skip] 2026-03-18 21:45:26 +00:00
Viktor Barzin
9d87ce605f revert prometheus memory 3Gi→4Gi: WAL tmpfs shares cgroup limit
The 2Gi WAL tmpfs (medium: Memory) counts against the container's
memory limit. At 3Gi, Prometheus OOM-kills during WAL replay on
startup (heap + tmpfs > 3Gi). Reverting to 4Gi restores headroom.
2026-03-18 21:44:14 +00:00
Viktor Barzin
410c893647 fix(provision): security hardening from code review
- Add input validation: username regex + email format check in pipeline
- Quote variables in .provision-env to prevent shell injection
- Remove dead source command (each Woodpecker command is separate shell)
- Use jq to build JSON payloads (prevents injection via group names)
- Clean up git-crypt key on failure (use ; instead of &&)
- Add Kyverno ndots lifecycle ignore to webhook-handler deployment
2026-03-18 21:25:03 +00:00
Viktor Barzin
fd130971aa feat(provision): automated user provisioning via Authentik webhook
- Expand CI Vault policy: write secret/data/platform + Transit SOPS keys
- Add Woodpecker provision-user.yml pipeline (manual event, API-triggered)
- Add env vars to webhook-handler deployment for Woodpecker/Authentik integration
- Update add-user skill with automated flow documentation
- Update Woodpecker repo ID list in CLAUDE.md
2026-03-17 23:56:30 +00:00
Viktor Barzin
0fff155f17 feat(k8s-portal): update onboarding + architecture with SOPS state docs
Onboarding (namespace-owner):
- Add steps for sops/terragrunt install, state decrypt, apply workflow
- Add flow diagram showing auth → decrypt → apply → encrypt → push
- Add architecture overview with security model table
- Add access control callout explaining per-stack Transit keys

Architecture:
- Add secrets & state encryption section with ASCII diagrams
- Add request flow diagram (Cloudflare → Traefik → pods)
- Add CI/CD pipeline diagram (GHA → Woodpecker → K8s)

[ci skip]
2026-03-17 23:17:47 +00:00
Viktor Barzin
ccbcebb670 feat(vault): automate SOPS onboarding for namespace-owners
- Add Transit mount + per-stack Transit keys to vault stack TF
- Auto-create sops-user-<name> policy scoping decrypt to owned stacks
- Auto-create sops-<name> external group + alias for Authentik mapping
- Add sops-admin policy to authentik-admins group
- Attach sops-user policy to namespace-owner identity entities
- Update add-user skill with SOPS onboarding steps and Authentik group
- Adding a user to k8s_users + applying vault stack = full SOPS access

[ci skip]
2026-03-17 23:15:25 +00:00
Viktor Barzin
12a51c4ffa right-size memory requests to unblock GPU workloads and fix dbaas quota [ci skip]
- nvidia: custom LimitRange (128Mi default, was 1Gi from Kyverno) to stop
  inflating GPU operator init containers; saves ~2.5Gi on GPU node
- nvidia: dcgm-exporter 1536Mi → 768Mi (actual usage 489Mi)
- monitoring: prometheus server 4Gi → 3Gi (actual usage 2.6Gi)
- onlyoffice: 2304Mi → 1536Mi (actual usage 1.3Gi)
- immich: frame explicit 64Mi resources (was getting 1Gi LimitRange default)
- dbaas: quota limits.memory 20Gi → 24Gi to fit 3rd MySQL replica

Root cause: Kyverno tier-2-gpu LimitRange injected 1Gi on every NVIDIA init
container (no explicit resources), wasting ~2.5Gi scheduling overhead on the
GPU node. Combined with over-requesting, frigate and immich-ml couldn't schedule.
2026-03-17 22:35:54 +00:00
Viktor Barzin
73511b1230 extract remaining 19 modules from platform, complete stack split [ci skip]
Phase 3: all 27 platform modules now run as independent stacks.
Platform reduced to empty shell (outputs only) for backward compat
with 72 app stacks that declare dependency "platform".
Fixed technitium cross-module dashboard reference by copying file.
Woodpecker pipeline applies all 27+1 stacks in parallel via loop.
All applied with zero destroys.
2026-03-17 21:42:16 +00:00
Viktor Barzin
ae36dc253b extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip]
Phase 2 of platform stack split. 5 more modules extracted into
independent stacks. All applied successfully with zero destroys.
Cloudflared now reads k8s_users from Vault directly to compute
user_domains. Woodpecker pipeline runs all 8 extracted stacks
in parallel. Memory bumped to 6Gi for 9 concurrent TF processes.
Platform reduced from 27 to 19 modules.
2026-03-17 21:34:11 +00:00
Viktor Barzin
3c804aedf8 extract dbaas, authentik, crowdsec from platform into independent stacks [ci skip]
Phase 1 of platform stack split for parallel CI applies.
All 3 modules were fully independent (no cross-module refs).
State migrated via terraform state mv. All 3 stacks applied
with zero changes (dbaas had pre-existing ResourceQuota drift).
Woodpecker pipeline updated to run extracted stacks in parallel.
2026-03-17 18:11:53 +00:00
Viktor Barzin
c8b42f78df fix DB password rotation desync in 5 stacks
Vault DB engine rotates passwords weekly but 5 stacks baked passwords
at Terraform plan time, causing stale credentials until next apply.

- real-estate-crawler: add vault-database ESO, use secret_key_ref in 3 deployments
- nextcloud: switch Helm chart to existingSecret for DB password
- grafana: add vault-database ESO, use envFromSecrets in Helm values
- woodpecker: use extraSecretNamesForEnvFrom, remove plan-time data source chain
- affine: add vault-database ESO, use secret_key_ref in deployment + init container
2026-03-17 07:39:29 +00:00
Viktor Barzin
8d8c8db737 increase DB password rotation from 24h to weekly (604800s) 2026-03-16 23:17:01 +00:00
Viktor Barzin
c31ba2c50c k8s-portal: use Recreate strategy, limit revision history to 3
Prevents stale pods serving old content during rapid successive deploys.
With 1 replica + RollingUpdate, old and new pods briefly coexist.
2026-03-16 22:55:15 +00:00
Viktor Barzin
fb66676d7b post-mortem: kured + containerd cascade outage — alerts + report
26h outage caused by unattended-upgrades kernel update → kured reboot →
containerd overlayfs snapshotter corruption → image pull failures →
calico down → cascading cluster outage.

Remediation:
- Add "Node Runtime Health" Prometheus alert group (6 alerts):
  KubeletImagePullErrors, KubeletPLEGUnhealthy, PodsStuckContainerCreating,
  KubeletRuntimeOperationsLatency, KubeletRunningContainersDrop, CalicoNodeNotReady
- Add containerd cascade inhibition rule
- Save post-mortem report as HTML in post-mortems/

Also applied via kubectl (needs Terraform codification):
- Sentinel gate DaemonSet gating kured reboots on cluster health
- Fixed kured Helm values: reboot window + gated sentinel path
2026-03-16 22:06:10 +00:00