Commit graph

21 commits

Author SHA1 Message Date
Viktor Barzin
3974827ea9 [ci skip] color only public IPs red in service map, private IPs (10.x, 192.168.x) get light blue 2026-02-28 19:44:16 +00:00
Viktor Barzin
c229f6ccab [ci skip] set network observability dashboard auto-refresh to 1h 2026-02-28 19:32:49 +00:00
Viktor Barzin
11eb256511 [ci skip] fix service map coloring: remove arc system, use color field for namespace-based node colors 2026-02-28 19:25:52 +00:00
Viktor Barzin
79775fa2cc [ci skip] improve network observability dashboard: namespace coloring, layered layout, full-width service map 2026-02-28 19:14:20 +00:00
Viktor Barzin
5e177a8889 [ci skip] combine caretta and goflow2 into unified network observability dashboard 2026-02-28 19:04:53 +00:00
Viktor Barzin
c5c3b092e5 [ci skip] fix caretta helm values and goflow2 transport args 2026-02-28 18:51:02 +00:00
Viktor Barzin
87fc11121d fix: use plain string for cache_from/cache_to and fix caretta helm_release
- cache_from/cache_to must be plain strings, not YAML lists — the
  plugin-docker-buildx treats them as single string values and the
  Woodpecker settings layer was splitting comma-separated list items
  into separate --cache-from flags (type=registry and ref=... separately)
- caretta.tf: replace deprecated set{} blocks with values=[yamlencode()]
  to fix Terraform plan error with newer Helm provider
2026-02-28 18:47:20 +00:00
Viktor Barzin
e8997ec430 [ci skip] add caretta, goflow2, and prometheus scrape targets to monitoring module 2026-02-28 18:30:20 +00:00
Viktor Barzin
bf3404bf6b [ci skip] add goflow2 netflow collector to monitoring module 2026-02-28 18:29:07 +00:00
Viktor Barzin
9d52acd286 [ci skip] add caretta eBPF pod topology to monitoring module 2026-02-28 18:28:09 +00:00
Viktor Barzin
c35bef2fd8 [ci skip] fix cluster health: GPU tolerations, actualbudget nfs_server, AuthentikDown alert
- Add missing nvidia.com/gpu toleration to ollama and yt-highlights deployments
- Add node_selector gpu=true to ollama deployment
- Pass nfs_server variable through to actualbudget factory modules
- Fix AuthentikDown alert to match actual deployment name (goauthentik-server)
2026-02-24 22:55:58 +00:00
Viktor Barzin
4fab38da1f [ci skip] wrongmove dashboard: add per-path latency table, fix layout, sort top offenders
Add "Per-Path Latency Breakdown" table with p50/p95/p99 and request rate
per endpoint. Fix bar gauge position to sit next to timeseries. Add sort
transformation to "Top Offenders (Avg Duration)" panel.
2026-02-24 22:31:41 +00:00
Viktor Barzin
0a1d53b6dd [ci skip] platform: add ndots=2 dns_config to all deployment pod specs
Prevents Terraform from reverting the Kyverno inject-ndots mutation
on every apply. 23 pod specs across 19 platform module files.
2026-02-23 22:43:05 +00:00
Viktor Barzin
a0df23f565 [ci skip] monitoring: increase resource quota limits
Bump limits.cpu 80→120 and limits.memory 160Gi→240Gi to provide
headroom. Previous values were at 87% and 92% utilization.
2026-02-23 22:42:30 +00:00
Viktor Barzin
89a6e08245 [ci skip] Infrastructure hardening: security, monitoring, reliability, maintainability
Phase 1 - Critical Security:
- Netbox: move hardcoded DB/superuser passwords to variables
- MeshCentral: disable public registration, add Authentik auth
- Traefik: disable insecure API dashboard (api.insecure=false)
- Traefik: configure forwarded headers with Cloudflare trusted IPs

Phase 2 - Security Hardening:
- Add security headers middleware (HSTS, X-Frame-Options, nosniff, etc.)
- Add Kyverno pod security policies in audit mode (privileged, host
  namespaces, SYS_ADMIN, trusted registries)
- Tighten rate limiting (avg=10, burst=50)
- Add Authentik protection to grampsweb

Phase 3 - Monitoring & Alerting:
- Add critical service alerts (PostgreSQL, MySQL, Redis, Headscale,
  Authentik, Loki)
- Increase Loki retention from 7 to 30 days (720h)
- Add predictive PV filling alert (predict_linear)
- Re-enable Hackmd and Privatebin down alerts

Phase 4 - Reliability:
- Add resource requests/limits to Redis, DBaaS, Technitium, Headscale,
  Vaultwarden, Uptime Kuma
- Increase Alloy DaemonSet memory to 512Mi/1Gi

Phase 6 - Maintainability:
- Extract duplicated tiers locals to terragrunt.hcl generate block
  (removed from 67 stacks)
- Replace hardcoded NFS IP 10.0.10.15 with var.nfs_server (114
  instances across 63 files)
- Replace hardcoded Redis/PostgreSQL/MySQL/Ollama/mail host references
  with variables across ~35 stacks
- Migrate xray raw ingress resources to ingress_factory modules
2026-02-23 22:05:28 +00:00
Viktor Barzin
1b4737c90c Reorder realestate-crawler Grafana dashboard sections
Move API Performance and Per-Endpoint Latency to the top.
Move Scraping Overview, Scraping Activity, and Throttling & Errors
to the bottom. Keeps the most operationally relevant panels visible
first.
2026-02-23 22:03:27 +00:00
Viktor Barzin
5fdd9d7f04 Sync realestate-crawler Grafana dashboard with per-endpoint latency panels 2026-02-23 21:31:01 +00:00
Viktor Barzin
b0aaa7b813 [ci skip] monitoring: enable mailserver-down Prometheus alert
Uncomment the mailserver availability alert so we get paged if
the mail server pod has no available replicas for 5 minutes.
2026-02-23 20:29:33 +00:00
Viktor Barzin
65ca327ed0 Sync realestate-crawler dashboard with navigation & usage metrics panels 2026-02-23 20:28:55 +00:00
Viktor Barzin
cc7f119578 [ci skip] Reduce node config drift: GPU label, OIDC idempotency, node-exporter, rebuild docs
- Add gpu=true label to Terraform (nvidia null_resource alongside taint)
- Improve API server OIDC config to detect value changes, not just flag presence
- Add policy_hash trigger to audit-policy so rule changes auto-reapply
- Enable prometheus-node-exporter sub-chart, delete unused Ansible playbook
- Document full node rebuild procedure in CLAUDE.md
- Save Talos Linux migration evaluation for future reference
2026-02-22 22:59:38 +00:00
Viktor Barzin
e6420c7b36 [ci skip] Move Terraform modules into stack directories
Move all 88 service modules (66 individual + 22 platform) from
modules/kubernetes/<service>/ into their corresponding stack directories:

- Service stacks: stacks/<service>/module/
- Platform stack: stacks/platform/modules/<service>/

This collocates module source code with its Terragrunt definition.
Only shared utility modules remain in modules/kubernetes/:
ingress_factory, setup_tls_secret, dockerhub_secret, oauth-proxy.

All cross-references to shared modules updated to use correct
relative paths. Verified with terragrunt run --all -- plan:
0 adds, 0 destroys across all 68 stacks.
2026-02-22 14:38:14 +00:00