infra

Author	SHA1	Message	Date
Viktor Barzin	882df4cc5c	[ci skip] kyverno: fix crash loop — failurePolicy Ignore, increase memory, pin chart Admission controller was restarting every ~5min due to API server timeouts causing leader election loss. failurePolicy:Fail meant the webhook blocked all pod creation cluster-wide when Kyverno was unavailable.	2026-02-24 23:00:45 +00:00
Viktor Barzin	c06cca288a	[ci skip] fix cluster health: GPU tolerations, actualbudget nfs_server, AuthentikDown alert - Add missing nvidia.com/gpu toleration to ollama and yt-highlights deployments - Add node_selector gpu=true to ollama deployment - Pass nfs_server variable through to actualbudget factory modules - Fix AuthentikDown alert to match actual deployment name (goauthentik-server)	2026-02-24 22:55:58 +00:00
Viktor Barzin	18a873a630	[ci skip] wrongmove dashboard: add per-path latency table, fix layout, sort top offenders Add "Per-Path Latency Breakdown" table with p50/p95/p99 and request rate per endpoint. Fix bar gauge position to sit next to timeseries. Add sort transformation to "Top Offenders (Avg Duration)" panel.	2026-02-24 22:31:41 +00:00
Viktor Barzin	650d785dcd	[ci skip] f1-stream: use v5.0.0 tag to bypass stale pull-through cache	2026-02-24 00:28:12 +00:00
Viktor Barzin	6d88a5df1e	f1-stream: fix frontend routing with catch-all handler for SvelteKit SPA	2026-02-24 00:18:28 +00:00
Viktor Barzin	0dc3fec3f9	f1-stream: fix SvelteKit routing - add trailingSlash for static adapter	2026-02-24 00:11:44 +00:00
Viktor Barzin	fd6a8bb6f2	f1-stream: add pydantic dependency and trigger CI build	2026-02-24 00:00:02 +00:00
Viktor Barzin	a050db616e	[ci skip] f1-stream: add CDN token refresh, SvelteKit frontend, multi-stream layout (Phases 6-8) - Phase 6: CDN token lifecycle with 3-strategy URL matching and periodic refresh - Phase 7: SvelteKit 2/Svelte 5 frontend with schedule calendar and hls.js player - Phase 8: Multi-stream layout supporting up to 4 simultaneous HLS streams - Update Dockerfile to multi-stage build (Node.js frontend + Python backend) - Switch deployment to :latest tag with Always pull policy for CI-driven deploys - Update Woodpecker CI to use explicit latest tag	2026-02-23 23:59:35 +00:00
Viktor Barzin	9bf0523ea9	[ci skip] f1-stream: add stream health checker and HLS proxy (Phases 4-5) Phase 4 - Stream Health and Fallback: - StreamHealthChecker with partial GET validation of m3u8 content - Bitrate extraction from BANDWIDTH tags - Response time measurement for quality ranking - Fallback ordering: live first, fastest response time first - GET /streams now only returns health-verified streams Phase 5 - HLS Proxy Core: - GET /proxy?url= - m3u8 playlist fetch with full URI rewriting - GET /relay?url= - chunked segment relay (never buffers full segment) - m3u8 rewriter handles master, variant, and segment URIs - Base64url encoding for URL parameters - CORS middleware for browser playback - Range header forwarding for seeking support	2026-02-23 23:41:16 +00:00
Viktor Barzin	b29f5ddb06	[ci skip] f1-stream: add extractor framework with demo streams (Phase 3) - BaseExtractor ABC with health_check method - ExtractorRegistry with concurrent fan-out extraction - ExtractionService with in-memory cache and background polling - DemoExtractor with 3 public HLS test streams - Adaptive polling: 5min during live sessions, 30min otherwise - GET /streams, GET /extractors, POST /extract endpoints	2026-02-23 23:02:56 +00:00
Viktor Barzin	4fd3e2d770	[ci skip] f1-stream: add gitignore for __pycache__, remove committed .pyc	2026-02-23 22:55:38 +00:00
Viktor Barzin	d7d347de27	[ci skip] f1-stream: add F1 schedule subsystem (Phase 2) - Fetch 2026 F1 race calendar from jolpica API with all sessions (FP1-3, Qualifying, Sprint, Race) and UTC timestamps - Persist schedule to NFS as JSON, load on startup if fresh - APScheduler daily refresh at 03:00 UTC - GET /schedule endpoint with live/upcoming/past session status - POST /schedule/refresh for manual refresh trigger	2026-02-23 22:55:13 +00:00
Viktor Barzin	f423a4d60c	[ci skip] f1-stream: replace Go service with Python/FastAPI skeleton Replaces the existing Go-based f1-stream service with a new Python/FastAPI backend as the foundation for the rebuilt F1 streaming aggregation service. - New FastAPI backend with health and root endpoints - Python 3.13 slim Dockerfile (replaces Go multi-stage build) - Updated Terraform deployment (port 8000, reduced resources) - Buildx-based redeploy.sh with --platform linux/amd64 - Added Woodpecker CI pipeline for automated builds - Removed all old Go source, node_modules, static assets	2026-02-23 22:47:06 +00:00
Viktor Barzin	c665544f41	[ci skip] fix plotting-book: add SESSION_SECRET env var Session secret stored in encrypted terraform.tfvars, referenced via variable to avoid committing secrets in plain text.	2026-02-23 22:44:04 +00:00
Viktor Barzin	85f88bf167	[ci skip] platform: add ndots=2 dns_config to all deployment pod specs Prevents Terraform from reverting the Kyverno inject-ndots mutation on every apply. 23 pod specs across 19 platform module files.	2026-02-23 22:43:05 +00:00
Viktor Barzin	a2a83d30aa	[ci skip] monitoring: increase resource quota limits Bump limits.cpu 80→120 and limits.memory 160Gi→240Gi to provide headroom. Previous values were at 87% and 92% utilization.	2026-02-23 22:42:30 +00:00
Viktor Barzin	e982a8ad81	[ci skip] fix redis OOMKilled: increase memory limits to 2Gi Redis was CrashLoopBackOff due to OOMKilled - 512Mi limit was insufficient for 650MB RDB dataset plus redis-stack modules.	2026-02-23 22:37:56 +00:00
Viktor Barzin	2789c0fa5c	[ci skip] add trading-bot Terraform stack	2026-02-23 22:29:59 +00:00
Viktor Barzin	2d919c4d34	[ci skip] Infrastructure hardening: security, monitoring, reliability, maintainability Phase 1 - Critical Security: - Netbox: move hardcoded DB/superuser passwords to variables - MeshCentral: disable public registration, add Authentik auth - Traefik: disable insecure API dashboard (api.insecure=false) - Traefik: configure forwarded headers with Cloudflare trusted IPs Phase 2 - Security Hardening: - Add security headers middleware (HSTS, X-Frame-Options, nosniff, etc.) - Add Kyverno pod security policies in audit mode (privileged, host namespaces, SYS_ADMIN, trusted registries) - Tighten rate limiting (avg=10, burst=50) - Add Authentik protection to grampsweb Phase 3 - Monitoring & Alerting: - Add critical service alerts (PostgreSQL, MySQL, Redis, Headscale, Authentik, Loki) - Increase Loki retention from 7 to 30 days (720h) - Add predictive PV filling alert (predict_linear) - Re-enable Hackmd and Privatebin down alerts Phase 4 - Reliability: - Add resource requests/limits to Redis, DBaaS, Technitium, Headscale, Vaultwarden, Uptime Kuma - Increase Alloy DaemonSet memory to 512Mi/1Gi Phase 6 - Maintainability: - Extract duplicated tiers locals to terragrunt.hcl generate block (removed from 67 stacks) - Replace hardcoded NFS IP 10.0.10.15 with var.nfs_server (114 instances across 63 files) - Replace hardcoded Redis/PostgreSQL/MySQL/Ollama/mail host references with variables across ~35 stacks - Migrate xray raw ingress resources to ingress_factory modules	2026-02-23 22:05:28 +00:00
Viktor Barzin	48083bb1fd	Reorder realestate-crawler Grafana dashboard sections Move API Performance and Per-Endpoint Latency to the top. Move Scraping Overview, Scraping Activity, and Throttling & Errors to the bottom. Keeps the most operationally relevant panels visible first.	2026-02-23 22:03:27 +00:00
Viktor Barzin	449937e22e	Sync realestate-crawler Grafana dashboard with per-endpoint latency panels	2026-02-23 21:31:01 +00:00
Viktor Barzin	8985cd60cc	[ci skip] mailserver: fix Rspamd DKIM signing key path Mount DKIM private key at Rspamd-expected path (/tmp/docker-mailserver/rspamd/dkim/viktorbarzin.me/mail.private) and add dkim_signing.conf override for domain/selector config. Rspamd does not auto-detect keys from the OpenDKIM path.	2026-02-23 21:01:29 +00:00
Viktor Barzin	04db99fde2	docs: map existing codebase	2026-02-23 20:54:27 +00:00
Viktor Barzin	e95ef07b04	[ci skip] mailserver: tighten DMARC policy to quarantine Move DMARC enforcement from p=none (monitoring only) to p=quarantine so spoofed emails from viktorbarzin.me are quarantined by recipients.	2026-02-23 20:30:30 +00:00
Viktor Barzin	ce03bc25a9	[ci skip] mailserver: add Postfix rate limiting Add connection and message rate limits to protect against brute-force attacks on SMTP/IMAP ports. 10 connections and 30 messages per minute per client IP.	2026-02-23 20:29:45 +00:00
Viktor Barzin	74948a8af3	[ci skip] roundcubemail: pin to 1.6-apache, disable debug logging Pin Roundcubemail to stable 1.6-apache tag instead of :latest to prevent unexpected breakage. Disable SMTP debug and reduce debug level from 6 to 1 for production use.	2026-02-23 20:29:39 +00:00
Viktor Barzin	b7ccae69bc	[ci skip] monitoring: enable mailserver-down Prometheus alert Uncomment the mailserver availability alert so we get paged if the mail server pod has no available replicas for 5 minutes.	2026-02-23 20:29:33 +00:00
Viktor Barzin	75f5cb2001	[ci skip] mailserver: enable Rspamd, disable OpenDKIM Enable Rspamd for spam filtering and DKIM signing, replacing OpenDKIM. Rspamd reads existing DKIM keys from the same mount path.	2026-02-23 20:29:32 +00:00
Viktor Barzin	6ca4a1a081	Sync realestate-crawler dashboard with navigation & usage metrics panels	2026-02-23 20:28:55 +00:00
Viktor Barzin	c6a79e89c7	[ci skip] Upgrade Woodpecker CI v3.5.1 → v3.13.0, fix helm healthcheck for v4	2026-02-23 20:14:30 +00:00
Viktor Barzin	0eababf212	[ci skip] Sunset Drone CI: remove all artifacts, DNS, configs, and references Drone CI has been fully replaced by Woodpecker CI at ci.viktorbarzin.me. Destroys K8s resources (12), removes DNS records, NFS exports, Uptime Kuma monitor, dashboard entry, and all code/doc references across 18 files.	2026-02-23 19:38:55 +00:00
Viktor Barzin	b45688646d	Woodpecker CI: use built-in clone, fix CoreDNS DNS resolution [CI SKIP] - Switch from custom clone override to woodpeckerci/plugin-git built-in clone (handles auth automatically via netrc from GitHub OAuth token) - Add 8.8.8.8 and 1.1.1.1 as CoreDNS upstream resolvers alongside pfSense (fixes intermittent DNS timeouts causing clone failures) - Fix missing comma after heredoc in audit-policy.tf (syntax error)	2026-02-23 00:08:42 +00:00
Viktor Barzin	d870a63130	[ci skip] Reduce healthcheck frequency to 8h, fix apiserver audit duplication bug Change cluster-healthcheck CronJob from every 30min to every 8h. Replace fragile sed-based audit config in apiserver manifest with idempotent Python script that deduplicates by name/mountPath, preventing the duplicate volume entries that crashed the API server.	2026-02-22 23:18:30 +00:00
Viktor Barzin	860077a126	[ci skip] Remove ResourceQuota limits from nvidia and realestate-crawler namespaces Add resource-governance/custom-quota=true label to both namespaces so Kyverno skips auto-generating ResourceQuotas that were causing CPU pressure.	2026-02-22 23:14:53 +00:00
Viktor Barzin	cf67e02135	[ci skip] Reduce node config drift: GPU label, OIDC idempotency, node-exporter, rebuild docs - Add gpu=true label to Terraform (nvidia null_resource alongside taint) - Improve API server OIDC config to detect value changes, not just flag presence - Add policy_hash trigger to audit-policy so rule changes auto-reapply - Enable prometheus-node-exporter sub-chart, delete unused Ansible playbook - Document full node rebuild procedure in CLAUDE.md - Save Talos Linux migration evaluation for future reference	2026-02-22 22:59:38 +00:00
Viktor Barzin	88960ba3a4	[ci skip] Rebuild docker-registry with nginx serialization on all ports Replace individual `docker run` commands with Docker Compose stack managed by systemd. Nginx now fronts all 5 registry ports (5000/5010/5020/5030/5040) with proxy_cache_lock to serialize concurrent blob pulls and prevent corrupt partial responses. Adds QEMU guest agent for remote management.	2026-02-22 21:45:53 +00:00
Viktor Barzin	f1a27ed2f9	[ci skip] Add Woodpecker CI stack (WIP) and claude agents - Add stacks/woodpecker/ with Helm-based deployment config - Add .woodpecker/ CI pipeline configs (default, build-cli, renew-tls) - Add NFS export entry for woodpecker - Add .claude/agents/ definitions	2026-02-22 21:30:25 +00:00
Viktor Barzin	bf90abe7c9	[ci skip] Fix poison fetcher: use HTTP/1.1 for upstream (HTTP/2 hangs) The Poison Fountain upstream (rnsaffn.com/poison2/) doesn't respond properly over HTTP/2. Force HTTP/1.1 for reliable content fetching. Also fixed NFS directory permissions for non-root curl container.	2026-02-22 20:42:53 +00:00
Viktor Barzin	b6169b881e	[ci skip] Add poison-fountain Terraform stack (deployment, service, ingress, CronJob)	2026-02-22 19:50:57 +00:00
Viktor Barzin	a92fbb8ca5	[ci skip] Add anti-AI scraping Traefik middlewares (ForwardAuth, headers, trap links)	2026-02-22 19:49:32 +00:00
Viktor Barzin	b7e7003e7a	[ci skip] Add poison fountain Python service and fetcher script	2026-02-22 19:46:43 +00:00
Viktor Barzin	550a682548	Use --queue-ignore-errors for CI (infra stack needs Proxmox SSH)	2026-02-22 18:29:27 +00:00
Viktor Barzin	eace95a1a0	Skip infra stack in CI, remove DRONE_IMAGE_CLONE setting	2026-02-22 18:21:10 +00:00
Viktor Barzin	bdf46cef4d	Use manual clone with alpine instead of drone/git (pull-through cache issue)	2026-02-22 18:05:53 +00:00
Viktor Barzin	91fe79de19	[ci skip] Fix Drone clone image: use alpine/git via DRONE_IMAGE_CLONE The drone/git:latest image was failing to pull through the registry cache (corrupted blobs, unexpected EOF). Set DRONE_IMAGE_CLONE on the Kubernetes runner to use alpine/git:latest globally for all pipelines.	2026-02-22 17:35:04 +00:00
Viktor Barzin	a9f96e2e53	[ci skip] Increase authentik ResourceQuota limits Authentik is a critical auth service that was at 83% CPU/memory quota utilization. Double all limits to prevent throttling.	2026-02-22 17:28:41 +00:00
Viktor Barzin	534e63c9b8	[ci skip] Remove legacy files and orphaned modules Delete 20 orphaned module directories and 3 stray files from modules/kubernetes/ that are no longer referenced by any stack. Remove 7 root-level legacy files including the empty tfstate, 27MB terraform zip, commented-out main.tf, and migration notes. Clean up commented-out dockerhub_secret and oauth-proxy references in blog, travel_blog, and city-guesser stacks. Remove stale frigate config.yaml entry from .gitignore. Remove ephemeral docs/plans/ directory.	2026-02-22 15:23:27 +00:00
Viktor Barzin	b692eb0c34	[ci skip] Flatten module wrappers into stack roots Remove the module "xxx" { source = "./module" } indirection layer from all 66 service stacks. Resources are now defined directly in each stack's main.tf instead of through a wrapper module. - Merge module/main.tf contents into stack main.tf - Apply variable replacements (var.tier -> local.tiers.X, renamed vars) - Fix shared module paths (one fewer ../ at each level) - Move extra files/dirs (factory/, chart_values, subdirs) to stack root - Update state files to strip module.<name>. prefix - Update CLAUDE.md to reflect flat structure Verified: terragrunt plan shows 0 add, 0 destroy across all stacks.	2026-02-22 15:13:55 +00:00
Viktor Barzin	e225e81ebf	[ci skip] Move Terraform modules into stack directories Move all 88 service modules (66 individual + 22 platform) from modules/kubernetes/<service>/ into their corresponding stack directories: - Service stacks: stacks/<service>/module/ - Platform stack: stacks/platform/modules/<service>/ This collocates module source code with its Terragrunt definition. Only shared utility modules remain in modules/kubernetes/: ingress_factory, setup_tls_secret, dockerhub_secret, oauth-proxy. All cross-references to shared modules updated to use correct relative paths. Verified with terragrunt run --all -- plan: 0 adds, 0 destroys across all 68 stacks.	2026-02-22 14:38:14 +00:00
Viktor Barzin	ae2bd9a9d8	[ci skip] Fix variable type mismatches in owntracks, ollama, tandoor stacks - owntracks_credentials: string -> map(string) - ollama_api_credentials: string -> map(string) - tandoor_email_password: add default="" (not in tfvars)	2026-02-22 14:07:33 +00:00

1 2

56 commits