Commit graph

1734 commits

Author SHA1 Message Date
Viktor Barzin
882df4cc5c
[ci skip] kyverno: fix crash loop — failurePolicy Ignore, increase memory, pin chart
Admission controller was restarting every ~5min due to API server timeouts
causing leader election loss. failurePolicy:Fail meant the webhook blocked
all pod creation cluster-wide when Kyverno was unavailable.
2026-02-24 23:00:45 +00:00
Viktor Barzin
c06cca288a
[ci skip] fix cluster health: GPU tolerations, actualbudget nfs_server, AuthentikDown alert
- Add missing nvidia.com/gpu toleration to ollama and yt-highlights deployments
- Add node_selector gpu=true to ollama deployment
- Pass nfs_server variable through to actualbudget factory modules
- Fix AuthentikDown alert to match actual deployment name (goauthentik-server)
2026-02-24 22:55:58 +00:00
Viktor Barzin
18a873a630
[ci skip] wrongmove dashboard: add per-path latency table, fix layout, sort top offenders
Add "Per-Path Latency Breakdown" table with p50/p95/p99 and request rate
per endpoint. Fix bar gauge position to sit next to timeseries. Add sort
transformation to "Top Offenders (Avg Duration)" panel.
2026-02-24 22:31:41 +00:00
Viktor Barzin
2327c73154
[ci skip] f1-stream: update project state - all 8 phases complete 2026-02-24 00:28:54 +00:00
Viktor Barzin
650d785dcd
[ci skip] f1-stream: use v5.0.0 tag to bypass stale pull-through cache 2026-02-24 00:28:12 +00:00
Viktor Barzin
6d88a5df1e
f1-stream: fix frontend routing with catch-all handler for SvelteKit SPA 2026-02-24 00:18:28 +00:00
Viktor Barzin
0dc3fec3f9
f1-stream: fix SvelteKit routing - add trailingSlash for static adapter 2026-02-24 00:11:44 +00:00
Viktor Barzin
fd6a8bb6f2
f1-stream: add pydantic dependency and trigger CI build 2026-02-24 00:00:02 +00:00
Viktor Barzin
a050db616e
[ci skip] f1-stream: add CDN token refresh, SvelteKit frontend, multi-stream layout (Phases 6-8)
- Phase 6: CDN token lifecycle with 3-strategy URL matching and periodic refresh
- Phase 7: SvelteKit 2/Svelte 5 frontend with schedule calendar and hls.js player
- Phase 8: Multi-stream layout supporting up to 4 simultaneous HLS streams
- Update Dockerfile to multi-stage build (Node.js frontend + Python backend)
- Switch deployment to :latest tag with Always pull policy for CI-driven deploys
- Update Woodpecker CI to use explicit latest tag
2026-02-23 23:59:35 +00:00
Viktor Barzin
9bf0523ea9
[ci skip] f1-stream: add stream health checker and HLS proxy (Phases 4-5)
Phase 4 - Stream Health and Fallback:
- StreamHealthChecker with partial GET validation of m3u8 content
- Bitrate extraction from BANDWIDTH tags
- Response time measurement for quality ranking
- Fallback ordering: live first, fastest response time first
- GET /streams now only returns health-verified streams

Phase 5 - HLS Proxy Core:
- GET /proxy?url= - m3u8 playlist fetch with full URI rewriting
- GET /relay?url= - chunked segment relay (never buffers full segment)
- m3u8 rewriter handles master, variant, and segment URIs
- Base64url encoding for URL parameters
- CORS middleware for browser playback
- Range header forwarding for seeking support
2026-02-23 23:41:16 +00:00
Viktor Barzin
00d9458dec
[ci skip] trim CLAUDE.md: remove discoverable info, deduplicate 2026-02-23 23:10:13 +00:00
Viktor Barzin
b29f5ddb06
[ci skip] f1-stream: add extractor framework with demo streams (Phase 3)
- BaseExtractor ABC with health_check method
- ExtractorRegistry with concurrent fan-out extraction
- ExtractionService with in-memory cache and background polling
- DemoExtractor with 3 public HLS test streams
- Adaptive polling: 5min during live sessions, 30min otherwise
- GET /streams, GET /extractors, POST /extract endpoints
2026-02-23 23:02:56 +00:00
Viktor Barzin
4fd3e2d770
[ci skip] f1-stream: add gitignore for __pycache__, remove committed .pyc 2026-02-23 22:55:38 +00:00
Viktor Barzin
d7d347de27
[ci skip] f1-stream: add F1 schedule subsystem (Phase 2)
- Fetch 2026 F1 race calendar from jolpica API with all sessions
  (FP1-3, Qualifying, Sprint, Race) and UTC timestamps
- Persist schedule to NFS as JSON, load on startup if fresh
- APScheduler daily refresh at 03:00 UTC
- GET /schedule endpoint with live/upcoming/past session status
- POST /schedule/refresh for manual refresh trigger
2026-02-23 22:55:13 +00:00
Viktor Barzin
f423a4d60c
[ci skip] f1-stream: replace Go service with Python/FastAPI skeleton
Replaces the existing Go-based f1-stream service with a new Python/FastAPI
backend as the foundation for the rebuilt F1 streaming aggregation service.

- New FastAPI backend with health and root endpoints
- Python 3.13 slim Dockerfile (replaces Go multi-stage build)
- Updated Terraform deployment (port 8000, reduced resources)
- Buildx-based redeploy.sh with --platform linux/amd64
- Added Woodpecker CI pipeline for automated builds
- Removed all old Go source, node_modules, static assets
2026-02-23 22:47:06 +00:00
Viktor Barzin
c665544f41
[ci skip] fix plotting-book: add SESSION_SECRET env var
Session secret stored in encrypted terraform.tfvars, referenced
via variable to avoid committing secrets in plain text.
2026-02-23 22:44:04 +00:00
Viktor Barzin
85f88bf167
[ci skip] platform: add ndots=2 dns_config to all deployment pod specs
Prevents Terraform from reverting the Kyverno inject-ndots mutation
on every apply. 23 pod specs across 19 platform module files.
2026-02-23 22:43:05 +00:00
Viktor Barzin
a2a83d30aa
[ci skip] monitoring: increase resource quota limits
Bump limits.cpu 80→120 and limits.memory 160Gi→240Gi to provide
headroom. Previous values were at 87% and 92% utilization.
2026-02-23 22:42:30 +00:00
Viktor Barzin
e982a8ad81
[ci skip] fix redis OOMKilled: increase memory limits to 2Gi
Redis was CrashLoopBackOff due to OOMKilled - 512Mi limit was
insufficient for 650MB RDB dataset plus redis-stack modules.
2026-02-23 22:37:56 +00:00
Viktor Barzin
2789c0fa5c
[ci skip] add trading-bot Terraform stack 2026-02-23 22:29:59 +00:00
Viktor Barzin
a2a3f8027c
docs(01-infrastructure-and-deployment): create phase plan 2026-02-23 22:28:54 +00:00
Viktor Barzin
ada6d7c496
docs: create roadmap (8 phases) 2026-02-23 22:22:53 +00:00
Viktor Barzin
26e3008d98
docs: define v1 requirements 2026-02-23 22:13:53 +00:00
Viktor Barzin
c986e9bd56
[ci skip] update claude knowledge: infrastructure hardening changes
- NFS volumes now use var.nfs_server (not hardcoded IP)
- Shared infra variables documented (redis_host, postgresql_host, etc.)
- Tiers locals now generated by terragrunt.hcl, not duplicated in stacks
- Traefik security hardening documented (API, headers, rate limiting)
- Kyverno pod security policies documented (audit mode)
- Prometheus alert groups updated (Critical Services, PVPredictedFull)
- Loki retention updated to 30d, Alloy memory to 512Mi/1Gi
- Grampsweb now protected by Authentik
- MeshCentral registration disabled
2026-02-23 22:08:46 +00:00
Viktor Barzin
ab9c661cfb
docs: complete project research 2026-02-23 22:06:23 +00:00
Viktor Barzin
2d919c4d34
[ci skip] Infrastructure hardening: security, monitoring, reliability, maintainability
Phase 1 - Critical Security:
- Netbox: move hardcoded DB/superuser passwords to variables
- MeshCentral: disable public registration, add Authentik auth
- Traefik: disable insecure API dashboard (api.insecure=false)
- Traefik: configure forwarded headers with Cloudflare trusted IPs

Phase 2 - Security Hardening:
- Add security headers middleware (HSTS, X-Frame-Options, nosniff, etc.)
- Add Kyverno pod security policies in audit mode (privileged, host
  namespaces, SYS_ADMIN, trusted registries)
- Tighten rate limiting (avg=10, burst=50)
- Add Authentik protection to grampsweb

Phase 3 - Monitoring & Alerting:
- Add critical service alerts (PostgreSQL, MySQL, Redis, Headscale,
  Authentik, Loki)
- Increase Loki retention from 7 to 30 days (720h)
- Add predictive PV filling alert (predict_linear)
- Re-enable Hackmd and Privatebin down alerts

Phase 4 - Reliability:
- Add resource requests/limits to Redis, DBaaS, Technitium, Headscale,
  Vaultwarden, Uptime Kuma
- Increase Alloy DaemonSet memory to 512Mi/1Gi

Phase 6 - Maintainability:
- Extract duplicated tiers locals to terragrunt.hcl generate block
  (removed from 67 stacks)
- Replace hardcoded NFS IP 10.0.10.15 with var.nfs_server (114
  instances across 63 files)
- Replace hardcoded Redis/PostgreSQL/MySQL/Ollama/mail host references
  with variables across ~35 stacks
- Migrate xray raw ingress resources to ingress_factory modules
2026-02-23 22:05:28 +00:00
Viktor Barzin
48083bb1fd
Reorder realestate-crawler Grafana dashboard sections
Move API Performance and Per-Endpoint Latency to the top.
Move Scraping Overview, Scraping Activity, and Throttling & Errors
to the bottom. Keeps the most operationally relevant panels visible
first.
2026-02-23 22:03:27 +00:00
Viktor Barzin
05ea5a6aa3
chore: add project config 2026-02-23 21:55:25 +00:00
Viktor Barzin
8e2948687b
docs: initialize project 2026-02-23 21:53:51 +00:00
Viktor Barzin
449937e22e
Sync realestate-crawler Grafana dashboard with per-endpoint latency panels 2026-02-23 21:31:01 +00:00
Viktor Barzin
8985cd60cc
[ci skip] mailserver: fix Rspamd DKIM signing key path
Mount DKIM private key at Rspamd-expected path
(/tmp/docker-mailserver/rspamd/dkim/viktorbarzin.me/mail.private)
and add dkim_signing.conf override for domain/selector config.
Rspamd does not auto-detect keys from the OpenDKIM path.
2026-02-23 21:01:29 +00:00
Viktor Barzin
04db99fde2
docs: map existing codebase 2026-02-23 20:54:27 +00:00
Viktor Barzin
59c862fe18
[ci skip] dns: remove stale SendGrid CNAME records
Remove em7107, s1._domainkey, and s2._domainkey SendGrid CNAME
records from the bind zone. These are remnants from a previous
relay setup that is no longer in use.
2026-02-23 20:31:07 +00:00
Viktor Barzin
e95ef07b04
[ci skip] mailserver: tighten DMARC policy to quarantine
Move DMARC enforcement from p=none (monitoring only) to p=quarantine
so spoofed emails from viktorbarzin.me are quarantined by recipients.
2026-02-23 20:30:30 +00:00
Viktor Barzin
ce03bc25a9
[ci skip] mailserver: add Postfix rate limiting
Add connection and message rate limits to protect against brute-force
attacks on SMTP/IMAP ports. 10 connections and 30 messages per minute
per client IP.
2026-02-23 20:29:45 +00:00
Viktor Barzin
74948a8af3
[ci skip] roundcubemail: pin to 1.6-apache, disable debug logging
Pin Roundcubemail to stable 1.6-apache tag instead of :latest to
prevent unexpected breakage. Disable SMTP debug and reduce debug
level from 6 to 1 for production use.
2026-02-23 20:29:39 +00:00
Viktor Barzin
b7ccae69bc
[ci skip] monitoring: enable mailserver-down Prometheus alert
Uncomment the mailserver availability alert so we get paged if
the mail server pod has no available replicas for 5 minutes.
2026-02-23 20:29:33 +00:00
Viktor Barzin
75f5cb2001
[ci skip] mailserver: enable Rspamd, disable OpenDKIM
Enable Rspamd for spam filtering and DKIM signing, replacing
OpenDKIM. Rspamd reads existing DKIM keys from the same mount path.
2026-02-23 20:29:32 +00:00
Viktor Barzin
6ca4a1a081
Sync realestate-crawler dashboard with navigation & usage metrics panels 2026-02-23 20:28:55 +00:00
Viktor Barzin
c6a79e89c7
[ci skip] Upgrade Woodpecker CI v3.5.1 → v3.13.0, fix helm healthcheck for v4 2026-02-23 20:14:30 +00:00
Viktor Barzin
0eababf212
[ci skip] Sunset Drone CI: remove all artifacts, DNS, configs, and references
Drone CI has been fully replaced by Woodpecker CI at ci.viktorbarzin.me.
Destroys K8s resources (12), removes DNS records, NFS exports, Uptime Kuma
monitor, dashboard entry, and all code/doc references across 18 files.
2026-02-23 19:38:55 +00:00
Viktor Barzin
b45688646d
Woodpecker CI: use built-in clone, fix CoreDNS DNS resolution [CI SKIP]
- Switch from custom clone override to woodpeckerci/plugin-git built-in clone
  (handles auth automatically via netrc from GitHub OAuth token)
- Add 8.8.8.8 and 1.1.1.1 as CoreDNS upstream resolvers alongside pfSense
  (fixes intermittent DNS timeouts causing clone failures)
- Fix missing comma after heredoc in audit-policy.tf (syntax error)
2026-02-23 00:08:42 +00:00
Viktor Barzin
d870a63130
[ci skip] Reduce healthcheck frequency to 8h, fix apiserver audit duplication bug
Change cluster-healthcheck CronJob from every 30min to every 8h.
Replace fragile sed-based audit config in apiserver manifest with
idempotent Python script that deduplicates by name/mountPath,
preventing the duplicate volume entries that crashed the API server.
2026-02-22 23:18:30 +00:00
Viktor Barzin
860077a126
[ci skip] Remove ResourceQuota limits from nvidia and realestate-crawler namespaces
Add resource-governance/custom-quota=true label to both namespaces so
Kyverno skips auto-generating ResourceQuotas that were causing CPU pressure.
2026-02-22 23:14:53 +00:00
Viktor Barzin
cf67e02135
[ci skip] Reduce node config drift: GPU label, OIDC idempotency, node-exporter, rebuild docs
- Add gpu=true label to Terraform (nvidia null_resource alongside taint)
- Improve API server OIDC config to detect value changes, not just flag presence
- Add policy_hash trigger to audit-policy so rule changes auto-reapply
- Enable prometheus-node-exporter sub-chart, delete unused Ansible playbook
- Document full node rebuild procedure in CLAUDE.md
- Save Talos Linux migration evaluation for future reference
2026-02-22 22:59:38 +00:00
Viktor Barzin
ff66adbe9e
[ci skip] Refactor knowledge: CLAUDE.md 881→190 lines, extract reference data
CLAUDE.md changes:
- Extract service catalog + Cloudflare domains → .claude/reference/service-catalog.md
- Extract Proxmox VMs, hardware, network → .claude/reference/proxmox-inventory.md
- Extract GitHub/Drone API patterns → .claude/reference/github-drone-api.md
- Extract Authentik state snapshot → .claude/reference/authentik-state.md
- Remove Init Container pattern (duplicates setup-project skill)
- Remove Poison Fountain service notes (duplicates Anti-AI section)
- Consolidate Authentik section (link to skills + reference)
- Remove resource limit tables (kept tier definitions inline)

Skill merges (37→32):
- helm-release-force-rerender + helm-stuck-release-recovery → helm-release-troubleshooting
- containerd-multi-registry-pull-through-cache + k8s-docker-registry-cache-bypass → k8s-container-image-caching
- (traefik merges in previous commits)
2026-02-22 22:11:31 +00:00
Viktor Barzin
512b7d08a5
[ci skip] Merge 3 Traefik skills into traefik-helm-configuration
Consolidated traefik-http3-quic, traefik-udp-cross-namespace, and
traefik-plugin-download-failure-404 into a single skill with sections
for HTTP/3 (QUIC), UDP cross-namespace routing, and plugin download
failure troubleshooting.
2026-02-22 22:09:26 +00:00
Viktor Barzin
072642d779
[ci skip] Merge 2 rewrite-body skills into traefik-rewrite-body-troubleshooting 2026-02-22 22:09:03 +00:00
Viktor Barzin
88960ba3a4
[ci skip] Rebuild docker-registry with nginx serialization on all ports
Replace individual `docker run` commands with Docker Compose stack managed
by systemd. Nginx now fronts all 5 registry ports (5000/5010/5020/5030/5040)
with proxy_cache_lock to serialize concurrent blob pulls and prevent
corrupt partial responses. Adds QEMU guest agent for remote management.
2026-02-22 21:45:53 +00:00
Viktor Barzin
9488af2397
[ci skip] Add rewrite-body Accept header skill, update NFS skill
New skill: traefik-rewrite-body-accept-header — rewrite-body plugin
silently skips injection when request Accept header doesn't contain
text/html (curl default Accept: */* doesn't match).

Updated: k8s-nfs-mount-troubleshooting v1.1.0 — added variant for
non-root container UID permission denied on NFS writes.
2026-02-22 21:41:07 +00:00