infra

Author	SHA1	Message	Date
Viktor Barzin	1ef40daeec	docs: update for MySQL 3→1, CrowdSec/Technitium PG migration, PG tuning, NFS async, node OS tuning [ci skip]	2026-04-13 23:05:46 +01:00
Viktor Barzin	82f674a0b4	rename weekly-backup → daily-backup across scripts, timers, services, and docs [ci skip] Reflects the schedule change from weekly to daily. All references updated: - scripts/weekly-backup.{sh,timer,service} → daily-backup.* - Pushgateway job name: weekly-backup → daily-backup - Prometheus metric names: weekly_backup_* → daily_backup_* - All docs, runbooks, AGENTS.md, CLAUDE.md, proxmox-inventory - offsite-sync dependency: After=daily-backup.service Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 18:37:04 +00:00
Viktor Barzin	b45cee5c4a	docs: update backup architecture for inotify change tracking + consolidated Synology layout [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 18:16:36 +00:00
Viktor Barzin	1c300a14cf	mailserver: overhaul inbound delivery, monitoring, CrowdSec, and migrate to Brevo relay Inbound: - Direct MX to mail.viktorbarzin.me (ForwardEmail relay attempted and abandoned) - Dedicated MetalLB IP 10.0.20.202 with ETP: Local for CrowdSec real-IP detection - Removed Cloudflare Email Routing (can't store-and-forward) - Fixed dual SPF violation, hardened to -all - Added MTA-STS, TLSRPT, imported Rspamd DKIM into Terraform - Removed dead BIND zones from config.tfvars (199 lines) Outbound: - Migrated from Mailgun (100/day) to Brevo (300/day free) - Added Brevo DKIM CNAMEs and verification TXT Monitoring: - Probe frequency: 30m → 20m, alert thresholds adjusted to 60m - Enabled Dovecot exporter scraping (port 9166) - Added external SMTP monitor on public IP Documentation: - New docs/architecture/mailserver.md with full architecture - New docs/architecture/mailserver-visual.html visualization - Updated monitoring.md, CLAUDE.md, historical plan docs	2026-04-12 22:24:38 +01:00
Viktor Barzin	6ba4878f3a	docs: update storage architecture for NFS migration to Proxmox host [ci skip]	2026-04-11 17:00:10 +01:00
Viktor Barzin	eec6af6aef	docs: add IPAM/DDNS architecture diagram and update docs - networking.md: Add mermaid diagram showing full device discovery pipeline (Kea DHCP → DDNS → Technitium, pfSense import → phpIPAM → DNS sync) - networking.md: Add data flow table, DHCP coverage table - networking.md: Update pfSense (3 subnets + 42 reservations), phpIPAM (passive import replaces fping), Technitium (192.168.1.2 in ACL) - CLAUDE.md: Update phpIPAM and networking descriptions [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 20:42:10 +00:00
Viktor Barzin	8cd8743140	docs: add phpIPAM, Kea DDNS, and DNS sync documentation - networking.md: Add phpIPAM IPAM section, Kea DDNS config, reverse DNS zones, Technitium dynamic update policy - CLAUDE.md: Add phpipam to DB rotation list, service notes, networking section - service-catalog.md: Add phpipam, mark netbox as disabled/replaced [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 16:01:32 +00:00
Viktor Barzin	b345b086ef	update backup/DR docs and runbooks for 3-2-1 architecture - Full rewrite of backup-dr.md: 3-2-1 strategy with sda backup disk, PVC file-level copy from LVM snapshots, pfsense backup, two offsite paths. 4 Mermaid diagrams (data flow, timeline, disk layout, restore tree). - Update storage.md: 65 proxmox-lvm PVCs, sda backup tier - Update restore-full-cluster.md: add Phase 3.5 for PVC restore from sda - Update restore-{mysql,postgresql,vault,vaultwarden}.md: add sda fallback paths - New runbook: restore-pvc-from-backup.md (file-level restore from sda) - Update CLAUDE.md Storage & Backup section for 3-2-1 architecture	2026-04-06 15:06:01 +03:00
Viktor Barzin	64c378d158	add critical instruction to update docs with every infra change [ci skip]	2026-04-06 13:21:49 +03:00
Viktor Barzin	fc233bd27f	docs: comprehensive audit and update of all architecture docs and runbooks [ci skip] Audited 14 documentation files against live cluster state and Terraform code. Architecture docs: - databases.md: MySQL 8.4.4, proxmox-lvm storage (not iSCSI), anti-affinity excludes k8s-node1 (GPU), 2Gi/3Gi resources, 7-day rotation (not 24h), CNPG 2 instances, PostGIS 16, postgresql.dbaas has endpoints - overview.md: 1x CPU, ~160GB RAM, all nodes 32GB, proxmox-lvm storage, correct Vault paths (secret/ not kv/) - compute.md: 272GB physical host RAM, ~160GB allocated to VMs - secrets.md: 7-day rotation, 7 MySQL + 5 PG roles, correct ESO config - networking.md: MetalLB pool 10.0.20.200-220 - ci-cd.md: 9 GHA projects, travel_blog 5.7GB Runbooks: - restore-mysql/postgresql: backup files are .sql.gz (not .sql) - restore-vault: weekly backup (not daily), auto-unseal sidecar note - restore-vaultwarden: PVC is proxmox (not iscsi) - restore-full-cluster: updated node roles, removed trading Reference docs: - CLAUDE.md: 7-day rotation, removed trading from PG list - AGENTS.md: 100+ stacks, proxmox-lvm, platform empty shell - service-catalog.md: 6 new stacks, 14 stack column updates	2026-04-06 13:21:05 +03:00
Viktor Barzin	9492874c43	fix: restore technitium MySQL query logging with Vault auto-rotation [ci skip] Query logs stopped syncing on 2026-03-16 due to password mismatch after MySQL cluster rebuild and Technitium app config reset. - Add Vault static role mysql-technitium (7-day rotation) - Add ExternalSecret for technitium-db-creds in technitium namespace - Add password-sync CronJob (6h) to push rotated password to Technitium API - Update Grafana datasource to use ESO-managed password - Remove stale technitium_db_password variable (replaced by ESO) - Update databases.md and restore-mysql.md runbook	2026-04-06 13:00:49 +03:00
Viktor Barzin	ad7c0d7fc8	docs: add critical "Terraform Only" rule to CLAUDE.md All infrastructure changes must go through Terraform/Terragrunt. kubectl is read-only except for temporary migration steps. If a resource isn't in Terraform, evaluate adding it before making manual changes.	2026-04-05 19:46:07 +03:00
Viktor Barzin	2d5c55f7b1	docs: add storage class decision rule to CLAUDE.md Default to proxmox-lvm for all new services. NFS only for RWX, backup destinations, or shared media libraries. Updated iSCSI backup section to reflect proxmox-lvm migration.	2026-04-04 16:35:12 +03:00
Viktor Barzin	10f22350c5	exclude frigate, audiblez, ollama, real-estate-crawler from Synology backup [ci skip] Expanded cloud sync excludes to reduce sync time and Synology disk usage. All excluded data is either regenerable or low-value. TrueNAS Task 1 and incremental script already updated live.	2026-03-29 13:44:32 +03:00
Viktor Barzin	78dec8f0ad	add e2e email roundtrip monitoring CronJob (every 30 min) sends test email via Mailgun API to smoke-test@viktorbarzin.me, verifies IMAP delivery in spam@ catch-all, deletes test email, pushes metrics to Pushgateway + Uptime Kuma. Prometheus alerts: EmailRoundtripFailing, EmailRoundtripStale, EmailRoundtripNeverRun. Uptime Kuma: SMTP/IMAP port checks + E2E push.	2026-03-25 22:50:22 +02:00
Viktor Barzin	1639910043	ingress latency: add histogram buckets, fix restarts, right-size memory - Traefik: add fine-grained Prometheus histogram buckets (0.01-30s) for meaningful P50/P99 - Calibre: relax liveness probe (timeout 5→10s, threshold 3→6) to stop NFS-caused restarts - Novelapp: increase memory 128Mi/256Mi → 640Mi/640Mi (confirmed OOMKilled, VPA upper 505Mi) - Forgejo: increase memory 256Mi → 384Mi (at 80% of limit, VPA upper 311Mi) - ActualBudget: add explicit resources to prevent silent LimitRange defaults - Docs: update Nextcloud note from 4Gi → 8Gi limit (Apache spike history)	2026-03-23 10:52:43 +02:00
Viktor Barzin	813f523170	docs: add private registry usage to infra CLAUDE.md [ci skip]	2026-03-23 01:08:57 +02:00
Viktor Barzin	c111799831	remove duplicated agents, update CLAUDE.md references [ci skip] All agents now live globally in ~/.claude/agents/ (shared via dotfiles). Deleted 11 duplicates, moved sev-*/deploy-app to global scope.	2026-03-22 23:44:27 +02:00
Viktor Barzin	1c13af142d	sync regenerated providers.tf + upstream changes - Terragrunt-regenerated providers.tf across stacks (vault_root_token variable removed from root generate block) - Upstream monitoring/openclaw/CLAUDE.md changes from rebase	2026-03-22 02:56:04 +02:00
Viktor Barzin	fd130971aa	feat(provision): automated user provisioning via Authentik webhook - Expand CI Vault policy: write secret/data/platform + Transit SOPS keys - Add Woodpecker provision-user.yml pipeline (manual event, API-triggered) - Add env vars to webhook-handler deployment for Woodpecker/Authentik integration - Update add-user skill with automated flow documentation - Update Woodpecker repo ID list in CLAUDE.md	2026-03-17 23:56:30 +00:00
Viktor Barzin	6239e07dd5	docs: add plotting-book to GHA-migrated list and repo IDs [ci skip]	2026-03-17 23:07:32 +00:00
Viktor Barzin	88abbef7c3	update claude knowledge: GHA builds architecture, postgresql_host fix [ci skip]	2026-03-16 07:10:45 +00:00
Viktor Barzin	b87ba5e778	update claude knowledge: secret/viktor is go-to for all personal secrets [ci skip]	2026-03-15 23:21:52 +00:00
Viktor Barzin	c8069f53c8	update claude knowledge: final ESO migration state [ci skip]	2026-03-15 22:32:46 +00:00
Viktor Barzin	23dfaa1ac8	update claude knowledge: vault-native secrets migration decisions [ci skip]	2026-03-15 21:00:07 +00:00
Viktor Barzin	3aba29e7a3	remove SOPS pipeline, deploy ESO + Vault DB/K8s engines Vault is now the sole source of truth for secrets. SOPS pipeline removed entirely — auth via `vault login -method=oidc`. Part A: SOPS removal - vault/main.tf: delete 990 lines (93 vars + 43 KV write resources), add self-read data source for OIDC creds from secret/vault - terragrunt.hcl: remove SOPS var loading, vault_root_token, check_secrets hook - scripts/tg: remove SOPS decryption, keep -auto-approve logic - .woodpecker/default.yml: replace SOPS with Vault K8s auth via curl - Delete secrets.sops.json, .sops.yaml Part B: External Secrets Operator - New stack stacks/external-secrets/ with Helm chart + 2 ClusterSecretStores (vault-kv for KV v2, vault-database for DB engine) Part C: Database secrets engine (in vault/main.tf) - MySQL + PostgreSQL connections with static role rotation (24h) - 6 MySQL roles (speedtest, wrongmove, codimd, nextcloud, shlink, grafana) - 6 PostgreSQL roles (trading, health, linkwarden, affine, woodpecker, claude_memory) Part D: Kubernetes secrets engine (in vault/main.tf) - RBAC for Vault SA to manage K8s tokens - Roles: dashboard-admin, ci-deployer, openclaw, local-admin - New scripts/vault-kubeconfig helper for dynamic kubeconfig K8s auth method with scoped policies for CI, ESO, OpenClaw, Woodpecker sync.	2026-03-15 16:37:38 +00:00
Viktor Barzin	944d6d3b22	update claude knowledge: resource management learnings from right-sizing session [ci skip]	2026-03-15 15:38:37 +00:00
Viktor Barzin	307b7f6819	update claude knowledge: infra operational learnings from commit history [ci skip] Add resource management patterns, networking resilience, service-specific notes, monitoring patterns, and NFS storage rules extracted from ~963 commits.	2026-03-15 10:46:45 +00:00
Viktor Barzin	0a69af618d	update claude knowledge: vault KV secrets migration [ci skip]	2026-03-15 03:22:07 +00:00
Viktor Barzin	4a27345057	enable memory-core plugin for OpenClaw [ci skip] - Add memory-core to plugins.allow and plugins.slots.memory - Add /app/extensions to plugin load paths - Update CLAUDE.md memory instructions to reference native tools	2026-03-15 03:22:07 +00:00
Viktor Barzin	5f71a53b08	add memory-tool instructions to project CLAUDE.md [ci skip] OpenClaw agents read the project-level CLAUDE.md from the workspace. Adding explicit memory-tool CLI instructions here ensures the agent uses exec to call memory-tool instead of looking for non-existent MCP tools (memory_store, memory_recall).	2026-03-15 02:16:03 +00:00
Viktor Barzin	456e2777f5	update claude knowledge: LinuxServer.io container optimization learnings [ci skip]	2026-03-15 02:04:04 +00:00
Viktor Barzin	916aa6c6cb	update claude knowledge: OpenClaw deployment and tg wrapper learnings [ci skip]	2026-03-14 23:42:17 +00:00
Viktor Barzin	4635d3b826	remember: CrowdSec Helm upgrade timeout [ci skip]	2026-03-14 12:04:07 +00:00
Viktor Barzin	2fa8ba2038	[ci skip] add sealed secrets convention: fileset + kubernetes_manifest pattern - Document sealed secrets workflow in AGENTS.md and CLAUDE.md - Add kubernetes_manifest + fileset(sealed-.yaml) block to plotting-book as reference - Users: kubeseal encrypt → commit sealed-.yaml → CI applies via Terraform - E2E tested: seal/commit/plan/apply/decrypt cycle verified	2026-03-08 20:03:50 +00:00
Viktor Barzin	98f4920af1	[ci skip] remember: update kubelet thresholds when changing node memory	2026-03-08 10:34:17 +00:00
Viktor Barzin	9f2ac0fd1a	[ci skip] update AGENTS.md + CLAUDE.md with SOPS workflow, add k8s-portal CI pipeline AGENTS.md: added SOPS secrets management section, scripts/tg usage, contributor onboarding steps, pull-through cache bypass notes. CLAUDE.md: added SOPS workflow note, linux/amd64 build reminder, versioned tag guidance for pull-through cache. CI: new .woodpecker/k8s-portal.yml pipeline — auto-builds and deploys the k8s portal when files under stacks/platform/modules/k8s-portal/files/ change on master push. Uses buildx for linux/amd64.	2026-03-07 15:37:19 +00:00
Viktor Barzin	8d3db35b5e	[ci skip] add AGENTS.md for model-agnostic knowledge, slim CLAUDE.md to Claude-specific layer AGENTS.md (63 lines): shared infra knowledge for any AI tool (Codex, Claude, Cursor). Covers: critical rules, architecture, storage, tiers, common ops. CLAUDE.md (23 lines): Claude-specific addons — skills, agents, user preferences. References AGENTS.md for shared knowledge. Removed generic agents (devops-engineer, fullstack-developer).	2026-03-06 23:50:26 +00:00
Viktor Barzin	c170351e77	[ci skip] refactor claude files: compact CLAUDE.md, clean memory, remove generic agents CLAUDE.md: 260→72 lines. Moved detailed patterns (NFS, iSCSI, Kyverno tables, anti-AI, node rebuild) to .claude/reference/patterns.md. Kept: critical rules, quick patterns, key commands, tier overview, prefs. Memory: CLAUDE.md is now single source of truth. Auto-memory reduced to scratch pad (67→25 lines, 5→1 files). MetaClaw DB cleaned from 40→16 entries (removed all infra-specific duplicates, kept cross-project prefs). Agents: removed generic devops-engineer (885L) and fullstack-developer (234L). Kept custom cluster-health-checker (48L).	2026-03-06 23:27:46 +00:00
Viktor Barzin	bcbe8b23b4	[ci skip] archive 28 unused skills, add runbook index to CLAUDE.md, add cluster-health agent - Move 28 never-invoked troubleshooting runbook skills to .claude/skills/archived/ - Keep 7 active workflow skills: cluster-health, uptime-kuma, pfsense, home-assistant, setup-project, extend-vm-storage, k8s-ndots - Add one-line runbook index to CLAUDE.md for quick reference - Create cluster-health-checker custom agent (haiku model, read-only + bash) for autonomous health checks without consuming main context	2026-03-06 23:17:40 +00:00
Viktor Barzin	e6234d4683	[ci skip] update claude knowledge: iSCSI migration for Redis, Prometheus, Loki	2026-03-06 21:05:21 +00:00
Viktor Barzin	0638e2cc2e	[ci skip] iSCSI migration, healthcheck fixes, health probes, etcd backup - Migrate MySQL/PostgreSQL storage from local-path to iscsi-truenas - Add democratic-csi iSCSI driver module for TrueNAS - Add open-iscsi to cloud-init VM template - Fix Shlink health probe path (/api/v3 -> /rest/v3 for Shlink 5.0) - Fix etcd backup: use etcd 3.5.21-0 (3.6.x is distroless, no /bin/sh) - Fix cluster healthcheck CronJob: always exit 0 to prevent circular JobFailed alerts (reporting via Slack, not exit codes) - Fix Uptime Kuma nested list handling in cluster-health.sh - Add health probes to: audiobookshelf, immich ML, ntfy, headscale, uptime-kuma, vaultwarden, rybbit (clickhouse + server + client), shlink, shlink-web - Add iSCSI storage documentation to CLAUDE.md	2026-03-06 19:54:21 +00:00
Viktor Barzin	61a532054e	[ci skip] update CLAUDE.md: NFS volume pattern now uses CSI-backed nfs_volume module	2026-03-02 02:04:47 +00:00
Viktor Barzin	de598996f1	[ci skip] remove low-traffic pull-through caches (registry.k8s.io, quay.io, reg.kyverno.io) Pull-through cache at 10.0.20.10 was serving corrupted/truncated images for low-traffic registries, breaking VPA certgen (ImagePullBackOff) and previously causing Kyverno image pull failures. Kept: docker.io (port 5000) and ghcr.io (port 5010) — high traffic, Docker Hub rate limits make caching essential. Removed from cloud-init template and all 5 live nodes: - registry.k8s.io (port 5030) — 14 system images, very low churn - quay.io (port 5020) — 11 images - reg.kyverno.io (port 5040) — 5 images The registry containers on the 10.0.20.10 VM still run but nodes no longer route to them. They can be stopped/removed from the VM later.	2026-03-01 21:46:41 +00:00
Viktor Barzin	ccf0b2232f	[ci skip] switch VPA to off mode globally, fix Ollama/MySQL resources - Kyverno policy: VPA mode set to 'off' for all namespaces (was 'initial' for non-core). Terraform is now sole authority for container resources. Goldilocks provides recommendations only. - Ollama: add explicit CPU/memory resources (500m/4Gi req, 4/12Gi limit) alongside GPU allocation. Fixes OOMKill from VPA scaling down resources. - MySQL InnoDB Cluster: bump memory limit from 2Gi to 3Gi. - Remove redundant per-namespace VPA opt-out labels from onlyoffice, openclaw, trading-bot (now handled globally by Kyverno policy).	2026-03-01 19:03:49 +00:00
Viktor Barzin	27e59a6af0	[ci skip] update claude knowledge: kyverno fixes, nextcloud, onlyoffice learnings	2026-03-01 18:07:04 +00:00
Viktor Barzin	99ecba46db	[ci skip] add Kyverno resource governance details to CLAUDE.md	2026-03-01 13:05:57 +00:00
Viktor Barzin	3ebf4557f5	[ci skip] update claude knowledge: never restart NFS, NFS export dir prereq	2026-02-28 12:20:36 +00:00
Viktor Barzin	a9a4ac37a2	[ci skip] trim CLAUDE.md: remove discoverable info, deduplicate	2026-02-23 23:10:13 +00:00
Viktor Barzin	c61c1744de	[ci skip] update claude knowledge: infrastructure hardening changes - NFS volumes now use var.nfs_server (not hardcoded IP) - Shared infra variables documented (redis_host, postgresql_host, etc.) - Tiers locals now generated by terragrunt.hcl, not duplicated in stacks - Traefik security hardening documented (API, headers, rate limiting) - Kyverno pod security policies documented (audit mode) - Prometheus alert groups updated (Critical Services, PVPredictedFull) - Loki retention updated to 30d, Alloy memory to 512Mi/1Gi - Grampsweb now protected by Authentik - MeshCentral registration disabled	2026-02-23 22:08:46 +00:00

1 2

99 commits