infra

Author	SHA1	Message	Date
Viktor Barzin	c8069f53c8	update claude knowledge: final ESO migration state [ci skip]	2026-03-15 22:32:46 +00:00
Viktor Barzin	6c8a42b4e3	add add-user skill for cluster onboarding Interactive skill that collects user info, updates Vault KV k8s_users, and applies vault/platform/woodpecker stacks. Includes verification checklist and auto-generated resource table.	2026-03-15 22:28:54 +00:00
Viktor Barzin	23dfaa1ac8	update claude knowledge: vault-native secrets migration decisions [ci skip]	2026-03-15 21:00:07 +00:00
Viktor Barzin	cfc30b62e8	enhance devops-engineer agent: deploy + monitor pod health [ci skip] - Upgrade model from sonnet to opus for subagent orchestration - Add Write, Edit, Agent tools for spawning monitor subagents - Add mandatory deployment workflow: pre-deploy snapshot, apply, spawn background haiku pod monitor, react to results - Monitor detects CrashLoopBackOff, OOM, ImagePullBackOff, stuck Pending, and probe failures within 3 min timeout - Allow terragrunt apply and kubectl set image as safe operations	2026-03-15 18:44:20 +00:00
Viktor Barzin	3aba29e7a3	remove SOPS pipeline, deploy ESO + Vault DB/K8s engines Vault is now the sole source of truth for secrets. SOPS pipeline removed entirely — auth via `vault login -method=oidc`. Part A: SOPS removal - vault/main.tf: delete 990 lines (93 vars + 43 KV write resources), add self-read data source for OIDC creds from secret/vault - terragrunt.hcl: remove SOPS var loading, vault_root_token, check_secrets hook - scripts/tg: remove SOPS decryption, keep -auto-approve logic - .woodpecker/default.yml: replace SOPS with Vault K8s auth via curl - Delete secrets.sops.json, .sops.yaml Part B: External Secrets Operator - New stack stacks/external-secrets/ with Helm chart + 2 ClusterSecretStores (vault-kv for KV v2, vault-database for DB engine) Part C: Database secrets engine (in vault/main.tf) - MySQL + PostgreSQL connections with static role rotation (24h) - 6 MySQL roles (speedtest, wrongmove, codimd, nextcloud, shlink, grafana) - 6 PostgreSQL roles (trading, health, linkwarden, affine, woodpecker, claude_memory) Part D: Kubernetes secrets engine (in vault/main.tf) - RBAC for Vault SA to manage K8s tokens - Roles: dashboard-admin, ci-deployer, openclaw, local-admin - New scripts/vault-kubeconfig helper for dynamic kubeconfig K8s auth method with scoped policies for CI, ESO, OpenClaw, Woodpecker sync.	2026-03-15 16:37:38 +00:00
Viktor Barzin	d17a6e2fd3	fix calendar-query.py: use get_display_name(), URL-decode names, fix search API - Replace deprecated cal.name with cal_name() helper using get_display_name() - URL-decode calendar names (Formula+1 → Formula 1) - Use cal.search(event=True) instead of deprecated date_search() - Default to showing all calendars instead of filtering to Personal	2026-03-15 16:12:36 +00:00
Viktor Barzin	944d6d3b22	update claude knowledge: resource management learnings from right-sizing session [ci skip]	2026-03-15 15:38:37 +00:00
Viktor Barzin	8bac6db48f	add name/description/tools to review-loop agent frontmatter [ci skip]	2026-03-15 11:14:31 +00:00
Viktor Barzin	616370d34c	rename planner agent to review-loop [ci skip]	2026-03-15 11:12:14 +00:00
Viktor Barzin	123e996b04	add planner agent: plan-review-fix convergence loop [ci skip]	2026-03-15 10:46:53 +00:00
Viktor Barzin	307b7f6819	update claude knowledge: infra operational learnings from commit history [ci skip] Add resource management patterns, networking resilience, service-specific notes, monitoring patterns, and NFS storage rules extracted from ~963 commits.	2026-03-15 10:46:45 +00:00
Viktor Barzin	0a69af618d	update claude knowledge: vault KV secrets migration [ci skip]	2026-03-15 03:22:07 +00:00
Viktor Barzin	4a27345057	enable memory-core plugin for OpenClaw [ci skip] - Add memory-core to plugins.allow and plugins.slots.memory - Add /app/extensions to plugin load paths - Update CLAUDE.md memory instructions to reference native tools	2026-03-15 03:22:07 +00:00
Viktor Barzin	5f71a53b08	add memory-tool instructions to project CLAUDE.md [ci skip] OpenClaw agents read the project-level CLAUDE.md from the workspace. Adding explicit memory-tool CLI instructions here ensures the agent uses exec to call memory-tool instead of looking for non-existent MCP tools (memory_store, memory_recall).	2026-03-15 02:16:03 +00:00
Viktor Barzin	456e2777f5	update claude knowledge: LinuxServer.io container optimization learnings [ci skip]	2026-03-15 02:04:04 +00:00
Viktor Barzin	ff83ec3325	add infrastructure agent team: 8 specialized agents + 14 diagnostic scripts Agents: devops-engineer, dba, security-engineer, sre, network-engineer, platform-engineer, observability-engineer, home-automation-engineer. Scripts: deploy-status, db-health, backup-verify, tls-check, crowdsec-status, authentik-audit, oom-investigator, resource-report, dns-check, network-health, nfs-health, truenas-status, platform-status, monitoring-health. Also: known-issues.md suppression list, cluster-health-checker port-forward fix.	2026-03-15 02:01:07 +00:00
Viktor Barzin	916aa6c6cb	update claude knowledge: OpenClaw deployment and tg wrapper learnings [ci skip]	2026-03-14 23:42:17 +00:00
Viktor Barzin	4635d3b826	remember: CrowdSec Helm upgrade timeout [ci skip]	2026-03-14 12:04:07 +00:00
Viktor Barzin	d05ff57b11	authentik: auto-assign invitation group via expression policy [ci skip] Added invitation-group-assignment expression policy bound to the enrollment-login stage. Reads group name from invitation fixed_data and auto-adds the user to the target group on enrollment. No more manual assign step needed after signup.	2026-03-13 22:21:10 +00:00
Viktor Barzin	160fda882f	authentik: cleanup unused resources + add invitation enrollment flow [ci skip] Cleanup: - Deleted 5 unused flows (enrollment-inviation, headscale-auth/authz, default-enrollment, oauth-enrollment) - Deleted 8 orphaned stages bound only to deleted flows - Deleted authentik Read-only group and role (0 users) - Deleted 2 unbound policies (map github username, Map Google Attributes) Invitation enrollment: - Created invitation-enrollment flow with 5 stages (invitation validation, identification with social login, prompt, user write, auto-login) - Set all OAuth sources (Google/GitHub/Facebook) enrollment_flow to invitation-enrollment - New users can only sign up via single-use invitation links - Added authentik-invite.sh script for invitation management - Updated reference docs and authentik skill	2026-03-13 22:21:10 +00:00
OpenClaw	4a9bd89b11	feat(health-check): Add Prometheus-based CPU and power monitoring SECTIONS ADDED: - Section 25: Advanced CPU Monitoring (Prometheus node_exporter metrics) - Section 26: Power Monitoring (DCGM GPU power + host power) FEATURES: - 5-minute CPU usage averages (more accurate than kubectl top) - Tesla T4 GPU power consumption monitoring - CPU thresholds: 70% warn, 85% critical - GPU power thresholds: 50W active, 65W high - Maps IP addresses to friendly node names - Integrates with existing health check infrastructure CURRENT STATUS: - All nodes have healthy disk usage (~10%) - k8s-node4 flagged at 87% CPU (explains resource pressure) - GPU operating normally at 30.9W - Enhanced monitoring prevents issues like node2 containerd corruption Total health check sections: 26 (was 24) Addresses node2 incident prevention requirements	2026-03-13 07:32:36 +00:00
OpenClaw	a09967e098	feat(monitoring): Enhance disk monitoring and containerd GC after node2 incident IMMEDIATE CHANGES (Active Now): - Lower disk warning threshold: 70% → WARN, 85% → FAIL (was 80%/90%) - More aggressive alerting to prevent containerd corruption - Enhanced cluster health check disk monitoring INFRASTRUCTURE CHANGES (Requires Terraform Apply): - Add containerd garbage collection configuration (30min intervals) - More aggressive kubelet eviction policies (15%/20% vs 10%/15%) - Enhanced disk space protection to prevent node2-type failures Root Cause: node2 disk exhaustion corrupted containerd image store Prevention: Proactive monitoring + aggressive cleanup policies [ci skip] - Infrastructure changes require SOPS access for apply	2026-03-13 07:16:56 +00:00
Viktor Barzin	bef0c073d5	fix DNS health check: use system resolver instead of hardcoded 10.0.20.101 The check was querying Technitium DNS directly at 10.0.20.101:53, which refuses connections from non-cluster hosts. Use the system resolver (no @server flag) so it works from any host or pod environment. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-12 09:08:34 +00:00
Viktor Barzin	2fa8ba2038	[ci skip] add sealed secrets convention: fileset + kubernetes_manifest pattern - Document sealed secrets workflow in AGENTS.md and CLAUDE.md - Add kubernetes_manifest + fileset(sealed-.yaml) block to plotting-book as reference - Users: kubeseal encrypt → commit sealed-.yaml → CI applies via Terraform - E2E tested: seal/commit/plan/apply/decrypt cycle verified	2026-03-08 20:03:50 +00:00
Viktor Barzin	d352d6e7f8	resource quota review: fix OOM risks, close quota gaps, add HA protections Phase 1 - OOM fixes: - dashy: increase memory limit 512Mi→1Gi (was at 99% utilization) - caretta DaemonSet: set explicit resources 300Mi/512Mi (was at 85-98%) - mysql-operator: add Helm resource values 256Mi/512Mi, create namespace with tier label (was at 92% of LimitRange default) - prowlarr, flaresolverr, annas-archive-stacks: add explicit resources (outgrowing 256Mi LimitRange defaults) - real-estate-crawler celery: add resources 512Mi/3Gi (608Mi actual, no explicit resources) Phase 2 - Close quota gaps: - nvidia, real-estate-crawler, trading-bot: remove custom-quota=true labels so Kyverno generates tier-appropriate quotas - descheduler: add tier=1-cluster label for proper classification Phase 3 - Reduce excessive quotas: - monitoring: limits.memory 240Gi→64Gi, limits.cpu 120→64 - woodpecker: limits.memory 128Gi→32Gi, limits.cpu 64→16 - GPU tier default: limits.memory 96Gi→32Gi, limits.cpu 48→16 Phase 4 - Kubelet protection: - Add cpu: 200m to systemReserved and kubeReserved in kubelet template Phase 5 - HA improvements: - cloudflared: add topology spread (ScheduleAnyway) + PDB (maxUnavailable:1) - grafana: add topology spread + PDB via Helm values - crowdsec LAPI: add topology spread + PDB via Helm values - authentik server: add topology spread via Helm values - authentik worker: add topology spread + PDB via Helm values	2026-03-08 18:17:46 +00:00
Viktor Barzin	98f4920af1	[ci skip] remember: update kubelet thresholds when changing node memory	2026-03-08 10:34:17 +00:00
Viktor Barzin	7027c49fef	[ci skip] update ha-sofia VM: VMID 103, disk 64G, SSH access info	2026-03-07 20:39:55 +00:00
Viktor Barzin	7cc7991ce6	[ci skip] claudeception: extract 2 skills from today's session 1. sops-age-secrets-migration: Complete guide for migrating from git-crypt to SOPS+age. Covers JSON format requirement, race condition avoidance, CI integration, complex types, and migration sequence. 2. iterative-plan-review-with-subagents: Design pattern for reviewing plans with parallel security + implementation subagents. 2-3 iterations to zero CRITICALs. Used successfully for the SOPS migration design.	2026-03-07 15:46:36 +00:00
Viktor Barzin	9f2ac0fd1a	[ci skip] update AGENTS.md + CLAUDE.md with SOPS workflow, add k8s-portal CI pipeline AGENTS.md: added SOPS secrets management section, scripts/tg usage, contributor onboarding steps, pull-through cache bypass notes. CLAUDE.md: added SOPS workflow note, linux/amd64 build reminder, versioned tag guidance for pull-through cache. CI: new .woodpecker/k8s-portal.yml pipeline — auto-builds and deploys the k8s portal when files under stacks/platform/modules/k8s-portal/files/ change on master push. Uses buildx for linux/amd64.	2026-03-07 15:37:19 +00:00
Viktor Barzin	5907e50fda	[ci skip] update ha-london skill: SSH is hassio@192.168.8.103 (HA OS) Old Pi at 192.168.8.104 no longer runs HA. Updated SSH host, user, config path, and platform info to reflect HA OS on 192.168.8.103.	2026-03-07 14:34:44 +00:00
Viktor Barzin	8d3db35b5e	[ci skip] add AGENTS.md for model-agnostic knowledge, slim CLAUDE.md to Claude-specific layer AGENTS.md (63 lines): shared infra knowledge for any AI tool (Codex, Claude, Cursor). Covers: critical rules, architecture, storage, tiers, common ops. CLAUDE.md (23 lines): Claude-specific addons — skills, agents, user preferences. References AGENTS.md for shared knowledge. Removed generic agents (devops-engineer, fullstack-developer).	2026-03-06 23:50:26 +00:00
Viktor Barzin	c170351e77	[ci skip] refactor claude files: compact CLAUDE.md, clean memory, remove generic agents CLAUDE.md: 260→72 lines. Moved detailed patterns (NFS, iSCSI, Kyverno tables, anti-AI, node rebuild) to .claude/reference/patterns.md. Kept: critical rules, quick patterns, key commands, tier overview, prefs. Memory: CLAUDE.md is now single source of truth. Auto-memory reduced to scratch pad (67→25 lines, 5→1 files). MetaClaw DB cleaned from 40→16 entries (removed all infra-specific duplicates, kept cross-project prefs). Agents: removed generic devops-engineer (885L) and fullstack-developer (234L). Kept custom cluster-health-checker (48L).	2026-03-06 23:27:46 +00:00
Viktor Barzin	bcbe8b23b4	[ci skip] archive 28 unused skills, add runbook index to CLAUDE.md, add cluster-health agent - Move 28 never-invoked troubleshooting runbook skills to .claude/skills/archived/ - Keep 7 active workflow skills: cluster-health, uptime-kuma, pfsense, home-assistant, setup-project, extend-vm-storage, k8s-ndots - Add one-line runbook index to CLAUDE.md for quick reference - Create cluster-health-checker custom agent (haiku model, read-only + bash) for autonomous health checks without consuming main context	2026-03-06 23:17:40 +00:00
Viktor Barzin	e6234d4683	[ci skip] update claude knowledge: iSCSI migration for Redis, Prometheus, Loki	2026-03-06 21:05:21 +00:00
Viktor Barzin	0638e2cc2e	[ci skip] iSCSI migration, healthcheck fixes, health probes, etcd backup - Migrate MySQL/PostgreSQL storage from local-path to iscsi-truenas - Add democratic-csi iSCSI driver module for TrueNAS - Add open-iscsi to cloud-init VM template - Fix Shlink health probe path (/api/v3 -> /rest/v3 for Shlink 5.0) - Fix etcd backup: use etcd 3.5.21-0 (3.6.x is distroless, no /bin/sh) - Fix cluster healthcheck CronJob: always exit 0 to prevent circular JobFailed alerts (reporting via Slack, not exit codes) - Fix Uptime Kuma nested list handling in cluster-health.sh - Add health probes to: audiobookshelf, immich ML, ntfy, headscale, uptime-kuma, vaultwarden, rybbit (clickhouse + server + client), shlink, shlink-web - Add iSCSI storage documentation to CLAUDE.md	2026-03-06 19:54:21 +00:00
Viktor Barzin	87ef313888	[ci skip] fix post-NFS-migration issues: MySQL GR, Loki, grampsweb, alerts - Loki: reduce memory limit from 6Gi to 4Gi (within LimitRange max) - Grampsweb: increase memory to 2Gi (was OOMKilled at 512Mi) - Fix PostgreSQLDown alert: check pod readiness instead of deployment - Fix MySQLDown alert: check StatefulSet replicas instead of deployment - Fix RedisDown alert: check StatefulSet replicas instead of deployment - Fix NFSServerUnresponsive: aggregate all NFS versions cluster-wide - Fix Uptime Kuma healthcheck: handle nested list heartbeat format - Update etcd backup image to registry.k8s.io/etcd:3.6.5-0	2026-03-03 21:10:26 +00:00
Viktor Barzin	61a532054e	[ci skip] update CLAUDE.md: NFS volume pattern now uses CSI-backed nfs_volume module	2026-03-02 02:04:47 +00:00
Viktor Barzin	de598996f1	[ci skip] remove low-traffic pull-through caches (registry.k8s.io, quay.io, reg.kyverno.io) Pull-through cache at 10.0.20.10 was serving corrupted/truncated images for low-traffic registries, breaking VPA certgen (ImagePullBackOff) and previously causing Kyverno image pull failures. Kept: docker.io (port 5000) and ghcr.io (port 5010) — high traffic, Docker Hub rate limits make caching essential. Removed from cloud-init template and all 5 live nodes: - registry.k8s.io (port 5030) — 14 system images, very low churn - quay.io (port 5020) — 11 images - reg.kyverno.io (port 5040) — 5 images The registry containers on the 10.0.20.10 VM still run but nodes no longer route to them. They can be stopped/removed from the VM later.	2026-03-01 21:46:41 +00:00
Viktor Barzin	53be356f41	[ci skip] add clickhouse-k8s-nfs-system-log-bloat skill, update GPU skill with auto-recovery New skill: ClickHouse on K8s/NFS burns CPU from unbounded system log tables and background merges. Covers config.d mount crash (exit code 36), CronJob truncation workaround, and diagnostic commands. Updated: k8s-gpu-no-nvidia-devices v1.1.0 — added automatic GPU recovery via liveness probe pattern (nvidia-smi + app health check).	2026-03-01 21:04:19 +00:00
Viktor Barzin	ccf0b2232f	[ci skip] switch VPA to off mode globally, fix Ollama/MySQL resources - Kyverno policy: VPA mode set to 'off' for all namespaces (was 'initial' for non-core). Terraform is now sole authority for container resources. Goldilocks provides recommendations only. - Ollama: add explicit CPU/memory resources (500m/4Gi req, 4/12Gi limit) alongside GPU allocation. Fixes OOMKill from VPA scaling down resources. - MySQL InnoDB Cluster: bump memory limit from 2Gi to 3Gi. - Remove redundant per-namespace VPA opt-out labels from onlyoffice, openclaw, trading-bot (now handled globally by Kyverno policy).	2026-03-01 19:03:49 +00:00
Viktor Barzin	f2c66f070b	[ci skip] add nfsv4-idmapd-uid-mapping skill, cross-ref from NFS troubleshooting New skill documenting the NFSv4 idmapd UID mapping crisis where all file UIDs show as 65534 (nobody) inside K8s containers. Root cause: containers auto-negotiate NFSv4.2, and idmapd domain mismatch maps all UIDs to nobody. Fix: v4_v3owner=true on TrueNAS for numeric UID passthrough.	2026-03-01 18:14:37 +00:00
Viktor Barzin	4beadc2ca2	[ci skip] add openclaw-k8s-deployment skill from claudeception Extracts all non-obvious gotchas from deploying OpenClaw on Kubernetes: - wizard block required for Telegram, exec.host valid values, - VPA resource overrides, file permissions, startup command, - modelrelay sidecar, NFS caching strategy	2026-03-01 18:10:33 +00:00
Viktor Barzin	27e59a6af0	[ci skip] update claude knowledge: kyverno fixes, nextcloud, onlyoffice learnings	2026-03-01 18:07:04 +00:00
Viktor Barzin	99ecba46db	[ci skip] add Kyverno resource governance details to CLAUDE.md	2026-03-01 13:05:57 +00:00
Viktor Barzin	f7acc31d83	[ci skip] update NFS mount skill: add stale mount variant after node reboots New variant documents ghost Running pods with frozen processes after kured rolling reboots. Key diagnostic: Running 1/1 but zero listening sockets from ss -tlnp. Fix: force-delete pods to get fresh NFS mounts.	2026-02-28 19:38:30 +00:00
Viktor Barzin	14b1c43713	[ci skip] expand k8s worker nodes to 256G, update inventory and extend script - k8s-node2: 128G → 256G (160GB free) - k8s-node3: 128G → 256G (135GB free) - k8s-node4: 128G → 256G (127GB free) - k8s-node1: already 256G (51GB free) - extend_vm_storage.sh: increase drain timeout to 300s, add --force flag - Remove Vaultwarden from SQLite migration plan (too risky)	2026-02-28 16:00:16 +00:00
Viktor Barzin	3ebf4557f5	[ci skip] update claude knowledge: never restart NFS, NFS export dir prereq	2026-02-28 12:20:36 +00:00
Viktor Barzin	a9a4ac37a2	[ci skip] trim CLAUDE.md: remove discoverable info, deduplicate	2026-02-23 23:10:13 +00:00
Viktor Barzin	c61c1744de	[ci skip] update claude knowledge: infrastructure hardening changes - NFS volumes now use var.nfs_server (not hardcoded IP) - Shared infra variables documented (redis_host, postgresql_host, etc.) - Tiers locals now generated by terragrunt.hcl, not duplicated in stacks - Traefik security hardening documented (API, headers, rate limiting) - Kyverno pod security policies documented (audit mode) - Prometheus alert groups updated (Critical Services, PVPredictedFull) - Loki retention updated to 30d, Alloy memory to 512Mi/1Gi - Grampsweb now protected by Authentik - MeshCentral registration disabled	2026-02-23 22:08:46 +00:00
Viktor Barzin	c8de2c4803	[ci skip] Sunset Drone CI: remove all artifacts, DNS, configs, and references Drone CI has been fully replaced by Woodpecker CI at ci.viktorbarzin.me. Destroys K8s resources (12), removes DNS records, NFS exports, Uptime Kuma monitor, dashboard entry, and all code/doc references across 18 files.	2026-02-23 19:38:55 +00:00

1 2 3 4

183 commits