infra

Author	SHA1	Message	Date
Viktor Barzin	e51c063600	docs(add-user): update skill with actual working flow (no auto TF apply)	2026-03-18 00:28:46 +00:00
Viktor Barzin	fd130971aa	feat(provision): automated user provisioning via Authentik webhook - Expand CI Vault policy: write secret/data/platform + Transit SOPS keys - Add Woodpecker provision-user.yml pipeline (manual event, API-triggered) - Add env vars to webhook-handler deployment for Woodpecker/Authentik integration - Update add-user skill with automated flow documentation - Update Woodpecker repo ID list in CLAUDE.md	2026-03-17 23:56:30 +00:00
Viktor Barzin	ccbcebb670	feat(vault): automate SOPS onboarding for namespace-owners - Add Transit mount + per-stack Transit keys to vault stack TF - Auto-create sops-user-<name> policy scoping decrypt to owned stacks - Auto-create sops-<name> external group + alias for Authentik mapping - Add sops-admin policy to authentik-admins group - Attach sops-user policy to namespace-owner identity entities - Update add-user skill with SOPS onboarding steps and Authentik group - Adding a user to k8s_users + applying vault stack = full SOPS access [ci skip]	2026-03-17 23:15:25 +00:00
Viktor Barzin	6239e07dd5	docs: add plotting-book to GHA-migrated list and repo IDs [ci skip]	2026-03-17 23:07:32 +00:00
Viktor Barzin	d6afbe84c8	post-mortem v2: pipeline team architecture with 4-stage agents [ci skip] Split monolithic orchestrator into triage (haiku), historian (sonnet), and report-writer (opus) stages. Each stage gets its own tool budget. Added sev-context.sh for structured cluster context gathering.	2026-03-16 21:59:34 +00:00
Viktor Barzin	0abb6b83ad	add deploy-app skill and agent for automated repo→app deployment [ci skip]	2026-03-16 18:06:24 +00:00
Viktor Barzin	88abbef7c3	update claude knowledge: GHA builds architecture, postgresql_host fix [ci skip]	2026-03-16 07:10:45 +00:00
Viktor Barzin	b87ba5e778	update claude knowledge: secret/viktor is go-to for all personal secrets [ci skip]	2026-03-15 23:21:52 +00:00
Viktor Barzin	c8069f53c8	update claude knowledge: final ESO migration state [ci skip]	2026-03-15 22:32:46 +00:00
Viktor Barzin	6c8a42b4e3	add add-user skill for cluster onboarding Interactive skill that collects user info, updates Vault KV k8s_users, and applies vault/platform/woodpecker stacks. Includes verification checklist and auto-generated resource table.	2026-03-15 22:28:54 +00:00
Viktor Barzin	23dfaa1ac8	update claude knowledge: vault-native secrets migration decisions [ci skip]	2026-03-15 21:00:07 +00:00
Viktor Barzin	cfc30b62e8	enhance devops-engineer agent: deploy + monitor pod health [ci skip] - Upgrade model from sonnet to opus for subagent orchestration - Add Write, Edit, Agent tools for spawning monitor subagents - Add mandatory deployment workflow: pre-deploy snapshot, apply, spawn background haiku pod monitor, react to results - Monitor detects CrashLoopBackOff, OOM, ImagePullBackOff, stuck Pending, and probe failures within 3 min timeout - Allow terragrunt apply and kubectl set image as safe operations	2026-03-15 18:44:20 +00:00
Viktor Barzin	3aba29e7a3	remove SOPS pipeline, deploy ESO + Vault DB/K8s engines Vault is now the sole source of truth for secrets. SOPS pipeline removed entirely — auth via `vault login -method=oidc`. Part A: SOPS removal - vault/main.tf: delete 990 lines (93 vars + 43 KV write resources), add self-read data source for OIDC creds from secret/vault - terragrunt.hcl: remove SOPS var loading, vault_root_token, check_secrets hook - scripts/tg: remove SOPS decryption, keep -auto-approve logic - .woodpecker/default.yml: replace SOPS with Vault K8s auth via curl - Delete secrets.sops.json, .sops.yaml Part B: External Secrets Operator - New stack stacks/external-secrets/ with Helm chart + 2 ClusterSecretStores (vault-kv for KV v2, vault-database for DB engine) Part C: Database secrets engine (in vault/main.tf) - MySQL + PostgreSQL connections with static role rotation (24h) - 6 MySQL roles (speedtest, wrongmove, codimd, nextcloud, shlink, grafana) - 6 PostgreSQL roles (trading, health, linkwarden, affine, woodpecker, claude_memory) Part D: Kubernetes secrets engine (in vault/main.tf) - RBAC for Vault SA to manage K8s tokens - Roles: dashboard-admin, ci-deployer, openclaw, local-admin - New scripts/vault-kubeconfig helper for dynamic kubeconfig K8s auth method with scoped policies for CI, ESO, OpenClaw, Woodpecker sync.	2026-03-15 16:37:38 +00:00
Viktor Barzin	d17a6e2fd3	fix calendar-query.py: use get_display_name(), URL-decode names, fix search API - Replace deprecated cal.name with cal_name() helper using get_display_name() - URL-decode calendar names (Formula+1 → Formula 1) - Use cal.search(event=True) instead of deprecated date_search() - Default to showing all calendars instead of filtering to Personal	2026-03-15 16:12:36 +00:00
Viktor Barzin	944d6d3b22	update claude knowledge: resource management learnings from right-sizing session [ci skip]	2026-03-15 15:38:37 +00:00
Viktor Barzin	8bac6db48f	add name/description/tools to review-loop agent frontmatter [ci skip]	2026-03-15 11:14:31 +00:00
Viktor Barzin	616370d34c	rename planner agent to review-loop [ci skip]	2026-03-15 11:12:14 +00:00
Viktor Barzin	123e996b04	add planner agent: plan-review-fix convergence loop [ci skip]	2026-03-15 10:46:53 +00:00
Viktor Barzin	307b7f6819	update claude knowledge: infra operational learnings from commit history [ci skip] Add resource management patterns, networking resilience, service-specific notes, monitoring patterns, and NFS storage rules extracted from ~963 commits.	2026-03-15 10:46:45 +00:00
Viktor Barzin	0a69af618d	update claude knowledge: vault KV secrets migration [ci skip]	2026-03-15 03:22:07 +00:00
Viktor Barzin	4a27345057	enable memory-core plugin for OpenClaw [ci skip] - Add memory-core to plugins.allow and plugins.slots.memory - Add /app/extensions to plugin load paths - Update CLAUDE.md memory instructions to reference native tools	2026-03-15 03:22:07 +00:00
Viktor Barzin	5f71a53b08	add memory-tool instructions to project CLAUDE.md [ci skip] OpenClaw agents read the project-level CLAUDE.md from the workspace. Adding explicit memory-tool CLI instructions here ensures the agent uses exec to call memory-tool instead of looking for non-existent MCP tools (memory_store, memory_recall).	2026-03-15 02:16:03 +00:00
Viktor Barzin	456e2777f5	update claude knowledge: LinuxServer.io container optimization learnings [ci skip]	2026-03-15 02:04:04 +00:00
Viktor Barzin	ff83ec3325	add infrastructure agent team: 8 specialized agents + 14 diagnostic scripts Agents: devops-engineer, dba, security-engineer, sre, network-engineer, platform-engineer, observability-engineer, home-automation-engineer. Scripts: deploy-status, db-health, backup-verify, tls-check, crowdsec-status, authentik-audit, oom-investigator, resource-report, dns-check, network-health, nfs-health, truenas-status, platform-status, monitoring-health. Also: known-issues.md suppression list, cluster-health-checker port-forward fix.	2026-03-15 02:01:07 +00:00
Viktor Barzin	916aa6c6cb	update claude knowledge: OpenClaw deployment and tg wrapper learnings [ci skip]	2026-03-14 23:42:17 +00:00
Viktor Barzin	4635d3b826	remember: CrowdSec Helm upgrade timeout [ci skip]	2026-03-14 12:04:07 +00:00
Viktor Barzin	d05ff57b11	authentik: auto-assign invitation group via expression policy [ci skip] Added invitation-group-assignment expression policy bound to the enrollment-login stage. Reads group name from invitation fixed_data and auto-adds the user to the target group on enrollment. No more manual assign step needed after signup.	2026-03-13 22:21:10 +00:00
Viktor Barzin	160fda882f	authentik: cleanup unused resources + add invitation enrollment flow [ci skip] Cleanup: - Deleted 5 unused flows (enrollment-inviation, headscale-auth/authz, default-enrollment, oauth-enrollment) - Deleted 8 orphaned stages bound only to deleted flows - Deleted authentik Read-only group and role (0 users) - Deleted 2 unbound policies (map github username, Map Google Attributes) Invitation enrollment: - Created invitation-enrollment flow with 5 stages (invitation validation, identification with social login, prompt, user write, auto-login) - Set all OAuth sources (Google/GitHub/Facebook) enrollment_flow to invitation-enrollment - New users can only sign up via single-use invitation links - Added authentik-invite.sh script for invitation management - Updated reference docs and authentik skill	2026-03-13 22:21:10 +00:00
OpenClaw	4a9bd89b11	feat(health-check): Add Prometheus-based CPU and power monitoring SECTIONS ADDED: - Section 25: Advanced CPU Monitoring (Prometheus node_exporter metrics) - Section 26: Power Monitoring (DCGM GPU power + host power) FEATURES: - 5-minute CPU usage averages (more accurate than kubectl top) - Tesla T4 GPU power consumption monitoring - CPU thresholds: 70% warn, 85% critical - GPU power thresholds: 50W active, 65W high - Maps IP addresses to friendly node names - Integrates with existing health check infrastructure CURRENT STATUS: - All nodes have healthy disk usage (~10%) - k8s-node4 flagged at 87% CPU (explains resource pressure) - GPU operating normally at 30.9W - Enhanced monitoring prevents issues like node2 containerd corruption Total health check sections: 26 (was 24) Addresses node2 incident prevention requirements	2026-03-13 07:32:36 +00:00
OpenClaw	a09967e098	feat(monitoring): Enhance disk monitoring and containerd GC after node2 incident IMMEDIATE CHANGES (Active Now): - Lower disk warning threshold: 70% → WARN, 85% → FAIL (was 80%/90%) - More aggressive alerting to prevent containerd corruption - Enhanced cluster health check disk monitoring INFRASTRUCTURE CHANGES (Requires Terraform Apply): - Add containerd garbage collection configuration (30min intervals) - More aggressive kubelet eviction policies (15%/20% vs 10%/15%) - Enhanced disk space protection to prevent node2-type failures Root Cause: node2 disk exhaustion corrupted containerd image store Prevention: Proactive monitoring + aggressive cleanup policies [ci skip] - Infrastructure changes require SOPS access for apply	2026-03-13 07:16:56 +00:00
Viktor Barzin	bef0c073d5	fix DNS health check: use system resolver instead of hardcoded 10.0.20.101 The check was querying Technitium DNS directly at 10.0.20.101:53, which refuses connections from non-cluster hosts. Use the system resolver (no @server flag) so it works from any host or pod environment. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-12 09:08:34 +00:00
Viktor Barzin	2fa8ba2038	[ci skip] add sealed secrets convention: fileset + kubernetes_manifest pattern - Document sealed secrets workflow in AGENTS.md and CLAUDE.md - Add kubernetes_manifest + fileset(sealed-.yaml) block to plotting-book as reference - Users: kubeseal encrypt → commit sealed-.yaml → CI applies via Terraform - E2E tested: seal/commit/plan/apply/decrypt cycle verified	2026-03-08 20:03:50 +00:00
Viktor Barzin	d352d6e7f8	resource quota review: fix OOM risks, close quota gaps, add HA protections Phase 1 - OOM fixes: - dashy: increase memory limit 512Mi→1Gi (was at 99% utilization) - caretta DaemonSet: set explicit resources 300Mi/512Mi (was at 85-98%) - mysql-operator: add Helm resource values 256Mi/512Mi, create namespace with tier label (was at 92% of LimitRange default) - prowlarr, flaresolverr, annas-archive-stacks: add explicit resources (outgrowing 256Mi LimitRange defaults) - real-estate-crawler celery: add resources 512Mi/3Gi (608Mi actual, no explicit resources) Phase 2 - Close quota gaps: - nvidia, real-estate-crawler, trading-bot: remove custom-quota=true labels so Kyverno generates tier-appropriate quotas - descheduler: add tier=1-cluster label for proper classification Phase 3 - Reduce excessive quotas: - monitoring: limits.memory 240Gi→64Gi, limits.cpu 120→64 - woodpecker: limits.memory 128Gi→32Gi, limits.cpu 64→16 - GPU tier default: limits.memory 96Gi→32Gi, limits.cpu 48→16 Phase 4 - Kubelet protection: - Add cpu: 200m to systemReserved and kubeReserved in kubelet template Phase 5 - HA improvements: - cloudflared: add topology spread (ScheduleAnyway) + PDB (maxUnavailable:1) - grafana: add topology spread + PDB via Helm values - crowdsec LAPI: add topology spread + PDB via Helm values - authentik server: add topology spread via Helm values - authentik worker: add topology spread + PDB via Helm values	2026-03-08 18:17:46 +00:00
Viktor Barzin	98f4920af1	[ci skip] remember: update kubelet thresholds when changing node memory	2026-03-08 10:34:17 +00:00
Viktor Barzin	7027c49fef	[ci skip] update ha-sofia VM: VMID 103, disk 64G, SSH access info	2026-03-07 20:39:55 +00:00
Viktor Barzin	7cc7991ce6	[ci skip] claudeception: extract 2 skills from today's session 1. sops-age-secrets-migration: Complete guide for migrating from git-crypt to SOPS+age. Covers JSON format requirement, race condition avoidance, CI integration, complex types, and migration sequence. 2. iterative-plan-review-with-subagents: Design pattern for reviewing plans with parallel security + implementation subagents. 2-3 iterations to zero CRITICALs. Used successfully for the SOPS migration design.	2026-03-07 15:46:36 +00:00
Viktor Barzin	9f2ac0fd1a	[ci skip] update AGENTS.md + CLAUDE.md with SOPS workflow, add k8s-portal CI pipeline AGENTS.md: added SOPS secrets management section, scripts/tg usage, contributor onboarding steps, pull-through cache bypass notes. CLAUDE.md: added SOPS workflow note, linux/amd64 build reminder, versioned tag guidance for pull-through cache. CI: new .woodpecker/k8s-portal.yml pipeline — auto-builds and deploys the k8s portal when files under stacks/platform/modules/k8s-portal/files/ change on master push. Uses buildx for linux/amd64.	2026-03-07 15:37:19 +00:00
Viktor Barzin	5907e50fda	[ci skip] update ha-london skill: SSH is hassio@192.168.8.103 (HA OS) Old Pi at 192.168.8.104 no longer runs HA. Updated SSH host, user, config path, and platform info to reflect HA OS on 192.168.8.103.	2026-03-07 14:34:44 +00:00
Viktor Barzin	8d3db35b5e	[ci skip] add AGENTS.md for model-agnostic knowledge, slim CLAUDE.md to Claude-specific layer AGENTS.md (63 lines): shared infra knowledge for any AI tool (Codex, Claude, Cursor). Covers: critical rules, architecture, storage, tiers, common ops. CLAUDE.md (23 lines): Claude-specific addons — skills, agents, user preferences. References AGENTS.md for shared knowledge. Removed generic agents (devops-engineer, fullstack-developer).	2026-03-06 23:50:26 +00:00
Viktor Barzin	c170351e77	[ci skip] refactor claude files: compact CLAUDE.md, clean memory, remove generic agents CLAUDE.md: 260→72 lines. Moved detailed patterns (NFS, iSCSI, Kyverno tables, anti-AI, node rebuild) to .claude/reference/patterns.md. Kept: critical rules, quick patterns, key commands, tier overview, prefs. Memory: CLAUDE.md is now single source of truth. Auto-memory reduced to scratch pad (67→25 lines, 5→1 files). MetaClaw DB cleaned from 40→16 entries (removed all infra-specific duplicates, kept cross-project prefs). Agents: removed generic devops-engineer (885L) and fullstack-developer (234L). Kept custom cluster-health-checker (48L).	2026-03-06 23:27:46 +00:00
Viktor Barzin	bcbe8b23b4	[ci skip] archive 28 unused skills, add runbook index to CLAUDE.md, add cluster-health agent - Move 28 never-invoked troubleshooting runbook skills to .claude/skills/archived/ - Keep 7 active workflow skills: cluster-health, uptime-kuma, pfsense, home-assistant, setup-project, extend-vm-storage, k8s-ndots - Add one-line runbook index to CLAUDE.md for quick reference - Create cluster-health-checker custom agent (haiku model, read-only + bash) for autonomous health checks without consuming main context	2026-03-06 23:17:40 +00:00
Viktor Barzin	e6234d4683	[ci skip] update claude knowledge: iSCSI migration for Redis, Prometheus, Loki	2026-03-06 21:05:21 +00:00
Viktor Barzin	0638e2cc2e	[ci skip] iSCSI migration, healthcheck fixes, health probes, etcd backup - Migrate MySQL/PostgreSQL storage from local-path to iscsi-truenas - Add democratic-csi iSCSI driver module for TrueNAS - Add open-iscsi to cloud-init VM template - Fix Shlink health probe path (/api/v3 -> /rest/v3 for Shlink 5.0) - Fix etcd backup: use etcd 3.5.21-0 (3.6.x is distroless, no /bin/sh) - Fix cluster healthcheck CronJob: always exit 0 to prevent circular JobFailed alerts (reporting via Slack, not exit codes) - Fix Uptime Kuma nested list handling in cluster-health.sh - Add health probes to: audiobookshelf, immich ML, ntfy, headscale, uptime-kuma, vaultwarden, rybbit (clickhouse + server + client), shlink, shlink-web - Add iSCSI storage documentation to CLAUDE.md	2026-03-06 19:54:21 +00:00
Viktor Barzin	87ef313888	[ci skip] fix post-NFS-migration issues: MySQL GR, Loki, grampsweb, alerts - Loki: reduce memory limit from 6Gi to 4Gi (within LimitRange max) - Grampsweb: increase memory to 2Gi (was OOMKilled at 512Mi) - Fix PostgreSQLDown alert: check pod readiness instead of deployment - Fix MySQLDown alert: check StatefulSet replicas instead of deployment - Fix RedisDown alert: check StatefulSet replicas instead of deployment - Fix NFSServerUnresponsive: aggregate all NFS versions cluster-wide - Fix Uptime Kuma healthcheck: handle nested list heartbeat format - Update etcd backup image to registry.k8s.io/etcd:3.6.5-0	2026-03-03 21:10:26 +00:00
Viktor Barzin	61a532054e	[ci skip] update CLAUDE.md: NFS volume pattern now uses CSI-backed nfs_volume module	2026-03-02 02:04:47 +00:00
Viktor Barzin	de598996f1	[ci skip] remove low-traffic pull-through caches (registry.k8s.io, quay.io, reg.kyverno.io) Pull-through cache at 10.0.20.10 was serving corrupted/truncated images for low-traffic registries, breaking VPA certgen (ImagePullBackOff) and previously causing Kyverno image pull failures. Kept: docker.io (port 5000) and ghcr.io (port 5010) — high traffic, Docker Hub rate limits make caching essential. Removed from cloud-init template and all 5 live nodes: - registry.k8s.io (port 5030) — 14 system images, very low churn - quay.io (port 5020) — 11 images - reg.kyverno.io (port 5040) — 5 images The registry containers on the 10.0.20.10 VM still run but nodes no longer route to them. They can be stopped/removed from the VM later.	2026-03-01 21:46:41 +00:00
Viktor Barzin	53be356f41	[ci skip] add clickhouse-k8s-nfs-system-log-bloat skill, update GPU skill with auto-recovery New skill: ClickHouse on K8s/NFS burns CPU from unbounded system log tables and background merges. Covers config.d mount crash (exit code 36), CronJob truncation workaround, and diagnostic commands. Updated: k8s-gpu-no-nvidia-devices v1.1.0 — added automatic GPU recovery via liveness probe pattern (nvidia-smi + app health check).	2026-03-01 21:04:19 +00:00
Viktor Barzin	ccf0b2232f	[ci skip] switch VPA to off mode globally, fix Ollama/MySQL resources - Kyverno policy: VPA mode set to 'off' for all namespaces (was 'initial' for non-core). Terraform is now sole authority for container resources. Goldilocks provides recommendations only. - Ollama: add explicit CPU/memory resources (500m/4Gi req, 4/12Gi limit) alongside GPU allocation. Fixes OOMKill from VPA scaling down resources. - MySQL InnoDB Cluster: bump memory limit from 2Gi to 3Gi. - Remove redundant per-namespace VPA opt-out labels from onlyoffice, openclaw, trading-bot (now handled globally by Kyverno policy).	2026-03-01 19:03:49 +00:00
Viktor Barzin	f2c66f070b	[ci skip] add nfsv4-idmapd-uid-mapping skill, cross-ref from NFS troubleshooting New skill documenting the NFSv4 idmapd UID mapping crisis where all file UIDs show as 65534 (nobody) inside K8s containers. Root cause: containers auto-negotiate NFSv4.2, and idmapd domain mismatch maps all UIDs to nobody. Fix: v4_v3owner=true on TrueNAS for numeric UID passthrough.	2026-03-01 18:14:37 +00:00
Viktor Barzin	4beadc2ca2	[ci skip] add openclaw-k8s-deployment skill from claudeception Extracts all non-obvious gotchas from deploying OpenClaw on Kubernetes: - wizard block required for Telegram, exec.host valid values, - VPA resource overrides, file permissions, startup command, - modelrelay sidecar, NFS caching strategy	2026-03-01 18:10:33 +00:00

1 2 3

141 commits