Commit graph

183 commits

Author SHA1 Message Date
Viktor Barzin
c8069f53c8 update claude knowledge: final ESO migration state [ci skip] 2026-03-15 22:32:46 +00:00
Viktor Barzin
6c8a42b4e3 add add-user skill for cluster onboarding
Interactive skill that collects user info, updates Vault KV k8s_users,
and applies vault/platform/woodpecker stacks. Includes verification
checklist and auto-generated resource table.
2026-03-15 22:28:54 +00:00
Viktor Barzin
23dfaa1ac8 update claude knowledge: vault-native secrets migration decisions [ci skip] 2026-03-15 21:00:07 +00:00
Viktor Barzin
cfc30b62e8 enhance devops-engineer agent: deploy + monitor pod health [ci skip]
- Upgrade model from sonnet to opus for subagent orchestration
- Add Write, Edit, Agent tools for spawning monitor subagents
- Add mandatory deployment workflow: pre-deploy snapshot, apply,
  spawn background haiku pod monitor, react to results
- Monitor detects CrashLoopBackOff, OOM, ImagePullBackOff, stuck
  Pending, and probe failures within 3 min timeout
- Allow terragrunt apply and kubectl set image as safe operations
2026-03-15 18:44:20 +00:00
Viktor Barzin
3aba29e7a3 remove SOPS pipeline, deploy ESO + Vault DB/K8s engines
Vault is now the sole source of truth for secrets. SOPS pipeline
removed entirely — auth via `vault login -method=oidc`.

Part A: SOPS removal
- vault/main.tf: delete 990 lines (93 vars + 43 KV write resources),
  add self-read data source for OIDC creds from secret/vault
- terragrunt.hcl: remove SOPS var loading, vault_root_token, check_secrets hook
- scripts/tg: remove SOPS decryption, keep -auto-approve logic
- .woodpecker/default.yml: replace SOPS with Vault K8s auth via curl
- Delete secrets.sops.json, .sops.yaml

Part B: External Secrets Operator
- New stack stacks/external-secrets/ with Helm chart + 2 ClusterSecretStores
  (vault-kv for KV v2, vault-database for DB engine)

Part C: Database secrets engine (in vault/main.tf)
- MySQL + PostgreSQL connections with static role rotation (24h)
- 6 MySQL roles (speedtest, wrongmove, codimd, nextcloud, shlink, grafana)
- 6 PostgreSQL roles (trading, health, linkwarden, affine, woodpecker, claude_memory)

Part D: Kubernetes secrets engine (in vault/main.tf)
- RBAC for Vault SA to manage K8s tokens
- Roles: dashboard-admin, ci-deployer, openclaw, local-admin
- New scripts/vault-kubeconfig helper for dynamic kubeconfig

K8s auth method with scoped policies for CI, ESO, OpenClaw, Woodpecker sync.
2026-03-15 16:37:38 +00:00
Viktor Barzin
d17a6e2fd3 fix calendar-query.py: use get_display_name(), URL-decode names, fix search API
- Replace deprecated cal.name with cal_name() helper using get_display_name()
- URL-decode calendar names (Formula+1 → Formula 1)
- Use cal.search(event=True) instead of deprecated date_search()
- Default to showing all calendars instead of filtering to Personal
2026-03-15 16:12:36 +00:00
Viktor Barzin
944d6d3b22 update claude knowledge: resource management learnings from right-sizing session [ci skip] 2026-03-15 15:38:37 +00:00
Viktor Barzin
8bac6db48f add name/description/tools to review-loop agent frontmatter [ci skip] 2026-03-15 11:14:31 +00:00
Viktor Barzin
616370d34c rename planner agent to review-loop [ci skip] 2026-03-15 11:12:14 +00:00
Viktor Barzin
123e996b04 add planner agent: plan-review-fix convergence loop [ci skip] 2026-03-15 10:46:53 +00:00
Viktor Barzin
307b7f6819 update claude knowledge: infra operational learnings from commit history [ci skip]
Add resource management patterns, networking resilience, service-specific
notes, monitoring patterns, and NFS storage rules extracted from ~963 commits.
2026-03-15 10:46:45 +00:00
Viktor Barzin
0a69af618d update claude knowledge: vault KV secrets migration [ci skip] 2026-03-15 03:22:07 +00:00
Viktor Barzin
4a27345057 enable memory-core plugin for OpenClaw [ci skip]
- Add memory-core to plugins.allow and plugins.slots.memory
- Add /app/extensions to plugin load paths
- Update CLAUDE.md memory instructions to reference native tools
2026-03-15 03:22:07 +00:00
Viktor Barzin
5f71a53b08 add memory-tool instructions to project CLAUDE.md [ci skip]
OpenClaw agents read the project-level CLAUDE.md from the workspace.
Adding explicit memory-tool CLI instructions here ensures the agent
uses exec to call memory-tool instead of looking for non-existent
MCP tools (memory_store, memory_recall).
2026-03-15 02:16:03 +00:00
Viktor Barzin
456e2777f5 update claude knowledge: LinuxServer.io container optimization learnings [ci skip] 2026-03-15 02:04:04 +00:00
Viktor Barzin
ff83ec3325 add infrastructure agent team: 8 specialized agents + 14 diagnostic scripts
Agents: devops-engineer, dba, security-engineer, sre, network-engineer,
platform-engineer, observability-engineer, home-automation-engineer.
Scripts: deploy-status, db-health, backup-verify, tls-check, crowdsec-status,
authentik-audit, oom-investigator, resource-report, dns-check, network-health,
nfs-health, truenas-status, platform-status, monitoring-health.
Also: known-issues.md suppression list, cluster-health-checker port-forward fix.
2026-03-15 02:01:07 +00:00
Viktor Barzin
916aa6c6cb update claude knowledge: OpenClaw deployment and tg wrapper learnings [ci skip] 2026-03-14 23:42:17 +00:00
Viktor Barzin
4635d3b826 remember: CrowdSec Helm upgrade timeout [ci skip] 2026-03-14 12:04:07 +00:00
Viktor Barzin
d05ff57b11 authentik: auto-assign invitation group via expression policy [ci skip]
Added invitation-group-assignment expression policy bound to the
enrollment-login stage. Reads group name from invitation fixed_data
and auto-adds the user to the target group on enrollment.
No more manual assign step needed after signup.
2026-03-13 22:21:10 +00:00
Viktor Barzin
160fda882f authentik: cleanup unused resources + add invitation enrollment flow [ci skip]
Cleanup:
- Deleted 5 unused flows (enrollment-inviation, headscale-auth/authz, default-enrollment, oauth-enrollment)
- Deleted 8 orphaned stages bound only to deleted flows
- Deleted authentik Read-only group and role (0 users)
- Deleted 2 unbound policies (map github username, Map Google Attributes)

Invitation enrollment:
- Created invitation-enrollment flow with 5 stages (invitation validation,
  identification with social login, prompt, user write, auto-login)
- Set all OAuth sources (Google/GitHub/Facebook) enrollment_flow to invitation-enrollment
- New users can only sign up via single-use invitation links
- Added authentik-invite.sh script for invitation management
- Updated reference docs and authentik skill
2026-03-13 22:21:10 +00:00
OpenClaw
4a9bd89b11 feat(health-check): Add Prometheus-based CPU and power monitoring
SECTIONS ADDED:
- Section 25: Advanced CPU Monitoring (Prometheus node_exporter metrics)
- Section 26: Power Monitoring (DCGM GPU power + host power)

FEATURES:
- 5-minute CPU usage averages (more accurate than kubectl top)
- Tesla T4 GPU power consumption monitoring
- CPU thresholds: 70% warn, 85% critical
- GPU power thresholds: 50W active, 65W high
- Maps IP addresses to friendly node names
- Integrates with existing health check infrastructure

CURRENT STATUS:
- All nodes have healthy disk usage (~10%)
- k8s-node4 flagged at 87% CPU (explains resource pressure)
- GPU operating normally at 30.9W
- Enhanced monitoring prevents issues like node2 containerd corruption

Total health check sections: 26 (was 24)
Addresses node2 incident prevention requirements
2026-03-13 07:32:36 +00:00
OpenClaw
a09967e098 feat(monitoring): Enhance disk monitoring and containerd GC after node2 incident
IMMEDIATE CHANGES (Active Now):
- Lower disk warning threshold: 70% → WARN, 85% → FAIL (was 80%/90%)
- More aggressive alerting to prevent containerd corruption
- Enhanced cluster health check disk monitoring

INFRASTRUCTURE CHANGES (Requires Terraform Apply):
- Add containerd garbage collection configuration (30min intervals)
- More aggressive kubelet eviction policies (15%/20% vs 10%/15%)
- Enhanced disk space protection to prevent node2-type failures

Root Cause: node2 disk exhaustion corrupted containerd image store
Prevention: Proactive monitoring + aggressive cleanup policies

[ci skip] - Infrastructure changes require SOPS access for apply
2026-03-13 07:16:56 +00:00
Viktor Barzin
bef0c073d5 fix DNS health check: use system resolver instead of hardcoded 10.0.20.101
The check was querying Technitium DNS directly at 10.0.20.101:53, which
refuses connections from non-cluster hosts. Use the system resolver
(no @server flag) so it works from any host or pod environment.

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-12 09:08:34 +00:00
Viktor Barzin
2fa8ba2038 [ci skip] add sealed secrets convention: fileset + kubernetes_manifest pattern
- Document sealed secrets workflow in AGENTS.md and CLAUDE.md
- Add kubernetes_manifest + fileset(sealed-*.yaml) block to plotting-book as reference
- Users: kubeseal encrypt → commit sealed-*.yaml → CI applies via Terraform
- E2E tested: seal/commit/plan/apply/decrypt cycle verified
2026-03-08 20:03:50 +00:00
Viktor Barzin
d352d6e7f8 resource quota review: fix OOM risks, close quota gaps, add HA protections
Phase 1 - OOM fixes:
- dashy: increase memory limit 512Mi→1Gi (was at 99% utilization)
- caretta DaemonSet: set explicit resources 300Mi/512Mi (was at 85-98%)
- mysql-operator: add Helm resource values 256Mi/512Mi, create namespace
  with tier label (was at 92% of LimitRange default)
- prowlarr, flaresolverr, annas-archive-stacks: add explicit resources
  (outgrowing 256Mi LimitRange defaults)
- real-estate-crawler celery: add resources 512Mi/3Gi (608Mi actual, no
  explicit resources)

Phase 2 - Close quota gaps:
- nvidia, real-estate-crawler, trading-bot: remove custom-quota=true
  labels so Kyverno generates tier-appropriate quotas
- descheduler: add tier=1-cluster label for proper classification

Phase 3 - Reduce excessive quotas:
- monitoring: limits.memory 240Gi→64Gi, limits.cpu 120→64
- woodpecker: limits.memory 128Gi→32Gi, limits.cpu 64→16
- GPU tier default: limits.memory 96Gi→32Gi, limits.cpu 48→16

Phase 4 - Kubelet protection:
- Add cpu: 200m to systemReserved and kubeReserved in kubelet template

Phase 5 - HA improvements:
- cloudflared: add topology spread (ScheduleAnyway) + PDB (maxUnavailable:1)
- grafana: add topology spread + PDB via Helm values
- crowdsec LAPI: add topology spread + PDB via Helm values
- authentik server: add topology spread via Helm values
- authentik worker: add topology spread + PDB via Helm values
2026-03-08 18:17:46 +00:00
Viktor Barzin
98f4920af1 [ci skip] remember: update kubelet thresholds when changing node memory 2026-03-08 10:34:17 +00:00
Viktor Barzin
7027c49fef [ci skip] update ha-sofia VM: VMID 103, disk 64G, SSH access info 2026-03-07 20:39:55 +00:00
Viktor Barzin
7cc7991ce6 [ci skip] claudeception: extract 2 skills from today's session
1. sops-age-secrets-migration: Complete guide for migrating from git-crypt
   to SOPS+age. Covers JSON format requirement, race condition avoidance,
   CI integration, complex types, and migration sequence.

2. iterative-plan-review-with-subagents: Design pattern for reviewing plans
   with parallel security + implementation subagents. 2-3 iterations to
   zero CRITICALs. Used successfully for the SOPS migration design.
2026-03-07 15:46:36 +00:00
Viktor Barzin
9f2ac0fd1a [ci skip] update AGENTS.md + CLAUDE.md with SOPS workflow, add k8s-portal CI pipeline
AGENTS.md: added SOPS secrets management section, scripts/tg usage,
contributor onboarding steps, pull-through cache bypass notes.

CLAUDE.md: added SOPS workflow note, linux/amd64 build reminder,
versioned tag guidance for pull-through cache.

CI: new .woodpecker/k8s-portal.yml pipeline — auto-builds and deploys
the k8s portal when files under stacks/platform/modules/k8s-portal/files/
change on master push. Uses buildx for linux/amd64.
2026-03-07 15:37:19 +00:00
Viktor Barzin
5907e50fda [ci skip] update ha-london skill: SSH is hassio@192.168.8.103 (HA OS)
Old Pi at 192.168.8.104 no longer runs HA. Updated SSH host, user,
config path, and platform info to reflect HA OS on 192.168.8.103.
2026-03-07 14:34:44 +00:00
Viktor Barzin
8d3db35b5e [ci skip] add AGENTS.md for model-agnostic knowledge, slim CLAUDE.md to Claude-specific layer
AGENTS.md (63 lines): shared infra knowledge for any AI tool (Codex, Claude,
Cursor). Covers: critical rules, architecture, storage, tiers, common ops.

CLAUDE.md (23 lines): Claude-specific addons — skills, agents, user preferences.
References AGENTS.md for shared knowledge.

Removed generic agents (devops-engineer, fullstack-developer).
2026-03-06 23:50:26 +00:00
Viktor Barzin
c170351e77 [ci skip] refactor claude files: compact CLAUDE.md, clean memory, remove generic agents
CLAUDE.md: 260→72 lines. Moved detailed patterns (NFS, iSCSI, Kyverno
tables, anti-AI, node rebuild) to .claude/reference/patterns.md.
Kept: critical rules, quick patterns, key commands, tier overview, prefs.

Memory: CLAUDE.md is now single source of truth. Auto-memory reduced to
scratch pad (67→25 lines, 5→1 files). MetaClaw DB cleaned from 40→16
entries (removed all infra-specific duplicates, kept cross-project prefs).

Agents: removed generic devops-engineer (885L) and fullstack-developer
(234L). Kept custom cluster-health-checker (48L).
2026-03-06 23:27:46 +00:00
Viktor Barzin
bcbe8b23b4 [ci skip] archive 28 unused skills, add runbook index to CLAUDE.md, add cluster-health agent
- Move 28 never-invoked troubleshooting runbook skills to .claude/skills/archived/
- Keep 7 active workflow skills: cluster-health, uptime-kuma, pfsense,
  home-assistant, setup-project, extend-vm-storage, k8s-ndots
- Add one-line runbook index to CLAUDE.md for quick reference
- Create cluster-health-checker custom agent (haiku model, read-only + bash)
  for autonomous health checks without consuming main context
2026-03-06 23:17:40 +00:00
Viktor Barzin
e6234d4683 [ci skip] update claude knowledge: iSCSI migration for Redis, Prometheus, Loki 2026-03-06 21:05:21 +00:00
Viktor Barzin
0638e2cc2e [ci skip] iSCSI migration, healthcheck fixes, health probes, etcd backup
- Migrate MySQL/PostgreSQL storage from local-path to iscsi-truenas
- Add democratic-csi iSCSI driver module for TrueNAS
- Add open-iscsi to cloud-init VM template
- Fix Shlink health probe path (/api/v3 -> /rest/v3 for Shlink 5.0)
- Fix etcd backup: use etcd 3.5.21-0 (3.6.x is distroless, no /bin/sh)
- Fix cluster healthcheck CronJob: always exit 0 to prevent circular
  JobFailed alerts (reporting via Slack, not exit codes)
- Fix Uptime Kuma nested list handling in cluster-health.sh
- Add health probes to: audiobookshelf, immich ML, ntfy, headscale,
  uptime-kuma, vaultwarden, rybbit (clickhouse + server + client),
  shlink, shlink-web
- Add iSCSI storage documentation to CLAUDE.md
2026-03-06 19:54:21 +00:00
Viktor Barzin
87ef313888 [ci skip] fix post-NFS-migration issues: MySQL GR, Loki, grampsweb, alerts
- Loki: reduce memory limit from 6Gi to 4Gi (within LimitRange max)
- Grampsweb: increase memory to 2Gi (was OOMKilled at 512Mi)
- Fix PostgreSQLDown alert: check pod readiness instead of deployment
- Fix MySQLDown alert: check StatefulSet replicas instead of deployment
- Fix RedisDown alert: check StatefulSet replicas instead of deployment
- Fix NFSServerUnresponsive: aggregate all NFS versions cluster-wide
- Fix Uptime Kuma healthcheck: handle nested list heartbeat format
- Update etcd backup image to registry.k8s.io/etcd:3.6.5-0
2026-03-03 21:10:26 +00:00
Viktor Barzin
61a532054e [ci skip] update CLAUDE.md: NFS volume pattern now uses CSI-backed nfs_volume module 2026-03-02 02:04:47 +00:00
Viktor Barzin
de598996f1 [ci skip] remove low-traffic pull-through caches (registry.k8s.io, quay.io, reg.kyverno.io)
Pull-through cache at 10.0.20.10 was serving corrupted/truncated images
for low-traffic registries, breaking VPA certgen (ImagePullBackOff) and
previously causing Kyverno image pull failures.

Kept: docker.io (port 5000) and ghcr.io (port 5010) — high traffic,
Docker Hub rate limits make caching essential.

Removed from cloud-init template and all 5 live nodes:
- registry.k8s.io (port 5030) — 14 system images, very low churn
- quay.io (port 5020) — 11 images
- reg.kyverno.io (port 5040) — 5 images

The registry containers on the 10.0.20.10 VM still run but nodes no
longer route to them. They can be stopped/removed from the VM later.
2026-03-01 21:46:41 +00:00
Viktor Barzin
53be356f41 [ci skip] add clickhouse-k8s-nfs-system-log-bloat skill, update GPU skill with auto-recovery
New skill: ClickHouse on K8s/NFS burns CPU from unbounded system log tables
and background merges. Covers config.d mount crash (exit code 36), CronJob
truncation workaround, and diagnostic commands.

Updated: k8s-gpu-no-nvidia-devices v1.1.0 — added automatic GPU recovery
via liveness probe pattern (nvidia-smi + app health check).
2026-03-01 21:04:19 +00:00
Viktor Barzin
ccf0b2232f [ci skip] switch VPA to off mode globally, fix Ollama/MySQL resources
- Kyverno policy: VPA mode set to 'off' for all namespaces (was 'initial'
  for non-core). Terraform is now sole authority for container resources.
  Goldilocks provides recommendations only.
- Ollama: add explicit CPU/memory resources (500m/4Gi req, 4/12Gi limit)
  alongside GPU allocation. Fixes OOMKill from VPA scaling down resources.
- MySQL InnoDB Cluster: bump memory limit from 2Gi to 3Gi.
- Remove redundant per-namespace VPA opt-out labels from onlyoffice,
  openclaw, trading-bot (now handled globally by Kyverno policy).
2026-03-01 19:03:49 +00:00
Viktor Barzin
f2c66f070b [ci skip] add nfsv4-idmapd-uid-mapping skill, cross-ref from NFS troubleshooting
New skill documenting the NFSv4 idmapd UID mapping crisis where all file
UIDs show as 65534 (nobody) inside K8s containers. Root cause: containers
auto-negotiate NFSv4.2, and idmapd domain mismatch maps all UIDs to nobody.
Fix: v4_v3owner=true on TrueNAS for numeric UID passthrough.
2026-03-01 18:14:37 +00:00
Viktor Barzin
4beadc2ca2 [ci skip] add openclaw-k8s-deployment skill from claudeception
Extracts all non-obvious gotchas from deploying OpenClaw on Kubernetes:
- wizard block required for Telegram, exec.host valid values,
- VPA resource overrides, file permissions, startup command,
- modelrelay sidecar, NFS caching strategy
2026-03-01 18:10:33 +00:00
Viktor Barzin
27e59a6af0 [ci skip] update claude knowledge: kyverno fixes, nextcloud, onlyoffice learnings 2026-03-01 18:07:04 +00:00
Viktor Barzin
99ecba46db [ci skip] add Kyverno resource governance details to CLAUDE.md 2026-03-01 13:05:57 +00:00
Viktor Barzin
f7acc31d83 [ci skip] update NFS mount skill: add stale mount variant after node reboots
New variant documents ghost Running pods with frozen processes after kured
rolling reboots. Key diagnostic: Running 1/1 but zero listening sockets
from ss -tlnp. Fix: force-delete pods to get fresh NFS mounts.
2026-02-28 19:38:30 +00:00
Viktor Barzin
14b1c43713 [ci skip] expand k8s worker nodes to 256G, update inventory and extend script
- k8s-node2: 128G → 256G (160GB free)
- k8s-node3: 128G → 256G (135GB free)
- k8s-node4: 128G → 256G (127GB free)
- k8s-node1: already 256G (51GB free)
- extend_vm_storage.sh: increase drain timeout to 300s, add --force flag
- Remove Vaultwarden from SQLite migration plan (too risky)
2026-02-28 16:00:16 +00:00
Viktor Barzin
3ebf4557f5 [ci skip] update claude knowledge: never restart NFS, NFS export dir prereq 2026-02-28 12:20:36 +00:00
Viktor Barzin
a9a4ac37a2 [ci skip] trim CLAUDE.md: remove discoverable info, deduplicate 2026-02-23 23:10:13 +00:00
Viktor Barzin
c61c1744de [ci skip] update claude knowledge: infrastructure hardening changes
- NFS volumes now use var.nfs_server (not hardcoded IP)
- Shared infra variables documented (redis_host, postgresql_host, etc.)
- Tiers locals now generated by terragrunt.hcl, not duplicated in stacks
- Traefik security hardening documented (API, headers, rate limiting)
- Kyverno pod security policies documented (audit mode)
- Prometheus alert groups updated (Critical Services, PVPredictedFull)
- Loki retention updated to 30d, Alloy memory to 512Mi/1Gi
- Grampsweb now protected by Authentik
- MeshCentral registration disabled
2026-02-23 22:08:46 +00:00
Viktor Barzin
c8de2c4803 [ci skip] Sunset Drone CI: remove all artifacts, DNS, configs, and references
Drone CI has been fully replaced by Woodpecker CI at ci.viktorbarzin.me.
Destroys K8s resources (12), removes DNS records, NFS exports, Uptime Kuma
monitor, dashboard entry, and all code/doc references across 18 files.
2026-02-23 19:38:55 +00:00