Commit graph

113 commits

Author SHA1 Message Date
OpenClaw
f30c62ee5c feat(health-check): Add Prometheus-based CPU and power monitoring
SECTIONS ADDED:
- Section 25: Advanced CPU Monitoring (Prometheus node_exporter metrics)
- Section 26: Power Monitoring (DCGM GPU power + host power)

FEATURES:
- 5-minute CPU usage averages (more accurate than kubectl top)
- Tesla T4 GPU power consumption monitoring
- CPU thresholds: 70% warn, 85% critical
- GPU power thresholds: 50W active, 65W high
- Maps IP addresses to friendly node names
- Integrates with existing health check infrastructure

CURRENT STATUS:
- All nodes have healthy disk usage (~10%)
- k8s-node4 flagged at 87% CPU (explains resource pressure)
- GPU operating normally at 30.9W
- Enhanced monitoring prevents issues like node2 containerd corruption

Total health check sections: 26 (was 24)
Addresses node2 incident prevention requirements
2026-03-17 16:51:02 +00:00
OpenClaw
2d5f44d1b3 feat(monitoring): Enhance disk monitoring and containerd GC after node2 incident
IMMEDIATE CHANGES (Active Now):
- Lower disk warning threshold: 70% → WARN, 85% → FAIL (was 80%/90%)
- More aggressive alerting to prevent containerd corruption
- Enhanced cluster health check disk monitoring

INFRASTRUCTURE CHANGES (Requires Terraform Apply):
- Add containerd garbage collection configuration (30min intervals)
- More aggressive kubelet eviction policies (15%/20% vs 10%/15%)
- Enhanced disk space protection to prevent node2-type failures

Root Cause: node2 disk exhaustion corrupted containerd image store
Prevention: Proactive monitoring + aggressive cleanup policies

[ci skip] - Infrastructure changes require SOPS access for apply
2026-03-17 16:51:02 +00:00
Viktor Barzin
57d75d7322 fix DNS health check: use system resolver instead of hardcoded 10.0.20.101
The check was querying Technitium DNS directly at 10.0.20.101:53, which
refuses connections from non-cluster hosts. Use the system resolver
(no @server flag) so it works from any host or pod environment.

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-12 09:08:34 +00:00
Viktor Barzin
4978804404
[ci skip] add sealed secrets convention: fileset + kubernetes_manifest pattern
- Document sealed secrets workflow in AGENTS.md and CLAUDE.md
- Add kubernetes_manifest + fileset(sealed-*.yaml) block to plotting-book as reference
- Users: kubeseal encrypt → commit sealed-*.yaml → CI applies via Terraform
- E2E tested: seal/commit/plan/apply/decrypt cycle verified
2026-03-08 20:03:50 +00:00
Viktor Barzin
407b33abd6
resource quota review: fix OOM risks, close quota gaps, add HA protections
Phase 1 - OOM fixes:
- dashy: increase memory limit 512Mi→1Gi (was at 99% utilization)
- caretta DaemonSet: set explicit resources 300Mi/512Mi (was at 85-98%)
- mysql-operator: add Helm resource values 256Mi/512Mi, create namespace
  with tier label (was at 92% of LimitRange default)
- prowlarr, flaresolverr, annas-archive-stacks: add explicit resources
  (outgrowing 256Mi LimitRange defaults)
- real-estate-crawler celery: add resources 512Mi/3Gi (608Mi actual, no
  explicit resources)

Phase 2 - Close quota gaps:
- nvidia, real-estate-crawler, trading-bot: remove custom-quota=true
  labels so Kyverno generates tier-appropriate quotas
- descheduler: add tier=1-cluster label for proper classification

Phase 3 - Reduce excessive quotas:
- monitoring: limits.memory 240Gi→64Gi, limits.cpu 120→64
- woodpecker: limits.memory 128Gi→32Gi, limits.cpu 64→16
- GPU tier default: limits.memory 96Gi→32Gi, limits.cpu 48→16

Phase 4 - Kubelet protection:
- Add cpu: 200m to systemReserved and kubeReserved in kubelet template

Phase 5 - HA improvements:
- cloudflared: add topology spread (ScheduleAnyway) + PDB (maxUnavailable:1)
- grafana: add topology spread + PDB via Helm values
- crowdsec LAPI: add topology spread + PDB via Helm values
- authentik server: add topology spread via Helm values
- authentik worker: add topology spread + PDB via Helm values
2026-03-08 18:17:46 +00:00
Viktor Barzin
4ccd78a0e7
[ci skip] remember: update kubelet thresholds when changing node memory 2026-03-08 10:34:17 +00:00
Viktor Barzin
f9efa902ef
[ci skip] update ha-sofia VM: VMID 103, disk 64G, SSH access info 2026-03-07 20:39:55 +00:00
Viktor Barzin
933250bd7b
[ci skip] claudeception: extract 2 skills from today's session
1. sops-age-secrets-migration: Complete guide for migrating from git-crypt
   to SOPS+age. Covers JSON format requirement, race condition avoidance,
   CI integration, complex types, and migration sequence.

2. iterative-plan-review-with-subagents: Design pattern for reviewing plans
   with parallel security + implementation subagents. 2-3 iterations to
   zero CRITICALs. Used successfully for the SOPS migration design.
2026-03-07 15:46:36 +00:00
Viktor Barzin
20ec276750
[ci skip] update AGENTS.md + CLAUDE.md with SOPS workflow, add k8s-portal CI pipeline
AGENTS.md: added SOPS secrets management section, scripts/tg usage,
contributor onboarding steps, pull-through cache bypass notes.

CLAUDE.md: added SOPS workflow note, linux/amd64 build reminder,
versioned tag guidance for pull-through cache.

CI: new .woodpecker/k8s-portal.yml pipeline — auto-builds and deploys
the k8s portal when files under stacks/platform/modules/k8s-portal/files/
change on master push. Uses buildx for linux/amd64.
2026-03-07 15:37:19 +00:00
Viktor Barzin
9ed12587a6
[ci skip] update ha-london skill: SSH is hassio@192.168.8.103 (HA OS)
Old Pi at 192.168.8.104 no longer runs HA. Updated SSH host, user,
config path, and platform info to reflect HA OS on 192.168.8.103.
2026-03-07 14:34:44 +00:00
Viktor Barzin
dc85c34069
[ci skip] add AGENTS.md for model-agnostic knowledge, slim CLAUDE.md to Claude-specific layer
AGENTS.md (63 lines): shared infra knowledge for any AI tool (Codex, Claude,
Cursor). Covers: critical rules, architecture, storage, tiers, common ops.

CLAUDE.md (23 lines): Claude-specific addons — skills, agents, user preferences.
References AGENTS.md for shared knowledge.

Removed generic agents (devops-engineer, fullstack-developer).
2026-03-06 23:50:26 +00:00
Viktor Barzin
f02f2a5a4d
[ci skip] refactor claude files: compact CLAUDE.md, clean memory, remove generic agents
CLAUDE.md: 260→72 lines. Moved detailed patterns (NFS, iSCSI, Kyverno
tables, anti-AI, node rebuild) to .claude/reference/patterns.md.
Kept: critical rules, quick patterns, key commands, tier overview, prefs.

Memory: CLAUDE.md is now single source of truth. Auto-memory reduced to
scratch pad (67→25 lines, 5→1 files). MetaClaw DB cleaned from 40→16
entries (removed all infra-specific duplicates, kept cross-project prefs).

Agents: removed generic devops-engineer (885L) and fullstack-developer
(234L). Kept custom cluster-health-checker (48L).
2026-03-06 23:27:46 +00:00
Viktor Barzin
51f2b040a6
[ci skip] archive 28 unused skills, add runbook index to CLAUDE.md, add cluster-health agent
- Move 28 never-invoked troubleshooting runbook skills to .claude/skills/archived/
- Keep 7 active workflow skills: cluster-health, uptime-kuma, pfsense,
  home-assistant, setup-project, extend-vm-storage, k8s-ndots
- Add one-line runbook index to CLAUDE.md for quick reference
- Create cluster-health-checker custom agent (haiku model, read-only + bash)
  for autonomous health checks without consuming main context
2026-03-06 23:17:40 +00:00
Viktor Barzin
3eaac4b9db
[ci skip] update claude knowledge: iSCSI migration for Redis, Prometheus, Loki 2026-03-06 21:05:21 +00:00
Viktor Barzin
1d80c49201
[ci skip] iSCSI migration, healthcheck fixes, health probes, etcd backup
- Migrate MySQL/PostgreSQL storage from local-path to iscsi-truenas
- Add democratic-csi iSCSI driver module for TrueNAS
- Add open-iscsi to cloud-init VM template
- Fix Shlink health probe path (/api/v3 -> /rest/v3 for Shlink 5.0)
- Fix etcd backup: use etcd 3.5.21-0 (3.6.x is distroless, no /bin/sh)
- Fix cluster healthcheck CronJob: always exit 0 to prevent circular
  JobFailed alerts (reporting via Slack, not exit codes)
- Fix Uptime Kuma nested list handling in cluster-health.sh
- Add health probes to: audiobookshelf, immich ML, ntfy, headscale,
  uptime-kuma, vaultwarden, rybbit (clickhouse + server + client),
  shlink, shlink-web
- Add iSCSI storage documentation to CLAUDE.md
2026-03-06 19:54:21 +00:00
Viktor Barzin
a8e07ad930
[ci skip] fix post-NFS-migration issues: MySQL GR, Loki, grampsweb, alerts
- Loki: reduce memory limit from 6Gi to 4Gi (within LimitRange max)
- Grampsweb: increase memory to 2Gi (was OOMKilled at 512Mi)
- Fix PostgreSQLDown alert: check pod readiness instead of deployment
- Fix MySQLDown alert: check StatefulSet replicas instead of deployment
- Fix RedisDown alert: check StatefulSet replicas instead of deployment
- Fix NFSServerUnresponsive: aggregate all NFS versions cluster-wide
- Fix Uptime Kuma healthcheck: handle nested list heartbeat format
- Update etcd backup image to registry.k8s.io/etcd:3.6.5-0
2026-03-03 21:10:26 +00:00
Viktor Barzin
70dd172ba7
[ci skip] update CLAUDE.md: NFS volume pattern now uses CSI-backed nfs_volume module 2026-03-02 02:04:47 +00:00
Viktor Barzin
00197c931e
[ci skip] remove low-traffic pull-through caches (registry.k8s.io, quay.io, reg.kyverno.io)
Pull-through cache at 10.0.20.10 was serving corrupted/truncated images
for low-traffic registries, breaking VPA certgen (ImagePullBackOff) and
previously causing Kyverno image pull failures.

Kept: docker.io (port 5000) and ghcr.io (port 5010) — high traffic,
Docker Hub rate limits make caching essential.

Removed from cloud-init template and all 5 live nodes:
- registry.k8s.io (port 5030) — 14 system images, very low churn
- quay.io (port 5020) — 11 images
- reg.kyverno.io (port 5040) — 5 images

The registry containers on the 10.0.20.10 VM still run but nodes no
longer route to them. They can be stopped/removed from the VM later.
2026-03-01 21:46:41 +00:00
Viktor Barzin
f30ef660e1
[ci skip] add clickhouse-k8s-nfs-system-log-bloat skill, update GPU skill with auto-recovery
New skill: ClickHouse on K8s/NFS burns CPU from unbounded system log tables
and background merges. Covers config.d mount crash (exit code 36), CronJob
truncation workaround, and diagnostic commands.

Updated: k8s-gpu-no-nvidia-devices v1.1.0 — added automatic GPU recovery
via liveness probe pattern (nvidia-smi + app health check).
2026-03-01 21:04:19 +00:00
Viktor Barzin
32762a0916
[ci skip] switch VPA to off mode globally, fix Ollama/MySQL resources
- Kyverno policy: VPA mode set to 'off' for all namespaces (was 'initial'
  for non-core). Terraform is now sole authority for container resources.
  Goldilocks provides recommendations only.
- Ollama: add explicit CPU/memory resources (500m/4Gi req, 4/12Gi limit)
  alongside GPU allocation. Fixes OOMKill from VPA scaling down resources.
- MySQL InnoDB Cluster: bump memory limit from 2Gi to 3Gi.
- Remove redundant per-namespace VPA opt-out labels from onlyoffice,
  openclaw, trading-bot (now handled globally by Kyverno policy).
2026-03-01 19:03:49 +00:00
Viktor Barzin
304b5e4b3d
[ci skip] add nfsv4-idmapd-uid-mapping skill, cross-ref from NFS troubleshooting
New skill documenting the NFSv4 idmapd UID mapping crisis where all file
UIDs show as 65534 (nobody) inside K8s containers. Root cause: containers
auto-negotiate NFSv4.2, and idmapd domain mismatch maps all UIDs to nobody.
Fix: v4_v3owner=true on TrueNAS for numeric UID passthrough.
2026-03-01 18:14:37 +00:00
Viktor Barzin
7a467c75ae
[ci skip] add openclaw-k8s-deployment skill from claudeception
Extracts all non-obvious gotchas from deploying OpenClaw on Kubernetes:
- wizard block required for Telegram, exec.host valid values,
- VPA resource overrides, file permissions, startup command,
- modelrelay sidecar, NFS caching strategy
2026-03-01 18:10:33 +00:00
Viktor Barzin
b8b41a9408
[ci skip] update claude knowledge: kyverno fixes, nextcloud, onlyoffice learnings 2026-03-01 18:07:04 +00:00
Viktor Barzin
e25848be8f
[ci skip] add Kyverno resource governance details to CLAUDE.md 2026-03-01 13:05:57 +00:00
Viktor Barzin
e8bcf21127
[ci skip] update NFS mount skill: add stale mount variant after node reboots
New variant documents ghost Running pods with frozen processes after kured
rolling reboots. Key diagnostic: Running 1/1 but zero listening sockets
from ss -tlnp. Fix: force-delete pods to get fresh NFS mounts.
2026-02-28 19:38:30 +00:00
Viktor Barzin
6e1586342c
[ci skip] expand k8s worker nodes to 256G, update inventory and extend script
- k8s-node2: 128G → 256G (160GB free)
- k8s-node3: 128G → 256G (135GB free)
- k8s-node4: 128G → 256G (127GB free)
- k8s-node1: already 256G (51GB free)
- extend_vm_storage.sh: increase drain timeout to 300s, add --force flag
- Remove Vaultwarden from SQLite migration plan (too risky)
2026-02-28 16:00:16 +00:00
Viktor Barzin
098c0eb752
[ci skip] update claude knowledge: never restart NFS, NFS export dir prereq 2026-02-28 12:20:36 +00:00
Viktor Barzin
00d9458dec
[ci skip] trim CLAUDE.md: remove discoverable info, deduplicate 2026-02-23 23:10:13 +00:00
Viktor Barzin
c986e9bd56
[ci skip] update claude knowledge: infrastructure hardening changes
- NFS volumes now use var.nfs_server (not hardcoded IP)
- Shared infra variables documented (redis_host, postgresql_host, etc.)
- Tiers locals now generated by terragrunt.hcl, not duplicated in stacks
- Traefik security hardening documented (API, headers, rate limiting)
- Kyverno pod security policies documented (audit mode)
- Prometheus alert groups updated (Critical Services, PVPredictedFull)
- Loki retention updated to 30d, Alloy memory to 512Mi/1Gi
- Grampsweb now protected by Authentik
- MeshCentral registration disabled
2026-02-23 22:08:46 +00:00
Viktor Barzin
0eababf212
[ci skip] Sunset Drone CI: remove all artifacts, DNS, configs, and references
Drone CI has been fully replaced by Woodpecker CI at ci.viktorbarzin.me.
Destroys K8s resources (12), removes DNS records, NFS exports, Uptime Kuma
monitor, dashboard entry, and all code/doc references across 18 files.
2026-02-23 19:38:55 +00:00
Viktor Barzin
860077a126
[ci skip] Remove ResourceQuota limits from nvidia and realestate-crawler namespaces
Add resource-governance/custom-quota=true label to both namespaces so
Kyverno skips auto-generating ResourceQuotas that were causing CPU pressure.
2026-02-22 23:14:53 +00:00
Viktor Barzin
cf67e02135
[ci skip] Reduce node config drift: GPU label, OIDC idempotency, node-exporter, rebuild docs
- Add gpu=true label to Terraform (nvidia null_resource alongside taint)
- Improve API server OIDC config to detect value changes, not just flag presence
- Add policy_hash trigger to audit-policy so rule changes auto-reapply
- Enable prometheus-node-exporter sub-chart, delete unused Ansible playbook
- Document full node rebuild procedure in CLAUDE.md
- Save Talos Linux migration evaluation for future reference
2026-02-22 22:59:38 +00:00
Viktor Barzin
ff66adbe9e
[ci skip] Refactor knowledge: CLAUDE.md 881→190 lines, extract reference data
CLAUDE.md changes:
- Extract service catalog + Cloudflare domains → .claude/reference/service-catalog.md
- Extract Proxmox VMs, hardware, network → .claude/reference/proxmox-inventory.md
- Extract GitHub/Drone API patterns → .claude/reference/github-drone-api.md
- Extract Authentik state snapshot → .claude/reference/authentik-state.md
- Remove Init Container pattern (duplicates setup-project skill)
- Remove Poison Fountain service notes (duplicates Anti-AI section)
- Consolidate Authentik section (link to skills + reference)
- Remove resource limit tables (kept tier definitions inline)

Skill merges (37→32):
- helm-release-force-rerender + helm-stuck-release-recovery → helm-release-troubleshooting
- containerd-multi-registry-pull-through-cache + k8s-docker-registry-cache-bypass → k8s-container-image-caching
- (traefik merges in previous commits)
2026-02-22 22:11:31 +00:00
Viktor Barzin
512b7d08a5
[ci skip] Merge 3 Traefik skills into traefik-helm-configuration
Consolidated traefik-http3-quic, traefik-udp-cross-namespace, and
traefik-plugin-download-failure-404 into a single skill with sections
for HTTP/3 (QUIC), UDP cross-namespace routing, and plugin download
failure troubleshooting.
2026-02-22 22:09:26 +00:00
Viktor Barzin
072642d779
[ci skip] Merge 2 rewrite-body skills into traefik-rewrite-body-troubleshooting 2026-02-22 22:09:03 +00:00
Viktor Barzin
9488af2397
[ci skip] Add rewrite-body Accept header skill, update NFS skill
New skill: traefik-rewrite-body-accept-header — rewrite-body plugin
silently skips injection when request Accept header doesn't contain
text/html (curl default Accept: */* doesn't match).

Updated: k8s-nfs-mount-troubleshooting v1.1.0 — added variant for
non-root container UID permission denied on NFS writes.
2026-02-22 21:41:07 +00:00
Viktor Barzin
27bbfdc050
[ci skip] update claude knowledge: add anti-AI scraping & poison-fountain docs 2026-02-22 21:36:40 +00:00
Viktor Barzin
f1a27ed2f9
[ci skip] Add Woodpecker CI stack (WIP) and claude agents
- Add stacks/woodpecker/ with Helm-based deployment config
- Add .woodpecker/ CI pipeline configs (default, build-cli, renew-tls)
- Add NFS export entry for woodpecker
- Add .claude/agents/ definitions
2026-02-22 21:30:25 +00:00
Viktor Barzin
e051d45160
Apply only platform stack in CI (matches old pipeline scope) 2026-02-22 18:59:02 +00:00
Viktor Barzin
ea77b91c06
Update Drone CI pipeline for Terragrunt stack architecture
Default pipeline now uses terragrunt run --all to apply all stacks
instead of the broken terraform apply -target=module.kubernetes_cluster.
TLS renewal pipeline stripped of unnecessary Terraform download/init
since renew2.sh is pure shell (certbot + Cloudflare DNS).
2026-02-22 17:47:06 +00:00
Viktor Barzin
534e63c9b8
[ci skip] Remove legacy files and orphaned modules
Delete 20 orphaned module directories and 3 stray files from
modules/kubernetes/ that are no longer referenced by any stack.
Remove 7 root-level legacy files including the empty tfstate,
27MB terraform zip, commented-out main.tf, and migration notes.
Clean up commented-out dockerhub_secret and oauth-proxy references
in blog, travel_blog, and city-guesser stacks. Remove stale
frigate config.yaml entry from .gitignore. Remove ephemeral
docs/plans/ directory.
2026-02-22 15:23:27 +00:00
Viktor Barzin
b692eb0c34
[ci skip] Flatten module wrappers into stack roots
Remove the module "xxx" { source = "./module" } indirection layer
from all 66 service stacks. Resources are now defined directly in
each stack's main.tf instead of through a wrapper module.

- Merge module/main.tf contents into stack main.tf
- Apply variable replacements (var.tier -> local.tiers.X, renamed vars)
- Fix shared module paths (one fewer ../ at each level)
- Move extra files/dirs (factory/, chart_values, subdirs) to stack root
- Update state files to strip module.<name>. prefix
- Update CLAUDE.md to reflect flat structure

Verified: terragrunt plan shows 0 add, 0 destroy across all stacks.
2026-02-22 15:13:55 +00:00
Viktor Barzin
e4367854f4
[ci skip] Update CLAUDE.md for module colocation
Reflect new directory structure where service modules live inside
their stack directories (stacks/<service>/module/) instead of
modules/kubernetes/<service>/. Update file paths, adding service
instructions, and stack structure documentation.
2026-02-22 14:39:22 +00:00
Viktor Barzin
73cb696f12
[ci skip] Update CLAUDE.md for Terragrunt migration 2026-02-22 14:12:37 +00:00
Viktor Barzin
1742a79fb2
[ci skip] Fix Uptime Kuma false-down reports: use bulk heartbeat API instead of per-monitor calls 2026-02-22 01:37:28 +00:00
Viktor Barzin
a79c4bc062
[ci skip] Fix Python f-string quoting in Slack formatter 2026-02-22 01:23:56 +00:00
Viktor Barzin
6103445e53
[ci skip] Improve Slack message formatting with human-readable names and structured blocks 2026-02-22 01:16:47 +00:00
Viktor Barzin
86a60bcfe6
[ci skip] Upgrade cluster-health.sh to full 24-check version with Slack 2026-02-22 01:09:02 +00:00
Viktor Barzin
f6a3e15cd1
[ci skip] Fix Slack message formatting to use Block Kit mrkdwn 2026-02-22 01:00:48 +00:00
Viktor Barzin
a459533225
[ci skip] Add cluster-health skill for OpenClaw agent 2026-02-22 00:04:15 +00:00