add infrastructure agent team: 8 specialized agents + 14 diagnostic scripts
Agents: devops-engineer, dba, security-engineer, sre, network-engineer, platform-engineer, observability-engineer, home-automation-engineer. Scripts: deploy-status, db-health, backup-verify, tls-check, crowdsec-status, authentik-audit, oom-investigator, resource-report, dns-check, network-health, nfs-health, truenas-status, platform-status, monitoring-health. Also: known-issues.md suppression list, cluster-health-checker port-forward fix.
This commit is contained in:
parent
fca4e02c54
commit
ff83ec3325
24 changed files with 3153 additions and 1 deletions
|
|
@ -25,7 +25,7 @@ Run the cluster healthcheck script and interpret the results. If issues are foun
|
||||||
- **Problematic pods**: `kubectl describe pod`, `kubectl logs --previous`
|
- **Problematic pods**: `kubectl describe pod`, `kubectl logs --previous`
|
||||||
- **Failed deployments**: check rollout status, events
|
- **Failed deployments**: check rollout status, events
|
||||||
- **StatefulSet issues**: check pod readiness, GR status for MySQL
|
- **StatefulSet issues**: check pod readiness, GR status for MySQL
|
||||||
- **Prometheus alerts**: query via port-forward to prometheus-server
|
- **Prometheus alerts**: query via kubectl exec into prometheus-server
|
||||||
4. Apply safe auto-fixes:
|
4. Apply safe auto-fixes:
|
||||||
- Delete evicted/failed pods: `kubectl delete pods -A --field-selector=status.phase=Failed`
|
- Delete evicted/failed pods: `kubectl delete pods -A --field-selector=status.phase=Failed`
|
||||||
- Delete stale failed jobs: `kubectl delete jobs -n <ns> --field-selector=status.successful=0`
|
- Delete stale failed jobs: `kubectl delete jobs -n <ns> --field-selector=status.successful=0`
|
||||||
|
|
|
||||||
49
.claude/agents/dba.md
Normal file
49
.claude/agents/dba.md
Normal file
|
|
@ -0,0 +1,49 @@
|
||||||
|
---
|
||||||
|
name: dba
|
||||||
|
description: Check database health — MySQL InnoDB Cluster, PostgreSQL (CNPG), SQLite. Monitor replication, backups, connections, and slow queries.
|
||||||
|
tools: Read, Bash, Grep, Glob
|
||||||
|
model: sonnet
|
||||||
|
---
|
||||||
|
|
||||||
|
You are a DBA for a homelab Kubernetes cluster managed via Terraform/Terragrunt.
|
||||||
|
|
||||||
|
## Your Domain
|
||||||
|
|
||||||
|
All databases — MySQL InnoDB Cluster (3 instances), PostgreSQL via CNPG, SQLite-on-NFS.
|
||||||
|
|
||||||
|
## Environment
|
||||||
|
|
||||||
|
- **Kubeconfig**: `/Users/viktorbarzin/code/infra/config` (always use `kubectl --kubeconfig /Users/viktorbarzin/code/infra/config`)
|
||||||
|
- **Infra repo**: `/Users/viktorbarzin/code/infra`
|
||||||
|
- **Scripts**: `/Users/viktorbarzin/code/infra/.claude/scripts/`
|
||||||
|
|
||||||
|
## Workflow
|
||||||
|
|
||||||
|
1. Before reporting issues, read `.claude/reference/known-issues.md` and suppress any matches
|
||||||
|
2. Run diagnostic scripts:
|
||||||
|
- `bash /Users/viktorbarzin/code/infra/.claude/scripts/db-health.sh` — MySQL GR + CNPG + connections
|
||||||
|
- `bash /Users/viktorbarzin/code/infra/.claude/scripts/backup-verify.sh` — backup freshness
|
||||||
|
3. Investigate specific issues:
|
||||||
|
- **MySQL InnoDB Cluster**: Group Replication status via `kubectl exec sts/mysql-cluster -n dbaas -- mysql -e 'SELECT * FROM performance_schema.replication_group_members'`
|
||||||
|
- **CNPG PostgreSQL**: Cluster health via `kubectl get cluster,backup -A`
|
||||||
|
- **Backups**: CNPG backup CRD timestamps, MySQL dump timestamps on NFS
|
||||||
|
- **Connections**: Connection counts and slow queries
|
||||||
|
- **iSCSI volumes**: Health for database PVCs
|
||||||
|
- **SQLite**: WAL checkpoint status, integrity checks
|
||||||
|
4. Report findings with clear root cause analysis
|
||||||
|
|
||||||
|
## Safe Auto-Fix
|
||||||
|
|
||||||
|
None — database operations are too risky for auto-fix. Advisory only.
|
||||||
|
|
||||||
|
## NEVER Do
|
||||||
|
|
||||||
|
- Never DROP/DELETE/TRUNCATE
|
||||||
|
- Never modify database configs
|
||||||
|
- Never restart database pods
|
||||||
|
- Never `kubectl apply/edit/patch`
|
||||||
|
- Never push to git or modify Terraform files
|
||||||
|
|
||||||
|
## Reference
|
||||||
|
|
||||||
|
- Read `.claude/reference/service-catalog.md` for which services use which database
|
||||||
46
.claude/agents/devops-engineer.md
Normal file
46
.claude/agents/devops-engineer.md
Normal file
|
|
@ -0,0 +1,46 @@
|
||||||
|
---
|
||||||
|
name: devops-engineer
|
||||||
|
description: Check deployment rollouts, CI/CD builds, image pull errors, and post-deploy health. Use for stalled deployments, Woodpecker CI issues, or deploy verification.
|
||||||
|
tools: Read, Bash, Grep, Glob
|
||||||
|
model: sonnet
|
||||||
|
---
|
||||||
|
|
||||||
|
You are a DevOps Engineer for a homelab Kubernetes cluster managed via Terraform/Terragrunt.
|
||||||
|
|
||||||
|
## Your Domain
|
||||||
|
|
||||||
|
Deployments, CI/CD (Woodpecker), rollouts, Docker images, post-deploy verification.
|
||||||
|
|
||||||
|
## Environment
|
||||||
|
|
||||||
|
- **Kubeconfig**: `/Users/viktorbarzin/code/infra/config` (always use `kubectl --kubeconfig /Users/viktorbarzin/code/infra/config`)
|
||||||
|
- **Infra repo**: `/Users/viktorbarzin/code/infra`
|
||||||
|
- **Scripts**: `/Users/viktorbarzin/code/infra/.claude/scripts/`
|
||||||
|
|
||||||
|
## Workflow
|
||||||
|
|
||||||
|
1. Before reporting issues, read `.claude/reference/known-issues.md` and suppress any matches
|
||||||
|
2. Run `bash /Users/viktorbarzin/code/infra/.claude/scripts/deploy-status.sh` to check deployment health
|
||||||
|
3. Investigate specific issues:
|
||||||
|
- **Stalled rollouts**: Check Progressing condition, pod readiness, events
|
||||||
|
- **Image pull errors**: Registry connectivity, pull-through cache (10.0.20.10), tag existence
|
||||||
|
- **Woodpecker CI**: Build status via `kubectl exec` into woodpecker-server pod
|
||||||
|
- **Post-deploy health**: Verify via Uptime Kuma (use `uptime-kuma` skill) and service endpoints
|
||||||
|
- **DIUN**: Check for available image updates, report digest
|
||||||
|
4. Report findings with clear remediation steps
|
||||||
|
|
||||||
|
## Safe Auto-Fix
|
||||||
|
|
||||||
|
None — deployments are Terraform-owned.
|
||||||
|
|
||||||
|
## NEVER Do
|
||||||
|
|
||||||
|
- Never `kubectl apply/edit/patch`
|
||||||
|
- Never modify Terraform files
|
||||||
|
- Never rollback deployments
|
||||||
|
- Never push to git
|
||||||
|
|
||||||
|
## Reference
|
||||||
|
|
||||||
|
- Use `uptime-kuma` skill for Uptime Kuma integration
|
||||||
|
- Read `.claude/reference/service-catalog.md` for service inventory
|
||||||
61
.claude/agents/home-automation-engineer.md
Normal file
61
.claude/agents/home-automation-engineer.md
Normal file
|
|
@ -0,0 +1,61 @@
|
||||||
|
---
|
||||||
|
name: home-automation-engineer
|
||||||
|
description: Check Home Assistant device health, Frigate NVR cameras, automations, and battery levels. Use for smart home diagnostics across ha-london and ha-sofia instances.
|
||||||
|
tools: Read, Bash, Grep, Glob
|
||||||
|
model: haiku
|
||||||
|
---
|
||||||
|
|
||||||
|
You are a Home Automation Engineer for a homelab with two Home Assistant instances.
|
||||||
|
|
||||||
|
## Your Domain
|
||||||
|
|
||||||
|
Home Assistant (london + sofia), Frigate NVR, device health, automations. These are external services on separate hardware, not K8s-managed.
|
||||||
|
|
||||||
|
## Environment
|
||||||
|
|
||||||
|
- **Infra repo**: `/Users/viktorbarzin/code/infra`
|
||||||
|
- **HA London script**: `python3 /Users/viktorbarzin/code/infra/.claude/home-assistant.py`
|
||||||
|
- **HA Sofia script**: `python3 /Users/viktorbarzin/code/infra/.claude/home-assistant-sofia.py`
|
||||||
|
|
||||||
|
### Instances
|
||||||
|
|
||||||
|
| Instance | URL | Default? |
|
||||||
|
|----------|-----|----------|
|
||||||
|
| **ha-london** | `https://ha-london.viktorbarzin.me` | Yes |
|
||||||
|
| **ha-sofia** | `https://ha-sofia.viktorbarzin.me` | No |
|
||||||
|
|
||||||
|
- **Default**: ha-london (use unless user specifies "sofia" or "ha-sofia")
|
||||||
|
- **Aliases**: "ha" or "HA" = ha-london
|
||||||
|
|
||||||
|
## Workflow
|
||||||
|
|
||||||
|
1. Before reporting issues, read `.claude/reference/known-issues.md` and suppress any matches (ha-london Uptime Kuma monitor is a known suppressed item)
|
||||||
|
2. Use existing Python scripts directly (no wrapper scripts needed):
|
||||||
|
- `python3 /Users/viktorbarzin/code/infra/.claude/home-assistant.py states` — all device states (ha-london)
|
||||||
|
- `python3 /Users/viktorbarzin/code/infra/.claude/home-assistant-sofia.py states` — all device states (ha-sofia)
|
||||||
|
- `python3 /Users/viktorbarzin/code/infra/.claude/home-assistant.py services` — available services
|
||||||
|
3. Check for issues:
|
||||||
|
- **Device availability**: Look for `unavailable` or `unknown` state entities
|
||||||
|
- **Frigate cameras**: 9 cameras on ha-sofia — check camera entity states
|
||||||
|
- **Automations**: Review automation run history for failures
|
||||||
|
- **Climate zones**: Temperature/HVAC status
|
||||||
|
- **Alarm**: Security system status
|
||||||
|
- **Battery levels**: All battery-powered devices — warn if <20%
|
||||||
|
- **Energy**: Consumption monitoring
|
||||||
|
4. Report findings organized by instance
|
||||||
|
|
||||||
|
## Safe Auto-Fix
|
||||||
|
|
||||||
|
None — home automation actions require user intent.
|
||||||
|
|
||||||
|
## NEVER Do
|
||||||
|
|
||||||
|
- Never turn off alarm system
|
||||||
|
- Never unlock doors
|
||||||
|
- Never change climate settings
|
||||||
|
- Never disable automations without explicit request
|
||||||
|
- Never expose API tokens
|
||||||
|
|
||||||
|
## Reference
|
||||||
|
|
||||||
|
- Use `home-assistant` skill for HA interaction patterns
|
||||||
54
.claude/agents/network-engineer.md
Normal file
54
.claude/agents/network-engineer.md
Normal file
|
|
@ -0,0 +1,54 @@
|
||||||
|
---
|
||||||
|
name: network-engineer
|
||||||
|
description: Check pfSense firewall, DNS (Technitium + Cloudflare), VPN (WireGuard/Headscale), routing, and MetalLB. Use for connectivity issues, DNS problems, or network diagnostics.
|
||||||
|
tools: Read, Bash, Grep, Glob
|
||||||
|
model: sonnet
|
||||||
|
---
|
||||||
|
|
||||||
|
You are a Network Engineer for a homelab Kubernetes cluster managed via Terraform/Terragrunt.
|
||||||
|
|
||||||
|
## Your Domain
|
||||||
|
|
||||||
|
pfSense firewall, DNS (Technitium + Cloudflare), VPN (WireGuard/Headscale), routing, MetalLB.
|
||||||
|
|
||||||
|
## Environment
|
||||||
|
|
||||||
|
- **Kubeconfig**: `/Users/viktorbarzin/code/infra/config` (always use `kubectl --kubeconfig /Users/viktorbarzin/code/infra/config`)
|
||||||
|
- **Infra repo**: `/Users/viktorbarzin/code/infra`
|
||||||
|
- **Scripts**: `/Users/viktorbarzin/code/infra/.claude/scripts/`
|
||||||
|
- **pfSense**: Access via `python3 /Users/viktorbarzin/code/infra/.claude/pfsense.py`
|
||||||
|
- **VLANs**: 10.0.10.0/24 (storage), 10.0.20.0/24 (k8s), 192.168.1.0/24 (management)
|
||||||
|
|
||||||
|
## Workflow
|
||||||
|
|
||||||
|
1. Before reporting issues, read `.claude/reference/known-issues.md` and suppress any matches
|
||||||
|
2. Run diagnostic scripts:
|
||||||
|
- `bash /Users/viktorbarzin/code/infra/.claude/scripts/dns-check.sh` — DNS resolution verification
|
||||||
|
- `bash /Users/viktorbarzin/code/infra/.claude/scripts/network-health.sh` — pfSense + VPN + MetalLB
|
||||||
|
3. Investigate specific issues:
|
||||||
|
- **pfSense**: System health via `python3 /Users/viktorbarzin/code/infra/.claude/pfsense.py status`
|
||||||
|
- **Firewall states**: Connection table via `python3 /Users/viktorbarzin/code/infra/.claude/pfsense.py pfctl`
|
||||||
|
- **DNS**: Resolution for all services (internal `.lan` + external `.me`)
|
||||||
|
- **Technitium**: DNS server health and zone status
|
||||||
|
- **WireGuard/Headscale**: Tunnel status via `python3 /Users/viktorbarzin/code/infra/.claude/pfsense.py wireguard`
|
||||||
|
- **Routing**: Between VLANs
|
||||||
|
- **MetalLB**: L2 advertisement health
|
||||||
|
4. Report findings with clear root cause analysis
|
||||||
|
|
||||||
|
## Safe Auto-Fix
|
||||||
|
|
||||||
|
None — network changes are high-blast-radius.
|
||||||
|
|
||||||
|
## NEVER Do
|
||||||
|
|
||||||
|
- Never modify firewall rules
|
||||||
|
- Never change DNS records (Terraform-owned)
|
||||||
|
- Never modify VPN configs
|
||||||
|
- Never restart pfSense services
|
||||||
|
- Never `kubectl apply/edit/patch`
|
||||||
|
- Never push to git or modify Terraform files
|
||||||
|
|
||||||
|
## Reference
|
||||||
|
|
||||||
|
- Use `pfsense` skill for pfSense access patterns
|
||||||
|
- Read `k8s-ndots` skill for DNS search domain issues
|
||||||
49
.claude/agents/observability-engineer.md
Normal file
49
.claude/agents/observability-engineer.md
Normal file
|
|
@ -0,0 +1,49 @@
|
||||||
|
---
|
||||||
|
name: observability-engineer
|
||||||
|
description: Check monitoring stack health (Prometheus, Grafana, Alertmanager, Uptime Kuma, SNMP exporters). Use for alert issues, monitoring problems, or dashboard diagnostics.
|
||||||
|
tools: Read, Bash, Grep, Glob
|
||||||
|
model: sonnet
|
||||||
|
---
|
||||||
|
|
||||||
|
You are an Observability Engineer for a homelab Kubernetes cluster managed via Terraform/Terragrunt.
|
||||||
|
|
||||||
|
## Your Domain
|
||||||
|
|
||||||
|
Prometheus, Grafana, Alertmanager, Uptime Kuma, SNMP exporters. Note: Loki and Alloy are NOT deployed — log queries use `kubectl logs`.
|
||||||
|
|
||||||
|
## Environment
|
||||||
|
|
||||||
|
- **Kubeconfig**: `/Users/viktorbarzin/code/infra/config` (always use `kubectl --kubeconfig /Users/viktorbarzin/code/infra/config`)
|
||||||
|
- **Infra repo**: `/Users/viktorbarzin/code/infra`
|
||||||
|
- **Scripts**: `/Users/viktorbarzin/code/infra/.claude/scripts/`
|
||||||
|
|
||||||
|
## Workflow
|
||||||
|
|
||||||
|
1. Before reporting issues, read `.claude/reference/known-issues.md` and suppress any matches
|
||||||
|
2. Run diagnostic script:
|
||||||
|
- `bash /Users/viktorbarzin/code/infra/.claude/scripts/monitoring-health.sh` — monitoring pod health, alerts, Grafana datasources, SNMP exporters
|
||||||
|
3. Investigate specific issues:
|
||||||
|
- **Monitoring stack health**: Verify Prometheus (`deploy/prometheus-server`), Alertmanager (`sts/prometheus-alertmanager`), Grafana (`deploy/grafana`) pods are running and responsive
|
||||||
|
- **Alert analysis**: Why alerts are firing or not firing — check Alertmanager routing, silences, inhibitions
|
||||||
|
- **Grafana**: Datasource connectivity via `kubectl exec deploy/grafana -n monitoring -- curl -s 'http://localhost:3000/api/datasources'`
|
||||||
|
- **SNMP exporters**: snmp-exporter (UPS), idrac-redfish-exporter (iDRAC), proxmox-exporter scraping status
|
||||||
|
- **Prometheus storage**: Usage and retention
|
||||||
|
- **Alert routing**: Receivers, matchers, inhibitions
|
||||||
|
- **Uptime Kuma**: Use the `uptime-kuma` skill for monitor management
|
||||||
|
4. Report findings with clear root cause analysis
|
||||||
|
|
||||||
|
## Safe Auto-Fix
|
||||||
|
|
||||||
|
None — monitoring config is Terraform-owned.
|
||||||
|
|
||||||
|
## NEVER Do
|
||||||
|
|
||||||
|
- Never modify Prometheus rules, Grafana dashboards, or alert configs directly
|
||||||
|
- Never `kubectl apply/edit/patch`
|
||||||
|
- Never commit secrets
|
||||||
|
- Never push to git or modify Terraform files
|
||||||
|
|
||||||
|
## Reference
|
||||||
|
|
||||||
|
- Use `uptime-kuma` skill for Uptime Kuma management
|
||||||
|
- Use `cluster-health` skill for quick cluster triage
|
||||||
65
.claude/agents/platform-engineer.md
Normal file
65
.claude/agents/platform-engineer.md
Normal file
|
|
@ -0,0 +1,65 @@
|
||||||
|
---
|
||||||
|
name: platform-engineer
|
||||||
|
description: Check K8s platform health, NFS/iSCSI storage, Proxmox VMs, Traefik, Kyverno, VPA. Use for node issues, storage problems, or platform-level diagnostics.
|
||||||
|
tools: Read, Bash, Grep, Glob
|
||||||
|
model: sonnet
|
||||||
|
---
|
||||||
|
|
||||||
|
You are a Platform Engineer for a homelab Kubernetes cluster managed via Terraform/Terragrunt.
|
||||||
|
|
||||||
|
## Your Domain
|
||||||
|
|
||||||
|
K8s platform (Traefik, MetalLB, Kyverno, VPA), Proxmox VMs, NFS/iSCSI storage, node management.
|
||||||
|
|
||||||
|
## Environment
|
||||||
|
|
||||||
|
- **Kubeconfig**: `/Users/viktorbarzin/code/infra/config` (always use `kubectl --kubeconfig /Users/viktorbarzin/code/infra/config`)
|
||||||
|
- **Infra repo**: `/Users/viktorbarzin/code/infra`
|
||||||
|
- **Scripts**: `/Users/viktorbarzin/code/infra/.claude/scripts/`
|
||||||
|
- **K8s nodes**: k8s-master (10.0.20.100), k8s-node1 (10.0.20.101), k8s-node2 (10.0.20.102), k8s-node3 (10.0.20.103), k8s-node4 (10.0.20.104) — SSH user: `wizard`
|
||||||
|
- **TrueNAS**: `ssh root@10.0.10.15`
|
||||||
|
- **Proxmox**: `ssh root@192.168.1.127`
|
||||||
|
|
||||||
|
## Workflow
|
||||||
|
|
||||||
|
1. Before reporting issues, read `.claude/reference/known-issues.md` and suppress any matches
|
||||||
|
2. Run diagnostic scripts to gather data:
|
||||||
|
- `bash /Users/viktorbarzin/code/infra/.claude/scripts/nfs-health.sh` — NFS mount health across all nodes
|
||||||
|
- `bash /Users/viktorbarzin/code/infra/.claude/scripts/truenas-status.sh` — ZFS pools, SMART, replication, iSCSI
|
||||||
|
- `bash /Users/viktorbarzin/code/infra/.claude/scripts/platform-status.sh` — Traefik, Kyverno, VPA, pull-through cache, Proxmox
|
||||||
|
3. Investigate specific issues:
|
||||||
|
- NFS: SSH to affected nodes, check mount status, detect stale file handles
|
||||||
|
- TrueNAS: ZFS pool status, SMART health, replication tasks via SSH
|
||||||
|
- PVCs: Check pending PVCs, unbound PVs, capacity usage
|
||||||
|
- iSCSI: democratic-csi volume health
|
||||||
|
- Traefik: IngressRoute health, middleware status
|
||||||
|
- Kyverno: Resource governance (LimitRange + ResourceQuota per namespace)
|
||||||
|
- VPA/Goldilocks: Status and unexpected updateMode settings
|
||||||
|
- Proxmox: Host resources via SSH
|
||||||
|
- Node conditions: kubelet status
|
||||||
|
- Pull-through cache: Registry health (10.0.20.10)
|
||||||
|
4. Report findings with clear root cause analysis
|
||||||
|
|
||||||
|
## Proactive Mode
|
||||||
|
|
||||||
|
Daily NFS + TrueNAS health check — storage failures cascade across all 70+ services.
|
||||||
|
|
||||||
|
## Safe Auto-Fix
|
||||||
|
|
||||||
|
None. NFS remount via SSH can hang on dead TrueNAS; PV cleanup destroys data.
|
||||||
|
|
||||||
|
## NEVER Do
|
||||||
|
|
||||||
|
- Never restart NFS on TrueNAS
|
||||||
|
- Never delete datasets/pools/snapshots
|
||||||
|
- Never modify PVCs via kubectl
|
||||||
|
- Never delete PVs
|
||||||
|
- Never `kubectl apply/edit/patch`
|
||||||
|
- Never change Kyverno policies directly
|
||||||
|
- Never push to git or modify Terraform files
|
||||||
|
|
||||||
|
## Reference
|
||||||
|
|
||||||
|
- Read `.claude/reference/patterns.md` for governance tables
|
||||||
|
- Read `.claude/reference/proxmox-inventory.md` for VM details
|
||||||
|
- Use `extend-vm-storage` skill for storage extension workflow
|
||||||
61
.claude/agents/security-engineer.md
Normal file
61
.claude/agents/security-engineer.md
Normal file
|
|
@ -0,0 +1,61 @@
|
||||||
|
---
|
||||||
|
name: security-engineer
|
||||||
|
description: Check TLS certs, CrowdSec WAF, Authentik SSO, Kyverno policies, Snort IDS, and Cloudflare tunnel. Use for security audits, cert expiry, or access control issues.
|
||||||
|
tools: Read, Bash, Grep, Glob
|
||||||
|
model: sonnet
|
||||||
|
---
|
||||||
|
|
||||||
|
You are a Security Engineer for a homelab Kubernetes cluster managed via Terraform/Terragrunt.
|
||||||
|
|
||||||
|
## Your Domain
|
||||||
|
|
||||||
|
TLS certs, CrowdSec WAF, Authentik SSO, Kyverno policies, Snort IDS, Cloudflare tunnel.
|
||||||
|
|
||||||
|
## Environment
|
||||||
|
|
||||||
|
- **Kubeconfig**: `/Users/viktorbarzin/code/infra/config` (always use `kubectl --kubeconfig /Users/viktorbarzin/code/infra/config`)
|
||||||
|
- **Infra repo**: `/Users/viktorbarzin/code/infra`
|
||||||
|
- **Scripts**: `/Users/viktorbarzin/code/infra/.claude/scripts/`
|
||||||
|
- **pfSense**: Access via `python3 /Users/viktorbarzin/code/infra/.claude/pfsense.py`
|
||||||
|
|
||||||
|
## Workflow
|
||||||
|
|
||||||
|
1. Before reporting issues, read `.claude/reference/known-issues.md` and suppress any matches
|
||||||
|
2. Run diagnostic scripts:
|
||||||
|
- `bash /Users/viktorbarzin/code/infra/.claude/scripts/tls-check.sh` — cert expiry scan
|
||||||
|
- `bash /Users/viktorbarzin/code/infra/.claude/scripts/crowdsec-status.sh` — CrowdSec LAPI/agent health
|
||||||
|
- `bash /Users/viktorbarzin/code/infra/.claude/scripts/authentik-audit.sh` — user/group audit
|
||||||
|
3. Investigate specific issues:
|
||||||
|
- **TLS certs**: Check in-cluster `kubernetes.io/tls` secrets + `secrets/fullchain.pem`, alert <14 days to expiry
|
||||||
|
- **cert-manager**: Certificate/CertificateRequest/Order CRDs for renewal failures
|
||||||
|
- **CrowdSec**: LAPI health via `kubectl exec` + `cscli`, agent DaemonSet, recent decisions
|
||||||
|
- **Authentik**: Users/groups via `kubectl exec deploy/goauthentik-server -n authentik`, outpost health
|
||||||
|
- **Snort IDS**: Review alerts via `python3 /Users/viktorbarzin/code/infra/.claude/pfsense.py snort`
|
||||||
|
- **Kyverno**: Policies in expected state (Audit mode, not Enforce)
|
||||||
|
- **Cloudflare tunnel**: Pod health
|
||||||
|
- **Sealed-secrets**: Controller operational
|
||||||
|
4. Report findings with clear remediation steps
|
||||||
|
|
||||||
|
## Proactive Mode
|
||||||
|
|
||||||
|
Daily TLS cert expiry check only. All other checks on-demand.
|
||||||
|
|
||||||
|
## Safe Auto-Fix
|
||||||
|
|
||||||
|
Delete stale CrowdSec machine registrations via `cscli machines delete` — only machines not seen in >7 days. Always run `cscli machines list` first and show what would be deleted before acting. Reversible — agents re-register on next heartbeat.
|
||||||
|
|
||||||
|
## NEVER Do
|
||||||
|
|
||||||
|
- Never read/expose raw secret values
|
||||||
|
- Never modify CrowdSec config (Terraform-owned)
|
||||||
|
- Never create/delete Authentik users without explicit request
|
||||||
|
- Never modify firewall rules
|
||||||
|
- Never disable security policies
|
||||||
|
- Never commit secrets
|
||||||
|
- Never `kubectl apply/edit/patch`
|
||||||
|
- Never push to git or modify Terraform files
|
||||||
|
|
||||||
|
## Reference
|
||||||
|
|
||||||
|
- Use `pfsense` skill for pfSense access patterns
|
||||||
|
- Read `.claude/reference/authentik-state.md` for Authentik configuration
|
||||||
68
.claude/agents/sre.md
Normal file
68
.claude/agents/sre.md
Normal file
|
|
@ -0,0 +1,68 @@
|
||||||
|
---
|
||||||
|
name: sre
|
||||||
|
description: Investigate OOMKilled pods, capacity issues, and complex multi-system incidents. The escalation point when specialist agents aren't enough.
|
||||||
|
tools: Read, Bash, Grep, Glob
|
||||||
|
model: opus
|
||||||
|
---
|
||||||
|
|
||||||
|
You are an SRE / On-Call engineer for a homelab Kubernetes cluster managed via Terraform/Terragrunt.
|
||||||
|
|
||||||
|
## Your Domain
|
||||||
|
|
||||||
|
Incident response, OOM investigation, capacity planning, root cause analysis. You are the escalation point when specialist agents aren't enough.
|
||||||
|
|
||||||
|
## Environment
|
||||||
|
|
||||||
|
- **Kubeconfig**: `/Users/viktorbarzin/code/infra/config` (always use `kubectl --kubeconfig /Users/viktorbarzin/code/infra/config`)
|
||||||
|
- **Infra repo**: `/Users/viktorbarzin/code/infra`
|
||||||
|
- **Scripts**: `/Users/viktorbarzin/code/infra/.claude/scripts/`
|
||||||
|
- **K8s nodes**: k8s-master (10.0.20.100), k8s-node1-4 (10.0.20.101-104) — SSH user: `wizard`
|
||||||
|
|
||||||
|
## Two Modes
|
||||||
|
|
||||||
|
### Mode 1 — OOM/Capacity (most common)
|
||||||
|
|
||||||
|
1. Run `bash /Users/viktorbarzin/code/infra/.claude/scripts/oom-investigator.sh` to find OOMKilled pods
|
||||||
|
2. For each OOMKilled pod:
|
||||||
|
- Identify the container that was killed
|
||||||
|
- Check LimitRange defaults in the namespace
|
||||||
|
- Check actual usage vs limit
|
||||||
|
- Read Goldilocks VPA recommendations
|
||||||
|
- Compare to Terraform-defined resources in the stack
|
||||||
|
3. Run `bash /Users/viktorbarzin/code/infra/.claude/scripts/resource-report.sh` for cluster-wide capacity
|
||||||
|
4. Produce actionable Terraform snippets for resource fixes
|
||||||
|
|
||||||
|
### Mode 2 — Incident Response (rare, complex)
|
||||||
|
|
||||||
|
1. **Pre-check**: Verify monitoring pods are running (`kubectl get pods -n monitoring`). If monitoring is down, fall back to kubectl events/logs and SSH-based investigation.
|
||||||
|
2. Query Prometheus via `kubectl exec deploy/prometheus-server -n monitoring -- wget -qO- 'http://localhost:9090/api/v1/query?query=...'`
|
||||||
|
3. Query Alertmanager via `kubectl exec sts/prometheus-alertmanager -n monitoring -- wget -qO- 'http://localhost:9093/api/v2/...'`
|
||||||
|
4. Aggregate logs via `kubectl logs` across pods/namespaces (Loki is NOT deployed)
|
||||||
|
5. Correlate across: pod events, node conditions, pfSense logs, CrowdSec decisions
|
||||||
|
6. SSH to nodes for kubelet logs (`journalctl -u kubelet`), dmesg, systemd status
|
||||||
|
7. Produce incident reports with root cause + remediation
|
||||||
|
|
||||||
|
## Workflow
|
||||||
|
|
||||||
|
1. Before reporting issues, read `.claude/reference/known-issues.md` and suppress any matches
|
||||||
|
2. Determine which mode applies based on the user's request
|
||||||
|
3. Run appropriate scripts and investigations
|
||||||
|
4. Report findings with clear root cause analysis and actionable remediation
|
||||||
|
|
||||||
|
## Safe Auto-Fix
|
||||||
|
|
||||||
|
None — purely investigative.
|
||||||
|
|
||||||
|
## NEVER Do
|
||||||
|
|
||||||
|
- Never `kubectl apply/edit/patch`
|
||||||
|
- Never modify any files
|
||||||
|
- Never restart services
|
||||||
|
- Never push to git
|
||||||
|
- Never commit secrets
|
||||||
|
|
||||||
|
## Reference
|
||||||
|
|
||||||
|
- All other agents' scripts are available in `.claude/scripts/`
|
||||||
|
- Read `.claude/reference/patterns.md` for governance tables
|
||||||
|
- Read `.claude/reference/proxmox-inventory.md` for VM details
|
||||||
12
.claude/reference/known-issues.md
Normal file
12
.claude/reference/known-issues.md
Normal file
|
|
@ -0,0 +1,12 @@
|
||||||
|
# Known Issues (suppress in all agents)
|
||||||
|
|
||||||
|
## Permanent
|
||||||
|
- ha-london Uptime Kuma monitor down — external HA on Raspberry Pi, not in this cluster
|
||||||
|
- PVFillingUp for navidrome-music — Synology NAS volume, threshold is 95%, expected
|
||||||
|
|
||||||
|
## Intermittent
|
||||||
|
- CrowdSec Helm release stuck in pending-upgrade — known issue, workaround: helm rollback
|
||||||
|
- Resource usage >80% on nodes — WARN only, overcommit is by design (2x LimitRange ratio)
|
||||||
|
|
||||||
|
## How agents consume this file
|
||||||
|
Each agent definition includes: "Before reporting issues, read `.claude/reference/known-issues.md` and suppress any matches."
|
||||||
134
.claude/scripts/authentik-audit.sh
Executable file
134
.claude/scripts/authentik-audit.sh
Executable file
|
|
@ -0,0 +1,134 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
KUBECTL="kubectl --kubeconfig /Users/viktorbarzin/code/infra/config"
|
||||||
|
AGENT="authentik-audit"
|
||||||
|
DRY_RUN=false
|
||||||
|
NAMESPACE="authentik"
|
||||||
|
|
||||||
|
for arg in "$@"; do
|
||||||
|
case "$arg" in
|
||||||
|
--dry-run) DRY_RUN=true ;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
checks=()
|
||||||
|
|
||||||
|
add_check() {
|
||||||
|
local name="$1" status="$2" message="$3"
|
||||||
|
checks+=("{\"name\": \"$name\", \"status\": \"$status\", \"message\": \"$message\"}")
|
||||||
|
}
|
||||||
|
|
||||||
|
find_authentik_pod() {
|
||||||
|
local pod
|
||||||
|
pod=$($KUBECTL get pods -n "$NAMESPACE" -l app.kubernetes.io/name=authentik,app.kubernetes.io/component=server -o jsonpath='{.items[0].metadata.name}' 2>/dev/null) || \
|
||||||
|
pod=$($KUBECTL get pods -n "$NAMESPACE" --no-headers 2>/dev/null | grep -i "goauthentik-server\|authentik-server" | grep "Running" | head -1 | awk '{print $1}') || true
|
||||||
|
echo "$pod"
|
||||||
|
}
|
||||||
|
|
||||||
|
check_server_health() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "authentik-server" "ok" "dry-run: would check goauthentik-server pod health"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local pods
|
||||||
|
pods=$($KUBECTL get pods -n "$NAMESPACE" --no-headers 2>/dev/null | grep -i "authentik") || {
|
||||||
|
add_check "authentik-server" "fail" "No Authentik pods found in namespace ${NAMESPACE}"
|
||||||
|
return
|
||||||
|
}
|
||||||
|
|
||||||
|
local not_running
|
||||||
|
not_running=$(echo "$pods" | grep -v "Running" | grep -v "Completed" | grep -c "." 2>/dev/null || echo "0")
|
||||||
|
|
||||||
|
local total
|
||||||
|
total=$(echo "$pods" | grep -c "." 2>/dev/null || echo "0")
|
||||||
|
|
||||||
|
if [ "$not_running" -gt 0 ]; then
|
||||||
|
add_check "authentik-server" "warn" "${not_running}/${total} Authentik pod(s) not running"
|
||||||
|
else
|
||||||
|
add_check "authentik-server" "ok" "All ${total} Authentik pod(s) running"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
check_outposts() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "authentik-outposts" "ok" "dry-run: would check Authentik outpost pods"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local outpost_pods
|
||||||
|
outpost_pods=$($KUBECTL get pods -n "$NAMESPACE" -l app.kubernetes.io/managed-by=goauthentik.io --no-headers 2>/dev/null) || \
|
||||||
|
outpost_pods=$($KUBECTL get pods -n "$NAMESPACE" --no-headers 2>/dev/null | grep -i "outpost" || true)
|
||||||
|
|
||||||
|
if [ -z "$outpost_pods" ]; then
|
||||||
|
add_check "authentik-outposts" "warn" "No outpost pods found"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local total not_running
|
||||||
|
total=$(echo "$outpost_pods" | grep -c "." 2>/dev/null || echo "0")
|
||||||
|
not_running=$(echo "$outpost_pods" | grep -v "Running" | grep -c "." 2>/dev/null || echo "0")
|
||||||
|
|
||||||
|
if [ "$not_running" -gt 0 ]; then
|
||||||
|
add_check "authentik-outposts" "warn" "${not_running}/${total} outpost pod(s) not running"
|
||||||
|
else
|
||||||
|
add_check "authentik-outposts" "ok" "All ${total} outpost pod(s) running"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
check_user_count() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "authentik-users" "ok" "dry-run: would check user count via ak CLI"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local pod
|
||||||
|
pod=$(find_authentik_pod)
|
||||||
|
|
||||||
|
if [ -z "$pod" ]; then
|
||||||
|
add_check "authentik-users" "warn" "No Authentik server pod found to query users"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Use the ak CLI to get user count
|
||||||
|
local user_output
|
||||||
|
user_output=$($KUBECTL exec -n "$NAMESPACE" "$pod" -- ak user list 2>/dev/null) || {
|
||||||
|
# Fallback: try management command
|
||||||
|
user_output=$($KUBECTL exec -n "$NAMESPACE" "$pod" -- python -c "
|
||||||
|
import django; django.setup()
|
||||||
|
from authentik.core.models import User
|
||||||
|
print(f'total={User.objects.count()} active={User.objects.filter(is_active=True).count()}')
|
||||||
|
" 2>/dev/null) || {
|
||||||
|
add_check "authentik-users" "warn" "Could not query user count from Authentik"
|
||||||
|
return
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
local user_count
|
||||||
|
if echo "$user_output" | grep -q "total="; then
|
||||||
|
user_count=$(echo "$user_output" | grep "total=" | sed 's/.*total=\([0-9]*\).*/\1/')
|
||||||
|
local active_count
|
||||||
|
active_count=$(echo "$user_output" | grep "active=" | sed 's/.*active=\([0-9]*\).*/\1/')
|
||||||
|
add_check "authentik-users" "ok" "${user_count} total users, ${active_count} active"
|
||||||
|
else
|
||||||
|
# Count lines of output as fallback
|
||||||
|
user_count=$(echo "$user_output" | wc -l | tr -d ' ')
|
||||||
|
add_check "authentik-users" "ok" "User query returned ${user_count} lines of output"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
check_server_health
|
||||||
|
check_outposts
|
||||||
|
check_user_count
|
||||||
|
|
||||||
|
# Output JSON
|
||||||
|
overall="ok"
|
||||||
|
for c in "${checks[@]}"; do
|
||||||
|
s=$(echo "$c" | jq -r '.status')
|
||||||
|
if [ "$s" = "fail" ]; then overall="fail"; break; fi
|
||||||
|
if [ "$s" = "warn" ]; then overall="warn"; fi
|
||||||
|
done
|
||||||
|
|
||||||
|
printf '{"status": "%s", "agent": "%s", "checks": [%s]}\n' \
|
||||||
|
"$overall" "$AGENT" "$(IFS=,; echo "${checks[*]}")"
|
||||||
247
.claude/scripts/backup-verify.sh
Executable file
247
.claude/scripts/backup-verify.sh
Executable file
|
|
@ -0,0 +1,247 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
KUBECTL="kubectl --kubeconfig /Users/viktorbarzin/code/infra/config"
|
||||||
|
DRY_RUN=false
|
||||||
|
AGENT="backup-verify"
|
||||||
|
|
||||||
|
for arg in "$@"; do
|
||||||
|
case "$arg" in
|
||||||
|
--dry-run) DRY_RUN=true ;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
CHECKS="[]"
|
||||||
|
|
||||||
|
add_check() {
|
||||||
|
local name="$1" status="$2" message="$3"
|
||||||
|
CHECKS=$(echo "$CHECKS" | python3 -c "
|
||||||
|
import sys, json
|
||||||
|
checks = json.load(sys.stdin)
|
||||||
|
checks.append({'name': '''$name''', 'status': '''$status''', 'message': '''$message'''})
|
||||||
|
json.dump(checks, sys.stdout)
|
||||||
|
")
|
||||||
|
}
|
||||||
|
|
||||||
|
# CNPG backup freshness via backup CRDs
|
||||||
|
check_cnpg_backups() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "cnpg-backups" "ok" "DRY RUN: would check CNPG backup CRD timestamps"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local backups
|
||||||
|
backups=$($KUBECTL get backup.postgresql.cnpg.io --all-namespaces -o json 2>/dev/null) || {
|
||||||
|
# Try scheduledbackup as well
|
||||||
|
local scheduled
|
||||||
|
scheduled=$($KUBECTL get scheduledbackup.postgresql.cnpg.io --all-namespaces --no-headers 2>/dev/null) || true
|
||||||
|
if [ -n "$scheduled" ]; then
|
||||||
|
add_check "cnpg-backups" "warn" "ScheduledBackups exist but no Backup CRDs found — backups may not have run yet"
|
||||||
|
else
|
||||||
|
add_check "cnpg-backups" "warn" "No CNPG Backup CRDs found"
|
||||||
|
fi
|
||||||
|
return
|
||||||
|
}
|
||||||
|
|
||||||
|
local report
|
||||||
|
report=$(echo "$backups" | python3 -c "
|
||||||
|
import sys, json
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
|
||||||
|
data = json.load(sys.stdin)
|
||||||
|
items = data.get('items', [])
|
||||||
|
if not items:
|
||||||
|
print('WARN|No CNPG backups found')
|
||||||
|
sys.exit(0)
|
||||||
|
|
||||||
|
# Group by cluster, find latest backup per cluster
|
||||||
|
clusters = {}
|
||||||
|
for b in items:
|
||||||
|
ns = b['metadata']['namespace']
|
||||||
|
cluster = b.get('spec', {}).get('cluster', {}).get('name', 'unknown')
|
||||||
|
key = f'{ns}/{cluster}'
|
||||||
|
phase = b.get('status', {}).get('phase', 'unknown')
|
||||||
|
started = b.get('status', {}).get('startedAt', '')
|
||||||
|
stopped = b.get('status', {}).get('stoppedAt', '')
|
||||||
|
if key not in clusters or stopped > clusters[key].get('stopped', ''):
|
||||||
|
clusters[key] = {'phase': phase, 'started': started, 'stopped': stopped}
|
||||||
|
|
||||||
|
results = []
|
||||||
|
all_ok = True
|
||||||
|
now = datetime.now(timezone.utc)
|
||||||
|
for key, info in sorted(clusters.items()):
|
||||||
|
age_str = 'unknown'
|
||||||
|
if info['stopped']:
|
||||||
|
try:
|
||||||
|
stopped_dt = datetime.fromisoformat(info['stopped'].replace('Z', '+00:00'))
|
||||||
|
age = now - stopped_dt
|
||||||
|
age_hours = age.total_seconds() / 3600
|
||||||
|
age_str = f'{age_hours:.1f}h ago'
|
||||||
|
if age_hours > 48:
|
||||||
|
all_ok = False
|
||||||
|
except Exception:
|
||||||
|
age_str = info['stopped']
|
||||||
|
else:
|
||||||
|
all_ok = False
|
||||||
|
age_str = 'no completion time'
|
||||||
|
|
||||||
|
phase = info['phase']
|
||||||
|
if phase not in ('completed', 'Completed'):
|
||||||
|
all_ok = False
|
||||||
|
results.append(f'{key}: {phase} ({age_str})')
|
||||||
|
|
||||||
|
status = 'OK' if all_ok else 'WARN'
|
||||||
|
print(f'{status}|' + '; '.join(results))
|
||||||
|
" 2>/dev/null) || report="WARN|Failed to parse CNPG backups"
|
||||||
|
|
||||||
|
local status_prefix="${report%%|*}"
|
||||||
|
local detail="${report#*|}"
|
||||||
|
|
||||||
|
if [ "$status_prefix" = "OK" ]; then
|
||||||
|
add_check "cnpg-backups" "ok" "$detail"
|
||||||
|
else
|
||||||
|
add_check "cnpg-backups" "warn" "$detail"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# CNPG ScheduledBackup health
|
||||||
|
check_cnpg_scheduled() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "cnpg-scheduled-backups" "ok" "DRY RUN: would check CNPG ScheduledBackup status"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local scheduled
|
||||||
|
scheduled=$($KUBECTL get scheduledbackup.postgresql.cnpg.io --all-namespaces -o json 2>/dev/null) || {
|
||||||
|
add_check "cnpg-scheduled-backups" "ok" "No CNPG ScheduledBackups configured"
|
||||||
|
return
|
||||||
|
}
|
||||||
|
|
||||||
|
local report
|
||||||
|
report=$(echo "$scheduled" | python3 -c "
|
||||||
|
import sys, json
|
||||||
|
data = json.load(sys.stdin)
|
||||||
|
items = data.get('items', [])
|
||||||
|
if not items:
|
||||||
|
print('OK|No ScheduledBackups defined')
|
||||||
|
sys.exit(0)
|
||||||
|
results = []
|
||||||
|
all_ok = True
|
||||||
|
for sb in items:
|
||||||
|
ns = sb['metadata']['namespace']
|
||||||
|
name = sb['metadata']['name']
|
||||||
|
schedule = sb.get('spec', {}).get('schedule', 'unknown')
|
||||||
|
suspend = sb.get('spec', {}).get('suspend', False)
|
||||||
|
last = sb.get('status', {}).get('lastScheduleTime', 'never')
|
||||||
|
if suspend:
|
||||||
|
all_ok = False
|
||||||
|
results.append(f'{ns}/{name}: SUSPENDED schedule={schedule}')
|
||||||
|
else:
|
||||||
|
results.append(f'{ns}/{name}: active schedule={schedule} last={last}')
|
||||||
|
status = 'OK' if all_ok else 'WARN'
|
||||||
|
print(f'{status}|' + '; '.join(results))
|
||||||
|
" 2>/dev/null) || report="WARN|Failed to parse ScheduledBackups"
|
||||||
|
|
||||||
|
local status_prefix="${report%%|*}"
|
||||||
|
local detail="${report#*|}"
|
||||||
|
|
||||||
|
if [ "$status_prefix" = "OK" ]; then
|
||||||
|
add_check "cnpg-scheduled-backups" "ok" "$detail"
|
||||||
|
else
|
||||||
|
add_check "cnpg-scheduled-backups" "warn" "$detail"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# MySQL backup file freshness on NFS
|
||||||
|
check_mysql_backups() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "mysql-backups" "ok" "DRY RUN: would check MySQL backup file timestamps"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Check for MySQL backup files via a pod that has NFS mounted, or via known backup job
|
||||||
|
local backup_pods
|
||||||
|
backup_pods=$($KUBECTL get pods --all-namespaces -l app=mysql-backup -o name 2>/dev/null | head -1) || true
|
||||||
|
if [ -z "$backup_pods" ]; then
|
||||||
|
backup_pods=$($KUBECTL get cronjobs --all-namespaces --no-headers 2>/dev/null | grep -i "mysql.*backup\|backup.*mysql" | awk '{print $1"/"$2}') || true
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ -z "$backup_pods" ]; then
|
||||||
|
# Try checking via TrueNAS SSH for NFS backup files
|
||||||
|
local nfs_check
|
||||||
|
nfs_check=$(ssh -o ConnectTimeout=5 -o StrictHostKeyChecking=no root@10.0.10.15 \
|
||||||
|
"find /mnt/main -name '*.sql.gz' -o -name '*.sql' -o -name '*mysql*backup*' 2>/dev/null | head -5" 2>/dev/null) || true
|
||||||
|
|
||||||
|
if [ -n "$nfs_check" ]; then
|
||||||
|
local ages
|
||||||
|
ages=$(ssh -o ConnectTimeout=5 -o StrictHostKeyChecking=no root@10.0.10.15 \
|
||||||
|
"for f in $(echo "$nfs_check" | tr '\n' ' '); do stat -f '%m %N' \"\$f\" 2>/dev/null || stat -c '%Y %n' \"\$f\" 2>/dev/null; done" 2>/dev/null) || true
|
||||||
|
if [ -n "$ages" ]; then
|
||||||
|
add_check "mysql-backups" "ok" "Found MySQL backup files on NFS: $(echo "$nfs_check" | tr '\n' '; ')"
|
||||||
|
else
|
||||||
|
add_check "mysql-backups" "warn" "Found backup files but cannot determine age: $(echo "$nfs_check" | tr '\n' '; ')"
|
||||||
|
fi
|
||||||
|
else
|
||||||
|
add_check "mysql-backups" "warn" "No MySQL backup CronJobs or backup files found"
|
||||||
|
fi
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Check CronJob last successful run
|
||||||
|
local cronjob_status
|
||||||
|
cronjob_status=$($KUBECTL get cronjobs --all-namespaces -o json 2>/dev/null | python3 -c "
|
||||||
|
import sys, json
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
|
||||||
|
data = json.load(sys.stdin)
|
||||||
|
results = []
|
||||||
|
for cj in data.get('items', []):
|
||||||
|
ns = cj['metadata']['namespace']
|
||||||
|
name = cj['metadata']['name']
|
||||||
|
if 'mysql' not in name.lower() and 'backup' not in name.lower():
|
||||||
|
continue
|
||||||
|
schedule = cj.get('spec', {}).get('schedule', 'unknown')
|
||||||
|
last_time = cj.get('status', {}).get('lastScheduleTime', '')
|
||||||
|
last_success = cj.get('status', {}).get('lastSuccessfulTime', '')
|
||||||
|
suspend = cj.get('spec', {}).get('suspend', False)
|
||||||
|
|
||||||
|
age_str = 'never'
|
||||||
|
if last_success:
|
||||||
|
try:
|
||||||
|
dt = datetime.fromisoformat(last_success.replace('Z', '+00:00'))
|
||||||
|
age = datetime.now(timezone.utc) - dt
|
||||||
|
age_str = f'{age.total_seconds()/3600:.1f}h ago'
|
||||||
|
except Exception:
|
||||||
|
age_str = last_success
|
||||||
|
|
||||||
|
status = 'suspended' if suspend else 'active'
|
||||||
|
results.append(f'{ns}/{name}: {status} schedule={schedule} last_success={age_str}')
|
||||||
|
|
||||||
|
if results:
|
||||||
|
print('; '.join(results))
|
||||||
|
else:
|
||||||
|
print('No MySQL/backup CronJobs found')
|
||||||
|
" 2>/dev/null) || cronjob_status="Failed to check CronJobs"
|
||||||
|
|
||||||
|
add_check "mysql-backups" "ok" "$cronjob_status"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Run all checks
|
||||||
|
check_cnpg_backups
|
||||||
|
check_cnpg_scheduled
|
||||||
|
check_mysql_backups
|
||||||
|
|
||||||
|
# Determine overall status
|
||||||
|
OVERALL=$(echo "$CHECKS" | python3 -c "
|
||||||
|
import sys, json
|
||||||
|
checks = json.load(sys.stdin)
|
||||||
|
statuses = [c['status'] for c in checks]
|
||||||
|
if 'fail' in statuses:
|
||||||
|
print('fail')
|
||||||
|
elif 'warn' in statuses:
|
||||||
|
print('warn')
|
||||||
|
else:
|
||||||
|
print('ok')
|
||||||
|
")
|
||||||
|
|
||||||
|
echo "{\"status\": \"$OVERALL\", \"agent\": \"$AGENT\", \"checks\": $CHECKS}" | python3 -m json.tool
|
||||||
166
.claude/scripts/crowdsec-status.sh
Executable file
166
.claude/scripts/crowdsec-status.sh
Executable file
|
|
@ -0,0 +1,166 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
KUBECTL="kubectl --kubeconfig /Users/viktorbarzin/code/infra/config"
|
||||||
|
AGENT="crowdsec-status"
|
||||||
|
DRY_RUN=false
|
||||||
|
|
||||||
|
for arg in "$@"; do
|
||||||
|
case "$arg" in
|
||||||
|
--dry-run) DRY_RUN=true ;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
checks=()
|
||||||
|
|
||||||
|
add_check() {
|
||||||
|
local name="$1" status="$2" message="$3"
|
||||||
|
checks+=("{\"name\": \"$name\", \"status\": \"$status\", \"message\": \"$message\"}")
|
||||||
|
}
|
||||||
|
|
||||||
|
find_crowdsec_namespace() {
|
||||||
|
$KUBECTL get pods -A -l app.kubernetes.io/name=crowdsec --no-headers 2>/dev/null | head -1 | awk '{print $1}' || \
|
||||||
|
$KUBECTL get pods -A --no-headers 2>/dev/null | grep -i crowdsec | head -1 | awk '{print $1}' || \
|
||||||
|
echo "crowdsec"
|
||||||
|
}
|
||||||
|
|
||||||
|
check_lapi_health() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "crowdsec-lapi" "ok" "dry-run: would check CrowdSec LAPI pod health"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local ns
|
||||||
|
ns=$(find_crowdsec_namespace)
|
||||||
|
|
||||||
|
local lapi_pod
|
||||||
|
lapi_pod=$($KUBECTL get pods -n "$ns" -l app.kubernetes.io/name=crowdsec,app.kubernetes.io/component=lapi --no-headers 2>/dev/null | head -1) || true
|
||||||
|
|
||||||
|
if [ -z "$lapi_pod" ]; then
|
||||||
|
lapi_pod=$($KUBECTL get pods -n "$ns" --no-headers 2>/dev/null | grep -i "crowdsec.*lapi" | head -1) || true
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ -z "$lapi_pod" ]; then
|
||||||
|
add_check "crowdsec-lapi" "fail" "No CrowdSec LAPI pod found in namespace ${ns}"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local pod_name status
|
||||||
|
pod_name=$(echo "$lapi_pod" | awk '{print $1}')
|
||||||
|
status=$(echo "$lapi_pod" | awk '{print $3}')
|
||||||
|
|
||||||
|
if [ "$status" != "Running" ]; then
|
||||||
|
add_check "crowdsec-lapi" "fail" "LAPI pod ${pod_name} is ${status}"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
add_check "crowdsec-lapi" "ok" "LAPI pod ${pod_name} is Running"
|
||||||
|
}
|
||||||
|
|
||||||
|
check_cscli_metrics() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "crowdsec-metrics" "ok" "dry-run: would run cscli metrics via kubectl exec"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local ns
|
||||||
|
ns=$(find_crowdsec_namespace)
|
||||||
|
|
||||||
|
local lapi_pod
|
||||||
|
lapi_pod=$($KUBECTL get pods -n "$ns" -l app.kubernetes.io/name=crowdsec,app.kubernetes.io/component=lapi -o jsonpath='{.items[0].metadata.name}' 2>/dev/null) || \
|
||||||
|
lapi_pod=$($KUBECTL get pods -n "$ns" --no-headers 2>/dev/null | grep -i "crowdsec.*lapi" | head -1 | awk '{print $1}') || true
|
||||||
|
|
||||||
|
if [ -z "$lapi_pod" ]; then
|
||||||
|
add_check "crowdsec-metrics" "warn" "No LAPI pod found to run cscli metrics"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local metrics_output
|
||||||
|
metrics_output=$($KUBECTL exec -n "$ns" "$lapi_pod" -- cscli metrics 2>/dev/null) || {
|
||||||
|
add_check "crowdsec-metrics" "warn" "Failed to run cscli metrics on ${lapi_pod}"
|
||||||
|
return
|
||||||
|
}
|
||||||
|
|
||||||
|
add_check "crowdsec-metrics" "ok" "cscli metrics returned successfully"
|
||||||
|
}
|
||||||
|
|
||||||
|
check_decisions() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "crowdsec-decisions" "ok" "dry-run: would check cscli decisions list"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local ns
|
||||||
|
ns=$(find_crowdsec_namespace)
|
||||||
|
|
||||||
|
local lapi_pod
|
||||||
|
lapi_pod=$($KUBECTL get pods -n "$ns" -l app.kubernetes.io/name=crowdsec,app.kubernetes.io/component=lapi -o jsonpath='{.items[0].metadata.name}' 2>/dev/null) || \
|
||||||
|
lapi_pod=$($KUBECTL get pods -n "$ns" --no-headers 2>/dev/null | grep -i "crowdsec.*lapi" | head -1 | awk '{print $1}') || true
|
||||||
|
|
||||||
|
if [ -z "$lapi_pod" ]; then
|
||||||
|
add_check "crowdsec-decisions" "warn" "No LAPI pod found to check decisions"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local decisions
|
||||||
|
decisions=$($KUBECTL exec -n "$ns" "$lapi_pod" -- cscli decisions list -o json 2>/dev/null) || {
|
||||||
|
add_check "crowdsec-decisions" "ok" "No active decisions (or failed to query)"
|
||||||
|
return
|
||||||
|
}
|
||||||
|
|
||||||
|
local count
|
||||||
|
count=$(echo "$decisions" | jq 'if type == "array" then length else 0 end' 2>/dev/null || echo "0")
|
||||||
|
|
||||||
|
if [ "$count" -gt 0 ]; then
|
||||||
|
add_check "crowdsec-decisions" "ok" "${count} active decision(s)"
|
||||||
|
else
|
||||||
|
add_check "crowdsec-decisions" "ok" "No active decisions"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
check_agent_daemonset() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "crowdsec-agents" "ok" "dry-run: would check CrowdSec agent DaemonSet"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local ns
|
||||||
|
ns=$(find_crowdsec_namespace)
|
||||||
|
|
||||||
|
local ds_json
|
||||||
|
ds_json=$($KUBECTL get daemonset -n "$ns" -l app.kubernetes.io/name=crowdsec -o json 2>/dev/null) || {
|
||||||
|
# Fallback: search by name
|
||||||
|
ds_json=$($KUBECTL get daemonset -n "$ns" -o json 2>/dev/null | jq '{items: [.items[] | select(.metadata.name | test("crowdsec"))]}') || {
|
||||||
|
add_check "crowdsec-agents" "warn" "No CrowdSec DaemonSet found"
|
||||||
|
return
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
local desired ready
|
||||||
|
desired=$(echo "$ds_json" | jq '[.items[].status.desiredNumberScheduled] | add // 0' 2>/dev/null || echo "0")
|
||||||
|
ready=$(echo "$ds_json" | jq '[.items[].status.numberReady] | add // 0' 2>/dev/null || echo "0")
|
||||||
|
|
||||||
|
if [ "$ready" -lt "$desired" ]; then
|
||||||
|
add_check "crowdsec-agents" "warn" "CrowdSec agents: ${ready}/${desired} ready"
|
||||||
|
elif [ "$desired" -eq 0 ]; then
|
||||||
|
add_check "crowdsec-agents" "warn" "No CrowdSec agent DaemonSet pods scheduled"
|
||||||
|
else
|
||||||
|
add_check "crowdsec-agents" "ok" "CrowdSec agents: ${ready}/${desired} ready"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
check_lapi_health
|
||||||
|
check_cscli_metrics
|
||||||
|
check_decisions
|
||||||
|
check_agent_daemonset
|
||||||
|
|
||||||
|
# Output JSON
|
||||||
|
overall="ok"
|
||||||
|
for c in "${checks[@]}"; do
|
||||||
|
s=$(echo "$c" | jq -r '.status')
|
||||||
|
if [ "$s" = "fail" ]; then overall="fail"; break; fi
|
||||||
|
if [ "$s" = "warn" ]; then overall="warn"; fi
|
||||||
|
done
|
||||||
|
|
||||||
|
printf '{"status": "%s", "agent": "%s", "checks": [%s]}\n' \
|
||||||
|
"$overall" "$AGENT" "$(IFS=,; echo "${checks[*]}")"
|
||||||
194
.claude/scripts/db-health.sh
Executable file
194
.claude/scripts/db-health.sh
Executable file
|
|
@ -0,0 +1,194 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
KUBECTL="kubectl --kubeconfig /Users/viktorbarzin/code/infra/config"
|
||||||
|
DRY_RUN=false
|
||||||
|
AGENT="db-health"
|
||||||
|
|
||||||
|
for arg in "$@"; do
|
||||||
|
case "$arg" in
|
||||||
|
--dry-run) DRY_RUN=true ;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
CHECKS="[]"
|
||||||
|
|
||||||
|
add_check() {
|
||||||
|
local name="$1" status="$2" message="$3"
|
||||||
|
CHECKS=$(echo "$CHECKS" | python3 -c "
|
||||||
|
import sys, json
|
||||||
|
checks = json.load(sys.stdin)
|
||||||
|
checks.append({'name': '''$name''', 'status': '''$status''', 'message': '''$message'''})
|
||||||
|
json.dump(checks, sys.stdout)
|
||||||
|
")
|
||||||
|
}
|
||||||
|
|
||||||
|
# MySQL InnoDB Cluster - Group Replication status
|
||||||
|
check_mysql_gr() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "mysql-group-replication" "ok" "DRY RUN: would check MySQL Group Replication status"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Discover MySQL pod via labels first, fall back to known name
|
||||||
|
local mysql_pod
|
||||||
|
mysql_pod=$($KUBECTL get pods -n dbaas -l app=mysql-cluster -o name 2>/dev/null | head -1) || true
|
||||||
|
if [ -z "$mysql_pod" ]; then
|
||||||
|
mysql_pod=$($KUBECTL get pods -n dbaas -l app.kubernetes.io/name=mysql -o name 2>/dev/null | head -1) || true
|
||||||
|
fi
|
||||||
|
if [ -z "$mysql_pod" ]; then
|
||||||
|
mysql_pod="sts/mysql-cluster"
|
||||||
|
fi
|
||||||
|
|
||||||
|
local gr_status
|
||||||
|
gr_status=$($KUBECTL exec "$mysql_pod" -n dbaas -- mysql -N -e \
|
||||||
|
"SELECT MEMBER_HOST, MEMBER_STATE, MEMBER_ROLE FROM performance_schema.replication_group_members" 2>/dev/null) || {
|
||||||
|
add_check "mysql-group-replication" "fail" "Cannot connect to MySQL cluster to check GR status"
|
||||||
|
return
|
||||||
|
}
|
||||||
|
|
||||||
|
local member_count online_count
|
||||||
|
member_count=$(echo "$gr_status" | grep -c . || true)
|
||||||
|
online_count=$(echo "$gr_status" | grep -c "ONLINE" || true)
|
||||||
|
|
||||||
|
if [ "$online_count" -eq "$member_count" ] && [ "$member_count" -ge 3 ]; then
|
||||||
|
add_check "mysql-group-replication" "ok" "All $member_count members ONLINE: $(echo "$gr_status" | tr '\t' ' ' | tr '\n' '; ')"
|
||||||
|
elif [ "$online_count" -lt "$member_count" ]; then
|
||||||
|
add_check "mysql-group-replication" "fail" "Only $online_count/$member_count members ONLINE: $(echo "$gr_status" | tr '\t' ' ' | tr '\n' '; ')"
|
||||||
|
else
|
||||||
|
add_check "mysql-group-replication" "warn" "Cluster has $member_count members (expected 3): $(echo "$gr_status" | tr '\t' ' ' | tr '\n' '; ')"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# MySQL pod health
|
||||||
|
check_mysql_pods() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "mysql-pods" "ok" "DRY RUN: would check MySQL pod status"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local pod_status
|
||||||
|
pod_status=$($KUBECTL get pods -n dbaas -l app=mysql-cluster -o wide --no-headers 2>/dev/null) || \
|
||||||
|
pod_status=$($KUBECTL get pods -n dbaas --no-headers 2>/dev/null | grep -i mysql) || {
|
||||||
|
add_check "mysql-pods" "warn" "Cannot find MySQL pods in dbaas namespace"
|
||||||
|
return
|
||||||
|
}
|
||||||
|
|
||||||
|
local not_running
|
||||||
|
not_running=$(echo "$pod_status" | grep -v "Running" | grep -v "Completed" || true)
|
||||||
|
|
||||||
|
if [ -z "$not_running" ]; then
|
||||||
|
local count
|
||||||
|
count=$(echo "$pod_status" | grep -c "Running" || true)
|
||||||
|
add_check "mysql-pods" "ok" "$count MySQL pod(s) running in dbaas namespace"
|
||||||
|
else
|
||||||
|
add_check "mysql-pods" "fail" "Unhealthy MySQL pods: $(echo "$not_running" | awk '{print $1": "$3}' | tr '\n' '; ')"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# CNPG PostgreSQL cluster health
|
||||||
|
check_cnpg() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "cnpg-clusters" "ok" "DRY RUN: would check CNPG PostgreSQL cluster health"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Check if CNPG CRDs exist
|
||||||
|
local cnpg_clusters
|
||||||
|
cnpg_clusters=$($KUBECTL get cluster.postgresql.cnpg.io --all-namespaces -o json 2>/dev/null) || {
|
||||||
|
add_check "cnpg-clusters" "warn" "CNPG CRD not found or no clusters deployed"
|
||||||
|
return
|
||||||
|
}
|
||||||
|
|
||||||
|
local report
|
||||||
|
report=$(echo "$cnpg_clusters" | python3 -c "
|
||||||
|
import sys, json
|
||||||
|
data = json.load(sys.stdin)
|
||||||
|
results = []
|
||||||
|
all_healthy = True
|
||||||
|
for cluster in data.get('items', []):
|
||||||
|
ns = cluster['metadata']['namespace']
|
||||||
|
name = cluster['metadata']['name']
|
||||||
|
phase = cluster.get('status', {}).get('phase', 'unknown')
|
||||||
|
ready = cluster.get('status', {}).get('readyInstances', 0)
|
||||||
|
instances = cluster.get('spec', {}).get('instances', 0)
|
||||||
|
primary = cluster.get('status', {}).get('currentPrimary', 'unknown')
|
||||||
|
if phase != 'Cluster in healthy state' and phase != 'Healthy':
|
||||||
|
all_healthy = False
|
||||||
|
if ready < instances:
|
||||||
|
all_healthy = False
|
||||||
|
results.append(f'{ns}/{name}: phase={phase} ready={ready}/{instances} primary={primary}')
|
||||||
|
print('HEALTHY' if all_healthy else 'UNHEALTHY')
|
||||||
|
print('; '.join(results))
|
||||||
|
" 2>/dev/null) || report="Failed to parse CNPG status"
|
||||||
|
|
||||||
|
local health_line
|
||||||
|
health_line=$(echo "$report" | head -1)
|
||||||
|
local detail_line
|
||||||
|
detail_line=$(echo "$report" | tail -1)
|
||||||
|
|
||||||
|
if [ "$health_line" = "HEALTHY" ]; then
|
||||||
|
add_check "cnpg-clusters" "ok" "$detail_line"
|
||||||
|
else
|
||||||
|
add_check "cnpg-clusters" "fail" "$detail_line"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Database connection counts (MySQL)
|
||||||
|
check_mysql_connections() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "mysql-connections" "ok" "DRY RUN: would check MySQL connection counts"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local mysql_pod
|
||||||
|
mysql_pod=$($KUBECTL get pods -n dbaas -l app=mysql-cluster -o name 2>/dev/null | head -1) || true
|
||||||
|
if [ -z "$mysql_pod" ]; then
|
||||||
|
mysql_pod="sts/mysql-cluster"
|
||||||
|
fi
|
||||||
|
|
||||||
|
local conn_info
|
||||||
|
conn_info=$($KUBECTL exec "$mysql_pod" -n dbaas -- mysql -N -e \
|
||||||
|
"SELECT 'threads_connected', VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME='Threads_connected' UNION ALL SELECT 'max_connections', VARIABLE_VALUE FROM performance_schema.global_variables WHERE VARIABLE_NAME='max_connections'" 2>/dev/null) || {
|
||||||
|
add_check "mysql-connections" "warn" "Cannot query MySQL connection info"
|
||||||
|
return
|
||||||
|
}
|
||||||
|
|
||||||
|
local threads_connected max_connections
|
||||||
|
threads_connected=$(echo "$conn_info" | grep threads_connected | awk '{print $2}') || threads_connected="unknown"
|
||||||
|
max_connections=$(echo "$conn_info" | grep max_connections | awk '{print $2}') || max_connections="unknown"
|
||||||
|
|
||||||
|
if [ "$threads_connected" != "unknown" ] && [ "$max_connections" != "unknown" ]; then
|
||||||
|
local pct=$((threads_connected * 100 / max_connections))
|
||||||
|
if [ "$pct" -gt 80 ]; then
|
||||||
|
add_check "mysql-connections" "fail" "MySQL connections at ${pct}%: $threads_connected/$max_connections"
|
||||||
|
elif [ "$pct" -gt 60 ]; then
|
||||||
|
add_check "mysql-connections" "warn" "MySQL connections at ${pct}%: $threads_connected/$max_connections"
|
||||||
|
else
|
||||||
|
add_check "mysql-connections" "ok" "MySQL connections: $threads_connected/$max_connections (${pct}%)"
|
||||||
|
fi
|
||||||
|
else
|
||||||
|
add_check "mysql-connections" "warn" "MySQL connections: threads=$threads_connected max=$max_connections"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Run all checks
|
||||||
|
check_mysql_gr
|
||||||
|
check_mysql_pods
|
||||||
|
check_cnpg
|
||||||
|
check_mysql_connections
|
||||||
|
|
||||||
|
# Determine overall status
|
||||||
|
OVERALL=$(echo "$CHECKS" | python3 -c "
|
||||||
|
import sys, json
|
||||||
|
checks = json.load(sys.stdin)
|
||||||
|
statuses = [c['status'] for c in checks]
|
||||||
|
if 'fail' in statuses:
|
||||||
|
print('fail')
|
||||||
|
elif 'warn' in statuses:
|
||||||
|
print('warn')
|
||||||
|
else:
|
||||||
|
print('ok')
|
||||||
|
")
|
||||||
|
|
||||||
|
echo "{\"status\": \"$OVERALL\", \"agent\": \"$AGENT\", \"checks\": $CHECKS}" | python3 -m json.tool
|
||||||
217
.claude/scripts/deploy-status.sh
Executable file
217
.claude/scripts/deploy-status.sh
Executable file
|
|
@ -0,0 +1,217 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
KUBECTL="kubectl --kubeconfig /Users/viktorbarzin/code/infra/config"
|
||||||
|
DRY_RUN=false
|
||||||
|
AGENT="deploy-status"
|
||||||
|
|
||||||
|
for arg in "$@"; do
|
||||||
|
case "$arg" in
|
||||||
|
--dry-run) DRY_RUN=true ;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
CHECKS="[]"
|
||||||
|
|
||||||
|
add_check() {
|
||||||
|
local name="$1" status="$2" message="$3"
|
||||||
|
CHECKS=$(echo "$CHECKS" | python3 -c "
|
||||||
|
import sys, json
|
||||||
|
checks = json.load(sys.stdin)
|
||||||
|
checks.append({'name': '''$name''', 'status': '''$status''', 'message': '''$message'''})
|
||||||
|
json.dump(checks, sys.stdout)
|
||||||
|
")
|
||||||
|
}
|
||||||
|
|
||||||
|
# Check for stalled rollouts (Progressing=False or deadline exceeded)
|
||||||
|
check_stalled_rollouts() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "stalled-rollouts" "ok" "DRY RUN: would check for stalled deployment rollouts"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local stalled
|
||||||
|
stalled=$($KUBECTL get deployments --all-namespaces -o json 2>/dev/null | python3 -c "
|
||||||
|
import sys, json
|
||||||
|
data = json.load(sys.stdin)
|
||||||
|
stalled = []
|
||||||
|
for dep in data.get('items', []):
|
||||||
|
ns = dep['metadata']['namespace']
|
||||||
|
name = dep['metadata']['name']
|
||||||
|
conditions = dep.get('status', {}).get('conditions', [])
|
||||||
|
for cond in conditions:
|
||||||
|
if cond.get('type') == 'Progressing' and cond.get('status') == 'False':
|
||||||
|
reason = cond.get('reason', 'unknown')
|
||||||
|
stalled.append(f'{ns}/{name}: {reason}')
|
||||||
|
elif cond.get('type') == 'Available' and cond.get('status') == 'False':
|
||||||
|
reason = cond.get('reason', 'unknown')
|
||||||
|
stalled.append(f'{ns}/{name}: unavailable ({reason})')
|
||||||
|
if stalled:
|
||||||
|
print('; '.join(stalled))
|
||||||
|
else:
|
||||||
|
print('')
|
||||||
|
" 2>/dev/null) || stalled="Failed to check deployments"
|
||||||
|
|
||||||
|
if [ -z "$stalled" ]; then
|
||||||
|
add_check "stalled-rollouts" "ok" "No stalled rollouts detected"
|
||||||
|
else
|
||||||
|
add_check "stalled-rollouts" "fail" "Stalled rollouts: $stalled"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Check for unavailable replicas
|
||||||
|
check_unavailable_replicas() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "unavailable-replicas" "ok" "DRY RUN: would check for deployments with unavailable replicas"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local unavail
|
||||||
|
unavail=$($KUBECTL get deployments --all-namespaces -o json 2>/dev/null | python3 -c "
|
||||||
|
import sys, json
|
||||||
|
data = json.load(sys.stdin)
|
||||||
|
issues = []
|
||||||
|
for dep in data.get('items', []):
|
||||||
|
ns = dep['metadata']['namespace']
|
||||||
|
name = dep['metadata']['name']
|
||||||
|
spec_replicas = dep.get('spec', {}).get('replicas', 1)
|
||||||
|
ready = dep.get('status', {}).get('readyReplicas', 0) or 0
|
||||||
|
unavailable = dep.get('status', {}).get('unavailableReplicas', 0) or 0
|
||||||
|
if unavailable > 0 or ready < spec_replicas:
|
||||||
|
issues.append(f'{ns}/{name}: {ready}/{spec_replicas} ready, {unavailable} unavailable')
|
||||||
|
if issues:
|
||||||
|
print('; '.join(issues))
|
||||||
|
else:
|
||||||
|
print('')
|
||||||
|
" 2>/dev/null) || unavail="Failed to check replicas"
|
||||||
|
|
||||||
|
if [ -z "$unavail" ]; then
|
||||||
|
add_check "unavailable-replicas" "ok" "All deployments have desired replicas ready"
|
||||||
|
else
|
||||||
|
add_check "unavailable-replicas" "warn" "Unavailable replicas: $unavail"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Check for image pull errors
|
||||||
|
check_image_pull_errors() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "image-pull-errors" "ok" "DRY RUN: would check for ImagePullBackOff/ErrImagePull pods"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local pull_errors
|
||||||
|
pull_errors=$($KUBECTL get pods --all-namespaces -o json 2>/dev/null | python3 -c "
|
||||||
|
import sys, json
|
||||||
|
data = json.load(sys.stdin)
|
||||||
|
errors = []
|
||||||
|
for pod in data.get('items', []):
|
||||||
|
ns = pod['metadata']['namespace']
|
||||||
|
name = pod['metadata']['name']
|
||||||
|
for cs in pod.get('status', {}).get('containerStatuses', []) + pod.get('status', {}).get('initContainerStatuses', []):
|
||||||
|
waiting = cs.get('state', {}).get('waiting', {})
|
||||||
|
reason = waiting.get('reason', '')
|
||||||
|
if reason in ('ImagePullBackOff', 'ErrImagePull', 'InvalidImageName'):
|
||||||
|
image = cs.get('image', 'unknown')
|
||||||
|
msg = waiting.get('message', '')[:100]
|
||||||
|
errors.append(f'{ns}/{name}: {reason} image={image} ({msg})')
|
||||||
|
if errors:
|
||||||
|
print('; '.join(errors))
|
||||||
|
else:
|
||||||
|
print('')
|
||||||
|
" 2>/dev/null) || pull_errors="Failed to check image pulls"
|
||||||
|
|
||||||
|
if [ -z "$pull_errors" ]; then
|
||||||
|
add_check "image-pull-errors" "ok" "No image pull errors found"
|
||||||
|
else
|
||||||
|
add_check "image-pull-errors" "fail" "Image pull errors: $pull_errors"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Check for recent restarts (>5 in last hour)
|
||||||
|
check_recent_restarts() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "recent-restarts" "ok" "DRY RUN: would check for pods with high restart counts"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local restarts
|
||||||
|
restarts=$($KUBECTL get pods --all-namespaces -o json 2>/dev/null | python3 -c "
|
||||||
|
import sys, json
|
||||||
|
data = json.load(sys.stdin)
|
||||||
|
high_restart = []
|
||||||
|
for pod in data.get('items', []):
|
||||||
|
ns = pod['metadata']['namespace']
|
||||||
|
name = pod['metadata']['name']
|
||||||
|
for cs in pod.get('status', {}).get('containerStatuses', []):
|
||||||
|
count = cs.get('restartCount', 0)
|
||||||
|
if count >= 5:
|
||||||
|
container = cs['name']
|
||||||
|
high_restart.append(f'{ns}/{name}:{container} restarts={count}')
|
||||||
|
if high_restart:
|
||||||
|
print('; '.join(sorted(high_restart, key=lambda x: int(x.split('=')[1]), reverse=True)[:20]))
|
||||||
|
else:
|
||||||
|
print('')
|
||||||
|
" 2>/dev/null) || restarts="Failed to check restarts"
|
||||||
|
|
||||||
|
if [ -z "$restarts" ]; then
|
||||||
|
add_check "recent-restarts" "ok" "No pods with 5+ restarts"
|
||||||
|
else
|
||||||
|
add_check "recent-restarts" "warn" "High restart counts: $restarts"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Check CrashLoopBackOff pods
|
||||||
|
check_crashloop() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "crashloop" "ok" "DRY RUN: would check for CrashLoopBackOff pods"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local crashloop
|
||||||
|
crashloop=$($KUBECTL get pods --all-namespaces -o json 2>/dev/null | python3 -c "
|
||||||
|
import sys, json
|
||||||
|
data = json.load(sys.stdin)
|
||||||
|
crashes = []
|
||||||
|
for pod in data.get('items', []):
|
||||||
|
ns = pod['metadata']['namespace']
|
||||||
|
name = pod['metadata']['name']
|
||||||
|
for cs in pod.get('status', {}).get('containerStatuses', []):
|
||||||
|
waiting = cs.get('state', {}).get('waiting', {})
|
||||||
|
if waiting.get('reason') == 'CrashLoopBackOff':
|
||||||
|
container = cs['name']
|
||||||
|
restarts = cs.get('restartCount', 0)
|
||||||
|
crashes.append(f'{ns}/{name}:{container} restarts={restarts}')
|
||||||
|
if crashes:
|
||||||
|
print('; '.join(crashes))
|
||||||
|
else:
|
||||||
|
print('')
|
||||||
|
" 2>/dev/null) || crashloop="Failed to check crashloop"
|
||||||
|
|
||||||
|
if [ -z "$crashloop" ]; then
|
||||||
|
add_check "crashloop" "ok" "No CrashLoopBackOff pods"
|
||||||
|
else
|
||||||
|
add_check "crashloop" "fail" "CrashLoopBackOff: $crashloop"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Run all checks
|
||||||
|
check_stalled_rollouts
|
||||||
|
check_unavailable_replicas
|
||||||
|
check_image_pull_errors
|
||||||
|
check_recent_restarts
|
||||||
|
check_crashloop
|
||||||
|
|
||||||
|
# Determine overall status
|
||||||
|
OVERALL=$(echo "$CHECKS" | python3 -c "
|
||||||
|
import sys, json
|
||||||
|
checks = json.load(sys.stdin)
|
||||||
|
statuses = [c['status'] for c in checks]
|
||||||
|
if 'fail' in statuses:
|
||||||
|
print('fail')
|
||||||
|
elif 'warn' in statuses:
|
||||||
|
print('warn')
|
||||||
|
else:
|
||||||
|
print('ok')
|
||||||
|
")
|
||||||
|
|
||||||
|
echo "{\"status\": \"$OVERALL\", \"agent\": \"$AGENT\", \"checks\": $CHECKS}" | python3 -m json.tool
|
||||||
144
.claude/scripts/dns-check.sh
Executable file
144
.claude/scripts/dns-check.sh
Executable file
|
|
@ -0,0 +1,144 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
KUBECTL="kubectl --kubeconfig /Users/viktorbarzin/code/infra/config"
|
||||||
|
AGENT="dns-check"
|
||||||
|
DRY_RUN=false
|
||||||
|
|
||||||
|
# Internal DNS server (Technitium)
|
||||||
|
INTERNAL_DNS="10.0.20.100"
|
||||||
|
# Public DNS
|
||||||
|
PUBLIC_DNS="1.1.1.1"
|
||||||
|
|
||||||
|
# Services to check
|
||||||
|
SERVICES=(
|
||||||
|
"grafana.viktorbarzin.me"
|
||||||
|
"prometheus.viktorbarzin.me"
|
||||||
|
"nextcloud.viktorbarzin.me"
|
||||||
|
"authentik.viktorbarzin.me"
|
||||||
|
"viktorbarzin.me"
|
||||||
|
)
|
||||||
|
|
||||||
|
for arg in "$@"; do
|
||||||
|
case "$arg" in
|
||||||
|
--dry-run) DRY_RUN=true ;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
checks=()
|
||||||
|
|
||||||
|
add_check() {
|
||||||
|
local name="$1" status="$2" message="$3"
|
||||||
|
checks+=("{\"name\": \"$name\", \"status\": \"$status\", \"message\": \"$message\"}")
|
||||||
|
}
|
||||||
|
|
||||||
|
check_dns_resolution() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "dns-resolution" "ok" "dry-run: would resolve ${#SERVICES[@]} services via internal and public DNS"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local failures=0 mismatches=0 successes=0
|
||||||
|
local failure_details="" mismatch_details=""
|
||||||
|
|
||||||
|
for svc in "${SERVICES[@]}"; do
|
||||||
|
local internal_result public_result
|
||||||
|
|
||||||
|
internal_result=$(dig +short "$svc" @"$INTERNAL_DNS" A 2>/dev/null | head -1) || internal_result=""
|
||||||
|
public_result=$(dig +short "$svc" @"$PUBLIC_DNS" A 2>/dev/null | head -1) || public_result=""
|
||||||
|
|
||||||
|
if [ -z "$internal_result" ] && [ -z "$public_result" ]; then
|
||||||
|
failures=$((failures + 1))
|
||||||
|
failure_details="${failure_details}${svc} (both resolvers failed); "
|
||||||
|
elif [ -z "$internal_result" ]; then
|
||||||
|
failures=$((failures + 1))
|
||||||
|
failure_details="${failure_details}${svc} (internal DNS failed); "
|
||||||
|
elif [ -z "$public_result" ]; then
|
||||||
|
# Public might use CNAME/proxy, not necessarily a failure
|
||||||
|
successes=$((successes + 1))
|
||||||
|
elif [ "$internal_result" != "$public_result" ]; then
|
||||||
|
# Mismatch is informational — Cloudflare proxy IPs differ from internal IPs
|
||||||
|
mismatches=$((mismatches + 1))
|
||||||
|
mismatch_details="${mismatch_details}${svc} (internal=${internal_result} public=${public_result}); "
|
||||||
|
successes=$((successes + 1))
|
||||||
|
else
|
||||||
|
successes=$((successes + 1))
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
if [ "$failures" -gt 0 ]; then
|
||||||
|
add_check "dns-resolution" "fail" "${failures} DNS failures: ${failure_details}"
|
||||||
|
elif [ "$mismatches" -gt 0 ]; then
|
||||||
|
add_check "dns-resolution" "ok" "${successes}/${#SERVICES[@]} resolved. ${mismatches} internal/public mismatches (expected with Cloudflare proxy): ${mismatch_details}"
|
||||||
|
else
|
||||||
|
add_check "dns-resolution" "ok" "All ${successes}/${#SERVICES[@]} services resolved successfully"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
check_technitium_health() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "technitium" "ok" "dry-run: would check Technitium DNS server pod health"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local tech_pods
|
||||||
|
tech_pods=$($KUBECTL get pods -A -l app.kubernetes.io/name=technitium --no-headers 2>/dev/null) || \
|
||||||
|
tech_pods=$($KUBECTL get pods -A --no-headers 2>/dev/null | grep -i technitium || true)
|
||||||
|
|
||||||
|
if [ -z "$tech_pods" ]; then
|
||||||
|
add_check "technitium" "warn" "No Technitium pods found"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local not_running
|
||||||
|
not_running=$(echo "$tech_pods" | grep -v "Running" | grep -c "." 2>/dev/null || echo "0")
|
||||||
|
|
||||||
|
if [ "$not_running" -gt 0 ]; then
|
||||||
|
add_check "technitium" "fail" "Technitium pod(s) not running"
|
||||||
|
else
|
||||||
|
add_check "technitium" "ok" "Technitium DNS server pod(s) running"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
check_coredns_health() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "coredns" "ok" "dry-run: would check CoreDNS pod health"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local coredns_pods
|
||||||
|
coredns_pods=$($KUBECTL get pods -n kube-system -l k8s-app=kube-dns --no-headers 2>/dev/null) || {
|
||||||
|
add_check "coredns" "warn" "Failed to query CoreDNS pods"
|
||||||
|
return
|
||||||
|
}
|
||||||
|
|
||||||
|
if [ -z "$coredns_pods" ]; then
|
||||||
|
add_check "coredns" "warn" "No CoreDNS pods found"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local total not_running
|
||||||
|
total=$(echo "$coredns_pods" | grep -c "." 2>/dev/null || echo "0")
|
||||||
|
not_running=$(echo "$coredns_pods" | grep -v "Running" | grep -c "." 2>/dev/null || echo "0")
|
||||||
|
|
||||||
|
if [ "$not_running" -gt 0 ]; then
|
||||||
|
add_check "coredns" "fail" "${not_running}/${total} CoreDNS pod(s) not running"
|
||||||
|
else
|
||||||
|
add_check "coredns" "ok" "All ${total} CoreDNS pod(s) running"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
check_dns_resolution
|
||||||
|
check_technitium_health
|
||||||
|
check_coredns_health
|
||||||
|
|
||||||
|
# Output JSON
|
||||||
|
overall="ok"
|
||||||
|
for c in "${checks[@]}"; do
|
||||||
|
s=$(echo "$c" | jq -r '.status')
|
||||||
|
if [ "$s" = "fail" ]; then overall="fail"; break; fi
|
||||||
|
if [ "$s" = "warn" ]; then overall="warn"; fi
|
||||||
|
done
|
||||||
|
|
||||||
|
printf '{"status": "%s", "agent": "%s", "checks": [%s]}\n' \
|
||||||
|
"$overall" "$AGENT" "$(IFS=,; echo "${checks[*]}")"
|
||||||
281
.claude/scripts/monitoring-health.sh
Executable file
281
.claude/scripts/monitoring-health.sh
Executable file
|
|
@ -0,0 +1,281 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
AGENT="monitoring-health"
|
||||||
|
KUBECTL="kubectl --kubeconfig /Users/viktorbarzin/code/infra/config"
|
||||||
|
MONITORING_NS="monitoring"
|
||||||
|
DRY_RUN=false
|
||||||
|
|
||||||
|
for arg in "$@"; do
|
||||||
|
case "$arg" in
|
||||||
|
--dry-run) DRY_RUN=true ;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
checks=()
|
||||||
|
|
||||||
|
add_check() {
|
||||||
|
local name="$1" status="$2" message="$3"
|
||||||
|
checks+=("{\"name\": \"$name\", \"status\": \"$status\", \"message\": \"$message\"}")
|
||||||
|
}
|
||||||
|
|
||||||
|
check_prometheus() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "prometheus" "ok" "dry-run: would check Prometheus server health"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Discover Prometheus server pod via labels
|
||||||
|
local prom_pod
|
||||||
|
prom_pod=$($KUBECTL get pods -n "$MONITORING_NS" -l app.kubernetes.io/name=prometheus,app.kubernetes.io/component=server -o name 2>/dev/null | head -1)
|
||||||
|
if [ -z "$prom_pod" ]; then
|
||||||
|
prom_pod=$($KUBECTL get pods -n "$MONITORING_NS" -l app=prometheus,component=server -o name 2>/dev/null | head -1)
|
||||||
|
fi
|
||||||
|
if [ -z "$prom_pod" ]; then
|
||||||
|
prom_pod=$($KUBECTL get pods -n "$MONITORING_NS" -o name 2>/dev/null | grep prometheus-server | head -1)
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ -z "$prom_pod" ]; then
|
||||||
|
add_check "prometheus" "fail" "No Prometheus server pod found in $MONITORING_NS"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local phase
|
||||||
|
phase=$($KUBECTL get "$prom_pod" -n "$MONITORING_NS" -o jsonpath='{.status.phase}' 2>/dev/null)
|
||||||
|
if [ "$phase" != "Running" ]; then
|
||||||
|
add_check "prometheus" "fail" "Prometheus server pod phase: $phase"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Check Prometheus is responding
|
||||||
|
local prom_healthy
|
||||||
|
prom_healthy=$($KUBECTL exec "$prom_pod" -n "$MONITORING_NS" -c prometheus-server -- \
|
||||||
|
wget -q -O- "http://localhost:9090/-/healthy" 2>/dev/null || echo "unhealthy")
|
||||||
|
|
||||||
|
if echo "$prom_healthy" | grep -qi "ok\|healthy"; then
|
||||||
|
# Check target scraping
|
||||||
|
local targets_up
|
||||||
|
targets_up=$($KUBECTL exec "$prom_pod" -n "$MONITORING_NS" -c prometheus-server -- \
|
||||||
|
wget -q -O- "http://localhost:9090/api/v1/targets" 2>/dev/null | \
|
||||||
|
python3 -c "
|
||||||
|
import sys, json
|
||||||
|
try:
|
||||||
|
data = json.load(sys.stdin)
|
||||||
|
active = data.get('data',{}).get('activeTargets',[])
|
||||||
|
up = sum(1 for t in active if t.get('health') == 'up')
|
||||||
|
total = len(active)
|
||||||
|
print(f'{up}/{total}')
|
||||||
|
except: print('unknown')
|
||||||
|
" 2>/dev/null || echo "unknown")
|
||||||
|
add_check "prometheus" "ok" "Prometheus server healthy, targets: $targets_up up"
|
||||||
|
else
|
||||||
|
add_check "prometheus" "warn" "Prometheus server running but health check unclear"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
check_alertmanager() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "alertmanager" "ok" "dry-run: would check Alertmanager health"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Discover Alertmanager pod
|
||||||
|
local am_pod
|
||||||
|
am_pod=$($KUBECTL get pods -n "$MONITORING_NS" -l app.kubernetes.io/name=alertmanager -o name 2>/dev/null | head -1)
|
||||||
|
if [ -z "$am_pod" ]; then
|
||||||
|
am_pod=$($KUBECTL get pods -n "$MONITORING_NS" -o name 2>/dev/null | grep alertmanager | head -1)
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ -z "$am_pod" ]; then
|
||||||
|
add_check "alertmanager" "fail" "No Alertmanager pod found in $MONITORING_NS"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local phase
|
||||||
|
phase=$($KUBECTL get "$am_pod" -n "$MONITORING_NS" -o jsonpath='{.status.phase}' 2>/dev/null)
|
||||||
|
if [ "$phase" != "Running" ]; then
|
||||||
|
add_check "alertmanager" "fail" "Alertmanager pod phase: $phase"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Check firing alerts
|
||||||
|
local alert_info
|
||||||
|
alert_info=$($KUBECTL exec "$am_pod" -n "$MONITORING_NS" -- \
|
||||||
|
wget -q -O- "http://localhost:9093/api/v2/alerts?active=true" 2>/dev/null | \
|
||||||
|
python3 -c "
|
||||||
|
import sys, json
|
||||||
|
try:
|
||||||
|
alerts = json.load(sys.stdin)
|
||||||
|
firing = [a for a in alerts if a.get('status',{}).get('state') == 'active']
|
||||||
|
print(len(firing))
|
||||||
|
except: print('unknown')
|
||||||
|
" 2>/dev/null || echo "unknown")
|
||||||
|
|
||||||
|
# Check silences
|
||||||
|
local silence_count
|
||||||
|
silence_count=$($KUBECTL exec "$am_pod" -n "$MONITORING_NS" -- \
|
||||||
|
wget -q -O- "http://localhost:9093/api/v2/silences" 2>/dev/null | \
|
||||||
|
python3 -c "
|
||||||
|
import sys, json
|
||||||
|
try:
|
||||||
|
silences = json.load(sys.stdin)
|
||||||
|
active = [s for s in silences if s.get('status',{}).get('state') == 'active']
|
||||||
|
print(len(active))
|
||||||
|
except: print('0')
|
||||||
|
" 2>/dev/null || echo "0")
|
||||||
|
|
||||||
|
if [ "$alert_info" = "unknown" ]; then
|
||||||
|
add_check "alertmanager" "warn" "Alertmanager running but could not query alerts"
|
||||||
|
else
|
||||||
|
local status="ok"
|
||||||
|
[ "$alert_info" -gt 0 ] 2>/dev/null && status="warn"
|
||||||
|
add_check "alertmanager" "$status" "Alertmanager healthy: $alert_info firing alerts, $silence_count active silences"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
check_grafana() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "grafana" "ok" "dry-run: would check Grafana health"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Discover Grafana pod
|
||||||
|
local grafana_pod
|
||||||
|
grafana_pod=$($KUBECTL get pods -n "$MONITORING_NS" -l app.kubernetes.io/name=grafana -o name 2>/dev/null | head -1)
|
||||||
|
if [ -z "$grafana_pod" ]; then
|
||||||
|
grafana_pod=$($KUBECTL get pods -n "$MONITORING_NS" -o name 2>/dev/null | grep grafana | grep -v test | head -1)
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ -z "$grafana_pod" ]; then
|
||||||
|
add_check "grafana" "fail" "No Grafana pod found in $MONITORING_NS"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local phase
|
||||||
|
phase=$($KUBECTL get "$grafana_pod" -n "$MONITORING_NS" -o jsonpath='{.status.phase}' 2>/dev/null)
|
||||||
|
if [ "$phase" != "Running" ]; then
|
||||||
|
add_check "grafana" "fail" "Grafana pod phase: $phase"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Check datasource connectivity
|
||||||
|
local ds_info
|
||||||
|
ds_info=$($KUBECTL exec "$grafana_pod" -n "$MONITORING_NS" -- \
|
||||||
|
curl -sf "http://localhost:3000/api/datasources" 2>/dev/null | \
|
||||||
|
python3 -c "
|
||||||
|
import sys, json
|
||||||
|
try:
|
||||||
|
ds = json.load(sys.stdin)
|
||||||
|
names = [d.get('name','?') for d in ds]
|
||||||
|
print(f'{len(ds)} datasources: {\", \".join(names)}')
|
||||||
|
except: print('unknown')
|
||||||
|
" 2>/dev/null || echo "unknown")
|
||||||
|
|
||||||
|
if [ "$ds_info" = "unknown" ]; then
|
||||||
|
add_check "grafana" "warn" "Grafana running but could not query datasources (may need auth)"
|
||||||
|
else
|
||||||
|
add_check "grafana" "ok" "Grafana healthy, $ds_info"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
check_snmp_exporters() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "snmp-exporters" "ok" "dry-run: would check SNMP exporter pods"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local exporters=("snmp-exporter" "idrac-redfish-exporter" "proxmox-exporter")
|
||||||
|
local running=0 total=0
|
||||||
|
|
||||||
|
for exporter in "${exporters[@]}"; do
|
||||||
|
total=$((total + 1))
|
||||||
|
local pod
|
||||||
|
pod=$($KUBECTL get pods -n "$MONITORING_NS" -o name 2>/dev/null | grep "$exporter" | head -1)
|
||||||
|
|
||||||
|
if [ -z "$pod" ]; then
|
||||||
|
# Try all namespaces
|
||||||
|
pod=$($KUBECTL get pods --all-namespaces -o custom-columns=NS:.metadata.namespace,NAME:.metadata.name --no-headers 2>/dev/null | \
|
||||||
|
grep "$exporter" | head -1)
|
||||||
|
if [ -z "$pod" ]; then
|
||||||
|
add_check "exporter-$exporter" "warn" "$exporter pod not found"
|
||||||
|
continue
|
||||||
|
fi
|
||||||
|
local ns
|
||||||
|
ns=$(echo "$pod" | awk '{print $1}')
|
||||||
|
local name
|
||||||
|
name=$(echo "$pod" | awk '{print $2}')
|
||||||
|
local phase
|
||||||
|
phase=$($KUBECTL get pod "$name" -n "$ns" -o jsonpath='{.status.phase}' 2>/dev/null)
|
||||||
|
if [ "$phase" = "Running" ]; then
|
||||||
|
running=$((running + 1))
|
||||||
|
add_check "exporter-$exporter" "ok" "$exporter running in $ns"
|
||||||
|
else
|
||||||
|
add_check "exporter-$exporter" "warn" "$exporter phase: $phase in $ns"
|
||||||
|
fi
|
||||||
|
else
|
||||||
|
local phase
|
||||||
|
phase=$($KUBECTL get "$pod" -n "$MONITORING_NS" -o jsonpath='{.status.phase}' 2>/dev/null)
|
||||||
|
if [ "$phase" = "Running" ]; then
|
||||||
|
running=$((running + 1))
|
||||||
|
add_check "exporter-$exporter" "ok" "$exporter running"
|
||||||
|
else
|
||||||
|
add_check "exporter-$exporter" "warn" "$exporter phase: $phase"
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
}
|
||||||
|
|
||||||
|
check_prometheus_storage() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "prometheus-storage" "ok" "dry-run: would check Prometheus storage usage"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local prom_pvc
|
||||||
|
prom_pvc=$($KUBECTL get pvc -n "$MONITORING_NS" -o name 2>/dev/null | grep prometheus-server | head -1)
|
||||||
|
|
||||||
|
if [ -z "$prom_pvc" ]; then
|
||||||
|
add_check "prometheus-storage" "warn" "No Prometheus server PVC found"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Check storage via Prometheus TSDB stats
|
||||||
|
local prom_pod
|
||||||
|
prom_pod=$($KUBECTL get pods -n "$MONITORING_NS" -l app.kubernetes.io/name=prometheus,app.kubernetes.io/component=server -o name 2>/dev/null | head -1)
|
||||||
|
if [ -z "$prom_pod" ]; then
|
||||||
|
prom_pod=$($KUBECTL get pods -n "$MONITORING_NS" -o name 2>/dev/null | grep prometheus-server | head -1)
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ -n "$prom_pod" ]; then
|
||||||
|
local storage_info
|
||||||
|
storage_info=$($KUBECTL exec "$prom_pod" -n "$MONITORING_NS" -c prometheus-server -- \
|
||||||
|
df -h /data 2>/dev/null | tail -1 | awk '{printf "%s used of %s (%s)", $3, $2, $5}' || echo "unknown")
|
||||||
|
add_check "prometheus-storage" "ok" "Prometheus storage: $storage_info"
|
||||||
|
else
|
||||||
|
add_check "prometheus-storage" "warn" "Could not check Prometheus storage"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Run checks
|
||||||
|
check_prometheus
|
||||||
|
check_alertmanager
|
||||||
|
check_grafana
|
||||||
|
check_snmp_exporters
|
||||||
|
check_prometheus_storage
|
||||||
|
|
||||||
|
# Determine overall status
|
||||||
|
overall="ok"
|
||||||
|
for c in "${checks[@]}"; do
|
||||||
|
if echo "$c" | grep -q '"status": "fail"'; then
|
||||||
|
overall="fail"
|
||||||
|
break
|
||||||
|
elif echo "$c" | grep -q '"status": "warn"'; then
|
||||||
|
overall="warn"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
# Output JSON
|
||||||
|
checks_json=$(IFS=,; echo "${checks[*]}")
|
||||||
|
cat <<EOF
|
||||||
|
{"status": "$overall", "agent": "$AGENT", "checks": [$checks_json]}
|
||||||
|
EOF
|
||||||
166
.claude/scripts/network-health.sh
Executable file
166
.claude/scripts/network-health.sh
Executable file
|
|
@ -0,0 +1,166 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
KUBECTL="kubectl --kubeconfig /Users/viktorbarzin/code/infra/config"
|
||||||
|
PFSENSE="python3 /Users/viktorbarzin/code/infra/.claude/pfsense.py"
|
||||||
|
AGENT="network-health"
|
||||||
|
DRY_RUN=false
|
||||||
|
|
||||||
|
for arg in "$@"; do
|
||||||
|
case "$arg" in
|
||||||
|
--dry-run) DRY_RUN=true ;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
checks=()
|
||||||
|
|
||||||
|
add_check() {
|
||||||
|
local name="$1" status="$2" message="$3"
|
||||||
|
checks+=("{\"name\": \"$name\", \"status\": \"$status\", \"message\": \"$message\"}")
|
||||||
|
}
|
||||||
|
|
||||||
|
check_pfsense_status() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "pfsense" "ok" "dry-run: would check pfSense system status via pfsense.py"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local pf_output
|
||||||
|
pf_output=$($PFSENSE status 2>/dev/null) || {
|
||||||
|
add_check "pfsense" "fail" "Failed to connect to pfSense via pfsense.py"
|
||||||
|
return
|
||||||
|
}
|
||||||
|
|
||||||
|
if echo "$pf_output" | grep -qi "error\|fail\|down"; then
|
||||||
|
add_check "pfsense" "warn" "pfSense reported issues: $(echo "$pf_output" | head -3 | tr '\n' ' ')"
|
||||||
|
else
|
||||||
|
add_check "pfsense" "ok" "pfSense system healthy"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
check_vpn_status() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "vpn" "ok" "dry-run: would check VPN tunnel status via pfsense.py"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local vpn_output
|
||||||
|
vpn_output=$($PFSENSE wireguard 2>/dev/null) || {
|
||||||
|
add_check "vpn" "warn" "Failed to query VPN status via pfsense.py"
|
||||||
|
return
|
||||||
|
}
|
||||||
|
|
||||||
|
if echo "$vpn_output" | grep -qi "error\|fail\|down"; then
|
||||||
|
add_check "vpn" "warn" "VPN issues detected: $(echo "$vpn_output" | head -3 | tr '\n' ' ')"
|
||||||
|
else
|
||||||
|
add_check "vpn" "ok" "VPN tunnels healthy"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
check_metallb_speakers() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "metallb-speakers" "ok" "dry-run: would check MetalLB speaker pod health"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local ns="metallb-system"
|
||||||
|
|
||||||
|
# Find MetalLB speaker pods via labels first
|
||||||
|
local speaker_pods
|
||||||
|
speaker_pods=$($KUBECTL get pods -n "$ns" -l app.kubernetes.io/component=speaker --no-headers 2>/dev/null) || \
|
||||||
|
speaker_pods=$($KUBECTL get pods -n "$ns" -l component=speaker --no-headers 2>/dev/null) || \
|
||||||
|
speaker_pods=$($KUBECTL get pods -n "$ns" --no-headers 2>/dev/null | grep -i speaker || true)
|
||||||
|
|
||||||
|
if [ -z "$speaker_pods" ]; then
|
||||||
|
add_check "metallb-speakers" "warn" "No MetalLB speaker pods found in ${ns}"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local total not_running
|
||||||
|
total=$(echo "$speaker_pods" | grep -c "." 2>/dev/null || echo "0")
|
||||||
|
not_running=$(echo "$speaker_pods" | grep -v "Running" | grep -c "." 2>/dev/null || echo "0")
|
||||||
|
|
||||||
|
if [ "$not_running" -gt 0 ]; then
|
||||||
|
add_check "metallb-speakers" "fail" "${not_running}/${total} MetalLB speaker pod(s) not running"
|
||||||
|
else
|
||||||
|
add_check "metallb-speakers" "ok" "All ${total} MetalLB speaker pod(s) running"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
check_metallb_l2() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "metallb-l2" "ok" "dry-run: would check MetalLB L2 advertisements"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local ns="metallb-system"
|
||||||
|
|
||||||
|
# Check L2Advertisement CRDs
|
||||||
|
local l2_ads
|
||||||
|
l2_ads=$($KUBECTL get l2advertisements -n "$ns" -o json 2>/dev/null) || {
|
||||||
|
add_check "metallb-l2" "warn" "Could not query L2Advertisement CRDs"
|
||||||
|
return
|
||||||
|
}
|
||||||
|
|
||||||
|
local count
|
||||||
|
count=$(echo "$l2_ads" | jq '.items | length' 2>/dev/null || echo "0")
|
||||||
|
|
||||||
|
if [ "$count" -eq 0 ]; then
|
||||||
|
add_check "metallb-l2" "warn" "No L2Advertisement resources found"
|
||||||
|
else
|
||||||
|
# Check MetalLB controller
|
||||||
|
local controller
|
||||||
|
controller=$($KUBECTL get pods -n "$ns" -l app.kubernetes.io/component=controller --no-headers 2>/dev/null) || \
|
||||||
|
controller=$($KUBECTL get pods -n "$ns" --no-headers 2>/dev/null | grep -i controller || true)
|
||||||
|
|
||||||
|
if [ -z "$controller" ]; then
|
||||||
|
add_check "metallb-l2" "warn" "${count} L2Advertisement(s) found but no controller pod"
|
||||||
|
elif echo "$controller" | grep -q "Running"; then
|
||||||
|
add_check "metallb-l2" "ok" "${count} L2Advertisement(s) configured, controller running"
|
||||||
|
else
|
||||||
|
add_check "metallb-l2" "warn" "${count} L2Advertisement(s) found but controller not running"
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
check_node_connectivity() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "node-connectivity" "ok" "dry-run: would ping k8s nodes"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local nodes=("10.0.20.100" "10.0.20.101" "10.0.20.102" "10.0.20.103" "10.0.20.104")
|
||||||
|
local names=("k8s-master" "k8s-node1" "k8s-node2" "k8s-node3" "k8s-node4")
|
||||||
|
local failures=0
|
||||||
|
local failure_details=""
|
||||||
|
|
||||||
|
for i in "${!nodes[@]}"; do
|
||||||
|
if ! ping -c 1 -W 2 "${nodes[$i]}" >/dev/null 2>&1; then
|
||||||
|
failures=$((failures + 1))
|
||||||
|
failure_details="${failure_details}${names[$i]}(${nodes[$i]}) "
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
if [ "$failures" -gt 0 ]; then
|
||||||
|
add_check "node-connectivity" "fail" "${failures} node(s) unreachable: ${failure_details}"
|
||||||
|
else
|
||||||
|
add_check "node-connectivity" "ok" "All ${#nodes[@]} nodes reachable"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
check_pfsense_status
|
||||||
|
check_vpn_status
|
||||||
|
check_metallb_speakers
|
||||||
|
check_metallb_l2
|
||||||
|
check_node_connectivity
|
||||||
|
|
||||||
|
# Output JSON
|
||||||
|
overall="ok"
|
||||||
|
for c in "${checks[@]}"; do
|
||||||
|
s=$(echo "$c" | jq -r '.status')
|
||||||
|
if [ "$s" = "fail" ]; then overall="fail"; break; fi
|
||||||
|
if [ "$s" = "warn" ]; then overall="warn"; fi
|
||||||
|
done
|
||||||
|
|
||||||
|
printf '{"status": "%s", "agent": "%s", "checks": [%s]}\n' \
|
||||||
|
"$overall" "$AGENT" "$(IFS=,; echo "${checks[*]}")"
|
||||||
145
.claude/scripts/nfs-health.sh
Executable file
145
.claude/scripts/nfs-health.sh
Executable file
|
|
@ -0,0 +1,145 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
AGENT="nfs-health"
|
||||||
|
KUBECTL="kubectl --kubeconfig /Users/viktorbarzin/code/infra/config"
|
||||||
|
TRUENAS_HOST="10.0.10.15"
|
||||||
|
NODES=("k8s-master:10.0.20.100" "k8s-node1:10.0.20.101" "k8s-node2:10.0.20.102" "k8s-node3:10.0.20.103" "k8s-node4:10.0.20.104")
|
||||||
|
SSH_USER="wizard"
|
||||||
|
DRY_RUN=false
|
||||||
|
|
||||||
|
for arg in "$@"; do
|
||||||
|
case "$arg" in
|
||||||
|
--dry-run) DRY_RUN=true ;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
checks=()
|
||||||
|
|
||||||
|
add_check() {
|
||||||
|
local name="$1" status="$2" message="$3"
|
||||||
|
checks+=("{\"name\": \"$name\", \"status\": \"$status\", \"message\": \"$message\"}")
|
||||||
|
}
|
||||||
|
|
||||||
|
check_truenas_reachable() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "truenas-reachable" "ok" "dry-run: would ping $TRUENAS_HOST"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
if timeout 5 ping -c 1 "$TRUENAS_HOST" &>/dev/null; then
|
||||||
|
add_check "truenas-reachable" "ok" "TrueNAS at $TRUENAS_HOST is reachable"
|
||||||
|
else
|
||||||
|
add_check "truenas-reachable" "fail" "TrueNAS at $TRUENAS_HOST is unreachable"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
check_truenas_nfs_service() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "truenas-nfs-service" "ok" "dry-run: would check NFS service on TrueNAS"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
local result
|
||||||
|
if result=$(timeout 10 ssh -o ConnectTimeout=5 -o StrictHostKeyChecking=no "root@$TRUENAS_HOST" \
|
||||||
|
"service nfs-server status 2>/dev/null || systemctl is-active nfs-server 2>/dev/null || echo 'unknown'" 2>/dev/null); then
|
||||||
|
if echo "$result" | grep -qiE "running|active|is running"; then
|
||||||
|
add_check "truenas-nfs-service" "ok" "NFS service is running on TrueNAS"
|
||||||
|
else
|
||||||
|
add_check "truenas-nfs-service" "warn" "NFS service status unclear: $(echo "$result" | head -1 | tr '"' "'")"
|
||||||
|
fi
|
||||||
|
else
|
||||||
|
add_check "truenas-nfs-service" "fail" "Could not check NFS service on TrueNAS via SSH"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
check_node_nfs_mounts() {
|
||||||
|
local node_name="$1" node_ip="$2"
|
||||||
|
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "nfs-mounts-$node_name" "ok" "dry-run: would check NFS mounts on $node_name ($node_ip)"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local mount_output
|
||||||
|
if ! mount_output=$(timeout 15 ssh -o ConnectTimeout=5 -o StrictHostKeyChecking=no "$SSH_USER@$node_ip" \
|
||||||
|
"mount | grep nfs" 2>/dev/null); then
|
||||||
|
add_check "nfs-mounts-$node_name" "warn" "No NFS mounts found or SSH failed on $node_name ($node_ip)"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ -z "$mount_output" ]; then
|
||||||
|
add_check "nfs-mounts-$node_name" "warn" "No NFS mounts found on $node_name"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local mount_count
|
||||||
|
mount_count=$(echo "$mount_output" | wc -l | tr -d ' ')
|
||||||
|
|
||||||
|
# Check for stale mounts by trying to stat each mount point
|
||||||
|
local stale_count=0
|
||||||
|
local stale_mounts=""
|
||||||
|
while IFS= read -r line; do
|
||||||
|
local mount_point
|
||||||
|
mount_point=$(echo "$line" | awk '{print $3}')
|
||||||
|
if [ -n "$mount_point" ]; then
|
||||||
|
if ! timeout 10 ssh -o ConnectTimeout=5 -o StrictHostKeyChecking=no "$SSH_USER@$node_ip" \
|
||||||
|
"timeout 5 stat '$mount_point' >/dev/null 2>&1" 2>/dev/null; then
|
||||||
|
stale_count=$((stale_count + 1))
|
||||||
|
stale_mounts="$stale_mounts $mount_point"
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
done <<< "$mount_output"
|
||||||
|
|
||||||
|
if [ "$stale_count" -gt 0 ]; then
|
||||||
|
add_check "nfs-mounts-$node_name" "fail" "$stale_count/$mount_count NFS mounts stale on $node_name:$stale_mounts"
|
||||||
|
else
|
||||||
|
add_check "nfs-mounts-$node_name" "ok" "$mount_count NFS mounts healthy on $node_name"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
check_nfs_pvcs() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "nfs-pvcs" "ok" "dry-run: would check NFS-backed PVCs"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local pending
|
||||||
|
pending=$($KUBECTL get pvc --all-namespaces --field-selector='status.phase!=Bound' -o json 2>/dev/null | \
|
||||||
|
python3 -c "import sys,json; items=json.load(sys.stdin).get('items',[]); nfs=[i for i in items if 'nfs' in json.dumps(i).lower()]; print(len(nfs))" 2>/dev/null || echo "error")
|
||||||
|
|
||||||
|
if [ "$pending" = "error" ]; then
|
||||||
|
add_check "nfs-pvcs" "warn" "Could not check NFS PVC status"
|
||||||
|
elif [ "$pending" = "0" ]; then
|
||||||
|
add_check "nfs-pvcs" "ok" "All NFS-backed PVCs are bound"
|
||||||
|
else
|
||||||
|
add_check "nfs-pvcs" "fail" "$pending NFS-backed PVCs are not bound"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Run checks
|
||||||
|
check_truenas_reachable
|
||||||
|
check_truenas_nfs_service
|
||||||
|
|
||||||
|
for node_entry in "${NODES[@]}"; do
|
||||||
|
node_name="${node_entry%%:*}"
|
||||||
|
node_ip="${node_entry##*:}"
|
||||||
|
check_node_nfs_mounts "$node_name" "$node_ip"
|
||||||
|
done
|
||||||
|
|
||||||
|
check_nfs_pvcs
|
||||||
|
|
||||||
|
# Determine overall status
|
||||||
|
overall="ok"
|
||||||
|
for c in "${checks[@]}"; do
|
||||||
|
if echo "$c" | grep -q '"status": "fail"'; then
|
||||||
|
overall="fail"
|
||||||
|
break
|
||||||
|
elif echo "$c" | grep -q '"status": "warn"'; then
|
||||||
|
overall="warn"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
# Output JSON
|
||||||
|
checks_json=$(IFS=,; echo "${checks[*]}")
|
||||||
|
cat <<EOF
|
||||||
|
{"status": "$overall", "agent": "$AGENT", "checks": [$checks_json]}
|
||||||
|
EOF
|
||||||
214
.claude/scripts/oom-investigator.sh
Executable file
214
.claude/scripts/oom-investigator.sh
Executable file
|
|
@ -0,0 +1,214 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
KUBECTL="kubectl --kubeconfig /Users/viktorbarzin/code/infra/config"
|
||||||
|
DRY_RUN=false
|
||||||
|
AGENT="oom-investigator"
|
||||||
|
|
||||||
|
for arg in "$@"; do
|
||||||
|
case "$arg" in
|
||||||
|
--dry-run) DRY_RUN=true ;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
CHECKS="[]"
|
||||||
|
|
||||||
|
add_check() {
|
||||||
|
local name="$1" status="$2" message="$3"
|
||||||
|
CHECKS=$(echo "$CHECKS" | python3 -c "
|
||||||
|
import sys, json
|
||||||
|
checks = json.load(sys.stdin)
|
||||||
|
checks.append({'name': '''$name''', 'status': '''$status''', 'message': '''$message'''})
|
||||||
|
json.dump(checks, sys.stdout)
|
||||||
|
")
|
||||||
|
}
|
||||||
|
|
||||||
|
# Find OOMKilled pods across all namespaces
|
||||||
|
find_oomkilled() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "oom-killed-pods" "ok" "DRY RUN: would check for OOMKilled pods across all namespaces"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local oom_pods
|
||||||
|
oom_pods=$($KUBECTL get pods --all-namespaces -o json | python3 -c "
|
||||||
|
import sys, json
|
||||||
|
data = json.load(sys.stdin)
|
||||||
|
results = []
|
||||||
|
for pod in data.get('items', []):
|
||||||
|
ns = pod['metadata']['namespace']
|
||||||
|
name = pod['metadata']['name']
|
||||||
|
for cs in pod.get('status', {}).get('containerStatuses', []) + pod.get('status', {}).get('initContainerStatuses', []):
|
||||||
|
last = cs.get('lastState', {}).get('terminated', {})
|
||||||
|
current = cs.get('state', {}).get('terminated', {})
|
||||||
|
for state in [last, current]:
|
||||||
|
if state.get('reason') == 'OOMKilled':
|
||||||
|
container = cs['name']
|
||||||
|
restart_count = cs.get('restartCount', 0)
|
||||||
|
finished = state.get('finishedAt', 'unknown')
|
||||||
|
results.append({'namespace': ns, 'pod': name, 'container': container, 'restarts': restart_count, 'finishedAt': finished})
|
||||||
|
json.dump(results, sys.stdout)
|
||||||
|
" 2>/dev/null) || oom_pods="[]"
|
||||||
|
|
||||||
|
local count
|
||||||
|
count=$(echo "$oom_pods" | python3 -c "import sys,json; print(len(json.load(sys.stdin)))")
|
||||||
|
|
||||||
|
if [ "$count" -eq 0 ]; then
|
||||||
|
add_check "oom-killed-pods" "ok" "No OOMKilled pods found"
|
||||||
|
else
|
||||||
|
add_check "oom-killed-pods" "fail" "Found $count OOMKilled container(s): $(echo "$oom_pods" | python3 -c "
|
||||||
|
import sys,json
|
||||||
|
pods = json.load(sys.stdin)
|
||||||
|
print('; '.join(f\"{p['namespace']}/{p['pod']}:{p['container']} (restarts={p['restarts']}, at={p['finishedAt']})\" for p in pods))
|
||||||
|
")"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Check LimitRange defaults in namespaces with OOM events
|
||||||
|
check_limitranges() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "limitranges" "ok" "DRY RUN: would check LimitRange defaults"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local namespaces
|
||||||
|
namespaces=$($KUBECTL get pods --all-namespaces -o json | python3 -c "
|
||||||
|
import sys, json
|
||||||
|
data = json.load(sys.stdin)
|
||||||
|
ns_set = set()
|
||||||
|
for pod in data.get('items', []):
|
||||||
|
for cs in pod.get('status', {}).get('containerStatuses', []) + pod.get('status', {}).get('initContainerStatuses', []):
|
||||||
|
for state in [cs.get('lastState', {}).get('terminated', {}), cs.get('state', {}).get('terminated', {})]:
|
||||||
|
if state.get('reason') == 'OOMKilled':
|
||||||
|
ns_set.add(pod['metadata']['namespace'])
|
||||||
|
for ns in sorted(ns_set):
|
||||||
|
print(ns)
|
||||||
|
" 2>/dev/null) || namespaces=""
|
||||||
|
|
||||||
|
if [ -z "$namespaces" ]; then
|
||||||
|
add_check "limitranges" "ok" "No namespaces with OOMKilled pods to check"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local lr_info=""
|
||||||
|
while IFS= read -r ns; do
|
||||||
|
local lr
|
||||||
|
lr=$($KUBECTL get limitrange -n "$ns" -o json 2>/dev/null | python3 -c "
|
||||||
|
import sys, json
|
||||||
|
data = json.load(sys.stdin)
|
||||||
|
for item in data.get('items', []):
|
||||||
|
for limit in item.get('spec', {}).get('limits', []):
|
||||||
|
if limit.get('type') == 'Container':
|
||||||
|
default_mem = limit.get('default', {}).get('memory', 'none')
|
||||||
|
default_cpu = limit.get('default', {}).get('cpu', 'none')
|
||||||
|
print(f'$ns: default memory={default_mem}, cpu={default_cpu}')
|
||||||
|
" 2>/dev/null) || lr=""
|
||||||
|
if [ -n "$lr" ]; then
|
||||||
|
lr_info="${lr_info}${lr}; "
|
||||||
|
else
|
||||||
|
lr_info="${lr_info}${ns}: no LimitRange; "
|
||||||
|
fi
|
||||||
|
done <<< "$namespaces"
|
||||||
|
|
||||||
|
add_check "limitranges" "warn" "LimitRange defaults for OOM namespaces: ${lr_info}"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Check VPA recommendations from Goldilocks
|
||||||
|
check_vpa_recommendations() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "vpa-recommendations" "ok" "DRY RUN: would check VPA recommendations"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local vpa_count
|
||||||
|
vpa_count=$($KUBECTL get vpa --all-namespaces --no-headers 2>/dev/null | wc -l | tr -d ' ') || vpa_count=0
|
||||||
|
|
||||||
|
if [ "$vpa_count" -eq 0 ]; then
|
||||||
|
add_check "vpa-recommendations" "warn" "No VPA objects found — Goldilocks may not be deployed"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local vpa_recs
|
||||||
|
vpa_recs=$($KUBECTL get vpa --all-namespaces -o json 2>/dev/null | python3 -c "
|
||||||
|
import sys, json
|
||||||
|
data = json.load(sys.stdin)
|
||||||
|
recs = []
|
||||||
|
for vpa in data.get('items', []):
|
||||||
|
ns = vpa['metadata']['namespace']
|
||||||
|
name = vpa['metadata']['name']
|
||||||
|
for cr in vpa.get('status', {}).get('recommendation', {}).get('containerRecommendations', []):
|
||||||
|
container = cr.get('containerName', 'unknown')
|
||||||
|
target_mem = cr.get('target', {}).get('memory', 'n/a')
|
||||||
|
target_cpu = cr.get('target', {}).get('cpu', 'n/a')
|
||||||
|
upper_mem = cr.get('upperBound', {}).get('memory', 'n/a')
|
||||||
|
recs.append(f'{ns}/{name}:{container} target_mem={target_mem} target_cpu={target_cpu} upper_mem={upper_mem}')
|
||||||
|
if recs:
|
||||||
|
print('; '.join(recs[:20]))
|
||||||
|
else:
|
||||||
|
print('No recommendations available yet')
|
||||||
|
" 2>/dev/null) || vpa_recs="Failed to read VPA recommendations"
|
||||||
|
|
||||||
|
add_check "vpa-recommendations" "ok" "$vpa_recs"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Check resource requests/limits on OOMKilled pods
|
||||||
|
check_pod_resources() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "pod-resources" "ok" "DRY RUN: would check pod resource specs"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local resources
|
||||||
|
resources=$($KUBECTL get pods --all-namespaces -o json | python3 -c "
|
||||||
|
import sys, json
|
||||||
|
data = json.load(sys.stdin)
|
||||||
|
results = []
|
||||||
|
for pod in data.get('items', []):
|
||||||
|
ns = pod['metadata']['namespace']
|
||||||
|
name = pod['metadata']['name']
|
||||||
|
has_oom = False
|
||||||
|
for cs in pod.get('status', {}).get('containerStatuses', []) + pod.get('status', {}).get('initContainerStatuses', []):
|
||||||
|
for state in [cs.get('lastState', {}).get('terminated', {}), cs.get('state', {}).get('terminated', {})]:
|
||||||
|
if state.get('reason') == 'OOMKilled':
|
||||||
|
has_oom = True
|
||||||
|
break
|
||||||
|
if has_oom:
|
||||||
|
for c in pod.get('spec', {}).get('containers', []) + pod.get('spec', {}).get('initContainers', []):
|
||||||
|
req_mem = c.get('resources', {}).get('requests', {}).get('memory', 'none')
|
||||||
|
lim_mem = c.get('resources', {}).get('limits', {}).get('memory', 'none')
|
||||||
|
req_cpu = c.get('resources', {}).get('requests', {}).get('cpu', 'none')
|
||||||
|
lim_cpu = c.get('resources', {}).get('limits', {}).get('cpu', 'none')
|
||||||
|
results.append(f\"{ns}/{name}:{c['name']} req_mem={req_mem} lim_mem={lim_mem} req_cpu={req_cpu} lim_cpu={lim_cpu}\")
|
||||||
|
if results:
|
||||||
|
print('; '.join(results))
|
||||||
|
else:
|
||||||
|
print('No OOMKilled pods to inspect')
|
||||||
|
" 2>/dev/null) || resources="Failed to check pod resources"
|
||||||
|
|
||||||
|
if echo "$resources" | grep -q "No OOMKilled"; then
|
||||||
|
add_check "pod-resources" "ok" "$resources"
|
||||||
|
else
|
||||||
|
add_check "pod-resources" "warn" "$resources"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Run all checks
|
||||||
|
find_oomkilled
|
||||||
|
check_limitranges
|
||||||
|
check_vpa_recommendations
|
||||||
|
check_pod_resources
|
||||||
|
|
||||||
|
# Determine overall status
|
||||||
|
OVERALL=$(echo "$CHECKS" | python3 -c "
|
||||||
|
import sys, json
|
||||||
|
checks = json.load(sys.stdin)
|
||||||
|
statuses = [c['status'] for c in checks]
|
||||||
|
if 'fail' in statuses:
|
||||||
|
print('fail')
|
||||||
|
elif 'warn' in statuses:
|
||||||
|
print('warn')
|
||||||
|
else:
|
||||||
|
print('ok')
|
||||||
|
")
|
||||||
|
|
||||||
|
echo "{\"status\": \"$OVERALL\", \"agent\": \"$AGENT\", \"checks\": $CHECKS}" | python3 -m json.tool
|
||||||
260
.claude/scripts/platform-status.sh
Executable file
260
.claude/scripts/platform-status.sh
Executable file
|
|
@ -0,0 +1,260 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
AGENT="platform-status"
|
||||||
|
KUBECTL="kubectl --kubeconfig /Users/viktorbarzin/code/infra/config"
|
||||||
|
PROXMOX_HOST="root@192.168.1.127"
|
||||||
|
REGISTRY_HOST="10.0.20.10"
|
||||||
|
DRY_RUN=false
|
||||||
|
|
||||||
|
for arg in "$@"; do
|
||||||
|
case "$arg" in
|
||||||
|
--dry-run) DRY_RUN=true ;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
checks=()
|
||||||
|
|
||||||
|
add_check() {
|
||||||
|
local name="$1" status="$2" message="$3"
|
||||||
|
checks+=("{\"name\": \"$name\", \"status\": \"$status\", \"message\": \"$message\"}")
|
||||||
|
}
|
||||||
|
|
||||||
|
check_traefik() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "traefik" "ok" "dry-run: would check Traefik status"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Discover Traefik pods via labels
|
||||||
|
local traefik_pod
|
||||||
|
traefik_pod=$($KUBECTL get pods -n traefik -l app.kubernetes.io/name=traefik -o name 2>/dev/null | head -1)
|
||||||
|
if [ -z "$traefik_pod" ]; then
|
||||||
|
traefik_pod=$($KUBECTL get pods -n traefik -l app=traefik -o name 2>/dev/null | head -1)
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ -z "$traefik_pod" ]; then
|
||||||
|
add_check "traefik" "fail" "No Traefik pods found in traefik namespace"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local phase
|
||||||
|
phase=$($KUBECTL get "$traefik_pod" -n traefik -o jsonpath='{.status.phase}' 2>/dev/null)
|
||||||
|
if [ "$phase" = "Running" ]; then
|
||||||
|
# Check IngressRoute count
|
||||||
|
local ir_count
|
||||||
|
ir_count=$($KUBECTL get ingressroute --all-namespaces --no-headers 2>/dev/null | wc -l | tr -d ' ')
|
||||||
|
add_check "traefik" "ok" "Traefik running, $ir_count IngressRoutes configured"
|
||||||
|
else
|
||||||
|
add_check "traefik" "fail" "Traefik pod phase: $phase"
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Check for IngressRoutes with errors (TLS or service issues)
|
||||||
|
local ir_errors
|
||||||
|
ir_errors=$($KUBECTL get events --all-namespaces --field-selector reason=IngressRouteError --no-headers 2>/dev/null | wc -l | tr -d ' ')
|
||||||
|
if [ "$ir_errors" -gt 0 ]; then
|
||||||
|
add_check "traefik-ingressroutes" "warn" "$ir_errors IngressRoute error events found"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
check_kyverno() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "kyverno" "ok" "dry-run: would check Kyverno status"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Discover Kyverno pods via labels
|
||||||
|
local kyverno_pods
|
||||||
|
kyverno_pods=$($KUBECTL get pods -n kyverno -l app.kubernetes.io/name=kyverno -o name 2>/dev/null)
|
||||||
|
if [ -z "$kyverno_pods" ]; then
|
||||||
|
kyverno_pods=$($KUBECTL get pods -n kyverno -l app=kyverno -o name 2>/dev/null)
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ -z "$kyverno_pods" ]; then
|
||||||
|
add_check "kyverno" "warn" "No Kyverno pods found"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local total=0 ready=0
|
||||||
|
while IFS= read -r pod; do
|
||||||
|
[ -z "$pod" ] && continue
|
||||||
|
total=$((total + 1))
|
||||||
|
local phase
|
||||||
|
phase=$($KUBECTL get "$pod" -n kyverno -o jsonpath='{.status.phase}' 2>/dev/null)
|
||||||
|
[ "$phase" = "Running" ] && ready=$((ready + 1))
|
||||||
|
done <<< "$kyverno_pods"
|
||||||
|
|
||||||
|
if [ "$ready" -eq "$total" ]; then
|
||||||
|
# Check policy count
|
||||||
|
local policy_count
|
||||||
|
policy_count=$($KUBECTL get clusterpolicy --no-headers 2>/dev/null | wc -l | tr -d ' ')
|
||||||
|
add_check "kyverno" "ok" "$ready/$total Kyverno pods running, $policy_count ClusterPolicies"
|
||||||
|
else
|
||||||
|
add_check "kyverno" "warn" "$ready/$total Kyverno pods running"
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Check for policy violations
|
||||||
|
local violations
|
||||||
|
violations=$($KUBECTL get policyreport --all-namespaces -o json 2>/dev/null | \
|
||||||
|
python3 -c "
|
||||||
|
import sys, json
|
||||||
|
try:
|
||||||
|
data = json.load(sys.stdin)
|
||||||
|
fail_count = sum(r.get('summary',{}).get('fail',0) for r in data.get('items',[]))
|
||||||
|
print(fail_count)
|
||||||
|
except: print('0')
|
||||||
|
" 2>/dev/null || echo "0")
|
||||||
|
|
||||||
|
if [ "$violations" -gt 0 ]; then
|
||||||
|
add_check "kyverno-violations" "warn" "$violations policy violations across namespaces"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
check_vpa_goldilocks() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "vpa-goldilocks" "ok" "dry-run: would check VPA/Goldilocks status"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Check VPA admission controller
|
||||||
|
local vpa_pods
|
||||||
|
vpa_pods=$($KUBECTL get pods -n goldilocks -l app.kubernetes.io/name=goldilocks -o name 2>/dev/null)
|
||||||
|
if [ -z "$vpa_pods" ]; then
|
||||||
|
vpa_pods=$($KUBECTL get pods -n goldilocks -o name 2>/dev/null)
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ -z "$vpa_pods" ]; then
|
||||||
|
add_check "vpa-goldilocks" "warn" "No Goldilocks pods found"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local total=0 ready=0
|
||||||
|
while IFS= read -r pod; do
|
||||||
|
[ -z "$pod" ] && continue
|
||||||
|
total=$((total + 1))
|
||||||
|
local phase
|
||||||
|
phase=$($KUBECTL get "$pod" -n goldilocks -o jsonpath='{.status.phase}' 2>/dev/null)
|
||||||
|
[ "$phase" = "Running" ] && ready=$((ready + 1))
|
||||||
|
done <<< "$vpa_pods"
|
||||||
|
|
||||||
|
if [ "$ready" -eq "$total" ]; then
|
||||||
|
local vpa_count
|
||||||
|
vpa_count=$($KUBECTL get vpa --all-namespaces --no-headers 2>/dev/null | wc -l | tr -d ' ')
|
||||||
|
add_check "vpa-goldilocks" "ok" "$ready/$total Goldilocks pods running, $vpa_count VPAs configured"
|
||||||
|
else
|
||||||
|
add_check "vpa-goldilocks" "warn" "$ready/$total Goldilocks pods running"
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Check for VPAs with unexpected updateMode
|
||||||
|
local auto_vpas
|
||||||
|
auto_vpas=$($KUBECTL get vpa --all-namespaces -o json 2>/dev/null | \
|
||||||
|
python3 -c "
|
||||||
|
import sys, json
|
||||||
|
try:
|
||||||
|
data = json.load(sys.stdin)
|
||||||
|
auto = [i['metadata']['name'] for i in data.get('items',[]) if i.get('spec',{}).get('updatePolicy',{}).get('updateMode','') == 'Auto']
|
||||||
|
print(len(auto))
|
||||||
|
except: print('0')
|
||||||
|
" 2>/dev/null || echo "0")
|
||||||
|
|
||||||
|
if [ "$auto_vpas" -gt 0 ]; then
|
||||||
|
add_check "vpa-auto-mode" "warn" "$auto_vpas VPAs set to Auto updateMode (may cause unexpected restarts)"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
check_pull_through_cache() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "pull-through-cache" "ok" "dry-run: would check pull-through cache at $REGISTRY_HOST"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
if timeout 5 curl -sf "http://${REGISTRY_HOST}:5000/v2/" &>/dev/null; then
|
||||||
|
add_check "pull-through-cache" "ok" "Pull-through cache registry at $REGISTRY_HOST:5000 is healthy"
|
||||||
|
elif timeout 5 curl -sf "https://${REGISTRY_HOST}/v2/" &>/dev/null; then
|
||||||
|
add_check "pull-through-cache" "ok" "Pull-through cache registry at $REGISTRY_HOST is healthy (HTTPS)"
|
||||||
|
else
|
||||||
|
add_check "pull-through-cache" "fail" "Pull-through cache registry at $REGISTRY_HOST is unreachable"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
check_proxmox() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "proxmox" "ok" "dry-run: would check Proxmox host resources"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local cpu_load
|
||||||
|
if cpu_load=$(timeout 10 ssh -o ConnectTimeout=5 -o StrictHostKeyChecking=no "$PROXMOX_HOST" \
|
||||||
|
"uptime | awk -F'load average:' '{print \$2}' | awk -F, '{print \$1}' | tr -d ' '" 2>/dev/null); then
|
||||||
|
local cpu_count
|
||||||
|
cpu_count=$(timeout 10 ssh -o ConnectTimeout=5 -o StrictHostKeyChecking=no "$PROXMOX_HOST" \
|
||||||
|
"nproc" 2>/dev/null || echo "1")
|
||||||
|
|
||||||
|
# Check memory
|
||||||
|
local mem_info
|
||||||
|
mem_info=$(timeout 10 ssh -o ConnectTimeout=5 -o StrictHostKeyChecking=no "$PROXMOX_HOST" \
|
||||||
|
"free -m | awk '/Mem:/{printf \"%d/%dMB (%.0f%%)\", \$3, \$2, \$3/\$2*100}'" 2>/dev/null || echo "unknown")
|
||||||
|
|
||||||
|
add_check "proxmox" "ok" "Proxmox host: load=$cpu_load (${cpu_count}cores), mem=$mem_info"
|
||||||
|
else
|
||||||
|
add_check "proxmox" "fail" "Could not reach Proxmox host via SSH"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
check_metallb() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "metallb" "ok" "dry-run: would check MetalLB status"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local metallb_pods
|
||||||
|
metallb_pods=$($KUBECTL get pods -n metallb-system -l app.kubernetes.io/name=metallb -o name 2>/dev/null)
|
||||||
|
if [ -z "$metallb_pods" ]; then
|
||||||
|
metallb_pods=$($KUBECTL get pods -n metallb-system -o name 2>/dev/null)
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ -z "$metallb_pods" ]; then
|
||||||
|
add_check "metallb" "warn" "No MetalLB pods found"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local total=0 ready=0
|
||||||
|
while IFS= read -r pod; do
|
||||||
|
[ -z "$pod" ] && continue
|
||||||
|
total=$((total + 1))
|
||||||
|
local phase
|
||||||
|
phase=$($KUBECTL get "$pod" -n metallb-system -o jsonpath='{.status.phase}' 2>/dev/null)
|
||||||
|
[ "$phase" = "Running" ] && ready=$((ready + 1))
|
||||||
|
done <<< "$metallb_pods"
|
||||||
|
|
||||||
|
if [ "$ready" -eq "$total" ]; then
|
||||||
|
add_check "metallb" "ok" "$ready/$total MetalLB pods running"
|
||||||
|
else
|
||||||
|
add_check "metallb" "warn" "$ready/$total MetalLB pods running"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Run checks
|
||||||
|
check_traefik
|
||||||
|
check_kyverno
|
||||||
|
check_vpa_goldilocks
|
||||||
|
check_pull_through_cache
|
||||||
|
check_proxmox
|
||||||
|
check_metallb
|
||||||
|
|
||||||
|
# Determine overall status
|
||||||
|
overall="ok"
|
||||||
|
for c in "${checks[@]}"; do
|
||||||
|
if echo "$c" | grep -q '"status": "fail"'; then
|
||||||
|
overall="fail"
|
||||||
|
break
|
||||||
|
elif echo "$c" | grep -q '"status": "warn"'; then
|
||||||
|
overall="warn"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
# Output JSON
|
||||||
|
checks_json=$(IFS=,; echo "${checks[*]}")
|
||||||
|
cat <<EOF
|
||||||
|
{"status": "$overall", "agent": "$AGENT", "checks": [$checks_json]}
|
||||||
|
EOF
|
||||||
190
.claude/scripts/resource-report.sh
Executable file
190
.claude/scripts/resource-report.sh
Executable file
|
|
@ -0,0 +1,190 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
KUBECTL="kubectl --kubeconfig /Users/viktorbarzin/code/infra/config"
|
||||||
|
DRY_RUN=false
|
||||||
|
AGENT="resource-report"
|
||||||
|
|
||||||
|
for arg in "$@"; do
|
||||||
|
case "$arg" in
|
||||||
|
--dry-run) DRY_RUN=true ;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
CHECKS="[]"
|
||||||
|
|
||||||
|
add_check() {
|
||||||
|
local name="$1" status="$2" message="$3"
|
||||||
|
CHECKS=$(echo "$CHECKS" | python3 -c "
|
||||||
|
import sys, json
|
||||||
|
checks = json.load(sys.stdin)
|
||||||
|
checks.append({'name': '''$name''', 'status': '''$status''', 'message': '''$message'''})
|
||||||
|
json.dump(checks, sys.stdout)
|
||||||
|
")
|
||||||
|
}
|
||||||
|
|
||||||
|
# Node capacity report: allocatable vs requests vs limits
|
||||||
|
check_node_capacity() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "node-capacity" "ok" "DRY RUN: would report node allocatable vs requests vs limits"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local report
|
||||||
|
report=$($KUBECTL get nodes -o json | python3 -c "
|
||||||
|
import sys, json
|
||||||
|
|
||||||
|
def parse_cpu(val):
|
||||||
|
if val.endswith('m'):
|
||||||
|
return int(val[:-1])
|
||||||
|
return int(float(val) * 1000)
|
||||||
|
|
||||||
|
def parse_mem(val):
|
||||||
|
units = {'Ki': 1024, 'Mi': 1024**2, 'Gi': 1024**3, 'Ti': 1024**4}
|
||||||
|
for suffix, mult in units.items():
|
||||||
|
if val.endswith(suffix):
|
||||||
|
return int(float(val[:-len(suffix)]) * mult)
|
||||||
|
return int(val)
|
||||||
|
|
||||||
|
def fmt_mem(b):
|
||||||
|
return f'{b / (1024**3):.1f}Gi'
|
||||||
|
|
||||||
|
def fmt_cpu(m):
|
||||||
|
return f'{m}m'
|
||||||
|
|
||||||
|
data = json.load(sys.stdin)
|
||||||
|
nodes = []
|
||||||
|
for node in data.get('items', []):
|
||||||
|
name = node['metadata']['name']
|
||||||
|
alloc = node.get('status', {}).get('allocatable', {})
|
||||||
|
cpu_alloc = parse_cpu(alloc.get('cpu', '0'))
|
||||||
|
mem_alloc = parse_mem(alloc.get('memory', '0'))
|
||||||
|
nodes.append({'name': name, 'cpu_alloc': cpu_alloc, 'mem_alloc': mem_alloc})
|
||||||
|
|
||||||
|
for n in nodes:
|
||||||
|
print(f\"{n['name']}: cpu_alloc={fmt_cpu(n['cpu_alloc'])} mem_alloc={fmt_mem(n['mem_alloc'])}\")
|
||||||
|
" 2>/dev/null) || report="Failed to get node capacity"
|
||||||
|
|
||||||
|
# Get requests/limits per node
|
||||||
|
local usage
|
||||||
|
usage=$($KUBECTL get pods --all-namespaces -o json | python3 -c "
|
||||||
|
import sys, json
|
||||||
|
|
||||||
|
def parse_cpu(val):
|
||||||
|
if not val: return 0
|
||||||
|
if val.endswith('m'):
|
||||||
|
return int(val[:-1])
|
||||||
|
return int(float(val) * 1000)
|
||||||
|
|
||||||
|
def parse_mem(val):
|
||||||
|
if not val: return 0
|
||||||
|
units = {'Ki': 1024, 'Mi': 1024**2, 'Gi': 1024**3, 'Ti': 1024**4}
|
||||||
|
for suffix, mult in units.items():
|
||||||
|
if val.endswith(suffix):
|
||||||
|
return int(float(val[:-len(suffix)]) * mult)
|
||||||
|
return int(val)
|
||||||
|
|
||||||
|
def fmt_mem(b):
|
||||||
|
return f'{b / (1024**3):.1f}Gi'
|
||||||
|
|
||||||
|
def fmt_cpu(m):
|
||||||
|
return f'{m}m'
|
||||||
|
|
||||||
|
data = json.load(sys.stdin)
|
||||||
|
per_node = {}
|
||||||
|
for pod in data.get('items', []):
|
||||||
|
phase = pod.get('status', {}).get('phase', '')
|
||||||
|
if phase not in ('Running', 'Pending'):
|
||||||
|
continue
|
||||||
|
node = pod.get('spec', {}).get('nodeName', 'unscheduled')
|
||||||
|
if node not in per_node:
|
||||||
|
per_node[node] = {'cpu_req': 0, 'cpu_lim': 0, 'mem_req': 0, 'mem_lim': 0}
|
||||||
|
for c in pod.get('spec', {}).get('containers', []) + pod.get('spec', {}).get('initContainers', []):
|
||||||
|
res = c.get('resources', {})
|
||||||
|
per_node[node]['cpu_req'] += parse_cpu(res.get('requests', {}).get('cpu', ''))
|
||||||
|
per_node[node]['cpu_lim'] += parse_cpu(res.get('limits', {}).get('cpu', ''))
|
||||||
|
per_node[node]['mem_req'] += parse_mem(res.get('requests', {}).get('memory', ''))
|
||||||
|
per_node[node]['mem_lim'] += parse_mem(res.get('limits', {}).get('memory', ''))
|
||||||
|
|
||||||
|
for node in sorted(per_node.keys()):
|
||||||
|
n = per_node[node]
|
||||||
|
print(f\"{node}: cpu_req={fmt_cpu(n['cpu_req'])} cpu_lim={fmt_cpu(n['cpu_lim'])} mem_req={fmt_mem(n['mem_req'])} mem_lim={fmt_mem(n['mem_lim'])}\")
|
||||||
|
" 2>/dev/null) || usage="Failed to get pod resource usage"
|
||||||
|
|
||||||
|
add_check "node-capacity" "ok" "Allocatable: ${report} | Usage: ${usage}"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Per-namespace ResourceQuota usage
|
||||||
|
check_resource_quotas() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "resource-quotas" "ok" "DRY RUN: would check ResourceQuota usage per namespace"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local quota_count
|
||||||
|
quota_count=$($KUBECTL get resourcequota --all-namespaces --no-headers 2>/dev/null | wc -l | tr -d ' ') || quota_count=0
|
||||||
|
|
||||||
|
if [ "$quota_count" -eq 0 ]; then
|
||||||
|
add_check "resource-quotas" "ok" "No ResourceQuotas defined in the cluster"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local quota_report
|
||||||
|
quota_report=$($KUBECTL get resourcequota --all-namespaces -o json 2>/dev/null | python3 -c "
|
||||||
|
import sys, json
|
||||||
|
data = json.load(sys.stdin)
|
||||||
|
results = []
|
||||||
|
for rq in data.get('items', []):
|
||||||
|
ns = rq['metadata']['namespace']
|
||||||
|
name = rq['metadata']['name']
|
||||||
|
hard = rq.get('status', {}).get('hard', {})
|
||||||
|
used = rq.get('status', {}).get('used', {})
|
||||||
|
for resource in hard:
|
||||||
|
h = hard[resource]
|
||||||
|
u = used.get(resource, '0')
|
||||||
|
results.append(f'{ns}/{name}: {resource} used={u} hard={h}')
|
||||||
|
if results:
|
||||||
|
print('; '.join(results[:30]))
|
||||||
|
else:
|
||||||
|
print('No quota usage data')
|
||||||
|
" 2>/dev/null) || quota_report="Failed to read ResourceQuotas"
|
||||||
|
|
||||||
|
add_check "resource-quotas" "ok" "$quota_report"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Top pods by memory usage
|
||||||
|
check_top_consumers() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "top-consumers" "ok" "DRY RUN: would report top memory-consuming pods"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local top_pods
|
||||||
|
top_pods=$($KUBECTL top pods --all-namespaces --no-headers 2>/dev/null | sort -k4 -h -r | head -10 | awk '{print $1"/"$2": cpu="$3" mem="$4}' | tr '\n' '; ') || top_pods="metrics-server may not be available"
|
||||||
|
|
||||||
|
if [ -z "$top_pods" ]; then
|
||||||
|
add_check "top-consumers" "warn" "kubectl top returned no data — metrics-server may not be running"
|
||||||
|
else
|
||||||
|
add_check "top-consumers" "ok" "Top 10 by memory: ${top_pods}"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Run all checks
|
||||||
|
check_node_capacity
|
||||||
|
check_resource_quotas
|
||||||
|
check_top_consumers
|
||||||
|
|
||||||
|
# Determine overall status
|
||||||
|
OVERALL=$(echo "$CHECKS" | python3 -c "
|
||||||
|
import sys, json
|
||||||
|
checks = json.load(sys.stdin)
|
||||||
|
statuses = [c['status'] for c in checks]
|
||||||
|
if 'fail' in statuses:
|
||||||
|
print('fail')
|
||||||
|
elif 'warn' in statuses:
|
||||||
|
print('warn')
|
||||||
|
else:
|
||||||
|
print('ok')
|
||||||
|
")
|
||||||
|
|
||||||
|
echo "{\"status\": \"$OVERALL\", \"agent\": \"$AGENT\", \"checks\": $CHECKS}" | python3 -m json.tool
|
||||||
143
.claude/scripts/tls-check.sh
Executable file
143
.claude/scripts/tls-check.sh
Executable file
|
|
@ -0,0 +1,143 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
KUBECTL="kubectl --kubeconfig /Users/viktorbarzin/code/infra/config"
|
||||||
|
AGENT="tls-check"
|
||||||
|
DRY_RUN=false
|
||||||
|
WARN_DAYS=14
|
||||||
|
|
||||||
|
for arg in "$@"; do
|
||||||
|
case "$arg" in
|
||||||
|
--dry-run) DRY_RUN=true ;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
checks=()
|
||||||
|
|
||||||
|
add_check() {
|
||||||
|
local name="$1" status="$2" message="$3"
|
||||||
|
checks+=("{\"name\": \"$name\", \"status\": \"$status\", \"message\": \"$message\"}")
|
||||||
|
}
|
||||||
|
|
||||||
|
check_tls_secrets() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "tls-secrets" "ok" "dry-run: would scan all kubernetes.io/tls secrets for expiry"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local secrets_json
|
||||||
|
secrets_json=$($KUBECTL get secrets -A -o json 2>/dev/null) || {
|
||||||
|
add_check "tls-secrets" "fail" "Failed to list secrets"
|
||||||
|
return
|
||||||
|
}
|
||||||
|
|
||||||
|
local tls_secrets
|
||||||
|
tls_secrets=$(echo "$secrets_json" | jq -r '.items[] | select(.type=="kubernetes.io/tls") | "\(.metadata.namespace)/\(.metadata.name)"' 2>/dev/null) || {
|
||||||
|
add_check "tls-secrets" "fail" "Failed to parse secrets JSON"
|
||||||
|
return
|
||||||
|
}
|
||||||
|
|
||||||
|
if [ -z "$tls_secrets" ]; then
|
||||||
|
add_check "tls-secrets" "warn" "No TLS secrets found"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local total=0 expiring=0 expired=0 healthy=0 errors=0
|
||||||
|
local now_epoch
|
||||||
|
now_epoch=$(date +%s)
|
||||||
|
local warn_epoch=$((now_epoch + WARN_DAYS * 86400))
|
||||||
|
local expiring_list=""
|
||||||
|
|
||||||
|
while IFS= read -r secret; do
|
||||||
|
total=$((total + 1))
|
||||||
|
local ns="${secret%%/*}"
|
||||||
|
local name="${secret##*/}"
|
||||||
|
|
||||||
|
local cert_pem
|
||||||
|
cert_pem=$($KUBECTL get secret "$name" -n "$ns" -o jsonpath='{.data.tls\.crt}' 2>/dev/null | base64 -d 2>/dev/null) || {
|
||||||
|
errors=$((errors + 1))
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
|
||||||
|
local expiry_str
|
||||||
|
expiry_str=$(echo "$cert_pem" | openssl x509 -noout -enddate 2>/dev/null | sed 's/notAfter=//') || {
|
||||||
|
errors=$((errors + 1))
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
|
||||||
|
local expiry_epoch
|
||||||
|
expiry_epoch=$(date -j -f "%b %d %T %Y %Z" "$expiry_str" +%s 2>/dev/null || date -d "$expiry_str" +%s 2>/dev/null) || {
|
||||||
|
errors=$((errors + 1))
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
|
||||||
|
if [ "$expiry_epoch" -lt "$now_epoch" ]; then
|
||||||
|
expired=$((expired + 1))
|
||||||
|
expiring_list="${expiring_list}EXPIRED: ${ns}/${name}; "
|
||||||
|
elif [ "$expiry_epoch" -lt "$warn_epoch" ]; then
|
||||||
|
local days_left=$(( (expiry_epoch - now_epoch) / 86400 ))
|
||||||
|
expiring=$((expiring + 1))
|
||||||
|
expiring_list="${expiring_list}${days_left}d: ${ns}/${name}; "
|
||||||
|
else
|
||||||
|
healthy=$((healthy + 1))
|
||||||
|
fi
|
||||||
|
done <<< "$tls_secrets"
|
||||||
|
|
||||||
|
if [ "$expired" -gt 0 ]; then
|
||||||
|
add_check "tls-secrets" "fail" "${expired} expired, ${expiring} expiring soon, ${healthy} healthy out of ${total} certs. ${expiring_list}"
|
||||||
|
elif [ "$expiring" -gt 0 ]; then
|
||||||
|
add_check "tls-secrets" "warn" "${expiring} expiring within ${WARN_DAYS}d, ${healthy} healthy out of ${total} certs. ${expiring_list}"
|
||||||
|
else
|
||||||
|
add_check "tls-secrets" "ok" "All ${healthy} TLS certs healthy (${errors} decode errors skipped)"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
check_cert_manager() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "cert-manager" "ok" "dry-run: would check cert-manager pod health and certificate CRDs"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local cm_pods
|
||||||
|
cm_pods=$($KUBECTL get pods -n cert-manager -l app.kubernetes.io/instance=cert-manager --no-headers 2>/dev/null) || {
|
||||||
|
add_check "cert-manager" "fail" "Failed to query cert-manager pods"
|
||||||
|
return
|
||||||
|
}
|
||||||
|
|
||||||
|
local not_running
|
||||||
|
not_running=$(echo "$cm_pods" | grep -v "Running" | grep -v "Completed" | grep -c "." 2>/dev/null || echo "0")
|
||||||
|
|
||||||
|
if [ "$not_running" -gt 0 ]; then
|
||||||
|
add_check "cert-manager" "fail" "${not_running} cert-manager pod(s) not running"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Check for failed certificates
|
||||||
|
local failed_certs
|
||||||
|
failed_certs=$($KUBECTL get certificates -A -o json 2>/dev/null | jq -r '.items[] | select(.status.conditions[]? | select(.type=="Ready" and .status=="False")) | "\(.metadata.namespace)/\(.metadata.name)"' 2>/dev/null) || {
|
||||||
|
add_check "cert-manager" "warn" "Could not query certificate CRDs"
|
||||||
|
return
|
||||||
|
}
|
||||||
|
|
||||||
|
if [ -n "$failed_certs" ]; then
|
||||||
|
local count
|
||||||
|
count=$(echo "$failed_certs" | wc -l | tr -d ' ')
|
||||||
|
add_check "cert-manager" "warn" "${count} certificate(s) not ready: $(echo "$failed_certs" | head -5 | tr '\n' ', ')"
|
||||||
|
else
|
||||||
|
add_check "cert-manager" "ok" "cert-manager healthy, all certificates ready"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
check_tls_secrets
|
||||||
|
check_cert_manager
|
||||||
|
|
||||||
|
# Output JSON
|
||||||
|
overall="ok"
|
||||||
|
for c in "${checks[@]}"; do
|
||||||
|
s=$(echo "$c" | jq -r '.status')
|
||||||
|
if [ "$s" = "fail" ]; then overall="fail"; break; fi
|
||||||
|
if [ "$s" = "warn" ]; then overall="warn"; fi
|
||||||
|
done
|
||||||
|
|
||||||
|
printf '{"status": "%s", "agent": "%s", "checks": [%s]}\n' \
|
||||||
|
"$overall" "$AGENT" "$(IFS=,; echo "${checks[*]}")"
|
||||||
186
.claude/scripts/truenas-status.sh
Executable file
186
.claude/scripts/truenas-status.sh
Executable file
|
|
@ -0,0 +1,186 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
AGENT="truenas-status"
|
||||||
|
TRUENAS_HOST="root@10.0.10.15"
|
||||||
|
DRY_RUN=false
|
||||||
|
|
||||||
|
for arg in "$@"; do
|
||||||
|
case "$arg" in
|
||||||
|
--dry-run) DRY_RUN=true ;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
checks=()
|
||||||
|
|
||||||
|
add_check() {
|
||||||
|
local name="$1" status="$2" message="$3"
|
||||||
|
checks+=("{\"name\": \"$name\", \"status\": \"$status\", \"message\": \"$message\"}")
|
||||||
|
}
|
||||||
|
|
||||||
|
ssh_cmd() {
|
||||||
|
timeout 15 ssh -o ConnectTimeout=5 -o StrictHostKeyChecking=no "$TRUENAS_HOST" "$@" 2>/dev/null
|
||||||
|
}
|
||||||
|
|
||||||
|
check_zfs_pools() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "zfs-pools" "ok" "dry-run: would check ZFS pool status"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local pool_status
|
||||||
|
if ! pool_status=$(ssh_cmd "zpool status -x" 2>/dev/null); then
|
||||||
|
add_check "zfs-pools" "fail" "Could not retrieve ZFS pool status via SSH"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
if echo "$pool_status" | grep -q "all pools are healthy"; then
|
||||||
|
add_check "zfs-pools" "ok" "All ZFS pools are healthy"
|
||||||
|
else
|
||||||
|
local degraded_pools
|
||||||
|
degraded_pools=$(echo "$pool_status" | grep "pool:" | awk '{print $2}' | tr '\n' ', ' | sed 's/,$//')
|
||||||
|
if [ -n "$degraded_pools" ]; then
|
||||||
|
add_check "zfs-pools" "fail" "Degraded ZFS pools: $degraded_pools"
|
||||||
|
else
|
||||||
|
add_check "zfs-pools" "warn" "ZFS pool status unclear: $(echo "$pool_status" | head -1 | tr '"' "'")"
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Check pool capacity
|
||||||
|
local pool_list
|
||||||
|
if pool_list=$(ssh_cmd "zpool list -H -o name,cap" 2>/dev/null); then
|
||||||
|
while IFS=$'\t' read -r pool_name cap_pct; do
|
||||||
|
local cap_num
|
||||||
|
cap_num=$(echo "$cap_pct" | tr -d '%')
|
||||||
|
if [ -n "$cap_num" ] && [ "$cap_num" -ge 90 ]; then
|
||||||
|
add_check "zfs-capacity-$pool_name" "fail" "Pool $pool_name is ${cap_pct} full"
|
||||||
|
elif [ -n "$cap_num" ] && [ "$cap_num" -ge 80 ]; then
|
||||||
|
add_check "zfs-capacity-$pool_name" "warn" "Pool $pool_name is ${cap_pct} full"
|
||||||
|
else
|
||||||
|
add_check "zfs-capacity-$pool_name" "ok" "Pool $pool_name is ${cap_pct} full"
|
||||||
|
fi
|
||||||
|
done <<< "$pool_list"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
check_smart_health() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "smart-health" "ok" "dry-run: would check SMART disk health"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local disk_list
|
||||||
|
if ! disk_list=$(ssh_cmd "smartctl --scan" 2>/dev/null); then
|
||||||
|
add_check "smart-health" "warn" "Could not scan disks for SMART status"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local fail_count=0
|
||||||
|
local total_count=0
|
||||||
|
local failed_disks=""
|
||||||
|
|
||||||
|
while IFS= read -r line; do
|
||||||
|
local dev
|
||||||
|
dev=$(echo "$line" | awk '{print $1}')
|
||||||
|
[ -z "$dev" ] && continue
|
||||||
|
total_count=$((total_count + 1))
|
||||||
|
|
||||||
|
local health
|
||||||
|
if health=$(ssh_cmd "smartctl -H '$dev'" 2>/dev/null); then
|
||||||
|
if ! echo "$health" | grep -qiE "PASSED|OK"; then
|
||||||
|
fail_count=$((fail_count + 1))
|
||||||
|
failed_disks="$failed_disks $dev"
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
done <<< "$disk_list"
|
||||||
|
|
||||||
|
if [ "$fail_count" -gt 0 ]; then
|
||||||
|
add_check "smart-health" "fail" "$fail_count/$total_count disks failing SMART:$failed_disks"
|
||||||
|
elif [ "$total_count" -gt 0 ]; then
|
||||||
|
add_check "smart-health" "ok" "All $total_count disks pass SMART health checks"
|
||||||
|
else
|
||||||
|
add_check "smart-health" "warn" "No disks found for SMART check"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
check_replication() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "replication" "ok" "dry-run: would check replication task status"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Check for any running/failed replication tasks via midclt if available
|
||||||
|
local repl_status
|
||||||
|
if repl_status=$(ssh_cmd "midclt call replication.query 2>/dev/null" 2>/dev/null); then
|
||||||
|
local failed
|
||||||
|
failed=$(echo "$repl_status" | python3 -c "
|
||||||
|
import sys, json
|
||||||
|
try:
|
||||||
|
tasks = json.load(sys.stdin)
|
||||||
|
failed = [t.get('name','unknown') for t in tasks if t.get('state',{}).get('state','') == 'ERROR']
|
||||||
|
print(len(failed))
|
||||||
|
except: print('error')
|
||||||
|
" 2>/dev/null || echo "error")
|
||||||
|
|
||||||
|
if [ "$failed" = "error" ]; then
|
||||||
|
add_check "replication" "warn" "Could not parse replication task status"
|
||||||
|
elif [ "$failed" = "0" ]; then
|
||||||
|
add_check "replication" "ok" "All replication tasks healthy"
|
||||||
|
else
|
||||||
|
add_check "replication" "fail" "$failed replication tasks in ERROR state"
|
||||||
|
fi
|
||||||
|
else
|
||||||
|
# Fallback: check if zfs send/recv processes are stuck
|
||||||
|
local send_procs
|
||||||
|
send_procs=$(ssh_cmd "pgrep -c 'zfs send' 2>/dev/null || echo 0")
|
||||||
|
add_check "replication" "warn" "midclt unavailable; $send_procs active zfs send processes"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
check_iscsi() {
|
||||||
|
if $DRY_RUN; then
|
||||||
|
add_check "iscsi-targets" "ok" "dry-run: would check iSCSI target status"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local target_status
|
||||||
|
if target_status=$(ssh_cmd "ctladm islist 2>/dev/null || targetcli ls 2>/dev/null" 2>/dev/null); then
|
||||||
|
local target_count
|
||||||
|
target_count=$(echo "$target_status" | wc -l | tr -d ' ')
|
||||||
|
if [ "$target_count" -gt 0 ]; then
|
||||||
|
add_check "iscsi-targets" "ok" "iSCSI service active with $target_count entries"
|
||||||
|
else
|
||||||
|
add_check "iscsi-targets" "warn" "iSCSI service active but no targets listed"
|
||||||
|
fi
|
||||||
|
else
|
||||||
|
# Try checking if the service is at least running
|
||||||
|
if ssh_cmd "midclt call iscsi.global.config" &>/dev/null; then
|
||||||
|
add_check "iscsi-targets" "ok" "iSCSI service is configured and running"
|
||||||
|
else
|
||||||
|
add_check "iscsi-targets" "warn" "Could not query iSCSI target status"
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Run checks
|
||||||
|
check_zfs_pools
|
||||||
|
check_smart_health
|
||||||
|
check_replication
|
||||||
|
check_iscsi
|
||||||
|
|
||||||
|
# Determine overall status
|
||||||
|
overall="ok"
|
||||||
|
for c in "${checks[@]}"; do
|
||||||
|
if echo "$c" | grep -q '"status": "fail"'; then
|
||||||
|
overall="fail"
|
||||||
|
break
|
||||||
|
elif echo "$c" | grep -q '"status": "warn"'; then
|
||||||
|
overall="warn"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
# Output JSON
|
||||||
|
checks_json=$(IFS=,; echo "${checks[*]}")
|
||||||
|
cat <<EOF
|
||||||
|
{"status": "$overall", "agent": "$AGENT", "checks": [$checks_json]}
|
||||||
|
EOF
|
||||||
Loading…
Add table
Add a link
Reference in a new issue