Commit graph

141 commits

Author SHA1 Message Date
Viktor Barzin
e51c063600 docs(add-user): update skill with actual working flow (no auto TF apply) 2026-03-18 00:28:46 +00:00
Viktor Barzin
fd130971aa feat(provision): automated user provisioning via Authentik webhook
- Expand CI Vault policy: write secret/data/platform + Transit SOPS keys
- Add Woodpecker provision-user.yml pipeline (manual event, API-triggered)
- Add env vars to webhook-handler deployment for Woodpecker/Authentik integration
- Update add-user skill with automated flow documentation
- Update Woodpecker repo ID list in CLAUDE.md
2026-03-17 23:56:30 +00:00
Viktor Barzin
ccbcebb670 feat(vault): automate SOPS onboarding for namespace-owners
- Add Transit mount + per-stack Transit keys to vault stack TF
- Auto-create sops-user-<name> policy scoping decrypt to owned stacks
- Auto-create sops-<name> external group + alias for Authentik mapping
- Add sops-admin policy to authentik-admins group
- Attach sops-user policy to namespace-owner identity entities
- Update add-user skill with SOPS onboarding steps and Authentik group
- Adding a user to k8s_users + applying vault stack = full SOPS access

[ci skip]
2026-03-17 23:15:25 +00:00
Viktor Barzin
6239e07dd5 docs: add plotting-book to GHA-migrated list and repo IDs [ci skip] 2026-03-17 23:07:32 +00:00
Viktor Barzin
d6afbe84c8 post-mortem v2: pipeline team architecture with 4-stage agents [ci skip]
Split monolithic orchestrator into triage (haiku), historian (sonnet),
and report-writer (opus) stages. Each stage gets its own tool budget.
Added sev-context.sh for structured cluster context gathering.
2026-03-16 21:59:34 +00:00
Viktor Barzin
0abb6b83ad add deploy-app skill and agent for automated repo→app deployment [ci skip] 2026-03-16 18:06:24 +00:00
Viktor Barzin
88abbef7c3 update claude knowledge: GHA builds architecture, postgresql_host fix [ci skip] 2026-03-16 07:10:45 +00:00
Viktor Barzin
b87ba5e778 update claude knowledge: secret/viktor is go-to for all personal secrets [ci skip] 2026-03-15 23:21:52 +00:00
Viktor Barzin
c8069f53c8 update claude knowledge: final ESO migration state [ci skip] 2026-03-15 22:32:46 +00:00
Viktor Barzin
6c8a42b4e3 add add-user skill for cluster onboarding
Interactive skill that collects user info, updates Vault KV k8s_users,
and applies vault/platform/woodpecker stacks. Includes verification
checklist and auto-generated resource table.
2026-03-15 22:28:54 +00:00
Viktor Barzin
23dfaa1ac8 update claude knowledge: vault-native secrets migration decisions [ci skip] 2026-03-15 21:00:07 +00:00
Viktor Barzin
cfc30b62e8 enhance devops-engineer agent: deploy + monitor pod health [ci skip]
- Upgrade model from sonnet to opus for subagent orchestration
- Add Write, Edit, Agent tools for spawning monitor subagents
- Add mandatory deployment workflow: pre-deploy snapshot, apply,
  spawn background haiku pod monitor, react to results
- Monitor detects CrashLoopBackOff, OOM, ImagePullBackOff, stuck
  Pending, and probe failures within 3 min timeout
- Allow terragrunt apply and kubectl set image as safe operations
2026-03-15 18:44:20 +00:00
Viktor Barzin
3aba29e7a3 remove SOPS pipeline, deploy ESO + Vault DB/K8s engines
Vault is now the sole source of truth for secrets. SOPS pipeline
removed entirely — auth via `vault login -method=oidc`.

Part A: SOPS removal
- vault/main.tf: delete 990 lines (93 vars + 43 KV write resources),
  add self-read data source for OIDC creds from secret/vault
- terragrunt.hcl: remove SOPS var loading, vault_root_token, check_secrets hook
- scripts/tg: remove SOPS decryption, keep -auto-approve logic
- .woodpecker/default.yml: replace SOPS with Vault K8s auth via curl
- Delete secrets.sops.json, .sops.yaml

Part B: External Secrets Operator
- New stack stacks/external-secrets/ with Helm chart + 2 ClusterSecretStores
  (vault-kv for KV v2, vault-database for DB engine)

Part C: Database secrets engine (in vault/main.tf)
- MySQL + PostgreSQL connections with static role rotation (24h)
- 6 MySQL roles (speedtest, wrongmove, codimd, nextcloud, shlink, grafana)
- 6 PostgreSQL roles (trading, health, linkwarden, affine, woodpecker, claude_memory)

Part D: Kubernetes secrets engine (in vault/main.tf)
- RBAC for Vault SA to manage K8s tokens
- Roles: dashboard-admin, ci-deployer, openclaw, local-admin
- New scripts/vault-kubeconfig helper for dynamic kubeconfig

K8s auth method with scoped policies for CI, ESO, OpenClaw, Woodpecker sync.
2026-03-15 16:37:38 +00:00
Viktor Barzin
d17a6e2fd3 fix calendar-query.py: use get_display_name(), URL-decode names, fix search API
- Replace deprecated cal.name with cal_name() helper using get_display_name()
- URL-decode calendar names (Formula+1 → Formula 1)
- Use cal.search(event=True) instead of deprecated date_search()
- Default to showing all calendars instead of filtering to Personal
2026-03-15 16:12:36 +00:00
Viktor Barzin
944d6d3b22 update claude knowledge: resource management learnings from right-sizing session [ci skip] 2026-03-15 15:38:37 +00:00
Viktor Barzin
8bac6db48f add name/description/tools to review-loop agent frontmatter [ci skip] 2026-03-15 11:14:31 +00:00
Viktor Barzin
616370d34c rename planner agent to review-loop [ci skip] 2026-03-15 11:12:14 +00:00
Viktor Barzin
123e996b04 add planner agent: plan-review-fix convergence loop [ci skip] 2026-03-15 10:46:53 +00:00
Viktor Barzin
307b7f6819 update claude knowledge: infra operational learnings from commit history [ci skip]
Add resource management patterns, networking resilience, service-specific
notes, monitoring patterns, and NFS storage rules extracted from ~963 commits.
2026-03-15 10:46:45 +00:00
Viktor Barzin
0a69af618d update claude knowledge: vault KV secrets migration [ci skip] 2026-03-15 03:22:07 +00:00
Viktor Barzin
4a27345057 enable memory-core plugin for OpenClaw [ci skip]
- Add memory-core to plugins.allow and plugins.slots.memory
- Add /app/extensions to plugin load paths
- Update CLAUDE.md memory instructions to reference native tools
2026-03-15 03:22:07 +00:00
Viktor Barzin
5f71a53b08 add memory-tool instructions to project CLAUDE.md [ci skip]
OpenClaw agents read the project-level CLAUDE.md from the workspace.
Adding explicit memory-tool CLI instructions here ensures the agent
uses exec to call memory-tool instead of looking for non-existent
MCP tools (memory_store, memory_recall).
2026-03-15 02:16:03 +00:00
Viktor Barzin
456e2777f5 update claude knowledge: LinuxServer.io container optimization learnings [ci skip] 2026-03-15 02:04:04 +00:00
Viktor Barzin
ff83ec3325 add infrastructure agent team: 8 specialized agents + 14 diagnostic scripts
Agents: devops-engineer, dba, security-engineer, sre, network-engineer,
platform-engineer, observability-engineer, home-automation-engineer.
Scripts: deploy-status, db-health, backup-verify, tls-check, crowdsec-status,
authentik-audit, oom-investigator, resource-report, dns-check, network-health,
nfs-health, truenas-status, platform-status, monitoring-health.
Also: known-issues.md suppression list, cluster-health-checker port-forward fix.
2026-03-15 02:01:07 +00:00
Viktor Barzin
916aa6c6cb update claude knowledge: OpenClaw deployment and tg wrapper learnings [ci skip] 2026-03-14 23:42:17 +00:00
Viktor Barzin
4635d3b826 remember: CrowdSec Helm upgrade timeout [ci skip] 2026-03-14 12:04:07 +00:00
Viktor Barzin
d05ff57b11 authentik: auto-assign invitation group via expression policy [ci skip]
Added invitation-group-assignment expression policy bound to the
enrollment-login stage. Reads group name from invitation fixed_data
and auto-adds the user to the target group on enrollment.
No more manual assign step needed after signup.
2026-03-13 22:21:10 +00:00
Viktor Barzin
160fda882f authentik: cleanup unused resources + add invitation enrollment flow [ci skip]
Cleanup:
- Deleted 5 unused flows (enrollment-inviation, headscale-auth/authz, default-enrollment, oauth-enrollment)
- Deleted 8 orphaned stages bound only to deleted flows
- Deleted authentik Read-only group and role (0 users)
- Deleted 2 unbound policies (map github username, Map Google Attributes)

Invitation enrollment:
- Created invitation-enrollment flow with 5 stages (invitation validation,
  identification with social login, prompt, user write, auto-login)
- Set all OAuth sources (Google/GitHub/Facebook) enrollment_flow to invitation-enrollment
- New users can only sign up via single-use invitation links
- Added authentik-invite.sh script for invitation management
- Updated reference docs and authentik skill
2026-03-13 22:21:10 +00:00
OpenClaw
4a9bd89b11 feat(health-check): Add Prometheus-based CPU and power monitoring
SECTIONS ADDED:
- Section 25: Advanced CPU Monitoring (Prometheus node_exporter metrics)
- Section 26: Power Monitoring (DCGM GPU power + host power)

FEATURES:
- 5-minute CPU usage averages (more accurate than kubectl top)
- Tesla T4 GPU power consumption monitoring
- CPU thresholds: 70% warn, 85% critical
- GPU power thresholds: 50W active, 65W high
- Maps IP addresses to friendly node names
- Integrates with existing health check infrastructure

CURRENT STATUS:
- All nodes have healthy disk usage (~10%)
- k8s-node4 flagged at 87% CPU (explains resource pressure)
- GPU operating normally at 30.9W
- Enhanced monitoring prevents issues like node2 containerd corruption

Total health check sections: 26 (was 24)
Addresses node2 incident prevention requirements
2026-03-13 07:32:36 +00:00
OpenClaw
a09967e098 feat(monitoring): Enhance disk monitoring and containerd GC after node2 incident
IMMEDIATE CHANGES (Active Now):
- Lower disk warning threshold: 70% → WARN, 85% → FAIL (was 80%/90%)
- More aggressive alerting to prevent containerd corruption
- Enhanced cluster health check disk monitoring

INFRASTRUCTURE CHANGES (Requires Terraform Apply):
- Add containerd garbage collection configuration (30min intervals)
- More aggressive kubelet eviction policies (15%/20% vs 10%/15%)
- Enhanced disk space protection to prevent node2-type failures

Root Cause: node2 disk exhaustion corrupted containerd image store
Prevention: Proactive monitoring + aggressive cleanup policies

[ci skip] - Infrastructure changes require SOPS access for apply
2026-03-13 07:16:56 +00:00
Viktor Barzin
bef0c073d5 fix DNS health check: use system resolver instead of hardcoded 10.0.20.101
The check was querying Technitium DNS directly at 10.0.20.101:53, which
refuses connections from non-cluster hosts. Use the system resolver
(no @server flag) so it works from any host or pod environment.

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-12 09:08:34 +00:00
Viktor Barzin
2fa8ba2038 [ci skip] add sealed secrets convention: fileset + kubernetes_manifest pattern
- Document sealed secrets workflow in AGENTS.md and CLAUDE.md
- Add kubernetes_manifest + fileset(sealed-*.yaml) block to plotting-book as reference
- Users: kubeseal encrypt → commit sealed-*.yaml → CI applies via Terraform
- E2E tested: seal/commit/plan/apply/decrypt cycle verified
2026-03-08 20:03:50 +00:00
Viktor Barzin
d352d6e7f8 resource quota review: fix OOM risks, close quota gaps, add HA protections
Phase 1 - OOM fixes:
- dashy: increase memory limit 512Mi→1Gi (was at 99% utilization)
- caretta DaemonSet: set explicit resources 300Mi/512Mi (was at 85-98%)
- mysql-operator: add Helm resource values 256Mi/512Mi, create namespace
  with tier label (was at 92% of LimitRange default)
- prowlarr, flaresolverr, annas-archive-stacks: add explicit resources
  (outgrowing 256Mi LimitRange defaults)
- real-estate-crawler celery: add resources 512Mi/3Gi (608Mi actual, no
  explicit resources)

Phase 2 - Close quota gaps:
- nvidia, real-estate-crawler, trading-bot: remove custom-quota=true
  labels so Kyverno generates tier-appropriate quotas
- descheduler: add tier=1-cluster label for proper classification

Phase 3 - Reduce excessive quotas:
- monitoring: limits.memory 240Gi→64Gi, limits.cpu 120→64
- woodpecker: limits.memory 128Gi→32Gi, limits.cpu 64→16
- GPU tier default: limits.memory 96Gi→32Gi, limits.cpu 48→16

Phase 4 - Kubelet protection:
- Add cpu: 200m to systemReserved and kubeReserved in kubelet template

Phase 5 - HA improvements:
- cloudflared: add topology spread (ScheduleAnyway) + PDB (maxUnavailable:1)
- grafana: add topology spread + PDB via Helm values
- crowdsec LAPI: add topology spread + PDB via Helm values
- authentik server: add topology spread via Helm values
- authentik worker: add topology spread + PDB via Helm values
2026-03-08 18:17:46 +00:00
Viktor Barzin
98f4920af1 [ci skip] remember: update kubelet thresholds when changing node memory 2026-03-08 10:34:17 +00:00
Viktor Barzin
7027c49fef [ci skip] update ha-sofia VM: VMID 103, disk 64G, SSH access info 2026-03-07 20:39:55 +00:00
Viktor Barzin
7cc7991ce6 [ci skip] claudeception: extract 2 skills from today's session
1. sops-age-secrets-migration: Complete guide for migrating from git-crypt
   to SOPS+age. Covers JSON format requirement, race condition avoidance,
   CI integration, complex types, and migration sequence.

2. iterative-plan-review-with-subagents: Design pattern for reviewing plans
   with parallel security + implementation subagents. 2-3 iterations to
   zero CRITICALs. Used successfully for the SOPS migration design.
2026-03-07 15:46:36 +00:00
Viktor Barzin
9f2ac0fd1a [ci skip] update AGENTS.md + CLAUDE.md with SOPS workflow, add k8s-portal CI pipeline
AGENTS.md: added SOPS secrets management section, scripts/tg usage,
contributor onboarding steps, pull-through cache bypass notes.

CLAUDE.md: added SOPS workflow note, linux/amd64 build reminder,
versioned tag guidance for pull-through cache.

CI: new .woodpecker/k8s-portal.yml pipeline — auto-builds and deploys
the k8s portal when files under stacks/platform/modules/k8s-portal/files/
change on master push. Uses buildx for linux/amd64.
2026-03-07 15:37:19 +00:00
Viktor Barzin
5907e50fda [ci skip] update ha-london skill: SSH is hassio@192.168.8.103 (HA OS)
Old Pi at 192.168.8.104 no longer runs HA. Updated SSH host, user,
config path, and platform info to reflect HA OS on 192.168.8.103.
2026-03-07 14:34:44 +00:00
Viktor Barzin
8d3db35b5e [ci skip] add AGENTS.md for model-agnostic knowledge, slim CLAUDE.md to Claude-specific layer
AGENTS.md (63 lines): shared infra knowledge for any AI tool (Codex, Claude,
Cursor). Covers: critical rules, architecture, storage, tiers, common ops.

CLAUDE.md (23 lines): Claude-specific addons — skills, agents, user preferences.
References AGENTS.md for shared knowledge.

Removed generic agents (devops-engineer, fullstack-developer).
2026-03-06 23:50:26 +00:00
Viktor Barzin
c170351e77 [ci skip] refactor claude files: compact CLAUDE.md, clean memory, remove generic agents
CLAUDE.md: 260→72 lines. Moved detailed patterns (NFS, iSCSI, Kyverno
tables, anti-AI, node rebuild) to .claude/reference/patterns.md.
Kept: critical rules, quick patterns, key commands, tier overview, prefs.

Memory: CLAUDE.md is now single source of truth. Auto-memory reduced to
scratch pad (67→25 lines, 5→1 files). MetaClaw DB cleaned from 40→16
entries (removed all infra-specific duplicates, kept cross-project prefs).

Agents: removed generic devops-engineer (885L) and fullstack-developer
(234L). Kept custom cluster-health-checker (48L).
2026-03-06 23:27:46 +00:00
Viktor Barzin
bcbe8b23b4 [ci skip] archive 28 unused skills, add runbook index to CLAUDE.md, add cluster-health agent
- Move 28 never-invoked troubleshooting runbook skills to .claude/skills/archived/
- Keep 7 active workflow skills: cluster-health, uptime-kuma, pfsense,
  home-assistant, setup-project, extend-vm-storage, k8s-ndots
- Add one-line runbook index to CLAUDE.md for quick reference
- Create cluster-health-checker custom agent (haiku model, read-only + bash)
  for autonomous health checks without consuming main context
2026-03-06 23:17:40 +00:00
Viktor Barzin
e6234d4683 [ci skip] update claude knowledge: iSCSI migration for Redis, Prometheus, Loki 2026-03-06 21:05:21 +00:00
Viktor Barzin
0638e2cc2e [ci skip] iSCSI migration, healthcheck fixes, health probes, etcd backup
- Migrate MySQL/PostgreSQL storage from local-path to iscsi-truenas
- Add democratic-csi iSCSI driver module for TrueNAS
- Add open-iscsi to cloud-init VM template
- Fix Shlink health probe path (/api/v3 -> /rest/v3 for Shlink 5.0)
- Fix etcd backup: use etcd 3.5.21-0 (3.6.x is distroless, no /bin/sh)
- Fix cluster healthcheck CronJob: always exit 0 to prevent circular
  JobFailed alerts (reporting via Slack, not exit codes)
- Fix Uptime Kuma nested list handling in cluster-health.sh
- Add health probes to: audiobookshelf, immich ML, ntfy, headscale,
  uptime-kuma, vaultwarden, rybbit (clickhouse + server + client),
  shlink, shlink-web
- Add iSCSI storage documentation to CLAUDE.md
2026-03-06 19:54:21 +00:00
Viktor Barzin
87ef313888 [ci skip] fix post-NFS-migration issues: MySQL GR, Loki, grampsweb, alerts
- Loki: reduce memory limit from 6Gi to 4Gi (within LimitRange max)
- Grampsweb: increase memory to 2Gi (was OOMKilled at 512Mi)
- Fix PostgreSQLDown alert: check pod readiness instead of deployment
- Fix MySQLDown alert: check StatefulSet replicas instead of deployment
- Fix RedisDown alert: check StatefulSet replicas instead of deployment
- Fix NFSServerUnresponsive: aggregate all NFS versions cluster-wide
- Fix Uptime Kuma healthcheck: handle nested list heartbeat format
- Update etcd backup image to registry.k8s.io/etcd:3.6.5-0
2026-03-03 21:10:26 +00:00
Viktor Barzin
61a532054e [ci skip] update CLAUDE.md: NFS volume pattern now uses CSI-backed nfs_volume module 2026-03-02 02:04:47 +00:00
Viktor Barzin
de598996f1 [ci skip] remove low-traffic pull-through caches (registry.k8s.io, quay.io, reg.kyverno.io)
Pull-through cache at 10.0.20.10 was serving corrupted/truncated images
for low-traffic registries, breaking VPA certgen (ImagePullBackOff) and
previously causing Kyverno image pull failures.

Kept: docker.io (port 5000) and ghcr.io (port 5010) — high traffic,
Docker Hub rate limits make caching essential.

Removed from cloud-init template and all 5 live nodes:
- registry.k8s.io (port 5030) — 14 system images, very low churn
- quay.io (port 5020) — 11 images
- reg.kyverno.io (port 5040) — 5 images

The registry containers on the 10.0.20.10 VM still run but nodes no
longer route to them. They can be stopped/removed from the VM later.
2026-03-01 21:46:41 +00:00
Viktor Barzin
53be356f41 [ci skip] add clickhouse-k8s-nfs-system-log-bloat skill, update GPU skill with auto-recovery
New skill: ClickHouse on K8s/NFS burns CPU from unbounded system log tables
and background merges. Covers config.d mount crash (exit code 36), CronJob
truncation workaround, and diagnostic commands.

Updated: k8s-gpu-no-nvidia-devices v1.1.0 — added automatic GPU recovery
via liveness probe pattern (nvidia-smi + app health check).
2026-03-01 21:04:19 +00:00
Viktor Barzin
ccf0b2232f [ci skip] switch VPA to off mode globally, fix Ollama/MySQL resources
- Kyverno policy: VPA mode set to 'off' for all namespaces (was 'initial'
  for non-core). Terraform is now sole authority for container resources.
  Goldilocks provides recommendations only.
- Ollama: add explicit CPU/memory resources (500m/4Gi req, 4/12Gi limit)
  alongside GPU allocation. Fixes OOMKill from VPA scaling down resources.
- MySQL InnoDB Cluster: bump memory limit from 2Gi to 3Gi.
- Remove redundant per-namespace VPA opt-out labels from onlyoffice,
  openclaw, trading-bot (now handled globally by Kyverno policy).
2026-03-01 19:03:49 +00:00
Viktor Barzin
f2c66f070b [ci skip] add nfsv4-idmapd-uid-mapping skill, cross-ref from NFS troubleshooting
New skill documenting the NFSv4 idmapd UID mapping crisis where all file
UIDs show as 65534 (nobody) inside K8s containers. Root cause: containers
auto-negotiate NFSv4.2, and idmapd domain mismatch maps all UIDs to nobody.
Fix: v4_v3owner=true on TrueNAS for numeric UID passthrough.
2026-03-01 18:14:37 +00:00
Viktor Barzin
4beadc2ca2 [ci skip] add openclaw-k8s-deployment skill from claudeception
Extracts all non-obvious gotchas from deploying OpenClaw on Kubernetes:
- wizard block required for Telegram, exec.host valid values,
- VPA resource overrides, file permissions, startup command,
- modelrelay sidecar, NFS caching strategy
2026-03-01 18:10:33 +00:00