Commit graph

89 commits

Author SHA1 Message Date
Viktor Barzin
e8bcf21127
[ci skip] update NFS mount skill: add stale mount variant after node reboots
New variant documents ghost Running pods with frozen processes after kured
rolling reboots. Key diagnostic: Running 1/1 but zero listening sockets
from ss -tlnp. Fix: force-delete pods to get fresh NFS mounts.
2026-02-28 19:38:30 +00:00
Viktor Barzin
6e1586342c
[ci skip] expand k8s worker nodes to 256G, update inventory and extend script
- k8s-node2: 128G → 256G (160GB free)
- k8s-node3: 128G → 256G (135GB free)
- k8s-node4: 128G → 256G (127GB free)
- k8s-node1: already 256G (51GB free)
- extend_vm_storage.sh: increase drain timeout to 300s, add --force flag
- Remove Vaultwarden from SQLite migration plan (too risky)
2026-02-28 16:00:16 +00:00
Viktor Barzin
098c0eb752
[ci skip] update claude knowledge: never restart NFS, NFS export dir prereq 2026-02-28 12:20:36 +00:00
Viktor Barzin
00d9458dec
[ci skip] trim CLAUDE.md: remove discoverable info, deduplicate 2026-02-23 23:10:13 +00:00
Viktor Barzin
c986e9bd56
[ci skip] update claude knowledge: infrastructure hardening changes
- NFS volumes now use var.nfs_server (not hardcoded IP)
- Shared infra variables documented (redis_host, postgresql_host, etc.)
- Tiers locals now generated by terragrunt.hcl, not duplicated in stacks
- Traefik security hardening documented (API, headers, rate limiting)
- Kyverno pod security policies documented (audit mode)
- Prometheus alert groups updated (Critical Services, PVPredictedFull)
- Loki retention updated to 30d, Alloy memory to 512Mi/1Gi
- Grampsweb now protected by Authentik
- MeshCentral registration disabled
2026-02-23 22:08:46 +00:00
Viktor Barzin
0eababf212
[ci skip] Sunset Drone CI: remove all artifacts, DNS, configs, and references
Drone CI has been fully replaced by Woodpecker CI at ci.viktorbarzin.me.
Destroys K8s resources (12), removes DNS records, NFS exports, Uptime Kuma
monitor, dashboard entry, and all code/doc references across 18 files.
2026-02-23 19:38:55 +00:00
Viktor Barzin
860077a126
[ci skip] Remove ResourceQuota limits from nvidia and realestate-crawler namespaces
Add resource-governance/custom-quota=true label to both namespaces so
Kyverno skips auto-generating ResourceQuotas that were causing CPU pressure.
2026-02-22 23:14:53 +00:00
Viktor Barzin
cf67e02135
[ci skip] Reduce node config drift: GPU label, OIDC idempotency, node-exporter, rebuild docs
- Add gpu=true label to Terraform (nvidia null_resource alongside taint)
- Improve API server OIDC config to detect value changes, not just flag presence
- Add policy_hash trigger to audit-policy so rule changes auto-reapply
- Enable prometheus-node-exporter sub-chart, delete unused Ansible playbook
- Document full node rebuild procedure in CLAUDE.md
- Save Talos Linux migration evaluation for future reference
2026-02-22 22:59:38 +00:00
Viktor Barzin
ff66adbe9e
[ci skip] Refactor knowledge: CLAUDE.md 881→190 lines, extract reference data
CLAUDE.md changes:
- Extract service catalog + Cloudflare domains → .claude/reference/service-catalog.md
- Extract Proxmox VMs, hardware, network → .claude/reference/proxmox-inventory.md
- Extract GitHub/Drone API patterns → .claude/reference/github-drone-api.md
- Extract Authentik state snapshot → .claude/reference/authentik-state.md
- Remove Init Container pattern (duplicates setup-project skill)
- Remove Poison Fountain service notes (duplicates Anti-AI section)
- Consolidate Authentik section (link to skills + reference)
- Remove resource limit tables (kept tier definitions inline)

Skill merges (37→32):
- helm-release-force-rerender + helm-stuck-release-recovery → helm-release-troubleshooting
- containerd-multi-registry-pull-through-cache + k8s-docker-registry-cache-bypass → k8s-container-image-caching
- (traefik merges in previous commits)
2026-02-22 22:11:31 +00:00
Viktor Barzin
512b7d08a5
[ci skip] Merge 3 Traefik skills into traefik-helm-configuration
Consolidated traefik-http3-quic, traefik-udp-cross-namespace, and
traefik-plugin-download-failure-404 into a single skill with sections
for HTTP/3 (QUIC), UDP cross-namespace routing, and plugin download
failure troubleshooting.
2026-02-22 22:09:26 +00:00
Viktor Barzin
072642d779
[ci skip] Merge 2 rewrite-body skills into traefik-rewrite-body-troubleshooting 2026-02-22 22:09:03 +00:00
Viktor Barzin
9488af2397
[ci skip] Add rewrite-body Accept header skill, update NFS skill
New skill: traefik-rewrite-body-accept-header — rewrite-body plugin
silently skips injection when request Accept header doesn't contain
text/html (curl default Accept: */* doesn't match).

Updated: k8s-nfs-mount-troubleshooting v1.1.0 — added variant for
non-root container UID permission denied on NFS writes.
2026-02-22 21:41:07 +00:00
Viktor Barzin
27bbfdc050
[ci skip] update claude knowledge: add anti-AI scraping & poison-fountain docs 2026-02-22 21:36:40 +00:00
Viktor Barzin
f1a27ed2f9
[ci skip] Add Woodpecker CI stack (WIP) and claude agents
- Add stacks/woodpecker/ with Helm-based deployment config
- Add .woodpecker/ CI pipeline configs (default, build-cli, renew-tls)
- Add NFS export entry for woodpecker
- Add .claude/agents/ definitions
2026-02-22 21:30:25 +00:00
Viktor Barzin
e051d45160
Apply only platform stack in CI (matches old pipeline scope) 2026-02-22 18:59:02 +00:00
Viktor Barzin
ea77b91c06
Update Drone CI pipeline for Terragrunt stack architecture
Default pipeline now uses terragrunt run --all to apply all stacks
instead of the broken terraform apply -target=module.kubernetes_cluster.
TLS renewal pipeline stripped of unnecessary Terraform download/init
since renew2.sh is pure shell (certbot + Cloudflare DNS).
2026-02-22 17:47:06 +00:00
Viktor Barzin
534e63c9b8
[ci skip] Remove legacy files and orphaned modules
Delete 20 orphaned module directories and 3 stray files from
modules/kubernetes/ that are no longer referenced by any stack.
Remove 7 root-level legacy files including the empty tfstate,
27MB terraform zip, commented-out main.tf, and migration notes.
Clean up commented-out dockerhub_secret and oauth-proxy references
in blog, travel_blog, and city-guesser stacks. Remove stale
frigate config.yaml entry from .gitignore. Remove ephemeral
docs/plans/ directory.
2026-02-22 15:23:27 +00:00
Viktor Barzin
b692eb0c34
[ci skip] Flatten module wrappers into stack roots
Remove the module "xxx" { source = "./module" } indirection layer
from all 66 service stacks. Resources are now defined directly in
each stack's main.tf instead of through a wrapper module.

- Merge module/main.tf contents into stack main.tf
- Apply variable replacements (var.tier -> local.tiers.X, renamed vars)
- Fix shared module paths (one fewer ../ at each level)
- Move extra files/dirs (factory/, chart_values, subdirs) to stack root
- Update state files to strip module.<name>. prefix
- Update CLAUDE.md to reflect flat structure

Verified: terragrunt plan shows 0 add, 0 destroy across all stacks.
2026-02-22 15:13:55 +00:00
Viktor Barzin
e4367854f4
[ci skip] Update CLAUDE.md for module colocation
Reflect new directory structure where service modules live inside
their stack directories (stacks/<service>/module/) instead of
modules/kubernetes/<service>/. Update file paths, adding service
instructions, and stack structure documentation.
2026-02-22 14:39:22 +00:00
Viktor Barzin
73cb696f12
[ci skip] Update CLAUDE.md for Terragrunt migration 2026-02-22 14:12:37 +00:00
Viktor Barzin
1742a79fb2
[ci skip] Fix Uptime Kuma false-down reports: use bulk heartbeat API instead of per-monitor calls 2026-02-22 01:37:28 +00:00
Viktor Barzin
a79c4bc062
[ci skip] Fix Python f-string quoting in Slack formatter 2026-02-22 01:23:56 +00:00
Viktor Barzin
6103445e53
[ci skip] Improve Slack message formatting with human-readable names and structured blocks 2026-02-22 01:16:47 +00:00
Viktor Barzin
86a60bcfe6
[ci skip] Upgrade cluster-health.sh to full 24-check version with Slack 2026-02-22 01:09:02 +00:00
Viktor Barzin
f6a3e15cd1
[ci skip] Fix Slack message formatting to use Block Kit mrkdwn 2026-02-22 01:00:48 +00:00
Viktor Barzin
a459533225
[ci skip] Add cluster-health skill for OpenClaw agent 2026-02-22 00:04:15 +00:00
Viktor Barzin
ac08707068
[ci skip] Add cluster health check script for OpenClaw agent 2026-02-22 00:00:47 +00:00
Viktor Barzin
86d1d50ad0
[ci skip] Extend cluster healthcheck from 14 to 24 checks
Add 10 new checks covering gaps discovered during incident response:
ResourceQuota pressure, StatefulSets, node disk usage, Helm release
health, Kyverno policy engine, NFS connectivity, DNS resolution,
TLS certificate expiry, GPU health, and Cloudflare tunnel status.
2026-02-21 23:57:04 +00:00
Viktor Barzin
d6cdbeaabe
[ci skip] Increase tier-based resource quotas to prevent quota exhaustion
Tier 2-gpu: 32→48 CPU limits, 64→96Gi mem limits, 30→40 pods
Tier 3-edge: 2→4 req CPU, 8→16 CPU limits, 16→32Gi mem limits, 20→30 pods
Tier 4-aux: 1→2 req CPU, 4→8 CPU limits, 8→16Gi mem limits, 15→20 pods

Fixes realestate-crawler (100% quota), nvidia (89.7%), resume/website (75%),
and actualbudget (75%) quota exhaustion causing pod creation failures.
2026-02-21 23:26:00 +00:00
Viktor Barzin
5fde3ab145
[ci skip] Add GitHub & Drone CI API access documentation 2026-02-21 19:14:41 +00:00
Viktor Barzin
3739906747
[ci skip] Add skills: pfsense-nat-rule-creation, coturn-k8s-without-hostnetwork 2026-02-21 18:29:32 +00:00
Viktor Barzin
71a4fb172c
[ci skip] Add Music Assistant librespot stale credentials skill
New skill: music-assistant-librespot-wrong-account
- Documents fix for Spotify playback failing with "librespot does not support
  free accounts" when cached credentials point to wrong Spotify account
- Includes step-by-step solution: find container, inspect cache, clear and restart

Updated: home-assistant skill with Music Assistant addon details for ha-sofia
2026-02-21 11:23:24 +00:00
Viktor Barzin
238db327f9
[ci skip] Add ground rules: no secrets, CI/CD required, monitoring required 2026-02-19 23:48:44 +00:00
Viktor Barzin
66f1867ccb
[ci skip] Update knowledge base: add OpenClaw service, rename moltbot references 2026-02-18 22:39:58 +00:00
Viktor Barzin
f23246f9a3
[ci skip] Add skills: authentik-oidc-kubernetes, kubelet-static-pod-manifest-update
Two skills extracted from multi-user k8s access implementation:
- authentik-oidc-kubernetes: 6 gotchas for Authentik OIDC + kube-apiserver
- kubelet-static-pod-manifest-update: full restart cycle for static pod changes
2026-02-17 22:56:03 +00:00
Viktor Barzin
1457a04695
[ci skip] Add Authentik management skill for API-based identity provider control 2026-02-17 22:55:41 +00:00
Viktor Barzin
84d9a3f926
[ci skip] Update CLAUDE.md with OIDC gotchas and k8s multi-user notes 2026-02-17 22:16:46 +00:00
Viktor Barzin
03651080bb
[ci skip] Update Authentik API token reference to terraform.tfvars 2026-02-17 22:03:55 +00:00
Viktor Barzin
79ce0db11c
[ci skip] Pass skill secrets to moltbot container and fix Python env
- Add skill_secrets variable to moltbot module with HA tokens and
  Uptime Kuma password as container env vars
- Install Python packages (requests, caldav, icalendar, uptime-kuma-api)
  in init container with PYTHONPATH for main container access
- Update all skills to use python3 directly instead of ~/.venvs/claude
  venv path that doesn't exist in the container
- Remove hardcoded Uptime Kuma password from skill, use env var
2026-02-17 21:53:32 +00:00
Viktor Barzin
72ff14e2df
[ci skip] Add Authentik API management knowledge 2026-02-17 21:10:40 +00:00
Viktor Barzin
6a8efa69c4
[ci skip] Import Claude skills into OpenClaw moltbot
- Convert setup-project and extend-vm-storage from standalone .md
  to directory-based SKILL.md format with YAML frontmatter
- Add symlink in moltbot init container to expose Claude skills
  at ~/.openclaw/skills/ for auto-discovery by OpenClaw
- Update CLAUDE.md skill path references
2026-02-17 21:09:12 +00:00
Viktor Barzin
c0363be5e4
[ci skip] Add Grafana dashboard for Technitium DNS query logs
Add MySQL datasource and 15-panel dashboard for DNS analytics:
queries over time, response codes, top domains/clients, response
times, blocked/NxDomain domains. Enable Grafana dashboard sidecar
for auto-provisioning dashboards from ConfigMaps.
2026-02-16 23:06:41 +00:00
Viktor Barzin
1802b4c86d
[ci skip] Add pfsense-dnsmasq-interface-binding skill, update ndots skill to v1.1.0 2026-02-16 22:30:57 +00:00
Viktor Barzin
c08bab7da7
[ci skip] Update preference: always use cluster_healthcheck.sh for health checks 2026-02-16 21:19:49 +00:00
Viktor Barzin
32a7c04deb
[ci skip] Remember to use cluster_healthcheck.sh for cluster status checks 2026-02-16 19:45:31 +00:00
Viktor Barzin
7854f89add
[ci skip] Add skill: k8s-ndots-search-domain-nxdomain-flood
Documents how Kubernetes ndots:5 search domain expansion floods external
DNS with NxDomain queries, and the CoreDNS template block fix.
2026-02-15 21:52:27 +00:00
Viktor Barzin
8d38e1474b
[ci skip] Document Terraform state splitting plan for future implementation 2026-02-15 21:10:40 +00:00
Viktor Barzin
1564ec7e79
Add tier-based resource governance via Kyverno [ci skip]
Four layers of noisy-neighbor protection using existing tier system:
- PriorityClasses (tier-0-core through tier-4-aux)
- LimitRange defaults auto-generated per namespace tier
- ResourceQuotas auto-generated per namespace tier
- PriorityClassName injection on pods via Kyverno mutate

Custom quota overrides for monitoring and crowdsec namespaces
which exceed the default tier quotas.
2026-02-15 18:48:33 +00:00
Viktor Barzin
df840a0078
[ci skip] remember: spawn subagent to monitor pods instead of sleeping 2026-02-15 17:48:42 +00:00
Viktor Barzin
2d475c8496
[ci skip] Add skills: helm-stuck-release-recovery, k8s-hpa-scaling-storm, crowdsec-agent-registration-failure 2026-02-15 17:18:17 +00:00