Commit graph

49 commits

Author SHA1 Message Date
Viktor Barzin
c8de2c4803 [ci skip] Sunset Drone CI: remove all artifacts, DNS, configs, and references
Drone CI has been fully replaced by Woodpecker CI at ci.viktorbarzin.me.
Destroys K8s resources (12), removes DNS records, NFS exports, Uptime Kuma
monitor, dashboard entry, and all code/doc references across 18 files.
2026-02-23 19:38:55 +00:00
Viktor Barzin
27dc486a4d [ci skip] Remove ResourceQuota limits from nvidia and realestate-crawler namespaces
Add resource-governance/custom-quota=true label to both namespaces so
Kyverno skips auto-generating ResourceQuotas that were causing CPU pressure.
2026-02-22 23:14:53 +00:00
Viktor Barzin
cc7f119578 [ci skip] Reduce node config drift: GPU label, OIDC idempotency, node-exporter, rebuild docs
- Add gpu=true label to Terraform (nvidia null_resource alongside taint)
- Improve API server OIDC config to detect value changes, not just flag presence
- Add policy_hash trigger to audit-policy so rule changes auto-reapply
- Enable prometheus-node-exporter sub-chart, delete unused Ansible playbook
- Document full node rebuild procedure in CLAUDE.md
- Save Talos Linux migration evaluation for future reference
2026-02-22 22:59:38 +00:00
Viktor Barzin
abe89c926e [ci skip] Refactor knowledge: CLAUDE.md 881→190 lines, extract reference data
CLAUDE.md changes:
- Extract service catalog + Cloudflare domains → .claude/reference/service-catalog.md
- Extract Proxmox VMs, hardware, network → .claude/reference/proxmox-inventory.md
- Extract GitHub/Drone API patterns → .claude/reference/github-drone-api.md
- Extract Authentik state snapshot → .claude/reference/authentik-state.md
- Remove Init Container pattern (duplicates setup-project skill)
- Remove Poison Fountain service notes (duplicates Anti-AI section)
- Consolidate Authentik section (link to skills + reference)
- Remove resource limit tables (kept tier definitions inline)

Skill merges (37→32):
- helm-release-force-rerender + helm-stuck-release-recovery → helm-release-troubleshooting
- containerd-multi-registry-pull-through-cache + k8s-docker-registry-cache-bypass → k8s-container-image-caching
- (traefik merges in previous commits)
2026-02-22 22:11:31 +00:00
Viktor Barzin
e5729c68b8 [ci skip] update claude knowledge: add anti-AI scraping & poison-fountain docs 2026-02-22 21:36:40 +00:00
Viktor Barzin
5cfe6595cd Apply only platform stack in CI (matches old pipeline scope) 2026-02-22 18:59:02 +00:00
Viktor Barzin
9ee3140b34 Update Drone CI pipeline for Terragrunt stack architecture
Default pipeline now uses terragrunt run --all to apply all stacks
instead of the broken terraform apply -target=module.kubernetes_cluster.
TLS renewal pipeline stripped of unnecessary Terraform download/init
since renew2.sh is pure shell (certbot + Cloudflare DNS).
2026-02-22 17:47:06 +00:00
Viktor Barzin
116c4d9c30 [ci skip] Remove legacy files and orphaned modules
Delete 20 orphaned module directories and 3 stray files from
modules/kubernetes/ that are no longer referenced by any stack.
Remove 7 root-level legacy files including the empty tfstate,
27MB terraform zip, commented-out main.tf, and migration notes.
Clean up commented-out dockerhub_secret and oauth-proxy references
in blog, travel_blog, and city-guesser stacks. Remove stale
frigate config.yaml entry from .gitignore. Remove ephemeral
docs/plans/ directory.
2026-02-22 15:23:27 +00:00
Viktor Barzin
c7c7047f1c [ci skip] Flatten module wrappers into stack roots
Remove the module "xxx" { source = "./module" } indirection layer
from all 66 service stacks. Resources are now defined directly in
each stack's main.tf instead of through a wrapper module.

- Merge module/main.tf contents into stack main.tf
- Apply variable replacements (var.tier -> local.tiers.X, renamed vars)
- Fix shared module paths (one fewer ../ at each level)
- Move extra files/dirs (factory/, chart_values, subdirs) to stack root
- Update state files to strip module.<name>. prefix
- Update CLAUDE.md to reflect flat structure

Verified: terragrunt plan shows 0 add, 0 destroy across all stacks.
2026-02-22 15:13:55 +00:00
Viktor Barzin
b0499a7f31 [ci skip] Update CLAUDE.md for module colocation
Reflect new directory structure where service modules live inside
their stack directories (stacks/<service>/module/) instead of
modules/kubernetes/<service>/. Update file paths, adding service
instructions, and stack structure documentation.
2026-02-22 14:39:22 +00:00
Viktor Barzin
7ef1a0a8bb [ci skip] Update CLAUDE.md for Terragrunt migration 2026-02-22 14:12:37 +00:00
Viktor Barzin
98b711ff8d [ci skip] Extend cluster healthcheck from 14 to 24 checks
Add 10 new checks covering gaps discovered during incident response:
ResourceQuota pressure, StatefulSets, node disk usage, Helm release
health, Kyverno policy engine, NFS connectivity, DNS resolution,
TLS certificate expiry, GPU health, and Cloudflare tunnel status.
2026-02-21 23:57:04 +00:00
Viktor Barzin
517f5d6a6c [ci skip] Increase tier-based resource quotas to prevent quota exhaustion
Tier 2-gpu: 32→48 CPU limits, 64→96Gi mem limits, 30→40 pods
Tier 3-edge: 2→4 req CPU, 8→16 CPU limits, 16→32Gi mem limits, 20→30 pods
Tier 4-aux: 1→2 req CPU, 4→8 CPU limits, 8→16Gi mem limits, 15→20 pods

Fixes realestate-crawler (100% quota), nvidia (89.7%), resume/website (75%),
and actualbudget (75%) quota exhaustion causing pod creation failures.
2026-02-21 23:26:00 +00:00
Viktor Barzin
fd6f9166a9 [ci skip] Add GitHub & Drone CI API access documentation 2026-02-21 19:14:41 +00:00
Viktor Barzin
9d7d63b970 [ci skip] Add ground rules: no secrets, CI/CD required, monitoring required 2026-02-19 23:48:44 +00:00
Viktor Barzin
71d6590939 [ci skip] Update knowledge base: add OpenClaw service, rename moltbot references 2026-02-18 22:39:58 +00:00
Viktor Barzin
aa433d0750 [ci skip] Update CLAUDE.md with OIDC gotchas and k8s multi-user notes 2026-02-17 22:16:46 +00:00
Viktor Barzin
c3840574a8 [ci skip] Update Authentik API token reference to terraform.tfvars 2026-02-17 22:03:55 +00:00
Viktor Barzin
9853b5edf7 [ci skip] Add Authentik API management knowledge 2026-02-17 21:10:40 +00:00
Viktor Barzin
5a2803736d [ci skip] Import Claude skills into OpenClaw moltbot
- Convert setup-project and extend-vm-storage from standalone .md
  to directory-based SKILL.md format with YAML frontmatter
- Add symlink in moltbot init container to expose Claude skills
  at ~/.openclaw/skills/ for auto-discovery by OpenClaw
- Update CLAUDE.md skill path references
2026-02-17 21:09:12 +00:00
Viktor Barzin
039f8559c9 [ci skip] Add Grafana dashboard for Technitium DNS query logs
Add MySQL datasource and 15-panel dashboard for DNS analytics:
queries over time, response codes, top domains/clients, response
times, blocked/NxDomain domains. Enable Grafana dashboard sidecar
for auto-provisioning dashboards from ConfigMaps.
2026-02-16 23:06:41 +00:00
Viktor Barzin
800b5db3b3 [ci skip] Update preference: always use cluster_healthcheck.sh for health checks 2026-02-16 21:19:49 +00:00
Viktor Barzin
d8b3922b62 [ci skip] Remember to use cluster_healthcheck.sh for cluster status checks 2026-02-16 19:45:31 +00:00
Viktor Barzin
e76a80eb72 [ci skip] Document Terraform state splitting plan for future implementation 2026-02-15 21:10:40 +00:00
Viktor Barzin
4d9b8242e8 Add tier-based resource governance via Kyverno [ci skip]
Four layers of noisy-neighbor protection using existing tier system:
- PriorityClasses (tier-0-core through tier-4-aux)
- LimitRange defaults auto-generated per namespace tier
- ResourceQuotas auto-generated per namespace tier
- PriorityClassName injection on pods via Kyverno mutate

Custom quota overrides for monitoring and crowdsec namespaces
which exceed the default tier quotas.
2026-02-15 18:48:33 +00:00
Viktor Barzin
719e3c6244 [ci skip] remember: spawn subagent to monitor pods instead of sleeping 2026-02-15 17:48:42 +00:00
Viktor Barzin
95013c9056 [ci skip] Strengthen Terraform-only change policy in project instructions 2026-02-15 15:10:11 +00:00
Viktor Barzin
a67a6f350e [ci skip] Fix pull-through cache for all registries
Replace deprecated wildcard containerd mirror with per-registry
config_path approach. Add proxy containers for ghcr.io, quay.io,
registry.k8s.io, and reg.kyverno.io on the docker-registry VM.
Set static IP for docker-registry VM to avoid DHCP issues.
2026-02-15 14:35:52 +00:00
Viktor Barzin
a5b240629c [ci skip] Update knowledge base with Loki + Alloy service notes 2026-02-13 23:46:01 +00:00
Viktor Barzin
08ea489fe0 [ci skip] Add extend-vm-storage script and skills
- Script to automate K8s node VM disk expansion (drain, shutdown, resize, boot, expand FS, uncordon)
- Skill docs for the workflow and troubleshooting pitfalls (growpart, macOS grep -P, drain timeouts)
- Successfully tested on k8s-node2, k8s-node3, k8s-node4 (64G → 128G)
2026-02-13 22:08:46 +00:00
Viktor Barzin
bcdebfd9c1 [ci skip] update claude knowledge: fix NFS scripts path to secrets/ 2026-02-08 02:41:42 +00:00
Viktor Barzin
945d2d90a7 [ci skip] update claude knowledge: always apply cloudflared module for DNS
When deploying a new service, the cloudflared module must also be applied
to create the Cloudflare DNS record. Updated CLAUDE.md and setup-project skill.
2026-02-08 02:30:19 +00:00
Viktor Barzin
ce8f81db0c [ci skip] Deploy Gramps Web genealogy service
Add grampsweb module with web app + Celery worker in a single pod,
using shared Redis (DB 2/3), NFS storage, email via mailserver,
and Ollama AI integration. Available at family.viktorbarzin.me.
2026-02-08 02:30:18 +00:00
Viktor Barzin
a2e1a79286 [ci skip] update claude knowledge: add health service 2026-02-08 01:55:30 +00:00
Viktor Barzin
b22a14c914 [ci skip] Deploy Wyoming Whisper STT service for Home Assistant voice input
Add Wyoming Faster Whisper (rhasspy/wyoming-whisper) as a new K8s service
exposed via Traefik TCP entrypoint on port 10300. Accessible from ha-london
RPi via VPN at 10.0.20.202:10300.
2026-02-08 01:51:43 +00:00
Viktor Barzin
5e3b6c57ad [ci skip] update claude knowledge: fix ha-london IP to 192.168.8.103 2026-02-08 01:51:42 +00:00
Viktor Barzin
c6a05d8e26 [ci skip] Add ha-london knowledge map: RPi Docker setup, smart plugs, air quality, e-bike
ha-london runs on Raspberry Pi at 192.168.8.104 (Docker rootless, HA 2025.9.1).
Key systems: TP-Link Kasa smart plugs with energy monitoring, Apollo AIR-1 air
quality sensor (ESPHome), Cowboy e-bike, UptimeRobot, Oral-B BLE toothbrush.
SSH access via pi@192.168.8.104, config at /home/pi/docker/homeAssistant/.
2026-02-07 22:39:20 +00:00
Viktor Barzin
936607ac4f [ci skip] Update ha-sofia SSH to direct IP 192.168.1.8 and document limitations 2026-02-07 22:21:30 +00:00
Viktor Barzin
01affd9727 [ci skip] Add Proxmox VM inventory to claude knowledge 2026-02-07 21:37:38 +00:00
Viktor Barzin
191c760b94 [ci skip] Add ha-sofia Home Assistant deployment to skills
- Update home-assistant skill to v2.0.0 covering both ha-london and ha-sofia
- Add separate API script for ha-sofia (home-assistant-sofia.py)
- ha-sofia: SSH via vbarzin@ha-sofia.viktorbarzin.lan, config at /config/
- Update CLAUDE.md with both HA deployments
2026-02-07 21:26:05 +00:00
Viktor Barzin
8b8beb78dd [ci skip] update claude knowledge: HTTP/3 enabled for Traefik and Cloudflare 2026-02-07 20:46:14 +00:00
Viktor Barzin
0709eb0266 [ci skip] update claude knowledge: always run terraform locally 2026-02-07 13:41:41 +00:00
Viktor Barzin
c14dc88ffa [ci skip] Clean up .claude: remove remote executor and /remote skill references
All commands and skills now reference tools directly without any remote
execution wrapper. Archived setup-remote-executor.md for reference.
Added rule: all infra changes must go through Terraform.
2026-02-07 13:21:58 +00:00
Viktor Barzin
6fc94dc9c2 [ci skip] update claude knowledge: never use SSH directly, use /remote skill 2026-02-07 13:08:00 +00:00
Viktor Barzin
050cd54ad8 [ci skip] update claude knowledge: always commit .claude file changes 2026-02-07 10:44:33 +00:00
Viktor Barzin
ffa80f0df6 add claude [ci skip] 2026-02-06 20:10:02 +00:00
Viktor Barzin
65a1fb57a8 add claude files [ci skip] 2026-01-18 15:40:43 +00:00
Viktor Barzin
8da263bf43 add claude files to gitignore [ci skip] 2026-01-18 13:40:31 +00:00
Viktor Barzin
a1d945a0b2 add prometheus alerts for deployment/statefulset/daemonset replica mismatches [ci skip]
- Add DeploymentReplicasMismatch alert
- Add StatefulSetReplicasMismatch alert
- Add DaemonSetMissingPods alert
- Add .claude/ directory with remote executor and knowledge base
2026-01-18 11:04:51 +00:00