infra

Author	SHA1	Message	Date
Viktor Barzin	01bc16d592	k8s-version-upgrade: decompose into Job chain to fix self-preemption The agent-based v1 ran inside claude-agent-service (replicas=1, no nodeSelector) and self-evicted when it tried to drain its host (k8s-node4 on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers v1.34.2) until manual recovery. Rewrite the pipeline as a chain of nodeSelector-pinned Jobs: preflight (k8s-node1) → master (k8s-node1) drains k8s-master → worker × 4 (k8s-node1) drains k8s-node{4,3,2} → worker (k8s-master + control-plane toleration) drains k8s-node1 → postflight (no pinning) Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by envsubst-ing job-template.yaml into the next Job. Deterministic names (k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply` idempotent — a failed Job can be re-created without duplicating downstream. Also lands `predrain_unstick`: deletes pods on the target node whose PDB has 0 disruptionsAllowed. Without this, drain loops indefinitely on single-replica deployments (e.g. every Anubis instance — discovered the hard way during 2026-05-11 manual recovery of k8s-node3). Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min). Deprecates the agent prompt (renamed to *.deprecated.md with a header pointer to the new code). Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps), then monitoring (loads the new alert). Both applied 2026-05-11. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-11 23:54:22 +00:00
Viktor Barzin	988bfde45c	k8s-version-upgrade: trigger etcd snapshot via existing backup-etcd Job; broaden agent RBAC Stage 2 now reuses the existing default/backup-etcd CronJob (NFS-backed PV pointing at 192.168.1.127:/srv/nfs/etcd-backup) instead of trying to ssh into master and run etcdctl against a non-existent /mnt/main mount. The agent triggers a one-shot Job from cronjob/backup-etcd, waits up to 10 min, then parses the backup-manage container log for "Backup done" line + byte count. Test 2 (dry-run) surfaced 5 real cluster blockers — agent loop works end-to-end at the planning level. Expanded the claude-agent ServiceAccount's privileges via a sibling ClusterRole (claude-agent-upgrade-ops): - patch namespaces/k8s-upgrade (in-flight annotation) - create batch/jobs (trigger etcd snapshot Job) - patch nodes (cordon/uncordon) - create pods/eviction (drain) - delete pods (drain fallback)	2026-05-10 19:16:12 +00:00
Viktor Barzin	a58d777059	k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison on master for new patches + HEAD pkgs.k8s.io for next-minor availability, then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent. The agent (.claude/agents/k8s-version-upgrade.md) orchestrates: pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match) -> etcd snapshot save -> optional master containerd skew fix -> apt repo URL rewrite (minor bumps only) -> drain/upgrade/uncordon master via ssh < update_k8s.sh -> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each -> post-flight verification Two new Upgrade Gates alerts catch failure modes: - K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m) - EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m) update_k8s.sh refactored to take --role / --release args; the agent shells it into each node via SSH pipe. update_node.sh annotated as OS-major path. Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section in docs/architecture/automated-upgrades.md. Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519 keypair distributed to all 5 nodes via authorized_keys; slack_webhook reuses kured webhook URL on initial deploy).	2026-05-10 19:07:42 +00:00
Viktor Barzin	a5963169ec	[service-upgrade] Drop vault-CLI assumptions + check default workflow only ## Context Since the 2026-04-15 migration from SSH-on-DevVM to in-cluster claude-agent-service, the agent spec's four `vault kv get ...` calls have been dead code: the pod has no `VAULT_TOKEN`, no `~/.vault-token`, no Vault login method, and port 8200 is refused. Every token fetch returns empty, which silently breaks: - Slack: `SLACK_WEBHOOK=""` → POSTs 404 → no messages for 3+ days (the exact user-visible symptom that started this thread). - Woodpecker CI polling: `WOODPECKER_TOKEN=""` → 401 on `/api/repos/1/pipelines` → agent can't find its own pipeline → 15-min poll times out → jumps to rollback → same failure in the revert → hits n8n's 30-min ceiling → SIGKILL mid-saga → no commit, no Slack. - Changelog fetch: `GITHUB_TOKEN=""` overrides the env var supplied by `envFrom: claude-agent-secrets`, crippling changelog lookups too. Separately, Step 9 read the overall pipeline `status`, which is `failure` any time a single workflow fails — e.g. the unrelated `build-cli` workflow (docker image push to registry.viktorbarzin.me:5050 has been erroring since private-registry htpasswd was enabled on 2026-03-22). That made the agent spuriously rollback every otherwise- successful upgrade. ## This change - Replace the four `vault kv get ...` invocations with the matching env-var reads (`$GITHUB_TOKEN`, `$WOODPECKER_API_TOKEN`, `$SLACK_WEBHOOK_URL`) and document the env-var contract at the top of the "Environment" section. The env vars are expected to be pre-loaded via `envFrom: claude-agent-secrets` — that part is tracked as the companion ExternalSecret/Terraform change in bd code-3o3 (must land before this spec is effective). - Rewrite Step 9 to poll the `default` workflow's `state` instead of the overall pipeline `status`. Adds a jq example and explicitly documents the build-cli noise so future operators know why overall status is unreliable. ## What is NOT in this change - The matching ExternalSecret / Terraform changes that feed WOODPECKER_API_TOKEN / SLACK_WEBHOOK_URL / REGISTRY_USER / REGISTRY_PASSWORD into the pod. Until those land, this spec still produces empty env vars at runtime — but at least the shape of the contract is correct and grep-friendly. - The .woodpecker/build-cli.yml `logins:` entry for registry.viktorbarzin.me:5050. That's fix C in the same task. ## Test Plan ### Automated None — this is pure markdown guidance for the model. Syntax-checked by `grep -nE 'vault kv get\|WOODPECKER_TOKEN\|SLACK_WEBHOOK[^_]' .claude/agents/service-upgrade.md` showing only the explanatory warning on line 37 as a match. ### Manual Verification After the companion ExternalSecret change lands and the pod has WOODPECKER_API_TOKEN + SLACK_WEBHOOK_URL in env: 1. Trigger a DIUN-style webhook on a known slow service. 2. Watch `kubectl -n claude-agent logs -f deploy/claude-agent-service`. 3. Expect curl to `ci.viktorbarzin.me/api/...` return 200 and pipeline JSON (no 401), and Slack `$SLACK_WEBHOOK_URL` return 200. 4. Expect a Slack `[Upgrade Agent] Starting:` post inside the first minute, and a `SUCCESS` or `FAILED + ROLLED BACK` post on exit. Refs: bd code-3o3 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 13:15:06 +00:00
Viktor Barzin	973f549810	[payslip-ingest] Update extractor agent + dashboard for v2 regex parser ## Context Companion change to payslip-ingest v2 (regex parser + accurate RSU tax attribution). The Grafana dashboard now has 4 more panels powered by the new earnings-decomposition and YTD-snapshot columns, and the Claude fallback agent's prompt is aligned with the new schema so non-Meta payslips still land with the full field set. ## This change ### `.claude/agents/payslip-extractor.md` Rewrites the RSU handling section to match Meta UK's actual template (rsu_vest = "RSU Tax Offset" + "RSU Excs Refund", no matching rsu_offset deduction — PAYE uses grossed-up Taxable Pay instead). Adds a new "Earnings decomposition (v2)" section telling the fallback agent how to populate salary/bonus/pension_sacrifice/taxable_pay/ytd_* and when to use pension_employee vs pension_sacrifice without double-counting. ### `stacks/monitoring/modules/monitoring/dashboards/uk-payslip.json` - Panel 4 (Effective rate) — SQL switched from the naive `(income_tax + NIC) / cash_gross` to the YTD-effective-rate method: `cash_tax = income_tax - rsu_vest × (ytd_tax_paid / ytd_taxable_pay)`. Title updated to "YTD-corrected" so the change is discoverable. - Panel 5 (Table) — adds salary, bonus, pension_sacrifice, taxable_pay columns so row-level debugging against the parser output is trivial. - +Panel 8 (Earnings breakdown) — monthly stacked bars of salary / bonus / rsu_vest / -pension_sacrifice. Bonus-sacrifice months show up as a massive negative pension_sacrifice spike paired with a near-zero bonus bar. - +Panel 9 (Accurate cash tax rate) — timeseries of cash_tax_rate_ytd vs naive_tax_rate. Divergence is the RSU contribution the payslip hides in the single `Tax paid` line. - +Panel 10 (All-in compensation) — stacked bars of cash_gross + rsu_vest per payslip. - +Panel 11 (YTD cumulative cash gross vs total comp) — two lines partitioned by tax_year; the gap between them is the RSU contribution YTD. Total panels go from 7 → 11. ## Test Plan ### Automated Dashboard JSON validity: ``` $ python3 -m json.tool uk-payslip.json > /dev/null && echo ok ok ``` ### Manual Verification After applying `stacks/monitoring/`: 1. `https://grafana.viktorbarzin.me/d/uk-payslip` loads with 11 panels 2. Bonus-sacrifice months (e.g. March 2024 if present in data) show the negative pension_sacrifice bar in panel 8 3. Panel 9 "Accurate cash effective tax rate" shows the cash_tax_rate_ytd line sitting ~10-15pp below naive_tax_rate in RSU-vest months ## Reproduce locally 1. `cd infra/stacks/monitoring && terragrunt plan` 2. Expected: ConfigMap diff on the payslip dashboard with the new panel JSON 3. `terragrunt apply` — Grafana reloads the dashboard automatically (configmap-reload sidecar) Relates to: payslip-ingest commit 9741816 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 10:54:33 +00:00
Viktor Barzin	238a3f14c9	[payslip-extractor] Add RSU handling section Document what RSU vest / RSU offset look like on Meta UK payslips and tell the agent to populate rsu_vest + rsu_offset fields (new in the payslip-ingest schema) rather than rolling them into gross_pay.	2026-04-18 23:37:33 +00:00
Viktor Barzin	eee694c915	[payslip-extractor] Add PAYSLIP_TEXT fast path payslip-ingest now runs pdftotext locally before calling claude-agent-service, shrinking the prompt ~20-100x. Agent file documents both paths: PAYSLIP_TEXT (fast) and PDF_BASE64 (fallback for scanned-image PDFs or when pdftotext fails).	2026-04-18 22:48:07 +00:00
Viktor Barzin	43b4e1d372	[payslip-ingest] Deploy stack + Grafana dashboard + Vault DB role ## Context New service `payslip-ingest` (code lives in `/home/wizard/code/payslip-ingest/`) needs in-cluster deployment, its own Postgres DB + rotating user, a Grafana datasource, a dashboard, and a Claude agent definition for PDF extraction. Cluster-internal only — webhook fires from Paperless-ngx in a sibling namespace. No ingress, no TLS cert, no DNS record. ## What ### New stack `stacks/payslip-ingest/` - `kubernetes_namespace` payslip-ingest, tier=aux. - ExternalSecret (vault-kv) projects PAPERLESS_API_TOKEN, CLAUDE_AGENT_BEARER_TOKEN, WEBHOOK_BEARER_TOKEN into `payslip-ingest-secrets`. - ExternalSecret (vault-database) reads rotating password from `static-creds/pg-payslip-ingest` and templates `DATABASE_URL` into `payslip-ingest-db-creds` with `reloader.stakater.com/match=true`. - Deployment: single replica, Recreate strategy (matches single-worker queue design), `wait-for postgresql.dbaas:5432` annotation, init container runs `alembic upgrade head`, main container serves FastAPI on 8080, Kyverno dns_config lifecycle ignore. - ClusterIP Service :8080. - Grafana datasource ConfigMap in `monitoring` ns (label `grafana_datasource=1`, uid `payslips-pg`) reading password from the db-creds K8s Secret. ### Grafana dashboard `uk-payslip.json` (4 panels) - Monthly gross/net/tax/NI (timeseries, currencyGBP). - YTD tax-band progression with threshold lines at £12,570 / £50,270 / £125,140. - Deductions breakdown (stacked bars). - Effective rate + take-home % (timeseries, percent). ### Vault DB role `pg-payslip-ingest` - Added to `allowed_roles` in `vault_database_secret_backend_connection.postgresql`. - New `vault_database_secret_backend_static_role.pg_payslip_ingest` (username `payslip_ingest`, 7d rotation). ### DBaaS — DB + role creation - New `null_resource.pg_payslip_ingest_db` mirrors `pg_terraform_state_db`: idempotent CREATE ROLE + CREATE DATABASE + GRANT ALL via `kubectl exec` into `pg-cluster-1`. ### Claude agent `.claude/agents/payslip-extractor.md` - Haiku-backed agent invoked by `claude-agent-service`. - Decodes base64 PDF from prompt, tries pdftotext → pypdf fallback, emits a single JSON object matching the schema to stdout. No network, no file writes outside /tmp, no markdown fences. ## Trade-offs / decisions - Own DB per service (convention), NOT a schema in a shared `app` DB as the plan initially described. The Alembic migration still creates a `payslip_ingest` schema inside the `payslip_ingest` DB for table organisation. - Paperless URL uses port 80 (the Service port), not 8000 (the pod target port). - Grafana datasource uses the primary RW user — separate `_ro` role is aspirational and not yet a pattern in this repo. - No ingress — webhook is cluster-internal; external exposure is unnecessary attack surface. - No Uptime Kuma monitor yet: the internal-monitor list is a static block in `stacks/uptime-kuma/`; will add in a follow-up tied to code-z29 (internal monitor auto-creator). ## Test Plan ### Automated ``` terraform init -backend=false && terraform validate Success! The configuration is valid. terraform fmt -check -recursive (exit 0) python3 -c "import json; json.load(open('uk-payslip.json'))" (exit 0) ``` ### Manual Verification (post-merge) Prerequisites: 1. Seed Vault: `vault kv put secret/payslip-ingest webhook_bearer_token=$(openssl rand -hex 32)`. 2. Seed Vault: `vault kv patch secret/paperless-ngx api_token=<paperless token>`. Apply: 3. `scripts/tg apply vault` → creates pg-payslip-ingest static role. 4. `scripts/tg apply dbaas` → creates payslip_ingest DB + role. 5. `cd stacks/payslip-ingest && ../../scripts/tg apply -target=kubernetes_manifest.db_external_secret` (first-apply ESO bootstrap). 6. `scripts/tg apply payslip-ingest` (full). 7. `kubectl -n payslip-ingest get pods` → Running 1/1. 8. `kubectl -n payslip-ingest port-forward svc/payslip-ingest 8080:8080 && curl localhost:8080/healthz` → 200. End-to-end: 9. Configure Paperless workflow (README in code repo has steps). 10. Upload sample payslip tagged `payslip` → row in `payslip_ingest.payslip` within 60s. 11. Grafana → Dashboards → UK Payslip → 4 panels render. Closes: code-do7 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 19:07:05 +00:00
Viktor Barzin	7bb9ec2934	Add agent task tracking documentation Documents the centralized Beads/Dolt task tracking system used by all Claude Code sessions. Covers architecture, session lifecycle, settings hierarchy, known issues, and E2E test verification. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 17:11:26 +00:00
Viktor Barzin	bcad200a23	chore: add untracked stacks, scripts, and agent configs - New stacks: beads-server, hermes-agent - Terragrunt tiers.tf for infra, phpipam, status-page - Secrets symlinks for vault, phpipam, hermes-agent - Scripts: cluster_manager, image_pull, containerd pullthrough setup - Frigate config, audiblez-web app source, n8n workflows dir - Claude agent: service-upgrade, reference: upgrade-config.json - Removed: claudeception skill, excalidraw empty submodule, temp listings [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 09:33:06 +00:00
Viktor Barzin	460c68e015	feat: add incident management system with user reporting - Status page (status.viktorbarzin.me): incident cards with SEV badges, expandable timelines, postmortem links, user report rendering - Issue templates on infra repo for user outage reports - CronJob reads incidents + user-reports from ViktorBarzin/infra - "Report an Outage" button on status page links to infra repo - Post-mortem agents restored (4-stage pipeline: triage → investigation → historian → report writer) with updated paths and issue linking - Post-mortem skill/template updated to link reports to GitHub Issues and manage postmortem-required/postmortem-done labels - Labels: incident, sev1-3, user-report, postmortem-required, postmortem-done on infra repo [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 20:00:31 +00:00
Viktor Barzin	8badb8181a	feat: post-mortem automation pipeline E2E workflow for incident post-mortems: 1. /post-mortem skill generates structured post-mortem markdown 2. Woodpecker pipeline triggers on docs/post-mortems/*.md changes 3. parse-postmortem-todos.sh extracts safe TODOs (Alert/Config/Monitor) 4. postmortem-todo-resolver agent implements TODOs headlessly 5. Agent updates post-mortem with Follow-up Implementation table Components: - .claude/skills/post-mortem/ — writer skill + template - .claude/agents/postmortem-todo-resolver.md — headless agent - .woodpecker/postmortem-todos.yml — CI pipeline - scripts/parse-postmortem-todos.sh — TODO extractor - cluster-health skill — auto-suggest post-mortem after recovery Safety: only auto-implements Alert/Config/Monitor types. Architecture/Migration/Investigation items are skipped. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 15:34:42 +00:00
Viktor Barzin	c111799831	remove duplicated agents, update CLAUDE.md references [ci skip] All agents now live globally in ~/.claude/agents/ (shared via dotfiles). Deleted 11 duplicates, moved sev-*/deploy-app to global scope.	2026-03-22 23:44:27 +02:00
Viktor Barzin	d6afbe84c8	post-mortem v2: pipeline team architecture with 4-stage agents [ci skip] Split monolithic orchestrator into triage (haiku), historian (sonnet), and report-writer (opus) stages. Each stage gets its own tool budget. Added sev-context.sh for structured cluster context gathering.	2026-03-16 21:59:34 +00:00
Viktor Barzin	0abb6b83ad	add deploy-app skill and agent for automated repo→app deployment [ci skip]	2026-03-16 18:06:24 +00:00
Viktor Barzin	cfc30b62e8	enhance devops-engineer agent: deploy + monitor pod health [ci skip] - Upgrade model from sonnet to opus for subagent orchestration - Add Write, Edit, Agent tools for spawning monitor subagents - Add mandatory deployment workflow: pre-deploy snapshot, apply, spawn background haiku pod monitor, react to results - Monitor detects CrashLoopBackOff, OOM, ImagePullBackOff, stuck Pending, and probe failures within 3 min timeout - Allow terragrunt apply and kubectl set image as safe operations	2026-03-15 18:44:20 +00:00
Viktor Barzin	8bac6db48f	add name/description/tools to review-loop agent frontmatter [ci skip]	2026-03-15 11:14:31 +00:00
Viktor Barzin	616370d34c	rename planner agent to review-loop [ci skip]	2026-03-15 11:12:14 +00:00
Viktor Barzin	123e996b04	add planner agent: plan-review-fix convergence loop [ci skip]	2026-03-15 10:46:53 +00:00
Viktor Barzin	ff83ec3325	add infrastructure agent team: 8 specialized agents + 14 diagnostic scripts Agents: devops-engineer, dba, security-engineer, sre, network-engineer, platform-engineer, observability-engineer, home-automation-engineer. Scripts: deploy-status, db-health, backup-verify, tls-check, crowdsec-status, authentik-audit, oom-investigator, resource-report, dns-check, network-health, nfs-health, truenas-status, platform-status, monitoring-health. Also: known-issues.md suppression list, cluster-health-checker port-forward fix.	2026-03-15 02:01:07 +00:00
Viktor Barzin	c170351e77	[ci skip] refactor claude files: compact CLAUDE.md, clean memory, remove generic agents CLAUDE.md: 260→72 lines. Moved detailed patterns (NFS, iSCSI, Kyverno tables, anti-AI, node rebuild) to .claude/reference/patterns.md. Kept: critical rules, quick patterns, key commands, tier overview, prefs. Memory: CLAUDE.md is now single source of truth. Auto-memory reduced to scratch pad (67→25 lines, 5→1 files). MetaClaw DB cleaned from 40→16 entries (removed all infra-specific duplicates, kept cross-project prefs). Agents: removed generic devops-engineer (885L) and fullstack-developer (234L). Kept custom cluster-health-checker (48L).	2026-03-06 23:27:46 +00:00
Viktor Barzin	bcbe8b23b4	[ci skip] archive 28 unused skills, add runbook index to CLAUDE.md, add cluster-health agent - Move 28 never-invoked troubleshooting runbook skills to .claude/skills/archived/ - Keep 7 active workflow skills: cluster-health, uptime-kuma, pfsense, home-assistant, setup-project, extend-vm-storage, k8s-ndots - Add one-line runbook index to CLAUDE.md for quick reference - Create cluster-health-checker custom agent (haiku model, read-only + bash) for autonomous health checks without consuming main context	2026-03-06 23:17:40 +00:00
Viktor Barzin	cbf041bcc9	[ci skip] Add Woodpecker CI stack (WIP) and claude agents - Add stacks/woodpecker/ with Helm-based deployment config - Add .woodpecker/ CI pipeline configs (default, build-cli, renew-tls) - Add NFS export entry for woodpecker - Add .claude/agents/ definitions	2026-02-22 21:30:25 +00:00

23 commits