The agent-based v1 ran inside claude-agent-service (replicas=1, no
nodeSelector) and self-evicted when it tried to drain its host (k8s-node4
on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers
v1.34.2) until manual recovery.
Rewrite the pipeline as a chain of nodeSelector-pinned Jobs:
preflight (k8s-node1)
→ master (k8s-node1) drains k8s-master
→ worker × 4 (k8s-node1) drains k8s-node{4,3,2}
→ worker (k8s-master + control-plane toleration) drains k8s-node1
→ postflight (no pinning)
Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by
envsubst-ing job-template.yaml into the next Job. Deterministic names
(k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply`
idempotent — a failed Job can be re-created without duplicating
downstream.
Also lands `predrain_unstick`: deletes pods on the target node whose PDB
has 0 disruptionsAllowed. Without this, drain loops indefinitely on
single-replica deployments (e.g. every Anubis instance — discovered the
hard way during 2026-05-11 manual recovery of k8s-node3).
Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min).
Deprecates the agent prompt (renamed to *.deprecated.md with a header
pointer to the new code).
Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps),
then monitoring (loads the new alert). Both applied 2026-05-11.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Stage 2 now reuses the existing default/backup-etcd CronJob (NFS-backed
PV pointing at 192.168.1.127:/srv/nfs/etcd-backup) instead of trying to
ssh into master and run etcdctl against a non-existent /mnt/main mount.
The agent triggers a one-shot Job from cronjob/backup-etcd, waits up to
10 min, then parses the backup-manage container log for "Backup done"
line + byte count.
Test 2 (dry-run) surfaced 5 real cluster blockers — agent loop works
end-to-end at the planning level.
Expanded the claude-agent ServiceAccount's privileges via a sibling
ClusterRole (claude-agent-upgrade-ops):
- patch namespaces/k8s-upgrade (in-flight annotation)
- create batch/jobs (trigger etcd snapshot Job)
- patch nodes (cordon/uncordon)
- create pods/eviction (drain)
- delete pods (drain fallback)
Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison
on master for new patches + HEAD pkgs.k8s.io for next-minor availability,
then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent.
The agent (.claude/agents/k8s-version-upgrade.md) orchestrates:
pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match)
-> etcd snapshot save
-> optional master containerd skew fix
-> apt repo URL rewrite (minor bumps only)
-> drain/upgrade/uncordon master via ssh < update_k8s.sh
-> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each
-> post-flight verification
Two new Upgrade Gates alerts catch failure modes:
- K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m)
- EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m)
update_k8s.sh refactored to take --role / --release args; the agent shells
it into each node via SSH pipe. update_node.sh annotated as OS-major path.
Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section
in docs/architecture/automated-upgrades.md.
Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519
keypair distributed to all 5 nodes via authorized_keys; slack_webhook
reuses kured webhook URL on initial deploy).
## Context
Since the 2026-04-15 migration from SSH-on-DevVM to in-cluster
claude-agent-service, the agent spec's four `vault kv get ...` calls
have been dead code: the pod has no `VAULT_TOKEN`, no `~/.vault-token`,
no Vault login method, and port 8200 is refused. Every token fetch
returns empty, which silently breaks:
- **Slack**: `SLACK_WEBHOOK=""` → POSTs 404 → no messages for 3+ days
(the exact user-visible symptom that started this thread).
- **Woodpecker CI polling**: `WOODPECKER_TOKEN=""` → 401 on
`/api/repos/1/pipelines` → agent can't find its own pipeline → 15-min
poll times out → jumps to rollback → same failure in the revert → hits
n8n's 30-min ceiling → SIGKILL mid-saga → no commit, no Slack.
- **Changelog fetch**: `GITHUB_TOKEN=""` overrides the env var supplied
by `envFrom: claude-agent-secrets`, crippling changelog lookups too.
Separately, Step 9 read the overall pipeline `status`, which is
`failure` any time a single workflow fails — e.g. the unrelated
`build-cli` workflow (docker image push to registry.viktorbarzin.me:5050
has been erroring since private-registry htpasswd was enabled on
2026-03-22). That made the agent spuriously rollback every otherwise-
successful upgrade.
## This change
- Replace the four `vault kv get ...` invocations with the matching
env-var reads (`$GITHUB_TOKEN`, `$WOODPECKER_API_TOKEN`,
`$SLACK_WEBHOOK_URL`) and document the env-var contract at the top
of the "Environment" section. The env vars are expected to be
pre-loaded via `envFrom: claude-agent-secrets` — that part is tracked
as the companion ExternalSecret/Terraform change in bd code-3o3
(must land before this spec is effective).
- Rewrite Step 9 to poll the `default` workflow's `state` instead of
the overall pipeline `status`. Adds a jq example and explicitly
documents the build-cli noise so future operators know why overall
status is unreliable.
## What is NOT in this change
- The matching ExternalSecret / Terraform changes that feed
WOODPECKER_API_TOKEN / SLACK_WEBHOOK_URL / REGISTRY_USER /
REGISTRY_PASSWORD into the pod. Until those land, this spec still
produces empty env vars at runtime — but at least the *shape* of the
contract is correct and grep-friendly.
- The .woodpecker/build-cli.yml `logins:` entry for
registry.viktorbarzin.me:5050. That's fix C in the same task.
## Test Plan
### Automated
None — this is pure markdown guidance for the model. Syntax-checked by
`grep -nE 'vault kv get|WOODPECKER_TOKEN|SLACK_WEBHOOK[^_]'
.claude/agents/service-upgrade.md` showing only the explanatory
warning on line 37 as a match.
### Manual Verification
After the companion ExternalSecret change lands and the pod has
WOODPECKER_API_TOKEN + SLACK_WEBHOOK_URL in env:
1. Trigger a DIUN-style webhook on a known slow service.
2. Watch `kubectl -n claude-agent logs -f deploy/claude-agent-service`.
3. Expect curl to `ci.viktorbarzin.me/api/...` return 200 and pipeline
JSON (no 401), and Slack `$SLACK_WEBHOOK_URL` return 200.
4. Expect a Slack `[Upgrade Agent] Starting:` post inside the first
minute, and a `SUCCESS` or `FAILED + ROLLED BACK` post on exit.
Refs: bd code-3o3
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Companion change to payslip-ingest v2 (regex parser + accurate RSU tax
attribution). The Grafana dashboard now has 4 more panels powered by the
new earnings-decomposition and YTD-snapshot columns, and the Claude
fallback agent's prompt is aligned with the new schema so non-Meta
payslips still land with the full field set.
## This change
### `.claude/agents/payslip-extractor.md`
Rewrites the RSU handling section to match Meta UK's actual template
(rsu_vest = "RSU Tax Offset" + "RSU Excs Refund", no matching
rsu_offset deduction — PAYE uses grossed-up Taxable Pay instead).
Adds a new "Earnings decomposition (v2)" section telling the fallback
agent how to populate salary/bonus/pension_sacrifice/taxable_pay/ytd_*
and when to use pension_employee vs pension_sacrifice without
double-counting.
### `stacks/monitoring/modules/monitoring/dashboards/uk-payslip.json`
- **Panel 4 (Effective rate)** — SQL switched from the naive
`(income_tax + NIC) / cash_gross` to the YTD-effective-rate
method: `cash_tax = income_tax - rsu_vest × (ytd_tax_paid /
ytd_taxable_pay)`. Title updated to "YTD-corrected" so the
change is discoverable.
- **Panel 5 (Table)** — adds salary, bonus, pension_sacrifice,
taxable_pay columns so row-level debugging against the parser
output is trivial.
- **+Panel 8 (Earnings breakdown)** — monthly stacked bars of
salary / bonus / rsu_vest / -pension_sacrifice. Bonus-sacrifice
months show up as a massive negative pension_sacrifice spike
paired with a near-zero bonus bar.
- **+Panel 9 (Accurate cash tax rate)** — timeseries of
cash_tax_rate_ytd vs naive_tax_rate. Divergence is the RSU
contribution the payslip hides in the single `Tax paid` line.
- **+Panel 10 (All-in compensation)** — stacked bars of cash_gross
+ rsu_vest per payslip.
- **+Panel 11 (YTD cumulative cash gross vs total comp)** — two
lines partitioned by tax_year; the gap between them is the RSU
contribution YTD.
Total panels go from 7 → 11.
## Test Plan
### Automated
Dashboard JSON validity:
```
$ python3 -m json.tool uk-payslip.json > /dev/null && echo ok
ok
```
### Manual Verification
After applying `stacks/monitoring/`:
1. `https://grafana.viktorbarzin.me/d/uk-payslip` loads with 11 panels
2. Bonus-sacrifice months (e.g. March 2024 if present in data) show the
negative pension_sacrifice bar in panel 8
3. Panel 9 "Accurate cash effective tax rate" shows the
cash_tax_rate_ytd line sitting ~10-15pp below naive_tax_rate in
RSU-vest months
## Reproduce locally
1. `cd infra/stacks/monitoring && terragrunt plan`
2. Expected: ConfigMap diff on the payslip dashboard with the new panel
JSON
3. `terragrunt apply` — Grafana reloads the dashboard automatically
(configmap-reload sidecar)
Relates to: payslip-ingest commit 9741816
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Document what RSU vest / RSU offset look like on Meta UK payslips and
tell the agent to populate rsu_vest + rsu_offset fields (new in the
payslip-ingest schema) rather than rolling them into gross_pay.
payslip-ingest now runs pdftotext locally before calling claude-agent-service,
shrinking the prompt ~20-100x. Agent file documents both paths: PAYSLIP_TEXT
(fast) and PDF_BASE64 (fallback for scanned-image PDFs or when pdftotext
fails).
## Context
New service `payslip-ingest` (code lives in `/home/wizard/code/payslip-ingest/`)
needs in-cluster deployment, its own Postgres DB + rotating user, a Grafana
datasource, a dashboard, and a Claude agent definition for PDF extraction.
Cluster-internal only — webhook fires from Paperless-ngx in a sibling namespace.
No ingress, no TLS cert, no DNS record.
## What
### New stack `stacks/payslip-ingest/`
- `kubernetes_namespace` payslip-ingest, tier=aux.
- ExternalSecret (vault-kv) projects PAPERLESS_API_TOKEN, CLAUDE_AGENT_BEARER_TOKEN,
WEBHOOK_BEARER_TOKEN into `payslip-ingest-secrets`.
- ExternalSecret (vault-database) reads rotating password from
`static-creds/pg-payslip-ingest` and templates `DATABASE_URL` into
`payslip-ingest-db-creds` with `reloader.stakater.com/match=true`.
- Deployment: single replica, Recreate strategy (matches single-worker queue
design), `wait-for postgresql.dbaas:5432` annotation, init container runs
`alembic upgrade head`, main container serves FastAPI on 8080, Kyverno
dns_config lifecycle ignore.
- ClusterIP Service :8080.
- Grafana datasource ConfigMap in `monitoring` ns (label `grafana_datasource=1`,
uid `payslips-pg`) reading password from the db-creds K8s Secret.
### Grafana dashboard `uk-payslip.json` (4 panels)
- Monthly gross/net/tax/NI (timeseries, currencyGBP).
- YTD tax-band progression with threshold lines at £12,570 / £50,270 / £125,140.
- Deductions breakdown (stacked bars).
- Effective rate + take-home % (timeseries, percent).
### Vault DB role `pg-payslip-ingest`
- Added to `allowed_roles` in `vault_database_secret_backend_connection.postgresql`.
- New `vault_database_secret_backend_static_role.pg_payslip_ingest`
(username `payslip_ingest`, 7d rotation).
### DBaaS — DB + role creation
- New `null_resource.pg_payslip_ingest_db` mirrors `pg_terraform_state_db`:
idempotent CREATE ROLE + CREATE DATABASE + GRANT ALL via `kubectl exec` into
`pg-cluster-1`.
### Claude agent `.claude/agents/payslip-extractor.md`
- Haiku-backed agent invoked by `claude-agent-service`.
- Decodes base64 PDF from prompt, tries pdftotext → pypdf fallback, emits a single
JSON object matching the schema to stdout. No network, no file writes outside /tmp,
no markdown fences.
## Trade-offs / decisions
- Own DB per service (convention), NOT a schema in a shared `app` DB as the plan
initially described. The Alembic migration still creates a `payslip_ingest`
schema inside the `payslip_ingest` DB for table organisation.
- Paperless URL uses port 80 (the Service port), not 8000 (the pod target port).
- Grafana datasource uses the primary RW user — separate `_ro` role is aspirational
and not yet a pattern in this repo.
- No ingress — webhook is cluster-internal; external exposure is unnecessary attack
surface.
- No Uptime Kuma monitor yet: the internal-monitor list is a static block in
`stacks/uptime-kuma/`; will add in a follow-up tied to code-z29 (internal monitor
auto-creator).
## Test Plan
### Automated
```
terraform init -backend=false && terraform validate
Success! The configuration is valid.
terraform fmt -check -recursive
(exit 0)
python3 -c "import json; json.load(open('uk-payslip.json'))"
(exit 0)
```
### Manual Verification (post-merge)
Prerequisites:
1. Seed Vault: `vault kv put secret/payslip-ingest webhook_bearer_token=$(openssl rand -hex 32)`.
2. Seed Vault: `vault kv patch secret/paperless-ngx api_token=<paperless token>`.
Apply:
3. `scripts/tg apply vault` → creates pg-payslip-ingest static role.
4. `scripts/tg apply dbaas` → creates payslip_ingest DB + role.
5. `cd stacks/payslip-ingest && ../../scripts/tg apply -target=kubernetes_manifest.db_external_secret`
(first-apply ESO bootstrap).
6. `scripts/tg apply payslip-ingest` (full).
7. `kubectl -n payslip-ingest get pods` → Running 1/1.
8. `kubectl -n payslip-ingest port-forward svc/payslip-ingest 8080:8080 && curl localhost:8080/healthz` → 200.
End-to-end:
9. Configure Paperless workflow (README in code repo has steps).
10. Upload sample payslip tagged `payslip` → row in `payslip_ingest.payslip` within 60s.
11. Grafana → Dashboards → UK Payslip → 4 panels render.
Closes: code-do7
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Documents the centralized Beads/Dolt task tracking system used by all
Claude Code sessions. Covers architecture, session lifecycle, settings
hierarchy, known issues, and E2E test verification.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Status page (status.viktorbarzin.me): incident cards with SEV badges,
expandable timelines, postmortem links, user report rendering
- Issue templates on infra repo for user outage reports
- CronJob reads incidents + user-reports from ViktorBarzin/infra
- "Report an Outage" button on status page links to infra repo
- Post-mortem agents restored (4-stage pipeline: triage → investigation
→ historian → report writer) with updated paths and issue linking
- Post-mortem skill/template updated to link reports to GitHub Issues
and manage postmortem-required/postmortem-done labels
- Labels: incident, sev1-3, user-report, postmortem-required,
postmortem-done on infra repo
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Split monolithic orchestrator into triage (haiku), historian (sonnet),
and report-writer (opus) stages. Each stage gets its own tool budget.
Added sev-context.sh for structured cluster context gathering.
- Upgrade model from sonnet to opus for subagent orchestration
- Add Write, Edit, Agent tools for spawning monitor subagents
- Add mandatory deployment workflow: pre-deploy snapshot, apply,
spawn background haiku pod monitor, react to results
- Monitor detects CrashLoopBackOff, OOM, ImagePullBackOff, stuck
Pending, and probe failures within 3 min timeout
- Allow terragrunt apply and kubectl set image as safe operations