## Context
The remote-executor pattern that SSHed into the DevVM (10.0.10.10) to run
`claude -p` was fully migrated to the in-cluster service
`claude-agent-service.claude-agent.svc:8080/execute` in commits 42f1c3cf and
99180bec (2026-04-18). Five parallel codebase audits (GH Actions, Woodpecker
+ scripts, K8s CronJobs/Deployments, n8n, local scripts/hooks/docs) confirmed
zero remaining SSH+claude sites.
This commit removes two cleanup artifacts left behind by that migration.
## This change
1. Deletes `.claude/skills/archived/setup-remote-executor.md` — the archived
skill doc for the obsolete SSH-based pattern. Already in `archived/`,
harmless but noise; deleting prevents anyone copy-pasting the old approach.
2. Removes `kubernetes_secret.ssh_key` from
`stacks/claude-agent-service/main.tf`. The Secret was created from the
`devvm_ssh_key` field at Vault `secret/ci/infra` but was never mounted
into the agent pod. The pod's `git-init` init container uses HTTPS +
`$GITHUB_TOKEN` exclusively and force-rewrites every `git@github.com:`
and `https://github.com/` URL via `git config url.insteadOf`, so no
downstream `git` invocation could fall through to SSH even if it tried.
3. Removes the now-orphaned `data "vault_kv_secret_v2" "ci_secrets"` block —
the SSH key resource was its only consumer.
## What is NOT in this change
- The `devvm_ssh_key` field at Vault `secret/ci/infra` stays in place.
Removing it requires read/modify/put of the full secret and the upside
is one unused Vault key. Not worth it without strong justification.
- DevVM host decommission is out of scope (separate audit needed for
non-Claude users of the host).
- Pre-existing `terraform fmt` warnings at lines 464-505 (CronJob alignment)
left untouched per no-adjacent-refactor rule.
## Test plan
### Automated
- `terraform fmt -check stacks/claude-agent-service/main.tf` — only the
pre-existing lines 464-505 are flagged; no new fmt warnings introduced
by these deletions.
### Manual verification
1. `cd infra/stacks/claude-agent-service && ../../scripts/tg apply`
2. Expect exactly one resource destroyed: `kubernetes_secret.ssh_key`.
The `ci_secrets` data source removal is plan-time only; does not appear
in resource counts.
3. `kubectl -n claude-agent get secret ssh-key` → `NotFound`.
4. `kubectl -n claude-agent get pod` → both pods Running, no restart events.
5. Submit a synthetic agent job via HTTP API to confirm pipeline still works:
curl -X POST http://claude-agent-service.claude-agent.svc.cluster.local:8080/execute
with a minimal prompt; expect job completes with `exit_code=0`.
Closes: code-bck
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Monitor id 663 "MySQL Standalone (dbaas)" was created manually yesterday via
the `uptime-kuma-api` Python library when the dbaas stack migrated from
InnoDB Cluster to standalone MySQL. It worked and was UP, but lived only in
Uptime Kuma's MariaDB — if UK's DB were wiped or restored from an older
backup, the monitor would be lost.
## This change
Adds declarative, self-healing management for internal-service monitors
(databases, non-HTTP endpoints) that can't be discovered from ingress
annotations. Modelled on the existing `external-monitor-sync` CronJob.
- `local.internal_monitors` — list of desired monitors (name, type,
connection string, Vault password key, interval, retries). Seeded with
the MySQL Standalone monitor. Add new entries here to manage more.
- `kubernetes_secret.internal_monitor_sync` — pulls admin password and all
referenced DB passwords from Vault `secret/viktor` at apply time. Secret
key names are derived from monitor name (`DB_PASSWORD_<upper_snake>`).
- `kubernetes_config_map_v1.internal_monitor_targets` — renders the target
list to JSON for the sync container.
- `kubernetes_cron_job_v1.internal_monitor_sync` — runs every 10 min,
looks up monitors by name, creates if missing, patches if drifted,
leaves id and history untouched when already in desired state.
## Why this approach (Option B, not a Terraform provider)
The `louislam/uptime-kuma` Terraform provider does NOT exist in the public
registry (verified — only a CLI tool of the same name). Option A from the
task brief was therefore unavailable. Option B (idempotent K8s CronJob)
matches the established pattern in the same module for
`external-monitor-sync` — no new machinery introduced.
## Monitor 663: no-op on first sync
Manual import was not possible (no provider → no state to import). The
sync job correctly identifies the existing monitor by name and reports:
Monitor MySQL Standalone (dbaas) (id=663) already in desired state
Internal monitor sync complete
DB heartbeats confirm monitor 663 stayed UP throughout with `status=1` and
`Rows: 1` responses every 60s — no disruption.
## Vault key — left manual (by design)
`secret/viktor` is not Terraform-managed anywhere in the repo (only read
via `data "vault_kv_secret_v2"`). It is a user-edited Vault entry holding
135 keys. The `uptimekuma_db_password` key was added manually yesterday;
this change does NOT codify it. Codifying the whole `secret/viktor` entry
is out of scope for this task (would need a separate migration + rotation
story). The sync job reads the existing value at apply time — so if the
value is ever rotated in Vault, the next sync picks it up.
## Plan + apply
Plan: 3 to add, 0 to change, 0 to destroy.
Apply complete! Resources: 3 added, 0 changed, 0 destroyed.
Re-plan: No changes. Your infrastructure matches the configuration.
Also updated `.claude/skills/uptime-kuma/SKILL.md` with the new pattern.
Closes: code-ed2
## Context
During a false-alarm investigation of terminal.viktorbarzin.me, an Explore
agent misdiagnosed "no monitoring" by checking cloudflare_proxied_names in
config.tfvars (a legacy fallback list) instead of the ingress_factory
auto-annotation. Both [External] monitors for terminal/terminal-ro exist and
are active — the original agent just looked in the wrong place.
## This change
Expands the Monitoring & Alerting bullet to spell out the mechanism:
ingress_factory auto-adds uptime.viktorbarzin.me/external-monitor=true when
dns_type != "none", and cloudflare_proxied_names is a legacy fallback for
the 17 hostnames not yet migrated. Future agents debugging "is this
monitored?" questions should not check cloudflare_proxied_names.
## What is NOT in this change
No Terraform, no K8s, no service config. Docs only.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
The setup-project skill treats "build from a Dockerfile" as priority 6 — "last
resort, avoid if possible" — with no formalized path for apps whose upstream
lacks a working Dockerfile. When we end up writing one to get the deploy green,
that Dockerfile stays private in the infra repo and upstream never benefits.
## This change
Adds a closed-loop flow: when we author a new Dockerfile (or fix a broken
upstream one) and the deploy is healthy for 10 minutes, auto-open a PR against
the upstream repo so the self-hosting community gets the working recipe.
Flow:
1. Classify dockerfile_state during research phase (image-used / used-as-is /
fixed-broken-upstream / written-from-scratch). Persist to
modules/kubernetes/<service>/.contribution-state.json.
2. After Terraform apply, run scripts/stability-gate.sh — polls pod Ready +
HTTP 200 every 30s x 20 iterations, requires 18/20 successes.
3. On pass with a trigger state, scripts/contribute-dockerfile.sh does the
GitHub API dance: fork → merge-upstream → branch → commit Dockerfile /
.dockerignore / BUILD.md via Contents API → open PR with body rendered from
templates/PR_BODY.md. Idempotent (skips on recorded PR URL, existing fork,
existing branch, open PR, upstream landed a Dockerfile mid-deploy).
GitHub API via curl (gh CLI is sandbox-blocked per .claude/CLAUDE.md); token
pulled from Vault (`secret/viktor` → `github_pat`). Commits include
Signed-off-by for DCO-enforcing repos. Fork branch name is `add-dockerfile`
for written-from-scratch or `fix-dockerfile` for fixed-broken-upstream, with
timestamp suffix on collision.
## Files
- SKILL.md — state classification table, quality bar checklist, §8b stability
gate, §10 contribute-upstream step, checklist updates
- scripts/stability-gate.sh — 10-minute health probe
- scripts/contribute-dockerfile.sh — GitHub API orchestrator
- templates/PR_BODY.md — `{{VAR}}` placeholder template for PR description
- templates/Dockerfile.README.md — BUILD.md template shipped with the PR
## What is NOT in this change
- No Woodpecker / GHA changes (skill-local flow).
- No auto-tracking of merge/reject outcomes upstream (manual follow-up).
- Not yet exercised end-to-end; first real-world run will validate the API
dance. Plan to dry-run against a throwaway sink repo before pointing at a
real upstream.
## Test Plan
### Automated
- bash -n on both scripts → pass
- Manual read-through of SKILL.md — step numbering coherent, existing
§1-9 untouched semantics, new §8b/§10 reference real files
### Manual Verification
1. Next time setup-project onboards a Dockerfile-less app:
- Confirm .contribution-state.json is written with `written-from-scratch`
- Run stability-gate.sh — expect 18/20 passes on a healthy deploy
- Run contribute-dockerfile.sh — expect a fork + branch + PR on ViktorBarzin
- Verify contribution_pr_url is back-written to the state file
2. Re-run contribute-dockerfile.sh → must be a no-op (idempotent)
3. Upstream-archived case: manually archive a test upstream → re-run →
expect SKIP, no PR created
[ci skip]
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
After the MySQL standalone migration + Technitium SQLite disable saved ~130 GB/day
of disk writes, this methodology should be reusable for periodic health reviews.
## This change:
Adds `/disk-wear` skill that combines three data sources:
- SSH to PVE host for real-time 30s I/O snapshots and SSD SMART health
- Prometheus PromQL for per-app write attribution (node_disk_written_bytes_total
joined with node_disk_device_mapper_info for dm->LVM mapping)
- kubectl for PVC UUID -> pod/namespace mapping
Produces ranked breakdowns by physical disk, VM, k8s namespace, and individual PVC.
Includes baselines, red flag detection, and annualized wear projections.
Note: container_fs_writes_bytes_total has 0 series (cadvisor doesn't track
block device writes per container), so per-app attribution uses the PVE host's
dm-device level metrics mapped through Prometheus and kubectl.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Context
Disk write analysis showed MySQL InnoDB Cluster writing ~95 GB/day for only
~35 MB of actual data due to Group Replication overhead (binlog, relay log,
GR apply log). The operator enforces GR even with serverInstances=1.
Bitnami Helm charts were deprecated by Broadcom in Aug 2025 — no free
container images available. Using official mysql:8.4 image instead.
## This change:
- Replace helm_release.mysql_cluster service selector with raw
kubernetes_stateful_set_v1 using official mysql:8.4 image
- ConfigMap mysql-standalone-cnf: skip-log-bin, innodb_flush_log_at_trx_commit=2,
innodb_doublewrite=ON (re-enabled for standalone safety)
- Service selector switched to standalone pod labels
- Technitium: disable SQLite query logging (18 GB/day write amplification),
keep PostgreSQL-only logging (90-day retention)
- Grafana datasource and dashboards migrated from MySQL to PostgreSQL
- Dashboard SQL queries fixed for PG integer division (::float cast)
- Updated CLAUDE.md service-specific notes
## What is NOT in this change:
- InnoDB Cluster + operator removal (Phase 4, 7+ days from now)
- Stale Vault role cleanup (Phase 4)
- Old PVC deletion (Phase 4)
Expected write reduction: ~113 GB/day (MySQL 95 + Technitium 18)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Context
Deploying new services required manually adding hostnames to
cloudflare_proxied_names/cloudflare_non_proxied_names in config.tfvars —
a separate file from the service stack. This was frequently forgotten,
leaving services unreachable externally.
## This change:
- Add `dns_type` parameter to `ingress_factory` and `reverse_proxy/factory`
modules. Setting `dns_type = "proxied"` or `"non-proxied"` auto-creates
the Cloudflare DNS record (CNAME to tunnel or A/AAAA to public IP).
- Simplify cloudflared tunnel from 100 per-hostname rules to wildcard
`*.viktorbarzin.me → Traefik`. Traefik still handles host-based routing.
- Add global Cloudflare provider via terragrunt.hcl (separate
cloudflare_provider.tf with Vault-sourced API key).
- Migrate 118 hostnames from centralized config.tfvars to per-service
dns_type. 17 hostnames remain centrally managed (Helm ingresses,
special cases).
- Update docs, AGENTS.md, CLAUDE.md, dns.md runbook.
```
BEFORE AFTER
config.tfvars (manual list) stacks/<svc>/main.tf
| module "ingress" {
v dns_type = "proxied"
stacks/cloudflared/ }
for_each = list |
cloudflare_record auto-creates
tunnel per-hostname cloudflare_record + annotation
```
## What is NOT in this change:
- Uptime Kuma monitor migration (still reads from config.tfvars)
- 17 remaining centrally-managed hostnames (Helm, special cases)
- Removal of allow_overwrite (keep until migration confirmed stable)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Documents the centralized Beads/Dolt task tracking system used by all
Claude Code sessions. Covers architecture, session lifecycle, settings
hierarchy, known issues, and E2E test verification.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Status page (status.viktorbarzin.me): incident cards with SEV badges,
expandable timelines, postmortem links, user report rendering
- Issue templates on infra repo for user outage reports
- CronJob reads incidents + user-reports from ViktorBarzin/infra
- "Report an Outage" button on status page links to infra repo
- Post-mortem agents restored (4-stage pipeline: triage → investigation
→ historian → report writer) with updated paths and issue linking
- Post-mortem skill/template updated to link reports to GitHub Issues
and manage postmortem-required/postmortem-done labels
- Labels: incident, sev1-3, user-report, postmortem-required,
postmortem-done on infra repo
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Increase Uptime Kuma API timeout to 120s with wait_events=0.2
- Remove hardcoded password, use Vault or UPTIME_KUMA_PASSWORD env var
- Report internal and external monitor status separately
- Install uptime-kuma-api in local venv
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Consolidate all outage reports under docs/ for better discoverability.
Moved from .claude/post-mortems/ (agent-internal) to docs/post-mortems/
(repo documentation).
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Inbound:
- Direct MX to mail.viktorbarzin.me (ForwardEmail relay attempted and abandoned)
- Dedicated MetalLB IP 10.0.20.202 with ETP: Local for CrowdSec real-IP detection
- Removed Cloudflare Email Routing (can't store-and-forward)
- Fixed dual SPF violation, hardened to -all
- Added MTA-STS, TLSRPT, imported Rspamd DKIM into Terraform
- Removed dead BIND zones from config.tfvars (199 lines)
Outbound:
- Migrated from Mailgun (100/day) to Brevo (300/day free)
- Added Brevo DKIM CNAMEs and verification TXT
Monitoring:
- Probe frequency: 30m → 20m, alert thresholds adjusted to 60m
- Enabled Dovecot exporter scraping (port 9166)
- Added external SMTP monitor on public IP
Documentation:
- New docs/architecture/mailserver.md with full architecture
- New docs/architecture/mailserver-visual.html visualization
- Updated monitoring.md, CLAUDE.md, historical plan docs
HomeAssistantVersionControl v1.2.0 installed on ha-sofia for git-based
config tracking. Auto-commits on file change, pushes hourly to private
GitHub repo ViktorBarzin/ha-sofia-config.
- vpn.md: Rewrite WireGuard section to match actual config (single tun_wg0
interface, 10.3.2.0/24 subnet, hub-and-spoke topology, correct device
names and subnets for London/Valchedrym)
- authentik-state.md: Document brute-force-protection policy unbinding fix
that was blocking all unauthenticated users from login flows
[ci skip]
Query logs stopped syncing on 2026-03-16 due to password mismatch after
MySQL cluster rebuild and Technitium app config reset.
- Add Vault static role mysql-technitium (7-day rotation)
- Add ExternalSecret for technitium-db-creds in technitium namespace
- Add password-sync CronJob (6h) to push rotated password to Technitium API
- Update Grafana datasource to use ESO-managed password
- Remove stale technitium_db_password variable (replaced by ESO)
- Update databases.md and restore-mysql.md runbook
All infrastructure changes must go through Terraform/Terragrunt.
kubectl is read-only except for temporary migration steps.
If a resource isn't in Terraform, evaluate adding it before
making manual changes.
Default to proxmox-lvm for all new services. NFS only for RWX,
backup destinations, or shared media libraries. Updated iSCSI
backup section to reflect proxmox-lvm migration.
Expanded cloud sync excludes to reduce sync time and Synology disk usage.
All excluded data is either regenerable or low-value.
TrueNAS Task 1 and incremental script already updated live.
- Terragrunt-regenerated providers.tf across stacks (vault_root_token
variable removed from root generate block)
- Upstream monitoring/openclaw/CLAUDE.md changes from rebase
- Add Transit mount + per-stack Transit keys to vault stack TF
- Auto-create sops-user-<name> policy scoping decrypt to owned stacks
- Auto-create sops-<name> external group + alias for Authentik mapping
- Add sops-admin policy to authentik-admins group
- Attach sops-user policy to namespace-owner identity entities
- Update add-user skill with SOPS onboarding steps and Authentik group
- Adding a user to k8s_users + applying vault stack = full SOPS access
[ci skip]