infra

viktor/infra

Fork 0

Commit graph

Author	SHA1	Message	Date
Viktor Barzin	c670cb7118	eso: Phase 2 — migrate all 104 ExternalSecrets + 2 ClusterSecretStores to v1 Some checks failed ci/woodpecker/push/default Pipeline failed Details The API rewrite half of the ESO 0.12->2.6 migration (last k8s-1.35 compat-gate blocker). Done on chart 0.16.2, which serves BOTH external-secrets.io/v1beta1 and v1, so this is the safe window — MUST land before 0.17 removes v1beta1 (there is no conversion webhook). Pure apiVersion bump, schema is byte-identical: 106 occurrences (104 ExternalSecrets + 2 ClusterSecretStores vault-kv/vault-database) across 73 .tf files, v1beta1 -> v1, no other field changes. Validated live first on tandoor (single, non-coupled, synced ES): the kubernetes_manifest apiVersion bump forces a REPLACE; the target Secret is cascade-GC'd for ONE ~0.3s poll then ESO recreates it (identical value re-synced from Vault, new UID) and the ES returns SecretSynced=True on v1. Running pods keep their mounted copy through the sub-second blip. All 110 target Secrets were snapshotted to /tmp first as a backstop. CI applies the changed stacks serially (staged rollout); watching aggregate ES sync back to 108 synced (2 pre-existing dead: instagram-poster, payslip-ingest). Next: Phase 3 climb 0.16.2 -> 2.6.0. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 19:13:04 +00:00
Viktor Barzin	2f3c58dff1	claude-agent-service image -> ghcr across all five consumer stacks (infra#19) All checks were successful ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details GHA now builds+pushes ghcr.io/viktorbarzin/claude-agent-service (public package, anonymous pulls). Repointed: claude-agent-service (deployment + git-init/seed-beads-agent inits), claude-breakglass, ci-pipeline-health, beads-server CronJobs, k8s-version-upgrade (tag var 2fd7670d -> latest — the Forgejo registry lost that sha; node caches were the only thing keeping those CronJobs alive). publish-gate: vendor-contact emails (licensing@/legal@/security@/sales@) ruled license-boilerplate, not PII. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-13 01:47:54 +00:00
Viktor Barzin	32cf75635f	claude-breakglass: in-cluster warm break-glass UI for the devvm Stand up the infra for Viktor's break-glass: when the devvm is wedged (cluster healthy), open breakglass.viktorbarzin.me, have Claude SSH in to diagnose/fix, and power-cycle VM 102 via the Proxmox host if needed. App half landed in the claude-agent-service repo. New stack stacks/claude-breakglass/ — own namespace + SA, NO Vault role (ESO syncs only its key, so the pod has zero direct Vault access). Hardened to survive the pressure it exists to fix: priorityClassName tier-0-core, broad node-pressure tolerations, anti-affinity off node1, imagePullPolicy Always. auth="required" ingress so it rides the Authentik resilience proxy and stays reachable via the basic-auth fallback during an auth-stack outage. Runs the shared claude-agent-service image with the breakglass entrypoint. files/breakglass-pve is the PVE forced-command (status\|forensics\|reset\|stop\| start\|cycle on VM 102, forensics-first). Isolation: the shared claude-agent pod's terraform-state Vault policy is explicitly DENIED secret/claude-breakglass/* (stacks/vault/main.tf) so a prompt-injected agent on that pod can't read the root-on-devvm key. traefik: add a checksum/auth-proxy-htpasswd annotation so the auth-proxy rolls when the emergency basic-auth password rotates (it's a subPath mount that doesn't auto-update) — regenerated this session so Viktor has a known emergency credential, which the auth-stack-outage failure domain requires. Docs: docs/runbooks/breakglass-ui.md (full incident + bootstrap procedure, incl. the per-host from= NAT quirks) and a security.md note recording the two new privileged footholds. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 21:40:17 +00:00

Author

SHA1

Message

Date

Viktor Barzin

c670cb7118

eso: Phase 2 — migrate all 104 ExternalSecrets + 2 ClusterSecretStores to v1

ci/woodpecker/push/default Pipeline failed

Details

The API rewrite half of the ESO 0.12->2.6 migration (last k8s-1.35 compat-gate
blocker). Done on chart 0.16.2, which serves BOTH external-secrets.io/v1beta1
and v1, so this is the safe window — MUST land before 0.17 removes v1beta1
(there is no conversion webhook). Pure apiVersion bump, schema is byte-identical:
106 occurrences (104 ExternalSecrets + 2 ClusterSecretStores vault-kv/vault-database)
across 73 .tf files, v1beta1 -> v1, no other field changes.

Validated live first on tandoor (single, non-coupled, synced ES): the
kubernetes_manifest apiVersion bump forces a REPLACE; the target Secret is
cascade-GC'd for ONE ~0.3s poll then ESO recreates it (identical value re-synced
from Vault, new UID) and the ES returns SecretSynced=True on v1. Running pods
keep their mounted copy through the sub-second blip. All 110 target Secrets were
snapshotted to /tmp first as a backstop.

CI applies the changed stacks serially (staged rollout); watching aggregate ES
sync back to 108 synced (2 pre-existing dead: instagram-poster, payslip-ingest).
Next: Phase 3 climb 0.16.2 -> 2.6.0.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-22 19:13:04 +00:00

Viktor Barzin

2f3c58dff1

claude-agent-service image -> ghcr across all five consumer stacks (infra#19)

ci/woodpecker/push/build-cli Pipeline was successful

Details

ci/woodpecker/push/default Pipeline was successful

Details

GHA now builds+pushes ghcr.io/viktorbarzin/claude-agent-service (public
package, anonymous pulls). Repointed: claude-agent-service (deployment +
git-init/seed-beads-agent inits), claude-breakglass, ci-pipeline-health,
beads-server CronJobs, k8s-version-upgrade (tag var 2fd7670d -> latest —
the Forgejo registry lost that sha; node caches were the only thing
keeping those CronJobs alive). publish-gate: vendor-contact emails
(licensing@/legal@/security@/sales@) ruled license-boilerplate, not PII.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

2026-06-13 01:47:54 +00:00

Viktor Barzin

32cf75635f

claude-breakglass: in-cluster warm break-glass UI for the devvm

Stand up the infra for Viktor's break-glass: when the devvm is wedged (cluster
healthy), open breakglass.viktorbarzin.me, have Claude SSH in to diagnose/fix,
and power-cycle VM 102 via the Proxmox host if needed. App half landed in the
claude-agent-service repo.

New stack stacks/claude-breakglass/ — own namespace + SA, NO Vault role (ESO
syncs only its key, so the pod has zero direct Vault access). Hardened to
survive the pressure it exists to fix: priorityClassName tier-0-core, broad
node-pressure tolerations, anti-affinity off node1, imagePullPolicy Always.
auth="required" ingress so it rides the Authentik resilience proxy and stays
reachable via the basic-auth fallback during an auth-stack outage. Runs the
shared claude-agent-service image with the breakglass entrypoint.
files/breakglass-pve is the PVE forced-command (status|forensics|reset|stop|
start|cycle on VM 102, forensics-first).

Isolation: the shared claude-agent pod's terraform-state Vault policy is
explicitly DENIED secret/claude-breakglass/* (stacks/vault/main.tf) so a
prompt-injected agent on that pod can't read the root-on-devvm key.

traefik: add a checksum/auth-proxy-htpasswd annotation so the auth-proxy rolls
when the emergency basic-auth password rotates (it's a subPath mount that
doesn't auto-update) — regenerated this session so Viktor has a known
emergency credential, which the auth-stack-outage failure domain requires.

Docs: docs/runbooks/breakglass-ui.md (full incident + bootstrap procedure,
incl. the per-host from= NAT quirks) and a security.md note recording the two
new privileged footholds.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

2026-06-12 21:40:17 +00:00

3 commits