Commit graph

4571 commits

Author SHA1 Message Date
Viktor Barzin
d105713ae7 fix(workstation): claude-auth-sync must merge, not overwrite, the shared Vault path
All checks were successful
ci/woodpecker/push/default Pipeline was successful
cas_backup did `vault kv put secret/workstation/claude-users/<user>`, a full
KV-v2 replace that rewrote the document with only its 3 OAuth keys. Because
`homelab vault setup` co-locates the user's vaultwarden_* credentials on that
same path, every six-hourly sync silently deleted them — so `homelab vault`
reported "not configured" within hours of each setup. (Reported as: homelab
vault "keeps getting reset / logged out", set up 3 times.)

Switch the backup to a merge: `kv patch -method=rw` (read+update, needs no
`patch` capability) when the path exists, and `kv put` only to create it on the
first backup. Add a regression test with a fake vault asserting a pre-existing
sibling key survives a backup, and document the merge requirement in the
renewal runbook.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 08:33:41 +00:00
Viktor Barzin
6f1951af93 fix(workstation): carry OS/sudo authz policy into managed-settings source + multi-tenancy doc
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ADR-0015's policy change was applied live to /etc/claude-code/managed-settings.json, but that file self-deploys from the repo source scripts/workstation/managed-settings.json via the hourly reconcile (sync_managed_config). Without updating the source the next reconcile would REVERT /etc to the old 'never read other homes' rule. This updates the source-of-truth claudeMd (now byte-identical to /etc) so the change is durable + canonical, and refresh_codex_mirror propagates it to every user's ~/.codex/AGENTS.md. Also notes the access-model change in the multi-tenancy architecture doc (pointer to ADR-0015).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 08:25:33 +00:00
Viktor Barzin
8121d8a4ac docs(adr): add ADR-0015 (OS/sudo is the authorization boundary), supersede ADR-0011 privacy norm
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor (owner) wants agents to stop refusing file reads the OS already permits. wizard holds passwordless root ((ALL) NOPASSWD: ALL), so the managed-settings rule 'never read another user's ~/.claude' was stricter than the OS itself. The managed-settings policy (/etc/claude-code/managed-settings.json) was updated out-of-band to defer to OS/sudo authorization with no extra prompt; backup kept at .bak-2026-06-26. This ADR records the decision, its symmetry across sudo-holders, and the larger blast radius.

ADR-0011's usage-telemetry design is unchanged; only the cross-user privacy norm it referenced is superseded. The original ask was to delete ADR-0011 — superseded instead to preserve the audit trail and the ADR-0012/0013 references.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 08:22:29 +00:00
Viktor Barzin
ebc8b6588f ESO: add force_conflicts to all ExternalSecret manifests (fleet sweep)
Some checks failed
ci/woodpecker/push/default Pipeline failed
The 2026-06-22 external-secrets v1 migration made the ESO controller the
server-side-apply owner of .spec.refreshInterval on every ExternalSecret, so any
stack defining one via kubernetes_manifest fails `terraform apply` with a
field-manager conflict the next time it's applied (instagram-poster + grafana hit
this on 2026-06-24; it was latent across the whole fleet). Add
field_manager { force_conflicts = true } to all 101 remaining ExternalSecret
manifests across 70 stacks, matching the fix already on grafana / woodpecker /
traefik / k8s-version-upgrade / instagram-poster. TF and ESO set the same value,
so it's stable (no perpetual drift). Defuses the landmine before each stack's
next apply trips it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 21:28:11 +00:00
Viktor Barzin
6c5288998f goldmane-trail: polish follow-ups #57/#59/#61/#62/#63 + digest→#alerts
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Completes the Goldmane who-talks-to-whom trail (ADR-0014), implemented by a
subagent workflow (distinct stacks in parallel, docs last):

- #57 Whisker gated ingress: ingress_factory (whisker.viktorbarzin.me,
  auth=required, Authentik-gated) + a NetworkPolicy allowing traefik->whisker:8081
  (the operator's whisker NP default-denies ingress). calico stack.
- #61 pipeline health: AggregatorDown + DigestFailing Prometheus alerts
  (prometheus_chart_values.tpl) + cluster-health check #48.
- #59 service-identity labels on the multi-Service namespaces (monitoring's 5
  TF-managed deployments + dbaas), with the KYVERNO_LIFECYCLE_V1 marker so they
  update in-place.
- #62/#63 docs: docs/runbooks/goldmane-flow-trail.md (new), service-catalog,
  security.md + monitoring.md east-west sections, ADR-0014 as-built, CONTEXT.md.
  #62 = the SQL to derive the Wave-1 per-namespace egress allowlist from the
  edge table (feeds code-8ywc; enforce-flips out of scope).

Also fixes the digest's Slack target: #security override 404s channel_not_found
because the shared alertmanager_slack_api_url webhook's app isn't a member of
#security (this likely also breaks alertmanager's slack-security receiver — flagged
in the runbook). Routed to #alerts (the webhook's working channel) until the app
is invited; verified a real digest run posts cleanly (360 edges).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 17:49:25 +00:00
Viktor Barzin
306cdd4cb3 state(dbaas): update encrypted state 2026-06-25 17:31:03 +00:00
Viktor Barzin
9c68d147e0 k8s-upgrade: reclaim+auto-prune kubeadm /etc/kubernetes/tmp leak; correct crash root cause to etcd IO (not OIDC)
Some checks failed
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline failed
Digging into "why did the apiserver crash" disproved the earlier OIDC
explanation. An isolated v1.35.6 apiserver repro with authentik reachable
initialises OIDC cleanly (oidc.go:313, no error) and runs fine — so the
--authentication-config -> --oidc-* revert is NOT what crashed it. etcd's
surviving crash-window log is the real cause: 1180 "apply request took too long"
warnings in 16 min, individual applies up to 4.3s (healthy <100ms) right as
kubeadm tried to bring up the new apiserver. That's etcd IO starvation on the
shared sdc HDD (beads code-oflt).

A big contributor + the reason master root fs sat at 73%: kubeadm dumps a full
~400MB etcd DB backup into /etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/ before
every etcd upgrade and never cleans it up — 145 dirs / 28GB had accumulated,
driving image-GC churn and extra write-IO onto etcd's spindle. Reclaimed live
(73% -> 23%) and added a preflight prune (>3 days) so it can't re-accumulate.

Also corrected the OIDC handling: the kubeadm-config drift is real but only
breaks dashboard/kubectl SSO AFTER a successful upgrade (recoverable via the
chain's restore.sh + the kubeadm-config reconciliation) — it does not crash the
apiserver. So the preflight check is now an ALERT, not a block (was added on the
wrong hypothesis). Post-mortem, runbook, and apiserver-oidc.tf header corrected.

Per Viktor: reclaim the disk and automate so the manual cleanup never recurs;
the durable IO fix remains code-oflt (etcd off the shared HDD).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 15:23:15 +00:00
Viktor Barzin
60a1cb9a25 k8s-upgrade: reconcile kubeadm-config OIDC drift that crash-looped the v1.35 apiserver upgrade
All checks were successful
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
Last night's autonomous 1.34->1.35 run reached the master control-plane phase
for the first time (preflight passed, etcd snapshot taken, etcd upgraded), then
the kube-apiserver upgrade to v1.35.6 crash-looped and kubeadm auto-rolled-back
to 1.34.9. The cluster stayed healthy but the master was left cordoned and the
chain wedged on in_flight.

Root cause: kubeadm upgrade regenerates the apiserver static-pod manifest from
the kubeadm-config ConfigMap. apiserver auth was switched on 2026-06-19 to a
structured multi-issuer --authentication-config (kubectl + dashboard SSO), but
kubeadm-config still carried the legacy single-issuer --oidc-* extraArgs, so the
regenerated manifest reverted structured auth and the new apiserver crash-looped.
Proven via `kubeadm upgrade diff`. The existing post-upgrade OIDC restore step
never ran because the upgrade itself never succeeded.

Fix:
- rbac/apiserver-oidc.tf: the remote script now also reconciles kubeadm-config
  (kubeadm init phase upload-config: drop --oidc-*, add --authentication-config)
  so a future kubeadm upgrade regenerates a correct manifest. Delivered to the
  cluster via the apiserver-oidc-restore ConfigMap the chain re-runs (CI needs no
  ssh key); trigger deliberately not script-hashed since CI cannot ssh.
- k8s-version-upgrade/upgrade-step.sh: new preflight gate runs `kubeadm upgrade
  diff` and BLOCKS+alerts (never drains the master) if --authentication-config
  would still be dropped.
- Post-mortem + runbook updated.

The live kubeadm-config was reconciled directly on the master and verified
(`kubeadm upgrade diff` now shows only the control-plane image bump), so tonight's
run can complete the 1.34->1.35 upgrade.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 14:16:04 +00:00
Viktor Barzin
c6bba1da6e home-assistant skill: refresh ha-london map (HAOS 2026.5.2, Cowboy revived, Overview redesign)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to redesign the ha-london dashboards and fix the broken integrations (the Cowboy one). The skill's ha-london knowledge map had drifted badly from reality, so this brings it current: it claimed HA 2025.9.1 on a docker-run container (it's HAOS 2026.5.2, managed); listed the now-dead jdejaegh Cowboy integration with sensor.bike_* entities (revived via elsbrock/cowboy-ha v1.2.0 -> entities are sensor.classic_performance_*); and didn't flag that met/metoffice/roomba/hildebrandglow are user-disabled (not broken) or that Tapo P100 is failing. Also documents the redesigned Overview (Home+More sections, Mushroom+mini-graph-card), the dashboard/view/card glossary, the parked-bike 'unknown battery' gotcha, and that london is API-only from the Sofia devvm (no SSH).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 22:03:15 +00:00
Viktor Barzin
b858561bd0 Merge remote-tracking branch 'origin/master'
Some checks failed
ci/woodpecker/push/default Pipeline failed
2026-06-24 20:59:39 +00:00
Viktor Barzin
a7704f46a6 deploy goldmane-edge-aggregator: durable who-talks-to-whom edge trail (#58, ADR-0014)
Infra side of ADR-0014: an mTLS gRPC consumer of Calico Goldmane's Flows API
that records the namespace-pair edge-set in CNPG and posts a daily new-edge
digest to #security. Adds the goldmane-edge-aggregator stack, the
pg-goldmane-edges Vault rotation role (Tier-0 vault state updated here), and the
namespace in the ghcr-credentials allowlist.

Cert: REUSES the operator-minted, Tigera-CA-signed whisker-backend client cert
(Goldmane verifies only the CA chain, not identity) instead of minting from the
Tigera CA private key. This avoids putting the CA key in TF state AND the
hashicorp/tls provider, which is incompatible with this repo's global
generate-providers/lockfile pattern (it broke every stack's lockfile).

Verified live: aggregator streaming flows, 174 edges in Postgres across 50x54
namespaces, db+slack ExternalSecrets synced, digest dry-run formats correctly,
private image pulls via the Kyverno-synced ghcr-credentials.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 20:59:39 +00:00
Viktor Barzin
aa510e3600 instagram-poster: force_conflicts on ESO manifests (fix apply)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The ESO v1 migration (2026-06-22) made the external-secrets controller own
.spec.refreshInterval via server-side apply, so terraform apply of the two
ExternalSecret manifests fails with a field-manager conflict (Woodpecker #348),
which blocked the replicas=0 scale-down from landing. Add force_conflicts=true
to both, matching the grafana/woodpecker/traefik fix applied to other stacks
the same day.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 20:49:53 +00:00
Viktor Barzin
53834deb24 instagram-poster: scale to 0 (unused, dead ExternalSecret)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Viktor confirmed the Instagram Graph poster isn't used. Its ExternalSecret
has been dead on missing Vault keys (ig_graph_long_lived_token,
ig_business_account_id), so the deployment sat at 0/1 firing
DeploymentReplicasMismatch. Setting replicas=0 stops the alert and makes the
scale-down durable (a bare kubectl scale reverts on the next stack apply).
Re-set to 1 after minting a Meta long-lived token + populating the Vault keys.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 20:45:30 +00:00
Viktor Barzin
8dd9a3978d Merge remote-tracking branch 'forgejo/master' into wizard/homelab-vault
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-24 12:25:52 +00:00
Viktor Barzin
65b2df1222 fix(monitoring): force_conflicts on grafana_db_creds ExternalSecret
The external-secrets controller owns .spec.refreshInterval via SSA, so a plain
terraform apply of the monitoring stack conflicts. Latent until 2026-06-24 (the
homelab-vault loki-rules change was the first monitoring apply in a while and
surfaced it). force_conflicts lets TF win — same pattern as woodpecker/traefik/
k8s-version-upgrade stacks.
2026-06-24 12:25:36 +00:00
Viktor Barzin
1d0388da12 Merge remote-tracking branch 'origin/master'
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-24 12:22:58 +00:00
Viktor Barzin
92361f36db calico: enable Goldmane + Whisker (Calico 3.30 OSS flow observability)
Turns on Calico 3.30's native east-west flow observability so we can see which
Service talks to which (ADR-0014, issue #57). Enabled via the operator CRs
directly (kubectl_manifest Goldmane + Whisker, name=default) rather than the
Helm goldmane/whisker flags, because the goldmanes/whiskers CRDs already exist
and this sidesteps the helm-upgrade CR-before-CRD ordering issue. Whisker
notifications=Disabled so the UI doesn't call the external Tigera endpoint.

Applied supervised: creating the Goldmane CR re-rendered calico-node with the
FELIX_FLOWLOGSGOLDMANESERVER env (operator auto-wires Felix — no manual
FelixConfiguration); calico-node rolled cleanly 7/7, tigerastatus healthy,
goldmane is receiving flows from all nodes, Whisker UI serves.

Durable Loki persistence is NOT included here: the Goldmane emitter is Calico
Cloud/Enterprise-gated with no OSS knob to aim it at Loki (the CR can override
only name+resources, not env), so a durable trail needs a small custom gRPC
consumer of goldmane:7443 — tracked in issue #58.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 12:22:48 +00:00
Viktor Barzin
e711b2f971 feat(monitoring): homelab vault traceability alerts (TOTP-fetch + volume)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Build infra CLI / build (push) Has been cancelled
Adds a Loki ruler group (lane=security -> #security) for the homelab vault
op-log: VaultwardenTOTPFetched (every 2nd-factor fetch is visible) and
VaultwardenFetchVolumeHigh (>100 fetches/10m backstop). The audit spine
(Vault audit device, reads of secret/data/workstation/claude-users/*) is
already captured. True CLI-bypass detection needs cross-stream correlation
(follow-up).
2026-06-24 10:31:32 +00:00
Viktor Barzin
64104e56e9 feat(devvm): install Bitwarden CLI for homelab vault 2026-06-24 10:29:57 +00:00
Viktor Barzin
15643d1f44 feat(cli): bare homelab vault help command 2026-06-24 10:29:32 +00:00
Viktor Barzin
772aed5370 fix(cli): vault security review fixes
C1 (critical): setup wrote the master password + API client_secret as
`vault kv patch key=value` argv, leaking them via /proc/<pid>/cmdline to
same-UID processes. Now written via stdin (key=- form); only email +
client_id (non-credentials) remain in argv.
I1: `get --json` refused on a TTY (was dumping the secret to scrollback).
M1: vaultLock now holds the per-user flock (it mutates bw state).
M4: bw login-detection parses status JSON instead of substring matching.
M5: clipboard path refuses when stderr is not a TTY (was silently failing).
M6: realRunner trims only trailing newline, preserving secret whitespace;
    secret prompts likewise.
Adds security-property tests: no secret in argv across the get flow,
clipboard decision matrix, --json TTY gate, bw status parsing.
2026-06-24 10:28:31 +00:00
Viktor Barzin
5a864cf19c feat(cli): homelab vault setup onboarding (one-time, self-service) 2026-06-24 10:21:57 +00:00
Viktor Barzin
e20033855d feat(cli): vault list/search/code/status/lock 2026-06-24 10:21:07 +00:00
Viktor Barzin
365340b37d feat(cli): homelab vault get with TTY-aware return 2026-06-24 10:20:05 +00:00
Viktor Barzin
2dd12fc6be feat(cli): vault session bootstrap with per-user flock + no-coredump 2026-06-24 10:18:36 +00:00
Viktor Barzin
5bae2a3907 feat(cli): privacy-aware vault op-log (process, never the secret) 2026-06-24 10:17:50 +00:00
Viktor Barzin
81122f8607 feat(cli): TTY-aware return + OSC52 clipboard with terminal gating 2026-06-24 10:17:13 +00:00
Viktor Barzin
06f4b87af1 feat(cli): vault bw engine env/arg builders + unlock 2026-06-24 10:16:19 +00:00
Viktor Barzin
cd44ca5921 feat(cli): vault creds loading from per-user Vault path 2026-06-24 10:15:32 +00:00
Viktor Barzin
6c53ee10b1 feat(cli): register homelab vault command group skeleton 2026-06-24 10:14:24 +00:00
Viktor Barzin
ae0d7984c4 docs: ADR-0014 + glossary — service identity (namespace+label) & Calico Goldmane observability
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Records the design reached in a /grill-with-docs session: how to track which
Service talks to which as more Services are added, using k8s-native options.

Decision: service identity = the workload's namespace (primary) plus a
`service-identity` label only in the few multi-Service namespaces; east-west
observability = Calico 3.30 Goldmane/Whisker (already in our Calico v3.30.7,
currently disabled) emitting to Loki for a durable trail; enforcement reuses the
existing Wave 1 egress track. Dedicated per-Service ServiceAccounts deferred and
a service mesh / mTLS / SPIFFE rejected — the trust model needs attribution-grade
forensics on a trusted, etcd-constrained cluster, not cryptographic
non-repudiation. This is the service-mesh evaluation the 2026-04-20 infra audit
flagged as missing; rejected alternatives (Retina, Hubble, Kiali, a custom Alloy
enricher) are recorded with rationale.

Adds glossary terms (Service identity, Goldmane / Whisker) to CONTEXT.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 10:00:36 +00:00
Viktor Barzin
0293b5c634 android-emulator: fix idle-sleeper dying with SIGPIPE before it could sleep
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Caught live-testing the previous commit: every sleeper run exited 141
(SIGPIPE) in ~1s with no output, never reaching the scale-down. Cause:
`set -o pipefail` + `dumpsys power | awk '...; exit'` — awk closes the pipe
after the first match while `kubectl exec` is still streaming dumpsys, so
the exec gets SIGPIPE, pipefail makes the pipeline 141, and set -e kills the
script before any echo. (My earlier dry-run missed it because it didn't run
under `set -euo pipefail`.)

Fix: drop pipefail; capture each exec to a var (`|| true`) then parse with
awk reading to END (no early `exit`), so nothing can SIGPIPE mid-stream and
a failed/booting exec falls through to the fail-safe "do not sleep" branch.
Also fetch the pod name via jsonpath instead of `-o name | head -1` (no pipe
to SIGPIPE, no `pod/` prefix to strip), and exec `adb` directly without the
`sh -c` wrapper.

Verified live: ran the corrected script as the gate ServiceAccount against
the stuck emulator (idle ~120h) — it logged "idle >= 6h ... scaling to zero"
and patched the deployment to replicas=0. The 6+ day pod is now asleep.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 08:57:36 +00:00
Viktor Barzin
839fdb33c2 android-emulator: sleep after 6h idle (activity-based), fix never-sleeping
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The emulator was meant to scale to zero when idle but had been up 6+ days
straight despite ~5 days with no real use. Two bugs:

1. The idle check counted ESTABLISHED TCP connections to the adb/noVNC
   ports. A forgotten `adb connect` (no disconnect) holds that transport
   open forever, so every 15-min run saw "active" and reset the counter --
   it never reached the sleep branch. (Right now: 4 such stale transports
   from pods on k8s-node3/node4.)
2. Even when it did reach the sleep branch, `kubectl scale --replicas=0`
   failed Forbidden -- the gate ServiceAccount can patch `deployments` but
   not `deployments/scale`.

Switch the sleeper to measure actual use: time since last user activity
(taps/keys/app-launches, incl. noVNC clicks) from `dumpsys power` vs guest
uptime. No interaction for 6h -> sleep. This ignores idle/forgotten
connections entirely. Scale down with a direct replicas patch on the named
deployment (same path the wake gate scales up), so it needs only the
existing `deployments` patch grant -- no `deployments/scale`. Now stateless
(drops the idle-counter annotation; gate.py no longer sets it) and lighter
on etcd. Fail-safe: any read error (e.g. mid-boot) does not sleep.

Requested by Viktor: turn the dev-only emulator off when it hasn't been
used for 6h.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 08:49:23 +00:00
Viktor Barzin
566447a698 k8s-upgrade: preflight kubeadm-plan gate must pass explicit target (minor-upgrade fix)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Last night's 1.34.9->1.35.6 run passed the ESO/kyverno compat gate (the migration
worked!) but ABORTED at the kubeadm-plan-target gate: it ran `kubeadm upgrade plan`
with NO version, so master's old 1.34.9 kubeadm auto-proposed only the current
minor (Loki: "falling back to stable-1.34") and plan_target != 1.35.6 -> abort.
That gate worked for patch upgrades but never for minors. Fix: pass the explicit
`v$TARGET_VERSION` (verified on master: `kubeadm upgrade plan v1.35.6` emits
"kubeadm upgrade apply v1.35.6"). Works for patches too. Applied live to the
ConfigMap before tonight's run; deleted the failed preflight-1-35-6 job.

Also: ESO 2.x took SSA ownership of .spec.refreshInterval, so terraform's apply of
the k8s-upgrade-creds ExternalSecret hit a field-manager conflict. Added
field_manager.force_conflicts=true (benign — interval is semantically identical).
This pattern affects all 104 migrated ESs fleet-wide (follow-up).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 06:06:14 +00:00
Viktor Barzin
98d2b89614 calico: bump tigera-operator mem limit 256Mi -> 512Mi (OOM crashloop fix)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The operator OOM-crashlooped on 2026-06-23: it idles at ~246Mi with a ~266Mi
startup spike (re-listing resources to build informer caches), both at/over the
256Mi limit, so the first time the pod restarted it could never finish startup
(exit 137 OOMKilled, leader-elect, OOM, repeat). A latent landmine — the limit
was always too tight; it only bit once the pod restarted. Data plane was never
affected (calico-node 7/7, tigerastatus green throughout). 512Mi gives headroom
(now ~246Mi steady, verified stable 0 restarts). NOT caused by the ESO migration
(which never touched calico); cluster churn was at most the trigger that exposed
the tight limit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 12:46:28 +00:00
Viktor Barzin
68c240b8de Merge remote-tracking branch 'origin/master'
Some checks failed
ci/woodpecker/push/default Pipeline failed
2026-06-23 09:56:25 +00:00
Viktor Barzin
7d297dc6b1 eso: complete migration — chart 2.6.0, all CRs on v1, 1.35 gate cleared
Phase 3 of the ESO 0.12->2.6 migration (the last k8s-1.35 compat-gate blocker).
Climbed external-secrets 0.16.2 -> 0.17.0 -> ... -> 2.6.0 one minor at a time,
each hop applied + verified (ES sync held at 109 Ready every hop; atomic=true
rollback safety net). Crossed the 0.17 cutoff (v1beta1 serving removed) only
after Phase 2 put all 104 ExternalSecrets + 2 ClusterSecretStores on
external-secrets.io/v1. Result: compat-gate now returns "OK: cluster is safe to
upgrade to 1.35.6" (EXIT 0) — the autonomous version-check chain will take k8s
1.34 -> 1.35 on its next nightly run.

Also fixes the repo-wide stale-lock issue that broke CI pipeline 332: the
terragrunt-generated providers.tf declares gavinbunney/kubectl + telmate/proxmox,
but ~28-39 stacks' committed .terraform.lock.hcl predated that ("Inconsistent
dependency lock file: no version selected"). Reconciled via `tg init -upgrade`
and committed so `terragrunt apply`/CI work cleanly again.

Docs: .claude/CLAUDE.md ESO line corrected (104 ESs, v1, chart 2.6.0); plan doc
marked COMPLETE.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 09:55:51 +00:00
Viktor Barzin
ff4b01a674 state(external-secrets): update encrypted state 2026-06-23 09:53:36 +00:00
Viktor Barzin
e1a85dd727 state(external-secrets): update encrypted state 2026-06-23 09:52:30 +00:00
Viktor Barzin
af22416d6f state(external-secrets): update encrypted state 2026-06-23 09:51:21 +00:00
Viktor Barzin
c75982f408 state(external-secrets): update encrypted state 2026-06-23 09:50:11 +00:00
Viktor Barzin
0407e3c578 state(external-secrets): update encrypted state 2026-06-23 09:48:33 +00:00
Viktor Barzin
dab8f9446f state(external-secrets): update encrypted state 2026-06-23 09:47:24 +00:00
Viktor Barzin
e815bb0295 state(external-secrets): update encrypted state 2026-06-23 09:46:17 +00:00
Viktor Barzin
8412cd7d54 state(external-secrets): update encrypted state 2026-06-23 09:45:04 +00:00
Viktor Barzin
f2956e1e62 state(external-secrets): update encrypted state 2026-06-23 09:43:57 +00:00
Viktor Barzin
bf2f865eee state(external-secrets): update encrypted state 2026-06-23 09:42:52 +00:00
Viktor Barzin
6f3cfb18c7 state(external-secrets): update encrypted state 2026-06-23 09:41:46 +00:00
Viktor Barzin
6e8e066215 state(external-secrets): update encrypted state 2026-06-23 09:40:14 +00:00
Viktor Barzin
de1fb04d9f state(external-secrets): update encrypted state 2026-06-23 09:39:12 +00:00