The kured sentinel gate DaemonSet requires /sentinel to exist on
all nodes. Without it, kured pods get stuck in ContainerCreating
with hostPath mount failure. Previously created manually; now
provisioned automatically for new nodes.
Upstream v2.4.1 accidentally dropped idrac_power_supply_input_voltage from
the legacy RefreshPowerOld code path during a Huawei OEM support refactor.
Built a patched image that restores the single missing line:
mc.NewPowerSupplyInputVoltage(ch, psu.LineInputVoltage, id)
Image: viktorbarzin/idrac-redfish-exporter:2.4.1-voltage-fix
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The 2Gi WAL tmpfs (medium: Memory) counts against the container's
memory limit. At 3Gi, Prometheus OOM-kills during WAL replay on
startup (heap + tmpfs > 3Gi). Reverting to 4Gi restores headroom.
- Add input validation: username regex + email format check in pipeline
- Quote variables in .provision-env to prevent shell injection
- Remove dead source command (each Woodpecker command is separate shell)
- Use jq to build JSON payloads (prevents injection via group names)
- Clean up git-crypt key on failure (use ; instead of &&)
- Add Kyverno ndots lifecycle ignore to webhook-handler deployment
Vault stack can't be applied in CI (git-crypt TLS certs + sensitive
for_each on k8s_users). Pipeline now automates Vault KV update +
Authentik group creation, then notifies admin to apply stacks manually.
This matches the existing pattern — vault is not in default.yml either.
Woodpecker performs compile-time substitution on ${...} patterns,
replacing pipeline variables with empty strings. Using $VAR without
braces lets the shell evaluate them at runtime.
- Add Transit mount + per-stack Transit keys to vault stack TF
- Auto-create sops-user-<name> policy scoping decrypt to owned stacks
- Auto-create sops-<name> external group + alias for Authentik mapping
- Add sops-admin policy to authentik-admins group
- Attach sops-user policy to namespace-owner identity entities
- Update add-user skill with SOPS onboarding steps and Authentik group
- Adding a user to k8s_users + applying vault stack = full SOPS access
[ci skip]
- Each stack gets its own Vault Transit key (transit/keys/sops-state-<stack>)
- state-sync passes per-stack Transit URI + age keys on encrypt
- Vault policies scope namespace-owners to their stacks only:
- sops-admin: wildcard access to all transit keys
- sops-user-<name>: access only to owned stack keys
- Anca (plotting-book) can only decrypt plotting-book state
- Admin can decrypt everything (via admin Transit policy or age fallback)
- External group sops-plotting-book maps Authentik group to Vault policy
- Updated CLAUDE.md with state sync documentation
- .sops.yaml: add hc_vault_transit_uri for transit/keys/sops-state
- state-sync: try Vault Transit first, fall back to age key on disk
- Re-encrypted all 101 state files with both Vault Transit + age
- Normal workflow: vault login → decrypt via Transit (no key files)
- Bootstrap/DR: age key at ~/.config/sops/age/keys.txt
Phase 3: all 27 platform modules now run as independent stacks.
Platform reduced to empty shell (outputs only) for backward compat
with 72 app stacks that declare dependency "platform".
Fixed technitium cross-module dashboard reference by copying file.
Woodpecker pipeline applies all 27+1 stacks in parallel via loop.
All applied with zero destroys.
Phase 2 of platform stack split. 5 more modules extracted into
independent stacks. All applied successfully with zero destroys.
Cloudflared now reads k8s_users from Vault directly to compute
user_domains. Woodpecker pipeline runs all 8 extracted stacks
in parallel. Memory bumped to 6Gi for 9 concurrent TF processes.
Platform reduced from 27 to 19 modules.
Phase 1 of platform stack split for parallel CI applies.
All 3 modules were fully independent (no cross-module refs).
State migrated via terraform state mv. All 3 stacks applied
with zero changes (dbaas had pre-existing ResourceQuota drift).
Woodpecker pipeline updated to run extracted stacks in parallel.
Vault DB engine rotates passwords weekly but 5 stacks baked passwords
at Terraform plan time, causing stale credentials until next apply.
- real-estate-crawler: add vault-database ESO, use secret_key_ref in 3 deployments
- nextcloud: switch Helm chart to existingSecret for DB password
- grafana: add vault-database ESO, use envFromSecrets in Helm values
- woodpecker: use extraSecretNamesForEnvFrom, remove plan-time data source chain
- affine: add vault-database ESO, use secret_key_ref in deployment + init container
Split monolithic orchestrator into triage (haiku), historian (sonnet),
and report-writer (opus) stages. Each stage gets its own tool budget.
Added sev-context.sh for structured cluster context gathering.
- Add container!="" filter to ContainerNearOOM to exclude system-level cadvisor entries
- Add $value to summaries: ContainerOOMKilled, ClusterMemoryRequestsHigh,
ContainerNearOOM, PVPredictedFull, NFSServerUnresponsive, NewTailscaleClient
- Add fallback field to all Slack receivers for clean push notifications
- Multiply ratio exprs by 100 for readable percentages
- Rename "New Tailscale client" to CamelCase "NewTailscaleClient"
- Add actionable hints to PodUnschedulable, NodeConditionBad, ForwardAuthFallbackActive
The woodpecker server was crashing repeatedly with database authentication failures
because Vault rotates the database password every 24 hours, but the Helm release
had hardcoded the password into WOODPECKER_DATABASE_DATASOURCE at plan time.
Changes:
- Updated ExternalSecret to provide the full DATABASE_DATASOURCE URI dynamically
- Modified Helm values to use envFrom to inject the secret instead of hardcoding
- ExternalSecret refreshes every 15 minutes, automatically picking up rotated passwords
- Pod will auto-restart when secret changes (via reloader.stakater.com annotation)
- This eliminates the plan-time password snapshot that goes stale within 24h
The pod still has an unrelated image pull issue on k8s-node4 (containerd blob
corruption), but the database credentials mechanism is now correctly implemented.