Commit graph

1765 commits

Author SHA1 Message Date
Viktor Barzin
b3c9c45a17 multi-user access: fix template memory default, add storage quota, add CONTRIBUTING.md [ci skip]
- Template: bump default memory from 128Mi to 256Mi (matches deploy-app skill guidance)
- ResourceQuota: add requests.storage (20Gi) and persistentvolumeclaims (5) defaults
- CONTRIBUTING.md: agent-friendly contributor guide for namespace-owners
2026-03-19 23:49:15 +00:00
Viktor Barzin
8dccf4f5ef state(openclaw): update encrypted state 2026-03-19 23:44:11 +00:00
Viktor Barzin
6b8ce04d44 fix(openclaw): change agent workspace from /workspace/infra to /workspace
Keeps infra repo as a subdirectory, allows OpenClaw to write to /workspace directly.
2026-03-19 23:32:28 +00:00
Viktor Barzin
fd207f4db5 state(openclaw): update encrypted state 2026-03-19 23:29:48 +00:00
Viktor Barzin
e823b795f7 fix(dbaas,vault): fix backup CronJob failures and mysql-operator memory
- Add docker.io/library/ prefix to mysql and postgres backup images
  to satisfy Kyverno require-trusted-registries policy (both CronJobs
  were blocked for 46h, triggering MySQLBackupStale alert)
- Document mysql-operator chart ignoring resources values key — the
  LimitRange default (256Mi) was silently applied, putting the operator
  at 97% memory. Patched live to 512Mi via kubectl.
- Increase vault-raft-backup backoff_limit to 6 for transient failures
  (also fixed NFS export: vault-backup was a separate ZFS dataset not
  in the TrueNAS NFS share — destroyed dataset, created directory)
2026-03-19 23:26:05 +00:00
Viktor Barzin
250a058627 feat(traefik): add custom error pages with tarampampam/error-pages
Deploy error-pages service to show themed error pages instead of raw
Traefik 502/503/504 responses. Adds catch-all IngressRoute (priority 1)
for 404 on unknown hosts. Only 5xx intercepted to avoid breaking JSON APIs.
2026-03-19 23:14:27 +00:00
Viktor Barzin
d95144bd05 fix(immich): bump postgres memory 512Mi → 1Gi for v2.6.1 geodata migration
v2.6.1 bulk-inserts into geodata_places on first boot, OOM-killing
postgres at 512Mi. Raise to 1Gi to accommodate the migration.
2026-03-19 22:50:36 +00:00
Viktor Barzin
89bb74c4ee state(immich): update encrypted state 2026-03-19 22:47:32 +00:00
Viktor Barzin
da630b8869 upgrade immich v2.5.6 → v2.6.1 2026-03-19 22:45:04 +00:00
Viktor Barzin
c7dc63f923 state(immich): update encrypted state 2026-03-19 20:39:18 +00:00
Viktor Barzin
af2222fce8 backup & DR: add alerting, fix rotation, secure MySQL password, add runbooks
Phase 1: Add 12 PrometheusRules for backup health alerting
- PostgreSQL, MySQL, Vault, Vaultwarden, Redis staleness + never-succeeded alerts
- CSIDriverCrashLoop alert for nfs-csi/iscsi-csi namespaces
- Generic BackupCronJobFailed alert

Phase 2: Fix backup rotation
- etcd: timestamped snapshots instead of overwriting single file
- Redis: timestamped RDB files with 7-day retention purge
- PostgreSQL: retention increased from 7 to 14 days

Phase 3: Fix MySQL password exposure
- Move root password from command line arg to MYSQL_PWD env var via secretKeyRef

Phase 5: Add restore runbooks
- PostgreSQL, MySQL, Vault, etcd, Vaultwarden, full cluster rebuild
2026-03-19 20:34:33 +00:00
Viktor Barzin
62d42657e6 state(redis): update encrypted state 2026-03-19 20:32:27 +00:00
Viktor Barzin
5be9f70a0d state(infra-maintenance): update encrypted state 2026-03-19 20:32:19 +00:00
Viktor Barzin
13759e58da state(redis): update encrypted state 2026-03-19 20:31:13 +00:00
Viktor Barzin
2511c1d78d state(infra-maintenance): update encrypted state 2026-03-19 20:30:50 +00:00
Viktor Barzin
414232cf5e state(redis): update encrypted state 2026-03-19 20:27:38 +00:00
Viktor Barzin
4680dd5fbc state(infra-maintenance): update encrypted state 2026-03-19 20:27:15 +00:00
Viktor Barzin
e54bc016ba reduce alert noise: raise memory thresholds, exclude claude-memory 4xx, right-size mysql-operator
- ContainerNearOOM: 85% → 90% (silences forgejo, changedetection, immich-pg, mysql-cluster)
- ClusterMemoryRequestsHigh: 85% → 92% (intentional overcommit)
- NodeMemoryPressureTrending: 85% → 92%
- HighService4xxRate: exclude claude-memory (401s from unauth requests are expected)
- mysql-operator memory limit: 512Mi → 580Mi (VPA upperBound 481Mi × 1.2)
2026-03-19 20:25:36 +00:00
Viktor Barzin
21bb3036af state(dbaas): update encrypted state 2026-03-19 20:23:59 +00:00
Viktor Barzin
67d1ce453c add /sentinel dir to cloud-init for kured reboot gating
The kured sentinel gate DaemonSet requires /sentinel to exist on
all nodes. Without it, kured pods get stuck in ContainerCreating
with hostPath mount failure. Previously created manually; now
provisioned automatically for new nodes.
2026-03-19 19:57:27 +00:00
Viktor Barzin
01eb9dd121 fix(monitoring): patch idrac-redfish-exporter to restore PSU voltage metric
Upstream v2.4.1 accidentally dropped idrac_power_supply_input_voltage from
the legacy RefreshPowerOld code path during a Huawei OEM support refactor.
Built a patched image that restores the single missing line:
  mc.NewPowerSupplyInputVoltage(ch, psu.LineInputVoltage, id)

Image: viktorbarzin/idrac-redfish-exporter:2.4.1-voltage-fix

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-19 13:37:14 +00:00
Viktor Barzin
b05421dbb5 add comment explaining prometheus 4Gi minimum memory requirement [ci skip] 2026-03-18 21:45:26 +00:00
Viktor Barzin
9d87ce605f revert prometheus memory 3Gi→4Gi: WAL tmpfs shares cgroup limit
The 2Gi WAL tmpfs (medium: Memory) counts against the container's
memory limit. At 3Gi, Prometheus OOM-kills during WAL replay on
startup (heap + tmpfs > 3Gi). Reverting to 4Gi restores headroom.
2026-03-18 21:44:14 +00:00
Viktor Barzin
03f55d969f state(vault): update encrypted state 2026-03-18 21:30:59 +00:00
Viktor Barzin
410c893647 fix(provision): security hardening from code review
- Add input validation: username regex + email format check in pipeline
- Quote variables in .provision-env to prevent shell injection
- Remove dead source command (each Woodpecker command is separate shell)
- Use jq to build JSON payloads (prevents injection via group names)
- Clean up git-crypt key on failure (use ; instead of &&)
- Add Kyverno ndots lifecycle ignore to webhook-handler deployment
2026-03-18 21:25:03 +00:00
Viktor Barzin
e51c063600 docs(add-user): update skill with actual working flow (no auto TF apply) 2026-03-18 00:28:46 +00:00
Viktor Barzin
82403a933c fix(provision): remove TF apply from pipeline, notify for manual apply
Vault stack can't be applied in CI (git-crypt TLS certs + sensitive
for_each on k8s_users). Pipeline now automates Vault KV update +
Authentik group creation, then notifies admin to apply stacks manually.
This matches the existing pattern — vault is not in default.yml either.
2026-03-18 00:23:06 +00:00
Viktor Barzin
d76b4b698f fix(provision): targeted vault apply + git-crypt in terragrunt step
- Two-pass vault apply: first target new user resources, then full apply
- Add git-crypt unlock to terragrunt step (TLS certs needed at plan time)
2026-03-18 00:19:16 +00:00
Viktor Barzin
6fad484126 fix(provision): reduce memory limit to 4Gi (LimitRange max) 2026-03-18 00:15:26 +00:00
Viktor Barzin
de6a5caecc fix(provision): merge terragrunt-apply into single shell block for env persistence 2026-03-18 00:11:14 +00:00
Viktor Barzin
7a24ff6702 fix(provision): use $USERNAME/$EMAIL directly — Woodpecker 3.x env vars
Woodpecker 3.x exposes pipeline variables with their original key names
(USERNAME, EMAIL), not CI_PIPELINE_VARIABLE_ prefix.
2026-03-18 00:04:51 +00:00
Viktor Barzin
52dc657af5 debug(provision): dump env vars to find correct variable names 2026-03-18 00:00:33 +00:00
Viktor Barzin
0a05343d86 fix(provision): use $VAR instead of ${VAR} to avoid Woodpecker interpolation
Woodpecker performs compile-time substitution on ${...} patterns,
replacing pipeline variables with empty strings. Using $VAR without
braces lets the shell evaluate them at runtime.
2026-03-17 23:58:46 +00:00
Viktor Barzin
fd130971aa feat(provision): automated user provisioning via Authentik webhook
- Expand CI Vault policy: write secret/data/platform + Transit SOPS keys
- Add Woodpecker provision-user.yml pipeline (manual event, API-triggered)
- Add env vars to webhook-handler deployment for Woodpecker/Authentik integration
- Update add-user skill with automated flow documentation
- Update Woodpecker repo ID list in CLAUDE.md
2026-03-17 23:56:30 +00:00
Viktor Barzin
82b9dd9e8a state(webhook_handler): update encrypted state 2026-03-17 23:52:32 +00:00
Viktor Barzin
5b29cfc73a state(vault): update encrypted state 2026-03-17 23:46:56 +00:00
Viktor Barzin
0fff155f17 feat(k8s-portal): update onboarding + architecture with SOPS state docs
Onboarding (namespace-owner):
- Add steps for sops/terragrunt install, state decrypt, apply workflow
- Add flow diagram showing auth → decrypt → apply → encrypt → push
- Add architecture overview with security model table
- Add access control callout explaining per-stack Transit keys

Architecture:
- Add secrets & state encryption section with ASCII diagrams
- Add request flow diagram (Cloudflare → Traefik → pods)
- Add CI/CD pipeline diagram (GHA → Woodpecker → K8s)

[ci skip]
2026-03-17 23:17:47 +00:00
Viktor Barzin
ccbcebb670 feat(vault): automate SOPS onboarding for namespace-owners
- Add Transit mount + per-stack Transit keys to vault stack TF
- Auto-create sops-user-<name> policy scoping decrypt to owned stacks
- Auto-create sops-<name> external group + alias for Authentik mapping
- Add sops-admin policy to authentik-admins group
- Attach sops-user policy to namespace-owner identity entities
- Update add-user skill with SOPS onboarding steps and Authentik group
- Adding a user to k8s_users + applying vault stack = full SOPS access

[ci skip]
2026-03-17 23:15:25 +00:00
Viktor Barzin
4d40c51a97 state(vault): update encrypted state 2026-03-17 23:14:24 +00:00
Viktor Barzin
7a8452e4c7 state(vault): update encrypted state 2026-03-17 23:14:16 +00:00
Viktor Barzin
0215d81622 state(vault): update encrypted state 2026-03-17 23:13:57 +00:00
Viktor Barzin
750cfcce7c state(vault): update encrypted state 2026-03-17 23:13:55 +00:00
Viktor Barzin
e54ad33315 state(vault): update encrypted state 2026-03-17 23:13:19 +00:00
Viktor Barzin
02d0291797 state(vault): update encrypted state 2026-03-17 23:12:58 +00:00
Viktor Barzin
468df3c5c4 state(vault): update encrypted state 2026-03-17 23:12:35 +00:00
Viktor Barzin
cf570c3d3b state(vault): update encrypted state 2026-03-17 23:12:03 +00:00
Viktor Barzin
4277b41c28 state(vault): update encrypted state 2026-03-17 23:11:55 +00:00
Viktor Barzin
77143dfd6b state: per-stack Transit keys for namespace-owner access control
- Each stack gets its own Vault Transit key (transit/keys/sops-state-<stack>)
- state-sync passes per-stack Transit URI + age keys on encrypt
- Vault policies scope namespace-owners to their stacks only:
  - sops-admin: wildcard access to all transit keys
  - sops-user-<name>: access only to owned stack keys
- Anca (plotting-book) can only decrypt plotting-book state
- Admin can decrypt everything (via admin Transit policy or age fallback)
- External group sops-plotting-book maps Authentik group to Vault policy
- Updated CLAUDE.md with state sync documentation
2026-03-17 23:08:18 +00:00
Viktor Barzin
6239e07dd5 docs: add plotting-book to GHA-migrated list and repo IDs [ci skip] 2026-03-17 23:07:32 +00:00
Viktor Barzin
4e7ca1ad61 state: add Vault Transit as primary SOPS backend, age as fallback
- .sops.yaml: add hc_vault_transit_uri for transit/keys/sops-state
- state-sync: try Vault Transit first, fall back to age key on disk
- Re-encrypted all 101 state files with both Vault Transit + age
- Normal workflow: vault login → decrypt via Transit (no key files)
- Bootstrap/DR: age key at ~/.config/sops/age/keys.txt
2026-03-17 22:56:33 +00:00