Viktor Barzin
53e05e63b5
state(cnpg): update encrypted state
2026-03-21 11:22:33 +00:00
Viktor Barzin
a5136749b7
state(claude-memory): update encrypted state
2026-03-21 11:21:49 +00:00
Viktor Barzin
73ca114ffa
state(city-guesser): update encrypted state
2026-03-21 11:20:36 +00:00
Viktor Barzin
b5fbd19088
state(changedetection): update encrypted state
2026-03-21 11:20:30 +00:00
Viktor Barzin
9b4bf85933
state(calibre): update encrypted state
2026-03-21 11:19:09 +00:00
Viktor Barzin
0888cb100a
state(blog): update encrypted state
2026-03-21 11:19:04 +00:00
Viktor Barzin
8551e75305
state(authentik): update encrypted state
2026-03-21 11:18:56 +00:00
Viktor Barzin
92aba3a9f7
state(audiobookshelf): update encrypted state
2026-03-21 11:18:53 +00:00
Viktor Barzin
d4edd53367
state(affine): update encrypted state
2026-03-21 11:18:02 +00:00
Viktor Barzin
0d69403aaa
state(actualbudget): update encrypted state
2026-03-21 11:16:01 +00:00
Viktor Barzin
21cfa8c072
bump memory limits for OOM-prone services
...
FreshRSS: 64Mi → 256Mi (171 restarts, VPA upper ~204Mi)
Actual Budget HTTP API: 128Mi → 384Mi (17 restarts, VPA upper ~297Mi)
n8n: 768Mi → 1Gi (18 restarts, VPA upper ~765Mi)
Dawarich: 768Mi → 896Mi (2 restarts, VPA upper ~628Mi)
Traefik: 384Mi → 768Mi (2 restarts, VPA upper ~584Mi)
2026-03-21 11:12:12 +00:00
Viktor Barzin
c848c9a39b
state(dawarich): update encrypted state
2026-03-21 11:09:39 +00:00
Viktor Barzin
c28c2cf654
state(n8n): update encrypted state
2026-03-21 11:08:46 +00:00
Viktor Barzin
3029c708b8
state(actualbudget): update encrypted state
2026-03-21 11:06:32 +00:00
Viktor Barzin
fcd602a257
state(freshrss): update encrypted state
2026-03-21 11:06:24 +00:00
Viktor Barzin
b3c9c45a17
multi-user access: fix template memory default, add storage quota, add CONTRIBUTING.md [ci skip]
...
- Template: bump default memory from 128Mi to 256Mi (matches deploy-app skill guidance)
- ResourceQuota: add requests.storage (20Gi) and persistentvolumeclaims (5) defaults
- CONTRIBUTING.md: agent-friendly contributor guide for namespace-owners
2026-03-19 23:49:15 +00:00
Viktor Barzin
8dccf4f5ef
state(openclaw): update encrypted state
2026-03-19 23:44:11 +00:00
Viktor Barzin
6b8ce04d44
fix(openclaw): change agent workspace from /workspace/infra to /workspace
...
Keeps infra repo as a subdirectory, allows OpenClaw to write to /workspace directly.
2026-03-19 23:32:28 +00:00
Viktor Barzin
fd207f4db5
state(openclaw): update encrypted state
2026-03-19 23:29:48 +00:00
Viktor Barzin
e823b795f7
fix(dbaas,vault): fix backup CronJob failures and mysql-operator memory
...
- Add docker.io/library/ prefix to mysql and postgres backup images
to satisfy Kyverno require-trusted-registries policy (both CronJobs
were blocked for 46h, triggering MySQLBackupStale alert)
- Document mysql-operator chart ignoring resources values key — the
LimitRange default (256Mi) was silently applied, putting the operator
at 97% memory. Patched live to 512Mi via kubectl.
- Increase vault-raft-backup backoff_limit to 6 for transient failures
(also fixed NFS export: vault-backup was a separate ZFS dataset not
in the TrueNAS NFS share — destroyed dataset, created directory)
2026-03-19 23:26:05 +00:00
Viktor Barzin
250a058627
feat(traefik): add custom error pages with tarampampam/error-pages
...
Deploy error-pages service to show themed error pages instead of raw
Traefik 502/503/504 responses. Adds catch-all IngressRoute (priority 1)
for 404 on unknown hosts. Only 5xx intercepted to avoid breaking JSON APIs.
2026-03-19 23:14:27 +00:00
Viktor Barzin
d95144bd05
fix(immich): bump postgres memory 512Mi → 1Gi for v2.6.1 geodata migration
...
v2.6.1 bulk-inserts into geodata_places on first boot, OOM-killing
postgres at 512Mi. Raise to 1Gi to accommodate the migration.
2026-03-19 22:50:36 +00:00
Viktor Barzin
89bb74c4ee
state(immich): update encrypted state
2026-03-19 22:47:32 +00:00
Viktor Barzin
da630b8869
upgrade immich v2.5.6 → v2.6.1
2026-03-19 22:45:04 +00:00
Viktor Barzin
c7dc63f923
state(immich): update encrypted state
2026-03-19 20:39:18 +00:00
Viktor Barzin
af2222fce8
backup & DR: add alerting, fix rotation, secure MySQL password, add runbooks
...
Phase 1: Add 12 PrometheusRules for backup health alerting
- PostgreSQL, MySQL, Vault, Vaultwarden, Redis staleness + never-succeeded alerts
- CSIDriverCrashLoop alert for nfs-csi/iscsi-csi namespaces
- Generic BackupCronJobFailed alert
Phase 2: Fix backup rotation
- etcd: timestamped snapshots instead of overwriting single file
- Redis: timestamped RDB files with 7-day retention purge
- PostgreSQL: retention increased from 7 to 14 days
Phase 3: Fix MySQL password exposure
- Move root password from command line arg to MYSQL_PWD env var via secretKeyRef
Phase 5: Add restore runbooks
- PostgreSQL, MySQL, Vault, etcd, Vaultwarden, full cluster rebuild
2026-03-19 20:34:33 +00:00
Viktor Barzin
62d42657e6
state(redis): update encrypted state
2026-03-19 20:32:27 +00:00
Viktor Barzin
5be9f70a0d
state(infra-maintenance): update encrypted state
2026-03-19 20:32:19 +00:00
Viktor Barzin
13759e58da
state(redis): update encrypted state
2026-03-19 20:31:13 +00:00
Viktor Barzin
2511c1d78d
state(infra-maintenance): update encrypted state
2026-03-19 20:30:50 +00:00
Viktor Barzin
414232cf5e
state(redis): update encrypted state
2026-03-19 20:27:38 +00:00
Viktor Barzin
4680dd5fbc
state(infra-maintenance): update encrypted state
2026-03-19 20:27:15 +00:00
Viktor Barzin
e54bc016ba
reduce alert noise: raise memory thresholds, exclude claude-memory 4xx, right-size mysql-operator
...
- ContainerNearOOM: 85% → 90% (silences forgejo, changedetection, immich-pg, mysql-cluster)
- ClusterMemoryRequestsHigh: 85% → 92% (intentional overcommit)
- NodeMemoryPressureTrending: 85% → 92%
- HighService4xxRate: exclude claude-memory (401s from unauth requests are expected)
- mysql-operator memory limit: 512Mi → 580Mi (VPA upperBound 481Mi × 1.2)
2026-03-19 20:25:36 +00:00
Viktor Barzin
21bb3036af
state(dbaas): update encrypted state
2026-03-19 20:23:59 +00:00
Viktor Barzin
67d1ce453c
add /sentinel dir to cloud-init for kured reboot gating
...
The kured sentinel gate DaemonSet requires /sentinel to exist on
all nodes. Without it, kured pods get stuck in ContainerCreating
with hostPath mount failure. Previously created manually; now
provisioned automatically for new nodes.
2026-03-19 19:57:27 +00:00
Viktor Barzin
01eb9dd121
fix(monitoring): patch idrac-redfish-exporter to restore PSU voltage metric
...
Upstream v2.4.1 accidentally dropped idrac_power_supply_input_voltage from
the legacy RefreshPowerOld code path during a Huawei OEM support refactor.
Built a patched image that restores the single missing line:
mc.NewPowerSupplyInputVoltage(ch, psu.LineInputVoltage, id)
Image: viktorbarzin/idrac-redfish-exporter:2.4.1-voltage-fix
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-19 13:37:14 +00:00
Viktor Barzin
b05421dbb5
add comment explaining prometheus 4Gi minimum memory requirement [ci skip]
2026-03-18 21:45:26 +00:00
Viktor Barzin
9d87ce605f
revert prometheus memory 3Gi→4Gi: WAL tmpfs shares cgroup limit
...
The 2Gi WAL tmpfs (medium: Memory) counts against the container's
memory limit. At 3Gi, Prometheus OOM-kills during WAL replay on
startup (heap + tmpfs > 3Gi). Reverting to 4Gi restores headroom.
2026-03-18 21:44:14 +00:00
Viktor Barzin
03f55d969f
state(vault): update encrypted state
2026-03-18 21:30:59 +00:00
Viktor Barzin
410c893647
fix(provision): security hardening from code review
...
- Add input validation: username regex + email format check in pipeline
- Quote variables in .provision-env to prevent shell injection
- Remove dead source command (each Woodpecker command is separate shell)
- Use jq to build JSON payloads (prevents injection via group names)
- Clean up git-crypt key on failure (use ; instead of &&)
- Add Kyverno ndots lifecycle ignore to webhook-handler deployment
2026-03-18 21:25:03 +00:00
Viktor Barzin
e51c063600
docs(add-user): update skill with actual working flow (no auto TF apply)
2026-03-18 00:28:46 +00:00
Viktor Barzin
82403a933c
fix(provision): remove TF apply from pipeline, notify for manual apply
...
Vault stack can't be applied in CI (git-crypt TLS certs + sensitive
for_each on k8s_users). Pipeline now automates Vault KV update +
Authentik group creation, then notifies admin to apply stacks manually.
This matches the existing pattern — vault is not in default.yml either.
2026-03-18 00:23:06 +00:00
Viktor Barzin
d76b4b698f
fix(provision): targeted vault apply + git-crypt in terragrunt step
...
- Two-pass vault apply: first target new user resources, then full apply
- Add git-crypt unlock to terragrunt step (TLS certs needed at plan time)
2026-03-18 00:19:16 +00:00
Viktor Barzin
6fad484126
fix(provision): reduce memory limit to 4Gi (LimitRange max)
2026-03-18 00:15:26 +00:00
Viktor Barzin
de6a5caecc
fix(provision): merge terragrunt-apply into single shell block for env persistence
2026-03-18 00:11:14 +00:00
Viktor Barzin
7a24ff6702
fix(provision): use $USERNAME/$EMAIL directly — Woodpecker 3.x env vars
...
Woodpecker 3.x exposes pipeline variables with their original key names
(USERNAME, EMAIL), not CI_PIPELINE_VARIABLE_ prefix.
2026-03-18 00:04:51 +00:00
Viktor Barzin
52dc657af5
debug(provision): dump env vars to find correct variable names
2026-03-18 00:00:33 +00:00
Viktor Barzin
0a05343d86
fix(provision): use $VAR instead of ${VAR} to avoid Woodpecker interpolation
...
Woodpecker performs compile-time substitution on ${...} patterns,
replacing pipeline variables with empty strings. Using $VAR without
braces lets the shell evaluate them at runtime.
2026-03-17 23:58:46 +00:00
Viktor Barzin
fd130971aa
feat(provision): automated user provisioning via Authentik webhook
...
- Expand CI Vault policy: write secret/data/platform + Transit SOPS keys
- Add Woodpecker provision-user.yml pipeline (manual event, API-triggered)
- Add env vars to webhook-handler deployment for Woodpecker/Authentik integration
- Update add-user skill with automated flow documentation
- Update Woodpecker repo ID list in CLAUDE.md
2026-03-17 23:56:30 +00:00
Viktor Barzin
82b9dd9e8a
state(webhook_handler): update encrypted state
2026-03-17 23:52:32 +00:00