Commit graph

1730 commits

Author SHA1 Message Date
Viktor Barzin
5b29cfc73a state(vault): update encrypted state 2026-03-17 23:46:56 +00:00
Viktor Barzin
0fff155f17 feat(k8s-portal): update onboarding + architecture with SOPS state docs
Onboarding (namespace-owner):
- Add steps for sops/terragrunt install, state decrypt, apply workflow
- Add flow diagram showing auth → decrypt → apply → encrypt → push
- Add architecture overview with security model table
- Add access control callout explaining per-stack Transit keys

Architecture:
- Add secrets & state encryption section with ASCII diagrams
- Add request flow diagram (Cloudflare → Traefik → pods)
- Add CI/CD pipeline diagram (GHA → Woodpecker → K8s)

[ci skip]
2026-03-17 23:17:47 +00:00
Viktor Barzin
ccbcebb670 feat(vault): automate SOPS onboarding for namespace-owners
- Add Transit mount + per-stack Transit keys to vault stack TF
- Auto-create sops-user-<name> policy scoping decrypt to owned stacks
- Auto-create sops-<name> external group + alias for Authentik mapping
- Add sops-admin policy to authentik-admins group
- Attach sops-user policy to namespace-owner identity entities
- Update add-user skill with SOPS onboarding steps and Authentik group
- Adding a user to k8s_users + applying vault stack = full SOPS access

[ci skip]
2026-03-17 23:15:25 +00:00
Viktor Barzin
4d40c51a97 state(vault): update encrypted state 2026-03-17 23:14:24 +00:00
Viktor Barzin
7a8452e4c7 state(vault): update encrypted state 2026-03-17 23:14:16 +00:00
Viktor Barzin
0215d81622 state(vault): update encrypted state 2026-03-17 23:13:57 +00:00
Viktor Barzin
750cfcce7c state(vault): update encrypted state 2026-03-17 23:13:55 +00:00
Viktor Barzin
e54ad33315 state(vault): update encrypted state 2026-03-17 23:13:19 +00:00
Viktor Barzin
02d0291797 state(vault): update encrypted state 2026-03-17 23:12:58 +00:00
Viktor Barzin
468df3c5c4 state(vault): update encrypted state 2026-03-17 23:12:35 +00:00
Viktor Barzin
cf570c3d3b state(vault): update encrypted state 2026-03-17 23:12:03 +00:00
Viktor Barzin
4277b41c28 state(vault): update encrypted state 2026-03-17 23:11:55 +00:00
Viktor Barzin
77143dfd6b state: per-stack Transit keys for namespace-owner access control
- Each stack gets its own Vault Transit key (transit/keys/sops-state-<stack>)
- state-sync passes per-stack Transit URI + age keys on encrypt
- Vault policies scope namespace-owners to their stacks only:
  - sops-admin: wildcard access to all transit keys
  - sops-user-<name>: access only to owned stack keys
- Anca (plotting-book) can only decrypt plotting-book state
- Admin can decrypt everything (via admin Transit policy or age fallback)
- External group sops-plotting-book maps Authentik group to Vault policy
- Updated CLAUDE.md with state sync documentation
2026-03-17 23:08:18 +00:00
Viktor Barzin
6239e07dd5 docs: add plotting-book to GHA-migrated list and repo IDs [ci skip] 2026-03-17 23:07:32 +00:00
Viktor Barzin
4e7ca1ad61 state: add Vault Transit as primary SOPS backend, age as fallback
- .sops.yaml: add hc_vault_transit_uri for transit/keys/sops-state
- state-sync: try Vault Transit first, fall back to age key on disk
- Re-encrypted all 101 state files with both Vault Transit + age
- Normal workflow: vault login → decrypt via Transit (no key files)
- Bootstrap/DR: age key at ~/.config/sops/age/keys.txt
2026-03-17 22:56:33 +00:00
Viktor Barzin
9f80eb7ba0 state: add devvm as SOPS recipient
Add devvm age public key to .sops.yaml and re-encrypt all 101 state
files with both laptop and devvm keys.
2026-03-17 22:41:19 +00:00
Viktor Barzin
b6faa24349 state: add SOPS-encrypted terraform state to git
- SOPS + age encrypts all 101 .tfstate files (JSON-aware: keys visible, values encrypted)
- scripts/state-sync: encrypt/decrypt/commit wrapper
- scripts/tg: auto-decrypt before ops, auto-encrypt+commit after apply/destroy
- terragrunt.hcl: -backup=- prevents backup file accumulation
- .gitignore: track .tfstate.enc, ignore plaintext .tfstate
- Cleaned 964MB of stale backups (state/backups/, .backup files)
2026-03-17 22:37:56 +00:00
Viktor Barzin
12a51c4ffa right-size memory requests to unblock GPU workloads and fix dbaas quota [ci skip]
- nvidia: custom LimitRange (128Mi default, was 1Gi from Kyverno) to stop
  inflating GPU operator init containers; saves ~2.5Gi on GPU node
- nvidia: dcgm-exporter 1536Mi → 768Mi (actual usage 489Mi)
- monitoring: prometheus server 4Gi → 3Gi (actual usage 2.6Gi)
- onlyoffice: 2304Mi → 1536Mi (actual usage 1.3Gi)
- immich: frame explicit 64Mi resources (was getting 1Gi LimitRange default)
- dbaas: quota limits.memory 20Gi → 24Gi to fit 3rd MySQL replica

Root cause: Kyverno tier-2-gpu LimitRange injected 1Gi on every NVIDIA init
container (no explicit resources), wasting ~2.5Gi scheduling overhead on the
GPU node. Combined with over-requesting, frigate and immich-ml couldn't schedule.
2026-03-17 22:35:54 +00:00
Viktor Barzin
73511b1230 extract remaining 19 modules from platform, complete stack split [ci skip]
Phase 3: all 27 platform modules now run as independent stacks.
Platform reduced to empty shell (outputs only) for backward compat
with 72 app stacks that declare dependency "platform".
Fixed technitium cross-module dashboard reference by copying file.
Woodpecker pipeline applies all 27+1 stacks in parallel via loop.
All applied with zero destroys.
2026-03-17 21:42:16 +00:00
Viktor Barzin
ae36dc253b extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip]
Phase 2 of platform stack split. 5 more modules extracted into
independent stacks. All applied successfully with zero destroys.
Cloudflared now reads k8s_users from Vault directly to compute
user_domains. Woodpecker pipeline runs all 8 extracted stacks
in parallel. Memory bumped to 6Gi for 9 concurrent TF processes.
Platform reduced from 27 to 19 modules.
2026-03-17 21:34:11 +00:00
Viktor Barzin
3c804aedf8 extract dbaas, authentik, crowdsec from platform into independent stacks [ci skip]
Phase 1 of platform stack split for parallel CI applies.
All 3 modules were fully independent (no cross-module refs).
State migrated via terraform state mv. All 3 stacks applied
with zero changes (dbaas had pre-existing ResourceQuota drift).
Woodpecker pipeline updated to run extracted stacks in parallel.
2026-03-17 18:11:53 +00:00
Viktor Barzin
c8b42f78df fix DB password rotation desync in 5 stacks
Vault DB engine rotates passwords weekly but 5 stacks baked passwords
at Terraform plan time, causing stale credentials until next apply.

- real-estate-crawler: add vault-database ESO, use secret_key_ref in 3 deployments
- nextcloud: switch Helm chart to existingSecret for DB password
- grafana: add vault-database ESO, use envFromSecrets in Helm values
- woodpecker: use extraSecretNamesForEnvFrom, remove plan-time data source chain
- affine: add vault-database ESO, use secret_key_ref in deployment + init container
2026-03-17 07:39:29 +00:00
Viktor Barzin
8d8c8db737 increase DB password rotation from 24h to weekly (604800s) 2026-03-16 23:17:01 +00:00
Viktor Barzin
c31ba2c50c k8s-portal: use Recreate strategy, limit revision history to 3
Prevents stale pods serving old content during rapid successive deploys.
With 1 replica + RollingUpdate, old and new pods briefly coexist.
2026-03-16 22:55:15 +00:00
Viktor Barzin
6cc4d526f1 add GitHub Pages for post-mortems
- Index page listing all incident reports
- GHA workflow deploys post-mortems/ on push
- Available at viktorbarzin.github.io/infra/
2026-03-16 22:16:05 +00:00
Viktor Barzin
fb66676d7b post-mortem: kured + containerd cascade outage — alerts + report
26h outage caused by unattended-upgrades kernel update → kured reboot →
containerd overlayfs snapshotter corruption → image pull failures →
calico down → cascading cluster outage.

Remediation:
- Add "Node Runtime Health" Prometheus alert group (6 alerts):
  KubeletImagePullErrors, KubeletPLEGUnhealthy, PodsStuckContainerCreating,
  KubeletRuntimeOperationsLatency, KubeletRunningContainersDrop, CalicoNodeNotReady
- Add containerd cascade inhibition rule
- Save post-mortem report as HTML in post-mortems/

Also applied via kubectl (needs Terraform codification):
- Sentinel gate DaemonSet gating kured reboots on cluster health
- Fixed kured Helm values: reboot window + gated sentinel path
2026-03-16 22:06:10 +00:00
Viktor Barzin
d6afbe84c8 post-mortem v2: pipeline team architecture with 4-stage agents [ci skip]
Split monolithic orchestrator into triage (haiku), historian (sonnet),
and report-writer (opus) stages. Each stage gets its own tool budget.
Added sev-context.sh for structured cluster context gathering.
2026-03-16 21:59:34 +00:00
Viktor Barzin
327c021a90 fix: improve Slack alert formatting — add values, fix ContainerNearOOM filter
- Add container!="" filter to ContainerNearOOM to exclude system-level cadvisor entries
- Add $value to summaries: ContainerOOMKilled, ClusterMemoryRequestsHigh,
  ContainerNearOOM, PVPredictedFull, NFSServerUnresponsive, NewTailscaleClient
- Add fallback field to all Slack receivers for clean push notifications
- Multiply ratio exprs by 100 for readable percentages
- Rename "New Tailscale client" to CamelCase "NewTailscaleClient"
- Add actionable hints to PodUnschedulable, NodeConditionBad, ForwardAuthFallbackActive
2026-03-16 19:35:24 +00:00
Viktor Barzin
b2d07556d5 fix: migrate woodpecker database credentials to runtime-refreshed ExternalSecret
The woodpecker server was crashing repeatedly with database authentication failures
because Vault rotates the database password every 24 hours, but the Helm release
had hardcoded the password into WOODPECKER_DATABASE_DATASOURCE at plan time.

Changes:
- Updated ExternalSecret to provide the full DATABASE_DATASOURCE URI dynamically
- Modified Helm values to use envFrom to inject the secret instead of hardcoding
- ExternalSecret refreshes every 15 minutes, automatically picking up rotated passwords
- Pod will auto-restart when secret changes (via reloader.stakater.com annotation)
- This eliminates the plan-time password snapshot that goes stale within 24h

The pod still has an unrelated image pull issue on k8s-node4 (containerd blob
corruption), but the database credentials mechanism is now correctly implemented.
2026-03-16 19:12:01 +00:00
Viktor Barzin
0abb6b83ad add deploy-app skill and agent for automated repo→app deployment [ci skip] 2026-03-16 18:06:24 +00:00
Viktor Barzin
f8a36f0621 fix pull-through cache: remove maxsize, harden nginx caching [ci skip]
Root cause: storage.filesystem.maxsize (5GiB) caused Docker Registry to
delete blob data while keeping metadata. Registry then served 200 OK with
correct Content-Length but 0 bytes body. nginx cached these broken responses.

Fixes:
- Remove maxsize from dockerhub/ghcr proxy configs (rely on weekly GC)
- nginx: don't cache 206 responses, require 2 requests before caching
- Wiped corrupted cache on registry VM and fixed corrupted pause container
  blobs on node3/node4
2026-03-16 07:41:11 +00:00
Viktor Barzin
88abbef7c3 update claude knowledge: GHA builds architecture, postgresql_host fix [ci skip] 2026-03-16 07:10:45 +00:00
Viktor Barzin
708eb69742 fix: update postgresql_host to pg-cluster-rw (old service had no endpoints)
The legacy `postgresql.dbaas` service had no endpoints after CNPG migration,
causing Woodpecker and other stacks to fail DB connections. Changed to
`pg-cluster-rw.dbaas` which points to the CNPG primary.
2026-03-16 07:07:22 +00:00
Viktor Barzin
c7bcd5b8b5 scale up f1-stream and changedetection [ci skip] 2026-03-16 07:06:09 +00:00
Viktor Barzin
6478097e2d fix platform stack: k8s_users.domains and sensitive for_each errors [ci skip]
- Use lookup(user, "domains", []) for missing domains attribute
- Wrap user_domains in nonsensitive() for Cloudflare for_each
2026-03-15 23:36:46 +00:00
Viktor Barzin
b87ba5e778 update claude knowledge: secret/viktor is go-to for all personal secrets [ci skip] 2026-03-15 23:21:52 +00:00
Viktor Barzin
a9890a1f27 trigger CI: json webhook 2026-03-15 23:17:58 +00:00
Viktor Barzin
0c6681bc76 fix woodpecker sync: single $ in heredoc, alpine image for jq, port 80 not 8000 2026-03-15 23:12:52 +00:00
Viktor Barzin
a04335d0f3 right-size 14 services and scale down GPU-heavy workloads [ci skip]
Memory right-sizing based on VPA upperBound analysis:
- Increases: stirling-pdf 1200→1536Mi, claude-memory 64→128Mi,
  dawarich 512→768Mi, kyverno-cleanup 128→192Mi, linkwarden 768→1Gi,
  navidrome 64→128Mi, listenarr 768→896Mi, privatebin 64→128Mi,
  ntfy 64→128Mi, health 128→256Mi, dbaas quota 16→20Gi,
  mysql-operator 384→512Mi
- Decreases: rybbit 768→384Mi, nvidia-exporter added explicit 192Mi,
  dcgm-exporter 2560→1536Mi
- Scale to 0: ebook2audiobook/audiblez-web, whisper (GPU node pressure)

Net effect: -496Mi cluster-wide, 13 ContainerNearOOM alerts resolved,
all ResourceQuota pressures cleared, GPU health green.
2026-03-15 23:00:49 +00:00
Viktor Barzin
b6d619e5df fix: increase terragrunt-apply step memory to 2Gi
LimitRange defaults containers to 192Mi which is insufficient for
terragrunt apply on the platform stack (48 vault refs, many modules).
Set explicit 1Gi request / 2Gi limit via backend_options.
2026-03-15 22:59:34 +00:00
Viktor Barzin
0c1239030d fix: CI pipeline - disable corrupted cache, add pull before push
- build-cli.yml: comment out cache_from/cache_to to avoid BuildKit
  "short read" errors from corrupted registry cache
- default.yml: add git pull --rebase before push in cleanup-and-push
  to handle remote having newer commits
2026-03-15 22:51:08 +00:00
Viktor Barzin
c8069f53c8 update claude knowledge: final ESO migration state [ci skip] 2026-03-15 22:32:46 +00:00
Viktor Barzin
6c8a42b4e3 add add-user skill for cluster onboarding
Interactive skill that collects user info, updates Vault KV k8s_users,
and applies vault/platform/woodpecker stacks. Includes verification
checklist and auto-generated resource table.
2026-03-15 22:28:54 +00:00
Viktor Barzin
82b1f82a2c add user onboarding and admin instructions to README
Admin section: how to add a new namespace-owner (Authentik group,
Vault KV entry, three terragrunt applies). Includes auto-generated
resource table.

User section: VPN setup, tool install, Vault/kubectl auth, first app
deployment from template, CI/CD pipeline example, useful commands,
and important rules.
2026-03-15 22:25:43 +00:00
Viktor Barzin
cc55249524 fix ollama: remove conditional count on basicAuth (incompatible with ESO data source) 2026-03-15 22:24:36 +00:00
Viktor Barzin
50620e6047 add generic multi-user cluster onboarding system
Data-driven user onboarding: add a JSON entry to Vault KV k8s_users,
apply vault + platform + woodpecker stacks, and everything is auto-generated.

Vault stack: namespace creation, per-user Vault policies with secret isolation
via identity entities/aliases, K8s deployer roles, CI policy update.

Platform stack: domains field in k8s_users type, TLS secrets per user namespace,
user domains merged into Cloudflare DNS, user-roles ConfigMap mounted in portal.

Woodpecker stack: admin list auto-generated from k8s_users, WOODPECKER_OPEN=true.

K8s-portal: dual-track onboarding (general/namespace-owner), namespace-owner
dashboard with Vault/kubectl commands, setup script adds Vault+Terraform+Terragrunt,
contributing page with CI pipeline template, versioned image tags in CI pipeline.

New: stacks/_template/ with copyable stack template for namespace-owners.
2026-03-15 22:23:36 +00:00
Viktor Barzin
39b3c51709 migrate 16 plan-time stacks: vault data source → ESO + kubernetes_secret
Replaced data "vault_kv_secret_v2" with:
1. ExternalSecret (ESO syncs Vault KV → K8s Secret)
2. data "kubernetes_secret" (reads ESO-created secret at plan time)

This removes the Vault provider dependency at plan time for these
stacks — they now only need K8s API access, not a Vault token.

Stacks: actualbudget, affine, audiobookshelf, calibre, changedetection,
coturn, freedify, freshrss, grampsweb, navidrome, novelapp, ollama,
owntracks, real-estate-crawler, servarr, ytdlp
2026-03-15 22:06:39 +00:00
Viktor Barzin
af3b1b5c90 fix health DB ExternalSecret: use pg-health not postgresql-health role name 2026-03-15 21:53:02 +00:00
root
a186d26ba7 Woodpecker CI deploy commit [CI SKIP] 2026-03-15 21:43:40 +00:00
Viktor Barzin
745e43c983 fix DB password desync + migrate remaining tfvars to Vault
DB desync fix: Stacks with Vault DB engine rotation (24h) now read
the password from vault-database ClusterSecretStore instead of vault-kv.
9 stacks updated with db ExternalSecrets reading from static-creds/*.

Stacks fixed: speedtest, hackmd, health, trading-bot, claude-memory,
woodpecker, linkwarden, nextcloud, url.

terraform.tfvars migration:
- plotting-book: google_client_id/secret → Vault KV + secret_key_ref
- tandoor: email_password var removed (was default="", now optional ESO)
- infra: ssh_private_key, vm_wizard_password, dockerhub_registry_password
  → Vault KV at secret/infra + data source
2026-03-15 21:39:45 +00:00