Commit graph

23 commits

Author SHA1 Message Date
Viktor Barzin
a49d1eadf6 workstation: emo direct master push — allow-then-audit [ci skip]
Viktor: emo may make any change; what matters is tracking what changed
and why. ebarzin added to master push+merge whitelists (force-push
stays disabled — append-only history). Tracking enforced three ways:
- agent instructions (managed claudeMd + AGENTS.md): commit body MUST
  carry the user's plain-language intent; commits land on master
  directly; [ci skip] forbidden for non-admins
- new notify-nonadmin-push step in .woodpecker/default.yml: Slack
  message for every non-admin master push (admin pushes silent)
- PR flow remains the fallback for non-whitelisted users

Accepted consequence (informed): emo's pushes auto-apply changed
stacks via CI. Offboard runbook gains whitelist-removal step.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 14:53:43 +00:00
Viktor Barzin
fd0f4a0365 fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip]
6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 08:45:33 +00:00
Viktor Barzin
6d224861c4 stem95su: scheduled Drive->site sync CronJob (every 10m)
CronJob stem95su-gdrive-sync (*/10) mounts the content PVC RW and
rclone-syncs the read-only Drive folder "claude" (stem claude/files) onto
it (rclone/rclone:1.74.3, scope=drive.readonly, empty-source guard +
--max-delete 25). ESO ExternalSecret stem95su-rclone <- Vault
secret/stem95su. Requires the GCP OAuth app published to Production or the
refresh token expires ~weekly.

Lands the gdrive-sync stack on master (it had landed on a feature branch
by accident on the shared devvm checkout).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 08:42:26 +00:00
Viktor Barzin
20738efe4e ci(woodpecker): generate kubeconfig from projected SA token
terragrunt.hcl injects -var kube_config_path=${repo_root}/config for
every terraform invocation, but the pipeline never created that file.
Every commit that touched a TF stack since #545 (2026-05-08) failed
with 'config_path refers to an invalid path: \"../../config\": no such
file or directory' followed by the kubernetes provider falling back
to localhost:80.

Add a step that writes a kubeconfig at <repo>/config using the
projected SA token + cluster CA. The woodpecker namespace's default
SA is already cluster-admin (woodpecker-default ClusterRoleBinding),
so the projected token is sufficient for any stack apply. Using
tokenFile (not an inline token) lets the provider re-read it if
kubelet rotates the projected token mid-pipeline.

#545 was the last green run because that commit only changed the
build-cli pipeline — 0 stacks applied so the missing kubeconfig
never mattered.
2026-05-09 11:26:47 +00:00
Viktor Barzin
3148d15d5a [forgejo] Phases 3+4+5: cutover, decommission, docs sweep
End of forgejo-registry-consolidation. After Phase 0/1 already landed
(Forgejo ready, dual-push CI, integrity probe, retention CronJob,
images migrated via forgejo-migrate-orphan-images.sh), this commit
flips everything off registry.viktorbarzin.me onto Forgejo and
removes the legacy infrastructure.

Phase 3 — image= flips:
* infra/stacks/{payslip-ingest,job-hunter,claude-agent-service,
  fire-planner,freedify/factory,chrome-service,beads-server}/main.tf
  — image= now points to forgejo.viktorbarzin.me/viktor/<name>.
* infra/stacks/claude-memory/main.tf — also moved off DockerHub
  (viktorbarzin/claude-memory-mcp:17 → forgejo.viktorbarzin.me/viktor/...).
* infra/.woodpecker/{default,drift-detection}.yml — infra-ci pulled
  from Forgejo. build-ci-image.yml dual-pushes still until next
  build cycle confirms Forgejo as canonical.
* /home/wizard/code/CLAUDE.md — claude-memory-mcp install URL updated.

Phase 4 — decommission registry-private:
* registry-credentials Secret: dropped registry.viktorbarzin.me /
  registry.viktorbarzin.me:5050 / 10.0.20.10:5050 auths entries.
  Forgejo entry is the only one left.
* infra/stacks/infra/main.tf cloud-init: dropped containerd
  hosts.toml entries for registry.viktorbarzin.me +
  10.0.20.10:5050. (Existing nodes already had the file removed
  manually by `setup-forgejo-containerd-mirror.sh` rollout — the
  cloud-init template only fires on new VM provision.)
* infra/modules/docker-registry/docker-compose.yml: registry-private
  service block removed; nginx 5050 port mapping dropped. Pull-
  through caches for upstream registries (5000/5010/5020/5030/5040)
  stay on the VM permanently.
* infra/modules/docker-registry/nginx_registry.conf: upstream
  `private` block + port 5050 server block removed.
* infra/stacks/monitoring/modules/monitoring/main.tf: registry_
  integrity_probe + registry_probe_credentials resources stripped.
  forgejo_integrity_probe is the only manifest probe now.

Phase 5 — final docs sweep:
* infra/docs/runbooks/registry-vm.md — VM scope reduced to pull-
  through caches; forgejo-registry-breakglass.md cross-ref added.
* infra/docs/architecture/ci-cd.md — registry component table +
  diagram now reflect Forgejo. Pre-migration root-cause sentence
  preserved as historical context with a pointer to the design doc.
* infra/docs/architecture/monitoring.md — Registry Integrity Probe
  row updated to point at the Forgejo probe.
* infra/.claude/CLAUDE.md — Private registry section rewritten end-
  to-end (auth, retention, integrity, where the bake came from).
* prometheus_chart_values.tpl — RegistryManifestIntegrityFailure
  alert annotation simplified now that only one registry is in
  scope.

Operational follow-up (cannot be done from a TF apply):
1. ssh root@10.0.20.10 — edit /opt/registry/docker-compose.yml to
   match the new template AND `docker compose up -d --remove-orphans`
   to actually stop the registry-private container. Memory id=1078
   confirms cloud-init won't redeploy on TF apply alone.
2. After 1 week of no incidents, `rm -rf /opt/registry/data/private/`
   on the VM (~2.6GB freed).
3. Open the dual-push step in build-ci-image.yml and drop
   registry.viktorbarzin.me:5050 from the `repo:` list — at that
   point the post-push integrity check at line 33-107 also needs
   to be repointed at Forgejo or removed (the per-build verify is
   redundant with the every-15min Forgejo probe).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 18:30:02 +00:00
Viktor Barzin
07bc0098e3
ci(woodpecker): show full terraform error on stack apply failure
The default workflow truncated the failed-stack output at `tail -5`,
which only captured the trailing source-line indicator (`│  45: resource
…`) and dropped the actual `Error: …` line above it. Bump to `tail -50`
so the real error is visible without re-running locally to reproduce.

Also fix the pre-warm step's FIRST_STACK detection — `head -1 file1
file2 | head -1` returns the file header (`==> .platform_apply <==`),
not the first stack name, so the cd then fails with "no such file or
directory". Use `cat | head -1` instead.

Pure logging-and-pre-warm change; no stacks touched, so this commit is
a no-op for the apply step.
2026-04-26 18:39:46 +00:00
Viktor Barzin
a05d63eefb [ci] Fix infra pipeline image-pull — drop :5050 from infra-ci image URL
P366-P374 default workflow failed with exit 126 "image can't be pulled" — containerd
hosts.toml has a mirror entry for `registry.viktorbarzin.me` but NOT for
`registry.viktorbarzin.me:5050`, so pulls fell through to direct HTTPS on :5050
(which isn't exposed externally). Convention per infra/.claude/CLAUDE.md is the
no-port form; :5050 was an anomaly introduced by the 2026-04-15 CI perf overhaul.

build-cli/build-ci-image push paths still use :5050 and work fine — they go through
the buildx plugin (pod DNS, not node containerd). Only `image:` fields on a step
hit the broken path. Normalizing push URLs left for a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 16:00:58 +00:00
Viktor Barzin
a5e097088a [ci] Persist VAULT_TOKEN across Woodpecker step commands
## Context
Follow-up to commit 2eca011c (bd code-e1x). That commit attached the
`terraform-state` policy to the `ci` Vault role and propagated apply-
loop failures so the pipeline actually fails when a stack fails. On
the very first push to exercise it (pipeline 361), the platform apply
step died with:

  [vault] Starting apply...
  state-sync: ERROR — no Vault token and no age key at ~/.config/sops/age/keys.txt
  [vault] FAILED (exit 1)

Root cause: in Woodpecker's `commands:` list, each `- |` item runs in
a fresh shell. The dedicated "Vault auth" command was doing
`export VAULT_TOKEN=...`, but that export was lost by the time the
apply command ran. Tier-0 stacks depended on Vault Transit (via
`scripts/state-sync`), and Tier-1 stacks depend on
`vault read database/static-creds/pg-terraform-state` via `scripts/tg`
— both silently fell through to their "no Vault" error path.

This bug was latent before 2eca011c because the old apply loop
swallowed per-stack exit codes. Now that we surface them, the pipeline
fails honestly — but fails on every run. Fixing the missing token
propagation is the last mile.

## This change
- Pin `VAULT_ADDR` at the step's `environment:` level so every command
  inherits it without an explicit export.
- In the Vault auth command, assert the auth succeeded (non-empty,
  non-"null" token) then write the token to `~/.vault-token` with
  `umask 077`. `vault`, `scripts/tg`, and `scripts/state-sync` all
  fall through to `~/.vault-token` when `VAULT_TOKEN` env is unset.

## What is NOT in this change
- A broader refactor to fold the multi-step chain into a single
  `- |` script — preserving the existing granular structure keeps
  individual step logs grep-friendly and failures localised.
- Restoring the VAULT_TOKEN export too — redundant once ~/.vault-token
  is written, and would need duplicating into each command anyway.

## Test Plan
### Automated
N/A (pure YAML change). Will be verified by the very next CI run —
the push creating this commit.

### Manual Verification
Watch `ci.viktorbarzin.me/repos/1/pipelines` for the pipeline whose
commit matches this one. Expected:
- `default` workflow exercises the auth + apply steps.
- Platform apply for `vault` stack runs state-sync decrypt → detects
  no drift (I applied locally already) → OK.
- Tier-1 stacks (if any in the diff): `vault read database/static-
  creds/pg-terraform-state` returns creds → apply runs.
- No "state-sync: ERROR" or "Cannot read PG credentials" errors.
- `default` workflow state: success.
- Overall pipeline status: still failure because `build-cli` is
  independently broken (bd code-12b); that's cosmetic.

Refs: bd code-e1x

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 14:30:39 +00:00
Viktor Barzin
2eca011cc3 [ci,vault] Fix Tier-1 apply silently failing in Woodpecker
## Context
For weeks, every push to infra has resulted in `build-cli` workflow
failure AND `default` workflow succeed — but the `default` workflow's
"success" was a lie. Inside the apply-loop we were swallowing per-stack
failures with `set +e ... echo FAILED` and the step exited 0 regardless.

Discovered during bd code-3o3 e2e test (qbittorrent 5.0.4 → 5.1.4):
agent commit landed, CI reported `default=success`, but cluster was
unchanged. Log inside the step showed:
    [servarr] Starting apply...
    ERROR: Cannot read PG credentials from Vault.
    Run: vault login -method=oidc
    [servarr] FAILED (exit 1)

Two root causes, two fixes here.

### 1. Vault `ci` role lacks Tier-1 PG backend creds

The Tier-1 PG state backend (2026-04-16 migration, memory 407) uses
the `pg-terraform-state` static DB role. `scripts/tg` reads it via
`vault read database/static-creds/pg-terraform-state`. That path is
permitted by the separate `terraform-state` Vault policy, which is
bound only to a role in namespace `claude-agent`. The CI runner is in
namespace `woodpecker` using role `ci`, whose policy grants only KV
+ K8s-creds + transit. Net: every Tier-1 stack apply from CI has
been dying at the PG-creds fetch since the migration.

**Fix**: attach `vault_policy.terraform_state` to
`vault_kubernetes_auth_backend_role.ci`'s `token_policies`. No new
policy needed — reuses the minimal one from 2026-04-16.

### 2. Apply-loop swallows stack failures

`.woodpecker/default.yml`'s platform + app apply loops use
`set +e; OUTPUT=$(... tg apply ...); EXIT=$?; set -e; [ $EXIT -ne 0 ]
&& echo FAILED` and then continue the while-loop. The step never
re-raises, so it exits 0 regardless of how many stacks failed.

**Fix**: accumulate failed stack names (excluding lock-skipped ones)
into `FAILED_PLATFORM_STACKS` / `FAILED_APP_STACKS`, serialise the
platform list to `.platform_failed` so it survives the step boundary,
and at the end of the app-stack step exit 1 if either list is
non-empty. Lock-skipped stacks remain non-fatal.

Together, (1) unblocks real apply and (2) ensures the Woodpecker
pipeline + the service-upgrade agent can both trust `default`
workflow state again.

## What is NOT in this change
- Re-running the qbittorrent upgrade to converge the cluster — the
  TF file is already at 5.1.4 in git; once CI picks up this commit
  it'll apply on its own, or Viktor can run `tg apply` locally now
  that the ci role has access too.
- Retiring the `set +e ... continue` pattern entirely — keeping the
  per-stack continuation so a single bad stack doesn't hide the
  others' plans from the log. Just making the final status honest.

## Test Plan
### Automated
`terraform plan` / apply clean (Tier-0 via scripts/tg):
```
Plan: 0 to add, 2 to change, 0 to destroy.
  # vault_kubernetes_auth_backend_role.ci will be updated in-place
  ~ token_policies = [
      + "terraform-state",
        # (1 unchanged element hidden)
    ]
  # vault_jwt_auth_backend.oidc will be updated in-place
  ~ tune = [...]    # cosmetic provider-schema drift, pre-existing

Apply complete! Resources: 0 added, 2 changed, 0 destroyed.
```
State re-encrypted via `scripts/state-sync encrypt vault`; enc file
committed.

### Manual Verification
```
# Before (on previous commit — expect failure):
$ kubectl -n woodpecker exec woodpecker-server-0 -- sh -c '
    SA=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token);
    TOK=$(curl -s -X POST http://vault-active.vault.svc:8200/v1/auth/kubernetes/login \
          -d "{\"role\":\"ci\",\"jwt\":\"$SA\"}" | jq -r .auth.client_token);
    curl -s -H "X-Vault-Token: $TOK" \
      http://vault-active.vault.svc:8200/v1/database/static-creds/pg-terraform-state'
→ {"errors":["1 error occurred:\n\t* permission denied\n\n"]}

# After (this commit):
→ {"data":{"username":"terraform_state","password":"..."},...}
```

Pipeline-level: the next infra push will exercise
`.woodpecker/default.yml`; expected first push is this very commit.
Watch `ci.viktorbarzin.me` — the `default` workflow should either
succeed for real (and land actual changes) or exit 1 with
"=== FAILED STACKS ===" so the cause is visible.

Refs: bd code-e1x

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 14:25:52 +00:00
Viktor Barzin
89af09852f feat(ci): add Vault advisory locks to CI terraform applies
CI now uses scripts/tg instead of raw terragrunt apply, acquiring the
same per-stack Vault KV lock that user sessions use. This prevents CI
from overwriting in-flight user applies.

Changes:
- Switch from xargs -P 4 (parallel) to serial while-read loop
- CI skips stacks locked by users instead of racing them
- Git rebase failures now exit 1 instead of silently continuing
- Updated header comments to reflect new locking behavior

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 20:53:00 +00:00
Viktor Barzin
601a83d84e fix: CI pipeline image pull auth + shallow clone resilience [ci skip]
- Add WOODPECKER_BACKEND_K8S_PULL_SECRET_NAMES to agent env so step
  pods can pull from private registry (registry.viktorbarzin.me:5050)
- Add fallback in default.yml when HEAD~1 is unavailable (shallow
  clone with depth=1): fetch more history, or apply all platform
  stacks as safe default
- Root cause: pipeline #243 failed because infra-ci:latest image
  couldn't be pulled (no imagePullSecrets on step pods)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 14:41:08 +00:00
Viktor Barzin
36454b87d1 feat: CI/CD performance overhaul
- New custom CI Docker image (ci/Dockerfile) with TF 1.5.7, TG 0.99.4,
  git-crypt, sops, kubectl pre-installed. Pushed to private registry.
  Eliminates 17 apk add calls + binary downloads per pipeline run.

- Unified CI pipeline: merge default.yml + app-stacks.yml into one.
  Changed-stacks-only detection (git diff, with global-file fallback).
  Concurrency limit (xargs -P 4). Step consolidation (2 steps vs 4).
  Shallow clone (depth=2). Provider cache (TF_PLUGIN_CACHE_DIR).

- Per-stack Vault advisory locks in scripts/tg. 30min TTL with stale
  lock detection. Blocks concurrent applies to same stack.

- TF_PLUGIN_CACHE_DIR enabled by default in scripts/tg for local dev.

- Daily drift detection pipeline (.woodpecker/drift-detection.yml).
  Runs terraform plan on all stacks, Slack alert on drift.

- CI image build pipeline (.woodpecker/build-ci-image.yml).

Expected speedup: ~5-10 min per pipeline run → ~2-4 min.

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 11:22:26 +00:00
Viktor Barzin
73511b1230 extract remaining 19 modules from platform, complete stack split [ci skip]
Phase 3: all 27 platform modules now run as independent stacks.
Platform reduced to empty shell (outputs only) for backward compat
with 72 app stacks that declare dependency "platform".
Fixed technitium cross-module dashboard reference by copying file.
Woodpecker pipeline applies all 27+1 stacks in parallel via loop.
All applied with zero destroys.
2026-03-17 21:42:16 +00:00
Viktor Barzin
ae36dc253b extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip]
Phase 2 of platform stack split. 5 more modules extracted into
independent stacks. All applied successfully with zero destroys.
Cloudflared now reads k8s_users from Vault directly to compute
user_domains. Woodpecker pipeline runs all 8 extracted stacks
in parallel. Memory bumped to 6Gi for 9 concurrent TF processes.
Platform reduced from 27 to 19 modules.
2026-03-17 21:34:11 +00:00
Viktor Barzin
3c804aedf8 extract dbaas, authentik, crowdsec from platform into independent stacks [ci skip]
Phase 1 of platform stack split for parallel CI applies.
All 3 modules were fully independent (no cross-module refs).
State migrated via terraform state mv. All 3 stacks applied
with zero changes (dbaas had pre-existing ResourceQuota drift).
Woodpecker pipeline updated to run extracted stacks in parallel.
2026-03-17 18:11:53 +00:00
Viktor Barzin
b6d619e5df fix: increase terragrunt-apply step memory to 2Gi
LimitRange defaults containers to 192Mi which is insufficient for
terragrunt apply on the platform stack (48 vault refs, many modules).
Set explicit 1Gi request / 2Gi limit via backend_options.
2026-03-15 22:59:34 +00:00
Viktor Barzin
0c1239030d fix: CI pipeline - disable corrupted cache, add pull before push
- build-cli.yml: comment out cache_from/cache_to to avoid BuildKit
  "short read" errors from corrupted registry cache
- default.yml: add git pull --rebase before push in cleanup-and-push
  to handle remote having newer commits
2026-03-15 22:51:08 +00:00
Viktor Barzin
3aba29e7a3 remove SOPS pipeline, deploy ESO + Vault DB/K8s engines
Vault is now the sole source of truth for secrets. SOPS pipeline
removed entirely — auth via `vault login -method=oidc`.

Part A: SOPS removal
- vault/main.tf: delete 990 lines (93 vars + 43 KV write resources),
  add self-read data source for OIDC creds from secret/vault
- terragrunt.hcl: remove SOPS var loading, vault_root_token, check_secrets hook
- scripts/tg: remove SOPS decryption, keep -auto-approve logic
- .woodpecker/default.yml: replace SOPS with Vault K8s auth via curl
- Delete secrets.sops.json, .sops.yaml

Part B: External Secrets Operator
- New stack stacks/external-secrets/ with Helm chart + 2 ClusterSecretStores
  (vault-kv for KV v2, vault-database for DB engine)

Part C: Database secrets engine (in vault/main.tf)
- MySQL + PostgreSQL connections with static role rotation (24h)
- 6 MySQL roles (speedtest, wrongmove, codimd, nextcloud, shlink, grafana)
- 6 PostgreSQL roles (trading, health, linkwarden, affine, woodpecker, claude_memory)

Part D: Kubernetes secrets engine (in vault/main.tf)
- RBAC for Vault SA to manage K8s tokens
- Roles: dashboard-admin, ci-deployer, openclaw, local-admin
- New scripts/vault-kubeconfig helper for dynamic kubeconfig

K8s auth method with scoped policies for CI, ESO, OpenClaw, Woodpecker sync.
2026-03-15 16:37:38 +00:00
Viktor Barzin
1f2c1ca361 [ci skip] phase 5+6: update CI pipelines for SOPS, add sensitive=true to secret vars
Phase 5 — CI pipelines:
- default.yml: add SOPS decrypt in prepare step, change git add . to
  specific paths (stacks/ state/ .woodpecker/), cleanup on success+failure
- renew-tls.yml: change git add . to git add secrets/ state/

Phase 6 — sensitive=true:
- Add sensitive = true to 256 variable declarations across 149 stack files
- Prevents secret values from appearing in terraform plan output
- Does NOT modify shared modules (ingress_factory, nfs_volume) to avoid
  breaking module interface contracts

Note: CI pipeline SOPS decryption requires sops_age_key Woodpecker secret
to be created before the pipeline will work with SOPS. Until then, the old
terraform.tfvars path continues to function.
2026-03-07 14:30:36 +00:00
Viktor Barzin
be47592e08 fix: remove deprecated secrets field from slack step 2026-02-28 18:32:10 +00:00
Viktor Barzin
4b6ade7b08 fix: replace removed woodpeckerci/plugin-slack with curl-based webhook 2026-02-28 18:25:23 +00:00
Viktor Barzin
ebecaaee5c Woodpecker CI: use built-in clone, fix CoreDNS DNS resolution [CI SKIP]
- Switch from custom clone override to woodpeckerci/plugin-git built-in clone
  (handles auth automatically via netrc from GitHub OAuth token)
- Add 8.8.8.8 and 1.1.1.1 as CoreDNS upstream resolvers alongside pfSense
  (fixes intermittent DNS timeouts causing clone failures)
- Fix missing comma after heredoc in audit-policy.tf (syntax error)
2026-02-23 00:08:42 +00:00
Viktor Barzin
cbf041bcc9 [ci skip] Add Woodpecker CI stack (WIP) and claude agents
- Add stacks/woodpecker/ with Helm-based deployment config
- Add .woodpecker/ CI pipeline configs (default, build-cli, renew-tls)
- Add NFS export entry for woodpecker
- Add .claude/agents/ definitions
2026-02-22 21:30:25 +00:00