Commit graph

13 commits

Author SHA1 Message Date
Viktor Barzin
68d9058f85 cleanup: fully remove orphaned council-complaints app
The council-complaints app (Islington civic-reporting pilot) has been
abandoned. It was already dead in the cluster (deployments scaled 0/0,
image only on the decommissioned registry.viktorbarzin.me which 404s),
and it was never in Terraform — only docs + a kyverno comment referenced
it. Its live cluster resources (namespace, both NFS-backed PVs, ingresses)
were torn down out-of-band via kubectl (nothing in TF to drift from); the
DB-dump PVC was backed up to NFS first.

This removes the remaining repo references to the live app:
- service-catalog.md: drop the council-complaints row
- ci-cd.md + .claude/CLAUDE.md: drop it from the GHA->ghcr app list
- kyverno require-trusted-registries: the registry.viktorbarzin.me/*
  allowlist comment claimed council-complaints as the last referencer;
  rewrite it (no live workload pulls from that registry now; only stale
  completed Job records still carry the ref). The allowlist line itself
  is kept (registry-scoped, not app-specific).

Historical point-in-time plan docs (docs/plans/2026-05-16-auto-upgrade-
apps-{design,plan}.md) still mention it inside a frozen "10 GHA-migrated
repos (memory id=388)" snapshot; left as-is so the dated record stays
accurate.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 13:32:10 +00:00
Viktor Barzin
8b7c77c794 android-emulator: new stack — shared in-cluster Android 16 testing instance
Viktor is setting up an Android app development pipeline (tripit is the
first app) and wants agents to natively test changes on Android before
shipping. This adds the testing environment: an API-36 Google emulator
under KVM as a privileged pod (namespace joins the Kyverno exclude list),
SDK/system-image/AVD on a proxmox-lvm PVC, adb on the shared MetalLB IP
10.0.20.200:5555 (LAN only), noVNC screen view at
android-emulator.viktorbarzin.lan. Image is built manually from the
stack's docker/ dir (rare rebuilds; off-infra-CI rule targets repeated
builds). First infra ADR records the trade-offs (devvm/VM/redroid/budtmo
rejected).
2026-06-11 19:51:57 +00:00
Viktor Barzin
fd0f4a0365 fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip]
6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 08:45:33 +00:00
Viktor Barzin
6d224861c4 stem95su: scheduled Drive->site sync CronJob (every 10m)
CronJob stem95su-gdrive-sync (*/10) mounts the content PVC RW and
rclone-syncs the read-only Drive folder "claude" (stem claude/files) onto
it (rclone/rclone:1.74.3, scope=drive.readonly, empty-source guard +
--max-delete 25). ESO ExternalSecret stem95su-rclone <- Vault
secret/stem95su. Requires the GCP OAuth app published to Production or the
refresh token expires ~weekly.

Lands the gdrive-sync stack on master (it had landed on a feature branch
by accident on the shared devvm checkout).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 08:42:26 +00:00
Viktor Barzin
a42f4f7b26 trek: trial-deploy TREK group-trip planner behind Authentik (solo eval)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Stand up upstream TREK (mauriceboe/trek:3.0.22, AGPL) as a low-commitment
trial to evaluate the self-hosted group-trip use case before building a
custom app. Solo, single shared instance, Authentik forward-auth.

- stacks/trek: namespace, deployment (pinned, TF-managed, no CI/Keel),
  service 80->3000, ingress_factory auth=required + proxied DNS at
  trek.viktorbarzin.me, TLS. Two proxmox-lvm-encrypted PVCs (SQLite data +
  uploads) -- encrypted per the sensitive-data rule and to avoid the
  SQLite-over-NFS locking hazard.
- Trial secrets posture: ENCRYPTION_KEY auto-generated on the data PVC,
  bootstrap admin in pod logs -- no Vault/ESO. Graduation TODOs documented
  in main.tf + service-catalog (Vault key, app-level SQLite backup, OIDC SSO).
- kyverno: add mauriceboe/* to require-trusted-registries allowlist (the
  policy is Enforce since 2026-05-19 -- also fixed the stale "stays in
  Audit" header comment that said otherwise and misled the deploy).
- Runs free on OpenStreetMap (no paid maps key). Rallly availability-poll
  companion deferred per solo-trial scope.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 20:30:07 +00:00
Viktor Barzin
f325b949be keel: re-enable with policy=patch (semver-bounded) + fix CI deny-privileged
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Re-enables Keel after the 2026-05-26 emergency stop, with a safer default.

Switch Kyverno-injected default from `force + match-tag=true` (proven
unreliable — it rewrote tag strings cluster-wide despite the design intent)
to `patch`, which is semver-parser-bounded:

  - Only patch bumps within current major.minor (1.2.3 → 1.2.4, never
    1.3.x or 2.x — the parser does the math, not string compare).
  - Non-semver tags (`:latest`, `:v4`, `:2`, SHA, `:nightly`) are
    IGNORED entirely. No tag rewriting under any code path.
  - 151 stale `force` annotations migrated to `patch` cluster-wide
    during this apply (anchor `+()` dropped, then re-added).

Live state after this commit:
  0 workloads on `force`, 209 on `patch`, 22 on `never`.
  Keel deployment back to 1/1 on `:0.21.1`.

Note: 22 workloads with `keel.sh/policy=never` LABEL had their annotation
mutated to `patch` during the migration despite Kyverno's
matchLabels-based exclude rule — appears to be a quirk of
`mutateExistingOnPolicyUpdate` not honoring `selector` excludes. Repatched
all 22 back to `annotation=never` via `kubectl annotate --overwrite`, then
restored the `+(keel.sh/policy)` anchor in the policy so future Kyverno
reconciles preserve them.

Also fixes CI build-cli workflow which was blocked by
`deny-privileged-containers` since wave 1 enforce flip on 2026-05-18:
woodpecker namespace added to the shared security_policy_exclude_namespaces
list (CI pipeline pods `wp-*` run privileged docker builds, legitimate use).

The `default` workflow (terragrunt apply) was already passing — only the
parallel `build-cli` workflow (which builds the infra-cli docker image) was
failing, but it took the overall pipeline status down with it.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 19:06:51 +00:00
Viktor Barzin
68a503e29f kyverno: allowlist woodpeckerci/* for CI step pods
Wave-1 trusted-registries allowlist was missing woodpeckerci/* which is
used by every .woodpecker.yml's clone step (woodpeckerci/plugin-git) and
build steps (woodpeckerci/plugin-docker-buildx). Result: ALL Woodpecker
pipelines have been failing at the git step since the Audit→Enforce flip
on 2026-05-19. First surfaced via code-da4h (recruiter-responder pushes
not building).

Added between viren070/* and zelest/* in the same DockerHub-user-repos
block as the 2026-05-22 batch (commit 2d35d72a).

Closes: code-da4h
2026-05-23 08:52:48 +00:00
2d35d72a53 kyverno(wave1): add 7 missing registries to trusted-registries allowlist
Discovered via W1.5 enforcement when querying live cluster state:
PolicyViolation events on 5 deployments (council-complaints, ebook2audiobook,
hermes-agent, netbox, whisper/piper) trying to admit images from registries
not in the original enumeration.

Added entries:
- amruthpillai/*       (resume — reactive-resume)
- athomasson2/*        (ebook2audiobook)
- netboxcommunity/*    (netbox)
- nousresearch/*       (hermes-agent)
- opentripplanner/*    (osm-routing)
- rhasspy/*            (whisper, piper)
- registry.viktorbarzin.me/*  (legacy private registry — council-complaints
                                still references; should migrate to forgejo)

The legacy registry.viktorbarzin.me was supposedly decommissioned 2026-05-07
per CLAUDE.md but council-complaints still uses it — separate cleanup task.

## Verification
- kubectl delete + reapply (kubectl_manifest resourceVersion=0 patch gotcha,
  same as 2026-05-18 inject-keel-annotations)
- Dry-run admission of previously-blocked images now PASS:
  - netboxcommunity/netbox:v4.5.0-beta1 ✓
  - rhasspy/wyoming-whisper:3.1.0 ✓
  - registry.viktorbarzin.me/council-complaints:1c56f8f ✓
- Policy still in Enforce mode

## Observation status (W1.6)
- Calico GNP wave1-egress-observe-tier34 still applied, 82 ns selected
- Loki `{job="node-journal"} |~ "calico-packet"` returns ~5000 lines/hour
- No errors from observation infrastructure

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 15:17:16 +00:00
66ca8b9e9c trading-bot: revive K8s stack + add meet-kevin-watcher
Uncomment the trading-bot stack (disabled 2026-04-06 due to resource
consumption) and add the new meet_kevin_watcher service container.

Changes:
- Uncomment the /* ... */ block enclosing the entire stack
- Fix db_init job: add -d postgres to psql commands (root user has no
  root-named database — matches pattern used in claude-memory + others)
- Remove 3 disabled containers from trading-bot-workers Pod spec:
  news-fetcher, sentiment-analyzer, trade-executor
- Add new meet-kevin-watcher container (image
  viktorbarzin/trading-bot-service:latest, command
  python -m services.meet_kevin_watcher.main, mem 128Mi/256Mi)
- Extend ExternalSecret with TRADING_OPENROUTER_API_KEY and
  TRADING_MEET_KEVIN_CHANNEL_ID keys (sourced from Vault
  secret/trading-bot)
- Add 4 common_env entries for the Meet Kevin pipeline
  (poll interval, daily cost cap, model slug, prompt version)
- Update lifecycle.ignore_changes to 4 image indices

vault: re-enable pg-trading static role

- Add pg-trading to vault_database_secret_backend_connection allowed_roles
- Uncomment vault_database_secret_backend_static_role.pg_trading
  (was disabled 2026-04-06 with the rest of trading-bot stack)

kyverno: add postgres* to trusted-registries allowlist

- trading-bot db_init uses postgres:16-alpine (Docker Hub library image)
- postgres* was not in the DockerHub bare-name allowlist (unlike mysql*,
  alpine*, nginx*, python* which were already there)

Final workers Pod containers (in order):
  [0] signal-generator
  [1] learning-engine
  [2] market-data
  [3] meet-kevin-watcher (NEW)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 11:23:30 +00:00
669ba97078 security(wave1): W1.1 audit-log shipping LIVE + W1.5 trusted-registries Enforce LIVE
## W1.1 — K8s API audit log shipping (LIVE)
- alloy.yaml: added control-plane toleration so Alloy DaemonSet runs on
  k8s-master node. Verified alloy-7zg7t scheduled on master, tailing
  /var/log/kubernetes/audit.log
- loki.tf "Security Wave 1" rule group: added K2-K9 alert rules
  (skipped K1 per Q7 decision):
  - K2 K8sSATokenFromUnexpectedIP
  - K3 K8sSensitiveSecretReadByUnexpectedActor
  - K4 K8sExecIntoSensitiveNamespace
  - K5 K8sMassDelete (>5 Pod/Secret/CM in 60s by single user)
  - K6 K8sAuditPolicyModified (kubeadm-config CM change)
  - K7 K8sClusterRoleWildcardCreated (verbs=* + resources=*)
  - K8 K8sAnonymousBindingGranted
  - K9 K8sViktorFromUnexpectedIP
- All rules use source-IP regex matching the wave-1 allowlist
  (10.0.20.0/22, 192.168.1.0/24, 10.10.0.0/16 pod, 10.96.0.0/12 svc,
  100.64-127 tailnet) and `lane = "security"` → #security Slack route.
- Verified: kubectl-audit logs flowing in Loki query
  {job="kubernetes-audit"} returns events with node=k8s-master.
- Verified: /loki/api/v1/rules lists all K2-K9 + V1-V7 + S1.

## W1.5 — require-trusted-registries Enforce (LIVE)
- security-policies.tf: flipped Audit→Enforce with explicit allowlist
  built by `kubectl get pods -A -o jsonpath='{..image}'` enumeration.
- Removed `*/*` catch-all (which made Audit→Enforce a no-op).
- Pattern includes 15 explicit registries, 6 DockerHub library bare
  names, 56 DockerHub user repos.
- Verified by admission dry-run:
  - evilcorp.example/malware:v1 → BLOCKED with custom message
  - alpine:3.20 → ALLOWED (matches `alpine*`)
  - docker.io/library/alpine:3.20 → ALLOWED (matches `docker.io/*`)

## W1.6 — Calico flow logs (BLOCKED — Calico OSS limitation)
- Tried adding FelixConfiguration with flowLogsFileEnabled=true via
  kubectl_manifest in stacks/calico/main.tf
- Calico OSS rejected with "strict decoding error: unknown field
  spec.flowLogsFileEnabled" — these fields are Calico Enterprise/Tigera-only
- Removed the failed resource. Documented alternative paths in main.tf
  comment block: GNP with action=Log (iptables NFLOG → journal), Cilium
  migration, eBPF tooling, or Tigera Operator adoption.

## Docs updates
- security.md status table refreshed: W1.1/W1.2/W1.3/W1.4/W1.5 LIVE,
  W1.6/W1.7 blocked
- monitoring.md: Loki marked DEPLOYED (was incorrectly NOT-DEPLOYED in
  prior session before today's apply)

## Cleanup
- Removed stacks/kyverno/imports.tf (TF 1.5+ import blocks completed
  their job in the 2026-05-18 apply; should not stay in tree per TF docs)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 06:37:54 +00:00
90e074a4a2 kyverno(wave1): swap kubernetes_manifest → kubectl_manifest + flip 3 security policies to Enforce
## Resolves code-e2dp (Kyverno TF apply blocked)
Root cause: terraform-provider-kubernetes v3.1.0 panics on plan/refresh of
kubernetes_manifest resources holding Kyverno ClusterPolicy CRDs (large
CEL/foreach schemas). Workaround: swap to gavinbunney/kubectl_manifest which
treats manifests as opaque YAML strings.

## Migration mechanics
- Root terragrunt.hcl: added gavinbunney/kubectl provider declaration so all
  stacks get it generated in providers.tf.
- stacks/kyverno/modules/kyverno/versions.tf (new): module-level provider source
  declaration (required for kubectl_manifest in a child module).
- Converted 17 kubernetes_manifest resources across 7 files to kubectl_manifest
  with yaml_body = yamlencode({...}). depends_on chains preserved.
- terraform state rm for all 17 old kubernetes_manifest entries.
- stacks/kyverno/imports.tf (new): TF 1.5+ import blocks mapping each
  kubectl_manifest to its live cluster resource by apiVersion//Kind//name ID.
- One resource (policy_inject_keel_annotations) needed kubectl delete + recreate
  because the kubectl provider couldn't patch it cleanly (resourceVersion=0
  invalid for update — gotcha when adopting a resource previously
  kubernetes_manifest-owned).

## W1.4 — security policies Audit → Enforce (LIVE)
Three policies flipped: deny-privileged-containers, deny-host-namespaces,
restrict-sys-admin. Verified live via kubectl. failurePolicy=Ignore preserved.

## Shared exclude list (35 namespaces)
local.security_policy_exclude_namespaces in security-policies.tf.
- 31 critical from memory id=1970 (Keel rollout list)
- + frigate (camera HW transcoding needs host access)
- + kured (privileged DaemonSet for node reboots)
- + default (etcd backup/defrag CronJobs use hostNetwork)
- + changedetection (uses SYS_ADMIN for chromium sandbox)

## W1.5 — require-trusted-registries stays Audit
Pattern */* allows anything-with-a-slash; Enforce would be a no-op for supply
chain. Tracked under beads code-8ywc as follow-up.

## TF import-blocks
The imports.tf file should be removed in a follow-up cleanup commit once
verified — TF doesn't auto-clean these.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Closes: code-e2dp
2026-05-18 20:10:27 +00:00
f30c141270 security(wave1): W1.2 Vault XFF (applied) + W1.4/W1.5 Kyverno code prep (apply blocked on provider crash)
## W1.2 — Vault audit device + X-Forwarded-For (APPLIED + VERIFIED)
- Added `x_forwarded_for_authorized_addrs = "10.10.0.0/16"` to vault listener config.
  Trust X-Forwarded-For from in-cluster sources (pod CIDR). Without this, every
  vault audit log entry shows Traefik's pod IP instead of the real client IP —
  the V7 alert rule (Viktor identity from non-allowlist source IP) needs the
  real client IP to be meaningful.
- Applied via `tg apply -target=helm_release.vault` (vault stack has pre-existing
  for_each unknown issues unrelated to this change; -target documented in error
  message itself as the workaround).
- Rolling restart of vault-{0,1,2} performed manually (StatefulSet uses OnDelete
  update strategy, not RollingUpdate). All 3 pods rejoined Raft + auto-unsealed
  within ~10s each. Verified XFF config visible in pod's
  /vault/config/extraconfig-from-values.hcl.
- The `vault_audit "file"` resource was already in TF at line 287 (writing to
  /vault/audit/vault-audit.log) — no change needed.

## W1.4 + W1.5 — Kyverno enforce flip (CODE ONLY, apply BLOCKED)
- Added shared `local.security_policy_exclude_namespaces` (31 critical namespaces
  from memory id=1970 + `frigate, kured, default, changedetection` discovered
  during the live-cluster pre-flight check for privileged/hostNetwork/SYS_ADMIN
  pods that would be blocked by Enforce).
- Flipped 3 security policies Audit → Enforce: deny-privileged-containers,
  deny-host-namespaces, restrict-sys-admin. failurePolicy=Ignore preserved at
  chart level.
- `require-trusted-registries` STAYS in Audit mode pending allowlist tightening
  (current pattern includes `*/*` which matches anything-with-a-slash, so Enforce
  would be a no-op for supply chain). Tracked under beads `code-8ywc` W1.5.

**Apply blocker**: `tg plan` panics with `terraform-provider-kubernetes_v3.1.0`
crash on the kubernetes_manifest resources (`ElementKeyInt(0): can't use
tftypes.Object...` — provider schema mismatch on Kyverno CRDs). The crash
reproduces on the UNMODIFIED file, so it's a pre-existing provider issue, not
caused by these changes. Resolving it requires either upgrading the provider or
finding a kubernetes_manifest-compatible workaround. Tracked under `code-8ywc`.

## Wave 1 status after this commit
- W1.2: APPLIED + VERIFIED (vault XFF + audit device already in place)
- W1.4 + W1.5: code ready, apply blocked on provider crash
- W1.1, W1.3, W1.6, W1.7: not started in this session

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 19:26:39 +00:00
Viktor Barzin
ae36dc253b extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip]
Phase 2 of platform stack split. 5 more modules extracted into
independent stacks. All applied successfully with zero destroys.
Cloudflared now reads k8s_users from Vault directly to compute
user_domains. Woodpecker pipeline runs all 8 extracted stacks
in parallel. Memory bumped to 6Gi for 9 concurrent TF processes.
Platform reduced from 27 to 19 modules.
2026-03-17 21:34:11 +00:00