Adds the `claude_oauth_token` Vault entries to the secrets table, a
new "OAuth token lifecycle" section explaining the two CLI auth modes
(`claude login` vs `claude setup-token`) and why we picked the latter
for headless use, the Ink 300-col PTY gotcha from today's harvest,
and the monitoring/rotation playbook for the new expiry alerts.
Follow-up to 8a054752 and 50dea8f0.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
These files are regenerated by Terragrunt on every run and have a
"# Generated by Terragrunt. Sig: ..." header. Earlier today multiple parallel
agents working on bd-w97 accidentally staged them, requiring two corrective
commits (3e11bd1b, 4eb68d6b). Preventing the recurrence at the source.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
My previous commit (c0ac24a5, [meshcentral] Import existing cluster
state + PVC) unintentionally committed two Terragrunt-generated
provider/locals files. These are auto-generated on every plan/apply
(marked 'Generated by Terragrunt. Sig:') and do not belong in the
repo. Mirrors 3e11bd1b which did the same cleanup for kyverno.
Removes from tracking only — files remain on disk so concurrent work
is unaffected.
Updates: code-w97
Imported the two proxmox-lvm-encrypted PVCs into the Tier 1 PG state.
All other declared resources (namespace, deployment, service, ingress,
NFS-backed PV/PVC, tls secret) were already state-managed.
Imported:
- kubernetes_persistent_volume_claim.data_encrypted
(meshcentral/meshcentral-data-encrypted, proxmox-lvm-encrypted, 1Gi)
- kubernetes_persistent_volume_claim.files_encrypted
(meshcentral/meshcentral-files-encrypted, proxmox-lvm-encrypted, 1Gi)
Pre-import plan: 2 to add, 3 to change, 0 to destroy
Post-import plan: 0 to add, 5 to change, 0 to destroy (benign drift)
Apply: 0 added, 5 changed, 0 destroyed
Benign drift reconciled on apply:
- PVC wait_until_bound attribute aligned (true -> false)
- tls-secret Kyverno sync labels cleared
- deployment/namespace annotation drift
Source reconciliation: none required. Both declared PVCs already match
the cluster (proxmox-lvm-encrypted, 1Gi, RWO, names identical). NFS
PV/PVC meshcentral-backups-host (nfs-truenas, 10Gi, RWX) remained
bound throughout. Deployment kept 1/1 replicas on the same pod
(meshcentral-6c4f47c6f8-mj8sk).
Commits the auto-generated cloudflare_provider.tf and tiers.tf so the
stack matches the repo convention used by its peers.
Updates: code-w97
My previous commit (dacf3d9e, [kyverno] Import existing cluster state)
unintentionally picked up two Terragrunt-generated provider/locals
files from the meshcentral stack that a parallel worker had just
created. These are auto-generated on every plan/apply (marked
"Generated by Terragrunt. Sig:") and do not belong in the repo.
Removes from tracking only — files remain on disk so concurrent work
is unaffected.
Files removed:
- stacks/meshcentral/cloudflare_provider.tf
- stacks/meshcentral/tiers.tf
No impact on the kyverno import work. State-level changes from
dacf3d9e (3 imports + 3 in-place updates) stand.
Updates: code-w97
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
All resources were already present in the Tier 1 PG state — no imports
required. The travel_blog stack has no PVC (content baked into the
Docker image, deployed via Woodpecker with 1.4GB context).
Pre-apply plan: 0 to add, 4 to change, 0 to destroy
Apply: 0 added, 4 changed, 0 destroyed
Post-apply plan: 0 to add, 3 to change, 0 to destroy (persistent benign drift)
Benign drift reconciled on apply:
- Deployment dns_config (Kyverno-injected ndots:2) removed
- Namespace goldilocks vpa-update-mode=off label removed
- Ingress external-monitor=false annotation removed (now auto-managed
by ingress_factory dns_type)
- TLS secret Kyverno sync labels removed
Post-apply drift (persists via external controllers, out of scope):
- Kyverno re-injects ndots:2 dns_config and sync-tls-secret labels
- Goldilocks re-adds vpa-update-mode label
(tracked separately — future work to add lifecycle ignore_changes)
Image tag viktorbarzin/travel_blog:latest unchanged — TF matches cluster.
Deployment remains at replicas=0 (intentional, per source comment:
"Scaled down — clears ExternalAccessDivergence alert"). Site is
intentionally offline.
Updates: code-w97
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Imported 3 missing cluster resources into the Tier 1 PG state for the
kyverno stack. The Helm release, 6 PriorityClasses, 14 ClusterPolicies,
both Secrets (registry-credentials, tls-secret), and all prior RBAC
resources were already managed in state. The strip-cpu-limits
ClusterPolicy (commit 1de2ee30, 56m prior to this import) was already
in state from its targeted apply.
Resources imported:
- module.kyverno.kubernetes_cluster_role_v1.kyverno_cleanup_pods
(kyverno:cleanup-controller:pods — RBAC for ClusterCleanupPolicy)
- module.kyverno.kubernetes_cluster_role_binding_v1.kyverno_cleanup_pods
(kyverno:cleanup-controller:pods — binding to cleanup-controller SA)
- module.kyverno.kubernetes_manifest.cleanup_failed_pods
(apiVersion=kyverno.io/v2,kind=ClusterCleanupPolicy,name=cleanup-failed-pods)
All three originated from commit cf578516 (auto-cleanup failed/evicted
pods), which added the declarations but apparently never made it into
PG state before the global state reorg.
Pre-import plan: 3 to add, 2 to change, 0 to destroy
Post-import plan: 0 to add, 3 to change, 0 to destroy (benign)
Apply: 0 added, 3 changed, 0 destroyed
Benign drift reconciled on apply:
- cleanup_failed_pods manifest field populated in state post-import
(annotations re-applied, no spec change)
- registry_credentials + tls_secret: null `generate.kyverno.io/clone-source`
label dropped from Terraform metadata (no K8s object change — the label
was only `null` in state, never existed on the live Secret)
Safety checks — all clean:
- ClusterPolicy count: 16 (unchanged, 14 owned here + 1 external
goldilocks-vpa-auto-mode + strip-cpu-limits); all status=Ready=True
- ClusterCleanupPolicy cleanup-failed-pods: intact, schedule 15 * * * *
- helm_release.kyverno: no diff (revision unchanged)
- Mutating/validating webhook configurations: 3 + 7 intact
- All 4 Kyverno Deployments Running (admission x2, background, cleanup, reports)
Kyverno failurePolicy stays Ignore (forceFailurePolicyIgnore=true) so
admission degrades open if ever unavailable.
Updates: code-w97
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Imported both resources for the pvc-autoresizer stack into the Tier 1 PG
state. The stack was previously unmanaged — cluster had the running
controller from a prior manual helm install (rev 1, 2026-04-03).
Resources imported:
- module.pvc_autoresizer.kubernetes_namespace.pvc_autoresizer (pvc-autoresizer)
- module.pvc_autoresizer.helm_release.pvc_autoresizer (pvc-autoresizer/pvc-autoresizer)
Pre-import plan: 2 to add, 0 to change, 0 to destroy
Post-import plan: 0 to add, 2 to change, 0 to destroy (benign drift)
Apply: 0 added, 2 changed, 0 destroyed
Benign drift reconciled on apply:
- Namespace goldilocks.fairwinds.com/vpa-update-mode=off label removed
(Kyverno ClusterPolicy goldilocks-vpa-auto-mode re-adds it immediately)
- Helm release metadata refresh only (atomic read-back, revision 1 -> 2;
chart pvc-autoresizer-0.17.0 and app 0.20.0 unchanged — no upgrade)
Controller pods pvc-autoresizer-controller-7dcc745f68-57bk6 and -n4bh9
stayed Running throughout (restart counts unchanged: 17 and 1, both
pre-existing from pre-apply state). No PVCs entered non-Bound state.
Updates: code-w97
Imported all 9 cluster resources into the Tier 1 PG state. Stack was
previously unmanaged — source was fully declared in main.tf but state
was empty.
Pre-import plan: 9 to add, 0 to change, 0 to destroy
Post-import plan: 0 to add, 9 to change, 0 to destroy
Apply: 0 added, 9 changed, 0 destroyed
Resources imported:
- kubernetes_namespace.tor-proxy
- kubernetes_deployment.tor-proxy
- kubernetes_deployment.torrserver
- kubernetes_service.tor-proxy
- kubernetes_service.torrserver
- kubernetes_service.torrserver-bt (LoadBalancer, IP 10.0.20.200)
- kubernetes_persistent_volume_claim.torrserver_data_proxmox
- module.tls_secret.kubernetes_secret.tls_secret
- module.torrserver_ingress.kubernetes_ingress_v1.proxied-ingress
Service pods tor-proxy-7fb4644dd8-npdwg and torrserver-7788ff4c4d-jnh85
stayed Running throughout. Tor circuit preserved — no deployment restarts.
Updates: code-w97
## Context
The new CLAUDE_CODE_OAUTH_TOKEN mechanism (commit 8a054752) uses
long-lived 1-year tokens minted via `claude setup-token`. Tokens don't
auto-refresh — at the 1-year mark they expire hard and the upgrade
agent stops working. We need to be told 30 days ahead, not find out
when DIUN fires and gets 401 again.
A cron rotator doesn't make sense here (tokens don't refresh, they
just expire) so we alert instead. Two spares at
`secret/claude-agent-service-spare-{1,2}` provide failover runway —
monitor covers all three.
## This change
**CronJob** (`claude-agent` ns, every 6h): reads a ConfigMap
containing `<path> → expiry_unix_timestamp` entries, pushes
`claude_oauth_token_expiry_timestamp{path="..."}` and
`claude_oauth_expiry_monitor_last_push_timestamp` to Pushgateway at
`prometheus-prometheus-pushgateway.monitoring:9091`.
**ConfigMap** generated from a Terraform local `claude_oauth_token_mint_epochs`
— source of truth for mint times. On rotation, update the map + apply.
TTL is a shared local (365d).
**PrometheusRules** (in prometheus_chart_values.tpl):
- `ClaudeOAuthTokenExpiringSoon` — <30d, warning, for 1h
- `ClaudeOAuthTokenCritical` — <7d, critical, for 10m
- `ClaudeOAuthTokenMonitorStale` — last push >48h, warning
- `ClaudeOAuthTokenMonitorNeverRun` — metric absent for 2h, warning
Alert labels include `{{ $labels.path }}` so we know which token is
expiring (primary / spare-1 / spare-2).
## Verification
```
$ kubectl -n claude-agent create job --from=cronjob/claude-oauth-expiry-monitor manual
$ curl pushgateway/metrics | grep claude_oauth_token_expiry
claude_oauth_token_expiry_timestamp{...,path="primary"} 1.808064429e+09
claude_oauth_token_expiry_timestamp{...,path="spare-1"} 1.80806428e+09
claude_oauth_token_expiry_timestamp{...,path="spare-2"} 1.808064429e+09
$ query: (claude_oauth_token_expiry_timestamp - time()) / 86400
primary: 365.2 days
spare-1: 365.2 days
spare-2: 365.2 days
```
## Rotation playbook (future)
1. `kubectl run -it --rm --image=registry.viktorbarzin.me/claude-agent-service:latest tokmint -- claude setup-token`
(or harvest via `harvest3.py` pattern in memory for headless flow)
2. `vault kv patch secret/claude-agent-service claude_oauth_token=<new>`
3. Update `claude_oauth_token_mint_epochs["primary"]` in
`stacks/claude-agent-service/main.tf` with new unix timestamp
4. `scripts/tg apply` claude-agent-service + monitoring
5. Alert clears within 6h (next cron tick) + 1h of the
`ClaudeOAuthTokenExpiringSoon` "for:" duration
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Earlier today we hit a silent auth failure on the upgrade agent: the
short-lived `sk-ant-oat01-*` access token in `.credentials.json` had
expired and the CLI's refresh path failed (refresh token either stale
or invalidated after the creds sat in Vault for 5 days).
The real fix isn't "refresh more often" — it's switching to the
long-lived auth mechanism `claude setup-token` provides. Unlike
`claude login` (OAuth flow → 6–8h access token + refresh token JSON),
`setup-token` mints a single opaque token valid for **1 year** that
the CLI consumes via `CLAUDE_CODE_OAUTH_TOKEN` env var. No refresh
dance, no JSON file, no rotation for a year.
## This change
Adds `CLAUDE_CODE_OAUTH_TOKEN` to the existing
`claude-agent-secrets` ExternalSecret, sourced from a new
`claude_oauth_token` field at `secret/claude-agent-service`. The
container already pulls that secret via `envFrom`, so no other wiring
needed.
The Claude CLI prefers `CLAUDE_CODE_OAUTH_TOKEN` over the OAuth JSON
file when both are present, so this is additive — `.credentials.json`
stays mounted as a fallback while we validate the long-lived path.
Future cleanup can remove the JSON mount entirely.
Verified E2E: synthetic DIUN webhook for `docker.io/library/httpd`
→ n8n → claude-agent-service /execute → agent job `fea5ff70dcfe`
completed in 30s with exit_code=0, agent correctly identified no
matching stack and aborted without changes. No API auth errors.
## Spares
Harvested two additional long-lived tokens and stored them at
`secret/claude-agent-service-spare-{1,2}` for failover if the
primary is compromised or revoked. Verified both coexist with the
primary (no revocation on mint).
## What is NOT in this change
- No removal of `.credentials.json` mount or its Vault source (keep
as fallback until we've run for 24h on env-var auth with no issues).
- No cron rotator — 1-year TTL means this can be a yearly manual
rotation, alerted on from Vault metadata. If we add rotation, we'll
source from the spares pool rather than minting new tokens.
## Reproduce locally
```
1. vault login -method=oidc
2. vault kv get -field=claude_oauth_token secret/claude-agent-service | head -c 25
3. cd stacks/claude-agent-service && ../../scripts/tg apply
4. kubectl -n claude-agent exec deploy/claude-agent-service -- \
printenv CLAUDE_CODE_OAUTH_TOKEN # should be 108 chars
5. Fire synthetic DIUN webhook (see docs/architecture/automated-upgrades.md)
```
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Monitor id 663 "MySQL Standalone (dbaas)" was created manually yesterday via
the `uptime-kuma-api` Python library when the dbaas stack migrated from
InnoDB Cluster to standalone MySQL. It worked and was UP, but lived only in
Uptime Kuma's MariaDB — if UK's DB were wiped or restored from an older
backup, the monitor would be lost.
## This change
Adds declarative, self-healing management for internal-service monitors
(databases, non-HTTP endpoints) that can't be discovered from ingress
annotations. Modelled on the existing `external-monitor-sync` CronJob.
- `local.internal_monitors` — list of desired monitors (name, type,
connection string, Vault password key, interval, retries). Seeded with
the MySQL Standalone monitor. Add new entries here to manage more.
- `kubernetes_secret.internal_monitor_sync` — pulls admin password and all
referenced DB passwords from Vault `secret/viktor` at apply time. Secret
key names are derived from monitor name (`DB_PASSWORD_<upper_snake>`).
- `kubernetes_config_map_v1.internal_monitor_targets` — renders the target
list to JSON for the sync container.
- `kubernetes_cron_job_v1.internal_monitor_sync` — runs every 10 min,
looks up monitors by name, creates if missing, patches if drifted,
leaves id and history untouched when already in desired state.
## Why this approach (Option B, not a Terraform provider)
The `louislam/uptime-kuma` Terraform provider does NOT exist in the public
registry (verified — only a CLI tool of the same name). Option A from the
task brief was therefore unavailable. Option B (idempotent K8s CronJob)
matches the established pattern in the same module for
`external-monitor-sync` — no new machinery introduced.
## Monitor 663: no-op on first sync
Manual import was not possible (no provider → no state to import). The
sync job correctly identifies the existing monitor by name and reports:
Monitor MySQL Standalone (dbaas) (id=663) already in desired state
Internal monitor sync complete
DB heartbeats confirm monitor 663 stayed UP throughout with `status=1` and
`Rows: 1` responses every 60s — no disruption.
## Vault key — left manual (by design)
`secret/viktor` is not Terraform-managed anywhere in the repo (only read
via `data "vault_kv_secret_v2"`). It is a user-edited Vault entry holding
135 keys. The `uptimekuma_db_password` key was added manually yesterday;
this change does NOT codify it. Codifying the whole `secret/viktor` entry
is out of scope for this task (would need a separate migration + rotation
story). The sync job reads the existing value at apply time — so if the
value is ever rotated in Vault, the next sync picks it up.
## Plan + apply
Plan: 3 to add, 0 to change, 0 to destroy.
Apply complete! Resources: 3 added, 0 changed, 0 destroyed.
Re-plan: No changes. Your infrastructure matches the configuration.
Also updated `.claude/skills/uptime-kuma/SKILL.md` with the new pattern.
Closes: code-ed2
## Context
During a false-alarm investigation of terminal.viktorbarzin.me, an Explore
agent misdiagnosed "no monitoring" by checking cloudflare_proxied_names in
config.tfvars (a legacy fallback list) instead of the ingress_factory
auto-annotation. Both [External] monitors for terminal/terminal-ro exist and
are active — the original agent just looked in the wrong place.
## This change
Expands the Monitoring & Alerting bullet to spell out the mechanism:
ingress_factory auto-adds uptime.viktorbarzin.me/external-monitor=true when
dns_type != "none", and cloudflare_proxied_names is a legacy fallback for
the 17 hostnames not yet migrated. Future agents debugging "is this
monitored?" questions should not check cloudflare_proxied_names.
## What is NOT in this change
No Terraform, no K8s, no service config. Docs only.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Grampsweb stack had an empty Terraform state — 7 K8s resources (namespace,
PVC, service, deployment, ingress, ExternalSecret manifest, TLS secret)
existed in the cluster but weren't tracked. This blocked commit 7b248897
(ollama LLM env-var removal) from being applied because any apply would
attempt to re-create existing resources.
Additionally, the TF source declared a **grampsweb-data-proxmox** PVC on
**storage_class=proxmox-lvm**, while the cluster had **grampsweb-data-encrypted**
on **proxmox-lvm-encrypted** (1 Gi, bound). The deployment was referencing
the encrypted PVC. This divergence predated this change — the source was
simply out of date vs cluster reality.
## This change
Two things:
1. **Source alignment** (the only file diff):
- Renames `kubernetes_persistent_volume_claim.data_proxmox` →
`data_encrypted`, metadata.name to match cluster, storage class to
`proxmox-lvm-encrypted`.
- Updates the deployment volume `claim_name` reference accordingly.
- Aligns with the newer project convention documented in
`.claude/CLAUDE.md`: "Default for sensitive data is
proxmox-lvm-encrypted" and "Convention: PVC names end in `-encrypted`".
- No destroy/recreate: the PVC and deployment already use the encrypted
PVC in the cluster; TF source now just describes reality.
2. **State imports** (out-of-band, via `scripts/tg import`, not in diff):
- `kubernetes_namespace.grampsweb` <- `grampsweb`
- `kubernetes_persistent_volume_claim.data_encrypted` <- `grampsweb/grampsweb-data-encrypted`
- `kubernetes_service.grampsweb` <- `grampsweb/grampsweb`
- `kubernetes_deployment.grampsweb` <- `grampsweb/grampsweb`
- `module.ingress.kubernetes_ingress_v1.proxied-ingress` <- `grampsweb/family`
- `module.tls_secret.kubernetes_secret.tls_secret` <- `grampsweb/tls-secret`
- `kubernetes_manifest.external_secret` <- `apiVersion=external-secrets.io/v1beta1,kind=ExternalSecret,namespace=grampsweb,name=grampsweb-secrets`
## Apply result
`Apply complete! Resources: 0 added, 7 changed, 0 destroyed.`
In-place updates applied:
- Deployment: dropped `GRAMPSWEB_LLM_BASE_URL` + `GRAMPSWEB_LLM_MODEL` env
vars (both containers) — realising the intent of commit 7b248897.
- Ingress: realigned Traefik middleware annotation + cleaned stale
`uptime.viktorbarzin.me/external-monitor=false` annotation.
- TLS secret: removed Kyverno-generated labels (Kyverno's
`sync-tls-secret` ClusterPolicy re-applies them on next reconcile —
no functional impact; same pattern in 29 other stacks using
`setup_tls_secret` module).
- Namespace, PVC, service: trivial metadata alignments (label /
`wait_until_bound` / `wait_for_load_balancer`).
- `kubernetes_manifest.external_secret`: populated the `manifest`
attribute after import (expected).
## What is NOT in this change
- No replica bump: deployment stays at `replicas=0` (stack is intentionally
inactive per 2026-03-14 OOM incident note).
- No destroy/recreate of any resource.
- The broader code-w97 (11 stacks with empty state) is NOT closed — only
grampsweb is imported. 10 stacks remain: beads-server, insta2spotify,
isponsorblocktv, kyverno, meshcentral, pvc-autoresizer, shadowsocks,
tor-proxy, travel_blog, + meshcentral PVC.
## Reproduce locally
```
KUBECONFIG=/home/wizard/code/config kubectl get all,ingress,pvc,externalsecret,secret -n grampsweb
# Deployment still replicas=0; PVC grampsweb-data-encrypted Bound; ingress 'family'
# on family.viktorbarzin.me; ExternalSecret SecretSynced True.
cd /home/wizard/code/infra/stacks/grampsweb
/home/wizard/code/infra/scripts/tg plan
# Expected: 'No changes.' (clean state after apply).
```
## Test Plan
### Automated
```
$ cd /home/wizard/code/infra/stacks/grampsweb && /home/wizard/code/infra/scripts/tg plan
Plan: 0 to add, 7 to change, 0 to destroy. [pre-apply]
$ /home/wizard/code/infra/scripts/tg apply --non-interactive
Plan: 0 to add, 7 to change, 0 to destroy.
kubernetes_namespace.grampsweb: Modifications complete after 0s [id=grampsweb]
kubernetes_persistent_volume_claim.data_encrypted: Modifications complete after 0s [id=grampsweb/grampsweb-data-encrypted]
kubernetes_service.grampsweb: Modifications complete after 0s [id=grampsweb/grampsweb]
module.ingress.kubernetes_ingress_v1.proxied-ingress: Modifications complete after 0s [id=grampsweb/family]
module.tls_secret.kubernetes_secret.tls_secret: Modifications complete after 0s [id=grampsweb/tls-secret]
kubernetes_manifest.external_secret: Modifications complete after 0s
kubernetes_deployment.grampsweb: Modifications complete after 1s [id=grampsweb/grampsweb]
Apply complete! Resources: 0 added, 7 changed, 0 destroyed.
$ terraform fmt -check -recursive stacks/grampsweb
(no output - formatted clean)
```
### Manual Verification
```
$ KUBECONFIG=/home/wizard/code/config kubectl get all,ingress,pvc,externalsecret,secret -n grampsweb
# - deployment.apps/grampsweb 0/0 0 0 47d (replicas=0 preserved)
# - service/grampsweb ClusterIP 10.106.232.205:80/TCP
# - persistentvolumeclaim/grampsweb-data-encrypted Bound pvc-c9a5dcf4... 1Gi RWO proxmox-lvm-encrypted
# - ingress/family traefik family.viktorbarzin.me -> 10.0.20.200:80,443
# - externalsecret/grampsweb-secrets vault-kv 15m SecretSynced True
# - secret/tls-secret kubernetes.io/tls
# No pod crashes (no pods — replicas=0).
```
Closes: code-8m6
Context
-------
The cluster policy is "no CPU limits anywhere" — CFS throttling causes
more harm than good for bursty single-threaded workloads (Node.js,
Python). LimitRanges are already correct (defaultRequest.cpu only, no
default.cpu), but 22 pods still carried CPU limits injected by upstream
Helm chart defaults — CrowdSec (lapi + agents), descheduler,
kubernetes-dashboard (×4), nvidia gpu-operator.
Previous attempts were ad-hoc: patch each values.yaml, occasionally
missing things on chart upgrade. This replaces that with a declarative
Kyverno mutation at admission time.
This change
-----------
Adds a new ClusterPolicy `strip-cpu-limits` with two foreach rules:
strip-container-cpu-limit → containers[]
strip-initcontainer-cpu-limit → initContainers[]
Each rule uses `patchesJson6902` with an `op: remove` on
`resources/limits/cpu`. JSON6902 `remove` fails on missing paths, so
per-element preconditions gate the mutation — pods without CPU limits
pass through untouched. A top-level rule precondition short-circuits
using JMESPath filter (`[?resources.limits.cpu != null] | length(@) > 0`)
so the mutation is a no-op for the overwhelming majority of pods.
Admission-time only. No `mutateExistingOnPolicyUpdate`, no `background`.
Existing pods keep their CPU limits until they're restarted naturally
(Helm upgrade, node drain, rollout). We rely on churn, not forced
restarts, to avoid unnecessary thrash.
Memory limits are preserved — they prevent OOM, still useful.
Flow
----
admission request → match Pod + CREATE
→ top-level precondition: any container has limits.cpu?
no → skip (fast path)
yes → foreach container:
element.limits.cpu present?
no → skip element
yes → remove /spec/containers/N/resources/limits/cpu
→ same again for initContainers
→ mutated pod proceeds to API server
Verification
------------
kubectl run test-strip-cpu --overrides='{limits:{cpu:500m,memory:64Mi}}'
→ admitted pod.resources = {limits:{memory:64Mi}, requests:{cpu:50m,memory:32Mi}}
→ CPU limit stripped, memory preserved, requests untouched
kubectl rollout restart deploy/kubernetes-dashboard-metrics-scraper
→ new pod.resources = {limits:{memory:400Mi}, requests:{cpu:100m,memory:200Mi}}
→ cluster-wide count of pods with CPU limits: 22 → 21
Rollout
-------
Remaining 21 pods will drop their CPU limits on natural churn. No manual
restarts in this change — user may want to time a mass restart with a
maintenance window.
Closes: code-eaf
Closes: code-4bz
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Final stage (9) of ollama decommission. After the stack was destroyed in
commit 0386f03f, several residual references remained:
- Vault KV `secret/ollama` (metadata + versions)
- `secrets/nfs_directories.txt` line listing `ollama` as a backup target
- `stacks/dashy/conf.yml` — "Ollama" tile linking to `ollama.viktorbarzin.me`
- `stacks/homepage/INGRESS_WIDGET_MAPPING.md` — 3 rows documenting the
now-removed ingresses (ollama, ollama-api, ollama-server)
## This change
- `vault kv metadata delete secret/ollama` → all versions + metadata deleted.
- `secrets/nfs_directories.txt`: removed the `ollama` entry (line 71).
- `stacks/dashy/conf.yml`: removed the Ollama tile (`&ref_42`) and its
reference at the end of the list; applied via Terragrunt so the running
dashy ConfigMap picks up the change. Dashy apply: 0 added, 4 changed, 0
destroyed (the ConfigMap diff plus the usual benign Kyverno drift).
- `stacks/homepage/INGRESS_WIDGET_MAPPING.md`: removed the 3 ollama rows.
## What was considered but NOT changed
- `stacks/ytdlp/yt-highlights/app/main.py`: `OLLAMA_URL = os.getenv("OLLAMA_URL", "")`
already falls back to empty string when unset; the env var is no longer
injected (stage 3) so this path is dead at runtime. Leaving source alone
to keep this commit scoped to infra-only cleanup — future app-level
cleanup can remove the dead fallback code.
- `stacks/k8s-portal/modules/k8s-portal/files/src/routes/agent/+server.ts`:
only mentions `var.ollama_host` in a documentation string inside a
system-prompt template — non-functional. Will fix in a separate commit
alongside the k8s-portal agent docs pass.
## Test plan
### Automated
- `vault kv get secret/ollama` → "No value found" (confirmed after delete).
- `scripts/tg apply` on dashy → "Apply complete! Resources: 0 added, 4 changed, 0 destroyed."
- `grep -n ollama secrets/nfs_directories.txt` → empty.
### Manual Verification
1. Open `https://dashy.viktorbarzin.me/` → Ollama tile is gone.
2. `kubectl get cm -n dashy dashy-config -o yaml | grep -i ollama` → no matches.
3. `vault kv get secret/ollama` → error "No value found at secret/data/ollama".
4. On PVE host: `rm -rf /srv/nfs-ssd/ollama` (optional — I skipped the
on-host disk cleanup; it's a manual ops step the user can run when
comfortable).
Closes: code-1gu
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Stage 8 of ollama decommission. With the ollama-tcp Traefik entrypoint and
IngressRouteTCP removed (stages 1-2), all downstream consumers re-routed or
cleaned (stages 3-6), and the root tfvar dropped (stage 7), the ollama stack
has no live consumers and can be destroyed.
## This change
- `terragrunt destroy -auto-approve` on stacks/ollama.
- Result: `Destroy complete! Resources: 18 destroyed.`
- 1 namespace (ollama)
- 2 deployments (ollama, ollama-ui)
- 2 services (ollama, ollama-ui)
- 3 ingresses (ollama, ollama-server, ollama-api) + 3 Cloudflare DNS
records (proxied ollama, non-proxied A + AAAA for ollama-api)
- 2 PVCs (ollama-data-host NFS, ollama-ui-data-proxmox — including the
stuck Pending one from 47h ago; no finalizer trick needed)
- 1 NFS PV (ollama-data-host)
- 1 middleware (ollama_api_basic_auth_middleware)
- 2 secrets (tls_secret, ollama_api_basic_auth)
- 1 ExternalSecret manifest (external_secret)
- Directory `stacks/ollama/` fully removed.
- Verified `kubectl get ns ollama` → NotFound.
## Destroy blocker and fix
The initial `tg destroy` failed because `variable "ollama_host"` in
`stacks/ollama/main.tf` had no default and we had already removed it from
`config.tfvars` in stage 7. Added `default = "ollama.ollama.svc.cluster.local"`
to the variable, re-ran destroy successfully, then removed the whole
directory as part of this commit (so the temporary default never ships).
## What is NOT in this change
- Vault `secret/ollama` still present (stage 9 cleanup pending if vault
authenticated interactively).
- NFS data at `/srv/nfs-ssd/ollama/` still present (stage 9 cleanup).
- `/home/wizard/code/infra/secrets/nfs_directories.txt` still lists ollama
(stage 9 — requires git-crypt unlock).
## Test plan
### Automated
- `scripts/tg destroy -auto-approve` → "Destroy complete! Resources: 18 destroyed."
- `kubectl get ns ollama` → "NotFound" (confirmed).
### Manual Verification
1. `kubectl get ns ollama` → NotFound.
2. `dig ollama.viktorbarzin.me @1.1.1.1` → Cloudflare record removed
(propagation may take up to 5m).
3. `ls /home/wizard/code/infra/stacks/ollama/` → directory does not exist.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Stage 7 of ollama decommission. `ollama_host` was a shared tfvar consumed by
grampsweb, trading-bot, and ytdlp (all three cleaned in previous commits in
this stack). With no consumers left, the variable is dead config.
## This change
- Removes `ollama_host = "ollama.ollama.svc.cluster.local"` from
`config.tfvars` (root-level).
- No direct apply — future stack applies automatically stop emitting
"Value for undeclared variable" warnings for this name.
## What is NOT in this change
- Ollama namespace + deployments still running (stage 8 destroys them).
- Stages 3, 4, 5 already removed the `variable "ollama_host"` declaration
in each consuming stack; with this commit the shared vars file matches.
## Test plan
### Automated
- None — tfvars change takes effect on next stack apply.
### Manual Verification
- `grep ollama_host config.tfvars` → empty (confirmed).
- `grep -r ollama_host stacks/` → only `ollama.svc.cluster.local` string
literals inside comments (rybbit worker) or the hub stack itself (ollama
stack being destroyed next).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Stage 6 of ollama decommission. The Cloudflare Worker at
stacks/rybbit/worker/index.js maps hostnames → rybbit analytics site IDs.
With `ollama.viktorbarzin.me` going away, the mapping is dead.
## This change
- Removes the `"ollama.viktorbarzin.me": "e73bebea399f"` entry from SITE_IDS.
- **Source-only** — does NOT auto-deploy. Cloudflare Workers are deployed
via `wrangler deploy` (manual, per user preference). The change will take
effect on the next manual deploy at the user's convenience.
## Manual deploy (when convenient)
```
cd stacks/rybbit/worker
wrangler deploy
```
## Test plan
### Automated
- Node syntax check: file remains valid JS (trailing comma rules preserved).
### Manual Verification
After `wrangler deploy`:
1. Hit `ollama.viktorbarzin.me` (while it still exists) — should NOT inject
rybbit script (map lookup misses, DEFAULT_SITE_ID is null).
2. Hit any other mapped host (e.g. `immich.viktorbarzin.me`) — should
continue to inject correctly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Stage 5 of ollama decommission. The `trading-bot` stack has been entirely
commented out since 2026-04-06 (deployments scaled to 0, infra disabled to
prevent re-creation on apply). The commented body still contained references
to `var.ollama_host`, `TRADING_OLLAMA_HOST`, and `TRADING_OLLAMA_MODEL`.
Removing them now so if/when the stack is ever re-enabled, those dead
references don't need remembering.
## This change
- Removes `variable "ollama_host"` from the commented-out block.
- Removes `TRADING_OLLAMA_HOST` and `TRADING_OLLAMA_MODEL` from the
commented `common_env` locals.
- Verified the outer `/* ... */` comment block still wraps the entire stack
(head: `/*`, tail: `*/`).
- No apply needed — stack is disabled.
## Test plan
### Automated
- None — file content is inside a block comment; Terraform parser ignores it.
- `terraform fmt` check: no effect (commented content).
### Manual Verification
- `head -1 stacks/trading-bot/main.tf` → `/*`
- `tail -1 stacks/trading-bot/main.tf` → `*/`
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Stage 4 of ollama decommission. `grampsweb` referenced `var.ollama_host` for
its `GRAMPSWEB_LLM_BASE_URL` + `GRAMPSWEB_LLM_MODEL` env vars. This stack is
currently missing from Terraform state (blocked by bd-w97, which handles
state imports for 11 stacks including grampsweb) — so an apply would fail on
"resource already exists" errors.
## This change
- Deletes `variable "ollama_host"` declaration (stacks/grampsweb/main.tf).
- Deletes the two env entries `GRAMPSWEB_LLM_BASE_URL` and
`GRAMPSWEB_LLM_MODEL` from the `common_env` locals block.
- **Source-only** — NO apply performed, because the stack cannot apply
cleanly until bd-w97 resolves state imports. When that unblocks, the next
apply will pick up the already-clean source.
## Why not apply now
- Running `scripts/tg apply` would try to create ~7 resources that already
exist in K8s (namespace, PVCs, deployments, ingress, etc.), producing
"already exists" errors for each.
- Once bd-w97 imports those into state, the next apply will be a no-op for
them and will rollout the LLM env-var removal without issue.
## Test plan
### Automated
- No apply performed — stack blocked on bd-w97.
- `terraform fmt` on main.tf: no issues.
### Manual Verification
After bd-w97 resolves:
1. `scripts/tg plan` should show only the env-var removal on `grampsweb`
deployments (no resource creates).
2. `scripts/tg apply` → deployments rollout with `GRAMPSWEB_LLM_*` vars gone.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Stage 3 of ollama decommission. `ytdlp` had an Ollama fallback path for when
OpenRouter models failed. With ollama going away, that fallback is
inoperable — removing the variable and two env entries prevents pods from
ever attempting to hit a service that no longer exists.
## This change
- Drops `variable "ollama_host"` from stacks/ytdlp/main.tf.
- Drops the two env entries `OLLAMA_URL` and `OLLAMA_MODEL` (plus their
preceding comment) from the yt-highlights container.
- Apply: `0 added, 4 changed, 0 destroyed` — deployments rolled out fresh
env, plus benign Kyverno ndots drift (already accepted).
- Verified `kubectl get deploy -n ytdlp` no longer exposes OLLAMA_URL.
## What is NOT in this change
- OpenRouter primary path unchanged.
- config.tfvars `ollama_host` still present (stage 7 removes it).
## Test plan
### Automated
- `scripts/tg plan` → 4 in-place updates, 0 destroy.
- `scripts/tg apply` → "Apply complete! Resources: 0 added, 4 changed, 0 destroyed."
### Manual Verification
1. `kubectl get deploy -n ytdlp -o yaml | grep OLLAMA` → empty.
2. yt-highlights continues processing via OpenRouter (check container logs for
successful OpenRouter responses).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Stage 2 of ollama decommission. The Traefik `ollama-tcp` entrypoint on port
11434 forwarded TCP traffic to the ollama service. With the IngressRouteTCP
already deleted (previous commit), the entrypoint is now orphaned — removing
it cleans up the Helm values and closes the port on the LB IP.
## This change
- Deletes the `ollama-tcp` entry from the `ports` map in traefik Helm values.
- Apply: `0 added, 4 changed, 0 destroyed` — helm_release.traefik rolled out
new config, 3 auxiliary deployments picked up benign Kyverno ndots drift
(already accepted per user approval).
## Verification
- `kubectl get svc -n traefik traefik -o jsonpath='{.spec.ports[*].name}'`
output: `piper-tcp web websecure websecure-http3 whisper-tcp`
- `ollama-tcp` no longer listed.
## Test plan
### Automated
- `scripts/tg plan` showed 4 in-place updates, 0 destroy.
- `scripts/tg apply` → "Apply complete! Resources: 0 added, 4 changed, 0 destroyed."
### Manual Verification
1. `kubectl get svc -n traefik traefik -o jsonpath='{.spec.ports[*].name}'`
2. Confirm `ollama-tcp` is absent from the output.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Uptime Kuma TTFB was bimodal — fast ~150ms responses mixed with slow
~3s responses — median 1.7s, p95 3.2s across 20 samples. CPU request
was 50m (5% of one core) against a Node.js process that handles ~190
monitors plus SQLite DB maintenance. Memory request was 64Mi while
actual RSS sat around 221Mi, so the pod was also running above its
guaranteed memory floor and subject to eviction pressure when nodes got
tight.
CPU limits are intentionally absent cluster-wide (CFS throttling caused
more pain than it solved), so the only knob to give the scheduler a
higher floor is the request itself. Raising the request makes the node
reserve more CPU for the pod and lets the kernel's CFS weight it more
generously when the node is busy — should reduce the tail on the slow
path without introducing throttling.
## This change
- requests.cpu: 50m -> 100m
- requests.memory: 64Mi -> 128Mi
- limits.memory: unchanged at 512Mi
- limits.cpu: still unset (explicit — cluster-wide rule)
## What is NOT in this change
- No CPU limit added
- No readiness/liveness probe tuning
- No replica count change (still 1, Recreate strategy)
- No DB layer / SQLite tuning
## Measurements (20 curl samples of https://uptime.viktorbarzin.me/)
Before:
min 0.143s
median 1.727s
p95 3.163s
max 3.204s
mean 1.768s
After:
min 0.149s
median 1.228s
p95 3.154s
max 3.283s
mean 1.590s
Median dropped ~29% (1.73s -> 1.23s). Tail (p95/max) essentially
unchanged — the slow bucket appears driven by something other than
CPU scheduling (likely socket.io / SSR render path inside the app,
or TLS/cf-tunnel handshake — worth a separate investigation).
Closes: code-79d
## Context
Ollama is being decommissioned. The `ollama_tcp_ingressroute` manifest in
stacks/whisper routed Traefik TCP entrypoint 11434 → ollama service in the
ollama namespace. With ollama going away, this route is dead weight and
blocks the subsequent destroy of the ollama stack.
## This change
- Deletes `kubernetes_manifest.ollama_tcp_ingressroute` from stacks/whisper/main.tf
- Apply result: 0 added, 5 changed, 0 destroyed (the manifest destroy happened in a
previous partial-apply; the 5 "changed" resources are benign Kyverno ndots /
PVC ownership drift which was already accepted per the user's approval).
- Verified `kubectl get ingressroutetcp -n traefik ollama-tcp` returns NotFound.
## What is NOT in this change
- Traefik entrypoint 11434 still exists (stage 2)
- Ollama namespace, deployments, services still present (stage 8)
## Test plan
### Automated
- `scripts/tg plan` showed 1 destroy (ollama_tcp_ingressroute), 1 create (data_proxmox
PVC import), 4 benign updates.
- `scripts/tg apply -auto-approve` → "Apply complete! Resources: 0 added, 5 changed, 0 destroyed."
### Manual Verification
- kubectl get ingressroutetcp -n traefik ollama-tcp → NotFound (confirmed)
- kubectl get ingressroutetcp -n traefik whisper-tcp piper-tcp → still present
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the stale "Dev VM SSH key" secret entry with the current
`claude-agent-service` bearer token path (synced to both consumer +
caller namespaces). Adds an "n8n workflow gotchas" section documenting:
1. The workflow is DB-state, not Terraform-managed — the JSON in the
repo is a backup, not authoritative.
2. Header-expression syntax: `=Bearer {{ $env.X }}` works, JS concat
`='Bearer ' + $env.X` does NOT — costs silent 401s.
3. `N8N_BLOCK_ENV_ACCESS_IN_NODE=false` requirement.
4. 401-troubleshooting steps and the UPDATE pattern for in-place
workflow patches.
Follow-up to 99180bec which fixed the actual pipeline break.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
DIUN has been detecting image updates and firing Slack + webhook
notifications for weeks, but zero automated upgrades ran because the
handoff from n8n to claude-agent-service was silently 401-ing.
The pipeline (DIUN → n8n webhook → claude-agent-service /execute →
service-upgrade agent) was migrated from DevVM SSH to K8s HTTP in
42f1c3cf. The migration wired `claude-agent-service` (API_BEARER_TOKEN
env set), updated the n8n workflow JSON to POST with `Authorization:
Bearer $env.CLAUDE_AGENT_API_TOKEN`, but missed two things on the n8n
side:
1. The deployment didn't expose `CLAUDE_AGENT_API_TOKEN` to the n8n
container — workflow sent `Authorization: Bearer ` (empty).
2. The workflow header expression used JS concat (`='Bearer ' + $env.X`)
which n8n 1.x does NOT evaluate in HTTP Request node header params.
It needs template-literal form: `=Bearer {{ $env.X }}`.
Evidence: `claude-agent-service` logs showed only `/health` probes —
zero `/execute` calls over 12h despite DIUN firing webhooks. n8n PG
execution 2250 returned `401 Missing bearer token`.
## This change
- Adds ExternalSecret `claude-agent-token` in the `n8n` namespace that
pulls `api_bearer_token` from Vault `secret/claude-agent-service`
(same source as the receiving service's token).
- Wires the token into the n8n container as env var
`CLAUDE_AGENT_API_TOKEN` via `secret_key_ref`.
- Sets `N8N_BLOCK_ENV_ACCESS_IN_NODE=false` so expressions CAN read
`$env.*` at all (default in 1.x is false already, but setting
explicitly guards against upstream default flips).
- Fixes the workflow JSON backup (`workflows/diun-upgrade.json`) header
expression to use `{{ $env.X }}` template syntax.
The live workflow in n8n's PG DB was also patched in place (one-time
`UPDATE workflow_entity SET nodes = REPLACE(...)` — workflows are not
TF-managed; they were imported once).
## What is NOT in this change
- No retroactive re-run of skipped DIUN events. They'll be rediscovered
in future scans.
- No change to the `claude-agent-service` side — its token and endpoint
were already correct.
- No Slack alert on n8n HTTP-node failures — future work; right now a
broken workflow fails silently unless you check Execution History.
## End-to-end verification
```
$ curl -X POST n8n.viktorbarzin.me/webhook/30805ab6-... \
-d '{"diun_entry_status":"update","diun_entry_image":"docker.io/library/httpd","diun_entry_imagetag":"2.4.66",...}'
{"message":"Workflow was started"} HTTP 200
# n8n PG: execution_entity latest row → status=success
# claude-agent-service logs → "POST /execute HTTP/1.1" 202 Accepted
```
## Reproduce locally
```
1. vault login -method=oidc
2. cd stacks/n8n && ../../scripts/tg apply
3. kubectl -n n8n exec deploy/n8n -- printenv CLAUDE_AGENT_API_TOKEN
(should print 64-char hex)
4. Fire synthetic webhook with non-critical image (httpd / alpine)
5. Check n8n execution is success, claude-agent-service shows 202
```
Closes: code-ekz
Related: code-bck
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
The claude-agent-service K8s pod (deployed 2026-04-15) provides an HTTP API
for running Claude headless agents. Three workflows still SSH'd to the DevVM
(10.0.10.10) to invoke `claude -p`. This eliminates that dependency.
## This change
Pipeline migrations (SSH → HTTP POST to claude-agent-service):
- `.woodpecker/issue-automation.yml` — Vault auth fetches API token instead
of SSH key; curl POST /execute + poll /jobs/{id} replaces SSH invocation
- `scripts/postmortem-pipeline.sh` — same pattern; uses jq for safe JSON
construction of TODO payloads
- `.woodpecker/postmortem-todos.yml` — drop openssh-client from apk install
- `stacks/n8n/workflows/diun-upgrade.json` — SSH node replaced with HTTP
Request node; API token via $env.CLAUDE_AGENT_API_TOKEN (added to Vault
secret/n8n)
Documentation updates:
- `docs/architecture/incident-response.md` — Mermaid diagram: DevVM → K8s
- `docs/architecture/automated-upgrades.md` — pipeline diagram + n8n action
- `AGENTS.md` — pipeline description updated
## What is NOT in this change
- DevVM decommissioning (still hosts terminal/foolery services)
- Removal of SSH key secrets from Vault (kept for rollback)
- n8n workflow import (must be done manually in n8n UI)
[ci skip]
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
MySQL migrated from InnoDB Cluster (Bitnami chart + mysql-operator) to
a standalone StatefulSet on 2026-04-16. Two Prometheus alerts still
referenced the old topology and were firing falsely against resources
that no longer exist:
- MySQLDown: queried kube_statefulset_status_replicas_ready{statefulset="mysql-cluster"}
— that StatefulSet was deleted as part of Phase 1 of the migration.
- MySQLOperatorDown: queried kube_deployment_status_replicas_available{namespace="mysql-operator"}
— the operator Deployment was removed in Phase 1.
Replacement availability monitoring for the standalone MySQL pod will
be handled via an Uptime Kuma MySQL-connection monitor (out of scope
for this change — no Prometheus replacement alert is being added, per
the migration plan's "simpler is better" principle).
MySQLBackupStale and MySQLBackupNeverSucceeded are retained — they
query the mysql-backup CronJob which is unchanged by the migration.
Also removes MySQLDown from the two inhibition rules (NodeDown and
NFSServerUnresponsive) that previously suppressed it during cascade
outages — the alert no longer exists so the reference became dead.
Closes: code-3sa
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
After commit f6812fe6 every external-monitor-sync run updated all ~107
monitors without any real change — because the new code always appended
`/` to the host (default path), while historical monitors had been
created with bare `https://host` URLs. Sync saw `https://host` !=
`https://host/` and re-wrote every monitor on each cycle: noisy logs,
wasted Uptime Kuma writes.
## This change
When the `uptime.viktorbarzin.me/external-monitor-path` annotation is
absent, build the URL WITHOUT a trailing slash so it matches the shape
of pre-existing monitors. When the annotation is set, append it as
before (e.g. `https://forgejo.viktorbarzin.me/api/healthz`).
Also flip the lenient/strict codes branch to trigger off the same
"annotation set?" signal instead of comparing against DEFAULT_PATH.
## Verification
Verified via two consecutive manual triggers of the CronJob against the
live stack:
Pass 1 (migration): 0 created, 107 updated, 0 deleted, 1 unchanged
Pass 2 (stable): 0 created, 0 updated, 0 deleted, 108 unchanged
`[External] forgejo` still probes `https://forgejo.viktorbarzin.me/api/healthz`
with strict `200-299`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
The null_resource.mysql_static_user provisioner in commit 2033e767 used
a bash -c wrapper with nested single quotes (`'"$DB"'`-style injection)
to interpolate the app-specific database name and credentials. The outer
bash -c '...' single-quoted string was broken by the inner ' characters
long before reaching the container, so the local (tg) shell saw `$DB`
and `$USER` unset and produced an empty database name:
ERROR 1102 (42000) at line 1: Incorrect database name ''
Apply failed for both forgejo and roundcubemail.
## This change
Feed the SQL to mysql on the pod via stdin through `kubectl exec -i`:
- Outer command: `kubectl exec -i ... -- sh -c 'exec mysql -uroot -p"$MYSQL_ROOT_PASSWORD"'`
- Single-quoted shell heredoc (`<<'SQL'`) carries the SQL statements
- HCL interpolates `${each.key}`, `${each.value.database}`,
`${each.value.password}` into the heredoc body before the shell runs
- No nested quoting — one single-quote layer, one double-quote layer,
one heredoc layer
Plan/apply verified on the live stack: 2 added (forgejo + roundcubemail),
7 pre-existing drift items changed, 0 destroyed. Both users now log in
with their app-cached passwords.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
The 2026-04-16 MySQL InnoDB Cluster → standalone migration recreated the
MySQL user table but scripted fresh passwords for every app user. Two apps
(forgejo, roundcubemail) store their DB password inside their own
application config — forgejo in `/data/gitea/conf/app.ini` (baked into the
PVC), roundcubemail in the ROUNDCUBEMAIL_DB_PASSWORD env from the
mailserver stack (sourced from Vault `secret/platform`). Neither app
could be restarted with a new password without rewriting its own config.
Both apps silently broke with `Access denied for user 'X'@'%'` after the
migration. Remediation on 2026-04-17 was a manual `ALTER USER ... IDENTIFIED
BY '<app_password>'` to re-sync MySQL with what each app already has. With
nothing in Terraform managing those users, the next migration would break
them again — that's the gap this change closes.
## What this change does
Codifies both MySQL users in `stacks/dbaas/modules/dbaas/` using the same
`null_resource` + `local-exec` + `kubectl exec` pattern already used for
`pg_terraform_state_db` (line 1373 of the same file). Rejected alternatives:
- `petoju/mysql` Terraform provider — no existing usage in the repo; would
be a net-new dependency. Module-level `for_each` over `mysql_user` +
`mysql_grant` is cleaner, but the added machinery (new provider block,
extra auth path via `MYSQL_HOST`/`MYSQL_USERNAME`/`MYSQL_PASSWORD` TF
env vars, state-dependent password reads) outweighs the benefit for two
static users.
- K8s Job — adds lifecycle management for a one-shot resource; needs
secret mounts and is harder to retry. `local-exec` is exactly what the
existing PG bootstrap uses.
Idempotency contract:
CREATE DATABASE IF NOT EXISTS <db>;
CREATE USER IF NOT EXISTS '<user>'@'%' IDENTIFIED WITH caching_sha2_password BY '<pw>';
ALTER USER '<user>'@'%' IDENTIFIED WITH caching_sha2_password BY '<pw>';
GRANT ALL PRIVILEGES ON <db>.* TO '<user>'@'%';
FLUSH PRIVILEGES;
The `ALTER USER` on every re-run re-syncs the password if Vault was rotated
out-of-band (healing drift). The `sha256(password)` trigger also re-runs
the provisioner when the Vault password legitimately changes, so the
resource is responsive to both new and rotated passwords. `caching_sha2_password`
matches the live plugin returned by `SHOW CREATE USER`; forcing it prevents
silent drift to `mysql_native_password`.
Flow (apply-time):
scripts/tg apply
│
├── data.vault_kv_secret_v2.viktor ── reads mysql_{forgejo,roundcubemail}_password
│
▼
module.dbaas
│
├── mysql-standalone-0 (StatefulSet, already running)
│
├── null_resource.mysql_static_user["forgejo"]
│ └── kubectl exec ... mysql -uroot -p$ROOT_PASSWORD ... CREATE/ALTER/GRANT
│
└── null_resource.mysql_static_user["roundcubemail"]
└── (same, for roundcubemail)
## Secrets
Two new keys added to Vault `secret/viktor`:
mysql_forgejo_password # bound to forgejo `[database]` in app.ini
mysql_roundcubemail_password # duplicates secret/platform
# mailserver_roundcubemail_db_password;
# secret/viktor is the personal vault of
# record per .claude/CLAUDE.md
Passwords are never written to the repo — both come from Vault via
`data "vault_kv_secret_v2" "viktor"` in the dbaas root module.
## What is NOT in this change
- PG-side users (managed by Vault DB engine static-roles already — see
MEMORY.md "Database rotation")
- Other MySQL users (speedtest, wrongmove, codimd, nextcloud, shlink,
grafana, phpipam are all rotated by Vault DB engine; root users
excluded by design)
- Removing the old mysql-operator / InnoDB Cluster helm releases (Phase 4
cleanup tracked under the MySQL standalone migration work — still
pending)
## Test plan
### Automated
`terraform fmt -check -recursive stacks/dbaas` → exit 0
`scripts/tg plan` in stacks/dbaas →
Plan: 2 to add, 7 to change, 0 to destroy.
# module.dbaas.null_resource.mysql_static_user["forgejo"] will be created
# module.dbaas.null_resource.mysql_static_user["roundcubemail"] will be created
The 7 "update in-place" entries are pre-existing drift (Kyverno labels on
LimitRange, MetalLB ip-allocated-from-pool annotation on postgresql_lb,
Kyverno-injected `dns_config` on 4 CronJobs lacking the
`ignore_changes` workaround, `resize.topolvm.io/storage_limit` bump
30Gi→50Gi on mysql-standalone PVC). None of those are introduced by this
commit and all are benign (no data loss, no pod restart).
### Manual Verification
# 1. Sanity check pre-apply — users are in their current (manually-fixed) state.
kubectl exec -n dbaas mysql-standalone-0 -c mysql -- bash -c \
'mysql -uroot -p"$MYSQL_ROOT_PASSWORD" -N -e \
"SELECT user,host,plugin FROM mysql.user WHERE user IN (\"forgejo\",\"roundcubemail\");"'
# Expected:
# forgejo % caching_sha2_password
# roundcubemail % caching_sha2_password
# 2. Apply and confirm the provisioner exits 0.
cd stacks/dbaas && ../../scripts/tg apply
# Expect: null_resource.mysql_static_user["forgejo"]: Creation complete
# null_resource.mysql_static_user["roundcubemail"]: Creation complete
# 3. App-level smoke: log in to forgejo.viktorbarzin.me (any git push)
# and load https://mail.viktorbarzin.me/roundcube (IMAP login). Both
# must succeed.
# 4. Destructive test (run ONCE, off-hours):
kubectl exec -n dbaas mysql-standalone-0 -c mysql -- bash -c \
'mysql -uroot -p"$MYSQL_ROOT_PASSWORD" -e "DROP USER '\''forgejo'\''@'\''%'\''"'
cd stacks/dbaas && ../../scripts/tg apply
# Expected: apply recreates the user with the Vault password, forgejo UI
# recovers without touching /data/gitea/conf/app.ini.
### Reproduce locally
1. vault login -method=oidc
2. cd infra/stacks/dbaas
3. ../../scripts/tg plan
4. Expected: "Plan: 2 to add, 7 to change, 0 to destroy." with the two
null_resource.mysql_static_user additions. 7 changes are pre-existing
drift unrelated to this commit.
Closes: code-6th
Closes: code-96w
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Forgejo's /api/healthz verifies cache + DB and returns 503 when
degraded, where / returns 200 even with a broken backend. Prevents
recurrence of the false-negative from the 2026-04-17 outage.
Closes: code-ut0
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
The `external-monitor-sync` CronJob probed `https://<host>/` for every
`*.viktorbarzin.me` ingress. Homepages frequently return 200 (or
allow-listed 30x/40x) even when the backend or DB is broken, producing
false-negatives — the forgejo outage on 2026-04-17 was not caught for
this reason: `/` returned a login page while `/api/healthz` returned
503 from the DB probe.
Manual monitor edits don't stick: the next sync is create-if-missing
only, so a deleted monitor gets recreated pointing at `/` again.
## This change
Teaches the sync three things:
1. **Reads a new annotation** `uptime.viktorbarzin.me/external-monitor-path`.
The annotation value is appended as the probe path; default `/`
preserves today's behaviour for every ingress that hasn't opted in.
2. **Tightens accepted status codes** when an explicit path is set:
`['200-299']` (strict — we expect a real healthz). The default `/`
path keeps the existing lenient set `['200-299','300-399','400-499']`
because homepages routinely 30x redirect or 40x on missing auth.
3. **Updates existing monitors** when the target URL or accepted
status codes drift. Previously the loop was create-if-missing only,
so annotating an already-monitored ingress had no effect until the
monitor was deleted. Now re-running the sync after changing the
annotation converges the live monitor.
## What is NOT in this change
- No change to the Ingress annotations on any individual stack. Each
service that wants a non-`/` probe path opts in separately.
- No change to the ConfigMap fallback payload shape — legacy entries
still get the lenient status codes.
- Monitor DB state in Uptime Kuma's SQLite is untouched at plan time;
the sync CronJob is what reconciles state on each run.
## Flow
```
ingress annotation CronJob Python
------------------ --------------
(none) --> url = https://host/ codes = lenient
external-monitor-path --> url = https://host<path> codes = strict ['200-299']
^^ "/api/healthz" https://host/api/healthz codes = ['200-299']
existing monitor + drifted target url --> api.edit_monitor(id, url=..., accepted_statuscodes=...)
```
## Test Plan
### Automated
- `terraform fmt -check -recursive stacks/uptime-kuma` — exit 0.
- `scripts/tg plan` on `stacks/uptime-kuma` — `Plan: 0 to add, 1 to
change, 0 to destroy`. The single in-place change is the CronJob
command (Python heredoc re-rendered). No other resources drift.
- Embedded Python compiles: extracted the `PYEOF` block and ran
`python3 -m py_compile` — OK.
### Manual Verification
1. Annotate an ingress: `kubectl annotate ingress/<name> -n <ns> uptime.viktorbarzin.me/external-monitor-path=/api/healthz`
2. Trigger sync early: `kubectl -n uptime-kuma create job --from=cronjob/external-monitor-sync external-monitor-sync-manual`
3. Expected log line:
`Updating monitor [External] <name>: https://host/ -> https://host/api/healthz (codes ['200-299','300-399','400-499'] -> ['200-299'])`
4. Inspect monitor in Uptime Kuma UI: URL and accepted status codes
reflect the annotation.
5. Final summary line includes updated count:
`Sync complete: 0 created, 1 updated, 0 deleted, N unchanged`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the broken Traefik rewrite-body plugin with a Cloudflare Worker
using HTMLRewriter to inject the rybbit tracking script into HTML responses
at the CDN edge.
- Wildcard route: *.viktorbarzin.me/* covers all proxied services
- 28 services have explicit site ID mappings
- Unmapped hosts pass through without injection
- Zero Traefik dependency, zero performance impact
Closes: code-sed
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Duplicate bug fix
The external-monitor-sync deduped targets by hostname (`host in seen`) but
multiple ingresses can share the same hostname. Changed to dedupe by final
monitor name (`f"{PREFIX}{label}" in seen`) — prevents creating duplicate
[External] monitors on every sync run. This caused 90 duplicates.
## Monitor cleanup
Deleted 118 monitors total:
- 90 duplicate [External] monitors (kept lower ID of each pair)
- 14 paused internal monitors for decommissioned services
- 14 external monitors for non-existent, scaled-down, or non-HTTP services
(xray-vless, complaints, hermes-agent, etc.)
## Opt-outs
Added `uptime.viktorbarzin.me/external-monitor=false` annotation to ingresses
that shouldn't have external HTTP monitors: xray (non-HTTP protocol),
council-complaints, hermes-agent, task-webhook, torrserver, www (no CF DNS).
329 monitors → ~210 monitors. Zero down monitors expected.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Both services were running against empty unencrypted PVCs after the
proxmox-lvm-encrypted migration. Data copied from old Released PVs
via LUKS-unlock on PVE host, deployments switched to encrypted PVCs.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* [f1-stream] Remove committed cluster-admin kubeconfig
## Context
A kubeconfig granting cluster-admin access was accidentally committed into
the f1-stream stack's application bundle in c7c7047f (2026-02-22). It
contained the cluster CA certificate plus the kubernetes-admin client
certificate and its RSA private key. Both remotes (github.com, forgejo)
are public, so the credential has been reachable for ~2 months.
Grep across the repo confirms no .tf / .hcl / .sh / .yaml file references
this path; the file is a stray local artifact, likely swept in during a
bulk `git add`.
## This change
- git rm stacks/f1-stream/files/.config
## What is NOT in this change
- Cluster-admin cert rotation on the control plane. The leaked client cert
must be invalidated separately via `kubeadm certs renew admin.conf` or
CA regeneration. Tracked in the broader secrets-remediation plan.
- Git-history rewrite. The file is still reachable in every commit since
c7c7047f. A `git filter-repo --path ... --invert-paths` pass against a
fresh mirror is planned and will be force-pushed to both remotes.
## Test plan
### Automated
No tests needed for a file removal. Sanity:
$ grep -rn 'f1-stream/files/\.config' --include='*.tf' --include='*.hcl' \
--include='*.yaml' --include='*.yml' --include='*.sh'
(no output)
### Manual Verification
1. `git show HEAD --stat` shows exactly one path deleted:
stacks/f1-stream/files/.config | 19 -------------------
2. `test ! -e stacks/f1-stream/files/.config` returns true.
3. A copy of the leaked file is at /tmp/leaked.conf for post-rotation
verification (confirming `kubectl --kubeconfig /tmp/leaked.conf get ns`
fails with 401/403 once the admin cert is renewed).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* [frigate] Remove orphan config.yaml with leaked RTSP passwords
## Context
A Frigate configuration file was added to modules/kubernetes/frigate/ in
bcad200a (2026-04-15, ~2 days ago) as part of a bulk `chore: add untracked
stacks, scripts, and agent configs` commit. The file contains 14 inline
rtsp://admin:<password>@<host>:554/... URLs, leaking two distinct RTSP
passwords for the cameras at 192.168.1.10 (LAN-only) and
valchedrym.ddns.net (confirmed reachable from public internet on port
554). Both remotes are public, so the creds have been exposed for ~2 days.
Grep across the repo confirms nothing references this config.yaml — the
active stacks/frigate/main.tf stack reads its configuration from a
persistent volume claim named `frigate-config-encrypted`, not from this
file. The file is therefore an orphan from the bulk add, with no
production function.
## This change
- git rm modules/kubernetes/frigate/config.yaml
## What is NOT in this change
- Camera password rotation. The user does not own the cameras; rotation
must be coordinated out-of-band with the camera operators. The DDNS
camera (valchedrym.ddns.net:554) is internet-reachable, so the leaked
password is high-priority to rotate from the device side.
- Git-history rewrite. The file plus its leaked strings remain in all
commits from bcad200a forward. Scheduled to be purged via
`git filter-repo --path modules/kubernetes/frigate/config.yaml
--invert-paths --replace-text <list>` in the broader remediation pass.
- Future Frigate config provisioning. If the stack is re-platformed to
source config from Git rather than the PVC, the replacement should go
through ExternalSecret + env-var interpolation, not an inline YAML.
## Test plan
### Automated
$ grep -rn 'frigate/config\.yaml' --include='*.tf' --include='*.hcl' \
--include='*.yaml' --include='*.yml' --include='*.sh'
(no output — confirms orphan status)
### Manual Verification
1. `git show HEAD --stat` shows exactly one deletion:
modules/kubernetes/frigate/config.yaml | 229 ---------------------------------
2. `test ! -e modules/kubernetes/frigate/config.yaml` returns true.
3. `kubectl -n frigate get pvc frigate-config-encrypted` still shows the
PVC bound (unaffected by this change).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* [setup-tls-secret] Delete deprecated renew.sh with hardcoded Technitium token
## Context
modules/kubernetes/setup_tls_secret/renew.sh is a 2.5-year-old
expect(1) script for manual Let's Encrypt wildcard-cert renewal via
Technitium DNS TXT-record challenges. It hardcodes a 64-char Technitium
API token on line 7 (as an expect variable) and line 27 (inside a
certbot-cleanup heredoc). Both remotes are public, so the token has been
exposed for ~2.5 years.
The script is not invoked by the module's Terraform (main.tf only creates
a kubernetes.io/tls Secret from PEM files); it is a standalone
run-it-yourself tool. grep across the repo confirms nothing references
`renew.sh` — neither the 20+ stacks that consume the `setup_tls_secret`
module, nor any CI pipeline, nor any shell wrapper.
A replacement script `renew2.sh` (4 weeks old) lives alongside it. It
sources the Technitium token from `$TECHNITIUM_API_KEY` env var and also
supports Cloudflare DNS-01 challenges via `$CLOUDFLARE_TOKEN`. It is the
current renewal path.
## This change
- git rm modules/kubernetes/setup_tls_secret/renew.sh
## What is NOT in this change
- Technitium token rotation. The leaked token still works against
`technitium-web.technitium.svc.cluster.local:5380` until revoked in the
Technitium admin UI. Rotation is a prerequisite for the upcoming
git-history scrub, which will remove the token from every commit via
`git filter-repo --replace-text`.
- renew2.sh is retained as-is (already env-var-sourced; clean).
- The setup_tls_secret module's main.tf is not touched; 20+ consuming
stacks keep working.
## Test plan
### Automated
$ grep -rn 'renew\.sh' --include='*.tf' --include='*.hcl' \
--include='*.yaml' --include='*.yml' --include='*.sh'
(no output — confirms no consumer)
$ git grep -n 'e28818f309a9ce7f72f0fcc867a365cf5d57b214751b75e2ef3ea74943ef23be'
(no output in HEAD after this commit)
### Manual Verification
1. `git show HEAD --stat` shows exactly one deletion:
modules/kubernetes/setup_tls_secret/renew.sh | 136 ---------
2. `test ! -e modules/kubernetes/setup_tls_secret/renew.sh` returns true.
3. `renew2.sh` still exists and is executable:
ls -la modules/kubernetes/setup_tls_secret/renew2.sh
4. Next cert-renewal run uses renew2.sh with env-var-sourced token; no
behavioral regression because renew.sh was never part of the automated
flow.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* [monitoring] Delete orphan server-power-cycle/main.sh with iDRAC default creds
## Context
stacks/monitoring/modules/monitoring/server-power-cycle/main.sh is an old
shell implementation of a power-cycle watchdog that polled the Dell iDRAC
on 192.168.1.4 for PSU voltage. It hardcoded the Dell iDRAC default
credentials (root:calvin) in 5 `curl -u root:calvin` calls. Both remotes
are public, so those credentials — and the implicit statement that 'this
host has not rotated the default BMC password' — have been exposed.
The current implementation is main.py in the same directory. It reads
iDRAC credentials from the environment variables `idrac_user` and
`idrac_password` (see module's iDRAC_USER_ENV_VAR / iDRAC_PASSWORD_ENV_VAR
constants), which are populated from Vault via ExternalSecret at runtime.
main.sh is not referenced by any Terraform, ConfigMap, or deploy script —
grep confirms no `file()` / `templatefile()` / `filebase64()` call loads
it, and no hand-rolled shell wrapper invokes it.
## This change
- git rm stacks/monitoring/modules/monitoring/server-power-cycle/main.sh
main.py is retained unchanged.
## What is NOT in this change
- iDRAC password rotation on 192.168.1.4. The BMC should be moved off the
vendor default `calvin` regardless; rotation is tracked in the broader
remediation plan and in the iDRAC web UI.
- A separate finding in stacks/monitoring/modules/monitoring/idrac.tf
(the redfish-exporter ConfigMap has `default: username: root, password:
calvin` as a fallback for iDRAC hosts not explicitly listed) is NOT
addressed here — filed as its own task so the fix (drop the default
block vs. source from env) can be considered in isolation.
- Git-history scrub of main.sh is pending the broader filter-repo pass.
## Test plan
### Automated
$ grep -rn 'server-power-cycle/main\.sh\|main\.sh' \
--include='*.tf' --include='*.hcl' --include='*.yaml' \
--include='*.yml' --include='*.sh'
(no consumer references)
### Manual Verification
1. `git show HEAD --stat` shows only the one deletion.
2. `test ! -e stacks/monitoring/modules/monitoring/server-power-cycle/main.sh`
3. `kubectl -n monitoring get deploy idrac-redfish-exporter` still shows
the exporter running — unrelated to this file.
4. main.py continues to run its watchdog loop without regression, because
it was never coupled to main.sh.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* [tls] Move 3 outlier stacks from per-stack PEMs to root-wildcard symlink
## Context
foolery, terminal, and claude-memory each had their own
`stacks/<x>/secrets/` directory with a plaintext EC-256 private key
(privkey.pem, 241 B) and matching TLS certificate (fullchain.pem, 2868 B)
for *.viktorbarzin.me. The 92 other stacks under stacks/ symlink
`secrets/` → `../../secrets`, which resolves to the repo-root
/secrets/ directory covered by the `secrets/** filter=git-crypt`
.gitattributes rule — i.e., every other stack consumes the same
git-crypt-encrypted root wildcard cert.
The 3 outliers shipped their keys in plaintext because `.gitattributes`
secrets/** rule matches only repo-root /secrets/, not
stacks/*/secrets/. Both remotes are public, so the 6 plaintext PEM files
have been exposed for 1–6 weeks (commits 5a988133 2026-03-11,
a6f71fc6 2026-03-18, 9820f2ce 2026-04-10).
Verified:
- Root wildcard cert subject = CN viktorbarzin.me,
SAN *.viktorbarzin.me + viktorbarzin.me — covers the 3 subdomains.
- Root privkey + fullchain are a valid key pair (pubkey SHA256 match).
- All 3 outlier certs have the same subject/SAN as root; different
distinct cert material but equivalent coverage.
## This change
- Delete plaintext PEMs in all 3 outlier stacks (6 files total).
- Replace each stacks/<x>/secrets directory with a symlink to
../../secrets, matching the fleet pattern.
- Add `stacks/**/secrets/** filter=git-crypt diff=git-crypt` to
.gitattributes as a regression guard — any future real file placed
under stacks/<x>/secrets/ gets git-crypt-encrypted automatically.
setup_tls_secret module (modules/kubernetes/setup_tls_secret/main.tf) is
unchanged. It still reads `file("${path.root}/secrets/fullchain.pem")`,
which via the symlink resolves to the root wildcard.
## What is NOT in this change
- Revocation of the 3 leaked per-stack certs. Backed up the leaked PEMs
to /tmp/leaked-certs/ for `certbot revoke --reason keycompromise`
once the user's LE account is authenticated. Revocation must happen
before or alongside the history-rewrite force-push to both remotes.
- Git-history scrub. The leaked PEM blobs are still reachable in every
commit from 2026-03-11 forward. Scheduled for removal via
`git filter-repo --path stacks/<x>/secrets/privkey.pem --invert-paths`
(and fullchain.pem for each stack) in the broader remediation pass.
- cert-manager introduction. The fleet does not use cert-manager today;
this commit matches the existing symlink-to-wildcard pattern rather
than introducing a new component.
## Test plan
### Automated
$ readlink stacks/foolery/secrets
../../secrets
(likewise for terminal, claude-memory)
$ for s in foolery terminal claude-memory; do
openssl x509 -in stacks/$s/secrets/fullchain.pem -noout -subject
done
subject=CN = viktorbarzin.me (x3 — all resolve via symlink to root wildcard)
$ git check-attr filter -- stacks/foolery/secrets/fullchain.pem
stacks/foolery/secrets/fullchain.pem: filter: git-crypt
(now matched by the new rule, though for the symlink target the
repo-root rule already applied)
### Manual Verification
1. `terragrunt plan` in stacks/foolery, stacks/terminal, stacks/claude-memory
shows only the K8s TLS secret being re-created with the root-wildcard
material. No ingress changes.
2. `terragrunt apply` for each stack → `kubectl -n <ns> get secret
<name>-tls -o yaml` → tls.crt decodes to CN viktorbarzin.me with
the root serial (different from the pre-change per-stack serials).
3. `curl -v https://foolery.viktorbarzin.me/` (and likewise terminal,
claude-memory) → cert chain presents the new serial, handshake OK.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Add broker-sync Terraform stack (pending apply)
Context
-------
Part of the broker-sync rollout — see the plan at
~/.claude/plans/let-s-work-on-linking-temporal-valiant.md and the
companion repo at ViktorBarzin/broker-sync.
This change
-----------
New stack `stacks/broker-sync/`:
- `broker-sync` namespace, aux tier.
- ExternalSecret pulling `secret/broker-sync` via vault-kv
ClusterSecretStore.
- `broker-sync-data-encrypted` PVC (1Gi, proxmox-lvm-encrypted,
auto-resizer) — holds the sync SQLite db, FX cache, Wealthfolio
cookie, CSV archive, watermarks.
- Five CronJobs (all under `viktorbarzin/broker-sync:<tag>`, public
DockerHub image; no pull secret):
* `broker-sync-version` — daily 01:00 liveness probe (`broker-sync
version`), used to smoke-test each new image.
* `broker-sync-trading212` — daily 02:00 `broker-sync trading212
--mode steady`.
* `broker-sync-imap` — daily 02:30, SUSPENDED (Phase 2).
* `broker-sync-csv` — daily 03:00, SUSPENDED (Phase 3).
* `broker-sync-fx-reconcile` — 7th of month 05:05, SUSPENDED
(Phase 1 tail).
- `broker-sync-backup` — daily 04:15, snapshots /data into
NFS `/srv/nfs/broker-sync-backup/` with 30-day retention, matches
the convention in infra/.claude/CLAUDE.md §3-2-1.
NOT in this commit:
- Old `wealthfolio-sync` CronJob retirement in
stacks/wealthfolio/main.tf — happens in the same commit that first
applies this stack, per the plan's "clean cutover" decision.
- Vault seed. `secret/broker-sync` must be populated before apply;
required keys documented in the ExternalSecret comment block.
Test plan
---------
## Automated
- `terraform fmt` — clean (ran before commit).
- `terraform validate` needs `terragrunt init` first; deferred to
apply time.
## Manual Verification
1. Seed Vault `secret/broker-sync/*` (see comment block on the
ExternalSecret in main.tf).
2. `cd stacks/broker-sync && scripts/tg apply`.
3. `kubectl -n broker-sync get cronjob` — expect 6 CJs, 3 suspended.
4. `kubectl -n broker-sync create job smoke --from=cronjob/broker-sync-version`.
5. `kubectl -n broker-sync logs -l job-name=smoke` — expect
`broker-sync 0.1.0`.
* fix(beads-server): disable Authentik + CrowdSec on Workbench
Authentik forward-auth returns 400 for dolt-workbench (no Authentik
application configured for this domain). CrowdSec bouncer also
intermittently returns 400. Both disabled — Workbench is accessible
via Cloudflare tunnel only.
TODO: Create Authentik application for dolt-workbench.viktorbarzin.me
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PoisonFountainDown and ForwardAuthFallbackActive both fired because
poison-fountain was scaled to 0 replicas (intentional). Updated both
alert expressions to check kube_deployment_spec_replicas > 0 before
alerting on missing available replicas — if desired replicas is 0,
the service is intentionally down and should not alert.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
GRAPHQLAPI_URL must point to localhost:9002 (internal), not the external
URL which goes through Authentik. SSR can't authenticate to Authentik.
Also removed Authentik from /graphql ingress — browser fetch() can't
follow 302 redirects on POST requests.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The env var was only set via kubectl and got overwritten on next apply.
Now permanently in the deployment spec.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Scale to 0 replicas:
- ollama: low usage, saves ~2Gi memory + 59GB NFS-SSD model data idle
- poison-fountain: RSS link archiver, not actively used
- travel-blog: Hugo blog, not actively used
Remove technitium DoH ingress (dns.viktorbarzin.me): externally unreachable
and unused. DNS is served on UDP/TCP port 53 via LoadBalancer (10.0.20.201).
Clears 3 of 5 ExternalAccessDivergence services. Remaining 2 (pdf, travel)
should clear now that the Uptime Kuma monitors will report both down.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## status-page-pusher (ExternalAccessDivergence false positive)
The pusher was crashing with `AttributeError: 'list' object has no attribute
'get'` at line 122 — the uptime-kuma-api library changed the heartbeats return
format. Fixed by making beat flattening more robust: handle any nesting of
lists/dicts in the heartbeat data, and add isinstance check before calling
`.get()` on the latest beat.
## Prometheus backup (PrometheusBackupNeverRun)
The backup sidecar's Pushgateway push was silently failing because `wget
--post-file=-` needs `--header="Content-Type: text/plain"` for Pushgateway
to accept the Prometheus exposition format. Added the header. Also manually
pushed the metric to clear the `absent()` alert immediately.
Note: ExternalAccessDivergence still fires because 5 services (ollama, pdf,
poison, dns, travel) ARE genuinely externally unreachable but internally up.
This is a real issue (likely Cloudflare tunnel routing) not a false positive.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Workbench's database connection is in-memory and lost on pod restart.
Added startup script that waits for GraphQL server readiness, then calls
addDatabaseConnection mutation automatically. No more manual reconnection.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Context
The setup-project skill treats "build from a Dockerfile" as priority 6 — "last
resort, avoid if possible" — with no formalized path for apps whose upstream
lacks a working Dockerfile. When we end up writing one to get the deploy green,
that Dockerfile stays private in the infra repo and upstream never benefits.
## This change
Adds a closed-loop flow: when we author a new Dockerfile (or fix a broken
upstream one) and the deploy is healthy for 10 minutes, auto-open a PR against
the upstream repo so the self-hosting community gets the working recipe.
Flow:
1. Classify dockerfile_state during research phase (image-used / used-as-is /
fixed-broken-upstream / written-from-scratch). Persist to
modules/kubernetes/<service>/.contribution-state.json.
2. After Terraform apply, run scripts/stability-gate.sh — polls pod Ready +
HTTP 200 every 30s x 20 iterations, requires 18/20 successes.
3. On pass with a trigger state, scripts/contribute-dockerfile.sh does the
GitHub API dance: fork → merge-upstream → branch → commit Dockerfile /
.dockerignore / BUILD.md via Contents API → open PR with body rendered from
templates/PR_BODY.md. Idempotent (skips on recorded PR URL, existing fork,
existing branch, open PR, upstream landed a Dockerfile mid-deploy).
GitHub API via curl (gh CLI is sandbox-blocked per .claude/CLAUDE.md); token
pulled from Vault (`secret/viktor` → `github_pat`). Commits include
Signed-off-by for DCO-enforcing repos. Fork branch name is `add-dockerfile`
for written-from-scratch or `fix-dockerfile` for fixed-broken-upstream, with
timestamp suffix on collision.
## Files
- SKILL.md — state classification table, quality bar checklist, §8b stability
gate, §10 contribute-upstream step, checklist updates
- scripts/stability-gate.sh — 10-minute health probe
- scripts/contribute-dockerfile.sh — GitHub API orchestrator
- templates/PR_BODY.md — `{{VAR}}` placeholder template for PR description
- templates/Dockerfile.README.md — BUILD.md template shipped with the PR
## What is NOT in this change
- No Woodpecker / GHA changes (skill-local flow).
- No auto-tracking of merge/reject outcomes upstream (manual follow-up).
- Not yet exercised end-to-end; first real-world run will validate the API
dance. Plan to dry-run against a throwaway sink repo before pointing at a
real upstream.
## Test Plan
### Automated
- bash -n on both scripts → pass
- Manual read-through of SKILL.md — step numbering coherent, existing
§1-9 untouched semantics, new §8b/§10 reference real files
### Manual Verification
1. Next time setup-project onboards a Dockerfile-less app:
- Confirm .contribution-state.json is written with `written-from-scratch`
- Run stability-gate.sh — expect 18/20 passes on a healthy deploy
- Run contribute-dockerfile.sh — expect a fork + branch + PR on ViktorBarzin
- Verify contribution_pr_url is back-written to the state file
2. Re-run contribute-dockerfile.sh → must be a no-op (idempotent)
3. Upstream-archived case: manually archive a test upstream → re-run →
expect SKIP, no PR created
[ci skip]
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>