2026-03-17 21:34:11 +00:00
|
|
|
# =============================================================================
|
|
|
|
|
# Monitoring Stack — Prometheus / Grafana / Loki
|
|
|
|
|
# =============================================================================
|
|
|
|
|
|
|
|
|
|
variable "tls_secret_name" { type = string }
|
|
|
|
|
variable "nfs_server" { type = string }
|
|
|
|
|
variable "mysql_host" { type = string }
|
|
|
|
|
variable "monitoring_idrac_username" { type = string }
|
|
|
|
|
|
|
|
|
|
data "vault_kv_secret_v2" "secrets" {
|
|
|
|
|
mount = "secret"
|
|
|
|
|
name = "platform"
|
|
|
|
|
}
|
|
|
|
|
|
2026-03-23 02:24:39 +02:00
|
|
|
data "vault_kv_secret_v2" "viktor" {
|
|
|
|
|
mount = "secret"
|
|
|
|
|
name = "viktor"
|
|
|
|
|
}
|
|
|
|
|
|
2026-03-17 21:34:11 +00:00
|
|
|
module "monitoring" {
|
|
|
|
|
source = "./modules/monitoring"
|
|
|
|
|
tls_secret_name = var.tls_secret_name
|
|
|
|
|
nfs_server = var.nfs_server
|
|
|
|
|
mysql_host = var.mysql_host
|
|
|
|
|
alertmanager_account_password = data.vault_kv_secret_v2.secrets.data["alertmanager_account_password"]
|
|
|
|
|
idrac_username = var.monitoring_idrac_username
|
|
|
|
|
idrac_password = data.vault_kv_secret_v2.secrets.data["monitoring_idrac_password"]
|
|
|
|
|
alertmanager_slack_api_url = data.vault_kv_secret_v2.secrets.data["alertmanager_slack_api_url"]
|
|
|
|
|
tiny_tuya_service_secret = data.vault_kv_secret_v2.secrets.data["tiny_tuya_service_secret"]
|
|
|
|
|
haos_api_token = data.vault_kv_secret_v2.secrets.data["haos_api_token"]
|
|
|
|
|
pve_password = data.vault_kv_secret_v2.secrets.data["pve_password"]
|
|
|
|
|
grafana_admin_password = data.vault_kv_secret_v2.secrets.data["grafana_admin_password"]
|
monitoring: lock Finance (Personal) folder to admin + fix cash classification
Folder ACL:
- Move uk-payslip + wealth dashboards to a new "Finance (Personal)"
folder; job-hunter + fire-planner stay in "Finance" (open).
- New null_resource calls Grafana's folder permissions API after the
dashboard sidecar materialises the folder, setting an admin-only
ACL ({Admin: 4}). Default Viewer/Editor inheritance is overridden,
so anonymous-Viewer (auth.anonymous=true) is denied. Server-admin
always retains access.
- Verified: anonymous → 403 on uk-payslip + wealth, 200 on
control dashboards (node-exporter); admin → 200 on all.
Wealth cash fix:
- Wealthfolio dumps WORKPLACE_PENSION wrappers entirely into
cash_balance because it doesn't track underlying fund holdings.
Reclassify pension cash as invested in the "Cash vs invested"
panel so the cash series reflects actual uninvested broker cash
(~£16k T212 ISA + Schwab) instead of phantom £154k.
Pre-fix: cash=£153,789 / invested=£870,282 / total=£1,024,071
Post-fix: cash=£16,064 / invested=£1,008,008 / total=£1,024,071
2026-04-25 23:11:26 +00:00
|
|
|
kube_config_path = var.kube_config_path
|
[registry] Stop recurring orphan OCI-index incidents — detection + prevention + recovery
Second identical registry incident on 2026-04-19 (first 2026-04-13): the
infra-ci:latest image index resolved to child manifests whose blobs had been
garbage-collected out from under the index. Pipelines P366→P376 all exited
126 "image can't be pulled". Hot fix (a05d63e / 6371e75 / c113be4) restored
green CI but left the underlying bug unaddressed.
Root cause: cleanup-tags.sh rmtrees tag dirs on the registry VM daily at
02:00, registry:2's GC (Sunday 03:25) walks OCI index children imperfectly
(distribution/distribution#3324 class). Nothing verified pushes end-to-end;
nothing probed the registry for fetchability; nothing caught orphan indexes.
Phase 1 — Detection:
- .woodpecker/build-ci-image.yml: after build-and-push, a verify-integrity
step walks the just-pushed manifest (index + children + config + every
layer blob) via HEAD and fails the pipeline on any non-200. Catches
broken pushes at the source.
- stacks/monitoring: new registry-integrity-probe CronJob (every 15m) and
three alerts — RegistryManifestIntegrityFailure,
RegistryIntegrityProbeStale, RegistryCatalogInaccessible — closing the
"registry serves 404 for a tag that exists" gap that masked the incident
for 2+ hours.
- docs/post-mortems/2026-04-19-registry-orphan-index.md: root cause,
timeline, monitoring gaps, permanent fix.
Phase 2 — Prevention:
- modules/docker-registry/docker-compose.yml: pin registry:2 → registry:2.8.3
across all six registry services. Removes the floating-tag footgun.
- modules/docker-registry/fix-broken-blobs.sh: new scan walks every
_manifests/revisions/sha256/<digest> that is an image index and logs a
loud WARNING when a referenced child blob is missing. Does NOT auto-
delete — deleting a published image is a conscious decision. Layer-link
scan preserved.
Phase 3 — Recovery:
- build-ci-image.yml: accept `manual` event so Woodpecker API/UI rebuilds
don't need a cosmetic Dockerfile edit (matches convention from
pve-nfs-exports-sync.yml).
- docs/runbooks/registry-rebuild-image.md: exact command sequence for
diagnosing + rebuilding after an orphan-index incident, plus a fallback
for building directly on the registry VM if Woodpecker itself is down.
- docs/runbooks/registry-vm.md + .claude/reference/service-catalog.md:
cross-references to the new runbook.
Out of scope (verified healthy or intentionally deferred):
- Pull-through DockerHub/GHCR mirrors (74.5% hit rate, no 404s).
- Registry HA/replication (single-VM SPOF is a known architectural
choice; Synology offsite covers RPO < 1 day).
- Diun exclude for registry:2 — not applicable; Diun only watches
k8s (DIUN_PROVIDERS_KUBERNETES=true), not the VM's docker-compose.
Verified locally:
- fix-broken-blobs.sh --dry-run on a synthetic registry directory correctly
flags both orphan layer links and orphan OCI-index children.
- terraform fmt + validate on stacks/monitoring: success (only unrelated
deprecation warnings).
- python3 yaml.safe_load on .woodpecker/build-ci-image.yml and
modules/docker-registry/docker-compose.yml: both parse clean.
Closes: code-4b8
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:08:28 +00:00
|
|
|
registry_user = data.vault_kv_secret_v2.viktor.data["registry_user"]
|
|
|
|
|
registry_password = data.vault_kv_secret_v2.viktor.data["registry_password"]
|
2026-05-07 15:53:08 +00:00
|
|
|
# try() so apply succeeds before the Vault key is populated during Phase 0
|
|
|
|
|
# bootstrap (see docs/runbooks/forgejo-registry-setup.md). Empty token =
|
|
|
|
|
# probe will report an auth failure and fire RegistryCatalogInaccessible —
|
|
|
|
|
# that's the intended visible-broken state until the PAT is created.
|
|
|
|
|
forgejo_pull_token = try(data.vault_kv_secret_v2.viktor.data["forgejo_pull_token"], "")
|
|
|
|
|
tier = local.tiers.cluster
|
2026-03-17 21:34:11 +00:00
|
|
|
}
|