End of forgejo-registry-consolidation. After Phase 0/1 already landed
(Forgejo ready, dual-push CI, integrity probe, retention CronJob,
images migrated via forgejo-migrate-orphan-images.sh), this commit
flips everything off registry.viktorbarzin.me onto Forgejo and
removes the legacy infrastructure.
Phase 3 — image= flips:
* infra/stacks/{payslip-ingest,job-hunter,claude-agent-service,
fire-planner,freedify/factory,chrome-service,beads-server}/main.tf
— image= now points to forgejo.viktorbarzin.me/viktor/<name>.
* infra/stacks/claude-memory/main.tf — also moved off DockerHub
(viktorbarzin/claude-memory-mcp:17 → forgejo.viktorbarzin.me/viktor/...).
* infra/.woodpecker/{default,drift-detection}.yml — infra-ci pulled
from Forgejo. build-ci-image.yml dual-pushes still until next
build cycle confirms Forgejo as canonical.
* /home/wizard/code/CLAUDE.md — claude-memory-mcp install URL updated.
Phase 4 — decommission registry-private:
* registry-credentials Secret: dropped registry.viktorbarzin.me /
registry.viktorbarzin.me:5050 / 10.0.20.10:5050 auths entries.
Forgejo entry is the only one left.
* infra/stacks/infra/main.tf cloud-init: dropped containerd
hosts.toml entries for registry.viktorbarzin.me +
10.0.20.10:5050. (Existing nodes already had the file removed
manually by `setup-forgejo-containerd-mirror.sh` rollout — the
cloud-init template only fires on new VM provision.)
* infra/modules/docker-registry/docker-compose.yml: registry-private
service block removed; nginx 5050 port mapping dropped. Pull-
through caches for upstream registries (5000/5010/5020/5030/5040)
stay on the VM permanently.
* infra/modules/docker-registry/nginx_registry.conf: upstream
`private` block + port 5050 server block removed.
* infra/stacks/monitoring/modules/monitoring/main.tf: registry_
integrity_probe + registry_probe_credentials resources stripped.
forgejo_integrity_probe is the only manifest probe now.
Phase 5 — final docs sweep:
* infra/docs/runbooks/registry-vm.md — VM scope reduced to pull-
through caches; forgejo-registry-breakglass.md cross-ref added.
* infra/docs/architecture/ci-cd.md — registry component table +
diagram now reflect Forgejo. Pre-migration root-cause sentence
preserved as historical context with a pointer to the design doc.
* infra/docs/architecture/monitoring.md — Registry Integrity Probe
row updated to point at the Forgejo probe.
* infra/.claude/CLAUDE.md — Private registry section rewritten end-
to-end (auth, retention, integrity, where the bake came from).
* prometheus_chart_values.tpl — RegistryManifestIntegrityFailure
alert annotation simplified now that only one registry is in
scope.
Operational follow-up (cannot be done from a TF apply):
1. ssh root@10.0.20.10 — edit /opt/registry/docker-compose.yml to
match the new template AND `docker compose up -d --remove-orphans`
to actually stop the registry-private container. Memory id=1078
confirms cloud-init won't redeploy on TF apply alone.
2. After 1 week of no incidents, `rm -rf /opt/registry/data/private/`
on the VM (~2.6GB freed).
3. Open the dual-push step in build-ci-image.yml and drop
registry.viktorbarzin.me:5050 from the `repo:` list — at that
point the post-push integrity check at line 33-107 also needs
to be repointed at Forgejo or removed (the per-build verify is
redundant with the every-15min Forgejo probe).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
13 KiB
Monitoring & Alerting Architecture
Overview
The monitoring stack provides comprehensive observability for the home Kubernetes cluster through metrics collection (Prometheus), visualization (Grafana), log aggregation (Loki), alerting (Alertmanager), and uptime monitoring (Uptime Kuma). GPU metrics are collected via NVIDIA's dcgm-exporter. The system tracks infrastructure health, application performance, backup success, and resource utilization with intelligent alert inhibition to reduce noise during cascading failures.
Architecture Diagram
graph TB
subgraph "Metric Sources"
K8S[Kubernetes API Server]
NODES[Node Exporters]
PODS[Application Pods]
GPU[NVIDIA GPU via dcgm-exporter]
UPS[UPS Exporter]
NFS[NFS Exporter]
EMAIL[Email Roundtrip Probe<br/>CronJob every 10m]
end
subgraph "Monitoring Stack (platform stack)"
PROM[Prometheus<br/>Scrape & Store]
LOKI[Loki<br/>Log Aggregation]
AM[Alertmanager<br/>Alert Routing]
GRAFANA[Grafana<br/>14+ Dashboards<br/>OIDC via Authentik]
UPTIME[Uptime Kuma<br/>HTTP Monitors]
end
subgraph "Alert Flow"
INHIBIT[Inhibition Rules<br/>Node Down → Suppress Pod Alerts]
NOTIFY[Notifications]
end
K8S -->|ServiceMonitors| PROM
NODES -->|Metrics| PROM
PODS -->|Metrics| PROM
PODS -->|Logs| LOKI
GPU -->|GPU Metrics| PROM
UPS -->|UPS Metrics| PROM
NFS -->|NFS Metrics| PROM
PROM -->|Query| GRAFANA
PROM -->|Alerts| AM
LOKI -->|Query| GRAFANA
AM --> INHIBIT
INHIBIT --> NOTIFY
EMAIL -->|Pushgateway| PROM
EMAIL -.->|Push| UPTIME
PODS -.->|HTTP Health| UPTIME
Components
| Component | Version | Location | Purpose |
|---|---|---|---|
| Prometheus | Latest (Diun monitored) | stacks/monitoring/modules/monitoring/ |
Metrics collection and storage, scrape configs for all services |
| Grafana | Latest (Diun monitored) | stacks/monitoring/modules/monitoring/ |
Visualization, 14+ dashboards (API server, CoreDNS, GPU, UPS, etc.) |
| Loki | Latest (Diun monitored) | stacks/monitoring/modules/monitoring/ |
Log aggregation and querying |
| Alertmanager | Latest (Diun monitored) | stacks/monitoring/modules/monitoring/ |
Alert routing with cascade inhibitions |
| Uptime Kuma | Latest (Diun monitored) | stacks/uptime-kuma/ |
Internal + external HTTP monitors, status page |
| External Monitor Sync | Python 3.12 | stacks/uptime-kuma/ |
CronJob (10min) syncs [External] monitors from cloudflare_proxied_names |
| dcgm-exporter | Configurable resources | stacks/monitoring/modules/monitoring/ |
NVIDIA GPU metrics collection |
| Email Roundtrip Probe | Python 3.12 | stacks/mailserver/modules/mailserver/ |
E2E email delivery verification via Mailgun API + IMAP |
| Forgejo Registry Integrity Probe | Alpine 3.20 + curl/jq | stacks/monitoring/modules/monitoring/main.tf |
CronJob every 15m: walks /v2/_catalog on forgejo.viktorbarzin.me (HTTP via in-cluster service), HEADs every tagged manifest + index child; emits registry_manifest_integrity_* metrics to Pushgateway. Replaces the legacy registry-integrity-probe against registry.viktorbarzin.me:5050 decommissioned in Phase 4 of forgejo-registry-consolidation 2026-05-07. |
How It Works
Metrics Collection
Prometheus scrapes metrics from all cluster components and applications using ServiceMonitor CRDs and scrape configs. Every new service deployed to the cluster receives:
- A Prometheus scrape configuration (via ServiceMonitor or static config)
- An Uptime Kuma HTTP monitor for internal health checks
- An external HTTP monitor (auto-created by
external-monitor-syncfor all Cloudflare-proxied services)
External Monitoring
The external-monitor-sync CronJob (every 10min, stacks/uptime-kuma/) ensures Uptime Kuma has [External] <service> monitors for externally-reachable ingresses. Discovery is opt-OUT: the script lists every ingress via the K8s API and creates a monitor for any host ending in .viktorbarzin.me, skipping only those annotated uptime.viktorbarzin.me/external-monitor: "false". Both ingress_factory and the reverse-proxy factory emit that annotation when the caller sets external_monitor = false; leaving it null keeps the opt-in default (important for helm-provisioned ingresses that don't go through our factories). The legacy cloudflare_proxied_names ConfigMap is a fallback if the K8s API discovery fails.
These monitors test the full external access path (DNS → Cloudflare → Tunnel → Traefik → Service) from inside the cluster. The status-page-pusher groups them as "External Reachability" and pushes a external_internal_divergence_count metric to Pushgateway when services are externally down but internally up. Alert ExternalAccessDivergence fires after 15min of divergence.
Data flows from targets through Prometheus storage to Grafana dashboards. Applications emit logs to stdout/stderr which are aggregated by Loki and queryable through Grafana's log viewer.
Alert Cascade Inhibition
Alertmanager implements intelligent alert suppression to prevent alert storms during cascading failures:
graph LR
NODE_DOWN[Node Down Alert] -->|Inhibits| POD_ALERTS[Pod Alerts on That Node]
COMPLETED[Completed CronJob Pod] -->|Excluded from| POD_READY[Pod Not Ready Alerts]
When a node goes down, all pod-level alerts for pods scheduled on that node are suppressed, reducing noise and focusing attention on the root cause.
GPU Monitoring
NVIDIA GPU metrics are collected via dcgm-exporter with configurable resource limits (dcgmExporter.resources). Metrics include GPU utilization, memory usage, temperature, and power consumption.
Database Version Pinning
MySQL, PostgreSQL, and Redis images have Diun monitoring disabled to prevent automatic version updates that could cause compatibility issues. Version upgrades are manual and coordinated.
Configuration
Key Config Files
- Monitoring Stack:
stacks/platform/modules/monitoring/- Prometheus scrape configs and recording rules
- Grafana dashboard definitions
- Alertmanager routing and inhibition rules
- Uptime Kuma configuration
Prometheus Scrape Configs
Every service must expose metrics and be registered in Prometheus via ServiceMonitor or static scrape config. Standard pattern:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-service
spec:
selector:
matchLabels:
app: my-service
endpoints:
- port: metrics
Grafana Dashboards
14+ pre-configured dashboards covering:
- Kubernetes API Server
- CoreDNS
- GPU metrics
- UPS status
- Node metrics
- Pod resource usage
- Application-specific metrics
Alert Definitions
Infrastructure Alerts
- OOMKill: Container killed due to out-of-memory
- PodReplicaMismatch: Deployment/StatefulSet replica count doesn't match desired
- ClusterMemoryRequestsHigh: Cluster memory requests >85%
- ContainerNearOOM: Container using >85% of memory limit
- PodUnschedulable: Pod cannot be scheduled due to resource constraints
- CPUTemp: CPU temperature threshold exceeded
- SSDWrites: Excessive SSD write volume
- NFSResponsiveness: NFS mount latency issues
- UPSBattery: UPS battery charge low
Application Alerts
- 4xx/5xx Error Rates: HTTP error rate threshold exceeded
Email Monitoring Alerts
- EmailRoundtripFailing: E2E email probe returning failure for >30m
- EmailRoundtripStale: No successful email round-trip in >80m (60m threshold + for:20m)
- EmailRoundtripNeverRun: Email probe has never reported (40m)
Registry Integrity Alerts
- RegistryManifestIntegrityFailure: Private registry serving 404 for manifests it advertises (orphan OCI-index children) — fires after 30m of
registry_manifest_integrity_failures > 0. Remediation: rebuild affected image perdocs/runbooks/registry-rebuild-image.md. - RegistryIntegrityProbeStale: Probe hasn't reported in >1h (CronJob broken)
- RegistryCatalogInaccessible: Probe cannot fetch
/v2/_catalog(auth failure or registry down)
The email monitoring system uses a CronJob (email-roundtrip-monitor, every 10 min) in the mailserver namespace that:
- Sends a test email via Mailgun HTTP API to
smoke-test@viktorbarzin.me - Email lands in the
spam@catch-all mailbox via MX delivery - Verifies delivery via IMAP (searches by UUID marker in subject)
- Deletes the test email immediately
- Pushes metrics (
email_roundtrip_success,email_roundtrip_duration_seconds,email_roundtrip_last_success_timestamp) to Prometheus Pushgateway - Pushes status to Uptime Kuma E2E Push monitor
Uptime Kuma monitors: TCP SMTP (port 25) on 176.12.22.76 (external), IMAP (port 993) on 10.0.20.202, and Dovecot exporter metrics on port 9166.
Backup Alerts
- PostgreSQLBackupStale: >36h since last backup
- MySQLBackupStale: >36h since last backup
- EtcdBackupStale: >8d since last backup
- VaultBackupStale: >8d since last backup
- VaultwardenBackupStale: >8d since last backup
- RedisBackupStale: >8d since last backup
- PrometheusBackupStale: >32d since last backup
- VaultwardenIntegrityFail: Backup integrity check failed
Vault Paths
No direct Vault integration required for the monitoring stack (platform stack cannot depend on Vault due to circular dependency).
Decisions & Rationale
Why Prometheus over alternatives (InfluxDB, Graphite)?
- Native Kubernetes integration via ServiceMonitor CRDs
- Pull-based model reduces application complexity (no push agents)
- Powerful query language (PromQL) for alerting and visualization
- Industry standard for cloud-native monitoring
Why Grafana over Prometheus UI?
- Superior visualization capabilities
- OIDC authentication via Authentik for secure access
- Multi-data-source support (Prometheus + Loki)
- Rich dashboard ecosystem
Why Loki for logs?
- Designed for Kubernetes log aggregation
- Cost-effective (indexes metadata, not full log content)
- Tight Grafana integration
- LogQL query language similar to PromQL
Why Uptime Kuma?
- Simple HTTP/TCP/Ping monitoring
- Public status page for service availability
- Lightweight compared to full APM solutions
- Complements Prometheus for black-box monitoring
Why alert inhibition?
- Prevents alert fatigue during cascading failures
- Root cause focus (fix the node, not 50 pods)
- Reduces on-call noise
Why exclude completed CronJob pods?
- CronJobs naturally transition to Completed state
- "Pod not ready" is expected and not actionable
- Prevents false positive alerts
Why disable Diun for databases?
- Version upgrades require migration planning
- Breaking schema changes need coordination
- Manual upgrade testing prevents production issues
Troubleshooting
Alert is firing but I don't see the issue
Check inhibition rules in Alertmanager. The alert may be suppressed due to a higher-level failure (e.g., node down suppressing pod alerts).
Grafana dashboards show no data
- Check Prometheus targets:
kubectl port-forward -n monitoring svc/prometheus 9090:9090→http://localhost:9090/targets - Verify ServiceMonitor is created:
kubectl get servicemonitor -A - Check Prometheus logs for scrape errors:
kubectl logs -n monitoring deployment/prometheus
Loki logs not appearing
- Verify pod logs are going to stdout/stderr (not files)
- Check Loki is scraping pod logs:
kubectl logs -n monitoring deployment/loki - Ensure Grafana data source is configured correctly
Backup alert firing but backup exists
- Check backup timestamp in Prometheus:
backup_last_success_timestamp_seconds{job="my-backup"} - Verify backup job completed successfully:
kubectl logs -n backups cronjob/my-backup - Ensure backup job updates the Prometheus metric via pushgateway or ServiceMonitor
GPU metrics not showing
- Verify dcgm-exporter is running:
kubectl get pods -n monitoring -l app=dcgm-exporter - Check GPU node has NVIDIA drivers installed
- Verify dcgm-exporter has access to GPU:
kubectl logs -n monitoring deployment/dcgm-exporter
Uptime Kuma monitor shows down but service is healthy
- Check network policies aren't blocking Uptime Kuma's pod
- Verify service endpoint is reachable from Uptime Kuma namespace
- Check Uptime Kuma logs:
kubectl logs -n monitoring deployment/uptime-kuma
Related
- Secrets Management - OIDC authentication for Grafana via Authentik
- Backup & DR - Backup monitoring alerts
- Platform Stack - Monitoring stack deployment
- Vault Architecture - No direct dependency but related to cluster observability