Commit graph

79 commits

Author SHA1 Message Date
ac604d4d1f [monitoring] uk-payslip: cash-basis queries + RSU vest panel
- Panels 1/2/4: compute on (gross_pay - rsu_vest) so numbers reflect
  actual UK cash pay, not the RSU-inflated figure the payslip shows.
- Detailed table: add cash_gross / rsu_vest / rsu_offset columns.
- New RSU panel at the bottom: bar chart of rsu_vest over time
  (only shows months with stock vests). Taxed at Schwab — included
  here for reporting/reconciliation, not for P&L.
2026-04-18 23:39:46 +00:00
73ed2d9001 [monitoring] Add detailed-payslips table + full-deductions panels
Two new panels below the 4 existing ones:
- Detailed table: every payslip sorted by pay_date DESC with all fields
  (gross, all deductions, net, tax_year, validated flag, paperless_doc_id).
  Footer reducer sums the numeric columns.
- Full deductions stacked bars: income_tax + NI + pension_employee +
  pension_employer + student_loan per payslip. The earlier panel only
  showed 4 deductions; this one shows the complete picture.
2026-04-18 23:32:21 +00:00
4cd8d96b01 [monitoring] Widen uk-payslip default time range to 10y
Oldest payslip in Paperless is July 2019. Previous default (now-2y) hid
everything from 2019-2023, making it look like the backfill was broken.
2026-04-18 23:26:49 +00:00
Viktor Barzin
1698cd1ce1 [mailserver] Add daily backup CronJob for mailserver PVC
## Context

The mailserver stack holds everything valuable and hard to recreate:
243M of maildirs, dovecot/rspamd state, and the DKIM private key that
signs outbound mail. Today the only defense is the LVM thin-pool
snapshots on the PVE host (7-day retention, storage-class scope only)
— there is no app-level backup. Infra/.claude/CLAUDE.md mandates that
every proxmox-lvm(-encrypted) app ship a NFS-backed backup CronJob,
and the mailserver stack was the only one still out of compliance.

Loss of mailserver-data-encrypted without backups = total loss of all
stored mail plus a DKIM key rotation (which requires a DNS update and
breaks signature verification on every message in transit for the TTL
window). Unacceptable for a service people actually use.

Trade-offs considered:
- mysqldump-style single-file dump vs rsync snapshot — maildirs are
  millions of small files, not a DB export. rsync --link-dest gives
  incremental weekly snapshots for ~10% of the cost of a full copy.
- RWO PVC read-only mount — the underlying PVC is ReadWriteOnce, so
  the backup Job has to co-locate with the mailserver pod. vaultwarden
  solves this with pod_affinity; mirrored here.
- Image choice — alpine + apk add rsync matches vaultwarden's pattern
  and keeps the container image small.

## This change

Adds `kubernetes_cron_job_v1.mailserver-backup` + NFS PV/PVC to the
mailserver module. Runs daily at 03:00 (avoids the 00:30 mysql-backup
and 00:45 per-db windows, and the */20 email-roundtrip cadence). The
job rsyncs /var/mail, /var/mail-state, /var/log/mail into
/srv/nfs/mailserver-backup/<YYYY-WW>/ with --link-dest against the
previous week for space-efficient incrementals. 8-week retention.

Data layout (flowed through from the deployment's subPath mounts so
the rsync tree matches the mailserver's own on-disk layout):

    PVC mailserver-data-encrypted (RWO, 2Gi)
      ├─ data/   (subPath) → pod's /var/mail        → backup/<week>/data/
      ├─ state/  (subPath) → pod's /var/mail-state  → backup/<week>/state/
      └─ log/    (subPath) → pod's /var/log/mail    → backup/<week>/log/

Safety:
- PVC mounted read-only (volume.persistent_volume_claim.read_only
  AND all three volume_mounts set read_only=true) so a backup-script
  bug cannot corrupt maildirs.
- pod_affinity on app=mailserver + topology_key=hostname forces the
  Job pod onto the same node holding the RWO PVC attachment.
- set -euxo pipefail + per-directory existence guard so a missing
  subPath short-circuits cleanly instead of silently no-op'ing.

Metrics pushed to Pushgateway match the mysql-backup/vaultwarden-backup
convention (job="mailserver-backup"):
  backup_duration_seconds, backup_read_bytes, backup_written_bytes,
  backup_output_bytes, backup_last_success_timestamp.

Alert rules added in monitoring stack, mirroring Mysql/Vaultwarden:
- MailserverBackupStale — 36h threshold, critical, 30m for:
- MailserverBackupNeverSucceeded — critical, 1h for:

## Reproduce locally

1. cd infra/stacks/mailserver && ../../scripts/tg plan
   Expected: 3 to add (cronjob + NFS PV + PVC), unrelated drift on
   deployment/service is pre-existing.
2. ../../scripts/tg apply --non-interactive \
     -target=module.mailserver.module.nfs_mailserver_backup_host \
     -target=module.mailserver.kubernetes_cron_job_v1.mailserver-backup
3. cd ../monitoring && ../../scripts/tg apply --non-interactive
4. kubectl create job --from=cronjob/mailserver-backup \
     mailserver-backup-test -n mailserver
5. kubectl wait --for=condition=complete --timeout=300s \
     job/mailserver-backup-test -n mailserver
6. Expected: test pod co-locates with mailserver on same node
   (k8s-node2 today), rsync writes ~950M to
   /srv/nfs/mailserver-backup/<YYYY-WW>/, Pushgateway exposes
   backup_output_bytes{job="mailserver-backup"}.

## Test Plan

### Automated

$ kubectl get cronjob -n mailserver mailserver-backup
NAME                SCHEDULE    TIMEZONE   SUSPEND   ACTIVE   LAST SCHEDULE   AGE
mailserver-backup   0 3 * * *   <none>     False     0        <none>          3s

$ kubectl create job --from=cronjob/mailserver-backup \
    mailserver-backup-test -n mailserver
job.batch/mailserver-backup-test created

$ kubectl wait --for=condition=complete --timeout=300s \
    job/mailserver-backup-test -n mailserver
job.batch/mailserver-backup-test condition met

$ kubectl logs -n mailserver job/mailserver-backup-test | tail -5
=== Backup IO Stats ===
duration: 80s
read:    1120 MiB
written: 1186 MiB
output:  947.0M

$ kubectl run nfs-verify --rm --image=alpine --restart=Never \
    --overrides='{...nfs mount /srv/nfs...}' \
    -n mailserver --attach -- ls -la /nfs/mailserver-backup/
947.0M  /nfs/mailserver-backup/2026-15

$ curl http://prometheus-prometheus-pushgateway.monitoring:9091/metrics \
    | grep mailserver-backup
backup_duration_seconds{instance="",job="mailserver-backup"} 80
backup_last_success_timestamp{instance="",job="mailserver-backup"} 1.776554641e+09
backup_output_bytes{instance="",job="mailserver-backup"} 9.92315701e+08
backup_read_bytes{instance="",job="mailserver-backup"} 1.175027712e+09
backup_written_bytes{instance="",job="mailserver-backup"} 1.244254208e+09

$ curl -s http://prometheus-server/api/v1/rules \
    | jq '.data.groups[].rules[] | select(.name | test("Mailserver"))'
MailserverBackupStale: (time() - kube_cronjob_status_last_successful_time{cronjob="mailserver-backup",namespace="mailserver"}) > 129600
MailserverBackupNeverSucceeded: kube_cronjob_status_last_successful_time{cronjob="mailserver-backup",namespace="mailserver"} == 0

### Manual Verification

1. Wait for the scheduled 03:00 run tonight; verify
   `kubectl get job -n mailserver` shows a new completed job.
2. Check that `backup_last_success_timestamp` advances past today.
3. Confirm `MailserverBackupNeverSucceeded` did not fire.
4. Next week (week 16), confirm `--link-dest` builds hardlinks vs
   2026-15 (size delta should drop from ~950M to ~the actual churn).

## Deviations from mysql-backup pattern

- Image: alpine + rsync (mirrors vaultwarden — mysql's `mysql:8.0`
  base is not applicable for a filesystem rsync).
- pod_affinity: required for RWO PVC co-location (mysql uses its own
  MySQL service for network access; mailserver must mount the PVC).
- Metric push via wget (mirrors vaultwarden; alpine has wget, not curl).
- Week-folder layout with --link-dest rotation: rsync pattern, closer
  to the PVE daily-backup script than mysql's single-file gzip dumps.

[ci skip]

Closes: code-z26

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 23:26:08 +00:00
06e3425a39 [monitoring] Set rawQuery+editorMode on uk-payslip panel targets
Grafana 11's Postgres plugin shows 'you do not have default database'
on any panel whose target is missing rawQuery:true / editorMode:"code".
The query builder can't reason about a custom schema.table path and
blanks the panel.
2026-04-18 23:12:45 +00:00
ed820e9b58 [monitoring] Fix uk-payslip datasource type to grafana-postgresql-datasource
The installed Postgres plugin is 'grafana-postgresql-datasource' (the newer
one). Dashboard panels referenced legacy 'postgres' type, which caused Grafana
to fall back to 'default database' and error out when rendering.

Ran sed over the JSON; all 8 panel+target type refs now match the installed
plugin name. UID (payslips-pg) was already correct.
2026-04-18 23:10:13 +00:00
471e946133 [monitoring] Put uk-payslip dashboard in Finance folder
Grafana can't auto-create the reserved 'General' folder ('A folder with
that name already exists'), which aborts the sidecar provisioner's walk
and drops every dashboard in that folder. Move uk-payslip to Finance so
it loads.
2026-04-18 23:03:22 +00:00
Viktor Barzin
b28c76e371 [infra] Wire drift detection to Pushgateway + alert on stale/unaddressed drift
## Context

Wave 7 of the state-drift consolidation plan. The drift-detection pipeline
(`.woodpecker/drift-detection.yml`) already ran terragrunt plan on every
stack daily and Slack-posted a summary, but its output was ephemeral —
nothing persisted in Prometheus, so there was no historical view of which
stacks drift, when, or for how long. Following the convergence work in
waves 1–6 (168 KYVERNO_LIFECYCLE_V1 markers, 4 stacks adopted, Phase 4
mysql cleanup), the baseline is clean enough that *new* drift should
stand out. That only works if we have observability.

## This change

### `.woodpecker/drift-detection.yml`

Enhances the existing cron pipeline to push a batched set of metrics to
the in-cluster Pushgateway (`prometheus-prometheus-pushgateway.monitoring:9091`)
after each run:

| Metric | Kind | Purpose |
|---|---|---|
| `drift_stack_state{stack}` | gauge, 0/1/2 | 0=clean, 1=drift, 2=error |
| `drift_stack_first_seen{stack}` | gauge (unix seconds) | Preserved across runs for drift-age tracking |
| `drift_stack_age_hours{stack}` | gauge (hours) | Computed from `first_seen` |
| `drift_stack_count` | gauge (count) | Total drifted stacks this run |
| `drift_error_count` | gauge (count) | Total plan-errored stacks |
| `drift_clean_count` | gauge (count) | Total clean stacks |
| `drift_detection_last_run_timestamp` | gauge (unix seconds) | Pipeline heartbeat |

First-seen preservation: on each drift hit, the pipeline queries
Pushgateway for the existing `drift_stack_first_seen{stack=<stack>}`
value. If present and non-zero, reuse it; otherwise stamp with `NOW`.
That means age-hours grows monotonically until the stack goes clean
(at which point state=0 resets first_seen by omission).

Atomic batched push: all metrics for a run are POST'd in a single
HTTP request. Pushgateway doesn't support atomic multi-metric updates
natively, but batching at the pipeline layer prevents half-updated
state if the curl is interrupted mid-run (the second call would just
fail the entire run and alert on `DriftDetectionStale`).

### `stacks/monitoring/.../prometheus_chart_values.tpl`

New `Infrastructure Drift` alert group with three rules:

- **DriftDetectionStale** (warning, 30m): fires if
  `drift_detection_last_run_timestamp` is older than 26h. Gives a 2h
  grace window on top of the 24h cron so transient Pushgateway or
  cluster unavailability doesn't false-alarm. Guards against the
  pipeline silently failing or the cron not firing.
- **DriftUnaddressed** (warning, 1h): fires if any stack has
  `drift_stack_age_hours > 72` — three days of unacknowledged drift.
  Three days is long enough to absorb weekends + typical review cycles
  but short enough to force follow-up before drift compounds.
- **DriftStacksMany** (warning, 30m): fires if `drift_stack_count > 10`
  in a single run. Sudden wide drift usually signals systemic causes
  (new admission webhook, provider version bump, cluster-wide CRD
  upgrade) rather than individual configuration errors, and the alert
  body nudges toward that diagnosis.

Applied to `stacks/monitoring` this session — 1 helm_release changed,
no other drift surfaced.

## What is NOT in this change

- The Wave 7 **GitHub issue auto-filer** — the full plan included
  filing a `drift-detected` issue per drifted stack. Deferred because
  it requires wiring the `file-issue` skill's convention + a gh token
  exposed to Woodpecker, both of which need separate setup. The Slack
  alert covers the same need at lower fidelity in the meantime.
- The Wave 7 **PG drift_history table** — would provide the richest
  historical view but adds a new DB schema dependency for a CI
  pipeline. Pushgateway + Prometheus handle the 72h window we care
  about; PG history is nice-to-have for quarterly reviews.
- Auto-apply marker (`# DRIFT_AUTO_APPLY_OK`) — premature until the
  baseline has been stable for a few cycles.

Follow-ups tracked: file dedicated beads items for GH-issue filer + PG
drift_history.

## Verification

```
$ cd stacks/monitoring && ../../scripts/tg apply --non-interactive
Apply complete! Resources: 0 added, 1 changed, 0 destroyed.

# After next cron run (cron expr: "drift-detection" in Woodpecker UI):
$ curl -s http://prometheus-prometheus-pushgateway.monitoring:9091/metrics \
    | grep -c '^drift_'
# expect a positive number
```

## Reproduce locally
1. `git pull`
2. Check Prometheus rules: `curl -sk https://prometheus.viktorbarzin.lan/api/v1/rules | jq '.data.groups[] | select(.name == "Infrastructure Drift")'`
3. Manually trigger the Woodpecker cron and watch Pushgateway populate.

Refs: Wave 7 umbrella (code-hl1)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 22:42:51 +00:00
Viktor Barzin
327ce215b9 [infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip]
## Context

Wave 3A (commit c9d221d5) added the `# KYVERNO_LIFECYCLE_V1` marker to the
27 pre-existing `ignore_changes = [...dns_config]` sites so they could be
grepped and audited. It did NOT address pod-owning resources that were
simply missing the suppression entirely. Post-Wave-3A sampling (2026-04-18)
found that navidrome, f1-stream, frigate, servarr, monitoring, crowdsec,
and many other stacks showed perpetual `dns_config` drift every plan
because their `kubernetes_deployment` / `kubernetes_stateful_set` /
`kubernetes_cron_job_v1` resources had no `lifecycle {}` block at all.

Root cause (same as Wave 3A): Kyverno's admission webhook stamps
`dns_config { option { name = "ndots"; value = "2" } }` on every pod's
`spec.template.spec.dns_config` to prevent NxDomain search-domain flooding
(see `k8s-ndots-search-domain-nxdomain-flood` skill). Without `ignore_changes`
on every Terraform-managed pod-owner, Terraform repeatedly tries to strip
the injected field.

## This change

Extends the Wave 3A convention by sweeping EVERY `kubernetes_deployment`,
`kubernetes_stateful_set`, `kubernetes_daemon_set`, `kubernetes_cron_job_v1`,
`kubernetes_job_v1` (+ their `_v1` variants) in the repo and ensuring each
carries the right `ignore_changes` path:

- **kubernetes_deployment / stateful_set / daemon_set / job_v1**:
  `spec[0].template[0].spec[0].dns_config`
- **kubernetes_cron_job_v1**:
  `spec[0].job_template[0].spec[0].template[0].spec[0].dns_config`
  (extra `job_template[0]` nesting — the CronJob's PodTemplateSpec is
  one level deeper)

Each injection / extension is tagged `# KYVERNO_LIFECYCLE_V1: Kyverno
admission webhook mutates dns_config with ndots=2` inline so the
suppression is discoverable via `rg 'KYVERNO_LIFECYCLE_V1' stacks/`.

Two insertion paths are handled by a Python pass (`/tmp/add_dns_config_ignore.py`):

1. **No existing `lifecycle {}`**: inject a brand-new block just before the
   resource's closing `}`. 108 new blocks on 93 files.
2. **Existing `lifecycle {}` (usually for `DRIFT_WORKAROUND: CI owns image tag`
   from Wave 4, commit a62b43d1)**: extend its `ignore_changes` list with the
   dns_config path. Handles both inline (`= [x]`) and multiline
   (`= [\n  x,\n]`) forms; ensures the last pre-existing list item carries
   a trailing comma so the extended list is valid HCL. 34 extensions.

The script skips anything already mentioning `dns_config` inside an
`ignore_changes`, so re-running is a no-op.

## Scale

- 142 total lifecycle injections/extensions
- 93 `.tf` files touched
- 108 brand-new `lifecycle {}` blocks + 34 extensions of existing ones
- Every Tier 0 and Tier 1 stack with a pod-owning resource is covered
- Together with Wave 3A's 27 pre-existing markers → **169 greppable
  `KYVERNO_LIFECYCLE_V1` dns_config sites across the repo**

## What is NOT in this change

- `stacks/trading-bot/main.tf` — entirely commented-out block (`/* … */`).
  Python script touched the file, reverted manually.
- `_template/main.tf.example` skeleton — kept minimal on purpose; any
  future stack created from it should either inherit the Wave 3A one-line
  form or add its own on first `kubernetes_deployment`.
- `terraform fmt` fixes to pre-existing alignment issues in meshcentral,
  nvidia/modules/nvidia, vault — unrelated to this commit. Left for a
  separate fmt-only pass.
- Non-pod resources (`kubernetes_service`, `kubernetes_secret`,
  `kubernetes_manifest`, etc.) — they don't own pods so they don't get
  Kyverno dns_config mutation.

## Verification

Random sample post-commit:
```
$ cd stacks/navidrome && ../../scripts/tg plan  → No changes.
$ cd stacks/f1-stream && ../../scripts/tg plan  → No changes.
$ cd stacks/frigate && ../../scripts/tg plan    → No changes.

$ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='*.tf' --include='*.tf.example' \
    | awk -F: '{s+=$2} END {print s}'
169
```

## Reproduce locally
1. `git pull`
2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/ | wc -l` → 169+
3. `cd stacks/navidrome && ../../scripts/tg plan` → expect 0 drift on
   the deployment's dns_config field.

Refs: code-seq (Wave 3B dns_config class closed; kubernetes_manifest
annotation class handled separately in 8d94688d for tls_secret)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:19:48 +00:00
Viktor Barzin
8b43692af0 [infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip]
## Context

Wave 3B-continued: the Goldilocks VPA dashboard (stacks/vpa) runs a Kyverno
ClusterPolicy `goldilocks-vpa-auto-mode` that mutates every namespace with
`metadata.labels["goldilocks.fairwinds.com/vpa-update-mode"] = "off"`. This
is intentional — Terraform owns container resource limits, and Goldilocks
should only provide recommendations, never auto-update. The label is how
Goldilocks decides per-namespace whether to run its VPA in `off` mode.

Effect on Terraform: every `kubernetes_namespace` resource shows the label
as pending-removal (`-> null`) on every `scripts/tg plan`. Dawarich survey
2026-04-18 confirmed the drift. Cluster-side count: 88 namespaces carry the
label (`kubectl get ns -o json | jq ... | wc -l`). Every TF-managed namespace
is affected.

This commit brings the intentional admission drift under the same
`# KYVERNO_LIFECYCLE_V1` discoverability marker introduced in c9d221d5 for
the ndots dns_config pattern. The marker now stands generically for any
Kyverno admission-webhook drift suppression; the inline comment records
which specific policy stamps which specific field so future grep audits
show why each suppression exists.

## This change

107 `.tf` files touched — every stack's `resource "kubernetes_namespace"`
resource gets:

```hcl
lifecycle {
  # KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
  ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
```

Injection was done with a brace-depth-tracking Python pass (`/tmp/add_goldilocks_ignore.py`):
match `^resource "kubernetes_namespace" ` → track `{` / `}` until the
outermost closing brace → insert the lifecycle block before the closing
brace. The script is idempotent (skips any file that already mentions
`goldilocks.fairwinds.com/vpa-update-mode`) so re-running is safe.

Vault stack picked up 2 namespaces in the same file (k8s-users produces
one, plus a second explicit ns) — confirmed via file diff (+8 lines).

## What is NOT in this change

- `stacks/trading-bot/main.tf` — entire file is `/* … */` commented out
  (paused 2026-04-06 per user decision). Reverted after the script ran.
- `stacks/_template/main.tf.example` — per-stack skeleton, intentionally
  minimal. User keeps it that way. Not touched by the script (file
  has no real `resource "kubernetes_namespace"` — only a placeholder
  comment).
- `.terraform/` copies (e.g. `stacks/metallb/.terraform/modules/...`) —
  gitignored, won't commit; the live path was edited.
- `terraform fmt` cleanup of adjacent pre-existing alignment issues in
  authentik, freedify, hermes-agent, nvidia, vault, meshcentral. Reverted
  to keep the commit scoped to the Goldilocks sweep. Those files will
  need a separate fmt-only commit or will be cleaned up on next real
  apply to that stack.

## Verification

Dawarich (one of the hundred-plus touched stacks) showed the pattern
before and after:

```
$ cd stacks/dawarich && ../../scripts/tg plan

Before:
  Plan: 0 to add, 2 to change, 0 to destroy.
   # kubernetes_namespace.dawarich will be updated in-place
     (goldilocks.fairwinds.com/vpa-update-mode -> null)
   # module.tls_secret.kubernetes_secret.tls_secret will be updated in-place
     (Kyverno generate.* labels — fixed in 8d94688d)

After:
  No changes. Your infrastructure matches the configuration.
```

Injection count check:
```
$ rg -c 'KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode' stacks/ | awk -F: '{s+=$2} END {print s}'
108
```

## Reproduce locally
1. `git pull`
2. Pick any stack: `cd stacks/<name> && ../../scripts/tg plan`
3. Expect: no drift on the namespace's goldilocks.fairwinds.com/vpa-update-mode label.

Closes: code-dwx

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:15:27 +00:00
Viktor Barzin
43b4e1d372 [payslip-ingest] Deploy stack + Grafana dashboard + Vault DB role
## Context

New service `payslip-ingest` (code lives in `/home/wizard/code/payslip-ingest/`)
needs in-cluster deployment, its own Postgres DB + rotating user, a Grafana
datasource, a dashboard, and a Claude agent definition for PDF extraction.

Cluster-internal only — webhook fires from Paperless-ngx in a sibling namespace.
No ingress, no TLS cert, no DNS record.

## What

### New stack `stacks/payslip-ingest/`
- `kubernetes_namespace` payslip-ingest, tier=aux.
- ExternalSecret (vault-kv) projects PAPERLESS_API_TOKEN, CLAUDE_AGENT_BEARER_TOKEN,
  WEBHOOK_BEARER_TOKEN into `payslip-ingest-secrets`.
- ExternalSecret (vault-database) reads rotating password from
  `static-creds/pg-payslip-ingest` and templates `DATABASE_URL` into
  `payslip-ingest-db-creds` with `reloader.stakater.com/match=true`.
- Deployment: single replica, Recreate strategy (matches single-worker queue
  design), `wait-for postgresql.dbaas:5432` annotation, init container runs
  `alembic upgrade head`, main container serves FastAPI on 8080, Kyverno
  dns_config lifecycle ignore.
- ClusterIP Service :8080.
- Grafana datasource ConfigMap in `monitoring` ns (label `grafana_datasource=1`,
  uid `payslips-pg`) reading password from the db-creds K8s Secret.

### Grafana dashboard `uk-payslip.json` (4 panels)
- Monthly gross/net/tax/NI (timeseries, currencyGBP).
- YTD tax-band progression with threshold lines at £12,570 / £50,270 / £125,140.
- Deductions breakdown (stacked bars).
- Effective rate + take-home % (timeseries, percent).

### Vault DB role `pg-payslip-ingest`
- Added to `allowed_roles` in `vault_database_secret_backend_connection.postgresql`.
- New `vault_database_secret_backend_static_role.pg_payslip_ingest`
  (username `payslip_ingest`, 7d rotation).

### DBaaS — DB + role creation
- New `null_resource.pg_payslip_ingest_db` mirrors `pg_terraform_state_db`:
  idempotent CREATE ROLE + CREATE DATABASE + GRANT ALL via `kubectl exec` into
  `pg-cluster-1`.

### Claude agent `.claude/agents/payslip-extractor.md`
- Haiku-backed agent invoked by `claude-agent-service`.
- Decodes base64 PDF from prompt, tries pdftotext → pypdf fallback, emits a single
  JSON object matching the schema to stdout. No network, no file writes outside /tmp,
  no markdown fences.

## Trade-offs / decisions

- Own DB per service (convention), NOT a schema in a shared `app` DB as the plan
  initially described. The Alembic migration still creates a `payslip_ingest`
  schema inside the `payslip_ingest` DB for table organisation.
- Paperless URL uses port 80 (the Service port), not 8000 (the pod target port).
- Grafana datasource uses the primary RW user — separate `_ro` role is aspirational
  and not yet a pattern in this repo.
- No ingress — webhook is cluster-internal; external exposure is unnecessary attack
  surface.
- No Uptime Kuma monitor yet: the internal-monitor list is a static block in
  `stacks/uptime-kuma/`; will add in a follow-up tied to code-z29 (internal monitor
  auto-creator).

## Test Plan

### Automated
```
terraform init -backend=false && terraform validate
Success! The configuration is valid.

terraform fmt -check -recursive
(exit 0)

python3 -c "import json; json.load(open('uk-payslip.json'))"
(exit 0)
```

### Manual Verification (post-merge)

Prerequisites:
1. Seed Vault: `vault kv put secret/payslip-ingest webhook_bearer_token=$(openssl rand -hex 32)`.
2. Seed Vault: `vault kv patch secret/paperless-ngx api_token=<paperless token>`.

Apply:
3. `scripts/tg apply vault` → creates pg-payslip-ingest static role.
4. `scripts/tg apply dbaas` → creates payslip_ingest DB + role.
5. `cd stacks/payslip-ingest && ../../scripts/tg apply -target=kubernetes_manifest.db_external_secret`
   (first-apply ESO bootstrap).
6. `scripts/tg apply payslip-ingest` (full).
7. `kubectl -n payslip-ingest get pods` → Running 1/1.
8. `kubectl -n payslip-ingest port-forward svc/payslip-ingest 8080:8080 && curl localhost:8080/healthz` → 200.

End-to-end:
9. Configure Paperless workflow (README in code repo has steps).
10. Upload sample payslip tagged `payslip` → row in `payslip_ingest.payslip` within 60s.
11. Grafana → Dashboards → UK Payslip → 4 panels render.

Closes: code-do7

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 19:07:05 +00:00
Viktor Barzin
cacc282f1a .gitignore: ignore terragrunt_rendered.json debug output
Generated by `terragrunt render-json` for debugging. Not meant to be
tracked — a stale one was sitting untracked in stacks/dbaas/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:18:05 +00:00
Viktor Barzin
50dea8f0a7 [monitoring] Add Claude OAuth token expiry monitoring + alerts
## Context

The new CLAUDE_CODE_OAUTH_TOKEN mechanism (commit 8a054752) uses
long-lived 1-year tokens minted via `claude setup-token`. Tokens don't
auto-refresh — at the 1-year mark they expire hard and the upgrade
agent stops working. We need to be told 30 days ahead, not find out
when DIUN fires and gets 401 again.

A cron rotator doesn't make sense here (tokens don't refresh, they
just expire) so we alert instead. Two spares at
`secret/claude-agent-service-spare-{1,2}` provide failover runway —
monitor covers all three.

## This change

**CronJob** (`claude-agent` ns, every 6h): reads a ConfigMap
containing `<path> → expiry_unix_timestamp` entries, pushes
`claude_oauth_token_expiry_timestamp{path="..."}` and
`claude_oauth_expiry_monitor_last_push_timestamp` to Pushgateway at
`prometheus-prometheus-pushgateway.monitoring:9091`.

**ConfigMap** generated from a Terraform local `claude_oauth_token_mint_epochs`
— source of truth for mint times. On rotation, update the map + apply.
TTL is a shared local (365d).

**PrometheusRules** (in prometheus_chart_values.tpl):
- `ClaudeOAuthTokenExpiringSoon`  — <30d, warning, for 1h
- `ClaudeOAuthTokenCritical`      — <7d,  critical, for 10m
- `ClaudeOAuthTokenMonitorStale`  — last push >48h, warning
- `ClaudeOAuthTokenMonitorNeverRun` — metric absent for 2h, warning

Alert labels include `{{ $labels.path }}` so we know which token is
expiring (primary / spare-1 / spare-2).

## Verification

```
$ kubectl -n claude-agent create job --from=cronjob/claude-oauth-expiry-monitor manual
$ curl pushgateway/metrics | grep claude_oauth_token_expiry
claude_oauth_token_expiry_timestamp{...,path="primary"} 1.808064429e+09
claude_oauth_token_expiry_timestamp{...,path="spare-1"} 1.80806428e+09
claude_oauth_token_expiry_timestamp{...,path="spare-2"} 1.808064429e+09

$ query: (claude_oauth_token_expiry_timestamp - time()) / 86400
  primary: 365.2 days
  spare-1: 365.2 days
  spare-2: 365.2 days
```

## Rotation playbook (future)

1. `kubectl run -it --rm --image=registry.viktorbarzin.me/claude-agent-service:latest tokmint -- claude setup-token`
   (or harvest via `harvest3.py` pattern in memory for headless flow)
2. `vault kv patch secret/claude-agent-service claude_oauth_token=<new>`
3. Update `claude_oauth_token_mint_epochs["primary"]` in
   `stacks/claude-agent-service/main.tf` with new unix timestamp
4. `scripts/tg apply` claude-agent-service + monitoring
5. Alert clears within 6h (next cron tick) + 1h of the
   `ClaudeOAuthTokenExpiringSoon` "for:" duration

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 12:27:11 +00:00
Viktor Barzin
947f8ace54 [monitoring] Remove stale MySQL InnoDB Cluster alerts
MySQL migrated from InnoDB Cluster (Bitnami chart + mysql-operator) to
a standalone StatefulSet on 2026-04-16. Two Prometheus alerts still
referenced the old topology and were firing falsely against resources
that no longer exist:

- MySQLDown: queried kube_statefulset_status_replicas_ready{statefulset="mysql-cluster"}
  — that StatefulSet was deleted as part of Phase 1 of the migration.
- MySQLOperatorDown: queried kube_deployment_status_replicas_available{namespace="mysql-operator"}
  — the operator Deployment was removed in Phase 1.

Replacement availability monitoring for the standalone MySQL pod will
be handled via an Uptime Kuma MySQL-connection monitor (out of scope
for this change — no Prometheus replacement alert is being added, per
the migration plan's "simpler is better" principle).

MySQLBackupStale and MySQLBackupNeverSucceeded are retained — they
query the mysql-backup CronJob which is unchanged by the migration.

Also removes MySQLDown from the two inhibition rules (NodeDown and
NFSServerUnresponsive) that previously suppressed it during cascade
outages — the alert no longer exists so the reference became dead.

Closes: code-3sa

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 10:03:58 +00:00
Viktor Barzin
e51bdb2af8 Add broker-sync Terraform stack (#7)
* [f1-stream] Remove committed cluster-admin kubeconfig

## Context
A kubeconfig granting cluster-admin access was accidentally committed into
the f1-stream stack's application bundle in c7c7047f (2026-02-22). It
contained the cluster CA certificate plus the kubernetes-admin client
certificate and its RSA private key. Both remotes (github.com, forgejo)
are public, so the credential has been reachable for ~2 months.

Grep across the repo confirms no .tf / .hcl / .sh / .yaml file references
this path; the file is a stray local artifact, likely swept in during a
bulk `git add`.

## This change
- git rm stacks/f1-stream/files/.config

## What is NOT in this change
- Cluster-admin cert rotation on the control plane. The leaked client cert
  must be invalidated separately via `kubeadm certs renew admin.conf` or
  CA regeneration. Tracked in the broader secrets-remediation plan.
- Git-history rewrite. The file is still reachable in every commit since
  c7c7047f. A `git filter-repo --path ... --invert-paths` pass against a
  fresh mirror is planned and will be force-pushed to both remotes.

## Test plan
### Automated
No tests needed for a file removal. Sanity:
  $ grep -rn 'f1-stream/files/\.config' --include='*.tf' --include='*.hcl' \
       --include='*.yaml' --include='*.yml' --include='*.sh'
  (no output)

### Manual Verification
1. `git show HEAD --stat` shows exactly one path deleted:
     stacks/f1-stream/files/.config | 19 -------------------
2. `test ! -e stacks/f1-stream/files/.config` returns true.
3. A copy of the leaked file is at /tmp/leaked.conf for post-rotation
   verification (confirming `kubectl --kubeconfig /tmp/leaked.conf get ns`
   fails with 401/403 once the admin cert is renewed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [frigate] Remove orphan config.yaml with leaked RTSP passwords

## Context
A Frigate configuration file was added to modules/kubernetes/frigate/ in
bcad200a (2026-04-15, ~2 days ago) as part of a bulk `chore: add untracked
stacks, scripts, and agent configs` commit. The file contains 14 inline
rtsp://admin:<password>@<host>:554/... URLs, leaking two distinct RTSP
passwords for the cameras at 192.168.1.10 (LAN-only) and
valchedrym.ddns.net (confirmed reachable from public internet on port
554). Both remotes are public, so the creds have been exposed for ~2 days.

Grep across the repo confirms nothing references this config.yaml — the
active stacks/frigate/main.tf stack reads its configuration from a
persistent volume claim named `frigate-config-encrypted`, not from this
file. The file is therefore an orphan from the bulk add, with no
production function.

## This change
- git rm modules/kubernetes/frigate/config.yaml

## What is NOT in this change
- Camera password rotation. The user does not own the cameras; rotation
  must be coordinated out-of-band with the camera operators. The DDNS
  camera (valchedrym.ddns.net:554) is internet-reachable, so the leaked
  password is high-priority to rotate from the device side.
- Git-history rewrite. The file plus its leaked strings remain in all
  commits from bcad200a forward. Scheduled to be purged via
  `git filter-repo --path modules/kubernetes/frigate/config.yaml
  --invert-paths --replace-text <list>` in the broader remediation pass.
- Future Frigate config provisioning. If the stack is re-platformed to
  source config from Git rather than the PVC, the replacement should go
  through ExternalSecret + env-var interpolation, not an inline YAML.

## Test plan
### Automated
  $ grep -rn 'frigate/config\.yaml' --include='*.tf' --include='*.hcl' \
       --include='*.yaml' --include='*.yml' --include='*.sh'
  (no output — confirms orphan status)

### Manual Verification
1. `git show HEAD --stat` shows exactly one deletion:
     modules/kubernetes/frigate/config.yaml | 229 ---------------------------------
2. `test ! -e modules/kubernetes/frigate/config.yaml` returns true.
3. `kubectl -n frigate get pvc frigate-config-encrypted` still shows the
   PVC bound (unaffected by this change).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [setup-tls-secret] Delete deprecated renew.sh with hardcoded Technitium token

## Context
modules/kubernetes/setup_tls_secret/renew.sh is a 2.5-year-old
expect(1) script for manual Let's Encrypt wildcard-cert renewal via
Technitium DNS TXT-record challenges. It hardcodes a 64-char Technitium
API token on line 7 (as an expect variable) and line 27 (inside a
certbot-cleanup heredoc). Both remotes are public, so the token has been
exposed for ~2.5 years.

The script is not invoked by the module's Terraform (main.tf only creates
a kubernetes.io/tls Secret from PEM files); it is a standalone
run-it-yourself tool. grep across the repo confirms nothing references
`renew.sh` — neither the 20+ stacks that consume the `setup_tls_secret`
module, nor any CI pipeline, nor any shell wrapper.

A replacement script `renew2.sh` (4 weeks old) lives alongside it. It
sources the Technitium token from `$TECHNITIUM_API_KEY` env var and also
supports Cloudflare DNS-01 challenges via `$CLOUDFLARE_TOKEN`. It is the
current renewal path.

## This change
- git rm modules/kubernetes/setup_tls_secret/renew.sh

## What is NOT in this change
- Technitium token rotation. The leaked token still works against
  `technitium-web.technitium.svc.cluster.local:5380` until revoked in the
  Technitium admin UI. Rotation is a prerequisite for the upcoming
  git-history scrub, which will remove the token from every commit via
  `git filter-repo --replace-text`.
- renew2.sh is retained as-is (already env-var-sourced; clean).
- The setup_tls_secret module's main.tf is not touched; 20+ consuming
  stacks keep working.

## Test plan
### Automated
  $ grep -rn 'renew\.sh' --include='*.tf' --include='*.hcl' \
       --include='*.yaml' --include='*.yml' --include='*.sh'
  (no output — confirms no consumer)
  $ git grep -n 'e28818f309a9ce7f72f0fcc867a365cf5d57b214751b75e2ef3ea74943ef23be'
  (no output in HEAD after this commit)

### Manual Verification
1. `git show HEAD --stat` shows exactly one deletion:
     modules/kubernetes/setup_tls_secret/renew.sh | 136 ---------
2. `test ! -e modules/kubernetes/setup_tls_secret/renew.sh` returns true.
3. `renew2.sh` still exists and is executable:
     ls -la modules/kubernetes/setup_tls_secret/renew2.sh
4. Next cert-renewal run uses renew2.sh with env-var-sourced token; no
   behavioral regression because renew.sh was never part of the automated
   flow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [monitoring] Delete orphan server-power-cycle/main.sh with iDRAC default creds

## Context
stacks/monitoring/modules/monitoring/server-power-cycle/main.sh is an old
shell implementation of a power-cycle watchdog that polled the Dell iDRAC
on 192.168.1.4 for PSU voltage. It hardcoded the Dell iDRAC default
credentials (root:calvin) in 5 `curl -u root:calvin` calls. Both remotes
are public, so those credentials — and the implicit statement that 'this
host has not rotated the default BMC password' — have been exposed.

The current implementation is main.py in the same directory. It reads
iDRAC credentials from the environment variables `idrac_user` and
`idrac_password` (see module's iDRAC_USER_ENV_VAR / iDRAC_PASSWORD_ENV_VAR
constants), which are populated from Vault via ExternalSecret at runtime.
main.sh is not referenced by any Terraform, ConfigMap, or deploy script —
grep confirms no `file()` / `templatefile()` / `filebase64()` call loads
it, and no hand-rolled shell wrapper invokes it.

## This change
- git rm stacks/monitoring/modules/monitoring/server-power-cycle/main.sh

main.py is retained unchanged.

## What is NOT in this change
- iDRAC password rotation on 192.168.1.4. The BMC should be moved off the
  vendor default `calvin` regardless; rotation is tracked in the broader
  remediation plan and in the iDRAC web UI.
- A separate finding in stacks/monitoring/modules/monitoring/idrac.tf
  (the redfish-exporter ConfigMap has `default: username: root, password:
  calvin` as a fallback for iDRAC hosts not explicitly listed) is NOT
  addressed here — filed as its own task so the fix (drop the default
  block vs. source from env) can be considered in isolation.
- Git-history scrub of main.sh is pending the broader filter-repo pass.

## Test plan
### Automated
  $ grep -rn 'server-power-cycle/main\.sh\|main\.sh' \
       --include='*.tf' --include='*.hcl' --include='*.yaml' \
       --include='*.yml' --include='*.sh'
  (no consumer references)

### Manual Verification
1. `git show HEAD --stat` shows only the one deletion.
2. `test ! -e stacks/monitoring/modules/monitoring/server-power-cycle/main.sh`
3. `kubectl -n monitoring get deploy idrac-redfish-exporter` still shows
   the exporter running — unrelated to this file.
4. main.py continues to run its watchdog loop without regression, because
   it was never coupled to main.sh.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [tls] Move 3 outlier stacks from per-stack PEMs to root-wildcard symlink

## Context
foolery, terminal, and claude-memory each had their own
`stacks/<x>/secrets/` directory with a plaintext EC-256 private key
(privkey.pem, 241 B) and matching TLS certificate (fullchain.pem, 2868 B)
for *.viktorbarzin.me. The 92 other stacks under stacks/ symlink
`secrets/` → `../../secrets`, which resolves to the repo-root
/secrets/ directory covered by the `secrets/** filter=git-crypt`
.gitattributes rule — i.e., every other stack consumes the same
git-crypt-encrypted root wildcard cert.

The 3 outliers shipped their keys in plaintext because `.gitattributes`
secrets/** rule matches only repo-root /secrets/, not
stacks/*/secrets/. Both remotes are public, so the 6 plaintext PEM files
have been exposed for 1–6 weeks (commits 5a988133 2026-03-11,
a6f71fc6 2026-03-18, 9820f2ce 2026-04-10).

Verified:
- Root wildcard cert subject = CN viktorbarzin.me,
  SAN *.viktorbarzin.me + viktorbarzin.me — covers the 3 subdomains.
- Root privkey + fullchain are a valid key pair (pubkey SHA256 match).
- All 3 outlier certs have the same subject/SAN as root; different
  distinct cert material but equivalent coverage.

## This change
- Delete plaintext PEMs in all 3 outlier stacks (6 files total).
- Replace each stacks/<x>/secrets directory with a symlink to
  ../../secrets, matching the fleet pattern.
- Add `stacks/**/secrets/** filter=git-crypt diff=git-crypt` to
  .gitattributes as a regression guard — any future real file placed
  under stacks/<x>/secrets/ gets git-crypt-encrypted automatically.

setup_tls_secret module (modules/kubernetes/setup_tls_secret/main.tf) is
unchanged. It still reads `file("${path.root}/secrets/fullchain.pem")`,
which via the symlink resolves to the root wildcard.

## What is NOT in this change
- Revocation of the 3 leaked per-stack certs. Backed up the leaked PEMs
  to /tmp/leaked-certs/ for `certbot revoke --reason keycompromise`
  once the user's LE account is authenticated. Revocation must happen
  before or alongside the history-rewrite force-push to both remotes.
- Git-history scrub. The leaked PEM blobs are still reachable in every
  commit from 2026-03-11 forward. Scheduled for removal via
  `git filter-repo --path stacks/<x>/secrets/privkey.pem --invert-paths`
  (and fullchain.pem for each stack) in the broader remediation pass.
- cert-manager introduction. The fleet does not use cert-manager today;
  this commit matches the existing symlink-to-wildcard pattern rather
  than introducing a new component.

## Test plan
### Automated
  $ readlink stacks/foolery/secrets
  ../../secrets
  (likewise for terminal, claude-memory)

  $ for s in foolery terminal claude-memory; do
      openssl x509 -in stacks/$s/secrets/fullchain.pem -noout -subject
    done
  subject=CN = viktorbarzin.me  (x3 — all resolve via symlink to root wildcard)

  $ git check-attr filter -- stacks/foolery/secrets/fullchain.pem
  stacks/foolery/secrets/fullchain.pem: filter: git-crypt
  (now matched by the new rule, though for the symlink target the
   repo-root rule already applied)

### Manual Verification
1. `terragrunt plan` in stacks/foolery, stacks/terminal, stacks/claude-memory
   shows only the K8s TLS secret being re-created with the root-wildcard
   material. No ingress changes.
2. `terragrunt apply` for each stack → `kubectl -n <ns> get secret
   <name>-tls -o yaml` → tls.crt decodes to CN viktorbarzin.me with
   the root serial (different from the pre-change per-stack serials).
3. `curl -v https://foolery.viktorbarzin.me/` (and likewise terminal,
   claude-memory) → cert chain presents the new serial, handshake OK.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Add broker-sync Terraform stack (pending apply)

Context
-------
Part of the broker-sync rollout — see the plan at
~/.claude/plans/let-s-work-on-linking-temporal-valiant.md and the
companion repo at ViktorBarzin/broker-sync.

This change
-----------
New stack `stacks/broker-sync/`:
- `broker-sync` namespace, aux tier.
- ExternalSecret pulling `secret/broker-sync` via vault-kv
  ClusterSecretStore.
- `broker-sync-data-encrypted` PVC (1Gi, proxmox-lvm-encrypted,
  auto-resizer) — holds the sync SQLite db, FX cache, Wealthfolio
  cookie, CSV archive, watermarks.
- Five CronJobs (all under `viktorbarzin/broker-sync:<tag>`, public
  DockerHub image; no pull secret):
    * `broker-sync-version` — daily 01:00 liveness probe (`broker-sync
      version`), used to smoke-test each new image.
    * `broker-sync-trading212` — daily 02:00 `broker-sync trading212
      --mode steady`.
    * `broker-sync-imap` — daily 02:30, SUSPENDED (Phase 2).
    * `broker-sync-csv` — daily 03:00, SUSPENDED (Phase 3).
    * `broker-sync-fx-reconcile` — 7th of month 05:05, SUSPENDED
      (Phase 1 tail).
- `broker-sync-backup` — daily 04:15, snapshots /data into
  NFS `/srv/nfs/broker-sync-backup/` with 30-day retention, matches
  the convention in infra/.claude/CLAUDE.md §3-2-1.

NOT in this commit:
- Old `wealthfolio-sync` CronJob retirement in
  stacks/wealthfolio/main.tf — happens in the same commit that first
  applies this stack, per the plan's "clean cutover" decision.
- Vault seed. `secret/broker-sync` must be populated before apply;
  required keys documented in the ExternalSecret comment block.

Test plan
---------
## Automated
- `terraform fmt` — clean (ran before commit).
- `terraform validate` needs `terragrunt init` first; deferred to
  apply time.

## Manual Verification
1. Seed Vault `secret/broker-sync/*` (see comment block on the
   ExternalSecret in main.tf).
2. `cd stacks/broker-sync && scripts/tg apply`.
3. `kubectl -n broker-sync get cronjob` — expect 6 CJs, 3 suspended.
4. `kubectl -n broker-sync create job smoke --from=cronjob/broker-sync-version`.
5. `kubectl -n broker-sync logs -l job-name=smoke` — expect
   `broker-sync 0.1.0`.

* fix(beads-server): disable Authentik + CrowdSec on Workbench

Authentik forward-auth returns 400 for dolt-workbench (no Authentik
application configured for this domain). CrowdSec bouncer also
intermittently returns 400. Both disabled — Workbench is accessible
via Cloudflare tunnel only.

TODO: Create Authentik application for dolt-workbench.viktorbarzin.me

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 21:17:45 +01:00
Viktor Barzin
7a884a0b97 [monitoring] Fix alerts for intentionally scaled-down services
PoisonFountainDown and ForwardAuthFallbackActive both fired because
poison-fountain was scaled to 0 replicas (intentional). Updated both
alert expressions to check kube_deployment_spec_replicas > 0 before
alerting on missing available replicas — if desired replicas is 0,
the service is intentionally down and should not alert.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 19:17:41 +00:00
Viktor Barzin
cdc851fc63 [alerts] Fix status-page-pusher crash + Prometheus backup push
## status-page-pusher (ExternalAccessDivergence false positive)
The pusher was crashing with `AttributeError: 'list' object has no attribute
'get'` at line 122 — the uptime-kuma-api library changed the heartbeats return
format. Fixed by making beat flattening more robust: handle any nesting of
lists/dicts in the heartbeat data, and add isinstance check before calling
`.get()` on the latest beat.

## Prometheus backup (PrometheusBackupNeverRun)
The backup sidecar's Pushgateway push was silently failing because `wget
--post-file=-` needs `--header="Content-Type: text/plain"` for Pushgateway
to accept the Prometheus exposition format. Added the header. Also manually
pushed the metric to clear the `absent()` alert immediately.

Note: ExternalAccessDivergence still fires because 5 services (ollama, pdf,
poison, dns, travel) ARE genuinely externally unreachable but internally up.
This is a real issue (likely Cloudflare tunnel routing) not a false positive.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 18:29:43 +00:00
Viktor Barzin
f8facf44dd [infra] Fix rewrite-body plugin + cleanup TrueNAS + version bumps
## Context

The rewrite-body Traefik plugin (packruler/rewrite-body v1.2.0) silently
broke on Traefik v3.6.12 — every service using rybbit analytics or anti-AI
injection returned HTTP 200 with "Error 404: Not Found" body. Root cause:
middleware specs referenced plugin name `rewrite-body` but Traefik registered
it as `traefik-plugin-rewritebody`.

Migrated to maintained fork `the-ccsn/traefik-plugin-rewritebody` v0.1.3
which uses the correct plugin name. Also added `lastModified = true` and
`methods = ["GET"]` to anti-AI middleware to avoid rewriting non-HTML
responses.

## This change

- Replace packruler/rewrite-body v1.2.0 with the-ccsn/traefik-plugin-rewritebody v0.1.3
- Fix plugin name in all 3 middleware locations (ingress_factory, reverse-proxy factory, traefik anti-AI)
- Remove deprecated TrueNAS cloud sync monitor (VM decommissioned 2026-04-13)
- Remove CloudSyncStale/CloudSyncFailing/CloudSyncNeverRun alerts
- Fix PrometheusBackupNeverRun alert (for: 48h → 32d to match monthly sidecar schedule)
- Bump versions: rybbit v1.0.21→v1.1.0, wealthfolio v1.1.0→v3.2,
  networking-toolbox 1.1.1→1.6.0, cyberchef v10.24.0→v9.55.0
- MySQL standalone storage_limit 30Gi → 50Gi
- beads-server: fix Dolt workbench type casing, remove Authentik on GraphQL endpoint

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 05:51:52 +00:00
Viktor Barzin
e80b2f026f [infra] Migrate Terraform state from local SOPS to PostgreSQL backend
Two-tier state architecture:
- Tier 0 (infra, platform, cnpg, vault, dbaas, external-secrets): local
  state with SOPS encryption in git — unchanged, required for bootstrap.
- Tier 1 (105 app stacks): PostgreSQL backend on CNPG cluster at
  10.0.20.200:5432/terraform_state with native pg_advisory_lock.

Motivation: multi-operator friction (every workstation needed SOPS + age +
git-crypt), bootstrap complexity for new operators, and headless agents/CI
needing the full encryption toolchain just to read state.

Changes:
- terragrunt.hcl: conditional backend (local vs pg) based on tier0 list
- scripts/tg: tier detection, auto-fetch PG creds from Vault for Tier 1,
  skip SOPS and Vault KV locking for Tier 1 stacks
- scripts/state-sync: tier-aware encrypt/decrypt (skips Tier 1)
- scripts/migrate-state-to-pg: one-shot migration script (idempotent)
- stacks/vault/main.tf: pg-terraform-state static role + K8s auth role
  for claude-agent namespace
- stacks/dbaas: terraform_state DB creation + MetalLB LoadBalancer
  service on shared IP 10.0.20.200
- Deleted 107 .tfstate.enc files for migrated Tier 1 stacks
- Cleaned up per-stack tiers.tf (now generated by root terragrunt.hcl)

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 19:33:12 +00:00
Viktor Barzin
f538115c43 [dbaas] Migrate MySQL from InnoDB Cluster to standalone StatefulSet
## Context
Disk write analysis showed MySQL InnoDB Cluster writing ~95 GB/day for only
~35 MB of actual data due to Group Replication overhead (binlog, relay log,
GR apply log). The operator enforces GR even with serverInstances=1.

Bitnami Helm charts were deprecated by Broadcom in Aug 2025 — no free
container images available. Using official mysql:8.4 image instead.

## This change:
- Replace helm_release.mysql_cluster service selector with raw
  kubernetes_stateful_set_v1 using official mysql:8.4 image
- ConfigMap mysql-standalone-cnf: skip-log-bin, innodb_flush_log_at_trx_commit=2,
  innodb_doublewrite=ON (re-enabled for standalone safety)
- Service selector switched to standalone pod labels
- Technitium: disable SQLite query logging (18 GB/day write amplification),
  keep PostgreSQL-only logging (90-day retention)
- Grafana datasource and dashboards migrated from MySQL to PostgreSQL
- Dashboard SQL queries fixed for PG integer division (::float cast)
- Updated CLAUDE.md service-specific notes

## What is NOT in this change:
- InnoDB Cluster + operator removal (Phase 4, 7+ days from now)
- Stale Vault role cleanup (Phase 4)
- Old PVC deletion (Phase 4)

Expected write reduction: ~113 GB/day (MySQL 95 + Technitium 18)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 19:01:06 +00:00
Viktor Barzin
375a3d91d5 [monitoring] Exclude websocket protocol from HighServiceLatency alert
Traefik records websocket connection lifetimes (minutes to hours) as
"request duration." When websockets close, the full lifetime pollutes
the average latency metric — Authentik showed 6.7s avg (201s websocket
avg) vs 0.065s actual HTTP avg. This caused ~90 false alerts/day across
12 services (Authentik, Vaultwarden, Terminal, HA, etc.).

Changes:
- Add protocol!="websocket" filter to HighServiceLatency alert expr
- Raise minimum traffic threshold from 0.01 to 0.05 rps to filter
  statistical noise from services with <3 req/min
- Remove .githooks/pre-commit file-size hook (blocked state commits)

Validated against 7-day historical data: 637 breaches → ~2 with both
filters applied (99.7% reduction).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 21:51:19 +00:00
Viktor Barzin
bd41bb9230 fix: cluster healthcheck fixes + Authentik upgrade to 2026.2.2
- Authentik: upgrade 2025.10.3 → 2025.12.4 → 2026.2.2 with DB restore
  and stepped migration. Switch to existingSecret, PgBouncer session mode.
- Mailserver: migrate email roundtrip probe from Mailgun to Brevo API
- Redis: fix HAProxy tcp-check regex (rstring), faster health intervals
- Nextcloud: fix Redis fallback to HAProxy service, update dependency
- MeshCentral: fix TLSOffload + certUrl init container for first-run
- Monitoring: remove authentik from latency alert exclusion
- Diun: simplify to webhook notifier, remove git auto-update

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 06:41:56 +00:00
Viktor Barzin
ff360a8807 feat: add external monitoring for all Cloudflare-proxied services
Add automatic external HTTPS monitors to Uptime Kuma for ~96 services
exposed via Cloudflare tunnel. A sync CronJob (every 10min) reads from
a Terraform-generated ConfigMap and creates/deletes [External] monitors
to match cloudflare_proxied_names. Status page groups these separately
as "External Reachability" and pushes a divergence metric to Pushgateway
when services are externally down but internally up. Prometheus alert
ExternalAccessDivergence fires after 15min of divergence.

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 19:04:45 +00:00
Viktor Barzin
ca2680c189 fix(post-mortem): add NFSHighRPCRetransmissions alert + migrate alertmanager to proxmox-lvm-encrypted [PM-2026-04-14]
- Add PrometheusRule: NFSHighRPCRetransmissions fires when node_nfs_rpc_retransmissions_total
  rate exceeds 5/s for 5m — catches NFS server degradation before pod failures cascade
- Migrate alertmanager PV from NFS (192.168.1.127:/srv/nfs/alertmanager) to proxmox-lvm-encrypted
  eliminating the circular dependency where alertmanager couldn't alert about NFS failures
- Set force_update=true on prometheus helm_release to handle StatefulSet volumeClaimTemplate changes

Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>
2026-04-14 18:05:33 +00:00
Viktor Barzin
0901dd5f61 state(monitoring): update encrypted state 2026-04-14 17:52:13 +00:00
Viktor Barzin
ea18116da9 fix: NFS outage recovery — migrate to NFSv4, add alerting
NFS server restart broke NFSv3 (lockd kernel bug on PVE 6.14).
All 52 NFS PVs patched to nfsvers=4, NFSv3 disabled on PVE.

Changes:
- nfs_volume module: add nfsvers=4 mount option
- nfs-csi StorageClass: add nfsvers=4 mount option
- dbaas: MySQL serverInstances 3→1, mysql-native-password=ON
- monitoring: add NFSCSINodeDown and NFSMountFailures alerts

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 10:28:27 +00:00
Viktor Barzin
82f674a0b4 rename weekly-backup → daily-backup across scripts, timers, services, and docs [ci skip]
Reflects the schedule change from weekly to daily. All references updated:
- scripts/weekly-backup.{sh,timer,service} → daily-backup.*
- Pushgateway job name: weekly-backup → daily-backup
- Prometheus metric names: weekly_backup_* → daily_backup_*
- All docs, runbooks, AGENTS.md, CLAUDE.md, proxmox-inventory
- offsite-sync dependency: After=daily-backup.service

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 18:37:04 +00:00
Viktor Barzin
38d51ab0af deprecate TrueNAS: migrate Immich NFS to Proxmox, remove all 10.0.10.15 references [ci skip]
- Migrate Immich (8 NFS PVs, 1.1TB) from TrueNAS to Proxmox host NFS
- Update config.tfvars nfs_server to 192.168.1.127 (Proxmox)
- Update nfs-csi StorageClass share to /srv/nfs
- Update scripts (weekly-backup, cluster-healthcheck) to Proxmox IP
- Delete obsolete TrueNAS scripts (nfs_exports.sh, truenas-status.sh)
- Rewrite nfs-health.sh for Proxmox NFS monitoring
- Update Freedify nfs_music_server default to Proxmox
- Mark CloudSync monitor CronJob as deprecated
- Update Prometheus alert summaries
- Update all architecture docs, AGENTS.md, and reference docs
- Zero PVs remain on TrueNAS — VM ready for decommission

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 14:42:07 +00:00
Viktor Barzin
1c300a14cf mailserver: overhaul inbound delivery, monitoring, CrowdSec, and migrate to Brevo relay
Inbound:
- Direct MX to mail.viktorbarzin.me (ForwardEmail relay attempted and abandoned)
- Dedicated MetalLB IP 10.0.20.202 with ETP: Local for CrowdSec real-IP detection
- Removed Cloudflare Email Routing (can't store-and-forward)
- Fixed dual SPF violation, hardened to -all
- Added MTA-STS, TLSRPT, imported Rspamd DKIM into Terraform
- Removed dead BIND zones from config.tfvars (199 lines)

Outbound:
- Migrated from Mailgun (100/day) to Brevo (300/day free)
- Added Brevo DKIM CNAMEs and verification TXT

Monitoring:
- Probe frequency: 30m → 20m, alert thresholds adjusted to 60m
- Enabled Dovecot exporter scraping (port 9166)
- Added external SMTP monitor on public IP

Documentation:
- New docs/architecture/mailserver.md with full architecture
- New docs/architecture/mailserver-visual.html visualization
- Updated monitoring.md, CLAUDE.md, historical plan docs
2026-04-12 22:24:38 +01:00
Viktor Barzin
82b0f6c4cb truenas deprecation: migrate all non-immich storage to proxmox NFS
- Migrate 7 backup CronJobs to Proxmox host NFS (192.168.1.127)
  (etcd, mysql, postgresql, nextcloud, redis, vaultwarden, plotting-book)
- Migrate headscale backup, ebook2audiobook, osm_routing to Proxmox NFS
- Migrate servarr (lidarr, readarr, soulseek) NFS refs to Proxmox
- Remove 79 orphaned TrueNAS NFS module declarations from 49 stacks
- Delete stacks/platform/modules/ (27 dead module copies, 65MB)
- Update nfs-truenas StorageClass to point to Proxmox (192.168.1.127)
- Remove iscsi DNS record from config.tfvars
- Fix woodpecker persistence config and alertmanager PV

Only Immich (8 PVCs, ~1.4TB) remains on TrueNAS.
2026-04-12 14:35:39 +01:00
Viktor Barzin
5da6d75094 fix(monitoring): PodCrashLooping alert now fires only for active CrashLoopBackOff
Switch from restart-count based detection (increase restarts[1h] > 5) to
waiting-reason based (kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"}).
Alert auto-resolves when pod recovers, making it clear whether the issue is active.
2026-04-12 12:41:07 +01:00
Viktor Barzin
d7de5de07c fix(monitoring): add pve_* metrics to Prometheus whitelist
ProxmoxMetricsMissing alert was firing because pve_* metrics were
excluded from the kubernetes-service-endpoints metric_relabel_configs
whitelist. The exporter was scraping successfully but metrics were
being dropped before ingestion.
2026-04-10 22:58:49 +01:00
Viktor Barzin
6101fb99f9 Reduce disk write amplification across cluster (~200-350 GB/day savings) [ci skip]
- Prometheus: persist metric whitelist (keep rules) to Helm template, preventing
  regression from 33K to 250K samples/scrape on next apply. Reduce retention 52w→26w.
- MySQL InnoDB: aggressive write reduction — flush_log_at_trx_commit=0, sync_binlog=0,
  doublewrite=OFF, io_capacity=100/200, redo_log=1GB, flush_neighbors=1, reduced page cleaners.
- etcd: increase snapshot-count 10000→50000 to reduce WAL snapshot frequency.
- VM disks: enable TRIM/discard passthrough to LVM thin pool via create-vm module.
- Cloud-init: enable fstrim.timer, journald limits (500M/7d/compress).
- Kubelet: containerLogMaxSize=10Mi, containerLogMaxFiles=3.
- Technitium: DNS query log retention 0→30 days (was unlimited writes to MySQL).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 19:01:21 +00:00
Viktor Barzin
fbdb57eb58 fix(monitoring): UsingInverterEnergyForTooLong only alerts when stuck
Changed from simple time-based (24h on inverter) to condition-based:
only fires when on inverter AND battery charge <80% for 1h. This means
normal daytime inverter usage won't trigger alerts — only fires when
the grid is unavailable and battery is draining.
2026-04-06 15:43:47 +03:00
Viktor Barzin
b5689afe6d fix(monitoring): tune alert thresholds to reduce false positives
- HighPowerUsage: raise from 200W to 300W (R730 idles at ~230W)
- HighServiceLatency: exclude headscale (WebSocket) and authentik (SSO)
  from latency checks — both have inherently high avg response times
2026-04-06 15:39:23 +03:00
Viktor Barzin
91242b0b40 feat(monitoring): add comprehensive hardware exporter alerts
Added 20 new alerts across 3 rule groups:

Power (8 new):
- UPSAlarmsActive, UPSBatteryDegraded, UPSOverloaded, UPSOutputVoltageAbnormal
- ATSFault, ATSPowerFault, ATSOverload, ATSInputVoltageAbnormal

Server Health (10 new):
- iDRACSystemUnhealthy, iDRACPowerSupplyUnhealthy, iDRACMemoryUnhealthy
- iDRACStorageDriveUnhealthy, iDRACSSDWearCritical/Warning
- iDRACServerPoweredOff, ProxmoxExporterDown
- FuseMainFault, FuseGarageFault

Metric Staleness (3 new):
- FuseMainMetricsMissing, FuseGarageMetricsMissing, ProxmoxMetricsMissing

Plus 4 new inhibition rules for alert cascade protection.
2026-04-06 15:31:50 +03:00
Viktor Barzin
6abc0b9742 security(monitoring): remove public SNMP exporter ingress
snmp-exporter-external.viktorbarzin.me exposed UPS metrics to the
public internet with no authentication. Removed the external ingress
and Cloudflare DNS record. ha-sofia now accesses the SNMP exporter
via the existing .lan ingress (allow_local_access_only=true) using
direct IP 10.0.20.200 with Host header.
2026-04-06 15:23:56 +03:00
Viktor Barzin
7f141faa8c Fix: Expose SNMP exporter externally to ha-sofia via Cloudflare tunnel
- Add snmp-exporter-ingress-external module for external HTTPS access to snmp-exporter
- Register snmp-exporter-external.viktorbarzin.me in Cloudflare DNS (proxied via tunnel)
- Update ha-sofia REST integration to use external HTTPS endpoint
- Fix ingress backend service routing to use existing snmp-exporter service
- All UPS sensors on ha-sofia now report values (voltage, battery %, load, etc.)
2026-04-06 15:14:19 +03:00
Viktor Barzin
d009f9a0f2 add 3-2-1 backup pipeline: weekly PVC file copy, NFS mirror, pfsense, offsite sync
- weekly-backup.sh: mounts LVM thin snapshots ro, rsyncs files to /mnt/backup/pvc-data
  with --link-dest versioning (4 weeks). Also mirrors NFS backup dirs from TrueNAS,
  backs up pfsense (config.xml + full tar), PVE host config, and prunes >7d snapshots.
- offsite-sync-backup.sh: rsync --files-from manifest to Synology (no full dir walk).
  Monthly full --delete sync on 1st Sunday. After=weekly-backup.service dependency.
- lvm-pvc-snapshot.timer: changed to daily 03:00 (was 2x daily)
- Prometheus alerts: WeeklyBackupStale, WeeklyBackupFailing, PfsenseBackupStale,
  OffsiteBackupSyncStale, BackupDiskFull. LVMSnapshotStale threshold 24h→48h.
2026-04-06 14:53:28 +03:00
Viktor Barzin
fe342a974b monitoring + proxmox-csi: LVM snapshot RBAC, pushgateway NodePort, backup dashboard
- proxmox-csi: add RBAC for PVE host snapshot restore script
- monitoring: expose Pushgateway via NodePort for PVE LVM snapshot metrics
- monitoring: add backup health Grafana dashboard
2026-04-06 11:57:41 +03:00
Viktor Barzin
0f2ef356d6 fix: remove ISCSICSIControllerDown alert (democratic-csi decommissioned)
iSCSI CSI (democratic-csi) was replaced by proxmox-csi in April 2026.
Controller is intentionally scaled to 0. Remove the stale alert and
update CSIDriverCrashLoop to monitor proxmox-csi instead of iscsi-csi.
2026-04-05 23:53:18 +03:00
Viktor Barzin
3cd560d4d9 fix: bank sync alerts - remove {{ $labels.job }} that Helm provider silently drops [ci skip]
The Terraform Helm provider's YAML diff comparison silently ignores rules
containing {{ $labels.job }} in annotations, preventing the alerts from being
applied. Also syncs alerts to platform stack tpl.
2026-04-05 20:07:51 +03:00
Viktor Barzin
3217a5f605 add bank sync monitoring with Pushgateway metrics and Prometheus alerts [ci skip]
CronJob now captures HTTP status, pushes bank_sync_success/duration/last_success
to Pushgateway. Alerts: BankSyncFailing (6h), BankSyncStale (48h).
2026-04-05 19:32:40 +03:00
Viktor Barzin
ce7b8c2b2e add pvc-autoresizer for automatic PVC expansion before volumes fill up [ci skip]
Deploy topolvm/pvc-autoresizer controller that monitors kubelet_volume_stats
via Prometheus and auto-expands annotated PVCs. Annotated all 9 block-storage
PVCs (proxmox-lvm) with per-PVC thresholds and max limits. Updated PVFillingUp
alert to critical/10m (means auto-expansion failed) and added PVAutoExpanding
info alert at 80%.
2026-04-03 23:30:00 +03:00
Viktor Barzin
dd59512153 migrate iSCSI block volumes from democratic-csi to Proxmox CSI [ci skip]
Replace TrueNAS iSCSI (democratic-csi) with Proxmox CSI plugin for all
block storage PVCs. Eliminates double-CoW (ZFS + LVM-thin) and removes
the iSCSI network hop for database I/O.

New stack: stacks/proxmox-csi/ — deploys proxmox-csi-plugin Helm chart
with StorageClass "proxmox-lvm" using existing local-lvm thin pool.

Migrated PVCs (12 total):
- Phase 1 standalone: plotting-book, novelapp, vaultwarden, nextcloud, prometheus
- Phase 2 StatefulSets: CNPG PostgreSQL (2), MySQL InnoDB (3), Redis (2)

All services verified healthy post-migration.
2026-04-02 22:13:04 +03:00
Viktor Barzin
a2b1b0e817 remove caretta network mapper to free 3Gi cluster memory
Caretta eBPF DaemonSet was using 600Mi x 5 nodes = 3Gi total for
non-critical network topology visualization. Removing it to free
memory for novelapp and aiostreams which were stuck in Pending.
2026-03-29 22:17:35 +03:00
Viktor Barzin
878b556179 state(monitoring): update encrypted state 2026-03-29 01:04:11 +02:00
Viktor Barzin
06490b0634 reduce Prometheus cardinality round 3: drop 44k more series
- cadvisor: drop unused network error/dropped counters, unused cpu
  metrics (load_avg, system, user), unused memory metrics (cache,
  failcnt, kernel, mapped_file, max_usage, rss, swap, active/inactive)
- kubelet: drop all unused histogram buckets (storage_operation, csi,
  volume_operation, image_pull, http_requests, rest_client, pod_worker,
  volume_metric, cgroup_manager) + kubernetes_feature_enabled
- apiserver: drop flowcontrol/rest_client histograms, longrunning_requests
- traefik: drop all router-level metrics (keep service + entrypoint)
- service-endpoints: drop coredns histograms, node_filesystem_*

Post-relabel: 332k → 99k (-70%), ingestion: 5,480 → 1,659 samples/sec (-70%)
2026-03-29 00:27:23 +02:00
Viktor Barzin
a9ca65bc31 reduce Prometheus cardinality round 2: drop 137k more series
- fix traefik double-scrape: kubernetes-pods job was scraping traefik
  pods again (43k duplicate series). Added namespace drop rule.
- drop unused cadvisor metrics: container_fs_*, container_blkio_*,
  container_pressure_*, container_spec_*, and misc (30k series)
- drop more apiserver histogram buckets: watch_list, watch_cache,
  response_sizes, watch_events, admission_controller, workqueue (11k)
- drop unused kube-state-metrics: replicaset_*, pod_tolerations,
  pod_labels, endpoint_*, service_*, configmap_*, etc (53k series)

Post-relabel samples: 332k → 142k (-57%)
Ingestion rate: 5,480 → 3,239 samples/sec (-41%)
2026-03-28 23:51:24 +02:00
Viktor Barzin
4b3851829b feat: organize Grafana dashboards into folders
Enable sidecar folderAnnotation + foldersFromFilesStructure to group
26 dashboards into 5 managed folders:

- Cluster (6): k8s health, API server, nodes, pods, kube-state-metrics
- Networking (6): CoreDNS, Technitium, Headscale, ingress, network traffic
- Hardware (5): node-exporter, proxmox, iDRAC, UPS, NVIDIA GPU
- Operations (4): backup health, registry, audit logs, Loki
- Applications (2): realestate-crawler, qBittorrent

Dashboard-to-folder mapping defined in grafana.tf locals block.
External stacks (headscale, technitium) annotated individually.
2026-03-28 16:23:49 +02:00