Commit graph

13 commits

Author SHA1 Message Date
Viktor Barzin
d76b5dbc4b priority-pass: backend c2b4ac50 — crop to card before transforming
Three fixes for boarding passes uploaded as iPhone screenshots (input
includes phone status bar, partial Tesco card below, etc.):

1. Detect the card region first and crop to it. All proportional
   coordinates (Step 8 text replacement, Step 9 logo removal) are now
   card-relative instead of full-image-relative — they were landing
   in the wrong region on tall screenshots, putting "Priority" text
   inside the QR area and leaving a yellow icon box at the bottom.

2. Step 8 now picks the LONGEST contiguous dark-row run inside a wider
   y-band, instead of using the dark-row [first, last] span. This
   distinguishes the QUEUE value text from the QUEUE label above it
   (both are dark blue in the original) so the erase rectangle no
   longer eats into the labels.

3. QR container padding bumped 8% → 12% so QR/container ratio matches
   the ~74-80% golden look.

Verified end-to-end against three real samples saved by the previous
build's training-data feature, plus the original non-priority.jpeg
fixture: outputs now match priority.jpeg layout.

[ci skip]
2026-05-01 19:06:02 +00:00
Viktor Barzin
40a6cd067b authentik: long-lived authenticated sessions, short-lived anonymous ones
- Adopt UserLoginStage (default-authentication-login) into Terraform
  and pin session_duration=weeks=4 so users stay logged in across
  browser restarts. There is no Brand.session_duration in 2026.2.x;
  UserLoginStage is the only correct lever.
- Cap anonymous Django sessions at 2h via
  AUTHENTIK_SESSIONS__UNAUTHENTICATED_AGE on server + worker pods
  (default is days=1). Bots, healthcheckers, and partial flows now
  get reaped within 2h instead of accumulating for a day.

Implementation note: the env var is injected via server.env /
worker.env rather than authentik.sessions.unauthenticated_age,
because authentik.existingSecret.secretName is set, which makes the
chart skip rendering its own AUTHENTIK_* Secret. authentik.* values
are therefore inert in this stack -- this is documented in
.claude/reference/authentik-state.md so future edits use the right
surface.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-01 19:03:50 +00:00
Viktor Barzin
b60e34032c [authentik] Phase 1 hardening — 3 replicas, PgBouncer PDB/probes, perf env
## Context

Following the 2026-04-18 /dev/shm ENOSPC P0 and a 5-subagent research pass,
this is Phase 1 of the authentik reliability + performance hardening epic
(beads code-cwj). Scope: everything that is safe, additive, and does not
require DB restart, architectural migration, or the 43-service auth path
to go through a risky validation window.

Five research findings drove the deltas:

1. **Server/worker at 2 replicas** conflicts with the documented convention
   "critical path services scaled to 3" in .claude/CLAUDE.md (Traefik,
   Authentik, CrowdSec LAPI, PgBouncer, Cloudflared). PDB minAvailable was
   still 1 — a single-pod outage could take auth down.
2. **PgBouncer had no resource requests/limits** — silently capped at the
   Kyverno tier-defaults LimitRange (256Mi), no PDB, no probes. Pool
   failures undetected until connection timeouts.
3. **Authentik 2026.2 has no Redis** (the cache moved to Postgres in
   2025.10). Persistent Django connections + longer flow/policy cache TTLs
   are the two knobs that move the needle most without DB tuning. Both are
   safe because PgBouncer runs in session mode.
4. **Gunicorn defaults** (2 workers × 4 threads on server, 1 process × 2
   threads on worker) don't use the pod's 1.5 Gi headroom. Each worker
   preloads Django at ~500 MiB — bumping to 3 workers needs a memory bump
   to 2 Gi first.
5. **AUTHENTIK_WORKER__CONCURRENCY was renamed AUTHENTIK_WORKER__THREADS**
   in 2025.8 — the old name is aliased but the canonical config key changed.

## This change

### values.yaml
- server.replicas 2 → 3 (PDB minAvailable 1 → 2)
- worker.replicas 2 → 3
- server/worker limits.memory 1.5 Gi → 2 Gi (headroom for gunicorn workers)
- authentik.postgresql.conn_max_age = 60 (persistent connections; safe
  with pgbouncer session mode, conn_max_age < server_idle_timeout=600s)
- authentik.postgresql.conn_health_checks = true
- authentik.cache.timeout_flows = 1800 (30 min; was 300)
- authentik.cache.timeout_policies = 900 (15 min; was 300)
- authentik.web.workers = 3, threads = 4
- authentik.worker.threads = 4 (was 2)

### pgbouncer.tf
- container resources: requests cpu=50m/mem=128Mi, limits mem=512Mi
  (observed live usage is 1-3 m CPU, 2-4 MiB RSS — huge headroom,
  safely above Kyverno 256Mi tier-default cap)
- readiness probe: TCP :6432, 10s period
- liveness probe: TCP :6432, 30s period, 30s delay
- kubernetes_pod_disruption_budget_v1.pgbouncer: minAvailable=2
  (3 replicas; single drain rolls cleanly, two-node simultaneous
  outage correctly blocked)

## What is NOT in this change (deferred as Phase 2 follow-ups)

- Codify outpost /dev/shm patch in Terraform (currently applied via
  Authentik API, not in code). Needs authentik_outpost resource.
- Migrate embedded outpost → dedicated outpost Deployment with 2
  replicas + sticky sessions. Only HA path per GH issue #18098; requires
  flow design because outpost sessions are in-process memory only.
- PG max_connections 100 → 200 + shared_buffers 512MB → 768MB + CNPG
  pod memory 2Gi → 3Gi. Needs coordinated DB restart.
- Enable pg_stat_statements on CNPG cluster for Authentik DB
  observability (currently shared_preload_libraries is empty).
- PgBouncer pool_mode session → transaction + django_channels layer
  split. Needs atomic change + psycopg3 prepared-statement support.
- authentik_tasks_tasklog 7-day retention (198k rows, unbounded).
- Traefik forward-auth plugin caching via
  xabinapal/traefik-authentik-forward-plugin.
- Grafana dashboard 14837 import + recording rule for
  authentik_flow_execution_duration (reported broken: values in ns
  while default buckets are seconds — upstream discussion #7156).

## Test plan

### Automated

    $ cd stacks/authentik && ../../scripts/tg plan
    Plan: 1 to add, 3 to change, 0 to destroy.

    $ ../../scripts/tg apply --non-interactive
    module.authentik.kubernetes_pod_disruption_budget_v1.pgbouncer: Creation complete after 0s
    module.authentik.kubernetes_deployment.pgbouncer: Modifications complete after 45s
    module.authentik.helm_release.authentik: Modifications complete after 2m47s
    Apply complete! Resources: 1 added, 3 changed, 0 destroyed.

### Manual Verification

1. **Pod topology and PDBs**:

        $ kubectl -n authentik get pods,pdb
        pod/goauthentik-server-5fc69b6cc6-ctvkp   1/1   Running   0   3m14s   k8s-node2
        pod/goauthentik-server-5fc69b6cc6-fkn8x   1/1   Running   0   3m45s   k8s-node3
        pod/goauthentik-server-5fc69b6cc6-jtjjd   1/1   Running   0   5m6s    k8s-node1
        pod/goauthentik-worker-5cfb7dc9bf-b2rlr   1/1   Running   0   3m44s   k8s-node2
        pod/goauthentik-worker-5cfb7dc9bf-fkfm4   1/1   Running   0   5m6s    k8s-node1
        pod/goauthentik-worker-5cfb7dc9bf-hxdg6   1/1   Running   0   3m3s    k8s-node4
        pod/pgbouncer-64746f955f-st567            1/1   Running   0   4m58s   k8s-node4
        pod/pgbouncer-64746f955f-xss9c            1/1   Running   0   5m11s   k8s-node2
        pod/pgbouncer-64746f955f-zvfkw            1/1   Running   0   4m45s   k8s-node3
        poddisruptionbudget/goauthentik-server    2     N/A   1
        poddisruptionbudget/goauthentik-worker    N/A   1     1
        poddisruptionbudget/pgbouncer             2     N/A   1

   All three workloads spread across 3+ nodes, PDBs allow 1 disruption.

2. **Authentik server health**:

        $ curl -sS -o /dev/null -w "%{http_code}\n" \
            https://authentik.viktorbarzin.me/-/health/ready/
        200

3. **Forward-auth redirect on protected service**:

        $ curl -sS -o /dev/null -w "%{http_code}\n" -L \
            https://wealthfolio.viktorbarzin.me/
        200

4. **Outpost /dev/shm still within sizeLimit** (patches from the
   2026-04-18 post-mortem were not regressed):

        $ kubectl -n authentik exec deploy/ak-outpost-authentik-embedded-outpost \
            -c proxy -- df -h /dev/shm
        tmpfs   2.0G  58M  2.0G  3%  /dev/shm

5. **PgBouncer port reachable from other pods**:

        $ kubectl -n authentik exec deploy/pgbouncer -- nc -zv 127.0.0.1 6432
        127.0.0.1 (127.0.0.1:6432) open

## Reproduce locally

1. `cd stacks/authentik && ../../scripts/tg plan` — expect 0/0/0 (No changes).
2. `kubectl -n authentik get pdb pgbouncer` — expect MIN AVAILABLE 2.
3. `kubectl -n authentik get deploy goauthentik-server -o jsonpath='{.spec.replicas}'` — expect 3.

Closes: code-cwj
2026-04-19 11:52:41 +00:00
Viktor Barzin
16d9fd8bde [infra] Adopt Authentik catch-all Proxy Provider + Application into TF (Wave 6a)
## Context

Wave 6a of the state-drift consolidation plan. The Domain wide catch all
Proxy Provider (pk=5) + its wrapping Application (slug=domain-wide-catch-all)
+ the embedded outpost (uuid 0eecac07-97c7-443c-8925-05f2f4fe3e47) have
run for a year as pure UI-created state. When the 2026-04-18 outpost SEV2
hit, it was harder to reason about the config than it should have been —
the only source of truth was the Authentik admin UI. Bringing the provider
+ application under Terraform means future changes are reviewable in PRs
and recoverable from git if the admin UI misbehaves.

## This change

Adds the `goauthentik/authentik` provider to the repo's central
`terragrunt.hcl` `required_providers` (side-effect: every stack can now
declare authentik resources; this stack is the only current consumer).
Stack-local `stacks/authentik/authentik_provider.tf` holds the provider
instance configuration + API token wiring + two resources + their flow
data-source lookups.

### Auth
- API token stored in Vault at `secret/authentik/tf_api_token`, identifier
  `terraform-infra-stack`, intent=API, user=akadmin, no expiry. Rotatable
  by rewriting the Vault KV + any running TF apply picks it up on next
  plan.

### Imports (both landed zero-diff)
- `authentik_application.catchall` ← id `domain-wide-catch-all`
- `authentik_provider_proxy.catchall` ← id `5`

### Flow references
Authorization + invalidation flows are looked up via `data
"authentik_flow"` by slug (`default-provider-authorization-implicit-consent`
+ `default-provider-invalidation-flow`). Keeping them as data sources
rather than hardcoded UUIDs means a flow recreation (slug unchanged)
doesn't require an HCL edit.

### `lifecycle { ignore_changes }` scope
On `authentik_provider_proxy.catchall`:
- `property_mappings` (5 UUIDs), `jwt_federation_sources` (1 UUID) — the
  live state references complex many-to-many relations that are easier
  to manage from the Authentik UI than to serialise in HCL. Drift
  suppressed.
- `skip_path_regex`, `internal_host`, all `basic_auth_*`,
  `intercept_header_auth`, `access_token_validity` — either defaults or
  UI-only tuning knobs that aren't part of Terraform's concern for this
  catch-all provider.

On `authentik_application.catchall`:
- `meta_description`, `meta_launch_url`, `meta_icon`, `group`,
  `backchannel_providers`, `policy_engine_mode`, `open_in_new_tab` —
  cosmetic/non-functional attributes; the Authentik UI is the right
  place to edit these and drift on them isn't interesting.

## What is NOT in this change

- Outpost-binding resource — the embedded outpost's provider list is a
  single-row many-to-many that the Authentik UI manages cleanly; adding
  TF there would fight the UI without reducing drift.
- Property mappings and JWT federation source — managed via UI, drift
  suppressed. A future wave can bring them in when someone actually
  wants to edit them through code review.
- Other Authentik entities (Flows, Stages, Groups, RBAC policies) —
  same rationale: UI is the natural editing surface. Adopt incrementally
  as they become interesting to code-review.

## Verification

```
$ cd stacks/authentik && ../../scripts/tg plan | grep Plan:
Plan: 0 to add, 1 to change, 0 to destroy.
  # module.authentik.kubernetes_deployment.pgbouncer — pre-existing drift,
  # unrelated to this commit (image_pull_policy Always -> IfNotPresent)

$ ../../scripts/tg state list | grep authentik_
authentik_application.catchall
authentik_provider_proxy.catchall
data.authentik_flow.default_authorization_implicit_consent
data.authentik_flow.default_provider_invalidation
```

## Reproduce locally
1. `git pull && cd stacks/authentik && ../../scripts/tg init`
2. Terraform pulls goauthentik/authentik provider (first time).
3. `tg plan` — expect only pgbouncer drift; authentik resources read-only.

Refs: Wave 6a of the state-drift consolidation (code-hl1)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 22:48:26 +00:00
Viktor Barzin
327ce215b9 [infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip]
## Context

Wave 3A (commit c9d221d5) added the `# KYVERNO_LIFECYCLE_V1` marker to the
27 pre-existing `ignore_changes = [...dns_config]` sites so they could be
grepped and audited. It did NOT address pod-owning resources that were
simply missing the suppression entirely. Post-Wave-3A sampling (2026-04-18)
found that navidrome, f1-stream, frigate, servarr, monitoring, crowdsec,
and many other stacks showed perpetual `dns_config` drift every plan
because their `kubernetes_deployment` / `kubernetes_stateful_set` /
`kubernetes_cron_job_v1` resources had no `lifecycle {}` block at all.

Root cause (same as Wave 3A): Kyverno's admission webhook stamps
`dns_config { option { name = "ndots"; value = "2" } }` on every pod's
`spec.template.spec.dns_config` to prevent NxDomain search-domain flooding
(see `k8s-ndots-search-domain-nxdomain-flood` skill). Without `ignore_changes`
on every Terraform-managed pod-owner, Terraform repeatedly tries to strip
the injected field.

## This change

Extends the Wave 3A convention by sweeping EVERY `kubernetes_deployment`,
`kubernetes_stateful_set`, `kubernetes_daemon_set`, `kubernetes_cron_job_v1`,
`kubernetes_job_v1` (+ their `_v1` variants) in the repo and ensuring each
carries the right `ignore_changes` path:

- **kubernetes_deployment / stateful_set / daemon_set / job_v1**:
  `spec[0].template[0].spec[0].dns_config`
- **kubernetes_cron_job_v1**:
  `spec[0].job_template[0].spec[0].template[0].spec[0].dns_config`
  (extra `job_template[0]` nesting — the CronJob's PodTemplateSpec is
  one level deeper)

Each injection / extension is tagged `# KYVERNO_LIFECYCLE_V1: Kyverno
admission webhook mutates dns_config with ndots=2` inline so the
suppression is discoverable via `rg 'KYVERNO_LIFECYCLE_V1' stacks/`.

Two insertion paths are handled by a Python pass (`/tmp/add_dns_config_ignore.py`):

1. **No existing `lifecycle {}`**: inject a brand-new block just before the
   resource's closing `}`. 108 new blocks on 93 files.
2. **Existing `lifecycle {}` (usually for `DRIFT_WORKAROUND: CI owns image tag`
   from Wave 4, commit a62b43d1)**: extend its `ignore_changes` list with the
   dns_config path. Handles both inline (`= [x]`) and multiline
   (`= [\n  x,\n]`) forms; ensures the last pre-existing list item carries
   a trailing comma so the extended list is valid HCL. 34 extensions.

The script skips anything already mentioning `dns_config` inside an
`ignore_changes`, so re-running is a no-op.

## Scale

- 142 total lifecycle injections/extensions
- 93 `.tf` files touched
- 108 brand-new `lifecycle {}` blocks + 34 extensions of existing ones
- Every Tier 0 and Tier 1 stack with a pod-owning resource is covered
- Together with Wave 3A's 27 pre-existing markers → **169 greppable
  `KYVERNO_LIFECYCLE_V1` dns_config sites across the repo**

## What is NOT in this change

- `stacks/trading-bot/main.tf` — entirely commented-out block (`/* … */`).
  Python script touched the file, reverted manually.
- `_template/main.tf.example` skeleton — kept minimal on purpose; any
  future stack created from it should either inherit the Wave 3A one-line
  form or add its own on first `kubernetes_deployment`.
- `terraform fmt` fixes to pre-existing alignment issues in meshcentral,
  nvidia/modules/nvidia, vault — unrelated to this commit. Left for a
  separate fmt-only pass.
- Non-pod resources (`kubernetes_service`, `kubernetes_secret`,
  `kubernetes_manifest`, etc.) — they don't own pods so they don't get
  Kyverno dns_config mutation.

## Verification

Random sample post-commit:
```
$ cd stacks/navidrome && ../../scripts/tg plan  → No changes.
$ cd stacks/f1-stream && ../../scripts/tg plan  → No changes.
$ cd stacks/frigate && ../../scripts/tg plan    → No changes.

$ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='*.tf' --include='*.tf.example' \
    | awk -F: '{s+=$2} END {print s}'
169
```

## Reproduce locally
1. `git pull`
2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/ | wc -l` → 169+
3. `cd stacks/navidrome && ../../scripts/tg plan` → expect 0 drift on
   the deployment's dns_config field.

Refs: code-seq (Wave 3B dns_config class closed; kubernetes_manifest
annotation class handled separately in 8d94688d for tls_secret)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:19:48 +00:00
Viktor Barzin
8b43692af0 [infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip]
## Context

Wave 3B-continued: the Goldilocks VPA dashboard (stacks/vpa) runs a Kyverno
ClusterPolicy `goldilocks-vpa-auto-mode` that mutates every namespace with
`metadata.labels["goldilocks.fairwinds.com/vpa-update-mode"] = "off"`. This
is intentional — Terraform owns container resource limits, and Goldilocks
should only provide recommendations, never auto-update. The label is how
Goldilocks decides per-namespace whether to run its VPA in `off` mode.

Effect on Terraform: every `kubernetes_namespace` resource shows the label
as pending-removal (`-> null`) on every `scripts/tg plan`. Dawarich survey
2026-04-18 confirmed the drift. Cluster-side count: 88 namespaces carry the
label (`kubectl get ns -o json | jq ... | wc -l`). Every TF-managed namespace
is affected.

This commit brings the intentional admission drift under the same
`# KYVERNO_LIFECYCLE_V1` discoverability marker introduced in c9d221d5 for
the ndots dns_config pattern. The marker now stands generically for any
Kyverno admission-webhook drift suppression; the inline comment records
which specific policy stamps which specific field so future grep audits
show why each suppression exists.

## This change

107 `.tf` files touched — every stack's `resource "kubernetes_namespace"`
resource gets:

```hcl
lifecycle {
  # KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
  ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
```

Injection was done with a brace-depth-tracking Python pass (`/tmp/add_goldilocks_ignore.py`):
match `^resource "kubernetes_namespace" ` → track `{` / `}` until the
outermost closing brace → insert the lifecycle block before the closing
brace. The script is idempotent (skips any file that already mentions
`goldilocks.fairwinds.com/vpa-update-mode`) so re-running is safe.

Vault stack picked up 2 namespaces in the same file (k8s-users produces
one, plus a second explicit ns) — confirmed via file diff (+8 lines).

## What is NOT in this change

- `stacks/trading-bot/main.tf` — entire file is `/* … */` commented out
  (paused 2026-04-06 per user decision). Reverted after the script ran.
- `stacks/_template/main.tf.example` — per-stack skeleton, intentionally
  minimal. User keeps it that way. Not touched by the script (file
  has no real `resource "kubernetes_namespace"` — only a placeholder
  comment).
- `.terraform/` copies (e.g. `stacks/metallb/.terraform/modules/...`) —
  gitignored, won't commit; the live path was edited.
- `terraform fmt` cleanup of adjacent pre-existing alignment issues in
  authentik, freedify, hermes-agent, nvidia, vault, meshcentral. Reverted
  to keep the commit scoped to the Goldilocks sweep. Those files will
  need a separate fmt-only commit or will be cleaned up on next real
  apply to that stack.

## Verification

Dawarich (one of the hundred-plus touched stacks) showed the pattern
before and after:

```
$ cd stacks/dawarich && ../../scripts/tg plan

Before:
  Plan: 0 to add, 2 to change, 0 to destroy.
   # kubernetes_namespace.dawarich will be updated in-place
     (goldilocks.fairwinds.com/vpa-update-mode -> null)
   # module.tls_secret.kubernetes_secret.tls_secret will be updated in-place
     (Kyverno generate.* labels — fixed in 8d94688d)

After:
  No changes. Your infrastructure matches the configuration.
```

Injection count check:
```
$ rg -c 'KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode' stacks/ | awk -F: '{s+=$2} END {print s}'
108
```

## Reproduce locally
1. `git pull`
2. Pick any stack: `cd stacks/<name> && ../../scripts/tg plan`
3. Expect: no drift on the namespace's goldilocks.fairwinds.com/vpa-update-mode label.

Closes: code-dwx

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:15:27 +00:00
Viktor Barzin
e80b2f026f [infra] Migrate Terraform state from local SOPS to PostgreSQL backend
Two-tier state architecture:
- Tier 0 (infra, platform, cnpg, vault, dbaas, external-secrets): local
  state with SOPS encryption in git — unchanged, required for bootstrap.
- Tier 1 (105 app stacks): PostgreSQL backend on CNPG cluster at
  10.0.20.200:5432/terraform_state with native pg_advisory_lock.

Motivation: multi-operator friction (every workstation needed SOPS + age +
git-crypt), bootstrap complexity for new operators, and headless agents/CI
needing the full encryption toolchain just to read state.

Changes:
- terragrunt.hcl: conditional backend (local vs pg) based on tier0 list
- scripts/tg: tier detection, auto-fetch PG creds from Vault for Tier 1,
  skip SOPS and Vault KV locking for Tier 1 stacks
- scripts/state-sync: tier-aware encrypt/decrypt (skips Tier 1)
- scripts/migrate-state-to-pg: one-shot migration script (idempotent)
- stacks/vault/main.tf: pg-terraform-state static role + K8s auth role
  for claude-agent namespace
- stacks/dbaas: terraform_state DB creation + MetalLB LoadBalancer
  service on shared IP 10.0.20.200
- Deleted 107 .tfstate.enc files for migrated Tier 1 stacks
- Cleaned up per-stack tiers.tf (now generated by root terragrunt.hcl)

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 19:33:12 +00:00
Viktor Barzin
b1d152be1f [infra] Auto-create Cloudflare DNS records from ingress_factory
## Context

Deploying new services required manually adding hostnames to
cloudflare_proxied_names/cloudflare_non_proxied_names in config.tfvars —
a separate file from the service stack. This was frequently forgotten,
leaving services unreachable externally.

## This change:

- Add `dns_type` parameter to `ingress_factory` and `reverse_proxy/factory`
  modules. Setting `dns_type = "proxied"` or `"non-proxied"` auto-creates
  the Cloudflare DNS record (CNAME to tunnel or A/AAAA to public IP).
- Simplify cloudflared tunnel from 100 per-hostname rules to wildcard
  `*.viktorbarzin.me → Traefik`. Traefik still handles host-based routing.
- Add global Cloudflare provider via terragrunt.hcl (separate
  cloudflare_provider.tf with Vault-sourced API key).
- Migrate 118 hostnames from centralized config.tfvars to per-service
  dns_type. 17 hostnames remain centrally managed (Helm ingresses,
  special cases).
- Update docs, AGENTS.md, CLAUDE.md, dns.md runbook.

```
BEFORE                          AFTER
config.tfvars (manual list)     stacks/<svc>/main.tf
        |                         module "ingress" {
        v                           dns_type = "proxied"
stacks/cloudflared/               }
  for_each = list                     |
  cloudflare_record               auto-creates
  tunnel per-hostname             cloudflare_record + annotation
```

## What is NOT in this change:

- Uptime Kuma monitor migration (still reads from config.tfvars)
- 17 remaining centrally-managed hostnames (Helm, special cases)
- Removal of allow_overwrite (keep until migration confirmed stable)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 13:45:04 +00:00
Viktor Barzin
bd41bb9230 fix: cluster healthcheck fixes + Authentik upgrade to 2026.2.2
- Authentik: upgrade 2025.10.3 → 2025.12.4 → 2026.2.2 with DB restore
  and stepped migration. Switch to existingSecret, PgBouncer session mode.
- Mailserver: migrate email roundtrip probe from Mailgun to Brevo API
- Redis: fix HAProxy tcp-check regex (rstring), faster health intervals
- Nextcloud: fix Redis fallback to HAProxy service, update dependency
- MeshCentral: fix TLSOffload + certUrl init container for first-run
- Monitoring: remove authentik from latency alert exclusion
- Diun: simplify to webhook notifier, remove git auto-update

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 06:41:56 +00:00
Viktor Barzin
0de2fef9c9 misc: actualbudget, authentik, headscale, rybbit, terminal, dbaas updates
- actualbudget: adjust resource config
- authentik: add configuration
- headscale: minor fix
- rybbit: add resources
- terminal: add terminal stack config
- platform/dbaas: add config
- infra: update lock file
2026-04-06 11:58:00 +03:00
Viktor Barzin
8a5a53a832 fix alerts and reduce Prometheus disk write rate
- linkwarden: add Reloader match annotation to DB secret so pods
  auto-restart on Vault credential rotation (was causing 100% 5xx)
- authentik: increase memory limits (server 1Gi→1.5Gi, worker 896Mi→1Gi)
  to prevent OOM kills
- prometheus: drop 113k high-cardinality series to reduce HDD write rate
  from ~8.8 to ~6.0 MB/s (31% reduction):
  - drop all traefik/apiserver/etcd histogram bucket metrics
  - drop goflow2_flow_process_nf_templates_total (9.3k series)
  - drop container_tasks_state and container_memory_failures_total
  - rewrite HighServiceLatency alert to use avg latency (_sum/_count)
  - update cluster_health dashboard to match
- raise KubeletRuntimeOperationsLatency threshold from 30s to 60s
2026-03-28 15:42:14 +02:00
Viktor Barzin
ad689076d8 scale down non-critical services to free cluster memory
- authentik server: 3→2, worker: 3→2, PDB minAvailable: 2→1
- tuya-bridge: 3→1
- realestate-crawler-api: 2→1
- claude-memory: 2→1
- grafana: 2→1 (config only, apply pending)
- alertmanager: 2→1 (config only, apply pending)

Estimated savings: ~1.2 Gi total
2026-03-22 03:10:12 +02:00
Viktor Barzin
3c804aedf8 extract dbaas, authentik, crowdsec from platform into independent stacks [ci skip]
Phase 1 of platform stack split for parallel CI applies.
All 3 modules were fully independent (no cross-module refs).
State migrated via terraform state mv. All 3 stacks applied
with zero changes (dbaas had pre-existing ResourceQuota drift).
Woodpecker pipeline updated to run extracted stacks in parallel.
2026-03-17 18:11:53 +00:00