infra/docs/plans/2026-06-08-matrix-synapse-to-tuwunel-plan.md
Viktor Barzin fd0f4a0365 fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip]
6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 08:45:33 +00:00

92 lines
5.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Matrix: Synapse → tuwunel migration — Plan (executed)
**Date:** 2026-06-08 · **Companion:** `2026-06-08-matrix-synapse-to-tuwunel-design.md`
## Executed steps
1. **Vault** — generated a 32-byte `registration_token`, stored at
`secret/matrix`.
2. **`stacks/matrix` rewrite** — replaced Synapse with tuwunel: removed the
`matrix-db-creds` ExternalSecret, both init-containers (`install-psycopg2`,
`inject-db-password`), the `extra-packages` volume, and the Reloader
annotation; added the `matrix-secrets` ExternalSecret (vault-kv `dataFrom`),
the `TUWUNEL_*` env, `securityContext` 1000, and the tuwunel image. Encrypted
PVC, Service (`80→8008`), and ingress (`auth="none"`, proxied) unchanged.
- The image is in the deployment's `ignore_changes` (KEEL_IGNORE_IMAGE); it
was **temporarily un-ignored** for this base-image swap, then re-added at
step 4 so Keel resumes tag management.
- `tg init -reconfigure` was required first (Tier-1 PG-backend creds rotate
weekly → "Backend configuration block has changed").
3. **Apply**`Plan: 1 to add, 2 to change, 1 to destroy`. tuwunel 1.7.1 came up
1/1, created a fresh RocksDB on the encrypted PVC (no permission errors —
fsGroup worked).
4. **Verify** — all `200`: `/_tuwunel/server_version`, `.well-known/matrix/
{client,server}`, `/_matrix/client/versions`, `/_matrix/federation/v1/version`.
Registered `@viktor:matrix.viktorbarzin.me` (first user → admin) via the token
flow; `whoami` confirmed. Creds stored at `secret/matrix`
(`admin_user`, `admin_password`).
5. **Lock down** — `TUWUNEL_ALLOW_REGISTRATION=false` + re-added image
`ignore_changes`; applied. Registration now returns `403 M_FORBIDDEN`.
6. **Cleanup** —
- `stacks/vault`: removed the `pg_matrix` static role + its `allowed_roles`
entry (targeted apply — the full plan also wanted an **unrelated** OIDC
`tune`-TTL change, deliberately NOT applied; see residual items).
- Dropped the orphaned `matrix` Postgres DB (16 MB) + `matrix` role on the
CNPG primary (`pg-cluster-2`).
- Docs updated: `.claude/CLAUDE.md` (PG-rotation list), `service-catalog.md`,
`upgrade-config.json` (removed synapse image-rename + matrix PG entry),
`authentication.md` + `authentik-state.md` (Matrix OIDC → orphaned).
## Rollback
Fresh start was confirmed, so there is no Synapse data to preserve. To revert the
*service*: restore the Synapse `main.tf` from git, re-add the `pg_matrix` Vault
role, and restore the `matrix` Postgres DB from the daily per-db dump
(`/backup/per-db/matrix/`). The reused encrypted PVC still holds Synapse's old
`homeserver.yaml` / signing key / media at the volume root alongside the new
RocksDB dir.
## Residual / follow-up items (flagged to user)
- **Authentik Matrix OAuth2 app — REMOVED 2026-06-08** (user-confirmed). It was
UI-managed (NOT in the authentik TF stack), so it was deleted via the Authentik
API: application `matrix` + OAuth2 provider `pk=6`. tuwunel uses native password
auth, so nothing consumed it.
- **Pre-existing drift in `stacks/vault`**: `vault_jwt_auth_backend.oidc` shows a
`tune` diff (explicit `768h` default/max lease TTLs being dropped). This
predates this migration and was **not** applied. Resolve separately.
- **Synapse leftover files** remain on the encrypted PVC volume root (unused by
tuwunel). Can be `rm`'d after confidence in the new server.
## Follow-up: open registration + bot mitigations (2026-06-08, user-chosen)
Registration was opened **fully (tokenless)** — `TUWUNEL_ALLOW_REGISTRATION=true`
+ `TUWUNEL_YES_I_AM_VERY_VERY_SURE_I_WANT_AN_OPEN_REGISTRATION_SERVER_PRONE_TO_ABUSE=true`,
dropped the `TUWUNEL_REGISTRATION_TOKEN` env (the Vault `secret/matrix` token +
`matrix-secrets` ESO are kept for one-env-change revert to token-gated). tuwunel
has **no CAPTCHA** (only Synapse does) and a browser challenge would break native
clients, so bot defense is layered instead:
- **Traefik rate-limit on `/register`** — a `register-ratelimit` Middleware
(`stacks/matrix`) on a path-scoped `ingress_register` carve-out (longer prefix
wins over the catch-all). Keyed on the **request Host (global `/register` cap),
not source IP** — because the host is reachable both via Cloudflare-IPv4
(`CF-Connecting-IP`) and **IPv6-direct (HE tunnel → pfSense HAProxy → Traefik,
no CF header)**; a per-source key let IPv6 bots bypass entirely (found during
testing). 10/min, burst 20, **per Traefik replica (×3)**.
- **CrowdSec** (already on the ingress chain) is the hard backstop — bans abusive
IPs on both paths; covers the per-replica looseness of the soft rate-limit.
- **Notification:** Loki ruler rule `MatrixNewUserRegistered` (`stacks/monitoring`,
matches `... registered on this server`, never the rejection line) → `lane=security`
→ existing `#security` Slack receiver. Also note tuwunel's admin bot
(`@conduit:matrix.viktorbarzin.me`) **natively posts every registration to the
server admin room**, so there's an in-Matrix notice too.
- **Verification:** open signup returns 200 (`@regtest1`, since deactivated via
`!admin users deactivate` in the admin room); Traefik access logs confirm
`/register` routes through the rate-limited carve-out router. A live 429 was not
force-tested (per-replica burst ~60 across 3 replicas; avoided hammering so as
not to trip CrowdSec on the test source IP).
**Add a user:** anyone can self-register now. To provision manually instead:
`!admin users create-user <name>` in the admin room (first user `@viktor` is admin).
**Revert to token-gated:** drop the YES_I_AM... flag, re-add `TUWUNEL_REGISTRATION_TOKEN`.